Publications

Publications MS Theses

Template-based Document Labeling for Dense Retrieval

Sukmin Cho
MS Thesis, KAIST, 2022.

Data Augmentation for Abusive Language Detection via Back-translation and Domain Knowledge

Jisu Shin
MS Thesis, KAIST, 2022.

Information Retrieval by Augmenting Document Representation

Soyeong Jeong
MS Thesis, KAIST, 2021
Show abstract
One of the challenges in information retrieval (IR) is the vocabularymismatch problem, which refers to the failure of retrieving the query-relevant document when the terms between the query and the document are lexically different but semantically similar. While recent work has tried to tackle the problem by expanding sparse representations with additional relevant terms or by embedding the representations to learnable dense space, both of the expansion and dense models generally require a large volume of labeled query-document pairs to train, whereas it is often challenging to acquire the labeled pairs annotated by humans. The thesis focuses on augmenting the document representations, either on the document text level or on the training dataset level, without requiring additional labeled query-document pairs for both sparse and dense retrieval models. For the sparse retrieval model, we propose Unsupervised Document Expansion with Generation (UDEG), which generates diverse supplementary sentences for the original document without using labels on query-document pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our UDEG on two standard IR benchmark datasets. The results show that our UDEG significantly outperforms relevant expansion baselines. For the dense retrieval model, we propose Document Augmentation for dense Retrieval (DAR), which augments the document representations with interpolation and perturbation. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the seen and unseen documents. We believe that our UDEG and DAR make a good contribution to sparse and dense retrievers by augmenting document representations without annotating additional query-document pairs.

Detecting quoted claims and claim speakers in news articles using transformer-based language models

Eugene Jang
MS Thesis, KAIST, 2021.
Show abstract
Quotations in news articles have been suggested as a source of subjective or biased information. This work deals with automatically detecting the quoted claims and speakers of the claims in news articles. We suggest that quoted claims in news articles appear in predictable, but non-trivial ways. We annotate 33 articles for their quoted claims, speakers, and relations between claims and speakers. A dataset created with the presented annotation scheme is used to experiment with i) claim identification, ii) speaker identification, iii) claim-speaker relation identification, iv) claim group identification, and v) speaker group identification. The annotation and experimental results suggest that the contextual information must be taken into account in order to annotate and predict quoted claims and their speakers.

Construction of a dialogue generation model based on persuasion strategies

Junseop Ji
MS Thesis, KAIST, 2020

Generating diverse sentential arguments on controversial topics with a memory-augmented generation model

ChaeHun Park
MS Thesis, KAIST, 2019

Generating humorous statements for public speech with pre-trained language model and tension analyzer

Seungwon Yoon
MS Thesis, KAIST, 2019

Interpretable Depression Detection from Social Media using Hierarchical Attention Network with Depression Indicators

Hoyun Song
MS Thesis, KAIST, 2018.
Show abstract
In order to effectively diagnose depression, which is one of the most harmful mental disorders, many researchers used social media by analyzing the differences in language use. However, detecting depression from social media has problems such as a small proportion of posts with depression indicators and difficulties for distinguishing depressive symptoms from temporarily depressed feelings. To address these problems, we propose hierarchical attention with depressive indicators inspired by the process of diagnosing depression by a person with domain knowledge. Our model provides not only interpretations, but also their visualizations with learned weights through attention mechanism. With this model, we can investigate different aspects of posts with depressive indicators based on psychological theories, which will help researchers to find useful evidence for depressive characteristics.

Mitigating Stereotypes in Word Embedding through Sentiment Modulation

Huije Lee
MS Thesis, KAIST, 2018.
Show abstract
Word embedding is an influential framework to quantify the meaning of a word, which is widely used in machine learning at a pre-processing level for natural language processing (NLP). However, word embedding trained with a large number of contexts encodes not only general syntactic and semantic meaning of a word, but also the stereotypes and biases that people may have. This thesis proposes a method to indirectly mitigate the stereotypes in the trained word embedding by modulating the dimension of sentimental attributes in a human entity without imposing equal probability on the compatible social groups. To prevent the word embedding from creating problematic predictions such as a stereotype threat, we modulate the strength of the association between a human entity and sentimental attribute and indirectly reduce the gender bias of the embedding model. We show that the proposed method preserves the overall embedding performance. We also confirm that increasing the strength of the association between human entities and sentimental attributes amplifies the model bias through experiment.

Using syntactic structure to extract prominent gene regulatory network from the literature

Wonsuk Yang
MS Thesis, KAIST, 2017.

Computational Identification of Sequence Variation and Environmental Condition in Clinical Depression from Biomedical Literature

Jinseon You
MS Thesis, KAIST, 2016.
Show abstract
Clinical depression is a complex disease, which is known to be influenced by various factors. As genetic and environmental factors are frequently referred to as the most influential in causing depression, there have been many studies that try to identify genes or proteins and environmental conditions associated with depression. While a number of text-mining (TM) systems identifying information about the genetic factors in the biomedical literature have consequently been developed, there is currently no TM system specifically targeted at extracting environmental conditions. As a result, biologists are provided only with incomplete information about depression by these TM systems, unable to help them to discover the etiology and treatment of depression. In the thesis, we propose a TM system that considers an interaction between genetic and environmental factors associated with depression. The system identifies not only relations between a sequence variation and depression but also changes in the relations according to environmental conditions. In order to develop the system, we split the system into two TM subsystems. The first system is applied to an existing system for extracting the relations between a sequence variation and depression from the biomedical literature. The system classifies whether the relations are positive or negative on a document level. Based on the dictionary with candidate terms for environmental conditions, the second system identifies the conditions in the biomedical literature containing the binary relations. Using the dependency of sentence, the system excludes terms wrongly classified as the conditions. The system is a first TM system considering a ternary relation among sequence variation, disease and condition. Through the system, we are able to provide more comprehensive information about depression than other systems. We expect that, as the system is applied to other diseases, biologists can easily identify diverse information associated with changes in symptoms of diseases including depression.

Synchronization of Non-Manual Signals in Sign Language with Sequence Prediction

Jung-Ho Kim
MS Thesis, KAIST, 2016.
Show abstract
There are various types of non-manual signals in sign language, which carry important linguistic information such as feeling, semantic difference and nuance. Upon investigation into the nature of non-manual signals in the bible and literature corpus, we find that several types of non-manual signals appear on a single word. It implies the possibility of the context in signed utterances. This thesis experimentally unravels the nature of non-manual signals and proposes a prediction model for the non-manual signal sequence and its advanced approach. The correlation between non-manual signals is measured by utilizing their co-occurrence rate. The result shows close correlations among 'Trunk', 'Head', 'Brow to Eye-gaze' and 'Mouth'. To verify the existence of the context, a prediction model using conditional random fields trained on a sequence of 'gloss'-'non-manual signal' pairs is proposed, which shows superior results in comparison with a 'gloss'-'non-manual signal' dictionary-based approach. This result suggests that synchronized non-manual signals can be predicted by the proposed model when the training is done with other non-manual signals. Also it means that the accuracy is expected to increase as we fine-tune such signals. As a result, all experiments show better performance when a sequence of 'Brow to Eye-gaze' is used as a training data.

Mention-Level Gene Normalization on Multi-Species and Multiple Identifiers

Joon-Yeob Kim
MS Thesis, KAIST, 2014.

Generating Chatting Messages in a Consistent Style with Authorship Attribution Methods

Sang-Chae Kim
MS Thesis, KAIST, 2013.

Fairy Tale Summarization through Sentence Selection

SeungJoo An
MS Thesis, KAIST, 2012.

Identifying Sentence Types in Korean with Morpho-Syntactic Analysis

Jin-Woo Chung
MS thesis, KAIST, 2011.

Automatic Sign Language Generation Reflecting the Relationship between Entities

SangYoon Jung
MS thesis, KAIST, 2010.

Extracting Melodies from Piano Music Based on Characteristics of Music

Yoonjae Choi
MS thesis, KAIST, 2009.

Function-focused Gene Clustering by Utilizing Granularities of Gene Functions

Tak-eun Kim
MS thesis, KAIST, 2009.

Automatic Identification of the Relation between Dependency Relations and Definitions of GO Concepts

Seung-Cheol Baek
MS thesis, KAIST, 2009.

Computational Processing of Verb Agreement for Automatic Generation of Sign Language Animation

Sangha Kim
MS thesis, KAIST, 2008.

Document Similarity Assessment with Natural Language Processing: Applications to Background Music Recommendation for Blog Articles

Doojin Park
MS thesis, KAIST, 2007.

Generation of Coherent Gene Summary

Chan-Goo Kang
MS thesis, KAIST, 2006.

Identification of Emotional Flow from Natural Language Documents

Hye-Jin Min
MS thesis, KAIST, 2005.

Automated Digital Cinematography with Natural Language Processing

Semin Jang
MS thesis, KAIST, 2004.

Automatic Translation of Korean into Korean Sign Language with Combinatory Categorial Grammar

Jiwon Choi
MS thesis, KAIST, 2004.

Applications to Molecular Interactions: Customized Visualization for Knowledge Discovery with Information Extraction

Changsu Lee
MS thesis, KAIST, 2004.
(Outstanding M.S. Thesis Award, 2004. 2.)

Kyung Wha Hong, Anaphora Resolution for Contextually Appropriate Text Animation

Kyung Wha Hong
MS thesis, KAIST, 2004.

Integrated Morphological Analysis for Korean in a Combinatory Categorial Grammar Framework

Ho-Joon Lee
MS thesis, KAIST, 2003.

Diphone-based Intonation Generation for Korean with Combinatory Categorial Grammar

Lee Hwa Jin
MS thesis, KAIST, 2002.

Automatic Synthesis of Multimedia Tales with Combinatory Categorial Grammar

Hyun Sook Kim
MS thesis, KAIST, 2002.

Computational Processing of Honorifics in Korean with combinatory Categorial Grammar

O Shik Kwon
MS thesis, KAIST, 2002.

Computational Processing of Floating Quantifiers in Korean with Combinatory Categorial Grammar

Jin-Bok Lee
MS thesis, KAIST, 2001.

Coordinate Constructions in Korean and Parsing Issues in Combinatory Categorial Grammar

Hyung-joon Cho
MS thesis, KAIST, 2000.