Publications

Capturing Ambiguity in Natural Language Understanding Tasks with Information from Internal Layers

Hancheol Park
PhD Dissertation, KAIST, 2023.
Show abstract

In natural language understanding (NLU) tasks, there are a large number of ambiguous samples where veracity of their labels is debatable among annotators. Recently, researchers have found that even when additional annotators evaluate such ambiguous samples, they tend not to converge to single gold labels. It has been also revealed that, even when they are assessed by different groups of annotators, the degree of ambiguity is similarly reproduced. Therefore, it is desirable for a model used in an NLU task not only to predict a label that is likely to be considered correct by multiple annotators for a given sample but also to provide information about the ambiguity, indicating whether other labels could also be correct. This becomes particularly crucial in situations where the outcomes of decision-making can lead to serious problems, as information about ambiguity can guide users to make more cautious decisions and avoid risks. In this dissertation, we discuss methods for capturing ambiguous samples in NLU tasks. Due to the inherent ambiguity in NLU tasks, numerous samples with different labels can exist among those that share similar features. Therefore, it is highly likely that the model has learned information within its internal layers about which features are associated with various labels, and consequently, whether or not they exhibit ambiguity. Based on this assumption, our investigation of the representations for samples at each internal layer has revealed that information about the ambiguity of samples is more accurately represented in lower layers. Specifically, in lower layers, ambiguous samples are represented closely to samples with relevant labels in their embedding space. However, this tendency is no longer observed in the higher layers. Based on these observations, we propose methods for capturing ambiguous samples using the distribution or representation information from lower layers of encoder-based pre-trained language models (PLMs) or decoder-based large language models (LLMs). Recently, these two types of models have been predominantly used for NLU tasks. More specifically, we introduce various approaches, including using layer pruning that removes upper layers close to the output layer to utilize information from lower layers, knowledge distillation that distills distribution knowledge from lower layers, and methods utilizing internal representations from lower layers. Through experiments with NLU datasets from various domains and tasks, we demonstrate that information from internal layers, particularly from lower layers, is valuable for capturing the ambiguity of samples. We also show that our proposed methods, which use the information from lower layers, significantly outperform existing methods.

Korean to Korean Sign Language Translation via Graph Generation

Jung-Ho Kim
PhD Dissertation, KAIST, 2022.
Show abstract

Sign language is a spatial and multi-channel language, but existing sign language translation (SLT) models have taken into account only sequential information of sign language words. As a result, the translated sign language sequence loses its spatial and non-manual information and can not fully convey the meaning of the sequence. The thesis claimed herein is that the translation model must understand spatial and non-manual information centered around manual information to generate a complete sign language expression from a spoken sentence. To understand and generate this, we represent a KSL expression as a graph form and formulate SLT as a sequence-to-graph (seq2graph) learning problem. Through experiments, we analyze the strengths and weaknesses of the sequence-to-sequence (seq2seq) SLT methods and compare the performance of the seq2graph SLT method to that of seq2seq SLT methods. To compare the performance with the same criteria, we propose a new metric, Sign Language Evaluation Understudy (SLEU), to measure not only sequential information accuracy but also spatial and non-manual information accuracy. As a result of the experiment, the seq2graph SLT model is shown to perform 31.9% better than the best-performed seq2seq SLT model. In the future, we anticipate that the results of this study will be used in areas where there is a high demand for sign language interpretation by the Deaf, such as daily life conversations, broadcasting, and the Internet.

Extracting Spatial Information about Events from Text

Jin-Woo Chung
PhD Dissertation, KAIST, Feb. 2018
Show abstract

Automatic extraction of spatial information about events from text plays an important role not only in the semantic interpretation of events but also in many location-based applications such as infectious disease surveillance and natural disaster monitoring. However, the fundamental limitation of previous work is the limited scope of extraction that only targets at information that is explicitly stated through predicate-argument structures. This leads to missing a lot of implicit information inferable from context in a document, which amounts to nearly 40% of the entire location information. To overcome this limitation, we present in this dissertation an approach to recognizing the document-level relationship between events and their locations, aiming specifically at identifying an expression in text that best indicates where a given event occurs. We present a two-step approach to this problem: First, we design an annotation framework to construct a corpus annotated with the associations between event mentions and location expressions in news articles. Based on the corpus annotation and analysis, we hypothesize that coherent narratives such as news articles usually mention a series of events that occur together in a similar location. Second, we present an inference system that recognizes the associations from a given document based on this hypothesis. The system employs a multi-pass architecture where locally captured, more precise information is propagated to neighboring events through particular context. We exploit distributional similarities as key contextual information in this architecture to assess how similar two events are. The results of experiments on the annotated corpus demonstrate that the multi-pass architecture with distributional similarities is reasonably capable of capturing the document-level associations between events and locations, especially when compared with several baseline systems. The results also show that considering multiple types of event components together in modeling event similarities leads to better understanding of spatial relatedness of two events than just a single type of component. Our system achieves good performance for this challenging task, which is around F1-scores of 0.50 across different settings, considering that general state-of-the-art systems for extracting spatiotemporal relations and document-level event relations show a similar level of performance. We believe that the proposed corpus and system have a good potential not only to benefit many downstream NLP tasks that involve a spatial analysis of events, but also to improve the quality of location-based applications that exploit textual documents.

Relation Information Extraction using a Comprehensive Representation Scheme: Applications to Oncology

Hee-Jin Lee
PhD Dissertation, KAIST, 2014.
Show abstract

Information extraction (IE) is a task of identifying relevant information from input text and producing structured data as output. While explicit expressions describing the target information are the basis for the development of IE systems, in-depth analysis of the input text becomes necessary when the information is conveyed implicitly in the text. In this dissertation, we address a specialized IE method for gene-cancer relations conveyed implicitly in biomedical text. Automatic identification of gene-cancer relations from a large volume of biomedical text is an important task for cancer research, since changes in genes are known to be the main cause of oncogenesis. In particular, it is essential to understand how a gene affects a cancer and to classify genes into oncogenes (genes that cause cancers), tumor suppressor genes (genes that protect cells from cancers) and biomarkers (genes that indicate normal or cancerous states), since such classification facilitates the process of treatment and diagnosis method development. However, despite the high volume of information on such gene classes that is conveyed implicitly with detailed descriptions about gene and cancer properties, there is not yet an IE system that is targeted at such implicit information. In this dissertation, we claim that in order to classify genes into candidates of oncogenes, tumor suppressor genes and biomarkers, gene-cancer relations described in biomedical text must be characterized with 1) how a gene changes; 2) how a cancer changes; and 3) the causality between the gene and the cancer. We propose a comprehensive representation scheme that identifies gene-cancer relations upon the three aspects above and use it for developing an advanced text mining system for oncogenes, tumor suppressor genes and biomarkers. The proposed representation scheme is shown to be adequate enough to describe the set of information that can be identified objectively from biomedical text, giving rise to an annotated corpus, or CoMAGC. The mapping between the proposed representations and the gene classes is encoded into a set of inference rules, which are validated through manual annotation and comparison with other biology databases. We present an implemented IE system that automatically extracts the information as defined by the proposed scheme, or OncoSearch. Together, we anticipate that CoMAGC and OncoSearch will enable more focused research into oncology, in the face of the rapidly accumulating amount of work in the field.

Identifying Mentions about Long-term Experiences and Sentiment Change on a Specific Target based on Linguistic Analysis: Application to a Product Review Domain

Hye-Jin Min
PhD Dissertation, KAIST, 2012.
Show abstract

People post and share their experiences through social media on the web these days. The resulting user-generated web documents have become a useful source of advice for making a decision or resolving difficulties because people can learn from others’ past successes or failures. Recently, in response to the rapid growth of such documents and great potential of experience-based information, researches have been conducted on analyzing experiences in user-generated web documents. Earlier work has addressed the issue on distinguishing “experience sentences” from others and has proposed a discrimination method based on the linguistic properties of the mentioned events in such sentences. However, such work has focused mostly on a single event at a sentence level in large-scale data, so that a meaningful series of a specific person’s experiences on a particular target has not been analyzed fully yet. This dissertation presents a method to analyze mentions about target-oriented experiences. More specifically, we propose a novel method to identify mentions about a customer’s experiences on a particular product in two aspects: long-term experiences and sentiment change in such experiences. As for long-term experiences, the hypothesis is that the two linguistic expressions time expressions and product names fully capture the customer’s long-term experiences mentioned in a review. As for sentiment change, the hypothesis is that sentiment change can be determined by detecting the state in a such review such that the overall sentiment towards a product instance purchased at a certain time in the past may not be the same as the overall sentiment towards another instance purchased at the latest time. In this dissertation, we address three major research questions. The first question is about identifying product names. Unlike previous researches on identification on a product entity level, instance level identification for instance distinction should be accounted for. Our research question is to determine the types of linguistic feature that are useful for such distinction. Based on experimental results, we argue that linguistic features including time expressions, term-based features and event features should be combined differently with respect to the linguistic characteristics of the product names referring to each type of instance. More specifically, we argue that the best combination for the distinction between recent purchases and past purchases is time expressions and term-based features, and the best combination between recent purchases and recent & past purchases is time expressions and event features. The second question is about sentiment classification regarding product names. The inherent polarity of the adjectival modifier should be blocked when it is used to refer to the property or the identity of the product. Regarding the question of determining the context in which the polarity of the adjectival modifier be blocked, we argue that the refined blocking rules with the semantic types of nouns, verbs, and clauses based on compositionality-based syntactic rules enhance the sentiment classification performance especially for neutral sentences. As for product name-sentiment association, we argue that comparative expressions are crucial to associating the compared target with the sentiment opposite to the one in the given grammatical structure and also argue that the product names referring to generic objects are crucial to discarding the sentiment in the given grammatical structure. The last question is about how we utilize the results from our method. As practical applications, we demonstrate a system that identifies helpful reviews by utilizing the proposed measure. The user study shows that this measure is not only as helpful as the best existing ones, such as ‘helpful vote’ or ‘reviewer rank’, but is also free from undesirable biases. We also illustrate another application that rates product reviews with respect to sentiment change. The user study shows that the review rating system based on sentiment change is more credible than the system based on the clause-level sentiment classification.

Personal Prosody Model based Korean Emotional Speech Synthesis

Ho-Joon Lee
PhD dissertation, KAIST, 2010.
Show abstract

Speech is the most basic and widely used communication method for expressing thoughts during human-human interaction and has been studied for user-friendly interfaces between humans and machines. Recent progress in speech synthesis has produced artificial vocal results with very high intelligibility, but the quality of sound and the naturalness of inflection remain major issues. Today, in addition to the need for improvement in sound quality and naturalness, there is a growing need for a method for the generation of speech with emotions to provide the required information in a natural and effective way. For this purpose, various types of emotional expression are usually transcribed first into corresponding datasets, which are then used for the modeling of each type of emotional speech. This kind of massive dataset analysis technique has improved the performance of information providing services both quantitatively and qualitatively. In this dissertation, however, I argue that this approach does not work well with interactions that are based on personal experience such as emotional speech synthesis. We know empirically that individual speakers have their own ways of expressing emotions based on their personal experience, and that massive dataset management may easily overlook these personalized and relative differences. Therefore, this dissertation examines the emotional prosody structures of four basic emotions such as anger, fear, happiness, and sadness, by considering their personalized and relative differences. As a result, this dissertation addresses the tendency for the emotional prosody structures of pitch and speech rate to depend more on individual speakers (i.e. personal information) than intensity and pause length do. This personal information enables the modeling of relative differences of each emotional prosody structure (i.e. personal prosody model), the possibilities of which were dismissed earlier during the application of massive dataset analysis technique. Based on the personal prosody model, we develop a Korean emotional speech synthesis system that can add emotional information to spoken expressions. In order to convert input sentence into speech, we used a commercial Korean TTS system with a female voice. The evaluation results show that we can successfully incorporate this personal information into an emotional prosody synthesis system, which enhances the recent progress in the recognition rate for happiness and other emotions. We have achieved 48.5% of the recognition rate for happiness among four emotions, which used to be close to the chance level. And from a series of repeated perception tests supported by enough prior training experience, the average recognition rate has improved up to 95.5% for all emotions. We also show the applicability of the proposed Korean emotional speech synthesis system with the implementation of a speech interface of assistive robots designed for the elderly that can modify its prosodic structure according to sentence types and emotional states.

Interpretation of Natural Language Queries for Effective Data Exploration over Heterogeneous Databases: Applications to Biomedical Domain

Hodong Lee
PhD dissertation, KAIST, 2008.
Show abstract

Data exploration is an essential process for discovering novel knowledge in scientific researches. However, it is difficult for field experts to find out the target data only by exploration, especially when the data are scattered over multiple and heterogeneous databases. Since such data are usually associated with one another, there may be appropriate sequences of searches that the field experts can use for queries to reach the target data. In order to help such data exploration, conventional database interfaces provide useful tools for querying in keywords or structured forms. However, we argue that they are inadequate to express the queries for sequences of searches in multiple databases which embody diverse relations among their data. In order to describe such queries in a convenient and expressive manner, we propose to use natural language queries (NLQs) to interact with the databases. Such a database interface shall automatically interpret NLQs into formal language queries (FLQs) that are in turn composed of small FLQs for different databases. This task requires us to address the problem of database heterogeneity due to the differences in formal query languages, database structures, and data contents. The dissertation addresses this problem by considering NLQs as terms and syntactic relations, which respectively correspond to data objects and their operations. We utilize SQL-like expressions to coordinate such terms and syntactic relations, resulting in FLQs via a straightforward mapping process. In this work, we present a method that derives the SQL-like expressions from NLQs in a Combinatory Categorial Grammar (CCG) framework, and then translates the expressions into the locations of data objects accessible from our target databases. The method then constructs FLQs for such locations in possible sequences with accounts for data associations. Our method thus provides a fully automated way to locate and retrieve available data from databases. We also show that our method works as a useful interface serving data exploration and integration, which help the experts to discover knowledge from heterogeneous databases. As practical examples, we illustrate biomedical applications: protein-seeking for data exploration, a ubiquitin-protein ligase (E3) database for data integration, and an E3 data mining tool for further data integration.

Bidirectional Incremental Approach to Efficient Information Extraction: Applications to Biomedicine

Jung-jae Kim
PhD dissertation, KAIST, 2006.
(Outstanding Ph.D. Dissertation Award, 2006. 8.)
Show abstract

Information extraction refers to the task of extracting relevant information from texts. This dissertation targets at extracting information of relations between biomedical concepts, which are explicitly represented with known linguistic structures in biomedical texts. Such structures of a target relation involve a keyword and its semantic arguments, where the keyword indicates the semantic type of the target relation and the arguments indicate the related concepts. The information of relations thus has two types of locality, such that the information is expressed in the local context of the keyword, called spatial locality, and that the keyword has well-known syntactic relations with its arguments, called structural locality. These two types of locality have been in the past handled by pattern matching and partial parsing approaches, respectively, but not at the same time. In this dissertation, we address this problem with a novel approach that searches for the arguments both bidirectionally and incrementally from the keywords. The extraction process is divided into two steps. First, it uses a non-structured pattern that describes a context between a keyword and its arguments, to match an input sentence bidirectionally from the keyword. It then performs syntactic analysis incrementally on candidate arguments and, if necessary, on their sentential contexts as well, with a parser of a combinatory categorial grammar for rigorous syntactic verification of the candidates. The approach addresses the aforementioned spatial locality by utilizing non-structured patterns and the structural locality by employing a lazy evaluation parser that is customized for information extraction. The approach is highly efficient, evidenced with experimental results, because it can stop the extraction process at any step, when the syntactic analysis gives a negative piece of evidence for extracting relevant information. We also show the applicability of the approach with two different tasks in biomedicine: Biological interactions, which are useful for building up biological pathways, and protein-protein contrastive relations which are useful for refining protein pathways.