Publications | KAIST NLP*CL Lab.

Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation

Hoyun Song, Huije Lee, Jisu Shin, Sukmin Cho, Changgeon Ko, and Jong C. Park
Findings of The 63rd Annual Meeting of the Association for Computational Linguistics: ACL 2025, July 27-Aug 1, 2025
(accepted)

EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation

Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, SeungYoon Han, and Jong C. Park
Findings of The 63rd Annual Meeting of the Association for Computational Linguistics: ACL 2025, July 27-Aug 1, 2025
(accepted)

An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs

Eui Jun Hwang, Sukmin Cho, Junmyeong Lee, and Jong C. Park
Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Apr 29-May 4, 2025

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

Sukmin Cho, SJ Choi, Taeho Hwang, Jeongyeon Seo, Soyeong Jeong, Huije Lee, Hoyun Song, Jong C. Park, and YJ Kwon
Findings of the Association for Computational Linguistics: NAACL 2025, Apr 29-May 4, 2025

Retrieval-Augmented Generation through Zero-shot Sentence-Level Passage Refinement using LLMs

Taeho Hwang, Soyeong Jeong, Sukmin Cho, and Jong C. Park
Journal of KIISE, 54(3), March, 2025. (To appear)

A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

Eui Jun Hwang, Sukmin Cho, Huije Lee, Youngwoo Yoon, and Jong C. Park
2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Feb 28-Mar 4, 2025

Political Bias in Large Language Models and its Implications on Downstream Tasks

Jeong yeon Seo, Sukmin Cho, and Jong C. Park
Journal of KIISE, 52(1), 18-28, January, 2025.

The Impact of Retrieved Document Bias on Generated Responses in RAG: An Analysis through Political Bias Experiments

Seungho Cho, Changgeon Ko, Taeho Hwang, Jeong yeon Seo, and Jong C. Park
Korea Software Congress (KSC2024), Dec 18-20, 2024.
(selected as the best paper)

Improving Keypoint-based Korean Sign Language Translation Performance Through Frame-wise Contrastive Learning

Hyeyeon Kim, Junmyeong Lee, Eui Jun Hwang, and Jong C. Park
Korea Software Congress (KSC2024), Dec 18-20, 2024.
(selected as the best paper)

Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach

Changgeon Ko, Jisu Shin, Hoyun Song, Jeong yeon Seo, and Jong C. Park
The workshop on Socially Responsible Language Modelling Research at NeurIPS 2024 (SoLaR@NeurIPS 2024), Dec 14, 2024.

PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

Junmyeong Lee, Eui Jun Hwang, Sukmin Cho, and Jong C. Park
The workshop on Self-Supervised Learning - Theory and Practice at NeurIPS 2024 (SSL@NeurIPS 2024), Dec 14, 2024.

Typos that Broke the RAG's Back: Genetic Attack on RAG Pipeline by Simulating Documents in the Wild via Low-level Perturbations

Sukmin Cho, Soyeong Jeong, Jeong yeon Seo, Taeho Hwang, and Jong C. Park
Findings of the Association for Computational Linguistics: EMNLP 2024 (Findings of EMNLP), Nov 12-14, 2024.

Towards Effective Counter-Responses: Aligning Human Preferences with Strategies to Combat Online Trolling

Huije Lee, Hoyun Song, Jisu Shin, Sukmin Cho, SeungYoon Han, and Jong C. Park
Findings of the Association for Computational Linguistics: EMNLP 2024 (Findings of EMNLP), Nov 12-14, 2024.

DSLR: Document Refinement with Sentence-Level Re-ranking and Reconstruction to Enhance Retrieval-Augmented Generation

Taeho Hwang, Soyeong Jeong, Sukmin Cho, SeungYoon Han, and Jong C. Park
The Third workshop on knowledge-augmented methods for NLP at ACL 2024 (KnowledgeNLP@ACL 2024), Aug 16, 2024.

Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in Large Language Models

Jisu Shin, Hoyun Song, Huije Lee, Soyeong Jeong, and Jong C. Park
Findings of the Association for Computational Linguistics: ACL 2024 (Findings of ACL), Aug 11-16, 2024.

Retrieval-Augmented Generation through Zero-shot Sentence-Level Passage Refinement using LLMs

Taeho Hwang, Soyeong Jeong, Sukmin Cho, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2024), June 26-28, 2024.
(selected as the outstanding paper)

Enhancing Sign Language Recognition with Pose-Based Data Augmentation: Focusing on Hand Keypoints

SeungYoon Han, Aujin Kim, KyungGeun Roh, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2024), June 26-28, 2024.

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun 16-21, 2024.

Effective Pre-processing on Hand Keypoints for Sign Language Recognition

KyungGeun Roh
MS Thesis, KAIST, 2024

A Gloss-free Sign Language Production with Discrete Representation

Eui Jun Hwang, Huije Lee, and Jong C. Park
The 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2024), May 27-31, 2024.

Preprocessing Mediapipe Keypoints with Keypoint Reconstruction and Anchors for Isolated Sign Language Recognition

KyungGeun Roh, Huije Lee, Eui Jun Hwang, Sukmin Cho, and Jong C. Park
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources (SignLang@LREC-COLING 2024), May 20-25, 2024.

Augmentation of Sign Language Poses by Including the Understanding of the Sign Language Domain by Body Part

Aujin Kim
MS Thesis, KAIST, 2023.

Capturing Ambiguity in Natural Language Understanding Tasks with Information from Internal Layers

Hancheol Park
PhD Dissertation, KAIST, 2023.
Show abstract

In natural language understanding (NLU) tasks, there are a large number of ambiguous samples where veracity of their labels is debatable among annotators. Recently, researchers have found that even when additional annotators evaluate such ambiguous samples, they tend not to converge to single gold labels. It has been also revealed that, even when they are assessed by different groups of annotators, the degree of ambiguity is similarly reproduced. Therefore, it is desirable for a model used in an NLU task not only to predict a label that is likely to be considered correct by multiple annotators for a given sample but also to provide information about the ambiguity, indicating whether other labels could also be correct. This becomes particularly crucial in situations where the outcomes of decision-making can lead to serious problems, as information about ambiguity can guide users to make more cautious decisions and avoid risks. In this dissertation, we discuss methods for capturing ambiguous samples in NLU tasks. Due to the inherent ambiguity in NLU tasks, numerous samples with different labels can exist among those that share similar features. Therefore, it is highly likely that the model has learned information within its internal layers about which features are associated with various labels, and consequently, whether or not they exhibit ambiguity. Based on this assumption, our investigation of the representations for samples at each internal layer has revealed that information about the ambiguity of samples is more accurately represented in lower layers. Specifically, in lower layers, ambiguous samples are represented closely to samples with relevant labels in their embedding space. However, this tendency is no longer observed in the higher layers. Based on these observations, we propose methods for capturing ambiguous samples using the distribution or representation information from lower layers of encoder-based pre-trained language models (PLMs) or decoder-based large language models (LLMs). Recently, these two types of models have been predominantly used for NLU tasks. More specifically, we introduce various approaches, including using layer pruning that removes upper layers close to the output layer to utilize information from lower layers, knowledge distillation that distills distribution knowledge from lower layers, and methods utilizing internal representations from lower layers. Through experiments with NLU datasets from various domains and tasks, we demonstrate that information from internal layers, particularly from lower layers, is valuable for capturing the ambiguity of samples. We also show that our proposed methods, which use the information from lower layers, significantly outperform existing methods.

Improving Zero-shot Reader by Reducing Distractions from Irrelevant Documents in Open-Domain Question Answering

Sukmin Cho, Jeong yeon Seo, Soyeong Jeong, and Jong C. Park
Findings of the Association for Computational Linguistics: EMNLP 2023 (Findings of EMNLP), Dec 6-10, 2023.

Test-Time Self-Adaptive Small Language Models for Question Answering

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park
Findings of the Association for Computational Linguistics: EMNLP 2023 (Findings of EMNLP), Dec 6-10, 2023.

Knowledge-Augmented Language Model Verification

Jinheon Baek, Soyeong Jeong, Minki Kang, Jong C. Park, and Sung Ju Hwang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Dec 6-10, 2023.

Generation of Korean Offensive Language by Leveraging Large Language Models via Prompt Design

Jisu Shin, Hoyun Song, Huije Lee, Fitsum Gaim, and Jong C. Park
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL), Nov 1-4, 2023.

Iterative Feedback-based Personality Persona Generation for Diversifying Linguistic Patterns in Large Language Models

Taeho Hwang, Hoyun Song, Jisu Shin, Sukmin Cho, and Jong C. Park
Proceedings of the 35th Annual Conference on Human & Cognitive Language Technology (HCLT), Oct 12-13, 2023.

Political Bias in Large Language Models and Implications on Downstream Tasks

Jeong yeon Seo, Sukmin Cho, and Jong C. Park
Proceedings of the 35th Annual Conference on Human & Cognitive Language Technology (HCLT), Oct 12-13, 2023.
(selected as best paper)

Deep Model Compression Also Helps Models Capture Ambiguity

Hancheol Park and Jong C. Park
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), July 9-14, 2023

Discrete Prompt Optimization via Constrained Generation for Zero-shot Re-ranker

Sukmin Cho, Soyeong Jeong, Jeong yeon Seo, and Jong C. Park
Findings of the Association for Computational Linguistics: ACL 2023 (Findings of ACL), July 9-14, 2023

Phrase Retrieval for Open Domain Conversational Question Answering with Conversational Dependency Modeling via Contrastive Learning

Soyeong Jeong, Jinheon Baek, Sung Ju Hwang, and Jong C. Park
Findings of the Association for Computational Linguistics: ACL 2023 (Findings of ACL), July 9-14, 2023

A Simple and Flexible Modeling for Mental Disorder Detection by Learning from Clinical Questionnaires

Hoyun Song, Jisu Shin, Huije Lee, and Jong C. Park
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), July 9-14, 2023

Question-Answering in a Low-resourced Language: Benchmark dataset and Models for Tigrinya

Fitsum Gaim, Wonsuk Yang, Hancheol Park, and Jong C. Park
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), July 9-14, 2023
(selected as outstanding paper)

Korean Sign Language Recognition on Keypoints with a Transformer Model

KyungGeun Roh, Eui Jun Hwang, Huije Lee, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2023), June 17-20, 2023.

Controllable prompt tuning with relation dependent tokens

Jinseok Kim, Sukmin Cho, Soyeong Jeong, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2023), June 17-20, 2023.

Sign Language Gloss Translation using Copy Mechanism

Jaewoo Kim, Huije Lee, Sukmin Cho, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2023), June 17-20, 2023.

Knowledge Transfer for Enhanced Sentiment-Based Abusive Language Detection: Insights from Sarcasm Detection

Dongho Choi
MS Thesis, KAIST, 2023.

Leveraging Large Language Models with Vocabulary Sharing for Sign Language Translation

Huije Lee, Jung-Ho Kim, Eui Jun Hwang, Jaewoo Kim, and Jong C. Park
SLTAT 2023 Workshop at ICASSP 2023, June 4-10, 2023

Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement

Soyeong Jeong, Jinheon Baek, Sung Ju Hwang, and Jong C. Park
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), May 2-4, 2023.

Korean to Korean Sign Language Translation via Graph Generation

Jung-Ho Kim
PhD Dissertation, KAIST, 2022.
Show abstract

Sign language is a spatial and multi-channel language, but existing sign language translation (SLT) models have taken into account only sequential information of sign language words. As a result, the translated sign language sequence loses its spatial and non-manual information and can not fully convey the meaning of the sequence. The thesis claimed herein is that the translation model must understand spatial and non-manual information centered around manual information to generate a complete sign language expression from a spoken sentence. To understand and generate this, we represent a KSL expression as a graph form and formulate SLT as a sequence-to-graph (seq2graph) learning problem. Through experiments, we analyze the strengths and weaknesses of the sequence-to-sequence (seq2seq) SLT methods and compare the performance of the seq2graph SLT method to that of seq2seq SLT methods. To compare the performance with the same criteria, we propose a new metric, Sign Language Evaluation Understudy (SLEU), to measure not only sequential information accuracy but also spatial and non-manual information accuracy. As a result of the experiment, the seq2graph SLT model is shown to perform 31.9% better than the best-performed seq2seq SLT model. In the future, we anticipate that the results of this study will be used in areas where there is a high demand for sign language interpretation by the Deaf, such as daily life conversations, broadcasting, and the Internet.

Detecting Implicitly Abusive Language by Applying Out-of-Distribution Problem

Jisu Shin, Hoyun Song, and Jong C. Park
Journal of KIISE, 49(11), 948-957, November, 2022.

Flexible acceptance condition of generics from a probabilistic viewpoint: Towards formalization of the semantics of generics.

Soo Hyun Ryu, Wonsuk Yang, and Jong C. Park
Journal of Psycholinguistic Research, 2022

Assessing automatic summarization model as a reading assistant

Aujin Kim, Jisu Shin, Soyeong Jeong, Sukmin Cho, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2022), June 29-July 1, 2022

Constructing Korean Abusive Language Dataset using Machine Translation

Jisu Shin, Hoyun Song, Huije Lee, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2022), June 29-July 1, 2022

Stopwords Mask Pooling for Dense Retrieval in Medical Domain

Dongho Choi, Hoyun Song, Soyeong Jeong, Sukmin Cho, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2022), June 29-July 1, 2022

Sign Language Production With Avatar Layering: A Critical Use Case over Rare Words

Jung-Ho Kim, Eui Jun Hwang, Sukmin Cho, Du Hui Lee, and Jong C. Park
Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), June 21-23, 2022

GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

Fitsum Gaim, Wonsuk Yang, and Jong C. Park
Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), June 21-23, 2022
(selected as best paper)

ELF22: A Context-based Counter-Trolling Dataset to Combat Internet Trolls

Huije Lee, Young Ju NA, Hoyun Song, Jisu Shin, and Jong C. Park
Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), June 21-23, 2022
Show abstract

Online trolls increase social costs and cause psychological damage to individuals. With the proliferation of automated accounts making use of bots for trolling, it is difficult for targeted individual users to handle the situation both quantitatively and qualitatively. To address this issue, we focus on automating the method to counter trolls, as counter responses to combat trolls encourage community users to maintain ongoing discussion without compromising freedom of expression. For this purpose, we propose a novel dataset for automatic counter response generation. In particular, we constructed a pair-wise dataset that includes troll comments and counter responses with labeled response strategies, which enables models fine-tuned on our dataset to generate responses by varying counter responses according to the specified strategy. We conducted three tasks to assess the effectiveness of our dataset and evaluated the results through both automatic and human evaluation. In human evaluation, we demonstrate that the model fine-tuned on our dataset shows a significantly improved performance in strategy-controlled sentence generation.

Template-based Document Labeling for Dense Retrieval

Sukmin Cho
MS Thesis, KAIST, 2022.

Data Augmentation for Abusive Language Detection via Back-translation and Domain Knowledge

Jisu Shin
MS Thesis, KAIST, 2022.

Query Generation with External Knowledge for Dense Retrieval

Sukmin Cho, Soyeong Jeong, Wonsuk Yang, and Jong C. Park
Proceedings of Deep Learning Inside Out (DeeLIO): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation

Soyeong Jeong, JinHeon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park
60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)

Detecting Implicitly Abusive Language by Applying Out-of-Distribution Problem

Jisu Shin, Hoyun Song, and Jong C. Park
Proceedings of the Korea Software Congress (KSC 2021), December 20-22, 2021
(selected as best paper)

Information Retrieval by Augmenting Document Representation

Soyeong Jeong
MS Thesis, KAIST, 2021
Show abstract

One of the challenges in information retrieval (IR) is the vocabularymismatch problem, which refers to the failure of retrieving the query-relevant document when the terms between the query and the document are lexically different but semantically similar. While recent work has tried to tackle the problem by expanding sparse representations with additional relevant terms or by embedding the representations to learnable dense space, both of the expansion and dense models generally require a large volume of labeled query-document pairs to train, whereas it is often challenging to acquire the labeled pairs annotated by humans. The thesis focuses on augmenting the document representations, either on the document text level or on the training dataset level, without requiring additional labeled query-document pairs for both sparse and dense retrieval models. For the sparse retrieval model, we propose Unsupervised Document Expansion with Generation (UDEG), which generates diverse supplementary sentences for the original document without using labels on query-document pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our UDEG on two standard IR benchmark datasets. The results show that our UDEG significantly outperforms relevant expansion baselines. For the dense retrieval model, we propose Document Augmentation for dense Retrieval (DAR), which augments the document representations with interpolation and perturbation. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the seen and unseen documents. We believe that our UDEG and DAR make a good contribution to sparse and dense retrievers by augmenting document representations without annotating additional query-document pairs.

Optimizing Domain Specificity of Transformer-based Language Models for Extractive Summarization of Financial News Articles in Korean

Huije Lee, Wonsuk Yang, ChaeHun Park, Hoyun Song, Eugene Jang, and Jong C. Park
35th Pacific Asia Conference on Language on Language, Information and Computation (PACLIC 35), November 5-7, 2021
Show abstract

Frequent usage of complex expressions with numbers and of the terms that require domain knowledge makes it more difficult to comprehend and summarize financial news articles than that of other daily news articles. We present a transformer-based model for the automatic summarization of the financial news articles in Korean and address related issues, and in particular analyze the interplay between the domain of the dataset used for pre-training and that for fine-tuning. We find that the summarization model performs much better when the two coincide, even when they are different from that of the target task, which is the financial domain in our work.

Non-Autoregressive Sign Language Production with Gaussian Space

Eui Jun Hwang, Jung-Ho Kim, and Jong C. Park
The 32nd British Machine Vision Conference (BMVC 2021), November 22-25, 2021

A Large-scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit

Hoyun Song, Soo Hyun Ryu, Huije Lee, and Jong C. Park
Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL 2021), November 10-11, 2021
Show abstract

As users in online communities suffer from severe side effects of abusive language, many researchers attempted to detect abusive texts from social media, presenting several datasets for such detection. However, none of them contain both comprehensive labels and contextual information, which are essential for thoroughly detecting all kinds of abusiveness from texts, since datasets with such fine-grained features demand a significant amount of annotations, leading to much increased complexity. In this paper, we propose a Comprehensive Abusiveness Detection Dataset (CADD), collected from the English Reddit posts, with multifaceted labels and contexts. Our dataset is annotated hierarchically for an efficient annotation through crowdsourcing on a large-scale. We also empirically explore the characteristics of our dataset and provide a detailed analysis for novel insights. The results of our experiments with strong pre-trained natural language understanding models on our dataset show that our dataset gives rise to meaningful performance, assuring its practicality for abusive language detection.

Monolingual Pre-Trained Language Models for Tigrinya

Fitsum Gaim, Wonsuk Yang, and Jong C. Park
WiNLP 2021 Workshop at EMNLP 2021, November 7-11, 2021
Show abstract

Pre-trained language models (PLMs) are driving much of the recent progress in natural language processing. However, due to the resource-intensive nature of the models, underrepresented languages without sizable curated data have not seen significant progress. Multilingual PLMs have been introduced with the potential to generalize across many languages, but their performance trails compared to their monolingual counterparts and depends on the characteristics of the target language. In the case of the Tigrinya language, recent studies report a sub-optimal performance when applying the current multilingual models. This may be due to its orthography and unique linguistic characteristics, especially when compared to the Indo-European and other typologically distant languages that were used to train the models. In this work, we pre-train three monolingual PLMs for Tigrinya on a newly compiled corpus, and we compare the models with their multilingual counterparts on two downstream tasks, part-of-speech tagging and sentiment analysis, achieving significantly better results and establishing the state-of-the-art. We make the data and trained models publicly available.

BERT-based Personality Disorder Detection Model with Abusive Language Marking from Social Media

Jisu Shin, Hoyun Song, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2021), June 22-25, 2021

The Relationship between the Quality of Automatically Generated Questions and the Quantity of the Context given for the Generation

Sukmin Cho, Wonsuk Yang, ChaeHun Park, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2021), June 22-25, 2021

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Soyeong Jeong, Jinheon Baek, ChaeHun Park, and Jong C. Park
Second Workshop on Scholarly Document Processing (SDP 2021), June 6-11, 2021
Show abstract

One of the challenges in information retrieval (IR) is the vocabulary mismatch problem, which happens when the terms between queries and documents are lexically different but semantically similar. While recent work has proposed to expand the queries or documents by enriching their representations with additional relevant terms to address this challenge, they usually require a large volume of query-document pairs to train an expansion model. In this paper, we propose an Unsupervised Document Expansion with Generation (UDEG) framework with a pretrained language model, which generates diverse supplementary sentences for the original document without using labels on querydocument pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our framework on two standard IR benchmark datasets. The results show that our framework significantly outperforms relevant expansion baselines for IR.

Generating Negative Samples by Manipulating Golden Responses for Unsupervised Learning of a Response Evaluation Model

ChaeHun Park, Eugene Jang, Wonsuk Yang, and Jong C. Park
2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2021), June 6-11, 2021
Show abstract

Evaluating the quality of responses generated by open-domain conversation systems is a challenging task. This is partly because there can be multiple appropriate responses to a given dialogue history. Reference-based metrics that rely on comparisons to a set of known correct responses often fail to account for this variety, and consequently correlate poorly with human judgment. To address this problem, researchers have investigated the possibility of assessing response quality without using a set of known correct responses. Tao et al. (2018) demonstrated that an automatic response evaluation model could be made using unsupervised learning for the next-utterance prediction (NUP) task. For unsupervised learning of such a model, we propose a method of manipulating a golden response to create a new negative response that is designed to be inappropriate within the context while maintaining high similarity with the original golden response. We find, from our experiments on English datasets, that using the negative samples generated by our method alongside random negative samples can increase the model's correlation with human evaluations. The process of generating such negative samples is automated and does not rely on human annotation

Calibration of Pre-trained Language Model for Korean

Soyeong Jeong, Wonsuk Yang, ChaeHun Park, and Jong C. Park
Journal of KIISE, Vol. 48, No. 4, pp. 434-443, April, 2021.
Show abstract

The development of deep learning models is showing performance beyond humans in various tasks such as computer vision and natural language understanding tasks. In particular, pre-trained Transformer models have recently shown remarkable performance in natural language understanding problems such as question answering(QA) tasks and dialogue tasks. However, compared to the rapid development of deep learning models such as Transformer-based models, the mechanisms they work remain relatively unknown. As a method of analyzing deep learning models, calibration of models measures how much the predicted value of the model(confidence) matches the actual value(accuracy). Our study aims at interpreting pre-trained Korean language models with calibration. In particular, we analyzed whether pre-trained Korean language models are able to capture ambiguities in sentences and applied smoothing methods to quantitatively measure such ambiguities with confidence. In addition, in terms of calibration, we evaluated the capability of pre-trained Korean language model in identifying grammatical characteristics in Korean, which affect semantic changes in Korean sentences.

Target-Agnostic Detection of Stances Toward Entities in News Articles

Eugene Jang, Wonsuk Yang, and Jong C. Park
Human-Computer Interaction Korea (HCI), Korea, January 27-29, 2021.

Detecting quoted claims and claim speakers in news articles using transformer-based language models

Eugene Jang
MS Thesis, KAIST, 2021.
Show abstract

Quotations in news articles have been suggested as a source of subjective or biased information. This work deals with automatically detecting the quoted claims and speakers of the claims in news articles. We suggest that quoted claims in news articles appear in predictable, but non-trivial ways. We annotate 33 articles for their quoted claims, speakers, and relations between claims and speakers. A dataset created with the presented annotation scheme is used to experiment with i) claim identification, ii) speaker identification, iii) claim-speaker relation identification, iv) claim group identification, and v) speaker group identification. The annotation and experimental results suggest that the contextual information must be taken into account in order to annotate and predict quoted claims and their speakers.

Automatic Facial Expression Generation for Sign Language with Neural Machine Translation

Eui Jun Hwang, Jung-Ho Kim, and Jong C. Park
Korea Software Congress (KSC), Korea, December 21-23, 2020.
Show abstract

In a sign language, facial expressions play an important role for effective communication. In particular, they are well known for conveying emotional and grammatical information that affects the meaning of a sign. In this paper, we only consider the grammatical functions of the facial expressions. We propose a transformer-based facial expression generation model that translates an expression in spoken language into continuous facial landmark sequences for sign language. In order to train the model efficiently, we employ Principal Component Analysis embedding and Custom Mean Squared Error loss. We report the quantitative and qualitative results of the generated facial landmarks.

A Study on Practical Machine Translation from Korean to Korean Sign Language

Jung-Ho Kim, Eui Jun Hwang, and Jong C. Park
Journal of Korean Sign Language Studies (수어학연구), Vol. 4, No. 1, 2020.
Show abstract

In this study, we propose a practical method for machine translation from Korean to Korean Sign Language (KSL). For a practical use, we select the most appropriate Korean corpus and then annotate the corpus into KSL sentences to construct a Korean-KSL parallel corpus. Through experiments, we train four neural machine translation models on our Korean-KSL parallel corpus and find the best machine translation model by calculating BLEU scores. As a qualitative result, our best model achieves a BLEU-4 score of 20.18 on a test set of our Korean-KSL parallel corpus. We also report qualitative results with an error analysis for better understanding. We finally demonstrate that our model can translate Korean sentences not included in our Korean dataset. Therefore, we believe that our Korean-KSL translation system can lessen the gap between supply and demand for sign language interpretations.

Calibration of Pre-trained Language Model for Korean

Soyeong Jeong, Wonsuk Yang, ChaeHun Park, and Jong C. Park
32th Annual Conference on Human & Cognitive Language Technology, October 15-16, 2020.
(selected as best paper)

TEA: The Effect of the Textual Entailment on the Acceptability Changes

Junseop Ji, Wonsuk Yang, and Jong C. Park
Human-Computer Interaction Korea (HCI), August 19-21, 2020.

Unsupervised Inference of Implicit Biomedical Events using Context Triggers

Jin-Woo Chung, Wonsuk Yang, and Jong C. Park
BMC Bioinformatics, 2020.

Construction of a dialogue generation model based on persuasion strategies

Junseop Ji
MS Thesis, KAIST, 2020

Computer Assisted Annotation of Tension Development in TED Talks through Crowdsourcing

Seungwon Yoon, Wonsuk Yang, and Jong C. Park
1st Workshop on Aggregating and analysing crowdsourced annotations for NLP (AnnoNLP), Hong Kong SAR, November 2, 2019.

Generating Sentential Arguments from Diverse Perspectives on Controversial Topic

ChaeHun Park, Wonsuk Yang, and Jong C. Park
2nd Workshop on NLP for Internet Freedom (NLP4IF): Censorship, Disinformation, and Propaganda, Hong Kong SAR, November 3, 2019.

Nonsense!: Quality Control via Two-Step Reason Selection for Annotating Local Acceptability and Related Attributes in News Editorials

Wonsuk Yang, Seungwon Yoon, Ada Carpenter, and Jong C. Park
2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong SAR, November 3-7, 2019.

Assessing the multi-level knowledge prominence perceived by the authors as revealed on their writings

Wonsuk Yang, Jin-Woo Chung, and Jong C. Park
Language and Information, Vol. 23, No. 2, 2019.

A Corpus of Sentence-level Annotations of Local Acceptability with Reasons

Wonsuk Yang, Jung-Ho Kim, Seungwon Yoon, ChaeHun Park, and Jong C. Park
33rd Pacific Asia Conference on Language, Information and Computation (PACLIC 33), Hakodate, Japan, September 13-15, 2019.

Automatic Scoring of Semantic Fluency

Najoung Kim, Jung-Ho Kim, Maria K. Wolters, Sarah E. MacPherson, and Jong C. Park
Frontiers in Psychology, Vol. 10, pp. 1020, 2019. (SSCI IF 2.089)
Show abstract

In neuropsychological assessment, semantic fluency is a widely accepted measure of executive function and access to semantic memory. While fluency scores are typically reported as the number of unique words produced, several alternative manual scoring methods have been proposed that provide additional insights into performance, such as clusters of semantically related items. Many automatic scoring methods yield metrics that are difficult to relate to the theories behind manual scoring methods, and most require manually-curated linguistic ontologies or large corpus infrastructure. In this paper, we propose a novel automatic scoring method based on Wikipedia, Backlink-VSM, which is easily adaptable to any of the 61 languages with more than 100k Wikipedia entries, can account for cultural differences in semantic relatedness, and covers a wide range of item categories. Our Backlink-VSM method combines relational knowledge as represented by links between Wikipedia entries (Backlink model) with a semantic proximity metric derived from distributional representations (vector space model; VSM). Backlink-VSM yields measures that approximate manual clustering and switching analyses, providing a straightforward link to the substantial literature that uses these metrics. We illustrate our approach with examples from two languages (English and Korean), and two commonly used categories of items (animals and fruits). For both Korean and English, we show that the measures generated by our automatic scoring procedure correlate well with manual annotations. We also successfully replicate findings that older adults produce significantly fewer switches compared to younger adults. Furthermore, our automatic scoring procedure outperforms the manual scoring method and a WordNet-based model in separating younger and older participants measured by binary classification accuracy for both English and Korean datasets. Our method also generalizes to a different category (fruit), demonstrating its adaptability.

A Corpus of Sentential Annotations on News Editorials with Multi-dimensional Credibility Metrics

Wonsuk Yang, Jung-Ho Kim, Jin-Woo Chung, and Jong C. Park
Human-Computer Interaction Korea (HCI), Jeju ICC, Korea, February 13-15, 2019.

Generating diverse sentential arguments on controversial topics with a memory-augmented generation model

ChaeHun Park
MS Thesis, KAIST, 2019

Generating humorous statements for public speech with pre-trained language model and tension analyzer

Seungwon Yoon
MS Thesis, KAIST, 2019

Mitigating Stereotypes in Word Embedding through Sentiment Modulation

Huije Lee, Jin-Woo Chung, and Jong C. Park
Korea Software Congress (KSC), Pyeongchang, Korea, December 19-21, 2018.
Show abstract

단어 임베딩은 저차원 벡터 내에 단어의 의미적 정보를 효과적으로 담는 모델로, 단어의 의미적 정보를 사용하는 여러 자연언어처리 분야에서 미리 학습된 word2vec이 사용되고 있다. 그러나 대량의 텍스트로 학 습된 단어 임베딩 모델은 사람이 가질 수 있는 성, 인종 등에 대한 고정관념 또한 의미 정보로 학습한다는 문제점이 있다. 본 논문에서는 인물 혹은 단체를 지칭하는 단어에 대한 암시적인 감성이 모델을 편향시킬 수 있다는 점에 주목하여, 임베딩 모델 내에서 정서적 고정관념을 드러내는 단어를 탐지하는 방법을 제시하고 고정관념 완화를 위해 인물 개체에 대한 감성 차원이 조정된 임베딩 모델을 제안한다. 실험 결과, 인물 개체에 대한 감성 차원의 임베딩이 증강될수록 모델의 편향성이 심화되었으며, 제안하는 모델은 기존 모델 에 비해 16%의 편향성이 감소되었지만 성능 변화 폭은 1% 이내로 유지되는 것을 확인하였다.

Neural Grammatical Error Correction by Simulating the Human Learner and the Human Proofreader

Fitsum Gaim, Jin-Woo Chung, and Jong C. Park
Korea Software Congress (KSC), Pyeongchang, Korea, December 19-21, 2018.
Show abstract

We present a learning framework for grammatical error correction (GEC) that leverages the duality of translation to effectively synthesize training signals from a monolingual corpus through a game of two contrasting agents that are initially trained with a small amount of parallel data. The first agent learns to produce ostensibly natural errors, whereas the second learns to proofread the erroneous output into grammatically correct text. This approach not only alleviates the need for large parallel corpora but also exposes the GEC model to a wider range of error types. Our final model is competitive against the best systems, outperforming some of the strongest models on standard benchmarks.

Interpretable Depression Detection from Social Media using Hierarchical Attention Network with Depression Indicators

Hoyun Song
MS Thesis, KAIST, 2018.
Show abstract

In order to effectively diagnose depression, which is one of the most harmful mental disorders, many researchers used social media by analyzing the differences in language use. However, detecting depression from social media has problems such as a small proportion of posts with depression indicators and difficulties for distinguishing depressive symptoms from temporarily depressed feelings. To address these problems, we propose hierarchical attention with depressive indicators inspired by the process of diagnosing depression by a person with domain knowledge. Our model provides not only interpretations, but also their visualizations with learned weights through attention mechanism. With this model, we can investigate different aspects of posts with depressive indicators based on psychological theories, which will help researchers to find useful evidence for depressive characteristics.

Mitigating Stereotypes in Word Embedding through Sentiment Modulation

Huije Lee
MS Thesis, KAIST, 2018.
Show abstract

Word embedding is an influential framework to quantify the meaning of a word, which is widely used in machine learning at a pre-processing level for natural language processing (NLP). However, word embedding trained with a large number of contexts encodes not only general syntactic and semantic meaning of a word, but also the stereotypes and biases that people may have. This thesis proposes a method to indirectly mitigate the stereotypes in the trained word embedding by modulating the dimension of sentimental attributes in a human entity without imposing equal probability on the compatible social groups. To prevent the word embedding from creating problematic predictions such as a stereotype threat, we modulate the strength of the association between a human entity and sentimental attribute and indirectly reduce the gender bias of the embedding model. We show that the proposed method preserves the overall embedding performance. We also confirm that increasing the strength of the association between human entities and sentimental attributes amplifies the model bias through experiment.

Feature Attention Network: Interpretable Depression Detection from Social Media

Hoyun Song, Jinseon You, Jin-Woo Chung, and Jong C. Park
32nd Pacific Asia Conference on Language, Information and Computation (PACLIC 32), The Hong Kong Polytechnic University, Hong Kong SAR, December 1-3, 2018.
Show abstract

Although depression is one of the most common mental disorders, the depressed individuals may not be aware of their symptoms at all so that they sometimes miss the appropriate time for treatment. In order to prevent this problem, many researchers looked into social media to figure out depressed individuals by analyzing the differences in language use. While they have recently achieved reasonable performance in detecting depression, especially using deep learning methods, such methods still do not provide a clear way to explain why certain individuals have been detected as depressed. To address this issue, we propose Feature Attention Network (FAN), inspired by the process of diagnosing depression by an expert who has background knowledge about depression. We evaluate the performance of our model on a large scale general forum (Reddit Self-reported Depression Diagnosis) dataset. Experimental results demonstrate that FAN shows good performance with high interpretability despite a smaller number of posts in training data. We investigate different aspects of posts by depressed users through four feature networks built upon psychological studies, which will help researchers to investigate social media posts to find useful evidence for depressive symptoms.

Extracting Supporting Evidence with High Precision via Bi-LSTM Network

ChaeHun Park, Wonsuk Yang, and Jong C. Park
30th Annual Conference on Human & Cognitive Language Technology, Korea University, Seoul, Korea, October 12-13, 2018.
Show abstract

논지가 높은 설득력을 갖기 위해서는 충분한 지지 근거가 필요하다. 논지 내의 주장을 논리적으로 지지할 수 있는 근거 자료 추출의 자동화는 자동 토론 시스템, 정책 투표에 대한 의사 결정 보조 등 여러 어플리케이션의 개발 및 상용화를 위해 필수적으로 해결되어야 한다. 하지만 웹문서로부터 지지 근거를 추출하는 시스템을 위해서는 다음과 같은 두 가지 연구가 선행되어야 하고, 이는 높은 성능의 시스템 구현을 어렵게 한다: 1) 논지의 주제와 직접적인 관련성은 낮지만 지지 근거로 사용될 수 있는 정보를 확보하기 위한 넓은 검색 범위, 2) 수집한 정보 내에서 논지의 주장을 명확하게 지지할 수 있는 근거를 식별할 수 있는 인지 능력. 본 연구는 높은 정밀도와 확장 가능성을 가진 지지 근거 추출을 위해 다음과 같은 단계적 지지 근거 추출 시스템을 제안한다: 1) TF-IDF 유사도 기반 관련 문서 선별, 2) 의미적 유사도를 통한 지지 근거 1차 추출, 3) 신경망 분류기를 통한 지지 근거 2차 추출. 제안하는 시스템의 유효성을 검증하기 위해 사설 4008개 내의 주장에 대해 웹 상에 있는 845675개의 뉴스에서 지지 근거를 추출하는 실험을 수행하였다. 주장과 지지 근거를 주석한 정보에 대하여 성능 평가를 진행한 결과 본 연구에서 제안한 단계적 시스템은 1,2차 추출 과정에서 각각 0.41, 0.70의 정밀도를 보였다. 이후 시스템이 추출한 지지 근거를 분석하여, 논지에 대한 적절한 이해를 바탕으로 한 지지 근거 추출이 가능하다는 것을 확인하였다.

Automatic Tension Recognition from Lecture Show Transcripts

Seungwon Yoon, Wonsuk Yang, and Jong C. Park
30th Annual Conference on Human & Cognitive Language Technology, Korea University, Seoul, Korea, October 12-13, 2018.
Show abstract

긴장이라는 측면은 의사소통을 하거나 글을 읽을 때 사람에게 항상 영향을 주고 있다. 긴장의 개념은 자연언어처리 분야에서 광범위한 의미로 사용되어 왔는데, 본 논문은 이런 개념 중 강연과 같은 한 방향 대화에서 화자의 말에 대하여 청중이 가지는 긴장도에 집중하여 이를 정량화하는 방법을 제안한다. 한 명의 저자에 의해 서술된 문서에 긴장도 개념을 적용함에 있어, 한 방향 대화에서의 긴장도를 정량화하는 본 연구는 긴장도 개념을 일반 문서에 적용할 때에 보다 용이하게 활용될 것으로 예상한다. 본 연구에서는 먼저 화자의 말에 대한 청중의 긴장도가 주석되어 있는 새로운 말뭉치를 구축하였다. 또한 문맥을 고려하여 긴장도를 예측할 수 있는 모델과 이에 따른 긴장도 분류 성능에 대한 실험 결과를 통하여 자동 긴장도 분류가 계산적으로 가능하다는 것을 보인다.

Extracting Spatial Information about Events from Text

Jin-Woo Chung
PhD Dissertation, KAIST, Feb. 2018
Show abstract

Automatic extraction of spatial information about events from text plays an important role not only in the semantic interpretation of events but also in many location-based applications such as infectious disease surveillance and natural disaster monitoring. However, the fundamental limitation of previous work is the limited scope of extraction that only targets at information that is explicitly stated through predicate-argument structures. This leads to missing a lot of implicit information inferable from context in a document, which amounts to nearly 40% of the entire location information. To overcome this limitation, we present in this dissertation an approach to recognizing the document-level relationship between events and their locations, aiming specifically at identifying an expression in text that best indicates where a given event occurs. We present a two-step approach to this problem: First, we design an annotation framework to construct a corpus annotated with the associations between event mentions and location expressions in news articles. Based on the corpus annotation and analysis, we hypothesize that coherent narratives such as news articles usually mention a series of events that occur together in a similar location. Second, we present an inference system that recognizes the associations from a given document based on this hypothesis. The system employs a multi-pass architecture where locally captured, more precise information is propagated to neighboring events through particular context. We exploit distributional similarities as key contextual information in this architecture to assess how similar two events are. The results of experiments on the annotated corpus demonstrate that the multi-pass architecture with distributional similarities is reasonably capable of capturing the document-level associations between events and locations, especially when compared with several baseline systems. The results also show that considering multiple types of event components together in modeling event similarities leads to better understanding of spatial relatedness of two events than just a single type of component. Our system achieves good performance for this challenging task, which is around F1-scores of 0.50 across different settings, considering that general state-of-the-art systems for extracting spatiotemporal relations and document-level event relations show a similar level of performance. We believe that the proposed corpus and system have a good potential not only to benefit many downstream NLP tasks that involve a spatial analysis of events, but also to improve the quality of location-based applications that exploit textual documents.

Detection of Non-Standard Meaning Usage with Word Embedding

Huije Lee, Hancheol Park, Wonsuk Yang, and Jong C. Park
Human-Computer Interaction Korea (HCI), Jeongseon, Korea, January 31-February 2, 2018.
Show abstract

본 연구에서는 분산 표상 기법으로 텍스트에서 사전상의 의미로 사용되지 않은 어휘(이하, 비표준 의미 어휘)를 탐지하는 모델을 제안한다. 어휘의 어형은 동일하나 비표준 의미로 사용되는 경우를 판단하는 것은 자동화된 텍스트 분석 및 오역의 문제를 해결하는 데 중요한 요소이다. 본 연구에서는 분산 표상 기법으로 생성된 문맥 및 대상 단어 벡터를 이용하여, 대상 단어가 주어진 문맥 내에서 적합한지를 검증하고 대상 단어가 비표준 의미로 사용되었는지 여부를 판단한다. 본 연구에서는 기존 연구에서의 문맥 벡터 생성 방식이 지니는 문제점을 해결하기 위해, 통합적인 문맥 정보를 표상하는 방법과 문맥 내 단어들의 가중치를 주는 방법을 제안한다. 제안하는 방법은 트위터 데이터를 이용한 실험에서 기존에 제안된 모델보다 더 높은 성능을 보였다.

Predicting Symptoms of Depression for Social Media Users via Linguistic Patterns

Hoyun Song, Hancheol Park, Wonsuk Yang, and Jong C. Park
Korea Software Congress (KSC), Busan, Korea, December 20-22, 2017.
Show abstract

우울증은 개인의 일상 기능 저하 및 다양한 사회적 문제를 야기할 수 있기 때문에 조기 진단이 중요하다. 이러한 조기 진단의 시도로서, 본 연구는 소셜 미디어 텍스트를 이용하여 사용자들의 우울증 여부를 예측하는 모델을 제안한다. 본 연구에서는 비정형 텍스트인 소셜 미디어 텍스트 상에서 기존의 어휘 기반 모델이 지닌 한계점인 어휘 매칭 문제 및 우울증을 겪고 있지 않은 사용자들의 우울증 관련 어휘 사용과 관련한 문제점을 해결하기 위해, 보다 심층적인 언어학적 패턴을 이용한 모델을 제시한다. 본 연구의 실험을 통해 사용자의 우울증 여부를 예측함에 있어 언어학적 패턴을 함께 적용할 경우 단순한 어휘 기반 모델에 비해 더욱 효과적임을 확인할 수 있었다.

Extraction of Gene-Environment Interaction from the Biomedical Literature

Jinseon You, Jin-Woo Chung, Wonsuk Yang, and Jong C. Park
Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), pp. 865–874, Taipei, Taiwan, November 27–December 1, 2017.
Show abstract

Genetic information in the literature has been extensively looked into for the purpose of discovering the etiology of a disease. As the gene-disease relation is sensitive to external factors, their identification is important to study a disease. Environmental influences, which are usually called Gene-Environment interaction (GxE), have been considered as important factors and have extensively been researched in biology. Nevertheless, there is still a lack of systems for automatic GxE extraction from the biomedical literature due to new challenges: (1) there are no preprocessing tools and corpora for GxE, (2) expressions of GxE are often quite implicit, and (3) document-level comprehension is usually required. We propose to overcome these challenges with neural network models and show that a modified sequence-to-sequence model with a static RNN decoder produces a good performance in GxE recognition.

Inferring Implicit Event Locations from Context with Distributional Similarities

Jin-Woo Chung, Wonsuk Yang, Jinseon You, and Jong C. Park
Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI-17), pp. 979-985, Melbourne, Australia, August 19-25, 2017.
Show abstract

Automatic event location extraction from text plays a crucial role in many applications such as infectious disease surveillance and natural disaster monitoring. The fundamental limitation of previous work such as SpaceEval is the limited scope of extraction, targeting only at locations that are explicitly stated in a syntactic structure. This leads to missing a lot of implicit information inferable from context in a document, which amounts to nearly 40% of the entire location information. To overcome this limitation for the first time, we present a system that infers the implicit event locations from a given document. Our system exploits distributional semantics, based on the hypothesis that if two events are described by similar expressions, it is likely that they occur in the same location. For example, if “A bomb exploded causing 30 victims” and “many people died from terrorist attack in Boston” are reported in the same document, it is highly likely that the bomb exploded in Boston. Our system shows good performance of a 0.58 F1-score, where state-of-the-art classifiers for intra-sentential spatiotemporal relations achieve around 0.60 F1-scores.

Using syntactic structure to extract prominent gene regulatory network from the literature

Wonsuk Yang
MS Thesis, KAIST, 2017.

Neural Theorem Prover with Word Embedding for Efficient Automatic Annotation

Wonsuk Yang, Hancheol Park, and Jong C. Park
Journal of KIISE, Vol. 44, No. 4, pp. 399-410, April, 2017.

Addressing low-resource problems in statistical machine translation of manual signals in sign language

Hancheol Park, Jung-Ho Kim, and Jong C. Park
Journal of KIISE, Vol. 44, No. 2, pp. 163-170, February, 2017.

Neural Theorem Prover with Word Embedding for Efficient Automatic Annotation

Wonsuk Yang, Hancheol Park, and Jong C. Park
Proceedings of the 28th Annual Conference on Human and Cognitive Language Technology (HCLT) pp. 79-84, Busan, Korea, October 07-08, 2016.
(selected as best paper)
Show abstract

본 연구는 전문기관에서 생산되는 검증된 문서를 웹상의 수많은 검증되지 않은 문서에 자동 주석하여 신 뢰도 향상 및 심화 정보를 자동으로 추가하는 시스템을 설계하는 것을 목표로 한다. 이를 위해 활용 가능 한 시스템인 인공 신경 정리 증명계(neural theorem prover)가 대규모 말뭉치에 적용되지 않는다는 근본 적인 문제를 해결하기 위해 내부 순환 모듈을 단어 임베딩 모듈로 교체하여 재구축 하였다. 학습 시간의 획기적인 감소를 입증하기 위해 국가암정보센터의 암 예방 및 실천에 대한 검증된 문서들에서 추출한 28,844개 명제를 위키피디아 암 관련 문서에서 추출한 7,844개 명제에 주석하는 사례를 통하여 기존의 시스템과 재구축한 시스템을 병렬 비교하였다. 동일한 환경에서 기존 시스템의 학습 시간이 553.8일로 추 정된 것에 비해 재구축한 시스템은 93.1분 내로 학습이 완료되었다. 본 연구의 장점은 인공 신경 정리 증 명계가 모듈화 가능한 비선형 시스템이기에 다른 선형 논리 및 자연언어 처리 모듈들과 병렬적으로 결합 될 수 있음에도 현실 사례에 이를 적용 불가능하게 했던 학습 시간에 대한 문제를 해소했다는 점이다.

Enhanced sign language transcription system via hand tracking and pose estimation

Jung-Ho Kim, Najoung Kim, Hancheol Park, and Jong C. Park
Journal of Computing Science and Engineering, Vol. 10, No. 3, pp. 95-101, September, 2016.
Show abstract

In this study, we propose a new system for constructing parallel corpora for sign languages, which are generally underresourced in comparison to spoken languages. In order to achieve scalability and accessibility regarding data collection and corpus construction, our system utilizes deep learning-based techniques and predicts depth information to perform pose estimation on hand information obtainable from video recordings by a single RGB camera. These estimated poses are then transcribed into expressions in SignWriting. We evaluate the accuracy of hand tracking and hand pose estimation modules of our system quantitatively, using the American Sign Language Image Dataset and the American Sign Language Lexicon Video Dataset. The evaluation results show that our transcription system has a high potential to be successfully employed in constructing a sizable sign language corpus using various types of video resources.

Making adjustments to event annotations for improved biological event extraction

Seung-Cheol Baek and Jong C. Park
Journal of Biomedical Semantics, 7:55, doi: 10.1186/s13326-016-0094-9, 16 September 2016. (SCIE IF 1.62)
Show abstract

Background
Current state-of-the-art approaches to biological event extraction train statistical models in a supervised manner on corpora annotated with event triggers and event-argument relations. Inspecting such corpora, we observe that there is ambiguity in the span of event triggers (e.g., “transcriptional activity” vs. ‘transcriptional’), leading to inconsistencies across event trigger annotations. Such inconsistencies make it quite likely that similar phrases are annotated with different spans of event triggers, suggesting the possibility that a statistical learning algorithm misses an opportunity for generalizing from such event triggers.

Methods
We anticipate that adjustments to the span of event triggers to reduce these inconsistencies would meaningfully improve the present performance of event extraction systems. In this study, we look into this possibility with the corpora provided by the 2009 BioNLP shared task as a proof of concept. We propose an Informed Expectation-Maximization (EM) algorithm, which trains models using the EM algorithm with a posterior regularization technique, which consults the gold-standard event trigger annotations in a form of constraints. We further propose four constraints on the possible event trigger annotations to be explored by the EM algorithm.

Results
The algorithm is shown to outperform the state-of-the-art algorithm on the development corpus in a statistically significant manner and on the test corpus by a narrow margin.

Conclusions
The analysis of the annotations generated by the algorithm shows that there are various types of ambiguity in event annotations, even though they could be small in number.

Prosodic and Linguistic Analysis of Semantic Fluency Data: A Window into Speech Production and Cognition

Maria Wolters, Najoung Kim, Jung-Ho Kim, Sarah E. MacPherson, and Jong C. Park
Interspeech 2016, pp. 2085-2089, San Francisco, California, September 8-12, 2016.
Show abstract

Semantic fluency is a commonly used task in psychology that provides data about executive function and semantic memory. Performance on the task is affected by conditions ranging from depression to dementia. The task involves participants naming as many members of a given category (e.g. animals) as possible in sixty seconds. Most of the analyses reported in the literature only rely on word counts and transcribed data, and do not take into account the evidence of utterance planning present in the speech signal. Using data from Korean, we show how prosodic analyses can be combined with computational linguistic analyses of the words produced to provide further insights into the processes involved in producing fluency data. We compare our analyses to an established analysis method for semantic fluency data, manual determination of lexically coherent clusters of words.

Computational Identification of Sequence Variation and Environmental Condition in Clinical Depression from Biomedical Literature

Jinseon You
MS Thesis, KAIST, 2016.
Show abstract

Clinical depression is a complex disease, which is known to be influenced by various factors. As genetic and environmental factors are frequently referred to as the most influential in causing depression, there have been many studies that try to identify genes or proteins and environmental conditions associated with depression. While a number of text-mining (TM) systems identifying information about the genetic factors in the biomedical literature have consequently been developed, there is currently no TM system specifically targeted at extracting environmental conditions. As a result, biologists are provided only with incomplete information about depression by these TM systems, unable to help them to discover the etiology and treatment of depression. In the thesis, we propose a TM system that considers an interaction between genetic and environmental factors associated with depression. The system identifies not only relations between a sequence variation and depression but also changes in the relations according to environmental conditions. In order to develop the system, we split the system into two TM subsystems. The first system is applied to an existing system for extracting the relations between a sequence variation and depression from the biomedical literature. The system classifies whether the relations are positive or negative on a document level. Based on the dictionary with candidate terms for environmental conditions, the second system identifies the conditions in the biomedical literature containing the binary relations. Using the dependency of sentence, the system excludes terms wrongly classified as the conditions. The system is a first TM system considering a ternary relation among sequence variation, disease and condition. Through the system, we are able to provide more comprehensive information about depression than other systems. We expect that, as the system is applied to other diseases, biologists can easily identify diverse information associated with changes in symptoms of diseases including depression.

Classification of Relations between Biological Entities using Word Vectors

Jimin Park, Jin-Woo Chung, and Jong C. Park
Proceedings of Korea Computer Congress (KCC), pp. 771-773, Jeju, Korea, June 29 - July 1, 2016. (poster presentation)
Show abstract

생물학적 체계 안에서 구성 요소 간의 관계를 논문 텍스트를 통해 식별하는 방법과, 일반적인 단어 사이의 관계를 분포 의미 모델을 이용하여 분류하는 방법에 대해서는 많은 연구가 각각 있었으나, 두 방법을 결합한 시도는 거의 보고되지 않았다. 본 연구에서는 분포 모델이 생물학적인 체계 안에서 두 구성요소가 맺고 있는 관계를 예측하는 데 어떤 기여를 하는지 알아보았다. 실험 결과, 분포 모델이 생물학적 구성 요소 간의 관계 식별에 유용한 자질로 활용될 수 있을 확인하였다.

Addressing Low-Resource Problems in Statistical Machine Translation of Sign Language

Hancheol Park, Jung-Ho Kim, and Jong C. Park
Proceedings of Korea Computer Congress (KCC), pp. 714-716, Jeju, Korea, June 29 - July 1, 2016.
(selected as best paper)
Show abstract

최근 통계적 기계 번역 기법을 이용한 수화 번역 연구가 활발해짐에도 불구하고, 병렬 말뭉치 자원의 희소성 문제는 아직 해결되지 못하고 있다. 본 연구는 통계적 기계 번역 방법을 이용하여 구어로 표현 될 수 있는 언어를 수지 표현으로 이루어진 수화로 번역 할 때, 자원 희소성에 기인하는 문제점들을 해결할 수 있는 세 가지 전처리 방법을 제시한다. 결과적으로 자원 희소성 문제를 안고 있는 수화 번역에서 실제로 번역 성능을 향상시킬 수 있는 방법들이 무엇인지를 실험을 통해 확인한다. 본 연구에서 제안하는 전처리 방법은 구어 문장의 패러프레이징을 통한 말뭉치 확장 방법, 구어 단어의 표제어화를 통한 개별 어휘 빈도를 높이는 방법, 그리고 수지 정보로 표현되지 않는 구어 품사에 해당하는 단어를 제거함으로써 구어와 수화 간 문장 성분을 일치시키는 방법이다. 영어와 미국 수화 병렬 말뭉치를 이용한 실험을 통하여 세 가지 전처리 방법 중 패러프레이즈 생성 및 표제어화의 적용 시에만 번역 품질이 향상된다는 사실을 알 수 있었다. 특히, 두 방법이 같이 적용될 때 가장 높은 성능을 보였다.

Language processing through the collaboration with field experts

Jong C. Park
Invited talk, 2016 Spring conference of the Language Research Institute of Hankuk University of Foreign Studies, May 27, 2016.

Synchronization of Non-Manual Signals in Sign Language with Sequence Prediction

Jung-Ho Kim
MS Thesis, KAIST, 2016.
Show abstract

There are various types of non-manual signals in sign language, which carry important linguistic information such as feeling, semantic difference and nuance. Upon investigation into the nature of non-manual signals in the bible and literature corpus, we find that several types of non-manual signals appear on a single word. It implies the possibility of the context in signed utterances. This thesis experimentally unravels the nature of non-manual signals and proposes a prediction model for the non-manual signal sequence and its advanced approach. The correlation between non-manual signals is measured by utilizing their co-occurrence rate. The result shows close correlations among 'Trunk', 'Head', 'Brow to Eye-gaze' and 'Mouth'. To verify the existence of the context, a prediction model using conditional random fields trained on a sequence of 'gloss'-'non-manual signal' pairs is proposed, which shows superior results in comparison with a 'gloss'-'non-manual signal' dictionary-based approach. This result suggests that synchronized non-manual signals can be predicted by the proposed model when the training is done with other non-manual signals. Also it means that the accuracy is expected to increase as we fine-tune such signals. As a result, all experiments show better performance when a sequence of 'Brow to Eye-gaze' is used as a training data.

A Morphological Approach to the Longitudinal Detection of Dementia

Najoung Kim and Jong C. Park
HCI Conference Korea, High1 Resort, Gangwon, January 27-29, 2016.
Show abstract

The impact of cognitive impairment on linguistic abilities has been a topic of continuous interest in dementia studies. However, there is a lack of systematic agreement on the longitudinal association between dementia progression and the patients' morphological capacity, and the role of morphological phenomena other than inflection has been relatively underreported. We present a longitudinal study of writings by Iris Murdoch (diagnosed of Alzheimer's Disease after her death) and Arthur Conan Doyle (no known record of dementia diagnosis), using two novel measures to account for the usage of complex morphology and lexical innovation. The results imply an association between lexical innovation and cognitive decline caused by dementia, as observed in Murdoch's works beginning from her mid-fifties, in contrast to a milder tendency in Doyle's works. Our findings contribute to a potential for facilitating early diagnosis of dementia through automated language processing approaches.

Biomedical Event Extraction and Management in Big-scale Biomedical Literature

Rize Jin, Jinseon You, and Jong C. Park
42nd KIISE Winter Conference, Phoenix Park, December 17-19, 2015. (poster presentation)
Show abstract

대용량 생물학 문헌 정보가 축적됨에 따라 생물학 연구자들의 연구를 효과적으로 돕기 위한 문헌 정보 관리 시스템이나 검색 엔진과 같은 도구들이 등장하였다. 이러한 도구들은 생물학 연구에 많은 도움을 주고 있으나 복잡한 연산 처리에 있어서는 아직 부족한 점이 많은 실정이다. 특히 검색엔진의 경우 단어 수준의 질의어는 쉽게 처리할 수 있으나 단어 사이의 관계를 나타내는 복잡한 질의어에 대해서는 아직 처리 수준이 미흡하다. 이에 생물학 언어 처리 분야에서는 복잡한 질의어를 처리하기 위해 유전자 식별, 생물학 이벤트 식별과 같은 텍스트 마이닝 연구가 활발히 진행되었으며 상당한 수준의 정확도를 보였다. 그러나 이러한 텍스트 마이닝 시스템들은 전과는 달리 복잡한 연산을 수행함에 따라 대용량 처리에는 적합하지 않게 설계되었고 이는 생물학 언어 처리 분야에 대용량 처리가 점점 더 필요해지면서 심각한 문제로 대두 되었다. 본 연구에서는 분산 시스템인 하둡을 이용해 텍스트 마이닝 시스템 중 하나인 이벤트 식별 시스템이 대용량 데이터를 효과적으로 처리할 수 있도록 시스템을 고도화 하는 방안을 제시한다.

A New Measure of Clustering and Switching Based on Bigrams

Maria Wolters, Sarah MacPherson, Jinseon You, Rize Jin, Seung-Cheol Baek, and Jong C. Park
Psychonomic Society's 56th Annual Meeting, Chicago, USA, November 19-22, 2015. (poster presentation)
Show abstract

The category fluency task (CFT) provides important information about executive abilities such as initiation set-shifting and inhibition. CFT sequences are generated by retrieving groups of related words (“clusters“) from semantic memory. Manual annotation schemes have been developed for inferring these clusters from transcribed CFT sequences (Troyer 2008), but these are time-consuming and require training. We propose an automatic analysis technique that is based on a simple statistical model of CFT sequences. This model can be easily adapted to different languages and domains, given sufficient training data. CFT sequences (domain “animals“) were generated by 104 younger adults aged 18-34 years and 100 older adults aged 50-84 years who were native speakers of UK English. The sequences were categorised both manually and using our automated method with key measures such as the number of switches significantly correlating (rho=0.4, 95% CI [0.28-0.51]). Both methods also resulted in the significant age differences that are consistently reported in the cognitive aging literature.

Corpus Annotation with a Linguistic Analysis of the Associations between Event Mentions and Spatial Expressions

Jin-Woo Chung, Jinseon You, and Jong C. Park
Proceedings of the 29th Pacific Asia Conference on Language, Information, and Computation (PACLIC 29), pp. 539-547, Shanghai, China, October 30-November 1, 2015.
Show abstract

Recognizing spatial information associated with events expressed in natural language text is essential for the proper interpretation of such events. However, the associations between events and spatial information found throughout the text have been much less studied than other types of spatial association as looked into in SpatialML and ISO-Space. In this paper, we present an annotation framework for the linguistic analysis of the associations between event mentions and spatial expressions in broadcast news articles. Based on the corpus annotation and analysis, we discuss which information should be included in the guidelines and what makes it difficult to achieve a high inter-annotator agreement. We also discuss possible improvements on the current corpus and annotation framework for insights into developing an automated system.

A System for Constructing a Korean-to-KSL Parallel Corpus

Jung-Ho Kim, Umang Sehgal, and Jong C. Park
17th Annual Conference on Korean Sign Language, Kongju University, Gongju, Korea, August 15, 2015. (poster presentation)
Show abstract

한국어-한국수어 병렬 말뭉치는 관련 사전이나 자동 번역 시스템에 활용될 수 있어 긴요하다. 그러나 일반 병렬 말뭉치 구축과는 달리, 수어의 공간 언어적인 특성 때문에 구축이 용이하지 않다. 본 연구에서는 효율적으로 한국어-한국수어 병렬 말뭉치를 구축할 수 있는 시스템을 제안한다.

CoMAGD: Annotation of Gene-Depression Relations

Rize Jin, Jinseon You, Jin-Woo Chung, Hee-Jin Lee, Maria Wolters, and Jong C. Park
Proceedings of the 2015 ACL Workshop on Biomedical Natural Language Processing (BioNLP 2015), pp. 104-113, Beijing, China, July 30, 2015.
Show abstract

Clinical depression is a mental disorder involving genetics and environmental factors. Although much work studied its genetic causes and numerous candidate genes have consequently been looked into and reported in the biomedical literature, no gene expression changes or mutations regarding depression have yet been adequately collected and analyzed for its full pathophysiology. In this paper, we present a depression-specific annotated corpus for text mining systems that target at providing a concise review of depression-gene relations, as well as capturing complex biological events such as gene expression changes. We describe the annotation scheme and the conducted annotation procedure in detail. We discuss issues regarding proper recognition of depression terms and entity interactions for future approaches to the task. The corpus is available at http://www.biopathway.org/CoMAGD.

Identification of Depression-Gene Associations from Biomedical Literature

Jinseon You, Rize Jin, Hee-Jin Lee, and Jong C. Park
Korea Computer Congress (KCC), Jeju, Korea, June 24-26, 2015.
Show abstract

우울증은 현대인들이 겪는 대표적인 정신 질환으로 관련 호르몬 분비량에 따라 증세가 달라지고 이는 또한 관련 유전자 표현 변화에 따라 달라진다. 우울증 관련 유전자를 파악하고 이들간의 관계를 밝혀낸다면 항우울제 개발에 많은 도움이 될 것이다. 현재 이에 대한 연구는 활발히 진행 중에 있으나 관련된 모든 유전자를 한 번에 파악하기는 어렵다. 본 논문에서는 암과 유전자간의 관계를 찾는 방법론을 도입하여 우울증과 유전자간 관계를 자동으로 파악하는 시스템을 구축한다. 이는 향후 우울증과 유전자 간의 심화된 관계를 밝히는데 필요한 코퍼스 제작에 큰 도움이 될 것으로 기대된다.

Corpus Annotation for the Linguistic Analysis of Reference Relations between Event and Spatial Expressions in Text

Jin-Woo Chung, Hee-Jin Lee, and Jong C. Park
Language and Information, Vol. 18, No. 2. pp. 141-168, 2014.
Show abstract

Recognizing spatial information associated with events expressed in natural language text is essential not only for the interpretation of such events and but also for the understanding of the relations among them. However, spatial information is rarely mentioned as compared to events and the association between event and spatial expressions is also highly implicit in a text. This would make it difficult to automate the extraction of spatial information associated with events from the text. In this paper, we give a linguistic analysis of how spatial expressions are associated with event expressions in a text. We first present issues in annotating narrative texts with reference relations between event and spatial expressions, and then discuss surface-level linguistic characteristics of such relations based on the annotated corpus to give a helpful insight into developing an automated recognition method.

Construction of a Korean-to-KSL Parallel Corpus by Effective Motion Capture of Hand Shapes

Jung-Ho Kim and Jong C. Park
41st KIISE Winter Conference, Phoenix Park, December 18-20, 2014. (poster presentation)
Show abstract

본 연구에서는 한국어와 한국수어 간의 병렬 코퍼스를 제작하기 위하여 수형(Hand Shape)의 효율적 수집 방안을 제시하며, 손동작 범위에 한하여 수어 동작을 인식 및 수집하기 위해 립모션(Leap Motion)을 이용한다. 제시한 방법으로 제작된 병렬 코퍼스의 성능을 검증하기 위해 46개의 수어 동작을 수집하였고, 미리 수집되지 않은 54개의 수어 동작을 추가 선별하여 총 100개의 수어에 대해 평균 42.15%의 정확도와 58.72%의 재현율을 가지는 인식 수준을 확인하였다. 본 연구에서 제안하는 방안은 매우 보편적이어서 대규모 및 동시적으로 자료를 수집할 수 있는 가능성을 보인다.

An Effective Construction of a Korean-to-KSL Parallel Corpus

Jung-Ho Kim and Jong C. Park
Proceedings of the 26th Annual Conference on Human and Cognitive Language Technology (HCLT), pp. 13-17, ChunCheon, Korea, October 10-11, 2014.
(selected as best paper)
Show abstract

본 연구에서는 한국어와 한국수화 간의 병렬 코퍼스 제작과 함께 이에 따른 문제를 다룬다. 본 연구에서는 병렬 코퍼스를 효율적으로 제작하기 위해 키넥트와 립모션을 이용하였고, 이의 성능을 검증하기 위해 기존 연구에서 제시하고 있는 장갑을 통한 동작 인식 및 수집 방법과 본 연구에서 제시하고 있는 수집 방법을 비교하였으며, 비교 결과 장갑을 통해 수집한 결과와 유의미하게 차이가 나지 않음을 확인하였다. 이는 본 연구의 동작 수집 방식이 상대적으로 고비용인 장갑 수집 방식과 비교하여 경쟁력이 있음을 시사하고 있으며, 특히 보편적인 자료 수집 방식을 사용하는 특징까지 가지고 있어서 동시적으로 자료를 수집할 수 있어 규모가 있는 병렬 코퍼스 구축을 더욱 효율적으로 진행할 수 있을 것으로 기대된다.

Relation Information Extraction using a Comprehensive Representation Scheme: Applications to Oncology

Hee-Jin Lee
PhD Dissertation, KAIST, 2014.
Show abstract

Information extraction (IE) is a task of identifying relevant information from input text and producing structured data as output. While explicit expressions describing the target information are the basis for the development of IE systems, in-depth analysis of the input text becomes necessary when the information is conveyed implicitly in the text. In this dissertation, we address a specialized IE method for gene-cancer relations conveyed implicitly in biomedical text. Automatic identification of gene-cancer relations from a large volume of biomedical text is an important task for cancer research, since changes in genes are known to be the main cause of oncogenesis. In particular, it is essential to understand how a gene affects a cancer and to classify genes into oncogenes (genes that cause cancers), tumor suppressor genes (genes that protect cells from cancers) and biomarkers (genes that indicate normal or cancerous states), since such classification facilitates the process of treatment and diagnosis method development. However, despite the high volume of information on such gene classes that is conveyed implicitly with detailed descriptions about gene and cancer properties, there is not yet an IE system that is targeted at such implicit information. In this dissertation, we claim that in order to classify genes into candidates of oncogenes, tumor suppressor genes and biomarkers, gene-cancer relations described in biomedical text must be characterized with 1) how a gene changes; 2) how a cancer changes; and 3) the causality between the gene and the cancer. We propose a comprehensive representation scheme that identifies gene-cancer relations upon the three aspects above and use it for developing an advanced text mining system for oncogenes, tumor suppressor genes and biomarkers. The proposed representation scheme is shown to be adequate enough to describe the set of information that can be identified objectively from biomedical text, giving rise to an annotated corpus, or CoMAGC. The mapping between the proposed representations and the gene classes is encoded into a set of inference rules, which are validated through manual annotation and comparison with other biology databases. We present an implemented IE system that automatically extracts the information as defined by the proposed scheme, or OncoSearch. Together, we anticipate that CoMAGC and OncoSearch will enable more focused research into oncology, in the face of the rapidly accumulating amount of work in the field.

OncoSearch: Cancer Gene Search Engine with Literature Evidence

Hee-Jin Lee, Tien Cuong Dang, Hyunju Lee, and Jong C. Park
Nucleic Acids Research, (1 July 2014) 42 (W1):W416-W421. (SCI IF 8.278)
Show abstract

In order to identify genes that are involved in oncogenesis and to understand how such genes affect cancers, abnormal gene expressions in cancers are actively studied. For an efficient access to the results of such studies that are reported in biomedical literature, the relevant information is accumulated via text-mining tools and made available through the Web. However, current Web tools are not yet tailored enough to allow queries that specify how a cancer changes along with the change in gene expression level, which is an important piece of information to understand an involved gene's role in cancer progression or regression. OncoSearch is a Web-based engine that searches Medline abstracts for sentences that mention gene expression changes in cancers, with queries that specify (i) whether a gene expression level is up-regulated or down-regulated, (ii) whether a certain type of cancer progresses or regresses along with such gene expression change and (iii) the expected role of the gene in the cancer. OncoSearch is available through http://oncosearch.biopathway.org.

Mention-Level Gene Normalization on Multi-Species and Multiple Identifiers

Joon-Yeob Kim
MS Thesis, KAIST, 2014.

Identification of Speakers in Fairytales with Linguistic Clues

Hye-Jin Min, Jin-Woo Chung, and Jong C. Park
Language and Information, Vol. 17, No. 2. pp. 93-121, 2013.
Show abstract

Identifying the speakers of individual utterances mentioned in textual stories is an important step towards developing applications that involve the use of unique characteristics of speakers in stories, such as robot storytelling and story-to-scene generation. Despite the usefulness, it is a challenging task because not only human entities but also animals and even inanimate objects can become speakers especially in fairytales so that the number of candidates is much more than that in other types of text. In addition, since the action of speaking is not always mentioned explicitly, it is necessary to infer the speaker from the implicitly mentioned speaking behaviors such as appearances or emotional expressions. In this paper, we investigate a method to exploit linguistic clues to identify the speakers of utterances from textual fairytale stories in Korean, especially in order to handle such challenging issues. Compared with the previous work, the present work takes into account additional linguistic features such as vocative roles and pairs of conversation participants, and proposes the use of discourse-level turn-taking behaviors between speakers to further reduce the number of possible candidate speakers. We describe a simple rule-based method to choose a speaker from candidates based on such linguistic features and turn-taking behaviors.

Augmenting Biological Text Mining with Symbolic Inference

Jong C. Park and Hee-Jin Lee
'Biological Knowledge Discovery Handbook', editors: Mourad Elloumi and Albert Y. Zomaya, Wiley, December 27, 2013.
Show abstract

In this chapter, the authors review recent work on such “next-level” text-mining tools. In particular, they focus on the work that uses symbolic inference to augment text-mining, apart from distributional analysis that is based on the co-occurrence of biological terms and statistical methods. By symbolic inference, they refer to the methods of deriving new information from known facts that are represented with nonnumeric symbols to which inference rules are applied deterministically rather than probabilistically. Researches reviewed in this chapter target one of the two abstract tasks. The first task is to recognize information not explicitly stated but implied in a document, where the targeted information is often scattered across multiple sentences. The second is to propose newly predicted biological knowledge using information gathered from the literature. They briefly review text-mining work with distributional analysis to contrast the use of symbolic inference with the use of distributional analysis.

On Mention-Level Gene Normalization

Joon-Yeob Kim, Seung-Cheol Baek, Hee-Jin Lee, and Jong C. Park
5th International Symposium on Languages in Biology and Medicine (LBM 2013), Tokyo, Japan, 12th and 13th December, 2013.
Show abstract

Document-level gene normalization (DGN), which produces a list of gene identifiers relevant to an input document, helps database curators to search for articles of interest by indexing articles with gene identifiers. Recent advances in automatic extraction of information from the biology literature call for mention-level gene normalization (MGN) systems. However, there have been no annotated corpora for MGN, probably because of a somewhat unfounded assumption (convertibility assumption) that it might be straightforward to map gene mentions into gene identifiers given a list of gene identifiers for the document. In the present work, we constructed gold standard annotations for the MGN task and assessed the validity of the convertibility assumption with GeneTUKit (Huang et al., 2011), a state-of-the-art DGN system.

Sign Language Animation Generation

Jong C. Park
Invited Presentation, Fall Colloquium, Department of Humanities and Social Sciences, KAIST, November 19, 2013.

CoMAGC: a Corpus with Multi-faceted Annotations of Gene-Cancer Relations

Hee-Jin Lee, Sang-Hyung Shim, Mi-Ryoung Song, Hyunju Lee, and Jong C. Park
BMC Bioinformatics, 14:323, doi:10.1186/1471-2105-14-323, 14 November 2013. (SCI IF 3.02)
Show abstract

Background
In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations.

Results
In this paper, we present a corpus for the development of TM systems that are specifically targeting gene-cancer relations but are still able to capture complex information in biomedical sentences. We describe CoMAGC, a corpus with multi-faceted annotations of gene-cancer relations. In CoMAGC, a piece of annotation is composed of four semantically orthogonal concepts that together express 1) how a gene changes, 2) how a cancer changes and 3) the causality between the gene and the cancer. The multi-faceted annotations are shown to have high inter-annotator agreement. In addition, we show that the annotations in CoMAGC allow us to infer the prospective roles of genes in cancers and to classify the genes into three classes according to the inferred roles. We encode the mapping between multi-faceted annotations and gene classes into 10 inference rules. The inference rules produce results with high accuracy as measured against human annotations. CoMAGC consists of 821 sentences on prostate, breast and ovarian cancers. Currently, we deal with changes in gene expression levels among other types of gene changes. The corpus is available at http://biopathway.org/CoMAGC under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0).

Conclusions
The corpus will be an important resource for the development of advanced TM systems on gene-cancer relations.

Parsing Dependency Paths to Identify Event-Argument Relation

Seung-Cheol Baek and Jong C. Park
Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP), Nagoya, Japan, October 15-17, 2013, pp. 699-705.
Show abstract

Mentions of event-argument relations, in particular dependency paths between event-referring words and argument-referring words, can be decomposed into meaningful components arranged in a regular way, such as those indicating the type of relations and the others allowing relations with distant arguments (e.g., coordinate conjunction). We argue that the knowledge about arrangements of such components may provide an opportunity for making event extraction systems more robust to training sets, since unseen patterns would be derived by combining seen components. However, current state-of-the-art machine learning based approaches to event extraction tasks take the notion of components at a shallow level by using n-grams of paths. In this paper, we propose two methods called pseudo-count and Bayesian methods to semi-automatically learn PCFGs by analyzing paths into components from the BioNLP shared task training corpus. Each lexical item in the learned PCFGs appears in 2.6 distinct paths on average between event-referring words and argument-referring words, suggesting that they contain recurring components. We also propose a grounded way of encoding multiple parse trees for a single dependency path into feature vectors in linear classification models. We show that our approach can improve the performance of identifying event-argument relations in a statistically significant manner.

Speaker-TTS Voice Mapping towards Natural and Characteristic Robot Storytelling

Hye-Jin Min, Sang-Chae Kim, Joon-Yeob Kim, Jin-Woo Chung, and Jong C. Park
Proceedings of the 22nd IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN 2013), pp. 793-800, Gyeongju, Korea, August 26-29, 2013.
Show abstract

Robot storytelling has the potential for its practical use in various domains such as entertainment, education, and rehabilitation. However, relying on human-recorded voices for natural storytelling is costly, and automation with text-to-speech systems is not readily applicable due to the difficulty of reflecting the full nature of stories in TTS systems. In this paper, we address the problem of automating robot storytelling with a particular focus on two issues: speaker identification and speaker-TTS voice mapping. We first conduct text analysis with rich linguistic clues to identify speakers from a given textual story. We then consider the task of speaker-TTS voice mapping as the graph coloring problem and propose effective algorithms for assigning voices to speakers given a limited number of TTS voices. Finally, we perform a user experiment on validating the usefulness of our method. The results demonstrate that our system significantly outperforms baseline systems and is also more acceptable to users.

Enhancing Readability of Web Documents by Text Augmentation for Deaf People

Jin-Woo Chung, Hye-Jin Min, JoonYeob Kim, and Jong C. Park
International Conference on Web Intelligence, Semantics, and Mining (WIMS), Madrid, Spain, June 12-14, 2013.
Show abstract

Deaf people have particular difficulty in understanding text-based web documents because their mother language, or sign language, is essentially visually oriented. To enhance the readability of text-based web documents for deaf people, we propose a news display system that converts complex sentences in news articles into simple sentences and presents the relations among them with a graphical representation. In particular, we focus on the tasks of 1) identifying subordinate and embedded clauses in complex sentences, 2) relocating them for better readability and 3) displaying the relations among the clauses with the graphical representation. The results of our evaluation show that the proposed system does simplify complex sentences in news articles effectively while maintaining their intended meaning, suggesting that our system can be used in practice to help deaf people to access textual information.

DigSee: Disease Gene Search Engine with Evidence Sentences (version cancer)

Jeongkyun Kim, Seongeun So, Hee-Jin Lee, Jong C. Park, Jung-jae Kim, and Hyunju Lee
Nucleic Acids Research, Vol. 41, No. W1, pp. 501-517, 12 June 2013 (SCI IF 8.026).
Show abstract

Biological events such as gene expression, regulation, phosphorylation, localization and protein catabolism play important roles in the development of diseases. Understanding the association between diseases and genes can be enhanced with the identification of involved biological events in this association. Although biological knowledge has been accumulated in several databases and can be accessed through the Web, there is no specialized Web tool yet allowing for a query into the relationship among diseases, genes and biological events. For this task, we developed DigSee to search MEDLINE abstracts for evidence sentences describing that ‘genes’ are involved in the development of ‘cancer’ through ‘biological events’. DigSee is available through http://gcancer.org/digsee.

Generating Chatting Messages in a Consistent Style with Authorship Attribution Methods

Sang-Chae Kim
MS Thesis, KAIST, 2013.

Blog Corpus-based Clustering Scheme for Category Fluency Test (CFT) Data Clustering

Yong-Jae Lee, Maria Wolters, Hee-Jin Lee, and Jong C. Park
HCI Conference Korea, High1 Resort, Gangwon, Jan. 30-Feb. 1, 2013.
Show abstract

Category Fluency Test (CFT) is one of the most popular methods to screen dementia and is used in particular to evaluate the organization of the semantic memory and verbal fluency of a patient with dementia. The CFT performance is assessed according to the number of items each patient produces during the test. Recently, however, researchers have also proposed to evaluate the performance by considering the pattern of clusters and switches of the CFT data, with efforts to figure out the clusters and switches on the CFT data computationally. In this work, we propose a novel blog corpus-based clustering scheme to analyze the clusters and switches of the CFT data in a computational manner. In addition, we will argue for the need of the blog corpus-based clustering scheme by comparing it with the previous work on automatic CFT data clustering.

Analyzing and Mapping Expressions of Tense for Korean-Korean Sign Language Translation

JoonYeob Kim, Jin-Woo Chung, and Jong C. Park
Proceedings of the KIISE Fall Conference, Vol. 39 No. 2-B, pp. 121-123, Chungnam National University, November 23-24, 2012.
Show abstract

수화는 농인 사회에서 주로 사용되는 시각언어로서 음성언어인 한국어와 표현 양식에서 많은 차이를 보인다. 특히 한국어에서는 특정 문법형태소를 서술어와 결합시킴으로써 시제를 명시적으로 드러내는 반면에, 수화의 경우 서술어와 결합하는 형태소나 시제를 위한 별도의 기능어가 없기 때문에 서술어의 시제 표현을 유지하는 것이 어렵다. 본 논문에서는 한국어-수화 병렬 데이터의 각 문장에 나타나는 시제 표현을 분석한 결과를 바탕으로, 주어진 한국어 문장을 적절한 수화 문장으로 변환하기 위해 필요한 시제 표현 방법에 대해서 논의한다.

Product Name Classification for Product Instance Distinction

Hye-Jin Min and Jong C. Park
The 26th Pacific Asia Conference on Language, Information, and Computation (PACLIC 26), Bali, Indonesia, November 7-10, 2012.
Show abstract

Product names with a temporal cue in a product review often refer to several product instances purchased at different times. Previous approaches to product entity recognition and temporal information analysis do not take into account such temporal cues and thus fail to distinguish different product instances. We propose to formulate the resolution of such product names as a classification problem by utilizing time expressions, event features and other temporal cues for a classifier in two stages, detecting the existence of such temporal cues and identifying the purchase time. The empirical results show that term-based features and existing event-based features together enhance the performance of product instance distinction.

Automatic Speaker Identification in Fairytales towards Robot Storytelling

Hye-Jin Min, Sang-Chae Kim, and Jong C. Park
Proceedings of the 24th Annual Conference on Human and Cognitive Language Technology (HCLT), pp. 77-84, Busan, Korea, October 12-13, 2012.
Show abstract

본 연구에서는 로봇의 자동 동화구연을 목표로 발화문장 상의 감정 파악 및 등장인물 별 다양한 TTS 보이스 선택에 활용 가능한 발화문장의 화자 파악문제를 다룬다. 본 연구에서는 기존 규칙기반 방법론에서 많이 활용되어온 자질인 후보의 위치, 화자 후보의 주격/목적격 여부, 발화동사 존재 여부를 비롯하여 동화에 자주 나타나는 등장인물의 의미적 분류 및 등장인물의 등장/퇴장과 관련된 동사들을 추가 자질로 활용한다. 사람 및 동식물, 무생물이 모두 화자가 될 수 있는 동화 코퍼스에서 제안한 자질들을 활용하여 의사결정트리로 학습 및 검증한 결과, 규칙기반의 베이스라인 방법에 비해 최대 49%의 정확도가 향상되었고, 제안한 방법론이 데이터의 변화에도 강인한 것을 확인할 수 있었다.

Use of Clue Word Annotations as the Silver-standard in Training Models for Biological Event Extraction

Seung-Cheol Baek and Jong C. Park
Proceedings of the 5th International Symposium on Semantic Mining in Biomedicine (SMBM 2012), pp. 34-41, University of Zurich, Switzerland, September 3-4, 2012.
Show abstract

Current state-of-the-art approaches to biological event extraction train models by reconstructing relevant graphs from training sentences, where labeled nodes correspond to tokens that indicate the presence of events and the relations between nodes correspond to the relations between these events and their participants. Since multi-word expressions may also indicate events, these approaches use heuristic rules to define target graphs to reconstruct by mapping various clue words into single tokens. Since training instances define actual problems to solve, the method of deriving graphs must affect the system performance, but there has not been any related study on this aspect, to the best of our knowledge. In this study, we propose an incorporation of an EM algorithm into supervised learning to look for training graphs that are more favorable for model construction. We evaluate our algorithm on the development dataset in the 2009 BioNLP shared task and show that this algorithm makes a statistically meaningful improvement on the performance of trained models over a supervised learning algorithm on a fixed set of training graphs. The models and graphs are available at http://biopathway.org/EventExtraction/.

Identifying Mentions about Long-term Experiences and Sentiment Change on a Specific Target based on Linguistic Analysis: Application to a Product Review Domain

Hye-Jin Min
PhD Dissertation, KAIST, 2012.
Show abstract

People post and share their experiences through social media on the web these days. The resulting user-generated web documents have become a useful source of advice for making a decision or resolving difficulties because people can learn from others’ past successes or failures. Recently, in response to the rapid growth of such documents and great potential of experience-based information, researches have been conducted on analyzing experiences in user-generated web documents. Earlier work has addressed the issue on distinguishing “experience sentences” from others and has proposed a discrimination method based on the linguistic properties of the mentioned events in such sentences. However, such work has focused mostly on a single event at a sentence level in large-scale data, so that a meaningful series of a specific person’s experiences on a particular target has not been analyzed fully yet. This dissertation presents a method to analyze mentions about target-oriented experiences. More specifically, we propose a novel method to identify mentions about a customer’s experiences on a particular product in two aspects: long-term experiences and sentiment change in such experiences. As for long-term experiences, the hypothesis is that the two linguistic expressions time expressions and product names fully capture the customer’s long-term experiences mentioned in a review. As for sentiment change, the hypothesis is that sentiment change can be determined by detecting the state in a such review such that the overall sentiment towards a product instance purchased at a certain time in the past may not be the same as the overall sentiment towards another instance purchased at the latest time. In this dissertation, we address three major research questions. The first question is about identifying product names. Unlike previous researches on identification on a product entity level, instance level identification for instance distinction should be accounted for. Our research question is to determine the types of linguistic feature that are useful for such distinction. Based on experimental results, we argue that linguistic features including time expressions, term-based features and event features should be combined differently with respect to the linguistic characteristics of the product names referring to each type of instance. More specifically, we argue that the best combination for the distinction between recent purchases and past purchases is time expressions and term-based features, and the best combination between recent purchases and recent & past purchases is time expressions and event features. The second question is about sentiment classification regarding product names. The inherent polarity of the adjectival modifier should be blocked when it is used to refer to the property or the identity of the product. Regarding the question of determining the context in which the polarity of the adjectival modifier be blocked, we argue that the refined blocking rules with the semantic types of nouns, verbs, and clauses based on compositionality-based syntactic rules enhance the sentiment classification performance especially for neutral sentences. As for product name-sentiment association, we argue that comparative expressions are crucial to associating the compared target with the sentiment opposite to the one in the given grammatical structure and also argue that the product names referring to generic objects are crucial to discarding the sentiment in the given grammatical structure. The last question is about how we utilize the results from our method. As practical applications, we demonstrate a system that identifies helpful reviews by utilizing the proposed measure. The user study shows that this measure is not only as helpful as the best existing ones, such as ‘helpful vote’ or ‘reviewer rank’, but is also free from undesirable biases. We also illustrate another application that rates product reviews with respect to sentiment change. The user study shows that the review rating system based on sentiment change is more credible than the system based on the clause-level sentiment classification.

Towards Automatic Evaluation of Category Fluency Test Performance: Distinguishing Groups using Word Clustering

Yong-Jae Lee, Maria Wolters, Hee-Jin Lee, and Jong C. Park
Korea Computer Congress (KCC), Jeju, Korea, June 27-29, 2012.
Show abstract

The Category Fluency Test (CFT) is a widely used verbal fluency test. The standard measure of scoring the test is the number of distinct words that a subject generates during the test. Recently, other measures have also been proposed to evaluate performance, such as clustering and switching. In this study, we examine clusters and switches can be assessed using word similarity measures. Based on these measures, we can distinguish between subject groups.

Age and Gender Prediction from Korean Tweets with Stylometric Analysis

Sang-Chae Kim and Jong C. Park
Korea Computer Congress (KCC), Jeju, Korea, June 27-29, 2012.
Show abstract

사람들은 주변의 영향을 받아 가면서 각자의 독특한 글쓰기 양식을 만들어간다. 따라서 같은 연령대와 성별을 가지는 사람들은 유사한 글쓰기 양식을 나타내는 경향이 있다. 이와 같은 가정을 바탕으로, 본 연구에서는 다양한 연령대와 성별의 사람들이 작성한 트윗의 문체를 분석하여 임의의 트윗을 작성한 저자의 연령대와 성별을 예측하는 실험을 진행하였다.
한국어 웹 언어에서 자주 보이는 표현들을 토대로 구성한 자질들과, 그에 비해 데이터와 관계가 적은 n-gram 단위의 자질들을 함께 사용하여 예측을 진행함으로써, 최대 공산 기준치보다 25% 가량 높은 정확도를 보이는 예측 결과를 얻게 되었다. 이와 함께 각 자질 구성이 예측에 얼마나 효율적으로 기여하는지에 대한 이해도를 높일 수 있었다.

Quality Analysis of User-generated Content on the Web

Jong C. Park and Hye-Jin Min
'Knowledge Service Engineering Handbook', editors: Jussi Kantola and Waldemar Karwowski, CRC Press, Taylor & Francis Group, pp. 197–220, May, 2012.

E3Net: A System for Exploring E3-mediated Regulatory Networks of Cellular Functions

Youngwoong Han, Hodong Lee, Jong C. Park, and Gwan-Su Yi
Molecular and Cellular Proteomics, March Issue, 2012, doi:10.1074/mcp.O111.014076, December 22, 2011. (SCI IF 8.35)
Show abstract

Ubiquitin-protein ligase (E3) is a key enzyme targeting specific substrates in diverse cellular processes for ubiquitination and degradation. The existing findings of substrate specificity of E3 are, however, scattered over a number of resources, making it difficult to study them together with an integrative view. Here we present E3Net, a web-based system that provides a comprehensive collection of available E3-substrate specificities and a systematic framework for the analysis of E3-mediated regulatory networks of diverse cellular functions. Currently, E3Net contains 2201 E3s and 4896 substrates in 427 organisms and 1671 E3-substrate specific relations between 493 E3s and 1277 substrates in 42 organisms, extracted mainly from MEDLINE abstracts and UniProt comments with an automatic text mining method and additional manual inspection and partly from high throughput experiment data and public ubiquitination databases. The significant functions and pathways of the extracted E3-specific substrate groups were identified from a functional enrichment analysis with 12 functional category resources for molecular functions, protein families, protein complexes, pathways, cellular processes, cellular localization, and diseases. E3Net includes interactive analysis and navigation tools that make it possible to build an integrative view of E3-substrate networks and their correlated functions with graphical illustrations and summarized descriptions. As a result, E3Net provides a comprehensive resource of E3s, substrates, and their functional implications summarized from the regulatory network structures of E3-specific substrate groups and their correlated functions. This resource will facilitate further in-depth investigation of ubiquitination-dependent regulatory mechanisms. E3Net is freely available online at http://pnet.kaist.ac.kr/e3net. Molecular & Cellular Proteomics 11: 10.1074/mcp.O111.014076, 1–14, 2012.

Fairy Tale Summarization through Sentence Selection

SeungJoo An
MS Thesis, KAIST, 2012.

Probabilistic Filtering for a Biological Knowledge Discovery System with Text Mining and Automatic Inference

Hee-Jin Lee and Jong C. Park
Journal of the Korean Society Of Computer and Information, Vol. 17, No. 2, pp. 139-148, February 2012.
Show abstract

본 논문에서는 텍스트 마이닝을 통해 생물학 문헌에서 분자 수준의 사건(event) 정보를 자동으로 추출하고, 이들 사건 정보를 기반으로 새로운 생물학 지식을 자동 추론하는 텍스트 마이닝 - 추론 통합 구조의 시스템을 다룬다. 이러한 통합 구조의 지식 발견 시스템은 미리 추출되어 데이터베이스에 등록된 정보만을 입력으로 사용하는 시스템들에 비하여 최신 정보를 보다 빨리 사용할 수 있고, 미리 정의된 형식 이외의 다양한 정보를 사 용할 수 있다는 장점이 있다. 반면, 텍스트 마이닝 정보 추출 결과를 그대로 사용하기 때문에 텍스트 마이닝 모듈(module)의 성능에 따라 전체 시스템의 효용성이 크게 저하될 수도 있다는 문제가 있다. 본 논문에서는 확률 기반 필터링(filtering) 방법을 제안하여, 텍스트 마이닝 결과 중 양성 오류(false positive)를 효과적으로 제거함으로써 전체 지식 발견 시스템의 정확도 및 효용성을 높이고자 한다. 본 논문에서 제안한 확률 기반 필터링 방법은 기준(baseline) 방법으로 사용된 횟수 기반 필터링 방법보다 높은 성능을 보였다.

Identifying Helpful Reviews Based on Customer’s Mentions about Experiences

Hye-Jin Min and Jong C. Park
Expert Systems With Applications, doi:10.1016/j.eswa.2012.01.116, January 25, 2012. (SCIE IF 1.924)
Show abstract

As numerous on-line product reviews that vary in quality are published every day, much attention is being paid to quality assessment of such reviews. The current metric of using the number of votes by other customers such as ‘helpful vote’, despite its dominance, does not yield a fully effective outcome. In this article, we propose a novel metric to rank product reviews by ‘mentions about experiences’, accounting for customer’s personal experiences, as a way of identifying high quality reviews. The proposed metric has two parameters that capture time expressions related to the use of products and product entities over different purchasing time periods by linguistic clues. The empirical results show that this metric is not only as helpful as the best existing metrics, ‘helpful vote’ or ‘reviewer rank’, but is also free from undesirable biases that either penalize recency or are driven solely by popularity. Our usability study also shows that ordering reviews by our metric is considered helpful on the accounts of both usefulness and satisfaction.

Analyzing the Patterns of Switching and Clustering on CFT Data Using Hidden Markov Model

Yong-Jae Lee, Hee-Jin Lee, Maria Wolters, and Jong C. Park
HCI Conference Korea, Alpensia resort, January 11-13, 2012.
Show abstract

Early detection of dementia allows people to have more time to prepare themselves for the symptom. As one of the methods to screen dementia, Category Fluency Test (CFT) is used to evaluate the organization of semantic memory and to assess the verbal fluency performance of patients with dementia. Recently, various measures to evaluate their CFT performance have been studied and, in particular, clusters and switches of the CFT data are considered as important factors. In this work, we analyze the clusters and switches of the CFT data by using Hidden Markov Model (HMM) to verify the hypothesis that a comprehensive pattern analysis of their switches and clusters can reveal important characteristics of verbal fluency performance.

Age Prediction from Korean Tweets with Style-Based Feature Analysis

Sang-Chae Kim and Jong C. Park
HCI Conference Korea, Alpensia resort, January 11-13, 2012.
Show abstract

Authorship attribution is a task of predicting the author from analyzing his/her writing. An increasing popularity of the Internet has made it easy for the authorship attribution researchers to access large corpora with annotated authorship. Such large corpora have enabled the researchers to predict the authors’ demographic characteristics such as age. In this paper, we analyze tweets in Korean with a small number of style-based features such as emoticons and propose a way of using these features to predict the age group. Our prediction resulted in a relatively high accuracy of 0.75

Analyzing Disagreements among ICD-9-CM Coders

Seung-Cheol Baek and Jong C. Park
4th International Symposium on Languages in Biology and Medicine (LBM 2011), Nanyang Technological University, Singapore, December 14-15, 2011.
Show abstract

NLP researchers find it difficult to acquire and interpret clinical free text directly, most likely because of the unfamilarity with medical practices. This is why publicly available annotated corpora would be of much help, but there are still very few in the clinical domain due to patient confidentiality. In this regard, it is encouraging to see that Computational Medicine Center’s 2007 Challenge provides a publicly available corpus consisting of radiology reports with ICD-9-CM codes as independently assigned by three different coders. However, the corpus shows many disagreements among the coders, making it imperative to set the standard correctly for their proper interpretation. A proposal for such a standard as implicitly advanced by its developers is to take the majority annotation. In this paper, we propose an alternative method to address such disagreements. We believe our work not only makes a meaningful improvement on the utility of this corpus but also has good implications for similar tasks, such as ICD-10-CM coding.

Identifying Gene Expression Changes in Prostate Cancer Cells from the Literature

Hee-Jin Lee, Hyunju Lee, and Jong C. Park
4th International Symposium on Languages in Biology and Medicine (LBM 2011), Nanyang Technological University, Singapore, December 14-15, 2011.
Show abstract

We propose to identify information about gene expression changes in diseased cells from the literature, utilizing event extraction techniques. Gene expression changes in a diseased cell or tissue happen when its expression level is either higher or lower than the level in normal states. Such information can be critically used in the next stage of understanding the molecular mechanisms of the disease, leading naturally to its pathway. In this work, we focus on prostate cancer (PC), one of the most troubling cancers.

Detecting and Blocking False Sentiment Propagation

Hye-Jin Min and Jong C. Park
Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), pp. 354–362, Chiang Mai, Thailand, November 8-13, 2011.
Show abstract

Sentiment detection of a given expression involves interaction with its component constituents through rules such as polarity propagation, reversal or neutralization. Such compositionality-based sentiment detection usually performs better than a vote-based bag-of words approach. However, in some contexts, the polarity of the adjectival modifier may not always be correctly determined by such rules, especially when the adjectival modifier characterizes the noun so that its denotation becomes a particular concept or an object in customer reviews. In this paper, we examine adjectival modifiers in customer review sentences whose polarity should either be propagated (SHIFT) or not (UNSHIFT). We refine polarity propagation rules in the literature by considering both syntactic and semantic clues of the modified nouns and the verbs that take such nouns as arguments. The resulting rules are shown to work particularly well in detecting cases of ‘UNSHIFT’ above, improving the performance of overall sentiment detection at the clause level, especially in ‘neutral’ sentences. We also show that even such polarity that is not propagated is still necessary for identifying implicit sentiment of the adjacent clauses.

Automatic Conversion of Korean into Korean Sign Language Based on Combinatory Categorial Grammar

Jong C. Park
Keynote Speech, Joint Conference of the Modern Linguistic Society of Korea and the Korean Society for Language and Information, Gongju National University of Education, Korea, November 5, 2011.

Text Parsing for Sign Language Generation with Combinatory Categorial Grammar

Jin-Woo Chung and Jong C. Park
2nd International Workshop on Sign Language Translation and Avatar Technology (SLTAT), 13th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), University of Dundee, UK, October 23, 2011.
Show abstract

In this paper, we propose a method to convert a written sentence in spoken language into a suitable representation in sign language within the framework of Combinatory Categorial Grammar (CCG). The representation reflects the multi-channel nature of sign language performance, including manual and non-manual linguistic signals of multiple channels and information about their coordination. We show that most information needed to address linguistic phenomena in sign language such as word order, spatial references, classifier construction, and verb inflection can be encoded in the CCG sign lexicon. During the CCG derivation process, a semantic representation for sign language expressions is created so that the resulting output can be directly interpreted as a sequence of signs, each containing manual and non-manual components and representing their coordination and spatial relationship. The derivation process with the constructed lexicon is presented with several examples for Korean Sign Language. We discuss implications of our proposal and future directions.

Revisiting Concatenative Video Synthesis with Relaxed Constraints

Sangyong Gil and Jong C. Park
2nd International Workshop on Sign Language Translation and Avatar Technology (SLTAT), 13th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), University of Dundee, UK, October 23, 2011.
Show abstract

In this paper, we propose a method to convert a written sentence in spoken language into a suitable representation in sign language within the framework of Combinatory Categorial Grammar (CCG). The representation reflects the multi-channel nature of sign language performance, including manual and non-manual linguistic signals of multiple channels and information about their coordination. We show that most information needed to address linguistic phenomena in sign language such as word order, spatial references, classifier construction, and verb inflection can be encoded in the CCG sign lexicon. During the CCG derivation process, a semantic representation for sign language expressions is created so that the resulting output can be directly interpreted as a sequence of signs, each containing manual and non-manual components and representing their coordination and spatial relationship. The derivation process with the constructed lexicon is presented with several examples for Korean Sign Language. We discuss implications of our proposal and future directions.

Reproducing Fairy Tales for Plot Identification

SeungJoo An and Jong C. Park
Proceedings of the 23rd Annual Conference on Human and Cognitive Language Technology (HCLT), pp. 3-8, Seoul, Korea, October 6-7, 2011.
Show abstract

텍스트의 스토리를 자동으로 이해하기 위해 텍스트에서 기술된 사건(event)을 파악하고 이들을 조합하여 스토리가 어떻게 구성되어 있는지를 파악하는 연구들이 진행되어 왔다. 하지만 이는 스토리의 깊은 의미 론적 이해를 요구하는 것 이외에도 텍스트마다 상황과 일어나는 사건들이 다양하기 때문에 언어 자원이 부족한 환경에서의 처리에는 한계가 있다. 이러한 문제는 사건들을 추상화 하여 단순하게 표현할 수 있다 면 스토리 이해의 자연스러움을 저해하지 않고 해결 할 수 있다. 본 논문에서는 사건들의 추상화 과정을 위한 기초 연구로서 텍스트 속 등장인물이 행하거나 당하는 사건들을 추출하여 PMI기법을 통해 사건의 흐름을 파악하고 언어학적 단서를 참조하여 스토리 이해 과정에 누락될 수 있는 사건들을 추가하여 보완 하였다. 이러한 접근을 통해 등장인물이 행할 수 있는 사건들을 재구성하여 단순화하는 방법을 제시한다.

Reading Desk for Preschool Children and Older People with Emotional Speech Synthesis

Ho-Joon Lee, Yong-Jae Lee, and Jong C. Park
International Conference on Convergence and Hybrid Information Technology (ICHIT), LNCS 6935, pp. 740-747, Daejeon, Korea, September 23-25, 2011.
Show abstract

In this paper, we introduce a reading desk designed to read books to the older people and children. For this purpose, we propose a reading desk together with an emotional speech synthesis system for Korean. The reading desk system provides a wireless audio output unit, and the reading desk is directly connected to a laptop computer in order to identify the current user and target reading material. The emotional speech synthesis system for Korean is a prosody re-synthesis system that has the option of providing four different emotions such as anger, fear, happiness, and sadness. Therefore, this system is also able to modify the speech rate and intensity information of speech as much as users want. We analyzed 240 pieces of emotional speech in order to extract distinct prosody structures for each emotion in Korean. The evaluation results show that we have achieved 48.5% of the recognition rate for happiness among four emotions, and with enough training experience, the average recognition rate has improved up to 95.5% for all emotions.

Linguistic Analysis of Picture Description for Language Impairment Diagnosis

Yong-Jae Lee, Hye-Jin Min, and Jong C. Park
Korea Computer Congress (KCC), Gyeongju, Korea, June 30-July 2, 2011.
Show abstract

사람은 성장 배경이나 학습에 따라 고유의 언어 사용 특성을 가지게 된다. 이러한 언어 사용 특성은 개 인의 언어 유창성에 대한 지표를 제공하며, 언어 사용 특성에 대한 분석은 장애에 따른 변화에도 능동적 으로 대처할 수 있게 한다. 그러나 어떤 특정인의 언어 사용 특성을 파악하는 연구는 아직 부족한 실정 이다. 본 연구에서는 개인 언어 사용 특성 파악을 위하여 일차적으로 일반인들의 그림 설명글 데이터를 모았으며, 이에 대한 분석 결과에 기반하여 언어 장애 진단에 적용하기 위한 언어 사용 특성을 파악하고 자 한다. 본 연구의 결과로 형태소 단위, 단어 단위, 그리고 내용 전달의 방식에 따른 개인의 언어 사용 특성을 일부 파악할 수 있었으며, 이와 같은 특성은 향후 치매와 같은 인지 기능 장애로 인한 언어 사용 의 변화를 추적하는데 중요한 실마리를 제공할 수 있을 것으로 기대된다.

Research on Automatic Sign Language Generation: State of the Art and Future Directions

Jin-Woo Chung, Ho-Joon Lee, and Jong C. Park
Invited Presentation, the 13th Annual Conference on Korean Sign Language, Korea National College of Rehabilitation and Welfare, Pyeongtaek, Korea, June 11, 2011.

Improving Accessibility to Web Documents for the Aurally Challenged with Sign Language Animation

Jin-Woo Chung, Ho-Joon Lee, and Jong C. Park
International Conference on Web Intelligence, Mining and Semantics (WIMS'11), Sogndal, Norway, May 25-27, 2011.
Show abstract

In this paper, we describe how to improve accessibility for the aurally challenged in a web environment, focusing on utilizing a signing avatar for web pages. Many systems were previously proposed to make a web environment more accessible for the deaf people by providing signed expressions, i.e. translating written text into sign language animations and presenting them in a proper way, based on the observation that deaf users normally have much difficulty understanding text-based information as well as audio contents. We analyze the strengths and weaknesses of these systems with respect to discussed design criteria, and propose a system that presents a signing avatar for web page documents via a mobile device, which is expected to overcome the shortcomings of the previous systems and to improve the accessibility of deaf users to textual contents in a web environment. The proposed system has three main parts based on a client-server architecture: 1) a client that executes a web browser and transmits selected text to the server, 2) a server that takes text as input and translates it into signed expressions through a sign language generation module, and 3) a mobile device that displays signing animation transmitted from the server by streaming. We also present some linguistic issues raised by the difference between Korean and Korean Sign Language. To the best of our knowledge, this is the first approach to the use of a mobile device for web document access by the aurally challenged people. We discuss implications of our study and future directions.

Natural Language Processing for the Aurally Challenged and for the Elderly

Jong C. Park
Research Seminar, University of Dundee, UK, March 16, 2011

Physical Push with a Socially Intelligent Robot: Make your wishes to 'Genie in the Lamp'

Hye-Jin Min and Jong C. Park
Proceedings of the 6th IEEE/ACM International Conference on Human-Robot Interaction, Late Breaking News, pp. 203-204, March 6-9, 2011, Lausanne, Switzerland. ACM
Show abstract

This paper proposes a robotic agent named ‘Genie’ that understands a user’s wish and gives its possible answers on a social network platform. Once a potential wish is detected upon monitoring the text updates in the micro-blog of the user, the agent initiates a task to help the user with both NLP and metadata analysis. As an interaction scenario, we set the type of a robot as an agent that identifies wishful products by searching for and analyzing product information on the web. After an analysis of the vast amount of data, the agent provides possible answers to the user as a way of granting the wish that might require additional time and effort to achieve. In order to draw the user's attention, the agent makes a physical movement as a push notification with more user-friendliness.

Annotation of Protein State Information in Biomedical Text

Hee-Jin Lee and Jong C. Park
9th Asia Pacific Bioinformatics Conference (APBC), Poster Presentation, Incheon, Korea, January 11-14, 2011.

Korean Speech Synthesis for Automatic Fairy Tale Narration with Automatic Identification of Character Roles

SeungJoo An, Ho-Joon Lee, and Jong C. Park
HCI Conference Korea, Alpensia resort, January 26-28, 2011.
Show abstract

부모들이 모두 일을 하여 아이들이 혼자 있는 시간이 늘어나게 됨에 따라 아이들에게 필요한 서비스를 제공하는 시스템이 필요하게 되었다. 이 중에서 자동 동화 구연 시스템은 아이들의 언어 능력과 정서 발달에 도움을 줄 수 있다. 이 때, 동화 속 등장 인물의 역할이 제대로 판단되지 못한다면 동화가 전달하고자 하는 의미와 다르게 동화 내용을 발화 할 수 있다. 본 논문은 동화 속 등장인물의 역할을 분류하기 위해서 다루어야 할 언어적 요소들을 통하여 동화 속 등장인물의 자동 역할 분류 시스템을 제안하고, 이렇게 분류된 역할에 따라서 적절한 음성 합성을 통하여 보다 동화의 의미 전달이 분명한 자연스러운 음성 표현을 할 수 있는 음성 합성 시스템을 제안하고자 한다.
As there is a growing tendency where parent leave their children alone for their work, a system which provides necessary services to children is needed. Among these services, an automatic fairy tale narration system can help language and emotional development of young children. However, if roles of the characters in the story cannot be determined correctly by an automatic fairy tale narration system, the meaning of fairy tales can be conveyed differently, if not distorted. In this paper, we propose an automatic role identification system based on linguistic clues to classify such roles, and through such classified roles, a speech synthesis system for more natural and clear automatic fairy tale narration.

Identifying Sentence Types in Korean with Morpho-Syntactic Analysis

Jin-Woo Chung
MS thesis, KAIST, 2011.

Evaluation of Emotion Categories based on the Analysis of Emotion-Rich Fairy Tales

Ho-Joon Lee and Jong C. Park
HCI Conference Korea, Alpensia resort, January 26-28, 2011.
Show abstract

본 논문에서는 전래 구연 동화를 분석하여, 발화문에 대한 감정 상태가 명시적으로 표현된 문장을 추출하고, 추출된 감정 상태를 바탕으로 감정 범주의 분포를 계산하여, 전래구연 동화에서 나타나는 감정 범주의 특성을 분석한다. 그 결과 화남과 놀람의 감정은 다른 감정에 비해 단일화된 형태로 표현되는 것을 확인할 수 있으며, 최종적으로 이러한 정보가 감정 합성이나 감정 인식 과정에서 활용될 수 있는 가능성을 보인다.
In this paper, we analyze the characteristics of emotion categories derived from the utterances of fairy tales. For this purpose, we extract explicit emotional states of each utterance, and calculate their distributions. As a result, we find that the emotional state of anger and astonishment are well-defined emotion categories, whereas other need more refinement. This finding can be used for the improvement of emotional speech synthesis and recognition systems.

Multi-modal Assessment and Treatment to Retain and Enhance Human Performance in Ageing

Jong C. Park, Jinah Park, KiWoong Kim, and Joon-Kyung Seong
Discerning Diversity in Ageing - SBF/SBMN workshop, University of Edinburgh, UK, November 10, 2010.

Quality of Life Technology for the Aurally Challenged and for the Aged

Jong C. Park
Annual Seminar Series, University of Manchester, UK, November 24, 2010.

Automatic Identification of Character Roles for Natural Fairy Tale Narration

SeungJoo An and Jong C. Park
KIISE Fall Conference, Danguk University, November 5-6, 2010.
Show abstract

동화를 구연할 때 구연자는 동화 속 등장 인물의 역할을 바탕으로 감정을 실어 발화한다. 이를 통하여 독자인 유아들의 관심을 유발하고 몰입시킴으로써, 이해도를 높인다. 이와 같이 동화 속 인물의 역할에 대한 적절한 이해는 자동 동화 구연에 있어서 중요한 요소 중 하나이다. 본 논문은 동화 속 등장인물의 역할을 분류하기 위해서 다루어야 할 언어적 요소들에 대하여 살펴본다. 또한 이를 바탕으로 이러한 역 할을 자동으로 분류하고, 처리하는 시스템을 제시한다.

A Ubiquitous Smart Parenting and Customized Education Service Robot

Ho-Joon Lee and Jong C. Park
The 2010 IEEE Workshop on Advanced Robotics and its Social Impacts (ARSO), 2010.
Show abstract

In this paper, we introduce a u-SPACE service robot, designed to help children who may be left alone while their caregivers are away from home. In order to protect children from indoor dangers, this service robot provides customized guiding messages taking into account the location information and behavioral patterns of a child, after the detection of dangerous objects and situations. And these guiding messages are vocalized by our emotional speech generation system. This emotional speech generation system is also being put to use in reading fairy tales to a child, as a part of a home education service. The outward appearance of the u-SPACE service robot is modeled on a teddy bear, in order to provide a safe and comforting environment for children. Two touch sensors designed for basic interactions between a child and the robot are installed on each hand of the robot, and an RFID tag is placed inside the body. A PDA with a Wi-Fi communication module, a touch screen, and a speaker is used as a main operating device of this u-SPACE service robot.

Detecting and Resolving Syntactic Ambiguity for Automatic Korean-Korean Sign Language Translation

Jin-Woo Chung and Jong C. Park
Proceedings of the 22nd Annual Conference on Human and Cognitive Language Technology, pp. 55-62, 2010.
Show abstract

수화는 농인 사회에서 주로 사용되는 시각언어로서 음성언어인 한국어와 통사적인 측면에서 많은 차이를 보인다. 특히 수화에서는 조사와 어미가 거의 사용되지 않기 때문에 한국어 문장에서 기존의 방법대로 이 들을 제거한 후 어순을 고려하지 않은 채 문장 성분의 기본형을 그대로 나열하여 수화문을 생성할 경우 문장 성분 간의 통사적 관계가 애매해질 수 있다. 본 논문에서는 통사적 중의성이 한국어 문장을 수화문 으로 변환하는 과정에서 추가적으로 나타나게 되는 특정 통사구조에 의해 발생하는 것으로 보고, 이러한 통사구조를 기본논항구조, 한정수식구조, 병렬구조, 서술구조로 분류하여 각각을 파악하고 그에 따라 통사 적 중의성을 해소하는 방법을 제시한다.

Personal Prosody Model based Korean Emotional Speech Synthesis

Ho-Joon Lee
PhD dissertation, KAIST, 2010.
Show abstract

Speech is the most basic and widely used communication method for expressing thoughts during human-human interaction and has been studied for user-friendly interfaces between humans and machines. Recent progress in speech synthesis has produced artificial vocal results with very high intelligibility, but the quality of sound and the naturalness of inflection remain major issues. Today, in addition to the need for improvement in sound quality and naturalness, there is a growing need for a method for the generation of speech with emotions to provide the required information in a natural and effective way. For this purpose, various types of emotional expression are usually transcribed first into corresponding datasets, which are then used for the modeling of each type of emotional speech. This kind of massive dataset analysis technique has improved the performance of information providing services both quantitatively and qualitatively. In this dissertation, however, I argue that this approach does not work well with interactions that are based on personal experience such as emotional speech synthesis. We know empirically that individual speakers have their own ways of expressing emotions based on their personal experience, and that massive dataset management may easily overlook these personalized and relative differences. Therefore, this dissertation examines the emotional prosody structures of four basic emotions such as anger, fear, happiness, and sadness, by considering their personalized and relative differences. As a result, this dissertation addresses the tendency for the emotional prosody structures of pitch and speech rate to depend more on individual speakers (i.e. personal information) than intensity and pause length do. This personal information enables the modeling of relative differences of each emotional prosody structure (i.e. personal prosody model), the possibilities of which were dismissed earlier during the application of massive dataset analysis technique. Based on the personal prosody model, we develop a Korean emotional speech synthesis system that can add emotional information to spoken expressions. In order to convert input sentence into speech, we used a commercial Korean TTS system with a female voice. The evaluation results show that we can successfully incorporate this personal information into an emotional prosody synthesis system, which enhances the recent progress in the recognition rate for happiness and other emotions. We have achieved 48.5% of the recognition rate for happiness among four emotions, which used to be close to the chance level. And from a series of repeated perception tests supported by enough prior training experience, the average recognition rate has improved up to 95.5% for all emotions. We also show the applicability of the proposed Korean emotional speech synthesis system with the implementation of a speech interface of assistive robots designed for the elderly that can modify its prosodic structure according to sentence types and emotional states.

Sentence Type Identification in Korean: Applications to Korean-Sign Language Translation and Korean Speech Synthesis

Jin-Woo Chung, Ho-Joon Lee, and Jong C. Park
Journal of the HCI Society of Korea, Vol. 5, No. 1, pp. 25-35, 2010.
(selected as best paper)
Show abstract

This paper proposes a method of automatically identifying sentence types in Korean and improving naturalness in sign language generation and speech synthesis using the identified sentence type information. In Korean, sentences are usually categorized into five types: declarative, imperative, propositive, interrogative, and exclamatory. However, it is also known that these types are quite ambiguous to identify in dialogues. In this paper, we present additional morphological and syntactic clues for the sentence type and propose a rule-based procedure for identifying the sentence type using these clues. The experimental results show that our method gives a reasonable performance. We also describe how the sentence type is used to generate non-manual signals in Korean-Korean sign language translation and appropriate intonation in Korean speech synthesis. Since the method of using sentence type information in speech synthesis and sign language generation is not much studied previously, it is anticipated that our method will contribute to research on generating more natural speech and sign language expressions.

Wrestling with Biomedical Research Results: Language Resources and Literature Analysis

D. Rebholz-Schuhmann, Nigel Collier, Jong C. Park, and Limsoon Wong
Journal of Bioinformatics and Computational Biology (JBCB), Vol. 8, No. 1, pp. 129-130, Imperial College Press, February 2010.

Intonation Generation for Korean Speech Synthesis with Automated Sentence Type Classification

Jin-Woo Chung, Ho-Joon Lee, and Jong C. Park
21th HCI Conference Korea, Phoenix Park, January 27-29, 2010.
Show abstract

음성은 인간과 인간 사이의 상호 작용에서 가장 기본적인 정보 전달 방식이며 최근 들어 로봇을 포함한 인간과 기계 사이의 자연스러운 상호작용을 위한 효과적인 수단으로도 널리 활용되고 있다. 음성은 문자 형태의 언어 표현이 소리 정보로 변환된 것으로 억양 정보를 포함하고 있는데, 이러한 억양 정보가 적절히 표현되지 못한다면 문자가 지닌 정보마저 온전하게 전달하기 어려우므로 상황에 맞는 억양 정보를 표현하는 것은 매우 중요하다. 한국어 음성에서 문장의 전체적인 억양은 그 문장의 유형에 따라 다르게 나타나므로, 자연스러운 음성 합성을 위해서는 문장의 유형을 잘 파악해야 한다. 이에 본 논문에서는 한국어 문장의 유형을 자동으로 분류하는 문형 분류 시스템을 제안하고, 이렇게 분류된 문장 유형에 맞는 억양 정보를 생성하여 자연스러운 음성 표현을 할 수 있는 음성 합성 시스템을 제안하고자 한다.

Extracting Melodies from Piano Solo Music Based on Characteristics of Music

Yoonjae Choi and Jong C. Park
Journal of the Korean Institute of Information Scientists and Engineers (KIISE): Computing Practices and Letters, Vol. 15, No. 12, pp. 923-927, 2009.
Show abstract

The recent growth of a digital music market induces increasing demands for music searching and recommendation services. In order to improve the performance of music-based application services, the process of extracting melodies from polyphonic music is essential. In this paper, we propose a method to extract melodies from piano solo music which is highly polyphonic and has a wide pitch range. We categorize piano music into three classes taking into account the characteristics of music, and extract melodies according to each class. The performance evaluation for the implemented system showed that our method works successfully on a variety of piano solo music.

Automated Classification of Sentential Types in Korean with Morphological Analysis

Jin-Woo Chung and Jong C. Park
Language and Information, Vol. 13, No. 2, pp. 59-97, 2009.
Show abstract

The type of a given sentence indicates the speaker's attitude towards the listener and is usually determined by its final endings and punctuation marks. However, some final endings are used in several types of sentences, which means that we cannot identify the sentential type by considering only the final endings and punctuation marks. In this paper, we propose methods of finding some other linguistic clues for identifying the sentential type with a morphological analysis. We also propose to use these methods to implement a system that automatically classfies sentences in Korean according to their sentential types.

Automatic Extraction of the Usage Information from the Component Words in Gene Ontology Terms to Enhance Consistency and Predictability

Seung-Cheol Baek and Jong C. Park
3rd International Symposium on Languages in Biology and Medicine (LBM 2009), long paper, Seogwipo, Korea, November 8-10, 2009.
Show abstract

The Gene Ontology (GO) is a controlled vocabulary that has gone through constant changes, motivated primarily by the need to reflect the dynamic nature of knowledge it addresses and the need for usability improvement. A good policy on such changes would be to maintain consistency across terms and structures so as to highlight the missing parts that are likely to be added afterwards, or the unchanged parts to which a policy on usability improvement might not have yet applied. In particular, we argue that the component words inside terms must be used consistently across terms, in order to enhance the predictability of such terms, thus their usability as well. For this purpose, we propose a representation for word usage and a method for extracting it from GO and show its utility in identifying the direction of future changes readily as well as in enhancing the consistency of terms.

Extracting Melodies from Piano Music Based on Characteristics of Music

Yoonjae Choi
MS thesis, KAIST, 2009.

Toward finer-grained sentiment identification in product reviews through linguistic and ontological analyses

Hye-Jin Min and Jong C. Park
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 169-172, Singapore, August 2-7, 2009.
Show abstract

We propose categories of finer-grained polarity for a more effective aspect-based sentiment summary, and describe linguistic and ontological clues that may affect such fine-grained polarity. We argue that relevance for satisfaction, contrastive weight clues, and certain adverbials work to affect the polarity, as evidenced by the statistical analysis.

Text and Sign Language Animation with Combinatory Categorial Grammar

Jong C. Park
Invited talk, Institute of Communicating and Collaborative Systems (ICCS), University of Edinburgh, UK, July 3, 2009.

A Text Mining Tool for Ubiquitin-Protein Ligases

Jong C. Park
Invited talk, Centre for Systems Biology, University of Edinburgh, UK, 8 July, 2009.

Interpretation of User Evaluation for Emotional Speech Synthesis System

Ho-Joon Lee and Jong C. Park
13th International Conference on Human-Computer Interaction (HCII 2009), San Diego, USA, July 19-24, 2009.
Show abstract

Whether it is for human-robot interaction or for human-computer interaction, there is a growing need for an emotional speech synthesis system that can provide the required information in a more natural and effective manner. In order to identify and understand the characteristics of basic emotions and their effects, we propose a series of user evaluation experiments on an emotional prosody modification system that can express either perceivable or slightly exaggerated emotions classified into anger, joy, and sadness as an independent module for a general purpose speech synthesis system. In this paper, we propose two experiments to evaluate the emotional prosody modification module according to different types of the initial input speech. And we also provide a supplementary experiment to understand the apparently prosody-independent emotion, or joy, by replacing the resynthesized joy speech information with original human voice recorded in the emotional state of joy.

Extracting Melodies from Piano Solo Music Based on Characteristics of Music

Yoonjae Choi and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2009), Vol. 36, No. 1(A), pp. 124-125, Jeju, July 1-3, 2009.
(selected as best paper)
Show abstract

인터넷의 발달로 멀티미디어 자료의 검색 및 활용 방법에 대한 연구가 활발히 진행되고 있다. 특히 디지털 음반 시장의 빠른 발전으로 인해 음악 검색 및 추천에 대한 수요가 계속해서 증가하고 있는데 이러한 서비스를 수행하는 음악 기반 응용 시스템의 성능 향상을 위해서는 일반적인 음악의 형태인 다음(Polyphonic) 음악에서 멜로디를 추출하는 과정이 필수적이다. 본 논문에서는 다음의 복잡도가 높고 넓은 음역을 가지는 음악을 만들 수 있는 피아노 솔로 음악에서 멜로디를 추출하는 방법을 제안한다.

Extracting Melodies from Polyphonic Piano Solo Music Based on Patterns of Music Structure

Yoonjae Choi, Ho-Joon Lee, Hodong Lee, and Jong C. Park
Proceedings of the 20th Human Computer Interaction (HCI 2009), pp. 725-732, Phoenix Park, Feb 9-11, 2009.
Show abstract

Thanks to the development of the Internet, people can easily access a vast amount of music. This brings attention to application systems such as a melody-based music search service or music recommendation service. Extracting melodies from music is a crucial process to provide such services. This paper introduces a novel algorithm that can extract melodies from piano music. Since piano can produce polyphonic music, we expect that by studying melody extraction from piano music, we can help extract melodies from general polyphonic music.

Function-focused Gene Clustering by Utilizing Granularities of Gene Functions

Tak-eun Kim
MS thesis, KAIST, 2009.

Automatic Identification of the Relation between Dependency Relations and Definitions of GO Concepts

Seung-Cheol Baek
MS thesis, KAIST, 2009.

Analysis and Use of Intonation Features for Emotional States

Ho-Joon Lee and Jong C. Park
Proceedings of the 20th Annual Conference on Human and Cognitive Language Technology, pp. 144-149, October 11-12, 2008.
Show abstract

본 논문에서는 8개의 문장에 대해서 6명의 화자가 5가지 감정 상태로 발화한 총 240개의 문장을 감정 음성 말뭉치로 활용하여 각 감정 상태에서 특징적으로 나타나는 억양 패턴을 분석하고, 이러한 억양 패턴을 음성 합성 시스템에 적용하는 방법에 대해서 논의한다. 이를 위해 본 논문에서는 감정 상태에 따른 특징적 억양 패턴을 억양구의 길이, 억양구의 구말 경계 성조, 하강 현상에 중점을 두어 분석하고, 기쁨, 슬픔, 화남 공포의 감정을 구분 지을 수 있는 억양 특징들을 합성 시스템에 적용하는 과정을 보인다. 본 연구를 통해 화남의 감정에서 나타나는 억양의 상승 현상을 확인할 수 있었고, 각 감정에 따른 특징적 억양 패턴을 찾을 수 있었다.

Towards Knowledge Discovery through Automatic Inference with Text Mining in Biology and Medicine

Hee-Jin Lee and Jong C. Park
3rd International Symposium on Semantic Mining in Biomedicine (SMBM), Turku, Finland, September 1-3, 2008.
Show abstract

Field experts in biology and medicine search the literature for state-of-the-art results and occasionally discover knowledge through manual inference on published causal relations. However, the results of such inference cannot be sufficiently accurate and/or complete, as the domain of published relations is rather huge. In this paper, we introduce an automatic inference system, BioDetective, which works on literature-mined qualitative causal information in biology and medicine. BioDetective provides proofs for such qualitative causal information, and predicts the existence of new causal information, if there is any. The system is tested with a case study, where literature-mined information about protein regulation is utilized to come up with new knowledge.

Computational Processing of Verb Agreement for Automatic Generation of Sign Language Animation

Sangha Kim
MS thesis, KAIST, 2008.

An effective way to learn biological knowledge with linguistic resources

Jin-Bok Lee, Tak-eun Kim, and Jong C. Park
18th International Congress of Linguists (CIL 18), Seoul, Korea, July 21-26, 2008.
Show abstract

The most general and effective way for people to acquire desired knowledge is to learn from tutors with face-to-face contact. The tutors can pick out important pieces of information and deliver them systematically to the learners considering their specialties, interests, rates of progress, and so on. However, since all learners may not be taught by tutors during their convenient time, the field of e-learning or distance learning has been emerged.
To maintain the benefits of face-to-face learning in an automatic way, the challenge remains in equipping computers with the expertise, skills and modes of actions of the human tutor, overcoming spatial, temporal, ocio-economical and environmental restrictions. In order to overcome these challenges, we focus on two issues: (1) information investigation: how to pick out essential pieces of information that do not include overlapping or obsolete pieces, and (2) information delivery: how to deliver the selected ones to learners effectively in point of understanding and memorization.
In this paper, we propose a web-based smart tutoring system for helping biology-major student to learn genes. To incorporate the two issues described above into our tutoring system, we extensively use linguistic resources in the biology domain, such as Gene Ontology or UMLS, for selecting and classifying information from huge amount of data. We believe that our tutoring system can autonomously carry out almost all the functionalities of human tutor including investigation, delivery, and adaptation of learner’s feedbacks.

Syntactic Construction of Coordination in Sign Language Generation

Hodong Lee, Sangha Kim, and Jong C. Park
18th International Congress of Linguists (CIL 18), Seoul, Korea, July 21-26, 2008.
Show abstract

Coordination in sign languages is an essential construction to describe more than one kind of information, as used in natural languages. Although it may appear to follow general rules of coordination, its realization with multi-channel motions is often quite different from that in natural languages, due to the differences at levels of syntax and semantics. A multi-channel motion is simultaneously composed of shape, position, orientation and movement of the hands, arms, body, or face. In this paper, we address the problems in converting coordination-bearing sentences into their matching motions in sign languages. In particular, we focus on the issues between the Korean language and the Korean sign language (KSL).

E3Miner: a text mining tool for ubiquitin-protein ligases

Hodong Lee, Gwansu Yi, and Jong C. Park
Nucleic Acids Research, Vol. 36, Web Server issue Published doi:10.1093/nar/gkn286, 15 May 2008 (SCI IF 8.026).
Show abstract

Ubiquitination is a regulatory process critically involved in the degradation of >80% of cellular proteins, where such proteins are specifically recognized by a key enzyme, or a ubiquitin-protein ligase (E3). Because of this important role of E3s, a rapidly growing body of the published literature in biology and biomedical fields reports novel findings about various E3s and their molecular mechanisms. However, such findings are neither adequately retrieved by general text-mining tools nor systematically made available by such protein databases as UniProt alone. E3Miner is a web-based text mining tool that extracts and organizes comprehensive knowledge about E3s from the abstracts of journal articles and the relevant databases, supporting users to have a good grasp of E3s and their related information easily from the available text. The tool analyzes text sentences to identify protein names for E3s, to narrow down target substrates and other ubiquitin-transferring proteins in E3-specific ubiquitination pathways and to extract molecular features of E3s during ubiquitination. E3Miner also retrieves E3 data about protein functions, other E3-interacting partners and E3-related human diseases from the protein databases, in order to help facilitate further investigation. E3Miner is freely available through http://e3miner.biopathway.org.

Monitoring the evolutionary aspect of the Gene Ontology to enhance predictability and usability

Jong C. Park, Tak-eun Kim, and Jinah Park
BMC Bioinformatics 2008, 9(Suppl 3):7 doi:10.1186/1471-2105-9-S3-S7, 11 April 2008.
Show abstract

Background: Much effort is currently made to develop the Gene Ontology (GO). Due to the dynamic nature of information it addresses, GO undergoes constant updates whose results are released at regular intervals as separate versions. Although there are a large number of computational tools to aid the development of GO, they are operating on a particular version of GO, making it difficult for GO curators to anticipate the full impact of particular changes along the time axis on a larger scale. We present a method for tapping into such an evolutionary aspect of GO, by making it possible to keep track of important temporal changes to any of the terms and relations of GO and by consequently making it possible to recognize associated trends.
Results: We have developed visualization methods for viewing the changes between two different versions of GO by constructing a colour-coded layered graph. The graph shows both versions of GO with highlights to those GO terms that are added, removed and modified between the two versions. Focusing on a specific GO term or terms of interest over a period, we demonstrate the utility of our system that can be used to make useful hypotheses about the cause of the evolution and to provide new insights into more complex changes.
Conclusions: GO undergoes fast evolutionary changes. A snapshot of GO, as presented by each version of GO alone, overlooks such evolutionary aspects, and consequently limits the utilities of GO. The method that highlights the differences of consecutive versions or two different versions of an evolving ontology with colour-coding enhances the utility of GO for users as well as for developers. To the best of our knowledge, this is the first proposal to visualize the evolutionary aspect of GO.

Text Mining and Management Tools for Resource Construction and Validation in the Life Sciences

Jong C. Park
Dagstuhl Seminar on Text Mining and Ontologies for Life Sciences, Schloss Dagstuhl, Wadern, Germany, March, 2008.

Sign Language Generation with Animation by Adverbial Phrase Analysis

Sangha Kim and Jong C. Park
17th Human Computer Interaction (HCI 2008), Phoenix Park, Feb 13-15, 2008.
(selected as best paper)
Show abstract

Sign languages, commonly used in aurally challenged communities, are a kind of visual language expressing sign words with motion, Spatiality and motility of a sign language are conveyed mainly via sign words as predicates. A predicate is modified by an adverbial phrase with an accompanying change in its semantics so that the adverbial phrase can also affect the overall spatiality and motility of expressions of a sign language. In this paper, we analyze the semantic features of adverbial phrases which may affect the motion-related semantics of a predicate in converting expressions in Korean into those in a sign language and propose a system that generates corresponding animation by utilizing these features.

On the Automatic Generation of Illustrations for Events in Storybooks: Representation of Illustrative Events

Seung-Cheol Baek, Hee-Jin Lee, and Jong C. Park
17th Human Computer Interaction (HCI 2008), Phoenix Park, Feb 13-15, 2008.
Show abstract

Storybooks, especially those for children, may contain illustrations. An automated system for generating illustrations would help the production process of storybook publishing. In this paper, we propose a method for automatically generating layouts of objects during generating illustrations. In generated layouts, it is preferred to avoid unnecessary overlap between objects, corresponding to the spatial information in storybooks. We first define a representation scheme for spatial information in natural language sentences using tree structures and predicate-argument structures. Unification of tree structures and Region Connection Calculus are then used to manipulate the information and generate corresponding illustrations.

Visualizing the Temporal Distribution of Terminologies for Biological Ontology Development

Tak-eun Kim, Hodong Lee, Jinah Park, and Jong C. Park
International Conference on Visualization and Data Analysis (VDA), San Jose, USA, 26-31 January, 2008.
Show abstract

Communities in biology have developed a number of ontologies that provide standard terminologies for the characteristics of various concepts and their relationships. However, it is difficult to construct and maintain such ontologies in biology, since it is a non-trivial task to identify commonly used potential member terms in a particular ontology, in the presence of constant changes of such terms over time as the research in the field advances. In this paper, we propose a visualization system, called BioTermViz, which presents the temporal distribution of ontological terms from the text of published journal abstracts. BioTermViz shows such a temporal distribution of terms for journal abstracts in the order of published time, occurrences of the annotated Gene Ontology concepts per abstract, and the ontological hierarchy of the terms. With a combination of these three types of information, we can capture the global tendency in the use of terms, and identify a particular term or terms to be created, modified, segmented, or removed, effectively developing biological ontologies in an interactive manner. In order to demonstrate the practical utility of BioTermViz, we describe several scenarios for the development of an ontology for a specific sub-class of proteins, or ubiquitin-protein ligases.

Interpretation of Natural Language Queries for Effective Data Exploration over Heterogeneous Databases: Applications to Biomedical Domain

Hodong Lee
PhD dissertation, KAIST, 2008.
Show abstract

Data exploration is an essential process for discovering novel knowledge in scientific researches. However, it is difficult for field experts to find out the target data only by exploration, especially when the data are scattered over multiple and heterogeneous databases. Since such data are usually associated with one another, there may be appropriate sequences of searches that the field experts can use for queries to reach the target data. In order to help such data exploration, conventional database interfaces provide useful tools for querying in keywords or structured forms. However, we argue that they are inadequate to express the queries for sequences of searches in multiple databases which embody diverse relations among their data. In order to describe such queries in a convenient and expressive manner, we propose to use natural language queries (NLQs) to interact with the databases. Such a database interface shall automatically interpret NLQs into formal language queries (FLQs) that are in turn composed of small FLQs for different databases. This task requires us to address the problem of database heterogeneity due to the differences in formal query languages, database structures, and data contents. The dissertation addresses this problem by considering NLQs as terms and syntactic relations, which respectively correspond to data objects and their operations. We utilize SQL-like expressions to coordinate such terms and syntactic relations, resulting in FLQs via a straightforward mapping process. In this work, we present a method that derives the SQL-like expressions from NLQs in a Combinatory Categorial Grammar (CCG) framework, and then translates the expressions into the locations of data objects accessible from our target databases. The method then constructs FLQs for such locations in possible sequences with accounts for data associations. Our method thus provides a fully automated way to locate and retrieve available data from databases. We also show that our method works as a useful interface serving data exploration and integration, which help the experts to discover knowledge from heterogeneous databases. As practical examples, we illustrate biomedical applications: protein-seeking for data exploration, a ubiquitin-protein ligase (E3) database for data integration, and an E3 data mining tool for further data integration.

Analysis of Indirect Uses of Interrogative Sentences Carrying Anger

Hye-Jin Min and Jong C. Park
PACLIC 21, Seoul National University, November 1-3, 2007.
Show abstract

Interrogative sentences are generally used to perform speech acts of directly asking a question or making a request, but they are also used to convey such speech acts indirectly. In the utterances, such indirect uses of interrogative sentences usually carry speaker’s emotion with a negative attitude, which is close to an expression of anger. The identification of such negative emotion is known as a difficult problem that requires relevant information in syntax, semantics, discourse, pragmatics, and speech signals. In this paper, we argue that the interrogatives used for indirect speech acts could serve as a dominant marker for identifying the emotional attitudes, such as anger, as compared to other emotion-related markers, such as discourse markers, adverbial words, and syntactic markers. To support such an argument, we analyze the dialogues collected from the Korean soap operas, and examine individual or cooperative influences of the emotion-related markers on emotional realization. The user study shows that the interrogatives could be utilized as a promising device for emotion identification.

On the Automatic Generation of Illustrations for Events in Storybooks

Seung-Cheol Baek, Eunyoung Chang, and Jong C. Park
KIISE 2007 Fall Conference, Pusan National University, October 26-27, 2007.
Show abstract

문학가와 일반인들 사이의 경계가 인터넷 소설 등으로 희미해지고 있다. 어린이를 독자로 결정하고 작 품을 창작하는 사람들은 삽화를 그려서 작품을 출판하고 싶어한다. 본 논문은 사용자가 동화의 특정 사 건을 주제로 삽화를 생성하고자 할 때 이를 자동으로 생성하는 방법에 대하여 논의한다. 본 논문에서는 특히 문장들의 결합으로 표현되는 하나의 사건을 삽화로 그리는 방법을 제안한다. 본 논문에서는 자연언 어를 해석하여 사건을 추출하는 방법으로 결합 범주 문법을 사용한다.

Translating a Complex Sentence in Korean into a Sign Language Script for an Automatic Sign Language Generation

Sangha Kim, Eunyoung Chang, and Jong C. Park
the 19th Annual Conference on Human and Cognitive Language Technology (KLIP 2007), Kyungpook National University, October 12-13, 2007.

Characteristics of Spoken Discourse Markers and their Application to Speech Synthesis Systems

Ho-Joon Lee and Jong C. Park
the 19th Annual Conference on Human and Cognitive Language Technology (KLIP 2007), Kyungpook National University, October 12-13, 2007.

Customized Message Generation and Speech Synthesis in Response to the Characteristic Behavioral Patterns of Children

Ho-Joon Lee and Jong C. Park
HCI International, Beijing, P. R. China, July 22-27, 2007.
Show abstract

There is a growing need for a user-friendly human-computer interaction system that can respond to various characteristics of a user in terms of behavioral patterns, mental state, and personalities. In this paper, we present a system that generates appropriate natural language spoken messages with customization for user characteristics, taking into account the fact that human behavioral patterns usually reveal one’s mental state or personality subconsciously. The system is targeted at handling various situations for five-year old kindergarteners by giving them caring words during their everyday lives. With the analysis of each case study, we provide a setting for a computational method to identify user behavioral patterns. We believe that the proposed link between the behavioral patterns and the mental state of a human user can be applied to improve not only user interactivity but also believability of the system.

Accessing and Managing Massive Information Resources using Natural Language Processing

Jong C. Park
Invited talk, KISTI, Daejeon, Korea, May, 2007.

Natural Language Processing and Combinatory Categorial Grammar

Jong C. Park
Invited talk, Korea Institute of Science and Technology (KIST), Seoul, Korea, April, 2007.

Combinatory Categorial Grammar: Fundamental Issues in the State-of-the-Art

Jong C. Park
Korean Society for Language and Information (KSLI), a tutorial presentation at a monthly meeting, Seoul, Korea, April, 2007.

Creating Biomedical Resources with NLP-based Information Extraction

Jong C. Park
Invited talk, Tokyo Forum on Advanced NLP and TM (T-FaNT 07), Tokyo, Japan, March 11-13, 2007.

Representing Emotions with Linguistic Acuity

Hye-Jin Min and Jong C. Park
Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Mexico City, Mexico, February 18-24, 2007.
Show abstract

For a robot to make eﬀective and friendly interaction with human users, it is important to keep track of emotional changes in utterance properly. Emotions have traditionally been characterized by intuitive but atomic categories or as points in evaluation-activity dimensions. However, this characterization falls short of capturing subtle emotional changes either in narration or in text, where the vast majority of information is presented with a host of linguistic constructions that convey emotional information. We propose a novel representation scheme for emotions, so that such important features as duration, target and intensity can also be treated as ﬁrst-class citizens and systematically accounted for. We argue that it is with this new mode of representation that the subtlety of the emotional ﬂow in utterance can be properly addressed. We use this representation to encode the emotional states and intentions of characters in the drama scripts for soap opera and describe how it is utilized in conjunction with parsing for lexicalized grammars.

Identifying Emotional Cues in Dialogue Sentences According to Targets

Hye-Jin Min and Jong C. Park
HCI Conference Korea, Phoenix Park, February, 2007.
Show abstract

일상 생활에서의 대화 또는 컴퓨터를 매개로 이루어지는 대화에서 자기노출은 서로에 대한 개인적인 정보를 공유하여 친밀한 관계를 유지하기 위한 과정이다. 자기노출에서의 개인적인 정보는 생각 및 경험을 비롯하여 감정 등을 의미하는데, 감정은 특히 대화 분위기 형성 및 원활한 대화 진행을 위한 효과적인 의사소통수단으로 작용한다. 대화 시의 감정노출은 대화 상대방(노출 대상)과 감정표현의 대상(표현 대상)에 따라 표현의 실제강도와 노출의 정도가 달라지게 된다. 본 연구에서는 인터넷을 통해 대화를 주고 받거나 자료를 전송할 수 있는 인스턴트 메신저를 통하여 이루어진 대화에서 노출 대상과 표현 대상을 고려하여 대화참여자의 감정상태를 파악한다. 이를 위한 사전조사로 드라마 스크립트 상의 등장인물들의 감정표현 패턴을 분석하고 이를 활용하여 노출 대상이 각각 다른 대화문장에서 통사 및 의미 분석 과정을 거쳐 표현 대상에 따른 대화참여자의 감정상태를 파악하고, 대화참여자가 자신의 감정을 관찰할 수 있는 인터페이스를 제공한다.

Searching Animation Models with a Lexical Ontology for Text Animation

Eunyoung Chang, Hee-Jin Lee, and Jong C. Park
HCI Conference Korea, Phoenix Park, February, 2007.

Customized and Selective Interpretation

Jong C. Park
6th Singapore-Korea Joint Workshop on Bioinformatics and Natural Language Processing, Singapore, February 12, 2007.

Automatic Data Integration of Ubiquitin-protein Ligases

Hodong Lee and Jong C. Park
6th Singapore-Korea Joint Workshop on Bioinformatics and Natural Language Processing, Singapore, February 12, 2007.

Extracting Relational Information for Protein Pairs

Hee-Jin Lee and Jong C. Park
6th Singapore-Korea Joint Workshop on Bioinformatics and Natural Language Processing, Singapore, February 12, 2007.

An Ontology-based Approach to Generation of Gene Summaries

Tak-eun Kim and Jong C. Park
6th Singapore-Korea Joint Workshop on Bioinformatics and Natural Language Processing, Singapore, February 12, 2007.

Audioization over Visualization for Effective Knowledge Discovery

Ho-Joon Lee and Jong C. Park
6th Singapore-Korea Joint Workshop on Bioinformatics and Natural Language Processing, Singapore, February 12, 2007.

Resource-bound Information Animation with a Lexical Ontology

Eunyoung Chang and Jong C. Park
6th Singapore-Korea Joint Workshop on Bioinformatics and Natural Language Processing, Singapore, February 12, 2007.

Text Analysis for Facial Animation of Non-Manual Information in Sign Language Generation

Sangha Kim and Jong C. Park
6th Singapore-Korea Joint Workshop on Bioinformatics and Natural Language Processing, Singapore, February 12, 2007.

Extracting Relational Information for Protein Pairs

Jin-Bok Lee, Tak-eun Kim, Chan-Goo Kang, and Jong C. Park
6th Singapore-Korea Workshop on Bioinformatics and Natural Language Processing, Institute for Infocomm Research, Singapore, February 12, 2007.

Document Similarity Assessment with Natural Language Processing: Applications to Background Music Recommendation for Blog Articles

Doojin Park
MS thesis, KAIST, 2007.

Reducing manual curation in Gene Ontology extension

Jin-Bok Lee and Jong C. Park
5th Korea-Singapore Joint Workshop on Bioinformatics and Natural Language Processing, Daejeon, Korea, November 17, 2006.

Exploring a cascade of databases for cancer-related target identification

Hodong Lee and Jong C. Park
5th Korea-Singapore Joint Workshop on Bioinformatics and Natural Language Processing, Daejeon, Korea, November 17, 2006.

Construction of Emotion-related Pathway for Emotional Disorder Diagnosis - Case Study: Serotonin-related Protein Pathway -

Hye-Jin Min and Jong C. Park
5th Korea-Singapore Joint Workshop on Bioinformatics and Natural Language Processing, Daejeon, Korea, November 17, 2006.

Bidirectional Incremental Approach to Efficient Information Extraction: Applications to Biomedicine

Jung-jae Kim
PhD dissertation, KAIST, 2006.
(Outstanding Ph.D. Dissertation Award, 2006. 8.)
Show abstract

Information extraction refers to the task of extracting relevant information from texts. This dissertation targets at extracting information of relations between biomedical concepts, which are explicitly represented with known linguistic structures in biomedical texts. Such structures of a target relation involve a keyword and its semantic arguments, where the keyword indicates the semantic type of the target relation and the arguments indicate the related concepts. The information of relations thus has two types of locality, such that the information is expressed in the local context of the keyword, called spatial locality, and that the keyword has well-known syntactic relations with its arguments, called structural locality. These two types of locality have been in the past handled by pattern matching and partial parsing approaches, respectively, but not at the same time. In this dissertation, we address this problem with a novel approach that searches for the arguments both bidirectionally and incrementally from the keywords. The extraction process is divided into two steps. First, it uses a non-structured pattern that describes a context between a keyword and its arguments, to match an input sentence bidirectionally from the keyword. It then performs syntactic analysis incrementally on candidate arguments and, if necessary, on their sentential contexts as well, with a parser of a combinatory categorial grammar for rigorous syntactic verification of the candidates. The approach addresses the aforementioned spatial locality by utilizing non-structured patterns and the structural locality by employing a lazy evaluation parser that is customized for information extraction. The approach is highly efficient, evidenced with experimental results, because it can stop the extraction process at any step, when the syntactic analysis gives a negative piece of evidence for extracting relevant information. We also show the applicability of the approach with two different tasks in biomedicine: Biological interactions, which are useful for building up biological pathways, and protein-protein contrastive relations which are useful for refining protein pathways.

Natural Langauge Processing and Bioinformatics

Jong C. Park
Norwegian University of Science and Technology (NTNU), Trondheim, Norway, June, 2006 (invited lecture)

Natural Langauge Processing and Bioinformatics

Jong C. Park
Invited lecture, Norwegian University of Science and Technology (NTNU), Trondheim, Norway, June, 2006.

Customized Emotion Representation for Automatic Generation of Emotionally Appropriate Dialogs

Hye-Jin Min and Jong C. Park
the Korean Society for Emotion & Sensibility, KIST, May, 2006.
Show abstract

본 연구에서는 사용자에게 영화 정보를 전달하고 영화를 추천해 주는 시스템에서 사용자와 시스템 간의 대화 말뭉치를 분석하여 대화문에 나타나는 보편적 또는 개별적 감정 정보를 식별하고 이들을 기술하는 방법에 대하여 논의한다. 감정을 표현하는 언어 정보는 자연언어처리 기술을 활용하여 대화문으로부터 자동으로 추출되어 감정이 포함된 대화문 응답 생성에 활용된다. 본 연구에서는 자연언어처리 기술로 대화 말뭉치 분석을 통해 제안한 기술방법의 적절성 및 유용성에 대한 평가를 하고 그 결과를 보인다.

Personalized Background Music Recommendation System for User Generated Contents using Collective Intelligence

Doojin Park and Jong C. Park
the Korean Society for Emotion & Sensibility, KIST, May, 2006.
Show abstract

최근 싸이월드와 같은 블로그 서비스들에서 많은 사용자들은 자신의 글을 게시하면서 이에 맞는 배경음 악을 함께 올리고 있다. 이때, 사용자가 좋아하는 음악이나 사용자가 판단하기에 글의 분위기에 맞는 음 악을 선정해서 올리게 되나 적절한 음악을 선정하기는 쉽지 않다. 한편 기존 음악추천 시스템에서는 특 정 음악에 대해 전문가가 음악이론에 따라 분석하여 기입한 감성정보를 이용하거나 음악의 파형을 분석 해서 얻은 감성정보를 이용하나 음악의 특성상 음악에서 느끼는 감성들은 개인적인 성향에 따라 다르다. 본 연구에서는 사용자가 블로그에 올리는 글을 자연언어처리 기술로 분석하여 글이 담고 있는 감성정보 를 포함한 상황정보를 추출하고, 이런 정보에 해당하는 배경음악을 사용자 정보를 감안하여 자동으로 추천해주는 시스템을 제안한다.

Text Mining and Management in Biomedicine

Jong C. Park, Gary Geunbae Lee, and Limsoon Wong
Guest Editors' Introduction to the Special Issue, ACM Transactions on Asian Language Information Processing (TALIP), March, 2006.

BioContrasts: Extracting and Exploiting Protein-Protein Contrastive Relations from Biomedical Literature

Jung-jae Kim, Zhuo Zhang, Jong C. Park, and See-Kiong Ng
Bioinformatics, Vol. 22, No. 5, pp. 597-605, March, 2006.
Show abstract

Motivation: Contrasts are useful conceptual vehicles for learning processes and exploratory research of the unknown. For example, contrastive information between proteins can reveal what similarities, divergences and relations there are of the two proteins, leading to invaluable insights for better understanding about the proteins. Such contrastive information are found to be reported in the biomedical literature. However, there have been no reported attempts in current biomedical text mining work that systematically extract and present such useful contrastive information from the literature for exploitation.

Results: Our BioContrasts system extracts protein–protein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for exploitation. Contrastive information are identified in the text abstracts with contrastive negation patterns such as ‘A but not B’. A total of 799 169 pairs of contrastive expressions were successfully extracted from 2.5 million MEDLINE abstracts. Using grounding of contrastive protein names to Swiss-Prot entries, we were able to produce 41 471 pieces of contrasts between Swiss-Prot protein entries. These contrastive pieces of information are then presented via a user-friendly interactive web portal that can be exploited for applications such as the refinement of biological pathways.

Availability: BioContrasts can be accessed at http://biocontrasts.i2r.a-star.edu.sg. It is also mirrored at http://biocontrasts.biopathway.org

Supplementary information: Supplementary materials are available at Bioinformatics online.

Contact:skng@i2r.a-star.edu.sg; park@cs.kaist.ac.kr

Biomedical Text Mining for Knowledge Discovery and Automatic Ontology Extension

Jong C. Park
Invited presentation, Workshop on Text Mining, Ontology, and NLP in Biomedical Fields, Manchester, England, March 20-21, 2006.

u-SPACE: Ubiquitous Smart Parenting and Customized Education

Hye-Jin Min, Doojin Park, Eunyoung Chang, Ho-Joon Lee, and Jong C. Park
HCI Conference Korea, Phoenix Park, February, 2006.
Show abstract

부모의 사회 활동 시간이 늘어남에 따라 아이들이 혼자 집에서 보내는 시간도 늘어나고 있다. 따라서 아이들의 자립심을 크게 제한하지 않으면서 노출되기 쉬운 실내 위험으로부터 아이들을 보호하고 아이의 심리, 감정적 상태에 따라 적절한 지도를 해주는 도움이 필요하다. 본 연구에서는 RFID 기술을 기반으로 아이들을 물리적 위험으로부터 보호하고 자연언어처리 기술을 이용하여 아이의 심리, 감정 상태에 따른 음악과 애니메이션의 멀티미디어 콘텐츠를 제공한다. 또한 지속적인 관심이 필요한 일정 관리, 일상 생활에서 도움을 주는 전자제품 사용법 안내 등의 정보를 제공하여 아이 스스로 자신의 일을 할 수 있도록 도움을 준다. 본 연구에서는 가상의 가정을 디자인하여 실현 가능한 시나리오를 중심으로 이와 같은 서비스를 시뮬레이션한 결과를 보인다.

Customized Speech Synthesis for Children with Characteristic Behavioral Patterns

Ho-Joon Lee and Jong C. Park
HCI Conference Korea, Phoenix Park, February, 2006.
Show abstract

음성을 통한 사용자 간의 정보 교환 방법은 추가적인 훈련 과정이나 장비가 필요하지 않고 공간 제약이 거의 없기 때문에 노약자 등 사용자의 연령대에 관계없이 사용될 수 있다. 또한 음성 정보는 시각이나 촉각 등 다른 정보 수 단과의 상호 작용으로 상승 효과를 유발할 수 있기 때문에 사람과 기계 사이 의 인터페이스로 활용될 경우 정보 전달력을 높이면서 사용자 친화적인 서비 스를 제공할 수 있다. 그러나 동일한 상황에서 동일한 유형의 음성 정보가 사용자에게 지속적으로 제공될 경우 표현상의 단조로움으로 인해 정보 전달 력이 급감할 수 있는 문제점도 지니고 있다. 따라서 음성을 통한 정보 전달 의 경우 동일 상황이라 하더라도 사용자의 행동 패턴, 심리 상태, 주변 환경 등에 따라 차별화된 문장 구조 및 어휘의 선택으로 긴장감을 유지시켜 줄 수 있어야 한다. 본 논문에서는 5 세 전후의 어린이를 대상으로 그들의 행동 패 턴 분석에 기반하여 개별화된 음성 합성 결과를 제공하는 시스템을 제안한 다. 이를 위해 유치원이라는 물리적 공간에서 어린이들의 주된 행동 패턴을 분석하고, 현직 유치원 교사를 대상으로 동일한 정보를 전달하는 조건을 통 하여 어린이의 행동 패턴과 위치 정보, 연령 및 성격에 따른 발화 문장의 문 장 구조와 어휘적 특성을 파악한다. 최종적으로, 개별화된 음성 합성 결과를 위해 유치원 공간을 시뮬레이션 하고 RFID 를 이용하여 어린이의 행동 패턴 및 위치 정보를 파악한다. 그리고 각 상황에 따라 분석된 발화문의 문장 구 조와 어휘 특성을 반영하여 음성으로 합성될 문장의 문장 구조 및 어휘를 재 구성하여 사용자 개별화된 음성 합성 결과를 생성한다. 이러한 결과를 통해 어린이의 행동 패턴이 발화문의 문장 구조 및 어휘에 미치는 영향에 대해서 살펴보고 재구성된 결과 발화문을 평가한다.

Visualization for Digesting a High Volume of the Biomedical Literature

Changsu Lee, Jinah Park, and Jong C. Park
Bioinformatics and Biosystems, Vol. 1, No. 1, pp. 51-60, Feb. 2006.
Show abstract

The paradigm in biology is currently changing from that of conducting hypothesis-driven individual experiments to that of utilizing the results of a massive data analysis with appropriate computational tools. We present LayMap, an implemented visualization system that helps the user to deal with a high volume of the biomedical literature such as MEDLINE, through the layered maps that are constructed on the results of an information extraction system. LayMap also utilizes filtering and granularity for an enhanced view of the results. Since a biomedical information extraction system gives rise to a focused and effective way of slicing up the data space, the combined use of LayMap with such an information extraction system can help the user to navigate the data space in a speedy and guided manner. As a case study, we have applied the system to datasets of journal abstracts on ’MAPK pathway’ and ’bufalin’ from MEDLINE. With the proposed visualization, we have successfully rediscovered pathway maps of a reasonable quality for ERK, p38 and JNK. Furthermore, with respect to bufalin, we were able to identify the potentially interesting relation between the Chinese medicine Chan su and apoptosis with a high level of detail.

Effective text visualization for biomedical information

Tak-eun Kim and Jong C. Park
HCI Conference Korea, Phoenix Park, February, 2007.
Show abstract

생물 의료 분야에서 정보의 양이 아주 빠르게 증가하고 있다. 이러한 방대한 양의 정보에서 유용한 정보를 추출하기 위해 텍스트 마이닝 기법을 이용한 연구들이 많이 진행되어 왔다. 그렇지만 이렇게 뽑아진 정보조차 그 양이 방대하고, 또한 텍스트로 되어 있기 때문에 직관적으로 이해하기가 어렵다. 따라서 이러한 정보들을 좀 더 직관적으로 이해하기 위해서는 정보 시각화 시스템이 필수적이다. 최근 들어 이러한 정보 시각화에 대한 연구가 많이 진행되었으나 이러한 시각화 정보조차 너무나 방대하기 때문에 사용자가 필요로 하는 정보를 여과해 주는 방법이 필요하다. 그리고 시각화 시스템에서의 지식 발견을 위한 방법을 제공하여야 한다. 본 논문에서는 생물 의료 정보의 텍스트 시각화에 초점을 맞추어 생물 의료 정보의 효과적인 표현 방법과 지식 발견을 위한 직관적인 인터페이스를 제안하고자 한다.

Automatic Extension of Gene Ontology with Induced Prediction and Flexible Validation of Candidate Terms

Jin-Bok Lee and Jong C. Park
5th Singapore-Korea Workshop on Bioinformatics and Natural Language Processing, National University of Singapore, Singapore, February 22, 2006.

Augmenting Visualization with Audioization for Enhanced Knowledge Discovery

Ho-Joon Lee and Jong C. Park
5th Singapore-Korea Workshop on Bioinformatics and Natural Language Processing, National University of Singapore, Singapore, February 22, 2006.

Explorative search with relational description of biological entities into multiple heterogeneous databases

Hodong Lee and Jong C. Park
5th Singapore-Korea Workshop on Bioinformatics and Natural Language Processing, National University of Singapore, Singapore, February 22, 2006.

Diagnoses of Emotional Disorders for Amygdala-related Pathway with BioIE

Hye-Jin Min and Jong C. Park
5th Singapore-Korea Workshop on Bioinformatics and Natural Language Processing, National University of Singapore, Singapore, February 22, 2006.

Towards an efficient CCG parser for RNA secondary structure prediction

Hee-Jin Lee and Jong C. Park
5th Singapore-Korea Workshop on Bioinformatics and Natural Language Processing, National University of Singapore, Singapore, February 22, 2006.

Term Characterization for Information Extraction with Syntactic Pattern Matching

Jung-jae Kim and Jong C. Park
5th Singapore-Korea Workshop on Bioinformatics and Natural Language Processing, National University of Singapore, Singapore, February 22, 2006.

Named Entity Recognition

Jong C. Park and Jung-jae Kim
Chapter six of the book 'Text Mining for Biology', editors: Ben Stapley and Sophia Ananiadou, Artech House Books, January, 2006.

Semantic Representation for Temporal Adverbs and Temporal Morphemes

Eunyoung Chang and Jong C. Park
Proceedings of Annual Conference of the KSLI (Korea Society for Language and Information), pp. 193-207, Kangwon, Korea, 2006.
Show abstract

상황은 문장에서 주로 용언으로 기술되며, 상황의 시간적 의미는 시간어에 의해 따로 표현된다. 이 중에서도 시간 부사와 시상 형태소(선어말 어미)가 시제와 상에 결정적으로 기여한다고 알려져 있으나, 여러 성분이 문장 내에서 복합적으로 나타나기 때문에 각 성분의 의미와 기능에 대해서는 아직 의견이 정리되지 않은 상황이다. 본 논문에서는 상황의 시간적 속성을 분류하고, 시간 부사와 시상 형태소가 각 속성에 끼치는 영향을 분석하여 어휘 단위의 의미 표현 방식을 제안한다. 시간 부사는 상황시의 위치나 상황의 시간적 속성을 수식하고, 시상 형태소는 발화시와 상황시의 관계 또는 화자의 상황에 대한 태도를 나타낸다. 이를 바탕으로 적절한 어휘 범주를 제시하고, 이들의 결합에 의하여 최종 의미가 도출되는 과정을 결합범주문법을 통한 처리 과정으로 보인다.

Linguistic Characterization of Sign Language Expressions for an Automatic Mapping from Natural Language Sentences

Jiwon Choi, Eunyoung Chang, Hee-Jin Lee, and Jong C. Park
Language and Information, Vol. 10, No. 1, pp. 71-91, 2006.

Generation of Coherent Gene Summary

Chan-Goo Kang
MS thesis, KAIST, 2006.

Automatic extension of Gene Ontology with flexible identification of candidate terms

Jin-Bok Lee, Jung-jae Kim, and Jong C. Park
Bioinformatics, Vol. 22 No. 6, pp. 665-670, 2006.
Show abstract

Motivation: Gene Ontology (GO) has been manually developed to provide a controlled vocabulary for gene product attributes. It continues to evolve with new concepts that are compiled mostly from existing concepts in a compositional way. If we consider the relatively slow growth rate of GO in the face of the fast accumulation of the biological data, it is much desirable to provide an automatic means for predicting new concepts from the existing ones.

Results: We present a novel method that predicts more detailed concepts by utilizing syntactic relations among the existing concepts. We propose a validation measure for the automatically predicted concepts by matching the concepts to biomedical articles. We also suggest how to find a suitable direction for the extension of a constantly growing ontology such as GO.

Availability:http://autogo.biopathway.org

Contact:park@nlp.kaist.ac.kr

Supplementary information: Supplementary materials are available at Bioinformatics online.

CCG-based RNA Secondary Structure Prediction for Structural Homology Analysis

Hee-Jin Lee and Jong C. Park
6th International Conference on Genome Informatics (GIW), Yokohama, Japan, December, 2005.
Show abstract

Various systems have been proposed to predict secondary structures of RNAs using their sequence information. Among them, Uemura et al. [2] described a system that recognizes some typical RNA secondary structures such as hairpin loops and pseudoknots with Tree Adjoining Grammar. However, their work captures only known sub-structures, and not those unknown sub-structures that might also exist. Ternary pseudoknot, composed of three pairs of cross-serially arranged reverse-complementary sequences, may be one such example. Figure 1 illustrates an example ternary pseudoknot. We describe a version of Combinatory Categorial Grammars (CCGs) for an RNA secondary prediction system to discover such unknown sub-structures. The parser for the proposed CCG takes an RNA sequence and produces the semantics string that contains structural information of the sequence.

From Text to Sign Language: Exploiting the Spatial and Motioning Dimension

Jiwon Choi, Hee-Jin Lee, and Jong C. Park
Proceedings of the 19th Pacific Asia Conference on Language, Information and Computation (PACLIC 19), pp. 61-69, Taipei, Taiwan, December, 2005.
Show abstract

In this paper, we address the problem of automatically converting information in the Korean language to one in a sign language as used in Korea. First, we discuss the differences between sign language and natural language, and in particular between the sign language in Korea and the Korean language. Then, we focus on issues that are relevant to the process of converting expressions in Korean into their counterparts in the sign language, including: 1) making explicit elided subjects of expressions in Korean, 2) omitting some expressions in Korean, and 3) reordering some expressions. We argue that it is important to utilize the spatial and motioning dimensionality of a sign language in order to minimize information loss and distortion. We also argue that the right decision to omit, or to merge some expressions in Korean plays a key role in exploiting this dimensionality. Finally, we present a system that converts sentences in Korean into corresponding animations in the sign language as proof of evidence for our claim.

Dynamic Informative Link Annotation for Biological Text over Heterogeneous Databases

Hodong Lee and Jong C. Park
16th International Conference on Genome Informatics (GIW), Yokohama, Japan, December, 2005.
Show abstract

Linking from a textual object to the biological databases is actively performed for an eﬃcient data access and information enrichment [2]. This task targets at a link for particular types of term, such as names, keywords and symbols, that correspond to each data entry. However, such one-to-one matching links are still insuﬃcient to make a full use of biological data in numerous databases. The previous researches have reported further problems: (1) The conceptual term referring to multiple data objects cannot be represented as a one-to-one link [1]; (2) the complex term often corresponds to the data objects from multiple databases [6]; (3) the link must be consistent with the data objects that can be changed or removed from a database [4]; and (4) the term is ambiguous due to the semantic and syntactic heterogeneity, which requires not only the structural and operational pieces of database information but also the biological pieces of knowledge about the term semantics [4, 5]. We address all the problems above with a dynamic link annotation system that annotates links by formulating the database statement in a formal query language. We are currently developing the system for 13 molecular biology databases mediated by SRS and Entrez: GO, GOA, UniProt, InterPro, EMBL, and Enzyme in SRS; Gene, Protein, Nucleotide, PubMed, OMIM, HomoloGene, and Taxonomy in Entrez.

Vowel Sound Disambiguation for Intelligible Korean Speech Synthesis

Ho-Joon Lee and Jong C. Park
Proceedings of the 19th Pacific Asia Conference on Language, Information and Computation (PACLIC 19), pp. 131-142, Taipei, Taiwan, December, 2005.
Show abstract

For speech synthesis systems that transform text materials into voice data, correctness and naturalness are the crucial measures of performance, the latter gaining more emphasis recently. In order to make synthesized voices natural, we must take into account pronunciation-related linguistic phenomena such as homograph, among others. The syntax certainly provides an important clue to disambiguating such homographs, but the relatively free word order in the Korean language makes it hard to utilize such information. In this paper, we describe a computational generation of contextually appropriate vowel lengths for the words in Korean by utilizing a higher level of linguistic information in a Combinatory Categorial Grammar framework. We consider parts-of- speech information, the possibility of conjunction with a suffix, case information, unconjugated adjectives, numerals, numerical adjectives with related nouns, and the relationship between a noun and its predicate as syntactic and semantic clues for vowel sound disambiguation. The results are expressed in Speech Synthesis Markup Language (SSML) for a target system neutral application. The proposed system with correctly predicted vowel sound can be used not only as an educational tool, but also as a plug-in for enhancing the intelligibility of a general purpose Text-to-Speech (TTS) system.

Text Animation with Music

Doojin Park and Jong C. Park
Proceedings of the 32th Korea Information Science Society (KISS), Vol. 32, No. 2, pp. 526-528, Seoul, November, 2005.
Show abstract

음악은 스토리텔링에서 이야기의 분위기와 흐름을 전달하는데 중요한 역할을 한다. 최근 컴퓨터 애니메이션에 자동으로 알맞은 음악을 삽입하기 위하여 많은 연구가 진행되고 있지만 이야기가 있는 애니메이션보다는 주로 영상물의 동기화를 위한 연구가 대부분이었다. 텍스트 애니메이션은 동화를 자동으로 분석하여 애니메이션을 만들어주는 연구이다. 본 논문에서는 동화의 이야기 구조에 근거하여 각 장면의 분위기에 맞는 음악 자질을 자동으로 추출하는 과정을 보이고 이를 이용하여 텍스트 애니메이션에 음악이 삽입될 수 있는 방법에 대하여 논의한다.

Prediction of RNA Secondary Structures in a Combinatory Categorial Grammar Framework

Hee-Jin Lee and Jong C. Park
Proceedings of the First International Symposium on Languages in Biology and Medicine (LBM), pp. 59-62, KAIST, Daejeon, Korea, November, 2005.
Show abstract

In this paper, we define a Combinatory Categorial Grammar (CCG) to model and predict RNA secondary structures. The proposed CCG can be used to capture various RNA secondary structures, including stem-loop and pseudoknot structures. We also argue that the CCG can be used to predict possibly unknown RNA secondary structures, for example an undiscovered structure 'ternary-pseudoknots'.

Automated Linking of Conceptual and Complex Terms into Data Objects in Biological Databases

Hodong Lee and Jong C. Park
Proceedings of the First International Symposium on Languages in Biology and Medicine (LBM), pp. 51-54, Creative Learning Building, KAIST, Daejeon, Korea, November, 2005.
Show abstract

The purpose of a textual link is to provide a one-to-one connection between a term and a related data object. However, this link is insufficient to deal with the conceptual and complex terms that are often used to refer to multiple data objects from heterogeneous databases. In this paper, we present a method that can dynamically create a link to a biological term by automatically constructing a database query for a search into the corresponding data object(s). This method can help the user to quickly build a hypothesis based on data drawn from text, as well as to understand the text by providing an access to relevant information for its biological terms.

Generation of Coherent Gene Summary with Concept-Linking Sentences

Chan-Goo Kang and Jong C. Park
Proceedings of the First International Symposium on Languages in Biology and Medicine (LBM), pp. 41-45, Creative Learning Building, KAIST, Daejeon, Korea, November, 2005.
Show abstract

Typical approaches to automatic summarization make efforts to generate a coherent document by arranging the order of sentences according to certain criteria such as the publication date of the text in which the expression appears. However, when describing a gene, there is no obvious order whatsoever among the facts to be presented. In this work, while generating a summary about a gene, we actually create the order from the unordered set of facts, by introducing new sentences that make associations among the main concepts of those facts.

CCG-based RNA Secondary Structure Prediction

Hee-Jin Lee and Jong C. Park
The First International Symposium on Languages in Biology and Medicine (LBM), Daejeon, Korea, November, 2005.
Show abstract

In this paper, we define a Combinatory Categorial Grammar (CCG) to model and predict RNA secondary structures. The proposed CCG can be used to capture various RNA secondary structures, including stem-loop and psudoknot structures. We also argue that the CCG can be used to predict possibly unknown RNA secondary structures, for example an undiscovered structure 'ternary-pseudoknots'.

Dynamic and Informative Linking from Biological Text into Heterogeneous Databases

Hodong Lee and Jong C. Park
The First International Symposium on Languages in Biology and Medicine (LBM), Daejeon, Korea, November, 2005.
Show abstract

Linking from a textual object to the biological databases is actively performed for an eﬃcient data access and information enrichment [2]. This task targets at a link for particular types of term, such as names, keywords and symbols, that correspond to each data entry. However, such one-to-one matching links are still insuﬃcient to make a full use of biological data in numerous databases. The previous researches have reported further problems: (1) The conceptual term referring to multiple data objects cannot be represented as a one-to-one link [1]; (2) the complex term often corresponds to the data objects from multiple databases [6]; (3) the link must be consistent with the data objects that can be changed or removed from a database [4]; and (4) the term is ambiguous due to the semantic and syntactic heterogeneity, which requires not only the structural and operational pieces of database information but also the biological pieces of knowledge about the term semantics [4, 5]. We address all the problems above with a dynamic link annotation system that annotates links by formulating the database statement in a formal query language. We are currently developing the system for 13 molecular biology databases mediated by SRS and Entrez: GO, GOA, UniProt, InterPro, EMBL, and Enzyme in SRS; Gene, Protein, Nucleotide, PubMed, OMIM, HomoloGene, and Taxonomy in Entrez.

Intonation Synthesis using Emotional Information from Spoken Fairy Tale

Ho-Joon Lee and Jong C. Park
Proceedings of the 17th Korean Association of Speech Science (KASS), pp. 88-97, Seoul, November 26, 2005.
Show abstract

정보 기술의 발달로 사용자 중심의 인터페이스가 부각되면서 음성 합성 기술의 활용이 점점 늘어나고 있는 추세이다. 자연스러운 음성 합성을 위해서는 발화 상황에 적합한 억양 정보를 생성하는 것이 중요하고, 특히 감정의 변화에 따른 자연스러운 음성 합성을 위해서는 억양 정보 중에서도 음의 높낮이를 적절하게 조절하는 것이 필요하다. 감정 정보를 음성 합성 기술에 적용하기 위해서는 감정 정보가 잘 표현되어 있는 음성 데이터의 분석이 선행 되어야 하고, 이와 관련한 자료로서 동화 구연 음성 데이터는 아이들에게 보다 사실감 있는 내용 전달을 위해 감정 정보가 풍부하게 표현되어있는 특징이 있다. 본 연구에서는 동화 구 연 전문가에 의해 녹음된 전래 인형극을 분석하여 감정 상태에 따른 발화문의 음운 특성을 살펴보고, 이러한 감정 정보와 문장의 통사, 의미 구조 등 언어학적인 정보와의 관계를 바 탕으로 감정 정보를 음성 합성 시스템에 제공하여 적절히 구사하는 방법에 대해서 논의한다.

Modeling Causality in Biological Pathways for Logical Identification of Drug Targets

Il Park and Jong C. Park
Proceedings of the 2005 International Joint Conference of InCoB, AASBi and KSBI (Bioinfo 2005), pp. 373-378, Busan, Korea, September, 2005.

Lexical Disambiguation for Intonation Synthesis: A CCG Approach

Ho-Joon Lee and Jong C. Park
Korean Society for Language and Information (KSLI), June 17-18, 2005.
Show abstract

IT의 급격한 발전과 함께 새로운 형태의 정보 전달 방법이 지속적으로 나타나면서 우리말의 정확한 발음에 대한 인식이 점점 약화되고 있는 추세이다. 특히 장단음의 발음은 발화에 대한 전문인들도 정확하게 구분하지 못하고 있는 심각한 상황이다. 본 논문에서는 한국어 명사에서 나타나는 장단음 화 현상을 주변 어휘와의 관계를 바탕으로 살펴보고 동음이의어 중 다르게 발음되는 명사의 장단음 구분을 명사와 명사의 수식어, 명사의 서술어와의 관계를 중심으로 논의한다. 분석된 결과는 결합범 주문법을 이용하여 표현하고 어휘적 중의성이 해소된 음성 합성 과정을 표준화된 SSML (Speech Synthesis Markup Language)으로 기술한다.

Induced Extension of Gene Ontology from Biomedical Resources with Flexible Identification of Candidate Terms

Jin-Bok Lee, Jung-jae Kim, and Jong C. Park
The First International Symposium on Semantic Mining in Biomedicine (SMBM), page 13, Cambridge, UK, April, 2005.
Show abstract

Motivation: We present a novel method to predict more detailed terms than those in the present Gene Ontology (GO). We apply this method to semantic tagging for natural language expressions that denote potential GO terms even when there is no direct mapping of such expressions into GO terms. The terms that are newly identiﬁed in this process can be used to extend GO by utilizing semantic relations such as hyponyms or synonyms. Finally, we suggest how to find a suitable direction for the possible extension of an ever-growing ontology such as GO.
Results: We provide an automatically extended GO, and tools for its manipulation and validation.
Availability: http://www.biopathway.org
Contact: park@nlp.kaist.ac.kr

Deciding When to Stop: Enhancing the Performance of Information Extraction with Deeper Linguistic Analysis

Jung-jae Kim and Jong C. Park
Proceedings of the 3rd Korea-Singapore Joint Workshop on Bioinformatics and Natural Language Processing, pp. 41-45, Muju Resort, Jeonbuk, South Korea, February, 2005.

Information Visualization with Text Data Mining for Knowledge Discovery Tools in Bioinformatics

Jinah Park, Changsu Lee, and Jong C. Park
Key Engineering Materials, Vols. 277-279, pp. 259-265, 2005. (SCI IF 0.224)
Show abstract

An abundant amount of information is produced in the digital domain, and an effective information extraction (IE) system is required to surf through this sea of information. In this paper, we show that an interactive visualization system works effectively to complement an IE system. In particular, three-dimensional (3D) visualization can turn a data-centric system into a user-centric one by facilitating the human visual system as a powerful pattern recognizer to become a part of the IE cycle. Because information as data is multidimensional in nature, 2D visualization has been the preferred mode. However, we argue that the extra dimension available for us in a 3D mode provides a valuable space where we can pack an orthogonal aspect of the available information. As for candidates of this orthogonal information, we have considered the following two aspects: 1) abstraction of the unstructured source data, and 2) the history line of the discovery process. We have applied our proposal to text data mining in bioinformatics. Through case studies of data mining for molecular interaction in the yeast and mitogen-activated protein kinase pathways, we demonstrate the possibility of interpreting the extracted results with a 3D visualization system.

A Graphic Tool for Curating Molecular Interaction Networks from the Literature

Changsu Lee, Jinah Park, and Jong C. Park
International Journal of Computers in Biology and Medicine, Vol. 35, pp. 555-564, 2005.
Show abstract

We propose a graphic tool for curating molecular interaction networks constructed from the literature by information extraction (IE). In order to turn preliminary results from IE into useful biomedical resources, we propose to use a controlled environment in which visualization and IE work synergistically. The usability of the proposed graphic tool is shown with respect to the identification of incorrectly extracted results that are due to the much troubling coordination phenomena in natural language texts. Through the experiment on molecular interactions in Saccaharomyces cerevisiae, we have seen a meaningful increase (from 91.5% to 97.5%) in the number of correctly extracted interaction information.

Automatic Generation of Multimedia Animation from Play Scripts

Doojin Park and Jong C. Park
HCI Conference Korea, 2005.
Show abstract

텍스트 애니메이션은 자연언어문장으로부터 애니메이션을 자동으로 생성하 기 위한 연구이다. 텍스트 애니메이션을 작가의 의도대로 실현하기 위해서는 캐릭터의 행동뿐만 아니라 부가적인 멀티미디어 효과가 필수적으로 요구된 다. 이러한 효과를 나타내는 정보는 일반적인 텍스트에서 충분히 제공되지 않지만 연극 공연을 위한 대본에는 다양한 부가 정보들이 어느 정도 정형적 으로 제시된다. 본 논문에서는 연극 대본의 대사, 지문, 해설을 자동으로 분석하여 캐릭터 의 행동과 음향이 통합된 멀티미디어 애니메이션을 생성하는 과정을 보인다. 음향은 극적 효과를 위한 기본적인 장치로, 캐릭터의 행동과 효과적으로 통 합되기 위해서는 연극 대본에서 표현된 음향효과를 직접 추출하거나 상식정 보를 이용한 추론으로 적합한 음향을 입체적이고 시간의 흐름에 맞게 표현해 주어야 한다. 이러한 과정을 위해 연극 대본의 자연언어 표현을 결합범주문 법으로 분석하여 캐릭터의 행동과 음향효과간의 상호작용을 추출하고 이에 따르는 캐릭터의 행동과 음향효과를 3D 모델 데이터베이스와 음향 데이터베 이스를 활용하여 멀티미디어 애니메이션으로 생성한다.

Identification of Emotional Flow from Natural Language Documents

Hye-Jin Min
MS thesis, KAIST, 2005.

Extracting contrastive information from negation patterns in biomedical literature

Jung-jae Kim and Jong C. Park
ACM Transactions on Asian Language Information Processing (TALIP), Special Issue on Text Mining and Management in Biomedicine, 2006.
Show abstract

Expressions of negation in the biomedical literature often encode information of contrast as a means for explaining signiﬁcant differences between the objects that are so contrasted. We show that such information gives additional insights into the nature of the structures and/or biological functions of these objects, leading to valuable knowledge for subcategorization of protein families by the properties that the involved proteins do not have in common. Based on the observation that the expressions of negation employ mostly predictable syntactic structures that can be characterized by subclausal coordination and by clause-level parallelism, we present a system that extracts such contrastive information by identifying those syntactic structures with natural language processing techniques and with additional linguistic resources for semantics. The implemented system shows the performance of 85.7% precision and 61.5% recall, including 7.7% partial recall, or an F score of 76.6. We apply the system to the biological interactions as extracted by our biomedical information extraction system in order to enrich proteome databases with contrastive information.

Introduction to the Thematic Session on Text Mining in Biomedicine

Sophia Ananiadou and Jong C. Park
Lecture Notes in Artificial Intelligence (LNAI), Vol. 3248 (revised selected papers from IJCNLP 2004), editors: K-Y Su, J. Tsujii, J.-H. Lee, O. Y. Kwong, p. 776, 2005.
Show abstract

This thematic session follows a series of workshops and conferences recently dedicated to bio text mining in Biology. This interest is due to the overwhelming amount of biomedical literature, Medline alone contains over 14M abstracts, and the urgent need to discover and organise knowledge extracted from texts. Text mining techniques such as information extraction, named entity recognition etc. have been successfully applied to biomedical texts with varying results. A variety of approaches such as machine learning, SVMs, shallow, deep linguistic analyses have been applied to biomedical texts to extract, manage and organize information. There are over 300 databases containing crucial information on biological data. One of the main challenges is the integration of such heterogeneous information from factual databases to texts. One of the major knowledge bottlenecks in biomedicine is terminology. In such a dynamic domain, new terms are constantly created. In addition there is not always a mapping among terms found in databases, controlled vocabularies, ontologies and “actual” terms which are found in texts. Term variation and term ambiguity have been addressed in the past but more solutions are needed. The confusion of what is a descriptor, a term, an index term accentuates the problem. Solving the terminological problem is paramount to biotext mining, as relationships linking new genes, drugs, proteins (i.e. terms) are important for effective information extraction. Mining for relationships between terms and their automatic extraction is important for the semi-automatic updating and populating of ontologies and other resources needed in biomedicine. Text mining applications such as question-answering, automatic summarization, intelligent information retrieval are based on the existence of shared resources, such as annotated corpora (GENIA) and terminological resources. The field needs more concentrated and integrated efforts to build these shared resources. In addition, evaluation efforts such as BioCreaTive, Genomic Trec are important for biotext mining techniques and applications. The aim of text mining in biology is to provide solutions to biologists, to aid curators in their task. We hope this thematic session addressed techniques and applications which aid the biologists in their research.

Constructing SSML Documents with Automatically Generated Intonation Information in a Combinatory Categorial Grammar Framework

Lee Hwa Jin, Ho-Joon Lee, and Jong C. Park
International Journal of Computer Processing of Oriental Languages (IJCPOL), Vol. 17, No. 4, pp. 223-238, December, 2004.
Show abstract

As of now, Text-to-Speech (TTS) systems are widely used throughout the full spectrum of our activities, and various natural language processing techniques have been utilized to enhance the performance of such TTS systems. As TTS systems begin to play an important role for communication between human and machine, naturalness is considered the most crucial measure of performance for TTS systems, in addition to correctness. General statistical approaches, though widely adopted, are not appropriate for the phenomena as they assign the same intonation to the same sentence. We analyze various kinds of corpus to extract informative features for intonation generation in a Combinatory Categorial Grammar framework, and express intonation-annotated document using Speech Synthesis Markup Language for target system neutral application.

Emotion Prediction from Natural Language Documents with Emotion Network

Hye-Jin Min and Jong C. Park
Proceedings of HLT, pp. 191-199, Ulsan, October, 2004.
Show abstract

본 논문에서는 텍스트에 나타난 감정상태를 인지하는 모델을 제안하고, 이러한 모델을 활용하여 현재문장에서 나타난 감정 및 이후에 나타나게 될 감정상태들을 예측하는 시스템에 대하여 다룬다. 사용자의 감정을 인지하고 이에 대한 자연스러운 메시지, 행동 등을 통해 인간과 상호작용 할 수 있는 컴퓨터시스템을 구현하기 위해서는 현재의 감정상태뿐만 아니라 사용자 개개인의 정보 및 시스템과 상호작용하고 있는 상황의 정보 등을 통해 이후에사용자가 느낄 수 있는 감정을 예측할 수 있는 감정모델이 요구된다. 본 논문에서는 파악된 이전의 감정상태 및 실제 감정과 표현된 감정간의 관계, 그리고 감정에 영향을 미친 주변대상의 특징 및 감정경험자의 목표와 행동이 반영된 상태-전이형태의 감정모델인 감정망(Emotion Network)을 제안한다. 감정망은 각 감정을 나타내는 상태(state)와 연결된 상태들 간의 전이(transition), 그리고 전이가 발생하기 위한 조건(condition)으로 구성된다. 본 논문에서는 텍스트 형태의 상담예시에 감정망을 활용하여 문헌의 감정어휘에 의해 직접적으로 표출되지 않는 감정을 예측할 수 있음을 보인다.

Identification and Recovery of Elided Information for Text Animation

Eunyoung Chang and Jong C. Park
Proceedings of HLT, pp. 94-102, Ulsan, October, 2004.
Show abstract

음성인식기술을 실제 생활에 적용할 때 발생하는 대표적인 문제로, 인식기의 낮은 인식률로 인한 오동작을 들 수 있다. 본 연구에서는, 텔레뱅킹 도메인에서의 HTK(Hidden Markov Model Toolkit) 연속 음성 인식 시스템과, 최대 엔트로피 기법에 기반한 사용자 발화에서의 핵심이 되는 단어(주로 고유 명사들)들에 대한 인식 신뢰도의 측정 방법을 제시한다. 음향특징과 언어특징들을 모두 고려하여 인식 신뢰도를 구하였으며 인식된 단어들에 대해 오인식 되었음을 약 86%의 정확도로 판단할 수 있음을 확인 하였다. 본 인식신뢰도를 이용하여 차후에 음성인식의 확인대화(Clarification Dialog)모델을 개발하는데 활용하고자 한다.

Constructing VoiceXML documents with Contextually Appropriate Intonation from Natural Language Dialogues in a Combinatory Categorial Grammar framework

Lee Hwa Jin, Ho-Joon Lee, and Jong C. Park
Proceedings of the 5th China-Korea Joint Symposium on Oriental Language Processing and Pattern Recognition, pp. 2-9, Qingao, P.R.China, February 25-27, 2004.
Show abstract

Various natural language processing techniques have been utilized to enhance the performance of the Text-to-Speech (TTS) systems to date. Correctness and naturalness are among the working measures for the performance of these systems, where the usual proposals to satisfy the second measure have employed statistic prediction methods to ﬁnd appropriate intonation for a given sequence of words in a sentence. However, these proposals tend to assign the same intonation to the same word sequence in a sentence, whereas people may associate quite different kinds of intonation with the same word sequence in a sentence depending upon the context in which the sentence is expressed. In this paper, we use a combinatory categorial grammar approach to synthesizing contextually appropriate intonation for dialogues in Korean, taking into account the distinguishing characteristics as identiﬁed from the speech corpus. The intonation-annotated dialogues are then translated into corresponding VoiceXML documents, which work as direct inputs to a TTS system for the generation of actual speech data.

Anaphora Resolution in Text Animation

Kyung Wha Hong and Jong C. Park
Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, pp. 347-352, Innsbruck, Austria, February, 2004.
Show abstract

For effective text animation from natural language stories, the source sentences in natural language should be processed not only individually but also as a coherent story as a whole. In particular, it is important that anaphoric expressions are interpreted adequately, since they provide crucial clues for the overall behaviors of story line characters. In text understanding, the task of anaphora resolution has been primarily on nominal expressions. In text animation, however, there are many other important candidates for anaphoric expressions, including those for actions and events, in addition to objects. In this paper, we provide an analysis of sample fairy tales, and present a classification for the types of anaphoric expressions for text animation. We also describe an implemented text animation system with anaphora resolution.

BioIE: Retargetable Information Extraction and Ontological Annotation of Biological Pathways from the Literature

Jung-jae Kim and Jong C. Park
Journal of Bioinformatics and Computational Biology (JBCB), Vol. 2, No. 3, pp. 551-568, 2004. (SCI IF 1.393)
Show abstract

The need for extracting general biological interactions of arbitrary types from the rapidly growing volume of the biomedical literature is drawing increased attention, while the need for this much diversity also requires both a robust treatment of complex linguistic phenomena and a method to consistently characterize the results. We present a biomedical information extraction system, BioIE, to address both of these needs by utilizing a full-fledged English grammar formalism, or a combinatory categorial grammar, and by annotating the results with the terms of Gene Ontology, which provides a common and controlled vocabulary. BioIE deals with complex linguistic phenomena such as coordination, relative structures, acronyms, appositive structures, and anaphoric expressions. In order to deal with real-world syntactic variations of ontological terms, BioIE utilizes the syntactic dependencies between words in sentences as well, based on the observation that the component words in an ontological term usually appear in a sentence with known patterns of syntactic dependencies.

Case Study: Visualization and Analysis of Mitogen-Activated Protein Kinase Pathways in the Literature

Changsu Lee, Jinah Park, and Jong C. Park
Conference on Visualization and Data Analysis (VDA), pp. 275-285, San Jose, USA, Janurary, 2004.
Show abstract

Data sets of up to 3000 journal abstracts from MEDLINE literature on the keyword combination 'MAPK pathway' and 'human' are visualized and analyzed for mitogen-activated protein kinase (MAPK) pathways. We have tightly coupled exploratory visualization with information extraction for interactive navigation through scattered information sources, in search of useful facts on MAPK by frequency-based filtering and amplification Unlike direct database visualization that operates on curated data sets, literature visualization has the advantages of manipulating data sets of a massive scale with a lot less manpower and effectively responding to the fast cycles of the developments in the field.

BioAR: Anaphora Resolution for Relating Protein Names to Proteome Database Entries

Jung-jae Kim and Jong C. Park
ACL Workshop on Reference Resolution and its Applications, pp. 79-86, Barcelona, Spain, 2004.
Show abstract

The need for associating, or grounding, protein names in the literature with the entries of proteome databases such as Swiss-Prot is well-recognized. The protein names in the biomedical literature show a high degree of morpholog- ical and syntactic variations, and various anaphoric expressions including null anaphors. We present a biomedical anaphora resolution system, BioAR, in order to address the variations of protein names and to further associate them with Swiss-Prot entries as the actual entities in the world. The system shows the performance of 59.5%✂75.0% precision and 40.7%✂56.3% recall, depending on the specific types of anaphoric expressions. We apply BioAR to the protein names in the biological interactions as extracted by our biomedical information extraction system, or BioIE, in order to construct protein pathways automatically.

Annotation of Gene Products in the Literature with Gene Ontology Terms using Syntactic Dependencies

Jung-jae Kim and Jong C. Park.
Proceedings of the 1st International Joint Conferrence on Natural Language Processing (IJCNLP), pp. 528-534, Hainan, P.R.China, 2004.
Show abstract

We present a method for automatically annotating gene products in the literature with the terms of Gene Ontology (GO), which provides a dynamic but controlled vocabulary. Although GO is well-organized with such lexical relations as synonymy, ‘is-a’, and ‘part-of’ relations among its terms, GO terms show quite a high degree of morphological and syntactic variations in the literature. As opposed to the previous approaches that considered only restricted kinds of term variations, our method uncovers the syntactic dependencies between gene product names and ontological terms as well in order to deal with real-world syntactic variations, based on the observation that the component words in an ontological term usually appear in a sentence with established patterns of syntactic dependencies.

Research Trends in Bio Text Mining

Jung-jae Kim and Jong C. Park.
Korea Information Science Society SIGBIT News Letter, Vol. 2, No. 1, pp. 14-31, 2004.

An Analysis of Syntactic and Semantic Relations between Negative Polarity Items and Negatives in Korean

Jung-jae Kim and Jong C. Park.
Journal of Language and Information, Vol. 8, No. 1, pp. 53-76, 2004.
Show abstract

Negative polarity items (NPIs), which function as quantifiers, are licensed in a syntactically strict way by negatives, which function as qualifiers, resulting in universal negating interpretations as pairs. We present a proposal to explain the related phenomena, in which the syntax and the semantics are closely related to each other, with Combinatory Categorial Grammar. For this purpose, we first adopt the usual approach to scrambling, but control its overgeneration with the use of markers, taking into account the complex syntactic phenomena involving NPIs and scrambling in Korean. We also propose to utilize polarity intensity as a novel feature, in order to account for the universal negating interpretations when NPIs are combined with negatives. Our proposal also explains the difference in readings when other quantifiers or qualifiers intervene the NPIs and the related negatives. (Korea Advanced Institute of Science and Technology)

Automatic Camera Control for Automated Digital Cinematography from Text

Semin Jang and Jong C. Park
Proceedings of the 31th KISS Spring Conference, Vol. 31, No. 1(B), pp. 904-906, KAIST, Korea, 2004.
Show abstract

영화를 제작하는 과정에 필수적으로 사용되고 있는 대본(臺本)에는 필요한 부분마다 영상기법이 명시되어 있어서 실제 장면을 구현하는 과정에 원작자가 의도하는 상황을 비교적 정확하게 재현하는 것이 가능하다. 이에 비하여 교통사고 사건보고서나 동화 등을 기반으로 디지털 영상을 자동으로 제작하려는 경우 이러한 영상기법이 명시되어 있지 않다. 그러므로 자연언어로 기술된 자료로 부터 디지털 영상을 자동으로 제작하기 위해서는 작가의 의도를 파악하여 적절한 영상기법을 추출 하는 방안이 있어야 한다. 본 논문의 선행 연구에서는 동화를 대상으로 하는 애니메이션 자동 생성을 위해서 시간 관리, 참조 해결, 위치 설정, 세부 명령 결정 및 다수 캐릭터 제어 등의 요소 기술이 필요하다는 것을 보이고 특히 시간 관리 중에서 적절한 장면전환이 필요한 경우를 자동으로 파악하는 방안을 제시하였다. 본 논문에서는 결합범주문법을 사용하여 동화 문장에 나타나는 작가의 의도를 분석하고, 이에 부합하는 다양한 카메라 운용기법을 자동으로 파악하여 적용한 디지털 영상 제작 방안을 제시하고 구현한 시스템을 보인다.

Automatic Generation of Multimedia Therapeutic Contents with Combinatory Categorial Grammar

Hye-Jin Min and Jong C. Park
HCI/CG/VR/UI/DESIGN, Phoenix Park, 2004.
Show abstract

인터넷의 발달로 대안적인 심리치료 방법이라 할 수 있는 상담치료, 음악치 료 및 미술치료가 개인의 고민을 상담해 주는 인터넷 사이트에서 활발히 제 공되고 있다. 본 논문에서는 내담자의 고민이 담긴 글을 자동으로 분석하여 내담자의 감정 상태와 고민의 원인 정보를 파악하여 글, 그림, 음악 등이 통 합된 멀티미디어 치료 정보를 생성하는 과정을 보인다. 멀티미디어 치료 정 보는 해당 감정의 해소에 도움을 줄 수 있는 텍스트, 이미지 및 음악파일이 심리적인 치료의 목적으로 검색어와 함께 구조화되어 있는 정보를 지칭한다. 멀티미디어 치료 정보를 구축하기 위한 검색어를 자동으로 생성하기 위해서 는 문장에서 고민에 관련되는 내담자의 감정표현 방식 및 의미 관계, 그리고 해당 감정의 경과 시간 정보 등을 적절히 분석해내야 하므로, 키워드에 따라 이에 맞는 감정을 대응시키거나 상식을 이용하여 추론하는 방법을 활용하여 감정 정보를 추출하는 기존의 연구에서는 다루지 않았던 추가적인 언어적 특 성들이 보다 심도있게 고려되어야 한다. 본 논문에서는 이를 위하여 내포문 이나 접속문과 같은 하위문의 주어와 상위문의 주어가 서로 가지는 관계를 자동으로 파악하고, 각 동사가 의미적으로 요구하는 문장성분의 성격에 따라 감정의 경험주 및 표현의 대상을 확인하며 시간부사로 감정변화상태를 파악 하는 등의 자연언어처리 과정을 결합범주문법을 통하여 구현함으로써 이들 문장에 나타나 있는 심리상태에 대응하는 치료 정보를 구조화된 데이터베이 스로부터 검색하여 멀티미디어 치료 정보를 생성하는 과정을 보인다.

Data-oriented Customized Visual Navigation

Changsu Lee, Jinah Park, and Jong C. Park
HCI/CG/VR/UI/DESIGN, Phoenix Park, 2004.
Show abstract

저장 매체의 발달 및 정보 기술의 발달로 인해서 빠르게 늘어나는 가용 한 정보의 방대한 양은 사용자의 정보에 대한 이해를 어렵게 만든다. 정보의 원천으로부터 정보의 여과, 정보의 표현으로 이어지는 일련의 정보 활용 과 정에서, 사용자 개별화에 대한 기존의 연구는 일반적으로 정보의 여과 쪽에 서만 이루어져 왔다. 하지만 사용자와 가깝게 상호 작용을 하는 정보의 표현 부분에서 사용자 개별화가 가능해지면, 사용자는 자신의 목적에 부합하는 정 보를 얻는 과정을 더욱 세밀하게 조절할 수 있다. 본 연구에서는 사용자 개 별화 기능을 갖춘 적극적 역할의 시각화 시스템을 제안한다. 사용자 개별화 기능은 데이터의 특성에 기반한 분류법을 사용하여 구현하였다. 본 연구에서 는 생물학을 적용 도메인으로 하여, 분자간 상호 작용 데이터의 특성에 따라 데이터를 분류하는 방법을 제안하며, 실험을 통하여 사용자별로 개별화된 분 자간 상호 작용 지도를 효과적으로 얻을 수 있음을 확인한다.

Natural Language Response Generation from Relational Database Query Result

Ji-yong Jung and Jong C. Park
HCI/CG/VR/UI/DESIGN, Phoenix Park, 2004.
Show abstract

자연언어 질의/응답 인터페이스는 사용자가 특별한 지식이 없어도 시스 템에 쉽게 접근할 수 있도록 하여, 정보의 제공을 쉽고 자연스럽게 한다. 그 러나 이에 대한 기존의 연구는 대부분이 자연언어를 SQL과 같은 데이터베이 스 접근을 위한 형식언어로 바꾸는 데 초점을 맞추고 있고, 질의로부터 얻어 진 결과를 적절하게 표현하는 응답 생성에 있어서는 아직 만족스러운 결과를 만들어내지 못하고 있다. 자연언어 응답 생성을 위해서는 사용자가 알고 있 는 정보, 데이터베이스 내장 정보, 그리고 사용자가 질의를 함으로써 얻고자 하는 정보가 복합적으로 고려되어야 한다. 또한 사용자가 기대하는 형태의 응답을 생성하기 위해서는 사용자가 원하는 응답형태를 사전에 모델링하고 가장 선호되는 응답형태를 사용해야 한다. 본 연구에서는 사용자의 질의로부 터 얻어진 관계형 데이터베이스 검색 결과에 대해 질의의 의도에 맞게 개별 화된 응답을 생성하는 과정을 다룬다. 적절한 응답 생성을 위해서 여행상품 정보에 대한 사용자의 질의/응답 코퍼스를 정보의 내용 및 분량 측면에서 분 석한 결과를 보이고, 이에 따라 내용계획, 문장 형태 구성, 어휘 표현의 세 단계를 거치는 문장 생성 방법을 제안한다.

Contextual Disambiguation of Adverbial Scopes in Korean for Text Animation

Eunyoung Chang, Kyung Wha Hong, and Jong C. Park
HCI/CG/VR/UI/DESIGN, Phoenix Park, 2004.
Show abstract

자연 언어 문장으로 구성된 텍스트를 애니메이션으로 자동 생성하기 위해 서는 문장의 통사 정보, 의미 정보, 담화 정보들을 바탕으로 일련의 애니메이 션 명령들을 도출해 내야 한다. 부사는 이러한 문장들에서 해당 애니메이션 명령의 속성 변화 정도를 결정하며 부사의 다양한 수식 대상과 의미의 정확 한 해석은 텍스트의 의도를 효과적으로 반영하는 중요한 역할을 하게 된다. 그러나 부사의 수식 대상 범위가 매우 넓고 그 의미도 다양하여, 내포절이나 병렬구조를 포함하는 복잡한 문장에서뿐만 아니라 단문에서도 부사의 기능을 정확히 파악하는 것이 용이하지 않다. 본 논문에서는 정확한 텍스트 애니메 이션을 위한 부사의 분석방법을 제안하고 그 처리 결과를 보인다. 현재 이루 어져 있는 한국어 부사에 대한 연구는 주로 통계 기반 학습으로 부사와 피수 식어와의 호응성을 활용하여 구조의 애매성을 처리하고 있을 뿐 아니라, 부 사의 위치 제약 정보 중 극히 일부만을 이러한 호응 관계에 대한 제약 정보 로 활용하고 있다. 본 논문에서는 이러한 정보에 문맥 정보를 같이 고려하여 구조적 애매성을 해결하고 보다 정확한 의미를 도출하고자 한다. 본 논문에 서는 부사의 통사적, 의미적 분석 방법을 제안하기 위해서 결합범주문법을 사용하였고, 이를 확장하여 파생부사, 부사구, 부사절 등의 복잡한 부사어 구 문에 대해서도 문법적으로 처리할 수 있는 방안을 제시한다. 그리고 이렇게 제시된 방안을 구현한 텍스트 애니메이션 시스템을 통하여 애니메이션 생성 결과를 확인한다.

Automated Digital Cinematography with Natural Language Processing

Semin Jang
MS thesis, KAIST, 2004.

Automatic Translation of Korean into Korean Sign Language with Combinatory Categorial Grammar

Jiwon Choi
MS thesis, KAIST, 2004.

Applications to Molecular Interactions: Customized Visualization for Knowledge Discovery with Information Extraction

Changsu Lee
MS thesis, KAIST, 2004.
(Outstanding M.S. Thesis Award, 2004. 2.)

Kyung Wha Hong, Anaphora Resolution for Contextually Appropriate Text Animation

Kyung Wha Hong
MS thesis, KAIST, 2004.

Annotation of Gene Products in the Literature with Gene Ontology Terms using Syntactic Dependencies

Jung-jae Kim and Jong C. Park
Lecture Notes on Artificial Intelligence, Post-Conference Book of IJCNLP-04, 2004.
Show abstract

We present a method for automatically annotating gene products in the literature with the terms of Gene Ontology (GO), which provides a dynamic but controlled vocabulary. Although GO is well-organized with such lexical relations as synonymy, ‘is-a’, and ‘part-of’ relations among its terms, GO terms show quite a high degree of morphological and syntactic variations in the literature. As opposed to the previous approaches that considered only restricted kinds of term variations, our method uncovers the syntactic dependencies between gene product names and ontological terms as well in order to deal with real-world syntactic variations, based on the observation that the component words in an ontological term usually appear in a sentence with established patterns of syntactic dependencies.

Information Visualization in 3-Dimensional Space for Text Data Mining

Jinah Park, Changsu Lee, and Jong C. Park
International Women's Conference on BIEN-Technology, Daejeon, Korea, November, 2003.

Analysis and Computational Processing of Sentences in Korean for automatic sign language Generation

Jiwon Choi and Jong C. Park
Proceedings of the National Conference on Korean Language Processing, pp. 219-226, October, 2003.
Show abstract

한국 수화는 한국어에 대한 기본적인 유사성을 가지고 있지만, 교착어이자 청각-음성 체 계 언어인 한국어와는 달리 고립어이자 시각-운동 체계 언어로서의 특성을 동시에 나타내 고 있다. 그러므로 텍스트 형태의 한국어 문장으로부터 수화를 자동 생성하기 위해서는 한 국어를 위해 미리 정의된 문법에 수화 표현을 무리하게 연계 시키려고 하기 보다, 수화 고 유의 의미 전달 체계를 분석하고 활용하여야 할 필요가 있다. 본 논문에서는 수화 표현상 의 언어학적 특징을 재현·생략·변형·이동의 네 가지로 구분하여 분석하고 결합범주문법을 이용한 이 같은 현상의 처리 방법 및 구현 방안에 대하여 논의한다.

Towards Automatic Sign Language Generation with Combinatory Categorial Grammar

Jiwon Choi and Jong C. Park
HCI Conference, pp. 481-486, Phoenix Park, Korea, February, 2003.
Show abstract

수화는 청각 장애인의 의사소통을 위한 시각적 언어라는 특징을 가지고 있어 구어 병용을 전제로 하는 다른 언어에서는 찾아 보기 어려운 독특한 문법 구 조를 가지고 있다. 그러나 수화를 자동으로 처리하려는 기존의 연구에서는 한국어를 위하여 미리 정의된 문법에 수화 표현을 연계 시키려는 노력이 무 리하게 선행되어 수화 고유의 의미 전달 체계를 파악하고 활용하는데 많은 문제점을 가지고 있다. 특히 수화에서는 수동, 수형 등의 수화소뿐만 아니라 동시적으로 표현하는 기제를 이용하여 도치문에서의 주어와 목적어 관계, 사 동과 피동문에서 주체와 객체 관계 등을 애매성 없이 표현할 수 있고, 직전 에 지정된 공간 정보를 일종의 선행사와 같이 사용함으로써 중복된 표현을 피하여 효율적인 정보 전달을 꾀할 수 있다. 본 논문에서는 한국어와 같은 자연 언어 표현을 결합범주문법으로 분석하는 과정을 통하여 이들 표현에 대 응하는 애니메이션을 동반한 수화 표현으로 자동 번역하는 연구를 수행하는 과정에 필수적으로 필요한 요소들에 대한 연구 결과를 보이고 수화에서 나타 나는 독특한 언어 표현 기법을 충분히 활용하여 보다 자연스러운 수화 표현 을 생성하는 방안을 구현과 함께 제시한다.

Anaphora Resolution and Multi-Character Control for Automatic Generation of Multimedia Fairy Tales

Kyung Wha Hong and Jong C. Park
HCI Conference, pp. 487-492, Phoenix Park, Korea, February, 2003.
Show abstract

한국어와 같은 자연언어로 작성된 문장의 연속으로 구성된 문서 형태의 동 화를 입력으로 받아 동화의 내용을 적절히 반영한 애니메이션을 포함하는 멀 티 동화를 자동 생성하기 위해서는 해당 문서에서 나타나는 각종 참조현상에 대한 정확한 해석이 필수적으로 요구된다. 이와 같은 애니메이션을 위한 참 조현상 해석은 문서의 이해를 돕기 위하여 자연언어처리 분야에서 통상적으 로 연구되고 있는 참조현상 해석에서보다 유형적으로 다양한 특성을 보인다. 본 논문에서는 멀티 동화를 자동 생성하는 과정에 문장의 참조현상과 함께 다수 캐릭터의 움직임을 적절히 고려하여 3 차원 가상 공간을 제어하는 명령 을 생성하는 시스템에 대한 구현 결과를 보인다. 애니메이션을 위한 참조현 상 해석은 참조표현의 적절한 선행사를 파악하는 것을 그 목적으로 하고 있 는데 캐릭터의 명칭, 동작, 성질, 사건, 시간 등의 다양한 장면 정보들에 대 한 고려가 필요하다. 다수 캐릭터를 문맥에 맞게 제어하기 위해서는 적절한 참조해결과 함께 다양한 지식을 활용하여 캐릭터들의 자연스러운 움직임을 제공하는 기법이 필요하다. 본 논문에서는 결합범주문법을 이용하여 동화를 분석한 뒤 이에 해당하는 Genesis 3D 게임엔진 제어 스크립트를 자동 생성하 는 시스템을 보인다.

Mediatory Visualization for Structured Data and Textual Information

Changsu Lee, Jinah Park, and Jong C. Park
The 3rd IASTED International Conference on Visualization, Imaging, and Image Processing (VIIP 2003), pp. 926-932, Benalmadena, Spain, 2003.
Show abstract

When we visualize structured data for knowledge discovery, it is important that the users have an easy access to the source textual information, especially when the map ping from the textual information to structured data is not perfect. In this paper, we present a new method for mediatory visualization for structured data and corresponding textual information to address this problem. The two dimensional space for visualizing structured data, such as the protein-protein interaction information collected from biomedical literature by information extraction, is linked perpendicularly to, but conceptually separated from, the pairwise one dimensional space for visualizing corresponding source textual data. The users can concentrate on the information in one space but explore the information in the other space as easily as one may manipulate objects in a three dimensional space. We show that the one dimensional color-banded rods give visual clues and insights to the nature of the underlying English sentence structures, which in turn give rise to useful feedback to the interaction information in the other two dimensional space, and vice versa.

Analysis and Computational Processing of Coordination Ambiguity in Korean

Hodong Lee and Jong C. Park
Journal of Language and Information, Vol. 7, No. 2, pp. 59-79, 2003.

Analysis and Processing of Korean with Quantifier Floating

Jin-Bok Lee and Jong C. Park
Journal of Language and Information, Vol. 7, No. 1, pp. 1-22, 2003.

Logical Representation of Ontological Terminologies in Biomedical Domain

Jung-jae Kim, Jin-Bok Lee, Hye-Jin Min, Ji-yong Jung, and Jong C. Park
Proceedings of the 2nd Annual Conference of The Korean Society for Bioinformatics (KSBI 2003), pp. 79-85, Daejeon, Korea, 2003.
Show abstract

본 논문은 대량의 생물의료분야 문서에서 단백질 이름을 자동으로 인식하고 각 단백질의 특 성을 문서에서 자동으로 파악하여 기존의 온톨로지와 연계시키는 방법을 제안한다. 온톨로 지 용어가 문서에서 다양한 형태로 발견되기 때문에, 이들을 논리적 표현으로 자동 변환하 고, 문서에서 단백질의 특성을 설명하는 문장들을 추출 및 분석하여 온톨로지 용어의 논리 적 표현과 비교하였다. 문서에서 단백질 특성을 인식할 때, 약어 처리 및 조응 현상 해결 등 의 자연언어처리 기법을 이용하는 방법을 제안하였다.

Morphological Analysis of Irregular Conjugation in Korean with Micro Combinatory Categorial Grammar

Ho-Joon Lee and Jong C. Park
Proceedings of the KISS Spring Conference, pp. 531-533, 2003.
Show abstract

본 논문에서는 형태소 수준의 결합범주문법을 이용하여 형태소 분석을 포함한 자연언어처리의 여러 단계를 한 단계의 유도과정으로 처리하고 형태소 분석 단계에서 증가하는 애매성과 복잡도를 상위 분석 단계의 정보 를 사용하여 줄이는 방법에 대해서 논한다. 한국어에서 나타나는 복잡한 언어 현상 중에 하나인 용언의 불규 칙 활용을 확률 정보 뿐만 아니라 음운정보를 포함한 통사 정보나 의미 정보 등의 상위 정보를 사용하여 처리 하여보고 일반적인 형태소 분석기로서의 발전 가능성에 대해서 알아본다.

Integrated Morphological Analysis for Korean in a Combinatory Categorial Grammar Framework

Ho-Joon Lee
MS thesis, KAIST, 2003.

Word Segmentation for Korean with Syllable-Level Combinatory Categorial Grammar

Ho-Joon Lee and Jong C. Park
Proceedings of the 14th National Conference on Korean Language Processing, pp. 47-54, October, 2002.
Show abstract

한국어의 띄어쓰기 현상은 단어별로 정형화된 띄어쓰기를 하는 영어나 띄어쓰기가 발달하지 않은 중 국어, 일본어와는 다르게 독특한 형태로 발전되어 왔다. 기존에는 부분적인 띄어쓰기 오류를 바로잡 아주는 형태의 연구가 많이 진행되었지만 이제는 문자인식이나 음성인식 등의 연구와 결합하여 띄어 쓰기가 완전히 무시된 문장의 띄어쓰기를 자동으로 처리하는 방법에 대한 연구가 활발히 진행 중이 다. 본 논문에서는 한국어의 띄어쓰기 현상과 띄어쓰기 복원 방법에 대한 기존의 연구에 대해서 살 펴보고 기존의 방법으로는 처리하기 힘들었던 형태를 음절단위 결합범주문법으로 설명한다.

Diphone-based Intonation and VoiceXML Document Generation using Multi-Dimensional Linguistic Information

Lee Hwa Jin and Jong C. Park
Proceedings of the 24th National Conference on Korean Language Processing, pp. 69-76, Cheongju, Korea, October, 2002.

Anaphora Resolution for Contextually Appropriate Animation of Multimedia Fairy Tales

Kyung Wha Hong and Jong C. Park
Proceedings of the 24th National Conference on Korean Language Processing, pp. 317-324, Cheongju, Korea, October, 2002.
Show abstract

참조현상이란 이미 언급되었던 혹은 이미 알고 있다고 여겨지는 정보에 대한 재표현이다. 참조현상은 자연언어처리 분야에서 뿐만 아니라 인지과학, 심리학, 철학분야에서도 활발하 게 연구되는 현상으로 참조표현인 조응사(anaphora)의 선행사(antecedent)를 채택하는 방 법에 따라 그 성능이 좌우된다. 자연언어문장으로부터 멀티동화를 생성을 위한 애니메이션 제어 스크립트 명령들에서의 참조해결은 선행 정보의 적절한 참조를 바탕으로 자연스러운 애니메이션 장면을 생성하는데 있어서 필수적이다. 본 논문에서는 이러한 동화의 자연언어 문장에 나타나는 참조현상들에 대해 살펴보고 결합범주문법을 이용하여 참조현상을 해결하 는 방법과 구현방법에 대해 논의한다.

Analysis and Reconstruction of Temporal Relations in Multimedia Fairy Tales for Digital Cinematography

Semin Jang and Jong C. Park
Proceedings of the 24th National Conference on Korean Language Processing, pp. 309-316, Cheongju, Korea, October, 2002.
Show abstract

동화는 사건의 흐름에 따라서 이야기를 진행시킨다. 그러나 독자인 어린이들의 관심을 지 속적으로 유지하기 위하여 사건을 실제 순서와 다르게 배치해놓아 극적 효과를 꾀하는 경우가 많이 있다. 동화를 애니메이션으로 생성하는데 있어서 이러한 사건의 배치에 담긴 작가의 의도를 제대로 파악하는 것은 중요한 문제이다. 본 논문에서는 이처럼 사건의 흐 름을 파악하고 이를 활용하기 위해서 다루어야 할 언어적 요소들에 대하여 살펴보고, 결 합범주문법을 사용하여 동화에서 나타나는 시간 관계를 분석한다. 또한 각 시간 관계에 따라 애니메이션 효과를 높이기 위한 영상 기법을 제안하고 이를 이용하여 시간 관계를 재현하는 시스템을 설명한다.

Automatic Gene Ontology Extension and Terminology Analysis

Jin-Bok Lee and Jong C. Park
Proceedings of the KISS Conference, pp. 229-231, Suwon, Korea, October, 2002.
Show abstract

생물학 분야의 방대한 지식을 효율적으로 다루기 위하여 생물정보학이 주요한 연구 분야가 되었다. 이 중 특히 생물학 문헌에서 정보를 자동으로 추출하는 연구가 활발히 진행되고 있는데, 이러한 정보추출 결과를 이용하여 유전자 온톨로지와 같은 유용한 지식베이스를 자동으로 확장함으로써 폭발적으로 증가 하는 생물학 분야의 연구 결과들을 지식베이스에 통합할 수 있다. 자동으로 확장된 온톨로지는 신뢰성을 보장하기 위한 검증 과정을 거쳐, 정보추출 시스템의 성능을 향상시키기 위한 지식베이스로 사용되게 된 다. 본 연구에서는 단백질 간의 상호작용에서 나타나는 조건을 추출하는 시스템과 유전자 온톨로지를 이 용하여 추출된 생물학 용어를 분석하는 시스템을 제안하고 유전자 온톨로지의 자동 확장 및 검증 시스템 에 대하여 논의한다.

Accomplishments and Challenges in Literature Data Mining for Biology

Lynette Hirschman, Jong C. Park, Junichi Tsujii, Limsoon Wong, and Cathy Wu
Bioinformatics, Vol. 18, No. 12, pp. 1553-1561, 2002.
Show abstract

We review recent results in literature data mining for biology and discuss the need and the steps for a challenge evaluation for this field. Literature data mining has progressed from simple recognition of terms to extraction of interaction relationships from complex sentences, and has broadened from recognition of protein interactions to a range of problems such as improving homology search, identifying cellular location, and so on. To encourage participation and accelerate progress in this expanding field, we propose creating challenge evaluations, and we describe two specific applications in this context.

Natural Language Query Interpretation System for Biomedical Database Access

Hodong Lee and Jong C. Park
Proceedings of the KISS Spring Conference, pp. 487-489, Han Yang University, April 26-27, 2002.
Show abstract

본 논문은 이질적인 데이터베이스에 선재되어 있는 생물의료 정보의 개념적인 접근을 가능하게 하기 위한 자연언어질의 시스템을 설명한다. 이를 위해 본 시스템에서는 질의문을 SQL, OQL, CPL 데이터베이스 정형언어로 변환하는데, 이 과정에서 필요한 질의문의 분석 및 변환과정을 보인다. 제안하는 방법은 구문분석에 의해 도출된 정보를 이용해 직접 다양한 정형언어들로 변환하므로, 시스템의 구조가 간결해지고 모듈화되어 전체 성능과 이식성의 향상을 가져올 수 있다.

Combinatory Categorial Grammar: from Natural Language Understanding to Biomedical Applications

Jong C. Park
Workshop on Natural Language Processing and Ontology Building for Biology, Tokyo, Japan, February, 2002.

Recent Issues in Biopathway Extraction from Literature

Jong C. Park
Institute for Mathematical Sciences (IMS), National University of Singapore, Singapore, February, 2002.

Challenges in Biopathway Extraction from Literature and Ontology Building for Biology

Jong C. Park
Korea Society for Bioinformatics Workshop, February, 2002.

Semi-Automatic Extension of Gene Ontology

Jin-Bok Lee, Jung-jae Kim, and Jong C. Park
Human Computer Interaction (HCI) Workshop, Phoenix Park, Korea, January, 2002.

Interpretation of Natural Language Queries for Relational Database Access with Combinatory Categorial Grammar

Hodong Lee and Jong C. Park
International Journal of Computer Processing of Oriental Languages (IJCPOL), Vol. 15, No. 3, pp. 281-304, 2002.
Show abstract

In this paper, we describe a proposal to derive formal language queries from natural language queries with a combinatory categorial grammar (CCG). CCGs are well known to provide a means of deriving all the levels of information for natural language, i.e., syntax, semantics and discourse, at the same time. In our proposal, we utilize an extra level of representation for formal language queries for the aforementioned derivation. The syntactic coverage is shown with various natural language queries, including compound nouns, modification markers, various types of ellipses, numerical expressions, and subordinate and coordinate constructions. The general purpose CCG lexicon is semi-automatically augmented with the database fields and entries. We also discuss the performance of an implemented natural language query processing system.

BiopathwayBuilder: Nested 3D visualization system for complex molecular interactions

Changsu Lee, Jinah Park, and Jong C. Park
Proceedings of International Conference on Genome Informatics (GIW), pp. 447-448, Tokyo, Japan, 2002.
Show abstract

In order to gain a full understanding of a biological process, we must be able to augment the known molecular interactions with discovered knowledge. We believe that a visualization system works as a means for accomplishing this task, as it provides an intuitive base for necessary information, among others. However, reported implementations have further problems: (1) The size of the information is not only enormous, but also grows very fast, which makes scalability and elision essential properties [5]; (2) the available information is not only incomplete, but also unreliable; and (3) the usual information in the field, such as protein modification [2], is inherently complex, which makes it very difficult to make the resulting visualization intuitive enough for end users as well as field experts. We address all the problems above with a 3D visualization system.

3D Visualization System for Complex Protein-Protein Interactions from Text Data Mining

Changsu Lee, Jinah Park, and Jong C. Park
IEEE Workshop on Visualization in Bioinformatics and Cheminformatics, Boston, USA, 2002.

Natural Language Interpretations for Heterogeneous Database Access

Hodong Lee and Jong C. Park
Proceedings of the International Conference on Computational Linguistics (COLING), pp. 523-529, Taiwan, 2002.

Text Data Mining for Automatic Gene Ontology Extension

Jin-Bok Lee and Jong C. Park
Intelligent Systems for Molecular Biology (ISMB), Proceedings of the second meeting of the special interest group on Text Data Mining, pp. 22-25, Edmonton, Alberta, Canada, 2002.

Literature Data Mining for Biology

Lynette Hirschman, Jong C. Park, Junichi Tsujii, Cathy Wu, and Limsoon Wong
Proceedings of the Pacific Symposium on Biocomputing (PSB) session, pp. 323-325, Hawaii, USA, 2002.

Natural Language Processing for Biomedical Information Extraction and Automatic Ontology Management

Jong C. Park
Proceedings of the 2nd Bioinformatics Forum, pp. 145-158, Seoul, Korea, 2002.

Diphone-based Intonation Generation for Korean with Combinatory Categorial Grammar

Lee Hwa Jin
MS thesis, KAIST, 2002.

Automatic Synthesis of Multimedia Tales with Combinatory Categorial Grammar

Hyun Sook Kim
MS thesis, KAIST, 2002.

Computational Processing of Honorifics in Korean with combinatory Categorial Grammar

O Shik Kwon
MS thesis, KAIST, 2002.

Using Combinatory Categorial Grammar to Extract Biomedical Information

Jong C. Park
IEEE Intelligent Systems, Special Issue on Intelligent Systems in Biology, Vol. 16, No. 6, pp. 62-67, November-December, 2001.
Show abstract

Extracting information from biology databases manually can be an overwhelming task. GenBank, the US National Institutes of Health database containing all publicly available DNA sequences, has more than 14 billion bases in 13 million genetic-sequence records.1 Medline, a literature database available through PubMed, has over 11 million journal citations. In a May 2001 search request for “cytokine” (regulatory proteins in the immune system), PubMed returned 296,556 articles.2 Given the quantity and complexity of biomedical literature, demands for computational tools to extract specific information are increasing. In this article, I review biomedical information extraction methods and present research done by KAIST’s natural language processing group on a system that shows encouraging performance using combinatory categorial grammar (explained in detail below) as a natural language grammar formalism.

Biomedical Informatics and Natural Language Processing

Jong C. Park
Annual Meeting of the Korean Society for Medical Informatics, Jeon-ju, Korea, November, 2001.

Bioinformatics and Natural Language Processing

Jong C. Park
Special Issue in Korean Information Processing, Communications of the Korea Information Science Society (KISS), Vol. 19, No. 10, pp. 46-51, October, 2001.
Show abstract

생물정보학(Bioinformatics)은 생물학에서 다루 는 정보의 양이 급증함에 따라 전산학, 수학, 통계 학 등의 분야에서 사용되고 있는 정보처리기법을 응용하여 이를 효율적으로 생산, 관리, 활용하려는 연구분야를 총칭한다. 1) 그리고 자연언어처리(Natural Language Processing－NLP)는 한국어나 영어와 같은 자연언어로 표현된 문장이나 문서들을 컴퓨터를 이용하여 처리하기 위한 연구분야를 총칭 하는데 이에는 인간과 컴퓨터의 상호작용(Human Computer Interaction－HCI)을 돕기 위한 연구의 측면도 있고 방대한 자연언어 정보를 효율적으로 관리, 활용하기 위한 연구의 측면도 있다. 본고에서 는 생물의료 정보 추출(Biomedical Information Extraction)이라는 분야의 연구에 대한 소개를 통 해서 이 두 가지 상이한 연구분야가 어떠한 관계 를 가지게 되는지에 대한 논의를 제공한다. 최근 생물정보학에 대하여 높아진 일반의 관심을 반영하 여 정보과학회지 2000년 8월호 [1]에서는 생물정보 학을 주제로 한 특집을 제공하였는데 본고의 내용 은 여기에 자연언어처리 응용 분야를 그로부터 일 년후의 시점에서 보완하는 것으로 볼 수 있을 것으 로 기대된다

Automatic Augmentation of Translation Dictionary with Database Terminologies in Multilingual Query Interpretation

Hodong Lee and Jong C. Park
Annual Meeting of the Association for Computational Linguistics (ACL), Workshop on Human Language Technologies and Knowledge Management, pp. 113-120, Toulouse, France, July, 2001.
Show abstract

In interpreting multilingual queries to databases whose domain information is described in a particular language, we must address the problem of word sense disambiguation. Since full-fledged semantic classification information is difficult to construct either automatically or manually for this purpose, we propose to disambiguate the senses of the source lexical items by automatically augmenting a simple translation dictionary with database terminologies and describe an implemented multilingual query interpretation system in a combinatory categorial grammar framework.

Translating Natural Language Queries into Formal Language Queries with Combinatory Categorial Grammar

Hodong Lee and Jong C. Park
Proceedings of the International Conference on Computer Processing of Oriental Languages (ICCPOL), pp. 41-46, Seoul, Korea, May, 2001.

Computational Generation of Context-based Intonation for Korean with Combinatory Categorial Grammar

Lee Hwa Jin and Jong C. Park
Proceedings of International Conference on Computer Processing of Oriental Languages (ICCPOL), pp. 415-420, Seoul, Korea, May, 2001.

Design and Implementation of E-Mail Response Management System for Call Center

Jung-jae Kim, O Shik Kwon, Hodong Lee, and Jong C. Park
Proceedings of the KISS Spring Conference, pp. 445-447, April, 2001.
Show abstract

본 논문에서는 콜센터를 위하여 설계 및 구현된 전자메일 자동응답 및 관리 시스템 중에서 서버 시스템에 해당하는 부분을 기술하였다. 본 연구에서는 도메인에 특성화된 표현 형식 개발을 개발하여 보다 효율적인 3단계 매칭방법을 가진 자동응답기, 학습에 기반한 도메인 비의존적인 자동분류기 및 적용밥벙의 재배열이 가능한 담당자 분배기를 구현하였다.

Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar

Jong C. Park, Hyun Sook Kim, and Jung-jae Kim
Pacific Symposium on Biocomputing (PSB), pp. 396-407, Big Island, Hawaii, USA, January, 2001.
Show abstract

As the importance of automatically extracting and analyzing various natural language assertions about protein-protein interactions in biomedical publications is recognized, many uses of natural language processing techniques are proposed in the literature. However, most proposals to date make rather simplifying assumptions about the syntactic aspects of natural language due to various reasons including efficiency. In this paper, we describe an implemented system that utilizes combinatory categorical grammar known to be competent in modeling natural language, with a controlled mechanism for the parser to operate bidirectionally and incrementally. We discuss the performance of the system on a large set of abstracts in Medline with quite encouraging results.

Real Time Synthesis of Multimedia Tales in Korean with Combinatory Categorial Grammar

Hyun Sook Kim and Jong C. Park
Proceedings of the National Conference on Korean Information Processing, pp. 509-512, 2001.

Computational Processing of Honorifics in Korean with Combinatory Categorial Grammar

O Shik Kwon and Jong C. Park
Proceedings of the National Conference on Korean Information Processing, pp. 365-372, 2001.
Show abstract

한국어나 일본어는 영어 등 서구의 언어와 비교하여 매우 발달된 경어 체계를 가지고 있다. 그러나 이러한 경어 체계는 이들 언어를 모국어로 사용하지 않는 사람들을 포함하여 모국어로 사용하는 많은 사람들까지도 정확하게 구사하기는 어려워 하는 것이 현실이다. 그럼에도 불구하고 경어 체계의 정확한 구사 능력은 적절한 어휘 선택 능력과 함께 자연스러운 의사 소통을 위한 중요한 언어 능력으로 간주되고 있다. 특히 기계번역기나 문법검사기를 구현하고자 할 때 이러한 경어 체계를 정확하게 이해하는 시스템의 구현은 한 차원 높은 자연스러운 표현을 제공하기 위하여 필수적이라고 할 수 있다. 본 논문에서는 한국어의 경어 체계를 조사하고 결합범주문법을 통하여 이를 검증하는 시스템을 소개한 뒤 사극 대본을 대상으로 이 시스템의 성능을 확인한다.

Generation of Contextually Appropriate Responses in E-Commerce with Combinatory Categorial Grammar

Jin-Bok Lee and Jong C. Park
Proceedings of the Human Computer Interaction (HCI) Symposium, pp. 314-319, Phoenix Park Convention Center, Korea, 2001.
Show abstract

We analyze various constructions in Korean including coordination, relative clauses, and embedded clauses by focusing on the phenomenon of quantifier floating where quantifying expressions may appear in places other than their original prenominal one. Based on these analyses, we process Korean sentences in a combinatory categorial grammar (CCG) framework that makes use of all the levels of syntax, semantics, and discourse. Finally, we describe an implemented query system that generates responses with contextually appropriate ellipsis in the domain of e-commerce.

Computational Processing of Floating Quantifiers in Korean with Combinatory Categorial Grammar

Jin-Bok Lee
MS thesis, KAIST, 2001.

Processing Floating Quantifiers with Combinatory Categorial Grammar

Jin-Bok Lee and Jong C. Park
the KISS Regional Conference, November, 2000.
Show abstract

본 논문에서는 한국어에서 나타나는 양화사유동을 병렬구문, 관계구문, 내포구문과 같이 복잡한 언어현상과 관련하여 통사적, 의미적, 담화적 관점에서 고려하고, 결합범주문법을 이용하여 한국어 문장을 분석할 수 있음을 보인다. 그리고 이를 바탕으로 전자상거래와 같은 분야에서 자연스러운 대화를 할 수 있는 인터페이스 구축의 가능성을 제시한다.

Predicting Contextually Appropriate Intonation from Utterances in Korean with Combinatory Categorial Grammar

Lee Hwa Jin and Jong C. Park
Proceedings of the National Conference on Korean Language Processing, pp. 68-75, October, 2000.
Show abstract

상대방에게 의사를 전달할 때 보다 정확하게 자신의 의도를 표현하려면 대화의 흐름에 맞는 적절한 억양을 주어 발화해야 한다. 본 논문에서는 결합범주문법을 이용하여 문장을 분석하고 문장 내 정보와 문장 간 정보 즉, 문맥에 따라 강세(pitch accent), 휴지(pause), 강조 등의 억양정보를 어떻게 나타내야 하는지를 분석하여 문장의 정보구조에 추가하는 방법을 제시한다.

Customizable Natural Language Interfaces to Data Bases

Jong C. Park
Invited presentation, Pacific Symposium on Biocomputing (PSB), Honolulu, Hawaii, USA, January, 2000.

Combinatory Categorial Grammar and Natural Language Interface to Database

Hodong Lee and Jong C. Park
Proceedings of the Human-Computer Interaction (HCI) Triangle Workshop, pp. 900-905, Phoenix Park Convention Center, Korea, January, 2000.
Show abstract

In this paper, we discuss issues related to the construction of a natural language interface to databases, including the characteristics of natural language queries. We propose to implement the system using Combinatory Categorial Grammar (CCG), so that various linguistic phenomena can be handled incrementally and in a modular manner for diverged expressions.

Informed Parsing for Coordination with Combinatory Categorial Grammar

Jong C. Park and Hyung-joon Cho
Proceedings of the International Conference on Computational Linguistics (COLING), pp. 593-599, Saarbrucken, Germany, 2000.
Show abstract

Coordination in natural language hampers efficient parsing, especially due to the multiple and mostly unintended candidate conjuncts/disjuncts in a given sentence that shows structural ambiguity. The problem gets more serious in a combinatory categorial grammar framework, which is well known for its competent treatment of coordination, as the flexibility of syntactic analysis often strikes back as spurious ambiguity. We propose to address these ambiguities with predicate argument structures and semantic co-occurrence similarity information, and present encouraging results.

Combinatory Categorial Grammar for the Syntactic, Semantic, and Discourse Analyses of Cordinate Constructions in Korean

Hyung-joon Cho and Jong C. Park
Journal of the Korea Information Science Society (KISS), Vol. 27, No. 4, pp. 448-462, 2000.
Show abstract

Coordinate constructions in natural language pose a number of difficulties to natural language processing units, due to the increased complexity of syntactic analysis, the syntactic ambiguity of the involved lexical items, and the apparent deletion of predicates in various places. In this paper, we address the syntactic characteristics of the coordinate constructions in Korean from the viewpoint of constructing a competence grammar, and present a version of combinatory categorial grammar for the analysis of coordinate constructions in Korean. We also show how to utilize a unified lexicon in the proposed grammar formalism in deriving the sentential semantics and associated information structures as well, in order to capture the discourse functions of coordinate constructions in Korean. The presented analysis conforms to the common wisdom that coordinate constructions are utilized in language not simply to reduce multiple sentences to a single sentence, but also to convey the information of contrast. Finally, we provide an analysis of sample corpora for the frequency of coordinate constructions in Korean and discuss some problematic cases

Combinatory Categorial Grammar for Natural Language Interface

Hodong Lee and Jong C. Park
Proceedings of the KISS Fall Conference, pp. 173-175, 2000.
Show abstract

본 연구에서는 전자상거래 데이터베이스를 대상으로 결합범주문법을 이용한 자연언어질의 인터페이스를 구현한다. 이를 위해 질의문을 분석하고 표현 방법을 논의한다. 또한 SQL 형식언어로 변환하기 위한 어휘 표현 및 유도 방법을 보인다. 제안하는 방법은 구문분석 과정에서 SQL 형식의 질의문을 직접 유도하는 것으로 기존 연구에서 제안됐던 중간논리언어 변환단계를 거치지 않으므로 과정이 간결해져 시스템의 성능향상을 가져올 수 있다. 시스템은 웹 기반과 client/server 구조로 구현된다.

Coordinate Constructions in Korean and Parsing Issues in Combinatory Categorial Grammar

Hyung-joon Cho
MS thesis, KAIST, 2000.

Combinatory Categorial Grammar and Parsing

Hyung-joon Cho and Jong C. Park
Proceedings of the National Conference on Korean Language Processing, pp. 223-230, Mokpo, Korea, October 1999.
Show abstract

본 논문에서는 결합범주문법으로 한국어를 처리할 때 구문분석과정에서 복잡도를 높이는 역할을 하는 spurious ambiguity와 구조적 모호성이 있는 명사구 접속에 대해서 논한다. 통사적 처리와 의미적 처리가 동시에 수행되는 결합범주문법의 특징을 사용해서 spurious ambiguity로 인해 발생하는 복잡도를 줄이는 방안을 제시하고 접속항에서 접속의 중심이 되는 명사들 간의 공기유사도를 이용해서 접속항 선정에서 발생하는 복잡도와 오분석을 줄이는 방안을 제시한 뒤 이의 개선방안을 논의한다.

An Analysis of the Semantic and Discourse Functions of the Korean Special Marker `-to'

June K. Park and Jong C. Park
the National Conference on Korean Language Processing, Mokpo, Korea, October 1999.
Show abstract

본 논문은 한국어의 특수조사, 특히 '도'의 의미, 문맥적 기능에 대하여 다루고 있다. '도'는 문맥의 자연스러운 연결에 있어서 중요한 역할을 수행한다. '도'가 쓰인 문장의 배경에는 반드시 일정한 전제가 존재한다. 전제는 그 문장의 의미 뿐만 아니라 기존 문맥과도 직접적으로 연관된다. 본 논문에서는 '같음', '유사함', '극한', '첨가' 및 병렬문에서 쓰이는 다섯 가지 '도'의 기능에 대하여 설명하고, alternatives semantics를 이용하여 이를 결합범주문법(CCG)에서 구현하는 방법을 제시한다.

A CCG for Coordination in Korean

Hyung-joon Cho and Jong C. Park
Proceedings of the KISS Conference, pp. 327-329, Jeonju, Korea, April, 1999.
Show abstract

자연어처리에 있어서 병렬문은 분석의 복잡성, 단어의 모호성, 공백 등에 따른 어려움을 내포하고 있다. 본 논문에서는 기존에 제시되었던 한국어 처리를 위한 범주문법의 한계를 논하고 기존의 범주문법들이 해결하지 못했던 한국어 병렬문을 결합범주문법을 사용해서 해결한다. 한국어 병렬문을 처리하는 과정에서 비형상언어인 한국어 병렬문을 서술논항 구조로 표현하고 이를 기계번역시스템에 활용할 수 있음을 보인다.

Multiset-CCG for Quantifier Floating in Korean

Jin-Bok Lee and Jong C. Park
Proceedings of the KISS Conference, pp. 330-332, Jeonju, Korea, April, 1999.
Show abstract

본 논문에서는 한국어에서 양화사가 나오는 유형을 살펴보고, 그 중에서 QF현상에 대하여 논의한다. QF현상이 주격, 목적격, 여격에서 모두 가능하다는 것을 제시하고, 내포문에서의 QF가 갖는 제약조건을 설명한다. 이러한 것들을 한국어 중집합결합범주문법의 framework에서 설명할 수 있음을 보인다.

Lexical Selection with a Target Language Monolingual Corpus and an MRD

Hyun Ah Lee, Jong C. Park, and Gil Chang Kim
Proceedings of the Theoretical and Methodological Issues in Machine Translation (TMI), pp. 150-160, Chester, England, 1999.
Show abstract

In this paper, we propose a lexical selection method with three steps: sense disambiguation of source words, sense-to-word mapping, and selection of the most appropriate target language lexical item. The knowledge for each step is extracted from a machine readable dictionary and a target language monolingual corpus. By splitting the process of lexical selection into three steps and extracting the essential knowledge for each step from existing resources, our system can select appropriate words for translation with high extensibility and robustness.

Checking Grammatical Mistakes for English-as-a-Second-Language (ESL) Students

Jong C. Park, Martha Palmer, and Gay Washburn
Proceedings of the KSEA-NERC, New Brunswick, New Jersey, USA, April, 1997.

An English Grammar Checker as a Writing Aid for Students of English as a Second Language

Jong C. Park, Martha Palmer, and Gay Washburn
Conference on Applied Natural Language Processing (ANLP), Descriptions of System Demonstrations and Videos, Washington, D.C., USA, March, 1997.
Show abstract

We present a prototype grammar checker for English as a Second Language (ESL) students, utilizing Combinatory Categorial Grammar (CCG) written in SICStus Prolog. Instead of attempting to handle all possible grammatical errors, the grammar checker identifies certain specific types of grammatical mistakes that appear more regularly than others in the present domain of application.

Quantifier Scope and Constituency

Jong C. Park
The 33rd Annual Meeting of the Association for Computational Linguistics (ACL), Cambridge, Massachusetts, USA, June, 1995.
Show abstract

Traditional approaches to quantifier scope typically need stipulation to exclude readings that are unavailable to human understanders. This paper shows that quantifier scope phenomena can be precisely characterized by a semantic representation constrained by surhce constituency, if the distinction between referential and quantificational NPs is properly observed. A CCG implementation is described and compared to other approaches.

Semantic Significance of Quantification in Natural Language Processing

Jong C. Park
Proceedings of the KSEA-NERC, pp. 432-436, New Brunswick, New Jersey, USA, March, 1995.

A Unification-based Semantic Interpretation for Coordinate Constructs

Jong C. Park
The 30th Annual Meeting of the Association for Computational Linguistics (ACL), Delaware, USA, June, 1992.
Show abstract

This paper shows that a first-order unification-based semantic interpretation for various coordinate constructs is possible without an explicit use of lambda expressions if we slightly modify the standard Montagovian semantics of coordination. This modification, along with partial execution, completely eliminates the lambda reduction steps during semantic interpretation.