Publications | KAIST NLP*CL Lab.

Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation

Hoyun Song, Huije Lee, Jisu Shin, Sukmin Cho, Changgeon Ko, and Jong C. Park
Findings of The 63rd Annual Meeting of the Association for Computational Linguistics: ACL 2025, July 27-Aug 1, 2025
(accepted)

EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation

Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, SeungYoon Han, and Jong C. Park
Findings of The 63rd Annual Meeting of the Association for Computational Linguistics: ACL 2025, July 27-Aug 1, 2025
(accepted)

An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs

Eui Jun Hwang, Sukmin Cho, Junmyeong Lee, and Jong C. Park
Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Apr 29-May 4, 2025

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

Sukmin Cho, SJ Choi, Taeho Hwang, Jeongyeon Seo, Soyeong Jeong, Huije Lee, Hoyun Song, Jong C. Park, and YJ Kwon
Findings of the Association for Computational Linguistics: NAACL 2025, Apr 29-May 4, 2025

A Spatio-Temporal Representation Learning as an Alternative to Traditional Glosses in Sign Language Translation and Production

Eui Jun Hwang, Sukmin Cho, Huije Lee, Youngwoo Yoon, and Jong C. Park
2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Feb 28-Mar 4, 2025

The Impact of Retrieved Document Bias on Generated Responses in RAG: An Analysis through Political Bias Experiments

Seungho Cho, Changgeon Ko, Taeho Hwang, Jeong yeon Seo, and Jong C. Park
Korea Software Congress (KSC2024), Dec 18-20, 2024.
(selected as the best paper)

Improving Keypoint-based Korean Sign Language Translation Performance Through Frame-wise Contrastive Learning

Hyeyeon Kim, Junmyeong Lee, Eui Jun Hwang, and Jong C. Park
Korea Software Congress (KSC2024), Dec 18-20, 2024.
(selected as the best paper)

Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach

Changgeon Ko, Jisu Shin, Hoyun Song, Jeong yeon Seo, and Jong C. Park
The workshop on Socially Responsible Language Modelling Research at NeurIPS 2024 (SoLaR@NeurIPS 2024), Dec 14, 2024.

PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling

Junmyeong Lee, Eui Jun Hwang, Sukmin Cho, and Jong C. Park
The workshop on Self-Supervised Learning - Theory and Practice at NeurIPS 2024 (SSL@NeurIPS 2024), Dec 14, 2024.

Typos that Broke the RAG's Back: Genetic Attack on RAG Pipeline by Simulating Documents in the Wild via Low-level Perturbations

Sukmin Cho, Soyeong Jeong, Jeong yeon Seo, Taeho Hwang, and Jong C. Park
Findings of the Association for Computational Linguistics: EMNLP 2024 (Findings of EMNLP), Nov 12-14, 2024.

Towards Effective Counter-Responses: Aligning Human Preferences with Strategies to Combat Online Trolling

Huije Lee, Hoyun Song, Jisu Shin, Sukmin Cho, SeungYoon Han, and Jong C. Park
Findings of the Association for Computational Linguistics: EMNLP 2024 (Findings of EMNLP), Nov 12-14, 2024.

DSLR: Document Refinement with Sentence-Level Re-ranking and Reconstruction to Enhance Retrieval-Augmented Generation

Taeho Hwang, Soyeong Jeong, Sukmin Cho, SeungYoon Han, and Jong C. Park
The Third workshop on knowledge-augmented methods for NLP at ACL 2024 (KnowledgeNLP@ACL 2024), Aug 16, 2024.

Ask LLMs Directly, "What shapes your bias?": Measuring Social Bias in Large Language Models

Jisu Shin, Hoyun Song, Huije Lee, Soyeong Jeong, and Jong C. Park
Findings of the Association for Computational Linguistics: ACL 2024 (Findings of ACL), Aug 11-16, 2024.

Retrieval-Augmented Generation through Zero-shot Sentence-Level Passage Refinement using LLMs

Taeho Hwang, Soyeong Jeong, Sukmin Cho, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2024), June 26-28, 2024.
(selected as the outstanding paper)

Enhancing Sign Language Recognition with Pose-Based Data Augmentation: Focusing on Hand Keypoints

SeungYoon Han, Aujin Kim, KyungGeun Roh, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2024), June 26-28, 2024.

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun 16-21, 2024.

A Gloss-free Sign Language Production with Discrete Representation

Eui Jun Hwang, Huije Lee, and Jong C. Park
The 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2024), May 27-31, 2024.

Preprocessing Mediapipe Keypoints with Keypoint Reconstruction and Anchors for Isolated Sign Language Recognition

KyungGeun Roh, Huije Lee, Eui Jun Hwang, Sukmin Cho, and Jong C. Park
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources (SignLang@LREC-COLING 2024), May 20-25, 2024.

Improving Zero-shot Reader by Reducing Distractions from Irrelevant Documents in Open-Domain Question Answering

Sukmin Cho, Jeong yeon Seo, Soyeong Jeong, and Jong C. Park
Findings of the Association for Computational Linguistics: EMNLP 2023 (Findings of EMNLP), Dec 6-10, 2023.

Test-Time Self-Adaptive Small Language Models for Question Answering

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park
Findings of the Association for Computational Linguistics: EMNLP 2023 (Findings of EMNLP), Dec 6-10, 2023.

Knowledge-Augmented Language Model Verification

Jinheon Baek, Soyeong Jeong, Minki Kang, Jong C. Park, and Sung Ju Hwang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Dec 6-10, 2023.

Generation of Korean Offensive Language by Leveraging Large Language Models via Prompt Design

Jisu Shin, Hoyun Song, Huije Lee, Fitsum Gaim, and Jong C. Park
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL), Nov 1-4, 2023.

Iterative Feedback-based Personality Persona Generation for Diversifying Linguistic Patterns in Large Language Models

Taeho Hwang, Hoyun Song, Jisu Shin, Sukmin Cho, and Jong C. Park
Proceedings of the 35th Annual Conference on Human & Cognitive Language Technology (HCLT), Oct 12-13, 2023.

Political Bias in Large Language Models and Implications on Downstream Tasks

Jeong yeon Seo, Sukmin Cho, and Jong C. Park
Proceedings of the 35th Annual Conference on Human & Cognitive Language Technology (HCLT), Oct 12-13, 2023.
(selected as best paper)

Deep Model Compression Also Helps Models Capture Ambiguity

Hancheol Park and Jong C. Park
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), July 9-14, 2023

Discrete Prompt Optimization via Constrained Generation for Zero-shot Re-ranker

Sukmin Cho, Soyeong Jeong, Jeong yeon Seo, and Jong C. Park
Findings of the Association for Computational Linguistics: ACL 2023 (Findings of ACL), July 9-14, 2023

Phrase Retrieval for Open Domain Conversational Question Answering with Conversational Dependency Modeling via Contrastive Learning

Soyeong Jeong, Jinheon Baek, Sung Ju Hwang, and Jong C. Park
Findings of the Association for Computational Linguistics: ACL 2023 (Findings of ACL), July 9-14, 2023

A Simple and Flexible Modeling for Mental Disorder Detection by Learning from Clinical Questionnaires

Hoyun Song, Jisu Shin, Huije Lee, and Jong C. Park
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), July 9-14, 2023

Question-Answering in a Low-resourced Language: Benchmark dataset and Models for Tigrinya

Fitsum Gaim, Wonsuk Yang, Hancheol Park, and Jong C. Park
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), July 9-14, 2023
(selected as outstanding paper)

Korean Sign Language Recognition on Keypoints with a Transformer Model

KyungGeun Roh, Eui Jun Hwang, Huije Lee, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2023), June 17-20, 2023.

Controllable prompt tuning with relation dependent tokens

Jinseok Kim, Sukmin Cho, Soyeong Jeong, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2023), June 17-20, 2023.

Sign Language Gloss Translation using Copy Mechanism

Jaewoo Kim, Huije Lee, Sukmin Cho, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2023), June 17-20, 2023.

Leveraging Large Language Models with Vocabulary Sharing for Sign Language Translation

Huije Lee, Jung-Ho Kim, Eui Jun Hwang, Jaewoo Kim, and Jong C. Park
SLTAT 2023 Workshop at ICASSP 2023, June 4-10, 2023

Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement

Soyeong Jeong, Jinheon Baek, Sung Ju Hwang, and Jong C. Park
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), May 2-4, 2023.

Assessing automatic summarization model as a reading assistant

Aujin Kim, Jisu Shin, Soyeong Jeong, Sukmin Cho, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2022), June 29-July 1, 2022

Constructing Korean Abusive Language Dataset using Machine Translation

Jisu Shin, Hoyun Song, Huije Lee, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2022), June 29-July 1, 2022

Stopwords Mask Pooling for Dense Retrieval in Medical Domain

Dongho Choi, Hoyun Song, Soyeong Jeong, Sukmin Cho, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2022), June 29-July 1, 2022

Sign Language Production With Avatar Layering: A Critical Use Case over Rare Words

Jung-Ho Kim, Eui Jun Hwang, Sukmin Cho, Du Hui Lee, and Jong C. Park
Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), June 21-23, 2022

GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

Fitsum Gaim, Wonsuk Yang, and Jong C. Park
Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), June 21-23, 2022
(selected as best paper)

ELF22: A Context-based Counter-Trolling Dataset to Combat Internet Trolls

Huije Lee, Young Ju NA, Hoyun Song, Jisu Shin, and Jong C. Park
Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), June 21-23, 2022
Show abstract

Online trolls increase social costs and cause psychological damage to individuals. With the proliferation of automated accounts making use of bots for trolling, it is difficult for targeted individual users to handle the situation both quantitatively and qualitatively. To address this issue, we focus on automating the method to counter trolls, as counter responses to combat trolls encourage community users to maintain ongoing discussion without compromising freedom of expression. For this purpose, we propose a novel dataset for automatic counter response generation. In particular, we constructed a pair-wise dataset that includes troll comments and counter responses with labeled response strategies, which enables models fine-tuned on our dataset to generate responses by varying counter responses according to the specified strategy. We conducted three tasks to assess the effectiveness of our dataset and evaluated the results through both automatic and human evaluation. In human evaluation, we demonstrate that the model fine-tuned on our dataset shows a significantly improved performance in strategy-controlled sentence generation.

Query Generation with External Knowledge for Dense Retrieval

Sukmin Cho, Soyeong Jeong, Wonsuk Yang, and Jong C. Park
Proceedings of Deep Learning Inside Out (DeeLIO): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation

Soyeong Jeong, JinHeon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park
60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)

Detecting Implicitly Abusive Language by Applying Out-of-Distribution Problem

Jisu Shin, Hoyun Song, and Jong C. Park
Proceedings of the Korea Software Congress (KSC 2021), December 20-22, 2021
(selected as best paper)

Optimizing Domain Specificity of Transformer-based Language Models for Extractive Summarization of Financial News Articles in Korean

Huije Lee, Wonsuk Yang, ChaeHun Park, Hoyun Song, Eugene Jang, and Jong C. Park
35th Pacific Asia Conference on Language on Language, Information and Computation (PACLIC 35), November 5-7, 2021
Show abstract

Frequent usage of complex expressions with numbers and of the terms that require domain knowledge makes it more difficult to comprehend and summarize financial news articles than that of other daily news articles. We present a transformer-based model for the automatic summarization of the financial news articles in Korean and address related issues, and in particular analyze the interplay between the domain of the dataset used for pre-training and that for fine-tuning. We find that the summarization model performs much better when the two coincide, even when they are different from that of the target task, which is the financial domain in our work.

Non-Autoregressive Sign Language Production with Gaussian Space

Eui Jun Hwang, Jung-Ho Kim, and Jong C. Park
The 32nd British Machine Vision Conference (BMVC 2021), November 22-25, 2021

A Large-scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit

Hoyun Song, Soo Hyun Ryu, Huije Lee, and Jong C. Park
Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL 2021), November 10-11, 2021
Show abstract

As users in online communities suffer from severe side effects of abusive language, many researchers attempted to detect abusive texts from social media, presenting several datasets for such detection. However, none of them contain both comprehensive labels and contextual information, which are essential for thoroughly detecting all kinds of abusiveness from texts, since datasets with such fine-grained features demand a significant amount of annotations, leading to much increased complexity. In this paper, we propose a Comprehensive Abusiveness Detection Dataset (CADD), collected from the English Reddit posts, with multifaceted labels and contexts. Our dataset is annotated hierarchically for an efficient annotation through crowdsourcing on a large-scale. We also empirically explore the characteristics of our dataset and provide a detailed analysis for novel insights. The results of our experiments with strong pre-trained natural language understanding models on our dataset show that our dataset gives rise to meaningful performance, assuring its practicality for abusive language detection.

Monolingual Pre-Trained Language Models for Tigrinya

Fitsum Gaim, Wonsuk Yang, and Jong C. Park
WiNLP 2021 Workshop at EMNLP 2021, November 7-11, 2021
Show abstract

Pre-trained language models (PLMs) are driving much of the recent progress in natural language processing. However, due to the resource-intensive nature of the models, underrepresented languages without sizable curated data have not seen significant progress. Multilingual PLMs have been introduced with the potential to generalize across many languages, but their performance trails compared to their monolingual counterparts and depends on the characteristics of the target language. In the case of the Tigrinya language, recent studies report a sub-optimal performance when applying the current multilingual models. This may be due to its orthography and unique linguistic characteristics, especially when compared to the Indo-European and other typologically distant languages that were used to train the models. In this work, we pre-train three monolingual PLMs for Tigrinya on a newly compiled corpus, and we compare the models with their multilingual counterparts on two downstream tasks, part-of-speech tagging and sentiment analysis, achieving significantly better results and establishing the state-of-the-art. We make the data and trained models publicly available.

BERT-based Personality Disorder Detection Model with Abusive Language Marking from Social Media

Jisu Shin, Hoyun Song, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2021), June 22-25, 2021

The Relationship between the Quality of Automatically Generated Questions and the Quantity of the Context given for the Generation

Sukmin Cho, Wonsuk Yang, ChaeHun Park, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2021), June 22-25, 2021

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Soyeong Jeong, Jinheon Baek, ChaeHun Park, and Jong C. Park
Second Workshop on Scholarly Document Processing (SDP 2021), June 6-11, 2021
Show abstract

One of the challenges in information retrieval (IR) is the vocabulary mismatch problem, which happens when the terms between queries and documents are lexically different but semantically similar. While recent work has proposed to expand the queries or documents by enriching their representations with additional relevant terms to address this challenge, they usually require a large volume of query-document pairs to train an expansion model. In this paper, we propose an Unsupervised Document Expansion with Generation (UDEG) framework with a pretrained language model, which generates diverse supplementary sentences for the original document without using labels on querydocument pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our framework on two standard IR benchmark datasets. The results show that our framework significantly outperforms relevant expansion baselines for IR.

Generating Negative Samples by Manipulating Golden Responses for Unsupervised Learning of a Response Evaluation Model

ChaeHun Park, Eugene Jang, Wonsuk Yang, and Jong C. Park
2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2021), June 6-11, 2021
Show abstract

Evaluating the quality of responses generated by open-domain conversation systems is a challenging task. This is partly because there can be multiple appropriate responses to a given dialogue history. Reference-based metrics that rely on comparisons to a set of known correct responses often fail to account for this variety, and consequently correlate poorly with human judgment. To address this problem, researchers have investigated the possibility of assessing response quality without using a set of known correct responses. Tao et al. (2018) demonstrated that an automatic response evaluation model could be made using unsupervised learning for the next-utterance prediction (NUP) task. For unsupervised learning of such a model, we propose a method of manipulating a golden response to create a new negative response that is designed to be inappropriate within the context while maintaining high similarity with the original golden response. We find, from our experiments on English datasets, that using the negative samples generated by our method alongside random negative samples can increase the model's correlation with human evaluations. The process of generating such negative samples is automated and does not rely on human annotation

Target-Agnostic Detection of Stances Toward Entities in News Articles

Eugene Jang, Wonsuk Yang, and Jong C. Park
Human-Computer Interaction Korea (HCI), Korea, January 27-29, 2021.

Automatic Facial Expression Generation for Sign Language with Neural Machine Translation

Eui Jun Hwang, Jung-Ho Kim, and Jong C. Park
Korea Software Congress (KSC), Korea, December 21-23, 2020.
Show abstract

In a sign language, facial expressions play an important role for effective communication. In particular, they are well known for conveying emotional and grammatical information that affects the meaning of a sign. In this paper, we only consider the grammatical functions of the facial expressions. We propose a transformer-based facial expression generation model that translates an expression in spoken language into continuous facial landmark sequences for sign language. In order to train the model efficiently, we employ Principal Component Analysis embedding and Custom Mean Squared Error loss. We report the quantitative and qualitative results of the generated facial landmarks.

Calibration of Pre-trained Language Model for Korean

Soyeong Jeong, Wonsuk Yang, ChaeHun Park, and Jong C. Park
32th Annual Conference on Human & Cognitive Language Technology, October 15-16, 2020.
(selected as best paper)

TEA: The Effect of the Textual Entailment on the Acceptability Changes

Junseop Ji, Wonsuk Yang, and Jong C. Park
Human-Computer Interaction Korea (HCI), August 19-21, 2020.

Computer Assisted Annotation of Tension Development in TED Talks through Crowdsourcing

Seungwon Yoon, Wonsuk Yang, and Jong C. Park
1st Workshop on Aggregating and analysing crowdsourced annotations for NLP (AnnoNLP), Hong Kong SAR, November 2, 2019.

Generating Sentential Arguments from Diverse Perspectives on Controversial Topic

ChaeHun Park, Wonsuk Yang, and Jong C. Park
2nd Workshop on NLP for Internet Freedom (NLP4IF): Censorship, Disinformation, and Propaganda, Hong Kong SAR, November 3, 2019.

Nonsense!: Quality Control via Two-Step Reason Selection for Annotating Local Acceptability and Related Attributes in News Editorials

Wonsuk Yang, Seungwon Yoon, Ada Carpenter, and Jong C. Park
2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong SAR, November 3-7, 2019.

A Corpus of Sentence-level Annotations of Local Acceptability with Reasons

Wonsuk Yang, Jung-Ho Kim, Seungwon Yoon, ChaeHun Park, and Jong C. Park
33rd Pacific Asia Conference on Language, Information and Computation (PACLIC 33), Hakodate, Japan, September 13-15, 2019.

A Corpus of Sentential Annotations on News Editorials with Multi-dimensional Credibility Metrics

Wonsuk Yang, Jung-Ho Kim, Jin-Woo Chung, and Jong C. Park
Human-Computer Interaction Korea (HCI), Jeju ICC, Korea, February 13-15, 2019.

Mitigating Stereotypes in Word Embedding through Sentiment Modulation

Huije Lee, Jin-Woo Chung, and Jong C. Park
Korea Software Congress (KSC), Pyeongchang, Korea, December 19-21, 2018.
Show abstract

단어 임베딩은 저차원 벡터 내에 단어의 의미적 정보를 효과적으로 담는 모델로, 단어의 의미적 정보를 사용하는 여러 자연언어처리 분야에서 미리 학습된 word2vec이 사용되고 있다. 그러나 대량의 텍스트로 학 습된 단어 임베딩 모델은 사람이 가질 수 있는 성, 인종 등에 대한 고정관념 또한 의미 정보로 학습한다는 문제점이 있다. 본 논문에서는 인물 혹은 단체를 지칭하는 단어에 대한 암시적인 감성이 모델을 편향시킬 수 있다는 점에 주목하여, 임베딩 모델 내에서 정서적 고정관념을 드러내는 단어를 탐지하는 방법을 제시하고 고정관념 완화를 위해 인물 개체에 대한 감성 차원이 조정된 임베딩 모델을 제안한다. 실험 결과, 인물 개체에 대한 감성 차원의 임베딩이 증강될수록 모델의 편향성이 심화되었으며, 제안하는 모델은 기존 모델 에 비해 16%의 편향성이 감소되었지만 성능 변화 폭은 1% 이내로 유지되는 것을 확인하였다.

Neural Grammatical Error Correction by Simulating the Human Learner and the Human Proofreader

Fitsum Gaim, Jin-Woo Chung, and Jong C. Park
Korea Software Congress (KSC), Pyeongchang, Korea, December 19-21, 2018.
Show abstract

We present a learning framework for grammatical error correction (GEC) that leverages the duality of translation to effectively synthesize training signals from a monolingual corpus through a game of two contrasting agents that are initially trained with a small amount of parallel data. The first agent learns to produce ostensibly natural errors, whereas the second learns to proofread the erroneous output into grammatically correct text. This approach not only alleviates the need for large parallel corpora but also exposes the GEC model to a wider range of error types. Our final model is competitive against the best systems, outperforming some of the strongest models on standard benchmarks.

Feature Attention Network: Interpretable Depression Detection from Social Media

Hoyun Song, Jinseon You, Jin-Woo Chung, and Jong C. Park
32nd Pacific Asia Conference on Language, Information and Computation (PACLIC 32), The Hong Kong Polytechnic University, Hong Kong SAR, December 1-3, 2018.
Show abstract

Although depression is one of the most common mental disorders, the depressed individuals may not be aware of their symptoms at all so that they sometimes miss the appropriate time for treatment. In order to prevent this problem, many researchers looked into social media to figure out depressed individuals by analyzing the differences in language use. While they have recently achieved reasonable performance in detecting depression, especially using deep learning methods, such methods still do not provide a clear way to explain why certain individuals have been detected as depressed. To address this issue, we propose Feature Attention Network (FAN), inspired by the process of diagnosing depression by an expert who has background knowledge about depression. We evaluate the performance of our model on a large scale general forum (Reddit Self-reported Depression Diagnosis) dataset. Experimental results demonstrate that FAN shows good performance with high interpretability despite a smaller number of posts in training data. We investigate different aspects of posts by depressed users through four feature networks built upon psychological studies, which will help researchers to investigate social media posts to find useful evidence for depressive symptoms.

Extracting Supporting Evidence with High Precision via Bi-LSTM Network

ChaeHun Park, Wonsuk Yang, and Jong C. Park
30th Annual Conference on Human & Cognitive Language Technology, Korea University, Seoul, Korea, October 12-13, 2018.
Show abstract

논지가 높은 설득력을 갖기 위해서는 충분한 지지 근거가 필요하다. 논지 내의 주장을 논리적으로 지지할 수 있는 근거 자료 추출의 자동화는 자동 토론 시스템, 정책 투표에 대한 의사 결정 보조 등 여러 어플리케이션의 개발 및 상용화를 위해 필수적으로 해결되어야 한다. 하지만 웹문서로부터 지지 근거를 추출하는 시스템을 위해서는 다음과 같은 두 가지 연구가 선행되어야 하고, 이는 높은 성능의 시스템 구현을 어렵게 한다: 1) 논지의 주제와 직접적인 관련성은 낮지만 지지 근거로 사용될 수 있는 정보를 확보하기 위한 넓은 검색 범위, 2) 수집한 정보 내에서 논지의 주장을 명확하게 지지할 수 있는 근거를 식별할 수 있는 인지 능력. 본 연구는 높은 정밀도와 확장 가능성을 가진 지지 근거 추출을 위해 다음과 같은 단계적 지지 근거 추출 시스템을 제안한다: 1) TF-IDF 유사도 기반 관련 문서 선별, 2) 의미적 유사도를 통한 지지 근거 1차 추출, 3) 신경망 분류기를 통한 지지 근거 2차 추출. 제안하는 시스템의 유효성을 검증하기 위해 사설 4008개 내의 주장에 대해 웹 상에 있는 845675개의 뉴스에서 지지 근거를 추출하는 실험을 수행하였다. 주장과 지지 근거를 주석한 정보에 대하여 성능 평가를 진행한 결과 본 연구에서 제안한 단계적 시스템은 1,2차 추출 과정에서 각각 0.41, 0.70의 정밀도를 보였다. 이후 시스템이 추출한 지지 근거를 분석하여, 논지에 대한 적절한 이해를 바탕으로 한 지지 근거 추출이 가능하다는 것을 확인하였다.

Automatic Tension Recognition from Lecture Show Transcripts

Seungwon Yoon, Wonsuk Yang, and Jong C. Park
30th Annual Conference on Human & Cognitive Language Technology, Korea University, Seoul, Korea, October 12-13, 2018.
Show abstract

긴장이라는 측면은 의사소통을 하거나 글을 읽을 때 사람에게 항상 영향을 주고 있다. 긴장의 개념은 자연언어처리 분야에서 광범위한 의미로 사용되어 왔는데, 본 논문은 이런 개념 중 강연과 같은 한 방향 대화에서 화자의 말에 대하여 청중이 가지는 긴장도에 집중하여 이를 정량화하는 방법을 제안한다. 한 명의 저자에 의해 서술된 문서에 긴장도 개념을 적용함에 있어, 한 방향 대화에서의 긴장도를 정량화하는 본 연구는 긴장도 개념을 일반 문서에 적용할 때에 보다 용이하게 활용될 것으로 예상한다. 본 연구에서는 먼저 화자의 말에 대한 청중의 긴장도가 주석되어 있는 새로운 말뭉치를 구축하였다. 또한 문맥을 고려하여 긴장도를 예측할 수 있는 모델과 이에 따른 긴장도 분류 성능에 대한 실험 결과를 통하여 자동 긴장도 분류가 계산적으로 가능하다는 것을 보인다.

Detection of Non-Standard Meaning Usage with Word Embedding

Huije Lee, Hancheol Park, Wonsuk Yang, and Jong C. Park
Human-Computer Interaction Korea (HCI), Jeongseon, Korea, January 31-February 2, 2018.
Show abstract

본 연구에서는 분산 표상 기법으로 텍스트에서 사전상의 의미로 사용되지 않은 어휘(이하, 비표준 의미 어휘)를 탐지하는 모델을 제안한다. 어휘의 어형은 동일하나 비표준 의미로 사용되는 경우를 판단하는 것은 자동화된 텍스트 분석 및 오역의 문제를 해결하는 데 중요한 요소이다. 본 연구에서는 분산 표상 기법으로 생성된 문맥 및 대상 단어 벡터를 이용하여, 대상 단어가 주어진 문맥 내에서 적합한지를 검증하고 대상 단어가 비표준 의미로 사용되었는지 여부를 판단한다. 본 연구에서는 기존 연구에서의 문맥 벡터 생성 방식이 지니는 문제점을 해결하기 위해, 통합적인 문맥 정보를 표상하는 방법과 문맥 내 단어들의 가중치를 주는 방법을 제안한다. 제안하는 방법은 트위터 데이터를 이용한 실험에서 기존에 제안된 모델보다 더 높은 성능을 보였다.

Predicting Symptoms of Depression for Social Media Users via Linguistic Patterns

Hoyun Song, Hancheol Park, Wonsuk Yang, and Jong C. Park
Korea Software Congress (KSC), Busan, Korea, December 20-22, 2017.
Show abstract

우울증은 개인의 일상 기능 저하 및 다양한 사회적 문제를 야기할 수 있기 때문에 조기 진단이 중요하다. 이러한 조기 진단의 시도로서, 본 연구는 소셜 미디어 텍스트를 이용하여 사용자들의 우울증 여부를 예측하는 모델을 제안한다. 본 연구에서는 비정형 텍스트인 소셜 미디어 텍스트 상에서 기존의 어휘 기반 모델이 지닌 한계점인 어휘 매칭 문제 및 우울증을 겪고 있지 않은 사용자들의 우울증 관련 어휘 사용과 관련한 문제점을 해결하기 위해, 보다 심층적인 언어학적 패턴을 이용한 모델을 제시한다. 본 연구의 실험을 통해 사용자의 우울증 여부를 예측함에 있어 언어학적 패턴을 함께 적용할 경우 단순한 어휘 기반 모델에 비해 더욱 효과적임을 확인할 수 있었다.

Extraction of Gene-Environment Interaction from the Biomedical Literature

Jinseon You, Jin-Woo Chung, Wonsuk Yang, and Jong C. Park
Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017), pp. 865–874, Taipei, Taiwan, November 27–December 1, 2017.
Show abstract

Genetic information in the literature has been extensively looked into for the purpose of discovering the etiology of a disease. As the gene-disease relation is sensitive to external factors, their identification is important to study a disease. Environmental influences, which are usually called Gene-Environment interaction (GxE), have been considered as important factors and have extensively been researched in biology. Nevertheless, there is still a lack of systems for automatic GxE extraction from the biomedical literature due to new challenges: (1) there are no preprocessing tools and corpora for GxE, (2) expressions of GxE are often quite implicit, and (3) document-level comprehension is usually required. We propose to overcome these challenges with neural network models and show that a modified sequence-to-sequence model with a static RNN decoder produces a good performance in GxE recognition.

Inferring Implicit Event Locations from Context with Distributional Similarities

Jin-Woo Chung, Wonsuk Yang, Jinseon You, and Jong C. Park
Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI-17), pp. 979-985, Melbourne, Australia, August 19-25, 2017.
Show abstract

Automatic event location extraction from text plays a crucial role in many applications such as infectious disease surveillance and natural disaster monitoring. The fundamental limitation of previous work such as SpaceEval is the limited scope of extraction, targeting only at locations that are explicitly stated in a syntactic structure. This leads to missing a lot of implicit information inferable from context in a document, which amounts to nearly 40% of the entire location information. To overcome this limitation for the first time, we present a system that infers the implicit event locations from a given document. Our system exploits distributional semantics, based on the hypothesis that if two events are described by similar expressions, it is likely that they occur in the same location. For example, if “A bomb exploded causing 30 victims” and “many people died from terrorist attack in Boston” are reported in the same document, it is highly likely that the bomb exploded in Boston. Our system shows good performance of a 0.58 F1-score, where state-of-the-art classifiers for intra-sentential spatiotemporal relations achieve around 0.60 F1-scores.

Neural Theorem Prover with Word Embedding for Efficient Automatic Annotation

Wonsuk Yang, Hancheol Park, and Jong C. Park
Proceedings of the 28th Annual Conference on Human and Cognitive Language Technology (HCLT) pp. 79-84, Busan, Korea, October 07-08, 2016.
(selected as best paper)
Show abstract

본 연구는 전문기관에서 생산되는 검증된 문서를 웹상의 수많은 검증되지 않은 문서에 자동 주석하여 신 뢰도 향상 및 심화 정보를 자동으로 추가하는 시스템을 설계하는 것을 목표로 한다. 이를 위해 활용 가능 한 시스템인 인공 신경 정리 증명계(neural theorem prover)가 대규모 말뭉치에 적용되지 않는다는 근본 적인 문제를 해결하기 위해 내부 순환 모듈을 단어 임베딩 모듈로 교체하여 재구축 하였다. 학습 시간의 획기적인 감소를 입증하기 위해 국가암정보센터의 암 예방 및 실천에 대한 검증된 문서들에서 추출한 28,844개 명제를 위키피디아 암 관련 문서에서 추출한 7,844개 명제에 주석하는 사례를 통하여 기존의 시스템과 재구축한 시스템을 병렬 비교하였다. 동일한 환경에서 기존 시스템의 학습 시간이 553.8일로 추 정된 것에 비해 재구축한 시스템은 93.1분 내로 학습이 완료되었다. 본 연구의 장점은 인공 신경 정리 증 명계가 모듈화 가능한 비선형 시스템이기에 다른 선형 논리 및 자연언어 처리 모듈들과 병렬적으로 결합 될 수 있음에도 현실 사례에 이를 적용 불가능하게 했던 학습 시간에 대한 문제를 해소했다는 점이다.

Prosodic and Linguistic Analysis of Semantic Fluency Data: A Window into Speech Production and Cognition

Maria Wolters, Najoung Kim, Jung-Ho Kim, Sarah E. MacPherson, and Jong C. Park
Interspeech 2016, pp. 2085-2089, San Francisco, California, September 8-12, 2016.
Show abstract

Semantic fluency is a commonly used task in psychology that provides data about executive function and semantic memory. Performance on the task is affected by conditions ranging from depression to dementia. The task involves participants naming as many members of a given category (e.g. animals) as possible in sixty seconds. Most of the analyses reported in the literature only rely on word counts and transcribed data, and do not take into account the evidence of utterance planning present in the speech signal. Using data from Korean, we show how prosodic analyses can be combined with computational linguistic analyses of the words produced to provide further insights into the processes involved in producing fluency data. We compare our analyses to an established analysis method for semantic fluency data, manual determination of lexically coherent clusters of words.

Classification of Relations between Biological Entities using Word Vectors

Jimin Park, Jin-Woo Chung, and Jong C. Park
Proceedings of Korea Computer Congress (KCC), pp. 771-773, Jeju, Korea, June 29 - July 1, 2016. (poster presentation)
Show abstract

생물학적 체계 안에서 구성 요소 간의 관계를 논문 텍스트를 통해 식별하는 방법과, 일반적인 단어 사이의 관계를 분포 의미 모델을 이용하여 분류하는 방법에 대해서는 많은 연구가 각각 있었으나, 두 방법을 결합한 시도는 거의 보고되지 않았다. 본 연구에서는 분포 모델이 생물학적인 체계 안에서 두 구성요소가 맺고 있는 관계를 예측하는 데 어떤 기여를 하는지 알아보았다. 실험 결과, 분포 모델이 생물학적 구성 요소 간의 관계 식별에 유용한 자질로 활용될 수 있을 확인하였다.

Addressing Low-Resource Problems in Statistical Machine Translation of Sign Language

Hancheol Park, Jung-Ho Kim, and Jong C. Park
Proceedings of Korea Computer Congress (KCC), pp. 714-716, Jeju, Korea, June 29 - July 1, 2016.
(selected as best paper)
Show abstract

최근 통계적 기계 번역 기법을 이용한 수화 번역 연구가 활발해짐에도 불구하고, 병렬 말뭉치 자원의 희소성 문제는 아직 해결되지 못하고 있다. 본 연구는 통계적 기계 번역 방법을 이용하여 구어로 표현 될 수 있는 언어를 수지 표현으로 이루어진 수화로 번역 할 때, 자원 희소성에 기인하는 문제점들을 해결할 수 있는 세 가지 전처리 방법을 제시한다. 결과적으로 자원 희소성 문제를 안고 있는 수화 번역에서 실제로 번역 성능을 향상시킬 수 있는 방법들이 무엇인지를 실험을 통해 확인한다. 본 연구에서 제안하는 전처리 방법은 구어 문장의 패러프레이징을 통한 말뭉치 확장 방법, 구어 단어의 표제어화를 통한 개별 어휘 빈도를 높이는 방법, 그리고 수지 정보로 표현되지 않는 구어 품사에 해당하는 단어를 제거함으로써 구어와 수화 간 문장 성분을 일치시키는 방법이다. 영어와 미국 수화 병렬 말뭉치를 이용한 실험을 통하여 세 가지 전처리 방법 중 패러프레이즈 생성 및 표제어화의 적용 시에만 번역 품질이 향상된다는 사실을 알 수 있었다. 특히, 두 방법이 같이 적용될 때 가장 높은 성능을 보였다.

A Morphological Approach to the Longitudinal Detection of Dementia

Najoung Kim and Jong C. Park
HCI Conference Korea, High1 Resort, Gangwon, January 27-29, 2016.
Show abstract

The impact of cognitive impairment on linguistic abilities has been a topic of continuous interest in dementia studies. However, there is a lack of systematic agreement on the longitudinal association between dementia progression and the patients' morphological capacity, and the role of morphological phenomena other than inflection has been relatively underreported. We present a longitudinal study of writings by Iris Murdoch (diagnosed of Alzheimer's Disease after her death) and Arthur Conan Doyle (no known record of dementia diagnosis), using two novel measures to account for the usage of complex morphology and lexical innovation. The results imply an association between lexical innovation and cognitive decline caused by dementia, as observed in Murdoch's works beginning from her mid-fifties, in contrast to a milder tendency in Doyle's works. Our findings contribute to a potential for facilitating early diagnosis of dementia through automated language processing approaches.

Biomedical Event Extraction and Management in Big-scale Biomedical Literature

Rize Jin, Jinseon You, and Jong C. Park
42nd KIISE Winter Conference, Phoenix Park, December 17-19, 2015. (poster presentation)
Show abstract

대용량 생물학 문헌 정보가 축적됨에 따라 생물학 연구자들의 연구를 효과적으로 돕기 위한 문헌 정보 관리 시스템이나 검색 엔진과 같은 도구들이 등장하였다. 이러한 도구들은 생물학 연구에 많은 도움을 주고 있으나 복잡한 연산 처리에 있어서는 아직 부족한 점이 많은 실정이다. 특히 검색엔진의 경우 단어 수준의 질의어는 쉽게 처리할 수 있으나 단어 사이의 관계를 나타내는 복잡한 질의어에 대해서는 아직 처리 수준이 미흡하다. 이에 생물학 언어 처리 분야에서는 복잡한 질의어를 처리하기 위해 유전자 식별, 생물학 이벤트 식별과 같은 텍스트 마이닝 연구가 활발히 진행되었으며 상당한 수준의 정확도를 보였다. 그러나 이러한 텍스트 마이닝 시스템들은 전과는 달리 복잡한 연산을 수행함에 따라 대용량 처리에는 적합하지 않게 설계되었고 이는 생물학 언어 처리 분야에 대용량 처리가 점점 더 필요해지면서 심각한 문제로 대두 되었다. 본 연구에서는 분산 시스템인 하둡을 이용해 텍스트 마이닝 시스템 중 하나인 이벤트 식별 시스템이 대용량 데이터를 효과적으로 처리할 수 있도록 시스템을 고도화 하는 방안을 제시한다.

A New Measure of Clustering and Switching Based on Bigrams

Maria Wolters, Sarah MacPherson, Jinseon You, Rize Jin, Seung-Cheol Baek, and Jong C. Park
Psychonomic Society's 56th Annual Meeting, Chicago, USA, November 19-22, 2015. (poster presentation)
Show abstract

The category fluency task (CFT) provides important information about executive abilities such as initiation set-shifting and inhibition. CFT sequences are generated by retrieving groups of related words (“clusters“) from semantic memory. Manual annotation schemes have been developed for inferring these clusters from transcribed CFT sequences (Troyer 2008), but these are time-consuming and require training. We propose an automatic analysis technique that is based on a simple statistical model of CFT sequences. This model can be easily adapted to different languages and domains, given sufficient training data. CFT sequences (domain “animals“) were generated by 104 younger adults aged 18-34 years and 100 older adults aged 50-84 years who were native speakers of UK English. The sequences were categorised both manually and using our automated method with key measures such as the number of switches significantly correlating (rho=0.4, 95% CI [0.28-0.51]). Both methods also resulted in the significant age differences that are consistently reported in the cognitive aging literature.

Corpus Annotation with a Linguistic Analysis of the Associations between Event Mentions and Spatial Expressions

Jin-Woo Chung, Jinseon You, and Jong C. Park
Proceedings of the 29th Pacific Asia Conference on Language, Information, and Computation (PACLIC 29), pp. 539-547, Shanghai, China, October 30-November 1, 2015.
Show abstract

Recognizing spatial information associated with events expressed in natural language text is essential for the proper interpretation of such events. However, the associations between events and spatial information found throughout the text have been much less studied than other types of spatial association as looked into in SpatialML and ISO-Space. In this paper, we present an annotation framework for the linguistic analysis of the associations between event mentions and spatial expressions in broadcast news articles. Based on the corpus annotation and analysis, we discuss which information should be included in the guidelines and what makes it difficult to achieve a high inter-annotator agreement. We also discuss possible improvements on the current corpus and annotation framework for insights into developing an automated system.

A System for Constructing a Korean-to-KSL Parallel Corpus

Jung-Ho Kim, Umang Sehgal, and Jong C. Park
17th Annual Conference on Korean Sign Language, Kongju University, Gongju, Korea, August 15, 2015. (poster presentation)
Show abstract

한국어-한국수어 병렬 말뭉치는 관련 사전이나 자동 번역 시스템에 활용될 수 있어 긴요하다. 그러나 일반 병렬 말뭉치 구축과는 달리, 수어의 공간 언어적인 특성 때문에 구축이 용이하지 않다. 본 연구에서는 효율적으로 한국어-한국수어 병렬 말뭉치를 구축할 수 있는 시스템을 제안한다.

CoMAGD: Annotation of Gene-Depression Relations

Rize Jin, Jinseon You, Jin-Woo Chung, Hee-Jin Lee, Maria Wolters, and Jong C. Park
Proceedings of the 2015 ACL Workshop on Biomedical Natural Language Processing (BioNLP 2015), pp. 104-113, Beijing, China, July 30, 2015.
Show abstract

Clinical depression is a mental disorder involving genetics and environmental factors. Although much work studied its genetic causes and numerous candidate genes have consequently been looked into and reported in the biomedical literature, no gene expression changes or mutations regarding depression have yet been adequately collected and analyzed for its full pathophysiology. In this paper, we present a depression-specific annotated corpus for text mining systems that target at providing a concise review of depression-gene relations, as well as capturing complex biological events such as gene expression changes. We describe the annotation scheme and the conducted annotation procedure in detail. We discuss issues regarding proper recognition of depression terms and entity interactions for future approaches to the task. The corpus is available at http://www.biopathway.org/CoMAGD.

Identification of Depression-Gene Associations from Biomedical Literature

Jinseon You, Rize Jin, Hee-Jin Lee, and Jong C. Park
Korea Computer Congress (KCC), Jeju, Korea, June 24-26, 2015.
Show abstract

우울증은 현대인들이 겪는 대표적인 정신 질환으로 관련 호르몬 분비량에 따라 증세가 달라지고 이는 또한 관련 유전자 표현 변화에 따라 달라진다. 우울증 관련 유전자를 파악하고 이들간의 관계를 밝혀낸다면 항우울제 개발에 많은 도움이 될 것이다. 현재 이에 대한 연구는 활발히 진행 중에 있으나 관련된 모든 유전자를 한 번에 파악하기는 어렵다. 본 논문에서는 암과 유전자간의 관계를 찾는 방법론을 도입하여 우울증과 유전자간 관계를 자동으로 파악하는 시스템을 구축한다. 이는 향후 우울증과 유전자 간의 심화된 관계를 밝히는데 필요한 코퍼스 제작에 큰 도움이 될 것으로 기대된다.

Construction of a Korean-to-KSL Parallel Corpus by Effective Motion Capture of Hand Shapes

Jung-Ho Kim and Jong C. Park
41st KIISE Winter Conference, Phoenix Park, December 18-20, 2014. (poster presentation)
Show abstract

본 연구에서는 한국어와 한국수어 간의 병렬 코퍼스를 제작하기 위하여 수형(Hand Shape)의 효율적 수집 방안을 제시하며, 손동작 범위에 한하여 수어 동작을 인식 및 수집하기 위해 립모션(Leap Motion)을 이용한다. 제시한 방법으로 제작된 병렬 코퍼스의 성능을 검증하기 위해 46개의 수어 동작을 수집하였고, 미리 수집되지 않은 54개의 수어 동작을 추가 선별하여 총 100개의 수어에 대해 평균 42.15%의 정확도와 58.72%의 재현율을 가지는 인식 수준을 확인하였다. 본 연구에서 제안하는 방안은 매우 보편적이어서 대규모 및 동시적으로 자료를 수집할 수 있는 가능성을 보인다.

An Effective Construction of a Korean-to-KSL Parallel Corpus

Jung-Ho Kim and Jong C. Park
Proceedings of the 26th Annual Conference on Human and Cognitive Language Technology (HCLT), pp. 13-17, ChunCheon, Korea, October 10-11, 2014.
(selected as best paper)
Show abstract

본 연구에서는 한국어와 한국수화 간의 병렬 코퍼스 제작과 함께 이에 따른 문제를 다룬다. 본 연구에서는 병렬 코퍼스를 효율적으로 제작하기 위해 키넥트와 립모션을 이용하였고, 이의 성능을 검증하기 위해 기존 연구에서 제시하고 있는 장갑을 통한 동작 인식 및 수집 방법과 본 연구에서 제시하고 있는 수집 방법을 비교하였으며, 비교 결과 장갑을 통해 수집한 결과와 유의미하게 차이가 나지 않음을 확인하였다. 이는 본 연구의 동작 수집 방식이 상대적으로 고비용인 장갑 수집 방식과 비교하여 경쟁력이 있음을 시사하고 있으며, 특히 보편적인 자료 수집 방식을 사용하는 특징까지 가지고 있어서 동시적으로 자료를 수집할 수 있어 규모가 있는 병렬 코퍼스 구축을 더욱 효율적으로 진행할 수 있을 것으로 기대된다.

On Mention-Level Gene Normalization

Joon-Yeob Kim, Seung-Cheol Baek, Hee-Jin Lee, and Jong C. Park
5th International Symposium on Languages in Biology and Medicine (LBM 2013), Tokyo, Japan, 12th and 13th December, 2013.
Show abstract

Document-level gene normalization (DGN), which produces a list of gene identifiers relevant to an input document, helps database curators to search for articles of interest by indexing articles with gene identifiers. Recent advances in automatic extraction of information from the biology literature call for mention-level gene normalization (MGN) systems. However, there have been no annotated corpora for MGN, probably because of a somewhat unfounded assumption (convertibility assumption) that it might be straightforward to map gene mentions into gene identifiers given a list of gene identifiers for the document. In the present work, we constructed gold standard annotations for the MGN task and assessed the validity of the convertibility assumption with GeneTUKit (Huang et al., 2011), a state-of-the-art DGN system.

Parsing Dependency Paths to Identify Event-Argument Relation

Seung-Cheol Baek and Jong C. Park
Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP), Nagoya, Japan, October 15-17, 2013, pp. 699-705.
Show abstract

Mentions of event-argument relations, in particular dependency paths between event-referring words and argument-referring words, can be decomposed into meaningful components arranged in a regular way, such as those indicating the type of relations and the others allowing relations with distant arguments (e.g., coordinate conjunction). We argue that the knowledge about arrangements of such components may provide an opportunity for making event extraction systems more robust to training sets, since unseen patterns would be derived by combining seen components. However, current state-of-the-art machine learning based approaches to event extraction tasks take the notion of components at a shallow level by using n-grams of paths. In this paper, we propose two methods called pseudo-count and Bayesian methods to semi-automatically learn PCFGs by analyzing paths into components from the BioNLP shared task training corpus. Each lexical item in the learned PCFGs appears in 2.6 distinct paths on average between event-referring words and argument-referring words, suggesting that they contain recurring components. We also propose a grounded way of encoding multiple parse trees for a single dependency path into feature vectors in linear classification models. We show that our approach can improve the performance of identifying event-argument relations in a statistically significant manner.

Speaker-TTS Voice Mapping towards Natural and Characteristic Robot Storytelling

Hye-Jin Min, Sang-Chae Kim, Joon-Yeob Kim, Jin-Woo Chung, and Jong C. Park
Proceedings of the 22nd IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN 2013), pp. 793-800, Gyeongju, Korea, August 26-29, 2013.
Show abstract

Robot storytelling has the potential for its practical use in various domains such as entertainment, education, and rehabilitation. However, relying on human-recorded voices for natural storytelling is costly, and automation with text-to-speech systems is not readily applicable due to the difficulty of reflecting the full nature of stories in TTS systems. In this paper, we address the problem of automating robot storytelling with a particular focus on two issues: speaker identification and speaker-TTS voice mapping. We first conduct text analysis with rich linguistic clues to identify speakers from a given textual story. We then consider the task of speaker-TTS voice mapping as the graph coloring problem and propose effective algorithms for assigning voices to speakers given a limited number of TTS voices. Finally, we perform a user experiment on validating the usefulness of our method. The results demonstrate that our system significantly outperforms baseline systems and is also more acceptable to users.

Enhancing Readability of Web Documents by Text Augmentation for Deaf People

Jin-Woo Chung, Hye-Jin Min, JoonYeob Kim, and Jong C. Park
International Conference on Web Intelligence, Semantics, and Mining (WIMS), Madrid, Spain, June 12-14, 2013.
Show abstract

Deaf people have particular difficulty in understanding text-based web documents because their mother language, or sign language, is essentially visually oriented. To enhance the readability of text-based web documents for deaf people, we propose a news display system that converts complex sentences in news articles into simple sentences and presents the relations among them with a graphical representation. In particular, we focus on the tasks of 1) identifying subordinate and embedded clauses in complex sentences, 2) relocating them for better readability and 3) displaying the relations among the clauses with the graphical representation. The results of our evaluation show that the proposed system does simplify complex sentences in news articles effectively while maintaining their intended meaning, suggesting that our system can be used in practice to help deaf people to access textual information.

Blog Corpus-based Clustering Scheme for Category Fluency Test (CFT) Data Clustering

Yong-Jae Lee, Maria Wolters, Hee-Jin Lee, and Jong C. Park
HCI Conference Korea, High1 Resort, Gangwon, Jan. 30-Feb. 1, 2013.
Show abstract

Category Fluency Test (CFT) is one of the most popular methods to screen dementia and is used in particular to evaluate the organization of the semantic memory and verbal fluency of a patient with dementia. The CFT performance is assessed according to the number of items each patient produces during the test. Recently, however, researchers have also proposed to evaluate the performance by considering the pattern of clusters and switches of the CFT data, with efforts to figure out the clusters and switches on the CFT data computationally. In this work, we propose a novel blog corpus-based clustering scheme to analyze the clusters and switches of the CFT data in a computational manner. In addition, we will argue for the need of the blog corpus-based clustering scheme by comparing it with the previous work on automatic CFT data clustering.

Analyzing and Mapping Expressions of Tense for Korean-Korean Sign Language Translation

JoonYeob Kim, Jin-Woo Chung, and Jong C. Park
Proceedings of the KIISE Fall Conference, Vol. 39 No. 2-B, pp. 121-123, Chungnam National University, November 23-24, 2012.
Show abstract

수화는 농인 사회에서 주로 사용되는 시각언어로서 음성언어인 한국어와 표현 양식에서 많은 차이를 보인다. 특히 한국어에서는 특정 문법형태소를 서술어와 결합시킴으로써 시제를 명시적으로 드러내는 반면에, 수화의 경우 서술어와 결합하는 형태소나 시제를 위한 별도의 기능어가 없기 때문에 서술어의 시제 표현을 유지하는 것이 어렵다. 본 논문에서는 한국어-수화 병렬 데이터의 각 문장에 나타나는 시제 표현을 분석한 결과를 바탕으로, 주어진 한국어 문장을 적절한 수화 문장으로 변환하기 위해 필요한 시제 표현 방법에 대해서 논의한다.

Product Name Classification for Product Instance Distinction

Hye-Jin Min and Jong C. Park
The 26th Pacific Asia Conference on Language, Information, and Computation (PACLIC 26), Bali, Indonesia, November 7-10, 2012.
Show abstract

Product names with a temporal cue in a product review often refer to several product instances purchased at different times. Previous approaches to product entity recognition and temporal information analysis do not take into account such temporal cues and thus fail to distinguish different product instances. We propose to formulate the resolution of such product names as a classification problem by utilizing time expressions, event features and other temporal cues for a classifier in two stages, detecting the existence of such temporal cues and identifying the purchase time. The empirical results show that term-based features and existing event-based features together enhance the performance of product instance distinction.

Automatic Speaker Identification in Fairytales towards Robot Storytelling

Hye-Jin Min, Sang-Chae Kim, and Jong C. Park
Proceedings of the 24th Annual Conference on Human and Cognitive Language Technology (HCLT), pp. 77-84, Busan, Korea, October 12-13, 2012.
Show abstract

본 연구에서는 로봇의 자동 동화구연을 목표로 발화문장 상의 감정 파악 및 등장인물 별 다양한 TTS 보이스 선택에 활용 가능한 발화문장의 화자 파악문제를 다룬다. 본 연구에서는 기존 규칙기반 방법론에서 많이 활용되어온 자질인 후보의 위치, 화자 후보의 주격/목적격 여부, 발화동사 존재 여부를 비롯하여 동화에 자주 나타나는 등장인물의 의미적 분류 및 등장인물의 등장/퇴장과 관련된 동사들을 추가 자질로 활용한다. 사람 및 동식물, 무생물이 모두 화자가 될 수 있는 동화 코퍼스에서 제안한 자질들을 활용하여 의사결정트리로 학습 및 검증한 결과, 규칙기반의 베이스라인 방법에 비해 최대 49%의 정확도가 향상되었고, 제안한 방법론이 데이터의 변화에도 강인한 것을 확인할 수 있었다.

Use of Clue Word Annotations as the Silver-standard in Training Models for Biological Event Extraction

Seung-Cheol Baek and Jong C. Park
Proceedings of the 5th International Symposium on Semantic Mining in Biomedicine (SMBM 2012), pp. 34-41, University of Zurich, Switzerland, September 3-4, 2012.
Show abstract

Current state-of-the-art approaches to biological event extraction train models by reconstructing relevant graphs from training sentences, where labeled nodes correspond to tokens that indicate the presence of events and the relations between nodes correspond to the relations between these events and their participants. Since multi-word expressions may also indicate events, these approaches use heuristic rules to define target graphs to reconstruct by mapping various clue words into single tokens. Since training instances define actual problems to solve, the method of deriving graphs must affect the system performance, but there has not been any related study on this aspect, to the best of our knowledge. In this study, we propose an incorporation of an EM algorithm into supervised learning to look for training graphs that are more favorable for model construction. We evaluate our algorithm on the development dataset in the 2009 BioNLP shared task and show that this algorithm makes a statistically meaningful improvement on the performance of trained models over a supervised learning algorithm on a fixed set of training graphs. The models and graphs are available at http://biopathway.org/EventExtraction/.

Towards Automatic Evaluation of Category Fluency Test Performance: Distinguishing Groups using Word Clustering

Yong-Jae Lee, Maria Wolters, Hee-Jin Lee, and Jong C. Park
Korea Computer Congress (KCC), Jeju, Korea, June 27-29, 2012.
Show abstract

The Category Fluency Test (CFT) is a widely used verbal fluency test. The standard measure of scoring the test is the number of distinct words that a subject generates during the test. Recently, other measures have also been proposed to evaluate performance, such as clustering and switching. In this study, we examine clusters and switches can be assessed using word similarity measures. Based on these measures, we can distinguish between subject groups.

Age and Gender Prediction from Korean Tweets with Stylometric Analysis

Sang-Chae Kim and Jong C. Park
Korea Computer Congress (KCC), Jeju, Korea, June 27-29, 2012.
Show abstract

사람들은 주변의 영향을 받아 가면서 각자의 독특한 글쓰기 양식을 만들어간다. 따라서 같은 연령대와 성별을 가지는 사람들은 유사한 글쓰기 양식을 나타내는 경향이 있다. 이와 같은 가정을 바탕으로, 본 연구에서는 다양한 연령대와 성별의 사람들이 작성한 트윗의 문체를 분석하여 임의의 트윗을 작성한 저자의 연령대와 성별을 예측하는 실험을 진행하였다.
한국어 웹 언어에서 자주 보이는 표현들을 토대로 구성한 자질들과, 그에 비해 데이터와 관계가 적은 n-gram 단위의 자질들을 함께 사용하여 예측을 진행함으로써, 최대 공산 기준치보다 25% 가량 높은 정확도를 보이는 예측 결과를 얻게 되었다. 이와 함께 각 자질 구성이 예측에 얼마나 효율적으로 기여하는지에 대한 이해도를 높일 수 있었다.

Analyzing the Patterns of Switching and Clustering on CFT Data Using Hidden Markov Model

Yong-Jae Lee, Hee-Jin Lee, Maria Wolters, and Jong C. Park
HCI Conference Korea, Alpensia resort, January 11-13, 2012.
Show abstract

Early detection of dementia allows people to have more time to prepare themselves for the symptom. As one of the methods to screen dementia, Category Fluency Test (CFT) is used to evaluate the organization of semantic memory and to assess the verbal fluency performance of patients with dementia. Recently, various measures to evaluate their CFT performance have been studied and, in particular, clusters and switches of the CFT data are considered as important factors. In this work, we analyze the clusters and switches of the CFT data by using Hidden Markov Model (HMM) to verify the hypothesis that a comprehensive pattern analysis of their switches and clusters can reveal important characteristics of verbal fluency performance.

Age Prediction from Korean Tweets with Style-Based Feature Analysis

Sang-Chae Kim and Jong C. Park
HCI Conference Korea, Alpensia resort, January 11-13, 2012.
Show abstract

Authorship attribution is a task of predicting the author from analyzing his/her writing. An increasing popularity of the Internet has made it easy for the authorship attribution researchers to access large corpora with annotated authorship. Such large corpora have enabled the researchers to predict the authors’ demographic characteristics such as age. In this paper, we analyze tweets in Korean with a small number of style-based features such as emoticons and propose a way of using these features to predict the age group. Our prediction resulted in a relatively high accuracy of 0.75

Analyzing Disagreements among ICD-9-CM Coders

Seung-Cheol Baek and Jong C. Park
4th International Symposium on Languages in Biology and Medicine (LBM 2011), Nanyang Technological University, Singapore, December 14-15, 2011.
Show abstract

NLP researchers find it difficult to acquire and interpret clinical free text directly, most likely because of the unfamilarity with medical practices. This is why publicly available annotated corpora would be of much help, but there are still very few in the clinical domain due to patient confidentiality. In this regard, it is encouraging to see that Computational Medicine Center’s 2007 Challenge provides a publicly available corpus consisting of radiology reports with ICD-9-CM codes as independently assigned by three different coders. However, the corpus shows many disagreements among the coders, making it imperative to set the standard correctly for their proper interpretation. A proposal for such a standard as implicitly advanced by its developers is to take the majority annotation. In this paper, we propose an alternative method to address such disagreements. We believe our work not only makes a meaningful improvement on the utility of this corpus but also has good implications for similar tasks, such as ICD-10-CM coding.

Identifying Gene Expression Changes in Prostate Cancer Cells from the Literature

Hee-Jin Lee, Hyunju Lee, and Jong C. Park
4th International Symposium on Languages in Biology and Medicine (LBM 2011), Nanyang Technological University, Singapore, December 14-15, 2011.
Show abstract

We propose to identify information about gene expression changes in diseased cells from the literature, utilizing event extraction techniques. Gene expression changes in a diseased cell or tissue happen when its expression level is either higher or lower than the level in normal states. Such information can be critically used in the next stage of understanding the molecular mechanisms of the disease, leading naturally to its pathway. In this work, we focus on prostate cancer (PC), one of the most troubling cancers.

Detecting and Blocking False Sentiment Propagation

Hye-Jin Min and Jong C. Park
Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), pp. 354–362, Chiang Mai, Thailand, November 8-13, 2011.
Show abstract

Sentiment detection of a given expression involves interaction with its component constituents through rules such as polarity propagation, reversal or neutralization. Such compositionality-based sentiment detection usually performs better than a vote-based bag-of words approach. However, in some contexts, the polarity of the adjectival modifier may not always be correctly determined by such rules, especially when the adjectival modifier characterizes the noun so that its denotation becomes a particular concept or an object in customer reviews. In this paper, we examine adjectival modifiers in customer review sentences whose polarity should either be propagated (SHIFT) or not (UNSHIFT). We refine polarity propagation rules in the literature by considering both syntactic and semantic clues of the modified nouns and the verbs that take such nouns as arguments. The resulting rules are shown to work particularly well in detecting cases of ‘UNSHIFT’ above, improving the performance of overall sentiment detection at the clause level, especially in ‘neutral’ sentences. We also show that even such polarity that is not propagated is still necessary for identifying implicit sentiment of the adjacent clauses.

Text Parsing for Sign Language Generation with Combinatory Categorial Grammar

Jin-Woo Chung and Jong C. Park
2nd International Workshop on Sign Language Translation and Avatar Technology (SLTAT), 13th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), University of Dundee, UK, October 23, 2011.
Show abstract

In this paper, we propose a method to convert a written sentence in spoken language into a suitable representation in sign language within the framework of Combinatory Categorial Grammar (CCG). The representation reflects the multi-channel nature of sign language performance, including manual and non-manual linguistic signals of multiple channels and information about their coordination. We show that most information needed to address linguistic phenomena in sign language such as word order, spatial references, classifier construction, and verb inflection can be encoded in the CCG sign lexicon. During the CCG derivation process, a semantic representation for sign language expressions is created so that the resulting output can be directly interpreted as a sequence of signs, each containing manual and non-manual components and representing their coordination and spatial relationship. The derivation process with the constructed lexicon is presented with several examples for Korean Sign Language. We discuss implications of our proposal and future directions.

Revisiting Concatenative Video Synthesis with Relaxed Constraints

Sangyong Gil and Jong C. Park
2nd International Workshop on Sign Language Translation and Avatar Technology (SLTAT), 13th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), University of Dundee, UK, October 23, 2011.
Show abstract

In this paper, we propose a method to convert a written sentence in spoken language into a suitable representation in sign language within the framework of Combinatory Categorial Grammar (CCG). The representation reflects the multi-channel nature of sign language performance, including manual and non-manual linguistic signals of multiple channels and information about their coordination. We show that most information needed to address linguistic phenomena in sign language such as word order, spatial references, classifier construction, and verb inflection can be encoded in the CCG sign lexicon. During the CCG derivation process, a semantic representation for sign language expressions is created so that the resulting output can be directly interpreted as a sequence of signs, each containing manual and non-manual components and representing their coordination and spatial relationship. The derivation process with the constructed lexicon is presented with several examples for Korean Sign Language. We discuss implications of our proposal and future directions.

Reproducing Fairy Tales for Plot Identification

SeungJoo An and Jong C. Park
Proceedings of the 23rd Annual Conference on Human and Cognitive Language Technology (HCLT), pp. 3-8, Seoul, Korea, October 6-7, 2011.
Show abstract

텍스트의 스토리를 자동으로 이해하기 위해 텍스트에서 기술된 사건(event)을 파악하고 이들을 조합하여 스토리가 어떻게 구성되어 있는지를 파악하는 연구들이 진행되어 왔다. 하지만 이는 스토리의 깊은 의미 론적 이해를 요구하는 것 이외에도 텍스트마다 상황과 일어나는 사건들이 다양하기 때문에 언어 자원이 부족한 환경에서의 처리에는 한계가 있다. 이러한 문제는 사건들을 추상화 하여 단순하게 표현할 수 있다 면 스토리 이해의 자연스러움을 저해하지 않고 해결 할 수 있다. 본 논문에서는 사건들의 추상화 과정을 위한 기초 연구로서 텍스트 속 등장인물이 행하거나 당하는 사건들을 추출하여 PMI기법을 통해 사건의 흐름을 파악하고 언어학적 단서를 참조하여 스토리 이해 과정에 누락될 수 있는 사건들을 추가하여 보완 하였다. 이러한 접근을 통해 등장인물이 행할 수 있는 사건들을 재구성하여 단순화하는 방법을 제시한다.

Reading Desk for Preschool Children and Older People with Emotional Speech Synthesis

Ho-Joon Lee, Yong-Jae Lee, and Jong C. Park
International Conference on Convergence and Hybrid Information Technology (ICHIT), LNCS 6935, pp. 740-747, Daejeon, Korea, September 23-25, 2011.
Show abstract

In this paper, we introduce a reading desk designed to read books to the older people and children. For this purpose, we propose a reading desk together with an emotional speech synthesis system for Korean. The reading desk system provides a wireless audio output unit, and the reading desk is directly connected to a laptop computer in order to identify the current user and target reading material. The emotional speech synthesis system for Korean is a prosody re-synthesis system that has the option of providing four different emotions such as anger, fear, happiness, and sadness. Therefore, this system is also able to modify the speech rate and intensity information of speech as much as users want. We analyzed 240 pieces of emotional speech in order to extract distinct prosody structures for each emotion in Korean. The evaluation results show that we have achieved 48.5% of the recognition rate for happiness among four emotions, and with enough training experience, the average recognition rate has improved up to 95.5% for all emotions.

Linguistic Analysis of Picture Description for Language Impairment Diagnosis

Yong-Jae Lee, Hye-Jin Min, and Jong C. Park
Korea Computer Congress (KCC), Gyeongju, Korea, June 30-July 2, 2011.
Show abstract

사람은 성장 배경이나 학습에 따라 고유의 언어 사용 특성을 가지게 된다. 이러한 언어 사용 특성은 개 인의 언어 유창성에 대한 지표를 제공하며, 언어 사용 특성에 대한 분석은 장애에 따른 변화에도 능동적 으로 대처할 수 있게 한다. 그러나 어떤 특정인의 언어 사용 특성을 파악하는 연구는 아직 부족한 실정 이다. 본 연구에서는 개인 언어 사용 특성 파악을 위하여 일차적으로 일반인들의 그림 설명글 데이터를 모았으며, 이에 대한 분석 결과에 기반하여 언어 장애 진단에 적용하기 위한 언어 사용 특성을 파악하고 자 한다. 본 연구의 결과로 형태소 단위, 단어 단위, 그리고 내용 전달의 방식에 따른 개인의 언어 사용 특성을 일부 파악할 수 있었으며, 이와 같은 특성은 향후 치매와 같은 인지 기능 장애로 인한 언어 사용 의 변화를 추적하는데 중요한 실마리를 제공할 수 있을 것으로 기대된다.

Improving Accessibility to Web Documents for the Aurally Challenged with Sign Language Animation

Jin-Woo Chung, Ho-Joon Lee, and Jong C. Park
International Conference on Web Intelligence, Mining and Semantics (WIMS'11), Sogndal, Norway, May 25-27, 2011.
Show abstract

In this paper, we describe how to improve accessibility for the aurally challenged in a web environment, focusing on utilizing a signing avatar for web pages. Many systems were previously proposed to make a web environment more accessible for the deaf people by providing signed expressions, i.e. translating written text into sign language animations and presenting them in a proper way, based on the observation that deaf users normally have much difficulty understanding text-based information as well as audio contents. We analyze the strengths and weaknesses of these systems with respect to discussed design criteria, and propose a system that presents a signing avatar for web page documents via a mobile device, which is expected to overcome the shortcomings of the previous systems and to improve the accessibility of deaf users to textual contents in a web environment. The proposed system has three main parts based on a client-server architecture: 1) a client that executes a web browser and transmits selected text to the server, 2) a server that takes text as input and translates it into signed expressions through a sign language generation module, and 3) a mobile device that displays signing animation transmitted from the server by streaming. We also present some linguistic issues raised by the difference between Korean and Korean Sign Language. To the best of our knowledge, this is the first approach to the use of a mobile device for web document access by the aurally challenged people. We discuss implications of our study and future directions.

Physical Push with a Socially Intelligent Robot: Make your wishes to 'Genie in the Lamp'

Hye-Jin Min and Jong C. Park
Proceedings of the 6th IEEE/ACM International Conference on Human-Robot Interaction, Late Breaking News, pp. 203-204, March 6-9, 2011, Lausanne, Switzerland. ACM
Show abstract

This paper proposes a robotic agent named ‘Genie’ that understands a user’s wish and gives its possible answers on a social network platform. Once a potential wish is detected upon monitoring the text updates in the micro-blog of the user, the agent initiates a task to help the user with both NLP and metadata analysis. As an interaction scenario, we set the type of a robot as an agent that identifies wishful products by searching for and analyzing product information on the web. After an analysis of the vast amount of data, the agent provides possible answers to the user as a way of granting the wish that might require additional time and effort to achieve. In order to draw the user's attention, the agent makes a physical movement as a push notification with more user-friendliness.

Annotation of Protein State Information in Biomedical Text

Hee-Jin Lee and Jong C. Park
9th Asia Pacific Bioinformatics Conference (APBC), Poster Presentation, Incheon, Korea, January 11-14, 2011.

Korean Speech Synthesis for Automatic Fairy Tale Narration with Automatic Identification of Character Roles

SeungJoo An, Ho-Joon Lee, and Jong C. Park
HCI Conference Korea, Alpensia resort, January 26-28, 2011.
Show abstract

부모들이 모두 일을 하여 아이들이 혼자 있는 시간이 늘어나게 됨에 따라 아이들에게 필요한 서비스를 제공하는 시스템이 필요하게 되었다. 이 중에서 자동 동화 구연 시스템은 아이들의 언어 능력과 정서 발달에 도움을 줄 수 있다. 이 때, 동화 속 등장 인물의 역할이 제대로 판단되지 못한다면 동화가 전달하고자 하는 의미와 다르게 동화 내용을 발화 할 수 있다. 본 논문은 동화 속 등장인물의 역할을 분류하기 위해서 다루어야 할 언어적 요소들을 통하여 동화 속 등장인물의 자동 역할 분류 시스템을 제안하고, 이렇게 분류된 역할에 따라서 적절한 음성 합성을 통하여 보다 동화의 의미 전달이 분명한 자연스러운 음성 표현을 할 수 있는 음성 합성 시스템을 제안하고자 한다.
As there is a growing tendency where parent leave their children alone for their work, a system which provides necessary services to children is needed. Among these services, an automatic fairy tale narration system can help language and emotional development of young children. However, if roles of the characters in the story cannot be determined correctly by an automatic fairy tale narration system, the meaning of fairy tales can be conveyed differently, if not distorted. In this paper, we propose an automatic role identification system based on linguistic clues to classify such roles, and through such classified roles, a speech synthesis system for more natural and clear automatic fairy tale narration.

Evaluation of Emotion Categories based on the Analysis of Emotion-Rich Fairy Tales

Ho-Joon Lee and Jong C. Park
HCI Conference Korea, Alpensia resort, January 26-28, 2011.
Show abstract

본 논문에서는 전래 구연 동화를 분석하여, 발화문에 대한 감정 상태가 명시적으로 표현된 문장을 추출하고, 추출된 감정 상태를 바탕으로 감정 범주의 분포를 계산하여, 전래구연 동화에서 나타나는 감정 범주의 특성을 분석한다. 그 결과 화남과 놀람의 감정은 다른 감정에 비해 단일화된 형태로 표현되는 것을 확인할 수 있으며, 최종적으로 이러한 정보가 감정 합성이나 감정 인식 과정에서 활용될 수 있는 가능성을 보인다.
In this paper, we analyze the characteristics of emotion categories derived from the utterances of fairy tales. For this purpose, we extract explicit emotional states of each utterance, and calculate their distributions. As a result, we find that the emotional state of anger and astonishment are well-defined emotion categories, whereas other need more refinement. This finding can be used for the improvement of emotional speech synthesis and recognition systems.

Automatic Identification of Character Roles for Natural Fairy Tale Narration

SeungJoo An and Jong C. Park
KIISE Fall Conference, Danguk University, November 5-6, 2010.
Show abstract

동화를 구연할 때 구연자는 동화 속 등장 인물의 역할을 바탕으로 감정을 실어 발화한다. 이를 통하여 독자인 유아들의 관심을 유발하고 몰입시킴으로써, 이해도를 높인다. 이와 같이 동화 속 인물의 역할에 대한 적절한 이해는 자동 동화 구연에 있어서 중요한 요소 중 하나이다. 본 논문은 동화 속 등장인물의 역할을 분류하기 위해서 다루어야 할 언어적 요소들에 대하여 살펴본다. 또한 이를 바탕으로 이러한 역 할을 자동으로 분류하고, 처리하는 시스템을 제시한다.

A Ubiquitous Smart Parenting and Customized Education Service Robot

Ho-Joon Lee and Jong C. Park
The 2010 IEEE Workshop on Advanced Robotics and its Social Impacts (ARSO), 2010.
Show abstract

In this paper, we introduce a u-SPACE service robot, designed to help children who may be left alone while their caregivers are away from home. In order to protect children from indoor dangers, this service robot provides customized guiding messages taking into account the location information and behavioral patterns of a child, after the detection of dangerous objects and situations. And these guiding messages are vocalized by our emotional speech generation system. This emotional speech generation system is also being put to use in reading fairy tales to a child, as a part of a home education service. The outward appearance of the u-SPACE service robot is modeled on a teddy bear, in order to provide a safe and comforting environment for children. Two touch sensors designed for basic interactions between a child and the robot are installed on each hand of the robot, and an RFID tag is placed inside the body. A PDA with a Wi-Fi communication module, a touch screen, and a speaker is used as a main operating device of this u-SPACE service robot.

Detecting and Resolving Syntactic Ambiguity for Automatic Korean-Korean Sign Language Translation

Jin-Woo Chung and Jong C. Park
Proceedings of the 22nd Annual Conference on Human and Cognitive Language Technology, pp. 55-62, 2010.
Show abstract

수화는 농인 사회에서 주로 사용되는 시각언어로서 음성언어인 한국어와 통사적인 측면에서 많은 차이를 보인다. 특히 수화에서는 조사와 어미가 거의 사용되지 않기 때문에 한국어 문장에서 기존의 방법대로 이 들을 제거한 후 어순을 고려하지 않은 채 문장 성분의 기본형을 그대로 나열하여 수화문을 생성할 경우 문장 성분 간의 통사적 관계가 애매해질 수 있다. 본 논문에서는 통사적 중의성이 한국어 문장을 수화문 으로 변환하는 과정에서 추가적으로 나타나게 되는 특정 통사구조에 의해 발생하는 것으로 보고, 이러한 통사구조를 기본논항구조, 한정수식구조, 병렬구조, 서술구조로 분류하여 각각을 파악하고 그에 따라 통사 적 중의성을 해소하는 방법을 제시한다.

Intonation Generation for Korean Speech Synthesis with Automated Sentence Type Classification

Jin-Woo Chung, Ho-Joon Lee, and Jong C. Park
21th HCI Conference Korea, Phoenix Park, January 27-29, 2010.
Show abstract

음성은 인간과 인간 사이의 상호 작용에서 가장 기본적인 정보 전달 방식이며 최근 들어 로봇을 포함한 인간과 기계 사이의 자연스러운 상호작용을 위한 효과적인 수단으로도 널리 활용되고 있다. 음성은 문자 형태의 언어 표현이 소리 정보로 변환된 것으로 억양 정보를 포함하고 있는데, 이러한 억양 정보가 적절히 표현되지 못한다면 문자가 지닌 정보마저 온전하게 전달하기 어려우므로 상황에 맞는 억양 정보를 표현하는 것은 매우 중요하다. 한국어 음성에서 문장의 전체적인 억양은 그 문장의 유형에 따라 다르게 나타나므로, 자연스러운 음성 합성을 위해서는 문장의 유형을 잘 파악해야 한다. 이에 본 논문에서는 한국어 문장의 유형을 자동으로 분류하는 문형 분류 시스템을 제안하고, 이렇게 분류된 문장 유형에 맞는 억양 정보를 생성하여 자연스러운 음성 표현을 할 수 있는 음성 합성 시스템을 제안하고자 한다.

Automatic Extraction of the Usage Information from the Component Words in Gene Ontology Terms to Enhance Consistency and Predictability

Seung-Cheol Baek and Jong C. Park
3rd International Symposium on Languages in Biology and Medicine (LBM 2009), long paper, Seogwipo, Korea, November 8-10, 2009.
Show abstract

The Gene Ontology (GO) is a controlled vocabulary that has gone through constant changes, motivated primarily by the need to reflect the dynamic nature of knowledge it addresses and the need for usability improvement. A good policy on such changes would be to maintain consistency across terms and structures so as to highlight the missing parts that are likely to be added afterwards, or the unchanged parts to which a policy on usability improvement might not have yet applied. In particular, we argue that the component words inside terms must be used consistently across terms, in order to enhance the predictability of such terms, thus their usability as well. For this purpose, we propose a representation for word usage and a method for extracting it from GO and show its utility in identifying the direction of future changes readily as well as in enhancing the consistency of terms.

Toward finer-grained sentiment identification in product reviews through linguistic and ontological analyses

Hye-Jin Min and Jong C. Park
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 169-172, Singapore, August 2-7, 2009.
Show abstract

We propose categories of finer-grained polarity for a more effective aspect-based sentiment summary, and describe linguistic and ontological clues that may affect such fine-grained polarity. We argue that relevance for satisfaction, contrastive weight clues, and certain adverbials work to affect the polarity, as evidenced by the statistical analysis.

Interpretation of User Evaluation for Emotional Speech Synthesis System

Ho-Joon Lee and Jong C. Park
13th International Conference on Human-Computer Interaction (HCII 2009), San Diego, USA, July 19-24, 2009.
Show abstract

Whether it is for human-robot interaction or for human-computer interaction, there is a growing need for an emotional speech synthesis system that can provide the required information in a more natural and effective manner. In order to identify and understand the characteristics of basic emotions and their effects, we propose a series of user evaluation experiments on an emotional prosody modification system that can express either perceivable or slightly exaggerated emotions classified into anger, joy, and sadness as an independent module for a general purpose speech synthesis system. In this paper, we propose two experiments to evaluate the emotional prosody modification module according to different types of the initial input speech. And we also provide a supplementary experiment to understand the apparently prosody-independent emotion, or joy, by replacing the resynthesized joy speech information with original human voice recorded in the emotional state of joy.

Extracting Melodies from Piano Solo Music Based on Characteristics of Music

Yoonjae Choi and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2009), Vol. 36, No. 1(A), pp. 124-125, Jeju, July 1-3, 2009.
(selected as best paper)
Show abstract

인터넷의 발달로 멀티미디어 자료의 검색 및 활용 방법에 대한 연구가 활발히 진행되고 있다. 특히 디지털 음반 시장의 빠른 발전으로 인해 음악 검색 및 추천에 대한 수요가 계속해서 증가하고 있는데 이러한 서비스를 수행하는 음악 기반 응용 시스템의 성능 향상을 위해서는 일반적인 음악의 형태인 다음(Polyphonic) 음악에서 멜로디를 추출하는 과정이 필수적이다. 본 논문에서는 다음의 복잡도가 높고 넓은 음역을 가지는 음악을 만들 수 있는 피아노 솔로 음악에서 멜로디를 추출하는 방법을 제안한다.

Extracting Melodies from Polyphonic Piano Solo Music Based on Patterns of Music Structure

Yoonjae Choi, Ho-Joon Lee, Hodong Lee, and Jong C. Park
Proceedings of the 20th Human Computer Interaction (HCI 2009), pp. 725-732, Phoenix Park, Feb 9-11, 2009.
Show abstract

Thanks to the development of the Internet, people can easily access a vast amount of music. This brings attention to application systems such as a melody-based music search service or music recommendation service. Extracting melodies from music is a crucial process to provide such services. This paper introduces a novel algorithm that can extract melodies from piano music. Since piano can produce polyphonic music, we expect that by studying melody extraction from piano music, we can help extract melodies from general polyphonic music.

Analysis and Use of Intonation Features for Emotional States

Ho-Joon Lee and Jong C. Park
Proceedings of the 20th Annual Conference on Human and Cognitive Language Technology, pp. 144-149, October 11-12, 2008.
Show abstract

본 논문에서는 8개의 문장에 대해서 6명의 화자가 5가지 감정 상태로 발화한 총 240개의 문장을 감정 음성 말뭉치로 활용하여 각 감정 상태에서 특징적으로 나타나는 억양 패턴을 분석하고, 이러한 억양 패턴을 음성 합성 시스템에 적용하는 방법에 대해서 논의한다. 이를 위해 본 논문에서는 감정 상태에 따른 특징적 억양 패턴을 억양구의 길이, 억양구의 구말 경계 성조, 하강 현상에 중점을 두어 분석하고, 기쁨, 슬픔, 화남 공포의 감정을 구분 지을 수 있는 억양 특징들을 합성 시스템에 적용하는 과정을 보인다. 본 연구를 통해 화남의 감정에서 나타나는 억양의 상승 현상을 확인할 수 있었고, 각 감정에 따른 특징적 억양 패턴을 찾을 수 있었다.

Towards Knowledge Discovery through Automatic Inference with Text Mining in Biology and Medicine

Hee-Jin Lee and Jong C. Park
3rd International Symposium on Semantic Mining in Biomedicine (SMBM), Turku, Finland, September 1-3, 2008.
Show abstract

Field experts in biology and medicine search the literature for state-of-the-art results and occasionally discover knowledge through manual inference on published causal relations. However, the results of such inference cannot be sufficiently accurate and/or complete, as the domain of published relations is rather huge. In this paper, we introduce an automatic inference system, BioDetective, which works on literature-mined qualitative causal information in biology and medicine. BioDetective provides proofs for such qualitative causal information, and predicts the existence of new causal information, if there is any. The system is tested with a case study, where literature-mined information about protein regulation is utilized to come up with new knowledge.

An effective way to learn biological knowledge with linguistic resources

Jin-Bok Lee, Tak-eun Kim, and Jong C. Park
18th International Congress of Linguists (CIL 18), Seoul, Korea, July 21-26, 2008.
Show abstract

The most general and effective way for people to acquire desired knowledge is to learn from tutors with face-to-face contact. The tutors can pick out important pieces of information and deliver them systematically to the learners considering their specialties, interests, rates of progress, and so on. However, since all learners may not be taught by tutors during their convenient time, the field of e-learning or distance learning has been emerged.
To maintain the benefits of face-to-face learning in an automatic way, the challenge remains in equipping computers with the expertise, skills and modes of actions of the human tutor, overcoming spatial, temporal, ocio-economical and environmental restrictions. In order to overcome these challenges, we focus on two issues: (1) information investigation: how to pick out essential pieces of information that do not include overlapping or obsolete pieces, and (2) information delivery: how to deliver the selected ones to learners effectively in point of understanding and memorization.
In this paper, we propose a web-based smart tutoring system for helping biology-major student to learn genes. To incorporate the two issues described above into our tutoring system, we extensively use linguistic resources in the biology domain, such as Gene Ontology or UMLS, for selecting and classifying information from huge amount of data. We believe that our tutoring system can autonomously carry out almost all the functionalities of human tutor including investigation, delivery, and adaptation of learner’s feedbacks.

Syntactic Construction of Coordination in Sign Language Generation

Hodong Lee, Sangha Kim, and Jong C. Park
18th International Congress of Linguists (CIL 18), Seoul, Korea, July 21-26, 2008.
Show abstract

Coordination in sign languages is an essential construction to describe more than one kind of information, as used in natural languages. Although it may appear to follow general rules of coordination, its realization with multi-channel motions is often quite different from that in natural languages, due to the differences at levels of syntax and semantics. A multi-channel motion is simultaneously composed of shape, position, orientation and movement of the hands, arms, body, or face. In this paper, we address the problems in converting coordination-bearing sentences into their matching motions in sign languages. In particular, we focus on the issues between the Korean language and the Korean sign language (KSL).

Sign Language Generation with Animation by Adverbial Phrase Analysis

Sangha Kim and Jong C. Park
17th Human Computer Interaction (HCI 2008), Phoenix Park, Feb 13-15, 2008.
(selected as best paper)
Show abstract

Sign languages, commonly used in aurally challenged communities, are a kind of visual language expressing sign words with motion, Spatiality and motility of a sign language are conveyed mainly via sign words as predicates. A predicate is modified by an adverbial phrase with an accompanying change in its semantics so that the adverbial phrase can also affect the overall spatiality and motility of expressions of a sign language. In this paper, we analyze the semantic features of adverbial phrases which may affect the motion-related semantics of a predicate in converting expressions in Korean into those in a sign language and propose a system that generates corresponding animation by utilizing these features.

On the Automatic Generation of Illustrations for Events in Storybooks: Representation of Illustrative Events

Seung-Cheol Baek, Hee-Jin Lee, and Jong C. Park
17th Human Computer Interaction (HCI 2008), Phoenix Park, Feb 13-15, 2008.
Show abstract

Storybooks, especially those for children, may contain illustrations. An automated system for generating illustrations would help the production process of storybook publishing. In this paper, we propose a method for automatically generating layouts of objects during generating illustrations. In generated layouts, it is preferred to avoid unnecessary overlap between objects, corresponding to the spatial information in storybooks. We first define a representation scheme for spatial information in natural language sentences using tree structures and predicate-argument structures. Unification of tree structures and Region Connection Calculus are then used to manipulate the information and generate corresponding illustrations.

Visualizing the Temporal Distribution of Terminologies for Biological Ontology Development

Tak-eun Kim, Hodong Lee, Jinah Park, and Jong C. Park
International Conference on Visualization and Data Analysis (VDA), San Jose, USA, 26-31 January, 2008.
Show abstract

Communities in biology have developed a number of ontologies that provide standard terminologies for the characteristics of various concepts and their relationships. However, it is difficult to construct and maintain such ontologies in biology, since it is a non-trivial task to identify commonly used potential member terms in a particular ontology, in the presence of constant changes of such terms over time as the research in the field advances. In this paper, we propose a visualization system, called BioTermViz, which presents the temporal distribution of ontological terms from the text of published journal abstracts. BioTermViz shows such a temporal distribution of terms for journal abstracts in the order of published time, occurrences of the annotated Gene Ontology concepts per abstract, and the ontological hierarchy of the terms. With a combination of these three types of information, we can capture the global tendency in the use of terms, and identify a particular term or terms to be created, modified, segmented, or removed, effectively developing biological ontologies in an interactive manner. In order to demonstrate the practical utility of BioTermViz, we describe several scenarios for the development of an ontology for a specific sub-class of proteins, or ubiquitin-protein ligases.

Analysis of Indirect Uses of Interrogative Sentences Carrying Anger

Hye-Jin Min and Jong C. Park
PACLIC 21, Seoul National University, November 1-3, 2007.
Show abstract

Interrogative sentences are generally used to perform speech acts of directly asking a question or making a request, but they are also used to convey such speech acts indirectly. In the utterances, such indirect uses of interrogative sentences usually carry speaker’s emotion with a negative attitude, which is close to an expression of anger. The identification of such negative emotion is known as a difficult problem that requires relevant information in syntax, semantics, discourse, pragmatics, and speech signals. In this paper, we argue that the interrogatives used for indirect speech acts could serve as a dominant marker for identifying the emotional attitudes, such as anger, as compared to other emotion-related markers, such as discourse markers, adverbial words, and syntactic markers. To support such an argument, we analyze the dialogues collected from the Korean soap operas, and examine individual or cooperative influences of the emotion-related markers on emotional realization. The user study shows that the interrogatives could be utilized as a promising device for emotion identification.

On the Automatic Generation of Illustrations for Events in Storybooks

Seung-Cheol Baek, Eunyoung Chang, and Jong C. Park
KIISE 2007 Fall Conference, Pusan National University, October 26-27, 2007.
Show abstract

문학가와 일반인들 사이의 경계가 인터넷 소설 등으로 희미해지고 있다. 어린이를 독자로 결정하고 작 품을 창작하는 사람들은 삽화를 그려서 작품을 출판하고 싶어한다. 본 논문은 사용자가 동화의 특정 사 건을 주제로 삽화를 생성하고자 할 때 이를 자동으로 생성하는 방법에 대하여 논의한다. 본 논문에서는 특히 문장들의 결합으로 표현되는 하나의 사건을 삽화로 그리는 방법을 제안한다. 본 논문에서는 자연언 어를 해석하여 사건을 추출하는 방법으로 결합 범주 문법을 사용한다.

Translating a Complex Sentence in Korean into a Sign Language Script for an Automatic Sign Language Generation

Sangha Kim, Eunyoung Chang, and Jong C. Park
the 19th Annual Conference on Human and Cognitive Language Technology (KLIP 2007), Kyungpook National University, October 12-13, 2007.

Characteristics of Spoken Discourse Markers and their Application to Speech Synthesis Systems

Ho-Joon Lee and Jong C. Park
the 19th Annual Conference on Human and Cognitive Language Technology (KLIP 2007), Kyungpook National University, October 12-13, 2007.

Customized Message Generation and Speech Synthesis in Response to the Characteristic Behavioral Patterns of Children

Ho-Joon Lee and Jong C. Park
HCI International, Beijing, P. R. China, July 22-27, 2007.
Show abstract

There is a growing need for a user-friendly human-computer interaction system that can respond to various characteristics of a user in terms of behavioral patterns, mental state, and personalities. In this paper, we present a system that generates appropriate natural language spoken messages with customization for user characteristics, taking into account the fact that human behavioral patterns usually reveal one’s mental state or personality subconsciously. The system is targeted at handling various situations for five-year old kindergarteners by giving them caring words during their everyday lives. With the analysis of each case study, we provide a setting for a computational method to identify user behavioral patterns. We believe that the proposed link between the behavioral patterns and the mental state of a human user can be applied to improve not only user interactivity but also believability of the system.

Representing Emotions with Linguistic Acuity

Hye-Jin Min and Jong C. Park
Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Mexico City, Mexico, February 18-24, 2007.
Show abstract

For a robot to make eﬀective and friendly interaction with human users, it is important to keep track of emotional changes in utterance properly. Emotions have traditionally been characterized by intuitive but atomic categories or as points in evaluation-activity dimensions. However, this characterization falls short of capturing subtle emotional changes either in narration or in text, where the vast majority of information is presented with a host of linguistic constructions that convey emotional information. We propose a novel representation scheme for emotions, so that such important features as duration, target and intensity can also be treated as ﬁrst-class citizens and systematically accounted for. We argue that it is with this new mode of representation that the subtlety of the emotional ﬂow in utterance can be properly addressed. We use this representation to encode the emotional states and intentions of characters in the drama scripts for soap opera and describe how it is utilized in conjunction with parsing for lexicalized grammars.

Identifying Emotional Cues in Dialogue Sentences According to Targets

Hye-Jin Min and Jong C. Park
HCI Conference Korea, Phoenix Park, February, 2007.
Show abstract

일상 생활에서의 대화 또는 컴퓨터를 매개로 이루어지는 대화에서 자기노출은 서로에 대한 개인적인 정보를 공유하여 친밀한 관계를 유지하기 위한 과정이다. 자기노출에서의 개인적인 정보는 생각 및 경험을 비롯하여 감정 등을 의미하는데, 감정은 특히 대화 분위기 형성 및 원활한 대화 진행을 위한 효과적인 의사소통수단으로 작용한다. 대화 시의 감정노출은 대화 상대방(노출 대상)과 감정표현의 대상(표현 대상)에 따라 표현의 실제강도와 노출의 정도가 달라지게 된다. 본 연구에서는 인터넷을 통해 대화를 주고 받거나 자료를 전송할 수 있는 인스턴트 메신저를 통하여 이루어진 대화에서 노출 대상과 표현 대상을 고려하여 대화참여자의 감정상태를 파악한다. 이를 위한 사전조사로 드라마 스크립트 상의 등장인물들의 감정표현 패턴을 분석하고 이를 활용하여 노출 대상이 각각 다른 대화문장에서 통사 및 의미 분석 과정을 거쳐 표현 대상에 따른 대화참여자의 감정상태를 파악하고, 대화참여자가 자신의 감정을 관찰할 수 있는 인터페이스를 제공한다.

Searching Animation Models with a Lexical Ontology for Text Animation

Eunyoung Chang, Hee-Jin Lee, and Jong C. Park
HCI Conference Korea, Phoenix Park, February, 2007.

Customized Emotion Representation for Automatic Generation of Emotionally Appropriate Dialogs

Hye-Jin Min and Jong C. Park
the Korean Society for Emotion & Sensibility, KIST, May, 2006.
Show abstract

본 연구에서는 사용자에게 영화 정보를 전달하고 영화를 추천해 주는 시스템에서 사용자와 시스템 간의 대화 말뭉치를 분석하여 대화문에 나타나는 보편적 또는 개별적 감정 정보를 식별하고 이들을 기술하는 방법에 대하여 논의한다. 감정을 표현하는 언어 정보는 자연언어처리 기술을 활용하여 대화문으로부터 자동으로 추출되어 감정이 포함된 대화문 응답 생성에 활용된다. 본 연구에서는 자연언어처리 기술로 대화 말뭉치 분석을 통해 제안한 기술방법의 적절성 및 유용성에 대한 평가를 하고 그 결과를 보인다.

Personalized Background Music Recommendation System for User Generated Contents using Collective Intelligence

Doojin Park and Jong C. Park
the Korean Society for Emotion & Sensibility, KIST, May, 2006.
Show abstract

최근 싸이월드와 같은 블로그 서비스들에서 많은 사용자들은 자신의 글을 게시하면서 이에 맞는 배경음 악을 함께 올리고 있다. 이때, 사용자가 좋아하는 음악이나 사용자가 판단하기에 글의 분위기에 맞는 음 악을 선정해서 올리게 되나 적절한 음악을 선정하기는 쉽지 않다. 한편 기존 음악추천 시스템에서는 특 정 음악에 대해 전문가가 음악이론에 따라 분석하여 기입한 감성정보를 이용하거나 음악의 파형을 분석 해서 얻은 감성정보를 이용하나 음악의 특성상 음악에서 느끼는 감성들은 개인적인 성향에 따라 다르다. 본 연구에서는 사용자가 블로그에 올리는 글을 자연언어처리 기술로 분석하여 글이 담고 있는 감성정보 를 포함한 상황정보를 추출하고, 이런 정보에 해당하는 배경음악을 사용자 정보를 감안하여 자동으로 추천해주는 시스템을 제안한다.

u-SPACE: Ubiquitous Smart Parenting and Customized Education

Hye-Jin Min, Doojin Park, Eunyoung Chang, Ho-Joon Lee, and Jong C. Park
HCI Conference Korea, Phoenix Park, February, 2006.
Show abstract

부모의 사회 활동 시간이 늘어남에 따라 아이들이 혼자 집에서 보내는 시간도 늘어나고 있다. 따라서 아이들의 자립심을 크게 제한하지 않으면서 노출되기 쉬운 실내 위험으로부터 아이들을 보호하고 아이의 심리, 감정적 상태에 따라 적절한 지도를 해주는 도움이 필요하다. 본 연구에서는 RFID 기술을 기반으로 아이들을 물리적 위험으로부터 보호하고 자연언어처리 기술을 이용하여 아이의 심리, 감정 상태에 따른 음악과 애니메이션의 멀티미디어 콘텐츠를 제공한다. 또한 지속적인 관심이 필요한 일정 관리, 일상 생활에서 도움을 주는 전자제품 사용법 안내 등의 정보를 제공하여 아이 스스로 자신의 일을 할 수 있도록 도움을 준다. 본 연구에서는 가상의 가정을 디자인하여 실현 가능한 시나리오를 중심으로 이와 같은 서비스를 시뮬레이션한 결과를 보인다.

Customized Speech Synthesis for Children with Characteristic Behavioral Patterns

Ho-Joon Lee and Jong C. Park
HCI Conference Korea, Phoenix Park, February, 2006.
Show abstract

음성을 통한 사용자 간의 정보 교환 방법은 추가적인 훈련 과정이나 장비가 필요하지 않고 공간 제약이 거의 없기 때문에 노약자 등 사용자의 연령대에 관계없이 사용될 수 있다. 또한 음성 정보는 시각이나 촉각 등 다른 정보 수 단과의 상호 작용으로 상승 효과를 유발할 수 있기 때문에 사람과 기계 사이 의 인터페이스로 활용될 경우 정보 전달력을 높이면서 사용자 친화적인 서비 스를 제공할 수 있다. 그러나 동일한 상황에서 동일한 유형의 음성 정보가 사용자에게 지속적으로 제공될 경우 표현상의 단조로움으로 인해 정보 전달 력이 급감할 수 있는 문제점도 지니고 있다. 따라서 음성을 통한 정보 전달 의 경우 동일 상황이라 하더라도 사용자의 행동 패턴, 심리 상태, 주변 환경 등에 따라 차별화된 문장 구조 및 어휘의 선택으로 긴장감을 유지시켜 줄 수 있어야 한다. 본 논문에서는 5 세 전후의 어린이를 대상으로 그들의 행동 패 턴 분석에 기반하여 개별화된 음성 합성 결과를 제공하는 시스템을 제안한 다. 이를 위해 유치원이라는 물리적 공간에서 어린이들의 주된 행동 패턴을 분석하고, 현직 유치원 교사를 대상으로 동일한 정보를 전달하는 조건을 통 하여 어린이의 행동 패턴과 위치 정보, 연령 및 성격에 따른 발화 문장의 문 장 구조와 어휘적 특성을 파악한다. 최종적으로, 개별화된 음성 합성 결과를 위해 유치원 공간을 시뮬레이션 하고 RFID 를 이용하여 어린이의 행동 패턴 및 위치 정보를 파악한다. 그리고 각 상황에 따라 분석된 발화문의 문장 구 조와 어휘 특성을 반영하여 음성으로 합성될 문장의 문장 구조 및 어휘를 재 구성하여 사용자 개별화된 음성 합성 결과를 생성한다. 이러한 결과를 통해 어린이의 행동 패턴이 발화문의 문장 구조 및 어휘에 미치는 영향에 대해서 살펴보고 재구성된 결과 발화문을 평가한다.

Effective text visualization for biomedical information

Tak-eun Kim and Jong C. Park
HCI Conference Korea, Phoenix Park, February, 2007.
Show abstract

생물 의료 분야에서 정보의 양이 아주 빠르게 증가하고 있다. 이러한 방대한 양의 정보에서 유용한 정보를 추출하기 위해 텍스트 마이닝 기법을 이용한 연구들이 많이 진행되어 왔다. 그렇지만 이렇게 뽑아진 정보조차 그 양이 방대하고, 또한 텍스트로 되어 있기 때문에 직관적으로 이해하기가 어렵다. 따라서 이러한 정보들을 좀 더 직관적으로 이해하기 위해서는 정보 시각화 시스템이 필수적이다. 최근 들어 이러한 정보 시각화에 대한 연구가 많이 진행되었으나 이러한 시각화 정보조차 너무나 방대하기 때문에 사용자가 필요로 하는 정보를 여과해 주는 방법이 필요하다. 그리고 시각화 시스템에서의 지식 발견을 위한 방법을 제공하여야 한다. 본 논문에서는 생물 의료 정보의 텍스트 시각화에 초점을 맞추어 생물 의료 정보의 효과적인 표현 방법과 지식 발견을 위한 직관적인 인터페이스를 제안하고자 한다.

Semantic Representation for Temporal Adverbs and Temporal Morphemes

Eunyoung Chang and Jong C. Park
Proceedings of Annual Conference of the KSLI (Korea Society for Language and Information), pp. 193-207, Kangwon, Korea, 2006.
Show abstract

상황은 문장에서 주로 용언으로 기술되며, 상황의 시간적 의미는 시간어에 의해 따로 표현된다. 이 중에서도 시간 부사와 시상 형태소(선어말 어미)가 시제와 상에 결정적으로 기여한다고 알려져 있으나, 여러 성분이 문장 내에서 복합적으로 나타나기 때문에 각 성분의 의미와 기능에 대해서는 아직 의견이 정리되지 않은 상황이다. 본 논문에서는 상황의 시간적 속성을 분류하고, 시간 부사와 시상 형태소가 각 속성에 끼치는 영향을 분석하여 어휘 단위의 의미 표현 방식을 제안한다. 시간 부사는 상황시의 위치나 상황의 시간적 속성을 수식하고, 시상 형태소는 발화시와 상황시의 관계 또는 화자의 상황에 대한 태도를 나타낸다. 이를 바탕으로 적절한 어휘 범주를 제시하고, 이들의 결합에 의하여 최종 의미가 도출되는 과정을 결합범주문법을 통한 처리 과정으로 보인다.

CCG-based RNA Secondary Structure Prediction for Structural Homology Analysis

Hee-Jin Lee and Jong C. Park
6th International Conference on Genome Informatics (GIW), Yokohama, Japan, December, 2005.
Show abstract

Various systems have been proposed to predict secondary structures of RNAs using their sequence information. Among them, Uemura et al. [2] described a system that recognizes some typical RNA secondary structures such as hairpin loops and pseudoknots with Tree Adjoining Grammar. However, their work captures only known sub-structures, and not those unknown sub-structures that might also exist. Ternary pseudoknot, composed of three pairs of cross-serially arranged reverse-complementary sequences, may be one such example. Figure 1 illustrates an example ternary pseudoknot. We describe a version of Combinatory Categorial Grammars (CCGs) for an RNA secondary prediction system to discover such unknown sub-structures. The parser for the proposed CCG takes an RNA sequence and produces the semantics string that contains structural information of the sequence.

From Text to Sign Language: Exploiting the Spatial and Motioning Dimension

Jiwon Choi, Hee-Jin Lee, and Jong C. Park
Proceedings of the 19th Pacific Asia Conference on Language, Information and Computation (PACLIC 19), pp. 61-69, Taipei, Taiwan, December, 2005.
Show abstract

In this paper, we address the problem of automatically converting information in the Korean language to one in a sign language as used in Korea. First, we discuss the differences between sign language and natural language, and in particular between the sign language in Korea and the Korean language. Then, we focus on issues that are relevant to the process of converting expressions in Korean into their counterparts in the sign language, including: 1) making explicit elided subjects of expressions in Korean, 2) omitting some expressions in Korean, and 3) reordering some expressions. We argue that it is important to utilize the spatial and motioning dimensionality of a sign language in order to minimize information loss and distortion. We also argue that the right decision to omit, or to merge some expressions in Korean plays a key role in exploiting this dimensionality. Finally, we present a system that converts sentences in Korean into corresponding animations in the sign language as proof of evidence for our claim.

Dynamic Informative Link Annotation for Biological Text over Heterogeneous Databases

Hodong Lee and Jong C. Park
16th International Conference on Genome Informatics (GIW), Yokohama, Japan, December, 2005.
Show abstract

Linking from a textual object to the biological databases is actively performed for an eﬃcient data access and information enrichment [2]. This task targets at a link for particular types of term, such as names, keywords and symbols, that correspond to each data entry. However, such one-to-one matching links are still insuﬃcient to make a full use of biological data in numerous databases. The previous researches have reported further problems: (1) The conceptual term referring to multiple data objects cannot be represented as a one-to-one link [1]; (2) the complex term often corresponds to the data objects from multiple databases [6]; (3) the link must be consistent with the data objects that can be changed or removed from a database [4]; and (4) the term is ambiguous due to the semantic and syntactic heterogeneity, which requires not only the structural and operational pieces of database information but also the biological pieces of knowledge about the term semantics [4, 5]. We address all the problems above with a dynamic link annotation system that annotates links by formulating the database statement in a formal query language. We are currently developing the system for 13 molecular biology databases mediated by SRS and Entrez: GO, GOA, UniProt, InterPro, EMBL, and Enzyme in SRS; Gene, Protein, Nucleotide, PubMed, OMIM, HomoloGene, and Taxonomy in Entrez.

Vowel Sound Disambiguation for Intelligible Korean Speech Synthesis

Ho-Joon Lee and Jong C. Park
Proceedings of the 19th Pacific Asia Conference on Language, Information and Computation (PACLIC 19), pp. 131-142, Taipei, Taiwan, December, 2005.
Show abstract

For speech synthesis systems that transform text materials into voice data, correctness and naturalness are the crucial measures of performance, the latter gaining more emphasis recently. In order to make synthesized voices natural, we must take into account pronunciation-related linguistic phenomena such as homograph, among others. The syntax certainly provides an important clue to disambiguating such homographs, but the relatively free word order in the Korean language makes it hard to utilize such information. In this paper, we describe a computational generation of contextually appropriate vowel lengths for the words in Korean by utilizing a higher level of linguistic information in a Combinatory Categorial Grammar framework. We consider parts-of- speech information, the possibility of conjunction with a suffix, case information, unconjugated adjectives, numerals, numerical adjectives with related nouns, and the relationship between a noun and its predicate as syntactic and semantic clues for vowel sound disambiguation. The results are expressed in Speech Synthesis Markup Language (SSML) for a target system neutral application. The proposed system with correctly predicted vowel sound can be used not only as an educational tool, but also as a plug-in for enhancing the intelligibility of a general purpose Text-to-Speech (TTS) system.

Text Animation with Music

Doojin Park and Jong C. Park
Proceedings of the 32th Korea Information Science Society (KISS), Vol. 32, No. 2, pp. 526-528, Seoul, November, 2005.
Show abstract

음악은 스토리텔링에서 이야기의 분위기와 흐름을 전달하는데 중요한 역할을 한다. 최근 컴퓨터 애니메이션에 자동으로 알맞은 음악을 삽입하기 위하여 많은 연구가 진행되고 있지만 이야기가 있는 애니메이션보다는 주로 영상물의 동기화를 위한 연구가 대부분이었다. 텍스트 애니메이션은 동화를 자동으로 분석하여 애니메이션을 만들어주는 연구이다. 본 논문에서는 동화의 이야기 구조에 근거하여 각 장면의 분위기에 맞는 음악 자질을 자동으로 추출하는 과정을 보이고 이를 이용하여 텍스트 애니메이션에 음악이 삽입될 수 있는 방법에 대하여 논의한다.

Prediction of RNA Secondary Structures in a Combinatory Categorial Grammar Framework

Hee-Jin Lee and Jong C. Park
Proceedings of the First International Symposium on Languages in Biology and Medicine (LBM), pp. 59-62, KAIST, Daejeon, Korea, November, 2005.
Show abstract

In this paper, we define a Combinatory Categorial Grammar (CCG) to model and predict RNA secondary structures. The proposed CCG can be used to capture various RNA secondary structures, including stem-loop and pseudoknot structures. We also argue that the CCG can be used to predict possibly unknown RNA secondary structures, for example an undiscovered structure 'ternary-pseudoknots'.

Automated Linking of Conceptual and Complex Terms into Data Objects in Biological Databases

Hodong Lee and Jong C. Park
Proceedings of the First International Symposium on Languages in Biology and Medicine (LBM), pp. 51-54, Creative Learning Building, KAIST, Daejeon, Korea, November, 2005.
Show abstract

The purpose of a textual link is to provide a one-to-one connection between a term and a related data object. However, this link is insufficient to deal with the conceptual and complex terms that are often used to refer to multiple data objects from heterogeneous databases. In this paper, we present a method that can dynamically create a link to a biological term by automatically constructing a database query for a search into the corresponding data object(s). This method can help the user to quickly build a hypothesis based on data drawn from text, as well as to understand the text by providing an access to relevant information for its biological terms.

Generation of Coherent Gene Summary with Concept-Linking Sentences

Chan-Goo Kang and Jong C. Park
Proceedings of the First International Symposium on Languages in Biology and Medicine (LBM), pp. 41-45, Creative Learning Building, KAIST, Daejeon, Korea, November, 2005.
Show abstract

Typical approaches to automatic summarization make efforts to generate a coherent document by arranging the order of sentences according to certain criteria such as the publication date of the text in which the expression appears. However, when describing a gene, there is no obvious order whatsoever among the facts to be presented. In this work, while generating a summary about a gene, we actually create the order from the unordered set of facts, by introducing new sentences that make associations among the main concepts of those facts.

CCG-based RNA Secondary Structure Prediction

Hee-Jin Lee and Jong C. Park
The First International Symposium on Languages in Biology and Medicine (LBM), Daejeon, Korea, November, 2005.
Show abstract

In this paper, we define a Combinatory Categorial Grammar (CCG) to model and predict RNA secondary structures. The proposed CCG can be used to capture various RNA secondary structures, including stem-loop and psudoknot structures. We also argue that the CCG can be used to predict possibly unknown RNA secondary structures, for example an undiscovered structure 'ternary-pseudoknots'.

Dynamic and Informative Linking from Biological Text into Heterogeneous Databases

Hodong Lee and Jong C. Park
The First International Symposium on Languages in Biology and Medicine (LBM), Daejeon, Korea, November, 2005.
Show abstract

Linking from a textual object to the biological databases is actively performed for an eﬃcient data access and information enrichment [2]. This task targets at a link for particular types of term, such as names, keywords and symbols, that correspond to each data entry. However, such one-to-one matching links are still insuﬃcient to make a full use of biological data in numerous databases. The previous researches have reported further problems: (1) The conceptual term referring to multiple data objects cannot be represented as a one-to-one link [1]; (2) the complex term often corresponds to the data objects from multiple databases [6]; (3) the link must be consistent with the data objects that can be changed or removed from a database [4]; and (4) the term is ambiguous due to the semantic and syntactic heterogeneity, which requires not only the structural and operational pieces of database information but also the biological pieces of knowledge about the term semantics [4, 5]. We address all the problems above with a dynamic link annotation system that annotates links by formulating the database statement in a formal query language. We are currently developing the system for 13 molecular biology databases mediated by SRS and Entrez: GO, GOA, UniProt, InterPro, EMBL, and Enzyme in SRS; Gene, Protein, Nucleotide, PubMed, OMIM, HomoloGene, and Taxonomy in Entrez.

Intonation Synthesis using Emotional Information from Spoken Fairy Tale

Ho-Joon Lee and Jong C. Park
Proceedings of the 17th Korean Association of Speech Science (KASS), pp. 88-97, Seoul, November 26, 2005.
Show abstract

정보 기술의 발달로 사용자 중심의 인터페이스가 부각되면서 음성 합성 기술의 활용이 점점 늘어나고 있는 추세이다. 자연스러운 음성 합성을 위해서는 발화 상황에 적합한 억양 정보를 생성하는 것이 중요하고, 특히 감정의 변화에 따른 자연스러운 음성 합성을 위해서는 억양 정보 중에서도 음의 높낮이를 적절하게 조절하는 것이 필요하다. 감정 정보를 음성 합성 기술에 적용하기 위해서는 감정 정보가 잘 표현되어 있는 음성 데이터의 분석이 선행 되어야 하고, 이와 관련한 자료로서 동화 구연 음성 데이터는 아이들에게 보다 사실감 있는 내용 전달을 위해 감정 정보가 풍부하게 표현되어있는 특징이 있다. 본 연구에서는 동화 구 연 전문가에 의해 녹음된 전래 인형극을 분석하여 감정 상태에 따른 발화문의 음운 특성을 살펴보고, 이러한 감정 정보와 문장의 통사, 의미 구조 등 언어학적인 정보와의 관계를 바 탕으로 감정 정보를 음성 합성 시스템에 제공하여 적절히 구사하는 방법에 대해서 논의한다.

Modeling Causality in Biological Pathways for Logical Identification of Drug Targets

Il Park and Jong C. Park
Proceedings of the 2005 International Joint Conference of InCoB, AASBi and KSBI (Bioinfo 2005), pp. 373-378, Busan, Korea, September, 2005.

Lexical Disambiguation for Intonation Synthesis: A CCG Approach

Ho-Joon Lee and Jong C. Park
Korean Society for Language and Information (KSLI), June 17-18, 2005.
Show abstract

IT의 급격한 발전과 함께 새로운 형태의 정보 전달 방법이 지속적으로 나타나면서 우리말의 정확한 발음에 대한 인식이 점점 약화되고 있는 추세이다. 특히 장단음의 발음은 발화에 대한 전문인들도 정확하게 구분하지 못하고 있는 심각한 상황이다. 본 논문에서는 한국어 명사에서 나타나는 장단음 화 현상을 주변 어휘와의 관계를 바탕으로 살펴보고 동음이의어 중 다르게 발음되는 명사의 장단음 구분을 명사와 명사의 수식어, 명사의 서술어와의 관계를 중심으로 논의한다. 분석된 결과는 결합범 주문법을 이용하여 표현하고 어휘적 중의성이 해소된 음성 합성 과정을 표준화된 SSML (Speech Synthesis Markup Language)으로 기술한다.

Induced Extension of Gene Ontology from Biomedical Resources with Flexible Identification of Candidate Terms

Jin-Bok Lee, Jung-jae Kim, and Jong C. Park
The First International Symposium on Semantic Mining in Biomedicine (SMBM), page 13, Cambridge, UK, April, 2005.
Show abstract

Motivation: We present a novel method to predict more detailed terms than those in the present Gene Ontology (GO). We apply this method to semantic tagging for natural language expressions that denote potential GO terms even when there is no direct mapping of such expressions into GO terms. The terms that are newly identiﬁed in this process can be used to extend GO by utilizing semantic relations such as hyponyms or synonyms. Finally, we suggest how to find a suitable direction for the possible extension of an ever-growing ontology such as GO.
Results: We provide an automatically extended GO, and tools for its manipulation and validation.
Availability: http://www.biopathway.org
Contact: park@nlp.kaist.ac.kr

Automatic Generation of Multimedia Animation from Play Scripts

Doojin Park and Jong C. Park
HCI Conference Korea, 2005.
Show abstract

텍스트 애니메이션은 자연언어문장으로부터 애니메이션을 자동으로 생성하 기 위한 연구이다. 텍스트 애니메이션을 작가의 의도대로 실현하기 위해서는 캐릭터의 행동뿐만 아니라 부가적인 멀티미디어 효과가 필수적으로 요구된 다. 이러한 효과를 나타내는 정보는 일반적인 텍스트에서 충분히 제공되지 않지만 연극 공연을 위한 대본에는 다양한 부가 정보들이 어느 정도 정형적 으로 제시된다. 본 논문에서는 연극 대본의 대사, 지문, 해설을 자동으로 분석하여 캐릭터 의 행동과 음향이 통합된 멀티미디어 애니메이션을 생성하는 과정을 보인다. 음향은 극적 효과를 위한 기본적인 장치로, 캐릭터의 행동과 효과적으로 통 합되기 위해서는 연극 대본에서 표현된 음향효과를 직접 추출하거나 상식정 보를 이용한 추론으로 적합한 음향을 입체적이고 시간의 흐름에 맞게 표현해 주어야 한다. 이러한 과정을 위해 연극 대본의 자연언어 표현을 결합범주문 법으로 분석하여 캐릭터의 행동과 음향효과간의 상호작용을 추출하고 이에 따르는 캐릭터의 행동과 음향효과를 3D 모델 데이터베이스와 음향 데이터베 이스를 활용하여 멀티미디어 애니메이션으로 생성한다.

Emotion Prediction from Natural Language Documents with Emotion Network

Hye-Jin Min and Jong C. Park
Proceedings of HLT, pp. 191-199, Ulsan, October, 2004.
Show abstract

본 논문에서는 텍스트에 나타난 감정상태를 인지하는 모델을 제안하고, 이러한 모델을 활용하여 현재문장에서 나타난 감정 및 이후에 나타나게 될 감정상태들을 예측하는 시스템에 대하여 다룬다. 사용자의 감정을 인지하고 이에 대한 자연스러운 메시지, 행동 등을 통해 인간과 상호작용 할 수 있는 컴퓨터시스템을 구현하기 위해서는 현재의 감정상태뿐만 아니라 사용자 개개인의 정보 및 시스템과 상호작용하고 있는 상황의 정보 등을 통해 이후에사용자가 느낄 수 있는 감정을 예측할 수 있는 감정모델이 요구된다. 본 논문에서는 파악된 이전의 감정상태 및 실제 감정과 표현된 감정간의 관계, 그리고 감정에 영향을 미친 주변대상의 특징 및 감정경험자의 목표와 행동이 반영된 상태-전이형태의 감정모델인 감정망(Emotion Network)을 제안한다. 감정망은 각 감정을 나타내는 상태(state)와 연결된 상태들 간의 전이(transition), 그리고 전이가 발생하기 위한 조건(condition)으로 구성된다. 본 논문에서는 텍스트 형태의 상담예시에 감정망을 활용하여 문헌의 감정어휘에 의해 직접적으로 표출되지 않는 감정을 예측할 수 있음을 보인다.

Identification and Recovery of Elided Information for Text Animation

Eunyoung Chang and Jong C. Park
Proceedings of HLT, pp. 94-102, Ulsan, October, 2004.
Show abstract

음성인식기술을 실제 생활에 적용할 때 발생하는 대표적인 문제로, 인식기의 낮은 인식률로 인한 오동작을 들 수 있다. 본 연구에서는, 텔레뱅킹 도메인에서의 HTK(Hidden Markov Model Toolkit) 연속 음성 인식 시스템과, 최대 엔트로피 기법에 기반한 사용자 발화에서의 핵심이 되는 단어(주로 고유 명사들)들에 대한 인식 신뢰도의 측정 방법을 제시한다. 음향특징과 언어특징들을 모두 고려하여 인식 신뢰도를 구하였으며 인식된 단어들에 대해 오인식 되었음을 약 86%의 정확도로 판단할 수 있음을 확인 하였다. 본 인식신뢰도를 이용하여 차후에 음성인식의 확인대화(Clarification Dialog)모델을 개발하는데 활용하고자 한다.

Constructing VoiceXML documents with Contextually Appropriate Intonation from Natural Language Dialogues in a Combinatory Categorial Grammar framework

Lee Hwa Jin, Ho-Joon Lee, and Jong C. Park
Proceedings of the 5th China-Korea Joint Symposium on Oriental Language Processing and Pattern Recognition, pp. 2-9, Qingao, P.R.China, February 25-27, 2004.
Show abstract

Various natural language processing techniques have been utilized to enhance the performance of the Text-to-Speech (TTS) systems to date. Correctness and naturalness are among the working measures for the performance of these systems, where the usual proposals to satisfy the second measure have employed statistic prediction methods to ﬁnd appropriate intonation for a given sequence of words in a sentence. However, these proposals tend to assign the same intonation to the same word sequence in a sentence, whereas people may associate quite different kinds of intonation with the same word sequence in a sentence depending upon the context in which the sentence is expressed. In this paper, we use a combinatory categorial grammar approach to synthesizing contextually appropriate intonation for dialogues in Korean, taking into account the distinguishing characteristics as identiﬁed from the speech corpus. The intonation-annotated dialogues are then translated into corresponding VoiceXML documents, which work as direct inputs to a TTS system for the generation of actual speech data.

Anaphora Resolution in Text Animation

Kyung Wha Hong and Jong C. Park
Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, pp. 347-352, Innsbruck, Austria, February, 2004.
Show abstract

For effective text animation from natural language stories, the source sentences in natural language should be processed not only individually but also as a coherent story as a whole. In particular, it is important that anaphoric expressions are interpreted adequately, since they provide crucial clues for the overall behaviors of story line characters. In text understanding, the task of anaphora resolution has been primarily on nominal expressions. In text animation, however, there are many other important candidates for anaphoric expressions, including those for actions and events, in addition to objects. In this paper, we provide an analysis of sample fairy tales, and present a classification for the types of anaphoric expressions for text animation. We also describe an implemented text animation system with anaphora resolution.

Case Study: Visualization and Analysis of Mitogen-Activated Protein Kinase Pathways in the Literature

Changsu Lee, Jinah Park, and Jong C. Park
Conference on Visualization and Data Analysis (VDA), pp. 275-285, San Jose, USA, Janurary, 2004.
Show abstract

Data sets of up to 3000 journal abstracts from MEDLINE literature on the keyword combination 'MAPK pathway' and 'human' are visualized and analyzed for mitogen-activated protein kinase (MAPK) pathways. We have tightly coupled exploratory visualization with information extraction for interactive navigation through scattered information sources, in search of useful facts on MAPK by frequency-based filtering and amplification Unlike direct database visualization that operates on curated data sets, literature visualization has the advantages of manipulating data sets of a massive scale with a lot less manpower and effectively responding to the fast cycles of the developments in the field.

BioAR: Anaphora Resolution for Relating Protein Names to Proteome Database Entries

Jung-jae Kim and Jong C. Park
ACL Workshop on Reference Resolution and its Applications, pp. 79-86, Barcelona, Spain, 2004.
Show abstract

The need for associating, or grounding, protein names in the literature with the entries of proteome databases such as Swiss-Prot is well-recognized. The protein names in the biomedical literature show a high degree of morpholog- ical and syntactic variations, and various anaphoric expressions including null anaphors. We present a biomedical anaphora resolution system, BioAR, in order to address the variations of protein names and to further associate them with Swiss-Prot entries as the actual entities in the world. The system shows the performance of 59.5%✂75.0% precision and 40.7%✂56.3% recall, depending on the specific types of anaphoric expressions. We apply BioAR to the protein names in the biological interactions as extracted by our biomedical information extraction system, or BioIE, in order to construct protein pathways automatically.

Annotation of Gene Products in the Literature with Gene Ontology Terms using Syntactic Dependencies

Jung-jae Kim and Jong C. Park.
Proceedings of the 1st International Joint Conferrence on Natural Language Processing (IJCNLP), pp. 528-534, Hainan, P.R.China, 2004.
Show abstract

We present a method for automatically annotating gene products in the literature with the terms of Gene Ontology (GO), which provides a dynamic but controlled vocabulary. Although GO is well-organized with such lexical relations as synonymy, ‘is-a’, and ‘part-of’ relations among its terms, GO terms show quite a high degree of morphological and syntactic variations in the literature. As opposed to the previous approaches that considered only restricted kinds of term variations, our method uncovers the syntactic dependencies between gene product names and ontological terms as well in order to deal with real-world syntactic variations, based on the observation that the component words in an ontological term usually appear in a sentence with established patterns of syntactic dependencies.

Automatic Camera Control for Automated Digital Cinematography from Text

Semin Jang and Jong C. Park
Proceedings of the 31th KISS Spring Conference, Vol. 31, No. 1(B), pp. 904-906, KAIST, Korea, 2004.
Show abstract

영화를 제작하는 과정에 필수적으로 사용되고 있는 대본(臺本)에는 필요한 부분마다 영상기법이 명시되어 있어서 실제 장면을 구현하는 과정에 원작자가 의도하는 상황을 비교적 정확하게 재현하는 것이 가능하다. 이에 비하여 교통사고 사건보고서나 동화 등을 기반으로 디지털 영상을 자동으로 제작하려는 경우 이러한 영상기법이 명시되어 있지 않다. 그러므로 자연언어로 기술된 자료로 부터 디지털 영상을 자동으로 제작하기 위해서는 작가의 의도를 파악하여 적절한 영상기법을 추출 하는 방안이 있어야 한다. 본 논문의 선행 연구에서는 동화를 대상으로 하는 애니메이션 자동 생성을 위해서 시간 관리, 참조 해결, 위치 설정, 세부 명령 결정 및 다수 캐릭터 제어 등의 요소 기술이 필요하다는 것을 보이고 특히 시간 관리 중에서 적절한 장면전환이 필요한 경우를 자동으로 파악하는 방안을 제시하였다. 본 논문에서는 결합범주문법을 사용하여 동화 문장에 나타나는 작가의 의도를 분석하고, 이에 부합하는 다양한 카메라 운용기법을 자동으로 파악하여 적용한 디지털 영상 제작 방안을 제시하고 구현한 시스템을 보인다.

Automatic Generation of Multimedia Therapeutic Contents with Combinatory Categorial Grammar

Hye-Jin Min and Jong C. Park
HCI/CG/VR/UI/DESIGN, Phoenix Park, 2004.
Show abstract

인터넷의 발달로 대안적인 심리치료 방법이라 할 수 있는 상담치료, 음악치 료 및 미술치료가 개인의 고민을 상담해 주는 인터넷 사이트에서 활발히 제 공되고 있다. 본 논문에서는 내담자의 고민이 담긴 글을 자동으로 분석하여 내담자의 감정 상태와 고민의 원인 정보를 파악하여 글, 그림, 음악 등이 통 합된 멀티미디어 치료 정보를 생성하는 과정을 보인다. 멀티미디어 치료 정 보는 해당 감정의 해소에 도움을 줄 수 있는 텍스트, 이미지 및 음악파일이 심리적인 치료의 목적으로 검색어와 함께 구조화되어 있는 정보를 지칭한다. 멀티미디어 치료 정보를 구축하기 위한 검색어를 자동으로 생성하기 위해서 는 문장에서 고민에 관련되는 내담자의 감정표현 방식 및 의미 관계, 그리고 해당 감정의 경과 시간 정보 등을 적절히 분석해내야 하므로, 키워드에 따라 이에 맞는 감정을 대응시키거나 상식을 이용하여 추론하는 방법을 활용하여 감정 정보를 추출하는 기존의 연구에서는 다루지 않았던 추가적인 언어적 특 성들이 보다 심도있게 고려되어야 한다. 본 논문에서는 이를 위하여 내포문 이나 접속문과 같은 하위문의 주어와 상위문의 주어가 서로 가지는 관계를 자동으로 파악하고, 각 동사가 의미적으로 요구하는 문장성분의 성격에 따라 감정의 경험주 및 표현의 대상을 확인하며 시간부사로 감정변화상태를 파악 하는 등의 자연언어처리 과정을 결합범주문법을 통하여 구현함으로써 이들 문장에 나타나 있는 심리상태에 대응하는 치료 정보를 구조화된 데이터베이 스로부터 검색하여 멀티미디어 치료 정보를 생성하는 과정을 보인다.

Data-oriented Customized Visual Navigation

Changsu Lee, Jinah Park, and Jong C. Park
HCI/CG/VR/UI/DESIGN, Phoenix Park, 2004.
Show abstract

저장 매체의 발달 및 정보 기술의 발달로 인해서 빠르게 늘어나는 가용 한 정보의 방대한 양은 사용자의 정보에 대한 이해를 어렵게 만든다. 정보의 원천으로부터 정보의 여과, 정보의 표현으로 이어지는 일련의 정보 활용 과 정에서, 사용자 개별화에 대한 기존의 연구는 일반적으로 정보의 여과 쪽에 서만 이루어져 왔다. 하지만 사용자와 가깝게 상호 작용을 하는 정보의 표현 부분에서 사용자 개별화가 가능해지면, 사용자는 자신의 목적에 부합하는 정 보를 얻는 과정을 더욱 세밀하게 조절할 수 있다. 본 연구에서는 사용자 개 별화 기능을 갖춘 적극적 역할의 시각화 시스템을 제안한다. 사용자 개별화 기능은 데이터의 특성에 기반한 분류법을 사용하여 구현하였다. 본 연구에서 는 생물학을 적용 도메인으로 하여, 분자간 상호 작용 데이터의 특성에 따라 데이터를 분류하는 방법을 제안하며, 실험을 통하여 사용자별로 개별화된 분 자간 상호 작용 지도를 효과적으로 얻을 수 있음을 확인한다.

Natural Language Response Generation from Relational Database Query Result

Ji-yong Jung and Jong C. Park
HCI/CG/VR/UI/DESIGN, Phoenix Park, 2004.
Show abstract

자연언어 질의/응답 인터페이스는 사용자가 특별한 지식이 없어도 시스 템에 쉽게 접근할 수 있도록 하여, 정보의 제공을 쉽고 자연스럽게 한다. 그 러나 이에 대한 기존의 연구는 대부분이 자연언어를 SQL과 같은 데이터베이 스 접근을 위한 형식언어로 바꾸는 데 초점을 맞추고 있고, 질의로부터 얻어 진 결과를 적절하게 표현하는 응답 생성에 있어서는 아직 만족스러운 결과를 만들어내지 못하고 있다. 자연언어 응답 생성을 위해서는 사용자가 알고 있 는 정보, 데이터베이스 내장 정보, 그리고 사용자가 질의를 함으로써 얻고자 하는 정보가 복합적으로 고려되어야 한다. 또한 사용자가 기대하는 형태의 응답을 생성하기 위해서는 사용자가 원하는 응답형태를 사전에 모델링하고 가장 선호되는 응답형태를 사용해야 한다. 본 연구에서는 사용자의 질의로부 터 얻어진 관계형 데이터베이스 검색 결과에 대해 질의의 의도에 맞게 개별 화된 응답을 생성하는 과정을 다룬다. 적절한 응답 생성을 위해서 여행상품 정보에 대한 사용자의 질의/응답 코퍼스를 정보의 내용 및 분량 측면에서 분 석한 결과를 보이고, 이에 따라 내용계획, 문장 형태 구성, 어휘 표현의 세 단계를 거치는 문장 생성 방법을 제안한다.

Contextual Disambiguation of Adverbial Scopes in Korean for Text Animation

Eunyoung Chang, Kyung Wha Hong, and Jong C. Park
HCI/CG/VR/UI/DESIGN, Phoenix Park, 2004.
Show abstract

자연 언어 문장으로 구성된 텍스트를 애니메이션으로 자동 생성하기 위해 서는 문장의 통사 정보, 의미 정보, 담화 정보들을 바탕으로 일련의 애니메이 션 명령들을 도출해 내야 한다. 부사는 이러한 문장들에서 해당 애니메이션 명령의 속성 변화 정도를 결정하며 부사의 다양한 수식 대상과 의미의 정확 한 해석은 텍스트의 의도를 효과적으로 반영하는 중요한 역할을 하게 된다. 그러나 부사의 수식 대상 범위가 매우 넓고 그 의미도 다양하여, 내포절이나 병렬구조를 포함하는 복잡한 문장에서뿐만 아니라 단문에서도 부사의 기능을 정확히 파악하는 것이 용이하지 않다. 본 논문에서는 정확한 텍스트 애니메 이션을 위한 부사의 분석방법을 제안하고 그 처리 결과를 보인다. 현재 이루 어져 있는 한국어 부사에 대한 연구는 주로 통계 기반 학습으로 부사와 피수 식어와의 호응성을 활용하여 구조의 애매성을 처리하고 있을 뿐 아니라, 부 사의 위치 제약 정보 중 극히 일부만을 이러한 호응 관계에 대한 제약 정보 로 활용하고 있다. 본 논문에서는 이러한 정보에 문맥 정보를 같이 고려하여 구조적 애매성을 해결하고 보다 정확한 의미를 도출하고자 한다. 본 논문에 서는 부사의 통사적, 의미적 분석 방법을 제안하기 위해서 결합범주문법을 사용하였고, 이를 확장하여 파생부사, 부사구, 부사절 등의 복잡한 부사어 구 문에 대해서도 문법적으로 처리할 수 있는 방안을 제시한다. 그리고 이렇게 제시된 방안을 구현한 텍스트 애니메이션 시스템을 통하여 애니메이션 생성 결과를 확인한다.

Information Visualization in 3-Dimensional Space for Text Data Mining

Jinah Park, Changsu Lee, and Jong C. Park
International Women's Conference on BIEN-Technology, Daejeon, Korea, November, 2003.

Analysis and Computational Processing of Sentences in Korean for automatic sign language Generation

Jiwon Choi and Jong C. Park
Proceedings of the National Conference on Korean Language Processing, pp. 219-226, October, 2003.
Show abstract

한국 수화는 한국어에 대한 기본적인 유사성을 가지고 있지만, 교착어이자 청각-음성 체 계 언어인 한국어와는 달리 고립어이자 시각-운동 체계 언어로서의 특성을 동시에 나타내 고 있다. 그러므로 텍스트 형태의 한국어 문장으로부터 수화를 자동 생성하기 위해서는 한 국어를 위해 미리 정의된 문법에 수화 표현을 무리하게 연계 시키려고 하기 보다, 수화 고 유의 의미 전달 체계를 분석하고 활용하여야 할 필요가 있다. 본 논문에서는 수화 표현상 의 언어학적 특징을 재현·생략·변형·이동의 네 가지로 구분하여 분석하고 결합범주문법을 이용한 이 같은 현상의 처리 방법 및 구현 방안에 대하여 논의한다.

Towards Automatic Sign Language Generation with Combinatory Categorial Grammar

Jiwon Choi and Jong C. Park
HCI Conference, pp. 481-486, Phoenix Park, Korea, February, 2003.
Show abstract

수화는 청각 장애인의 의사소통을 위한 시각적 언어라는 특징을 가지고 있어 구어 병용을 전제로 하는 다른 언어에서는 찾아 보기 어려운 독특한 문법 구 조를 가지고 있다. 그러나 수화를 자동으로 처리하려는 기존의 연구에서는 한국어를 위하여 미리 정의된 문법에 수화 표현을 연계 시키려는 노력이 무 리하게 선행되어 수화 고유의 의미 전달 체계를 파악하고 활용하는데 많은 문제점을 가지고 있다. 특히 수화에서는 수동, 수형 등의 수화소뿐만 아니라 동시적으로 표현하는 기제를 이용하여 도치문에서의 주어와 목적어 관계, 사 동과 피동문에서 주체와 객체 관계 등을 애매성 없이 표현할 수 있고, 직전 에 지정된 공간 정보를 일종의 선행사와 같이 사용함으로써 중복된 표현을 피하여 효율적인 정보 전달을 꾀할 수 있다. 본 논문에서는 한국어와 같은 자연 언어 표현을 결합범주문법으로 분석하는 과정을 통하여 이들 표현에 대 응하는 애니메이션을 동반한 수화 표현으로 자동 번역하는 연구를 수행하는 과정에 필수적으로 필요한 요소들에 대한 연구 결과를 보이고 수화에서 나타 나는 독특한 언어 표현 기법을 충분히 활용하여 보다 자연스러운 수화 표현 을 생성하는 방안을 구현과 함께 제시한다.

Anaphora Resolution and Multi-Character Control for Automatic Generation of Multimedia Fairy Tales

Kyung Wha Hong and Jong C. Park
HCI Conference, pp. 487-492, Phoenix Park, Korea, February, 2003.
Show abstract

한국어와 같은 자연언어로 작성된 문장의 연속으로 구성된 문서 형태의 동 화를 입력으로 받아 동화의 내용을 적절히 반영한 애니메이션을 포함하는 멀 티 동화를 자동 생성하기 위해서는 해당 문서에서 나타나는 각종 참조현상에 대한 정확한 해석이 필수적으로 요구된다. 이와 같은 애니메이션을 위한 참 조현상 해석은 문서의 이해를 돕기 위하여 자연언어처리 분야에서 통상적으 로 연구되고 있는 참조현상 해석에서보다 유형적으로 다양한 특성을 보인다. 본 논문에서는 멀티 동화를 자동 생성하는 과정에 문장의 참조현상과 함께 다수 캐릭터의 움직임을 적절히 고려하여 3 차원 가상 공간을 제어하는 명령 을 생성하는 시스템에 대한 구현 결과를 보인다. 애니메이션을 위한 참조현 상 해석은 참조표현의 적절한 선행사를 파악하는 것을 그 목적으로 하고 있 는데 캐릭터의 명칭, 동작, 성질, 사건, 시간 등의 다양한 장면 정보들에 대 한 고려가 필요하다. 다수 캐릭터를 문맥에 맞게 제어하기 위해서는 적절한 참조해결과 함께 다양한 지식을 활용하여 캐릭터들의 자연스러운 움직임을 제공하는 기법이 필요하다. 본 논문에서는 결합범주문법을 이용하여 동화를 분석한 뒤 이에 해당하는 Genesis 3D 게임엔진 제어 스크립트를 자동 생성하 는 시스템을 보인다.

Mediatory Visualization for Structured Data and Textual Information

Changsu Lee, Jinah Park, and Jong C. Park
The 3rd IASTED International Conference on Visualization, Imaging, and Image Processing (VIIP 2003), pp. 926-932, Benalmadena, Spain, 2003.
Show abstract

When we visualize structured data for knowledge discovery, it is important that the users have an easy access to the source textual information, especially when the map ping from the textual information to structured data is not perfect. In this paper, we present a new method for mediatory visualization for structured data and corresponding textual information to address this problem. The two dimensional space for visualizing structured data, such as the protein-protein interaction information collected from biomedical literature by information extraction, is linked perpendicularly to, but conceptually separated from, the pairwise one dimensional space for visualizing corresponding source textual data. The users can concentrate on the information in one space but explore the information in the other space as easily as one may manipulate objects in a three dimensional space. We show that the one dimensional color-banded rods give visual clues and insights to the nature of the underlying English sentence structures, which in turn give rise to useful feedback to the interaction information in the other two dimensional space, and vice versa.

Logical Representation of Ontological Terminologies in Biomedical Domain

Jung-jae Kim, Jin-Bok Lee, Hye-Jin Min, Ji-yong Jung, and Jong C. Park
Proceedings of the 2nd Annual Conference of The Korean Society for Bioinformatics (KSBI 2003), pp. 79-85, Daejeon, Korea, 2003.
Show abstract

본 논문은 대량의 생물의료분야 문서에서 단백질 이름을 자동으로 인식하고 각 단백질의 특 성을 문서에서 자동으로 파악하여 기존의 온톨로지와 연계시키는 방법을 제안한다. 온톨로 지 용어가 문서에서 다양한 형태로 발견되기 때문에, 이들을 논리적 표현으로 자동 변환하 고, 문서에서 단백질의 특성을 설명하는 문장들을 추출 및 분석하여 온톨로지 용어의 논리 적 표현과 비교하였다. 문서에서 단백질 특성을 인식할 때, 약어 처리 및 조응 현상 해결 등 의 자연언어처리 기법을 이용하는 방법을 제안하였다.

Morphological Analysis of Irregular Conjugation in Korean with Micro Combinatory Categorial Grammar

Ho-Joon Lee and Jong C. Park
Proceedings of the KISS Spring Conference, pp. 531-533, 2003.
Show abstract

본 논문에서는 형태소 수준의 결합범주문법을 이용하여 형태소 분석을 포함한 자연언어처리의 여러 단계를 한 단계의 유도과정으로 처리하고 형태소 분석 단계에서 증가하는 애매성과 복잡도를 상위 분석 단계의 정보 를 사용하여 줄이는 방법에 대해서 논한다. 한국어에서 나타나는 복잡한 언어 현상 중에 하나인 용언의 불규 칙 활용을 확률 정보 뿐만 아니라 음운정보를 포함한 통사 정보나 의미 정보 등의 상위 정보를 사용하여 처리 하여보고 일반적인 형태소 분석기로서의 발전 가능성에 대해서 알아본다.

Word Segmentation for Korean with Syllable-Level Combinatory Categorial Grammar

Ho-Joon Lee and Jong C. Park
Proceedings of the 14th National Conference on Korean Language Processing, pp. 47-54, October, 2002.
Show abstract

한국어의 띄어쓰기 현상은 단어별로 정형화된 띄어쓰기를 하는 영어나 띄어쓰기가 발달하지 않은 중 국어, 일본어와는 다르게 독특한 형태로 발전되어 왔다. 기존에는 부분적인 띄어쓰기 오류를 바로잡 아주는 형태의 연구가 많이 진행되었지만 이제는 문자인식이나 음성인식 등의 연구와 결합하여 띄어 쓰기가 완전히 무시된 문장의 띄어쓰기를 자동으로 처리하는 방법에 대한 연구가 활발히 진행 중이 다. 본 논문에서는 한국어의 띄어쓰기 현상과 띄어쓰기 복원 방법에 대한 기존의 연구에 대해서 살 펴보고 기존의 방법으로는 처리하기 힘들었던 형태를 음절단위 결합범주문법으로 설명한다.

Diphone-based Intonation and VoiceXML Document Generation using Multi-Dimensional Linguistic Information

Lee Hwa Jin and Jong C. Park
Proceedings of the 24th National Conference on Korean Language Processing, pp. 69-76, Cheongju, Korea, October, 2002.

Anaphora Resolution for Contextually Appropriate Animation of Multimedia Fairy Tales

Kyung Wha Hong and Jong C. Park
Proceedings of the 24th National Conference on Korean Language Processing, pp. 317-324, Cheongju, Korea, October, 2002.
Show abstract

참조현상이란 이미 언급되었던 혹은 이미 알고 있다고 여겨지는 정보에 대한 재표현이다. 참조현상은 자연언어처리 분야에서 뿐만 아니라 인지과학, 심리학, 철학분야에서도 활발하 게 연구되는 현상으로 참조표현인 조응사(anaphora)의 선행사(antecedent)를 채택하는 방 법에 따라 그 성능이 좌우된다. 자연언어문장으로부터 멀티동화를 생성을 위한 애니메이션 제어 스크립트 명령들에서의 참조해결은 선행 정보의 적절한 참조를 바탕으로 자연스러운 애니메이션 장면을 생성하는데 있어서 필수적이다. 본 논문에서는 이러한 동화의 자연언어 문장에 나타나는 참조현상들에 대해 살펴보고 결합범주문법을 이용하여 참조현상을 해결하 는 방법과 구현방법에 대해 논의한다.

Analysis and Reconstruction of Temporal Relations in Multimedia Fairy Tales for Digital Cinematography

Semin Jang and Jong C. Park
Proceedings of the 24th National Conference on Korean Language Processing, pp. 309-316, Cheongju, Korea, October, 2002.
Show abstract

동화는 사건의 흐름에 따라서 이야기를 진행시킨다. 그러나 독자인 어린이들의 관심을 지 속적으로 유지하기 위하여 사건을 실제 순서와 다르게 배치해놓아 극적 효과를 꾀하는 경우가 많이 있다. 동화를 애니메이션으로 생성하는데 있어서 이러한 사건의 배치에 담긴 작가의 의도를 제대로 파악하는 것은 중요한 문제이다. 본 논문에서는 이처럼 사건의 흐 름을 파악하고 이를 활용하기 위해서 다루어야 할 언어적 요소들에 대하여 살펴보고, 결 합범주문법을 사용하여 동화에서 나타나는 시간 관계를 분석한다. 또한 각 시간 관계에 따라 애니메이션 효과를 높이기 위한 영상 기법을 제안하고 이를 이용하여 시간 관계를 재현하는 시스템을 설명한다.

Automatic Gene Ontology Extension and Terminology Analysis

Jin-Bok Lee and Jong C. Park
Proceedings of the KISS Conference, pp. 229-231, Suwon, Korea, October, 2002.
Show abstract

생물학 분야의 방대한 지식을 효율적으로 다루기 위하여 생물정보학이 주요한 연구 분야가 되었다. 이 중 특히 생물학 문헌에서 정보를 자동으로 추출하는 연구가 활발히 진행되고 있는데, 이러한 정보추출 결과를 이용하여 유전자 온톨로지와 같은 유용한 지식베이스를 자동으로 확장함으로써 폭발적으로 증가 하는 생물학 분야의 연구 결과들을 지식베이스에 통합할 수 있다. 자동으로 확장된 온톨로지는 신뢰성을 보장하기 위한 검증 과정을 거쳐, 정보추출 시스템의 성능을 향상시키기 위한 지식베이스로 사용되게 된 다. 본 연구에서는 단백질 간의 상호작용에서 나타나는 조건을 추출하는 시스템과 유전자 온톨로지를 이 용하여 추출된 생물학 용어를 분석하는 시스템을 제안하고 유전자 온톨로지의 자동 확장 및 검증 시스템 에 대하여 논의한다.

Natural Language Query Interpretation System for Biomedical Database Access

Hodong Lee and Jong C. Park
Proceedings of the KISS Spring Conference, pp. 487-489, Han Yang University, April 26-27, 2002.
Show abstract

본 논문은 이질적인 데이터베이스에 선재되어 있는 생물의료 정보의 개념적인 접근을 가능하게 하기 위한 자연언어질의 시스템을 설명한다. 이를 위해 본 시스템에서는 질의문을 SQL, OQL, CPL 데이터베이스 정형언어로 변환하는데, 이 과정에서 필요한 질의문의 분석 및 변환과정을 보인다. 제안하는 방법은 구문분석에 의해 도출된 정보를 이용해 직접 다양한 정형언어들로 변환하므로, 시스템의 구조가 간결해지고 모듈화되어 전체 성능과 이식성의 향상을 가져올 수 있다.

Challenges in Biopathway Extraction from Literature and Ontology Building for Biology

Jong C. Park
Korea Society for Bioinformatics Workshop, February, 2002.

Semi-Automatic Extension of Gene Ontology

Jin-Bok Lee, Jung-jae Kim, and Jong C. Park
Human Computer Interaction (HCI) Workshop, Phoenix Park, Korea, January, 2002.

BiopathwayBuilder: Nested 3D visualization system for complex molecular interactions

Changsu Lee, Jinah Park, and Jong C. Park
Proceedings of International Conference on Genome Informatics (GIW), pp. 447-448, Tokyo, Japan, 2002.
Show abstract

In order to gain a full understanding of a biological process, we must be able to augment the known molecular interactions with discovered knowledge. We believe that a visualization system works as a means for accomplishing this task, as it provides an intuitive base for necessary information, among others. However, reported implementations have further problems: (1) The size of the information is not only enormous, but also grows very fast, which makes scalability and elision essential properties [5]; (2) the available information is not only incomplete, but also unreliable; and (3) the usual information in the field, such as protein modification [2], is inherently complex, which makes it very difficult to make the resulting visualization intuitive enough for end users as well as field experts. We address all the problems above with a 3D visualization system.

3D Visualization System for Complex Protein-Protein Interactions from Text Data Mining

Changsu Lee, Jinah Park, and Jong C. Park
IEEE Workshop on Visualization in Bioinformatics and Cheminformatics, Boston, USA, 2002.

Natural Language Interpretations for Heterogeneous Database Access

Hodong Lee and Jong C. Park
Proceedings of the International Conference on Computational Linguistics (COLING), pp. 523-529, Taiwan, 2002.

Text Data Mining for Automatic Gene Ontology Extension

Jin-Bok Lee and Jong C. Park
Intelligent Systems for Molecular Biology (ISMB), Proceedings of the second meeting of the special interest group on Text Data Mining, pp. 22-25, Edmonton, Alberta, Canada, 2002.

Literature Data Mining for Biology

Lynette Hirschman, Jong C. Park, Junichi Tsujii, Cathy Wu, and Limsoon Wong
Proceedings of the Pacific Symposium on Biocomputing (PSB) session, pp. 323-325, Hawaii, USA, 2002.

Natural Language Processing for Biomedical Information Extraction and Automatic Ontology Management

Jong C. Park
Proceedings of the 2nd Bioinformatics Forum, pp. 145-158, Seoul, Korea, 2002.

Biomedical Informatics and Natural Language Processing

Jong C. Park
Annual Meeting of the Korean Society for Medical Informatics, Jeon-ju, Korea, November, 2001.

Automatic Augmentation of Translation Dictionary with Database Terminologies in Multilingual Query Interpretation

Hodong Lee and Jong C. Park
Annual Meeting of the Association for Computational Linguistics (ACL), Workshop on Human Language Technologies and Knowledge Management, pp. 113-120, Toulouse, France, July, 2001.
Show abstract

In interpreting multilingual queries to databases whose domain information is described in a particular language, we must address the problem of word sense disambiguation. Since full-fledged semantic classification information is difficult to construct either automatically or manually for this purpose, we propose to disambiguate the senses of the source lexical items by automatically augmenting a simple translation dictionary with database terminologies and describe an implemented multilingual query interpretation system in a combinatory categorial grammar framework.

Translating Natural Language Queries into Formal Language Queries with Combinatory Categorial Grammar

Hodong Lee and Jong C. Park
Proceedings of the International Conference on Computer Processing of Oriental Languages (ICCPOL), pp. 41-46, Seoul, Korea, May, 2001.

Computational Generation of Context-based Intonation for Korean with Combinatory Categorial Grammar

Lee Hwa Jin and Jong C. Park
Proceedings of International Conference on Computer Processing of Oriental Languages (ICCPOL), pp. 415-420, Seoul, Korea, May, 2001.

Design and Implementation of E-Mail Response Management System for Call Center

Jung-jae Kim, O Shik Kwon, Hodong Lee, and Jong C. Park
Proceedings of the KISS Spring Conference, pp. 445-447, April, 2001.
Show abstract

본 논문에서는 콜센터를 위하여 설계 및 구현된 전자메일 자동응답 및 관리 시스템 중에서 서버 시스템에 해당하는 부분을 기술하였다. 본 연구에서는 도메인에 특성화된 표현 형식 개발을 개발하여 보다 효율적인 3단계 매칭방법을 가진 자동응답기, 학습에 기반한 도메인 비의존적인 자동분류기 및 적용밥벙의 재배열이 가능한 담당자 분배기를 구현하였다.

Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar

Jong C. Park, Hyun Sook Kim, and Jung-jae Kim
Pacific Symposium on Biocomputing (PSB), pp. 396-407, Big Island, Hawaii, USA, January, 2001.
Show abstract

As the importance of automatically extracting and analyzing various natural language assertions about protein-protein interactions in biomedical publications is recognized, many uses of natural language processing techniques are proposed in the literature. However, most proposals to date make rather simplifying assumptions about the syntactic aspects of natural language due to various reasons including efficiency. In this paper, we describe an implemented system that utilizes combinatory categorical grammar known to be competent in modeling natural language, with a controlled mechanism for the parser to operate bidirectionally and incrementally. We discuss the performance of the system on a large set of abstracts in Medline with quite encouraging results.

Real Time Synthesis of Multimedia Tales in Korean with Combinatory Categorial Grammar

Hyun Sook Kim and Jong C. Park
Proceedings of the National Conference on Korean Information Processing, pp. 509-512, 2001.

Computational Processing of Honorifics in Korean with Combinatory Categorial Grammar

O Shik Kwon and Jong C. Park
Proceedings of the National Conference on Korean Information Processing, pp. 365-372, 2001.
Show abstract

한국어나 일본어는 영어 등 서구의 언어와 비교하여 매우 발달된 경어 체계를 가지고 있다. 그러나 이러한 경어 체계는 이들 언어를 모국어로 사용하지 않는 사람들을 포함하여 모국어로 사용하는 많은 사람들까지도 정확하게 구사하기는 어려워 하는 것이 현실이다. 그럼에도 불구하고 경어 체계의 정확한 구사 능력은 적절한 어휘 선택 능력과 함께 자연스러운 의사 소통을 위한 중요한 언어 능력으로 간주되고 있다. 특히 기계번역기나 문법검사기를 구현하고자 할 때 이러한 경어 체계를 정확하게 이해하는 시스템의 구현은 한 차원 높은 자연스러운 표현을 제공하기 위하여 필수적이라고 할 수 있다. 본 논문에서는 한국어의 경어 체계를 조사하고 결합범주문법을 통하여 이를 검증하는 시스템을 소개한 뒤 사극 대본을 대상으로 이 시스템의 성능을 확인한다.

Generation of Contextually Appropriate Responses in E-Commerce with Combinatory Categorial Grammar

Jin-Bok Lee and Jong C. Park
Proceedings of the Human Computer Interaction (HCI) Symposium, pp. 314-319, Phoenix Park Convention Center, Korea, 2001.
Show abstract

We analyze various constructions in Korean including coordination, relative clauses, and embedded clauses by focusing on the phenomenon of quantifier floating where quantifying expressions may appear in places other than their original prenominal one. Based on these analyses, we process Korean sentences in a combinatory categorial grammar (CCG) framework that makes use of all the levels of syntax, semantics, and discourse. Finally, we describe an implemented query system that generates responses with contextually appropriate ellipsis in the domain of e-commerce.

Processing Floating Quantifiers with Combinatory Categorial Grammar

Jin-Bok Lee and Jong C. Park
the KISS Regional Conference, November, 2000.
Show abstract

본 논문에서는 한국어에서 나타나는 양화사유동을 병렬구문, 관계구문, 내포구문과 같이 복잡한 언어현상과 관련하여 통사적, 의미적, 담화적 관점에서 고려하고, 결합범주문법을 이용하여 한국어 문장을 분석할 수 있음을 보인다. 그리고 이를 바탕으로 전자상거래와 같은 분야에서 자연스러운 대화를 할 수 있는 인터페이스 구축의 가능성을 제시한다.

Predicting Contextually Appropriate Intonation from Utterances in Korean with Combinatory Categorial Grammar

Lee Hwa Jin and Jong C. Park
Proceedings of the National Conference on Korean Language Processing, pp. 68-75, October, 2000.
Show abstract

상대방에게 의사를 전달할 때 보다 정확하게 자신의 의도를 표현하려면 대화의 흐름에 맞는 적절한 억양을 주어 발화해야 한다. 본 논문에서는 결합범주문법을 이용하여 문장을 분석하고 문장 내 정보와 문장 간 정보 즉, 문맥에 따라 강세(pitch accent), 휴지(pause), 강조 등의 억양정보를 어떻게 나타내야 하는지를 분석하여 문장의 정보구조에 추가하는 방법을 제시한다.

Combinatory Categorial Grammar and Natural Language Interface to Database

Hodong Lee and Jong C. Park
Proceedings of the Human-Computer Interaction (HCI) Triangle Workshop, pp. 900-905, Phoenix Park Convention Center, Korea, January, 2000.
Show abstract

In this paper, we discuss issues related to the construction of a natural language interface to databases, including the characteristics of natural language queries. We propose to implement the system using Combinatory Categorial Grammar (CCG), so that various linguistic phenomena can be handled incrementally and in a modular manner for diverged expressions.

Informed Parsing for Coordination with Combinatory Categorial Grammar

Jong C. Park and Hyung-joon Cho
Proceedings of the International Conference on Computational Linguistics (COLING), pp. 593-599, Saarbrucken, Germany, 2000.
Show abstract

Coordination in natural language hampers efficient parsing, especially due to the multiple and mostly unintended candidate conjuncts/disjuncts in a given sentence that shows structural ambiguity. The problem gets more serious in a combinatory categorial grammar framework, which is well known for its competent treatment of coordination, as the flexibility of syntactic analysis often strikes back as spurious ambiguity. We propose to address these ambiguities with predicate argument structures and semantic co-occurrence similarity information, and present encouraging results.

Combinatory Categorial Grammar for Natural Language Interface

Hodong Lee and Jong C. Park
Proceedings of the KISS Fall Conference, pp. 173-175, 2000.
Show abstract

본 연구에서는 전자상거래 데이터베이스를 대상으로 결합범주문법을 이용한 자연언어질의 인터페이스를 구현한다. 이를 위해 질의문을 분석하고 표현 방법을 논의한다. 또한 SQL 형식언어로 변환하기 위한 어휘 표현 및 유도 방법을 보인다. 제안하는 방법은 구문분석 과정에서 SQL 형식의 질의문을 직접 유도하는 것으로 기존 연구에서 제안됐던 중간논리언어 변환단계를 거치지 않으므로 과정이 간결해져 시스템의 성능향상을 가져올 수 있다. 시스템은 웹 기반과 client/server 구조로 구현된다.

Combinatory Categorial Grammar and Parsing

Hyung-joon Cho and Jong C. Park
Proceedings of the National Conference on Korean Language Processing, pp. 223-230, Mokpo, Korea, October 1999.
Show abstract

본 논문에서는 결합범주문법으로 한국어를 처리할 때 구문분석과정에서 복잡도를 높이는 역할을 하는 spurious ambiguity와 구조적 모호성이 있는 명사구 접속에 대해서 논한다. 통사적 처리와 의미적 처리가 동시에 수행되는 결합범주문법의 특징을 사용해서 spurious ambiguity로 인해 발생하는 복잡도를 줄이는 방안을 제시하고 접속항에서 접속의 중심이 되는 명사들 간의 공기유사도를 이용해서 접속항 선정에서 발생하는 복잡도와 오분석을 줄이는 방안을 제시한 뒤 이의 개선방안을 논의한다.

An Analysis of the Semantic and Discourse Functions of the Korean Special Marker `-to'

June K. Park and Jong C. Park
the National Conference on Korean Language Processing, Mokpo, Korea, October 1999.
Show abstract

본 논문은 한국어의 특수조사, 특히 '도'의 의미, 문맥적 기능에 대하여 다루고 있다. '도'는 문맥의 자연스러운 연결에 있어서 중요한 역할을 수행한다. '도'가 쓰인 문장의 배경에는 반드시 일정한 전제가 존재한다. 전제는 그 문장의 의미 뿐만 아니라 기존 문맥과도 직접적으로 연관된다. 본 논문에서는 '같음', '유사함', '극한', '첨가' 및 병렬문에서 쓰이는 다섯 가지 '도'의 기능에 대하여 설명하고, alternatives semantics를 이용하여 이를 결합범주문법(CCG)에서 구현하는 방법을 제시한다.

A CCG for Coordination in Korean

Hyung-joon Cho and Jong C. Park
Proceedings of the KISS Conference, pp. 327-329, Jeonju, Korea, April, 1999.
Show abstract

자연어처리에 있어서 병렬문은 분석의 복잡성, 단어의 모호성, 공백 등에 따른 어려움을 내포하고 있다. 본 논문에서는 기존에 제시되었던 한국어 처리를 위한 범주문법의 한계를 논하고 기존의 범주문법들이 해결하지 못했던 한국어 병렬문을 결합범주문법을 사용해서 해결한다. 한국어 병렬문을 처리하는 과정에서 비형상언어인 한국어 병렬문을 서술논항 구조로 표현하고 이를 기계번역시스템에 활용할 수 있음을 보인다.

Multiset-CCG for Quantifier Floating in Korean

Jin-Bok Lee and Jong C. Park
Proceedings of the KISS Conference, pp. 330-332, Jeonju, Korea, April, 1999.
Show abstract

본 논문에서는 한국어에서 양화사가 나오는 유형을 살펴보고, 그 중에서 QF현상에 대하여 논의한다. QF현상이 주격, 목적격, 여격에서 모두 가능하다는 것을 제시하고, 내포문에서의 QF가 갖는 제약조건을 설명한다. 이러한 것들을 한국어 중집합결합범주문법의 framework에서 설명할 수 있음을 보인다.

Lexical Selection with a Target Language Monolingual Corpus and an MRD

Hyun Ah Lee, Jong C. Park, and Gil Chang Kim
Proceedings of the Theoretical and Methodological Issues in Machine Translation (TMI), pp. 150-160, Chester, England, 1999.
Show abstract

In this paper, we propose a lexical selection method with three steps: sense disambiguation of source words, sense-to-word mapping, and selection of the most appropriate target language lexical item. The knowledge for each step is extracted from a machine readable dictionary and a target language monolingual corpus. By splitting the process of lexical selection into three steps and extracting the essential knowledge for each step from existing resources, our system can select appropriate words for translation with high extensibility and robustness.

Checking Grammatical Mistakes for English-as-a-Second-Language (ESL) Students

Jong C. Park, Martha Palmer, and Gay Washburn
Proceedings of the KSEA-NERC, New Brunswick, New Jersey, USA, April, 1997.

An English Grammar Checker as a Writing Aid for Students of English as a Second Language

Jong C. Park, Martha Palmer, and Gay Washburn
Conference on Applied Natural Language Processing (ANLP), Descriptions of System Demonstrations and Videos, Washington, D.C., USA, March, 1997.
Show abstract

We present a prototype grammar checker for English as a Second Language (ESL) students, utilizing Combinatory Categorial Grammar (CCG) written in SICStus Prolog. Instead of attempting to handle all possible grammatical errors, the grammar checker identifies certain specific types of grammatical mistakes that appear more regularly than others in the present domain of application.

Quantifier Scope and Constituency

Jong C. Park
The 33rd Annual Meeting of the Association for Computational Linguistics (ACL), Cambridge, Massachusetts, USA, June, 1995.
Show abstract

Traditional approaches to quantifier scope typically need stipulation to exclude readings that are unavailable to human understanders. This paper shows that quantifier scope phenomena can be precisely characterized by a semantic representation constrained by surhce constituency, if the distinction between referential and quantificational NPs is properly observed. A CCG implementation is described and compared to other approaches.

Semantic Significance of Quantification in Natural Language Processing

Jong C. Park
Proceedings of the KSEA-NERC, pp. 432-436, New Brunswick, New Jersey, USA, March, 1995.

A Unification-based Semantic Interpretation for Coordinate Constructs

Jong C. Park
The 30th Annual Meeting of the Association for Computational Linguistics (ACL), Delaware, USA, June, 1992.
Show abstract

This paper shows that a first-order unification-based semantic interpretation for various coordinate constructs is possible without an explicit use of lambda expressions if we slightly modify the standard Montagovian semantics of coordination. This modification, along with partial execution, completely eliminates the lambda reduction steps during semantic interpretation.