Publications

Publications Journal Papers and Book Chapters

Retrieval-Augmented Generation through Zero-shot Sentence-Level Passage Refinement using LLMs

Taeho Hwang, Soyeong Jeong, Sukmin Cho, and Jong C. Park
Journal of KIISE, 54(3), March, 2025. (To appear)

Political Bias in Large Language Models and its Implications on Downstream Tasks

Jeong yeon Seo, Sukmin Cho, and Jong C. Park
Journal of KIISE, 52(1), 18-28, January, 2025.

Detecting Implicitly Abusive Language by Applying Out-of-Distribution Problem

Jisu Shin, Hoyun Song, and Jong C. Park
Journal of KIISE, 49(11), 948-957, November, 2022.

Flexible acceptance condition of generics from a probabilistic viewpoint: Towards formalization of the semantics of generics.

Soo Hyun Ryu, Wonsuk Yang, and Jong C. Park
Journal of Psycholinguistic Research, 2022

Calibration of Pre-trained Language Model for Korean

Soyeong Jeong, Wonsuk Yang, ChaeHun Park, and Jong C. Park
Journal of KIISE, Vol. 48, No. 4, pp. 434-443, April, 2021.
Show abstract

The development of deep learning models is showing performance beyond humans in various tasks such as computer vision and natural language understanding tasks. In particular, pre-trained Transformer models have recently shown remarkable performance in natural language understanding problems such as question answering(QA) tasks and dialogue tasks. However, compared to the rapid development of deep learning models such as Transformer-based models, the mechanisms they work remain relatively unknown. As a method of analyzing deep learning models, calibration of models measures how much the predicted value of the model(confidence) matches the actual value(accuracy). Our study aims at interpreting pre-trained Korean language models with calibration. In particular, we analyzed whether pre-trained Korean language models are able to capture ambiguities in sentences and applied smoothing methods to quantitatively measure such ambiguities with confidence. In addition, in terms of calibration, we evaluated the capability of pre-trained Korean language model in identifying grammatical characteristics in Korean, which affect semantic changes in Korean sentences.

A Study on Practical Machine Translation from Korean to Korean Sign Language

Jung-Ho Kim, Eui Jun Hwang, and Jong C. Park
Journal of Korean Sign Language Studies (수어학연구), Vol. 4, No. 1, 2020.
Show abstract

In this study, we propose a practical method for machine translation from Korean to Korean Sign Language (KSL). For a practical use, we select the most appropriate Korean corpus and then annotate the corpus into KSL sentences to construct a Korean-KSL parallel corpus. Through experiments, we train four neural machine translation models on our Korean-KSL parallel corpus and find the best machine translation model by calculating BLEU scores. As a qualitative result, our best model achieves a BLEU-4 score of 20.18 on a test set of our Korean-KSL parallel corpus. We also report qualitative results with an error analysis for better understanding. We finally demonstrate that our model can translate Korean sentences not included in our Korean dataset. Therefore, we believe that our Korean-KSL translation system can lessen the gap between supply and demand for sign language interpretations.

Unsupervised Inference of Implicit Biomedical Events using Context Triggers

Jin-Woo Chung, Wonsuk Yang, and Jong C. Park
BMC Bioinformatics, 2020.

Assessing the multi-level knowledge prominence perceived by the authors as revealed on their writings

Wonsuk Yang, Jin-Woo Chung, and Jong C. Park
Language and Information, Vol. 23, No. 2, 2019.

Automatic Scoring of Semantic Fluency

Najoung Kim, Jung-Ho Kim, Maria K. Wolters, Sarah E. MacPherson, and Jong C. Park
Frontiers in Psychology, Vol. 10, pp. 1020, 2019. (SSCI IF 2.089)
Show abstract

In neuropsychological assessment, semantic fluency is a widely accepted measure of executive function and access to semantic memory. While fluency scores are typically reported as the number of unique words produced, several alternative manual scoring methods have been proposed that provide additional insights into performance, such as clusters of semantically related items. Many automatic scoring methods yield metrics that are difficult to relate to the theories behind manual scoring methods, and most require manually-curated linguistic ontologies or large corpus infrastructure. In this paper, we propose a novel automatic scoring method based on Wikipedia, Backlink-VSM, which is easily adaptable to any of the 61 languages with more than 100k Wikipedia entries, can account for cultural differences in semantic relatedness, and covers a wide range of item categories. Our Backlink-VSM method combines relational knowledge as represented by links between Wikipedia entries (Backlink model) with a semantic proximity metric derived from distributional representations (vector space model; VSM). Backlink-VSM yields measures that approximate manual clustering and switching analyses, providing a straightforward link to the substantial literature that uses these metrics. We illustrate our approach with examples from two languages (English and Korean), and two commonly used categories of items (animals and fruits). For both Korean and English, we show that the measures generated by our automatic scoring procedure correlate well with manual annotations. We also successfully replicate findings that older adults produce significantly fewer switches compared to younger adults. Furthermore, our automatic scoring procedure outperforms the manual scoring method and a WordNet-based model in separating younger and older participants measured by binary classification accuracy for both English and Korean datasets. Our method also generalizes to a different category (fruit), demonstrating its adaptability.

Neural Theorem Prover with Word Embedding for Efficient Automatic Annotation

Wonsuk Yang, Hancheol Park, and Jong C. Park
Journal of KIISE, Vol. 44, No. 4, pp. 399-410, April, 2017.

Addressing low-resource problems in statistical machine translation of manual signals in sign language

Hancheol Park, Jung-Ho Kim, and Jong C. Park
Journal of KIISE, Vol. 44, No. 2, pp. 163-170, February, 2017.

Enhanced sign language transcription system via hand tracking and pose estimation

Jung-Ho Kim, Najoung Kim, Hancheol Park, and Jong C. Park
Journal of Computing Science and Engineering, Vol. 10, No. 3, pp. 95-101, September, 2016.
Show abstract

In this study, we propose a new system for constructing parallel corpora for sign languages, which are generally underresourced in comparison to spoken languages. In order to achieve scalability and accessibility regarding data collection and corpus construction, our system utilizes deep learning-based techniques and predicts depth information to perform pose estimation on hand information obtainable from video recordings by a single RGB camera. These estimated poses are then transcribed into expressions in SignWriting. We evaluate the accuracy of hand tracking and hand pose estimation modules of our system quantitatively, using the American Sign Language Image Dataset and the American Sign Language Lexicon Video Dataset. The evaluation results show that our transcription system has a high potential to be successfully employed in constructing a sizable sign language corpus using various types of video resources.

Making adjustments to event annotations for improved biological event extraction

Seung-Cheol Baek and Jong C. Park
Journal of Biomedical Semantics, 7:55, doi: 10.1186/s13326-016-0094-9, 16 September 2016. (SCIE IF 1.62)
Show abstract

Background
Current state-of-the-art approaches to biological event extraction train statistical models in a supervised manner on corpora annotated with event triggers and event-argument relations. Inspecting such corpora, we observe that there is ambiguity in the span of event triggers (e.g., “transcriptional activity” vs. ‘transcriptional’), leading to inconsistencies across event trigger annotations. Such inconsistencies make it quite likely that similar phrases are annotated with different spans of event triggers, suggesting the possibility that a statistical learning algorithm misses an opportunity for generalizing from such event triggers.

Methods
We anticipate that adjustments to the span of event triggers to reduce these inconsistencies would meaningfully improve the present performance of event extraction systems. In this study, we look into this possibility with the corpora provided by the 2009 BioNLP shared task as a proof of concept. We propose an Informed Expectation-Maximization (EM) algorithm, which trains models using the EM algorithm with a posterior regularization technique, which consults the gold-standard event trigger annotations in a form of constraints. We further propose four constraints on the possible event trigger annotations to be explored by the EM algorithm.

Results
The algorithm is shown to outperform the state-of-the-art algorithm on the development corpus in a statistically significant manner and on the test corpus by a narrow margin.

Conclusions
The analysis of the annotations generated by the algorithm shows that there are various types of ambiguity in event annotations, even though they could be small in number.

Corpus Annotation for the Linguistic Analysis of Reference Relations between Event and Spatial Expressions in Text

Jin-Woo Chung, Hee-Jin Lee, and Jong C. Park
Language and Information, Vol. 18, No. 2. pp. 141-168, 2014.
Show abstract

Recognizing spatial information associated with events expressed in natural language text is essential not only for the interpretation of such events and but also for the understanding of the relations among them. However, spatial information is rarely mentioned as compared to events and the association between event and spatial expressions is also highly implicit in a text. This would make it difficult to automate the extraction of spatial information associated with events from the text. In this paper, we give a linguistic analysis of how spatial expressions are associated with event expressions in a text. We first present issues in annotating narrative texts with reference relations between event and spatial expressions, and then discuss surface-level linguistic characteristics of such relations based on the annotated corpus to give a helpful insight into developing an automated recognition method.

OncoSearch: Cancer Gene Search Engine with Literature Evidence

Hee-Jin Lee, Tien Cuong Dang, Hyunju Lee, and Jong C. Park
Nucleic Acids Research, (1 July 2014) 42 (W1):W416-W421. (SCI IF 8.278)
Show abstract

In order to identify genes that are involved in oncogenesis and to understand how such genes affect cancers, abnormal gene expressions in cancers are actively studied. For an efficient access to the results of such studies that are reported in biomedical literature, the relevant information is accumulated via text-mining tools and made available through the Web. However, current Web tools are not yet tailored enough to allow queries that specify how a cancer changes along with the change in gene expression level, which is an important piece of information to understand an involved gene's role in cancer progression or regression. OncoSearch is a Web-based engine that searches Medline abstracts for sentences that mention gene expression changes in cancers, with queries that specify (i) whether a gene expression level is up-regulated or down-regulated, (ii) whether a certain type of cancer progresses or regresses along with such gene expression change and (iii) the expected role of the gene in the cancer. OncoSearch is available through http://oncosearch.biopathway.org.

Identification of Speakers in Fairytales with Linguistic Clues

Hye-Jin Min, Jin-Woo Chung, and Jong C. Park
Language and Information, Vol. 17, No. 2. pp. 93-121, 2013.
Show abstract

Identifying the speakers of individual utterances mentioned in textual stories is an important step towards developing applications that involve the use of unique characteristics of speakers in stories, such as robot storytelling and story-to-scene generation. Despite the usefulness, it is a challenging task because not only human entities but also animals and even inanimate objects can become speakers especially in fairytales so that the number of candidates is much more than that in other types of text. In addition, since the action of speaking is not always mentioned explicitly, it is necessary to infer the speaker from the implicitly mentioned speaking behaviors such as appearances or emotional expressions. In this paper, we investigate a method to exploit linguistic clues to identify the speakers of utterances from textual fairytale stories in Korean, especially in order to handle such challenging issues. Compared with the previous work, the present work takes into account additional linguistic features such as vocative roles and pairs of conversation participants, and proposes the use of discourse-level turn-taking behaviors between speakers to further reduce the number of possible candidate speakers. We describe a simple rule-based method to choose a speaker from candidates based on such linguistic features and turn-taking behaviors.

Augmenting Biological Text Mining with Symbolic Inference

Jong C. Park and Hee-Jin Lee
'Biological Knowledge Discovery Handbook', editors: Mourad Elloumi and Albert Y. Zomaya, Wiley, December 27, 2013.
Show abstract

In this chapter, the authors review recent work on such “next-level” text-mining tools. In particular, they focus on the work that uses symbolic inference to augment text-mining, apart from distributional analysis that is based on the co-occurrence of biological terms and statistical methods. By symbolic inference, they refer to the methods of deriving new information from known facts that are represented with nonnumeric symbols to which inference rules are applied deterministically rather than probabilistically. Researches reviewed in this chapter target one of the two abstract tasks. The first task is to recognize information not explicitly stated but implied in a document, where the targeted information is often scattered across multiple sentences. The second is to propose newly predicted biological knowledge using information gathered from the literature. They briefly review text-mining work with distributional analysis to contrast the use of symbolic inference with the use of distributional analysis.

CoMAGC: a Corpus with Multi-faceted Annotations of Gene-Cancer Relations

Hee-Jin Lee, Sang-Hyung Shim, Mi-Ryoung Song, Hyunju Lee, and Jong C. Park
BMC Bioinformatics, 14:323, doi:10.1186/1471-2105-14-323, 14 November 2013. (SCI IF 3.02)
Show abstract

Background
In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations.

Results
In this paper, we present a corpus for the development of TM systems that are specifically targeting gene-cancer relations but are still able to capture complex information in biomedical sentences. We describe CoMAGC, a corpus with multi-faceted annotations of gene-cancer relations. In CoMAGC, a piece of annotation is composed of four semantically orthogonal concepts that together express 1) how a gene changes, 2) how a cancer changes and 3) the causality between the gene and the cancer. The multi-faceted annotations are shown to have high inter-annotator agreement. In addition, we show that the annotations in CoMAGC allow us to infer the prospective roles of genes in cancers and to classify the genes into three classes according to the inferred roles. We encode the mapping between multi-faceted annotations and gene classes into 10 inference rules. The inference rules produce results with high accuracy as measured against human annotations. CoMAGC consists of 821 sentences on prostate, breast and ovarian cancers. Currently, we deal with changes in gene expression levels among other types of gene changes. The corpus is available at http://biopathway.org/CoMAGC under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0).

Conclusions
The corpus will be an important resource for the development of advanced TM systems on gene-cancer relations.

DigSee: Disease Gene Search Engine with Evidence Sentences (version cancer)

Jeongkyun Kim, Seongeun So, Hee-Jin Lee, Jong C. Park, Jung-jae Kim, and Hyunju Lee
Nucleic Acids Research, Vol. 41, No. W1, pp. 501-517, 12 June 2013 (SCI IF 8.026).
Show abstract

Biological events such as gene expression, regulation, phosphorylation, localization and protein catabolism play important roles in the development of diseases. Understanding the association between diseases and genes can be enhanced with the identification of involved biological events in this association. Although biological knowledge has been accumulated in several databases and can be accessed through the Web, there is no specialized Web tool yet allowing for a query into the relationship among diseases, genes and biological events. For this task, we developed DigSee to search MEDLINE abstracts for evidence sentences describing that ‘genes’ are involved in the development of ‘cancer’ through ‘biological events’. DigSee is available through http://gcancer.org/digsee.

Quality Analysis of User-generated Content on the Web

Jong C. Park and Hye-Jin Min
'Knowledge Service Engineering Handbook', editors: Jussi Kantola and Waldemar Karwowski, CRC Press, Taylor & Francis Group, pp. 197–220, May, 2012.

E3Net: A System for Exploring E3-mediated Regulatory Networks of Cellular Functions

Youngwoong Han, Hodong Lee, Jong C. Park, and Gwan-Su Yi
Molecular and Cellular Proteomics, March Issue, 2012, doi:10.1074/mcp.O111.014076, December 22, 2011. (SCI IF 8.35)
Show abstract

Ubiquitin-protein ligase (E3) is a key enzyme targeting specific substrates in diverse cellular processes for ubiquitination and degradation. The existing findings of substrate specificity of E3 are, however, scattered over a number of resources, making it difficult to study them together with an integrative view. Here we present E3Net, a web-based system that provides a comprehensive collection of available E3-substrate specificities and a systematic framework for the analysis of E3-mediated regulatory networks of diverse cellular functions. Currently, E3Net contains 2201 E3s and 4896 substrates in 427 organisms and 1671 E3-substrate specific relations between 493 E3s and 1277 substrates in 42 organisms, extracted mainly from MEDLINE abstracts and UniProt comments with an automatic text mining method and additional manual inspection and partly from high throughput experiment data and public ubiquitination databases. The significant functions and pathways of the extracted E3-specific substrate groups were identified from a functional enrichment analysis with 12 functional category resources for molecular functions, protein families, protein complexes, pathways, cellular processes, cellular localization, and diseases. E3Net includes interactive analysis and navigation tools that make it possible to build an integrative view of E3-substrate networks and their correlated functions with graphical illustrations and summarized descriptions. As a result, E3Net provides a comprehensive resource of E3s, substrates, and their functional implications summarized from the regulatory network structures of E3-specific substrate groups and their correlated functions. This resource will facilitate further in-depth investigation of ubiquitination-dependent regulatory mechanisms. E3Net is freely available online at http://pnet.kaist.ac.kr/e3net. Molecular & Cellular Proteomics 11: 10.1074/mcp.O111.014076, 1–14, 2012.

Probabilistic Filtering for a Biological Knowledge Discovery System with Text Mining and Automatic Inference

Hee-Jin Lee and Jong C. Park
Journal of the Korean Society Of Computer and Information, Vol. 17, No. 2, pp. 139-148, February 2012.
Show abstract

본 논문에서는 텍스트 마이닝을 통해 생물학 문헌에서 분자 수준의 사건(event) 정보를 자동으로 추출하고, 이들 사건 정보를 기반으로 새로운 생물학 지식을 자동 추론하는 텍스트 마이닝 - 추론 통합 구조의 시스템을 다룬다. 이러한 통합 구조의 지식 발견 시스템은 미리 추출되어 데이터베이스에 등록된 정보만을 입력으로 사용하는 시스템들에 비하여 최신 정보를 보다 빨리 사용할 수 있고, 미리 정의된 형식 이외의 다양한 정보를 사 용할 수 있다는 장점이 있다. 반면, 텍스트 마이닝 정보 추출 결과를 그대로 사용하기 때문에 텍스트 마이닝 모듈(module)의 성능에 따라 전체 시스템의 효용성이 크게 저하될 수도 있다는 문제가 있다. 본 논문에서는 확률 기반 필터링(filtering) 방법을 제안하여, 텍스트 마이닝 결과 중 양성 오류(false positive)를 효과적으로 제거함으로써 전체 지식 발견 시스템의 정확도 및 효용성을 높이고자 한다. 본 논문에서 제안한 확률 기반 필터링 방법은 기준(baseline) 방법으로 사용된 횟수 기반 필터링 방법보다 높은 성능을 보였다.

Identifying Helpful Reviews Based on Customer’s Mentions about Experiences

Hye-Jin Min and Jong C. Park
Expert Systems With Applications, doi:10.1016/j.eswa.2012.01.116, January 25, 2012. (SCIE IF 1.924)
Show abstract

As numerous on-line product reviews that vary in quality are published every day, much attention is being paid to quality assessment of such reviews. The current metric of using the number of votes by other customers such as ‘helpful vote’, despite its dominance, does not yield a fully effective outcome. In this article, we propose a novel metric to rank product reviews by ‘mentions about experiences’, accounting for customer’s personal experiences, as a way of identifying high quality reviews. The proposed metric has two parameters that capture time expressions related to the use of products and product entities over different purchasing time periods by linguistic clues. The empirical results show that this metric is not only as helpful as the best existing metrics, ‘helpful vote’ or ‘reviewer rank’, but is also free from undesirable biases that either penalize recency or are driven solely by popularity. Our usability study also shows that ordering reviews by our metric is considered helpful on the accounts of both usefulness and satisfaction.

Sentence Type Identification in Korean: Applications to Korean-Sign Language Translation and Korean Speech Synthesis

Jin-Woo Chung, Ho-Joon Lee, and Jong C. Park
Journal of the HCI Society of Korea, Vol. 5, No. 1, pp. 25-35, 2010.
(selected as best paper)
Show abstract

This paper proposes a method of automatically identifying sentence types in Korean and improving naturalness in sign language generation and speech synthesis using the identified sentence type information. In Korean, sentences are usually categorized into five types: declarative, imperative, propositive, interrogative, and exclamatory. However, it is also known that these types are quite ambiguous to identify in dialogues. In this paper, we present additional morphological and syntactic clues for the sentence type and propose a rule-based procedure for identifying the sentence type using these clues. The experimental results show that our method gives a reasonable performance. We also describe how the sentence type is used to generate non-manual signals in Korean-Korean sign language translation and appropriate intonation in Korean speech synthesis. Since the method of using sentence type information in speech synthesis and sign language generation is not much studied previously, it is anticipated that our method will contribute to research on generating more natural speech and sign language expressions.

Wrestling with Biomedical Research Results: Language Resources and Literature Analysis

D. Rebholz-Schuhmann, Nigel Collier, Jong C. Park, and Limsoon Wong
Journal of Bioinformatics and Computational Biology (JBCB), Vol. 8, No. 1, pp. 129-130, Imperial College Press, February 2010.

Extracting Melodies from Piano Solo Music Based on Characteristics of Music

Yoonjae Choi and Jong C. Park
Journal of the Korean Institute of Information Scientists and Engineers (KIISE): Computing Practices and Letters, Vol. 15, No. 12, pp. 923-927, 2009.
Show abstract

The recent growth of a digital music market induces increasing demands for music searching and recommendation services. In order to improve the performance of music-based application services, the process of extracting melodies from polyphonic music is essential. In this paper, we propose a method to extract melodies from piano solo music which is highly polyphonic and has a wide pitch range. We categorize piano music into three classes taking into account the characteristics of music, and extract melodies according to each class. The performance evaluation for the implemented system showed that our method works successfully on a variety of piano solo music.

Automated Classification of Sentential Types in Korean with Morphological Analysis

Jin-Woo Chung and Jong C. Park
Language and Information, Vol. 13, No. 2, pp. 59-97, 2009.
Show abstract

The type of a given sentence indicates the speaker's attitude towards the listener and is usually determined by its final endings and punctuation marks. However, some final endings are used in several types of sentences, which means that we cannot identify the sentential type by considering only the final endings and punctuation marks. In this paper, we propose methods of finding some other linguistic clues for identifying the sentential type with a morphological analysis. We also propose to use these methods to implement a system that automatically classfies sentences in Korean according to their sentential types.

E3Miner: a text mining tool for ubiquitin-protein ligases

Hodong Lee, Gwansu Yi, and Jong C. Park
Nucleic Acids Research, Vol. 36, Web Server issue Published doi:10.1093/nar/gkn286, 15 May 2008 (SCI IF 8.026).
Show abstract

Ubiquitination is a regulatory process critically involved in the degradation of >80% of cellular proteins, where such proteins are specifically recognized by a key enzyme, or a ubiquitin-protein ligase (E3). Because of this important role of E3s, a rapidly growing body of the published literature in biology and biomedical fields reports novel findings about various E3s and their molecular mechanisms. However, such findings are neither adequately retrieved by general text-mining tools nor systematically made available by such protein databases as UniProt alone. E3Miner is a web-based text mining tool that extracts and organizes comprehensive knowledge about E3s from the abstracts of journal articles and the relevant databases, supporting users to have a good grasp of E3s and their related information easily from the available text. The tool analyzes text sentences to identify protein names for E3s, to narrow down target substrates and other ubiquitin-transferring proteins in E3-specific ubiquitination pathways and to extract molecular features of E3s during ubiquitination. E3Miner also retrieves E3 data about protein functions, other E3-interacting partners and E3-related human diseases from the protein databases, in order to help facilitate further investigation. E3Miner is freely available through http://e3miner.biopathway.org.

Monitoring the evolutionary aspect of the Gene Ontology to enhance predictability and usability

Jong C. Park, Tak-eun Kim, and Jinah Park
BMC Bioinformatics 2008, 9(Suppl 3):7 doi:10.1186/1471-2105-9-S3-S7, 11 April 2008.
Show abstract

Background: Much effort is currently made to develop the Gene Ontology (GO). Due to the dynamic nature of information it addresses, GO undergoes constant updates whose results are released at regular intervals as separate versions. Although there are a large number of computational tools to aid the development of GO, they are operating on a particular version of GO, making it difficult for GO curators to anticipate the full impact of particular changes along the time axis on a larger scale. We present a method for tapping into such an evolutionary aspect of GO, by making it possible to keep track of important temporal changes to any of the terms and relations of GO and by consequently making it possible to recognize associated trends.
Results: We have developed visualization methods for viewing the changes between two different versions of GO by constructing a colour-coded layered graph. The graph shows both versions of GO with highlights to those GO terms that are added, removed and modified between the two versions. Focusing on a specific GO term or terms of interest over a period, we demonstrate the utility of our system that can be used to make useful hypotheses about the cause of the evolution and to provide new insights into more complex changes.
Conclusions: GO undergoes fast evolutionary changes. A snapshot of GO, as presented by each version of GO alone, overlooks such evolutionary aspects, and consequently limits the utilities of GO. The method that highlights the differences of consecutive versions or two different versions of an evolving ontology with colour-coding enhances the utility of GO for users as well as for developers. To the best of our knowledge, this is the first proposal to visualize the evolutionary aspect of GO.

Text Mining and Management in Biomedicine

Jong C. Park, Gary Geunbae Lee, and Limsoon Wong
Guest Editors' Introduction to the Special Issue, ACM Transactions on Asian Language Information Processing (TALIP), March, 2006.

BioContrasts: Extracting and Exploiting Protein-Protein Contrastive Relations from Biomedical Literature

Jung-jae Kim, Zhuo Zhang, Jong C. Park, and See-Kiong Ng
Bioinformatics, Vol. 22, No. 5, pp. 597-605, March, 2006.
Show abstract

Motivation: Contrasts are useful conceptual vehicles for learning processes and exploratory research of the unknown. For example, contrastive information between proteins can reveal what similarities, divergences and relations there are of the two proteins, leading to invaluable insights for better understanding about the proteins. Such contrastive information are found to be reported in the biomedical literature. However, there have been no reported attempts in current biomedical text mining work that systematically extract and present such useful contrastive information from the literature for exploitation.

Results: Our BioContrasts system extracts protein–protein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for exploitation. Contrastive information are identified in the text abstracts with contrastive negation patterns such as ‘A but not B’. A total of 799 169 pairs of contrastive expressions were successfully extracted from 2.5 million MEDLINE abstracts. Using grounding of contrastive protein names to Swiss-Prot entries, we were able to produce 41 471 pieces of contrasts between Swiss-Prot protein entries. These contrastive pieces of information are then presented via a user-friendly interactive web portal that can be exploited for applications such as the refinement of biological pathways.

Availability: BioContrasts can be accessed at http://biocontrasts.i2r.a-star.edu.sg. It is also mirrored at http://biocontrasts.biopathway.org

Supplementary information: Supplementary materials are available at Bioinformatics online.

Contact:skng@i2r.a-star.edu.sg; park@cs.kaist.ac.kr

Visualization for Digesting a High Volume of the Biomedical Literature

Changsu Lee, Jinah Park, and Jong C. Park
Bioinformatics and Biosystems, Vol. 1, No. 1, pp. 51-60, Feb. 2006.
Show abstract

The paradigm in biology is currently changing from that of conducting hypothesis-driven individual experiments to that of utilizing the results of a massive data analysis with appropriate computational tools. We present LayMap, an implemented visualization system that helps the user to deal with a high volume of the biomedical literature such as MEDLINE, through the layered maps that are constructed on the results of an information extraction system. LayMap also utilizes filtering and granularity for an enhanced view of the results. Since a biomedical information extraction system gives rise to a focused and effective way of slicing up the data space, the combined use of LayMap with such an information extraction system can help the user to navigate the data space in a speedy and guided manner. As a case study, we have applied the system to datasets of journal abstracts on ’MAPK pathway’ and ’bufalin’ from MEDLINE. With the proposed visualization, we have successfully rediscovered pathway maps of a reasonable quality for ERK, p38 and JNK. Furthermore, with respect to bufalin, we were able to identify the potentially interesting relation between the Chinese medicine Chan su and apoptosis with a high level of detail.

Named Entity Recognition

Jong C. Park and Jung-jae Kim
Chapter six of the book 'Text Mining for Biology', editors: Ben Stapley and Sophia Ananiadou, Artech House Books, January, 2006.

Linguistic Characterization of Sign Language Expressions for an Automatic Mapping from Natural Language Sentences

Jiwon Choi, Eunyoung Chang, Hee-Jin Lee, and Jong C. Park
Language and Information, Vol. 10, No. 1, pp. 71-91, 2006.

Automatic extension of Gene Ontology with flexible identification of candidate terms

Jin-Bok Lee, Jung-jae Kim, and Jong C. Park
Bioinformatics, Vol. 22 No. 6, pp. 665-670, 2006.
Show abstract

Motivation: Gene Ontology (GO) has been manually developed to provide a controlled vocabulary for gene product attributes. It continues to evolve with new concepts that are compiled mostly from existing concepts in a compositional way. If we consider the relatively slow growth rate of GO in the face of the fast accumulation of the biological data, it is much desirable to provide an automatic means for predicting new concepts from the existing ones.

Results: We present a novel method that predicts more detailed concepts by utilizing syntactic relations among the existing concepts. We propose a validation measure for the automatically predicted concepts by matching the concepts to biomedical articles. We also suggest how to find a suitable direction for the extension of a constantly growing ontology such as GO.

Availability:http://autogo.biopathway.org

Contact:park@nlp.kaist.ac.kr

Supplementary information: Supplementary materials are available at Bioinformatics online.

Information Visualization with Text Data Mining for Knowledge Discovery Tools in Bioinformatics

Jinah Park, Changsu Lee, and Jong C. Park
Key Engineering Materials, Vols. 277-279, pp. 259-265, 2005. (SCI IF 0.224)
Show abstract

An abundant amount of information is produced in the digital domain, and an effective information extraction (IE) system is required to surf through this sea of information. In this paper, we show that an interactive visualization system works effectively to complement an IE system. In particular, three-dimensional (3D) visualization can turn a data-centric system into a user-centric one by facilitating the human visual system as a powerful pattern recognizer to become a part of the IE cycle. Because information as data is multidimensional in nature, 2D visualization has been the preferred mode. However, we argue that the extra dimension available for us in a 3D mode provides a valuable space where we can pack an orthogonal aspect of the available information. As for candidates of this orthogonal information, we have considered the following two aspects: 1) abstraction of the unstructured source data, and 2) the history line of the discovery process. We have applied our proposal to text data mining in bioinformatics. Through case studies of data mining for molecular interaction in the yeast and mitogen-activated protein kinase pathways, we demonstrate the possibility of interpreting the extracted results with a 3D visualization system.

A Graphic Tool for Curating Molecular Interaction Networks from the Literature

Changsu Lee, Jinah Park, and Jong C. Park
International Journal of Computers in Biology and Medicine, Vol. 35, pp. 555-564, 2005.
Show abstract

We propose a graphic tool for curating molecular interaction networks constructed from the literature by information extraction (IE). In order to turn preliminary results from IE into useful biomedical resources, we propose to use a controlled environment in which visualization and IE work synergistically. The usability of the proposed graphic tool is shown with respect to the identification of incorrectly extracted results that are due to the much troubling coordination phenomena in natural language texts. Through the experiment on molecular interactions in Saccaharomyces cerevisiae, we have seen a meaningful increase (from 91.5% to 97.5%) in the number of correctly extracted interaction information.

Extracting contrastive information from negation patterns in biomedical literature

Jung-jae Kim and Jong C. Park
ACM Transactions on Asian Language Information Processing (TALIP), Special Issue on Text Mining and Management in Biomedicine, 2006.
Show abstract

Expressions of negation in the biomedical literature often encode information of contrast as a means for explaining signiﬁcant differences between the objects that are so contrasted. We show that such information gives additional insights into the nature of the structures and/or biological functions of these objects, leading to valuable knowledge for subcategorization of protein families by the properties that the involved proteins do not have in common. Based on the observation that the expressions of negation employ mostly predictable syntactic structures that can be characterized by subclausal coordination and by clause-level parallelism, we present a system that extracts such contrastive information by identifying those syntactic structures with natural language processing techniques and with additional linguistic resources for semantics. The implemented system shows the performance of 85.7% precision and 61.5% recall, including 7.7% partial recall, or an F score of 76.6. We apply the system to the biological interactions as extracted by our biomedical information extraction system in order to enrich proteome databases with contrastive information.

Introduction to the Thematic Session on Text Mining in Biomedicine

Sophia Ananiadou and Jong C. Park
Lecture Notes in Artificial Intelligence (LNAI), Vol. 3248 (revised selected papers from IJCNLP 2004), editors: K-Y Su, J. Tsujii, J.-H. Lee, O. Y. Kwong, p. 776, 2005.
Show abstract

This thematic session follows a series of workshops and conferences recently dedicated to bio text mining in Biology. This interest is due to the overwhelming amount of biomedical literature, Medline alone contains over 14M abstracts, and the urgent need to discover and organise knowledge extracted from texts. Text mining techniques such as information extraction, named entity recognition etc. have been successfully applied to biomedical texts with varying results. A variety of approaches such as machine learning, SVMs, shallow, deep linguistic analyses have been applied to biomedical texts to extract, manage and organize information. There are over 300 databases containing crucial information on biological data. One of the main challenges is the integration of such heterogeneous information from factual databases to texts. One of the major knowledge bottlenecks in biomedicine is terminology. In such a dynamic domain, new terms are constantly created. In addition there is not always a mapping among terms found in databases, controlled vocabularies, ontologies and “actual” terms which are found in texts. Term variation and term ambiguity have been addressed in the past but more solutions are needed. The confusion of what is a descriptor, a term, an index term accentuates the problem. Solving the terminological problem is paramount to biotext mining, as relationships linking new genes, drugs, proteins (i.e. terms) are important for effective information extraction. Mining for relationships between terms and their automatic extraction is important for the semi-automatic updating and populating of ontologies and other resources needed in biomedicine. Text mining applications such as question-answering, automatic summarization, intelligent information retrieval are based on the existence of shared resources, such as annotated corpora (GENIA) and terminological resources. The field needs more concentrated and integrated efforts to build these shared resources. In addition, evaluation efforts such as BioCreaTive, Genomic Trec are important for biotext mining techniques and applications. The aim of text mining in biology is to provide solutions to biologists, to aid curators in their task. We hope this thematic session addressed techniques and applications which aid the biologists in their research.

Constructing SSML Documents with Automatically Generated Intonation Information in a Combinatory Categorial Grammar Framework

Lee Hwa Jin, Ho-Joon Lee, and Jong C. Park
International Journal of Computer Processing of Oriental Languages (IJCPOL), Vol. 17, No. 4, pp. 223-238, December, 2004.
Show abstract

As of now, Text-to-Speech (TTS) systems are widely used throughout the full spectrum of our activities, and various natural language processing techniques have been utilized to enhance the performance of such TTS systems. As TTS systems begin to play an important role for communication between human and machine, naturalness is considered the most crucial measure of performance for TTS systems, in addition to correctness. General statistical approaches, though widely adopted, are not appropriate for the phenomena as they assign the same intonation to the same sentence. We analyze various kinds of corpus to extract informative features for intonation generation in a Combinatory Categorial Grammar framework, and express intonation-annotated document using Speech Synthesis Markup Language for target system neutral application.

BioIE: Retargetable Information Extraction and Ontological Annotation of Biological Pathways from the Literature

Jung-jae Kim and Jong C. Park
Journal of Bioinformatics and Computational Biology (JBCB), Vol. 2, No. 3, pp. 551-568, 2004. (SCI IF 1.393)
Show abstract

The need for extracting general biological interactions of arbitrary types from the rapidly growing volume of the biomedical literature is drawing increased attention, while the need for this much diversity also requires both a robust treatment of complex linguistic phenomena and a method to consistently characterize the results. We present a biomedical information extraction system, BioIE, to address both of these needs by utilizing a full-fledged English grammar formalism, or a combinatory categorial grammar, and by annotating the results with the terms of Gene Ontology, which provides a common and controlled vocabulary. BioIE deals with complex linguistic phenomena such as coordination, relative structures, acronyms, appositive structures, and anaphoric expressions. In order to deal with real-world syntactic variations of ontological terms, BioIE utilizes the syntactic dependencies between words in sentences as well, based on the observation that the component words in an ontological term usually appear in a sentence with known patterns of syntactic dependencies.

Research Trends in Bio Text Mining

Jung-jae Kim and Jong C. Park.
Korea Information Science Society SIGBIT News Letter, Vol. 2, No. 1, pp. 14-31, 2004.

An Analysis of Syntactic and Semantic Relations between Negative Polarity Items and Negatives in Korean

Jung-jae Kim and Jong C. Park.
Journal of Language and Information, Vol. 8, No. 1, pp. 53-76, 2004.
Show abstract

Negative polarity items (NPIs), which function as quantifiers, are licensed in a syntactically strict way by negatives, which function as qualifiers, resulting in universal negating interpretations as pairs. We present a proposal to explain the related phenomena, in which the syntax and the semantics are closely related to each other, with Combinatory Categorial Grammar. For this purpose, we first adopt the usual approach to scrambling, but control its overgeneration with the use of markers, taking into account the complex syntactic phenomena involving NPIs and scrambling in Korean. We also propose to utilize polarity intensity as a novel feature, in order to account for the universal negating interpretations when NPIs are combined with negatives. Our proposal also explains the difference in readings when other quantifiers or qualifiers intervene the NPIs and the related negatives. (Korea Advanced Institute of Science and Technology)

Annotation of Gene Products in the Literature with Gene Ontology Terms using Syntactic Dependencies

Jung-jae Kim and Jong C. Park
Lecture Notes on Artificial Intelligence, Post-Conference Book of IJCNLP-04, 2004.
Show abstract

We present a method for automatically annotating gene products in the literature with the terms of Gene Ontology (GO), which provides a dynamic but controlled vocabulary. Although GO is well-organized with such lexical relations as synonymy, ‘is-a’, and ‘part-of’ relations among its terms, GO terms show quite a high degree of morphological and syntactic variations in the literature. As opposed to the previous approaches that considered only restricted kinds of term variations, our method uncovers the syntactic dependencies between gene product names and ontological terms as well in order to deal with real-world syntactic variations, based on the observation that the component words in an ontological term usually appear in a sentence with established patterns of syntactic dependencies.

Analysis and Computational Processing of Coordination Ambiguity in Korean

Hodong Lee and Jong C. Park
Journal of Language and Information, Vol. 7, No. 2, pp. 59-79, 2003.

Analysis and Processing of Korean with Quantifier Floating

Jin-Bok Lee and Jong C. Park
Journal of Language and Information, Vol. 7, No. 1, pp. 1-22, 2003.

Accomplishments and Challenges in Literature Data Mining for Biology

Lynette Hirschman, Jong C. Park, Junichi Tsujii, Limsoon Wong, and Cathy Wu
Bioinformatics, Vol. 18, No. 12, pp. 1553-1561, 2002.
Show abstract

We review recent results in literature data mining for biology and discuss the need and the steps for a challenge evaluation for this field. Literature data mining has progressed from simple recognition of terms to extraction of interaction relationships from complex sentences, and has broadened from recognition of protein interactions to a range of problems such as improving homology search, identifying cellular location, and so on. To encourage participation and accelerate progress in this expanding field, we propose creating challenge evaluations, and we describe two specific applications in this context.

Interpretation of Natural Language Queries for Relational Database Access with Combinatory Categorial Grammar

Hodong Lee and Jong C. Park
International Journal of Computer Processing of Oriental Languages (IJCPOL), Vol. 15, No. 3, pp. 281-304, 2002.
Show abstract

In this paper, we describe a proposal to derive formal language queries from natural language queries with a combinatory categorial grammar (CCG). CCGs are well known to provide a means of deriving all the levels of information for natural language, i.e., syntax, semantics and discourse, at the same time. In our proposal, we utilize an extra level of representation for formal language queries for the aforementioned derivation. The syntactic coverage is shown with various natural language queries, including compound nouns, modification markers, various types of ellipses, numerical expressions, and subordinate and coordinate constructions. The general purpose CCG lexicon is semi-automatically augmented with the database fields and entries. We also discuss the performance of an implemented natural language query processing system.

Using Combinatory Categorial Grammar to Extract Biomedical Information

Jong C. Park
IEEE Intelligent Systems, Special Issue on Intelligent Systems in Biology, Vol. 16, No. 6, pp. 62-67, November-December, 2001.
Show abstract

Extracting information from biology databases manually can be an overwhelming task. GenBank, the US National Institutes of Health database containing all publicly available DNA sequences, has more than 14 billion bases in 13 million genetic-sequence records.1 Medline, a literature database available through PubMed, has over 11 million journal citations. In a May 2001 search request for “cytokine” (regulatory proteins in the immune system), PubMed returned 296,556 articles.2 Given the quantity and complexity of biomedical literature, demands for computational tools to extract specific information are increasing. In this article, I review biomedical information extraction methods and present research done by KAIST’s natural language processing group on a system that shows encouraging performance using combinatory categorial grammar (explained in detail below) as a natural language grammar formalism.

Bioinformatics and Natural Language Processing

Jong C. Park
Special Issue in Korean Information Processing, Communications of the Korea Information Science Society (KISS), Vol. 19, No. 10, pp. 46-51, October, 2001.
Show abstract

생물정보학(Bioinformatics)은 생물학에서 다루 는 정보의 양이 급증함에 따라 전산학, 수학, 통계 학 등의 분야에서 사용되고 있는 정보처리기법을 응용하여 이를 효율적으로 생산, 관리, 활용하려는 연구분야를 총칭한다. 1) 그리고 자연언어처리(Natural Language Processing－NLP)는 한국어나 영어와 같은 자연언어로 표현된 문장이나 문서들을 컴퓨터를 이용하여 처리하기 위한 연구분야를 총칭 하는데 이에는 인간과 컴퓨터의 상호작용(Human Computer Interaction－HCI)을 돕기 위한 연구의 측면도 있고 방대한 자연언어 정보를 효율적으로 관리, 활용하기 위한 연구의 측면도 있다. 본고에서 는 생물의료 정보 추출(Biomedical Information Extraction)이라는 분야의 연구에 대한 소개를 통 해서 이 두 가지 상이한 연구분야가 어떠한 관계 를 가지게 되는지에 대한 논의를 제공한다. 최근 생물정보학에 대하여 높아진 일반의 관심을 반영하 여 정보과학회지 2000년 8월호 [1]에서는 생물정보 학을 주제로 한 특집을 제공하였는데 본고의 내용 은 여기에 자연언어처리 응용 분야를 그로부터 일 년후의 시점에서 보완하는 것으로 볼 수 있을 것으 로 기대된다

Combinatory Categorial Grammar for the Syntactic, Semantic, and Discourse Analyses of Cordinate Constructions in Korean

Hyung-joon Cho and Jong C. Park
Journal of the Korea Information Science Society (KISS), Vol. 27, No. 4, pp. 448-462, 2000.
Show abstract

Coordinate constructions in natural language pose a number of difficulties to natural language processing units, due to the increased complexity of syntactic analysis, the syntactic ambiguity of the involved lexical items, and the apparent deletion of predicates in various places. In this paper, we address the syntactic characteristics of the coordinate constructions in Korean from the viewpoint of constructing a competence grammar, and present a version of combinatory categorial grammar for the analysis of coordinate constructions in Korean. We also show how to utilize a unified lexicon in the proposed grammar formalism in deriving the sentential semantics and associated information structures as well, in order to capture the discourse functions of coordinate constructions in Korean. The presented analysis conforms to the common wisdom that coordinate constructions are utilized in language not simply to reduce multiple sentences to a single sentence, but also to convey the information of contrast. Finally, we provide an analysis of sample corpora for the frequency of coordinate constructions in Korean and discuss some problematic cases