Publications

Publications Journal Papers and Book Chapters

Neural Theorem Prover with Word Embedding for Efficient Automatic Annotation

Wonsuk Yang, Hancheol Park, and Jong C. Park
Journal of KIISE, Vol. 44, No. 4, pp. 399-410, April, 2017.

Addressing low-resource problems in statistical machine translation of manual signals in sign language

Hancheol Park, Jung-Ho Kim, and Jong C. Park
Journal of KIISE, Vol. 44, No. 2, pp. 163-170, February, 2017.

Enhanced sign language transcription system via hand tracking and pose estimation

Jung-Ho Kim, Najoung Kim, Hancheol Park, and Jong C. Park
Journal of Computing Science and Engineering, Vol. 10, No. 3, pp. 95-101, September, 2016.
Show abstract
In this study, we propose a new system for constructing parallel corpora for sign languages, which are generally underresourced in comparison to spoken languages. In order to achieve scalability and accessibility regarding data collection and corpus construction, our system utilizes deep learning-based techniques and predicts depth information to perform pose estimation on hand information obtainable from video recordings by a single RGB camera. These estimated poses are then transcribed into expressions in SignWriting. We evaluate the accuracy of hand tracking and hand pose estimation modules of our system quantitatively, using the American Sign Language Image Dataset and the American Sign Language Lexicon Video Dataset. The evaluation results show that our transcription system has a high potential to be successfully employed in constructing a sizable sign language corpus using various types of video resources.

Making adjustments to event annotations for improved biological event extraction

Seung-Cheol Baek and Jong C. Park
Journal of Biomedical Semantics, 7:55, doi: 10.1186/s13326-016-0094-9, 16 September 2016. (SCIE IF 1.62)
Show abstract
Background
Current state-of-the-art approaches to biological event extraction train statistical models in a supervised manner on corpora annotated with event triggers and event-argument relations. Inspecting such corpora, we observe that there is ambiguity in the span of event triggers (e.g., 쐔ranscriptional activity vs. 쁳ranscriptional), leading to inconsistencies across event trigger annotations. Such inconsistencies make it quite likely that similar phrases are annotated with different spans of event triggers, suggesting the possibility that a statistical learning algorithm misses an opportunity for generalizing from such event triggers.

Methods
We anticipate that adjustments to the span of event triggers to reduce these inconsistencies would meaningfully improve the present performance of event extraction systems. In this study, we look into this possibility with the corpora provided by the 2009 BioNLP shared task as a proof of concept. We propose an Informed Expectation-Maximization (EM) algorithm, which trains models using the EM algorithm with a posterior regularization technique, which consults the gold-standard event trigger annotations in a form of constraints. We further propose four constraints on the possible event trigger annotations to be explored by the EM algorithm.

Results
The algorithm is shown to outperform the state-of-the-art algorithm on the development corpus in a statistically significant manner and on the test corpus by a narrow margin.

Conclusions
The analysis of the annotations generated by the algorithm shows that there are various types of ambiguity in event annotations, even though they could be small in number.

Corpus Annotation for the Linguistic Analysis of Reference Relations between Event and Spatial Expressions in Text

Jin-Woo Chung, Hee-Jin Lee, and Jong C. Park
Language and Information, Vol. 18, No. 2. pp. 141-168, 2014.
Show abstract
Recognizing spatial information associated with events expressed in natural language text is essential not only for the interpretation of such events and but also for the understanding of the relations among them. However, spatial information is rarely mentioned as compared to events and the association between event and spatial expressions is also highly implicit in a text. This would make it difficult to automate the extraction of spatial information associated with events from the text. In this paper, we give a linguistic analysis of how spatial expressions are associated with event expressions in a text. We first present issues in annotating narrative texts with reference relations between event and spatial expressions, and then discuss surface-level linguistic characteristics of such relations based on the annotated corpus to give a helpful insight into developing an automated recognition method.

OncoSearch: Cancer Gene Search Engine with Literature Evidence

Hee-Jin Lee, Tien Cuong Dang, Hyunju Lee, and Jong C. Park
Nucleic Acids Research, (1 July 2014) 42 (W1):W416-W421. (SCI IF 8.278)
Show abstract
In order to identify genes that are involved in oncogenesis and to understand how such genes affect cancers, abnormal gene expressions in cancers are actively studied. For an efficient access to the results of such studies that are reported in biomedical literature, the relevant information is accumulated via text-mining tools and made available through the Web. However, current Web tools are not yet tailored enough to allow queries that specify how a cancer changes along with the change in gene expression level, which is an important piece of information to understand an involved gene's role in cancer progression or regression. OncoSearch is a Web-based engine that searches Medline abstracts for sentences that mention gene expression changes in cancers, with queries that specify (i) whether a gene expression level is up-regulated or down-regulated, (ii) whether a certain type of cancer progresses or regresses along with such gene expression change and (iii) the expected role of the gene in the cancer. OncoSearch is available through http://oncosearch.biopathway.org.

Identification of Speakers in Fairytales with Linguistic Clues

Hye-Jin Min, Jin-Woo Chung, and Jong C. Park
Language and Information, Vol. 17, No. 2. pp. 93-121, 2013.
Show abstract
Identifying the speakers of individual utterances mentioned in textual stories is an important step towards developing applications that involve the use of unique characteristics of speakers in stories, such as robot storytelling and story-to-scene generation. Despite the usefulness, it is a challenging task because not only human entities but also animals and even inanimate objects can become speakers especially in fairytales so that the number of candidates is much more than that in other types of text. In addition, since the action of speaking is not always mentioned explicitly, it is necessary to infer the speaker from the implicitly mentioned speaking behaviors such as appearances or emotional expressions. In this paper, we investigate a method to exploit linguistic clues to identify the speakers of utterances from textual fairytale stories in Korean, especially in order to handle such challenging issues. Compared with the previous work, the present work takes into account additional linguistic features such as vocative roles and pairs of conversation participants, and proposes the use of discourse-level turn-taking behaviors between speakers to further reduce the number of possible candidate speakers. We describe a simple rule-based method to choose a speaker from candidates based on such linguistic features and turn-taking behaviors.

Augmenting Biological Text Mining with Symbolic Inference

Jong C. Park and Hee-Jin Lee
'Biological Knowledge Discovery Handbook', editors: Mourad Elloumi and Albert Y. Zomaya, Wiley, December 27, 2013.
Show abstract
In this chapter, the authors review recent work on such 쐍ext-level text-mining tools. In particular, they focus on the work that uses symbolic inference to augment text-mining, apart from distributional analysis that is based on the co-occurrence of biological terms and statistical methods. By symbolic inference, they refer to the methods of deriving new information from known facts that are represented with nonnumeric symbols to which inference rules are applied deterministically rather than probabilistically. Researches reviewed in this chapter target one of the two abstract tasks. The first task is to recognize information not explicitly stated but implied in a document, where the targeted information is often scattered across multiple sentences. The second is to propose newly predicted biological knowledge using information gathered from the literature. They briefly review text-mining work with distributional analysis to contrast the use of symbolic inference with the use of distributional analysis.

CoMAGC: a Corpus with Multi-faceted Annotations of Gene-Cancer Relations

Hee-Jin Lee, Sang-Hyung Shim, Mi-Ryoung Song, Hyunju Lee, and Jong C. Park
BMC Bioinformatics, 14:323, doi:10.1186/1471-2105-14-323, 14 November 2013. (SCI IF 3.02)
Show abstract
Background
In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations.

Results
In this paper, we present a corpus for the development of TM systems that are specifically targeting gene-cancer relations but are still able to capture complex information in biomedical sentences. We describe CoMAGC, a corpus with multi-faceted annotations of gene-cancer relations. In CoMAGC, a piece of annotation is composed of four semantically orthogonal concepts that together express 1) how a gene changes, 2) how a cancer changes and 3) the causality between the gene and the cancer. The multi-faceted annotations are shown to have high inter-annotator agreement. In addition, we show that the annotations in CoMAGC allow us to infer the prospective roles of genes in cancers and to classify the genes into three classes according to the inferred roles. We encode the mapping between multi-faceted annotations and gene classes into 10 inference rules. The inference rules produce results with high accuracy as measured against human annotations. CoMAGC consists of 821 sentences on prostate, breast and ovarian cancers. Currently, we deal with changes in gene expression levels among other types of gene changes. The corpus is available at http://biopathway.org/CoMAGC under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0).

Conclusions
The corpus will be an important resource for the development of advanced TM systems on gene-cancer relations.

DigSee: Disease Gene Search Engine with Evidence Sentences (version cancer)

Jeongkyun Kim, Seongeun So, Hee-Jin Lee, Jong C. Park, Jung-jae Kim, and Hyunju Lee
Nucleic Acids Research, Vol. 41, No. W1, pp. 501-517, 12 June 2013 (SCI IF 8.026).
Show abstract
Biological events such as gene expression, regulation, phosphorylation, localization and protein catabolism play important roles in the development of diseases. Understanding the association between diseases and genes can be enhanced with the identification of involved biological events in this association. Although biological knowledge has been accumulated in several databases and can be accessed through the Web, there is no specialized Web tool yet allowing for a query into the relationship among diseases, genes and biological events. For this task, we developed DigSee to search MEDLINE abstracts for evidence sentences describing that 쁤enes are involved in the development of 쁟ancer through 쁞iological events. DigSee is available through http://gcancer.org/digsee.

Quality Analysis of User-generated Content on the Web

Jong C. Park and Hye-Jin Min
'Knowledge Service Engineering Handbook', editors: Jussi Kantola and Waldemar Karwowski, CRC Press, Taylor & Francis Group, pp. 197220, May, 2012.

E3Net: A System for Exploring E3-mediated Regulatory Networks of Cellular Functions

Youngwoong Han, Hodong Lee, Jong C. Park, and Gwan-Su Yi
Molecular and Cellular Proteomics, March Issue, 2012, doi:10.1074/mcp.O111.014076, December 22, 2011. (SCI IF 8.35)
Show abstract
Ubiquitin-protein ligase (E3) is a key enzyme targeting specific substrates in diverse cellular processes for ubiquitination and degradation. The existing findings of substrate specificity of E3 are, however, scattered over a number of resources, making it difficult to study them together with an integrative view. Here we present E3Net, a web-based system that provides a comprehensive collection of available E3-substrate specificities and a systematic framework for the analysis of E3-mediated regulatory networks of diverse cellular functions. Currently, E3Net contains 2201 E3s and 4896 substrates in 427 organisms and 1671 E3-substrate specific relations between 493 E3s and 1277 substrates in 42 organisms, extracted mainly from MEDLINE abstracts and UniProt comments with an automatic text mining method and additional manual inspection and partly from high throughput experiment data and public ubiquitination databases. The significant functions and pathways of the extracted E3-specific substrate groups were identified from a functional enrichment analysis with 12 functional category resources for molecular functions, protein families, protein complexes, pathways, cellular processes, cellular localization, and diseases. E3Net includes interactive analysis and navigation tools that make it possible to build an integrative view of E3-substrate networks and their correlated functions with graphical illustrations and summarized descriptions. As a result, E3Net provides a comprehensive resource of E3s, substrates, and their functional implications summarized from the regulatory network structures of E3-specific substrate groups and their correlated functions. This resource will facilitate further in-depth investigation of ubiquitination-dependent regulatory mechanisms. E3Net is freely available online at http://pnet.kaist.ac.kr/e3net. Molecular & Cellular Proteomics 11: 10.1074/mcp.O111.014076, 114, 2012.

Probabilistic Filtering for a Biological Knowledge Discovery System with Text Mining and Automatic Inference

Hee-Jin Lee and Jong C. Park
Journal of the Korean Society Of Computer and Information, Vol. 17, No. 2, pp. 139-148, February 2012.
Show abstract
蹂 끉臾몄뿉꽌뒗 뀓뒪듃 留덉씠떇쓣 넻빐 깮臾쇳븰 臾명뿄뿉꽌 遺꾩옄 닔以쓽 궗嫄(event) 젙蹂대 옄룞쑝濡 異붿텧븯怨, 씠뱾 궗嫄 젙蹂대 湲곕컲쑝濡 깉濡쒖슫 깮臾쇳븰 吏떇쓣 옄룞 異붾줎븯뒗 뀓뒪듃 留덉씠떇 - 異붾줎 넻빀 援ъ“쓽 떆뒪뀥쓣 떎猷щ떎. 씠윭븳 넻빀 援ъ“쓽 吏떇 諛쒓껄 떆뒪뀥 誘몃━ 異붿텧릺뼱 뜲씠꽣踰좎씠뒪뿉 벑濡앸맂 젙蹂대쭔쓣 엯젰쑝濡 궗슜븯뒗 떆뒪뀥뱾뿉 鍮꾪븯뿬 理쒖떊 젙蹂대 蹂대떎 鍮⑤━ 궗슜븷 닔 엳怨, 誘몃━ 젙쓽맂 삎떇 씠쇅쓽 떎뼇븳 젙蹂대 궗 슜븷 닔 엳떎뒗 옣젏씠 엳떎. 諛섎㈃, 뀓뒪듃 留덉씠떇 젙蹂 異붿텧 寃곌낵瑜 洹몃濡 궗슜븯湲 븣臾몄뿉 뀓뒪듃 留덉씠떇 紐⑤뱢(module)쓽 꽦뒫뿉 뵲씪 쟾泥 떆뒪뀥쓽 슚슜꽦씠 겕寃 븯맆 닔룄 엳떎뒗 臾몄젣媛 엳떎. 蹂 끉臾몄뿉꽌뒗 솗瑜 湲곕컲 븘꽣留(filtering) 諛⑸쾿쓣 젣븞븯뿬, 뀓뒪듃 留덉씠떇 寃곌낵 以 뼇꽦 삤瑜(false positive)瑜 슚怨쇱쟻쑝濡 젣嫄고븿쑝濡쒖뜥 쟾泥 吏떇 諛쒓껄 떆뒪뀥쓽 젙솗룄 諛 슚슜꽦쓣 넂씠怨좎옄 븳떎. 蹂 끉臾몄뿉꽌 젣븞븳 솗瑜 湲곕컲 븘꽣留 諛⑸쾿 湲곗(baseline) 諛⑸쾿쑝濡 궗슜맂 슏닔 湲곕컲 븘꽣留 諛⑸쾿蹂대떎 넂 꽦뒫쓣 蹂댁떎.

Identifying Helpful Reviews Based on Customer셲 Mentions about Experiences

Hye-Jin Min and Jong C. Park
Expert Systems With Applications, doi:10.1016/j.eswa.2012.01.116, January 25, 2012. (SCIE IF 1.924)
Show abstract
As numerous on-line product reviews that vary in quality are published every day, much attention is being paid to quality assessment of such reviews. The current metric of using the number of votes by other customers such as 쁥elpful vote, despite its dominance, does not yield a fully effective outcome. In this article, we propose a novel metric to rank product reviews by 쁬entions about experiences, accounting for customer셲 personal experiences, as a way of identifying high quality reviews. The proposed metric has two parameters that capture time expressions related to the use of products and product entities over different purchasing time periods by linguistic clues. The empirical results show that this metric is not only as helpful as the best existing metrics, 쁥elpful vote or 쁱eviewer rank, but is also free from undesirable biases that either penalize recency or are driven solely by popularity. Our usability study also shows that ordering reviews by our metric is considered helpful on the accounts of both usefulness and satisfaction.

Sentence Type Identification in Korean: Applications to Korean-Sign Language Translation and Korean Speech Synthesis

Jin-Woo Chung, Ho-Joon Lee, and Jong C. Park
Journal of the HCI Society of Korea, Vol. 5, No. 1, pp. 25-35, 2010.
(selected as best paper)
Show abstract
This paper proposes a method of automatically identifying sentence types in Korean and improving naturalness in sign language generation and speech synthesis using the identified sentence type information. In Korean, sentences are usually categorized into five types: declarative, imperative, propositive, interrogative, and exclamatory. However, it is also known that these types are quite ambiguous to identify in dialogues. In this paper, we present additional morphological and syntactic clues for the sentence type and propose a rule-based procedure for identifying the sentence type using these clues. The experimental results show that our method gives a reasonable performance. We also describe how the sentence type is used to generate non-manual signals in Korean-Korean sign language translation and appropriate intonation in Korean speech synthesis. Since the method of using sentence type information in speech synthesis and sign language generation is not much studied previously, it is anticipated that our method will contribute to research on generating more natural speech and sign language expressions.

Wrestling with Biomedical Research Results: Language Resources and Literature Analysis

D. Rebholz-Schuhmann, Nigel Collier, Jong C. Park, and Limsoon Wong
Journal of Bioinformatics and Computational Biology (JBCB), Vol. 8, No. 1, pp. 129-130, Imperial College Press, February 2010.

Extracting Melodies from Piano Solo Music Based on Characteristics of Music

Yoonjae Choi and Jong C. Park
Journal of the Korean Institute of Information Scientists and Engineers (KIISE): Computing Practices and Letters, Vol. 15, No. 12, pp. 923-927, 2009.
Show abstract
The recent growth of a digital music market induces increasing demands for music searching and recommendation services. In order to improve the performance of music-based application services, the process of extracting melodies from polyphonic music is essential. In this paper, we propose a method to extract melodies from piano solo music which is highly polyphonic and has a wide pitch range. We categorize piano music into three classes taking into account the characteristics of music, and extract melodies according to each class. The performance evaluation for the implemented system showed that our method works successfully on a variety of piano solo music.

Automated Classification of Sentential Types in Korean with Morphological Analysis

Jin-Woo Chung and Jong C. Park
Language and Information, Vol. 13, No. 2, pp. 59-97, 2009.
Show abstract
The type of a given sentence indicates the speaker's attitude towards the listener and is usually determined by its final endings and punctuation marks. However, some final endings are used in several types of sentences, which means that we cannot identify the sentential type by considering only the final endings and punctuation marks. In this paper, we propose methods of finding some other linguistic clues for identifying the sentential type with a morphological analysis. We also propose to use these methods to implement a system that automatically classfi es sentences in Korean according to their sentential types.

E3Miner: a text mining tool for ubiquitin-protein ligases

Hodong Lee, Gwansu Yi, and Jong C. Park
Nucleic Acids Research, Vol. 36, Web Server issue Published doi:10.1093/nar/gkn286, 15 May 2008 (SCI IF 8.026).
Show abstract
Ubiquitination is a regulatory process critically involved in the degradation of >80% of cellular proteins, where such proteins are specifically recognized by a key enzyme, or a ubiquitin-protein ligase (E3). Because of this important role of E3s, a rapidly growing body of the published literature in biology and biomedical fields reports novel findings about various E3s and their molecular mechanisms. However, such findings are neither adequately retrieved by general text-mining tools nor systematically made available by such protein databases as UniProt alone. E3Miner is a web-based text mining tool that extracts and organizes comprehensive knowledge about E3s from the abstracts of journal articles and the relevant databases, supporting users to have a good grasp of E3s and their related information easily from the available text. The tool analyzes text sentences to identify protein names for E3s, to narrow down target substrates and other ubiquitin-transferring proteins in E3-specific ubiquitination pathways and to extract molecular features of E3s during ubiquitination. E3Miner also retrieves E3 data about protein functions, other E3-interacting partners and E3-related human diseases from the protein databases, in order to help facilitate further investigation. E3Miner is freely available through http://e3miner.biopathway.org.

Monitoring the evolutionary aspect of the Gene Ontology to enhance predictability and usability

Jong C. Park, Tak-eun Kim, and Jinah Park
BMC Bioinformatics 2008, 9(Suppl 3):7 doi:10.1186/1471-2105-9-S3-S7, 11 April 2008.
Show abstract
Background: Much effort is currently made to develop the Gene Ontology (GO). Due to the dynamic nature of information it addresses, GO undergoes constant updates whose results are released at regular intervals as separate versions. Although there are a large number of computational tools to aid the development of GO, they are operating on a particular version of GO, making it difficult for GO curators to anticipate the full impact of particular changes along the time axis on a larger scale. We present a method for tapping into such an evolutionary aspect of GO, by making it possible to keep track of important temporal changes to any of the terms and relations of GO and by consequently making it possible to recognize associated trends.
Results: We have developed visualization methods for viewing the changes between two different versions of GO by constructing a colour-coded layered graph. The graph shows both versions of GO with highlights to those GO terms that are added, removed and modified between the two versions. Focusing on a specific GO term or terms of interest over a period, we demonstrate the utility of our system that can be used to make useful hypotheses about the cause of the evolution and to provide new insights into more complex changes.
Conclusions: GO undergoes fast evolutionary changes. A snapshot of GO, as presented by each version of GO alone, overlooks such evolutionary aspects, and consequently limits the utilities of GO. The method that highlights the differences of consecutive versions or two different versions of an evolving ontology with colour-coding enhances the utility of GO for users as well as for developers. To the best of our knowledge, this is the first proposal to visualize the evolutionary aspect of GO.

Text Mining and Management in Biomedicine

Jong C. Park, Gary Geunbae Lee, and Limsoon Wong
Guest Editors' Introduction to the Special Issue, ACM Transactions on Asian Language Information Processing (TALIP), March, 2006.

BioContrasts: Extracting and Exploiting Protein-Protein Contrastive Relations from Biomedical Literature

Jung-jae Kim, Zhuo Zhang, Jong C. Park, and See-Kiong Ng
Bioinformatics, Vol. 22, No. 5, pp. 597-605, March, 2006.
Show abstract
Motivation: Contrasts are useful conceptual vehicles for learning processes and exploratory research of the unknown. For example, contrastive information between proteins can reveal what similarities, divergences and relations there are of the two proteins, leading to invaluable insights for better understanding about the proteins. Such contrastive information are found to be reported in the biomedical literature. However, there have been no reported attempts in current biomedical text mining work that systematically extract and present such useful contrastive information from the literature for exploitation.

Results: Our BioContrasts system extracts protein뱎rotein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for exploitation. Contrastive information are identified in the text abstracts with contrastive negation patterns such as 쁀 but not B. A total of 799169 pairs of contrastive expressions were successfully extracted from 2.5 million MEDLINE abstracts. Using grounding of contrastive protein names to Swiss-Prot entries, we were able to produce 41471 pieces of contrasts between Swiss-Prot protein entries. These contrastive pieces of information are then presented via a user-friendly interactive web portal that can be exploited for applications such as the refinement of biological pathways.

Availability: BioContrasts can be accessed at http://biocontrasts.i2r.a-star.edu.sg. It is also mirrored at http://biocontrasts.biopathway.org

Supplementary information: Supplementary materials are available at Bioinformatics online.

Contact:skng@i2r.a-star.edu.sg; park@cs.kaist.ac.kr

Visualization for Digesting a High Volume of the Biomedical Literature

Changsu Lee, Jinah Park, and Jong C. Park
Bioinformatics and Biosystems, Vol. 1, No. 1, pp. 51-60, Feb. 2006.
Show abstract
The paradigm in biology is currently changing from that of conducting hypothesis-driven individual experiments to that of utilizing the results of a massive data analysis with appropriate computational tools. We present LayMap, an implemented visualization system that helps the user to deal with a high volume of the biomedical literature such as MEDLINE, through the layered maps that are constructed on the results of an information extraction system. LayMap also utilizes filtering and granularity for an enhanced view of the results. Since a biomedical information extraction system gives rise to a focused and effective way of slicing up the data space, the combined use of LayMap with such an information extraction system can help the user to navigate the data space in a speedy and guided manner. As a case study, we have applied the system to datasets of journal abstracts on 셃APK pathway and 셚ufalin from MEDLINE. With the proposed visualization, we have successfully rediscovered pathway maps of a reasonable quality for ERK, p38 and JNK. Furthermore, with respect to bufalin, we were able to identify the potentially interesting relation between the Chinese medicine Chan su and apoptosis with a high level of detail.

Named Entity Recognition

Jong C. Park and Jung-jae Kim
Chapter six of the book 'Text Mining for Biology', editors: Ben Stapley and Sophia Ananiadou, Artech House Books, January, 2006.

Linguistic Characterization of Sign Language Expressions for an Automatic Mapping from Natural Language Sentences

Jiwon Choi, Eunyoung Chang, Hee-Jin Lee, and Jong C. Park
Language and Information, Vol. 10, No. 1, pp. 71-91, 2006.

Automatic extension of Gene Ontology with flexible identification of candidate terms

Jin-Bok Lee, Jung-jae Kim, and Jong C. Park
Bioinformatics, Vol. 22 No. 6, pp. 665-670, 2006.
Show abstract
Motivation: Gene Ontology (GO) has been manually developed to provide a controlled vocabulary for gene product attributes. It continues to evolve with new concepts that are compiled mostly from existing concepts in a compositional way. If we consider the relatively slow growth rate of GO in the face of the fast accumulation of the biological data, it is much desirable to provide an automatic means for predicting new concepts from the existing ones.

Results: We present a novel method that predicts more detailed concepts by utilizing syntactic relations among the existing concepts. We propose a validation measure for the automatically predicted concepts by matching the concepts to biomedical articles. We also suggest how to find a suitable direction for the extension of a constantly growing ontology such as GO.

Availability:http://autogo.biopathway.org

Contact:park@nlp.kaist.ac.kr

Supplementary information: Supplementary materials are available at Bioinformatics online.

Information Visualization with Text Data Mining for Knowledge Discovery Tools in Bioinformatics

Jinah Park, Changsu Lee, and Jong C. Park
Key Engineering Materials, Vols. 277-279, pp. 259-265, 2005. (SCI IF 0.224)
Show abstract
An abundant amount of information is produced in the digital domain, and an effective information extraction (IE) system is required to surf through this sea of information. In this paper, we show that an interactive visualization system works effectively to complement an IE system. In particular, three-dimensional (3D) visualization can turn a data-centric system into a user-centric one by facilitating the human visual system as a powerful pattern recognizer to become a part of the IE cycle. Because information as data is multidimensional in nature, 2D visualization has been the preferred mode. However, we argue that the extra dimension available for us in a 3D mode provides a valuable space where we can pack an orthogonal aspect of the available information. As for candidates of this orthogonal information, we have considered the following two aspects: 1) abstraction of the unstructured source data, and 2) the history line of the discovery process. We have applied our proposal to text data mining in bioinformatics. Through case studies of data mining for molecular interaction in the yeast and mitogen-activated protein kinase pathways, we demonstrate the possibility of interpreting the extracted results with a 3D visualization system.

A Graphic Tool for Curating Molecular Interaction Networks from the Literature

Changsu Lee, Jinah Park, and Jong C. Park
International Journal of Computers in Biology and Medicine, Vol. 35, pp. 555-564, 2005.
Show abstract
We propose a graphic tool for curating molecular interaction networks constructed from the literature by information extraction (IE). In order to turn preliminary results from IE into useful biomedical resources, we propose to use a controlled environment in which visualization and IE work synergistically. The usability of the proposed graphic tool is shown with respect to the identification of incorrectly extracted results that are due to the much troubling coordination phenomena in natural language texts. Through the experiment on molecular interactions in Saccaharomyces cerevisiae, we have seen a meaningful increase (from 91.5% to 97.5%) in the number of correctly extracted interaction information.

Extracting contrastive information from negation patterns in biomedical literature

Jung-jae Kim and Jong C. Park
ACM Transactions on Asian Language Information Processing (TALIP), Special Issue on Text Mining and Management in Biomedicine, 2006.
Show abstract
Expressions of negation in the biomedical literature often encode information of contrast as a means for explaining signi詮갷ant differences between the objects that are so contrasted. We show that such information gives additional insights into the nature of the structures and/or biological functions of these objects, leading to valuable knowledge for subcategorization of protein families by the properties that the involved proteins do not have in common. Based on the observation that the expressions of negation employ mostly predictable syntactic structures that can be characterized by subclausal coordination and by clause-level parallelism, we present a system that extracts such contrastive information by identifying those syntactic structures with natural language processing techniques and with additional linguistic resources for semantics. The implemented system shows the performance of 85.7% precision and 61.5% recall, including 7.7% partial recall, or an F score of 76.6. We apply the system to the biological interactions as extracted by our biomedical information extraction system in order to enrich proteome databases with contrastive information.

Introduction to the Thematic Session on Text Mining in Biomedicine

Sophia Ananiadou and Jong C. Park
Lecture Notes in Artificial Intelligence (LNAI), Vol. 3248 (revised selected papers from IJCNLP 2004), editors: K-Y Su, J. Tsujii, J.-H. Lee, O. Y. Kwong, p. 776, 2005.
Show abstract
This thematic session follows a series of workshops and conferences recently dedicated to bio text mining in Biology. This interest is due to the overwhelming amount of biomedical literature, Medline alone contains over 14M abstracts, and the urgent need to discover and organise knowledge extracted from texts. Text mining techniques such as information extraction, named entity recognition etc. have been successfully applied to biomedical texts with varying results. A variety of approaches such as machine learning, SVMs, shallow, deep linguistic analyses have been applied to biomedical texts to extract, manage and organize information. There are over 300 databases containing crucial information on biological data. One of the main challenges is the integration of such heterogeneous information from factual databases to texts. One of the major knowledge bottlenecks in biomedicine is terminology. In such a dynamic domain, new terms are constantly created. In addition there is not always a mapping among terms found in databases, controlled vocabularies, ontologies and 쏿ctual terms which are found in texts. Term variation and term ambiguity have been addressed in the past but more solutions are needed. The confusion of what is a descriptor, a term, an index term accentuates the problem. Solving the terminological problem is paramount to biotext mining, as relationships linking new genes, drugs, proteins (i.e. terms) are important for effective information extraction. Mining for relationships between terms and their automatic extraction is important for the semi-automatic updating and populating of ontologies and other resources needed in biomedicine. Text mining applications such as question-answering, automatic summarization, intelligent information retrieval are based on the existence of shared resources, such as annotated corpora (GENIA) and terminological resources. The field needs more concentrated and integrated efforts to build these shared resources. In addition, evaluation efforts such as BioCreaTive, Genomic Trec are important for biotext mining techniques and applications. The aim of text mining in biology is to provide solutions to biologists, to aid curators in their task. We hope this thematic session addressed techniques and applications which aid the biologists in their research.

Constructing SSML Documents with Automatically Generated Intonation Information in a Combinatory Categorial Grammar Framework

Lee Hwa Jin, Ho-Joon Lee, and Jong C. Park
International Journal of Computer Processing of Oriental Languages (IJCPOL), Vol. 17, No. 4, pp. 223-238, December, 2004.
Show abstract
As of now, Text-to-Speech (TTS) systems are widely used throughout the full spectrum of our activities, and various natural language processing techniques have been utilized to enhance the performance of such TTS systems. As TTS systems begin to play an important role for communication between human and machine, naturalness is considered the most crucial measure of performance for TTS systems, in addition to correctness. General statistical approaches, though widely adopted, are not appropriate for the phenomena as they assign the same intonation to the same sentence. We analyze various kinds of corpus to extract informative features for intonation generation in a Combinatory Categorial Grammar framework, and express intonation-annotated document using Speech Synthesis Markup Language for target system neutral application.

BioIE: Retargetable Information Extraction and Ontological Annotation of Biological Pathways from the Literature

Jung-jae Kim and Jong C. Park
Journal of Bioinformatics and Computational Biology (JBCB), Vol. 2, No. 3, pp. 551-568, 2004. (SCI IF 1.393)
Show abstract
The need for extracting general biological interactions of arbitrary types from the rapidly growing volume of the biomedical literature is drawing increased attention, while the need for this much diversity also requires both a robust treatment of complex linguistic phenomena and a method to consistently characterize the results. We present a biomedical information extraction system, BioIE, to address both of these needs by utilizing a full-fledged English grammar formalism, or a combinatory categorial grammar, and by annotating the results with the terms of Gene Ontology, which provides a common and controlled vocabulary. BioIE deals with complex linguistic phenomena such as coordination, relative structures, acronyms, appositive structures, and anaphoric expressions. In order to deal with real-world syntactic variations of ontological terms, BioIE utilizes the syntactic dependencies between words in sentences as well, based on the observation that the component words in an ontological term usually appear in a sentence with known patterns of syntactic dependencies.

Research Trends in Bio Text Mining

Jung-jae Kim and Jong C. Park.
Korea Information Science Society SIGBIT News Letter, Vol. 2, No. 1, pp. 14-31, 2004.

An Analysis of Syntactic and Semantic Relations between Negative Polarity Items and Negatives in Korean

Jung-jae Kim and Jong C. Park.
Journal of Language and Information, Vol. 8, No. 1, pp. 53-76, 2004.
Show abstract
Negative polarity items (NPIs), which function as quantifiers, are licensed in a syntactically strict way by negatives, which function as qualifiers, resulting in universal negating interpretations as pairs. We present a proposal to explain the related phenomena, in which the syntax and the semantics are closely related to each other, with Combinatory Categorial Grammar. For this purpose, we first adopt the usual approach to scrambling, but control its overgeneration with the use of markers, taking into account the complex syntactic phenomena involving NPIs and scrambling in Korean. We also propose to utilize polarity intensity as a novel feature, in order to account for the universal negating interpretations when NPIs are combined with negatives. Our proposal also explains the difference in readings when other quantifiers or qualifiers intervene the NPIs and the related negatives. (Korea Advanced Institute of Science and Technology)

Annotation of Gene Products in the Literature with Gene Ontology Terms using Syntactic Dependencies

Jung-jae Kim and Jong C. Park
Lecture Notes on Artificial Intelligence, Post-Conference Book of IJCNLP-04, 2004.
Show abstract
We present a method for automatically annotating gene products in the literature with the terms of Gene Ontology (GO), which provides a dynamic but controlled vocabulary. Although GO is well-organized with such lexical relations as synonymy, 쁦s-a, and 쁯art-of relations among its terms, GO terms show quite a high degree of morphological and syntactic variations in the literature. As opposed to the previous approaches that considered only restricted kinds of term variations, our method uncovers the syntactic dependencies between gene product names and ontological terms as well in order to deal with real-world syntactic variations, based on the observation that the component words in an ontological term usually appear in a sentence with established patterns of syntactic dependencies.

Analysis and Computational Processing of Coordination Ambiguity in Korean

Hodong Lee and Jong C. Park
Journal of Language and Information, Vol. 7, No. 2, pp. 59-79, 2003.

Analysis and Processing of Korean with Quantifier Floating

Jin-Bok Lee and Jong C. Park
Journal of Language and Information, Vol. 7, No. 1, pp. 1-22, 2003.

Accomplishments and Challenges in Literature Data Mining for Biology

Lynette Hirschman, Jong C. Park, Junichi Tsujii, Limsoon Wong, and Cathy Wu
Bioinformatics, Vol. 18, No. 12, pp. 1553-1561, 2002.
Show abstract
We review recent results in literature data mining for biology and discuss the need and the steps for a challenge evaluation for this field. Literature data mining has progressed from simple recognition of terms to extraction of interaction relationships from complex sentences, and has broadened from recognition of protein interactions to a range of problems such as improving homology search, identifying cellular location, and so on. To encourage participation and accelerate progress in this expanding field, we propose creating challenge evaluations, and we describe two specific applications in this context.

Interpretation of Natural Language Queries for Relational Database Access with Combinatory Categorial Grammar

Hodong Lee and Jong C. Park
International Journal of Computer Processing of Oriental Languages (IJCPOL), Vol. 15, No. 3, pp. 281-304, 2002.
Show abstract
In this paper, we describe a proposal to derive formal language queries from natural language queries with a combinatory categorial grammar (CCG). CCGs are well known to provide a means of deriving all the levels of information for natural language, i.e., syntax, semantics and discourse, at the same time. In our proposal, we utilize an extra level of representation for formal language queries for the aforementioned derivation. The syntactic coverage is shown with various natural language queries, including compound nouns, modification markers, various types of ellipses, numerical expressions, and subordinate and coordinate constructions. The general purpose CCG lexicon is semi-automatically augmented with the database fields and entries. We also discuss the performance of an implemented natural language query processing system.

Using Combinatory Categorial Grammar to Extract Biomedical Information

Jong C. Park
IEEE Intelligent Systems, Special Issue on Intelligent Systems in Biology, Vol. 16, No. 6, pp. 62-67, November-December, 2001.
Show abstract
Extracting information from biology databases manually can be an overwhelming task. GenBank, the US National Institutes of Health database containing all publicly available DNA sequences, has more than 14 billion bases in 13 million genetic-sequence records.1 Medline, a literature database available through PubMed, has over 11 million journal citations. In a May 2001 search request for 쐁ytokine (regulatory proteins in the immune system), PubMed returned 296,556 articles.2 Given the quantity and complexity of biomedical literature, demands for computational tools to extract specific information are increasing. In this article, I review biomedical information extraction methods and present research done by KAIST셲 natural language processing group on a system that shows encouraging performance using combinatory categorial grammar (explained in detail below) as a natural language grammar formalism.

Bioinformatics and Natural Language Processing

Jong C. Park
Special Issue in Korean Information Processing, Communications of the Korea Information Science Society (KISS), Vol. 19, No. 10, pp. 46-51, October, 2001.
Show abstract
깮臾쇱젙蹂댄븰(Bioinformatics) 깮臾쇳븰뿉꽌 떎猷 뒗 젙蹂댁쓽 뼇씠 湲됱쬆븿뿉 뵲씪 쟾궛븰, 닔븰, 넻怨 븰 벑쓽 遺꾩빞뿉꽌 궗슜릺怨 엳뒗 젙蹂댁쿂由ш린踰뺤쓣 쓳슜븯뿬 씠瑜 슚쑉쟻쑝濡 깮궛, 愿由, 솢슜븯젮뒗 뿰援щ텇빞瑜 珥앹묶븳떎. 1) 洹몃━怨 옄뿰뼵뼱泥섎━(Natural Language Processing竊뛐LP)뒗 븳援뼱굹 쁺뼱 媛숈 옄뿰뼵뼱濡 몴쁽맂 臾몄옣씠굹 臾몄꽌뱾쓣 而댄벂꽣瑜 씠슜븯뿬 泥섎━븯湲 쐞븳 뿰援щ텇빞瑜 珥앹묶 븯뒗뜲 씠뿉뒗 씤媛꾧낵 而댄벂꽣쓽 긽샇옉슜(Human Computer Interaction竊뛊CI)쓣 룙湲 쐞븳 뿰援ъ쓽 痢〓㈃룄 엳怨 諛⑸븳 옄뿰뼵뼱 젙蹂대 슚쑉쟻쑝濡 愿由, 솢슜븯湲 쐞븳 뿰援ъ쓽 痢〓㈃룄 엳떎. 蹂멸퀬뿉꽌 뒗 깮臾쇱쓽猷 젙蹂 異붿텧(Biomedical Information Extraction)씠씪뒗 遺꾩빞쓽 뿰援ъ뿉 븳 냼媛쒕 넻 빐꽌 씠 몢 媛吏 긽씠븳 뿰援щ텇빞媛 뼱뼚븳 愿怨 瑜 媛吏寃 릺뒗吏뿉 븳 끉쓽瑜 젣怨듯븳떎. 理쒓렐 깮臾쇱젙蹂댄븰뿉 븯뿬 넂븘吏 씪諛섏쓽 愿떖쓣 諛섏쁺븯 뿬 젙蹂닿낵븰쉶吏 2000뀈 8썡샇 [1]뿉꽌뒗 깮臾쇱젙蹂 븰쓣 二쇱젣濡 븳 듅吏묒쓣 젣怨듯븯뒗뜲 蹂멸퀬쓽 궡슜 뿬湲곗뿉 옄뿰뼵뼱泥섎━ 쓳슜 遺꾩빞瑜 洹몃줈遺꽣 씪 뀈썑쓽 떆젏뿉꽌 蹂댁셿븯뒗 寃껋쑝濡 蹂 닔 엳쓣 寃껋쑝 濡 湲곕맂떎

Combinatory Categorial Grammar for the Syntactic, Semantic, and Discourse Analyses of Cordinate Constructions in Korean

Hyung-joon Cho and Jong C. Park
Journal of the Korea Information Science Society (KISS), Vol. 27, No. 4, pp. 448-462, 2000.
Show abstract
Coordinate constructions in natural language pose a number of difficulties to natural language processing units, due to the increased complexity of syntactic analysis, the syntactic ambiguity of the involved lexical items, and the apparent deletion of predicates in various places. In this paper, we address the syntactic characteristics of the coordinate constructions in Korean from the viewpoint of constructing a competence grammar, and present a version of combinatory categorial grammar for the analysis of coordinate constructions in Korean. We also show how to utilize a unified lexicon in the proposed grammar formalism in deriving the sentential semantics and associated information structures as well, in order to capture the discourse functions of coordinate constructions in Korean. The presented analysis conforms to the common wisdom that coordinate constructions are utilized in language not simply to reduce multiple sentences to a single sentence, but also to convey the information of contrast. Finally, we provide an analysis of sample corpora for the frequency of coordinate constructions in Korean and discuss some problematic cases