Publications Journal Papers and Book Chapters

Neural Theorem Prover with Word Embedding for Efficient Automatic Annotation

Wonsuk Yang, Hancheol Park, and Jong C. Park
Journal of KIISE, Vol. 44, No. 4, pp. 399-410, April, 2017.

Addressing low-resource problems in statistical machine translation of manual signals in sign language

Hancheol Park, Jung-Ho Kim, and Jong C. Park
Journal of KIISE, Vol. 44, No. 2, pp. 163-170, February, 2017.

Enhanced sign language transcription system via hand tracking and pose estimation

Jung-Ho Kim, Najoung Kim, Hancheol Park, and Jong C. Park
Journal of Computing Science and Engineering, Vol. 10, No. 3, pp. 95-101, September, 2016.
In this study, we propose a new system for constructing parallel corpora for sign languages, which are generally underresourced in comparison to spoken languages. In order to achieve scalability and accessibility regarding data collection and corpus construction, our system utilizes deep learning-based techniques and predicts depth information to perform pose estimation on hand information obtainable from video recordings by a single RGB camera. These estimated poses are then transcribed into expressions in SignWriting. We evaluate the accuracy of hand tracking and hand pose estimation modules of our system quantitatively, using the American Sign Language Image Dataset and the American Sign Language Lexicon Video Dataset. The evaluation results show that our transcription system has a high potential to be successfully employed in constructing a sizable sign language corpus using various types of video resources.

Making adjustments to event annotations for improved biological event extraction

Seung-Cheol Baek and Jong C. Park
Journal of Biomedical Semantics, 7:55, doi: 10.1186/s13326-016-0094-9, 16 September 2016. (SCIE IF 1.62)
Current state-of-the-art approaches to biological event extraction train statistical models in a supervised manner on corpora annotated with event triggers and event-argument relations. Inspecting such corpora, we observe that there is ambiguity in the span of event triggers (e.g., 쐔ranscriptional activity vs. 쁳ranscriptional), leading to inconsistencies across event trigger annotations. Such inconsistencies make it quite likely that similar phrases are annotated with different spans of event triggers, suggesting the possibility that a statistical learning algorithm misses an opportunity for generalizing from such event triggers.

We anticipate that adjustments to the span of event triggers to reduce these inconsistencies would meaningfully improve the present performance of event extraction systems. In this study, we look into this possibility with the corpora provided by the 2009 BioNLP shared task as a proof of concept. We propose an Informed Expectation-Maximization (EM) algorithm, which trains models using the EM algorithm with a posterior regularization technique, which consults the gold-standard event trigger annotations in a form of constraints. We further propose four constraints on the possible event trigger annotations to be explored by the EM algorithm.

The algorithm is shown to outperform the state-of-the-art algorithm on the development corpus in a statistically significant manner and on the test corpus by a narrow margin.

The analysis of the annotations generated by the algorithm shows that there are various types of ambiguity in event annotations, even though they could be small in number.

Corpus Annotation for the Linguistic Analysis of Reference Relations between Event and Spatial Expressions in Text

Jin-Woo Chung, Hee-Jin Lee, and Jong C. Park
Language and Information, Vol. 18, No. 2. pp. 141-168, 2014.
Recognizing spatial information associated with events expressed in natural language text is essential not only for the interpretation of such events and but also for the understanding of the relations among them. However, spatial information is rarely mentioned as compared to events and the association between event and spatial expressions is also highly implicit in a text. This would make it difficult to automate the extraction of spatial information associated with events from the text. In this paper, we give a linguistic analysis of how spatial expressions are associated with event expressions in a text. We first present issues in annotating narrative texts with reference relations between event and spatial expressions, and then discuss surface-level linguistic characteristics of such relations based on the annotated corpus to give a helpful insight into developing an automated recognition method.

OncoSearch: Cancer Gene Search Engine with Literature Evidence

Hee-Jin Lee, Tien Cuong Dang, Hyunju Lee, and Jong C. Park
Nucleic Acids Research, (1 July 2014) 42 (W1):W416-W421. (SCI IF 8.278)
In order to identify genes that are involved in oncogenesis and to understand how such genes affect cancers, abnormal gene expressions in cancers are actively studied. For an efficient access to the results of such studies that are reported in biomedical literature, the relevant information is accumulated via text-mining tools and made available through the Web. However, current Web tools are not yet tailored enough to allow queries that specify how a cancer changes along with the change in gene expression level, which is an important piece of information to understand an involved gene's role in cancer progression or regression. OncoSearch is a Web-based engine that searches Medline abstracts for sentences that mention gene expression changes in cancers, with queries that specify (i) whether a gene expression level is up-regulated or down-regulated, (ii) whether a certain type of cancer progresses or regresses along with such gene expression change and (iii) the expected role of the gene in the cancer. OncoSearch is available through

Identification of Speakers in Fairytales with Linguistic Clues

Hye-Jin Min, Jin-Woo Chung, and Jong C. Park
Language and Information, Vol. 17, No. 2. pp. 93-121, 2013.
Identifying the speakers of individual utterances mentioned in textual stories is an important step towards developing applications that involve the use of unique characteristics of speakers in stories, such as robot storytelling and story-to-scene generation. Despite the usefulness, it is a challenging task because not only human entities but also animals and even inanimate objects can become speakers especially in fairytales so that the number of candidates is much more than that in other types of text. In addition, since the action of speaking is not always mentioned explicitly, it is necessary to infer the speaker from the implicitly mentioned speaking behaviors such as appearances or emotional expressions. In this paper, we investigate a method to exploit linguistic clues to identify the speakers of utterances from textual fairytale stories in Korean, especially in order to handle such challenging issues. Compared with the previous work, the present work takes into account additional linguistic features such as vocative roles and pairs of conversation participants, and proposes the use of discourse-level turn-taking behaviors between speakers to further reduce the number of possible candidate speakers. We describe a simple rule-based method to choose a speaker from candidates based on such linguistic features and turn-taking behaviors.

Augmenting Biological Text Mining with Symbolic Inference

Jong C. Park and Hee-Jin Lee
'Biological Knowledge Discovery Handbook', editors: Mourad Elloumi and Albert Y. Zomaya, Wiley, December 27, 2013.
In this chapter, the authors review recent work on such 쐍ext-level text-mining tools. In particular, they focus on the work that uses symbolic inference to augment text-mining, apart from distributional analysis that is based on the co-occurrence of biological terms and statistical methods. By symbolic inference, they refer to the methods of deriving new information from known facts that are represented with nonnumeric symbols to which inference rules are applied deterministically rather than probabilistically. Researches reviewed in this chapter target one of the two abstract tasks. The first task is to recognize information not explicitly stated but implied in a document, where the targeted information is often scattered across multiple sentences. The second is to propose newly predicted biological knowledge using information gathered from the literature. They briefly review text-mining work with distributional analysis to contrast the use of symbolic inference with the use of distributional analysis.

CoMAGC: a Corpus with Multi-faceted Annotations of Gene-Cancer Relations

Hee-Jin Lee, Sang-Hyung Shim, Mi-Ryoung Song, Hyunju Lee, and Jong C. Park
BMC Bioinformatics, 14:323, doi:10.1186/1471-2105-14-323, 14 November 2013. (SCI IF 3.02)
In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations.

In this paper, we present a corpus for the development of TM systems that are specifically targeting gene-cancer relations but are still able to capture complex information in biomedical sentences. We describe CoMAGC, a corpus with multi-faceted annotations of gene-cancer relations. In CoMAGC, a piece of annotation is composed of four semantically orthogonal concepts that together express 1) how a gene changes, 2) how a cancer changes and 3) the causality between the gene and the cancer. The multi-faceted annotations are shown to have high inter-annotator agreement. In addition, we show that the annotations in CoMAGC allow us to infer the prospective roles of genes in cancers and to classify the genes into three classes according to the inferred roles. We encode the mapping between multi-faceted annotations and gene classes into 10 inference rules. The inference rules produce results with high accuracy as measured against human annotations. CoMAGC consists of 821 sentences on prostate, breast and ovarian cancers. Currently, we deal with changes in gene expression levels among other types of gene changes. The corpus is available at under the terms of the Creative Commons Attribution License (

The corpus will be an important resource for the development of advanced TM systems on gene-cancer relations.

DigSee: Disease Gene Search Engine with Evidence Sentences (version cancer)

Jeongkyun Kim, Seongeun So, Hee-Jin Lee, Jong C. Park, Jung-jae Kim, and Hyunju Lee
Nucleic Acids Research, Vol. 41, No. W1, pp. 501-517, 12 June 2013 (SCI IF 8.026).
Biological events such as gene expression, regulation, phosphorylation, localization and protein catabolism play important roles in the development of diseases. Understanding the association between diseases and genes can be enhanced with the identification of involved biological events in this association. Although biological knowledge has been accumulated in several databases and can be accessed through the Web, there is no specialized Web tool yet allowing for a query into the relationship among diseases, genes and biological events. For this task, we developed DigSee to search MEDLINE abstracts for evidence sentences describing that 쁤enes are involved in the development of 쁟ancer through 쁞iological events. DigSee is available through

Quality Analysis of User-generated Content on the Web

Jong C. Park and Hye-Jin Min
'Knowledge Service Engineering Handbook', editors: Jussi Kantola and Waldemar Karwowski, CRC Press, Taylor & Francis Group, pp. 197220, May, 2012.

E3Net: A System for Exploring E3-mediated Regulatory Networks of Cellular Functions

Youngwoong Han, Hodong Lee, Jong C. Park, and Gwan-Su Yi
Molecular and Cellular Proteomics, March Issue, 2012, doi:10.1074/mcp.O111.014076, December 22, 2011. (SCI IF 8.35)
Ubiquitin-protein ligase (E3) is a key enzyme targeting specific substrates in diverse cellular processes for ubiquitination and degradation. The existing findings of substrate specificity of E3 are, however, scattered over a number of resources, making it difficult to study them together with an integrative view. Here we present E3Net, a web-based system that provides a comprehensive collection of available E3-substrate specificities and a systematic framework for the analysis of E3-mediated regulatory networks of diverse cellular functions. Currently, E3Net contains 2201 E3s and 4896 substrates in 427 organisms and 1671 E3-substrate specific relations between 493 E3s and 1277 substrates in 42 organisms, extracted mainly from MEDLINE abstracts and UniProt comments with an automatic text mining method and additional manual inspection and partly from high throughput experiment data and public ubiquitination databases. The significant functions and pathways of the extracted E3-specific substrate groups were identified from a functional enrichment analysis with 12 functional category resources for molecular functions, protein families, protein complexes, pathways, cellular processes, cellular localization, and diseases. E3Net includes interactive analysis and navigation tools that make it possible to build an integrative view of E3-substrate networks and their correlated functions with graphical illustrations and summarized descriptions. As a result, E3Net provides a comprehensive resource of E3s, substrates, and their functional implications summarized from the regulatory network structures of E3-specific substrate groups and their correlated functions. This resource will facilitate further in-depth investigation of ubiquitination-dependent regulatory mechanisms. E3Net is freely available online at Molecular & Cellular Proteomics 11: 10.1074/mcp.O111.014076, 114, 2012.

Probabilistic Filtering for a Biological Knowledge Discovery System with Text Mining and Automatic Inference

Hee-Jin Lee and Jong C. Park
Journal of the Korean Society Of Computer and Information, Vol. 17, No. 2, pp. 139-148, February 2012.
Identifying Helpful Reviews Based on Customer셲 Mentions about Experiences

Hye-Jin Min and Jong C. Park
Expert Systems With Applications, doi:10.1016/j.eswa.2012.01.116, January 25, 2012. (SCIE IF 1.924)
Show abstract
As numerous on-line product reviews that vary in quality are published every day, much attention is being paid to quality assessment of such reviews. The current metric of using the number of votes by other customers such as 쁥elpful vote, despite its dominance, does not yield a fully effective outcome. In this article, we propose a novel metric to rank product reviews by 쁬entions about experiences, accounting for customer셲 personal experiences, as a way of identifying high quality reviews. The proposed metric has two parameters that capture time expressions related to the use of products and product entities over different purchasing time periods by linguistic clues. The empirical results show that this metric is not only as helpful as the best existing metrics, 쁥elpful vote or 쁱eviewer rank, but is also free from undesirable biases that either penalize recency or are driven solely by popularity. Our usability study also shows that ordering reviews by our metric is considered helpful on the accounts of both usefulness and satisfaction.

Sentence Type Identification in Korean: Applications to Korean-Sign Language Translation and Korean Speech Synthesis

Jin-Woo Chung, Ho-Joon Lee, and Jong C. Park
Journal of the HCI Society of Korea, Vol. 5, No. 1, pp. 25-35, 2010.
(selected as best paper)
Show abstract
Wrestling with Biomedical Research Results: Language Resources and Literature Analysis

D. Rebholz-Schuhmann, Nigel Collier, Jong C. Park, and Limsoon Wong
Journal of Bioinformatics and Computational Biology (JBCB), Vol. 8, No. 1, pp. 129-130, Imperial College Press, February 2010.

Extracting Melodies from Piano Solo Music Based on Characteristics of Music

Yoonjae Choi and Jong C. Park
Journal of the Korean Institute of Information Scientists and Engineers (KIISE): Computing Practices and Letters, Vol. 15, No. 12, pp. 923-927, 2009.
Show abstract
Automated Classification of Sentential Types in Korean with Morphological Analysis

Jin-Woo Chung and Jong C. Park
Language and Information, Vol. 13, No. 2, pp. 59-97, 2009.
Show abstract
E3Miner: a text mining tool for ubiquitin-protein ligases

Hodong Lee, Gwansu Yi, and Jong C. Park
Nucleic Acids Research, Vol. 36, Web Server issue Published doi:10.1093/nar/gkn286, 15 May 2008 (SCI IF 8.026).
Show abstract
Monitoring the evolutionary aspect of the Gene Ontology to enhance predictability and usability

Jong C. Park, Tak-eun Kim, and Jinah Park
BMC Bioinformatics 2008, 9(Suppl 3):7 doi:10.1186/1471-2105-9-S3-S7, 11 April 2008.
Show abstract
Results: We have developed visualization methods for viewing the changes between two different versions of GO by constructing a colour-coded layered graph. The graph shows both versions of GO with highlights to those GO terms that are added, removed and modified between the two versions. Focusing on a specific GO term or terms of interest over a period, we demonstrate the utility of our system that can be used to make useful hypotheses about the cause of the evolution and to provide new insights into more complex changes.
Conclusions: GO undergoes fast evolutionary changes. A snapshot of GO, as presented by each version of GO alone, overlooks such evolutionary aspects, and consequently limits the utilities of GO. The method that highlights the differences of consecutive versions or two different versions of an evolving ontology with colour-coding enhances the utility of GO for users as well as for developers. To the best of our knowledge, this is the first proposal to visualize the evolutionary aspect of GO.

Text Mining and Management in Biomedicine

Jong C. Park, Gary Geunbae Lee, and Limsoon Wong
Guest Editors' Introduction to the Special Issue, ACM Transactions on Asian Language Information Processing (TALIP), March, 2006.

BioContrasts: Extracting and Exploiting Protein-Protein Contrastive Relations from Biomedical Literature

Jung-jae Kim, Zhuo Zhang, Jong C. Park, and See-Kiong Ng
Bioinformatics, Vol. 22, No. 5, pp. 597-605, March, 2006.
Show abstract
Results: Our BioContrasts system extracts protein뱎rotein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for exploitation. Contrastive information are identified in the text abstracts with contrastive negation patterns such as 쁀 but not B. A total of 799169 pairs of contrastive expressions were successfully extracted from 2.5 million MEDLINE abstracts. Using grounding of contrastive protein names to Swiss-Prot entries, we were able to produce 41471 pieces of contrasts between Swiss-Prot protein entries. These contrastive pieces of information are then presented via a user-friendly interactive web portal that can be exploited for applications such as the refinement of biological pathways.

Availability: BioContrasts can be accessed at It is also mirrored at

Supplementary information: Supplementary materials are available at Bioinformatics online.;

Visualization for Digesting a High Volume of the Biomedical Literature

Changsu Lee, Jinah Park, and Jong C. Park
Bioinformatics and Biosystems, Vol. 1, No. 1, pp. 51-60, Feb. 2006.
Show abstract
Named Entity Recognition

Jong C. Park and Jung-jae Kim
Chapter six of the book 'Text Mining for Biology', editors: Ben Stapley and Sophia Ananiadou, Artech House Books, January, 2006.

Linguistic Characterization of Sign Language Expressions for an Automatic Mapping from Natural Language Sentences

Jiwon Choi, Eunyoung Chang, Hee-Jin Lee, and Jong C. Park
Language and Information, Vol. 10, No. 1, pp. 71-91, 2006.

Automatic extension of Gene Ontology with flexible identification of candidate terms

Jin-Bok Lee, Jung-jae Kim, and Jong C. Park
Bioinformatics, Vol. 22 No. 6, pp. 665-670, 2006.
Show abstract
Results: We present a novel method that predicts more detailed concepts by utilizing syntactic relations among the existing concepts. We propose a validation measure for the automatically predicted concepts by matching the concepts to biomedical articles. We also suggest how to find a suitable direction for the extension of a constantly growing ontology such as GO.


Supplementary information: Supplementary materials are available at Bioinformatics online.

Information Visualization with Text Data Mining for Knowledge Discovery Tools in Bioinformatics

Jinah Park, Changsu Lee, and Jong C. Park
Key Engineering Materials, Vols. 277-279, pp. 259-265, 2005. (SCI IF 0.224)
Show abstract
A Graphic Tool for Curating Molecular Interaction Networks from the Literature

Changsu Lee, Jinah Park, and Jong C. Park
International Journal of Computers in Biology and Medicine, Vol. 35, pp. 555-564, 2005.
Show abstract
Extracting contrastive information from negation patterns in biomedical literature

Jung-jae Kim and Jong C. Park
ACM Transactions on Asian Language Information Processing (TALIP), Special Issue on Text Mining and Management in Biomedicine, 2006.
Show abstract
Introduction to the Thematic Session on Text Mining in Biomedicine

Sophia Ananiadou and Jong C. Park
Lecture Notes in Artificial Intelligence (LNAI), Vol. 3248 (revised selected papers from IJCNLP 2004), editors: K-Y Su, J. Tsujii, J.-H. Lee, O. Y. Kwong, p. 776, 2005.
Show abstract
Constructing SSML Documents with Automatically Generated Intonation Information in a Combinatory Categorial Grammar Framework

Lee Hwa Jin, Ho-Joon Lee, and Jong C. Park
International Journal of Computer Processing of Oriental Languages (IJCPOL), Vol. 17, No. 4, pp. 223-238, December, 2004.
Show abstract
BioIE: Retargetable Information Extraction and Ontological Annotation of Biological Pathways from the Literature

Jung-jae Kim and Jong C. Park
Journal of Bioinformatics and Computational Biology (JBCB), Vol. 2, No. 3, pp. 551-568, 2004. (SCI IF 1.393)
Show abstract
Research Trends in Bio Text Mining

Jung-jae Kim and Jong C. Park.
Korea Information Science Society SIGBIT News Letter, Vol. 2, No. 1, pp. 14-31, 2004.

An Analysis of Syntactic and Semantic Relations between Negative Polarity Items and Negatives in Korean

Jung-jae Kim and Jong C. Park.
Journal of Language and Information, Vol. 8, No. 1, pp. 53-76, 2004.
Show abstract
Annotation of Gene Products in the Literature with Gene Ontology Terms using Syntactic Dependencies

Jung-jae Kim and Jong C. Park
Lecture Notes on Artificial Intelligence, Post-Conference Book of IJCNLP-04, 2004.
Show abstract
Analysis and Computational Processing of Coordination Ambiguity in Korean

Hodong Lee and Jong C. Park
Journal of Language and Information, Vol. 7, No. 2, pp. 59-79, 2003.

Analysis and Processing of Korean with Quantifier Floating

Jin-Bok Lee and Jong C. Park
Journal of Language and Information, Vol. 7, No. 1, pp. 1-22, 2003.

Accomplishments and Challenges in Literature Data Mining for Biology

Lynette Hirschman, Jong C. Park, Junichi Tsujii, Limsoon Wong, and Cathy Wu
Bioinformatics, Vol. 18, No. 12, pp. 1553-1561, 2002.
Show abstract
Interpretation of Natural Language Queries for Relational Database Access with Combinatory Categorial Grammar

Hodong Lee and Jong C. Park
International Journal of Computer Processing of Oriental Languages (IJCPOL), Vol. 15, No. 3, pp. 281-304, 2002.
Show abstract
Using Combinatory Categorial Grammar to Extract Biomedical Information

Jong C. Park
IEEE Intelligent Systems, Special Issue on Intelligent Systems in Biology, Vol. 16, No. 6, pp. 62-67, November-December, 2001.
Show abstract
Bioinformatics and Natural Language Processing

Jong C. Park
Special Issue in Korean Information Processing, Communications of the Korea Information Science Society (KISS), Vol. 19, No. 10, pp. 46-51, October, 2001.
Show abstract
Combinatory Categorial Grammar for the Syntactic, Semantic, and Discourse Analyses of Cordinate Constructions in Korean

Hyung-joon Cho and Jong C. Park
Journal of the Korea Information Science Society (KISS), Vol. 27, No. 4, pp. 448-462, 2000.
Show abstract
