Recent Publications

Publications The latest 10 papers published or under review

A Large-scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit

Hoyun Song, Soo Hyun Ryu, Huije Lee, and Jong C. Park
Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL 2021), November 10-11, 2021 (Accepted)
Show abstract
As users in online communities suffer from severe side effects of abusive language, many researchers attempted to detect abusive texts from social media, presenting several datasets for such detection. However, none of them contain both comprehensive labels and contextual information, which are essential for thoroughly detecting all kinds of abusiveness from texts, since datasets with such fine-grained features demand a significant amount of annotations, leading to much increased complexity. In this paper, we propose a Comprehensive Abusiveness Detection Dataset (CADD), collected from the English Reddit posts, with multifaceted labels and contexts. Our dataset is annotated hierarchically for an efficient annotation through crowdsourcing on a large-scale. We also empirically explore the characteristics of our dataset and provide a detailed analysis for novel insights. The results of our experiments with strong pre-trained natural language understanding models on our dataset show that our dataset gives rise to meaningful performance, assuring its practicality for abusive language detection.

BERT-based Personality Disorder Detection Model with Abusive Language Marking from Social Media

Jisu Shin, Hoyun Song, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2021), June 22-25, 2021

The Relationship between the Quality of Automatically Generated Questions and the Quantity of the Context given for the Generation

Sukmin Cho, Wonsuk Yang, ChaeHun Park, and Jong C. Park
Proceedings of the Korea Computer Congress (KCC 2021), June 22-25, 2021

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Soyeong Jeong, Jinheon Baek, ChaeHun Park, and Jong C. Park
Second Workshop on Scholarly Document Processing (SDP 2021), June 6-11, 2021
Show abstract
One of the challenges in information retrieval (IR) is the vocabulary mismatch problem, which happens when the terms between queries and documents are lexically different but semantically similar. While recent work has proposed to expand the queries or documents by enriching their representations with additional relevant terms to address this challenge, they usually require a large volume of query-document pairs to train an expansion model. In this paper, we propose an Unsupervised Document Expansion with Generation (UDEG) framework with a pretrained language model, which generates diverse supplementary sentences for the original document without using labels on querydocument pairs for training. For generating sentences, we further stochastically perturb their embeddings to generate more diverse sentences for document expansion. We validate our framework on two standard IR benchmark datasets. The results show that our framework significantly outperforms relevant expansion baselines for IR.

Generating Negative Samples by Manipulating Golden Responses for Unsupervised Learning of a Response Evaluation Model

ChaeHun Park, Eugene Jang, Wonsuk Yang, and Jong C. Park
2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2021), June 6-11, 2021
Show abstract
Evaluating the quality of responses generated by open-domain conversation systems is a challenging task. This is partly because there can be multiple appropriate responses to a given dialogue history. Reference-based metrics that rely on comparisons to a set of known correct responses often fail to account for this variety, and consequently correlate poorly with human judgment. To address this problem, researchers have investigated the possibility of assessing response quality without using a set of known correct responses. Tao et al. (2018) demonstrated that an automatic response evaluation model could be made using unsupervised learning for the next-utterance prediction (NUP) task. For unsupervised learning of such a model, we propose a method of manipulating a golden response to create a new negative response that is designed to be inappropriate within the context while maintaining high similarity with the original golden response. We find, from our experiments on English datasets, that using the negative samples generated by our method alongside random negative samples can increase the model's correlation with human evaluations. The process of generating such negative samples is automated and does not rely on human annotation

Calibration of Pre-trained Language Model for Korean

Soyeong Jeong, Wonsuk Yang, ChaeHun Park, and Jong C. Park
Journal of KIISE, Vol. 48, No. 4, pp. 434-443, April, 2021.
Show abstract
The development of deep learning models is showing performance beyond humans in various tasks such as computer vision and natural language understanding tasks. In particular, pre-trained Transformer models have recently shown remarkable performance in natural language understanding problems such as question answering(QA) tasks and dialogue tasks. However, compared to the rapid development of deep learning models such as Transformer-based models, the mechanisms they work remain relatively unknown. As a method of analyzing deep learning models, calibration of models measures how much the predicted value of the model(confidence) matches the actual value(accuracy). Our study aims at interpreting pre-trained Korean language models with calibration. In particular, we analyzed whether pre-trained Korean language models are able to capture ambiguities in sentences and applied smoothing methods to quantitatively measure such ambiguities with confidence. In addition, in terms of calibration, we evaluated the capability of pre-trained Korean language model in identifying grammatical characteristics in Korean, which affect semantic changes in Korean sentences.

Target-Agnostic Detection of Stances Toward Entities in News Articles

Eugene Jang, Wonsuk Yang, and Jong C. Park
Human-Computer Interaction Korea (HCI), Korea, January 27-29, 2021.

Automatic Facial Expression Generation for Sign Language with Neural Machine Translation

Eui Jun Hwang, Jung-Ho Kim, and Jong C. Park
Korea Software Congress (KSC), Korea, December 21-23, 2020.
Show abstract
In a sign language, facial expressions play an important role for effective communication. In particular, they are well known for conveying emotional and grammatical information that affects the meaning of a sign. In this paper, we only consider the grammatical functions of the facial expressions. We propose a transformer-based facial expression generation model that translates an expression in spoken language into continuous facial landmark sequences for sign language. In order to train the model efficiently, we employ Principal Component Analysis embedding and Custom Mean Squared Error loss. We report the quantitative and qualitative results of the generated facial landmarks.

A Study on Practical Machine Translation from Korean to Korean Sign Language

Jung-Ho Kim, Eui Jun Hwang, and Jong C. Park
Journal of Korean Sign Language Studies (닔뼱븰뿰援), Vol. 4, No. 1, 2020.
Show abstract
In this study, we propose a practical method for machine translation from Korean to Korean Sign Language (KSL). For a practical use, we select the most appropriate Korean corpus and then annotate the corpus into KSL sentences to construct a Korean-KSL parallel corpus. Through experiments, we train four neural machine translation models on our Korean-KSL parallel corpus and find the best machine translation model by calculating BLEU scores. As a qualitative result, our best model achieves a BLEU-4 score of 20.18 on a test set of our Korean-KSL parallel corpus. We also report qualitative results with an error analysis for better understanding. We finally demonstrate that our model can translate Korean sentences not included in our Korean dataset. Therefore, we believe that our Korean-KSL translation system can lessen the gap between supply and demand for sign language interpretations.

Calibration of Pre-trained Language Model for Korean

Soyeong Jeong, Wonsuk Yang, ChaeHun Park, and Jong C. Park
32th Annual Conference on Human & Cognitive Language Technology, October 15-16, 2020.
(selected as best paper)