Recent Publications

Publications The latest 10 papers published or under review

Sign Language Production With Avatar Layering: A Critical Use Case over Rare Words

Jung-Ho Kim, Eui Jun Hwang, Sukmin Cho, Du Hui Lee, and Jong C. Park
Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022)

GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

Fitsum Gaim, Wonsuk Yang, and Jong C. Park
Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022)

ELF22: A Context-based Counter-Trolling Dataset to Combat Internet Trolls

Huije Lee, Young Ju NA, Hoyun Song, Jisu Shin, and Jong C. Park
Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022)

Query Generation with External Knowledge for Dense Retrieval

Sukmin Cho, Soyeong Jeong, Wonsuk Yang, and Jong C. Park
Proceedings of Deep Learning Inside Out (DeeLIO): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation

Soyeong Jeong, JinHeon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park
60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)

Flexible acceptance condition of generics from a probabilistic viewpoint: Towards formalization of the semantics of generics.

Soo Hyun Ryu, Wonsuk Yang, and Jong C. Park
Journal of Psycholinguistic Research, 2022

Detecting Implicitly Abusive Language by Applying Out-of-Distribution Problem

Jisu Shin, Hoyun Song, and Jong C. Park
Proceedings of the Korea Software Congress (KSC 2021), December 20-22, 2021

Optimizing Domain Specificity of Transformer-based Language Models for Extractive Summarization of Financial News Articles in Korean

Huije Lee, Wonsuk Yang, ChaeHun Park, Hoyun Song, Eugene Jang, and Jong C. Park
35th Pacific Asia Conference on Language on Language, Information and Computation (PACLIC 35), November 5-7, 2021
Show abstract
Frequent usage of complex expressions with numbers and of the terms that require domain knowledge makes it more difficult to comprehend and summarize financial news articles than that of other daily news articles. We present a transformer-based model for the automatic summarization of the financial news articles in Korean and address related issues, and in particular analyze the interplay between the domain of the dataset used for pre-training and that for fine-tuning. We find that the summarization model performs much better when the two coincide, even when they are different from that of the target task, which is the financial domain in our work.

Non-Autoregressive Sign Language Production with Gaussian Space

Eui Jun Hwang, Jung-Ho Kim, and Jong C. Park
The 32nd British Machine Vision Conference (BMVC 2021), November 22-25, 2021

A Large-scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit

Hoyun Song, Soo Hyun Ryu, Huije Lee, and Jong C. Park
Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL 2021), November 10-11, 2021
Show abstract
As users in online communities suffer from severe side effects of abusive language, many researchers attempted to detect abusive texts from social media, presenting several datasets for such detection. However, none of them contain both comprehensive labels and contextual information, which are essential for thoroughly detecting all kinds of abusiveness from texts, since datasets with such fine-grained features demand a significant amount of annotations, leading to much increased complexity. In this paper, we propose a Comprehensive Abusiveness Detection Dataset (CADD), collected from the English Reddit posts, with multifaceted labels and contexts. Our dataset is annotated hierarchically for an efficient annotation through crowdsourcing on a large-scale. We also empirically explore the characteristics of our dataset and provide a detailed analysis for novel insights. The results of our experiments with strong pre-trained natural language understanding models on our dataset show that our dataset gives rise to meaningful performance, assuring its practicality for abusive language detection.