From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection

Pham, Q.H. and Anh Nguyen, V. and Doan, L.B. and Tran, N.N. and Thanh, T.M. (2020) From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection. In: 12th International Conference on Knowledge and Systems Engineering, KSE 2020, 12 November 2020 through 14 November 2020.

Text
68. Consideration of a robust watermarking algorithm for color image using improved QR decomposition.pdf
Download (10MB) | Preview

Official URL: https://www.scopus.com/inward/record.uri?eid=2-s2....

Abstract

Natural language processing (NLP) is a fast-growing field of artificial intelligence. Since the Transformer [32] was introduced by Google in 2017, a large number of language models such as BERT, GPT, and ELMo have been inspired by this architecture. These models were trained on huge datasets and achieved state-of-the-art results on natural language understanding. However, fine-tuning a pre-trained language model on much smaller datasets for downstream tasks requires a carefully-designed pipeline to mitigate problems of the datasets such as lack of training data and imbalanced data. In this paper, we propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection. We first tune the PhoBERT1[9] on our dataset by re-training the model on the Masked Language Model (MLM) task; then, we employ its encoder for text classification. In order to preserve pre-trained weights while learning new feature representations, we further utilize different training techniques: Layer freezing, block-wise learning rate, and label smoothing. Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection (HSD) campaign2 with 0.7221 F1 score. © 2020 IEEE.

Item Type:	Conference or Workshop Item (Paper)
Divisions:	Faculties > Faculty of Information Technology
Identification Number:	10.1109/KSE50997.2020.9287406
Uncontrolled Keywords:	Artificial intelligence; Computational linguistics; Natural language processing systems; Pipelines; Speech recognition; Systems engineering; Text processing; Feature representation; Imbalanced data; NAtural language processing; Natural language understanding; Speech detection; State of the art; Text classification; Training techniques; Classification (of information)
Additional Information:	Conference code: 165870. Language of original document: English. All Open Access, Green.
URI:	http://eprints.lqdtu.edu.vn/id/eprint/8869

Actions (login required)

: View Item