Offensive language identification using Hindi-English code-mixed tweets, and code-mixed data augmentation |
|
Author: | Jahan, Md Saroar1; Oussalah, Mourad1; Mim, Jhuma Kabir2; |
Organizations: |
1University of Oulu, Faculty of Information Tech., CMVS, PO Box 4500, Oulu 90014, FINLAND 2LUT Univerity, Dept of Computational Engineering 53850 Lappeenranta, FINLAND 3Daffodil International University, Dhaka 1207, BANGLADESH |
Format: | article |
Version: | published version |
Access: | open |
Online Access: | PDF Full Text (PDF, 0.7 MB) |
Persistent link: | http://urn.fi/urn:nbn:fi-fe2022070551216 |
Language: | English |
Published: |
RWTH Aachen University,
2021
|
Publish Date: | 2022-07-05 |
Description: |
AbstractThe Code-mixed text classification is challenging due to the lack of code-mixed labeled datasets and the non-existence of pre-trained models. This paper presents the HASOC-2021 offensive language identification results and main findings on code-mixed (Hindi-English) Subtask2. In this work, we have proposed a new method of code-mixed data augmentation using synonym replacement of Hindi and English words using WordNet, and phonetics conversion of Hinglish (Hindi-English) words. We used a 5.7k pre-annotated HASOC-2021 code-mixed dataset for training and data augmentation. The proposal’s feasibility was tested with a Logistic Regression (LR) used as a baseline, Convolutional Neural Network (CNN), and BERT with and without data augmentation. The research outcomes were promising and yields almost 3% increase of classifier accuracy and F1 scores as compared to baseline. Our official submission showed a 66.56% F1 score and ranked 8th position in the competition. see all
|
Series: |
CEUR workshop proceedings |
ISSN: | 1613-0073 |
ISSN-E: | 1613-0073 |
ISSN-L: | 1613-0073 |
Volume: | 3159 |
Pages: | 226 - 238 |
Host publication: |
Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, Gandhinagar, India, December 13-17, 2021 |
Host publication editor: |
Mehta, Parth Mandl, Thomas Majumder, Prasenjit Mitra, Mandar |
Conference: |
Forum for Information Retrieval Evaluation |
Type of Publication: |
A4 Article in conference proceedings |
Field of Science: |
113 Computer and information sciences |
Subjects: | |
Funding: |
This project was partially funded by EU Project WaterLine (Downscaling Remotely Sensed Products to Improve Hydrological Modelling Performance), which is gratefully acknowledged. |
Copyright information: |
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). |
https://creativecommons.org/licenses/by/4.0/ |