University of Oulu

Jahan, M. S., Oussalah, M., Mim, J. K., & Islam, M. (2021). Offensive Language Identification Using Hindi-English Code-Mixed Tweets, and Code-Mixed Data Augmentation. In P. Mehta, T. Mandl, P. Majumder, & Mandar M. (Eds.), Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, Gandhinagar, India, December 13-17, 2021 (pp. 226-238). RWTH Aachen University. http://ceur-ws.org/Vol-3159/T1-23.pdf

Offensive language identification using Hindi-English code-mixed tweets, and code-mixed data augmentation

Saved in:
Author: Jahan, Md Saroar1; Oussalah, Mourad1; Mim, Jhuma Kabir2;
Organizations: 1University of Oulu, Faculty of Information Tech., CMVS, PO Box 4500, Oulu 90014, FINLAND
2LUT Univerity, Dept of Computational Engineering 53850 Lappeenranta, FINLAND
3Daffodil International University, Dhaka 1207, BANGLADESH
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 0.7 MB)
Persistent link: http://urn.fi/urn:nbn:fi-fe2022070551216
Language: English
Published: RWTH Aachen University, 2021
Publish Date: 2022-07-05
Description:

Abstract

The Code-mixed text classification is challenging due to the lack of code-mixed labeled datasets and the non-existence of pre-trained models. This paper presents the HASOC-2021 offensive language identification results and main findings on code-mixed (Hindi-English) Subtask2. In this work, we have proposed a new method of code-mixed data augmentation using synonym replacement of Hindi and English words using WordNet, and phonetics conversion of Hinglish (Hindi-English) words. We used a 5.7k pre-annotated HASOC-2021 code-mixed dataset for training and data augmentation. The proposal’s feasibility was tested with a Logistic Regression (LR) used as a baseline, Convolutional Neural Network (CNN), and BERT with and without data augmentation. The research outcomes were promising and yields almost 3% increase of classifier accuracy and F1 scores as compared to baseline. Our official submission showed a 66.56% F1 score and ranked 8th position in the competition.

see all

Series: CEUR workshop proceedings
ISSN: 1613-0073
ISSN-E: 1613-0073
ISSN-L: 1613-0073
Volume: 3159
Pages: 226 - 238
Host publication: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, Gandhinagar, India, December 13-17, 2021
Host publication editor: Mehta, Parth
Mandl, Thomas
Majumder, Prasenjit
Mitra, Mandar
Conference: FIRE 2021 - Forum for Information Retrieval Evaluation, Gandhinagar, India, December 13-17, 2021
Type of Publication: A4 Article in conference proceedings
Field of Science: 113 Computer and information sciences
Subjects:
Funding: This project was partially funded by EU Project WaterLine (Downscaling Remotely Sensed Products to Improve Hydrological Modelling Performance), which is gratefully acknowledged.
Copyright information: © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
  https://creativecommons.org/licenses/by/4.0/