University of Oulu

Jahan, M. S., Beddiar, D. R., Oussalah, M., & Mohamed, M. (2022). Data expansion using WordNet-based semantic expansion and word disambiguation for cyberbullying detection. In N. Calzolari et al. (Eds,), Language Resources and Evaluation Conference, LREC 2022, 20-25 June 2022, Palais du Pharo, Marseille, France : conference proceedings (pp. 1761-1770). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.187.pdf

Data expansion using WordNet-based semantic expansion and word disambiguation for cyberbullying detection

Saved in:
Author: Jahan, Md Saroar1; Beddiar, Djamila Romaissa1; Oussalah, Mourad1;
Organizations: 1University of Oulu, CMVS, BP 4500, 90014, Finland
2Operations and Information Management, Aston University, B4 7ET, UK
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 0.5 MB)
Persistent link: http://urn.fi/urn:nbn:fi-fe2022070551051
Language: English
Published: European Language Resources Association, 2022
Publish Date: 2022-07-05
Description:

Abstract

Automatic identification of cyberbullying from textual content is known to be a challenging task. The challenges arise from the inherent structure of cyberbullying and the lack of labeled large-scale corpus, enabling efficient machine-learning-based tools including neural networks. This paper advocates a data augmentation-based approach that could enhance the automatic detection of cyberbullying in social media texts. We use both word sense disambiguation and synonymy relation in WordNet lexical database to generate coherent equivalent utterances of cyberbullying input data. The disambiguation and semantic expansion are intended to overcome the inherent limitations of social media posts, such as an abundance of unstructured constructs and limited semantic content. Besides, to test the feasibility, a novel protocol has been employed to collect cyberbullying traces data from AskFm forum, where about a 10K-size dataset has been manually labeled. Next, the problem of cyberbullying identification is viewed as a binary classification problem using an elaborated data augmentation strategy and an appropriate classifier. For the latter, a Convolutional Neural Network (CNN) architecture with FastText and BERT was put forward, whose results were compared against commonly employed Na¨ıve Bayes (NB) and Logistic Regression (LR) classifiers with and without data augmentation. The research outcomes were promising and yielded almost 98.4% of classifier accuracy, an improvement of more than 4% over baseline results

see all

ISBN: 979-10-95546-72-6
Pages: 1761 - 1770
Host publication: Language Resources and Evaluation Conference, LREC 2022, 20-25 June 2022, Palais du Pharo, Marseille, France : conference proceedings
Host publication editor: Calzolari, Nicoletta
Béchet, Frédéric
Blache, Philippe
Choukri, Khalid
Cieri, Christopher
Declerck, Thierry
Goggi, Sara
Isahara, Hitoshi
Maegaard, Bente
Mariani, Joseph
Mazo, Hélène
Odijk, Jan
Piperidis, Stelios
Conference: Language Resources and Evaluation Conference
Type of Publication: B3 Article in conference proceedings
Field of Science: 113 Computer and information sciences
Subjects:
Funding: This work was partially supported by EU Project YougRes on youth polarization & radicalization (ID: 823701) and COST Action NexusLinguarum – “European network for Web-centered linguistic data science” (CA18209), which are gratefully acknowledged.
Copyright information: © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0
  https://creativecommons.org/licenses/by-nc/4.0/