University of Oulu

Jahan, M. S. (2020). Team oulu at semeval-2020 task 12: Multilingual identification of offensive language, type and target of twitter post using translated datasets. Proceedings of the Fourteenth Workshop on Semantic Evaluation, 1628–1637. https://doi.org/10.18653/v1/2020.semeval-1.212

Team Oulu at SemEval-2020 task 12 : multilingual identification of offensive language, type and target of Twitter post using translated datasets

Saved in:
Author: Jahan, Md. Saroar1; Oussalah, Mourad1
Organizations: 1University of Oulu, Faculty of Information Tech., CMVS PO Box 4500, Oulu 90014 Finland
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 0.5 MB)
Persistent link: http://urn.fi/urn:nbn:fi-fe2022021118639
Language: English
Published: Association for computational linguistics, 2020
Publish Date: 2022-02-11
Description:

Abstract

With the proliferation of social media platforms, anonymous discussions together with easy online access, reports on offensive content have caused serious concern to both authorities and research communities. Although there is extensive research in identifying textual offensive language from online content, the dynamic discourse of social media content, as well as the emergence of new forms of offensive language, especially in a multilingual setting, calls for future research in the issue. In this work, we tackled Task A, B, and C of Offensive Language Challenge at SemEval2020. We handled offensive language in five languages: English, Greek, Danish, Arabic, and Turkish. Specifically, we pre-processed all provided datasets and developed an appropriate strategy to handle Tasks (A, B, & C) for identifying the presence/absence, type and the target of offensive language in social media. For this purpose, we used OLID2019, OLID2020 datasets, and generated new datasets, which we made publicly available. We used the provided unsupervised machine learning implementation for automated annotated datasets and the online Google translation tools to create new datasets as well. We discussed the limitations and the success of our machine learning-based approach for all the five different languages. Our results for identifying offensive posts (Task A) yielded satisfactory accuracy of 0.92 for English, 0.81 for Danish, 0.84 for Turkish, 0.85 for Greek, and 0.89 for Arabic. For the type detection (Task B), the results are significantly higher (.87 accuracy) compared to target detection (Task C), which yields .81 accuracy. Moreover, after using automated Google translation, the overall efficiency improved by 2% for Greek, Turkish, and Danish.

see all

ISBN: 978-1-952148-31-6
Pages: 1628 - 1637
DOI: 10.18653/v1/2020.semeval-1.212
OADOI: https://oadoi.org/10.18653/v1/2020.semeval-1.212
Host publication: Proceedings of the fourteenth workshop on semantic evaluation
Conference: Workshop on Semantic Evaluation
Type of Publication: A4 Article in conference proceedings
Field of Science: 113 Computer and information sciences
Subjects:
Funding: This work is partly supported by European project YoungRes (#823701), which is gratefully acknowledged.
Copyright information: © 2020 The Authors.