Jahan, M. S. (2020). Team oulu at semeval-2020 task 12: Multilingual identification of offensive language, type and target of twitter post using translated datasets. Proceedings of the Fourteenth Workshop on Semantic Evaluation, 1628–1637. https://doi.org/10.18653/v1/2020.semeval-1.212
Team Oulu at SemEval-2020 task 12 : multilingual identification of offensive language, type and target of Twitter post using translated datasets
|Author:||Jahan, Md. Saroar1; Oussalah, Mourad1|
1University of Oulu, Faculty of Information Tech., CMVS PO Box 4500, Oulu 90014 Finland
|Online Access:||PDF Full Text (PDF, 0.5 MB)|
|Persistent link:|| http://urn.fi/urn:nbn:fi-fe2022021118639
Association for computational linguistics,
|Publish Date:|| 2022-02-11
With the proliferation of social media platforms, anonymous discussions together with easy online access, reports on offensive content have caused serious concern to both authorities and research communities. Although there is extensive research in identifying textual offensive language from online content, the dynamic discourse of social media content, as well as the emergence of new forms of offensive language, especially in a multilingual setting, calls for future research in the issue. In this work, we tackled Task A, B, and C of Offensive Language Challenge at SemEval2020. We handled offensive language in five languages: English, Greek, Danish, Arabic, and Turkish. Specifically, we pre-processed all provided datasets and developed an appropriate strategy to handle Tasks (A, B, & C) for identifying the presence/absence, type and the target of offensive language in social media. For this purpose, we used OLID2019, OLID2020 datasets, and generated new datasets, which we made publicly available. We used the provided unsupervised machine learning implementation for automated annotated datasets and the online Google translation tools to create new datasets as well. We discussed the limitations and the success of our machine learning-based approach for all the five different languages. Our results for identifying offensive posts (Task A) yielded satisfactory accuracy of 0.92 for English, 0.81 for Danish, 0.84 for Turkish, 0.85 for Greek, and 0.89 for Arabic. For the type detection (Task B), the results are significantly higher (.87 accuracy) compared to target detection (Task C), which yields .81 accuracy. Moreover, after using automated Google translation, the overall efficiency improved by 2% for Greek, Turkish, and Danish.
|Pages:||1628 - 1637|
Proceedings of the fourteenth workshop on semantic evaluation
Workshop on Semantic Evaluation
|Type of Publication:||
A4 Article in conference proceedings
|Field of Science:||
113 Computer and information sciences
This work is partly supported by European project YoungRes (#823701), which is gratefully acknowledged.
© 2020 The Authors.