University of Oulu

Cyber bullying identification and tackling using natural language processing techniques

Saved in:
Author: Jahan, Md.1
Organizations: 1University of Oulu, Faculty of Information Technology and Electrical Engineering, Computer Science
Format: ebook
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 3.2 MB)
Pages: 75
Persistent link: http://urn.fi/URN:NBN:fi:oulu-202008222869
Language: English
Published: Oulu : M. Jahan, 2020
Publish Date: 2020-08-24
Thesis type: Master's thesis
Tutor: Partala, Juha
Reviewer: Oussalah, Mourad
Description:

Abstract

As offensive content has a detrimental influence on the internet and especially in social media, there has been much research identifying cyberbullying posts from social media datasets. Previous works on this topic have overlooked the problems for cyberbullying categories detection, impact of feature choice, negation handling, and dataset construction. Indeed, many natural language processing (NLP) tasks, including cyberbullying detection in texts, lack comprehensive manually labeled datasets limiting the application of powerful supervised machine learning algorithms, including neural networks. Equally, it is challenging to collect large scale data for a particular NLP project due to the inherent subjectivity of labeling task and man-made effort.

For this purpose, this thesis attempts to contribute to these challenges by the following. We first collected and annotated a multi-category cyberbullying (10K) dataset from the social network platform (ask.fm). Besides, we have used another publicly available cyberbullying labeled dataset, ’Formspring’, for comparison purpose and ground truth establishment. We have devised a machine learning-based methodology that uses five distinct feature engineering and six different classifiers. The results showed that CNN classifier with Word-embedding features yielded a maximum performance amidst all state-of-art classifiers, with a detection accuracy of 93\% for AskFm and 92\% for FormSpring dataset. We have performed cyberbullying category detection, and CNN architecture still provide the best performance with 81\% accuracy and 78\% F1-score on average.

Our second purpose was to handle the problem of lack of relevant cyberbullying instances in the training dataset through data augmentation. For this end, we developed an approach that makes use of wordsense disambiguation with WordNet-aided semantic expansion. The disambiguation and semantic expansion were intended to overcome several limitations of the social media (SM) posts/comments, such as unstructured content, limited semantic content, among others, while capturing equivalent instances induced by the wordsense disambiguation-based approach. We run several experiments and disambiguation/semantic expansion to estimate the impact of the classification performance using both original and the augmented datasets. Finally, we have compared the accuracy score for cyberbullying detection with some widely used classifiers before and after the development of datasets. The outcome supports the advantage of the data-augmentation strategy, which yielded 99\% of classifier accuracy, a 5\% improvement from the base score of 93\%.

Our third goal related to negation handling was motivated by the intuitive impact of negation on cyberbullying statements and detection. Our proposed approach advocates a classification like technique by using NegEx and POS tagging that makes the use of a particular data design procedure for negation detection. Performances using the negation-handling approach and without negation handling are compared and discussed. The result showed a 95\% of accuracy for the negated handed dataset, which corresponds to an overall accuracy improvement of 2\% from the base score of 93\%.

Our final goal was to develop a software tool using our machine learning models that will help to test our experiments and provide a real-life example of use case for both end-users and research communities. To achieve this objective, a python based web-application was developed and successfully tested.

see all

Subjects:
Copyright information: © Md. Jahan, 2020. This publication is copyrighted. You may download, display and print it for your own personal use. Commercial use is prohibited.