University of Oulu

Using multiclass classification algorithms to improve text categorization tool : NLoN

Saved in:
Author: Xu, Jianwen1
Organizations: 1University of Oulu, Faculty of Information Technology and Electrical Engineering, Department of Information Processing Science, Information Processing Science
Format: ebook
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 1.9 MB)
Pages: 43
Persistent link:
Language: English
Published: Oulu : J. Xu, 2021
Publish Date: 2021-09-09
Thesis type: Master's thesis
Tutor: Mäntylä, Mika
Reviewer: Mäntylä, Mika
Claes, Maëlick


Natural language processing (NLP) and machine learning techniques have been widely utilized in the mining software repositories (MSR) field in recent years. Separating natural language from source code is a pre-processing step that is needed in both NLP and the MSR domain for better data quality. This paper presents the design and implementation of a multi-class classification approach that is based on the existing open-source R package Natural Language or Not (NLoN).

This article also reviews the existing literature on MSR and NLP. The review classified the information sources and approaches of MSR in detail, and also focused on the text representation and classification tasks of NLP. In addition, the design and implementation methods of the original paper are briefly introduced.

Regarding the research methodology, since the research goal is technology-oriented, i.e., to improve the design and implementation of existing technologies, this article adopts the design science research methodology and also describes how the methodology was adopted.

This research implements an open-source Python library, namely NLoN-PY. This is an open-source library hosted on GitHub, and users can also directly use the tools published to the PyPI library.

Since NLoN has achieved comparable performance on two-class classification tasks with the Lasso regression model, this study evaluated other multi-class classification algorithms, i.e., Naive Bayes, k-Nearest Neighbours, and Support Vector Machine. Using 10-fold cross-validation, the expanded classifier achieved AUC performance of 0.901 for the 5-class classification task and the AUC performance of 0.92 for the 2-class task.

Although the design of this study did not show a significant performance improvement compared to the original design, the impact of unbalanced data distribution on performance was detected and the category of the classification problem was also refined in the process. These findings on the multi-class classification design can provide a research foundation or direction for future research.

see all

Copyright information: © Jianwen Xu, 2021. This publication is copyrighted. You may download, display and print it for your own personal use. Commercial use is prohibited.