Using multiclass classification algorithms to improve text categorization tool : NLoN
1University of Oulu, Faculty of Information Technology and Electrical Engineering, Department of Information Processing Science, Information Processing Science
|Online Access:||PDF Full Text (PDF, 1.9 MB)|
|Persistent link:|| http://urn.fi/URN:NBN:fi:oulu-202109098960
Oulu : J. Xu,
|Publish Date:|| 2021-09-09
|Thesis type:||Master's thesis
Natural language processing (NLP) and machine learning techniques have been widely utilized in the mining software repositories (MSR) field in recent years. Separating natural language from source code is a pre-processing step that is needed in both NLP and the MSR domain for better data quality. This paper presents the design and implementation of a multi-class classification approach that is based on the existing open-source R package Natural Language or Not (NLoN).
This article also reviews the existing literature on MSR and NLP. The review classified the information sources and approaches of MSR in detail, and also focused on the text representation and classification tasks of NLP. In addition, the design and implementation methods of the original paper are briefly introduced.
Regarding the research methodology, since the research goal is technology-oriented, i.e., to improve the design and implementation of existing technologies, this article adopts the design science research methodology and also describes how the methodology was adopted.
This research implements an open-source Python library, namely NLoN-PY. This is an open-source library hosted on GitHub, and users can also directly use the tools published to the PyPI library.
Since NLoN has achieved comparable performance on two-class classification tasks with the Lasso regression model, this study evaluated other multi-class classification algorithms, i.e., Naive Bayes, k-Nearest Neighbours, and Support Vector Machine. Using 10-fold cross-validation, the expanded classifier achieved AUC performance of 0.901 for the 5-class classification task and the AUC performance of 0.92 for the 2-class task.
Although the design of this study did not show a significant performance improvement compared to the original design, the impact of unbalanced data distribution on performance was detected and the category of the classification problem was also refined in the process. These findings on the multi-class classification design can provide a research foundation or direction for future research.
© Jianwen Xu, 2021. This publication is copyrighted. You may download, display and print it for your own personal use. Commercial use is prohibited.