Rantala, L., Mäntylä, M. Predicting technical debt from commit contents: reproduction and extension with automated feature selection. Software Qual J 28, 1551–1579 (2020). https://doi.org/10.1007/s11219-020-09520-3
Predicting technical debt from commit contents : reproduction and extension with automated feature selection
|Author:||Rantala, Leevi1; Mäntylä, Mika1|
1M3S / ITEE / University of Oulu, P.O.B. 4500, 90014 University of Oulu, Oulu, Finland
|Online Access:||PDF Full Text (PDF, 0.9 MB)|
|Persistent link:|| http://urn.fi/urn:nbn:fi-fe2020080347902
|Publish Date:|| 2020-08-03
Self-admitted technical debt refers to sub-optimal development solutions that are expressed in written code comments or commits. We reproduce and improve on a prior work by Yan et al. (2018) on detecting commits that introduce self-admitted technical debt. We use multiple natural language processing methods: Bag-of-Words, topic modeling, and word embedding vectors. We study 5 open-source projects. Our NLP approach uses logistic Lasso regression from Glmnet to automatically select best predictor words. A manually labeled dataset from prior work that identified self-admitted technical debt from code level commits serves as ground truth. Our approach achieves + 0.15 better area under the ROC curve performance than a prior work, when comparing only commit message features, and + 0.03 better result overall when replacing manually selected features with automatically selected words. In both cases, the improvement was statistically significant (p < 0.0001). Our work has four main contributions, which are comparing different NLP techniques for SATD detection, improved results over previous work, showing how to generate generalizable predictor words when using multiple repositories, and producing a list of words correlating with SATD. As a concrete result, we release a list of the predictor words that correlate positively with SATD, as well as our used datasets and scripts to enable replication studies and to aid in the creation of future classifiers.
Software quality journal
|Pages:||1551 - 1579|
|Type of Publication:||
A4 Article in conference proceedings
|Field of Science:||
113 Computer and information sciences
The authors have been supported by Infotech Oulu and Academy of Finland (grants 298020 and 328058).
|Academy of Finland Grant Number:||
298020 (Academy of Finland Funding decision)
328058 (Academy of Finland Funding decision)
© The Authors 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.