University of Oulu

Rantala, L., Mäntylä, M. Predicting technical debt from commit contents: reproduction and extension with automated feature selection. Software Qual J 28, 1551–1579 (2020).

Predicting technical debt from commit contents : reproduction and extension with automated feature selection

Saved in:
Author: Rantala, Leevi1; Mäntylä, Mika1
Organizations: 1M3S / ITEE / University of Oulu, P.O.B. 4500, 90014 University of Oulu, Oulu, Finland
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 0.9 MB)
Persistent link:
Language: English
Published: Springer Nature, 2020
Publish Date: 2020-08-03


Self-admitted technical debt refers to sub-optimal development solutions that are expressed in written code comments or commits. We reproduce and improve on a prior work by Yan et al. (2018) on detecting commits that introduce self-admitted technical debt. We use multiple natural language processing methods: Bag-of-Words, topic modeling, and word embedding vectors. We study 5 open-source projects. Our NLP approach uses logistic Lasso regression from Glmnet to automatically select best predictor words. A manually labeled dataset from prior work that identified self-admitted technical debt from code level commits serves as ground truth. Our approach achieves + 0.15 better area under the ROC curve performance than a prior work, when comparing only commit message features, and + 0.03 better result overall when replacing manually selected features with automatically selected words. In both cases, the improvement was statistically significant (p < 0.0001). Our work has four main contributions, which are comparing different NLP techniques for SATD detection, improved results over previous work, showing how to generate generalizable predictor words when using multiple repositories, and producing a list of words correlating with SATD. As a concrete result, we release a list of the predictor words that correlate positively with SATD, as well as our used datasets and scripts to enable replication studies and to aid in the creation of future classifiers.

see all

Series: Software quality journal
ISSN: 0963-9314
ISSN-E: 1573-1367
ISSN-L: 0963-9314
Volume: 28
Pages: 1551 - 1579
DOI: 10.1007/s11219-020-09520-3
Type of Publication: A4 Article in conference proceedings
Field of Science: 113 Computer and information sciences
Funding: The authors have been supported by Infotech Oulu and Academy of Finland (grants 298020 and 328058).
Academy of Finland Grant Number: 298020
Detailed Information: 298020 (Academy of Finland Funding decision)
328058 (Academy of Finland Funding decision)
Copyright information: © The Authors 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit