University of Oulu

M. Sridharan, M. Mantyla, L. Rantala and M. Claes, "Data Balancing Improves Self-Admitted Technical Debt Detection," 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021, pp. 358-368, doi: 10.1109/MSR52588.2021.00048

Data balancing improves self-admitted technical debt detection

Saved in:
Author: Sridharan, Murali1; Mäntylä, Mika1; Rantala, Leevi1;
Organizations: 1M3S, ITEE University of Oulu Oulu, Finland
Format: article
Version: accepted version
Access: open
Online Access: PDF Full Text (PDF, 0.3 MB)
Persistent link:
Language: English
Published: Institute of Electrical and Electronics Engineers, 2021
Publish Date: 2021-10-21


A high imbalance exists between technical debt and non-technical debt source code comments. Such imbalance affects Self-Admitted Technical Debt (SATD) detection performance, and existing literature lacks empirical evidence on the choice of balancing technique. In this work, we evaluate the impact of multiple balancing techniques, including Data level, Classifier level, and Hybrid, for SATD detection in Within-Project and Cross-Project setup. Our results show that the Data level balancing technique SMOTE or Classifier level Ensemble approaches Random Forest or XGBoost are reasonable choices depending on whether the goal is to maximize Precision, Recall, F1, or AUC-ROC. We compared our best-performing model with the previous SATD detection benchmark (cost-sensitive Convolution Neural Network). Interestingly the top-performing XGBoost with SMOTE sampling improved the Within-project F1 score by 10% but fell short in Cross-Project set up by 9%. This supports the higher generalization capability of deep learning in Cross-Project SATD detection, yet while working within individual projects, classical machine learning algorithms can deliver better performance. We also evaluate and quantify the impact of duplicate source code comments in SATD detection performance. Finally, we employ SHAP and discuss the interpreted SATD features. We have included the replication package1 and shared a web-based SATD prediction tool2 with the balancing techniques in this study.

see all

Series: IEEE International Working Conference on Mining Software Repositories
ISSN: 2160-1852
ISSN-E: 2160-1860
ISSN-L: 2160-1852
ISBN: 978-1-7281-8710-5
ISBN Print: 978-1-6654-2985-6
Pages: 358 - 368
Article number: 9463080
DOI: 10.1109/MSR52588.2021.00048
Host publication: 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021
Conference: IEEE/ACM International Conference on Mining Software Repositories
Type of Publication: A4 Article in conference proceedings
Field of Science: 113 Computer and information sciences
Funding: The authors acknowledge the financial support by the Academy of Finland (grant ID 328058) and computational infrastructure by CSC Finland.
Academy of Finland Grant Number: 328058
Detailed Information: 328058 (Academy of Finland Funding decision)
Copyright information: © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.