Falessi, D., Huang, J., Narayana, L. et al. On the need of preserving order of data when validating within-project defect classifiers. Empir Software Eng 25, 4805–4830 (2020). https://doi.org/10.1007/s10664-020-09868-x
On the need of preserving order of data when validating within-project defect classifiers
|Author:||Falessi, Davide1; Huang, Jacky2; Narayana, Likhita2;|
1University of Rome “Tor Vergata”, Rome, Italy
2California Polytechnic State University, San Luis Obispo, CA, USA
3Monash University, Melbourne, Australia
4University of Oulu, Oulu, Finland
|Online Access:||PDF Full Text (PDF, 1.2 MB)|
|Persistent link:|| http://urn.fi/urn:nbn:fi-fe202102195393
|Publish Date:|| 2021-02-19
We are in the shoes of a practitioner who uses previous project releases’ data to predict which classes of the current release are defect-prone. In this scenario, the practitioner would like to use the most accurate classifier among the many available ones. A validation technique, hereinafter “technique”, defines how to measure the prediction accuracy of a classifier. Several previous research efforts analyzed several techniques. However, no previous study compared validation techniques in the within-project across-release class-level context or considered techniques that preserve the order of data. In this paper, we investigate which technique recommends the most accurate classifier. We use the last release of a project as the ground truth to evaluate the classifier’s accuracy and hence the ability of a technique to recommend an accurate classifier. We consider nine classifiers, two industry and 13 open projects, and three validation techniques: namely 10-fold cross-validation (i.e., the most used technique), bootstrap (i.e., the recommended technique), and walk-forward (i.e., a technique preserving the order of data). Our results show that: 1) classifiers differ in accuracy in all datasets regardless of their entity per value, 2) walk-forward outperforms both 10-fold cross-validation and bootstrap statistically in all three accuracy metrics: AUC of the selected classifier, bias and absolute bias, 3) surprisingly, all techniques resulted to be more prone to overestimate than to underestimate the performances of classifiers, and 3) the defect rate resulted in changing between the second and first half in both industry projects and 83% of open-source datasets. This study recommends the use of techniques that preserve the order of data such as walk-forward over 10-fold cross-validation and bootstrap in the within-project across-release class-level context given the above empirical results and that walk-forward is by nature more simple, inexpensive, and stable than the other two techniques.
Empirical software engineering
|Pages:||4805 - 4830|
|Type of Publication:||
A1 Journal article – refereed
|Field of Science:||
113 Computer and information sciences
Open access funding provided by Università degli Studi di Roma Tor Vergata within the CRUI-CARE Agreement.
# The Author(s) 2020, corrected publication 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.