University of Oulu

Herbold, S., Trautsch, A., Ledel, B. et al. A fine-grained data set and analysis of tangling in bug fixing commits. Empir Software Eng 27, 125 (2022). https://doi.org/10.1007/s10664-021-10083-5

A fine-grained data set and analysis of tangling in bug fixing commits

Saved in:
Author: Herbold, Steffen1; Trautsch, Alexander2; Ledel, Benjamin1;
Organizations: 1Institute for Software and Systems Engineering, TU Clausthal, Clausthal-Zellefeld, Germany
2Institute of Computer Science, University of Goettingen, Goettingen, Germany
3Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
4School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada
5Department of Computer Science, Guru Nanak Dev University, Amritsar, India
6Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
7Brunel University London, Uxbridge, UK
8University of Tehran, Tehran, Iran
9Ericsson Hungary ltd., Budapest, Hungary
10Simula Research Laboratory, Oslo, Norway
11Technical University of Košice, Košice, Slovakia
12LUT University, Lappeenranta, Finland
13National University of Defense Technology, Changsha, China
14University of British Columbia, Kelowna, Canada
15Østfold University College, Halden, Norway
16Vrije Universiteit Amsterdam, Amsterdam, Netherlands
17University of Auckland, Auckland, New Zealand
18University of Saskatchewan, Saskatoon, Canada
19IBM, Boulder, NY, USA
20University of Missouri-Kansas City, Kansas City, MO, USA
21Università della Svizzera italiana, Lugano, Switzerland
22Trent University, Peterborough, Canada
23Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia
24The Hebrew University/Acumen, Jerusalem, Israel
25University of Oulu, Oulu, Finland
26Monash University, Melbourne, Australia
27University of Würzburg, Würzburg, Germany
28Technische Universität Darmstadt, Darmstadt, Germany
29University of Tennessee, Knoxville, TN, USA
30Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece
31Department of Computer Engineering, Ankara, Turkey
32University of Melbourne, Melbourne, Australia
33Department of Business Informatics and Operations Management, Ghent University, Ghent, Belgium
34TomTom B.V., Amsterdam, Netherlands
35University of Stuttgart, Stuttgart, Germany
36Purdue University, West Lafayette, IN, USA
37Eindhoven University of Technology, Eindhoven, Netherlands
38Softtech Inc., Research and Development Center, 34947, Istanbul, Turkey
39Radboud University, Nijmegen, Netherlands
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 2.2 MB)
Persistent link: http://urn.fi/urn:nbn:fi-fe2022111065139
Language: English
Published: Springer Nature, 2022
Publish Date: 2022-11-10
Description:

Abstract

Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.

Objectives: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits.

Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus.

Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case.

Conclusions: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.

see all

Series: Empirical software engineering
ISSN: 1382-3256
ISSN-E: 1573-7616
ISSN-L: 1382-3256
Volume: 27
Issue: 6
Article number: 125
DOI: 10.1007/s10664-021-10083-5
OADOI: https://oadoi.org/10.1007/s10664-021-10083-5
Type of Publication: A1 Journal article – refereed
Field of Science: 213 Electronic, automation and communications engineering, electronics
Subjects:
Funding: Alexander Trautsch and Benjamin Ledel and the development of the infrastructure required for this research project were funded by DFG Grant 402774445. Ivan Pashchenko was partially funded by the H2020 AssureMOSS project (Grant No. 952647). Open Access funding enabled and organized by Projekt DEAL.
Copyright information: © The Author(s) 2022. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
  https://creativecommons.org/licenses/by/4.0/