A fine-grained data set and analysis of tangling in bug fixing commits

Herbold, Steffen; Trautsch, Alexander; Ledel, Benjamin; Aghamohammadi, Alireza; Ghaleb, Taher A.; Chahal, Kuljit Kaur; Bossenmaier, Tim; Nagaria, Bhaveet; Makedonski, Philip; Ahmadabadi, Matin Nili; Szabados, Kristof; Spieker, Helge; Madeja, Matej; Hoy, Nathaniel; Lenarduzzi, Valentina; Wang, Shangwen; Rodríguez-Pérez, Gema; Colomo-Palacios, Ricardo; Verdecchia, Roberto; Singh, Paramvir; Qin, Yihao; Chakroborti, Debasish; Davis, Willard; Walunj, Vijay; Wu, Hongjun; Marcilio, Diego; Alam, Omar; Aldaeej, Abdullah; Amit, Idan; Turhan, Burak; Eismann, Simon; Wickert, Anna-Katharina; Malavolta, Ivano; Sulír, Matúš; Fard, Fatemeh; Henley, Austin Z.; Kourtzanidis, Stratos; Tuzun, Eray; Treude, Christoph; Shamasbi, Simin Maleki; Pashchenko, Ivan; Wyrich, Marvin; Davis, James; Serebrenik, Alexander; Albrecht, Ella; Aktas, Ethem Utku; Strüber, Daniel; Erbel, Johannes

A fine-grained data set and analysis of tangling in bug fixing commits

Herbold, Steffen; Trautsch, Alexander; Ledel, Benjamin; Aghamohammadi, Alireza; Ghaleb, Taher A.; Chahal, Kuljit Kaur; Bossenmaier, Tim; Nagaria, Bhaveet; Makedonski, Philip; Ahmadabadi, Matin Nili; Szabados, Kristof; Spieker, Helge; Madeja, Matej; Hoy, Nathaniel; Lenarduzzi, Valentina; Wang, Shangwen; Rodríguez-Pérez, Gema; Colomo-Palacios, Ricardo; Verdecchia, Roberto; Singh, Paramvir; Qin, Yihao; Chakroborti, Debasish; Davis, Willard; Walunj, Vijay; Wu, Hongjun; Marcilio, Diego; Alam, Omar; Aldaeej, Abdullah; Amit, Idan; Turhan, Burak; Eismann, Simon; Wickert, Anna-Katharina; Malavolta, Ivano; Sulír, Matúš; Fard, Fatemeh; Henley, Austin Z.; Kourtzanidis, Stratos; Tuzun, Eray; Treude, Christoph; Shamasbi, Simin Maleki; Pashchenko, Ivan; Wyrich, Marvin; Davis, James; Serebrenik, Alexander; Albrecht, Ella; Aktas, Ethem Utku; Strüber, Daniel; Erbel, Johannes

Avaa tiedosto

nbnfi-fe2022111065139.pdf (2.219Mt)

nbnfi-fe2022111065139_meta.xml (153.1Kt)

nbnfi-fe2022111065139_solr.xml (86.54Kt)

Lataukset:

URL:

https://doi.org/10.1007/s10664-021-10083-5

Herbold, Steffen

Trautsch, Alexander

Ledel, Benjamin

Aghamohammadi, Alireza

Ghaleb, Taher A.

Chahal, Kuljit Kaur

Bossenmaier, Tim

Nagaria, Bhaveet

Makedonski, Philip

Ahmadabadi, Matin Nili

Szabados, Kristof

Spieker, Helge

Madeja, Matej

Hoy, Nathaniel

Lenarduzzi, Valentina

Wang, Shangwen

Rodríguez-Pérez, Gema

Colomo-Palacios, Ricardo

Verdecchia, Roberto

Singh, Paramvir

Qin, Yihao

Chakroborti, Debasish

Davis, Willard

Walunj, Vijay

Wu, Hongjun

Marcilio, Diego

Alam, Omar

Aldaeej, Abdullah

Amit, Idan

Turhan, Burak

Eismann, Simon

Wickert, Anna-Katharina

Malavolta, Ivano

Sulír, Matúš

Fard, Fatemeh

Henley, Austin Z.

Kourtzanidis, Stratos

Tuzun, Eray

Treude, Christoph

Shamasbi, Simin Maleki

Pashchenko, Ivan

Wyrich, Marvin

Davis, James

Serebrenik, Alexander

Albrecht, Ella

Aktas, Ethem Utku

Strüber, Daniel

Erbel, Johannes

Springer Nature

Herbold, S., Trautsch, A., Ledel, B. et al. A fine-grained data set and analysis of tangling in bug fixing commits. Empir Software Eng 27, 125 (2022). https://doi.org/10.1007/s10664-021-10083-5

https://creativecommons.org/licenses/by/4.0/
© The Author(s) 2022. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
https://creativecommons.org/licenses/by/4.0/

doi:https://doi.org/10.1007/s10664-021-10083-5

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2022111065139

Tiivistelmä

Abstract

Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.

Objectives: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits.

Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus.

Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case.

Conclusions: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.

Kokoelmat

Avoin saatavuus [32049]

Ellei muuten mainita, aineiston lisenssi on https://creativecommons.org/licenses/by/4.0/