This web page presents the supplementary material of the poster The Buggy Side of Code Refactoring: Understanding the Relationship between Refactorings and Bugs.

Code refactoring is widely practiced by software developers. There is an explicit assumption that code refactoring improves the structural quality of a software project, thereby also reducing its bug proneness. However, refactoring is often applied with different purposes in practice. Depending on the complexity of certain refactorings, developers might unconsciously make the source code more susceptible to have bugs. In this paper, we present a longitudinal study of 5 Java open source projects, where 20,689 refactorings, and 1,033 bug reports were analyzed. We found that many bugs are introduced in the refactored code as soon as the first immediate change is made on it. Furthermore, code elements affected by refactorings performed in conjunction with other changes are more prone to have bugs than those affected by pure refactorings.

The figure below illustrates the study phases designed for investigating our research question. We describe each phase as follows.

Study Phases
Figure 1.Study Phases

Phase 1: Select Software Projects

This phase consists of selecting a set of software projects for analysis. We relied on open source projects selected from GitHub. To conduct our study, we selected 5 open source projects which follow three criteria as follows. First, they are highly popular on GitHub and from different domains. Second, users actively use their issue tracking systems such as Bugzilla and the GitHub issue management system for bug reporting and improvement suggestions. Third, at least 90\% of the code repository is written in Java, which is a very popular language. Table 1 provides general data about the analyzed projects. The first column presents the name of the software project. The second column presents the number of lines of code. The third column presents the number of classes. The fourth column presents the analyzed period. The fifth column presents the number of commits. The sixth column presents the number of bug reports.

Table 1. General data of the analyzed software projects
Software Project LOC #Classes Analyzed Period #Commits #Bug Reports
Ant 137,314 1,784 2000-01 to 2016-07 13,331 70
Derby 1,760,766 3,741 2004-08 to 2016-12 8,135 173
Okhttp 49,739 642 2011-05 to 2016-08 2,645 270
Presto 350,976 4,146 2012-08 to 2016-08 8,056 296
Tomcat 668,720 2,275 2006-03 to 2016-12 296 282

Phase 2: Identify Refactorings

We choose to study the 11 most commonly investigated refactoring types in the literature (Murphy Hill, Parnin and Black, 2012). These refactoring types are defined in Fowler's catalog (Fowler, 1991). Moreover, we used the Refactoring Miner tool (Tsantalis et al., 2013) to identify refactoring operations in the selected projects. Tsantalis et al., have reported that Refactoring Miner has a precision of 96.4% and low rates of false positives for all refactoring types, which we confirmed in our validation process, as discussed in Phase 3. Refactoring Miner detects all 11 refactoring types investigated in our study (Murphy Hill, Parnin and Black, 2012). We identified 20,689 refactoring operations in total. Table 2 presents the refactoring types analyzed in our study. The first column presents the refactoring type. The second column describes the problem that is intended to be addressed by each refactoring type. The third column describes the solution intended by applying each refactoring type.

Table 2. Refactoring types analyzed in this study (Extracted from Fowler, 1991)
Refactoring Type Problem Solution
Extract Method Parts of code should be gathered in a single method Create a new method with the extracted code
Extract Interface Class that implement commonly used resources Extract the subset into an interface or two classes have part of their interfaces in common
Extract Superclass There are two classes with similar features Create a superclass and move the common features to the superclass
Inline Method When a method body is more obvious than the method Replace calls to the method with the method’s itself, use this technique content and delete the method itself
Move Field A field is, or will be, used by another class more than the class in which it's defined Create a new field in the target class, change all its users
Move Method A method is, or will be, using or used by more features of another class than the class in which it is defined Create a new method with a similar body in the class it uses most. Either turn the old method into a simple delegation, or remove it altogether
Rename Method The name of a method does not reveal its purpose Change the name of the method
Pull up Field Two subclasses have the same field Move the field to the superclass
Pull up Method There are methods with identical results on subclasses Move them to the superclass
Push down Field A field is used only by some subclasses Move the field to those subclasses
Push down Method The behavior on a superclass is relevant only for some of its subclasses Move it to those subclasses


In this work, it is considered as refactored elements all those directly affected by the refactoring. If a refactoring is applied only in a method body, only this method is considered as refactored element. For instance, lets consider the Move Method refactoring. In this refactoring type, a method m is moved from class A to B. Hence, the considered refactored elements in this case would be {m, A, B}. All m method callers are affected by this refactoring, but we do not consider them as refactored elements. As another example, let us consider the Rename Method refactoring. In this scenario, a new name is given to the method m and the refactored element set would be just {m}. For each refactoring type a different refactored element set is used. Table 3 presents the considered refactored elements for each type of refactoring.

Table 3. Refactored Elements
Refactoring Refactored Elements
Extract Interface classes implementing the new interface.
Extract Method (i) method created; (ii) method from where the new method was extracted; and (iii) class containing both methods.
Extract Superclass (i) classes extending the new class; and (ii) new class created.
Inline Method (i) the method which received the new code; and (ii) class containing the method.
Move Field the two classes affected by the change: the class which the field used to reside and the class which received the field.
Move Method the two classes affected by the change: the class which the method used to reside and the class which received the method.
Pull Up Field the two classes affected by the change: the class which the field used to reside and the class which received the field.
Pull Up Method the two classes affected by the change: the class which the method used to reside and the class which received the method.
Push Down Field the two classes affected by the change: the class which the field used to reside and the class which received the field.
Push Down Method the two classes affected by the change: the class which the method used to reside and the class which received the method.
Rename Method the renamed method and the class that contains it.

Phase 3: Manually Validate Refactorings and Classify by tactic

We conducted a manual validation of the refactorings identified by the Refactoring Miner tool to ensure the reliability of our data. Such validation covered a random set of refactoring operations from different refactoring types since the precision of the Refactoring Miner tool could vary due to the rules implemented to detect each refactoring type. We recruited ten undergraduate students to analyze the samples. The samples were divided into ten disjointed sets, and each student validated a different one. After applying a statistical test with a confidence level of 95%, we observed a high precision of the tool for each refactoring, with a median of 88.36%. By applying the Grubb outlier test (Grubbs, 1969) (alpha = 0.05), we could not find any outliers, indicating that no refactoring type strongly influences the median precision found. Thus, the obtained results represent a key factor in the reliability of the results reported in this study.

We also evaluate root-canal and floss refactoring, and we conducted a manual inspection of a randomly selected sample of 2,119 refactorings. We manually analyzed whether the changes performed during the refactoring do not modify the behavior (root-canal refactoring). We classify a change as floss refactoring when there are behavioral changes, such as an addition of methods or changes in the method body that are not related to refactoring transformations. When we did not identify behavioral changes, the refactoring was classified as root-canal. This inspection was performed by three researchers. Two of them are very experienced refactoring researchers. The most experienced one solved the conflicts. As a result, we found that developers apply root-canal refactoring in 31.5% of the cases. The confidence level for this number is 95% with a confidence interval of 5%.

Phase 4: Collect Bug Reports

We selected bug reports with status resolved fixed, verified fixed, closed, or closed fixed for analysis. Furthermore, we chose to analyze only bugs labeled as bug in the issue tracking system. Table 1 presents the number of bug reports for each software project (column #Bug Reports).

Phase 5: Identify the Bug-fix Commit, Bug-fix Elements, and Bug-inducing Commit

A common practice among developers is to include the bug report number in the commit comment whenever they fix a bug associated with it (Śliwerski, Zimmermann, and Zeller, 2005). In this way, to map a bug report with its fix commit, we automatically search log messages for references to bug reports such as "bug 23442" or "fix for bug 23442" as proposed by Dallmeier and Zimmerman (Dallmeier, and Zimmermann, 2007). We ignored bug reports that we could not find the commit of the fix because, without the fix commit, we cannot find the fixed files. Thus, these bug reports are considered not functional (Ye, Bunescu, and Liu, 2014). We consider as buggy elements, all code elements that were modified in the fix commit.

Given the bug-fix commit and the bug-fix elements identified, we used the bug-introducing change identification algorithm proposed by Śliwerski, Zimmermann, and Zeller (the SZZ Algorithm) to identify when the bug was introduced in the project. SZZ is currently the most used algorithm for automatically identify fix-inducing commits (Costa et al., 2017). SZZ aims at identifying the lines modified in a bug-fixing commit, and then it identifies the fix-inducing change immediately before each line of the bug-fixing commit. As the original version of SZZ may have false positives and false negatives, we have used a combination of heuristics proposed by (Kim et al.) and (Williams and Spacco). Kim et al. mention two limitations of the original SZZ: (i) not all changes are fixes, i.e., even if a file change is defined as a bug-fix by developers, not all hunks in the change are bug-fixes; (ii) there is not enough information in bug tracking systems, and because of this an incorrect bug-inducing commit may be chosen. Using their approach, we can remove 38-51% of false positives and 14% of false negatives as compared to the original implementation of SZZ. SZZ outputs a list of commits related to the introduction of the bug in the software system. The results provided by SZZ will be used to compute the distance between the refactored commit and the commit where the bug was introduced (see Phase 7). For analysis purposes, we considered only the newest commit reported by SZZ.

Phase 6: Manually Validate Bugs

Previous research (Herzig, Just, and Zeller, 2013) mentions that bug report classifications are unreliable. Thus, we performed a bug report manual classification to identify which bug reports actually represent bugs in the projects of Apache Tomcat, Apache Derby, and Apache Ant. This classification was performed in pairs by 14 researchers. Each person of the pair was responsible for manually classify the same bug report as "bug" or "not bug". When there was a divergence in opinion, the pair should talk and define the final classification of such bug. In the final analysis, we considered only bug reports that represent bugs in such projects. We manually validated 1,477 bug reports, in which 516 (35.00%) were classified as "bug" and 961 (65.00%) as "not bug".

Phase 7: Compute the Distance in Number of Changes

To answer our RQ, we compute the distance in number of changes between the refactored commit and the bug code commit. To do that, we take into account only commits where the buggy element was touched by any change.

Phase 8: Compute Quartiles

To measure the bug proneness of refactored code elements, we computed the quartiles based on distance values and observed how far or how close a bug appears after a refactoring operation considering the distance classification. Figure 2 presents an example of the bug proneness of refactored code elements. In the Figure, method X was refactored in commit 1 and had a bug in commit 10. From commit 1 to 10, method X was changed 2 times (in commit 3 and 5). Thus, we say that the distance between the refactored commit and the bug code commit (Distance (r,b)) is equal to 2. In this case, a bug is close to the refactored commit. In our RQ, we will also analyze the bug proneness of each refactoring tactic, namely root-canal and floss refactoring. In the end, we will compare if root-canal refactoring is more bug-prone than floss refactoring.

Example of bug proneness
Figure 2. Example of bug proneness
# Artefact Description
1 Distances by Project This file contains the complete list of all relationships between refactorings and bugs analyzed in this study. There is a file per software project.
2 Submited Paper Complete text submited to ICSE 2018

Any question/suggestion please contact the authors of this work.

# Name E-mail
1 Isabella Ferreira iferreira@inf.puc-rio.br
2 Eduardo Fernandes emfernandes@inf.puc-rio.br
3 Diego Cedrim dcgrego@inf.puc-rio.br
4 Anderson Uchôa auchoa@inf.puc-rio.br
5 Ana Carla Bibiano abibiano@inf.puc-rio.br
6 Alessandro Garcia afgarcia@inf.puc-rio.br
7 João Lucas Correia jlmc@ic.ufal.br
8 Filipe Santos filipebatista@ic.ufal.br
9 Gabriel Nunes gabrielnunes@ic.ufal.br
10 Caio Barbosa cbvs@ic.ufal.br
11 Baldoino Fonseca baldoino@ic.ufal.br
12 Rafael de Mello rmaiani@inf.puc-rio.br
  1. Murphy-Hill, Emerson, Chris Parnin, and Andrew P. Black. "How we refactor, and how we know it." IEEE Transactions on Software Engineering 38.1 (2012): 5-18.
  2. Fowler, Martin, and Kent Beck. Refactoring: improving the design of existing code. Addison-Wesley Professional, 1999.
  3. Tsantalis, Nikolaos, et al. "A multidimensional empirical study on refactoring activity." Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research. IBM Corp., 2013.
  4. Grubbs, Frank E. "Procedures for detecting outlying observations in samples." Technometrics 11.1 (1969): 1-21.
  5. Śliwerski, Jacek, Thomas Zimmermann, and Andreas Zeller. "When do changes induce fixes?." ACM sigsoft software engineering notes. Vol. 30. No. 4. ACM, 2005.
  6. Dallmeier, Valentin, and Thomas Zimmermann. "Extraction of bug localization benchmarks from history." Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering. ACM, 2007.
  7. Ye, Xin, Razvan Bunescu, and Chang Liu. "Learning to rank relevant files for bug reports using domain knowledge." Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 2014.
  8. da Costa, Daniel Alencar, et al. "A framework for evaluating the results of the szz approach for identifying bug-introducing changes." IEEE Transactions on Software Engineering 43.7 (2017): 641-657.
  9. Kim, Sunghun, et al. "Automatic identification of bug-introducing changes." Automated Software Engineering, 2006. ASE'06. 21st IEEE/ACM International Conference on. IEEE, 2006.
  10. Williams, Chadd, and Jaime Spacco. "Szz revisited: verifying when changes induce fixes." Proceedings of the 2008 workshop on Defects in large software systems. ACM, 2008.
  11. Herzig, Kim, Sascha Just, and Andreas Zeller. "It's not a bug, it's a feature: how misclassification impacts bug prediction." Proceedings of the 2013 international conference on software engineering. IEEE Press, 2013.