ICSE 2018 - The Buggy Side of Code Refactoring: Understanding the Relationship between Refactorings and Bugs

The figure below illustrates the study phases designed for investigating our research question. We describe each phase as follows.

Figure 1.Study Phases

Phase 1: Select Software Projects

This phase consists of selecting a set of software projects for analysis. We relied on open source projects selected from GitHub. To conduct our study, we selected 5 open source projects which follow three criteria as follows. First, they are highly popular on GitHub and from different domains. Second, users actively use their issue tracking systems such as Bugzilla and the GitHub issue management system for bug reporting and improvement suggestions. Third, at least 90\% of the code repository is written in Java, which is a very popular language. Table 1 provides general data about the analyzed projects. The first column presents the name of the software project. The second column presents the number of lines of code. The third column presents the number of classes. The fourth column presents the analyzed period. The fifth column presents the number of commits. The sixth column presents the number of bug reports.

Table 1. General data of the analyzed software projects
Software Project	LOC	#Classes	Analyzed Period	#Commits	#Bug Reports
Ant	137,314	1,784	2000-01 to 2016-07	13,331	70
Derby	1,760,766	3,741	2004-08 to 2016-12	8,135	173
Okhttp	49,739	642	2011-05 to 2016-08	2,645	270
Presto	350,976	4,146	2012-08 to 2016-08	8,056	296
Tomcat	668,720	2,275	2006-03 to 2016-12	296	282

Phase 2: Identify Refactorings

We choose to study the 11 most commonly investigated refactoring types in the literature (Murphy Hill, Parnin and Black, 2012). These refactoring types are defined in Fowler's catalog (Fowler, 1991). Moreover, we used the Refactoring Miner tool (Tsantalis et al., 2013) to identify refactoring operations in the selected projects. Tsantalis et al., have reported that Refactoring Miner has a precision of 96.4% and low rates of false positives for all refactoring types, which we confirmed in our validation process, as discussed in Phase 3. Refactoring Miner detects all 11 refactoring types investigated in our study (Murphy Hill, Parnin and Black, 2012). We identified 20,689 refactoring operations in total. Table 2 presents the refactoring types analyzed in our study. The first column presents the refactoring type. The second column describes the problem that is intended to be addressed by each refactoring type. The third column describes the solution intended by applying each refactoring type.

Table 2. Refactoring types analyzed in this study (Extracted from Fowler, 1991)
Refactoring Type	Problem	Solution
Extract Method	Parts of code should be gathered in a single method	Create a new method with the extracted code
Extract Interface	Class that implement commonly used resources	Extract the subset into an interface or two classes have part of their interfaces in common
Extract Superclass	There are two classes with similar features	Create a superclass and move the common features to the superclass
Inline Method	When a method body is more obvious than the method	Replace calls to the method with the method’s itself, use this technique content and delete the method itself
Move Field	A field is, or will be, used by another class more than the class in which it's defined	Create a new field in the target class, change all its users
Move Method	A method is, or will be, using or used by more features of another class than the class in which it is defined	Create a new method with a similar body in the class it uses most. Either turn the old method into a simple delegation, or remove it altogether
Rename Method	The name of a method does not reveal its purpose	Change the name of the method
Pull up Field	Two subclasses have the same field	Move the field to the superclass
Pull up Method	There are methods with identical results on subclasses	Move them to the superclass
Push down Field	A field is used only by some subclasses	Move the field to those subclasses
Push down Method	The behavior on a superclass is relevant only for some of its subclasses	Move it to those subclasses

In this work, it is considered as refactored elements all those directly affected by the refactoring. If a refactoring is applied only in a method body, only this method is considered as refactored element. For instance, lets consider the Move Method refactoring. In this refactoring type, a method m is moved from class A to B. Hence, the considered refactored elements in this case would be {m, A, B}. All m method callers are affected by this refactoring, but we do not consider them as refactored elements. As another example, let us consider the Rename Method refactoring. In this scenario, a new name is given to the method m and the refactored element set would be just {m}. For each refactoring type a different refactored element set is used. Table 3 presents the considered refactored elements for each type of refactoring.

Table 3. Refactored Elements
Refactoring	Refactored Elements
Extract Interface	classes implementing the new interface.
Extract Method	(i) method created; (ii) method from where the new method was extracted; and (iii) class containing both methods.
Extract Superclass	(i) classes extending the new class; and (ii) new class created.
Inline Method	(i) the method which received the new code; and (ii) class containing the method.
Move Field	the two classes affected by the change: the class which the field used to reside and the class which received the field.
Move Method	the two classes affected by the change: the class which the method used to reside and the class which received the method.
Pull Up Field	the two classes affected by the change: the class which the field used to reside and the class which received the field.
Pull Up Method	the two classes affected by the change: the class which the method used to reside and the class which received the method.
Push Down Field	the two classes affected by the change: the class which the field used to reside and the class which received the field.
Push Down Method	the two classes affected by the change: the class which the method used to reside and the class which received the method.
Rename Method	the renamed method and the class that contains it.

Phase 3: Manually Validate Refactorings and Classify by tactic

We conducted a manual validation of the refactorings identified by the Refactoring Miner tool to ensure the reliability of our data. Such validation covered a random set of refactoring operations from different refactoring types since the precision of the Refactoring Miner tool could vary due to the rules implemented to detect each refactoring type. We recruited ten undergraduate students to analyze the samples. The samples were divided into ten disjointed sets, and each student validated a different one. After applying a statistical test with a confidence level of 95%, we observed a high precision of the tool for each refactoring, with a median of 88.36%. By applying the Grubb outlier test (Grubbs, 1969) (alpha = 0.05), we could not find any outliers, indicating that no refactoring type strongly influences the median precision found. Thus, the obtained results represent a key factor in the reliability of the results reported in this study.

We also evaluate root-canal and floss refactoring, and we conducted a manual inspection of a randomly selected sample of 2,119 refactorings. We manually analyzed whether the changes performed during the refactoring do not modify the behavior (root-canal refactoring). We classify a change as floss refactoring when there are behavioral changes, such as an addition of methods or changes in the method body that are not related to refactoring transformations. When we did not identify behavioral changes, the refactoring was classified as root-canal. This inspection was performed by three researchers. Two of them are very experienced refactoring researchers. The most experienced one solved the conflicts. As a result, we found that developers apply root-canal refactoring in 31.5% of the cases. The confidence level for this number is 95% with a confidence interval of 5%.

Phase 4: Collect Bug Reports

We selected bug reports with status resolved fixed, verified fixed, closed, or closed fixed for analysis. Furthermore, we chose to analyze only bugs labeled as bug in the issue tracking system. Table 1 presents the number of bug reports for each software project (column #Bug Reports).

Phase 5: Identify the Bug-fix Commit, Bug-fix Elements, and Bug-inducing Commit

A common practice among developers is to include the bug report number in the commit comment whenever they fix a bug associated with it (Śliwerski, Zimmermann, and Zeller, 2005). In this way, to map a bug report with its fix commit, we automatically search log messages for references to bug reports such as "bug 23442" or "fix for bug 23442" as proposed by Dallmeier and Zimmerman (Dallmeier, and Zimmermann, 2007). We ignored bug reports that we could not find the commit of the fix because, without the fix commit, we cannot find the fixed files. Thus, these bug reports are considered not functional (Ye, Bunescu, and Liu, 2014). We consider as buggy elements, all code elements that were modified in the fix commit.

Given the bug-fix commit and the bug-fix elements identified, we used the bug-introducing change identification algorithm proposed by Śliwerski, Zimmermann, and Zeller (the SZZ Algorithm) to identify when the bug was introduced in the project. SZZ is currently the most used algorithm for automatically identify fix-inducing commits (Costa et al., 2017). SZZ aims at identifying the lines modified in a bug-fixing commit, and then it identifies the fix-inducing change immediately before each line of the bug-fixing commit. As the original version of SZZ may have false positives and false negatives, we have used a combination of heuristics proposed by (Kim et al.) and (Williams and Spacco). Kim et al. mention two limitations of the original SZZ: (i) not all changes are fixes, i.e., even if a file change is defined as a bug-fix by developers, not all hunks in the change are bug-fixes; (ii) there is not enough information in bug tracking systems, and because of this an incorrect bug-inducing commit may be chosen. Using their approach, we can remove 38-51% of false positives and 14% of false negatives as compared to the original implementation of SZZ. SZZ outputs a list of commits related to the introduction of the bug in the software system. The results provided by SZZ will be used to compute the distance between the refactored commit and the commit where the bug was introduced (see Phase 7). For analysis purposes, we considered only the newest commit reported by SZZ.

Phase 6: Manually Validate Bugs

Previous research (Herzig, Just, and Zeller, 2013) mentions that bug report classifications are unreliable. Thus, we performed a bug report manual classification to identify which bug reports actually represent bugs in the projects of Apache Tomcat, Apache Derby, and Apache Ant. This classification was performed in pairs by 14 researchers. Each person of the pair was responsible for manually classify the same bug report as "bug" or "not bug". When there was a divergence in opinion, the pair should talk and define the final classification of such bug. In the final analysis, we considered only bug reports that represent bugs in such projects. We manually validated 1,477 bug reports, in which 516 (35.00%) were classified as "bug" and 961 (65.00%) as "not bug".

Phase 7: Compute the Distance in Number of Changes

To answer our RQ, we compute the distance in number of changes between the refactored commit and the bug code commit. To do that, we take into account only commits where the buggy element was touched by any change.

Phase 8: Compute Quartiles

To measure the bug proneness of refactored code elements, we computed the quartiles based on distance values and observed how far or how close a bug appears after a refactoring operation considering the distance classification. Figure 2 presents an example of the bug proneness of refactored code elements. In the Figure, method X was refactored in commit 1 and had a bug in commit 10. From commit 1 to 10, method X was changed 2 times (in commit 3 and 5). Thus, we say that the distance between the refactored commit and the bug code commit (Distance (r,b)) is equal to 2. In this case, a bug is close to the refactored commit. In our RQ, we will also analyze the bug proneness of each refactoring tactic, namely root-canal and floss refactoring. In the end, we will compare if root-canal refactoring is more bug-prone than floss refactoring.

Figure 2. Example of bug proneness

#	Artefact	Description
1	Distances by Project	This file contains the complete list of all relationships between refactorings and bugs analyzed in this study. There is a file per software project.
2	Submited Paper	Complete text submited to ICSE 2018

Any question/suggestion please contact the authors of this work.

#	Name	E-mail
1	Isabella Ferreira	iferreira@inf.puc-rio.br
2	Eduardo Fernandes	emfernandes@inf.puc-rio.br
3	Diego Cedrim	dcgrego@inf.puc-rio.br
4	Anderson Uchôa	auchoa@inf.puc-rio.br
5	Ana Carla Bibiano	abibiano@inf.puc-rio.br
6	Alessandro Garcia	afgarcia@inf.puc-rio.br
7	João Lucas Correia	jlmc@ic.ufal.br
8	Filipe Santos	filipebatista@ic.ufal.br
9	Gabriel Nunes	gabrielnunes@ic.ufal.br
10	Caio Barbosa	cbvs@ic.ufal.br
11	Baldoino Fonseca	baldoino@ic.ufal.br
12	Rafael de Mello	rmaiani@inf.puc-rio.br

Poster: The Buggy Side of Code Refactoring:

Understanding the Relationship between Refactorings and Bugs

Abstract

Study Design