0% found this document useful (0 votes)
29 views17 pages

2023 - An Empirical Study of Refactoring Rhythms and Tactics in The Software Development Process

Uploaded by

liujiayu508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views17 pages

2023 - An Empirical Study of Refactoring Rhythms and Tactics in The Software Development Process

Uploaded by

liujiayu508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO.

12, DECEMBER 2023 5103

An Empirical Study of Refactoring Rhythms and


Tactics in the Software Development Process
Shayan Noei , Heng Li , Stefanos Georgiou , and Ying Zou , Senior Member, IEEE

Abstract—It is critical for developers to develop high-quality [5]. Refactoring facilitates the extensibility and maintainabil-
software to reduce maintenance cost. While often, developers ity of a software system [6], [7], [8]. Various reasons drive
apply refactoring practices to make source code readable and developers’ refactoring activities, such as improving software
maintainable without impacting the software functionality. Ex-
isting studies identify development rhythms (i.e., weekly devel- design [9], making software systems easier to understand [10],
opment patterns) and their relationship with various metrics, enhancing reusability [11], removing dependencies among at-
such as productivity. However, existing studies focus entirely tributes, methods, classes, interfaces, and packages [12], as well
on development rhythms. There is no study on refactoring as eliminating code smells [12], [13], [14], [15], [16], [17].
rhythms and their relationship with code quality. Moreover, A code smell is a design flaw that violates the principles of
the existing studies categorize the refactoring tactics (i.e., long-
term refactoring patterns) into two general concepts of consistent design standards and impairs the maintainability of software
and inconsistent refactoring. Nevertheless, the existence of other [18], [19]. As a result, a code smell can affect the internal
tactics and their relationship with code quality is not explored. quality of software. For instance, a broken hierarchy code smell
In this paper, we conduct an empirical study on the refactoring happens when a subtype and its supertype do not share an
practices of 196 Apache projects in the early, middle, and “IS-A” relationship [20]. The presence of code smells is a good
late stages of development. We aim to identify (1) existing
refactoring rhythms, (2) further refactoring tactics, and (3) the indicator for code quality checks [18], and multiple refactoring
relationship between the identified tactics and rhythms with operations are typically needed to eliminate code smells [17].
code quality. The recognition of existing refactoring strategies Several factors contribute to the quantity of refactoring op-
and their relationship with code quality can assist practitioners erations performed to improve code quality, such as developer
in recognizing and applying the appropriate and high-quality perceptions, team experience, development schedule, software
refactoring rhythms or tactics to deliver a higher quality of
software. We find two frequently used refactoring rhythms:
characteristics, and so forth [21]. Different teams may apply dif-
work-day refactoring and all-day refactoring. We also identify ferent refactoring strategies in short-term or long-term periods
two deviations of floss and root canal refactoring tactics as: [22], [23]. Identifying patterns in refactoring practices and their
intermittent root canal, intermittent spiked floss, frequent spiked relationship with code quality can help software developers
floss, and frequent root canal. We find that root canal-based adopt the most suitable patterns in their projects. More specifi-
tactics are correlated with less increase in the code smells (i.e.,
cally refactoring patterns have two perspectives: (1) refactor-
higher quality code) compared to floss-based tactics. Moreover,
we find that refactoring rhythms are not significantly correlated ing rhythms and (2) refactoring tactics. Refactoring rhythms
with the quality of the code. Furthermore, we provide detailed describe how refactoring operations split across the weekdays
information on the relationship of each refactoring tactic to each and usually focus on existing tasks. Refactoring tactics are
code smell type. referred to as long-term refactoring more focused on future
Index Terms—Refactoring, code quality, code smells, development [24].
refactoring rhythms, refactoring tactics. In the context of refactoring rhythms, existing studies focus
on development rhythms and categorize development rhythms
as work-day (i.e., Monday to Friday) development and all-day
I. INTRODUCTION development. Moreover, they correlate development rhythms
with the measures, such as task performance and productivity
R EFACTORING is a systematic process of improving the
internal quality of software without affecting its func-
tionalities [1]. Many studies show that refactoring is a widely
[25], [26], [27]. However, to the best of our knowledge, no study
has investigated the identification of refactoring rhythms and
engaged part of the software maintenance process [2], [3], [4], their relationship with code quality.
In terms of refactoring tactics, existing studies divide refac-
Manuscript received 20 April 2023; revised 31 August 2023; accepted toring tactics into floss and root canal [1], [28], [29]. Floss
10 October 2023. Date of publication 10 November 2023; date of current refactoring is distinguished by frequent refactoring along with
version 12 December 2023. Recommended for acceptance by J. I. Maletic. the development process. Root canal refactoring is identified
(Corresponding author: Shayan Noei.)
Shayan Noei, Stefanos Georgiou, Ying Zou are with Queen’s Uni- by occasional refactoring aside from the development pro-
versity, Kingston, Ontario K7L 3N6, Canada (e-mail: [email protected]; cess. While the terms floss and root canal are widely used
[email protected]; [email protected]). as refactoring tactics, the existence of other possible tactics
Heng Li is with Polytechnique Montréal, Montréal, Quebec H3T 1J4,
Canada (e-mail: [email protected]). and their relationship with code quality is not explored in the
Digital Object Identifier 10.1109/TSE.2023.3326775 existing work.

0098-5589 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5104 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023

Fig. 1. An overview of our study.

In this work, we study developers’ refactoring activities with code quality improvement (i.e., reducing code smells).
(rhythms and tactics) and their impact on code quality in terms To examine the relationship between refactoring rhythms with
of code smells. To identify refactoring rhythms and other possi- code quality metrics, we use a Scott-Knott-ESD [31], [32] test
ble refactoring tactics, we study 196 Apache projects and intro- to rank and cluster code smell changes after adopting each
duce two metrics: daily refactoring density (DRD) and weekly refactoring rhythm and tactic. We observe that root canal-based
refactoring density (WRD). Using the introduced metrics, we tactics are more targeted refactoring operations, therefore, are
divide each project into daily and weekly time frames to identify correlated with more reduction of code smells compared to
frequent refactoring tactics and rhythms during the lifetime of a floss-based tactics and deliver higher quality code. Furthermore,
project. Then, we investigate the relationship between different we observe that refactoring rhythms are not significantly corre-
rhythms and tactics with code quality. Such information can lated with software quality. Consequently, refactoring rhythms
guide developers in selecting the most suitable and high-quality are chosen based on the project assets and the development
refactoring rhythms or tactics for their projects. To this end, we team’s comforts. Finally, we provide some guidelines on pos-
aim to answer the following research questions: itive and negative relationships between different refactoring
RQ1: What are the rhythms of refactoring?—To iden- tactics and different types of code smells.
tify existing refactoring rhythms, we analyze if the refac- In conclusion, we make the following contributions:
toring rhythms fit into the software development rhythms 1) We identify refactoring rhythms and tactics that are
introduced by previous studies (work-day and all-day). By uti- used in the software development process, which can
lizing the DRD metric and performing Kruskal-Wallis test [30], provide insights for practitioners to understand exist-
we observe that the majority of projects (95%) apply two pri- ing refactoring practices and develop tooling to support
mary rhythms: (1) work-day refactoring (11%) and (2) all-day such practices.
refactoring (84%). 2) We understand the relationship between the used tactics
RQ2: What are the most frequent refactoring tactics and rhythms with software quality, which helps develop-
used in projects?—To identify refactoring tactics, we utilize ers to adopt the most suitable approach.
the WRD metric and form a time series of refactoring activi- The replication package can be accessed online [33].
ties for every project. Using dynamic time warping (DTW), we Paper organization. The remainder of our study is orga-
cluster refactoring time series and we observe four variations nized as follows. Section II describes the setup of this study.
of floss and root canal refactoring tactics: Section III presents our approaches and results for answering
• Intermittent spiked floss: Regular and consistent refac- our research questions. Section IV discusses the threats to the
toring with fewer sudden increases (spikes) compared to validity of our findings and Section V provides the implications
frequent spiked floss. of out study. Section VI surveys related studies and compares
• Frequent spiked floss: Consistent refactoring but with them to our work. Finally, we conclude our paper and present
more spikes in refactoring density compared to intermit- future research directions in Section VII, and acknowledge
tent spiked floss. contributions in Section VIII.
• Intermittent root canal: Once in a while refactoring in
high densities, but with most weeks having no refactoring II. EXPERIMENT SETUP
densities.
• Frequent root canal: More frequent refactoring with
This section presents the setup of our study, including our
data collection and data analysis approaches.
more spikes in refactoring density compared to intermit-
tent root canal with most weeks having no refactoring
activities. A. Overview of Our Approach
RQ3: What is the relationship of different refactor- Fig. 1 gives an overview of our study. We conduct our
ing rhythms and tactics with code quality?—In the two research using the projects with a reasonable amount of de-
first research questions, we identify frequently used refactor- velopment activities from the 20-MAD Apache dataset [34].
ing rhythms and tactics. Furthermore, we are interested in We extract the refactoring history of these projects and calcu-
understanding how different rhythms and tactics are associated late refactoring density metrics along with their lifespan. By

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5105

utilizing refactoring density metrics, we compare refactoring the older projects into two or three stages using the thresholds of
distributions at different time periods to answer the first and the age groups. For instance, if a project has 6 years of activities,
the second research questions that aim to identify refactoring it has only the early stage of development (e.g., the first 4.5
rhythms and tactics. Furthermore, we use the characteristics of years of activities), while a project with 10 years of activities
the projects and developers to provide insights into the rationale has the early, middle, and late stages of development. Doing
of using such rhythms and tactics. Finally, by measuring quality so allows us to identify similar rhythms and tactics that might
changes after adopting each rhythm and tactic, we identify appear among the projects at different stages. We use these
the relationship of the identified rhythms and tactics with the three development stages throughout the analyses performed in
quality of code. this study.
2) Refactoring Extraction: To extract the refactoring oper-
ations, we use the Rminer 2.0.3 tool [14], which is an AST-
B. Data Collection based algorithm that finds up to 59 Java refactoring types from
To perform our experiments, we perform several steps to the commit history without the need for user-defined thresholds
collect our dataset. [14], [38]. Furthermore, Rminer is a superior refactoring detec-
1) Project Selection and Pre-Processing: The 20-MAD tion tool compared to its opponents and identifies refactoring
Apache dataset [34] contains information about commits and is- operations with an overall precision of 99.7% and a recall of
sues related to 765 Mozilla and Apache projects with a timespan 94.2% [14]. To measure the accuracy of the tool on our dataset,
of 20 years. In particular, the dataset contains 3.4M commits, with a confidence level of 95% and a margin of error of 5% on
2.3M issues, and 17.3M issue comments. Considering that Java the total number of commits, we select 385 commits to perform
is one the most popular programming languages [35], [36] and our manual validation. In each commit, we reviewed all types of
it is best supported by refactoring extraction tools (e.g., Rminer refactorings that the tool could identify and determined if and
[14]), we limit our study to Java projects. To select projects with how many of them were present in the results of the tool. We
enough data that help identify the different refactoring tactics then compared our results with the tool’s results. For example,
and rhythms, we exclude the projects that: if we identified a pull-up method but the tool did not, we marked
• have less than 80% of Java source code; it as a tool failure and vice versa. The manual validation is done
st
• have less than the 1 quantile of commit counts (i.e., by one of the authors and one undergraduate computer science
< 1,021 commits); and student. The results of the manual validation show an overall
• have a short lifespan (i.e.,< one-year of commit history). precision of 97%, recall of 96%, and F1 score of 95% respec-
As a result, we obtain 196 Java projects that have sufficient tively. Furthermore, we calculate Cohen’s kappa coefficient [39]
commit history and lifespan for our analysis. from the participant’s manual validation results and achieve a
The studied projects in our dataset have varying lifespans and score of 0.91, which suggests a strong agreement. Therefore,
therefore, possess different development histories. Furthermore, we run Rminer on every selected project for each commit to
refactoring habits or requirements may change over time. For extract refactoring activities in the history of development.
example, as a project ages, design issues may be fixed less 3) Code Smells Extraction: To identify the relationship
frequently [37]. Hence, we cannot compare refactoring prac- between the refactoring rhythms or tactics and code quality,
tices unless we have projects with a similar lifetime. To be we need to analyze how these refactoring patterns reduce or
able to analyze the refactoring activities in the projects, we increase the frequency of code smells. As we use code smells
partition longer projects into multiple stages. We utilized the as code quality indicators, we analyze how refactoring rhythms
first, second, and third quartiles of project ages to establish and tactics are associated with the reduction or increase in the
thresholds and divided all projects into four age groups, with frequency of code smells.
each group containing an equal number of projects. The age We use the Designite tool [40] to extract code smells. De-
groups are categorized as follows: (1) younger than 4.5 years signite can identify the most types of code smells compared
(i.e., the first quartile), (2) between 4.5 (first quartile) and 7 to its alternatives and detects numerous code smells in large
years (second quartile), (3) between 7 (second quartile) and 8.5 codebases [40], [41]. For instance, Arcan [42] and Hotspot
years (third quartile), and (4) older than 8.5 years (third quartile) detector [43] can detect only 4 types of code smells. Similarly,
of activities. We excluded projects with less than 4.5 years of Jdeodorant [44] detects 5 and Arcade [45] detects 11 types
activities from our study because such projects have a wide of code smells. Designite can identify 35 code smell types
variety of ages, making it difficult to compare similar activities including 7 architecture smells, 18 design smells, and 10 imple-
among them, unless they have a similar lifespan. By using the mentation smells as listed in Table I. The different categories of
identified age groups, we define different stages of software code smells have specific causes and impacts on the software
development: (1) early stage: start of a project until the 4.5th system. Architecture code smells arise from poor software de-
year, (2) middle stage: 4.5th year of a project until the 7th year, signs within the system architecture and can negatively impact
and (3) late stage: 7th year of a project until the 8.5th year which system quality, performance, and lifespan [46]. Design code
is the longest age considered in all studied projects. As a result, smells result from inadequate system design and have a negative
we observe that 50 projects have ages between 4.5 and 7 years, impact on code quality [19]. Implementation code smells, on the
49 projects have ages between 7 to 8.5 years, and 50 projects other hand, stem from poor implementation decisions made by
have ages of more than 8.5 years. To compare the old projects contributors and can negatively affect the quality of the code [1].
(e.g., 10 years old) with young ones (e.g., 5 years old), we divide To address all three types of code smells, refactoring has been
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5106 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023

TABLE I TABLE III


THE LIST OF CODE SMELL METRICS USED IN THE STUDY THE LIST OF PROJECT METRICS USED IN THE STUDY AND
THEIR DESCRIPTIONS
Category Metrics
Cyclic Dependency, Unstable Dependency, Project Metrics Description
Architecture Ambiguous Interface, God Component, Files Describes the total number of files.
Smells Feature Concentration, Scattered Functionality, Defines the lines of comments added to the
and Dense Structure Comments
codebase.
Imperative Abstraction, Wide Hierarchy, Lines of Code Describes all lines of codes written in Java.
Broken-Modularization, Cyclic Hierarchy, Describes the total number of known
Hub like Modularization, Multipath Hierarchy, Contributors
contributors.
Unnecessary Abstraction, Missing Hierarchy, Defines the number of different timezones of the
Design Multifaceted Abstraction, Feature-Envy, Timezones
developers contributing to the project.
Smells Unutilized Abstraction, Rebellious-Hierarchy, Commits Describes the total number of commits.
Deficient Encapsulation, Broken Hierarchy, Defines the length between the first commit
Unexploited Encapsulation, Insufficient- Age
and the last commit in days.
Modularization, Cyclically Dependent- Describes the popularity of a project in terms of
Modularization, and Deep Hierarchy. Stars
stars gained.
Long Method, Complex Method, Long- Refactoring Describes the density of refactoring in
Parameter List, Long Identifier, Long- Density a repository.
Implementation
Statement, Complex Conditional, Abstract-
Smells
Function Call from Constructor, Empty Catch-
Clause, Magic Number, and Missing Default.

We collect contribution, timezone, experience, commits,


TABLE II
THE LIST OF AUTHOR METRICS USED IN THE STUDY AND work time, contributors count, and the age of the projects by
THEIR DESCRIPTIONS writing a python script and traversing the commit logs. To
calculate the refactoring density, we used Rminer to obtain the
Author Metrics Description
number of refactoring lines and a bash script to calculate the
Defines the code churn of
Contribution total code churn (i.e., all lines of code added or deleted in a
the developer.
Timezone Describes the primary timezone of a developer. commit). We then divided the number of refactoring lines by
Describes the time that a developer contributes the total code churn. Moreover, we obtain information about
Experience
to a project. files, comments, and lines of code from the Cloc tool [48] and
Defines the number of commits submitted by
Commits
a developer.
we use GitHub API [49] to fetch stars.
Explains the primary time of commit Highly correlated metrics are linearly related and can be
Work Time expressed by each other. Furthermore, redundant metrics can
submission ( e.g., 14:00).
Refactoring Describes the density of contribution toward be derived from other metrics. Having highly correlated or
Density refactoring. redundant metrics makes it difficult to analyze the impact of
the metrics [50]. Thus, we perform correlation analysis and re-
shown to be an effective solution [1], [19], [46], [47]. Being able dundancy analysis to remove correlated and redundant metrics.
to identify the various types of code smells allows us to study • Correlation analysis: We find that the author and project
the impact of different refactoring strategies and techniques in profile metrics do not follow a normal distribution, thus
a more comprehensive manner. Moreover, Designite is open- we use Spearman’s rank correlation coefficient to find the
source and can be used on cloned projects at the code level. correlation between the computed metrics. A coefficient
Therefore, we use Designite to measure code smells at the start of > 0.7 represents a high correlation [51]. For each pair
and end of each development stage. This helps us evaluate the of highly correlated metrics, we remove one metric and
frequency and types of code smells before and after implement- keep the other in our model. Fig. 2 shows a dendro-
ing each refactoring tactic or rhythm. Given that development gram representing the correlation between the project and
activities are minimal in the early days of a project, we use the author metrics.
first quartile of the early stage as the starting point and consider • Redundancy analysis: R-square is a measure that shows
it the initial quality point for extracting code smells. how variance of a dependent variable can be explained by
independent variables [52]. We use an R-squared cut-off
C. Author and Project Profiles Identification of 0.9 to identify redundant metrics that can be estimated
Different teams may have different author formations with from other metrics.
distinct skills. Furthermore, the characteristics of the projects, We measure highly correlated and redundant metrics in the
such as their size, may lead to the adoption of different refac- project and developer profiles metrics and exclude them from
toring strategies. Therefore, it is essential to identify the types the studied metrics. The correlation (Fig. 2-A) analysis reveals
of authors and projects in different stages of development to that author’s commits and experience are highly correlated.
gain insights into the adopted refactoring tactics and rhythms. Likewise, in project metrics, files, comments, and lines of codes
To this end, we pick a set of metrics to explain the characteristics are highly correlated (Fig. 2-B). Hence, we remove project’s
of the authors and projects. Furthermore, we cluster authors comments, project’s lines of codes, and author’s experience. Af-
and projects based on the collected metrics and form different ter removing highly correlated metrics from the clustering set,
profiles for projects and authors. The selected metrics are listed we performed redundancy analysis, but no redundant metrics
in Tables II and III. were identified.
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5107

Fig. 2. The results of the correlation analysis of author and project profile metrics.

TABLE IV
THE CLUSTERS IDENTIFIED BY K-MODS CLUSTERING TO IDENTIFY AUTHOR PROFILES

Label Timezone Contribution Refactoring Density Commits Work Time


Less More Most More More
Main Authors
(-5.00, 0.00] (577.00, 9,848.0] (0.19, 1.00] (7.0, 63.0] (14:57, 17:06)
Least Least Least Least Most
Casual Contributors
[-12.00, -5.00] [0.00, 23.00] [0.00, 0.00] [0, 2] (17:06, 24:00]
Less Most More Most Less
Core Authors
(-5.00, 0.00] (9,848.0, ∞) (0.06, 0.19] (63.0, ∞ ) (12:33, 14:57]

TABLE V
THE CLUSTERS IDENTIFIED BY K-MODS CLUSTERING TO IDENTIFY PROJECT PROFILES

Refactoring
Label Files Contributors Timezones Commits Age Stars
Density
Most Most Most Most Less Most More
Vibrant
(2157.75, ∞) (44.00, ∞) (10.75, ∞) (3,450.75, ∞) (902.77, 1,684.16] (237.25, ∞) (0.15, 0.20]
Least Least Least Least Less Less Most
Maintaining
[0.00, 530.75] [0.00, 15.00] [0.00, 1.00] [0.00, 951.00] (902.77, 1,684.16] (1.00, 38.00] (0.20, 1.00]
Less More More Less Most More Least
Obsolete
(530.75, 1,242.00] [27.00, 44.00] [5.00, 10.75] (951.00, 1,702.00] (2,096.51, ∞) (38.00, 237.25] [0.00, 0.10]
More Least Least More More Least Least
Growing
[1242.00, 2157.75]] (15.00, 27.00] [0.00, 1.00] (1,702.00, 3,450.75] (1,684.16, 2,096.51] [0.00, 1.00] (0.00, 0.10]

After removing the highly correlated and redundant metrics, For the Project profiles, we find four clusters that provide
based on the distribution of each metric in different quartiles, different meaningful clusters with the optimal cost value of
we divide them into four groups and label them as Least, Less, 919,000. Hence, we apply k-mods clustering with k = 4 and
More, and Most [53]. We use k-mods clustering, an extension identify four major profiles, namely, vibrant, maintaining, ob-
of the k-means [54] clustering algorithm, which is suitable for solete, and growing projects. Vibrant projects have the most
clustering categorical data, to cluster the labeled metrics and commits, contributors, and stars while obsolete projects expe-
cluster them into different profiles. We use elbow method [55] rience the least refactoring density along with less commits,
to find the optimal number of clusters (k) and manually validate most age, and more stars. Furthermore, growing projects expe-
and check if the clustering results provide distinct centroids. rience least refactorings with the least stars while having more
The elbow method involves plotting a graph that displays the commits and least contributors. Moreover, maintaining projects
number of clusters versus the sum of squared errors (SSE) for share least commits and contributors with the most refactoring
each cluster. The optimal number of clusters is identified by the density. Tables IV and V summarize the results of author and
point on the graph where the SSE begins to level off and form project clustering.
an elbow shape [56].
For Author profiles, we identify 3 as the optimal number of
clusters with the optimal cost value of 62,872. Therefore, we D. Research Methods
apply k-mods with 3 clusters (k = 3) and identify three major This section presents the research methods applied to answer
profiles. Based on the selected metrics (i.e., timezone, contribu- the research questions.
tion, refactoring density, commits, and work time), we label the 1) Refactoring Rhythms Identification: To identify the
three identified clusters as main authors, casual contributors, refactoring rhythms of the projects, we require a measure to
and core authors. The core authors make the most contributions describe the amount of refactoring activities (i.e., refactoring
while the casual contributors make the least contributions. churn) applied on each day of the week. As the refactoring activ-
The main and core authors are primarily located in America ities could be reflected by the amount of code changes caused by
and Western Europe with commits between 12:33 to 17:06 at refactoring, we use the refactoring churn to quantify the amount
their local time (14:57-17:06 for main authors and 12:33-14.57 of refactoring activities and normalize it by the actual code
for core authors). Casual contributors are primarily located in churn. Therefore, we introduce daily refactoring density (DRD),
North America and commit from 17:06 to 24:00 (midnight). which indicates the amount of refactoring activities deviated

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5108 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023

from the overall development (e.g., the total code churn) of each
day of development. The DRD is calculated as below:
Refactoring churn of the day (i)
DRD(i) = (1)
Total code churn of the day (i)
We measure the daily refactoring densities (i.e.,DRD) and com-
pare them throughout the week. We form seven groups, each Fig. 3. How changes in code smells are calculated after each stage of
development in each project.
of which represents one day of the week and contains all refac-
toring activities that occurred on that particular day. Doing so
3) Quality Changes Measurement: For identifying the re-
allows us to find the similarities and differences of refactoring
lationship between the refactoring rhythms or tactics and code
activities from one day to another. As our data does not follow
quality, we need to analyze how these refactoring rhythms or
a normal distribution, to measure the significance of the dif-
tactics reduce or increase the frequency of each type of code
ferences or similarities of the measured DRDs among different
smell. To do this, we use the boundary points before and af-
days, we use the Kruskal-Wallis test, an extension of Mann–
ter each stage of development. As it is shown in Fig. 3, we
Whitney U test [57] that evaluates if two or more samples come
measure the frequency of each code smell type before and after
within the same distribution [30]. The Kruskall-Wallis test does
each stage (i.e., between two consecutive stage boundaries) and
not assume that the data is normally distributed or not. We use
measure the difference (reduction or increase) in each type of
p-value > 0.05 to decide if a test of null-hypothesis is sig-
code smell.
nificant [58]. The null-hypothesis is a statistical theory that
Since a larger codebase could contain more code smells, we
measures if a significant relationship exists between two sets
normalize the frequency of code smells by the lines of code
of data [59].
(LOC) in the codebase. Additionally, since more code changes
2) Refactoring Tactics Identification: Previous studies show
(i.e., larger code churn) may lead to larger differences in code
that the majority of projects utilize agile methodologies, which
require small tasks to be finished within one week of develop- smells, we normalize the difference in code smells by the code
ment [58], [60]. To understand the long-term refactoring activ- churn within each stage. Finally, we calculate the differences in
ities applied by developers in the long run, we need a measure each type of code smell and label it according to the identified
to understand the amount of refactoring churns compared to the rhythm and tactic adopted in that stage. We utilize the frequency
development per week of the development. Hence, we propose of code smells at the end of each stage (ECS), the lines of code
a weekly refactoring density WRD metric, which reflects the at the end of each stage (ELC), the initial frequency of code
amount of refactoring activities per week of development. The smells in each stage (ICS), the lines of code at the beginning
WRD metric is computed using the following formula:
of each stage (ILC), and the total code churn of each stage (CC)
to measure the normalized differences of the total frequency of
Refactoring churn of the week (i) code smells of each stage (CSD). To measure CSD at each stage
WRD(i) = (2)
Code churn of the week (i) (i), we use the following equation:
For each project, we create a time series of WRDs within (ECS(i)/ELC(i) − ICS(i)/ILC(i))
every stage of development. Each data point of the time series CSD(i) = (3)
CC(i)/ELC(i)
includes a week of development and the corresponding WRD
metric for that week. Accordingly, the refactoring time series An increase in CSD indicates an increase in the number of
data depicts refactoring behaviors over time. Therefore, similar code smells (i.e., decreased code quality) and a decrease in
refactoring time series between different stages of the projects CSD indicates an increase in the number of code smells (i.e.,
represent a similar refactoring habit. The created refactoring increased code quality). To measure the smell difference after
time series share different lengths and they vary in speed. For in- utilizing each refactoring rhythm or tactic, we require a multiple
stance, the development period of Project A is 10 weeks, while comparison method to cluster and rank the identified rhythm or
that of Project B is 20 weeks, indicating a variation in their tactic into statistically significant groups and rank them based
length. Additionally, Project A experiences a refactoring spike on CSD metric (i.e., changes in code quality). Therefore, we
in the second week of development, while Project B experiences use the Scott-Knott-ESD [31], [32] test that uses a multiple
the same spike in the fifth week of development, indicating comparison method that divides and ranks a set of input distri-
a variation in the speed of refactoring activities. Therefore, butions into statistically distinct groups [32]. Scott-Knott-ESD
comparing the refactoring time series with point-to-point mea- is an extension of the Scott-Knott [64] test with the addition of
sures, such as euclidean distance [61], could not overcome the effect size difference. The effect size examines the strength of
limitations of speed and length variation and could not iden- the difference between different groups of data [65]. Therefore,
tify similarities optimally. Therefore, we use the dynamic time we cluster and rank the refactoring rhythms and tactics based
warping (DTW) algorithm to measure the similarity between the on the CSDs and identify the rhythms or tactics leading to a
refactoring time series of project stages as refactoring tactics. codebase with an increased or decreased amount of code smells.
DTW overcomes the limitation of point-to-point comparisons
III. RESULTS
with the step pattern that allows transitions and weights between
two pairs [62]. Moreover, DTW is an algorithm for calculating In this section, we provide the motivation, approach, and
the similarity between two time series that vary in speed [63]. findings for each of our research questions.

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5109

A. RQ1. What Are the Rhythms of Refactoring? Clustering project and developer profiles in terms of
refactoring activities. To understand the distribution and sig-
1) Motivation: In software development, developers can
nificance of the refactoring rhythms across different project
have various working rhythms. For example, some develop-
and author profiles, we rank the combinations of the different
ers prefer to work only on workdays; however, others do not
refactoring rhythms and project or author profiles. Using the
mind working even on weekends. Existing studies focus on
Scott-Knott-ESD [31], [32], we group combinations of refac-
development rhythms and categorize development rhythms as
toring rhythms and project or author profiles into statistically
work-day (i.e., Monday to Friday) development and all-day
significant clusters. Specifically, we perform two separate Scott-
(i.e., Monday to Sunday) development [25], [26], [27]. Prior
Knott-ESD clustering analyses: (1) for combinations of project
research reports that the state of getting recovered during the
profiles and refactoring rhythms, and (2) for combinations of
weekend from working on the workdays is correlated with an
author profiles and refactoring rhythms.
increase in weekly task performance and personal initiative
In the clustering method, each clustered item (i.e., a node)
[25]. Moreover, the state of working overtime is associated
represents the distribution of projects or authors that are as-
with a decrease in productivity [66]. Furthermore, previous
sociated with a specific refactoring rhythm. Moreover, each
studies have suggested that deviating from regular develop-
clustered item is represented by a vector of the same length as
ment to perform refactoring may help address unhealthy code
the number of projects or authors, with each project or author
and potentially improve code quality [28]. Inspired by prior being assigned a value of 1 if it is associated with the specific
work, our intuition is that providing dedicated time for refac- project or developer profile and the specific refactoring rhythm,
toring outside of regular development cycles enables devel- and a value of 0 otherwise. For example, All-day-Vibrant refers
opers to focus more on addressing unhealthy code through to the distribution of the all-day refactoring rhythm across
refactoring, resulting in improved code quality. Thus, we study all projects that fall under the vibrant project profile. Each
the refactoring rhythms based on their deviations from the cluster corresponds to a statistically significant distribution of
development rhythms. Understanding the refactoring devia- the various combinations of refactoring rhythms and project
tions from the regular development rhythms and their relation- or author profiles across the dataset. The results of the Scott-
ship with the code quality improvement can assist software Knott-ESD test provide insights into how refactoring rhythms
teams and developers to (1) understand the existing refactor- are distributed across different profiles.
ing rhythms and (2) adopt/apply the most effective refactoring Identifying refactoring rhythms characteristics. To study
rhythms. In this research question, we investigate and character- the differences between refactoring operations (e.g., pull up
ize different refactoring rhythms to help developers understand method) performed on weekends and those performed on work-
the existing refactoring rhythms and identify which one suits days, we use the Mann–Whitney U test [57] to compare the
their needs. distribution of each refactoring operation on the weekend and
2) Approach: As described in Section II.D.1, to identify workdays. We utilize Cliff’s Delta to measure the effect size of
the refactoring rhythms of the studied projects, we form seven the differences. We consider the operations that obtain a p-value
groups and measure DRDs on every day of development. More- < 0.05 and an effect size > 0.33 [67], indicating a medium
over, we compare DRDs throughout the week to discover refac- or large magnitude of difference, as the operations that are
toring rhythms using the Kruskal-Wallis test [30]. performed significantly differently between the weekends and
Prior studies [25], [26], [27] divide the software development the workdays.
process into two groups—work-day development and all-day 3) Findings: The majority (95%) of project stages fol-
development (including workdays and weekends). To identify low one of the work-day or all-day refactoring rhythms.
the refactoring rhythms adopted in different stages and different We accept the null hypothesis H0−1 for 84% of the project
projects, we need to compare different days of refactoring. stages and the null hypothesis H0−2 for 11% of the remain-
Hence, we group DRDs into seven groups based on the week- ing project stages. For the 5% of the project stages, both null
days, where each group represents a weekday and depicts the hypotheses H0−1 and H0−2 are rejected. Therefore, our anal-
overall DRD distribution on the corresponding day of the week. ysis shows that only a few project stages (i.e., 5%) do not
Using the Kruskal-Wallis test, we first identify whether we can follow any of the initial refactoring rhythms. Specifically, we
fit the majority of the rhythms adopted from the selected project find that 11% of the project stages perform all-day refactor-
stages into two groups, namely work-day refactoring and all- ing, whereas 84% of the project stages perform the work-day
day refactoring. To do this, we perform two individual tests refactoring rhythm.
using the following hypothesis: In the work-day refactoring rhythm, we observe a signifi-
• H0-1 : Refactoring densities are similar among all days of cant difference in refactoring densities between workdays and
the week. weekends, with the median refactoring densities being higher
• H0-2 : Refactoring densities are similar among all work- in workdays compared to weekends, as illustrated in Fig. 4.
days of the week. Additionally, certain types of refactoring are applied differ-
To this end, after running the first test, we exclude the stages ently between weekends and workdays, as shown in Fig. 5-A.
of projects that have a similar distribution of refactoring on all These types of refactoring include move class, pull up method,
days of the week and perform the second test. We accept a hy- pull up attribute, add attribute annotation, extract interface,
pothesis if the p-value is higher than 0.05 and reject otherwise. add parameter annotation, modify parameter annotation, split

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5110 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023

TABLE VI
SCOTT-KNOTT-ESD TEST RESULTS ON THE REFACTORING RHYTHMS
AND ASSOCIATED PROJECT AND AUTHOR PROFILES

Project Profiles Author Profiles


Rhythm Cluster Mean Rhythm Cluster Mean
Profile Rank (%) Profile Rank (%)
All-day All-day
1 0.89 1 0.79
Maintaining Core
All-day All-day
1 0.83 1 0.80
Growing Main
All-day All-day
1 0.83 1 0.76
Vibrant Casual
All-day Work-day
1 0.82 2 0.19
Obsolete Casual
Work-day Work-day
2 0.14 2 0.16
Vibrant Main
Work-day Work-day
2 0.12 2 0.15
Obsolete Core
Work-day
Fig. 4. Comparison of refactoring density between work-day and all-day 2 0.11
Growing
refactoring rhythms. Work-day
3 0.06
Maintaining

workdays. Therefore, it appears that developers exert an equal


amount of effort towards refactoring throughout the week in the
all-day rhythm.
Table VI shows the results of the Scott-Knott test, providing
further insights into the relationship between the project and
author profiles with their corresponding rhythms:
Among project profiles: In the maintaining, obsolete,
growing, and vibrant project stages, the all-day refactoring
rhythm (Cluster 1) is often used with a similar distribution
than the work-day refactoring rhythm with over 82% utilization
practice (Clusters 2 and 3). In vibrant, obsolete, and growing
project stages work-day refactoring rhythm (Cluster 2) is more
frequently used compared to the maintaining project stages.
Maintaining project stages experience the most refactoring den-
Fig. 5. The different refactoring operations and the lines of refactored code sity and the least number of commits (Table III). Moreover, in
applied in weekend compared to weekdays in all-day refactoring rhythm. Maintaining project stages, the usage of the all-day refactoring
rhythm is highest, at 89% (Cluster 1), compared to the work-day
attribute, move and rename attribute, and split parameter. refactoring rhythm, which is 0.06% (Cluster 3). Therefore, in
These refactoring actions mainly relate to the class/method maintaining project stages where there is less development and
level and play a crucial role in shaping the overall system more refactoring, developers are likely to perform the work-day
design. Therefore, developers may prefer to perform complex rhythm less frequently and focus on refactoring whenever they
design-level refactorings on workdays, leaving weekends for have time. The observation that the distribution of the work-day
less risky modifications. We observe that the median lines of refactoring rhythm is similar in vibrant, obsolete, and growing
the codes affected by the workday refactoring types (i.e., the projects (Cluster 2), and the distribution of the all-day rhythm
refactoring types that are applied more on workdays) are higher is also similar in these project stages (Cluster 1), indicates that
compared to weekday refactoring types (i.e., the refactoring the project profile does not have a significant impact on the
types that are applied similarly during workdays and weekends) choice of refactoring rhythm in vibrant, obsolete, and growing
(Fig. 5-B). Therefore, developers apply more heavy-weight project stages.
refactoring operations that involve more code changes (e.g., Among author profiles: Core, main, and casual authors
move class) during the workdays compared to the weekends.s In often utilize the all-day refactoring rhythm with a similar dis-
contrast, in the all-day refactoring rhythm, the Kruskal-Wallis tribution (Cluster 1). Moreover, the distribution of the work-
test results demonstrate that there is no statistically significant day rhythm is similar in all author profiles (i.e.,core, main, and
difference in the density of refactoring among the different casual) (Cluster 2). Since the distribution of different author
days of the week, and the median refactoring densities are profiles in all-day and work-day rhythms separately are similar,
consistent across all days of the week, as depicted in Fig. 4. the choice of specific refactoring rhythm is not influenced by
Additionally, we do not observe a significant difference in the the type of developer. Therefore, it is likely that authors choose
refactoring operations performed on weekends compared to different rhythms based on their preferences.

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5111

Software projects follow two major refactoring rhythms:


work-day and all-day refactoring. The work-day refactoring
rhythm tends to have higher densities of refactoring to the
code base from Monday to Friday. In the all-day refactor-
ing rhythm, there is no significant difference in refactor-
ing activities on different days of the week. In maintaining
project stages, the all-day refactoring rhythm is more preva-
lent compared to other project stages. The choice of refac-
toring rhythms (all-day or work-day) is not influenced by the
type of authors.

B. RQ2: What Are the Most Frequent Refactoring Tactics Fig. 6. The results of the elbow curve, showing the optimal number of
Used in Projects? clusters using DTW.

1) Motivation: Previous studies have only classified refac-


toring tactics as either floss or root canal [1], [28], [29]. Floss selects the smallest value of k (i.e., the number of clusters)
refactoring is distinguished by frequent refactoring along with with the lowest SSE as the optimal number of clusters. This is
the development process. On the other hand, root canal refac- determined by identifying the point on the graph where the SSE
toring is identified by occasional refactoring aside from the begins to level off and form an elbow shape [56]. Moreover, we
development process. While the terms floss and root canal manually validate the optimal number of clusters (k) identified
tactics have been useful in understanding the general patterns by the elbow method and check if our clustering results provide
of refactoring, there may be other potential refactoring tactics distinct centroids. Based on the elbow curve analysis depicted
that have not yet been identified. Moreover, understanding the in Fig. 6, four is identified as the optimal number of clusters.
distinctive features of each refactoring tactic can offer valu- Furthermore, we utilize the silhouette score with the existing
able insights into developers’ decision-making processes when criteria [68], [69], [70], [71], [72] of the silhouette method to
choosing a particular tactic. Additionally, recognizing various verify the optimal number of clusters. This identification is
refactoring tactics can establish a common vocabulary for de- based on two conditions: (1) average silhouette score greater
scribing them, facilitating communication among practitioners. than 0.5 and (2) absence of clusters exhibiting all silhouette
This, in turn, helps developers comprehend the refactoring tac- scores below the average. The silhouette score analysis points
tics they use and choose or switch to the most appropriate tactic towards the optimal cluster numbers being 3 and 4, yielding
for their project. In this research question, by considering differ- average silhouette scores of 0.56 and 0.52, respectively. All
ent stages of development in different projects, we investigate clusters in both cases exhibit scores above the average thresh-
whether there are more refactoring tactics other than floss and old. This observation supports the idea that both 3 and 4 clusters
root canal. could be considered as optimal solutions. However, as k = 4
2) Approach: As described in Section II.D.2, to understand leads to a more even distribution of the sizes (i.e., thicknesses)
the refactoring tactics in the studied projects, we first cluster of the clusters [68], we opt for k = 4. In summary, both the
refactoring time series of the project stages (i.e., in terms of elbow curve and silhouette score analyses suggest that 4 clusters
WRD). Using the DTW algorithm, we measure the similarity are the preferred number of clusters.
between the WRD time series of each pair of project stages as We utilize DTW to identify the similarities of refactoring time
part of the clustering process. series by analyzing all stages of development in all projects
Clustering common refactoring practices. To identify together. Analyzing all stages together at the same time allows
refactoring tactics, we utilize WRD and form a time series that us to compare and identify the unified common behaviors in all
represent the refactoring history of each project. Subsequently, stages of development despite their different life spans. By con-
using DTW we cluster the projects based on the similarities of sidering the optimal number of clusters as four and performing
their refactoring activities represented by the time series. As the DTW, we identify four common behaviors based on the cluster
selected projects do not share a similar life cycle and they may centroids to represent the common tactics as variations of the
experience different refactoring practices in different stages of root canal and floss refactoring tactics.
development, we measure the similarities of refactoring activ- Identifying refactoring spikes. To provide more insights
ities between projects in different stages of development (i.e., on the identified tactics, we measure the number of spikes
early, middle, late). Therefore, if a project has multiple devel- that happen in each refactoring tactic centroid. Our intuition is
opment stages, we break its time series into multiple smaller that a higher number of spikes within a refactoring tactic time
time series, each of which represents one stage of the project. series indicates more deviation of refactoring densities from
We use Dynamic Time Warping (DTW), a clustering tech- regular refactoring densities. To determine a spike we apply the
nique for temporal sequences based on their similarity, to cluster Median Absolute Deviation (MAD) method. Compared to the
refactoring time series as refactoring tactics. We identify the standard deviation, MAD is a robust estimator of scale. MAD
optimal number of clusters using the elbow method [55]. The can also be used as a scaling quantity instead of the standard
elbow method measures the sum of squared errors (SSE) and deviation, which is vulnerable to the influence of extreme values

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5112 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023

[73], [74]. MAD can be calculated using the formula below


where n is each data point and ñ is the median of all data
points in a window:
M AD = median (|n − median(ñ)|) (4)
Therefore, to detect refactoring spikes, we iterate through the
data points in our centroid time series within a window of four
weeks (one month) before and after a given index, which rep-
resents a week of development. For each window, we calculate
the median absolute deviation (MAD). Then, we check if the
absolute deviation of the data point at the current index from
the median of the window is greater than three times the MAD
[75]. If the condition is true, we consider it a refactoring spike.
Clustering refactoring tactics in terms of project
and developer profiles. Using a similar approach to
Fig. 7. The relationship between development weeks percentiles and refac-
Section III.A.2, we use the Scott-Knott-ESD [31], [32] to toring density.
cluster the distribution of the project and author profiles
associated with the refactoring tactics into statistically TABLE VII
significant groups. We perform two separate clusterings for SUMMARY OF THE NUMBER OF REFACTORING SPIKES FOR EACH
REFACTORING TACTIC CENTROID
(1) for combinations of project profiles and refactoring tactics,
and (2) for combinations of author profiles and refactoring Floss Root Canal
tactics. Each cluster represents the distribution of project or Intermittent Frequent
Intermittent Frequent
Spiked Spiked
author profiles in project stages corresponding to the identified Spikes
tactic. For example, RC-Vibrant indicates the distribution of 35 59 35 66
Count
the root canal tactic across vibrant project stages. Hence,
each group represents a statistically significant distribution of tactic, we find that intermittent spiked floss has fewer spikes
the project or author profiles associated with the refactoring (35) compared to frequent spiked floss (59). Similarly, frequent
tactics. The results of the Scott-Knott-ESD test reveal more root canal has more spikes (66) compared to intermittent root
insights into the identified refactoring tactics. canal, which has 35 spikes. Overall, the frequent root canal and
3) Findings: Our clustering approach uncovers four pri- frequent spiked floss tactics exhibit more frequent high-density
mary refactoring tactics. We define the intermittent spiked periods (i.e., refactoring spikes) than the intermittent root canal
floss and the frequent spiked floss as two variations of the and intermittent spiked floss tactics. We define each refactoring
floss refactoring tactic, and the intermittent root canal and tactic as follows:
the frequent root canal as two variations of the root-canal • Intermittent spiked floss: with refactoring consistently
refactoring tactic. The behavior (i.e., changes in refactoring in all development weeks, developers perform refactoring
density over time) of refactoring tactics is illustrated in Fig. 8. on a regular basis along with fewer refactoring spikes
The main difference between the floss-based and root canal- compared to frequent spiked floss.
based tactics is that: floss-based tactics mix refactoring with • Frequent spiked floss: with refactoring consistently in
regular development activities (i.e., the refactoring densities are all development weeks, developers perform refactoring
consistently higher than zero, as indicated in Fig. 8 (B and with more drops and increases (i.e., spikes) in refactoring
C)); in comparison, root canal-based tactics involve refactoring density compared to intermittent spiked floss.
activities once in a while (i.e., the refactoring densities are at or • Intermittent root canal: with the majority of the weeks
near zero for most of the time periods, as indicated in Fig. 8 (A having zero refactoring densities, developers tend to per-
and D). Additionally, Fig. 7 shows the refactoring density per- form refactoring irregularly but in high densities when they
centiles of root canal-based and floss-based tactics. As is shown do perform it.
root canal-based tactics have zero refactoring densities for more • Frequent root canal: with the majority of the weeks hav-
than half of the development cycle. Specifically, intermittent ing zero refactoring densities, developers tend to perform
root canal have zero refactoring density until the 74th percentile more frequent refactorings with more spikes in refactoring
of development weeks (i.e., more than 74% of the weeks have density compared to intermittent root canal.
zero refactoring densities), and frequent root canal have zero Software projects undergo different refactoring tactics
refactoring until the 52nd percentile (i.e., more than 52% of during their lifetime. Table VIII shows the distribution of the
the weeks have zero refactoring densities). In contrast, floss- identified tactics in different stages of development. In particu-
based tactics have a non-zero median at all percentile points. lar, in the early stage of development, the majority of refactoring
Moreover, frequent root canal and frequent floss tend to have tactics are floss-based (55%). This observation is aligned with
more frequent high-density periods than intermittent spiked the previous studies showing that the majority of refactoring
floss and intermittent root canal. Furthermore, as it is shown tactics are floss-based refactoring [76], [77]. However, in the
in Table VII, by comparing refactoring spikes count in each middle stage, the utilization of floss-based tactics drops to

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5113

Fig. 8. Clustering centroids that represent refactoring tactics identified in this study, which are labeled as: intermittent root canal, intermittent spiked floss,
frequent spiked floss, and frequent root canal. The red dots show refactoring spikes in each tactic.

TABLE VIII TABLE IX


THE DISTRIBUTION OF REFACTORING TACTICS IN DIFFERENT SCOTT-KNOTT-ESD TEST RESULTS ON THE REFACTORING TACTICS AND
STAGES OF DEVELOPMENT ASSOCIATED PROJECT AND AUTHOR PROFILES

Floss Root Canal Project Profiles Author Profiles


Intermittent Frequent Tactic Cluster Mean Tactic Cluster Mean
Intermittent Frequent Profile Rank (%) Profile Rank (%)
Spiked Spiked
IR-Maintaining 1 0.60 FF-Casual 1 0.47
Early 3% 52% 19% 26%
IR-Obsolet 2 0.42 FF-Main 1 0.42
Middle 35% 12% 41% 12% FF-Growing 2 0.42 FF-Core 2 0.35
Late 21% 0% 78% 1% FF-Vibrant 2 0.41 IR-Core 2 0.31
IR-Growing 2 0.39 IF-Casual 2 0.30
IF-Vibrant 2 0.38 IF-Main 3 0.25
47%. Finally, in the late stage of development, the majority of FF-Obsolete 3 0.30 IR-Main 3 0.24
FR-Maintaining 4 0.25 IR-Casual 4 0.19
refactorings are observed to be root canal-based tactics (79%). FR-Obsolete 4 0.19 IF-Core 4 0.18
Therefore, the amount of floss-based refactoring tactics IR-Vibrant 4 0.18 FR-Core 4 0.16
FR-Growin 5 0.11 FR-Main 5 0.09
reduce while the projects enter their later stages of devel- FF-Maintaining 5 0.09 FR-Casual 6 0.04
opment: the developers aim for more targeted refactoring IF-Obsolete 5 0.09
IF-Growing 5 0.08
operations as the projects grow over time. Table IX shows IF-Maintaining 6 0.06
the results of the Scott-Knott-ESD test, which reveals more FR-Vibrant 6 0.03
IR: intermittent root canal, IF: intermittent spiked floss, FF: frequent spiked floss,
information on the distribution of refactoring tactics associated FR: frequent root canal
with different project and author profiles:
Among project profiles: In maintaining project stages,
Apart from floss and root canal refactoring tactics, software
intermittent root canal (cluster 1) is frequently used (60%).
developers use more diverse refactoring tactics, such as inter-
While, in vibrant project stages, floss-based tactics (frequent
mittent root canal, intermittent spiked floss, frequent spiked
spiked floss and intermittent spiked floss) (cluster 2) are more
floss, and frequent root canal. Among project stages cate-
frequently used (79%). Moreover, the obsolete project stages
gorized as obsolete or maintaining, root canal-based tactics
mainly use intermittent root canal (cluster 2) and growing
are prevalent, whereas floss-based tactics are more commonly
project stages (cluster 2) utilize frequent spiked floss and inter-
employed in vibrant stages. Additionally, core authors tend
mittent root canal as the main refactoring tactics. As Vibrant
to use more root canal-based tactics, whereas casual contrib-
project stages have more and most contributors and commits,
utors are inclined towards floss-based tactics.
they have more active development activities; thus they are
more likely to experience frequent refactoring during the de-
velopment process (i.e., floss-based tactics). However, main-
taining and obsolete project stages with the least contributors C. RQ3: What Is the Relationship of Different Refactoring
Rhythms and Tactics With Code Quality?
try to maintain the code and keep it working by doing targeted
refactoring from time to time. 1) Motivation: In the first and second research questions,
Among author profiles: casual developers mainly do floss- we identify different refactoring tactics and rhythms applied
based refactoring tactics (cluster 1 and 2) (i.e., frequent spiked by different projects. Apart from finding different tactics and
floss and intermittent spiked floss), while main authors often rhythms, identifying the relationship between refactoring tactics
utilize frequent spiked floss (cluster 1). Moreover, core au- and code quality is crucial as it helps developers prioritize
thors utilize both frequent spiked floss and intermittent root their efforts, improve development processes, and deliver high-
canal (cluster 2). As core authors have the most and more quality software. Therefore, we utilize code smells as quality
contributions to the repository, they are more likely to con- metrics for refactoring [13], [16] to compare the changes in
tribute more to critical refactoring activities; therefore, they are quality after adopting each tactic or rhythm. Understanding the
more likely to do root canal-based refactoring, while letting relationship of different rhythms and tactics with code quality
casual contributors perform floss-based refactoring during the can help practitioners and project teams to (1) discover the pos-
development process. itive and negative aspects of the different refactoring rhythms or

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5114 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023

tactics, and (2) adopt or switch to the most suitable refactoring TABLE X
rhythm or tactic. SCOTT-KNOTT-ESD TEST RESULTS ON THE OVERALL CHANGES IN
THE FREQUENCY OF CODE SMELLS ASSOCIATED WITH THE
2) Approach: To understand the relationship of the iden- REFACTORING RACTICS
tified refactoring rhythms and tactics with code quality, we
utilize code smells listed in Table I as code quality metrics. Us- Floss Root Canal
Intermittent Frequent
ing Scott-Knott-ESD test [31], [32], we cluster the magnitude Intermittent Frequent
Spiked Spiked
of code smell changes (i.e., increase/decrease) after adopting Cluster
1 2 3 3
each tactic or rhythm. The Scott-Knott-ESD complements the Rank
Mean 0.198276 0.012329 -0.025959 -0.029701
Scott-Knott test [64] by taking the effect size difference into
account when identifying different clusters. We first identify TABLE XI
the relationship between the identified (1) refactoring tactics SCOTT-KNOTT-ESD TEST RESULTS ON THE
and (2) refactoring rhythms with overall increase or decrease OVERALL CHANGES IN THE FREQUENCY OF
CODE CMELLS ASSOCIATED WITH THE
in code smells as quality measures. Moreover, we identify the REFACTORING RHYTHMS
relationship of each refactoring tactic and rhythm with different
All Day Work Day
types of code smells. We describe our detailed approach below.
Cluster Rank 1 1
Measuring code smell changes. As discussed in Section II, Mean 0.01352 0.01991
to measure the relationship of the identified refactoring rhythms
and tactics with code quality, we use code smells. We set 3) Findings: In this section, we provide the findings on
three stages (early, middle, and late) for the lifetime of projects the relationship of both refactoring rhythms and tactics with
and then collect the code smell metrics, which are normal- code quality.
ized by the project size, at the beginning and the end of each Overall relationship: We use the sum of all types of code
stage respectively. smell changes (i.e., the total number of code smell changes
The relationship of refactoring rhythms and tactics regardless of the code smell type) to measure the overall code
with the overall code quality. In Section III-A, we clas- smell changes after adopting each refactoring rhythm and tactic.
sify refactoring rhythms as all-day and work-day. Besides, in As the results suggest, the identified rhythms belong to the
Section III-B, we identify four major refactoring tactics: inter- same cluster and do not significantly affect overall changes in
mittent root canal, intermittent spiked floss, frequent spiked the number of code smells (Table XI), however, the all-day
floss, and frequent root canal. To identify the relationship refactoring rhythm is associated with the lowest mean in overall
between the above refactoring rhythms and tactics with code code smell changes, which indicates a higher code quality.
quality, we utilize the normalized frequency of code smell Overall, refactoring rhythms are not statistically associated with
changes after adopting each rhythm and tactic (listed in Table I). the overall changes in the code smells. For refactoring tactics,
Therefore, a higher frequency of changes indicates an increase on the other hand, intermittent spiked floss and frequent spiked
in code smells and a decrease in software quality. To analyze the floss are in the first and second ranked clusters, hence, they are
overall relationship of refactoring tactics and rhythms with the associated with more increase in the overall changes of code
frequency of code smell changes, we use the normalized sum smells compared to the frequent root canal and the intermittent
of all code smell changes as the overall changes in code smells root canal tactics. In fact, on average, floss-based tactics
and label them with the corresponding rhythm and tactic. Using are associated with an increase in the frequency of code
the Scott-Knott-ESD test we cluster and rank the refactoring smells (positive mean as shown in Table X), while root
rhythms and tactics based on the code smell changes to identify canal-based tactics are associated with a decrease in the
the rhythms and tactics leading to more smelly code. We use frequency of code smells (negative mean as shown in
p − values < 0.05 to identify the statistical significance and Table X). Therefore, root canal-based tactics (i.e.,frequent root
use means to rank the identified clusters. canal and intermittent root canal) are associated with a higher
The relationship of refactoring rhythms and tactics with code quality compared to floss-based tactics. A possible ex-
each code smell type. To provide more insights and details on planation may be that floss-based refactoring is typically inte-
the identified rhythms and tactics, we conduct separate analyses grated with addressing daily maintenance tasks, such as bug
to assess the relationship between the frequency of different fixes and the implementation of new features, while root canal-
types of code smells and each refactoring rhythm and tactic. based refactoring focuses on improving the overall quality of
We use 35 code smells (listed in Table I) and the changes the design.
after adopting each rhythm and tactic. To this end, we utilize Relationship with specific code smells: The results from
the Scott-Knott-ESD test to cluster (p − values < 0.05 as the our analysis of the overall changes in the number of code smells
significance threshold) and rank (using means) the rhythms and show a significant difference in the code smell changes be-
tactics based on each type of code smells separately. Therefore, tween the floss-based and the root canal-based tactics. However,
we perform 35 individual tests for rhythms and 35 separate tests rhythms do not show a significant difference in the changes
for tactics. Therefore, each Scott-Knott-ESD test is responsible in the frequency of code smells. Therefore, we cluster the
for one type of code smells. Doing so allows us to identify frequency of code smell changes in each code smell type af-
how the refactoring rhythms and tactics impact each type of ter adopting each refactoring tactic separately. Fig. 9 shows
code smells. the results of the individual Scott-Knott tests applied for each

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5115

Fig. 9. Results from the Scott-Knott-ESD tests that cluster and rank the refactoring tactics for overall and different types of code smell changes. A higher
rank indicates a larger increase (or smaller decrease) in code smells.

type of code smell. We observe that, for 6% (2 out of 35 refactoring tactics. Moreover, due to the varying lengths of
code smell types), namely empty catch clause and multifaceted the life cycles of projects in stages after the late refactoring
abstraction, the different refactoring tactics are not associated stage, time series clustering could not be applied, and we had
with a statistically significant difference in the frequency of the to exclude them from our study. Thus, it is possible that some
corresponding code smell. This was determined through the patterns may emerge in later stages that we were unable to
Scott-Knott-ESD tests resulting in a single cluster. However, capture. In the second research question, We have categorized
root canal-based tactics result in statistically smaller increases the data into four distinct clusters, namely intermittent root
in the frequency of code smells, indicating higher quality, for canal, intermittent spiked floss, frequent spiked floss, and fre-
80% (28 out of 35) of code smell types. This includes 90% quent root canal. The number of clusters chosen may impact
of the implementation smell types (9 out of 10 types), 83% of the quality and comprehensibility of the clustering outcomes,
the design smell types (15 out of 17 types), and 57% of the as well as the insights and conclusions derived from them. If
architecture smell types (4 out of 7 types). The five remaining there are too many clusters, it may result in overfitting, whereas
code smell types (i.e., ambiguous interface, scattered function- if there are too few, important information may be lost. To
ality, dense structure, imperative abstraction, and unnecessary avoid bias, we use the elbow method [55], silhouette score
abstraction) show slightly different clustering results (Fig. 9) [72], and manual inspection to identify the optimum number
from the majority of the code smells (74%). Therefore, adopting of clusters. However, different numbers of clusters could reveal
root canal-based tactics results in the majority of improvements less or more refactoring tactics. In the third research question,
in code smells across all three categories (i.e., implementation, we use code smells as a code quality measure to study the
design, and architecture smells) of code smells. Overall, our relationship between refactoring rhythms or tactics with code
results suggest that more dedicated refactoring efforts (i.e., quality. Nevertheless, we agree that code quality can be char-
using root canal-based tactics) can better help remove or fix acterized by other measures, such as the number of bugs or
most types of code smells. maintenance costs. Furthermore, we admit that other socio-
technical metrics, such as the way refactoring is applied (e.g.,
Root canal-based tactics are associated with a greater de- manually or automatically) and regulations of the development
crease (or smaller increase) in the number of code smells, team could affect our code quality measurement. Future work
and thus higher code quality, compared to floss-based tactics, that explores the relationship between refactoring activities
which suggests more dedicated refactoring operations. How- and other characteristics of code quality could complement
ever, refactoring rhythms are not associated with the changes our results.
in the number of code smells, suggesting that the choice of External validity. Concerning the generalization of our find-
rhythm may be driven more by project-specific factors and ings, our experiments and results are based solely on the anal-
team preferences rather than their impact on code quality. ysis of the 196 Apache projects we studied, and therefore, our
conclusions may not necessarily apply to other projects, such
as those in different domains. Additionally, since our analysis
IV. THREATS TO VALIDITY
was limited to projects written in Java, the findings may not be
In this section, we discuss the possible threats to the validity applicable to projects written in other programming languages.
of our study. Construct validity. Concerning our measurement accuracy,
Internal validity. Concerning our project selection and se- in the third research question, to study the relationship between
lected approaches, in the second research question, for clus- refactoring rhythms and code quality (i.e., in terms of the fre-
tering refactoring time series and finding refactoring tactics, quency of code smell changes), we measure the code smells in
we analyze the refactoring densities in three stages of develop- different stages of the projects, because calculating code smells
ment. We choose the mentioned time frames so that we could every week takes approximately 25 days for each project, and
compare the refactoring behaviors of the project stages with computing them for all projects requires a significant amount
similar length of development history. We admit having more of time. Nevertheless, extracting quality changes every week
projects with different lengths of time frames could reveal more could provide more accurate results.
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5116 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023

V. IMPLICATIONS Future research that performs empirical studies on


refactoring should be aware of different refactoring tac-
In this section, based on the results of our study, we provide
tics adopted in their studied projects. Our study reveals a
implications for practitioners, developers, and tool builders to
distinction in the adoption of floss or root canal-based refac-
improve their understanding of different refactoring rhythms
toring over the long term and their different impact on code
and tactics.
quality. We strongly recommend researchers consider different
Our findings help practitioners understand the patterns
of refactoring activities and their impact on code qual- types of refactoring tactics for the project selection when they
ity, which can help practitioners make more informed investigate refactoring evolution, practices, and their impact on
decisions in their refactoring adoption. In this study, we software quality. Opting for projects exclusively aligned with
identify the deviations of code refactoring from the regular floss-based or root canal-based approaches could substantially
development rhythms and measure how different refactoring influence or bias the outcomes of their studies. On the other
rhythms and tactics are associated with the increase or decrease hand, we recommend researchers distinguish the projects that
in code quality. Our findings can help practitioners understand adopt different refactoring tactics when they study the charac-
practical refactoring rhythms and tactics in different real-world teristics of refactoring in these projects.
projects and observe their relationship with code quality. Hence, Our findings promote the adoption of root canal-
practitioners can leverage our findings to (1) understand the based refactorings in the development process instead of
refactoring rhythms or tactics that they use and the impact on solely relying on floss-based refactoring. In fact, a large
code quality, and (2) adopt or apply the most effective refactor- portion (42%) of our studied projects only perform floss-based
ing rhythms or tactics for their projects. refactoring. Our research underscores the importance of incor-
Refactoring rhythms do not have a significant impact on porating root canal-based refactoring as a means to minimize
code quality. In this study, we observe two dominant refactor- code smells. Consequently, we advocate for developers to move
ing rhythms as all-day and work-day refactoring rhythms. By beyond solely relying on ad hoc refactoring during development
measuring the code smell changes after adopting each rhythm, (i.e., floss-based). Instead, we propose the inclusion of dedi-
we don’t observe significant differences in the code quality cated refactoring tasks within the development timeline. While
after applying either work-day or all-day rhythms. Therefore, this approach may necessitate additional time investment, it
we recommend that practitioners choose a refactoring rhythm proves beneficial for developers to conduct long-term mainte-
that aligns with their project objectives and milestones, whether nance tasks on code quality.
work-day or all-day rhythm. Considerations such as the size of
the contribution and the developers’ comfort could be taken into
account when making this decision. VI. RELATED WORK
Different refactoring tactics have different impacts on
code quality: root canal-based refactoring is more likely In this section, we review the literature related to refactoring
than floss-based refactoring to be associated with better rhythms, tactics, and studies related to refactoring detection.
code quality. As shown in our results, the two root canal- Working rhythms. Zhang et al. [26] conduct a survey study
based tactics are ranked in the first place in terms of the overall on developers for working overtime. The authors find that work-
frequency of code smell reduction, more specifically for 28 ing overtime is a common behavior among software practition-
out of 35 types of code smell types. Root canal-based tactics ers. Developers who work more often on weekends believe that
lead to an average decrease in the frequency of code smells working overtime could increase their productivity. Similarly,
while floss-based tactics lead to an average increase in the Claes et al. [27] study the frequency of the commit messages
frequency of code smells. Hence, root canal-based tactics out- on 86 open-source projects and find that one-third of developers
perform floss-based tactics by reducing the total amount of code work overtime either at night or during the weekends. In terms
smells. Therefore, we encourage practitioners to apply higher- of researchers, Wang et al. [78] study the download informa-
level refactorings once in a while to keep the code maintainable tion of scientific papers and find that many researchers work
with less code smells. on weekends. However, the amount of overtime work differs
Establishing a common vocabulary for describing var- among countries. Binnewies et al. [25] conducts a survey study
ious refactoring rhythms and tactics can help improve on 133 employees and shows that psychological detachment,
communication among practitioners, developers, project relaxation, and mastery experiences during the weekend are
managers, and other stakeholders. This work formulated associated with being recovered for the upcoming week. Being
several common patterns of refactoring rhythms and tactics. recovered affects the weekly task performance, personal initia-
Such common patterns (or common vocabulary) facilitate com- tive, organizational citizenship behavior, and low perceived ef-
munication among practitioners, enabling them to identify and fort. To summarize, most of the prior work has been conducted
implement effective refactoring patterns and techniques. By to find different working rhythms during the week and correlate
gaining a better understanding of the existing refactoring pat- them with productivity. However, there is no study related to
terns, team members can communicate more effectively with the refactoring rhythms and their relationship with code
each other and with stakeholders. This is especially useful in quality. Our goal is to identify various refactoring rhythms
large and complex projects where there may be multiple teams employed by developers and determine which rhythms are
and stakeholders involved. positively associated with higher code quality.

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5117

Refactoring tactics. Floss and root canal are two refactoring that affect refactoring. Bibiano et al. [17] correlate and study
tactics identified in previous studies [1], [28]. Floss refactoring the effect of batch refactoring on code smells. It identifies that
is distinguished by frequent refactoring, blended with the soft- there is usually more than one refactoring operation required to
ware development process, while the root canal is identified by eliminate the code smells. Cinn’eide et al. [9] conduct a survey
occasional periods of refactoring which is not consistent with study on the benefits of refactoring and argue that, although
the software development process. Liu et al. [76] investigate refactoring is commonly believed to aim at removing code
refactoring histories on data collected from 753,367 engineers smells, developers are not strongly motivated by the desire to
and suggest that between floss and root canal, the most fre- eliminate them. Murphy et al. [28] define two refactoring tac-
quently adopted refactoring tactic by engineers is floss. Sousa tics, floss and root canal, using a dental metaphor. Floss involves
et al. [29] classify refactoring as floss and root canal and frequent refactoring with other program changes, while root
conduct a study on software projects to examine refactoring canal involves infrequent, longer periods of refactoring with few
opportunities indicated by code smells. In this study, we other program changes. Murphy et al. propose five principles
identify new variations of refactoring tactics in addition to and evaluate tools for alignment with floss tactics. Murphy et
the previously mentioned tactics. Moreover, we study their al. find that the tools are not aligned with floss tactics and are
relationship with code quality in terms of code smells. therefore not suitable for floss refactoring. It suggests that floss
Refactoring detection. Several tools and approaches are refactoring is likely to result in higher quality and lower costs
introduced to automatically identify refactoring operations [38], in the long run. However, it does not propose a quantitative
[79]. The main idea behind these approaches is to compare approach to measure this claim. Previous studies [12], [13],
different versions of the code fragments stored in a version [14], [15], [16], [17] link refactoring with code smells and code
control system and point out refactoring operations. These tools quality which makes code smells a good quality indicator of
can help us study refactoring activities on a large scale in code after performing refactoring operations. Therefore, we use
the software maintenance process. Kim et al. [79] introduce code smells to measure the relationship between the identified
Ref-Finder, which takes two versions of a program as input rhythms and tactics with code quality. Different from the
from workspace snapshots or subversion of a repository and existing studies, our work is the first to quantitatively study
extracts logical facts about the syntactic structure of a program. the relationship between refactoring rhythms/tactics and
Nevertheless, Soares et al. [80] conducts a study and show that code quality.
Ref-Finder has low precision and recall which leads to false-
positive results, which means it is inaccurate in detecting refac-
VII. CONCLUSION
torings. However, Tsantalis et al. [38] design a tool, Rminer, that
overcomes the above constraints. Similarly to Ref-Finder [79], In this study, we investigate the refactoring activities on a
Rminer [14], [38] takes two revisions of source code from the dataset consisting of 196 Apache projects to identify refac-
commit history in the version control system of a Java project toring tactics and rhythms that developers and projects adopt
and returns a list of refactoring operations applied between two in the software development process. We also examine their
versions. Using a similar approach, Alizadeh et al. [81] intro- relationship with code quality in terms of code smells. Com-
duce a bot integrated into a version control system that monitors paring both refactoring and development activities, we first
software repositories and identifies refactoring opportunities by determine that in more than 95% of project stages develop-
analyzing recently changed files through pull requests. It then ers use a systematic refactoring rhythm on weekdays. Two
finds the best series of refactorings to fix the quality issues. In major rhythms are identified as 1) work-day refactoring and
this work, we employ the refactoring detection approach (2) all-day refactoring. By considering the relationship between
Rminer, which was developed in previous research [14], to refactoring rhythms and the quality metrics (i.e., code smells),
extract refactoring operations from our dataset. We also we observe that different refactoring rhythms do not make a
validate its effectiveness in our context. statistically significant difference to the code quality. Moreover,
Refactoring and code quality. Prior work has performed by clustering the life-cycle of refactoring activities we find four
studies regarding the relationship between refactoring and code variations of existing refactoring tactics: (1) frequent spiked
quality. Almogahed et al. [82] examine the studies that identify floss, (2) intermittent spiked floss, (3) frequent root canal, and
the impacts of code refactoring on software quality. It identifies (4) intermittent root canal refactoring tactics. We observe that
that researchers agree that refactoring has a positive impact on root canal-based tactics (frequent root canal and intermittent
both internal and external quality attributes. Moreover, Lacerda root canal) are associated with a larger reduction in the fre-
et al. [83] conduct a literature review on refactoring tools and quency of code smells compared to floss-based tactics (frequent
common code smells to measure the relationship between refac- spiked floss and intermittent spiked floss). Our findings can help
toring operations and code smells. By analyzing the initial and researchers and practitioners understand practical refactoring
final code smells after refactoring, the study finds that a signif- activities in real-world projects and their relationship with code
icant proportion of code smells get eliminated after perform- quality. Practitioners can leverage our findings to choose the
ing refactoring, which in turn preserves or enhances software appropriate refactoring patterns for their projects based on their
quality during the maintenance process. Moreover, it notices resources and code quality requirements. For future work, we
that code smells and refactoring are linked by quality attributes plan to conduct experiments for other programming languages
and quality attributes that affect code smells are the same ones and focus more on automatic vs. manual refactoring operations.

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5118 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023

ACKNOWLEDGMENT [21] L. Chen and M. A. Babar, “Towards an evidence-based understanding of


emergence of architecture through continuous refactoring in agile soft-
The authors would like to express their sincere gratitude to ware development,” in Proc. IEEE/IFIP Conf. Softw. Archit., Piscataway,
Yucan Li for participating in the manual validation of their NJ, USA: IEEE Press, 2014, pp. 195–204.
[22] A. Martini and J. Bosch, “An empirically developed method to aid
results. His assistance was invaluable in ensuring the accuracy decisions on architectural technical debt refactoring: AnaConDebt,” in
and reliability of their findings. Proc. IEEE/ACM 38th Int. Conf. Softw. Eng. Companion (ICSE-C),
Piscataway, NJ, USA: IEEE Press, 2016, pp. 31–40.
[23] M. Kim, T. Zimmermann, and N. Nagappan, “A field study of refactor-
ing challenges and benefits,” in Proc. ACM SIGSOFT 20th Int. Symp.
REFERENCES Found. Softw. Eng., 2012, pp. 1–11.
[24] V. Khorikov. “Short-term vs long-term perspective in software de-
[1] M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts, Refactoring:
velopment.” Enterprise Craftsmanship. Accessed: Apr. 3, 2023. [On-
Improving the Design of Existing Code. Berkeley, CA, USA: Addison-
line]. Available: https://enterprisecraftsmanship.com/posts/short-term-
Wesley, 1999. vs-long-term-perspective/
[2] E. Murphy-Hill, C. Parnin, and A. P. Black, “How we refactor, and [25] C. Binnewies, S. Sonnentag, and E. J. Mojza, “Recovery during the
how we know it,” IEEE Trans. Softw. Eng., vol. 38, no. 1, pp. 5–18, weekend and fluctuations in weekly job performance: A week-level
Jan./Feb. 2012. study examining intra-individual relationships,” J. Occupational Orga-
[3] M. Kaya, S. Conley, Z. S. Othman, and A. Varol, “Effective software nizational Psychol., vol. 83, no. 2, pp. 419–441, Jun. 2010.
refactoring process,” in Proc. 6th Int. Symp. Digit. Forensic Secur. [26] J. Zhang et al., “Understanding the working time of developers in it
(ISDFS), Piscataway, NJ, USA: IEEE Press, 2018, pp. 1–6. companies in China and the United States,” IEEE Softw., vol. 38, no. 2,
[4] A. Arif and Z. A. Rana, “Refactoring of code to remove technical debt pp. 96–106, Mar./Apr. 2021.
and reduce maintenance effort,” in Proc. 14th Int. Conf. Open Source [27] M. Claes, M. V. Mäntylä, M. Kuutila, and B. Adams, “Do programmers
Syst. Technologies (ICOSST), Piscataway, NJ, USA: IEEE Press, 2020, work at night or during the weekend?” in Proc. 40th Int. Conf. Softw.
pp. 1–7. Eng., 2018, pp. 705–715.
[5] J. Al Dallal and A. Abdin, “Empirical evaluation of the impact of object- [28] E. Murphy-Hill and A. P. Black, “Refactoring tools: Fitness for purpose,”
oriented code refactoring on quality attributes: A systematic literature IEEE Softw., vol. 25, no. 5, pp. 38–44, Sep./Oct. 2008.
review,” IEEE Trans. Softw. Eng., vol. 44, no. 1, pp. 44–69, Jan. 2018. [29] L. Sousa, W. Oizumi, A. Garcia, A. Oliveira, D. Cedrim, and C. Lucena,
[6] J. Kerievsky, Refactoring to Patterns (Addison-Wesley Signature “When are smells indicators of architectural refactoring opportunities:
Series). Boston, MA, USA: Addison-Wesley, 2005. A study of 50 software projects,” in Proc. 28th Int. Conf. Program
[7] T. Sharma, “Quantifying quality of software design to measure the Comprehension, 2020, pp. 354–365.
impact of refactoring,” in Proc. IEEE 36th Annu. Comput. Softw. Appl. [30] W. Kruskal, “Kruskall–Wallis one way analysis of variance,” J. Amer.
Conf. Workshops, Piscataway, NJ, USA: IEEE Press, 2012, pp. 266–271. Statistical Assoc., vol. 47, no. 260, pp. 583–621, 1952.
[8] R. C. Martin, Clean Code: A Handbook of Agile Software Craftsman- [31] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto,
ship. Upper Saddle River, NJ, USA: Pearson Education, 2009. “An empirical comparison of model validation techniques for defect
[9] M. Ó Cinnéide, A. Yamashita, and S. Counsell, “Measuring refactoring prediction models,” IEEE Trans. Softw. Eng., vol. 43, no. 1, pp. 1–18,
benefits: A survey of the evidence,” in Proc. 1st Int. Workshop Softw. Jan. 2017.
Refactoring, 2016, pp. 9–12. [32] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto,
[10] B. Du Bois, S. Demeyer, and J. Verelst, “Does the “refactor to under- “The impact of automated parameter optimization on defect prediction
stand” reverse engineering pattern improve program comprehension?” models,” IEEE Trans. Softw. Eng., vol. 45, no. 7, pp. 683–711, Jul. 2019.
in Proc. 9th Eur. Conf. Softw. Maintenance Reengineering, Piscataway, [33] S. Noei, H. Li, S. Georgiou, and Y. Zou. “Replication pack-
NJ, USA: IEEE Press, 2005, pp. 334–343. age.” GitHub. Accessed: Aug. 30, 2023. [Online]. Available:
[11] R. Moser, A. Sillitti, P. Abrahamsson, and G. Succi, “Does refactoring https://github.com/seal-tse-2023/replication-package
improve reusability?” in Proc. Int. Conf. Softw. Reuse, Turin, Italy: [34] M. Claes and M. V. Mäntylä, “20-MAD: 20 years of issues and commits
Springer-Verlag, 2006, pp. 287–297. of Mozilla and Apache development,” in Proc. 17th Int. Conf. Mining
[12] D. Silva, N. Tsantalis, and M. T. Valente, “Why we refactor? Confes- Softw. Repositories, 2020, pp. 503–507.
sions of GitHub contributors,” in Proc. 24th ACM SIGSOFT Int. Symp. [35] “Most in-demand programming languages in 2021.” Berkeley Boot
Found. Softw. Eng., 2016, pp. 858–870. Camps. Accessed: Apr. 3, 2023. [Online]. Available: https://bootcamp.
[13] D. Cedrim et al., “Understanding the impact of refactoring on smells: A berkeley.edu/blog/most-in-demand-programming-languages/
longitudinal study of 23 software projects,” in Proc. 11th Joint Meeting [36] “Index | TIOBE - The software quality company.” TIOBE. Accessed:
Found. Softw. Eng., 2017, pp. 465–475. Apr. 3, 2023. [Online]. Available: https://www.tiobe.com/tiobe-index/
[14] N. Tsantalis, A. Ketkar, and D. Dig, “RefactoringMiner 2.0,” IEEE [37] I. Ahmed, U. A. Mannan, R. Gopinath, and C. Jensen, “An empirical
Trans. Softw. Eng., vol. 48, no. 3, pp. 930–950, Mar. 2022. study of design degradation: How software projects get worse over time,”
[15] G. Szőke, C. Nagy, L. J. Fülöp, R. Ferenc, and T. Gyimóthy, “Fault- in Proc. ACM/IEEE Int. Symp. Empirical Softw. Eng. Meas. (ESEM),
Buster: An automatic code smell refactoring toolset,” in Proc. IEEE 15th Piscataway, NJ, USA: IEEE Press, 2015, pp. 1–10.
Int. Work. Conf. Source Code Anal. Manipulation (SCAM), Piscataway, [38] N. Tsantalis, M. Mansouri, L. Eshkevari, D. Mazinanian, and D. Dig,
NJ, USA: IEEE Press, 2015, pp. 253–258. “Accurate and efficient refactoring detection in commit history,” in Proc.
[16] N. Yoshida, T. Saika, E. Choi, A. Ouni, and K. Inoue, “Revisiting the IEEE/ACM 40th Int. Conf. Softw. Eng. (ICSE), Piscataway, NJ, USA:
relationship between code smells and refactoring,” in Proc. IEEE 24th IEEE Press, 2018, pp. 483–494.
Int. Conf. Program Comprehension (ICPC), Piscataway, NJ, USA: IEEE [39] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd
Press, 2016, pp. 1–4. ed. New York, NY, USA: Routledge, Jul. 1988.
[17] A. C. Bibiano et al., “A quantitative study on characteristics and effect [40] T. Sharma, P. Mishra, and R. Tiwari, “Designite: A software design qual-
of batch refactoring on code smells,” in Proc. ACM/IEEE Int. Symp. ity assessment tool,” in Proc. 1st Int. Workshop Bringing Architectural
Empirical Softw. Eng. Meas. (ESEM), Piscataway, NJ, USA: IEEE Press, Des. Thinking Developers’ Daily Activities, 2016, pp. 1–4.
2019, pp. 1–11. [41] T. Sharma and D. Spinellis, “A survey on software smells,” J. Syst.
[18] A. Yamashita and L. Moonen, “Do code smells reflect important main- Softw., vol. 138, pp. 158–173, Apr. 2018.
tainability aspects?” in Proc. 28th IEEE Int. Conf. Softw. Maintenance [42] F. A. Fontana, I. Pigazzini, R. Roveda, D. Tamburri, M. Zanoni, and
(ICSM), Piscataway, NJ, USA: IEEE Press, 2012, pp. 306–315. E. Di Nitto, “Arcan: A tool for architectural smells detection,” in Proc.
[19] G. Suryanarayana, G. Samarthyam, and T. Sharma, Refactoring for IEEE Int. Conf. Softw. Archit. Workshops (ICSAW), Piscataway, NJ,
Software Design Smells: Managing Technical Debt. San Mateo, CA, USA: IEEE Press, 2017, pp. 282–285.
USA: Morgan Kaufmann, 2014. [43] R. Mo, Y. Cai, R. Kazman, and L. Xiao, “Hotspot patterns: The formal
[20] L. Aversano, U. Carpenito, and M. Iammarino, “An empirical study on definition and automatic detection of architecture smells,” in Proc. 12th
the evolution of design smells,” Information, vol. 11, no. 7, 2020, Art. Work. IEEE/IFIP Conf. Softw. Archit., Piscataway, NJ, USA: IEEE Press,
no. 348. 2015, pp. 51–60.

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5119

[44] N. Tsantalis, T. Chaikalis, and A. Chatzigeorgiou, “JDeodorant: Identi- [65] C. J. Ferguson, “An effect size primer: A guide for clini-
fication and removal of type-checking bad smells,” in Proc. 12th Eur. cians and researchers.” Prof. Psychol. Res. Pract., vol. 40, no. 5,
Conf. Softw. Maintenance Reengineering, Piscataway, NJ, USA: IEEE pp. 532–538, 2009.
Press, 2008, pp. 329–331. [66] I. Spieler, S. Scheibe, C. Stamov-Roßnagel, and A. Kappas, “Help or
[45] U. Azadi, F. A. Fontana, and D. Taibi, “Architectural smells detected by hindrance? Day-level relationships between flextime use, work–nonwork
tools: A catalogue proposal,” in Proc. IEEE/ACM Int. Conf. Tech. Debt boundaries, and affective well-being.” J. Appl. Psychol., vol. 102, no. 1,
(TechDebt), Piscataway, NJ, USA: IEEE Press, 2019, pp. 88–97. pp. 67–87, 2017.
[46] J. Garcia, D. Popescu, G. Edwards, and N. Medvidovic, “Identifying [67] J. Romano, J. D. Kromrey, J. Coraggio, and J. Skowronek, “Should we
architectural bad smells,” in Proc. 13th Eur. Conf. Softw. Maintenance really be using t-test and Cohen’sd for evaluating group differences on
Reengineering, Piscataway, NJ, USA: IEEE Press, 2009, pp. 255–258. the NSSE and other surveys,” in Proc. Annu. Meeting Florida Assoc.
[47] F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, and A. De Lucia, Institutional Res., 2006, pp. 1–3.
“The scent of a smell: An extensive comparison between textual [68] “Selecting the number of clusters with silhouette analysis on KMeans
and structural smells,” in Proc. 40th Int. Conf. Softw. Eng., 2018, clustering.” Scikit-learn., Accessed: Aug. 23, 2023. [Online]. Avail-
pp. 740–740. able: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_
[48] “AlDanial/cloc.” GitHub. [Online]. Accessed: Feb. 10, 2023. Available: silhouette_analysis.html
https://github.com/AlDanial/cloc [69] E. S. Dalmaijer, C. L. Nord, and D. E. Astle, “Statistical power for
[49] “GitHub API.” GitHub. Accessed: Apr. 3, 2023. [Online]. Available: cluster analysis,” BMC Bioinf., vol. 23, no. 1, pp. 1–28, 2022.
https://docs.github.com/en/rest [70] U. Rani and S. Sahu, “Comparison of clustering techniques for mea-
[50] F. E. Harrell et al., Regression Modeling Strategies. With Applications to suring similarity in articles,” in Proc. 3rd Int. Conf. Comput. Intell.
Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Commun. Technol. (CICT), Piscataway, NJ, USA: IEEE Press, 2017,
vol. 3. Berlin, Germany: Springer-Verlag, 2015. pp. 1–7.
[51] E. Noei, F. Zhang, and Y. Zou, “Too many user-reviews! What should [71] T. Yatsunenko et al., “Human gut microbiome viewed across age and
app developers look at first?” IEEE Trans. Softw. Eng., vol. 47, no. 2, geography,” Nature, vol. 486, no. 7402, pp. 222–227, 2012.
pp. 367–378, Feb. 2021. [72] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and
[52] J. Miles, “R-squared, adjusted r-squared,” in Encyclopedia of Statistics validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, pp. 53–
in Behavioral Science. Hoboken, NJ, USA: Wiley, 2005. 65, Nov. 1987.
[53] F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou, “Towards building a [73] T. L. Wahl, “Discussion of “Despiking acoustic doppler velocimeter
universal defect prediction model,” in Proc. 11th Work. Conf. Mining data” by Derek G. Goring and Vladimir I. Nikora,” J. Hydraul. Eng.,
Softw. Repositories, 2014, pp. 182–191. vol. 129, no. 6, pp. 484–487, 2003.
[54] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means [74] M. Parsheh, F. Sotiropoulos, and F. Porté-Agel, “Estimation of power
clustering algorithm,” J. Roy. Statistical Soc. C (Appl. Statist.), vol. 28, spectra of acoustic-doppler velocimetry data contaminated with inter-
no. 1, pp. 100–108, 1979. mittent spikes,” J. Hydraul. Eng., vol. 136, no. 6, pp. 368–378, 2010.
[55] R. L. Thorndike, “Who belongs in the family?” Psychometrika, vol. 18, [75] R. Q. Quiroga, Z. Nadasdy, and Y. Ben-Shaul, “Unsupervised spike
no. 4, pp. 267–276, 1953. detection and sorting with wavelets and superparamagnetic clustering,”
[56] M. Syakur, B. Khotimah, E. Rochman, and B. D. Satoto, “Integration Neural Comput., vol. 16, no. 8, pp. 1661–1687, 2004.
k-means clustering method and elbow method for identification of the [76] H. Liu, Y. Gao, and Z. Niu, “An initial study on refactoring tactics,”
best customer profile cluster,” IOP Conf. Ser. Mater. Sci. Eng., vol. 336, in Proc. IEEE 36th Annu. Comput. Softw. Appl. Conf., Piscataway, NJ,
no. 1, 2018, Art. no. 012017. USA: IEEE Press, 2012, pp. 213–218.
[57] P. E. McKnight and J. Najab, “Mann–Whitney U test,” in The Corsini [77] E. Fernandes et al., “Refactoring effect on internal quality attributes:
Encyclopedia of Psychology. Hoboken, NJ, USA: Wiley, 2010, p. 1. What haven’t they told you yet?” Inf. Softw. Technol., vol. 126,
[58] T. Dahiru, “P-value, a true test of statistical significance? A cautionary Oct. 2020, Art. no. 106347.
note,” Ann. Ibadan Postgraduate Med., vol. 6, no. 1, pp. 21–26, 2008. [78] X. Wang et al., “Exploring scientists’ working timetable: Do scientists
[59] S. K. Haldar, “Statistical and geostatistical applications in geology,” often work overtime?” J. Informetrics, vol. 6, no. 4, pp. 655–660, 2012.
in Mineral Exploration. Amsterdam, The Netherlands: Wiley, 2018, [79] M. Kim, M. Gee, A. Loh, and N. Rachatasumrit, “Ref-Finder: A
pp. 167–194. refactoring reconstruction tool based on logic query templates,” in Proc.
[60] L. Rising and N. S. Janoff, “The scrum software develop- 18th ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2010, pp. 371–372.
ment process for small teams,” IEEE Softw., vol. 17, no. 4, [80] G. Soares, R. Gheyi, E. Murphy-Hill, and B. Johnson, “Comparing
pp. 26–32, Jul./Aug. 2000. approaches to analyze refactoring activity on software repositories,” J.
[61] J. C. Gower, “Properties of euclidean and non-euclidean dis- Syst. Softw., vol. 86, no. 4, pp. 1006–1022, 2013.
tance matrices,” Linear Algebra Its Appl., vol. 67, pp. 81–97, [81] V. Alizadeh, M. A. Ouali, M. Kessentini, and M. Chater, “RefBot:
Jun. 1985. Intelligent software refactoring bot,” in Proc. 34th IEEE/ACM Int. Conf.
[62] M. Kljun and M. Tersěk, “A review and comparison of time series Automated Softw. Eng. (ASE), Piscataway, NJ, USA: IEEE Press, 2019,
similarity measures,” in Proc. 29th Int. Electrotechnical Comput. Sci. pp. 823–834.
Conf. (ERK), Portorož, Slovenia, 2020, pp. 21–22. [82] A. Almogahed, M. Omar, and N. H. Zakaria, “Impact of software
[63] S. K. Gaikwad, B. W. Gawali, and P. Yannawar, “A review on refactoring on software quality in the industrial environment: A review
speech recognition technique,” Int. J. Comput. Appl., vol. 10, no. 3, of empirical studies,” in Proc. Knowl. Manage. Int. Conf. (KMICe), Miri
pp. 16–24, 2010. Sarawak, Malaysia, Jul. 25–27, 2018, pp. 229–234.
[64] E. G. Jelihovschi, J. C. Faria, and I. B. Allaman, “ScottKnott: A pack- [83] G. Lacerda, F. Petrillo, M. Pimenta, and Y. G. Guéhéneuc, “Code
age for performing the Scott-Knott clustering algorithm in R,” TEMA smells and refactoring: A tertiary systematic review of challenges and
(São Carlos), vol. 15, no. 1, pp. 3–17, 2014. observations,” J. Syst. Softw., vol. 167, Sep. 2020, Art. no. 110610.

Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.

You might also like