2023 - An Empirical Study of Refactoring Rhythms and Tactics in The Software Development Process
2023 - An Empirical Study of Refactoring Rhythms and Tactics in The Software Development Process
Abstract—It is critical for developers to develop high-quality [5]. Refactoring facilitates the extensibility and maintainabil-
software to reduce maintenance cost. While often, developers ity of a software system [6], [7], [8]. Various reasons drive
apply refactoring practices to make source code readable and developers’ refactoring activities, such as improving software
maintainable without impacting the software functionality. Ex-
isting studies identify development rhythms (i.e., weekly devel- design [9], making software systems easier to understand [10],
opment patterns) and their relationship with various metrics, enhancing reusability [11], removing dependencies among at-
such as productivity. However, existing studies focus entirely tributes, methods, classes, interfaces, and packages [12], as well
on development rhythms. There is no study on refactoring as eliminating code smells [12], [13], [14], [15], [16], [17].
rhythms and their relationship with code quality. Moreover, A code smell is a design flaw that violates the principles of
the existing studies categorize the refactoring tactics (i.e., long-
term refactoring patterns) into two general concepts of consistent design standards and impairs the maintainability of software
and inconsistent refactoring. Nevertheless, the existence of other [18], [19]. As a result, a code smell can affect the internal
tactics and their relationship with code quality is not explored. quality of software. For instance, a broken hierarchy code smell
In this paper, we conduct an empirical study on the refactoring happens when a subtype and its supertype do not share an
practices of 196 Apache projects in the early, middle, and “IS-A” relationship [20]. The presence of code smells is a good
late stages of development. We aim to identify (1) existing
refactoring rhythms, (2) further refactoring tactics, and (3) the indicator for code quality checks [18], and multiple refactoring
relationship between the identified tactics and rhythms with operations are typically needed to eliminate code smells [17].
code quality. The recognition of existing refactoring strategies Several factors contribute to the quantity of refactoring op-
and their relationship with code quality can assist practitioners erations performed to improve code quality, such as developer
in recognizing and applying the appropriate and high-quality perceptions, team experience, development schedule, software
refactoring rhythms or tactics to deliver a higher quality of
software. We find two frequently used refactoring rhythms:
characteristics, and so forth [21]. Different teams may apply dif-
work-day refactoring and all-day refactoring. We also identify ferent refactoring strategies in short-term or long-term periods
two deviations of floss and root canal refactoring tactics as: [22], [23]. Identifying patterns in refactoring practices and their
intermittent root canal, intermittent spiked floss, frequent spiked relationship with code quality can help software developers
floss, and frequent root canal. We find that root canal-based adopt the most suitable patterns in their projects. More specifi-
tactics are correlated with less increase in the code smells (i.e.,
cally refactoring patterns have two perspectives: (1) refactor-
higher quality code) compared to floss-based tactics. Moreover,
we find that refactoring rhythms are not significantly correlated ing rhythms and (2) refactoring tactics. Refactoring rhythms
with the quality of the code. Furthermore, we provide detailed describe how refactoring operations split across the weekdays
information on the relationship of each refactoring tactic to each and usually focus on existing tasks. Refactoring tactics are
code smell type. referred to as long-term refactoring more focused on future
Index Terms—Refactoring, code quality, code smells, development [24].
refactoring rhythms, refactoring tactics. In the context of refactoring rhythms, existing studies focus
on development rhythms and categorize development rhythms
as work-day (i.e., Monday to Friday) development and all-day
I. INTRODUCTION development. Moreover, they correlate development rhythms
with the measures, such as task performance and productivity
R EFACTORING is a systematic process of improving the
internal quality of software without affecting its func-
tionalities [1]. Many studies show that refactoring is a widely
[25], [26], [27]. However, to the best of our knowledge, no study
has investigated the identification of refactoring rhythms and
engaged part of the software maintenance process [2], [3], [4], their relationship with code quality.
In terms of refactoring tactics, existing studies divide refac-
Manuscript received 20 April 2023; revised 31 August 2023; accepted toring tactics into floss and root canal [1], [28], [29]. Floss
10 October 2023. Date of publication 10 November 2023; date of current refactoring is distinguished by frequent refactoring along with
version 12 December 2023. Recommended for acceptance by J. I. Maletic. the development process. Root canal refactoring is identified
(Corresponding author: Shayan Noei.)
Shayan Noei, Stefanos Georgiou, Ying Zou are with Queen’s Uni- by occasional refactoring aside from the development pro-
versity, Kingston, Ontario K7L 3N6, Canada (e-mail: [email protected]; cess. While the terms floss and root canal are widely used
[email protected]; [email protected]). as refactoring tactics, the existence of other possible tactics
Heng Li is with Polytechnique Montréal, Montréal, Quebec H3T 1J4,
Canada (e-mail: [email protected]). and their relationship with code quality is not explored in the
Digital Object Identifier 10.1109/TSE.2023.3326775 existing work.
0098-5589 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5104 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023
In this work, we study developers’ refactoring activities with code quality improvement (i.e., reducing code smells).
(rhythms and tactics) and their impact on code quality in terms To examine the relationship between refactoring rhythms with
of code smells. To identify refactoring rhythms and other possi- code quality metrics, we use a Scott-Knott-ESD [31], [32] test
ble refactoring tactics, we study 196 Apache projects and intro- to rank and cluster code smell changes after adopting each
duce two metrics: daily refactoring density (DRD) and weekly refactoring rhythm and tactic. We observe that root canal-based
refactoring density (WRD). Using the introduced metrics, we tactics are more targeted refactoring operations, therefore, are
divide each project into daily and weekly time frames to identify correlated with more reduction of code smells compared to
frequent refactoring tactics and rhythms during the lifetime of a floss-based tactics and deliver higher quality code. Furthermore,
project. Then, we investigate the relationship between different we observe that refactoring rhythms are not significantly corre-
rhythms and tactics with code quality. Such information can lated with software quality. Consequently, refactoring rhythms
guide developers in selecting the most suitable and high-quality are chosen based on the project assets and the development
refactoring rhythms or tactics for their projects. To this end, we team’s comforts. Finally, we provide some guidelines on pos-
aim to answer the following research questions: itive and negative relationships between different refactoring
RQ1: What are the rhythms of refactoring?—To iden- tactics and different types of code smells.
tify existing refactoring rhythms, we analyze if the refac- In conclusion, we make the following contributions:
toring rhythms fit into the software development rhythms 1) We identify refactoring rhythms and tactics that are
introduced by previous studies (work-day and all-day). By uti- used in the software development process, which can
lizing the DRD metric and performing Kruskal-Wallis test [30], provide insights for practitioners to understand exist-
we observe that the majority of projects (95%) apply two pri- ing refactoring practices and develop tooling to support
mary rhythms: (1) work-day refactoring (11%) and (2) all-day such practices.
refactoring (84%). 2) We understand the relationship between the used tactics
RQ2: What are the most frequent refactoring tactics and rhythms with software quality, which helps develop-
used in projects?—To identify refactoring tactics, we utilize ers to adopt the most suitable approach.
the WRD metric and form a time series of refactoring activi- The replication package can be accessed online [33].
ties for every project. Using dynamic time warping (DTW), we Paper organization. The remainder of our study is orga-
cluster refactoring time series and we observe four variations nized as follows. Section II describes the setup of this study.
of floss and root canal refactoring tactics: Section III presents our approaches and results for answering
• Intermittent spiked floss: Regular and consistent refac- our research questions. Section IV discusses the threats to the
toring with fewer sudden increases (spikes) compared to validity of our findings and Section V provides the implications
frequent spiked floss. of out study. Section VI surveys related studies and compares
• Frequent spiked floss: Consistent refactoring but with them to our work. Finally, we conclude our paper and present
more spikes in refactoring density compared to intermit- future research directions in Section VII, and acknowledge
tent spiked floss. contributions in Section VIII.
• Intermittent root canal: Once in a while refactoring in
high densities, but with most weeks having no refactoring II. EXPERIMENT SETUP
densities.
• Frequent root canal: More frequent refactoring with
This section presents the setup of our study, including our
data collection and data analysis approaches.
more spikes in refactoring density compared to intermit-
tent root canal with most weeks having no refactoring
activities. A. Overview of Our Approach
RQ3: What is the relationship of different refactor- Fig. 1 gives an overview of our study. We conduct our
ing rhythms and tactics with code quality?—In the two research using the projects with a reasonable amount of de-
first research questions, we identify frequently used refactor- velopment activities from the 20-MAD Apache dataset [34].
ing rhythms and tactics. Furthermore, we are interested in We extract the refactoring history of these projects and calcu-
understanding how different rhythms and tactics are associated late refactoring density metrics along with their lifespan. By
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5105
utilizing refactoring density metrics, we compare refactoring the older projects into two or three stages using the thresholds of
distributions at different time periods to answer the first and the age groups. For instance, if a project has 6 years of activities,
the second research questions that aim to identify refactoring it has only the early stage of development (e.g., the first 4.5
rhythms and tactics. Furthermore, we use the characteristics of years of activities), while a project with 10 years of activities
the projects and developers to provide insights into the rationale has the early, middle, and late stages of development. Doing
of using such rhythms and tactics. Finally, by measuring quality so allows us to identify similar rhythms and tactics that might
changes after adopting each rhythm and tactic, we identify appear among the projects at different stages. We use these
the relationship of the identified rhythms and tactics with the three development stages throughout the analyses performed in
quality of code. this study.
2) Refactoring Extraction: To extract the refactoring oper-
ations, we use the Rminer 2.0.3 tool [14], which is an AST-
B. Data Collection based algorithm that finds up to 59 Java refactoring types from
To perform our experiments, we perform several steps to the commit history without the need for user-defined thresholds
collect our dataset. [14], [38]. Furthermore, Rminer is a superior refactoring detec-
1) Project Selection and Pre-Processing: The 20-MAD tion tool compared to its opponents and identifies refactoring
Apache dataset [34] contains information about commits and is- operations with an overall precision of 99.7% and a recall of
sues related to 765 Mozilla and Apache projects with a timespan 94.2% [14]. To measure the accuracy of the tool on our dataset,
of 20 years. In particular, the dataset contains 3.4M commits, with a confidence level of 95% and a margin of error of 5% on
2.3M issues, and 17.3M issue comments. Considering that Java the total number of commits, we select 385 commits to perform
is one the most popular programming languages [35], [36] and our manual validation. In each commit, we reviewed all types of
it is best supported by refactoring extraction tools (e.g., Rminer refactorings that the tool could identify and determined if and
[14]), we limit our study to Java projects. To select projects with how many of them were present in the results of the tool. We
enough data that help identify the different refactoring tactics then compared our results with the tool’s results. For example,
and rhythms, we exclude the projects that: if we identified a pull-up method but the tool did not, we marked
• have less than 80% of Java source code; it as a tool failure and vice versa. The manual validation is done
st
• have less than the 1 quantile of commit counts (i.e., by one of the authors and one undergraduate computer science
< 1,021 commits); and student. The results of the manual validation show an overall
• have a short lifespan (i.e.,< one-year of commit history). precision of 97%, recall of 96%, and F1 score of 95% respec-
As a result, we obtain 196 Java projects that have sufficient tively. Furthermore, we calculate Cohen’s kappa coefficient [39]
commit history and lifespan for our analysis. from the participant’s manual validation results and achieve a
The studied projects in our dataset have varying lifespans and score of 0.91, which suggests a strong agreement. Therefore,
therefore, possess different development histories. Furthermore, we run Rminer on every selected project for each commit to
refactoring habits or requirements may change over time. For extract refactoring activities in the history of development.
example, as a project ages, design issues may be fixed less 3) Code Smells Extraction: To identify the relationship
frequently [37]. Hence, we cannot compare refactoring prac- between the refactoring rhythms or tactics and code quality,
tices unless we have projects with a similar lifetime. To be we need to analyze how these refactoring patterns reduce or
able to analyze the refactoring activities in the projects, we increase the frequency of code smells. As we use code smells
partition longer projects into multiple stages. We utilized the as code quality indicators, we analyze how refactoring rhythms
first, second, and third quartiles of project ages to establish and tactics are associated with the reduction or increase in the
thresholds and divided all projects into four age groups, with frequency of code smells.
each group containing an equal number of projects. The age We use the Designite tool [40] to extract code smells. De-
groups are categorized as follows: (1) younger than 4.5 years signite can identify the most types of code smells compared
(i.e., the first quartile), (2) between 4.5 (first quartile) and 7 to its alternatives and detects numerous code smells in large
years (second quartile), (3) between 7 (second quartile) and 8.5 codebases [40], [41]. For instance, Arcan [42] and Hotspot
years (third quartile), and (4) older than 8.5 years (third quartile) detector [43] can detect only 4 types of code smells. Similarly,
of activities. We excluded projects with less than 4.5 years of Jdeodorant [44] detects 5 and Arcade [45] detects 11 types
activities from our study because such projects have a wide of code smells. Designite can identify 35 code smell types
variety of ages, making it difficult to compare similar activities including 7 architecture smells, 18 design smells, and 10 imple-
among them, unless they have a similar lifespan. By using the mentation smells as listed in Table I. The different categories of
identified age groups, we define different stages of software code smells have specific causes and impacts on the software
development: (1) early stage: start of a project until the 4.5th system. Architecture code smells arise from poor software de-
year, (2) middle stage: 4.5th year of a project until the 7th year, signs within the system architecture and can negatively impact
and (3) late stage: 7th year of a project until the 8.5th year which system quality, performance, and lifespan [46]. Design code
is the longest age considered in all studied projects. As a result, smells result from inadequate system design and have a negative
we observe that 50 projects have ages between 4.5 and 7 years, impact on code quality [19]. Implementation code smells, on the
49 projects have ages between 7 to 8.5 years, and 50 projects other hand, stem from poor implementation decisions made by
have ages of more than 8.5 years. To compare the old projects contributors and can negatively affect the quality of the code [1].
(e.g., 10 years old) with young ones (e.g., 5 years old), we divide To address all three types of code smells, refactoring has been
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5106 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023
Fig. 2. The results of the correlation analysis of author and project profile metrics.
TABLE IV
THE CLUSTERS IDENTIFIED BY K-MODS CLUSTERING TO IDENTIFY AUTHOR PROFILES
TABLE V
THE CLUSTERS IDENTIFIED BY K-MODS CLUSTERING TO IDENTIFY PROJECT PROFILES
Refactoring
Label Files Contributors Timezones Commits Age Stars
Density
Most Most Most Most Less Most More
Vibrant
(2157.75, ∞) (44.00, ∞) (10.75, ∞) (3,450.75, ∞) (902.77, 1,684.16] (237.25, ∞) (0.15, 0.20]
Least Least Least Least Less Less Most
Maintaining
[0.00, 530.75] [0.00, 15.00] [0.00, 1.00] [0.00, 951.00] (902.77, 1,684.16] (1.00, 38.00] (0.20, 1.00]
Less More More Less Most More Least
Obsolete
(530.75, 1,242.00] [27.00, 44.00] [5.00, 10.75] (951.00, 1,702.00] (2,096.51, ∞) (38.00, 237.25] [0.00, 0.10]
More Least Least More More Least Least
Growing
[1242.00, 2157.75]] (15.00, 27.00] [0.00, 1.00] (1,702.00, 3,450.75] (1,684.16, 2,096.51] [0.00, 1.00] (0.00, 0.10]
After removing the highly correlated and redundant metrics, For the Project profiles, we find four clusters that provide
based on the distribution of each metric in different quartiles, different meaningful clusters with the optimal cost value of
we divide them into four groups and label them as Least, Less, 919,000. Hence, we apply k-mods clustering with k = 4 and
More, and Most [53]. We use k-mods clustering, an extension identify four major profiles, namely, vibrant, maintaining, ob-
of the k-means [54] clustering algorithm, which is suitable for solete, and growing projects. Vibrant projects have the most
clustering categorical data, to cluster the labeled metrics and commits, contributors, and stars while obsolete projects expe-
cluster them into different profiles. We use elbow method [55] rience the least refactoring density along with less commits,
to find the optimal number of clusters (k) and manually validate most age, and more stars. Furthermore, growing projects expe-
and check if the clustering results provide distinct centroids. rience least refactorings with the least stars while having more
The elbow method involves plotting a graph that displays the commits and least contributors. Moreover, maintaining projects
number of clusters versus the sum of squared errors (SSE) for share least commits and contributors with the most refactoring
each cluster. The optimal number of clusters is identified by the density. Tables IV and V summarize the results of author and
point on the graph where the SSE begins to level off and form project clustering.
an elbow shape [56].
For Author profiles, we identify 3 as the optimal number of
clusters with the optimal cost value of 62,872. Therefore, we D. Research Methods
apply k-mods with 3 clusters (k = 3) and identify three major This section presents the research methods applied to answer
profiles. Based on the selected metrics (i.e., timezone, contribu- the research questions.
tion, refactoring density, commits, and work time), we label the 1) Refactoring Rhythms Identification: To identify the
three identified clusters as main authors, casual contributors, refactoring rhythms of the projects, we require a measure to
and core authors. The core authors make the most contributions describe the amount of refactoring activities (i.e., refactoring
while the casual contributors make the least contributions. churn) applied on each day of the week. As the refactoring activ-
The main and core authors are primarily located in America ities could be reflected by the amount of code changes caused by
and Western Europe with commits between 12:33 to 17:06 at refactoring, we use the refactoring churn to quantify the amount
their local time (14:57-17:06 for main authors and 12:33-14.57 of refactoring activities and normalize it by the actual code
for core authors). Casual contributors are primarily located in churn. Therefore, we introduce daily refactoring density (DRD),
North America and commit from 17:06 to 24:00 (midnight). which indicates the amount of refactoring activities deviated
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5108 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023
from the overall development (e.g., the total code churn) of each
day of development. The DRD is calculated as below:
Refactoring churn of the day (i)
DRD(i) = (1)
Total code churn of the day (i)
We measure the daily refactoring densities (i.e.,DRD) and com-
pare them throughout the week. We form seven groups, each Fig. 3. How changes in code smells are calculated after each stage of
development in each project.
of which represents one day of the week and contains all refac-
toring activities that occurred on that particular day. Doing so
3) Quality Changes Measurement: For identifying the re-
allows us to find the similarities and differences of refactoring
lationship between the refactoring rhythms or tactics and code
activities from one day to another. As our data does not follow
quality, we need to analyze how these refactoring rhythms or
a normal distribution, to measure the significance of the dif-
tactics reduce or increase the frequency of each type of code
ferences or similarities of the measured DRDs among different
smell. To do this, we use the boundary points before and af-
days, we use the Kruskal-Wallis test, an extension of Mann–
ter each stage of development. As it is shown in Fig. 3, we
Whitney U test [57] that evaluates if two or more samples come
measure the frequency of each code smell type before and after
within the same distribution [30]. The Kruskall-Wallis test does
each stage (i.e., between two consecutive stage boundaries) and
not assume that the data is normally distributed or not. We use
measure the difference (reduction or increase) in each type of
p-value > 0.05 to decide if a test of null-hypothesis is sig-
code smell.
nificant [58]. The null-hypothesis is a statistical theory that
Since a larger codebase could contain more code smells, we
measures if a significant relationship exists between two sets
normalize the frequency of code smells by the lines of code
of data [59].
(LOC) in the codebase. Additionally, since more code changes
2) Refactoring Tactics Identification: Previous studies show
(i.e., larger code churn) may lead to larger differences in code
that the majority of projects utilize agile methodologies, which
require small tasks to be finished within one week of develop- smells, we normalize the difference in code smells by the code
ment [58], [60]. To understand the long-term refactoring activ- churn within each stage. Finally, we calculate the differences in
ities applied by developers in the long run, we need a measure each type of code smell and label it according to the identified
to understand the amount of refactoring churns compared to the rhythm and tactic adopted in that stage. We utilize the frequency
development per week of the development. Hence, we propose of code smells at the end of each stage (ECS), the lines of code
a weekly refactoring density WRD metric, which reflects the at the end of each stage (ELC), the initial frequency of code
amount of refactoring activities per week of development. The smells in each stage (ICS), the lines of code at the beginning
WRD metric is computed using the following formula:
of each stage (ILC), and the total code churn of each stage (CC)
to measure the normalized differences of the total frequency of
Refactoring churn of the week (i) code smells of each stage (CSD). To measure CSD at each stage
WRD(i) = (2)
Code churn of the week (i) (i), we use the following equation:
For each project, we create a time series of WRDs within (ECS(i)/ELC(i) − ICS(i)/ILC(i))
every stage of development. Each data point of the time series CSD(i) = (3)
CC(i)/ELC(i)
includes a week of development and the corresponding WRD
metric for that week. Accordingly, the refactoring time series An increase in CSD indicates an increase in the number of
data depicts refactoring behaviors over time. Therefore, similar code smells (i.e., decreased code quality) and a decrease in
refactoring time series between different stages of the projects CSD indicates an increase in the number of code smells (i.e.,
represent a similar refactoring habit. The created refactoring increased code quality). To measure the smell difference after
time series share different lengths and they vary in speed. For in- utilizing each refactoring rhythm or tactic, we require a multiple
stance, the development period of Project A is 10 weeks, while comparison method to cluster and rank the identified rhythm or
that of Project B is 20 weeks, indicating a variation in their tactic into statistically significant groups and rank them based
length. Additionally, Project A experiences a refactoring spike on CSD metric (i.e., changes in code quality). Therefore, we
in the second week of development, while Project B experiences use the Scott-Knott-ESD [31], [32] test that uses a multiple
the same spike in the fifth week of development, indicating comparison method that divides and ranks a set of input distri-
a variation in the speed of refactoring activities. Therefore, butions into statistically distinct groups [32]. Scott-Knott-ESD
comparing the refactoring time series with point-to-point mea- is an extension of the Scott-Knott [64] test with the addition of
sures, such as euclidean distance [61], could not overcome the effect size difference. The effect size examines the strength of
limitations of speed and length variation and could not iden- the difference between different groups of data [65]. Therefore,
tify similarities optimally. Therefore, we use the dynamic time we cluster and rank the refactoring rhythms and tactics based
warping (DTW) algorithm to measure the similarity between the on the CSDs and identify the rhythms or tactics leading to a
refactoring time series of project stages as refactoring tactics. codebase with an increased or decreased amount of code smells.
DTW overcomes the limitation of point-to-point comparisons
III. RESULTS
with the step pattern that allows transitions and weights between
two pairs [62]. Moreover, DTW is an algorithm for calculating In this section, we provide the motivation, approach, and
the similarity between two time series that vary in speed [63]. findings for each of our research questions.
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5109
A. RQ1. What Are the Rhythms of Refactoring? Clustering project and developer profiles in terms of
refactoring activities. To understand the distribution and sig-
1) Motivation: In software development, developers can
nificance of the refactoring rhythms across different project
have various working rhythms. For example, some develop-
and author profiles, we rank the combinations of the different
ers prefer to work only on workdays; however, others do not
refactoring rhythms and project or author profiles. Using the
mind working even on weekends. Existing studies focus on
Scott-Knott-ESD [31], [32], we group combinations of refac-
development rhythms and categorize development rhythms as
toring rhythms and project or author profiles into statistically
work-day (i.e., Monday to Friday) development and all-day
significant clusters. Specifically, we perform two separate Scott-
(i.e., Monday to Sunday) development [25], [26], [27]. Prior
Knott-ESD clustering analyses: (1) for combinations of project
research reports that the state of getting recovered during the
profiles and refactoring rhythms, and (2) for combinations of
weekend from working on the workdays is correlated with an
author profiles and refactoring rhythms.
increase in weekly task performance and personal initiative
In the clustering method, each clustered item (i.e., a node)
[25]. Moreover, the state of working overtime is associated
represents the distribution of projects or authors that are as-
with a decrease in productivity [66]. Furthermore, previous
sociated with a specific refactoring rhythm. Moreover, each
studies have suggested that deviating from regular develop-
clustered item is represented by a vector of the same length as
ment to perform refactoring may help address unhealthy code
the number of projects or authors, with each project or author
and potentially improve code quality [28]. Inspired by prior being assigned a value of 1 if it is associated with the specific
work, our intuition is that providing dedicated time for refac- project or developer profile and the specific refactoring rhythm,
toring outside of regular development cycles enables devel- and a value of 0 otherwise. For example, All-day-Vibrant refers
opers to focus more on addressing unhealthy code through to the distribution of the all-day refactoring rhythm across
refactoring, resulting in improved code quality. Thus, we study all projects that fall under the vibrant project profile. Each
the refactoring rhythms based on their deviations from the cluster corresponds to a statistically significant distribution of
development rhythms. Understanding the refactoring devia- the various combinations of refactoring rhythms and project
tions from the regular development rhythms and their relation- or author profiles across the dataset. The results of the Scott-
ship with the code quality improvement can assist software Knott-ESD test provide insights into how refactoring rhythms
teams and developers to (1) understand the existing refactor- are distributed across different profiles.
ing rhythms and (2) adopt/apply the most effective refactoring Identifying refactoring rhythms characteristics. To study
rhythms. In this research question, we investigate and character- the differences between refactoring operations (e.g., pull up
ize different refactoring rhythms to help developers understand method) performed on weekends and those performed on work-
the existing refactoring rhythms and identify which one suits days, we use the Mann–Whitney U test [57] to compare the
their needs. distribution of each refactoring operation on the weekend and
2) Approach: As described in Section II.D.1, to identify workdays. We utilize Cliff’s Delta to measure the effect size of
the refactoring rhythms of the studied projects, we form seven the differences. We consider the operations that obtain a p-value
groups and measure DRDs on every day of development. More- < 0.05 and an effect size > 0.33 [67], indicating a medium
over, we compare DRDs throughout the week to discover refac- or large magnitude of difference, as the operations that are
toring rhythms using the Kruskal-Wallis test [30]. performed significantly differently between the weekends and
Prior studies [25], [26], [27] divide the software development the workdays.
process into two groups—work-day development and all-day 3) Findings: The majority (95%) of project stages fol-
development (including workdays and weekends). To identify low one of the work-day or all-day refactoring rhythms.
the refactoring rhythms adopted in different stages and different We accept the null hypothesis H0−1 for 84% of the project
projects, we need to compare different days of refactoring. stages and the null hypothesis H0−2 for 11% of the remain-
Hence, we group DRDs into seven groups based on the week- ing project stages. For the 5% of the project stages, both null
days, where each group represents a weekday and depicts the hypotheses H0−1 and H0−2 are rejected. Therefore, our anal-
overall DRD distribution on the corresponding day of the week. ysis shows that only a few project stages (i.e., 5%) do not
Using the Kruskal-Wallis test, we first identify whether we can follow any of the initial refactoring rhythms. Specifically, we
fit the majority of the rhythms adopted from the selected project find that 11% of the project stages perform all-day refactor-
stages into two groups, namely work-day refactoring and all- ing, whereas 84% of the project stages perform the work-day
day refactoring. To do this, we perform two individual tests refactoring rhythm.
using the following hypothesis: In the work-day refactoring rhythm, we observe a signifi-
• H0-1 : Refactoring densities are similar among all days of cant difference in refactoring densities between workdays and
the week. weekends, with the median refactoring densities being higher
• H0-2 : Refactoring densities are similar among all work- in workdays compared to weekends, as illustrated in Fig. 4.
days of the week. Additionally, certain types of refactoring are applied differ-
To this end, after running the first test, we exclude the stages ently between weekends and workdays, as shown in Fig. 5-A.
of projects that have a similar distribution of refactoring on all These types of refactoring include move class, pull up method,
days of the week and perform the second test. We accept a hy- pull up attribute, add attribute annotation, extract interface,
pothesis if the p-value is higher than 0.05 and reject otherwise. add parameter annotation, modify parameter annotation, split
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5110 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023
TABLE VI
SCOTT-KNOTT-ESD TEST RESULTS ON THE REFACTORING RHYTHMS
AND ASSOCIATED PROJECT AND AUTHOR PROFILES
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5111
B. RQ2: What Are the Most Frequent Refactoring Tactics Fig. 6. The results of the elbow curve, showing the optimal number of
Used in Projects? clusters using DTW.
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5112 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5113
Fig. 8. Clustering centroids that represent refactoring tactics identified in this study, which are labeled as: intermittent root canal, intermittent spiked floss,
frequent spiked floss, and frequent root canal. The red dots show refactoring spikes in each tactic.
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5114 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023
tactics, and (2) adopt or switch to the most suitable refactoring TABLE X
rhythm or tactic. SCOTT-KNOTT-ESD TEST RESULTS ON THE OVERALL CHANGES IN
THE FREQUENCY OF CODE SMELLS ASSOCIATED WITH THE
2) Approach: To understand the relationship of the iden- REFACTORING RACTICS
tified refactoring rhythms and tactics with code quality, we
utilize code smells listed in Table I as code quality metrics. Us- Floss Root Canal
Intermittent Frequent
ing Scott-Knott-ESD test [31], [32], we cluster the magnitude Intermittent Frequent
Spiked Spiked
of code smell changes (i.e., increase/decrease) after adopting Cluster
1 2 3 3
each tactic or rhythm. The Scott-Knott-ESD complements the Rank
Mean 0.198276 0.012329 -0.025959 -0.029701
Scott-Knott test [64] by taking the effect size difference into
account when identifying different clusters. We first identify TABLE XI
the relationship between the identified (1) refactoring tactics SCOTT-KNOTT-ESD TEST RESULTS ON THE
and (2) refactoring rhythms with overall increase or decrease OVERALL CHANGES IN THE FREQUENCY OF
CODE CMELLS ASSOCIATED WITH THE
in code smells as quality measures. Moreover, we identify the REFACTORING RHYTHMS
relationship of each refactoring tactic and rhythm with different
All Day Work Day
types of code smells. We describe our detailed approach below.
Cluster Rank 1 1
Measuring code smell changes. As discussed in Section II, Mean 0.01352 0.01991
to measure the relationship of the identified refactoring rhythms
and tactics with code quality, we use code smells. We set 3) Findings: In this section, we provide the findings on
three stages (early, middle, and late) for the lifetime of projects the relationship of both refactoring rhythms and tactics with
and then collect the code smell metrics, which are normal- code quality.
ized by the project size, at the beginning and the end of each Overall relationship: We use the sum of all types of code
stage respectively. smell changes (i.e., the total number of code smell changes
The relationship of refactoring rhythms and tactics regardless of the code smell type) to measure the overall code
with the overall code quality. In Section III-A, we clas- smell changes after adopting each refactoring rhythm and tactic.
sify refactoring rhythms as all-day and work-day. Besides, in As the results suggest, the identified rhythms belong to the
Section III-B, we identify four major refactoring tactics: inter- same cluster and do not significantly affect overall changes in
mittent root canal, intermittent spiked floss, frequent spiked the number of code smells (Table XI), however, the all-day
floss, and frequent root canal. To identify the relationship refactoring rhythm is associated with the lowest mean in overall
between the above refactoring rhythms and tactics with code code smell changes, which indicates a higher code quality.
quality, we utilize the normalized frequency of code smell Overall, refactoring rhythms are not statistically associated with
changes after adopting each rhythm and tactic (listed in Table I). the overall changes in the code smells. For refactoring tactics,
Therefore, a higher frequency of changes indicates an increase on the other hand, intermittent spiked floss and frequent spiked
in code smells and a decrease in software quality. To analyze the floss are in the first and second ranked clusters, hence, they are
overall relationship of refactoring tactics and rhythms with the associated with more increase in the overall changes of code
frequency of code smell changes, we use the normalized sum smells compared to the frequent root canal and the intermittent
of all code smell changes as the overall changes in code smells root canal tactics. In fact, on average, floss-based tactics
and label them with the corresponding rhythm and tactic. Using are associated with an increase in the frequency of code
the Scott-Knott-ESD test we cluster and rank the refactoring smells (positive mean as shown in Table X), while root
rhythms and tactics based on the code smell changes to identify canal-based tactics are associated with a decrease in the
the rhythms and tactics leading to more smelly code. We use frequency of code smells (negative mean as shown in
p − values < 0.05 to identify the statistical significance and Table X). Therefore, root canal-based tactics (i.e.,frequent root
use means to rank the identified clusters. canal and intermittent root canal) are associated with a higher
The relationship of refactoring rhythms and tactics with code quality compared to floss-based tactics. A possible ex-
each code smell type. To provide more insights and details on planation may be that floss-based refactoring is typically inte-
the identified rhythms and tactics, we conduct separate analyses grated with addressing daily maintenance tasks, such as bug
to assess the relationship between the frequency of different fixes and the implementation of new features, while root canal-
types of code smells and each refactoring rhythm and tactic. based refactoring focuses on improving the overall quality of
We use 35 code smells (listed in Table I) and the changes the design.
after adopting each rhythm and tactic. To this end, we utilize Relationship with specific code smells: The results from
the Scott-Knott-ESD test to cluster (p − values < 0.05 as the our analysis of the overall changes in the number of code smells
significance threshold) and rank (using means) the rhythms and show a significant difference in the code smell changes be-
tactics based on each type of code smells separately. Therefore, tween the floss-based and the root canal-based tactics. However,
we perform 35 individual tests for rhythms and 35 separate tests rhythms do not show a significant difference in the changes
for tactics. Therefore, each Scott-Knott-ESD test is responsible in the frequency of code smells. Therefore, we cluster the
for one type of code smells. Doing so allows us to identify frequency of code smell changes in each code smell type af-
how the refactoring rhythms and tactics impact each type of ter adopting each refactoring tactic separately. Fig. 9 shows
code smells. the results of the individual Scott-Knott tests applied for each
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5115
Fig. 9. Results from the Scott-Knott-ESD tests that cluster and rank the refactoring tactics for overall and different types of code smell changes. A higher
rank indicates a larger increase (or smaller decrease) in code smells.
type of code smell. We observe that, for 6% (2 out of 35 refactoring tactics. Moreover, due to the varying lengths of
code smell types), namely empty catch clause and multifaceted the life cycles of projects in stages after the late refactoring
abstraction, the different refactoring tactics are not associated stage, time series clustering could not be applied, and we had
with a statistically significant difference in the frequency of the to exclude them from our study. Thus, it is possible that some
corresponding code smell. This was determined through the patterns may emerge in later stages that we were unable to
Scott-Knott-ESD tests resulting in a single cluster. However, capture. In the second research question, We have categorized
root canal-based tactics result in statistically smaller increases the data into four distinct clusters, namely intermittent root
in the frequency of code smells, indicating higher quality, for canal, intermittent spiked floss, frequent spiked floss, and fre-
80% (28 out of 35) of code smell types. This includes 90% quent root canal. The number of clusters chosen may impact
of the implementation smell types (9 out of 10 types), 83% of the quality and comprehensibility of the clustering outcomes,
the design smell types (15 out of 17 types), and 57% of the as well as the insights and conclusions derived from them. If
architecture smell types (4 out of 7 types). The five remaining there are too many clusters, it may result in overfitting, whereas
code smell types (i.e., ambiguous interface, scattered function- if there are too few, important information may be lost. To
ality, dense structure, imperative abstraction, and unnecessary avoid bias, we use the elbow method [55], silhouette score
abstraction) show slightly different clustering results (Fig. 9) [72], and manual inspection to identify the optimum number
from the majority of the code smells (74%). Therefore, adopting of clusters. However, different numbers of clusters could reveal
root canal-based tactics results in the majority of improvements less or more refactoring tactics. In the third research question,
in code smells across all three categories (i.e., implementation, we use code smells as a code quality measure to study the
design, and architecture smells) of code smells. Overall, our relationship between refactoring rhythms or tactics with code
results suggest that more dedicated refactoring efforts (i.e., quality. Nevertheless, we agree that code quality can be char-
using root canal-based tactics) can better help remove or fix acterized by other measures, such as the number of bugs or
most types of code smells. maintenance costs. Furthermore, we admit that other socio-
technical metrics, such as the way refactoring is applied (e.g.,
Root canal-based tactics are associated with a greater de- manually or automatically) and regulations of the development
crease (or smaller increase) in the number of code smells, team could affect our code quality measurement. Future work
and thus higher code quality, compared to floss-based tactics, that explores the relationship between refactoring activities
which suggests more dedicated refactoring operations. How- and other characteristics of code quality could complement
ever, refactoring rhythms are not associated with the changes our results.
in the number of code smells, suggesting that the choice of External validity. Concerning the generalization of our find-
rhythm may be driven more by project-specific factors and ings, our experiments and results are based solely on the anal-
team preferences rather than their impact on code quality. ysis of the 196 Apache projects we studied, and therefore, our
conclusions may not necessarily apply to other projects, such
as those in different domains. Additionally, since our analysis
IV. THREATS TO VALIDITY
was limited to projects written in Java, the findings may not be
In this section, we discuss the possible threats to the validity applicable to projects written in other programming languages.
of our study. Construct validity. Concerning our measurement accuracy,
Internal validity. Concerning our project selection and se- in the third research question, to study the relationship between
lected approaches, in the second research question, for clus- refactoring rhythms and code quality (i.e., in terms of the fre-
tering refactoring time series and finding refactoring tactics, quency of code smell changes), we measure the code smells in
we analyze the refactoring densities in three stages of develop- different stages of the projects, because calculating code smells
ment. We choose the mentioned time frames so that we could every week takes approximately 25 days for each project, and
compare the refactoring behaviors of the project stages with computing them for all projects requires a significant amount
similar length of development history. We admit having more of time. Nevertheless, extracting quality changes every week
projects with different lengths of time frames could reveal more could provide more accurate results.
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5116 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5117
Refactoring tactics. Floss and root canal are two refactoring that affect refactoring. Bibiano et al. [17] correlate and study
tactics identified in previous studies [1], [28]. Floss refactoring the effect of batch refactoring on code smells. It identifies that
is distinguished by frequent refactoring, blended with the soft- there is usually more than one refactoring operation required to
ware development process, while the root canal is identified by eliminate the code smells. Cinn’eide et al. [9] conduct a survey
occasional periods of refactoring which is not consistent with study on the benefits of refactoring and argue that, although
the software development process. Liu et al. [76] investigate refactoring is commonly believed to aim at removing code
refactoring histories on data collected from 753,367 engineers smells, developers are not strongly motivated by the desire to
and suggest that between floss and root canal, the most fre- eliminate them. Murphy et al. [28] define two refactoring tac-
quently adopted refactoring tactic by engineers is floss. Sousa tics, floss and root canal, using a dental metaphor. Floss involves
et al. [29] classify refactoring as floss and root canal and frequent refactoring with other program changes, while root
conduct a study on software projects to examine refactoring canal involves infrequent, longer periods of refactoring with few
opportunities indicated by code smells. In this study, we other program changes. Murphy et al. propose five principles
identify new variations of refactoring tactics in addition to and evaluate tools for alignment with floss tactics. Murphy et
the previously mentioned tactics. Moreover, we study their al. find that the tools are not aligned with floss tactics and are
relationship with code quality in terms of code smells. therefore not suitable for floss refactoring. It suggests that floss
Refactoring detection. Several tools and approaches are refactoring is likely to result in higher quality and lower costs
introduced to automatically identify refactoring operations [38], in the long run. However, it does not propose a quantitative
[79]. The main idea behind these approaches is to compare approach to measure this claim. Previous studies [12], [13],
different versions of the code fragments stored in a version [14], [15], [16], [17] link refactoring with code smells and code
control system and point out refactoring operations. These tools quality which makes code smells a good quality indicator of
can help us study refactoring activities on a large scale in code after performing refactoring operations. Therefore, we use
the software maintenance process. Kim et al. [79] introduce code smells to measure the relationship between the identified
Ref-Finder, which takes two versions of a program as input rhythms and tactics with code quality. Different from the
from workspace snapshots or subversion of a repository and existing studies, our work is the first to quantitatively study
extracts logical facts about the syntactic structure of a program. the relationship between refactoring rhythms/tactics and
Nevertheless, Soares et al. [80] conducts a study and show that code quality.
Ref-Finder has low precision and recall which leads to false-
positive results, which means it is inaccurate in detecting refac-
VII. CONCLUSION
torings. However, Tsantalis et al. [38] design a tool, Rminer, that
overcomes the above constraints. Similarly to Ref-Finder [79], In this study, we investigate the refactoring activities on a
Rminer [14], [38] takes two revisions of source code from the dataset consisting of 196 Apache projects to identify refac-
commit history in the version control system of a Java project toring tactics and rhythms that developers and projects adopt
and returns a list of refactoring operations applied between two in the software development process. We also examine their
versions. Using a similar approach, Alizadeh et al. [81] intro- relationship with code quality in terms of code smells. Com-
duce a bot integrated into a version control system that monitors paring both refactoring and development activities, we first
software repositories and identifies refactoring opportunities by determine that in more than 95% of project stages develop-
analyzing recently changed files through pull requests. It then ers use a systematic refactoring rhythm on weekdays. Two
finds the best series of refactorings to fix the quality issues. In major rhythms are identified as 1) work-day refactoring and
this work, we employ the refactoring detection approach (2) all-day refactoring. By considering the relationship between
Rminer, which was developed in previous research [14], to refactoring rhythms and the quality metrics (i.e., code smells),
extract refactoring operations from our dataset. We also we observe that different refactoring rhythms do not make a
validate its effectiveness in our context. statistically significant difference to the code quality. Moreover,
Refactoring and code quality. Prior work has performed by clustering the life-cycle of refactoring activities we find four
studies regarding the relationship between refactoring and code variations of existing refactoring tactics: (1) frequent spiked
quality. Almogahed et al. [82] examine the studies that identify floss, (2) intermittent spiked floss, (3) frequent root canal, and
the impacts of code refactoring on software quality. It identifies (4) intermittent root canal refactoring tactics. We observe that
that researchers agree that refactoring has a positive impact on root canal-based tactics (frequent root canal and intermittent
both internal and external quality attributes. Moreover, Lacerda root canal) are associated with a larger reduction in the fre-
et al. [83] conduct a literature review on refactoring tools and quency of code smells compared to floss-based tactics (frequent
common code smells to measure the relationship between refac- spiked floss and intermittent spiked floss). Our findings can help
toring operations and code smells. By analyzing the initial and researchers and practitioners understand practical refactoring
final code smells after refactoring, the study finds that a signif- activities in real-world projects and their relationship with code
icant proportion of code smells get eliminated after perform- quality. Practitioners can leverage our findings to choose the
ing refactoring, which in turn preserves or enhances software appropriate refactoring patterns for their projects based on their
quality during the maintenance process. Moreover, it notices resources and code quality requirements. For future work, we
that code smells and refactoring are linked by quality attributes plan to conduct experiments for other programming languages
and quality attributes that affect code smells are the same ones and focus more on automatic vs. manual refactoring operations.
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
5118 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 49, NO. 12, DECEMBER 2023
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.
NOEI et al.: AN EMPIRICAL STUDY OF REFACTORING RHYTHMS AND TACTICS IN THE SOFTWARE DEVELOPMENT PROCESS 5119
[44] N. Tsantalis, T. Chaikalis, and A. Chatzigeorgiou, “JDeodorant: Identi- [65] C. J. Ferguson, “An effect size primer: A guide for clini-
fication and removal of type-checking bad smells,” in Proc. 12th Eur. cians and researchers.” Prof. Psychol. Res. Pract., vol. 40, no. 5,
Conf. Softw. Maintenance Reengineering, Piscataway, NJ, USA: IEEE pp. 532–538, 2009.
Press, 2008, pp. 329–331. [66] I. Spieler, S. Scheibe, C. Stamov-Roßnagel, and A. Kappas, “Help or
[45] U. Azadi, F. A. Fontana, and D. Taibi, “Architectural smells detected by hindrance? Day-level relationships between flextime use, work–nonwork
tools: A catalogue proposal,” in Proc. IEEE/ACM Int. Conf. Tech. Debt boundaries, and affective well-being.” J. Appl. Psychol., vol. 102, no. 1,
(TechDebt), Piscataway, NJ, USA: IEEE Press, 2019, pp. 88–97. pp. 67–87, 2017.
[46] J. Garcia, D. Popescu, G. Edwards, and N. Medvidovic, “Identifying [67] J. Romano, J. D. Kromrey, J. Coraggio, and J. Skowronek, “Should we
architectural bad smells,” in Proc. 13th Eur. Conf. Softw. Maintenance really be using t-test and Cohen’sd for evaluating group differences on
Reengineering, Piscataway, NJ, USA: IEEE Press, 2009, pp. 255–258. the NSSE and other surveys,” in Proc. Annu. Meeting Florida Assoc.
[47] F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, and A. De Lucia, Institutional Res., 2006, pp. 1–3.
“The scent of a smell: An extensive comparison between textual [68] “Selecting the number of clusters with silhouette analysis on KMeans
and structural smells,” in Proc. 40th Int. Conf. Softw. Eng., 2018, clustering.” Scikit-learn., Accessed: Aug. 23, 2023. [Online]. Avail-
pp. 740–740. able: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_
[48] “AlDanial/cloc.” GitHub. [Online]. Accessed: Feb. 10, 2023. Available: silhouette_analysis.html
https://github.com/AlDanial/cloc [69] E. S. Dalmaijer, C. L. Nord, and D. E. Astle, “Statistical power for
[49] “GitHub API.” GitHub. Accessed: Apr. 3, 2023. [Online]. Available: cluster analysis,” BMC Bioinf., vol. 23, no. 1, pp. 1–28, 2022.
https://docs.github.com/en/rest [70] U. Rani and S. Sahu, “Comparison of clustering techniques for mea-
[50] F. E. Harrell et al., Regression Modeling Strategies. With Applications to suring similarity in articles,” in Proc. 3rd Int. Conf. Comput. Intell.
Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Commun. Technol. (CICT), Piscataway, NJ, USA: IEEE Press, 2017,
vol. 3. Berlin, Germany: Springer-Verlag, 2015. pp. 1–7.
[51] E. Noei, F. Zhang, and Y. Zou, “Too many user-reviews! What should [71] T. Yatsunenko et al., “Human gut microbiome viewed across age and
app developers look at first?” IEEE Trans. Softw. Eng., vol. 47, no. 2, geography,” Nature, vol. 486, no. 7402, pp. 222–227, 2012.
pp. 367–378, Feb. 2021. [72] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and
[52] J. Miles, “R-squared, adjusted r-squared,” in Encyclopedia of Statistics validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, pp. 53–
in Behavioral Science. Hoboken, NJ, USA: Wiley, 2005. 65, Nov. 1987.
[53] F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou, “Towards building a [73] T. L. Wahl, “Discussion of “Despiking acoustic doppler velocimeter
universal defect prediction model,” in Proc. 11th Work. Conf. Mining data” by Derek G. Goring and Vladimir I. Nikora,” J. Hydraul. Eng.,
Softw. Repositories, 2014, pp. 182–191. vol. 129, no. 6, pp. 484–487, 2003.
[54] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means [74] M. Parsheh, F. Sotiropoulos, and F. Porté-Agel, “Estimation of power
clustering algorithm,” J. Roy. Statistical Soc. C (Appl. Statist.), vol. 28, spectra of acoustic-doppler velocimetry data contaminated with inter-
no. 1, pp. 100–108, 1979. mittent spikes,” J. Hydraul. Eng., vol. 136, no. 6, pp. 368–378, 2010.
[55] R. L. Thorndike, “Who belongs in the family?” Psychometrika, vol. 18, [75] R. Q. Quiroga, Z. Nadasdy, and Y. Ben-Shaul, “Unsupervised spike
no. 4, pp. 267–276, 1953. detection and sorting with wavelets and superparamagnetic clustering,”
[56] M. Syakur, B. Khotimah, E. Rochman, and B. D. Satoto, “Integration Neural Comput., vol. 16, no. 8, pp. 1661–1687, 2004.
k-means clustering method and elbow method for identification of the [76] H. Liu, Y. Gao, and Z. Niu, “An initial study on refactoring tactics,”
best customer profile cluster,” IOP Conf. Ser. Mater. Sci. Eng., vol. 336, in Proc. IEEE 36th Annu. Comput. Softw. Appl. Conf., Piscataway, NJ,
no. 1, 2018, Art. no. 012017. USA: IEEE Press, 2012, pp. 213–218.
[57] P. E. McKnight and J. Najab, “Mann–Whitney U test,” in The Corsini [77] E. Fernandes et al., “Refactoring effect on internal quality attributes:
Encyclopedia of Psychology. Hoboken, NJ, USA: Wiley, 2010, p. 1. What haven’t they told you yet?” Inf. Softw. Technol., vol. 126,
[58] T. Dahiru, “P-value, a true test of statistical significance? A cautionary Oct. 2020, Art. no. 106347.
note,” Ann. Ibadan Postgraduate Med., vol. 6, no. 1, pp. 21–26, 2008. [78] X. Wang et al., “Exploring scientists’ working timetable: Do scientists
[59] S. K. Haldar, “Statistical and geostatistical applications in geology,” often work overtime?” J. Informetrics, vol. 6, no. 4, pp. 655–660, 2012.
in Mineral Exploration. Amsterdam, The Netherlands: Wiley, 2018, [79] M. Kim, M. Gee, A. Loh, and N. Rachatasumrit, “Ref-Finder: A
pp. 167–194. refactoring reconstruction tool based on logic query templates,” in Proc.
[60] L. Rising and N. S. Janoff, “The scrum software develop- 18th ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2010, pp. 371–372.
ment process for small teams,” IEEE Softw., vol. 17, no. 4, [80] G. Soares, R. Gheyi, E. Murphy-Hill, and B. Johnson, “Comparing
pp. 26–32, Jul./Aug. 2000. approaches to analyze refactoring activity on software repositories,” J.
[61] J. C. Gower, “Properties of euclidean and non-euclidean dis- Syst. Softw., vol. 86, no. 4, pp. 1006–1022, 2013.
tance matrices,” Linear Algebra Its Appl., vol. 67, pp. 81–97, [81] V. Alizadeh, M. A. Ouali, M. Kessentini, and M. Chater, “RefBot:
Jun. 1985. Intelligent software refactoring bot,” in Proc. 34th IEEE/ACM Int. Conf.
[62] M. Kljun and M. Tersěk, “A review and comparison of time series Automated Softw. Eng. (ASE), Piscataway, NJ, USA: IEEE Press, 2019,
similarity measures,” in Proc. 29th Int. Electrotechnical Comput. Sci. pp. 823–834.
Conf. (ERK), Portorož, Slovenia, 2020, pp. 21–22. [82] A. Almogahed, M. Omar, and N. H. Zakaria, “Impact of software
[63] S. K. Gaikwad, B. W. Gawali, and P. Yannawar, “A review on refactoring on software quality in the industrial environment: A review
speech recognition technique,” Int. J. Comput. Appl., vol. 10, no. 3, of empirical studies,” in Proc. Knowl. Manage. Int. Conf. (KMICe), Miri
pp. 16–24, 2010. Sarawak, Malaysia, Jul. 25–27, 2018, pp. 229–234.
[64] E. G. Jelihovschi, J. C. Faria, and I. B. Allaman, “ScottKnott: A pack- [83] G. Lacerda, F. Petrillo, M. Pimenta, and Y. G. Guéhéneuc, “Code
age for performing the Scott-Knott clustering algorithm in R,” TEMA smells and refactoring: A tertiary systematic review of challenges and
(São Carlos), vol. 15, no. 1, pp. 3–17, 2014. observations,” J. Syst. Softw., vol. 167, Sep. 2020, Art. no. 110610.
Authorized licensed use limited to: Institute of Software. Downloaded on June 06,2024 at 09:17:21 UTC from IEEE Xplore. Restrictions apply.