On The Effectiveness of Developer Features in Code Smell Prioritization - A Replication Study
On The Effectiveness of Developer Features in Code Smell Prioritization - A Replication Study
1. Introduction
Code quality is a major concern of software developers since they are closely related to software quality. However,
to deliver software before deadlines, trade-offs are usually made between code quality and speed of delivery (Pecorelli
et al., 2020). Code smells (i.e., sub-optimal code implementation and design choices (Fowler et al., 1999)) are
consequences of such trade-offs which can hinder software maintainability and reliability in the long run (Palomba
et al., 2018b). Code smell detectors have been actively developed in the last 2 decades (Sobrinho et al., 2021), and they
have achieved promising results detecting major code smells in various granularities (e.g., statement (Saboury et al.,
2017), method, class, and package (Moha et al., 2010)) and types (e.g., structural and textual (Palomba et al., 2018c)).
Practitioners need to focus on removing the worst code smells in advance (Pecorelli et al., 2020; Sae-Lim et al.,
2017) since Software Quality Assurance (SQA) resources are limited. However, automatic detection tools of code
smells may produce an excessive number of results. Manual inspection of every result would be time-consuming, but
few of them are of high priority. Consequently, both rule and machine learning based automatic code smell detectors
are perceived as unhelpful by practitioners (Pecorelli et al., 2020; Ferreira et al., 2021; Sae-Lim et al., 2018a). To
improve the developers’ acceptance of code smell detectors, prior studies focused on improving their adaptation to the
developers’ perceptions. They built machine learners capturing structural (Fontana and Zanoni, 2017; Pecorelli et al.,
2020; Sae-Lim et al., 2018a) (e.g., coupling, cohesion, complexity) and contextual (e.g., error-proneness (Sae-Lim
et al., 2018a), change history, developer (Pecorelli et al., 2020)) information to rank the results.
The replicated MSR’20 paper (Pecorelli et al., 2020) presented major progress in code smell prioritization which
proposed a developer-driven and machine learning based approach to rank 4 code smells including Blob, Complex
Class, Spaghetti Code, and Shotgun Surgery. In terms of novelty, it introduced the concept of “developer-driven" which
referred to (1) the features proposed are related to developers, i.e., they include development process and developer
This work was partially supported by the National Natural Science Foundation of China (No. 62372174), the Natural Science Foundation of
Shanghai (No. 21ZR1416300), the Capacity Building Project of Local Universities Science and Technology Commission of Shanghai Municipality
(No. 22010504100), the Research Programme of National Engineering Laboratory for Big Data Distribution and Exchange Technologies, and the
Shanghai Municipal Special Fund for Promoting High Quality Development (No. 2021-GYHLW-01007).
∗ Corresponding authors
experience aspects, (2) the dataset was collected from original developers with their comments included. In terms of
paper influence, it was published in a premier venue for data science and machine learning in software engineering1 ,
cited by 40 times according to Google Scholar, and is authored by the most influential scholars in code smell research
(see node 9#, 14#, and 16# in Fig.12 in Sobrinho et al. (2021)).
The MSR paper used the model constructed using the feature generation method of the KBS paper (Fontana and
Zanoni, 2017) as the baseline method. The KBS paper exploited multiple automatic static analysis tools to calculate
software metrics such as cohesion, coupling, complexity, and size. The major conclusion of the MSR paper is that it
suggested building prioritization models based on mixed features (i.e., developer, process, and code metrics) instead
of using pure code metrics from the compared KBS baseline.
However, after manually investigating the comments of developers available online 2 , we find that some developer-
related factors are rarely presented in the comments. Unexpectedly, the MSR paper claimed involving features related
to such factors improved model performance significantly. Explaining such inconsistency is the motivation of our
research.
Thus, we replicate the MSR paper to discover the cause of the inconsistency between developer comments and
performance improvement. We suspect inappropriate feature selection harmed the performance of the baselines, and
led to the result that significant improvement is achieved by involving new features. To verify our assumption, first,
we apply the state-of-the-art feature selection methods mentioned in prior work (Zhao et al., 2021; Jiarpakdee et al.,
2018, 2020) on the MSR dataset and its KBS baseline to assess their ability to mitigate multicollinearity. Then, we
use the selected features to build prediction models and to evaluate their impact on model performance based on
additional performance metrics suitable for human rating assessment. Finally, we exploit XAI techniques to explain
the best-performed model, and we manually assess the agreement between the model behaviors and the developers’
comments.
The major findings in this replication study include:
(1) In terms of data preprocessing, inappropriate feature selection may lead to biased results. Correlation-based
Feature Selection (CFS) should not be used as a default method, since it could negatively impact performance by up
to 34% and 72% in AUC-ROC and Krippendorff’s Alpha.
(2) In terms of model construction, pure code metrics could be better in the prioritization of 3 smells except for
Shotgun Surgery, which indicates we may not need additional features to correctly prioritize change-insensitive code
smells.
(3) In terms of performance evaluation, classification metrics may be biased for ordinal rating tasks, i.e., such
metrics ignore the distance between the predicted value and the actual label. Metrics such as Krippendorff’s Alpha
should be reported as well (Tian et al., 2016).
(4) In terms of model explainability, our manual verification shows the agreement between developers and the
model built with pure code metric dataset is higher.
Replication Package. We provide an executable code capsule verified by CodeOcean 3 including the datasets and
model training code with all parameters included for performance replication. The details are available in subsection A
of the appendix section. We also provide an executable example 4 for demonstrating XAI for code smell prioritization.
The rest of this paper is organized as follows. In Section 2 we summarize related work. Section 3 presents the
background and datasets, while Section 4 outlines the research questions and methodologies. In Section 5 we discuss
the results, while Section 6 overviews the threats to the validity. Finally, Section 7 concludes the paper and describes
future research.
2. Related Work
This section describes studies related to code smell detection and prioritization, as well as XAI empirical studies
in Software Quality Assurance (SQA).
1 https://conf.researchr.org/home/msr-2020
2 https://figshare.com/s/94c699da52bb7b897074
3 https://doi.org/10.24433/CO.9879914.v2
4 https://github.com/escapar/JSS23_demo
validated the results using a strategy of evaluating recommending system (e.g., measuring nDCG as task relevance)
and manual verification.
Developer and Process. Our replicated study (Pecorelli et al., 2020) used structural code metrics and process
metrics of software systems to measure the priority, i.e., they focused on present and historical information of code
rather than the relationships with other software artifacts. More details are introduced in Section 3. Vidal et al.
prioritized code smells (Vidal et al., 2016; Guimarães et al., 2018) and their agglomerations (Vidal et al., 2019)
according to developers’ preference (e.g., prefer to improve coupling or cohesion), their impact on software architecture,
agglomeration size, and change histories.
Inter-Smell Relationship. Liu et al. (2012) arranged resolution sequences of code smells to save effort for
refactoring. Palomba et al. (2017, 2018a) mined association rules among code smells to reveal their co-occurrence,
and they suggested focusing on method smells that are likely to disappear along with the class-level ones. Inspired by
their studies, we proposed a refactoring route generation approach (Wang et al., 2021) based on the association rules
for eliminating Python code smells.
Table 1
Statistics of Projects in The Datasets
Complex Class. Classes with high complexity, e.g., too many loops and conditional control statements. Complex
Classes could be determined by complexity code metrics such as CYCLO (Brown et al., 1998) and code readability
(Buse and Weimer, 2009).
Spaghetti Code. Classes do not follow Object-Oriented Programming (OOP) principles, e.g., a container of long
methods that do not interact with each other. Spaghetti Code could be detected by size metrics such as LOC (Line of
Code) as well as the absence of inheritance (Moha et al., 2010).
Shotgun Surgery. Classes trigger small co-changes of other classes frequently. Shotgun Surgery refers to high
coupling. However, it can be better detected by historical code change information (Pecorelli et al., 2020; Palomba
et al., 2015, 2013) rather than code metrics. Shotgun Surgery is a change-sensitive smell because variations in code
change metrics will greatly increase the severity of Shotgun Surgery. The severity of the other 3 smells investigated
will not be directly influenced by frequent code changes.
The authors tracked the commits of 9 established projects of Apache and Eclipse open-source foundations in 6
months, and the information on the 6 projects is listed in Table 1. The authors used rule-based detectors to identify
code smells daily, and they manually discarded false positives. Afterward, they sent emails to the original developers
to collect their perceptions of the criticality of smells as soon as possible. The developers’ perceived criticalities of the
MSR paper originally range from 1 to 5. Since the margins of criticality levels {1, 2} and {4, 5} were not clear, the
MSR paper (Pecorelli et al., 2020) merged the unclear criticalities to 3 new criticalities, i.e., {NON-SEVERE, MEDIUM,
SEVERE}. Thus, the prediction was performed over the merged criticalities. Finally, they received 1,332 instances
almost equally distributed among the 4 smells.
The authors also provided an online appendix 7 containing the original developers’ comments. The online appendix
contains short comments of original developers describing their attitudes toward the criticality of every code smell
instance. However, we find 5 comments were missing from the relevant folder named after code smells, and thus they
are discarded from the model explanation. Consequently, we are using 1,327 samples in our replication study.
Table 2
Features, Examples of Developer Comments, and Their Reflected Aspects
Table 3
Aspects of Contents Presented in the Comments of Developers
PROC DEV INNER CROSS Number of Instances
Blob 66 175 206 2 341
Complex Class 37 164 304 3 349
Spaghetti Code 0 5 307 2 311
Shotgun Surgery 4 3 125 225 326
Total 107 347 942 232 1327
Proportion% 8.06 26.15 70.99 17.26 100
in developers’ comments, which is not identical to the definition of the “complex" measured in complexity metrics.
Although “complex" in the view of developers represents that they can hardly understand the code, however, they
may either refer to complex control flow or complex interactions with other code components. The category of related
comments is assigned according to their context (e.g., whether coupling or cohesion is mentioned) and the smells they
are referring to (e.g., complex in Shotgun Surgery is related to the CROSS issue).
The manual identification is performed independently by the 1st author (Ph.D. student, Java developer with 5
years of experience) and the 5th author (Ph.D. student studying code summarization). Later, we involve the 6th author
(Ph.D. student studying code generation) to validate the results together with the 2 authors. The initial agreement rate
is 75.05%. Apart from validated misunderstanding and mislabeling, most of the inconsistencies appear in whether
developers have expressed their own opinions toward the criticality, and whether the comments refer to “legacy code"
expressed PROC factors. Finally, we reach an agreement that legacy code could be captured by process metrics.
To clarify, we are not using the grounded theory approach to define categories completely based on the developers’
comments because the aim of this study is replication rather than a fine-grained analysis of developers’ attitude toward
code smells criticality, and the present category partitioning is closer to the design and the claimed improvement of
the replicated paper.
Developers
OSS Repository Replicate Feature Selection
Smell KBS Dataset RF KBS Model
Tool Detection Result Label Discussion
Criticality Better Approach
The experimental process is depicted in Figure 1. The part with grey background represents the work of the MSR
paper, while the white parts are experiments conducted in our study.
The goal of our study is to validate whether pure code metrics (KBS) are inferior compared to mixed features
(MSR) if better experimental settings (e.g., feature selection) are applied, with the purpose of validating whether some
conventional processes of building code smell related machine learning models are reliable. Our perspective is of both
researchers and practitioners: our replication could help the former to adopt more reasonable features to build models,
while we can also provide explanations for every instance of the predicted classes for the latter. To these ends, we
propose the following 5 research questions.
RQ1: Can we replicate the model performance using MSR and KBS datasets in the context of the MSR paper?
Motivation. This RQ aims to replicate the MSR paper to present the basic performance of the models built with
the MSR and the KBS datasets.
Approach. We use the experimental settings of the MSR paper and exploit the implementation of SCIKIT-L EARN
(Pedregosa et al., 2011) for building classifiers, i.e., we exploit cross-project prediction, apply feature selection, and
perform grid search over parameters of multiple evaluated classifiers. Specifically, we perform CFS independently
for every code smell. Moreover, we report the performance before and after feature selection. We also involve 2 more
common performance metrics related to regression and rating recommended in related empirical study (Tian et al.,
2016), i.e., Krippendorff’s Alpha (Krippendorff, 1970), and Cohen’s Kappa (Cohen, 1960), and they will be described
in Section 4.2 in detail. We exclude Precision and Recall to avoid reporting an excessive number of metrics since
classification metrics could summarize them. We also exploit the Wilcoxon Ranksum Test and Cliff’s Delta to compare
the prediction results in 10 folds using CFS and without any feature selection, to verify if there exists a significant
difference with non-negligible effect size.
RQ2: Is there a better choice as a default option for feature selection that guarantees stable performance?
Motivation. The MSR paper used CFS as the feature selection method by default since CFS is known for its stable
performance of correlation removal. However, a valid default option should guarantee good or best performance in
most cases. We think the inconsistency between our observation on comments and the original conclusions (e.g., C1
and C3) may be caused by feature selection. Since CFS was reported unideal in a recent defect prediction empirical
study (Jiarpakdee et al., 2020), we intend to investigate whether CFS is the most stable method in terms of both
multicollinearity mitigation and the impact on model performance for code smell prioritization.
Approach. We exploit the feature selection methods applied in Zhao et al. (2021) and AutoSpearman (Jiarpakdee
et al., 2020, 2018). These methods will be described in Section 4.3. Then, we measure the proportion of the
feature pairs with collinearity presented (Spearman’s 𝜌 > 0.7 (Jiarpakdee et al., 2020)) and any feature with
multicollinearity presented (VIF > 5 (Jiarpakdee et al., 2020)). Finally, the results of multicollinearity mitigation and
the impact on performance are tested by SK-ESD (Scott-Knott Effect Size Difference) (Tantithamthavorn et al., 2017;
Tantithamthavorn et al., 2019), which groups and ranks the values of performance metrics. The above-mentioned
statistical approaches will be described in Section 4.4.
RQ3: Can we achieve better performance through better feature selection which mitigates collinearity and
multicollinearity?
Motivation. This RQ is designed to validate C1 and C2 in our context. Since RQ2 reveals the overall performance
of feature selection methods, we evaluate their actual impact on model performance with respect to RQ1.
Approach. We pick the best-performed feature selection algorithm for every code smell and present the result
and the variations in terms of model performance. We then exploit the Wilcoxon Ranksum Test and Cliff’s Delta to
compare the prediction results in 10 folds using the best-performed feature selection and without any feature selection.
RQ4: Based on features selected in RQ3, can we generate global feature importance similar to the MSR paper?
Motivation. This RQ is designed to validate the dependability of C3 in terms of global explanation. Since
multicollinearity is mitigated, we could exploit XAI technique to validate the consistency of our conclusion with C3.
Approach. First, we generate global feature importance, i.e., the mean of absolute values of SHAP feature
importance (which will be introduced in Section 4.5). Afterward, we compare our conclusion with C3.
RQ5: Based on features selected in RQ3, which dataset could better reflect the opinions of the original
developers toward code smell criticality?
Motivation. This RQ is designed to validate the dependability of C3 in terms of local explanation. Since SHAP
could generate local explanations for every predicted instance, we intend to check the agreement of the model behavior
with the developers’ perceptions to reveal which dataset is more ideal. We use the opinions of developers to judge
whether applying certain features is feasible because a recent study Aleithan (2021) found that developers will not be
likely to trust a model’s explanation if it is completely unexpected or misses some key preferences they are expecting,
which is in line with other XAI studies saying people preferred the explanations consistent with their prior knowledge
Maltbie et al. (2021). We also believe that following the decision of the original developer would be helpful in making
a more reliable prioritization closer to the ground truth.
Approach. We classify the features into the 4 categories presented in Table 2. Then, we check the consistency
of the categories of the top-ranked features and the categories of the developers’ perception. In terms of consistency,
we refer to (1) perfect match, i.e., the top-ranked features that SHAP derive are in exactly the same categories with
respect to the developers’ comments covered categories, (2) partial match, i.e., we measure the proportion of the
matched categories and do not restrict all categories to be the same for each predicted Java class, and (3) Kappa and
Alpha agreement treating the model and the original developer as two individual raters, and data showing whether
their explanations appear in categories are used as input. For each category, the rating is 1 if an explanation falls in a
category, and 0 otherwise. Note that we only check if any categorized factor was presented instead of identifying its
positive or negative impact on criticality because some comments were ambiguous. The features’ categories and a
detailed example of agreement calculation are available in the appendix section.
𝑇𝑃
𝑇𝑃𝑅 = , (1)
𝑇𝑃 + 𝐹𝑁
𝑇𝑁
𝐹𝑃𝑅 = , (2)
𝐹𝑃 + 𝑇𝑁
𝑇𝑃 × 𝑇𝑁 − 𝐹𝑃 × 𝐹𝑁
𝑀𝐶𝐶 = √ , (3)
(𝑇 𝑃 + 𝐹 𝑃 )(𝑇 𝑃 + 𝐹 𝑁)(𝑇 𝑁 + 𝐹 𝑃 )(𝑇 𝑁 + 𝐹 𝑁)
2 × 𝑇𝑃
𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = (4)
2 × 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
where TP is for true positive (positive sample predicted as positive), FN is for false negative (positive sample falsely
predicted as negative), TN is for true negative, and FP is for false positive.
Krippendorff’s Alpha measures inter-raters’ agreement (higher is better), which is different from classification
metrics since it takes the distance between the predicted and real criticalities into account (Tian et al., 2016). We
involve them as a related study in bug report prioritizing (Tian et al., 2016) suggested. Alpha > 0.66 indicates a
reasonable agreement. The equation for calculating Alpha is listed in (5).
𝐷𝑜
𝛼 =1− (5)
𝐷𝑒
𝐷𝑜 is the observed disagreement between code smell criticality assigned by the original developers, and 𝐷𝑒 is the
disagreement expected when the rating of code smells can be attributed to chance rather than due to the inherent
property of the code smells themselves, their calculation is listed in (6) and (7).
1 ∑∑
𝐷𝑜 = 𝑜 𝛿2 (6)
𝑛 𝑐 𝑘 𝑐𝑘 𝑚𝑒𝑡𝑟𝑖𝑐 𝑐𝑘
1 ∑∑
𝐷𝑒 = 𝑛 .𝑛 𝛿2 (7)
𝑛(𝑛 − 1) 𝑐 𝑘 𝑐 𝑘 𝑚𝑒𝑡𝑟𝑖𝑐 𝑐𝑘
where 𝑜𝑐𝑘 , 𝑛𝑐 , 𝑛𝑘 and 𝑛 refer to the frequencies of values in the coincidence matrices, and 𝑚𝑒𝑡𝑟𝑖𝑐 𝛿𝑐𝑘 2 refers to any
difference function that is appropriate to the 𝑚𝑒𝑡𝑟𝑖𝑐 (i.e., levels of measurement for comparison) of the given data. We
2 to calculate inter-annotator agreement, since the criticalities could be transferred to
use the ordinal metric 9 𝑜𝑟𝑑𝑖𝑛𝑎𝑙 𝛿𝑐𝑘
ordinal values (e.g., {0, 1, 2} for {NON-SEVERE, MEDIUM, SEVERE}). 𝛼 is ranged in 0 and 1. 𝛼 = 1 indicates perfect
agreement between developer and model. When 𝛼 = 0 the agreement is no better than random guessing.
The calculation of Kappa is listed in equation (9),
𝑝𝑜 − 𝑝𝑒
𝜅= (8)
1 − 𝑝𝑒
where 𝑝𝑜 is observed agreement between two raters, calculated as the proportion of cases where both raters agree
on the same severity, and 𝑝𝑒 is expected agreement between two raters by chance, calculated as the product of the
proportion of cases where each rater assigned a particular severity. Kappa values greater than 0.60 indicate reliable
ratings (Amidei et al., 2019).
Table 4
Feature Selection Methods Applied
Family Type Name or Strategy Abbreviation Thresholds
Chi-Square ChiSq Select {15, 45, 75}%
Statistics-based Correlation Corr Select {15, 45, 75}%
Probablistic Significance Sig Select {15, 45, 75}%
Information Gain IG Select {15, 45, 75}%
Probability-based Gain Ratio Gain Select {15, 45, 75}%
Filter-based ranking Symmetrical Uncertainty Symm Select {15, 45, 75}%
Instance-based ReliefF Relief Select {15, 45, 75}%
One-Rule OneR Select {15, 45, 75}%
Classifier-based SVM SVM Select {15, 45, 75}%
Correlation-based Best First(BF), GreedyStepwise(GS) CFS
Filter-based subset Consistency-based BF, GS Consist
Nearest Neighbor BF, GS KNN K=1
Logistic Regression BF, GS Log
Wrapper-based subset (Wrap) Naive Bayes BF, GS NB
Repeated Incremental Pruning BF, GS JRip
Hybrid VIF- and Correlation-based AutoSpearman AutoSpearman VIF=5, 𝜌=0.7
None None
third quartile, and 45% is the average value of the other 2 thresholds which is also close to the second quartile. We are
not able to perform a grid search in the threshold since it will result in an excessively large search space, and thus we
use these 3 representative values.
For AutoSpearman, we use the open-source Python implementation provided by the original authors10 . The
algorithm consists of two stages. In the first stage, Spearman rank correlation coefficients are calculated to reduce
collinearity between feature pairs; in the second stage, the variance inflation factor (VIF) is calculated to reduce
multicollinearity. While maximizing collinearity reduction, AutoSpearman retains as many features as possible,
thereby reducing its impact on performance.
data with normal distribution, thereby ranking performance metrics and other data. Within a ranking, there may
be multiple groups of performance data without significant statistical differences. The SK-ESD algorithm adds a
correction function for non-normally distributed data and removes statistically insignificant groups based on effect
size, and thus expanding the applicability of the original algorithm. In this paper, we use the R implementation of
SK-ESD provided by the original authors.
∑
𝑀
𝑔(𝒛) = 𝜙0 + 𝜙𝑖 𝑧𝑖 , (9)
𝑖=1
where 𝒛 ∈ {0, 1}𝑀 is the coalition vector (also known as simplified features (Lundberg and Lee, 2017)), and 𝑀 is
the maximum size of the coalition vector (i.e., the number of simplified features). Specifically, 𝑧𝑖 is the 𝑖-th binary
value in 𝒛, where 𝑧𝑖 = 1 means the corresponding feature is included in the coalition, and 𝑧𝑖 = 0 indicates the feature
is absent from the coalition. 𝜙0 is the average prediction value of the model, and 𝜙𝑖 is the Shapley value of the 𝑖-th
feature. Larger positive 𝜙𝑖 indicates a greater impact of the i-th feature on the positive prediction result of the model.
Note that ∣𝜙𝑖 ∣ is SHAP feature importance score that is guaranteed in theory to be locally, consistently, and additively
accurate for each data point (Rajbahadur et al., 2021). We use the Python implementation of SHAP (Lundberg et al.,
2020) in our study.
Table 5
Prioritization performance using all features and features selected by CFS
Blob Complex Class
CFS No Selection CFS No Selection
Metric MSR-CFS KBS-CFS MSR KBS Metric MSR-CFS KBS-CFS MSR KBS
Alpha 0.75 0.19 0.80 0.91 Alpha 0.44 0.83 0.46 0.83
Kappa 0.73 0.28 0.81 0.96 Kappa 0.63 0.82 0.65 0.83
AUC-ROC 0.87 0.64 0.91 0.98 AUC-ROC 0.83 0.91 0.84 0.92
F-Measure 0.85 0.58 0.89 0.98 F-Measure 0.77 0.89 0.78 0.90
MCC 0.74 0.29 0.81 0.96 MCC 0.64 0.83 0.66 0.84
Winner MSR-CFS KBS Winner KBS & KBS-CFS
Spaghetti Code Shotgun Surgery
CFS No Selection CFS No Selection
Metric MSR-CFS KBS-CFS MSR KBS Metric MSR-CFS KBS-CFS MSR KBS
Alpha 0.49 0.30 0.61 0.80 Alpha 0.52 0.10 0.55 0.01
Kappa 0.41 0.49 0.52 0.70 Kappa 0.52 -0.01 0.58 -0.04
AUC-ROC 0.70 0.74 0.76 0.83 AUC-ROC 0.75 0.49 0.79 0.48
F-Measure 0.65 0.67 0.71 0.81 F-Measure 0.69 0.35 0.73 0.34
MCC 0.41 0.50 0.53 0.72 MCC 0.52 -0.01 0.59 -0.04
Winner Tie KBS Winner MSR & MSR-CFS
Table 6
Multicollinearity Presence in the CFS Processed Dataset
Collinearity (𝜌)% Multicollinearity (VIF)%
Code Smell KBS MSR KBS MSR
Blob 0.00 0.00 0.00 0.00
Complex Class 4.58 3.03 10.52 23.08
Spaghetti Code 0.00 0.00 0.00 10.00
Shotgun Surgery 0.00 4.76 0.00 0.00
desirable results of prioritization models, and (iii) multicollinearity is not mitigated as suggested, which may hinder
the reliability of XAI conclusions.
Finding 1. C1 and C2 are mostly replicable in the original context of the MSR paper. However, using CFS
as a feature selection method, using only classification metrics, and multicollinearity appearance may lead to
biased conclusions. Thus, we revise these experimental settings in the next RQs for a fair replication.