Shahbaz 2025 Leveraging Explainable Ai For Early
Shahbaz 2025 Leveraging Explainable Ai For Early
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294
Abstract—Leukemia, a prevalent childhood cancer affecting According to statistics from Shaukat Khanum Memorial
the blood and bone marrow, necessitates a proactive approach Cancer Hospital and Research Center (SKMCH and RC)
to risk mitigation through lifestyle adaptations. Despite previous and Karachi Diagnostic Centre (KDC), A total of 127,929
investigations into similar aspects across different cancer types, a
comprehensive inquiry focusing on all leukemia subtypes within neoplasms of which 119,486 malignant and 8,443 benign cases
Pakistan remains a significant research gap. Acknowledging the were reported during the past 28 years [1]. Leukemia is a
influence of regional variations on individuals’ lifestyles, this highly lethal medical condition that claims thousands of lives
study aims to identify lifestyle and demographic factors asso- annually. It is the most prevalent cancer type among children.
ciated with leukemia development and predict specific leukemia In 2019, a comprehensive data collection effort was conducted
subtypes using clinical data. Our data collection included 364
leukemia cases and 896 control subjects, gathered from different across 209 countries and territories. The recorded statistics
cancer or tertiary care hospitals in Islamabad and Peshawar, indicated that there were 0.64 million new cases, 11.6 million
Pakistan. The data was meticulously categorized into laboratory disability-adjusted life years, and 0.33 million deaths attributed
results, demographic characteristics, and lifestyle parameters. to leukemia. Notably, 28.11% of leukemia-related fatalities
For demographic and lifestyle analysis, we employed advanced were linked to AML, while 8.95% were associated with CML.
techniques of Machine Learning, and statistical and graph-based
methodologies to assess leukemia development risk. This study Additionally, 14.24% and 13.33% of leukemia deaths resulted
highlights factors associated with a higher risk of leukemia, from ALL and CLL, respectively [2]. It is crucial to recognize
including passive smoking, rural residence, and poor nutrition. that cancer is a preventable disease that demands significant
These insights emphasize the promotion of healthier lifestyle lifestyle modifications [3].
choices to potentially reduce leukemia incidences. Additionally, Research on leukemia classification has been conducted in var-
we transformed the clinical dataset into graph data, which was
utilized for leukemia classification and subtype prediction. We ious countries [4]–[9], but there remains a gap in investigations
conducted classification on both graph and structured (tabular) within Pakistan, particularly regarding the identification of risk
data, with the structured data, achieving a 96% accuracy rate, factors. Recognizing regional lifestyle and demographic varia-
notably on oversampled data. To enhance the interoperability of tions that influence leukemia is essential. Prevailing leukemia
our classification outcomes, we employed the SHapley Additive classification methods rely heavily on laboratory data while
exPlanations (SHAP) algorithm to explain the classification, of-
fering comprehensive insights into the rationale behind leukemia potential insights from lifestyle and demographic data are
classification. neglected. Integrating such information could enhance our
disease understanding about disease prevalence and mitigation
Index Terms—Leukemia, Risk Factors, Classification, Graph
data, Lifestyle, Demography strategies. Additionally, the utilization of advanced methods
like Machine Learning (ML) and Knowledge Graphs (KG)
for leukemia risk analysis is notable, as most studies focus on
I. I NTRODUCTION other cancer types [10].
To address this research gap, we compiled a dataset en-
ANCER a perilous disease, includes leukemia, a cancer
C of the blood and bone marrow marked by uncontrolled
blood cell growth. Leukemia originates in the bone marrow
compassing laboratory, lifestyle, and demographic features,
sourced from tertiary care hospitals in Peshawar and Islam-
abad. By incorporating lifestyle and demographic information,
and often swiftly enters the bloodstream. There are four we predicted risk factors and assessed them using traditional
main leukemia types: Acute Lymphoblastic Leukemia (ALL), Odds Ratio (OR), ML techniques, and KG methods. In this
Chronic Lymphoblastic Leukemia (CLL), Acute Myeloid study, we integrated laboratory, lifestyle, and demographic
Leukemia (AML), and Chronic Myeloid Leukemia (CML). features to improve leukemia classification and subtype iden-
Acute leukemia involves immature cell formation, whereas tification, thereby enhancing detection accuracy. The dataset
Chronic leukemia entails mature cell production. comprised 363 cases and 896 controls, and to address imbal-
anced data, we applied various class balancing techniques to
M. Shahbaz, A. Basharat, and S. Hussain are with the National University
of Computer and Emerging Sciences, Islamabad, Pakistan reduce biases and enhance prediction reliability. Additionally,
R. Abbasi and N. Ahmad are with the Institute of Biomedical and Genetic we conducted a comparative analysis between structured and
Engineering, Islamabad, Pakistan graph data for classification. The Neo4j system served as a
R. Yasmin is with Quaid e Azam University, Islamabad
S. Hussain are with The School of Engineering, Computing and Mathemat- robust tool for analyzing and querying graph data, enabling
ics, Oxford Brookes University, Oxford, UK us to explore its potential benefits in leukemia classification
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294
explainable AI to explain the best-performing model. An 3) Odds Ratio (OR): To identify the association between
overview of our work is shown in Figure 1. two independent variables, OR was used. It is defined as the
ratio of the Probabilities of an event and a non-event. OR also
A. Preparing Input Data considers Conf idence − Interval(CI) > 95% to check the
results association.
At the beginning of the study, we gathered data from cancer
1) OR > 1: increase the happening of an event.
hospitals in Peshawar and Islamabad, Pakistan after obtaining
2) OR < 1: Decrease the happening of an event.
consent from participants. The clinical data was divided into
3) OR=1 : No association.
demographic, lifestyle, and laboratory categories. After con-
sulting with doctors, we made adjustments to the laboratory
features, updating them based on appropriate ranges. C. Balanced distribution of classes
The issue of imbalanced distribution among classes ad-
B. Risk Prediction versely affects accuracy. To improve the accuracy, most re-
searchers use data-level solutions i.e., oversampling and under-
For predicting risk, we used ML and KG methods to deter- sampling to balance data [28]. First, we applied classifi-
mine which method yields better results. In the ML approach, cation on simple datasets, further we applied classification
we employed XGBoost to identify important features. While on balanced datasets. In undersampling techniques, examples
CDR was previously utilized for lung cancer [14]. Therefore, of the majority class were reduced, while in oversampling
We adopted these approaches for identifying leukemia risk techniques, examples of minority classes were increased. As
factors. Additionally, we used OR to validate our findings, as a result, all classes have an equal number of samples.
they are commonly employed by medical specialists to verify
results [27].
1) XGBoost Algorithm: Gradient-boosted decision trees D. Diagnosing leukemia and its type
were implemented in XGBoost. Weights were assigned to ML is an evolving branch of computational algorithms that
all variables independently which helped to predict results. emulate human intelligence by learning from the surrounding
Weights of wrongly predicted variables were assigned to the environment [29]. ML models learn on labeled datasets to
next decision trees. This made the model more precise. predict output. For the classification of leukemia and its
2) Connection Delta Ratio (CDR): First, we transformed types, we employed various ML models due to the small and
the structured data into a graph format using Neo4j. The structured nature of our dataset. As noted in the literature
resulting graph data contained nodes representing patients or review, DL approaches are typically utilized when working
healthy individuals, along with other relevant features. CDR with gene or image data. This study utilizes structured (tabular)
was employed as a measure of each feature’s importance and graph data as inputs for the classification process, followed
in distinguishing between patients and healthy individuals. by a comparative analysis of their respective outcomes.A Few
Specifically, CDR was used to evaluate the significance of ML algorithms are chosen based on the literature review. We
connections between case and control data based on the implemented LR with L2 regularization to address the multi-
relationships within the graph. By utilizing CDR, we identified class nature of the problem, utilizing a One-Vs-Rest strategy.
the top-ranked factors, where features with CDR values greater The model configuration includes 500 epochs and a random
than 0.1 were considered most significant about leukemia. seed set to 3. Various parameters were tested for the Multi-
For calculating CDR, connections to leukemia patients are Layer Perceptron (MLP) model and we opted for the MLP
represented as Targeted Patient Connections (TPC), while model structured with 5 hidden layers, each specified by vary-
connections to control data are denoted as Healthy People ing neuron counts: 256, 128, 64, 32, and 5 neurons distributed
Connections (HPC). The CDR is then defined as (TPC- HPC) across these tiers. It has been configured with a learning rate
/ (TPC + HPC). of 0.001, has employed the ReLU activation function, and
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294
TABLE VI: Comparison of predicting risk factors findings suggest that the additional features in this research
Rank OR XGBoost Algorithm CDR
may have contributed to improved accuracy. Nevertheless,
1 Passive Smoking Polluted Environment Passive Smoking direct comparisons are limited due to differences in features
2 Rural Area Gas Fuel Wood Fuel and dataset characteristics across studies. Overall, this work
3 Rare Carbon Drinks Perfume Usage Drink Spring Water demonstrates the potential of a larger feature set to improve
4 Drink Spring Water Rural Area Rural Area
5 Wood Fuel Passive Smoking Poor Nutrition leukemia classification accuracy. Table 1 from Supplementary
Material compared these results with previous research.
A. SHAP-Explainable AI
The RF algorithm was chosen for in-depth analysis, us-
ing the SHAP algorithm, due to its highest accuracy when
applied to oversampled data. SHAP values were utilized as
independent features to assess the average impact of individual
features on the algorithm’s performance. The findings revealed
that weight, age, and gender play crucial roles in accurately
classifying leukemia or its specific type. However, it was
observed that the age feature exhibited bias since the dataset
predominantly consisted of children. To address bias, an
Fig. 5: Barchart of results on structured data
by [25], classification was conducted to predict ALL. However, additional RF algorithm analysis specifically targeted lifestyle
their approach differed from the present research in the number and laboratory features. The outcomes, depicted in Figure
of features used: they employed 15 features, whereas 21 7, emphasize perfumes as the most influential factor from
features are utilized here. Furthermore, their highest accuracy, lifestyle features, particularly classifying AML, while smoking
achieved with a RF classifier, was 82%, while this analysis appeared as least relevant. Factors such as nutrition and
achieved a higher accuracy of 86% on a simpler dataset. These environmental elements also hold significance in classification.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294
Among laboratory features, the Red Blood Cell (RBC) count [13] D. W. Kaufman, T. E. Anderson, and S. Issaragrisil, “Risk factors for
emerges as the most influential. Notably, all factors displayed leukemia in thailand,” Annals of hematology, vol. 88, no. 11, pp. 1079–
1088, 2009.
minimal contribution specifically in ALL. [14] M. M. Behzad, M. Abbasi, I. Oliaei, S. G. Gholiabad, and
H. Rafieemehr, “Effects of lifestyle and environmental factors on the
risk of acute myeloid leukemia: result of a hospital-based case-control
VII. C ONCLUSION study,” Journal of Research in Health Sciences, vol. 21, no. 3, p. e00525,
The prediction of leukemia risk has traditionally relied 2021.
[15] N. Mahmood, S. Shahid, T. Bakhshi, S. Riaz, H. Ghufran, and
on conventional methods. While lifestyle changes can vary M. Yaqoob, “Identification of significant risks in pediatric acute lym-
by region, this type of analysis has not been conducted on phoblastic leukemia (all) through machine learning (ml) approach,”
Pakistani data. Moreover, previous studies have primarily used Medical & Biological Engineering & Computing, vol. 58, no. 11, pp.
2631–2640, 2020.
tabular data for leukemia type classification. In this study, after [16] F. M. Onyije, A. Olsson, D. Baaken, F. Erdmann, M. Stanulla,
collecting relevant data, we applied OR, KG, and ML methods D. Wollschläger, and J. Schüz, “Environmental risk factors for childhood
to identify the risk factors associated with leukemia. Due to acute lymphoblastic leukemia: an umbrella review,” Cancers, vol. 14,
no. 2, p. 382, 2022.
the limited control data relative to case data, we employed [17] [Online]. Available: https://synthetichealth.github.io/synthea/
class balancing strategies to address data imbalance. We also [18] M. O. Aftab, M. J. Awan, S. Khalid, R. Javed, and H. Shabir, “Executing
converted clinical data into a graph format and compared its spark bigdl for leukemia detection from microscopic images using
transfer learning,” in 2021 1st International Conference on Artificial
classification accuracy against structured data. The structured Intelligence and Data Analytics (CAIDA). IEEE, 2021, pp. 216–220.
dataset achieved a highest accuracy of 96% through over- [19] A. Abhishek, R. K. Jha, R. Sinha, and K. Jha, “Automated classification
sampling using ML algorithms. Even with a dataset without of acute leukemia on a heterogeneous dataset using machine learning and
deep learning techniques,” Biomedical Signal Processing and Control,
balancing, structured data consistently outperformed graph vol. 72, p. 103341, 2022.
data, emphasizing its effectiveness for classification tasks, [20] P. Rastogi, K. Khanna, and V. Singh, “Leufeatx: Deep learning–based
particularly when dealing with smaller datasets and direct node feature extractor for the diagnosis of acute leukemia from microscopic
images of peripheral blood smear,” Computers in Biology and Medicine,
relationships. vol. 142, p. 105236, 2022.
[21] N. Bibi, M. Sikandar, I. Ud Din, A. Almogren, and S. Ali, “Iomt-based
R EFERENCES automated detection and classification of leukemia using deep learning,”
Journal of healthcare engineering, vol. 2020, 2020.
[1] Shaukat Khanum Memorial Trust, “Cancer statistics,” accessed: [22] E. Nazari, A. H. Farzin, M. Aghemiri, A. Avan, M. Tara, and H. Tabesh,
July, 2023. [Online]. Available: https://shaukatkhanum.org.pk/ “Deep learning for acute myeloid leukemia diagnosis,” Journal of
health-care-professionals-researchers/cancer-statistics/ Medicine and Life, vol. 13, no. 3, p. 382, 2020.
[2] M. Du, W. Chen, K. Liu, L. Wang, Y. Hu, Y. Mao, X. Sun, Y. Luo, [23] V. Rupapara, F. Rustam, W. Aljedaani, H. F. Shahzad, E. Lee, and
J. Shi, K. Shao et al., “The global burden of leukemia and its attributable I. Ashraf, “Blood cancer prediction using leukemia microarray gene data
factors in 204 countries and territories: Findings from the global burden and hybrid logistic vector trees model,” Scientific Reports, vol. 12, no. 1,
of disease 2019 study and projections to 2030,” Journal of Oncology, pp. 1–15, 2022.
vol. 2022, 2022. [24] N. Mahmood, S. Shahid, T. Bakhshi, S. Riaz, H. Ghufran, and
[3] P. Anand, A. B. Kunnumakara, C. Sundaram, K. B. Harikumar, S. T. M. Yaqoob, “Identification of significant risks in pediatric acute lym-
Tharakan, O. S. Lai, B. Sung, and B. B. Aggarwal, “Cancer is a pre- phoblastic leukemia (all) through machine learning (ml) approach,”
ventable disease that requires major lifestyle changes,” Pharmaceutical Medical & Biological Engineering & Computing, vol. 58, no. 11, pp.
research, vol. 25, pp. 2097–2116, 2008. 2631–2640, 2020.
[4] M. Zhou, K. Wu, L. Yu, M. Xu, J. Yang, Q. Shen, B. Liu, L. Shi, [25] L. Pan, G. Liu, F. Lin, S. Zhong, H. Xia, X. Sun, and H. Liang,
S. Wu, B. Dong et al., “Development and evaluation of a leukemia “Machine learning applications for prediction of relapse in childhood
diagnosis system using deep learning in real clinical scenarios,” Frontiers acute lymphoblastic leukemia,” Scientific reports, vol. 7, no. 1, pp. 1–9,
in Pediatrics, p. 616, 2021. 2017.
[5] S. Rezayi, N. Mohammadzadeh, H. Bouraghi, S. Saeedi, and A. Mo- [26] A. Kashef, T. Khatibi, and A. Mehrvar, “Treatment outcome classifica-
hammadpour, “Timely diagnosis of acute lymphoblastic leukemia using tion of pediatric acute lymphoblastic leukemia patients with clinical and
artificial intelligence-oriented deep learning methods,” Computational medical data using machine learning: A case study at mahak hospital,”
Intelligence and Neuroscience, vol. 2021, 2021. Informatics in medicine unlocked, vol. 20, p. 100399, 2020.
[6] A. Rehman, N. Abbas, T. Saba, S. I. u. Rahman, Z. Mehmood, and [27] J. M. Bland and D. G. Altman, “The odds ratio,” Bmj, vol. 320, no.
H. Kolivand, “Classification of acute lymphoblastic leukemia using deep 7247, p. 1468, 2000.
learning,” Microscopy Research and Technique, vol. 81, no. 11, pp. [28] S. Fotouhi, S. Asadi, and M. W. Kattan, “A comprehensive data level
1310–1317, 2018. analysis for cancer diagnosis on imbalanced data,” Journal of biomedical
[7] S. H. Kassani, P. H. Kassani, M. J. Wesolowski, K. A. Schneider, informatics, vol. 90, p. 103089, 2019.
and R. Deters, “A hybrid deep learning architecture for leukemic b- [29] I. El Naqa and M. J. Murphy, What is machine learning? Springer,
lymphoblast classification,” in 2019 International Conference on Infor- 2015.
mation and Communication Technology Convergence (ICTC). IEEE, [30] R. Yasmin, R. Abbasi, T. Saeed, M. Sadiq, N. Yasmeen, M. Iqbal, A. K.
2019, pp. 271–276. Alzahrani, N. Kizilbash, B. Ugur, N. Ahmad et al., “Epidemiological
[8] J.-N. Eckardt, T. Schmittmann, S. Riechert, M. Kramer, A. S. Sulaiman, and clinical correlates of leukemia ascertained in a multiethnic cohort
K. Sockel, F. Kroschinsky, J. Schetelig, L. Wagenführ, U. Schuler et al., of pakistan,” Available at SSRN 4179190, 2023.
“Deep learning identifies acute promyelocytic leukemia in bone marrow [31] X. Chen, S. Jia, and Y. Xiang, “A review: Knowledge reasoning
smears,” BMC cancer, vol. 22, no. 1, pp. 1–11, 2022. over knowledge graph,” Expert systems with applications, vol. 141, p.
[9] P. K. Das and S. Meher, “An efficient deep convolutional neural network 112948, 2020.
based detection and classification of acute lymphoblastic leukemia,” [32] [Online]. Available: https://scikit-learn.org/stable/
Expert Systems with Applications, vol. 183, p. 115311, 2021.
[10] A. Chen, “A novel graph methodology for analyzing disease risk factor
distribution using synthetic patient data,” Healthcare Analytics, vol. 2,
p. 100084, 2022.
[11] Neo4j, “Graph algorithms,” Neo4j, Inc., May 3,2023. [Online]. Avail-
able: https://neo4j.com/docs/graph-data-science/current/algorithms/
[12] M. Jin, S. Xu, Q. An, and P. Wang, “A review of risk factors for
childhood leukemia,” Eur Rev Med Pharmacol Sci, vol. 20, no. 18, pp.
3760–3764, 2016.
© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.