0% found this document useful (0 votes)
42 views8 pages

Shahbaz 2025 Leveraging Explainable Ai For Early

This study investigates the risk factors and classification of leukemia subtypes in Pakistan, utilizing clinical data from 364 leukemia cases and 896 controls. It employs advanced machine learning techniques and graph data to achieve a 96% accuracy rate in predicting leukemia types, while highlighting lifestyle factors such as passive smoking and poor nutrition as significant contributors to leukemia risk. The research addresses a notable gap in existing literature by integrating demographic and lifestyle data into leukemia classification methodologies.

Uploaded by

fsayir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views8 pages

Shahbaz 2025 Leveraging Explainable Ai For Early

This study investigates the risk factors and classification of leukemia subtypes in Pakistan, utilizing clinical data from 364 leukemia cases and 896 controls. It employs advanced machine learning techniques and graph data to achieve a 96% accuracy rate in predicting leukemia types, while highlighting lifestyle factors such as passive smoking and poor nutrition as significant contributors to leukemia risk. The research addresses a notable gap in existing literature by integrating demographic and lifestyle data into leukemia classification methodologies.

Uploaded by

fsayir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294

Leveraging Explainable AI for Early Risk


Prediction and Type Classification for Leukemia:
Insights using Clinical Data from Pakistan
Maryam Shahbaz , Amna Basharat , Rehana Yasmin , Nafees Ahmad , Rashada Abbasi Shujaat Hussain

Abstract—Leukemia, a prevalent childhood cancer affecting According to statistics from Shaukat Khanum Memorial
the blood and bone marrow, necessitates a proactive approach Cancer Hospital and Research Center (SKMCH and RC)
to risk mitigation through lifestyle adaptations. Despite previous and Karachi Diagnostic Centre (KDC), A total of 127,929
investigations into similar aspects across different cancer types, a
comprehensive inquiry focusing on all leukemia subtypes within neoplasms of which 119,486 malignant and 8,443 benign cases
Pakistan remains a significant research gap. Acknowledging the were reported during the past 28 years [1]. Leukemia is a
influence of regional variations on individuals’ lifestyles, this highly lethal medical condition that claims thousands of lives
study aims to identify lifestyle and demographic factors asso- annually. It is the most prevalent cancer type among children.
ciated with leukemia development and predict specific leukemia In 2019, a comprehensive data collection effort was conducted
subtypes using clinical data. Our data collection included 364
leukemia cases and 896 control subjects, gathered from different across 209 countries and territories. The recorded statistics
cancer or tertiary care hospitals in Islamabad and Peshawar, indicated that there were 0.64 million new cases, 11.6 million
Pakistan. The data was meticulously categorized into laboratory disability-adjusted life years, and 0.33 million deaths attributed
results, demographic characteristics, and lifestyle parameters. to leukemia. Notably, 28.11% of leukemia-related fatalities
For demographic and lifestyle analysis, we employed advanced were linked to AML, while 8.95% were associated with CML.
techniques of Machine Learning, and statistical and graph-based
methodologies to assess leukemia development risk. This study Additionally, 14.24% and 13.33% of leukemia deaths resulted
highlights factors associated with a higher risk of leukemia, from ALL and CLL, respectively [2]. It is crucial to recognize
including passive smoking, rural residence, and poor nutrition. that cancer is a preventable disease that demands significant
These insights emphasize the promotion of healthier lifestyle lifestyle modifications [3].
choices to potentially reduce leukemia incidences. Additionally, Research on leukemia classification has been conducted in var-
we transformed the clinical dataset into graph data, which was
utilized for leukemia classification and subtype prediction. We ious countries [4]–[9], but there remains a gap in investigations
conducted classification on both graph and structured (tabular) within Pakistan, particularly regarding the identification of risk
data, with the structured data, achieving a 96% accuracy rate, factors. Recognizing regional lifestyle and demographic varia-
notably on oversampled data. To enhance the interoperability of tions that influence leukemia is essential. Prevailing leukemia
our classification outcomes, we employed the SHapley Additive classification methods rely heavily on laboratory data while
exPlanations (SHAP) algorithm to explain the classification, of-
fering comprehensive insights into the rationale behind leukemia potential insights from lifestyle and demographic data are
classification. neglected. Integrating such information could enhance our
disease understanding about disease prevalence and mitigation
Index Terms—Leukemia, Risk Factors, Classification, Graph
data, Lifestyle, Demography strategies. Additionally, the utilization of advanced methods
like Machine Learning (ML) and Knowledge Graphs (KG)
for leukemia risk analysis is notable, as most studies focus on
I. I NTRODUCTION other cancer types [10].
To address this research gap, we compiled a dataset en-
ANCER a perilous disease, includes leukemia, a cancer
C of the blood and bone marrow marked by uncontrolled
blood cell growth. Leukemia originates in the bone marrow
compassing laboratory, lifestyle, and demographic features,
sourced from tertiary care hospitals in Peshawar and Islam-
abad. By incorporating lifestyle and demographic information,
and often swiftly enters the bloodstream. There are four we predicted risk factors and assessed them using traditional
main leukemia types: Acute Lymphoblastic Leukemia (ALL), Odds Ratio (OR), ML techniques, and KG methods. In this
Chronic Lymphoblastic Leukemia (CLL), Acute Myeloid study, we integrated laboratory, lifestyle, and demographic
Leukemia (AML), and Chronic Myeloid Leukemia (CML). features to improve leukemia classification and subtype iden-
Acute leukemia involves immature cell formation, whereas tification, thereby enhancing detection accuracy. The dataset
Chronic leukemia entails mature cell production. comprised 363 cases and 896 controls, and to address imbal-
anced data, we applied various class balancing techniques to
M. Shahbaz, A. Basharat, and S. Hussain are with the National University
of Computer and Emerging Sciences, Islamabad, Pakistan reduce biases and enhance prediction reliability. Additionally,
R. Abbasi and N. Ahmad are with the Institute of Biomedical and Genetic we conducted a comparative analysis between structured and
Engineering, Islamabad, Pakistan graph data for classification. The Neo4j system served as a
R. Yasmin is with Quaid e Azam University, Islamabad
S. Hussain are with The School of Engineering, Computing and Mathemat- robust tool for analyzing and querying graph data, enabling
ics, Oxford Brookes University, Oxford, UK us to explore its potential benefits in leukemia classification

© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294

[11]. and validated against existing literature. Despite advancements


The main contributions of this article are explained as follows: in leukemia research, there remains a notable gap in studies
1) The initial step involves the collection of comprehensive applying advanced ML, OR, and KG to Pakistani data, par-
data pertaining to all four types of leukemia. ticularly considering regional environmental variations. Until
2) OR and ML algorithms were used to predict leukemia now, no research has been conducted to compare these three
risk factors. and comparative analysis was conducted to approaches for identifying risk factors.
assess and contrast their respective outcomes.
3) To address potential class imbalance issues in ML B. Diagnosing Leukemia and Its Type
classification, various class balancing techniques were
Several researchers have used blood smear images and
implemented.
applied machine and Deep Learning (DL) models for binary
4) Leveraging both relational and graph data, leukemia
and multi classification of leukemia or specific leukemia
classification was performed, followed by a comparison
types [4]–[9]. They have employed image datasets for multi
of their respective accuracies.
classification, achieving a remarkable 94.78% accuracy rate
The remaining content is organized as follows. In the second for predicting four leukemia types using transfer learning mod-
section, related work is discussed. The proposed methodology els [18]. Additionally, Convolutional Neural Network (CNN)
is given in section III. Experimental evaluation, results, dis- models have been employed for 3-class leukemia classification
cussion, and conclusion are provided in the IV, V, VI, and VII [19], and deep features have been used to train models for
sections respectively. predicting subtypes of Acute Leukemia [20]. Transfer learning
models have also been applied to diagnose four leukemia types
II. BACKGROUND with an exceptional 100% accuracy rate [21].
In this section, we present a comprehensive overview of Gene data has shown promise in diagnosing leukemia and
existing methods for predicting the risk of leukemia and its predicting the survival rate of leukemia patients. For example,
diagnosis. DL models have been employed on gene data to predict AML
[22], while ML models based on gene data have been used to
A. Risk Prediction of Leukemia Patients predict blood cancer, achieving a remarkable 100% accuracy
In the quest to identify leukemia-contributing factors, five rate [23]. Regardless, it overlooked demographic or lifestyle
risk factors linked to childhood leukemia have been considered features and restricted them to binary classification, excluding
[12]. These factors were determined using OR applied to certain leukemia types. Recent studies has used limited explo-
demographic and medical attributes [13]. Notably, transmis- ration of clinical data for predicting leukemia or its various
sible agents, parental age, and birth weight have emerged as types (See Table 1 from Supplementary Material). Researchers
significant influencers of childhood leukemia [13]. have leveraged clinical features, including biomedical tests,
Factors associated with AML have been explored [14]. Data alongside phenotypic and environmental factors to gauge
collection involved a comprehensive questionnaire encompass- feature significance within ML-based models and CART has
ing personal, family medical history, lifestyle, environmen- excelled, outperforming other methods [24]. Another study
tal, and physical/biological agent attributes. OR, conditioned utilized a RF algorithm along with fourteen clinical features
multivariate and univariate Logistic Regression (LR), t-tests, to predict relapse in children with ALL [25]. While, treatment
and p-values, have been meticulously employed in the pursuit outcomes in ALL have been assessed through clinical and
of identifying AML risk factors. Risk factors for ALL have medical history data, achieving an impressive 94.9% accuracy
been assessed using data from clinical, phenotypic, and envi- rate using the Support Vector Machine (SVM) and XGBoost
ronmental domains. A study conducted in Lahore, Pakistan, algorithms [26].
has involved 50 cases and 44 controls [15]. Four supervised Despite advancements in the use of ML and DL models for
ML algorithms, including Classification And Regression Tree leukemia classification through blood smears and gene data,
(CART), Random Forest (RF), gradient-boosted machine, and significant gaps remained in the literature. Notably, lifestyle
C5.0 decision tree, have been employed. CART has displayed and demographic features have not been utilized to predict
the highest accuracy at 99.83%, ranking features by signifi- risk factors across all four types of leukemia. Additionally,
cance. Platelets, hemoglobin, and White Blood Cells (WBCs) many existing approaches are limited to binary classification,
have been the most influential, followed by gender, water without differentiating between specific leukemia types or
intake, age, and ALL type. However, the study has focused classification of two or three types of leukemia.
solely on predicting ALL risk factors [15]. An umbrella review
of 59 articles, including 17 pooling analyses and 42 systematic III. T HE P ROPOSED M ETHOD
reviews, identified ionizing radiation and maternal pesticide This study aims to uncover key lifestyle and demographic
exposure as highly associated with ALL [16]. Furthermore, factors contributing to leukemia. It encompasses leukemia
risk factors for lung cancer have been identified using KG [10]. type classification with explainable AI-driven reasoning. This
Synthetic data has been generated through the synthea patient research involves data collection and preprocessing. We predict
generator and lung cancer patient data has been obtained and risk factors and perform classification, applying balancing
converted into graphs using Neo4j [17]. Potential risk factors techniques. The classification is conducted on both structured
have been identified using the Connection Delta Ratio (CDR) and graph data, and the results are compared. We utilize

© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294

Fig. 1: Workflow of proposed methodology

explainable AI to explain the best-performing model. An 3) Odds Ratio (OR): To identify the association between
overview of our work is shown in Figure 1. two independent variables, OR was used. It is defined as the
ratio of the Probabilities of an event and a non-event. OR also
A. Preparing Input Data considers Conf idence − Interval(CI) > 95% to check the
results association.
At the beginning of the study, we gathered data from cancer
1) OR > 1: increase the happening of an event.
hospitals in Peshawar and Islamabad, Pakistan after obtaining
2) OR < 1: Decrease the happening of an event.
consent from participants. The clinical data was divided into
3) OR=1 : No association.
demographic, lifestyle, and laboratory categories. After con-
sulting with doctors, we made adjustments to the laboratory
features, updating them based on appropriate ranges. C. Balanced distribution of classes
The issue of imbalanced distribution among classes ad-
B. Risk Prediction versely affects accuracy. To improve the accuracy, most re-
searchers use data-level solutions i.e., oversampling and under-
For predicting risk, we used ML and KG methods to deter- sampling to balance data [28]. First, we applied classifi-
mine which method yields better results. In the ML approach, cation on simple datasets, further we applied classification
we employed XGBoost to identify important features. While on balanced datasets. In undersampling techniques, examples
CDR was previously utilized for lung cancer [14]. Therefore, of the majority class were reduced, while in oversampling
We adopted these approaches for identifying leukemia risk techniques, examples of minority classes were increased. As
factors. Additionally, we used OR to validate our findings, as a result, all classes have an equal number of samples.
they are commonly employed by medical specialists to verify
results [27].
1) XGBoost Algorithm: Gradient-boosted decision trees D. Diagnosing leukemia and its type
were implemented in XGBoost. Weights were assigned to ML is an evolving branch of computational algorithms that
all variables independently which helped to predict results. emulate human intelligence by learning from the surrounding
Weights of wrongly predicted variables were assigned to the environment [29]. ML models learn on labeled datasets to
next decision trees. This made the model more precise. predict output. For the classification of leukemia and its
2) Connection Delta Ratio (CDR): First, we transformed types, we employed various ML models due to the small and
the structured data into a graph format using Neo4j. The structured nature of our dataset. As noted in the literature
resulting graph data contained nodes representing patients or review, DL approaches are typically utilized when working
healthy individuals, along with other relevant features. CDR with gene or image data. This study utilizes structured (tabular)
was employed as a measure of each feature’s importance and graph data as inputs for the classification process, followed
in distinguishing between patients and healthy individuals. by a comparative analysis of their respective outcomes.A Few
Specifically, CDR was used to evaluate the significance of ML algorithms are chosen based on the literature review. We
connections between case and control data based on the implemented LR with L2 regularization to address the multi-
relationships within the graph. By utilizing CDR, we identified class nature of the problem, utilizing a One-Vs-Rest strategy.
the top-ranked factors, where features with CDR values greater The model configuration includes 500 epochs and a random
than 0.1 were considered most significant about leukemia. seed set to 3. Various parameters were tested for the Multi-
For calculating CDR, connections to leukemia patients are Layer Perceptron (MLP) model and we opted for the MLP
represented as Targeted Patient Connections (TPC), while model structured with 5 hidden layers, each specified by vary-
connections to control data are denoted as Healthy People ing neuron counts: 256, 128, 64, 32, and 5 neurons distributed
Connections (HPC). The CDR is then defined as (TPC- HPC) across these tiers. It has been configured with a learning rate
/ (TPC + HPC). of 0.001, has employed the ReLU activation function, and

© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294

TABLE I: Clinical features categorization B. Data Preprocessing


No. of Kind of We examined the laboratory data and mapped it according
Features Features Data to the ranges outlined in Table II, adjusting the feature values
WBC Status,RBC Status,HB Status,
Platelets Status, Neurophills Status, 6 Laboratory Data to be either incremental, decremental, or normal. We have
Lymphocytes Status also mapped the demographic dataset values to align with the
Nutrition, Microwave use, Fuel, respective ranges. 0-14 years of age was considered as a child,
Watersource, Revenue area,
Carbonated drinks, Perfumes,
9 Lifestyle Data 14-50 years as an adult, and 50 plus years as old. The Body
Smoking, Environmental conditions Mass Index (BMI) was computed using the height and weight
Gender, Age, Father and Mother’s age Demographic of the individual. A BMI greater than 18.5 was considered
6
at union,Family type, Weight Data
underweight, 18.5-24.5 as normal, 24.5-30 as overweight, and
greater than 30 as obese. According to clinicians, the typical
age range for females engaging in reproductive intercourse is
has utilized the Adam optimizer while maintaining consistent 20 to 35 years, whereas for males, it is 20 to 45 years. For
parameters across both graph and structured data. Additionally, the classification analysis, the dataset was divided into training
K-Nearest Neighbors (KNN) has been utilized to forecast cate- and testing samples, with 80% allocated for training and 20%
gorical variables, factoring in the distances from the 34 nearest for testing. Given the initial dataset of 1,258 samples, this
neighbors. Uniform weights were allocated, ensuring equal resulted in 1,006 training samples and 252 testing samples.
consideration of all points within each neighborhood. Further-
more, in both structured and graph data, we’ve incorporated 55 TABLE II: Normal ranges of laboratory data
trees in RF, while keeping default settings unchanged for all
other parameters. SVM finds a hyper-plane in N-dimensional Feature Child Male Female
space, where N represents several features used to classify Adults Adults
data points. We have to find the plane, with the maximum WBC (109 /liters) 5-13 4-11 4-10
RBC (106 /microliters) 4-5.2 3.5-5.6 4.5-5.5
margin. Data can be considered from different classes if it is
Hb (gram/deciliter) 11.5-15.5 11-18 13-17
placed on either side of the hyperplane. The polynomial kernel Platelets (103 /microliters) 170-450 150-400 150-410
function is selected and assigned 3 degrees. ’Auto’ is chosen Lymphocytes (%age) 30-40 12-50 25-40
as Kernel coefficients for polynomial function. We opted for Neutrophils (%age) 60-70 37-75 45-70
these specific configurations due to their delivery of the highest
accuracy among the options explored.
V. R ESULTS
E. SHapley Additive exPlanations (SHAP) A. Leukemia Risk Prediction
1) Odds Ratio (OR): We have utilized OR, as presented
The SHAP model is a type of explainable AI engineered to in Figure 2, to predict the risk factors associated with de-
provide different perspectives behind individual predictions, mographic and lifestyle features. Passive smoking has been
identifying feature importance to contributions within a ML identified as a crucial feature based on the given dataset.
model. We used this algorithm to explain the model which Moreover, it has been observed that the living area is also a
gives the highest accuracy. It calculates feature importance significant factor, impacting leukemia. Specifically, the dataset
by considering the involvement of each feature in every includes two types of areas, with rural areas showing a higher
possible combination of features, assessing their influence on incidence of leukemia than urban areas. From demographic
the prediction outcome. features, children are more vulnerable to leukemia. However,
due to the prevalence of the majority of children samples in
this dataset, the demographic feature of age was disregarded
IV. E XPERIMENTAL E VALUATION during the analysis.
2) XGBoost Algorithm: Figure 3 illustrates the prediction
A. Data Collection
of risk factors associated with lifestyle and demographic fea-
The dataset of leukemia patients has been collected after tures using the XGBoost algorithm. Environmental conditions
written consent of the patient/legal guardian. It was collected have been identified as the most significant factor among
as per the methodology approved by the Institutional Review lifestyle features, indicating that a polluted environment can
Board) (IRB) of the Institute of Biomedical and Genetic lead to leukemia. Consistent with the findings obtained through
Engineering (IBGE) (Letter No. IBGE/SARK/09/1205/2012), OR, XGBoost also suggests that living in rural areas is
Islamabad. It includes cases and controls from all gender and associated with a higher incidence of leukemia. However, there
ethnic groups. In total, we have 1258 samples. It includes 261 is a discrepancy in the prediction of risk associated with the
ALL, 18 CLL, 40 AML, 45 CML samples, and 894 non-cancer use of gas or wood as fuel for cooking. XGBoost indicates that
people. Data samples are split into laboratory, demographic, the use of gas fuel is linked to an increased risk of leukemia.
and lifestyle features. Features included are shown in Table I. 3) CDR: We have assigned a unique identification number
The same data has also been reported in other research paper, to each patient and linked them to their respective demo-
but only OR are predicted in this paper [30]. graphic, lifestyle, and laboratory features. The CSV data was

© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294

Fig. 2: OR of lifestyle features

then transformed into a Neo4j graph format through the


process of loading data and executing cypher queries within
the Neo4j platform. An abstract representation of the resulting
graph data is presented in Figure 4.

Algorithm 1 Example of Cypher Query


MATCH ( n : P a t i e n t I D ) − [ a : u s e F u e l ] (a) Lifestyle features
−>( j : F u e l {Name : ” Gas ” } ) ,
( n : P a t i e n t I D ) −[ b : leukemia Result ]
−>(k : R e s u l t {Name : ” N o l e u k e m i a ” } )
w i t h c o u n t ( n ) a s no
MATCH (m: P a t i e n t I D ) − [ a : u s e F u e l ]
−>( j : F u e l {Name : ” Gas ” } ) ,
(m: P a t i e n t I D ) − [ b : l e u k e m i a R e s u l t ]
−>(k : R e s u l t {Name : ” Y e s l e u k e m i a ” } )
w i t h no , c o u n t (m) a s y e s
r e t u r n no , yes ,
t o F l o a t ( yes −no ) / ( y e s +no ) a s CDR

We calculated the CDR from the graph data. However, all


of the calculated results were less than 0.5, likely due to an
insufficient number of cases as compared to control data. To
address this issue, we performed oversampling on the case
data to balance the number of cases and controls. Algorithm 1
shows the cypher query to calculate the CDR associated with
the use of gas as fuel. Table III shows the result of CDR on
all features. (b) Demographic features

Fig. 3: Result of XGBoost algorithm


B. Diagnosing Leukemia and its type
KG models complex relationships within the data. KGs
enable us to identify key connections and insights that may [11]. The csv data has been converted into Neo4j graph format
not be visible through traditional classification methods [31]. to enable this analysis. Using the gds library, two types of
Therefore, we used KG to diagnose leukemia and its types. prediction can be performed on graph data: link prediction and
Additionally, we have applied structured approaches to allow node prediction. This study focuses on both types within graph
for a meaningful comparison of results. Comparing knowledge data, specifically using two nodes for patient ID and features.
graph results with classification outcomes, we aimed to pro- First, we performed classification on the unbalanced dataset,
vide a more comprehensive understanding of factors associated referred to as the simple dataset. Subsequently, the imbal-
with leukemia. ances were addressed through undersampling, oversampling,
1) Graph Data: To classify graph data, the Graph Data Synthetic Minority Over-sampling Technique (SMOTE), and
Science (GDS) library supports only three classification func- Adaptive Synthetic Sampling (ADASYN). Table IV presents
tions i.e., RF, MLP, and LR and we have employed all three the testing data results for diagnosing leukemia based on graph

© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294

and KNN algorithms are applied. The selection of these


algorithms was based in a literature review. We used structured
data for classifying leukemia by using the sklearn library
[32]. Due to limited patient data, the results were biased
toward control data, therefore, class balancing techniques were
used. Table V shows the test data results of all classification
models on simple, undersampled, oversampled, SMOTE, and
ADASYN datasets. RF gives the highest accuracy (96%) by
using oversampled data. For structured data, we employed
the same three models used for graph data, along with the
SVM and KNN algorithms.They are selected based on a
comprehensive literature review. The sklearn library [32] has
been used to classify leukemia on structured data. Given the
Fig. 4: Abstract representation of graph data limited patient data, initial results were biased toward control
data, necessitating the use of class balancing techniques. Table
TABLE III: Results of CDR
V presents the test data results for all classification models
Real Over-sampled applied to simple, undersampled, oversampled, SMOTE, and
Features Label
data data ADASYN datasets. The highest accuracy of 96% is achieved
Nutrition Good -0.56 -0.18 by RF on oversampled data.
Nutrition Poor -0.011 0.2
Fuel Wood 0.0625 0.32
Fuel Gas -0.05 -0.18 VI. D ISCUSSION
Water Source Spring 0.0745 0.39 We used statistical, graphical, and ML methods to predict
Water Source Well -0.218 0.13
risk factors from demographic and lifestyle features. Table
Water Source Spring 0.0745 0.39
Water Source Filter -0.77 -0.54 VI shows the most important lifestyle attributes. Our find-
Water Source Tap 0.345 -0.061 ings suggest that passive smoking, rural residences, avoiding
Microwave No -0.02 0.04 perfume use, poor nutrition, and drinking from springs are
Microwave Yes -0.67 -0.3
Microwave Frequent -0.789 -0.5
significant risk factors. The use of microwaves has also been
Microwave No -0.02 0.04 identified as a potential factor contributing to leukemia. There
Perfume Frequent -0.75 -0.47 exists a discrepancy among statistical, graphical, and ML
Perfume No -0.3 0.17 analyses concerning the impact of cooking fuel and perfume
Environment Clean -0.607 -0.2
Environment Polluted -0.61 -0.196 use on leukemia. Statistical and graphical assessments align
Carbonated Drink Frequent -0.44 0.04 in attributing leukemia risk to wood usage, whereas the ML
Carbonated Drink Yes -0.5 -0.05 analysis diverges, implicating gas usage. Similarly, quantitative
Carbonated Drink No -0.05 0.01
R-Smoking Passive 0.085 0.51
measures infer reduced leukemia risk by abstaining from
R-Smoking Frequent -0.61 0.23 perfume, conflicting with the ML findings that suggest a causal
R-Smoking Rare -0.05 0.08 relationship between perfume usage and leukemia. Notably,
R-Smoking No -0.49 -0.047 the ML analysis exclusively identifies environmental pollution
Revenue Area Urban -0.701 -0.35
Revenue Area Rural -0.23 0.24 as a significant factor associated with leukemia. The results of
this study indicate that children have a higher risk of devel-
oping leukemia, as revealed by all the techniques employed.
Additionally, the ML algorithm and graphical method suggest
data. On the graph data, accuracy declines when balancing
that underweight individuals are more susceptible to leukemia
techniques are applied to the datasets. The highest accuracy
cancer. In Appendix A, Table 1 from Supplementary Material
of 79% on testing data has been achieved using the MLP
presents a comparison between our work and previous studies
algorithm on simple data, with an F1 score of 83%. On the
focused on predicting risk factors. Various ML algorithms
simple dataset, the RF algorithm yielded the lowest accuracy
predict leukemia and subtypes, using techniques for balanced
at 72%.
data distribution. Among these algorithms, MLP and RF have
TABLE IV: Results of leukemia prediction on graph data demonstrated the highest accuracy, as depicted in Figure 5.
RF outperformed MLP in accuracy. Exploring data distribution
Methods LR LR RF RF MLP MLP balancing methods, oversampling, especially with SMOTE and
F1 Acc. F1 Acc. F1 Acc.
Simple-Data 70% 74% 67% 72% 83% 79% ADASYN, proved superior to undersampling. Oversampling
Under-sample 22% 31% 12% 20% 21% 26% yielded the highest accuracy. Results from graph data showed
Over-sample 63% 69% 63% 66% 70% 76% that MLP and RF algorithms outperformed LR, with MLP
SMOTE-oversample 38% 50% 39% 51% 50% 58% achieving the highest accuracy at 79%. Comparing data tech-
ADASYN-oversample 38% 50% 37% 48% 48% 50%
niques, both oversampling and undersampling reduced algo-
rithm accuracy, with simple data yielding the best results. In
2) Structured Data : For structured data, we employed the comparison of oversampling techniques, oversampling pro-
all three models that are used in graph data. Further, SVM duced better results than ADASYN and SMOTE. In addition, a

© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294

TABLE V: Results of diagnosing leukemia on structured data

Method SVM RF KNN LR MLP


- Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
Simple-Data 84 76 84 80 86 79 86 82 80 72 80 75 82 75 82 78 50 81 57 50
Under-Sampling 41 43 41 40 50 45 50 46 59 40 59 47 41 37 41 37 73 77 73 73
Over-Sampling 85 85 85 85 96 96 96 96 70 70 70 68 55 55 55 54 94 94 94 94
SMOTE-Oversampling 85 81 85 83 90 86 90 88 79 76 79 77 57 56 57 56 90 91 90 90
ADASYN-Oversampling 80 80 80 79 94 94 94 94 74 75 74 73 58 58 58 57 89 90 89 89

TABLE VI: Comparison of predicting risk factors findings suggest that the additional features in this research
Rank OR XGBoost Algorithm CDR
may have contributed to improved accuracy. Nevertheless,
1 Passive Smoking Polluted Environment Passive Smoking direct comparisons are limited due to differences in features
2 Rural Area Gas Fuel Wood Fuel and dataset characteristics across studies. Overall, this work
3 Rare Carbon Drinks Perfume Usage Drink Spring Water demonstrates the potential of a larger feature set to improve
4 Drink Spring Water Rural Area Rural Area
5 Wood Fuel Passive Smoking Poor Nutrition leukemia classification accuracy. Table 1 from Supplementary
Material compared these results with previous research.

A. SHAP-Explainable AI
The RF algorithm was chosen for in-depth analysis, us-
ing the SHAP algorithm, due to its highest accuracy when
applied to oversampled data. SHAP values were utilized as
independent features to assess the average impact of individual
features on the algorithm’s performance. The findings revealed
that weight, age, and gender play crucial roles in accurately
classifying leukemia or its specific type. However, it was
observed that the age feature exhibited bias since the dataset
predominantly consisted of children. To address bias, an
Fig. 5: Barchart of results on structured data

comparison was made between structured and graph data using


the LR, MLP, and RF algorithms, as depicted in Figure 6. LR
algorithm indicated higher accuracy for structured data over
graph data. The RF algorithm achieved the highest accuracy
with simple oversampled structured data. In a previous study

Fig. 7: Average impact of features on model using SHap


Fig. 6: Comparing results on a graph and structured data algorithm

by [25], classification was conducted to predict ALL. However, additional RF algorithm analysis specifically targeted lifestyle
their approach differed from the present research in the number and laboratory features. The outcomes, depicted in Figure
of features used: they employed 15 features, whereas 21 7, emphasize perfumes as the most influential factor from
features are utilized here. Furthermore, their highest accuracy, lifestyle features, particularly classifying AML, while smoking
achieved with a RF classifier, was 82%, while this analysis appeared as least relevant. Factors such as nutrition and
achieved a higher accuracy of 86% on a simpler dataset. These environmental elements also hold significance in classification.

© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Journal of Biomedical and Health Informatics. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2025.3531294

Among laboratory features, the Red Blood Cell (RBC) count [13] D. W. Kaufman, T. E. Anderson, and S. Issaragrisil, “Risk factors for
emerges as the most influential. Notably, all factors displayed leukemia in thailand,” Annals of hematology, vol. 88, no. 11, pp. 1079–
1088, 2009.
minimal contribution specifically in ALL. [14] M. M. Behzad, M. Abbasi, I. Oliaei, S. G. Gholiabad, and
H. Rafieemehr, “Effects of lifestyle and environmental factors on the
risk of acute myeloid leukemia: result of a hospital-based case-control
VII. C ONCLUSION study,” Journal of Research in Health Sciences, vol. 21, no. 3, p. e00525,
The prediction of leukemia risk has traditionally relied 2021.
[15] N. Mahmood, S. Shahid, T. Bakhshi, S. Riaz, H. Ghufran, and
on conventional methods. While lifestyle changes can vary M. Yaqoob, “Identification of significant risks in pediatric acute lym-
by region, this type of analysis has not been conducted on phoblastic leukemia (all) through machine learning (ml) approach,”
Pakistani data. Moreover, previous studies have primarily used Medical & Biological Engineering & Computing, vol. 58, no. 11, pp.
2631–2640, 2020.
tabular data for leukemia type classification. In this study, after [16] F. M. Onyije, A. Olsson, D. Baaken, F. Erdmann, M. Stanulla,
collecting relevant data, we applied OR, KG, and ML methods D. Wollschläger, and J. Schüz, “Environmental risk factors for childhood
to identify the risk factors associated with leukemia. Due to acute lymphoblastic leukemia: an umbrella review,” Cancers, vol. 14,
no. 2, p. 382, 2022.
the limited control data relative to case data, we employed [17] [Online]. Available: https://synthetichealth.github.io/synthea/
class balancing strategies to address data imbalance. We also [18] M. O. Aftab, M. J. Awan, S. Khalid, R. Javed, and H. Shabir, “Executing
converted clinical data into a graph format and compared its spark bigdl for leukemia detection from microscopic images using
transfer learning,” in 2021 1st International Conference on Artificial
classification accuracy against structured data. The structured Intelligence and Data Analytics (CAIDA). IEEE, 2021, pp. 216–220.
dataset achieved a highest accuracy of 96% through over- [19] A. Abhishek, R. K. Jha, R. Sinha, and K. Jha, “Automated classification
sampling using ML algorithms. Even with a dataset without of acute leukemia on a heterogeneous dataset using machine learning and
deep learning techniques,” Biomedical Signal Processing and Control,
balancing, structured data consistently outperformed graph vol. 72, p. 103341, 2022.
data, emphasizing its effectiveness for classification tasks, [20] P. Rastogi, K. Khanna, and V. Singh, “Leufeatx: Deep learning–based
particularly when dealing with smaller datasets and direct node feature extractor for the diagnosis of acute leukemia from microscopic
images of peripheral blood smear,” Computers in Biology and Medicine,
relationships. vol. 142, p. 105236, 2022.
[21] N. Bibi, M. Sikandar, I. Ud Din, A. Almogren, and S. Ali, “Iomt-based
R EFERENCES automated detection and classification of leukemia using deep learning,”
Journal of healthcare engineering, vol. 2020, 2020.
[1] Shaukat Khanum Memorial Trust, “Cancer statistics,” accessed: [22] E. Nazari, A. H. Farzin, M. Aghemiri, A. Avan, M. Tara, and H. Tabesh,
July, 2023. [Online]. Available: https://shaukatkhanum.org.pk/ “Deep learning for acute myeloid leukemia diagnosis,” Journal of
health-care-professionals-researchers/cancer-statistics/ Medicine and Life, vol. 13, no. 3, p. 382, 2020.
[2] M. Du, W. Chen, K. Liu, L. Wang, Y. Hu, Y. Mao, X. Sun, Y. Luo, [23] V. Rupapara, F. Rustam, W. Aljedaani, H. F. Shahzad, E. Lee, and
J. Shi, K. Shao et al., “The global burden of leukemia and its attributable I. Ashraf, “Blood cancer prediction using leukemia microarray gene data
factors in 204 countries and territories: Findings from the global burden and hybrid logistic vector trees model,” Scientific Reports, vol. 12, no. 1,
of disease 2019 study and projections to 2030,” Journal of Oncology, pp. 1–15, 2022.
vol. 2022, 2022. [24] N. Mahmood, S. Shahid, T. Bakhshi, S. Riaz, H. Ghufran, and
[3] P. Anand, A. B. Kunnumakara, C. Sundaram, K. B. Harikumar, S. T. M. Yaqoob, “Identification of significant risks in pediatric acute lym-
Tharakan, O. S. Lai, B. Sung, and B. B. Aggarwal, “Cancer is a pre- phoblastic leukemia (all) through machine learning (ml) approach,”
ventable disease that requires major lifestyle changes,” Pharmaceutical Medical & Biological Engineering & Computing, vol. 58, no. 11, pp.
research, vol. 25, pp. 2097–2116, 2008. 2631–2640, 2020.
[4] M. Zhou, K. Wu, L. Yu, M. Xu, J. Yang, Q. Shen, B. Liu, L. Shi, [25] L. Pan, G. Liu, F. Lin, S. Zhong, H. Xia, X. Sun, and H. Liang,
S. Wu, B. Dong et al., “Development and evaluation of a leukemia “Machine learning applications for prediction of relapse in childhood
diagnosis system using deep learning in real clinical scenarios,” Frontiers acute lymphoblastic leukemia,” Scientific reports, vol. 7, no. 1, pp. 1–9,
in Pediatrics, p. 616, 2021. 2017.
[5] S. Rezayi, N. Mohammadzadeh, H. Bouraghi, S. Saeedi, and A. Mo- [26] A. Kashef, T. Khatibi, and A. Mehrvar, “Treatment outcome classifica-
hammadpour, “Timely diagnosis of acute lymphoblastic leukemia using tion of pediatric acute lymphoblastic leukemia patients with clinical and
artificial intelligence-oriented deep learning methods,” Computational medical data using machine learning: A case study at mahak hospital,”
Intelligence and Neuroscience, vol. 2021, 2021. Informatics in medicine unlocked, vol. 20, p. 100399, 2020.
[6] A. Rehman, N. Abbas, T. Saba, S. I. u. Rahman, Z. Mehmood, and [27] J. M. Bland and D. G. Altman, “The odds ratio,” Bmj, vol. 320, no.
H. Kolivand, “Classification of acute lymphoblastic leukemia using deep 7247, p. 1468, 2000.
learning,” Microscopy Research and Technique, vol. 81, no. 11, pp. [28] S. Fotouhi, S. Asadi, and M. W. Kattan, “A comprehensive data level
1310–1317, 2018. analysis for cancer diagnosis on imbalanced data,” Journal of biomedical
[7] S. H. Kassani, P. H. Kassani, M. J. Wesolowski, K. A. Schneider, informatics, vol. 90, p. 103089, 2019.
and R. Deters, “A hybrid deep learning architecture for leukemic b- [29] I. El Naqa and M. J. Murphy, What is machine learning? Springer,
lymphoblast classification,” in 2019 International Conference on Infor- 2015.
mation and Communication Technology Convergence (ICTC). IEEE, [30] R. Yasmin, R. Abbasi, T. Saeed, M. Sadiq, N. Yasmeen, M. Iqbal, A. K.
2019, pp. 271–276. Alzahrani, N. Kizilbash, B. Ugur, N. Ahmad et al., “Epidemiological
[8] J.-N. Eckardt, T. Schmittmann, S. Riechert, M. Kramer, A. S. Sulaiman, and clinical correlates of leukemia ascertained in a multiethnic cohort
K. Sockel, F. Kroschinsky, J. Schetelig, L. Wagenführ, U. Schuler et al., of pakistan,” Available at SSRN 4179190, 2023.
“Deep learning identifies acute promyelocytic leukemia in bone marrow [31] X. Chen, S. Jia, and Y. Xiang, “A review: Knowledge reasoning
smears,” BMC cancer, vol. 22, no. 1, pp. 1–11, 2022. over knowledge graph,” Expert systems with applications, vol. 141, p.
[9] P. K. Das and S. Meher, “An efficient deep convolutional neural network 112948, 2020.
based detection and classification of acute lymphoblastic leukemia,” [32] [Online]. Available: https://scikit-learn.org/stable/
Expert Systems with Applications, vol. 183, p. 115311, 2021.
[10] A. Chen, “A novel graph methodology for analyzing disease risk factor
distribution using synthetic patient data,” Healthcare Analytics, vol. 2,
p. 100084, 2022.
[11] Neo4j, “Graph algorithms,” Neo4j, Inc., May 3,2023. [Online]. Avail-
able: https://neo4j.com/docs/graph-data-science/current/algorithms/
[12] M. Jin, S. Xu, Q. An, and P. Wang, “A review of risk factors for
childhood leukemia,” Eur Rev Med Pharmacol Sci, vol. 20, no. 18, pp.
3760–3764, 2016.

© 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like