0% found this document useful (0 votes)
31 views7 pages

Exploratory Data Analysis and Machine Learning Models For Stroke Prediction

This research investigates stroke prediction using exploratory data analysis and machine learning models, specifically Random Forest, Logistic Regression, and XGBoost. The study finds that the Random Forest model demonstrates superior predictive capabilities, significantly enhancing early stroke detection and patient care. Key factors influencing stroke risk include age, BMI, and average glucose levels, underscoring the importance of these variables in stroke risk assessment.

Uploaded by

mjazeel2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views7 pages

Exploratory Data Analysis and Machine Learning Models For Stroke Prediction

This research investigates stroke prediction using exploratory data analysis and machine learning models, specifically Random Forest, Logistic Regression, and XGBoost. The study finds that the Random Forest model demonstrates superior predictive capabilities, significantly enhancing early stroke detection and patient care. Key factors influencing stroke risk include age, BMI, and average glucose levels, underscoring the importance of these variables in stroke risk assessment.

Uploaded by

mjazeel2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Exploratory Data Analysis and Machine Learning Models for Stroke

Prediction

Wei Fu
College of Information Science and Engineering, Northeastern University, Shenyang, China

Keywords: Stroke Prediction, Exploratory Data Analysis, Random Forest, Logistic Regression, XGBoost Models.

Abstract: Stroke risk assessment is a vital area of study in healthcare. This research delves into the application of
sophisticated analytical methods, combining exploratory data analysis (EDA) with advanced machine
learning techniques including Random Forest, Logistic Regression, and XGBoost models. These models were
deployed to predict stroke risk, leveraging key variables such as age, gender, BMI, and smoking habits.
Notably, the Random Forest models exhibited robust predictive capabilities, indicating promising prospects
for clinical implementation. By fusing the power of exploratory data analysis and machine learning algorithms,
this study significantly enhances the early detection of stroke cases. The findings hold substantial potential
for improving patient care and advancing the field of stroke risk assessment research. The integration of
exploratory data analysis and machine learning not only augments the understanding of stroke risk factors but
also paves the way for further scholarly investigations in this domain. The insights garnered from this research
serve as a cornerstone, offering valuable direction for future studies and contributing to the continuous
evolution of stroke risk assessment methodologies.

1 INTRODUCTION predictive tool for stroke prediction (Chung et al.,


2023). They excel in accuracy, feature importance
Stroke is a sudden neurological disorder that typically analysis, handling imbalanced data, capturing non-
leads to severe health consequences such as paralysis, linear relationships, and preventing over-fitting.
speech impairment, and cognitive decline. Hence, These characteristics make XGBoost a valuable
early detection and intervention are critical in addition to the machine learning toolkit when
reducing the risk of Stroke. In the field of healthcare, exploring stroke risk factors and developing
machine learning is widely employed for Stroke predictive models. Random Forest plays a key role in
prediction as it leverages extensive patient data and predictive modelling (Fernandez-Lozano et al.,
multiple features to build accurate prediction models. 2021). It assesses feature importance, handles non-
Machine learning techniques such as decision trees, linearity, reduces over-fitting, deals with missing
random forests, XGboost models and deep learning data, and provides ensemble averaging for a stable
have been applied to stroke prediction. prediction model with valuable insights into factors
Exploratory Data Analysis (EDA) involves affecting strokes.
assessing data quality by identifying missing values, Logistic Regression is a crucial tool used to
outliers, and duplicates, summarizing data statistics, forecast the likelihood of stroke based on various risk
visualizing data distributions and relationships factors. This statistical technique provides interpret-
through graphs, and helping select relevant features able insights by quantifying how each risk factor
for stroke prediction (Chun et al., 2021). It aids in impacts stroke risk, aiding in risk assessment. Its
gaining insights into the nature of the stroke simplicity and transparency make it an essential
prediction problem, guiding feature selection and baseline model for comparing and evaluating the
engineering, and providing a foundation for performance of more complex machine learning
subsequent machine learning model development, methods in the context of stroke prediction.
ultimately enhancing model accuracy and The significance of lifestyle factors and patient
interpretability. medical records in influencing the likelihood of
XGBoost models in this paper serve as a powerful stroke development has been examined in various

211
Fu, W.
Exploratory Data Analysis and Machine Learning Models for Stroke Prediction.
DOI: 10.5220/0012783300003885
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Data Analysis and Machine Learning (DAML 2023), pages 211-217
ISBN: 978-989-758-705-4
Proceedings Copyright © 2024 by SCITEPRESS – Science and Technology Publications, Lda.
DAML 2023 - International Conference on Data Analysis and Machine Learning

studies (Meschia et al., 2014; Harmsen et al., 2006; the per- centage distribution of categorical variables,
Nwosu et al., 2019; Pathan et al., 2020). Additionally, as shown in Figure 2 to 5.
the utilization of machine learning models for
forecasting stroke incidence has also gained traction
in recent research (Jeena and Kumar, 2016; Hanifa
and Raja-S, 2010). In his research paper,
Soumyabrata Dev proposes the utilization of neural
networks (NN), decision trees (DT), and random
forests (RF) for the prediction of strokes based on
patient attributes (Dev et al., 2022). In this paper,
various algorithms were used including logistic
regression, random forest and XGboost model to
stroke, and evaluates each algorithm according to
confusion matrix.

2 DATA
Figure 2: Percentage of married people (Original).

2.1 Dataset
The dataset, referred to as the “Stroke Prediction
Dataset” (Kaggle, Online), encompasses 5110
instances and is structured into 12 columns. In the
’gender’ column, there are 3 unique values, Male,
Female, and Other. The average age of these
observation is around 43 years old. Most of the
patient are never smoke before. The average glucose
level and BMI among all of the patients are
106.147677 and 28.893237, respectively.

2.2 Data Visualization


In light of the fact that this specific row in the dataset
does not exhibit any severely detrimental values that Figure 3: Percentage of types of work (Original).
could significantly impact the integrity of the data, it
is recommended to refrain from its deletion.
Subsequently, the author generates a histogram to
visually represent the age distribution within the
dataset, facilitating a comprehensive understanding
of the age distribution as depicted in “Figure 1”.

Figure 1: Age histogram (Original). Figure 4: Percentage of residence types (Original).

Here, several pie charts are presented, displaying

212
Exploratory Data Analysis and Machine Learning Models for Stroke Prediction

shown in Figure 11.

Figure 5: Percentage of smoking status (Original).


Figure 8: Average glucose level by stroke histogram
Also, it is worth noting that the dataset includes (Original).
patients across all age groups. To investigate the
presence of outliers, the author has provided several
box plots, which can be observed in Figure 6 and
Figure 7.

Figure 9: BMI by stroke histogram (Original).

Figure 6: Average glucose level boxplot (Original).

Figure 10: Correlation Heatmap (Original).

Figure 7: BMI boxplot (Original).

There are some values that are maybe too high, so


whether these values are possible to have these
highly BMI should be observed. Figure 8 and Figure
9 are the two plots of the relationship between
average glucose level, BMI, and stroke. Lastly, the
author plot the correlation heatmap Figure 10 to show
whether the variables are correlated to each other.
Obviously, the most impact variable that effect on the
Figure 11: Scatter plot of BMI and average glucose level
three causes assumption, stroke and BMI is age
(Original).
variable. Below the author plot the scatter plot on the
BMI and average glucose level variables which

213
DAML 2023 - International Conference on Data Analysis and Machine Learning

3 METHOD To offer an alternative perspective:


Imagine the presence of an error metric denoted
In this project, the author employed exploratory data as r2 designed to quantify the errors associated with
analysis (EDA) to preprocess the data, followed by out-of-bag (OOB) samples following random
the application of logistic regression, random forest, permutations. In this particular context, the feature
and XGBoost for analysis. importance (I) associated with a specific feature, for
instance, xn, can be elucidated as follows:
3.1 Algorithm The feature importance (I(xn)) is computed as the
average across N iterations, with each iteration
(1) Logistic regression is a statistical approach used entailing the subtraction of r2 from r1.
to investigate data, considering the influence of one (3) XGBoost, which stands for Extreme Gradient
or more independent factors on a particular outcome. Boosting, is a powerful and widely used machine
It finds its niche in tasks where the outcome is binary, learning technique. It’s particularly well-suited for
meaning it has only two possible categories, typically situations where you have structured or tabular data
referred to as 0 and 1. and are working on supervised learning tasks. It
operates as an ensemble learning technique that
(2) Random forest algorithm is a machine learning amalgamates the forecasts of numerous independent
technique that builds upon the principles of decision models, often in the form of decision trees.
trees. It is possible to perform feature selection by
evaluating the significance of each feature through 3.2 Evaluation Criteria
calculation. Random forest algorithm first uses the
bootstrap aggregation method to gain training sets. A (1) Confusion matrices are useful in the context of
decision tree is built for each training set. When stroke prediction (or any binary classification
sampling using bootstrap, one sample is selected problem) for evaluating the performance of predictive
randomly from the original set (N samples) with models. A confusion matrix presents a detailed
replacement. One training set is generated by summary of how well a model’s predictions match
repeating this step N times. The probability that a the real outcomes in the dataset. It is particularly
single sample will be selected in N times of samplin valuable for assessing the model’s ability to make
g is: accurate predictions and for understanding the types
P = 1 − (1 − 1/N )N (1) of errors it makes.
When n goes to in infinity: (2) Accuracy measures the model’s ability to make
correct predictions by considering the total correct
1 − (1 − 1/N )N ≈ 1 − 1/e ≈ 0.632 (2) predictions (TP + TN) in relation to all predictions
This suggests that around 63.2% of the sample data is made. It provides a holistic assessment of the model’s
consistently utilized as the training set for each overall effectiveness.
modeling iteration. Consequently, approximately
36.8% of the training data remains unused and does
not contribute to the model training process. These (3)
unused data points are commonly referred to as “out-
(3) Precision is a performance metric that assesses the
of-bag data” or OOB data.
reliability of a model’s positive predictions. It is
Consider a decision tree denoted as G− n (xn ), a calculated by taking the ratio of True Positives (TP)
constituent element of a random forest model. This to the sum of True Positives (TP) and False Positives
specific decision tree is purposefully engineered to (FP). In essence, precision informs the frequency with
provide predictions solely for the data point xn. which the model’s positive predictions are accurate.
Assuming a total of N decision trees exist within
the random forest, the out- of-bag error, (4) Recall tells how good the model is at finding all
conventionally symbolized as r1, may be precisely the positive cases. It’s calculated by dividing the
defined as follows: number of true positives (correctly identified
The out-of-bag error (r1) is computed through the positives) by the sum of true positives and false
process of averaging the prediction errors for N data negatives. In the context of stroke prediction, recall is
points, involving the comparative analysis between crucial to avoid missing high-risk stroke cases.
the actual values (yn) and the predictions rendered by (5) The F1-score is a way to express both Precision
G−n (xn). and Recall with a single number, utilizing the

214
Exploratory Data Analysis and Machine Learning Models for Stroke Prediction

harmonic 5ean to find a balanced measure. Because prediction tasks. Ultimately, the choice of the most
recall and precision cannot be used independently to suitable model should be guided by the specific
assess a model, F1-score is used to balance the two demands of the application, with consideration given
indicators and make them compatible. The F1-Score to factors such as model interpretability,
provides a balanced assessment of precision and computational resources, and the need for fine-
recall, essentially striking a middle ground between tuning. Nonetheless, these findings underscore the
the two metrics. It takes into account both precision prominence of the Random Forest model as a robust
(the accuracy of positive predictions) and recall (the choice for stroke prediction in most scenarios. Future
sensitivity to detect true positives). The calculation research may focus on enhancing the performance of
involves a specific mathematical formula. other models or investigating ensemble approaches to
further improve predictive accuracy.

Table 1: Performance metrics of various models.


4 RESULT
Model F1-score Recall Precision Accuracy
4.1 Extract Key Features Logistic 0.612245 0.652174 0.576923 0.728571
Regression
From the importance plot shown in Figure 12, the age Decision Tree 0.613861 0.673913 0.563636 0.721429
is usually the feature that have the most impact in this
Random Forest 0.969671 0.987983 0.952026 0.941368
model, and come with BMI and average glucose
XGBoost 0.273438 0.555556 0.181347 0.848534
level, respectively. Unlike other models, the most
important feature in this model is BMI, followed by
avg glucose level, and then comes age.
5 EVALUATION
In the evaluation phase, the author observed strong
precision, recall, and F1-scores in Figure 12-15. In a
hospital setting, the false negative area in the
confusion matrix is of particular concern, as it
represents cases where the model failed to predict a
medical condition. This can have serious
consequences, especially if timely intervention is
needed. Bringing this evaluation perspective to the
results, the author find that BMI, average glucose
Figure 12: All features and their important score (Original). level, and age stand out with high F1-scores.
Specifically, BMI has an F1-score of 112, average
4.2 Predict Result glucose level is at 108, and age is at 85. These
findings emphasize the importance of these factors in
In summary, the results obtained from the improving early detection and enhancing patient care
experiments in Table 1 in this research provide in real-world clinical applications.
important information about how different machine
learning models perform when it comes to predicting
strokes. Among the models evaluated, the Random
Forest model emerged as the standout performer,
consistently achieving high F1 scores, recall,
precision, and accuracy. This underscores its
effectiveness in precisely recognizing individuals
who are at a heightened risk of experiencing a stroke.
While Logistic Regression and Decision Tree models
exhibited respectable performance, their simplicity
and interpretability make them viable options in
scenarios where model transparency is paramount.
Conversely, the XGBoost model’s relatively poor
performance suggests a need for further refinement or
exploration of alternative algorithms for stroke Figure 13: Logistic Regression (Original).

215
DAML 2023 - International Conference on Data Analysis and Machine Learning

analysis with other predictive models, such as


Logistic Regression and XGBoost, Random Forest's
performance outshone its counterparts. The inherent
strength of the Random Forest model lies in its
adeptness at handling complex feature interactions
and non-linear patterns within the dataset, attributes
that contribute to its heightened predictive accuracy.
This exceptional performance positions Random
Forest as a prime candidate for further refinement
and potential real-world application in the realm of
stroke risk assessment. Nevertheless, the journey
doesn't end here. Additional research and meticulous
model fine-tuning are warranted to fully harness and
validate the capabilities of Random Forest in
Figure 14: Decision Tree (Original).
practical clinical settings. Such endeavors are
essential to elevate the accuracy of stroke prediction
and, in turn, optimize patient care outcomes. This
research serves as a pivotal stepping stone, paving
the way for enhanced stroke prediction
methodologies and, ultimately, improved patient
well-being.

REFERENCES
M. Chun, R. Clarke, B. J. Cairns, D. Clifton, D. Bennett, Y.
Chen, et al., ”Stroke risk prediction using machine
learning: A prospective cohort study of 0.5 million
Chinese adults,” Journal of the American Medical
Informatics Association, vol. 28, no. 8, pp. 1719-1727,
Figure 15: RF (Original).
2021.
C. C. Chung, E. C.-Y. Su, J. Chen, Y. Chen, and C.-Y. Kuo,
”XGBoost- Based Simple Three-Item Model
6 DISCUSSION Accurately Predicts Outcomes of Acute Ischemic
Stroke,” Diagnostics, vol. 13, no. 5, pp. 842, 2023.
C. Fernandez-Lozano, P. Hervella, V. Mato-Abad, M.
The training of the model is completed in a short Rodrı́guez-Yáñez, S. Suárez-Garaboa, I. López-
time. Also, the precision, recall, and accuracy are Dequidt, A. Estany-Gestal, T. Sobrino, F. Campos, J.
very high. From the above table and figures, it shows Castillo, et al., ”Random forest-based prediction of
that the Random Forest model achieves 94% high stroke outcome,” Scientific reports, vol. 11, no. 1, pp.
accuracy. Comparing Random Forest with logistic 10071, 2021.
regression, they achieve different levels of accuracy, J. F. Meschia, C. Bushnell, B. Boden-Albala, L. T. Braun,
D. M. Bravata,S. Chaturvedi, M. A. Creager, R. H.
precision, and recall. This indicates that some
Eckel, M. S. V. Elkind, M. Fornage, et al., ”Guidelines
features are useless in predicting tumors. for the primary prevention of stroke: a statement for
healthcare professionals from the American Heart
Association/American Stroke Association,” Stroke,
7 CONCLUSION vol. 45, no. 12, pp. 3754-3832, 2014.
P. Harmsen, G. Lappas, A. Rosengren, and L. Wilhelmsen,
”Long-term risk factors for stroke: twenty-eight years
Within the framework of this experimental study, the of follow-up of 7457 middle- aged men in Göteborg,
standout performer emerged as the Random Forest Sweden,” Stroke, vol. 37, no. 7, pp. 1663-1667, 2006.
model, boasting an impressive accuracy rate C. S. Nwosu, S. Dev, P. Bhardwaj, B. Veeravalli, and D.
exceeding 90%. Notably, it also exhibited John, ”Predicting stroke from electronic health
exceptional F1 score and AUC values, underscoring records,” in 2019 41st Annual International Conference
its proficiency in stroke prediction. In a comparative

216
Exploratory Data Analysis and Machine Learning Models for Stroke Prediction

of the IEEE Engineering in Medicine and Biology


Society (EMBC), pp. 5704-5707, 2019.
M. S. Pathan, Z. Jianbiao, D. John, A. Nag, and S. Dev,
”Identifying stroke indicators using rough sets,” IEEE
Access, vol. 8, pp. 210318- 210327, 2020.
R.S. Jeena and Sukesh Kumar, ”Stroke prediction using
SVM,” in 2016 International Conference on Control,
Instrumentation, Communication and Computational
Technologies (ICCICCT), pp. 600-602, 2016.
S.M. Hanifa and K. Raja-S, ”Stroke risk prediction through
non-linear support vector classification models,” Int. J.
Adv. Res. Comput. Sci, vol. 1, no. 3, p. 4753, 2010.
S. Dev, H. Wang, C. S. Nwosu, N. Jain, B. Veeravalli, and
D. John, ”A predictive analytics approach for stroke
prediction using machine learning and neural
networks,” in Healthcare Analytics, vol. 2, p. 100032,
2022.
Kaggle, Stroke Prediction Dataset, [online] Available:
https://www.kaggle.com/datasets/fedesoriano/stroke-
prediction-dataset

217

You might also like