Heart Disease Prediction Using Machine Learning a Data-Driven Approach
Heart Disease Prediction Using Machine Learning a Data-Driven Approach
2024
Proceedings of the 2024 4th International Conference on Technological Advancements in Computational Sciences
Abboskhujaev Akhrorkhuja1, Danish Ather2, Rahul Chauhan3, Kireet Joshi4, Gurinder Singh5, Naina Chaudhary6
Amity University Tashkent, Tashkent, Uzbekistan
1,2,6
Computer Science & Engineering, Graphic Era Hill University, Dehradun, Uttarakhand, India
3
4
Computer Science & Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
5
Amity University Noida, Noida, Uttar Pradesh, India
E-mail: [email protected], [email protected],
1
3
[email protected], [email protected]
Abstract—Heart sickness remains a main purpose of diagnosis of heart disease can significantly improve patient
mortality worldwide, accounting for a significant percentage outcomes, allowing timely interventions and reducing the
of worldwide deaths. Early detection and correct diagnosis burden on healthcare systems.
are important in reducing its impact and enhancing affected Heart disease risk factors have been traditionally
person effects. This look at proposes a machine learning-
identify using clinical algorithms including the Framingham
based totally approach for heart sickness prediction, utilising
a dataset of scientific fitness parameters along with age,
Risk Score and other statistical models to help healthcare
gender, levels of cholesterol, and blood pressure. The version providers find those individuals at highest risk. But these
is evolved the usage of a mixture of algorithms, inclusive of approaches often use reduced data and ignore complex
logistic regression, random woodland, and aid vector machines interactions of multiple risk factors. Big data and machine
(SVM), to are expecting the probability of heart sickness. learning are a new frontier in predictive modelling, capturing
Feature choice strategies are implemented to discover the overarching patterns in vast amounts of data which have
most crucial parameters influencing heart disorder danger. represented as a very attractive alternative to heart disease
The dataset is break up into training and testing units, and the prediction models. With machine learning techniques,
models are evaluated primarily based on accuracy, precision, patient data will be analyzed and can now highlight the
bear in mind, and F1-score. Our experimental outcomes
major risk factors and even determine what is the likelihood
display that the random forest version performed the very
best performance with an accuracy of 85%, outperforming
of heart disease more accurately than traditional methods.
different fashions. This technique demonstrates the ability of Many researchers have applied machine learning model
machine gaining knowledge of in helping early prognosis and of the prediction of heart disease in recent years such kind
customized treatment making plans for sufferers susceptible of datasets are related to various clinical and demographic
to coronary heart sickness. The proposed method can be information. Predictive models with different machine
incorporated into healthcare structures to enhance predictive learning techniques like logistic regression, decision trees,
skills and facilitate proactive healthcare interventions. Future random forest and support vector machines (SVM) have
work will discover the inclusion of more various datasets to been used. These models win over traditional statistic
enhance the model’s generalizability.
models, especially in case of non-linear relationships
Keywords: Heart disease, Machine learning, Healthcare,
prediction, Data science
among variables. While models have improved in the task
of detecting recurrences, issues continue to present itself
I. INTRODUCTION when trying to generalize these predictions on different
According to the World Health Organization (WHO), patient groups.
cardiovascular diseases (CVD), commonly known as heart This studies ambitions to develop a robust and correct
diseases, cause approximately 17.9 million deaths each year machine gaining knowledge of version for predicting heart
and remain the leading cause of death worldwide. CVD disease, leveraging a dataset of medical parameters. By
refers to various heart conditions, including coronary artery employing multiple machine mastering algorithms, which
disease, heart attacks, and heart failure, which are the most include logistic regression, random forests, and SVM, this
common forms. Both modifiable (such as diet, smoking, observe seeks to discover the most influential elements
and physical inactivity) and non-modifiable (such as age contributing to heart sickness and evaluate the performance
and genetics) risk factors contribute to the high incidence of of every model in terms of accuracy, precision, keep in mind,
heart disease. Angina pectoris is a condition often associated and F1-rating. In doing so, the take a look at highlights the
with coronary artery disease. Accurate prediction and early capacity of gadget studying to revolutionize heart ailment
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on March 04,2025 at 17:13:22 UTC from IEEE Xplore. Restrictions apply.
IEEE Conference ID: 62700 13th – 15th Nov. 2024
Proceedings of the 2024 4th International Conference on Technological Advancements in Computational Sciences
prediction, facilitating early detection and taking into random forest model performed better than decision trees
consideration extra focused interventions. with an accuracy of 85% in a comparative study [5].
Furthermore, the model that is proposed can also be Other algorithms like support vector machines (SVM) are
integrated in healthcare systems to provide clinicians with a also widely used to provide high accuracy and precision
robust tool for determining heart disease risk and initiating in classifying heart disease tasks [3]. In another work,
interventions prior to the development of untoward hybrid SVM Particle Swarm Optimization (PSO) was
conditions. The rest of the paper is organized as follows: combined with traditional classification technique for
Section II provides a background on heart disease prediction higher performance which resulted in accuracy level of
using machine learning. In Section III, we discuss the 87% [6]–[9]. Recent improvements in deep mastering have
causes of heart disease and what key risk factors to look also made strides in coronary heart sickness prediction.
for. In the fourth section, we explain the methodologies Convolutional neural networks (CNNs) were carried out to
followed in building a prediction model which includes data categorise medical pictures associated with cardiovascular
preprocessing steps, feature selection process and training health, yielding high prediction accuracy while utilized in
an actual model. Section V presents experimental results, mixture with scientific information [10], [11]. However, the
and Section VI draws conclusions and proposes directions venture with deep getting to know models lies in their want
for future work. for big datasets and excessive computational electricity,
limiting their huge use in a few healthcare settings.
II. RELATED WORK
A not unusual difficulty confronted via most gadget
In recent years, the interest in predicting heart disease gaining knowledge of models is overfitting, mainly whilst
using machine learning algorithms has increased rapidly. working with small datasets. To mitigate this, techniques
Example different models and methodologies have been together with move-validation and regularization had been
tried to address this problem from the typical statistical extensively employed [12], [13]. Despite the high accuracy
model to advanced machine learning-frameworks. suggested in diverse studies, generalizability throughout
Various studies supports that machine learning can be specific populations stays a mission, and the inclusion
used to increase the accuracy in heart disease prediction. of numerous datasets is important for improving model
For instance, one of the standard models: logistic regression robustness.
yields as baseline performance for prediction (especially
Table I: Comparison of Machine Learning Models for Heart Disease
along with feature selection methods). For example, in one Prediction
study there was 79% classification accuracy by logistic
Model Accuracy Advantages Disadvantages
regression but model performance was proscribed due to
Logistic Simple,
assumptions of linearity. [1]–[4]. Regression
79% [1]
interpretable
Assumes linearity
A. Causes of Heart Disease Captures
Heart disease is influenced by a combination of genetic Decision Tree 82% [5] non-linear Prone to overfitting
relationships
and environmental factors. Fig. 1 illustrates the main causes
Robust, High computational
contributing to heart disease. Random Forest 85% [5]
reducesoverfitting cost
Support Vector Effective in
Hyperparameter
Machine ( 83% [14] highdimensional
tuning needed
SVM ) spaces
Combines SVM
SVM + PSO 87% [6] Increased complexity
with optimization
Convolutional
Neural High accuracy Requires large
90% [10]
Network with large datasets datasets
(CNN)
III. METHODOLOGY
In this section, we describe the methodology used for
heart disease prediction. Figure 2 provides an overview
Fig. 1: Causes of Heart Disease of the methodology, including data preprocessing, model
As shown in Fig. 1 First, smoking, an unhealthy diet training, and evaluation.
and low physical activity are major lifestyle factors. On
top of that, things such as high blood pressure, cholesterol
and diabetes make the risk even higher. Decision trees
and random forest machines demonstrated the greatest
improvement, as they are able to capture non-linear
relationships between risk factors and heart disease. The Fig. 2: Overview of the Methodology for Heart Disease Prediction
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on March 04,2025 at 17:13:22 UTC from IEEE Xplore. Restrictions apply.
Heart Disease Prediction Using Machine Learning: A Data-Driven Approach
where X is the original value of the feature, and Xmin and (5)
Xmax represent the minimum and maximum values of that where:
feature, respectively. • w is the normal vector to the hyperplane,
• b is the bias term,
B. Feature Selection
• yi is the class label for the i-th training example,
Feature selection was performed to reduce • xi is the feature vector for the i-th training example.
dimensionality and eliminate irrelevant or redundant The non-linear SVM uses kernel functions to project
variables. Two primary techniques were used: data into a higher-dimensional space. The commonly used
1) Correlation Matrix Radial Basis Function (RBF) kernel is defined as:
A correlation matrix was computed to identify the (6)
relationship between independent variables and the target where γ is a hyperparameter that defines the influence
variable (heart disease). Variables with high correlation to of a single training example.
the target were selected for model training. 3) Random Forest
2) Recursive Feature Elimination (RFE) Random forest is an ensemble method that constructs
RFE is a recursive method that selects the most multiple choice timber. Each tree is constructed using a
important features by iteratively removing the least random subset of the facts and functions. The prediction is
significant ones. In each iteration, the model is trained, and made via averaging the predictions of all trees for regression
features are ranked by their importance. responsibilities or taking the majority vote for class tasks.
C. Machine Learning Models For a classification project, the bulk vote is defined as:
In this study, we applied several machine learning (7)
models to predict heart disease. Below, we outline the where Tk (x) is the prediction of the k-th tree, and K is
mathematical formulations of the key models: the number of trees in the forest.
1) Logistic Regression D. Model Evaluation
Logistic regression is a statistical model used for The performance of each model was evaluated using
binary classification problems. In our case, it predicts standard classification metrics. These include:
the probability of heart disease. The logistic function, or
1) Accuracy
sigmoid function, is used to model the probability, and is
defined as: The overall accuracy of the model is given by the ratio
of correctly predicted instances to the total instances:
(2)
Accuracy (8)
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on March 04,2025 at 17:13:22 UTC from IEEE Xplore. Restrictions apply.
IEEE Conference ID: 62700 13th – 15th Nov. 2024
Proceedings of the 2024 4th International Conference on Technological Advancements in Computational Sciences
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on March 04,2025 at 17:13:22 UTC from IEEE Xplore. Restrictions apply.
Heart Disease Prediction Using Machine Learning: A Data-Driven Approach
variety of actual positives (TP), real negatives (TN), false Table IV: Top 5 Most Important Features in Random Forest Model
positives ( FP), and fake negatives (FN). The confusion Feature Importance Score
matrices for Logistic Regression, Random Forest, and SVM
are proven in Table III. Age 0.25
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on March 04,2025 at 17:13:22 UTC from IEEE Xplore. Restrictions apply.
IEEE Conference ID: 62700 13th – 15th Nov. 2024
Proceedings of the 2024 4th International Conference on Technological Advancements in Computational Sciences
Authorized licensed use limited to: VISVESVARAYA NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on March 04,2025 at 17:13:22 UTC from IEEE Xplore. Restrictions apply.