Heart Ailment Prediction Using Machine LearningMethods
Heart Ailment Prediction Using Machine LearningMethods
Abstract—The heart is the coordinating centre of the major endocrine glandular structure of
the body, which produces hormones that profoundly affect the operations of the body, and
diagnosing cardiovascular disease is a difficult but critical task. By extracting knowledge and
information about the disease from patient data, data mining is a more practical technique to
help doctors detect disorders. We use a variety of machine learning methods here, including
logistic regression and support vector classifiers (SVC), K-nearest neighbours Classifier (KNN),
Decision Tree Classifier, Random Forest Classifier and Gradient Boosting Classifier. These
algorithms are applied to patient’s data containing 13 different factors to build a system that
predicts heart disease in less time with more accuracy.
Index Terms— Logistic Regression, Support Vector Classifier, K-Nearest Neighbour, Decision
Tree, Random Forest and Gradient Boosting.
I. INTRODUCTION
Rheumatic heart disease is linked to around 2% of cardiovascular disease-related fatalities worldwide. The terms
"cardiovascular disease" and "heart disease" are sometimes used interchangeably. Heart attacks, chest pain
(angina), strokes, and other illnesses caused by restricted or obstructed blood vessels are together referred to as
cardiovascular disease. Symptoms consist of Angina, or chest pain from the heart muscle due to insufficient
oxygen and nutrient-rich arterial blood flow, which is a typical sign of heart disease. You get chest discomfort as
a result of angina. Around their breastbone, some people feel tightness or a squeezing sensation. The neck,
shoulder blades, upper arms, upper abdomen, and upper back may all experience pain radiating from the lower
back. The most important organ in the human body, the heart controls blood flow throughout the body. Other
body parts may suffer if there is any kind of heart function impairment. Heart disease is currently the biggest
cause of death among people. According to estimates from the World Health Organization, almost 12 million
people die from heart disease each year (WHO). The WHO estimates almost the death rate would increase to
23.6 million by 2030 [6].
Dizziness, ankle swelling, shortness of breath, slow heartbeats, fainting, lightheadedness, pain in the neck, jaw,
throat, dullness, weakness, or coldness in your body parts, and irregular heartbeats are all signs of this illness.
Heart disease can be prevented if detected earlier. More accurate diagnoses in less time. Providing the best
standard services and early, correct diagnosis is the industry's key problem. The extensive application of
machine learning, which also produces favorable results with the highest accuracy for medical diagnostics, can
1374
III. PROPOSED METHODOLOGY
The main goal of this paper is to estimate the likelihood that patients may develop heart disease, and data mining
is crucial in achieving this goal. This research makes use of the 13-factor heart disease dataset. Gender, age,
exercise-induced angina, resting blood pressure, cholesterol, fasting blood sugar, chest pain, thalassemia, results
of resting electrocardiography, maximum heart rate reached, ST depression brought on by exercise in
comparison to rest, slope, and number of major vessels are some of these factors. The program employs a
classification technique.
A. Architecture Diagram
i. Data Import and Preprocessing
Fig. 1 shows 6 steps of preprocessing, which includes, import data, duplication removal, preprocessing,
encoding, feature scaling and the preparation of training and testing dataset.
iii. Prediction
1375
B. Data Source
The dataset used in the prediction process was obtained from the machine learning repository at the University of
California, Irvine. The dataset consists of 1026 instances of data with the 13 medical factors that are appropriate
for prediction. [15].
C. Steps
1. Preprocessing: Data is checked for null and duplicate values and is filtered. Then the data is encoded,
feature scaling is carried out and lastly splitted into training and testing data. Data preprocessing is
represented in fig. 1.
2. Training: The training data is fed to each of the ML algorithms and then tested using the test data as
displayed in fig. 2. Accuracies are calculated and the model having best accuracy is saved as the model.
3. Prediction: Lastly, prediction is carried out on the saved model and output to the user is shown in the
fig. 3.
Figure 4. Results
However, amongst various classification algorithms, the random forest accuracy is at a higher side. A small part
of the random forest is represented in fig. 5. The UI consists of a form as in fig. 6, where the user will be able to
enter the values of the factors that we considered for training the model.
1376
Figure 5. Small part of random forest
After the user has entered the values for the fields (fig. 7), if the model returns 1 for the values the user provided,
then it will show “Possibility of Heart Disease”, else, it’ll show “No Heart Disease”.
V. MATHEMATICAL CALCULATIONS
Numerous decision trees are created by random forests (RF) during training. The final predictions—the method
of the classes for grouping or the mean prediction for regression—are made by pooling predictions from all
trees. They are referred to as "trooper methods" because they draw from a variety of outcomes before making a
final decision. To predict the target value, decision trees determine how to divide the information into
successively smaller subsets.
Scikit-learn gives data in their documentation on the recipes utilized for debasement basis. For order, it utilizes
Gini impurity of course but offers Entropy (2) as another option.
( ) = 1− ∑ (1)
( )= ∑ − (2)
The weighting of the node pollution by the likelihood of reaching that node determines the relevance of the
feature. The number of tests that arrive at the hub, divided by the total number of tests, can be used to calculate
the hub likelihood. The component's importance increases with value.
For every decision tree, Scikit-learn computes a hub's significance utilizing Gini significance, expecting just two
child nodes nodes (binary tree):
1377
Where, ni(j) is equal to node j's significance, weighted number of tests arriving at node j, denoted as w(j). C(j) =
Node j's impurity value, Right(j) = child node from right split on node j, Left(j) = kid node from left split on
node j, Gini index for each feature calculated from (1):
Gini Index for feature 0: 0.078939
Gini Index for feature 1: 0.091710
Gini Index for feature 2: 0.064395
Gini Index for feature 3: 0.080580
Gini Index for feature 4: 0.069999
Gini Index for feature 5: 0.073910
Gini Index for feature 6: 0.089164
Gini Index for feature 7: 0.087576
Gini Index for feature 8: 0.083251
Gini Index for feature 9: 0.058257
Gini Index for feature 10: 0.076762
Gini Index for feature 11: 0.079951
Gini Index for feature 12: 0.127049
These can then be normalized to a value between 0 and 1 by dividing by the sum of all feature importance
values:
= ∑
(3)
Average over all decision trees,
∑
= (4)
normfi(ij)= the normalized component significance for I in tree j, T is total number of trees represented in (3)
RFfi(i)= the significance of the element I determined from all trees in the Random Forest model, shown in (4).
Random Forest is highly effective for heart disease prediction because it can handle high-dimensional data and
automatically select relevant features. It is robust against outliers and noisy data often found in heart disease
datasets. Random Forest's ability to aggregate predictions from multiple decision trees reduces bias effectively.
This is beneficial because decision trees have low bias but high variance, and the combination in Random Forest
achieves a balance of reduced bias and reasonable variability. Overall, Random Forest is a powerful and reliable
approach for heart disease prediction, offering enhanced performance.
VI. CONCLUSION
Our research focuses on using various machine learning techniques to predict heart disease, and we assess the
efficacy of these algorithms by presenting a variety of signs that can be used to determine whether a patient has
heart disease or not. The research demonstrates how several machine learning algorithms function in the
foretelling of a cardiovascular disease. Using Python programming, the classification procedures employed in
the study were carried out. According to the results above, the Random Forest Classifier is the best-performing
machine learning technique out of all the strategies examined. It has an accuracy rate of 83.60 percent. The
average accuracy predicted is 78.94%. K-Nearest Neighbors is the least accurate algorithm with accuracy
73.77%. In order to predict cardiac illness earlier and lower the death rate, machine learning can be utilized
efficiently in this way.
FUTURE SCOPE
Advanced technology like deep learning can be applied to increase the correctness of the system up to 100%.
With the implementation of better ML systems in the healthcare sector, we can briefly reduce the human error
factor and also, increase the accuracy of prediction of various diseases such as heart disease, liver disease,
diabetes, tumor predictions, etc.
REFERENCES
[1] Jothi, K. A., Subburam, S., Umadevi, V., & Hemavathy, K. C. (2021). WITHDRAWN: Heart disease prediction system
using machine learning. Materials Today: Proceedings.
1378
[2] Kavitha, M., Gnaneswar, G., Dinesh, R., Sai, Y. R., & Suraj, R. S. (2021). Heart Disease Prediction using Hybrid
machine Learning Model. Heart Disease Prediction Using Hybrid Machine Learning Model.
[3] Maini, E., Venkateswarlu, B., Maini, B., & Marwaha, D. (2021). Machine learning–based heart disease prediction
system for Indian population: An exploratory study done in South India. Medical Journal, Armed Forces India, 77(3),
302–311.
[4] Goel, Rati, Heart Disease Prediction Using Various Algorithms of Machine Learning (July 12, 2021). Proceedings of the
International Conference on Innovative Computing & Communication (ICICC)2021.
[5] J, S. K., & Geetha, S. (2019). Prediction of Heart Disease Using Machine Learning Algorithms.
[6] Ansari, M.F., Alankar, B., Kaur, H. (2021). A Prediction of Heart Disease Using Machine Learning Algorithms. In:
Chen, J.IZ., Tavares, J.M.R.S., Shakya, S., Iliyasu, A.M. (eds) Image Processing and Capsule Networks. ICIPCN 2020.
Advances in Intelligent Systems and Computing, vol 1200. Springer, Cham.
[7] Marikani, T., & Shyamala, K. (2017). Prediction of Heart Disease using Supervised Learning Algorithms. International
Journal of Computer Applications, 165(5), 41–44.
[8] Ramalingam, V. V., Dandapath, A., & Raja, M. K. (2018). Heart disease prediction using machine learning techniques:
a survey. International Journal of Engineering & Technology, 7(2.8), 684.
[9] (n.d.-a). Heart Disease Prediction using Machine Learning Techniques. International Journal of Engineering Research &
Technology (IJERT), Volume 09(Issue 11 (November 2020)).
[10] Heart Attack Prediction Using Machine Learning Algorithms, INTERNATIONAL JOURNAL OF ENGINEERING
RESEARCH & TECHNOLOGY (IJERT) ICEI – 2022, ICEI – 2022 (Volume 10) (Issue 11).
[11] Jindal, H., Agrawal, S., Khera, R., Jain, R., & Nagrath, P. (2021). Heart disease prediction using machine learning
algorithms. IOP Conference Series: Materials Science and Engineering, 1022(1), 012072.
[12] (n.d.-a). Heart Disease Prediction using Machine Learning. INTERNATIONAL JOURNAL OF ENGINEERING
RESEARCH & TECHNOLOGY (IJERT), NCETER – 2021 (Volume 09 – Issue 11).
[13] Mohan, S., Thirumalai, C., & Srivastava, G. (2019). Effective Heart Disease Prediction Using Hybrid Machine Learning
Techniques. IEEE Access, 7, 81542–81554.
[14] Hassan, C. H. C., Khan, M. S., & Shah, M. A. (2018). Comparison of Machine Learning Algorithms in Data
classification.
[15] https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
1379