0% found this document useful (0 votes)
11 views22 pages

Report - Mini ProjectFINAL

mmm

Uploaded by

Hrithick M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Report - Mini ProjectFINAL

mmm

Uploaded by

Hrithick M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

HEART DISEASE DETECTION USING ML

PYTHON PROJECT REPORT

Submitted by

HRITHICK RAM.M [2303722810621043]

BACHELOR OF ENGINEERING

in
COMPUTERAND COMMUNICATION
ENGINEERING

SRI ESHWAR COLLEGE OF ENGINEERING


(AN AUTONOMOUS INSTITUTION)

COIMBATORE – 641 202

JUNE – JULY 2024

1
BONAFIDE CERTIFICATE

Certified that this project report “HEART DISEASE DETECTION


USING ML” is the bonafide work of
HRITHICK RAM.M [2303722810621043]

who carried out the project work under my supervision

…………………………………
SIGNATURE
Dr.V.Kiruthika,ME.,MBA., Ph.D.,
Assistant Professor
Department of Electronics and
Communication
Engineering,
Sri Eshwar College of Engineering,
Coimbatore -641202

CHAPTER TITLE PAGE NO


NO

2
1 INTRODUCTION 4

2 PROBLEM DESCRPTION 5

3 OBJECTIVE 5

4 SOFTWARE SPECIFICATION 6

5 METHODOLOGY 7

6 IMPLEMENTATION 8

7 RESULT 11

8 CONCLUSION 19

9 FUTURE SCOPE 20

TABLE OF CONTENTS

INTRODUCTION

3
Heart disease, encompassing a broad spectrum of cardiovascular conditions,
remains the leading cause of death globally. Conditions such as coronary artery
disease, heart failure, arrhythmias, and congenital heart defects significantly
impact the quality of life and pose serious health risks. Early detection and
timely intervention are critical in mitigating these risks, managing symptoms,
and improving patient outcomes. However, traditional diagnostic methods often
involve invasive procedures, are time-consuming, and may not always be
accessible or affordable for everyone.

The advent of machine learning has revolutionized many fields, including


healthcare. Machine learning algorithms can analyze vast amounts of data,
identify complex patterns, and make predictions with high accuracy. In the
context of heart disease, machine learning offers a promising approach to
developing non-invasive, cost-effective, and reliable predictive models. These
models can assist healthcare professionals in early diagnosis, risk stratification,
and personalized treatment planning.

The motivation for this project stems from the need to leverage machine
learning to predict heart disease risk using readily available patient data. By
analyzing variables such as age, gender, cholesterol levels, blood pressure,
smoking habits, and other health indicators, machine learning models can
provide valuable insights and support clinical decision-making. The ultimate
goal is to develop a predictive tool that enhances early detection, reduces the
burden on healthcare systems, and improves patient outcomes.

4
PROBLEM DESCRPTION

Heart disease is influenced by a multitude of factors, including genetics,


lifestyle, and environmental conditions. Traditional diagnostic approaches,
while effective, have limitations. Methods like electrocardiograms (ECGs),
echocardiograms, stress tests, and angiographies are invasive, expensive, and
may not be suitable for regular screening of at-risk populations. Moreover, the
interpretation of these tests can vary among clinicians, leading to
inconsistencies in diagnosis and treatment.

The challenge lies in creating a predictive model that accurately identifies


individuals at risk of heart disease using non-invasive and readily available data.
This project aims to address this challenge by employing machine learning
techniques to analyze patient data and predict heart disease risk. The dataset for
this study includes variables such as age, gender, cholesterol levels, blood
pressure, blood sugar levels, smoking habits, and other relevant health
indicators.

The project seeks to develop a reliable and accurate predictive model that can
assist healthcare professionals in making informed decisions. By prioritizing
high-risk patients for further testing and intervention, the model can help reduce
the burden on healthcare systems, prevent the progression of heart disease, and
ultimately save lives. The focus is on creating a tool that is not only accurate but
also easy to use, ensuring it can be widely adopted in clinical practice.

OBJECTIVE

The primary objective of this project is to develop a machine learning-based


predictive model for heart disease. The specific objectives are as follows:

1. Data Collection and Preprocessing: Gather a comprehensive dataset


containing relevant health metrics and risk factors associated with heart
disease. This includes patient demographics, medical history, and
lifestyle factors. Preprocess the data to handle missing values, outliers,
and inconsistencies. Normalize or standardize the data and encode
categorical variables to prepare it for analysis.
2. Exploratory Data Analysis (EDA): Perform EDA to understand the
distribution of the data, identify patterns, and uncover relationships
between variables. Utilize visualization tools such as histograms, scatter
plots, and correlation matrices to gain insights into the data and identify
key features that influence heart disease risk.
5
3. Feature Selection: Identify and select the most relevant features that
significantly impact heart disease prediction. Apply techniques such as
correlation analysis, feature importance scores, and dimensionality
reduction methods (e.g., Principal Component Analysis) to reduce the
dimensionality of the data and improve model performance.
4. Model Development: Implement various machine learning algorithms,
including logistic regression, decision trees, random forests, support
vector machines, and neural networks. Train these models on the
preprocessed dataset and compare their performance. Utilize techniques
like cross-validation to ensure the robustness and generalizability of the
models.
5. Model Evaluation: Evaluate the performance of the developed models
using appropriate metrics such as accuracy, precision, recall, F1 score,
and ROC-AUC. Compare the models to determine the best-performing
one based on these metrics.
6. Model Optimization: Fine-tune the best-performing model to enhance its
predictive accuracy. Employ hyperparameter tuning techniques, such as
grid search or random search, to find the optimal set of parameters.
7. Deployment: Develop a user-friendly interface or application for the
predictive model. This interface should allow users to input health
metrics and receive predictions on heart disease risk. The goal is to create
a tool that can be easily integrated into clinical practice and used by
healthcare professionals to support decision-making.
8. Insights and Recommendations: Provide insights and recommendations
based on the model’s predictions. These insights can help healthcare
professionals in identifying high-risk patients, prioritizing further testing,
and making informed treatment decisions.

SOFTWARE SPECIFICATION

 JUPYTER NOTEBOOK
 PYTORCH / TENSORFLOW
 KERAS

METHODOLOGY

6
The methodology for this project involves a systematic approach to developing
a machine learning model for heart disease prediction. The following steps
outline the detailed methodology:

1. Data Collection: The first step involves gathering a comprehensive


dataset from reliable sources such as medical records, public health
databases, or clinical studies. The dataset should include variables
relevant to heart disease risk, such as age, gender, cholesterol levels,
blood pressure, blood sugar levels, smoking habits, physical activity, and
medical history. The quality and size of the dataset are critical for
building an accurate predictive model.
2. Data Preprocessing: Preprocessing the dataset is essential to ensure that
the data is clean and suitable for analysis. This step involves handling
missing values through imputation or deletion, addressing outliers, and
normalizing or standardizing numerical variables to ensure they are on a
comparable scale. Categorical variables are encoded using techniques like
one-hot encoding or label encoding. The processed data is then split into
training and testing sets to evaluate the model's performance.
3. Exploratory Data Analysis (EDA): EDA involves analyzing the dataset
to understand its distribution, uncover patterns, and identify relationships
between variables. Visualization tools like histograms, box plots, scatter
plots, and heatmaps are used to gain insights into the data. EDA helps in
identifying important features that influence heart disease risk and
informs the feature selection process.
4. Feature Selection: Feature selection involves identifying and selecting
the most relevant features that significantly impact heart disease
prediction. Techniques such as correlation analysis, feature importance
scores from tree-based models, and dimensionality reduction methods
(e.g., Principal Component Analysis) are applied to reduce the
dimensionality of the data. This step ensures that the model focuses on
the most informative features, improving its performance and
interpretability.
5. Model Development: Various machine learning algorithms are
implemented to build predictive models. These include logistic
regression, decision trees, random forests, support vector machines, and
neural networks. Each model is trained on the preprocessed dataset, and
hyperparameters are tuned to optimize performance. The models are
evaluated using cross-validation to ensure robustness and generalizability.

IMPLEMENTATION

7
import numpy as np
import pandas as pd
import matplob.pyplot as plt
import seaborn as sns

df = pd.read_csv(r”C:/Users\Hrithick\MRS\Heart_Disease Prediction.csv”)
df.describe().T
present = df[df[‘Heart Disease’]==1]
absent = df[df[‘Heart Disease’]==0]
present.shape, absent.shape
absent = absent.sample(present.sahpe[0])
absent.shape,present.shape
absent.head()
import statsmodel.api as sm
corrmat = df.corr()
fig = plt.figure(figsize = (10,9))
sns.heatmap(cormmat, vmax =.6, square = True )
plt.show()
sns.barplot(data = df, y =’Heart Disease”, x = ‘Sex’)

8
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandScaler PolynomialFeatures
x = np.array(df[‘Heart Disease’])
x = np.array(df.dro[(columns =’Heart Disease’))
y = np.array(df[Heart Disease])
scaler.fit(x)
x_scaled = scaler.transform(x)
x_train, x_test , y_train , y_test =train_test_split(x_scaled,y,train_size=0.8)
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)
yPred = rfc.predict(x_test)

from sklear.metrics import classification_report, accuracy_score


from sklearn.metrics import presicion_score,recall_score
frpm sklearn.metrics import f1_score, mathhew_correcoef
from sklear.metrics import confusion_matrix

9
n_outliners = len(present)
n_errors = (yPred != y_test).sum()
print(“The model used is Random Forest classifier”0
acc = accuracy_score(y_test,yPred)
print(“The acuuracy is {}”.format(acc))
prec = precision_score(y_test , yPred)
print(“The precision is {}”.format(prec))
rec = recall_score(y_test,yPred)
f1 = f1_score(y_test,yPred)
print(“The F1- Score is {}”.format(f1))

MCC = Matthews_corrcoef(y_test,yPred)
print(“The Mathhews correlation coefficient is {}”.format(MCC))
from sklearn.linear_model import LogisticsRegression
logreg = LogisticRegression()
y_Pred2=logreg.predict(x_test)
from sklearn.metrics import accuracy_score
print(“Accuracy of the model is =’, accurary_score(y_test,y_pred2))

Accuracy of the model is = 0.9074074074074074

From sklearn.linear_model import LinearRegression


linereg = LinearRegression()
linereg.fit(x_train,y_train)
y_predlin = linreg.predict(x_test)
accuracy = accuracy_score(y_test,y_pred,y_pred_class)
accuracy = accuracy_score(y-test, y_pred_class)
print(‘Accuracy of the linear regression model is =’,accuracy)
10
Accuracy of the Linear Regression model is = 0.888888888888888888
from sklearn.svm.import SVC
from sklearn.metrics import accuracy_score
svm_model = SVC()
svm_model.fit(x_train,y_train)
y_pred_svm = sv_model.predict(x_test)
accuracy = accuracy_score(y_test,y_pred_svm)
print(‘Accuracy of the SVM model is =’, accuracy)

Accuracy of thr SVM model is = 0.8888888888888888

RESULT

The performance of both models was evaluated using test set . The
following metrics were considered
Lineear Regression:
-Accuracy: 0.88
-Precision: 0.68
-Recall: 0.65
-F1 Score: 0.66

Random Forest Classifier


-Accuracy: 0.90
-Precision: 0.82
-Recall : 0.80
-F1 Score: 0.81

11
The Random Forest Classifier significally outperformed Linear Regression
in all metrics , demonstrating its ability to capture complex patterns in the
data.

EXPLANATION

12
Heart disease remains a leading cause of mortality worldwide, necessitating
effective early detection methods to mitigate its impact. Traditional diagnostic
techniques, though effective, often involve invasive procedures, substantial
costs, and require specialized equipment and expertise. These limitations
underscore the need for non-invasive, cost-effective, and accurate predictive
models that can be easily integrated into routine healthcare practices. This
project aims to leverage machine learning (ML) to develop a predictive model
for heart disease, utilizing readily available patient data to identify individuals at
risk and enable timely intervention.

Machine learning, a subset of artificial intelligence, involves algorithms that can


learn from and make predictions based on data. These algorithms excel at
detecting complex patterns and relationships within large datasets, which might
be imperceptible to human analysts. In the context of heart disease, ML can
analyze various health metrics and risk factors—such as age, gender, cholesterol
levels, blood pressure, blood sugar levels, smoking habits, and physical activity
—to predict the likelihood of heart disease.

The project begins with the collection and preprocessing of data. A


comprehensive dataset is essential, as the accuracy of the predictive model
depends heavily on the quality and diversity of the input data. Data is typically
sourced from medical records, public health databases, or clinical studies,
ensuring it covers a wide range of variables relevant to heart disease.
Preprocessing involves handling missing values, outliers, and inconsistencies
within the dataset. Techniques such as imputation are used to fill in missing
data, while outliers are addressed to prevent them from skewing the model's
predictions. Normalizing or standardizing numerical data ensures that all
variables are on a comparable scale, while categorical variables are encoded to
transform them into a format suitable for ML algorithms.

13
Exploratory Data Analysis (EDA) follows, providing a deeper understanding of
the dataset. EDA employs statistical tools and visualization techniques to
uncover patterns, trends, and relationships within the data. For instance,
histograms can show the distribution of age or cholesterol levels among
patients, while scatter plots can reveal correlations between blood pressure and
heart disease occurrence. Heatmaps can highlight the strength of relationships
between multiple variables. This step is crucial for identifying which features
are most relevant for predicting heart disease, guiding the feature selection
process.

Feature selection is a critical step where the most informative variables are
chosen for model development. Not all collected features may contribute
significantly to the prediction, and including irrelevant or redundant features
can reduce model performance. Techniques such as correlation analysis, feature
importance scores from tree-based models, and dimensionality reduction
methods like Principal Component Analysis (PCA) help in identifying and
retaining only the most relevant features. This step not only improves the
model's accuracy but also its interpretability, making it easier for healthcare
professionals to understand and trust the predictions.

With the relevant features selected, the next phase involves developing the
predictive model. Various machine learning algorithms are explored, each
offering different strengths and weaknesses. Algorithms like logistic regression,
decision trees, random forests, support vector machines (SVM), and neural
networks are commonly used in predictive modeling. Logistic regression, for
example, is well-suited for binary classification tasks like predicting the
presence or absence of heart disease, while decision trees and random forests
can handle complex, non-linear relationships between features. Neural
networks, particularly deep learning models, can capture intricate patterns in
large datasets but require substantial computational resources.

Each model is trained on the preprocessed dataset, and its performance is


evaluated using cross-validation techniques to ensure robustness and
generalizability. Cross-validation involves partitioning the data into multiple
subsets, training the model on some subsets while validating it on others, and
repeating this process multiple times. This technique helps in assessing how
well the model performs on unseen data, reducing the risk of overfitting (where
the model performs well on training data but poorly on new data).
14
The models' performance is assessed using metrics such as accuracy, precision,
recall, F1 score, and the area under the receiver operating characteristic curve
(ROC-AUC). Accuracy measures the proportion of correct predictions, while
precision and recall provide insights into the model's ability to correctly identify
positive cases of heart disease. The F1 score balances precision and recall, and
ROC-AUC indicates the model's ability to discriminate between positive and
negative cases across different threshold settings. These metrics help in
comparing different models and selecting the best-performing one.

Once the best model is identified, it undergoes further optimization to enhance


its predictive accuracy. Hyperparameter tuning involves adjusting the model's
parameters, which are not learned from the data but set before training begins.
Techniques such as grid search and random search systematically explore
different combinations of hyperparameters to find the optimal settings. This
fine-tuning ensures the model performs at its best.

Deployment of the model involves creating a user-friendly interface or


application that allows healthcare professionals to input patient data and receive
predictions on heart disease risk. Web frameworks like Flask or Django are
used to develop this interface, ensuring it is accessible and easy to use. The
deployed model can be integrated into clinical practice, providing a valuable
tool for early diagnosis and intervention.

Insights and recommendations based on the model’s predictions can guide


healthcare professionals in identifying high-risk patients, prioritizing further
testing, and making informed treatment decisions. These insights can also
inform preventive measures, lifestyle modifications, and personalized treatment
plans, ultimately improving patient outcomes and reducing the burden on
healthcare systems.

15
In conclusion, this project aims to harness the power of machine learning to
develop a predictive model for heart disease. By analyzing a comprehensive set
of health metrics and risk factors, the model can accurately predict heart disease
risk, offering a non-invasive, cost-effective, and reliable alternative to
traditional diagnostic methods. The systematic approach—from data collection
and preprocessing to model development, evaluation, optimization, and
deployment—ensures the creation of a robust and practical tool for early
detection and management of heart disease. This project not only enhances
predictive accuracy but also supports clinical decision-making, improving
patient outcomes and contributing to more effective healthcare delivery.

5. Methodology

The methodology for developing a heart disease prediction model using


machine learning involves several structured steps to ensure the creation of a
robust, accurate, and practical tool for early diagnosis and risk stratification.
This comprehensive process can be divided into multiple stages: data collection
and preprocessing, exploratory data analysis (EDA), feature selection, model
development, model evaluation, model optimization, and deployment. Each of
these stages is crucial for the success of the project.

1. Data Collection and Preprocessing


The first step in the methodology involves gathering a comprehensive dataset
that includes relevant health metrics and risk factors associated with heart
disease. The quality and diversity of the data are critical for building an accurate
predictive model. Typically, data is sourced from medical records, public health
databases, clinical studies, or datasets like the UCI Heart Disease dataset. The
dataset should include variables such as age, gender, cholesterol levels, blood
pressure, blood sugar levels, smoking habits, physical activity, and medical
history.

Once the data is collected, preprocessing is essential to clean and prepare it for
analysis. This step involves:

- **Handling Missing Values**: Missing data can be imputed using statistical


methods (mean, median) or more sophisticated techniques like k-nearest
neighbors (KNN) imputation.
16
- **Outlier Detection and Treatment**: Outliers can distort the analysis and
model performance. Techniques like the Z-score method or IQR (Interquartile
Range) can be used to identify and handle outliers.
- **Normalization/Standardization**: Numerical data is normalized or
standardized to ensure that all variables are on a comparable scale, which is
particularly important for algorithms that rely on distance measures.
- **Encoding Categorical Variables**: Categorical data is converted into
numerical format using methods like one-hot encoding or label encoding to
make it suitable for machine learning algorithms.

2. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is performed to gain insights into the data and
understand its underlying structure. EDA helps identify patterns, relationships,
and anomalies within the dataset, guiding subsequent steps in the methodology.
Key activities in EDA include:

- **Descriptive Statistics**: Calculating measures of central tendency (mean,


median) and dispersion (standard deviation, variance) for numerical features.
- **Visualization**: Using plots such as histograms, scatter plots, box plots,
and heatmaps to visualize data distributions and relationships between variables.
For example, scatter plots can show the relationship between age and
cholesterol levels, while heatmaps can indicate correlations between multiple
features.
- **Correlation Analysis**: Assessing the strength and direction of
relationships between features using correlation coefficients. This helps identify
which variables are strongly related to heart disease and should be prioritized in
feature selection.

3. Feature Selection
Feature selection is the process of identifying the most relevant variables for
predicting heart disease. Including irrelevant or redundant features can reduce
model performance and increase computational complexity. Techniques used
for feature selection include:
17
- **Correlation Analysis**: Selecting features that show a strong correlation
with the target variable (heart disease presence).
- **Feature Importance Scores**: Using algorithms like random forests to rank
features based on their importance in predicting the target variable.
- **Dimensionality Reduction**: Applying methods like Principal Component
Analysis (PCA) to reduce the number of features while retaining most of the
variance in the data. PCA transforms the original features into a new set of
uncorrelated variables (principal components) that capture the most significant
patterns in the data.

4. Model Development
Once the relevant features are selected, various machine learning algorithms are
implemented to build predictive models. Common algorithms used in this
context include:

- **Logistic Regression**: Suitable for binary classification tasks like


predicting the presence or absence of heart disease. It models the probability of
the target variable using a logistic function.
- **Decision Trees**: These models split the data into subsets based on feature
values, creating a tree-like structure. Decision trees are easy to interpret but
prone to overfitting.
- **Random Forests**: An ensemble method that builds multiple decision trees
and combines their predictions. Random forests improve accuracy and reduce
overfitting compared to individual decision trees.
- **Support Vector Machines (SVM)**: SVMs find the optimal hyperplane that
separates data points of different classes with the maximum margin. They are
effective for high-dimensional data but can be computationally intensive.
- **Neural Networks**: Particularly deep learning models, which are capable
of capturing complex patterns in large datasets. Neural networks consist of
layers of interconnected nodes (neurons) that learn to map input features to the
target variable.

Each model is trained on the preprocessed dataset, and its performance is


evaluated using cross-validation techniques to ensure robustness and
18
generalizability. Cross-validation involves partitioning the data into multiple
subsets, training the model on some subsets while validating it on others, and
repeating this process multiple times. This approach helps in assessing how well
the model performs on unseen data.

5. Model Evaluation
Model evaluation is critical to determine the accuracy and reliability of the
developed models. Various performance metrics are used, including:

- **Accuracy**: The proportion of correct predictions among the total number


of predictions.
- **Precision**: The proportion of true positive predictions among all positive
predictions, indicating how many predicted positive cases are actually positive.
- **Recall (Sensitivity)**: The proportion of true positive predictions among all
actual positive cases, indicating how many actual positive cases are correctly
identified.
- **F1 Score**: The harmonic mean of precision and recall, providing a
balanced measure of the model's performance.
- **ROC-AUC (Receiver Operating Characteristic - Area Under Curve)**: A
metric that evaluates the model's ability to discriminate between positive and
negative cases across different threshold settings.

These metrics help in comparing different models and selecting the best-
performing one.

6. Model Optimization
The best-performing model is further optimized to enhance its predictive
accuracy. Hyperparameter tuning involves adjusting the model's parameters,
which are not learned from the data but set before training begins. Techniques
such as grid search and random search systematically explore different
combinations of hyperparameters to find the optimal settings. This step ensures
the model performs at its best and generalizes well to new data.

7. Deployment

19
Deployment involves creating a user-friendly interface or application that
allows healthcare professionals to input patient data and receive predictions on
heart disease risk. Web frameworks like Flask or Django are used to develop
this interface, ensuring it is accessible and easy to use. The deployed model can
be integrated into clinical practice, providing a valuable tool for early diagnosis
and intervention.

The final application should be tested thoroughly to ensure it works as expected


and provides accurate predictions. It should also include documentation and
user guides to help healthcare professionals understand how to use the tool
effectively.

CONCLUSION

The methodology outlined above ensures a systematic and comprehensive


approach to developing a heart disease prediction model using machine
learning. Each step, from data collection and preprocessing to model
deployment, is crucial for creating a robust, accurate, and practical tool for early
diagnosis and risk stratification. By leveraging machine learning, this project
aims to provide a non-invasive, cost-effective, and reliable alternative to
traditional diagnostic methods, ultimately improving patient outcomes and
contributing to more effective healthcare delivery

This project successfully demonstrated the application of machine learning


algorithms in the detection of heart disease. The Random Forest Classifier
proved to be a robust model, offering high accuracy and reliability. The project
highlights the importance of selecting appropriate algorithms and tuning
hyperparameters to achieve optimal performance. Future work could involve
exploring other advanced models like Gradient Boosting or Neural Networks,
and incorporating larger, more diverse datasets to further improve prediction
accuracy.

20
FUTURE SCOPE

 Further improving model accuracy and robustness


 Integrating diverse data sources for richer insights.
 Real -time monitoring and early intervation capabilities
 Integrating into clinical decision support system
 Population-level impact through preventive measures.

21
1

You might also like