0% found this document useful (0 votes)
68 views24 pages

Cardiovascular Health Assessment and Risk Prediction Model Project

Uploaded by

notjuicy355
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views24 pages

Cardiovascular Health Assessment and Risk Prediction Model Project

Uploaded by

notjuicy355
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Cardiovascular Health Assessment

and Risk Prediction Model Project

SLIDEBAZAAR | [Link] RIGHTS


Introduction

Title: Predicting 10-Year Risk of Coronary Heart Disease (CHD)

Heart disease, a leading cause of death globally, is often rooted in a complex interplay
of lifestyle choices and underlying medical conditions. Early detection is crucial, as it
allows for preventative measures to be taken and potentially life-saving interventions.
This project addresses this critical need by developing a model to predict a patient's
10-year risk of Coronary Heart Disease (CHD). By leveraging machine learning, we aim
to create a more objective and efficient approach to CHD risk assessment,
empowering healthcare professionals to provide better patient care.

SLIDEBAZAAR | [Link] RIGHTS


The Business
Concern…
Heart diseases, particularly Coronary Heart Disease (CHD), are a significant health
concern globally, affecting millions of lives each year. Lifestyle factors, medical history,
and demographics play crucial roles in determining an individual's risk of developing
CHD. Identifying individuals at high risk of CHD can enable early intervention and
preventive measures, potentially reducing morbidity and mortality associated with the
disease.

Impact: Early detection of CHD risk allows for timely intervention and preventative
strategies, potentially saving lives and reducing healthcare costs.

SLIDEBAZAAR | [Link] RIGHTS


Objective
The objective of this project is to develop a predictive model that accurately estimates
the 10-year risk of Coronary Heart Disease (CHD) for individuals based on their
demographic, behavioral, and medical risk factors. Leveraging data from an ongoing
cardiovascular study in Framingham, Massachusetts.
This project aims to:
• Develop a machine learning model that predicts the 10-year risk of CHD for a given
patient.
• Utilize a dataset of over 3390 patient records with 16 attributes encompassing
demographics, behaviors, and medical history.
• Identify the most significant risk factors contributing to CHD.
This model will empower healthcare professionals with a data-driven tool to improve
CHD risk assessment and patient care.

SLIDEBAZAAR | [Link] RIGHTS


Data Description
Demographic:
Sex: male or female ("M" or "F")
Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole
numbers, the concept of age is continuous)

Education :
Education :1 - Higher Secondary, 2- Graduate, 3 - Post Graduate 4- Doctorate or PHD

Behavioral:
is_smoking: whether or not the patient is a current smoker ("YES" or "NO")
Cigs Per Day: the number of cigarettes that the person smoked on average in one day .(can be
considered continuous as one can have any number of cigarettes, even half a cigarette.)

Medical (history):
BP Meds: whether or not the patient was on blood pressure medication (Nominal)
Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
Diabetes: whether or not the patient had diabetes (Nominal)
SLIDEBAZAAR | [Link] RIGHTS
Data Description
Medical(current):

Tot Chol: total cholesterol level (Continuous)


Sys BP: systolic blood pressure (Continuous)
Dia BP: diastolic blood pressure (Continuous)
BMI: Body Mass Index (Continuous)
Heart Rate: heart rate(Continuous - In medical research, variables such as heart rate thought discrete,
are considered continuous because of a large number of possible values.)
Glucose: glucose level (Continuous)

Predict variable (desired target):

TenYearCHD : 10-year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”) –
DV

SLIDEBAZAAR | [Link] RIGHTS


Exploratory Data Analysis
Distribution of TenYearCHD (Target Column)

Coronary Heart Disease (CHD) is a progressive condition


where plaque buildup narrows coronary arteries, reducing
blood flow to the heart. Predicting a patient's ten-year risk of
CHD becomes a crucial window of opportunity for preventive
measures.

Looking at the distribution, it appears the data shows a ratio


of 15% to 85% for "yes" and "no" responses, respectively.

SLIDEBAZAAR | [Link] RIGHTS


Exploratory Data
Analysis
Observations:
• There is a positive correlation
between age and glucose
levels. This means that as age
increases, glucose levels also
tend to increase.
• The data points are spread
out, indicating a varied range
of glucose levels for each age
group.
• Some younger individuals
have high glucose levels,
while some older individuals
have lower glucose levels.
This suggests that factors
other than age may also
S L I D E B Ainfluence
Z A A R | 2 0 1 9 . glucose
A L L R I G H T S levels.
Exploratory Data Analysis
Blood Pressure : People with higher levels
of education tend to have lower systolic
and diastolic blood pressure.
Body Mass Index (BMI): There seems to be
a weak inverse relationship between
education and BMI. People with higher
levels of education tend to have a lower
BMI.
Heart Rate: The data for heart rate is
inconclusive. There seems to be no clear
association between education and
heartrate.
Smoking: There seems to be an inverse
relationship between education and the
number of cigarettes smoked per day.
People with higher levels of education
tend to smoke less.
Cholesterol: There seems to be a weak
S Linverse
I D E B A Z A A R relationship
| 2 0 1 9 . A L L R I G H Tbetween
S education
and total cholesterol. People with higher
Exploratory Data
Analysis
• Men tend to be older.
• Men smoke more cigarettes per day.
• Men have higher total cholesterol,
systolic and diastolic blood pressure,
and BMI.
• Women tend to have higher heart
rates and blood glucose levels.

SLIDEBAZAAR | [Link] RIGHTS


Exploratory Data Analysis
No clear association between BP
meds and age, BMI, heart rate, or
blood sugar. On the other hand,
the data suggests that people who
use blood pressure medication
tend to have lower systolic and
diastolic blood pressure. This is
likely because blood pressure
medication is designed to lower
blood pressure. Additionally,
people who use blood pressure
medication tend to smoke fewer
cigarettes per day, this could be
because smoking can raise blood
pressure.

SLIDEBAZAAR | [Link] RIGHTS


Exploratory Data
Analysis
• Age, cigarettes per day, systolic blood
pressure, and diastolic blood
pressure all show a positive
correlation with prevalent stroke.
This means that as these factors
increase, the number of prevalent
stroke cases also increases. These
factors are all known risk factors for
stroke.

• The data for total cholesterol, BMI,


heart rate, and blood glucose is
inconclusive. There isn't a clear trend
between these factors and prevalent
stroke cases
SLIDEBAZAAR | [Link] RIGHTS
Exploratory Data Analysis
• Age: The prevalence of hypertension
increases with age. This is likely due to
a number of factors, including the
degeneration of the arteries and
kidneys over time, as well as changes
in hormone levels.
• Cigarettes per day: Smoking is a major
risk factor for hypertension. Smoking
damages the blood vessels and makes
them more narrow, which can increase
blood pressure.
• Systolic blood pressure (SysBP): This is
the top number in a blood pressure
reading. It represents the pressure
against the artery walls when the heart
• beats. The
Diastolic higher
blood the systolic
pressure blood
(DiaBP): This is the bottom number in a blood
pressure, the greater the risk of
pressure reading. It represents the pressure against the artery walls when the
hypertension.
heart is at rest. The higher the diastolic blood pressure, the greater the risk of
hypertension.
SLIDEBAZAAR | [Link] RIGHTS
Exploratory Data
• Body Mass Index (BMI): People with Analysis
diabetes tend to have a higher BMI
than those who do not have diabetes.
BMI is a measure of a person's weight
relative to their height. Having
overweight or obesity can increase
your risk of type 2 diabetes.
• Smoking: Smoking is a major risk factor
for many health problems, including
heart disease, stroke, and lung cancer.
It may also increase your risk of type 2
diabetes.
• Cholesterol: High cholesterol is a risk
factor for heart disease, and people
with diabetes are also at increased risk
of heart disease.

SLIDEBAZAAR | [Link] RIGHTS


Exploratory Data Analysis
• There is a weak positive correlation
between age and diabetes. This
means that as age increases, the
likelihood of diabetes also increases.
This is consistent with what we
already know about diabetes.
• People with diabetes tend to have a
higher body mass index (BMI) than
those who do not. This is likely
because obesity is a major risk
factor for type 2 diabetes.
• The data for smoking, total
cholesterol, systolic blood pressure
(SysBP), diastolic blood pressure
(DiaBP), heart rate, and blood sugar
is inconclusive. There isn't a clear
trend between these factors and
diabetes in this dataset.
SLIDEBAZAAR | [Link] RIGHTS
Exploratory Data
While research has definitively shown that Analysis
smokers tend to be older and have higher
cholesterol levels compared to non-
smokers, the impact of smoking on other
health markers remains unclear. Data on
blood pressure, Body Mass Index (BMI),
heart rate, and blood glucose levels
haven't yielded conclusive evidence of a
direct link to smoking. This suggests that
other factors may play a more significant
role in influencing these health measures
for smokers and non-smokers alike.
Further research is needed to understand
the complex interplay between smoking
and these other health indicators.

SLIDEBAZAAR | [Link] RIGHTS


Machine Learning Approach
Introduction to Machine Learning Methodology for Predictive Modeling
In modern data-driven research and applications, machine learning (ML) methodologies play a pivotal role in extracting insights, making predictions, and
solving complex problems across diverse domains. A well-defined ML approach encompasses several key stages, each tailored to address specific aspects of
the data analysis and modeling process.

01 02 03
The first step in any ML Feature selection aims to identify Model selection involves
methodology involves data and retain the most relevant choosing an appropriate ML
preprocessing, where raw data features that contribute to the algorithm or ensemble of
is transformed, cleaned, and predictive power of the model algorithms based on the nature
prepared for analysis. This while eliminating redundant or of the problem, the size and
stage typically includes tasks irrelevant features. Techniques complexity of the dataset, and
such as data cleaning to such as univariate feature performance requirements.
handle missing values and selection, recursive feature Common ML algorithms include
outliers, feature engineering to elimination, and feature linear regression, decision
extract relevant information, importance ranking based on tree- trees, support vector
and data normalization or based models are commonly machines, random forests, and
04
scaling to ensure uniformity
across features 05
employed in this stage.
06
neural networks. Models are
trained using labeled data.
Hyperparameter tuning involves Model evaluation is a critical step Once a satisfactory model is
optimizing the parameters of in assessing the performance and identified, it is deployed into
the selected ML algorithm(s) to generalization capabilities of the production environments where it
improve model performance. trained models. Various evaluation can make predictions on new,
Techniques such as grid search, metrics such as accuracy, unseen data. Continuous
random search, and Bayesian precision, recall, F1-score, and monitoring of model performance
optimization are utilized to area under the receiver operating and feedback loops for model
SLIDEBAZAAR | [Link] RIGHTS
systematically explore the characteristic curve (AUC-ROC) refinement are essential to ensure
hyperparameter space and are used to quantify model that the deployed model remains
identify the optimal performance on unseen data accurate and reliable over time.
Machine Learning
Approach
Deployment Parameter Tuning Training & Model
Final Evaluation

Feature Selection Model Selection Encoding

[Link] Regression
[Link] Tree
[Link] Forest
[Link] Vector Machine

Exploratory Data Data Cleaning Descriptive


RAW
Analysis Analysis
Dataset

SLIDEBAZAAR | [Link] RIGHTS


Model Evaluation
Logistic Regression
Logistic regression is a fundamental statistical technique used for binary classification
tasks, where the outcome variable takes on one of two possible values. Unlike linear
regression, which predicts continuous outcomes, logistic regression models the
probability of the binary outcome using a logistic function. This logistic function
maps the input features linearly onto the log odds of the outcome, constraining the
predicted probabilities to fall between 0 and 1. One of the key strengths of logistic
regression lies in its interpretability, as it allows for the estimation of the effect of
each predictor variable on the probability of the outcome. Additionally, logistic
regression is relatively simple and computationally efficient, making it well-suited for
scenarios with a moderate number of predictors and a large number of observations.
Despite its simplicity, logistic regression can be powerful when applied appropriately,
providing valuable insights into the relationship between predictors and binary
outcomes in various fields such as healthcare, finance, and social sciences.

Training Accuracy Score 0.8558259587020649

Test Accuracy Score 0.8525073746312685

SLIDEBAZAAR | [Link] RIGHTS


Model Evaluation
Decision Tree
A decision tree is a versatile and intuitive machine learning algorithm
used for both classification and regression tasks. It operates by
recursively partitioning the input space into smaller regions based on the
features that best separate the data. At each node of the tree, a decision
is made based on a feature's value, effectively splitting the dataset into
subsets that are more homogeneous with respect to the target variable.
This process continues until a stopping criterion is met, such as reaching
a maximum tree depth or when further splits fail to improve the model's
performance. Decision trees are highly interpretable, as they represent a
sequence of simple if-else decision rules. Additionally, they can handle
both numerical and categorical data, making them suitable for a wide
range of applications. However, decision trees are prone to overfitting,
especially when they are deep or when the dataset is noisy. To mitigate
this issue, techniques like pruning and ensemble methods such as
Random Forests are often employed. Overall, decision trees are valuable
tools in the machine learning toolkit due to their simplicity,
interpretability, and effectiveness in capturing complex relationships
within the data.

STest
L I D Accuracy
E B A Z A A R after
| 2 0 Hyperparameter
[Link] RIGHTS Tuning: 0.8053097345132744

Train Accuracy after Hyperparameter Tuning: 0.9122418879056047


Model Evaluation
Random Forest
Random Forest is a powerful and versatile machine learning
algorithm widely used for both classification and regression tasks.
It operates by constructing a multitude of decision trees during the
training phase and outputs the mode of the classes (classification)
or the mean prediction (regression) of the individual trees. The
strength of Random Forest lies in its ability to mitigate overfitting,
handle high-dimensional data, and deal with noisy and correlated
features. By aggregating the predictions from multiple trees,
Random Forest tends to generalize well to unseen data, making it
robust and reliable for a wide range of applications. Additionally,
Random Forest provides built-in feature importance scores,
allowing practitioners to gain insights into the relative significance
of different features in the dataset. Its simplicity, scalability, and
excellent performance make Random Forest a popular choice
among data scientists and researchers for tackling various real-
world problems in fields such as finance, healthcare, and
environmental science.

Test Accuracy after Hyperparameter Tuning: 0.8466076696165191


SLIDEBAZAAR | [Link] RIGHTS

Train Accuracy after Hyperparameter Tuning: 0.8757374631268436


Model Evaluation
Support Vector Machines
Support Vector Machine (SVM) is a powerful and versatile
supervised learning algorithm commonly used for classification and
regression tasks. Its fundamental principle revolves around finding
the optimal hyperplane that best separates classes in a high-
dimensional feature space. SVM aims to maximize the margin, i.e.,
the distance between the hyperplane and the nearest data points
(support vectors), thereby enhancing the model's robustness and
generalization ability. One of the key strengths of SVM lies in its
ability to handle both linear and non-linear classification problems
through the use of different kernel functions, such as linear,
polynomial, radial basis function (RBF), and sigmoid kernels. SVM is
particularly well-suited for datasets with complex decision
boundaries and is known for its effectiveness in handling high-
dimensional data. Moreover, SVM exhibits resilience to overfitting,
thanks to its regularization parameter, which helps control the
trade-off between maximizing the margin and minimizing
classification errors. Overall, SVM stands as a versatile and efficient
algorithm that continues to find applications across various
domains, including image classification, text categorization,
bioinformatics, and financial forecasting.

Training
S L I D E BAccuracy:
A Z A A R |0.8595132743362832
[Link] RIGHTS

Test Accuracy: 0.8466076696165191


Conclusion
Key Findings and Outcomes:

* This study has confirmed a clear association between smoking and two key health markers:
age and cholesterol. Smokers tend to be older and have higher cholesterol levels compared to
non-smokers.
* The impact of smoking on other health indicators, including blood pressure, BMI, heart rate,
and blood glucose, remains inconclusive based on the data analyzed.

Future Work Potential:

* Further investigation is needed to explore the inconclusive relationships between smoking


and blood pressure, BMI, heart rate, and blood glucose. This could involve:
By continuing this * Analyzing a larger and more diverse sample population.
research, we can gain a * Accounting for potential confounding factors like diet, physical activity, and
more comprehensive socioeconomic status.
understanding of the full * Conducting longitudinal studies to track changes in health markers over time in relation to
spectrum of health smoking habits.
consequences * Research could also delve deeper into the mechanisms by which smoking might affect
specific health outcomes.
associated with smoking
* The findings on age and cholesterol can be used to develop targeted public health
and develop more
SLIDEBAZAAR | [Link] RIGHTS interventions aimed at encouraging smoking cessation, particularly among older adults.
effective strategies to
promote healthy
THANKYOU
SLIDEBAZAAR | [Link] RIGHTS

You might also like