Report HFP
Report HFP
A PROJECT REPORT
Submitted by
BACHELOR OF ENGINEERING
in
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Dr. S PAVITHRA M.E., Ph.D., Dr. P KARTHIKEYAN M.E., Ph.D.,
ASSOCIATE PROFESSOR ASSOCIATE PROFESSOR
HEAD OF THE DEPARTMENT SUPERVISOR
Computer Science and Engineering Computer Science and Engineering
Chennai Institute of Technology Chennai Institute of Technology
Kundrathur, Chennai – 600069 Kundrathur, Chennai - 600069
We thank all our department teaching and non-teaching staffs, who have
helped directly and indirectly to complete this project in time. Last but not least
we extend our deep gratitude to our beloved family members for their moral
coordination, encouragement and financial support to carry out this project.
Finally, we express our heartfelt and deep sense of gratitude to all faculty
members in our division and to our friends for their helping hands, valuable,
support and encouragement during the project work.
TABLE OF CONTENTS
ABSTRACT i
LIST OF FIGURES ii
MOTIVATION iii
1 INTRODUCTION 1
2 LITERATURE SURVEY 3
2.1 INTRODUCTION 3
3 SYSTEM DESIGN 6
3.1 INTRODUCTION 6
i
LIST OF FIGURES
3 Architecture Diagram 8
4 Linear Regression 12
5 Decision Tree 13
7 K Nearest neighbor 14
8 Random Forest 14
9 Naïve Bayesian 15
10 K means clustering 16
12 Jupytor Notebook 23
14 Module Evaluation 31
ii
MOTIVATION
iii
According to World Health Organization the total number of people who
died from the heart disease and heart stroke is the highest when compared to the
other death reasons.
In India per day there were about 7.23 deaths per 1,000 inhabitants in
India. This is mainly due to their medical status and some of them are because of
accidents and rest is for crimes. The major part in the deaths is due to diseases.
iv
CHAPTER 1
INTRODUCTION
1
1.2 SCOPE AND OBJECTIVE
Scope:
According to a report by Global Burden of Diseases in 2016, 1.7
million Indians die due to heart disease out of the world's 17.3 million deaths.
The major part in the deaths is due to diseases. We are trying to integrate our day-
to-day life with technology i.e., we are trying to predict whether the person might
get a disease or not based on the Machine learning algorithms.
OBJECTIVE:
The purpose of this project is to find out the most likely chance
of the Heart Failure for a person so that he/she can have a chance to save their
lives taking the early treatment. Here we have taken the Heart Failure Detection
to predict whether a person has a chance of heart failure or not.
2
CHAPTER 2
LITERATURE SURVEY
2.1 Introduction
Literature survey is the most important step in software
development process. Before developing the tool, it is necessary to determine the
time factor, economy and company strength. Once these things are satisfied, then
the next step is to determine which operating system and language can be used
for developing the tool. Once the programmers start building the tool the
programmers need lot of external support. This support can be obtained from
senior programmers, from book or from websites. Before building the system, the
above consideration is taken into account for developing the proposed system.
The major part of the project development sector considers and fully survey all
the required needs for developing the project. For every project Literature survey
is the most important sector in software development process. Before developing
the tools and the associated designing it is necessary to determine and survey the
time factor, resource requirement, manpower, economy, and company strength.
Once these things are satisfied and fully surveyed, then the next step is to
determine about the software specifications in the respective system such as what
type of operating system the project would require, and what are all the necessary
software are needed to proceed with the next step such as developing the tools,
and the associated operations.
3
2.2 LITERATURE REVIEW
1-Mohammed Abdul Khaleel has given paper in the Survey of Techniques for
mining of data on Medical Data for Finding Frequent Diseases locally. This paper
focus on dissects information mining procedures which are required for medicinal
information mining particularly to find locally visit illnesses, for example, heart
infirmities, lung malignancy, bosom disease et cetera. Information mining is the
way toward extricating information for finding inactive examples which
Vembandasamy et al. performed a work, to analyze and detect heart disease. In
this the algorithm used was Naive Bayes algorithm. In Naïve Bayes algorithm
they used Bayes theorem. Hence Naive Bayes has a very power to make
assumption independently. The used data-set is obtained from diabetic research
institutes of Chennai, TamilNadu which is leading institute. There are more than
500 patients in the dataset. The tool used is Weka and classification are executed
by using 70% of Percentage Split. The accuracy offered by Naive Bayes is
86.419%
4
3-L. Sathish Kumar and A. Padma Priya has given a paper named Prediction for
similarities of disease by using the ID3 algorithm in television and mobile phones.
This paper gives a programmed and concealed way to deal with recognized
designs that are covered up of coronary illness. The given framework utilizes
information mining methods, for example, ID3 algorithm. This proposed method
helps the people not only to know about the diseases but it can also help to reduce
the death rate and count of disease affected people
4- Santayana Krishnan J and Dr. Geetha S from MIT campus, Anna University
has given a paper named Prediction of Heart Diseases using Machine Learning
Algorithms. This paper the dataset is used and two machine learning algorithms
are developed to calculate accuracy and decide the accurate algorithm among
them.
5
CHAPTER 3
SYSTEM DESIGN
3.1 INTRODUCTION
6
3.2 EXISTING SYSTEM
The dataset consists of clinical parameters like high blood pressure, age
etc.,
7
3.4 SYSTEM ARCHITECTURE
8
3.5 SOFTWARE REQUIREMENTS:
The software requirements document is the specification of the
system. It should include both a definition and a specification of requirements. It
is a set of what the system should do rather than how it should do it. The software
requirements provide a basis for creating the software requirements specification.
It is useful in estimating cost, planning team activities, performing tasks and
tracking the team’s and tracking the team’s progress throughout the development
activity.
• IDE - Google Collab or Jupiter Notebook.
• Operating System - Windows 10
• Front End - Streamlit
• Back End - Python
• Packages Used - Seaborn, Pandas, Scikit-Learn, pickle
• Algorithms Used - Random Forest
Logistic Regression
Decision Tree
Naive Bayes
9
CHAPTER 4
10
4.1.2 SUPERVISED LEARNING:
Supervised learning can be defined as learning with the proper
guide or you can say that learning in the present of teacher. we have a training
dataset which act as the teacher for prediction on the given dataset that is for
testing a data there are always a training dataset. Supervised learning is based on
"train me" concept. Supervised learning has following processes:
• Classification
• Random Forest
• Decision tree
• Regression
• Linear Regression
• Logistical Regression
• Support Vector Machines (SVM)
• Neural Networks
• Random Forest
• Gradient Boosted Trees
• Decision Trees
• Naive Bayes
11
LINEAR REGRESSION:
It is the supervised learning technique. It is based on the
relationship between independent variable and dependent variable as seen
in Fig below variable “x” and “y” are independent and dependent variable and
relation between them is shown by equation of line which is linear in nature that
why this approach is called linear regression.
It gives a relation equation to predict a dependent variable value
“y” based on an independent variable value “x” as we can see in the Fig below so
it is concluded that linear regression technique gives the linear relationship
between x(input) and y(output).
DECISION TREE:
On the other hand, decision tree is the graphical
representation of the data and it is also the kind of supervised machine learning
algorithms. For the tree construction we use entropy of the data
attributes and on the basis of attribute root and other nodes are drawn.
Entropy= -∑ Pii log Pij (1)
12
When the number of nodes is imbalanced then tree is creating the over fitting
problem which is not good for the calculation and this is one of reason why
decision tree has less accuracy as compare to linear regression.
13
K NEAREST NEIGHBOUR
It works on the basis of distance between the location of data and
on the basis of this distinct data are classified with each other. All the other group
of data are called neighbor of each other and number of neighbors are decided by
the user which plays very crucial role in analysis of the dataset.
In the above Fig. k=3 shows that there are three neighbor that
means three different type of data are there. Each cluster represented in two-
dimensional space whose coordinates are represented as (Xiyu) where Xi is the
x-axis, Y represent y- axis and I= 1,2, 3, n.
RANDOM FOREST
Random forests or random decision forests are an ensemble
learning method for classification, regression and other tasks that operate by
constructing a multitude of decision trees at training time and outputting the class
that is the mode of the classes or mean/average prediction of the individual trees.
14
NAIVE BAYESIAN
Naive Bayes classifiers are a collection of classification
algorithms based on Bayes' Theorem. It is not a single algorithm but a family of
algorithms where all of them share a common principle, i.e., every pair of
features being classified is independent of each other.
Supervisor learning say there are mango, banana and apple but
Unsupervised learning said it as there are three different clusters. Unsupervised
algorithms have following process:
15
• Dimensionality
• Clustering
K -MEANS CLUSTERING
Clustering is one of the most common exploratory data analysis
techniques used to get an intuition about the structure of the data. It can be
defined as the task of identifying subgroups in the data such that data points in
the same subgroup (cluster) are very similar while data points in different
clusters are very different. In other words, we try to find homogeneous
subgroups within the data such that data points in each cluster are as similar
as possible according to a similarity measure such Euclidean-based distance
or correlation-based distance.
The decision of which similarity measure to use is application-specific.
16
PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (PCA) is an unsupervised, non-
parametric statistical technique primarily used for dimensionality reduction in
machine learning.
The ability to generalize correctly becomes exponentially harder
as the dimensionality of the training dataset grows, as the training set covers a
dwindling fraction of the input space. Models also become more efficient as the
reduced feature set boosts learning rates and diminishes computation costs by
removing redundant features.
REINFORCEMENT
Reinforced learning is the agent ability to interact with the
environment and find out the outcome. It is based on "hit and trial” concept. In
reinforced learning each agent is awarded with positive and negative points and
on the basis of positive points reinforced learning give the dataset output that is
on the basis of positive awards it trained and on the basis of this training perform
the testing on datasets.
17
PYTHON
Python is an interpreted, high-level and general-purpose
programming language. Python's design philosophy emphasizes code readability
with its notable use of significant indentation. Its language constructs and object-
oriented approach aim to help programmers write clear, logical code for small
and large-scale projects.
Python is dynamically-typed and garbage-collected. It supports
multiple programming paradigms, including structured (particularly, procedural),
object-oriented and functional programming. Python is often described as a
"batteries included" language due to its comprehensive standard library
IMPORTANCE OF PYTHON
1) Easy to Learn and Use
Python language is incredibly easy to use and learn for new
beginners and newcomers. The python language is one of the most accessible
programming languages available because it has simplified syntax and not as to
complicated, which gives more emphasis on natural language.
2) Mature and Supportive Python Community
Python was created more than 30 years ago, which is a lot of
time for any community of programming language to grow and mature
adequately to support developers ranging from beginner to expert levels. There
are plenty of documentation, guides and Video Tutorials for Python language are
available that learner and developer of any skill level or ages can use and receive
the support required enhance their knowledge in python programming
language.
18
Visual Basic & C# by Microsoft. Python Programming language is heavily
backed by Facebook, Amazon Web Services, and especially Google.
4) Hundreds of Python Libraries and Frameworks
Due to its corporate sponsorship and big supportive community
of python, python has excellent libraries that you can use to select and save your
time and effort on the initial cycle of development. There are also lots of cloud
media services that offer cross-platform support through library-like tools, which
can be extremely beneficial.
5) Versatility, Efficiency, Reliability, and Speed
Ask any python developer, and they will wholeheartedly agree
that the python language is efficient, reliable, and much faster than most modern
languages. Python can be used in nearly any kind of environment, and one will
not face any kind of performance loss issue irrespective of the platform one is
working.
6) Big data, Machine Learning and Cloud Computing
Cloud Computing, Machine Learning, and Big Data are some of
the hottest trends in the computer science world right now, which helps lots of
organizations to transform and improve their processes and workflows.
7) First-choice Language
Python language is the first choice for many programmers and
students due to the main reason for python being in high demand in the
development market. Students and developers always look forward to learning a
language that is in high demand. Python is undoubtedly the hottest cake in the
market now.
8) The Flexibility of Python Language
The python language is so flexible that it gives the developer the
chance to try something new. The person who is an expert in python language is
not just limited to build similar kinds of things but can also go on to try to make
something different than before.
19
IMPORTANCE OF PYTHON IN MACHINE LEARNING
AI projects differ from traditional software projects. The
differences lie in the technology stack, the skills required for an AI-based project,
and the necessity of deep research. To implement your AI aspirations, you should
use a programming language that is stable, flexible, and has tools available.
Python offers all of this, which is why we see lots of Python AI projects today.
• Simple and Consistent
• Extensive selection of libraries and frameworks
Keras, TensorFlow, and Scikit-learn for machine learning
NumPy for high-performance scientific computing and
data analysis
SciPy for advanced computing
Pandas for general-purpose data analysis
Seaborn for data visualization.
• Great community and popularity
• Platform independent
• Spam filters, recommendation systems, search engines, personal assistants,
and fraud detection systems are all made possible by AI and machine
learning, and there are definitely more things to come. Product owners
want to build apps that perform well. This requires coming up with
algorithms that process information intelligently, making software act like
a human.
20
STREAMLIT
Streamlit is an open-source Python library that makes it easy to
create and share beautiful, custom web apps for machine learning, data science.
In just a few minutes you can build and deploy powerful data apps.
• Make sure that you have Python 3.6 - Python 3.8 installed.
• Install Streamlit using PIP and run the ‘helloworld’ app:
pip install streamlit
streamlit hello
To create a new app, follow the below steps
• Open a new Python file, import Streamlit, and write some code
• Run the file with:
streamlit run [filename]
21
GOOGLE COLABORATORY
22
JUPYTER NOTEBOOK
Jupyter notebook is used as the simulation tool and it is
comfortable for python programming projects. Jupytor notebook contains
rich text elements and code also, which are figures, equations, links and many
more. Because of the mix of rich text elements and code, these documents are
perfect location to bring together an analysis description, and its results, as well
as, they can execute data analysis in real time. Jupyter Notebook is an open-
source, web-based interactive graphics, maps, plots, visualizations, and narrative
text.
23
4.2 MODULES AND DESCRIPTION
METHODOLOGY OF SYSTEM
i) About Dataset
ii) Attributes of a Data Set
iii) Steps followed in Data Analysis
• Data Cleaning
• Feature Engineering
• Training Data
• Testing Data
ABOUT DATASET:
Cardiovascular diseases (CVDs) are the number 1 cause of death
globally, taking an estimated 17.9 million lives each year, which accounts for
31% of all deaths worldwide. Heart failure is a common event caused by CVDs
and this dataset contains 12 features that can be used to predict mortality by heart
failure.
Most cardiovascular diseases can be prevented by addressing
behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical
inactivity and harmful use of alcohol using population-wide strategies.
People with cardiovascular disease or who are at high
cardiovascular risk (due to the presence of one or more risk factors such as
hypertension, diabetes, hyperlipidemia or already established disease) need early
detection and management wherein a machine learning model can be of great
help.
24
ATTRIBUTES OF DATASET
25
STEPS IN DATA ANALYSIS:
DATA CLEANING:
Data cleansing or data cleaning is the process of detecting and
correcting corrupt or inaccurate records from a record set, table, or database and
refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the
data and then replacing, modifying, or deleting the dirty or coarse data.
FEATURE ENGINEERING:
Feature engineering is the process of using domain knowledge
to extract features from raw data via data mining techniques. These features can
be used to improve the machine learning algorithms. Feature engineering can be
considered as applied machine learning itself.
26
TRAINING DATA
The observations in the training set form the experience that the
algorithm uses to learn. In supervised learning problems, each observation had
consists of an observed output variable and one or more observed input variables.
TESTING DATA
The test set is a set of observations used to evaluate the
performance of the model using some performance metric. It is important that no
observations from the training set are included in the test set. If the test set does
contain examples from the training set, it will be difficult to assess whether the
algorithm has learned to generalize from the training set or has simply suggests it.
As we have taken 80% data as training and rest 20% as training data.
27
MACHINE LEARNING MODEL
A machine learning model is a file that has been trained to
recognize certain types of patterns. You train a model over a set of data, providing
it an algorithm that it can use to reason over and learn from those data
• Logistic Regression
• Naive Bayes
• Support Vector Machine
• Random Forest
• Decision Tree
• Principal Component Analysis
LOGISTIC REGRESSION:
Logistic regression is a statistical model that in its basic form
uses a logistic function to model a binary dependent variable, although many
more complex extensions exist. In regression analysis, logistic regression (or logit
regression) is estimating the parameters of a logistic model
28
NAIVE BAYESIAN
Naive Bayes classifiers are a collection of classification algorithms
based on Bayes' Theorem. It is not a single algorithm but a family of algorithms
where all of them share a common principle, i.e., every pair of features being
classified is independent of each other.
30
PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis, or PCA, is a dimensionality-
reduction method that is often used to reduce the dimensionality of large datasets,
by transforming a large set of variables into a smaller one that still contains most
of the information in the large set.
MODEL EVALUATION
31
CONFUSION MATRIX
When we get the data, after data cleaning, pre-processing and wrangling,
the first step we do is to feed it to an outstanding model and of course, get output
in probabilities. But hold on! How in the hell can we measure the effectiveness
of our model? Better the effectiveness, better the performance and that’s exactly
what we want. And it is where the Confusion matrix comes into the limelight.
Confusion Matrix is a performance measurement for machine learning
algorithm.
32
ACCURACY CALCULATION
33
CHAPTER 5
RESULTS AND DISCUSSIONS
SAMPLE CODING:
HeartFailurePrediction.ipynb
# Read the CSV file and print number of rows ans columns
import pandas as pd
df=pd.read_csv('heart_failure_clinical_records_datase
t.csv')
df.shape
# Pair plot the read dataset which gives correlation between columns
import seaborn as sns
sns.pairplot(df)
#Heat map
sns.heatmap(df.corr(), annot = True)
sns.heatmap(df.isnull())
# Correlation
correlation = df.corr()
print(correlation)
34
df.tail()
# Data preprocessing
X=df.drop(columns=['DEATH_EVENT','anaemia','diabetes'
,'high_blood_pressure','sex','smoking'])
X
y=df['DEATH_EVENT']
y
35
predictions = model.predict(x_test)
acc1=acc1.append(pd.Series(metrics.accuracy_score(pr
edictions, y_test)))
plt.plot(X_axis, acc1)
plt.xticks(x)
plt.title("Logistic Graph")
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.grid()
plt.show()
print('Highest value: ', acc1.values.max())
acc1
36
plt.plot(X_axis, acc2)
plt.xticks(x)
plt.title("Random forest Graph")
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.grid()
plt.show()
print('Highest value: ', acc2.values.max())
acc2
37
plt.grid()
plt.show()
print('Highest value: ', acc3.values.max())
acc3
38
# Plotting accuracies in a graph
import numpy as np
algos=["logistic","Random forest","Naive bais","Decis
sion tree"]
lr=acc1.values.max()*100
rf=acc2.values.max()*100
nb=acc3.values.max()*100
dt=acc4.values.max()*100
accuracy=[lr,rf,nb,dt]
xpos=np.arange(len(algos))
plt.bar(xpos,accuracy,width=0.8,align="center",color=
['red', 'green','blue','brown'],ec="black")
for i in range(len(algos)):
plt.text(i,accuracy[i],accuracy[i],ha="center",va="
bottom")
plt.xticks(xpos,algos)
plt.xlabel('comparision of algorithms')
plt.ylabel('percentage')
plt.title("Accuracy")
39
App.py
import streamlit as st
import pickle
import numpy as np
model=pickle.load(open('hf1.pkl','rb'))
def
predict_forest(age,anaemia,creatinine_phosphokinase
,diabetes,ejection_fract
ion,high_blood_pressure,platelets,serum_creatinine,
serum_sodium,sex,smoking
,time):
input=np.array([[age,anaemia,creatinine_phosphokina
se,diabetes,ejection_fra
ction,high_blood_pressure,platelets,serum_creatinin
e,serum_sodium,sex,smoki
ng,time]]).astype(np.float64)
prediction=model.predict_proba(input)
pred='{0:.{1}f}'.format(prediction[0][0], 2)
return float( pred)
def main():
st.title("Streamlit services")
html_temp = """
<div style="background-color:#025246 ;padding:10px">
<h2 style="color:white;text-align:center;">Heart
failure prediction app
</h2>
</div>
"""
st. markdown( html_ temp, unsafe_ allow_ html= True)
age = st.text_input("age","Type Here")
anaemia = st.text_input("anaemia","Type Here")
40
smoking = st.text_input("smoking","Type Here")
time = st.text_input("time","Type Here")
safe_html="""
<div style="background-color:#F4D03F;padding:10px >
<h2 style="color:white;text-align:center;"> Your are
safe</h2>
</div>
"""
danger_html="""
<div style="background-color:#F08080;padding:10px >
<h2 style="color:black ;text-align:center;"> Your
are in danger</h2>
</div>
"""
if st.button("Predict"):
output=predict_forest(age,anaemia,creatinine_phosph
okinase,diabetes,ejectio
n_fraction,high_blood_pressure,platelets,serum_crea
tinine,serum_sodium,sex, smoking, time)
st.success('The probability of heart failure is
{}'.format(output))
41
Main.py
import streamlit as st
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
st.title('HEART FAILURE PREDICTION')
classifier_name = st.sidebar.selectbox('Select
classifier',('KNN',
'SVM', 'Random Forest'))
df = pd.read_csv("hf.csv")
X,y=df.iloc[:,:-1],df['DEATH_EVENT']
#print(X)
def add_parameter_ui(clf_name):
params = dict()
if clf_name == 'SVM':
C = st.sidebar.slider('C', 0.01, 10.0)
params['C'] = C
elif clf_name == 'KNN':
K = st.sidebar.slider('K', 1, 15)
params['K'] = K
else:
max_depth = st.sidebar.slider('max_depth', 2, 15)
params['max_depth'] = max_depth
n_estimators = st.sidebar.slider('n_estimators', 1,
100)
42
params['n_estimators'] = n_estimators
return params
params = add_parameter_ui(classifier_name)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
st.write(f'Classifier = {classifier_name}')
st.write(f'Accuracy =', acc)
43
PROGRAM RESULTS:
44
45
46
47
48
49
5.1 RESULTS:
Pair-Plot Graph:
50
Heat Map:
51
Correlation
Description of Dataset:
Head of Dataset:
52
Heat map of confusion matrix Logistic Regression:
Logistic Graph:
55
Heat map of confusion matrix Decission Tree Classifier:
Tail of Dataset:
57
Values of X:
Values of Y:
58
Output screen of app.py
59
Output screen after Prediction:
60
CHAPTER 6
6.1 CONCLUSION:
In this project we made a model which gives us the best accuracy among
all the machine learning algorithms and by using that model we have made a
web service using streamlit which will connect the back end of ml model to the
front end that we designed it, with of help of this many people can get their
health condition priorly and might have taken care accordingly.
61
REFERENCES
[1] Aditi Gavhane, Gouthami Kokkula, Isha Panday, Prof. Kailash Devadkar,
“Prediction of Heart Disease using Machine Learning”, Proceedings of the
2nd International conference on Electronics, Communication and
Aerospace Technology (ICECA), 2018.
[3] Rairikar, A., Kulkarni, V., Sabale, V., Kale, H., & Lamgunde, A. (2017,
June). Heart disease prediction using data mining techniques. In 2017
International Conference on Intelligent Computing and Control (I2C2) (pp.
1-8). IEEE.
[4] Aldallal, A., & Al-Moosa,A. A. A. (2018, September). Using Data Mining
Techniques to Predict Diabetes and Heart Diseases. In 2018 4th
International Conference on Frontiers of Signal Processing (ICFSP) (pp.
150- 154). IEEE.
[6] Hazra, A., Mandal, S., Gupta, A. and Mukherjee, “A Heart Disease
Diagnosis and Prediction Using Machine Learning and Data Mining
Techniques: A Review” Advances in Computational Sciences and
Technology, 2017
62