0% found this document useful (0 votes)
46 views71 pages

Report HFP

Heart failure prediction system using machine learning algorithms

Uploaded by

Alavala Nishanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views71 pages

Report HFP

Heart failure prediction system using machine learning algorithms

Uploaded by

Alavala Nishanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

HEART FAILURE PREDICTION SYSTEM

USING MACHINE LEARNING ALGORITHMS

A PROJECT REPORT

Submitted by

SHAIK SHARIQ - 210420104145


TATA MASTHAN MANDEESH - 210420104168

in partial fulfillment for the award of the degree


of

BACHELOR OF ENGINEERING
in

COMPUTER SCIENCE AND ENGINEERING

ANNA UNIVERSITY: CHENNAI - 600025


MARCH 2024
CHENNAI INSTITUTE OF TECHNOLOGY
(An Autonomous Institution, Affiliated to Anna University, Chennai) - 600069

BONAFIDE CERTIFICATE

Certified that this project report “Heart Failure Prediction


System” is the bonafide work done by SHAIK SHARIQ (210420104145),
TATA MASTHAN MANDEESH (210420104168), who carried out the work
under my supervision.

SIGNATURE SIGNATURE
Dr. S PAVITHRA M.E., Ph.D., Dr. P KARTHIKEYAN M.E., Ph.D.,
ASSOCIATE PROFESSOR ASSOCIATE PROFESSOR
HEAD OF THE DEPARTMENT SUPERVISOR
Computer Science and Engineering Computer Science and Engineering
Chennai Institute of Technology Chennai Institute of Technology
Kundrathur, Chennai – 600069 Kundrathur, Chennai - 600069

Submitted for University viva voce examination held on ……………


at Chennai Institute of Technology, Kundrathur.

INTERNAL EXAMINER EXTERNAL EXAMINER


ACKNOWLEDGEMENT

We convey our profound thanks and gratitude to our honorable Chairman


Mr. P Sriram, Chennai Institute of Technology, Chennai for providing me
an excellent academic climate, which made this endeavor possible.

We also express our gratitude to our beloved Principal


Dr. A. RAMESH M.E., Ph.D., who constantly nurtured our standard of
education and devoted their precious time for our needs.

We are deeply indebted to pay our sincere thanks to our respectable


Head of the Department Dr.S. PAVITHRA M.E., Ph.D., of Computer
Science Engineering for showing some exuberant consent for our project and
providing us with all the facilities in the department to complete the project.

We take immense pleasure to express our heartfelt thanks to our Guide


Dr.P.KARTHIKEYAN M.E., Ph.D., Associate Professor for motivating us to
study in this field and for his illuminating guidance and continuous support
in the planning and execution of this project.

We thank all our department teaching and non-teaching staffs, who have
helped directly and indirectly to complete this project in time. Last but not least
we extend our deep gratitude to our beloved family members for their moral
coordination, encouragement and financial support to carry out this project.
Finally, we express our heartfelt and deep sense of gratitude to all faculty
members in our division and to our friends for their helping hands, valuable,
support and encouragement during the project work.
TABLE OF CONTENTS

CHAPTER NO TITLE PAGE NO

ABSTRACT i

LIST OF FIGURES ii

MOTIVATION iii

1 INTRODUCTION 1

1.1 OVERVIEW OF PROJECT 1

1.2 SCOPE AND OBJECTIVE 2

2 LITERATURE SURVEY 3

2.1 INTRODUCTION 3

2.2 LITERATURE REVIEW 4

3 SYSTEM DESIGN 6

3.1 INTRODUCTION 6

3.2 EXISTING SYSTEM 7

3.3 PROPOSED SYSTEM 7

3.4 SYSTEM ARCHITECTURE 8

3.5 SOFTWARE REQUIREMENTS 9

3.6 HARDWARE REQUIREMENTS 9

4 IMPLEMENTATION AND ANALYSIS 10

4.1 SOFTWARE SPECIFICATIONS 10


4.2 MODULES AND DESCRIPTION 24

5 RESULTS AND DISCUSSIONS 34

5.1 PROGRAM RESULTS 55

6 CONCLUSION AND FUTURE


ENHANCEMENTS 61
6.1 CONCLUSION 61
6.2 FUTURE ENHANCEMENTS 61
REFERENCES 62
ABSTRACT

The diagnosis of heart failure in most instances relies on a complex


amalgamation of clinical and pathological data. Due to this intricacy, there is a
considerable amount of interest among clinical professionals and researchers in
efficiently and accurately predicting heart failure. In this project, we have
devised a heart failure prediction system that can aid medical professionals in
forecasting the heart failure status based on patients' clinical data. Our
methodology encompasses three steps. Firstly, we carefully select 13 crucial
clinical features, such as age, sex, anemia, creatinine phosphokinase, diabetes,
ejection fraction, high blood pressure, platelets, serum-creatinine, serum sodium,
smoking, time, and Death Event. Secondly, we develop Machine Learning
algorithms to classify heart failure based on these clinical features. The
prediction accuracy is approximately 80%. Lastly, we create a user-friendly heart
failure prediction system (HFPS) that comprises various features, including an
input clinical data section, ROC curve display section, and prediction
performance display section (execution time, accuracy, sensitivity, specificity,
and prediction result). Our approaches have proven effective in predicting heart
failure in patients. The HFPS system developed in this study presents a novel
approach that can be utilized for heart failure classification.

i
LIST OF FIGURES

Figure No. NAME OF FIGURE Page No.

1 Percentage of deaths by cause (2016,2017) iii

2 Percentage of deaths by cause in 2019 iv

3 Architecture Diagram 8

4 Linear Regression 12

5 Decision Tree 13

6 Support Vector Machine 13

7 K Nearest neighbor 14

8 Random Forest 14

9 Naïve Bayesian 15

10 K means clustering 16

11 Principal Component Analysis 17

12 Jupytor Notebook 23

13 Steps to be followed in data analysis 26

14 Module Evaluation 31

ii
MOTIVATION

As we seen in nowadays major cause of Death is due to the CardioVascular


Disease and many of the patients claimed that they not knew about their condition
earlier , and if they knew they get a chance to survive among us , now by seeing
this situation we came to an idea to build a system which predicts the condition
of their heart by measuring some the key aspects of their health , this system was
HDFS “ Heart Failure Predicting System” this helps millions of the people by
knowing their heart condition earlier and can have a treatment at earlier stages
which helps them to not to undergo any critical conditions . We as a human's pay
more importance to our life than any other things as it is the one which won't
come back after time, this is a small effort of us to predict heart failure, if this
works properly, we are trying to implement other diseases also.

iii
According to World Health Organization the total number of people who
died from the heart disease and heart stroke is the highest when compared to the
other death reasons.

In India per day there were about 7.23 deaths per 1,000 inhabitants in
India. This is mainly due to their medical status and some of them are because of
accidents and rest is for crimes. The major part in the deaths is due to diseases.

iv
CHAPTER 1

INTRODUCTION

1.1 OVERVIEW OF THE PROJECT


Heart is one of the most extensive and vital organ of human body
so the care of heart is essential. Most of diseases are related heart so the prediction
about heart diseases is necessary and for this purpose comparative study needed in
this field, today most of patient are died because their diseases are recognized lasts
stage due to lack of accuracy of instrument so there is need to know about the mor
efficient algorithms for diseases prediction.

Machine Learning is one of the efficient technologies for testing,


which is based on training and testing. It is the branch of Artificial Intelligence of
which is one of broad area of learning where machines emulating human abilities
machine learning is a specific branch of AI. On the other hand, machines learning
systems are trained to learn how to process and make use of data hence the combo
of each technology is also called as Machine Intelligence.

As the definition of machine learning, it learns from the natural


phenomenon, natural things so in this project we use the biological parameter as
testing data such as High Blood Pressure, Platelets Count, Diabetes, sex, age, etc
and on the basis of these, comparison is done in terms of accuracy alogorithms.

1
1.2 SCOPE AND OBJECTIVE

Scope:
According to a report by Global Burden of Diseases in 2016, 1.7
million Indians die due to heart disease out of the world's 17.3 million deaths.
The major part in the deaths is due to diseases. We are trying to integrate our day-
to-day life with technology i.e., we are trying to predict whether the person might
get a disease or not based on the Machine learning algorithms.

As the World is growing in all aspects it is connected with


technology. The one thing that we all are scared for is our death, In India per
day there were about 7.23 deaths per 1,000 inhabitants in India. This is mainly
due to their medical status and some of them are because of accidents and
rest is for
crimes.

OBJECTIVE:

The purpose of this project is to find out the most likely chance
of the Heart Failure for a person so that he/she can have a chance to save their
lives taking the early treatment. Here we have taken the Heart Failure Detection
to predict whether a person has a chance of heart failure or not.

In this project we calculate the accuracy of four different


machine learning approaches and on the basis of calculation we conclude that
which one of them is best and we deploy that algorithm into an interface using
Streamlit.

2
CHAPTER 2

LITERATURE SURVEY

2.1 Introduction
Literature survey is the most important step in software
development process. Before developing the tool, it is necessary to determine the
time factor, economy and company strength. Once these things are satisfied, then
the next step is to determine which operating system and language can be used
for developing the tool. Once the programmers start building the tool the
programmers need lot of external support. This support can be obtained from
senior programmers, from book or from websites. Before building the system, the
above consideration is taken into account for developing the proposed system.
The major part of the project development sector considers and fully survey all
the required needs for developing the project. For every project Literature survey
is the most important sector in software development process. Before developing
the tools and the associated designing it is necessary to determine and survey the
time factor, resource requirement, manpower, economy, and company strength.
Once these things are satisfied and fully surveyed, then the next step is to
determine about the software specifications in the respective system such as what
type of operating system the project would require, and what are all the necessary
software are needed to proceed with the next step such as developing the tools,
and the associated operations.

3
2.2 LITERATURE REVIEW

1-Mohammed Abdul Khaleel has given paper in the Survey of Techniques for
mining of data on Medical Data for Finding Frequent Diseases locally. This paper
focus on dissects information mining procedures which are required for medicinal
information mining particularly to find locally visit illnesses, for example, heart
infirmities, lung malignancy, bosom disease et cetera. Information mining is the
way toward extricating information for finding inactive examples which
Vembandasamy et al. performed a work, to analyze and detect heart disease. In
this the algorithm used was Naive Bayes algorithm. In Naïve Bayes algorithm
they used Bayes theorem. Hence Naive Bayes has a very power to make
assumption independently. The used data-set is obtained from diabetic research
institutes of Chennai, TamilNadu which is leading institute. There are more than
500 patients in the dataset. The tool used is Weka and classification are executed
by using 70% of Percentage Split. The accuracy offered by Naive Bayes is
86.419%

2-Costas Sideris, Nabil Al-shurafa, Haik Kalantarian and Mohammad


Pourhomayoun have given a paper named Remote Health Monitoring Outcome
Success prediction using First Month and Baseline Intervention Data. RHS
systems are effective in saving costs and reducing illness. In this paper, they
portray an upgraded RHM framework, Wanda- CVD that is cellphone based and
intended to give remote instructing and social help to members. CVD
counteractive action measures are perceived as a basic focus by social insurance
associations around the world.

4
3-L. Sathish Kumar and A. Padma Priya has given a paper named Prediction for
similarities of disease by using the ID3 algorithm in television and mobile phones.
This paper gives a programmed and concealed way to deal with recognized
designs that are covered up of coronary illness. The given framework utilizes
information mining methods, for example, ID3 algorithm. This proposed method
helps the people not only to know about the diseases but it can also help to reduce
the death rate and count of disease affected people

4- Santayana Krishnan J and Dr. Geetha S from MIT campus, Anna University
has given a paper named Prediction of Heart Diseases using Machine Learning
Algorithms. This paper the dataset is used and two machine learning algorithms
are developed to calculate accuracy and decide the accurate algorithm among
them.

5
CHAPTER 3
SYSTEM DESIGN

3.1 INTRODUCTION

Design is a multi - step that focuses on architectural flow,


procedural details, algorithm and interface between modules. The design process
gives an idea about the requirements and the quality assurance of the system
before coding begins. Systems design can be considered as the theory to develop
the product. Until the 1990s, systems design had a crucial and respected role in
the data processing industry. In the 1990s, standardization of hardware and
software resulted in the ability to build modular systems. The architectural design
of a system emphasizes the design flow of working of the system architecture that
describes the structure, behavior and more views of that system and analysis.
They needed to be able to standardize their work into a formal discipline with
proper methods, especially for new fields like information theory, discipline with
design change continuously for better analysis. Software design methodology
lacks the depth, flexibility and quantitative nature that are normally associated
with more classical engineering disciplines. However, techniques for software
designs do exit, criteria for design qualities are available and design notation can
be applied.

6
3.2 EXISTING SYSTEM
The dataset consists of clinical parameters like high blood pressure, age
etc.,

Two machine algorithms were applied on dataset to predict the possibilities


of having heart disease/attack of a patient.

These were analyzed with classification model namely Naive Bayes


classifier and decision tree classification. This work can be improved by
including some other machine learning algorithms.

3.3 PROPOSED SYSTEM


We approached this in three steps. Firstly, we select 13 important clinical
features, secondly, we develop four Machine Learning algorithms for
classifying heart failure based on the clinical features.

Finally, we develop a user-friendly heart failure prediction system using


“Streamlit” which will connect the backend of Machine Learning model to
the front end that we designed.

These approaches are effective in predicting the heart failure of a patient.

PROPOSED SYSTEM ADVANTAGES:


Calculates efficiency of four different algorithms.
Achieved higher efficiency when compared to existing.
Deployed the highest accuracy algorithm into a web interface using
Streamlet.

7
3.4 SYSTEM ARCHITECTURE

Fig 3.4 Architecture of the System Classification

8
3.5 SOFTWARE REQUIREMENTS:
The software requirements document is the specification of the
system. It should include both a definition and a specification of requirements. It
is a set of what the system should do rather than how it should do it. The software
requirements provide a basis for creating the software requirements specification.
It is useful in estimating cost, planning team activities, performing tasks and
tracking the team’s and tracking the team’s progress throughout the development
activity.
• IDE - Google Collab or Jupiter Notebook.
• Operating System - Windows 10
• Front End - Streamlit
• Back End - Python
• Packages Used - Seaborn, Pandas, Scikit-Learn, pickle
• Algorithms Used - Random Forest
Logistic Regression
Decision Tree
Naive Bayes

3.6 HARDWARE REQUIREMENTS


The hardware requirements may serve as the basis for a contract
for the implementation of the system and should therefore be a complete and
consistent specification of the whole system. They are used by software engineers
as the starting point for the system design. It shows what the system does and not
how it should be implemented.

• Processor - Intel Core i5


• RAM - min 4GB
• Hard Disk - 100 GB

9
CHAPTER 4

IMPLEMENTATION AND ANALYSIS

4.1 SOFTWARE SPECIFICATIONS

4.1.1 MACHINE LEARNING:

Machine Learning is one of the efficient technologies which is


based on two terms namely Testing and Training i.e., system take training directly
from data and experience and based on this training should be applied on different
type of need as per the algorithm required.

There are three types of Machine Learning algorithms:

Fig 4.1.1 Classification of Machine Learning Algorithms.

10
4.1.2 SUPERVISED LEARNING:
Supervised learning can be defined as learning with the proper
guide or you can say that learning in the present of teacher. we have a training
dataset which act as the teacher for prediction on the given dataset that is for
testing a data there are always a training dataset. Supervised learning is based on
"train me" concept. Supervised learning has following processes:
• Classification
• Random Forest
• Decision tree
• Regression

To recognize patterns and measures probability of


uninterruptable outcomes, is phenomenon of regression. System have ability to
to identify numbers, their values and grouping sense of numbers which means
width and height, etc. There are following superised machine learning algorithms:

• Linear Regression
• Logistical Regression
• Support Vector Machines (SVM)
• Neural Networks
• Random Forest
• Gradient Boosted Trees
• Decision Trees
• Naive Bayes

11
LINEAR REGRESSION:
It is the supervised learning technique. It is based on the
relationship between independent variable and dependent variable as seen
in Fig below variable “x” and “y” are independent and dependent variable and
relation between them is shown by equation of line which is linear in nature that
why this approach is called linear regression.
It gives a relation equation to predict a dependent variable value
“y” based on an independent variable value “x” as we can see in the Fig below so
it is concluded that linear regression technique gives the linear relationship
between x(input) and y(output).

Fig 4.1.2.1 Linear Regression Technique

DECISION TREE:
On the other hand, decision tree is the graphical
representation of the data and it is also the kind of supervised machine learning
algorithms. For the tree construction we use entropy of the data
attributes and on the basis of attribute root and other nodes are drawn.
Entropy= -∑ Pii log Pij (1)

In the above equation of entropy (1) Pij is probability of the node


and according to it the entropy of each node is calculated. The node
which has highest entropy calculation is selected as the root node and this process
is repeated until all the nodes of the tree are calculated.

12
When the number of nodes is imbalanced then tree is creating the over fitting
problem which is not good for the calculation and this is one of reason why
decision tree has less accuracy as compare to linear regression.

SUPPORT VECTOR MACHINE


It is one category of machine learning technique which work on the concept of
hyperplane means it classify the data by creating hyper plan between them.
Training sample dataset is (Yi, Xi) where I=1,2, 3, n and Xi is the ith vector, Yi
is the target vector. Number of hyper plans decide the type of support vector such
as example if a line is used as hyper plan then method is called linear support
vector.

Fig 4.1.2.2 Support Vector Machine Algorithm

13
K NEAREST NEIGHBOUR
It works on the basis of distance between the location of data and
on the basis of this distinct data are classified with each other. All the other group
of data are called neighbor of each other and number of neighbors are decided by
the user which plays very crucial role in analysis of the dataset.

In the above Fig. k=3 shows that there are three neighbor that
means three different type of data are there. Each cluster represented in two-
dimensional space whose coordinates are represented as (Xiyu) where Xi is the
x-axis, Y represent y- axis and I= 1,2, 3, n.

RANDOM FOREST
Random forests or random decision forests are an ensemble
learning method for classification, regression and other tasks that operate by
constructing a multitude of decision trees at training time and outputting the class
that is the mode of the classes or mean/average prediction of the individual trees.

Fig 4.1.2.3 Simplified Random Forest Algorithm

14
NAIVE BAYESIAN
Naive Bayes classifiers are a collection of classification
algorithms based on Bayes' Theorem. It is not a single algorithm but a family of
algorithms where all of them share a common principle, i.e., every pair of
features being classified is independent of each other.

4.1.3 UNSUPERVISED LEARNING:

Unsupervised learning can be defined as the learning


without a guidance which in Unsupervised learning there are no teacher are
guiding. In Unsupervised learning when a dataset is given it automatically work
on the dataset and find the pattern and relationship between them and according
to the created relationships, when new data is given it classify them and store in
one of them relation. Unsupervised learning is based on "self-sufficient"
concept.

For example, suppose there are combination fruits mango,


banana and apple and when Unsupervised learning is applied it classify them in
three different clusters on the basis of their relation with each other and when a
new data is given it automatically send it to one of the clusters.

Supervisor learning say there are mango, banana and apple but
Unsupervised learning said it as there are three different clusters. Unsupervised
algorithms have following process:

15
• Dimensionality
• Clustering

There are following unsupervised machine learning algorithms:


• t-SNE
• k-means clustering
• PCA

K -MEANS CLUSTERING
Clustering is one of the most common exploratory data analysis
techniques used to get an intuition about the structure of the data. It can be
defined as the task of identifying subgroups in the data such that data points in
the same subgroup (cluster) are very similar while data points in different
clusters are very different. In other words, we try to find homogeneous
subgroups within the data such that data points in each cluster are as similar
as possible according to a similarity measure such Euclidean-based distance
or correlation-based distance.
The decision of which similarity measure to use is application-specific.

Fig 4.1.3.1 Clustering Before and After K-Means

16
PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (PCA) is an unsupervised, non-
parametric statistical technique primarily used for dimensionality reduction in
machine learning.
The ability to generalize correctly becomes exponentially harder
as the dimensionality of the training dataset grows, as the training set covers a
dwindling fraction of the input space. Models also become more efficient as the
reduced feature set boosts learning rates and diminishes computation costs by
removing redundant features.

Fig 4.1.3.2 Simple Graph of PCA

REINFORCEMENT
Reinforced learning is the agent ability to interact with the
environment and find out the outcome. It is based on "hit and trial” concept. In
reinforced learning each agent is awarded with positive and negative points and
on the basis of positive points reinforced learning give the dataset output that is
on the basis of positive awards it trained and on the basis of this training perform
the testing on datasets.
17
PYTHON
Python is an interpreted, high-level and general-purpose
programming language. Python's design philosophy emphasizes code readability
with its notable use of significant indentation. Its language constructs and object-
oriented approach aim to help programmers write clear, logical code for small
and large-scale projects.
Python is dynamically-typed and garbage-collected. It supports
multiple programming paradigms, including structured (particularly, procedural),
object-oriented and functional programming. Python is often described as a
"batteries included" language due to its comprehensive standard library

IMPORTANCE OF PYTHON
1) Easy to Learn and Use
Python language is incredibly easy to use and learn for new
beginners and newcomers. The python language is one of the most accessible
programming languages available because it has simplified syntax and not as to
complicated, which gives more emphasis on natural language.
2) Mature and Supportive Python Community
Python was created more than 30 years ago, which is a lot of
time for any community of programming language to grow and mature
adequately to support developers ranging from beginner to expert levels. There
are plenty of documentation, guides and Video Tutorials for Python language are
available that learner and developer of any skill level or ages can use and receive
the support required enhance their knowledge in python programming
language.

3) Support from Renowned Corporate Sponsors


Programming languages grows faster when a corporate sponsor
backs it. For example, PHP is backed by Facebook, Java by Oracle and Sun,

18
Visual Basic & C# by Microsoft. Python Programming language is heavily
backed by Facebook, Amazon Web Services, and especially Google.
4) Hundreds of Python Libraries and Frameworks
Due to its corporate sponsorship and big supportive community
of python, python has excellent libraries that you can use to select and save your
time and effort on the initial cycle of development. There are also lots of cloud
media services that offer cross-platform support through library-like tools, which
can be extremely beneficial.
5) Versatility, Efficiency, Reliability, and Speed
Ask any python developer, and they will wholeheartedly agree
that the python language is efficient, reliable, and much faster than most modern
languages. Python can be used in nearly any kind of environment, and one will
not face any kind of performance loss issue irrespective of the platform one is
working.
6) Big data, Machine Learning and Cloud Computing
Cloud Computing, Machine Learning, and Big Data are some of
the hottest trends in the computer science world right now, which helps lots of
organizations to transform and improve their processes and workflows.
7) First-choice Language
Python language is the first choice for many programmers and
students due to the main reason for python being in high demand in the
development market. Students and developers always look forward to learning a
language that is in high demand. Python is undoubtedly the hottest cake in the
market now.
8) The Flexibility of Python Language
The python language is so flexible that it gives the developer the
chance to try something new. The person who is an expert in python language is
not just limited to build similar kinds of things but can also go on to try to make
something different than before.

19
IMPORTANCE OF PYTHON IN MACHINE LEARNING
AI projects differ from traditional software projects. The
differences lie in the technology stack, the skills required for an AI-based project,
and the necessity of deep research. To implement your AI aspirations, you should
use a programming language that is stable, flexible, and has tools available.
Python offers all of this, which is why we see lots of Python AI projects today.
• Simple and Consistent
• Extensive selection of libraries and frameworks
Keras, TensorFlow, and Scikit-learn for machine learning

NumPy for high-performance scientific computing and
data analysis
SciPy for advanced computing
Pandas for general-purpose data analysis
Seaborn for data visualization.
• Great community and popularity
• Platform independent
• Spam filters, recommendation systems, search engines, personal assistants,
and fraud detection systems are all made possible by AI and machine
learning, and there are definitely more things to come. Product owners
want to build apps that perform well. This requires coming up with
algorithms that process information intelligently, making software act like
a human.

• In the Python Developers Survey 2017, observe that Python is commnly


commonly used for web development. At first glance, web development
prevails, accounting for over 26% of the use cases shown in the image
below. However, if you combine data science and machine learning, they
make up a stunning 27%.

20
STREAMLIT
Streamlit is an open-source Python library that makes it easy to
create and share beautiful, custom web apps for machine learning, data science.
In just a few minutes you can build and deploy powerful data apps.

• Make sure that you have Python 3.6 - Python 3.8 installed.
• Install Streamlit using PIP and run the ‘helloworld’ app:
pip install streamlit

streamlit hello
To create a new app, follow the below steps
• Open a new Python file, import Streamlit, and write some code
• Run the file with:
streamlit run [filename]

Streamlit provides a caching mechanism that allows your app to


stay performant even when loading data from the web, manipulating large sets
of datasets, or performing expensive computations.

Streamlit makes it easy for you to visualize, mutate, and share


data. The API reference is organized by activity type, like displaying data or
optimizing performance. Each section includes methods associated with type
of activity type, including examples.

21
GOOGLE COLABORATORY

Google is quite aggressive in AI research. Over many years,


Google developed AI framework called TensorFlow and a development tool
called Colaboratory. Today TensorFlow is open-sourced and since 2017, Google
made Colaboratory free for public use. Colaboratory is now known as Google
Colab or simply Colab.
Another attractive feature that Google offers to the developers is
the use of GPU. Colabsupports GPU and it is totally free. The reasons for making
it free for public could be to make its software a standard in the academics for
teaching machine learning and data science. It may also have a long-term
perspective of building a customer base for Google Cloud APIs which are sold
per-use basis.
Irrespective of the reasons, the introduction of Colab has eased
the learning and development of machine learning applications.

WHAT COLAB OFFERS US?

• Write and execute code in Python


• Document your code that supports mathematical equations
• Create/Upload/Share notebooks
• Import/Save notebooks from/to Google Drive
• Import/Publish notebooks from GitHub
• Import external datasets
• Integrate PyTorch, TensorFlow, Keras, OpenCV
• Free Cloud service with free GPU

22
JUPYTER NOTEBOOK
Jupyter notebook is used as the simulation tool and it is
comfortable for python programming projects. Jupytor notebook contains
rich text elements and code also, which are figures, equations, links and many
more. Because of the mix of rich text elements and code, these documents are
perfect location to bring together an analysis description, and its results, as well
as, they can execute data analysis in real time. Jupyter Notebook is an open-
source, web-based interactive graphics, maps, plots, visualizations, and narrative
text.

23
4.2 MODULES AND DESCRIPTION

METHODOLOGY OF SYSTEM
i) About Dataset
ii) Attributes of a Data Set
iii) Steps followed in Data Analysis
• Data Cleaning
• Feature Engineering
• Training Data
• Testing Data

ABOUT DATASET:
Cardiovascular diseases (CVDs) are the number 1 cause of death
globally, taking an estimated 17.9 million lives each year, which accounts for
31% of all deaths worldwide. Heart failure is a common event caused by CVDs
and this dataset contains 12 features that can be used to predict mortality by heart
failure.
Most cardiovascular diseases can be prevented by addressing
behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical
inactivity and harmful use of alcohol using population-wide strategies.
People with cardiovascular disease or who are at high
cardiovascular risk (due to the presence of one or more risk factors such as
hypertension, diabetes, hyperlipidemia or already established disease) need early
detection and management wherein a machine learning model can be of great
help.

24
ATTRIBUTES OF DATASET

1. Age (patient’s Age)


2. Anemia (Decrease of red blood cells or hemoglobin)
3. Creatinine_phosphokinase (Level of the CPK enzyme in the blood
(mcg/L))
4. Diabetes (If the patient has diabetes)
5. Ejection_fraction (Percentage of blood leaving the heart at each
contraction)
6. High_blood_pressure (If the patient has hypertension)
7. Platelets (Platelets in the blood (kiloplatelets/mL))
8. Serum_creatinine(Level of serum creatinine in the blood (mg/dL))
9. Serum_sodium(Level of serum sodium in the blood (mEq/L))
10. Sex (Woman orman)
11. Smoking (if a patient has a habit of smoking or not)
12. Death Event
13. Time (patient’s Follow-up Period)

25
STEPS IN DATA ANALYSIS:

Fig 4.2. Steps in Data Analysis

DATA CLEANING:
Data cleansing or data cleaning is the process of detecting and
correcting corrupt or inaccurate records from a record set, table, or database and
refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the
data and then replacing, modifying, or deleting the dirty or coarse data.

FEATURE ENGINEERING:
Feature engineering is the process of using domain knowledge
to extract features from raw data via data mining techniques. These features can
be used to improve the machine learning algorithms. Feature engineering can be
considered as applied machine learning itself.

26
TRAINING DATA
The observations in the training set form the experience that the
algorithm uses to learn. In supervised learning problems, each observation had
consists of an observed output variable and one or more observed input variables.

TESTING DATA
The test set is a set of observations used to evaluate the
performance of the model using some performance metric. It is important that no
observations from the training set are included in the test set. If the test set does
contain examples from the training set, it will be difficult to assess whether the
algorithm has learned to generalize from the training set or has simply suggests it.

As we have taken 80% data as training and rest 20% as training data.

27
MACHINE LEARNING MODEL
A machine learning model is a file that has been trained to
recognize certain types of patterns. You train a model over a set of data, providing
it an algorithm that it can use to reason over and learn from those data

Types of Models We’ve been used:

• Logistic Regression
• Naive Bayes
• Support Vector Machine
• Random Forest
• Decision Tree
• Principal Component Analysis

LOGISTIC REGRESSION:
Logistic regression is a statistical model that in its basic form
uses a logistic function to model a binary dependent variable, although many
more complex extensions exist. In regression analysis, logistic regression (or logit
regression) is estimating the parameters of a logistic model

Fig 4.2.1 Logistic Restriction of statistic Model

28
NAIVE BAYESIAN
Naive Bayes classifiers are a collection of classification algorithms
based on Bayes' Theorem. It is not a single algorithm but a family of algorithms
where all of them share a common principle, i.e., every pair of features being
classified is independent of each other.

SUPPORT VECTOR MACHINE


Support vector machines (SVMs) are powerful yet flexible
supervised machine learning algorithms which are used both for classification
and regression. But generally, they are used in classification problems. SVMs
have their unique way of implementation as compared to other machine learning
algorithms.

Fig 4.2.2. Supervised SVM Machine


29
RANDOM FOREST
Random forests or random decision forests are an ensemble learning
method for classification, regression and other tasks that operate by constructing
a multitude of decision trees at training time and outputting the class that is the
mode of the classes or mean/average prediction of the individual trees.

Fig 4.2.3. Graph of Random Forest


DECISION TREE
Decision trees use multiple algorithms to decide to split a node
into two or more sub-nodes. The creation of sub-nodes increases the homogeneity
of resultant sub-nodes. The decision tree splits the nodes on all available variables
and then selects the split which results in most homogeneous
sub-nodes.

Fig 4.2.4. Decision Algorithms

30
PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis, or PCA, is a dimensionality-
reduction method that is often used to reduce the dimensionality of large datasets,
by transforming a large set of variables into a smaller one that still contains most
of the information in the large set.

MODEL EVALUATION

Model evaluation aims to estimate the generalization accuracy of a model


on future (unseen/out-of-sample) data. Methods for evaluating a model's
performance are divided into 2 categories: namely, holdout and Cross-
validation. Both methods use a test set (i.e., data not seen by the model) to
evaluate model performance

Fig 4.2.5. Evaluation of System Model

31
CONFUSION MATRIX
When we get the data, after data cleaning, pre-processing and wrangling,
the first step we do is to feed it to an outstanding model and of course, get output
in probabilities. But hold on! How in the hell can we measure the effectiveness
of our model? Better the effectiveness, better the performance and that’s exactly
what we want. And it is where the Confusion matrix comes into the limelight.
Confusion Matrix is a performance measurement for machine learning
algorithm.

Need of Confusion Matrix


Well, it is a performance measurement for machine learning classification
problem where output can be two or more classes.
It is a table with 4 different combinations of predicted and actual values.
It is extremely useful for measuring Recall, Precision, Specificity,
Accuracy and most importantly AUC-ROC Curve.

Figure. Need of the Confusion Matrix

32
ACCURACY CALCULATION

Accuracy of the algorithms are depending on four values namely true


positive (TP), false positive (FP), true negative (TN) and false negative (FN).
Accuracy= (TN+TP) / (TP+FP+TN+FN)

(2) The numerical value of TP, FP, TN, FN defines as:

TP= Number of people with heart diseases


TN= Number of people with heart diseases and no heart diseases
FP= Number of people with no heart diseases
FN= Number of people with no heart diseases and with heart diseases

33
CHAPTER 5
RESULTS AND DISCUSSIONS
SAMPLE CODING:

HeartFailurePrediction.ipynb
# Read the CSV file and print number of rows ans columns
import pandas as pd
df=pd.read_csv('heart_failure_clinical_records_datase
t.csv')
df.shape

# Pair plot the read dataset which gives correlation between columns
import seaborn as sns
sns.pairplot(df)

#Heat map
sns.heatmap(df.corr(), annot = True)

#Heat map with null values

sns.heatmap(df.isnull())

# Correlation
correlation = df.corr()
print(correlation)

# Describe the dataset


df.describe()

# First and last columns


df.head()

34
df.tail()

# Data preprocessing
X=df.drop(columns=['DEATH_EVENT','anaemia','diabetes'
,'high_blood_pressure','sex','smoking'])
X
y=df['DEATH_EVENT']
y

# Confusion Matrix for logistic regression

import matplotlib.pyplot as plt


from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X
, y, test_size=0.2)
model = LogisticRegression()
model.fit(x_train,y_train)
predictions = model.predict(x_test)
model = metrics.confusion_matrix(y_test, predictions)
print("Predictions made by logistic regression:")
print(model)
sns.heatmap(pd.DataFrame(model), annot=True, fmt='d')

# Accuracy for Logistic Regression

from sklearn.linear_model import LogisticRegression


from sklearn import metrics
import matplotlib.pyplot as plt
X_axis = list(range(1,20))
acc1 = pd.Series()
x = range(1,20)
for i in list(range(1,20)):
model = LogisticRegression()
model.fit(x_train,y_train)

35
predictions = model.predict(x_test)
acc1=acc1.append(pd.Series(metrics.accuracy_score(pr
edictions, y_test)))
plt.plot(X_axis, acc1)
plt.xticks(x)
plt.title("Logistic Graph")
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.grid()
plt.show()
print('Highest value: ', acc1.values.max())
acc1

# Confusion Matrix for Random Forest

from sklearn.ensemble import RandomForestClassifier


random_model = RandomForestClassifier(n_estimators=10
0)
random_model.fit(x_train,y_train)
radom_predictions=random_model.predict(x_test)
random_model=metrics.confusion_matrix(y_test,radom_pr
edictions)
print("Predictions made by random forest are:")
print(random_model)
sns.heatmap(pd.DataFrame(random_model), annot=True, f
mt='d')

# Accuracy for Random Forest


from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
X axis = list(range(1,20))
acc2 = pd.Series()
x = range(1,20)
for i in list(range(1,20)):
random_ model = RandomForestClassifier(n_ estimators=
100)
random_model.fit(x_train,y_train)
radom_predictions=random_model.predict(x_test)
acc2 = acc2.append(pd.Series(metrics.accuracy_score
(radom_predictions, y_test)))

36
plt.plot(X_axis, acc2)
plt.xticks(x)
plt.title("Random forest Graph")
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.grid()
plt.show()
print('Highest value: ', acc2.values.max())
acc2

# Confusion Matrix for Naïve Bayes


from sklearn.naive_bayes import GaussianNB
nbmodel = GaussianNB()
nbmodel.fit(x_train,y_train)
nb_predictions=nbmodel.predict(x_test)
nbmodel=metrics.confusion_matrix(y_test,nb_prediction
s)
print("predictions that are made by navie bayes are:"
)
print(nbmodel)
sns.heatmap(pd.DataFrame(nbmodel), annot=True, fmt='d
')

# Accuracy for Naïve Bayes


from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
X axis = list(range(1,20))
acc3 = pd.Series()
x = range(1,20)
for i in list(range(1,20)):
nbmodel = GaussianNB()
nbmodel.fit(x_train,y_train)
nb_predictions=nbmodel.predict(x_test)
acc3 = acc3.append(pd.Series(metrics.accuracy_score
(nb_predictions, y_test)))
plt.plot(X_axis, acc3)
plt.xticks(x)
plt.title("Navie bayes Graph")
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")

37
plt.grid()
plt.show()
print('Highest value: ', acc3.values.max())
acc3

# Confusion Matrix for Decision Tree


from sklearn.tree import DecisionTreeClassifier
dtmodel=DecisionTreeClassifier()
dtmodel.fit(x_train,y_train)
dt_predictions=dtmodel.predict(x_test)
dtmodel=metrics.confusion_matrix(y_test,dt_prediction
s)
print("predictions that are made by decission tree cl
assifier are:")
print(dtmodel)
sns.heatmap(pd.DataFrame(dtmodel), annot=True, fmt='d
')

# Accuracy for Decision Tree

from sklearn.tree import DecisionTreeClassifier


from sklearn import metrics
X_ axis = list(range(1,20))
acc4 = pd.Series()
x = range(1,20)
for i in list(range(1,20)):
dtmodel=DecisionTreeClassifier()
dtmodel.fit(x_train,y_train)
dt_predictions=dtmodel.predict(x_test)
acc4 = acc4.append(pd.Series(metrics.accuracy_score
(dt_predictions, y_test)))
plt.plot(X_axis, acc4)
plt.xticks(x)
plt.title("Decision tree Graph")
plt.xlabel("n_estimators")
plt.ylabel("Accuracy")
plt.grid()
plt.show()
print('Highest value: ', acc4.values.max())
acc4

38
# Plotting accuracies in a graph

import numpy as np
algos=["logistic","Random forest","Naive bais","Decis
sion tree"]
lr=acc1.values.max()*100
rf=acc2.values.max()*100
nb=acc3.values.max()*100
dt=acc4.values.max()*100
accuracy=[lr,rf,nb,dt]
xpos=np.arange(len(algos))
plt.bar(xpos,accuracy,width=0.8,align="center",color=
['red', 'green','blue','brown'],ec="black")
for i in range(len(algos)):
plt.text(i,accuracy[i],accuracy[i],ha="center",va="
bottom")
plt.xticks(xpos,algos)
plt.xlabel('comparision of algorithms')
plt.ylabel('percentage')
plt.title("Accuracy")

39
App.py

import streamlit as st
import pickle
import numpy as np
model=pickle.load(open('hf1.pkl','rb'))
def
predict_forest(age,anaemia,creatinine_phosphokinase
,diabetes,ejection_fract
ion,high_blood_pressure,platelets,serum_creatinine,
serum_sodium,sex,smoking
,time):
input=np.array([[age,anaemia,creatinine_phosphokina
se,diabetes,ejection_fra
ction,high_blood_pressure,platelets,serum_creatinin
e,serum_sodium,sex,smoki
ng,time]]).astype(np.float64)
prediction=model.predict_proba(input)
pred='{0:.{1}f}'.format(prediction[0][0], 2)
return float( pred)
def main():
st.title("Streamlit services")
html_temp = """
<div style="background-color:#025246 ;padding:10px">
<h2 style="color:white;text-align:center;">Heart
failure prediction app
</h2>
</div>
"""
st. markdown( html_ temp, unsafe_ allow_ html= True)
age = st.text_input("age","Type Here")
anaemia = st.text_input("anaemia","Type Here")

sex = st.text_input("sex","Type Here")

40
smoking = st.text_input("smoking","Type Here")
time = st.text_input("time","Type Here")

safe_html="""
<div style="background-color:#F4D03F;padding:10px >
<h2 style="color:white;text-align:center;"> Your are
safe</h2>
</div>
"""
danger_html="""
<div style="background-color:#F08080;padding:10px >
<h2 style="color:black ;text-align:center;"> Your
are in danger</h2>
</div>
"""

if st.button("Predict"):

output=predict_forest(age,anaemia,creatinine_phosph
okinase,diabetes,ejectio
n_fraction,high_blood_pressure,platelets,serum_crea
tinine,serum_sodium,sex, smoking, time)
st.success('The probability of heart failure is
{}'.format(output))

if output > 0.5:


st. markdown( danger_ html, unsafe_ allow_ html= True)
else:
st. markdown( safe_ html, unsafe_ allow_ html= True)

if name ==' main ':


main()

41
Main.py
import streamlit as st
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
st.title('HEART FAILURE PREDICTION')
classifier_name = st.sidebar.selectbox('Select
classifier',('KNN',
'SVM', 'Random Forest'))
df = pd.read_csv("hf.csv")
X,y=df.iloc[:,:-1],df['DEATH_EVENT']
#print(X)
def add_parameter_ui(clf_name):
params = dict()
if clf_name == 'SVM':
C = st.sidebar.slider('C', 0.01, 10.0)
params['C'] = C
elif clf_name == 'KNN':
K = st.sidebar.slider('K', 1, 15)
params['K'] = K
else:
max_depth = st.sidebar.slider('max_depth', 2, 15)
params['max_depth'] = max_depth
n_estimators = st.sidebar.slider('n_estimators', 1,
100)

42
params['n_estimators'] = n_estimators
return params

params = add_parameter_ui(classifier_name)

def get_classifier(clf_name, params):


clf = None
if clf_name == 'SVM':
clf = SVC(C=params['C'])
elif clf_name == 'KNN':
clf = KNeighborsClassifier(n_neighbors=params['K'])
else:
clf=RandomForestClassifier(n_estimators=params['n_es
timators
'], max_depth=params['max_depth'], random_state=1234)
return clf

clf = get_classifier(classifier_name, params)

#### CLASSIFICATION ####

X_train, X_test, y_train, y_test =


train_test_split(X, y,
test_size=0.2, random_state=1234)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)

st.write(f'Classifier = {classifier_name}')
st.write(f'Accuracy =', acc)

43
PROGRAM RESULTS:

44
45
46
47
48
49
5.1 RESULTS:

Pair-Plot Graph:

50
Heat Map:

Fig. Heat Map of Subplot

Heat Map of null values in the data set:

Fig. Heat Map of Null Values

51
Correlation

Description of Dataset:

Head of Dataset:

52
Heat map of confusion matrix Logistic Regression:

Fig. Heat Map of Confusion Matrix

Logistic Graph:

Fig. Graph Represents Logistic Algorithm


53
Heat map of confusion matrix Random Forest classifier

Fig. Heat map of RF Classifier

Random forest Graph:

Fig. Graph of RF Algorithm


54
Heat map of confusion matrix navie bayes:

Fig. Heat map of Confusion Matrix Naïve Bayes

Navie bayes Graph:

Fig. Graph of Naive Bayes

55
Heat map of confusion matrix Decission Tree Classifier:

Fig. Heat Map of Decision Tree Matrix

Decision Tree Graph:

Fig. Graph represents Decision Tree


56
Accuracy Graph:

Fig. Graph Representing Accuracy

Tail of Dataset:

57
Values of X:

Values of Y:

58
Output screen of app.py

59
Output screen after Prediction:

60
CHAPTER 6

CONCLUSION AND FUTURE ENHANCEMENTS

6.1 CONCLUSION:
In this project we made a model which gives us the best accuracy among
all the machine learning algorithms and by using that model we have made a
web service using streamlit which will connect the back end of ml model to the
front end that we designed it, with of help of this many people can get their
health condition priorly and might have taken care accordingly.

6.2 FUTURE ENHANCEMENTS:


We have tried it on our local host. We are going to implement by adding
some features to our deployed model like adding all algorithms and by showing
the ROC curve to the people who are using it to predict and make some changes
and then we can implement it as an online website. This can also be used for
medical professionals as a second reference.

61
REFERENCES

[1] Aditi Gavhane, Gouthami Kokkula, Isha Panday, Prof. Kailash Devadkar,
“Prediction of Heart Disease using Machine Learning”, Proceedings of the
2nd International conference on Electronics, Communication and
Aerospace Technology (ICECA), 2018.

[2] Himanshu Sharma and M A Rizvi, “Prediction of Heart Disease using


Machine Learning Algorithms: A Survey” International Journal on Recent
and Innovation Trends in Computing and Communication Volume: 5
Issue: 8, IJRITCC August 2017.

[3] Rairikar, A., Kulkarni, V., Sabale, V., Kale, H., & Lamgunde, A. (2017,
June). Heart disease prediction using data mining techniques. In 2017
International Conference on Intelligent Computing and Control (I2C2) (pp.
1-8). IEEE.

[4] Aldallal, A., & Al-Moosa,A. A. A. (2018, September). Using Data Mining
Techniques to Predict Diabetes and Heart Diseases. In 2018 4th
International Conference on Frontiers of Signal Processing (ICFSP) (pp.
150- 154). IEEE.

[5] Pahulpreet Singh Kohli and Shriya Arora, “Application of Machine


Learning in Diseases Prediction”, 4th International Conference on
Computing Communication and Automation (ICCCA), 2018.

[6] Hazra, A., Mandal, S., Gupta, A. and Mukherjee, “A Heart Disease
Diagnosis and Prediction Using Machine Learning and Data Mining
Techniques: A Review” Advances in Computational Sciences and
Technology, 2017

62

You might also like