0% found this document useful (0 votes)

0 views53 pages

Vikash_Rai_Project_Report

The document is a project report for a Bachelor of Technology degree focused on a disease prediction application using machine learning techniques. It outlines the development of a web application that predicts diseases like breast cancer, heart disease, and diabetes based on patient data and recommends suitable hospitals and doctors. The project aims to improve healthcare decision-making by providing accurate predictions and reliable recommendations to patients.

Uploaded by

raiv09827

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views53 pages

Vikash_Rai_Project_Report

Uploaded by

raiv09827

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

DISEASE PREDICTION APPLICATION USING

MACHINE LEARNING

Major Project Report submitted in partial fulfilment of the requirement

for the award of degree of

Bachelor of Technology
In
Information Technology

Under the Supervision of

Dr.Tripti Rathee

Vikash Kumar Rai (02196303121)

MAHARAJA SURAJMAL INSTITUTE OF TECHNOLOGY

C-4, Janakpuri, New Delhi-58

Affiliated to Guru Gobind Singh Indraprastha University, Delhi

May, 2025
DEPARTMENT OF INFORMATION TECHNOLOGY

BONAFIDE CERTIFICATE

This is to certify that this Project Report is the bonafide work of V i k a s h K u m a r

R a i (02196303121), who carried out the project entitled “Disease Prediction
Application Using Machine Learning” under my supervision from January 2025
to May 2025.

InternalGuide

Dr. Tripti Rathee

Submitted for Viva Voce Examination held on

Internal Examiner External Examiner

DECLARATION

I Vikash Kumar Rai(02196303121) hereby declare that the Project Report

entitled Disease Prediction Application Using Machine Learning done by me

under the guidance of Dr.Tripti Rathee (Internal) is submitted in partial

fulfillment of the requirements for the award of Bachelor of Engineering /

Technology degree in Computer Science and Engineering.

DATE:
PLACE:
SIGNATURE OF THE STUDENT
ACKNOWLEDGEMENT

I am pleased to acknowledge my sincere thanks to Board of Management of

MAHARAJA SURAJMAL INSTITUTE OF TECHNOLOGY for their kind
encouragement in doing this project and for completing it successfully. I am
grateful to them.

I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr. Tripti Rathee for his valuable guidance, suggestions and constant
encouragement paved way for the successful completion of my project work.

I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of INFORMATION TECHNOLOGY who were helpful in many
ways for the completion of the project.
ABSTRACT

The health care systems collects data and reports from the hospitals or
patient's database by machine learning and data processing techniques
which is employed to predict the disease so as to create reports
supported the results which used for various kinds of predictions for
disease and which is that the leading explanation for the human's death
since past years. Medical reports and data had been extracted from
various databases to predict a number of the required diseases which are
commonly found in people nowadays breast cancer, heart disease and
diabetes disease and make their life more critical to measure. Nowadays
technology advancement within the health care industry has been
helping people to create their process easier by suggesting hospitals and
doctors to travel to for his or her treatment, where to admit and which
hospitals are the simplest for the treating the desired disease. we've
implemented this sort of system in our application to form people’s life
simpler by predicting the disease by inputting certain data from their
reports which can give the result positive or negative supported the
disease prediction they are going to be having a choice to get
recommendation of best hospitals with best doctors nearby from the
past users or guardians..

i
LIST OF FIGURES

FIGURE NO. FIGURE NAME. PAGE NO.

4.1 Heart disease data set details. 10

4.2 Correlation formula. 12
4.3 Diabetes disease data set details. 12
4.4 Breast cancer data set details. 14
4.5 System Architecture. 15
4.6 Flowchart Diagram. 17
4.7 Algorithm selection for diabetes disease prediction. 22
4.8 Algorithm selection for heart disease prediction. 23
4.9 Algorithm selection for breast cancer prediction. 23
4.10 Logistic regression graph. 25
4.11 Random Forest algorithm architecture. 28
5.1 Diabetes disease prediction result. 32
5.2 Heart disease prediction result. 32
5.3 Breast cancer prediction result. 33
5.4 Performance analysis of logistic regression
For diabetes disease prediction. 34
5.5 Performance analysis of Random Forest classifier
For heart disease prediction. 35
5.6 Performance analysis of Random Forest classifier
For breast cancer prediction. 35
7.1 Plagiarism Report 45

ii
LIST OF TABLES

TABLE NO. TABLE NAME. PAGE NO.

4.1 Hardware Requirements 8

4.2 Software Requirements 8
4.3 Heart disease data set fields 10
4.4 Diabetes disease data set fields 11
4.5 Breast cancer data set fields 13

iii
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT i

LIST OF FIGURES ii
LIST OF TABLES iii

1. INTRODUCTION 1

2. LITERATURE SURVEY 2

3. AIM AND SCOPE OF THE PRESENT 5

INVESTIGATION

4. EXPERIMENT OR MATERIALS AND

METHODS, ALGORITHMS USED 6

4.1 EXISTING AND PROPOSED 6

SYSTEM, HARWARE
AND SOFTWARE
CONFIGURATION.

4.2 DATASET DETAILS 9

4.3 METHODOLOGY 15

4.4 ALGORITHM MODELS 23

5. RESULTS AND DISCUSS,

PERFORMANNCE ANALYSIS 31

5.1 RESULTS 31

5.2 PERFORMANCE EVALUETION 34

6. CONCLUSION AND FUTURE WORK 36

6.1 SUMMARY AND CONCLUSION 36

6.2 FUTURE WORK 37

iv
REFERENCES 39

APPENDICES

A. SCREENSHOTS 40

B. PLAGARISM REPORT 44

v
CHAPTER 1
INTRODUCTION
Properly analyzing clinical documents about patients’ health anticipate the
possibility of occurrence of various diseases. In addition, acquiring
information regarding specialists of that particular disease as per the
requirement facilitates proper and efficient diagnosis. This Project provides a
novel method that uses data mining technique, namely, Logistic regression and
random forest classification algorithm for prediction of disease. Using medical
profiles such as heart rate, blood pressure through sensors and other externally
observable symptoms such as fever, cold, headache etc. that patient has,
prediction of likelihood of a disease is done. Logistic regression and random
forest classification algorithm takes these symptoms and predicts disease.
Furthermore, all the needful and adequate information regarding the predicted
disease as well as the recommended doctors is provided. Recommendation
(Future implementation) suggests the location , contact and other necessary
details of the disease specialists based on the filters chosen by the user out of
less fees, more experience, nearest location and feedback reviews of the
doctors. algorithm. Thus user can get appropriate treatment and necessary
medical advice as fast as possible. Additionally, users provide their feedback
for the recommended doctors which are then added for analysis in order to
make further recommendations based on reviews.
Healthcare industry generates terabytes of data every year. The medical
documents maintained are a pool of information regarding patients. The task
of extracting useful in formation or quality healthcare is tricky and important.
By analyzing these voluminous data we can predict the occurrence of the
disease and safe guard people. Thus, an intelligent system for disease
prediction plays a major role in controlling the disease and maintaining the
good health status for people by providing accurate and trustworthy disease
risk prediction.

40
CHAPTER 2

LITERATURE SURVEY

SL AUTHOR YEAR DESCRIPTION PROS CONS

NO.
1 M. Denil, D. 2014 explored the This paper It does not
Matheson, and relation between focuses on focus on the
N. De Freitas the willingness to proper patients
recommend the hospital point of view
hospital and other Hygiene and on easy
satisfaction staff. access and
identifiers. Author : bes choice
Klink Enberg This of hospital.
paper finds that
hospitals that focus
on resources to
improve aspects of
care such as nurses
and physician
respect, respect,
obedience, room
hygiene, etc. This
paper does not look
at patient data. The
literature in the
HCAHPS database
analysis is largely
driven by
hypothesis and only
looks at specific
aspects of patient
satisfaction or
census.

2 Watson, F. 1994 Using retrospect, It provides It lacks a

Marir they concluded that the idea that technological
non-Spanish whites patients easy access
on average always to the best
tend to go to prefer best hospitals.
hospitals that offer hospital first.
a better patient
experience for

40
all patients
compared to
hospitals commonly
used by African
American, Hispanic,
Asian / Pacific
Islander, or
multiracial
patients

3 W. Bergerud 1996 Investigated the It used the Its lacks in

relationship patients doctors point
between perspective of view who
postoperative to make is the most
morbidity and improvement imp in these
mortality and for the systems.
patients’ hospital.
perspectives of care
in surgical patients.
Author: Sheetz et al
In their article, the
satisfaction points
used were used
along with the
registered Registry
of the Michigan
Collaborative Clinic
as a measure of
patient care ideas.
A few studies have
examined the
relationship
between one
satisfaction
question and one or
more patient
information.

4 Binal T. et al 2010 Healthcare decision Healthcare It lacks in

support system support accuracy in
forswine flu system for prediction of
prediction using swine flue. the disease.
naïve bayes
classifier, focuses
onthe aspect of
medical diagnosis

40
by learning patterns
throughthe
collected data for
swine flu using naïve
bayes classifierfor
classifying the
patients of swine flu
into three
categories(least
possible, probable or
most probable),
resulting into
anaccuracy of nearly
63.33%.
Datasets used for
thisclassification
were limited in
number

40
CHAPTER 3

AIM AND SCOPE OF THE PRESENT INVESTIGATION

When we see around there are many patients that does not get the right
treatment at the right time because of their lack of decision taking about the
choice of hospital and doctors, they don’t know what you do now and end up
very serious at the end. The objective of the project is to provide the service to
patients by suggesting them the best hospital to find their cure for their
existing disease . The project is to provide a very easy solution for the
patients to get recommendation to what doctor or hospital they need to go
after diagnosed with a severe disease.This web application can find the
solution to that, no need of thinking about what should be done after
diagnosed with a severe disease. This web application handles reports to make
predictions and give results accordingly to that, a best hospitals can be
selected for their treatment and more lives can be saved.
The scope of the project is to provide a very easy solution for the patients to
get recommendation to what doctor or hospital they need to go after diagnosed
with a severe disease. In this project we will be using web development to
develop a web application that will help us to achieve our target and with
machine learning algorithms to predict the disease by using random forest
classifier algorithm and recommend the best doctors and the best hospitals
nearby by using collaborative filtering technique.

40
CHAPTER 4

EXPERIMENT OR MATERIALS AND METHODS,

ALGORITHMS USED

4.1 EXISTING AND PROPOSED SYSTEM

4.1.1 EXISTING SYSTEM

• Several online health care system has invented new ideas to

benefit people and so many online applications have features to
give recommendations on hospital and doctors.
• But they have lack of reliability and accuracy where they need to
do improvisations in the features and modules. Genuinely health
care systems might not upload the opinions of people in some
cases for the negative response and by doing manually while
collecting feedbacks from the patients, might be patients hesitate
to give complete opinion of doctors or hospitals in front of
persons where we will find the lack of quality.
• In total we have not found all features and modules at a time in
one application and there are different types of applications for
different type of diseases where they have different applications
separately for doctors and hospitals to give recommendations.

Disadvantages of Existing system:

1) Difficulties in finding the best doctors for a particular disease.

2) Tough to discover the hospitals based on the recommendation.

40
4.1.1 PROPOSED WORK

• In this research we have found the solution for the issues facing
in existing system where we have proposed the accuracy,
reliability and efficiency by developing the features of three
diseases called Heart disease, cancer disease and diabetes where
we will find most common diseases in people health and we
have installed in one application with prediction of three
diseases by analysing the symptoms collected from the patient’s
record and taking positive and negative opinions from patient’s
according to that we will give ratings to the hospitals and
doctors from best to worst.
• Guardians opinions is also very much important and they can
give feedback of them like how they were treating their patients?
Was it friendly or strictly? And how the hospital management
is? Was it clean? How is the hospitality ? When the feedback
comes to online so that patients and guardians can give both
positive and negative opinions completely without any
hesitation.
• Based on that we can provide truthful recommendations of
hospital and doctors for the people and can predict the results.
According to that prediction of particular disease we will predict
best suitable hospital and doctor to consult and to get admit into
it.

Advantages Of Proposed System:

1) Easy way of accessing the application with best

recommendations on both Hospitals and doctors.
2) Application has multiple options to make decision easily on diseases

40
4.1.2 SOFTWARE AND HARDWARE CONFIGURATIONS

🠶 Hardware: Desktop or Laptop or present generation devices.

Hardware requirements
Hardware Minimum requirements

Computer 4 GHz minimum, multi-core processor

Memory (RAM) At least 4GB, preferably higher, and

8ommensurate with concurrent usage

Hard disk space At least 10 GB

Table 4.1: Hardware Requirements.

Software requirements

Software Minimum requirements

Operating system Windows Server 2012 R2 or above

Microsoft .Net Framework v4.6.1 The HelpMaster Web Portal has been written to
(or higher) use Microsoft IIS ASP.NET technology and as such
requires the machine that IIS is running on to have
the Microsoft .NET v4.6.1 (or higher) Framework
installed as well as the ASP.Net 4.5 and .Net
Extensibility 4.5 features enabled.

Table 4.2: Software Requirements.

• Hardware : Desktop or Laptop.

• Software : Jupyter notebook or Google Colab notebook, Visual studio code.
• Programming Language : Python Programming Language,
PHP, HTML, CSS, Javascript.

40
4.2 DATASET DETAILS

4.2.1 DATASET NAMES

A data set (or dataset) is a collection of data. In the case of tabular data,
a data set corresponds to one or more database tables, where every
column of a table represents a particular variable, and each row
corresponds to a given record of the data set in question. The data set
lists values for each of the variables, such as height and weight of an
object, for each member of the data set. Data sets can also consist of a
collection of documents or files. The sources of the datasets are from
Kaggle.com.
The datasets that are used are:

• Heart disease dataset – (heart_disease.csv)

• Diabetes disease dataset – (diabetes_disease.csv)

• Breast cancer dataset – (breast_cancer.csv)

4.2.1 HEART DISEASE DATASET DETAILS

The heart disease datasets consists of 303 rows 14 columns.

The fields: age, sex, cp, trestbp,chol, fbs, restecg, thalach,exan,goldpeak,

slope, ca, thal. We have taken the 14th column as the target variable
(target).

40
4.2.1.1 TABLE OF DATASET FIELDS

Column Non-Null Count Data types

Age 303 non-null Int 64

Sex 303 non-null Int 64

Cp 303 non-null Int 64

Threstbps 303 non-null Int 64

Chol 303 non-null Int 64

Fbs 303 non-null Int 64

Restecg 303 non-null Int 64

Exang 303 non-null Int 64

Oldpeak 303 non-null Int 64

Slope 303 non-null Int 64

Ca 303 non-null Int 64

Thal 303 non-null Int 64

Target 303 non-null Int 64

Table 4.3: Heart disease data set fields.

DATA SET VISUALIZATION

Fig 4.1: Heart disease dataset details.

40
4.2.2 DIABETES DISEASE DATASET DETAILS

The diabetes disease dataset consists of 768 rows, 9 columns

Column field names : Pregnancies, Glucose , BloodPressure,
SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age.The last
column is the target (outcome).

TABLE OF DATASET FIELDS

Column Non-Null count Data Type

Pregnancies 768 non-null int64

Glucose 768 non-null int64

Blood pressure 768 non-null int64

Skin Thickness 768 non-null int64

Insulin 768 non-null int64

BMI 768 non-null float64

Diabeta pedigree Function 768 non-null float64

Age 768 non-null int64

Outcome 768 non-null int64

Table 4.4: Diabetes disease data set fields.

40
From the above 7 input fields we are only choosing the two input fields
[Glucose, BMI] based on Correlation Pearson method.

Fig 4.2: Correlation formula.

DATA SET VISUALIZATION

Fig 4.3: Diabetes disease dataset details.

40
4.2.3 BREAST CNACER DATASET DETAILS

The breast cancer dataset consists of 569 rows, 31 columns.

Column field names : radius_mean, texture_mean, perimeter_mean,
area_mean, smoothness_mean, compactness_mean, concavity_mean,
concave points_mean, symmetry_mean, fractal_dimension_mean,
radius_se, texture_se, perimeter_se, area_se, smoothness_se,
compactness_se, concavity_se, concave points_se, symmetry_se,
fractal_dimension_se, radius_worst, texture_worst, perimeter_worst,
area_worst, smoothness_worst, compactness_worst, concavity_worst,
concave points_worst, symmetry_worst, fractal_dimension_worst.
Diagnosis will be the target variable (M=1, B=0)

TABLE OF DATASET FIELDS

Column Non-Null count Data types

Diagnosis 569 non-null float64
radius_mean 569 non-null float64
texture_mean 569 non-null float64
perimeter_mean 569 non-null float64
area_mean 569 non-null float64
smoothness_mean 569 non-null float64
compactness_mean 569 non-null float64
concavity_mean 569 non-null float64
concave points_mean 569 non-null float64
symmetry_mean 569 non-null float64
fractal_dimension_mean 569 non-null float64
radius_se 569 non-null float64
texture_se 569 non-null float64
perimeter_se 569 non-null float64
area_se 569 non-null float64
smoothness_se 569 non-null float64

40
compactness_se 569 non-null float64
concavity_se 569 non-null float64
concave points_se 569 non-null float64
symmetry_se 569 non-null float64
fractal_dimension_se 569 non-null float64
radius_worst 569 non-null float64
texture_worst 569 non-null float64
perimeter_worst 569 non-null float64
area_worst 569 non-null float64
smoothness_worst 569 non-null float64
compactness_worst 569 non-null float64
concavity_worst 569 non-null float64
concave points_worst 569 non-null float64
symmetry_worst 569 non-null float64
fractal_dimension_worst 569 non-null float64

Table 4.5: Breast cancer disease dataset details.

DATA SET VISUALIZATION

Fig 4.4: Dataset used

40
4.3 METHODOLOGY

The user has to input the data where it will be stored in database and
then according to their choice the prediction will be made. After
collecting the user data from the database and the choice of predicting
the disease is to be predicted. If negative then end the process and if
positive the user will get hospital recommendations at which their best
treatment can be done.

SYSTEM ARCHITECTURE

Fig 4.5: System Architecture.

System Architecture design-identifies the general hypermedia structure for the

40
WebApp. Architecture design is tied to the goals establish for a WebApp, the
content to be presented, the users who will visit, and also the navigation
philosophy that has been established. Content architecture, focuses on the
way within which content objects and structured for presentation and
navigation. WebApp architecture, addresses the way the applying is structure
to manage user interaction, handle internal processing tasks, effect
navigation, and present content. WebApp architecture is defined within the
context of the event environment during which the appliance is to be
implemented.

MODULES IMPLEMENTED
The user has to input the data where it will be stored in database and then
according to their choice the prediction will be made. After collecting the user
data from the database and the choice of predicting the disease is to be
predicted. If negative then end the process and if positive the user will get
hospital recommendations (future ) at which their best treatment can be done.

• Application Flowchart.
• Data collection (from the user) to make dataset.
• Importing packages.
• Data pre-processing.
• Data fitting and training.
• Prediction as opted by the user.
• Result or output.

40
FLOWCHART DIAGRAM

Fig 4.6: Flowchart Diagram.

40
DATA COLLECTION

Data Collection is one of the most important tasks in building a machine

learning model. We collect the specific data based on requirements from
users to make the dataset. The dataset contains some unwanted data
also. So first we need to pre- process the data and obtain perfect data set
for algorithm.

PACKAGES IMPORTED

• Pandas : Pandas is a software library written for python for data

manipulation and analysis. It offers data structures and
operations for manipulating numerical tables and time series.

• Numpy: It is a library for the Python Programming Language,

adding support for large, multiple-dimensional arrays and
matrices, along with a large collection of high-level
mathematical functions to operate on these arrays.

• Scikit-learn Package: Scikit-learn is a free machine learning

library for python. It features various algorithms like SVM,
Random Forest , K-neighbours and Decision Tree.

• Confusion Matrix: A confusion matrix is a technique for

summarizing the performance of a classification algorithm.
Classification accuracy alone can be misleading if you have an
unequal number of observations in each class or if you have
more than two classes in your dataset.

Syntax: from sklearn.metrics import confusion_matrix.

40
• Classification Report:The classification report visualizer
displays the precision, recall, F1, and support scores for the
model.
Syntax: from sklearn.metrics import classification_report.

• Accuracy Score: Accuracy is one metric for evaluating

classification models. Informally, accuracy is the fraction of
predictions our model got right. Formally, accuracy has the
following definition:
Accuracy = Number of correct predictions / Total
number of predictions. Syntax: from sklearn.metrics
import accuracy_score.

DATA PRE-PROCESSING
It is the gathering of task related information based on some
targeted variables to analyse and produce some valuable outcome.
However, some of the data may be noisy, i.e. may contain inaccurate
values, incomplete values or incorrect values. Hence, it is must to
process the data before analysing it and coming to the results. Data
pre-processing can be done by data cleaning, data transformation, data
selection
Data pre processing is a process of preparing the raw data and
making it suitable for a machine learning model. It is the first and
crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case
that we come across the clean and formatted data. And while doing
any operation with data, it is mandatory to clean it and put in a
formatted way. So for this, we use data pre processing task.

40
A real-world data generally contains noises, missing values, and maybe in an
unusable format which cannot be directly used for machine learning models.
Data preprocessing is required tasks for cleaning the data and making
it suitable for a machine learning model which also increases the accuracy
and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset

o Importing libraries

o Importing datasets

o Finding Missing Data

o Encoding Categorical Data

o Splitting dataset into training and test set

o Feature scaling

40
DATA TRAINING
Model fitting is a measure of how well a machine learning model
generalizes to similar data to that on which it was trained. A model that
is well-fitted produces more accurate outcomes. A model that is
overfitted matches the data too closely. A model that is underfitted
doesn't match closely enough
Training data is the initial dataset used to train machine learning
algorithms. Models create and refine their rules using this data. It's a set
of data samples used to fit the parameters of a machine learning model
to training it by example. Training data is also known as training
dataset, learning set, and training set. It's an essential component of
every machine learning model and helps them make accurate
predictions or perform a desired task.

ALGORITHM SELECTION

• The datasets has been tested with different supervised machine

learning algorithms and it is found that the best solution with accuracy is given by

1. Diabetes – Logistic regression algorithm

2. Heart disease – Random Forest algorithm

3. Breast Cancer – Random Forest algorithm

For hospital recommendation used collaborative filtering algorithm

40
Fig 4.7: Algorithm selection for diabetes disease prediction.

40
Fig 4.8: Algorithm selection for heart disease prediction.

Fig 4.9: Algorithm selection for breast cancer prediction.

40
PREDICTION AS OPTED BY THE USED

 Prediction of disease has three options

o Heart disease prediction
o Diabetes disease prediction
o breast cancer prediction
 Where Random forest algorithm is used for heart and breast
cancer prediction and logistic regressor algorithm is used for
diabetes prediction for these following algorithm gives the best
accuracy rate for these datasets.

4.3 ALGORITHM MODELS

4.3.1 LOGISTIC REGRESSION ALGORITHM

Logistic regression is one of the most popular Machine Learning

algorithms, which comes under the Supervised Learning technique. It is
used for predicting the categorical dependent variable using a given set
of independent variables. Logistic regression predicts the output of a
categorical dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1. Logistic Regression is
much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems. In
Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or
1).

40
The curve from the logistic function indicates the likelihood of
something such as whether the cells are cancerous or not, a mouse is
obese or not based on its weight, etc. Logistic Regression is a
significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and
discrete datasets. Logistic Regression can be used to classify the
observations using different types of data and can easily determine the
most effective variables used for the classification. The below image is
showing the logistic function:

Fig 4.10: Logistic Regression graph.

The sigmoid function is a mathematical function used to map the

predicted values to probabilities. It maps any real value into another
value within a range of 0 and 1. The value of the logistic regression
must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.In logistic regression, we use
the concept of the threshold value, which defines the probability of

40
either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
For Diabetes disease prediction: LOGISTIC REGRESSION:
Logistic regression is a supervised learning classification algorithm used t
predict the probability of a target variable. The nature of target or dependent
variable is dichotomous, which means there would be

Only two possible classes. ... Mathematically, a logistic regression

model predicts P(Y=1) as a function of X.
Syntax:fromsklearn.linear_model import LogisticRegression
obj=Logistic regression()

LOGISTIC REGRESSION
Logistic regression equation:
P=eβ0+β1X/
1+eβ0+β1X1
When all the
feature
plugged in ;
logit(p)=log(p/(1−p))=β0+β1∗ Sexmale+β2∗ age+β3∗ cigsPerDay+β4∗ totChol+β5
∗ sysB P+β6∗ glucose
To implement the Logistic Regression using Python, we will use
the same steps as we have done in previous topics of Regression.
Below are the steps:

o Data Pre-processing step

o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

40
4.3.2 RANDOM FOREST ALGORITHM

For heart disease and breast cancer prediction RANDOM FOREST

CLASSIFIER: Random forest is a supervised learning algorithm which
is used for both classification as well as regression. But however, it is
mainly used for classification problems. As we know that a forest is
made up of trees and more trees means more robust forest. Similarly,
random forest algorithm creates decision trees on data samples and then
gets the prediction from each of them and finally selects the best
solution by means of voting. It is an ensemble method which is better
than a single decision tree because it reduces the over-fitting by
averaging the result.

Random Forest is a popular machine learning algorithm that belongs to

the supervised learning technique. It can be used for both Classification
and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model. As the
name suggests, "Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the
average to improve the predictive accuracy of that dataset." Instead of
relying on one decision tree, the random forest takes the prediction from
each tree and based on the majority votes of predictions, and it predicts
the final output.

40
The greater number of trees in the forest leads to higher accuracy
and prevents the problem of over fitting.

Implementation Steps are given below:

o Data Pre-processing step

o Fitting the Random forest algorithm to the Training set
o Predicting the test result
o Test accuracy of the result (Creation of Confusion matrix)
o Visualizing the test set result.

Fig 4.11: Random Forest Algorithm Architecture.

40
Random Forest works in two-phase first is to create the random forest
by combining N decision tree, and second is to make predictions for
each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.

Normal woods can be a gathering of trees. Here, the independence is

partitioned into vectors, and each tree gives an underlying stage division
called a x distribution. Customary timberlands give a gathering of
guaranteed trees to make a fundamental variety of trees, and Breiman
picked the best strategy, the technique for cooking or grouping each tree
in one of the Random Forests, and Breiman followed the accompanying
advances: Randomly organized N archives, yet additionally supplanted,
as should be visible from the first numbers, this is a boot test. An
illustration of this is tree establishing preparing. In the event that there is
another M info, m << M chooses something similar for every hub, and
m is a variable chosen from M, so a positive detachment from m
addresses the property to be utilized for separation. The consistent
worth of m during woods improvement. Each tree develops as large as
could be expected. try not to cut. In this manner many trees are brought
into the woods; The quantity of trees anticipated by the ntree boundary.

40
The greatest number of factors (m) chose for every hub is again called
"mtry" or k. The profundity of the tree can be constrained by hub
boundaries (for instance, the quantity of leaves), and now and then by
something like one. As referenced above, it streams from every one of
the trees that fill in the backwoods to decide the degree of
substitution in the wake of preparing or catching the woodland. Each
tree gives another example class to casting a ballot. All tree ideas were
merged and the greater part (larger part vote) grouping was affirmed at
another level. Going on here, the woodland characterizes a tree
backwoods assembled utilizing the RI timberland. In the ranger service
area, each tree was chosen and a freight test was made for substitution,
yet around 1/3 of the first material was absent. This rundown of models
is called OOB (Out of pocket) data. Each tree has its own OOB data,
which is utilized to look at the breaks in each tree in the timberland, and
is known as the OOB break estimation.

40
CHAPTER 5

RESULTS AND DISCUSSION

5.1 RESULTS

When we see around there are many patients that does not get the right
treatment at the right time because of their lack of decision taking about
the choice of hospital and doctors, they don’t know what you do now
and end up very serious at the end. The objective of the project is to
provide the service to patients by suggesting them the best hospital to
find their cure for their existing disease . The project is to provide a
very easy solution for the patients to get recommendation to what
doctor or hospital they need to go after diagnosed with a severe disease.
This web application can find the solution to that, no need of thinking
about what should be done after diagnosed with a severe disease. This
web application handles reports to make predictions and give results
accordingly to that, a best hospitals can be selected more for their
treatment and more lives can be saved. After easy login or registering
into the app the patient can predict their disease after inputting certain
reports from their medical diagnosis report which will display
accurately that the patient has the particular disease or not it will
show in form of positive or negative. After the Prediction they will be
having an option to get recommended hospital which are best for the
treatment of their disease nearby. By this way the app can save many
more lives more before its too late to get the treatment.

40
SCREENSHOTS OF RESULT

DIABETES DISEASE PREDICTION RESULTS

Fig 5.1: Diabetes disease prediction result.

HEART DISEASE PREDICTION RESULT

Fig 5.2: Heart disease prediction result.

40
BREAST CANCER PREDICTION RESULT

Fig 5.3: Breast cancer prediction result.

40
5.1 PERFORMANCE EVALUATION

DIABETES DISEASE PREDICTION PERFORMANCE ANALYSIS

Fig 5.4: Performance analysis of logistic regression for diabetes disease

prediction.

40
HEART DISEASE PREDICTION PERFORMANCE ANALYSIS

Fig 5.5: Performance analysis of Random Forest classifier for

heart disease prediction.
PERFORMANCE ANALYSIS OF BREASE CANCER
PREDICTION

Fig 5.6: Performance analysis of random forest classifier for

breast cancer prediction.

40
CHAPTER 6

CONCLUSION AND FUTURE WORK

6.1 SUMMARY AND CONCLUSION

Earlier days in hospitals they lack in technological aspects for testing

and issuing the reports which might take one day or may be more than
that to issue the report for the lab related work that are being executed
manually to predict the disease also they lack in efficiency and
accuracy. But nowadays we have ample amount of data to show that
these similar aspects or components can lead to this disease (exception
may occur), so with the help of machine learning we have tried to
implement similar system to predict the above stated disease which are
most commonly found in person these days. In this application we have
tried to implement a similar system which focuses on the three most
deadly disease heart disease, breast cancer and diabetes diseases. We
have implemented an effective way to reduce the dimensionality,
reducing and eliminating the irrelevant data and increasing the
accuracy. After the prediction of the disease a positive and negative
report will be displayed according to which the patients can get best and
nearby hospitals recommendations. It is to make easier way for patients
to find the hospitals with good quality care of doctors. In total we are
implementing our innovation ideas to give benefits to the people who
are suffering from the health issues and they can make use of this
application where they will find all good options at a time in one appeal.
Opinions given by people on hospitals and doctors plays an important
role and easily they can make decision. The goal was to use such
associations to create a patient satisfaction based the recommendation.

40
6.1 FUTURE WORK

Future implementation is to recommend hospitals based on users review

with the algorithm Collaborative Filtering: The motivation behind the
CF calculation is to ascertain the benefits of a specific item for another
item that is offered or for a chose client in view of the client's related
knowledge and afterward the thoughts of different clients.
In view of authenticity and effortlessness, we expect to pay attention to
what different clients share for all intents and purpose and love
comparable preferences. Consolidating the inclinations of the two
clients is considered by the orientation correspondence of the past. All
CF techniques share the capacity to anticipate or give groundbreaking
plans to individual clients who will appreciate utilizing past clients.
Central issues depend on the possibility of connecting customers or
items, and network is characterized as the demonstration of contracting
between the first or the best. The two most significant CF modes are
typically executed as client based objects, while the joined solicitation
technique is separated into two gatherings: memory-based and model-
based.
A memory-based approach is a heuristic in view of an assortment of
things that clients have recently esteemed and remarked on. This
expertise requires all scores,things, and clients to retain. The system
depends on the utilization of an evaluation group to track down a model
and make speculations. This innovation takes into account a normalized
web-based evaluation technique. The CF strategy utilizes the thoughts
of different networks to draw in clients. By and large, thoughts for the
individuals who use them are utilized to gather the flavor of different
clients. Hence, the CF expects that purchasers who have concurred in
the past will probably settle on what's to come.

40
The CF framework requires a lot of information handling, including
broadband, for example, web-based business and web facilitating.
Throughout the course of recent years, CF has advanced and has at long
last become perhaps the most well-known method for significantly
impacting the manner in which you approach directing. Today, PCs, as
well as the Internet, assist us with contemplating the thoughts of an
extraordinary spot with numerous individuals. People can profit from
local gatherings, permitting them to acquire information from different
clients and gaining from an assortment of items. Also, data can assist
clients with making their own thoughts or check significant items out.
Specifically, CF methods are utilized to assist clients with observing
new items they might like, get guidance on explicit items, and associate
with different clients who have comparative issues.

40
REFERENCE

[1] M. Denil, D. Matheson, and N. De Freitas, “Narrowing the Gap:

Random Forests In Therein, M., Matheson, D., & De Freitas, N.
(2014). Narrowing the Gap:

[2] W. Bergerud, “Introduction to logistic regression models with

worked forestry examples: biometrics information handbook no.
7,” no. 7, p. 147, 1996.

[3] Watson, F. Marir "Using retrospect, they concluded that non-

Spanish whites on average tend to go to hospitals that offer a better
patient experience for all patients compared to hospitals commonly
used by African American, Hispanic, Asian / Pacific Islander, or
multiracial
patients" 1994.

[4] Binal A. Thakkar, Mosin I. Hasan, Mansi A. Desai, "Healthcare

decision support system for swine flu prediction using naïve bayes
classifier",IEEE", 101-105,2010.

[5] Disease Prediction and hospital recommendation using

machine learning algorithm, www.academia.edu

[6] Random forest algorithm, javapoint.com

[7] Logistic regression algorithm, javapoint.com

[8] Youtube.com, for application reference

[9] Disease prediction and doctor recommendation system,

International Research Journal of Engineering and Technology
(IRJET)

40
APPENDIX
A.SCREENSHOTS

Screenshot 1: Application Homepage (There are options for

the user to register, login and to know
about the application)

40
Screenshot 2: Application About Us page (To know about
the application what is the purpose and what it does).

Screenshot 3: Application Register page (new users can

41
Screenshot 4: Application Login page (Already registered
users can use user id and password to login to the
application and use the application for their benefits which is
very user friendly to use.)

Screenshot 5: Application Dashboard page (where users

are given different options for their disease)

42
Screenshot 6: Report entry page (where users can give
enter according to their diagonised report )

Screenshot 7: Result page (where User get their report

after the prediction whether he is suffering
from the disease or not.)

43
B.PLAGIARISM REPORT

Fig 7.2: Plagiarism Report

Title: Smart Heath Prediction Using Machine Learning
No ratings yet
Title: Smart Heath Prediction Using Machine Learning
20 pages
PR3197 - DiseasePredictionUsingMachineLearning - Report - MAYUR SHIVAKU
100% (1)
PR3197 - DiseasePredictionUsingMachineLearning - Report - MAYUR SHIVAKU
51 pages
Lesson Plan 10
No ratings yet
Lesson Plan 10
6 pages
Disese Prediction Final-1
No ratings yet
Disese Prediction Final-1
57 pages
hearts report final pages
No ratings yet
hearts report final pages
27 pages
Disese Prediction Final-1
No ratings yet
Disese Prediction Final-1
57 pages
1822 B.E Cse Batchno 296
No ratings yet
1822 B.E Cse Batchno 296
83 pages
BT40816_Project_Report
No ratings yet
BT40816_Project_Report
34 pages
Maindra
No ratings yet
Maindra
22 pages
Health and Med Tech Sadhana
No ratings yet
Health and Med Tech Sadhana
94 pages
Final_Proj-AZRA-merged
No ratings yet
Final_Proj-AZRA-merged
36 pages
MDPS
No ratings yet
MDPS
25 pages
286IARP27
No ratings yet
286IARP27
72 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
70 pages
Sample TSReport
No ratings yet
Sample TSReport
32 pages
Synopsis
No ratings yet
Synopsis
6 pages
C. Karthik Chandran, M. Rajalakshmi, Sachi Nandan Mohanty, Subrata Chowdhury - Machine Learning For Healthcare Systems - Foundations and Applications-River Publishers (2023)
No ratings yet
C. Karthik Chandran, M. Rajalakshmi, Sachi Nandan Mohanty, Subrata Chowdhury - Machine Learning For Healthcare Systems - Foundations and Applications-River Publishers (2023)
251 pages
Final G04
No ratings yet
Final G04
42 pages
Report Final 2
No ratings yet
Report Final 2
58 pages
Ijarcce 2019 81210
No ratings yet
Ijarcce 2019 81210
3 pages
Final Documentation
No ratings yet
Final Documentation
10 pages
projectworddoc
No ratings yet
projectworddoc
56 pages
Human Disease Prediction (2) - 1 - Compressed
No ratings yet
Human Disease Prediction (2) - 1 - Compressed
62 pages
Blue Book Format
No ratings yet
Blue Book Format
49 pages
b22it031 report
No ratings yet
b22it031 report
29 pages
Nann Mudhalvan Report
No ratings yet
Nann Mudhalvan Report
27 pages
REPORT
No ratings yet
REPORT
33 pages
Aad Project
No ratings yet
Aad Project
70 pages
ITRByAYUSH
No ratings yet
ITRByAYUSH
58 pages
saniya lt l
No ratings yet
saniya lt l
23 pages
Text Recognition Past, Present and Future
No ratings yet
Text Recognition Past, Present and Future
7 pages
Final Report
No ratings yet
Final Report
25 pages
Final Project Report Format
No ratings yet
Final Project Report Format
27 pages
Final PROJECT-1
No ratings yet
Final PROJECT-1
10 pages
Disease Prediction Project
No ratings yet
Disease Prediction Project
25 pages
Project Word
No ratings yet
Project Word
58 pages
Healthy Food Suggestions Based On Blood Parameters Web Application
No ratings yet
Healthy Food Suggestions Based On Blood Parameters Web Application
23 pages
Hepatitis Disease Prediction using.Machine.Learning
No ratings yet
Hepatitis Disease Prediction using.Machine.Learning
52 pages
Intelligent Heart Diseases Prediction System Using Datamining Techniques0
No ratings yet
Intelligent Heart Diseases Prediction System Using Datamining Techniques0
104 pages
Compparison of Classification Algorithm For Heart Disease - Predictionpdf
No ratings yet
Compparison of Classification Algorithm For Heart Disease - Predictionpdf
34 pages
MD Kamrul Islam
No ratings yet
MD Kamrul Islam
63 pages
BDA Final
No ratings yet
BDA Final
33 pages
New Highlighted - Thesis Final V2
No ratings yet
New Highlighted - Thesis Final V2
160 pages
Heart Disease Prediction Research
No ratings yet
Heart Disease Prediction Research
45 pages
report4
No ratings yet
report4
38 pages
MINI PROJECT Kshetrika
No ratings yet
MINI PROJECT Kshetrika
41 pages
Rustamji Institute of Technology: Predictive Analytics On Health Care
No ratings yet
Rustamji Institute of Technology: Predictive Analytics On Health Care
12 pages
MINI - PROJECT - REPORT (Deeps) 1
No ratings yet
MINI - PROJECT - REPORT (Deeps) 1
40 pages
Edited - Django Website For Disease Prediction Using Machine Learning
No ratings yet
Edited - Django Website For Disease Prediction Using Machine Learning
7 pages
1822 B.E Cse Batchno 188
No ratings yet
1822 B.E Cse Batchno 188
63 pages
ilovepdf_merged_removed
No ratings yet
ilovepdf_merged_removed
33 pages
Final project report
No ratings yet
Final project report
31 pages
Submitted Research Proposal for UoG Call (Health Datasets)
No ratings yet
Submitted Research Proposal for UoG Call (Health Datasets)
19 pages
Intelligent Heart Diseases Prediction System Using Datamining Techniques0
50% (6)
Intelligent Heart Diseases Prediction System Using Datamining Techniques0
104 pages
Disease Prediction and Hospital Recommendation Using Machine Learning
No ratings yet
Disease Prediction and Hospital Recommendation Using Machine Learning
11 pages
Final year ppt 3 (1)
No ratings yet
Final year ppt 3 (1)
18 pages
Final Report of Heart Disease Prdiction
No ratings yet
Final Report of Heart Disease Prdiction
81 pages
Breast cancer prediction project
No ratings yet
Breast cancer prediction project
33 pages
LITERATURE SURVEY
No ratings yet
LITERATURE SURVEY
72 pages
Predictive Health Care-Enhancin Diagnosis and Treatment With Maching Learning
No ratings yet
Predictive Health Care-Enhancin Diagnosis and Treatment With Maching Learning
49 pages
Quality metrics for semantic interoperability in Health Informatics
From Everand
Quality metrics for semantic interoperability in Health Informatics
Alberto Moreno Conde
No ratings yet
Code of Conduct and Policies
100% (1)
Code of Conduct and Policies
5 pages
Indhu Kamireddy
No ratings yet
Indhu Kamireddy
36 pages
BMW c650gt Oil Filter Change Instructions
100% (1)
BMW c650gt Oil Filter Change Instructions
3 pages
Ador Incident Form
No ratings yet
Ador Incident Form
3 pages
KVK Kapurthala's Newsletter
No ratings yet
KVK Kapurthala's Newsletter
18 pages
SOW FOR REPAIR OF MAIN STORM RAIN WATER DRAINAGE PIPE IN COMPOUND Project 45
No ratings yet
SOW FOR REPAIR OF MAIN STORM RAIN WATER DRAINAGE PIPE IN COMPOUND Project 45
15 pages
I Ntegrat
No ratings yet
I Ntegrat
18 pages
Structural Design Practice - II: Tutorial-1 Orientation Exercise
No ratings yet
Structural Design Practice - II: Tutorial-1 Orientation Exercise
2 pages
Chapter 13 en
No ratings yet
Chapter 13 en
28 pages
Jaguar Cars Limited
No ratings yet
Jaguar Cars Limited
7 pages
The Bicycle Wheel
No ratings yet
The Bicycle Wheel
147 pages
Name: . Date: .. Each Question Has Only One Correct Answer! Please Enter Your Answers On The Answer Sheet
No ratings yet
Name: . Date: .. Each Question Has Only One Correct Answer! Please Enter Your Answers On The Answer Sheet
9 pages
Arrhythmias Types, Pathophysiology Atf
No ratings yet
Arrhythmias Types, Pathophysiology Atf
9 pages
M1SUMA-82100 2
No ratings yet
M1SUMA-82100 2
4 pages
36.) Carol Mathis Sworn Witness Statement
No ratings yet
36.) Carol Mathis Sworn Witness Statement
9 pages
React Bits PDF
No ratings yet
React Bits PDF
126 pages
Vodafone Strategy Italy
No ratings yet
Vodafone Strategy Italy
46 pages
Overstory Book Po Polsku - Google Search
No ratings yet
Overstory Book Po Polsku - Google Search
1 page
CV1 Aishwarya Sharma 61920094
100% (1)
CV1 Aishwarya Sharma 61920094
1 page
Social Science
No ratings yet
Social Science
2 pages
Span of Control
No ratings yet
Span of Control
18 pages
Tehnicka Preporuka 4, Prilog
No ratings yet
Tehnicka Preporuka 4, Prilog
8 pages
Unit3 to unit 5 MCQ
No ratings yet
Unit3 to unit 5 MCQ
10 pages
CEMS ZOuYeal
No ratings yet
CEMS ZOuYeal
17 pages
Wa0001.
No ratings yet
Wa0001.
47 pages
MMDA Traffic Violations and Penalties
No ratings yet
MMDA Traffic Violations and Penalties
6 pages
Historal Evolution of Nursing
No ratings yet
Historal Evolution of Nursing
41 pages
MEL-Top Notch 2e - Teacher Walkthrough
No ratings yet
MEL-Top Notch 2e - Teacher Walkthrough
9 pages
Augmentin I V
No ratings yet
Augmentin I V
13 pages