Project Report On Diabetes Prediction
Project Report On Diabetes Prediction
Submitted To :
Submitted By:
STUDENT’S DECLARATION
We, the undersigned solemnly declare that the report of the project work entitled
“DIABETES PREDICTION USING MACHINE LEARNING”, is based on our work
carried out during the course of study under the supervision of Er. Sudan Prajapati.
We assert that the statements made and conclusions drawn are an outcome of the project work.
We further declare that, to the best of our knowledge and belief that the project report does not
contain any part of any work which has been submitted for the award of any other
degree/diploma/certificate in this University.
ii
Tribhuvan University
Institute Of Science and Technology
Supervisor’s Recommendation
I hereby recommend that this project work report is satisfactory in the partial fulfillment
for the requirement of Bachelor of Science in Computer Science and Information
Technology and be processed for the evaluation.
………………............................
Er. Sudan Prajapati
Lecturer
Patan Multiple Campus
(Supervisor)
Date:
iii
Tribhuvan University
Institute Of Science and Technology
LETTER OF APPROVAL
This is to certify that the project prepared by Mr. Santosh Thapa (10052/073), Mr.
Shyam Das Shrestha (10057/073), and Mr. Rojen Bahadur Pradhan (10043/073),
entitled “DIABETES PREDICTION USING MACHINE LEARNING” in partial
fulfillment of the requirements for the degree of B.Sc. in Computer Science and
Information Technology has been well studied. In our opinion, it is satisfactory in the
scope and quality as a project for the required degree.
……………………………
Er. Sudan Prajapati
Lecturer
Patan Multiple Campus
(Supervisor)
Department of Computer Science
and Information Technology
Patan Multiple Campus
iv
ACKNOWLEDGEMENT
It is a great pleasure to have the opportunity to extend our heartfelt gratitude to everyone
who helped us throughout the course of this project. We are profoundly grateful to our
supervisor Er. Sudan Prajapati, lecturer at Patan Multiple Campus, for his expert
guidance, continuous encouragement, and ever willingness to spare time from his
otherwise busy schedule for the project’s progress reviews. His continuous inspiration has
made us complete this project and achieve its target.
We would also like to express our deepest appreciation to Mr. Mahesh Kumar Yadav,
coordinator, and Mr. Prabin Lal Shrestha, Head of Department, Patan Multiple Campus,
Department of Computer Science and Information Technology, for their constant
motivation, support, and for providing us with a suitable working environment. We would
also like to extend our sincere regards to Er. Jyoti Prakash Chaudhary and all the faculty
members for their support and encouragement.
At last, our special thanks go to all staff members of the BSc CSIT department who
directly and indirectly extended their hands in making this project a success.
v
ABSTRACT
Diabetes is a common disease caused by a set of metabolic ailments where the sugar stages over a
drawn-out period is high. It touches the diverse organs of the human body’s system, in precise the
blood strains and nerves. Early prediction in such a disease can be exact and save human life.
However, it is not possible to monitor patients every day in all cases accurately and consultation
of a patient for 24 hours by a doctor is not available since it requires more patience, time, and
expertise. To achieve the goal, this research work mainly discovers numerous factors associated
with this disease using machine learning techniques. The early prognosis of diabetes patients can
aid in making decisions on lifestyle changes in high-risk patients and in turn reduce the
complications, which can be a great milestone in the field of medicine.
Keywords: Machine Learning, Random Forest Classification, Accuracy, Recall, Precision, Flask
vi
LIST OF FIGURES:
vii
LIST OF TABLES:
viii
Table of Contents
CHAPTER 1: INTRODUCTION................................................................................................... 1
1.1 Problem Definition............................................................................................................... 2
1.2 Motivation ........................................................................................................................ 2
1.3 Objectives............................................................................................................................. 2
CHAPTER 2: LITERATURE REVIEW........................................................................................3
CHAPTER 3: DATASETS............................................................................................................. 4
CHAPTER 4: METHODS AND ALGORITHMS USED..............................................................6
4.1 Synthetic Method Oversampling Technique..........................................................................6
4.2 Random Forest Classifier…...:..............................................................................................7
CHAPTER 5: METHODOLOGY………………………………………………………………...8
5.1 Software Development Model……………………………………………………………...8
5.2 Gantt Chart………………………………………………………………………………….8
CHAPTER 6: EXPERIMENTS.................................................................................................... 9
6.1 Exploratory Data Analysis................................................................................................... 9
6.2 Missing Value Imputation .................................................................................................. 12
6.3 Training And Testing.......................................................................................................... 12
CHAPTER 7: EVALUATION METRICS................................................................................... 13
7.1 Confusion Matrix................................................................................................................ 13
7.2 Accuracy ............................................................................................................................ 13
7.3 Recall ................................................................................................................................. 14
7.4 Precision............................................................................................................................. 14
CHAPTER 8: DEPLOYMENT................................................................................................... 15.
CHAPTER 9: CODE ................................................................................................................... 17
CHAPTER 10: CONCLUSION .................................................................................................. 18
References…………………………………………………………………………………….....19
ix
CHAPTER 1: INTRODUCTION
Diabetes is a disease that affects the hormone insulin, follow-on in abnormal metabolism of
carbohydrates, and advanced steps of sugar in the blood. This great blood sugar affects several
organs of the human body which in turn complicates many sources of the body, in precise the
blood strains and nerves. The details of diabetes are not nevertheless totally exposed, many
researchers supposed that both the hereditary elements and environmental effects are complex
therein. As exposed by the International Diabetes Federation[1], the extent of people having
diabetes stretched 422 million out of 2021 that makes up 5.34 % of the world’s total adult
population. Early prediction of such diseases can be controlled over the diseases and save human
life. To accomplish this goal, this research work mainly discovers the early prediction of diabetes
by taking into account various risk factors related to this disease. For the willpower of the study,
we gathered a diagnostic dataset having 16 attributes diabetic of different patients. Later, we
debate about these attributes with their conforming values. Based on these attributes, we figure a
prediction model by means of various machine learning techniques to predict diabetes. Machine
Learning techniques provide well-organized results to extract knowledge by making prediction
models from diagnostic medical datasets composed of diabetic patients. Though it is difficult to
select the best techniques to predict based on such attributes, thus for the determination of the
study different algorithms have been used for the model prediction.
1
1.1 Problem Definition
The major challenge in predicting diabetes cases is its detection. There are instruments available
that can predict diabetes but either they are expensive or are not efficient to calculate the chance
of diabetes in humans. Early detection of diabetes can decrease the mortality rate and overall
complications. However, it is not possible to monitor patients every day in all cases accurately
and consultation of a patient for 24 hours by a doctor is not available since it requires more
patience, time, and expertise. Since we have a good amount of data in today’s world, we can use
various machine learning algorithms to analyze the data for hidden patterns. The hidden patterns
can be used for health diagnosis in medicinal data.
1.2 Motivation
Machine learning techniques have been around us and have been compared and used for analysis
for many kinds of data science applications. This project is carried out with the motivation to
develop an appropriate computer-based system and decision support that can aid in the early
detection of diabetes, in this project we have developed a model which classifies if the patient
will have diabetes based on various features (i.e. potential risk factors that can cause diabetes)
using random forest classifier. Hence, the early prognosis of diabetes can aid in making decisions
on lifestyle changes in high-risk patients and in turn reduce the complications, which can be a
great milestone in the field of medicine.
1.3 Objectives
2
CHAPTER 2: LITERATURE REVIEW
Various researchers have been shown revision in the area of diabetes by using machine learning
techniques to extract knowledge from existing medical data. For illustration, Marina Skurichina
and Ludmila Kuncheva and Robert P W Duin. Bagging and Boosting for the Nearest Mean
Classifier: Effects of Sample Size on Diversity and Accuracy[2]. Michael Lindenbaum and Shaul
Markovitch and Dmitry Rusakov. Selective Sampling Using Random Field Modelling[3].
Michael Lindenbaum and Shaul Markovitch and Dmitry Rusakov. Selective Sampling Using
Random Field Modelling[4]. In this work, we examine real diagnostic medical data based on
numerous risk factors using popular machine learning classification techniques to assess their
performance for predicting diabetes cases.
3
CHAPTER 3: DATASETS
In this work, Pima Indian Diabetes Dataset[1] has been used. The dataset was collected among the
Pima Indian female population near Phoenix, Arizona. This particular dataset has been widely
used in machine learning experiments and is currently available through the UCI repository of
standard datasets. This population has been studied continuously by the National Institute of
Diabetes, Digestive, and Kidney. UCI repository contains 768 instances of observations and total
9 attributes with no missing values reported. Datasets contain 8 particular variables which were
considered high-risk factors for the occurrence of diabetes and 1 target variable containing ‘1’ for
diabetic and ‘0’ for non-diabetic patients. The 8 feature variables along with the target variable
are shown in the following table (Table 1).
4
Figure 1: Original Dataset Snapshot
5
CHAPTER 4: METHODS AND ALGORITHMS USED
The main purpose of designing this system is to prognose the risk of future diabetes. We have
used various methods and algorithms throughout the different phases of the machine learning
pipeline which will be discussed below.
The minority class is over-sampled by taking each minority class sample and introducing
synthetic examples along the line segments joining any/all of the k minority class nearest
neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest
neighbors are randomly chosen. The current implementation uses five nearest neighbors. For
instance, if the amount of over-sampling needed is 200%, only two neighbors from the five
nearest neighbors are chosen and one sample is generated in the direction of each. Synthetic
samples are generated in the following way: Take the difference between the feature vector
(sample) under consideration and its nearest neighbor. Multiply this difference by a random
number between 0 and 1, and add it to the feature vector under consideration. This causes the
selection of a random point along the line segment between two specific features. This approach
effectively forces the decision region of the minority class to become more general.[3]
6
4.2 Random Forest Classifier
Machine learning (ML) is the study of computer algorithms that improve automatically through
experience and by the use of data. Traditional programming approaches take in data and rule to
produce the desired output. Whereas, machine learning approaches take in data and the desired
output as the input and output the necessary rules.
Supervised Learning Algorithms use the input features along with the target output for training,
in contrast to the unsupervised learning algorithms that take only the input features for training.
Supervised Learning Algorithms try to find the mapping between the inputs and the outputs.
Example: regression, classification.
Random Forest Classifier is an ensemble supervised learning method for classification that
operates by constructing a multitude of decision trees at training time. To classify a particular
sample, the class predicted by most of the decision trees is selected.
To understand the working of Random Forest Classifiers, we need to have an idea on how
decision trees work.
A decision tree spits the feature space recursively to build a classification tree.
Steps:
1. Make a bootstrap table initially.
2. Randomly select m features from T, where m<<T. Here T is the total number of predicted
variables and out of total predicted variables, we will select randomly few features.
3. For nodded, calculate the best split point among the m features.
4. Split the node into two daughter nodes using the best split
5. Repeat the firsts three steps until n number of nodes has been reached.
6. Build the model by repeating steps 1 to 5 for D number of times.
7
CHAPTER 5: METHODOLOGY
We can see that the number of diabetic samples is less than the number of non-diabetic samples.
If we plot the pie chart of the number of diabetic and non-diabetic patients, we can see that 65.1% of
the samples are non-diabetic and 34.9% of the samples are diabetic.
9
Distribution of the feature variables
These plots show that the features glucose, blood pressure, BMI are approximately normally distributed
and pregnancies, insulin, age, DiabetesPedigreeFunction are rightly skewed.
10
Correlation Matrix
The correlation matrix shows how each feature is correlated with other features and the target outcome
variable. Glucose, Age, BMI, and Pregnancies are the most correlated features with the Outcome.
Insulin and DiabetesPedigreeFunction have relatively less correlation with the outcome and Blood
Pressure and SkinThickness are the least correlated with the outcome. The matrix also shows that there
is a correlation between features as well.
11
6.2 Missing Value Imputation
The dataset had some missing values which had been encoded as zeros. Features like Glucose, Blood
Pressure, SkinThickness, Insulin, and BMI had some values of zero which was not possible. So the
zeros in these features were replaced by null values. These null values were then imputed using the
medians of the corresponding feature’s values. Median was preferred over mean as some of the features
were skewed and centered more around the median.
The dataset used to build the model is usually divided in multiple data sets. In particular, two
data sets are used in different stages of the creation of the model: training (80%) and test set
(20%).
The model is initially fit on a training data set, which is a set of examples used to fit the
parameters (e.g. splits of trees of the random forest) of the model. The model is trained on the
training data set using a supervised learning method.
Finally, the test data set is a data set used to provide an unbiased evaluation of a final model fit
on the training data set. If the data in the test data set has never been used in training (for
example in cross-validation), the test data set is also called a holdout data set. The term
"validation set" is sometimes used instead of "test set" in some literature (e.g., if the original data
set was partitioned into only two subsets, the test set might be referred to as the validation set).
12
CHAPTER 7: EVALUATION METRICS
TP=54 FP=31
FN=15 TN=92
7.2 Accuracy
The accuracy is calculated as:
Where,
The obtained accuracy during training the data after balancing and extracting the feature
was 76%.
13
7.3 Recall
Recall can be defined as High Recall indicates the class is correctly recognized (a small number
of FN). Recall is calculated as:
The obtained recall during training the data after feature selection by balancing the data was
0.78.
7.4 Precision
To get the value of precision we divide the total number of correctly classified positive examples
by the total number of predicted positive examples. High Precision indicates an example labeled
as positive is indeed positive (a small number of FP). Precision is calculated as:
The obtained precision during training the data after feature selection by balancing the
data was 0.64.
Evaluation Summary:
Evaluation Metrics Values
Accuracy 76%
Recall 0.78
Precision 0.64
14
CHAPTER 8: DEPLOYMENT
The best-performing model was downloaded as a pickle file. A flask web application was built in order
to get the input data from the users and provided to the model for prediction. The model took the input
data and classified the user either as diabetic or non-diabetic along with the probability of the user to be
diabetic. The user interface of the flask web application was created using HTML and CSS.
User Interface:
15
Prediction:
The prediction history was stored in sqlite database for future use and possible training.
16
CHAPTER 9: CODE
The coding portion was carried out to prepare the data, visualize it, pre-process it, build the
model and then evaluate it. The code has been written in the Python programming language using
Jupyter Notebook as IDE. The experiments and all the models building are done based on python
libraries. The code is available in the Git repository given in the following link:
https://github.com/sththapa/Diabetes-Cases
Libraries used:
1. NumPy
2. Plotly
3. Matplotlib
4. Seaborn
5. Pandas
6. Sklearn
7. Flask
c. Sklearn.pipeline Pipeline
d. Imblearn.over_sampling SMOTE
17
CHAPTER 10: CONCLUSION
The early prognosis of diabetes cases can aid in making decisions on lifestyle changes in
high-risk patients and in turn reduce the complications, which can be a great milestone in the field
of medicine. This project resolved the feature selection with balanced data behind the models and
successfully predicted the diabetes cases with around 80% accuracy. The model was used to find
the best machine learning algorithm but among all Random Forest model gave the best result
among all the models and hence Random Forest model has been selected as the best model for
this project.
18
REFERENCES
[2] Marina Skurichina and Ludmila Kuncheva and Robert P W Duin. Bagging and Boosting for the
Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy
[3] Michael Lindenbaum and Shaul Markovitch and Dmitry Rusakov. Selective Sampling Using
Random Field Modelling
19