0% found this document useful (0 votes)

361 views

Project Report On Diabetes Prediction

This document is a project report on diabetes prediction using machine learning. It was submitted by three students to the Department of Science and Information Technology at Patan Multiple Campus, Tribhuvan University, Nepal under the supervision of Er. Sudan Prajapati. The report describes building a random forest classifier model to predict diabetes using a medical dataset and evaluating the model's performance. It includes sections on the literature review, dataset, methods, methodology, experiments conducted, evaluation metrics, and implementation details.

Uploaded by

adityasony269

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

361 views

Project Report On Diabetes Prediction

Uploaded by

adityasony269

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Tribhuvan University

Institute of Science and Technology

Patan Multiple Campus

Diabetes Detection using Machine Learning

A Project Report

Submitted To :

Department of Science and Information Technology

Patan Multiple Campus, Patan Dhoka , Lalitpur, Nepal

Submitted By:

Rojen Bahadur Nhuchhe Pradhan(10043/073)

Santosh Thapa (10052/073)

Shyam Das Shrestha (10057/073)

Under the Supervision of

Er. Sudan Prajapati
(Lecturer, Patan Multiple Campus)
i
Tribhuvan University
Institute Of Science and Technology

STUDENT’S DECLARATION

We, the undersigned solemnly declare that the report of the project work entitled
“DIABETES PREDICTION USING MACHINE LEARNING”, is based on our work
carried out during the course of study under the supervision of Er. Sudan Prajapati.

We assert that the statements made and conclusions drawn are an outcome of the project work.
We further declare that, to the best of our knowledge and belief that the project report does not
contain any part of any work which has been submitted for the award of any other
degree/diploma/certiﬁcate in this University.

…………………… Mr. Santosh Thapa(10052/073)

…………………... Mr. Shyam Das Shrestha (10057/073)

…………………… Mr. Rojen Bahadur Pradhan (10043/073)

ii
Tribhuvan University
Institute Of Science and Technology

Supervisor’s Recommendation

I hereby recommend that this project work report is satisfactory in the partial fulﬁllment
for the requirement of Bachelor of Science in Computer Science and Information
Technology and be processed for the evaluation.

………………............................
Er. Sudan Prajapati
Lecturer
Patan Multiple Campus
(Supervisor)

Date:

iii
Tribhuvan University
Institute Of Science and Technology

LETTER OF APPROVAL
This is to certify that the project prepared by Mr. Santosh Thapa (10052/073), Mr.
Shyam Das Shrestha (10057/073), and Mr. Rojen Bahadur Pradhan (10043/073),
entitled “DIABETES PREDICTION USING MACHINE LEARNING” in partial
fulﬁllment of the requirements for the degree of B.Sc. in Computer Science and
Information Technology has been well studied. In our opinion, it is satisfactory in the
scope and quality as a project for the required degree.

……………………………
Er. Sudan Prajapati
Lecturer
Patan Multiple Campus
(Supervisor)
Department of Computer Science
and Information Technology
Patan Multiple Campus

…………………………… (Internal Examiner)

…………………………… (External Examiner)

iv
ACKNOWLEDGEMENT

It is a great pleasure to have the opportunity to extend our heartfelt gratitude to everyone
who helped us throughout the course of this project. We are profoundly grateful to our
supervisor Er. Sudan Prajapati, lecturer at Patan Multiple Campus, for his expert
guidance, continuous encouragement, and ever willingness to spare time from his
otherwise busy schedule for the project’s progress reviews. His continuous inspiration has
made us complete this project and achieve its target.

We would also like to express our deepest appreciation to Mr. Mahesh Kumar Yadav,
coordinator, and Mr. Prabin Lal Shrestha, Head of Department, Patan Multiple Campus,
Department of Computer Science and Information Technology, for their constant
motivation, support, and for providing us with a suitable working environment. We would
also like to extend our sincere regards to Er. Jyoti Prakash Chaudhary and all the faculty
members for their support and encouragement.

At last, our special thanks go to all staﬀ members of the BSc CSIT department who
directly and indirectly extended their hands in making this project a success.

v
ABSTRACT

Diabetes is a common disease caused by a set of metabolic ailments where the sugar stages over a
drawn-out period is high. It touches the diverse organs of the human body’s system, in precise the
blood strains and nerves. Early prediction in such a disease can be exact and save human life.
However, it is not possible to monitor patients every day in all cases accurately and consultation
of a patient for 24 hours by a doctor is not available since it requires more patience, time, and
expertise. To achieve the goal, this research work mainly discovers numerous factors associated
with this disease using machine learning techniques. The early prognosis of diabetes patients can
aid in making decisions on lifestyle changes in high-risk patients and in turn reduce the
complications, which can be a great milestone in the field of medicine.

Keywords: Machine Learning, Random Forest Classification, Accuracy, Recall, Precision, Flask

vi
LIST OF FIGURES:

Figure 1: Original Dataset Snapshot............................................................................................... 5

Figure 2: Prototype Model.............................................................................................................. 8

Figure 3: Gantt Chart…………………………………….... ......................................................... 8

Figure 4: Histogram Of The Outcome…....……............................................................................9

Figure 5: Piechart Of Outcome....................................................................................................... 9

Figure 6: Distribution Plot Of Feature Variables………..…………………............................... 10

Figure 7: Correlation Graph................................................…………………………………….11

Figure 8: Skewed And Symmetrical Distribution..........................……………………………....12

Figure 9: Deployment Architecture..........................…………………………………………… 15

Figure 10: User Interface………………………………………………………………………...15

Figure 11: Prediction Result……………………………………………………………………..16

Figure 12: Sqlite Database Snapshot……………………………………………………………16

vii
LIST OF TABLES:

Table 1: Dataset Columns Description .......................................................................................... 4

Table 2: Confusion Matrix……………………………………………………………………….13
Table 3: Evaluation Of Random Forest Model............................................................................. 14
Table 4: Measure Modules and Classes Used .............................................................................. 17
Table 5: Major modules and classes used from Sklearn............................................................... 15

viii
Table of Contents

CHAPTER 1: INTRODUCTION................................................................................................... 1
1.1 Problem Definition............................................................................................................... 2
1.2 Motivation ........................................................................................................................ 2
1.3 Objectives............................................................................................................................. 2
CHAPTER 2: LITERATURE REVIEW........................................................................................3
CHAPTER 3: DATASETS............................................................................................................. 4
CHAPTER 4: METHODS AND ALGORITHMS USED..............................................................6
4.1 Synthetic Method Oversampling Technique..........................................................................6
4.2 Random Forest Classifier…...:..............................................................................................7
CHAPTER 5: METHODOLOGY………………………………………………………………...8
5.1 Software Development Model……………………………………………………………...8
5.2 Gantt Chart………………………………………………………………………………….8
CHAPTER 6: EXPERIMENTS.................................................................................................... 9
6.1 Exploratory Data Analysis................................................................................................... 9
6.2 Missing Value Imputation .................................................................................................. 12
6.3 Training And Testing.......................................................................................................... 12
CHAPTER 7: EVALUATION METRICS................................................................................... 13
7.1 Confusion Matrix................................................................................................................ 13
7.2 Accuracy ............................................................................................................................ 13
7.3 Recall ................................................................................................................................. 14
7.4 Precision............................................................................................................................. 14
CHAPTER 8: DEPLOYMENT................................................................................................... 15.
CHAPTER 9: CODE ................................................................................................................... 17
CHAPTER 10: CONCLUSION .................................................................................................. 18
References…………………………………………………………………………………….....19

ix
CHAPTER 1: INTRODUCTION

Diabetes is a disease that affects the hormone insulin, follow-on in abnormal metabolism of
carbohydrates, and advanced steps of sugar in the blood. This great blood sugar affects several
organs of the human body which in turn complicates many sources of the body, in precise the
blood strains and nerves. The details of diabetes are not nevertheless totally exposed, many
researchers supposed that both the hereditary elements and environmental effects are complex
therein. As exposed by the International Diabetes Federation[1], the extent of people having
diabetes stretched 422 million out of 2021 that makes up 5.34 % of the world’s total adult
population. Early prediction of such diseases can be controlled over the diseases and save human
life. To accomplish this goal, this research work mainly discovers the early prediction of diabetes
by taking into account various risk factors related to this disease. For the willpower of the study,
we gathered a diagnostic dataset having 16 attributes diabetic of different patients. Later, we
debate about these attributes with their conforming values. Based on these attributes, we figure a
prediction model by means of various machine learning techniques to predict diabetes. Machine
Learning techniques provide well-organized results to extract knowledge by making prediction
models from diagnostic medical datasets composed of diabetic patients. Though it is difficult to
select the best techniques to predict based on such attributes, thus for the determination of the
study different algorithms have been used for the model prediction.

1
1.1 Problem Definition
The major challenge in predicting diabetes cases is its detection. There are instruments available
that can predict diabetes but either they are expensive or are not efficient to calculate the chance
of diabetes in humans. Early detection of diabetes can decrease the mortality rate and overall
complications. However, it is not possible to monitor patients every day in all cases accurately
and consultation of a patient for 24 hours by a doctor is not available since it requires more
patience, time, and expertise. Since we have a good amount of data in today’s world, we can use
various machine learning algorithms to analyze the data for hidden patterns. The hidden patterns
can be used for health diagnosis in medicinal data.

1.2 Motivation
Machine learning techniques have been around us and have been compared and used for analysis
for many kinds of data science applications. This project is carried out with the motivation to
develop an appropriate computer-based system and decision support that can aid in the early
detection of diabetes, in this project we have developed a model which classifies if the patient
will have diabetes based on various features (i.e. potential risk factors that can cause diabetes)
using random forest classifier. Hence, the early prognosis of diabetes can aid in making decisions
on lifestyle changes in high-risk patients and in turn reduce the complications, which can be a
great milestone in the field of medicine.

1.3 Objectives

The main objective of developing this project are:

● Exploratory Data Analysis of PIMA Indian diabetes dataset.
● Train and evaluate a machine learning model to detect the possible cases of diabetes.
● To analyze the significant risk factors based on PIMA dataset which may lead to diabetes.
● To deploy the trained model using a web framework so that the model is accessible through the
web browser.

2
CHAPTER 2: LITERATURE REVIEW

Various researchers have been shown revision in the area of diabetes by using machine learning
techniques to extract knowledge from existing medical data. For illustration, Marina Skurichina
and Ludmila Kuncheva and Robert P W Duin. Bagging and Boosting for the Nearest Mean
Classifier: Effects of Sample Size on Diversity and Accuracy[2]. Michael Lindenbaum and Shaul
Markovitch and Dmitry Rusakov. Selective Sampling Using Random Field Modelling[3].
Michael Lindenbaum and Shaul Markovitch and Dmitry Rusakov. Selective Sampling Using
Random Field Modelling[4]. In this work, we examine real diagnostic medical data based on
numerous risk factors using popular machine learning classification techniques to assess their
performance for predicting diabetes cases.

3
CHAPTER 3: DATASETS

In this work, Pima Indian Diabetes Dataset[1] has been used. The dataset was collected among the
Pima Indian female population near Phoenix, Arizona. This particular dataset has been widely
used in machine learning experiments and is currently available through the UCI repository of
standard datasets. This population has been studied continuously by the National Institute of
Diabetes, Digestive, and Kidney. UCI repository contains 768 instances of observations and total
9 attributes with no missing values reported. Datasets contain 8 particular variables which were
considered high-risk factors for the occurrence of diabetes and 1 target variable containing ‘1’ for
diabetic and ‘0’ for non-diabetic patients. The 8 feature variables along with the target variable
are shown in the following table (Table 1).

No Name Description Type

1 Pregnancies Number of times pregnant Numeric

2 Glucose Plasma glucose concentration a 2 hours in Numeric

an oral glucose tolerance test Numeric

3 Blood Pressure Diastolic blood pressure Numeric

4 Skin Thickness Triceps skinfold thickness Numeric

5 Insulin 2-hour serum insulin Numeric

6 BMI Body mass index Numeric

7 DiabetesPedigreeFunction Diabetes pedigree function Numeric

8 Age Patient’s age Numeric

9 Outcome Target variable (1 if diabetic, else 0) Binary (0 or 1)

Table 1. Dataset Columns Description

4
Figure 1: Original Dataset Snapshot

5
CHAPTER 4: METHODS AND ALGORITHMS USED

The main purpose of designing this system is to prognose the risk of future diabetes. We have
used various methods and algorithms throughout the different phases of the machine learning
pipeline which will be discussed below.

4.1 Synthetic Minority Over-sampling Technique (SMOTE)

Our dataset is imbalanced. The number of samples belonging to the non-diabetic class (500) is
more than the number of samples belonging to the diabetic class (268). A machine learning
algorithm trained on an imbalanced dataset is biased towards the majority class.
To avoid the dominance of the majority class over the minority class, the samples in the minority
class were oversampled using Synthetic Minority Over-sampling Technique (SMOTE).

The minority class is over-sampled by taking each minority class sample and introducing
synthetic examples along the line segments joining any/all of the k minority class nearest
neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest
neighbors are randomly chosen. The current implementation uses five nearest neighbors. For
instance, if the amount of over-sampling needed is 200%, only two neighbors from the five
nearest neighbors are chosen and one sample is generated in the direction of each. Synthetic
samples are generated in the following way: Take the difference between the feature vector
(sample) under consideration and its nearest neighbor. Multiply this difference by a random
number between 0 and 1, and add it to the feature vector under consideration. This causes the
selection of a random point along the line segment between two specific features. This approach
effectively forces the decision region of the minority class to become more general.[3]

6
4.2 Random Forest Classifier

Machine learning (ML) is the study of computer algorithms that improve automatically through
experience and by the use of data. Traditional programming approaches take in data and rule to
produce the desired output. Whereas, machine learning approaches take in data and the desired
output as the input and output the necessary rules.

Supervised Learning Algorithms use the input features along with the target output for training,
in contrast to the unsupervised learning algorithms that take only the input features for training.
Supervised Learning Algorithms try to find the mapping between the inputs and the outputs.
Example: regression, classification.

Classification is a supervised learning problem where the output to be predicted is discrete

numeric values representing different classes. For example, predicting whether an image is a dog
or a cat if a student will pass an examination or not if a review is positive or negative, and so on.

Random Forest Classifier is an ensemble supervised learning method for classification that
operates by constructing a multitude of decision trees at training time. To classify a particular
sample, the class predicted by most of the decision trees is selected.

To understand the working of Random Forest Classifiers, we need to have an idea on how
decision trees work.

A decision tree spits the feature space recursively to build a classification tree.

Steps:
1. Make a bootstrap table initially.
2. Randomly select m features from T, where m<<T. Here T is the total number of predicted
variables and out of total predicted variables, we will select randomly few features.
3. For nodded, calculate the best split point among the m features.
4. Split the node into two daughter nodes using the best split
5. Repeat the firsts three steps until n number of nodes has been reached.
6. Build the model by repeating steps 1 to 5 for D number of times.

7
CHAPTER 5: METHODOLOGY

5.1 Software Development Model

The prototyping model is a systems development method in which a prototype is built, tested, and
then reworked as necessary until an acceptable outcome is achieved from which the complete
system or product can be developed. This model works best in scenarios where not all of the
project requirements are known in detail ahead of time. It is an iterative, trial-and-error process
that takes place between the developers and the users.

Figure 2: Prototype Model

5.2 Software Development Schedule (GANTT CHART)

Figure 3: GANTT CHART

8
CHAPTER 6: EXPERIMENTS

6.1 Exploratory Data Analysis

Distribution of the target variable ‘Outcome’

Figure 5: Pie Chart of Outcome variable

Figure 4: Histogram of the Outcome variable

We can see that the number of diabetic samples is less than the number of non-diabetic samples.
If we plot the pie chart of the number of diabetic and non-diabetic patients, we can see that 65.1% of
the samples are non-diabetic and 34.9% of the samples are diabetic.

9
Distribution of the feature variables

Figure 6: Distribution plot of feature variables

These plots show that the features glucose, blood pressure, BMI are approximately normally distributed
and pregnancies, insulin, age, DiabetesPedigreeFunction are rightly skewed.

10
Correlation Matrix

Figure 7: Correlation graph

The correlation matrix shows how each feature is correlated with other features and the target outcome
variable. Glucose, Age, BMI, and Pregnancies are the most correlated features with the Outcome.
Insulin and DiabetesPedigreeFunction have relatively less correlation with the outcome and Blood
Pressure and SkinThickness are the least correlated with the outcome. The matrix also shows that there
is a correlation between features as well.

11
6.2 Missing Value Imputation

The dataset had some missing values which had been encoded as zeros. Features like Glucose, Blood
Pressure, SkinThickness, Insulin, and BMI had some values of zero which was not possible. So the
zeros in these features were replaced by null values. These null values were then imputed using the
medians of the corresponding feature’s values. Median was preferred over mean as some of the features
were skewed and centered more around the median.

Figure 8: Mean, Median, Mode of skewed and symmetrical

6.3 Training and testing

The dataset used to build the model is usually divided in multiple data sets. In particular, two
data sets are used in different stages of the creation of the model: training (80%) and test set
(20%).

The model is initially fit on a training data set, which is a set of examples used to fit the
parameters (e.g. splits of trees of the random forest) of the model. The model is trained on the
training data set using a supervised learning method.

Finally, the test data set is a data set used to provide an unbiased evaluation of a final model fit
on the training data set. If the data in the test data set has never been used in training (for
example in cross-validation), the test data set is also called a holdout data set. The term
"validation set" is sometimes used instead of "test set" in some literature (e.g., if the original data
set was partitioned into only two subsets, the test set might be referred to as the validation set).

12
CHAPTER 7: EVALUATION METRICS

7.1 Confusion Matrix

A confusion matrix, also known as an error matrix, is a table that is often used to describe the
performance of a classification model (or “classifier”) on a set of test data for which the true
values are known. It allows the visualization of the performance of an algorithm. It allows easy
identification of confusion between classes e.g. one class is commonly mislabeled as the other.
The key to the confusion matrix is the number of correct and incorrect predictions are
summarized with count values and broken down by each class, not just the number of errors
made.

TP=54 FP=31

FN=15 TN=92

Table 2: Confusion Matrix

7.2 Accuracy
The accuracy is calculated as:

Where,

• True Positive (TP) =Observation is positive and is predicted to be positive.

• False Negative (FN) = Observation is positive but is predicted negative.
• True Negative (TN) = Observation is negative and is predicted to be negative.
• False Positive (FP) =Observation is negative but is predicted positive

The obtained accuracy during training the data after balancing and extracting the feature
was 76%.

13
7.3 Recall
Recall can be defined as High Recall indicates the class is correctly recognized (a small number
of FN). Recall is calculated as:

The obtained recall during training the data after feature selection by balancing the data was
0.78.

7.4 Precision
To get the value of precision we divide the total number of correctly classified positive examples
by the total number of predicted positive examples. High Precision indicates an example labeled
as positive is indeed positive (a small number of FP). Precision is calculated as:

The obtained precision during training the data after feature selection by balancing the
data was 0.64.

Evaluation Summary:
Evaluation Metrics Values

Accuracy 76%

Recall 0.78

Precision 0.64

Table 3: Evaluation of Random Forest Mode

14
CHAPTER 8: DEPLOYMENT
The best-performing model was downloaded as a pickle file. A flask web application was built in order
to get the input data from the users and provided to the model for prediction. The model took the input
data and classified the user either as diabetic or non-diabetic along with the probability of the user to be
diabetic. The user interface of the flask web application was created using HTML and CSS.

Figure 9: Deployment Architecture

User Interface:

Figure 10: User Interface

15
Prediction:

Figure 11: Prediction result

The prediction history was stored in sqlite database for future use and possible training.

Figure 12: SQLite Database snapshot

16
CHAPTER 9: CODE

The coding portion was carried out to prepare the data, visualize it, pre-process it, build the
model and then evaluate it. The code has been written in the Python programming language using
Jupyter Notebook as IDE. The experiments and all the models building are done based on python
libraries. The code is available in the Git repository given in the following link:

https://github.com/sththapa/Diabetes-Cases

Libraries used:
1. NumPy
2. Plotly
3. Matplotlib
4. Seaborn
5. Pandas
6. Sklearn
7. Flask

Modules used: Imported class from respective modules:

a. Sklearn.model_selection Train Test Split,KFold, Cross Val Score

b. Sklearn.metrics Classification Report,Confusion Matrix

c. Sklearn.pipeline Pipeline

d. Imblearn.over_sampling SMOTE

e. Sklearn.preprocessing Robust Scalar

f. Sklearn.model_selection Train_test_split, StratifiedKFold

g.Sklearn.ensemble Random Forest Classifier

Table 4: Major modules and classes used

17
CHAPTER 10: CONCLUSION

The early prognosis of diabetes cases can aid in making decisions on lifestyle changes in
high-risk patients and in turn reduce the complications, which can be a great milestone in the field
of medicine. This project resolved the feature selection with balanced data behind the models and
successfully predicted the diabetes cases with around 80% accuracy. The model was used to find
the best machine learning algorithm but among all Random Forest model gave the best result
among all the models and hence Random Forest model has been selected as the best model for
this project.

18
REFERENCES

[1] International Diabetes Federation [IDF]

[2] Marina Skurichina and Ludmila Kuncheva and Robert P W Duin. Bagging and Boosting for the
Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy

[3] Michael Lindenbaum and Shaul Markovitch and Dmitry Rusakov. Selective Sampling Using
Random Field Modelling

Training Report On Data Sciencep
No ratings yet
Training Report On Data Sciencep
80 pages
Analysis of An Interview Based On Emotion Detection Using Convolutional Neural Networks
No ratings yet
Analysis of An Interview Based On Emotion Detection Using Convolutional Neural Networks
25 pages
PR3197 - DiseasePredictionUsingMachineLearning - Report - MAYUR SHIVAKU
100% (1)
PR3197 - DiseasePredictionUsingMachineLearning - Report - MAYUR SHIVAKU
51 pages
Hardware Installation PDF
100% (1)
Hardware Installation PDF
462 pages
54 Batch Project Documentation-1
No ratings yet
54 Batch Project Documentation-1
82 pages
Mini Project On Diabetes Prediction: Information Technology
No ratings yet
Mini Project On Diabetes Prediction: Information Technology
19 pages
Diabetes Pridiction Using Machine Learning
No ratings yet
Diabetes Pridiction Using Machine Learning
31 pages
Internship - Report Nithin
No ratings yet
Internship - Report Nithin
25 pages
Ravi Internship Report
No ratings yet
Ravi Internship Report
39 pages
Disease Prediction Using ML
100% (1)
Disease Prediction Using ML
43 pages
Kidney Stone Detection Using Ultrasound
No ratings yet
Kidney Stone Detection Using Ultrasound
26 pages
A Facial Expression Recognition System A PDF
No ratings yet
A Facial Expression Recognition System A PDF
45 pages
Final ML Report
No ratings yet
Final ML Report
34 pages
Object Detection
No ratings yet
Object Detection
73 pages
Additional Blue Light in Traffic Signal Light To Indicate Traffic Ahead
No ratings yet
Additional Blue Light in Traffic Signal Light To Indicate Traffic Ahead
73 pages
"Accident Detection and Alert System": Visvesvaraya Technological University "Jnana Sangama" Belagavi-590018
No ratings yet
"Accident Detection and Alert System": Visvesvaraya Technological University "Jnana Sangama" Belagavi-590018
23 pages
Multiple Disease Prediction
No ratings yet
Multiple Disease Prediction
23 pages
Final Report
No ratings yet
Final Report
51 pages
Multiple Disease Detection
No ratings yet
Multiple Disease Detection
79 pages
Final Intership Report
No ratings yet
Final Intership Report
32 pages
Report
100% (1)
Report
32 pages
Kumar Mu Tie Rep
No ratings yet
Kumar Mu Tie Rep
30 pages
Prediction of Diabetes Using Classi Cation Algorithms
No ratings yet
Prediction of Diabetes Using Classi Cation Algorithms
8 pages
A Report of 08 Weeks Industrial Training At: ASPEXX Health Solution Pvt. LTD
No ratings yet
A Report of 08 Weeks Industrial Training At: ASPEXX Health Solution Pvt. LTD
74 pages
Fruit Old
No ratings yet
Fruit Old
37 pages
Industrial Training Report
No ratings yet
Industrial Training Report
24 pages
Python and Machine Learning: A Practical Training Report On
No ratings yet
Python and Machine Learning: A Practical Training Report On
65 pages
Heart Disease Prediction Synopsis
No ratings yet
Heart Disease Prediction Synopsis
36 pages
b3 Plant Leaf Disease Detection
No ratings yet
b3 Plant Leaf Disease Detection
62 pages
Project
100% (1)
Project
25 pages
Vreportinterm Nsihp
No ratings yet
Vreportinterm Nsihp
28 pages
Medicinal Drug Recommendation System
No ratings yet
Medicinal Drug Recommendation System
52 pages
Major Project Report
No ratings yet
Major Project Report
100 pages
Jinendra Major Project
No ratings yet
Jinendra Major Project
127 pages
REPORT FILE of FACE MASK DETECTION
No ratings yet
REPORT FILE of FACE MASK DETECTION
45 pages
Projects 2021 B4
No ratings yet
Projects 2021 B4
96 pages
Internship Report
No ratings yet
Internship Report
45 pages
Vandana Internship Report
No ratings yet
Vandana Internship Report
48 pages
18 Converging Blockchain and Machine Learning For Healthcare
No ratings yet
18 Converging Blockchain and Machine Learning For Healthcare
3 pages
Virtual Mirror - A Hassle Free Approach To The Use of Trial Room
No ratings yet
Virtual Mirror - A Hassle Free Approach To The Use of Trial Room
38 pages
MCA Project Report Format - MU - Updated
100% (1)
MCA Project Report Format - MU - Updated
20 pages
Abstractive Text Summarization Using Deep Learning
No ratings yet
Abstractive Text Summarization Using Deep Learning
43 pages
Drowsiness Detection Using Opencv Final
No ratings yet
Drowsiness Detection Using Opencv Final
83 pages
AIML Internship Report
No ratings yet
AIML Internship Report
53 pages
Liver Disease Prediction using Machine learning and Deep Learning
No ratings yet
Liver Disease Prediction using Machine learning and Deep Learning
73 pages
Object Detection Using Deep Learning
No ratings yet
Object Detection Using Deep Learning
45 pages
Plagiarism Detection System
100% (1)
Plagiarism Detection System
93 pages
Visvesvaraya Technological University: "Car Rental Management System"
No ratings yet
Visvesvaraya Technological University: "Car Rental Management System"
31 pages
File 4
No ratings yet
File 4
60 pages
AI-ML-DS_SUMMERINTERNSHIP
No ratings yet
AI-ML-DS_SUMMERINTERNSHIP
59 pages
Internship Report DiabetesPrediction
No ratings yet
Internship Report DiabetesPrediction
15 pages
Fruit-Classification Report
100% (1)
Fruit-Classification Report
17 pages
Breast Cancer Classification Using Deep Learning Final Ppt (1)
No ratings yet
Breast Cancer Classification Using Deep Learning Final Ppt (1)
19 pages
Deep Learning Based Car Damage Detection, Classification and Severity
No ratings yet
Deep Learning Based Car Damage Detection, Classification and Severity
7 pages
Classification of Fruits and Detection of Disease Using CNN: Bachelor of Engineering IN Information Technology
No ratings yet
Classification of Fruits and Detection of Disease Using CNN: Bachelor of Engineering IN Information Technology
65 pages
Projects 1920 A12
No ratings yet
Projects 1920 A12
78 pages
CSE35 Project Report
No ratings yet
CSE35 Project Report
111 pages
Major Project Documentation Final 2
No ratings yet
Major Project Documentation Final 2
62 pages
Theft Identification - Alert Through Motion Detection - Facial Recognition Using IOT - Report
No ratings yet
Theft Identification - Alert Through Motion Detection - Facial Recognition Using IOT - Report
52 pages
Weather Prediction Using CPT+ Algorithm: Proposed Scheme
No ratings yet
Weather Prediction Using CPT+ Algorithm: Proposed Scheme
12 pages
Project Report Minor
No ratings yet
Project Report Minor
33 pages
Operation Manual: Vocal Processor
No ratings yet
Operation Manual: Vocal Processor
39 pages
11 Class Pairing Scheem 2021: 1 Mcqs 1 Mcqs 1 MCQ 1 MCQ 1 Mcqs 1 MCQ 1 Mcqs 1 Mcqs 1 Mcqs 2 Mcqs 2 Mcqs
No ratings yet
11 Class Pairing Scheem 2021: 1 Mcqs 1 Mcqs 1 MCQ 1 MCQ 1 Mcqs 1 MCQ 1 Mcqs 1 Mcqs 1 Mcqs 2 Mcqs 2 Mcqs
9 pages
Systemd Vs SysVinit
No ratings yet
Systemd Vs SysVinit
1 page
GR 6 NAT REVIEWER
No ratings yet
GR 6 NAT REVIEWER
17 pages
Compuertas Logicas Básicas
No ratings yet
Compuertas Logicas Básicas
27 pages
Etea 2019
No ratings yet
Etea 2019
7 pages
Mosfet Type On Laptop Motherboard
No ratings yet
Mosfet Type On Laptop Motherboard
6 pages
IED-Review Engineering Formula Sheet
100% (1)
IED-Review Engineering Formula Sheet
10 pages
Graphical Solution of Linear Programming Models
No ratings yet
Graphical Solution of Linear Programming Models
44 pages
FEM Syllabus
No ratings yet
FEM Syllabus
2 pages
WELDING
No ratings yet
WELDING
6 pages
MagNet - Tutorials
No ratings yet
MagNet - Tutorials
241 pages
PID110 Temprature Controllers
No ratings yet
PID110 Temprature Controllers
2 pages
Object Detection Based Handwriting Localization
No ratings yet
Object Detection Based Handwriting Localization
15 pages
Hydraulic Circuit Design and Analysis Notes To Students
No ratings yet
Hydraulic Circuit Design and Analysis Notes To Students
32 pages
X-Maths (BY Naresh Kumar Sharma)
No ratings yet
X-Maths (BY Naresh Kumar Sharma)
6 pages
SPCC - Module - 4 - Loaders and Linkers (NEW)
No ratings yet
SPCC - Module - 4 - Loaders and Linkers (NEW)
67 pages
LG Plasma AC Manual
100% (1)
LG Plasma AC Manual
21 pages
Lab File PDF
No ratings yet
Lab File PDF
30 pages
Bandgap 2009
No ratings yet
Bandgap 2009
27 pages
How To Mount Software RAID1 Member Using Mdadm
No ratings yet
How To Mount Software RAID1 Member Using Mdadm
4 pages
Hama Suport TV - 84426
No ratings yet
Hama Suport TV - 84426
28 pages
Journal of Urban Development and Management Integrating The Biophilia Concept Into Urban Planning: A Case Study of Kufa City, Iraq
No ratings yet
Journal of Urban Development and Management Integrating The Biophilia Concept Into Urban Planning: A Case Study of Kufa City, Iraq
11 pages
Ethics of Quantum Computing: An Outline
No ratings yet
Ethics of Quantum Computing: An Outline
22 pages
EVC300i Operation Manual
No ratings yet
EVC300i Operation Manual
8 pages
Etap Installation Guide
50% (2)
Etap Installation Guide
4 pages
Simulation-Based Econometric Methods PDF
100% (1)
Simulation-Based Econometric Methods PDF
185 pages
Anchor Bolt& Base Plate Design
No ratings yet
Anchor Bolt& Base Plate Design
28 pages
Valves - Vessels - Systems - Controls: H. A. Phillips & Co. - Valves & Accessories Catalog - VB-22E-01
No ratings yet
Valves - Vessels - Systems - Controls: H. A. Phillips & Co. - Valves & Accessories Catalog - VB-22E-01
40 pages

Project Report On Diabetes Prediction

Uploaded by

Project Report On Diabetes Prediction

Uploaded by

Tribhuvan University

Institute of Science and Technology

Diabetes Detection using Machine Learning

Department of Science and Information Technology

Patan Multiple Campus, Patan Dhoka , Lalitpur, Nepal

Rojen Bahadur Nhuchhe Pradhan(10043/073)

Shyam Das Shrestha (10057/073)

Under the Supervision of

…………………… Mr. Santosh Thapa(10052/073)

…………………... Mr. Shyam Das Shrestha (10057/073)

…………………… Mr. Rojen Bahadur Pradhan (10043/073)

…………………………… (Internal Examiner)

…………………………… (External Examiner)

Figure 1: Original Dataset Snapshot............................................................................................... 5

Figure 2: Prototype Model.............................................................................................................. 8

Figure 3: Gantt Chart…………………………………….... ......................................................... 8

Figure 4: Histogram Of The Outcome…....……............................................................................9

Figure 5: Piechart Of Outcome....................................................................................................... 9

Figure 6: Distribution Plot Of Feature Variables………..…………………............................... 10

Figure 7: Correlation Graph................................................…………………………………….11

Figure 8: Skewed And Symmetrical Distribution..........................……………………………....12

Figure 9: Deployment Architecture..........................…………………………………………… 15

Figure 10: User Interface………………………………………………………………………...15

Figure 11: Prediction Result……………………………………………………………………..16

Figure 12: Sqlite Database Snapshot……………………………………………………………16

Table 1: Dataset Columns Description .......................................................................................... 4

The main objective of developing this project are:

No Name Description Type

1 Pregnancies Number of times pregnant Numeric

2 Glucose Plasma glucose concentration a 2 hours in Numeric

3 Blood Pressure Diastolic blood pressure Numeric

4 Skin Thickness Triceps skinfold thickness Numeric

5 Insulin 2-hour serum insulin Numeric

6 BMI Body mass index Numeric

7 DiabetesPedigreeFunction Diabetes pedigree function Numeric

8 Age Patient’s age Numeric

9 Outcome Target variable (1 if diabetic, else 0) Binary (0 or 1)

Table 1. Dataset Columns Description

4.1 Synthetic Minority Over-sampling Technique (SMOTE)

Classification is a supervised learning problem where the output to be predicted is discrete

5.1 Software Development Model

Figure 2: Prototype Model

5.2 Software Development Schedule (GANTT CHART)

Figure 3: GANTT CHART

6.1 Exploratory Data Analysis

Distribution of the target variable ‘Outcome’

Figure 5: Pie Chart of Outcome variable

Figure 4: Histogram of the Outcome variable

Figure 6: Distribution plot of feature variables

Figure 7: Correlation graph

Figure 8: Mean, Median, Mode of skewed and symmetrical

6.3 Training and testing

7.1 Confusion Matrix

Table 2: Confusion Matrix

• True Positive (TP) =Observation is positive and is predicted to be positive.

Table 3: Evaluation of Random Forest Mode

Figure 9: Deployment Architecture

Figure 10: User Interface

Figure 11: Prediction result

Figure 12: SQLite Database snapshot

Modules used: Imported class from respective modules:

a. Sklearn.model_selection Train Test Split,KFold, Cross Val Score

b. Sklearn.metrics Classification Report,Confusion Matrix

e. Sklearn.preprocessing Robust Scalar

f. Sklearn.model_selection Train_test_split, StratifiedKFold

g.Sklearn.ensemble Random Forest Classifier

Table 4: Major modules and classes used

[1] International Diabetes Federation [IDF]

You might also like