0% found this document useful (0 votes)
15 views30 pages

Classifier Model For Diabetes Prediction

The document discusses building a classifier model to predict diabetes using medical data. It describes preprocessing data by replacing missing values and scaling features. Logistic regression and SVM models are tested, with logistic regression having slightly better recall and accuracy in identifying diabetic patients.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views30 pages

Classifier Model For Diabetes Prediction

The document discusses building a classifier model to predict diabetes using medical data. It describes preprocessing data by replacing missing values and scaling features. Logistic regression and SVM models are tested, with logistic regression having slightly better recall and accuracy in identifying diabetic patients.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Classifier Model for Diabetes

Prediction
Group 13
Mahindapala D.P.P EG/2016/2916
Thalpavila T.W.K.M.B.K EG/2016/2997

1
Introduction

2
Introduction

▪ As diabetes has become a common disease in the present


society, it has made a huge problem not only about health
but also about the economy of the country.

3
Introduction

▪ This project of construction of a classifier model for


diabetes prediction is useful to make people aware of their
health and risk they have got to be a diabetes patient, so
that they are able to change their habits to become more
healthy.

4
Introduction
ML Definition of the problem

▶ Task - Predicting diabetes in patients using diagnostic


measures
▶ Performance Measure - Recall of predictions
▶ Experience - Medical data of patients from Pima Indians
Diabetes Database with labels

5
Introduction
Problem Statement

6
Introduction
Problem Statement

▪ Sri lanka which has a free health service has to dedicate a


lot of money per year to treat these people who are sick.

▪ Also it make a bad impact on the economy of the country.

7
Introduction
Problem Statement

▪ It is important to find a solution to make people aware of


the situation to reduce the risk of getting diabetes.

▪ Also it will helpful to give them treatments on time properly.

8
Methodology

9
Methodology

10
Methodology

▪ In this project we have constructed a classifier model for


diabetes prediction through a supervised learning problem.

▪ For this binary classification problem, we plan to use


Logistic Regression and Support Vector Machines (SVM)
machine learning algorithms.

11
Methodology

▪ The dataset consists of following features and has 768 data


points collected from females over 21 years old in Arizona.
Pregnancies Insulin
Glucose BMI
Blood Pressure Diabetes Pedigree Function
Skin Thickness (Triceps) Age

12
13
Methodology

▪ In our dataset there are zero values for Glucose, Blood


Pressure, Skin Thickness, Insulin and BMI.

▪ It is not possible to have zero values for these features in


the real world. Therefore, we have identified them as
missing values.

14
Methodology

▪ For handling the missing values, we have chosen to replace


zeros by the median, which is reasonable because most of
the values are distributed around the center.

▪ First we have replaced all zeros by NaN (Not-a-Number) in


python , then we have replaced the NaN data by the
corresponding median value.

15
16
Methodology

▪ We have selected the first 650 records as the training data


set and the remaining 118 records as the testing data.
▪ Feature scaling when preprocessing data can be helpful to
improve the performance of distance based algorithms like
SVM.
▪ In our project we have standardized our data to have a mean
of 0 and a standard deviation of 1.

17
Methodology

▪ For finding the best hyperparameters for the two classifiers,


we have used the GridSearchCV in sci-kit learn. We have
selected the following hyperparameters.
Logistic Regression SVM

C = 0.001 Kernel = Sigmoid


Solver = Liblinear C = 10
Gamma = 1

18
Results

19
Results

▪ Our goal is having low false negatives than low false


positives.

False negative False positive


Not predicting a patient as The patient is falsely predicted as
diabetic when the patient is diabetic diabetic
Leads to major health issues Have to take further tests and
treatment

20
Results

▪ Precision of Logistic Regression : 0.692


▪ Recall of Logistic Regression : 0.600
▪ Accuracy of Logistic Regression : 0.746

▪ Precision of SVM : 0.786


▪ Recall of SVM : 0.489
▪ Accuracy of SVM : 0.754

21
Results

Confusion matrix of Logistic Regression Confusion matrix of SVM

22
Results

▪ According to the confusion matrices, we can see that the


Logistic Regression classifier gives less false negatives
than the SVM classifier.
▪ Logistic Regression classifier also has higher recall than
the SVM classifier.

23
Results

Precision-Recall Curves for the two classifiers

24
Discussion

25
Discussion

▪ Logistic Regression algorithm has performed well in the


problem with a recall of 0.6.
▪ 60% of the diabetics patients will we correctly identified as
diabetic using the model.

26
Discussion
Limitations

▪ Because the data was collected between 1960s and


1980s, the results may not be entirely relevant to present
conditions.
▪ Other diagnostic measures like urine tests and
haemoglobin tests can be also used to identify diabetes.
▪ Only 768 data points collected from patients in one area is
available.

27
Conclusion

28
Conclusion

▪ This project provides a good start on predicting the risk of


having diabetes using medical data.
▪ Blood glucose level and BMI are the most prominent
features used to identify patients as diabetic.
▪ Maintaining the blood glucose level and a average BMI is
important for a healthy life.

29
Q&A

Pima Indians of Arizona

30

You might also like