A Mini Project Report On
A MACHINE LEARNING APPROACH TO DIABETES RISK
ASSESSMENT
A Dissertation submitted in partial fulfillment of the academic
requirements for the award of the degree.
Bachelor of Technology
In
CSE (Artificial Intelligence & Machine Learning)
Submitted by
(Student Name) (Roll No)
A.RAJESH REDDY
(21H51A66E3) D.SIMHADRI
(21H51A66E6)
B.LOKESH (21H51A66E4)
Under the esteemed guidance of
Ms.Sana Afreen
Assist. prof, CSE(AI&ML)
Department of Computer Science & Engineering (AI&ML)
CMR COLLEGE OF ENGINEERING & TECHNOLOGY
(Autonomous)
(NAAC Accredited with ‘A+’ Grade & NBA Accredited) (Approved by
AICTE, Permanently Affiliated to JNTU Hyderabad) KANDLAKOYA,
MEDCHAL ROAD, HYDERABAD-501401
2024-25
CMR COLLEGE OF ENGINEERING & TECHNOLOGY
KANDLAKOYA, MEDCHAL ROAD, HYDERABAD – 501401
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(AI&ML)
CERTIFICATE
This is to certify that the Mini Project report entitled “A MACHINE LEARNING APPROACH
TO DIABETES RISK ASSESSMENT” being submitted by A. Rajesh Reddy (21H51A66E3),
D. Simhadri (21H51A66E6), B. Lokesh (21H51A66E4) in partial fulfillment for the award
of Bachelor of Technology in Computer Science & Engineering (AI&ML) is a record of
bonafide work carried out his/her under my guidance and supervision.
The results embodied in this project report have not been submitted to any other University or
Institute for the award of any Degree.
Ms.Sana Afreen Dr. P. Sruthi
Guide (Assoc. Professor & HOD)
Assist. Prof. Dept. of CSE(AI&ML) Dept. of CSE(AI&ML)
ACKNOWLEDGMENT
With great pleasure I want to take this opportunity to express my heartfelt gratitude to all the people who
helped in making this project work a grand success.
I am grateful to Ms.Sana Afreen, Associate Professor, Dept of Computer Science and Engineering-AIML
for his valuable suggestions and guidance during the execution of this project work.
I would like to thank Dr. P. Sruthi, Head of the Department of Computer Science and Engineering-AIML,
for his moral support throughout the period of my study in CMRCET.
I am highly indebted to Major Dr. V.A. Narayana, Principal CMRCET for giving permission to carry out
this project in a successful and fruitful way.
I would like to thank the Teaching & Non- teaching staff of Department of Computer Science and
Engineering for their co-operation
Finally, I express my sincere thanks to Mr. Ch. Gopal Reddy, Secretary, CMR Group of Institutions, for
his continuous care. I sincerely acknowledge and thank all those who gave support directly and indirectly in
completion of this project work.
A.RAJESH REDDY (21H51A66E3)
D.SIMHADRI (21H51A66E6)
B.LOKESH (21H51A66E4)
DECLARATION
We hereby declare that results embodied in this Report of Project on “A MACHINE LEARNING
APPROACH TO DIABETES RISK ASSESSMENT” are from work carried out by using partial
fulfillment of the requirements for the award of B. Tech degree. We have not submitted this report to any
other university/institute for the award of any other degree.
NAME ROLL NO SIGNATURE
A.RAJESH REDDY (21H51A66E3)
D.SIMHADRI (21H51A66E6)
B.LOKESH (21H51A66E4)
ABSTRACT
Diabetes Mellitus is among critical diseases and lots of people are suffering from this disease. Age, obesity,
lack of exercise, hereditary diabetes, living style, bad diet, high blood pressure, etc. can cause Diabetes
Mellitus. People having diabetes have high risk of diseases like heart disease, kidney disease, stroke, eye
problem, nerve damage, etc. Current practice in hospital is to collect required information for diabetes
diagnosis through various tests and appropriate treatment is provided based on diagnosis. Big Data Analytics
plays a significant role in healthcare industries. Healthcare industries have large volume databases.
Using big data analytics one can study huge datasets and find hidden information, hidden patterns to discover
knowledge from the data and predict outcomes accordingly. In existing methods, the classification and
prediction accuracy is not so high. In this paper, we have proposed a diabetes prediction model for better
classification of diabetes which includes a few external factors responsible for diabetes along with regular
factors like Glucose, BMI, Age, Insulin, etc. Classification accuracy is boosted with the new dataset compared
to the existing dataset. Further, we imposed a pipeline model for diabetes prediction intended towards
improving the accuracy of classification.
Early detection of diabetes is essential to prevent serious complications in patients. The purpose of this work
is to detect and classify type 2 diabetes in patients using machine learning (ML) models, and to select the
most optimal model to predict the risk of diabetes. In this paper, five ML models, including K-nearest
neighbor (K- NN), Bernoulli Naïve Bayes (BNB), decision tree (DT), logistic regression (LR), and support
vector machine (SVM), are investigated to predict diabetic patients. A Kaggle-hosted Pima Indian dataset
containing 768 patients with and without diabetes was used, including variables such as number of
pregnancies the patient has had, blood glucose concentration, diastolic blood pressure, skinfold thickness,
body insulin levels, body mass index (BMI), genetic background, diabetes in the family tree, age, and outcome
(with/without diabetes). The results show that the K-NN and BNB models outperform the other models.
TABLE OF CONTENTS
CHAPTER NO. DESCRIPTION PAGE
NO.
LIST OF FIGURES
LIST OF TABLES
ABSTRACT 5
1 INTRODUCTION
1.1 Problem Statement 7
1.2 Research Objective 8
1.3 Project Scope and Limitations 8
2 BACKGROUND WORK
2.1 Medtronic MiniMed 670G
2.1.1 Introduction 9
2.1.2 Merits , Demerits and Challenges 9
2.1.3 Implementation 10
2.2 IBM Watson Health for Diabetes Management
2.2.1 Introduction 10
2.2.2 Merits , Demerits and Challenges 10
2.2.3 Implementation 10
3 PROPOSED SYSTEM
3.1 Objective Of Proposed Model 11
3.2 Algorithms Used For Proposed Model 11-12
3.3 UML diagrams & Architecture 13
3.4 System Implementation and Code 14-18
4 RESULTS AND DISCUSSION
4.1 Performance metrics 19-20
4.2 Result Screenshot 20
5 CONCLUSION 21
REFERENCES 22
CHAPTER 1
INTRODUCTION
1.1 PROBLEM STATEMENT
Healthcare sectors have large volume databases. Such databases may contain structured, semi-structured or
unstructured data. Big data analytics is the process which analyses huge data sets and reveals hidden
information, hidden patterns to discover knowledge from the given data. Considering the current scenario, in
developing countries like India, Diabetic Mellitus (DM) has become a very severe disease. Diabetic Mellitus
(DM) is classified as Non-Communicable Disease (NCB) and many people are suffering from it. Around 425
million people suffer from diabetes according to 2017 statistics. Approximately 2-5 million patients every
year lose their lives due to diabetes. It is said that by 2045 this will rise to 629 million.[1] Diabetes Mellitus
(DM) is classified asType-1 known as Insulin-Dependent Diabetes Mellitus (IDDM). Inability of human’s
body to generate sufficient insulin is the reason behind this type of DM and hence it is required to inject
insulin to a patient. Type-2 also known as Non-Insulin-Dependent Diabetes Mellitus (NIDDM).
This type of Diabetes is seen when body cells are not able to use insulin properly.Type-3 Gestational
Diabetes, increase in blood sugar level in pregnant woman where diabetes is not detected earlier results in this
type of diabetes. DM has long term complications associated with it. Also, there are high risks of various
health problems for a diabetic person. A technique called, Predictive Analysis, incorporates a variety of
machine learning algorithms, data mining techniques and statistical methods that uses current and past data to
find knowledge and predict future events. By applying predictive analysis on healthcare data, significant
decisions can be taken and predictions can be made. Predictive analytics can be done using machine learning
and regression technique. Predictive analytics aims at diagnosing the disease with best possible accuracy,
enhancing patient care, optimizing resources along with improving clinical outcomes.[1] Machine learning is
considered to be one of the most important artificial intelligence features supports development of computer
systems having the ability to acquire knowledge from past experiences with no need of programming for
every case.
Machine learning is considered to be a dire need of today’s situation in order to eliminate human efforts by
supporting automation with minimum flaws. Existing method for diabetes detection is uses lab tests such as
fasting blood glucose and oral glucose tolerance. However, this method is time consuming. This paper focuses
on building predictive model using machine learning algorithms and data mining techniques for diabetes
prediction.
CMR College of Engineering & Technology 7
1.2 RESEARCH OBJECTIVE
The primary objective of this research is to develop a machine learning-based diabetes prediction model that
enhances the accuracy and reliability of diabetes diagnosis. By integrating traditional diagnostic factors such
as glucose levels, BMI, age, and insulin with external risk factors like lifestyle, family history, and blood
pressure, this study aims to create a more comprehensive and accurate predictive framework. The goal is to
bridge the gap in existing diagnostic methods by incorporating a wider array of data points that influence
diabetes risk, thus offering a more robust model for clinical use.
In this research, machine learning techniques will be applied to large healthcare datasets to extract hidden
patterns and insights that traditional methods might overlook. Big Data Analytics allows for the processing
and analysis of vast amounts of information from diverse sources, making it possible to study relationships
between various factors that contribute to diabetes. This study will explore how machine learning algorithms
such as decision trees, support vector machines, and neural networks can be optimized to predict diabetes
with higher accuracy. The research also seeks to identify which algorithms perform best in terms of
classification precision, sensitivity, and specificity.
1.3 PROJECT SCOPE AND LIMITATIONS
PROJECT SCOPE:
The scope of this project involves designing and developing a machine learning-based predictive model for
the early detection of Diabetes Mellitus. This model will be trained on a comprehensive dataset that includes
both traditional clinical indicators (such as glucose levels, insulin, BMI, age) and external factors (such as
family history, lifestyle, and blood pressure). The project will employ various machine learning algorithms,
including decision trees, logistic regression, and neural networks, to identify the most effective approach in
predicting diabetes. The focus will be on improving the classification accuracy of the model, with the aim of
surpassing current diagnostic tools used in clinical settings.
LIMITATIONS:
The project is limited by the quality and size of the available dataset. If the dataset lacks diversity or is
skewed towards certain demographics, the model may struggle to generalize well across different
population groups. Additionally, while the project seeks to incorporate external factors like lifestyle and
family history, the lack of consistent or standardized data for these features may limit their effectiveness in
the model.
Another limitation is the potential computational complexity of training certain machine learning models,
particularly deep learning networks, which could require significant resources and time. Furthermore,
although the model may show high predictive accuracy, it is still reliant on data that may not capture all real-
world scenarios, and it may require further refinement and testing in practical healthcare environments.
CMR College of Engineering & Technology 8
CHAPTER 2
BACKGROUND WORK
1. Medtronic MiniMed 670G (Artificial Pancreas System)
Introduction:
The Medtronic Mini Med 670G is a hybrid closed-loop system, often referred to as an "artificial pancreas."
It’s designed to help people with Type 1 diabetes by automatically adjusting insulin delivery based on
continuous glucose monitoring (CGM) data. This system is one of the first FDA-approved closed-loop insulin
systems, aiming to manage glucose levels automatically without manual intervention.
Fig 1.1 Medtronic MiniMed 670G
Merits:
Real-time Monitoring and Response: The system provides continuous glucose monitoring, enabling
real-time adjustments to insulin delivery. It helps maintain blood sugar within the target range by
automatically administering basal insulin (background insulin) and prompting the user to manually
bolus (larger insulin doses) when needed.
Improved Quality of Life: For Type 1 diabetics, the 670G significantly reduces the need for constant
monitoring, manual blood glucose checks, and insulin administration. This leads to improved
glycemic control and reduces the risks of hypo- and hyperglycemia.
FDA Approval: The system is approved by the FDA, making it reliable and safe for widespread
clinical use.
Demerits and Challenges:
Expensive: The system is relatively costly, both in terms of the initial setup (sensor, pump) and
recurring costs (CGM sensors and pump supplies). This limits accessibility for many patients.
User Compliance: Although the system automates much of the insulin delivery, patients still need
to input carbohydrate intake manually, and user error can lead to issues with insulin dosing.
Comfort: Wearing the device can be cumbersome for some patients, as it involves attaching both
the CGM and insulin pump to the body.
CMR College of Engineering & Technology 9
Implementation:
The MiniMed 670G consists of a CGM sensor, an insulin pump, and a control algorithm. The CGM sensor
continuously measures glucose levels in the interstitial fluid and sends this data to the pump. The control
algorithm in the pump automatically adjusts basal insulin delivery based on the CGM readings. Patients are
still required to manually adjust for meals, providing a bolus dose based on carbohydrate intake.
2. IBM Watson Health for Diabetes Management
Introduction:
IBM Watson Health applies artificial intelligence (AI) and machine learning to help predict and manage diabetes.
Watson Health uses advanced data analytics and natural language processing to analyze both structured (e.g.,
lab results) and unstructured data (e.g., doctor notes) to help clinicians make better decisions in treating and
managing diabetes.
Merits:
Predictive Analytics: Watson Health can predict diabetes-related complications by analyzing large-
scale healthcare datasets. It provides personalized recommendations for patients, which can help
clinicians make timely adjustments to treatment plans.
Integration with EHR: Watson Health can integrate with Electronic Health Records (EHR) systems,
allowing it to gather comprehensive patient data for analysis, leading to improved patient care.
Clinical Decision Support: The system supports clinicians by providing data-driven insights and
alerts regarding patient health. For example, it can identify patients at risk for complications such as
diabetic retinopathy or cardiovascular disease.
Demerits and Challenges:
Complexity and Cost: Implementing IBM Watson Health is costly and complex, requiring large-
scale data integration and significant resources for customization based on hospital or clinic needs.
Data Privacy and Security: The use of large healthcare datasets presents challenges related to data
security and privacy. Safeguarding sensitive patient data while using AI for predictive modeling is a
critical concern.
Implementation:
Watson Health analyzes large amounts of patient data in real-time, including both historical and current clinical
data. The AI system uses machine learning models to assess patient risk and recommend treatment
adjustments. For example, it can predict if a patient is at risk for diabetes complications, suggest lifestyle
changes, or recommend medication adjustments. It also has a self-learning feature, where it continuously
updates its knowledge base with new medical research and data.
CMR College of Engineering & Technology 10
CHAPTER 3
PROPOSED SYSTEM
3.1 Objective of the Proposed Model
The primary objective of the proposed diabetes detection model is to develop an efficient and accurate
machine learning framework that can predict the likelihood of diabetes in individuals based on clinical and
lifestyle- related features. By leveraging advanced techniques such as feature selection, dimensionality
reduction, and classification algorithms, the model aims to enhance diagnostic accuracy and provide early
warnings for high- risk individuals.
The model incorporates key factors such as glucose levels, insulin, body mass index (BMI), age, blood
pressure, and additional external factors like lifestyle habits and hereditary information. By integrating these
features into the predictive model, the goal is to improve the overall classification performance, thereby
reducing false positives and negatives.
Additionally, the model is designed to be scalable and adaptable to new datasets, allowing healthcare
professionals to employ it across different demographics and regions. The long-term objective is to facilitate
early detection, aid in decision-making for treatment planning, and contribute to reducing the overall burden of
diabetes-related complications.
3.2 Algorithms Used For Proposed Model
Support Vector Machine (SVM) - an abbreviation for It is a kind of learning called supervised learning, and
it divides data into two categories using a hyper plane as the dividing line. The Support Vector Machine
(SVM) accomplishes the same goal as C4.5, with the exception that it does not make any use of Decision
Trees. In order to reduce the likelihood of an incorrect classification being made, the support vector machine
makes an effort to increase the margin, which is the distance across the hyper plane and the 2 data points that
are closest to it from each class. Scikit-learn, MATLAB, and LIBSVM are examples of well-known software
packages that may be used to create support vector machines.
CMR College of Engineering & Technology 11
We use a single algorithm in the proposed system, which lowers the time complexity . SVM (Support Vector
Machine) is a machine learning technique used to predict diabetes . We are able to take into account patient
data regardless of age or gender. The suggested system is an interactive application that asks the user to
enter data in order to generate a prediction. The updated dataset under consideration includes the following
attributes: gender, age, heart disease, hypertension, smoking history, BMI, hemoglobin A1c (HbA1c) level,
glucose level, and outcome .The proposed system takes into account patients who are younger than 21.
Details about the dataset:
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor
variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
CMR College of Engineering & Technology 12
3.3 Designing
3.3.1 UML Diagram
Fig 2.1 WorkFlow
3.3.2 Architecture
Fig 2.2 System Architecture
CMR College of Engineering & Technology 13
3.4 Stepwise Implementation and Code
Step 1: Went on a search for a diabetes dataset with all the labels given.
Step 2: Then made a flow char of how to implement the diabetes predicting system and made
some conclusions like ,What are the ML algorithms or techniques that would perfectly fit the
problem with good accuracy.
Step 3: Next installation of required software's like Anaconda ,Jupyter, Python.
Step 4: After installing the required software’s ,started with Anaconda lunched Jupyter notebook
with python 3.
Step 5: Now importing all the necessary libraries like Pandas,Numpy,Sklearn,Matplotlib and all the
required modules.
Fig 4.3: code of importing libraries
Step 6: Import the diabetes Dataset into Jupyter notebook.
Step 7: Now we need to see the structure of the data like how many columns and rows are present.
CMR College of Engineering & Technology 14
Fig 3.1: code for importing data set
Step 8: Then started looking for missing values and filling those with mean of the particular column
(here we have 2 missing values one in SkinThickness and the other on Insulin)
Fig 3.2: Example code for the pre-work of data set with null values
CMR College of Engineering & Technology 15
Step 9: Once we are good with the data with no null values and missing values and no string values
then we can split the data into 2 parts.
Part 1 : Training data
Part 2: Testing data
Here our Target column is ‘Outcome’ i.e y so separating from the main data now we have y data
frame with ‘Outcome’ and x data frame with all the other lables. We need at least 80% of data to train and
20% of data to test.
Fig 3.3 : code for splitting data frame into two parts based on the column values
Step 10: Now further splitting the data into x_train,x_test y_train,y_test
Fig 3.4: code for splitting data frame into two parts based on the column values
CMR College of Engineering & Technology 16
Step 11: Now we will apply ML algorithm i.e. Support Vector Machine from Sklearn import svm.
Fig 3.5: code to demonstrate fitting the algorithm to the model and resulting array of predictions made Step
12: From the trained model we will calculate the accuracy of our model.
Fig 3.6: code to calculate the accuracy of the model
CMR College of Engineering & Technology 17
Step13: Now will see the accuracy graph.
Fig 3.7: Accuracy graph model
Confusion Matrix
Fig 3.8: code for importing metrics to find the accuracy score
CMR College of Engineering & Technology 18
CHAPTER 4
RESULTS AND DISCUSSION
4.1 Performance Metrics
Classification Accuracy- It is the ratio of number of correct predictions to the total number of input
samples. It is given as-
Confusion Matrix- It gives us gives us a matrix as output and describes the complete performance of the
model.
Where, TP: True Positive
FP: False Positive
FN: False Negative
TN: True Negative
Accuracy for the matrix can be calculated by taking average of the values lying across the main diagonal.
F1 score-It is used to measure a test’s accuracy. F1 Score is the Harmonic Mean between precision and
recall. The range for F1 Score is [0, 1]. It tells you how precise your classifier is as well as how robust it is.
Mathematically, it is given as-
F1 Score tries to find the balance between precision and recall.
Precision: It is the number of correct positive results divided by the number of positive results predicted by
the classifier. It is expressed as-
CMR College of Engineering & Technology 19
Recall: It is the number of correct positive results divided by the number of all relevant samples. In
mathematical form it is given as-
4.2 Result Screenshot
Fig 4 Output
CMR College of Engineering & Technology 20
CHAPTER 5
CONCLUSION
5.1 CONCLUSION
In this project, we developed a diabetes prediction model using machine learning techniques, focusing on
improving the classification accuracy of diabetes detection. By incorporating both traditional factors such as
glucose levels, BMI, age, and insulin along with external factors like lifestyle and hereditary conditions, we
were able to enhance the predictive performance. The application of data pre-processing, clustering, and
machine learning algorithms demonstrated an efficient method to analyze complex healthcare data. This
model can serve as a reliable tool for early diabetes detection, providing insights for healthcare professionals
to take timely preventive measures.
Future Enhancement:
In future work, the model can be enhanced by integrating additional data sources such as real-time sensor
data from wearable devices, which can provide continuous monitoring and prediction of diabetes onset.
More advanced deep learning techniques like neural networks or ensemble learning methods could be
implemented to further boost the accuracy and adaptability of the model. Additionally, expanding the
dataset with more diverse population data and focusing on different types of diabetes (e.g., Type 1, Type 2,
gestational diabetes) will make the model more comprehensive. Incorporating explainability into the model
will also be crucial for gaining trust among healthcare practitioners by providing interpretable insights
alongside predictions.
CMR College of Engineering & Technology 21
REFERENCES
[1] Gauri D. Kalyankar, Shivananda R. Poojara and Nagaraj V. Dharwadkar,” Predictive Analysis of
Diabetic Patient Data Using Machine
Learning and Hadoop”, International Conference On I-SMAC,978-1-5090-3243-3,2023.
[2] B. Nithya and Dr. V. Ilango,” Predictive Analytics in Health Care Using Machine Learning Tools and
Techniques”, International Conference
on Intelligent Computing and Control Systems, 978-1-5386-2745-7,2019.
[3] Dr Saravana kumar N M, Eswari T, Sampath P and Lavanya S,” Predictive Methodology for Diabetic
Data Analysis in Big Data”, 2nd
International Symposium on Big Data and Cloud Computing,2020.
[4] Aiswarya Iyer, S. Jeyalatha and Ronak Sumbaly,” Diagnosis of Diabetes Using Classification Mining
Techniques”, International Journal of
Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2022.
[5] P. Suresh Kumar and S. Pranavi “Performance Analysis of Machine Learning Algorithms on Diabetes
Dataset using Big Data Analytics”, International Conference on Infocom Technologies and Unmanned
Systems, 978-1-5386-0514-1, Dec. 18-20, 2021.
[6] Mani Butwall and Shraddha Kumar,” A Data Mining Approach for the Diagnosis of Diabetes
Mellitus using Random Forest Classifier”, International Journal of Computer Applications, Volume
120 - Number 8,2019.
CMR College of Engineering & Technology 22