0% found this document useful (0 votes)
4 views11 pages

Springer Lecture Notes in Computer Science (1)

The document discusses a GUI-based diabetes prediction system that utilizes a pipeline of machine learning models for early detection of diabetes. It highlights the importance of preprocessing clinical data and evaluates various algorithms, concluding that the Gradient Boosting Classifier achieves the highest accuracy of approximately 94%. The system allows users to input health-related information, which is then processed to predict the likelihood of diabetes, demonstrating the potential of machine learning in healthcare.

Uploaded by

Yog Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Springer Lecture Notes in Computer Science (1)

The document discusses a GUI-based diabetes prediction system that utilizes a pipeline of machine learning models for early detection of diabetes. It highlights the importance of preprocessing clinical data and evaluates various algorithms, concluding that the Gradient Boosting Classifier achieves the highest accuracy of approximately 94%. The system allows users to input health-related information, which is then processed to predict the likelihood of diabetes, demonstrating the potential of machine learning in healthcare.

Uploaded by

Yog Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

GUI Based Diabetes Prediction Using Pipeline

1st Lalit Agarwal1 ([email protected]. in]


2nd Yog Singh²yogsingh.mait@gmail. conm]
3rd Aryan Saxena2 [arysaz1 808 @gmail. com] and
4th BiVa Verma2lriyaa7vermaa @gmail.com]

1
Assistant Professor, Department of Electrical-Engineering, MAIT, India
2 BTech,
Department of Electrical Engineering, MAIT, India

Abstract. Diabetes is a widespread disease that affects individuals glob


ally.It can cause other serious problems like heart and kidney issues that
might last for a long time. Early detection of the disease can lead to
healthier and longer lives. By using supervised machine learning models
trained on suitable datasets, it becomes possible to facilitate the early
detection and diagnosis of diabetes. Our analysis provides a comprehen
sive guide to predicting diabetic patients by identifying the attributes of
the dataset and evaluating the use of a GUI-based diabetes prediction
system using a pipeline.
A GUI-based diabetes prediction system that uses a pipeline would in
volve creating a graphical user interface that allows users to input various
health-related information, such as age, weight, and blood sugar levels.
This information would then be passed through a pipeline of machine
learning models, which would make predictions about the user's likeli
hood of having diabetes. The machine learning algorithms trained with
several datasets in this analysis include Decision Tree (DT), k-Nearest
Neighbor (KNN), Random Forest (RF), Gradient Boosting (GB), Light
GBM classifier and Support Vector Machine (SVM). The pipeline typi
cally consists of several stages, including data p-reprocessing, data visu
alization, and model training and evaluation.
In the model training and evaluation stage, we trained the machine learn
ing models and tested them on the input data. After evaluating their per
formance, we determined that the Gradient Boosting Classifier (GBC)
is the best model for making predictions. Furthermore, applying hyper
parameter tuning on the GBC yielded a training accuracy of about 94%.
Once the best model is selected, it would be used to make predictions
on new input data provided by the user through the GUI. The predicted
diabetes status would then be displayed to the user through the GUI.The
results of this study demonstrate that appropriate preprocessing of clin
ical data and application of ML-based classification can accurately and
effectively predict diabetes.

Keywords: Diabetes Diabetes Prediction GUI (Graphical User In


terface). Pipeline- Machine Learning. Random Forest (RF)- Gradient
boosting (GB).
2 GUI Based Diabetes Prediction Using Pipeline

Introduction
Diabetes and its related complications impose a consideralble healthcare bur
den on a global scale, creating significant challenges for patients, healthcare
systems, and national economies. (Panel 1). As per the World Health Orga
nization (WHO), the global population is projected to grow by 37% between
2000 and 2030. However, the number of individuals affected by diabetes is esti
mated to surge by 114% during the same period.1 Diabetes is spreading quickly
in Asia, making it the main region facing a growing epidemic of this disease.
Based on conservative estimates considering factors like population growth, ag
ing, and urbanization, it is predicted that by 2030, India and China will have
the highest number of individuals affected by diabetes. India is projected to
have approximately 79.4 million people with diabetes, while China is expected
to have around 42.3 million individuals with the disease. Additionally, four more
of the top ten countries with the highest diabetes rates are in Asia, including
Indonesia, Pakistan, Bangladesh, and the Philippines. The actual prevalence of
diabetes may be underestimated, as these estimates do not take into account
other diabetes-related risk factors. By 2025, the global population is expected
to reach 7.9 billion, with nearly 50% of the annual population increase con
tributed by six countries. Among these countries, India, China, and Pakistan
are significant contributors, accounting for 21%, 12%, and 5% of the increase,
respectively. (2] It is important to note that Asian populations are racially diverse
and possess varying demographic, cultural, and socioeconomic attributes, which
can potentially influence the variations in diabetes causes and development. Ma
chine learning involves teaching computers to learn and make decisions on their
own using examples and experiences, rather than programming them to follow
specific instructions. They are considered the workhorse" of the so-called new
era of big data. Machine learning base techniques have been successfully ap
plied in many fields such as pattern recognition, computer vision, spacecraft
engineering, finance, entertainment, computational biology, biomedical applica
tions and medical. More than half of cancer patients receive ionizing radiation
(radiotherapy) as part of their treatment, which is the mainstay of treatment
for advanced localized disease. Radiation therapy involves a complex series of
procedures that begin at the consultation stage and extend beyond treatment
to ensure that patients receive the correct dose of radiation and respond posi
tively to treatment. Radiotherapy can be complicated, involving different steps
where people and machines work together to make decisions. The complexity
inherent in these procedures underscores the suitability of employing machine
learning algorithms to optimize and streamline them. These procedures encom
pass a range of critical aspects, including radiation physics quality assurance,
contouring, treatment planning, image-guided radiotherapy, respiratory motion
managenment, treatment response modeling, and outcomes prediction. Machine
learning algorithms possess the remarkable capability to acquire knowledge from
the prevailing context and extend that understanding to previously unseen tasks.
This enables significant advancements in both the safety and efficacy of radio
therapy practice, ultimately resulting in improved patient outcomnes. (3)
GUIBased Diabetes Prediction Using Pipeline 3

2 ALGO AND RELATED WORK

2.1 Pipeline

In machine learning, a pipeline consists of a series of com-plex data processing


steps and processes a labelled training dataset to produce a machine learning
model. The model then must be deployed on a platform where it can answer
prediction queries in real time. For proper pre-processing of prediction queries,
the pipeline typically needs to be deployed alongside the model. The deployment
platform must be ro-bust, accommodating the ability to handle different machine
learning models and pipelines, and easy to tune. Additionally, the platform must
maintain the quality of the model by further training it when new training data
becomes available. One method to maintain the quality of a deployed nmodel is
through online deployment, in which the platform uses online learning methods
for continuously updating the model based on incoming training data. However,
it is necessary to note that online learning can be sensitive to noise and outliers,
which may increase the prediction error rate. Therefore, to ensure a high level of
quality, the online learning method must be tailored tothe specific use case. [4,5]
Thus, selective online deployment of machine learning models may not always
provide robustness and simplicity.

2.2 Machine Learning Algorithms

- K-Neighbors Classifier The K-nearest neighbors (KNN) algorithm is a


straightforward and comprehensible supervised machine learning technique
applicable to both classification and regression tasks. However, its major
drawback is that it becomes significantly slower as the size of the data in use
increases.
The Support Vector Classifier (SVC) is a supervised machine learning algo
rithm specifically designed for classification tasks. It operates by mapping
data points onto a high-dimensional space and subsequently identifying the
ideal hyperplane that effectively separates the data into two distinct classes.
Decision Tree Classifier - In the Decision Tree classi-fier, the entropy of the
database is calculated first. A higher level of certainty indicates a more favor
able outcome in classification results.The information gain of each feature is
then calculated, which represents how much the uncertainty is reduced after
database partitioning. The information gain is calculated for every feature
and the database is partitioned to the feature with the highest information
gain. This iterative process continues until all nodes have been processed
and cleared.
Random Forest - Random Forest is an ensemble learning algorithm that
is based on decision trees. It involves creating a collection of decision trees,
where each tree is constructed using a randomly chosen subset of the training
set. The class of the test olbject is determined by aggregating the votes from
the different decision trees.
GUI Based Diabetes Prediction Using Pipeline

Gradient Boosting Classifiers Gradient boosting classifiers are a family of


machine learning algorithms that combine multiple weak learning models to
form a strong and robust predictive model.Decision trees are commonly used
when implementing gradient boosting.
LightGBM - LightGBM is a powerful gradient boosting framework that har
nesses decision tree-based learning algorithms, aiming to provide efficient
and effective model training and prediction capabilities. It is optimized for
large datasets and datasets with many features. It stands out by using a
histogram-based method to find the best split points, enabling it toselect the
most crucial features and enhance its performance. Furthermore, it employs
a leaf-wise tree growth algorithm, which leads to more accurate and efficient
splits and less chance of overfitting. Additionally, LightGBM supports par
allel and GPU-based learning, which can further speed up the process. In
summary, LightGBM is a robust tool for creating decision tree nodels and
gradient boosting models, particularly in cases of large datasets.

2.3 Related Work

Diabetes has become a significant threat to people's lives. It is widely preva


lent around the world, including in India. It affects individuals of all ages and
is caused by factors such as lifestyle, genetics, stress, and age. Without proper
attention, diabetes can lead to severe complications. Currently, several methods
are being utilized to predict and prevent diabetes and related illnesses. In this
proposed work, we have used Machine Learning algorithms, specifically Support
Vector Machine (SVM) and Random Forest (RF), to iden-tify potential risks
of developing Diabetes-Related Diseases. Following data pre-processing, we em
ployed step-forward and step-backward feature selection techniques to identify
the significant predictors for our prediction task. Additionally, we applied Prin
ciple Component Analysis (PCA)for dimensionality reduction. Our experiments
showed that applying PCA along with Random Forest (RF) resulted in an im
pressive 83% improvement in the accuracy of diabetes prediction. In comparison,
the accuracy achieved using Support Vector Machine (SVM) was 81.4%, making
RF a more favorable choice for this particular task [1].
Diabetes is a serious and life-threatening disease that is also associated with
several other health complications such as coronary failure, blindness, and kid
ney diseases. In these cases, the patients must visit diagnostic centres to get their
reports after consultation. This often involves investing both time and money.
However, With the advancements in Machine Learning techniques, we now have
the means to address these challenges effectively. Using the data processing we
can predict whether a patient has diabetes or not. This early prediction can
play a crucial role in mitigating the severity of the disease. The primary objec
tive of this research is to develop a highly accurate system for predicting the risk
level of diabetes in patients. The model incorporates classification methods such
as Decision Tree, Artificial Neural Networks (ANN), Naive Bayes, and Support
Vector Machine (SVM) algorithms. The Decision Tree model exhibited a preci
GUI Based Diabetes Prediction Using Pipeline 5

sion of 85%, Naive Bayes achieved 77%, and SVM achieved 77.3%. These results
indicate a substantial level of accuracy in the employed methods [2].
The Learning Community (ML) has focused on the prediction of diabetes
and several studies have been conducted on this topic. Considering the severity
of this disease, this article introduces a model called Diabetes Expert System,
which uses machine learning analysis (DESMLA) to effectively predict diabetes
by analyzing diabetes data. The diabetes dataset is unbalanced, so the DESMLA
model uses five commonly used oversampling techniques: SMOTE, Borderline
SMOTE, ADASYN, KMeans SMOTE, and Gaussian SMOTE to solve the class
imbalance problem. The model also uses decision trees (DT) and random forests
(RF) as classifiers, as well as various data preprocessing steps for diabetes pre
diction. Experimental results show that DESMLA models with KMeans SMOTE
and Gaussian SMOTE perform better. [3].
Diabetes is a disease characterized by elevated blood glu-cose levels, also
known as hyperglycemia. It occurs when the body is unable to produce enough
insulin, a hormone responsible for converting glucose from food into energy. The
increasing prevalence of this disease has motivated researchers to work hard to
develop efficient models for diagnosing it. In the healthcare sector, there is a
vast amount of data readily accessible, making it convenient to extract valuable
information for diagnosis and develop new models that yield better results. The
objective of this research is to propose an efficient machine learning model for
diabetes prediction. Logistic regression, Support Vector Machine (SVM), and
k-nearest neighbors (k-NN) algorithms were used to classify diabetic patients.
After pre-processing and training the data, these algorithms produced good re
sults. In terms of accuracy on test data, logistic regression outperformed other
models, achieving the highest accuracy of 83%. SVM and k-NN also performed
well, with accuracies of 82% and 79%, respectively. These results demonstrate
the effectiveness of the proposed model, showcasing improved performance com
pared to previous studies (4).

3 DIABETES DETECTION STRATEGY


The initial step is to analyze and pre-process the dataset.. Subsequently, the
dataset is divided into two separate sets: the training set and the testing set.. The
training set is utilized to build predictive machine learning models by employing
various algorithms. Afterward, the performance of these algorithms is evaluated.
The best ML model is then imported and integrated with Tkinter to create a
GUIthat shows the results of whether a person is diabetic or not.
The workflow is as follows:

3.1 Dataset
We have gathered the dataset " Pima Indians Diabetes Database" from Kaggle.
The dataset used in this study originates from the National Institute of Diabetes
6 GUI Based Diabetes Prediction Using Pipeline

and Digestive and Kidney Diseases. Its primary objective is to predict whether
a patient has diabetes or not based on diagnostic measurements. The in-stances
included in the dataset were selected based on specific criteria from a larger
database. The dataset comprises several medical predictors, also known as in
dependent variables, and one target or dependent variable named "Outcome."
These independent variables consist of the number of pregnancies a patient has
had, their Body Mass Index (BMI), insulin level, age, and so on.

3.2 Data Analysis and Data Pre-processing

Before feeding the datasets into the machine learning model, pre-processing steps
are applied to enhance their performance. These steps include remnoving outliers
and handling missing values, ensuring that the data is reliable and suitable for
accurate analysis.

Outliers Removal - The dataset may contain attribute values that devi
ate significantly.These values can negatively impact the performance of the
machine-learning algorithm. To remove these outliers, we em-ployed the Z
SCore method.

- Handled the null and missing values.

80
Outcome Outcome
150
60

50 20

5 10 15 50 100 150 200


Pregnancies Glucose

60 Outcome
125
Outcome
50 0
1 100
40 Juno

73

50
20

10 25

20 40 60 80 100 120 20 40 60 80 100


BloodPressure SkinThickness
GUI Based Diabetes Prediction Using Pipeline 7

250 60
Outcome Outcome
200 50
1

150
30
100
20
50
10

200 400 600 800 0 40 60


Insulin BMI

100
Outcome Outcome
0
80 150
1 1
60
100
40

50
20

0
0.0 0.5 1.0 1.5 2.0 2.5 20 30 40 50 60 70 80
Diabetes PedigreeFunction Age

3.3 Model Construction

We imported a pipeline that includes several machine learn-ing algorithms such


as K-Nearest Neighbors (KNN), Support Vector Classification (SVC), Decision
Tree Classifier (DTC), Random Forest Classifier (RFC), Gradient Boosting Clas
sifier(GBC) and LightGBM. To construct the predictive model, 80% of the pre
processed data was allocated for training, while the remaining 20% was reserved
for testing purposes.

3.4 Model training and Evaluation

In the model training and evaluation stage, we trained the machine learning
models and tested them on the input data. Their accuracy is represented as
follows:

Table 1. MODEL ACCURACY

S.No. ALGORITHM ACCURACY(%)


1. KNeighbors Classifier 73.3766233 7662337
2. Support Vector Machine 74.02597402597402
3. Decision Tree Classifier 74.02597402597402
4. Random Forest Classifier 75.123662337662337
5. Gradient Boosting Classifier 76.62337662337663
6. LightGBM Classifier 65.5844155844156
GUI Based Diabetes Prediction Using Pipeline

3.5 Performance Analysis

We have evaluated the proposed model using various performance metrics and
have determined that the Gradient Boosting Classifier (GBC) and Random For
est Classifier (RFC) are the best algorithms for the GUI-based diabetes predic
tion system. The Gradient B0osting Classifier (GBC) model achieved a training
accuracy of 94% and a test accuracy of 77%. On the other hand, the Random
Forest Classifier (RFC) demonstrated a training accuracy of 96% and a test
accuracy of 75%. The GBC algorithm is selected as the best mnodel for mak
ing predictions as it has a higher test accuracy than the RFC. Additionally, by
fine-tuning its hyperparameters, we achieved a test accuracy of about 77%.

Table 2. BEST ALGORITHM

S.No. Gradient Boosting Classifier Random Forest Classifier


Test Accuracy 77% 75%
Train Accuracy 94% 96%

3.6 GUI based Diabetes prediction

Once the most suitable model is identified, it will be utilized to make predictions
on new input data provided by the user via the GUI interface. We utilized
Joblib to import the model and enable lightweight pipelining. To create a GUI
application, we have used Tkinter. It is a widely-used GUI library for Python
that provides developers with a convenient framework for building graphical
User interfaces. By utilizing Tkinter, we can create interactive applications with
a user-friendly iterface. Tkinter offers a robust and object-oriented interface
to the Tk GUI toolkit, enabling you to easily design windows, buttons, input
fields, and other interactive components. The predicted diabetes status will be
displayed to the user through the GUI.
GUIBased Diabetes Prediction Using Pipeline

PIMA
DATASET

VISUALIZATION OF
DATA

DATA PRE-PROCESSING

Training Test
Set Set

ML MODELS

DIABETES
PREDICTION

4 CONCLUSION

Machine learning and data mining techniques play a crucial role in disease di
agnosis, particularly in predicting diabetes at an early stage, which is vital for
effective patient treatment. This paper explores different classification methods
for the mnedical diagnosis of diabetes patients, focusing on their accuracy. We
have identified a classification problem in the expressions of accuracy and ap
plied three machine-learning techniques to the Pima Indians diabetes dataset.
Our model was trained and validated using a test dataset.. From our simulation
and analysis of the data, we discovered that our model successfully and accu
rately identified diabetic and non-diabetic patients, with an accuracy of 94% on
741 data points. Our results showed that the use of Artificial Neural Network
(ANN) outperformed the other models in this study. Additionally, we used asso
ciation rule mining and a strong association was identified between Body Mass
Index (BMI) and glucose levels with diabetes.
10 GUI Based Diabetes Prediction Using Pipeline

5 Simulation Results

Diabetes Prediction Using Machine Learning|


Pregnancies
Glucose 143
Enter Value of BloodPressure 90
Enter Value of SkinThickness 23
Enter Value of Insulin 59
Enter Value of BMI 26
Enter Value of DiabetesPedigreeFunction 0.5
Enter Value of Age 23

Predict
Non-Diabetic

Diabetes Prediction Using Machine Learning


Pregnancies 2

Glucose |189

Enter Value of BloodPressure 90


Enter Value of SkinThickness 23

Enter Value of Insulin 59


Enter Value of BMI 30
Enter Value of DiabetesPedigreeFunction 0.5
Enter Value of Age 55

Predict
N Diabetic tic

6 FUTURE WORK
One limitation of this study is that it solely relied on a structured dataset for
analysis. In future studies, we plan to incorporate unstructured data as well.
Additionally, we aim to apply these methods to other medical fields for predic
tion, such as different types of cancer, psoriasis, and Parkinson's disease. We also
intend to consider factors, such as physical inactivity, family history of diabetes,
and smoking habits, in our analysis for diabetes diagnosis.
GUIBased Diabetes Prediction Using Pipeline 11

References
1. Sarah Wild, Gojka Roglic, Anders Green, Richard Sicree, Hilary King; Global
Prevalence of Diabetes: Estimates for the year 2000 and projec-tions for 2030.
Diabetes Care 1 May 2004.
2. Ramachandran A, Ma RC, SnehalathaC. Diabetes in Asia. Lancet. 2010 Jan 30.
3. El Naqga, I., Murphy, M.J.; "What Is Machine Learning?"; Springer, 2015.
4. Derakhshan, Behrouz & Markl, Volker. (2019). Continuous Deployment of Machine
Learning Pipelines.
5. Annalisa Occhipinti, Louis Rogers, Claudio Angione,;A pipeline and comparative
study of 12 machine learning models for text classification, Expert Systems with
Applications, Volume 201,2022.
6. Kleinbaum, D.G. and Klein, M. (2010) Logistic Regression: A Self Learning Text.
3rd Edition, Springer, New York.
7. A. Giri, M. V. V. Bhagavath, B. Pruthvi and N. Dubey, "A Placement Prediction
System using k-nearest neighbors classifier," 2016 Second International Confer
ence on Cognitive Computing and Information Pro-cessing (CCIP), Mysuru, India,
2016.
8. S. Suthaharan. Machine learning models and algorithms for big data classification:
thinking with examples for effective learning, vol. 36, Springer US, 2016.
9. G. Jagannathan, K. Pillaipakkamnatt and R. N. Wright, "A Practical Diferentially
Private Random Decision Tree Classifier," 2009 IEEE International Conference on
Data Mining Workshops, Miami, FL, USA, 2009
10. Archana Chaudhary, Savita Kolhe, Raj Kamal, An improved random forest clas
sifier for multi-class classification, Information Processing in Agriculture, Volume
3, Issue 4, 2016
11. Natekin Alexey, Knoll Alois, "Gradient boosting machines, a tutorial"; Frontiers
in Neurorobotics; Volume 7; 2013
12. S. Reshmi, S. K. Biswas, A. N. Boruah, D. M. Thounaojam and B. Purkayastha,
"Diabetes Prediction Using Machine Learning Analytics," 2022 International Con
ference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT
CON), 2022
13. H. B. Kibria, A. Matin, N. Jahan and S. Islam, "A Comparative Study with Dif
ferent Machine Learning Algorithms for Diabetes Disease Pre-diction," 2021 18th
International Conference on Electrical Engineering, Computing Science and Auto
matic Control (CCE), 2021

You might also like