Springer Lecture Notes in Computer Science (1)
Springer Lecture Notes in Computer Science (1)
1
Assistant Professor, Department of Electrical-Engineering, MAIT, India
2 BTech,
Department of Electrical Engineering, MAIT, India
Introduction
Diabetes and its related complications impose a consideralble healthcare bur
den on a global scale, creating significant challenges for patients, healthcare
systems, and national economies. (Panel 1). As per the World Health Orga
nization (WHO), the global population is projected to grow by 37% between
2000 and 2030. However, the number of individuals affected by diabetes is esti
mated to surge by 114% during the same period.1 Diabetes is spreading quickly
in Asia, making it the main region facing a growing epidemic of this disease.
Based on conservative estimates considering factors like population growth, ag
ing, and urbanization, it is predicted that by 2030, India and China will have
the highest number of individuals affected by diabetes. India is projected to
have approximately 79.4 million people with diabetes, while China is expected
to have around 42.3 million individuals with the disease. Additionally, four more
of the top ten countries with the highest diabetes rates are in Asia, including
Indonesia, Pakistan, Bangladesh, and the Philippines. The actual prevalence of
diabetes may be underestimated, as these estimates do not take into account
other diabetes-related risk factors. By 2025, the global population is expected
to reach 7.9 billion, with nearly 50% of the annual population increase con
tributed by six countries. Among these countries, India, China, and Pakistan
are significant contributors, accounting for 21%, 12%, and 5% of the increase,
respectively. (2] It is important to note that Asian populations are racially diverse
and possess varying demographic, cultural, and socioeconomic attributes, which
can potentially influence the variations in diabetes causes and development. Ma
chine learning involves teaching computers to learn and make decisions on their
own using examples and experiences, rather than programming them to follow
specific instructions. They are considered the workhorse" of the so-called new
era of big data. Machine learning base techniques have been successfully ap
plied in many fields such as pattern recognition, computer vision, spacecraft
engineering, finance, entertainment, computational biology, biomedical applica
tions and medical. More than half of cancer patients receive ionizing radiation
(radiotherapy) as part of their treatment, which is the mainstay of treatment
for advanced localized disease. Radiation therapy involves a complex series of
procedures that begin at the consultation stage and extend beyond treatment
to ensure that patients receive the correct dose of radiation and respond posi
tively to treatment. Radiotherapy can be complicated, involving different steps
where people and machines work together to make decisions. The complexity
inherent in these procedures underscores the suitability of employing machine
learning algorithms to optimize and streamline them. These procedures encom
pass a range of critical aspects, including radiation physics quality assurance,
contouring, treatment planning, image-guided radiotherapy, respiratory motion
managenment, treatment response modeling, and outcomes prediction. Machine
learning algorithms possess the remarkable capability to acquire knowledge from
the prevailing context and extend that understanding to previously unseen tasks.
This enables significant advancements in both the safety and efficacy of radio
therapy practice, ultimately resulting in improved patient outcomnes. (3)
GUIBased Diabetes Prediction Using Pipeline 3
2.1 Pipeline
sion of 85%, Naive Bayes achieved 77%, and SVM achieved 77.3%. These results
indicate a substantial level of accuracy in the employed methods [2].
The Learning Community (ML) has focused on the prediction of diabetes
and several studies have been conducted on this topic. Considering the severity
of this disease, this article introduces a model called Diabetes Expert System,
which uses machine learning analysis (DESMLA) to effectively predict diabetes
by analyzing diabetes data. The diabetes dataset is unbalanced, so the DESMLA
model uses five commonly used oversampling techniques: SMOTE, Borderline
SMOTE, ADASYN, KMeans SMOTE, and Gaussian SMOTE to solve the class
imbalance problem. The model also uses decision trees (DT) and random forests
(RF) as classifiers, as well as various data preprocessing steps for diabetes pre
diction. Experimental results show that DESMLA models with KMeans SMOTE
and Gaussian SMOTE perform better. [3].
Diabetes is a disease characterized by elevated blood glu-cose levels, also
known as hyperglycemia. It occurs when the body is unable to produce enough
insulin, a hormone responsible for converting glucose from food into energy. The
increasing prevalence of this disease has motivated researchers to work hard to
develop efficient models for diagnosing it. In the healthcare sector, there is a
vast amount of data readily accessible, making it convenient to extract valuable
information for diagnosis and develop new models that yield better results. The
objective of this research is to propose an efficient machine learning model for
diabetes prediction. Logistic regression, Support Vector Machine (SVM), and
k-nearest neighbors (k-NN) algorithms were used to classify diabetic patients.
After pre-processing and training the data, these algorithms produced good re
sults. In terms of accuracy on test data, logistic regression outperformed other
models, achieving the highest accuracy of 83%. SVM and k-NN also performed
well, with accuracies of 82% and 79%, respectively. These results demonstrate
the effectiveness of the proposed model, showcasing improved performance com
pared to previous studies (4).
3.1 Dataset
We have gathered the dataset " Pima Indians Diabetes Database" from Kaggle.
The dataset used in this study originates from the National Institute of Diabetes
6 GUI Based Diabetes Prediction Using Pipeline
and Digestive and Kidney Diseases. Its primary objective is to predict whether
a patient has diabetes or not based on diagnostic measurements. The in-stances
included in the dataset were selected based on specific criteria from a larger
database. The dataset comprises several medical predictors, also known as in
dependent variables, and one target or dependent variable named "Outcome."
These independent variables consist of the number of pregnancies a patient has
had, their Body Mass Index (BMI), insulin level, age, and so on.
Before feeding the datasets into the machine learning model, pre-processing steps
are applied to enhance their performance. These steps include remnoving outliers
and handling missing values, ensuring that the data is reliable and suitable for
accurate analysis.
Outliers Removal - The dataset may contain attribute values that devi
ate significantly.These values can negatively impact the performance of the
machine-learning algorithm. To remove these outliers, we em-ployed the Z
SCore method.
80
Outcome Outcome
150
60
50 20
60 Outcome
125
Outcome
50 0
1 100
40 Juno
73
50
20
10 25
250 60
Outcome Outcome
200 50
1
150
30
100
20
50
10
100
Outcome Outcome
0
80 150
1 1
60
100
40
50
20
0
0.0 0.5 1.0 1.5 2.0 2.5 20 30 40 50 60 70 80
Diabetes PedigreeFunction Age
In the model training and evaluation stage, we trained the machine learning
models and tested them on the input data. Their accuracy is represented as
follows:
We have evaluated the proposed model using various performance metrics and
have determined that the Gradient Boosting Classifier (GBC) and Random For
est Classifier (RFC) are the best algorithms for the GUI-based diabetes predic
tion system. The Gradient B0osting Classifier (GBC) model achieved a training
accuracy of 94% and a test accuracy of 77%. On the other hand, the Random
Forest Classifier (RFC) demonstrated a training accuracy of 96% and a test
accuracy of 75%. The GBC algorithm is selected as the best mnodel for mak
ing predictions as it has a higher test accuracy than the RFC. Additionally, by
fine-tuning its hyperparameters, we achieved a test accuracy of about 77%.
Once the most suitable model is identified, it will be utilized to make predictions
on new input data provided by the user via the GUI interface. We utilized
Joblib to import the model and enable lightweight pipelining. To create a GUI
application, we have used Tkinter. It is a widely-used GUI library for Python
that provides developers with a convenient framework for building graphical
User interfaces. By utilizing Tkinter, we can create interactive applications with
a user-friendly iterface. Tkinter offers a robust and object-oriented interface
to the Tk GUI toolkit, enabling you to easily design windows, buttons, input
fields, and other interactive components. The predicted diabetes status will be
displayed to the user through the GUI.
GUIBased Diabetes Prediction Using Pipeline
PIMA
DATASET
VISUALIZATION OF
DATA
DATA PRE-PROCESSING
Training Test
Set Set
ML MODELS
DIABETES
PREDICTION
4 CONCLUSION
Machine learning and data mining techniques play a crucial role in disease di
agnosis, particularly in predicting diabetes at an early stage, which is vital for
effective patient treatment. This paper explores different classification methods
for the mnedical diagnosis of diabetes patients, focusing on their accuracy. We
have identified a classification problem in the expressions of accuracy and ap
plied three machine-learning techniques to the Pima Indians diabetes dataset.
Our model was trained and validated using a test dataset.. From our simulation
and analysis of the data, we discovered that our model successfully and accu
rately identified diabetic and non-diabetic patients, with an accuracy of 94% on
741 data points. Our results showed that the use of Artificial Neural Network
(ANN) outperformed the other models in this study. Additionally, we used asso
ciation rule mining and a strong association was identified between Body Mass
Index (BMI) and glucose levels with diabetes.
10 GUI Based Diabetes Prediction Using Pipeline
5 Simulation Results
Predict
Non-Diabetic
Glucose |189
Predict
N Diabetic tic
6 FUTURE WORK
One limitation of this study is that it solely relied on a structured dataset for
analysis. In future studies, we plan to incorporate unstructured data as well.
Additionally, we aim to apply these methods to other medical fields for predic
tion, such as different types of cancer, psoriasis, and Parkinson's disease. We also
intend to consider factors, such as physical inactivity, family history of diabetes,
and smoking habits, in our analysis for diabetes diagnosis.
GUIBased Diabetes Prediction Using Pipeline 11
References
1. Sarah Wild, Gojka Roglic, Anders Green, Richard Sicree, Hilary King; Global
Prevalence of Diabetes: Estimates for the year 2000 and projec-tions for 2030.
Diabetes Care 1 May 2004.
2. Ramachandran A, Ma RC, SnehalathaC. Diabetes in Asia. Lancet. 2010 Jan 30.
3. El Naqga, I., Murphy, M.J.; "What Is Machine Learning?"; Springer, 2015.
4. Derakhshan, Behrouz & Markl, Volker. (2019). Continuous Deployment of Machine
Learning Pipelines.
5. Annalisa Occhipinti, Louis Rogers, Claudio Angione,;A pipeline and comparative
study of 12 machine learning models for text classification, Expert Systems with
Applications, Volume 201,2022.
6. Kleinbaum, D.G. and Klein, M. (2010) Logistic Regression: A Self Learning Text.
3rd Edition, Springer, New York.
7. A. Giri, M. V. V. Bhagavath, B. Pruthvi and N. Dubey, "A Placement Prediction
System using k-nearest neighbors classifier," 2016 Second International Confer
ence on Cognitive Computing and Information Pro-cessing (CCIP), Mysuru, India,
2016.
8. S. Suthaharan. Machine learning models and algorithms for big data classification:
thinking with examples for effective learning, vol. 36, Springer US, 2016.
9. G. Jagannathan, K. Pillaipakkamnatt and R. N. Wright, "A Practical Diferentially
Private Random Decision Tree Classifier," 2009 IEEE International Conference on
Data Mining Workshops, Miami, FL, USA, 2009
10. Archana Chaudhary, Savita Kolhe, Raj Kamal, An improved random forest clas
sifier for multi-class classification, Information Processing in Agriculture, Volume
3, Issue 4, 2016
11. Natekin Alexey, Knoll Alois, "Gradient boosting machines, a tutorial"; Frontiers
in Neurorobotics; Volume 7; 2013
12. S. Reshmi, S. K. Biswas, A. N. Boruah, D. M. Thounaojam and B. Purkayastha,
"Diabetes Prediction Using Machine Learning Analytics," 2022 International Con
ference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT
CON), 2022
13. H. B. Kibria, A. Matin, N. Jahan and S. Islam, "A Comparative Study with Dif
ferent Machine Learning Algorithms for Diabetes Disease Pre-diction," 2021 18th
International Conference on Electrical Engineering, Computing Science and Auto
matic Control (CCE), 2021