Chatbot_for_Disease_Prediction_Using_Classification_Based_Machine_Learning_Algorithms
Chatbot_for_Disease_Prediction_Using_Classification_Based_Machine_Learning_Algorithms
of the International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME 2023)
19-21 July 2023, Tenerife, Canary Islands, Spain
2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME) | 979-8-3503-2297-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICECCME57830.2023.10252636
Abstract— In today’s world, there is a plethora of medical data study proposes the use of a medical chatbot also referred to as a
which if processed into information could alleviate the burden on “virtual doctor” which is employed using a suite of machine
the health-care systems and prevent readmissions to the hospitals. learning algorithms and data mining techniques. Artificial
On another perspective, it could also make the patients aware of Intelligence [3] provides computers with the power to imitate
the criticality of their health conditions to take the required action. human way of thinking and responding. Chatbots (text or voice
That warrants employing machine learning and data mining based devices) are one such technology that have emerged and
techniques to provide resolutions to system’s shortcomings and bloomed in the markets for their ability to replicate a human and
reducing overhead. Precise and prompt disease diagnosis based on provide intelligent answers and guidance as and when needed.
the symptoms of the patients has become imperative. The
The application of chatbots in the health industry are immense
proposed research is to create an automated conversational
chatbot solution which takes in symptoms and suggests probable
and can change the face of the medical systems we know today.
symptoms to the users for a complete and holistic understanding In our work, we have employed a conversational text-based
of their condition using Natural Language Processing. Following chatbot which acts as an interface between the human and the
the aggregation of symptoms, the system predicts the disease based AI (Artificial Intelligence) model trained to perform predictions.
on the pre-labelled dataset using various classification-based For the purpose of making such predictions, we have used NLP
Machine Learning techniques. The aim of the proposed system is (Natural Language Processing) and classification-based
for people to be aware of their health conditions and be able to algorithms.
determine when a doctor’s visit is absolutely necessary.
In paper [1] by R.B. Mathew, the proposed system comprises
Keywords—chatbot, healthcare, Artificial Intelligence, Data chatbot-text processing using NLP. The work proposes an
mining, Machine Learning, Symptoms Analysis, Disease Diagnosis Android application with sign up and login features for the users.
The text processing is done to transform symptoms to natural
I. INTRODUCTION language. For this purpose, they have used Python’s NLTK
(Natural Language Tool Kit). This toolkit provides tokenization
A prosperous society [1] is one where people in it are
capabilities which converts regular text into vectors (bag of
healthy. Health index is most often correlated to the happiness
words). This natural language can then be fed into the model
of an individual. In today’s world, the stress indicators due to
trained using the KNN (k-nearest neighbors) machine learning
fast-paced jobs have risen which has had a direct impact on
algorithm. The aim is for the chatbot to predict the disease and
people’s health and their awareness towards their body and the
also provide a URL for the treatment needed for the disease.
corresponding signs and symptoms. It is only when the
symptoms become severe that people would pay a doctor’s visit In [2], the authors have devised a chat-bot based architecture
and get themselves checked. The excuse of leading a busy life which comprises data collection, building, training, testing the
[2] is often used for neglecting their own health. Nowadays, model and then predicting the disease and treatment to the user.
people would more promptly respond to social media apps like The chatbot acts as an interface between the users and the
Instagram, Facebook, Whatsapp etc. than respond to their own backend ML (Machine Learning) model. The data inputted goes
body and its conditions. In developing countries, appropriate through multiple layers such as - text pre-processing into natural
medical infrastructure is not accessible to all villages and towns. language, symptoms mapping and prediction of disease using
Moreover, the cost of health insurances are sky-rocketing and varying machine learning models. For the purpose of prediction,
getting doctors’ appointments could take weeks to months their work includes the use of 3 classification algorithms - Naïve
because of immense backlogs. Such countries are lacking Bayes Algorithm, Decision Tree Algorithm and Random Forest.
hospital infrastructure but the Internet is available to these From their experiments and analysis, it is observed that Naïve
societies at a cheaper cost. Bayes Algorithm has better accuracy as compared to the other
two. It comes to a whopping 95.1% whereas the other two have
This emerging trend of the society warrants leveraging
almost the same accuracy of 91.5%.
chatbot-based technology in the health sector too. Hence, our
Authorized licensed use limited to: Yenepoya Institute of Technology. Downloaded on February 06,2025 at 05:40:05 UTC from IEEE Xplore. Restrictions apply.
The authors in [4] have utilized a linear design for their
chatbot system which starts off with user login, symptoms intake
and analysis and finally, suggesting if the disease is major or
minor based on the symptoms entered by the user. The paper
proposes about 4 major steps - symptom collection, symptom
mapping, disease diagnosis and decision making if the disease
is major or minor. If the disease is recognized to be major, the
chatbot sends back the doctor’s details that the patient can
leverage. The database stores user details which helps the
system in retrieving old chats or conversations in history which
the users can view for their reference. For the purpose of
symptom analysis, the authors have used the process of string
searching algorithms and suggesting closest symptoms based on
the user input.
II. METHODOLOGY
The “doctor bot” or the chatbot with disease diagnosing
capabilities is employed using the NLP algorithms and data Fig. 1. Architecture of the System
mining techniques. The architecture relatively follows a linear
approach involving - user dialog/conversation, symptoms B. Symptoms Collection and Analysis
collection and processing, training the AI model, predicting the Once the symptoms are inputted by the user as shown in Fig.
disease and mapping the best approach using determining 2, the system provides a functionality to propose similar
factors such as accuracy. symptoms to get a holistic view of the patient’s medical
The following section mainly involves the Architecture of the condition. The word-embedding technique in NLP is utilized for
Proposed System, Symptoms Collection and Suggestion, Data this purpose which maps words to vectors of numbers. For this
Extraction and Preprocessing and Disease Diagnosis using reason, it is able to recommend words based on semantic and
Machine Learning. syntactic similarities to the symptoms inputted by the user. For
carrying out this step in the architecture, we used Word
A. Architecture Of the Proposed System Embedding with the Gensim Word2Vec Model. Gensim is an
The architecture of the proposed system as shown in Fig. 1 Open Source python library and uses a two-layer neural
comprises the chatbot which acts as a conduit between the user network. There are mainly two kinds of training algorithms -
and the backend system (model). The user joins the bot chat and Continuous Bag of Words (CBOW) and Skip-gram. For the
is provided with basic questions to determine their age, sex, and purpose of this experiment, we have utilized Skip-gram as it is
email address (used later to send the generated report for able to provide a target context once the symptom is inputted.
reference). After answering general questions, the chatbot
intakes the symptoms from the user. As an additional feature,
the chatbot is capable of suggesting symptoms that the user
might be undergoing based on their answers for a
comprehensive data collection using word embedding in NLP.
The chatbot then sends the symptoms data to the prediction
system for further processing
The data provided by the chatbot is in the form of text and is
required to be pre-processed in order to make it fit for the
predictive machine learning models. The text data is
preprocessed using natural language processing (NLP)
techniques such as tokenization and vectorization. The natural
language data is passed through the model in order to map the
symptoms to the disease and be able to predict with certain
accuracy. The patient’s details along with the symptoms and the
disease predicted are then translated to a file. Using SMTP
(Simple Mail Transfer Protocol), the file is sent out as an email
to the user for a complete view of their chat. The user could also
provide the bot analysis to the doctor given a visit is required.
Authorized licensed use limited to: Yenepoya Institute of Technology. Downloaded on February 06,2025 at 05:40:05 UTC from IEEE Xplore. Restrictions apply.
In order to train this ML model, the input used is a corpus of
medical text/ dictionary. This algorithm requires the input to be
a list of lists which involve some form of text processing on the • Convert the dataset into a Python dictionary to obtain a
symptom-disease dataset in CSV (comma-separated values) symptom-disease mapping which is then converted into
format. The model is then trained to create a space with two separate lists of symptoms and disease labels.
clustering of words similar to each other in context. The • The list with symptoms are then converted into
Euclidean distance is used to compute distance between the numerical representations using vectorization technique.
vectors. As a result of this step, we were able to find keywords For the purpose of vectorization, the symptoms’
most similar to symptoms entered by the user as shown in Fig. description is transformed to features as shown in Fig. 5
3. The user is then suggested back the top 3 symptoms with the which can be used as an input to the training model.
closest Euclidean distance. For future improvements, cosine CountVectorizer algorithm is used that results in
similarity can be used to determine the closest words for better multiple two-dimensional matrices where each row
measures. Moreover, for this research, we have assumed that the represents the symptoms with the corresponding disease
user has clarity about the symptoms he/she is experiencing. label.
Code for the Gensim Model
model = gensim.models.Word2Vec(
mylist,
window=5,
min_count=2,
workers=4,
sg=1)
model.train(mylist, total_examples=len(mylist), epochs=10)
model.wv.most_similar('fever')
Authorized licensed use limited to: Yenepoya Institute of Technology. Downloaded on February 06,2025 at 05:40:05 UTC from IEEE Xplore. Restrictions apply.
D. Disease Diagnosis using Classification Algorithms III. RESULTS
In this paper, three classification algorithms are used over a Based on our work and findings, the following accuracies
set of 490 symptoms and 1500 diseases in order to compare the were obtained for the three classification algorithms. Over a
accuracy between the different models. Based on our large dataset, Logistic Regression performed the best out of all
experiments and observations, it was found that the Logistic the classification algorithms as shown in Table 1. Previous
Regression performed better than Decision Tree Classifier researchers [7] have used a different combination of
and K-Neighbors Classifier algorithms on a large dataset. About classification algorithms which include Random Forest
4 epochs were run on the dataset to get maximum accuracy in classifier and Naïve Bayes classifier. However, on a very large
training the model. dataset, Naïve Bayes causes Overfitting which means that the
1) Logistic Regression Algorithm is generally used to tree is unable to produce accurate output for the new test data
predict a dependent label based on independent variables (which inputted and tries to memorize the training dataset. In our
are symptoms here). For this purpose, we have used the research, we noticed a trade-off between dataset volume and the
accuracy. We wanted to incorporate as many symptoms and
Multinomial Logistic Regression since we have multiple output
diseases of the dataset such that we could build the accuracy
categories for disease labels. starting there.
2) Decision Tree, as the name suggests [5] is a tree-like
structure which comprises leaf and non-leaf nodes and branches. TABLE I. ACCURACY OF THE THREE DIFFERENT CLASSIFICATION
Each non-leaf node denotes a condition on the input feature, ALGORITHMS USING LARGER DATASET
branch denotes the decision made and the leaf node is supposed S. No Algorithm Accuracy
to classify into the labels (disease labels in this dataset). On
application of the decision tree algorithm with symptoms as the 1 Logistic Regression 89.2%
leaf nodes and the disease as the class label, it was found that the 2 Decision Tree 86.7%
accuracy was lower than the Regression Algorithm for a large
3 KNN 85.6%
dataset. In this study, for the purpose of classification, WEKA
(Waikato Environment for Knowledge Analysis) [5] was used
to visualize the data. WEKA is an Open-Source tool that
contains different methods for pre-processing, classification and IV. CONCLUSION AND FUTURE WORK
clustering techniques and provides capabilities to visualize the This paper aims to propose a disease prediction chatbot
data. system where the user inputs basic information and discusses
3) K-nearest neighbors [6] is an essential classification details of the signs and symptoms observed by them. Once the
algorithm among the supervised learning algorithms. In this symptoms are collated, text data is processed into vectors such
that the data is fit to train the models. The training of the models
technique, the new data (symptoms) are classified by its nearest
is done using 3 classification algorithms – Logistic Regression,
neighbor’s vote. The classification happens based on the Decision Tree and KNN. A huge dataset was utilized with about
distance function which uses Euclidean, Manhattan and 490 symptoms and 1500 diseases. On running the test cases, it
Minkowski distance methods. The k value in the algorithm was observed that Logistic Regression performed the best out of
refers to the number of neighbors. The distribution in the case of all the three algorithms. The aim of our work was to mimic a
this dataset between symptoms and disease was found as real-world scenario where there is an exponential increase in the
described in Fig. 6. medical data generated each year. Therefore, for the purpose of
our research, we utilized a dataset with a large number of
diseases and after conducting rigorous experiments, we were
able to achieve a whooping accuracy of 89.2%. Our proposed
model was able to solve the challenges of the data volumes, and
achieve accuracy irrespective of that.
Chatbots are the future of the healthcare industry. Also, these
chatbots [8] can be integrated with mobile devices for ease of
usage and technical feasibility. If implemented correctly, AI can
change the face of the medical systems and provide medical
services online and accessible to everyone at a cheaper price.
However, the technology is fast evolving so it can be very
uncertain in terms of compliance and regulatory needs
especially when it comes to sensitive user personal information
and health data. For the future work, cybersecurity challenges
need to be considered in depth. We need to ensure that the
chatbot collects data over encrypted and safe channels. For the
purpose of our research, Logistic Regression worked best in
Fig. 6. Visual Graph for K-nearest neighbors terms of accuracy and predicting diseases on the test data. In our
Authorized licensed use limited to: Yenepoya Institute of Technology. Downloaded on February 06,2025 at 05:40:05 UTC from IEEE Xplore. Restrictions apply.
future work, we aim to broaden the scope of the classification [5] R.Chadha, S. Mayank, Prediction of heart disease using data mining
algorithms used. [9] CNNs (Convolutional Neural Networks), techniques. CSIT 4, 193–198 (2016). https://doi.org/10.1007/s40012-
016-0121-0
weighted KNN and SVM (Support Vector Machine) can also be
[6] S. Chowdhury and M. P. Schoen, "Research Paper Classification using
brought into scope to hash out an improvement in accuracy. Supervised Machine Learning Techniques," 2020 Intermountain
Accuracy of disease prediction [10] using AI is also a cause of Engineering, Technology and Computing (IETC), Orem, UT, USA, 2020,
major concern and hence, datasets with more precise and pp. 1-6, doi: 10.1109/IETC47856.2020.9249211.
accurate data with well-trained ML models is a must. [7] S. Grampurohit and C. Sagarnal, "Disease Prediction using Machine
Learning Algorithms," 2020 International Conference for Emerging
REFERENCES Technology (INCET), Belgaum, India, 2020, pp. 1-7, doi:
10.1109/INCET49848.2020.9154130.
[1] R. B. Mathew, S. Varghese, S. E. Joy and S. S. Alex, "Chatbot for
Disease Prediction and Treatment Recommendation using Machine [8] A. N. V. K. Swarupa, V. H. Sree, S. Nookambika, Y. K. S. Kishore and
Learning," 2019 3rd International Conference on Trends in Electronics U. R. Teja, "Disease Prediction: Smart Disease Prediction System using
and Informatics (ICOEI), Tirunelveli, India, 2019, pp. 851-856, doi: Random Forest Algorithm," 2021 IEEE International Conference on
10.1109/ICOEI.2019.8862707. Intelligent Systems, Smart and Green Technologies (ICISSGT),
Visakhapatnam, India, 2021, pp. 48-51, doi:
[2] J. Agarwal, M. Kumar and A. K. Srivastava, "Symptoms Based Disease 10.1109/ICISSGT52025.2021.00021.
Diagnosis and Treatment Recommendation," 2021 International
Conference on Computational Performance Evaluation (ComPE), [9] M. Gulhane and T. Sajana, "A Machine Learning based Model for
Shillong, India, 2021, pp. 162-167, doi: Disease Prediction," 2021 International Conference on Computing,
10.1109/ComPE53109.2021.9751805. Communication and Green Engineering (CCGE), Pune, India, 2021, pp.
1-5, doi: 10.1109/CCGE50943.2021.9776374.
[3] D. Madhu, C. J. N. Jain, E. Sebastain, S. Shaji and A. Ajayakumar, "A
novel approach for medical assistance using trained chatbot," 2017 [10] R.Chadha, S. Mayank, A. Vardhan, T. Pradhan, (2016). Application of
International Conference on Inventive Communication and Data Mining Techniques on Heart Disease Prediction: A Survey. In:
Computational Technologies (ICICCT), Coimbatore, India, 2017, pp. Shetty, N., Prasad, N., Nalini, N. (eds) Emerging Research in Computing,
243-246, doi: 10.1109/ICICCT.2017.7975195. Information, Communication and Applications. Springer, New Delhi.
https://doi.org/10.1007/978-81-322-2553-9_38
[4] S, Divya, V, Indumathi, S, Ishwarya, M, Priyasankari, S, Kalpana Devi,
“A Self-Diagnosis Medical Chatbot Using Artificial Intelligence.”
Journal of Web Development and Web Designing, vol. 3, no. 1, 7 Apr.
2018. Accessed 4 Mar. 2023.
Authorized licensed use limited to: Yenepoya Institute of Technology. Downloaded on February 06,2025 at 05:40:05 UTC from IEEE Xplore. Restrictions apply.