Machine Learning for Disease Prediction
Machine Learning for Disease Prediction
A
Synopsis Report
Submitted in Partial Fulfillment of the Requirements
for the Degree
of
Bachelor Of Technology
In
Computer Science and Engineering
by
S UNDARAM S HUKLA
A LI A KBAR A NSARI
U MESH K UMAR
S HASHI P RAKASH
{sundaram2020cse, ali2020cse, umesh2021lcse, shashi2020cse}@iert.ac.in
i
Acknowledgement
We would like to extend our heartfelt gratitude to everyone who made the completion of this report
possible. We are especially thankful to our supervisors, Dr. Vimal Mishra and Dr. Rohit, for their
invaluable help, insightful suggestions, and unwavering encouragement throughout the fabrication
process and the writing of this report. Their assistance was pivotal in every step of our journey. We
are also deeply appreciative of the time they dedicated to proofreading and correcting our numerous
errors.
Our sincere thanks also go to our batch mates and the staff of the Computer Science and Engi-
neering department for their crucial support. Their permission to use the lab equipment and all the
tools in the laboratory was instrumental to our work.
ii
Declaration
We, the undersigned, hereby declare that the project titled ”Multiple Disease Prediction System
Using Machine Learning” reflects our original research efforts conducted under the guidance of
Dr. Rohit Sir and Dr. Vimal Mishra. This project embodies the application of machine learn-
ing methodologies to tackle the complexities of disease prediction. The content presented in this
project is unique and has not been previously submitted for any academic credentials. We have
adhered to the ethical standards outlined by our institution throughout the project. The synopsis
and conclusions are based on our research outcomes, ensuring originality and compliance with uni-
versity guidelines. Proper attribution has been given to external sources utilized, both in the text
and references, whenever applicable.
Sundaram Shukla
(2001100100061)
Umesh Kumar
(2101100109006)
Shashi Prakash
(2001100100055)
iii
Abstract
A disease can be defined as any disturbance in the structure or function of any organ of the body.
Disease plays a crucial role in affecting human health. The diseases become serious if not detected
early. Predicting the diseases was difficult for the doctors. So, they started implementing machine
learning algorithms for detecting the possibility of any such disease. Enormous work has been
done to predict diseases using the machine-learning algorithms. However, there is an issue with the
accuracy and speed of the existing model.
The proposed system employs machine learning and deep learning algorithms to predict dis-
eases. The proposed system predicts multiple diseases, including heart disease, diabetes, malaria,
and lung cancer. The proposed system comprises random forest and convolutional neural network
algorithms. The Convolutional Neural Network of the proposed model is trained on the CT Scan of
the patient. The features from images are extracted using the CNN algorithm. The random forest
algorithm of the proposed model is trained on tabular datasets. The proposed model achieves an
accuracy of 88% and 78% for heart disease and diabetes respectively. It achieves an accuracy of
95% for both malaria and lung cancer.
iv
Contents
1 Introduction 1
1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Implementation 9
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Convolutional Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Observations 12
4.1 Confusion Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Training and validation loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Training and validation accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Conclusion 16
6 Future Work 17
1 Introduction
Diseases are a common occurrence in human beings, and nearly everyone experiences some form
of illness at some point. Some of the most prevalent diseases include heart disease, diabetes,
lung cancer, and malaria, which collectively affect a significant portion of the global population.
According to the World Health Organization [1], heart disease claims 17.9 million lives annually,
while lung cancer causes approximately 1.8 million deaths. Malaria results in around 0.6 million
deaths, and diabetes accounts for 1.5 million deaths each year. These staggering statistics highlight
the global impact of these diseases, often exacerbated by the lack of accurate early detection.
The early and precise detection of diseases can significantly reduce their adverse effects, yet
sometimes, even skilled doctors and healthcare professionals fail to diagnose them accurately. Over
the years, numerous efforts have been made to tackle this issue, and one promising solution is the
development of a disease prediction system that leverages machine learning and deep learning algo-
rithms. These advanced technologies can analyze unique features from patients’ records, including
lab test data and medical images like CT scans, to predict the presence of diseases. By accurately
identifying whether a patient is infected or uninfected, such systems can assist healthcare providers
in making timely and informed decisions, thereby improving patient outcomes. This project fo-
cuses on creating a comprehensive multiple-disease prediction system, utilizing tabular and image
data, to enhance diagnostic accuracy for diseases like heart disease, diabetes, malaria, and lung
cancer.
This initiative not only represents a significant technological advancement but also aligns with
the broader goal of integrating machine learning into healthcare to improve diagnostics and treat-
ment strategies. The experience gained from working on this project, including handling large
datasets and applying various machine learning algorithms, will be invaluable. The knowledge and
skills developed will be instrumental in my future endeavors as a Python developer, contributing to
the field of healthcare technology and beyond.
1
1.1 Literature Review
Apurv et al. [2] states that machine learning has been used to determine if the individual is affected
by heart disease or not. Machine learning can be applied to predict cardiovascular disease by eval-
uating features such as the occurrence of chest pain, the age of the person, and the cholesterol level
of the person amongst others. KNN emerges as 86.88% prediction accuracy. The prediction accu-
racy by Random forest is 81.96%.
Isfafuzzaman et al. [3] collected a dataset of 203 individuals. They used the XGBoost classifier
to classify whether the patient suffered from diabetes disease or not. The model has achieved an
accuracy of 81%.
In 2015 a study was carried out by Kanika et al. [4], they worked in the domain of feature ex-
traction and classification of malaria to detect whether a patient is affected by malaria or not. The
model achieved an accuracy of 87.8%.
Muntasir et al. [5] used the CNN model for the detection of lung cancer whether a patient is
affected or not. They collected ct scan images of the patient from the Kaggle online platform.
They implemented some other model such as V3, Xception, and ResNet-50 models to compare the
performance. The model achieved an accuracy of 92%.
2
2 Proposed System Architecture
The project proposes a model that employs two algorithms Random Forest and Convolutional Neu-
ral Network. Random forest algorithm is trained on tabular data to predict heart and diabetes. CNN
is trained on the cell image dataset to predict malaria and the chest CT Scan image dataset to predict
lung cancer. Figure 3 shows the architecture diagram of the proposed model.
A Level 0 Use Case Diagram for the Multiple Disease Prediction System illustrates a basic
interaction: a user inputs a patient’s health records into the system, which then processes this data
using machine learning algorithms. The system outputs a classification, indicating whether the
patient is infected or uninfected, providing a quick and accurate health assessment.
3
Figure 2: Level 3 Use case diagram
The use case diagram illustrates the interaction between a user and a multiple disease prediction
system built with Streamlit. The user begins by running the Streamlit app and configuring the
system parameters. They then navigate to the disease dashboard, select the disease to be predicted,
and enter the required input data, which may include numerical values and image files. The system
incorporates robust error handling mechanisms: it displays errors for invalid string inputs, out-of-
range values, missing inputs, and invalid image files. Users must correct these errors by re-entering
the correct data or uploading valid image files. Once valid inputs are provided, the system predicts
the disease using the appropriate machine learning model, logs the results for future reference, and
displays the prediction results to the user, completing the interaction cycle. This comprehensive
flow ensures a reliable and accurate disease prediction process.
4
2.2 Architecture Diagram
The architecture diagram outlines building and training two machine learning models, a random
forest and a CNN. You start by choosing a medical dataset (heart disease, diabetes, etc.), preprocess
it, then split it for training and testing. The training data is used to build two models: a random
forest and a CNN. Finally, both models are evaluated and can be used to classify new data points,
like diagnosing patients with a chosen disease.
Data collection [6] is the process of collecting the relevant data from various sources. The tabular
datasets of diabetes and heart disease are taken from the Siddharthan disease dataset on GitHub.
The malarial cell image dataset and chest CT Scan datasets are taken from Kaggle.
5
The heart disease dataset has 304 samples, and 13 features are age, sex, blood pressure, choles-
terol, resting, thalach, exang, old speak, slope, ca, Thal. Within the 769 samples in the diabetes
dataset, eight characteristics are included: age, body mass index, blood pressure, skin thickness,
insulin level, blood glucose level, and number of pregnancies. The malaria cell image dataset has
27560 images, labeled equally in two classes. The chest CT Scan dataset has 315 images labeled in
four classes namely normal, adenocarcinoma, squamous. cells.carcinoma and large cell carcinoma.
Data Preprocessing is a process of enhancing the quality of data. Whenever data is collected from
different platforms, the data is in raw form which is not good for production. Cleaning and pre-
processing [7] the data to enhance the quality, helps in improving model performance. For tabular
datasets, the methods used are to encode categorical variables and normalize numerical variables.
For the image dataset resize the image to 100x100 pixels. The datasets are split into 80% for
training and 20% for testing.
The present section defines the algorithms used in the proposed system.
Random Forest [8] is an ensemble learning method that combines multiple decision trees to make
predictions. In the random forest model, the number of decision trees is employed to predict the
disease of input.
• Bootstrap sampling with replacement, it involves randomly selecting samples from the origi-
nal datasets with replacement. The model used 500 random samples.
• Model used in Random Forest [9] is decision tree. Each decision tree is trained on the tabular
dataset and makes predictions on input data points.
• In Bag: In Bag refer to the training dataset. 80% of the dataset is for training the model.
• Out of bag-OOB refers to the part of the dataset that is not present in any sample dataset. The
data samples are used as the validation sets. There are 39% samples out of the bag.
• n-samples: The number of decision trees implemented. 500 decision trees are implemented.
• max-depth: Signifies the maximum number of depths of the tree. In the model max depth is
3.
6
2.5.2 Convolutional Neural Network (CNN)
Convolutional Neural Network [10] is implemented for the classification of images in two classes.
The layers of the model are put in sequence. The proposed model consists of four convolutional
layers. Each layer is placed after the output, flatten, and max-pooling [11] layers. The loss function
in the model is used to evaluate how well the algorithm is modeling the dataset. The loss function
used in the proposed model is categorical loss entropy. Loss function is used for multiclass classi-
fication. The optimizer is used to minimize the loss and optimize the efficiency of model. In the
proposed model, Adam is used to optimize the model. The activation function used in the proposed
model is ReLU (Rectified Linear Unit) [12] . ReLU is a type of activation function used in neural
networks and deep learning models. The ReLU activation allows the neural network models to
converge faster and it also adds the content shown in table 1.
• Shapes (100, 100, 3) are present in the input layer, it is made to take images as input of size
(100 x 100) pixel and 3 color channels (RGB).
• Convolutional layer: The layers apply convolution operations to the input data to extract fea-
tures and spatial hierarchies. The initial layer consists of 32 filters, the second layer consists
of 64 layers and third layer consists of 128 layers and the fourth layer consists of 256 layers.
• The Flatten layer flattens the convolutional layer’s 2D vector output into a 1D vector.
• Fully Connected Layer: Two completely linked layers with 256 and 128 units each receive the
flattened layer. It helps to learn the most important features from the extracted spatial features.
• Output The layer consists of a sigmoid activation function. The activation function classified
output into two classes 0 (uninfected), and 1 (infected) based on the calculated probability
value of the two classes.
7
2.6 Model Training and Evaluation
The dataset of each disease is split into testing and training sets to train random forest and convo-
lutional neural network models for respective diseases and evaluate their performance [13]. Utilize
significant performance metrics including F1 score, precision, recall, and accuracy to determine the
capacity of model for prediction and use cross-validation to assess the model’s performance.
8
3 Implementation
The present section has explored the implementation of the proposed multiple-disease prediction
system. The section has developed random forest and convolutional neural networks to establish
the multiple disease prediction system. The paper used tabular datasets to train the random forest
classifier model and image datasets to train convolutional neural networks.
The dataset is collected from Kaggle and GitHub.The image dataset consists of 27,875 images and
the tabular dataset consists of 1073 samples.
Data preprocessing is performed on collected datasets to enhance their quality. For tabular datasets,
the methods used are to encode categorical variables and normalize numerical variables in a fixed
range between 0 and 1. For the image dataset resize the image to 100x100 pixels.
While implementing the model the first step of the research is to load the datasets and perform data
analysis to understand the structure of the data. Then, data processing and analysis [14] are done.
Seaborn library [15] and, matplotlib [16] are used for visualisation and analysis of data.
After exploring the dataset the next step is to perform the task of feature engineering. Feature
engineering [17] is a process of developing new features and their connections from the available
features. After that, the dataset is distributed into train and test data. 80% of the dataset is utilized
for training, while 20% is used for testing. The algorithm of the random forest model is described
below.
9
Algorithm 1 Random Forest Classifier Model
Input: Training dataset, Testing dataset, True Output value
Output: Predicted value
1: function R ANDOM F OREST CLASSIFIER (Xtest )
2: Divide the dataset: 20% for testing and 80% for training.
3: Set Parameters:
4: n estimators ← 500
5: max f eatures ← N
6: max depth ← 3
Initialize an empty list T to store decision trees.
7: for i=1 to n estimators do
8: Randomly select max f eatures from the entire
9: features in the dataset.
10: Train a decision tree with max depth = 3 using
11: selected features
12: Add the trained decision tree to T
13: end for
14: Initialize an empty list P to store the prediction values
15: for each decision tree t in T do
16: Make predictions on Xtest using t
17: Add the predictions to P
18: end for
19: If the majority vote of predictions is successful then
20: return The accuracy is calculated by dividing
21: the total number of forecasts by the
22: the number of right guesses else
23: return Inaccuracy as the majority vote failed
24: End if
25: end function
Explanation: In the random forest model the input training datasets are randomly distributed
and are passed to each decision tree. Each decision tree is trained on the training dataset. After
training the decision tree, a new input value is given to the model. They classify the input point
into a category such as infected or uninfected. The majority of all the decision trees for any given
point are considered as the predicted value of the model.
The model uses CONV2D [18] to classify images accurately. CNN uses a series of convolutional
layers to capture the spatial features in the image to classify them . The deeper layer of the neural
network identifies more abstract features while the lower layer identifies simple structures like
edges and curves. The maxpooling layer helps to downsample the spatial feature and reduce the
image dimension. The features maps are converted into a completely vectorized 1D format and an
10
additional connected layer is introduced for classification [19]. The algorithm of the convolutional
neural network is described below.
Explanation: The convolutional neural network model relied on images labeled datasets of the
class for training. Image preprocessing does the job of labeling and photo resizing. The model is
trained with specified parameters such as Adam optimizer and the loss function used is categorical
loss entropy. The model is trained on a training set of datasets and tested on the test set.
11
4 Observations
The proposed model employs two algorithm random forest and convolutional neural network. It
is trained on VS Code using an AMD RYZEN 3 processor. It takes around 95 minutes to train all
models. After training the accuracy of the convolutional neural network model reaches 95% and
the random forest model reaches 88%. The random forest model takes less than a minute to train
on the heart disease dataset and diabetes datasets. The convolutional neural network model takes
70 minutes to train on an image dataset of malaria and takes approximately half an hour to train on
an image dataset of lung cancer.
The performance of each model is evaluated using the concept of a confusion matrix [20]. The
confusion matrix is a helper matrix that gives information on the number of correctly classified
instances and incorrectly classified instances.
Performance Metrics: The performance matrix of the developed model can be obtained by
applying a confusion matrix to estimate the accuracy, precision, recall, and F1 score of each
class,3. For heart disease detection, the values of accuracy, precision, recall, and F1 score are
88%,83%,85%, and 87% respectively. For diabetes detection, the values of accuracy, precision,
recall, and F1 score are 78%,79%,70%, and 77% respectively. For lung cancer detection the value
of accuracy, precision, recall, and F1 score are 97%,91%,97%, and 94, and for malaria disease the
the value of accuracy, precision, recall, F1 score is 95%,95%,97%, and 92% respectively.
Figure 4 shows the confusion matrix. The result of the Confusion Matrix is mentioned in table
2
True Negative (TN): 24 instances are Correctly classified as true negative (labeled 0).
True Positive (TP): 29 instances are Correctly classified as true positive (labeled 1).
12
Metric Value
True Positive (TP) 29
False Positive (FP) 5
True Negative (TN) 24
False Negative (FN) 3
False Positive (FP): 5 instances are negative samples are incorrect as positive .
False Negative (FN): 3 instances samples are incorrect as negative.
The figure 5 shows the confusion matrix of the convolutional neural network model on image
datasets. There are four classes labeled for the image that is normal, carcinoma, large cell carci-
noma, and squamous cell carcinoma. All the three classes except normal are considered as cancer-
ous. The true positive(TP) value for the normal class is 26, and for the cancerous class is 47. Based
on the confusion matrix the accuracy of the model is 95%, Fi score is 92%, the recall value is 97%
and the precision value is 95%.4
(T P + T N )
Accuracy =
(T N + F N + F P + T P )
TP
Precision =
(F P + T P )
13
TP
Recall =
(T P + F N )
(2 ∗ Recall ∗ P recision)
F-1 score =
(Recall + P recision)
The training loss is a highly important metric in determining the agreement between a deep learning
model and the training data. The loss plot in Figure 6 shows that the model is converging to a good
solution. There are 100 epochs in total. The training loss and validation loss trade in each epoch
cycle are displayed. Each epoch leads to reduce the loss value of the neural network model.
Training and validation accuracy define the performance of the model. The increase in the value
of the accuracy and validation accuracy in each epoch shows that the performance of the model is
going better. The model achieves the highest accuracy at 100 epochs. The training and validation
accuracy graph is shown in Figure 7.
14
Figure 7: Training and validation accuracy of proposed model
15
5 Conclusion
The existing system has problems related to inaccuracy and speed. The project overcomes the short-
comings of conventional systems using random forest and convolutional neural network conv2D.
The proposed system uses a specialized layer of conv2D to identify critical patterns indicating the
disease by initially detecting the relevant features. The proposed model also used a random forest
model to predict disease. The model has the capability to distinguish between infected and unin-
fected diseases. The proposed system has achieved an accuracy of 95% and an F1 score of 92% for
image datasets and accuracy of 88% and an F1 score of 87% for tabular datasets.
16
6 Future Work
• Adding multiple disease prediction systems with electronic health records, can provide a more
extensive and personalized healthcare experience by using patient data from EHR systems, it
increases the accuracy of prediction.
17
References
[2] B. . K. R. Garg, A. Sharma, “Heart disease prediction using machine learning techniques,”
https://iopscience.iop.org/article/10.1088/1757-899X/1022/1/012046, p. 012046, jan 2021.
[3] T. I. S. . K. R. Tasin, I. Nabil, “Diabetes prediction using machine learning and explainable ai
techniques.” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10107388/, 2022.
[6] H. Taherdoost, “Data collection methods and tools for research; a step-by-step guide
to choose data collection technique for academic and business research projects,”
https://www.researchgate.net/publication/359596426 Data Collection Methods and Tools
for Research A Step-by-Step Guide to Choose Data Collection Technique for Academic
and Business Research Projects, 08 2021.
[7] C. Zelaya, G. & Vladimiro, “Towards explaining the effects of data preprocessing on machine
learning,” https://ieeexplore.ieee.org/document/8731532, 2019.
[8] B. N. l. C. S. Krishnan, V. Lavanya, “Random forest algorithm for the prediction of dia-
betes,” https://www.researchgate.net/publication/336781365 Random Forest Algorithm for
the Prediction of Diabetes, 2019.
[10] K. . J. R. Chauhan, R. Ghanshala, “Convolutional neural network (cnn) for image detec-
tion and recognition,” https://www.researchgate.net/publication/332826568 Convolutional
Neural Network CNN for Image Detection and Recognition, 2018.
[12] A. Agarapn, “Deep learning using rectified linear units (relu),” https://www.researchgate.net/
publication/323956667 Deep Learning using Rectified Linear Units ReLU, 03 2018.
18
[13] J. Jordan, “Evaluating a machine learning model,” https://www.jeremyjordan.me/
evaluating-a-machine-learning-model/, 2017.
[16] P. Barrett, J. Hunter, J. Miller, J.-C. Hsu, and P. Greenfield, “matplotlib – a portable
python plotting package,” https://www.researchgate.net/publication/234238535 matplotlib
-- A Portable Python Plotting Package, 12 2005.
[17] T. . K. Rawat, “Feature engineering (fe) tools and techniques for better classification
performance,” https://www.researchgate.net/publication/333015077 Feature Engineering
FE Tools and Techniques for Better Classification Performance, 2019.
[18] S. . B. K. Agarwal, M. Gupta, “A new conv2d model with modified relu activation function for
identification of disease type and severity in cucumber plant,” https://www.researchgate.net/
publication/347279241 A new Conv2D model with modified ReLU activation function
for identification of disease type and severity in cucumber plant, 2020.
[19] S. . W. Y. Lv, Q. Zhang, “Deep learning model of image classification using machine learning,”
https://downloads.hindawi.com/journals/am/2022/3351256.pdf, 2022.
[20] R. Kundu, “Confusion matrix: How to use it interpret results [examples],” https://www.
v7labs.com/blog/confusion-matrix-guide, 2022.
19