0% found this document useful (0 votes)

37 views25 pages

Machine Learning for Disease Prediction

Uploaded by

TOTAL CHAMPION SHIP

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views25 pages

Machine Learning for Disease Prediction

Uploaded by

TOTAL CHAMPION SHIP

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Multiple Disease Prediction System Using Machine Learning

A
Synopsis Report
Submitted in Partial Fulfillment of the Requirements
for the Degree
of
Bachelor Of Technology
In
Computer Science and Engineering

by
S UNDARAM S HUKLA
A LI A KBAR A NSARI
U MESH K UMAR
S HASHI P RAKASH
{sundaram2020cse, ali2020cse, umesh2021lcse, shashi2020cse}@iert.ac.in

U NDER THE G UIDANCE / SUPERVISION OF

D R . V IMAL M ISHRA & D R .ROHIT

D EPARTMENT OF C OMPUTER S CIENCE AND E NGINEERING

I NSTITUTE O F E NGINEERING AND RURAL T ECHNOLOGY
D R .A.P.J A BDUL K ALAM T ECHNICAL U NIVERSITY, L UCKNOW
Preface
This project report is submitted in partial fulfillment of the requirements for the award of the de-
gree of “Bachelor of Technology (Computer Science and Engineering).” Completing the final year
project, ”Multiple Disease Prediction Using Machine Learning,” has been an immensely rewarding
and educational journey. This project resulted from extensive research, meticulous experimenta-
tion, and relentless dedication. I am profoundly grateful to my project supervisors, Dr. Vimal
Mishra and Dr. Rohit, for their unwavering support, insightful guidance, and continuous encour-
agement. Their expertise has been invaluable in navigating the complexities of machine learning in
healthcare. I would also like to express my appreciation to the Institute of Engineering and Rural
Technology and the Department of Computer Science and Engineering for providing the academic
environment and resources essential for this project. Access to advanced computational resources
and an excellent library has been crucial in the research and development phases. Special thanks
are due to the organizations and repositories that provided the medical datasets used in this project,
including the UCI Machine Learning Repository, Github, Kaggle, and other public databases. Their
contributions to open science and data sharing are greatly appreciated. I am deeply thankful for the
support and collaboration of my peers and colleagues. Their constructive criticism, collaborative
discussions, and moral support have been a source of motivation throughout this journey. Finally,
my heartfelt gratitude goes to my family and friends for their constant encouragement and under-
standing during the course of this project. Their unwavering support has been a source of strength,
especially during challenging times. Completing this project marks a significant milestone in my
academic journey, and I am confident that the knowledge and skills acquired will serve as a solid
foundation for my future endeavors.

i
Acknowledgement
We would like to extend our heartfelt gratitude to everyone who made the completion of this report
possible. We are especially thankful to our supervisors, Dr. Vimal Mishra and Dr. Rohit, for their
invaluable help, insightful suggestions, and unwavering encouragement throughout the fabrication
process and the writing of this report. Their assistance was pivotal in every step of our journey. We
are also deeply appreciative of the time they dedicated to proofreading and correcting our numerous
errors.
Our sincere thanks also go to our batch mates and the staff of the Computer Science and Engi-
neering department for their crucial support. Their permission to use the lab equipment and all the
tools in the laboratory was instrumental to our work.

ii
Declaration
We, the undersigned, hereby declare that the project titled ”Multiple Disease Prediction System
Using Machine Learning” reflects our original research efforts conducted under the guidance of
Dr. Rohit Sir and Dr. Vimal Mishra. This project embodies the application of machine learn-
ing methodologies to tackle the complexities of disease prediction. The content presented in this
project is unique and has not been previously submitted for any academic credentials. We have
adhered to the ethical standards outlined by our institution throughout the project. The synopsis
and conclusions are based on our research outcomes, ensuring originality and compliance with uni-
versity guidelines. Proper attribution has been given to external sources utilized, both in the text
and references, whenever applicable.

Sundaram Shukla
(2001100100061)

Ali Akbar Ansari

(2001100100010)

Umesh Kumar
(2101100109006)

Shashi Prakash
(2001100100055)

iii
Abstract
A disease can be defined as any disturbance in the structure or function of any organ of the body.
Disease plays a crucial role in affecting human health. The diseases become serious if not detected
early. Predicting the diseases was difficult for the doctors. So, they started implementing machine
learning algorithms for detecting the possibility of any such disease. Enormous work has been
done to predict diseases using the machine-learning algorithms. However, there is an issue with the
accuracy and speed of the existing model.
The proposed system employs machine learning and deep learning algorithms to predict dis-
eases. The proposed system predicts multiple diseases, including heart disease, diabetes, malaria,
and lung cancer. The proposed system comprises random forest and convolutional neural network
algorithms. The Convolutional Neural Network of the proposed model is trained on the CT Scan of
the patient. The features from images are extracted using the CNN algorithm. The random forest
algorithm of the proposed model is trained on tabular datasets. The proposed model achieves an
accuracy of 88% and 78% for heart disease and diabetes respectively. It achieves an accuracy of
95% for both malaria and lung cancer.

iv
Contents
1 Introduction 1
1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Proposed System Architecture 3

2.1 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Architecture Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Algorithms used in System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.2 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . . . . . 7
2.6 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Implementation 9
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Convolutional Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Observations 12
4.1 Confusion Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Training and validation loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Training and validation accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Conclusion 16

6 Future Work 17
1 Introduction
Diseases are a common occurrence in human beings, and nearly everyone experiences some form
of illness at some point. Some of the most prevalent diseases include heart disease, diabetes,
lung cancer, and malaria, which collectively affect a significant portion of the global population.
According to the World Health Organization [1], heart disease claims 17.9 million lives annually,
while lung cancer causes approximately 1.8 million deaths. Malaria results in around 0.6 million
deaths, and diabetes accounts for 1.5 million deaths each year. These staggering statistics highlight
the global impact of these diseases, often exacerbated by the lack of accurate early detection.
The early and precise detection of diseases can significantly reduce their adverse effects, yet
sometimes, even skilled doctors and healthcare professionals fail to diagnose them accurately. Over
the years, numerous efforts have been made to tackle this issue, and one promising solution is the
development of a disease prediction system that leverages machine learning and deep learning algo-
rithms. These advanced technologies can analyze unique features from patients’ records, including
lab test data and medical images like CT scans, to predict the presence of diseases. By accurately
identifying whether a patient is infected or uninfected, such systems can assist healthcare providers
in making timely and informed decisions, thereby improving patient outcomes. This project fo-
cuses on creating a comprehensive multiple-disease prediction system, utilizing tabular and image
data, to enhance diagnostic accuracy for diseases like heart disease, diabetes, malaria, and lung
cancer.
This initiative not only represents a significant technological advancement but also aligns with
the broader goal of integrating machine learning into healthcare to improve diagnostics and treat-
ment strategies. The experience gained from working on this project, including handling large
datasets and applying various machine learning algorithms, will be invaluable. The knowledge and
skills developed will be instrumental in my future endeavors as a Python developer, contributing to
the field of healthcare technology and beyond.

1
1.1 Literature Review

Apurv et al. [2] states that machine learning has been used to determine if the individual is affected
by heart disease or not. Machine learning can be applied to predict cardiovascular disease by eval-
uating features such as the occurrence of chest pain, the age of the person, and the cholesterol level
of the person amongst others. KNN emerges as 86.88% prediction accuracy. The prediction accu-
racy by Random forest is 81.96%.

Isfafuzzaman et al. [3] collected a dataset of 203 individuals. They used the XGBoost classifier
to classify whether the patient suffered from diabetes disease or not. The model has achieved an
accuracy of 81%.

In 2015 a study was carried out by Kanika et al. [4], they worked in the domain of feature ex-
traction and classification of malaria to detect whether a patient is affected by malaria or not. The
model achieved an accuracy of 87.8%.

Muntasir et al. [5] used the CNN model for the detection of lung cancer whether a patient is
affected or not. They collected ct scan images of the patient from the Kaggle online platform.
They implemented some other model such as V3, Xception, and ResNet-50 models to compare the
performance. The model achieved an accuracy of 92%.

2
2 Proposed System Architecture

The project proposes a model that employs two algorithms Random Forest and Convolutional Neu-
ral Network. Random forest algorithm is trained on tabular data to predict heart and diabetes. CNN
is trained on the cell image dataset to predict malaria and the chest CT Scan image dataset to predict
lung cancer. Figure 3 shows the architecture diagram of the proposed model.

2.1 Use Case Diagram

Figure 1: Level 0 Use case diagram

A Level 0 Use Case Diagram for the Multiple Disease Prediction System illustrates a basic
interaction: a user inputs a patient’s health records into the system, which then processes this data
using machine learning algorithms. The system outputs a classification, indicating whether the
patient is infected or uninfected, providing a quick and accurate health assessment.

3
Figure 2: Level 3 Use case diagram

The use case diagram illustrates the interaction between a user and a multiple disease prediction
system built with Streamlit. The user begins by running the Streamlit app and configuring the
system parameters. They then navigate to the disease dashboard, select the disease to be predicted,
and enter the required input data, which may include numerical values and image files. The system
incorporates robust error handling mechanisms: it displays errors for invalid string inputs, out-of-
range values, missing inputs, and invalid image files. Users must correct these errors by re-entering
the correct data or uploading valid image files. Once valid inputs are provided, the system predicts
the disease using the appropriate machine learning model, logs the results for future reference, and
displays the prediction results to the user, completing the interaction cycle. This comprehensive
flow ensures a reliable and accurate disease prediction process.

4
2.2 Architecture Diagram

Figure 3: Architecture diagram of proposed model

The architecture diagram outlines building and training two machine learning models, a random
forest and a CNN. You start by choosing a medical dataset (heart disease, diabetes, etc.), preprocess
it, then split it for training and testing. The training data is used to build two models: a random
forest and a CNN. Finally, both models are evaluated and can be used to classify new data points,
like diagnosing patients with a chosen disease.

2.3 Data Collection

Data collection [6] is the process of collecting the relevant data from various sources. The tabular
datasets of diabetes and heart disease are taken from the Siddharthan disease dataset on GitHub.
The malarial cell image dataset and chest CT Scan datasets are taken from Kaggle.

5
The heart disease dataset has 304 samples, and 13 features are age, sex, blood pressure, choles-
terol, resting, thalach, exang, old speak, slope, ca, Thal. Within the 769 samples in the diabetes
dataset, eight characteristics are included: age, body mass index, blood pressure, skin thickness,
insulin level, blood glucose level, and number of pregnancies. The malaria cell image dataset has
27560 images, labeled equally in two classes. The chest CT Scan dataset has 315 images labeled in
four classes namely normal, adenocarcinoma, squamous. cells.carcinoma and large cell carcinoma.

2.4 Data Preprocessing

Data Preprocessing is a process of enhancing the quality of data. Whenever data is collected from
different platforms, the data is in raw form which is not good for production. Cleaning and pre-
processing [7] the data to enhance the quality, helps in improving model performance. For tabular
datasets, the methods used are to encode categorical variables and normalize numerical variables.
For the image dataset resize the image to 100x100 pixels. The datasets are split into 80% for
training and 20% for testing.

2.5 Algorithms used in System

The present section defines the algorithms used in the proposed system.

2.5.1 Random Forest

Random Forest [8] is an ensemble learning method that combines multiple decision trees to make
predictions. In the random forest model, the number of decision trees is employed to predict the
disease of input.

• Bootstrap sampling with replacement, it involves randomly selecting samples from the origi-
nal datasets with replacement. The model used 500 random samples.

• Model used in Random Forest [9] is decision tree. Each decision tree is trained on the tabular
dataset and makes predictions on input data points.

• In Bag: In Bag refer to the training dataset. 80% of the dataset is for training the model.

• Out of bag-OOB refers to the part of the dataset that is not present in any sample dataset. The
data samples are used as the validation sets. There are 39% samples out of the bag.

• n-samples: The number of decision trees implemented. 500 decision trees are implemented.

• max-depth: Signifies the maximum number of depths of the tree. In the model max depth is
3.

6
2.5.2 Convolutional Neural Network (CNN)

Convolutional Neural Network [10] is implemented for the classification of images in two classes.
The layers of the model are put in sequence. The proposed model consists of four convolutional
layers. Each layer is placed after the output, flatten, and max-pooling [11] layers. The loss function
in the model is used to evaluate how well the algorithm is modeling the dataset. The loss function
used in the proposed model is categorical loss entropy. Loss function is used for multiclass classi-
fication. The optimizer is used to minimize the loss and optimize the efficiency of model. In the
proposed model, Adam is used to optimize the model. The activation function used in the proposed
model is ReLU (Rectified Linear Unit) [12] . ReLU is a type of activation function used in neural
networks and deep learning models. The ReLU activation allows the neural network models to
converge faster and it also adds the content shown in table 1.

Table 1: Input and Output of CNN layers in Proposed system

Layer Output Shape Parameters
Input 2 None, 100, 100, 3 0
Conv2d 1 None, 98, 98, 32 896
Maxpooling2d 1 None, 49, 49, 32 0
Conv2d 2 None, 47, 47, 64 18496
Maxpooling2d 2 None, 23, 23, 64 0
Conv2d 3 None, 21, 21, 128 73856
Maxpooling2d 3 None, 10, 10, 128 0
Conv2d 4 None, 8, 8, 256 295168
Maxpooling2d 4 None, 4, 4, 256 0
Flatten None, 12800 3277056
Dense None, 256 0

• Shapes (100, 100, 3) are present in the input layer, it is made to take images as input of size
(100 x 100) pixel and 3 color channels (RGB).

• Convolutional layer: The layers apply convolution operations to the input data to extract fea-
tures and spatial hierarchies. The initial layer consists of 32 filters, the second layer consists
of 64 layers and third layer consists of 128 layers and the fourth layer consists of 256 layers.

• Every convolutional layer is followed by a Max-Pooling layer to reduce input dimensions

while keeping crucial information.

• The Flatten layer flattens the convolutional layer’s 2D vector output into a 1D vector.

• Fully Connected Layer: Two completely linked layers with 256 and 128 units each receive the
flattened layer. It helps to learn the most important features from the extracted spatial features.

• Output The layer consists of a sigmoid activation function. The activation function classified
output into two classes 0 (uninfected), and 1 (infected) based on the calculated probability
value of the two classes.

7
2.6 Model Training and Evaluation

The dataset of each disease is split into testing and training sets to train random forest and convo-
lutional neural network models for respective diseases and evaluate their performance [13]. Utilize
significant performance metrics including F1 score, precision, recall, and accuracy to determine the
capacity of model for prediction and use cross-validation to assess the model’s performance.

8
3 Implementation

The present section has explored the implementation of the proposed multiple-disease prediction
system. The section has developed random forest and convolutional neural networks to establish
the multiple disease prediction system. The paper used tabular datasets to train the random forest
classifier model and image datasets to train convolutional neural networks.

3.1 Data Collection

The dataset is collected from Kaggle and GitHub.The image dataset consists of 27,875 images and
the tabular dataset consists of 1073 samples.

3.2 Data Preprocessing

Data preprocessing is performed on collected datasets to enhance their quality. For tabular datasets,
the methods used are to encode categorical variables and normalize numerical variables in a fixed
range between 0 and 1. For the image dataset resize the image to 100x100 pixels.

3.3 Random Forest Classifier

While implementing the model the first step of the research is to load the datasets and perform data
analysis to understand the structure of the data. Then, data processing and analysis [14] are done.
Seaborn library [15] and, matplotlib [16] are used for visualisation and analysis of data.
After exploring the dataset the next step is to perform the task of feature engineering. Feature
engineering [17] is a process of developing new features and their connections from the available
features. After that, the dataset is distributed into train and test data. 80% of the dataset is utilized
for training, while 20% is used for testing. The algorithm of the random forest model is described
below.

9
Algorithm 1 Random Forest Classifier Model
Input: Training dataset, Testing dataset, True Output value
Output: Predicted value
1: function R ANDOM F OREST CLASSIFIER (Xtest )
2: Divide the dataset: 20% for testing and 80% for training.
3: Set Parameters:
4: n estimators ← 500
5: max f eatures ← N
6: max depth ← 3
Initialize an empty list T to store decision trees.
7: for i=1 to n estimators do
8: Randomly select max f eatures from the entire
9: features in the dataset.
10: Train a decision tree with max depth = 3 using
11: selected features
12: Add the trained decision tree to T
13: end for
14: Initialize an empty list P to store the prediction values
15: for each decision tree t in T do
16: Make predictions on Xtest using t
17: Add the predictions to P
18: end for
19: If the majority vote of predictions is successful then
20: return The accuracy is calculated by dividing
21: the total number of forecasts by the
22: the number of right guesses else
23: return Inaccuracy as the majority vote failed
24: End if
25: end function

Explanation: In the random forest model the input training datasets are randomly distributed
and are passed to each decision tree. Each decision tree is trained on the training dataset. After
training the decision tree, a new input value is given to the model. They classify the input point
into a category such as infected or uninfected. The majority of all the decision trees for any given
point are considered as the predicted value of the model.

3.4 Convolutional Neural Network Model

The model uses CONV2D [18] to classify images accurately. CNN uses a series of convolutional
layers to capture the spatial features in the image to classify them . The deeper layer of the neural
network identifies more abstract features while the lower layer identifies simple structures like
edges and curves. The maxpooling layer helps to downsample the spatial feature and reduce the
image dimension. The features maps are converted into a completely vectorized 1D format and an

10
additional connected layer is introduced for classification [19]. The algorithm of the convolutional
neural network is described below.

Algorithm 2 Convolutional Neural Network Model

Input:
-Dataset of cell images for Malaria and images of Chest CT scans for lung disease prediction
- CNN architecture parameters
- Training parameters
Output:
- Trained CNN model
Data Preprocessing:
1: for each image in the dataset do
2: Resize the images to the best size suitable for input to the convolutional neural network model
3: Assign label (1 for infected, 0 for uninfected)
4: end for
Initialize CNN Model:
5: Initialize CNN architecture with the parameters
6: Add max pooling layers after convolutional layers
7: Include fully connected layers for categorization
8: Use sigmoid activation function for classification
Split Entire Dataset:
9: Split dataset into training, validation, and testing sets
Training:
10: Initialize CNN model parameters
11: Define loss function used in the model
12: Select a suitable optimizer
13: while training not converged do
14: Forward and backward pass
15: Validate model on validation set
16: end while
17: Validate model on validation set
18: if validation performance does not improve then
Early stopping
end

Explanation: The convolutional neural network model relied on images labeled datasets of the
class for training. Image preprocessing does the job of labeling and photo resizing. The model is
trained with specified parameters such as Adam optimizer and the loss function used is categorical
loss entropy. The model is trained on a training set of datasets and tested on the test set.

11
4 Observations

The proposed model employs two algorithm random forest and convolutional neural network. It
is trained on VS Code using an AMD RYZEN 3 processor. It takes around 95 minutes to train all
models. After training the accuracy of the convolutional neural network model reaches 95% and
the random forest model reaches 88%. The random forest model takes less than a minute to train
on the heart disease dataset and diabetes datasets. The convolutional neural network model takes
70 minutes to train on an image dataset of malaria and takes approximately half an hour to train on
an image dataset of lung cancer.

4.1 Confusion Metrics

The performance of each model is evaluated using the concept of a confusion matrix [20]. The
confusion matrix is a helper matrix that gives information on the number of correctly classified
instances and incorrectly classified instances.
Performance Metrics: The performance matrix of the developed model can be obtained by
applying a confusion matrix to estimate the accuracy, precision, recall, and F1 score of each
class,3. For heart disease detection, the values of accuracy, precision, recall, and F1 score are
88%,83%,85%, and 87% respectively. For diabetes detection, the values of accuracy, precision,
recall, and F1 score are 78%,79%,70%, and 77% respectively. For lung cancer detection the value
of accuracy, precision, recall, and F1 score are 97%,91%,97%, and 94, and for malaria disease the
the value of accuracy, precision, recall, F1 score is 95%,95%,97%, and 92% respectively.
Figure 4 shows the confusion matrix. The result of the Confusion Matrix is mentioned in table
2

Figure 4: Confusion Matrix of proposed model for tabular dataset

True Negative (TN): 24 instances are Correctly classified as true negative (labeled 0).
True Positive (TP): 29 instances are Correctly classified as true positive (labeled 1).

12
Metric Value
True Positive (TP) 29
False Positive (FP) 5
True Negative (TN) 24
False Negative (FN) 3

Table 2: Confusion Matrix

False Positive (FP): 5 instances are negative samples are incorrect as positive .
False Negative (FN): 3 instances samples are incorrect as negative.

Figure 5: Confusion Matrix of proposed model for image dataset

The figure 5 shows the confusion matrix of the convolutional neural network model on image
datasets. There are four classes labeled for the image that is normal, carcinoma, large cell carci-
noma, and squamous cell carcinoma. All the three classes except normal are considered as cancer-
ous. The true positive(TP) value for the normal class is 26, and for the cancerous class is 47. Based
on the confusion matrix the accuracy of the model is 95%, Fi score is 92%, the recall value is 97%
and the precision value is 95%.4

(T P + T N )
Accuracy =
(T N + F N + F P + T P )

TP
Precision =
(F P + T P )

13
TP
Recall =
(T P + F N )

(2 ∗ Recall ∗ P recision)
F-1 score =
(Recall + P recision)

4.2 Training and validation loss

The training loss is a highly important metric in determining the agreement between a deep learning
model and the training data. The loss plot in Figure 6 shows that the model is converging to a good
solution. There are 100 epochs in total. The training loss and validation loss trade in each epoch
cycle are displayed. Each epoch leads to reduce the loss value of the neural network model.

Figure 6: Training and validation loss of proposed mode

4.3 Training and validation accuracy

Training and validation accuracy define the performance of the model. The increase in the value
of the accuracy and validation accuracy in each epoch shows that the performance of the model is
going better. The model achieves the highest accuracy at 100 epochs. The training and validation
accuracy graph is shown in Figure 7.

14
Figure 7: Training and validation accuracy of proposed model

Table 3: Performance Score based on Tabular Data

Classes Accuracy Precision Recall F-1 score
Heart Disease 0.88 0.83 0.85 0.87
Diabetes Disease 0.78 0.79 0.70 0.77

Table 4: Performance Score based on Image Data

Classes Label Accuracy Precision Recall F1 score
Malaria Parasitised 0.95 0.95 0.97 0.92
Malaria Uninfected 0.95 0.95 0.93 0.97
Lung cancer Cancerous 0.97 0.91 0.97 0.94
Lung cancer Normal 1.00 1.00 1.00 1.00

15
5 Conclusion

The existing system has problems related to inaccuracy and speed. The project overcomes the short-
comings of conventional systems using random forest and convolutional neural network conv2D.
The proposed system uses a specialized layer of conv2D to identify critical patterns indicating the
disease by initially detecting the relevant features. The proposed model also used a random forest
model to predict disease. The model has the capability to distinguish between infected and unin-
fected diseases. The proposed system has achieved an accuracy of 95% and an F1 score of 92% for
image datasets and accuracy of 88% and an F1 score of 87% for tabular datasets.

16
6 Future Work

• Adding multiple disease prediction systems with electronic health records, can provide a more
extensive and personalized healthcare experience by using patient data from EHR systems, it
increases the accuracy of prediction.

• In the future, the model can be upgraded to predict more diseases.

17
References

[1] “The global health observatory,” https://www.who.int/data/gho/publications/

world-health-statistics.

[2] B. . K. R. Garg, A. Sharma, “Heart disease prediction using machine learning techniques,”
https://iopscience.iop.org/article/10.1088/1757-899X/1022/1/012046, p. 012046, jan 2021.

[3] T. I. S. . K. R. Tasin, I. Nabil, “Diabetes prediction using machine learning and explainable ai
techniques.” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10107388/, 2022.

[4] S. K. M. . S. J. Gautam, K Jangir, “Malaria detection system using con-

volutional neural network algorithm,” https://www.igi-global.com/chapter/
malaria-detection-system-using-convolutional-neural-network-algorithm/257321, 2020.

[5] M. M. M. . A. A. Mamun, M. Mahmud, “Lcdctcnn: Lung cancer diagnosis of ct scan images

using cnn based model,” https://arxiv.org/abs/2304.04814, 2023.

[6] H. Taherdoost, “Data collection methods and tools for research; a step-by-step guide
to choose data collection technique for academic and business research projects,”
https://www.researchgate.net/publication/359596426 Data Collection Methods and Tools
for Research A Step-by-Step Guide to Choose Data Collection Technique for Academic
and Business Research Projects, 08 2021.

[7] C. Zelaya, G. & Vladimiro, “Towards explaining the effects of data preprocessing on machine
learning,” https://ieeexplore.ieee.org/document/8731532, 2019.

[8] B. N. l. C. S. Krishnan, V. Lavanya, “Random forest algorithm for the prediction of dia-
betes,” https://www.researchgate.net/publication/336781365 Random Forest Algorithm for
the Prediction of Diabetes, 2019.

[9] R. A. N. . M. I. Ali, J. Khan, “Random forests and decision trees,” https://www.researchgate.

net/publication/259235118 Random Forests and Decision Trees, 09 2012.

[10] K. . J. R. Chauhan, R. Ghanshala, “Convolutional neural network (cnn) for image detec-
tion and recognition,” https://www.researchgate.net/publication/332826568 Convolutional
Neural Network CNN for Image Detection and Recognition, 2018.

[11] S. R. A. . W. R. Nirthika, R. Manivannan, “Pooling in convolutional neural networks for

medical image analysis: a survey and an empirical study,” https://link.springer.com/article/10.
1007/s00521-022-06953-8#citeas, 2022.

[12] A. Agarapn, “Deep learning using rectified linear units (relu),” https://www.researchgate.net/
publication/323956667 Deep Learning using Rectified Linear Units ReLU, 03 2018.

18
[13] J. Jordan, “Evaluating a machine learning model,” https://www.jeremyjordan.me/
evaluating-a-machine-learning-model/, 2017.

[14] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing for super-

vised learning,” https://www.researchgate.net/publication/228084519 Data Preprocessing
for Supervised Learning, 2006.

[15] W. M., “seaborn: statistical data visualization,” https://joss.theoj.org/papers/10.21105/joss.

03021, 2021.

[16] P. Barrett, J. Hunter, J. Miller, J.-C. Hsu, and P. Greenfield, “matplotlib – a portable
python plotting package,” https://www.researchgate.net/publication/234238535 matplotlib
-- A Portable Python Plotting Package, 12 2005.

[17] T. . K. Rawat, “Feature engineering (fe) tools and techniques for better classification
performance,” https://www.researchgate.net/publication/333015077 Feature Engineering
FE Tools and Techniques for Better Classification Performance, 2019.

[18] S. . B. K. Agarwal, M. Gupta, “A new conv2d model with modified relu activation function for
identification of disease type and severity in cucumber plant,” https://www.researchgate.net/
publication/347279241 A new Conv2D model with modified ReLU activation function
for identification of disease type and severity in cucumber plant, 2020.

[19] S. . W. Y. Lv, Q. Zhang, “Deep learning model of image classification using machine learning,”
https://downloads.hindawi.com/journals/am/2022/3351256.pdf, 2022.

[20] R. Kundu, “Confusion matrix: How to use it interpret results [examples],” https://www.
v7labs.com/blog/confusion-matrix-guide, 2022.

(IJCST-V13I2P2) :seema Saroj, Sakshi Sahu, Sanjana Patel, Suraj Sahu
No ratings yet
(IJCST-V13I2P2) :seema Saroj, Sakshi Sahu, Sanjana Patel, Suraj Sahu
2 pages
Diseaseppt
No ratings yet
Diseaseppt
18 pages
Multiple Disease Prediction
No ratings yet
Multiple Disease Prediction
23 pages
Multi-Disease Prediction App Using ML
No ratings yet
Multi-Disease Prediction App Using ML
51 pages
Machine Learning for Disease Prediction
No ratings yet
Machine Learning for Disease Prediction
58 pages
Machine Learning for Disease Prediction
No ratings yet
Machine Learning for Disease Prediction
42 pages
Project PPT Batch (5) )
No ratings yet
Project PPT Batch (5) )
14 pages
Multi Disease Prediction Using Machine Learning Algorithms
No ratings yet
Multi Disease Prediction Using Machine Learning Algorithms
10 pages
Multiplex
No ratings yet
Multiplex
10 pages
1822 B.E Cse Batchno 296
No ratings yet
1822 B.E Cse Batchno 296
83 pages
Final Multidiseaseprediction
No ratings yet
Final Multidiseaseprediction
56 pages
Multi-Disease Prediction with ML
No ratings yet
Multi-Disease Prediction with ML
15 pages
Mini Project 1
No ratings yet
Mini Project 1
13 pages
Machine Learning for Disease Prediction
No ratings yet
Machine Learning for Disease Prediction
5 pages
Machine Learning for Disease Diagnosis
No ratings yet
Machine Learning for Disease Diagnosis
34 pages
Machine Learning for Disease Prediction
No ratings yet
Machine Learning for Disease Prediction
15 pages
Multi-Disease Prediction with Machine Learning
No ratings yet
Multi-Disease Prediction with Machine Learning
24 pages
Latest Seminar Report Yash Ingole
No ratings yet
Latest Seminar Report Yash Ingole
35 pages
Disease Prediction with Machine Learning
No ratings yet
Disease Prediction with Machine Learning
70 pages
Multi-Disease Prediction System Overview
No ratings yet
Multi-Disease Prediction System Overview
7 pages
International Journal of Research Publication and Reviews
No ratings yet
International Journal of Research Publication and Reviews
4 pages
Multiple Disease Prediction Project Report
No ratings yet
Multiple Disease Prediction Project Report
18 pages
Multiple-Disease-Prediction FINAL REPORT
No ratings yet
Multiple-Disease-Prediction FINAL REPORT
51 pages
Synopsis
No ratings yet
Synopsis
6 pages
Drugdisease 2
No ratings yet
Drugdisease 2
17 pages
ML-Based Multi-Disease Prediction
No ratings yet
ML-Based Multi-Disease Prediction
8 pages
Disease Prediction System Using ML
No ratings yet
Disease Prediction System Using ML
36 pages
Be9 4 Exp9
No ratings yet
Be9 4 Exp9
6 pages
Multiple Disease Prediction System ML
No ratings yet
Multiple Disease Prediction System ML
6 pages
Final Conference 1
No ratings yet
Final Conference 1
8 pages
Multi-Disease Prediction System Overview
No ratings yet
Multi-Disease Prediction System Overview
24 pages
Human Disease Prediction (2) - 1 - Compressed
No ratings yet
Human Disease Prediction (2) - 1 - Compressed
62 pages
Ibm 23
No ratings yet
Ibm 23
4 pages
Final Research Paper
No ratings yet
Final Research Paper
6 pages
Machine Learning for Heart Disease Detection
No ratings yet
Machine Learning for Heart Disease Detection
36 pages
Integrated Disease Prediction Platform
No ratings yet
Integrated Disease Prediction Platform
33 pages
Predictive Disease Detection App Using Machine Learning Model
No ratings yet
Predictive Disease Detection App Using Machine Learning Model
15 pages
No 11
No ratings yet
No 11
8 pages
Research - Paper (1) (AutoRecovered)
No ratings yet
Research - Paper (1) (AutoRecovered)
5 pages
AP Mini Project
No ratings yet
AP Mini Project
19 pages
Disease Prediction Based On Symptoms
No ratings yet
Disease Prediction Based On Symptoms
16 pages
Santhosh Minor
No ratings yet
Santhosh Minor
18 pages
Django Disease Prediction with ML
No ratings yet
Django Disease Prediction with ML
8 pages
ReferencesAns Student - Assignment - SUID78031 (1) Dddans Student - Assignment - SUID78031 (1) DDD
No ratings yet
ReferencesAns Student - Assignment - SUID78031 (1) Dddans Student - Assignment - SUID78031 (1) DDD
18 pages
Machine Learning for Disease Prediction
No ratings yet
Machine Learning for Disease Prediction
13 pages
Major Project Synopysis
No ratings yet
Major Project Synopysis
4 pages
Leveraging SVM For Accurate Multiple Disease Prediction in Healthcare
No ratings yet
Leveraging SVM For Accurate Multiple Disease Prediction in Healthcare
6 pages
Important Q.a, Literature Summary and Presentation Script
No ratings yet
Important Q.a, Literature Summary and Presentation Script
7 pages
Heart Disease Prediction Research
No ratings yet
Heart Disease Prediction Research
45 pages
Doctormate - An Early Disease Prediction Approach Using Multiple Machine Learning Algorithms
No ratings yet
Doctormate - An Early Disease Prediction Approach Using Multiple Machine Learning Algorithms
7 pages
Ensemble Model for Disease Prediction
No ratings yet
Ensemble Model for Disease Prediction
20 pages
Synopsis 1
No ratings yet
Synopsis 1
5 pages
Epidemics vs. Pandemics
No ratings yet
Epidemics vs. Pandemics
15 pages
No 3
No ratings yet
No 3
4 pages
Final G04
No ratings yet
Final G04
42 pages
Symptom-Based Disease Prediction A Machine Learnin
No ratings yet
Symptom-Based Disease Prediction A Machine Learnin
10 pages
FCH MTH302 2025
No ratings yet
FCH MTH302 2025
2 pages
Cloud Xaas QuadRpt Customize Full Report Global 2018
No ratings yet
Cloud Xaas QuadRpt Customize Full Report Global 2018
38 pages
CT PCB Series: User's Manual
No ratings yet
CT PCB Series: User's Manual
22 pages
Introduction to Communication Systems
No ratings yet
Introduction to Communication Systems
6 pages
ASSET LIST-Sobha One Site Office
No ratings yet
ASSET LIST-Sobha One Site Office
14 pages
Li 2016
No ratings yet
Li 2016
6 pages
Beauty Expo Brands 2022 - 20-21 August 2022 - ICC Sydney - Sydney, Australia Leads
No ratings yet
Beauty Expo Brands 2022 - 20-21 August 2022 - ICC Sydney - Sydney, Australia Leads
79 pages
Sap Tables List
No ratings yet
Sap Tables List
2 pages
Acad Elec 2009 Userguide
No ratings yet
Acad Elec 2009 Userguide
1,622 pages
Cpi 2
No ratings yet
Cpi 2
24 pages
QuickBooks New Client Checklist
100% (1)
QuickBooks New Client Checklist
4 pages
Module 8-9 Big Data and E-Science
No ratings yet
Module 8-9 Big Data and E-Science
4 pages
Current Loop System Design Guide
No ratings yet
Current Loop System Design Guide
5 pages
The 80s
No ratings yet
The 80s
6 pages
Computer Basics for Beginners
No ratings yet
Computer Basics for Beginners
91 pages
Introduction to Cyber Threats Canada
No ratings yet
Introduction to Cyber Threats Canada
18 pages
Grade 6 Cycle Test 2 Term 2 Ict Material New
No ratings yet
Grade 6 Cycle Test 2 Term 2 Ict Material New
3 pages
Adw300 Software Configuration Manual
No ratings yet
Adw300 Software Configuration Manual
10 pages
Vertical Axis Wind Turbine Project Report
No ratings yet
Vertical Axis Wind Turbine Project Report
68 pages
RTI Online - View Status Form
No ratings yet
RTI Online - View Status Form
1 page
Rithmic Trader Pro Initialization Log
No ratings yet
Rithmic Trader Pro Initialization Log
74 pages
Technical Data
No ratings yet
Technical Data
246 pages
02 2022 2 00998120 Fee Voucher
No ratings yet
02 2022 2 00998120 Fee Voucher
1 page
Centrify PuTTY
No ratings yet
Centrify PuTTY
17 pages
7PA30423AA000 Datasheet En, TCS, 3p
No ratings yet
7PA30423AA000 Datasheet En, TCS, 3p
2 pages
2020 Prusa i3 3D Printer Build Guide
No ratings yet
2020 Prusa i3 3D Printer Build Guide
40 pages
Benefits and Costs of Telstra Separation
0% (1)
Benefits and Costs of Telstra Separation
11 pages
HP ProCurve 2915-8G-PoE Switch
No ratings yet
HP ProCurve 2915-8G-PoE Switch
9 pages
Order Form
No ratings yet
Order Form
1 page
Board of Trustees: JC Ryan Espia Bsit 1D NSTP 01 - Assignment #2 10/18/2024
No ratings yet
Board of Trustees: JC Ryan Espia Bsit 1D NSTP 01 - Assignment #2 10/18/2024
7 pages

Machine Learning for Disease Prediction

Uploaded by

Machine Learning for Disease Prediction

Uploaded by

Multiple Disease Prediction System Using Machine Learning

U NDER THE G UIDANCE / SUPERVISION OF

D R . V IMAL M ISHRA & D R .ROHIT

D EPARTMENT OF C OMPUTER S CIENCE AND E NGINEERING

Ali Akbar Ansari

2 Proposed System Architecture 3

2.1 Use Case Diagram

Figure 1: Level 0 Use case diagram

Figure 3: Architecture diagram of proposed model

2.3 Data Collection

2.4 Data Preprocessing

2.5 Algorithms used in System

2.5.1 Random Forest

Table 1: Input and Output of CNN layers in Proposed system

• Every convolutional layer is followed by a Max-Pooling layer to reduce input dimensions

3.1 Data Collection

3.2 Data Preprocessing

3.3 Random Forest Classifier

3.4 Convolutional Neural Network Model

Algorithm 2 Convolutional Neural Network Model

4.1 Confusion Metrics

Figure 4: Confusion Matrix of proposed model for tabular dataset

Table 2: Confusion Matrix

Figure 5: Confusion Matrix of proposed model for image dataset

4.2 Training and validation loss

Figure 6: Training and validation loss of proposed mode

4.3 Training and validation accuracy

Table 3: Performance Score based on Tabular Data

Table 4: Performance Score based on Image Data

• In the future, the model can be upgraded to predict more diseases.

[1] “The global health observatory,” https://www.who.int/data/gho/publications/

[4] S. K. M. . S. J. Gautam, K Jangir, “Malaria detection system using con-

[5] M. M. M. . A. A. Mamun, M. Mahmud, “Lcdctcnn: Lung cancer diagnosis of ct scan images

[9] R. A. N. . M. I. Ali, J. Khan, “Random forests and decision trees,” https://www.researchgate.

[11] S. R. A. . W. R. Nirthika, R. Manivannan, “Pooling in convolutional neural networks for

[14] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing for super-

[15] W. M., “seaborn: statistical data visualization,” https://joss.theoj.org/papers/10.21105/joss.

You might also like