Vikash_Rai_Project_Report
Vikash_Rai_Project_Report
MACHINE LEARNING
Bachelor of Technology
In
Information Technology
Dr.Tripti Rathee
By
May, 2025
DEPARTMENT OF INFORMATION TECHNOLOGY
BONAFIDE CERTIFICATE
InternalGuide
DATE:
PLACE:
SIGNATURE OF THE STUDENT
ACKNOWLEDGEMENT
I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr. Tripti Rathee for his valuable guidance, suggestions and constant
encouragement paved way for the successful completion of my project work.
I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of INFORMATION TECHNOLOGY who were helpful in many
ways for the completion of the project.
ABSTRACT
The health care systems collects data and reports from the hospitals or
patient's database by machine learning and data processing techniques
which is employed to predict the disease so as to create reports
supported the results which used for various kinds of predictions for
disease and which is that the leading explanation for the human's death
since past years. Medical reports and data had been extracted from
various databases to predict a number of the required diseases which are
commonly found in people nowadays breast cancer, heart disease and
diabetes disease and make their life more critical to measure. Nowadays
technology advancement within the health care industry has been
helping people to create their process easier by suggesting hospitals and
doctors to travel to for his or her treatment, where to admit and which
hospitals are the simplest for the treating the desired disease. we've
implemented this sort of system in our application to form people’s life
simpler by predicting the disease by inputting certain data from their
reports which can give the result positive or negative supported the
disease prediction they are going to be having a choice to get
recommendation of best hospitals with best doctors nearby from the
past users or guardians..
i
LIST OF FIGURES
ii
LIST OF TABLES
iii
TABLE OF CONTENTS
ABSTRACT i
LIST OF FIGURES ii
LIST OF TABLES iii
1. INTRODUCTION 1
2. LITERATURE SURVEY 2
4.3 METHODOLOGY 15
5.1 RESULTS 31
iv
REFERENCES 39
APPENDICES
A. SCREENSHOTS 40
B. PLAGARISM REPORT 44
v
CHAPTER 1
INTRODUCTION
Properly analyzing clinical documents about patients’ health anticipate the
possibility of occurrence of various diseases. In addition, acquiring
information regarding specialists of that particular disease as per the
requirement facilitates proper and efficient diagnosis. This Project provides a
novel method that uses data mining technique, namely, Logistic regression and
random forest classification algorithm for prediction of disease. Using medical
profiles such as heart rate, blood pressure through sensors and other externally
observable symptoms such as fever, cold, headache etc. that patient has,
prediction of likelihood of a disease is done. Logistic regression and random
forest classification algorithm takes these symptoms and predicts disease.
Furthermore, all the needful and adequate information regarding the predicted
disease as well as the recommended doctors is provided. Recommendation
(Future implementation) suggests the location , contact and other necessary
details of the disease specialists based on the filters chosen by the user out of
less fees, more experience, nearest location and feedback reviews of the
doctors. algorithm. Thus user can get appropriate treatment and necessary
medical advice as fast as possible. Additionally, users provide their feedback
for the recommended doctors which are then added for analysis in order to
make further recommendations based on reviews.
Healthcare industry generates terabytes of data every year. The medical
documents maintained are a pool of information regarding patients. The task
of extracting useful in formation or quality healthcare is tricky and important.
By analyzing these voluminous data we can predict the occurrence of the
disease and safe guard people. Thus, an intelligent system for disease
prediction plays a major role in controlling the disease and maintaining the
good health status for people by providing accurate and trustworthy disease
risk prediction.
40
CHAPTER 2
LITERATURE SURVEY
40
all patients
compared to
hospitals commonly
used by African
American, Hispanic,
Asian / Pacific
Islander, or
multiracial
patients
40
by learning patterns
throughthe
collected data for
swine flu using naïve
bayes classifierfor
classifying the
patients of swine flu
into three
categories(least
possible, probable or
most probable),
resulting into
anaccuracy of nearly
63.33%.
Datasets used for
thisclassification
were limited in
number
40
CHAPTER 3
40
CHAPTER 4
40
4.1.1 PROPOSED WORK
• In this research we have found the solution for the issues facing
in existing system where we have proposed the accuracy,
reliability and efficiency by developing the features of three
diseases called Heart disease, cancer disease and diabetes where
we will find most common diseases in people health and we
have installed in one application with prediction of three
diseases by analysing the symptoms collected from the patient’s
record and taking positive and negative opinions from patient’s
according to that we will give ratings to the hospitals and
doctors from best to worst.
• Guardians opinions is also very much important and they can
give feedback of them like how they were treating their patients?
Was it friendly or strictly? And how the hospital management
is? Was it clean? How is the hospitality ? When the feedback
comes to online so that patients and guardians can give both
positive and negative opinions completely without any
hesitation.
• Based on that we can provide truthful recommendations of
hospital and doctors for the people and can predict the results.
According to that prediction of particular disease we will predict
best suitable hospital and doctor to consult and to get admit into
it.
40
4.1.2 SOFTWARE AND HARDWARE CONFIGURATIONS
Hardware requirements
Hardware Minimum requirements
Software requirements
Microsoft .Net Framework v4.6.1 The HelpMaster Web Portal has been written to
(or higher) use Microsoft IIS ASP.NET technology and as such
requires the machine that IIS is running on to have
the Microsoft .NET v4.6.1 (or higher) Framework
installed as well as the ASP.Net 4.5 and .Net
Extensibility 4.5 features enabled.
40
4.2 DATASET DETAILS
A data set (or dataset) is a collection of data. In the case of tabular data,
a data set corresponds to one or more database tables, where every
column of a table represents a particular variable, and each row
corresponds to a given record of the data set in question. The data set
lists values for each of the variables, such as height and weight of an
object, for each member of the data set. Data sets can also consist of a
collection of documents or files. The sources of the datasets are from
Kaggle.com.
The datasets that are used are:
slope, ca, thal. We have taken the 14th column as the target variable
(target).
40
4.2.1.1 TABLE OF DATASET FIELDS
40
4.2.2 DIABETES DISEASE DATASET DETAILS
40
From the above 7 input fields we are only choosing the two input fields
[Glucose, BMI] based on Correlation Pearson method.
40
4.2.3 BREAST CNACER DATASET DETAILS
40
compactness_se 569 non-null float64
concavity_se 569 non-null float64
concave points_se 569 non-null float64
symmetry_se 569 non-null float64
fractal_dimension_se 569 non-null float64
radius_worst 569 non-null float64
texture_worst 569 non-null float64
perimeter_worst 569 non-null float64
area_worst 569 non-null float64
smoothness_worst 569 non-null float64
compactness_worst 569 non-null float64
concavity_worst 569 non-null float64
concave points_worst 569 non-null float64
symmetry_worst 569 non-null float64
fractal_dimension_worst 569 non-null float64
40
4.3 METHODOLOGY
The user has to input the data where it will be stored in database and
then according to their choice the prediction will be made. After
collecting the user data from the database and the choice of predicting
the disease is to be predicted. If negative then end the process and if
positive the user will get hospital recommendations at which their best
treatment can be done.
SYSTEM ARCHITECTURE
40
WebApp. Architecture design is tied to the goals establish for a WebApp, the
content to be presented, the users who will visit, and also the navigation
philosophy that has been established. Content architecture, focuses on the
way within which content objects and structured for presentation and
navigation. WebApp architecture, addresses the way the applying is structure
to manage user interaction, handle internal processing tasks, effect
navigation, and present content. WebApp architecture is defined within the
context of the event environment during which the appliance is to be
implemented.
MODULES IMPLEMENTED
The user has to input the data where it will be stored in database and then
according to their choice the prediction will be made. After collecting the user
data from the database and the choice of predicting the disease is to be
predicted. If negative then end the process and if positive the user will get
hospital recommendations (future ) at which their best treatment can be done.
• Application Flowchart.
• Data collection (from the user) to make dataset.
• Importing packages.
• Data pre-processing.
• Data fitting and training.
• Prediction as opted by the user.
• Result or output.
40
FLOWCHART DIAGRAM
40
DATA COLLECTION
PACKAGES IMPORTED
40
• Classification Report:The classification report visualizer
displays the precision, recall, F1, and support scores for the
model.
Syntax: from sklearn.metrics import classification_report.
DATA PRE-PROCESSING
It is the gathering of task related information based on some
targeted variables to analyse and produce some valuable outcome.
However, some of the data may be noisy, i.e. may contain inaccurate
values, incomplete values or incorrect values. Hence, it is must to
process the data before analysing it and coming to the results. Data
pre-processing can be done by data cleaning, data transformation, data
selection
Data pre processing is a process of preparing the raw data and
making it suitable for a machine learning model. It is the first and
crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case
that we come across the clean and formatted data. And while doing
any operation with data, it is mandatory to clean it and put in a
formatted way. So for this, we use data pre processing task.
40
A real-world data generally contains noises, missing values, and maybe in an
unusable format which cannot be directly used for machine learning models.
Data preprocessing is required tasks for cleaning the data and making
it suitable for a machine learning model which also increases the accuracy
and efficiency of a machine learning model.
o Importing libraries
o Importing datasets
o Feature scaling
40
DATA TRAINING
Model fitting is a measure of how well a machine learning model
generalizes to similar data to that on which it was trained. A model that
is well-fitted produces more accurate outcomes. A model that is
overfitted matches the data too closely. A model that is underfitted
doesn't match closely enough
Training data is the initial dataset used to train machine learning
algorithms. Models create and refine their rules using this data. It's a set
of data samples used to fit the parameters of a machine learning model
to training it by example. Training data is also known as training
dataset, learning set, and training set. It's an essential component of
every machine learning model and helps them make accurate
predictions or perform a desired task.
ALGORITHM SELECTION
40
Fig 4.7: Algorithm selection for diabetes disease prediction.
40
Fig 4.8: Algorithm selection for heart disease prediction.
40
PREDICTION AS OPTED BY THE USED
40
The curve from the logistic function indicates the likelihood of
something such as whether the cells are cancerous or not, a mouse is
obese or not based on its weight, etc. Logistic Regression is a
significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and
discrete datasets. Logistic Regression can be used to classify the
observations using different types of data and can easily determine the
most effective variables used for the classification. The below image is
showing the logistic function:
40
either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
For Diabetes disease prediction: LOGISTIC REGRESSION:
Logistic regression is a supervised learning classification algorithm used t
predict the probability of a target variable. The nature of target or dependent
variable is dichotomous, which means there would be
LOGISTIC REGRESSION
Logistic regression equation:
P=eβ0+β1X/
1+eβ0+β1X1
When all the
feature
plugged in ;
logit(p)=log(p/(1−p))=β0+β1∗ Sexmale+β2∗ age+β3∗ cigsPerDay+β4∗ totChol+β5
∗ sysB P+β6∗ glucose
To implement the Logistic Regression using Python, we will use
the same steps as we have done in previous topics of Regression.
Below are the steps:
40
4.3.2 RANDOM FOREST ALGORITHM
40
The greater number of trees in the forest leads to higher accuracy
and prevents the problem of over fitting.
40
Random Forest works in two-phase first is to create the random forest
by combining N decision tree, and second is to make predictions for
each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.
40
The greatest number of factors (m) chose for every hub is again called
"mtry" or k. The profundity of the tree can be constrained by hub
boundaries (for instance, the quantity of leaves), and now and then by
something like one. As referenced above, it streams from every one of
the trees that fill in the backwoods to decide the degree of
substitution in the wake of preparing or catching the woodland. Each
tree gives another example class to casting a ballot. All tree ideas were
merged and the greater part (larger part vote) grouping was affirmed at
another level. Going on here, the woodland characterizes a tree
backwoods assembled utilizing the RI timberland. In the ranger service
area, each tree was chosen and a freight test was made for substitution,
yet around 1/3 of the first material was absent. This rundown of models
is called OOB (Out of pocket) data. Each tree has its own OOB data,
which is utilized to look at the breaks in each tree in the timberland, and
is known as the OOB break estimation.
40
CHAPTER 5
5.1 RESULTS
When we see around there are many patients that does not get the right
treatment at the right time because of their lack of decision taking about
the choice of hospital and doctors, they don’t know what you do now
and end up very serious at the end. The objective of the project is to
provide the service to patients by suggesting them the best hospital to
find their cure for their existing disease . The project is to provide a
very easy solution for the patients to get recommendation to what
doctor or hospital they need to go after diagnosed with a severe disease.
This web application can find the solution to that, no need of thinking
about what should be done after diagnosed with a severe disease. This
web application handles reports to make predictions and give results
accordingly to that, a best hospitals can be selected more for their
treatment and more lives can be saved. After easy login or registering
into the app the patient can predict their disease after inputting certain
reports from their medical diagnosis report which will display
accurately that the patient has the particular disease or not it will
show in form of positive or negative. After the Prediction they will be
having an option to get recommended hospital which are best for the
treatment of their disease nearby. By this way the app can save many
more lives more before its too late to get the treatment.
40
SCREENSHOTS OF RESULT
40
BREAST CANCER PREDICTION RESULT
40
5.1 PERFORMANCE EVALUATION
40
HEART DISEASE PREDICTION PERFORMANCE ANALYSIS
40
CHAPTER 6
40
6.1 FUTURE WORK
40
The CF framework requires a lot of information handling, including
broadband, for example, web-based business and web facilitating.
Throughout the course of recent years, CF has advanced and has at long
last become perhaps the most well-known method for significantly
impacting the manner in which you approach directing. Today, PCs, as
well as the Internet, assist us with contemplating the thoughts of an
extraordinary spot with numerous individuals. People can profit from
local gatherings, permitting them to acquire information from different
clients and gaining from an assortment of items. Also, data can assist
clients with making their own thoughts or check significant items out.
Specifically, CF methods are utilized to assist clients with observing
new items they might like, get guidance on explicit items, and associate
with different clients who have comparative issues.
40
REFERENCE
40
APPENDIX
A.SCREENSHOTS
40
Screenshot 2: Application About Us page (To know about
the application what is the purpose and what it does).
41
Screenshot 4: Application Login page (Already registered
users can use user id and password to login to the
application and use the application for their benefits which is
very user friendly to use.)
42
Screenshot 6: Report entry page (where users can give
enter according to their diagonised report )
43
B.PLAGIARISM REPORT
44