0% found this document useful (0 votes)

21 views16 pages

diabetes-prediction-using-machine-learning

Uploaded by

shivzzzz2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views16 pages

diabetes-prediction-using-machine-learning

Uploaded by

shivzzzz2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Diabetes Prediction Using Machine Learning

A LG O RI T HM D AT A S E T S HE A LT HC A RE M A C HI NE LE A RNI NG

This article was published as a part of the Data Science Blogathon.

Overview

In this article, we will be predicting that whether the patient has diabetes or not on the basis of the
features we will provide to our machine learning model, and for that, we will be using the famous Pima
Indians Diabetes Database.

Image Source: Plastics Today

1. Data analysis: Here one will get to know about how the data analysis part is done in a data science life
cycle.
2. Exploratory data analysis: EDA is one of the most important steps in the data science project life cycle
and here one will need to know that how to make inferences from the visualizations and data analysis
3. Model building: Here we will be using 4 ML models and then we will choose the best performing model.
4. Saving model: Saving the best model using pickle to make the prediction from real data.

Importing Libraries

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set() from

mlxtend.plotting import plot_decision_regions import missingno as msno from pandas.plotting import

scatter_matrix from sklearn.preprocessing import StandardScaler from sklearn.model_selection import
train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import

confusion_matrix from sklearn import metrics from sklearn.metrics import classification_report import
warnings warnings.filterwarnings('ignore') %matplotlib inline
Here we will be reading the dataset which is in the CSV format

diabetes_df = pd.read_csv('diabetes.csv') diabetes_df.head()

Output:

Exploratory Data Analysis (EDA)

Now let’ see that what are columns available in our dataset.

diabetes_df.columns

Output:

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',

'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')

Information about the dataset

diabetes_df.info()

Output:

RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ --
------------ ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null
int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6
DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes:
float64(2), int64(7) memory usage: 54.1 KB

To know more about the dataset

diabetes_df.describe()

Output:
To know more about the dataset with transpose – here T is for the transpose

diabetes_df.describe().T

Output:

Now let’s check that if our dataset have null values or not

diabetes_df.isnull().head(10)

Output:
Now let’s check the number of null values our dataset has.

diabetes_df.isnull().sum()

Output:

Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0

Outcome 0 dtype: int64

Here from the above code we first checked that is there any null values from the IsNull() function then we
are going to take the sum of all those missing values from the sum() function and the inference we now
get is that there are no missing values but that is actually not a true story as in this par ticular dataset all
the missing values were given the 0 as a value which is not good for the authenticity of the dataset.
Hence we will first replace the 0 value with the NAN value then start the imputation process.

diabetes_df_copy = diabetes_df.copy(deep = True)

diabetes_df_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] =

diabetes_df_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN) # Showing the

Count of NANs print(diabetes_df_copy.isnull().sum())

Output:

Pregnancies 0 Glucose 5 BloodPressure 35 SkinThickness 227 Insulin 374 BMI 11 DiabetesPedigreeFunction 0 Age
0 Outcome 0 dtype: int64

As mentioned above that now we will be replacing the zeros with the NAN values so that we can impute it
later to maintain the authenticity of the dataset as well as trying to have a better Imputation approach i.e to
apply mean values of each column to the null values of the respective columns.

Data Visualization

Plotting the data distribution plots before removing null values

p = diabetes_df.hist(figsize = (20,20))

Output:
Inference: So here we have seen the distribution of each features whether it is dependent data or
independent data and one thing which could always strike that why do we need to see the distribution of
data? So the answer is simple it is the best way to start the analysis of the dataset as it shows the
occurrence of every kind of value in the graphical structure which in turn lets us know the range of the
data.

Now we will be imputing the mean value of the column to each missing value of that par ticular column.

diabetes_df_copy['Glucose'].fillna(diabetes_df_copy['Glucose'].mean(), inplace = True)

diabetes_df_copy['BloodPressure'].fillna(diabetes_df_copy['BloodPressure'].mean(), inplace = True)

diabetes_df_copy['SkinThickness'].fillna(diabetes_df_copy['SkinThickness'].median(), inplace = True)
diabetes_df_copy['Insulin'].fillna(diabetes_df_copy['Insulin'].median(), inplace = True)

diabetes_df_copy['BMI'].fillna(diabetes_df_copy['BMI'].median(), inplace = True)

Plotting the distributions after removing the NAN values.

p = diabetes_df_copy.hist(figsize = (20,20))

Output:

Inference: Here we are again using the hist plot to see the distribution of the dataset but this time we are
using this visualization to see the changes that we can see after those null values are removed from the
dataset and we can clearly see the difference for example – In age column after removal of the null values,
we can see that there is a spike at the range of 50 to 100 which is quite logical as well.

Plotting Null Count Analysis Plot

p = msno.bar(diabetes_df)

Output:

Inference: Now in the above graph also we can clearly see that there are no null values in the dataset.

Now, let’s check that how well our outcome column is balanced

color_wheel = {1: "#0392cf", 2: "#7bc043"} colors = diabetes_df["Outcome"].map(lambda x: color_wheel.get(x +

1)) print(diabetes_df.Outcome.value_counts()) p=diabetes_df.Outcome.value_counts().plot(kind="bar")

Output:

0 500 1 268 Name: Outcome, dtype: int64

Inference: Here from the above visualization it is clearly visible that our dataset is completely imbalanced
in fact the number of patients who are diabetic is half of the patients who are non-diabetic.

plt.subplot(121), sns.distplot(diabetes_df['Insulin']) plt.subplot(122),

diabetes_df['Insulin'].plot.box(figsize=(16,5)) plt.show()

Output:

Inference: That’s how Distplot can be helpful where one will able to see the distribution of the data as well
as with the help of boxplot one can see the outliers in that column and other information too which can be
derived by the box and whiskers plot.

Correlation between all the features

Correlation between all the features before cleaning

plt.figure(figsize=(12,10)) # seaborn has an easy method to showcase heatmap p =

sns.heatmap(diabetes_df.corr(), annot=True,cmap ='RdYlGn')

Output:
Scaling the Data

Before scaling down the data let’s have a look into it

diabetes_df_copy.head()

Output:

After Standard scaling

sc_X = StandardScaler() X = pd.DataFrame(sc_X.fit_transform(diabetes_df_copy.drop(["Outcome"],axis = 1),),

columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',

'DiabetesPedigreeFunction', 'Age']) X.head()

Output:
That’s how our dataset will be looking like when it is scaled down or we can see every value now is on the
same scale which will help our ML model to give a better result.

Let’s explore our target column

y = diabetes_df_copy.Outcome y

Output:

0 1 1 0 2 1 3 0 4 1 .. 763 0 764 0 765 0 766 1 767 0 Name: Outcome, Length: 768, dtype: int64

Model Building

Splitting the dataset

X = diabetes_df.drop('Outcome', axis=1) y = diabetes_df['Outcome']

Now we will split the data into training and testing data using the train_test_split function

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y,

test_size=0.33, random_state=7)

Random Forest

Building the model using RandomForest

from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(n_estimators=200)

rfc.fit(X_train, y_train)

Now after building the model let’s check the accuracy of the model on the training dataset.

rfc_train = rfc.predict(X_train) from sklearn import metrics print("Accuracy_Score =",

format(metrics.accuracy_score(y_train, rfc_train)))

Output: Accuracy = 1.0

So here we can see that on the training dataset our model is overfitted.
Getting the accuracy score for Random Forest

from sklearn import metrics predictions = rfc.predict(X_test) print("Accuracy_Score =",

format(metrics.accuracy_score(y_test, predictions)))

Output:

Accuracy_Score = 0.7677165354330708

Classification repor t and confusion matrix of random forest model

from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test,

predictions)) print(classification_report(y_test,predictions))

Output:

Decision Tree

Building the model using DecisionTree

from sklearn.tree import DecisionTreeClassifier dtree = DecisionTreeClassifier() dtree.fit(X_train, y_train)

Now we will be making the predictions on the testing data directly as it is of more importance.

Getting the accuracy score for Decision Tree

from sklearn import metrics predictions = dtree.predict(X_test) print("Accuracy Score =",

format(metrics.accuracy_score(y_test,predictions)))

Output:

Accuracy Score = 0.7322834645669292

Classification repor t and confusion matrix of the decision tree model

from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test,

predictions)) print(classification_report(y_test,predictions))
Output:

XgBoost classifier

Building model using XGBoost

from xgboost import XGBClassifier xgb_model = XGBClassifier(gamma=0) xgb_model.fit(X_train, y_train)

Output:

Now we will be making the predictions on the testing data directly as it is of more importance.

Getting the accuracy score for the XgBoost classifier

from sklearn import metrics xgb_pred = xgb_model.predict(X_test) print("Accuracy Score =",

format(metrics.accuracy_score(y_test, xgb_pred)))

Output:

Accuracy Score = 0.7401574803149606

Classification repor t and confusion matrix of the XgBoost classifier

from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, xgb_pred))

print(classification_report(y_test,xgb_pred))

Output:
Support Vector Machine (SVM)

Building the model using Suppor t Vector Machine (SVM)

from sklearn.svm import SVC svc_model = SVC() svc_model.fit(X_train, y_train)

Prediction from suppor t vector machine model on the testing data

svc_pred = svc_model.predict(X_test)

Accuracy score for SVM

from sklearn import metrics print("Accuracy Score =", format(metrics.accuracy_score(y_test, svc_pred)))

Output:

Accuracy Score = 0.7401574803149606

Classification repor t and confusion matrix of the SVM classifier

from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, svc_pred))

print(classification_report(y_test,svc_pred))

Output:

The Conclusion from Model Building

Therefore Random forest is the best model for this prediction since it has an accuracy_score of 0.76.
Feature Importance

Knowing about the feature importance is quite necessary as it shows that how much weightage each
feature provides in the model building phase.

Getting feature impor tances

rfc.feature_importances_

Output:

array([0.07684946, 0.25643635, 0.08952599, 0.08437176, 0.08552636, 0.14911634, 0.11751284, 0.1406609 ])

From the above output, it is not much clear that which feature is important for that reason we will now
make a visualization of the same.

Plotting feature impor tances

(pd.Series(rfc.feature_importances_, index=X.columns).plot(kind='barh'))

Output:

Here from the above graph, it is clearly visible that Glucose as a feature is the most impor tant in this
dataset.

Saving Model – Random Forest

import pickle # Firstly we will be using the dump() function to save the model using pickle saved_model =
pickle.dumps(rfc) # Then we will be loading that saved model rfc_from_pickle = pickle.loads(saved_model) #
lastly, after loading that model we will use this to make predictions rfc_from_pickle.predict(X_test)

Output:
Now for the last time, I’ll be looking at the head and tail of the dataset so that we can take any random set
of features from both the head and tail of the data to test that if our model is good enough to give the right
prediction.

diabetes_df.head()

Output:

diabetes_df.tail()

Output:

Putting data points in the model will either return 0 or 1 i.e. person suffering from diabetes or not.

rfc.predict([[0,137,40,35,168,43.1,2.228,33]]) #4th patient

Output:

array([1], dtype=int64)
Another one

rfc.predict([[10,101,76,48,180,32.9,0.171,63]]) # 763 th patient

Output:

array([0], dtype=int64)

Conclusion

After using all these patient records, we are able to build a machine learning model (random forest – best
one) to accurately predict whether or not the patients in the dataset have diabetes or not along with that
we were able to draw some insights from the data via data analysis and visualization.

Here’s the repo link to this article.

About Me

Greeting to everyone, I’m currently working in TCS and previously, I worked as a Data Science Analyst in
Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field, i.e. Data
Science, along with its other subsets of Artificial Intelligence such as Computer Vision, Machine learning,
and Deep learning; feel free to collaborate with me on any project on the domains mentioned above
(LinkedIn).

Here you can access my other articles, which are published on Analytics Vidhya as a part of the Blogathon
(link)

The media shown in this ar ticle is not owned by Analytics Vidhya and are used at the Author’s discretion.

Article Url - https://www.analyticsvidhya.com/blog/2022/01/diabetes-prediction-using-machine-learning/

Aman Preet Gulati

Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Pima Indian Diabetes Data Analysis in Python - Canopus Business Management Group
No ratings yet
Pima Indian Diabetes Data Analysis in Python - Canopus Business Management Group
21 pages
Documentation Code
No ratings yet
Documentation Code
20 pages
ML Practical 04
No ratings yet
ML Practical 04
20 pages
diabetes_test report
No ratings yet
diabetes_test report
62 pages
Diabetes_Prediction_1704256341
No ratings yet
Diabetes_Prediction_1704256341
17 pages
20BCE7620 AP2021228000397 Experiment-6 Removed
No ratings yet
20BCE7620 AP2021228000397 Experiment-6 Removed
19 pages
مختار النعيري - The Course Work Submission (1)
No ratings yet
مختار النعيري - The Course Work Submission (1)
31 pages
Project
No ratings yet
Project
8 pages
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
No ratings yet
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
15 pages
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
No ratings yet
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
10 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
20 pages
x23 Group 1 - Final Project cst383
No ratings yet
x23 Group 1 - Final Project cst383
25 pages
ML Minor May
No ratings yet
ML Minor May
5 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
Logistic - Ipynb - Colaboratory
No ratings yet
Logistic - Ipynb - Colaboratory
6 pages
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
No ratings yet
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
8 pages
KNN For Classification
No ratings yet
KNN For Classification
4 pages
Afroz Content
No ratings yet
Afroz Content
24 pages
Diabetes
No ratings yet
Diabetes
97 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
Logidtic_Regression_ASSIGNMENT
No ratings yet
Logidtic_Regression_ASSIGNMENT
13 pages
Diabetes Prediction System
No ratings yet
Diabetes Prediction System
4 pages
Diabetic Prediction Using LogicalRegression
No ratings yet
Diabetic Prediction Using LogicalRegression
9 pages
healthcare-project-simplilearn- Week1
No ratings yet
healthcare-project-simplilearn- Week1
6 pages
Exp 5
No ratings yet
Exp 5
7 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
1 page
Logistic Regression
No ratings yet
Logistic Regression
12 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
5 pages
Abdul Mateen
No ratings yet
Abdul Mateen
6 pages
Diabetes_Assignment_Report
No ratings yet
Diabetes_Assignment_Report
3 pages
Machine Learning and Deep Learning Techniques
No ratings yet
Machine Learning and Deep Learning Techniques
13 pages
Pima Indians Diabetes Database Analysis - Kaggle
No ratings yet
Pima Indians Diabetes Database Analysis - Kaggle
37 pages
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
No ratings yet
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
17 pages
mlPPT_11_45
No ratings yet
mlPPT_11_45
31 pages
Logistic Regression 205
No ratings yet
Logistic Regression 205
8 pages
Diabetes Classification Report
No ratings yet
Diabetes Classification Report
17 pages
IPL Winning Prediction Intern Report
No ratings yet
IPL Winning Prediction Intern Report
52 pages
linear_merged_pagenumber
No ratings yet
linear_merged_pagenumber
48 pages
Heart Disease Indicator Prediction Model
No ratings yet
Heart Disease Indicator Prediction Model
17 pages
ML
No ratings yet
ML
1 page
Univariate and Multivariate Analysis - Jupyter Notebook
No ratings yet
Univariate and Multivariate Analysis - Jupyter Notebook
5 pages
Improving support vector machine and backpropagation performance for diabetes mellitus classification
No ratings yet
Improving support vector machine and backpropagation performance for diabetes mellitus classification
10 pages
Capstone Project 2
No ratings yet
Capstone Project 2
15 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
14 pages
Diabetics Data Set
No ratings yet
Diabetics Data Set
4 pages
Experiment 4
No ratings yet
Experiment 4
5 pages
Chapter Three 111
No ratings yet
Chapter Three 111
13 pages
datascience pgms
No ratings yet
datascience pgms
5 pages
Dataset
No ratings yet
Dataset
13 pages
Ads exp 10
No ratings yet
Ads exp 10
10 pages
Diabetes Prediction - ML
No ratings yet
Diabetes Prediction - ML
29 pages
End to End Project Multiple Disease Detection Using ML - Nomidl
No ratings yet
End to End Project Multiple Disease Detection Using ML - Nomidl
24 pages
Independent Project
No ratings yet
Independent Project
10 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Social Media Marketing in Pakistan
No ratings yet
Social Media Marketing in Pakistan
5 pages
Amg Pace - LWF Nil - Form C - Sep - 2022
No ratings yet
Amg Pace - LWF Nil - Form C - Sep - 2022
1 page
SV 260 - 275 - en - GRIMME 196
No ratings yet
SV 260 - 275 - en - GRIMME 196
12 pages
Blockchain For Transportation:: Where The Future Starts
No ratings yet
Blockchain For Transportation:: Where The Future Starts
14 pages
Generator AVK
100% (1)
Generator AVK
138 pages
Sujita
No ratings yet
Sujita
3 pages
Options Open Interest Analysis
0% (1)
Options Open Interest Analysis
17 pages
HIGH COURT OF DELHI Judgment Delivered o PDF
No ratings yet
HIGH COURT OF DELHI Judgment Delivered o PDF
32 pages
2C - Note On Bostock V Clayton County - 2021 Final
No ratings yet
2C - Note On Bostock V Clayton County - 2021 Final
3 pages
A2USHCPU-S1 User's Manual PDF
No ratings yet
A2USHCPU-S1 User's Manual PDF
162 pages
Divisibility Rules For 3, 6 and 9 (3 Digit Numbers) (A)
No ratings yet
Divisibility Rules For 3, 6 and 9 (3 Digit Numbers) (A)
1 page
Bosch Therm 8000S
No ratings yet
Bosch Therm 8000S
48 pages
QMS Iqa
No ratings yet
QMS Iqa
11 pages
STATE OF THE NATION ADDRESS BY PETER MUTHARIKA, PRESIDENT OF THE REPUBLIC OF MALAWI ON THE OCCASION OF THE STATE OPENING OF THE 3RD MEETING IN THE 47TH SESSION OF PARLIAMENT AND 2018/2019 BUDGET MEETING LILONGWE
No ratings yet
STATE OF THE NATION ADDRESS BY PETER MUTHARIKA, PRESIDENT OF THE REPUBLIC OF MALAWI ON THE OCCASION OF THE STATE OPENING OF THE 3RD MEETING IN THE 47TH SESSION OF PARLIAMENT AND 2018/2019 BUDGET MEETING LILONGWE
32 pages
htm4320 Uran Tourism
No ratings yet
htm4320 Uran Tourism
3 pages
LG ESC Invoice -Alish Aircon 2024-2025 INST BILL 23122024 Nov Inst Signatory
No ratings yet
LG ESC Invoice -Alish Aircon 2024-2025 INST BILL 23122024 Nov Inst Signatory
1 page
SALES ANALYSIS FOR QUICKMART REGIONAL SALES
No ratings yet
SALES ANALYSIS FOR QUICKMART REGIONAL SALES
4 pages
Kasus Mnj. Strategik ALL Kls - OYO (Thunderbird 2020)
100% (2)
Kasus Mnj. Strategik ALL Kls - OYO (Thunderbird 2020)
10 pages
2008-Harvey Stensaker. Quality Culture
No ratings yet
2008-Harvey Stensaker. Quality Culture
17 pages
Comprehensive School Safety Monitoring Tool
No ratings yet
Comprehensive School Safety Monitoring Tool
8 pages
Ch04 - Individual & Market Demand
No ratings yet
Ch04 - Individual & Market Demand
47 pages
Unit 5 Wordlist
No ratings yet
Unit 5 Wordlist
3 pages
KCIPL ITR ASST. YEAR 24-25
No ratings yet
KCIPL ITR ASST. YEAR 24-25
1 page
Crash Course Summary Notes
No ratings yet
Crash Course Summary Notes
16 pages
Low-Side Gate Drivers With UVLO Versus BJT Totem-Pole
No ratings yet
Low-Side Gate Drivers With UVLO Versus BJT Totem-Pole
3 pages
Reservoir Modelling Course FREE Course usingMBAL-Software - 1
100% (2)
Reservoir Modelling Course FREE Course usingMBAL-Software - 1
71 pages
Asean Raya
No ratings yet
Asean Raya
8 pages
Paper 4-Analysis of the Impact of Different Parameters Setting
No ratings yet
Paper 4-Analysis of the Impact of Different Parameters Setting
6 pages
0545 Ndlovu Thandiwe L
No ratings yet
0545 Ndlovu Thandiwe L
2 pages
Prathi Project Capex
0% (1)
Prathi Project Capex
100 pages

diabetes-prediction-using-machine-learning

Uploaded by

diabetes-prediction-using-machine-learning

Uploaded by

Diabetes Prediction Using Machine Learning

This article was published as a part of the Data Science Blogathon.

Image Source: Plastics Today

mlxtend.plotting import plot_decision_regions import missingno as msno from pandas.plotting import

diabetes_df = pd.read_csv('diabetes.csv') diabetes_df.head()

Exploratory Data Analysis (EDA)

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',

Information about the dataset

To know more about the dataset

Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0

diabetes_df_copy = diabetes_df.copy(deep = True)

diabetes_df_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN) # Showing the

Count of NANs print(diabetes_df_copy.isnull().sum())

Plotting the data distribution plots before removing null values

diabetes_df_copy['Glucose'].fillna(diabetes_df_copy['Glucose'].mean(), inplace = True)

diabetes_df_copy['BloodPressure'].fillna(diabetes_df_copy['BloodPressure'].mean(), inplace = True)

diabetes_df_copy['BMI'].fillna(diabetes_df_copy['BMI'].median(), inplace = True)

Plotting Null Count Analysis Plot

color_wheel = {1: "#0392cf", 2: "#7bc043"} colors = diabetes_df["Outcome"].map(lambda x: color_wheel.get(x +

0 500 1 268 Name: Outcome, dtype: int64

plt.subplot(121), sns.distplot(diabetes_df['Insulin']) plt.subplot(122),

Correlation between all the features

Correlation between all the features before cleaning

plt.figure(figsize=(12,10)) # seaborn has an easy method to showcase heatmap p =

Before scaling down the data let’s have a look into it

After Standard scaling

sc_X = StandardScaler() X = pd.DataFrame(sc_X.fit_transform(diabetes_df_copy.drop(["Outcome"],axis = 1),),

columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',

Let’s explore our target column

Splitting the dataset

X = diabetes_df.drop('Outcome', axis=1) y = diabetes_df['Outcome']

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y,

Building the model using RandomForest

from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(n_estimators=200)

rfc_train = rfc.predict(X_train) from sklearn import metrics print("Accuracy_Score =",

Output: Accuracy = 1.0

from sklearn import metrics predictions = rfc.predict(X_test) print("Accuracy_Score =",

Classification repor t and confusion matrix of random forest model

from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test,

Building the model using DecisionTree

from sklearn.tree import DecisionTreeClassifier dtree = DecisionTreeClassifier() dtree.fit(X_train, y_train)

Getting the accuracy score for Decision Tree

from sklearn import metrics predictions = dtree.predict(X_test) print("Accuracy Score =",

Accuracy Score = 0.7322834645669292

Classification repor t and confusion matrix of the decision tree model

from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test,

Building model using XGBoost

from xgboost import XGBClassifier xgb_model = XGBClassifier(gamma=0) xgb_model.fit(X_train, y_train)

Getting the accuracy score for the XgBoost classifier

from sklearn import metrics xgb_pred = xgb_model.predict(X_test) print("Accuracy Score =",

Accuracy Score = 0.7401574803149606

Classification repor t and confusion matrix of the XgBoost classifier

from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, xgb_pred))

Building the model using Suppor t Vector Machine (SVM)

from sklearn.svm import SVC svc_model = SVC() svc_model.fit(X_train, y_train)

Prediction from suppor t vector machine model on the testing data

Accuracy score for SVM

from sklearn import metrics print("Accuracy Score =", format(metrics.accuracy_score(y_test, svc_pred)))

Accuracy Score = 0.7401574803149606

Classification repor t and confusion matrix of the SVM classifier

from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, svc_pred))

The Conclusion from Model Building

Getting feature impor tances

array([0.07684946, 0.25643635, 0.08952599, 0.08437176, 0.08552636, 0.14911634, 0.11751284, 0.1406609 ])

Plotting feature impor tances

Saving Model – Random Forest

rfc.predict([[0,137,40,35,168,43.1,2.228,33]]) #4th patient

rfc.predict([[10,101,76,48,180,32.9,0.171,63]]) # 763 th patient

Here’s the repo link to this article.

Article Url - https://www.analyticsvidhya.com/blog/2022/01/diabetes-prediction-using-machine-learning/

Aman Preet Gulati

You might also like