diabetes-prediction-using-machine-learning
diabetes-prediction-using-machine-learning
A LG O RI T HM D AT A S E T S HE A LT HC A RE M A C HI NE LE A RNI NG
Overview
In this article, we will be predicting that whether the patient has diabetes or not on the basis of the
features we will provide to our machine learning model, and for that, we will be using the famous Pima
Indians Diabetes Database.
1. Data analysis: Here one will get to know about how the data analysis part is done in a data science life
cycle.
2. Exploratory data analysis: EDA is one of the most important steps in the data science project life cycle
and here one will need to know that how to make inferences from the visualizations and data analysis
3. Model building: Here we will be using 4 ML models and then we will choose the best performing model.
4. Saving model: Saving the best model using pickle to make the prediction from real data.
Importing Libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set() from
confusion_matrix from sklearn import metrics from sklearn.metrics import classification_report import
warnings warnings.filterwarnings('ignore') %matplotlib inline
Here we will be reading the dataset which is in the CSV format
Output:
Now let’ see that what are columns available in our dataset.
diabetes_df.columns
Output:
diabetes_df.info()
Output:
RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ --
------------ ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null
int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6
DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes:
float64(2), int64(7) memory usage: 54.1 KB
diabetes_df.describe()
Output:
To know more about the dataset with transpose – here T is for the transpose
diabetes_df.describe().T
Output:
Now let’s check that if our dataset have null values or not
diabetes_df.isnull().head(10)
Output:
Now let’s check the number of null values our dataset has.
diabetes_df.isnull().sum()
Output:
Here from the above code we first checked that is there any null values from the IsNull() function then we
are going to take the sum of all those missing values from the sum() function and the inference we now
get is that there are no missing values but that is actually not a true story as in this par ticular dataset all
the missing values were given the 0 as a value which is not good for the authenticity of the dataset.
Hence we will first replace the 0 value with the NAN value then start the imputation process.
Output:
Pregnancies 0 Glucose 5 BloodPressure 35 SkinThickness 227 Insulin 374 BMI 11 DiabetesPedigreeFunction 0 Age
0 Outcome 0 dtype: int64
As mentioned above that now we will be replacing the zeros with the NAN values so that we can impute it
later to maintain the authenticity of the dataset as well as trying to have a better Imputation approach i.e to
apply mean values of each column to the null values of the respective columns.
Data Visualization
p = diabetes_df.hist(figsize = (20,20))
Output:
Inference: So here we have seen the distribution of each features whether it is dependent data or
independent data and one thing which could always strike that why do we need to see the distribution of
data? So the answer is simple it is the best way to start the analysis of the dataset as it shows the
occurrence of every kind of value in the graphical structure which in turn lets us know the range of the
data.
Now we will be imputing the mean value of the column to each missing value of that par ticular column.
p = diabetes_df_copy.hist(figsize = (20,20))
Output:
Inference: Here we are again using the hist plot to see the distribution of the dataset but this time we are
using this visualization to see the changes that we can see after those null values are removed from the
dataset and we can clearly see the difference for example – In age column after removal of the null values,
we can see that there is a spike at the range of 50 to 100 which is quite logical as well.
Output:
Inference: Now in the above graph also we can clearly see that there are no null values in the dataset.
Now, let’s check that how well our outcome column is balanced
Output:
Output:
Inference: That’s how Distplot can be helpful where one will able to see the distribution of the data as well
as with the help of boxplot one can see the outliers in that column and other information too which can be
derived by the box and whiskers plot.
Output:
Scaling the Data
diabetes_df_copy.head()
Output:
Output:
That’s how our dataset will be looking like when it is scaled down or we can see every value now is on the
same scale which will help our ML model to give a better result.
y = diabetes_df_copy.Outcome y
Output:
0 1 1 0 2 1 3 0 4 1 .. 763 0 764 0 765 0 766 1 767 0 Name: Outcome, Length: 768, dtype: int64
Model Building
Now we will split the data into training and testing data using the train_test_split function
Random Forest
Now after building the model let’s check the accuracy of the model on the training dataset.
format(metrics.accuracy_score(y_train, rfc_train)))
So here we can see that on the training dataset our model is overfitted.
Getting the accuracy score for Random Forest
format(metrics.accuracy_score(y_test, predictions)))
Output:
Accuracy_Score = 0.7677165354330708
Output:
Decision Tree
Now we will be making the predictions on the testing data directly as it is of more importance.
Output:
XgBoost classifier
Output:
Now we will be making the predictions on the testing data directly as it is of more importance.
format(metrics.accuracy_score(y_test, xgb_pred)))
Output:
Output:
Support Vector Machine (SVM)
svc_pred = svc_model.predict(X_test)
Output:
Output:
Therefore Random forest is the best model for this prediction since it has an accuracy_score of 0.76.
Feature Importance
Knowing about the feature importance is quite necessary as it shows that how much weightage each
feature provides in the model building phase.
rfc.feature_importances_
Output:
From the above output, it is not much clear that which feature is important for that reason we will now
make a visualization of the same.
(pd.Series(rfc.feature_importances_, index=X.columns).plot(kind='barh'))
Output:
Here from the above graph, it is clearly visible that Glucose as a feature is the most impor tant in this
dataset.
import pickle # Firstly we will be using the dump() function to save the model using pickle saved_model =
pickle.dumps(rfc) # Then we will be loading that saved model rfc_from_pickle = pickle.loads(saved_model) #
lastly, after loading that model we will use this to make predictions rfc_from_pickle.predict(X_test)
Output:
Now for the last time, I’ll be looking at the head and tail of the dataset so that we can take any random set
of features from both the head and tail of the data to test that if our model is good enough to give the right
prediction.
diabetes_df.head()
Output:
diabetes_df.tail()
Output:
Putting data points in the model will either return 0 or 1 i.e. person suffering from diabetes or not.
Output:
array([1], dtype=int64)
Another one
Output:
array([0], dtype=int64)
Conclusion
After using all these patient records, we are able to build a machine learning model (random forest – best
one) to accurately predict whether or not the patients in the dataset have diabetes or not along with that
we were able to draw some insights from the data via data analysis and visualization.
About Me
Greeting to everyone, I’m currently working in TCS and previously, I worked as a Data Science Analyst in
Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field, i.e. Data
Science, along with its other subsets of Artificial Intelligence such as Computer Vision, Machine learning,
and Deep learning; feel free to collaborate with me on any project on the domains mentioned above
(LinkedIn).
Here you can access my other articles, which are published on Analytics Vidhya as a part of the Blogathon
(link)
The media shown in this ar ticle is not owned by Analytics Vidhya and are used at the Author’s discretion.