Dataset
Dataset
ARSI UNIVERSITY
COLLGE OF BUSINSS AND ECONOMICS
DEPARTEMENT OF MANAGEMENT INFORMATION SYSTEM
PROJECT TITLE: DATA SCIENCE ON DIABETES PREDICTIONM
DATASET
Presented by: Group A student
2
INRODUCTION
According to WHO, Diabetes is a chronic disease that occurs either when the
pancreas does not produce enough insulin or when the body cannot
effectively use the insulin it produces.
Insulin is a hormone that regulates blood sugar.
Hyperglycemia, or raised blood sugar, is a common effect of uncontrolled
diabetes and over time leads to serious damage to many of the body's
systems, especially the nerves and blood vessels.
3
CON…..
Diabetes is a health condition that affects how your body turns food into energy.
Most of the food you eat is broken down into sugar (also called glucose) and
released into your bloodstream.
When your blood sugar goes up, it signals your pancreas to release insulin.
Without ongoing, careful management, diabetes can lead to a buildup of sugars in
the blood, which can increase the risk of dangerous complications, including stroke
and heart disease.
So that I decide to predict using Machine Learning in Python
4
Problem Statement/business understanding
Diabetes dataset is to diagnostically predict whether or not a patient has diabetes, based on certain
diagnostic measurements included in the dataset.
Several constraints were placed on the selection of these instances from a larger database. In particular, all
patients here are females at least 21 years old of Pima Indian Heritage.
To know the impact of Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI and Diabetes
Pedigree Function based on available data.
Based on regression analysis we predict the relationship between dependent variable(diabetes) and
independent variable (pregnancy, glucose, age, BMI....
5
Objectives
Data Set the dataset collected is originally from the Pima Indians
Diabetes Database is available on Kaggle.
It consists of several medical analyst variables and one target variable.
The objective of the dataset is to predict whether the patient has diabetes or not.
The dataset consists of several independent variables and one dependent
variable.
7
CON…..
We saw on df.head() that some features contain 0, it doesn't make sense here
and this indicates missing value Below we replace 0 value by Null:
This part contain cleaning and preparing the data
Under this Fix the inconsistencies within the data, handle missing values,
and treat data with principles of collinearity
We observed that there is no missing values in dataset however the features
like Glucose, BloodPressure, Insulin, SkinThickness has 0 values which is
not possible.
9
Data Exploration
This stage is all about building a model that best solves your problem.
This stage always begins with a process called Data Splicing, where you split
your entire data set into two proportions.
One for training the model (training data set) and the other for testing the
efficiency of the model (testing data set).
This is followed by building the model by using the training data set and finally
evaluating the model by using the test data set.
10
Feature Engineering
Under feature engineering we use feature selection we select data train data and test
data
Now, it’s time to add important features to the dataset discover some effective
features before fitting it into machine learning models
11
Predictive modeling
In our proposed predictive model we have done pre- processing of raw data and
different feature engineering techniques to get better results.
Algorithm is used for feature selection as it provides unbiased selection of
important features and unimportant features from an information system.
Training of raw data after feature engineering has a significant role in supervised
learning.
We have used highly correlated variables for better outcomes.
Input data, here indicates to test data used for predict and confusion matrix
8.Data Visualization 12
THANK
YOU FOR
YOUR
ATTENTI
ON !!