0% found this document useful (0 votes)
8 views15 pages

DMBI

The document outlines a project aimed at predicting diabetes onset using machine learning algorithms, specifically Bagging and Gradient Boosting Classifiers, trained on a dataset with various health indicators. It details the necessary data pre-processing steps, exploratory data analysis (EDA) methods, and the evaluation of model performance through confusion matrices and accuracy scores. The final decision on model deployment considers factors like computational efficiency and interpretability, suggesting the Bagging Classifier for simpler applications and the Gradient Boosting Classifier for more complex scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

DMBI

The document outlines a project aimed at predicting diabetes onset using machine learning algorithms, specifically Bagging and Gradient Boosting Classifiers, trained on a dataset with various health indicators. It details the necessary data pre-processing steps, exploratory data analysis (EDA) methods, and the evaluation of model performance through confusion matrices and accuracy scores. The final decision on model deployment considers factors like computational efficiency and interpretability, suggesting the Bagging Classifier for simpler applications and the Gradient Boosting Classifier for more complex scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

GROUP: MOHIT,KAUSHIK AND SANCHIT ROLL: 34,30,38

PROBLEM STATEMENT:
"Predicting Diabetes Onset: Utilizing a dataset containing various health
indicators, including pregnancies, glucose levels, blood pressure, skin thickness,
insulin levels, BMI, pedigree function, and age, we aim to develop predictive
models that can effectively identify individuals at risk of developing diabetes. By
employing machine learning algorithms such as Bagging Classifier and Gradient
Boosting Classifier, trained on a labeled dataset, we seek to build robust models
capable of accurately classifying individuals into diabetic and non-diabetic groups.
The ultimate goal is to develop a predictive tool that can assist healthcare
professionals in early detection and intervention for individuals predisposed to
diabetes, thereby improving health outcomes and reducing the burden of the
disease.

A) WHICH DATA MINING TASK IS NEEDED IN OUR


DATASET:
bagging and boosting algorithms are used in ensemble machine learning. They do benefit
from certain data pre-processing steps common in data mining.

Data Pre-processing:

The CSV file (diabetes.csv) needs pre-processing before using it in bagging or


boosting algorithms. This involve:
Handling missing values (filling them with appropriate strategies or removing
rows/columns with too many missing entries).
Encoding categorical variables into numerical ones (if present).
Feature scaling (ensuring all features are on a similar scale).

Data Splitting:

Both bagging and boosting require splitting the data into training and testing sets.
The training set is used to build the ensemble models.
The testing set is used to evaluate the final predictions from the ensemble.
Ensemble Methods:

1. Bagging: (Bootstrap Aggregation) In bagging, multiple models are trained


on different samples drawn with replacement from the original data. This
process helps reduce variance in the final predictions. Data mining isn't
directly involved here, but the quality of pre-processed data is crucial.
2. Boosting: Boosting algorithms train models sequentially. Each model tries to
learn from the errors of the previous model. Data mining isn't directly
involved, but appropriate data pre-processing ensures the models can learn
effectively

B) THE DATASET WE HAVE CHOSEN FOR OUR MINI PROJECT:


https://www.kaggle.com/code/faressayah/ensemble-ml-algorithms-bagging-boostin
g-voting#Ensemble-Machine-Learning-Algorithms-in-Python-with-scikit-learn
C) HERE WE ARE GOING TO PERFORM EDA (EXPLORATORY DATA
ANALYSIS)

1) Loading the dataset (df = pd.read_csv("/content/diabetes.csv")).

2) Checking basic information about the dataset (df.info()).


3) Checking for missing values (df.isnull().sum()).

4) Displaying descriptive statistics of the dataset (df.describe()).

5) Checking for categorical and continuous variables.


6) Visualizing correlations between features using a heatmap
(sns.heatmap(df.corr(), annot=True, cmap='viridis', fmt=".2f")).

7) Visualizing the distribution of features with respect to the outcome using


boxplots (sns.boxplot(x='Outcome', y=feature, data=df)).
8) Visualizing the age distribution by outcome using kernel density estimation
(sns.kdeplot()).
9) Identifying and replacing missing zero values in feature columns with the
mean using SimpleImputer.
D) NOW DATA PRE-PROCESSING IS GOING TO DONE

As we can see all the null values are getting removed from the data
Then The training of data is done:
E) NOW THE ALGORITHM HERE , WE ARE IMPLEMENTING IS
1) Bagging Classifier with Decision Trees:
The Bagging Classifier is utilized with Decision Trees as base estimators.
This ensemble method combines multiple decision tree models trained on
different subsets of the training data, with replacement. The final prediction
is typically determined by averaging the predictions of individual trees (for
regression) or by taking a majority vote (for classification).
It's implemented using BaggingClassifier from the sklearn.ensemble module.

2) Gradient Boosting Classifier:


The Gradient Boosting Classifier is employed, which is another ensemble
learning technique. Unlike bagging, where models are trained independently
and combined, gradient boosting builds models sequentially, with each new
model correcting errors made by the previous ones.
It's implemented using GradientBoostingClassifier from the
sklearn.ensemble module.

Here the Confusion matrix is generated after training the model:


1. Building and Evaluating Bagging Classifier:

These lines create a Bagging Classifier with a Decision Tree base estimator, fit it to
the training data, and then evaluate its performance on both training and testing
data using the evaluate function.
2. Building and Evaluating Gradient Boosting Classifier:

Similarly, these lines create a Gradient Boosting Classifier, fit it to the training
data, and evaluate its performance on both training and testing data using the
evaluate function.

3. ) Visualizing Results:
scores_df.plot(kind='barh', figsize=(15, 8))
F.) TO IDENTIFY FROM BOTH MODELS WHICH PERFORMS THE
BEST:
The use of all the model performance measures, including confusion matrix,
accuracy score, and classification report, is done within the evaluate function for
both the training and testing sets. The evaluate function calculates these measures
for each model (Bagging Classifier and Gradient Boosting Classifier) and prints
them out for analysis.

In the provided code, the use of all the model performance measures, including
confusion matrix, accuracy score, and classification report, is done within the
evaluate function for both the training and testing sets. The evaluate function
calculates these measures for each model (Bagging Classifier and Gradient
Boosting Classifier) and prints them out for analysis.

Remarks on model performance:

Bagging Classifier:

Train Accuracy: High


Test Accuracy: Slightly lower than the train accuracy but still high
Remarks: The Bagging Classifier performs well on both training and testing data,
indicating good generalization ability.

Gradient Boosting Classifier:

Train Accuracy: High


Test Accuracy: Slightly lower than the train accuracy but still high
Remarks: The Gradient Boosting Classifier also performs well on both training and
testing data, showing strong predictive performance.
Business Intelligence (BI) Decision:
Based on the performance of both models, it appears that they are capable of
effectively predicting diabetes onset. However, to determine the best model for
deployment in a real-world scenario, other factors such as computational
efficiency, interpretability, and specific business requirements need to be
considered.

If computational resources are limited and a simpler model is preferred, the


Bagging Classifier could be a good choice due to its slightly better performance on
the testing data compared to the Gradient Boosting Classifier.

On the other hand, if interpretability is important and computational resources are


sufficient, the Gradient Boosting Classifier might be preferred, as it tends to
provide more interpretable results and can handle complex relationships between
variables effectively.

You might also like