0% found this document useful (0 votes)

8 views15 pages

DMBI

The document outlines a project aimed at predicting diabetes onset using machine learning algorithms, specifically Bagging and Gradient Boosting Classifiers, trained on a dataset with various health indicators. It details the necessary data pre-processing steps, exploratory data analysis (EDA) methods, and the evaluation of model performance through confusion matrices and accuracy scores. The final decision on model deployment considers factors like computational efficiency and interpretability, suggesting the Bagging Classifier for simpler applications and the Gradient Boosting Classifier for more complex scenarios.

Uploaded by

2021.vaishnavi.chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views15 pages

DMBI

Uploaded by

2021.vaishnavi.chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

GROUP: MOHIT,KAUSHIK AND SANCHIT ROLL: 34,30,38

PROBLEM STATEMENT:
"Predicting Diabetes Onset: Utilizing a dataset containing various health
indicators, including pregnancies, glucose levels, blood pressure, skin thickness,
insulin levels, BMI, pedigree function, and age, we aim to develop predictive
models that can effectively identify individuals at risk of developing diabetes. By
employing machine learning algorithms such as Bagging Classifier and Gradient
Boosting Classifier, trained on a labeled dataset, we seek to build robust models
capable of accurately classifying individuals into diabetic and non-diabetic groups.
The ultimate goal is to develop a predictive tool that can assist healthcare
professionals in early detection and intervention for individuals predisposed to
diabetes, thereby improving health outcomes and reducing the burden of the
disease.

A) WHICH DATA MINING TASK IS NEEDED IN OUR

DATASET:
bagging and boosting algorithms are used in ensemble machine learning. They do benefit
from certain data pre-processing steps common in data mining.

Data Pre-processing:

The CSV file (diabetes.csv) needs pre-processing before using it in bagging or

boosting algorithms. This involve:
Handling missing values (filling them with appropriate strategies or removing
rows/columns with too many missing entries).
Encoding categorical variables into numerical ones (if present).
Feature scaling (ensuring all features are on a similar scale).

Data Splitting:

Both bagging and boosting require splitting the data into training and testing sets.
The training set is used to build the ensemble models.
The testing set is used to evaluate the final predictions from the ensemble.
Ensemble Methods:

1. Bagging: (Bootstrap Aggregation) In bagging, multiple models are trained

on different samples drawn with replacement from the original data. This
process helps reduce variance in the final predictions. Data mining isn't
directly involved here, but the quality of pre-processed data is crucial.
2. Boosting: Boosting algorithms train models sequentially. Each model tries to
learn from the errors of the previous model. Data mining isn't directly
involved, but appropriate data pre-processing ensures the models can learn
effectively

B) THE DATASET WE HAVE CHOSEN FOR OUR MINI PROJECT:

https://www.kaggle.com/code/faressayah/ensemble-ml-algorithms-bagging-boostin
g-voting#Ensemble-Machine-Learning-Algorithms-in-Python-with-scikit-learn
C) HERE WE ARE GOING TO PERFORM EDA (EXPLORATORY DATA
ANALYSIS)

1) Loading the dataset (df = pd.read_csv("/content/diabetes.csv")).

2) Checking basic information about the dataset (df.info()).

3) Checking for missing values (df.isnull().sum()).

4) Displaying descriptive statistics of the dataset (df.describe()).

5) Checking for categorical and continuous variables.

6) Visualizing correlations between features using a heatmap
(sns.heatmap(df.corr(), annot=True, cmap='viridis', fmt=".2f")).

7) Visualizing the distribution of features with respect to the outcome using

boxplots (sns.boxplot(x='Outcome', y=feature, data=df)).
8) Visualizing the age distribution by outcome using kernel density estimation
(sns.kdeplot()).
9) Identifying and replacing missing zero values in feature columns with the
mean using SimpleImputer.
D) NOW DATA PRE-PROCESSING IS GOING TO DONE

As we can see all the null values are getting removed from the data
Then The training of data is done:
E) NOW THE ALGORITHM HERE , WE ARE IMPLEMENTING IS
1) Bagging Classifier with Decision Trees:
The Bagging Classifier is utilized with Decision Trees as base estimators.
This ensemble method combines multiple decision tree models trained on
different subsets of the training data, with replacement. The final prediction
is typically determined by averaging the predictions of individual trees (for
regression) or by taking a majority vote (for classification).
It's implemented using BaggingClassifier from the sklearn.ensemble module.

2) Gradient Boosting Classifier:

The Gradient Boosting Classifier is employed, which is another ensemble
learning technique. Unlike bagging, where models are trained independently
and combined, gradient boosting builds models sequentially, with each new
model correcting errors made by the previous ones.
It's implemented using GradientBoostingClassifier from the
sklearn.ensemble module.

Here the Confusion matrix is generated after training the model:

1. Building and Evaluating Bagging Classifier:

These lines create a Bagging Classifier with a Decision Tree base estimator, fit it to
the training data, and then evaluate its performance on both training and testing
data using the evaluate function.
2. Building and Evaluating Gradient Boosting Classifier:

Similarly, these lines create a Gradient Boosting Classifier, fit it to the training
data, and evaluate its performance on both training and testing data using the
evaluate function.

3. ) Visualizing Results:
scores_df.plot(kind='barh', figsize=(15, 8))
F.) TO IDENTIFY FROM BOTH MODELS WHICH PERFORMS THE
BEST:
The use of all the model performance measures, including confusion matrix,
accuracy score, and classification report, is done within the evaluate function for
both the training and testing sets. The evaluate function calculates these measures
for each model (Bagging Classifier and Gradient Boosting Classifier) and prints
them out for analysis.

In the provided code, the use of all the model performance measures, including
confusion matrix, accuracy score, and classification report, is done within the
evaluate function for both the training and testing sets. The evaluate function
calculates these measures for each model (Bagging Classifier and Gradient
Boosting Classifier) and prints them out for analysis.

Remarks on model performance:

Bagging Classifier:

Train Accuracy: High

Test Accuracy: Slightly lower than the train accuracy but still high
Remarks: The Bagging Classifier performs well on both training and testing data,
indicating good generalization ability.

Gradient Boosting Classifier:

Train Accuracy: High

Test Accuracy: Slightly lower than the train accuracy but still high
Remarks: The Gradient Boosting Classifier also performs well on both training and
testing data, showing strong predictive performance.
Business Intelligence (BI) Decision:
Based on the performance of both models, it appears that they are capable of
effectively predicting diabetes onset. However, to determine the best model for
deployment in a real-world scenario, other factors such as computational
efficiency, interpretability, and specific business requirements need to be
considered.

If computational resources are limited and a simpler model is preferred, the

Bagging Classifier could be a good choice due to its slightly better performance on
the testing data compared to the Gradient Boosting Classifier.

On the other hand, if interpretability is important and computational resources are

sufficient, the Gradient Boosting Classifier might be preferred, as it tends to
provide more interpretable results and can handle complex relationships between
variables effectively.

Introducing Pure Mathematics
100% (5)
Introducing Pure Mathematics
580 pages
(William McDonough, Michael Braungart) Cradle To C
No ratings yet
(William McDonough, Michael Braungart) Cradle To C
122 pages
17 Ensemble Techniques Problem Statement
No ratings yet
17 Ensemble Techniques Problem Statement
28 pages
CHAPTER 4 Diabetes
No ratings yet
CHAPTER 4 Diabetes
6 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
phython 3
No ratings yet
phython 3
10 pages
5 2 ensemble learning
No ratings yet
5 2 ensemble learning
38 pages
20BCE7620 AP2021228000397 Experiment-6 Removed
No ratings yet
20BCE7620 AP2021228000397 Experiment-6 Removed
19 pages
Draft Xai
No ratings yet
Draft Xai
16 pages
Phase 3 IBM
No ratings yet
Phase 3 IBM
7 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
9 pages
Green University of Bangladesh Department of Computer Science and Engineering (CSE)
No ratings yet
Green University of Bangladesh Department of Computer Science and Engineering (CSE)
6 pages
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler Fall 2012
36 pages
Binod ML Project-052
No ratings yet
Binod ML Project-052
14 pages
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
ML in Python Part-2
No ratings yet
ML in Python Part-2
21 pages
Arnav MLlab05
No ratings yet
Arnav MLlab05
12 pages
Ensemble Learning
No ratings yet
Ensemble Learning
35 pages
Daa PL 6
No ratings yet
Daa PL 6
28 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
MCS224 Dec 2024 Solved
No ratings yet
MCS224 Dec 2024 Solved
22 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Diabetes Prediction
No ratings yet
Diabetes Prediction
28 pages
minor project
No ratings yet
minor project
21 pages
TD2345
No ratings yet
TD2345
3 pages
Model_learning_steps
No ratings yet
Model_learning_steps
12 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Data_Mining_Lab-2
No ratings yet
Data_Mining_Lab-2
6 pages
02450ex9_Python
No ratings yet
02450ex9_Python
9 pages
DM UNIT - 3
No ratings yet
DM UNIT - 3
21 pages
ML LAb Task
No ratings yet
ML LAb Task
4 pages
Final ML
No ratings yet
Final ML
2 pages
Exposys Data Labs Diabetes Disease Prediction: Shilpa J Shetty Nishma Nayana
No ratings yet
Exposys Data Labs Diabetes Disease Prediction: Shilpa J Shetty Nishma Nayana
13 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
6 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
AIML-Unit 5 Notes-Assignment 5
No ratings yet
AIML-Unit 5 Notes-Assignment 5
24 pages
UE22CS342AA2_916f8e8d-6433-11ef-9cb6-0a3286c41105_20241114095341
No ratings yet
UE22CS342AA2_916f8e8d-6433-11ef-9cb6-0a3286c41105_20241114095341
23 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch
No ratings yet
Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch
22 pages
Ensemble Methods in Machine Learning
No ratings yet
Ensemble Methods in Machine Learning
24 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
4 pages
UNIT-4-ML
No ratings yet
UNIT-4-ML
25 pages
MLDA1
No ratings yet
MLDA1
8 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Module 7 - Ensemble Learning
No ratings yet
Module 7 - Ensemble Learning
41 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
15 pages
5 markd
No ratings yet
5 markd
24 pages
Diabetes_Prediction_Presentation
No ratings yet
Diabetes_Prediction_Presentation
12 pages
Cardiovascular Disease Slides
No ratings yet
Cardiovascular Disease Slides
35 pages
Iot Lab 5
No ratings yet
Iot Lab 5
4 pages
mlPPT_11_45
No ratings yet
mlPPT_11_45
31 pages
mod8_dm
No ratings yet
mod8_dm
13 pages
Python Learning
No ratings yet
Python Learning
21 pages
Ensemble Learning (1)
No ratings yet
Ensemble Learning (1)
25 pages
Python Essential Methods In Machine Learning
No ratings yet
Python Essential Methods In Machine Learning
6 pages
Lecture 17 - Ensemble Learning
No ratings yet
Lecture 17 - Ensemble Learning
31 pages
Credit_Card_Approval_Prediction_Report-Final
No ratings yet
Credit_Card_Approval_Prediction_Report-Final
27 pages
Asffh
No ratings yet
Asffh
22 pages
Mail Merge Letter and Table
No ratings yet
Mail Merge Letter and Table
2 pages
6K0007D-01 - Lista de Erros e Advertências Da MCP - Inglês
No ratings yet
6K0007D-01 - Lista de Erros e Advertências Da MCP - Inglês
8 pages
Rb-Roof Beam Det.: Roof Framing Plan Foundation Plan
No ratings yet
Rb-Roof Beam Det.: Roof Framing Plan Foundation Plan
1 page
AEE Lab 7th Sem Btech
No ratings yet
AEE Lab 7th Sem Btech
53 pages
POL 208 Course Outline
No ratings yet
POL 208 Course Outline
6 pages
Genie Lift Manual
No ratings yet
Genie Lift Manual
74 pages
Operations Jotun
No ratings yet
Operations Jotun
13 pages
AMDashboard ASMCases 2023 Updated
No ratings yet
AMDashboard ASMCases 2023 Updated
3 pages
Download Complete Equivalent Stress Concept for Limit State Analysis 1st Edition Vladimir A. Kolupaev (Auth.) PDF for All Chapters
100% (6)
Download Complete Equivalent Stress Concept for Limit State Analysis 1st Edition Vladimir A. Kolupaev (Auth.) PDF for All Chapters
55 pages
Power Transmission Using Elbow Mechanism
No ratings yet
Power Transmission Using Elbow Mechanism
17 pages
Expression of Interest For Modular Packages
No ratings yet
Expression of Interest For Modular Packages
8 pages
Narrative Report PEC 103
No ratings yet
Narrative Report PEC 103
6 pages
Robotics Appendix
No ratings yet
Robotics Appendix
97 pages
Electric Potential Energy & Electric Potential
No ratings yet
Electric Potential Energy & Electric Potential
17 pages
3154
No ratings yet
3154
10 pages
Appendix A: Piezoelectric Constitutive Equations
No ratings yet
Appendix A: Piezoelectric Constitutive Equations
6 pages
SC1 Catalog
No ratings yet
SC1 Catalog
2 pages
Question Pgdca & Adca
No ratings yet
Question Pgdca & Adca
7 pages
IECC 2015 Application Guide PDF
No ratings yet
IECC 2015 Application Guide PDF
32 pages
Grade 7 q4, m2 - 084206
No ratings yet
Grade 7 q4, m2 - 084206
13 pages
Netezza Data Loading Guide
No ratings yet
Netezza Data Loading Guide
90 pages
Arabic Grammar Lessons
No ratings yet
Arabic Grammar Lessons
91 pages
Ambassadors' Booklet (2024-25) 2
No ratings yet
Ambassadors' Booklet (2024-25) 2
28 pages
Humility-summary
No ratings yet
Humility-summary
6 pages
Java Interview Q&A
No ratings yet
Java Interview Q&A
23 pages
DGSOrder No 17 of 2022
No ratings yet
DGSOrder No 17 of 2022
111 pages
DR Sachin S. Kamble
No ratings yet
DR Sachin S. Kamble
14 pages
2 Talent Management Lessons From Apple … a Case Study of the World’s Most Valuable Firm (Part 2)
No ratings yet
2 Talent Management Lessons From Apple … a Case Study of the World’s Most Valuable Firm (Part 2)
5 pages