0% found this document useful (0 votes)
35 views

unit 5 ML

The document provides an overview of Ensemble Learning in machine learning, detailing various model combination schemes such as voting, bagging, boosting, and stacking. It explains simple and advanced ensemble methods, emphasizing their goals to improve predictive performance, reduce errors, and enhance robustness. Additionally, it covers specific techniques like Error Correcting Output Codes (ECOC) and AdaBoost, along with the Random Forest algorithm, highlighting their processes and advantages.

Uploaded by

Devabn Nirmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

unit 5 ML

The document provides an overview of Ensemble Learning in machine learning, detailing various model combination schemes such as voting, bagging, boosting, and stacking. It explains simple and advanced ensemble methods, emphasizing their goals to improve predictive performance, reduce errors, and enhance robustness. Additionally, it covers specific techniques like Error Correcting Output Codes (ECOC) and AdaBoost, along with the Random Forest algorithm, highlighting their processes and advantages.

Uploaded by

Devabn Nirmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

MACHINE LEARNING

UNIT-V: Ensemble Learning

Ensemble Learning Model Combination Schemes,


Voting, Error-Correcting Output Codes, Bagging:
Random Forest Trees, Boosting: Adaboost, Stacking.

Ensemble Learning Model Combination Schemes:


Ensemble learning is a general meta-approach to machine learning that seeks better
predictive performance by combining the predictions from multiple models.
Ensemble learning is a machine learning technique combining multiple individual models to
create a stronger, more accurate predictive model. By leveraging the diverse strengths of
different models, ensemble learning aims to mitigate errors, enhance performance, and
increase the overall robustness of predictions, leading to improved results across various
tasks in machine learning and data analysis.
The ensemble methods in machine learning combine the insights obtained from multiple
learning models to facilitate accurate and improved decisions. These methods follow the
same principle as the example of buying an air-conditioner in a show room.
It is done in two ways: Simple Ensemble Methods & Advanced Ensemble Methods

Simple Ensemble Methods:


1.Mode:
In statistical terminology, "mode" is the number or value that most often appears in a
dataset of numbers or values. In this ensemble technique, machine learning professionals
use a number of models for making predictions about each data point. The predictions
made by different models are taken as separate votes. Subsequently, the prediction made
by most models is treated as the ultimate prediction.
2.Mean/Average:
In the mean/average ensemble technique, data analysts take the average predictions
made by all models into account when making the ultimate prediction. Let's take, for
instance, one hundred people rated the beta release of your travel and tourism app on a
scale of 1 to 5, where 15 people gave a rating of 1, 28 people gave a rating of 2, 37 people
gave a rating of 3, 12 people gave a rating of 4, and 8 people gave a rating of 5. The average
in this case is - (1 * 15) + (2 * 28) + (3 * 37) + (4 * 12) + (5 * 8) / 100 = 2.7
3.Weighted Average:
In the weighted average ensemble method, data scientists assign different weights to all
the models in order to make a prediction, where the assigned weight defines the relevance
of each model.

Advanced Ensemble Methods: Bagging, Boosting, and Stacking:


Ensemble learning combines multiple models to improve predictive performance,reduce
errors, and enhance generalization. The three main ensemble methods are:
1. Bagging (Bootstrap Aggregating) – Reducing Variance
2. Boosting – Reducing Bias and Improving Performance
3. Stacking – Combining Models for Better Predictions
1. Bagging (Bootstrap Aggregating)
Goal: Reduce variance and prevent overfitting.

How It Works:
 Bagging creates multiple subsets of the training data by randomly selecting data
points with replacement (bootstrap sampling).
 Each subset is used to train a separate model (e.g., decision trees in Random Forest).
 The final prediction is obtained by averaging the outputs (for regression) or majority
voting (for classification).
Steps in Bagging:
1. Create multiple random subsets of the dataset (bootstrap samples).
2. Train a separate model for each subset independently.
3. Run all models in parallel.
4. Combine their predictions by averaging (for regression) or voting (for classification).
Example: Random Forest is a popular bagging algorithm that trains multiple decision
trees and averages their predictions.
Advantages of Bagging:
 Reduces variance and prevents overfitting.
 Works well with high-variance models (e.g., decision trees).
 Can be used for both classification and regression tasks.
2. Boosting – Reducing Bias and Improving Performance
Goal: Convert weak learners into strong learners by focusing on misclassified data.

An illustration presenting the intuition behind the boosting algorithm, consisting of the
parallel learners and weighted dataset.
How It Works:
 Unlike bagging, boosting trains models sequentially, where each new model focuses
on correcting the errors of the previous one.
 In each step, misclassified data points are given higher weights, so the next model
learns better.
 The final model is obtained by combining the predictions of all models, weighted by
their accuracy.
Steps in Boosting:
1. Train the first model on the entire dataset.
2. Identify misclassified data points and assign them higher weights.
3. Train the next model, focusing more on the misclassified points.
4. Repeat the process, continuously improving the model.
5. The final prediction is obtained by weighted averaging (regression) or weighted
voting (classification).
Popular Boosting Algorithms:
 AdaBoost (Adaptive Boosting) – Increases weights of misclassified points.
 Gradient Boosting – Minimizes residual errors using gradient descent.
 XGBoost – Optimized version of Gradient Boosting (faster, handles missing data).
Advantages of Boosting:
 Increases accuracy by correcting previous errors.
 Works well with complex datasets.
 Handles imbalanced data better than bagging.
3. Stacking – Combining Models for Better Predictions
Goal: Improve predictive performance by learning how to best combine multiple models.

How It Works:
 Stacking trains multiple models (base learners) in parallel on the same dataset.
 Their predictions are used as inputs for a meta-model, which makes the final
prediction.
 The meta-model learns how to best combine the outputs of the base models.
Steps in Stacking:
1. Train multiple models (e.g., Decision Tree, SVM, Neural Network) on the same
dataset.
2. Collect predictions from all models.
3. Train a meta-learner (e.g., Logistic Regression) using these predictions as input.
4. The meta-model makes the final prediction.
Example:
A stacking model might combine:
 Decision Trees (capture non-linear relationships)
 Support Vector Machines (SVM) (good for high-dimensional data)
 Neural Networks (good for complex patterns)
 Meta-learner: Logistic Regression (learns how to combine all the models)
Advantages of Stacking
 Uses different algorithms to capture different patterns.
 Can outperform individual models when properly tuned.
 Works well for both classification and regression.
Comparison of Bagging, Boosting, and Stacking

Feature Bagging Boosting Stacking

Reduce bias & Improve model


Purpose Reduce variance
improve accuracy predictions

Parallel (independent Sequential Parallel (combines


Process
models) (corrects errors) different models)

Majority voting
Final Weighted sum of Meta-model learns
(classification), averaging
Prediction weak models best combination
(regression)

Overfitting Moderate (if too


Low Low
Risk many iterations)

AdaBoost,
Example Logistic Regression
Random Forest XGBoost, Gradient
Models as meta-learner
Boosting
Voting:
Voting ensembles are the ensembles technique that trains the multiple machine learning
models, and then predictions from all the individual models are combined for output. Voting
can be used in both classification and regression tasks. A Voting Classifier is a machine
learning model that trains on an ensemble of numerous models and predicts an output
(class) based on their highest probability of chosen class as the output. The idea is instead of
creating separate dedicated models and finding the accuracy for each them, we create a
single model which trains by these models and predicts output based on their combined
majority of voting for each output class.

There are several types of voting methods in ensemble learning:


Hard Voting:
In hard voting, each model in the ensemble predicts the class label (for classification)
or numerical value (for regression), and the final prediction is determined by majority voting
(for classification) or averaging (for regression). For classification tasks, the class that
receives the majority of votes is the final predicted class.
For example, if you have three classifiers in an ensemble predicting class labels for a binary
classification problem and they predict [1, 0, 1], the majority class is 1, so the ensemble
prediction would be 1.
Soft Voting:
In soft voting, each model in the ensemble predicts the probability estimates for all
possible outcomes (for classification) or numerical values (for regression). The final
prediction is then calculated by averaging the probability estimates from all models and
choosing the class with the highest average probability for classification tasks. Soft voting
often works better than hard voting because it takes into account the confidence of each
model's predictions.
Weighted Voting:
In this approach, different models in the ensemble are assigned different weights
based on their performance or reliability. When making predictions, the models with higher
weights have a stronger influence on the final prediction. Weighted voting allows giving
more importance to the models that are more accurate or have higher confidence.

Error Correcting Output Codes(ECOC):


Error Correcting Output Codes (ECOC) is a strategy that decomposes a multi-class
classification problem into multiple binary classification problems. By assigning binary codes to each
class, ECOC leverages error correction principles to handle classification errors more effectively.

While binary classification is relatively straightforward, multi-class classification can


be more complex. Error Correcting Output Codes (ECOC) is a technique designed to simplify
and improve multi-class classification by decomposing it into multiple binary classification
problems.
ECOC was first introduced by Dietterich and Bakiri in 1995 as a method to enhance
the performance of classifiers by leveraging error-correcting codes. The core idea is to
represent each class with a unique binary code and then train binary classifiers to
distinguish between these codes. The concept of ECOC stems from the domain of error-
correcting codes in information theory, where redundancy is added to data to detect and
correct errors during transmission.
How Error Correcting Output Codes(ECOC) Works?
The process of ECOC involves two main steps: encoding and decoding.
1. Encoding (Training Phase)
In this phase, a code matrix is created, and multiple binary classifiers are trained.
Step 1: Code Matrix Construction
 The multi-class classification problem is represented as a binary classification
problem using a code matrix.
 The code matrix (also called the ECOC matrix) consists of rows representing classes
and columns representing binary classifiers.
 The entries in the matrix are typically {-1, 0, 1}:
o 1 and -1 indicate the two classes used in a binary classifier.
o 0 means the binary classifier does not consider that class.
For example, if we have four classes (C1, C2, C3, C4) and want to classify them using three
binary classifiers, our code matrix may look like this:

Class Classifier 1 Classifier 2 Classifier 3

C1 1 -1 1

C2 -1 1 1

C3 1 1 -1
Class Classifier 1 Classifier 2 Classifier 3

C4 -1 -1 -1

Here, each row (codeword) uniquely represents a class.


Step 2: Training Binary Classifiers
 Each column represents a binary classification problem.
 A separate binary classifier is trained using the dataset where the labels are
determined by the entries in that column.
 Popular binary classifiers include Logistic Regression, Support Vector Machines
(SVM), and Decision Trees.
 The classifiers are trained to distinguish between the classes they are assigned to.
2. Decoding (Classification Phase):
In this phase, the trained classifiers are used to predict the class of a new data point.
Step 1: Making Predictions
 When a new instance is given as input, each trained binary classifier predicts a 1 or -
1.
 This results in a binary vector that represents the output of all classifiers.
Example:
If a new instance is processed, and the classifiers output the vector [1, -1, 1], this vector is
compared to the rows of the code matrix.
Step 2: Codeword Matching (Hamming Distance)
 The predicted binary vector is compared to all the rows of the code matrix.
 Hamming Distance is used to measure similarity:
o Hamming Distance between two binary vectors is the number of positions
where the values are different.
 The class whose codeword has the minimum Hamming distance from the predicted
vector is chosen as the final class.
Example:
If our predicted vector is [1, -1, 1]:
 Distance to C1 = 0 (Exact match)
 Distance to C2 = 2 (Two different values)
 Distance to C3 = 2
 Distance to C4 = 3
Since C1 has the smallest Hamming distance (0), the new instance is classified as C1.
ECOC with AdaBoost:
ECOC can also be used with AdaBoost, which is an ensemble learning technique that
improves classification accuracy.
 AdaBoost learners are trained iteratively, with each iteration focusing on data points
that were misclassified by previous classifiers.
 In the classification phase, AdaBoost assigns weights to each prediction, and the final
prediction is chosen based on the weighted votes.
 Instead of simply choosing the class with the minimum Hamming distance, Argmin
(minimum error selection) is used to pick the best classifier based on AdaBoost’s
predictions.
Advantages of ECOC
 Enhances Binary Classifiers – Extends traditional binary classifiers to handle multi-
class problems.
 Error Tolerance – Introduces redundancy, making the system more robust against
misclassifications.
 Parallel Computation – Since multiple classifiers work independently, they can be
trained in parallel.
Disadvantages of ECOC
 Requires More Classifiers – Needs multiple binary classifiers, increasing computation
time.
 Code Matrix Design – The choice of a good code matrix is crucial and can affect
accuracy.

AdaBoost (Adaptive Boosting):


AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as an
Ensemble Method in Machine Learning. It is called Adaptive Boosting as the weights are re-
assigned to each instance, with higher weights assigned to incorrectly classified instances.
Boosting is used to reduce bias as well as variance for supervised learning.
It works on the principle of learners growing sequentially. Except for the first, each
subsequent learner is grown from previously grown learners. In simple words, weak learners
are converted into strong ones.
The AdaBoost algorithm works on the same principle as boosting with a slight
difference. It makes ‘n’ number of decision trees during the data training period. As the first
decision tree/model is made, the incorrectly classified record in the first model is given
priority. Only these records are sent as input for the second model. The process goes on
until we specify a number of base learners we want to create. Remember, repetition of
records is allowed with all boosting techniques.
The models 1,2, 3,…, N are individual models that can be known as decision trees. All
types of boosting models work on the same principle.

Random Forest Algorithm:


What is Random Forest?
Random Forest is a supervised machine learning algorithm used for classification and
regression tasks. It is an ensemble learning method that creates multiple decision trees and
combines their outputs to produce a more accurate, stable, and generalized prediction.
The core idea of Random Forest is to use Bootstrap Aggregation (Bagging) to
improve accuracy and reduce overfitting by training multiple decision trees on different
subsets of data.
How Does Random Forest Work?
The Random Forest algorithm consists of the following key steps:
Step 1: Data Sampling (Bootstrap Sampling)
 Instead of using the entire dataset for training a single decision tree, Random Forest
creates multiple random subsets of data.
 It selects n random records (with replacement) from the dataset to train each
decision tree.
Step 2: Feature Selection (Random Feature Subset)
 At each node of a decision tree, Random Forest randomly selects a subset of
features instead of using all available features.
 This ensures diversity among trees, reducing overfitting and increasing accuracy.
Step 3: Decision Tree Construction
 Multiple decision trees are trained on different bootstrap samples.
 Each tree makes its own prediction for a given input.
Step 4: Aggregating the Predictions
 For classification, the final prediction is determined by majority voting (the most
common output among all trees).
 For regression, the final prediction is the average of all tree predictions.

Example: Classifying Fruits in a Basket


Imagine we have a basket containing different fruits, and we need to classify them as apple
or banana.
Step 1: Create Multiple Decision Trees
 We take multiple samples from the basket and train decision trees separately.
 Each tree learns from a different sample and produces an individual prediction.
Step 2: Majority Voting
 If more trees predict "apple" than "banana," then the final output is apple.
 If more trees predict "banana," then the final output is banana.
This process ensures stability, robustness, and reduced overfitting compared to a single
decision tree.
Key Features of Random Forest:
Diversity: Each decision tree is trained on different subsets of data and features, preventing
overfitting.
Immune to Curse of Dimensionality: Since not all features are considered at every split, it
works well with high-dimensional data.
Parallelization: Trees are built independently, allowing efficient use of CPU resources.
Built-in Train-Test Split: Since each tree is trained on a subset of data, some portion
remains unseen, acting as test data.
Stability: Combining multiple trees results in more stable and generalized predictions
compared to individual trees.
Advantages of Random Forest
1. High Accuracy
 Random Forest provides higher accuracy than a single decision tree because it
aggregates multiple trees’ decisions.
 Reduces variance and overfitting, leading to better generalization on unseen data.
2. Robustness
 Works well with noisy data and outliers because each tree makes independent
decisions.
3. Works for Both Classification & Regression
 Classification: Majority voting determines the final class.
 Regression: The average of all predictions is taken.
4. Handles Missing Values
 Unlike other models, Random Forest can handle missing values automatically.
5. Feature Importance
 It provides insights into which features are most important for prediction, aiding in
feature selection.
6. Captures Complex Relationships
 Can model non-linear relationships between variables without needing domain-
specific knowledge.
7. Scalability
 Handles large datasets efficiently due to parallel processing.
8. Reduces Overfitting
 Multiple decision trees prevent overfitting, making it a robust model for real-world
applications.
9. Easy to Tune
 Hyperparameters like the number of trees, depth, and feature selection can be
easily adjusted using cross-validation.
10. Works Well with Imbalanced Data
 Random Forest gives equal importance to minority and majority classes, making it
effective for imbalanced datasets.
Disadvantages of Random Forest
Slow Predictions:
 More trees improve accuracy but increase the time required for predictions.
 Not ideal for real-time applications requiring fast predictions.
Complexity:
 The model is more complex than a simple decision tree and is harder to interpret.
High Memory Usage:
 Requires significant memory for storing multiple decision trees.
Lack of Explainability:
 Unlike single decision trees, Random Forest cannot explicitly describe relationships
within the data.

Stacking:
Stacking is a way to ensemble multiple classifications or regression model. There are many
ways to ensemble models, the widely known models are Bagging or Boosting. Bagging
allows multiple similar models with high variance are averaged to decrease variance.
Boosting builds multiple incremental models to decrease the bias, while keeping variance
small.
Stacking (sometimes called Stacked Generalization) is a different paradigm. The point of
stacking is to explore a space of different models for the same problem. The idea is that you
can attack a learning problem with different types of models which are capable to learn
some part of the problem, but not the whole space of the problem. So, you can build
multiple different learners and you use them to build an intermediate prediction, one
prediction for each learned model. Then you add a new model which learns from the
intermediate predictions the same target.
This final model is said to be stacked on the top of the others, hence the name. Thus, you
might improve your overall performance, and often you end up with a model which is better
than any individual intermediate model. Notice however, that it does not give you any
guarantee, as is often the case with any machine learning technique.

How stacking works?


1. We split the training data into K-folds just like K-fold cross-validation.
2. A base model is fitted on the K-1 parts and predictions are made for Kth part.
3. We do for each part of the training data.
4. The base model is then fitted on the whole train data set to calculate its
performance on the test set.
5. We repeat the last 3 steps for other base models.
6. Predictions from the train set are used as features for the second level model.
7. Second level model is used to make a prediction on the test set.

You might also like