0% found this document useful (0 votes)
13 views

Unit 4 Updated Notes

The document discusses different ensemble techniques in machine learning including bagging, boosting, stacking and their architectures. It explains stacking involves using predictions from multiple base models as input for a meta model to make improved predictions. The key steps to implement stacking are also provided.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit 4 Updated Notes

The document discusses different ensemble techniques in machine learning including bagging, boosting, stacking and their architectures. It explains stacking involves using predictions from multiple base models as input for a meta model to make improved predictions. The key steps to implement stacking are also provided.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

SCHOOL OF COMPUTING

DEPARTMENTOF COMPUTER SCIENCE AND ENGINEERING

MachineLearningTechniques–SEC1630

MachineLearningTechniques–SEC1630

MachineLearningTechniques–SEC1630

MachineLearningEssentials–SCSA 1415
SCHOOL OF COMPUTING

DEPARTMENTOF COMPUTER SCIENCE AND ENGINEERING

MachineLearningTechniques–SEC1630

MachineLearningTechniques–SEC1630

MachineLearningTechniques–SEC1630

UNITI–ENSEMBLING MODELS
UNIT 1 ENSEMBLING MODELS
Need of Ensembling- Applications of Ensembling – Types of Ensembling : Basic Ensemble
Techniques-Advanced Ensemble Techniques : Bagging, Boosting, Stacking, Blending -
Techniques of Ensembling -AdaBoost.

Need of Ensembling

Ensemble learning is one of the most powerful machine learning techniques that use the
combined output of two or more models/weak learners and solve a particular computational
intelligence problem. E.g., a Random Forest algorithm is an ensemble of various decision trees
combined.

Ensemble learning is primarily used to improve the model performance, such as classification,
prediction, function approximation, etc. In simple words, we can summarise the ensemble learning
as follows:

"An ensembled model is a machine learning model that combines the predictions from two
or more models.”

There are 3 most common ensemble learning methods in machine learning. These are as follows:

o Bagging
o Boosting
o Stacking

However, we will mainly discuss Stacking on this topic.

1. Bagging
Bagging is a method of ensemble modeling, which is primarily used to solve supervised machine
learning problems. It is generally completed in two steps as follows:

o Bootstrapping: It is a random sampling method that is used to derive samples from the
data using the replacement procedure. In this method, first, random data samples are fed to
the primary model, and then a base learning algorithm is run on the samples to complete
the learning process.
o Aggregation: This is a step that involves the process of combining the output of all base
models and, based on their output, predicting an aggregate result with greater accuracy
and reduced variance.
Example: In the Random Forest method, predictions from multiple decision trees are ensembled
parallelly. Further, in regression problems, we use an average of these predictions to get the final
output, whereas, in classification problems, the model is selected as the predicted class.

2. Boosting
Boosting is an ensemble method that enables each member to learn from the preceding member's
mistakes and make better predictions for the future. Unlike the bagging method, in boosting, all
base learners (weak) are arranged in a sequential format so that they can learn from the mistakes
of their preceding learner. Hence, in this way, all weak learners get turned into strong learners and
make a better predictive model with significantly improved performance.

We have a basic understanding of ensemble techniques in machine learning and their two
common methods, i.e., bagging and boosting. Now, let's discuss a different paradigm of ensemble
learning, i.e., Stacking.

3. Stacking
Stacking is one of the popular ensemble modeling techniques in machine learning. Various
weak learners are ensembled in a parallel manner in such a way that by combining them
with Meta learners, we can predict better predictions for the future.

This ensemble technique works by applying input of combined multiple weak learners' predictions
and Meta learners so that a better output prediction model can be achieved.

In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how to
best combine the input predictions to make a better output prediction.

Stacking is also known as a stacked generalization and is an extended form of the Model
Averaging Ensemble technique in which all sub-models equally participate as per their
performance weights and build a new model with better predictions. This new model is stacked up
on top of the others; this is the reason why it is named stacking.

Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of two or more
base/learner's models and a meta-model that combines the predictions of the base models. These
base models are called level 0 models, and the meta-model is known as the level 1 model. So, the
Stacking ensemble method includes original (training) data, primary level models, primary
level prediction, secondary level model, and final prediction. The basic architecture of stacking
can be represented as shown below the image.
o Original data: This data is divided into n-folds and is also considered test data or training
data.
o Base models: These models are also referred to as level-0 models. These models use
training data and provide compiled predictions (level-0) as an output.
o Level-0 Predictions: Each base model is triggered on some training data and provides
different predictions, which are known as level-0 predictions.
o Meta Model: The architecture of the stacking model consists of one meta-model, which
helps to best combine the predictions of the base models. The meta-model is also known as
the level-1 model.
o Level-1 Prediction: The meta-model learns how to best combine the predictions of the
base models and is trained on different predictions made by individual base models, i.e.,
data not used to train the base models are fed to the meta-model, predictions are made,
and these predictions, along with the expected outputs, provide the input and output pairs
of the training dataset used to fit the meta-model.

Steps to implement Stacking models:


There are some important steps to implementing stacking models in machine learning. These are
as follows:

o Split training data sets into n-folds using the RepeatedStratifiedKFold as this is the most common
approach to preparing training datasets for meta-models.
o Now the base model is fitted with the first fold, which is n-1, and it will make predictions for the nth
folds.
o The prediction made in the above step is added to the x1_train list.
o Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of size n,
o Now, the model is trained on all the n parts, which will make predictions for the sample data.
o Add this prediction to the y1_test list.
o In the same way, we can find x2_train, y2_test, x3_train, and y3_test by using Model 2 and 3 for
training, respectively, to get Level 2 predictions.
o Now train the Meta model on level 1 prediction, where these predictions will be used as features for
the model.
o Finally, Meta learners can now be used to make a prediction on test data in the stacking model.

Stacking Ensemble Family


There are some other ensemble techniques that can be considered the forerunner of the stacking
method. For better understanding, we have divided them into the different frameworks of essential
stacking so that we can easily understand the differences between methods and the uniqueness of
each technique. Let's discuss a few commonly used ensemble techniques related to stacking.

Voting ensembles:
This is one of the simplest stacking ensemble methods, which uses different algorithms to prepare
all members individually. Unlike the stacking method, the voting ensemble uses
simple statistics instead of learning how to best combine predictions from base models
separately.

It is significant to solve regression problems where we need to predict the mean or median of the
predictions from base models. Further, it is also helpful in various classification problems according
to the total votes received for prediction. The label with the higher numbers of votes is referred to
as hard voting, whereas the label that receives the largest sums of probability or lesser votes is
referred to as soft voting.

The voting ensemble differs from than stacking ensemble in terms of weighing models based on
each member's performance because here, all models are considered to have the same skill levels.

Member Assessment: In the voting ensemble, all members are assumed to have the same skill
sets.

Combine with Model: Instead of using combined prediction from each member, it uses simple
statistics to get the final prediction, e.g., mean or median.

Weighted Average Ensemble


The weighted average ensemble is considered the next level of the voting ensemble, which uses a
diverse collection of model types as contributing members. This method uses some training
datasets to find the average weight of each ensemble member based on their performance. An
improvement over this naive approach is to weigh each member based on its performance on a
hold-out dataset, such as a validation set or out-of-fold predictions during k-fold cross-validation.
Furthermore, it may also involve tuning the coefficient weightings for each model using an
optimization algorithm and performance on a holdout dataset.

Member Assessment: Weighted average ensemble method uses member performance based on
the training dataset.

Combine With Model: It considers the weighted average of prediction from each member
separately.
Blending Ensemble:
Blending is a similar approach to stacking with a specific configuration. It is considered a stacking
method that uses k-fold cross-validation to prepare out-of-sample predictions for the meta-
model. In this method, the training dataset is first to split into different training sets and validation
sets then we train learner models on the training sets. Further, predictions are made on the
validation set and sample set, where validation predictions are used as features to build a new
model, which is later used to make final predictions on the test set using the prediction values as
features.

Member Predictions: The blending stacking ensemble uses out-of-sample predictions on a


validation set.

Combine With Model: Linear model (e.g., linear regression or logistic regression).

Super Learner Ensemble:


This method is quite similar to blending, which has a specific configuration of a stacking ensemble.
It uses out-of-fold predictions from learner models and prepares a meta-model. However, it is
considered a modified form of blending, which only differs in the selection of how out-of-sample
predictions are prepared for the meta learner.

Summary of Stacking Ensemble


Stacking is an ensemble method that enables the model to learn how to use combine predictions
given by learner models with meta-models and prepare a final model with accurate prediction. The
main benefit of stacking ensemble is that it can shield the capabilities of a range of well-
performing models to solve classification and regression problems. Further, it helps to prepare a
better model having better predictions than all individual models. In this topic, we have learned
various ensemble techniques and their definitions, the stacking ensemble method, the architecture
of stacking models, and steps to implement stacking models in machine learning.

What is Boosting in Data Mining?


Boosting is an ensemble learning method that combines a set of weak learners into strong learners
to minimize training errors. In boosting, a random sample of data is selected, fitted with a model,
and then trained sequentially. That is, each model tries to compensate for the weaknesses of its
predecessor. Each classifier's weak rules are combined with each iteration to form one strict
prediction rule.

Boosting is an efficient algorithm that converts a weak learner into a strong learner. They use the
concept of the weak learner and strong learner conversation through the weighted average values
and higher votes values for prediction. These algorithms use decision stamp and margin
maximizing classification for processing.
There are three types of Algorithms available such as AdaBoost or Adaptive boosting Algorithm,
Gradient, and XG Boosting algorithm. These are the machine learning algorithms that follow the
process of training for predicting and fine-tuning the result.

Example

Let's understand this concept with the help of the following example. Let's take the example of the
email. How will you recognize your email, whether it is spam or not? You can recognize it by the
following conditions:

o If an email contains lots of sources, that means it is spam.


o If an email contains only one file image, then it is spam.
o If an email contains the message "You Own a lottery of $xxxxx," it is spam.
o If an email contains some known source, then it is not spam.
o If it contains the official domain like educba.com, etc., it is not spam.

The rules mentioned above are not that powerful to recognize spam or not; hence these rules are
called weak learners.

To convert weak learners to the strong learner, combine the prediction of the weak learner using
the following methods.

1.
Using average or weighted average.
2. Consider prediction has a higher vote.

Consider the 5 rules mentioned above; there are 3 votes for spam and 2 votes for not spam. Since
there is high vote spam, we consider it spam.

Why is Boosting Used?


To solve complicated problems, we require more advanced techniques. Suppose that, given a data
set of images containing images of cats and dogs, you were asked to build a model that can
classify these images into two separate classes. Like every other person, you will start by
identifying the images by using some rules given below:

1. The image has pointy ears: Cat


2. The image has cat-shaped eyes: Cat
3. The image has bigger limbs: Dog
4. The image has sharpened claws: Cat
5. The image has a wider mouth structure: Dog

These rules help us identify whether an image is a Dog or a cat. However, the prediction would be
flawed if we were to classify an image based on an individual (single) rule. These rules are called
weak learners because these rules are not strong enough to classify an image as a cat or dog.

Therefore, to ensure our prediction is more accurate, we can combine the prediction from these
weak learners by using the majority rule or weighted average. This makes a strong learner model.

In the above example, we have defined 5 weak learners, and the majority of these rules (i.e., 3 out
of 5 learners predict the image as a cat) give us the prediction that the image is a cat. Therefore,
our final output is a cat.

How does the Boosting Algorithm Work?


The basic principle behind the working of the boosting algorithm is to generate multiple weak
learners and combine their predictions to form one strict rule. These weak rules are generated by
applying base Machine Learning algorithms on different distributions of the data set. These
algorithms generate weak rules for each iteration. After multiple iterations, the weak learners are
combined to form a strong learner that will predict a more accurate outcome.
Here's how the algorithm works:

Step 1: The base algorithm reads the data and assigns equal weight to each sample observation.

Step 2: False predictions made by the base learner are identified. In the next iteration, these false
predictions are assigned to the next base learner with a higher weightage on these incorrect
predictions.

Step 3: Repeat step 2 until the algorithm can correctly classify the output.

Therefore, the main aim of Boosting is to focus more on miss-classified predictions.

Types of Boosting
Boosting methods are focused on iteratively combining weak learners to build a strong learner
that can predict more accurate outcomes. As a reminder, a weak learner classifies data slightly
better than random guessing. This approach can provide robust prediction problem results,
outperform neural networks, and support vector machines for tasks.

Boosting algorithms can differ in how they create and aggregate weak learners during the
sequential process. Three popular types of boosting methods include:

1. Adaptive boosting or AdaBoost: This method operates iteratively, identifying misclassified


data points and adjusting their weights to minimize the training error. The model continues to
optimize sequentially until it yields the strongest predictor.

AdaBoost is implemented by combining several weak learners into a single strong learner. The
weak learners in AdaBoost take into account a single input feature and draw out a single split
decision tree called the decision stump. Each observation is weighted equally while drawing out
the first decision stump.

The results from the first decision stump are analyzed, and if any observations are wrongfully
classified, they are assigned higher weights. A new decision stump is drawn by considering the
higher-weight observations as more significant. Again if any observations are misclassified, they're
given a higher weight, and this process continues until all the observations fall into the right class.

AdaBoost can be used for both classification and regression-based problems. However, it is more
commonly used for classification purposes.

2. Gradient Boosting: Gradient Boosting is also based on sequential ensemble learning. Here the
base learners are generated sequentially so that the present base learner is always more effective
than the previous one, i.e., and the overall model improves sequentially with each iteration.

The difference in this boosting type is that the weights for misclassified outcomes are not
incremented. Instead, the Gradient Boosting method tries to optimize the loss function of the
previous learner by adding a new model that adds weak learners to reduce the loss function.

The main idea here is to overcome the errors in the previous learner's predictions. This boosting
has three main components:
o Loss function:The use of the loss function depends on the type of problem. The advantage
of gradient boosting is that there is no need for a new boosting algorithm for each loss
function.
o Weak learner:In gradient boosting, decision trees are used as a weak learners. A regression
tree is used to give true values, which can combine to create correct predictions. Like in the
AdaBoost algorithm, small trees with a single split are used, i.e., decision stump. Larger trees
are used for large levels,e, 4-8.
o Additive Model: Trees are added one at a time in this model. Existing trees remain the
same. During the addition of trees, gradient descent is used to minimize the loss function.

Like AdaBoost, Gradient Boosting can also be used for classification and regression problems.

3. Extreme gradient boosting or XGBoost: XGBoost is an advanced gradient boosting method.


XGBoost, developed by Tianqi Chen, falls under the Distributed Machine Learning Community
(DMLC) category.

The main aim of this algorithm is to increase the speed and efficiency of computation. The
Gradient Descent Boosting algorithm computes the output slower since they sequentially analyze
the data set. Therefore XGBoost is used to boost or extremely boost the model's performance.

XGBoost is designed to focus on computational speed and model efficiency. The main features
provided by XGBoost are:

o Parallel Processing: XG Boost provides Parallel Processing for tree construction which uses
CPU cores while training.
o Cross-Validation: XG Boost enables users to run cross-validation of the boosting process at
each iteration, making it easy to get the exact optimum number of boosting iterations in
one run.
o Cache Optimization: It provides Cache Optimization of the algorithms for higher execution
speed.
o Distributed Computing: For training large models, XG Boost allows Distributed
Computing.
Benefits and Challenges of Boosting
The boosting method presents many advantages and challenges for classification or regression
problems. The benefits of boosting include:

o Ease of Implementation: Boosting can be used with several hyper-parameter tuning


options to improve fitting. No data preprocessing is required, and boosting algorithms have
built-in routines to handle missing data. In Python, the sci-kit-learn library of ensemble
methods makes it easy to implement the popular boosting methods, including AdaBoost,
XGBoost, etc.
o Reduction of bias: Boosting algorithms combine multiple weak learners in a sequential
method, iteratively improving upon observations. This approach can help to reduce high
bias, commonly seen in shallow decision trees and logistic regression models.
o Computational Efficiency: Since boosting algorithms have special features that increase
their predictive power during training, it can help reduce dimensionality and increase
computational efficiency.

And the challenges of boosting include:

o Overfitting: There's some dispute in the research around whether or not boosting can help
reduce overfitting or make it worse. We include it under challenges because in the instances
that it does occur, predictions cannot be generalized to new datasets.
o Intense computation: Sequential training in boosting is hard to scale up. Since each
estimator is built on its predecessors, boosting models can be computationally expensive,
although XGBoost seeks to address scalability issues in other boosting methods. Boosting
algorithms can be slower to train when compared to bagging, as a large number of
parameters can also influence the model's behavior.
o Vulnerability to outlier data: Boosting models are vulnerable to outliers or data values
that are different from the rest of the dataset. Because each model attempts to correct the
faults of its predecessor, outliers can skew results significantly.
o Real-time implementation: You might find it challenging to use boosting for real-time
implementation because the algorithm is more complex than other processes. Boosting
methods have high adaptability, so you can use various model parameters that immediately
affect the model's performance.

Applications of Boosting
Boosting algorithms are well suited for artificial intelligence projects across a broad range of
industries, including:

o Healthcare: Boosting is used to lower errors in medical data predictions, such as predicting
cardiovascular risk factors and cancer patient survival rates. For example, research shows
that ensemble methods significantly improve the accuracy in identifying patients who could
benefit from preventive treatment of cardiovascular disease while avoiding unnecessary
treatment of others. Likewise, another study found that applying boosting to multiple
genomics platforms can improve the prediction of cancer survival time.
o IT: Gradient boosted regression trees are used in search engines for page rankings, while
the Viola-Jones boosting algorithm is used for image retrieval. As noted by Cornell, boosted
classifiers allow the computations to be stopped sooner when it's clear which direction a
prediction is headed. A search engine can stop evaluating lower-ranked pages, while image
scanners will only consider images containing the desired object.
o Finance: Boosting is used with deep learning models to automate critical tasks, including
fraud detection, pricing analysis, and more. For example, boosting methods in credit card
fraud detection and financial product pricing analysis improves the accuracy of analyzing
massive data sets to minimize financial losses.

You might also like