Unit 4 Updated Notes
Unit 4 Updated Notes
MachineLearningTechniques–SEC1630
MachineLearningTechniques–SEC1630
MachineLearningTechniques–SEC1630
MachineLearningEssentials–SCSA 1415
SCHOOL OF COMPUTING
MachineLearningTechniques–SEC1630
MachineLearningTechniques–SEC1630
MachineLearningTechniques–SEC1630
UNITI–ENSEMBLING MODELS
UNIT 1 ENSEMBLING MODELS
Need of Ensembling- Applications of Ensembling – Types of Ensembling : Basic Ensemble
Techniques-Advanced Ensemble Techniques : Bagging, Boosting, Stacking, Blending -
Techniques of Ensembling -AdaBoost.
Need of Ensembling
Ensemble learning is one of the most powerful machine learning techniques that use the
combined output of two or more models/weak learners and solve a particular computational
intelligence problem. E.g., a Random Forest algorithm is an ensemble of various decision trees
combined.
Ensemble learning is primarily used to improve the model performance, such as classification,
prediction, function approximation, etc. In simple words, we can summarise the ensemble learning
as follows:
"An ensembled model is a machine learning model that combines the predictions from two
or more models.”
There are 3 most common ensemble learning methods in machine learning. These are as follows:
o Bagging
o Boosting
o Stacking
1. Bagging
Bagging is a method of ensemble modeling, which is primarily used to solve supervised machine
learning problems. It is generally completed in two steps as follows:
o Bootstrapping: It is a random sampling method that is used to derive samples from the
data using the replacement procedure. In this method, first, random data samples are fed to
the primary model, and then a base learning algorithm is run on the samples to complete
the learning process.
o Aggregation: This is a step that involves the process of combining the output of all base
models and, based on their output, predicting an aggregate result with greater accuracy
and reduced variance.
Example: In the Random Forest method, predictions from multiple decision trees are ensembled
parallelly. Further, in regression problems, we use an average of these predictions to get the final
output, whereas, in classification problems, the model is selected as the predicted class.
2. Boosting
Boosting is an ensemble method that enables each member to learn from the preceding member's
mistakes and make better predictions for the future. Unlike the bagging method, in boosting, all
base learners (weak) are arranged in a sequential format so that they can learn from the mistakes
of their preceding learner. Hence, in this way, all weak learners get turned into strong learners and
make a better predictive model with significantly improved performance.
We have a basic understanding of ensemble techniques in machine learning and their two
common methods, i.e., bagging and boosting. Now, let's discuss a different paradigm of ensemble
learning, i.e., Stacking.
3. Stacking
Stacking is one of the popular ensemble modeling techniques in machine learning. Various
weak learners are ensembled in a parallel manner in such a way that by combining them
with Meta learners, we can predict better predictions for the future.
This ensemble technique works by applying input of combined multiple weak learners' predictions
and Meta learners so that a better output prediction model can be achieved.
In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how to
best combine the input predictions to make a better output prediction.
Stacking is also known as a stacked generalization and is an extended form of the Model
Averaging Ensemble technique in which all sub-models equally participate as per their
performance weights and build a new model with better predictions. This new model is stacked up
on top of the others; this is the reason why it is named stacking.
Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of two or more
base/learner's models and a meta-model that combines the predictions of the base models. These
base models are called level 0 models, and the meta-model is known as the level 1 model. So, the
Stacking ensemble method includes original (training) data, primary level models, primary
level prediction, secondary level model, and final prediction. The basic architecture of stacking
can be represented as shown below the image.
o Original data: This data is divided into n-folds and is also considered test data or training
data.
o Base models: These models are also referred to as level-0 models. These models use
training data and provide compiled predictions (level-0) as an output.
o Level-0 Predictions: Each base model is triggered on some training data and provides
different predictions, which are known as level-0 predictions.
o Meta Model: The architecture of the stacking model consists of one meta-model, which
helps to best combine the predictions of the base models. The meta-model is also known as
the level-1 model.
o Level-1 Prediction: The meta-model learns how to best combine the predictions of the
base models and is trained on different predictions made by individual base models, i.e.,
data not used to train the base models are fed to the meta-model, predictions are made,
and these predictions, along with the expected outputs, provide the input and output pairs
of the training dataset used to fit the meta-model.
o Split training data sets into n-folds using the RepeatedStratifiedKFold as this is the most common
approach to preparing training datasets for meta-models.
o Now the base model is fitted with the first fold, which is n-1, and it will make predictions for the nth
folds.
o The prediction made in the above step is added to the x1_train list.
o Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of size n,
o Now, the model is trained on all the n parts, which will make predictions for the sample data.
o Add this prediction to the y1_test list.
o In the same way, we can find x2_train, y2_test, x3_train, and y3_test by using Model 2 and 3 for
training, respectively, to get Level 2 predictions.
o Now train the Meta model on level 1 prediction, where these predictions will be used as features for
the model.
o Finally, Meta learners can now be used to make a prediction on test data in the stacking model.
Voting ensembles:
This is one of the simplest stacking ensemble methods, which uses different algorithms to prepare
all members individually. Unlike the stacking method, the voting ensemble uses
simple statistics instead of learning how to best combine predictions from base models
separately.
It is significant to solve regression problems where we need to predict the mean or median of the
predictions from base models. Further, it is also helpful in various classification problems according
to the total votes received for prediction. The label with the higher numbers of votes is referred to
as hard voting, whereas the label that receives the largest sums of probability or lesser votes is
referred to as soft voting.
The voting ensemble differs from than stacking ensemble in terms of weighing models based on
each member's performance because here, all models are considered to have the same skill levels.
Member Assessment: In the voting ensemble, all members are assumed to have the same skill
sets.
Combine with Model: Instead of using combined prediction from each member, it uses simple
statistics to get the final prediction, e.g., mean or median.
Member Assessment: Weighted average ensemble method uses member performance based on
the training dataset.
Combine With Model: It considers the weighted average of prediction from each member
separately.
Blending Ensemble:
Blending is a similar approach to stacking with a specific configuration. It is considered a stacking
method that uses k-fold cross-validation to prepare out-of-sample predictions for the meta-
model. In this method, the training dataset is first to split into different training sets and validation
sets then we train learner models on the training sets. Further, predictions are made on the
validation set and sample set, where validation predictions are used as features to build a new
model, which is later used to make final predictions on the test set using the prediction values as
features.
Combine With Model: Linear model (e.g., linear regression or logistic regression).
Boosting is an efficient algorithm that converts a weak learner into a strong learner. They use the
concept of the weak learner and strong learner conversation through the weighted average values
and higher votes values for prediction. These algorithms use decision stamp and margin
maximizing classification for processing.
There are three types of Algorithms available such as AdaBoost or Adaptive boosting Algorithm,
Gradient, and XG Boosting algorithm. These are the machine learning algorithms that follow the
process of training for predicting and fine-tuning the result.
Example
Let's understand this concept with the help of the following example. Let's take the example of the
email. How will you recognize your email, whether it is spam or not? You can recognize it by the
following conditions:
The rules mentioned above are not that powerful to recognize spam or not; hence these rules are
called weak learners.
To convert weak learners to the strong learner, combine the prediction of the weak learner using
the following methods.
1.
Using average or weighted average.
2. Consider prediction has a higher vote.
Consider the 5 rules mentioned above; there are 3 votes for spam and 2 votes for not spam. Since
there is high vote spam, we consider it spam.
These rules help us identify whether an image is a Dog or a cat. However, the prediction would be
flawed if we were to classify an image based on an individual (single) rule. These rules are called
weak learners because these rules are not strong enough to classify an image as a cat or dog.
Therefore, to ensure our prediction is more accurate, we can combine the prediction from these
weak learners by using the majority rule or weighted average. This makes a strong learner model.
In the above example, we have defined 5 weak learners, and the majority of these rules (i.e., 3 out
of 5 learners predict the image as a cat) give us the prediction that the image is a cat. Therefore,
our final output is a cat.
Step 1: The base algorithm reads the data and assigns equal weight to each sample observation.
Step 2: False predictions made by the base learner are identified. In the next iteration, these false
predictions are assigned to the next base learner with a higher weightage on these incorrect
predictions.
Step 3: Repeat step 2 until the algorithm can correctly classify the output.
Types of Boosting
Boosting methods are focused on iteratively combining weak learners to build a strong learner
that can predict more accurate outcomes. As a reminder, a weak learner classifies data slightly
better than random guessing. This approach can provide robust prediction problem results,
outperform neural networks, and support vector machines for tasks.
Boosting algorithms can differ in how they create and aggregate weak learners during the
sequential process. Three popular types of boosting methods include:
AdaBoost is implemented by combining several weak learners into a single strong learner. The
weak learners in AdaBoost take into account a single input feature and draw out a single split
decision tree called the decision stump. Each observation is weighted equally while drawing out
the first decision stump.
The results from the first decision stump are analyzed, and if any observations are wrongfully
classified, they are assigned higher weights. A new decision stump is drawn by considering the
higher-weight observations as more significant. Again if any observations are misclassified, they're
given a higher weight, and this process continues until all the observations fall into the right class.
AdaBoost can be used for both classification and regression-based problems. However, it is more
commonly used for classification purposes.
2. Gradient Boosting: Gradient Boosting is also based on sequential ensemble learning. Here the
base learners are generated sequentially so that the present base learner is always more effective
than the previous one, i.e., and the overall model improves sequentially with each iteration.
The difference in this boosting type is that the weights for misclassified outcomes are not
incremented. Instead, the Gradient Boosting method tries to optimize the loss function of the
previous learner by adding a new model that adds weak learners to reduce the loss function.
The main idea here is to overcome the errors in the previous learner's predictions. This boosting
has three main components:
o Loss function:The use of the loss function depends on the type of problem. The advantage
of gradient boosting is that there is no need for a new boosting algorithm for each loss
function.
o Weak learner:In gradient boosting, decision trees are used as a weak learners. A regression
tree is used to give true values, which can combine to create correct predictions. Like in the
AdaBoost algorithm, small trees with a single split are used, i.e., decision stump. Larger trees
are used for large levels,e, 4-8.
o Additive Model: Trees are added one at a time in this model. Existing trees remain the
same. During the addition of trees, gradient descent is used to minimize the loss function.
Like AdaBoost, Gradient Boosting can also be used for classification and regression problems.
The main aim of this algorithm is to increase the speed and efficiency of computation. The
Gradient Descent Boosting algorithm computes the output slower since they sequentially analyze
the data set. Therefore XGBoost is used to boost or extremely boost the model's performance.
XGBoost is designed to focus on computational speed and model efficiency. The main features
provided by XGBoost are:
o Parallel Processing: XG Boost provides Parallel Processing for tree construction which uses
CPU cores while training.
o Cross-Validation: XG Boost enables users to run cross-validation of the boosting process at
each iteration, making it easy to get the exact optimum number of boosting iterations in
one run.
o Cache Optimization: It provides Cache Optimization of the algorithms for higher execution
speed.
o Distributed Computing: For training large models, XG Boost allows Distributed
Computing.
Benefits and Challenges of Boosting
The boosting method presents many advantages and challenges for classification or regression
problems. The benefits of boosting include:
o Overfitting: There's some dispute in the research around whether or not boosting can help
reduce overfitting or make it worse. We include it under challenges because in the instances
that it does occur, predictions cannot be generalized to new datasets.
o Intense computation: Sequential training in boosting is hard to scale up. Since each
estimator is built on its predecessors, boosting models can be computationally expensive,
although XGBoost seeks to address scalability issues in other boosting methods. Boosting
algorithms can be slower to train when compared to bagging, as a large number of
parameters can also influence the model's behavior.
o Vulnerability to outlier data: Boosting models are vulnerable to outliers or data values
that are different from the rest of the dataset. Because each model attempts to correct the
faults of its predecessor, outliers can skew results significantly.
o Real-time implementation: You might find it challenging to use boosting for real-time
implementation because the algorithm is more complex than other processes. Boosting
methods have high adaptability, so you can use various model parameters that immediately
affect the model's performance.
Applications of Boosting
Boosting algorithms are well suited for artificial intelligence projects across a broad range of
industries, including:
o Healthcare: Boosting is used to lower errors in medical data predictions, such as predicting
cardiovascular risk factors and cancer patient survival rates. For example, research shows
that ensemble methods significantly improve the accuracy in identifying patients who could
benefit from preventive treatment of cardiovascular disease while avoiding unnecessary
treatment of others. Likewise, another study found that applying boosting to multiple
genomics platforms can improve the prediction of cancer survival time.
o IT: Gradient boosted regression trees are used in search engines for page rankings, while
the Viola-Jones boosting algorithm is used for image retrieval. As noted by Cornell, boosted
classifiers allow the computations to be stopped sooner when it's clear which direction a
prediction is headed. A search engine can stop evaluating lower-ranked pages, while image
scanners will only consider images containing the desired object.
o Finance: Boosting is used with deep learning models to automate critical tasks, including
fraud detection, pricing analysis, and more. For example, boosting methods in credit card
fraud detection and financial product pricing analysis improves the accuracy of analyzing
massive data sets to minimize financial losses.