17 Ensemble Learning
- Dr. Sifat Momen (SfM1)
Learning goals
• After this presentation, you should be able to
• Appreciate the use of ensemble techniques
• Understand the general idea of Ensemble Approach
• Understand the idea of Bagging and Pasting
• Understand OOB evaluation
• Understand and Apply Random Forest Algorithm
• Understand Boosting Approach
• Understand AdaBoost Algorithm
08/06/2024 Slides by Dr. Sifat Momen 2
Jupyter Notebook
• Please note that there is an associated Jupyter notebook with this
presentation
• Please use both in parallel for optimal understanding
08/06/2024 Slides by Dr. Sifat Momen 3
Wisdom of the crowd
08/06/2024 Slides by Dr. Sifat Momen 4
Ensemble Methods
• Construct a set of base classifiers learned from the training data
• Predict class label of test records by combining the predictions made
by multiple classifiers (e.g., by taking majority vote)
5
Necessary Conditions for Ensemble Methods
• Ensemble Methods work better than a single base classifier if:
1. All base classifiers are independent of each other
2. All base classifiers perform better than random guessing
(error rate < 0.5 for binary classification)
6
General Approach of Ensemble Learning
Using majority vote or
weighted majority vote
(weighted according to their
accuracy or relevance)
7
Constructing Ensemble Classifiers
• By manipulating training set
• Example: bagging, boosting, random forests
• By manipulating input features
• Example: random forests
• By manipulating class labels
• Example: error-correcting output coding
• By manipulating learning algorithm
• Example: injecting randomness in the initial weights of ANN
8
Voting Classifiers – Training Diverse
Classifiers
• Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the
ensemble.
• In fact, even if each classifier is a weak learner (meaning it does only slightly better than random guessing),
the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number
of weak learners in the ensemble and they are sufficiently diverse.
• If all classifiers are able to estimate class probabilities (i.e., if they all have a predict_proba() method), then
you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the
individual classifiers. This is called soft voting.
08/06/2024 Slides by Dr. Sifat Momen 9
Voting Classifiers – Bagging and Pasting
• One way to get a diverse set of classifiers is to use very different
training algorithms, as just discussed.
• Another approach is to use the same training algorithm for every
predictor but train them on different random subsets of the training
set.
• When sampling is performed with replacement, this method is called
bagging (short for bootstrap aggregating ).
• When sampling is performed without replacement, it is called pasting
08/06/2024 Slides by Dr. Sifat Momen 10
Voting Classifiers – Bagging and Pasting
08/06/2024 Slides by Dr. Sifat Momen 11
Sampling with replacement
08/06/2024 Slides by Dr. Sifat Momen 12
Sampling without replacement
08/06/2024 Slides by Dr. Sifat Momen 13
Out of Bag Evaluation
• With bagging, some training instances may be sampled several times
for any given predictor, while others may not be sampled at all.
• it can be shown mathematically that only about 63% of the training
instances are sampled on average for each predictor.
• Probability of a training instance being selected in a bootstrap sample
is:
1 – (1 - 1/n)n (n: number of training instances)
~0.632 when n is large
08/06/2024 Slides by Dr. Sifat Momen 14
Out of Bag Evaluation
Probability of a training instance being selected
1.2
1 The remaining 37% of the training
instances that are not sampled are called
0.8
out-of-bag (OOB) instances. Note that
Probability
0.6 they are not the same 37% for all
predictors.
0.4
0.2
0
0 5 10 15 20 25 30 35 40 45
08/06/2024 Slides by Dr. Sifat Momen 15
Out of Bag Evaluation
08/06/2024 Slides by Dr. Sifat Momen 16
Out of Bag Evaluation
• A bagging ensemble can be evaluated using OOB instances, without
the need for a separate validation set: indeed, if there are enough
estimators, then each instance in the training set will likely be an OOB
instance of several estimators, so these estimators can be used to
make a fair ensemble prediction for that instance.
• Once you have a prediction for each instance, you can compute the
ensemble’s prediction accuracy (or any other metric).
• In Scikit-Learn, you can set oob_score=True when creating a
BaggingClassifier to request an automatic OOB evaluation after
training.
08/06/2024 Slides by Dr. Sifat Momen 17
Bagging Example
• Consider 1-dimensional data set:
Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
• Classifier is a decision stump (decision tree of size 1)
• Decision rule: x k versus x > k
• Split point k is chosen based on entropy xk
True False
yleft yright
18
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1
y 1 1 1 -1 -1 -1 1 1 1 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 -1 -1 -1 -1 1 1 1
19
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7 y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7 y = 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3 y = -1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35 y = 1
y 1 1 1 -1 -1 -1 -1 1 1 1 x > 0.35 y = -1
20
Bagging Example
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75 y = 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 10:
x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05 y = 1
y 1 1 1 1 1 1 1 1 1 1 x > 0.05 y = 1
21
Bagging Example
• Summary of Trained Decision Stumps:
Round Split Point Left Class Right Class
1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1
22
Bagging Example
• Use majority vote (sign of sum of predictions) to determine class of
ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Predicted Sum 2 2 2 -6 -6 -6 -6 2 2 2
Class Sign 1 1 1 -1 -1 -1 -1 1 1 1
• Bagging can also increase the complexity (representation capacity) of
simple classifiers such as decision stumps
23
Random Forest Algorithm
• Construct an ensemble of decision trees by manipulating training set
as well as features
• Use bootstrap sample to train every decision tree (similar to Bagging)
• Use the following tree induction algorithm:
• At every internal node of decision tree, randomly sample p attributes for selecting split
criterion
• Repeat this procedure until all leaves are pure (unpruned tree)
24
Boosting (originally called hypothesis
boosting)
• Refers to an ensemble method that can combine several weak
learners into a strong learner.
• The general idea of boosting is to train predictors sequentially, each
trying to correct its predecessor.
• AdaBoost (adaptive boosting)
• Gradient Boosting
• XGBoost
• LightGBM (Light gradient boosting machine)
08/06/2024 Slides by Dr. Sifat Momen 25
AdaBoost
• One way for a new predictor to correct its predecessor is to pay a bit
more attention to the training instances that the predecessor
underfit.
• This results in new predictors focusing more and more on the hard
cases.
• This is the technique used by AdaBoost
08/06/2024 Slides by Dr. Sifat Momen 26
AdaBoost
08/06/2024 Slides by Dr. Sifat Momen 27
AdaBoost in detail (Slightly different than
that in the textbook)
08/06/2024 Slides by Dr. Sifat Momen 28
Voting Power
Voting Power
2.5
1.5
Voting power or predictor's weight
0.5
0
0 0.2 0.4 0.6 0.8 1 1.2
-0.5
-1
-1.5
-2
-2.5
error
08/06/2024 Slides by Dr. Sifat Momen 29
How is the overall classifier assembled
• The overall classifier is assembled in series of rounds
• For each round:
• Pick the best “weak” classifier, h(x), to add to the overall classifier, H(x)
• Best – classifier that makes the fewest errors
• Assign the classifier a voting power
• Append the term αh(x) to our overall classifier
08/06/2024 Slides by Dr. Sifat Momen 30
AdaBoost classifier
Check the corresponding Excel file to see how AdaBoost classifier
works
08/06/2024 Slides by Dr. Sifat Momen 31