0% found this document useful (0 votes)

12 views7 pages

Ada Boost

AdaBoost is an algorithm that boosts the performance of weak classification algorithms by combining them into a strong classifier. It does this by iteratively fitting weak classifiers to weighted versions of the training data and combining them with a weighted vote. AdaBoost is simple to implement and often produces very effective results by combining many weak classifiers that are only slightly better than random. It works by optimizing the weights of weak classifiers to minimize an exponential loss function over multiple iterations.

Uploaded by

Валентин Прокопец

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views7 pages

Ada Boost

Uploaded by

Валентин Прокопец

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CSC 411 / CSC D11 / CSC C11 AdaBoost

18 AdaBoost
Boosting is a general strategy for learning classifiers by combining simpler ones. The idea of
boosting is to take a “weak classifier” — that is, any classifier that will do at least slightly better
than chance — and use it to build a much better classifier, thereby boosting the performance of the
weak classification algorithm. This boosting is done by averaging the outputs of a collection of
weak classifiers. The most popular boosting algorithm is AdaBoost, so-called because it is “adap-
tive.”1 AdaBoost is extremely simple to use and implement (far simpler than SVMs), and often
gives very effective results. There is tremendous flexibility in the choice of weak classifier as well.
Boosting is a specific example of a general class of learning algorithms called ensemble methods,
which attempt to build better learning algorithms by combining multiple simpler algorithms.
Suppose we are given training data {(xi , yi )}N i=1 , where xi ∈ R
K
and yi ∈ {−1, 1}. And
suppose we are given a (potentially large) number of weak classifiers, denoted fm (x) ∈ {−1, 1},
and a 0-1 loss function I, defined as

0 iffm (xi ) = yi
I(fm (x), y) = (1)
1 iffm (xi ) 6= yi

Then, the pseudocode of the AdaBoost algorithm is as follows:

(1)
for i from 1 to N , wi = 1
for m = 1 to M do
Fit weak classifier m to minimize the objective function:
PN (m)
wi I(fm (xi )6=yi )
ǫm = i=1
P (m)
i wi
where I(fm (xi ) 6= yi ) = 1 if fm (xi ) 6= yi and 0 otherwise
αm = ln 1−ǫ ǫm
m

for all i do
(m+1) (m)
wi = wi eαm I(fm (xi )6=yi )
end for
end for

After learning, the final classifier is based on a linear combination of the weak classifiers:
M
!
X
g(x) = sign αm fm (x) (2)
m=1

Essentially, AdaBoost is a greedy algorithm that builds up a ”strong classifier”, i.e., g(x), incre-
mentally, by optimizing the weights for, and adding, one weak classifier at a time.
1
AdaBoost was called adaptive because, unlike previous boosting algorithms, it does not need to know error bounds
on the weak classifiers, nor does it need to know the number of classifiers in advance.

2 2 2

0 0 0

0 1 2 0 1 2 0 1 2
2 2 2

0 0 0

0 1 2 0 1 2 0 1 2

Figure 1: Illustration of the steps of AdaBoost. The decision boundary is shown in green for each
step, and the decision stump for each step shown as a dashed line. The results are shown after 1, 2,
3, 6, 10, and 150 steps of AdaBoost. (Figure from Pattern Recognition and Machine Learning by Chris
Bishop.)

Training data Classified data Loss on training set

1.5 1 0.8
Exp loss
1 Binary loss
0.5 0.6

0.5
0 0.4
0

−0.5 0.2
−0.5

−1 −1 0
−2 −1 0 1 −1 0 1 0 20 40 60

f(x) = Σ αm fm(x) Decision boundary

500 500
6
400 4
400
2
300 300
0

200 −2
200
−4
100 100
−6
−8
100 200 300 400 500 100 200 300 400 500

Figure 2: 50 steps of AdaBoost used to learn a classifier with decision stumps.

18.1 Decision stumps

As an example of a weak classifier, we consider “decision stumps,” which are a trivial special case
of decision trees. A decision stump has the following form:

f (x) = s(xk > c) (3)

where the value in the parentheses is 1 if the k-th element of the vector x is greater than c, and
-1 otherwise. The scalar s is either -1 or 1 which allows one the classifier to respond with class 1
when xk ≤ c. Accordingly, there are three parameters to a decision stump:
• c∈R

• k ∈ {1, ...K}, where K is the dimension of x, and

• s ∈ {−1, 1}
Because the number of possible parameter settings is relatively small, a decision stump is often
trained by brute force: discretize the real numbers from the smallest to the largest value in the
training set, enumerate all possible classifiers, and pick the one with the lowest training error. One
can be more clever in the discretization: between each pair of data points, only one classifier must
be tested (since any stump in this range will give the same value). More sophisticated methods, for
example, based on binning the data, or building CDFs of the data, may also be possible.

18.2 Why does it work?

There are many different ways to analyze AdaBoost; none of them alone gives a full picture of why
AdaBoost works so well. AdaBoost was first invented based on optimization of certain bounds on
training, and, since then, a number of new theoretical properties have been discovered.

Loss function view. Here we discuss the loss function interpretation of AdaBoost. As was shown
(decades after AdaBoost was first invented), AdaBoost can be viewed as greedy optimization of a
particular loss function. We define f (x) = 21 m αm fm (x), and rewrite the classifier as g(x) =
P
sign(f (x)) (the factor of 1/2 has no effect on the classifier output). AdaBoost can then be viewed
as optimizing the exponential loss:

Lexp (x, y) = e−yf (x) (4)

so that the full learning objective function, given training data {(xi , yi )}N
i=1 , is
X 1 PM
E= e− 2 yi m=1 αm fm (xi ) (5)
i

which must be optimized with respect to the weights α and the parameters of the weak classifiers.
The optimization process is greedy and sequential: we add one weak classifier at a time, choosing it

and its α to be optimal with respect to E, and then never change it again. Note that the exponential
loss is an upper-bound on the 0-1 loss:
Lexp (x, y) ≥ L0−1 (x, y) (6)
Hence, if exponential loss of zero is achieved, then the 0-1 loss is zero as well, and all training
points are correctly classified.
Consider the weak classifier fm to be added at step m. The entire objective function can be
written to separate out the contribution of this classifier:
X 1 Pm−1 1
E = e− 2 yi j=1 αj fj (xi )− 2 yi αm fm (xi ) (7)
i
X 1 Pm−1 1
αj fj (xi )
= e − 2 yi j=1 e− 2 yi αm fm (xi ) (8)
i
Since we are holding
P
constant the first m − 1 terms, we can replace them with a single constant
(m) − 21 yi m−1 α f j (xi )
wi = e j=1 j
. Note that these are the same weights computed by the recursion used
(m) (m−1) − 1 yi αj fm−1 (xi )
by AdaBoost, i.e., wi ∝ wi e 2 . (There is a proportionality constant that can be
ignored). Hence, we have
X (m) 1
E = wi e− 2 yi αm fm (xi ) (9)
i

We can split this into two summations, one for data correctly classified by fm , and one for those
misclassified:
X (m) αm X (m) αm
E = wi e− 2 + wi e 2 (10)
i:fm (xi )=yi i:fm (xi )6=yi

Rearranging terms, we have

αm αm X (m) αm X (m)
E = (e 2 − e− 2 ) wi I(fm (xi ) 6= yi ) + e− 2 wi (11)
i i
P (m)
Optimizing this with respect to fm is equivalent to optimizing yi ), which is i wi I(fm (xi ) 6=
dE
what AdaBoost does. The optimal value for αm can be derived by solving dα m
= 0:
dE α m αm αm
X
(m) αm − αm X (m)
= e 2 + e− 2 wi I(fm (xi ) 6= yi ) − e 2 wi = 0 (12)
dαm 2 i
2 i
αm
Dividing both sides by P (m) , we have
2 i wi
αm αm αm
0 = e 2 ǫm + e − 2 ǫm − e − 2 (13)
αm αm
e 2 ǫm = e− 2 (1 − ǫm ) (14)
αm αm
+ ln ǫm = − + ln(1 − ǫm ) (15)
2 2
1 − ǫm
αm = ln (16)
ǫm

E(z)

z
2 1 0 1 2

Figure 3: Loss functions for learning: Black: 0-1 loss. Blue: Hinge Loss. Red: Logistic regression.
Green: Exponential loss. (Figure from Pattern Recognition and Machine Learning by Chris Bishop.)

Problems with the loss function view. The exponential loss is not a very good loss function to
use in general. For example, if we directly optimize the exponential loss over all variables in the
classifier (e.g., with gradient descent), we will often get terrible performance. So the loss-function
interpretation of AdaBoost does not tell the whole story.

Margin view. One might expect that, when AdaBoost reaches zero training set error, adding any
new weak classifiers would cause overfitting. In practice, the opposite often occurs: continuing to
add weak classifiers actually improves test set performance in many situations. One explanation
comes from looking at the margins: adding classifiers tends to increase the margin size. The formal
details of this will not be discussed here.

18.3 Early stopping

It is nonetheless possible to overfit with AdaBoost, by adding too many classifiers. The solution
that is normally used practice is a procedure called early stopping. The idea is as follows. We
partition our data set into two pieces, a training set and a test set. The training set is used to train
the algorithm normally. However, at each step of the algorithm, we also compute the 0-1 binary
loss on the test set. During the course of the algorithm, the exponential loss on the training set is
guaranteed to decrease, and the 0-1 binary loss will generally decrease as well. The errors on the
testing set will also generally decrease in the first steps of the algorithm, however, at some point,
the testing error will begin to get noticeably worse. When this happens, we revert the classifier to
the form that gave the best test error, and discard any subsequent changes (i.e., additional weak
classifiers).
The intuition for the algorithm is as follows. When we begin learning, our initial classifier is
extremely simple and smooth. During the learning process, we add more and more complexity to

the model to improve the fit to the data. At some point, adding additional complexity to the model
overfits: we are no longer modeling the decision boundary we wish to fit, but are fitting the noise
in the data instead. We use the test set to determine when overfitting begins, and stop learning at
that point.
Early stopping can be used for most iterative learning algorithms. For example, suppose we use
gradient descent to learn a regression algorithm. If we begin with weights w = 0, we are beginning
with a very smooth curve. Each step of gradient descent will make the curve less smooth, as the
entries of w get larger and larger; stopping early can prevent w from getting too large (and thus
too non-smooth).
Early stopping is very simple and very general; however, it is heuristic, as the final result one
gets will depend on the particulars in the optimization algorithm being used, and not just on the
objective function. However, AdaBoost’s procedure is suboptimal anyway (once a weak classifier
is added, it is never updated).
An even more aggressive form of early stopping is to simply stop learning at a fixed number of
iterations, or by some other criteria unrelated to test set error (e.g., when the result “looks good.”)
In fact, pracitioners often using early stopping to regularize unintentionally, simply because they
halt the optimizer before it has converged, e.g., because the convergence threshold is set too high,
or because they are too impatient to wait.

Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
AdaBoost Is Consistent
No ratings yet
AdaBoost Is Consistent
22 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
ADABOOST
No ratings yet
ADABOOST
9 pages
Boosting Approach To Machine Learn
No ratings yet
Boosting Approach To Machine Learn
23 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
No ratings yet
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
35 pages
Ada Boost
No ratings yet
Ada Boost
25 pages
Adaboost Algorithm
No ratings yet
Adaboost Algorithm
17 pages
Pradipta Kumar Pattanayak - Ada Boosting
No ratings yet
Pradipta Kumar Pattanayak - Ada Boosting
44 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
1 Eric Boosting304FinalRpdf
No ratings yet
1 Eric Boosting304FinalRpdf
19 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
_LECTURE+NOTES_Boosting
No ratings yet
_LECTURE+NOTES_Boosting
8 pages
Adaboost
No ratings yet
Adaboost
13 pages
Boosting: 1. What Is The Difference Between Adaboost and Gradient Boosting?
No ratings yet
Boosting: 1. What Is The Difference Between Adaboost and Gradient Boosting?
2 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
FAQ - Boosting - Ensemble Techniques - Great Learning
No ratings yet
FAQ - Boosting - Ensemble Techniques - Great Learning
2 pages
LectureNotes7
No ratings yet
LectureNotes7
8 pages
addaboost
No ratings yet
addaboost
12 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
DM(Boosting)
No ratings yet
DM(Boosting)
15 pages
Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics
No ratings yet
Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics
39 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
AdaBoost Notes
No ratings yet
AdaBoost Notes
5 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
Statistics Project
No ratings yet
Statistics Project
5 pages
boosting algo adaboost
No ratings yet
boosting algo adaboost
3 pages
Introduction To Boosting - 2
No ratings yet
Introduction To Boosting - 2
79 pages
Boosting and AdaBoost For Machine Learning
No ratings yet
Boosting and AdaBoost For Machine Learning
18 pages
Adaboost Solutions
No ratings yet
Adaboost Solutions
6 pages
Improving Classification With AdaBoost
No ratings yet
Improving Classification With AdaBoost
20 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
adaboost
No ratings yet
adaboost
5 pages
A Robust Real Time Face Detection
No ratings yet
A Robust Real Time Face Detection
55 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
Adaboost
No ratings yet
Adaboost
4 pages
Boosting
No ratings yet
Boosting
2 pages
ENG6500 7 Ensembles Boosting
No ratings yet
ENG6500 7 Ensembles Boosting
49 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
AdaBoost M1
No ratings yet
AdaBoost M1
16 pages
L07 Classifiers Combination
No ratings yet
L07 Classifiers Combination
17 pages
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
No ratings yet
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
16 pages
Resilience To Overfitting AdaBoosts Approach
No ratings yet
Resilience To Overfitting AdaBoosts Approach
8 pages
Ensemble (v6)
No ratings yet
Ensemble (v6)
45 pages
IMPROVE_boost_1999
No ratings yet
IMPROVE_boost_1999
40 pages
Power System Security Assessment Using Adaboost Algorithm
No ratings yet
Power System Security Assessment Using Adaboost Algorithm
6 pages
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
100% (1)
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
41 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
Lecture 16: Boosting — Applied ML
No ratings yet
Lecture 16: Boosting — Applied ML
20 pages
AdaBoost New PDF
No ratings yet
AdaBoost New PDF
45 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Solving Math Problems
From Everand
Solving Math Problems
George N. Frempong
No ratings yet
(Articulo) Class Imbalance, and Cost Sensitivity - Why Undersampling Beats Over - Sampling PDF
No ratings yet
(Articulo) Class Imbalance, and Cost Sensitivity - Why Undersampling Beats Over - Sampling PDF
8 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Regularization
No ratings yet
Regularization
9 pages
Bott Curt Noce 18
No ratings yet
Bott Curt Noce 18
89 pages
Assingment Ai
No ratings yet
Assingment Ai
7 pages
Geophysical Prospecting - 2024 - Li - One‐dimensional deep learning inversion of marine controlled‐source electromagnetic (1)
No ratings yet
Geophysical Prospecting - 2024 - Li - One‐dimensional deep learning inversion of marine controlled‐source electromagnetic (1)
21 pages
Libfm
No ratings yet
Libfm
7 pages
Deep Learning and Genetic Algorithms For Cosmological Bayesian Inference Speed-Up
No ratings yet
Deep Learning and Genetic Algorithms For Cosmological Bayesian Inference Speed-Up
16 pages
Client Side Webspoofing PishCatcher
No ratings yet
Client Side Webspoofing PishCatcher
75 pages
Ensemble Learning and Random Forests
No ratings yet
Ensemble Learning and Random Forests
151 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
CS7015 (Deep Learning) : Lecture 8
No ratings yet
CS7015 (Deep Learning) : Lecture 8
86 pages
Lab 1
No ratings yet
Lab 1
13 pages
Feed Forward Neural Network Assignment PDF
No ratings yet
Feed Forward Neural Network Assignment PDF
11 pages
2 Early Stopping - But When?
No ratings yet
2 Early Stopping - But When?
2 pages
10fold-split70
No ratings yet
10fold-split70
5 pages
Predicting Risk of Stillbirth and Preterm Pregnancies With Machine Learning
No ratings yet
Predicting Risk of Stillbirth and Preterm Pregnancies With Machine Learning
12 pages
Slides Regu Early Stopping
No ratings yet
Slides Regu Early Stopping
5 pages
Individual Assignment 3 Guideline
No ratings yet
Individual Assignment 3 Guideline
4 pages
4. Regularization
No ratings yet
4. Regularization
19 pages
Fast Cross-Validation Via Sequential Analysis - Paper
No ratings yet
Fast Cross-Validation Via Sequential Analysis - Paper
5 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
The Art of Fine-Tuning Large Language Models Explained in Depth
No ratings yet
The Art of Fine-Tuning Large Language Models Explained in Depth
15 pages
DSML
No ratings yet
DSML
510 pages
Xgboost Package
No ratings yet
Xgboost Package
54 pages
Fine Tuning Process Darshan
No ratings yet
Fine Tuning Process Darshan
3 pages
First-Principles, Data-Based, and Hybrid Modeling and Optimization of An Industrial Hydrocracking Unit
No ratings yet
First-Principles, Data-Based, and Hybrid Modeling and Optimization of An Industrial Hydrocracking Unit
10 pages
Neural Networks Fail To Learn Periodic Functions and How To Fix It
No ratings yet
Neural Networks Fail To Learn Periodic Functions and How To Fix It
22 pages
Bartlett 2021 Invasive-Measurements
No ratings yet
Bartlett 2021 Invasive-Measurements
8 pages