0% found this document useful (0 votes)
12 views7 pages

Ada Boost

AdaBoost is an algorithm that boosts the performance of weak classification algorithms by combining them into a strong classifier. It does this by iteratively fitting weak classifiers to weighted versions of the training data and combining them with a weighted vote. AdaBoost is simple to implement and often produces very effective results by combining many weak classifiers that are only slightly better than random. It works by optimizing the weights of weak classifiers to minimize an exponential loss function over multiple iterations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views7 pages

Ada Boost

AdaBoost is an algorithm that boosts the performance of weak classification algorithms by combining them into a strong classifier. It does this by iteratively fitting weak classifiers to weighted versions of the training data and combining them with a weighted vote. AdaBoost is simple to implement and often produces very effective results by combining many weak classifiers that are only slightly better than random. It works by optimizing the weights of weak classifiers to minimize an exponential loss function over multiple iterations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CSC 411 / CSC D11 / CSC C11 AdaBoost

18 AdaBoost
Boosting is a general strategy for learning classifiers by combining simpler ones. The idea of
boosting is to take a “weak classifier” — that is, any classifier that will do at least slightly better
than chance — and use it to build a much better classifier, thereby boosting the performance of the
weak classification algorithm. This boosting is done by averaging the outputs of a collection of
weak classifiers. The most popular boosting algorithm is AdaBoost, so-called because it is “adap-
tive.”1 AdaBoost is extremely simple to use and implement (far simpler than SVMs), and often
gives very effective results. There is tremendous flexibility in the choice of weak classifier as well.
Boosting is a specific example of a general class of learning algorithms called ensemble methods,
which attempt to build better learning algorithms by combining multiple simpler algorithms.
Suppose we are given training data {(xi , yi )}N i=1 , where xi ∈ R
K
and yi ∈ {−1, 1}. And
suppose we are given a (potentially large) number of weak classifiers, denoted fm (x) ∈ {−1, 1},
and a 0-1 loss function I, defined as

0 iffm (xi ) = yi
I(fm (x), y) = (1)
1 iffm (xi ) 6= yi

Then, the pseudocode of the AdaBoost algorithm is as follows:

(1)
for i from 1 to N , wi = 1
for m = 1 to M do
Fit weak classifier m to minimize the objective function:
PN (m)
wi I(fm (xi )6=yi )
ǫm = i=1
P (m)
i wi
where I(fm (xi ) 6= yi ) = 1 if fm (xi ) 6= yi and 0 otherwise
αm = ln 1−ǫ ǫm
m

for all i do
(m+1) (m)
wi = wi eαm I(fm (xi )6=yi )
end for
end for

After learning, the final classifier is based on a linear combination of the weak classifiers:
M
!
X
g(x) = sign αm fm (x) (2)
m=1

Essentially, AdaBoost is a greedy algorithm that builds up a ”strong classifier”, i.e., g(x), incre-
mentally, by optimizing the weights for, and adding, one weak classifier at a time.
1
AdaBoost was called adaptive because, unlike previous boosting algorithms, it does not need to know error bounds
on the weak classifiers, nor does it need to know the number of classifiers in advance.

Copyright c 2015 Aaron Hertzmann, David J. Fleet and Marcus Brubaker 123
CSC 411 / CSC D11 / CSC C11 AdaBoost

2 2 2

0 0 0

0 1 2 0 1 2 0 1 2
2 2 2

0 0 0

0 1 2 0 1 2 0 1 2

Figure 1: Illustration of the steps of AdaBoost. The decision boundary is shown in green for each
step, and the decision stump for each step shown as a dashed line. The results are shown after 1, 2,
3, 6, 10, and 150 steps of AdaBoost. (Figure from Pattern Recognition and Machine Learning by Chris
Bishop.)

Copyright c 2015 Aaron Hertzmann, David J. Fleet and Marcus Brubaker 124
CSC 411 / CSC D11 / CSC C11 AdaBoost

Training data Classified data Loss on training set


1.5 1 0.8
Exp loss
1 Binary loss
0.5 0.6

0.5
0 0.4
0

−0.5 0.2
−0.5

−1 −1 0
−2 −1 0 1 −1 0 1 0 20 40 60

f(x) = Σ αm fm(x) Decision boundary


500 500
6
400 4
400
2
300 300
0

200 −2
200
−4
100 100
−6
−8
100 200 300 400 500 100 200 300 400 500

Figure 2: 50 steps of AdaBoost used to learn a classifier with decision stumps.

Copyright c 2015 Aaron Hertzmann, David J. Fleet and Marcus Brubaker 125
CSC 411 / CSC D11 / CSC C11 AdaBoost

18.1 Decision stumps


As an example of a weak classifier, we consider “decision stumps,” which are a trivial special case
of decision trees. A decision stump has the following form:

f (x) = s(xk > c) (3)

where the value in the parentheses is 1 if the k-th element of the vector x is greater than c, and
-1 otherwise. The scalar s is either -1 or 1 which allows one the classifier to respond with class 1
when xk ≤ c. Accordingly, there are three parameters to a decision stump:
• c∈R

• k ∈ {1, ...K}, where K is the dimension of x, and

• s ∈ {−1, 1}
Because the number of possible parameter settings is relatively small, a decision stump is often
trained by brute force: discretize the real numbers from the smallest to the largest value in the
training set, enumerate all possible classifiers, and pick the one with the lowest training error. One
can be more clever in the discretization: between each pair of data points, only one classifier must
be tested (since any stump in this range will give the same value). More sophisticated methods, for
example, based on binning the data, or building CDFs of the data, may also be possible.

18.2 Why does it work?


There are many different ways to analyze AdaBoost; none of them alone gives a full picture of why
AdaBoost works so well. AdaBoost was first invented based on optimization of certain bounds on
training, and, since then, a number of new theoretical properties have been discovered.

Loss function view. Here we discuss the loss function interpretation of AdaBoost. As was shown
(decades after AdaBoost was first invented), AdaBoost can be viewed as greedy optimization of a
particular loss function. We define f (x) = 21 m αm fm (x), and rewrite the classifier as g(x) =
P
sign(f (x)) (the factor of 1/2 has no effect on the classifier output). AdaBoost can then be viewed
as optimizing the exponential loss:

Lexp (x, y) = e−yf (x) (4)

so that the full learning objective function, given training data {(xi , yi )}N
i=1 , is
X 1 PM
E= e− 2 yi m=1 αm fm (xi ) (5)
i

which must be optimized with respect to the weights α and the parameters of the weak classifiers.
The optimization process is greedy and sequential: we add one weak classifier at a time, choosing it

Copyright c 2015 Aaron Hertzmann, David J. Fleet and Marcus Brubaker 126
CSC 411 / CSC D11 / CSC C11 AdaBoost

and its α to be optimal with respect to E, and then never change it again. Note that the exponential
loss is an upper-bound on the 0-1 loss:
Lexp (x, y) ≥ L0−1 (x, y) (6)
Hence, if exponential loss of zero is achieved, then the 0-1 loss is zero as well, and all training
points are correctly classified.
Consider the weak classifier fm to be added at step m. The entire objective function can be
written to separate out the contribution of this classifier:
X 1 Pm−1 1
E = e− 2 yi j=1 αj fj (xi )− 2 yi αm fm (xi ) (7)
i
X 1 Pm−1 1
αj fj (xi )
= e − 2 yi j=1 e− 2 yi αm fm (xi ) (8)
i
Since we are holding
P
constant the first m − 1 terms, we can replace them with a single constant
(m) − 21 yi m−1 α f j (xi )
wi = e j=1 j
. Note that these are the same weights computed by the recursion used
(m) (m−1) − 1 yi αj fm−1 (xi )
by AdaBoost, i.e., wi ∝ wi e 2 . (There is a proportionality constant that can be
ignored). Hence, we have
X (m) 1
E = wi e− 2 yi αm fm (xi ) (9)
i

We can split this into two summations, one for data correctly classified by fm , and one for those
misclassified:
X (m) αm X (m) αm
E = wi e− 2 + wi e 2 (10)
i:fm (xi )=yi i:fm (xi )6=yi

Rearranging terms, we have


αm αm X (m) αm X (m)
E = (e 2 − e− 2 ) wi I(fm (xi ) 6= yi ) + e− 2 wi (11)
i i
P (m)
Optimizing this with respect to fm is equivalent to optimizing yi ), which is i wi I(fm (xi ) 6=
dE
what AdaBoost does. The optimal value for αm can be derived by solving dα m
= 0:
dE α m  αm αm
X
(m) αm − αm X (m)
= e 2 + e− 2 wi I(fm (xi ) 6= yi ) − e 2 wi = 0 (12)
dαm 2 i
2 i
αm
Dividing both sides by P (m) , we have
2 i wi
αm αm αm
0 = e 2 ǫm + e − 2 ǫm − e − 2 (13)
αm αm
e 2 ǫm = e− 2 (1 − ǫm ) (14)
αm αm
+ ln ǫm = − + ln(1 − ǫm ) (15)
2 2
1 − ǫm
αm = ln (16)
ǫm

Copyright c 2015 Aaron Hertzmann, David J. Fleet and Marcus Brubaker 127
CSC 411 / CSC D11 / CSC C11 AdaBoost

E(z)

z
2 1 0 1 2

Figure 3: Loss functions for learning: Black: 0-1 loss. Blue: Hinge Loss. Red: Logistic regression.
Green: Exponential loss. (Figure from Pattern Recognition and Machine Learning by Chris Bishop.)

Problems with the loss function view. The exponential loss is not a very good loss function to
use in general. For example, if we directly optimize the exponential loss over all variables in the
classifier (e.g., with gradient descent), we will often get terrible performance. So the loss-function
interpretation of AdaBoost does not tell the whole story.

Margin view. One might expect that, when AdaBoost reaches zero training set error, adding any
new weak classifiers would cause overfitting. In practice, the opposite often occurs: continuing to
add weak classifiers actually improves test set performance in many situations. One explanation
comes from looking at the margins: adding classifiers tends to increase the margin size. The formal
details of this will not be discussed here.

18.3 Early stopping


It is nonetheless possible to overfit with AdaBoost, by adding too many classifiers. The solution
that is normally used practice is a procedure called early stopping. The idea is as follows. We
partition our data set into two pieces, a training set and a test set. The training set is used to train
the algorithm normally. However, at each step of the algorithm, we also compute the 0-1 binary
loss on the test set. During the course of the algorithm, the exponential loss on the training set is
guaranteed to decrease, and the 0-1 binary loss will generally decrease as well. The errors on the
testing set will also generally decrease in the first steps of the algorithm, however, at some point,
the testing error will begin to get noticeably worse. When this happens, we revert the classifier to
the form that gave the best test error, and discard any subsequent changes (i.e., additional weak
classifiers).
The intuition for the algorithm is as follows. When we begin learning, our initial classifier is
extremely simple and smooth. During the learning process, we add more and more complexity to

Copyright c 2015 Aaron Hertzmann, David J. Fleet and Marcus Brubaker 128
CSC 411 / CSC D11 / CSC C11 AdaBoost

the model to improve the fit to the data. At some point, adding additional complexity to the model
overfits: we are no longer modeling the decision boundary we wish to fit, but are fitting the noise
in the data instead. We use the test set to determine when overfitting begins, and stop learning at
that point.
Early stopping can be used for most iterative learning algorithms. For example, suppose we use
gradient descent to learn a regression algorithm. If we begin with weights w = 0, we are beginning
with a very smooth curve. Each step of gradient descent will make the curve less smooth, as the
entries of w get larger and larger; stopping early can prevent w from getting too large (and thus
too non-smooth).
Early stopping is very simple and very general; however, it is heuristic, as the final result one
gets will depend on the particulars in the optimization algorithm being used, and not just on the
objective function. However, AdaBoost’s procedure is suboptimal anyway (once a weak classifier
is added, it is never updated).
An even more aggressive form of early stopping is to simply stop learning at a fixed number of
iterations, or by some other criteria unrelated to test set error (e.g., when the result “looks good.”)
In fact, pracitioners often using early stopping to regularize unintentionally, simply because they
halt the optimizer before it has converged, e.g., because the convergence threshold is set too high,
or because they are too impatient to wait.

Copyright c 2015 Aaron Hertzmann, David J. Fleet and Marcus Brubaker 129

You might also like