CS 221: Artificial Intelligence
Lecture 11:
Advanced Machine Learning
Peter Norvig and Sebastian Thrun
Slide credit: Andrew Moore, Ray Mooney, Xu&Arun, Hongbo
Deng
Vocabulary Test
Discriminative learning
Perceptron
Large Margin Classifiers
Kernel Trick
Boosting
Occam’s Razor
Haar Features
2
Linear Seperaration
3
Outline
Perceptron Algorithm
Support Vector Machines
Boosting
4
Basic Neuron
5
Perceptron Node – Threshold Logic Unit
x1 w1
x2 w2 z
xn wn
6
Perceptron Learning Algorithm
x1 .4
.1 z
x2 -.2
x1 x2 t
.8 .3 1
.4 .1 0
7
First Training Instance
.8 .4
.1 z =1
.3 -.2 net = .8*.4 + .3*-.2 = .26
x1 x2 t
.8 .3 1
.4 .1 0
8
Second Training Instance
.4 .4
.1 z =1
.1 -.2 net = .4*.4 + .1*-.2 = .14
x1 x2 t
.8 .3 1 wi = t - z xi
.4 .1 0 c
9
Perceptron Rule Learning
wij = ctj – zj)xi
Least perturbation principle
Only change weights if there is an error
small c rather than changing weights sufficient to make current pattern
correct
Scale by xi
Create a network with n input and m output nodes
Iteratively apply a pattern from the training set and apply the
perceptron rule
Each iteration through the training set is an epoch
Continue training until total training set error is less than some epsilon
Perceptron Convergence Theorem: Guaranteed to find a solution in
finite time if a solution exists
10
Outline
Perceptron Algorithm
Support Vector Machines
Boosting
11
Support Vector Machines: Overview
• A new, powerful method for 2-class classification
– Original idea: Vapnik, 1965 for linear classifiers
– SVM, Cortes and Vapnik, 1995
– Became very hot since 2001
• Better generalization (less overfitting)
• Can do linearly unspeakable classification with
global optimal
• Key ideas
– Use kernel function to transform low dimensional
training samples to higher dim (for linear separability
problem)
– Use quadratic programming (QP) to find the best
classifier boundary hyperplane (for global optima and)
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
How would you
classify this
data?
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
How would you
classify this
data?
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
How would you
classify this
data?
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
How would you
classify this
data?
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
Any of these
would be fine..
..but which is
best?
Classifier Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 Define the
margin of a
linear classifier
as the width
that the
boundary could
be increased by
before hitting a
datapoint.
Maximum Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Maximum Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with maximum
Vectors are
those margin.
datapoints
This is the
that the
margin pushes simplest kind of
up against SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b)
2. If we’ve made = sign(w.
a small error x
in-the
denotes +1 b)
location of the boundary this
denotes -1 The maximum
gives us least chance of causing
a misclassification.
margin linear
classifier is the
3. It also helps generalization
linear classifier
4. There’s some theory that this is a
Support with the, um,
Vectors are good thing.
those maximum
5. Empirically it works very very
datapoints well. margin.
that the
margin pushes This is the
up against simplest kind of
SVM (Called an
LSVM)
Computing the margin width
”
+1 M = Margin Width
s s=
C la e
t n
e dic zo
“Pr -1” How do we
s=
b=
1
C las compute M in
+ ic t one
wx =0
+ b
=- “P r ed z
terms of w and
wx b
+ 1
wx
b?
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
M=
Copyright © 2001, 2003, Andrew W. Moore
Learning the Maximum Margin Classifier
”
+1 + M = Margin Width
s s= x
la =
ic t C one
ed z
“Pr -1”
sx=-
= 1
C las
+b ic t one
wx =0
+ b
=- “P r ed z
wx b
+ 1
wx
Given a guess of w and b we can
Compute whether all data points in the correct half-planes
Compute the width of the margin
So now we just need to write a program to search the space
of w’s and b’s to find the widest margin that matches all the
datapoints. How?
Gradient descent? Simulated Annealing?
Learning via Quadratic Programming
QP is a well-studied class of optimization
algorithms to maximize a quadratic function
of some real-valued variables subject to
linear constraints.
Minimize both ww (to maximize M) and
misclassification error
Quadratic Programming
Find Quadratic criterion
Subject to
n additional
linear
inequality
constraints
And subject
e additional
to
constraints
equality
linear
Learning Maximum Margin with Noise
M Given guess of w , b we can
=
Compute sum of distances
of points to their correct
zones
= 1
wx
+b
=0
Compute the margin width
+ b
wx =-
wx
+ 1b Assume R datapoints, each
(xk,yk) where yk = +/- 1
What should our How many constraints will
quadratic optimization we have?
criterion be? What should they be?
Learning Maximum Margin with Noise
M Given guess of w , b we can
11 =
2 Compute sum of distances
of points to their correct
zones
= 1
wx
+b
=0 7 Compute the margin width
+ b
wx =-
wx
+ 1b Assume R datapoints, each
(xk,yk) where yk = +/- 1
What should our How many constraints will
quadratic optimization we have? 2R
criterion be? What should they be?
Minimize w . xk + b >= 1-k if yk = 1
k = distance of error points to w . xk + b <= -1+k if yk = -
their correct place 1
Solving the Optimization Problem
Find w and b such that
Φ(w) =wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
Need to optimize a quadratic function subject to linear constraints.
Quadratic optimization problems are a well-known class of
mathematical programming problems for which several (non-trivial)
algorithms exist.
The solution involves constructing a dual problem where a Lagrange
multiplier αi is associated with every inequality constraint in the
primal (original) problem:
Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
The Optimization Problem Solution
Given a solution α1…αn to the dual problem, solution to the primal is:
w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0
Each non-zero αi indicates that corresponding xi is a support vector.
Then the classifying function is (note that we don’t need w explicitly):
f(x) = ΣαiyixiTx + b
Notice that it relies on an inner product between the test point x and the
support vectors xi – we will return to this later.
Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points.
Soft Margin Classification
What if the training set is not linearly separable?
Slack variables ξi can be added to allow misclassification of difficult
or noisy examples, resulting margin called soft.
ξi
ξi
Soft Margin Classification Mathematically
The old formulation:
Find w and b such that
Φ(w) =wTw is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1
Modified formulation incorporates slack variables:
Find w and b such that
Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0
Parameter C can be viewed as a way to control overfitting: it
“trades off” the relative importance of maximizing the margin and
fitting the training data.
Non-linear SVMs
Datasets that are linearly separable with some noise work out great:
0 x
But what are we going to do if the dataset is just too hard?
0 x
How about… mapping data to a higher-dimensional space:
x2
0 x
Non-linear SVMs: Feature spaces
General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
The “Kernel Trick”
The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is a function that is equivalent to an inner product in some
feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Thus, a kernel function implicitly maps data to a high-dimensional space
(without the need to compute each φ(x) explicitly).
What Functions are Kernels?
For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be
cumbersome.
Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
Semi-positive definite symmetric functions correspond to a semi-
positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)
K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
Examples of Kernel Functions
Linear: K(xi,xj)= xiTxj
Mapping Φ: x → φ(x), where φ(x) is x itself
Polynomial of power p: K(xi,xj)= (1+ xiTxj)p
Mapping Φ: d p
x → φ(x), where φ(x) has dimensions
p
2
xi x j
Gaussian (radial-basis function): K(xi,xj) = e 2 2
Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is
mapped to a function (a Gaussian); combination of functions for support vectors
is the separator.
Higher-dimensional space still has intrinsic dimensionality d (the mapping
is not onto), but linear separators in it correspond to non-linear separators
in original space.
Non-linear SVMs Mathematically
Dual problem formulation:
Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
The solution is:
f(x) = ΣαiyiK(xi, xj)+ b
Optimization techniques for finding αi’s remain the same!
Compare SVM with Perceptron
Similarity
− SVM + sigmoid kernel ~ two-layer feedforward NN
− SVM + Gaussian kernel ~ RBF network
− For most problems, SVM and NN have similar performance
Advantages
− Based on sound mathematics theory
− Learning result is more robust
− Over-fitting is not common
− Not trapped in local minima (because of QP)
− Fewer parameters to consider (kernel, error cost C)
− Works well with fewer training samples (none support vectors do
not matter much).
Disadvantages
− Problem need to be formulated as 2-class classification
− Learning takes long time (QP optimization)
SVM applications
SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.
SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
SVMs can be applied to complex data types beyond feature vectors
(e.g. graphs, sequences, relational data) by designing kernel functions
for such data.
SVM techniques have been extended to a number of tasks such as
regression [Vapnik et al. ’97], principal component analysis [Schölkopf
et al. ’99], etc.
Most popular optimization algorithms for SVMs use decomposition to
hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and
[Joachims ’99]
Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.
SVM applications
SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.
SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
SVMs can be applied to complex data types beyond feature vectors
(e.g. graphs, sequences, relational data) by designing kernel functions
for such data.
SVM techniques have been extended to a number of tasks such as
regression [Vapnik et al. ’97], principal component analysis [Schölkopf
et al. ’99], etc.
Most popular optimization algorithms for SVMs use decomposition to
hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and
[Joachims ’99]
Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.
Outline
Perceptron Algorithm
Support Vector Machines
Boosting
41
Why boosting?
A simple algorithm for learning robust classifiers
Freund & Shapire, 1995
Friedman, Hastie, Tibshhirani, 1998
Provides efficient algorithm for sparse visual
feature selection
Tieu & Viola, 2000
Viola & Jones, 2003
Easy to implement, not requires external
optimization tools.
Boosting
• Defines a classifier using an additive model:
Strong Weak classifier
classifier
Weight
Features
vector
Boosting
• Defines a classifier using an additive model:
Strong Weak classifier
classifier
Weight
Features
vector
We need to define a family of weak classifiers
from a family of weak classifiers
Boosting
It is a sequential procedure:
xt=1 Each data point has
xt
a class label:
xt=2
+1 ( )
yt =
-1 ( )
and a weight:
wt =1
Toy example
Weak learners from the family of lines
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
and a weight:
wt =1
h => p(error) = 0.5 it is at chance
Toy example
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
and a weight:
wt =1
This one seems to be the best
This is a ‘weak classifier’: It performs slightly better than chance.
Toy example
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
Toy example
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
Toy example
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
Toy example
Each data point has
a class label:
+1 ( )
yt =
-1 ( )
We update the weights:
wt wt exp{-yt Ht}
We set a new problem for which the previous weak classifier performs at chance again
Toy example
f1 f2
f4
f3
The strong (non- linear) classifier is built as the combination of
all the weak (linear) classifiers.
Adaboost Terminology
ht(x) … “weak” or basis classifier (Classifier =
Learner = Hypothesis)
Weak Classifier: < 50% error over any
distribution
Strong Classifier: thresholded linear combination
of weak classifier outputs
53
Formal Procedure of AdaBoost
Adaboost
AdaBoost with Perceptron
t=1
56
AdaBoost with Perceptron
57
AdaBoost with Perceptron
58
AdaBoost with Perceptron
59
AdaBoost with Perceptron
60
AdaBoost with Perceptron
61
AdaBoost with Perceptron
62
AdaBoost with Perceptron
63
Testing Set Performance
Will Adaboost screw up with a fat complex
classifier finally?
Occam’s razor –
simple is the best
Over fitting
Shall we stop before over fitting? If only over fitting happens.
Actual Typical Run
An explanation by margin
This margin is not the margin in SVM
Margin Distribution
Although final classifier is getting
larger, margins are still increasing
Final classifier is actually getting to
simpler classifer
Weak detectors
Haar filters and integral image
Viola and Jones, ICCV 2001
The average intensity in the
block is computed with four
sums independently of the
block size.
Experiments (dataset for training)
●
4916 positive training
examples were hand
picked aligned,
normalized, and scaled
to a base resolution of
24x24
●
10,000 negative
examples were
selected by randomly
picking sub-windows
from 9500 images
which did not contain
faces
Results
Tracking Cars (Hendrik Dahlkamp)
71