0% found this document useful (0 votes)

18 views71 pages

cs221 Lecture11

The document outlines advanced machine learning concepts, focusing on the Perceptron algorithm, Support Vector Machines (SVM), and boosting techniques. It discusses the principles of linear separation, maximum margin classifiers, and the kernel trick for non-linear classification. Key topics include the optimization of classifiers using quadratic programming and the introduction of soft margins for handling noisy data.

Uploaded by

Surya Basnet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views71 pages

cs221 Lecture11

Uploaded by

Surya Basnet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 71

CS 221: Artificial Intelligence

Lecture 11:
Advanced Machine Learning

Peter Norvig and Sebastian Thrun

Slide credit: Andrew Moore, Ray Mooney, Xu&Arun, Hongbo
Deng
Vocabulary Test
 Discriminative learning
 Perceptron
 Large Margin Classifiers
 Kernel Trick
 Boosting
 Occam’s Razor
 Haar Features

2
Linear Seperaration

3
Outline
 Perceptron Algorithm
 Support Vector Machines
 Boosting

4
Basic Neuron

5
Perceptron Node – Threshold Logic Unit

x1 w1

x2 w2 z

xn wn

6
Perceptron Learning Algorithm

x1 .4

.1 z

x2 -.2

x1 x2 t
.8 .3 1
.4 .1 0

7
First Training Instance

.8 .4

.1 z =1

.3 -.2 net = .8.4 + .3-.2 = .26

x1 x2 t
.8 .3 1
.4 .1 0

8
Second Training Instance

.4 .4

.1 z =1

.1 -.2 net = .4.4 + .1-.2 = .14

x1 x2 t
.8 .3 1 wi = t - z  xi
.4 .1 0 c

9
Perceptron Rule Learning
wij = ctj – zj)xi
 Least perturbation principle
 Only change weights if there is an error
 small c rather than changing weights sufficient to make current pattern
correct
 Scale by xi
 Create a network with n input and m output nodes
 Iteratively apply a pattern from the training set and apply the
perceptron rule
 Each iteration through the training set is an epoch
 Continue training until total training set error is less than some epsilon
 Perceptron Convergence Theorem: Guaranteed to find a solution in
finite time if a solution exists

10
Outline
 Perceptron Algorithm
 Support Vector Machines
 Boosting

11
Support Vector Machines: Overview
• A new, powerful method for 2-class classification
– Original idea: Vapnik, 1965 for linear classifiers
– SVM, Cortes and Vapnik, 1995
– Became very hot since 2001
• Better generalization (less overfitting)
• Can do linearly unspeakable classification with
global optimal
• Key ideas
– Use kernel function to transform low dimensional
training samples to higher dim (for linear separability
problem)
– Use quadratic programming (QP) to find the best
classifier boundary hyperplane (for global optima and)

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

How would you

classify this
data?

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

How would you

classify this
data?

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

How would you

classify this
data?

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

How would you

classify this
data?

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1

Any of these
would be fine..

..but which is
best?

Classifier Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 Define the
margin of a
linear classifier
as the width
that the
boundary could
be increased by
before hitting a
datapoint.

Maximum Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM

Maximum Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with maximum
Vectors are
those margin.
datapoints
This is the
that the
margin pushes simplest kind of
up against SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b)
2. If we’ve made = sign(w.
a small error x
in-the
denotes +1 b)
location of the boundary this
denotes -1 The maximum
gives us least chance of causing
a misclassification.
margin linear
classifier is the
3. It also helps generalization
linear classifier
4. There’s some theory that this is a
Support with the, um,
Vectors are good thing.
those maximum
5. Empirically it works very very
datapoints well. margin.
that the
margin pushes This is the
up against simplest kind of
SVM (Called an
LSVM)
Computing the margin width
”
+1 M = Margin Width
s s=
C la e
t n
e dic zo
“Pr -1” How do we
s=
b=
1
C las compute M in
+ ic t one
wx =0
+ b
=- “P r ed z
terms of w and
wx b
+ 1
wx
b?
 Plus-plane = { x : w . x + b = +1 }
 Minus-plane = { x : w . x + b = -1 }
 M=

Copyright © 2001, 2003, Andrew W. Moore

Learning the Maximum Margin Classifier
”
+1 + M = Margin Width
s s= x
la =
ic t C one
ed z
“Pr -1”
sx=-
= 1
C las
+b ic t one
wx =0
+ b
=- “P r ed z
wx b
+ 1
wx
Given a guess of w and b we can
 Compute whether all data points in the correct half-planes
 Compute the width of the margin
So now we just need to write a program to search the space
of w’s and b’s to find the widest margin that matches all the
datapoints. How?
Gradient descent? Simulated Annealing?
Learning via Quadratic Programming

 QP is a well-studied class of optimization

algorithms to maximize a quadratic function
of some real-valued variables subject to
linear constraints.
 Minimize both ww (to maximize M) and
misclassification error
Quadratic Programming
Find Quadratic criterion

Subject to
n additional
linear
inequality
constraints

And subject

e additional
to

constraints
equality
linear
Learning Maximum Margin with Noise
M Given guess of w , b we can
= 
Compute sum of distances
of points to their correct
zones
= 1
wx
+b
=0
 Compute the margin width
+ b
wx =-
wx
+ 1b Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our How many constraints will

quadratic optimization we have?
criterion be? What should they be?
Learning Maximum Margin with Noise
M Given guess of w , b we can
11 = 
2 Compute sum of distances
of points to their correct
zones
= 1
wx
+b
=0 7  Compute the margin width
+ b
wx =-
wx
+ 1b Assume R datapoints, each
(xk,yk) where yk = +/- 1
What should our How many constraints will
quadratic optimization we have? 2R
criterion be? What should they be?
Minimize w . xk + b >= 1-k if yk = 1
k = distance of error points to w . xk + b <= -1+k if yk = -
their correct place 1
Solving the Optimization Problem
Find w and b such that
Φ(w) =wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1

 Need to optimize a quadratic function subject to linear constraints.

 Quadratic optimization problems are a well-known class of
mathematical programming problems for which several (non-trivial)
algorithms exist.
 The solution involves constructing a dual problem where a Lagrange
multiplier αi is associated with every inequality constraint in the
primal (original) problem:
Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
The Optimization Problem Solution
 Given a solution α1…αn to the dual problem, solution to the primal is:

w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0

 Each non-zero αi indicates that corresponding xi is a support vector.

 Then the classifying function is (note that we don’t need w explicitly):

f(x) = ΣαiyixiTx + b

 Notice that it relies on an inner product between the test point x and the
support vectors xi – we will return to this later.
 Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points.
Soft Margin Classification
 What if the training set is not linearly separable?
 Slack variables ξi can be added to allow misclassification of difficult
or noisy examples, resulting margin called soft.

ξi
ξi
Soft Margin Classification Mathematically

 The old formulation:

Find w and b such that

Φ(w) =wTw is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1

 Modified formulation incorporates slack variables:

Find w and b such that
Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0

 Parameter C can be viewed as a way to control overfitting: it

“trades off” the relative importance of maximizing the margin and
fitting the training data.
Non-linear SVMs
 Datasets that are linearly separable with some noise work out great:

0 x

 But what are we going to do if the dataset is just too hard?

0 x

 How about… mapping data to a higher-dimensional space:

0 x
Non-linear SVMs: Feature spaces
 General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:

Φ: x → φ(x)
The “Kernel Trick”
 The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
 If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
 A kernel function is a function that is equivalent to an inner product in some
feature space.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
 Thus, a kernel function implicitly maps data to a high-dimensional space
(without the need to compute each φ(x) explicitly).
What Functions are Kernels?
 For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be
cumbersome.
 Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
 Semi-positive definite symmetric functions correspond to a semi-
positive definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)

K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)

K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
Examples of Kernel Functions
 Linear: K(xi,xj)= xiTxj
 Mapping Φ: x → φ(x), where φ(x) is x itself

 Polynomial of power p: K(xi,xj)= (1+ xiTxj)p

 Mapping Φ:  d  p
x → φ(x), where φ(x) has  dimensions

 p 
2
xi  x j

 Gaussian (radial-basis function): K(xi,xj) = e 2 2
 Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is
mapped to a function (a Gaussian); combination of functions for support vectors
is the separator.

 Higher-dimensional space still has intrinsic dimensionality d (the mapping

is not onto), but linear separators in it correspond to non-linear separators
in original space.
Non-linear SVMs Mathematically
 Dual problem formulation:

Find α1…αn such that

Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

 The solution is:

f(x) = ΣαiyiK(xi, xj)+ b

 Optimization techniques for finding αi’s remain the same!

Compare SVM with Perceptron
 Similarity
− SVM + sigmoid kernel ~ two-layer feedforward NN
− SVM + Gaussian kernel ~ RBF network
− For most problems, SVM and NN have similar performance
 Advantages
− Based on sound mathematics theory
− Learning result is more robust
− Over-fitting is not common
− Not trapped in local minima (because of QP)
− Fewer parameters to consider (kernel, error cost C)
− Works well with fewer training samples (none support vectors do
not matter much).
 Disadvantages
− Problem need to be formulated as 2-class classification
− Learning takes long time (QP optimization)
SVM applications
 SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.
 SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
 SVMs can be applied to complex data types beyond feature vectors
(e.g. graphs, sequences, relational data) by designing kernel functions
for such data.
 SVM techniques have been extended to a number of tasks such as
regression [Vapnik et al. ’97], principal component analysis [Schölkopf
et al. ’99], etc.
 Most popular optimization algorithms for SVMs use decomposition to
hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and
[Joachims ’99]
 Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.
SVM applications
 SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.
 SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
 SVMs can be applied to complex data types beyond feature vectors
(e.g. graphs, sequences, relational data) by designing kernel functions
for such data.
 SVM techniques have been extended to a number of tasks such as
regression [Vapnik et al. ’97], principal component analysis [Schölkopf
et al. ’99], etc.
 Most popular optimization algorithms for SVMs use decomposition to
hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and
[Joachims ’99]
 Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.
Outline
 Perceptron Algorithm
 Support Vector Machines
 Boosting

41
Why boosting?
 A simple algorithm for learning robust classifiers
 Freund & Shapire, 1995
 Friedman, Hastie, Tibshhirani, 1998

 Provides efficient algorithm for sparse visual

feature selection
 Tieu & Viola, 2000
 Viola & Jones, 2003

 Easy to implement, not requires external

optimization tools.
Boosting
• Defines a classifier using an additive model:

Strong Weak classifier

classifier
Weight
Features
vector
Boosting
• Defines a classifier using an additive model:

Strong Weak classifier

classifier
Weight
Features
vector

 We need to define a family of weak classifiers

from a family of weak classifiers
Boosting
 It is a sequential procedure: