0% found this document useful (0 votes)
8 views78 pages

ML - Lec 8-SVM As A Linear Classifier

The document discusses Support Vector Machines (SVM), a classification method rooted in statistical learning theory, highlighting its history, development, and significance in machine learning. It explains the concept of linear classifiers, the importance of maximizing the margin between classes, and the role of support vectors in determining the optimal boundary. Additionally, it touches on the mathematical formulation and optimization techniques, particularly quadratic programming, used to solve SVM problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views78 pages

ML - Lec 8-SVM As A Linear Classifier

The document discusses Support Vector Machines (SVM), a classification method rooted in statistical learning theory, highlighting its history, development, and significance in machine learning. It explains the concept of linear classifiers, the importance of maximizing the margin between classes, and the role of support vectors in determining the optimal boundary. Additionally, it touches on the mathematical formulation and optimization techniques, particularly quadratic programming, used to solve SVM problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

L7: Support Vector

Machines

Modified from Prof. Andrew W. Moore’s Lecture Notes


www.cs.cmu.edu/~awm
History
• SVM is a classifier derived from statistical
learning theory by Vapnik and Chervonenkis
• SVMs introduced by Boser, Guyon, Vapnik in
COLT-92
• Now an important and active field of all
Machine Learning research.
• Special issues of Machine Learning Journal,
and Journal of Machine Learning Research.

SVM: 2
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2


sign(w. x + b) = +1 iff x1w1 + w2x2– b > 0 SVM: 3
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2


sign(w. x + b) = +1 iff x1w1 + w2x2– b > 0 SVM: 4
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2


sign(w. x + b) = +1 iff x1w1 + w2x2– b > 0 SVM: 5
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2


sign(w. x + b) = +1 iff x1w1 + w2x2– b > 0 SVM: 6
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2


sign(w. x + b) = +1 iff x1w1 + w2x2+ b > 0 SVM: 7
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2


sign(w. x + b) = +1 iff x1w1 + w2x2– b > 0 SVM: 8
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
SVM: 9
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the
are those
datapoints that maximum
the margin margin.
pushes up
against This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM SVM: 10
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b) =
2. Empirically it sign(w. x + b)
works very
denotes +1
well.
denotes -1 The maximum
3. If we’ve made a small error in
the locationmargin linear
of the boundary
classifier
(it’s been jolted in its is the
perpendicularlinear classifier
direction) this
Support Vectors gives us least chance
with the,ofum,
causing
are those a misclassification.
datapoints that maximum margin.
the margin 4. There’s some theory (using VC
pushes up dimension) This
that isisrelated
the to
against (but not thesimplest
same as) kindthe of
proposition SVM (Called
that this an
is a good
thing. LSVM)
SVM: 11
Estimate the Margin
denotes +1
denotes -1 x
wx +b = 0

X – Vector
w’
W – Normal Vector
b – Scale Value

• What is the distance expression for a point x to a


line wx+b= 0?
xw b xw b
d (x)  

2 d 2
w w
i 1 i
2
SVM: 12
http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
Estimate the Margin
denotes +1
denotes -1 wx +b = 0

Margin

• What is the expression for margin, given w and b?


xw b
margin  arg min d ( x)  arg min

xD xD d 2
w
i 1 i

SVM: 13
Maximize Margin
denotes +1
denotes -1 wx +b = 0

Margin

argmax margin( w, b, D)
w ,b
= argmax arg min d ( xi )
w ,b xi D

b  xi  w
 argmax arg min

w ,b xi D d 2
w
i 1 i
SVM: 14
Maximize Margin
denotes +1
denotes -1 wx +b = 0

WXi+b≥0 iff yi=1

Margin WXi+b≤0 iff yi=-1

yi(WXi+b) ≥0

b  xi  w
argmax arg min

w ,b xi D d 2
w
i 1 i

subject to xi  D : yi  xi  w  b  ≥00

SVM: 15
Maximize Margin
wxi+b≥0 iff yi=1
denotes +1
wx +b = 0 wxi+b≤0iff yi=-1
denotes -1

yi(WXi+b) ≥ 0
Margin
b  xi  w
argmax arg min

w ,b xi D d 2
w
i 1 i
Strategy: subject to xi  D : yi  xi  w  b   0
xi  D : b  xi  w  1

argmin i 1 wi2
d
wx +b = 0 w ,b

α(wx +b) = 0 where α≠0 subject to xi  D : yi  xi  w  b   1


SVM: 16
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane

• How do we represent this mathematically?


• …in m input dimensions?

SVM: 17
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }

Classify as.. +1 if w . x + b >= 1


-1 if w . x + b <= -1
Universe if -1 < w . x + b < 1
explodes
SVM: 18
Computing the margin width
M = Margin Width

How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?

SVM: 19
Computing the margin width
M = Margin Width

How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?

And so of course the vector w is also


perpendicular to the Minus Plane
SVM: 20
Computing the margin width
x+ M = Margin Width

x-
How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
• Let x- be any point on the minus plane mm:: not
R not
necessarily a
• Let x+ be the closest plus-plane-point to x-. datapoint

SVM: 21
Computing the margin width
x+ M = Margin Width

x-
How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + l w for some value of l. Why?

SVM: 22
Computing the margin width
x+ M = Margin Width
The line from x- to x+ is
x-
How do we compute
perpendicular to the
planes.
M in terms of w
and
So to getbfrom
? x- to x+
travel some distance in
• Plus-plane = direction
{ x : w . x + b = +1 } w.
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + l w for some value of l. Why?

SVM: 23
Computing the margin width
x+ M = Margin Width

x-

What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
It’s now easy to get M
in terms of w and b
SVM: 24
Maximize Margin
• How does it come ?
b  xi  w
argmax arg min

d
xi D
argmin i 1 wi2
w ,b 2 d
w
i 1 i

subject to xi  D : yi  xi  w  b   0
w ,b

subject to xi  D : yi  xi  w  b   1
xi  D : b  xi  w  1
| b  xi .w | | b  xi .w | K 1
We have arg min  arg min 
d d d

w
i 1
2
i w K
i 1
2
i  i
w'
i 1
2

| b  xi .w | 1 d
Thus, arg max arg min  arg max  arg min  w'i2
d d

w  i
2 2 i 1
i w'
i 1 i 1

SVM: 25
Maximum Margin Linear Classifier
{w , b }= argmax 
* * d 2
w
k 1 k
w, b
subject to
y1  w  x1  b   1
y2  w  x2  b   1
....
y N  w  xN  b   1

• How to solve it?

SVM: 26
Learning via Quadratic Programming
• QP is a well-studied class of optimization algorithms to
maximize a quadratic function of some real-valued
variables subject to linear constraints.
• Detail solution of Quadratic Programming
• Convex Optimization Stephen P. Boyd
• Online Edition, Free for Downloading

www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

SVM: 27
Quadratic Programming
T
u Ru
Find arg max c  d u 
T Quadratic criterion
u 2

Subject to a11u1  a12u2  ...  a1mum  b1


a21u1  a22u2  ...  a2 mum  b2 n additional linear
inequality
: constraints
an1u1  an 2u2  ...  anmum  bn

SVM: 28
Quadratic Programming for the Linear Classifier

{w* , b* }= min  i wi2


w,b

subject to yi  w  xi  b   1 for all training data ( xi , yi )


{w* , b*}= argmax 0  0  w  wT I n w
w, b

y1  w  x1  b   1 

y2  w  x2  b   1 
 inequality constraints
.... 
y N  w  xN  b   1 
SVM: 29
• Popular Tools - LibSVM

SVM: 30
Uh-oh! This is going to be a problem!
What should we do?
denotes +1
denotes -1

SVM: 31
Uh-oh! This is going to be a problem!
What should we do?
denotes +1 Idea 1:
denotes -1 Find minimum w.w, while
minimizing number of
training set errors.
Problem: Two things to
minimize makes for an
ill-defined optimization

SVM: 32
Uh-oh! This is going to be a problem!
What should we do?
denotes +1 Idea 1.1:
denotes -1 Minimize
w.w + C (#train errors)
Tradeoff parameter

There’s a serious practical


problem that’s about to make
us reject this approach. Can
you guess what it is?
SVM: 33
Uh-oh! This is going to be a problem!
What should we do?
denotes +1 Idea 1.1:
denotes -1 Minimize
w.w + C (#train errors)
Tradeoff parameter
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
There’s a serious practical
(Also, doesn’t distinguish between
problem
disastrous errors and near misses)
that’s about to make
us reject this approach. Can
you guess what it is?
SVM: 34
Uh-oh! This is going to be a problem!
What should we do?
denotes +1 Idea 2.0:
denotes -1 Minimize
w.w + C (distance of error
points to their
correct place)

SVM: 35
Learning Maximum Margin with Noise
e11 M = Given guess of w , b we can
2
e2 w.w •Compute sum of distances
of points to their correct
zones
e7 • Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we


optimization criterion be? have? R
Minimize 1 R What should they be?
w.w  C  εk w . xk + b >= 1-ek if yk = 1
2 k 1
w . xk + b <= -1+ek if yk = -1

SVM: 36
Learning Maximum Marginmwith Noise
= # input
M = Given guessdimensions
of w , b we can
e11 2
e2 w.w •
Compute sum of distances
of points to their correct
Our original (noiseless data) QP had m+1
zones
variables: w1, w2, … wm, and b.
e7 • Compute the margin width
Our new (noisy data) QP has m+1+R
Assume R datapoints, each
variables: w1, w2, … wm, b, ek , e1 ,… eR
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we


R= # records
optimization criterion be? have? R
Minimize 1 R What should they be?
w.w  C  εk w . xk + b >= 1-ek if yk = 1
2 k 1
w . xk + b <= -1+ek if yk = -1

SVM: 37
Support Vector Machine (SVM) for
Noisy Data
{w , b }= min   c
d N
* *
w
i 1 i
2
e
j 1 j denotes +1
w,b

y1  w  x1  b   1  e1 denotes -1
e3
y2  w  x2  b   1  e 2
...
y N  w  xN  b   1  e N e1

• Any problem with the above


e2
formulism?

SVM: 38
Support Vector Machine (SVM) for
Noisy Data
{w* , b*}= min  i 1 wi2  c  j 1 e j
d N
w,b denotes +1
y1  w  x1  b   1  e1 , e1  0 denotes -1
y2  w  x2  b   1  e 2 , e 2  0 e3
...
y N  w  xN  b   1  e N , e N  0
e1

• Balance the trade off between e2


margin and classification errors

SVM: 39
Support Vector Machine for Noisy Data

{w* , b* }= argmin  i wi2  c  j 1e j


N

w,b

y1  w  x1  b   1  e1 ,e1  0 

y2  w  x2  b   1  e 2 ,e 2  0 
 inequality constraints
.... 
y N  w  xN  b   1  e N ,e N  0 

How do we determine the appropriate value for c ?

SVM: 40
• Therefore, the problem of maximizing the margin is
equivalent to
1
Minimize 𝐽 𝑤 = 𝑤 2
2
Subject to 𝑇
𝑦𝑖 𝑤 𝑥𝑖 + 𝑏 ≥ 1 ∀𝑖

– Notice that 𝐽(𝑤) is a quadratic function, which means that there exists
a single global minimum and no local minima
• To solve this problem, we will use classical Lagrangian
optimization techniques
– We first present the Kuhn-Tucker Theorem, which provides an
essential result for the interpretation of Support Vector Machines

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17


(Kuhn-Tucker Theorem)
• Given an optimization problem with convex domain 𝛀 ⊆ 𝑅𝑁
Minimize 𝑓 𝑧 𝑧∈Ω
Subject to 𝑔𝑖 𝑧 ≤ 0 𝑖 = 1. . 𝑘
ℎ𝑖 𝑧 = 0 𝑖 = 1. . 𝑚
– with 𝑓 ∈ 𝐶 1 convex and 𝑔𝑖 , ℎ𝑖 affine, necessary & sufficient conditions for
a normal point 𝑧 ∗ to be an optimum are the existence of 𝛼 ∗ , 𝛽 ∗ such that
𝜕𝐿 𝑧 ∗ , 𝛼 ∗ , 𝛽 ∗ 𝜕𝑧 = 0
where:
𝜕𝐿 𝑧 ∗ , 𝛼 ∗ , 𝛽 ∗ 𝜕𝛽 = 0
𝑘 𝑚
𝛼𝑖∗ 𝑔𝑖 𝑧 ∗ = 0 𝑖 = 1. . 𝑘 𝐿 𝑧, 𝛼, 𝛽 = 𝑓 𝑧 + 𝑖=1 𝛼𝑖 𝑔𝑖 𝑧 + 𝑖=1 𝛽𝑖 ℎ𝑖 𝑧
𝑔𝑖 𝑧 ∗ ≤ 0 𝑖 = 1. . 𝑘
𝛼𝑖∗ ≥ 0 𝑖 = 1. . 𝑘
– 𝐿(𝑧, 𝛼, 𝛽) is known as a generalized Lagrangian function
– The third condition is known as the Karush-Kuhn-Tucker (KKT)
complementary condition
• It implies that for active constraints 𝛼𝑖 ≥ 0; and for inactive constraints 𝛼𝑖 = 0
• As we will see in a minute, the KKT condition allows us to identify the training
examples that define the largest margin hyperplane. These examples will be
known as Support Vectors
[Cristianini and Shawe-Taylor, 2000]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18
The Lagrangian dual problem
• Constrained minimization of 𝐽 𝑤 = 𝟏/𝟐 𝒘 𝟐 is solved by
introducing the Lagrangian
𝑁
1 2
𝐿𝑃 𝑤, 𝑏, 𝛼 = 𝑤 − 𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1
2
𝑖=1
– which yields an unconstrained optimization problem that is solved by:
• minimizing 𝐿𝑃 w.r.t. the primal variables w and b, and
• maximizing 𝐿𝑃 w.r.t. the dual variables 𝛼𝑖 ≥ 0 (the Lagrange multipliers)
– Thus, the optimum is defined by a saddle point
– This is known as the Lagrangian primal problem

A saddle point

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19


• Solution
– To simplify the primal problem, we eliminate the primal variables (𝑤, 𝑏) using the
first Kuhn-Tucker condition 𝜕𝐽/𝜕𝑧 = 0
– Differentiating 𝐿𝑃 𝑤, 𝑏, 𝛼 with respect to 𝑤 and 𝑏, and setting to zero yields
𝜕𝐿𝑃 𝑤, 𝑏, 𝛼 𝜕𝑤 = 0 ⇒𝑤= 𝑁 𝑖=1 𝛼𝑖 𝑦𝑖 𝑥𝑖
𝜕𝐿𝑃 𝑤, 𝑏, 𝛼 𝜕𝑏 = 0 ⇒ 𝑁 𝑖=1 𝛼𝑖 𝑦𝑖 = 0
– Expansion of 𝐿𝑃 yields
𝑁 𝑁 𝑁
1
𝐿𝑃 𝑤, 𝑏, 𝛼 = 𝑤 𝑇 𝑤 − 𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 − 𝑏 𝛼𝑖 𝑦𝑖 + 𝛼𝑖
2
𝑖=1 𝑖=1 𝑖=1
– Using the optimality condition 𝜕𝐽/𝜕𝑤 = 0, the first term in 𝐿𝑃 can be expressed as
𝑁 𝑁

𝑤𝑇𝑤 = 𝑤𝑇 𝛼𝑖 𝑦𝑖 𝑥𝑖 = 𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 =
𝑖=1 𝑖=1
𝑇
𝑁 𝑁 𝑁 𝑁

𝛼𝑖 𝑦𝑖 𝛼𝑗 𝑦𝑗 𝑥𝑗 𝑥𝑖 = 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗
𝑖=1 𝑗=1 𝑖=1 𝑗=1
– The second term in 𝐿𝑃 can be expressed in the same way
– The third term in 𝐿𝑃 is zero by virtue of the optimality condition 𝜕𝐽/𝜕𝑏 = 0

[Haykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20
– Merging these expressions together we obtain
𝑁 𝑁 𝑁
1
𝐿𝐷 𝛼 = 𝛼𝑖 − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗
2
𝑖=1 𝑖=1 𝑗=1
𝑁
– Subject to the (simpler) constraints 𝛼𝑖 ≥ 0 and 𝑖=1 𝛼𝑖 𝑦𝑖 =0
– This is known as the Lagrangian dual problem
• Comments
– We have transformed the problem of finding a saddle point for
𝐿𝑃 𝑤, 𝑏 into the easier one of maximizing 𝐿𝐷 𝛼
• Notice that 𝐿𝐷 𝛼 depends on the Lagrange multipliers 𝛼, not on (𝑤, 𝑏)
– The primal problem scales with dimensionality (𝑤 has one coefficient
for each dimension), whereas the dual problem scales with the
amount of training data (there is one Lagrange multiplier per example)
– Moreover, in 𝐿𝐷 𝛼 training data appears only as dot products 𝑥𝑖𝑇 𝑥𝑗
• As we will see in the next lecture, this property can be cleverly exploited
to perform the classification in a higher (e.g., infinite) dimensional space
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 21
Support Vectors
• The KKT complementary condition states that, for every point in
the training set, the following equality must hold
𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 = 0 ∀𝑖 = 1. . 𝑁
– Therefore, ∀𝑥, either 𝛼𝑖 = 0 or 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏 − 1) = 0 must hold
• Those points for which 𝛼𝑖 > 0 must then lie on one of the two hyperplanes
that define the largest margin (the term 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏 − 1) becomes zero only
at these hyperplanes)
• These points are known as the Support Vectors
• All the other points must have 𝛼𝑖 = 0
– Note that only the SVs contribute to defining the optimal hyperplane
𝜕𝐽 𝑤,𝑏,𝛼 𝑁
=0 ⇒𝑤= 𝑖=1 𝛼𝑖 𝑦𝑖 𝑥𝑖 x2
𝜕𝑤
– NOTE: the bias term 𝑏 is found from
the KKT complementary condition
on the support vectors
– Therefore, the complete dataset could Support
be replaced by only the support vectors, Vectors (>0)

and the separating hyperplane would be


the same
x1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 22
The Dual Form of QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (xk  xl )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define:
R
w α
k 1
k yk x k

SVM: 41
An Equivalent QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (x k .xl )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define: Datapoints with ak > 0


will be the support
R
vectors
w  αk yk x k
k 1
..so this sum only needs
to be over the
support vectors.

SVM: 42
Support Vectors

Support Vectors i : a i  yi  w  xi  b   1  e i    0

w x  b  1
denotes +1 ai = 0 for non-support vectors
denotes -1
w ai  0 for support vectors

R
w   αk yk x k
k 1
Decision boundary is
w  x  b  1 determined only by those
support vectors !

SVM: 43
The Dual Form of QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (xk  xl )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define:
R
w  αk yk x k
k 1
Then classify with:
f(x,w,b) = sign(w. x + b)

How to determine b ?

SVM: 44
An Equivalent QP: Determine b

{w* , b*}= argmin  i wi2  c  j 1 e j b* = argmin  j 1 e j


N N

b ,e i i1
N
w,b

y1  w  x1  b   1  e1 ,e1  0 Fix w y1  w  x1  b   1  e1 ,e1  0


y2  w  x2  b   1  e 2 ,e 2  0 y2  w  x2  b   1  e 2 ,e 2  0
.... ....
y N  w  xN  b   1  e N ,e N  0 y N  w  xN  b   1  e N ,e N  0

A linear programming problem !

SVM: 45
• Parameter c is used to control the fitness

Noise

SVM: 46
Feature Transformation ?
• The problem is non-linear
• Find some trick to transform the input
• Linear separable after Feature Transformation
• What Features should we use ?

Basic Idea :

XOR Problem
SVM: 47
Suppose we’re in 1-dimension

What would
SVMs do with
this data?

x=0

SVM: 48
Suppose we’re in 1-dimension

Not a big surprise

x=0
Positive “plane” Negative “plane”

SVM: 49
Harder 1-dimensional dataset

That’s wiped the


smirk off SVM’s
face.
What can be
done about
this?

x=0

SVM: 50
Harder 1-dimensional dataset

Map the data


from low-dimensional space
to high-dimensional space
Let’s permit them here too

x=0 z k  ( xk , x )
2
k

SVM: 51
Harder 1-dimensional dataset

Map the data


from low-dimensional space
to high-dimensional space
Let’s permit them here too

x=0 z k  ( xk , x )
2
k
Feature Enumeration
xk transform
  ( xk )  zk
SVM: 52
Non-linear SVMs: Feature spaces
• General idea: the original input space can always be mapped
to some higher-dimensional feature space where the training
set is separable:

Φ: x → φ(x)

SVM: 53
• Polynomial features for the XOR problem

SVM: 54
Kernel methods
• Let’s now see how to put together all these concepts
– Assume that our original feature vector 𝑥 lives in a space 𝑅𝐷
– We are interested in non-linearly projecting 𝑥 onto a higher
dimensional implicit space 𝜑 𝑥 ∈ 𝑅𝐷1 𝐷1 > 𝐷 where classes have
a better chance of being linearly separable
• Notice that we are not guaranteeing linear separability, we are only saying
that we have a better chance because of Cover’s theorem
– The separating hyperplane in 𝑅𝐷1 will be defined by
𝐷1
𝑗=1 𝑤𝑗 𝜑𝑗 𝑥 + 𝑏 = 0
– To eliminate the bias term 𝑏, let’s augment the feature vector in the
implicit space with a constant dimension 𝜑0 (𝑥) = 1
– Using vector notation, the resulting hyperplane becomes
𝑤𝑇𝜑 𝑥 = 0
– From our previous results, the optimal (maximum margin) hyperplane
in the implicit space is given by
𝑤= 𝑁 𝑖=1 𝛼𝑖 𝑦𝑖 𝜑(𝑥𝑖 )

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10


– Merging this optimal weight vector with the hyperplane equation
𝑤𝑇𝜑 𝑥 = 0
𝑇
𝑁

⇒ 𝛼𝑖 𝑦𝑖 𝜑 𝑥𝑖 𝜑 𝑥 =0
𝑖=1
𝑁

⇒ 𝛼𝑖 𝑦𝑖 𝜑 𝑥𝑖 𝑇 𝜑 𝑥 = 0
𝑖=1
– and, since 𝜑 𝑇 𝑥𝑖 𝜑 𝑥𝑗 = 𝐾 𝑥𝑖 , 𝑥𝑗 , the optimal hyperplane becomes
𝑁
𝑖=1 𝛼𝑖 𝑦𝑖 𝐾 𝑥𝑖 , 𝑥 = 0

– Therefore, classification of an unknown example 𝑥 is performed by


computing the weighted sum of the kernel function with respect to
the support vectors 𝑥𝑖 (remember that only the support vectors have
non-zero dual variables 𝛼𝑖 )

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11


• How do we compute dual variables 𝜶𝒊 in the implicit space?
– Very simple: we use the same optimization problem as before, and
replace the dot product 𝜑 𝑇 𝑥𝑖 𝜑 𝑥𝑗 with the kernel 𝐾 𝑥𝑖 , 𝑥𝑗
– The Lagrangian dual problem for the non-linear SVM is simply
𝑁 𝑁 𝑁
1
𝐿𝐷 𝛼 = 𝛼𝑖 − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾 𝑥𝑖𝑇 , 𝑥𝑗
2
𝑖=1 𝑖=1 𝑗=1
– subject to the constraints
𝑁
𝑖=1 𝛼𝑖 𝑦𝑖
=0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1 … 𝑁

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12


• How do we select the implicit mapping 𝝋(𝑥)?
– As we saw in the example a few slides back, we will normally select a
kernel function first, and then determine the implicit mapping 𝜑(𝑥)
that it corresponds to
• Then, how do we select the kernel function 𝑲 𝒙𝒊 , 𝒙𝒋 ?
– We must select a kernel for which an implicit mapping exists, this is, a
kernel that can be expressed as the dot-product of two vectors
• For which kernels 𝑲 𝒙𝒊 , 𝒙𝒋 does there exist an implicit
mapping 𝝋(𝑥)?
– The answer is given by Mercer’s Condition

[Burges, 1998; Kaykin, 1999]


CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13
Mercer’s Condition
• Let 𝐾(𝑥, 𝑥 ′ ) be a continuous symmetric kernel that is defined in the
closed interval 𝑎 ≤ 𝑥 ≤ 𝑏
– The kernel can be expanded in the series:

𝐾 𝑥, 𝑥 ′ = 𝜆𝑖 𝜑𝑖 𝑥 𝜑𝑖 𝑥 ′
𝑖=1
• Strictly speaking, the space where 𝜑 𝑥 resides is a Hilbert space, a “generalization” of
an Euclidean space where the inner product can be any inner product, not just the scalar
dot product [Burges, 1998]
• With positive coefficients 𝜆𝑖 > 0 ∀𝑖
– For this expansion to be valid and for it to converge absolutely and uniformly, it is
necessary and sufficient that the condition
𝑏 𝑏
𝐾 𝑥, 𝑥 ′ 𝜓 𝑥 𝜓 𝑥 ′ 𝑑𝑥𝑑𝑥 ′ ≥ 0
𝑎 𝑎
𝑏 2
– holds for all 𝜓(∙) for which 𝑎
𝜓 𝑥 𝑑𝑥 ≤ ∞
• The functions 𝜑𝑖 𝑥 are called eigenfunctions of the expansion, and the numbers 𝜆𝑖 are
the eigenvalues. The fact that all of the eigenvalues are positive means that the kernel is
positive definite
– Notice that the dimensionality of the implicit space can be infinitely large
– Mercer’s Condition only tells us whether a kernel is actually an inner-product
kernel, but it does not tell us how to construct the functions 𝜑𝑖 𝑥 for the
expansion
[Kaykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14
• Which kernels meet Mercer’s condition?
– Polynomial kernels
𝐾 𝑥, 𝑥 ′ = 𝑥 𝑇 𝑥 ′ + 1 𝑝

• The degree of the polynomial is a user-specified parameter


– Radial basis function kernels

1
𝐾 𝑥, 𝑥 = exp − 2 𝑥 − 𝑥 ′ 2
2𝜎
• The width 𝜎 is a user-specified parameter, but the number of radial basis
functions and their centers are determined automatically by the number
of support vectors and their values
– Two-layer perceptron
𝐾 𝑥, 𝑥 ′ = tanh 𝛽0 𝑥 𝑇 𝑥 ′ + 𝛽1
• The number of hidden neurons and their weight vectors are determined
automatically by the number of support vectors and their values,
respectively. The H-O weights are the Lagrange multipliers 𝛼𝑖
• However, this kernel will only meet Mercer’s condition for certain values
of 𝛽0 and 𝛽1
[Burges, 1998; Kaykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15
Efficiency Problem in Computing Feature
• Feature space Mapping

This use of kernel function


• Example: all 2 degree Monomials to avoid carrying out
Φ(x) explicitly is known
as the kernel trick

9 Multipllication
kernel trick 3 Multipllication

SVM: 55
Architecture of an SVM

Bias

K(x,x1)

Output
neuron
K(x,x2) Output
Input
vector variable
x y
Linear
projection

K(x,xD1)

kernels with
support vectors

[Kaykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 16
Numerical example
• To illustrate the operation of a non-linear SVM we will solve the
classical XOR problem
– Dataset
• Class 1: 𝑥1 = (−1, −1), 𝑥4 = (+1, +1)
• Class 2: 𝑥2 = (−1, +1), 𝑥3 = (+1, −1)
– Kernel function
• Polynomial of order 2: 𝐾(𝑥, 𝑥 ′ ) = 𝑥 𝑇 𝑥 ′ + 1 2

• Solution
– The implicit mapping can be shown to be 5-dimensional
2 2 𝑇
𝜑 𝑥 = 1 2𝑥𝑖,1 2𝑥𝑖,2 2𝑥𝑖,1 𝑥𝑖,2 𝑥𝑖,1 𝑥𝑖,2
– To achieve linear separability, we will use 𝐶 = ∞
– The objective function for the dual problem becomes
4 4
1
𝐿𝐷 𝛼 = 𝛼1 + 𝛼2 + 𝛼3 + 𝛼4 − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑘𝑖𝑗
2
𝑖=1 𝑗=1
𝑁
• subject to the constraints 𝑖=1 𝛼𝑖 𝑦𝑖
=0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1 … 𝑁
[Cherkassky and Mulier, 1998; Haykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17
– where the inner product is represented as a 4 × 4 K matrix
9 1 1 1
1 9 1 1
𝐾=
1 1 9 1
1 1 1 9
– Optimizing with respect to the Lagrange multipliers leads to the
following system of equations
9𝛼1 − 𝛼2 − 𝛼3 + 𝛼4 = 1
−𝛼1 + 9𝛼2 + 𝛼3 − 𝛼4 = 1
−𝛼1 + 𝛼2 + 9𝛼3 − 𝛼4 = 1
𝛼1 − 𝛼2 − 𝛼3 + 9𝛼4 = 1
– whose solution is 𝛼1 = 𝛼2 = 𝛼3 = 𝛼4 = 0.125

– Thus, all data points are support vectors in this case

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18


– For this simple problem, it is worthwhile to write the decision surface
in terms of the polynomial expansion
4
𝑇
𝑤= 𝛼𝑖 𝑦𝑖 𝜑 𝑥𝑖 = 0 0 0 1/ 2 0 0
𝑖=1
– resulting in the intuitive non-linear discriminant function
6

𝑔 𝑥 = 𝑤𝑖 𝜑𝑖 𝑥 = 𝑥1 𝑥2
𝑖=1
– which has zero empirical error on the XOR training set

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19


• Decision function defined by the SVM
– Notice that the decision boundaries are non-linear in the original
space 𝑅2 , but linear in the implicit space 𝑅6

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20


• “Radius basis functions” for the XOR problem

SVM: 57
• Could solve complicated
Non-Linear Problems
• γ and c control the
complexity of decision
boundary
1
 
2

SVM: 58
Discussion
• Advantages of SVMs
– There are no local minima, because the solution is a QP problem
– The optimal solution can be found in polynomial time
– Few model parameters to select: the penalty term C, the kernel
function and parameters (e.g., spread 𝜎 in the case of RBF kernels)
– Final results are stable and repeatable (e.g., no random initial weights)
– SVM solution is sparse; it only involves the support vectors
– SVMs represent a general methodology for many PR problems:
classification, regression, feature extraction, clustering, novelty
detection, etc.
– SVMs rely on elegant and principled learning methods
– SVMs provide a method to control complexity independently of
dimensionality
– SVMs have been shown (theoretically and empirically) to have
excellent generalization capabilities
[Bennett and Campbell, 2000]
• Challenges
– Do SVMs always perform best? Can they beat a hand-crafted solution
for a particular problem?
– Do SVMs eliminate the model selection problem? Can the kernel
functions be selected in a principled manner? SVMs still require
selection of a few parameters, typically through cross-validation
– How does one incorporate domain knowledge? Currently this is
performed through the selection of the kernel and the introduction of
“artificial” examples
– How interpretable are the results provided by an SVM?
– What is the optimal data representation for SVM? What is the effect of
feature weighting? How does an SVM handle categorical or missing
features?

You might also like