ML - Lec 8-SVM As A Linear Classifier
ML - Lec 8-SVM As A Linear Classifier
Machines
SVM: 2
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
X – Vector
w’
W – Normal Vector
b – Scale Value
Margin
SVM: 13
Maximize Margin
denotes +1
denotes -1 wx +b = 0
Margin
argmax margin( w, b, D)
w ,b
= argmax arg min d ( xi )
w ,b xi D
b xi w
argmax arg min
w ,b xi D d 2
w
i 1 i
SVM: 14
Maximize Margin
denotes +1
denotes -1 wx +b = 0
yi(WXi+b) ≥0
b xi w
argmax arg min
w ,b xi D d 2
w
i 1 i
SVM: 15
Maximize Margin
wxi+b≥0 iff yi=1
denotes +1
wx +b = 0 wxi+b≤0iff yi=-1
denotes -1
yi(WXi+b) ≥ 0
Margin
b xi w
argmax arg min
w ,b xi D d 2
w
i 1 i
Strategy: subject to xi D : yi xi w b 0
xi D : b xi w 1
argmin i 1 wi2
d
wx +b = 0 w ,b
SVM: 17
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
SVM: 19
Computing the margin width
M = Margin Width
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?
x-
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
• Let x- be any point on the minus plane mm:: not
R not
necessarily a
• Let x+ be the closest plus-plane-point to x-. datapoint
SVM: 21
Computing the margin width
x+ M = Margin Width
x-
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + l w for some value of l. Why?
SVM: 22
Computing the margin width
x+ M = Margin Width
The line from x- to x+ is
x-
How do we compute
perpendicular to the
planes.
M in terms of w
and
So to getbfrom
? x- to x+
travel some distance in
• Plus-plane = direction
{ x : w . x + b = +1 } w.
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + l w for some value of l. Why?
SVM: 23
Computing the margin width
x+ M = Margin Width
x-
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
It’s now easy to get M
in terms of w and b
SVM: 24
Maximize Margin
• How does it come ?
b xi w
argmax arg min
d
xi D
argmin i 1 wi2
w ,b 2 d
w
i 1 i
subject to xi D : yi xi w b 0
w ,b
subject to xi D : yi xi w b 1
xi D : b xi w 1
| b xi .w | | b xi .w | K 1
We have arg min arg min
d d d
w
i 1
2
i w K
i 1
2
i i
w'
i 1
2
| b xi .w | 1 d
Thus, arg max arg min arg max arg min w'i2
d d
w i
2 2 i 1
i w'
i 1 i 1
SVM: 25
Maximum Margin Linear Classifier
{w , b }= argmax
* * d 2
w
k 1 k
w, b
subject to
y1 w x1 b 1
y2 w x2 b 1
....
y N w xN b 1
SVM: 26
Learning via Quadratic Programming
• QP is a well-studied class of optimization algorithms to
maximize a quadratic function of some real-valued
variables subject to linear constraints.
• Detail solution of Quadratic Programming
• Convex Optimization Stephen P. Boyd
• Online Edition, Free for Downloading
www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
SVM: 27
Quadratic Programming
T
u Ru
Find arg max c d u
T Quadratic criterion
u 2
SVM: 28
Quadratic Programming for the Linear Classifier
{w* , b*}= argmax 0 0 w wT I n w
w, b
y1 w x1 b 1
y2 w x2 b 1
inequality constraints
....
y N w xN b 1
SVM: 29
• Popular Tools - LibSVM
SVM: 30
Uh-oh! This is going to be a problem!
What should we do?
denotes +1
denotes -1
SVM: 31
Uh-oh! This is going to be a problem!
What should we do?
denotes +1 Idea 1:
denotes -1 Find minimum w.w, while
minimizing number of
training set errors.
Problem: Two things to
minimize makes for an
ill-defined optimization
SVM: 32
Uh-oh! This is going to be a problem!
What should we do?
denotes +1 Idea 1.1:
denotes -1 Minimize
w.w + C (#train errors)
Tradeoff parameter
SVM: 35
Learning Maximum Margin with Noise
e11 M = Given guess of w , b we can
2
e2 w.w •Compute sum of distances
of points to their correct
zones
e7 • Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
SVM: 36
Learning Maximum Marginmwith Noise
= # input
M = Given guessdimensions
of w , b we can
e11 2
e2 w.w •
Compute sum of distances
of points to their correct
Our original (noiseless data) QP had m+1
zones
variables: w1, w2, … wm, and b.
e7 • Compute the margin width
Our new (noisy data) QP has m+1+R
Assume R datapoints, each
variables: w1, w2, … wm, b, ek , e1 ,… eR
(xk,yk) where yk = +/- 1
SVM: 37
Support Vector Machine (SVM) for
Noisy Data
{w , b }= min c
d N
* *
w
i 1 i
2
e
j 1 j denotes +1
w,b
y1 w x1 b 1 e1 denotes -1
e3
y2 w x2 b 1 e 2
...
y N w xN b 1 e N e1
SVM: 38
Support Vector Machine (SVM) for
Noisy Data
{w* , b*}= min i 1 wi2 c j 1 e j
d N
w,b denotes +1
y1 w x1 b 1 e1 , e1 0 denotes -1
y2 w x2 b 1 e 2 , e 2 0 e3
...
y N w xN b 1 e N , e N 0
e1
SVM: 39
Support Vector Machine for Noisy Data
w,b
y1 w x1 b 1 e1 ,e1 0
y2 w x2 b 1 e 2 ,e 2 0
inequality constraints
....
y N w xN b 1 e N ,e N 0
SVM: 40
• Therefore, the problem of maximizing the margin is
equivalent to
1
Minimize 𝐽 𝑤 = 𝑤 2
2
Subject to 𝑇
𝑦𝑖 𝑤 𝑥𝑖 + 𝑏 ≥ 1 ∀𝑖
– Notice that 𝐽(𝑤) is a quadratic function, which means that there exists
a single global minimum and no local minima
• To solve this problem, we will use classical Lagrangian
optimization techniques
– We first present the Kuhn-Tucker Theorem, which provides an
essential result for the interpretation of Support Vector Machines
A saddle point
𝑤𝑇𝑤 = 𝑤𝑇 𝛼𝑖 𝑦𝑖 𝑥𝑖 = 𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 =
𝑖=1 𝑖=1
𝑇
𝑁 𝑁 𝑁 𝑁
𝛼𝑖 𝑦𝑖 𝛼𝑗 𝑦𝑗 𝑥𝑗 𝑥𝑖 = 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗
𝑖=1 𝑗=1 𝑖=1 𝑗=1
– The second term in 𝐿𝑃 can be expressed in the same way
– The third term in 𝐿𝑃 is zero by virtue of the optimality condition 𝜕𝐽/𝜕𝑏 = 0
[Haykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20
– Merging these expressions together we obtain
𝑁 𝑁 𝑁
1
𝐿𝐷 𝛼 = 𝛼𝑖 − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗
2
𝑖=1 𝑖=1 𝑗=1
𝑁
– Subject to the (simpler) constraints 𝛼𝑖 ≥ 0 and 𝑖=1 𝛼𝑖 𝑦𝑖 =0
– This is known as the Lagrangian dual problem
• Comments
– We have transformed the problem of finding a saddle point for
𝐿𝑃 𝑤, 𝑏 into the easier one of maximizing 𝐿𝐷 𝛼
• Notice that 𝐿𝐷 𝛼 depends on the Lagrange multipliers 𝛼, not on (𝑤, 𝑏)
– The primal problem scales with dimensionality (𝑤 has one coefficient
for each dimension), whereas the dual problem scales with the
amount of training data (there is one Lagrange multiplier per example)
– Moreover, in 𝐿𝐷 𝛼 training data appears only as dot products 𝑥𝑖𝑇 𝑥𝑗
• As we will see in the next lecture, this property can be cleverly exploited
to perform the classification in a higher (e.g., infinite) dimensional space
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 21
Support Vectors
• The KKT complementary condition states that, for every point in
the training set, the following equality must hold
𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 = 0 ∀𝑖 = 1. . 𝑁
– Therefore, ∀𝑥, either 𝛼𝑖 = 0 or 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏 − 1) = 0 must hold
• Those points for which 𝛼𝑖 > 0 must then lie on one of the two hyperplanes
that define the largest margin (the term 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏 − 1) becomes zero only
at these hyperplanes)
• These points are known as the Support Vectors
• All the other points must have 𝛼𝑖 = 0
– Note that only the SVs contribute to defining the optimal hyperplane
𝜕𝐽 𝑤,𝑏,𝛼 𝑁
=0 ⇒𝑤= 𝑖=1 𝛼𝑖 𝑦𝑖 𝑥𝑖 x2
𝜕𝑤
– NOTE: the bias term 𝑏 is found from
the KKT complementary condition
on the support vectors
– Therefore, the complete dataset could Support
be replaced by only the support vectors, Vectors (>0)
Then define:
R
w α
k 1
k yk x k
SVM: 41
An Equivalent QP
R
1 R R
Maximize αk αk αl Qkl where Qkl yk yl (x k .xl )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0 αk C k α
k 1
k yk 0
SVM: 42
Support Vectors
Support Vectors i : a i yi w xi b 1 e i 0
w x b 1
denotes +1 ai = 0 for non-support vectors
denotes -1
w ai 0 for support vectors
R
w αk yk x k
k 1
Decision boundary is
w x b 1 determined only by those
support vectors !
SVM: 43
The Dual Form of QP
R
1 R R
Maximize αk αk αl Qkl where Qkl yk yl (xk xl )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0 αk C k α
k 1
k yk 0
Then define:
R
w αk yk x k
k 1
Then classify with:
f(x,w,b) = sign(w. x + b)
How to determine b ?
SVM: 44
An Equivalent QP: Determine b
b ,e i i1
N
w,b
SVM: 45
• Parameter c is used to control the fitness
Noise
SVM: 46
Feature Transformation ?
• The problem is non-linear
• Find some trick to transform the input
• Linear separable after Feature Transformation
• What Features should we use ?
Basic Idea :
XOR Problem
SVM: 47
Suppose we’re in 1-dimension
What would
SVMs do with
this data?
x=0
SVM: 48
Suppose we’re in 1-dimension
x=0
Positive “plane” Negative “plane”
SVM: 49
Harder 1-dimensional dataset
x=0
SVM: 50
Harder 1-dimensional dataset
x=0 z k ( xk , x )
2
k
SVM: 51
Harder 1-dimensional dataset
x=0 z k ( xk , x )
2
k
Feature Enumeration
xk transform
( xk ) zk
SVM: 52
Non-linear SVMs: Feature spaces
• General idea: the original input space can always be mapped
to some higher-dimensional feature space where the training
set is separable:
Φ: x → φ(x)
SVM: 53
• Polynomial features for the XOR problem
SVM: 54
Kernel methods
• Let’s now see how to put together all these concepts
– Assume that our original feature vector 𝑥 lives in a space 𝑅𝐷
– We are interested in non-linearly projecting 𝑥 onto a higher
dimensional implicit space 𝜑 𝑥 ∈ 𝑅𝐷1 𝐷1 > 𝐷 where classes have
a better chance of being linearly separable
• Notice that we are not guaranteeing linear separability, we are only saying
that we have a better chance because of Cover’s theorem
– The separating hyperplane in 𝑅𝐷1 will be defined by
𝐷1
𝑗=1 𝑤𝑗 𝜑𝑗 𝑥 + 𝑏 = 0
– To eliminate the bias term 𝑏, let’s augment the feature vector in the
implicit space with a constant dimension 𝜑0 (𝑥) = 1
– Using vector notation, the resulting hyperplane becomes
𝑤𝑇𝜑 𝑥 = 0
– From our previous results, the optimal (maximum margin) hyperplane
in the implicit space is given by
𝑤= 𝑁 𝑖=1 𝛼𝑖 𝑦𝑖 𝜑(𝑥𝑖 )
⇒ 𝛼𝑖 𝑦𝑖 𝜑 𝑥𝑖 𝜑 𝑥 =0
𝑖=1
𝑁
⇒ 𝛼𝑖 𝑦𝑖 𝜑 𝑥𝑖 𝑇 𝜑 𝑥 = 0
𝑖=1
– and, since 𝜑 𝑇 𝑥𝑖 𝜑 𝑥𝑗 = 𝐾 𝑥𝑖 , 𝑥𝑗 , the optimal hyperplane becomes
𝑁
𝑖=1 𝛼𝑖 𝑦𝑖 𝐾 𝑥𝑖 , 𝑥 = 0
𝐾 𝑥, 𝑥 ′ = 𝜆𝑖 𝜑𝑖 𝑥 𝜑𝑖 𝑥 ′
𝑖=1
• Strictly speaking, the space where 𝜑 𝑥 resides is a Hilbert space, a “generalization” of
an Euclidean space where the inner product can be any inner product, not just the scalar
dot product [Burges, 1998]
• With positive coefficients 𝜆𝑖 > 0 ∀𝑖
– For this expansion to be valid and for it to converge absolutely and uniformly, it is
necessary and sufficient that the condition
𝑏 𝑏
𝐾 𝑥, 𝑥 ′ 𝜓 𝑥 𝜓 𝑥 ′ 𝑑𝑥𝑑𝑥 ′ ≥ 0
𝑎 𝑎
𝑏 2
– holds for all 𝜓(∙) for which 𝑎
𝜓 𝑥 𝑑𝑥 ≤ ∞
• The functions 𝜑𝑖 𝑥 are called eigenfunctions of the expansion, and the numbers 𝜆𝑖 are
the eigenvalues. The fact that all of the eigenvalues are positive means that the kernel is
positive definite
– Notice that the dimensionality of the implicit space can be infinitely large
– Mercer’s Condition only tells us whether a kernel is actually an inner-product
kernel, but it does not tell us how to construct the functions 𝜑𝑖 𝑥 for the
expansion
[Kaykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14
• Which kernels meet Mercer’s condition?
– Polynomial kernels
𝐾 𝑥, 𝑥 ′ = 𝑥 𝑇 𝑥 ′ + 1 𝑝
9 Multipllication
kernel trick 3 Multipllication
SVM: 55
Architecture of an SVM
Bias
K(x,x1)
Output
neuron
K(x,x2) Output
Input
vector variable
x y
Linear
projection
K(x,xD1)
kernels with
support vectors
[Kaykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 16
Numerical example
• To illustrate the operation of a non-linear SVM we will solve the
classical XOR problem
– Dataset
• Class 1: 𝑥1 = (−1, −1), 𝑥4 = (+1, +1)
• Class 2: 𝑥2 = (−1, +1), 𝑥3 = (+1, −1)
– Kernel function
• Polynomial of order 2: 𝐾(𝑥, 𝑥 ′ ) = 𝑥 𝑇 𝑥 ′ + 1 2
• Solution
– The implicit mapping can be shown to be 5-dimensional
2 2 𝑇
𝜑 𝑥 = 1 2𝑥𝑖,1 2𝑥𝑖,2 2𝑥𝑖,1 𝑥𝑖,2 𝑥𝑖,1 𝑥𝑖,2
– To achieve linear separability, we will use 𝐶 = ∞
– The objective function for the dual problem becomes
4 4
1
𝐿𝐷 𝛼 = 𝛼1 + 𝛼2 + 𝛼3 + 𝛼4 − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑘𝑖𝑗
2
𝑖=1 𝑗=1
𝑁
• subject to the constraints 𝑖=1 𝛼𝑖 𝑦𝑖
=0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1 … 𝑁
[Cherkassky and Mulier, 1998; Haykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17
– where the inner product is represented as a 4 × 4 K matrix
9 1 1 1
1 9 1 1
𝐾=
1 1 9 1
1 1 1 9
– Optimizing with respect to the Lagrange multipliers leads to the
following system of equations
9𝛼1 − 𝛼2 − 𝛼3 + 𝛼4 = 1
−𝛼1 + 9𝛼2 + 𝛼3 − 𝛼4 = 1
−𝛼1 + 𝛼2 + 9𝛼3 − 𝛼4 = 1
𝛼1 − 𝛼2 − 𝛼3 + 9𝛼4 = 1
– whose solution is 𝛼1 = 𝛼2 = 𝛼3 = 𝛼4 = 0.125
𝑔 𝑥 = 𝑤𝑖 𝜑𝑖 𝑥 = 𝑥1 𝑥2
𝑖=1
– which has zero empirical error on the XOR training set
SVM: 57
• Could solve complicated
Non-Linear Problems
• γ and c control the
complexity of decision
boundary
1
2
SVM: 58
Discussion
• Advantages of SVMs
– There are no local minima, because the solution is a QP problem
– The optimal solution can be found in polynomial time
– Few model parameters to select: the penalty term C, the kernel
function and parameters (e.g., spread 𝜎 in the case of RBF kernels)
– Final results are stable and repeatable (e.g., no random initial weights)
– SVM solution is sparse; it only involves the support vectors
– SVMs represent a general methodology for many PR problems:
classification, regression, feature extraction, clustering, novelty
detection, etc.
– SVMs rely on elegant and principled learning methods
– SVMs provide a method to control complexity independently of
dimensionality
– SVMs have been shown (theoretically and empirically) to have
excellent generalization capabilities
[Bennett and Campbell, 2000]
• Challenges
– Do SVMs always perform best? Can they beat a hand-crafted solution
for a particular problem?
– Do SVMs eliminate the model selection problem? Can the kernel
functions be selected in a principled manner? SVMs still require
selection of a few parameters, typically through cross-validation
– How does one incorporate domain knowledge? Currently this is
performed through the selection of the kernel and the introduction of
“artificial” examples
– How interpretable are the results provided by an SVM?
– What is the optimal data representation for SVM? What is the effect of
feature weighting? How does an SVM handle categorical or missing
features?