0% found this document useful (0 votes)

8 views78 pages

ML - Lec 8-SVM As A Linear Classifier

The document discusses Support Vector Machines (SVM), a classification method rooted in statistical learning theory, highlighting its history, development, and significance in machine learning. It explains the concept of linear classifiers, the importance of maximizing the margin between classes, and the role of support vectors in determining the optimal boundary. Additionally, it touches on the mathematical formulation and optimization techniques, particularly quadratic programming, used to solve SVM problems.

Uploaded by

alidashtbozorg709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views78 pages

ML - Lec 8-SVM As A Linear Classifier

Uploaded by

alidashtbozorg709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

L7: Support Vector

Machines

Modified from Prof. Andrew W. Moore’s Lecture Notes

www.cs.cmu.edu/~awm
History
• SVM is a classifier derived from statistical
learning theory by Vapnik and Chervonenkis
• SVMs introduced by Boser, Guyon, Vapnik in
COLT-92
• Now an important and active field of all
Machine Learning research.
• Special issues of Machine Learning Journal,
and Journal of Machine Learning Research.

SVM: 2
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you

classify this data?

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

sign(w. x + b) = +1 iff x1w1 + w2x2– b > 0 SVM: 3
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you

classify this data?

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

sign(w. x + b) = +1 iff x1w1 + w2x2– b > 0 SVM: 4
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you

classify this data?

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

sign(w. x + b) = +1 iff x1w1 + w2x2– b > 0 SVM: 5
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you

classify this data?

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

sign(w. x + b) = +1 iff x1w1 + w2x2– b > 0 SVM: 6
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

sign(w. x + b) = +1 iff x1w1 + w2x2+ b > 0 SVM: 7
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

sign(w. x + b) = +1 iff x1w1 + w2x2– b > 0 SVM: 8
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
SVM: 9
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the
are those
datapoints that maximum
the margin margin.
pushes up
against This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM SVM: 10
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b) =
2. Empirically it sign(w. x + b)
works very
denotes +1
well.
denotes -1 The maximum
3. If we’ve made a small error in
the locationmargin linear
of the boundary
classifier
(it’s been jolted in its is the
perpendicularlinear classifier
direction) this
Support Vectors gives us least chance
with the,ofum,
causing
are those a misclassification.
datapoints that maximum margin.
the margin 4. There’s some theory (using VC
pushes up dimension) This
that isisrelated
the to
against (but not thesimplest
same as) kindthe of
proposition SVM (Called
that this an
is a good
thing. LSVM)
SVM: 11
Estimate the Margin
denotes +1
denotes -1 x
wx +b = 0

X – Vector
w’
W – Normal Vector
b – Scale Value

• What is the distance expression for a point x to a

line wx+b= 0?
xw b xw b
d (x)  

2 d 2
w w
i 1 i
2
SVM: 12
http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
Estimate the Margin
denotes +1
denotes -1 wx +b = 0

Margin

• What is the expression for margin, given w and b?

xw b
margin  arg min d ( x)  arg min

xD xD d 2
w
i 1 i

SVM: 13
Maximize Margin
denotes +1
denotes -1 wx +b = 0

Margin

argmax margin( w, b, D)
w ,b
= argmax arg min d ( xi )
w ,b xi D

b  xi  w
 argmax arg min

w ,b xi D d 2
w
i 1 i
SVM: 14
Maximize Margin
denotes +1
denotes -1 wx +b = 0

WXi+b≥0 iff yi=1

Margin WXi+b≤0 iff yi=-1

yi(WXi+b) ≥0

b  xi  w
argmax arg min

w ,b xi D d 2
w
i 1 i

subject to xi  D : yi  xi  w  b  ≥00

SVM: 15
Maximize Margin
wxi+b≥0 iff yi=1
denotes +1
wx +b = 0 wxi+b≤0iff yi=-1
denotes -1

yi(WXi+b) ≥ 0
Margin
b  xi  w
argmax arg min

w ,b xi D d 2
w
i 1 i
Strategy: subject to xi  D : yi  xi  w  b   0
xi  D : b  xi  w  1

argmin i 1 wi2
d
wx +b = 0 w ,b

α(wx +b) = 0 where α≠0 subject to xi  D : yi  xi  w  b   1

SVM: 16
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane

• How do we represent this mathematically?

• …in m input dimensions?

SVM: 17
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }

Classify as.. +1 if w . x + b >= 1

-1 if w . x + b <= -1
Universe if -1 < w . x + b < 1
explodes
SVM: 18
Computing the margin width
M = Margin Width

How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?

SVM: 19
Computing the margin width
M = Margin Width

How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?

And so of course the vector w is also

perpendicular to the Minus Plane
SVM: 20
Computing the margin width
x+ M = Margin Width

x-
How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
• Let x- be any point on the minus plane mm:: not
R not
necessarily a
• Let x+ be the closest plus-plane-point to x-. datapoint

SVM: 21
Computing the margin width
x+ M = Margin Width

x-
How do we compute
M in terms of w
and b?

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + l w for some value of l. Why?

SVM: 22
Computing the margin width
x+ M = Margin Width
The line from x- to x+ is
x-
How do we compute
perpendicular to the
planes.
M in terms of w
and
So to getbfrom
? x- to x+
travel some distance in
• Plus-plane = direction
{ x : w . x + b = +1 } w.
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + l w for some value of l. Why?

SVM: 23
Computing the margin width
x+ M = Margin Width

What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
It’s now easy to get M
in terms of w and b
SVM: 24
Maximize Margin
• How does it come ?
b  xi  w
argmax arg min

d
xi D
argmin i 1 wi2
w ,b 2 d
w
i 1 i

subject to xi  D : yi  xi  w  b   0
w ,b

subject to xi  D : yi  xi  w  b   1
xi  D : b  xi  w  1
| b  xi .w | | b  xi .w | K 1
We have arg min  arg min 
d d d

w
i 1
2
i w K
i 1
2
i  i
w'
i 1
2

| b  xi .w | 1 d
Thus, arg max arg min  arg max  arg min  w'i2
d d

w  i
2 2 i 1
i w'
i 1 i 1

SVM: 25
Maximum Margin Linear Classifier
{w , b }= argmax 
* * d 2
w
k 1 k
w, b
subject to
y1  w  x1  b   1
y2  w  x2  b   1
....
y N  w  xN  b   1

• How to solve it?

SVM: 26
Learning via Quadratic Programming
• QP is a well-studied class of optimization algorithms to
maximize a quadratic function of some real-valued
variables subject to linear constraints.
• Detail solution of Quadratic Programming
• Convex Optimization Stephen P. Boyd
• Online Edition, Free for Downloading

www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

SVM: 27
Quadratic Programming
T
u Ru
Find arg max c  d u 
T Quadratic criterion
u 2

Subject to a11u1  a12u2  ...  a1mum  b1

a21u1  a22u2  ...  a2 mum  b2 n additional linear
inequality
: constraints
an1u1  an 2u2  ...  anmum  bn

SVM: 28
Quadratic Programming for the Linear Classifier

{w* , b* }= min  i wi2

w,b

subject to yi  w  xi  b   1 for all training data ( xi , yi )


{w* , b*}= argmax 0  0  w  wT I n w
w, b

y1  w  x1  b   1 

y2  w  x2  b   1 
 inequality constraints
.... 
y N  w  xN  b   1 
SVM: 29
• Popular Tools - LibSVM

SVM: 30
Uh-oh! This is going to be a problem!
What should we do?
denotes +1
denotes -1

SVM: 31
Uh-oh! This is going to be a problem!
What should we do?
denotes +1 Idea 1:
denotes -1 Find minimum w.w, while
minimizing number of
training set errors.
Problem: Two things to
minimize makes for an
ill-defined optimization

SVM: 32
Uh-oh! This is going to be a problem!
What should we do?
denotes +1 Idea 1.1:
denotes -1 Minimize
w.w + C (#train errors)
Tradeoff parameter

There’s a serious practical

problem that’s about to make
us reject this approach. Can
you guess what it is?
SVM: 33
Uh-oh! This is going to be a problem!
What should we do?
denotes +1 Idea 1.1:
denotes -1 Minimize
w.w + C (#train errors)
Tradeoff parameter
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
There’s a serious practical
(Also, doesn’t distinguish between
problem
disastrous errors and near misses)
that’s about to make
us reject this approach. Can
you guess what it is?
SVM: 34
Uh-oh! This is going to be a problem!
What should we do?
denotes +1 Idea 2.0:
denotes -1 Minimize
w.w + C (distance of error
points to their
correct place)

SVM: 35
Learning Maximum Margin with Noise
e11 M = Given guess of w , b we can
2
e2 w.w •Compute sum of distances
of points to their correct
zones
e7 • Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we

optimization criterion be? have? R
Minimize 1 R What should they be?
w.w  C  εk w . xk + b >= 1-ek if yk = 1
2 k 1
w . xk + b <= -1+ek if yk = -1

SVM: 36
Learning Maximum Marginmwith Noise
= # input
M = Given guessdimensions
of w , b we can
e11 2
e2 w.w •
Compute sum of distances
of points to their correct
Our original (noiseless data) QP had m+1
zones
variables: w1, w2, … wm, and b.
e7 • Compute the margin width
Our new (noisy data) QP has m+1+R
Assume R datapoints, each
variables: w1, w2, … wm, b, ek , e1 ,… eR
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we

R= # records
optimization criterion be? have? R
Minimize 1 R What should they be?
w.w  C  εk w . xk + b >= 1-ek if yk = 1
2 k 1
w . xk + b <= -1+ek if yk = -1

SVM: 37
Support Vector Machine (SVM) for
Noisy Data
{w , b }= min   c
d N
* *
w
i 1 i
2
e
j 1 j denotes +1
w,b

y1  w  x1  b   1  e1 denotes -1
e3
y2  w  x2  b   1  e 2
...
y N  w  xN  b   1  e N e1

• Any problem with the above

e2
formulism?

SVM: 38
Support Vector Machine (SVM) for
Noisy Data
{w* , b*}= min  i 1 wi2  c  j 1 e j
d N
w,b denotes +1
y1  w  x1  b   1  e1 , e1  0 denotes -1
y2  w  x2  b   1  e 2 , e 2  0 e3
...
y N  w  xN  b   1  e N , e N  0
e1

• Balance the trade off between e2

margin and classification errors

SVM: 39
Support Vector Machine for Noisy Data

{w* , b* }= argmin  i wi2  c  j 1e j

w,b

y1  w  x1  b   1  e1 ,e1  0 

y2  w  x2  b   1  e 2 ,e 2  0 
 inequality constraints
.... 
y N  w  xN  b   1  e N ,e N  0 

How do we determine the appropriate value for c ?

SVM: 40
• Therefore, the problem of maximizing the margin is
equivalent to
1
Minimize 𝐽 𝑤 = 𝑤 2
2
Subject to 𝑇
𝑦𝑖 𝑤 𝑥𝑖 + 𝑏 ≥ 1 ∀𝑖

– Notice that 𝐽(𝑤) is a quadratic function, which means that there exists
a single global minimum and no local minima
• To solve this problem, we will use classical Lagrangian
optimization techniques
– We first present the Kuhn-Tucker Theorem, which provides an
essential result for the interpretation of Support Vector Machines

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17

(Kuhn-Tucker Theorem)
• Given an optimization problem with convex domain 𝛀 ⊆ 𝑅𝑁
Minimize 𝑓 𝑧 𝑧∈Ω
Subject to 𝑔𝑖 𝑧 ≤ 0 𝑖 = 1. . 𝑘
ℎ𝑖 𝑧 = 0 𝑖 = 1. . 𝑚
– with 𝑓 ∈ 𝐶 1 convex and 𝑔𝑖 , ℎ𝑖 affine, necessary & sufficient conditions for
a normal point 𝑧 ∗ to be an optimum are the existence of 𝛼 ∗ , 𝛽 ∗ such that
𝜕𝐿 𝑧 ∗ , 𝛼 ∗ , 𝛽 ∗ 𝜕𝑧 = 0
where:
𝜕𝐿 𝑧 ∗ , 𝛼 ∗ , 𝛽 ∗ 𝜕𝛽 = 0
𝑘 𝑚
𝛼𝑖∗ 𝑔𝑖 𝑧 ∗ = 0 𝑖 = 1. . 𝑘 𝐿 𝑧, 𝛼, 𝛽 = 𝑓 𝑧 + 𝑖=1 𝛼𝑖 𝑔𝑖 𝑧 + 𝑖=1 𝛽𝑖 ℎ𝑖 𝑧
𝑔𝑖 𝑧 ∗ ≤ 0 𝑖 = 1. . 𝑘
𝛼𝑖∗ ≥ 0 𝑖 = 1. . 𝑘
– 𝐿(𝑧, 𝛼, 𝛽) is known as a generalized Lagrangian function
– The third condition is known as the Karush-Kuhn-Tucker (KKT)
complementary condition
• It implies that for active constraints 𝛼𝑖 ≥ 0; and for inactive constraints 𝛼𝑖 = 0
• As we will see in a minute, the KKT condition allows us to identify the training
examples that define the largest margin hyperplane. These examples will be
known as Support Vectors
[Cristianini and Shawe-Taylor, 2000]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18
The Lagrangian dual problem
• Constrained minimization of 𝐽 𝑤 = 𝟏/𝟐 𝒘 𝟐 is solved by
introducing the Lagrangian
𝑁
1 2
𝐿𝑃 𝑤, 𝑏, 𝛼 = 𝑤 − 𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1
2
𝑖=1
– which yields an unconstrained optimization problem that is solved by:
• minimizing 𝐿𝑃 w.r.t. the primal variables w and b, and
• maximizing 𝐿𝑃 w.r.t. the dual variables 𝛼𝑖 ≥ 0 (the Lagrange multipliers)
– Thus, the optimum is defined by a saddle point
– This is known as the Lagrangian primal problem

A saddle point

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19

• Solution
– To simplify the primal problem, we eliminate the primal variables (𝑤, 𝑏) using the
first Kuhn-Tucker condition 𝜕𝐽/𝜕𝑧 = 0
– Differentiating 𝐿𝑃 𝑤, 𝑏, 𝛼 with respect to 𝑤 and 𝑏, and setting to zero yields
𝜕𝐿𝑃 𝑤, 𝑏, 𝛼 𝜕𝑤 = 0 ⇒𝑤= 𝑁 𝑖=1 𝛼𝑖 𝑦𝑖 𝑥𝑖
𝜕𝐿𝑃 𝑤, 𝑏, 𝛼 𝜕𝑏 = 0 ⇒ 𝑁 𝑖=1 𝛼𝑖 𝑦𝑖 = 0
– Expansion of 𝐿𝑃 yields
𝑁 𝑁 𝑁
1
𝐿𝑃 𝑤, 𝑏, 𝛼 = 𝑤 𝑇 𝑤 − 𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 − 𝑏 𝛼𝑖 𝑦𝑖 + 𝛼𝑖
2
𝑖=1 𝑖=1 𝑖=1
– Using the optimality condition 𝜕𝐽/𝜕𝑤 = 0, the first term in 𝐿𝑃 can be expressed as
𝑁 𝑁

𝑤𝑇𝑤 = 𝑤𝑇 𝛼𝑖 𝑦𝑖 𝑥𝑖 = 𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 =
𝑖=1 𝑖=1
𝑇
𝑁 𝑁 𝑁 𝑁

𝛼𝑖 𝑦𝑖 𝛼𝑗 𝑦𝑗 𝑥𝑗 𝑥𝑖 = 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗
𝑖=1 𝑗=1 𝑖=1 𝑗=1
– The second term in 𝐿𝑃 can be expressed in the same way
– The third term in 𝐿𝑃 is zero by virtue of the optimality condition 𝜕𝐽/𝜕𝑏 = 0

[Haykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20
– Merging these expressions together we obtain
𝑁 𝑁 𝑁
1
𝐿𝐷 𝛼 = 𝛼𝑖 − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗
2
𝑖=1 𝑖=1 𝑗=1
𝑁
– Subject to the (simpler) constraints 𝛼𝑖 ≥ 0 and 𝑖=1 𝛼𝑖 𝑦𝑖 =0
– This is known as the Lagrangian dual problem
• Comments
– We have transformed the problem of finding a saddle point for
𝐿𝑃 𝑤, 𝑏 into the easier one of maximizing 𝐿𝐷 𝛼
• Notice that 𝐿𝐷 𝛼 depends on the Lagrange multipliers 𝛼, not on (𝑤, 𝑏)
– The primal problem scales with dimensionality (𝑤 has one coefficient
for each dimension), whereas the dual problem scales with the
amount of training data (there is one Lagrange multiplier per example)
– Moreover, in 𝐿𝐷 𝛼 training data appears only as dot products 𝑥𝑖𝑇 𝑥𝑗
• As we will see in the next lecture, this property can be cleverly exploited
to perform the classification in a higher (e.g., infinite) dimensional space
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 21
Support Vectors
• The KKT complementary condition states that, for every point in
the training set, the following equality must hold
𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 = 0 ∀𝑖 = 1. . 𝑁
– Therefore, ∀𝑥, either 𝛼𝑖 = 0 or 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏 − 1) = 0 must hold
• Those points for which 𝛼𝑖 > 0 must then lie on one of the two hyperplanes
that define the largest margin (the term 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏 − 1) becomes zero only
at these hyperplanes)
• These points are known as the Support Vectors
• All the other points must have 𝛼𝑖 = 0
– Note that only the SVs contribute to defining the optimal hyperplane
𝜕𝐽 𝑤,𝑏,𝛼 𝑁
=0 ⇒𝑤= 𝑖=1 𝛼𝑖 𝑦𝑖 𝑥𝑖 x2
𝜕𝑤
– NOTE: the bias term 𝑏 is found from
the KKT complementary condition
on the support vectors
– Therefore, the complete dataset could Support
be replaced by only the support vectors, Vectors (>0)

and the separating hyperplane would be

the same
x1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 22
The Dual Form of QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (xk  xl )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define:
R
w α
k 1
k yk x k

SVM: 41
An Equivalent QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (x k .xl )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define: Datapoints with ak > 0

will be the support
R
vectors
w  αk yk x k
k 1
..so this sum only needs
to be over the
support vectors.

SVM: 42
Support Vectors

Support Vectors i : a i  yi  w  xi  b   1  e i    0

w x  b  1
denotes +1 ai = 0 for non-support vectors
denotes -1
w ai  0 for support vectors

R
w   αk yk x k
k 1
Decision boundary is
w  x  b  1 determined only by those
support vectors !

SVM: 43
The Dual Form of QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (xk  xl )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define:
R
w  αk yk x k
k 1
Then classify with:
f(x,w,b) = sign(w. x + b)

How to determine b ?

SVM: 44
An Equivalent QP: Determine b

{w* , b}= argmin  i wi2  c  j 1 e j b = argmin  j 1 e j

N N

b ,e i i1
N
w,b

y1  w  x1  b   1  e1 ,e1  0 Fix w y1  w  x1  b   1  e1 ,e1  0

y2  w  x2  b   1  e 2 ,e 2  0 y2  w  x2  b   1  e 2 ,e 2  0
.... ....
y N  w  xN  b   1  e N ,e N  0 y N  w  xN  b   1  e N ,e N  0

A linear programming problem !

SVM: 45
• Parameter c is used to control the fitness

Noise

SVM: 46
Feature Transformation ?
• The problem is non-linear
• Find some trick to transform the input
• Linear separable after Feature Transformation
• What Features should we use ?

Basic Idea :

XOR Problem
SVM: 47
Suppose we’re in 1-dimension

What would
SVMs do with
this data?

x=0

SVM: 48
Suppose we’re in 1-dimension

Not a big surprise

x=0
Positive “plane” Negative “plane”

SVM: 49
Harder 1-dimensional dataset

That’s wiped the

smirk off SVM’s
face.
What can be
done about
this?

x=0

SVM: 50
Harder 1-dimensional dataset

Map the data

from low-dimensional space
to high-dimensional space
Let’s permit them here too

x=0 z k  ( xk , x )
2
k

SVM: 51
Harder 1-dimensional dataset

Map the data

from low-dimensional space
to high-dimensional space
Let’s permit them here too

x=0 z k  ( xk , x )
2
k
Feature Enumeration
xk transform
  ( xk )  zk
SVM: 52
Non-linear SVMs: Feature spaces
• General idea: the original input space can always be mapped
to some higher-dimensional feature space where the training
set is separable:

Φ: x → φ(x)

SVM: 53
• Polynomial features for the XOR problem

SVM: 54
Kernel methods
• Let’s now see how to put together all these concepts
– Assume that our original feature vector 𝑥 lives in a space 𝑅𝐷
– We are interested in non-linearly projecting 𝑥 onto a higher
dimensional implicit space 𝜑 𝑥 ∈ 𝑅𝐷1 𝐷1 > 𝐷 where classes have
a better chance of being linearly separable
• Notice that we are not guaranteeing linear separability, we are only saying
that we have a better chance because of Cover’s theorem
– The separating hyperplane in 𝑅𝐷1 will be defined by
𝐷1
𝑗=1 𝑤𝑗 𝜑𝑗 𝑥 + 𝑏 = 0
– To eliminate the bias term 𝑏, let’s augment the feature vector in the
implicit space with a constant dimension 𝜑0 (𝑥) = 1
– Using vector notation, the resulting hyperplane becomes
𝑤𝑇𝜑 𝑥 = 0
– From our previous results, the optimal (maximum margin) hyperplane
in the implicit space is given by
𝑤= 𝑁 𝑖=1 𝛼𝑖 𝑦𝑖 𝜑(𝑥𝑖 )

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10

– Merging this optimal weight vector with the hyperplane equation
𝑤𝑇𝜑 𝑥 = 0
𝑇
𝑁

⇒ 𝛼𝑖 𝑦𝑖 𝜑 𝑥𝑖 𝜑 𝑥 =0
𝑖=1
𝑁

⇒ 𝛼𝑖 𝑦𝑖 𝜑 𝑥𝑖 𝑇 𝜑 𝑥 = 0
𝑖=1
– and, since 𝜑 𝑇 𝑥𝑖 𝜑 𝑥𝑗 = 𝐾 𝑥𝑖 , 𝑥𝑗 , the optimal hyperplane becomes
𝑁
𝑖=1 𝛼𝑖 𝑦𝑖 𝐾 𝑥𝑖 , 𝑥 = 0

– Therefore, classification of an unknown example 𝑥 is performed by

computing the weighted sum of the kernel function with respect to
the support vectors 𝑥𝑖 (remember that only the support vectors have
non-zero dual variables 𝛼𝑖 )

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11

• How do we compute dual variables 𝜶𝒊 in the implicit space?
– Very simple: we use the same optimization problem as before, and
replace the dot product 𝜑 𝑇 𝑥𝑖 𝜑 𝑥𝑗 with the kernel 𝐾 𝑥𝑖 , 𝑥𝑗
– The Lagrangian dual problem for the non-linear SVM is simply
𝑁 𝑁 𝑁
1
𝐿𝐷 𝛼 = 𝛼𝑖 − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾 𝑥𝑖𝑇 , 𝑥𝑗
2
𝑖=1 𝑖=1 𝑗=1
– subject to the constraints
𝑁
𝑖=1 𝛼𝑖 𝑦𝑖
=0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1 … 𝑁

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12

• How do we select the implicit mapping 𝝋(𝑥)?
– As we saw in the example a few slides back, we will normally select a
kernel function first, and then determine the implicit mapping 𝜑(𝑥)
that it corresponds to
• Then, how do we select the kernel function 𝑲 𝒙𝒊 , 𝒙𝒋 ?
– We must select a kernel for which an implicit mapping exists, this is, a
kernel that can be expressed as the dot-product of two vectors
• For which kernels 𝑲 𝒙𝒊 , 𝒙𝒋 does there exist an implicit
mapping 𝝋(𝑥)?
– The answer is given by Mercer’s Condition

[Burges, 1998; Kaykin, 1999]

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13
Mercer’s Condition
• Let 𝐾(𝑥, 𝑥 ′ ) be a continuous symmetric kernel that is defined in the
closed interval 𝑎 ≤ 𝑥 ≤ 𝑏
– The kernel can be expanded in the series:
∞

𝐾 𝑥, 𝑥 ′ = 𝜆𝑖 𝜑𝑖 𝑥 𝜑𝑖 𝑥 ′
𝑖=1
• Strictly speaking, the space where 𝜑 𝑥 resides is a Hilbert space, a “generalization” of
an Euclidean space where the inner product can be any inner product, not just the scalar
dot product [Burges, 1998]
• With positive coefficients 𝜆𝑖 > 0 ∀𝑖
– For this expansion to be valid and for it to converge absolutely and uniformly, it is
necessary and sufficient that the condition
𝑏 𝑏
𝐾 𝑥, 𝑥 ′ 𝜓 𝑥 𝜓 𝑥 ′ 𝑑𝑥𝑑𝑥 ′ ≥ 0
𝑎 𝑎
𝑏 2
– holds for all 𝜓(∙) for which 𝑎
𝜓 𝑥 𝑑𝑥 ≤ ∞
• The functions 𝜑𝑖 𝑥 are called eigenfunctions of the expansion, and the numbers 𝜆𝑖 are
the eigenvalues. The fact that all of the eigenvalues are positive means that the kernel is
positive definite
– Notice that the dimensionality of the implicit space can be infinitely large
– Mercer’s Condition only tells us whether a kernel is actually an inner-product
kernel, but it does not tell us how to construct the functions 𝜑𝑖 𝑥 for the
expansion
[Kaykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14
• Which kernels meet Mercer’s condition?
– Polynomial kernels
𝐾 𝑥, 𝑥 ′ = 𝑥 𝑇 𝑥 ′ + 1 𝑝

• The degree of the polynomial is a user-specified parameter

– Radial basis function kernels
′
1
𝐾 𝑥, 𝑥 = exp − 2 𝑥 − 𝑥 ′ 2
2𝜎
• The width 𝜎 is a user-specified parameter, but the number of radial basis
functions and their centers are determined automatically by the number
of support vectors and their values
– Two-layer perceptron
𝐾 𝑥, 𝑥 ′ = tanh 𝛽0 𝑥 𝑇 𝑥 ′ + 𝛽1
• The number of hidden neurons and their weight vectors are determined
automatically by the number of support vectors and their values,
respectively. The H-O weights are the Lagrange multipliers 𝛼𝑖
• However, this kernel will only meet Mercer’s condition for certain values
of 𝛽0 and 𝛽1
[Burges, 1998; Kaykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15
Efficiency Problem in Computing Feature
• Feature space Mapping

This use of kernel function

• Example: all 2 degree Monomials to avoid carrying out
Φ(x) explicitly is known
as the kernel trick

9 Multipllication
kernel trick 3 Multipllication

SVM: 55
Architecture of an SVM

Bias

K(x,x1)

Output
neuron
K(x,x2) Output
Input
vector variable
x y
Linear
projection

K(x,xD1)

kernels with
support vectors

[Kaykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 16
Numerical example
• To illustrate the operation of a non-linear SVM we will solve the
classical XOR problem
– Dataset
• Class 1: 𝑥1 = (−1, −1), 𝑥4 = (+1, +1)
• Class 2: 𝑥2 = (−1, +1), 𝑥3 = (+1, −1)
– Kernel function
• Polynomial of order 2: 𝐾(𝑥, 𝑥 ′ ) = 𝑥 𝑇 𝑥 ′ + 1 2

• Solution
– The implicit mapping can be shown to be 5-dimensional
2 2 𝑇
𝜑 𝑥 = 1 2𝑥𝑖,1 2𝑥𝑖,2 2𝑥𝑖,1 𝑥𝑖,2 𝑥𝑖,1 𝑥𝑖,2
– To achieve linear separability, we will use 𝐶 = ∞
– The objective function for the dual problem becomes
4 4
1
𝐿𝐷 𝛼 = 𝛼1 + 𝛼2 + 𝛼3 + 𝛼4 − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑘𝑖𝑗
2
𝑖=1 𝑗=1
𝑁
• subject to the constraints 𝑖=1 𝛼𝑖 𝑦𝑖
=0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1 … 𝑁
[Cherkassky and Mulier, 1998; Haykin, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17
– where the inner product is represented as a 4 × 4 K matrix
9 1 1 1
1 9 1 1
𝐾=
1 1 9 1
1 1 1 9
– Optimizing with respect to the Lagrange multipliers leads to the
following system of equations
9𝛼1 − 𝛼2 − 𝛼3 + 𝛼4 = 1
−𝛼1 + 9𝛼2 + 𝛼3 − 𝛼4 = 1
−𝛼1 + 𝛼2 + 9𝛼3 − 𝛼4 = 1
𝛼1 − 𝛼2 − 𝛼3 + 9𝛼4 = 1
– whose solution is 𝛼1 = 𝛼2 = 𝛼3 = 𝛼4 = 0.125

– Thus, all data points are support vectors in this case

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18

– For this simple problem, it is worthwhile to write the decision surface
in terms of the polynomial expansion
4
𝑇
𝑤= 𝛼𝑖 𝑦𝑖 𝜑 𝑥𝑖 = 0 0 0 1/ 2 0 0
𝑖=1
– resulting in the intuitive non-linear discriminant function
6

𝑔 𝑥 = 𝑤𝑖 𝜑𝑖 𝑥 = 𝑥1 𝑥2
𝑖=1
– which has zero empirical error on the XOR training set

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19

• Decision function defined by the SVM
– Notice that the decision boundaries are non-linear in the original
space 𝑅2 , but linear in the implicit space 𝑅6

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20

• “Radius basis functions” for the XOR problem

SVM: 57
• Could solve complicated
Non-Linear Problems
• γ and c control the
complexity of decision
boundary
1
 
2

SVM: 58
Discussion
• Advantages of SVMs
– There are no local minima, because the solution is a QP problem
– The optimal solution can be found in polynomial time
– Few model parameters to select: the penalty term C, the kernel
function and parameters (e.g., spread 𝜎 in the case of RBF kernels)
– Final results are stable and repeatable (e.g., no random initial weights)
– SVM solution is sparse; it only involves the support vectors
– SVMs represent a general methodology for many PR problems:
classification, regression, feature extraction, clustering, novelty
detection, etc.
– SVMs rely on elegant and principled learning methods
– SVMs provide a method to control complexity independently of
dimensionality
– SVMs have been shown (theoretically and empirically) to have
excellent generalization capabilities
[Bennett and Campbell, 2000]
• Challenges
– Do SVMs always perform best? Can they beat a hand-crafted solution
for a particular problem?
– Do SVMs eliminate the model selection problem? Can the kernel
functions be selected in a principled manner? SVMs still require
selection of a few parameters, typically through cross-validation
– How does one incorporate domain knowledge? Currently this is
performed through the selection of the kernel and the introduction of
“artificial” examples
– How interpretable are the results provided by an SVM?
– What is the optimal data representation for SVM? What is the effect of
feature weighting? How does an SVM handle categorical or missing
features?

12 - Bài Toán Phân L P - SVM - v2
No ratings yet
12 - Bài Toán Phân L P - SVM - v2
138 pages
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
No ratings yet
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
65 pages
21 Support Vector Machines 03-10-2024
No ratings yet
21 Support Vector Machines 03-10-2024
72 pages
CALCULUS PROTTER and PROTTER
100% (6)
CALCULUS PROTTER and PROTTER
21 pages
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
No ratings yet
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
65 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
74 pages
Vector Exam Questions Part A
No ratings yet
Vector Exam Questions Part A
164 pages
SVM PCA Kmeans
No ratings yet
SVM PCA Kmeans
121 pages
SVM Minus Kernel 71
No ratings yet
SVM Minus Kernel 71
32 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
103 pages
10 SVM
No ratings yet
10 SVM
77 pages
Lect 07 Distance Based Algorithms
No ratings yet
Lect 07 Distance Based Algorithms
34 pages
Complex Analysis 2
No ratings yet
Complex Analysis 2
44 pages
SVM 2
No ratings yet
SVM 2
65 pages
Support Vector Machines
No ratings yet
Support Vector Machines
11 pages
Module 6-Svm
No ratings yet
Module 6-Svm
47 pages
Chapter 8
No ratings yet
Chapter 8
52 pages
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
No ratings yet
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
65 pages
16 SVM
No ratings yet
16 SVM
41 pages
2d Transformations
No ratings yet
2d Transformations
47 pages
Transformations Concept Map
No ratings yet
Transformations Concept Map
1 page
Xu Ly Tin Hieu So Vo Trung Dung Handout 5 Dung Vo DSP 2013 04 Fir Filtering and Convolution (Cuuduongthancong - Com)
No ratings yet
Xu Ly Tin Hieu So Vo Trung Dung Handout 5 Dung Vo DSP 2013 04 Fir Filtering and Convolution (Cuuduongthancong - Com)
20 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
Machine Learning SVM: Mustansar Ali
No ratings yet
Machine Learning SVM: Mustansar Ali
21 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
Mathongo Question Matrices Jee Main 2021 March Chapterwise IaLZF2UGSLiTLebsjrXL
No ratings yet
Mathongo Question Matrices Jee Main 2021 March Chapterwise IaLZF2UGSLiTLebsjrXL
4 pages
Cuet 2024 (Top 50)
No ratings yet
Cuet 2024 (Top 50)
158 pages
工數Exercise 3
No ratings yet
工數Exercise 3
3 pages
IVPML Unit III
No ratings yet
IVPML Unit III
139 pages
Application of Cross Product
No ratings yet
Application of Cross Product
15 pages
Introduction To Support Vector Machines: Andrew Moore CMU
No ratings yet
Introduction To Support Vector Machines: Andrew Moore CMU
40 pages
Memorial Universicty of Newfoundland Linear Algebra I Lecture Notes 1 - 9 - Math 2050 Sect. 2 With Answers
No ratings yet
Memorial Universicty of Newfoundland Linear Algebra I Lecture Notes 1 - 9 - Math 2050 Sect. 2 With Answers
57 pages
Lecture#9: Support Vector Machine (SVM)
No ratings yet
Lecture#9: Support Vector Machine (SVM)
18 pages
Lecture 18 - SVM
No ratings yet
Lecture 18 - SVM
54 pages
Overview of SVM: A Support Vector Machine (SVM) Performs by Finding The That The Margin Between The
No ratings yet
Overview of SVM: A Support Vector Machine (SVM) Performs by Finding The That The Margin Between The
20 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
Support Vector Machine
No ratings yet
Support Vector Machine
19 pages
Unit 1 PT 5 Inverse SQ Matrix Part 1
No ratings yet
Unit 1 PT 5 Inverse SQ Matrix Part 1
9 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
Exp 14
No ratings yet
Exp 14
27 pages
Supervised Alg
No ratings yet
Supervised Alg
27 pages
Unit 2 - SVM - 241016 - 104220
No ratings yet
Unit 2 - SVM - 241016 - 104220
47 pages
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
No ratings yet
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
44 pages
The Biquaternions: Renee Russell Kim Kesting Caitlin Hult SPWM 2011
No ratings yet
The Biquaternions: Renee Russell Kim Kesting Caitlin Hult SPWM 2011
17 pages
12 Maths
No ratings yet
12 Maths
31 pages
Determinants Lecture Notes
No ratings yet
Determinants Lecture Notes
44 pages
Support Vector Machine
No ratings yet
Support Vector Machine
55 pages
01 RA - (Vector + 3-d)
No ratings yet
01 RA - (Vector + 3-d)
3 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
23 pages
The QR Method For Finding Eigenvalues: Text Reference: Section 6.4, P. 400
No ratings yet
The QR Method For Finding Eigenvalues: Text Reference: Section 6.4, P. 400
4 pages
CS168: The Modern Algorithmic Toolbox Lecture #1: Introduction and Consistent Hashing
No ratings yet
CS168: The Modern Algorithmic Toolbox Lecture #1: Introduction and Consistent Hashing
11 pages
SVM MJJ
No ratings yet
SVM MJJ
19 pages
QM Notes 9
No ratings yet
QM Notes 9
3 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
Support Vector Machine
No ratings yet
Support Vector Machine
8 pages
SVM Class
No ratings yet
SVM Class
33 pages
Unit - 2
No ratings yet
Unit - 2
15 pages
Metode Greville
No ratings yet
Metode Greville
12 pages
5d. Support Vector Machine
No ratings yet
5d. Support Vector Machine
2 pages
Linear Regression & SVM
No ratings yet
Linear Regression & SVM
33 pages
(CMBS-NSF 59) Grace Wahba - Spline Models For Observational Data-SIAM (1990)
No ratings yet
(CMBS-NSF 59) Grace Wahba - Spline Models For Observational Data-SIAM (1990)
179 pages
Transform The Triangles: Sheet 1
No ratings yet
Transform The Triangles: Sheet 1
2 pages
SVM
No ratings yet
SVM
11 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
Support Vector Machines
No ratings yet
Support Vector Machines
19 pages
SVM Tutorial
100% (1)
SVM Tutorial
34 pages
Unit 2
No ratings yet
Unit 2
47 pages
FEM Exercises
No ratings yet
FEM Exercises
23 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
Basic of SVM Algorithm
No ratings yet
Basic of SVM Algorithm
10 pages
Scheme of Study For MSC (TS) : Code Course Title Credit Hours
No ratings yet
Scheme of Study For MSC (TS) : Code Course Title Credit Hours
17 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
QR Decomposition
No ratings yet
QR Decomposition
10 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
SVM Scribe Notes
No ratings yet
SVM Scribe Notes
16 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
Module 2 Vectors
No ratings yet
Module 2 Vectors
16 pages
Chapter 3 - Support Vector Machine With Math. - Deep Math Machine Learning - Ai - Medium
No ratings yet
Chapter 3 - Support Vector Machine With Math. - Deep Math Machine Learning - Ai - Medium
11 pages
First Year BTech Curriculum
No ratings yet
First Year BTech Curriculum
20 pages
Matrices Basic Concepts
No ratings yet
Matrices Basic Concepts
14 pages
FA 15 16 Ex4 solHW
No ratings yet
FA 15 16 Ex4 solHW
11 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)

ML - Lec 8-SVM As A Linear Classifier

Uploaded by

ML - Lec 8-SVM As A Linear Classifier

Uploaded by

L7: Support Vector

Modified from Prof. Andrew W. Moore’s Lecture Notes

How would you

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

How would you

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

How would you

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

How would you

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

e.g. x = (x1, x2), w = (w1, w2), w.x = x1w1 + w2x2

• What is the distance expression for a point x to a

• What is the expression for margin, given w and b?

WXi+b≥0 iff yi=1

Margin WXi+b≤0 iff yi=-1

subject to xi  D : yi  xi  w  b  ≥00

α(wx +b) = 0 where α≠0 subject to xi  D : yi  xi  w  b   1

• How do we represent this mathematically?

Classify as.. +1 if w . x + b >= 1

And so of course the vector w is also

• How to solve it?

Subject to a11u1  a12u2  ...  a1mum  b1

{w* , b* }= min  i wi2

subject to yi  w  xi  b   1 for all training data ( xi , yi )

There’s a serious practical

What should our quadratic How many constraints will we

What should our quadratic How many constraints will we

• Any problem with the above

• Balance the trade off between e2

{w* , b* }= argmin  i wi2  c  j 1e j

How do we determine the appropriate value for c ?

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19

and the separating hyperplane would be

Then define: Datapoints with ak > 0

{w* , b*}= argmin  i wi2  c  j 1 e j b* = argmin  j 1 e j

y1  w  x1  b   1  e1 ,e1  0 Fix w y1  w  x1  b   1  e1 ,e1  0

A linear programming problem !

Not a big surprise

That’s wiped the

Map the data

Map the data

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10

– Therefore, classification of an unknown example 𝑥 is performed by

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12

[Burges, 1998; Kaykin, 1999]

• The degree of the polynomial is a user-specified parameter

This use of kernel function

– Thus, all data points are support vectors in this case

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20

You might also like

{w* , b}= argmin  i wi2  c  j 1 e j b = argmin  j 1 e j