SVM EXAMPLE
SVM EXAMPLE
Slide 1
Overview
Classification
Support vector machine
Regularization
Slide 2
Classification
Predict categorical output (i.e., two or multiple classes) from
input attributes (i.e., features)
Class A
≥ 0 (Class A)
Feature x2
f (X ) = W X + C
T
< 0 (Class B )
Class B
Feature x1
Slide 3
Classification
Classification vs. regression
Input Prediction of
Regression
attributes real-valued output
Input Prediction of
Classification
attributes categorical output
Slide 4
Classification Examples
Identify hand-written digits from US zip codes
Slide 5
Classification Examples
Identify geometrical structure from oil flow data
Slide 6
Support Vector Machine (SVM)
Support vector machine (SVM) is a popular algorithm used for
many classification problems
Key idea: maximize classification margin (immune to noise)
≥ 0 (Class A)
Class A
f (X ) = W X + C
T
< 0 (Class B )
Feature x2
Class B Margin
Feature x1
Slide 7
Margin Calculation
To maximize margin, we must first represent margin as a
function of W and C
≥ 0 (Class A)
f (X ) = W T X + C
WT X +C = 0 Class A < 0 (Class B )
Plus plane
Minus plane
Plus plane WT X +C =1
Class B Minus plane W T X + C = −1
Margin
(Right-hand side can be normalized to ±1)
Support vectors
Slide 8
Margin Calculation
W is perpendicular to plus/minus planes
Plus plane WT X +C =1
Minus plane W T X + C = −1
x2
A–B WT A+C =1
A WTB +C =1
W
B W T ⋅ (A − B) = 0
WT X +C =1 W is perpendicular to (A – B)
0 x1
Slide 9
Margin Calculation
Margin equals to the distance between Xm and Xp
X p = X m + λW Margin = X p − X m = λW 2
2
x2
Xp
WT X +C =1
λW W
Xm
W T X + C = −1
0 x1
Slide 10
Margin Calculation
X p = X m + λW
WT X p +C =1 W T ⋅ ( X p − X m ) = λW T W = 2
W T X m + C = −1
x2
Xp
WT X +C =1
λW W
Xm
W T X + C = −1
0 x1
Slide 11
Margin Calculation
2 2
λW W = 2
T
λ= T Margin = λW 2
= λ ⋅ W TW =
W W W TW
x2
Xp
WT X +C =1
λW W
Xm
W T X + C = −1
0 x1
Slide 12
Mathematical Formulation
Start from a set of training samples
( X i , yi ) (i = 1,2,, N )
Xi: input feature of i-th sampling point
yi: output label of i-th sampling point
Class A → yi = 1
Class B → yi = −1
Class A:
Class A W T X i + C ≥ 1 yi = 1
W X +C =1
T ( )
yi ⋅ W T X i + C ≥ 1
W T X + C = −1 Class B:
Class B W T X i + C ≤ −1 yi = −1
( )
yi ⋅ W T X i + C ≥ 1
Slide 13
Mathematical Formulation
Formulate a convex optimization problem
2
max Maximize margin
W ,C T
W W
S.T. ( )
yi ⋅ W T X i + C ≥ 1 All data samples are in the right class
(i = 1,2,, N )
( )
W ,C
S.T. yi ⋅ W T X i + C ≥ 1 Linear constraints
(i = 1,2,, N )
(Convex optimization)
Slide 14
A Simple SVM Example
Two training samples
Class A: x1 = 1, x2 = 1 and y = 1 x2
Class B: x1 = −1, x2 = −1 and y = −1
Class A
≥ 0 (Class A)
f ( X ) = w1 x1 + w2 x2 + C
< 0 (Class B ) x1
Class B
Solve w1, w2 and C to determine classifier
Slide 15
A Simple SVM Example
Two training samples
Class A: x1 = 1, x2 = 1 and y = 1 x2
Class B: x1 = −1, x2 = −1 and y = −1
Class A
T
min W W
( )
W ,C
S.T. yi ⋅ W T X i + C ≥ 1 x1
(i = 1,2,, N ) Class B
Slide 16
A Simple SVM Example
min w12 + w22
w2
W ,C
S.T. 1 ⋅ (w1 + w2 + C ) ≥ 1
− 1 ⋅ (− w1 − w2 + C ) ≥ 1
w1 + w2 ≥ 1 + C
w1 = w2 = 0.5
min w12 + w22 C =0
W ,C
S.T. w1 + w2 ≥ 1 + C
Slide 17
A Simple SVM Example
Two training samples
Class A: x1 = 1, x2 = 1 and y = 1 x2
Class B: x1 = −1, x2 = −1 and y = −1
Class A
w1 = w2 = 0.5
C =0 x1
Class B
Slide 18
Support Vector Machine with Noise
In practice, training samples may contain noise or are not
linearly separable
min W T W
( )
W ,C
S.T. yi ⋅ W T X i + C ≥ 1
(i = 1,2,, N ) Class A
Feature x2
(No feasible solution)
Slide 19
Support Vector Machine with Noise
Can be solved by convex programming
Cost : sum of two convex functions
Constraints: linear and hence convex
Linear Quadratic
(convex) (convex)
min
W ,C ,ξ
∑ i
ξ + λ ⋅ W T
W Convex
S.T. ( )
yi ⋅ W T X i + C ≥ 1 − ξ i
ξi ≥ 0 Linear
(i = 1,2,, N )
(Convex optimization)
Slide 20
Regularization
Regression vs. classification Regularization
min
α
A ⋅α − B 2 + λ ⋅ α
2 2
2
min
W ,C ,ξ
∑ξ i + λ ⋅W TW
Regression
S.T. ( )
yi ⋅ W T X i + C ≥ 1 − ξ i
ξi ≥ 0
(i = 1,2,, N )
Support vector machine
Slide 21
Regularization
L1-norm regularization is used to find a sparse solution of W
L1-norm regularization
min
W ,C ,ξ
∑ i
ξ + λ ⋅ W T
W min
W ,C ,ξ
∑ξ i +λ⋅ W 1
S.T. ( )
yi ⋅ W T X i + C ≥ 1 − ξ i S.T. ( )
yi ⋅ W T X i + C ≥ 1 − ξ i
ξi ≥ 0 ξi ≥ 0
(i = 1,2,, N ) (i = 1,2,, N )
Slide 22
Regularization
Feature selection
≥ 0 (Class A)
f (X ) = W X + C
T
< 0 (Class B )
x1
x
2
[0 0 × 0 ×]⋅ x3
WT
x4 Important features
x5
X
Slide 23
Summary
Classification
Support vector machine
Regularization
Slide 24