Support Vector Machines
Support Vector Machines
A central result in convex optimization is that the original problem can be solved by maxi-
mizing F subject to αi ≥ 0 and αi g(xi ) = 0.
Hyperplanes and SVM’s. Suppose we have data (X1 , Y1 ), . . . , (Xn , Yn ) that can be sep-
arated by a hyperplane. Let b + wT x = 0 be such a hyperplane. Note that Yi (b + XiT w) ≥ 1
for all i. Any re-scaled version of the hyper-plane is the same classifier. So re-scale the
hyper-plane so that
min |b + wT Xi | = 1.
i
If x0 is any point, then using some simple algebra, we find that the distance to the hyperplane
is
|b + wT x0 |
.
||w||
We call the distance to the closest point, the margin ρ. Since | mini |b + wT Xi | = 1, we see
that
|wT Xi + b| 1
ρ = min = .
i ||w|| ||w||
The support vector machine (SVM) is the hyperplane that maximized the margin. But
maximizing 1/||w|| is the same is minimizing ||w|| which is the same as minimizing (1/2)||w||2 .
So finding the SVM corresponds to:
1
minw,b ||w||2 subject to Yi (wT Xi + b) ≥ 1 i = 1, . . . , n.
2
1
where αi ≥ 0 and αi [Yi (wT Xi + b) − 1] = 0. If we set ∇w L = 0 and ∇b L = 0 we get the two
equations
X
w= αi Yi Xi = 0
i
X
0= αi Yi .
i
P P
If we insert w = i αi Yi Xi into L and use the fact that i αi Yi = 0 we get
X 1X
L= αi − αi αj Yi Yj (XiT Xj ).
i
2 i,j
subject to αi ≥ 0 and αi [Yi (wT Xi + b) − 1] = 0. Note two importnat facts: (i) this is a
quadratic program so it can be solved quickly and (ii) we don’t need the Xi ’s we only need
the inner products XiT Xj .
X
b = Yi − αj Yj XjT Xi .
j
Since Yi2 = 1, w =
P P
i αi Yi Xi and αi Yi = 0 this implies that
i
X
0= αi − ||w||2 .
i
2
The Non-separable Case. Usually, the data are not linearly separable. So we can’t
assume that Yi (wT Xi + b) ≥ 1. We introduce slack variables ξi ≥ 0 and instead require
Yi (WiT Xi + b) ≥ 1 − ξi .
This allows points to be incorrectly classified. But it also allows points to be correctly
classified but be inside the margin. We change the optimization problem to
1 X
min ||w||2 + C ξi
w,b,ξ 2
i
The Lagrangian is
1 X X X
L = ||w||2 + C ξi − αi [Yi (wT Xi + b) − 1 + ξi ] − βi ξi .
2 i i i
3
But this bound does not use the structure of SVM’s. For this, we turn to margin theory.
Theorem 1 Suppose that the sample space is contained in {x : ||x|| ≤ r}. Let H be the set
of hyperplanes satisfying ||w|| ≤ Λ and mini |wT Xi | = 1. Then VC(H) ≤ r2 Λ2 .
Proof. Suppose that {x1 , . . . , xd } can be shattered. Then for y ∈ {−1, +1}d there exists w
such that 1 ≤ yi (wT xi ) for all i. Sum over i to get
X X X
d ≤ wT yi xi ≤ ||w|| || yi xi || ≤ Λ || yi xi ||.
i i i
This holds for all choices of yi . So it holds if Yi is drawn uniformly over {−1, +1}. Thus
E[Yi Yj ] = E[Yi ][Yj ] = 0 for i 6= j and E[Yi Yi ] = 1. So
d
X
s
X
d ≤ ΛE|| Yi xi || ≤ Λ E|| Yi xi ||2
i=1 i
sX s
X
T
=Λ E[Yi Yj ]xi xj = Λ xTi xi
i,j i
√ √
≤ Λ dr2 = Λr d
so that d ≤ r2 Λ.
If the data are separable, the hyperplane satisfies ||w|| = 1/ρ so that Λ2 = 1/ρ2 and hence
d ≤ r2 /ρ2 . Plugging this into (1) we get
s r
2r 2 log((enρ2 )/r 2 ) log(1/δ)
R(h) ≤ R(b h) + 2
+ (2)
nρ 2n
4
SVM by solving
X 1X
max αi − αi αj Yi Yj K(Xi , Xj )
α
i
2 i,j
P
subject to 0 ≤ αi ≤ C and i αi Yi = 0. The classifier is
!
X
h(x) = sign Yi K(Xi , x) + b .
i