0% found this document useful (0 votes)
74 views

Support Vector Machines

The document summarizes support vector machines (SVMs). SVMs find the optimal separating hyperplane that maximizes the margin between positive and negative examples. This optimal hyperplane is found by solving a convex quadratic programming optimization problem that minimizes the hyperplane's scale while ensuring examples of different classes lie on separate sides of the hyperplane. Introducing slack variables allows for misclassified examples and improves the model's ability to generalize to new data.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Support Vector Machines

The document summarizes support vector machines (SVMs). SVMs find the optimal separating hyperplane that maximizes the margin between positive and negative examples. This optimal hyperplane is found by solving a convex quadratic programming optimization problem that minimizes the hyperplane's scale while ensuring examples of different classes lie on separate sides of the hyperplane. Introducing slack variables allows for misclassified examples and improves the model's ability to generalize to new data.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Support Vector Machines

These notes are based on Mohri, Rostamizadeh and Talwalkar (2012).

Some Convex Optimization. Consider

min f (x) subject to gi (x) ≤ 0 i = 1, . . . , m.


x

Define the Lagrangian X


L = f (x) + αj gj (x).
j

The dual function is define by


F (α) = inf L.
x

A central result in convex optimization is that the original problem can be solved by maxi-
mizing F subject to αi ≥ 0 and αi g(xi ) = 0.

Hyperplanes and SVM’s. Suppose we have data (X1 , Y1 ), . . . , (Xn , Yn ) that can be sep-
arated by a hyperplane. Let b + wT x = 0 be such a hyperplane. Note that Yi (b + XiT w) ≥ 1
for all i. Any re-scaled version of the hyper-plane is the same classifier. So re-scale the
hyper-plane so that
min |b + wT Xi | = 1.
i

If x0 is any point, then using some simple algebra, we find that the distance to the hyperplane
is
|b + wT x0 |
.
||w||
We call the distance to the closest point, the margin ρ. Since | mini |b + wT Xi | = 1, we see
that
|wT Xi + b| 1
ρ = min = .
i ||w|| ||w||

The support vector machine (SVM) is the hyperplane that maximized the margin. But
maximizing 1/||w|| is the same is minimizing ||w|| which is the same as minimizing (1/2)||w||2 .
So finding the SVM corresponds to:
1
minw,b ||w||2 subject to Yi (wT Xi + b) ≥ 1 i = 1, . . . , n.
2

The Lagrangian for this problem is


1 X
L = ||w||2 − αi [Yi (wT Xi + b) − 1]
2 i

1
where αi ≥ 0 and αi [Yi (wT Xi + b) − 1] = 0. If we set ∇w L = 0 and ∇b L = 0 we get the two
equations
X
w= αi Yi Xi = 0
i
X
0= αi Yi .
i
P P
If we insert w = i αi Yi Xi into L and use the fact that i αi Yi = 0 we get
X 1X
L= αi − αi αj Yi Yj (XiT Xj ).
i
2 i,j

This leads to the optimization


X 1X
maximize αi − αi αj Yi Yj (XiT Xj )
i
2 i,j

subject to αi ≥ 0 and αi [Yi (wT Xi + b) − 1] = 0. Note two importnat facts: (i) this is a
quadratic program so it can be solved quickly and (ii) we don’t need the Xi ’s we only need
the inner products XiT Xj .

Consider the constraint αi [Yi (wT Xi + b) − 1] = 0. If αi > 0 then Yi (wT Xi + b) = 1 which


implies that this point lies on the boundary of the margin. Such a point is called
P a support
vector. On the other hand, if Yi (wT Xi + b) > 1 then αi = 0. Since w = i αi Yi Xi this
means that the hyperplane only depends on the support vectors.

If (Xi , Yi ) is a support vector then W T Xi + b = Yi . Since w = j αj Yj Xj , we see that


P

X
b = Yi − αj Yj XjT Xi .
j

Multiply by αi Yi and sum to get


X X X
αi Yi b = αi Yi2 − αi αj Yi Yj (XiT Xj ).
i i i,j

Since Yi2 = 1, w =
P P
i αi Yi Xi and αi Yi = 0 this implies that
i
X
0= αi − ||w||2 .
i

The margin ρ is 1/||w|| so that


1 1 1
ρ2 = 2
=P = .
||w|| i αi ||α||1

2
The Non-separable Case. Usually, the data are not linearly separable. So we can’t
assume that Yi (wT Xi + b) ≥ 1. We introduce slack variables ξi ≥ 0 and instead require
Yi (WiT Xi + b) ≥ 1 − ξi .
This allows points to be incorrectly classified. But it also allows points to be correctly
classified but be inside the margin. We change the optimization problem to
1 X
min ||w||2 + C ξi
w,b,ξ 2
i

subject to Yi (wT Xi + b) ≥ 1 − ξi and ξi ≥ 0. The constant C ≥ 0 controls the amount of


slack that is allowed.

The Lagrangian is
1 X X X
L = ||w||2 + C ξi − αi [Yi (wT Xi + b) − 1 + ξi ] − βi ξi .
2 i i i

Setting the derivative to 0 leads to the conditions


X
w= αi Yi Xi
i
X
0= αi Yi
i
C = α i + βi
0 = αi or Yi (wT Xi + b) = 1 − ξi
0 = βi or ξi = 0.
When αi > 0 we call Xi a support vector. If αi 6= 0 then
Yi (wT Xi + b) = 1 − ξi .
If ξi = 0 then Xi lies on the marginal hyperplane. If ξi 6= 0 then βi = 0 which implies
αi = C. In summary, support vectors lie on the marginal hyperplane or αi = C.

The dual problem has a simple form:


X 1X
max αi − αi αj Yi Yj XiT Xj
α
i
2 i,j
P
subject to 0 ≤ αi ≤ C and i αi Yi = 0. Again, it is a quadratic program and only involves
inner products of the Xi .

Since the VC dimension of hyperplane classifiers is d + 1, we know that, with probability at


least 1 − δ, r r
2(d + 1) log(en/(n + 1)) log(1/δ)
R(h) ≤ R(b h) + + . (1)
n 2n

3
But this bound does not use the structure of SVM’s. For this, we turn to margin theory.

Margins. Recall that the margin is


Yi (wT Xi + b)
ρ = min .
i ||w||
We can improve the VC bound using the margin.

Theorem 1 Suppose that the sample space is contained in {x : ||x|| ≤ r}. Let H be the set
of hyperplanes satisfying ||w|| ≤ Λ and mini |wT Xi | = 1. Then VC(H) ≤ r2 Λ2 .

Proof. Suppose that {x1 , . . . , xd } can be shattered. Then for y ∈ {−1, +1}d there exists w
such that 1 ≤ yi (wT xi ) for all i. Sum over i to get
X X X
d ≤ wT yi xi ≤ ||w|| || yi xi || ≤ Λ || yi xi ||.
i i i

This holds for all choices of yi . So it holds if Yi is drawn uniformly over {−1, +1}. Thus
E[Yi Yj ] = E[Yi ][Yj ] = 0 for i 6= j and E[Yi Yi ] = 1. So
d
X
s
X
d ≤ ΛE|| Yi xi || ≤ Λ E|| Yi xi ||2
i=1 i
sX s
X
T
=Λ E[Yi Yj ]xi xj = Λ xTi xi
i,j i
√ √
≤ Λ dr2 = Λr d

so that d ≤ r2 Λ. 

If the data are separable, the hyperplane satisfies ||w|| = 1/ρ so that Λ2 = 1/ρ2 and hence
d ≤ r2 /ρ2 . Plugging this into (1) we get
s r
2r 2 log((enρ2 )/r 2 ) log(1/δ)
R(h) ≤ R(b h) + 2
+ (2)
nρ 2n

which is dimension independent.

Nonparametric SVM’s. We can get a nonparametric SVM using RKHS’s by replacing x


with a feature map Φ(x). Recall that Φ(x1 )T Φ(x2 ) = K(x1 , x2 ). So we get a nonparaametric

4
SVM by solving
X 1X
max αi − αi αj Yi Yj K(Xi , Xj )
α
i
2 i,j
P
subject to 0 ≤ αi ≤ C and i αi Yi = 0. The classifier is
!
X
h(x) = sign Yi K(Xi , x) + b .
i

This is a nonlinear (nonparametric) classifer.

You might also like