0% found this document useful (0 votes)
84 views5 pages

Machine Learning-Kernel Methods

This document summarizes a lecture on support vector machines (SVMs). It discusses three key topics: 1) The separable case, where training examples are linearly separable. The SVM aims to maximize the margin between classes by minimizing a regularization penalty. 2) The non-separable case, where examples are not perfectly separable. The optimization problem is modified to allow for some misclassifications by adding penalty terms. 3) A comparison of SVMs to logistic regression, noting they both aim to minimize a regularized empirical loss function, though SVMs focus on large-margin separation while logistic regression models class probabilities.

Uploaded by

aviral1987
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views5 pages

Machine Learning-Kernel Methods

This document summarizes a lecture on support vector machines (SVMs). It discusses three key topics: 1) The separable case, where training examples are linearly separable. The SVM aims to maximize the margin between classes by minimizing a regularization penalty. 2) The non-separable case, where examples are not perfectly separable. The optimization problem is modified to allow for some misclassifications by adding penalty terms. 3) A comparison of SVMs to logistic regression, noting they both aim to minimize a regularized empirical loss function, though SVMs focus on large-margin separation while logistic regression models class probabilities.

Uploaded by

aviral1987
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Machine learning: lecture 7 Topics

Tommi S. Jaakkola • Support vector machines


MIT CSAIL – separable case, formulation, margin
[email protected] – non-separable case, penalties, and logistic regression
– dual solution, kernels
– examples, properties

Tommi Jaakkola, MIT CSAIL 2

Support vector machine (SVM) SVM: separable case


(d
• When the training examples are linearly separable we can • We minimize "w1"2/2 = 2
i=1 wi /2 subject to
maximize a geometric notion of margin (distance to the
yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n
boundary) by minimizing the regularization penalty
d margin = 1/!ŵ1! o
! o o o
x
"w1"2/2 = wi2/2 x o f (x; ŵ) = ŵ0 + xT ŵ1
o
i=1 x o
o
o x
x o o o x ŵ1 o
o o ooo o o xx xxx
subject to the classification constraints x
o
x x x
x− x+ w1
x o x
o + −
x |x − x |/2
yi[w0 + xTi w1] −1≥0 x
x
o x
x x
for i = 1, . . . , n. x • The resulting margin and the “slope” "ŵ1" are inversely
x
x related
• The solution is defined only on the basis of a subset of
examples or “support vectors”

Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL 4

SVM: non-separable case SVM: non-separable case cont’d


• When the examples are not linearly separable we can modify • We can also write the SVM optimization problem more
the optimization problem slightly to add a penalty for compactly as
violating the classification constraints: ξi
n "
! & #$ '+%
We minimize
n C 1 − yi [w0 + xi w1] +"w1"2/2
T
!
"w1"2/2 + C ξi x o
o
o o
i=1

i=1 x x
o
o where (z) = z if z ≥ 0 and zero otherwise (i.e., returns the
+
x o
subject to relaxed classification x
o positive part).
x ŵ1 o
constraints x x x
x
yi [w0 + xTi w1] − 1 + ξi ≥ 0, x
o
x

for i = 1, . . . , n. Here ξi ≥ 0 are


called “slack” variables.

Tommi Jaakkola, MIT CSAIL 5 Tommi Jaakkola, MIT CSAIL 6


SVM: non-separable case cont’d SVM vs logistic regression
• We can also write the SVM optimization problem more • When viewed from the point of view of regularized empirical
compactly as loss minimization, SVM and logistic regression appear quite
ξi similar:
n "
! & #$ '+%
1 !& '+
n
C 1 − yi [w0 + xi w1] +"w1"2/2
T
SVM: 1 − yi [w0 + xTi w1] + λ"w1"2/2
i=1 n i=1

where (z) = z if z ≥ 0 and zero otherwise (i.e., returns the


+ − log P (y |x,w)
i
n " & #$ '%
1!
positive part). Logistic: − log g yi [w0 + xTi w1] +λ"w1"2/2
n i=1
• This is equivalent to regularized empirical loss minimization
where g(z) = (1 + exp(−z))−1 is the logistic function.
1 !& '+
n
1 − yi [w0 + xTi w1] + λ"w1"2/2 (Note that we have transformed the problem maximizing the
n i=1
penalized log-likelihood into minimizing negative penalized
where λ = 1/nC is the regularization parameter. log-likelihood.)

Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8

SVM vs logistic regression cont’d SVM: solution, Lagrange multipliers


• The difference comes from how we penalize “errors”: • Back to the separable case: how do we solve
&" z
1!
n #$ %' min "w1"2/2 subject to
Both: Loss yi [w0 + xTi w1] + λ"w1"2/2
n i=1 yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n
5
SVM loss
• SVM: 4.5 LR loss

Loss(z) = (1 − z)+ 3.5

3
loss

2.5

• Regularized logistic reg: 2

1.5

Loss(z) = log(1 + exp(−z)) 1

0.5

0
!4 !3 !2 !1 0 1 2 3 4
z

Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL 10

SVM: solution, Lagrange multipliers SVM: solution, Lagrange multipliers


• Back to the separable case: how do we solve • Back to the separable case: how do we solve

min "w1"2/2 subject to min "w1"2/2 subject to


yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n

• Let start by representing the constraints as losses • Let start by representing the constraints as losses
+ +
) * 0, yi [w0 + xTi w1] − 1 ≥ 0 ) * 0, yi [w0 + xTi w1] − 1 ≥ 0
max α 1 − yi [w0 + xTi w1] = max α 1 − yi [w0 + xTi w1] =
α≥0 ∞, otherwise α≥0 ∞, otherwise

and rewrite the minimization problem in terms of these


, n
! ) *-
min "w1"2/2 + max αi 1 − yi [w0 + xTi w1]
w αi≥0
i=1

Tommi Jaakkola, MIT CSAIL 11 Tommi Jaakkola, MIT CSAIL 12


SVM: solution, Lagrange multipliers SVM solution cont’d
• Back to the separable case: how do we solve • We can then swap ’max’ and ’min’:
min "w1"2/2 subject to , n
! ) *-
min max "w1"2/2 + αi 1 − yi [w0 + xTi w1]
yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n w {αi≥0}
i=1
, n
! ) *-
• Let start by representing the constraints as losses ?
= max min "w1"2/2 + αi 1 − yi [w0 + xTi w1]
+ {αi≥0} w
) * 0, yi [w0 + xTi w1] − 1 ≥ 0 $ i=1 %" #
max α 1 − yi [w0 + xTi w1] = J(w;α)
α≥0 ∞, otherwise
As a result we have to be able to minimize J(w; α) with
and rewrite the minimization problem in terms of these
respect to parameters w for any fixed setting of the Lagrange
, !n
) *- multipliers αi ≥ 0.
min "w1"2/2 + max αi 1 − yi [w0 + xTi w1]
w αi≥0
i=1
, n
! ) *-
= min max "w1"2/2 + αi 1 − yi [w0 + xTi w1]
w {αi≥0}
i=1

Tommi Jaakkola, MIT CSAIL 13 Tommi Jaakkola, MIT CSAIL 14

SVM solution cont’d SVM solution cont’d


• We can then swap ’max’ and ’min’: • We can then substitute the solution
! n
, !n
) *- ∂
min max "w1"2/2 + αi 1 − yi [w0 + xTi w1] J(w; α) = w1 − αiyixi = 0
w {αi≥0}
∂w1 i=1
i=1
! n
, n
! ) *- ∂
?
= max min "w1" /2 +2
αi 1 − yi [w0 + xTi w1] J(w; α) = − αiyi = 0
{αi≥0} w ∂w0 i=1
$ i=1 %" #
J(w;α) back into the objective and get (after some algebra):
, !n
) *-
We can find the optimal ŵ as a function of {αi} by setting
max "ŵ1"2/2 + αi 1 − yi [ŵ0 + xTi ŵ1]
the derivatives to zero: P
αi ≥ 0
i=1
n αiyi = 0
∂ ! i

J(w; α) = w1 − αiyixi = 0 ,!
n
1 !
n -
∂w1 i=1 = max αi − yiyj αiαj (xTi xj )
n αi ≥ 0 2 i,j=1
∂ ! P
αiyi = 0
i=1
J(w; α) = − αiyi = 0 i

∂w0 i=1

Tommi Jaakkola, MIT CSAIL 15 Tommi Jaakkola, MIT CSAIL 16

SVM solution: summary SVM solution: summary


• We can find the optimal setting of the Lagrange multipliers • We can find the optimal setting of the Lagrange multipliers
αi by maximizing αi by maximizing
n
! n n n
1 ! ! 1 !
αi − yiyj αiαj (xTi xj ) αi − yiyj αiαj (xTi xj )
i=1
2 i,j=1 i=1
2 i,j=1
( (
subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding
to “support vectors” will be non-zero. to “support vectors” will be non-zero.
• We can make predictions on any new example x according
to the sign of the discriminant function
n
)! !
ŵ0 + xT ŵ1= ŵ0 + xT α̂iyixi) = ŵ0 + α̂iyi(xT xi)
i=1 i∈SV

Tommi Jaakkola, MIT CSAIL 17 Tommi Jaakkola, MIT CSAIL 18


SVM solution: summary SVM solution: summary
• We can find the optimal setting of the Lagrange multipliers • We can find the optimal setting of the Lagrange multipliers
αi by maximizing αi by maximizing
n
! n n n
1 ! ! 1 !
αi − yiyj αiαj (xTi xj ) αi − yiyj αiαj (xTi xj )
i=1
2 i,j=1 i=1
2 i,j=1
( (
subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding
to “support vectors” will be non-zero. to “support vectors” will be non-zero.
• We can make predictions on any new example x according • We can make predictions on any new example x according
to the sign of the discriminant function to the sign of the discriminant function
n n
)! ! )! !
ŵ0 + xT ŵ1 = ŵ0 + xT α̂iyixi)= ŵ0 + α̂iyi(xT xi) ŵ0 + xT ŵ1 = ŵ0 + xT α̂iyixi) = ŵ0 + α̂iyi(xT xi)
i=1 i∈SV i=1 i∈SV

Tommi Jaakkola, MIT CSAIL 19 Tommi Jaakkola, MIT CSAIL 20

Non-linear classifier Non-linear classifier


• So far our classifier can make only linear separations
x x
• As with linear regression and logistic regression models, we x x
x

can easily obtain a non-linear classifier by first mapping our x


x
x
x

examples x = [x1 x2] into longer feature vectors φ(x)


x
x x
x
√ √ √ x
x
x x

φ(x) = [x21 x22 2x1x2 2x1 2x2 1]


Linear separator in the feature φ-space
and then applying the linear classifier to the new feature
vectors φ(x) x
x x
x x x
x
x
x
x
x
x x
x
x
x

Non-linear separator in the original x-space

Tommi Jaakkola, MIT CSAIL 21 Tommi Jaakkola, MIT CSAIL 22

Feature mapping and kernels Examples of kernel functions


• Let’s look at the previous example in a bit more detail • Linear kernel
√ √ √
x → φ(x) = [x21 x22 2x1x2 2x1 2x2 1] K(x, x%) = (xT x%)

• The SVM classifier deals only with inner products of examples • Polynomial kernel
(or feature vectors). In this example, ) *p
K(x, x%) = 1 + (xT x%)
φ(x) φ(x ) =
T %
x21x%2
1 + x22x%2
2 + 2x1x2x1x2
% %
+ 2x1x%1 + 2x2x%2 +1 where p = 2, 3, . . .. To get the feature vectors we
= (1 + x1x%1 + x2x%2)2 concatenate all up to pth order polynomial terms of the
) T % 2
* components of x (weighted appropriately)
= 1 + (x x )
• Radial basis kernel
so the inner products can be evaluated without ever explicitly . /
1
constructing the feature vectors φ(x)! K(x, x%) = exp − "x − x%"2
2
) *2
• K(x, x%) = 1 + (xT x%) is a kernel function (inner product In this case the feature space is infinite dimensional function
in the feature space) space (use of the kernel results in a non-parametric classifier).

Tommi Jaakkola, MIT CSAIL 23 Tommi Jaakkola, MIT CSAIL 24


SVM examples
2 2

1.5 1.5

1 1

0.5 0.5

0 0

!0.5 !0.5

!1 !1
!1.5 !1 !0.5 0 0.5 1 1.5 2 !1.5 !1 !0.5 0 0.5 1 1.5 2

linear 2nd order polynomial


2 2

1.5 1.5

1 1

0.5 0.5

0 0

!0.5 !0.5

!1 !1
!1.5 !1 !0.5 0 0.5 1 1.5 2 !1.5 !1 !0.5 0 0.5 1 1.5 2

4th order polynomial 8th order polynomial

Tommi Jaakkola, MIT CSAIL 25

You might also like