0% found this document useful (0 votes)
31 views33 pages

Kernel Machines

Uploaded by

Shayan Chowdary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views33 pages

Kernel Machines

Uploaded by

Shayan Chowdary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Lecture Slides for

INTRODUCTION TO

Machine Learning
2nd Edition

ETHEM ALPAYDIN
© The MIT Press, 2010
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml2e
CHAPTER 13:

Kernel Machines
Outline
Introduction,
Optimal Separating Hyper plane,
The Non-separable Case: Soft Margin Hyper plane,
ν-SVM,
Kernel Trick,
Vectorial Kernels,
Defining Kernels

3
Introduction
Kernel machines are maximum margin methods that
allow the model to be written as a sum of the influences
of a subset of the training instances.
These influences are given by application-specific
similarity kernels, and we discuss “kernelized”
classification, regression, ranking, outlier detection and
dimensionality reduction, and how to choose and use
kernels.

4
We now discuss a different approach for linear
classification and regression.
We should not be surprised to have so many different
methods even for the simple case of a linear model.
Each learning algorithm has a different inductive bias,
makes different assumptions, and defines a different
objective function and thus may find a different linear
model.

5
The model that we will discuss in this chapter, called the
support vector machine (SVM), and later generalized under
the name kernel machine, has been popular in recent years
for a number of reasons:
1. It is a Discriminant-based method and uses Vapnik’s
principle to never solve a more complex problem as a first
step before the actual problem (Vapnik 1995).
2. After training, the parameter of the linear model, the
weight vector, can be written down in terms of a subset of
the training set, which are the so-called support vectors.
3. The output is written as a sum of the influences of
support vectors and these are given by kernel functions
that are application-specific measures of similarity
between data instances.
6
Typically in most learning algorithms, data points are
represented as vectors, and either dot product (as in the
multilayer perceptrons) or Euclidean distance (as in
radial basis function networks) is used.
4. A kernel function allows us to go beyond that. For
example, G1 and G2 may be two graphs and K(G1,G2)
may correspond to the number of shared paths, which
we can calculate without needing to represent G1 or G2
explicitly as vectors.
5. Kernel-based algorithms are formulated as convex
optimization problems, and there is a single optimum
that we can solve for analytically.

7
Kernel Machines
Discriminant-based: No need to estimate densities first
Define the discriminant in terms of support vectors
The use of kernel functions, application-specific
measures of similarity
No need to represent instances as vectors
Convex optimization problems with a unique solution

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8
Optimal Separating Hyperplane
t
  1 if x  C1
X  x , r t where r  
t t t
t
  1 if x C2
find w and w 0 such that
w T xt  w 0  1 for r t  1
w T xt  w 0  1 for r t  1
which can be rewritten as
r t w T xt  w 0   1

(Cortes and Vapnik, 1995; Vapnik, 1995)


Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9
Margin
Distance from the discriminant to the closest instances
on either side
Distance of x to the hyperplane is w T xt  w0
w
We require r t w T xt  w 0 
  , t
w

For a unique sol’n, fix ρ||w||=1, and to max margin

1
min w subject to r t w T xt  w0   1, t
2

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10
Margin •For a two-class
problem where the
instances of the
classes are
shown by plus signs
and dots, the thick
line is the boundary
and the dashed
lines
•define the margins
on either side.
Circled instances
are the support
vectors.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11
1
min w subject to r t w T xt  w 0   1, t
2

2
1 N
Lp  w   t r t w T xt  w 0  1
2
2
 
t 1

1 N N
 w   r w x  w 0   t
2 t t T t

2 t 1 t 1

Lp N
 0  w   t r t xt
w t 1

Lp N
 0   t r t  0
w 0 t 1

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12
1 T
Ld  w w  w T  t r t xt w 0  t r t   t
2 t t t

1 T
  w w   t
2 t

1
    r r x  x   t
t s t s t T s

2 t s t

subject to  t r t  0 and  t  0, t
t

Most αt are 0 and only a small number have αt >0; they are
the support vectors

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13
Soft Margin Hyperplane
Not linearly separable
r t w T x t  w0   1   t

Soft error

 
t
t

New primal is

1

Lp  w  C t  t  t  t r t w T x t  w 0  1   t t  t t
2
2

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14
15
Hinge Loss t t  0
Lhinge (y , r )   t t
if y t r t  1
1  y r otherwise

16
n-SVM 1 1
min w -     t
2

2 N t
subject to
r t w T xt  w 0      t , t  0,   0
1 N
Ld     r r x  x
t s t s t T s

2 t 1 s
subject to
1
t  t t
r  0 ,0   t

N t
,   t


n controls the fraction of support vectors

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17
Kernel Trick
Preprocess input x by basis functions
z = φ(x)
g(z)=wTz
g(x)=wT φ(x)
The SVM solution
w    t r t z t    t r t φ x t 
t t

g x   w φx     r φ x
T t t
 φx 
t T

g x     t r t K x t , x 
t

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 18
•A Kernel Trick is a simple method where a Non Linear
data is projected onto a higher dimension space so as to
make it easier to classify the data where it could be
linearly divided by a plane.
•This is mathematically achieved by Lagrangian formula
using Lagrangian multipliers.

19
•Kernel: A kernel is a
method of placing a
two dimensional
plane into a higher
dimensional space, so
that it is curved in
the higher
dimensional space.
(In simple terms, a
kernel is a function
from the low
dimensional space
into a higher
dimensional space.)
20
Advantages of Support Vector Machine

1. Training of the model is relatively easy.


2. The model scales relatively well to high
dimensional data
3. SVM is a useful alternative to neural networks
4. Trade-off amongst classifier complexity and error
can be controlled explicitly
5. It is useful for both Linearly Separable and Non-
linearly Separable data
6. Assured Optimality: The solution is guaranteed to
be the global minimum due to the nature of
Convex Optimization
21
Disadvantages of Support Vector Machine

1. Picking right kernel and parameters can be


computationally intensive
2. In Natural Language Processing (NLP), a structured
representation of text yields better performance.
However, SVMs cannot accommodate such
structures(word embedding)

22
Vectorial Kernels
Polynomials of degree q:

K x , x   x x  1
t T t q

K x , y   xT y  1
2

 x1y1  x 2 y 2  12
 1  2 x1y1  2 x 2 y 2  2 x1 x 2 y1y 2  x12 y12  x 22 y 22

 x   1, 2 x1 , 2 x 2 , 2 x1 x 2 , x , x 2
1 2 
2 T

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23
Vectorial Kernels
Radial-basis functions:
2
 xt  x 
K xt , x   exp  
 2s 2 
 

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24
Defining kernels
Kernel “engineering”
Defining good measures of similarity
String kernels, graph kernels, image kernels, ...
Empirical kernel map: Define a set of templates mi and
score function s(x,mi)
f(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)]
and
K(x,xt)=f (x)T f (xt)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25
Multiple Kernel Learning
Fixed kernel combination  cK x, y 

K x, y   K1 x, y   K 2 x , y 
 K x, y K x, y 
 1 2

Adaptive kernel combination


m
K x , y    i K i x , y 
i 1

1
Ld   t    t s r t r s i K i xt , x s 
t 2 t s i

g(x)   t r t  i K i xt , x 
t i

Localized kernel combination


g(x)   t r t i x| K i xt , x 
t i

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26
Multiclass Kernel Machines
1-vs-all
Pairwise separation
Error-Correcting Output Codes
Single multiclass optimization
1 K
min  w i  C   it
2

2 i 1 i t

subject to
w z t T xt  w z t 0  w i T xt  wi 0  2   it , i  z t ,  it  0

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27
SVM for Regression
Use a linear model (possibly kernelized)
f(x)=wTx+w0
Use the є-sensitive error function
0 if r t  f xt   
e r , f x    t
t t

 r  f x   
t
otherwise
1
min w  C   t   t 
2

2 t

r t  w T x  w0     t 
w x  w  r
T
0
t
    t
 t , t  0
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29
Kernel Regression
Polynomial kernel Gaussian kernel

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30
One-Class Kernel Machines
Consider a sphere with center a and radius R
min R 2  C  t
t

subject to
xt  a  R 2   t , t  0
N
Ld   x t t T
 x    r r x  x
s t s t s t T s

t t 1 s

subject to
0   t  C ,  t  1
t

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32
Kernel Dimensionality Reduction

Kernel PCA does


PCA on the kernel
matrix (equal to
canonical PCA with
a linear kernel)
Kernel LDA

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 33

You might also like