Kernel Machines
Kernel Machines
INTRODUCTION TO
Machine Learning
2nd Edition
ETHEM ALPAYDIN
© The MIT Press, 2010
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml2e
CHAPTER 13:
Kernel Machines
Outline
Introduction,
Optimal Separating Hyper plane,
The Non-separable Case: Soft Margin Hyper plane,
ν-SVM,
Kernel Trick,
Vectorial Kernels,
Defining Kernels
3
Introduction
Kernel machines are maximum margin methods that
allow the model to be written as a sum of the influences
of a subset of the training instances.
These influences are given by application-specific
similarity kernels, and we discuss “kernelized”
classification, regression, ranking, outlier detection and
dimensionality reduction, and how to choose and use
kernels.
4
We now discuss a different approach for linear
classification and regression.
We should not be surprised to have so many different
methods even for the simple case of a linear model.
Each learning algorithm has a different inductive bias,
makes different assumptions, and defines a different
objective function and thus may find a different linear
model.
5
The model that we will discuss in this chapter, called the
support vector machine (SVM), and later generalized under
the name kernel machine, has been popular in recent years
for a number of reasons:
1. It is a Discriminant-based method and uses Vapnik’s
principle to never solve a more complex problem as a first
step before the actual problem (Vapnik 1995).
2. After training, the parameter of the linear model, the
weight vector, can be written down in terms of a subset of
the training set, which are the so-called support vectors.
3. The output is written as a sum of the influences of
support vectors and these are given by kernel functions
that are application-specific measures of similarity
between data instances.
6
Typically in most learning algorithms, data points are
represented as vectors, and either dot product (as in the
multilayer perceptrons) or Euclidean distance (as in
radial basis function networks) is used.
4. A kernel function allows us to go beyond that. For
example, G1 and G2 may be two graphs and K(G1,G2)
may correspond to the number of shared paths, which
we can calculate without needing to represent G1 or G2
explicitly as vectors.
5. Kernel-based algorithms are formulated as convex
optimization problems, and there is a single optimum
that we can solve for analytically.
7
Kernel Machines
Discriminant-based: No need to estimate densities first
Define the discriminant in terms of support vectors
The use of kernel functions, application-specific
measures of similarity
No need to represent instances as vectors
Convex optimization problems with a unique solution
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8
Optimal Separating Hyperplane
t
1 if x C1
X x , r t where r
t t t
t
1 if x C2
find w and w 0 such that
w T xt w 0 1 for r t 1
w T xt w 0 1 for r t 1
which can be rewritten as
r t w T xt w 0 1
1
min w subject to r t w T xt w0 1, t
2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10
Margin •For a two-class
problem where the
instances of the
classes are
shown by plus signs
and dots, the thick
line is the boundary
and the dashed
lines
•define the margins
on either side.
Circled instances
are the support
vectors.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11
1
min w subject to r t w T xt w 0 1, t
2
2
1 N
Lp w t r t w T xt w 0 1
2
2
t 1
1 N N
w r w x w 0 t
2 t t T t
2 t 1 t 1
Lp N
0 w t r t xt
w t 1
Lp N
0 t r t 0
w 0 t 1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12
1 T
Ld w w w T t r t xt w 0 t r t t
2 t t t
1 T
w w t
2 t
1
r r x x t
t s t s t T s
2 t s t
subject to t r t 0 and t 0, t
t
Most αt are 0 and only a small number have αt >0; they are
the support vectors
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13
Soft Margin Hyperplane
Not linearly separable
r t w T x t w0 1 t
Soft error
t
t
New primal is
1
Lp w C t t t t r t w T x t w 0 1 t t t t
2
2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14
15
Hinge Loss t t 0
Lhinge (y , r ) t t
if y t r t 1
1 y r otherwise
16
n-SVM 1 1
min w - t
2
2 N t
subject to
r t w T xt w 0 t , t 0, 0
1 N
Ld r r x x
t s t s t T s
2 t 1 s
subject to
1
t t t
r 0 ,0 t
N t
, t
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17
Kernel Trick
Preprocess input x by basis functions
z = φ(x)
g(z)=wTz
g(x)=wT φ(x)
The SVM solution
w t r t z t t r t φ x t
t t
g x w φx r φ x
T t t
φx
t T
g x t r t K x t , x
t
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 18
•A Kernel Trick is a simple method where a Non Linear
data is projected onto a higher dimension space so as to
make it easier to classify the data where it could be
linearly divided by a plane.
•This is mathematically achieved by Lagrangian formula
using Lagrangian multipliers.
19
•Kernel: A kernel is a
method of placing a
two dimensional
plane into a higher
dimensional space, so
that it is curved in
the higher
dimensional space.
(In simple terms, a
kernel is a function
from the low
dimensional space
into a higher
dimensional space.)
20
Advantages of Support Vector Machine
22
Vectorial Kernels
Polynomials of degree q:
K x , x x x 1
t T t q
K x , y xT y 1
2
x1y1 x 2 y 2 12
1 2 x1y1 2 x 2 y 2 2 x1 x 2 y1y 2 x12 y12 x 22 y 22
x 1, 2 x1 , 2 x 2 , 2 x1 x 2 , x , x 2
1 2
2 T
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23
Vectorial Kernels
Radial-basis functions:
2
xt x
K xt , x exp
2s 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24
Defining kernels
Kernel “engineering”
Defining good measures of similarity
String kernels, graph kernels, image kernels, ...
Empirical kernel map: Define a set of templates mi and
score function s(x,mi)
f(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)]
and
K(x,xt)=f (x)T f (xt)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25
Multiple Kernel Learning
Fixed kernel combination cK x, y
K x, y K1 x, y K 2 x , y
K x, y K x, y
1 2
1
Ld t t s r t r s i K i xt , x s
t 2 t s i
g(x) t r t i K i xt , x
t i
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26
Multiclass Kernel Machines
1-vs-all
Pairwise separation
Error-Correcting Output Codes
Single multiclass optimization
1 K
min w i C it
2
2 i 1 i t
subject to
w z t T xt w z t 0 w i T xt wi 0 2 it , i z t , it 0
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27
SVM for Regression
Use a linear model (possibly kernelized)
f(x)=wTx+w0
Use the є-sensitive error function
0 if r t f xt
e r , f x t
t t
r f x
t
otherwise
1
min w C t t
2
2 t
r t w T x w0 t
w x w r
T
0
t
t
t , t 0
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29
Kernel Regression
Polynomial kernel Gaussian kernel
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30
One-Class Kernel Machines
Consider a sphere with center a and radius R
min R 2 C t
t
subject to
xt a R 2 t , t 0
N
Ld x t t T
x r r x x
s t s t s t T s
t t 1 s
subject to
0 t C , t 1
t
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32
Kernel Dimensionality Reduction
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 33