0% found this document useful (0 votes)
6 views

Lecture 05

Uploaded by

睿笙
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 05

Uploaded by

睿笙
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

MA4270: Data Modelling and Computation

Lecture 5: Kernel Methods


for non-linear
mainly models .

1 / 49
Section 1

Feature Space Mappings

2 / 49
Feature Space Mappings

In previous lectures, we looked at linear classifiers for binary labels:

Predict positive label () ✓ T x + ✓0 > 0

and also linear predictors for real-valued variables:


linear
regression
ŷ (x) = ✓ T x + ✓0 .

What if x and y and related in a highly non-linear manner?


I We can still use linear classification and regression methods!

3 / 49
Example: Fitting a quadratic

compute X" for each

If we know that y is (well-approximated by) a quadratic↑ function of a


single input x, then we can use x to construct x̃ = [x x 2 ]T , and then
perform linear regression with input x̃ and ✓ 2 R2 . Then ŷ = ✓ T x̃ + ✓0
evaluates to ŷ =000
✓2 x 2 + ✓1 x + ✓0 – an arbitrary quadratic function!
3
parameters
,

4 / 49
Example: A circular classification region
binary
Suppose that d = 2 and
labels
the binary labels are generated according to
(
+1 x12 + x22  1
yt =
1 otherwise
yielding a circular region:

Any linear classifier with input x = [x1 x2 ]T will perform poorly. But a
linear classifier with input x̃ = [x1 x2 x12 + x22 ]T exists that classifies
perfectly!
5 / 49
Example: A circular classification region

Visual illustration:
http://www.youtube.com/watch?v=3liCbRZPrZA
other color
&

S
3d

yellow 2d

2D Points at bottom (not linearly separable), mapping to 3D shown


above (linearly separable)

6 / 49
Feature Space Mappings

In general, we might hope to get a better classifier by mapping each x to a


and
well-designed feature space: idea create: more more

classifier by apply
⇥ interesting
(x) = 1 (x), . . . , N (x)]Tcertain function & outu
the data and then create

for some real-valued functions 1, . . . , N. new extra coordinates .

do
Tricky : How we know which function to

and ? "

7 / 49
A word of caution!
We can always get to zero training error by adding more and more
features, but is that always a good idea? Which prediction rule will work
better in the following regression example?
fit a kernel function


degrea polynomial
3

"

keep increasing degree polynomial > lead to overfitting,


of -

A more complex classifier achieves smaller training error, but runs into the
danger of overfitting. We will explore the notions of generalization error
and overfitting in a later lecture.
8 / 49
Section 2

Inner Products

9 / 49
Kernel Methods – Overview

Many machine learning algorithms depend on the data x1 , . . . , xn only


through the pairwise inner products hxi , xj i. evenif
Examples to come later: Ridge regression, SVM
youdonthave da ancs is

Example explored in the project: K -nearest neighbors (given a new


point, find the K closest points in the training set, and classify
according to a majority vote)

10 / 49
Overview

classifier
maximum margin .

Inner products capture the geometry of the data ↑ set, so one generally
expects geometrically inspired algorithms (e.g., SVM) to depend only on
inner products:
For algorithms that use distances, note that
kx x0 k2 = hx, xi + hx0 , x0 i 2hx, x0 i, so distances can be expressed
in terms of inner products.
For algorithms that use angles, note that p
hx,x0 i
angle(x, x0 ) = cos 1 kxk·kx 0 k , so since kxk = hx, xi, angles can
C
also be expressed in terms of inner products.

can be inner
product also .
expressed as

11 / 49
Overview

einformationweaener
We know that moving to feature spaces can help, so we could map
each xi ! (xi ) and apply the algorithm using h (xi ), (xj )i.
A kernel function k(xi , xj ) can be thought of as an inner product in a
possibly implicit feature space
do &4) but
perform
+
not (i we

I
↑ of
Key idea. There are clever choices the mapping (·) ensuring
can know the inner
product.
that
we can efficiently compute h (xi ), (xj )i without ever explicitly
mapping to the feature space
I In some cases, the feature space is infinite-dimensional, so we could not
explicitly map to it even if we wanted to.

Kernel trick. In this lecture, we will build algorithms that


only depend on the data through inner products of the form
h (xi ), (xj )i. Moreover, these inner products can be computed
very easily through a kernel function evaluation k(x, x0 ).
Building algorithms that only depend on the inner product is called
the ‘kernel trick’.

12 / 49
Overview

Intuition 1. The kernel function is a measure of similarity between xi


and xj .
Intuition 2. The kernel trick applies to problems that depend on the
geometry of the data (think of SVM for example!). In particular, note
that kx x0 k2 = hx x0 , x x0 i = hx, xi 2hx, x0 i + hx0 , x0 i, so
algorithms depending on distances can be rewritten in terms of inner
products.

13 / 49
Section 3

Formal definitions

14 / 49
Kernel methods – Formal definition
take in
Definition two rectors

A function k : Rd ⇥ Rd ! R is said to be a positive semidefinite (PSD)


kernel if
(i) it is symmetric, i.e., k(x, x0 ) = k(x0 , x);
(ii) For any integer m > 0 and any set of inputs x1 , . . . , xm in Rd , the
following matrix is positive semi-definite: eigenvalue of matrix is
2 3
negative
non .

k(x1 , x1 ) . . . k(x1 , xm )
6 .. .. .. 7
K=4 . . . 5 ⌫ 0.
k(xm , x1 ) . . . k(xm , xm )

This matrix, with (i, j)-th entry equal to k(xi , xj ), is called the kernel
matrix (you might also see it referred to as the Gram matrix).
such function I is called semi definite karnel .
positive
15 / 49
Kernel methods

Theorem
A function k : Rd ⇥ Rd ! R is a PSD kernel if and only if it equals an
-
inner product h (x), (x0 )i for some (possibly infinite Space
(Hilbert
dimensional)
(4) 4('
mapping (x).a function K equal
to this function ,

a
In fact, this statement is a bit imprecise, since in general it may need to be a
generalized notion of inner product beyond standard vector spaces (namely, to2Hilbert
>
- no need in this courses
spaces). However, we will avoid such technicalities and focus on the standard inner
product applied to real-valued vectors. (In this course we consider only finite
,
cased
PSD & I turns
- (1) If K Kernel , I some
mapping
where out

to be equal to this inner


product .

(2) Suppose h is on this form ,


which is (46) , 4)
for some & then k PSD kernel ,
mapping , is a

Isutishy the 2 properties) 16 / 49


Kernel methods
If direction.
>
- (2) show K is a PSD Kerrsl .

The “if” part is easy to show (at least when is finite-dimensional):


Suppose such a mapping exists.
(i) One has k(x, x 0 ) = h (x), (x0 )i = h (x0 ), (x)i = k(x 0 , x); i.e., k is
symmetric.
(ii) The Gram matrix is
2 3 2 3
k(x1 , x1 ) . . . k(x1 , xm ) (x1 )T
6 .. .. .. 7 6 .. 7⇥ ⇤
K=4 . . . 5=4 . 5 (x1 ) . . . (xm )
k(xm , x1 ) . . . k(xm , xm ) (xm )T

In particular, K is positive semidefinite because for any z one has

zT Kz = zT T
z = k zk2 0,

where ⇥ ⇤
= (x1 ) . . . (xm ) . 17 / 49
Kernel methods
(1)
**(Optional)** Only if direction.
The “only if” part is more challenging, but can be understood fairly easily
in the case that x only takes on finitely many values, using the idea of
a
eigenvalue decomposition. possible
a

M allthe
Supposing that x can only take values in a finite set {x1 , . . . , xm }, the
entire function is described by an m ⇥ m matrix Kfull with (i, j)-th entry
k(xi , xj ). By assumption Kfull is a PSD matrix. PrHence it Tadmits an
eigenvalue decomposition of the form Kfull = O
assume rank =
r

j=1 j vj vj , where
vj 2 Rm . Consider the feature map
j-entry of rector v
.

2 p X 3
1 (v1 )j ~ dimensional rector .

6 .
. 7
(xj ) = 4 5
p .
r (vr )j

One can check that K (xi , xj ) = h (xi ), (xj )i.

18 / 49
Kernel methods

**(Optional)** Only if direction.


Such an approach can be extended to more general scenarios via Mercer’s
theorem, and a fully general treatment is possible via the notion of a
Reproducing Kernel Hilbert Spaces (RKHS).

19 / 49
Section 4

Examples

20 / 49
Example 1: Polynomial kernels

Start with the 1D setting, d = 1. Naively, one might map

x 7! (1, x, x 2 )

for quadratic features,


x 7! (1, x, x 2 , x 3 )
for cubic features, and so on.
Similarly for higher dimensions, e.g., (x1 , x2 ) 7! (1, x1 , x2 , x1 x2 , x12 , x22 ).
Notice the presence of the cross-term x1 x2 .

21 / 49
Example 1: Polynomial kernels

Now, considering the cubic example, we claim that


p p
x 7! (1, 3x, 3x 2 , x 3 )

is a “better” choice than the one above mapping to (1, x, x 2 , x 3 ).


First note that the set of all possible linear classifiers (meaning linear
in (x)) is exactly the same regardless of the choice of mapping. This
is
p because the middle two coefficients are just scaled up or down by
when do linear both the
output, same
3. we
regression give , is

of
fitting
will be curves
we into same
type
But notice the inner product simplifies nicely under the second choice:
& differences only in terms of computation
h (x), (x )i = 1 + 3xx 0 + 3x 2 (x 0 )2 + x 3 (x 0 )3
0

= (1 + xx 0 )3 .

easier to do inner
product.
22 / 49
Example 1: Polynomial kernels.

Generalizing this idea to polynomials of degree p and functions in d


dimensions, we arrive at the polynomial kernel k(x, x0 ) = (1 + hx, x0 i)p .

it
The computational savings of avoiding the construction of (x) can be
much more significant than the above example. be
degree
of
polynomial computed
Instead of computing
- a huge number of features (specifically, it can
p+d more quickl y
be shown to be d ), the computation time is linear in d (do d
multiplications, sum ↓them, and apply (1 + result)p )
# of features
very large.
grows large v quickly
C

23 / 49
**Example 2: String kernels** not examinable
)
Cont of examible syllabus

Continuing the interpretation of a kernel being a measure of similarity, how


can we “measure similarity” between the following?
x1 = “This sentence is the first string in my data set”
x2 = “The second string in my data set is this sentence”
x3 = “Tihs sentance is the third stirng in my dataset”
# of words in common -

One possible approach is to let k(x, x0 ) be the number of words appearing


will
give
in both strings. This corresponds to a feature space with dimension equal
to the total number of possible words, and with CPSD Kernel /
>
- indicator function
j (x) = 1{x contains the j-th word}.

This approach has limitations, such as not handling spelling errors well.
But this feature has v
large dimension ; the dimensur

east

isequa trthe forsi english


words
you can

24 / 49
**Example 2: String kernels** nu

A more “robust” approach is the String Subsequence Kernel (SSK), which


looks at substrings that are not necessarily contiguous (e.g., “sentnce” is a
substring of both “sentence” and “sentance”), but decreases the weight
when this substring is spread a cross a longer length (e.g., “sentnce” is
also a substring of “sent this once”, but this is spread over a longer length
so carries less weight).
Can be computed somewhat-efficiently using dynamic programming
techniques.
If exact computation is still too demanding, approximation
computation can be used instead.
The idea of using substrings instead of exact matches is also very
important in biology applications.

25 / 49
Other examples

Below we will introduce the commonly-used RBF kernel:


✓ ◆
0 1 0 2
kRBF (x, x ) = exp kx x k ,
2

which indicates that similarity exponentially decays to zero as the


squared distance increases. Kernel value desays to 0
The kernel cookbook gives several other examples and insights.
For data represented in a not-so-standard format (e.g., text, graphs),
more unconventional/creative choices of kernels are often adopted.

26 / 49
Section 5

Constructing More Complicated Kernels from Simpler


Ones ( Kemel)

27 / 49
Constructing More Complicated Kernels

Claim
(functions]
If k1 and k2 are PSD kernels, then so are the following:
1 lany)function f
k(x, x0 ) = f (x)k1 (x, x0 )f (x0 ) for some PSD Kerril
sum of PSD is also
2 k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ) PSD Kernel

3 k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ) product of PSD also PJD

Use
(iit)
theorem to
previous prove

28 / 49
Constructing More Complicated Kernels

To set up notation, let (1) (x) and (2)


(x) be the feature vectors
corresponding to co
k1 and k2 .
PJD
by assumption
Proof of #1.
(1)
Let (x) = f (x) (x). Then

h (x), (x0 )i = f (x)h (1)


(x), (1)
(x0 )if (x0 ) = f (x)k1 (x, x0 )f (x0 ).

This means that k(x, x0 ) = h (x), (x0 )i, and hence k(x, x0 ) is a PSD
kernel.

29 / 49
Constructing More Complicated Kernels

Proof of #2.
 (1)
(x)
Let (x) = (2) , and write
(x)

h (x), (x0 )i = h (x)(1) , (1)


(x0 )i+h (x)(2) , (2)
(x0 )i = k1 (x, x0 )+k2 (x, x0 ).

This means that k(x, x0 ) = h (x), (x0 )i, and hence k(x, x0 ) is a PSD
kernel.

30 / 49
Constructing More Complicated Kernels
Proof of #3.
(1) (2)
Let ˜ (x) contain entries ˜ ij (x) = i (x) j (x) for each i, j, where
(1) (1) (2)
i is the i-th feature in and similarly for j .
We have
X
h ˜ (x), ˜ (x0 )i = ˜ ij (x) ˜ ij (x0 )
i,j
X (1) (2) (1) 0 (2) 0
= i (x) j (x) i (x ) j (x )
i,j
X (1) (1) 0
X (2) (2) 0
= i (x) i (x ) j (x) j (x )
i j
0 0
= k1 (x, x )k2 (x, x ).

31 / 49
Section 6

RBF Kernel

32 / 49
Example 3: Radial basis function (RBF) kernel
ver

Letting k1 (x, x0 ) = k2 (x, x0 ) = xT x0 in Property 3 above. Then (xT x0 )2 is


a PSD kernel.
multiply PSD Karnal is
agn a PSD Kernel
Repeat this process. Then (xT x0 )p is a PSD kernel for any integer p.
Kernel that encountered 64
polynomial
we .

(Side note: A similar argument


- gives an alternative proof that the
T 0 p
polynomial kernel (1 + x x ) is a PSD kernel.)
P tj ->
Sum of PSD Kernel is
ag
a

Recall the power series exp(t) = 1 j=0 j! . By applying Property 2 an


PSP Kernel
“infinite number of times” (a formal proof of the validity of this is omitted
here), we deduce that exp(xT x) is also a PSD kernel.
Let + : * "*

33 / 49
Example 3: Radial basis function (RBF) kernel

if K PSD ,
,

=>
Finally, define k(x , x) =
f(x)k ,
(x ,
xlf(x)
for some function f
✓ ◆
1 also PSD
k(x, x0 ) = exp kx x k 0 2 .

(6.1)
2
✓ ◆ ✓ ◆
1 2 T 0 1 0 2
= exp kxk · exp x x · exp kx k , (6.2)
2 2

and use property 1 to deduce that k(x, x0 ) is a PSD kernel.

34 / 49
Example 3: Radial basis function (RBF) kernel

This kernel goes by several names: Radial basis function (RBF) kernel,
Gaussian kernel, squared exponential kernel. It is usually defined with a
length-scale parameter `: represent size or width of Kernel .

✓ ◆
0 1 0 2
k(x, x ) = exp kx x k .
2`2

Such a parameter represents the rough scale over which the function
varies.
More generally, can have di↵erent lengths in each dimension,
` = (`1 , . . . , `d ).

The associated feature space is infinite-dimensional,


P as hinted by the fact
tj
that we used the infinite expansion e t = 1 j=0 j! in its derivation.

35 / 49
Example 3: Radial basis function (RBF) kernel

An illustration of the sorts of classification regions that can be produced by


the kernel SVM (to be introduced in the next lecture) with an RBF kernel
data is 2 dimension (x1 , X2)
- -- - -- + + Given dataset
+ + --
,

+ + - + - + ++
to learn
Classify as +1 try
+ + -- - - + the classification
x2 + - + -
-- + + + + -
+ - -----+ + + -- - + Classify as -1
- -- -+ + + - -- -
x1
Here the prediction rule is of the form ŷ = sign(g (x)) as usual, and
the colors in this figure represent the values of g (x).
describe complicated
RBF is
powerful in this kind of

non linear classification


region .

36 / 49
Linear Regression Revisited see the example in ID .


>
-
describe the
An example using the RBF kernel (with length-scale `):
width of
Kernel .

Coverfitthy ,

fit the data

perfectly

Cunder
fitting)
can be seems as

defining thecomplya
I Observe that even among a given class of kernels, the choice of their
parameter(s) may be very important (e.g., length-scale ` in this
example, degree p in polynomial example, etc.).
I Often the parameters are chosen using maximum likelihood.
I We will also cover model selection later, a special case of which is
kernel selection.

37 / 49
Linear Regression Revisited

38 / 49
Section 7

Linear Regression Revisited

linear > linear


regression
non
regression
-

idea of Karnal .
using the

39 / 49
Linear Regression Revisited
Here we look at linear regression with kernels. In the next lecture we
will look at SVM with kernels. The same can be done for logistic
regression, but we will skip that.
We previously considered the regularized least squares estimator
(ridge regression); 2
in the 3
case of no o↵set (✓0 = 0), it is written as
xT
1
6 7
follows (with X = 4 ... 5 2 Rn⇥d ):
xT
n
Least square objective ·

ˆ = arg min ky
✓ 2
X✓k + k✓k , 2

(no offset] ↑
and has the closed-form solution penalty term -

ˆ = (XT X + I)
✓ 1
XT y.

40 / 49
Linear Regression Revisited
of dimension data points
&some useful matrix manipulations. To help with
X =

[M
Now let’s apply
&d # of data
axn >
points
-
-

readability, let Id and In denote identity matrices with the size made
explicit. First observe

(XT X + Id )XT = XT XXT + XT = XT (XXT + In ).

Multiplying by (XT X + Id ) 1 on the left and (XXT + In ) 1 on the


right gives

XT (XXT + In ) 1
= (XT X + Id ) 1
XT ,
ˆ
meaning we obtain the following equivalent form for ✓:
since we have the closed from solution
previously
ˆ = XT (XXT + In )
✓ 1
y.
-

Mig
-
(x +

Therefore, given a new input x0 2 Rd , the prediction


ˆ T x0 = (x0 )T ✓
ŷ (x0 ) = ✓ ˆ can be written as

ŷ (x0 ) = (x0 )T XT (XXT + In ) 1


y.
Tut
pain computed wine
41 / 49
Linear Regression Revisited

Crucial observation. The prediction depends on the data only through


inner products, since matrix

2
hx0 , x1 i
3T gran
2
hx , x i
1 1... hx1 , xn i
3
6 .. 7 6 . .. .. 7
(x0 )T XT = 4 . 5 ,
T
XX = 4 .. . . 5.
0
hx , xn i hxn , x1 i . . . hxn , xn i

42 / 49
Linear Regression Revisited

Therefore, we can apply the kernel trick and consider the more general
prediction function
ŷ (x0 ) = k(x0 )(K + I) 1 y, (7.1)
where the inner product K)
by % )
replace
2 3T 2 3
k(x0 , x1 ) k(x1 , x1 ) . . . k(x1 , xn )
6 .. 7 6 .. .. .. 7
k(x0 ) = 4 . 5 , K=4 . . . 5.
0
k(x , xn ) k(xn , x1 ) . . . k(xn , xn )

This is known as kernel ridge regression.

43 / 49
Linear Regression Revisited
But what is this estimator doing?
-
Kernal
ridge regression

ŷ (x0 ) = k(x0 )(K + I) 1


y, (7.2)

Consider performing regularized least squares on the model


original model OX "p"
was
y
:
., no

T do the
regular least
(+) -

y =✓ (x) + z, we

square model ,
we will

back Ix') in
page 41
get .

where (x) is the feature vector corresponding to the kernel k. After


ˆ we perform linear prediction
obtaining an estimate ✓, "I"
consider
now I
we feature m o re

will different estimator


get
ˆ T (x0 ).
me

✓ that will potentially


better .

We now
regulared least square on this new model (+ ) ,

Sounds like a very difficult problem (especially if we have to perform the


we Obtain 5 and can do
prediction
mapping ). ,

& be complicated mapping (high dimensional


mapping
Surprise! Itmight
turns out that this particular prediction is equal to ŷ (x0 ) we
derived earlier! The above simple expression actually equals the regularized
least squares solution of a potentially very high dimensional problem.
44 / 49
Linear Regression Revisited 5
idea :
we no need to know

the
important thing
is the
predicted value

& (x) -

and also ,
we do not have to
woory abt 4

How could this be true, given that looks very complicated!


Indeed is very complicated. But we are ultimately not interested in .
We are only interested in the output ŷ (x0 ). It turns out that there is a
simpler way to compute ŷ (x0 ) without ever having to map to .
Y can know which is
This is the magic of the kernel method.
more important when

↑ (x) = k(x)(k + 11) +i dring prediction


linear combination of y , linear map
.

45 / 49
Linear Regression Revisited

More intuitive level


ŷ (x0 ) = k(x0 )(K + I) 1
y, (7.3)
Rough intuition behind equation (7.3). The estimate ŷ is a weighted
sum of the previously-observed outputs y1 , . . . , yn . The more similar x0 is
to the corresponding xt (i.e., the higher k(x, x0 )), the more weight is given
to that y. (But the similarities among training points themselves also play
a role through the presence of K)

46 / 49
E↵ect of regularization

47 / 49
Section 8

Useful References

48 / 49
Useful References

Slide set lecture bo0.pdf from a one-day course I gave1


MIT lecture notes,2 lectures 6 and 7
Chapters 6 and 7 of Bishop’s “Pattern Recognition and Machine
Learning” book
Chapter 16 of “Understanding Machine Learning” book
Kernel cookbook3
(Advanced) Lecture videos by Julien Mairal and Jean-Philippe Vert4

1
https://www.comp.nus.edu.sg/~scarlett/gp_slides
2
http:
//ocw.mit.edu/courses/electrical-engineering-and-computer-science/
6-867-machine-learning-fall-2006/lecture-notes/
3
http://www.cs.toronto.edu/~duvenaud/cookbook/
4
https://www.youtube.com/channel/UCotztBOmGVl9pPGIN4YqcRw/videos
49 / 49

You might also like