0% found this document useful (0 votes)

6 views

Lecture 05

Uploaded by

睿笙

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Lecture 05

Uploaded by

睿笙

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

MA4270: Data Modelling and Computation

Lecture 5: Kernel Methods

for non-linear
mainly models .

1 / 49
Section 1

Feature Space Mappings

2 / 49
Feature Space Mappings

In previous lectures, we looked at linear classifiers for binary labels:

Predict positive label () ✓ T x + ✓0 > 0

and also linear predictors for real-valued variables:

linear
regression
ŷ (x) = ✓ T x + ✓0 .

What if x and y and related in a highly non-linear manner?

I We can still use linear classification and regression methods!

3 / 49
Example: Fitting a quadratic

compute X" for each

If we know that y is (well-approximated by) a quadratic↑ function of a

single input x, then we can use x to construct x̃ = [x x 2 ]T , and then
perform linear regression with input x̃ and ✓ 2 R2 . Then ŷ = ✓ T x̃ + ✓0
evaluates to ŷ =000
✓2 x 2 + ✓1 x + ✓0 – an arbitrary quadratic function!
3
parameters
,

4 / 49
Example: A circular classification region
binary
Suppose that d = 2 and
labels
the binary labels are generated according to
(
+1 x12 + x22  1
yt =
1 otherwise
yielding a circular region:

Any linear classifier with input x = [x1 x2 ]T will perform poorly. But a
linear classifier with input x̃ = [x1 x2 x12 + x22 ]T exists that classifies
perfectly!
5 / 49
Example: A circular classification region

Visual illustration:
http://www.youtube.com/watch?v=3liCbRZPrZA
other color
&

S
3d

yellow 2d

2D Points at bottom (not linearly separable), mapping to 3D shown

above (linearly separable)

6 / 49
Feature Space Mappings

In general, we might hope to get a better classifier by mapping each x to a

and
well-designed feature space: idea create: more more

classifier by apply
⇥ interesting
(x) = 1 (x), . . . , N (x)]Tcertain function & outu
the data and then create

for some real-valued functions 1, . . . , N. new extra coordinates .

do
Tricky : How we know which function to

and ? "

7 / 49
A word of caution!
We can always get to zero training error by adding more and more
features, but is that always a good idea? Which prediction rule will work
better in the following regression example?
fit a kernel function
↓
↑
degrea polynomial
3

keep increasing degree polynomial > lead to overfitting,

of -

A more complex classifier achieves smaller training error, but runs into the
danger of overfitting. We will explore the notions of generalization error
and overfitting in a later lecture.
8 / 49
Section 2

Inner Products

9 / 49
Kernel Methods – Overview

Many machine learning algorithms depend on the data x1 , . . . , xn only

through the pairwise inner products hxi , xj i. evenif
Examples to come later: Ridge regression, SVM
youdonthave da ancs is

Example explored in the project: K -nearest neighbors (given a new

point, find the K closest points in the training set, and classify
according to a majority vote)

10 / 49
Overview

classifier
maximum margin .

Inner products capture the geometry of the data ↑ set, so one generally
expects geometrically inspired algorithms (e.g., SVM) to depend only on
inner products:
For algorithms that use distances, note that
kx x0 k2 = hx, xi + hx0 , x0 i 2hx, x0 i, so distances can be expressed
in terms of inner products.
For algorithms that use angles, note that p
hx,x0 i
angle(x, x0 ) = cos 1 kxk·kx 0 k , so since kxk = hx, xi, angles can
C
also be expressed in terms of inner products.
↓
can be inner
product also .
expressed as

11 / 49
Overview

einformationweaener
We know that moving to feature spaces can help, so we could map
each xi ! (xi ) and apply the algorithm using h (xi ), (xj )i.
A kernel function k(xi , xj ) can be thought of as an inner product in a
possibly implicit feature space
do &4) but
perform
+
not (i we

I
↑ of
Key idea. There are clever choices the mapping (·) ensuring
can know the inner
product.
that
we can efficiently compute h (xi ), (xj )i without ever explicitly
mapping to the feature space
I In some cases, the feature space is infinite-dimensional, so we could not
explicitly map to it even if we wanted to.

Kernel trick. In this lecture, we will build algorithms that

only depend on the data through inner products of the form
h (xi ), (xj )i. Moreover, these inner products can be computed
very easily through a kernel function evaluation k(x, x0 ).
Building algorithms that only depend on the inner product is called
the ‘kernel trick’.

12 / 49
Overview

Intuition 1. The kernel function is a measure of similarity between xi

and xj .
Intuition 2. The kernel trick applies to problems that depend on the
geometry of the data (think of SVM for example!). In particular, note
that kx x0 k2 = hx x0 , x x0 i = hx, xi 2hx, x0 i + hx0 , x0 i, so
algorithms depending on distances can be rewritten in terms of inner
products.

13 / 49
Section 3

Formal definitions

14 / 49
Kernel methods – Formal definition
take in
Definition two rectors

A function k : Rd ⇥ Rd ! R is said to be a positive semidefinite (PSD)

kernel if
(i) it is symmetric, i.e., k(x, x0 ) = k(x0 , x);
(ii) For any integer m > 0 and any set of inputs x1 , . . . , xm in Rd , the
following matrix is positive semi-definite: eigenvalue of matrix is
2 3
negative
non .

k(x1 , x1 ) . . . k(x1 , xm )
6 .. .. .. 7
K=4 . . . 5 ⌫ 0.
k(xm , x1 ) . . . k(xm , xm )

This matrix, with (i, j)-th entry equal to k(xi , xj ), is called the kernel
matrix (you might also see it referred to as the Gram matrix).
such function I is called semi definite karnel .
positive
15 / 49
Kernel methods

Theorem
A function k : Rd ⇥ Rd ! R is a PSD kernel if and only if it equals an
-
inner product h (x), (x0 )i for some (possibly infinite Space
(Hilbert
dimensional)
(4) 4('
mapping (x).a function K equal
to this function ,

a
In fact, this statement is a bit imprecise, since in general it may need to be a
generalized notion of inner product beyond standard vector spaces (namely, to2Hilbert
>
- no need in this courses
spaces). However, we will avoid such technicalities and focus on the standard inner
product applied to real-valued vectors. (In this course we consider only finite
,
cased
PSD & I turns
- (1) If K Kernel , I some
mapping
where out

to be equal to this inner

product .

(2) Suppose h is on this form ,

which is (46) , 4)
for some & then k PSD kernel ,
mapping , is a

Isutishy the 2 properties) 16 / 49

Kernel methods
If direction.
>
- (2) show K is a PSD Kerrsl .

The “if” part is easy to show (at least when is finite-dimensional):

Suppose such a mapping exists.
(i) One has k(x, x 0 ) = h (x), (x0 )i = h (x0 ), (x)i = k(x 0 , x); i.e., k is
symmetric.
(ii) The Gram matrix is
2 3 2 3
k(x1 , x1 ) . . . k(x1 , xm ) (x1 )T
6 .. .. .. 7 6 .. 7⇥ ⇤
K=4 . . . 5=4 . 5 (x1 ) . . . (xm )
k(xm , x1 ) . . . k(xm , xm ) (xm )T

In particular, K is positive semidefinite because for any z one has

zT Kz = zT T
z = k zk2 0,

where ⇥ ⇤
= (x1 ) . . . (xm ) . 17 / 49
Kernel methods
(1)
**(Optional)** Only if direction.
The “only if” part is more challenging, but can be understood fairly easily
in the case that x only takes on finitely many values, using the idea of
a
eigenvalue decomposition. possible
a

M allthe
Supposing that x can only take values in a finite set {x1 , . . . , xm }, the
entire function is described by an m ⇥ m matrix Kfull with (i, j)-th entry
k(xi , xj ). By assumption Kfull is a PSD matrix. PrHence it Tadmits an
eigenvalue decomposition of the form Kfull = O
assume rank =
r

j=1 j vj vj , where
vj 2 Rm . Consider the feature map
j-entry of rector v
.

2 p X 3
1 (v1 )j ~ dimensional rector .

6 .
. 7
(xj ) = 4 5
p .
r (vr )j

One can check that K (xi , xj ) = h (xi ), (xj )i.

18 / 49
Kernel methods

(Optional) Only if direction.

Such an approach can be extended to more general scenarios via Mercer’s
theorem, and a fully general treatment is possible via the notion of a
Reproducing Kernel Hilbert Spaces (RKHS).

19 / 49
Section 4

Examples

20 / 49
Example 1: Polynomial kernels

Start with the 1D setting, d = 1. Naively, one might map

x 7! (1, x, x 2 )

for quadratic features,

x 7! (1, x, x 2 , x 3 )
for cubic features, and so on.
Similarly for higher dimensions, e.g., (x1 , x2 ) 7! (1, x1 , x2 , x1 x2 , x12 , x22 ).
Notice the presence of the cross-term x1 x2 .

21 / 49
Example 1: Polynomial kernels

Now, considering the cubic example, we claim that

p p
x 7! (1, 3x, 3x 2 , x 3 )

is a “better” choice than the one above mapping to (1, x, x 2 , x 3 ).

First note that the set of all possible linear classifiers (meaning linear
in (x)) is exactly the same regardless of the choice of mapping. This
is
p because the middle two coefficients are just scaled up or down by
when do linear both the
output, same
3. we
regression give , is

of
fitting
will be curves
we into same
type
But notice the inner product simplifies nicely under the second choice:
& differences only in terms of computation
h (x), (x )i = 1 + 3xx 0 + 3x 2 (x 0 )2 + x 3 (x 0 )3
0

= (1 + xx 0 )3 .
↑
easier to do inner
product.
22 / 49
Example 1: Polynomial kernels.

Generalizing this idea to polynomials of degree p and functions in d

dimensions, we arrive at the polynomial kernel k(x, x0 ) = (1 + hx, x0 i)p .

it
The computational savings of avoiding the construction of (x) can be
much more significant than the above example. be
degree
of
polynomial computed
Instead of computing
- a huge number of features (specifically, it can
p+d more quickl y
be shown to be d ), the computation time is linear in d (do d
multiplications, sum ↓them, and apply (1 + result)p )
# of features
very large.
grows large v quickly
C

23 / 49
**Example 2: String kernels** not examinable
)
Cont of examible syllabus

Continuing the interpretation of a kernel being a measure of similarity, how

can we “measure similarity” between the following?
x1 = “This sentence is the first string in my data set”
x2 = “The second string in my data set is this sentence”
x3 = “Tihs sentance is the third stirng in my dataset”
# of words in common -

One possible approach is to let k(x, x0 ) be the number of words appearing

will
give
in both strings. This corresponds to a feature space with dimension equal
to the total number of possible words, and with CPSD Kernel /
>
- indicator function
j (x) = 1{x contains the j-th word}.

This approach has limitations, such as not handling spelling errors well.
But this feature has v
large dimension ; the dimensur

east

isequa trthe forsi english

words
you can

24 / 49
**Example 2: String kernels** nu

A more “robust” approach is the String Subsequence Kernel (SSK), which

looks at substrings that are not necessarily contiguous (e.g., “sentnce” is a
substring of both “sentence” and “sentance”), but decreases the weight
when this substring is spread a cross a longer length (e.g., “sentnce” is
also a substring of “sent this once”, but this is spread over a longer length
so carries less weight).
Can be computed somewhat-efficiently using dynamic programming
techniques.
If exact computation is still too demanding, approximation
computation can be used instead.
The idea of using substrings instead of exact matches is also very
important in biology applications.

25 / 49
Other examples

Below we will introduce the commonly-used RBF kernel:

✓ ◆
0 1 0 2
kRBF (x, x ) = exp kx x k ,
2

which indicates that similarity exponentially decays to zero as the

squared distance increases. Kernel value desays to 0
The kernel cookbook gives several other examples and insights.
For data represented in a not-so-standard format (e.g., text, graphs),
more unconventional/creative choices of kernels are often adopted.

26 / 49
Section 5

Constructing More Complicated Kernels from Simpler

Ones ( Kemel)

27 / 49
Constructing More Complicated Kernels

Claim
(functions]
If k1 and k2 are PSD kernels, then so are the following:
1 lany)function f
k(x, x0 ) = f (x)k1 (x, x0 )f (x0 ) for some PSD Kerril
sum of PSD is also
2 k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ) PSD Kernel

3 k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ) product of PSD also PJD

Use
(iit)
theorem to
previous prove

28 / 49
Constructing More Complicated Kernels

To set up notation, let (1) (x) and (2)

(x) be the feature vectors
corresponding to co
k1 and k2 .
PJD
by assumption
Proof of #1.
(1)
Let (x) = f (x) (x). Then

h (x), (x0 )i = f (x)h (1)

(x), (1)
(x0 )if (x0 ) = f (x)k1 (x, x0 )f (x0 ).

This means that k(x, x0 ) = h (x), (x0 )i, and hence k(x, x0 ) is a PSD
kernel.

29 / 49
Constructing More Complicated Kernels

Proof of #2.
 (1)
(x)
Let (x) = (2) , and write
(x)

h (x), (x0 )i = h (x)(1) , (1)

(x0 )i+h (x)(2) , (2)
(x0 )i = k1 (x, x0 )+k2 (x, x0 ).

This means that k(x, x0 ) = h (x), (x0 )i, and hence k(x, x0 ) is a PSD
kernel.

30 / 49
Constructing More Complicated Kernels
Proof of #3.
(1) (2)
Let ˜ (x) contain entries ˜ ij (x) = i (x) j (x) for each i, j, where
(1) (1) (2)
i is the i-th feature in and similarly for j .
We have
X
h ˜ (x), ˜ (x0 )i = ˜ ij (x) ˜ ij (x0 )
i,j
X (1) (2) (1) 0 (2) 0
= i (x) j (x) i (x ) j (x )
i,j
X (1) (1) 0
X (2) (2) 0
= i (x) i (x ) j (x) j (x )
i j
0 0
= k1 (x, x )k2 (x, x ).

31 / 49
Section 6

RBF Kernel

32 / 49
Example 3: Radial basis function (RBF) kernel
ver

Letting k1 (x, x0 ) = k2 (x, x0 ) = xT x0 in Property 3 above. Then (xT x0 )2 is

a PSD kernel.
multiply PSD Karnal is
agn a PSD Kernel
Repeat this process. Then (xT x0 )p is a PSD kernel for any integer p.
Kernel that encountered 64
polynomial
we .

(Side note: A similar argument

- gives an alternative proof that the
T 0 p
polynomial kernel (1 + x x ) is a PSD kernel.)
P tj ->
Sum of PSD Kernel is
ag
a

Recall the power series exp(t) = 1 j=0 j! . By applying Property 2 an

PSP Kernel
“infinite number of times” (a formal proof of the validity of this is omitted
here), we deduce that exp(xT x) is also a PSD kernel.
Let + : * "*

33 / 49
Example 3: Radial basis function (RBF) kernel

if K PSD ,
,

=>
Finally, define k(x , x) =
f(x)k ,
(x ,
xlf(x)
for some function f
✓ ◆
1 also PSD
k(x, x0 ) = exp kx x k 0 2 .

(6.1)
2
✓ ◆ ✓ ◆
1 2 T 0 1 0 2
= exp kxk · exp x x · exp kx k , (6.2)
2 2

and use property 1 to deduce that k(x, x0 ) is a PSD kernel.

34 / 49
Example 3: Radial basis function (RBF) kernel

This kernel goes by several names: Radial basis function (RBF) kernel,
Gaussian kernel, squared exponential kernel. It is usually defined with a
length-scale parameter `: represent size or width of Kernel .

✓ ◆
0 1 0 2
k(x, x ) = exp kx x k .
2`2

Such a parameter represents the rough scale over which the function
varies.
More generally, can have di↵erent lengths in each dimension,
` = (`1 , . . . , `d ).

The associated feature space is infinite-dimensional,

P as hinted by the fact
tj
that we used the infinite expansion e t = 1 j=0 j! in its derivation.

35 / 49
Example 3: Radial basis function (RBF) kernel

An illustration of the sorts of classification regions that can be produced by

the kernel SVM (to be introduced in the next lecture) with an RBF kernel
data is 2 dimension (x1 , X2)
- -- - -- + + Given dataset
+ + --
,

+ + - + - + ++
to learn
Classify as +1 try
+ + -- - - + the classification
x2 + - + -
-- + + + + -
+ - -----+ + + -- - + Classify as -1
- -- -+ + + - -- -
x1
Here the prediction rule is of the form ŷ = sign(g (x)) as usual, and
the colors in this figure represent the values of g (x).
describe complicated
RBF is
powerful in this kind of

non linear classification

region .

36 / 49
Linear Regression Revisited see the example in ID .

↑
>
-
describe the
An example using the RBF kernel (with length-scale `):
width of
Kernel .

Coverfitthy ,

fit the data

perfectly

Cunder
fitting)
can be seems as

defining thecomplya
I Observe that even among a given class of kernels, the choice of their
parameter(s) may be very important (e.g., length-scale ` in this
example, degree p in polynomial example, etc.).
I Often the parameters are chosen using maximum likelihood.
I We will also cover model selection later, a special case of which is
kernel selection.

37 / 49
Linear Regression Revisited

38 / 49
Section 7

Linear Regression Revisited

linear > linear

regression
non
regression
-

idea of Karnal .
using the

39 / 49
Linear Regression Revisited
Here we look at linear regression with kernels. In the next lecture we
will look at SVM with kernels. The same can be done for logistic
regression, but we will skip that.
We previously considered the regularized least squares estimator
(ridge regression); 2
in the 3
case of no o↵set (✓0 = 0), it is written as
xT
1
6 7
follows (with X = 4 ... 5 2 Rn⇥d ):
xT
n
Least square objective ·

ˆ = arg min ky
✓ 2
X✓k + k✓k , 2
✓
(no offset] ↑
and has the closed-form solution penalty term -

ˆ = (XT X + I)
✓ 1
XT y.

40 / 49
Linear Regression Revisited
of dimension data points
&some useful matrix manipulations. To help with
X =

[M
Now let’s apply
&d # of data
axn >
points
-
-

readability, let Id and In denote identity matrices with the size made
explicit. First observe

(XT X + Id )XT = XT XXT + XT = XT (XXT + In ).

Multiplying by (XT X + Id ) 1 on the left and (XXT + In ) 1 on the

right gives

XT (XXT + In ) 1
= (XT X + Id ) 1
XT ,
ˆ
meaning we obtain the following equivalent form for ✓:
since we have the closed from solution
previously
ˆ = XT (XXT + In )
✓ 1
y.
-

Mig
-
(x +

Therefore, given a new input x0 2 Rd , the prediction

ˆ T x0 = (x0 )T ✓
ŷ (x0 ) = ✓ ˆ can be written as

ŷ (x0 ) = (x0 )T XT (XXT + In ) 1

y.
Tut
pain computed wine
41 / 49
Linear Regression Revisited

Crucial observation. The prediction depends on the data only through

inner products, since matrix

2
hx0 , x1 i
3T gran
2
hx , x i
1 1... hx1 , xn i
3
6 .. 7 6 . .. .. 7
(x0 )T XT = 4 . 5 ,
T
XX = 4 .. . . 5.
0
hx , xn i hxn , x1 i . . . hxn , xn i

42 / 49
Linear Regression Revisited

Therefore, we can apply the kernel trick and consider the more general
prediction function
ŷ (x0 ) = k(x0 )(K + I) 1 y, (7.1)
where the inner product K)
by % )
replace
2 3T 2 3
k(x0 , x1 ) k(x1 , x1 ) . . . k(x1 , xn )
6 .. 7 6 .. .. .. 7
k(x0 ) = 4 . 5 , K=4 . . . 5.
0
k(x , xn ) k(xn , x1 ) . . . k(xn , xn )

This is known as kernel ridge regression.

43 / 49
Linear Regression Revisited
But what is this estimator doing?
-
Kernal
ridge regression

ŷ (x0 ) = k(x0 )(K + I) 1

y, (7.2)

Consider performing regularized least squares on the model

original model OX "p"
was
y
:
., no

T do the
regular least
(+) -

y =✓ (x) + z, we

square model ,
we will

back Ix') in
page 41
get .

where (x) is the feature vector corresponding to the kernel k. After

ˆ we perform linear prediction
obtaining an estimate ✓, "I"
consider
now I
we feature m o re

will different estimator

get
ˆ T (x0 ).
me

✓ that will potentially

better .

We now
regulared least square on this new model (+ ) ,

Sounds like a very difficult problem (especially if we have to perform the

we Obtain 5 and can do
prediction
mapping ). ,

& be complicated mapping (high dimensional

mapping
Surprise! Itmight
turns out that this particular prediction is equal to ŷ (x0 ) we
derived earlier! The above simple expression actually equals the regularized
least squares solution of a potentially very high dimensional problem.
44 / 49
Linear Regression Revisited 5
idea :
we no need to know

the
important thing
is the
predicted value

& (x) -

and also ,
we do not have to
woory abt 4

How could this be true, given that looks very complicated!

Indeed is very complicated. But we are ultimately not interested in .
We are only interested in the output ŷ (x0 ). It turns out that there is a
simpler way to compute ŷ (x0 ) without ever having to map to .
Y can know which is
This is the magic of the kernel method.
more important when

↑ (x) = k(x)(k + 11) +i dring prediction

linear combination of y , linear map
.

45 / 49
Linear Regression Revisited

More intuitive level

ŷ (x0 ) = k(x0 )(K + I) 1
y, (7.3)
Rough intuition behind equation (7.3). The estimate ŷ is a weighted
sum of the previously-observed outputs y1 , . . . , yn . The more similar x0 is
to the corresponding xt (i.e., the higher k(x, x0 )), the more weight is given
to that y. (But the similarities among training points themselves also play
a role through the presence of K)

46 / 49
E↵ect of regularization

47 / 49
Section 8

Useful References

48 / 49
Useful References

Slide set lecture bo0.pdf from a one-day course I gave1

MIT lecture notes,2 lectures 6 and 7
Chapters 6 and 7 of Bishop’s “Pattern Recognition and Machine
Learning” book
Chapter 16 of “Understanding Machine Learning” book
Kernel cookbook3
(Advanced) Lecture videos by Julien Mairal and Jean-Philippe Vert4

1
https://www.comp.nus.edu.sg/~scarlett/gp_slides
2
http:
//ocw.mit.edu/courses/electrical-engineering-and-computer-science/
6-867-machine-learning-fall-2006/lecture-notes/
3
http://www.cs.toronto.edu/~duvenaud/cookbook/
4
https://www.youtube.com/channel/UCotztBOmGVl9pPGIN4YqcRw/videos
49 / 49

Lecture4
No ratings yet
Lecture4
49 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
SVM 4
No ratings yet
SVM 4
8 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
More Kernels and Their Properties
No ratings yet
More Kernels and Their Properties
3 pages
Lecture 8_Kernels
No ratings yet
Lecture 8_Kernels
32 pages
SchSmo03c
No ratings yet
SchSmo03c
24 pages
KernelMethods
No ratings yet
KernelMethods
19 pages
0701907v3
No ratings yet
0701907v3
53 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
Kernel Methods: Feature Mapping at No Cost
No ratings yet
Kernel Methods: Feature Mapping at No Cost
25 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
תרגול - SVM 1
No ratings yet
תרגול - SVM 1
32 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
lec16
No ratings yet
lec16
23 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
Ds 11
No ratings yet
Ds 11
21 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Vahid
No ratings yet
Vahid
18 pages
KernelTrick PDF
No ratings yet
KernelTrick PDF
4 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
Lecture4 introToRKHS
No ratings yet
Lecture4 introToRKHS
33 pages
Kernel Method
No ratings yet
Kernel Method
5 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
ML Kernel Methods
No ratings yet
ML Kernel Methods
51 pages
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
No ratings yet
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
31 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Arthur Gretton - Slides4A
No ratings yet
Arthur Gretton - Slides4A
121 pages
SVM and Kernels
No ratings yet
SVM and Kernels
13 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
Lec5 SVM Kernel SoftMargin
No ratings yet
Lec5 SVM Kernel SoftMargin
44 pages
Classes of Kernels For Machine Learning: A Statistics Perspective
No ratings yet
Classes of Kernels For Machine Learning: A Statistics Perspective
14 pages
Lec3-The Kernel Trick
No ratings yet
Lec3-The Kernel Trick
4 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
No ratings yet
Lecture 19 - Nonlinear Learning With Kernels (1) - Plain
15 pages
SVM
No ratings yet
SVM
57 pages
Lecture 14: Kernels — Applied ML
No ratings yet
Lecture 14: Kernels — Applied ML
14 pages
ICS E4030 Lecture1
No ratings yet
ICS E4030 Lecture1
37 pages
Slides Chap5 KernelMethods
No ratings yet
Slides Chap5 KernelMethods
24 pages
The Representation of Similarities in Linear Spaces
No ratings yet
The Representation of Similarities in Linear Spaces
17 pages
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
No ratings yet
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
15 pages
Solutions To The Exercises On The Kernel Trick
No ratings yet
Solutions To The Exercises On The Kernel Trick
3 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
05 Lectureslides Kernels
No ratings yet
05 Lectureslides Kernels
47 pages
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
No ratings yet
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
6 pages
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
No ratings yet
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
26 pages
SVM
No ratings yet
SVM
40 pages
hw5 Kernel Trick 2021
No ratings yet
hw5 Kernel Trick 2021
4 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Dahlin Zhou 2006 NSG 4 113-123
No ratings yet
Dahlin Zhou 2006 NSG 4 113-123
12 pages
SAFETY Contract Health & Safety Requirements 32-136
No ratings yet
SAFETY Contract Health & Safety Requirements 32-136
43 pages
GHQ 13 4
No ratings yet
GHQ 13 4
8 pages
WC Connectors: MACFIT, Flexible and Rigid
No ratings yet
WC Connectors: MACFIT, Flexible and Rigid
8 pages
Signals and Systems Formulas
No ratings yet
Signals and Systems Formulas
3 pages
56x00 Series
No ratings yet
56x00 Series
82 pages
Chang Tsai 2022
No ratings yet
Chang Tsai 2022
11 pages
Geog
100% (1)
Geog
199 pages
Mahendran 2012
No ratings yet
Mahendran 2012
12 pages
CHEMISTRY NOTES 2025
No ratings yet
CHEMISTRY NOTES 2025
9 pages
Automation Cheat Sheet
No ratings yet
Automation Cheat Sheet
2 pages
Viola Complete 10
100% (1)
Viola Complete 10
10 pages
Oberon - The Overlooked Jewel
No ratings yet
Oberon - The Overlooked Jewel
13 pages
Browerville Blade - 10/31/2013
No ratings yet
Browerville Blade - 10/31/2013
12 pages
Automech Report PDF
No ratings yet
Automech Report PDF
18 pages
OROFLEX 10 Layflat Hose
No ratings yet
OROFLEX 10 Layflat Hose
3 pages
Exam 3 FINC 631 Summer 2013
No ratings yet
Exam 3 FINC 631 Summer 2013
12 pages
Reasearch Chapter 1
No ratings yet
Reasearch Chapter 1
17 pages
Micro Insurance
No ratings yet
Micro Insurance
18 pages
Scale Up and Tech Transfer
No ratings yet
Scale Up and Tech Transfer
61 pages
Industrial Accident and Safety
100% (1)
Industrial Accident and Safety
64 pages
My Mini Idiom Book-15-18
No ratings yet
My Mini Idiom Book-15-18
4 pages
Khadijah Group Seminar Banking
No ratings yet
Khadijah Group Seminar Banking
14 pages
Disaster Action Plan of Sitio Labey and Hilltop, Ambuklao, Bokod, Benguet
No ratings yet
Disaster Action Plan of Sitio Labey and Hilltop, Ambuklao, Bokod, Benguet
3 pages
English PPT
No ratings yet
English PPT
52 pages
Lawful Money Is Equitable Title To Labor-Credit Asset
100% (6)
Lawful Money Is Equitable Title To Labor-Credit Asset
2 pages
Can Robots Help Us Manage The Caregiving Crisis
No ratings yet
Can Robots Help Us Manage The Caregiving Crisis
2 pages
32-21-02-610
No ratings yet
32-21-02-610
3 pages
Using Electricity Fun Activities Games Reading Comprehension Exercis 16716
No ratings yet
Using Electricity Fun Activities Games Reading Comprehension Exercis 16716
2 pages
Perception On Digital Payment
No ratings yet
Perception On Digital Payment
56 pages

Lecture 05

Uploaded by

Lecture 05

Uploaded by

MA4270: Data Modelling and Computation

Lecture 5: Kernel Methods

Feature Space Mappings

In previous lectures, we looked at linear classifiers for binary labels:

Predict positive label () ✓ T x + ✓0 > 0

and also linear predictors for real-valued variables:

What if x and y and related in a highly non-linear manner?

compute X" for each

If we know that y is (well-approximated by) a quadratic↑ function of a

2D Points at bottom (not linearly separable), mapping to 3D shown

In general, we might hope to get a better classifier by mapping each x to a

for some real-valued functions 1, . . . , N. new extra coordinates .

keep increasing degree polynomial > lead to overfitting,

Many machine learning algorithms depend on the data x1 , . . . , xn only

Example explored in the project: K -nearest neighbors (given a new

Kernel trick. In this lecture, we will build algorithms that

Intuition 1. The kernel function is a measure of similarity between xi

A function k : Rd ⇥ Rd ! R is said to be a positive semidefinite (PSD)

to be equal to this inner

(2) Suppose h is on this form ,

Isutishy the 2 properties) 16 / 49

The “if” part is easy to show (at least when is finite-dimensional):

In particular, K is positive semidefinite because for any z one has

One can check that K (xi , xj ) = h (xi ), (xj )i.

**(Optional)** Only if direction.

Start with the 1D setting, d = 1. Naively, one might map

for quadratic features,

Now, considering the cubic example, we claim that

is a “better” choice than the one above mapping to (1, x, x 2 , x 3 ).

Generalizing this idea to polynomials of degree p and functions in d

Continuing the interpretation of a kernel being a measure of similarity, how

One possible approach is to let k(x, x0 ) be the number of words appearing

isequa trthe forsi english

A more “robust” approach is the String Subsequence Kernel (SSK), which

Below we will introduce the commonly-used RBF kernel:

which indicates that similarity exponentially decays to zero as the

Constructing More Complicated Kernels from Simpler

3 k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ) product of PSD also PJD

To set up notation, let (1) (x) and (2)

h (x), (x0 )i = f (x)h (1)

h (x), (x0 )i = h (x)(1) , (1)

Letting k1 (x, x0 ) = k2 (x, x0 ) = xT x0 in Property 3 above. Then (xT x0 )2 is

(Side note: A similar argument

Recall the power series exp(t) = 1 j=0 j! . By applying Property 2 an

and use property 1 to deduce that k(x, x0 ) is a PSD kernel.

The associated feature space is infinite-dimensional,

An illustration of the sorts of classification regions that can be produced by

non linear classification

fit the data

Linear Regression Revisited

linear > linear

(XT X + Id )XT = XT XXT + XT = XT (XXT + In ).

Multiplying by (XT X + Id ) 1 on the left and (XXT + In ) 1 on the

Therefore, given a new input x0 2 Rd , the prediction

ŷ (x0 ) = (x0 )T XT (XXT + In ) 1

Crucial observation. The prediction depends on the data only through

This is known as kernel ridge regression.

ŷ (x0 ) = k(x0 )(K + I) 1

Consider performing regularized least squares on the model

where (x) is the feature vector corresponding to the kernel k. After

will different estimator

✓ that will potentially

Sounds like a very difficult problem (especially if we have to perform the

& be complicated mapping (high dimensional

How could this be true, given that looks very complicated!

↑ (x) = k(x)(k + 11) +i dring prediction

More intuitive level

Slide set lecture bo0.pdf from a one-day course I gave1

You might also like

(Optional) Only if direction.