Lecture 05
Lecture 05
1 / 49
Section 1
2 / 49
Feature Space Mappings
3 / 49
Example: Fitting a quadratic
4 / 49
Example: A circular classification region
binary
Suppose that d = 2 and
labels
the binary labels are generated according to
(
+1 x12 + x22 1
yt =
1 otherwise
yielding a circular region:
Any linear classifier with input x = [x1 x2 ]T will perform poorly. But a
linear classifier with input x̃ = [x1 x2 x12 + x22 ]T exists that classifies
perfectly!
5 / 49
Example: A circular classification region
Visual illustration:
http://www.youtube.com/watch?v=3liCbRZPrZA
other color
&
S
3d
yellow 2d
6 / 49
Feature Space Mappings
classifier by apply
⇥ interesting
(x) = 1 (x), . . . , N (x)]Tcertain function & outu
the data and then create
do
Tricky : How we know which function to
and ? "
7 / 49
A word of caution!
We can always get to zero training error by adding more and more
features, but is that always a good idea? Which prediction rule will work
better in the following regression example?
fit a kernel function
↓
↑
degrea polynomial
3
"
A more complex classifier achieves smaller training error, but runs into the
danger of overfitting. We will explore the notions of generalization error
and overfitting in a later lecture.
8 / 49
Section 2
Inner Products
9 / 49
Kernel Methods – Overview
10 / 49
Overview
classifier
maximum margin .
Inner products capture the geometry of the data ↑ set, so one generally
expects geometrically inspired algorithms (e.g., SVM) to depend only on
inner products:
For algorithms that use distances, note that
kx x0 k2 = hx, xi + hx0 , x0 i 2hx, x0 i, so distances can be expressed
in terms of inner products.
For algorithms that use angles, note that p
hx,x0 i
angle(x, x0 ) = cos 1 kxk·kx 0 k , so since kxk = hx, xi, angles can
C
also be expressed in terms of inner products.
↓
can be inner
product also .
expressed as
11 / 49
Overview
einformationweaener
We know that moving to feature spaces can help, so we could map
each xi ! (xi ) and apply the algorithm using h (xi ), (xj )i.
A kernel function k(xi , xj ) can be thought of as an inner product in a
possibly implicit feature space
do &4) but
perform
+
not (i we
I
↑ of
Key idea. There are clever choices the mapping (·) ensuring
can know the inner
product.
that
we can efficiently compute h (xi ), (xj )i without ever explicitly
mapping to the feature space
I In some cases, the feature space is infinite-dimensional, so we could not
explicitly map to it even if we wanted to.
12 / 49
Overview
13 / 49
Section 3
Formal definitions
14 / 49
Kernel methods – Formal definition
take in
Definition two rectors
k(x1 , x1 ) . . . k(x1 , xm )
6 .. .. .. 7
K=4 . . . 5 ⌫ 0.
k(xm , x1 ) . . . k(xm , xm )
This matrix, with (i, j)-th entry equal to k(xi , xj ), is called the kernel
matrix (you might also see it referred to as the Gram matrix).
such function I is called semi definite karnel .
positive
15 / 49
Kernel methods
Theorem
A function k : Rd ⇥ Rd ! R is a PSD kernel if and only if it equals an
-
inner product h (x), (x0 )i for some (possibly infinite Space
(Hilbert
dimensional)
(4) 4('
mapping (x).a function K equal
to this function ,
a
In fact, this statement is a bit imprecise, since in general it may need to be a
generalized notion of inner product beyond standard vector spaces (namely, to2Hilbert
>
- no need in this courses
spaces). However, we will avoid such technicalities and focus on the standard inner
product applied to real-valued vectors. (In this course we consider only finite
,
cased
PSD & I turns
- (1) If K Kernel , I some
mapping
where out
zT Kz = zT T
z = k zk2 0,
where ⇥ ⇤
= (x1 ) . . . (xm ) . 17 / 49
Kernel methods
(1)
**(Optional)** Only if direction.
The “only if” part is more challenging, but can be understood fairly easily
in the case that x only takes on finitely many values, using the idea of
a
eigenvalue decomposition. possible
a
M allthe
Supposing that x can only take values in a finite set {x1 , . . . , xm }, the
entire function is described by an m ⇥ m matrix Kfull with (i, j)-th entry
k(xi , xj ). By assumption Kfull is a PSD matrix. PrHence it Tadmits an
eigenvalue decomposition of the form Kfull = O
assume rank =
r
j=1 j vj vj , where
vj 2 Rm . Consider the feature map
j-entry of rector v
.
2 p X 3
1 (v1 )j ~ dimensional rector .
6 .
. 7
(xj ) = 4 5
p .
r (vr )j
18 / 49
Kernel methods
19 / 49
Section 4
Examples
20 / 49
Example 1: Polynomial kernels
x 7! (1, x, x 2 )
21 / 49
Example 1: Polynomial kernels
of
fitting
will be curves
we into same
type
But notice the inner product simplifies nicely under the second choice:
& differences only in terms of computation
h (x), (x )i = 1 + 3xx 0 + 3x 2 (x 0 )2 + x 3 (x 0 )3
0
= (1 + xx 0 )3 .
↑
easier to do inner
product.
22 / 49
Example 1: Polynomial kernels.
it
The computational savings of avoiding the construction of (x) can be
much more significant than the above example. be
degree
of
polynomial computed
Instead of computing
- a huge number of features (specifically, it can
p+d more quickl y
be shown to be d ), the computation time is linear in d (do d
multiplications, sum ↓them, and apply (1 + result)p )
# of features
very large.
grows large v quickly
C
23 / 49
**Example 2: String kernels** not examinable
)
Cont of examible syllabus
This approach has limitations, such as not handling spelling errors well.
But this feature has v
large dimension ; the dimensur
east
24 / 49
**Example 2: String kernels** nu
25 / 49
Other examples
26 / 49
Section 5
27 / 49
Constructing More Complicated Kernels
Claim
(functions]
If k1 and k2 are PSD kernels, then so are the following:
1 lany)function f
k(x, x0 ) = f (x)k1 (x, x0 )f (x0 ) for some PSD Kerril
sum of PSD is also
2 k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ) PSD Kernel
Use
(iit)
theorem to
previous prove
28 / 49
Constructing More Complicated Kernels
This means that k(x, x0 ) = h (x), (x0 )i, and hence k(x, x0 ) is a PSD
kernel.
29 / 49
Constructing More Complicated Kernels
Proof of #2.
(1)
(x)
Let (x) = (2) , and write
(x)
This means that k(x, x0 ) = h (x), (x0 )i, and hence k(x, x0 ) is a PSD
kernel.
30 / 49
Constructing More Complicated Kernels
Proof of #3.
(1) (2)
Let ˜ (x) contain entries ˜ ij (x) = i (x) j (x) for each i, j, where
(1) (1) (2)
i is the i-th feature in and similarly for j .
We have
X
h ˜ (x), ˜ (x0 )i = ˜ ij (x) ˜ ij (x0 )
i,j
X (1) (2) (1) 0 (2) 0
= i (x) j (x) i (x ) j (x )
i,j
X (1) (1) 0
X (2) (2) 0
= i (x) i (x ) j (x) j (x )
i j
0 0
= k1 (x, x )k2 (x, x ).
31 / 49
Section 6
RBF Kernel
32 / 49
Example 3: Radial basis function (RBF) kernel
ver
33 / 49
Example 3: Radial basis function (RBF) kernel
if K PSD ,
,
=>
Finally, define k(x , x) =
f(x)k ,
(x ,
xlf(x)
for some function f
✓ ◆
1 also PSD
k(x, x0 ) = exp kx x k 0 2 .
(6.1)
2
✓ ◆ ✓ ◆
1 2 T 0 1 0 2
= exp kxk · exp x x · exp kx k , (6.2)
2 2
34 / 49
Example 3: Radial basis function (RBF) kernel
This kernel goes by several names: Radial basis function (RBF) kernel,
Gaussian kernel, squared exponential kernel. It is usually defined with a
length-scale parameter `: represent size or width of Kernel .
✓ ◆
0 1 0 2
k(x, x ) = exp kx x k .
2`2
Such a parameter represents the rough scale over which the function
varies.
More generally, can have di↵erent lengths in each dimension,
` = (`1 , . . . , `d ).
35 / 49
Example 3: Radial basis function (RBF) kernel
+ + - + - + ++
to learn
Classify as +1 try
+ + -- - - + the classification
x2 + - + -
-- + + + + -
+ - -----+ + + -- - + Classify as -1
- -- -+ + + - -- -
x1
Here the prediction rule is of the form ŷ = sign(g (x)) as usual, and
the colors in this figure represent the values of g (x).
describe complicated
RBF is
powerful in this kind of
36 / 49
Linear Regression Revisited see the example in ID .
↑
>
-
describe the
An example using the RBF kernel (with length-scale `):
width of
Kernel .
Coverfitthy ,
perfectly
Cunder
fitting)
can be seems as
defining thecomplya
I Observe that even among a given class of kernels, the choice of their
parameter(s) may be very important (e.g., length-scale ` in this
example, degree p in polynomial example, etc.).
I Often the parameters are chosen using maximum likelihood.
I We will also cover model selection later, a special case of which is
kernel selection.
37 / 49
Linear Regression Revisited
38 / 49
Section 7
idea of Karnal .
using the
39 / 49
Linear Regression Revisited
Here we look at linear regression with kernels. In the next lecture we
will look at SVM with kernels. The same can be done for logistic
regression, but we will skip that.
We previously considered the regularized least squares estimator
(ridge regression); 2
in the 3
case of no o↵set (✓0 = 0), it is written as
xT
1
6 7
follows (with X = 4 ... 5 2 Rn⇥d ):
xT
n
Least square objective ·
ˆ = arg min ky
✓ 2
X✓k + k✓k , 2
✓
(no offset] ↑
and has the closed-form solution penalty term -
ˆ = (XT X + I)
✓ 1
XT y.
40 / 49
Linear Regression Revisited
of dimension data points
&some useful matrix manipulations. To help with
X =
[M
Now let’s apply
&d # of data
axn >
points
-
-
readability, let Id and In denote identity matrices with the size made
explicit. First observe
XT (XXT + In ) 1
= (XT X + Id ) 1
XT ,
ˆ
meaning we obtain the following equivalent form for ✓:
since we have the closed from solution
previously
ˆ = XT (XXT + In )
✓ 1
y.
-
Mig
-
(x +
2
hx0 , x1 i
3T gran
2
hx , x i
1 1... hx1 , xn i
3
6 .. 7 6 . .. .. 7
(x0 )T XT = 4 . 5 ,
T
XX = 4 .. . . 5.
0
hx , xn i hxn , x1 i . . . hxn , xn i
42 / 49
Linear Regression Revisited
Therefore, we can apply the kernel trick and consider the more general
prediction function
ŷ (x0 ) = k(x0 )(K + I) 1 y, (7.1)
where the inner product K)
by % )
replace
2 3T 2 3
k(x0 , x1 ) k(x1 , x1 ) . . . k(x1 , xn )
6 .. 7 6 .. .. .. 7
k(x0 ) = 4 . 5 , K=4 . . . 5.
0
k(x , xn ) k(xn , x1 ) . . . k(xn , xn )
43 / 49
Linear Regression Revisited
But what is this estimator doing?
-
Kernal
ridge regression
T do the
regular least
(+) -
y =✓ (x) + z, we
square model ,
we will
back Ix') in
page 41
get .
We now
regulared least square on this new model (+ ) ,
the
important thing
is the
predicted value
& (x) -
and also ,
we do not have to
woory abt 4
45 / 49
Linear Regression Revisited
46 / 49
E↵ect of regularization
47 / 49
Section 8
Useful References
48 / 49
Useful References
1
https://www.comp.nus.edu.sg/~scarlett/gp_slides
2
http:
//ocw.mit.edu/courses/electrical-engineering-and-computer-science/
6-867-machine-learning-fall-2006/lecture-notes/
3
http://www.cs.toronto.edu/~duvenaud/cookbook/
4
https://www.youtube.com/channel/UCotztBOmGVl9pPGIN4YqcRw/videos
49 / 49