0% found this document useful (0 votes)
3 views46 pages

Lec16

The document discusses the Kernel Trick in machine learning, which allows for the computation of polynomial kernels without explicitly calculating high-dimensional feature mappings. It explains how algorithms like kernel ridge regression and kernel logistic regression can leverage this trick to improve computational efficiency by using inner products of the feature mappings instead of the mappings themselves. The document emphasizes the significance of this approach in handling large datasets and high-dimensional feature spaces.

Uploaded by

Muhammad Furrukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views46 pages

Lec16

The document discusses the Kernel Trick in machine learning, which allows for the computation of polynomial kernels without explicitly calculating high-dimensional feature mappings. It explains how algorithms like kernel ridge regression and kernel logistic regression can leverage this trick to improve computational efficiency by using inner products of the feature mappings instead of the mappings themselves. The document emphasizes the significance of this approach in handling large datasets and high-dimensional feature spaces.

Uploaded by

Muhammad Furrukh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

KASHIF JAVED

EED, UET, Lahore

1
Lecture 16
The Kernel Trick

KASHIF JAVED
Readings: EED, UET, Lahore
▪ https://people.eecs.berkeley.edu/~jrs/189/
2
Motivation

2 2
Φ ∶ ℝ2 → ℝ3 𝑥1 , 𝑥2 →KASHIF
𝑧1 , 𝑧2 , 𝑧JAVED
3 = 𝑥1 , 𝑥2 , √2𝑥1 𝑥2
EED, UET, Lahore
https://people.eecs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf
3
Kernels
• Recall featurizing map Φ ∶ ℝ𝑑 → ℝ𝐷
• 𝑑 input features; 𝐷 features after featurization (Φ)
• Degree-𝑝 polynomials blow up to 𝐷 ∈ Θ(𝑑𝑝 ) features.
• When 𝑑 and 𝑝 are not small, this gets computationally intractable really
fast.
• As I said in Lecture 4, if you have 100 features per feature vector and you
want to use degree-4 polynomial decision functions, then each featurized
feature vector has a length of roughly 100 million.
KASHIF JAVED
• Today, magically, we use those features
EED, UET, without
Lahore computing them!

4
Kernels
• Observation: In many learning algorithms,
– the weights can be written as a linear combo of training points, &
– we can use inner products of Φ(𝑥)’s only ⇒ don’t need to compute Φ(𝑥)!

• Algos that have the first property include ridge regression, logistic
regression, perceptron and SVM.
• Kernelization is used in many different well-known algos.
• The second property is used to speed up kernelization
KASHIF JAVED
EED, UET, Lahore

5
Kernels
• Observation: In many learning algorithms,
– the weights can be written as a linear combo of training points, &

• Suppose optimal weight 𝑤 = 𝑋 𝑇 𝑎 = σ𝑛𝑖=1 𝑎𝑖 𝑋𝑖 for some 𝑎 ∈ ℝ𝑛

• Substitute this identity into alg. and optimize 𝑛 dual weights 𝑎 (aka dual
parameters) instead of 𝐷 primal weights 𝑤

KASHIF JAVED
EED, UET, Lahore

6
Kernel Ridge Regression
• To kernelize ridge regression, we need the weights to be a linear
combination of the training points.

• Unfortunately, that only happens if we penalize the bias term 𝑤𝑑+1 = 𝛼,


as these normal equations do.

• Fortunately, when we center 𝑋 and 𝑦, the “expected” value of the bias term
is zero. The actual bias won’t usually be exactly zero, but it will often be
close enough that we won’t doKASHIF JAVED
much harm by penalizing the bias term.
EED, UET, Lahore

7
Kernel Ridge Regression
• Center 𝑋 and 𝑦 so their means are zero: 𝑋𝑖 ← 𝑋𝑖 − 𝜇𝑋 , 𝑦𝑖 ← 𝑦𝑖 − 𝜇𝑦

• 𝑋𝑖,𝑑+1 = 1 [don’t center the 1’s!]

• This lets us replace 𝐼 ′ with 𝐼 in normal equations:

(𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑤 = 𝑋 𝑇 𝑦
KASHIF JAVED
EED, UET, Lahore

8
Kernel Ridge Regression
• Suppose 𝑎 is a solution to

(𝑋𝑋 𝑇 + 𝜆𝐼) 𝑎 = 𝑦 (Always has a solution if 𝜆 > 0)

• Then 𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑋 𝑇 𝑎 + 𝜆 𝑋 𝑇 𝑎 = (𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑋 𝑇 𝑎
• (𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑤 = 𝑋 𝑇 𝑦
• Therefore, 𝑤 = 𝑋 𝑇 𝑎 is a solution to the normal equations, and 𝑤 is a linear
combo of training points! KASHIF JAVED
EED, UET, Lahore

9
Kernel Ridge Regression
• By solving this eq., (𝑋𝑋 𝑇 + 𝜆𝐼) 𝑎 = 𝑦 , we get a solution 𝑎

• Multiplying the solution with 𝑋 𝑇 (i.e., 𝑤 = 𝑋 𝑇 𝑎 ) gives a solution to the


original problem given as:

• (𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑤 = 𝑋 𝑇 𝑦

KASHIF JAVED
EED, UET, Lahore

10
Kernel Ridge Regression
• 𝑎 is a dual solution; solves the dual form of ridge regression:

Find 𝑎 that minimizes ||𝑋𝑋 𝑇 𝑎 − 𝑦 ||2 + 𝜆| 𝑋 𝑇 𝑎 |2

• We obtain this dual form by substituting 𝑤 = 𝑋 𝑇 𝑎 into the original ridge


regression cost function. KASHIF JAVED
EED, UET, Lahore

11
Kernel Ridge Regression
• Training: Solve (𝑋𝑋 𝑇 + 𝜆𝐼) 𝑎 = 𝑦 for 𝑎

• Testing: Regression fn is

ℎ 𝑧 = 𝑤 𝑇 𝑧 = 𝑎𝑇 𝑋 𝑧 = σ𝑛𝑖=1 𝑎𝑖 𝑋𝑖𝑇 𝑧 ⇐ weighted sum of inner products of


training and test pts

KASHIF JAVED
EED, UET, Lahore

12
Kernel Ridge Regression
• Let 𝑘(𝑥, 𝑧) = 𝑥 𝑇 𝑧 be kernel fn.

• Later, we’ll replace 𝑥 and 𝑧 with Φ(𝑥) and Φ(𝑧), and that’s where the
magic will happen.

• Let 𝐾 = 𝑋𝑋 𝑇 be 𝑛 × 𝑛 kernel matrix.


• Each entry of this matrix is 𝐾𝑖𝑗 = 𝑘(𝑋𝑖 , 𝑋𝑗 ).

KASHIF JAVED
EED, UET, Lahore

13
Kernel Ridge Regression
• 𝐾 may be singular. If so, probably no solution if 𝜆 = 0. Then we must
choose a positive 𝜆. But that’s okay.

• Always singular if 𝑛 > 𝑑 + 1. But don’t worry about the case 𝑛 > 𝑑 + 1,
because you would only want to use the dual form when 𝑑 > 𝑛, i.e., for
polynomial features.

• But 𝐾 could still be singular when 𝑑 > 𝑛.


KASHIF JAVED
EED, UET, Lahore

14
Kernel Ridge Regression
• Dual/kernel ridge regr. alg:
∀𝑖, 𝑗 𝐾𝑖𝑗 ← 𝑘 𝑋𝑖 , 𝑋𝑗 ⇐ 𝑂(𝑛2 𝑑) time
Solve (𝐾 + 𝜆𝐼) 𝑎 = 𝑦 for 𝑎 ⇐ 𝑂(𝑛3 ) time
For each test pt 𝑧
ℎ 𝑧 ← σ𝑛𝑖=1 𝑎𝑖 𝑘 𝑋𝑖 , 𝑧 ⇐ 𝑂 𝑛𝑑 time/test pt.

• Does not use 𝑋𝑖 directly! Only 𝑘. [This will become important soon.]
KASHIF JAVED
EED, UET, Lahore

15
Kernel Ridge Regression
• Important: dual ridge regression produces the same predictions as primal
ridge regression (with a penalized bias term) (studied in lecture 13)!

• The difference is the running time; the dual algorithm is faster if 𝑑 > 𝑛,
because the primal algorithm solves a 𝑑 × 𝑑 linear system, whereas the
dual algorithm solves an 𝑛 × 𝑛 linear system.

KASHIF JAVED
EED, UET, Lahore

16
The Kernel Trick (aka Kernelization)
• Here’s the magic part. We can compute a polynomial kernel without
actually computing the features.

• The polynomial kernel of degree 𝑝 is 𝑘(𝑥, 𝑧) = (𝑥 𝑇 𝑧 + 1)𝑝

scalar

KASHIF JAVED
EED, UET, Lahore

17
The Kernel Trick (aka Kernelization)
• Theorem: (𝑥 𝑇 𝑧 + 1)𝑝 = Φ(𝑥)𝑇 Φ(𝑧) for some Φ(𝑥) containing every
monomial in 𝑥 of degree 0 . . . 𝑝.

• Example for 𝑑 = 2, 𝑝 = 2:
• (𝑥 𝑇 𝑧 + 1)2 = 𝑥12 𝑧12 + 𝑥22 𝑧22 + 2𝑥1 𝑧1 𝑧2 𝑥2 + 2𝑥1 𝑧1 + 2𝑧2 𝑥2 + 1
= 𝑥12 𝑥22 2𝑥1 𝑥2 2𝑥1 2𝑥2 1 𝑧12 𝑧22 2𝑧1 𝑧2 2𝑧1 2𝑧2 1
= Φ(𝑥)𝑇 Φ(𝑧) This is how we’re defining Φ(𝑥)
KASHIF JAVED
EED, UET, Lahore

18
The Kernel Trick (aka Kernelization)
• Notice the factors of 2.
• If you try a higher polynomial degree p, you’ll see a wider variety of these
constants.
• We have no control of the constants that appear in Φ(𝑥), but they don’t
matter much, because the primal weights 𝑤 will scale themselves to
compensate.
• Even though we don’t directly compute the primal weights, they implicitly
exist in the form 𝑤 = 𝑋 𝑇 𝑎
KASHIF JAVED
EED, UET, Lahore

19
The Kernel Trick (aka Kernelization)
• Key win: compute Φ(𝑥)𝑇 Φ(𝑧) in 𝑂(𝑑) time instead of 𝑂(𝐷) = 𝑂(𝑑𝑝 ),
even though Φ(𝑥) has length 𝐷.

• Kernel ridge regression replaces 𝑋𝑖 with Φ(𝑋𝑖 ) : let 𝑘 𝑥, 𝑧 = Φ(𝑥)𝑇 Φ(𝑧),


but doesn’t compute Φ(𝑥) or Φ(𝑧); it computes 𝑘 𝑥, 𝑧 = (𝑥 𝑇 𝑧 + 1)𝑝

KASHIF JAVED
EED, UET, Lahore

20
The Kernel Trick (aka Kernelization)
• Running times for 3 ridge algorithms:

• I think what we’ve done here is pretty mind-blowing: we can now do


polynomial regression with an exponentially long, high-order polynomial in
less time than it would take even to write out the final polynomial.
• The running time can be asymptotically smaller than D, the number of
KASHIF JAVED
terms in the polynomial EED, UET, Lahore

21
Kernel Logistic Regression
• Let Φ(𝑋) be 𝑛 × 𝐷 matrix with rows Φ(𝑋𝑖 )𝑇 . (Φ(𝑋) is the design matrix of the
featurized training points.)
• Featurized logistic regression with batch grad. descent:
𝑤 ← 0 [starting point is arbitrary]
repeat until convergence
𝑤 ← 𝑤 + 𝜖Φ(𝑋)𝑇 (𝑦 − 𝑠(Φ(𝑋)𝑤)) apply s component-wise to vector Φ(𝑋)𝑤
for each test pt z
ℎ(𝑧) ← 𝑠(𝑤 𝑇 Φ(𝑧)) KASHIF JAVED
EED, UET, Lahore

22
Kernel Logistic Regression
• Dualize with 𝑤 = Φ(𝑋)𝑇 𝑎

• Then the code “𝑎 ← 𝑎 + 𝜖 (𝑦 − 𝑠(Φ(𝑋)𝑤))” has same effect as “𝑤 ←


𝑤 + 𝜖Φ(𝑋)𝑇 (𝑦 − 𝑠(Φ(𝑋)𝑤))”.

• Let 𝐾 = Φ(𝑋) Φ(𝑋)𝑇 . (The 𝑛 × 𝑛 kernel matrix; but we don’t compute


Φ(𝑋)—we use the kernel trick.)
• Note that 𝐾𝑎 = Φ(𝑋) Φ(𝑋)𝑇 𝑎 KASHIF
= Φ(𝑋)𝑤. (And Φ(𝑋)𝑤 appears in the
JAVED
algorithm above.) EED, UET, Lahore

23
Kernel Logistic Regression
• Dual/kernel logistic regression:
𝑎 ← 0 (starting point is arbitrary)
∀𝑖, 𝑗 𝐾𝑖𝑗 ← 𝑘 𝑋𝑖 , 𝑋𝑗 ⇐ 𝑂(𝑛2 𝑑) time (Kernel trick)
repeat until convergence
𝑎 ← 𝑎 + 𝜖 (𝑦 − 𝑠(𝐾𝑎)) ⇐ 𝑂(𝑛2 ) time/iteration apply s component-wise
for each test pt 𝑧
ℎ(𝑧) ← 𝑠(σ𝑛𝑖=1 𝑎𝑖 𝑘 (𝑋𝑖 , 𝑧)) ⇐ 𝑂(𝑛𝑑) time/test pt (Kernel trick)
KASHIF JAVED
EED, UET, Lahore

24
Kernel Logistic Regression
• For classification, you can skip the logistic function 𝑠(·) and just compute
the sign of the summation.

• Kernel logistic regression computes the same answer as the primal


algorithm, but the running time changes.

KASHIF JAVED
EED, UET, Lahore

25
Kernel Logistic Regression
• Important: running times depend on original dimension 𝑑, not on length
𝐷 of Φ(·)!
• Training for j iterations:

• Primal: 𝑂(𝑛𝐷𝑗) time


• Dual (no kernel trick): 𝑂(𝑛2 𝐷 + 𝑛2 𝑗) time
• Kernel: 𝑂(𝑛2 𝑑 + 𝑛2 𝑗) time

KASHIF JAVED
EED, UET, Lahore

26
Kernel Logistic Regression
• Alternative training: stochastic gradient descent (SGD). Primal logistic
SGD step is

𝑤 ← 𝑤 + 𝜖 𝑦 − 𝑠 Φ 𝑋𝑖 𝑇 𝑤 Φ 𝑋𝑖

• Dual logistic SGD maintains a vector 𝑞 = 𝐾𝑎 ∈ ℝ𝑛


• Note that 𝑞𝑖 = (Φ(𝑋)𝑤)𝑖 = Φ 𝑋𝑖 𝑇 𝑤.
KASHIF JAVED
EED, UET, Lahore

27
Kernel Logistic Regression
• Let 𝐾∗𝑖 denote column i of K.
𝑎 ← 0; 𝑞 ← 0; ∀𝑖, 𝑗 𝐾𝑖𝑗 ← 𝑘 𝑋𝑖 , 𝑋𝑗 (For a different starting point, set
𝑞 ← 𝐾𝑎)
repeat until convergence
Choose random 𝑖 ∈ [1, 𝑛]
𝑎𝑖 ← 𝑎𝑖 + 𝜖 (𝑦𝑖 − 𝑠(𝑞𝑖 )) ⇐ 𝑂 1
𝑞 ← 𝑞 + 𝜖 (𝑦𝑖 − 𝑠(𝑞𝑖 )) 𝐾∗𝑖 ⇐ computes 𝑞 = 𝐾𝑎 𝑂 𝑛 time
KASHIF JAVED
EED, UET, Lahore

28
Kernel Logistic Regression
• SGD updates only one dual weight ai per iteration; that’s a nice benefit of
the dual formulation.

• We cleverly update 𝑞 = 𝐾𝑎 in linear time instead of performing a


quadratic-time matrix-vector multiplication.

• Primal: 𝑂(𝐷𝑗) time


• Dual (no kernel trick): 𝑂(𝐷𝑛2 +KASHIF
𝑛𝑗) time
JAVED
• Kernel: 𝑂(𝑛2 𝑑 + 𝑛𝑗) time EED, UET, Lahore

29
Kernel Logistic Regression
• Alternative testing: If # of training points and test points both exceed 𝐷/𝑑,
classifying with primal weights 𝑤 may be faster. [This applies to ridge
regression as well.]

𝑤 = Φ 𝑋 𝑇a ⇐ 𝑂(𝑛𝐷) time (once only)


for each test pt 𝑧
ℎ(𝑧) ← 𝑠(𝑤 𝑇 Φ(𝑧)) ⇐ 𝑂(𝐷) time/test pt

KASHIF JAVED
EED, UET, Lahore

30
The Gaussian Kernel
• Mind-blowing as the polynomial kernel is, I think our next trick is even
more mind-blowing.

• Since we can now do fast computations in spaces with exponentially large


dimensions, why don’t we go all the way and generate feature vectors in
an infinite-dimensional space?

KASHIF JAVED
EED, UET, Lahore

31
The Gaussian Kernel
• Gaussian kernel, aka radial basis function kernel: there exists a Φ: ℝ𝑑 →
ℝ∞ such that

𝑥−𝑧 2
𝑘(𝑥, 𝑧) = exp − 2𝜎 2
[This kernel takes 𝑂(𝑑) time to compute.]

KASHIF JAVED
EED, UET, Lahore

32
The Gaussian Kernel
• In case you’re curious, here’s the feature vector that gives you this kernel,
for the case where you have only one input feature per sample point.

• e.g., for 𝑑 = 1,
𝑥2 𝑥 𝑥2 𝑥3
Φ 𝑥 = exp − 2𝜎2 1, 𝜎 1! , 𝜎2 2! , 𝜎3 3! , …

• This is an infinite vector, and Φ 𝑥 ·Φ 𝑧 is a series that converges to


𝑘 𝑥, 𝑧 . Nobody actually uses this
KASHIF of Φ 𝑥 directly, or even cares
valueJAVED
EED,
about it; they just use the kernel UET, Lahore
function 𝑘(·,·).
33
The Gaussian Kernel
• At this point, it’s best not to think of points in a high-dimensional space. It’s
no longer a useful intuition.

• Instead, think of the kernel 𝑘 as a measure of how similar or close together


two points are to each other.

KASHIF JAVED
EED, UET, Lahore

34
The Gaussian Kernel
• Key observation:
• hypothesis ℎ 𝑧 = σ𝑛𝑗=1 𝑎𝑗 𝑘 (𝑋𝑗 , 𝑧) is a linear combo of Gaussians centered
at training pts.
• The dual weights are the coefficients of the linear combination.
• The Gaussians are a basis for the hypothesis.

KASHIF JAVED
EED, UET, Lahore

35
The Gaussian Kernel
• A hypothesis ℎ that is a linear
combination of Gaussians
centered at four training
points, two with positive
weights and two with
negative weights.
• If you use ridge regression
with a Gaussian kernel, your
“linear” regression will look
KASHIF JAVED
something like this.
EED, UET, Lahore

36
The Gaussian Kernel
• Very popular in practice! Why?
– Gives very smooth ℎ (In fact, ℎ is infinitely differentiable; continuous)

– Behaves somewhat like 𝑘-nearest neighbors (training pts close to the test
pt get bigger votes), but smoother

– Oscillates less than polynomials (depending on 𝜎)

KASHIF JAVED
EED, UET, Lahore

37
The Gaussian Kernel
• Very popular in practice! Why?
– 𝑘(𝑥, 𝑧) interpreted as a similarity measure. Maximum when 𝑧 = 𝑥; goes to
0 as distance increases.

– Training points “vote” for value at z, but closer points get weightier vote.

• The “standard” kernel 𝑘(𝑥, 𝑧) = 𝑥 · 𝑧 assigns more weight to training point


vectors that point in roughly the same direction as 𝑧.
• By contrast, the Gaussian kernel assigns
KASHIF more weight to training points near
JAVED
𝑧. EED, UET, Lahore

38
The Gaussian Kernel
• Choose 𝜎 by validation.
𝜎 trades off bias vs. variance:
larger 𝜎 → wider Gaussians & smoother ℎ → more bias & less variance

KASHIF JAVED
EED, UET, Lahore

39
1
•𝛾 ∝ 𝜎
• Too narrow, more oscillations
• More wide, little oscillations

• The effect of the inverse-width


parameter of the Gaussian kernel
(𝛾) for a fixed value of the soft-
margin constant.
KASHIF JAVED
EED, UET, Lahore
Asa Ben-Hur and J Watson, “A User's Guide to Support Vector Machines”, Data Mining Techniques for the Life Sciences.
40
Methods in Molecular Biology, 2010
• For small values of 𝛾 (upper left)
the decision boundary is nearly
linear

• As 𝛾 increases the flexibility of the


decision boundary increases.

• Large values of 𝛾 lead to overfitting


(bottom). KASHIF JAVED
EED, UET, Lahore
Asa Ben-Hur and J Watson, “A User's Guide to Support Vector Machines”, Data Mining Techniques for the Life Sciences.
41
Methods in Molecular Biology, 2010
ESL, Figure 12.3
• The decision boundary (solid
black) of a soft-margin SVM
with a Gaussian kernel.

KASHIF JAVED
EED, UET, Lahore

42
ESL, Figure 12.3
• Observe that in this example,
it comes reasonably close to
the Bayes optimal decision
boundary (dashed purple).
• The dashed black curves are
the boundaries of the margin.
• The small black disks are the
support vectors that lie on the
margin boundary. KASHIF JAVED
EED, UET, Lahore

43
Kernels
• By the way, there are many other kernels that, like the Gaussian kernel,
are defined directly as kernel functions without worrying about Φ.
• But not every function can be a kernel function.
• A function is qualified only if it always generates a positive semidefinite
kernel matrix, for every sample.
• There is an elaborate theory about how to construct valid kernel functions.
• However, you probably won’t need it. The polynomial and Gaussian
kernels are the two most popular by far.
KASHIF JAVED
EED, UET, Lahore

44
Kernels
• As a final word, be aware that not every featurization Φ leads to a kernel
function that can be computed faster than Θ(𝐷) time.

• In fact, the vast majority cannot.

• Featurizations that can are rare and special.

KASHIF JAVED
EED, UET, Lahore

45
Kernels

The effect of the degree of a polynomial kernel. Higher degree


polynomial kernels allow a more flexible decision boundary.
KASHIF JAVED
EED, UET, Lahore
Asa Ben-Hur and J Watson, “A User's Guide to Support Vector Machines”, Data Mining Techniques for the Life Sciences.
46
Methods in Molecular Biology, 2010

You might also like