Lec16
Lec16
1
Lecture 16
The Kernel Trick
KASHIF JAVED
Readings: EED, UET, Lahore
▪ https://people.eecs.berkeley.edu/~jrs/189/
2
Motivation
2 2
Φ ∶ ℝ2 → ℝ3 𝑥1 , 𝑥2 →KASHIF
𝑧1 , 𝑧2 , 𝑧JAVED
3 = 𝑥1 , 𝑥2 , √2𝑥1 𝑥2
EED, UET, Lahore
https://people.eecs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf
3
Kernels
• Recall featurizing map Φ ∶ ℝ𝑑 → ℝ𝐷
• 𝑑 input features; 𝐷 features after featurization (Φ)
• Degree-𝑝 polynomials blow up to 𝐷 ∈ Θ(𝑑𝑝 ) features.
• When 𝑑 and 𝑝 are not small, this gets computationally intractable really
fast.
• As I said in Lecture 4, if you have 100 features per feature vector and you
want to use degree-4 polynomial decision functions, then each featurized
feature vector has a length of roughly 100 million.
KASHIF JAVED
• Today, magically, we use those features
EED, UET, without
Lahore computing them!
4
Kernels
• Observation: In many learning algorithms,
– the weights can be written as a linear combo of training points, &
– we can use inner products of Φ(𝑥)’s only ⇒ don’t need to compute Φ(𝑥)!
• Algos that have the first property include ridge regression, logistic
regression, perceptron and SVM.
• Kernelization is used in many different well-known algos.
• The second property is used to speed up kernelization
KASHIF JAVED
EED, UET, Lahore
5
Kernels
• Observation: In many learning algorithms,
– the weights can be written as a linear combo of training points, &
• Substitute this identity into alg. and optimize 𝑛 dual weights 𝑎 (aka dual
parameters) instead of 𝐷 primal weights 𝑤
KASHIF JAVED
EED, UET, Lahore
6
Kernel Ridge Regression
• To kernelize ridge regression, we need the weights to be a linear
combination of the training points.
• Fortunately, when we center 𝑋 and 𝑦, the “expected” value of the bias term
is zero. The actual bias won’t usually be exactly zero, but it will often be
close enough that we won’t doKASHIF JAVED
much harm by penalizing the bias term.
EED, UET, Lahore
7
Kernel Ridge Regression
• Center 𝑋 and 𝑦 so their means are zero: 𝑋𝑖 ← 𝑋𝑖 − 𝜇𝑋 , 𝑦𝑖 ← 𝑦𝑖 − 𝜇𝑦
(𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑤 = 𝑋 𝑇 𝑦
KASHIF JAVED
EED, UET, Lahore
8
Kernel Ridge Regression
• Suppose 𝑎 is a solution to
• Then 𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑋 𝑇 𝑎 + 𝜆 𝑋 𝑇 𝑎 = (𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑋 𝑇 𝑎
• (𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑤 = 𝑋 𝑇 𝑦
• Therefore, 𝑤 = 𝑋 𝑇 𝑎 is a solution to the normal equations, and 𝑤 is a linear
combo of training points! KASHIF JAVED
EED, UET, Lahore
9
Kernel Ridge Regression
• By solving this eq., (𝑋𝑋 𝑇 + 𝜆𝐼) 𝑎 = 𝑦 , we get a solution 𝑎
• (𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑤 = 𝑋 𝑇 𝑦
KASHIF JAVED
EED, UET, Lahore
10
Kernel Ridge Regression
• 𝑎 is a dual solution; solves the dual form of ridge regression:
11
Kernel Ridge Regression
• Training: Solve (𝑋𝑋 𝑇 + 𝜆𝐼) 𝑎 = 𝑦 for 𝑎
• Testing: Regression fn is
KASHIF JAVED
EED, UET, Lahore
12
Kernel Ridge Regression
• Let 𝑘(𝑥, 𝑧) = 𝑥 𝑇 𝑧 be kernel fn.
• Later, we’ll replace 𝑥 and 𝑧 with Φ(𝑥) and Φ(𝑧), and that’s where the
magic will happen.
KASHIF JAVED
EED, UET, Lahore
13
Kernel Ridge Regression
• 𝐾 may be singular. If so, probably no solution if 𝜆 = 0. Then we must
choose a positive 𝜆. But that’s okay.
• Always singular if 𝑛 > 𝑑 + 1. But don’t worry about the case 𝑛 > 𝑑 + 1,
because you would only want to use the dual form when 𝑑 > 𝑛, i.e., for
polynomial features.
14
Kernel Ridge Regression
• Dual/kernel ridge regr. alg:
∀𝑖, 𝑗 𝐾𝑖𝑗 ← 𝑘 𝑋𝑖 , 𝑋𝑗 ⇐ 𝑂(𝑛2 𝑑) time
Solve (𝐾 + 𝜆𝐼) 𝑎 = 𝑦 for 𝑎 ⇐ 𝑂(𝑛3 ) time
For each test pt 𝑧
ℎ 𝑧 ← σ𝑛𝑖=1 𝑎𝑖 𝑘 𝑋𝑖 , 𝑧 ⇐ 𝑂 𝑛𝑑 time/test pt.
• Does not use 𝑋𝑖 directly! Only 𝑘. [This will become important soon.]
KASHIF JAVED
EED, UET, Lahore
15
Kernel Ridge Regression
• Important: dual ridge regression produces the same predictions as primal
ridge regression (with a penalized bias term) (studied in lecture 13)!
• The difference is the running time; the dual algorithm is faster if 𝑑 > 𝑛,
because the primal algorithm solves a 𝑑 × 𝑑 linear system, whereas the
dual algorithm solves an 𝑛 × 𝑛 linear system.
KASHIF JAVED
EED, UET, Lahore
16
The Kernel Trick (aka Kernelization)
• Here’s the magic part. We can compute a polynomial kernel without
actually computing the features.
scalar
KASHIF JAVED
EED, UET, Lahore
17
The Kernel Trick (aka Kernelization)
• Theorem: (𝑥 𝑇 𝑧 + 1)𝑝 = Φ(𝑥)𝑇 Φ(𝑧) for some Φ(𝑥) containing every
monomial in 𝑥 of degree 0 . . . 𝑝.
• Example for 𝑑 = 2, 𝑝 = 2:
• (𝑥 𝑇 𝑧 + 1)2 = 𝑥12 𝑧12 + 𝑥22 𝑧22 + 2𝑥1 𝑧1 𝑧2 𝑥2 + 2𝑥1 𝑧1 + 2𝑧2 𝑥2 + 1
= 𝑥12 𝑥22 2𝑥1 𝑥2 2𝑥1 2𝑥2 1 𝑧12 𝑧22 2𝑧1 𝑧2 2𝑧1 2𝑧2 1
= Φ(𝑥)𝑇 Φ(𝑧) This is how we’re defining Φ(𝑥)
KASHIF JAVED
EED, UET, Lahore
18
The Kernel Trick (aka Kernelization)
• Notice the factors of 2.
• If you try a higher polynomial degree p, you’ll see a wider variety of these
constants.
• We have no control of the constants that appear in Φ(𝑥), but they don’t
matter much, because the primal weights 𝑤 will scale themselves to
compensate.
• Even though we don’t directly compute the primal weights, they implicitly
exist in the form 𝑤 = 𝑋 𝑇 𝑎
KASHIF JAVED
EED, UET, Lahore
19
The Kernel Trick (aka Kernelization)
• Key win: compute Φ(𝑥)𝑇 Φ(𝑧) in 𝑂(𝑑) time instead of 𝑂(𝐷) = 𝑂(𝑑𝑝 ),
even though Φ(𝑥) has length 𝐷.
KASHIF JAVED
EED, UET, Lahore
20
The Kernel Trick (aka Kernelization)
• Running times for 3 ridge algorithms:
21
Kernel Logistic Regression
• Let Φ(𝑋) be 𝑛 × 𝐷 matrix with rows Φ(𝑋𝑖 )𝑇 . (Φ(𝑋) is the design matrix of the
featurized training points.)
• Featurized logistic regression with batch grad. descent:
𝑤 ← 0 [starting point is arbitrary]
repeat until convergence
𝑤 ← 𝑤 + 𝜖Φ(𝑋)𝑇 (𝑦 − 𝑠(Φ(𝑋)𝑤)) apply s component-wise to vector Φ(𝑋)𝑤
for each test pt z
ℎ(𝑧) ← 𝑠(𝑤 𝑇 Φ(𝑧)) KASHIF JAVED
EED, UET, Lahore
22
Kernel Logistic Regression
• Dualize with 𝑤 = Φ(𝑋)𝑇 𝑎
23
Kernel Logistic Regression
• Dual/kernel logistic regression:
𝑎 ← 0 (starting point is arbitrary)
∀𝑖, 𝑗 𝐾𝑖𝑗 ← 𝑘 𝑋𝑖 , 𝑋𝑗 ⇐ 𝑂(𝑛2 𝑑) time (Kernel trick)
repeat until convergence
𝑎 ← 𝑎 + 𝜖 (𝑦 − 𝑠(𝐾𝑎)) ⇐ 𝑂(𝑛2 ) time/iteration apply s component-wise
for each test pt 𝑧
ℎ(𝑧) ← 𝑠(σ𝑛𝑖=1 𝑎𝑖 𝑘 (𝑋𝑖 , 𝑧)) ⇐ 𝑂(𝑛𝑑) time/test pt (Kernel trick)
KASHIF JAVED
EED, UET, Lahore
24
Kernel Logistic Regression
• For classification, you can skip the logistic function 𝑠(·) and just compute
the sign of the summation.
KASHIF JAVED
EED, UET, Lahore
25
Kernel Logistic Regression
• Important: running times depend on original dimension 𝑑, not on length
𝐷 of Φ(·)!
• Training for j iterations:
KASHIF JAVED
EED, UET, Lahore
26
Kernel Logistic Regression
• Alternative training: stochastic gradient descent (SGD). Primal logistic
SGD step is
𝑤 ← 𝑤 + 𝜖 𝑦 − 𝑠 Φ 𝑋𝑖 𝑇 𝑤 Φ 𝑋𝑖
27
Kernel Logistic Regression
• Let 𝐾∗𝑖 denote column i of K.
𝑎 ← 0; 𝑞 ← 0; ∀𝑖, 𝑗 𝐾𝑖𝑗 ← 𝑘 𝑋𝑖 , 𝑋𝑗 (For a different starting point, set
𝑞 ← 𝐾𝑎)
repeat until convergence
Choose random 𝑖 ∈ [1, 𝑛]
𝑎𝑖 ← 𝑎𝑖 + 𝜖 (𝑦𝑖 − 𝑠(𝑞𝑖 )) ⇐ 𝑂 1
𝑞 ← 𝑞 + 𝜖 (𝑦𝑖 − 𝑠(𝑞𝑖 )) 𝐾∗𝑖 ⇐ computes 𝑞 = 𝐾𝑎 𝑂 𝑛 time
KASHIF JAVED
EED, UET, Lahore
28
Kernel Logistic Regression
• SGD updates only one dual weight ai per iteration; that’s a nice benefit of
the dual formulation.
29
Kernel Logistic Regression
• Alternative testing: If # of training points and test points both exceed 𝐷/𝑑,
classifying with primal weights 𝑤 may be faster. [This applies to ridge
regression as well.]
KASHIF JAVED
EED, UET, Lahore
30
The Gaussian Kernel
• Mind-blowing as the polynomial kernel is, I think our next trick is even
more mind-blowing.
KASHIF JAVED
EED, UET, Lahore
31
The Gaussian Kernel
• Gaussian kernel, aka radial basis function kernel: there exists a Φ: ℝ𝑑 →
ℝ∞ such that
𝑥−𝑧 2
𝑘(𝑥, 𝑧) = exp − 2𝜎 2
[This kernel takes 𝑂(𝑑) time to compute.]
KASHIF JAVED
EED, UET, Lahore
32
The Gaussian Kernel
• In case you’re curious, here’s the feature vector that gives you this kernel,
for the case where you have only one input feature per sample point.
• e.g., for 𝑑 = 1,
𝑥2 𝑥 𝑥2 𝑥3
Φ 𝑥 = exp − 2𝜎2 1, 𝜎 1! , 𝜎2 2! , 𝜎3 3! , …
KASHIF JAVED
EED, UET, Lahore
34
The Gaussian Kernel
• Key observation:
• hypothesis ℎ 𝑧 = σ𝑛𝑗=1 𝑎𝑗 𝑘 (𝑋𝑗 , 𝑧) is a linear combo of Gaussians centered
at training pts.
• The dual weights are the coefficients of the linear combination.
• The Gaussians are a basis for the hypothesis.
KASHIF JAVED
EED, UET, Lahore
35
The Gaussian Kernel
• A hypothesis ℎ that is a linear
combination of Gaussians
centered at four training
points, two with positive
weights and two with
negative weights.
• If you use ridge regression
with a Gaussian kernel, your
“linear” regression will look
KASHIF JAVED
something like this.
EED, UET, Lahore
36
The Gaussian Kernel
• Very popular in practice! Why?
– Gives very smooth ℎ (In fact, ℎ is infinitely differentiable; continuous)
– Behaves somewhat like 𝑘-nearest neighbors (training pts close to the test
pt get bigger votes), but smoother
KASHIF JAVED
EED, UET, Lahore
37
The Gaussian Kernel
• Very popular in practice! Why?
– 𝑘(𝑥, 𝑧) interpreted as a similarity measure. Maximum when 𝑧 = 𝑥; goes to
0 as distance increases.
– Training points “vote” for value at z, but closer points get weightier vote.
38
The Gaussian Kernel
• Choose 𝜎 by validation.
𝜎 trades off bias vs. variance:
larger 𝜎 → wider Gaussians & smoother ℎ → more bias & less variance
KASHIF JAVED
EED, UET, Lahore
39
1
•𝛾 ∝ 𝜎
• Too narrow, more oscillations
• More wide, little oscillations
KASHIF JAVED
EED, UET, Lahore
42
ESL, Figure 12.3
• Observe that in this example,
it comes reasonably close to
the Bayes optimal decision
boundary (dashed purple).
• The dashed black curves are
the boundaries of the margin.
• The small black disks are the
support vectors that lie on the
margin boundary. KASHIF JAVED
EED, UET, Lahore
43
Kernels
• By the way, there are many other kernels that, like the Gaussian kernel,
are defined directly as kernel functions without worrying about Φ.
• But not every function can be a kernel function.
• A function is qualified only if it always generates a positive semidefinite
kernel matrix, for every sample.
• There is an elaborate theory about how to construct valid kernel functions.
• However, you probably won’t need it. The polynomial and Gaussian
kernels are the two most popular by far.
KASHIF JAVED
EED, UET, Lahore
44
Kernels
• As a final word, be aware that not every featurization Φ leads to a kernel
function that can be computed faster than Θ(𝐷) time.
KASHIF JAVED
EED, UET, Lahore
45
Kernels