0% found this document useful (0 votes)

3 views46 pages

Lec16

The document discusses the Kernel Trick in machine learning, which allows for the computation of polynomial kernels without explicitly calculating high-dimensional feature mappings. It explains how algorithms like kernel ridge regression and kernel logistic regression can leverage this trick to improve computational efficiency by using inner products of the feature mappings instead of the mappings themselves. The document emphasizes the significance of this approach in handling large datasets and high-dimensional feature spaces.

Uploaded by

Muhammad Furrukh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views46 pages

Lec16

Uploaded by

Muhammad Furrukh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

KASHIF JAVED

EED, UET, Lahore

1
Lecture 16
The Kernel Trick

KASHIF JAVED
Readings: EED, UET, Lahore
▪ https://people.eecs.berkeley.edu/~jrs/189/
2
Motivation

2 2
Φ ∶ ℝ2 → ℝ3 𝑥1 , 𝑥2 →KASHIF
𝑧1 , 𝑧2 , 𝑧JAVED
3 = 𝑥1 , 𝑥2 , √2𝑥1 𝑥2
EED, UET, Lahore
https://people.eecs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf
3
Kernels
• Recall featurizing map Φ ∶ ℝ𝑑 → ℝ𝐷
• 𝑑 input features; 𝐷 features after featurization (Φ)
• Degree-𝑝 polynomials blow up to 𝐷 ∈ Θ(𝑑𝑝 ) features.
• When 𝑑 and 𝑝 are not small, this gets computationally intractable really
fast.
• As I said in Lecture 4, if you have 100 features per feature vector and you
want to use degree-4 polynomial decision functions, then each featurized
feature vector has a length of roughly 100 million.
KASHIF JAVED
• Today, magically, we use those features
EED, UET, without
Lahore computing them!

4
Kernels
• Observation: In many learning algorithms,
– the weights can be written as a linear combo of training points, &
– we can use inner products of Φ(𝑥)’s only ⇒ don’t need to compute Φ(𝑥)!

• Algos that have the first property include ridge regression, logistic
regression, perceptron and SVM.
• Kernelization is used in many different well-known algos.
• The second property is used to speed up kernelization
KASHIF JAVED
EED, UET, Lahore

5
Kernels
• Observation: In many learning algorithms,
– the weights can be written as a linear combo of training points, &

• Suppose optimal weight 𝑤 = 𝑋 𝑇 𝑎 = σ𝑛𝑖=1 𝑎𝑖 𝑋𝑖 for some 𝑎 ∈ ℝ𝑛

• Substitute this identity into alg. and optimize 𝑛 dual weights 𝑎 (aka dual
parameters) instead of 𝐷 primal weights 𝑤

KASHIF JAVED
EED, UET, Lahore

6
Kernel Ridge Regression
• To kernelize ridge regression, we need the weights to be a linear
combination of the training points.

• Unfortunately, that only happens if we penalize the bias term 𝑤𝑑+1 = 𝛼,

as these normal equations do.

• Fortunately, when we center 𝑋 and 𝑦, the “expected” value of the bias term
is zero. The actual bias won’t usually be exactly zero, but it will often be
close enough that we won’t doKASHIF JAVED
much harm by penalizing the bias term.
EED, UET, Lahore

7
Kernel Ridge Regression
• Center 𝑋 and 𝑦 so their means are zero: 𝑋𝑖 ← 𝑋𝑖 − 𝜇𝑋 , 𝑦𝑖 ← 𝑦𝑖 − 𝜇𝑦

• 𝑋𝑖,𝑑+1 = 1 [don’t center the 1’s!]

• This lets us replace 𝐼 ′ with 𝐼 in normal equations:

(𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑤 = 𝑋 𝑇 𝑦
KASHIF JAVED
EED, UET, Lahore

8
Kernel Ridge Regression
• Suppose 𝑎 is a solution to

(𝑋𝑋 𝑇 + 𝜆𝐼) 𝑎 = 𝑦 (Always has a solution if 𝜆 > 0)

• Then 𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑋𝑋 𝑇 𝑎 + 𝜆 𝑋 𝑇 𝑎 = (𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑋 𝑇 𝑎
• (𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑤 = 𝑋 𝑇 𝑦
• Therefore, 𝑤 = 𝑋 𝑇 𝑎 is a solution to the normal equations, and 𝑤 is a linear
combo of training points! KASHIF JAVED
EED, UET, Lahore

9
Kernel Ridge Regression
• By solving this eq., (𝑋𝑋 𝑇 + 𝜆𝐼) 𝑎 = 𝑦 , we get a solution 𝑎

• Multiplying the solution with 𝑋 𝑇 (i.e., 𝑤 = 𝑋 𝑇 𝑎 ) gives a solution to the

original problem given as:

• (𝑋 𝑇 𝑋 + 𝜆𝐼) 𝑤 = 𝑋 𝑇 𝑦

KASHIF JAVED
EED, UET, Lahore

10
Kernel Ridge Regression
• 𝑎 is a dual solution; solves the dual form of ridge regression:

Find 𝑎 that minimizes ||𝑋𝑋 𝑇 𝑎 − 𝑦 ||2 + 𝜆| 𝑋 𝑇 𝑎 |2

• We obtain this dual form by substituting 𝑤 = 𝑋 𝑇 𝑎 into the original ridge

regression cost function. KASHIF JAVED
EED, UET, Lahore

11
Kernel Ridge Regression
• Training: Solve (𝑋𝑋 𝑇 + 𝜆𝐼) 𝑎 = 𝑦 for 𝑎

• Testing: Regression fn is

ℎ 𝑧 = 𝑤 𝑇 𝑧 = 𝑎𝑇 𝑋 𝑧 = σ𝑛𝑖=1 𝑎𝑖 𝑋𝑖𝑇 𝑧 ⇐ weighted sum of inner products of

training and test pts

KASHIF JAVED
EED, UET, Lahore

12
Kernel Ridge Regression
• Let 𝑘(𝑥, 𝑧) = 𝑥 𝑇 𝑧 be kernel fn.

• Later, we’ll replace 𝑥 and 𝑧 with Φ(𝑥) and Φ(𝑧), and that’s where the
magic will happen.

• Let 𝐾 = 𝑋𝑋 𝑇 be 𝑛 × 𝑛 kernel matrix.

• Each entry of this matrix is 𝐾𝑖𝑗 = 𝑘(𝑋𝑖 , 𝑋𝑗 ).

KASHIF JAVED
EED, UET, Lahore

13
Kernel Ridge Regression
• 𝐾 may be singular. If so, probably no solution if 𝜆 = 0. Then we must
choose a positive 𝜆. But that’s okay.

• Always singular if 𝑛 > 𝑑 + 1. But don’t worry about the case 𝑛 > 𝑑 + 1,
because you would only want to use the dual form when 𝑑 > 𝑛, i.e., for
polynomial features.

• But 𝐾 could still be singular when 𝑑 > 𝑛.

KASHIF JAVED
EED, UET, Lahore

14
Kernel Ridge Regression
• Dual/kernel ridge regr. alg:
∀𝑖, 𝑗 𝐾𝑖𝑗 ← 𝑘 𝑋𝑖 , 𝑋𝑗 ⇐ 𝑂(𝑛2 𝑑) time
Solve (𝐾 + 𝜆𝐼) 𝑎 = 𝑦 for 𝑎 ⇐ 𝑂(𝑛3 ) time
For each test pt 𝑧
ℎ 𝑧 ← σ𝑛𝑖=1 𝑎𝑖 𝑘 𝑋𝑖 , 𝑧 ⇐ 𝑂 𝑛𝑑 time/test pt.

• Does not use 𝑋𝑖 directly! Only 𝑘. [This will become important soon.]
KASHIF JAVED
EED, UET, Lahore

15
Kernel Ridge Regression
• Important: dual ridge regression produces the same predictions as primal
ridge regression (with a penalized bias term) (studied in lecture 13)!

• The difference is the running time; the dual algorithm is faster if 𝑑 > 𝑛,
because the primal algorithm solves a 𝑑 × 𝑑 linear system, whereas the
dual algorithm solves an 𝑛 × 𝑛 linear system.

KASHIF JAVED
EED, UET, Lahore

16
The Kernel Trick (aka Kernelization)
• Here’s the magic part. We can compute a polynomial kernel without
actually computing the features.

• The polynomial kernel of degree 𝑝 is 𝑘(𝑥, 𝑧) = (𝑥 𝑇 𝑧 + 1)𝑝

scalar

KASHIF JAVED
EED, UET, Lahore

17
The Kernel Trick (aka Kernelization)
• Theorem: (𝑥 𝑇 𝑧 + 1)𝑝 = Φ(𝑥)𝑇 Φ(𝑧) for some Φ(𝑥) containing every
monomial in 𝑥 of degree 0 . . . 𝑝.

• Example for 𝑑 = 2, 𝑝 = 2:
• (𝑥 𝑇 𝑧 + 1)2 = 𝑥12 𝑧12 + 𝑥22 𝑧22 + 2𝑥1 𝑧1 𝑧2 𝑥2 + 2𝑥1 𝑧1 + 2𝑧2 𝑥2 + 1
= 𝑥12 𝑥22 2𝑥1 𝑥2 2𝑥1 2𝑥2 1 𝑧12 𝑧22 2𝑧1 𝑧2 2𝑧1 2𝑧2 1
= Φ(𝑥)𝑇 Φ(𝑧) This is how we’re defining Φ(𝑥)
KASHIF JAVED
EED, UET, Lahore

18
The Kernel Trick (aka Kernelization)
• Notice the factors of 2.
• If you try a higher polynomial degree p, you’ll see a wider variety of these
constants.
• We have no control of the constants that appear in Φ(𝑥), but they don’t
matter much, because the primal weights 𝑤 will scale themselves to
compensate.
• Even though we don’t directly compute the primal weights, they implicitly
exist in the form 𝑤 = 𝑋 𝑇 𝑎
KASHIF JAVED
EED, UET, Lahore

19
The Kernel Trick (aka Kernelization)
• Key win: compute Φ(𝑥)𝑇 Φ(𝑧) in 𝑂(𝑑) time instead of 𝑂(𝐷) = 𝑂(𝑑𝑝 ),
even though Φ(𝑥) has length 𝐷.

• Kernel ridge regression replaces 𝑋𝑖 with Φ(𝑋𝑖 ) : let 𝑘 𝑥, 𝑧 = Φ(𝑥)𝑇 Φ(𝑧),

but doesn’t compute Φ(𝑥) or Φ(𝑧); it computes 𝑘 𝑥, 𝑧 = (𝑥 𝑇 𝑧 + 1)𝑝

KASHIF JAVED
EED, UET, Lahore

20
The Kernel Trick (aka Kernelization)
• Running times for 3 ridge algorithms:

• I think what we’ve done here is pretty mind-blowing: we can now do

polynomial regression with an exponentially long, high-order polynomial in
less time than it would take even to write out the final polynomial.
• The running time can be asymptotically smaller than D, the number of
KASHIF JAVED
terms in the polynomial EED, UET, Lahore

21
Kernel Logistic Regression
• Let Φ(𝑋) be 𝑛 × 𝐷 matrix with rows Φ(𝑋𝑖 )𝑇 . (Φ(𝑋) is the design matrix of the
featurized training points.)
• Featurized logistic regression with batch grad. descent:
𝑤 ← 0 [starting point is arbitrary]
repeat until convergence
𝑤 ← 𝑤 + 𝜖Φ(𝑋)𝑇 (𝑦 − 𝑠(Φ(𝑋)𝑤)) apply s component-wise to vector Φ(𝑋)𝑤
for each test pt z
ℎ(𝑧) ← 𝑠(𝑤 𝑇 Φ(𝑧)) KASHIF JAVED
EED, UET, Lahore

22
Kernel Logistic Regression
• Dualize with 𝑤 = Φ(𝑋)𝑇 𝑎

• Then the code “𝑎 ← 𝑎 + 𝜖 (𝑦 − 𝑠(Φ(𝑋)𝑤))” has same effect as “𝑤 ←

𝑤 + 𝜖Φ(𝑋)𝑇 (𝑦 − 𝑠(Φ(𝑋)𝑤))”.

• Let 𝐾 = Φ(𝑋) Φ(𝑋)𝑇 . (The 𝑛 × 𝑛 kernel matrix; but we don’t compute

Φ(𝑋)—we use the kernel trick.)
• Note that 𝐾𝑎 = Φ(𝑋) Φ(𝑋)𝑇 𝑎 KASHIF
= Φ(𝑋)𝑤. (And Φ(𝑋)𝑤 appears in the
JAVED
algorithm above.) EED, UET, Lahore

23
Kernel Logistic Regression
• Dual/kernel logistic regression:
𝑎 ← 0 (starting point is arbitrary)
∀𝑖, 𝑗 𝐾𝑖𝑗 ← 𝑘 𝑋𝑖 , 𝑋𝑗 ⇐ 𝑂(𝑛2 𝑑) time (Kernel trick)
repeat until convergence
𝑎 ← 𝑎 + 𝜖 (𝑦 − 𝑠(𝐾𝑎)) ⇐ 𝑂(𝑛2 ) time/iteration apply s component-wise
for each test pt 𝑧
ℎ(𝑧) ← 𝑠(σ𝑛𝑖=1 𝑎𝑖 𝑘 (𝑋𝑖 , 𝑧)) ⇐ 𝑂(𝑛𝑑) time/test pt (Kernel trick)
KASHIF JAVED
EED, UET, Lahore

24
Kernel Logistic Regression
• For classification, you can skip the logistic function 𝑠(·) and just compute
the sign of the summation.

• Kernel logistic regression computes the same answer as the primal

algorithm, but the running time changes.

KASHIF JAVED
EED, UET, Lahore

25
Kernel Logistic Regression
• Important: running times depend on original dimension 𝑑, not on length
𝐷 of Φ(·)!
• Training for j iterations:

• Primal: 𝑂(𝑛𝐷𝑗) time

• Dual (no kernel trick): 𝑂(𝑛2 𝐷 + 𝑛2 𝑗) time
• Kernel: 𝑂(𝑛2 𝑑 + 𝑛2 𝑗) time

KASHIF JAVED
EED, UET, Lahore

26
Kernel Logistic Regression
• Alternative training: stochastic gradient descent (SGD). Primal logistic
SGD step is

𝑤 ← 𝑤 + 𝜖 𝑦 − 𝑠 Φ 𝑋𝑖 𝑇 𝑤 Φ 𝑋𝑖

• Dual logistic SGD maintains a vector 𝑞 = 𝐾𝑎 ∈ ℝ𝑛

• Note that 𝑞𝑖 = (Φ(𝑋)𝑤)𝑖 = Φ 𝑋𝑖 𝑇 𝑤.
KASHIF JAVED
EED, UET, Lahore

27
Kernel Logistic Regression
• Let 𝐾∗𝑖 denote column i of K.
𝑎 ← 0; 𝑞 ← 0; ∀𝑖, 𝑗 𝐾𝑖𝑗 ← 𝑘 𝑋𝑖 , 𝑋𝑗 (For a different starting point, set
𝑞 ← 𝐾𝑎)
repeat until convergence
Choose random 𝑖 ∈ [1, 𝑛]
𝑎𝑖 ← 𝑎𝑖 + 𝜖 (𝑦𝑖 − 𝑠(𝑞𝑖 )) ⇐ 𝑂 1
𝑞 ← 𝑞 + 𝜖 (𝑦𝑖 − 𝑠(𝑞𝑖 )) 𝐾∗𝑖 ⇐ computes 𝑞 = 𝐾𝑎 𝑂 𝑛 time
KASHIF JAVED
EED, UET, Lahore

28
Kernel Logistic Regression
• SGD updates only one dual weight ai per iteration; that’s a nice benefit of
the dual formulation.

• We cleverly update 𝑞 = 𝐾𝑎 in linear time instead of performing a

quadratic-time matrix-vector multiplication.

• Primal: 𝑂(𝐷𝑗) time

• Dual (no kernel trick): 𝑂(𝐷𝑛2 +KASHIF
𝑛𝑗) time
JAVED
• Kernel: 𝑂(𝑛2 𝑑 + 𝑛𝑗) time EED, UET, Lahore

29
Kernel Logistic Regression
• Alternative testing: If # of training points and test points both exceed 𝐷/𝑑,
classifying with primal weights 𝑤 may be faster. [This applies to ridge
regression as well.]

𝑤 = Φ 𝑋 𝑇a ⇐ 𝑂(𝑛𝐷) time (once only)

for each test pt 𝑧
ℎ(𝑧) ← 𝑠(𝑤 𝑇 Φ(𝑧)) ⇐ 𝑂(𝐷) time/test pt

KASHIF JAVED
EED, UET, Lahore

30
The Gaussian Kernel
• Mind-blowing as the polynomial kernel is, I think our next trick is even
more mind-blowing.

• Since we can now do fast computations in spaces with exponentially large

dimensions, why don’t we go all the way and generate feature vectors in
an infinite-dimensional space?

KASHIF JAVED
EED, UET, Lahore

31
The Gaussian Kernel
• Gaussian kernel, aka radial basis function kernel: there exists a Φ: ℝ𝑑 →
ℝ∞ such that

𝑥−𝑧 2
𝑘(𝑥, 𝑧) = exp − 2𝜎 2
[This kernel takes 𝑂(𝑑) time to compute.]

KASHIF JAVED
EED, UET, Lahore

32
The Gaussian Kernel
• In case you’re curious, here’s the feature vector that gives you this kernel,
for the case where you have only one input feature per sample point.

• e.g., for 𝑑 = 1,
𝑥2 𝑥 𝑥2 𝑥3
Φ 𝑥 = exp − 2𝜎2 1, 𝜎 1! , 𝜎2 2! , 𝜎3 3! , …

• This is an infinite vector, and Φ 𝑥 ·Φ 𝑧 is a series that converges to

𝑘 𝑥, 𝑧 . Nobody actually uses this
KASHIF of Φ 𝑥 directly, or even cares
valueJAVED
EED,
about it; they just use the kernel UET, Lahore
function 𝑘(·,·).
33
The Gaussian Kernel
• At this point, it’s best not to think of points in a high-dimensional space. It’s
no longer a useful intuition.

• Instead, think of the kernel 𝑘 as a measure of how similar or close together

two points are to each other.

KASHIF JAVED
EED, UET, Lahore

34
The Gaussian Kernel
• Key observation:
• hypothesis ℎ 𝑧 = σ𝑛𝑗=1 𝑎𝑗 𝑘 (𝑋𝑗 , 𝑧) is a linear combo of Gaussians centered
at training pts.
• The dual weights are the coefficients of the linear combination.
• The Gaussians are a basis for the hypothesis.

KASHIF JAVED
EED, UET, Lahore

35
The Gaussian Kernel
• A hypothesis ℎ that is a linear
combination of Gaussians
centered at four training
points, two with positive
weights and two with
negative weights.
• If you use ridge regression
with a Gaussian kernel, your
“linear” regression will look
KASHIF JAVED
something like this.
EED, UET, Lahore

36
The Gaussian Kernel
• Very popular in practice! Why?
– Gives very smooth ℎ (In fact, ℎ is infinitely differentiable; continuous)

– Behaves somewhat like 𝑘-nearest neighbors (training pts close to the test
pt get bigger votes), but smoother

– Oscillates less than polynomials (depending on 𝜎)

KASHIF JAVED
EED, UET, Lahore

37
The Gaussian Kernel
• Very popular in practice! Why?
– 𝑘(𝑥, 𝑧) interpreted as a similarity measure. Maximum when 𝑧 = 𝑥; goes to
0 as distance increases.

– Training points “vote” for value at z, but closer points get weightier vote.

• The “standard” kernel 𝑘(𝑥, 𝑧) = 𝑥 · 𝑧 assigns more weight to training point

vectors that point in roughly the same direction as 𝑧.
• By contrast, the Gaussian kernel assigns
KASHIF more weight to training points near
JAVED
𝑧. EED, UET, Lahore

38
The Gaussian Kernel
• Choose 𝜎 by validation.
𝜎 trades off bias vs. variance:
larger 𝜎 → wider Gaussians & smoother ℎ → more bias & less variance

KASHIF JAVED
EED, UET, Lahore

39
1
•𝛾 ∝ 𝜎
• Too narrow, more oscillations
• More wide, little oscillations

• The effect of the inverse-width

parameter of the Gaussian kernel
(𝛾) for a fixed value of the soft-
margin constant.
KASHIF JAVED
EED, UET, Lahore
Asa Ben-Hur and J Watson, “A User's Guide to Support Vector Machines”, Data Mining Techniques for the Life Sciences.
40
Methods in Molecular Biology, 2010
• For small values of 𝛾 (upper left)
the decision boundary is nearly
linear

• As 𝛾 increases the flexibility of the

decision boundary increases.

• Large values of 𝛾 lead to overfitting

(bottom). KASHIF JAVED
EED, UET, Lahore
Asa Ben-Hur and J Watson, “A User's Guide to Support Vector Machines”, Data Mining Techniques for the Life Sciences.
41
Methods in Molecular Biology, 2010
ESL, Figure 12.3
• The decision boundary (solid
black) of a soft-margin SVM
with a Gaussian kernel.

KASHIF JAVED
EED, UET, Lahore

42
ESL, Figure 12.3
• Observe that in this example,
it comes reasonably close to
the Bayes optimal decision
boundary (dashed purple).
• The dashed black curves are
the boundaries of the margin.
• The small black disks are the
support vectors that lie on the
margin boundary. KASHIF JAVED
EED, UET, Lahore

43
Kernels
• By the way, there are many other kernels that, like the Gaussian kernel,
are defined directly as kernel functions without worrying about Φ.
• But not every function can be a kernel function.
• A function is qualified only if it always generates a positive semidefinite
kernel matrix, for every sample.
• There is an elaborate theory about how to construct valid kernel functions.
• However, you probably won’t need it. The polynomial and Gaussian
kernels are the two most popular by far.
KASHIF JAVED
EED, UET, Lahore

44
Kernels
• As a final word, be aware that not every featurization Φ leads to a kernel
function that can be computed faster than Θ(𝐷) time.

• In fact, the vast majority cannot.

• Featurizations that can are rare and special.

KASHIF JAVED
EED, UET, Lahore

45
Kernels

The effect of the degree of a polynomial kernel. Higher degree

polynomial kernels allow a more flexible decision boundary.
KASHIF JAVED
EED, UET, Lahore
Asa Ben-Hur and J Watson, “A User's Guide to Support Vector Machines”, Data Mining Techniques for the Life Sciences.
46
Methods in Molecular Biology, 2010

03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Lec11
No ratings yet
Lec11
104 pages
HCSA Presales Transmission+&+Access+V1.0+Training+Material
No ratings yet
HCSA Presales Transmission+&+Access+V1.0+Training+Material
359 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
Lec10
No ratings yet
Lec10
61 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
Lec15
No ratings yet
Lec15
66 pages
Principe Icassp2011 Klms
No ratings yet
Principe Icassp2011 Klms
124 pages
Lec13
No ratings yet
Lec13
54 pages
Lec12
No ratings yet
Lec12
55 pages
zhang15d
No ratings yet
zhang15d
42 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
No ratings yet
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
26 pages
Lecture4
No ratings yet
Lecture4
49 pages
ML Answers Updated
No ratings yet
ML Answers Updated
13 pages
Lec8
No ratings yet
Lec8
50 pages
Machine Learning 3
No ratings yet
Machine Learning 3
35 pages
Class04 Feature+Kernels
No ratings yet
Class04 Feature+Kernels
35 pages
Lecture slides - Linear Regression (2025)
No ratings yet
Lecture slides - Linear Regression (2025)
45 pages
Lecture03_kernel
No ratings yet
Lecture03_kernel
28 pages
Kernel Ridge Regression: Max Welling
No ratings yet
Kernel Ridge Regression: Max Welling
3 pages
0701907v3
No ratings yet
0701907v3
53 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
CH 4
No ratings yet
CH 4
41 pages
Machine Learning Course - Kernel Regression
No ratings yet
Machine Learning Course - Kernel Regression
9 pages
Kernel Methods!: Sargur Srihari!
No ratings yet
Kernel Methods!: Sargur Srihari!
29 pages
Lab 1
No ratings yet
Lab 1
36 pages
Cours2 ML
No ratings yet
Cours2 ML
21 pages
DCS115
No ratings yet
DCS115
304 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
Kernel Ridge Regression
No ratings yet
Kernel Ridge Regression
8 pages
05 Lectureslides Kernels
No ratings yet
05 Lectureslides Kernels
47 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Logistic Regression in Data Analysis: An Overview
No ratings yet
Logistic Regression in Data Analysis: An Overview
21 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
Questions and Answers On Basic Structures of Computers
57% (7)
Questions and Answers On Basic Structures of Computers
23 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Lecture 14: Kernels — Applied ML
No ratings yet
Lecture 14: Kernels — Applied ML
14 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
KernelMethods
No ratings yet
KernelMethods
19 pages
(Slide) Non Linear Regression
No ratings yet
(Slide) Non Linear Regression
39 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
45 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Ch2Regression and Regularization1
No ratings yet
Ch2Regression and Regularization1
45 pages
Practice Assignment 3
No ratings yet
Practice Assignment 3
2 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
11 pages
Vahid
No ratings yet
Vahid
18 pages
T03Soln
No ratings yet
T03Soln
5 pages
Support Vector Machine Explained
No ratings yet
Support Vector Machine Explained
10 pages
cheatsheet 2
No ratings yet
cheatsheet 2
5 pages
Lect 3
No ratings yet
Lect 3
14 pages
Practice_Problems_for_ML_Midterms
No ratings yet
Practice_Problems_for_ML_Midterms
5 pages
Aiml K2
No ratings yet
Aiml K2
8 pages
CMU 2018s NinaBALCAN HW3
No ratings yet
CMU 2018s NinaBALCAN HW3
7 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
Big Switch Installation
No ratings yet
Big Switch Installation
35 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Chem Folder
No ratings yet
Chem Folder
1 page
SMA 1117 - Lec 3
No ratings yet
SMA 1117 - Lec 3
30 pages
Final Project Synopsys
No ratings yet
Final Project Synopsys
53 pages
Diffrac - Eva Faq
No ratings yet
Diffrac - Eva Faq
24 pages
Resume (1)
No ratings yet
Resume (1)
5 pages
Whitepaper CPC
No ratings yet
Whitepaper CPC
18 pages
Iec 61850 Mms Client Manual v5
No ratings yet
Iec 61850 Mms Client Manual v5
21 pages
ED4-MATH2C+Mathematics+Didactics+ FP +2022+frances Wessels+FINAL UPDATED
No ratings yet
ED4-MATH2C+Mathematics+Didactics+ FP +2022+frances Wessels+FINAL UPDATED
16 pages
Operating - System - KCS 401 - Assignment - 1 PDF
No ratings yet
Operating - System - KCS 401 - Assignment - 1 PDF
5 pages
Emerson Control Techniques MD 404-00-000 Manual 202011716949
No ratings yet
Emerson Control Techniques MD 404-00-000 Manual 202011716949
138 pages
Prerequisite: One Must Know "How To Code A Custom List View"?
No ratings yet
Prerequisite: One Must Know "How To Code A Custom List View"?
11 pages
Lesson 2 2 - Inserting and Formatting Text Box
No ratings yet
Lesson 2 2 - Inserting and Formatting Text Box
12 pages
KELOMPOK 4-ICoSEIT 2022-Francis Edwin Felix-Paper RMINCS
No ratings yet
KELOMPOK 4-ICoSEIT 2022-Francis Edwin Felix-Paper RMINCS
6 pages
Angles and Directions: An Introduction To JTS Warped
No ratings yet
Angles and Directions: An Introduction To JTS Warped
7 pages
Good Hope School 11-16-1B Ch.12 Manipulation of Simple Polynomials MC
No ratings yet
Good Hope School 11-16-1B Ch.12 Manipulation of Simple Polynomials MC
7 pages
Machine Learning For Blockchain Data Analysis: Progress and Opportunities
No ratings yet
Machine Learning For Blockchain Data Analysis: Progress and Opportunities
9 pages
Csc 101 (Summary)
No ratings yet
Csc 101 (Summary)
5 pages
Ab Initio - V1.4
No ratings yet
Ab Initio - V1.4
15 pages
Shrikant Resume
No ratings yet
Shrikant Resume
3 pages
A Step-By-Step Guide To Mobile App Development: The Key Stages of The App Development Process
No ratings yet
A Step-By-Step Guide To Mobile App Development: The Key Stages of The App Development Process
11 pages
Salesforce - CRT 600.v2021 03 30 PDF
No ratings yet
Salesforce - CRT 600.v2021 03 30 PDF
21 pages
2025_SP_I2ML_CEP-1
No ratings yet
2025_SP_I2ML_CEP-1
2 pages
Hygieia Research
No ratings yet
Hygieia Research
3 pages
Practical Game Development With Unity and Blender
100% (3)
Practical Game Development With Unity and Blender
353 pages
INterview Tosca
No ratings yet
INterview Tosca
2 pages
CM3035 Advanced Web Development
No ratings yet
CM3035 Advanced Web Development
6 pages
Leveling Techniques & Adjustments
No ratings yet
Leveling Techniques & Adjustments
4 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet
Differential Equations (Calculus) Mathematics E-Book For Public Exams
From Everand
Differential Equations (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)

Lec16

Uploaded by

Lec16

Uploaded by

KASHIF JAVED

EED, UET, Lahore

• Suppose optimal weight 𝑤 = 𝑋 𝑇 𝑎 = σ𝑛𝑖=1 𝑎𝑖 𝑋𝑖 for some 𝑎 ∈ ℝ𝑛

• Unfortunately, that only happens if we penalize the bias term 𝑤𝑑+1 = 𝛼,

• 𝑋𝑖,𝑑+1 = 1 [don’t center the 1’s!]

• This lets us replace 𝐼 ′ with 𝐼 in normal equations:

(𝑋𝑋 𝑇 + 𝜆𝐼) 𝑎 = 𝑦 (Always has a solution if 𝜆 > 0)

• Multiplying the solution with 𝑋 𝑇 (i.e., 𝑤 = 𝑋 𝑇 𝑎 ) gives a solution to the

Find 𝑎 that minimizes ||𝑋𝑋 𝑇 𝑎 − 𝑦 ||2 + 𝜆| 𝑋 𝑇 𝑎 |2

• We obtain this dual form by substituting 𝑤 = 𝑋 𝑇 𝑎 into the original ridge

ℎ 𝑧 = 𝑤 𝑇 𝑧 = 𝑎𝑇 𝑋 𝑧 = σ𝑛𝑖=1 𝑎𝑖 𝑋𝑖𝑇 𝑧 ⇐ weighted sum of inner products of

• Let 𝐾 = 𝑋𝑋 𝑇 be 𝑛 × 𝑛 kernel matrix.

• But 𝐾 could still be singular when 𝑑 > 𝑛.

• The polynomial kernel of degree 𝑝 is 𝑘(𝑥, 𝑧) = (𝑥 𝑇 𝑧 + 1)𝑝

• Kernel ridge regression replaces 𝑋𝑖 with Φ(𝑋𝑖 ) : let 𝑘 𝑥, 𝑧 = Φ(𝑥)𝑇 Φ(𝑧),

• I think what we’ve done here is pretty mind-blowing: we can now do

• Then the code “𝑎 ← 𝑎 + 𝜖 (𝑦 − 𝑠(Φ(𝑋)𝑤))” has same effect as “𝑤 ←

• Let 𝐾 = Φ(𝑋) Φ(𝑋)𝑇 . (The 𝑛 × 𝑛 kernel matrix; but we don’t compute

• Kernel logistic regression computes the same answer as the primal

• Primal: 𝑂(𝑛𝐷𝑗) time

• Dual logistic SGD maintains a vector 𝑞 = 𝐾𝑎 ∈ ℝ𝑛

• We cleverly update 𝑞 = 𝐾𝑎 in linear time instead of performing a

• Primal: 𝑂(𝐷𝑗) time

𝑤 = Φ 𝑋 𝑇a ⇐ 𝑂(𝑛𝐷) time (once only)

• Since we can now do fast computations in spaces with exponentially large

• This is an infinite vector, and Φ 𝑥 ·Φ 𝑧 is a series that converges to

• Instead, think of the kernel 𝑘 as a measure of how similar or close together

– Oscillates less than polynomials (depending on 𝜎)

• The “standard” kernel 𝑘(𝑥, 𝑧) = 𝑥 · 𝑧 assigns more weight to training point

• The effect of the inverse-width

• As 𝛾 increases the flexibility of the

• Large values of 𝛾 lead to overfitting

• In fact, the vast majority cannot.

• Featurizations that can are rare and special.

The effect of the degree of a polynomial kernel. Higher degree

You might also like