0% found this document useful (0 votes)

30 views35 pages

Lect 7

The document outlines the course MATH3340 Mathematics of Machine Learning, focusing on topics such as the representer theorem, kernels, and convex optimization. It discusses the properties of Hilbert spaces, positive-definite kernels, and various kernel functions including linear and polynomial kernels. The document also covers the kernel trick and its implications for machine learning algorithms, emphasizing the importance of dot products over explicit feature computations.

Uploaded by

Kelvin Ho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views35 pages

Lect 7

Uploaded by

Kelvin Ho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

MATH3340 Mathematics of Machine

Learning

Lecturer: Bangti Jin ([email protected])

Chinese University of Hong Kong

March 2025

1 / 35
Outline

1 Representer theorem

2 Kernels
Algorithm and analysis
Generalization guarantees-Lipschitz-continuous losses

2 / 35
Introduction

Linear models : fθ (x) = θ> ϕ(x), ϕ(x) ∈ Rd is feature map

• Estimation error is bounded as √1n : independent of
dimension
D
q
Rn (F) ≤ √ E[kϕ(x)k22 ]
n
• Convex optimization

=⇒ d = +∞: fθ (x) = hθ, ϕ(x)i, θ ∈ H (Hilbert space), ϕ(x) ∈ H

3 / 35
A Hilbert space H is a complete vector space equipped with an
inner product h·, ·i. It satisfies the following properties:
Vector space structure: (vector addition and scalar multiplication
are defined)
Inner product h·, ·i that maps pairs of vectors in H to a scalar
(real or complex), satisfying:
conjugate symmetry hx, y i = hy, xi
linearity in the first argument hax + by, zi = ahx, zi + bhy, zi
positive definiteness: hx, xi ≥ 0, with equality if and only if
x =0
completeness: Every Cauchy sequence in the space converges
to a limit within the space
’

4 / 35
norm induced by inner product
p
kxk = hx, xi

orthogonality: Two vectors x and y are orthogonal if hx, y i = 0

Finite-dimensional: Rn with the standard dot product.

Infinite-dimensional: The space of square-integrable functions
L2 (Ω) with the inner product
Z
hf , gi = f (x)g(x)dx
Ω

5 / 35
Representer theorem

d = +∞: fθ (x) = hθ, ϕ(x)i, θ ∈ H (Hilbert space), ϕ(x) ∈ H

xi , . . . , xn ∈ X , yi , . . . , yn ∈ Y, fθ (x) = hθ, ϕ(x)i, with ϕ : X → H.
Regularized ERM:
n
1X λ
min ` (yi , hθ, ϕ (xi )i) + kθk2H
θ∈H n 2
i=1

feature: access only observations xi only through dot product

hθ, ϕ(xi )i, penalize with Hilbert norm kθk

6 / 35
Representer theorem
Consider a feature map ϕ : X → H. Let (x1 , . . . , xn ) ∈ X n and
assume that the functional Ψ : Rn+1 → R is strictly increasing in the
last variable. Then the infimum of

Ψ(hθ, ϕ(x1 )i, . . . , hθ, ϕ(xn )i, kθk2 )

can be obtained by restricting to a vector θ in the linear span of

ϕ(x1 ), . . . , ϕ(xn ), i.e.
n
X
θ= αi ϕ(xi ), with α ∈ Rn
i=1

7 / 35
Theorem (Kimeldorf, Wahba, 1971)
1
Pn
For λ > 0, the minimization problem n i=1 `(yi , hθ, ϕ(xi )i) + λkθk2
can be restricted to
n
X
θ= αi ϕ (xi ) , α ∈ Rn
i=1

No convexity is assumed on the loss function `.

8 / 35
n n
1X λ X
min ` (yi , hϕ (xi ) , θi) + kθk2H , θ = αi ϕ (xi ) .
θ∈H n 2
i=1 i=1

Question: how to solve the problem ?

* n +
X
hθ, ϕ (xi )i = αj ϕ xj , ϕ (xi )
j=1
n
X n
X

= αj ϕ (xi ) , ϕ xj = αj k xi , xj = (K α)i
j=1 j=1

Kernel function: k : X × X → R, k (x, x 0 ) = hϕ(x), ϕ (x 0 )i.

Kernel matrix: K ∈ Rn×n , Kij = k xi , xj

Gram matrix of feature vectors

9 / 35
n 2 n
X X
2

kθk = αi ϕ (xi ) = αi αj ϕ (xi ) , ϕ xj
i=1 i,j=1
n
X n
X
= αi αj k (xi , xj ) = αi αj Ki,j = α> K α.
i,j=1 i,j=1

Equivalent problem:
n
1X λ
minn ` (yi , (K α)i ) + α> K α.
α∈R n 2
i=1

for any test point x ∈ X ,

n
X n
X
f (x) = hθ, ϕ(x)i = αi hϕ(xi ), ϕ(x)i = αi k (x, xi )
i=1 i=1

10 / 35
Kernel trick

• Replace the search space H by Rn

• Separate the representation problem and the design of
algorithms and their analysis

Message: no need to explicitly computing the feature vector ϕ(x),

only dot product

11 / 35
Kernels

Definition (Positive-definite kernels)

A function k : X × X → R is a positive-definite kernel if and only if all
kernel matrices resulting from k are symmetric positive semidefinite.

K is positive semidefinite ⇐⇒ all eigenvalues are non-negative

α> K α ≥ 0, ∀α ∈ Rn

12 / 35
Theorem (Aronszajn, 1950)
A function k : X × X → R is a positive-definite kernel if and only if
there exists a Hilbert space H, and a function ϕ : X → H such that for
all x, x 0 ∈ X , k (x, x 0 ) = hϕ(x), ϕ (x 0 )iH .

no assumption on the input space X , no regularity assumption

on k

13 / 35
k : X × X → R is a positive-definite kernel =⇒ there exists
ϕ : X → H (ϕ: feature map; H: feature space) s.t.

k (x, x 0 ) = hϕ(x), ϕ (x 0 )iH .

reproducing property of kernel space

hk (·, x), f i = f (x), hk (·, x), k (·, x 0 )i = k (x, x 0 )

reproducing kernel Hilbert spaces (RKHS)

function evaluation is the dot product with a function
(smoothness)

14 / 35
Kernel calculus
The set of positive-definite kernels on a set X is a cone.
k1 , k2 ∈ X × X → R are positive-definite kernels, λ ≥ 0, then

• k1 + k2 is positive-definite kernel
• λk1 is positive-definite kernel
• k1 k2 is positive-definite kernel

15 / 35
Linear kernel
X = Rd , k (x, x) = x > x 0
ϕ(x) = x explicit feature map
function space: linear function fθ (x) = θ> x with `2 -penalty kθk22 .
• If d > n, representer theorem is useful if k (x, x 0 ) easy to
compute (O(d))
• Naive running time to compute kernel matrix: O(n2 d)

16 / 35
Polynomial kernel
s
X = Rd , k (x, x 0 ) = x > x 0 , s ∈ N, s ≥ 1
k is positive-definite as a product of positive-definite kernels

d
!s
X
0
k (x, x ) = xi xi0 (binomial theorem)
i=1

X s α1 α
= (x1 x10 ) · · · (xd xd0 ) d
α1 +···+αd =s
α1 , . . . , αd | {z }
α α α
(x1α1 ...xd d )((x10 ) 1 ···(xd0 ) d )
= ϕ(x)> ϕ(x 0 ),
12
ϕ(x)α1 ,...,αd = s
α1 ,...,αd x1α1 · · · xdαd

17 / 35
Polynomial kernel
d+s−1
∼ Cs d s (kernel trick)

the dimension of ϕ(x) is s

the space of functions is homogeneous polynomials of degree s

• Example: d = 2, s = 2, ϕ(x) = (x12 , x22 , x1 , x2 )

binary classification: separation by ellipsoid (in input space), linear

separation in feature space
18 / 35
Translation-invariant kernels on X = [0, 1]

k (x, x 0 ) = q (x − x 0 )
q : [0, 1] → R, 1-periodic: q(x + 1) = q(x), ∀x.
Show how the kernels emerge from penalties on the
Fourier coefficients of functions.
R1
f : R → R is 1-periodic and 0 f (x)2 dx is finite.
Orthonormal basis: em : x 7→ e2imπx , m ∈ Z
Z 1 Z 1
2imπx 2im0 πx ∗ 0
e (e ) dx = e2i(m−m )πx dx = 0, m 6= m0
0 0

19 / 35
P
f = m∈Z hf , em iL2 [0,1] em
Fourier series:
X Z 1
f (x) = f̂m e 2imπx
, f̂m = hf , em i = f (x)e−2imπx dx
m∈Z 0

Preliminaries:
R1 2
2 dx =
P
• Parseval’s identity: 0 |f (x)| m∈Z f̂m

• If f is differentiable, f 0 (x) = m∈Z f̂m (2imπ)e2imπx

P
(Fourier coefficient)

20 / 35
Consequence:
Z 1 X 2 X 2
|f 0 (x)|2 dx = f̂m (2imπ) = |2mπ|2 f̂m
0 m∈Z m∈Z

Sobolev norm:
Z 1 Z 1
Ω(f ) = kf k2 = |f (x)|2 dx + |f 0 (x)|2 dx
0 0
X 2
2 2
= (1 + 4π m ) f̂m
m∈Z

b ) + λ kf k2 , λ > 0: smooth
• Learning problem: inf R(f 2

21 / 35
The penalty kf k2 can be interpreted through
P a feature map
and its associated dot product (ha, bi = m∈Z am b−m )
√
ϕ(x)m = e2imπx / 1 + 4π 2 m2 , fθ (x) = hθ, ϕ(x)i (θ ∈ `2 )
Fourier series:
X
f̂θ (x) = (f̂θ (x))m e2imπx
m∈Z
X p e2imπx
= f̂θ (x) 1 + 4π 2 m2 √
2 2
m∈Z |
m
{z } 1 + 4π m
θm
2 2
P
kθk = m∈Z |θm | = Ω(fθ ) (Sobolev penalty)
1 0
e2imπ(x−x ) = q(x−x 0 )
X
k (x, x 0 ) = hϕ(x), ϕ(x 0 )i =
1 + 4π 2 m2
m∈Z

any penalty of the form m∈Z cm |f̂m |2 defines a squared RKHS

norm, if cm is strictly positive for all m ∈ Z and m c1m < ∞

P
22 / 35
Translation-invariant kernels on Rd
0 k2
X = Rd , k (x, x 0 ) = q (x − x 0 ) [eg: k (x, x 0 ) = e−αkx−x 2 ]
Question: when is this a kernel ?
>
Fourier transform: f̂ (ω) = Rd f (x)e−iω x dx
R

1 iω > x dω
R
Inversion formula: f (x) = (2π) d Rd f̂ (ω)e
1
Parseval’s identity: Rd |f (x)|2 dx = (2π) 2
R R
d Rd |f̂ (ω)| dω

Theorem (Bochner’s theorem)

k (x, x 0 ) = q (x − x 0 ) is positive-definite ⇐⇒ q̂(ω) ≥ 0, ∀ω ∈ Rd .

23 / 35
Translation-invariant kernels on Rd
Summary:
(1) If q has non-negative Fourier transform,
k (x, x 0 ) = q (x − x 0 ) is a positive-definite kernel;
Z q
1
q
iω > x > 0
k (x, x 0 ) = q̂(ω)e ( q̂(ω)eiω x )∗ dω
(2π)d Rd
Z
= ϕ(x)ω ϕ(x 0 )∗ω dω
Rd

(2) The associated norm as functions is

|f̂ (ω)|2
Z
Ω(f ) = dω.
q̂(ω)

24 / 35
• Exponential kernel: q(u) = e−kuk2 .
2
• Gaussian kernel: q(u) = e−kuk2 .
Fourier transform
1
q̂(ω) = 2d π (d−1)/2 Γ( d+1
2 )
(1 + kωk22 )(d+1)/2

d odd: q̂(ω)−1 is a sum of monomials, penalize all derivatives up to

the total order (d + 1)/2
the case d = 1
|f̂ (ω)|2
Z Z Z
1 1 1
kf k2H = dω = |f̂ (ω)|2 dω + |ω f̂ (ω)|2 dω
2π R q̂(ω) 4π R 4π R
Z Z
1 1
= 2
|f (x)| dx + |f 0 (x)|2 dx
2 R 2 R

25 / 35
Examples:
• Exponential kernel: q(u) = e−kuk2 .
2
• Gaussian kernel: q(u) = e−kuk2 .
Fourier transform 2
q̂(ω) = π d e−kωk2 /4
power series expansion of q̂(ω)−1 : RKHS norm penalizing all
derivatives of all order (infinitely differentiable)

26 / 35
Kernels beyond X = Rd : Sequences
An alphabet A, xi , . . . , xn , xi ∈ A
ϕ(x) indexed by all sequences of length `
ϕ(x)y = 1 if y is contained in x

27 / 35
Algorithm
Goal:
n
1X λ
min ` (yi , f (xi )) + kf k2H
f ∈H n 2
i=1
representer theorem: f (x) = hθ, ϕ(x)i, and
n
X n
X
θ= αi ϕ(xi ) =⇒ f (x) = αi k (x, xi )
i=1 i=1

• α is obtained as the minimizer of

n
1X λ
` (yi , (K α)i ) + α> K α.
n 2
i=1

28 / 35
Example: least-square (ridge regression)
1 λ
min ky − K αk22 + α> K α = minn F(α)
α∈Rn 2n 2 α∈R

F 0 (α) = n1 K (Kx − y ) + λK α = 0

⇐⇒ (K 2 + nλK )α = Ky

solution is α = (K + nλI)−1 y
may not be unique if K is not invertible
issue: K often has tiny eigenvalues
Gradient descent: Hessian matrix F 00 (α) = K 2 + nλK := H
λmax (H)
Speed ∝ λmin (H) = condition number
29 / 35
Alternative: K = ΦΦ> , with K ∈ Rn×m , m rank of K
n
1X λ
minm `(yi , (Φβ)i ) + kβk22
β∈R n 2
i=1

optimality condition: n1 Φ> g + λβ (g gradient of `

1
then obtain α ∈ Rn via α = − λng g

explicit feature space representation β = Φ> α

30 / 35
Analysis

f̂ = arg min R(f

b ) such that kf kH ≤ D
f ∈H
Question:
h performance
i on unseen data ?
E R(f̂ ) − R∗ , R: expected risk
∀f : X → R, R(f ) = E[`(y , f (x))]
R(f̂ ) − R∗ = R(f̂ ) − inf R(f ) + inf R(f ) − R∗
kf k≤D kf k≤D
| {z } | {z }
estimation error approximation error

G-Lipschitz-continuous loss and kϕ(x)k2H 6 R 2 =⇒

4GDR
E[estimation error] ≤ √
n
31 / 35
Approximation error: infkf k≤D R(f ) − R∗
let f∗ = arg min R(f ) (Bayes predictor), by
G-Lipschitz-continuous of the loss and Jensen’s inequality
R(f ) − R(f∗ ) ≤ E [|`(y , f (x)) − ` (y , f∗ (x))|] ≤ E [G |f (x) − f∗ (x)|]
r h i
2
≤ G E |f (x) − f∗ (x)| = G kf (x) − f∗ (x)kL2 (p)

approximation error ≤ G infkf kH ≤D kf − f∗ kL2 (p)

• In Rd , f∗ (x) = θ∗> ϕ(x), f (x) = θ> ϕ(x)
kf − f∗ kL2 (p) ≤ R kθ − θ∗ k2 .

Total approximation error goes to 0 when D → ∞.

Note: if f∗ ∈ H, D ≥ kf∗ k =⇒ approximation error = 0.

32 / 35
risk decomposition: supx∈X k (x, x) < R 2
dimension-free bound:
4GDR
E[R(f̂Dc )] − R(f∗ ) ≤ √ + G inf kf − f∗ kL2 (p)
n kf kH ≤D

√
r n o
n 16R 2 kf k2H
optimal D = 4R inff ∈H kf − f∗ k2L2 (p) + n

key quantity

A(µ, f∗ ) = inf {kf − f∗ k2L2 (p) + µkf k2H }

f ∈H

tradeoff between estimation and approximation errors

33 / 35
key quantity A(µ, f∗ ) = inff ∈H {kf − f∗ k2L2 (p) + µkf k2H }
if f∗ ∈ H: A(µ, f∗ ) ≤ µkf∗ k (well specified case)
if f∗ 6∈ H, but can be arbitrarily approximated by f ∈ H
A(µ, f∗ ) tends to zero as µ → 0, but no rate (unless further
condition)
if f∗ cannot be approximated ...

34 / 35
Summary

For kernel methods: f∗ has s bounded derivatives

• Excess risk (if regularization parameter is well-chosen)
s
R(f̂ ) − R∗ ∼ n− d+2s

• Adaptive !
For neural networks: even more adaptivity

35 / 35

hw5 Kernel Trick 2021
No ratings yet
hw5 Kernel Trick 2021
4 pages
Exercices Kernel Trick
No ratings yet
Exercices Kernel Trick
24 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
760 pages
The Representation of Similarities in Linear Spaces
No ratings yet
The Representation of Similarities in Linear Spaces
17 pages
Lecture 8 - Kernels
No ratings yet
Lecture 8 - Kernels
32 pages
Lecture4 introToRKHS
No ratings yet
Lecture4 introToRKHS
33 pages
Stats 231 / CS229T Homework 3 Solutions
No ratings yet
Stats 231 / CS229T Homework 3 Solutions
6 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Lecture-Notes Kernal Methods
No ratings yet
Lecture-Notes Kernal Methods
12 pages
Kernel Trick
No ratings yet
Kernel Trick
40 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
Micchelli 06 A
No ratings yet
Micchelli 06 A
17 pages
II. The Exponential Function: II.1. Smooth Functions Defined by Power Series
No ratings yet
II. The Exponential Function: II.1. Smooth Functions Defined by Power Series
24 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
3 pages
Advanced RKHS and Regularization
No ratings yet
Advanced RKHS and Regularization
40 pages
SKD - Module 9 - Integral Equation & Integral Transforms (Matc 3.1)
No ratings yet
SKD - Module 9 - Integral Equation & Integral Transforms (Matc 3.1)
4 pages
Riesz Kernels and Sobolev Spaces
No ratings yet
Riesz Kernels and Sobolev Spaces
8 pages
Information Theory With Kernel Methods: Francis Bach Inria, Ecole Normale Sup Erieure PSL Research University
No ratings yet
Information Theory With Kernel Methods: Francis Bach Inria, Ecole Normale Sup Erieure PSL Research University
47 pages
Class 03
No ratings yet
Class 03
40 pages
Practice Problems For ML Midterms
No ratings yet
Practice Problems For ML Midterms
5 pages
Understanding Fourier Transform Basics
No ratings yet
Understanding Fourier Transform Basics
24 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Statistical A-Summability of Positive Linear Operators
No ratings yet
Statistical A-Summability of Positive Linear Operators
8 pages
RKHS 0
No ratings yet
RKHS 0
13 pages
PJM v52 n2 p26 S
No ratings yet
PJM v52 n2 p26 S
21 pages
Pacific Journal of Mathematics: Reproducing Kernels and Operators With A Cyclic Vector. I
No ratings yet
Pacific Journal of Mathematics: Reproducing Kernels and Operators With A Cyclic Vector. I
21 pages
Continuous Linear Functionals and Norm Derivatives
No ratings yet
Continuous Linear Functionals and Norm Derivatives
9 pages
Reproducing Kernels and Operators With A Cyclic Vector I: e E), First Considered by Livsic
No ratings yet
Reproducing Kernels and Operators With A Cyclic Vector I: e E), First Considered by Livsic
18 pages
Class04 Feature+Kernels
No ratings yet
Class04 Feature+Kernels
35 pages
10 Understanding Kernels
No ratings yet
10 Understanding Kernels
41 pages
Quantum Data Processing
No ratings yet
Quantum Data Processing
8 pages
Introduction to Von Neumann Algebras
No ratings yet
Introduction to Von Neumann Algebras
127 pages
Positive Definite Kernels in ML
No ratings yet
Positive Definite Kernels in ML
5 pages
Functional Analysis Course Notes
0% (1)
Functional Analysis Course Notes
31 pages
Machine Learning: Kernel Methods
No ratings yet
Machine Learning: Kernel Methods
6 pages
Question Bank (CT)
No ratings yet
Question Bank (CT)
12 pages
Reproducing Kernel Hilbert Spaces
No ratings yet
Reproducing Kernel Hilbert Spaces
4 pages
7 PDF
No ratings yet
7 PDF
4 pages
Lecture23.dmunoz
No ratings yet
Lecture23.dmunoz
3 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
Tools Matrices Operators
No ratings yet
Tools Matrices Operators
7 pages
Kernel Methods
No ratings yet
Kernel Methods
32 pages
Hour Exam 3 Solutions
No ratings yet
Hour Exam 3 Solutions
4 pages
Neural Networks and CPD Kernels Homework
No ratings yet
Neural Networks and CPD Kernels Homework
10 pages
Foundations Computational Mathematics: Online Learning Algorithms
No ratings yet
Foundations Computational Mathematics: Online Learning Algorithms
26 pages
Functional Analysis Course Notes
No ratings yet
Functional Analysis Course Notes
135 pages
Hermite-Sobolev and Closely Connected Orthogonal Polynomials
No ratings yet
Hermite-Sobolev and Closely Connected Orthogonal Polynomials
15 pages
Lectures On Semi-Group Theory and Its Application To Cauchy's Problem in Partial Di
No ratings yet
Lectures On Semi-Group Theory and Its Application To Cauchy's Problem in Partial Di
160 pages
Tifr 08
No ratings yet
Tifr 08
160 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
Math516 Notes
100% (1)
Math516 Notes
126 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
Solutions
No ratings yet
Solutions
50 pages
Mathematics Department, Princeton University Annals of Mathematics
No ratings yet
Mathematics Department, Princeton University Annals of Mathematics
12 pages
oso-9780199969678-chapter-2
No ratings yet
oso-9780199969678-chapter-2
23 pages
Lusin's Theorem: TOPICAL SURVEY Real Analysis Exchange Vol. 16 (1990-91)
No ratings yet
Lusin's Theorem: TOPICAL SURVEY Real Analysis Exchange Vol. 16 (1990-91)
11 pages
Roubaud PerecquianOULIPO 2004
No ratings yet
Roubaud PerecquianOULIPO 2004
12 pages
16 Symbolism, The Primary Process, and Dreams: Freud's Contribution
No ratings yet
16 Symbolism, The Primary Process, and Dreams: Freud's Contribution
24 pages
15 A Ritual of Discourse': Conceptualizing and Reconceptualizing The Analytic Relationship
No ratings yet
15 A Ritual of Discourse': Conceptualizing and Reconceptualizing The Analytic Relationship
13 pages
Phenomenology: What's AI Got To Do With It?: Alessandra Buccella Alison A. Springle
No ratings yet
Phenomenology: What's AI Got To Do With It?: Alessandra Buccella Alison A. Springle
16 pages
Vance G. Morgan - Weaving The World - Simone Weil On Science, Mathematics, and Love (2005, University of Notre Dame Press) - Libgen - Li
No ratings yet
Vance G. Morgan - Weaving The World - Simone Weil On Science, Mathematics, and Love (2005, University of Notre Dame Press) - Libgen - Li
247 pages
Dream Format
No ratings yet
Dream Format
2 pages
Book of Abstracts-AKKSHI 2023
No ratings yet
Book of Abstracts-AKKSHI 2023
33 pages
NG Pháp Và T V NG L P 10. Olympic 30 - 4. Cô Thu Oanh
No ratings yet
NG Pháp Và T V NG L P 10. Olympic 30 - 4. Cô Thu Oanh
4 pages
Zuss
No ratings yet
Zuss
4 pages
DepEd Emerging-LAS Week1 (Edited)
No ratings yet
DepEd Emerging-LAS Week1 (Edited)
17 pages
(Ebook) Act Exam Success in Only 6 Simple Steps! by Elizabeth Chesla, Nancy Hirsch, Melinda Grove ISBN 9781576854365, 1576854361
No ratings yet
(Ebook) Act Exam Success in Only 6 Simple Steps! by Elizabeth Chesla, Nancy Hirsch, Melinda Grove ISBN 9781576854365, 1576854361
56 pages
Finally Blocks in Try/Catch Structures
No ratings yet
Finally Blocks in Try/Catch Structures
18 pages
Gullivers Test Ak
No ratings yet
Gullivers Test Ak
2 pages
Revision Worksheet Class 7 English Language
No ratings yet
Revision Worksheet Class 7 English Language
3 pages
Class 9th Paper
No ratings yet
Class 9th Paper
3 pages
French Homework Help
100% (1)
French Homework Help
6 pages
Circle Map
No ratings yet
Circle Map
2 pages
Buchi-Emecheta - A-Feminist-With-A-Small - F'-Or-A-Motherist-With-A-Big - M'
No ratings yet
Buchi-Emecheta - A-Feminist-With-A-Small - F'-Or-A-Motherist-With-A-Big - M'
20 pages
Components of The Main Window: Section: The Basics of Interacting With Abaqus CAE
No ratings yet
Components of The Main Window: Section: The Basics of Interacting With Abaqus CAE
7 pages
Sheila Walsh
No ratings yet
Sheila Walsh
6 pages
Sketches of Fools and Simpletons
No ratings yet
Sketches of Fools and Simpletons
3 pages
219 - Nov Dez24
No ratings yet
219 - Nov Dez24
144 pages
Prelim Summative Test in Rizalcrs
No ratings yet
Prelim Summative Test in Rizalcrs
3 pages
Understanding Full Infinitives
No ratings yet
Understanding Full Infinitives
1 page
Cs2311 Oops Model Exam QP-KEY (EEE)
No ratings yet
Cs2311 Oops Model Exam QP-KEY (EEE)
4 pages
Grade 10 English Assessment Plan 2025
No ratings yet
Grade 10 English Assessment Plan 2025
3 pages
Tech Engineers' Reference Guide
No ratings yet
Tech Engineers' Reference Guide
54 pages
Untitled Document - Edited
No ratings yet
Untitled Document - Edited
11 pages
Benguet Adventure for Students
No ratings yet
Benguet Adventure for Students
9 pages
Behaviorism Theory
100% (2)
Behaviorism Theory
30 pages
Data Cleaning in Databricks
No ratings yet
Data Cleaning in Databricks
9 pages
Emcee Script for Career Guidance 2024
No ratings yet
Emcee Script for Career Guidance 2024
1 page
OSI Model: A Guide for Developers
100% (1)
OSI Model: A Guide for Developers
6 pages
Theology Simplified by Bob Yandian
100% (1)
Theology Simplified by Bob Yandian
79 pages
Holiday Homework Class 7 2025-26
No ratings yet
Holiday Homework Class 7 2025-26
5 pages

Lect 7

Uploaded by

Lect 7

Uploaded by

MATH3340 Mathematics of Machine

Lecturer: Bangti Jin ([email protected])

Chinese University of Hong Kong

Linear models : fθ (x) = θ> ϕ(x), ϕ(x) ∈ Rd is feature map

=⇒ d = +∞: fθ (x) = hθ, ϕ(x)i, θ ∈ H (Hilbert space), ϕ(x) ∈ H

orthogonality: Two vectors x and y are orthogonal if hx, y i = 0

Finite-dimensional: Rn with the standard dot product.

d = +∞: fθ (x) = hθ, ϕ(x)i, θ ∈ H (Hilbert space), ϕ(x) ∈ H

feature: access only observations xi only through dot product

Ψ(hθ, ϕ(x1 )i, . . . , hθ, ϕ(xn )i, kθk2 )

can be obtained by restricting to a vector θ in the linear span of

No convexity is assumed on the loss function `.

Question: how to solve the problem ?

Kernel function: k : X × X → R, k (x, x 0 ) = hϕ(x), ϕ (x 0 )i.

Gram matrix of feature vectors

for any test point x ∈ X ,

• Replace the search space H by Rn

Message: no need to explicitly computing the feature vector ϕ(x),

Definition (Positive-definite kernels)

K is positive semidefinite ⇐⇒ all eigenvalues are non-negative

no assumption on the input space X , no regularity assumption

k (x, x 0 ) = hϕ(x), ϕ (x 0 )iH .

reproducing property of kernel space

hk (·, x), f i = f (x), hk (·, x), k (·, x 0 )i = k (x, x 0 )

reproducing kernel Hilbert spaces (RKHS)

the space of functions is homogeneous polynomials of degree s

binary classification: separation by ellipsoid (in input space), linear

• If f is differentiable, f 0 (x) = m∈Z f̂m (2imπ)e2imπx

any penalty of the form m∈Z cm |f̂m |2 defines a squared RKHS

norm, if cm is strictly positive for all m ∈ Z and m c1m < ∞

Theorem (Bochner’s theorem)

(2) The associated norm as functions is

d odd: q̂(ω)−1 is a sum of monomials, penalize all derivatives up to

• α is obtained as the minimizer of

optimality condition: n1 Φ> g + λβ (g gradient of `

explicit feature space representation β = Φ> α

f̂ = arg min R(f

G-Lipschitz-continuous loss and kϕ(x)k2H 6 R 2 =⇒

approximation error ≤ G infkf kH ≤D kf − f∗ kL2 (p)

Total approximation error goes to 0 when D → ∞.

A(µ, f∗ ) = inf {kf − f∗ k2L2 (p) + µkf k2H }

tradeoff between estimation and approximation errors

For kernel methods: f∗ has s bounded derivatives

You might also like