0% found this document useful (0 votes)
30 views35 pages

Lect 7

The document outlines the course MATH3340 Mathematics of Machine Learning, focusing on topics such as the representer theorem, kernels, and convex optimization. It discusses the properties of Hilbert spaces, positive-definite kernels, and various kernel functions including linear and polynomial kernels. The document also covers the kernel trick and its implications for machine learning algorithms, emphasizing the importance of dot products over explicit feature computations.

Uploaded by

Kelvin Ho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views35 pages

Lect 7

The document outlines the course MATH3340 Mathematics of Machine Learning, focusing on topics such as the representer theorem, kernels, and convex optimization. It discusses the properties of Hilbert spaces, positive-definite kernels, and various kernel functions including linear and polynomial kernels. The document also covers the kernel trick and its implications for machine learning algorithms, emphasizing the importance of dot products over explicit feature computations.

Uploaded by

Kelvin Ho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

MATH3340 Mathematics of Machine

Learning

Lecturer: Bangti Jin ([email protected])

Chinese University of Hong Kong

March 2025

1 / 35
Outline

1 Representer theorem

2 Kernels
Algorithm and analysis
Generalization guarantees-Lipschitz-continuous losses

2 / 35
Introduction

Linear models : fθ (x) = θ> ϕ(x), ϕ(x) ∈ Rd is feature map


• Estimation error is bounded as √1n : independent of
dimension
D
q
Rn (F) ≤ √ E[kϕ(x)k22 ]
n
• Convex optimization

=⇒ d = +∞: fθ (x) = hθ, ϕ(x)i, θ ∈ H (Hilbert space), ϕ(x) ∈ H

3 / 35
A Hilbert space H is a complete vector space equipped with an
inner product h·, ·i. It satisfies the following properties:
Vector space structure: (vector addition and scalar multiplication
are defined)
Inner product h·, ·i that maps pairs of vectors in H to a scalar
(real or complex), satisfying:
conjugate symmetry hx, y i = hy, xi
linearity in the first argument hax + by, zi = ahx, zi + bhy, zi
positive definiteness: hx, xi ≥ 0, with equality if and only if
x =0
completeness: Every Cauchy sequence in the space converges
to a limit within the space

4 / 35
norm induced by inner product
p
kxk = hx, xi

orthogonality: Two vectors x and y are orthogonal if hx, y i = 0

Finite-dimensional: Rn with the standard dot product.


Infinite-dimensional: The space of square-integrable functions
L2 (Ω) with the inner product
Z
hf , gi = f (x)g(x)dx

5 / 35
Representer theorem

d = +∞: fθ (x) = hθ, ϕ(x)i, θ ∈ H (Hilbert space), ϕ(x) ∈ H


xi , . . . , xn ∈ X , yi , . . . , yn ∈ Y, fθ (x) = hθ, ϕ(x)i, with ϕ : X → H.
Regularized ERM:
n
1X λ
min ` (yi , hθ, ϕ (xi )i) + kθk2H
θ∈H n 2
i=1

feature: access only observations xi only through dot product


hθ, ϕ(xi )i, penalize with Hilbert norm kθk

6 / 35
Representer theorem
Consider a feature map ϕ : X → H. Let (x1 , . . . , xn ) ∈ X n and
assume that the functional Ψ : Rn+1 → R is strictly increasing in the
last variable. Then the infimum of

Ψ(hθ, ϕ(x1 )i, . . . , hθ, ϕ(xn )i, kθk2 )

can be obtained by restricting to a vector θ in the linear span of


ϕ(x1 ), . . . , ϕ(xn ), i.e.
n
X
θ= αi ϕ(xi ), with α ∈ Rn
i=1

7 / 35
Theorem (Kimeldorf, Wahba, 1971)
1
Pn
For λ > 0, the minimization problem n i=1 `(yi , hθ, ϕ(xi )i) + λkθk2
can be restricted to
n
X
θ= αi ϕ (xi ) , α ∈ Rn
i=1

No convexity is assumed on the loss function `.

8 / 35
n n
1X λ X
min ` (yi , hϕ (xi ) , θi) + kθk2H , θ = αi ϕ (xi ) .
θ∈H n 2
i=1 i=1

Question: how to solve the problem ?


* n +
X 
hθ, ϕ (xi )i = αj ϕ xj , ϕ (xi )
j=1
n
X n
X
 
= αj ϕ (xi ) , ϕ xj = αj k xi , xj = (K α)i
j=1 j=1

Kernel function: k : X × X → R, k (x, x 0 ) = hϕ(x), ϕ (x 0 )i.


Kernel matrix: K ∈ Rn×n , Kij = k xi , xj


Gram matrix of feature vectors


9 / 35
n 2 n
X X
2

kθk = αi ϕ (xi ) = αi αj ϕ (xi ) , ϕ xj
i=1 i,j=1
n
X n
X
= αi αj k (xi , xj ) = αi αj Ki,j = α> K α.
i,j=1 i,j=1

Equivalent problem:
n
1X λ
minn ` (yi , (K α)i ) + α> K α.
α∈R n 2
i=1

for any test point x ∈ X ,


n
X n
X
f (x) = hθ, ϕ(x)i = αi hϕ(xi ), ϕ(x)i = αi k (x, xi )
i=1 i=1

10 / 35
Kernel trick

• Replace the search space H by Rn


• Separate the representation problem and the design of
algorithms and their analysis

Message: no need to explicitly computing the feature vector ϕ(x),


only dot product

11 / 35
Kernels

Definition (Positive-definite kernels)


A function k : X × X → R is a positive-definite kernel if and only if all
kernel matrices resulting from k are symmetric positive semidefinite.

K is positive semidefinite ⇐⇒ all eigenvalues are non-negative

α> K α ≥ 0, ∀α ∈ Rn

12 / 35
Theorem (Aronszajn, 1950)
A function k : X × X → R is a positive-definite kernel if and only if
there exists a Hilbert space H, and a function ϕ : X → H such that for
all x, x 0 ∈ X , k (x, x 0 ) = hϕ(x), ϕ (x 0 )iH .

no assumption on the input space X , no regularity assumption


on k

13 / 35
k : X × X → R is a positive-definite kernel =⇒ there exists
ϕ : X → H (ϕ: feature map; H: feature space) s.t.

k (x, x 0 ) = hϕ(x), ϕ (x 0 )iH .

reproducing property of kernel space

hk (·, x), f i = f (x), hk (·, x), k (·, x 0 )i = k (x, x 0 )

reproducing kernel Hilbert spaces (RKHS)


function evaluation is the dot product with a function
(smoothness)

14 / 35
Kernel calculus
The set of positive-definite kernels on a set X is a cone.
k1 , k2 ∈ X × X → R are positive-definite kernels, λ ≥ 0, then

• k1 + k2 is positive-definite kernel
• λk1 is positive-definite kernel
• k1 k2 is positive-definite kernel

15 / 35
Linear kernel
X = Rd , k (x, x) = x > x 0
ϕ(x) = x explicit feature map
function space: linear function fθ (x) = θ> x with `2 -penalty kθk22 .
• If d > n, representer theorem is useful if k (x, x 0 ) easy to
compute (O(d))
• Naive running time to compute kernel matrix: O(n2 d)

16 / 35
Polynomial kernel
s
X = Rd , k (x, x 0 ) = x > x 0 , s ∈ N, s ≥ 1
k is positive-definite as a product of positive-definite kernels

d
!s
X
0
k (x, x ) = xi xi0 (binomial theorem)
i=1
 
X s α1 α
= (x1 x10 ) · · · (xd xd0 ) d
α1 +···+αd =s
α1 , . . . , αd | {z }
α α α
(x1α1 ...xd d )((x10 ) 1 ···(xd0 ) d )
= ϕ(x)> ϕ(x 0 ),
  12 
ϕ(x)α1 ,...,αd = s
α1 ,...,αd x1α1 · · · xdαd

17 / 35
Polynomial kernel
d+s−1
∼ Cs d s (kernel trick)

the dimension of ϕ(x) is s

the space of functions is homogeneous polynomials of degree s


• Example: d = 2, s = 2, ϕ(x) = (x12 , x22 , x1 , x2 )

binary classification: separation by ellipsoid (in input space), linear


separation in feature space
18 / 35
Translation-invariant kernels on X = [0, 1]

k (x, x 0 ) = q (x − x 0 )
q : [0, 1] → R, 1-periodic: q(x + 1) = q(x), ∀x.
Show how the kernels emerge from penalties on the
Fourier coefficients of functions.
R1
f : R → R is 1-periodic and 0 f (x)2 dx is finite.
Orthonormal basis: em : x 7→ e2imπx , m ∈ Z
Z 1 Z 1
2imπx 2im0 πx ∗ 0
e (e ) dx = e2i(m−m )πx dx = 0, m 6= m0
0 0

19 / 35
P
f = m∈Z hf , em iL2 [0,1] em
Fourier series:
X Z 1
f (x) = f̂m e 2imπx
, f̂m = hf , em i = f (x)e−2imπx dx
m∈Z 0

Preliminaries:
R1 2
2 dx =
P
• Parseval’s identity: 0 |f (x)| m∈Z f̂m

• If f is differentiable, f 0 (x) = m∈Z f̂m (2imπ)e2imπx


P
(Fourier coefficient)

20 / 35
Consequence:
Z 1 X 2 X 2
|f 0 (x)|2 dx = f̂m (2imπ) = |2mπ|2 f̂m
0 m∈Z m∈Z

Sobolev norm:
Z 1 Z 1
Ω(f ) = kf k2 = |f (x)|2 dx + |f 0 (x)|2 dx
0 0
X 2
2 2
= (1 + 4π m ) f̂m
m∈Z

b ) + λ kf k2 , λ > 0: smooth
• Learning problem: inf R(f 2

21 / 35
The penalty kf k2 can be interpreted through
P a feature map
and its associated dot product (ha, bi = m∈Z am b−m )

 ϕ(x)m = e2imπx / 1 + 4π 2 m2 , fθ (x) = hθ, ϕ(x)i (θ ∈ `2 )
 Fourier series:
X
f̂θ (x) = (f̂θ (x))m e2imπx
m∈Z
X  p e2imπx
= f̂θ (x) 1 + 4π 2 m2 √
2 2
m∈Z |
m
{z } 1 + 4π m
θm
2 2
P
 kθk = m∈Z |θm | = Ω(fθ ) (Sobolev penalty)
1 0
e2imπ(x−x ) = q(x−x 0 )
X
k (x, x 0 ) = hϕ(x), ϕ(x 0 )i =
1 + 4π 2 m2
m∈Z

any penalty of the form m∈Z cm |f̂m |2 defines a squared RKHS


P

norm, if cm is strictly positive for all m ∈ Z and m c1m < ∞


P
22 / 35
Translation-invariant kernels on Rd
0 k2
X = Rd , k (x, x 0 ) = q (x − x 0 ) [eg: k (x, x 0 ) = e−αkx−x 2 ]
Question: when is this a kernel ?
>
Fourier transform: f̂ (ω) = Rd f (x)e−iω x dx
R

1 iω > x dω
R
Inversion formula: f (x) = (2π) d Rd f̂ (ω)e
1
Parseval’s identity: Rd |f (x)|2 dx = (2π) 2
R R
d Rd |f̂ (ω)| dω

Theorem (Bochner’s theorem)


k (x, x 0 ) = q (x − x 0 ) is positive-definite ⇐⇒ q̂(ω) ≥ 0, ∀ω ∈ Rd .

23 / 35
Translation-invariant kernels on Rd
Summary:
(1) If q has non-negative Fourier transform,
k (x, x 0 ) = q (x − x 0 ) is a positive-definite kernel;
Z q
1
q
iω > x > 0
k (x, x 0 ) = q̂(ω)e ( q̂(ω)eiω x )∗ dω
(2π)d Rd
Z
= ϕ(x)ω ϕ(x 0 )∗ω dω
Rd

(2) The associated norm as functions is

|f̂ (ω)|2
Z
Ω(f ) = dω.
q̂(ω)

24 / 35
• Exponential kernel: q(u) = e−kuk2 .
2
• Gaussian kernel: q(u) = e−kuk2 .
Fourier transform
1
q̂(ω) = 2d π (d−1)/2 Γ( d+1
2 )
(1 + kωk22 )(d+1)/2

d odd: q̂(ω)−1 is a sum of monomials, penalize all derivatives up to


the total order (d + 1)/2
the case d = 1
|f̂ (ω)|2
Z Z Z
1 1 1
kf k2H = dω = |f̂ (ω)|2 dω + |ω f̂ (ω)|2 dω
2π R q̂(ω) 4π R 4π R
Z Z
1 1
= 2
|f (x)| dx + |f 0 (x)|2 dx
2 R 2 R

25 / 35
Examples:
• Exponential kernel: q(u) = e−kuk2 .
2
• Gaussian kernel: q(u) = e−kuk2 .
Fourier transform 2
q̂(ω) = π d e−kωk2 /4
power series expansion of q̂(ω)−1 : RKHS norm penalizing all
derivatives of all order (infinitely differentiable)

26 / 35
Kernels beyond X = Rd : Sequences
An alphabet A, xi , . . . , xn , xi ∈ A
ϕ(x) indexed by all sequences of length `
ϕ(x)y = 1 if y is contained in x

27 / 35
Algorithm
Goal:
n
1X λ
min ` (yi , f (xi )) + kf k2H
f ∈H n 2
i=1
representer theorem: f (x) = hθ, ϕ(x)i, and
n
X n
X
θ= αi ϕ(xi ) =⇒ f (x) = αi k (x, xi )
i=1 i=1

• α is obtained as the minimizer of


n
1X λ
` (yi , (K α)i ) + α> K α.
n 2
i=1

28 / 35
Example: least-square (ridge regression)
1 λ
min ky − K αk22 + α> K α = minn F(α)
α∈Rn 2n 2 α∈R

F 0 (α) = n1 K (Kx − y ) + λK α = 0

⇐⇒ (K 2 + nλK )α = Ky

solution is α = (K + nλI)−1 y
may not be unique if K is not invertible
issue: K often has tiny eigenvalues
Gradient descent: Hessian matrix F 00 (α) = K 2 + nλK := H
λmax (H)
Speed ∝ λmin (H) = condition number
29 / 35
Alternative: K = ΦΦ> , with K ∈ Rn×m , m rank of K
n
1X λ
minm `(yi , (Φβ)i ) + kβk22
β∈R n 2
i=1

optimality condition: n1 Φ> g + λβ (g gradient of `


1
then obtain α ∈ Rn via α = − λng g

explicit feature space representation β = Φ> α

30 / 35
Analysis

f̂ = arg min R(f


b ) such that kf kH ≤ D
f ∈H
Question:
h performance
i on unseen data ?
E R(f̂ ) − R∗ , R: expected risk
∀f : X → R, R(f ) = E[`(y , f (x))]
R(f̂ ) − R∗ = R(f̂ ) − inf R(f ) + inf R(f ) − R∗
kf k≤D kf k≤D
| {z } | {z }
estimation error approximation error

G-Lipschitz-continuous loss and kϕ(x)k2H 6 R 2 =⇒


4GDR
E[estimation error] ≤ √
n
31 / 35
Approximation error: infkf k≤D R(f ) − R∗
let f∗ = arg min R(f ) (Bayes predictor), by
G-Lipschitz-continuous of the loss and Jensen’s inequality
R(f ) − R(f∗ ) ≤ E [|`(y , f (x)) − ` (y , f∗ (x))|] ≤ E [G |f (x) − f∗ (x)|]
r h i
2
≤ G E |f (x) − f∗ (x)| = G kf (x) − f∗ (x)kL2 (p)

approximation error ≤ G infkf kH ≤D kf − f∗ kL2 (p)


• In Rd , f∗ (x) = θ∗> ϕ(x), f (x) = θ> ϕ(x)
kf − f∗ kL2 (p) ≤ R kθ − θ∗ k2 .

Total approximation error goes to 0 when D → ∞.


Note: if f∗ ∈ H, D ≥ kf∗ k =⇒ approximation error = 0.

32 / 35
risk decomposition: supx∈X k (x, x) < R 2
dimension-free bound:
4GDR
E[R(f̂Dc )] − R(f∗ ) ≤ √ + G inf kf − f∗ kL2 (p)
n kf kH ≤D


r n o
n 16R 2 kf k2H
optimal D = 4R inff ∈H kf − f∗ k2L2 (p) + n

key quantity

A(µ, f∗ ) = inf {kf − f∗ k2L2 (p) + µkf k2H }


f ∈H

tradeoff between estimation and approximation errors

33 / 35
key quantity A(µ, f∗ ) = inff ∈H {kf − f∗ k2L2 (p) + µkf k2H }
if f∗ ∈ H: A(µ, f∗ ) ≤ µkf∗ k (well specified case)
if f∗ 6∈ H, but can be arbitrarily approximated by f ∈ H
A(µ, f∗ ) tends to zero as µ → 0, but no rate (unless further
condition)
if f∗ cannot be approximated ...

34 / 35
Summary

For kernel methods: f∗ has s bounded derivatives


• Excess risk (if regularization parameter is well-chosen)
s
R(f̂ ) − R∗ ∼ n− d+2s

• Adaptive !
For neural networks: even more adaptivity

35 / 35

You might also like