MATH3340 Mathematics of Machine
Learning
Chinese University of Hong Kong
March 2025
1 / 35
Outline
1 Representer theorem
2 Kernels
Algorithm and analysis
Generalization guarantees-Lipschitz-continuous losses
2 / 35
Introduction
Linear models : fθ (x) = θ> ϕ(x), ϕ(x) ∈ Rd is feature map
• Estimation error is bounded as √1n : independent of
dimension
D
q
Rn (F) ≤ √ E[kϕ(x)k22 ]
n
• Convex optimization
=⇒ d = +∞: fθ (x) = hθ, ϕ(x)i, θ ∈ H (Hilbert space), ϕ(x) ∈ H
3 / 35
A Hilbert space H is a complete vector space equipped with an
inner product h·, ·i. It satisfies the following properties:
Vector space structure: (vector addition and scalar multiplication
are defined)
Inner product h·, ·i that maps pairs of vectors in H to a scalar
(real or complex), satisfying:
conjugate symmetry hx, y i = hy, xi
linearity in the first argument hax + by, zi = ahx, zi + bhy, zi
positive definiteness: hx, xi ≥ 0, with equality if and only if
x =0
completeness: Every Cauchy sequence in the space converges
to a limit within the space
’
4 / 35
norm induced by inner product
p
kxk = hx, xi
orthogonality: Two vectors x and y are orthogonal if hx, y i = 0
Finite-dimensional: Rn with the standard dot product.
Infinite-dimensional: The space of square-integrable functions
L2 (Ω) with the inner product
Z
hf , gi = f (x)g(x)dx
Ω
5 / 35
Representer theorem
d = +∞: fθ (x) = hθ, ϕ(x)i, θ ∈ H (Hilbert space), ϕ(x) ∈ H
xi , . . . , xn ∈ X , yi , . . . , yn ∈ Y, fθ (x) = hθ, ϕ(x)i, with ϕ : X → H.
Regularized ERM:
n
1X λ
min ` (yi , hθ, ϕ (xi )i) + kθk2H
θ∈H n 2
i=1
feature: access only observations xi only through dot product
hθ, ϕ(xi )i, penalize with Hilbert norm kθk
6 / 35
Representer theorem
Consider a feature map ϕ : X → H. Let (x1 , . . . , xn ) ∈ X n and
assume that the functional Ψ : Rn+1 → R is strictly increasing in the
last variable. Then the infimum of
Ψ(hθ, ϕ(x1 )i, . . . , hθ, ϕ(xn )i, kθk2 )
can be obtained by restricting to a vector θ in the linear span of
ϕ(x1 ), . . . , ϕ(xn ), i.e.
n
X
θ= αi ϕ(xi ), with α ∈ Rn
i=1
7 / 35
Theorem (Kimeldorf, Wahba, 1971)
1
Pn
For λ > 0, the minimization problem n i=1 `(yi , hθ, ϕ(xi )i) + λkθk2
can be restricted to
n
X
θ= αi ϕ (xi ) , α ∈ Rn
i=1
No convexity is assumed on the loss function `.
8 / 35
n n
1X λ X
min ` (yi , hϕ (xi ) , θi) + kθk2H , θ = αi ϕ (xi ) .
θ∈H n 2
i=1 i=1
Question: how to solve the problem ?
* n +
X
hθ, ϕ (xi )i = αj ϕ xj , ϕ (xi )
j=1
n
X n
X
= αj ϕ (xi ) , ϕ xj = αj k xi , xj = (K α)i
j=1 j=1
Kernel function: k : X × X → R, k (x, x 0 ) = hϕ(x), ϕ (x 0 )i.
Kernel matrix: K ∈ Rn×n , Kij = k xi , xj
Gram matrix of feature vectors
9 / 35
n 2 n
X X
2
kθk = αi ϕ (xi ) = αi αj ϕ (xi ) , ϕ xj
i=1 i,j=1
n
X n
X
= αi αj k (xi , xj ) = αi αj Ki,j = α> K α.
i,j=1 i,j=1
Equivalent problem:
n
1X λ
minn ` (yi , (K α)i ) + α> K α.
α∈R n 2
i=1
for any test point x ∈ X ,
n
X n
X
f (x) = hθ, ϕ(x)i = αi hϕ(xi ), ϕ(x)i = αi k (x, xi )
i=1 i=1
10 / 35
Kernel trick
• Replace the search space H by Rn
• Separate the representation problem and the design of
algorithms and their analysis
Message: no need to explicitly computing the feature vector ϕ(x),
only dot product
11 / 35
Kernels
Definition (Positive-definite kernels)
A function k : X × X → R is a positive-definite kernel if and only if all
kernel matrices resulting from k are symmetric positive semidefinite.
K is positive semidefinite ⇐⇒ all eigenvalues are non-negative
α> K α ≥ 0, ∀α ∈ Rn
12 / 35
Theorem (Aronszajn, 1950)
A function k : X × X → R is a positive-definite kernel if and only if
there exists a Hilbert space H, and a function ϕ : X → H such that for
all x, x 0 ∈ X , k (x, x 0 ) = hϕ(x), ϕ (x 0 )iH .
no assumption on the input space X , no regularity assumption
on k
13 / 35
k : X × X → R is a positive-definite kernel =⇒ there exists
ϕ : X → H (ϕ: feature map; H: feature space) s.t.
k (x, x 0 ) = hϕ(x), ϕ (x 0 )iH .
reproducing property of kernel space
hk (·, x), f i = f (x), hk (·, x), k (·, x 0 )i = k (x, x 0 )
reproducing kernel Hilbert spaces (RKHS)
function evaluation is the dot product with a function
(smoothness)
14 / 35
Kernel calculus
The set of positive-definite kernels on a set X is a cone.
k1 , k2 ∈ X × X → R are positive-definite kernels, λ ≥ 0, then
• k1 + k2 is positive-definite kernel
• λk1 is positive-definite kernel
• k1 k2 is positive-definite kernel
15 / 35
Linear kernel
X = Rd , k (x, x) = x > x 0
ϕ(x) = x explicit feature map
function space: linear function fθ (x) = θ> x with `2 -penalty kθk22 .
• If d > n, representer theorem is useful if k (x, x 0 ) easy to
compute (O(d))
• Naive running time to compute kernel matrix: O(n2 d)
16 / 35
Polynomial kernel
s
X = Rd , k (x, x 0 ) = x > x 0 , s ∈ N, s ≥ 1
k is positive-definite as a product of positive-definite kernels
d
!s
X
0
k (x, x ) = xi xi0 (binomial theorem)
i=1
X s α1 α
= (x1 x10 ) · · · (xd xd0 ) d
α1 +···+αd =s
α1 , . . . , αd | {z }
α α α
(x1α1 ...xd d )((x10 ) 1 ···(xd0 ) d )
= ϕ(x)> ϕ(x 0 ),
12
ϕ(x)α1 ,...,αd = s
α1 ,...,αd x1α1 · · · xdαd
17 / 35
Polynomial kernel
d+s−1
∼ Cs d s (kernel trick)
the dimension of ϕ(x) is s
the space of functions is homogeneous polynomials of degree s
• Example: d = 2, s = 2, ϕ(x) = (x12 , x22 , x1 , x2 )
binary classification: separation by ellipsoid (in input space), linear
separation in feature space
18 / 35
Translation-invariant kernels on X = [0, 1]
k (x, x 0 ) = q (x − x 0 )
q : [0, 1] → R, 1-periodic: q(x + 1) = q(x), ∀x.
Show how the kernels emerge from penalties on the
Fourier coefficients of functions.
R1
f : R → R is 1-periodic and 0 f (x)2 dx is finite.
Orthonormal basis: em : x 7→ e2imπx , m ∈ Z
Z 1 Z 1
2imπx 2im0 πx ∗ 0
e (e ) dx = e2i(m−m )πx dx = 0, m 6= m0
0 0
19 / 35
P
f = m∈Z hf , em iL2 [0,1] em
Fourier series:
X Z 1
f (x) = f̂m e 2imπx
, f̂m = hf , em i = f (x)e−2imπx dx
m∈Z 0
Preliminaries:
R1 2
2 dx =
P
• Parseval’s identity: 0 |f (x)| m∈Z f̂m
• If f is differentiable, f 0 (x) = m∈Z f̂m (2imπ)e2imπx
P
(Fourier coefficient)
20 / 35
Consequence:
Z 1 X 2 X 2
|f 0 (x)|2 dx = f̂m (2imπ) = |2mπ|2 f̂m
0 m∈Z m∈Z
Sobolev norm:
Z 1 Z 1
Ω(f ) = kf k2 = |f (x)|2 dx + |f 0 (x)|2 dx
0 0
X 2
2 2
= (1 + 4π m ) f̂m
m∈Z
b ) + λ kf k2 , λ > 0: smooth
• Learning problem: inf R(f 2
21 / 35
The penalty kf k2 can be interpreted through
P a feature map
and its associated dot product (ha, bi = m∈Z am b−m )
√
ϕ(x)m = e2imπx / 1 + 4π 2 m2 , fθ (x) = hθ, ϕ(x)i (θ ∈ `2 )
Fourier series:
X
f̂θ (x) = (f̂θ (x))m e2imπx
m∈Z
X p e2imπx
= f̂θ (x) 1 + 4π 2 m2 √
2 2
m∈Z |
m
{z } 1 + 4π m
θm
2 2
P
kθk = m∈Z |θm | = Ω(fθ ) (Sobolev penalty)
1 0
e2imπ(x−x ) = q(x−x 0 )
X
k (x, x 0 ) = hϕ(x), ϕ(x 0 )i =
1 + 4π 2 m2
m∈Z
any penalty of the form m∈Z cm |f̂m |2 defines a squared RKHS
P
norm, if cm is strictly positive for all m ∈ Z and m c1m < ∞
P
22 / 35
Translation-invariant kernels on Rd
0 k2
X = Rd , k (x, x 0 ) = q (x − x 0 ) [eg: k (x, x 0 ) = e−αkx−x 2 ]
Question: when is this a kernel ?
>
Fourier transform: f̂ (ω) = Rd f (x)e−iω x dx
R
1 iω > x dω
R
Inversion formula: f (x) = (2π) d Rd f̂ (ω)e
1
Parseval’s identity: Rd |f (x)|2 dx = (2π) 2
R R
d Rd |f̂ (ω)| dω
Theorem (Bochner’s theorem)
k (x, x 0 ) = q (x − x 0 ) is positive-definite ⇐⇒ q̂(ω) ≥ 0, ∀ω ∈ Rd .
23 / 35
Translation-invariant kernels on Rd
Summary:
(1) If q has non-negative Fourier transform,
k (x, x 0 ) = q (x − x 0 ) is a positive-definite kernel;
Z q
1
q
iω > x > 0
k (x, x 0 ) = q̂(ω)e ( q̂(ω)eiω x )∗ dω
(2π)d Rd
Z
= ϕ(x)ω ϕ(x 0 )∗ω dω
Rd
(2) The associated norm as functions is
|f̂ (ω)|2
Z
Ω(f ) = dω.
q̂(ω)
24 / 35
• Exponential kernel: q(u) = e−kuk2 .
2
• Gaussian kernel: q(u) = e−kuk2 .
Fourier transform
1
q̂(ω) = 2d π (d−1)/2 Γ( d+1
2 )
(1 + kωk22 )(d+1)/2
d odd: q̂(ω)−1 is a sum of monomials, penalize all derivatives up to
the total order (d + 1)/2
the case d = 1
|f̂ (ω)|2
Z Z Z
1 1 1
kf k2H = dω = |f̂ (ω)|2 dω + |ω f̂ (ω)|2 dω
2π R q̂(ω) 4π R 4π R
Z Z
1 1
= 2
|f (x)| dx + |f 0 (x)|2 dx
2 R 2 R
25 / 35
Examples:
• Exponential kernel: q(u) = e−kuk2 .
2
• Gaussian kernel: q(u) = e−kuk2 .
Fourier transform 2
q̂(ω) = π d e−kωk2 /4
power series expansion of q̂(ω)−1 : RKHS norm penalizing all
derivatives of all order (infinitely differentiable)
26 / 35
Kernels beyond X = Rd : Sequences
An alphabet A, xi , . . . , xn , xi ∈ A
ϕ(x) indexed by all sequences of length `
ϕ(x)y = 1 if y is contained in x
27 / 35
Algorithm
Goal:
n
1X λ
min ` (yi , f (xi )) + kf k2H
f ∈H n 2
i=1
representer theorem: f (x) = hθ, ϕ(x)i, and
n
X n
X
θ= αi ϕ(xi ) =⇒ f (x) = αi k (x, xi )
i=1 i=1
• α is obtained as the minimizer of
n
1X λ
` (yi , (K α)i ) + α> K α.
n 2
i=1
28 / 35
Example: least-square (ridge regression)
1 λ
min ky − K αk22 + α> K α = minn F(α)
α∈Rn 2n 2 α∈R
F 0 (α) = n1 K (Kx − y ) + λK α = 0
⇐⇒ (K 2 + nλK )α = Ky
solution is α = (K + nλI)−1 y
may not be unique if K is not invertible
issue: K often has tiny eigenvalues
Gradient descent: Hessian matrix F 00 (α) = K 2 + nλK := H
λmax (H)
Speed ∝ λmin (H) = condition number
29 / 35
Alternative: K = ΦΦ> , with K ∈ Rn×m , m rank of K
n
1X λ
minm `(yi , (Φβ)i ) + kβk22
β∈R n 2
i=1
optimality condition: n1 Φ> g + λβ (g gradient of `
1
then obtain α ∈ Rn via α = − λng g
explicit feature space representation β = Φ> α
30 / 35
Analysis
f̂ = arg min R(f
b ) such that kf kH ≤ D
f ∈H
Question:
h performance
i on unseen data ?
E R(f̂ ) − R∗ , R: expected risk
∀f : X → R, R(f ) = E[`(y , f (x))]
R(f̂ ) − R∗ = R(f̂ ) − inf R(f ) + inf R(f ) − R∗
kf k≤D kf k≤D
| {z } | {z }
estimation error approximation error
G-Lipschitz-continuous loss and kϕ(x)k2H 6 R 2 =⇒
4GDR
E[estimation error] ≤ √
n
31 / 35
Approximation error: infkf k≤D R(f ) − R∗
let f∗ = arg min R(f ) (Bayes predictor), by
G-Lipschitz-continuous of the loss and Jensen’s inequality
R(f ) − R(f∗ ) ≤ E [|`(y , f (x)) − ` (y , f∗ (x))|] ≤ E [G |f (x) − f∗ (x)|]
r h i
2
≤ G E |f (x) − f∗ (x)| = G kf (x) − f∗ (x)kL2 (p)
approximation error ≤ G infkf kH ≤D kf − f∗ kL2 (p)
• In Rd , f∗ (x) = θ∗> ϕ(x), f (x) = θ> ϕ(x)
kf − f∗ kL2 (p) ≤ R kθ − θ∗ k2 .
Total approximation error goes to 0 when D → ∞.
Note: if f∗ ∈ H, D ≥ kf∗ k =⇒ approximation error = 0.
32 / 35
risk decomposition: supx∈X k (x, x) < R 2
dimension-free bound:
4GDR
E[R(f̂Dc )] − R(f∗ ) ≤ √ + G inf kf − f∗ kL2 (p)
n kf kH ≤D
√
r n o
n 16R 2 kf k2H
optimal D = 4R inff ∈H kf − f∗ k2L2 (p) + n
key quantity
A(µ, f∗ ) = inf {kf − f∗ k2L2 (p) + µkf k2H }
f ∈H
tradeoff between estimation and approximation errors
33 / 35
key quantity A(µ, f∗ ) = inff ∈H {kf − f∗ k2L2 (p) + µkf k2H }
if f∗ ∈ H: A(µ, f∗ ) ≤ µkf∗ k (well specified case)
if f∗ 6∈ H, but can be arbitrarily approximated by f ∈ H
A(µ, f∗ ) tends to zero as µ → 0, but no rate (unless further
condition)
if f∗ cannot be approximated ...
34 / 35
Summary
For kernel methods: f∗ has s bounded derivatives
• Excess risk (if regularization parameter is well-chosen)
s
R(f̂ ) − R∗ ∼ n− d+2s
• Adaptive !
For neural networks: even more adaptivity
35 / 35