Empirical Process (Sara Van de Geer)
Empirical Process (Sara Van de Geer)
January 2020
2
Contents
1 Introduction 5
5 Symmetrization 33
5.1 Intermezzo: some facts about (conditional) expectations . . . . . 34
5.1.1 Suprema in-/outside the expectation . . . . . . . . . . . . 34
5.1.2 Iterated expectations . . . . . . . . . . . . . . . . . . . . . 34
5.2 Symmetrization with means . . . . . . . . . . . . . . . . . . . . . 34
5.3 Symmetrization with probabilities . . . . . . . . . . . . . . . . . 36
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3
4 CONTENTS
7 M-estimators 51
7.1 What is an M-estimator? . . . . . . . . . . . . . . . . . . . . . . 51
7.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Introduction
The empirical distribution. The unknown P can be estimated from the data
in the following way. Suppose first that we are interested in the probability that
an observation falls in A, where A is a certain set chosen by the researcher. We
denote this probability by P (A). Now, from the frequentist point of view, the
probability of an event is nothing else than the limit of relative frequencies of
occurrences of that event as the number of occasions of possible occurrences n
grows without limit. So it is natural to estimate P (A) with the frequency of A,
i.e, with
number of Xi ∈ A
= .
n
5
6 CHAPTER 1. INTRODUCTION
We now define the empirical measure Pn as the probability law that assigns to
a set A the probability Pn (A). We regard Pn as an estimator of the unknown
P.
F ^
0.9 Fn
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11
x
Figure 1
Figure 1 plots the distribution function F (x) = 1 − 1/x2 , x ≥ 1 (smooth curve)
and the empirical distribution function F̂n (stair function) of a sample from F
with sample size n = 200.
µ := E(X)
Sample median. The median of X is the value m that satisfies F (m) = 1/2
(assuming there is a unique solution). Its empirical version is any value m̂n
7
Tn → θ as n → ∞.
Nonparametric models.
Nonparametric models cannot be described by finitely many parameters.
Example: density estimation. An example of a nonparametric model is
where one assumes that the density f of the distribution function F on R
exists, but all one assumes about it is some kind of “smoothness” (e.g. the
continuous first derivative of f exists). In that case, one may propose e.g. to
use the histogram as estimator of f . This is an example of a nonparametric
estimator.
Histograms. Suppose our aim is estimating the density f (x) at a given point
x. The density is defined as the derivative of the distribution function F at x:
F (x + h) − F (x) P (x, x + h]
f (x) = lim = lim .
h→0 h h→0 h
Here, (x, x + h] is the interval with left endpoint x (not included) and right
endpoint x + h (included). Unfortunately, replacing P by Pn here does not
8 CHAPTER 1. INTRODUCTION
1.8
1.6
f
1.4
1.2
0.8
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10 11
Figure 2
The question arises how to choose the bandwidth h? One may want to apply
a data-dependent choice. For example, introduce for fˆn = fˆn,h depending on h
the risk function Z
R(fn,h ) := (fˆn,h (t) − f (t))2 dt.
ˆ
Note that
Z Z Z
R(fˆn,h ) = fˆn,h
2
(t)dt − 2 fˆn,h (t)f (t)dt + f 2 (t)dt .
| {z }
does not depend on h
It can be estimated by
Z n
2Xˆ
2 fˆn,h (t)dF̂n (t) = fn,h (Xi ).
n
i=1
9
But using the data twice, both for estimation as well as for estimation of the
performance of the estimator, is perhaps not a good idea as it may lead to
“overfitting”. Therefore, we propose here to use a fresh sample, that is, {Xi0 }m
i=1 ,
i.i.d. copies from X independent of {Xi }ni=1 , and apply the estimated bandwidth
Z m
ˆ2 2 Xˆ 0
ĥ := arg min fn,h (t)dt − fn,h (Xi ) .
h>0 m
i=1
(X1 , . . . , Xn , X10 , . . . , Xm
0
),
with sample size n + m. In other words, for the selection of the bandwidth a
sample splitting technique is used. In order to further improve performance, a
common technique is “cross-validation” consisting of several sample splits (into
training and test sets).
Example: the classification problem. Let Y ∈ {0, 1} be a response variable
(a label) and X ∈ X a co-variable (or input). For all x ∈ X we define the
probability that the label is Y = 1 when the co-variable takes the value x:
Our aim is now to predict the label given some input, say x0 . Call the predicted
value y0 . Bayes rule says: predict the most likely label, that is predict
1
if η(x0 ) > 1/2
y0 := undecided, say 1, if η(x0 ) = 1/2 .
0 if η(x0 ) < 1/2
Note that η(x) > 1/2 if and only if the log-odds ratio
η(x)
log
1 − η(x)
is strictly positive. When X is a subset of r-dimensional Euclidean space Rr
one may want to assume the parametric logistic regression model, where the
log-odds ratio is a linear function of the (row) vector x:
η(x)
log = α0 + xβ 0 , x ∈ X ,
1 − η(x)
n1/3 (θ̂ − θ0 )
1
When the sample is splitted in a training and a test set this symmetry is lost.
Chapter 2
This chapter introduces the notation and (part of ) the problem setting.
IP( lim Tn = T ) = 1.
n→∞
µ := EX
11
12 CHAPTER 2. GLIVENKO CANTELLI CLASSES
n
1X
X̄n := Xi , n ≥ 1.
n
i=1
X̄n → µ, a.s.
Now, let
F (t) := P (X ≤ t), t ∈ R,
1
F̂n (t) := #{Xi ≤ t, 1 ≤ i ≤ n}, t ∈ R,
n
H0 : F = F0 .
Test statistic:
Tn := sup |F̂n (t) − F0 (t)|.
t
Questions:
D = {l(−∞,t] : t ∈ R}
P g := Eg(X),
and
n
1X
Pn g := g(Xi ).
n
i=1
We define
1 p + p0
G := log l{p0 > 0} : p ∈ P .
2 2p0
Theorem 2.5.1 Suppose that G is GC. Then
IP
h(p̂, p0 ) → 0.
It follows that
p̂ + p0
h2 , p0 ≤ kPn − P kG .
2
Finally
√ √
Z
2 1
h (p, p0 ) = ( p − p0 )2
2
Z p+p0 + √p 2 r
q
2
1 2 0 p + p0 √
= 4 √ √ − p 0
2 p + p0 2
p̂ + p0
≤ 16h2 , p0 .
2
So we conclude
h2 (p̂, p0 ) ≤ 16kPn − P kG .
IP
By assumption, G is GC, i.e., kPn − P kG → 0. u
t
16 CHAPTER 2. GLIVENKO CANTELLI CLASSES
Chapter 3
(Exponential) probability
inequalities
A statistician is almost never sure about something, but often says that some-
thing holds “with large probability”. We study probability inequalities for devia-
tions of means from their expectations. These are exponential inequalities, that
is, the probability that the deviation is large is exponentially small. (We will in
fact see that the inequalities are similar to those obtained if we assume Gaus-
sian distributions.) Exponentially small probabilities are useful indeed when one
wants to prove that with large probability a whole collection of events holds si-
multaneously. It then suffices to show that adding up the small probabilities that
one such an event does not hold, still gives something small.
Eφ(X)
P (X ≥ a) ≤ .
φ(a)
Proof.
Z Z Z
Eφ(X) = φ(x)dP (x) = φ(x)dP (x) + φ(x)dP (x)
X≥a X<a
Z Z
≥ φ(x)dP (x) ≥ φ(a)dP (x)
X≥a X≥a
Z
= φ(a) dP = φ(a)P (X ≥ a).
X≥a
17
18 CHAPTER 3. (EXPONENTIAL) PROBABILITY INEQUALITIES
u
t
I X̄n − µ)2 σ2
E(
IP |X̄n − µ| ≥ a ≤ = → 0 ∀ a > 0.
a2 na2
Thus
X̄n →IP µ
p
((weak) law of large numbers1 ). In a reformulation we put a = σ t/n,
r
t 1
IP |X̄n − µ| ≥ σ ≤ , ∀ t > 0.
n t
N
N σ2
X
IP max |(Pn − P )gj | ≥ a ≤ IP |(Pn − P )gj | ≥ a ≤ ∀ a > 0.
1≤j≤N na2
j=1
1
This implies X̄n → µ almost surely by a martingale argument (skipped here).
2
The union bound says that IP(A ∪ B) ≤ IP(A) + IP(B) for any two sets A and B.
3.2. HOEFFDING’S INEQUALITY 19
Then
n
!
a2
X
IP Xi ≥ a ≤ exp − 2 ∀ a > 0,
2b
i=1
or reformulated
n
!
X √
IP Xi ≥ b 2t ≤ exp[−t] ∀ t > 0.
i=1
|Xi | ≤ ci .
Pn
Lemma 3.2.1 Assume Hoeffding’s condition. Let b2 := 2
i=1 ci . Then for all
λ>0 X n 2 2
λ b
EI exp λ Xi ≤ exp .
2
i=1
Define now
ci − Xi
αi = .
2ci
Then
Xi = αi (−ci ) + (1 − αi )ci ,
so
exp[λXi ] ≤ αi exp[−λci ] + (1 − αi ) exp[λci ].
20 CHAPTER 3. (EXPONENTIAL) PROBABILITY INEQUALITIES
1 1
I exp[λXi ] ≤
E exp[−λci ] + exp[λci ].
2 2
Now, for all x,
∞
X x2k
exp[−x] + exp[x] = 2 ,
(2k)!
k=0
whereas
∞
2
X x2k
exp[x /2] = .
2k k!
k=0
Since
(2k)! ≥ 2k k!,
and hence
I exp[λXi ] ≤ exp[λ2 c2i /2].
E
Therefore,
n n
" # " #
X X
E
I exp λ Xi ≤ exp λ2 c2i /2 .
i=1 i=1
u
t
Theorem
Pn 3.2.1 (Hoeffding’s inequality) Assume Hoeffding’s condition. Let
2 2
b := i=1 ci . Then
n
!
a2
X
IP Xi ≥ a ≤ exp − Pn 2 ∀ a > 0,
i=1
2 i=1 ci
or reformulated
n
!
X √
IP Xi ≥ b 2t ≤ exp[−t] ∀ t > 0.
i=1
One can also derive an inequality for the expectation of a maximum (instead
of a probability inequality).
Lemma 3.2.2 Consider N functions gj : X → R, j = 1, . . . , N , with, for
some constant K,
(
Egj (X) = 0
∀ j ∈ {1, . . . , N } : .
supx∈X |gj (x)| ≤ K
Then r
2 log(2N )
E
I max |Pn gj | ≤K .
1≤j≤N n
Proof. We have
e|x| ≤ex +e−x
I exp[λ|nPn gj |]
E ≤ E
I exp[λnPn gj ] + E
I exp[λnPn gj ]
2 2
by Lemma 3.2.1 λ K
≤ 2 exp .
2
Hence
1
I max n|Pn gj |
E = I log exp λ max n|Pn gj |
E
1≤j≤N λ 1≤j≤N
Jensen 1
≤ I exp λ max n|Pn gj |
log E
λ 1≤j≤N
1
= I max exp λn|Pn gj |
log E
λ 1≤j≤N
N
X
1
≤ log I exp λn|Pn gj |
E
λ
j=1
2 2
1 λ K
≤ log 2N exp
λ 2
log(2N ) λK 2
= + .
λ 2
22 CHAPTER 3. (EXPONENTIAL) PROBABILITY INEQUALITIES
We minimize the last expression over λ. Take the derivative and put it to zero:
log(2N ) K 2 ∆
− + = 0.
λ2 2
This gives p
2 log(2N )
λ=
K
and with this value of λ
r r
log(2N ) λK 2 log(2N ) log(2N ) p
+ =K +K = K 2 log(2N ).
λ 2 2 2
u
t
EX
I i = 0,
m! m−2 2
I i |m ≤
E|X K σi , m = 2, 3, . . . .
2
Pn
Lemma 3.3.1 Suppose Bernstein’s condition. Define b2 := 2
i=1 σi . Then for
all 0 < λ < 1/K
n
λ2 b2
X
EI exp λ Xi ≤ exp .
2(1 − λK)
i=1
Or reformulated
n
!
X √
IP Xi ≥ b 2t + Kt ≤ exp[−t] ∀ t > 0.
i=1
λ2 σi2
=1+
2(1 − λK)
λ2 σi2
≤ exp .
2(1 − λK)
It follows that
n n
" #
X Y
I exp λ
E Xi = E
I exp[λXi ]
i=1 i=1
λ2 b2
≤ exp .
2(1 − λK)
Now, apply Chebyshev’s inequality to ni=1 Xi , and with φ(x) = exp[λx], x ∈
P
R. We arrive at
n
!
λ2 b2
X
IP Xi ≥ a ≤ exp − λa .
2(1 − λK)
i=1
Take
a
λ=
Ka + b2
√
to complete the first part. For the reformulation, choose a = b 2t + Kt.
u
t
Corollary 3.3.1 Consider N functions gj : X → R, j = 1, . . . , N , with, for
some constants σ 2 and K
Egj (X) = 0
∀ j ∈ {1, . . . , N } : Egj2 (X) ≤ σ 2 .
supx∈X |gj (x)| ≤ K
24 CHAPTER 3. (EXPONENTIAL) PROBABILITY INEQUALITIES
3.4 Exercises
P (X ≥ a) ≤ exp[λ2 /2 − λa].
P (X ≥ a) ≤ exp[−a2 /2].
I exp[λ|X|] − 1 − λ|X| ≤ ς 2 .
E
Show that Bernstein’s condition (Condition 3.3.1) holds with appropriate con-
stants σ 2 and K.
Exercise 3.4.4 Recall the class
1 p + p0 dP
G := log l{p0 > 0} : p ∈ P , p0 := ,
2 2p0 dµ
defined in Section 2.5. Show that for all g ∈ G the random variable g(X) −
Eg(X) satisfied Bernstein’s condition (Condition 3.3.1). Hint: use Exercise
3.4.3.
Chapter 4
F (x) := P (X ≤ x), x ∈ R
1
F̂n (x) := #{Xi ≤ x, 1 ≤ i ≤ n}.
n
Theorem 4.1.1 (the classical Glivenko Cantelli Theorem) It holds that
IP
sup |F̂n (x) − F (x)| → 0.
x∈R
|l(−∞,x] − F (x)| ≤ 1
Let now 0 < δ < 1 be arbitrary and take a0 < a1 < · · · < aN such that
F (aj ) − F (aj−1 ) = δ. Note that then N ≤ 1 + 1/δ. If x ∈ (aj−1 , aj ] we clearly
have
(−∞, aj−1 ] ⊂ (−∞, x] ⊂ (−∞, aj ],
25
26 CHAPTER 4. ULLNS BASED ON ENTROPY WITH BRACKETING
that is, we can “bracket” l(−∞,x] between a lower function l(−∞,aj−1 ] and an
upper function l(−∞,aj ] . This gives for x ∈ (aj−1 , aj ]
4.2 Entropy
Definition 4.2.1 Consider a subset S of a metric space with metric d. For any
δ > 0, let N (δ, S, d) be the minimum number of balls with radius δ, necessary
to cover S. Then N (δ, S, d) is called the δ-covering number of S. Moreover
H(·, S, d) := log N (·, S, d)
is called the entropy of S.
4.3. ENTROPY WITH BRACKETING 27
Figure 3
Example 4.2.1 Let S = [−1, 1]r be the r-dimensional hypercube. Take as
metric
d(x, y) := max |xk − yk | =: kx − yk∞ , x ∈ Rr , y ∈ Rr .
1≤k≤r
N (δ, S, d) ≤ d1/δer
≤ (1 + 1/δ)r ,
H(δ, S, d) ≤ r log(1 + 1/δ).
We define for q ≥ 1
We let
L∞ = {g : kgk∞ < ∞},
where
kgk∞ := sup |g(x)|
x∈X
is the sup-norm.
28 CHAPTER 4. ULLNS BASED ON ENTROPY WITH BRACKETING
Definition 4.3.1 Consider a class G ⊂ Lq (P ). For any δ > 0, let {[gjL , gjU ]}N
j=1 ⊂
Lq (P ) be such that
(i) ∀ j, gjL ≤ gjU and kgjU − gjL kq ≤ δ,
(ii) ∀ g ∈ G ∃ j such that gjL ≤ g ≤ gjU .
We then call {[gjL , gjU ]}N
j=1 a δ−bracketing set.
The δ-covering number with bracketing is
L U N
Nq,B (δ, G, P ) := min N : ∃ δ−bracketing set {[gj , gj ]}j=1 .
We partition the interval [0, 1] into m ≤ 1 + 1/δ intervals (aj−1 , aj ] with length
at most δ, 0 = a0 < · · · < aN = 1. For g ∈ G and x ∈ (aj−1 , aj ] we let
be the integer part of g(aj )/δ. Then, since g(aj ) − δ < bg(aj )/δcδ ≤ g(aj ),
We have at most 1 + 1/δ choices for bg(a1 )/δc. Given bg(aj )/δc we have
so there are at most 7 choices for bg(aj+1 )/δc. The total number of functions
g̃ as g varies is thus at most
· · × 7} ≤ (1 + 1/δ)71/δ .
(1 + 1/δ) × 7| × ·{z
m−1 times
Thus
log 7
H∞ (δ, G) ≤ log(1 + 1/δ) + ∀ 0 < δ < 1.
δ
4.4. THE ENVELOPE FUNCTION 29
Proof. We use the same arguments as in the proof of Theorem 4.1.1: let δ > 0
be arbitrary and let {[gjL , gjU ]}N
j=1 ⊂ L1 (P ) be a minimal δ-covering set with
bracketing (N = NB (δ, G, P )). Because this is a finite set, we know that1
L U IP
max |(Pn − P )gj | ∨ max |(Pn − P )gj | → 0.
1≤j≤N 1≤j≤N
Let G ⊂ Lq (P ).
Definition 4.4.1 The envelope function (or envelope) is
G := sup |g|.
g∈G
Note that
Hq,B (·, G, P ) < ∞ ⇒ G ∈ Lq (P ).
Similarly
H∞ (·, G) < ∞ ⇒ kGk∞ < ∞.
1
for a and b in R we let a ∨ b := max{a, b} and a ∧ b := min{a, b}.
30 CHAPTER 4. ULLNS BASED ON ENTROPY WITH BRACKETING
Then by continuity
lim wθ,ρ = 0.
ρ→0
Define
Bθ := {ϑ : d(θ, ϑ) < ρθ }.
Then {Bθ : θ ∈ Θ} is an open cover of Θ. Since Θ is compact there exists a
finite sub-cover, say
For θ ∈ Bj we have
with
P (gjU − gjL )q = P (2wθj ,ρθj )q ≤ (2δ)q .
u
t
Here is an example where the ULLN is used for proving consistency of certain
maximum likelihood estimators. Let Y ∈ R be a random variable with unknown
4.6. MAXIMUM LIKELIHOOD FOR MIXTURE MODELS 31
p 2 1/2
Z
1 √
h(p, p̃) = ( p − p̃) .
2
Lemma 4.6.1 It holds that
IP
h(pF̂ , pF0 ) → 0.
is continuous for d, so
2pF
F 7→
pF + pF0
is continuous for d as well. Let
2pF ∗
G := : F ∈F .
pF + pF0
This class has envelope
G ≤ 2.
It follows from Lemma 4.5.1 that G is GC:
IP
kPn − P kG → 0.
We have
2pF̂
0 ≤ Pn log
pF̂ + pF0
2pF̂
≤ Pn −1
pF̂ + pF0
2pF̂ 2pF̂
= (Pn − P ) +P − 1.
pF̂ + pF0 pF̂ + pF0
2
We skip details.
32 CHAPTER 4. ULLNS BASED ON ENTROPY WITH BRACKETING
But
√ √
Z
2 1
h (p, p0 ) = ( p − p0 )2
2
(p − p0 )2
Z
1
= √ √
2 ( p + p 0 )2
(p − p0 )2
Z
1
≤
2 p + p0
(p0 − p)2 1 p0 − p
Z Z
1
= + (p0 + p)
2 p0 + p 2 p +p
| 0 {z }
=0
p0 − p
Z
= p0
p +p
Z 0
2p
= 1− p0
p + p0
2p
= 1−P .
p + p0
4.7 Exercises
Exercise 4.7.1 Complete the proof of Theorem 4.1.1 allowing for distribution
functions F with jumps.
Exercise 4.7.2 Let
r
X
d2 (x, y) := (xk − yk )2 =: kx − yk22 , x ∈ Rr , y ∈ Rr
k=1
and let
S := {x : kxk2 ≤ 1}
be the r-dimensional ball. Show that
Symmetrization
n n
1X 1X
Pn := δXi , Pn0 := δXi .
n n
i=1 i=1
and likewise
kPn0 − P kG := sup |(Pn0 − P )g|,
g∈G
and
kPn − Pn0 kG := sup |(Pn − Pn0 )g|.
g∈G
33
34 CHAPTER 5. SYMMETRIZATION
EE(Y |X) = EY
and
I n − Pn0 kG .
I n − P kG ≤ EkP
EkP
Proof. Obviously,
E(P
I n g|X) = Pn g
and
I n0 g|X) = P g.
E(P
So
0
(Pn − P )g = E[(P
I n − Pn )g|X].
5.2. SYMMETRIZATION WITH MEANS 35
Hence
kPn − P kG = sup |(Pn − P )g| = sup |IE[(Pn − Pn0 )g||X]
g∈G g∈G
u
t
Definition 5.2.1 A Rademacher sequence {σi }ni=1 is a sequence of independent
random variables σi , with
1
IP(σi = 1) = IP(σi = −1) = ∀ i.
2
Let {σi }ni=1 be a Rademacher sequence, independent of the two samples X and
X0 . We define the symmetrized empirical measure
n
1X
Pnσ g := σi g(Xi ), g ∈ G.
n
i=1
Let
kPnσ kG = sup |Pnσ g|.
g∈G
I n − P kG ≤ 2IEkPnσ kG .
EkP
I n − Pn0 kG = EkP
EkP I nσ − Pn0,σ kG
I n0,σ kG = 2IEkPnσ kG .
I nσ kG + EkP
≤ EkP
u
t
36 CHAPTER 5. SYMMETRIZATION
Proof. Let IPX denote the conditional probability given X. If kPn − P kG > δ,
we know that for some random function g∗ = g∗ (X) depending on X,
5.4 Exercises
Exercise 5.4.1 Use the same arguments as in the proof of Lemma 3.2.2 to
show that r
σ 2 log(2N )
EI max |Pn gj | X ≤ Rn ,
1≤j≤N n
with
Rn2 := max kgj k2n ,
1≤j≤N
ULLNs based on
symmetrization
In this chapter, we prove uniform laws of large numbers for the empirical mean
of functions g of the individual observations, when g varies over a class G of
functions. First, we study the case where G is finite. Symmetrization is used
in order to be able to apply Hoeffding’s inequality. Hoeffding’s inequality gives
exponential small probabilities for the deviation of averages from their expec-
tations. So considering only a finite number of such averages, the difference
between these averages and their expectations will be small for all averages si-
multaneously, with large probability.
If G is not finite, we approximate it by a finite set. A δ-approximation is called
a δ-covering, and the number of elements of a minimal δ-covering is called the
δ-covering number.
We introduce Vapnik Chervonenkis (VC) classes. These are classes with small
covering numbers.
IP a.s.
Remark. It can be shown that if kPn − P kG → 0, then also kPn − P kG → 0.
This involves e.g. martingale arguments. We will not consider this issue.
39
40 CHAPTER 6. ULLNS BASED ON SYMMETRIZATION
N
union bound X
IP ∪N
k=1 Ak ≤ IP(Ak ) ≤ N max P(A).
1≤k≤N
k=1
Lemma 6.1.1 Let G be a finite class of functions, with cardinality |G| := N >
1. Suppose that for some finite constant K,
max kgk∞ ≤ K.
g∈G
Then we have
r !
2(log N + t)
IP kPnσ kG >K ≤ 2 exp [−t] ∀ t > 0.
n
Proof.
• By Hoeffding’s inequality, for each g ∈ G,
√
IP |Pnσ (g)| > K 2t ≤ 2 exp [−t] ∀ t > 0.
u
t
Definition 6.1.2 Let S be some subset of a metric space (Λ, d). For δ > 0, the
δ-covering number N (δ, S, d) of S is the minimum number of balls with radius
δ, necessary to cover S, i.e. the smallest value of N , such that there exist
s1 , . . . , sN in1 Λ with
min d(s, sj ) ≤ δ, ∀ s ∈ S.
j=1,...,N
kgk∞ ≤ K, ∀ g ∈ G.
|Pnσ g| ≤ |Pnσ gj | + δ.
So
kPnσ kG ≤ max |Pnσ gj | + δ.
j=1,...N
•
Conclude that
r !
2(log N + t)
IPX kPnσ kG >δ+K ≤ 2 exp[−t] ∀ t > 0.
n
• But then r !
2t
IP kPnσ kG > 2δ + K
n
r !
2H1 (δ, G, Pn )
≤ 2 exp[−t] + IP K >δ ∀ t > 0.
n
G := {g ↑, 0 ≤ g ≤ 1}.
g̃(x) := dg(x)/δeδ, x ∈ R,
We conclude that
m+n−1
N∞ (δ, G, Pn ) ≤ ,
m
6.2. CLASSES OF SETS 43
and hence
and
1 IP
H1 (δ, G, Pn ) → 0.
n
Then
IP
kPn − P kG → 0.
(i) Let GK := {gl{G ≤ K}}. This class is uniformly bounded by K and since
H1 (·, GK , Pn ) ≤ H1 (·, G, Pn ) we conclude from Theorem 6.1.1 that
IP
kPn − P kGK → 0.
(ii) We have
u
t
That is, count the number of sets in D, when two sets D1 and D2 are considered
as equal if D1 4D2 ∩ {ξ1 , . . . , ξn } = ∅.
Here
D1 4D2 = (D1 ∩ D2c ) ∪ (D1c ∩ D2 )
is the symmetric difference between D1 and D2 .
Remark. For our purposes, we will not need to calculate 4D (ξ1 , . . . , ξn ) ex-
actly, but only a good enough upper bound.
D = {l(−∞,t] : t ∈ R}.
4D (ξ1 , . . . , ξn ) ≤ n + 1.
Example. Let D be the collection of all finite subsets of X . Then, if the points
ξ1 , . . . , ξn are distinct,
4D (ξ1 , . . . , ξn ) = 2n .
Proof of the only if part. This follows from applying Theorem 6.1.1 to
G = {lD : D ∈ D}.
To see this, note first that a class of indicator functions is uniformly bounded
by 1. This is also true for the centred version, i.e. we can take K = 1 in
Theorem 6.1.1. Moreover, writing N∞ (·, G, Pn ) for the covering number of G
for the (pseudo-)metric induced by the (pseudo-)norm
we see that
N1 (·, G, Pn ) ≤ N∞ (·, G, Pn ).
6.3. VAPNIK CHERVONENKIS CLASSES 45
1 1 IP
H1 (δ, {lD : D ∈ D}, Pn ) ≤ log ∆D (X1 , . . . , Xn ) → 0.
n n
u
t
Examples.
a) X = R, D = {l(−∞,t] : t ∈ R}. Since mD (n) ≤ n + 1, D is VC.
b) X = Rd , D = {l(−∞,t] : t ∈ Rd }. Since mD (n) ≤ (n + 1)d , D is VC.
θ
d T ∈ Rd+1 }. Since mD (n) ≤ 2d nd , D is
c) X = R , D = {{x : θ x > t},
t
VC.
Proof. Exercise.
u
t
Examples.
46 CHAPTER 6. ULLNS BASED ON SYMMETRIZATION
Example. Let X = [0, 1]2 , and let D be the collection of all convex subsets of
X . Then D is not VC, but when P is uniform, D is GC.
Definition 6.3.2 The VC dimension of D is
The following lemma is beautiful, but to avoid digressions, we will not provide
a proof.
Lemma 6.3.2 (Sauer-Shelah Lemma) We have that
D
PD
V
is VC
n
if and only if
V (D) < ∞. In fact, we have for V = V (D), m (n) ≤ k=0 k .
u
t
Examples (X = Rd ).
a) G = {g(x) = θ0 + θ1 x1 + . . . + θd xd : θ ∈ Rd+1 },
b) G = {g(x) = |θ0 + θ1 x1 + . . . + θd xd | : θ ∈ Rd+1 } .
a
( b
a + bx if x ≤ c
c ∈ R5 ,
c) d = 1, G = g(x) = ,
d + ex if x > c
d
e
Definition 6.4.2 Let S be some subset of a metric space (Λ, d). For δ > 0,
the δ-packing number D(δ, S, d) of S is the largest value of N , such that there
exist s1 , . . . , sN in S with
d(sk , sj ) > δ, ∀ k 6= j.
But
N (δ/2, {s1 , . . . , sN }, d) = N.
u
t
Theorem 6.4.1 Let Q be any probability measure on X and let N1 (·, G, Q) be
the covering number of G endowed with the metric corresponding to the L1 (Q)
norm. For a VC class G with VC dimension V , we have for a constant A
depending only on V ,
.
Proof. Without loss of generality, assume Q(G) = 1. Choose S ∈ X with
distribution dQS = GdQ. Given S = s, choose T uniformly in the interval
[−G(s), G(s)]. Let g1 , . . . , gN be a maximal set in G, such that Q(|gj − gk |) > δ
for j 6= k. Consider a pair j 6= k. Given S = s, the probability that T falls in
between the two graphs of gj and gk is
(1 − δ/2)n .
48 CHAPTER 6. ULLNS BASED ON SYMMETRIZATION
The probability that for some j 6= k, none of these fall in between the graphs
of gj and gk is then at most
N n 1 nδ 1
(1 − δ/2) ≤ exp 2 log N − ≤ < 1,
2 2 2 2
when we choose n the smallest integer such that
4 log N
n≥ .
δ
So for such a value of n, with positive probability, for any j 6= k, some of the
Ti fall in between the graphs of gj and gk . Therefore, we must have
N ≤ cnV .
But then, for N ≥ exp[δ/4],
V
4 log N
N ≤ c +1
δ
8 log N V
≤ c
δ
1
!V
16V log N 2V
= c
δ
16V V 1
≤ c N 2.
δ
So 2V
2 16V
N ≤c .
δ
u
t
Corollary 6.4.1 By Theorem 6.1.2 and Theorem 6.4.1 we arrive at the fol-
lowing important conclusion:
G VC & P G < ∞ ⇒ G GC
6.5 Exercises
Exercise 6.5.1 Let G be a finite class of functions, with cardinality |G| := N >
1. Suppose that for some finite constant K,
max kgk∞ ≤ K.
g∈G
Hq (·, GK , Pn ) ≤ Hq (·, G, Pn ).
Exercise 6.5.4 Are the following classes of sets (functions) VC? Why (not)?
Exercise 6.5.5 Let G be the class of all functions g on [0, 1] with derivative ġ
satisfying |ġ| ≤ 1. Check that G is not VC. Show that G is GC by using partial
integration and the Glivenko Cantelli Theorem for the empirical distribution
function.
50 CHAPTER 6. ULLNS BASED ON SYMMETRIZATION
Chapter 7
M-estimators
Let Θ be a parameter space (a subset of some metric space with metric d) and
let for θ ∈ Θ,
γθ : X → R,
be some loss function. We assume P |γθ | < ∞ for all θ ∈ Θ. We estimate the
unknown parameter
θ0 := arg min P γθ ,
θ∈Θ
by the M-estimator
θ̂n := arg min Pn γθ .
θ∈Θ
Examples.
(i) Location estimators. X = R, Θ = R, and
(i.a) γθ (x) = (x − θ)2 (estimating the mean),
(i.b) γθ (x) = |x − θ| (estimating the median).
(ii) Maximum likelihood. {pθ : θ ∈ Θ} family of densities w.r.t. a σ-finite
dominating measure µ, and
γθ = − log pθ .
If dP/dµ = pθ0 , θ0 ∈ Θ, then indeed θ0 is a minimizer of P (γθ ), θ ∈ Θ.
(ii.a) Poisson distribution:
θx
pθ (x) = eθ , θ > 0, x ∈ {1, 2, . . .}.
x!
51
52 CHAPTER 7. M-ESTIMATORS
eθ−x
pθ (x) = , θ ∈ R, x ∈ R.
(1 + eθ−x )2
7.2 Consistency
Define for θ ∈ Θ,
R(θ) = P γθ ,
and
Rn (θ) = Pn γθ .
Definition 7.2.1 We say that θ0 is well-separated if for all η > 0
IP
i.e., that {γθ : θ ∈ Θ} is a GC class. Then R(θ̂n ) → R(θ0 ). If moreover θ0 is
IP
well-separated, also θ̂n → θ0 .
Proof. We have
0 ≤ R(θ̂n ) − R(θ0 )
= [R(θ̂n ) − R(θ0 )] − [Rn (θ̂n ) − Rn (θ0 )] + [Rn (θ̂n ) − Rn (θ0 )]
IP
≤ [R(θ̂n ) − R(θ0 )] − [Rn (θ̂n ) − Rn (θ0 )] → 0.
IP IP
So R(θ̂n ) → R(θ0 ) and hence, if θ0 is well-separated, θ̂n → θ0 .
u
t
G = sup |γθ |.
θ∈Θ
Then
IP
sup |Rn (θ) − R(θ)| → 0.
θ∈Θ
7.2. CONSISTENCY 53
Define
α=
+ kθ̂n − θ0 k
and
θ̃n = αθ̂n + (1 − α)θ0 .
Then
kθ̃n − θ0 k ≤ .
Moreover,
Rn (θ̃n ) ≤ αRn (θ̂n ) + (1 − α)Rn (θ0 ) ≤ Rn (θ0 ).
It follows from the arguments used in the proof of Proposition 7.2.1 that
IP
R(θ̃n ) → R(θ0 ). The convexity and the uniqueness of θ0 implies now that
IP
kθ̃n − θ0 k → 0.
u
t
Example 7.2.1 Let X ∈ R, Θ = R and for some q ≥ 1
γθ (x) := |x − θ|q , x ∈ R.
IP
θ̂n → θ0 .
54 CHAPTER 7. M-ESTIMATORS
We may extend this to the situation where there are co-variables: replace X by
(X, Y ) where X ∈ Rr is a row-vector containing the co-variables and Y ∈ R is
the response variable. The loss function is
γθ (x, y) = |y − xθ|q .
Then E|Y − Xθ0 |q < ∞ together with uniqueness of θ0 yields consistency.
Example 7.2.2 Replace X by (X, Y ) where X ∈ [0, 1] is a co-variable and
Y = θ0 (X) + ξ,
with the noise ξ ∼ N (0, σ 2 ). We use least squares loss:
γθ (x, y) := (y − θ(x))2 , x ∈ [0, 1], y ∈ R.
We assume that θ0 ∈ Θ where, for a given m ∈ N, Θ is the “Sobolev” space
Z 1
(m) 2
θ := θ : [0, 1] → R, |θ (x)| dx ≤ 1 .
0
Here θ(m)denotes the m-th derivative of the function θ. We endow the space
Θ with the sup-norm
kθk := kθk∞ = sup |θ(x)|.
x∈[0,1]
Then one can show (we omit the details here) that for any
Z 1
(m) 2
Θ := θ : [0, 1] → R, kθ − θ0 k ≤ , |θ (x)| dx ≤ 1
0
is compact. It follows that under the assumption that θ0 is unique (which is
true for example if the distribution of X is absolutely continuous with a density
that stays away from zero), the (non-parametric) least squares estimator θ̂n is
consistent in sup-norm.
7.3 Exercises
After having studied uniform laws of large numbers, a natural question is:
can we also prove uniform central limit theorems? It turns out that precisely
defining what a uniform central limit theorem is, is quite involved, and actually
beyond our scope. In this Chapter we will therefore only briefly indicate the
results, and not present any proofs. These sections only reveal a glimpse of
the topic of weak convergence on abstract spaces. The thing to remember from
them is the concept asymptotic continuity, because we will use that concept in
our statistical applications.
Let X = R.
Theorem 8.1.1 (Central limit theorem in R) Suppose EX = µ, and
var(X) = σ 2 exist. Then
√
n(X̄n − µ)
IP ≤z → Φ(z), f or all z,
σ
55
56 CHAPTER 8. UNIFORM CENTRAL LIMIT THEOREMS
i.e. √ T L
n a (X̄n − µ) → N (0, aT Σa), f or all a ∈ Rd .
u
t.
Let X = R. Recall the definition of the distribution function F and the empir-
ical distribution function F̂n :
F (t) = P (X ≤ t), t ∈ R,
1
F̂n (t) = #{Xi ≤ t, 1 ≤ i ≤ n}, t ∈ R.
n
Define √
Wn (t) := n(F̂n (t) − F (t)), t ∈ R.
Also, by the central limit theorem in R2 (Section 8.2), for all s < t,
Wn (s) L
→ N (0, Σ(s, t)),
Wn (t)
where
F (s)(1 − F (s)) F (s)(1 − F (t))
Σ(s, t) = .
F (s)(1 − F (t)) F (t)(1 − F (t))
We are now going to consider the stochastic process Wn = {Wn (t) : t ∈ R}.
The process Wn is called the (classical) empirical process.
Definition 8.3.1 Let K0 be the collection of bounded functions on [0, 1] The
stochastic process B(·) ∈ K0 , is called the standard Brownian bridge if
- B(0) = B(1) = 0,
B(t1 )
- for all r ≥ 1 and all t1 , . . . , tr ∈ (0, 1), the vector ... is multivariate
B(tr )
8.4. DONSKER CLASSES 57
Thus, WF = B ◦ F .
Theorem 8.3.1 (Donsker’s theorem) Consider Wn and WF as elements of
the space K of bounded functions on R. We have
L
Wn → WF ,
that is,
I (Wn ) → Ef
Ef I (WF ),
for all continuous and bounded functions f .
u
t
P g := Eg(X),
Let us recall the central limit theorem for g fixed. Denote the variance of g(X)
by
σ 2 (g) := var(g(X)) = P g 2 − (P g)2 .
If σ 2 (g) < ∞, we have
L
νn (g) → N (0, σ 2 (g)).
The central limit theorem also holds for finitely many g simultaneously. Let gk
and gl be two functions and denote the covariance between gk (X) and gl (X) by
I (νn ) → Ef
Ef I (ν).
kgk2 := P g 2 , g ∈ L2 (P ),
9.1 Chaining
We consider in this section the symmetrized empirical process and work condi-
tionally on X = (X1 , . . . , Xn ). We let IPX be the conditional distribution given
X. We describe the chaining technique in this context
Define the empirical norm p
kgkn := Pn g 2
and the empirical radius
Rn := sup kgkn .
g∈G
For notational convenience, we index the functions in G by a parameter θ ∈ Θ:
G = {gθ : θ ∈ Θ}. Let for s = 0, 1, 2, . . ., {gjs }N −s
j=1 ⊂ G be a minimal 2 Rn -
s
covering set of (G, k·kn ). So Ns = N2 (2−s Rn , G, Pn ), and for each θ, there exists
s } such that kg − g s k ≤ 2−s R . We use the parameter θ
a gθs ∈ {g1s , . . . , gNs θ θ n n
here to indicate which function in the covering set approximates a particular g.
We may choose gθ0 ≡ 0, since kgθ kn ≤ Rn . We let Hs := log Ns for all s.
Let S ∈ N to be fixed later and let for gjS+1 ∈ {g1S+1 , . . . , gN
S+1
S+1
},
S
gj,S := arg min kgjS+1 − gkS kn : gkS ∈ {g1S , . . . , gN
S
S
}
61
62 CHAPTER 9. CHAINING AND ASYMPTOTIC CONTINUITY
One can think of this as telescoping from gθ to gθS+1 , i.e. we follow a path taking
smaller and smaller steps. As S → ∞, we have max1≤i≤n |gθ (Xi )−gθS+1 (Xi )| →
0. The term Ss=0 (gj,S
s+1 s ) can be handled by exploiting the fact that as θ
P
− gj,S
varies, each summand involves only finitely many functions.
We define
S
X
2−s Rn
p
Jn := 2Hs+1 .
s=0
Then
S S
X X √
2−s Rn
p
αs = J n + 2(1 + s)(1 + t) ≤ Jn + 4Rn 1 + t.
s=0 s=0
9.3. DE-SYMMETRIZING 63
Therefore
S r
X
σ s+1 s Jn 1+t
IPX max | Pn (gj,S − gj,S )| ≥ √ + 4Rn
j n n
s=0
XS
σ s+1 s
≤ IPX max |Pn (gj,S − gj,S )| ≥ αs
j
s=0
XS
≤ 2 exp[−(1 + s)(1 + t)]
s=0
≤ 2 exp[−t]
u
t
9.3 De-symmetrizing
We let
R := sup kgk
g∈G
be the diameter of G.
For any probability measure Q, we let H2 (·, G, Q) be the entropy of G endowed
with the metric induced by the L2 (Q)-norm k · kQ .
Condition 9.3.1 For all probability measures Q it holds that
We then define Z ρp
J (ρ) := 2 2H(u)du, ρ > 0.
0
√
Proof. Take S as the smallest value in N such that 2−S ≤ 1/ n. Then
√
kgθ − gθS+1 kn ≤ 2−(S+1) Rn ≤ Rn /(2 n). On the set where Rn ≤ 2R and
kGkn ≤ 2kGk we have
Z Rn p
2 2H2 (u, G, Pn )du ≤ kGkJ (4RkGk)
2−(S+2) Rn
Theorem 9.4.1 Assume Condition 9.3.1 and that G has envelope G, with
P (G2 ) < ∞. Then νn is asymptotically continuous.
Proof. Define for δ > 0 and g0 ∈ G
G(δ) = {g ∈ G : kg − g0 k ≤ δ}.
By Theorem 9.3.1
r
8kGkJ (8δkGk) 1+t
IP kPn − P kG(δ) > √ + 32δ
n n
2 2
≤ 8 exp[−t] + 4IP sup |(Pn − P )(g − g0 ) | > δ ∀ t > 0.
g∈G(δ)∪{2G+g0 }
u
t
9.5. APPLICATION TO VC GRAPH CLASSES 65
9.6 Exercises
Let
∞
X
2−s
p
J(Rn ) := 2H2 (u, G, Pn )du,
s=0
where we assume that the sum converges. Show that
r
J(Rn ) 1+t
IPX sup(, g)n ≥ √ + 4Rn ≤ exp[−t] ∀ t > 0.
g∈G n n
66 CHAPTER 9. CHAINING AND ASYMPTOTIC CONTINUITY
Chapter 10
Asymptotic normality of
M-estimators
where
`1
..
` = . : X → Rr ,
`r
67
68 CHAPTER 10. ASYMPTOTIC NORMALITY OF M-ESTIMATORS
Definition 10.1.2 Let θ̂n,1 and θ̂n,2 be two asymptotically linear estimators of
θ0 , with asymptotic variance σ12 and σ22 respectively. Then
σ22
e1,2 :=
σ12
We start with 3 conditions a, b and c, which are easier to check but more
stringent. We later relax them to conditions aa, bb and cc.
∂
ψθ (x) := γθ (x), x ∈ X .
∂θ
Condition b. We have as θ → θ0 ,
{ψθ : |θ − θ0 | < }
Lemma 10.2.1 Suppose conditions a,b and c. Then θ̂n is asymptotically linear
with influence function
` = −V −1 ψθ0 ,
so √ L
n(θ̂n − θ0 ) → N (0, V −1 JV −1 ),
where
J = P ψθ0 ψθT0 .
with
γ(x) = x2 l{|x| ≤ k} + (2k|x| − k 2 )l{|x| > k}, x ∈ R.
Here, 0 < k < ∞ is some fixed constant, chosen by the statistician. We will
now verify Conditions a, b and c.
a)
+2k
if x − θ ≤ k
ψθ (x) = −2(x − θ) if |x − θ| ≤ k .
−2k if x − θ ≥ k
b) We have Z
d
ψθ dP = 2(F (k + θ) − F (−k + θ)),
dθ
where F (t) = P (X ≤ t), t ∈ R is the distribution function. So
The sample median can be regarded as the limiting case of a Huber estimator,
with k ↓ 0. However, the loss function γθ (x) = |x − θ| is not differentiable, i.e.,
does not satisfy condition a. For even sample sizes, we do nevertheless have the
score equation F̂n (θ̂n ) − 21 = 0. Let us investigate this closer.
Let X ∈ R have distribution F , and let F̂n be the empirical distribution. The
population median θ0 is a solution of the equation
1
F (θ0 ) = .
2
We assume this solution exists and also that F has positive density f in a
neighbourhood of θ0 . Consider now for simplicity even sample sizes n and let
the sample median θ̂n be any solution of
F̂n (θ̂n ) = 0.
Then we get
In other words,
√ Wn (θ0 )
n(θ̂n − θ0 ) = − + oIP (1).
f (θ0 )
So the influence function is
(
1
− 2f (θ 0)
if x ≤ θ0
`(x) = 1
,
+ 2f (θ 0)
if x > θ0
1
σ2 = .
4f (θ0 )2
10.4. CONDITIONS AA, BB AND CC FOR ASYMPTOTIC NORMALITY71
We are now in the position to compare median and mean. It is easily seen that
the asymptotic relative efficiency of the mean as compared to the median is
1
e1,2 = ,
4σ02 f (θ0 )2
where σ02 = var(X). So e1,2 = π/2 for the normal distribution, and e1,2 = 1/2
for the double exponential (Laplace) distribution. The density of the double
exponential distribution is
" √ #
1 2|x − θ0 |
f (x) = p 2 exp − , x ∈ R.
2σ0 σ0
γθ − γθ0 − (θ − θ0 )T ψ0
lim = 0.
θ→θ0 |θ − θ0 |
Suppose that for some > 0, the class G := {gθ : 0 < |θ − θ0 | < } is
asymptotically continuous at 0.
Lemma 10.4.1 Suppose conditions aa, bb and cc are met. Then θ̂n has influ-
ence function
` = −V −1 ψ0 ,
and so
L
(θ̂n − θ0 ) → N (0, V −1 JV −1 ),
p
where J = P ψ0 ψ0T .
72 CHAPTER 10. ASYMPTOTIC NORMALITY OF M-ESTIMATORS
Pn (γθ − γθ0 )
= (Pn − P )(γθ − γθ0 ) + P (γθ − γθ0 )
= (Pn − P )gθ |θ − θ0 | + (θ − θ0 )T Pn ψ0 + P (γθ − γθ0 )
√
= oIP (1/ n)|θ − θ0 | + (θ − θ0 )T Pn ψ0 + P (γθ − γθ0 )
√ 1
= oIP (1/ n)|θ − θ0 | + (θ − θ0 )T Pn ψ0 + (θ − θ0 )T V (θ − θ0 ) + o(|θ − θ0 |2 )
2
1 1/2 √ 2
= V (θ − θ0 ) + o(|θ − θ0 |) + OIP (1/ n)
2
Because Pn (γθ̂n − γθ0 ) ≤ 0 the previous applied with θ = θ̂n gives |θ̂n − θ0 | =
√
OIP (1/ n). The previous applied to the sequence θ̃n := θ0 − V −1/2 Pn ψ0 gives
1
Pn (γθ̃n − γθ0 ) = − |V −1/2 Pn ψ0 |2 + oIP (1/n).
2
Because Pn (γθ̂n − γθ0 ) ≤ Pn (γθ̃n − γθ0 ) we get
1 1
(θ̂n − θ0 )T Pn ψ0 + (θ̂n − θ0 )T V (θ̂n − θ0 )+ ≤ − |V −1/2 Pn ψ0 |2 + oIP (1/n)
2 2
or
2
1 1/2
V (θ̂n − θ0 ) + V −1/2 Pn ψ0 = oIP (1/n).
2
Thus √
V 1/2 (θ̂n − θ0 ) = −V −1/2 Pn ψ0 + oIP (1/ n)
or √
θ̂n − θ0 = −V −1 Pn ψ0 + oIP (1/ n).
u
t
10.5 Exercises
Exercise 10.5.1 Suppose X has the logistic distribution with location param-
eter θ0 . Show that the maximum likelihood estimator has asymptotic variance
equal to 3, and the median has asymptotic variance equal to 4. Hence, the
asymptotic relative efficiency of the maximum likelihood estimator as compared
to the median is 4/3.
βr0
10.5. EXERCISES 73
Assume that given X = x, the random variable Y − m(x) has a density f not
depending on x, with f positive in a neighbourhood of zero. Suppose moreover
that
1 X
Σ=E
X XX T
exists and is invertible. Let
n
1X
β̂n = arg min |Yi − b0 − b1 Xi,1 − . . . − br Xi,r | ,
b∈Rr+1 n
i=1
√
L 1
n(β̂n − β 0 ) → N 0, 2 Σ−1 ,
4f (0)
Probability inequalities for the least squares estimator (LSE) are obtained, un-
der conditions on the entropy of the class of regression functions. In the ex-
amples, we study smooth regression functions, functions of bounded variation,
concave functions, and image restoration. Results for the entropies of various
classes of functions is taken from the literature on approximation theory.
The empirical inner product between error and regression function is written
as
n
1X
(, g)n = i g(zi ).
n
i=1
75
76 CHAPTER 11. RATES OF CONVERGENCE FOR LSES
G(δ) := {g ∈ G : kg − g0 kn ≤ δ}
u
t
The main idea to arrive at rates of convergence for ĝn is to invoke the basic
inequality. The modulus of continuity of the process {(, g−g0 )n : g ∈ G(δ)} can
be derived from the entropy H2 (·, G(δ), Qn ) of G(δ), endowed with the metric
induced the the norm k · kn .
Condition 11.0.1 For all δ > 0, the entropy integral
Z δp
J(δ) := 2 2H2 (u, G(δ), Qn )du
0
it holds that
e
IP(kĝ − g0 kn > δn ) ≤ exp[−t].
e−1
11.3. EXAMPLES 77
The function √
J(2j δn ) + 42j δ 1 + t + j
j→
7
(2j δ)2
is the sum of two decreasing functions and hence is decreasing. So for all j ∈ N
√ √ √
J(2j δn ) + 42j δ 1 + t + j J(δn ) + 4δn 1 + t n
≤ ≤
(2j δn )2 δn2 8
so that r
1 j−1 2 1 j 2 J(2j δn ) 1+t+j
(2 δn ) = (2 δn ) ≥ √ + 42j δn
2 8 n n
It follows that
IP(kĝ − g0 kn > δn )
∞ r
J(2j δ)
X
j 1+t+j
≤ IP sup (, g − g0 )n ≥ √ + 42 δ
g∈G(2j δn ) n n
j=1
e
≤ exp[−(t + j)] = exp[−t].
e−1
u
t
11.3 Examples
It yields that
r r !
r 1+t e
IP kĝn − g0 kn > 8 A0 +4 ≤ exp[−t] ∀ t > 0.
n n e−1
(Note that we made extensive use here from the fact that it suffices to calculate
the local entropy of G.)
Let k−1 T
R ψTk (x) = x , k = 1, . . . , m, ψ(x) = (ψ1 (x), . . . , ψm (x)) and Σn =
ψψ dQn . Denote the smallest eigenvalue of Σn by λn , and assume that
Define for g ∈ G, Z
ḡ := gdQn .
max |g(xi )| ≤ ḡ + 1.
i=1,...,n
G = {g : [0, 1] → R, 0 ≤ ġ ≤ 1, ġ decreasing}.
Then G is a subset of
Z 1
g : [0, 1] → R, |dġ| ≤ 2 .
0
Birman and Solomjak (1967) prove that for all m ∈ {2, 3, . . .},
Z 1
1
H∞ δ, g : [0, 1] → [0, 1] : (m)
|g (x)|dx ≤ 1 ≤ Aδ − m , for all δ > 0.
0
Again, our class G is not uniformly bounded, but we can write for g ∈ G,
g = g1 + g2 ,
with g1 (x) := θ1 + θ2 x and kg2 k∞ ≤ 2. Assume now that n1 ni=1 (xi − x̄)2 stays
P
away from 0. Then, we obtain for a constant A0
2 r !
2 1 + t e
IP kĝn − g0 kn > 16 A05 n− 5 + 4 ≤ exp[−t] ∀ t > 0.
n e−1
G = conv(K),
K := {lD : D ∈ D}.
Assume that
N2 (δ, K, Qn ) ≤ cδ −w , for all δ > 0.
Then from Ball and Pajor(1990),
2w
H2 (δ, G, Qn ) ≤ Aδ − 2+w , for all δ > 0.
e
≤ exp[−t] ∀ t > 0.
e−1
We observe
Ykl = g0 (xkl ) + kl ,
with xkl = (uk , vl ), xk = k/m, xl = l/m, k, l ∈ {1, . . . , m}. The total number
of pixels is thus n = m2 .
Suppose that
D0 ∈ D = {all convex subsets of [0, 1]2 },
and write
G := {lD : D ∈ D}.
Dudley (1984) shows that for all δ > 0 sufficiently small
1
H2 (δ, G, Qn ) ≤ Aδ − 2 ,
11.4. EXERCISES 81
11.4 Exercises
We revisit the regression problem of the previous chapter. One has observations
{Yi }ni=1 , and fixed co-variables x1 , . . . , xn , where the response variables satisfy
the regression
Yi = g0 (xi ) + i , i = 1, . . . , n,
where 1 , . . . , n are independent and centred noise variables, and where g0 is
an unknown function on X . The errors are assumed to be N (0, σ 2 )-distributed.
Let Ḡ be a collection of regression functions. The regularized least squares
estimator is ( n )
1X 2
ĝn = arg min |Yi − g(xi )| + pen(g) .
g∈Ḡ n
i=1
When this aim is indeed reached, we loosely say that ĝn satisfies an oracle
inequality. In fact, what (*) says is that ĝn behaves as the noiseless version g∗ .
That means so to speak that we “overruled” the variance of the noise.
In Section 12.1, we recall the definitions of estimation and approximation error.
Section 12.2 calculates the estimation error when one employs least squares
estimation, without penalty, over a finite model class. The estimation error
turns out to behave as the log-cardinality of the model class. Section 12.3
shows that when considering a collection of nested finite models, a penalty
pen(g) proportional to the log-cardinality of the smallest class containing g will
indeed mimic the oracle over this collection of models. In Section 12.4, we
consider general penalties. It turns out that the (local) entropy of the model
83
84 CHAPTER 12. REGULARIZED LEAST SQUARES
Let G be a model class. Consider first the least squares estimator without
penalty
n
1X
ĝn (·, G) = arg min |Yi − g(xi )|2 .
g∈G n
i=1
be the best approximation of g0 within the model G. Then kg∗ (·, G) − g0 k2n is
the (squared) approximation error if the model G is used. We define
which trades off approximation error kg∗ (·, G)−g0 k2n against complexity pen(G).
As we will see, taking pen(G) proportional to (an estimate of) the estimation
error of ĝn (·, G) will (up to constants and possibly (log n)-factors) balance esti-
mation error and approximation error.
12.2. FINITE MODELS 85
The result of Lemma 12.2.1 below implies that the estimation error is pro-
portional to log |G|/n, i.e., it is logarithmic in the number of elements in the
parameter space. We present the result in terms of a probability inequality.
An inequality for e.g., the average excess risk follows from this (see Exercise
12.6.1).
Lemma 12.2.1 We have for all t > 0 and 0 < δ < 1,
2 1 2 4(log |G| + t)
IP kĝn − g0 kn ≥ (1 + δ)kg∗ − g0 kn + ≤ exp[−t].
1−δ nδ
q √
If (, ĝn − g∗ )n ≤ 2(log n|G|+t) kĝn − g∗ kn , we have, using 2 ab ≤ a + b for all
non-negative a and b,
r
2(log |G| + t)
kĝn − g0 k2n ≤ 2 kĝn − g∗ kn + kg∗ − g0 k2n
r n
2(log |G| + t)
≤ 2 kĝn − g0 kn + kg∗ − g0 kn + kg∗ − g0 k2n
n
4(log |G|/(nδ) + t)
≤ δkĝn − g0 k2n + + (1 + δ)kg∗ − g0 k2n .
nδ
u
t
86 CHAPTER 12. REGULARIZED LEAST SQUARES
But if r
2(4 log |G(ĝn )| + t)
(, ĝn − g∗ )n ≤ kĝn − g∗ kn
n
the Basic inequality gives
kĝn − g0 k2n
r
2(4 log |G(ĝn )| + t)
≤ 2 kĝn − g∗ kn
n
+ kg∗ − g0 k2n + pen(g∗ ) − pen(ĝn )
r
2(4 log |G(ĝn )| + t)
≤ 2 kĝn − g0 kn + kg∗ − g0 kn
n
+ kg∗ − g0 k2n + pen(g∗ ) − pen(ĝn )
4(4 log |G(ĝn )| + t)
≤ δkĝn − g0 k2n + − pen(ĝn )
nδ
+ (1 + δ)kg∗ − g0 k2n + pen(g∗ )
4t
= δkĝn − g0 k2n + (1 + δ)kg∗ − g0 k2n + pen(g∗ ) + ,
nδ
by the definition of pen(g). u
t
In the general case with possibly infinite model classes G, we may replace the
log-cardinality of a class by its entropy.
We moreover define
and
G(δ) = {g ∈ Ḡ : τ 2 (g) ≤ δ 2 }, δ > 0.
Consider the entropy H(·, G(δ), Qn ) of G(δ). Suppose it is finite for each δ, and
in fact that the square root of the entropy is integrable:
Condition 12.4.1 One has
Z 2δ p
J(δ) := 2 2H(u, G(δ), Qn )du < ∞, ∀ δ > 0. (∗∗)
0
88 CHAPTER 12. REGULARIZED LEAST SQUARES
This means that near u = 0, the entropy H(u, G(δ), Qn ) is not allowed to grow
faster than 1/u2 .
Theorem 12.4.1 Assume Condition 12.4.1. Suppose that J(δ)/δ 2 is decreas-
ing function of δ. Then for all t > 0 and
r
J(δn ) 1+t
(•) δn2 2
≥ 4τ (g∗ ) + 8 √ + 8δn
n n
we have
e
IP(τ (ĝn ) > δn ) ≤ exp[−t].
e−1
We can apply the same arguments as in the proof of Theorem 11.2.1: since
r
J(δn ) 1+t
δn2 2
≥ 4τ (g∗ ) + 8 √ + 8δn
n n
so that r
jδ )
j−1 2 2 J(2 n j 1+t+j
(2 δn ) − τ (g∗ ) ≥ 2 √ + 82 δn
n n
and hence
∞
X
j−1 2 2
IP sup 2(, g − g∗ )n ≥ (2 δn ) − τ (g∗ )
j=1 g∈G(2j δn )
∞ r
J(2j δn )
X
j 1+t+j
≤ IP sup 2(, g − g∗ )n ≥ 2 √ + 82 δn
g∈G(2j δn ) n n
j=1
∞
X e
≤ exp[−(t + j)] = exp[−t].
e−1
j=1
u
t
12.5. APPLICATION TO THE “CLASSICAL” PENALTY 89
Suppose X = [0, 1]. Let Ḡ be the class of functions on [0, 1] which have deriva-
tives of all orders. The m-th derivative of a function g ∈ Ḡ on [0, 1] is denoted
by g (m) . Define for a given 1 ≤ p < ∞, and given smoothness m ∈ {1, 2, . . .},
Z 1
I p (g) = |g (m) (x)|p dx, g ∈ Ḡ.
0
Corollary 12.5.1 By applying Lemma 12.5.1 , we find that for some constant
c1 ,
kĝn − g0 k2n + λ2 I p (ĝn ) ≤ 4 min{kg − g0 k2n + λ2 I p (g)}
g
2pm !
1
∨1
1 2pm+p−2 log λ
+OIP 2 + .
nλ pm n
For choosing the smoothing parameter λ, the above suggests the penalty
( 2pm )
C0 2pm+p−2
pen(g) = min λ2 I p (g) + 2 ,
λ
nλ pm
where C00 depends on C0 and m. From the computational point of view (in
particular, when p = 2), it may be convenient to carry out the penalized least
squares as in the previous subsection, for all values of λ, yielding the estimators
( n )
1X
ĝn (·, λ) = arg min |Yi − g(xi )|2 + λ2 I p (g) .
g n
i=1
Then the estimator with the penalty of this subsection is ĝn (·, λ̂n ), where
( n 2pm )
1X C0 2pm+p−2
λ̂n = arg min |Yi − ĝn (xi , λ)|2 + 2 . .
λ>0 n nλ pm
i=1
Thus, the estimator adapts to small values of I(g0 ). For example, when m = 1
and I(g0 ) = 0 (i.e., when g0 is the constant function), the excess risk of the
estimator converges withPparametric rate 1/n. If we knew that g0 is constant, we
would of course use the ni=1 Yi /n as estimator. Thus, this penalized estimator
mimics an oracle.
12.6. EXERCISES 91
12.6 Exercises
for a non-negative random variable Z, derive bounds for the average excess risk
I n − g0 k2n of the estimator considered in this chapter.
Ekĝ