0% found this document useful (0 votes)
643 views91 pages

Empirical Process (Sara Van de Geer)

The document discusses Empirical Process Theory, focusing on the convergence of empirical measures as sample sizes increase. It covers various statistical concepts such as Glivenko Cantelli classes, probability inequalities, M-estimators, and uniform central limit theorems. The introduction emphasizes the importance of empirical measures in estimating unknown probability distributions from observed data.

Uploaded by

yuzukieba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
643 views91 pages

Empirical Process (Sara Van de Geer)

The document discusses Empirical Process Theory, focusing on the convergence of empirical measures as sample sizes increase. It covers various statistical concepts such as Glivenko Cantelli classes, probability inequalities, M-estimators, and uniform central limit theorems. The introduction emphasizes the importance of empirical measures in estimating unknown probability distributions from observed data.

Uploaded by

yuzukieba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Empirical Process Theory

Sara van de Geer

January 2020
2
Contents

1 Introduction 5

2 Glivenko Cantelli classes 11


2.1 Law of large numbers for real-valued random variables . . . . . . 11
2.2 Law of large numbers for Rd -valued random variables. . . . . . . 12
2.3 Definition Glivenko Cantelli classes of sets . . . . . . . . . . . . . 13
2.4 Convergence of averages to their expectations . . . . . . . . . . . 13
2.5 ULLN and maximum likelihood . . . . . . . . . . . . . . . . . . . 14

3 (Exponential) probability inequalities 17


3.1 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Bernstein’s inequality . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 ULLNs based on entropy with bracketing 25


4.1 The classical Glivenko Cantelli Theorem . . . . . . . . . . . . . . 25
4.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Entropy with bracketing . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 The envelope function . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Entropy with bracketing and compactness . . . . . . . . . . . . . 30
4.6 Maximum likelihood for mixture models . . . . . . . . . . . . . . 30
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Symmetrization 33
5.1 Intermezzo: some facts about (conditional) expectations . . . . . 34
5.1.1 Suprema in-/outside the expectation . . . . . . . . . . . . 34
5.1.2 Iterated expectations . . . . . . . . . . . . . . . . . . . . . 34
5.2 Symmetrization with means . . . . . . . . . . . . . . . . . . . . . 34
5.3 Symmetrization with probabilities . . . . . . . . . . . . . . . . . 36
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 ULLNs based on symmetrization 39


6.1 Classes of functions . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Classes of sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Vapnik Chervonenkis classes . . . . . . . . . . . . . . . . . . . . . 45

3
4 CONTENTS

6.4 VC graph classes of functions . . . . . . . . . . . . . . . . . . . . 46


6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 M-estimators 51
7.1 What is an M-estimator? . . . . . . . . . . . . . . . . . . . . . . 51
7.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8 Uniform central limit theorems 55


8.1 Real-valued random variables . . . . . . . . . . . . . . . . . . . . 55
8.2 Rr -valued random variables . . . . . . . . . . . . . . . . . . . . . 56
8.3 Donsker’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.4 Donsker classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

9 Chaining and asymptotic continuity 61


9.1 Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.2 Increments of the symmetrized process . . . . . . . . . . . . . . . 62
9.3 De-symmetrizing . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.4 Asymptotic continuity of the empirical process . . . . . . . . . . 64
9.5 Application to VC graph classes . . . . . . . . . . . . . . . . . . 65
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

10 Asymptotic normality of M-estimators 67


10.1 Asymptotic linearity . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.2 Conditions a, b and c for asymptotic normality . . . . . . . . . . 68
10.3 Asymptotics for the median . . . . . . . . . . . . . . . . . . . . . 70
10.4 Conditions aa, bb and cc for asymptotic normality . . . . . . . . 71
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

11 Rates of convergence for LSEs 75


11.1 Gaussian errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
11.2 Rates of convergence . . . . . . . . . . . . . . . . . . . . . . . . . 76
11.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

12 Regularized least squares 83


12.1 Estimation and approximation error . . . . . . . . . . . . . . . . 84
12.2 Finite models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
12.3 Nested finite models . . . . . . . . . . . . . . . . . . . . . . . . . 86
12.4 General penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
12.5 Application to the “classical” penalty . . . . . . . . . . . . . . . 89
12.5.1 Fixed smoothing parameter . . . . . . . . . . . . . . . . . 89
12.5.2 Overruling the variance in this case . . . . . . . . . . . . . 90
12.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Chapter 1

Introduction

This introduction motivates why, from a statistician’s point of view, it is in-


teresting to study empirical processes. We indicate that any estimator is some
function of the empirical measure. In these lectures, we study convergence of
the empirical measure, as sample size increases.

In the simplest case, a data set consists of observations on a single variable,


say real-valued observations. Suppose there are n such observations, denoted
by X1 , . . . , Xn . For example, Xi could be the reaction time of individual i to
a given stimulus, or the number of car accidents on day i, etc. Suppose now
that each observation follows the same probability law P . This means that the
observations are relevant if one wants to predict the value of a new observation
X say (the reaction time of a hypothetical new subject, or the number of car
accidents on a future day, etc.). Thus, a common underlying distribution P
allows one to generalize the outcomes.

An estimator is any given function Tn (X1 , . . . , Xn ) of the data. Let us review


some common estimators.

The empirical distribution. The unknown P can be estimated from the data
in the following way. Suppose first that we are interested in the probability that
an observation falls in A, where A is a certain set chosen by the researcher. We
denote this probability by P (A). Now, from the frequentist point of view, the
probability of an event is nothing else than the limit of relative frequencies of
occurrences of that event as the number of occasions of possible occurrences n
grows without limit. So it is natural to estimate P (A) with the frequency of A,
i.e, with

number of times an observation Xi falls in A


Pn (A) =
total number of observations

number of Xi ∈ A
= .
n

5
6 CHAPTER 1. INTRODUCTION

We now define the empirical measure Pn as the probability law that assigns to
a set A the probability Pn (A). We regard Pn as an estimator of the unknown
P.

The empirical distribution function. The distribution function of X is


defined as
F (x) = P (X ≤ x),
and the empirical distribution function is
number of Xi ≤ x
F̂n (x) = .
n
1

F ^
0.9 Fn
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
1 2 3 4 5 6 7 8 9 10 11
x

Figure 1
Figure 1 plots the distribution function F (x) = 1 − 1/x2 , x ≥ 1 (smooth curve)
and the empirical distribution function F̂n (stair function) of a sample from F
with sample size n = 200.

Means and averages. The theoretical mean

µ := E(X)

(E stands for Expectation), can be estimated by the sample average


X1 + . . . + Xn
X̄n := .
n
More generally, let g be a real-valued function on R. Then
g(X1 ) + . . . + g(Xn )
,
n
is an estimator Eg(X).

Sample median. The median of X is the value m that satisfies F (m) = 1/2
(assuming there is a unique solution). Its empirical version is any value m̂n
7

such that F̂n (m̂n ) is equal or as close as possible to 1/2. In the


√ above example
2
F (x) = 1 − 1/x , so that the theoretical median is m = 2 = 1.4142. In
the ordered sample, the 100th observation is equal to 1.4166 and the 101th
observation is equal to 1.4191. A common choice for the sample median is
taking the average of these two values. This gives m̂n = 1.4179.

Properties of estimators. Let Tn = Tn (X1 , . . . , Xn ) be an estimator of the


real-valued parameter θ. Then it is desirable that Tn is in some sense close to θ.
A minimum requirement is that the estimator approaches θ as the sample size
increases. This is called consistency. To be more precise, suppose the sample
X1 , . . . , Xn are the first n of an infinite sequence X1 , X2 , . . . of independent
copies of X. Then Tn is called strongly consistent if, with probability one,

Tn → θ as n → ∞.

Note that consistency of frequencies as estimators of probabilities, or means as


estimators of expectations, follows from the (strong) law of large numbers. In
general, an estimator Tn can be a complicated function of the data. In that
case, it is helpful to know that the convergence of means to their expectations
is uniform over a class. The latter is a major topic in empirical process theory.

Parametric models. The distribution P may be partly known beforehand.


The unknown parts of P are called parameters of the model. For example, if
the Xi are yes/no answers to a certain question (the binary case), we know that
P allows only two possibilities, say 1 and 0 (yes=1, no=0). There is only one
parameter , say the probability of a yes answer θ = P (X = 1). More generally,
in a parametric model, it is assumed that P is known up to a finite number
of parameters θ = (θ1 , · · · , θr ). We then often write P = Pθ . When there are
infinitely many parameters (which is for example the case when P is completely
unknown), the model is called nonparametric.

Nonparametric models.
Nonparametric models cannot be described by finitely many parameters.
Example: density estimation. An example of a nonparametric model is
where one assumes that the density f of the distribution function F on R
exists, but all one assumes about it is some kind of “smoothness” (e.g. the
continuous first derivative of f exists). In that case, one may propose e.g. to
use the histogram as estimator of f . This is an example of a nonparametric
estimator.
Histograms. Suppose our aim is estimating the density f (x) at a given point
x. The density is defined as the derivative of the distribution function F at x:

F (x + h) − F (x) P (x, x + h]
f (x) = lim = lim .
h→0 h h→0 h
Here, (x, x + h] is the interval with left endpoint x (not included) and right
endpoint x + h (included). Unfortunately, replacing P by Pn here does not
8 CHAPTER 1. INTRODUCTION

work, as for h small enough, Pn (x, x + h] will be equal to zero. Therefore,


instead of taking the limit as h → 0, we fix h at a (small) positive value, called
the bandwidth. The estimator of f (x) thus becomes

Pn (x, x + h] number of Xi ∈ (x, x + h]


fˆn (x) = = .
h nh
A plot of this estimator at points x ∈ {x0 , x0 + h, x0 + 2h, . . .} is called a
histogram.
Figure 2 shows the histogram, with bandwidth h = 0.5, for the sample of size
n = 200 from the Pareto distribution with parameter θ = 2. The solid line is
the density of this distribution.
2

1.8

1.6
f
1.4

1.2

0.8

0.6

0.4

0.2

0
1 2 3 4 5 6 7 8 9 10 11

Figure 2
The question arises how to choose the bandwidth h? One may want to apply
a data-dependent choice. For example, introduce for fˆn = fˆn,h depending on h
the risk function Z
R(fn,h ) := (fˆn,h (t) − f (t))2 dt.
ˆ

Note that
Z Z Z
R(fˆn,h ) = fˆn,h
2
(t)dt − 2 fˆn,h (t)f (t)dt + f 2 (t)dt .
| {z }
does not depend on h

The double product term in the middle can be written as


Z
2 fˆn,h (t)dF (t).

It can be estimated by
Z n
2Xˆ
2 fˆn,h (t)dF̂n (t) = fn,h (Xi ).
n
i=1
9

But using the data twice, both for estimation as well as for estimation of the
performance of the estimator, is perhaps not a good idea as it may lead to
“overfitting”. Therefore, we propose here to use a fresh sample, that is, {Xi0 }m
i=1 ,
i.i.d. copies from X independent of {Xi }ni=1 , and apply the estimated bandwidth
Z m 
ˆ2 2 Xˆ 0
ĥ := arg min fn,h (t)dt − fn,h (Xi ) .
h>0 m
i=1

Note that the total sample is now

(X1 , . . . , Xn , X10 , . . . , Xm
0
),

with sample size n + m. In other words, for the selection of the bandwidth a
sample splitting technique is used. In order to further improve performance, a
common technique is “cross-validation” consisting of several sample splits (into
training and test sets).
Example: the classification problem. Let Y ∈ {0, 1} be a response variable
(a label) and X ∈ X a co-variable (or input). For all x ∈ X we define the
probability that the label is Y = 1 when the co-variable takes the value x:

η(x) := P (Y = 1|X = x).

Our aim is now to predict the label given some input, say x0 . Call the predicted
value y0 . Bayes rule says: predict the most likely label, that is predict

1
 if η(x0 ) > 1/2
y0 := undecided, say 1, if η(x0 ) = 1/2 .

0 if η(x0 ) < 1/2

Note that η(x) > 1/2 if and only if the log-odds ratio
 
η(x)
log
1 − η(x)
is strictly positive. When X is a subset of r-dimensional Euclidean space Rr
one may want to assume the parametric logistic regression model, where the
log-odds ratio is a linear function of the (row) vector x:
 
η(x)
log = α0 + xβ 0 , x ∈ X ,
1 − η(x)

with α0 ∈ R an unknown intercept and β 0 ∈ Rd an unknown (column) vector


of coefficients. Bayes rule is now lA0 where A0 := {x : α0 + xβ 0 ≥ 0} is a
half-space. Maximum likelihood estimators α̂MLE and β̂MLE of the unknown
parameters can be obtained from labeled data {(Xi , Yi )}ni=1 , which are i.i.d.
copies of (X, Y ). This is called logistic regression. Then the estimated Bayes
rule is lÂMLE , where ÂMLE := {x : α̂MLE + xβ̂MLE ≥ 0}.
The risk R(A) of a predictor lA is the probability that it makes a mistake:

R(A) = P (Y 6= lA (X)) = P (|Y − lA (X)| = 1).


10 CHAPTER 1. INTRODUCTION

Then one may verify that Bayes rule is

A0 = arg min R(A).


A⊂X

One my mimic this by replacing the unknown P by the empirical measure Pn .


Then the empirical risk becomes
  Xn
6 lA (Xi ) =
Rn (A) = # (Xi , Yi ) : Yi = |Yi − lA (Xi )|
i=1

and the empirical risk minimizer is

ÂERM := arg min Rn (A)


A∈A

where A is a given collection of subsets of X , for example all half-spaces, such


as is the case in logistic regression. In this notes, we will develop theory for
deducing how close  = ÂML or  = ÂERM is to A0 , for example in terms of
the measure of the symmetric difference Â∆A0 .
Sub-example. Here we give an illustration of what can be achieved, although
the details will not be treated in these notes. Suppose X ∈ R, that A0 = [θ0 , ∞)
and that A is the collection of all half-intervals (for example X could be the
person’s pitch of voice, and Y the person’s gender). Let ÂERM =: [θ̂, ∞). Then,
using arguments from empirical process theory, one can show that under certain
conditions, θ̂ converges with rate n−1/3 and that

n1/3 (θ̂ − θ0 )

converges in distribution to the maximum of a Brownian motion minus parabola.


Conclusion. An estimator Tn is some function of the data X1 , . . . , Xn . If it is
a symmetric function of the data (which is typically the case when the ordering
in the data contains no information1 ), we may write Tn = T (Pn ), where Pn is
the empirical distribution. Roughly speaking, the main purpose in theoretical
statistics is studying the difference between T (Pn ) and T (P ). We therefore are
interested in convergence of Pn to P in a broad enough sense. This is what
empirical process theory is about.

1
When the sample is splitted in a training and a test set this symmetry is lost.
Chapter 2

Glivenko Cantelli classes

This chapter introduces the notation and (part of ) the problem setting.

Let X1 , . . . , Xn , . . . be i.i.d. copies of a random variable X with values in X and


with distribution P . The distribution of the sequence X1 , X2 , . . . (+ perhaps
some auxiliary variables) is denoted by IP.

Let {Tn , T } be a collection of real-valued random variables.


Definition 2.0.1 The sequence Tn converges in probability to T , if for all
 > 0,
lim IP(|Tn − T | > ) = 0.
n→∞
IP
Notation: Tn → T or Tn = T + oIP (1).
Moreover, Tn converges almost surely (a.s.) to T if

IP( lim Tn = T ) = 1.
n→∞

Remark. Convergence almost surely implies convergence in probability.


Definition 2.0.2 We say that {Tn } remains bounded in probability if

lim lim sup IP(|Tn | > M ) = 0.


M →∞ n→∞

Notation: Tn = OIP (1).

2.1 Law of large numbers for real-valued random


variables

Consider the case X = R. Suppose the mean

µ := EX

11
12 CHAPTER 2. GLIVENKO CANTELLI CLASSES

exists. Define the average

n
1X
X̄n := Xi , n ≥ 1.
n
i=1

Then, by the law of large numbers, as n → ∞,

X̄n → µ, a.s.

Now, let
F (t) := P (X ≤ t), t ∈ R,

be the theoretical distribution function, and

1
F̂n (t) := #{Xi ≤ t, 1 ≤ i ≤ n}, t ∈ R,
n

be the empirical distribution function. Then by the law of large numbers, as


n → ∞,
F̂n (t) → F (t), a.s. for all t.

We will prove the Glivenko Cantelli Theorem, which says that

sup |F̂n (t) − F (t)| → 0, a.s.


t

This is a uniform law of large numbers (ULLN).

Application: Kolmogorov’s goodness-of-fit test. We want to test

H0 : F = F0 .

Test statistic:
Tn := sup |F̂n (t) − F0 (t)|.
t

Reject H0 for large values of Tn .

2.2 Law of large numbers for Rd -valued random vari-


ables.

Questions:

(i) What is a natural extension of half-intervals in R to higher dimensions?

(ii) Does Glivenko Cantelli hold for this extension?


2.3. DEFINITION GLIVENKO CANTELLI CLASSES OF SETS 13

2.3 Definition Glivenko Cantelli classes of sets

Let for any (measurable1 ) A ⊂ X


1
Pn (A) := #{Xi ∈ A, 1 ≤ i ≤ n}.
n
We call Pn the empirical measure (based on X1 , . . . , Xn ).

Let D be a collection of subsets of X .


Definition 2.3.1 The collection D is called a Glivenko Cantelli (GC) class
if
IP
sup |Pn (D) − P (D)| → 0.
D∈D

Example. Let X = R. The class of half-intervals

D = {l(−∞,t] : t ∈ R}

is GC. But when e.g. P = uniform distribution on [0, 1] (i.e., F (t) = t, 0 ≤ t ≤


1), the class
B = {all (Borel) subsets of [0, 1]}
is not GC.

2.4 Convergence of averages to their expectations

Notation. For a function g : X → R, we write

P g := Eg(X),

and
n
1X
Pn g := g(Xi ).
n
i=1

Let G be a collection of real-valued functions on X .


Definition 2.4.1 The class G is called a Glivenko Cantelli (GC) class if
IP
sup |Pn (g) − P (g)| → 0.
g∈G

We will often use the notation

kPn − P kG := sup |(Pn − P )g)|.


g∈G
1
We will skip measurability issues, and most of the time do not mention explicitly the
requirement of measurability of certain sets or functions. This means that everything has to
be understood modulo measurability.
14 CHAPTER 2. GLIVENKO CANTELLI CLASSES

2.5 ULLN and maximum likelihood

Let {Pθ : θ ∈ Θ} be a collection of probability measures dominated by a σ-finite


measure µ. Suppose that P = Pθ0 for some θ0 ∈ Θ. Define
dPθ
pθ := , θ ∈ Θ.

The maximum likelihood estimator (if it exists) is
θ̂ := arg max Pn log pθ .
θ∈Θ

We now let P := {pθ : θ ∈ Θ} be the collection of densities in our model, write


p0 := pθ0 for the true density and p̂ := pθ̂ for the estimated density. Thus
p̂ = arg max P log p.
p∈P

Definition 2.5.1 The Hellinger distance between densities p and p̃ is


 Z 1/2
1 √ p 2
h(p, p̃) := ( p − p̃) dµ .
2

We define  
1 p + p0
G := log l{p0 > 0} : p ∈ P .
2 2p0
Theorem 2.5.1 Suppose that G is GC. Then
IP
h(p̂, p0 ) → 0.

Proof. By the concavity of the log-function


p + p0 1 p
log l{p0 > 0} ≥ log l{p0 > 0}.
2p0 2 p0
Thus
 

0 ≤ Pn log l{p0 > 0}
p0
 
p + p0
≤ 2Pn log l{p0 > 0}
2p0
   
p + p0 p + p0
= 2(Pn − P ) log l{p0 > 0} + 2P log l{p0 > 0} .
2p0 2p0
But
 
1 p + p0
P log l{p0 > 0}
2 2p0
r 
p + p0
≤ P l{p0 > 0} − 1
2p0
Z r
p + p0 √
= p0 dµ − 1
2
 
2 p + p0
= −h , p0 .
2
2.5. ULLN AND MAXIMUM LIKELIHOOD 15

It follows that  
p̂ + p0
h2 , p0 ≤ kPn − P kG .
2
Finally
√ √
Z
2 1
h (p, p0 ) = ( p − p0 )2
2
Z  p+p0 + √p 2 r
q
2
1 2 0 p + p0 √
= 4 √ √ − p 0
2 p + p0 2
 
p̂ + p0
≤ 16h2 , p0 .
2

So we conclude
h2 (p̂, p0 ) ≤ 16kPn − P kG .
IP
By assumption, G is GC, i.e., kPn − P kG → 0. u
t
16 CHAPTER 2. GLIVENKO CANTELLI CLASSES
Chapter 3

(Exponential) probability
inequalities

A statistician is almost never sure about something, but often says that some-
thing holds “with large probability”. We study probability inequalities for devia-
tions of means from their expectations. These are exponential inequalities, that
is, the probability that the deviation is large is exponentially small. (We will in
fact see that the inequalities are similar to those obtained if we assume Gaus-
sian distributions.) Exponentially small probabilities are useful indeed when one
wants to prove that with large probability a whole collection of events holds si-
multaneously. It then suffices to show that adding up the small probabilities that
one such an event does not hold, still gives something small.

3.1 Chebyshev’s inequality

Theorem 3.1.1 (Chebyshev’s inequality) Consider a random variable X ∈


R with distribution P , and an increasing function φ : R → [0, ∞). Then for all
a with φ(a) > 0, we have

Eφ(X)
P (X ≥ a) ≤ .
φ(a)

Proof.
Z Z Z
Eφ(X) = φ(x)dP (x) = φ(x)dP (x) + φ(x)dP (x)
X≥a X<a
Z Z
≥ φ(x)dP (x) ≥ φ(a)dP (x)
X≥a X≥a
Z
= φ(a) dP = φ(a)P (X ≥ a).
X≥a

17
18 CHAPTER 3. (EXPONENTIAL) PROBABILITY INEQUALITIES

u
t

Example. Let X ∈ R, with mean µ := EX and finite variance σ 2P:= var(X) =


E(X − µ)2 . Let X1 , . . . , Xn , . . . be i.i.d. copies of X and X̄n := ni=1 Xi /n be
the sample average. Then by Chebyshev’s inequality

I X̄n − µ)2 σ2
 
E(
IP |X̄n − µ| ≥ a ≤ = → 0 ∀ a > 0.
a2 na2

Thus
X̄n →IP µ
p
((weak) law of large numbers1 ). In a reformulation we put a = σ t/n,
 r 
t 1
IP |X̄n − µ| ≥ σ ≤ , ∀ t > 0.
n t

In other words (see Definition 2.0.2)



X̄n = µ + OIP (1/ n).

We say that X̄n converges in probability to µ with rate 1/ n.
Example. Let {g1 , . . . , gN } be a collection of N real-valued functions with
domain X , such that for some fixed constant σ 2 , var(gj (X)) ≤ σ 2 for all j ∈
{1, . . . , N }. This class of functions is allowed to depend on n and N is allowed
to grow with n. Then by the union bound2

N
N σ2
  X  
IP max |(Pn − P )gj | ≥ a ≤ IP |(Pn − P )gj | ≥ a ≤ ∀ a > 0.
1≤j≤N na2
j=1

We may reformulate this to


 r 
Nt 1
IP max |(Pn − P )gj | ≥ σ ≤ ∀ t > 0.
1≤j≤N n t

Thus, with σ kept fixed, we conclude that


p
max |(Pn − P )gj | = OIP ( N/n).
1≤j≤N

Our aim is now to improve here the dependence on N .

Example. Let X be N (0, 1)-distributed. By Exercise 3.4.1

P (X ≥ a) ≤ exp[−a2 /2] ∀ a > 0.

1
This implies X̄n → µ almost surely by a martingale argument (skipped here).
2
The union bound says that IP(A ∪ B) ≤ IP(A) + IP(B) for any two sets A and B.
3.2. HOEFFDING’S INEQUALITY 19

Corollary 3.1.1 Let X1 , . . . , Xn be independent real-valued random variables,


and suppose, for all i, that Xi is N (0, σi2 )-distributed. Define
n
X
2
b = σi2 .
i=1

Then
n
!
a2
X  
IP Xi ≥ a ≤ exp − 2 ∀ a > 0,
2b
i=1
or reformulated
n
!
X √
IP Xi ≥ b 2t ≤ exp[−t] ∀ t > 0.
i=1

Corollary 3.1.2 Let Z1 , . . . , ZN be possibly dependent real-valued random vari-


ables, and let Zj have the N (0, σj2 )-distribution, j = 1, . . . , N . Define σ 2 :=
max1≤j≤N σj2 . Then
 
p
IP max |Zj | ≥ σ 2(t + log N ) ≤ 2N exp[−(t + log N )] = 2 exp[−t] ∀ t > 0.
1≤j≤N

3.2 Hoeffding’s inequality

Let X1 , . . . , Xn be independent real-valued random variables.


Definition 3.2.1 (Hoeffding’s condition) Suppose that for all i, EX
I i = 0,
and for certain constants ci > 0,

|Xi | ≤ ci .
Pn
Lemma 3.2.1 Assume Hoeffding’s condition. Let b2 := 2
i=1 ci . Then for all
λ>0  X n   2 2
λ b
EI exp λ Xi ≤ exp .
2
i=1

Proof. Let λ > 0. By the convexity of the exponential function exp[λx], we


know that for any 0 ≤ α ≤ 1,

exp[αλx + (1 − α)λy] ≤ α exp[λx] + (1 − α) exp[λy].

Define now
ci − Xi
αi = .
2ci
Then
Xi = αi (−ci ) + (1 − αi )ci ,
so
exp[λXi ] ≤ αi exp[−λci ] + (1 − αi ) exp[λci ].
20 CHAPTER 3. (EXPONENTIAL) PROBABILITY INEQUALITIES

But then, since Eα


I i = 1/2, we find

1 1
I exp[λXi ] ≤
E exp[−λci ] + exp[λci ].
2 2
Now, for all x,

X x2k
exp[−x] + exp[x] = 2 ,
(2k)!
k=0

whereas

2
X x2k
exp[x /2] = .
2k k!
k=0

Since
(2k)! ≥ 2k k!,

we thus know that


exp[−x] + exp[x] ≤ 2 exp[x2 /2],

and hence
I exp[λXi ] ≤ exp[λ2 c2i /2].
E

Therefore,
n n
" # " #
X X
E
I exp λ Xi ≤ exp λ2 c2i /2 .
i=1 i=1

u
t
Theorem
Pn 3.2.1 (Hoeffding’s inequality) Assume Hoeffding’s condition. Let
2 2
b := i=1 ci . Then

n
!
a2
X  
IP Xi ≥ a ≤ exp − Pn 2 ∀ a > 0,
i=1
2 i=1 ci

or reformulated
n
!
X √
IP Xi ≥ b 2t ≤ exp[−t] ∀ t > 0.
i=1

Proof. It follows from Chebyshev’s inequality that


n
!
X
Xi ≥ a ≤ exp λ2 b2 /2 − λa .
 
IP
i=1

Take λ = a/b2 to obtain the first part. Take a = b 2t to arrive at the reformu-
lation.
u
t
3.2. HOEFFDING’S INEQUALITY 21

Corollary 3.2.1 Consider N functions gj : X → R, j = 1, . . . , N , with, for


some constant K,
(
Egj (X) = 0
∀ j ∈ {1, . . . , N } : .
supx∈X |gj (x)| ≤ K

Then for all j


 r 
2t
IP |Pn gj | ≥ K ≤ 2 exp[−t], ∀ t > 0.
n
Hence by the union bound
 r 
2(t + log N )
IP max |Pn gj | ≥ K ≤ 2 exp[−t], ∀ t > 0.
1≤j≤N n

One can also derive an inequality for the expectation of a maximum (instead
of a probability inequality).
Lemma 3.2.2 Consider N functions gj : X → R, j = 1, . . . , N , with, for
some constant K,
(
Egj (X) = 0
∀ j ∈ {1, . . . , N } : .
supx∈X |gj (x)| ≤ K

Then   r
2 log(2N )
E
I max |Pn gj | ≤K .
1≤j≤N n

Proof. We have
e|x| ≤ex +e−x
I exp[λ|nPn gj |]
E ≤ E
I exp[λnPn gj ] + E
I exp[λnPn gj ]
 2 2
by Lemma 3.2.1 λ K
≤ 2 exp .
2
Hence
   
1
I max n|Pn gj |
E = I log exp λ max n|Pn gj |
E
1≤j≤N λ 1≤j≤N
 
Jensen 1
≤ I exp λ max n|Pn gj |
log E
λ 1≤j≤N
  
1
= I max exp λn|Pn gj |
log E
λ 1≤j≤N
N
X   
1
≤ log I exp λn|Pn gj |
E
λ
j=1
  2 2 
1 λ K
≤ log 2N exp
λ 2
log(2N ) λK 2
= + .
λ 2
22 CHAPTER 3. (EXPONENTIAL) PROBABILITY INEQUALITIES

We minimize the last expression over λ. Take the derivative and put it to zero:

log(2N ) K 2 ∆
− + = 0.
λ2 2
This gives p
2 log(2N )
λ=
K
and with this value of λ
r r
log(2N ) λK 2 log(2N ) log(2N ) p
+ =K +K = K 2 log(2N ).
λ 2 2 2
u
t

3.3 Bernstein’s inequality

Bernstein’s inequality is a close relative of Hoeffding’s inequality and an impor-


tant tool in empirical process theory. We present it here. However, unfortu-
nately we have no space in these lecture notes to show its power.
Let X1 , . . . , Xn be independent real-valued random variables.
Definition 3.3.1 (Bernstein’s condition) For all i,

EX
I i = 0,

m! m−2 2
I i |m ≤
E|X K σi , m = 2, 3, . . . .
2
Pn
Lemma 3.3.1 Suppose Bernstein’s condition. Define b2 := 2
i=1 σi . Then for
all 0 < λ < 1/K
n
λ2 b2
 X   
EI exp λ Xi ≤ exp .
2(1 − λK)
i=1

Proof. We have for 0 < λ < 1/K,



X 1 m
E
I exp[λXi ] = 1 + I im
λ EX
m!
m=2

X λ2
≤ 1+ (λK)m−2 σi2
2
m=2
λ2 σi2
= 1+
2(1 − λK)
λ2 σi2
 
≤ exp .
2(1 − λK)
u
t
3.3. BERNSTEIN’S INEQUALITY 23

TheoremP3.3.1 (Bernstein’s inequality) Suppose Bernstein’s condition. De-


fine b2 = ni=1 σi2 . We have
n
!
a2
X  
IP Xi ≥ a ≤ exp − ∀ a > 0.
2(aK + b2 )
i=1

Or reformulated
n
!
X √
IP Xi ≥ b 2t + Kt ≤ exp[−t] ∀ t > 0.
i=1

Proof. We have for 0 < λ < 1/K,



X 1 m
E
I exp[λXi ] = 1 + I im
λ EX
m!
m=2

X λ2
≤1+ (λK)m−2 σi2
2
m=2

λ2 σi2
=1+
2(1 − λK)
λ2 σi2
 
≤ exp .
2(1 − λK)

It follows that
n n
" #
X Y
I exp λ
E Xi = E
I exp[λXi ]
i=1 i=1

λ2 b2
 
≤ exp .
2(1 − λK)
Now, apply Chebyshev’s inequality to ni=1 Xi , and with φ(x) = exp[λx], x ∈
P
R. We arrive at
n
!
λ2 b2
X  
IP Xi ≥ a ≤ exp − λa .
2(1 − λK)
i=1

Take
a
λ=
Ka + b2

to complete the first part. For the reformulation, choose a = b 2t + Kt.
u
t
Corollary 3.3.1 Consider N functions gj : X → R, j = 1, . . . , N , with, for
some constants σ 2 and K

Egj (X) = 0

∀ j ∈ {1, . . . , N } : Egj2 (X) ≤ σ 2 .

supx∈X |gj (x)| ≤ K

24 CHAPTER 3. (EXPONENTIAL) PROBABILITY INEQUALITIES

Then for all j


 r 
2t Kt
IP |Pn gj | ≥ σ + ≤ 2 exp[−t], ∀ t > 0.
n n
Hence by the union bound
 r 
2(t + log N ) K(t + log N )
IP max |Pn gj | ≥ σ + ≤ 2 exp[−t], ∀ t > 0.
1≤j≤N n n

3.4 Exercises

Exercise 3.4.1 Let X be N (0, 1)-distributed. Show that for λ > 0,

E exp[λX] = exp[λ2 /2].

Conclude that for all a > 0,

P (X ≥ a) ≤ exp[λ2 /2 − λa].

Take λ = a to find the inequality

P (X ≥ a) ≤ exp[−a2 /2].

Exercise 3.4.2 Consider N functions gj : X → R, j = 1, . . . , N , with, for


some constants σ 2 and K,

Egj (X) = 0

∀ j ∈ {1, . . . , N } : Egj2 (X) ≤ σ 2 .

supx∈X |gj (x)| ≤ K

Derive an inequality for E(max


I 1≤j≤N |Pn gj |) using Bernstein’s inequality (The-
orem 3.3.1) instead of Hoeffding’s inequality (Theorem 3.2.1).
Exercise 3.4.3 Let X1 , . . . , Xn be i.i.d. copies of a random variable X ∈ R
with mean zero. Suppose that for some constants λ and σ,

I exp[λ|X|] − 1 − λ|X| ≤ ς 2 .
E

Show that Bernstein’s condition (Condition 3.3.1) holds with appropriate con-
stants σ 2 and K.
Exercise 3.4.4 Recall the class
 
1 p + p0 dP
G := log l{p0 > 0} : p ∈ P , p0 := ,
2 2p0 dµ
defined in Section 2.5. Show that for all g ∈ G the random variable g(X) −
Eg(X) satisfied Bernstein’s condition (Condition 3.3.1). Hint: use Exercise
3.4.3.
Chapter 4

ULLNs based on entropy with


bracketing

4.1 The classical Glivenko Cantelli Theorem

For the case X = R we define the theoretical distribution function

F (x) := P (X ≤ x), x ∈ R

and the empirical distribution function

1
F̂n (x) := #{Xi ≤ x, 1 ≤ i ≤ n}.
n
Theorem 4.1.1 (the classical Glivenko Cantelli Theorem) It holds that

IP
sup |F̂n (x) − F (x)| → 0.
x∈R

Remark 4.1.1 The convergence in probability can be strengthened to conver-


gence almost surely.
Proof of Theorem 4.1.1 for the case F continuous. We have for all x

|l(−∞,x] − F (x)| ≤ 1

so that by Hoeffding’s inequality (Theorem 3.2.1)


 
p
IP |F̂n (x) − F (x)| ≥ 2t/n ≤ 2 exp[−t] ∀ t > 0.

Let now 0 < δ < 1 be arbitrary and take a0 < a1 < · · · < aN such that
F (aj ) − F (aj−1 ) = δ. Note that then N ≤ 1 + 1/δ. If x ∈ (aj−1 , aj ] we clearly
have
(−∞, aj−1 ] ⊂ (−∞, x] ⊂ (−∞, aj ],

25
26 CHAPTER 4. ULLNS BASED ON ENTROPY WITH BRACKETING

that is, we can “bracket” l(−∞,x] between a lower function l(−∞,aj−1 ] and an
upper function l(−∞,aj ] . This gives for x ∈ (aj−1 , aj ]

F̂n (x) − F (x) ≤ F̂n (aj ) − F (x)


≤ F̂n (aj ) − F (aj−1 )
= F̂n (aj ) − F (aj ) + δ
and
F̂n (x) − F (x) ≥ F̂n (aj−1 ) − F (x)
≥ F̂n (aj−1 ) − F (aj )
= F̂n (aj−1 ) − F (aj−1 ) − δ
It follows that
sup |F̂n (x) − F (x)| ≤ max |F̂n (aj ) − F (aj )| + δ.
x∈R j=1,...,N

By Hoeffding’s inequality and the union bound


 r 
2(t + log N )
IP max |F̂n (aj ) − F (aj )| ≥ ≤ 2 exp[−t], ∀ t > 0.
j=1,...,N n
Take
nδ 2
t= .
4
Take n sufficiently large:
4
n≥ log N.
δ2
Then r
2(t + log N )
≤ δ.
n
and we find
 
IP max |F̂n (aj ) − F (aj )| ≥ δ ≤ 2 exp[−nδ 2 /4].
j=1,...,N

Hence for n ≥ 4 log(1 + 1/δ)/δ 2


 
IP sup |F̂n (x) − F (x)| ≥ 2δ ≤ 2 exp[−nδ 2 /4].
x∈R
u
t

4.2 Entropy

Definition 4.2.1 Consider a subset S of a metric space with metric d. For any
δ > 0, let N (δ, S, d) be the minimum number of balls with radius δ, necessary
to cover S. Then N (δ, S, d) is called the δ-covering number of S. Moreover
H(·, S, d) := log N (·, S, d)
is called the entropy of S.
4.3. ENTROPY WITH BRACKETING 27

Remark. We sometimes require the centres of the balls to lie in S.

Figure 3
Example 4.2.1 Let S = [−1, 1]r be the r-dimensional hypercube. Take as
metric
d(x, y) := max |xk − yk | =: kx − yk∞ , x ∈ Rr , y ∈ Rr .
1≤k≤r

Then for all 0 < δ ≤ 1 and for a > 0, dae := min m ∈ N : m ≥ a,

N (δ, S, d) ≤ d1/δer
≤ (1 + 1/δ)r ,
H(δ, S, d) ≤ r log(1 + 1/δ).

4.3 Entropy with bracketing

We define for q ≥ 1

Lq (P ) := {g : X → R : P |g|q < ∞}.

Endow this space with the norm

kgkq := (P |g|q )1/q , g ∈ Lq (P ).

We let
L∞ = {g : kgk∞ < ∞},

where
kgk∞ := sup |g(x)|
x∈X

is the sup-norm.
28 CHAPTER 4. ULLNS BASED ON ENTROPY WITH BRACKETING

Definition 4.3.1 Consider a class G ⊂ Lq (P ). For any δ > 0, let {[gjL , gjU ]}N
j=1 ⊂
Lq (P ) be such that
(i) ∀ j, gjL ≤ gjU and kgjU − gjL kq ≤ δ,
(ii) ∀ g ∈ G ∃ j such that gjL ≤ g ≤ gjU .
We then call {[gjL , gjU ]}N
j=1 a δ−bracketing set.
The δ-covering number with bracketing is
 
L U N
Nq,B (δ, G, P ) := min N : ∃ δ−bracketing set {[gj , gj ]}j=1 .

The entropy with bracketing is

Hq,B (·, G, P ) := log Nq,B (·, G, P ).

Remark 4.3.1 Note that

Hq,B (δ, G, P ) ≤ H∞ (δ/2, G)

where H∞ (·, G) := H(·, G, k · k∞ ) is the entropy of G endowed with the metric


induced by the sup-norm k · k∞ .
Example 4.3.1 A function g on a subinterval X of R is called L-Lipschitz if

|g(x) − g(x̃)| ≤ L|x − x̃| ∀ x, x̃ ∈ X .

Let now X = (0, 1] and

G := {g : (0, 1] → [0, 1] : g 1−Lipschitz}.

We partition the interval [0, 1] into m ≤ 1 + 1/δ intervals (aj−1 , aj ] with length
at most δ, 0 = a0 < · · · < aN = 1. For g ∈ G and x ∈ (aj−1 , aj ] we let

g̃(x)/δ = bg(aj )/δc

be the integer part of g(aj )/δ. Then, since g(aj ) − δ < bg(aj )/δcδ ≤ g(aj ),

|g(x) − g̃(x)| ≤ |g(x) − g(aj )| + δ ≤ 2δ.

We have at most 1 + 1/δ choices for bg(a1 )/δc. Given bg(aj )/δc we have

bg(aj+1 )/δc − bg(aj )/δc ≤ 2 + |g(aj+1 ) − g(aj )|/δ ≤ 3

so there are at most 7 choices for bg(aj+1 )/δc. The total number of functions
g̃ as g varies is thus at most

· · × 7} ≤ (1 + 1/δ)71/δ .
(1 + 1/δ) × 7| × ·{z
m−1 times

Thus
log 7
H∞ (δ, G) ≤ log(1 + 1/δ) + ∀ 0 < δ < 1.
δ
4.4. THE ENVELOPE FUNCTION 29

Example 4.3.2 Let X = R and G := {l(−∞,x] : x ∈ R}. Let F := P (X ≤ ·)


be continuous (say). Then
Hq,B (δ, G, P ) ≤ q log(1 + 1/δ), 0 < δ ≤ 1.
In this case H∞ (·, G) = ∞ for all 0 < δ < 1.
Recall the notation
kPn − P kG := sup |(Pn − P )g|.
g∈G

Lemma 4.3.1 (ULLN based on entropy with bracketing)


Suppose
H1,B (·, G, P ) < ∞.
Then
IP
kPn − P kG → 0.

Proof. We use the same arguments as in the proof of Theorem 4.1.1: let δ > 0
be arbitrary and let {[gjL , gjU ]}N
j=1 ⊂ L1 (P ) be a minimal δ-covering set with
bracketing (N = NB (δ, G, P )). Because this is a finite set, we know that1
   
L U IP
max |(Pn − P )gj | ∨ max |(Pn − P )gj | → 0.
1≤j≤N 1≤j≤N

Now use that when gjL ≤ g ≤ gjU , P (gjU − gjL ) ≤ δ,

(Pn − P )g ≤ Pn gjU − P gjL


≤ (Pn − P )gjU + δ
(Pn − P )g ≥ Pn gjL − P gjU
≥ (Pn − P )gjL − δ.
u
t

4.4 The envelope function

Let G ⊂ Lq (P ).
Definition 4.4.1 The envelope function (or envelope) is
G := sup |g|.
g∈G

We call G uniformly bounded if


kGk∞ < ∞.

Note that
Hq,B (·, G, P ) < ∞ ⇒ G ∈ Lq (P ).
Similarly
H∞ (·, G) < ∞ ⇒ kGk∞ < ∞.
1
for a and b in R we let a ∨ b := max{a, b} and a ∧ b := min{a, b}.
30 CHAPTER 4. ULLNS BASED ON ENTROPY WITH BRACKETING

4.5 Entropy with bracketing and compactness

Lemma 4.5.1 Let G := {gθ : θ ∈ Θ} where (Θ, d) is a compact metric space.


Suppose
- θ 7→ gθ (x) is continuous for all x ∈ X ,
- G(·) := supθ∈Θ |gθ (·)| ∈ Lq (P ).
Then
Hq,B (·, G, P ) < ∞.

Proof. Define for θ ∈ Θ and ρ > 0

wθ,ρ (·) := sup |gθ − gϑ |.


ϑ: d(θ,ϑ)<ρ

Then by continuity
lim wθ,ρ = 0.
ρ→0

Since G ∈ Lq (P ), by dominated convergence,


q
lim P wθ,ρ = 0.
ρ→0

Let δ > 0 be arbitrary and ρθ be such that


q
P wθ,ρθ
≤ δq .

Define
Bθ := {ϑ : d(θ, ϑ) < ρθ }.
Then {Bθ : θ ∈ Θ} is an open cover of Θ. Since Θ is compact there exists a
finite sub-cover, say

{Bj := {ϑ : d(θj , ϑ) < ρθj }}N


j=1 .

For θ ∈ Bj we have

gθ ≤ gθj + wθj ,ρθj =: gjU ,


gθ ≥ gθj − wθj ,ρθj =: gjL

with
P (gjU − gjL )q = P (2wθj ,ρθj )q ≤ (2δ)q .
u
t

4.6 Maximum likelihood for mixture models

Here is an example where the ULLN is used for proving consistency of certain
maximum likelihood estimators. Let Y ∈ R be a random variable with unknown
4.6. MAXIMUM LIKELIHOOD FOR MIXTURE MODELS 31

distribution function F0 . Given Y = y, let X have a known distribution with


density k(x|y). Then X has mixture density
Z
pF0 := k(·|y)dF0 (y).

WE observe n i.i.d. copies X1 , . . . , Xn of X. Let F be the collection of all


distribution functions. The maximum likelihood estimator of F0 is

F̂ := arg max Pn log pF .


F ∈F

Recall the definition of Hellinger distance (Definition 2.5.1):

p 2 1/2
 Z 
1 √
h(p, p̃) = ( p − p̃) .
2
Lemma 4.6.1 It holds that
IP
h(pF̂ , pF0 ) → 0.

Proof. We have F ⊂ (F ∗ , d) where d is the metric induced by the “vague


topology”. The space (F ∗ , d) is compact.2 The map
Z
F 7→ pF = k(·|y)dF0 (y)

is continuous for d, so
2pF
F 7→
pF + pF0
is continuous for d as well. Let
 
2pF ∗
G := : F ∈F .
pF + pF0
This class has envelope
G ≤ 2.
It follows from Lemma 4.5.1 that G is GC:
IP
kPn − P kG → 0.

We have
 
2pF̂
0 ≤ Pn log
pF̂ + pF0
 
2pF̂
≤ Pn −1
pF̂ + pF0
   
2pF̂ 2pF̂
= (Pn − P ) +P − 1.
pF̂ + pF0 pF̂ + pF0
2
We skip details.
32 CHAPTER 4. ULLNS BASED ON ENTROPY WITH BRACKETING

But
√ √
Z
2 1
h (p, p0 ) = ( p − p0 )2
2
(p − p0 )2
Z
1
= √ √
2 ( p + p 0 )2
(p − p0 )2
Z
1

2 p + p0
(p0 − p)2 1 p0 − p
Z Z
1
= + (p0 + p)
2 p0 + p 2 p +p
| 0 {z }
=0
p0 − p
Z
= p0
p +p
Z 0 
2p
= 1− p0
p + p0
 
2p
= 1−P .
p + p0

Thus we have shown that

h2 (pF̂ , pF0 ) ≤ kPn − P kG

whence the result. u


t

4.7 Exercises

Exercise 4.7.1 Complete the proof of Theorem 4.1.1 allowing for distribution
functions F with jumps.
Exercise 4.7.2 Let
r
X
d2 (x, y) := (xk − yk )2 =: kx − yk22 , x ∈ Rr , y ∈ Rr
k=1

and let
S := {x : kxk2 ≤ 1}
be the r-dimensional ball. Show that

H(δ, S, d) ≤ const. r log(1/δ), ∀ δ > 0

(where the “const.” does not depend on r or δ).


Exercise 4.7.3 Extend the result of Example 4.3.1 to L-Lipschitz functions
with L possibly unequal to 1.
Chapter 5

Symmetrization

Symmetrization is a technique based on the following idea. Suppose you have


some estimation method, and want to know how good it performs. Suppose you
have a sample of size n, the so-called training set and a second sample, say also
of size n, the so-called test set. Then we may use the training set to calculate
the estimator, and the test set to check its performance. For example, suppose
we want to know how large the maximal deviation is between certain averages
and expectations. We cannot calculate this maximal deviation directly, as the
expectations are unknown. Instead, we can calculate the maximal deviation
between the averages in the two samples. Symmetrization is closely related: it
splits the sample of size 2n randomly in two subsamples of size n.

Let X ∈ X be a random variable with distribution P . We consider two indepen-


dent sets of independent copies of X, X := X1 , . . . , Xn and X0 := X10 , . . . , Xn0 .

Let G be a class of real-valued functions on X . Consider the empirical measures

n n
1X 1X
Pn := δXi , Pn0 := δXi .
n n
i=1 i=1

Here δx denotes a point mass at x. Define

kPn − P kG := sup |(Pn − P )g|,


g∈G

and likewise
kPn0 − P kG := sup |(Pn0 − P )g|,
g∈G

and
kPn − Pn0 kG := sup |(Pn − Pn0 )g|.
g∈G

33
34 CHAPTER 5. SYMMETRIZATION

5.1 Intermezzo: some facts about (conditional) ex-


pectations

5.1.1 Suprema in-/outside the expectation

Let Z ∈ Z be random variable and for a set T , let be defined a function


ft : Z → R for all t ∈ T . Then

I sup |ft (Z)| ≥ sup E|f


E I t (Z)| ≥ sup |IEft (Z)|.
t∈T t∈T t∈T

5.1.2 Iterated expectations

Let X and Y be random variables, say in R. Then

EE(Y |X) = EY

and for functions f : R → R and g : R2 → R,

E[f (X)g(X, Y )|X] = f (X)E[g(X, Y )|X].

Moreover, for sets A ⊂ R and B of R2

P (Y ∈ A|X) = E[lA (Y )|X],

and

P (X ∈ A, (X, Y ) ∈ B|X) = E[lA (X)lB (X, Y )|X]


= lA (X)E[lB (X, Y )|X]
= lA (X)P ((X, Y ) ∈ B|X).

5.2 Symmetrization with means

Lemma 5.2.1 We have

I n − Pn0 kG .
I n − P kG ≤ EkP
EkP

Proof. Obviously,
E(P
I n g|X) = Pn g
and
I n0 g|X) = P g.
E(P
So
0
(Pn − P )g = E[(P
I n − Pn )g|X].
5.2. SYMMETRIZATION WITH MEANS 35

Hence
kPn − P kG = sup |(Pn − P )g| = sup |IE[(Pn − Pn0 )g||X]
g∈G g∈G

Now use the observation of Subsection 5.1.1. We see that

sup |IE[(Pn − Pn0 )g||X] ≤ E[kP


I 0
n − Pn kG |X].
g∈G

So we now showed that


0
kPn − P kG ≤ E[kP
I n − Pn kG |X].

Finally, use iterated expectations (Subsection 5.1.2):

I E[kPn − Pn0 kG |X] = EkP


I n − P kG ≤ EI
EkP I n − Pn0 kG .

u
t
Definition 5.2.1 A Rademacher sequence {σi }ni=1 is a sequence of independent
random variables σi , with
1
IP(σi = 1) = IP(σi = −1) = ∀ i.
2

Let {σi }ni=1 be a Rademacher sequence, independent of the two samples X and
X0 . We define the symmetrized empirical measure
n
1X
Pnσ g := σi g(Xi ), g ∈ G.
n
i=1

Let
kPnσ kG = sup |Pnσ g|.
g∈G

Lemma 5.2.2 We have

I n − P kG ≤ 2IEkPnσ kG .
EkP

Proof. Consider the symmetrized version of the second sample X0 :


n
1X
Pn0,σ g = σi g(Xi0 ).
n
i=1

Then kPn − Pn0 kG has the same distribution as kPnσ − Pn0,σ kG . So

I n − Pn0 kG = EkP
EkP I nσ − Pn0,σ kG

I n0,σ kG = 2IEkPnσ kG .
I nσ kG + EkP
≤ EkP

u
t
36 CHAPTER 5. SYMMETRIZATION

5.3 Symmetrization with probabilities

Lemma 5.3.1 Let δ > 0. Suppose that for all g ∈ G,


1
IP (|(Pn − P )g| > δ/2) ≤ .
2
Then  
0 δ
IP (kPn − P kG > δ) ≤ 2IP kPn − Pn kG > .
2

Proof. Let IPX denote the conditional probability given X. If kPn − P kG > δ,
we know that for some random function g∗ = g∗ (X) depending on X,

|(Pn − P )g∗ | > δ.

Because X0 is independent of X, we also know that


 1
IPX |(Pn0 − P )g∗ | > δ/2 ≤ .
2
Thus,
 
0 δ
IP |(Pn − P )g∗ | > δ and |(Pn − P )g∗ | ≤
2
 
0 δ
= I PX |(Pn − P )g∗ | > δ and |(Pn − P )g∗ | ≤
EI
2
 
δ
= I PX |(Pn0 − P )g∗ | ≤
EI l{|(Pn − P )g∗ | > δ}
2
1
≥ El{(|(P
I n − P )g∗ | > δ}
2
1
= IP (|(Pn − P )g∗ | > δ) .
2
It follows that

IP (kPn − P kG > δ) ≤ IP (|(Pn − P )g∗ | > δ)


 
0 δ
≤ 2IP |(Pn − P )g∗ | > δ and |(Pn − P )g∗ | ≤
2
 
δ
≤ 2P |(Pn − Pn0 )g∗ | > .
2
u
t
Corollary 5.3.1 Let δ > 0. Suppose that for all g ∈ G,
1
IP (|(Pn − P )g| > δ/2) ≤ .
2
Then  
σ δ
IP (kPn − P kG > δ) ≤ 4IP kPn kG > .
4
5.4. EXERCISES 37

5.4 Exercises

Exercise 5.4.1 Use the same arguments as in the proof of Lemma 3.2.2 to
show that   r
σ 2 log(2N )
EI max |Pn gj | X ≤ Rn ,
1≤j≤N n
with
Rn2 := max kgj k2n ,
1≤j≤N

where for a function g : X → R, k · kn is a short hand (abuse of) notation for


the empirical norm p
kgkn := Pn g 2 , g : X → R.
38 CHAPTER 5. SYMMETRIZATION
Chapter 6

ULLNs based on
symmetrization

In this chapter, we prove uniform laws of large numbers for the empirical mean
of functions g of the individual observations, when g varies over a class G of
functions. First, we study the case where G is finite. Symmetrization is used
in order to be able to apply Hoeffding’s inequality. Hoeffding’s inequality gives
exponential small probabilities for the deviation of averages from their expec-
tations. So considering only a finite number of such averages, the difference
between these averages and their expectations will be small for all averages si-
multaneously, with large probability.
If G is not finite, we approximate it by a finite set. A δ-approximation is called
a δ-covering, and the number of elements of a minimal δ-covering is called the
δ-covering number.
We introduce Vapnik Chervonenkis (VC) classes. These are classes with small
covering numbers.

Let X ∈ X be a random variable with distribution P . Consider a class G of


real-valued functions on X , and consider i.i.d. copies {X1 , X2 , . . .} of X. In this
IP
chapter, we address the problem of proving kPn − P kG → 0. If this is the case,
we call G a Glivenko Cantelli (GC) class.

IP a.s.
Remark. It can be shown that if kPn − P kG → 0, then also kPn − P kG → 0.
This involves e.g. martingale arguments. We will not consider this issue.

6.1 Classes of functions

Notation. The sup-norm of a function g is


kgk∞ := sup |g(x)|.
x∈X

39
40 CHAPTER 6. ULLNS BASED ON SYMMETRIZATION

Elementary observation. Let {Ak }N


k=1 be a finite collection of events. Then

N
 union bound X
IP ∪N
k=1 Ak ≤ IP(Ak ) ≤ N max P(A).
1≤k≤N
k=1

Lemma 6.1.1 Let G be a finite class of functions, with cardinality |G| := N >
1. Suppose that for some finite constant K,

max kgk∞ ≤ K.
g∈G

Then we have
r !
2(log N + t)
IP kPnσ kG >K ≤ 2 exp [−t] ∀ t > 0.
n

and for all t > 0 such that log N + t ≥ 1


r !
2(log N + t)
IP kPn − P kG > 4K ≤ 8 exp [−t] .
n

Proof.
• By Hoeffding’s inequality, for each g ∈ G,
 √ 
IP |Pnσ (g)| > K 2t ≤ 2 exp [−t] ∀ t > 0.

• Use the elementary observation to conclude that


r !
2(log N + t)
IP kPnσ kG > K ≤ 2 exp [−t] ∀ t > 0.
n

• By Chebyshev’s inequality, for each g ∈ G and for K 2 /(nδ 2 ) ≤ 1/2


var(g(X))
IP (|(Pn − P )g| > δ) ≤
nδ 2
K 2 1
≤ 2
≤ .
nδ 2

• Hence, by symmetrization with probabilities when log N + t ≥ 1


r ! r !
2(log N + t) σ 2(log N + t)
IP kPn − P kG > 4K ≤ 4IP kPn kG > K
n n
≤ 8 exp [−t] .

u
t

We recall some definitions from Chapter 4.


6.1. CLASSES OF FUNCTIONS 41

Definition 6.1.1 The envelope G of a collection of functions G is defined by

G(x) = sup |g(x)|, x ∈ X .


g∈G

Definition 6.1.2 Let S be some subset of a metric space (Λ, d). For δ > 0, the
δ-covering number N (δ, S, d) of S is the minimum number of balls with radius
δ, necessary to cover S, i.e. the smallest value of N , such that there exist
s1 , . . . , sN in1 Λ with

min d(s, sj ) ≤ δ, ∀ s ∈ S.
j=1,...,N

The set s1 , . . . , sN is then called a minimal δ-covering of S. The logarithm


H(·, S, d) := log N (·, S, d) of the covering number is called the entropy of S.

Notation. We let Nq (·, G, Pn ) be the covering numbers of G endowed with the


metric corresponding to the empirical norm

(Pn |g|q )1/q , g : X → R,

and let Hq (·, G, Pn ) be the entropy for this metric.


Theorem 6.1.1 Suppose

kgk∞ ≤ K, ∀ g ∈ G.

Assume moreover that


1 IP
H1 (δ, G, Pn ) → 0.
n
Then
IP
kPn − P kG → 0.

Proof. Let δ > 0. Let g1 , . . . , gN , with N = N1 (δ, G, Pn ), be a minimal δ-


covering of G.
• When Pn (|g − gj |) ≤ δ, we have

|Pnσ g| ≤ |Pnσ gj | + δ.

So
kPnσ kG ≤ max |Pnσ gj | + δ.
j=1,...N

• Let IPX dente conditional probability given X. By Hoeffding’s inequality and


the elementary observation, we have
r !
2(log N + t)
IPX max |Pnσ (gj )| > K ≤ 2 exp[−t] ∀ t > 0.
j=1,...,N n
1
We will sometimes require that {sj }N
j=1 ∈ S
42 CHAPTER 6. ULLNS BASED ON SYMMETRIZATION


Conclude that
r !
2(log N + t)
IPX kPnσ kG >δ+K ≤ 2 exp[−t] ∀ t > 0.
n

• But then r !
2t
IP kPnσ kG > 2δ + K
n
r !
2H1 (δ, G, Pn )
≤ 2 exp[−t] + IP K >δ ∀ t > 0.
n

• Now, use the symmetrization with probabilities to conclude that


r !
2t
IP kPn − P kG > 8δ + 4K
n
r !
2H1 (δ, G, Pn )
≤ 8 exp[−t] + 4IP K >δ ∀ t ≥ 1.
n

Since δ is arbitrary, this concludes the proof.


u
t
Example 6.1.1 Consider the case X = R and

G := {g ↑, 0 ≤ g ≤ 1}.

Let N∞ (·, G, Pn ) be the covering number of G for the (pseudo-)metric induced


by the (pseudo-)norm
max |g(Xi )|, g : X → R.
1≤i≤n

Then N1 (·, G, Pn ) ≤ N∞ (·, G, Pn ). Let δ > 0. We approximate g ∈ G by

g̃(x) := dg(x)/δeδ, x ∈ R,

where dae := min{b ∈ N : b ≥ a}, a ≥ 0. Now we count how many such


functions g̃ we obtain as g varies. The function g̃ has at most m ≤ 1 + 1/δ
jumps. The jumps are counted at X1 , . . . , Xn . So we can represent g̃ by a
sequence of n − 1 zeroes and m ones. The number of such sequences is
 
m+n−1
.
m

We conclude that  
m+n−1
N∞ (δ, G, Pn ) ≤ ,
m
6.2. CLASSES OF SETS 43

and hence

H∞ (δ, G, Pn ) := log N∞ (δ, G, Pn )


 
m+n−1
≤ log
m
≤ m log(m + n − 1)
≤ (1 + 1/δ) log(n + 1/δ).

By Theorem 6.1.1 we conclude that the class of monotone functions G is GC.

We next replace the assumption that G is uniformly bounded by the weaker


assumption that its envelope G is P -integrable.

Theorem 6.1.2 Suppose


G ∈ L1 (P ),

and
1 IP
H1 (δ, G, Pn ) → 0.
n
Then
IP
kPn − P kG → 0.

Proof. It holds that for all g ∈ G and any K > 0

|(Pn − P )g| ≤ |(Pn − P )gl{G ≤ K} + |(Pn − P )gl{G > K} .


| {z } | {z }
:=(i) :=(ii)

(i) Let GK := {gl{G ≤ K}}. This class is uniformly bounded by K and since
H1 (·, GK , Pn ) ≤ H1 (·, G, Pn ) we conclude from Theorem 6.1.1 that

IP
kPn − P kGK → 0.

(ii) We have

sup |(Pn − P )gl{G > K} ≤ (Pn + P )Gl{G > K}


g∈G
= (Pn − P )Gl{G > K} + 2P Gl{G > K} .
| {z } | {z }
IP →0 as K→∞
→ 0 as n→∞ ∀ K

u
t

6.2 Classes of sets

Let D be a collection of subsets of X , and let {ξ1 , . . . , ξn } be n points in X .


44 CHAPTER 6. ULLNS BASED ON SYMMETRIZATION

Definition 6.2.1 We write

4D (ξ1 , . . . , ξn ) = card({D ∩ {ξ1 , . . . , ξn } : D ∈ D}


= the number of subsets of {ξ1 , . . . , ξn } that D distinguishes.

That is, count the number of sets in D, when two sets D1 and D2 are considered
as equal if D1 4D2 ∩ {ξ1 , . . . , ξn } = ∅.
Here
D1 4D2 = (D1 ∩ D2c ) ∪ (D1c ∩ D2 )
is the symmetric difference between D1 and D2 .

Remark. For our purposes, we will not need to calculate 4D (ξ1 , . . . , ξn ) ex-
actly, but only a good enough upper bound.

Example. Let X = R and

D = {l(−∞,t] : t ∈ R}.

Then for all {ξ1 , . . . , ξn } ⊂ R

4D (ξ1 , . . . , ξn ) ≤ n + 1.

Example. Let D be the collection of all finite subsets of X . Then, if the points
ξ1 , . . . , ξn are distinct,
4D (ξ1 , . . . , ξn ) = 2n .

Theorem 6.2.1 (Vapnik and Chervonenkis (1971)) We have


1 IP
log ∆D (X1 , . . . , Xn ) → 0,
n
if and only if
IP
sup |Pn (D) − P (D)| → 0.
D∈D

Proof of the only if part. This follows from applying Theorem 6.1.1 to

G = {lD : D ∈ D}.

To see this, note first that a class of indicator functions is uniformly bounded
by 1. This is also true for the centred version, i.e. we can take K = 1 in
Theorem 6.1.1. Moreover, writing N∞ (·, G, Pn ) for the covering number of G
for the (pseudo-)metric induced by the (pseudo-)norm

max |g(Xi )|, g : X → R,


1≤i≤n

we see that
N1 (·, G, Pn ) ≤ N∞ (·, G, Pn ).
6.3. VAPNIK CHERVONENKIS CLASSES 45

But for 0 < δ < 1,

N∞ (δ, {lD : D ∈ D}, Pn ) = ∆D (X1 , . . . , Xn ).


IP
So indeed, if 1
n log ∆D (X1 , . . . , Xn ) → 0, then also

1 1 IP
H1 (δ, {lD : D ∈ D}, Pn ) ≤ log ∆D (X1 , . . . , Xn ) → 0.
n n

u
t

6.3 Vapnik Chervonenkis classes

Definition 6.3.1 Let

mD (n) = sup{4D (ξ1 , . . . , ξn ) : ξ1 , . . . , ξn ∈ X }.

We say that D is a Vapnik Chervonenkis (VC) class if for certain constants c


and V , and for all n,
mD (n) ≤ cnV ,
i.e., if mD (n) does not grow faster than a polynomial in n.

Important conclusion: For sets VC ⇒ GC .

Examples.
a) X = R, D = {l(−∞,t] : t ∈ R}. Since mD (n) ≤ n + 1, D is VC.
b) X = Rd , D = {l(−∞,t] : t ∈ Rd }. Since mD (n) ≤ (n + 1)d , D is VC.
 
θ
d T ∈ Rd+1 }. Since mD (n) ≤ 2d nd , D is

c) X = R , D = {{x : θ x > t},
t
VC.

The VC property is closed under measure theoretic operations:


Lemma 6.3.1 Let D, D1 and D2 be VC. Then the following classes are also
VC:
(i) Dc = {Dc : D ∈ D},
(ii) D1 ∩ D2 = {D1 ∩ D2 : D1 ∈ D1 , D2 ∈ D2 },
(iii) D1 ∪ D2 = {D1 ∩ D2 : D1 ∈ D1 , D2 ∈ D2 }.

Proof. Exercise.
u
t

Examples.
46 CHAPTER 6. ULLNS BASED ON SYMMETRIZATION

- the class of intersections of two half-spaces,


- all ellipsoids,
- all half-ellipsoids,
   
r θ r+1
- in R, the class {x : θ1 x + . . . + θr x ≤ t} : ∈R .
t

There are classes that are GC, but not VC.

Example. Let X = [0, 1]2 , and let D be the collection of all convex subsets of
X . Then D is not VC, but when P is uniform, D is GC.
Definition 6.3.2 The VC dimension of D is

V (D) = inf{n : mD (n) < 2n }.

The following lemma is beautiful, but to avoid digressions, we will not provide
a proof.
Lemma 6.3.2 (Sauer-Shelah Lemma) We have that
D
PD
V
is VC
n
 if and only if
V (D) < ∞. In fact, we have for V = V (D), m (n) ≤ k=0 k .
u
t

6.4 VC graph classes of functions

Definition 6.4.1 The subgraph of a function g : X → R is

subgraph(g) = {(x, t) ∈ X × R : g(x) ≥ t}.

A collection of functions G is called a VC class if the subgraphs {subgraph(g) :


g ∈ G} form a VC class.

Example. G = {lD : D ∈ D} is GC if D is GC.

Examples (X = Rd ).
a) G = {g(x) = θ0 + θ1 x1 + . . . + θd xd : θ ∈ Rd+1 },
b) G = {g(x) = |θ0 + θ1 x1 + . . . + θd xd | : θ ∈ Rd+1 } .
   

 a 


 ( b 

 a + bx if x ≤ c   
 c  ∈ R5 ,
c) d = 1, G = g(x) = ,


 d + ex if x > c  
d 


 
e
 

d) d = 1, G = {g(x) = eθx : θ ∈ R}.


6.4. VC GRAPH CLASSES OF FUNCTIONS 47

Definition 6.4.2 Let S be some subset of a metric space (Λ, d). For δ > 0,
the δ-packing number D(δ, S, d) of S is the largest value of N , such that there
exist s1 , . . . , sN in S with

d(sk , sj ) > δ, ∀ k 6= j.

Lemma 6.4.1 For all δ > 0


(i) N (δ, S, d) ≤ D(δ, S, d),
(ii) D(δ, S, d) ≤ N (δ/2, S, d).
Proof. Let s1 , . . . , sN in S with d(sk , sj ) > δ, ∀ k 6= j. Let it be a maximal
such set.
(i) Then for any s ∈ S there is a j ∈ {1, . . . , N } such that d(s, sj ) ≤ δ (since
otherwise we could add s to the packing).
(ii) Since {s1 , . . . , sN } ⊂ S

N (δ/2, S, d) ≥ N (δ/2, {s1 , . . . , sN }, d).

But
N (δ/2, {s1 , . . . , sN }, d) = N.
u
t
Theorem 6.4.1 Let Q be any probability measure on X and let N1 (·, G, Q) be
the covering number of G endowed with the metric corresponding to the L1 (Q)
norm. For a VC class G with VC dimension V , we have for a constant A
depending only on V ,

N1 (δQG, G, Q) ≤ max(Aδ −2V , eδ/4 ), ∀ δ > 0

.
Proof. Without loss of generality, assume Q(G) = 1. Choose S ∈ X with
distribution dQS = GdQ. Given S = s, choose T uniformly in the interval
[−G(s), G(s)]. Let g1 , . . . , gN be a maximal set in G, such that Q(|gj − gk |) > δ
for j 6= k. Consider a pair j 6= k. Given S = s, the probability that T falls in
between the two graphs of gj and gk is

|gj (s) − gk (s)|


.
2G(s)

So the unconditional probability that T falls in between the two graphs of gj


and gk is
|gj (s) − gk (s)| Q(|gj − gk |)
Z
δ
dQS (s) = > .
2G(s) 2 2
Now, choose n independent copies {(Si , Ti )}ni=1 of (T, S). The probability that
none of these fall in between the graphs of gj and gk is then at most

(1 − δ/2)n .
48 CHAPTER 6. ULLNS BASED ON SYMMETRIZATION

The probability that for some j 6= k, none of these fall in between the graphs
of gj and gk is then at most
   
N n 1 nδ 1
(1 − δ/2) ≤ exp 2 log N − ≤ < 1,
2 2 2 2
when we choose n the smallest integer such that
4 log N
n≥ .
δ
So for such a value of n, with positive probability, for any j 6= k, some of the
Ti fall in between the graphs of gj and gk . Therefore, we must have
N ≤ cnV .
But then, for N ≥ exp[δ/4],
 V
4 log N
N ≤ c +1
δ
8 log N V
 
≤ c
δ
1
!V
16V log N 2V
= c
δ
16V V 1
 
≤ c N 2.
δ
So  2V
2 16V
N ≤c .
δ
u
t
Corollary 6.4.1 By Theorem 6.1.2 and Theorem 6.4.1 we arrive at the fol-
lowing important conclusion:
G VC & P G < ∞ ⇒ G GC

6.5 Exercises

Exercise 6.5.1 Let G be a finite class of functions, with cardinality |G| := N >
1. Suppose that for some finite constant K,
max kgk∞ ≤ K.
g∈G

Use Bernstein’s inequality to show that for


4 log N 
δ2 ≥ δK + K 2

n
one has
nδ 2
 
IP (kPn − P kG > δ) ≤ 2 exp − .
4(δK + K 2 )
6.5. EXERCISES 49

Exercise 6.5.2 Let GK := {gl{G ≤ K} : g ∈ G}. Show that

Hq (·, GK , Pn ) ≤ Hq (·, G, Pn ).

Exercise 6.5.3 Present a proof of Lemma 6.3.1.

Exercise 6.5.4 Are the following classes of sets (functions) VC? Why (not)?

1) The class of all rectangles in Rd .

2) The class of all monotone functions on R.

3) The class of functions on [0, 1] given by

G = {g(x) = aebx + cedx : (a, b, c, d) ∈ [0, 1]4 }.

4) The class of all sections in R2 (a section is of the form {(x1 , x2 ) : x1 =


a1 + r sin t, x2 = a2 + r cos t, θ1 ≤ t ≤ θ2 }, for some (a1 , a2 ) ∈ R2 , some r > 0,
and some 0 ≤ θ1 ≤ θ2 ≤ 2π).

5) The class of all star-shaped sets in R2 (a set D is star-shaped if for some


a ∈ D and all b ∈ D also all points on the line segment joining a and b are in
D).

Exercise 6.5.5 Let G be the class of all functions g on [0, 1] with derivative ġ
satisfying |ġ| ≤ 1. Check that G is not VC. Show that G is GC by using partial
integration and the Glivenko Cantelli Theorem for the empirical distribution
function.
50 CHAPTER 6. ULLNS BASED ON SYMMETRIZATION
Chapter 7

M-estimators

7.1 What is an M-estimator?

Let X1 , . . . , Xn , . . . be i.i.d. copies of a random variable X with values in X and


with distribution P .

Let Θ be a parameter space (a subset of some metric space with metric d) and
let for θ ∈ Θ,
γθ : X → R,
be some loss function. We assume P |γθ | < ∞ for all θ ∈ Θ. We estimate the
unknown parameter
θ0 := arg min P γθ ,
θ∈Θ

by the M-estimator
θ̂n := arg min Pn γθ .
θ∈Θ

We assume that θ0 exists and is unique and that θ̂n exists.

Examples.
(i) Location estimators. X = R, Θ = R, and
(i.a) γθ (x) = (x − θ)2 (estimating the mean),
(i.b) γθ (x) = |x − θ| (estimating the median).
(ii) Maximum likelihood. {pθ : θ ∈ Θ} family of densities w.r.t. a σ-finite
dominating measure µ, and
γθ = − log pθ .
If dP/dµ = pθ0 , θ0 ∈ Θ, then indeed θ0 is a minimizer of P (γθ ), θ ∈ Θ.
(ii.a) Poisson distribution:

θx
pθ (x) = eθ , θ > 0, x ∈ {1, 2, . . .}.
x!

51
52 CHAPTER 7. M-ESTIMATORS

(ii.b) Logistic distribution:

eθ−x
pθ (x) = , θ ∈ R, x ∈ R.
(1 + eθ−x )2

7.2 Consistency

Define for θ ∈ Θ,
R(θ) = P γθ ,
and
Rn (θ) = Pn γθ .
Definition 7.2.1 We say that θ0 is well-separated if for all η > 0

inf{R(θ) : d(θ, θ0 ) > η} > R(θ0 ).

We first present an easy proposition with a too stringent condition (•).


Proposition 7.2.1 Suppose that
IP
(•) sup |Rn (θ) − R(θ)| → 0,
θ∈Θ

IP
i.e., that {γθ : θ ∈ Θ} is a GC class. Then R(θ̂n ) → R(θ0 ). If moreover θ0 is
IP
well-separated, also θ̂n → θ0 .
Proof. We have

0 ≤ R(θ̂n ) − R(θ0 )
= [R(θ̂n ) − R(θ0 )] − [Rn (θ̂n ) − Rn (θ0 )] + [Rn (θ̂n ) − Rn (θ0 )]
IP
≤ [R(θ̂n ) − R(θ0 )] − [Rn (θ̂n ) − Rn (θ0 )] → 0.
IP IP
So R(θ̂n ) → R(θ0 ) and hence, if θ0 is well-separated, θ̂n → θ0 .
u
t

The assumption (•) is rather severe. It is close to requiring compactness of Θ.


Lemma 7.2.1 Suppose that (Θ, d) is compact and that θ 7→ γθ is continuous.
Moreover, assume that P (G) < ∞, where

G = sup |γθ |.
θ∈Θ

Then
IP
sup |Rn (θ) − R(θ)| → 0.
θ∈Θ
7.2. CONSISTENCY 53

Proof. This is Lemma 4.5.1. u


t

We give a lemma which replaces compactness by a convexity assumption.


Lemma 7.2.2 Suppose that Θ is a convex subset of a normed vector space with
norm k · k and that θ 7→ γθ , θ ∈ Θ is convex. Suppose that for some  > 0,
- Θ := {θ ∈ Θ : kθ − θ0 k ≤ } is compact,
- P (G ) < ∞ where G := supθ∈Θ |γθ |.
IP
Then θ̂n → θ0 .
Proof. By Lemma 4.5.1
IP
sup |Rn (θ) − R(θ)| → 0.
kθ−θ0 k≤

Define

α=
 + kθ̂n − θ0 k
and
θ̃n = αθ̂n + (1 − α)θ0 .
Then
kθ̃n − θ0 k ≤ .
Moreover,
Rn (θ̃n ) ≤ αRn (θ̂n ) + (1 − α)Rn (θ0 ) ≤ Rn (θ0 ).
It follows from the arguments used in the proof of Proposition 7.2.1 that
IP
R(θ̃n ) → R(θ0 ). The convexity and the uniqueness of θ0 implies now that

IP
kθ̃n − θ0 k → 0.

But then also


kθ̃n − θ0 k IP
kθ̂n − θ0 k = → 0.
 − kθ̃n − θ0 k

u
t
Example 7.2.1 Let X ∈ R, Θ = R and for some q ≥ 1

γθ (x) := |x − θ|q , x ∈ R.

Assume E|X − θ0 |q < ∞. Then

G := sup |γθ | ≤ 2q−1 (|X − θ0 |q + q ) ∈ L1 (P ).


kθ−θ0 k≤

Hence (assuming uniqueness of θ0 )

IP
θ̂n → θ0 .
54 CHAPTER 7. M-ESTIMATORS

We may extend this to the situation where there are co-variables: replace X by
(X, Y ) where X ∈ Rr is a row-vector containing the co-variables and Y ∈ R is
the response variable. The loss function is
γθ (x, y) = |y − xθ|q .
Then E|Y − Xθ0 |q < ∞ together with uniqueness of θ0 yields consistency.
Example 7.2.2 Replace X by (X, Y ) where X ∈ [0, 1] is a co-variable and
Y = θ0 (X) + ξ,
with the noise ξ ∼ N (0, σ 2 ). We use least squares loss:
γθ (x, y) := (y − θ(x))2 , x ∈ [0, 1], y ∈ R.
We assume that θ0 ∈ Θ where, for a given m ∈ N, Θ is the “Sobolev” space
 Z 1 
(m) 2
θ := θ : [0, 1] → R, |θ (x)| dx ≤ 1 .
0

Here θ(m)denotes the m-th derivative of the function θ. We endow the space
Θ with the sup-norm
kθk := kθk∞ = sup |θ(x)|.
x∈[0,1]

Then one can show (we omit the details here) that for any 
 Z 1 
(m) 2
Θ := θ : [0, 1] → R, kθ − θ0 k ≤ , |θ (x)| dx ≤ 1
0
is compact. It follows that under the assumption that θ0 is unique (which is
true for example if the distribution of X is absolutely continuous with a density
that stays away from zero), the (non-parametric) least squares estimator θ̂n is
consistent in sup-norm.

7.3 Exercises

Exercise 7.3.1 Let Y ∈ {0, 1} be a binary response variable and X ∈ R be a


co-variable. Assume the logistic regression model
1
Pθ0 (Y = 1|X = x) = ,
1 + exp[α0 + xβ0 ]
where θ0 = (α0 , β0 ) ∈ R2 is an unknown parameter. Let {(Xi , Yi )}ni=1 be i.i.d.
copies of (X, Y ). Show consistency of the MLE of θ0 (assuming uniqueness of
θ0 ).

Exercise 7.3.2 Suppose X1 , . . . , Xn are i.i.d. real-valued random variables


with density p0 = dP/dµ on [0,1]. Here, µ is Lebesgue measure on [0, 1].
Suppose it is given that p0 ∈ P, with P the set of all decreasing densities
bounded from above by 2 and from below by 1/2. Let p̂n be the MLE. Can you
show consistency of p̂n ? For what metric? Hint: use the result of Section 2.5
and Example 6.1.1.
Chapter 8

Uniform central limit


theorems

After having studied uniform laws of large numbers, a natural question is:
can we also prove uniform central limit theorems? It turns out that precisely
defining what a uniform central limit theorem is, is quite involved, and actually
beyond our scope. In this Chapter we will therefore only briefly indicate the
results, and not present any proofs. These sections only reveal a glimpse of
the topic of weak convergence on abstract spaces. The thing to remember from
them is the concept asymptotic continuity, because we will use that concept in
our statistical applications.

8.1 Real-valued random variables

Let X = R.
Theorem 8.1.1 (Central limit theorem in R) Suppose EX = µ, and
var(X) = σ 2 exist. Then
√ 
n(X̄n − µ)
IP ≤z → Φ(z), f or all z,
σ

where Φ is the standard normal distribution function.


u
t
Notation. √
n(X̄n − µ) L
→ N (0, 1),
σ
or
√ L
n(X̄n − µ) → N (0, σ 2 ).

55
56 CHAPTER 8. UNIFORM CENTRAL LIMIT THEOREMS

8.2 Rr -valued random variables

Let X1 , X2 , . . . be i.i.d. Rr -valued random variables copies of X, (X ∈ X = Rr ),


with expectation µ = EX, and covariance matrix Σ = EXX T − µµT .
Theorem 8.2.1 (Central limit theorem in Rr ) We have
√ L
n(X̄n − µ) → N (0, Σ),

i.e. √  T  L
n a (X̄n − µ) → N (0, aT Σa), f or all a ∈ Rd .

u
t.

8.3 Donsker’s theorem

Let X = R. Recall the definition of the distribution function F and the empir-
ical distribution function F̂n :

F (t) = P (X ≤ t), t ∈ R,
1
F̂n (t) = #{Xi ≤ t, 1 ≤ i ≤ n}, t ∈ R.
n
Define √
Wn (t) := n(F̂n (t) − F (t)), t ∈ R.

By the central limit theorem in R (Section 8.1), for all t


L
Wn (t) → N (0, F (t)(1 − F (t))) .

Also, by the central limit theorem in R2 (Section 8.2), for all s < t,
 
Wn (s) L
→ N (0, Σ(s, t)),
Wn (t)

where  
F (s)(1 − F (s)) F (s)(1 − F (t))
Σ(s, t) = .
F (s)(1 − F (t)) F (t)(1 − F (t))

We are now going to consider the stochastic process Wn = {Wn (t) : t ∈ R}.
The process Wn is called the (classical) empirical process.
Definition 8.3.1 Let K0 be the collection of bounded functions on [0, 1] The
stochastic process B(·) ∈ K0 , is called the standard Brownian bridge if
- B(0) = B(1) = 0,  
B(t1 )
- for all r ≥ 1 and all t1 , . . . , tr ∈ (0, 1), the vector  ...  is multivariate
 

B(tr )
8.4. DONSKER CLASSES 57

normal with mean zero,


- for all s ≤ t, cov(B(s), B(t)) = s(1 − t),
- the sample paths of B are a.s. continuous.
We next consider the process WF defined as

WF (t) = B(F (t)) : t ∈ R.

Thus, WF = B ◦ F .
Theorem 8.3.1 (Donsker’s theorem) Consider Wn and WF as elements of
the space K of bounded functions on R. We have

L
Wn → WF ,

that is,
I (Wn ) → Ef
Ef I (WF ),
for all continuous and bounded functions f .
u
t

Reflection. Suppose F is continuous. Then, since B is almost surely continu-


ous, also WF = B◦F is almost surely continuous. So Wn must be approximately
continuous as well in some sense. Indeed, we have for any t and any sequence
tn converging to t,
IP
|Wn (tn ) − Wn (t)| → 0.
This is called asymptotic continuity.

8.4 Donsker classes

Let X1 , . . . , Xn , . . . be i.i.d. copies of a random variable X, with values in the


space X , and with distribution P . Consider a class G of functions g : X → R.
The (theoretical) mean of a function g is

P g := Eg(X),

and the (empirical) average (based on the n observations X1 , . . . , Xn ) is


n
1X
Pn g := g(Xi ).
n
i=1

Here Pn is the empirical distribution (based on X1 , . . . , Xn ).


Definition 8.4.1 The empirical process indexed by G is

νn (g) := n(Pn − P )g, g ∈ G.
58 CHAPTER 8. UNIFORM CENTRAL LIMIT THEOREMS

Let us recall the central limit theorem for g fixed. Denote the variance of g(X)
by
σ 2 (g) := var(g(X)) = P g 2 − (P g)2 .
If σ 2 (g) < ∞, we have
L
νn (g) → N (0, σ 2 (g)).

The central limit theorem also holds for finitely many g simultaneously. Let gk
and gl be two functions and denote the covariance between gk (X) and gl (X) by

σ(gk , gl ) := cov(gk (X), gl (X)) = P gk gl − (P gk )(P gl ).

Then, whenever σ 2 (gk ) < ∞ for k = 1, . . . , r,


 
νn (g1 )
 ..  L
 .  → N (0, Σg1 ,...,gr ),
νn (gr )

where Σg1 ,...,gr is the variance-covariance matrix


 2 
σ (g1 ) . . . σ(g1 , gr )
(∗) Σg1 ,...,gr =  .. .. ..
.
 
. . .
2
σ(g1 , gr ) . . . σ (gr )

Definition 8.4.2 Let ν be a Gaussian process indexed by G. Assume that for


each r ∈ N and for each finite collection {g1 , . . . , gr } ⊂ G, the r-dimensional
vector  
ν(g1 )
 .. 
 . 
ν(gr )
has a N (0, Σg1 ,...,gr )-distribution, with Σg1 ,...,gr defined in (*). We then call ν
the P -Brownian bridge indexed by G.
Definition 8.4.3 Consider νn and ν as bounded functions on G. We call G a
P -Donsker class if
L
νn → ν,
that is, if for all continuous and bounded functions f , we have

I (νn ) → Ef
Ef I (ν).

Definition 8.4.4 The process νn on G is called asymptotically continuous at


IP
g0 ∈ G if for all (possibly random) sequences {gn } ⊂ G with σ(gn − g0 ) → 0, we
have
IP
|νn (gn ) − νn (g0 )| → 0.
If this is true for all g0 ∈ G we call νn on G asymptotically continuous.
8.4. DONSKER CLASSES 59

We will use the notation

kgk2 := P g 2 , g ∈ L2 (P ),

i.e., k · k is the L2 (P )-norm.

Remark. Note that σ(g) ≤ kgk.


Definition 8.4.5 The class G is called totally bounded for the metric induced
by k · k, if its entropy H2 (·, G, P ) is finite.
Theorem 8.4.1 Suppose that G is totally bounded. Then G is a P -Donsker
class if and only if νn (as process on G) is asymptotically continuous.
u
t
60 CHAPTER 8. UNIFORM CENTRAL LIMIT THEOREMS
Chapter 9

Chaining and asymptotic


continuity

9.1 Chaining

We consider in this section the symmetrized empirical process and work condi-
tionally on X = (X1 , . . . , Xn ). We let IPX be the conditional distribution given
X. We describe the chaining technique in this context
Define the empirical norm p
kgkn := Pn g 2
and the empirical radius
Rn := sup kgkn .
g∈G
For notational convenience, we index the functions in G by a parameter θ ∈ Θ:
G = {gθ : θ ∈ Θ}. Let for s = 0, 1, 2, . . ., {gjs }N −s
j=1 ⊂ G be a minimal 2 Rn -
s

covering set of (G, k·kn ). So Ns = N2 (2−s Rn , G, Pn ), and for each θ, there exists
s } such that kg − g s k ≤ 2−s R . We use the parameter θ
a gθs ∈ {g1s , . . . , gNs θ θ n n
here to indicate which function in the covering set approximates a particular g.
We may choose gθ0 ≡ 0, since kgθ kn ≤ Rn . We let Hs := log Ns for all s.
Let S ∈ N to be fixed later and let for gjS+1 ∈ {g1S+1 , . . . , gN
S+1
S+1
},
 
S
gj,S := arg min kgjS+1 − gkS kn : gkS ∈ {g1S , . . . , gN
S
S
}

and for s ∈ {0, . . . , S − 1},


 
s
gj,S := arg min kgjs+1 − gks kn : gks ∈ {g1s , . . . , gN
s
s
} .

Then we can write for gθS+1 = gjS+1


S
X
s+1
gθ = (gj,S s
− gj,S ) + (gθ − gθS+1 ).
s=0

61
62 CHAPTER 9. CHAINING AND ASYMPTOTIC CONTINUITY

One can think of this as telescoping from gθ to gθS+1 , i.e. we follow a path taking
smaller and smaller steps. As S → ∞, we have max1≤i≤n |gθ (Xi )−gθS+1 (Xi )| →
0. The term Ss=0 (gj,S
s+1 s ) can be handled by exploiting the fact that as θ
P
− gj,S
varies, each summand involves only finitely many functions.

9.2 Increments of the symmetrized process

We define
S
X
2−s Rn
p
Jn := 2Hs+1 .
s=0

Remark We may use the bound


Z Rn p
Jn ≤ 2 2H2 (u, G, Pn )du.
2−(S+2) Rn

The right-hand side is called (up to constants) Dudley’s entropy integral.

Theorem 9.2.1 We have


 S r 
X
σ s+1 s Jn 1+t
IPX max | Pn (gj,S − gj,S )| ≥ √ + 4Rn ≤ 2 exp[−t] ∀ t > 0.
j n n
s=0

Proof. By Hoeffding’s inequality (Theorem 3.2.1), for all j and s


 r 
σ s+1 s −s 2t
IPX |Pn (gj,S − gj,S )| ≥ 2 Rn ≤ 2 exp[−t] ∀ t > 0.
n

Therefore (by the union bound), for all s


 r 
σ s+1 s −s 2(Hs+1 + t)
IPX max |Pn (gj,S − gj,S )| ≥ 2 Rn ≤ 2 exp[−t] ∀ t > 0.
j n

Fix t and let for s = 0, . . . , S


 
αs := 2−s Rn
p p
2Hs+1 + 2(1 + s)(1 + t) .

Then

S S
X X √
2−s Rn
p
αs = J n + 2(1 + s)(1 + t) ≤ Jn + 4Rn 1 + t.
s=0 s=0
9.3. DE-SYMMETRIZING 63

Therefore
 S r 
X
σ s+1 s Jn 1+t
IPX max | Pn (gj,S − gj,S )| ≥ √ + 4Rn
j n n
s=0
XS  
σ s+1 s
≤ IPX max |Pn (gj,S − gj,S )| ≥ αs
j
s=0
XS
≤ 2 exp[−(1 + s)(1 + t)]
s=0
≤ 2 exp[−t]

u
t

9.3 De-symmetrizing

Recall the (theoretical) L2 (P )-norm


p
kgk = P g 2 , g ∈ L2 (P ).

We let
R := sup kgk
g∈G

be the diameter of G.
For any probability measure Q, we let H2 (·, G, Q) be the entropy of G endowed
with the metric induced by the L2 (Q)-norm k · kQ .
Condition 9.3.1 For all probability measures Q it holds that

H2 (ukGkQ , G, Q) ≤ H(u), ∀ u > 0

for some decreasing function H(·) satisfying


Z 1p
J (R) := H(u)du < ∞.
0

We then define Z ρp
J (ρ) := 2 2H(u)du, ρ > 0.
0

Theorem 9.3.1 Suppose Condition 9.3.1 is met and G ∈ L2 (P ). Then


 r 
4kGkJ (4RkGk) 1+t
IP kPn − P kG > √ + 32R
n n
 
2 2
≤ 8 exp[−t] + 4IP sup |(Pn − P )g | > R ∀t>0
g∈G∪{G}
64 CHAPTER 9. CHAINING AND ASYMPTOTIC CONTINUITY


Proof. Take S as the smallest value in N such that 2−S ≤ 1/ n. Then

kgθ − gθS+1 kn ≤ 2−(S+1) Rn ≤ Rn /(2 n). On the set where Rn ≤ 2R and
kGkn ≤ 2kGk we have
Z Rn p
2 2H2 (u, G, Pn )du ≤ kGkJ (4RkGk)
2−(S+2) Rn

so for X in this set


 r 
σ kGkJ (4RkGk) 1+t
IPX kPn kG ≥ √ + 8R ≤ 2 exp[−t] ∀ t > 0.
n n
We can then de-symmetrize
 r 
4kGkJ (4RkGk) 1+t
IP kPn − P kG > √ + 32R
n n
 r 
σ kGkJ (4RkGk) 1+t
≤ 4IP kPn kG > √ + 8R .
n n
 
2 2
≤ 8 exp[−t] + 4IP sup |(Pn − P )g | > R .
g∈G∪{G}
u
t

9.4 Asymptotic continuity of the empirical process

Theorem 9.4.1 Assume Condition 9.3.1 and that G has envelope G, with
P (G2 ) < ∞. Then νn is asymptotically continuous.
Proof. Define for δ > 0 and g0 ∈ G
G(δ) = {g ∈ G : kg − g0 k ≤ δ}.

By Theorem 9.3.1
 r 
8kGkJ (8δkGk) 1+t
IP kPn − P kG(δ) > √ + 32δ
n n
 
2 2
≤ 8 exp[−t] + 4IP sup |(Pn − P )(g − g0 ) | > δ ∀ t > 0.
g∈G(δ)∪{2G+g0 }

Take  > 0 arbitrary and t = log(8/). Since J (8δkGk) ↓ 0 as δ ↓ 0, we see


that there is a δ > 0 such that

  
2 2
IP nkPn − P kG(δ) > ) ≤  + 4IP sup |(Pn − P )(g − g0 ) | > δ .
g∈G(δ)∪{2G+g0 }

By the ULLN for {(g − g0 )2 : g ∈ G(δ) ∪ {2G + g0 }},


 
IP sup |(Pn − P )(g − g0 )2 | > δ 2 → 0.
g∈G(δ)∪{2G+g0 }

u
t
9.5. APPLICATION TO VC GRAPH CLASSES 65

9.5 Application to VC graph classes

Theorem 9.5.1 Suppose that G is a VC class with envelope G ∈ L2 (P ). Then


{νn (g) : g ∈ G} is asymptotically continuous.
Proof. We recall Theorem 6.2.1:

N1 (δQG, G, Q) ≤ max(Aδ −2V , eδ/4 ), ∀ δ > 0.

Now note that for any two functions g and g̃ in G,


Z Z
2
(g − g̃) dQ ≤ |g − g̃|dQ̄,

where dQ̄ := 2GdQ. This gives

N2 (δkGkQ , G, Q) ≤ N1 (δ 2 kGk2Q , G, Q̄).

The measure Q̄ is perhaps not a probability measure, but since it is a finite


measure this only effects the constants. In other words, Condition 9.3.1 is met.
Apply Theorem 9.4.1 to finish the proof. u
t

Remark. In particular, suppose that a VC class G with square integrable


envelope G is parametrized by θ in some parameter space Θ ⊂ Rr , i.e. G =
{gθ : θ ∈ Θ}. Let zn (θ) = νn (gθ ). Question: do we have that for a (random)
sequence θn with θn → θ0 (in probability), also
IP
|zn (θn ) − zn (θ0 )| → 0?
IP
Indeed, if kgθ − gθ0 k → 0 as θ converges to θ0 , the answer is yes. In other words,
we need here mean square continuity of the map θ 7→ gθ .

9.6 Exercises

Exercise 9.6.1 Let 1 , . . . , n be i.i.d. N (0, 1) random variables, independent


of X1 , . . . , Xn . Define
n
1X
(, g)n := i g(Xi ), g ∈ G.
n
i=1

Let

X
2−s
p
J(Rn ) := 2H2 (u, G, Pn )du,
s=0
where we assume that the sum converges. Show that
 r 
J(Rn ) 1+t
IPX sup(, g)n ≥ √ + 4Rn ≤ exp[−t] ∀ t > 0.
g∈G n n
66 CHAPTER 9. CHAINING AND ASYMPTOTIC CONTINUITY
Chapter 10

Asymptotic normality of
M-estimators

Consider an M-estimator θ̂n of a finite dimensional parameter θ0 . We will give


conditions for asymptotic normality of θ̂n . It turns out that these conditions in
fact imply asymptotic linearity. Our first set of conditions include differentiabil-
ity in θ at each x of the loss function γθ (x). The proof of asymptotic normality
is then the easiest. In the second set of conditions, only differentiability in
quadratic mean of γθ is required.
The results of the previous chapter (asymptotic continuity) supply us with an
elegant way to handle remainder terms in the proofs.

In this chapter, we assume that θ0 is an interior point of Θ ⊂ Rr . Moreover,


we assume that we already showed that θ̂n is consistent.

10.1 Asymptotic linearity

Definition 10.1.1 The (sequence of ) estimator(s) θ̂n of θ0 is called asymptot-


ically linear if we may write
√ √
n(θ̂n − θ0 ) = nPn ` + oIP (1),

where  
`1
 .. 
` =  .  : X → Rr ,
`r

satisfies P ` = 0 and P (`2k ) < ∞, k = 1, . . . , r. The function ` is then called


the influence function. For the case r = 1, we call σ 2 := P `2 the asymptotic
variance.

67
68 CHAPTER 10. ASYMPTOTIC NORMALITY OF M-ESTIMATORS

Definition 10.1.2 Let θ̂n,1 and θ̂n,2 be two asymptotically linear estimators of
θ0 , with asymptotic variance σ12 and σ22 respectively. Then

σ22
e1,2 :=
σ12

is called the asymptotic relative efficiency (of θ̂n,1 as compared to θ̂n,2 ).

10.2 Conditions a, b and c for asymptotic normality

We start with 3 conditions a, b and c, which are easier to check but more
stringent. We later relax them to conditions aa, bb and cc.

Condition a. There exists an  > 0 such that θ 7→ γθ (x) is differentiable for


all |θ − θ0 | <  and all x, with derivative


ψθ (x) := γθ (x), x ∈ X .
∂θ

Condition b. We have as θ → θ0 ,

P (ψθ − ψθ0 ) = V (θ − θ0 ) + o(1)|θ − θ0 |,

where V ∈ Rr×r is a positive definite matrix.

Condition c. There exists an  > 0 such that the class

{ψθ : |θ − θ0 | < }

is asymptotically continuous at ψθ0 . Moreover,

lim kψθ − ψθ0 k = 0.


θ→θ0

Lemma 10.2.1 Suppose conditions a,b and c. Then θ̂n is asymptotically linear
with influence function
` = −V −1 ψθ0 ,
so √ L
n(θ̂n − θ0 ) → N (0, V −1 JV −1 ),
where
J = P ψθ0 ψθT0 .

Proof. Recall that θ0 is an interior point of Θ, and minimizes P γθ , so that


P ψθ0 = 0. Because θ̂n is consistent, it is eventually a solution of the score
equations
Pn ψθ̂n = 0.
10.2. CONDITIONS A, B AND C FOR ASYMPTOTIC NORMALITY 69

Rewrite the score equations as

0 = Pn ψθ̂n = Pn (ψθ̂n − ψθ0 ) + Pn ψθ0


= (Pn − P )(ψθ̂n − ψθ0 ) + P ψθ̂n + Pn ψθ0 .

Now, use condition b and the asymptotic continuity of {ψθ : |θ − θ0 | ≤ }, to


obtain
0 = oIP (n−1/2 ) + V (θ̂n − θ0 ) + o(|θ̂n − θ0 |) + Pn ψθ0 .
This yields
(θ̂n − θ0 ) = −V −1 Pn ψθ0 + oIP (n−1/2 ).
u
t
Example 10.2.1 (Huber estimator) Let X = R, Θ = R. The Huber esti-
mator corresponds to the loss function

γθ (x) = γ(x − θ),

with
γ(x) = x2 l{|x| ≤ k} + (2k|x| − k 2 )l{|x| > k}, x ∈ R.
Here, 0 < k < ∞ is some fixed constant, chosen by the statistician. We will
now verify Conditions a, b and c.
a) 
+2k
 if x − θ ≤ k
ψθ (x) = −2(x − θ) if |x − θ| ≤ k .

−2k if x − θ ≥ k

b) We have Z
d
ψθ dP = 2(F (k + θ) − F (−k + θ)),

where F (t) = P (X ≤ t), t ∈ R is the distribution function. So

V = 2(F (k + θ0 ) − F (−k + θ0 )).

c) Clearly ψθ : θ ∈ R is a VC graph class, with envelope Ψ ≤ 2k. The asymptotic


continuity follows from Theorem 9.5.1.
So the Huber estimator θ̂n has influence function

−k
 F (k+θ0 )−F (−k+θ0 ) if x − θ0 ≤ −k


x−θ0
`(x) = F (k+θ0 )−F (−k+θ0 ) if |x − θ0 | ≤ k .

k
if x − θ0 ≥ k


F (k+θ0 )−F (−k+θ0 )

The asymptotic variance is


R k+θ0
2
k 2 F (−k + θ0 ) + −k+θ0 (x − θ0 )2 dF (x) + k 2 (1 − F (k + θ0 ))
σ = .
(F (k + θ0 ) − F (−k + θ0 ))2
70 CHAPTER 10. ASYMPTOTIC NORMALITY OF M-ESTIMATORS

10.3 Asymptotics for the median

The sample median can be regarded as the limiting case of a Huber estimator,
with k ↓ 0. However, the loss function γθ (x) = |x − θ| is not differentiable, i.e.,
does not satisfy condition a. For even sample sizes, we do nevertheless have the
score equation F̂n (θ̂n ) − 21 = 0. Let us investigate this closer.

Let X ∈ R have distribution F , and let F̂n be the empirical distribution. The
population median θ0 is a solution of the equation

1
F (θ0 ) = .
2
We assume this solution exists and also that F has positive density f in a
neighbourhood of θ0 . Consider now for simplicity even sample sizes n and let
the sample median θ̂n be any solution of

F̂n (θ̂n ) = 0.

Then we get

0 = F̂n (θ̂n ) − F (θ0 )


h i h i
= F̂n (θ̂n ) − F (θ̂n ) + F (θ̂n ) − F (θ0 )
1 h i
= √ Wn (θ̂n ) + F (θ̂n ) − F (θ0 ) ,
n

where Wn := n(F̂n − F ) is the empirical process. Since F is continuous at
θ0 , and θ̂n → θ0 , we have by the asymptotic continuity of the empirical process
(use the VC-property of intervals) that Wn (θ̂n ) = Wn (θ0 ) + oIP (1). We thus
arrive at
√ h i
0 = Wn (θ0 ) + n F (θ̂n ) − F (θ0 ) + oIP (1)

= Wn (θ0 ) + n[f (θ0 ) + o(1)][θ̂n − θ0 ].

In other words,
√ Wn (θ0 )
n(θ̂n − θ0 ) = − + oIP (1).
f (θ0 )
So the influence function is
(
1
− 2f (θ 0)
if x ≤ θ0
`(x) = 1
,
+ 2f (θ 0)
if x > θ0

and the asymptotic variance is

1
σ2 = .
4f (θ0 )2
10.4. CONDITIONS AA, BB AND CC FOR ASYMPTOTIC NORMALITY71

We are now in the position to compare median and mean. It is easily seen that
the asymptotic relative efficiency of the mean as compared to the median is
1
e1,2 = ,
4σ02 f (θ0 )2

where σ02 = var(X). So e1,2 = π/2 for the normal distribution, and e1,2 = 1/2
for the double exponential (Laplace) distribution. The density of the double
exponential distribution is
" √ #
1 2|x − θ0 |
f (x) = p 2 exp − , x ∈ R.
2σ0 σ0

10.4 Conditions aa, bb and cc for asymptotic nor-


mality

We are now going to relax the condition of differentiability of γθ .

Condition aa. (Differentiability in quadratic mean.) There exists a function


ψ0 : X → Rr , with P ψ0,k
2 < ∞, k = 1, . . . , r, such that

γθ − γθ0 − (θ − θ0 )T ψ0
lim = 0.
θ→θ0 |θ − θ0 |

Condition bb. We have as θ → θ0 ,


1
P (γθ − γθ0 ) = (θ − θ0 )T V (θ − θ0 ) + o(1)|θ − θ0 |2 ,
2
with V ∈ Rr×r a positive definite matrix.

Condition cc. Define


T
θ −γθ0 −(θ−θ0 ) ψ0

|θ−θ0 | θ 6= θ0
gθ = .
0 θ = θ0

Suppose that for some  > 0, the class G := {gθ : 0 < |θ − θ0 | < } is
asymptotically continuous at 0.
Lemma 10.4.1 Suppose conditions aa, bb and cc are met. Then θ̂n has influ-
ence function
` = −V −1 ψ0 ,
and so
L
(θ̂n − θ0 ) → N (0, V −1 JV −1 ),
p

where J = P ψ0 ψ0T .
72 CHAPTER 10. ASYMPTOTIC NORMALITY OF M-ESTIMATORS

Proof. Since {gθ : |θ −θ0 | ≤ } is asymptotically continuous we have as θ → θ0

Pn (γθ − γθ0 )
= (Pn − P )(γθ − γθ0 ) + P (γθ − γθ0 )
= (Pn − P )gθ |θ − θ0 | + (θ − θ0 )T Pn ψ0 + P (γθ − γθ0 )

= oIP (1/ n)|θ − θ0 | + (θ − θ0 )T Pn ψ0 + P (γθ − γθ0 )
√ 1
= oIP (1/ n)|θ − θ0 | + (θ − θ0 )T Pn ψ0 + (θ − θ0 )T V (θ − θ0 ) + o(|θ − θ0 |2 )
2
1 1/2 √ 2
= V (θ − θ0 ) + o(|θ − θ0 |) + OIP (1/ n)
2

Because Pn (γθ̂n − γθ0 ) ≤ 0 the previous applied with θ = θ̂n gives |θ̂n − θ0 | =

OIP (1/ n). The previous applied to the sequence θ̃n := θ0 − V −1/2 Pn ψ0 gives
1
Pn (γθ̃n − γθ0 ) = − |V −1/2 Pn ψ0 |2 + oIP (1/n).
2
Because Pn (γθ̂n − γθ0 ) ≤ Pn (γθ̃n − γθ0 ) we get
1 1
(θ̂n − θ0 )T Pn ψ0 + (θ̂n − θ0 )T V (θ̂n − θ0 )+ ≤ − |V −1/2 Pn ψ0 |2 + oIP (1/n)
2 2
or
2
1 1/2
V (θ̂n − θ0 ) + V −1/2 Pn ψ0 = oIP (1/n).
2
Thus √
V 1/2 (θ̂n − θ0 ) = −V −1/2 Pn ψ0 + oIP (1/ n)
or √
θ̂n − θ0 = −V −1 Pn ψ0 + oIP (1/ n).
u
t

10.5 Exercises

Exercise 10.5.1 Suppose X has the logistic distribution with location param-
eter θ0 . Show that the maximum likelihood estimator has asymptotic variance
equal to 3, and the median has asymptotic variance equal to 4. Hence, the
asymptotic relative efficiency of the maximum likelihood estimator as compared
to the median is 4/3.

Exercise 10.5.2 Let (Xi , Yi ), i = 1, . . . , n, . . . be i.i.d. copies of (X, Y ), where


X ∈ Rr and Y ∈ R. Suppose that the conditional distribution of Y given X = x
has median m(x) = β00 + β10 x1 + . . . βr0 xr , with
 0
β0
 .. 
β =  .  ∈ Rr+1 .
0

βr0
10.5. EXERCISES 73

Assume that given X = x, the random variable Y − m(x) has a density f not
depending on x, with f positive in a neighbourhood of zero. Suppose moreover
that  
1 X
Σ=E
X XX T
exists and is invertible. Let
n
1X
β̂n = arg min |Yi − b0 − b1 Xi,1 − . . . − br Xi,r | ,
b∈Rr+1 n
i=1

be the least absolute deviations (LAD) estimator. Show that


 
L 1
n(β̂n − β 0 ) → N 0, 2 Σ−1 ,
4f (0)

by verifying conditions aa, bb and cc.


74 CHAPTER 10. ASYMPTOTIC NORMALITY OF M-ESTIMATORS
Chapter 11

Rates of convergence for LSEs

Probability inequalities for the least squares estimator (LSE) are obtained, un-
der conditions on the entropy of the class of regression functions. In the ex-
amples, we study smooth regression functions, functions of bounded variation,
concave functions, and image restoration. Results for the entropies of various
classes of functions is taken from the literature on approximation theory.

Let Y1 , . . . , Yn be real-valued observations, satisfying


Yi = g0 (xi ) + i , i = 1, . . . , n,
with x1 , . . . , xn (fixed) covariates in a space X , 1 , . . . , n independent errors
with expectation zero, and with the unknown regression function g0 in a given
class G of regression functions. The least squares estimator is
n
X
ĝn := arg min (Yi − g(xi ))2 .
g∈G
i=1

Throughout, we assume that a minimizer ĝn ∈ G of the sum of squares exists,


but it need not be unique.
The following notation will be used. The empirical measure of the covariates is
n
1X
Qn := δ xi .
n
i=1

For g a function on Z, we denote its squared L2 (Qn )-norm by


n
1X 2
kgk2n := g (xi ).
n
i=1

The empirical inner product between error and regression function is written
as
n
1X
(, g)n = i g(zi ).
n
i=1

75
76 CHAPTER 11. RATES OF CONVERGENCE FOR LSES

Finally, we let for δ > 0

G(δ) := {g ∈ G : kg − g0 kn ≤ δ}

denote a ball around g0 with radius δ, intersected with G.


Lemma 11.0.1 (Basic inequality). It holds that

kĝn − g0 k2n ≤ 2(, ĝn − g0 )n .

Proof. This is rewriting the inequality


n
X n
X
2
(Yi − ĝn (xi )) ≤ (Yi − g0 (xi ))2 .
i=1 i=1

u
t
The main idea to arrive at rates of convergence for ĝn is to invoke the basic
inequality. The modulus of continuity of the process {(, g−g0 )n : g ∈ G(δ)} can
be derived from the entropy H2 (·, G(δ), Qn ) of G(δ), endowed with the metric
induced the the norm k · kn .
Condition 11.0.1 For all δ > 0, the entropy integral
Z δp
J(δ) := 2 2H2 (u, G(δ), Qn )du
0

exists (i.e. is finite).

11.1 Gaussian errors

To simplify the exposition, will assume that

1 , . . . , n are i.i, d, N (0, 1)−distributed.

Then, as in Theorem 9.2.1


 
J(δ)
IP sup (, g − g0 )n ≥ √ + 4δ(1 + t) ≤ exp[−t] ∀ t > 0.
g∈G(δ) n

11.2 Rates of convergence

Theorem 11.2.1 Assume Condition 11.0.1 and that J(δ)/δ 2 is decreasing in


δ. Then for all t > 0 and for
√ 2 √
 
nδn ≥ 8 J(δn ) + 4δn 1 + t

it holds that
e
IP(kĝ − g0 kn > δn ) ≤ exp[−t].
e−1
11.3. EXAMPLES 77

Proof. We use the “peeling device”



X  
j−1 2
IP(kĝ − g0 kn > δn ) ≤ IP sup 2(, g − g0 )n ≥ (2 δn ) .
j=1 g∈G(2j δn )

The function √
J(2j δn ) + 42j δ 1 + t + j
j→
7
(2j δ)2
is the sum of two decreasing functions and hence is decreasing. So for all j ∈ N
√ √ √
J(2j δn ) + 42j δ 1 + t + j J(δn ) + 4δn 1 + t n
≤ ≤
(2j δn )2 δn2 8

so that r
1 j−1 2 1 j 2 J(2j δn ) 1+t+j
(2 δn ) = (2 δn ) ≥ √ + 42j δn
2 8 n n
It follows that

IP(kĝ − g0 kn > δn )
∞ r
J(2j δ)
 
X
j 1+t+j
≤ IP sup (, g − g0 )n ≥ √ + 42 δ
g∈G(2j δn ) n n
j=1
e
≤ exp[−(t + j)] = exp[−t].
e−1
u
t

11.3 Examples

Example 11.3.1 Linear regression Let

G = {g(x) = θ1 ψ1 (x) + . . . + θr ψr (x) : θ ∈ Rr }.

One may verify


δ + 4u
H2 (u, G(δ), Qn ) ≤ r log( ), for all 0 < u < δ, δ > 0.
u
So Z δ p √ Z δ
δ + 4u
J(δ) = 2 2H2 (u, G(δ), Qn )du ≤ 2 2r log1/2 ( )du
0 0 δ
√ Z 1 √
= 2 2rδ log1/2 (1 + 4v)dv := A0 rδ.
0
So Theorem 11.2.1 can be applied with
 r r 
r 1+t
δ n ≥ 8 A0 +4 .
n n
78 CHAPTER 11. RATES OF CONVERGENCE FOR LSES

It yields that
 r r !
r 1+t e
IP kĝn − g0 kn > 8 A0 +4 ≤ exp[−t] ∀ t > 0.
n n e−1

(Note that we made extensive use here from the fact that it suffices to calculate
the local entropy of G.)

Example 11.3.2 Smooth functions Let


 Z 1 
G = g : [0, 1] → R, |g (m) (x)|2 dx ≤ 1 .
0

Let k−1 T
R ψTk (x) = x , k = 1, . . . , m, ψ(x) = (ψ1 (x), . . . , ψm (x)) and Σn =
ψψ dQn . Denote the smallest eigenvalue of Σn by λn , and assume that

λn ≥ λ > 0, for all n ≥ n0 .

One can show (Kolmogorov and Tihomirov (1959)) that


1
H2 (δ, G(δ), Qn ) ≤ Aδ − m , for small δ > 0,

where the constant A depends on λ. Thus


1/2 1
J(δ) ≤ A0 δ 1− 2m

for some constant A0 . For


 m r 
m
2m+1 − 2m+1 1+t
δn ≥ 16 A0 n +4
n
we get
√ √
 
nδn2 ≥ 8 J(δn ) + 4δn 1 + t .

Hence, we find from Theorem 11.2.1 that


 m r !
m 1 + t
IP kĝn − g0 kn > 16 A02m+1 n− 2m+1 + 4
n
e
≤ exp[−t] ∀ t > 0.
e−1

Example 11.3.3 Functions of bounded variation in R Let


 Z 
G = g : R → R, |dg| ≤ 1 .

Without loss of generality, we may assume that x1 ≤ . . . ≤ xn . The derivative


should be understood in the generalized sense:
Z n
X
|dg| := |g(xi ) − g(xi−1 )|.
i=2
11.3. EXAMPLES 79

Define for g ∈ G, Z
ḡ := gdQn .

Then it is easy to see that,

max |g(xi )| ≤ ḡ + 1.
i=1,...,n

One can now show (Birman and Solomjak (1967)) that

H2 (δ, G(δ), Qn ) ≤ Aδ −1 , for small δ > 0,

and therefore, for some constant A0


 1 r !
− 1 1 + t e
IP kĝn − g0 kn > 16 A03 n 3 + 4 ≤ exp[−t] ∀ t > 0.
n e−1

Example 11.3.4 Concave functions Let

G = {g : [0, 1] → R, 0 ≤ ġ ≤ 1, ġ decreasing}.

Then G is a subset of
 Z 1 
g : [0, 1] → R, |dġ| ≤ 2 .
0

Birman and Solomjak (1967) prove that for all m ∈ {2, 3, . . .},
  Z 1 
1
H∞ δ, g : [0, 1] → [0, 1] : (m)
|g (x)|dx ≤ 1 ≤ Aδ − m , for all δ > 0.
0

Again, our class G is not uniformly bounded, but we can write for g ∈ G,

g = g1 + g2 ,

with g1 (x) := θ1 + θ2 x and kg2 k∞ ≤ 2. Assume now that n1 ni=1 (xi − x̄)2 stays
P
away from 0. Then, we obtain for a constant A0
 2 r !
2 1 + t e
IP kĝn − g0 kn > 16 A05 n− 5 + 4 ≤ exp[−t] ∀ t > 0.
n e−1

Example 11.3.5 Image restoration


Case (i). Let X ⊂ R2 be some subset of the plane. Each site x ∈ X has a
certain grey-level g0 (x), which is expressed as a number between 0 and 1, i.e.,
g0 (x) ∈ [0, 1]. We have noisy data on a set of n = n1 n2 pixels {xkl : k =
1, . . . , n1 , l = 1, . . . , n2 } ⊂ X :

Ykl = g0 (xkl ) + kl ,


80 CHAPTER 11. RATES OF CONVERGENCE FOR LSES

where the measurement errors {kl : k = 1, . . . , n1 , l = 1, . . . , n2 } are inde-


pendent N (0, 1) random variables. Now, each patch of a certain grey-level is a
mixture of certain amounts of black and white. Let

G = conv(K),

be the closed convex hull of K, where where

K := {lD : D ∈ D}.

Assume that
N2 (δ, K, Qn ) ≤ cδ −w , for all δ > 0.
Then from Ball and Pajor(1990),
2w
H2 (δ, G, Qn ) ≤ Aδ − 2+w , for all δ > 0.

It follows from Theorem 11.2.1 for a constant A0


 2+w r !
2+w 1 + t
IP kĝn − g0 kn > 16 A04+4W n− 4+4w + 4
n

e
≤ exp[−t] ∀ t > 0.
e−1

Case (ii). Consider a black-and-white image observed with noise. Let X =


[0, 1]2 be the unit square, and
(
1, if x is black,
g0 (x) =
0, if x is white.

The black part of the image is

D0 := {x ∈ [0, 1]2 : g0 (x) = 1}.

We observe
Ykl = g0 (xkl ) + kl ,
with xkl = (uk , vl ), xk = k/m, xl = l/m, k, l ∈ {1, . . . , m}. The total number
of pixels is thus n = m2 .
Suppose that
D0 ∈ D = {all convex subsets of [0, 1]2 },
and write
G := {lD : D ∈ D}.
Dudley (1984) shows that for all δ > 0 sufficiently small
1
H2 (δ, G, Qn ) ≤ Aδ − 2 ,
11.4. EXERCISES 81

so that for a constant A0


 2 r !
5 −5
2 1+t
IP kĝn − g0 kn > 16 A0 n + 4
n
e
≤ exp[−t] ∀ t > 0.
e−1
Let D̂n be the estimate of the black area, so that ĝn = lD̂n . For two sets D1 and
D2 , denote the symmetric difference by
D1 ∆D2 := (D1 ∩ D2c ) ∪ (D1c ∩ D2 ).
Since Qn (D) = klD k2n , we find
4
Qn (D̂n ∆D0 ) = OIP (n− 5 ).

Remark. In higher dimensions, say X = [0, 1]r , r ≥ 2, the class G of indicators


of convex sets has entropy
r−1
H2 (δ, G, Qn ) ≤ Aδ − 2 , δ ↓ 0,
provided that the pixels are on a regular grid (see Dudley (1984)). So the rate
is then ( 4
OIP (n− r+3 ) , if r ∈ {2, 3, 4},
Qn (D̂n ∆D0 ) =
? if r ≥ 5.
4
For r ≥ 5, the rate of convergence can still be shown to be OIP (n− r+3 ). One
needs to refine Theorem 11.2.1 for the case of a diverging entropy integral.

11.4 Exercises

Exercise 11.4.1 Let Y1 , . . . , Yn be independent, Gaussian random variables,


with EYi = α0 for i = 1, . . . , bnγ0 c, and EYi = β0 for i = bnγ0 c + 1, . . . , n,
where α0 , β0 and the change point γ0 are completely unknown. Write g0 (i) =
g(i; α0 , β0 , γ0 ) = α0 l{1 ≤ i ≤ bnγ0 c} + β0 l{bnγ0 c + 1 ≤ i ≤ n}. We call
the parameter (α0 , β0 , γ0 ) identifiable if α0 6= β0 and γ0 ∈ (0, 1). Let ĝn =
g(·; α̂n , β̂n , γ̂n ) be the least squares estimator. Show that if α0 , β0 , γ0 is identi-
√ √
fiable, then kĝn − g0 kn = OIP (1/ n), and |α̂n − α0 | = OIP (1/ n), |β̂n − β0 | =

OIP (1/ n), and |γ̂n − p γ0 | = OIP (1/n). If (α0 , β0 , γ0 ) is not identifiable, show
that kĝn − g0 kn = OIP ( log log n/n).

Exercise 11.4.2 Let xi = i/n, i = 1, . . . , n, and let G consist of the functions


(
α1 + α2 x, if x ≤ γ
g(x) = .
β1 + β2 x, if x > γ
Suppose g0 ∈ G is continuous, but does have a kink at γ0 : α1,0 = α2,0 = 0,

β1,0 = − 21 , β2,0 = 1, and γ0 = 21 . Show that kĝn − g0 kn = OIP (1/ n), and that
√ √
|α̂n − α0 | = OIP (1/ n), |β̂n − β0 | = OIP (1/ n) and |γ̂n − γ0 | = OP (1/n1/3 ).
82 CHAPTER 11. RATES OF CONVERGENCE FOR LSES

Exercise 11.4.3 If G is a uniformly bounded class of increasing functions, show


that it follows from Theorem 11.2.1 that kĝn −g0 kn = OIP (n−1/3 (log n)1/3 ). (By
a more tight bound on the entropy one finds the rate OIP (n−1/3 ) as in Example
11.3.3).
Chapter 12

Regularized least squares

We revisit the regression problem of the previous chapter. One has observations
{Yi }ni=1 , and fixed co-variables x1 , . . . , xn , where the response variables satisfy
the regression
Yi = g0 (xi ) + i , i = 1, . . . , n,
where 1 , . . . , n are independent and centred noise variables, and where g0 is
an unknown function on X . The errors are assumed to be N (0, σ 2 )-distributed.
Let Ḡ be a collection of regression functions. The regularized least squares
estimator is ( n )
1X 2
ĝn = arg min |Yi − g(xi )| + pen(g) .
g∈Ḡ n
i=1

Here pen(g) is a penalty on the complexity of the function g. Let Qn be the


empirical distribution of x1 , . . . , xn and k · kn be the L2 (Qn )-norm. Define

g∗ = arg min kg − g0 k2n + pen(g) .



g∈Ḡ

Our aim is to show that

I n − g0 k2n ≤ const. kg∗ − g0 k2n + pen(g∗ ) .



(∗) Ekĝ

When this aim is indeed reached, we loosely say that ĝn satisfies an oracle
inequality. In fact, what (*) says is that ĝn behaves as the noiseless version g∗ .
That means so to speak that we “overruled” the variance of the noise.
In Section 12.1, we recall the definitions of estimation and approximation error.
Section 12.2 calculates the estimation error when one employs least squares
estimation, without penalty, over a finite model class. The estimation error
turns out to behave as the log-cardinality of the model class. Section 12.3
shows that when considering a collection of nested finite models, a penalty
pen(g) proportional to the log-cardinality of the smallest class containing g will
indeed mimic the oracle over this collection of models. In Section 12.4, we
consider general penalties. It turns out that the (local) entropy of the model

83
84 CHAPTER 12. REGULARIZED LEAST SQUARES

classes plays a crucial rule. The local entropy a finite-dimensional space is


proportional to its dimension. For a finite class, the entropy is (bounded by)
its log-cardinality.
Whether or not (*) holds true depends on the choice of the penalty. When the
penalty is taken “too small” there will appear an additional term showing that
not all variance was “killed”. Section 12.5 presents an example.
Throughout this chapter, we assume the noise level σ > 0 to be known. In that
case, by a rescaling argument, one can assume without loss of generality that
σ = 1. In general, one needs a good estimate of an upper bound for σ, because
the penalties considered in this chapter depend on the noise level. When one
replaces the unknown noise level σ by an estimated upper bound, the penalty
in fact becomes data dependent.

12.1 Estimation and approximation error

Let G be a model class. Consider first the least squares estimator without
penalty
n
1X
ĝn (·, G) = arg min |Yi − g(xi )|2 .
g∈G n
i=1

If we have a collection of models {G}, a penalty is usually some measure of


the complexity of the model class G. With some abuse of notation, write this
penalty as pen(G). The corresponding penalty on the functions g is then

pen(g) = min pen(G).


G: g∈G

An estimator that makes a data-dependent choice among the possible models


is ( n )
1X
ĝn = arg min |Yi − ĝn (xi , G)|2 + pen(G) ,
G∈{G} n
i=1

where ĝn (·, G) is the least squares estimator over G. Let

g∗ (·, G) := arg min kg − g0 kn


g∈G

be the best approximation of g0 within the model G. Then kg∗ (·, G) − g0 k2n is
the (squared) approximation error if the model G is used. We define

g∗ = arg min {kg∗ (·, G) − g0 k2n + pen(G)},


G∈{G}

which trades off approximation error kg∗ (·, G)−g0 k2n against complexity pen(G).
As we will see, taking pen(G) proportional to (an estimate of) the estimation
error of ĝn (·, G) will (up to constants and possibly (log n)-factors) balance esti-
mation error and approximation error.
12.2. FINITE MODELS 85

12.2 Finite models

Let G be a finite collection of functions, with cardinality |G| ≥ 2. Consider the


least squares estimator over G
n
1X
ĝn = arg min |Yi − g(xi )|2 .
f ∈G n
i=1

In this section, G is fixed, and we do not explicitly express the dependency of


ĝn on G. Define
kg∗ − g0 kn = min kg − g0 kn .
g∈G

The dependence of g∗ on G is also not expressed in the notation of this section.


Alternatively stated, we take here
(
0 ∀g ∈ G
pen(g) = .
∞ ∀g ∈ Ḡ\G

The result of Lemma 12.2.1 below implies that the estimation error is pro-
portional to log |G|/n, i.e., it is logarithmic in the number of elements in the
parameter space. We present the result in terms of a probability inequality.
An inequality for e.g., the average excess risk follows from this (see Exercise
12.6.1).
Lemma 12.2.1 We have for all t > 0 and 0 < δ < 1,
  
2 1 2 4(log |G| + t)
IP kĝn − g0 kn ≥ (1 + δ)kg∗ − g0 kn + ≤ exp[−t].
1−δ nδ

Proof. We have the basic inequality


kĝn − g0 k2n ≤ 2(, ĝn − g∗ )n + kg∗ − g0 k2n .
For all t > 0, by the union bound
r !
(, g − g∗ )n 2(log |G| + t)
IP sup > ≤ exp[−t] ∀ t > 0.
g∈G, kg−g∗ kn >0 kg − g∗ kn n

q √
If (, ĝn − g∗ )n ≤ 2(log n|G|+t) kĝn − g∗ kn , we have, using 2 ab ≤ a + b for all
non-negative a and b,
r
2(log |G| + t)
kĝn − g0 k2n ≤ 2 kĝn − g∗ kn + kg∗ − g0 k2n
r n
 
2(log |G| + t)
≤ 2 kĝn − g0 kn + kg∗ − g0 kn + kg∗ − g0 k2n
n
4(log |G|/(nδ) + t)
≤ δkĝn − g0 k2n + + (1 + δ)kg∗ − g0 k2n .

u
t
86 CHAPTER 12. REGULARIZED LEAST SQUARES

12.3 Nested finite models

Let G1 ⊂ G2 ⊂ · · · be a collection of nested, finite models, and let Ḡ = ∪∞


m=1 Gm .
We assume log |G1 | > 1.
As indicated in Section 12.1, it is a good strategy to take the penalty propor-
tional to the estimation error. In the present context, this works as follows.
Define
G(g) = Gm(g) , m(g) = arg min{m : g ∈ Gm },
and for some 0 < δ < 1,
16 log |G(g)|
pen(g) = .

In coding theory, this penalty is quite familiar: when encoding a message using
an encoder from Gm , one needs to send, in addition to the encoded message,
log2 |Gm | bits to tell the receiver which encoder was used.
Let
n
( )
1X
ĝn = arg min |Yi − g(xi )|2 + pen(g) ,
g∈Ḡ n
i=1
and
g∗ = arg min kg − g0 k2n + pen(g) .

g∈Ḡ

Lemma 12.3.1 We have, for all t > 0 and 0 < δ < 1,


  
2 1 2 2t
IP kĝn − g0 kn > (1 + δ)kg∗ − g0 kn + pen(g∗ ) + ≤ exp[−t].
1−δ nδ

Proof. Write down the basic inequality


kĝn − g0 k2n + pen(ĝn ) ≤ 2(, ĝn − g∗ )n + kg∗ − g0 k2n + pen(g∗ ).
We invoke a peeling device. Define Ḡj = {g : 2j < | log G(g)| ≤ 2j+1 }, j =
0, 1, . . .. We have for all t > 0,
r !
2(4 log |G(g)| + t)
IP ∃ g ∈ Ḡ : (, g − g∗ )n > kg − g∗ kn
n

r !
X 2(42j + t)
≤ IP ∃g ∈ Ḡj , (, g − g∗ )n > kg − g∗ kn
n
j=0

r !
X 2(log |Ḡj | + 2j+1 + t)
≤ IP ∃g ∈ Ḡj , (, g − g∗ )n > kg − g∗ kn
n
j=0

X
≤ exp[−(2j+1 + t)]
j=0
X∞ Z ∞
≤ exp[−(j + 1 + t)] ≤ exp[−(x + t)] = exp[−t].
j=0 0
12.4. GENERAL PENALTIES 87

But if r
2(4 log |G(ĝn )| + t)
(, ĝn − g∗ )n ≤ kĝn − g∗ kn
n
the Basic inequality gives

kĝn − g0 k2n
r
2(4 log |G(ĝn )| + t)
≤ 2 kĝn − g∗ kn
n
+ kg∗ − g0 k2n + pen(g∗ ) − pen(ĝn )
r  
2(4 log |G(ĝn )| + t)
≤ 2 kĝn − g0 kn + kg∗ − g0 kn
n
+ kg∗ − g0 k2n + pen(g∗ ) − pen(ĝn )
4(4 log |G(ĝn )| + t)
≤ δkĝn − g0 k2n + − pen(ĝn )

+ (1 + δ)kg∗ − g0 k2n + pen(g∗ )
4t
= δkĝn − g0 k2n + (1 + δ)kg∗ − g0 k2n + pen(g∗ ) + ,

by the definition of pen(g). u
t

12.4 General penalties

In the general case with possibly infinite model classes G, we may replace the
log-cardinality of a class by its entropy.

Recall the definition of the estimator


( n )
1X
ĝn = arg min |Yi − g(xi )|2 + pen(g) ,
g∈Ḡ n
i=1

and of the noiseless version

g∗ = arg min kg − g0 k2n + pen(g) .



g∈Ḡ

We moreover define

τ 2 (g) := kg − g0 k2n + pen(g), g ∈ Ḡ,

and
G(δ) = {g ∈ Ḡ : τ 2 (g) ≤ δ 2 }, δ > 0.
Consider the entropy H(·, G(δ), Qn ) of G(δ). Suppose it is finite for each δ, and
in fact that the square root of the entropy is integrable:
Condition 12.4.1 One has
Z 2δ p
J(δ) := 2 2H(u, G(δ), Qn )du < ∞, ∀ δ > 0. (∗∗)
0
88 CHAPTER 12. REGULARIZED LEAST SQUARES

This means that near u = 0, the entropy H(u, G(δ), Qn ) is not allowed to grow
faster than 1/u2 .
Theorem 12.4.1 Assume Condition 12.4.1. Suppose that J(δ)/δ 2 is decreas-
ing function of δ. Then for all t > 0 and
 r 
J(δn ) 1+t
(•) δn2 2
≥ 4τ (g∗ ) + 8 √ + 8δn
n n

we have
e
IP(τ (ĝn ) > δn ) ≤ exp[−t].
e−1

Proof. Since δn ≥ τ (g∗ ) we know that when g ∈ G(2j δn )

kg − g∗ kn ≤ kg − g0 kn + kg∗ − g0 kn ≤ τ (g) + τ (g∗ ) ≤ 2j δn + δn ≤ 2j+1 δn .

By the Basic inequality


IP(τ (ĝn ) > δn ) ≤

X  
IP sup 2(, g − g∗ )n ≥ (2j−1 δn )2 − τ 2 (g∗ ) .
j=1 g∈G(2j δn )

We can apply the same arguments as in the proof of Theorem 11.2.1: since
 r 
J(δn ) 1+t
δn2 2
≥ 4τ (g∗ ) + 8 √ + 8δn
n n

and δ 7→ J(δ)/δ 2 is decreasing, it holds for all j ∈ N


r
J(2j δn )
 
1+t+j
(2j δn )2 ≥ 4τ 2 (g∗ ) + 8 √ + 82j δn
n n

so that r
jδ )
 
j−1 2 2 J(2 n j 1+t+j
(2 δn ) − τ (g∗ ) ≥ 2 √ + 82 δn
n n
and hence

X  
j−1 2 2
IP sup 2(, g − g∗ )n ≥ (2 δn ) − τ (g∗ )
j=1 g∈G(2j δn )
∞ r
J(2j δn )
  
X
j 1+t+j
≤ IP sup 2(, g − g∗ )n ≥ 2 √ + 82 δn
g∈G(2j δn ) n n
j=1

X e
≤ exp[−(t + j)] = exp[−t].
e−1
j=1

u
t
12.5. APPLICATION TO THE “CLASSICAL” PENALTY 89

12.5 Application to the “classical” penalty

Suppose X = [0, 1]. Let Ḡ be the class of functions on [0, 1] which have deriva-
tives of all orders. The m-th derivative of a function g ∈ Ḡ on [0, 1] is denoted
by g (m) . Define for a given 1 ≤ p < ∞, and given smoothness m ∈ {1, 2, . . .},
Z 1
I p (g) = |g (m) (x)|p dx, g ∈ Ḡ.
0

We consider two cases. In Subsection 12.5.1, we fix a tuning parameter λ > 0


and take the penalty pen(g) = λ2 I p (g). After some calculations, we then show
that in general the variance has not been “overruled”, i.e., we do not arrive at an
estimator that behaves as a noiseless version, because there still is an additional
term. However, this additional term can now be “killed” by including it in the
penalty. It all boils down in Subsection 12.5.2 to a data dependent choice for λ,
2
or alternatively viewed, a penalty of the form pen(g) = λ̃2 I 2m+1 (g), with λ̃ > 0
depending on m and n. This penalty allows one to adapt to small values for
I(g0 ).

12.5.1 Fixed smoothing parameter

For a function g ∈ Ḡ, we define the penalty


pen(g) = λ2 I p (g),
with a given λ > 0.
Lemma 12.5.1 The entropy integral J can be bounded by
r !
2pm+2−p
− 1 1
J(δ) ≤ A0 δ 2pm λ pm + δ log( ∨ 1) δ > 0.
λ
Here, A0 is a constant depending on m and p.
Proof. This follows from the fact that
H∞ (u, {g ∈ Ḡ : I(g) ≤ 1, |g| ≤ 1}) ≤ Au−1/m , u > 0
where the constant A depends on m and p (see Birman and Solomjak (1967)).
For g ∈ G(δ), we have
 2
δ p
I(g) ≤ ,
λ
and
kg − g∗ kn ≤ δ.
We therefore may write g ∈ G(δ) as g = g1 + g2 , with |g1 | ≤ I(g1 ) = I(g) and
kg2 − g∗ kn ≤ δ + I(g). It is now not difficult to show that for some constant A1
  2  !
δ pm − 1 δ
H2 (u, G(δ), Qn ) ≤ A1 u m + log , 0 < u < δ.
λ (λ ∧ 1)u
u
t
90 CHAPTER 12. REGULARIZED LEAST SQUARES

Corollary 12.5.1 By applying Lemma 12.5.1 , we find that for some constant
c1 ,
kĝn − g0 k2n + λ2 I p (ĝn ) ≤ 4 min{kg − g0 k2n + λ2 I p (g)}
g
2pm !
1
∨1
 
1 2pm+p−2 log λ
+OIP 2 + .
nλ pm n

12.5.2 Overruling the variance in this case

For choosing the smoothing parameter λ, the above suggests the penalty
(   2pm )
C0 2pm+p−2
pen(g) = min λ2 I p (g) + 2 ,
λ
nλ pm

with C0 a suitable constant. The minimization within this penalty yields


2m 2
pen(g) = C00 n− 2m+1 I 2m+1 (g),

where C00 depends on C0 and m. From the computational point of view (in
particular, when p = 2), it may be convenient to carry out the penalized least
squares as in the previous subsection, for all values of λ, yielding the estimators
( n )
1X
ĝn (·, λ) = arg min |Yi − g(xi )|2 + λ2 I p (g) .
g n
i=1

Then the estimator with the penalty of this subsection is ĝn (·, λ̂n ), where
( n   2pm )
1X C0 2pm+p−2
λ̂n = arg min |Yi − ĝn (xi , λ)|2 + 2 . .
λ>0 n nλ pm
i=1

One arrives at the following corollary.


Corollary 12.5.2 For an appropriate, large enough, choice of C00 depending
on p and m, we have
2m 2
kĝn − g0 k2n + C00 n− 2m+1 I 2m+1 (ĝn )
n 2m 2
o
≤ 4 min kg − g0 k2n + C00 n− 2m+1 I 2m+1 (g) + OIP (1/n).
g

Thus, the estimator adapts to small values of I(g0 ). For example, when m = 1
and I(g0 ) = 0 (i.e., when g0 is the constant function), the excess risk of the
estimator converges withPparametric rate 1/n. If we knew that g0 is constant, we
would of course use the ni=1 Yi /n as estimator. Thus, this penalized estimator
mimics an oracle.
12.6. EXERCISES 91

12.6 Exercises

Exercise 12.6.1 Using the formula


Z ∞
EZ = IP(Z ≥ t)dt
0

for a non-negative random variable Z, derive bounds for the average excess risk
I n − g0 k2n of the estimator considered in this chapter.
Ekĝ

You might also like