LIVIU I NICOLAESCU - A Graduate Course in Probability-World Scientific Publishing (2023)
LIVIU I NICOLAESCU - A Graduate Course in Probability-World Scientific Publishing (2023)
Graduate Course
Probability
in
World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center,
Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from
the publisher.
Printed in Singapore
To my mom.
v
January 19, 2018 9:17 ws-book961x669 Beyond the Triangle: Brownian Motion...Planck Equation-10734 HKU˙book page vi
Introduction
vii
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page viii
Introduction ix
x An Introduction to Probability
chains. The chapter ends with brief discussion of the Markov Chain Monte Carlo
methods.
The last chapter of the book is the shortest and is devoted to the classical ergodic
theorems. I have included it because I felt I owed it to the reader to highlight a
principle that unifies and clarifies the main limit theorems in Chapters 2 and 4.
As the title indicates, this book is meant as an introduction to the modern, i.e.,
post Kolmogorov’s axiomatization, theory of probability. The reader is assumed
to have some familiarity with measure theory and integration and be comfortable
with the basic objects and concepts of modern analysis: metric/topological spaces,
convergence, compactness. In a few places, familiarity with basic concepts of func-
tional analysis is assumed. It could serve as a textbook for a year-long basic graduate
course in probability. With this purpose in mind I have a included a relatively large
number of exercises, many of them nontrivial and highlighting aspects I did not
include in the main body of the text.
The book grew up from notes of a one-semester graduate course in probability
that I taught at the University of Notre Dame. That course covered Chapter 1, the
classical limit theorems (Secs. 2.1–2.3) and discrete time martingales (Secs. 3.1–
3.2). Some of the proofs appear in fine print as a suggestion to the potential
student/instructor that they can be skipped at a first encounter with this subject.
Work on this book has been my constant happy companion during these improb-
able times. I hope I was able to convey my curiosity, fascination and enthusiasm
about probability and convince some readers to dig deeper into this intellectually
rewarding subject.
I want to thank World Scientific for a most professional, helpful and pleasant
collaboration over the years.
xi
January 19, 2018 9:17 ws-book961x669 Beyond the Triangle: Brownian Motion...Planck Equation-10734 HKU˙book page vi
Contents
Introduction vii
1. Foundations 1
1.1 Measurable spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Sigma-algebras . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Measurable maps . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Measures and integration . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Independence and conditional probability . . . . . . . . . . 21
1.2.3 Integration of measurable functions . . . . . . . . . . . . . 31
1.2.4 Lp spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.2.5 Measures on compact metric spaces . . . . . . . . . . . . . 39
1.3 Invariants of random variables . . . . . . . . . . . . . . . . . . . . . 41
1.3.1 The distribution and the expectation of a random variable 41
1.3.2 Higher order integral invariants of random variables . . . . 48
1.3.3 Classical examples of discrete random variables . . . . . . . 53
1.3.4 Classical examples of continuous probability distributions . 64
1.3.5 Product probability spaces and independence . . . . . . . . 69
1.3.6 Convolutions of Borel measures on the real axis . . . . . . 77
1.3.7 Modes of convergence of random variables . . . . . . . . . 83
1.4 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . 91
1.4.1 Conditioning on a sigma sub-algebra . . . . . . . . . . . . . 92
1.4.2 Some applications of conditioning . . . . . . . . . . . . . . 102
1.4.3 Conditional independence . . . . . . . . . . . . . . . . . . . 110
1.4.4 Kernels and regular conditional distributions . . . . . . . . 111
1.4.5 Disintegration of measures . . . . . . . . . . . . . . . . . . 120
1.5 What are stochastic processes? . . . . . . . . . . . . . . . . . . . . 123
1.5.1 Definition and examples . . . . . . . . . . . . . . . . . . . . 123
1.5.2 Kolmogorov’s existence theorem . . . . . . . . . . . . . . . 127
xiii
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page xiv
3. Martingales 259
3.1 Basic facts about martingales . . . . . . . . . . . . . . . . . . . . . 260
3.1.1 Definition and examples . . . . . . . . . . . . . . . . . . . . 260
3.1.2 Discrete stochastic integrals . . . . . . . . . . . . . . . . . . 266
3.1.3 Stopping and sampling: discrete time . . . . . . . . . . . . 269
3.1.4 Applications of the optional sampling theorem . . . . . . . 274
3.1.5 Concentration inequalities: martingale techniques . . . . . 280
3.2 Limit theorems: discrete time . . . . . . . . . . . . . . . . . . . . . 286
3.2.1 Almost sure convergence . . . . . . . . . . . . . . . . . . . 286
3.2.2 Uniform integrability . . . . . . . . . . . . . . . . . . . . . 293
3.2.3 Uniformly integrable martingales . . . . . . . . . . . . . . . 298
3.2.4 Applications of the optional sampling theorem . . . . . . . 304
3.2.5 Uniformly integrable submartingales . . . . . . . . . . . . . 310
3.2.6 Maximal inequalities and Lp -convergence . . . . . . . . . . 317
3.2.7 Backwards martingales . . . . . . . . . . . . . . . . . . . . 322
3.2.8 Exchangeable sequences of random variables . . . . . . . . 325
3.3 Continuous time martingales . . . . . . . . . . . . . . . . . . . . . 333
3.3.1 Generalities about filtered processes . . . . . . . . . . . . . 333
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page xv
Contents xv
Bibliography 525
Index 533
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 1
Chapter 1
Foundations
At the beginning of the twentieth century probability was in a fluid state. There
was no clear mathematical concept of probability, and ad-hoc methods were used to
rigorously formulate classical questions. Probability at that stage was a collection of
interesting problems in search of a coherent setup. According to Jean Ville, a PhD
student of M. Fréchet, in Paris probability was viewed among mathematicians as “an
honorable pastime for those who distinguished themselves in pure mathematics”.
The whole enterprise seemed to be concerned with concepts that lie outside
mathematics. Henri Poincaré himself wrote that “one can hardly give a satisfactory
definition of probability”. As Richard von Misses pointed out in 1928, the German
word for probability, “wahrscheinlich”, translates literally as “truth resembling”;
see [155]. Bertrand Russel was quoted as saying in 1929 that “Probability is the
most important concept in modern science, especially as nobody has the slightest
notion of what it means”. The philosophical underpinnings of this concept are
discussed even today. For more on this aspect we refer to the recent delightful
book [45].
In his influential 1900 International Congress address in Paris D. Hilbert recog-
nized this state of affairs and the importance of the subject. In the sixth problem of
his famous list of 23 he asks, among other things, for rigorous foundations of prob-
ability. These were laid by A. N. Kolmogorov in his famous 1933 monograph [94].
According to Kolmogorov himself, this was not a research work, but a work of syn-
thesis. A brilliant synthesis I might add. His point of view was universally adopted
and modern probability theory was born. The theory of probability can now be
informally divided into two eras: before and after Kolmogorov.
The present chapter is devoted to this foundational work of Kolmogorov. The
pillars of probability theory are the concept of probability or sample space, ran-
dom variables, independence, conditional expectations, and consistency, i.e., the
existence of random variables or processes with prescribed statistics.
So efficient is his axiomatization that to the untrained eye, probability, as envis-
aged by Kolmogorov, may seem like a slice of measure theory. In a 1963 interview
Kolmogorov complained that his axioms have been so successful on the theoretic
side that many mathematicians lost interest in the problems and applications that
1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 2
2 An Introduction to Probability
were and are the main engines of growth of this subject. I understand his criti-
cism since I too was one of those mathematicians that was not interested in these
applications. Now I know better.
In this chapter present these pillars of probability theory and prove their main
properties. I have included a large number of detailed examples meant to convey
the subtleties, depth, power and richness of these concepts. No abstract theorem
can capture this richness.
I want to close with a personal anecdote that I find revealing. A few years
ago, at a conference, I had a conversation with J. M. Bismut, a known probabilist
whose mathematical interests were becoming more and more geometric. He noticed
that I was in the middle of a mathematical transition in the opposite direction
and asked me what prompted it. I explained my motivation, how I discovered
that probability is not just a glorious part of measure theory and how much I
struggled to truly understand the concept of conditional expectation, a concept
eminently probabilistic. He smiled and said: “Probability theory is measure theory
plus conditional expectation”. I know it is an oversimplification, but it contains a
lot of truth.
1.1.1 Sigma-algebras
Fix a nonempty set Ω.
Definition 1.1. (a) A collection A of subsets of Ω is called an algebra of Ω if it
satisfies the following conditions
(i) ∅, Ω ∈ A.
(ii) ∀A, B ∈ A, A ∪ B ∈ A.
(iii) ∀A ∈ A, Ac ∈ A.
Foundations 3
We will refer to it as the Bernoulli algebra with success A. Note that SA is the
pullback of 2{0,1} via the indicator function I A : Ω → {0, 1}.
(d) If C ⊂ 2Ω is a family of subsets of Ω, then we denote by σ(C) the σ-algebra
generated by C, i.e., the intersection of all σ-algebras that contain C. In particular,
if S1 , S2 are σ-algebras of Ω, then we set
S1 ∨ S2 := σ(S1 ∪ S2 ).
More generally, for any family (Si )i∈I of σ-algebras we set
!
_ [
Si := σ Si .
i∈I i∈I
The sets An are called the chambers of the partition. Then the σ-algebra generated
by this partition is the σ-algebra consisting of all the subsets of Ω who are unions
of chambers. This σ-algebra can be viewed as the σ-algebra generated by the map
X
X : Ω → N, X = nI An ,
n∈N
−1
so that An = X {n} .
(f) If (Si )i∈I is a family of (σ-)algebras of Ω, then their intersection
Si ⊂ 2Ω
\
i∈I
is a (σ-)algebra of Ω.
(g) If (Ω1 , S1 ) and (Ω2 , S2 ) are two measurable spaces, then we denote by S1 ⊗ S2
the sigma algebra of Ω1 × Ω2 generated by the collection
S1 × S2 : S1 ∈ S1 , S2 ∈ S2 ⊂ 2Ω1 ×Ω2 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 4
4 An Introduction to Probability
(h) If X is a topological space and T ⊂ 2X denotes the family of open subset, then
the Borel σ-algebra of X, denotes by BX , is the σ-algebra generated by TX . The
sets in BX are called the Borel subsets of X. Note that since any open set in Rn is
a countable union of open cubes we have
BRn = B⊗n
R . (1.1.2)
Any finite dimensional real vector space V can be equipped with a topology by
choosing a linear isomorphism L : V → Rdim V . This topology is independent of
the choice of the isomorphism L. It can be alternatively identified as the smallest
topology on V such that all the linear maps V → R are continuous. We denote by
BV the sigma-algebra of Borel subsets determined by this topology.
We set R̄ = [−∞, ∞]. As a topological space it is homeomorphic to [−1, 1]. For
simplicity we will refer to the Borel subsets of R̄ simply as Borel sets.
(i) If (Ω, S) is a measurable space and X ⊂ Ω, then the collection
S|X := S ∩ X : S ∈ S ⊂ 2X
Remark 1.4. In measure theory and analysis, sigma-algebras lie in the background
and rarely come to the forefront. In probability they play a more important role
having to do with how they are perceived.
One should think of Ω as the collection of all the possible outcomes of a random
experiment. A σ-algebra of Ω can be viewed as the totality of information we can
collect using certain measurements about the outcomes ω ∈ Ω. Let us explain this
vague statement on a simple example.
Suppose we are given a function X : Ω → R and the only thing that we can
absolutely confirm about the outcome ω of an experiment is whether X(ω) ≤ x for
any given and x ∈ R. In other words, we can detect by measurements the collection
of sets
{X ≤ x} := X −1 (−∞, x] , x ∈ R.
In particular, we can detect whether X(ω) > x, i.e., we can detect the sets
{X > x} = {X ≤ x}c . More generally, we can determine the sets
{a < X ≤ b} = {X > a} ∩ {X ≤ b}.
Indeed, we can do this using two experiments: on experiment to decide if X ≤ a
and one to decide if X ≤ b.
We say that a set S is X-measurable if given ω ∈ Ω we can decide by doing
countably many measurements on X whether ω ∈ S. If S1 , . . . , Sn , . . . ⊂ Ω are
known to be X-measurable, then their union is X-measurable. Indeed,
[
ω∈ Sn ⇐⇒∃n ∈ N : ω ∈ Sn .
n∈N
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 5
Foundations 5
Let us observe that the set theoretic conditions imposed on a sigma-algebra have
logical/linguistic counterparts. Thus, the statement
\
ω∈ Si
i∈I
(i) ∅, Ω ∈ C.
(ii) if A, B ∈ C and A ⊂ B, then B \ A ∈ C.
(iii) If A1 ⊂ A2 ⊂ · · · belong to C, then so does their union.
t
u
6 An Introduction to Probability
Proof. Since any σ-algebra is a λ-system we deduce Λ(P) ⊂ σ(P). Thus it suffices
to show that
σ(P) ⊂ Λ(P). (1.1.3)
Equivalently, it suffices to show that Λ(P) is a σ-algebra. This happens if and only
if the λ-system Λ(P) is also a π-system. Hence it suffices to show that Λ(P) is closed
under (finite) intersections.
Fix A ∈ Λ(P) and set
LA := B ∈ 2Ω : A ∩ B ∈ Λ(P) .
Foundations 7
Example 1.10. (a) The composition of two measurable maps is a measurable map.
(b) A subset S ⊂ Ω is S-measurable if and only if the indicator function I S is a
measurable function.
(c) If A is the σ-algebra generated by a finite or countable partition
G
Ω= Ai , I ⊂ N,
i∈I
Proof. Clearly (i) ⇒ (ii). The opposite implication follows from the π − λ theorem
since the set
C ∈ S2 ; F −1 (C) ∈ S1
Proof. It follows from the previous corollary by observing that the collection
(−∞, x]; x ∈ R ⊂ 2R
8 An Introduction to Probability
Proof. (i) ⇒ (ii) Observe that if the maps F1 , F2 are measurable then
F1−1 (S1 ), F2−1 (S2 ) ∈ S, ∀S1 ∈ S1 , S2 ∈ S2
Definition 1.15. For any measurable space (Ω, S) we denote by L0 (S) = L0 (Ω, S)
the space of S-measurable random variables, i.e., (S, BR̄ )-measurable functions
Ω → R̄.
The subset of L0 (Ω, S) consisting of nonnegative functions is denoted by
L0+ (Ω, S), while the subspace of L0 (Ω, S) consisting of bounded measurable func-
tions is denoted L∞ (Ω, S). t
u
Foundations 9
Proof. (i) Denote by D the subset of R̄2 consisting of the pairs (x, y) for which
x + y is well defined. Observe that X + Y is the composition of two measurable
maps
Ω → D ⊂ R̄2 , ω 7→ X(ω), Y (ω) , D → R̄, (x, y) 7→ x + y.
Above, the first map is measurable according to Corollary 1.14 and the second map
is Borel measurable since it is continuous. The measurability of XY and cX is
established in a similar fashion.
(ii) We will show that for any x ∈ R the set X∞ (ω) > x is S-measurable. Note
that
X∞ (ω) > x ⇐⇒ ∀ν ∈ N, ∃N = N (ω) ∈ N : ∀n ≥ N : Xn (ω) > x + 1/ν.
Equivalently
n o \ [ \ n o
X∞ (ω) > x = Xn > x + 1/ν ∈ S.
ν∈N N ∈N n≥N
(iii) The proof is very similar to the proof of (ii) so we leave the details to the reader.
t
u
Corollary 1.18. For any function f ∈ L0+ (Ω, S), its positive and negative parts,
f + := max(f, 0), f − := max(−f, 0)
belong to L0+ (Ω, S) as well.
10 An Introduction to Probability
and its range is contained in R0 + R1 . This is a finite set since R0 , R1 are finite.
Clearly the multiplication of an elementary function by a scalar also produces an
elementary function.
Let us observe that any nonnegative measurable function is the limit of an
increasing sequence of elementary functions. For n ∈ N we define
n
n2
X k−1
Dn : [0, ∞) → [0, ∞), Dn (r) := I [(k−1)2−n ,k2−n ) (r).
2n
k=1
Let us observe that if r ∈ [0, n], then Dn (r) truncates the binary expansion of r
after n digits. E.g., if r ∈ [0, 1) and
∞
X k
r = 0.1 2 . . . n . . . := , k ∈ {0, 1},
2k
k=1
then
Dn (r) = 0.1 . . . n .
This shows that (Dn )n∈N is a nondecreasing sequence of functions and
lim Dn (r) = r, ∀r ≥ 0.
n→∞
(i) I Ω ∈ M.
(ii) If f, g ∈ M are bounded and a, b ∈ R, then af + bg ∈ M.
(iii) If (fn ) is an increasing sequence of nonnegative random variables in M with
finite limit f∞ , then f∞ ∈ M.
t
u
Foundations 11
Proof. Clearly, (ii) ⇒ (i). To prove that (i) ⇒ (ii) consider the family M of σ(F)-
measurable functions of the form X 0 ◦ F , X 0 ∈ L0 (Ω0 , S0 ). We will prove that
M = L0 Ω, σ(F) . We will achieve using the monotone class theorem.
Step 1. I Ω ∈ M.
Step 2. M is a vector space. Indeed if X, Y ∈ M and a, b ∈ R, then there exist
S0 -measurable functions X 0 , Y 0 such that
X = X 0 ◦ F, Y = Y 0 ◦ F, aX + bY = (aX 0 + bY 0 ) ◦ F.
Hence aX + bY ∈ M.
Step 3. I A ∈ M, ∀A ∈ σ(F ). Indeed, since A ∈ σ(F ) there exists A0 ∈ S0 such
that
A = F −1 (A0 )
so I A = I A0 ◦ F . Hence M contains all the σ(F )-measurable elementary functions.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 12
12 An Introduction to Probability
Define
Ω00 := ω 0 ∈ Ω0 ; the limit limn→∞ Xn0 (ω 0 ) exists and it is finite .
Remark 1.25. We see that, in its simplest form, Corollary 1.24 describes a mea-
sure theoretic form of functional dependence. Thus, if in a given experiment we
can measure the quantities X1 , . . . , Xn and we know that the information X ≤ c
can be decided only by measuring the quantities X1 , . . . , Xn , then X is in fact a
(measurable) function of X1 , . . . , Xn . In plain English this sounds tautological. In
particular, this justifies the choice of term “measurable”. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 13
Foundations 13
1.2.1 Measures
Throughout this section (Ω, S) will denote a measurable space. Given a function
f : X → R we will use the notation {f ≤ c} to denote the subset f −1 (−∞, c] .
A1 ⊂ A2 ⊂ · · ·
such that
[
An = Ω and µ An < ∞, ∀n ∈ N.
n∈N
The measure is called finite if µ Ω < ∞. A probability measure is a measure P
such that P Ω = 1. We will denote by Prob(Ω, S) the set of probability measures
on (Ω, S). t
u
(i) µ is finitely additive, i.e., for any finite collection of S-measurable sets A1 , . . . , An
we have
" n # n
[ X
µ Ak = µ Ak .
k=1 k=1
14 An Introduction to Probability
If µ Ω < ∞ and µ is finitely additive, then the increasing continuity condition
(ii) is equivalent with the decreasing continuity condition, i.e., for any decreasing
sequence of S-measurable sets B1 ⊃ B2 ⊃ · · ·
" #
\
µ Bn = lim µ Bn . (1.2.3)
n→∞
n∈n
c
Indeed, the sequence Bnc = Ω \ Bn is increasing
and µ B n = µ Ω − µ B n . This
last equality could be meaningless if µ Ω = ∞. t
u
Definition 1.28. (a) A measured space is a triplet (Ω, S, µ), where (Ω, S) is a
measurable space and µ : S → [0, ∞] is a measure. t
u
Our next result shows that a finite measure is uniquely determined by its re-
striction to an algebra generating the sigma-algebra where it is defined.
Proposition 1.29. Consider a measurable space (Ω, S) and two finite measures
µ1 , µ2 : S → [0, ∞] such that µ1 Ω = µ2 Ω < ∞, then the collection
E := S ∈ S; µ1 S = µ2 S
is a λ-system. In particular, if µ1 C = µ2 C for any set C that belongs to a
π-system C, then µ1 and µ2 coincide on the σ-algebra generated by C.
Definition 1.30. A probability space, or sample space, is a measured space (Ω, S, P),
where P is a probability measure. In this case we use the following terminology.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 15
Foundations 15
Example 1.31. (a) If (Ω, S) is a measurable space, then for any ω0 ∈ Ω, the Dirac
measure concentrated at ω0 is the probability measure
(
1, ω0 ∈ S,
δω0 : S → [0, ∞), δω0 S =
0, ω0 6∈ S.
16 An Introduction to Probability
by setting
µ (ω1 , . . . , ωn ) = µ1 ω1 · · · µn ωn , ∀(ω1 , . . . , ωn ) ∈ Ω1 × · · · × Ωn .
In particular, there exists a probability measure βp⊗n on {0, 1}n .
Note that we have a random variable
N : {0, 1}n → N0 , N (1 , . . . , n ) = 1 + · · · + n , ∀1 , . . . , n ∈ {0, 1}.
1 +···+n =k
X
k n−k n k n−k
= p q = p q .
k
1 +···+n =k
(f) The Lebesgue measure λ defines a measure on BR . For any compact interval
[a, b] the uniform probability measure on [a, b] is
1
I [a,b] λ. t
u
b−a
Definition 1.32. Let X be a topological space. As usual BX denotes the σ-algebra
of Borel subsets of X. A measure on X is called Borel if it is defined on BX . t
u
Definition 1.33. Suppose that X ∈ L0 (Ω, S, P). Its distribution is the Borel prob-
ability measure PX on R̄ defined by
PX B = P X ∈ B , ∀B ∈ BR̄ .
Definition 1.34. Suppose that µ is a measure on the measurable space (Ω, S).
Foundations 17
Remark 1.35. (a) It may be helpful to think of a sample space (Ω, S, P) as the
collection of all possible outcomes ω of an experiment with unpredictable results.
The observer may not be able to distinguish through measurements all the possi-
ble outcomes, but she is able to distinguish some features or properties of various
outcomes. An event can be understood as the collection of the all outcomes having
an observable or measurable property. The probability P associates a likelihood of
a certain property to be observed at the end of such a random experiment.
Take for example the experiment of flipping n times a coin with 0/1 faces. One
n
natural sample space for this experiment is based on the set Ω = 0, 1 .
If we assume that the coin is fair, then it is natural to conclude that each outcome
ω ∈ Ω is equally likely. Suppose that we can distinguish all the outcomes. In this
case
S = 2Ω .
Since there are 2n outcomes that are equally likely to occur we obtain a probability
measure P given by
|S|
P S = n , ∀S ∈ S.
2
the collection S of observable properties. For example, in the situation of n fair coin
tosses, the number N of 1’s observed at the end of n tosses is a random variable.
(b) Often one speaks of sampling a probability distribution on R. Modern computer
systems can sample many distributions. More concretely, we say that a probability
measure µ on (R, BR ) can be sampled by a computer system if that computer can
produce a random1 experiment whose outcome is a random number X so that, when
we run the experiment a large number of times n, it generates numbers x1 , . . . , xn
and,
for any c ∈ R, the fraction of these numbers that is ≤ c is very close to
µ (−∞, c] .
When we speak of sampling a random variable X, we really mean sampling its
probability distribution PX . t
u
1 The precise term is pseudo-random since one cannot really simulate randomness.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 18
18 An Introduction to Probability
Sµ = S ∪ N ; S ∈ S, N ∈ Nµ ⊂ 2Ω .
∀S ∈ S, N ∈ Nµ µ̄ S ∪ N = µ S .
t
u
(c) µ is countably additive, i.e., for any sequence (An )n∈N of disjoint sets in F
whose union is a set A ∈ F we have
X
µ A = µ An .
n≥1
(ii) The premeasure µ is called σ-finite if there exists a sequence of sets (Ωn )n∈N
in F such that
[
Ω= Ωn , µ Ωn < ∞, ∀n ∈ N.
n∈N
For a proof of the next central result we refer to [4, Sec. 1.3], [50, Chap. 3] or
[92, Thm. 1.53, 1.65].
Foundations 19
(ii) For any A ∈ σ(F) and any ε > 0 there exist mutually disjoint sets
A1 , . . . , Am ∈ F and B1 , . . . , Bn ∈ F such that
[m m
[
A⊂ Aj , µ
e Aj \ A < ε,
j=1 j=1
and " #
n
[
µ
e A∆ Bk < ε.
k=1
t
u
Example 1.40. Let F denote the collection of subsets of R that are union of inter-
vals of the type (a, b], −∞ ≤ a < b < ∞. This is an algebra of sets. Any F can be
written in a (non)unique way as a union
n
[
F = (ai , bi ], ai < bi ≤ ai+1 < bi+1 , ∀i = 1, . . . , n − 1.
j=1
While this decomposition is not unique the sum
n
X
λ F = (bi − ai )
i=1
depends only on F and not on the decomposition. It is not very hard to show that
the correspondence
F 3 F 7→ λ F ∈ [0, ∞]
20 An Introduction to Probability
Foundations 21
From the definition of inf as the greatest lower bound we deduce that there exists
x∗ ∈ (x∞ , x0 ) such that F (x∗ ) < p0 . Thus F (xn ) ≤ F (x∗ ) Since pn % p0 we deduce
pn > F (x∗ ) for all n sufficiently large. This implies
x∗ 6∈ x; F (x) ≥ pn = [Q(pn ), ∞)
i.e., xn = Q(pn ) > x∗ , for all n sufficiently large. This contradicts the fact that
xn → x∞ < x∗ .
If λ[0,1] denotes the Lebesgue measure2 on [0, 1], then
Hence the pushforward measure Q# λ[0,1] satisfies (1.2.4) since it coincides with µF
on the π-system consisting of the intervals of the form (a, b] it coincides with µF on
the sigma-algebra of Borel sets.
When F is the cumulative distribution function of a random variable, the asso-
ciated quantile function is called the quantile of the random variable X. t
u
The next concepts are purely probabilistic in nature. They have no natural coun-
terpart in the traditional measure theory.
Remark 1.47. (a) We want to emphasize that the independence condition is sen-
sitive to the choice of probability measure involved in this definition.
2 The proof of the existence of the Lebesgue measure is based on Caratheodory’s extension
theorem.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 22
22 An Introduction to Probability
1 Y
= = P E0 · P Ei .
2k+1
i∈I
Thus, any n of the events E0 , E1 , . . . , En are independent. Finally, note that
n
(
Y 1 0, n odd,
P Ei = n+1 and P E0 ∩ E1 ∩ · · · ∩ En = 1
i=0
2 n, n even. 2
This shows the events E0 , E1 , . . . , En are dependent.
(c) If Ω is contained in each of the families of events A1 , . . . , An , then these families
are independent if and only if
P A1 ∩ · · · ∩ An = P A1 · · · P An , ∀Ak ∈ Ak , k = 1, . . . , n.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 23
Foundations 23
then
P A ∩ S2 ∩ · · · ∩ Sn = lim P Aν ∩ S2 ∩ · · · ∩ Sn
ν→∞
= lim P Aν P S2 · · · P Sn = P A P S2 · · · P Sn .
ν→∞
24 An Introduction to Probability
Fα = σ(Cα ), ∀α ∈ A
and the family (Cα )α∈A is independent. The conclusion now follows from Proposi-
tion 1.48. t
u
Remark 1.53. (a) An event S is a tail event of the sequence (Sn )n∈N if
_
∀m ∈ N ; S ∈ Sn .
n>m
The sequence of σ-algebras (Sn )n∈N can be viewed as an information stream. The
tail events are described by a stream of information and are characterized by the
fact that their occurrence is unaffected by information at finitely moments of time
in the stream.
(b) To a sequence of random variables Xn : (Ω, S, P) → R we associate the sequence
of σ-algebras Sn = σ(Xn ) and the event C=the sequence (Xn )n≥1 converges”. To
see that this is a tail event note that Tm = σ(Xm+1 , Xm+2 , . . . ) and
\
C= Cm ,
m∈N
Theorem 1.54(Kolmogorov’s
0-1 law). If A is a tail event of the independency
(Sn )n∈N , then P A = 0 or P A = 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 25
Foundations 25
t
u
Corollary 1.56. Suppose that (Xn )n∈N is a sequence of independent random vari-
ables on the probability space (Ω, S, P). Then the series
X
Xn
n∈N
is either almost surely convergent, or almost surely divergent. In other words, the
almost sure convergence is a zero-one event. t
u
Definition 1.57. Suppose that A, B are events in the sample space (Ω, P, S) such
that P[B] 6= 0. The conditional probability of A given B is the number
P A∩B
P A B := . t
u
P B
Note that we have the useful product formula
P A∩B =P A B P B . (1.2.9)
In particular, we deduce that A, B are independent if and only if P A = P A|B .
Note that the map
P − B : S → [0, 1], S 7→ P S B
26 An Introduction to Probability
Indeed,
X (1.2.9) X
P S = P S ∩ Ai = P S Ai P Ai .
i∈I i∈I
Example 1.59. Suppose that we have an urn containing b black balls and r red
balls. A ball is drawn from the urn and discarded. Without knowing its color, what
is the probability that a second ball drawn is black?
For k = 1, 2 denote by Bk the event “the k-th drawn ball is black ”. We are asked
to find P(B2 ). The first drawn ball is either black (B1 ) or not black (B1c ). From
the law of total probability we deduce
P B2 = P B2 |B1 P B1 + P B2 |B1c P B1c .
Observing that
b r
and P B1c =
P B1 = ,
b+r b+r
we conclude
b−1 b b r b(b − 1) + br
P B2 = · + · =
b+r−1 b+r b+r−1 b+r (b + r)(b + r − 1)
b(b + r − 1) b
= = = P B1 .
(b + r)(b + r − 1) b+r
Thus, the probability that the second extracted ball is black is equal to the proba-
bility that the first extracted ball is black. This seems to contradict our intuition
because when we extract the second ball the composition of available balls at that
time is different from the initial composition.
This is a special case of a more general result, due to S. Poisson, [31, Sec. 5.3].
Suppose in an urn containing b black and r red balls, n balls have been drawn first
and discarded without their colors being noted. If another ball is drawn next, the
probability that it is black is the same as if we had drawn this ball at the outset,
without having discarded the n balls previously drawn.
Foundations 27
Example 1.60 (The ballot problem). This is one of the oldest problems in
probability. A person starts at S0 ∈ Z and every second (or epoch) he flips a
fair coin: Heads, he moves ahead, Tails he takes one step back. We denote by Sn
its location after n coin flips. The sequence of random variables (Sn )n∈N is called
the standard (or unbiased) random walk on Z.
Formally we have a sequence of independent random variables (Xn )n∈N such
that
1
P Xn = 1 = P Xn = −1 = , ∀n ∈ N.
2
The random variables with this distribution are called Rademacher random vari-
ables. Then
Sn = S0 + X1 + · · · + Xn .
S0 = 0, In := {1, . . . , n }
Hn := # k ∈ In ; Xk = 1 , Tn = k ∈ In ; Xk = −1 .
Thus Hn is the number of Heads during the first n coin flips, while Tn denotes the
number of Tails during the first n coin flips. Note that
n = Hn + Tn , Sn = S0 + Hn − Tn = S0 + 2Hn − n.
We deduce that
Sn = m ⇐⇒n + m − S0 = 2Hn .
In particular this shows that Sn ≡ n − S mod 2, ∀n ∈ N. Moreover,
n + m − S0
Sn = m ⇐⇒ Hn = ,
2
and we deduce.
(
(n−m−S0 )/2
2−n
2 m ≡ n − S0 mod 2,
P Sn = m =
0, otherwise.
It is convenient to visualize the random walk as a zig-zag obtained by successively
connecting by a line segment the point (n − 1, Sn−1 ) to the point (n, Sn ), n ∈ N.
The connecting line segment has slope Xn ; see Figure 1.1.
Suppose that y ∈ N and S0 = 0. The ballot problem asks what is the probability
py that
Sk > 0, ∀k = 1, . . . , n − 1 given that Sn = y.
One can think of a zigzag as describing a succession of votes in favor of one of the
two candidates H or T . When the zigzag goes up, a vote for H is cast, and when
it goes down, a vote in favor of T is cast. We know that at the end of the election
H was declared winner with y votes over T . Thus py is the probability that H was
always ahead during the voting process.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 28
28 An Introduction to Probability
Foundations 29
Fig. 1.2 The zigzag Z r traces Z until Z hits the horizontal axis. At this moment the zigzag Z r
follows the opposite motion of Z (dashed line).
30 An Introduction to Probability
Proof. According to the law of total probability, the denominator in the right-
hand-side of (1.2.12) equals P S . Thus, the equality (1.2.12) is equivalent to
P Ai0 |S P S = P S|Ai0 P Ai0 .
The
product
formula shows that both sides of the above equality are equal to
P Ai0 ∩ S . t
u
Example 1.63 (Biased coins). We say that a coin has bias θ ∈ (0, 1) if the
probability of showing Heads when flipped is θ. Suppose that we have an urn
containing c1 coins with bias θ1 and c2 coins with bias θ2 . Let n := c1 + c2 denote
the total number of coins and set pi := cni , i = 1, 2. We assume that
c1 < c2 and θ1 > θ2 , (1.2.13)
i.e., there are fewer coins with higher bias. We draw a coin at random we flip it
twice and we get Heads both times. What is the probability that the coin we have
drawn has higher bias.
If θ denotes the (unknown) bias of the coin drawn at random, then we can think
of θ as a random variable that takes two values θ1 , θ2 with probabilities
P θi := P θ = θi = pi , i = 1, 2.
Denote by E the event that two successive flips produce Heads. Then
P E θi := P E θ = θi = θi2 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 31
Foundations 31
as follows. If
M
X
f= ai I Ai , A1 , . . . , AM disjoint,
k=1
then
Z M
X
µ f = f (ω)µ dω := ai µ Ai .
Ω i=1
Note that if
N
X
f= bj I Bj , B1 , . . . , Bn disjoint,
j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 32
32 An Introduction to Probability
then ai = bj if Ai ∩ Bj 6= ∅. Hence
X XX XX X
ai µ Ai = ai µ Ai ∩ Bj = bj µ Ai ∩ Bj = bj µ B j .
i i j j i j
R
This shows that the value of Ω f (ω)µ(dω) is independent of the decomposition of
f as a linear combination of indicators of pairwise disjoint measurable sets.
The above integration map satisfies the following elementary properties.
µ f + , µ f − < ∞.
We denote by L (Ω, S, µ) the set of µ-integrable functions and by L1+ (Ω, S, µ) the
1
Note that
Moreover,
Z
∀f ∈ L0+ (Ω, S) : µ[f > 0] = 0 ⇐⇒ f dµ = 0. (1.2.17)
Ω
Foundations 33
Proof. The sequence µ fn is nondecreasing and is bounded above by µ f . Hence it has a, possibly
infinite, limit and limn→∞ µ fn ≤ µ f . The proof of the opposite inequality
lim µ fn ≥ µ f
n→∞
relies on a clever a clever trick. Fix g ∈ Ef+ , c ∈ (0, 1), and set
Sn := ω ∈ Ω; fn (ω) ≥ cg(ω) .
Since f = lim fn and (fn ) is a nondecreasing sequence of functions we deduce that Sn is a nondecreasing
sequence of measurable sets whose union is Ω. For any elementary function h the product I Sn h is also
elementary. For any n ∈ N we have fn ≥ fn I Sn ≥ cgI Sn so that
µ fn ≥ µ I Sn fn ≥ cµ gI Sn .
The sequence of sets (Aj ∩ Sn )n∈N is nondecreasing and its union is Aj so that
X X
lim µ fn ≥ c gj lim µ Aj ∩ Sn = c gj µ Aj = cµ g .
n→∞ n→∞
j j
Hence
f
lim µ fn ≥ cµ g , ∀g ∈ E+ , ∀c ∈ (0, 1),
n→∞
so that
lim µ fn ≥ cµ f , ∀c ∈ (0, 1).
n→∞
Letting c % 1 we deduce limn→∞ µ fn ≥ µ f . t
u
34 An Introduction to Probability
Corollary 1.69 (Markov’s Inequality). Suppose that f ∈ L1+ (Ω, S, µ). Then,
for any C > 0, we have
Z
1
µ {f ≥ C} ≤ f dµ. (1.2.19)
C Ω
In particular, f < ∞, µ-a.e.
Remark 1.72. The presentation so far had to tread carefully around a nagging
problem: given f, g in L1 (Ω, S, µ), then f (ω) + g(ω) may not be well defined for
some ω. For example, it could happen that f (ω) = ∞, g(ω) = −∞. Fortunately,
Corollary 1.70 shows that the set of such ω’s is negligible. Moreover, if we redefine
f and g to be equal to zero on the set where they had infinite values, then their
integrals do not change. For this reason we alter the definition of L1 (Ω, S, µ) as
follows.
( Z )
L (Ω, S, µ) := f : (Ω, S) → R; f measurable
1
|f |dµ < ∞ .
Ω
Thus, in the sequel the integrable functions will be assumed to be everywhere finite.
With this convention, the space L1 (Ω, S, µ) is a vectors space and the Lebesgue
integral is a linear functional
µ : L1 (Ω, S, µ) → R, f 7→ µ f .
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 35
Foundations 35
Recall that for any sequence (xn )n∈N of real numbers we have
lim inf xn = lim x∗k := inf xn .
n→∞ k→∞ n≥k
Proof. Set
gk := inf fn .
n≥k
Proposition 1.17(iii) implies that gk ∈ L0+ (Ω, S). The sequence (gk ) is nondecreasing and
i.e.,
Z Z
gk dµ ≤ inf fn dµ.
Ω n≥k Ω
Letting k → ∞ we deduce
Z Z Z
lim gk dµ ≤ lim inf fn dµ = lim inf fn dµ.
k→∞ Ω k→∞ n≥k Ω n→∞ Ω
t
u
The next result illustrates one of the advantages of the Lebesgue integral over
the Riemann integral: one needs less restrictive conditions to pass to the limit under
the Lebesgue integral.
36 An Introduction to Probability
Theorem 1.75 (Change in variables). Suppose that (Ω0 , S0 ), (Ω1 , S1 ) are mea-
surable spaces and
Φ : (Ω0 , S0 ) → (Ω1 , S1 )
is a measurable map. Fix a measure µ0 : S0 → [0, ∞] and a measurable function
f ∈ L0 (Ω1 , S1 ). Then
f ∈ L1 (Ω1 , S1 , Φ# µ0 )⇐⇒f ◦ Φ ∈ L1 (Ω0 , S0 , µ0 )
and Z Z
f ◦ Φ dµ0 = f dΦ# µ0 . (1.2.21)
Ω0 Ω1
Proof. Note that it suffices to prove the theorem in the case f ≥ 0. The result
is obviously true if f ∈ Elem+ (Ω1 , S1 ). The general case follows from the Mono-
tone Convergence Theorem using the increasing approximation [f ]n % f of f by
elementary functions; see (1.1.7). t
u
Remark 1.76. Unlike the well known change-in-variables formula, the map T in
(1.2.21) need not be bijective, only measurable.
If T is bijective with measurable inverse, then for any measure µ1 on Ω1 , S1 )
then (1.2.21) applied to the map T −1 reads
Z Z
−1
f ω1 µ1 dω1 = f (T ω0 )T# µ1 dω0 , (1.2.22)
Ω1 Ω0
∀f ∈ L1 (Ω1 , S1 , µ1 ).
In particular, if Ωi are open subsets of Rn , T : Ω0 → Ω1 is a C 1 -diffeomorphism
onto, and µ1 is the Lebesgue measure on Ω1 , then (1.2.22) reads
Z Z
f (y)λ dy = f T x det JT (x) λ dx , (1.2.23)
Ω1 Ω0
where JT (x) is the Jacobian of the C 1 map x → T x. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 37
Foundations 37
Proposition 1.77. Let f ∈ L0+ (Ω, S). Suppose that µ : S → [0, ∞] is a sigma finite
measure. Define
Z Z
µf : S → [0, ∞], µf [S] = f dµ := I S f dµ.
S Ω
Definition 1.78. Suppose that µ, ν are two measures on the measurable space. We
say that ν is absolutely continuous with respect to µ, and we write this ν µ if
∀S ∈ S : µ S = 0 ⇒ ν S .
t
u
(i) ν µ.
(ii) There exists ρ ∈ L0+ (Ω, S) such that ν = µρ , i.e.,
Z
ρµ dω , ∀S ∈ S.
ν[S] =
S
The function ρ is not unique, but it defines a unique element in L0+ (Ω, S, µ)
dν
which we denote by dµ and we will refer to it as the density of ν relative to µ. t
u
1.2.4 Lp spaces
We recall here an important class of Banach spaces. For proofs and many more
148 . We define an equivalence relation ∼µ on L (Ω, S)
0
details we refer to [50; 102; ]
by declaring f ∼µ g iff µ f 6= g = 0. Note that
Z Z
f ∈ L1 (Ω, S, µ) and g ∼µ f ⇒ g ∈ L1 (Ω, S, µ) and g dµ = f dµ.
Ω Ω
We set
38 An Introduction to Probability
(i) For any p ∈ [1, ∞], the pair Lp (Ω, S, µ), k − kp is a Banach space.
(ii) If p ∈ [1, ∞), the vector subspace of p-integrable elementary functions is dense
in Lp (Ω, S, P). In particular, if S is generated as a sigma-algebra by a countable
collection of sets, then Lp (Ω, S, µ) is separable. t
u
The above density result follows from a combined application of the Monotone
Class Theorem and the Monotone Convergence Theorem; see Exercise 1.4.
Suppose that (Ω, S, µ) is a measured space and p ∈ [1, ∞]. Denote by q the
exponent conjugate to p, i.e.,
1 1 p
+ = 1 ⇐⇒ q = .
p q p−1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 39
Foundations 39
To see that this indeed the case fix C ∈ CX and, for any n ∈ N denote by Dn the closed set
Dn := x ∈ X; dist(x, C) ≥ 1/n .
Define fn ∈ Cb (X)
dist(x, Dn )
fn (x) := .
dist(x, Dn ) + dist(x, C)
The function fn is identically 1 on C and identically 0 on Dn . Moreover
lim fn (x) = I C (x), ∀x ∈ X.
n→∞
Using the Dominated Convergence Theorem we deduce
µ C = lim Iµ fn = lim Iν fn = ν C .
n→∞ n→∞
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 40
40 An Introduction to Probability
Corollary 1.85. Suppose that X is a metric space and µ is a finite Borel measure
on X. Then the space Cb (X) is dense in L1 (X, BX , µ). t
u
t
u
For a proof we refer to [52, Sec. IV.6, Thm. 3] or [148, Thm. 13.5].
Example 1.87. We can use the above result to construct probability measures on
a smooth compact manifold M of dimension m. As shown in e.g. [122, Sec. 3.4.1]
a Riemann metric g on M defines a continuous linear functional
Z
C(M ) 3 f 7→ f dVg ∈ R,
M
usually referred to as the integral with respect to the volume element determined by
g. The Riesz Representation Theorem shows that this corresponds to the integral
with respect to a finite Borel measure Volg on M called the metric measure. The
metric volume of M is then
Z
Volg M = I M dVg .
M
Foundations 41
42 An Introduction to Probability
Foundations 43
More generally, for any Borel measurable function f : R → R such that f (X) is
integrable or nonnegative we have3
Z
E f (X) = f (x)PX dx . (1.3.4)
R
In other words, the expectation of a random variable is determined by its probability
distribution alone, and not on the precise nature of the sample space on which it is
defined.
For example, the random variables N and R described at the beginning of this
section have the same distribution and thus they have the same mean
1 + ··· + 6 7
E N =E R = = .
6 2
Remark 1.89 (Bertrand’s paradox). More often than not, in concrete prob-
lems the sample space where a random variable is defined is not explicitly mentioned.
Sometimes this can create a problem. Consider the following classical example.
√ at random on a unit circle. What is the probability that its length
Pick a chord
is at most 3, the length of the edge of an equilateral triangle inscribed in that
unit circle?
The answer depends on the concept of “at random” we utilize.
For example, we can think that a chord is determined by two points θ1 , θ2 on the
circle or, equivalently,
√ by a pair of numbers in [0, 2π]. The corresponding chord has
length ≤ 3 if and only if |θ1 − θ2 | ≤ 2π 3 . The region in the square [0, 2π] occupied
by pairs (θ1 , θ2 ) consists of two isosceles right triangles with legs of size 2π
3 with
vertices (0, 2π) and (2π, 0). By gluing these triangles along their hypothenuses we
get a square one third the size of [0, 2π]. Assuming that the point (θ1 , θ2 ) is chosen
uniformly inside the√square [0, 2π] we deduce that the probability that the chord
has length at most 3 is 19 .
On the other hand, a chord is uniquely determined√by the location of its midpoint
inside the unit circle. The chord has length at most 3 if and only if the midpoint
is at distance at least 12 from the center. Assuming that the midpoint is chosen
uniformly
√ inside this circle, we deduce that the probability that the chord is at most
3
3 is 4 since the disk of radius 12 occupies 14 of the unit disk.
We can try to decide empirically which is correct answer, but any simula-
tion/experiment must adopt a certain model of randomness. Things are even more
complex. The set of chords has a natural symmetry given by the group of rotations
about the origin. Any “reasonable” model of randomness ought to be compati-
ble with this symmetry. In mathematical terms this means that the underlying
probability measure ought to be invariant with respect to this symmetry.
As a set, we can identify the set of chords with the unit disk: we can describe
a chord by indicating the location of its midpoint. The problem boils down to
choosing a rotation invariant Borel measure on the unit disk. The quotient of the
3 In undergraduate probability classes this formula is often referred as LOTUS: the Law Of The
Unconscious Statistician.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 44
44 An Introduction to Probability
disk with respect to the group of rotation is a segment. In particular, any probability
measure µ on the unit interval defines a rotation invariant probability measure Pµ
defined by on the unit disk, determined by the requirements
θ1 − θ0
Pµ 0 ≤ r ≤ r1 , θ0 ≤ θ ≤ θ1 = µ [0, r1 ] .
2π
Hence, there are infinitely may geometric randomness models. In our first model of
randomness, the measure µ is the distribution of cos Θ1 +Θ 2
2
, where Θ1 , Θ2 are
independent uniformly distributed on [0, 2π]. In the second model of randomness
the measure µ is 2rdr. t
u
Example 1.90. Suppose that n ≥ 3 birds are arranged along a circle looking
towards the center. At a given moment each bird randomly and independently
turns his head to the left or to the right, with equal probabilities. After they turn
their heads, some birds will be visible by one of their neighbors, and some not.
Denote by Xn thenumber of birds that are invisible to their neighbors. We want
to compute E Xn , the expected number of invisible birds. We leave the reader to
convince herself/himself that Xn is indeed a well defined mathematical object.
For k = 1, . . . , n we denote by Bk the event that the k-th bird is invisible to its
neighbors. Then
n
X n n
X X
Xn = I Bk and E Xn = E I Bk = P Bk = nP B1 .
k=1 k=1 k=1
The probability that the first bird is invisible to is neighbors is computed by observ-
ing that this happens iff its right neighbor turn his head right and its left neighbor
turn its head left. Since they do this independently with probabilities 21 we deduce
1 1 1
P B1 = · = .
2 2 4
Hence
n
E Xn = .
4
To appreciate how efficient this computation is we present an alternate method.
We will determine the expectation by determining the probability distribution of Xn , or equivalently
its probability generating function (pgf)
X X k
GXn (t) = E t n = P Xn = k t .
k≥0
I learned the argument below from Luke Whitmer, a student in one of my undergraduate probability
courses.
Assume the birds sit on the edges of a convex n-gone Pn . Orienting an edge corresponds to describing
in which direction the corresponding bird is looking. We will refer to a choice of orientations of the
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 45
Foundations 45
Similarly, we denote by P−n the polygon obtained from Pn by collapsing the edges E2 , E4 , . . . . We can
take the collapsed edges as vertices of the new polygon. Its edges are
− − −
E1 = E1 , E2 = E3 , . . . , Ek = E2k−1 .
Note that
1
GXn (t) = Pn (t).
2n
We denote by ym,j the number of oriented m-gons with j out-vertices and we set
X j
X y (ω)
Qm (t) := ym,j t = t m .
j≥0 ω∈Ωm
46 An Introduction to Probability
labelling {1, 2, . . . , m} of the vertices of Qm . If ym (ω) = j then zm (ω) = j so the set S of locations of
in-/out-vertices has cardinality 2j,
S = 1 ≤ `1 < `2 < · · · < `2j ≤ m, .
The above discussion shows that if `1 is an out/in-vertex, then all vertices `3 , `5 , . . . are out/in-vertices
while the even vertices `2 , `4 , . . . are in/out-vertices. This shows that
n Xn j
ym,j = 2 , Qm (t) = t ,
2j j≥0
2j
2 m m
Qm (t ) = (1 + t) + (1 − t) .
Hence
2 k k 2 2k 2k 2 k
P2k (t ) = (1 + t) + (1 − t) = (1 + t) + (1 − t) + 2(1 − t ) ,
2 2k+1 2k+1
P2k+1 (t ) = (1 − t) + (1 + t) .
We conclude that
√ 2k+1 √
1− t
+ 1 + t 2k+1 , n = 2k + 1,
1
GXn (t) = n ×
2 1 + √t 2k + 1 − √t 2k + 2 1 − tk ,
n = 2k.
The mean of Xn is
0
E Xn = GXn (1).
t
u
Theorem 1.91. Suppose that (Ω, S, P) and F, G ⊂ S are two independent sigma-
subalgebras. If X ∈ L1 (Ω, F, P), Y ∈ L1 (Ω, G, P), then XY ∈ L1 (Ω, S, P) and
E XY = E X E Y . (1.3.9)
Proof. Observe that the equality (1.3.9) is bilinear in X and Y . The equality holds
for X = I F , F ∈ F and Y = I G , G ∈ G and thus it holds for X ∈ Elem(Ω, F) and
Y ∈ Elem(Ω, G).
If X, Y are nonnegative, then Dn [X]Dn [Y ] % XY and the Monotone Con-
vergence Theorem shows that (1.3.9) holds for X, Y ≥ 0. The bilinearity of this
equality implies that it holds in the claimed generality. t
u
Foundations 47
Proof. Follows inductively from Corollary 1.92 by observing that for any
k = 2, . . . , n the random variables f1 (X1 ) · · · fk−1 (Xk−1 ) and fk (Xk ) are
independent. t
u
(i) For any Borel measurable function f : R → R such that f (X) ∈ L1 and any
F ∈F
E f (X)I F = P F E f (X) .
(ii) The random variable X is independent of F.
The following is not the usual definition of a convex function (see Exercise 1.23)
but it has the advantage that it is better suited for the applications we have in
mind.
Proof. Observe that the when ϕ is linear theorem is valid in the stronger form
ϕ E[X] = E ϕ(X) . We can find a linear function ` : R → R such that
ϕ(x) ≥ `(x), ∀x ∈ I and it is clear that if the theorem is valid for the nonnegative
convex function g := ϕ − `, then it is also valid for ϕ. Note that E g(X) ∈ [0, ∞]
4 The graph of such an ` is tangent to the graph of ϕ at x0 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 48
48 An Introduction to Probability
and
thusthe addition E g(X) + ` E X is well defined and yields a well defined
E ϕ(X) , when ϕ(X) is integrable or nonnegative. Moreover ϕ(X) is integrable if
and only if g(X) is so. Because of this, we set
E ϕ(X) := ∞ if ϕ(X) is not integrable.
Set µ := E X and observe that µ ∈ I since X ∈ I a.s. Choose a linear function
` : R → R such that
`(x) ≤ ϕ(x), ∀x ∈ I and `(µ) = ϕ(µ).
Then
ϕ E X = ϕ(µ) = `(µ) = E `(X) ≤ E ϕ(X) .
If ϕ is strictly convex, then we can choose `(x) such that
`(x) < ϕ(x), ∀x ∈ I \ {µ} and `(µ) = ϕ(µ).
If X is not a.s. constant neither is the nonnegative random variable ϕ(X) − `(X)
so
E ϕ(X) − ϕ(µ) = E ϕ(X) − `(X) > 0.
t
u
quantity
µk X := E X k .
Note that µ1 X = E[X].
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 49
Foundations 49
Var X = E (X − µ)2 .
Note that
Var X = 0 ⇐⇒ X = E X a.s.
This shows that the variance is a special case of ϕ-entropy. More precisely,
Var X = Hϕ X , ϕ(x) = x2 .
Note that
Var Z = E a2 X̄ 2 = a2 Var X .
Theorem 1.98 (Chebyshev’s inequality). Let X ∈ L2 (Ω, ]S, P). Set µ := E[X]
and σ = σ[X]. Then
1
P |X − µ| ≥ cσ ≤ 2 , ∀c > 0. (1.3.16)
c
Equivalently
Var[X] σ2
P |X − µ| ≥ r ≤ = , ∀r > 0. (1.3.17)
r2 r2
Proof. Set Y := |X − µ|2 . Then
(1.2.19) 1 Var[X]
P |X − µ| > r = P Y > r2
≤ E Y = .
r2 r2
Chebyshev’s inequality (1.3.16) now follows from (1.3.17) by setting r = cσ. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 50
50 An Introduction to Probability
Proof. Set
µX := E X , X̄ = X − µX , µY = E Y , Ȳ = Y − µY .
(i) We have
Cov X, Y = E X̄ Ȳ = E XY − E µX Y − E µY X +µX µY
| {z } | {z }
µX µY µX µY
= E XY − µX µY .
(ii) Corollary
1.92 shows that if X, Y are independent, then E XY = µX µY , i.e.,
Cov X, Y = 0.
(iii) Next
Var X + Y = E (X̄ + Ȳ )2 = E X̄ 2 + E Ȳ 2 + 2E X̄ Ȳ
= Var X + Var Y + 2 Cov X, Y .
(iv) This follows from (ii) and (iii). t
u
Foundations 51
MX : I → R, MX (t) = E etX .
t
u
Proposition 1.104. Let X such that MX (t) is defined for all t ∈ (−t0 , t0 ).
(i) The moment generating function determines the momenta of X. More pre-
cisely, the function
(−t0 , t0 ) 3 t 7→ MX (t)
is smooth and
(k)
MX (0) = µk X , ∀k = 1, 2, . . . . (1.3.19)
t
u
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 52
52 An Introduction to Probability
Remark 1.106 (The moment problem). Denote by Prob the set of Borel prob-
ability measures on the real axis and by Prob∞− the subset of Prob consisting of
probability measures p such that
Z
|x|k p[dx] < ∞, ∀k ∈ N.
R
N0
We denote by R the set of sequences of real numbers s = (sn )n≥0 . We have a
map
Party (i) of the moment problem is completely understood in the sense that there
are known several necessary and sufficient conditions for a sequence s to be the
sequence of momenta of a probability measure on R. We refer to [137, Chap. 3] for
more details.
As for part (ii), it is known that a sequence s can be the sequence of momenta
of several probability measures; see Exercise 1.30. On the other hand, there are
known sufficient conditions on s guaranteeing the uniqueness of measure with that
sequence of momenta; see [137, Chap. 4] for more details. In particular, if X is a
random variable such that etX is integrable for any t in an open interval containing
0, then PX is uniquely determined by its moments, [137, Cor. 4.14]. t
u
We formulate for the record the last uniqueness result mentioned above. In
Exercise 2.45 we outline a proof of this special case.
Theorem 1.107. Let X, Y ∈ L0 (Ω, S, P) such that there exist r > 0 with the
property that
Then
d
X = Y ⇐⇒ MX (t) = MY (t), ∀|t| < r. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 53
Foundations 53
Note that
GX (1) = 1, G0X (1) = E X , G00X (1) = E X(X − 1) .
(1.3.20)
Similarly, if X, Y are two independent N0 -valued random variables, then
GX+Y (t) = E tX+Y = E tX E tY = GX (t)GY (t).
54 An Introduction to Probability
R. We throw in the third category the random variables that do not fit in these
two categories. We want to describe a few classical example of discrete and con-
tinuous random variables that play an important role in probability. Throughout
our presentation we will frequently assume that given a sequence (µn )n∈N of Borel
probability measures on R there exists a probability space (Ω, S, P) and indepen-
dent random variables Xn : (Ω, S, P) → R such that PXn = µn , ∀n ∈ N. The fact
that such a thing is possible is a consequence of Kolmogorov’s existence theorem,
Theorem 1.195.
We begin by introducing some frequently occurring discrete random variables
by describing the random experiments where they appear.
Note that any random variable with range {0, 1} is a Bernoulli random variable
since X = I {X=1} . t
u
Since the events (Sk )1≤k≤n are independent we deduce from Corollary 1.93 that
n
X
Var N = Var I Sk = npq.
k=1
the sum of the numbers on the dice is 7. In this case success is when the sum is 7 and it is not
hard to see that the probability of success is 16 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 55
Foundations 55
so
GN (s) = GI S1 (t)n = (q + ps)n , MN (t) = MI S1 (t)n = (q + pet )n . t
u
This string of Bernoulli trials can be realized abstractly in the probability space
{0, 1}n , 2{0,1} , βp⊗n
n
t
u
Moreover
X X d X n p 1
E T1 = npq n−1 = p nq n−1 = p q = 2
= . (1.3.21)
dq (1 − q) p
n≥1 n≥1 n≥0
Here is a simple plausibility test for this result. Suppose we role a die until we first
roll a 1. The probability of rolling a 1 is 61 so it is to be expected that we need 6
rolls until we roll our first 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 56
56 An Introduction to Probability
We have
∞ ∞
X X
E T12 − E T1 = n(n − 1)pq n−1 = n(n − 1)pq n−1
n=1 n=2
∞
X d2 1 2pq 2q
= pq n(n − 1)pq n−2 = pq 2
= 3
= 2.
n=2
dq 1 − q (1 − q) p
We deduce that
2q 1 2q 1 1 q
E T12 = 2 + , Var T1 = 2 + − 2 = 2 .
p p p p p p
Note that
∞ ∞
X X m pet
MT1 (t) = E etT1 = pq n−1 ent = pet qet
= .
n=1 m=0
1 − et
Consider now a more general situation. Fix k ∈ N and perform independent
Bernoulli trials until we observe the k-th success. Denote by Tk the number trials
until we record the k-th success. Note that
Tk = T1 + (T2 − T1 ) + (T3 − T2 ) + · · · + (Tk − Tk−1 ).
Due to the independence of the trials, once we observe the i-th success it is as if we
start the experiment anew, so the waiting time Ti+1 − Ti until we observe the next
success, the (i + 1)-th, is a random variable with the same distribution as T1
d
Ti+1 − Ti = T1 , ∀i ∈ N.
1
Hence E Ti+1 − Ti = E T1 = p so
k
E Tk = kE T1 = . (1.3.22)
p
The probability distribution of Tk is computed as follows. Note that Tk = n if
during the first n − 1 trials we observed exactly k − 1 successes, and at the n-th
trial we observed another success. Hence
n − 1 k−1 n−k n − 1 k n−k
P Tk = n = p q ·p= p q , (1.3.23)
k−1 k−1
and
k
pet
MTk (t) = .
1 − et
Since the waiting times between two consecutive successes are independent random
variables we deduce
kq
Var Tk = k Var T1 = 2 .
p
The above probability measure on R is called the negative binomial distribution and
Tk is called a negative binomial random variable corresponding to k successes with
probability p. We write this Tk ∼ NegBin(k, p). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 57
Foundations 57
Let us describe a classical and less than obvious application of the geometric
random variables.
58 An Introduction to Probability
Foundations 59
We have
t t t
M0N (t) = λet eλ(e −1)
, M00N (t) = λet eλ(e −1)
+ (λet )2 eλ(e −1)
so
E N 2 = M00N (0) = λ + λ2 , Var N = λ.
t
u
60 An Introduction to Probability
Using the above equality with m = 0 we obtain the better known formula
n n
X X X
(−1)k−1 (−1)k−1 sk . (1.3.26)
P A1 ∪· · ·∪An = 1−P Ω0 = P AI =
k=1 I⊂In , k=1
|I|=k
n−m
X X
= (−1)k c(J)I AJ .
k=0 |J|=m+k
m+k
Now observe that for any subset J ⊂ In of cardinality m+k there are m different
way of writing I AJ as a product
I AJ = I AI I AJ\I , |I| = m.
m+k
Thus c(J) = m for |J| = m + k. We deduce
X m+k
c(J)I AJ = Tm+k .
m
|J|=m+k
Foundations 61
Note that I AI (ω) = 0 if |I| > r(ω). In particular, this shows that all the terms in
the inequality (1.3.29) are equal to zero if r(ω) < m.
Suppose that r(ω) ≥ m. Then, for any k ≤ r, we have
X r
Tk (ω) = I AI (ω) = .
k
I⊂Iω
|I|=k
Observe that
r p
ak = , p = r − m.
m k
The inequality (1.3.32) reduces to
p p p p
− + ··· + − ≤0
0 1 2` − 2 2` − 1
p p p p p
0≤ − + + ··· − + ,
0 1 2 2` − 1 2`
where 2` ≤ p. These inequalities are immediate consequences of two well known
properties of the binomial coefficients, namely their symmetry
p p
= ,
k p−k
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 62
62 An Introduction to Probability
Similarly, for n ≥ m we denote by Ωnm the set of points in Ω that belong to exactly
m of the sets An,1 , . . . , An,n . Using Bonferroni inequalities we deduce that for fixed
` and n > 2` + m we have
2`−1 2`
k m+k k m+k
X n X
n
(−1) sm+k ≤ P Ωm ≤ (−1) snm+k . (1.3.33)
m m
k=0 k=0
Suppose now that there exists λ > 0 such that, for any q ∈ N we have
λq
lim snq = . (1.3.34)
n→∞ q!
If we let n → ∞ in (1.3.33) we obtain
2`−1 2`
1 X λk 1 X λk
(−1)k ≤ lim inf P Ωnm ≤ lim sup P Ωn ≤ ! (−1)k .
!
m k! n→∞ n→∞ m k!
k=0 k=0
Foundations 63
Then Ωnm = {Xn = m} and thus we showed that if (1.3.33) holds, then
lim P Xn = m = P Poi(λ) = m ,
n→∞
where we recall that Poi(λ) denotes a Poisson random variable with parameter λ.
The phenomenon depicted above is referred under the generic name of pois-
sonization or Poisson
approximation. Let us observe that if the events An,k are
independent and P An,i = nλ , then
k
n λ λk
snk = ∼ as n → ∞.
k n k!
In this case Xn = Bin(n, λ/n). The success probability nλ is small for large n and for
this reason the Poisson distribution is sometimes referred as the law of rare events:
The estimation techniques based on various versions of the inclusion-exclusion
principle are called sieves. We refer to [143, Chaps. 2, 3] for a more detailed de-
scription of far reaching generalizations of the inclusion-exclusion principle and as-
sociated sieves. t
u
64 An Introduction to Probability
Thus the expected number of fixed points is rather low: a random permutation has,
on average, one fixed point.
Let us compute the probability distribution of F . For each I ⊂ In we set
[
EI = Ei .
i∈I
Thus σ ∈ EI if and only if the permutation σ fixes all the points in I. We deduce
that if |I| = k, then
(n − k)! X n (n − k)! 1
P EI = and sk := P EI = = .
n! k n! k!
|I|=k
Note that F (σ) = m, then σ fixes exactly k points and (1.3.25) yields
n−m n−m
X m+k 1 X 1
(−1)k (−1)k .
P F =m = sm+k =
m m! k!
k=0 k=0
In particular, the number of derangements is
n
X 1
(−1)k .
P F =0 =
k!
k=0
The equality E F = 1 yields an interesting identity
n n n−m
!
X X 1 X
k 1
1= mP F = m = (−1) .
m=1 m=1
(m − 1)! k!
k=0
Note that
e−1
lim P Fn = m = . (1.3.36)
n→∞ m!
−1
The sequence em! , m ≥ 0 describes the Poisson distribution Poi(1). t
u
Foundations 65
Fig. 1.3 The graph of γ 0,σ for σ = 1 (dotted curve) and σ = 0.1 (continuous curve).
We have
Z
1 2
ye−y /2
E Y =√ dy = 0,
2π R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 66
66 An Introduction to Probability
and
Z Z ∞
1 2 2 2
Var Y = E Y 2 = √ y 2 e−y /2 dy = √ y 2 e−y /2 dy
2π R 2π 0
√
(s = y 2 /2, y = 2s)
Z ∞
2 2 2 1
=√ s1/2 e−s ds = √ Γ(3/2) = √ · Γ(1/2) = 1,
π 0 π π 2
where at the last two steps we used basic facts about the Gamma function recalled
in Proposition A.2. We deduce that
X ∼ N (µ, σ 2 ) ⇒ E X = µ, Var Y = σ 2 .
(1.3.38)
A variable X ∼ N (0, 1) is called a standard normal random variable. Its cdf is
Z x
1 2
e−x /2 dx,
Φ(x) := P X ≤ x = √ (1.3.39)
2π −∞
plays an important role in probability and statistics. The quantity
P X>x
γ 1 (x)
is called the Mills ratio of the standard normal random variable. It satisfies the
inequalities
x 1
2
γ 1 (x) ≤ P X > x ≤ γ 1 (x). (1.3.40)
x +1 x
In Exercise 1.25 we outline a proof of this inequality.
Observe that if X ∼ N (0, 1), and σ ∈ R, then σX ∈ N (0, σ 2 ) and
MσX (t) = E etσX = MX [σt].
Foundations 67
From the definition of the Gamma function we deduce that gν (x; λ) is indeed a
probability density, i.e.,
Z ∞
gν (x; λ)dx = 1.
0
We will use the notation X ∼ Gamma(ν, λ) to indicate that PX = Γν,λ .
The Gamma(1, λ)-random variables play a special role in probability. They are
called exponential random variables with parameter λ. We will use the notation
X ∼ Exp(λ) to indicate that X is such a random variable. The distribution of
Exp(λ) is
Exp(λ) ∼ λe−λx I (0,∞) dx.
We will have more to say about exponential variables in the next subsection.
The parameter ν is sometimes referred to as the shape parameter. Figure 1.4
may explain the reason for this terminology.
68 An Introduction to Probability
2 Γ(ν + 2) ν 2 k(k + 1) − k 2 ν
Var X = µ2 X − µ1 X = 2 − − 2.
λ Γ(ν) = λ2 λ
Finally, if X ∼ Gamma(ν, λ), then for t < λ we have
Z ∞
λν
MX (t) = xν−1 e−(λ−t)x dx
Γ(ν) 0
x = y/(λ − t)
∞ ν
λν
Z
λ
= y ν−1 e−y dy = .
Γ(ν)(λ − t)ν 0 λ−t
t
u
Foundations 69
Hence
2 a a+1 a
Var X = E X 2 − E X =
−
a+b a+b+1 a+b
a (a + 1)(a + b) − a(a + b + 1) ab
= · = 2
.
a+b (a + b)(a + b + 1) (a + b) (a + b + 1)
The distribution Beta(1/2, 1/2) is called the arcsine distribution. In this case
1 1
β1/2,1/2 (x) = p ,
π x(1 − x)
and Z x √
1
β1/2,1/2 (s)ds =
arcsin x.
0 π
We refer to Exercise 1.35 for an alternate interpretation of Beta(1/2, 1/2). t
u
In Appendix A.2 we have listed the basic integral invariants of several frequently
occurring probability distributions.
70 An Introduction to Probability
0 0
Proof. We prove only the statement concerning fω 1
. For simplicity will write fω1 instead of fω 1
. We
will use the Monotone Class Theorem 1.21.
Denote by M the collection of functions f ∈ L0 (Ω0 × Ω1 , S0 × S1 )∗ such that fω1 is S0 -measurable,
∀ω1 ∈ Ω1 . Clearly is f, g ∈ M are bounded then af + bg ∈ M, ∀a, b ∈ R
The collection R of rectangles is a π-system. Note that for any rectangle R = S0 × S1 the function
f = I R belongs to M. Indeed, for any ω1 ∈ Ω1 we have
(
I S0 , ω1 ∈ S1 ,
fω1 =
0, ω1 ∈ Ω1 \ S1 .
If (fn ) is an increasing sequence of functions in M so is the sequence of slices fn,ω1 so the limit f is
also in M. By the Monotone Class Theorem the collection M contains all the nonnegative measurable
functions. Since M is a vector space, it must coincide with L0 (Ω0 × Ω1 , S0 ⊗ S1 )8 .
When f ∈ L0+ , but f is allowed to have infinite values, the function f is the increasing limit of a
sequence in M. Hence this situation is also included in the conclusions of the lemma. t
u
is well defined.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 71
Foundations 71
This follows from the Monotone Class Theorem arguing exactly as in the proof of Lemma 1.123.
For S ∈ S0 ⊗ S1 we set
µ1,0 S = I1,0 I S .
Note that Z
I 1 I S0 ×S1 = I Ω0 ×Ω1 (ω0 , ω1 )µ1 dω1 .
Ω1
If ω0 ∈ Ω0 \ S0 the integral is 0. If ω0 ∈ S0 the integral is
Z
I S1 dµ1 = µ1 S1 .
Ω1
Hence
I 1 I S0 ×S1 = µ1 S1 I S0 .
We deduce Z
I1,0 S0 × S1 = µ1 S1 I S0 dµ0 = µ0 S0 · µ1 S1 .
Ω0
We want to show that if ν is another measure on S such that ν R = µ1,0 R for any R ∈ R, then
ν A = µ1,0 A , ∀A ∈ S.
To see this assume first that µ0 and µ1 are finite measures. Then Ω0 × Ω1 ∈ R
µ1,0 Ω0 × Ω1 = ν Ω0 × Ω1 < ∞
and since R is a π-system we deduce from Proposition 1.29 that µ1,0 = ν on S.
To deal with the general case choose two increasing sequences En i
∈ Si , i = 0, 1 such that
i [ i
µi Sn < ∞, ∀n and Ωi = En , i = 0, 1.
n≥1
Define
0 1 n i
En := En × En , µi Si := µi Si ∩ En , Si ∈ Si , i = 0, 1,
n
ν A := ν A ∩ En , ∀A ∈ S.
72 An Introduction to Probability
The above construction can be iterated. More precisely, given sigma-finite mea-
sured spaces (Ωk , Sk , µk ), k = 1, . . . , n, we have a measure µ = µ1 ⊗· · ·⊗µn uniquely
determined by the condition
µ S1 × S2 × · · · × Sn = µ1 S1 µ2 S2 · · · µn Sn , ∀Sk ∈ Sk , k = 1, . . . , n.
Remark 1.125. Recall that λ denotes the Lebesgue measure on R. The measure
λ⊗n on BRn is called the n-dimensional Lebesgue measure and will denoted by
λn or simply λ, when no confusion is possible. A subset of Rn is called Lebesgue
measurable if it belongs to the completion of the Borel sigma-algebra with respect
to the Lebesgue measure.
One can prove that if a function f : Rn → R is absolutely Riemann integrable
(see [121, Chap. 15]), then it is also Lebesgue integrable with respect to the Lebesgue
measure on Rn and, moreover
Z Z
f (x) |dx| = f (x) λ dx ,
Rn Rn
where the left-hand-side integral is the (improper) Riemann integral.
We recommend the reader to try to prove this fact or at least to try to understand
why a Riemann integrable function defined on a cube is Lebesgue measurable. This
is not obvious because there exist Riemann integrable functions that are not Borel
measurable.
For example, if C ⊂ [0, 1] is the Cantor set, then there exists a subset A of C that
are not Borel because the cardinality of the set 2C is bigger than the cardinality
of the family of Borel subsets of C. The subset A is Lebesgue measurable since C
is Lebesgue negligible. The indicator function I A is Riemann integrable but not
Borel measurable.
The change in variables for the Riemann integral shows that if U, V are open
subsets of Rn and F : U → V is a C 1 -diffeomorphism onto V , then
−1
F# λV dx = | det JF (x)|λU dx .
t
u
Proof. We have
Z ∞ Z ∞ Z
xp−1 P[X > x]dx = pxp−1 dx
p I {X>x} (ω)P dω
0 0 Ω
Z
pxp−1 P ⊗ λ dωdx
=
(ω,x)∈Ω×[0,∞)
0≤x<X(ω)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 73
Foundations 73
(use Fubini-Tonelli)
!
Z Z X(ω) Z
p−1
X p (ω) P dω = E X p .
= px dx P dω =
Ω 0 Ω
t
u
Example 1.127. Suppose that X is a random variable that takes only nonnegative
integral values. Then
X
PX = P X = n δn ,
n≥0
and
(1.3.43)
Z ∞
E X = P X > x dx
0
XZ n+1 X (1.3.44)
= P X > x dx = P X>n .
n≥0 n n≥0
Similarly
X
µ2 T = E T 2 = 2
nP T > n
n≥0
X X 2q 2q
=2 nq n = 2q nq n−1 = = 2.
(1 − q)2 p
n≥1 n≥1
In particular
2 q
Var[T ] = E T 2 − E T = 2 .
t
u
p
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 74
74 An Introduction to Probability
This random variable describes the waiting time for an event to happen, e.g., the
waiting time for a laptop to crash, or the waiting time for a bus to arrive at a bus
station. The quantity λe−λt dt is the probability that the waiting time is in the
interval (t, t + dt]. Then
Z ∞ Z ∞
−λτ −λ 1
e−λt dt = .
P T >t = λe dτ = e , E T =
t 0 λ
We see that λ1 is measured in units of time. For this reason λ is called the rate and
describes how many rare events take place per unit of time.
Similarly
Z ∞ Z ∞ Z ∞
2
µ2 T = E T 2 = 2 te−λt dt = 2 se−s ds
tP[T > t]dt = 2
0 0 λ 0
2 2
= Γ(2) = 2 .
λ2 λ
The function S(t) := P T > t is called the survival function. For example, if T
denotes the life span of a laptop, then S(t) is the probability that a laptop survives
more than g units of time.
The exponential distribution enjoys the so called memoryless property
P T > t + s|T > s = P T > t . (1.3.45)
For example, if T is the waiting time for a bus to arrive then, given that you’ve
waited more that s units of time, the probability that you will have to wait at
least t extra is the same as if you have not waited at all. The proof of (1.3.45) is
immediate.
P T >t+s e−λ(t+s)
= e−λt = P T > t .
P T > t + s| T > s = = −λs
t
u
P T >s e
Foundations 75
was denoted by
Z a
u(x)dFk (x).
0
This classical notation is a bit ambiguous due to the following simple fact
Z Z
u(x)µk dx = u(0)Fk (0) + u(x)µk dx .
[0,a] (0,a]
We want to prove a version of the integration by parts formula. Namely, we will
show that if one of the functions F0 , F1 is continuous, then
Z a Z a
F0 (x)dF1 (x) = F0 (a)F1 (a) − F0 (0)F1 (0) − F1 (x)dF0 (x). (1.3.46)
0 0
Assume for simplicity that F1 is continuous so F1 (0) = 0. Set µ := µ0 ⊗µ1 . Observe
that since
F0 (a)F1 (a) − F0 (0)F1 (0) = F0 (a)F1 (a) = µ [0, a] × [0, a] .
| {z }
Sa
(F1 is continuous)
Z Z !
= I [0,x) (y) µ1 dy µ0 dx = µ R0 ,
[0,a] [0,a]
where
(x, y) ∈ R2 ; 0 ≤ y < x ≤ a, y < x .
R0 :=
Similarly
!
Z a Z Z
F0 (y)dF1 (y) = I[0,y] µ0 dx µ1 dy = µ R1 ,
0 [0,a] [0,a]
(x, y) ∈ R2 ; 0 ≤ x ≤ y ≤ a .
R1 :=
Observe that the regions R0 , R1 are disjoint.
The region R0 is the part of the square Sa = [0, a] × [0, a] strictly below the
diagonal y = x, while R1 is the part of this square above or this diagonal. Hence
Sa = R0 ∪ R1 and thus
µ R0 + µ R1 = µ Sa .
Let us observe that the integration by parts formula is not true
if both F0 , F1 are
1
discontinuous. Take for example the case µ0 = µ1 = 2 δ1 + δ3 . Then
0, x < 1,
F0 (x) = F1 (x) = F (x) = 21 , 1 ≤ x < 3,
1, x ≥ 3.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 76
76 An Introduction to Probability
The reason for this failure has a simple geometric origin: the diagonal {y = x}
may not be µ0 ⊗ µ1 -negligible. The continuity assumption allowed us to discard the
diagonal of the square because in this case it is indeed negligible. t
u
X : (Ω, S, P) → (V, BV ).
X1 , . . . , Xn : (Ω, S, P) → R
X := (X1 , . . . , Xn ) : (Ω, S, P) → Rn .
t
u
X1 , . . . , Xn ∈ L0 (Ω, S, P)
are random variables with probability distributions PX1 , . . . , PXn . The following
statements are equivalent.
Foundations 77
Proof. The random variables X1 , . . . , Xn are independent iff for any Borel sets
B1 , . . . , Bn ⊂ R we have
P X1 ∈ B1 , . . . , Xn ∈ Bn = P[X1 ∈ B1 ] · · · P[Xn ∈ Bn ]
⇐⇒ PX1 ,...,Xn B1 × · · · × Bn = PX1 ⊗ · · · ⊗ PXn B1 × · · · × Bn .
Thus the random variables X1 , . . . , Xn are independent iff the measures PX1 ,...,Xn
and PX1 ⊗ · · · ⊗ PXn coincide on the set of rectangles B1 × · · · × Bn , i.e.,
PX1 ,...,Xn = PX1 ⊗ · · · ⊗ PXn .
t
u
78 An Introduction to Probability
Remark 1.135. (a) Suppose that Fµ is the cdf of the probability measure µ, i.e.,
Fµ (c) = µ (−∞, c] , ∀c ∈ R. Then the cdf Fµ∗ν of µ ∗ ν satisfies
Z
Fµ∗ν (c) = Fµ (c − x)ν dx , ∀c ∈ R.
R
We write this equality as
Fµ∗ν = Fµ ∗ ν. (1.3.48)
If µ and ν are absolutely continuous with respect to the Lebesgue measure λ on R
so
µ dx = ρµ (x)dx, ν dx = ρν (x)dx, ρµ , ρν ∈ L1 (R, λ),
then µ ∗ ν λ and
Z
µ ∗ ν dx = ρµ∗ν (x)dx, ρµ∗ν (x) = ρµ ∗ ρν (x) := ρµ (x − y)ν dy .
R
To see this it suffices to check that for any c ∈ R we have
Z c
µ ∗ ν (−∞, c] = ρµ∗ν (x)dx.
−∞
We have
Z
Z Z c−y
µ ∗ ν (−∞, c] = µ (−∞, c − y] ν[dy] = ρ(x)dx ν[dy]
R R ∞
Z Z c−y Z Z c
= ρµ (x)dx ν[dy] = ρ(z − y)dz ν[dy]
R ∞ R ∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 79
Foundations 79
(use Fubini)
Z c Z Z c
= ρµ (z − y)ν[dy] dz = ρµ∗ν (z)[dz].
−∞ R −∞
1R : (R, BR , µ) → R, 1R (x) = x.
If µ1 , µ2 , µ3 are different Borel probability measures on R, then we can define three
independent random variables
X1 , X2 , X3 : R3 , BR3 , µ1 ⊗ µ2 ⊗ µ2 → R,
Xk (x1 , x2 , x3 ) = xk , k = 1, 2, 3.
Similarly
(c) The operation of convolution makes sense for any finite Borel measures µ, ν on
R and satisfies the same commutativity and associativity
properties
we encountered
in the case of probability measures. Note that µ ∗ ν R = µ R · ν R . t
u
T1 = S1 , T2 = S2 − S1 , . . . , Tn = Sn − Sn−1 , . . .
80 An Introduction to Probability
If n > 0, then N (t) = n if and only if the n-th bus arrived sometime during the
interval [0, t], i.e., Sn ≤ t, but the (n+1)-th bus has not arrived in this time interval.
We deduce h i
P N (t) = n = P {Sn ≤ t} \ {Sn+1 ≤ t} = P Sn ≤ t − P Sn+1 ≤ t .
If we denote by Fn (t) the cdf of Sn , then we can rewrite the above equality in the
form
P N (t) = n = Fn (t) − Fn+1 (t).
We have
PSn = Exp(λ) ∗ · · · ∗ Exp(λ)
| {z }
n
(1.6.6a)
= Gamma(λ, 1) ∗ · · · ∗ Gamma(λ, 1) = Gamma(λ, n).
| {z }
n
Hence, for n > 0
Z t
λn+1 λn+1 t n −λs
Z
Fn+1 (t) = sn e−λs ds = s e ds.
Γ(n + 1) 0 n! 0
For n > 0, we integrate by parts to obtain
n s=t Z t
λ n −λs λn (tλ)n −λt
Fn+1 (t) = − s e + sn−1 e−λs ds = − e + Fn (t).
n! (n − 1)! 0 n!
s=0
Hence
(tλ)n −λt
P N (t) = n = Fn (t) − Fn+1 (t) = e , n > 0. (1.3.49)
n!
This shows that N (t) is a Poisson random variable, N (t) ∼ Poi(λt).
The Poisson process plays an important role in probability since it appears
in many situations and displays many surprising phenomena. One such interesting
phenomenon is the waiting time paradox, [59, I.4]. To better appreciate this paradox
we consider two separate situations.
Suppose first that buses arrive at a bus station following a Poisson stream with
frequency λ. Bob arrives at the bus station at a time t ≥ 0, the bus is not there
and he is waiting for the next one. His waiting time is
Wt := SN (t)+1 − t.
We want to compute its expectation wt := E Wt . There are two possible heuristic
arguments.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 81
Foundations 81
(i) The memoryless property of the exponential distribution shows that wt should
be independent of t so wt = w0 = λ1 .
(ii) Bob’s arrival time t is uniformly distributed in the inter-arrivals interval
SN (t) , SN (t)+1 of expected length λ1 and, as in the earlier deterministic com-
1
putation, the expectation should be half its length, 2λ .
We will show that (i) provides the correct answer. However, even the reasoning
(ii) holds a bit of truth. To see what is happening we compute the expectations of
SN (t) and SN (t)+1 . We have
Z t
E SN (t) = P SN (t) > x dx.
0
Note that
X
P SN (t) > x = P SN (t) > x, N (t) = n .
n≥0
so
Z
P x < Sn < t, Sn + Tn+1 > t = ρ(s, τ )dsdτ
x<s≤t
s+τ >t
Z t Z ∞ t
λn
Z
sn−1 e−λs ds
= ρ(s, τ )dτ ds = P Tn+1 > t − s
x t−s x (n − 1)!
t t
λn e−λt λn e−λt λn n
Z Z
e−λ(t−s) sn−1 e−λs ds = sn−1 ds = t − xn .
=
x (n − 1)! (n − 1)! x n!
We deduce
X e−λt λn n
t − xn = 1 − eλ(−λ(t−x)) ,
P SN (t) > x =
n!
n≥0
t t
e−t λt
Z Z
−λ(t−x) −λt
eλx dt = t −
E SN (t) = 1−e dx = t − e (e − 1).
0 0 λ
Hence
1 e−λt 1
= E N (t) − 1 + e−λt .
E SN (t) = t − + (1.3.50)
λ λ λ
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 82
82 An Introduction to Probability
Let us compute E SN (t)+1 . Again, we have
X
P SN (t)+1 > x = P SN (t)+1 > x, N (t) = n ,
n≥0
and
P SN (t)+1 > x, N (t) = n = PSn ≤ t Sn+1 ≥ max(t, x)
(
P Sn ≤ t, Sn + Tn+1 ≥ t , x ≤ t,
=
P Sn ≤ t, Sn + Tn+1 ≥ x , x > t.
For any c ≥ t we have Z
P Sn ≤ t, Sn + Tn+1 ≥ c = ρ(s, t)dsdτ
s≤t,
s+τ ≥c
Z t Z ∞ t
λn e−λc (λt)n
Z
= ρ(s, τ )dτ ds = e−λ(c−s) sn−1 e−λs ds = .
0 c−s (n − 1)! 0 n!
Observing that
X e−λc (λt)n
= e−λ(c−t)
n!
n≥0
we deduce that (
1, x ≤ t,
P SN (t) > x = −λ(x−t)
e , x > t.
Hence Z t Z ∞
1 1
dx + eλt e−λx dx = t +
E SN (t)+1 = = E N (t) + 1 , (1.3.51)
0 t λ λ
and
1
wt = E SN )t)+1 − t = .
λ
In fact much more is true. One can show (see [132, Sec. 3.6]) that the waiting
time Wt is an exponential random variable, Wt ∼ Exp(λ), in agreement with the
conclusion of the argument (i).
The above computations are a bit counterintuitive. The number of busses ar-
riving during a time interval [0, t] is N (t). The busses arrive with a frequency of λ1
per unit of time, so we should expect to wait t = λ1 E N (t) units of time for N (t)
busses to arrive. Formula (1.3.50) shows that we should expect less. On the other
hand, formula (1.3.51) shows that we should expect λ1 E N (t) + 1 units of time
for N (t) + 1 busses to arrive! We refer to Remark 3.71 for an explanation for this
paradoxical divergence of conclusions.
The above computation show that the expectation of Lt = SN (t)+1 − SN (t) is
2 e−λt 2
E Lt = − ≈ for t large.
λ λ λ
This shows that even the argument (ii) captures a bit of what is going on since wt
is close to half the expected length of the inter-arrival interval SN (t) , SN (t)+1 .
The Poisson processes are special cases of renewal processes. For an enjoy-
able and highly readable introduction to renewal processes we refer to [59] or [132,
Chap. 3]. For a more in-depth presentation of these processes and some of their
practical applications we refer to [5]. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 83
Foundations 83
Definition 1.138. For any sequence of events (An )n∈N ⊂ S we denote by An i.o.
the event “An occurs infinitely often”,
\ [
An i.o. := An .
m≥1 n≥m
Thus
ω ∈ An i.o. ⇐⇒ ∀m ∈ N ∃n ≥ m : ω ∈ An . t
u
(i) If
X
P An < ∞.
n≥1
Then P An i.o. = 0.
(ii) Conversely, if the events (An )n∈N are independent then P An i.o. ∈ {0, 1},
and
X
P An i.o. = 0 ⇐⇒ P An < ∞. (1.3.52)
n≥1
Note that {An i.o.} = {N = ∞}. From the Monotone Convergence Theorem we
deduce
X X
E N = E I An = P An < ∞
n≥q n≥1
so P N = ∞ = 0.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 84
84 An Introduction to Probability
(ii) Kolmogorov’s
0-1
theorem shows that when the events (An )n≥1 are independent
we have P An i.o. ∈ {0, 1}.
To prove (1.3.52) we have to show that if
X
P An = ∞,
n≥1
then P An i.o. = 1. We have
" # " #
[ \
c
P An = 1 − P An
n≥m n≥m
(1 − x ≤ e−x , ∀x ∈ R)
P
≥ 1 − e− n≥m P[An ]
= 1.
Hence
" #
[
P An i.o. = lim P An = 1.
m→∞
n≥m
t
u
Remark 1.140. Statement (i) in Theorem 1.139 is usually referred to as the First
Borel-Cantelli Lemma while statement (ii) is usually referred to as the Second Borel-
Cantelli Lemma. Exercises 3.12 and 3.18 present refinements of the Borel-Cantelli
lemmas. t
u
Corollary 1.141. Suppose that there exists X ∈ L0 (Ω, S, P) such that the sequence
Xn ∈ L0 (Ω, S, P) satisfies
X
P |Xn − X| > ε < ∞, ∀ε > 0.
n≥1
a.s.
Then Xn −→ X. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 85
Foundations 85
(i) Xn → X in probability as n → ∞.
(ii) dist(Xn , X) → 0 as n → ∞.
Proof. Set
ρ(x) := min(|x|, 1), Yn := Xn − X.
Using Markov’s inequality we deduce that for any n ≥ 1 and any ε ∈ (0, 1) we have
εP |Yn | > ε = εP ρ(Yn ) > ε ≤ E ρ(Yn ) = dist(Yn , 0).
This shows that (ii) ⇒ (i).
Conversely, observe
Z that, for any ε >
Z 0, we have
E ρ(Yn ) = ρ(Yn )dP + ρ(Yn )dP ≤ ε + P |Yn | > ε .
|Yn |≤ε |Yn |>ε
This proves that 0 ≤ lim inf dist(Yn , 0) ≤ lim sup dist(Yn , 0) ≤ ε, ∀ε > 0. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 86
86 An Introduction to Probability
The next result describes the relationships between a.s. convergence and con-
vergence in probability.
Theorem 1.144. Let X, Xn ∈ L0 (Ω, S, P). Then the following hold.
Definition 1.146. Let p ∈ [1, ∞). We say that the sequence (Xn )n∈N ⊂ L0 (Ω, S, P)
converges in p-mean or in Lp to X ∈ L0 (Ω, S, P) if
X, Xn ∈ Lp (Ω, S, P), ∀n ∈ N,
and
lim E |Xn − X|p = 0.
t
u
n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 87
Foundations 87
88 An Introduction to Probability
Foundations 89
The equality (1.3.54) is due to Chvátal and Sankoff [32], but we will follow the
presentation in [144, Chap. 1].
The key observation is that the sequence (`n )n∈N is superadditive, i.e.,
`n + `m ≤ `m+n , ∀m, n ∈ N. (1.3.55)
The proof is very simple. We set Zn = (Xn , Yn ) and we observe that
the random variable Ln is an invariant of the sequence of pairs (Z1 , . . . , Zn ),
Ln = L(Z1 , . . . , Zn ). Clearly
Lm = L(Zn+1 , . . . , Zn+m ), ∀m, n ∈ N.
If we concatenate the longest common subsequence of (Z1 , . . . , Zn ) with the longest
common subsequence of (Zn+1 , . . . , Zn+m ) we obtain a common subsequence of
(Z1 , . . . , Zn , Zn+1 , . . . , Zn+m ) of length
L(Z1 , . . . , Zn ) + L(Zn+1 , . . . , Zn+m )
showing that
L(Z1 , . . . , Zn ) + L(Zn+1 , . . . , Zn+m ) ≤ L(Z1 , . . . , Zn , Zn+1 , . . . , Zn+m ),
i.e.,
Lm + Ln ≤ Lm+n , ∀m, n ∈ N. (1.3.56)
Taking the expectations of both sides in the above inequality we obtain (1.3.55).
The conclusion (1.3.54) is now an immediate consequence of the following ele-
mentary result.
Lemma 1.151 (Fekete). Suppose that (xn )n≥1 is a subadditive sequence of real
numbers, i.e.,
xm+n ≤ xm + xn , ∀m, n ∈ N.
Then
xn xn
lim = µ := inf .
n→∞ n n≥1 n
Proof. Then, for any c > µ we can find k = k(c) such that xk ≤ kc. The subaddi-
tivity condition implies xkn ≤ nxk , ∀n ∈ N, so that
xnk
µ≤ < c, ∀n ∈ N.
nk
Hence
xn
µ ≤ lim inf ≤ c, ∀c > µ,
n→∞ n
i.e.,
xn
µ = lim inf .
n→∞ n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 90
90 An Introduction to Probability
Now observe that for any n ≥ k, there exist m ∈ N and r ∈ {0, 1, . . . , k − 1} such
that n = mk + r. Hence
xn ≤ mak + ar < mk(µ + ε) + xr
so that
xn (n − r)c M
< + , M = sup |a1 | + · · · + |ak1 | .
n n n
Hence
xn (n − r)c
lim sup ≤ lim sup = c, ∀c > µ.
n→∞ n n→∞ n
This completes the proof of the lemma. t
u
The conclusion (1.3.54) follows from Fekete’s Lemma applied to the sequence
xn = −Ln . The inequality (1.3.56) show that
Ln Ln
→ R := sup .
n n n
Set r = r(π) := E R . The Cauchy-Schwartz inequality
!2
X 2 1 X 1
r ≥ E L1 = π a ≥ π(a) = > 0.
k k
a∈A a∈A
The exact value of r(π) is not known in general. In Example 3.34, using more
sophisticated techniques, we will show that the limit R(π) is constant, R(π) = r
and Lnn is highly concentrated around its mean rn . t
u
Example 1.152. Consider the interval [−1, 1] equipped with the uniform probabil-
ity measure 12 dx. Consider the sequence of bounded, nonnegative, random variables
Xn = 2n I [−2−n ,2−n ] .
Note that Xn → 0 a.s. but
2n
Z 2−n
E Xn = dx = 1, ∀n.
2 −2−n
As we will see later in Chapter 3, the reason why the convergence in mean fails is
the high concentration of Xn on sets of smaller and smaller measures. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 91
Foundations 91
then Xn → X in L1 and
lim E Xn = E X . (1.3.57)
n→∞
t
u
Remark 1.154. The Bounded Convergence theorem does not follow immediately
from the Dominated Convergence Theorem which involves a.s. convergence. How-
ever, using Theorem 1.144(iii) we can use the Dominated Convergence Theorem to
provide an alternate proof of the Bounded Convergence Theorem. t
u
92 An Introduction to Probability
Moreover
X
Y ∈ L1 ⇐⇒
|yα | P Fα < ∞.
α
Suppose now that X ∈ L (Ω, S, P). We define the expectation of X given the event
1
Fα to be the expectation of X with respect to the conditional probability P − Fα ,
i.e., the number
Z
1 1
x̄α = E X Fα := E XI α = X(ω)P dω . (1.4.1)
P Fα P Fα F α
We obtain an F-measurable random variable
X
X= x̄α I α .
α
Note that
1
|x̄α | ≤ E |X|I α
P Fα
so
X
E |X| ≤ E |X|I α = E |X| < ∞.
α
Since
E XI α = E XI α , ∀α ∈ A,
we deduce
E XI F = E XI F , ∀F ∈ F.
(1.4.2)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 93
Foundations 93
Note that if
X
Y = yα I α
α
is another F-measurable, integrable random variable that satisfies (1.4.2), then
P Fα xα = E XI α = E Y I α = P Fα yα , ∀α ∈ A,
so that yα = xα , ∀α, i.e., X is uniquely determined by (1.4.2).
If in (1.4.2) we set F = Ω we deduce
X X
E X =E X = x̄α P Fα = E X Fα P Fα . (1.4.3)
α α
When X = I S , then
P S ∩ Fα
E I S Fα = = P S Fα .
P Fα
In this special case the equality (1.4.3) becomes the law of total probability
X
P S = P S Fα P Fα . (1.4.4)
α
t
u
The next result explains why the condition (1.4.2) is key to our further devel-
opments.
94 An Introduction to Probability
I am using different notations for the conditional expectation given and event,
E X F , and the conditional expectation given a sigma-subalgebra, E X k F , for
a simple reason: I want to emphasize visually that the first is a number and the
latter is a function.
Remark 1.158. Using the Monotone Convergence Theorem and the Monotone
Class Theorem we deduce that the following are equivalent.
Foundations 95
Note that
h i
E Y = E Y = E E Y kX = E f (X) .
Thus
Z
E Y = E f (X) = f (x)PX dx .
R
This shows that PX is absolutely continuous with respect to the Lebesgue measure
on R and
PX dx = pX (x)dx.
Similarly
Z
PY dy = pY (y)dy = pX,Y (x, y)dx.
R
Classically, the probability distributions PX and PY are called the marginal distri-
butions of the random vector (X, Y ). We define
pX,Y (x,y)
pX (x) , pX (x) 6= 0,
pY |X=x (y) =
0, pX (x) := 0.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 96
96 An Introduction to Probability
t
u
Theorem 1.162. For any X ∈ L1 (Ω, S, P) and any sigma sub-algebra F ⊂ S there
exists a conditional expectation E X k F ∈ L1 (Ω, F, P).
Proof. We follow the approach in [160]. We establish the existence gradually, first
under more restrictive assumptions.
Step 1. Assume X ∈ L2 (Ω, S, P). Then L2 (Ω, F, P) is a closed subspace of
L2 (Ω, S, P). Denote by PF X the orthogonal projection of X on this closed sub-
space. We claim that
PF X = E X k F ,
(1.4.14a)
X ≥ 0 ⇒ E X k F ≥ 0.
(1.4.14b)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 97
Foundations 97
In particular,
E (X − Y )I F = 0, ∀F ∈ F.
is linear.
Step 2. Assume X ∈ L1 (Ω, S, P). Decompose X = X + − X − and, for n ∈ N, set
Xn± = min X ± , n .
Since Xn± − Xm
±
≥ 0 a.s. if m ≤ n we deduce from (1.4.14b) that
0 ≤ Ym± ≤ Yn± , a.s., ∀m ≤ n.
We set
Y ± := lim Yn± .
n→∞
This shows that the random variables Y± are integrable and in particular a.s. finite.
We set
Y := Y + − Y − .
We will show that Y is a version of the conditional expectation of X given F. Let
F ∈ F. Then
E XI F = E X+ I F − E X− I F = lim E Xn+ I F − lim E Xn− I F
n→∞ n→∞
98 An Introduction to Probability
Remark 1.163. (a) The sigma sub-algebra F should be viewed as encoding partial
information that we have about a random experiment. Following a terminology
frequently used in statistics, we refer to the F-measurable random variables as
predictors determined by the information contained in F.
Step 1 in the above proof shows that the conditional expectation X of a random
variable X, given the partial information F, should be viewed as the predictor
that best approximates X given the information F. The missing part X − X is
independent of F so it is unknowable given only the information encoded by F.
Note that when F = {∅, Ω}, then
E X = E X I Ω.
To put it differently, if the only information we have about a random experiment is
that there will be an outcome, then the most/best we can predict about a numerical
characteristic of that outcome is its expectation.
(b) A random variable X ∈ L1 (Ω, S, P) defines a signed measure
Z
µX : F → [0, ∞), µX F = X(ω)P dω , ∀F ∈ F.
F
This measure is absolutely continuous with P (restricted to F). The Radon–
exists an F-measurable
Nicodym theorem implies that there integrable function
ρX ∈ L (Ω, F, P) such that µX dω = ρX (ω)P dω , i.e.,
1
Z Z
ρX (ω)P dω , ∀F ∈ F.
X(ω)P dω =
F F
This shows that ρX = E X k F a.s.
t
u
In other words
h i
E E X kF
=E X . (1.4.15)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 99
Foundations 99
L1 (Ω, S, P) 3 X 7→ E X k F ∈ L1 (Ω, F, P)
E XY k F = Y E X k F .
E Xn k F % E X k F , a.s.
(viii) If Xn → X a.s. and there exists Y ∈ L1 (Ω, S, P) such that |Xn | ≤ Y a.s., then
E Xn k F → E X k F a.s.
E − k F : Lp (Ω, S, P) → Lp (Ω, F, P)
E X kF
≤ kXkLp .
Lp
E X kF ∨ G = E X kF .
E X kG = E X .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 100
Proof. (i) Follows by choosing F = Ω in (1.4.7). (ii) Follows from the proof of
Theorem 1.162.
(iii) The linearity follows from the fact that the defining condition (1.4.7) is linear
in X. Now let X ∈ L1 (Ω, S, P). We have X = X + − X − . Choose versions Y ± of
E X ± k F . Then Y± ≥ 0 and
E X k F = Y + − Y − ≤ Y + + Y − = E X + + X − k F = E |X| k F .
Hence
h i
E X kF F
1
≤ E E |X| k = E |X| = kXkL1 .
L
Y Z is a version of E XY k F , i.e.,
E XY I F = E ZY I F , ∀F ∈ F.
(1.4.16)
Let F ∈ F. Since Z is a version of E X k F we deduce from (1.4.8) that
E XU = E ZU , ∀U ∈ L∞ (Ω, F, P).
In particular, ∀F ∈ F we have
EX Y I F = E ZU = E ZY I F .
| {z }
U
ing. The Monotone Convergence theorem implies that kX − Xn kL1 → 0. From (iii)
we deduce
kYn − Y kL1 ≤ kXn − Y kL1 .
Proposition 1.147 implies that Yn admits a subsequence that converges a.s. to Y .
Since the sequence Yn is increasing we deduce that the whole sequence converges
a.s. to Y .
(vii) Set
Yk = inf Xn .
n≥k
Foundations 101
i.e.,
E X k F ≤ lim inf E Xn k F .
(ix) We need to use a less familiar property of convex functions, [4, Thm. 6.3.4].
More precisely, there exist sequences of real numbers (an )n∈N and (bn )n∈N such that
ϕ(x) = sup(an x + bn ), ∀x ∈ R.
n∈N
Hence
ϕ E X kF = sup `n E X k F = sup E `n (X) k F ≤ E ϕ(X) k F .
n∈N n∈N
(x) Let G ∈ G and F in F. Then, the random variables I G and XI F are independent
so
E XI F ∩G = E XF F I G = E XI F P G .
If Y is a version of E X k F , then Y is F-measurable and thus independent of G,
so
E Y I F ∩G = E Y I F I G = E Y I F P G
= E XI F P G = E XI F ∩G , ∀F ∈ F, G ∈ G.
so that E X k F ∨ G = Y , i.e., E X k F = E X k F ∨ G .
t
u
7 When ϕ is C 1 the family `n coincides with the family (`q )q∈Q , `q (x) = ϕ0 (q)(x − q) + ϕ(q).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 102
Example 1.167. Suppose that a player rolls a die an indefinite amount of times.
More formally, we are given a sequence independent random variables (Xn )n∈N ,
uniformly distributed on I6 := {1, 2, . . . , 6}.
For k ∈ N, we say that a k-run of length k occurred at time n if n ≥ k and
Xn = Xn−1 = · · · = Xn−k+1 = 6.
We set
R = Rk := n; a k-run occurred at time n ⊂ N ∪ {∞}, T = Tk = inf Rk ,
where inf ∅ :=
∞. Thus T is the moment when the first k-run occurs. We want to
show that E T < ∞.
Note that for each n ∈ N the event {T ≤ n} belongs to the sigma algebra Fn
generated by X1 , . . . , Xn . The explanation is simple: if we know the results of
the first n rolls of the die we can decide if a k-run was occurred. Consider the
conditional probability
P {T ≤ n + k} k Fn = E I {T ≤n+k} k Fn .
6
X
P T ≤ n + k k Fn =
pi1 ,...,in |k I Si1 ,...,in ,
i1 ,...,in =1
where
pi1 ,...,in |k = P T ≤ n + k X1 = i1 , . . . , Xn = in .
Note that, irrespective of the ij -s, we have
1
pi1 ,...,in |k ≥ =: r.
6k
Hence
P T ≤ n + k k Fn ≥ r, ∀n.
In particular
P T > n + k k Fn ≤ (1 − r) < 1, ∀n ∈ N.
Foundations 103
h i
= E I {T >`k} E T > n + (` + 1)k k Fn+`k
≤ (1 − r)E I {T >`k} = (1 − r)P T > n + `k .
Iterating, we deduce that for any i ∈ {1, . . . , k} and any ` ∈ N we have
P T > i + `k < (1 − r)` P T > i ≤ (1 − r)` .
Example 1.168 (Optimal stopping with finite horizon). Let us consider the
following abstract situation. Suppose we are given N random variables
X1 , . . . , XN ∈ L0 Ω, S, P .
Note that
T = T1 ⊃ T2 ⊃ · · · ⊃ TN .
A stopping time T belongs to Tn if and only if the decision to stop comes only after
we have observed the first n random variables in the stream, X1 , . . . , Xn .
We will detect an optimal stopping strategy using a process of “successive ap-
proximations”. The first approximation is the simplest strategy: pick the reward
only at the end, after we have observed all the N variables in the stream. In this
case the reward is YN = RN . This may not give us the largest expected reward
because some of the up-stream rewards could have been higher. We tweak this
strategy a bit to produce a better outcome.
We wait to observe the first N − 1 variables in the stream, and then decide what
to do. At this moment our reward is RN −1 . To decide what to do next we compare
this reward with the expected reward R N given that we observed X1 , . . . , XN −1 ,
i.e., with the conditional expectation E YN k FN −1 = E RN k FN −1 . This is an
FN −1 -measurable quantity, i.e., a quantity that is computable from the knowledge
of X1 , . . . , XN −1 .
If the reward RN −1 that what we have in our hands is bigger than we ex-
pect to gain given our current information, we choose it and we stop. If not,
we wait one more step to stop. More formally, we stop after N − 1 steps if
RN −1 ≥ E RN k FN −1 and we continue one more step otherwise. The decision is
thus based on the random variable YN −1 = max RN −1 , E YN k FN −1 .
E RTn k Fn = Yn .
(1.4.18b)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 105
Foundations 105
Hence
E RTn k Fn ≥ E Yn = E RT k Fn , ∀T ∈ Tn .
RT = Rn−1 + RT 0
A A∩{T =n−1} A∩{T ≥n}
({T ≥ n} ∈ Fn−1 ) Z Z
E RT 0 k Fn−1
= Rn−1 +
A∩{T =n−1} A∩{T ≥n}
Z Z h i
E E RT 0 k Fn Fn−1
= Rn−1 +
A∩{T =n−1}
A∩{T ≥n}
(use the induction assumption E RT 0 k Fn ≤ Yn )
Z Z Z
E Yn k Fn−1 ≤
≤ Rn−1 + Yn−1 .
A∩{T =n−1} | {z } A∩{T ≥n} | {z } A
≤Yn−1 ≤Yn−1
This proves the inequality (1.4.18a).
To prove the equality (1.4.18b), we run the above argument with T = Tn−1 .
Observe that in this case
Un := {T = n − 1} = Rn−1 ≥ E Yn k Fn−1 } = Yn−1 = Rn−1 ,
(1.4.20a)
Vn := Tn−1 > n − 1} = Rn−1 < E Yn k Fn−1
(1.4.20b)
= Yn−1 = E Yn k Fn−1 .
Remark 1.169. The procedure for determining the optimal time T1 outlined in the
above example is a bit counterintuitive. The maximal expected reward is E Y1 .
By construction, the random variable Y1 is F1 -measurable, by construction, and
thus has the form f (X1 ) for some Borel measurable function f : R → R. Thus
we can determine Y1 knowing only the initial input X1 . On the other hand the
definition of Y1 by descending induction used the knowledge of the entire stream
X1 , . . . , XN , not just the initial input X1 .
What it is true is that we can compute the maximal expected reward without
running the stream. On the other hand, the moment we stop and the actual reward
when stopped are random quantities. It is conceivable that if we do not stop when
T1 tells us to stop we could get a higher reward later on. However, on average, we
cannot beat the stopping strategy T1 .
We will illustrate this process on the classical secretary problem. t
u
N
X
= P Vn = VN , T = n = P VT = vN .
n=1
We want to find a stopping time T that maximizes E RT , i.e., the probability
that Bob pick the biggest prize. Let us make a few remarks.
8 Think of N secretaries interviewing for a single job and the values v , . . . , v
1 N rank their job
suitability, the higher the value the more suitable. The interviewer learns the value vk only at the
time of the interview.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 107
Foundations 107
1. Observe that rankings (Xn )n∈N defined in (1.4.21) are independent and
1
P Xn = j = , ∀1 ≤ j ≤ n ≤ N. (1.4.22)
n
Indeed, the random vector (V1 , . . . , VN ) can be identified with a random permuta-
tion ϕ ∈ SN of IN
(V1 , . . . , VN ) = (vϕ(1) , . . . , vϕ(N ) ).
The rank Xn is then a function of ϕ
Xn (ϕ) := # j ≤ n; ϕ(j) ≥ ϕ(n) .
To reach the desired conclusion observe that the map
~ : SN → I1 × I2 × · · · × IN , ϕ 7→ X1 (ϕ), . . . , XN (ϕ)
X
is a bijection.9
2. We have
n n
Rn = I {Xn =1} = I {Vn =vN } .
N N
Indeed, the conditional expectation Rn = E I {Vn =vN } k Xn is a function of xn ∈ In
and we have
Rn (xn ) = E I {Vn =vN } Xn = xn = P Vn = vN Xn = xn .
This probability is zero if Xn > 1. Now observe that
P V n = vN (N − 1)! n
P Vn = vN Xn = 1 = = N = .
P Xn = 1 n (n − 1)!(N − n)! N
Following (1.4.17) and (1.4.18a) we set yn = E Yn . The quantity yn is the proba-
bility of Bob obtaining the largest prize among the strategies that discard the first
(n − 1) selected prizes. We have
1
YN = RN = I {VN =vN } , yN = .
N
Since {VN = vN } = {XN = 1} is independent of FN −1 we deduce
(1.4.22) 1
E I {VN =vN } k FN −1 = E I {VN =vN }
= = yN ,
N
n o
YN −1 = max RN −1 , E I {VN =vN } k FN −1
N −1 1
= max RN −1 , yN = I {XN −1 =1} + I {XN −1 >1} ,
N N
1 (N − 2)
yN −1 = + yN .
N (N − 1)
9 From ~ is injective. It
the equality ϕ−1 (N ) = max{j, Xj (ϕ) = 1} we deduce inductively that X
is also surjective since SN and N
Q
n=1 In have the same cardinality.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 108
Similarly
E YN −1 k FN −2 = E YN −1 = yN −1
YN −2 = max RN −2 , yN −1
(1.4.22) 1 N −3
yN −2 = max{(N − 2)/N, yN −1 } + yN −1 .
N −2 N −2
Iterating we deduce
Yn = max Rn , yn+1 = max{n/N, yn+1 }I {Xn =1} + yn+1 I {Xn >1} ,
1 n−1
yn = max{n/N, yn+1 }
+ yn+1 .
n n
While it is difficult to find an explicit formula for yn , the above equalities can be
easily implemented on a computer. The optimal probability is pN = y1 . Here is a
less than optimal but simple R code that computes y1 given N .
optimal<-function(N){
p<-1/N
m<-N-1
for (i in 1:m){
p<-max((N-i)/N,p)/(N-i)+((N-i-1)/(N-i))*p
}
p
}
Here are some results. Below, pN denotes the optimal probability of choosing the
largest among N prizes.
N 3 4 5 6 8 100 200
pN 0.5 0.458 0.433 0.4277 0.4098 0.3710 0.3694
n
Note that yn+1 < yn with equality when yn+1 > N . We deduce that
n
yn+1 ≥ ⇒ yn+1 = yn = · · · = y1 .
N
We set
N∗ := max{ n; yn ≥ (n − 1)/N
so yN∗ +1 < yN∗ = yN∗ −1 = · · · = y1 . The optimal strategy is given by the stopping
time TN∗ : reject the first N∗ − 1 selected gifts and then pick the first gift that is
more valuable than any of the preceding ones.
N 3 4 8 10 50 100 1000
N∗ 3 3 5 5 20 39 370
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 109
Foundations 109
n 1 2 3 4 5 6 7 8 9 10
yn 0.398 0.398 0.398 0.398 0.398 0.372 0.32 0.26 0.18 0.1
In this case N∗ = 5 and the optimal strategy corresponds to the stopping time T5 :
reject the first four gifts and then accept the first gift more valuable then any of the
previously chosen. In this case the probability of choosing the most valuable gifts
is p10 ≈ 0.398.
Let us sketch what happens as N → ∞. Consider the sequence
N
z := (zn )1≤n≤N +1 defined by backwards induction
n−1 1
zN +1 = 0, zn = zn+1 + , 1 ≤ n ≤ N.
n N
One can show by backwards induction that zn ≤ yn , ∀n ≤ N and zn = yn , ∀n ≥ N∗ .
Denote by fN : [0, 1] → R the continuous function [0, 1] → R that is linear on
each on the intervals [(i − 1)/N, i/N ] and such that
fN (i/N ) = zN +1−i , i = 0, 1, . . . , N.
Note that
1 1
fN (i + 1)/N − f (i/N ) = zN −i − zN −i+1 = − zN −i+1
N N −i
1 1
= 1− fN i/N .
N 1 − i/N
We recognize here the Euler scheme for the initial value problem
1
f0 = 1 − f, f (0) = 0 (1.4.23)
1−t
corresponding to the subdivision i/N of [0, 1].
The unique solution of this equation is f (t) = −(1 − t) log(1 − t) and fN (t)
converge to f (t) uniformly on the compacts of [0, 1). In fact, (see [25, Sec. 212]) for
every T ∈ (0, 1), there exists C = CT > 0 such that
CT
sup fN (t) − f (t) ≤ .
t∈[0,T ] N
we deduce that
(
< 0, τ > 1/e,
lim N (zn − zn+1 ) = 1 + log τ =
N/n→τ > 0, τ < 1/e.
This implies that as N → ∞ we have
N∗ 1 1
→ ≈ 0.368, yN∗ = zN∗ →
N e e
as N → ∞. For details we refer to [28, Sec. 3.3] or [69].
As explained in [69] a (nearly) optimal strategy is as follows. Denote by m the
largest integer satisfying
N − 1/2 1 N − 1/2 3
+ ≤m≤ + .
e 2 e 2
Reject the first m prizes and accept the next prize more valuable than any of the
preceding ones. t
u
Foundations 111
(ii) F+ ⊥
⊥ F0 F− .
Remark 1.173. You should think of a system evolving in time. Then F0 collects
the present information about the system, F− collects the past information and
F+ collects the future information. Roughly speaking, the above proposition shows
that the information about an event given the present and the past coincides with
the information given the present if and only if the future is independent of the past
given the present. t
u
The kernel K is called a probability kernel or a Markovian kernel if Kω0 − is
a probability measure on (Ω1 , F1 ), for any ω0 ∈ Ω0 .
The condition (K1 ) above shows that a kernel is a family (Kω0 [−])ω0 ∈Ω0 of mea-
sures on (Ω1 , F1 ) parametrized by Ω0 . Condition (K2 ) is a measurability condition
on this family. For this reason kernels are also know as random measures.
(i) For any f ∈ L0+ (Ω1 , F1 ) we define its pullback by K to be the function
Z
K ∗ f : Ω0 → [0, ∞], K ∗ f (ω0 ) =
f (ω1 )Kω0 dω1 .
Ω1
∗
Then K f ∈ L0+ (Ω0 , F0 ).
(ii) For any measure µ : F0 → [0, ∞] we define its push-forward by K to be the
measure K∗ µ : F1 → [0, ∞]
Z
Kω0 F1 µ dω0 ∈ [0, ∞], F1 ∈ F1 .
K∗ µ F1 :=
Ω0
Foundations 113
The statement (ii) follows from the Monotone Convergence theorem and (K1 ).
For part (iii), fix the measure µ. Observe that for F ∈ F1 we have
Z Z Z
hµ, K ∗ I F i = K ∗ I F (ω0 )µ dω0 =
Kω0 F dω1
Ω0 Ω0 Ω1
= K∗ µ F = hK∗ µ, I F i.
Thus (1.4.25) holds for f = I H , F ∈ F1 . The general case follows by invoking the
Monotone Class Theorem. t
u
where δω1 denotes the Dirac measure on (Ω1 , F1 ) concentrated at ω1 ; see Exam-
ple 1.31(a). Observe that for any measure µ on F0 and any f ∈ L0+ (Ω1 , F1 ) we
have
K∗T µ = T# µ, (K T )∗ f = T ∗ f := f ◦ T.
Thus, (1.4.25 contains as a special case the change in variables formula (1.2.21).
(b) Consider the random measure K : Ω × BR → [0, 1] in Example 1.174. Given a
probability measure µ on (Ω, S) we have
K∗ µ = Ber(f¯) = 1 − f¯ δ0 + f¯ δ1 , f¯ := Eµ f .
A priori, the negligible set N depends on the family (Fn )n≥1 , and there might not
exist one negligible set that works for all such increasing families. When such a
thing is possible we say that the conditional probability P − k S admits a regular
Q : Ω × F → [0, 1]
Proposition 1.178. If
Q : Ω × F → [0, 1], (ω, F ) 7→ Qω F
The equality (1.4.26) can be written in the less precise but more intuitive way
Z
E X kS = X(η)P dη k S .
(1.4.27)
Ω
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 115
Foundations 115
S-measurable and
• the random variable Q F is e
• for any S̃ ∈ S, F ∈ F we have
e
Z
−1
P S̃ ∩ T (F ) =
e Qω̃ F Pe dω̃ . (1.4.29)
Ω̃
t
u
Remark 1.181. (a) The above is not the usual definition of a Lusin space but it
has the advantage that emphasizes the compactness feature we need in the proof of
Kolmogorov’s existence theorem.
There are plenty of Lusin spaces. In fact, a topological space that is not Lusin
is rather unusual. We refer to [15; 39] for a more in depth presentation of these
spaces and their applications in measure theory and probability. To give the reader
a taste of the fauna of Lusin spaces we list a few examples.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 116
(b) From a measure theoretic point of view the Lusin spaces are indistinguishable
from the Polish spaces. More precisely, for any Lusin space X, there exists a Polish
space Y and a Borel measurable bijection Φ : X → Y such that the inverse is also
Borel measurable; see [35, Prop. 8.6.13].
The Polish spaces have another important property. More precisely, a Polish
space equipped with the σ-algebra of Borel subsets is isomorphic as a measurable
space to a Borel subset E of [0, 1] equipped with the σ-algebra of Borel subsets.
For a proof we refer to [126, Sec. I.2]. Moreover, any two Borel subsets of R are
measurably isomorphic if and only if they have the same cardinality, [126, Ch. I,
Thm. 2.12].
On the other hand, it is known that the continuum hypothesis holds for the Borel
subsets of a Polish space; see [39, Appendix III.80] or [98, XII.6]. In particular, any
Borel subset of R is either finite, countable or has the continuum cardinality. In
particular, this shows that any Borel subset of [0, 1] is measurably isomorphic to a
compact subset of [0, 1]. Hence any Lusin space is Borel isomorphic to a compact
metric space!
The measurable spaces isomorphic to a Borel subset E of [0, 1] equipped with
the σ-algebra of Borel subsets are called standard measurable spaces and play an
important role in probability. Hence the Lusin spaces are standard measurable
spaces. t
u
Then, for every measurable map Y : (Ω, F) → (Y, BY ), and every σ-subalgebra
S ⊂ F there exists a regular version
Q : Ω × BY → [0, 1], (ω, B) 7→ Qω B
Q B = P Y ∈ B k S a.s., ∀B ⊂ BY .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 117
Foundations 117
Idea of proof. For a complete proof we refer to [33, Th. IV2.10], [39, III.71], [40,
IX.11] or [135, II.89].
We can assume that Y is a compact metric space. Fix a dense countable subset
U ⊂ C(Y ) suchthat 1 ∈ U and U is a vector space over Q. we can find representa-
tives Φ(u) of E u(Y ) k such that the map
U 3 u 7→ Φ(u) ∈ L1 Ω, S, P)
is Q-linear, Φ(1) = 1 and Φ(u) ≥ 0 if u ≥ 0. For every nonnegativef ∈ C(U ) we set
Φ∗ (f ) := sup Φ(u); u ∈ U, 0 ≤ u ≤ f .
Moreover Z
E Y kS =
yQ dy , P − a.s. on Ω.
R
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 118
Suppose that the distribution P0 is known and would like to get information about
the distribution of (X1 , Y1 ) by investigating T (X0 , Y0 ). The above equality shows
that knowledge of T adds nothing to our understanding of the density g T (x, y)
beyond what we know from (X0 , Y0 ). t
u
Note that
x n
P X(n) ≤ x = P Xk ≤ x, ∀k = 1, . . . , n = ,
L
so that the probability distribution of X(n) is
xn−1
Pn dx = n n I [0,L] (x)dx.
L
Similarly,
n
(L − x)
P X(1) > x = P Xk > x, ∀k = 1, . . . , n = ,
L
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 119
Foundations 119
Z
= P X(n) ≤ X(1) + g X(1) = x1 ρ1 (x1 )dx1 .
[0,L]
g n−1
= I [0,L−g] (x1 ) + I [L−g,L] (x1 ).
(L − x1 )n−1
Thus
ng n−1 L−g
ng n−1 (L − g) gn
Z Z
P G≤g = dx1 + ρ1 (x1 )dx1 = + .
Ln 0 [L−g,L] Ln Ln
We deduce
d n(n − 1)g n−2 n2 g n−1 ng n−1 n(n − 1)g n−2 g
P G≤g = + − = 1 − .
dg Ln−1 Ln Ln Ln−1 L
Thus, the probability distribution of G is
n(n − 1)g n−2 g
PG dg = n−1
1− I [0,L] (g) dg.
L L
If L = 1 then the above distribution is the Beta distribution Beta(n − 1, 2). t
u
Foundations 121
For F0 ∈ F0 we have
Z Z
µ F 0 × Ω1 = Kω0 dω1 µ0 dω0 = µ0 F0 ,
F0 Ω
| 1 {z }
Kω0 Ω1 =1
so that µ0 = (π0 )# µ. This shows that the measure µ0 is uniquely and a priori
determined by µ. We can rewrite (1.4.32) as
µ dω0 dω1 = (π0 )# µ dω0 Kω0 dω1 . (1.4.33)
Example 1.188. Consider a random 2-dimensional vector (X, Y ) with joint distri-
bution
PX,Y ∈ Prob(R2 ).
According to Corollary 1.187, the distribution
PX of X disintegrates the joint dis-
tribution PX,Y . Suppose that Kx dy is the associated disintegration kernel, i.e.,
PX,Y dxdy = Kx dy PX dx .
1
Then, for any measurable function f : R → R such that f (Y) ∈ L there exists a
measurable function g : R → R such that
g(X) = E f (Y ) k X . As in Remark 1.160
we denote g(x) by E f (Y ) k X = x . We can give a more explicit description of
g(x) using the disintegration kernel. More precisely we will show that
Z
g(x) = f (y)Kx dy .
R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 122
Foundations 123
Let
p
1 + |f 0 (x)|2
µ0 dx = · λ dx ∈ Prob [0, 1] .
L
Then the Borel probability measure PK,µ0 on [0, 1]×R corresponds to the integration
with respect to the normalized arclength along the graph of f . t
u
Example 1.190. Suppose that X1 , . . . , Xn are independent random variables with
common distribution p(x)λ dx . Denote by X the random vector (X1 , . . . , Xn ).
Suppose that f : Rn → R is a Borel measurable
function. Denote by P the distri-
bution of the random vector X, f (X) . This is disintegrated by the distribution
µ0 := PX of the random vector X. The disintegration kernel K is the conditional
distribution of f (X) given X. We deduce that
Kx1 ,...,xn − = δf (x1 ,...,xn ) .
If B0 is a Borel subset of Rn and B1 is a Borel subset of R, then
Z
P B0 × B1 = I B1 f (x1 , . . . , xn ) p(x1 ) · · · p(xn )dx1 · · · dxn .
B0
Using a notation dear to theoretical physicists we can rewrite the above equality as
P dx1 · · · dxn dy = δ y − f (x1 , . . . , xn ) p(x1 ) · · · p(xn ) dx1 · · · dxn dy,
where δ(z) denotes the Dirac “function” on the real axis. t
u
We have already met stochastic processes though we have not called them so. This
section is meant to be a first encounter with this vast subject. We have a rather
restricted goal namely, to explain what they are, describe a few basic features and
more importantly, show that stochastic processes with prescribed statistics do exist
as mathematical objects.
Foundations 125
It comes with a natural family of random variables Xt : C [0, 1] → R, t ∈ [0, 1].
More precisely, the random variable Xt associates to a function f ∈ C [0, 1] its
value at t, Xt (f ) = f (t). Note that At = Xt ◦ A. Thus we can view A• as a random
continuous function.
Suppose now that (Xt )t∈T is a general family of random variables
Xt : (Ω, S, P) → (X, F).
This defines a map
X : T × Ω → X, T × Ω 3 (t, ω) 7→ X(t, ω) := Xt (ω) ∈ X,
such that Xt is measurable for any t.
Equivalently, we can view this as a map
X : Ω → XT = the space of functions f : T → X, (1.5.1)
where for each ω ∈ Ω we have a function X(ω) : T → X, t 7→ Xt (ω).
It is convenient to regard XT as a product of copies Xt of X
Y
XT = Xt .
t∈T
distribution of the stochastic process (Xt )t∈T . In this way we can view the process
as defining a random function T → X.
For any finite set I = {t1 , . . . , tm } ⊂ T we have a sigma-algebra FI in XI ,
FI = Ft1 ⊗ · · · ⊗ Ftm ,
and we obtain a random “vector”
X I : Ω, S → XI , FI ), ω 7→ Xt1 (ω), . . . , . . . , Xtm (ω) ∈ XI .
k=0 k=0
X
10 Take a few seconds to convince yourself of the validity of (1.5.2).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 127
Foundations 127
Pn = µ0 ⊗ · · · ⊗ µn .
More precisely, we set Kx0 ,...,xn dx = Kxn dx .
In this case the measures Pn on FIn are defined by
b
Pn dx0 dx1 · · · dxn = µ0 dx0 Kx0 dx1 · · · Kxn−1 dxn .
Z
Pn S = I S ~x µ0 dx0 Kx0 dx1 · · · Kxn−2 dxn−1 Kxn−1 dxn . (1.5.4)
X bIn
The above is an iterated integral, going from right to left, i.e., we first integrate
with respect to xn , next with respect to xn−1 etc.
Such a situation occurs in the context of Markov chains. t
u
Fix a topological space X and a parameter set T . We denote by 2T0 the collection of
finite subsets of T . For I ∈ 2T0 we denote by BI the Borel σ-algebra in XI equipped
with the product topology. For any finite subsets I ⊂ J ⊂ T we denote by PIJ the
natural projection XJ → XI .
For t ∈ T we denote by πt the natural projection
πt : XT → X, πt (x) = xt .
Remark 1.193. The sigma-algebra ET can also be identified with the σ-algebra of
the Borel subsets of XT equipped with the product topology. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 128
Thus
h i (1.5.5)
PI πI (C) = PI P−1
KI π K (C) = (PKI )# PI πK (C) = PK πK (C) ,
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 129
Foundations 129
and, similarly,
h i (1.5.5)
PJ πJ (C) = PI P−1
KJ PK (C)
= (PKJ )# PJ πK (C) = PK πK (C) .
Equivalently, if we set
n
[
Bn := C∞ \ Ck ∈ F,
k=1
(i) The measure µ is called outer regular if for any Borel set B ∈ BX we have
µ B = inf µ U .
U ⊃B,
U open
(ii) The measure µ is called inner regular if for any Borel set B ∈ BX we have
µ B = sup µ C .
C⊂B,
C closed
(iii) The measure µ is called regular if it is both inner and outer regular.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 130
(iv) The measure µ is called Radon if it is outer regular, and for any Borel set
B ∈ BX , we have
µ[B] = sup µ K .
K⊂B,
K compact
From the above definition it is clear that
µ is Radon ⇒ µ is regular.
A deep result in measure theory states that any Borel probability measure on a
Lusin space is Radon, [15, Thm. 7.4.3]. For our immediate needs we can get away
by with a lot less. We have the following useful result, [126, Chap. II, Thm. 1.2]. A
proof is outlined in Exercise 1.53.
Lemma 1.199. Let Y be a compact metric space. Then any Borel probability mea-
sure on Y is Radon. t
u
We can now complete the proof of Kolmogorov’s theorem. Suppose there exists
a decreasing sequence of sets (Bn )n≥1 in CT such that
\
Bn = ∅.
n∈N
Foundations 131
For any n ∈ N, the space X In is a compact metric space and Lemma 1.199 implies
that the Borel probability measure PIn on X In is Radon. Hence, for any n > 0,
there exists a compact set Kn ⊂ Sn such that
δ
PIn Sn \ Kn < .
2n+1
Set
Cn := Kn × XT \In ⊂ Sn × XT \In = Bn .
Tikhonov’s compactness theorem shows that all products XT \In are compact with
respect to the product topology. Hence the sets Cn = Kn × XT \In are also compact.
Note that
b Bn \ Cn = PI Sn \ Kn < δ , ∀n ∈ N.
P n
(1.5.8)
2n+1
Set
\n
Dn := Cj ∈ CT .
j=1
(1.5.8) n n
X X δ δ
≤ P Bj \ Cj ≤
b < .
j=1 j=1
2j+1 2
Hence
b Dn > δ , ∀n ≥ 1.
P
2
This shows that Dn is nonempty ∀n ∈ N so the decreasing sequence of nonempty
and compact sets Dn has a nonempty intersection. We have reached a contradiction
since
\ \
∅= Bn ⊃ Dn 6= ∅.
n≥1 n≥1
t
u
Note that PXn = P, ∀n and the joint distribution of X1 , . . . , Xn is P⊗n . Thus, the
random variables (Xn ) are independent and have identical distributions. We have
thus proved the following fact.
Remark 1.201. The proof of Theorem 1.195 uses in an essential fashion the topo-
logical nature of the projective family of measures PI I∈2T . We want to emphasize
0
that in this theorem the set of parameters T is arbitrary.
If the set of parameters T is countable, say T = N0 , then one can avoid the
topological assumptions.
Consider for example the projective family of measures Pn constructed in Exam-
ple 1.191. Recall briefly its construction that we are given a sequence of measurable
spaces (Xn , Fn )n≥0 and measures Pn on
X0 × · · · × Xn , F0 ⊗ · · · ⊗ Fn
n≥0
A theorem of C. Ionescu-Tulcea (see [85, Thm. 8.24]) states that there exists a
unique probability measure P∞ on F⊗∞ such that
(Pn )# P∞ = Pn , ∀n ≥ 0,
where Pn denotes the natural projection X ∞ → X0 × · · · × Xn .
As a special case of this result let us mention an infinite-dimensional version
of Fubini-Tonelli: given measures µn on Fn , there exists a unique measure µ∞ on
F⊗∞ such that
On
(Pn )# µ∞ = µk .
k=0
For this reason we will denote the measure µ∞ by
O∞
µn .
n=0
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 133
Foundations 133
1.6 Exercises
Exercise 1.1. Let S0 , S1 be two sigma-algebras of a set Ω. Prove that the following
are equivalent.
t
u
Prove that Ω∞ , d is a compact metric space. Hint. Use the diagonal procedure to
Exercise 1.5. Suppose that (Ω, F, µ) is a measured space and (S, d) a metric space.
Consider a function
F : S × Ω → R, (s, ω) 7→ Fs (ω)
satisfying the following properties.
Foundations 135
is differentiable at t0 and
Z Z
d
F 0 (t0 , ω)µ dω .
F (t, ω)µ dω = t
u
dt t=t0 Ω Ω
Exercise 1.7. Prove that the random variables N1 , . . . , Nm that appear in Exam-
ple 1.112 on the coupon collector problem can be realized as measurable functions
defined on the same probability space. Hint. Use Exercise 1.3. t
u
Exercise 1.9 (M. Gardner). A family has two children. Find the conditional
probability that both children are boys in each of the following situations.
probabilities.12 What is the probability that A occurs before B? Hint. Consider the
event C = (A ∪ B)c = neither A, nor B. Condition of the result of the first experiment which can be
A, B or C. t
u
Exercise 1.12. Suppose that X, Y : (Ω, S, P) → R are two random variables whose
ranges X and Y are countable subsets of R. We set
X
E X k Y = y I {Y =y} ∈ L0 (Ω, σ(Y ), P),
E X kY =
y∈Y
where
X
E X k Y = y := xP[X = x k Y = y].
x∈X
The random variable E X k Y is called the conditional expectation of X given Y .
Prove that
E[X] = E E[X k Y ] . t
u
12 For example if we roll a pair of dice, A could be the event “the sum is 4” and B could be the
Foundations 137
Exercise 1.13 (Polya’s urn). An urn U contains r0 red balls and g0 red balls.
At each stage a ball is selected at random from the urn, we observe its color, we
return it to the urn and then we add another ball of the same color. We denote by
Rn the number of red balls and by Gn the number of green balls at stage n. Finally,
we denote by Cn the “concentration” of red balls at stage n,
Rn
Cn = .
Rn + Gn
(i) Show
that E Cn+1 k Rn = Cn , where the conditional expectation
E Cn+1 k Rn is defined in Exercise 1.12.
(ii) Show that E[Cn ] = r0r+g
0
0
, ∀n ∈ N. t
u
Exercise 1.14. Prove the claim about the events Sk at the end of Example 1.110.
t
u
Exercise 1.17. A box contains n identical balls labelled 1, . . . , n. Draw one ball,
uniformly random, and record its label N . Next flip a fair coin N times. What is
the expected number of heads you roll? Hint. Use Wald’s formula. t
u
t
u
Exercise 1.20. Let N = Nm be the random variable defined in the coupon collector
problem described in Example 1.112. Show that
m
X m−k
Var Nm = m . t
u
k2
k=1
Foundations 139
(i) Prove that P N` = 1 when
` <d.
(ii) Prove that E
N ` +`
0 1 = E N`0 + E N`1 , ∀`0 , `1 > 0.
(iii) Compute E N` , ` > 0.
t
u
t
u
t
u
in terms of Φ and γ 1 . t
u
Foundations 141
if the integral on the right hand side is finite. Moreover equality holds iff
p = q.
Hint. Show that p(x) − p(x) log p(x) ≤ q(x) − p(x) log q(x), ∀x ∈ R.
(iii) Show that if p ∈ Dens(R) satisfies
E p = 0 = E γ 1 , Var p = 1 = Var γ 1 ,
then Ent p ≤ Ent γ 1 with equality iff p = γ 1 .
t
u
(i) X ∼ Poi(λ).
(ii) E λf (X + 1) − Xf (X) = 0, for any bounded function f : N0 → R.
t
u
Exercise 1.30. Let Y ∼ N (0, 1) be a standard normal random variable and set
X := exp(Y ).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 142
(ii) Prove that the probability distribution PX of X is given by the log-normal law
2
( 1
√1 e− 2 (log x) , x > 0,
PX dx = p(x)dx, p(x) = x 2π
0, x ≤ 0,
where log denotes the natural logarithm.
(iii) For α ∈ [−1, 1] we set
(
p(x) 1 + α sin(2π log x) , x > 0,
pα (x) =
0, x ≤ 0.
Prove that for any α ∈ [−1, 1] and any n ∈ N0 we have
Z
2
xn pα (x)dx = en /2 .
R
Thus, for any α ∈ [−1, 1], the function pα (x)dx is a probability density on R
and the probability measure pα (x) has the same moments as X.
t
u
(i) Show that the power series defining P GX is convergent for any |s| < 1. More-
over, ∀t ≤ 0 we have
MX (t) = P GX (et ).
(ii) Compute P GX when X ∼ Bin(n, p), X ∼ Geom(p), X ∼ Poi(λ).
t
u
Foundations 143
Exercise 1.33. Let µ0 , µ1 ∈ Prob([0, 1]) be two Borel probability measures. Prove
that the following statements are equivalent.
(i)
Z 1 Z 1
xn µ0 dx = xn µ1 dx , ∀n ∈ N.
0 0
t
u
Exercise 1.35. Consider the interval [−π/2, π/2] equipped with the probability
measure
1
P dx = λ dx ,
π
λ = the Lebesgue measure. We regard the function
Exercise 1.36. For any a, b > 0 we define the incomplete Beta function
Z x
1
Ba,b : (0, 1) → R, Ba,b (x) = ta−1 (1 − t)b−1 dt,
B(a, b) 0
where B(a, b) is the Beta function (A.1.2).
xa (1 − x)b
= Ba,b (x) − Ba+1,b (x). (1.6.7a)
aB(a, b)
xa (1 − x)b
= Ba,b+1 (x) − Ba,b (x). (1.6.7b)
bB(a, b)
(ii) Show that if k, n ∈ N, k < n we have
n
X n
Bk,n+1−k (x) = xa (1 − x)n−a . (1.6.8)
a
a=k
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 144
Z
p ≥ 0, p(x1 , . . . , xn )dx1 · · · dxn = 1.
Rn
where the matrix A = aij 1≤i,j≤n is invertible with inverse A−1 = aij 1≤i,j≤n .
(iii) Set
n
1 1 X 2
X1 + · · · + Xn , S 2 :=
X := Xi − X .
n n − 1 i=1
X
Tn := √ .
S/ n
Prove that Tn ∼ Studn−1 , where Studp denotes the Student t-distribution with
p degrees of freedom
p+1
1 Γ( 2 ) 1
Studp = √ dt, t ∈ R, p > 0.
pπ Γ( p2 ) 1 + t2 /p (p+1)/2
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 145
Foundations 145
Exercise 1.39. Fix a probability space (Ω, S, P). Show that L0 (Ω, S, P) equipped
with the metric dist defined in (1.3.53) is a complete metric space. More precisely,
show that if a sequence of random variables Xn ∈ L0 (Ω, S, P) is Cauchy in proba-
bility, i.e.,
lim P |Xm − Xn | > r = 0, ∀r > 0,
m,n→∞
Exercise 1.41. Suppose that X, Y are independent random variables with distri-
butions PX and respectively PY . Let f : R2 → R be a Borel measurable function
such that f (X, Y ) is integrable. Show that
E f (X, Y ) k X = h(X),
where
Z
h(x) = f (x, y)PY dy . t
u
R
(i) X = Y a.s.
(ii) For any bounded Borel measurable function f : R → R, E f (X) k F =
f (Y ) a.s.
t
u
Define ord : Rn → Cn
(x1 , . . . , xn ) 7→ ord(x1 , . . . , xn ) = (x(1) , x(2) , . . . , x(n) ),
where
x(1) = min{x1 , . . . , xn }, x(2) = min {x1 , . . . , xn } \ {x(1) } , . . . .
t
u
Exercise 1.45. Suppose that X1 , . . . , Xn−1 are independent and uniformly dis-
tributed in [0, 1]. Consider their order statistics
X(1) ≤ · · · ≤ X(n−1)
and the corresponding spacings16
S1 = X(1) , S2 = X(2) − X(1) , . . . , Sn = 1 − X(n−1) .
Denote by Ln the largest spacing, Ln = max S1 , . . . , Sn ).
15 To appreciate how surprising then conclusion (v) think that an institution buys a large number
n of computers, all of the same brand, and X1 , . . . , Xn denote the lifetimes of these machines.
Each is expected to last 1/λ years. The random variable X(1) is the lifetime of the first computer
that breaks down. The result in (v) show that we should expect the first break down pretty soon,
1
in nλ years!
16 The n − 1 points X , . . . , X
1 n−1 divide the interval [0, 1] into n subintervals and the spacings are
the lengths of these subintervals.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 147
Foundations 147
have the same distribution. Hint. Use (iii) and Exercise 1.44(vi). Deduce that17
n
1X1
E Ln := .
n k
k=1
t
u
Remark 1.202. Observe that the above exercise produces a strange identity,
n n
X 1 X k+1 1 n
= (−1) .
k k k
k=1 k=1
t
u
Exercise 1.46. Consider the Poisson process (N (t))t≥0 with intensity λ described
in Example 1.136.
t
u
17 This
log n
equality shows that E Ln ∼ n
, which substantially higher than the mean of each
1
individual spacing, E Sk = n , ∀k.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 148
Exercise 1.47. Consider the Poisson process (N (t))t≥0 with intensity λ described
in Example 1.136. Let S be a nonnegative random variable independent of the
arrival times (Tn )n≥0 of the Poisson process. For any arrival time Tn we denote by
ZTn ,S the number of arrival times located in the interval (Tn , Tn + S]
ZTn ,S := # k > n; Tn < Tk ≤ Tn + S .
Prove that
∞
(λs)k
Z
e−kλs
P ZTn ,S = k = PS ds . t
u
0 k!
Exercise 1.48. Suppose that N (t) is a Poisson process (see Example 1.136) with
intensity λ and arrival times
T1 ≤ T2 ≤ · · · .
Fix t > 0 and let (Xn )n≥1 be i.i.d. random variables uniformly distributed in [0, t].
Prove that, conditional on N (t) = n, the random vectors
T1 , . . . , Tn and X(1) , . . . , X(n)
have the same distribution. t
u
Exercise 1.49. Suppose that the 20 contestants at a quiz show are each given
the same question, and that each answers it correctly, independently of the others,
with probability P . But the difficulty of the question is that P itself is a random
variable.18 Suppose, for the sake of illustration, that P is uniformly distributed
over the interval (0, 1].
(i) What is the probability that exactly two of the contestants answer the question
correctly?
(ii) What is the expected number of contestants that answer a question correctly?
t
u
Exercise 1.50 (Skhorohod). Denote by Prob0 (R) the set of Borel probability
measures on R such that Z
xµ dx = 0.
R
Clearly Prob0 (R) is a convex subset of the set Prob(R) of Borel probability measures
on R.
For u, v ≥ 0 such that u + v > 0 we define the bipolar measure
v u
βu,v := δ−u + δv ∈ Prob0 (R).
u+v u+v
Let Q := (u, v) ∈ R2 ; u, v ≥ 0, u + v ≥ 0 . We regard βu,v as a random measure
(or Markov kernel) β : Q × BR → R
β (u, v), B) = βu,v B .
18 Think of P as a random Bernoulli measure of the kind discussed in Example 1.174.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 149
Foundations 149
Prove that for any µ ∈ Prob0 (R) there exists a Borel probability measure ν on Q
such that µ := β∗ ν. In other words, any measure µ ∈ Prob0 (R) is a mixture of
bipolar measures. t
u
Exercise 1.51. Given sigma algebras F± , F0 , ⊂ S, prove that the following are
equivalent.
(i) F+ ⊥
⊥ F0 F− .
(ii) F+ ⊥
⊥ F0 F0 ∨ F− .
t
u
Exercise 1.52. Given sigma algebras F± , F0 , ⊂ S, prove that the following are
equivalent.
(i) F+ ⊥
⊥ F0 ∨ F−
(ii) F+ ⊥
⊥ F0 and F+ ⊥
⊥ F0 F− .
t
u
Exercise 1.53. Suppose that µ is a Borel probability measure on the metric space
(X, d). Denote by C the collection of Borel subsets S of X satisfying the regularity
property: for any ε0 there exists a closed subset Cε ⊂ S and an open subset Oε ⊃ S
such that
µ Oε \ Cε < ε.
t
u
Exercise 1.54. Suppose that (X, d) is a compact metric space and µ is a finite
Borel measure on X. Prove that for any p ∈ [1, ∞) the space C(X) of continuous
functions on X is dense in Lp (X, µ). Hint. Use Exercise 1.53 to show that for any Borel subset
B ⊂ X the indicator function I B can be approximated in Lp by continuous functions. t
u
Chapter 2
Limit theorems
The limit theorems have preoccupied mathematicians from the dawns of probability.
The first law of large numbers goes back to Jacob Bernoulli at the end of the
seventeenth century. The Golden Theorem in his Ars Conjectandi is what we call
today a weak law of large numbers. Bernoulli considers an urn that contains a large
number of black and white balls. If p ∈ (0, 1) is the proportion of white balls in the
urn and we draw with replacement a large number n of balls, then the proportion
pn of white balls among the extracted ones is with high confidence within a given
open interval containing p.
His result lacked foundations since the concept of probability lacked a proper
definition. The situation improved at the beginning of the twentieth century when
E. Borel proved a strong form of Bernoulli’s law. Borel too lacked a good definition
of a probability space, but he worked rigorously. In modern terms, he used the
interval [0, 1] with the Lebesgue measure as probability space. He then proceeded
to construct explicitly a sequence of functions Xn : [0, 1] → R which, viewed as
random variables are i.i.d. with common distribution Bin(1/2).
It took the efforts of Hinchin and Kolmogorov to settle the general case. The
strong law of large numbers states that if (Xn )n∈N are i.i.d. random variables with
finite mean µ, then the empirical mean
n
1X
Mn = Xn
n
k=1
151
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 152
extremely small under certain conditions. The fourth section is devoted to uniform
limit theorem of the Glivenko-Cantelli type. We have included this section due to its
applications in machine learning. In particular, we show how such results coupled
with the concentration inequalities lead to Probably Approximatively Correct, or
PAC, learning.
The last section of this chapter is devoted to a brief introduction to the Brownian
motion. This is such a fundamental object that we thought that any student of
probability ought to make its acquaintance as soon as possible. As always, along
the way we present many, we hope, interesting examples.
This section is devoted to the (Strong) Law of Large Numbers. We follow Kol-
mogorov’s approach based on random series, a subject of independent interest.
Proof. For n ∈ N we denote by Sn the n-th partial sum of the series (2.1.1),
Xn
Sn := Xk .
k=1
The L2 -convergence follows immediately from (2.1.2b) which, coupled with the in-
dependence of the random variables (Xn ) implies that the sequence (Sn ) is Cauchy
in L2 since
k
X
kSn+k − Sn k2L2 =
Var Xn+j , ∀k, n ∈ N.
j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 153
n
X n h i
X
E I Ak Sn2 = E I Ak Sk2 + 2Sk (Sn − Sk ) + (Sn − Sk )2
=
k=1 k=1
(I Ak , I Ak Sk ⊥⊥ Sn − Sk )
n
X
E I Ak Sk2 + 2E I Ak Sk E Sn − Sk + E I Ak E (Sn − Sk )2
=
k=1
| {z } | {z }
=0 ≥0
Xn Xn
2 2
P[Ak ] = a2 P[Mn ≥ a .
≥ E I Ak Sk ≥ a
k=1 k=1
| {z }
Sk2 ≥a2 on Ak
t
u
We can now complete the proof of Theorem 2.1. Using Kolmogorov’s maximal
inequality for the sequence (Xm+n )n∈N we deduce that for any n ∈ N we have
n
1 1 X
P max |Sm+k − Sm | > ε ≤ 2 Var Sm+n − Sm = 2 Var[Xm+k ]
1≤k≤n ε ε
k=1
1 X
≤ Var[Xm+k ] .
ε2
k≥1
| {z }
=:rm
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 154
Thus
rm
P sup |Sm+n − Sm | > ε ≤ 2 . (2.1.4)
n≥1 ε
We set
Ym := sup |Si − Sj |, Zm := sup |Sm+n − Sm |.
i,j≥m n≥1
Now observe that Sm converges a.s. iff Ym → 0 a.s. The sequence Ym is nonincreas-
ing and thus it converges a.s. to a random variable Y ≥ 0. We will show that Y = 0
a.s.
Note that, for i, j > m we have
|Si − Sj | ≤ |Si − Sm | + |Sj − Sm | ≤ 2Zm ,
so Ym ≤ 2Zm , ∀m so
Ym > 2ε ⇒ Zm > ε ⇒ P Ym > 2ε ≤ P Zm > ε .
The equality (2.1.4) reads
rm
P Zm > ε ≤ 2 ∀m ≥ 1, ∀ε > 0.
ε
Hence
lim P Ym > ε = lim P |Zm | > ε = 0.
m→∞ m→∞
Example 2.3. Consider a sequence of i.i.d. Bernoulli random variables (Nk )n∈N
with success probability 21 . The resulting random variables Rk = (−1)Nk are called
Rademacher random variables and take only the values ±1 with equal probabilities.
We obtain the random series
X (−1)Nk X Rn
= .
k n
n≥1 n≥1
Loosely speaking, this is a version of the harmonic series with random signs
1 1
± 1 ± ± ± ··· , (2.1.5)
2 3
where the ±-choices at any term are equally likely and also independent of the
choices at the other terms of the series. We set
(−1)Nk
Xk = .
k
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 155
We know that if all the terms are positive, a probability zero event, then we obtain
the harmonic series which is divergent. On the other hand,
1
E Xk = 0, Var Xk = 2 .
k
Since
X 1
< ∞,
k2
k≥1
is a.s. convergent. Thus, if we flip a fair coin with two sides, a + side and a − side
and we assign the signs in (2.1.5) according to the coin flips, the resulting series is
convergent with probability 1! t
u
Remark 2.4. Kolmogorov also established necessary and sufficient conditions for
convergence in his three series theorem. Before we state it let us introduce a conve-
nient notation. For any random variable X and any positive constant C we denote
by X C the truncation
(
C X, |X| ≤ C,
X := XI {|X|≤C} = (2.1.6)
0, |X| > C.
X
E XnC .
n≥1
X
Var XnC .
n≥1
Proof. We use the same strategy as in Example 1.149. Denote by σ 2 the com-
mon variance
of the random variables Xn . Since they are independent we have
Var Sn = nσ 2 , so
1 1
Var Sn /n = 2 Var Sn = .
n nσ 2
Let ε > 0. Note that E Sn /n = µ. Chebyshev’s inequality (1.3.17) implies
1
P |Sn /n − µ| > ε ≤ → 0 as n → ∞.
nσ 2 ε2
Thus Sn /n → µ in probability. t
u
Note that
1 1
P |Mn | > ε = P |Mn |4 > ε4 ≤ 4 E Mn4 = 4 4 E Sn4 .
ε n ε
Observe that
Xn
E Sn4 =
E Xi Xj Xk X` . (2.1.8)
i,j,k,`=1
Let i 6= j. Due to the independence of the random variables (Xn )n∈N we have
E Xi2 Xj2 = E Xi2 E Xj2 = σ 4 , E Xi Xj3 = E Xi E Xj3 = 0.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 157
Hence
1
P |Mn | > ε = O as n → ∞
n2 ε4
so that, for any ε > 0,
X
P |Mn | > ε < ∞.
n≥1
Remark 2.8. The above Strong Law of Large Numbers is not the most general,
but its proof makes the role of independence much more visible. More precisely
the independence, or the small correlations force the fourth moment of Sn to be
“unnaturally” small
and thus
the large fluctuations around the mean are highly
unlike, i.e. the P |Mn | > ε is very small for large n. t
u
Theorem 2.9 (The Strong Law of Large Numbers). Suppose that (Xn )n≥1
is a sequence of iid random variables Xn ∈ L1 (Ω, S, P). Then
1
lim Sn = µ a.s.
n→∞ n
Lemma 2.10 (Kronecker’s Lemma). Suppose that (an )n∈N and (xn )n∈N are se-
quences of real numbers satisfying the following conditions.
Then
n
1 X
lim xk = 0.
n→∞ an
k=1
is a.s. convergent. The independence assumption will finally play a role because we
will invoke the one-series theorem. Clearly the random variables Znn are indepen-
dent. We claim that
X Var[Zk ]
< ∞. (2.1.12)
k2
k≥1
1 Use Exercise 2.3 with pk,n = 1/n.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 159
We have
2
Var Zk = Var Yk = E Yk2 − E Yk ≤ E Yk2
Z ∞ Z ∞
(1.3.43)
= 2yP |Yk | > y dy = 2yP k ≥ |Xk | > y I {y<k} dy
0 0
Z ∞
≤ 2yP |Xk | > y I {y<k} dy.
0
Thus
X Var[Zk ] X 1 Z ∞
≤ 2yP |Xk | > y I {y<k} dy
k2 2
k 0
k≥1 k≥1
Z ∞ X 1 Z ∞ X
1
=
2
I {y≤k} 2yP |X1 | > y dy =
2
2y P |X1 | > y dy.
0 k 0 k
k≥1 k≥y
| {z }
=:w(y)
We claim that
w(y) < 6, ∀y ≥ 0. (2.1.13)
Indeed, for y ≤ 1 we have
X 1
w(y) = 2y ≤ 4y < 4.
k2
k≥1
so
2y 2byc + 2 4
w(y) ≤ ≤ =2+ < 6.
byc − 1 byc − 1 byc − 1
Using (2.1.13) we deduce
X Var[Zk ] Z ∞
<6 P |X1 | > y dy = 6E |X1 | < ∞.
k2 0
k≥1
This proves (2.1.12) and completes the proof of the SLLN, assuming Lemma 2.10.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 160
We have2
n
X n
X n
X
ak yk = ak (sk − sk−1 ) = an sn − sk−1 (ak − ak−1 ).
k=1 k=1 k=1
Now set
wk
wk := ak − ak−1 , pn,k := .
an
Since (an )n∈N is increasing, positive and unbounded we deduce
n
X
pn,k = 1, ∀n ≥ 1, lim pn,k = 0, ∀k. (2.1.14)
n→∞
k=1
Observe that
n n
1 X X
ak yk = sn − pk,n sk−1 .
an
k=1 k=1
t
u
1
Then the empirical mean n Sn converges in probability to µ. t
u
2 This is classically known as Abel’s trick. It is a discrete version of the integration by parts trick.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 161
Remark 2.12. Let us observe that in Theorem 2.6 the random variables Xn need
not be independent or identically distributed. Assuming all have mean 0, all we
need for that for the Weak Law of Large Numbers to hold is that the random
variables are pairwise uncorrelated,
E Xm Xn = E Xm E Xn , ∀m 6= n, (2.1.15)
In Exercise 2.6 we ask the reader to show that the WLLN holds even if we assume
something weaker than (2.1.15) namely that if |m − n| 1, the random variables
Xm and Xn are weakly correlated. More precisely
lim sup E Xm Xm+k = 0.
k→∞ m∈N
Similarly, for the strong Strong Law of Large Numbers to hold we do not need the
variables to be independent. The theorem continues to hold if the variables are
identically distributed, integrable and only pairwise independent. For a proof we
refer to [53, Sec. 2.4].
The arguments in the proof Theorem 2.7 show that the SLLN holds even when
the variables Xn are neither independent, nor identically distributed. Assuming
that all the variables have mean zero, the SLLN holds if any four of them are
independent, and the only assumptions about their distributions is
A natural philosophical question arises. What makes the Law of Large Numbers
possible? The above discussion suggests that it is a consequence of a mysterious
interplay between some form of independence and some “asynchronicity”: their fluc-
tuations around the mean cannot be in resonance. These features can be observed
in the other Laws of Large Numbers we will discuss in this text.
If the random variables are independent, but not necessarily identically dis-
tributed, there are known necessary and sufficient conditions for the WLLN to
hold. We refer to [59, IX], [70, §22.], or [127, Chap. 4] for details. t
u
Remark 2.13. Suppose that (Xn )n≥1 is a sequence of i.i.d. variables. This Strong
Law of Large Numbers shows that if they have finite mean µ, then the empirical
means
1
Mn = X1 + · · · + Xn
n
converge a.s. to µ. If µ = ∞ and Mn converge a.s. to a random variable M∞ , then
M∞ is a.s. constant. Exercise 2.9 outlines a proof of this fact. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 162
Example 2.14. Suppose we roll a fair die a large number n of times and we denote
by Sn the number of times we roll a 1. Intuition tells us that if the die is fair, then
for large n, the fraction of times we get a 1 should be close to 16 , i.e.,
Sn 1
≈ for n 0.
n 6
This follows from the SLLN. Indeed, the above experiment is encoded by a sequence
(Xn )n∈N of i.i.d. Bernoulli random variables with success probability p = 16 . Then
n
X
Sn = Xk ,
k=1
and the SLLN
Sn 1
→ E X1 = a.s. as n → ∞.
n 6
It helps to visualize a computer simulation of such an experiment. Suppose we roll
a die a large number N of times. For i = 1, . . . , N we denote by fi the frequency of
1-s during the first i trials, i.e.,
Si
fi = .
i
The resulting vector (fi )1≤i≤N ∈ RN is called relative or cumulative frequency.
The R-code below simulates one such experiment when we roll the die 12, 000
times.
N<-12000
x<-sample(1:6, N, replace=TRUE)
rolls<-x==1
rel_freq<-cumsum(rolls)/(1:N)
Fig. 2.1 The frequencies fi fluctuates wildly initially and then stabilizes around the horizontal
line y = 1/6 in perfect agreement with SLLN .
Example 2.16. A good example to have in mind is the “alphabet” of the English
language. In this alphabet we throw in not just the letters, but also the punctuation
signs and the blank space. The elements xi are letters/symbols of the alphabet.
The probabilities p(xi ) can be viewed as the frequency of the symbol xi in the
written texts. One way to estimated these frequencies3 is to count the number of
their occurrences in a large text, say Moby Dick.
Another good example is the alphabet {0, 1} used in computer languages. The
frequencies p(0) = p(1) = 21 . t
u
We can view the entropy as a function Ent2 : ∆m−1 → [0, ∞). One can check that
it is concave since the function [0, ∞) 3 x 7→ f (x) = −x log2 x is strictly concave.
We have
m
X
Ent2 p = f (pi ).
i=1
Jensen’s inequality shows that
m m
!
1 X 1 X log2 m
f (pi ) ≤ f pi = f (1/m) = ,
m i=1 m i=1 m
3 As a curiosity, the letter “e” is the most frequent letter of he English language; it appears 13%
of the time in large texts. It is for this reason that it has the simplest Morse code, a dot.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 165
1
with equality if and only if p1 = · · · = pm = m . We deduce
Ent2 p ≤ log2 |X |, ∀p ∈ Prob(X ), (2.1.17)
with equality if and only if p is the uniform probability measure. We will see later
that the above is a special case of the Gibbs’ inequality (2.3.8). Intuitively, this
inequality says that among all the probability measures on a finite set, the uniform
one is the most “chaotic”, the least “predictable”.
We will refer to the elements of X n as words of length n. The term “word” is a
bit misleading. For example, when X is the English alphabet as above, an element
of X n with large n can be thought of as the sequence of symbols appearing in a
large text. On the other hand, we can think of X n itself as a new alphabet with
frequencies
pn (x1 , . . . , xn ) = p(x1 ) · · · p(xn ).
The amount of “surprise” of a word (x1 , . . . , xn ) is
n
X
S(x1 , . . . , xn ) = S(xk ).
k=1
The entropy of (X n , pn ) is
Ent2 pn = n Ent2 p .
We denote by X ∗ the disjoint union of the sets X n ,
G
X∗= X n,
n∈N
(n)
(i) pn Aε > 1 − ε.
(n) n(Ent2 [ p ]+ε)
(ii) |Aε | ≤ 2 .
(n)
(iii) |Aε | ≥ (1 − )2n(Ent2 [ p ]−ε) .
The Asymptotic Equipartion Property (or AEP) shows that a typical set has
probability nearly 1, all its elements are nearly equiprobable, and its cardinality is
nearly 2n Ent2 [ p ] . The inequality (2.1.17) shows that if p is not he uniform proba-
bility measure on X , then
2Ent2 [ p ] |X |.
Hence, if ε > 0 is sufficiently small, then
(n)
|Aε |
→0
|X n |
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 167
exponentially fast as n → ∞. That is, the typical sets have high probability and
are “extremely small” if the entropy is small.
This suggests the following coding procedure. Fix ε > 0 so that 1 − ε will
(n) L
be our
confidence
level. For n > N (ε) the set Aε has about 2 elements where
L = n Ent2 p elements and thus we can find an injection
I : A(n)
ε → BL .
(n)
For x ∈ Aε we attach the symbol 1 at the beginning of the word I(x) ∈ BL and
the resulting word in BL+1 will encode x. It uses L + 1 bits. The first bit is 1 and
indicates that the word x is typical.
We are less careful with the atypical words. Chose any map
J : X n \ A(n)
ε → BL
and we encode an atypical word x using the binary word J(x) with a prefix 0
attached to indicate that it is atypical. The resulting map C : X n → BL+1 is not
injective, but if two words have the same code, they must be atypical and thus
occur with very small frequency. This is an example of compression.
Take for example the English language. There are various estimates for its
entropy, starting with the pioneering word of Claude Shannon. Most recent ones4
vary from 1 to 1.5 bits. How do we encode efficiently texts consisting of n = 106
symbols say? For example, “Moby Dick ” has 206, 052 words and the average length
of an English word is 5 letters so “Moby Dick ” consists of about 1.03 million symbols.
Forgetting capitalization and punctuation there are 26n such texts and a brute
encoding would require 26n codewords to cover all the possibilities. The above
result however says that roughly 21.5n texts suffice to capture nearly surely almost
everything. The term compression is fully justified since this is a much smaller
fraction of the total number of possible texts. Also we only need codewords of
lengths 1.5 million. Thus we need is roughly 1.5 gigabits to encode such a text. If
the letters of the alphabet where uniformly distributed in human texts5 then the
entropy would be log2 (26) ≈ 4.70 > 3 × 1.5 and we would need more than three
times amount of memory to store it.
Remark 2.19. The story does not end here and much more precise results are
available. To describe some of them note first that for any alphabet X there is an
obvious operation of concatenation
∗ : X m × X n → X m+n , (x, x0 ) 7→ x ∗ x0
where the word x ∗ x0 is obtained by writing in succession the word x followed by
0 L+1
x . Note that this code uses on average n ≈ Ent2 p bits per symbol in a word.
This is an example of compression.
4 A Google search with the keywords “entropy of the English language” will provide many more
but we can safely call the resulting texts highly atypical of the English texts humans are used to.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 168
Such codes are called Shannon codes. The above code is a Shannon code. In fact it
is a special example of the famous Huffman code, [37].
Let us discuss a particularly suggestive experiment that highlights a defining
feature of Huffman codes and reveals one interpretation of the entropy of an alpha-
bet.
Suppose we have an urn containing the letters a, b, c, d, in proportions
pa , pb , pc , pd . A person randomly draws a letter from the urn and you are sup-
posed to guess what it is by asking YES/NO question. Think YES = 1, NO = 0.
The above code describes an optimal guessing strategy. Here it is.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 169
(1) Ask first if the letter is a → 1. If the answer is YES (= 1), the game is over.
The game has length 1 with probability 1/2.
(01) If the answer is NO (= 0) the letter can only be b, c or d. Ask if the letter is
b → 01. If the answer is YES (= 1) the game is over. The game has length 2
with probability 1/3.
(001) If the answer is NO (= 0) ask if the letter is c → 001. The game has length
3 with probability 1/6.
For more details about information theory and its application we refer to [37;
112]. For a more informal introduction to information theory we refer to [60]. The
eminently readable [71] contains historical perspective on the evolution of informa-
tion theory. Kolmogorov’s brief but very rich in intuition survey [95] is a good place
to start learning about the mathematical theory of information. t
u
The goal of this section is to prove a striking classical result that adds additional
information to the Law of Large Numbers.
Suppose that (Xn )n∈N is a sequence of i.i.d. random variables with mean µ and
finite variance σ 2 . Note that the sum Sn := X1 + · · · + Xn has mean nµ and
variance nσ 2 . Loosely speaking, the central limit theorem states that for large n
the probability distribution of Sn “resembles” very much a Gaussian with the same
mean and variance.
For example, if the Xn -s are Bernoulli random variables with success probability
p, then µ = p, σ 2 = pq and Sn ∼ Bin(n, p). In Figure 2.2 we have illustrated what
happens in the case p = 0.3 and n = 65.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 170
The vertical lines depict the probability mass function of the binomial distri-
bution while the curve wrapping them is the Gaussian with the same mean and
variance. They obviously do “resemble”. However, we need to define precisely
what we mean by “resemble”.
kf k∞ := sup f (x) .
x∈X
(i) We say that the sequence (µn ) converges vaguely to µ ∈ Meas(X), and we
write this µn 99K µ if
Z Z
lim f (x)µn dx = f (x)µ dx , ∀f ∈ C0 (X). (2.2.1)
n→∞ R R
(ii) We say that the sequence (µn ) converges weakly to µ ∈ Meas(X), and we write
this µn ⇒ µ if
Z Z
lim f (x)µn dx = f (x)µ dx , ∀f ∈ Cb (X). (2.2.2)
n→∞ R R
PXn ⇒ PX in Prob(X),
i.e.,
lim E f (Xn ) = E f (X) , ∀f ∈ Cb (X). (2.2.3)
n→∞
d
We will use the notation Xn −→ X to indicate that Xn converges to X in
distribution. t
u
The sum in the right-hand-side of the above equality is a Riemann sum for f
corresponding to the uniform partition
1 2 n−1
0 < < < ··· < < 1.
n n n
Since f is Riemann integrable we deduce
n Z 1 Z
1X
lim f (k/n) = f (x)dx = f (x)µ dx .
n→∞ n 0 R
k=1
t
u
Example 2.24. There exist vaguely convergent sequences of Borel probability mea-
µn = δn , n ∈ N. Then
sures on R that are not weakly convergent. Take for example
µn 99K 0 yet µn does not converge weakly to 0 since µn R = 1, ∀n. t
u
Proof. We deduce from Corollary 1.145 that for any f ∈ Cb (R) the random vari-
ables f (Xn ) converge in probability to f (X). The bounded convergence theorem
implies
lim E f (Xn ) = E f (X) , ∀f ∈ Cb (R).
n→∞
t
u
Proof. (i) ⇒ (ii) According to Theorem 1.198 the measure µ is regular, i.e., for
any ε > 0 there exists a closed set Cε ⊂ U such that
µ U > µ Cε > µ U − ε.
Consider the continuous function
dist(x, U c )
f : R → [0, 1], f (x) = .
dist(x, U c ) + dist(x, Cε )
Note that f = 1 on Cε and f = 0 outside U we have
µn f ≤ µn U , ∀n ∈ N > .
In particular, we deduce that, ∀ε > 0, we have
µ U − ε < µ Cε ≤ µ f = lim µn f ≤ lim inf µn U .
n→∞ n
Hence
lim µn C = µ C , lim µn B = lim µn C + lim µn ∂B = µ B .
n n n n
(iv) ⇒ (i). Clearly it suffices to show that µn f → µ f , for any nonnegative,
bounded, continuous function f on Rk .
Suppose that f be such a function. Set K = sup f . For any ν ∈ Prob Rk we
can regard f as a random variable (Rk , BRk , ν) → R. The integral ν f is then the
Note that
ν f = t = 0 ⇒ ν ∂{f > t} = 0.
t
u
Thus, the sequence µn satisfies the condition (ii) in the portmanteau theorem The-
orem 2.28, where U is any open interval of the real axis. Since any open set of
the real axis is a disjoint union of countably many open intervals, we deduce that
condition (ii) in the Portmanteau Theorem is satisfied for all the open sets U ⊂ R.
t
u
Theorem 2.30 (Slutsky). Suppose that (Xn )n∈N and (Yn )n∈N are sequences of
random variables such that (Xn ) converges in distribution to X and Yn converges
in probability to c ∈ R. Then the sum Xn + Yn converges in distribution to X + c.
We can now formulate and prove the main convergence criterion of this sub-
section.
Theorem 2.31. Suppose that (µn )n∈N is a sequence of finite Borel measures on
R. Fix a subset F ⊂ Cb (R) whose closure in Cb (R) contains C0 (R). The following
statements are equivalent.
Proof. Since
Z Z
µ R = I R µ dx = lim 1µn dx , ∀ν ∈ Meas(R)
R n→∞ R
we can replace the measures µn by 1 µn and thus we can assume that all the
µn R
measures µn are probability measures. In this case (ii) reads
µn converges vaguely to µ and µ is a probability measure,
while (iii) reads
µ is a probability measure and µn f → µ f , ∀f ∈ F.
Obviously (i) ⇒ (ii) and (ii) ⇒ (iii). It suffices to prove that (ii) ⇒ (i) and (iii)
⇒ (ii). We will need the following result.
Proof. Let B ⊂ Rk be a Borel set and ε > 0. According to Theorem 1.198, the
measure µ is regular. Hence, there exists a closed set C ⊂ B such that
ε
µ B\C < .
2
On the other hand, we can find R > 0 sufficiently large such that
ε
µ BR (0) > µ Rk − .
2
We set K := BR ∩ C. The set K is clearly compact and
ε
µ C \ K ≤ µ Rk \ BR (0) < .
2
Thus µ B \ K = µ B \ K + µ C \ K < ε. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 177
(ii) ⇒ (i) We will show that the sequence (µn ) satisfies the condition (ii) in the
portmanteau theorem. Now let U ⊂ Rk be an open set and ε > 0. Lemma 2.32
shows that for any ε > 0 there exists a compact set K ⊂ U such that
µ K > µ U − ε.
1
Now choose r < 2 dist(K, U c ) and set
Cr := x ∈ Rk ; dist(x, K) ≥ r .
ε ε
≤ lim µn fε + = µ[fε ] + < µ[ϕ] + ε.
2 2
The above inequalities hold for any ε > 0 so
lim inf µn [ϕ] = lim sup µn [ϕ] = µ[ϕ].
n n
t
u
Proof. The implication (i) ⇒ (ii) is obvious. To prove that (ii) ⇒ (i) observe first
that any compactly supported continuous function can be uniformly approximated
by compactly supported smooth functions6 so the closure in Cb (Rk ) of the set
of bounded Lipschitz functions contains C0 (Rk ). The measure µ is a probability
measure since the constant function I Rk is bounded and Lipschitz and thus
µ I Rk = lim µn I Rk = 1.
n→∞
Corollary 2.35. Suppose that (Xn )n∈N and X are random variables with ranges
contained in Z. Then Xn ⇒ X if and only if
lim P Xn = k = P X = k , ∀k ∈ Z. (2.2.5)
n→∞
= lim P k − 1/2 < Xn ≤ k + 1/2 = lim P Xn = k .
n→∞ n→∞
The next result generalizes Fatou’s Lemma. However, our proof relies on Fatou’s
Lemma.
Proposition 2.36. Suppose that the sequence of random variables (Xn )n∈bN con-
verges in distribution to X. Then
E |X| ≤ lim inf E |Xn | .
n→∞
Proof. The Mapping Theorem 2.25 implies that the sequence (|Xn |)n∈N converges
in distribution to |X|. Thus
lim P |Xn | > t = P |X| > t ,
n∈N
for all t outside a countable subset of [0, ∞). Using (1.3.43) we deduce
Z ∞ Z ∞
E |X| = P |X| > t dt, E |Xn | = P |Xn | > t dt, ∀n.
0 0
Fatou’s Lemma implies
Z ∞ Z ∞
P |X| > t dt ≤ lim inf P |Xn | > t dt.
0 n→∞ 0
t
u
so
∂ξ eixξ = ixeixξ ∈ L1 R, PX ,
Let Γv ∈ Prob(R) be the Gaussian measure with mean 0 and variance v > 0
1 x2
e− 2v .
Γv dx = γ v (x)dx, γ v (x) = √
2πv
Proposition 2.38.
r
2
− vξ2 2π
Γ
b v (ξ) = e = γ (ξ), ∀v > 0. (2.2.6)
v 1/v
Proof. We have
√ √
Z Z
b v (ξ) = √ 1
Γ
x2
e− 2v eiξx dx = √
1 y2
e− 2 ei vξy dy = Γ
b 1 ( v η).
2πv R 2π R
Thus it suffices to determine
Z
1 x2
f (ξ) = Γ1 (ξ) = √
b e− 2 eiξx dx.
2π R
The imaginary part of the above integrand is odd function (in x) so f (ξ) is real, ∀ξ,
i.e.,
Z
1 x2
f (ξ) = √ e− 2 cos(ξx)dx.
2π R
The function
d − x2 x2
e 2 cos(ξx) = −xe− 2 sin(ξx)
dξ
is integrable (in the x variable). This shows that f (ξ) is differentiable (see Exer-
cise 1.6) and
Z Z
0 1 − x2
2 1 d − x2
f (ξ) = − √ xe sin(ξx)dx = √ e 2 sin(ξx)dx
2π R 2π R dx
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 181
(integrate by parts)
Z
ξ x2
= −√ e− 2 cos(ξx)dx = −ξf (ξ).
2π R
Thus
f 0 (ξ) + ξf (ξ) = 0
so that
d ξ2 /2 ξ2
f (ξ) = 0 ⇐⇒ f (ξ) = Ce− 2 .
e
dξ
2
b 1 (ξ) = e− ξ2 .
Since f (0) = 1 we deduce C = 1 and thus Γ t
u
Proof of Fact 1. The idea behind this fact is that the Fourier transform and the
convolution interact in a nice way. More precisely we will show that
Z
1
ρv (x) = √ eixξ γ 1/v (ξ)b
µ(−ξ)dξ. (2.2.7)
2πv R
Using (2.2.6) we deduce
√ x2
Z
2πvγ v (x) = e− 2v = eixξ γ 1/v (ξ)dξ.
R
We deduce
Z Z
1 i(x−y)ξ
ρv (x) = √ e γ 1/v (ξ)dξ µ[dy]
2πv R R
(use Fubini)
Z Z Z
1 1
=√ eixξ γ 1/v (ξ) e−iyξ µ[dy] dξ = √ eixξ γ 1/v (ξ)b
µ(−ξ)dξ.
2πv R R 2πv R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 182
Z Z
= γ v (x − y)f (x)dx µ[dy].
R
| R {z }
=:fv (y)
where Ty f (z) := f (z + y). Fix y, ε > 0 and a δ = δ(ε) > 0 such that
ε
sup f (z + y) − f (y) < .
|z|<δ 2
Then
Z
fv (y) − f (y) = Γv Ty f − f (y) = f (z + y) − f (y) Γv [dz]
R
Z Z
≤ f (z + y) − f (y) Γv [dz] + f (z + y) − f (y) Γv [dz]
|z|<δ |z|≥δ
≤ sup f (z + y) − f (y) + 2M Γv |z| > δ
|z|<δ
(1.3.17) 2M v ε 2M v
≤ sup f (z + y) − f (y) + < + 2 .
|z|<δ δ2 2 δ
Hence
ε
lim sup fv (y) − f (y) ≤ , ∀ε > 0,
v&0 2
so that
lim fv (y) = f (y), ∀y ∈ R.
v&0
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 183
Remark 2.40. (a) The above theorem can be rephrased as stating that the collec-
tion of trigonometric functions
R 3 x 7→ cos(ξx), sin(ξx); ξ ∈ R
is not separating! More precisely, there exists two distinct probability measures
µ0 , µ1 such that
We refer to [111, Chap. IV, Sec. 15, p. 231] for more details.
(b) The range of the Fourier transform
Prob(R) 3 µ 7→ µ̂ ∈ Cb (R)
Additionally, the function µ b is positive definite. This means that, for any n ∈ N
and any ξ1 , . . . , ξn ∈ R, the hermitian matrix
b(ξi − ξj ) 1≤i,j≤n
µ
It turns out that these above necessary conditions characterize the range of the
Fourier transform. This is the content of the celebrated Bochner theorem. For a
proof we refer to [59, p. 622], [68, §II.3], or [135, I.24]. t
u
lim µ
bn (ξ) = µ
b(ξ).
n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 184
(ii) ⇒ (i) We carry the proof in two steps. For any v > 0 and any f ∈ Cb (R) we
define fv : R → R
Z
fv (x) = f (x − y)Γv dy .
R
Observe that
Z Z
fv = f (x − y)γ v (y)dy = f (z)γ v (x − z)dz.
R R
Z Z
(2.2.7) 1
= √ eixξ γ 1/v (ξ)b
ν (−ξ)dξ f (x)dx
2πv R R
Z Z Z
1 1
=√ eixξ f (x)dξ γ 1/v (ξ)b
ν (−ξ)dξ = √ fb(ξ)γ 1/v (ξ)b
ν (−ξ)dξ.
2πv R R 2πv R
| {z }
=;fb(ξ)
The function fb(ξ) is well defined since f ∈ C0 (R). The Dominated Convergence
theorem shows that fb is continuous. Moreover
Z
fb(ξ) ≤ |f (x)| dx.
R
We deduce that, ∀n ∈ N,
Z
1
µn fv = √ fb(ξ)γ 1/v (ξ)b
µn (−ξ)dξ.
2πv R
The Dominated Convergence theorem shows that
Z Z
lim f (ξ)γ 1/v (ξ)b
b µn (−ξ)dξ = fb(ξ)γ 1/v (ξ)b
µ(−ξ)dξ = µ fv .
n→∞ R R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 185
For x ∈ R
Z
fv (x) − f (x) = f (x − y)Γv [dy] − f (x)
R
Z Z Z
= f (x − y)Γv [dy] − f (x)Γv dy ≤ |f (x − y) − f (x)|Γv dy
R R R
Z Z
= f (x − y) − f (x) Γv [dy] + f (x − y) − f (x) Γv [dy]
|y|≤r |y|>r
Z Z
≤ ω(r) Γv [dy] + 2M Γv [dy]
|y|≤r |y|>r
Remark 2.42. (a) One can show that if a sequence µn ∈ Prob(R) converges weakly
to a probability measure µ, then µ
bn (ξ) converges ro µ
b(ξ) uniformly on compacts;
see Exercise 2.37.
(b) In Theorem 2.41 we assumed that the limit of the sequence of characteristic
functions µbn n∈N is the characteristic function of a probability measure µ. This
assumption is not necessary.
One can show that if the characteristic functions µ bn (ξ) converge pointwisely
to a continuous function f , then f is the characteristic function of a probability
measure. In Exercise 2.36 we describe the main steps of the rather involved proof
of this more general result. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 186
Remark 2.43. P. Lévy, [105, §17, p. 47], introduced a metric dL on Prob(R). More
precisely, given µ0 , µi ∈ Prob(R) with cumulative distribution functions
Fi (x) = µi (−∞, x] , x ∈ R, i = 0, 1,
then the Lévy metric is the length of the largest segment cut-out by the graphs Γ0 , Γ1
of F0 , F1 along a line of the form x + y = a. The graphs are made continuous by
adding vertical segments connecting Fi (x−0) to Fi (x) at the points of discontinuity.
Intuitively, the distance is the diagonal if the largest square with sides parallel to
the axes that can be squeezed between the curves Γ0 and Γ1 .
More precisely
dL (µ0 , µ1 ) = sup distR2 p0 (a), p1 (a) ,
a∈R
where pi (a) is the intersection of the graph Γi with the line x + y = a. Note that if
we write pi (a) = (xi , yi ), then yi = F (xi ),7 then
√
dL (µ0 , µ1 ) = sup 2|x0 − x1 |; x0 + F0 (x0 ) = x1 + F1 (x1 ) .
Lévy refers to the convergence with respect to the metric dL as “convergence from
the point of view of Bernoulli ”. He shows (see [105, §17]) that a sequence of proba-
bility measures µn converges in the metric dL to a probability measure µ if and only
if the characteristic functions µ
bn converge to the characteristic function µ. Hence,
the convergence in the metric dL is the weak convergence so that dL metrizes the
weak convergence. t
u
Observe that Xn are i.i.d. with mean 0 and variance v, while Zn has mean 0 and
variance 1. Denote by Φ(ξ) their common characteristic function, Φ(ξ) = E eiX̄1 .
We have
" n #
√
Y ξ
ΦZn (ξ) = ΦS̄n /√nv (ξ) = ΦS̄n (ξ/ nv) = E exp i √ Xk
nv
k=1
7 At
a point of discontinuity this reads yi ∈ Fi (xi − 0), Fi (xi ) .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 187
(the variables exp i √ξnv Xk , 1 ≤ k ≤ n are independent)
n √ n
Y ξ
= E exp i √ Xk = Φ ξ/ nv .
nv
k=1
Lemma 2.45. Suppose that (cn )n≥1 is a convergent sequence of complex numbers
and
c = lim cn .
n→∞
Then
cn n
lim 1+ = ec .
n→∞ n
Assuming Lemma 2.45 we deduce that, for any ξ ∈ R we have
!n
ξ2 ξ2
lim ΦZn (ξ) = lim 1 − + o(1/n) = e− 2 = ΦΓ1 (ξ).
n→∞ n→∞ 2n
Proof of Lemma 2.45. Set c = a + bi, cn = an + bn i, so that an → a, bn → b. We set
cn an bn
zn = 1 + =1+ + i.
n n n
For large n zn = rn eiθn , where
q 1/2
rn = (1 + an /n)2 + b2n /n2 = 1 + 2a/n + o(1/n) ,
π 1 bn
|θn | < , tan θn = .
2 n 1 + an /n
Thus
1 bn b
θn = arctan = + o(1/n) as n → ∞.
n 1 + an /n n
We deduce that as n → ∞ we have
n/2
n i(b+o(1)) a ib c
zn = 1 + 2a/n + o(1/n) ·e →e ·e =e .
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 188
Remark 2.46. There is a more refined version of the Central Limit Theorem that
does not require that the random variables be identically distributed, only indepen-
dent. More precisely, we have the following result of Lindeberg.
Suppose that (Xn )n≥0 is a sequence of independent random variables with zero
means and finite variances. We set Sn := X1 + · · · + Xn and s2n := Var Sn .
Suppose that, ∀t > 0
n
1 X
E I {|Xk |>tsn } Xk2 = 0.
lim 2 (2.2.9)
n→∞ sn
k=1
Then s1n Sn converges in distribution to a standard normal random variable.
Note that if the random variables Xn are also identically distributed with com-
mon variances σ 2 , then σn2 = nσ 2 . Then
n
1 X 1
E I {|Xk |>tsn } Xk2 = 2 E I {|X1 |>tσ√n} X12 → 0
2
sn σ
k=1
as n → ∞. Hence condition (2.2.9) is satisfied when the random variables are i.i.d.
For a proof of Lindeberg’s theorem we refer to [59, Sec. VIII.4].
For even more general versions of the CLT we refer to [72; 127]. t
u
Suppose that (Xn )n∈N is a sequence of i.i.d. random variables with mean 0. Let
Sn := X1 + · · · + Xn .
The Strong Law of Large of Numbers shows that n1 Sn → 0 a.s. A concentration
inequality offers a quantitative information on the probability that n1 Sn deviates
from 0 by a given amount ε. More concretely, it gives an upper bound for the
1
probability
2
n |Sn | > ε. If the random variables Xn have finite second moments,
that
σ = Var X1 , then we have seen that Chebyshev’s inequality yields the estimate
2 2 2
Var Sn σ2
P |Sn | > nε = P Sn > n ε < = .
n2 ε2 nε2
In the proof of Theorem 2.7 we have shown that if the variables Xn have a stronger
integrability property namely E Xn4 < ∞, then there exists a constant C > 0 such
that for any ε > 0 and any ε > 0 we have
C
P |Sn | > nε ≤ 2 4 ,
n ε
showing that n1 Sn is even more concentrated around its mean. Loosely speaking,
we expect higher concentration around if Xn have lighter tails, i.e., the probabilities
P |Xn | > x
decay fast as x → ∞.
In this section we want to describe some quantitative results stating that, under
appropriate light-tail assumptions, for any ε > 0 the probability P |Sn | > nε
decays exponentially fast to 0 as n → ∞. The subject of concentration inequalities
has witnessed and explosive growth in the last three decades so we will only be able
to scratch the surface. For more on this subject we refer to [17].
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 189
so ΨX (λ) ≥ 0.
Here is the key idea of Chernoff’s method. For x > 0 we have
1
P X > x = P eλX > eλx ≤ λx E eλX , ∀λ ∈ J+ ,
e
where at the last step we used Markov’s inequality. Hence
− xλ−ΨX (λ)
P X>x ≤e , ∀λ ∈ J+ .
Set
I+ (x) := sup xλ − ΨX (λ) .
λ∈J+
and
P X > x + µ ≤ e−I− (x) , I− (x) := sup (x + µ)λ − ΨX (λ) , ∀x < 0. (2.3.4)
λ∈J−
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 190
Suppose that (Xn )n∈N is a sequence of i.i.d. random variables such that
M(λ) = MXk (λ) < ∞,
for any λ in an open interval J containing 0. Set
µ := E Xk , Sn := X1 + · · · + Xn .
Then
E Sn = nµ, MSn (λ) = M(λ)n , ΨSn (λ) = nΨ(Λ).
We deduce that
sup (nx + nµ)λ − ΨSn (λ) = nI+ (x), ∀x > 0,
λ∈J+
and
sup (nx + nµ)λ − ΨSn (λ) = nI− (x), ∀x < 0.
λ∈J−
We deduce
1
Sn − µ > x = P Sn − nµ > nx ≤ e−nI+ (x) , ∀x > 0,
P (2.3.5a)
n
1
Sn − µ < x = P Sn − nµ < nx ≤ e−nI− (x) , ∀x < 0.
P (2.3.5b)
n
We have reached a remarkable conclusion. The assumption M(λ) < ∞ for λ in an
open neighborhood of the origin implies that the probability that the empirical mean
1
n Sn deviates from the theoretical mean µ by a fixed amount x decays exponentially
to 0 as n → ∞. In other words, n1 Sn is highly concentrated around its mean and
the above inequalities quantify this fact.
To gain some more insight on the above estimates it is useful to list a few
properties of the function I+ (x).
Taking the logarithm of both sides of the above inequality we obtain the convexity
of ΨX (λ). Next observe that
M0X (0)
Ψ0X (0) = = 0.
MX (0)
Since ΨX (λ) is convex is graph sits above the tangent at λ = 0 so ΨX (λ) ≥ 0,
∀λ ∈ J.
(iii) For t1 , t2 ∈ (0, 1) such that t1 + t2 = 1 and for x1 , x2 > 0 we have
I+ (t1 x1 + t1 x2 ) = sup (t1 x2 + t2 x2 ) − ΨX (λ)
λ∈(0,r)
= sup t1 (x1 − −ΨX (λ)) − (t2 x2 − ΨX (λ)) ≤ t1 I+ (x1 ) + t2 I+ (x2 ).
λ∈(0,r)
λx − ΨX (λ) ≤ 0, ∀λ ≤ 0
proving that
I(x) = sup λx − ΨX (λ) = sup λx − ΨX (λ) .
λ∈J λ∈J+
where h−, −i denotes the canonical inner product in Rn . One can show that f ∗ is
also convex and lower semicontinuous and f = (f ∗ )∗ . The conjugate f ∗ is sometimes
called the Fenchel-Legendre conjugate of f . Observe that I(x) is the conjugate of
the convex function ΨX (λ). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 192
Example 2.49. Suppose that X ∼ Bin(p). Then E X = p, MX (λ) = q + peλ .
For x ∈ R we have
fx (λ) := (x + p)λ − ΨX (λ) = (x + p)λ − log(q + peλ )
peλ
fx0 (λ) = x + p −
q + peλ
and fx0 (λ) = 0 if
x+p
p(x + p − 1)eλ = −q(x + p), i.e., peλ = q .
q−x
This forces x ∈ (−p, q). In this case
x+p q−x
λ = log q − log p + log(x + p) − log(q − x) = log − log
p q
x+p q−x q−x
I(x) = (x + p) log − (x + p) log + log
p q q
x+p q−x
= (x + p) log + (q − x) log .
p q
t
u
Remark 2.50. Suppose that P, Q are two Borel probability measures on R that
are mutually absolutely continuous,
P Q and Q P.
dP
We denote by ρP|Q := dQ the density of P with respect to Q. We define the Kullback-
Leibler divergence
Z
dP
DKL P k Q := log P dx . (2.3.7)
R dQ
(a) Suppose that P is the probability distribution Bin(p),
P = qδ0 + pδ1 .
For x ∈ (−p, q) consider the probability distribution
Qx = (q − x)δ0 + (p + x)δ1 .
Then
x+p q−x
DKL Qx k P = (x + p) log + (q − x) log .
p q
This is the rate I(x) we found in Example 2.49.
(b) Let X be a random variable with probability distribution Q and set
Z := ρP|Q (X). Then
Z Z
dP
E Z = dQ = dP = 1,
R dQ R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 193
Z Z
dP dP dP
E Z log Z = log dQ = log dP = DKL P k Q .
R dQ dQ R dQ
Thus
E Z log Z − E Z log E Z = DKL P k Q
showing that Kullback-Leibler divergence is a special case of ϕ-entropy (1.3.12).
More precisely, the above equality shows that
DKL P k Q = Hϕ Z , ϕ(z) = z log z, z > 0.
In particular this yields Gibbs’ inequality
DKL P k Q ≥ 0. (2.3.8)
Above, we could have used instead of the natural logarithm any logarithm in a base
> 1 and reach the same conclusion. In particular, if we work with log2 and we set
Z
dQ
D2 P k Q = log2 P dx .
R dP
Then Gibbs’ inequality continues to hold in this case as well
D2 P k Q ≥ 0. (2.3.9)
Let X be a finite subset of R. Assume that we are given a function p : X → (0, 1]
such that
X
p(x) = 1
x∈X
so p defines the probability measure
X
Pp = p(x)δx ∈ Prob(R).
x∈X
t
u
x2
P |X − µ| > x ≤ 2e− 2σ2 , ∀x > 0.
(2.3.11b)
Observe that if X1 , X2 are independent random variables and Xk ∈ G(σk2 ), k = 1, 2,
then
a1 X1 + a1 X2 ∈ G(a21 σ12 + a22 σ22 ), ∀a1 , a2 ∈ R.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 195
where the last inequality is obtained by inspecting the Taylor series of the two terms
and using the inequality 2n n! ≤ (2n)!. Hence R ∈ G(1). Similarly, cR ∈ G(1),
∀c ∈ [0, 1]. t
u
For these estimates to be useful we need to have some simple ways of recognizing
subgaussian random variables.
Proposition 2.54. Suppose that X is a centered random variable, i.e., E X = 0.
If there exists C > 0 such that
E X 2k ≤ k!C k , ∀k ∈ N,
then X ∈ G(4C).
We deduce
∞
0 0 X λ2k
E eλX ≤ E eλX · E e−λX = E eλ(X−X ) = E (X − X 0 )2k .
(2k)!
k=0
2k
Since the function x is convex we have
(x + y)2k ≤ 22k−1 x2k + y 2k , ∀x, y ∈ R
so
(2k)! (2k)!
E (X − X 0 )2k ≤ 22k E X 2k ≤ 22k k!C k = (2C)k ≤ (2C)k .
(2k − 1)!! k!
Hence
∞
X (2Cλ2 )k 2
E eλX ≤ = e2Cλ .
k!
k=0
Hence X ∈ G(4C). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 196
so that R ∈ G(4). We see that this estimate is not as good as the one in Exam-
ple 2.53. t
u
Proof. Let us first observe that any random variable Y such that Y ∈ [a, b] a.s.
satisfies
(b − a)2
Var Y ≤ .
4
Indeed, if µ = E Y , then Y − µ ∈ a − µ, b − µ . If
(a − µ) + (b − µ)
m=
2
is the midpoint of a − µ, b − µ , then
b−a
|(Y − µ) − m| ≤
2
and
2 (b − a)2
Var Y ≤ E (Y − µ)2 + m2 = E (Y − µ) − m
≤ .
4
Observe next that we can assume that X is centered. Indeed, if µ = E X , then the
centered variable X −µ satisfies X −µ ∈ a−µ, b−µ and (b−a) = (b−µ)−(a−µ).
Denote by P the probability distribution of X. For any λ ∈ R we denote by Pλ
the probability measure on R given by
eλx
Pλ dx =
λX
P dx . (2.3.14)
E e
Note that Pλ is also supported on [a, b]. Since E X = 0 we have Ψ0X (0) = 0. We
2
Z Z
x2 Pλ dx −
= xPλ dx .
R R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 197
The last term is the variance of a random variable Z with probability distribution
Pλ . Since Pλ is supported in [a, b] we have Z ∈ [a, b] and we deduce
(b − a)2
Ψ00X (λ) = Var Z ≤ . (2.3.15)
4
Using the Taylor approximation with Lagrange remainder we deduce that for some
ξ ∈ [0, λ] we have
1 λ2 (b − a)2
ΨX (λ) = ΨX (0) + λΨ0X (0) + Ψ00X (ξ) ≤ .
| {z } 2 8
=0
2
Hence X ∈ G (b − a) /4). t
u
is achieved when
d 1
λy − ΨY −1 (λ) = y + 1 − = 0.
dλ 1 − 2λ
Solving this equation for λ we get
1 y
1 − 2λ = ⇐⇒λ =
y+1 2(y + 1)
and
y2 y 1 y 1 y2
I(y) = + − log(1 + y) = − log(y + 1) ≥ .
2(y + 1) 2(y + 1) 2 2 2 4
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 198
Hence
y2
P Y − 1 < −y ∨ P Y − 1 > y ≤ e− 4 , ∀y > 0.
We deduce that
1 ny 2
P Zn − 1 > y < e− 4 , ∀y > 0
n
1 ny 2
P Zn − 1 < y < e− 4 , ∀y ∈ (−1, 0).
n
Thus, for large n the random vector √1n X~ is highly concentrated around the unit
n
sphere in R . This is sometimes referred to as the Poincaré phenomenon. In
Exercise 2.52 we describe another proof of this result. t
u
C = x 1 , . . . , x N ⊂ Rn .
t
u
Remark 2.59. Let us highlight some remarkable features of the above result. Note
first that the dimension d(N, ε, p0 ) is independent of the dimension of the ambient
space Rn where the cloud C resides. Moreover, d(N, ε, p0 ) is substantially smaller
than the size N of the cloud.
For example, if we choose the confidence level p0 = 10−3 , the distortion factor
ε = 10−1 and the size of the cloud N = 1012 , then
4 N
2
log = 60 · 102 log 10 < 14 · 103 1012 .
ε p0
The cloud C could be chosen in a Hilbert space and we can choose as ambient
space the subspace span(C) that has dimension n ≤ N . In this case the vectors
Yk := √1 X~k , k = 1, . . . , d, have with high confidence norm 1.
N
N δ2
P kYk k − 1 > δ, ∀1 ≤ k ≤ d ≤ 2de− 4 .
They are also, with high confidence, mutually orthogonal. Indeed, Exercise 2.49
shows that for |r| < 12
d − N r2
P hYi , Yj i > r, ∀i < j ≤ 2 e 12 .
2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 200
This shows that the operator √1N A is with high confidence very close to
the orthogonal projection PX~1 ,...,X~d onto the random d-dimensional9 subspace
span{X ~1 , . . . , X
~d }. This shows that, with high confidence, the operator
r
N
P~ ~
d X1 ,...,Xd
distorts very little the distances between the points in C. The projected cloud has
identical size, similar geometry but lives in a subspace of much smaller dimension.
t
u
need not be negligible. In other words, the set of ω’s such that the functions
Fn (−, ω) do not converge pointwisely to the function F (−) need not by negligible.
We will show that this is not the case.
9 It ~1 , . . . , X
is not hard to see that dim span{X ~d } = d a.s.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 201
Lemma 2.60. The function DnF is measurable and DnF ≤ DnU , with equality if F is
continuous.
Proof. Let us first show that Dn is indeed measurable. We will show that
Dn = sup Fn (x) − F (x) . (2.4.2)
x∈Q
Now observe that the functions x 7→ Fn (x, ω), F (x) are right-continuous so there
exists a sequence of rational numbers (qn )n∈N such that qn > xn and
1
Fn (xn , ω) − F (xn ) − Fn (qn , ω) − F (qn ) < .
n
Hence
lim Fn (qn , ω) − F (qn ) = lim Fn (xn , ω) − F (xn )
n→∞ n→∞
n
(1.2.6) 1X
= I Q(Yk )≤x − F (x) = Fn (x) − F (x).
n
k=1
Thus
DnF = sup Fn (x) − F (x) = sup Un (F (x)) − U (F (x))
x∈R x∈R
Proof. Lemma 2.60 shows that it suffices to prove the theorem only in the special
case when that random variables are uniformly distributed. Thus we assume F = U .
Fix a partition P of [0, 1], P = {0 = x0 < x1 < x2 < · · · < xm = 1}. Set
kPk := max (xk − xk−1 ).
1≤k≤m
Remark 2.62. Suppose that (Xn )n∈N is a sequence of i.i.d. random variables with
common cdf F (x). Form the empirical (cumulative) distribution function
n
1X
Fn (x) = I (−∞,x] Xk ,
n
k=1
For an “elementary” proof of this fact we refer to [57]. For a more sophisticated
proof that reveals the significance of the strange series above we refer to [12] or [51].
t
u
2.4.2 VC-theory
We want to present a generalization of the Glivenko-Cantelli theorem based on
ideas pioneered by V. N. Vapnik and A. Ja. Cervonenkis [151] that turned out to
be very useful in machine learning. Our presentation follows [129, Chap. II]. For
more recent developments we refer to [51; 70; 150; 156].
Fix a Borel probability measure µ on X := RN . Any sequence of i.i.d. random
vectors
Xn : (Ω, S, P) → X = RN
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 204
n
1X
Pn B = I B (Xk ).
n
k=1
To prove that Dn → 0 a.s. we will employ a different strategy than before. More
precisely we intend
to show that, under certain assumptions on the family (Bt )t∈T ,
the probability P Dn > ε decays very fast as n → ∞, for any ε > 0. This will
guarantee that the series
X
P Dn > ε
n∈N
is convergent for any ε > 0 and thus, according to Corollary 1.141, the sequence
Dn converges a.s. to 0. To obtain these tail estimates we will rely on some clever
symmetrization tricks.
To state the first symmetrization result choose another sequence Xn0 : Ω → X ,
n ∈ N, of i.i.d. random variables, independent of (Xn )n∈N , but with the same
distribution. Set
n
1X 0
Yk0 (t) := I Bt Xk0 , Zn0 (t) := Yk (t) − µ Yk0 (t) , ∀n ∈ N, t ∈ T,
n
k=1
Equivalently,
n
1 X
Dn,n = sup Yn+k (t) − Yk (t) .
t∈T n
k=1
1
≥1− .
nε2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 206
Proof. The key observation is that, because Ξk (t) = Yk0 (t) − Yk (t) is symmetric, it
has the same distribution as Rk Ξk (t). Set
n n n
1X 1X X
Sn (t) := Rk Yk (t), Sn0 (t) := Rk Yk0 (t), Sn (t) := Y k (t)
n n
k=1 k=1 k=1
10 Here we are making a tacit assumption that there exists such a sequence random vari-
ables Rn defined on Ω. For example if we can choose Ω to be the probability space
(X , µ⊗N ) ⊗ (X , µ⊗N ) ⊗ {−1, 1}⊗N all the above choices are possible. The choice of Ω is irrelevant
because the Glivenko-Cantelli theorem is a result about (X , µ⊗N ).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 207
" n
#
1 X 0 ε 1 0 ε
P sup Yk (t) − Yk (t) > = P sup | Sn (t) − Sn (t) | >
t∈R n 2 t∈R n 2
k=1
1 ε 1 0 ε 1 ε
≤ P sup | Sn (t) | > + P sup | Sn (t) | > = 2P sup Sn (t) > ,
t∈R n 4 t∈R n 4 t∈R n 4
where we used the fact that Rk Yk0 (t) and Rk Yk (t) have the same distributions. t
u
where ~x := (x1 , . . . , xn ) ∈ X n
and
yk (t, ~x) = I Bt (xk ) ∈ {0, 1}, ∀k = 1, . . . , n, t ∈ T.
Hence
Z
ε ⊗n
P Dn > ε ≤ 4 P sup St (~x) > µ dx1 · · · dxn . (2.4.9)
Xn t∈R 4
For each n ∈ N, t ∈ T and ~x ∈ X n we set In := {1, . . . , n},
Ct (~x) := k ∈ In ; yk (t, ~x) = 1 = k ∈ In ; xk ∈ Bt .
Roughly speaking, Ct (~x) = Bt ∩ {x1 , . . . , xn }.
Cn (~x) := C ⊂ In ; ∃t ∈ T, C = Ct (~x) .
We can now finally understand the role of the Rademacher symmetrization. The
sums
Xn
Rk yk (t, ~x)
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 208
are of the type appearing in Hoeffding’s inequality (2.3.12), where Rk yk (t, ~x) ∈ G(1)
by the computation in Example 2.53. We deduce
2 2
P SC > ε/4 ≤ 2e−n ε /32 , ∀C ⊂ In .
We deduce
2
P sup St (~x) > ε/4 ≤ 2|Cn (~x)|e−nε /32 .
(2.4.10)
t∈T
Using this in (2.4.9) we deduce
Z
−nε2 /32
|Cn (~x)| µ⊗n dx1 · · · dxn .
P Dn > ε ≤ 8e (2.4.11)
X n
n
We have a rough bound |Cn (~x)| ≤ 2 but it is not helpful. At this point we add our
last and crucial assumption.
VC. The family F = (Bt )t∈T satisfies V C-condition.11 This means that there exists
d ∈ N such that
sup |Cn (~x)| = O(nd ) as n → ∞.
x∈X n
~
With this assumption in place we deduce that there exists K > 0 such that
2|Cn (~x)| ≤ K(nd + 1), ∀n ∈ N, ∀~x ∈ X n
so that
2
P Dn > ε ≤ 8Ke−nε /32 (nd + 1).
(2.4.12)
In the above estimate the constant K is independent of the distribution µ. Since
the series
X 2
e−nε /32 (nd + 1) < ∞, ∀ε > 0,
n∈N
we deduce that Dn → 0 a.s. We have thus proved the following wide ranging
generalization of the Glivenko-Cantelli theorem.
" n #
1 X
0
= E Yk (t) − Yk (t) k Yk , 1 ≤ k ≤ n
n
k=1
" n
#
X
Yk0 (t)
≤E Yk (t) − k Yk , 1 ≤ k ≤ n ≤ E Dn,n k Yk , 1 ≤ k ≤ n .
k=1
Hence
n
1 X
Dn (t) = sup Yk (t) − E Yk (t) ≤ E Dn,n k Yk , 1 ≤ k ≤ n .
n
k=1
By taking the expectations of both sides of the above inequality we obtain (2.4.13).
A similar argument as in the proof of the Rademacher symmetrization lemma yields
" n
#
1 X
E Dn,n ≤ 2 E sup Y k (t) .
t∈T n
k=1
| {z }
=:Rn (T )
The sequence Rn (T ) is called the Rademacher complexity of the family (Bt )t∈T .
Azuma’s inequality (3.1.14), a refined concentration inequality, shows that Dn
is highly concentrated around its mean. The VC condition can be used to show
that the Rademacher complexity goes to 0 as n → ∞. Thus the mean of Dn goes
to 0 as n → ∞. Combining these facts one can obtain an inequality very similar to
(2.4.11). For details we refer to [156, Sec. 4.2] or Exercise 3.20.
(b) One can obtain bounds for the tails of Dn by a Chernoff-like technique, by
obtaining bounds for E Φ(Dn ) , where Φ : [0, ∞) → R is a convex increasing
function; see Exercise 2.53. We refer to [130] or [150] for details. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 210
The key assumption is VC and we want to discuss it in some detail and describe
several nontrivial examples of families of sets satisfying this condition.
Fix an ambient space X and F ⊂ 2X a family of subsets of X . The shadow of
F on a subset A is the family
FA := F ∩ A; F ∈ F ⊂ 2A .
Thus, any subset A such that |A| ≤ dimV C (F) is shattered by F. In other words, if
k = dimV C (F), then for any n ≤ k we have
min(n,k)
X n
sF (n) = 2n = .
j=1
j
We have the following remarkable dichotomy. For proof we refer to [51, Thm. 4.1.2]
or [70, Thm. 3.6.3].
(i) Suppose that F consists of all the lower half-lines (−∞, t] ⊂ R, t ∈ R. Note
that if A = {a1 , a2 }, a1 < a2 , then any half-line that contains a2 must also
contain a1 so that dimV C (F) ≤ 1.
(ii) Suppose that F consists of all the open-half spaces of the vector space Rn . A
classical theorem of Radon [114, Thm. 1.3.1] shows that any subset A ⊂ Rn
of cardinality n + 2 contains a subset A0 that cannot be separated from its
complement A \ A0 by a hyperplane. Thus dimV C (F) ≤ n + 1. With a bit
more work one can show that in fact we have equality.
(iii) The above example is a special case of the following general result, [51,
Thm. 4.2.1].
Then
F0 u F1 := F0 ∩ F1 ; Fk ∈ Fk , k = 0, 1 ,
F0 ⊗ F1 := F0 × F1 ; Fk ∈ Fk , k = 0, 1 ,
= E I Bt + I G − 2I Bt I G = Lµ (t).
Thus, even if we do not know µ we can estimate Lµ (t) using the random functionals
n
1X
Ln (t) = I Bt (Xk ) + I G (Xk ) − 2I Bt ∩G (Xk )
n
k=1
n
1X
= I Bt (Xk ) + Yk − 2Yk I Bt (Xk ) .
n
k=1
If (Bt )t∈Bt is a VC-family, then so is the family (Bt ∩ G)t∈T and (2.4.11) shows that
there exists constants K, c > 0, independent of the mysterious µ, such that
2
P sup | Ln (t) − Lµ (t)| > ε ≤ Ke−cnε , ∀n.
t∈T
Thus, for large n, Ln (tn ) is, with high confidence, within ε of the absolute minimum
LP (t0 ) = 0. Hopefully, this signifies that tn is close to t0 . In the language of machine
learning we say that the hypothesis class (Bt )t∈T is PAC learnable, where PAC
stands for Probably Approximatively Correct. For more details we refer to [139;
152].
Remark 2.69. The results in this section only scratch the surface of the vast
subject concerned with the limits of empirical processes. We have limited our
presentation to 0-1-functions. The theory is more general than that.
Suppose that (U, U) is a measurable space and
Xn : Ω, S, µ → (U, U)
The Brownian motion bears the name of its discoverer, the botanist R. Brown who
observed in 1827 the chaotic motion of a particle of pollen in a fluid. Its study took
off at the beginning of the 20th century and has since witnessed dramatic growth.
It popped up in many branches of sciences and has lead to the development of many
new branches of mathematics. In the theory of stochastic processes it plays a role
similar to the role of Gaussian random variables in classical probability. It is such a
fundamental and rich object that I believe any student learning the basic principles
of probability needs to have a minimal introduction to it.
I drew my inspiration from many sources and I want to mention a few that we
used more extensively, [12; 53; 103; 106; 136; 145]. My approach is not the most
“efficient” one since I wanted to use the discussion of the Brownian motion as an
opportunity to introduce the reader to other several important concepts concerning
stochastic processes.
2.5.1 Heuristics
To get a grasp on the Brownian motion on a line, we consider first a discretization.
We assume that the pollen particle performs a random walk along the line starting
at the origin. Every unit of time τ it moves to the right or to the left, with
equal probabilities, a distance δ. We denote by Snδ,τ its location after n steps, or
equivalently, its location at time nτ , assuming we start the clock when the motion
begins.
When δ = τ = 1 we obtain the standard random walk on Z
n
X
Sn1,1 = Sn := Xk ,
k=1
We assume that during the (n + 1)-th jump the particle travels with constant
speed 1 so we can assume that its location at time t ∈ [n, n + 1) is
W 1 (t) = Sn + (t − n)Xn+1 = Sbtc + t − btc Xbtc+1 .
If we sample the random variables (Xn ), then of W 1 (t) is a piecewise linear function
with linear pieces of slopes ±1. Its graph is a zig-zag of the type depicted in
Figure 2.3.
Suppose now that the pollen particle performs these random jumps at a much
faster rate say ν-jumps per second and the size (in absolute value) of the jump is
δ meters. We choose δ to depend on the frequency ν and we intend to let ν → ∞.
Assuming that during a jump its speed is constant we deduce that this speed is δν
meters per second and its location at time t will be
W ν,δ (t) = δSbνtc + δ νt − bνtc Xbνtc+1 .
| {z }
=:Rν,δ (t)
To understand this formula observe that in the time interval [0, t] the particle per-
formed bνtc complete jumps of size δ. It completed the last one at time bνtc
ν . From
this moment to t it travels in the direction Xbνtc+1 with speed δν for a duration of
time t − bνtc
ν .
Assuming that in finite time the particle will stay within a bounded region it is
reasonable to assume that
∀t, sup E W ν,δ (t)2 < ∞.
(2.5.1)
ν
Now observe that δSbνtc and Rν,δ are mean zero independent random variables so
that
E W ν,δ (t)2 = δ 2 E Sbνtc
2
+ E Rν,δ (t)2 = δ 2 bνtc + E Rν,δ (t)2 .
Clearly E Rν,δ (t)2 ∈ [0, δ| so for (2.5.1) to hold we need
sup δ 2 ν < ∞.
ν
For each ν, the collection (W ν (t))t≥0 is a real valued random process parametrized
by [0, ∞). Think of it as a random real valued function defined on [0, ∞). It turns
out that the random processes (W ν (t))t≥0 have a sort of limit as ν → ∞. The next
result states this in a more precise form.
exists in distribution and it is a Gaussian random variable with mean zero and
variance t. Moreover, if
0 ≤ s0 < t0 ≤ s1 < t1 ≤ · · · ≤ sk < tk , k ≥ 1,
then the increments
W (t0 ) − W (s0 ), W (t1 ) − W (s1 ), . . . , W (tk ) − W (sk )
are independent.
Proof. Fix 0 ≤ s < t. For ν sufficiently large we have bνsc < bνtc and
W ν (t) − W ν (s) = ν −1 Sbνtc − Sbνsc + (Rν (t) − Rν (s) .
| {z } | {z }
Yν Zν
are independent and the above argument shows that they converge in law to the
Gaussian
W (tj ) − W (sj ), j = 0, 1, . . . , k.
Corollary 2.29 implies that these increments are also independent. t
u
(i) W (0) = 0.
(ii) For any 0 ≤ s < t the increment W (t) − W (s) is a Gaussian random variable
with mean zero and variance t − s.
(iii) For any
0 ≤ s0 < t0 ≤ s1 < t1 ≤ · · · ≤ sk < tk , k ≥ 1,
increments
W (t0 ) − W (s0 ), W (t1 ) − W (s1 ), . . . , W (tk ) − W (sk )
are independent.
A pre-Brownian
motion on [0, 1] is a collection of real valued random variables
W (t) t∈[0,1] satisfying (i)–(iii) above with the s’s and t’s in [0, 1]. t
u
We have thus proved that a suitable rescaling of the standard random walk on
Z converges to a pre-Browning motion. In Figure 2.4 we have depicted the graph
of a sample of W ν (t) for ν = 100. Its graph is also a piecewise linear curve, but its
linear pieces are much steeper,
of slopes ±ν 1/2 .
Suppose that W (t) t≥0 is a pre-Brownian motion on [0, ∞). As explained in
Subsection 1.5.1, this process defines a probability measure on R[0,∞) equipped with
[0,∞)
the product sigma-algebra BT called the distribution of the process. We want to
show that any two pre-Brownian motions have the same distributions. This requires
a small digression in the world of Gaussian measures and processes. In the next
subsection we survey some basic facts concerning these concepts. In Exercise 2.54
we ask the reader to fill in some of the details of this digression.
1
C(ξ, η) = vµ ξ + η − vµ ξ − η = Eµ ξ − mµ ξ η − mµ η .
4
Then (see Exercises 2.54(ii) + (iii)) the mean mµ is a linear functional mµ : V ∗ → R
and the covariance Cµ is a symmetric and positive semidefinite bilinear form on V ∗ .
The proof of the above result is based on the Fourier transform and its main
steps are described in Exercise 2.54. In the sequel we will refer to the mean zero
Gaussian measures as centered.
(d) Suppose (−, −) is an inner product on the vector space V with associated norm
k − k. We can then identify V ∗ with V and the symmetric bilinear forms on V ∗
with symmetric operators. The centered Gaussian measure on V whose covariance
form is given by the inner product is
1 1 2
e− 2 kxk dx.
Γ1 dx =
(2π)dim V /2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 220
C :V∗×V∗ →R
is Gaussian since its components are independent Gaussian random variables; see
Example 2.73(a). Observing that
(W (t1 ), . . . , W (tn ) = (X1 , X1 + X2 , . . . , X1 + · · · + Xn
we deduce from Example 2.73(c) that the vector (W (t1 ), . . . , W (tn ) is also Gaus-
sian as linear image of a Gaussian vector. Thus, any pre-Brownian motion is a
Gaussian process. It is centered since all the random variables W (t) have mean
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 221
zero. Its distribution is a probability measure on then path space R[0,∞) uniquely
determined by the covariance kernel
K : [0, ∞) × [0, ∞) → R, K(s, t) = E W (s)W (t) .
We claim that
Hence
E W (s)W (t) = s = min(s, t).
We see that all pre-Brownian motions have the same covariance form and thus they
all have the same distribution.
Conversely, suppose that X(t) t≥0 is a centered Gaussian process whose co-
variance form is given by (2.5.3). Then this process is a pre-Brownian motion.
Indeed,
E X(0)2 = K(0, 0) = 0
E X(t)2 = K(t, t) = t.
We see that these distributions are independent of the choice of pre-Brownian mo-
tion B. This shows that if
B i : (Ωi , Fi , Pi ) → R, i = 1, 2,
are two pre-Brownian motions, then for any measurable set S ⊂ [0, ∞) and any
injection τ : N → [0, ∞) we have
P1 Bτ1(n) n∈N ∈ S = P2 Bτ2(n) n∈N ∈ S .
t
u
then the series defining F (t) converges in L2 Ω, S, P , for any t ∈ T . To see this,
This proves that the sequence Fn (t) n∈N is Cauchy in L2 Ω, S, P . The family
F = F (t) t∈T is a centered Gaussian random process. It is convenient to think
of F as a random function. Its value F (t) at t is not a deterministic quantity, it is
random.
The covariance kernel is
X
K(s, t) = KF (s, t) = E F (s)F (t) = fn (s)f (t).
n∈N
for any ω ∈ Ω \ Nt the series F (t, ω) in (2.5.4) converges. We will denote by F (t, ω)
its sum. Set
[
N := Nt .
t∈T
In particular, this also shows that the image W (H) of X is a closed subspace of
L2 (Ω, S, P) consisting of centered Gaussian random variables. Such a subspace is
called a Gaussian Hilbert space. Obviously there is a natural bijection between
Gaussian white noises and Gaussian Hilbert spaces.
Here is how one can construct Gaussian white noises. Fix a separable Hilbert
space H with inner product (−, −). Next, fix a Hilbert basis of (en )n∈N . Every
element in H can then be decomposed along this basis
X
h= an (h)en , an (h) := (h, en ).
n∈N
Choose a sequence of independent standard normal random variables (Xn )n∈N de-
fined on a probability space (Ω, S, P). For h ∈ H we set
X
W h = an (h)Xn .
n∈N
From Parseval’s identity we deduce that
X
an (h)2 = khk2H
n∈N
proving that the series defining W h converges in L2 . The collection W (h) h∈H
is a Gaussian process and its covariance is
X
K(h0 , h1 ) = E W h0 W h1 = an (h0 )an (h1 ) = (h0 , h1 ).
n∈N
In particular, this proves that the correspondence h 7→ W h is an isometry, and
thus we have produced a Gaussian white noise.
As a special example, suppose that H = L2 [0, ∞), λ). Fix a Hilbert basis
(fn )n∈N and construct the Gaussian noise as above
X Z ∞
L2 [0, ∞), λ 3 f 7→ Wf =
an (h)Wn an (f ) = f (t)fn (t)dt.
0
For each t ∈ [0, ∞) we set
Z t
X
B(t) := W I [0,t] = fn (s)ds Xn . (2.5.6)
n∈N 0
Note that
Z ∞
E B(s)B(t) = I [0,s] (x)I [0,t] (x)dx = min(s, t).
0
This shows that W (t) is a pre-Brownian motion.
Observe that if s 6= t and |u| < |t − s|/2, then the random variables
1 1
B(s + u) − B(s) and B(t + u) − B(t)
u u
are independent.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 225
Now we need to make a leap of faith and pretend we can derivate with respect
to t. (We really cannot.) Letting u → 0 we deduce that F 0 (t) and F 0 (s) are
independent Gaussian random variables. Derivating with the same abandon the
equality (2.5.6) we deduce
X
B 0 (t) = fn (t)Xn . (2.5.7)
n∈N
Thus, the elusive B 0 (t) is a random “function” of the kind described in Example 2.77
with one big difference: in this case the condition (2.5.5) is not satisfied. Observe
that the “value” of F 0 at a point t is independent of its value at a point s. Thus,
the value F 0 at a point carries no information about its value at a different point so
F 0 (t) is a completely chaotic random “function” and it is what is commonly referred
to as white noise.
As we will see in the next subsection the function B(t) cannot be derivated
at any point. Moreover, the series (2.5.7) does not converge in a classical sense.
However it can be shown to converge in the sense of distributions. For an excellent
discussion of this aspect we refer to[68, Sec. III.4].
For any function f ∈ L2 [0, ∞) we define its Wiener integral
Z t
f (s)dB(s) := W I [0,t] f . (2.5.8)
0
In Exercise 2.60 we give an alternate definition of the this object that justifies this
choice of notation. In particular we deduce that
Z t
B(t) = dB(s).
0
0
Even though B (t) does not exist in any meaningful way, the above intuition is
nevertheless very important since it is what lead to the very important concepts of
Ito integral and stochastic differential equations. t
u
Definition 2.80. Let (Ω, S, P) be a probability space, T a set, and (X, F) a mea-
surable set. Consider stochastic processes
X, Y : T × Ω → X, (t, ω) 7→ Xt (ω), Yt (ω).
t
u
• ∀t ∈ T , Xt = Yt a.s. and,
• for any ω ∈ Ω \ Nα , and any s, t ∈ T we have
Ys (ω) − Yt (ω) ≤ C(ω)|s − t|α .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 227
Proof. We follow the presentation in [103, Thm. 29]. Without loss of generality
we can assume that T = [0, 1]. We denote by D the set of diadic numbers in [0, 1]
[ k
D= Dn , Dn = ; 0 ≤ k ≤ 2 n
.
2n
n≥0
From the assumption (2.5.9) and Markov’s inequality we deduce that for any s, t ∈ T
and any a > 0 we have
1 K
P |Xs − Xt | > a ≤ q E |Xs − Xt |q ≤ q |t − s|1+r .
a a
Applying this inequality to s = j−1 n 1
2n , t = j/2 and a = 2nα we deduce
Hence
2n
[
|X(k−1)/2n − Xk/2n | > 2−nα ≤ Kρn .
P
k=1
| {z }
Hn
Note that since α < r/q we have ρ ∈ (0, 1). From the Borel-Cantelli Lemma we
deduce that
P Hn i.o. = 0.
Thus, there exists a negligible set N with the following property: for any ω ∈ Ω \ N
there exists n0 (ω) ∈ N so that for any n ≥ n0 (ω), and any k = 1, . . . , 2n we have
Lemma 2.82. Let f : D → R be a function. Suppose that there exist α ∈ (0, 1),
n0 ∈ N and K > 0 such that, ∀n ≥ n0 and any k = 1, . . . , 2n
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 228
This shows that for every ω ∈ Ω \ N the map D 3 t 7→ Xt (ω) is Hölder continuous
with exponent α. This completes the proof of (2.5.10).
Step 2. We can now produce the claimed modification. For every ω ∈ Ω \ N the
map
D 3 t 7→ Xt (ω)
admits a unique α-Hölder extension T 3 t 7→ Xt (ω) ∈ R. For t0 ∈ T we have
lim Xt = Xt0 .
t→t0
t∈D
Since
lim E |Xt − Xt0 |q = 0
t→t0
t∈D
we deduce that Xt0 = X̄t0 a.s. Hence the process Xt t∈T
is a modification of
Xt t∈T whose paths are a.s. α-Hölder continuous. t
u
i−1 i
Proof of Lemma 2.82. Let 0 ≤ m < n0 s = 2m and t = 2m . For
j = 0, 1, . . . , 2n0 −m we set sj = s + 2nj 0 . Then
0 −m
2nX
f (t) − f (s) ≤ K 2−n0 α = K2n0 −m 2−n0 α = |K2n0 −m+α(m−n
{z
0 ) −mα
}2 .
j=1 =:K1
∞
X 2α
≤ K1 2−pα 2−iα = K1 α 2−pα ≤ 2α K2 |s − t|α ,
i=1
2 − 1
| {z }
=:K2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 229
where at the last step we used the fact that 2−p < 2|s − t|. Similarly
f (u) − f (t) ≤ 2α K2 |s − t|α .
Hence
f (s) − f (t) ≤ f (s) − f (u) + f (u) − f (t) ≤ 21+α K2 |t − s|α .
This proves the lemma with C(n0 , α, K) = 21+α K2 . t
u
Remark 2.83. (a) Using Exercise 2.59 one can modify the modification in Theo-
rem 2.81 to be α-Hölder continuous for any α ∈ (0, q/r), not just for a fixed α in
this range.
(b) The argument in the proof of Lemma 2.82 is an elementary incarnation of
the chaining technique. For a wide ranging generalization of the continuity Theo-
rem 2.81 and the chaining technique we refer to [101, Chap. 11]. t
u
Corollary 2.84. Suppose that (Wt )t≥0 is a pre-Brownian motion. Then for any
α ∈ (0, 1/2) the process (Wt ) admits a modification whose paths are a.s. α-Hölder
continuous. In particular, Brownian motions exist.
Proof. Set δ − 21 − α. Note that since Wt − Ws is Gaussian with mean 0 and
variance |t − s|. Then D := √ 1 Wt − Ws ∼ N (0, 1) so that, ∀q ≥ 1, we have
|t−s|
Remark 2.85. I want to say a few words about Paul Lévy’s elegant construction
of the Brownian motion, [106, Sec. 1].
He produces the Brownian motion on [0, 1] as a limit of random piecewise linear
functions Ln with nodes on the dyadic sets
k n
Dn := ; 0 ≤ k ≤ 2 , n ≥ 0.
2n
They are successively better approximations of the Brownian motion. The 0-th
order approximation is the random linear function L0 (t) such that L0 (0) = 0 and
L0 (1) is a standard normal random variable.
The n-th order approximation Ln satisfies the following conditions.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 230
• It is linear on each of the intervals (k − 1)/2n , k/2n , Ln (0) = 0.
• The increments
Ln k/2n ) − Ln (k − 1)/2b , k = 1, . . . , 2n
are normal random variables with mean zero and variance 1/2n .
• Ln (t) = Ln−1 (t), ∀t ∈ Dn−1 .
and consider a family (Xt )t∈D of independent standard normal random variables.
Then
L0 (t) := tX1 .
The approximation Ln+1 is obtained from Ln as follows. If t0 < t1 are two con-
secutive points in Dn and t∗ ∈ Dn+1 is the midpoint of [t0 , t1 ], then Ln+1 (t∗ ) is
obtained by mimicking (2.5.12), i.e.,
√
1 t1 − t0
Ln+1 (t∗ ) = Ln (t0 ) + Ln (t1 ) + Xt∗
2 2
1 1
= Ln (t0 ) + Ln (t1 ) + 1+n/2 Xt∗ .
2 2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 231
To prove that the sequence Ln (t) converges uniformly a.s. it suffices to show that
the series of random variables
X
sup Ln+1 (t) − Ln (t)
n≥0 t∈[0,1]
| {z }
=:Un
converges a.s.
Denote by Mn the set of midpoints of the 2n intervals determined by Dn ,
Mn = Dn+1 \ Dn . From the construction of Ln we deduce that
1
Un = max |Xτ |.
21+n/2 τ ∈Mn
We deduce that for any c > 0 we have
P Un > c ≤ 2n P Y > 21+n/2 c , Y ∼ N (0, 1).
The Mills ratio inequalities (1.3.40) coupled with the Borel-Cantelli lemma lead to
the claimed convergence. t
u
Let us observe that if (B(t)) is a standard Brownian motion, then B(0) = 0 a.s.
For this reason, the standard Brownian motion is also referred to as the Brownian
motion started at 0. For x ∈ R we set B x (t) = x + B(t). We will refer to B x (t) as
the Brownian motion started at x.
The topology induced by this metric is the topology of uniform convergence on the
compact subsets of [0, ∞). One can prove (see Exercise 2.61) that the Borel algebra
of this metric space coincides with the sigma algebra generated by the functions
Evt : C → R, Evt (f ) = f (t).
More generally, for any finite subset I ⊂ [0, ∞) we have a measurable evaluation
maps
EvI : C → RI , f 7→ f |I .
Proposition 1.29 shows that if µ0 , µ1 are two probability measures on C such that
(EvI )# µ0 = (EvI )# µ1
for any finite subset I⊂ [0, ∞), then µ0 = µ1 .
Note that if Xt t≥0 is a stochastic process defined on a probability space
(Ω, S, P) whose paths are continuous, then it defines a map
X : Ω → C, Ω 3 ω 7→ X(ω) ∈ C, X(ω)(t) = Xt (ω).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 232
The map X is measurable since its composition with all the evaluation maps EvI
are measurable. Thus the stochastic process defines a probability measure
PX := X# P ∈ Prob C, BC
The next result suggests that the paths of a Brownian motion are very rough,
i.e., they have poor differentiability properties.
Proof. The Gaussian random variables Xkn = Btnk − Btnk−1 , 1 ≤ k ≤ pn , are inde-
pendent, have mean zero and momenta
2
E (Xkn )2 = tnk − tnk−1 , E (Xkn )4 = 3 tnk − tnk−1 .
From the first equality we deduce E Qn (c) = c. Moreover
pn pn
X 2 X 2
Xkn −c= Xkn − tnk − tnk−1 .
k=1 k=1 | {z }
=:Ykn
The random variables Ykn are independent and have mean zero so
pn 2 n
X 2 X
Xkn −c = kYkn k2L2 .
k=1 L2 k=1
Hence
pn 2 pn
X 2 X 2
Xkn −c =2 tnk − tnk−1
k=1 L2 k=1
pn
X
tnk − tnk−1 = 2µn c → 0 as n → ∞.
≤ 2µn
k=1
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 234
On a subsequence nj we have Qnj (c) → c > 0 a.s. On the other hand, if for
some ω ∈ Ω the function t → Bt (ω) where Hölder with exponent α > 1/2 on [0, c],
then for some constant C = Cω > 0 independent of n we would have
X 2α
0 ≤ Qn (t)(ω) ≤ Cω2 tnk − tnk−1 ≤ Cω2 µn2α−1 c → 0.
k
This prove that Bt is a.s. not α-Hölder on [0, c], α > 1/2.
On the other hand, we know that the paths of the Brownian motion are Hölder
continuous for any exponent < 1/2. A 1933 result of Paley, Wiener, Zygmund [124]
shows that they have very poor differentiability properties. First some historical
context.
One question raised in the 19th century was whether there exist continuous
functions on an interval that are nowhere differentiable. Apparently Gauss believed
that there are no such functions. K. Weierstrass explicitly produced in 1872 such
examples defined by lacunary (or sparse) Fourier series. In 1931 S. Banach [7]
and S. Mazurkewicz [115] independently showed that the complement of the set of
nowhere differentiable functions in the metric space of continuous functions on a
compact interval is very small, meagre in the Baire category sense.
The 1933 result of Paley, Wiener, Zygmund that we want discuss is similar in
nature. They prove that the complement set of continuous nowhere differentiable
functions f ∈ C is negligible with respect to the Wiener measure.
Proof. We follow the very elegant argument of Dvoretzky, Erdös, Kakutani [54].
We will show that for any interval I = [a, b) ⊂ [0, ∞) the paths of (Bt ) are a.s.
nowhere differentiable on I. Assume the Brownian motion is defined on a probability
space (Ω, S, P). This probability space could be the space C equipped with the
Wiener measure. For ease of presentation we assume that I = [0, 1). Consider the
set
S := ω ∈ Ω; the path Bt (ω) is nowhere differentiable on [0, 1) .
The set S may not be measurable13 but we will show that its complement is con-
tained in a measurable subset of Ω of measure zero.
Let us observe that if ω ∈ Ω \ S, i.e., the path t 7→ Bt (ω) is differentiable at a
point t0 ∈ [0, 1], then there exist M, N ∈ N such that for any n ≥ N there exists
k ∈ {1, . . . , n − 2} with the property that
M
B(k−1+i)/n (ω) − B(k+i)/n (ω) ≤ , ∀i = 0, 1, 2.
n
To see this set f (t) = Bt (ω), m = |f 0 (t0 )|, M = bmc + 2. Then there exists ε > 0
so that if s, t ∈ (t0 − ε, t0 + ε), s < t we have
|f (s) − f (t)| ≤ M (t − s).
13 In 1936 S. Mazurkewicz proved that the set S is not a Borel subset of C.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 235
1 ε
Now choose N such that N < 6 and, for n ≥ N choose k ∈ {1, 2, . . . , n} such that
k−1 k k+1 k+2
t0 − ε < , , , < t0 + ε. (2.5.15)
n n n n
We deduce that
[ [ n \
\ [ 2
Ω\S ⊂ B(k−1+i)/n − B(k+i)/n ≤ M/n .
M ∈N N ∈N n≥N k=1 i=0
| {z }
=:XM,N
Clearly, the set XM,N is measurable and it suffices to show it is negligible. We have
n−2
X
P XM,N ≤ inf P max B(k−1+i)/n − B(k+i)/n ≤ M/n . (2.5.16)
n≥N 0≤i≤2
k=1
Now observe that the increments B(k−1)/n − Bk/n are independent Gaussians with
mean zero and variance 1/n. We deduce
n−2
X 3
P XM,N ≤ inf P B(k−1)/n − Bk/n ≤ M/n .
n≥N
k=1
The exponent 3 above will make all the difference. It appears because of the con-
√
straint (2.5.15) on N . Since n B(k−1)/n − Bk/n is standard normal, the random
variable B(k−1)/n − Bk/n is normal with variance n1 and we have
r Z M/n
n 2
e−x n/2 dx
P B(k−1)/n − Bk/n ≤ M/n = 2
2π 0
(x = M y/n)
n M 1 − M y2
r Z
2
2 e 2n dy ≤ √ M n−1/2 .
2π n 0 2π
| {z }
=:C
Hence
n−2
X 3
≤ nC 3 M 3 n−3/2 = C 3 M 3 n−1/2 , ∀n ≥ N,
P B(k−1)/n − Bk/n ≤ M/n
k=1
and (2.5.16) implies that P XM,N = 0. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 236
2.6 Exercises
Exercise 2.2. Suppose that (Xn )n≥1 is a sequence of independent random vari-
ables. Prove that the following statements are equivalent.
P
(i) The series n≥1 Xn converges in probability.
P
(ii) The series n≥1 Xn converges a.s.
Remark 2.89. The so called Lévy equivalence theorem, [47, §III.2], [50, §9.7] or
[105, §43] states that a series with independent terms converges a.s. iff converges in
probability, iff converges in distribution. t
u
Show that if (xn ) is a sequence of real numbers that converges to a number x, then
the sequence of weighted averages
Xn
yn := pn,k xk
k=1
converges to the same number x. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 237
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 238
Exercise 2.5 (Bernstein). For each x ∈ [0, 1] we consider a sequence (Bkx )k∈N of
i.i.d. Bernoulli random variables with probability of success x. We set
X
Snx = Bkx .
k=1
Snx /n → x a.s. as n → ∞.
The dominated converges theorem implies that for any continuous function
f : [0, 1] → R we have
Set
(ii) Prove that as n → ∞ the polynomials Bnf (x) converge uniformly on [0, 1]
to f (x).
Hint. For (ii) imitate the argument in Step 2 of the proof of Theorem 2.41. t
u
Prove that
1 p
X1 + · · · + Xn −→ 0 as n → ∞. t
u
n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 239
Fig. 2.5 The graph of f (x) = sin(4πx) (the continuous curve) and of the degree 50 Bernstein
f
polynomial B50 (x) (the dotted curve).
Exercise 2.7. Suppose that a player rolls a die an indefinite amount of times. More
formally, we are given a sequence independent random variables (Xn )n∈N , uniformly
distributed on I6 := {1, 2, . . . , 6}. For k ∈ N, we say that a k-run occurred at time
n if n ≥ k and
Xn = Xn−1 = · · · = Xn−k+1 = 6.
For n ∈ N we set
Rn = Rnk := # m ≤ n; a k-run occurred at time m ,
T = Tk = min n ≥ k; Rn > 0 .
Thus
T is the moment when the first k-run occurs. As shown in Example 1.167,
E T < ∞.
(i) Compute E T .
(ii) Prove that Rnn converges in probability to 61k . Hint. For n ≥ k set
Yn := I {Xn =6} · · · I X{n−k+1 =6} .
Observe that Rn = Yk + · · · + Yn .
t
u
Exercise 2.8 (A. Renyi). Suppose that (An )n≥0 is a sequence of events in the
sample space (Ω, S, P) with the following properties.
• A0 = Ω.
• P An 6= 0, ∀n ≥ 0.
• There exists ρ ∈ (0, 1] satisfying
lim P An |Ak = ρ, ∀k ≥ 0. (2.6.1)
n→∞
Set Xn := I An − ρ.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 240
(iii) Conclude that the sequence (An ) satisfies the mixing condition with density ρ
lim P An ∩ A = ρP A , ∀A ∈ S.
(2.6.2)
n→∞
Thus, in the long run, the set An occupies the same proportion ρ of any
measurable set A.
t
u
Exercise 2.9 (A. Renyi). Suppose that (Xn )n∈N is a sequence of i.i.d., almost
surely finite random variables. Set
X1 + · · · + Xn
Mn := .
n
Assume that the empirical means Mn converge in probability to a random variable
M . The goal of the exercise is to prove that M is a.s. constant. We argue by
contradiction.
Assume
M is not a.s. constant. Let F : R → [0, 1] the cdf of M ,
F (m) = P M ≤ m .
(i) Prove that there exist two continuity points a < b of F (x) such that
p0 := F (b) − F (a) = P a < M ≤ b ∈ (0, 1).
S|B = S ∩ B; S ∈ S .
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 241
S∆S 0 = S \ S 0 ∪ S 0 ∪ S .
Define d : S × S → [0, ∞)
d S0 , S1 = µ S0 ∆S1 .
λ :S→ R
Prove that the sets Sk,ε ⊂ S are closed with respect to the metric d and
[
S= Sk,ε , ∀ε > 0.
k∈N
t
u
Exercise 2.11 (A. Renyi). Let (Ω, S, P) be a probability space and supposethat
(An ) is a stable sequence of
events, i.e., for any B ∈ S the sequence P An ∩ B has
a finite limit λ B and λ Ω ∈ (0, 1). Prove that λ : S → [0, 1] is a finite measure
Exercise 2.12 (A. Renyi). Let (Ω, S, P) be a probability space and suppose that
(An )n∈N is a sequence of events such that the limits
λ0 = lim P An , λk := lim P Ak ∩ An , k ∈ N
n→∞ n→∞
exist and λ0 ∈ (0, 1). Denote by X linear span of the indicators I An and by X its
closure in L2 .
L(ξ) = lim E ξI An = E ρξ .
n→∞
∀ξ ∈ L2 (Ω, S, P).
(iv) Show that (An )n∈N is a stable sequence with density ρ. (Note that when ρ
is constant the sequence satisfies the mixing condition (2.6.1) with density
ρ = λ0 .)
t
u
Exercise 2.13. Suppose that f : [0, 1] → [0, 1] is a continuous function that is not
identically 0 or 1. For n ∈ N we set
n−1
[
An = k/n, k/n + f (k/n) .
k=0
Show that (An )n≥1 is a stable sequence of events and compute its density. t
u
Exercise
2.14.
Suppose that π is a probability measure on In = 1, 2, . . . , n ,
pi = π {i} . Consider a sequence (Xn )n∈N of i.i.d. random variables uniformly
distributed on [0, 1]. For j ∈ In and m ∈ N we set
j−1 j
( ) n
X X 1 X
Zm,j := # 1 ≤ k ≤ m; pi ≤ Xk < pi , Hm := Zm,j log2 pj .
i=0 i=0
m j=1
Prove that
n
X
lim Hm = − Ent2 π = pj log2 pj , a.s.
m→∞
j=1
t
u
Exercise 2.15. Let (Xn )n∈N be a sequence of i.i.d. Bernoulli random variables with
success probability 12 and (Yn )n∈N a sequence of i.i.d. Bernoulli random variables
with success probability 13 . (The sequences (Xn ) and (Yn ) may not be independent
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 243
We set
[
F := Fn .
n∈N
(i) Prove that for any n ∈ N the measure Qn is absolutely continuous with respect
to Pn . Compute the density dQ
dPn of Qn with respect to Pn .
n
(ii) Prove that Q is not absolutely continuous with respect to P. Hint. Use the Law
of Large Numbers.
t
u
Exercise 2.16. Show that the Gaussian measures Γv [dx] = γ 0,v (x)[dx],
1 x2
e− 2v ,
γ 0,v (x) := √
2πv
converge weakly to the Dirac measure δ0 on R as their variances v converge to 0.
v
t
u
Hint. Use the Chebyshev’s inequality (1.3.17) Γv |x| > c ≤ c2
.
Exercise 2.18. Fix λ > 0. Show that as n → ∞ we have Bin(n, λ/n) ⇒ Poi(λ),
where Bin(n, λ/n) denotes the binomial probability distribution corresponding to
n independent trials with success probability λ/n and Poi(λ) denotes the Poisson
distribution with parameter λ. t
u
Remark 2.91. Let me comment why the result in Exercise 2.19 is surprising. Con-
sider the following concrete situation.
Assume n = 2r and suppose that we want to distribute 2r gifts to r children.
We want to do this in the “fairest” possible way since the gifts, of equal value, are
different, and several kids may desire the same gift. To remove any bias, “common
sense” suggests that each gift should be given to a child chosen uniformly at random.
There are twice as many gifts as children so what can go wrong? Part (ii) of this
exercise shows that for n large nearly surely e−2 r ≈ 0.13r children will receive no
gifts! t
u
x2 x2
log P XN > x ≤ − and lim log P XN > x = − , ∀x > 0.
2 N →∞ 2
t
u
Exercise 2.22. [P. Lévy] Consider the random variables Ln defined in Exer-
cise 1.11. Prove that as n → ∞ the random variables Lnn converge in distribution
to the arcsine distribution Beta(1/2, 1/2); see Example 1.122. Hint. You need to use
Stirling formula (A.1.7) with error estimate (A.1.8). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 245
Exercise 2.23. Suppose that (Xn )n∈N and (Yn )n∈N are two sequences of random
variables such that Xn → X in distribution and
lim P Xn 6= Yn = 0.
n→∞
Then Yn → X in distribution. t
u
Exercise 2.24. Suppose that (Xn )n∈N and (Yn )n∈N are two sequences of random
variables such that
• Xn converges in distribution X.
• Yn converges in distribution to Y .
• Xn is independent of Yn for every n and X is independent of Y .
t
u
Exercise 2.25. Suppose that (Xn )n∈N and (Yn )n∈N are sequences of random vari-
ables with the following properties.
Prove that for any Borel measurable function f : R → R the sequence of random
vectors ( f (Xn ), Yn ) converges in distribution to ( f (X), Y ). Hint. Fix a Borel measurable
function f . It suffices to show that for any continuous and bounded functions u, v : R → R we have
lim E u f (Xn ) v(Yn ) = E u f (X) v(Y ) .
n→∞
t
u
Consider the Borel measurable functions vn defined by vn (Xn ) = E v(Yn ) k Xn = vn (Xn ).
Exercise 2.26. Suppose that (Xn )n∈N and (Yn )n∈N are two sequences of random
variables such that Xn converges in distribution to X and Y converges in probability
to the constant c. Prove that the random vector (Xn , Yn ) converges in distribution
to (X, c). Hint. Prove that (Xn , c) converges in probability to (X, c) and then use Exercise 2.23. t
u
2
Exercise 2.27.
Suppose that
(X
n )n∈N is a sequence of i.i.d. L random variables
2
with µ = E Xn , σ = Var Xn . Set
n−1
1 1 X 2
Xn = X1 + · · · + Xn , Yn = Xk − X̄n .
n n−1
k=1
Prove that E Yn = σ 2 and Yn → σ 2 in probability. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 246
Exercise 2.28 (Trotter). We outline a proof of the CLT that does not rely on
the characteristic function.
For any random variable X and any f ∈ Cb (R) we denote by TX f the function
R → R given by
TX f (y) = E f (X + y) , y ∈ R.
t
u
Exercise 2.29. Suppose that (Xn )n∈N is a sequence of i.i.d. Bernoulli random
variables with success probability p = 12 . For each n ∈ N we set
b
X 1
Sn := Xk .
2k
k=1
t
u
Remark 2.92. Part (iv) of the above exercise is essentially a universality property
of the simplest random experiment: tossing a fair coin. If we are able to perform
this experiment repeatedly and independently, then we can approximate any prob-
ability distribution. In other words, we can approximatively sample any probability
distribution by flipping fair coins. t
u
Exercise 2.30. Suppose that (Xn )n∈N is a sequence of i.i.d. random variables uni-
formly distributed in [0, L], L > 0. For n ∈ N we set
X(n) := max X1 , X2 , . . . , Xn .
Prove that limn→∞ E X(n) = L and X(n) → L in probability. Hint. Have a look at
Exercise 1.44. t
u
Exercise 2.31. Suppose that (Xn )n∈N is a sequence of i.i.d. random variables uni-
n n (n)
formly distributed in [0, 1]. Denote by X(1) , X2) , . . . , Xn the order statistics of the
first n of them; see Exercise 1.44. Prove that for any k ∈ N the random variable
n
nX(k) converges in distribution to Gamma(1, k). t
u
Exercise 2.32. Suppose that (µn )n≥0 is a sequence of finite Borel measures. De-
note their
Fn (x) = µn (−∞, x] , ∀n ∈ N, x ∈ R.
Let µ ∈ Meas(R) be a finite Borel measure with potential
F (x) = µ (−∞, x] , ∀x ∈ R.
Prove that the following are equivalent.
t
u
Exercise 2.33. Suppose that (µn )n∈R is a sequence of Borel measures on R such
that
sup µn R < ∞.
n
Set Fn (x) = µn (−∞, x] , x ∈ R. Prove that (µn ) contains a subsequence converg-
ing vaguely to a finite probability measure. Hint. Construct a subsequence (nk ) such that,
the sequence Fnk (q) is convergent for any q ∈ Q. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 248
(i) Prove that the sequence (µn )n∈N is tight. Hint. Use Exercise 2.35.
(ii) Show that f is the characteristic function of a Borel probability measure µ.
Hint. Use Exercise 2.34.
(iii) Prove that µn converges weakly to µ.
t
u
Exercise 2.38. Suppose that X is a random variable and ϕ(ξ) is its characteristic
function
ϕ(ξ) = E eiξX .
(i) Prove that for any r > 0 there exists xr ∈ R such that
CX (r) = P |X − xr | > r .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 249
(ii) Prove that if Var X < ∞ is integrable, then
1
CX (r) ≤ 2 Var X .
r
(iii) Prove that if X, Y are independent random variables, then
CX+Y (r) ≥ max CX (r), CY (r) , ∀r > 0.
(iv) Suppose that (Xn ) is a sequence of independent random variables. We set
Sn := X1 + · · · + Xn , Cn,N = CSN −Sn .
Show that the limits
Cn (r) = lim Cn,N (r)
N →∞
exists for every n, and the resulting sequence Cn (r) is nondecreasing.
(v) Show that
lim Cn (r)
n→∞
is independent of r and it is either 0 or 1.
t
u
(i) Prove that the Poisson distributions and the Gaussian distributions are in-
finitely divisible.
(ii) Prove that any linear combination of independent infinitely divisible random
variables is an infinitely divisible random variable.
(iii) Suppose that (Xn )n∈N is a sequence of i.i.d. random variables with common
distribution ν ∈ Prob(R). Denote by N (t), t ≥ 0 a Poisson process with
intensity λ > 0; see Example 1.136. For t ≥ 0 we set
N (t)
X
Y (t) = Xk .
k=1
The distribution of Y (t) denoted by Qt , is called a compound Poisson dis-
tribution. The distribution ν is called the compounding distribution. Show
that
∞
X (λt)n ∗n
Qt = e−λt ν
n=0
n!
and deduce that Qt ∗ Qs = Qt+s , ∀t, s ≥ 0. In particular Qt is infinitely
divisible.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 250
t
u
(ii) If f ∈ L2 (R, µ). Prove that there exists r1 > 0 such that for any complex num-
ber such that | Re z| < r1 the complex valued function R 3 x 7→ eizx f (x) ∈ C
is µ integrable and the resulting function
Z
eizx f (x)µ dx
z 7→ F (z) =
R
is holomorphic in the strip | Re z| < r1 .
(iii) Prove that R x , the space of polynomials with real coefficients, is dense in
L2 (R, µ). Hint. You have to show that if f ∈ L2 (R, ν) satisfies
Z
n
f (x)x µ dx = 0, ∀n ≥ 0,
R
t
u
Exercise 2.45. Suppose that µ0 , µ1 are two Borel probability measures such that
∃t0 > 0
Z Z
etx µ0 dx = etx µ1 dx , ∀|t| < t0 .
R R
Fix r > 0 as in Exercise 2.44(ii) such that for any complex number the functions
Z
eizx f (x)µk dx , k = 0, 1,
z 7→ Fk (z) =
R
are well defined and holomorphic in the strip | Re z| < r . Show that F0 = F1
and deduce that µ0 = µ1 . Hint. Set F = F1 − F0 . Use the Cauchy-Riemann equations to prove
dn F
that dz n z=t = 0, ∀n ∈ N, ∀t ∈ (−r, r). t
u
Exercise 2.46 (De Moivre). Let Xn ∼ Bin(n, 1/2) and Y ∼ N (0, 1). Prove that
√
P |Xn − n/2| ≤ 2r n
lim = 1, ∀r > 0. t
u
n→∞ P |Y | < r
t
u
Exercise 2.49. Suppose that X, Y are independent random normal variables. Set
Z = XY .
Thus for spheres of large dimension n most of the volume is concentrated near
the Equator {x0 = 0}! Hint. Choose independent standard random variables Y0 , . . . , Yn set
Z = Y02 + · · · + Yn2 . Show that the random vector
1
(X0 , . . . , Xn ) = √ Y0 , . . . Y n
Z
is uniformly distributed on S n . To conclude use Exercise 1.38. t
u
(i) Show that the map V ∗ 3 ξ 7→ m[ξ] ∈ R is linear and thus defines an element
m = mµ ∈ (V ∗ )∗ ∼
=V
called the mean of the Gaussian measure. Moreover
Z
mµ = xµ[dx] ∈ V.
V
∗ ∗
(ii) Define C = Cµ : V × V → R
1
C(ξ, η) = v ξ+η −v ξ−η = Eµ (ξ − m[ξ])(η − m[η]) .
4
Show that C is a bilinear form, it is symmetric and positive semidefinite. It
is called the covariance form of the Gaussian measure µ.
(iii) Show that if µ0 , µ1 are Gaussian measures on V0 and respectively V1 , then
the product µ0 ⊗ µ1 is a Gaussian measure on V0 ⊕ V1 . Moreover,
m[µ0 ⊗ µ1 ] = mµ0 ⊕ mµ1 , Cµ0 ⊗µ1 = Cµ0 ⊕ Cµ1 .
We set
Γ1n := Γ1 ⊗ · · · ⊗ Γ1 .
| {z }
n
(iv) Suppose that V0 , V1 are real finite dimensional vector spaces, µ is a Gaussian
measure on V0 and A : V0 → V1 is a linear map. Denote by µA the pushfor-
ward of µ via the map A, µA := A# µ. Prove that µA is a Gaussian measure
on V1 with mean mµA = Amµ and covariance form
CA : V1∗ × V1∗ → R, CA (ξ1 , η1 ) = Cµ (A∗ ξ1 , A∗ η1 ), ∀ξ1 , η1 ∈ V1∗ .
Above, A∗ : V1 → V0∗ is the dual (transpose) of the linear map A.
(v) Fix a basis {e1 , . . . , en } of V so we can identify V and V ∗ with Rn and
C with a symmetric positive semidefinite matrix. Denote by A its unique
positive semidefinite square root. Show that the pushforward A# Γ1n is a
Gaussian measure on Rn with mean zero and covariance form C = A2 .
(vi) Define the Fourier transform of a measure µ ∈ Prob(V ) to be the function
Z
b : V ∗ → C, µ b(ξ) = Eµ eiξ = eihξ,xi µ[dx].
µ
V
Exercise 2.55. Let (Ω, S, P) be a probability space and E a finite dimensional real
vector space. Recall that a Borel measurable map X : (Ω, S, P) → E is called a
Gaussian random vector if its distribution PX = X# P is a Gaussian measure on E;
see Exercise 2.54.
Suppose that X1 , . . . , Xn ∈ L0 (Ω, S, P) are jointly Gaussian random variables,
i.e., the random vector X ~ = (X1 , . . . , Xn ) : Ω → Rn is Gaussian.
(i) Prove that each of the variables X1 , . . . , Xn is Gaussian and the covariance
form
C : Rn × Rn → R
of the Gaussian measure PX~ ∈ Prob(Rn ) is given by the matrix (cij )1≤i,j≤n
cij = Cov Xi , Xj , ∀1 ≤ i, j ≤ n.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 256
(ii) Prove that X1 , . . . , Xn are independent if and only if the matrix (cij )1≤i,j≤n
is diagonal, i.e.,
E Xi Xj = E Xi E Xj , ∀i 6= j.
Hint. Use the results in Exercise 2.54. t
u
(i) Prove that X0 = E X0 k X1 , . . . , Xn .
(ii) Suppose that the covariance matrix C of the Gaussian vector (X1 , . . . , Xn ) is
invertible. Denote by L = [`1 , . . . , `n ] the 1 × n matrix
`i = E X0 Xi , i = 1, . . . , n.
Prove that
X1
X0 = L · C −1 · X, X := ... .
Xn
Hint. For (i) use Exercise 2.55(ii) and (1.4.10). t
u
Remark 2.93. The result in Exercise 2.56 is remarkable. Let us explain its typical
use in statistics.
Suppose we want to understand the random quantity X0 and all we truly under-
stand are the random variables X1 , . . . , Xn . A quantity of the form f (X1 , . . . , Xn ) is
called a predictor, and the simplest predictors are of the form c0 +c 1 X1 +· · ·+cn Xn.
These are called linear predictors. The conditional expectation E X0 k X1 , . . . , Xn
is the predictor closest to X0 . The linear predictor closest to X0 is called the linear
regression. The coefficients c0 , c1 , . . . , cn corresponding to the linear regression are
obtained via the least squares approximation.
The result in the above exercise shows that, when the random variables
X0 , X1 , . . . , X1 are jointly Gaussian, the best predictor of X0 , given X1 , . . . , Xn
is the linear predictor. This is another reason why the Gaussian variables are ex-
tremely convenient to work with in practice. t
u
Exercise 2.57 (Maxwell). Suppose that (Xn )n∈N is a sequence of mean zero
i.i.d. random variables. For each n ∈ N we denote by Vn the random vector
Vn := (X1 , . . . , Xn ). Prove that the following are equivalent.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 257
t
u
Exercise 2.59. Suppose that T is a compact interval of the real axis, and (Xt )t∈T ,
(Yt )t∈T , (Zt )t∈T are real valued stochastic processes such that (Yt ) and (Zt ) are
modifications of (Xt ) with a.s. continuous paths. Prove that the processes (Yt ) and
(Zt ) are indistinguishable. t
u
Exercise 2.60. Fix a Brownian motion (Bt )t≥0 defined on a probability space
(Ω, S, P). Denote by E the vector subspace of L2 [0, 1], λ spanned by the functions
I (s,t] , 0 ≤ s < t ≤ 1.
Show that
n
X m
X
c0k Bt0k − Bs0k
ck Btk − Bsk = =: W (f ).
k=1 i=1
Exercise 2.61. The space F := C [0, ∞) of continuous functions [0, ∞) → R is
equipped with a natural metric d,
X 1
d(f, g) = n
min 1, dn (f, g) , dn (f, g) := sup |f (t) − g(t)|.
2 t∈[n−1,n]
n∈N
t≥0
Chapter 3
Martingales
The usefulness of the martingale property was recognized by P. Lévy (condition (C)
in [105, Chap. VIII]), but it was J. L. Doob [47] who realized its full potential by
discovering its most important properties: optional stopping/sampling, existence of
asymptotic limits, maximal inequalities.
I have to admit that when I was first introduced to martingales they looked
alien to me. Why would anyone be interested in such things? What are really these
martingales?
I can easily answer the first question. Martingales are ubiquitous, they appear
in the most unexpected of situations, though not always in an obvious way, and
they are “well behaved”. Since their appearance on the probabilistic scene these
stochastic processes have found many applications.
As for the true meaning of this concept let me first remark that the name
“martingale” itself is a bit unusual. It is French word that has an equestrian meaning
(harness) but, according to [113], the term was used among the French gamblers
when referring to a gambling system. I personally cannot communicate clearly,
beyond a formal definition, what is the true meaning of this concept. I believe it
is a fundamental concept of probability theory and I subscribe to R. Feynman’s
attitude: it is more useful to know how the electromagnetic waves behave than
knowing what the look like. The same could be said about the concept of martingale
and, why not, about the concept of probability. I hope that the large selection of
examples discussed in this chapter will give the reader a sense of this concept.
This chapter is divided into two parts. The first and bigger part is devoted to
discrete time martingales. The second and smaller part is devoted to continuous
time martingales. I have included many and varied applications of martingales
with the hope that they will allow the reader to see the many facets of this concept
and convince him/her of its power and versatility. My presentation was inspired
by many sources and I want to single out [75; 33; 47; 53; 102; 103; 135; 160] that
influenced me the most.
259
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 260
Xt : (Ω, S, P) → R, t ∈ T.
Fs ⊂ Ft , ∀s ≤ t.
We set
_
F∞ := Ft .
t∈T
Fn = σ(X0 , X1 , . . . , Xn ), n ∈ N0 ,
Definition 3.2. Suppose that (Ω, S, P) equipped with a filtration F• = (Ft )t∈T . An
F• -martingale is a family of random variables Xt : (Ω, S, P) → R, t ∈ T, satisfying
the following two conditions.
(i) The family is adapted to the filtration F• and Xt is integrable for any t ∈ T.
(ii) For all s, t ∈ T, s < t, we have Xs = E Xt kFs .
Martingales 261
Note that a sequence of random variables (Xn )n∈N0 is a discrete time submartin-
gale (resp. martingale) with respect to a filtration (Fn )n∈N0 of F if
E Xn+1 kFn ≥ Xn , (resp. E Xn+1 kFn = Xn ), ∀n ∈ N0 .
Remark 3.3. Suppose that (Xn )n≥0 is a sequence of integrable random variables
and
Fn = σ(X1 , . . . , Xn ).
is a martingale since
h i
E Xn+1 k Fn = E E X k Fn+1 Fn = E X k Fn = Xm .
Example 3.5 (Unbiased random walk). Suppose that (Xn )n∈N is a sequence
of independent integrable random variables such that E[Xn ] = 0, ∀n ∈ N0 .
One should think that Xn is the size of the n-th step so that the location after
n steps is
Sn = X1 + · · · + Xn .
= E Xn+1 + X1 + · · · + Xn = Sn . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 262
Example 3.7 (Biased random walk). Suppose that (Xn )n∈N are i.i.d. random
variables such that the moment generating function
µ(λ) := E eλXn
Martingales 263
The random variable Zn can be interpreted as the population of the n-th generation
of a species that had ` individuals at n = 0 and such that the number of offsprings
of a given individuals is a random variable with distribution µ. The j-th individual
of the n-th generators has Xn,j offsprings. We will refer to µ as the reproduction
law.
The sequence (Zn )n≥0 is known as the Galton-Watson process or the branching
process with reproduction law µ.
When ` = 1 this process can be visualized as a random rooted tree. The root v0
has Z1 = X0,1 successor vertices. v1,1 , . . . , v1,Z1 . The vertex v1,i has X1,i successors
etc.; see Figure 3.1. For any n ∈ N0 we have
X∞ X k
Zn+1 = Xn,j I {Zn =k}
k=1 j=1
so
∞
X Xk
E [ Zn+1 kFn ] = E Xn,j I {Zn =j} Fn
k=1 j=1
∞
X k
X
= E Xn,j Fn I {Zn =k}
k=0 j=1
⊥ Fn , ∀n, j)
(Xn,j ⊥
∞ ∞
k
!
X X X
= E Xn,j I {Zn =k} = m kI {Zn =k} = mZn .
k=1 j=1 k=0
| {z }
=km
Example 3.9 (Polya’s urn). An urn contains r > 0 red balls and g > 0 green
balls. At each moment of time we draw a ball uniformly likely from the balls existing
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 264
Fn = σ(R0 , G0 , · · · , Rn , Gn ) = σ(X0 , X1 , . . . , Xn ).
so
X i h i
E Xn+1 k Fn = E I {Rn+1 =j,Gn+1 =j} Fn .
i,j>0
i+j
X
= P Rn+1 = i, Gn+1 = jkRn = k, Gn = ` I {Rn =k,Gn =`}
k,`>0
i−c j−c
= I {Rn =i−c,Gn =j−c} + I {Rn =i,Gn =j−c} .
i+j−c i+j−c
We deduce
X i i−c
E Xn+1 k Fn =
· I {Rn =i−c,Gn =j}
i,j
i+j i+j−c
X i j−c
+ · I {Rn =i,Gn =j−c}
i,j
i+j i+j−c
X u+c u X u v
= · I {Rn =u,Gn =v} + · I {Rn =u,Gn =v}
u,v
u+v+c u+v u,v
u + v + c u + v
X u(u + v + c) X u
= I {Rn =u,Gn =v} = I {Rn =u,Gn =v} = Xn . t
u
u,v
(u + v)(u + v + c) u,v
u+v
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 265
Martingales 265
Consider the simple random walk on Γ that starts at a given vertex v0 and the
1
probability of transitioning from a vertex u to a neighbor v is equal to deg(u) .
Denote by Vn the location after n steps of the walk. Suppose that F : V Γ → R is
a harmonic function. Then the sequence of random variables
Xn = F (Vn ), n ∈ N0 ,
(i) If (Xn )n∈N0 is a martingale and ϕ : R → R is a convex function such that ϕ(Xn )
is integrable ∀n ∈ N0 , then the conditional Jensen inequality implies that the
sequence ϕ(Xn ) is a submartingale. Indeed, Jensen’s inequality implies
E ϕ(Xn+1 ) k Fn ≥ ϕ E Xn+1 k Fn
= ϕ(Xn ).
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 266
The next result formalizes the discussion at the beginning of this subsection.
Mn+1 − Mn = Xn+1 − Xn − Cn+1 − Cn , ∀n ∈ N0 . (3.1.1b)
Note that Cn+1 − Cn is Fn measurable so (Cn ) is predictable. By construction M•
is an F• -martingale. Clearly, if X• is a submartingale then, tautologically, Cn is
increasing.
Uniqueness. Suppose that X• is a submartingale, M•0 is a martingale, and C• is a
nondecreasing predictable process such that
M0 = C0 = 0, Xn = X0 + Mn0 + Cn0 , ∀n ∈ N0 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 267
Martingales 267
We deduce
E Xn+1 Fn − Xn = E Mn+1 Fn − Mn0 + E Cn+1 Fn − Cn0 .
0 0
| {z } | {z }
=0 0
=Cn+1 0
−Cn
This shows that the increments of Cn0 are given by (3.1.1a) so Cn0 = Cn . In partic-
ular, Mn0 = Mn , ∀n ∈ N. t
u
so
n
X
E Xn k Fn−1
Cn =
k=1
and
n
X n
X
E Xn k Fn−1 = Xn − E Xn k Fn−1 .
Mn = Sn −
k=1 k=1
If the variables Xn are independent, then
Xn
Mn = Xk − E Xk .
k=1
t
u
Example 3.16. Suppose that (Xn )n≥1 are independent random variables with zero
means and finite variances. We set S0 = 0,
Sn = X1 + · · · + Xn .
Then
n
X
E Sn2 = E Xk2 < ∞, ∀n ≥ 1.
k=1
Thus (S• ) is an L2 -martingale. From the computations in Example 3.14 we deduce
n
X n
X
E Xk2 = E (Sk − Sk−1 )2 .
hS• in =
k=1 k=1
This explains why we refer to hS• i as quadratic quadratic variation. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 268
Remark 3.18. (a) When X• is a martingale the process (H · X)• is called the
discrete stochastic integral of H with respect to X and it is alternatively denoted
Z n
HdX := (H · X)n .
Martingales 269
Example 3.19. Observe first that a discrete time process (Yn )n∈N on (Ω, S, P) can
be viewed as a map
Y : N × Ω → R, (n, ω) 7→ Yn (ω).
We equip N × Ω with the product σ-algebra. A measurable set X ⊂ N × Ω defines
a stochastic process
I X : N × Ω → {0, 1}, I X n = I Xn , Xn := ω ∈ Ω; (n, ω) ∈ X .
The set X is called F• -predictable if the process I X is such. More precisely, this
means that X0 ∈ F0 and, for any n ∈ N, the set Xn is Fn−1 -measurable. t
u
Example 3.21. (a) For each n ∈ N0 the constant random variable equal to n is a
stopping time.
(b) Suppose that (Xn )n∈N0 is F• -adapted and C ⊂ R is a Borel set. We define the
hitting time of C to be the random variable
HC : Ω → N0 ∪ {∞}, HC (ω) := min n ∈ N0 ; Xn (ω) ∈ C .
This is a stopping time since
[
HC ≤ n = Xk ∈ C
k≤n
Definition 3.22. Let X• = (Xn )n∈N be a process adapted to the filtration (Fn )n≥0 .
For any stopping time T we denote by X•T the process stopped at T defined by
(
Xn (ω), n ≤ T (ω),
XnT := XT ∧n , where XT ∧n = Xmin(T (ω),n) (ω) = (3.1.4)
XT (ω) , T (ω) < n.
t
u
Note that the process stopped at T is also adapted to the filtration (Fn )n≥0 .
Proposition 3.23. Suppose that S, T is are stopping times such that S ≤ T . Define
]]T, ∞[[ := (n, ω) ∈ N0 × Ω; T (ω) < n ,
]]S, T ]] := (n, ω) ∈ N0 × Ω; S(ω) < n ≤ T (ω) .
Then ]]T, ∞[[, [[0, T ]] and ]]S, T ]] are predictable subsets of N0 × Ω.
Suppose now that (Xn )n∈N is a (sub)martingale and T is a stopping time. Then
S0 = 0 is also a stopping time, S0 ≤ T . As we have seen above, the process
I ]]S0 ,T ]] = I ]]0,T ]] · X is a submartingale.
For every n ∈ N we have
(I ]]0,T ]] · X)n = I ]]0,T ]] n Xn − Xn−1 + · · · + I ]]0,T ]] 1 X1 − X0
Martingales 271
Example 3.25. Suppose that T is the hitting time of a Borel set C ⊂ R. Then
the event E belongs to FT if, at any moment of time n, we can decide using the
information Fn available to us at time n whether, up to that moment, we have
visited C and the event E has occurred. t
u
XT ∈ L1 . (3.1.7b)
lim E I {T >n} |Xn | = 0. (3.1.7c)
n→∞
t
u
Roughly speaking, the Doob conditions state that the random process (Xn )n≥0
is not sampled “too late”. In Proposition 3.66 we provide another characterization of
the Doob conditions in terms of the asymptotic behavior of the stopped process X•T .
they were first spelled out by J. L. Doob in his influential monograph [47].
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 272
Proof. We follow the original approach in [47, VII.2]; see also [4, Thm. 6.7.4].
Suppose that (Xn )n≥0 is a martingale. Set Am := {S = m}. Then
X
E XS = E XS I Am
m≥0
Martingales 273
Clearly (3.1.7c) implies (3.1.8). Let us show that (3.1.8) ⇒ (3.1.7c). Assume first that X• is a martingale.
Fix m, n ∈ N0 , m < n. Observing that {T > m} ∈ Fm we deduce
E Xm I T >n = E Xm+1 I T >m = E Xm+1 I T =m+1 + E Xm+1 I T >m+1
= E Xm+1 I T =m+1 + E Xm+2 I T =m+2 + E Xm+2 I T >m+2
= · · · = E Xm+1 I T =m+1 + · · · + E XN I T =n + E Xn I T >n = E XT I m<T ≤n .
We deduce
E Xm I T >m − E Xn I T >n = E XT I m<T ≤n , ∀n > m.
If we let n → ∞ in the above equality and recall that T < ∞ a.s., XT ∈ L1 and X•+ satisfies (3.1.8) we
deduce
− − +
lim E Xn I T >n − E Xm I T >m = E XT I T >m − E Xm I T >m .
n→∞
Using the Optional Sampling Theorem 3.24 for the stopping times S ≡ m and T we deduce
+ + −
E XT I T >m − E Xm I T >m = E Xm I T >m − E Xm I T >m = −E Xm I T >m .
Hence
− − −
lim E Xn I T >n − E Xm I T >m = −E Xm I T >m
n→∞
so that
+ −
lim E |Xn |I T >n = lim E Xn I T >n + lim E Xn I T >n = 0.
n→∞ n→∞ n→∞
and we deduce that the martingale Y• = X0 + M• satisfies (3.1.8) and thus (3.1.8). Next, observe that
+ + + −
Xn = (Yn + Cn )I Yn ≥0 + (Cn − Yn )I 0<Y − ≤C .
n n
Hence
lim E |Xn |I T >n = lim E |Yn | + Cn I T >n = 0. t
u
n→∞ n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 274
Example 3.30 (The Ballot Problem). Let us consider again the ballot problem
first discussed in Example 1.60. Recall the setup.
Two candidates A and B run for an election. Candidate A received a votes
while candidate B received b votes, where b < a. The votes were counted in random
order, so any permutation of the a + b votes cast is equally likely. We have shown
in Example 1.60 that the probability that A was ahead throughout the count is
a−b
p= .
a+b
We want to described an alternate proof using martingale methods. Our presenta-
tion is inspired from [117, Sec. 12.2].
Set n := a + b and denote by Dk the denote the number votes by which A was
ahead when the k-th voted was tabulated. Note that Sn = a − b. Let Xk denote
the random variable indicating the k-th vote. Thus, Xk = 1, if the vote went for
A, and Xk = −1 if the vote went for B so that
D0 = 0, Dk = X1 + · · · + Xk .
For k = 0, 1, . . . , n we denote by Rk the ratio
Dn−k
Rk := .
n−k
In other words Rk is candidate’s A the lead in percentages after the (n − k)-th
counted vote. Let us first show that Rk is a martingale with respect to the filtration
Fk = σ R0 , . . . , Rk = σ Dn , Dn−1 , . . . , Dn−k .
Martingales 275
Thus, if Dn−k is known, the (n − k)-th vote could have been either a vote for A,
and the probability of such a vote is An−k
n−k
, or it could have been a vote for B, and
Bn−k
the probability of such a vote is n−k . Hence
An−k Bn−k
E Dn−k−1 k Dn−k = Dn−k − 1 + Dn−k + 1
n−k n−k
Dn−k n−k−1
= Dn−k − = Dn−k .
n−k n−k
Dividing by (n − k − 1) we deduce that (Rk )0≤k≤n−1 is indeed a martingale.
Now define the stopping times
S := 0 ≤ k ≤ n − 1; Rk = 0 ,
where min ∅ := ∞ and T := min(S, n − 1). The stopping time T is bounded and
the Optional Sampling Theorem 3.28 implies
Dn a−b
E RT = E R0 = = .
n a+b
Now observe that
E RT = E RT I S=∞ + E RT I S<∞ .
Note RT = 0 on {S < ∞}. Observe that if S = ∞, then Dk > 0, for all 1 ≤ k ≤ n.
Hence T = (n − 1) on {S = ∞} so RT = D1 = 1 on {S = ∞}.
a−b
= E RT = P S = ∞
a+b
= the probability that candidate A lead throughout the vote count.
t
u
To visualize this, think that we have an urn with balls labeled by the letters in A
in proportions given by π. We sample with replacement the urn and we record in
succession the labels we draw. We are interested in the moment we first observe
the labels a1 , . . . , a` in succession as we sample the urn. As a special case, think
that we flip a fair coin and we stop the first we see T, H, T, H in succession. In this
case A = {H, T }, π(H) = π(T ) = 12 , a = T HT H.
An amusing quote by Bertrand Russel comes to mind. “There is a special
department of Hell for students of probability. In this department there are many
typewriters and many monkeys. Every time that a monkey walks on a typewriter,
it types by chance one of Shakespeare’s
sonnets.”
We will compute E Ta by using a clever martingale method due to Li [107].
The precise answer is contained in (3.1.11).
Let us first observe that E Ta < ∞. This follows from a very useful trick, [160,
E10.5], generalizing the result in Example 1.167.
In Exercise 3.6 we ask the reader to provide a proof of this result. It is a nice
application of various properties of the conditional expectation.
`
In the case at hand (3.1.9) is satisfied with N0 = ` and r = mina∈A π(a) .
Following [107] we consider the following betting game involving the House
(casino) and a random number of players. At each moment of time n = 1, 2, . . . the
House samples the alphabet A according to the probability distribution π. (The
House runs a chance game with set of outcomes A and probability distribution π.)
The outcome of this sampling is the sequence of i.i.d. random variables An .
The first player adopts the following a-based strategy.
• At time 0 he bets his fortune F01 = 1 that the outcome of the first game is
A1 = a1 . If A1 = a1 his fortune will change to F11 = f (a1 ) = a11 . Otherwise,
he will lose his fortune F01 to the house, so F11 = 0 in this case.
• At time 1 he bets his fortune that A2 = a2 . If he wins, i.e., A2 = a2 , his fortune
at time 2 will grow to F21 = f (a2 )F11 . If he loses, he will have to turn all its
fortune to the House.
1
• In general, if k ≤ ` and his fortune at time k − 1 is Fk−1 (the fortune could
be 0 at that moment), the player bets all its fortune, f (ak ) on a dollar, that
Ak = ak . If this happens, his fortune will grow to Fk1 = f (ak )Fk−1 . Otherwise,
1
he will surrender his fortune Fk−1 to the house, so Fk1 = 0 in this case.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 277
Martingales 277
Concisely, if we define
(
f (ak )I {Ak =ak } , 1 ≤ k ≤ `,
Mk1 =
1, k < 1 or k > `,
then
n
Y
Fn1 = Mk1 .
k=1
Since E Mk1 = 1 we deduce that F•1 and X•1 = F•1 − 1 are martingales.
In general, for m = 1, 2, . . . , the m-th player also plays ` rounds using the same
strategy as the first player, but with a delay of m − 1 units of times.
Thus, the second player skips game 1 and only starts betting before the 2nd game
using the same betting strategy as if the game started when he began playing: at
his j-th round he bets f (aj ) on a dollar that the outcome is Aj+1 = aj . The third
player skips the first two games etc.
In general, at his j-th round, the m-th player bets f (aj ) on a dollar that the
outcome is Aj+m−1 = aj . We denote by Fnm the fortune of the m-th player at time
n. More precisely, if we set
(
m f (ak−m+1 )I {Ak =ak−m+1 } , m ≤ k ≤ m + ` − 1,
Mk :=
1, k < m or k ≥ m + `,
then
n
Y
Fnm := Mkm , Xnm = Fnm − 1 n = 1, 2, . . . .
k=1
Note that Fnm = 1 for n < m because the m-th player skips the games
n = 1, 2, . . . , m − 1. Define
X n
X n
X
Sn := Xnm = Xnm = Fnm − n.
m≥1 m=1 m=1
In other words, Sn is the sum of the profits of all the players after n games. The
process S• is obviously a martingale. Note that
X
ST = FTm − T, T = Ta .
m≤T
Recall that T is the first moment of time such that
AT −`+1 = a1 , AT −`+2 = a2 , . . . , AT = a` . (3.1.10)
Thus the player (T − ` + 1) will be the first player to hit the jackpot, i.e., observes
the pattern a during the first ` games he plays. This proves FTm = 0 for m ≤ T − `.
Indeed, the minimality of T implies
Am , . . . , Am+`−1 6= (a1 , . . . , a` )
and thus I {Am =a1 } · · · I {Am+`−1 =a` } = 0.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 278
`−1
Y `−2
Y
= f (a1 ) · · · f (a` ) + f (aj )δaj+1 ,aj + f (aj )δaj+2 ,aj + · · ·
| {z }
j=1 j=1
FTT | {z } | {z }
FTT −1 FTT −2
`−1 `−k
X Y
= f (aj )δaj+k ,aj .
k=0 j=1
| {z }
=:τ (a)
`−1
Y `−2
Y
≤ f (a1 ) · · · f (a` ) + f (aj )δaj+1 ,aj + f (aj )δaj+2 ,aj + · · · = τ (a).
j=1 j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 279
Martingales 279
Hence
|Sn |I {T >n} ≤ τ (a) + n I {T >n} ≤ τ (a) + T I {T >n} .
Since E T < ∞ we deduce
lim E |Sn |I {T >n} = 0.
n→∞
This shows that the stopping time Ta satisfies Doob’s conditions so that
`−1 `−k
X Y δaj+k ,aj
E Ta = τ (a) = . (3.1.11)
j=1
π(aj )
k=0
Let us describe this equality using a more convenient notation. Denote by V (A)
the vocabulary of the alphabet A
G
V (A) = A` , A0 := {∅}.
`≥0
then the waiting time τ (a) coincides with the waiting time T to observe the first
occurrence of a k-run of 6-s discussed in Example 1.167. In this case we have
k
X 6k+1 − 6
E T = 6j = .
j=1
5
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 280
We refer to Example A.22 for an R-code that simulates sampling an alphabet until
a given pattern is observed.
Let us discuss in more detail the special case A = {H, T }, π(H) = π(T ) = 12 .
Suppose that a is the pattern a = (T T HH), b = HHH. Observe that
hLj a, Rj ai = 1 for j = 4 and 0 otherwise. Hence E Ta = 16. A similar computa-
tion shows that E Tb = 14. Thus we have to wait a longer time for the pattern a
to occur.
On the other hand, a formula of Conway (see Exercise 3.14) shows that
P Tb < Ta Φ(a, a) − Φ(a, b)
= .
P Ta < Tb Φ(b, b) − Φ(b, a)
We have hLi a, Rj bi = 0, ∀j and hRj a, Lj bi = 1 for j = 1, 2 so that
P Tb < Ta 5
Φ(a, b) = 0, Φ(b, a) = 6, = .
P Ta < Tb 7
We have reached a somewhat surprising conclusion: although, on average, we have
to wait a shorter amount of time to observe the pattern b, it is less likely that we
will observe b before a. The odds that b will appear first versus that a will appear
first are 5 : 7.
There are other strange phenomena. We should mention M. Gardner’s even
stranger nontransitivity paradox [66, Chap. 5]. More precisely, given any pattern
a ∈ Ak there exists
1a pattern b ∈ A such that b is more likely to occur before a,
k
Theorem 3.33 (Azuma). Suppose that (Xn )n≥0 is a martingale adapted to a fil-
tration F• = (Fn )n≥0 of the probability space (Ω, S, P). Assume that for any n ∈ N
there exist constants an < bn such the differences Dn = Xn − Xn−1 satisfy
an ≤ Dn ≤ bn a.s.
Then
2
− 2 2x 2
∀x > 0, P |Xn − X0 | > x ≤ 2e (s1 +···+sn ) , sk = bk − ak . (3.1.14)
Martingales 281
We set
Zn (λ) := E eλDn k Fn−1 , ∀n ∈ N, λ ∈ R.
We claim that
λs2
n
∀n ∈ N, ∀λ ∈ R, Zn (λ) ≤ e 8 a.s. (3.1.16)
Obviously this implies that
λs2
n
E eλ(Xn −X0 ) ≤ e 8 E eλ(Xn−1 −X0 ) ,
we deduce
ES DnS = 0.
S λs2n
E Zn (λ)I S = E eλDn I S = P S ES eλDn ≤ P S e 2 .
all with the same distribution π. We denote by Ln the length of the longest common
subsequence of two random words
(X1 , . . . , Xn ) and (Y1 , . . . , Yn ).
We set
1
Rn := Ln , R := sup Rn .
n n
We will to show that Rn is highly concentrated around its mean rn . We follow the
presentation in[144, Sec. 1.3].
Set `n := E Ln , Zn = (Xn , Yn ). Consider the finite filtration
F0 := σ(∅), Fj = σ Z1 , . . . , Zj , j = 1, . . . , n.
so that
nx2
P |Rn − rn | ≥ x ≤ 2e− 2 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 283
Martingales 283
Example 3.35 (Bin packing). The bin packing problem has a short formulation:
pack n items of sizes x1 , . . . , xn ∈ [0, 1] in as few bins of maximum capacity 1 each.
We denote by Bn (x1 , . . . , xn ) the lowest numbers of bins we can use to pack the
items of sizes x1 , . . . , xn .
As in the case of the longest common subsequence problem, the bin packing
problem has a probabilistic counterpart. Consider independent random variables
Xn ∼ Unif([0, 1]), n ∈ N defined on a probability
space (Ω, S, P). We will describe
the behavior of bm := E Bn (X1 , . . . , Xn ) as n → ∞.
Note that
X1 + · · · + Xn ≤ Bn (X1 , . . . , Xn ) ≤ n.
bn+m ≤ bn + bm , ∀n, m ∈ N.
bn
Setting rn := n , we deduce from Fekete’s Lemma 1.151 that
lim rn = r := inf rn .
n→∞ n
1
The inequalities (3.1.17) show that r ∈ 2 , 1 .
We set Rn := Bnn . We deduce from (3.1.18) and Fekete’s Lemma that
Rn → R := inf Rn a.s. and r = E R .
n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 284
We want to show that Rn is highly concentrated around its mean. We use the same
approach as in Example 3.34.
We set
Fj = σ(X1 , . . . , Xj ), F0 = { ∅, Ω }.
Uj = Un,j := E Bn k Fj
so the collection (Uj )0≤j≤n is a martingale adapted to the filtration (Fj )0≤j≤n .
There exist Borel measurable maps Fj : [0, 1]j → N such that
Uj = Fj (X1 , . . . , Xj ). More precisely,
Z
Fj (x1 , . . . , xj ) = Bn x1 , . . . , xj , xj+1 , . . . xn dxj+1 · · · dxn .
[0,1]n−j
This shows that Rn is highly concentrated around its mean and that Rn → r a.s.
In this case it is known that r = 12 . More precisely, there is an algorithm called
MATCH which takes as input the sizes x1 , . . . , xn of the n items and packs them
into Mn = Mn (x1 , . . . , xn ) boxes where
n n √
≤ E B n ≤ E Mn ≤ + O n .
2 2
This is the best one can hope for since it is also known
√ √
r
n n
E Bn ≥ + 3−1 +o n .
2 24π
For details we refer to [34, Sec. 5.1]. t
u
The tricks used in the above examples are generalized and refined in McDiarmid’s
inequality.
Martingales 285
Let us observe that the above condition is satisfied if and only if f is Lipschitz
with respect to the Hamming distance on S
Xn
dH : S n × S n → [0, ∞), dH s, t :=
I R\{0} (sk − tk ). (3.1.20)
k=1
where hk−1 (x1 , . . . , xk−1 ) := Ek eλ(gk −Ek [gk ]) . Fix x1 , . . . , xk−1 and set
Hence
λL2
k λ(L2 2
1 +···+Ln )
E eλDk k Fk−1 ≤ e 8 a.s., i.e., E eλ(Zn −Z0 ) ≤ e
8 .
t
u
We have seen in the previous section how the Optional Stopping Theorem combined
with quite a bit of ingenuity can produce miraculous results. This section is devoted
to another miraculous property of martingales, namely, their rather nice asymptotic
behavior. The foundational results in this section are all due to J. L. Doob. To
convince the reader of the amazing versatility of martingales we have included a
large eclectic collection of concrete applications.
S1 T1 T2 n
S2
The terms Sk are called downcrossing times while the terms Tk are called the
upcrossing times of the sequence (αn )n≥0 . We define the upcrossing numbers
Nn [a, b], α := # k ∈ N; Tk (α) ≤ n , n ∈ N,
(3.2.1)
N∞ [a, b], α := lim Nn ( [a, b], α ).
n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 287
Martingales 287
Lemma 3.38. Suppose that α = (αn )n≥0 is a sequence of real numbers. Then the
following statements are equivalent.
Suppose now that (Xn )n∈N0 is a process adapted to the filtration F• . Then, for
any k ∈ N, the down/up-crossing times Sk (X) and Tk (X) are stopping times.
we see that it suffices to prove the result in the special case X ≥ 0 and a = 0 < b.
In other words, it suffices to prove that if X ≥ 0, then
bE Nn [0, b], X ≤ E (Xn − X0 ) . (3.2.3)
The key fact underlying this inequality is the
existence of a submartingale Y that
lies above the random process Nn ([0, b], X and, in the mean, below the process X.
Consider the predictable process
∞
X
H= I ]]Sk (X),Tk (X)]] ,
k=1
i.e.,
∞
X
Hn = I {Sk (X)<n≤Tk (X)} .
k=1
changes to Xn at the end of the n-th trading day. Then Yn is the profit following
this strategy at the end of n days. Clearly the profit will be at least as big as b×
the number of upcrossings of the interval (0, b). This is the content of the following
fundamental inequality.
Yn ≥ bNn [0, b], X . (3.2.4)
Here is a formal proof of this inequality. Let M := Nn [0, b], X . Then
n M Tk (X) n
X X X X
Yn = Hj · Xj − Xj−1 = (Xj − Xj−1 ) + Xj − Xj−1
j=1 k=1 j=Sk (X)+1 j=SM +1 +1
M
X
= XTk − XSk + I {SM +1 <n} Xn − XSM +1
k=1
Hence
bE Nn ([0, b], X) ≤ E Yn , ∀n ∈ N.
Note that the inequality (3.2.4) does not rely on the fact that X is a submartingale.
The process (Hn ) is predictable and thus
E Yk − Yk−1 kFk−1 = E Hk (Xk − Xk−1 )kFk−1
= Hk E (Xk − Xk−1 )kFk−1 .
Since X is a submartingale we deduce
E (Xk − Xk−1 )kFk−1 ≥ 0.
On the other hand Hk ≤ 1 so that
Hk E (Xk − Xk−1 )kFk−1 ≤ E (Xk − Xk−1 )kFk−1 .
Hence
E Yk − Yk−1 ≤ E Xk − Xk−1 . (3.2.5)
We deduce
n
X
bE Nn ( [0, b], X ) ≤ E Yn = E Yk − Yk−1 ≤ E Xn − X0 .
k=1
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 289
Martingales 289
so that
sup E Xn− < ∞
n∈N0
Proof. Set
M := sup E |Xn | .
n∈N0
Now let a, b ∈ Q, a < b. Doob’s upcrossing inequality shows that, for all n ≥ 1, we
have
(b − a)E Nn (a, b, X• ) ≤ E (Xn − a)+ ≤ |a| + E |Xn | ≤ |a| + M.
Letting n → ∞ we deduce E N∞ [a, b], X• < ∞, and thus N∞ ([a, b], X• ) < ∞
a.s. By removing a countable family of negligible sets (one for each pair of rational
numbers a, b, a < b) we deduce that there exists a negligible set N ⊂ Ω such that
∀ω ∈ Ω \ N we have
N∞ [a, b], X• (ω) < ∞, ∀a, b ∈ Q, a < b.
Lemma 3.38 implies that the sequence X• converges a.s. to a random variable X∞ .
The integrability of X∞ follows from Fatou’s lemma
E |X∞ | ≤ lim inf E |Xn | < ∞.
n→∞
t
u
Proof. Observe that Yn = −Xn is a submartingale and Yn+ = 0. The result now
follows from the Submartingale Convergence Theorem. t
u
Thus, the expected population decays exponentially to zero. Something more dra-
matic holds.
Since Zn ≥ 1 if Zn > 0 we deduce
P Zn > 0 = P Zn ≥ 1 ≤ E Zn = `mn .
Hence
X
P Zn > 0 < ∞.
n≥0
The Borell-Cantelli Lemma implies that P Zn > 0 i.o. = 0.
2 To the post pandemic reader. I wrote most of this book during the great covid pandemic. I
even taught this example to a group of masked students that were numbed by the news about the
R-factor. The mean m is a close relative of this R-factor. This example explains the desirability
of R < 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 291
Martingales 291
Thus, a population of bacteria that have on average less that one successor will
die out, i.e., with probability 1 there exists n ∈ N such that Zn = 0. If we set
En := Zk = 0, ∀k ≥ n = Zn = 0 ,
then the event
[
E= En
n≥0
Remark 3.46. It is known that a random series with independent terms converges
a.s. if and only if it converges in probability; see Exercise 2.2. However, there exist
martingales that converge in probability, but not a.s.
Here is one such example, [53, Example 4.2.14]. Consider the following random
walk (Xn )n≥0 on Z where you should think of Xn as the location at time n. We
set X0 = 0. If Xn−1 is known, then
1 1
P Xn = ±1 k Xn−1 = 0 = , P Xn = 0 k Xn−1 = 0 = 1 − ,
2n n
1 1
P Xn = 0 k Xn−1 = x 6= 0 = 1 − , P Xn = nx k Xn−1 = x 6= 0 = .
n n
The existence of such a process is guaranteed by Kolmogorov’s theorem.
Denote by Fn the sigma-algebra generated by the random variables
X0 , X1 , . . . , Xn . From the construction we deduce that E Xn k Xn−1 = Xn−1
so (Xn ) is a martingale with respect to the filtration Fn . Let pn := P Xn 6= 0 .
Note that
pn = P Xn 6= 0 Xn−1 = 0 P Xn−1 = 0 + P Xn 6= 0 Xn−1 6= 0 P Xn−1 6= 0
1 1 1
= (1 − pn−1 ) + pn−1 = .
n n n
Hence
lim P Xn 6= 0 = 0,
n→∞
Hence
X X1
E I Fn k Fn−1 = = ∞.
n
n≥1 n≥1
then Xn converges in probability to 0 but not a.s. For details we refer to [128]. t
u
S0 = 1, Sn = 1 + X1 + · · · + Xn , n ∈ N
is a martingale describing the evolution of the walk. Denote by N the first moment
the walk reaches the origin, i.e.,
N := inf n ∈ N; Sn = 0 .
Observe that N < ∞ a.s.; see Exercise 3.13. Consider the random walk stopped
at N
Yn := SnN = Sn∧N .
From the Optional Stopping Theorem 3.24 we deduce that Yn is a martingale which,
by construction, is also nonnegative. Clearly Yn → 0 a.s. since N < ∞ a.s. This
convergence is not L1 since
E Yn = E Y0 = 1, ∀n ∈ N. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 293
Martingales 293
Proof. The sequence Zn := |X|I {|X|>n} converges a.s. to 0 and |Zn | ≤ |X|, ∀n.
The desired conclusion now follows from the Dominated Convergence theorem. t u
i.e., χ(0) < ∞. Indeed, ∀X ∈ X and r is sufficiently large so that χ(r) < 1, we
have
E |X| = E |X|I {|X| <r} + E |X|I {|X| ≥r} ≤ r + χ(r) < ∞.
t
u
Proof. (i) ⇒ (ii) Fix ε > 0. There exists rε > 0 such that χ(rε ) < ε/2. Now fix
δ > 0 such that δrε < 2ε . Then, for any X ∈ X and any S ∈ S such that P[S] < δ,
we have
E |X|I S = E |X|I S∩{|X|<rε } + E |X|I S∩{|X|≥rε }
≤ rε P[S] + E |X|I {|X|≥rε } ≤ δrε + χ(rε ) < ε.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 294
t
u
(i) X is UI.
(ii)
Z ∞
lim sup P |X| > x dx = 0.
r→∞ X∈X r
(iii) There exists a convex increasing function f : [0, ∞) → [0, ∞) which is also
superlinear,
f (r)
lim =∞
r→∞ r
and satisfies
sup E f (|X|) < ∞. (3.2.8)
X∈X
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 295
Martingales 295
Note that h(0) ≤ r + h(r) < ∞. Since h(r) = o(1) as r → ∞ we can find
0 = r0 < r1 < r2 < · · ·
such that
h(0)
h(rn ) ≤ , ∀n ∈ N.
2n
Now define
X Z x
g(r) := I [rn ,∞) (r), f (x) = g(r)dr.
n≥0 0
Note that g(r) is nondecreasing and limr→∞ g(r) = ∞. This shows that f is
increasing convex and superlinear. Using the Fubini-Tonelli theorem as in the proof
of Proposition 1.126 we deduce
"Z #
|X| XZ ∞
E f (|X|) = E g(r)dr = E I |X|>rn (x)dx
0 n≥0 rn
X X 1
≤ h(rn ) ≤ h(0) .
2n
n≥0 n≥0
Then X is UI. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 296
Proof. Lemma 3.48 shows that the family {X} consisting of the integrable random
variable X is uniformly integrable. Theorem 3.54 implies that there exists a super-
linear, convex, increasing function f : [0, ∞) → [0, ∞) such that E f (X) < ∞.
From the conditional Jensen inequality in Theorem 1.166(ix) we deduce that
|Xi | = E X k Fi ≤ E |X| k Fi .
The next result clarifies the importance of the uniform integrability condition.
Martingales 297
From Corollary 1.145 we deduce that ΦM (ε) (Xn ) converges to ΦM (ε) (X) in proba-
bility. Moreover,
ΦM (ε) (Xn ) < M (ε), ∀n ∈ N.
The Bounded Convergence Theorem 1.153 implies that there exists n = n(ε) > 0
such that for any n ≥ n(ε) we have
ε
E |ΦM (ε) (Xn ) − ΦM (ε) (X)| < .
2
From (3.2.9) we deduce that E |Xn − X| < ε for n > n(ε).
Clearly (ii) ⇒ (iii) since Xn → X in L1 implies kXn kL1 → kXkL1 .
(iii) ⇒ (i) For any M > 0 consider the continuous function
x,
x ∈ [0, M − 1],
ΨM : [0, ∞) → R, ΨM (x) = 0, x ≥ M,
linear, x ∈ (M − 1, M ).
(3.2.10)
ε
< E |Xn | − E |X| + E ΨM (ε) (|X|) − E ΨM (ε) (|Xn |) + .
2
We can choose n = n(ε, M (ε)) so that for n > n(ε) we have
ε
E |Xn | − E |X| + E ΨM (ε) (|X|) − E ΨM (ε) (|Xn |) < .
2
Hence for any M ≥ M (ε)
sup E |Xn |I {|Xn |>M } ≤ sup E |Xn |I {|Xn |>M (ε)} < ε.
n>n(ε) n>n(ε)
Remark 3.58. (a) The implication (iii) ⇒ (ii) is sometimes referred to as Scheffé’s
Lemma.
(b) We used the Bounded Convergence Theorem to prove the implication (i) ⇒
(ii). Obviously the Bounded Convergence Theorem is a special case of this impli-
cation. One can prove the equivalence (i) ⇐⇒ (ii) without relying on the Bounded
Convergence Theorem; see [50, Thm. 10.3.6].
(c) The sequence in Example 2.27 converges in law, it is uniformly integrable yet it
does not converge in probability. This shows that in the above theorem we cannot
relax the convergence-in-probability condition to convergence in law. t
u
Theorem 3.59. If (Xn )n∈N0 is a martingale adapted to the filtration (Fn )n≥0 of
(Ω, F, P). Set
_
F∞ := Fn = σ Fn , n ≥ 0 .
n≥0
If the above conditions are satisfied, then the limiting random variable X∞
in (ii) and (iii) is related to the random variable X in (iv) via the equality
X∞ = E X k F∞ , i.e.,
lim E X k Fn = E X k F∞
n→∞
1
a.s. and L .
Proof. Note that if a martingale (Xn ) is UI, then it is bounded in L1 and, according
to Theorem 3.41, converges a.s. to an integrable random variable X∞ . In view of
the previous discussion the statements (i)–(iii) are equivalent. The implication (iv)
⇒ (i) follows from Corollary 3.56. The only thing left to prove is (iii) ⇒ (iv).
More precisely, we will show that if Xn → X∞ in L1 , then
Xn = E X∞ kFn , a.s., ∀n ∈ N0 .
In other words, we have to show that, for all m ∈ N0 , and all A ∈ Fm we have
E Xm I A = E X∞ I A .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 299
Martingales 299
Now let n → ∞.
Suppose now
that for some integrable random variable X we have
Xn = E X k Fn . We want to show that
lim Xn = X∞ := E X k F∞
n
In particular, we deduce
Corollary 3.60 (Lévy’s 0-1 law). For any set A ∈ F∞ , the random variables
E I A k Fn , n ∈ N,
E I H k Fn → I H a.s.
If p0 = µ 0 > 0, then, with probability 1, the population either becomes extinct
or explodes, i.e.,
E = U c , P E ∪ U = 1.
(3.2.13)
In particular n o
E= lim Zn = 0 . (3.2.14)
n
Note that
∀ν ∈ N0 , ∃δ(ν) ∈ (0, 1) : ∀n ∈ N,
(3.2.15)
P E k Z1 , . . . , Zn ≥ δ(ν) on {Zn ≤ ν}.
Indeed, if the population of the n-th generation has at most ν individuals then the
probability that there will be no (n + 1)-th generation is at least pν0 . More formally,
P E k Z1 , . . . , Zn ≥ P En+1 k Z1 , . . . , Zn = P En+1 kZn .
We have
X ν
P En+1 k Zn I {Zn ≤ν} = P En+1 k Zn = k I {Zn =k}
k=0
ν
X
= pk0 I {Zn =k} ≥ pν0 I {Zn ≤ν} .
k=0
This proves (3.2.15) with δ(ν) = pν0 .
Since Zn are integer valued we deduce that
Lemma 3.63. Suppose that (Zn )n≥1 is a sequence of nonnegative random variables.
Set
E = Zn = 0 for some n , B := sup Zn < ∞ .
n
If (Zn ) satisfies (3.2.15), then
E ⊃ B. (3.2.16)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 301
Martingales 301
In our special case, B = U c . Note also that if the population dies at a time at a
time n0 , then Zn = 0, ∀n ≥ n0 . Hence E ⊂ B or, in view of (3.2.16), E = B = U c .
This proves the claimed dichotomy (3.2.13).
When m = 1, then Wn = Zn converges almost surely to an integrable random
variable and we see that
n o
lim Zn < ∞ ⊂ sup Zn < ∞ ⊂ E
n
and we deduce that
h i
1 ≥ P E ≥ P lim Zn < ∞ = 1.
n
Thus,
when m = 1 and the probability having no successors is positive, i.e.,
µ 0 > 0, then the extinction probability is also 1. One can show (see [6, Sec. I.9]
or [89]) that if m = 1 and
X
σ 2 := Var Xn,j =
k(k − 1)µ k < ∞,
k
then
2
lim nP Zn > 0 = 2 .
n→∞ σ
Thus, the probability of the population surviving more than n generations given
that individuals have on average 1 successor is O(1/n).
When m > 1 the extinction probability is still positive but < 1. Exercise 3.26
describes this probability and gives additional information about the distribution
of W . For more details about branching processes we refer to [6; 80]. t
u
T
X̂T = X∞ := lim XT ∧n = lim XnT . (3.2.18)
n→∞ n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 302
( E X k F ≤ E |X| k F
)
X h i
E I {T =n} E |X∞ | k Fn
≤ + E I {T =∞} X∞
n≥0
Proposition 3.66. Suppose that (Xn )n≥0 is a martingale adapted to the filtration
(Fn )n≥0 and T is an a.s. finite stopping time adapted to the same filtration. Then
the following statements are equivalent.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 303
Martingales 303
(i) The stopping time satisfies Doob’s conditions (3.1.7b) and (3.1.7c).
(ii) The stopped martingale XnT = XT ∧n is UI.
Proof. (i) ⇒ (ii) Consider the submartingale |Xn |. Since T satisfies Doob’s condi-
tions we deduce from Theorem 3.28 that
E |XT | ≥ E X0 ≥ E |XT ∧n | ∀n ≥ 0.
Thus
lim sup E |XT ∧n | ] ≤ E |XT | .
n→∞
so that
lim sup E |XT ∧n | ] = E |XT | .
n→∞
t
u
E XT k FS = XS .
E XT k FS = E X∞ k FS = XST = XS .
T
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 304
Proposition 3.69. Suppose that (Mn )n≥0 is a random process adapted to the fil-
tration (Fn )n≥0 such that
E Mn < ∞, ∀n,
and T is a stopping time adapted to the same filtration. Suppose that
E T < ∞, (3.2.21a)
n−1
X n
X
= Mk I {T ≥k} − I {T ≥k+1} + Mn I {T ≥n} = M0 + Mk − Mk−1 I {T ≥k}
k=0 k=1
so
n
X
MT ∧n ≤ M0 + Mk − Mk−1 I {T ≥k} .
k=1
Set
∞
X
Y := M0 + Mk − Mk−1 I {T ≥k} .
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 305
Martingales 305
Clearly MT ∧n ≤ Y , ∀n ≥ 0. We will show that E Y < ∞. We have
h i
E Mk − Mk−1 I {T ≥k} = E E Mk − Mk−1 I {T ≥k} k Fk−1 ,
({T ≥ k} ∈ Fk−1 )
h i (3.2.21b)
E I {T ≥k} E Mk − Mk−1 k Fk−1
≤ CE I {T ≥k} = CP T ≥ k .
Thus
∞
X (3.2.21a)
E Y ≤ E M0 + C P T ≥ k = E M0 + CE T < ∞.
k=1
t
u
Theorem 3.70 (Wald’s formula). Suppose that (Yn )n≥0 is a sequence of i.i.d.
integrable random variables with finite mean µ. Set
n
X
Sn := Yk .
k=0
Then
E Mn k Fn−1 = E Y n + Mn−1 k Fn−1
Observe that
E Mn − Mn−1 k Fn−1 = E Y n k Fn−1 = E Y n = E Y 0
so that (3.2.21b) is satisfied. We deduce from Proposition 3.69 that the stopped
martingale MnT is UI and the Optional Sampling Theorem implies
0 = E M0 = E MT = E ST − µE T .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 306
(ii) From (i) we deduce E ST = 0 so Var ST = E ST2 . Set
n
X
Qn := Yk2 .
k=1
We have
n
X X
E Sn2 = E Yk2 + 2
E Yi Yj = E Qn .
k=1 1≤i<j≤n
We deduce from Proposition 3.69 that the stopped martingale ZnT is UI and the
Optional Sampling Theorem implies
0 = E Z0 = E ZT = E QT − σ 2 E T
= E ST2 − σ 2 E T = Var ST − σ 2 E T .
t
u
Martingales 307
Sn+1 > t, i.e., that n is the largest index k such that Sk ≤ t. If Wald’s for-
mula were true in this case it would predict E SN (t) = t. However, we know from
(1.3.50) that
1 e−λt
E SN (t) = t − + .
λ λ
Let us observe that T = N (t) + 1 is adapted to the filtration Fn . Indeed
T = n ⇐⇒ N (t) = n − 1
N −k k
Thus, the ruin probability is 1 − pk (N ) = N = 1− N. In Example A.19 we
describe R codes simulating this situation.
B. p 6= 1/2 so the game is biased. Consider the De Moivre’s martingale Mn defined
Example 3.7, i.e.,
Sn
q
Mn = .
p
The stopped martingale M T is UI since it is bounded. Hence
k
q
E MT = E M 0 = .
p
If we set pk (N ) := P ST = N , then we deduce
k
q 0
N N
q q q
= P ST = 0 + P[ST = N ] = 1 − pk (N ) + pk (N ) .
p p p p
Hence
( pq )k − 1
pk (N ) = q N . t
u
(p) − 1
Example 3.73 (The coupon collector problem revisited). Let us recall the
coupon collector problem we discussed in Example 1.112.
Suppose that each box of cereal contains one of m different coupons. Once you
obtain one of every type of coupons, you can send in for a prize. Ann wants that
prize and, for that reason, she buys one box of cereals everyday. Assuming that the
coupon in each box is chosen independently and uniformly at random from the m
possibilities and that Ann does not collaborate with others to collect coupons, how
many boxes of cereal is she expected to buy before she obtain at least one of every
type of coupon?
Let N denote the number of boxes bought until Ann has at least one of every
coupon. We have shown in Example 1.112 that
1 1 1
E N = mHm , Hm := 1 + + · · · + + .
2 m−1 m
Suppose now that Ann has a little brother, Bob, and, every time she collects a
coupon she already has, she gives it to Bob. At the moment when she completed
her collection, Bob is missing B coupons. What is the expectation of B?
To answer this question we follow the approach in [61, Sec. 12.5]. Assume that
the coupons are labelled 1, . . . , m. We denote by Ck the label of the coupon Ann
found in the k-th box she bought. Thus (Ck )k≥1 are i.i.d., uniformly distributed in
{1, . . . , m}. We set
Fn := σ C1 , . . . , Cn , n ∈ N.
Denote by Xn the number of coupons Ann is missing after she bought n cereal
boxes and by Yn the number of coupons that have appeared exactly one time in the
first n boxes Ann bought.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 309
Martingales 309
E ∆Zn k Fn = 0.
(3.2.24)
Let us observe that, when Ann buys a new cereal box there are only three, mutually
exclusive possibilities,
The first possibility corresponds to Ann obtaining a new coupon, the second possi-
bility corresponds to Bob obtaining a new coupon, and the third possibility occurs
when the (n + 1)-th coupon is owned by both Ann and Bob. Hence
∆Zn = I {∆Xn =−1} f (Xn − 1, Yn + 1) − f (Xn , Yn )
+I {∆Yn =−1} f (Xn , Yn − 1) − (f (Xn , Yn ) .
Yn (3.2.23)
+ f (Xn , Yn − 1) − (f (Xn , Yn ) = 0.
m
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 310
Theorem 3.75. If (Xn )n∈N0 is a submartingale, then the following are equivalent.
Corollary 3.76. Suppose that X• = (Xn )n∈N0 is a submartingale with Doob de-
composition Xn = X0 + Mn + Cn , where (Mn )n∈N0 is the martingale component
and (Cn )n∈N0 is the predictable compensator. Then the following are equivalent.
Martingales 311
Before each game the player bets a sum s, called stake, that cannot be larger
than his fortune at that moment. If he wins, his fortune increases by the amount
that he bet. Otherwise he loses his stake.
The player starts with a sum of money x and decides that he will play until
the first moment his fortune goes above a set sum, the goal, say 1. His strategy is
based on a function σ(x). If his fortune after n games is Xn , then the amount he
wagers for the next game depends on his current fortune Xn and is σ(Xn ). The
player stops playing when, either he is broke, or he has reached (or surpassed) his
goal. The function σ is known as the strategy of the gambler.
We denote by π(x, σ) the probability that the gambler will reach his goal using
the strategy σ, given that his initial fortune is x.
We want to show that the strategy that maximizes the winning probability
π(x, σ) is the “go-bold ” strategy: if your fortune is less than half the goal, bet it
all, and if your fortune is more than half the goal, bet as much as you need to
reach your goal. Our presentation follows [64, §24.8]. To find out about gambling
strategies for more complex games we refer to [49].
First let us introduce the appropriate formalism. The strategies will be chosen
from a space S, the collection of measurable functions σ : [0, ∞) → [0, ∞) such that
σ(x) ≤ x, ∀x ∈ [0, 1] and σ(x) = 0, ∀x > 1.
Note that the stopping rule is built in the definition of S.
The sequence of games encoded by the sequence of i.i.d. random variables
(Yn )n∈N such that
1
P Yn = 1 = p, P Yn = −1 = 1 − p, 0 < p < .
2
For each x ≥ 0 and each σ ∈ S define inductively a sequence of random variables
Xn = Xnx,σ ,
X0x = x, Xn+1 = Xn + σ(Xn )Yn+1 , n ≥ 0. (3.2.27)
We denote by (Ω, S, P) the probability space where the random variables Xn and
Yn are defined. Thus Xnx,σ is the fortune of the player after n games starting with
initial fortune x and using the strategy σ. Note that σ(Xn ) is the amount of money
the player bets before the (n + 1)-th game. It depends only on its fortune Xn at
that time. If Yn = 1 the player gains σ(Xn ) and if Yn = −1, the player loses this
amount. His strategy σ stays the same for the duration of the game.
Let us observe first that
x,σ
X∞ := lim Xnx,σ
n→∞
exists a.s. and L . We will prove this by showing that Xnx,σ is a bounded super-
1
martingale.
Since σ(x) ≤ x we deduce x − σ(x) ≥ 0 and we deduce inductively that Xn ≥ 0,
a.s., ∀n. Next, we observe that if x ≤ 1 + x then x + σ(x) ≤ x + 1. We deduce
inductively that Xn ≤ x + 1, a.s., ∀n.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 313
Martingales 313
We have E Yn = 2p − 1 < 0 and thus
E Xn+1 k Fn = Xn + σ(Xn )E Yn+1 ≤ Xn .
Thus
lim σ(Xn (ω)) = σ(X∞ (ω)) > 0.
n→∞
Lemma 3.81. Let σ0 ∈ S and set h0 (x) := h(x, σ0 ), π0 (x) = π(x, σ0 ). If h0 (x) is
continuous and satisfies,
h0 (x) ≥ ph0 (x + s) + (1 − p)h0 (x − s), (3.2.29)
then, for any σ ∈ S, and any x ∈ [0, 1] we have π(x, σ0 ) = h0 (x) ≥ π(x, σ).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 314
Proof. Fix σ ∈ S and x ∈ [0, 1] and set Xn = Xnx,σ . We set h0 (x) = 1, for x ≥ 1.
This is a natural condition: if the initial fortune is greater than the goal then the
probability of achieving the goal is 1.
Observe that the random process Yn = h0 (X) is a supermartingale. Indeed,
E h0 Xn+1 k Fn = E h0 Xn + σ(Xn )Yn+1 k Fn
(3.2.29)
= ph0 Xn + σ(Xn ) + (1 − p)h0 Xn − σ(Xn ) ≤ h0 (Xn ).
Thus Yn is a bounded supermartingale and thus
h0 (x) = E h0 (X0x,σ ) E Y0 ≥ E Yn .
Now observe that E Y0 = h0 (x).
On the other hand, since h0 (x) is continuous and bounded we deduce that
h0 (Xn ) converges a.s. and L1 to h0 (X∞ ). Thus
x,σ
x,σ
E Y∞ = E h0 (X∞ ) ≥ P X∞ ≥ 1 ≥ π(x, σ).
t
u
Define σ0 ∈ S
(
min(x, 1 − x), x ∈ [0, 1],
σ0 (x) :=
0, x ≥ 1,
and set h0 (x) := h(x, σ0 ), π0 (x) = π(x, σ0 ). We want to show that σ0 satisfies all
the conditions of Lemma 3.81.
Clearly σ0 is a continuous strategy. By construction, for any x ∈ [0, 1] we have
0 ≤ Xnx,σ0 ≤ 1 a.s. so
x,σ0
π0 (x) = h0 (x) = E X∞ , ∀x ∈ [0, 1].
The functions
[0, 1] 3 x 7→ x + σ0 (x) ∈ [0, 1], [0, 1] 3 x 7→ x − σ0 (x) ∈ [0, 1]
are non-decreasing. We deduce inductively that if x ≤ y then
E Xnx,σ0 ≤ E Xny,σ0
Martingales 315
Set
k
D := ; n ∈ N0 , 0 ≤ k ≤ 2n .
2n
k
We will prove by induction on n that (3.2.29) holds for x of the form x = 2n . Start
with n = 1 so x = 21 . We have
h(1/2) − ph(1/2 + s) − (1 − p)h(1/2 − s)
(3.2.30)
= p − p p + (1 − p)h(2s) − (1 − p)ph(1 − 2s)
= p(1 − p) 1 − h(2s) + h(1 − 2s) ≥ 0,
where at the last step we used the fact that h(x) ≤ x, ∀x ∈ [0, 1].
For the inductive step, assume that n > 1 and x = 2kn , k < 2n . Choose s ∈ [0, x].
We consider several cases.
Case 1. x + s ≤ 21 . Using (3.2.30) and the induction hypothesis we deduce
ph(x + s) + (1 − p)h(x − s) = p ph(2x + 2s) + (1 − p)ph(2x − 2s)
≤ ph(2x) = h(x).
Case 2. x − s ≥ 21 . Similar to Case 1.
1
Case 3. x ≤ 2 and x + s ≥ 12 . Using (3.2.30) we have
A := h(x) − ph(x + s) − (1 − p)h(x − s)
= ph(x2x) − p p + (1 − p)h(2x + 2s − 1) − (1 − p)ph(2x − 2s)
= p h(2x) − p − (1 − p)h(2x + 2s − 1) − (1 − p)h(2x − 2s) .
1
Observe that since 2 ≤ x + s ≤ 2x. Using (3.2.30) we deduce
h(2x) = p + (1 − p)h(4x − 1)
so that
A = p p + (1 − p)h(4x − 1) − p − (1 − p)h(2x + 2s − 1) − (1 − p)h(2x − 2s)
= p(1 − p) h(4x − 1) − h(2x + 2s − 1) − h(x − 2s)
= (1 − p) h(2x − 1/2) − ph(x + 2s − 1) − p(x − 2s)
(p ≤ 1 − p)
≥ (1 − p) h(2x − 1/2) − ph(x + 2s − 1) − (1 − p)(x − 2s) .
The induction hypothesis implies h(2x − 1/2) − ph(x + 2s − 1) − (1 − p)(x − 2s) ≥ 0.
1
Case 4. x ≥ 2 and x − s ≤ 12 . This is similar to the previous case.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 316
Martingales 317
21
As an illustration let us compute h0 (21/32). Note that 32 has the binary
expansion
21
= 0.10101
32
so that
h(21/32) = f1 ◦ f0 ◦ f1 ◦ f0 (p) = f1 ◦ f0 ◦ f1 (p2 )
= f1 ◦ f0 p + p2 − p3 ) = f1 (p2 + p3 − p4 ) = p + (1 − p)(p2 + p3 − p4 )
= p + p2 + p3 − p4 − p3 − p4 + p5 = p + p2 − 2p4 + p5 .
For example if the winning probability is p = 0.4, then h0 (21/32) = 0.519 > 0.5.
Thus, although the winning probability p < 0.5, using this strategy with an initial
fortune 21/32, the odds of increasing the fortune to 1 are better than 50 : 50.
If the initial fortune is x = 41 , then using its binary expansion 14 = 0.01 we
deduce
p
h0 (1/4) = ph0 (1/2) = .
2
In this case, if p = 0.4, the probability of reaching his goal is 0.2, substantially
smaller. tu
Theorem 3.82 (Doob’s maximal inequality). Suppose that (Xn )n∈N0 is a sub-
martingale. Set
X̃n := sup Xk .
k≤n
Applying the Optional Sampling Theorem 3.28 to the bounded stopping times T ∧n
and n we deduce
E XT ∧n ≤ E Xn .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 318
so XT ∧n ≥ aI A + Xn I Ac . We deduce
aP A + E Xn I Ac ≤ E XT ∧n ≤ E Xn = E Xn I A + E Xn I Ac .
This implies the first inequality in (3.2.31). The second inequality is trivial. t
u
Corollary 3.83. Suppose that Yn is a martingale. We set
Yn∗ = max Yn .
k≤n
Theorem 3.84 (Doob’s Lp -inequality). Let p > 1 and suppose that (Xn )n∈N0
is a positive submartingale such that Xn ∈ Lp , ∀n ≥ 0. Set
X
en := sup Xk .
k≤n
where
1 1 p
+ = 1 or q = .
p q p−1
In particular, if (Yn )n∈N0 is a martingale and if
Martingales 319
Proof. Clearly (3.2.32) ⇒ (3.2.33). Note that (Xnp )n≥0 is also a submartingale and
X̃n ∈ Lp . From Doob’s maximal inequality we deduce
aP X̃n ≥ a ≤ E Xn I {X̃n ≥a}
so
Z ∞ (3.2.31)
Z ∞
1 p (1.3.43) p−1
ap−2 E Xn I {X̃n ≥a} da.
E X̃n = a P X̃n ≥ a da ≤
p 0 0
Definition 3.85. Let p ∈ [1, ∞). A martingale (Xn )n∈N0 is called an Lp -martingale
if
E |Xn |p < ∞, ∀n ∈ N0 .
X∞ ∈ Lp (Ω, F∞ , P).
Moreover
p
∗ p p
E |X∞ |p .
E (X∞ ) ≤
p−1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 320
Example 3.87 (Kolmogorov’s one series theorem). Suppose that (Xn )n≥0 is
a sequence of independent random variables such that
X
E[Xn ] = 0, ∀n ≥ 0, Var[Xn ] < ∞.
n≥0
Example 3.88 (Likelihood ratio). This example has origin in statistics. Sup-
pose that we have a random quantity and we have reasons to believe that its prob-
ability distribution is either of the form p(x)dx or q(x)dx where p, q : R → [0, ∞)
are mutually absolutely continuous probability densities on R
Z Z
p(x)dx = q(x)dx = 1.
R R
We want to describe a statistical test that helps deciding which is the real distribu-
tion. Our presentation follows [75, Sec. 12.8].
We take a large number of samples of the random quantity, or equivalently,
suppose that we are given a sequence of i.i.d. random variables (Xn )n≥1 with com-
mon probability density f , where f is one of the two densities p or q. Assume for
simplicity that p(x), q(x) > 0, for almost any x ∈ R.
The products
n
Y p(Xk )
Yn :=
q(Xk )
k=1
are called likelihood ratios. Note that if f = q, then E Yn = 1, ∀n.
To decide whether f = q or f = p we fix a (large) positive number a and a large
n ∈ N and adopt the prediction strategy
(
p, Yn ≥ a,
fn :=
q, Yn < a.
We want to show
that this strategy picks the correct density with high confidence,
i.e., P f = fn is very close to 1 for large n and a.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 321
Martingales 321
Example 3.89. Consider again the branching process in Example 3.8. Suppose
that the reproduction law µ satisfies
X∞ ∞
X
m= kµ(k) < ∞, k 2 µ(k) < ∞.
k=0 k=0
We set
∞
X
σ 2 := Var[µ] = k 2 µ(k) − m2 .
k=0
Note that
∞
X k
X ∞
X X ∞
X
Zn+1 = Xn,j I {Zn =k} = Xn,j I {Zn =k} = Xn,j I {Zn ≥j} ,
k=1 j=1 j=0 k≥j j=0
∞
X
2
E Zn+1 kFn = E I {Zn ≥j,Zn ≥k} Xn,j Xn,k kFn
k,j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 322
⊥ Fn )
(Xn,j , Xn,k ⊥
X∞ ∞
X
I {Zn ≥j,Zn ≥k} m2 + δjk σ 2
= I {Zn ≥j,Zn ≥k} E Xn,j Xnk =
k,j=1 k,j=1
∞
X ∞
X
= m2 I {Zn ≥j} I {Zn ≥k} + σ 2 I {Zn ≥k}
j,k=j=1 k=1
P
(E[Zn ] = k≥1 P(Zn ≥ 1))
∞
!2 ∞
X X
2
=m I {Zn ≥k} + σ2 I {Zn ≥k} = m2 Zn2 + σ 2 Zn .
k=1 k=1
Hence
2
E[Zn+1 ] = m2 E[Zn2 ] + m2 E[Zn ] = σ 2 E[Zn2 ] + σ 2 mn E[Z0 ] = m2 E[Zn2 ] + σ 2 mn `.
We set
qn+1 := m−2n E Zn2
and
C := inf E Xn > −∞.
n≤0
Then the following hold.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 323
Martingales 323
Moreover
X−∞ ≤ E Xn F−∞ ,
(3.2.34)
with equality if (Xn )n∈−N0 is a martingale.
Hence
E Xn− ≤ C + E Xn+ ≤ C + E X0+ , ∀n ≤ 0,
and consequently,
Z := sup E |Xn | < ∞. (3.2.35)
n≤0
and the GK K
n -submartingale Yn = X(−K+n)∧0 . Thus
This proves that, for any rational numbers a < b, the nondecreasing sequence
K 7→ NK [a, b], Y K
is also bounded, and thus it has a finite limit N∞ ([a, b], X) as K → ∞. An obvious
version of Lemma 3.38 shows that Xn has an a.s. limit a.s. n → −∞. The limit is
a F−∞ -measurable random variable X−∞ . Fatou’s Lemma shows that
E |X−∞ | < ∞.
Step 3. Uniform integrability. This is obvious if (Xn )n≤ is a martingale since
Xn = E X0 k Fn
= −E X−n + E X−n I {X−n ≥−a} + E X−n I {X−n >a}
ε
≤ −E X−K + + E X−n I {X−n ≥−a} + E X−n I {X−n >a} .
2
Now observe that, for any H ∈ Fn , we have
so
E X−n I H ≤ E X−K I H .
Hence, if H = {X−n ≥ −a}, or H = {X−n > a}, then
E X−n I {X−n ≥−a} + E X−n I {X−n >a} ≤ E X−K I {X−n ≥−a} + E X−K I {X−n >a} ,
and
ε
E |X−n |I {|X−n |>a} ≤ −E X−K + E X−K I {X−n ≥−a} + E X−K I {X−n >a} +
2
ε
= E |X−K |I {|X−n |≥a} + .
2
From Markov’s inequality and (3.2.35) we deduce
Z
P |X−m | > a ≤ , ∀m ∈ N0 .
a
Since the family consisting of the single random variable X−K is uniformly
integrable, we deduce
that there exists δ = δ(ε) > 0 such that, for any A ∈ FK satisfying P A < δ we have
ε
E |X−K |I A < .
2
Z
We deduce that for any a > 0 such that a
< δ(ε) we have
ε
E |X−n |I {|X−n |>a} ≤ E |X−K |I {|X−n |≥a} < .
2
This proves that the family (X−n )n∈N0 is UI.
Step 4. Conclusion.
Finally,
observe
that for any A ∈ F−∞ and any n ≤ m ≤ 0
we have E Xn I A ≤ E Xm I A . If we let n → −∞ we deduce
E X−∞ I A ≤ E Xm I A , ∀m ≤ 0, A ∈ F−∞ .
This is precisely the inequality (3.2.34). When (Xn ) is a martingale all the above
inequalities are equalities. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 325
Martingales 325
Example 3.92. (a) A sequence of i.i.d. random variables (Xn )n≥1 is exchangeable.
(b) Suppose that (µλ )λ∈Λ is a family of Borel probability measures on R
parametrized by a probability space (Λ, S, PΛ ) such that, for any Borel subset
B ⊂ R, the function
Λ 3 λ 7→ µλ B
is measurable. In other words, µ• is a random probability measure. In the language
of kernels, the function µ• : Λ → BR is a Markov kernel (Λ, S) → (R, BR ).
For each λ ∈ Λ we have a product measure µ⊗n n
λ on R equipped with its natural
⊗n
σ-algebra, Bn = BR . The mixture of the family (µλ ) directed by PΛ is the measure
n
For example, suppose that ν is a Borel probability measure on Λ = [0, 1]. For
any p > 0 define
µp = Bin(p) = (1 − p)δ0 + pδ1 ∈ Prob {0, 1} .
Then we obtain the mixtures µnν ∈ Prob {0, 1}n defined by,
Z
n
µnν {1 , . . . , n } = (1 − p)n−k pk ν dp , k = 1 + · · · + n .
k [0,1]
The collection µnν ∈ Prob {0, 1}n , n ∈ N is a projective family and thus it defines
a measure µ∞ ν on {0, 1} .
N
Theorem 3.93 (de Finetti). Suppose that X := (Xn )n∈N is an exchangeable se-
quence of integrable random variables defined on the same probability space (Ω, F, P).
Set
Sn := X −1 Sn , ∀n ∈ N ∪ {∞}.
Then the following hold.
(i) The random variables (Xn )n≥1 are conditionally independent given S∞ .
(ii) The random variables (Xn )n≥1 are identically distributed given S∞ , i.e., there
exists a negligible subset N ∈ F such that, on Ω \ N
P Xi ≤ x k S∞ = P Xj ≤ x k S∞ , ∀i, j ∈ N, ∀x ∈ R.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 327
Martingales 327
Proof. We follow the presentation in [91]. Without any loss of generality we can
assume that (Ω, F) = (RN , BN ) and Xn (x1 , x2 , . . . ) = xn . In this case Sn = Sn .
Observe that the exchangeability condition implies that the random variables Xn
are identically distributed. Suppose that f : R → R is a measurable function such
that f (X1 ) ∈ L1 . We claim that
1
∀k ∈ N, f (X1 ) + · · · + f (Xn ) = E f (Xk )kSn . (3.2.36)
n
Note that Sn = X −1 (Bn ), where Bn is the σ-subalgebra of BN consisting of Sn -
invariant subsets. In particular, a function g : Ω → R is Sn -measurable iff there
exists an n-symmetric function Φ such that g = Φ(X).
Let A ∈ Sn and choose an n-symmetric function Φ such that I A = Φ(X). Then,
for 1 ≤ j ≤ n we have
E f (Xj )Φ(X) = E f (Xj )Φ(Xj , X2 , . . . , Xj−1 , X1 , Xj+1 , . . . )
= E f (X1 )Φ(X) ,
so that
f (X1 ) + · · · + f (Xn )
E f (X1 )I A = E I A = E f (Xj )Φ(X) .
n
The equality (3.2.36) follows by observing that f (X1 )+· · ·+f (Xn ) is Sn -measurable.
The convergence theorem for backwards martingales (Corollary 3.91) shows that
the empirical mean
f (X1 ) + · · · + f (Xn )
n
converges a.s. and L to E f (X1 ) k S∞ . By choosing f (x) = x we obtain the
1
Remark 3.94. Suppose that (Xn )n∈N is an exchangeable sequence of random vari-
ables defined on the probability space (Ω, F, P). Denote by S∞ the sigma-algebra
of exchangeable events. Suppose that
Q : Ω × BR → [0, 1], (ω, B) 7→ Qω B
∀B ∈ BR , P X1 ∈ B k S∞ = Q B , a.s.
h Z
i
Q⊗n
= E Q B1 · · · Q Bn = ω B1 × · · · Bn P dω .
Ω
Thus the distribution of the sequence (Xn ) is a mixture of i.i.d. driven by the
random distribution Q. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 329
Martingales 329
Proof. We follow the approach in [29, Sec. 7.3, Thm. 4]. For a different but related
proof we refer to [1, Cor. (3.10)].
Denote by S∗∞ and T∞ ∗
the completions of S∞ and respectively T∞ . We have
n
1X
1{Xk ≤x} = P X1 ≤ x k S∞ .
lim
n→∞ n
k=1
Clearly the limit in the left-hand side is T∞ -measurable since it is not affected by
changing finitely many of the random variables. Hence
P X1 ≤ x k S∞ = P X1 ≤ x k T∞ = P X1 ≤ x k T∞ ∗
. (3.2.39)
Similarly, for any x1 , . . . , xn ∈ R, the random variable
Yn
P Xk ≤ xk k S∞
k=1
is T∞ -measurable. Hence, for any S ∈ S∞ we have
" n
# " n
#
\ Y
{Xk ≤ xk } T∞ = E I S · P Xk ≤ xk k S∞ T∞
P S∩
k=1 k=1
" n
#
Y
P Xk ≤ xk k S∞ k T∞ P S k T∞
=E
k=1
n
(3.2.39) Y
P Xk ≤ xk k T∞ P S k T∞ .
=
k=1
Thus S∞ and X1 , . . . , Xn are conditionally independent given T∞ so S∞ and
(Xn )n∈N are conditionally independent given T∞ . Since S∞ ⊂ σ Xn , n ∈ N
we deduce that for any S ∈ S∞ is conditionally independent of itself given T∞ . We
have
2
P S k T∞ = P S k T∞
T = TS = ω : P S k T∞ (ω) = 1 .
Then T ∈ T∞ , T ⊂ S and
h i
P T ∩ S = E I T P S k T∞
=P T ,
h i
P S = E P S k T∞
= E IT = P T .
This concludes the proof. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 330
Observe that a sequence of i.i.d. random variables (Xn )n≥1 is exchangeable. The
Kolmogorov 0-1 law and the above proposition imply the following result.
Theorem 3.96 (Hewitt-Savage 0-1 Law). If (Xn )n≥1 is a sequence of iid ran-
dom variables and A ∈ S∞ , then P A ∈ {0, 1}. t
u
Proof. From
De Finetti’s Theorem 3.93
we deduce
that
X̄n converges a.s. and L1 to
E X1 k S∞ . Proposition 3.95 that E X1 k S∞ = E X1 k T∞ and Kolmogorov’s
Then
S = P X1 = 1 k S∞ ,
(3.2.40a)
P X1 = · · · = Xk = 1, Xk+1 = · · · = Xn = 0 k S = S k (1 − S)n−k ,
(3.2.40b)
P X1 = · · · = Xk = 1, Xk+1 = · · · = Xn = 0 = E S k (1 − S)n−k .
(3.2.40c)
In particular, the moment generating function of S is
X tn
E etS =
P X1 = · · · = Xn = 1 .
n!
n≥0
k n−k
= P X1 = 1 k S∞ P X1 = 0 k S∞ = S k (1 − S)n−k .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 331
Martingales 331
= E S k (1 − S k ) k S = S k (1 − S k ).
Clearly,
P X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0 = E S k (1 − S)n−k .
t
u
Let us observe that the sequence (Xn )n≥1 is exchangeable. We prove by induction
that (X1 , . . . , Xn ) is exchangeable. For n = 1 the result is trivial.
Let n > 1 and 1 , . . . , n ∈ {0, 1}. We denote by rk and gk the number of red
balls and respectively green balls after the k-th draw. We deduce
Qn
k=1 k rk−1 +(1−k )gk−1
Qn , c > 0,
k=1 (r+g+(k−1)c)
P X1 = 1 , . . . , Xn = n =
k
z0 (1 − z0 )n−k , c = 0,
r
where z0 = Z0 = r+g . When c > 0 the denominator above is independent of
{1 , . . . , n }. We set Sn := 1 + · · · + n and we rewrite the numerator in the form
Sn
n
Y Y n−S
Yn
k rk−1 + (1 − k )gk−1 = r + c(i − 1) g + c(j − 1) .
k=1 i=1 j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 332
is a [0, 1]-valued random variable and (3.2.40c) with k = n shows that, for any
n ≥ 0, we have
Z 1
z n PZ∞ dz = E Z∞
n
= P X1 = · · · = Xn = 1
0
R
B(ρ+n,γ) 1 n z ρ−1 (1−z)γ−1
B(ρ,γ) , c > 0, 0 z dz, c > 0,
B(ρ,γ)
(3.2.41)
= =
z n ,
c=0 R 1 sn δ dz
c = 0,
0 0 z0
The distribution in the case c > 0 is the Beta distribution with parameters ρ, γ
discussed in Example 1.122. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 333
Martingales 333
t
u
Definition 3.104. Fix a filtration F• = (Ft )t≥0 of the probability space (Ω, F, P).
t
u
Lemma 3.105. For any stopping time T adapted to the filtration F• the collection
FT is a σ-algebra. t
u
Martingales 335
Proof. Indeed,
[
T <t = T ≤ t − 1/n ,
n≥0
∈ Ft−1/n ⊂ Ft .
and T ≤ t − 1/n t
u
Definition 3.107. Fix a probability space (Ω, F, P) and a filtration F• = (Ft )t≥0
of F. We set
\
Ft+ := Fs , t ≥ 0.
s>t
t
u
Remark 3.108. If (Ft )t≥0 is a filtration of the complete probability space (Ω, F, P),
then the usual augmentation of (Ft ) is the minimal filtration (F̂t ) containing (Ft )
and satisfying the usual conditions. More precisely if N ⊂ F is the collection of
probability zero events, then
\
F̂t = σ(N, Fs ). t
u
s∈(t,∞)
Proposition 3.109. Consider a random variable T : Ω → [0, ∞]. Then the follow-
ing statements are equivalent.
Example 3.110. Suppose that (Xt )t≥0 is a process adapted to F• and Γ ⊂ R. The
(Γ-) début time of (Xt ) is the function
DΓ : Ω → [0, ∞], DΓ (ω) = inf t ≥ 0; Xt (ω) ∈ Ω ,
6 Recallthat this means that any set contained in a P-null subset is measurable.
7 Thissettles an inconsistency in the existence literature. Many authors refer to stopping times
as optional times, while our optional times are sometimes referred to as weakly optional times.
When the filtration is right continuous all these terms refer to the same concept, that of stopping
time.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 336
(i) If Γ is open, and the paths of Xt are right continuous, then the début time DΓ
is a stopping time of (Xt ), while the hitting time HΓ is an optional time.
(ii) If Γ is closed, and the paths of Xt are continuous, then the début time DΓ is a
stopping time of (Xt ), while the hitting time HΓ is an optional time.
We deduce from the above that if the filtration Ft is right-continuous and the
paths of (Xt ) are continuous, then both DΓ and HΓ are stopping times if Γ is either
open or closed. t
u
If the filtration F• satisfies the usual condition a much more general result is
true. More precisely, we have the following highly nontrivial result of Dellacherie
and Meyer [39, Thm. IV.50].
Theorem 3.111 (Début Theorem). Suppose that the filtration F• satisfies the
usual conditions and (Xt )t≥0 is an F• -progressive process. Then, for any Borel
subset Γ ⊂ R, the début time DΓ is a stopping time. t
u
Ft -measurable.
Proof. We prove only (i). The rest are left to the reader as an exercise. To prove
that the sublevel set {T ≤ c} is measurable we have to show that for any t ≥ 0 the
intersection
{T ≤ c} ∩ {T ≤ t} = {T ≤ t ∧ c}
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 337
Martingales 337
Definition 3.113. Fix a filtered probability space (Ω, F• , P). Given a random
process (Xt )t≥0 and an F• -stopping time T : Ω → [0, ∞] we denote by XT the
random variable
(
XT (ω) (ω), T (ω) < ∞
I {T (ω)<∞} = XT (ω) = .
0, T (ω) = ∞
t
u
Proof. The statements (i) and (ii) are immediate. The last statement concerning
time inversion requires a bit more work. We follow the approach in the proof of [33,
Thm. VIII.1.6].
Observe that first Xt is a Gaussian process with mean zero and covariances
E Xs Xt = min(s, t), ∀s, t ≥ 0.
Thus it suffices to show that (Xt ) is a.s. continuous, i.e.,
lim Xt = 0 a.s.
t&0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 338
is a martingale since the above summands have mean zero and are independent.
Applying Doob’s maximal inequalities (3.2.31) to the discrete submartingales
2
Ym = Dnm , 0 ≤ k ≤ m , m = 1, 2, . . . ,
1 h
2
i 1
≤ 2 2
E |Xn+1 − Xn | = 2 2.
n ε n ε
1
P
Since n≥1 n2 < ∞ we deduce from the Borel-Cantelli Lemma that
1
lim sup Xn+s − Xn = 0, a.s.
n→∞ n s∈[0,1]
t
u
Ft = σ N, Bs , 0 ≤ s ≤ t .
Martingales 339
Proof. We follow the approach in the proof of [33, Thm. VII.3.20]. It suffices to
prove that (Ft ) is right-continuous, i.e.,
\
Ft0 = Ft .
t>t0
We set
G = Ft0 , Gn = σ Bt0 +2−n − Bt0 +2−n−1 , n ∈ N.
From Corollary 3.61 we deduce that Ft0 = T∞ . On the other hand, T∞ ⊃ Ft0 + so
Ft0 + = Ft0 . t
u
Proposition 3.118. Suppose that (Bt )t≥0 is a standard Brownian motion and
Ft = σ Bs , 0 ≤ s ≤ t .
Then
P Ta < ∞ = 1, ∀a ∈ R.
In particular, a.s.,
lim sup Bt = ∞, lim inf Bt = −∞.
t→∞ t→∞
where the second is an increasing limit. The rescaling invariance of the Brownian
motion implies
" #
P sup Bs > δ = P sup Bsδ > 1 .
0≤s≤1 0≤s≤1/δ 2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 341
Martingales 341
We deduce
X
P sup Bs > 1 = lim P Bsδ > 1 = 1.
s≥0 δ&
0≤s≤1/δ 2
Replacing B by −B we deduce
P inf Bs < −M = 1, ∀M > 0.
s≥0
Remark 3.119. The above result shows that, with probability 1 the Brownian
motion has a zero on any arbitrarily small interval [0, ε]. As a matter of fact, the
set of zeros of a Brownian motion is a large set: its Hausdorff dimension is a.s. 21 ,
[118, Thm. 4.24]. t
u
Let us observe that if (Bt )t≥0 is a Brownian motion, then for any t0 ≥ 0, the
process
Bt+t0 − Bt0 t≥0
is also a Brownian motion, independent of σ Bs , 0 ≤ s ≤ t0 . We will refer to this
elementary fact as the simple Markov property. We want to show that a stronger
result holds where t0 is allowed to be random.
Theorem 3.120 (The strong Markov property). Suppose that (Bt )t≥0 is a
standard Brownian motion and T is a stopping time with respect to the filtration
Ft = σ(Bs , 0 ≤ s ≤ t) such that P T < ∞ > 0. For every t ≥ 0 we set
(T )
Bt := I {T <∞} BT +t − BT .
(T )
Then, with respect to the probability measure P − T < ∞], the process Bt is a
standard Brownian motion, independent of FT .
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 342
Let us show first that conclusions of theorem follow from the above lemma. Set
S∞ := {T < ∞}. Assume first that P S∞ = 1. Then (3.3.5) reads
(T ) (T )
E I A F Bt1 , . . . , Btp = P A E F Bt1 , . . . , Btp . (3.3.6)
(T )
Indeed, if we set A = Ω in (3.3.6) we deduce that Bt is a Brownian motion. In
particular, for every choice of t1 , . . . , tp ≥ 0, the vectors
(T ) (T )
Bt1 , . . . , Btp and Bt1 , . . . , Btp
have the same distribution. Next, (3.3.6) implies that for every choice of
(T ) (T )
t1 , . . . , tp ≥ 0 the vector (Bt1 , . . . , Btp ) is independent of FT .
If P S∞ < 1, t and we denote
by ES∞ the expectation with respect to the
probability measure P − S∞ , then (3.3.5) implies
(T ) (T )
ES∞ I A F Bt1 , . . . , Btp = P A E∞ E F Bt1 , . . . , Btp .
Arguing as before we reach the conclusions of Theorem 3.120 assuming the validity
of Lemma 3.121. t
u
Proof
of Lemma 3.121. For
the clarity of exposition we discuss only the case
P S∞ = 1. The case P S∞ < 1 requires no new ideas. The details can be safely
left to the reader.
For every t ≥ 0 and any n ∈ N we denote by [t]n the smallest rational number of
the form k/2n and ≥ t. Note that the quantities [T ]n are stopping times: stopping
the process at [T ]n corresponds to stopping the process at the first time of the form
k/2n after T . Then
lim [T ]n = T
n→∞
and
(T ) (T ) ([T ]n ) ([T ]n )
F Bt1 , . . . , B tp = lim F Bt1 , . . . , B tp .
n→∞
∞
([T ] ) ([T ] )
X
= lim E I A I (k−1)2−n <T ≤k2−n F Bt1 n , . . . , Btp n .
n→∞
k=0
is Fk2−n -measurable.
From the simple Markov property of the Brownian motion we deduce
([T ] ) ([T ] )
E I Ak,n F Bt1 n , . . . , Btp n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 343
Martingales 343
= E I Ak,n F Bt1 +k2−n − Bk2−n , . . . , Btp +k2−n − Bk2−n
= P Ak,n E F Bt1 , . . . , Btp .
Observing that
n
X
P Ak,n = P[A]
k=0
we deduce
∞
([T ] ) ([T ] )
X
E I A I (k−1)2−n <T ≤k2−n F Bt1 n , . . . , Btp n
k=0
∞
([T ] ) ([T ] )
X
= E I Ak,n F Bt1 n , . . . , Btp n
k=0
n
X
= P Ak,n E F Bt1 , . . . , Btp = P A E F Bt1 , . . . , Btp .
k=0
t
u
Let us present some applications application of the strong Markov property. For
a ∈ R we define the hitting time
Ta := inf t > 0; Bt = a .
This is a stopping time for the standard Brownian motion Bt and Proposi-
tion 3.118(ii) shows that
P Ta < ∞ = 1.
Remark 3.123. The above result is called the reflection principle for a simple
reason. In the region t ≥ Ta the graph of the function t → B̃t , viewed as a curve
in the Cartesian plane with coordinates (t, x), is the reflection of the graph of Bt in
the horizontal line x = a. This reflection principle is intimately related to André’s
reflection trick. t
u
St := sup Bu .
u≤t
(3.3.8)
= 2P Bt ≥ a = P Bt ≥ a + P Bt ≤ −a = P |Bt | ≥ a .
t
u
Corollary 3.125. For every a > 0 the stopping time Ta has the same distribution
a2
as B 2 and has density
1
2
a a
fa (t) = √ exp − I {t>0} .
2πt3 2t
Proof. Note that
P Ta ≤ t = P St ≥ a = P |Bt | ≥ a = P Bt2 ≥ a2
a2
= P tB12 ≥ a2 = P
≤ t .
B12
The statement about fa now follows from the fact that B1 is a standard normal
random variable. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 345
Martingales 345
Definition 3.126. A random process (Xt )t≥0 adapted to the filtration (Ft )t≥0 such
that Xt ∈ L1 , ∀t, is called a
• martingale if,
E Xt kFs = Xs , ∀0 ≤ s < t,
• submartingale if,
E Xt kFs ≥ Xs , ∀0 ≤ s < t,
• supermartingale if,
E Xt kFs ≤ Xs , ∀0 ≤ s < t.
t
u
The case (i) follows from (3.3.9). The case (iii) is the continuous time analogue
of Example 3.7 and the proof is similar. To prove (ii) note that
2
et kFs = E (Zes + Zet − Zes )2 kFs
E Z
= Zes2 + E Z
2 2
et −2E Zes Z
et + E Z es
h i
= Zes2 + E Z
2
et −2E E Zes Zet kFs + E Zes2 = Zes2 + E Z
2 2
et −E Zes .
Hence
h i
E Zet2 − E Z Fs = Zes2 − E Zes2 .
2
et
Classical examples of processes with independent increments are the Brownian mo-
tion, the Poisson process, or more generally the Lévy processes, [33, Chap. VII].
If Bt is a 1-dimensional Brownian motion started at 0, adapted to Ft , then Bt is
a normal random variable with mean 0 and variance t, for each t > 0. The moment
generating function of Bt is
θ2 t
MBt (θ) = E eθBt = e 2 .
where Hn (x) is the n-th Hermite polynomial (1.6.4). We can rewrite the above
equality as
θ2
X θn √
eθBt − 2 t = Mn (t) , Mn (t) = tn/2 Hn Bt / t .
n!
n≥0
Martingales 347
Remark 3.131. Suppose that (Xt )t≥0 is an R-submartingale. Fix a negligible set
N ⊂ Ω such that t 7→ Xt (ω) is an R-function for any ω ∈ Ω \ N. Fix a dense
countable subset D of [0, ∞).
Note that for every open interval I ⊂ [0, ∞) we have
This shows that (Xt )t≥ is a separable process in the sense of Doob, [47, II.2]. This
means that there exist
such that, for any closed interval I ⊂ R, and any open subset O of [0, ∞), the sets
t
u
For any function f : [0, ∞) → R, any rational numbers a < b and any S ⊂ [0∞) we
denote by N (f, S, [a, b]) the supremum of the set of integers k such that there exist
s1 < t1 < · · · sk < tk
in S such that f (si ) ≤ a, f (ti ) ≥ b, ∀i = 1, . . . , k.
For m ∈ N we set Nm (f, [a, b]) := N (f, Dm , [a, b]). Equivalently, Nm (f, [a, b]) is
the number of upcrossings of the strip [a, b] by the function f D . Note that
m
Nm X, [a, b] ≤ Nm+1 X, [a, b] , ∀m,
and
N (f, D, [a, b]) = lim Nm X, [a, b] .
m→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 349
Martingales 349
exists a.s. We leave the reader convince her/himself that since the process X• is
separable (see Remark 3.131) the limit
X∞ = lim Xt
t→∞
exists a.s. The boundedness assumption (3.3.13) coupled with Fatou’s lemma im-
plies that X∞ is integrable. t
u
The above theorem implies immediately the following continuous time counter-
part of Theorem 3.59.
exists a.s. Let T : Ω → [0, ∞] be a stopping time adapted to the filtration (Ft ).
The optional sampling of X• at T is the random variable
XT (ω) = I T <∞ XT (ω) (ω) + I T =∞ X∞ (ω).
(iii) E[XS ] = E X∞ = E X0 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 350
Proof. We set
∞
X k+1
Sn = I {k2−n <S≤(k+1)2−n } + ∞I S=∞ ,
2n
k=0
∞
X k+1
Tn = I {k2−n <T ≤(k+1)2−n } + ∞I T =∞ .
2n
k=0
Observe that Sn ≥ S, Tn ≥ T and Sn ≤ Tn , ∀n.
Let us show that Sn is FS measurable and Tn is T -measurable. In other words,
we have to show that
{Sn ≤ c} ∩ {S ≤ s} ∈ Fs , ∀c, s ≥ 0.
Note that
[ n o
{S ≤ s} ∩ {Sn ≤ c} = {S ≤ s} ∩ k2−n < S ≤ (k + 1)2−n ∈ Fc
(k+1)2−n ≤c
[ n o
= k2−n < S ≤ min s, (k + 1)2−n ∈ Fs .
(k+1)2−n ≤c
is a UI discrete martingales with respect to the filtration F•n := (Ft )t∈Dn . The
above arguments show that Sn and Tn are stopping times with respect to these
filtrations. We deduce from the discrete Optional Sampling Theorems 3.64 that
XSn = XSnn = E XTnn k FSnn = E XTn k FSn ,
and
XSn = E X∞ kFSn , XTn = E X∞ kFTn .
Now observe that since (Xt ) is a.s. right continuous we have
XS = lim XSn and XT = lim XTn a.s.
n→∞ n→∞
The families (XSn ) and (XTn ) are UI so the above convergences also hold in L1 .
Since FS ⊂ FSn ⊂ FTn and the conditional expectation map
E −kFS : L1 (Ω, , F, P) → L1 (Ω, FS , P)
is a contraction we deduce
XS = E XS kFS = lim E XSn kFS = lim E XTn kFS = E XT kFS ,
n→∞ n→∞
1
where the above converges are in L . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 351
Martingales 351
Corollary 3.138. Suppose that (Xt )t≥0 is an R-martingale and S, T are bounded
stoping times such that S ≤ T a.s. Then the following hold.
Proof. Fix t0 > 0 such S, T ≤ t0 a.s. Then the stopped process Xt∧t0 is an U I R-
martingale. The conclusions now follow from Theorem 3.137 applied to this stopped
martingale. t
u
Proof. We begin by proving (ii). For s < t, the stopping times s ∧ T and t ∧ T are
bounded and s ∧ T ≤ t ∧ T . The random variables Xt∧T are Ft∧T -measurable and
thus Ft -measurable since Ft∧T ⊂ Ft . To prove (3.3.14) it suffices to check that for
any A ∈ Ft we have
E XT I A = E Xt∧T I A .
Decompose I A = I A∩{T ≤t} + I A∩{T >t} . We have
XT I A∩{T ≤t} = Xt∧T I A∩{T ≤t}
so that
E XT I A∩{T ≤t} = E Xt∧T I A∩{T ≤t} . (3.3.16)
On the other hand, we deduce from Theorem 3.137 that
Xt∧T = E XT k Ft∧T .
E Xt∧T I A∩{T >t} = E XT I A∩{T >t} . (3.3.17)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 352
The desired conclusion follows by adding (3.3.16) and (3.3.17). The assertion
(3.3.15) follows from the fact that the stopped martingale X T is UI. Part (i) now
follows from (ii) applied to the sequence of UI martingales
(Xtn )t≥0 := (Xn∧t )t≥0 , n ∈ N.
Indeed, the martingales X n are compatible with Ft and for s < t we have
E XTn∧t k Fs = E E XTn kFt Fs = E XTn k Fs = XTn∧s .
Example 3.140. Suppose that (Bt )t≥0 is a Brownian motion started at 0 and
(Ft )t≥0 is its canonical filtration. For any a ∈ R we set
Ta := inf t ≥ 0 : Bt = a .
According to Proposition 3.118(ii), P Ta < ∞ = 1.
(a) We want to show that if a < 0 < b, then
b −a
P Ta < Tb = , P Ta > Tb = . (3.3.18)
b−a b−a
Consider the stopping time T = Ta ∧ Tb and the stopped martingale Mt = BT ∧t .
This martingale is UI since |Mt | ≤ |a| ∨ |b|. We deduce
0 = E M0 = E M∞ = E BT = aP Ta < Tb + bP Tb < Ta ].
The
equalities
(3.3.18) follow by observing that the probabilities P Ta < Tb and
P Ta > Tb satisfy a second linear constraint
P Ta < Tb + P Ta > Tb = 1.
(b) For a > 0 we set
Ua := inf t ≥ 0 : |Bt | = a = Ta ∧ T−a .
We want to show that
E Ua = a2 .
(3.3.19)
To see this consider the martingale of Example 3.3.9(ii), Mt = Bt2 − t.
The stopped
process Mt∧Ua is still a martingale so
2
E Mt∧Ua = E M0 = 0 and E Bt∧U a
= E t ∧ Ua .
The Monotone Convergence Theorem implies that
lim E t ∧ Ua = E Ua .
t→∞
Martingales 353
(c) Fix a > 0. We want to compute the moment generating function of Ta . To this
aim, we consider for any λ ∈ R the martingale of Example 3.3.9(iii)
λ2 t
λ
Xt := exp λBt − . (3.3.20)
2
For λ > 0 the stopped martingale Yyλ = Xt∧T
λ
a
is bounded thus UI and we deduce
λ 2 Ta
1 = E Y0λ = E Y∞ = eλa E e− 2
λ
.
√
Replacing λ with 2λ we deduce
√
E e−λTa = e−a 2λ .
3.4 Exercises
Exercise 3.1. Suppose that (Xn )n≥0 is a sequence of integrable random variables
and (qn )n≥1 is a sequence of nonzero real numbers such that, for any n ∈ N
E Xn k Fn−1 = qn Xn−1 , Fn−1 := σ X0 , . . . , Xn−1 .
Exercise 3.2. Suppose that (Xn )n≥0 is a martingale with respect to a filtration
(Fn )n≥0 such that X0 = 0 and E |Xn |2 < ∞, ∀n. Using the sequence of differences
Dn = Xn − Xn−1 , n ≥ 1 we construct two new processes, the optional quadratic
variation
Xn
Qn = Dk2
k=1
and the predictable quadratic variation
X n
E Dk2 k Fk−1 .
Vn =
k=1
Prove that the processes
An = Xn2 − Qn Bn = Xn2 − Vn
are martingales with respect to the (Fn )n≥0 . t
u
Exercise 3.3. Let x1 , . . . xr ∈ R. Fix a family In , Jn ; n ∈ N of indepen-
dent random variables such that In , Jn are uniformly distributed on {1, . . . , n − 1},
∀n ≥ 2. Define inductively
(
xr , n≤r
Xn :=
XIn + XJn , n > r,
and set
n
1 X
Yn := Xk .
n(n + 1)
k=1
Prove that the sequence (Yn ) is a martingale with respect to the filtration
σ(X1 , . . . , Xn ). t
u
Martingales 355
Exercise 3.10 (Pratt’s Lemma). Let (Xn ), (Yn ), (Zn ) be three sequences of
integrable random variables with the following properties.
(i) Xn ≤ Yn ≤ Zn , ∀n.
p p p
(ii) Xn → X, Yn→ Y , Zn → Z.
(iii) E Xn → E X , E Zn → E Z .
Prove that E Yn → E Y . t
u
Exercise 3.12 (P. Lévy). Suppose that (Ω, S, P) is a probability space and
(Fn )n≥1 is a filtration of sigma-subalgebras. Let (Fn ) be a sequence of events
such that Fn ∈ Fn , ∀n. We set
Xn
I Fk − E I Fk k Fk−1 .
Xn =
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 356
(i) Prove that Xn is a martingale and |Xn − Xn−1 | ≤ 4, ∀n. Hint. Have a look at
Example 3.14.
(ii) Prove that
( )
X
E I Fn k Fn−1 = ∞ .
Fn i.o. =
n≥1
Hint. Use Exercise 3.11.
(iii) Deduce from (ii) the second Borel-Cantelli Lemma, Theorem 1.139(ii).
t
u
(i) Compute pn . Hint. Use the André’s reflection trick in Example 1.60.
(ii) Show that pn → 1 as n → ∞.
t
u
Exercise 3.14. Consider the situation in Example 3.31. We have a finite set A
called alphabet, a probability distribution π on A such that π a 6= 0, ∀a ∈ A. Fix
two words
a = (a1 , . . . , ak ) ∈ Ak , , b = (b1 , . . . , b` ) ∈ A`
and assume that b is not a subword of a, i.e.,
(ai+1 , . . . , ai+` ) 6= (b1 , · · · , b` ), ∀i = 0, . . . , k − `.
Let (An )n≥1 be i.i.d. A valued random variable with common distribution π. As in
Example 3.31 we denote by Tb the time to observe the pattern b.
Martingales 357
Hint. Consider the same martingale (Xn ) as in Example 3.31. Observe that Xk = Φ(a, b) − k given
that Aj = aj , j = 1, . . . , k. (ii) Note that E Tb = E Tb + E Tb − T and (i) gives a formula for
t
u
E Tb − T k T = Ta .
t
u
Prove that M (D) is bijective and for any polynomial P the sequence
Yn = M (D)−n P (Sn ), n ≥ 1,
is a martingale. Find Yn when P (x) = x and P (x) = x2 .
Hint. Set Pn := T −n P and express E Pn+1 (Sn + Xn+1 ) k X1 , . . . , xn using the operator M (D).
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 358
Exercise 3.17. Suppose that (X n2)n≥0 is a martingale with respect to the filtra-
tion F• = (Fn )≥0 such that E Xn < ∞, ∀n. The sequence (Xn2 )n≥0 is a sub-
t
u
t
u
Martingales 359
f0
(i) Prove that the probability pg that he reaches his goal is ≤ g .
(ii) Prove that if Bn ≤ min(Fn−1 , g − Fn−1 ), ∀n ≥ 1, then pg = fg0 .
(iii) Find pg if Bn = 12 Fn−1 .
t
u
Exercise 3.20. Suppose that we are given a sequence of i.i.d. random vectors
Xn : (Ω, S, P) → X := RN
and a collection F of uniformly bounded measurable functions f : X → R, i.e., there
exists C > 0 such that kf kL∞ ≤ C, ∀f ∈ F. For n ∈ N we set
n
1 X
Dn (F) := sup f(Xk ) , f(x) := f (x) − E Xj ,
f ∈F n
k=1
n
" #
1 X
Rn (F) = sup E Rk f (Xk ) ,
f ∈F n
k=1
where (Rn )n≥1 is a sequence of independent Rademacher random variables that are
also independent of (Xn )n≥1 . Assume that Dn is measurable.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 360
Exercise 3.21. Consider the standard random walk (Sn )n≥0 on Z started at 0,
i.e.,
S0 = a ∈ Z, Sn = X1 + · · · + Xn ,
where (Xn )n≥1 are i.i.d. with P Xn = ±1 = 21 . Fix a, g ∈ N0 , a < g and set
Ta := min n ∈ N; Sn = 0 or Sn = g .
t
u
Exercise 3.23. Suppose that (Xn )n≥1 is a sequence of i.i.d., nonnegative, integer
valued random variables with finite mean. Set
Sn := X1 + · · · + Xn .
For k = 1, . . . , n, set F−k
= σ Sk , Sk+1 , . . . , Sn , Y−k = Sk /k.
Martingales 361
Exercise 3.25. Suppose that (Xn )n≥0 is a supermartingale such that, there exist
f0 , g > 0 with the property
X0 = f0 a.s., 0 ≤ Xn ≤ g a.s., ∀n ∈ N.
Prove that for any stopping time T such that P T < ∞ = 1 we have
P XT = g ≤ fg0 .
t
u
Exercise 3.26. Consider the branching process (Zn )n≥0 with initial condition
Z0 = 1 and reproduction law µ ∈ Prob(N0 ) such that
X
m := E µ = nµn < ∞, µn := µ n .
n≥0
We set
fn (s) := f ◦ · · · ◦ f (s), n ∈ N.
| {z }
n
(i) Show that if m > 1 the equation f (s) = s has a unique solution r = r(µ) in
the interval (0, 1). Compute r(µ) when
µn = qpn , n ∈ N0 ,
where p ∈ (1/2, 1 − p.
1), q =
(ii) Prove that P Zn = 0 = fn (0).
(iii) Denote by E the extinction event
[
E= Zn = 0}.
n≥0
Prove that
(
1, m ≤ 1,
P E =
r(µ), m > 1.
Z
(iv) Assume m > 1. Prove that the sequence r n
n≥0
is a martingale.
(v) Set
1
Wn := Zn .
mn
Assume
X 2
m > 1, E Z12 =
n µn < ∞,
n≥
and set
W := lim Wn .
n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 362
Prove that
∞
X
ϕ0 (0) = 1, ϕ(λ) = f ϕ(λ/m) = µn ϕ λ/m )n , ∀ Re λ ≥ 0.
(3.4.2)
n=0
(vi) Prove that there exists at most one probability measure ν ∈ Prob [0, ∞)
such that
Z ∞
t2 ν[dt] < ∞
0
satisfies (3.4.2).
Hint. Consider two such measures νk , k = 0, 1, denote by Φk (t) their characteristic functions.
Set Φ(t) = Φ1 (t) − Φ0 (t), γ(t) = Φ(t)/t, t 6= 0. Prove that |γ(mt)| ≤ |γ(t)| and conclude that
Φ ≡ 0.
t
u
Martingales 363
Exercise 3.28. Suppose that (Xn )n≥0 is an L2 -martingale adapted to the filtration
(eFn )n≥0 and hX• i is its quadratic variation; see Definition 3.15. Fix a bounded
predictable process (Hn )n≥0 and form the discrete stochastic integral (H • X) (see
Theorem 3.17.
k=1
Exercise 3.29. Suppose that (Xn )n∈N is an exchangeable sequence of random vari-
ables and T is a stopping time adapted to the filtration Fn = σ(X1 , . . . , Xn ). Prove
that if T < N a.s., then XT +1 has the same distribution as X1 . t
u
Exercise 3.30. Suppose that (Xn )n∈N is a sequence of random variables such that
for any n ∈ N the distribution of the random vector (X1 , . . . , Xn ) is orthogonally
invariant, i.e., for any T ∈ O(n), T# PX1 ,...,Xn = PX1 ,...,Xn . Prove that (Xn )N are
conditionally i.i.d. N (0, σ 2 ) given a random variable σ 2 ≥ 0. t
u
t
u
t
u
Exercise 3.37. Suppose that (Wt )t≥0 is a pre-Brownian motion defined on a prob-
ability space (Ω, S, P); see Definition 2.71. Let t0 , δ ≥ 0. Set
R(t0 , δ) = sup B(t) − B(t0 ) .
t∈Q∩[t0 ,t0 +δ]
t
u
Exercise 3.38. Let (Bt )t≥0 be a standard Brownian motion and −a < 0 < b. Set
T = min(T−a , Tb ) where for c ∈ R, we set Tc = inf t ≥ 0; Bt = c . Prove that
E T = E BT2 = ab.
t
u
Exercise 3.39 (P. Lévy). Let (Bt )t≥0 be a standard Brownian motion and c > 0.
For a ∈ R we denote by ra the reflection ra : R → R, ra (x) = 2a − x.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 365
Martingales 365
(i) Prove that for any Borel subsets U− ⊂ (−∞, −c], U+ ⊂ [c, ∞) we have
P Tc < T−c , B1 ∈ U− + P Tc > T−c , B1 ∈ rc (U− ) = P B1 ∈ rc (U− )
P Tc > T−c , B1 ∈ U+ + P Tc < T−c , B1 ∈ r−c (U+ ) = P B1 ∈ r−c (U+ ) .
(ii) Denote by J the interval [−c, c]. Prove that
P Tc ≤ T−c ∧ 1, Bt ∈ J = P B1 ∈ rc (J) − P Tc > T−c , B1 ∈ rc (J) ,
P T−c ≤ Tc ∧ 1, Bt ∈ J = P B1 ∈ r−c (J) − P Tc < T−c , B1 ∈ r−c (J) .
(iii) Prove that
" #
P sup |Bt | < c = P B1 ∈ J
t∈[0,1]
− P Tc ≤ T−c ∧ 1, Bt ∈ J + P T−c ≤ Tc ∧ 1, Bt ∈ J .
t
u
Remark 3.142. Exercise 3.39 is a special case of a more general result called the
support theorem. For any continuous function f : [0, 1] → R such that f (0) = 0 and
any ε > 0 we have
" #
P sup Bt − f (t) ≤ ε > 0. (3.4.3)
t∈[0,1]
Think of F (t) as tracing the motion of the tip of an infinitesimally fine pen as you
sign a planar piece of paper, starting at the origin.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 366
Chapter 4
Markov chains
The Markov chains form a special but sufficiently general class of examples of
stochastic processes. Their investigation requires a diverse arsenal of techniques,
probabilistic and not only, and they reveal important patterns arising in many other
instances.
The foundations of this theory were laid by the Russian mathematician
A. A. Markov at the beginning of the twentieth century. By most accounts, Markov
was a rather unconventional individual. He discovered what we now know as Markov
chains in his attempts to contradict Pavel Nekrasov, a mathematician/theologian
of that time who maintained on a theological basis that the Law of Large Numbers
was specific to independent events/random variables and cannot be seen in other
contexts. Markov succeeded in proving Nekrasov wrong and in the process laid the
foundations of the theory of Markov chains. For more on this history of this concept
we refer to the very readable article [82].
So what did Markov discovered? Think of a Markov chain as a random walk
on a finite set X . From a given location x the walker can go to a location x0 with
probability qx,x0 . Suppose that at some location x0 ∈ X we placed a pile of sand
consisting of giddy grains of sand: every second one of them starts this random walk
and performs a billion steps (think of a fixed but very large number of steps). After
all the grains of sand performed this ritual, the initial pile of sand is redistributed
at various points of X . Denote by m1x the mass of the pile of send relocated at
x. Next, collect the piles from their locations and move them back to the initial
location x0 .
Run the above experiment again we get a new distribution of piles of sand at
the points of X . Denote the mass at x by m2x . Markov observed that
m1x
≈ 1, ∀x.
m2x
Run the experiment a third time to obtain a third distribution of mass (m3x )x∈X
and the conclusion is the same
m1x
≈ 1, ∀x.
m3x
367
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 368
To put it differently, if m is the mass of the pile of sand at x0 , then, for any x ∈ X,
m1x m2 m3
≈ x ≈ x ≈ ··· .
m m m
This phenomenon is one manifestation of the Law of Large Numbers for Markov
chains.
During this more than a century since its creation, the theory of Markov chains
has witnessed dramatic growth and generalizations, and has found applications
in unexpected problems. For example, Googly’s Pageant algorithm is a special
application of the Law of Large Numbers for Markov chains.
The present chapter is an introduction to the theory of Markov chains. We
present the classical results and spend some time on some more recent developments.
As always, we try to illustrate the power of the theory on many concrete example.
Needless to say, we barely scratch the surface of this subject.
In the sequel X will denote a finite or countable set equipped with the discrete
topology. We will refer to it as the state space. The Borel sigma-algebra of X
coincides with the sigma-algebra 2X of all subsets of X .
Xn : (Ω, S, P) → (X , 2X ), n ∈ N0 ,
∀n ∈ N, x0 , x1 , . . . , xn , xn+1 ∈ X .
The filtration associated to the Markov chain is the sequence of sigma-
subalgebras
Fn := σ(X0 , . . . , Xn ), n ∈ N0 .
P Xn+1 = x0 Xn = x = P X1 = x0 X0 = x .
Remark 4.2. (a) Let us observe that the Markov property can be written in the
more compact form
P Xn+1 = x k Xn = P Xn+1 = x k Fn , ∀n ∈ N, x ∈ X .
(4.1.2)
In view of Proposition 1.172, the last property is equivalent to the conditional
independence
⊥ Xn Fn−1 , ∀n ∈ N.
Xn+1 ⊥ (4.1.3)
Exercise 1.51 shows that this is also equivalent to the condition
⊥ Xn Fn , ∀n ∈ N.
Xn+1 ⊥ (4.1.4)
One can show that this further equivalent to that
⊥ Xn Fn .
σ(Xn+1 , Xn+2 , . . . ) ⊥ (4.1.5)
This is colloquially expressed as saying that the future is conditionally independent
of the past given the present.
(b) It is convenient to think of a Markov chain with state space X as describing the
random walk of a grasshopper hopscotching on the elements of X . The decision
where to jump next is not influenced by the past, but only by the current location
and the current time. For a homogeneous Markov chain the decision where to
jump next depends only on the current location and not on the “time” n when the
grasshopper reaches that state. Thus Qx0 ,x1 is the probability that the grasshopper,
currently located at x0 , will jump to x1 .
We can represent an HMC with state space X and transition matrix Q as a
directed graph (loops allowed) with vertex set X constructed as follows: there is a
directed edge from x0 to x1 if and only if Qx0 ,x1 > 0. t
u
If (Xn )n≥0 is a homogeneous Markov chain (or HMC for brevity), then its tran-
sition matrix Q is stochastic, i.e.,
X
Qx0 ,x1 ≥ 0, Qx0 ,x = 1, ∀x0 , x1 ∈ X . (4.1.6)
x∈X
In other words, the entries of the matrix Q are nonnegative and the sum of the
entries in each row is equal to 1.
If µn is the distribution of Xn , then, for any x ∈ X we have
X X
P Xn = x0 Qx0 ,x = µn x0 Qx0 ,x .
P Xn+1 = x =
x0 ∈X x0 ∈X
1Imade the decision to break with the tradition and use the letter Q to denote the transition
matrix after teaching this topic and realizing that there were too many P ’s on the blackboard and
this sometimes confused the audience.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 370
Think of µn and µn+1 as matrices consisting of a single row. We can rewrite the
above equality as an equality of matrices µn+1 = µn Q. In particular,
µn = µ0 Qn , (4.1.7)
where Qn denotes the n-th power of the matrix Q, Qn = Qnx,y x,y∈X
. From
(4.1.7) we deduce that
P Xn = xn X0 = x0 = Qnx0 ,xn .
(4.1.8)
For this reason the matrix Qn is also known as the n-th step transition matrix.
Let us show that given any matrix Q : X ×X → [0, 1] satisfying (1.2.19) and any
probability measure µ on X , there exists a homogeneous Markov chain, with state
space X , initial distribution µ and transition matrix Q, i.e., Markov(X , µ, Q) 6= ∅.
Observe that we can view Q as a kernel or random probability measure
matrix Q.
Consider the set X N0 equipped with the natural product sigma algebra E; see
Definition 1.192. In this case it coincides with the sigma algebra generated by
π-system consisting of the cylinders
To prove that such a measure does indeed exist for any µ and Q we will rely on
Kolmogorov’s existence theorem, Theorem 1.195.
The equalities (4.1.9) define probability measures Pk = Pµ,Q
k on the product
spaces X {0,1,...,k} by setting
k
Y
Pk (s0 , . . . , sk ) := µ[s0 ] Qsi−1 ,si . (4.1.10)
i=1
The family of measures (Pk )k≥0 is projective since the transition matrix Q is stochas-
tic. Indeed,
X
Pk+1 (s0 , . . . , sk ) × X =
Pk+1 (s0 , . . . , sk , x)
x∈X
k
!
Y X
= µ[s0 ] Qsi−1 ,si Qsk ,x
i=1 x (4.1.12)
| {z }
=1
k
Y
Qsi−1 ,si = Pk (s0 , . . . , sk ) × X .
= µ[s0 ]
i=1
For x ∈ X we set
Ex := Eδx . (4.1.15)
Θ : X N0 → X N0 , Θ(x0 , x1 , x2 , . . . ) = (x1 , x2 , . . . ).
Xn : X N0 → X , Xn (x) = xn , n ∈ N0 .
Eµ F ◦ Θn k En = Eµ F k Xn .
(4.1.16)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 372
We will show that for any A ⊂ X we have the equality of random variables
X
P Xn+1 ∈ A k En = QXn A =
QXn ,a . (4.1.17)
a∈A
Note that
N
Y
I CA0 ,...,AN = I {Xk ∈Ak } ,
k=0
and
N
Y
I CA0 ,...,AN ◦ Θn = I {Xn+k ∈Ak } .
k=0
To verify (4.1.16) for sets of this form and arbitrary n we argue by induction on N .
For N = 1 this follows from (4.1.17). For the inductive step note that
"N #
Y
E I {Xn =x0 ,Xn+1 =x1 ,...,Xn+N =xN } k En = E I {Xnk =xk } k En
k=0
" " N
# #
Y
= E I {Xn =x0 } E I {Xn+k =xk } En+1 En
k=1
"N #
Y
En
I {Xn =x0 } E
= E I {Xn+k =xk } Xn+1
k=1
| {z }
=:f (Xn+1 )
(σ(Xn ) ⊂ En+1 )
" N
#
Y
=E I {Xn+k =xk } Xn .
k=0
t
u
Remark 4.4. We have deduced (4.1.16) relying on the Markov property. The above
proof shows that the Markov property (4.1.17) is a special case of (4.1.16). For this
reason we can take (4.1.16) as definition of Markov’s property. t
u
~ := X # P ∈ Prob X
~ ,E .
N0
PX
It is uniquely determined by the equalities
k
Y
~ Cs0 ,s1 ,...,sk := P X0 = s0 , . . . , Xk = sk = µ0 [s0 ]
PX Qsi−1 ,si . (4.1.19)
i=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 374
We deduce
~ = Pµ .
PX
For every, F ∈ L1 X N0 , E, Pµ , we have
Z
EP F (X0 , X1 , . . . ) = Eµ F = F (x)Pµ dx .
X N0
This is a special case of the change in variables formula (1.2.21).
This shows that the distribution of the Markov chain is uniquely determined by
the initial distribution µ ∈ Prob(X ) and the transition matrix Q.
Remark 4.5. One can define any HMC on probability spaces other than X N0 .
X transition matrix Q
Here is a such a construction corresponding to state space
and initial probability distribution µ. We set µx := µ x .
First, a little bit of terminology. We say that an interval is convenient if it either
empty or the form [a, b), a < b. If [a, b), [c, d) are nonempty convenient intervals,
then we say that [a, b) precedes [c, d) and we write [a, b) ≺ [c, d) if b ≤ c. The empty
set is allowed to precede or succeed any nonempty convenient interval. Assume that
X is a subset of N. As such it is equipped with a total order.
The probability space is the unit interval [0, 1) equipped with the Lebesgue
measure. The random variables Xn , depend on the choice of initial distribution,
and are defined inductively as follows.
4.1.2 Examples
The homogeneous Markov chains appear in many and diverse situations. According
to the discussion in the previous subsection, to describe an HMC it suffices to
describe the state space X and the transition matrix Q. We will remain vague
about the initial distribution µ.
Example 4.6 (Gambler’s ruin). Consider the gambler’s ruin problem discussed
in Example 3.72. The state space is X = {0, 1, . . . , N }. Then Xn is the fortune
of a gambler at time n. The gambler flips a fair coin with two faces labeled ±1. If
its fortune is strictly in between 0 and N , then its fortune changes by the amount
shown on the face of the coin. The game stops when its fortune reaches either 0 or
N . Concretely
1
Qk,k+1 = Qk,k−1 = , Qk,j = 0, if |k − j| > 1, 0 < k, j < N.
2
The directed graph describing this HMC is depicted in Figure 4.1 where, for clarity,
we have omitted the loops at 0 and N . t
u
Example 4.7 (The Ehrenfest Urn). Consider the following situation. There
are B balls in two urns. Equivalently, think of an urn with two chambers. Pick
one of these B balls uniformly at random and move it in the other box/chamber.
Denote by Xn the number of balls in the left box at time n. Then (Xn )n≥0 is an
HMC with transition probabilities
B−i i
Qi,i+1 = , i = 0, 1, . . . , B − 1, Qi,i−1 = , i = 1, . . . , B,
B B
Qi,j = 0, |i − j| > 1.
This HMC is known as the Ehrenfest urn. Note that during this process it is more
likely that a ball moves from the more crowded box to the less crowded one, similarly
to what happens in diffusion processes. tu
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 376
Example 4.9 (Random walk on Zd ). Suppose that (Xn )n≥1 are i.i.d. Zd -valued
random variables. Denote by π their common distribution. Set
S0 = 0, Sn = X1 + · · · + Xn .
Then the random process (Sn )n∈N0 is an HMC with transition matrix
Qm,n = P X1 = n − m = π n − m , m, n ∈ Zd .
One can imagine this process as a person starting at the origin of Zd and walking
with random step sizes, with Xn the size of the n-th step.
A standard random walk is obtained as follows. Denote by e1 ,. . . , ed the canon-
ical basis Zd and choose π to be uniformly distributed on the set ± e1 , . . . , ±ed ,
i.e.,
1
π ±ek = , k = 1, . . . , d.
2d
For example, when d = 1, this corresponds to a random walk on Z where, at each
moment, going one step ahead or one step back is decided by flipping a fair coin. t u
Example 4.11 (The branching process). Consider again the branching pro-
cess with reproduction law µ described in Example 3.8. Recall that it deals with
the evolution of a population of individuals of a species with µ j denoting the
probability that a given individual will have j ∈ N0 offsprings.
Denote by Zn the size of the n-th generation population. We assume that
Z0 = 1. Then (Zn )n≥0 is an HMC with state space N0 .
To see this, choose a sequence of i.i.d. random variables (ξk )k∈N with common
distribution µ. Then
P Zn+1 = j Zn = i = P ξ1 + · · · + ξi = j .
The distribution of the random variable ξ1 + · · · + ξi is the convolution of µ∗i , the
convolution of i copies of µ. More precisely,
X
µ∗i j =
µ k1 · · · µ ki .
k1 +···+ki =j
Example 4.12 (Queing). Customers arrive for service and take their place in a
waiting line. During each period of time one customer is served, if at least one
customer is present. During a service period new customers may arrive. We assume
that the number of customers that arrive during the n-th service period is a ran-
dom variable ξn , and that the random variables
ξ1 , ξ2 , . . . are i.i.d. with common
distribution µ ∈ Prob(N0 ). We set µi := µ i , i ∈ N0 . For notation convenience
we set µn = 0 for n < 0.
We denote by Xn the number of customers in line at the end of the n-th period.
Note that
Xn+1 = (Xn − 1)+ + ξn .
The sequence (Xn )n≥0 is an HMC with state space N0 and transition matrix
(
µj , i = 0,
Qi,j =
µj−i+1 , i > 0.
t
u
In this section we will consistently adopt the dynamical point of view on Markov
chains described in Remark 4.2(b) and extract some useful consequences.
Recall that to an HMC with state space X we can associate a directed graph
with vertex set X ; see Remark 4.2(b). A walk from x to x0 in this graph is a
sequence of vertices
x = x0 , x1 , . . . , xn = x0
such that, for any i = 1, . . . , n, there exists a directed edge from xi−1 to xi . If
x 6= x0 , then x0 is accessible from x if there is a walk from x to x0 .
Observe that
X
Qm+n m n
x0 ,x2 = Qx0 ,x1 Qx1 ,x2 + Qm n
x0 ,x Qx,x2 > 0.
x∈X \{x1 }
| {z }
≥0
Definition 4.16. The equivalence classes of the relation ↔ are called the commu-
nication classes of the given HMC. t
u
Example 4.17. (a) Consider the HMC associated to the gamblers’s ruin problem
described in Example 4.6. The state space is {0, 1, . . . , N } and there are three
communication classes
C0 = {0}, C = {1, . . . , N − 1}, CN = {N }.
Note that no state in C is accessible from C0 or CN .
(b) The HMC associated to the Ehrenfest urn model in Example 4.7 has state
space {0, 1, . . . , N } and any two states communicate so that there is only a single
communication class.
(c) The HMC corresponding to the random placement of balls problem in Exam-
ple 4.8 has state space {0, 1, . . . , r} and communication classes
C0 = {0}, {1}, . . . , Cr = {r}.
Note that for j > i, the class Cj is accessible from the class Ci . t
u
Definition 4.18. Let (Xn )n∈N0 be an HMC with state space X and transition
matrix Q.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 380
t
u
Example 4.19. For the HMC corresponding to the random placement of balls
problem in Example 4.8 with state space {0, . . . , r}, all the subsets {k, k + 1, . . . , r}
are closed and the state r is absorbing. This is not an irreducible Markov chain. t u
3 5
7
1 4
2 6 8
Example 4.21. Consider an HMC with associated digraph depicted in Figure 4.2.
It consists of three communication classes
C1 := {1, 2, 3, 4}, C2 := {5, 7, 8}, C3 := {6}.
The communication class C3 is closed while C1 and C2 are not. The only irreducible
set is C3 . In particular the state 6 is absorbing.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 381
3 5
7
1 4
2 6
Lemma 4.22. Let C ⊂ X be a closed subset. Then the following are equivalent.
(i) C is irreducible.
(ii) C is a communication class.
Definition 4.23. Suppose that (Xn )n∈N0 is an HMC with state space X and
transition matrix Q.
(ii) The period of a state x is d = d(x) := gcd Px , where “gcd” stands for greatest
common divisor. When Px = ∅ we set d(x) := ∞.
(iii) A state x is called aperiodic if d(x) = 1.
t
u
Lemma 4.24. Let (Xn )n≥0 be an HMC with state space X and transition matrix
Q. Suppose that x ∈ X and d(x) < ∞. Then the following hold.
Px,y + Py + Py,x ⊂ Px
we deduce that d(x)|Py so d(x)|d(y). Reversing the roles of x, y in the above argu-
ment we deduce d(y)|d(x) so d(x) = d(y). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 383
According to the above result, all the states of an irreducible HMC have the
same period so we can speak of the period of that HMC.
Definition 4.25. An irreducible HMC is called aperiodic if each of its states has
period 1. t
u
Example 4.26. (i) Each state in the standard random walk on Z locally finite
graph has period 2.
More generally, given a vertex v in a locally finite, connected graph, its set of
periods with respect to the standard random walk coincides with the set of lengths
of paths in the graph that start and end at v. Since there is such a path of length
2 we deduce that the vertex is aperiodic if and only if there exists a path of odd
length starting and ending at x.
(ii) The Ehrenfest urn in Example 4.7 is irreducible with period 2. t
u
Proposition 4.27. Let (Xn )n≥0 be an irreducible HMC with state space X , tran-
sition matrix Q, and period d < ∞. Fix x0 ∈ X . Consider the HM C (Yn )n≥0
with state space X , initial state Y0 = x0 and transition matrix T = Qd . Denote by
CT the set of communication classes of T . For each x ∈ X we denote by [x]T the
T -communication class of x. Then the following hold.
(i) There exists a bijection r = rx0 : CT → Z/dZ such that r [x]T = k mod d iff
n
Conversely, suppose that r(x) = r(y). Fix nx , ny ∈ N0 such that Qnx0x,x , Qx0y,y > 0.
Choose N large enough such that N d > nx and N d ∈ Px0 . Then N d − nx ∈ Px,x0
and ny ∈ Px0 ,x so N d − nx + ny ∈ Px,y . Hence
Remark 4.28. Suppose that (Xn )n≥0 is an HMC as in the above proposition and
C0 , C1 , . . . , Cd−1 ⊂ X
S ∈ FT ⇐⇒S ∩ T ≤ n ∈ Fn , ∀n ∈ N0 .
Example 4.29 (Return times). Let (Xn )n∈N0 be an HMC with state space X .
For A ⊂ X we define
TA := min n ≥ 1; Xn ∈ A .
We will refer to TA as the return time to A. This is a stopping time with respect
to the canonical filtration Fn . For x ∈ X we set Tx := T{x} .
Note that the event S belongs to FTA if at any moment n we can decide using
the information collected up to that point in Fn whether S occurred and we have
returned to A up to that moment. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 385
Example 4.30 (Hitting times). Let (Xn )n∈N0 be an HMC with state space X .
For A ⊂ X we define
HA := min n ≥ 0; Xn ∈ A .
We will refer to HA as the hitting time of A. This is a stopping time with respect
to the canonical filtration Fn . For x ∈ X we set Hx := H{x} . t
u
process
Yn : (Ω, S, PΛ ) → X
is Markov(X , δx , Q) and independent of FT .
∞
X
= Qxn ,xn+1 P {T = k} ∩ S = Qxn ,xn+1 P S .
k=0
= P S ∩ Γ0 ∩ {T < ∞} Qx0 ,x1 · · · Qxn−1 ,xn = P S ∩ Λ Qx0 ,x1 · · · Qxn−1 ,xn ,
i.e.,
P S ∩ Γn ∩ Λ = P S ∩ Λ Qx0 ,x1 · · · Qxn−1 ,xn .
Then
P S∩Λ
P S ∩ Γn Λ = · Qx0 ,x1 · · · Qxn−1 ,xn , x0 = x.
P Λ
Since the stochastic process Yn : (Ω, S, PΛ ) → X is Markov(X , δx , Q) we deduce
Qx0 ,x1 · · · Qxn−1 ,xn = P Γn Λ .
Hence P S ∩ Γn Λ = P S Λ · P Γn Λ . t
u
In the following subsections we will have plenty of opportunities to see the strong
Markov principle at work.
Definition
4.32. A state x ∈ X is called recurrent or persistent if
Px Tx < ∞ = 1. Otherwise it is called transient. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 387
Example 4.33. If X is finite and irreducible, then any state of X is recurrent. In-
deed the ‘sooner-rather-than-later’ Lemma 3.32 implies that Ex Tx < ∞, ∀x ∈ X .
t
u
Nx := # k ∈ N; Txk < ∞
= # n ∈ N; Xn = x ∈ N ∪ {∞}.
Thus Txk is the time of the k-th return to x. We will refer to Nx the number of
returns to x. We set
p = px := Px Tx < ∞ .
Lemma 4.34. For any n ∈ N0 we have Px Nx ≥ n = pn . In particular, if x is
recurrent, i.e., p = 1, then Nx = ∞ a.s. and, if X is transient, then
p
Ex Nx = .
1−p
Proof. Set pn := Px N x ≥ n .1 We will
prove inductively that pn = pn .
Clearly P Nx ≥ 1 = P Tx < ∞ = p. Suppose that pn = p . The post-Txk
n
process Yn = XTxn +n starts at x and the strong Markov property implies that it is a
HMC with the same transition matrix. In particular, the probability that it returns
to x is p. On the other hand, Yn returns to x if and only if Nx ≥ k + 1. Since the
post-Txk process is independent of FTxk we deduce
P Nx ≥ n + 1 = pP Nx ≥ n = pn+1 .
t
u
and
X X n
Ex Nx = Ex I {Xn =x} = Qx,x .
n∈N n∈N
(i) x ↔ y,
(ii) Py Tx < ∞ = 1,
(iii) the state y is recurrent.
t
u
The above result shows that if C is a communication class then, either all classes
in C are recurrent, or all classes in C are transient. In the first case C is called a
recurrence class and in the second case C is called a transience class. An irreducible
HMC consists of a single communication class. Accordingly an irreducible HMC
can be either transient, or recurrent.
Proposition 4.38. Suppose that (Xn )n≥0 is an irreducible transient HMC with
state space X , transition matrix Q and initial distribution µ. Then,
Eµ Nx < ∞, ∀x ∈ X .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 389
Proof. We first prove that given x0 ∈ X there exists C = Cx0 > 0 such that
Ey Nx0 ≤ C, ∀y ∈ X .
Indeed, using the strong Markov property as in the proof of Lemma 4.34 we deduce
that for any y ∈ X we have
X X
Ey Nx0 = Py Nx0 ≥ n = Px0 Nx0 ≥ n − 1 Py Tx0 < ∞
n≥1 n≥1
X
= Px0 Nx0 ≥ 0 Py Tx0 < ∞ + Py Tx0 < ∞ Px0 Nx0 ≥ m
m≥1
| {z }
=1 | {z }
Ex0 Nx0
= Py Tx0 < ∞ 1 + Ex0 Nx0 ≤ 1 + Ex0 Nx0 .
| {z }
Cx0
Hence
P ∀y ∈ C, Ny = ∞, X0 ∈ C = 1 − P ∃y ∈ C, Ny < ∞ X0 ∈ C = 1.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 390
Each state of X is in itself a communication class of the new Markov chain. The
state space X is partitioned into two types: transient states and recurrent states.
The recurrent states are closed, i.e., they are absorbing as states in X . Given a
recurrent state R ∈ X , no other communication class is accessible from R.
Example 4.41. Consider for example the gambler’s ruin problem with total fortune
N ∈ N; see Example 4.6. This can be viewed as a Markov chain with state space
{0, 1, . . . , N } and transition probabilities
1
qi,i±1 = , ∀0 < i < N, q0,0 = qN,N = 1.
2
The communication classes of this Markov chain are
{0}, {N }, {1, 2, . . . , N − 1}.
The first two are recurrent while the third is transient. t
u
Example 4.42 (G. Polya). (a) Consider the standard random walk on Z. We
denote by Q the transition matrix. This is an irreducible Markov chain and each
state has period 2. To decide whether it is transient or recurrent it suffices to verify
2n−1
if the origin is such. Note that Q0,0 = 0, ∀n ∈ N. To compute Q2n 0,0 we observe
that a path of length 2n starts and ends at the origin if and only if it consists of
exactly n steps to the right and n steps to the left. Since each such step occurs with
probability 12 we deduce
1 2n (2n)!
Q2n
0,0 = 2n
= 2n .
2 n 2 (n!)2
Using Stirling’s formula (A.1.6) we deduce that, as n → ∞, we have
√
(2n)! 4πn 1
∼ ∼√ ,
22n (n!)2 2πn πn
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 391
so
X
Qn0,0 = ∞.
n∈N
so that
2
1 2n 1
Q2n
0,0 = ∼ as n → ∞.
22n n πn
Hence, again
X
Qn0,0 = ∞
n∈N
Hence
X X
p2jk` ≤ max pj,k,` pj,k,` = max pj,k,` ,
j,k,` j,k,`
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 392
so
1 2n
Q2n
0,0 ≤ max pj,k,` .
22n n
Let us observe that the maximum value of pj,k,` is achieved when j, k, ` are as close
to n/3 as possible. Indeed, if j ≤ k ≤ `, j < `, then
j+1
(j + 1)!(` − 1)! = j!`! ≤ j!`!
`
so
pj+1,k,`−1 ≥ pj,k,` .
Assume now that n = 3m. We deduce
2n 1 2n (3m)!
Q0,0 ≤ 2n .
2 n (m!)3 33m
Using again Stirling’s formula we deduce that, as m → ∞ we have
√ √
(3m)! 6πm 3
∼ = .
(m!)3 33m (2πm)3/2 2πm
On the other hand
1 2n 1 1
∼√ =√ .
22n n πn 3πm
We deduce that
−3/2
Q6m
0,0 = O m as m → ∞.
Arguing in a similar fashion we deduce
Q6m+2 , Q6m+4 = O m−3/2
0,0 0,0 as m → ∞.
We conclude that
X X
Qn0,0 = Q2n
0,0 < ∞,
n∈N n∈N
Definition 4.43. An invariant or stationary measure for the HMC (Xn )n∈N0 is a
σ-finite measure λ on X such that λ = λQ, i.e.,
X
λx = λy Qy,x , ∀x ∈ X . (4.2.3)
y∈X
P X0 = x = πx , ∀x ∈ X .
Then
X X
P X1 = x = P X1 = x X0 = y πy = πy Qy,x = πx .
y∈X y
Iterating we deduce that the random variables (Xn )n∈N0 are identically distributed.
For x, y ∈ X we set
P X1 = y X0 = x P X0 = x πx
Ry,x := P X0 = x X1 = y = = Qx,y .
P X1 = y πy
Note that for every x, y ∈ X we have
X 1 X
Ry,x = πx Qx,y = 1,
x
πy x
so (Ry,x )x,y∈X is a stochastic matrix describing the so called time reversed chain.
Suppose now that π is a probability distribution on X such that πx > 0, ∀x ∈ X
and satisfying
πx
Qy,x = Ry,x = Qx,y , ∀x, y ∈ X . (4.2.4)
πy
From the equality
X X πx
1= Qy,x = Qx,y
x x
πy
we deduce that π is a stationary distribution and the time reversed chain coincides
with the initial chain. This is the reason why the chains satisfying (4.2.4) are called
reversible. t
u
Definition 4.45. An irreducible HMC with state space X and transition matrix
is called reversible if there exists a function λ : X → (0, ∞) satisfying the detailed
balance equations
t
u
Example 4.46. (a) If Qx,y = Qy,x for any x, y ∈ X , then the corresponding chain
is reversible and any uniform measure on X is invariant. This happens for example
if (Xn )n≥0 describes the standard random walk on Zd .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 394
(b) In the case the standard random walk on a locally finite connected graph we
have
1
Qx,y = , deg x · Qx,y = 1 = deg y · Qy,x .
deg x
Hence Q is in detailed balance with invariant measure x 7→ deg x. If, additionally,
X is finite, then the probability measure
deg x
πx = P
y deg y
is invariant. t
u
Example 4.47 (The Ehrenfest urn). Consider the Ehrenfest urn model de-
tailed in Example 4.7. We recall that the state space is X = {0, 1, . . . , B}, B ∈ N
and the only nontrivial transition probabilities are
B−k k
Qk,k+1 = , Qk,k−1 = .
B B
Note that
B
Qk,k+1 B−k k+1
= = B .
Qk+1,k k+1 k
B
Then the measure k → λk = k is invariant and
1 B
πk = B , k = 0, 1, . . . , B,
2 k
is an invariant probability distribution. t
u
Theorem 4.48. Suppose that the HMC (Xn )n∈N0 is irreducible and recurrent. Fix
x0 ∈ X , and denote by T0 the time of first return to x0 , i.e.,
T0 := Tx0 = min n ≥ 1; Xn = x0 .
For any x ∈ X, define
X T0
X
Nx = I {Xn =x} I {n≤T0 } = I {Xn =x} , (4.2.6a)
n∈N n=1
(
Ex0 Nx , x 6= x0 ,
λx = λx,x0 = (4.2.6b)
1, x = x0 .
In other words, λx is the expected number of visits to x before returning to x0 when
starting from x0 . Then, the following hold.
is invariant.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 395
(iii) The measure λ is the unique invariant measure such that λx0 = 1.
Proof. (i) We follow the approach in [20, Thm. 3.2.1]. Clearly λx0 = 1. For
x ∈ X \ {x0 } and n ∈ N we set
px (n) := P Xn = x, n ≤ T0 .
Thus, px (n) is the probability of visiting state x at time n before returning to x0 .
The equality (4.2.6a) implies that
X
λx = px (n), ∀x 6= x0 . (4.2.7)
n∈N
Let us prove that λ satisfies (4.2.3). Observe first that px (1) = Qx0 ,x . From the
Markov property we deduce
X
px (n) = py (n − 1)Qy,x . (4.2.8)
y6=x0
We deduce that
!
(4.2.7) X X X
λx = px (n) = px (1) + py (n) Qy,x
n∈N y6=x0 n∈N
| {z }
=λy
This proves (4.2.3). Let us now show that the λx defined by (4.2.6b) are positive.
Suppose that λx = 0 for some x ∈ X . Obviously x 6= x0 . Moreover, from the
equality λ = λQn , ∀n ∈ N we deduce
X
0 = λx = Qnx0 ,x + λy Qny,x .
y6=x0
Thus Qnx0 ,x
= 0, ∀n ∈ N, which contradicts the fact that x0 and x communicate.
Finally, let us prove that λx < ∞, ∀x. Observe that
X
1 = λx0 = λx Qnx,x0 . (4.2.9)
x∈X
Suppose that λx = ∞ for some x 6= x0 . Since the chain is irreducible, the state x0
communicates with x so there exists n = n(x) such that Qnx0 ,x 6= 0. The equality
(4.2.9) implies λx ≤ Qn1 .
x,x0
(ii) We have
X X X X X
I {Xn =x} I {n≤T0 } = I {Xn =x} I {n≤T0 } = I {n≤T0 } = T0 .
x∈X n≥1 n≥1 x∈X x∈X
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 396
Hence
X X X
λ X =
λx = Ex0 I {Xn =x} I {n≤T0 } = Ex0 T0 .
x∈X x∈X n≥1
(iii) We follow the approach in [2; 123]. Consider the matrix K : X × X → [0, 1]
(
Qx,y , y 6= x0 ,
Kx,y =
0, y = x0 .
Consider the sequence (µn )n≥0 of measures on X defined by
µ0 = δx0 , µn x = px (n) = P Xn = x, n < T0 , x ∈ X .
Note that µn x0 = 0, ∀n. The equality (4.2.8) implies that
µn = µn−1 K, ∀n ≥ 1
n
so µn = δx0 K . Observe that
X X
λ= µn = δ x0 K n .
n≥0 n≥0
Fix an invariant measure ν such that νx0 = 1. The invariance condition reads
ν = δx0 + νK. We deduce
ν = δx0 + δx0 + νK K = δx0 + δx0 K + νK 2 .
Remark 4.49. The example of the standard random walk on Z3 shows that even
transient chains can admit invariant measures. t
u
Suppose that (Xn )n≥0 is irreducible and recurrent. For each x ∈ X we denote
by π x the unique invariant measure on X such that π x x = 1. We know that for
x, y ∈ X the measure π y is a positive multiple of π x ,
π y = cy,x π x .
From the equality π x x = 1 we deduce cy,x = π y x so that
πy = πy x πx .
(4.2.10)
From Theorem 4.48(ii) we deduce that
π x X = Ex Tx .
(4.2.11)
In particular this shows that the following statements are equivalent
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 397
(i) The chain is called positively recurrent if Ex Tx < ∞ for some x ∈ X .
t
u
Corollary 4.52. Any irreducible HMC with finite state space X admits a unique
stationary probability measure. t
u
Proof. As shown is Example 4.33 this chain is recurrent since the state space is
finite. The finiteness of X implies (4.2.12). t
u
To count them observe that each 2×3 sub-rectangle of the chess table determines
four edges in this graph, two for each diagonal. The same is true for the 3 × 2
rectangle. Moreover, any knight move is corresponds to a diagonal of unique such
rectangle. If N2×3 and respectively N3×2 denote the number of 2 × 3 rectangles
respectively 3 × 2 rectangles, then
N2×3 = N3×2 =: N,
so that Z = 16N . Now observe that since a 3 × 2 rectangle is uniquely determined
by the location of its lower left corner we have N = 6 × 7 = 42 so Z = 16 · 42 = 672.
If x corresponds to the left-hand square, then deg x = 4 so
1 16 · 42
Ex Tx = = = 4 · 42 = 168.
π x 4
Thus, given that the knight starts at x, the expected time to return to x is 168. t
u
Theorem 4.54. Suppose that (Xn ) is an irreducible HMC with state space X and
transition matrix Q. Then the following are equivalent.
Proof. We have already shown that (i) ⇒ (ii). To prove the implication (ii) ⇒ (i)
fix an invariant probability measure π. Thus π = Qn π, ∀n ∈ N, so that
X
π x Qnx,y , ∀y ∈ X , n ∈ N.
π y = (4.2.14)
x∈X
Fix y0 ∈ X such that π y0 6= 0. We prove first that if the chain is recurrent then
it has tobe positively recurrent. Denote by λy0 the unique invariant measure such
that λy0 y0 = 1. The measure λy0 is a constant multiple of π so it is finite. Hence
Ey0 Ty0 = λy0 X < ∞,
Hence
X
Qnx,y0 < ∞ and lim Qnx,y0 = 0, ∀x ∈ X . (4.2.15)
n→∞
n
On the other hand, the equality (4.2.15) coupled with the Dominated Convergence
theorem implies Z
lim qn (x)π[dx] = 0.
n→∞ X
This contradiction completes the proof. t
u
Example 4.55. We have shown in Example 4.42 that the standard random walks
on Z and Z2 are recurrent. Let us show that they are null recurrent.
Note that for k = 1, 2, the measure on Zk defined by λ x = 1, ∀x∈ Zk is
invariant.
kBy
Theorem 4.48, λ is the unique invariant measure such that λ 0 = 1.
Since λ Z = ∞ we deduce that there is no invariant finite measure. t
u
Note that x ∈ Y.
For y ∈ Y we have X
Ey Tx = Ey Tx X1 = y Qy,y + Ey Tx X1 = z Qy,z .
z6=y
Hence
y 6= z, Qy,z > 0 ⇒ Ey Tx X1 = z < ∞.
Now observe that for z 6= y
(
1, z = x,
Ey Tx X1 = z =
1 + Ez Tx , z 6= x.
We deduce that y ∈ X and Qy,z > 0, then z ∈ Y. We conclude iteratively that
y ∈ Y, ∀y, x → y.
Since the chain is irreducible we deduce Y = X . t
u
Let (Xn )n≥0 be an HMC with state space and transition matrix Q. Recall that
for any set A ⊂ X we denoted by TA the time of first return to A
TA := inf n ≥ 1; Xn ∈ A .
Note that TA ≤ Ta , ∀a ∈ A, so
Ex TA ≤ Ex Ta , ∀x ∈ X , ∀a ∈ A.
∞
X
= MA Pa0 Tb0 > k = MA Ea0 Tb0 < ∞,
k=0
since the chain (Yk ) is positively recurrent. t
u
Suppose that (Xn )n∈N0 is an irreducible, positively recurrent HMC with state space
X and transition matrix Q. It thus has a unique stationary probability measure
π∞ ∈ Prob(X ). In this section we will provide a dynamical description of π∞ and
prove a Law of Large Numbers that involves this measure.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 401
For n ∈ N we set
n
X
ν(n) = νx0 (n) := I {Xk =x0 } .
k=1
In other words, the random variable νx0 (n) is the number of returns to x0 during
the interval [1, n].
n Z
1 X X
f (x)π 0 [dx] = f (x)π 0 x , Px0 − a.s. (4.3.2)
lim f (Xk ) =
n→∞ νx0 (n) X
k=1 x∈X
Proof. We follow the proof in [20, Prop. 3.4.1]. Using the decomposition
f = f + − f − we see that it suffices to consider only the case when f is non-
negative. Let T0 = τ1 ≤ τ2 ≤ · · · denote the successive times of return to x0 . We
set
τp
X
Up := f (Xk ).
k=τp−1 +1
The strong Markov property shows that the random variables U1 , U2 , . . . are i.i.d.
We have
T0
"T #
0
X X X
Ex0 U1 = Ex0 f (Xk ) = Ex0 f (x)I {Xk =x}
k=1 k=1 x∈X
"T #
0
X X (4.3.1) X
f (x)π 0 x .
= f (x)Ex0 I {Xk =x} =
x∈X k=1 x∈X
In other words
τp
1X X
f (x)π 0 x , Px0 − a.s.
f (Xk ) →
p
k=1 x∈X
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 402
Observing that τν(n) ≤ n < τν(n)+1 , we deduce that for nonnegative of f we have
τν(n) n τν(n)+1
1 X 1 X 1 X
f (Xk ) ≤ f (Xk ) ≤ f (Xk )
ν(n) ν(n) ν(n)
k=1 k=1 k=1
τν(n)+1
ν(n) + 1 1 X
= f (Xk ).
ν(n) ν(n) + 1
k=1
Corollary 4.60 (Ergodic Theorem). Suppose that (Xn )n≥0 is a positively re-
current irreducible HMC with state space X , transition matrix Q and stationary
distribution π∞ . Let f ∈ L1 (X , π∞ ). Then, for any µ ∈ Prob(X ) we have
n Z
1X
lim f (Xk ) = f (x)π∞ dx , Pµ − a.s. (4.3.4)
n→∞ n X
k=1
Proof. Assume first that (Xn ) are defined on the path space(X N0 , E, Pµ ).
Suppose µ = δx0 . If we divide both sides of (4.3.2) by Ex0 Tx0 we deduce
n Z
1 X
lim f (Xk ) = f (x)π∞ dx .
n→∞ ν(n)Ex Tx X
0 0 k=1
(4.1.13) X
Pµ C = µ x Px C = 1.
x∈X
Suppose that random maps (Xn ) are defined on a probability space (Ω, S, P), not
~ : Ω → X N0 we reduce this case to
necessarily the path space. Using the map X
the situation we have discussed above. t
u
The Ergodic Theorem is a Law of Large Numbers for a sequence of, not neces-
sarily independent, random variables.
If X, Y are X -valued random variables, then the variation distance between them
is defined to be the variation distance between their distributions PX , PY ,
dv (X, Y ) := dv (PX , PY ).
Now define
n o
x∈X; µ x ≥ν x .
B :=
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 404
ν A − µ A ≤ ν Bc − µ Bc .
Observe that
µ B − ν B + µ B c − ν B c = µ X − ν X = 0.
Hence
1 X
sup µ A −ν A =µ B −ν B = µ x −ν x .
A⊂X 2
x∈X
t
u
For example, if Ω = R equipped with the Borel sigma-algebra and µ, ν are given by
densities p and respectively q, then
Z
kµ − νk = p(x) − q(x) dx.
R
λ
We will use the notation µ ! ν to indicate that λ is a coupling of µ with ν. We
will denote by Couple(µ, ν) the set of couplings of µ with ν.
A coupling of a pair of X -valued random variables is defined to be an X × X
random variable Z whose distribution is a coupling of the distributions of X and Y .
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 405
The next result explains the relevance of couplings in estimating the variation
distance between two measures.
Then,
dv (µ, ν) ≤ λ X (2) , ∀λ ∈ Couple(µ, ν).
(4.3.5)
≥λ A×X −λ X ×A =µ A −ν A .
Hence
λ X (2) ≥ sup
µ A − ν A = dv (µ, ν).
A⊂X
t
u
Remark 4.65. The inequality (4.3.5) is optimal in the sense that there exists a
coupling λ ∈ Couple(µ, ν) such that dv (µ, ν) = λ X (2) . For details we refer to
[20, Sec. 4.1.2] or [104, Sec. 4.2]. t
u
Definition 4.66. Two X -valued stochastic processes (Xn )n∈N0 and (Yn )n∈N0 are
said to couple if the random variable
T = min n ∈ N; Xm = Ym , ∀m ≥ n
is a.s. finite. The random variable T is called the coupling time. t
u
Lemma 4.67. Suppose that the X -valued processes (Xn )n∈N0 and (Yn )n∈N0 couple
with coupling time T . Then
dv (Xn , Yn ) ≤ P T > n .
In particular,
lim dv (Xn , Yn ) = 0.
n→∞
Theorem 4.68. Suppose that Q is a probability transition matrix on the state space
X such that the associated HMC’s are irreducible, aperiodic and positively recur-
rent. Denote by π the unique invariant probability measure on X . Then, for any
µ ∈ Prob(X )
lim dv µQn , π) = 0. (4.3.6)
n→∞
In particular if µ = δx0 we deduce that
π x = lim Qnx0 ,x .
(4.3.7)
n→∞
Proof. Consider two independent HMCs (Xn )n∈N and (Yn )n≥0 with state space
X , transition matrix Q such that initial distribution of (Xn ) is µ and the initial
distribution of Yn is π. Since π is stationary, the probability distribution of Yn is
π, ∀n. According to Lemma 4.67 it suffices to show that the stochastic processes
(Xn ), (Yn ) couple.
Consider the stochastic process (Xn , Yn ). This is an HMC with state space
X × X and transition matrix Q b
b (x ,y ),(x ,y ) = Qx ,x · Qy ,y .
Q 0 0 1 1 0 1 0 1
This shows that if Qkx,y6= 0, then Q̃nx,y > 0, ∀n ≥ k. Using the terminology of
generalized
convergence in [79], we can say that the Euler means of the sequence
Qnx,y n≥0 converge to the invariant measure. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 407
serve that
E f (Xn+1 ) k Fn = E f (Xn+1 ) k Xn = Qf (Xn ), ∀n ≥ 0.
(4.3.8)
Indeed,
X
E f (Xn+1 ) Xn = x = f (y)P Xn+1 = y Xn = x
y∈X
X
= Qx,y f (y) = (Qf )(x).
y∈X
Thus f (Xn ) martingale iff ∆f = 0, i.e., h is harmonic with respect to this Laplacian.
This sequence is a supermartingale iff ∆f ≥ 0, i.e., f is superharmonic with respect
to the Laplacian ∆.
A function f : X → R is said to be harmonic on a subset U ⊂ X if
X
∆f (u) = 0, ∀u ∈ U ⇐⇒ f (u) = Qu,x f (x), ∀u ∈ U.
x∈X
f
Then the sequence Mn n≥0
is a martingale.
and
f
− Mnf k Fn = E f (Xn+1 ) k Fn − f (Xn ) − ∆f (Xn )
E Mn+1
(4.3.8)
= Qf (Xn ) − f (Xn ) + ∆f (Xn ) = 0.
t
u
Theorem 4.75. Suppose that the HMC (Xn )n≥0 is irreducible and recurrent. Then
any bounded Lyapunov function is constant.
Thus, the sequence h(Xn ) has a.s. two different limit points h(x0 ) and h(x1 ) and
thus h(Xn ) is a.s. divergent! t
u
Corollary 4.76. If the irreducible HMC (Xn )n≥0 admits a nonconstant, bounded
Lyapunov function then it must be transient. t
u
Example 4.77. Suppose that (Xn )n≥0 describes the simple random walk on a
locally finite connected graph G = (V, E) with vertex set V and edge set E. A
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 410
where u ∼ v indicates that the vertices u and v are neighbors, i.e., connected by an
edge.
Suppose that G is a rooted binary tree. This means that G is a tree, it has a
unique vertex v0 of degree 1, and every other vertex has degree 3. One can think
that any vertex other than the root has a unique direct ancestor and two direct
successors. The root has a unique successor
One can think of v0 as the generation zero vertex. It has a unique successor.
This is the generation 1 vertex. Its two successors form the second generation of
vertices. Their 4 successors determine the third generation etc. Equivalently, a
vertex belongs to the n-th generation, n > 1, if its predecessor is in the (n − 1)-th
generation. We obtain in this fashion a generation function
Define
Any vertex v 6= v0 has two vertices of generation g(v)+1 and one vertex of generation
g(v) − 1. Hence
X
f (u) = 2−g(v)+1 + 2 · 2−f (v)−1 = 3 · 2−g(v) = 3f (v),
u∼v
so that
1X
f (v) = f (u), ∀v ∈ V \ {v0 }.
3 u∼v
Definition 4.78. A function f ∈ L0 (X ) is called coercive if, for any C > 0, the
set {f ≤ C} is a finite subset of X . t
u
Proposition 4.79. Let (Xn )n≥0 be an irreducible HMC with state space X and
transition matrix Q. Suppose that there exists a nonnegative coercive function
f : X → [0, ∞) and a finite set A ⊂ X such that
X
Qx,y f (y) ≤ f (x), ∀x ∈ X \ A. (4.3.9)
y∈X
Remark 4.80. We want to mention that the condition (4.3.9) is also necessary for
recurrence. For a proof we refer to [56, Sec. 2.2]. t
u
Theorem 4.81 (Foster). Let (Xn )n≥0 be an irreducible HMC with state space X
and transition matrix Q. Suppose that there exists a function f : X → [0, ∞), a
finite set A ⊂ X and ε > 0 such that
X
Qx,y f (y) ≤ f (x) − ε, ∀x ∈ X \ A. (4.3.10a)
y∈X
X
Qx,y f (y) < ∞, ∀x ∈ A. (4.3.10b)
y∈X
Proof. We follow [56, Sec. 2.2]. Denote by TA the time of first return to A and set
Yn := Xn∧TA . Suppose X0 = x ∈ X \ A. Then (4.3.10a) reads
Ex f (Yn+1 ) − f (Yn ) k Yn = Ex f (Yn+1 ) k Yn − Yn =≤ −εI {TA >n} .
Thus f (Yn ) is a nonnegative supermartingale and
Ex f (Yn+1 ) − Ex f (Yn ) ≤ −εPx TA > n .
Hence
Xn
Ex f (Yn+1 ) − f (x) = Ex f (Yn+1 ) − Ex f (Y0 ) ≤ ε Px TA > k
k=0
so that
n
X 1
Px TA > k ≤ f (x).
ε
k=0
Letting n → ∞ we deduce
1
Ex TA ≤ f (x), ∀x ∈ X \ A.
(4.3.11)
ε
Now let a ∈ A. Then
X X
Ea TA = Qa,b + Qa,x 1 + Ex TA
b∈A x∈X \A
X (4.3.11) 1 X (4.3.10b)
=1+ Qa,x Ex TA ≤ 1+ Qa,x f (x) < ∞.
ε
x∈X \A x∈X \A
Thus Ea TA < ∞, ∀a ∈ A and Proposition 4.57 implies that (Xn )n≥0 is positively
recurrent. t
u
Example 4.83. Consider the biased random walk on N0 = {0, 1, . . . } with transi-
tion probabilities
Q0,1 = 1, Qn,n+1 = pn , Qn,n−1 = qn := 1 − pn , ∀n ∈ N.
Above pn , qn > 0, ∀n ∈ N, so that the corresponding Markov chain is irreducible.
Consider the coercive function
f : N0 → [0, ∞), f (n) = n.
Then, ∀n ≥ 1 we have
∆f (n) = n − pn (n + 1) + qn (n − 1) = qn − pn .
Thus, if qn ≥ pn , this random walk is recurrent. Moreover if
inf (qn − pn ) > 0
n∈N
then this random walk is positively recurrent. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 413
The reversibility means that there exists a function c : X → (0, ∞) such that
c(y)Qy,x = c(x)Qx,y ∀x, y ∈ X . (4.4.1)
Note that any positive multiple of c also satisfies (4.4.1). We set
c(x, y) := c(x)Qx,y , ∀x, y ∈ X .
The detail balance condition (4.4.1) shows that c(x, y) = c(y, x) and c(x, y) 6= 0 iff
Qx,y > 0. Note that
c(x, y) X
Qx,y = , c(x) = c(x, y). (4.4.2)
c(x) y∼x
As observed by R. Bott and by H. Weyl, see e.g. [16], the physical laws of electric
networks have simple geometric interpretations, best expressed in the language of
Hodge theory.
The main objects in Hodge theory are the chain/cochain complexes. To define
them we need to make some choices.
Consider a locally finite graph (X , E). An orientation of the graph is a subset
E+ ⊂ E such that for any edge (x, y) ∈ E either (x, y) ∈ E+ or (y, x) ∈ E+ , but
not both.
One can obtain such an E+ by assigning orientations (arrows) along the edges.
Define E+ as the collection of positively oriented edges. More precisely (x, y) ∈ E+ ,
if and only if the arrow of the oriented edge goes from x to y.
The vector space of 0-chains, denoted by C0 , consists of formal sums of the type
X
j := j(x)[x], j(x) ∈ R, ∀x ∈ X .
x∈X
• If i ∈ C1 , then
X X X
∂i := w(x)[x], w(x) = i(x, y) = − i(y, x), ∀x ∈ X .
x∈X y∈N (x) y∈N (x)
In particular, for x0 , x1 ∈ X ,
∂[x0 ] 7 [x1 ] = [x1 ] − [x0 ].
• If j ∈ C0cpt , then
X
∂j = j(x) ∈ R.
x∈X
Let us observe that for any compactly supported current i we have ∂ 2 i = 0. Indeed
X X X
∂(∂i) = i(x, y) = i(x, y) = 0
x∈X y∈N (x) (x,y)∈E
Remark 4.84. If X is infinite, then there could exist 1-chains i such that ∂i ∈ C0cpt
yet ∂ 2 i 6= 0. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 416
The (finite) paths in the graph are special examples of compactly supported
1-chains. By a path of length n we understand a sequence of neighbors
x0 , x1 , . . . , xn , xk−1 ∼ xk , xk−1 6= xk , ∀k = 1, . . . , n.
Note that
We denote by C1∞ the space of finite energy 1-chains. The space C1∞ is endowed
with a (resistor) inner product
X
hi1 , i2 ir := r(x, y)i1 (x, y)i2 (x, y). (4.4.7)
(x,y)∈E+
Lemma 4.85. A 1-cochain v is exact if and only if its integral along any closed
path is 0. Equivalently, this means that the integral along a path depends only on
the endpoints of the path. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 418
(x,y)∈E+ (x,y)∈E+
Definition 4.86. A Kirkhoff current is a finite energy current i such that its dual
i∗ = Ri is exact. A function u ∈ C0 such that i∗ = −du is called a potential of the
Kirchhoff current. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 419
y∈N (x)
Proof. We have
X
h∆u, vic = c(x)∆u(x)v(x)
x∈X
X X X
= c(x)Qx,y u(x) − u(y) v(x) = c(x, y) u(x) − u(y) v(x)
x∈X y∈Y x,y∈X
X X
= c(x, y) u(x) − u(y) v(x) + c(x, y) u(x) − u(y) v(x)
(x,y)∈E+ (y,x)∈E+
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 420
The same argument, with the roles of u and v reversed show that
hdu, dvic = hu, ∆vic .
The above expressions are well defined since both dv and ∆v are compactly sup-
ported because the graph is locally finite. t
u
Proof. We have 0 = h∆u, uic = hdu, duic . Hence du = 0 and since X is connected,
we deduce that u is constant. Since S 6= ∅, we deduce that u is identically zero. t
u
The energy of u is
X
hdu, duic = hu, ∆uic = c(x)u(x)∆u(x)
x∈X (4.4.17)
= c(x+ )u(x+ )∆u(x+ ) = u(x+ )j(x+ ).
Now observe that u(x+ ) = 1 so that
X
∆u(x+ ) = 1 − Qx+ ,x u(x)
x∈N (x+ )
X
=1− Qx+ ,x Px HS− > Hx+ = Px+ Tx+ > HS− .
x∈N (x+ )
Hence
(4.4.17)
Ec du = c(x+ )∆u(x+ ) = j x+ ,S+ (x+ )
(4.4.18)
= c(x+ )Px+ Tx+ > HS− =: κ(x+ , S− ).
The quantity κ(x+ , S− ) is called the effective conductance from x+ to S− . Its inverse
is called effective resistance between x+ and S− and it is denoted by Reff (x+ , S− ).
Thus
1
Reff (x+ , S− ) = .
c(x+ )Px+ Tx+ > HS−
We set
1 1
u = ux+ ,S− := ux+ ,S− = Px Hx+ < HS− .
κ(x+ , S− ) κ(x+ , S− )
This is the potential of the compactly supported Kirchhoff current ix+ ,S− such that
with source
1 c(x)
j = jx+ ,S− = j x+ ,S+ = ∆u(x), (4.4.20)
κ(x+ , S− ) c(x+ )Px+ Tx+ > HS−
where u is defined in (4.4.15), u(x) = Px Hx+ < HS− . Note that
Its energy is
1 1
E
E x+ ,S− := c dux + ,S− = = ux+ ,S− (x+ ). (4.4.21)
κ(x+ , S− )2 κ(x+ , S− )
Since
ux+ ,S− (x) = Px HS− < Tx+ ≤ 1 = ux+ ,S− (x+ ), ∀x ∈ X ,
we deduce
0 ≤ ux+ ,S− (x) ≤ ux+ ,S− (x+ ) = E x− ,S− , ∀x ∈ X . (4.4.22)
Let us observe that if X is finite and S− = {x− }, then the equality (4.4.16) shows
that
X
0= jx+ ,x− (x) = jx+ (x+ ) + jx− (x− )
x∈X
and thus
(
±1, x = x± ,
jx+ ,x− (x) =
0, x 6= x± .
A flow from x+ to S− satisfies the second Kirchhoff law if and only if it has
finite energy and i∗ is the differential of a function u : X → R. We will refer to
such flows as Kirchhoff flows.
Proof. We have
X
i(x, y) = 0, ∀x ∈ X .
y∈N (x)
X X
= i(x, y) u(y) − u(x) = 2 i(x, y) u(y) − u(x)
(x,y)∈E+ (x,y)∈E
X X X X
= −2 u(x) i(x, y) + 2 u(y) i(x, y) = 0.
x∈X y∈N (x) y∈X x∈N (y)
All the above sums involve only finitely many terms since i is compactly supported.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 423
(i) The current i0 := ix+ ,S− defined by (4.4.19) is the unique compactly supported
Kirchhoff current with source the dipole j 0 = jx+ ,S− . In particular it is a
Kirchhoff flow from x+ to S− .
(ii) The voltage function u = ux+ ,S− that determines i0 is the unique solution of
the boundary value problem
∆v(x) = 0, ∀x ∈ X \ S− ∪ {x+ }
(
1 (4.4.23)
, x = x+ ,
v(x) = c(x+ )
0, x∈S−.
(iii) The energy of i0 is
1
E i0 = E x+ ,S− = ux+ ,S− (x+ ) = = Reff (x+ , S− ).
c(x+ )Px+ Tx+ > HS−
(iv) If i1 is another compactly supported flow from x+ to S− , then
E i1 ≥ E ix+ ,S− .
Proof. (i) Set u0 := ux+ ,S− . Recall that u0 has compact support. Suppose that i1
is another compactly supported Kirchhoff flow from x+ to S− . Then there exists
a function u1 : X → R such that i∗1 = du1 . We deduce from (4.4.12) that the
functions uk , k = 0, 1 are solutions of the same equation
1
∆uk (x) = j (x), ∀x ∈ X .
c(x) 0
If we write u = u1 − u0 , then ∆u = 0 on X . The function u may not have compact
support, but du does. We have
1 X
hdu, duic = c(x, y) u(x) − u(y) u(x) − u(y)
2
(x,y)∈E
1 X 1 X
= c(x, y) u(x) − u(y) u(x) − c(x, y) u(x) − u(y) u(y)
2 2
(x,y)∈E (x,y)∈E
1 X X 1 X X
= u(x) c(x, y)(u(x) − u(y) + u(y) c(y, x) u(y) − u(x)
2 2
x∈X y∈N (x) y∈X x∈N (y)
| {z } | {z }
=c(x)∆u(x)=0 =c(y)∆u(y)=0
= 0.
Hence du = 0 so that i0 = i1 .
(ii) If v1 , v2 are two compactly supported solutions of (4.4.23), then the argument
above shows that hdv, dvic = 0 and, since v is compactly supported, we deduce that
v = 0.
The equality (iii) follows from (4.4.21).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 424
(i∗0 = du)
≥ E i0 + 2hdu, i∗ ic .
Lemma 4.90 shows that hdu, i∗ ic = 0 since i has compact support and ∂i = 0. t
u
Remark 4.92. (a) Part (iv) of the theorem is known as the Thompson or Dirichlet
Principle. It classically states that the Kirchoff flow is the least energy compactly
supported flow sourced by the dipole jx+ ,S− . Observe that the energy of the Kirch-
hoff flow carries information about the dynamics of the Markov chain associated to
the electric network.
(b) The Kirchhoff flow from x+ to S− is the unique compactly supported current i
such that
• ∂i(x+ ) = −1.
• There exists a function u : X → R, identically zero on S− , such that i∗ = du.
t
u
4.4.5 Degenerations
To proceed further we perform a reduction to a finite network. We set
S+ := X \ S− , ∂S+ := s− ∈ S− ; N (s− ) ∩ S+ 6= ∅ .
For simplicity we assume that x+ does not have any neighbor in S− . We obtain a
new finite electric network X /S− described as follows.
• Its vertex set is S+ ∪ {x− }. Think that we have identified all the vertices in
S− with a single point x− .
• The conductances c∗ (x, y) of X /S− are defined according to the rule
c(x, y),
x, y ∈ S+ ,
P
c∗ (x, y) = s+ ∈∂S+ c(x, s+ ), y = x−
P
c(s , y), x = x .
s+ ∈∂S+ + −
∆∗ u = ∆u on S+ .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 425
Moreover
c∗ (x± )∆∗ u∗ (x+ ) = ±1.
Thus u∗ is the potential of the Kirchoff flow on X /S− from x+ to x− . We denote
by E x+,x− its energy.
Note that the induced Kirchoff flow on X /S− has the same energy as the original
Kirchhoff flow on X , i.e.,
E x+ ,x− = E x+ ,S− . (4.4.24)
On the finite graph X /S− the flows from x+ to x− can be thought of as paths
from x+ and x− . They all have finite energy. The Kirchhoff flow is the path with
minimal energy from x+ to x− .
In view of this reduction to finite graphs we concentrate on finite electric net-
works. Suppose (X , E, c) is such a network and x+ , x− ∈ X , x+ 6= x− . For finite
graphs the finite energy condition is automatically satisfied and a flow from x+ to
x− is simply a 1-chain i such that
∂i = [x− ] − [x+ ].
The source [x+ ] − [x− ] is called a dipole with source x+ and sink x− .
+ The flow condition involves only the topology of graph and is independent of the
physics/geometry of the network encoded by the conductance function. However,
the Kirchhoff flow depends on the physics/geometry of the network.
Denote by i = ix+ ,x− the Kirchhoff flow with source x+ and sink x− . Its
potential grounded at x− is the function u : X → R uniquely determined by the
equations
∆u = 0 ∈ X \ {x+ , x− }, u(x− ) = 0, c(x+ )∆u(x+ ) = 1. (4.4.25)
Then ix+ ,x− = R−1 dux+ ,x− . The energy of this flow is
1
E x+,x− = ux+ ,x− (x+ ) = . (4.4.26)
c(x+ )Px+ Tx+ > Tx−
This quantity is an invariant of the quadruplet (X , c, x+ , x− ).
Clearly if we vary the conductance function the energy changes, and a flow that
is minimal for a choice of conductance may fail to be so for another choice. In
particular, a flow that has minimal energy with respect to a conductance function
may not have this property if we change the conductance or, equivalently, the
1
resistance function r(x, y) = c(x,y) ∈ (0, ∞]. We will indicate the dependence of
E x+ ,x− on r using the notation E x+ ,x0 (r).
Suppose we change the conductance function to a new function c0 that is bigger
or, equivalently, such that r0 (x, y) ≤ r(x, y). Then for any current i we have
1 X 1 X 0
Er i = r(x, y)i(x, y)2 ≥ r (x, y)i(x, y)2 = Er0 i .
2 2
(x,y)∈E (x,y)∈E
Theorem 4.93 (Raleigh). The energy of the Kirchoff flow with given source and
sink increases with the increase of the resistance function or, equivalently, if the
conductance function decreases.
Proof. Suppose that we decrease the resistance of an edge from r(x, y) to r0 (x, y).
Denote by i(r) the Kirchoff flow with source x+ , sink x− and choice of resistance
r. Define i(r0 ) in a similar fashion.
We have
E x+ ,x− (r) = Er i(r) ≥ Er0 i(r) ≥ Er0 i(r0 ) = E x+ ,x− (r0 ).
t
u
We can use this principle to produce estimates for E x+ ,x− (r) in terms E x+ ,x− (r0 )
if r is chosen wisely making E x+ ,x− (r0 ) easier to compute. One way to simplify
0
the computation of E x+ ,x− is to modify the topology of the graph. We can achieve
this by pushing r to extreme values. Let describe two such degenerations.
Suppose y0 , y1 ∈ X \ {x+ , x− } are two nodes connected by an edge. Upon
rescaling c we can assume that c(y0 , y1 ) = 1 = r(y0 , y1 ). We have a family of
deformed resistances
(
0 t, (x, x0 ) = (y0 , y1 ) or (y1 , y0 ),
rt : E → (0, ∞), t > 0, rt (x, x ) =
r(x, x0 ), otherwise.
• X 0 = X \ {y0 , y1 } ∪ {∗}.
c1 c2
y0
c
0
y
c3 1 c4
c2
c0+ c1
*
c3 c4
and
lim E t = E , = 0, ∞.
t→
k=1
It is independent of t since the path avoids the only edge whose resistance depends
on t. We deduce from Thompson’s principle that
E t ≤ Er γ , ∀t > 0.
This shows that the family of functions ut : X → [0, ∞) is relatively compact with
respect to the usual topology of the finite dimensional vector space RX .
2. t → ∞. In this case observe that
lim ct (x, y) = c∞ (x, y), ∀x, y ∈ X .
t→∞
We will show that as t → ∞ the family ut has only one limit point. Suppose that
for a sequence tn → ∞ the functions utn converge to a function v. The function
utn satisfies the equation
(
X
tn tn tn
0, x 6= x± ,
c (x, y) u (x) − u (y) =
y∈X
±1, x = x± , utn (x− ) = 0.
Letting n → ∞ we deduce that v satisfies
(
X
∞
0, x 6= x± ,
c (x, y) v(x) − v(y) =
y∈X
±1, x = x± , v(x− ) = 0.
According to Theorem 4.91(ii) the above equation has a unique solution, the poten-
tial u∞ of the Kirchhoff flow from x+ to x− grounded at x− in (X ∞ , c∞ ) proving
that
lim ut = u∞ .
t→∞
The equality
lim E t = E ∞
t→∞
is obvious.
3. t → 0. The above argument fails in this case because ct (y0 , y1 ) = 1t . Pick a
sequence tn % 0 such that utn has a limit u0 as tn → 0. To simplify the presentation
we will write ut instead of utn . We will show that
u0 (y0 ) = u0 (y1 ) (4.4.27)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 429
We set
N∗ (y0 ) := N (y0 ) \ {y1 }, N∗ (y1 ) := N (y1 ) \ {y0 },
X
c∗ (yi ) := c(y0 , y), i = 0, 1.
y∈N∗ (yi )
so that
X
1 + tc∗ (y0 ) ut (y0 ) − ut (y1 ) = t c(y0 , y)ut (y).
y∈N∗ (y0 )
y∈N∗ (y1 )
Thus ut (y0 ), ut (y1 ) is the solution of the 2 × 2 non-homogeneous linear system
t
a0 (t) −1 u (y0 ) c0 (t)
· t =t· ,
−1 a1 (t) u (y1 ) c1 t)
| {z } | {z }
=:A(t) α(t)
where
X
ai (t) = 1 + tc∗ (yi ), ci (t) = c(yi , y)ut (y), i = 0, 1.
y∈N∗ (yi )
Note that
det A(t) = a0 (t)a1 (t) − 1 = t c∗ (y0 ) + c∗ (y1 ) + O(t2 ) = tc0 (∗) + O(t2 ).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 430
Set
c0 (t) −1 a0 (t) c0 (t)
A0 (t) = , A1 (t) = .
c1 (t) a1 (t) −1 c1 (t)
Using Cramer’s rule we deduce
t det A0 (t) a1 (t)c0 (t) + c1 (t)
ut (y0 ) = =
det A(t) c0 (∗) + O(t)
Hence
c0 (∗, y)u0 (y)
P
0 0 0 y∈N 0 (∗)
u (y0 ) = u (y1 ) = ū (∗) := .
c(∗)
This proves (4.4.27). The equality (4.4.28a) is obvious. Observe that
X X
ū0 (∗) c(∗, y) = c0 (∗, y)ū0 (y),
y∈N 0 (∗) y∈N 0 (∗)
i.e.,
X
c0 (∗, y) ū0 (∗) − ū0 (y) = 0.
y∈N 0 (∗)
and
N 0 (x, ∗) = N (x) \ {y0 , y1 } ∪ {∗}.
This proves the equality (4.4.28b). This determines ū0 uniquely and shows that
lim ut = ū0 p(x) .
t→0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 431
Note that
1 X 2
Et = ct (x, y) ut (x) − ut (y) .
2
(x,y)∈E
There are two problematic terms in the above sum corresponding to (x, y) = (y0 , y1 )
or (y1 , y0 ) and their contribution to the energy is
1 t 2
u (y0 ) − ut (y1 ) .
t
Now observe that
t t c0 (t) a1 (t) − 1 − c1 (t) a0 (t) − 1) c∗ (y1 )c0 (t) − c∗ (y0 )c1 (t)
u (y0 ) − u (y1 ) = =t .
c0 (∗) + O(t) c0 (∗) + O(t)
Hence
1 t 2
u (y0 ) − ut (y1 ) = O(t) as t → 0,
t
so
1 X 2
lim E t = lim ct (x, y) ut (x) − ut (y)
t→0 2 →0
(x,y)∈E\{y0 ,y1 ),(y1 ,y0 )}
1 X 2
= c0 (x, y) u0 (x) − u0 (y) = E0.
2
(x,y)∈E 0
t
u
Remark 4.95. (i) Let us explain what happens if the edge (y0 , y1 ) disconnects
the graph but x+ , x− lie in the same connected component of the resulting graph.
Denote by (X0 , E0 ) the connected component containing x+ , x− and by (X∗ , E∗ )
in the other component. The compactness part of the argument still works since the
energy of ut is bounded by the energy of a path in (X0 , E0 ) connecting x+ to x+ .
Denote by ut0 the restriction of u to X0 and by ut∗ its restriction of u to X ∗ .
Then
1 X 2 2
E dut = c(x, y) ut (x) − ut (y) + t ut (y0 ) − ut (y1 )
2
(x,y)∈E0
| {z }
=:Et0
1 X 2
+ c(x, y) ut (x) − ut (y) .
2
(x,y)∈E∗
| {z }
=:Et∗
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 432
Note that
2
lim t ut (y0 ) − ut (y1 ) .
t→0
Arguing exactly as in Step 2 of the proof of Theorem 4.94 one can show that ut0
converges to u0x+ ,x− the potential grounded at 0 of the Kirchhoff flow in X 0 from
x+ to x− . If u∗ is any limit point of ut∗ then u∗ satisfies ∆∗ u∗ = 0 so
hdu∗ , du∗ i∗ = h∆∗ u∗ , u∗ i∗ = 0
so 0 is the only limit point of Et∗ as t → 0.
Ec du c ≤ lim Ect dut = Ec0 du0
t→0
so the energy of the Kirchhoff flow from x+ , x− in X is not greater than the energy
of the similar flow in X0 .
(ii) To understand why shorting is tricky recall that X is finite so the Markov chain
defined by the conductance ct has an invariant probability measure is given by
ct (x) X
πt (x) = π(x) = , Zt = ct (x).
Zt
x∈X
4.4.6 Applications
We want to illustrate the usefulness of the above results on some concrete example.
When the graph (X , E) is finite and all the edges have the same conductances,
the Kirchhoff flow from x+ to x− can be described explicitly in terms certain counts
of spanning trees, [74, Thm. 1.16]. In particular, its energy K(x+ , x− ) is a topo-
logical invariant of the quadruplet (X , E, x+ , x− ) described explicitly in terms of
spanning trees.
If we now assign conductances c to the edges, the energy E x+ ,x− (c) of the
Kirchhoff flow from x− , x+ satisfies
1 1
K(x+ , x− ) ≤ E x+ ,x− (c) ≤ K(x+ , x− ).
sup c(x, y) inf c(x, y)
The computation of K(x+ , x− ) is impractical for complicated graphs, but the above
rather rough estimate expresses in a simple fashion the fact that E x+ ,x− (c) depends
on both the topology and the geometry of the electrical network.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 433
Example 4.96. Suppose that (X , E, c) is a finite electric network such that the
underlying graph is a tree. Then for any pair of points x+ , x− there exists a unique
1-chain i such that
∂i = [x− ] − [x+ ].
It is described by a minimal path
x+ = x0 , x1 , . . . , xn = x− .
This is the Kirchhoff flow from x+ to x− and its energy is
n n
X X 1
E x+ ,x− = r(xi−1 , xi ) = .
i=1 i=1
c(x i−1 , xi )
As a special case of this
consider the Ehrenfest urn model. Recall that the state
space is the set X := 0, 1, . . . , B , B ∈ N and transition matrix Q given by
k B−j
Qk,k−1 = , ∀k ≥ 1, Qj,j+1 = , ∀j < B.
B B
As explained in Example 4.47, this can be described as an electric network whose
underlining graph is a path
0 → 1 → · · · → B,
and conductances
B B−j B−1
c(j, j + 1) = = .
j B j
In particular,
B−1 B−1 B
c(j) = + = .
j j−1 j
If B is even, B = 2N , then
NX =−1
1
E 0,N = E N,0 = 2N −1
.
j=0 j
Thus
1 1
PN TN > T0 = , P0 T0 > TN = .
c(N )E N,0 c(0)E N,0
Hence
4N
P0 T0 > TN c(N ) 2N
= = ∼√ .
PN TN > T0 c(0) N πN
In particular, this shows that PN TN > T0 is extremely small for large N . Thus
if initially in the two chambers there equal numbers of balls, the probability that
during the random transfers of balls between them, one of the chambers will con-
tinuously have less than half the balls until it empties, is extremely small. In fact,
the expected time of emptying the left chamber while starting with equal numbers
of balls in both is (see [87, Sec. VII.3, p. 175] with s = 2N )
EN T0 ∼ 4N 1 + A/N as N → ∞, 1 ≤ A ≤ 2.
(4.4.29)
This example is historically important because it was used to explain an apparent
contradiction between Boltzmann’s kinetic theory of gases and classical thermody-
namics. We refer to [13; 84] for more details. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 434
Note that the balls Bn are finite. For n ∈ N we denote by C(n) the total number
of edges connecting a point in Σn−1 to a point in Σn .
x+
xn-
n
S1 S2 S n-1
x+
S1 S2 S n-1 x n-
E x+ ,x−
n
the energy of the Kirchhoff flow in X n from x+ to xn −.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 435
As we have seen
1
= E x+ ,x+ .
c(x+ )Px+ Tx+ > HSn− n
Observe that the collapsed network X /Sn− is obtained from the collapsed network
− −
X /Sn+1 by first shorting the edges in Σn ⊂ X /Sn+1 and then shorting the edge
− −
(xn , xn+1 ). Hence
E x+ ,x−
n
≤ E x+ ,x− .
n+1
We set
1
E x+ ,∞ := lim E x+ ,xn = lim .
Tx+ > Sn−
n→∞ n→∞ c(x+ )Px+
Thus
1
lim Px+ Tx+ > Sn− =
.
n→∞ c(x+ )E x+ ,∞
We deduce that the associated Markov chain is recurrent if and only if E x+ ,∞ = ∞
and transient otherwise.
To estimate E x+ ,x−
n
from below we short edges in X /Sn− . First we short the
edges between points in Σk , k = 1, . . . , n − 1. We obtain the electric network X∗n
at the bottom of Figure 4.5. As explained in Example 4.96, energy of the Kirchhoff
flow in X∗n from x+ to x− n is
n
X 1
En = ≤ E x+ ,x− .
C(k) n
k=1
Hence
∞
X 1
E x+ ,∞ ≥ .
C(k)
k=1
We deduce that if
∞
X 1
= ∞,
C(k)
k=1
then the corresponding Markov chain is recurrent.
To estimate E x+ ,∞ from above we use the cutting trick. We gradually remove
edges such that the component containing x+ has infinitely many vertices. Re-
stricting to the component containing x+ we obtain a electric network with bigger
E x+ ,∞ according to Theorem 4.94 and Remark 4.95(iii).
Thus if the graph (X , E) contains a connected subgraph (X0 , E0 ) such that the
random walk on X0 is transient, then the random walk on (X , E) is also transient.
t
u
Each of the four vertices if this square is connected to Σn through 3 edges. The
interior of each of the four edges contains (n − 2) lattice points and each of them is
connected to Σn through 2-edges. Thus
C(n) = 12 + 8(n − 2) = 8n − 4, ∀n ∈ N.
Since
X 1
= ∞.
8n − 4
n≥1
Define Bn , Σn , Sn− as in the previous example. We assume that the tree is radially
symmetric about the root i.e., for any n ∈ N the vertices on the sphere Σn have the
same number sn of successors. Set
σk := |Σk |.
Note that for any k ≥ 0 we have
σk+1 = s0 s1 · · · sk .
One can think of σk as the “volume” of the sphere Σk .
We want to investigate the unbiased random walk on this tree. Equivalently,
this means assigning conductance 1 to every edge. We want to solve the equation
(
0, x ∈ Bn \ {x+ },
∆u(x) = 1
d(x+ ) , x = x+ ,
For HMC-s with finite state space the theory simplifies somewhat and new tech-
niques are available.
For convenience we will denote by Rm the space of row vectors and by Cm the
space of column vectors. We will denote the row vectors using Greek letters and
we will think of them as signed measures on X . The matrix Q acts on row vectors
by right multiplication µ → µ · Q, and on column vectors by left multiplication,
v 7→ Q · v.
A signed measure µ ∈ Rm is a probability measure if
µk ≥ 0, ∀k ∈ Im , µ · e = 1.
Let Probm ⊂ Rm denote the space of probability measures on Im . We equip Rm
with the variation norm
Xm
kαkv := |αk |.
k=1
Observe that if µ, ν ∈ Probm , then
1
dv (µ, ν) = kµ − νkv .
2
Note that a column vector
z1
z = ... ∈ Cm
zm
T
is a (left) eigenvector of Q corresponding to an eigenvalue λ ∈ C if and only if the
row vector z > is a (right) eigenvector of Q since
z > · Q = λz > .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 439
The matrix Q and its transpose Q> have the same eigenvalues.4 The vector e is
a (left) eigenvector of Q corresponding to the eigenvalue 1. We deduce that there
exists a row vector α ∈ Rm such that
α · Q = α.
If α had nonnegative entries, then it would be an invariant measure for the HMC
defined by Q. The classical Perron-Frobenius theory explains when this is the case
and much more.
Observe that the HMC defined by Q is irreducible if and only if
∀i, j ∈ X , ∃ n > 0 such that Qni,j > 0.
Additionally, it is aperiodic if and only if Q is primitive, i.e., there exists n0 ∈ N
such that
∀ n > n0 , ∀i, j ∈ X such that Qni,j > 0, ∀1 ≤ i, j ≤ m.
For a proof of the following result we refer to [65, Chap. XIII] or [138, Chap. 8].
(i) All the eigenvalues of Q> are contained in the unit disk.
(ii) If Q is irreducible, then there exists p ∈ N such that
λ ∈ Spec(Q) and |λ| = 1 ⇐⇒ λp = 1.
Moreover, every eigenvalue on the unit circle is simple.
(iii) The matrix Q is primitive if and only if p = 1. In this case
ρ := max |λ|; λ ∈ Spec(Q), λ 6= 1 < 1.
t
u
where
1 − Q = α ∈ Rm ; α · (1 − Q) = 0 = span(π),
kerr
α ∈ Rm ; α · B(Q) = 0 .
kerr B(Q) :=
Thus any α ∈ Rm admits a unique decomposition
α = α0 + α⊥ , α0 ∈ kerr 1 − Q , α⊥ ∈ kerr B(Q).
Indeed
XX XX
hQx, yiπ = Qij xj yi πi = πj Qji xj yi
i j j i
!
X X
= Qji yi πj xj = hx, Qyiπ .
j i
In this case all the eigenvalues are real and the operator Q is diagonalizable and
(4.5.1) improves to
Qnij − πj = O ρn .
(4.5.2)
In general finding or estimating the SLE can be a daunting task. If some sym-
metry is present this is sometimes manageable.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 442
If Q denotes the transition matrix of this Markov chain, then for any f ∈ L2 (G) we
have
d
X 1 X f (x + ek ) + f (x − ek )
Qf (x) = Qx,x0 f (x0 ) = , (4.5.3a)
0
d 2
x ∈G k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 443
d
1 X f (x + ek ) − 2f (x) + f (x − ek )
∆f (x) = f (x) − Qf (x) = − . (4.5.3b)
d 2
k=1
One can verify that the induced operator Q : L2 (G) → L2 (G) is symmetric since Q
is reversible but we will not rely on this fact in this example.
To compute the eigenvalues of Q : L2 (G) → L2 (G) we use Fourier analysis. This
requires a little bit of representation theory and we will refer to [149] for the proofs
of all the claims below.
A character of G is a group morphism
χ : G → S 1 := z ∈ C; |z| = 1 .
The set G
b of characters is a group itself with respect to the pointwise multiplication
of characters. It is called the dual group.
Denote by Rn the group of n-th roots of unity
Rn := z ∈ C∗ ; z n = 1 .
Observe that for any character χ, the complex numbers χ(ek ) are n-th roots of 1.
In fact, the map
b → Rd , G
ρ:G b 3 χ 7→ (ρ1 , . . . , ρd ) = χ(e1 ), . . . , χ(ed ) ∈ Rd
n n
The function
b 3 χ 7→ fb(χ) := hf, χi ∈ C
G
is called the Fourier transform of f . More explicitly,
1 X
fb(χ) = f (x)χ(x).
|G|
x∈G
Thus
!
X X
Qf = Q fb(χ)χ = m(χ)fb(χ)χ.
χ χ
In other words, the orthonormal basis χ; χ ∈ G,
b diagonalizes Q and
Spec Q = m(χ), χ ∈ G b .
If we write
χ(ek ) = ρk = cos θk + i sin θk ∈ Rn ,
then χ(ek ) + χ(−ek ) = ρk + ρ¯k = 2 cos θk and
d
( )
1X 2π 2π(n − 1)
m(χ) = cos θk , θk ∈ 0, ,..., .
d n n
k=1
Then
X
µ · QN = m(χ)n χ.
χ
t
u
Example 4.103 (The Ehrenfest urn revisited). The random walk (Xn )n≥0
on
Vd := {0, 1}d ,
the set of vertices of the hypercube [0, 1]d is intimately related to Ehrenfest urn; see
Example 4.7.
To see this, consider the states
sk := x = (x1 , . . . , xd ) ∈ Vd ; |x| := x1 + · · · + xd = k , k = 0, 1, . . . , d.
If we think of the vertices x ∈ Vd as vectors of bits 0/1, then the random walk has
a simple description: if located at x ∈ Vd , pick a random component of x and flip
it to the opposite bit. Note that
d−k k
P Xn+1 ∈ sk+1 Xn ∈ sk = , P Xn+1 ∈ sk−1 Xn ∈ sk = .
d d
We recognize here the transition rules for the Ehrenfest urn model with d parti-
cles/balls. Thus, if on our walk along the vertices of the hypercube, we only keep
track of the state we are in, we obtain the Markov chain defined by Ehrenfest’s urn
model.
For concrete computations it is convenient to have an alternate description of
this phenomenon. Denote by Sd the group of permutations of {1, . . . , d}. There is
an obvious left action of Sd on Vd ,
ϕ · (x1 , . . . , xd ) = xϕ(1) , . . . , xϕ(d) , ∀ϕ ∈ Sd , (x1 , . . . , xd ) ∈ {0, 1}d .
On the other hand, Vd is equipped with a metric, the so called Hamming distance,
d
X
δ(x, y) = |xi − yi |, x, y ∈ Vd .
i=1
Two vertices x, y ∈ Vd are neighbors (connected by an edge of the cube) iff
δ(x, y) = 1. Since the above action of Sd preserves the Hamming distance we
deduce that Sd is a group of graph isomorphisms, i.e.,
∀x, y ∈ Vd , ϕ ∈ Sd : x ∼ y ⇐⇒ ϕ · x ∼ ϕ · y.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 446
Observe also that the states sk , k = 0, 1, . . . , d, are the orbits of the above action
of Sd . Thus, the state space of the Ehrenfest urn model can be identified with
Vd := Sd \Vd , the space of orbits of the above left action. Denote by π the invariant
probability measure of the random walk on Vd and π the invariant measure of the
Ehrenfest urn model
1 d
π k = d .
2 k
If Proj : Vd → Sd \Vd is the natural projection, then
Proj# π = π.
The left action of Sd on Vd induces a right action on the space L2 (Vd , π)
(f · ϕ)(x) = f ϕ · x , ∀f : Vd → R, x ∈ Vd , ϕ ∈ Sd .
We denote by L2 (Vd , π)Sd the subspace consisting of invariant functions, i.e., func-
tions constant along the orbits of Sd . The pullback
Proj∗ : L2 Sd \Vd ,π) → L2 Vd , π , f 7→ f ◦ Proj
Proj∗ Proj∗
u u
L Vd ,π)
2
w L Vd ,π)
2
Q
If λ ∈ Spec Q and χ ∈ker λ − Q is an eigenfunction of Q, then (4.5.9) implies
that χ · ϕ ∈ ker λ − Q , ∀ϕ ∈ Sd .
For every ∈ {−1, 1}d we set
w(~) = # j; j = −1 .
Note that
X 2w(~)
j = d − 2w(~), λ(~) = 1 − .
j
d
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 447
2j
If λj = 1 − d , then
ker λj − Q = span χ~; w(~) = j .
The orthogonal projection Π onto L2 (Vd , π)Sd is the symmetrization operator
1 X
L2 (Vd ) 3 f 7→ Πf = f · ϕ ∈ L2 (Vd , π)Sd .
d!
ϕ∈Sd
we deduce
Πχ~ = Πχϕ·~, ∀ϕ ∈ Sd .
Thus Πχ~ depends only on w(~). We set
Ψj := ΨΠχ~, w(~) = j.
Note that
1 X
Ψj = d
χ~. (4.5.10)
j w(~
=j
Since the eigenfunctions χ~ with fixed weight w(~) = j span the eigenspace of Q
corresponding to the eigenvalue λj we deduce that
ker λj − Q) = span Ψj
so dim ker λ − Q ≤ 1, ∀λ ∈ SpecQ ⊂ Spec Q. Hence
X
|x| = xi = # i; xi = 1 .
i
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 448
Observe that
d d
X X (4.5.10) X d
K(x, z) = χ~(x) z j = Ψj (x)z j .
j=0 j=0
j
w(~
)=j
Thus
X d
|x| d−|x|
(1 − z) (1 + z) = Ψj (x)z j . (4.5.12)
j
j
Integrating the equality K(x, z)2 = (1−z)2|x| (1+z)2(d−|x|) over Vd with the uniform
probability measure π we deduce
Z d
2 1 X d
(1 − z)2k (1 + x)2(d−k)
K(x, z) π dx =
Vd 2n k
k=1
1 d
= (1 − z)2 + (1 + z)2 = (1 + z 2 )d .
2d
This shows that
1
kΨj k2L2 (π) = d
.
j
f (0)
L2 Vd ,π 3 f 7→ ... ∈ R1+d
f (d)
with the inner product
1
hu, viπ :=
d
Bu, v ,
2
where (−, −) denotes the canonical inner product on Rd+1 ,
d
X
u, v = ui vi ,
i=0
and B is the diagonal matrix
!
d d
B = Diag ,..., .
0 d
We denote by ckj the coefficient of z j in (1 − z)k (1 + z)d−k . If we think of the
invariant eigenfunction Ψj as a function on Vd , Ψj (k) := Ψj (x), |x| = k, then we
have
c0j
d d
Ψj = ... .
(4.5.12)
Ψj (k) = ckj ,
j j
cdj
| {z }
=:Cj
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 449
we deduce
d
hCi , Cj iπ = δij , ∀i = 0, 1, . . . , d.
i
In other words,
1
BCx, Cy = Bx, y , ∀x, y ∈ Rd+1 .
2d
Hence
C > BC = 2d B. (4.5.13)
The matrix C has another miraculous symmetry. To prove it we need to get back
to the definition of the entries ckj ,
k d−k X
1−z 1+z = ckj z j .
j
X d X X d
= uk ckj z j uk = ckj z j uk .
k j
k
k k,j
X d X X d
= zj cjk uk = cjk z j uk .
j
j j
k k,j
Hence
d d
ckj = cjk , ∀j, k.
k j
This can be written in more compact form as
(BC)kj = (BC)jk ⇐⇒ BC = (BC)> = C > B.
Using this in (4.5.13) we deduce BC 2 = 2d B so that
1
C −1 = C.
2d
Hence
1
Qn = CΛn C −1 = CΛn C, ∀n ≥ 0. (4.5.14)
2d
The above formula was first obtained by M. Kac [84]. Since then, many different
proofs were offered [87; 88; 147]. For more about the rich history and the ubiquity
of the Ehrenfest urn we refer to [13; 147]. As a curiosity, we want to mention that
the spectrum of Q was known to J. J. Sylvester in the 19th century.
One can use (4.5.14) to obtain important information about the dynamics of
the Ehrenfest urn such that the return or first passage times Ti , i = 0, 1, . . . , d. We
refer to [13; 84; 87; 88] for more details.
The above “miraculous” properties of the matrix C are manifestations of the
remarkable symmetries of the Krawtchouk polynomials. We refer to [43; 44] for
more about these polynomials and their applications in probability. t
u
Lemma 4.104.
1 X
E(u, v) =
πx Qx,y u(x) − u(y) v(x) − v(y) .
2
x,y∈X
Proof.
X
πx Qx,y u(x) − u(y) v(x) − v(y)
x,y∈X
X X
= Qx,y u(x) − u(y) v(x)πx − πx Qx,y u(x) − u(y) v(y) .
x,y∈X x,y∈X
| {z } | {z }
=:A =:B
Note that
X X
A= Qx,y u(x) − u(y) v(x)πx
x∈X y∈Y
X
= u(x) − (Qu)(x) v(x)πx = h∆u, viπ .
x∈X
Let us observe that the reversible Markov chain is defined by an electric network
with conductances c(x, y), where
c(x, y) := πx Qx,y .
Then ∀u, v ∈ L2 (X , π)
1 X
E u, v =
c(x, y) u(x) − u(y) v(x) − v(y) = hdu, dvic ,
2
x,y∈X
where h−, −ic is the inner product (4.4.9) on 1-cochains and d is the coboundary
operator (4.4.8).
The classical Ritz-Raleigh description of eigenvalues of a symmetric operator
shows that
µ2 := inf E(u, u); kukπ = 1, hu, eiπ = 0 .
1X X X
⇐⇒ c(x, y) u(x) − u(y))2 ≥ m c(x)u(x)2 if u(x)c(x) = 0.
2 x,y x
x∈X
To state our first Poincaré type inequality we need a few geometric preliminaries.
To our reversible Markov chain we associate a graph G with vertex set X . Two
vertices x, y are connected by an edge iff Q(x, y) 6= 0. We write x ∼ y if x and y
are connected by an edge in G. This graph could have loops. It is connected since
the Markov chain is irreducible. We set
b := (x, y) ∈ X × X ; x ∼ y .
E (4.5.15)
We think of the elements of E
b as edges of G equipped with an orientation. For any
u : X → R and e = (x , x ) ∈ E
0 00 b we set
δe u := u(x00 ) − u(x0 ).
We can speak of the conductance c(e) of any oriented edge e = (x, y),
c(e) := c(x, y) = πx Qx,y .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 453
Note that
1X
E(u, u) = c(e)(δe u)2 . (4.5.16)
2
e∈E
b
A path in G between two vertices x, y is a succession of vertices
γ : x = x0 ∼ x1 ∼ · · · x`−1 ∼ x` = y,
where we do not allow repeated edges. The number ` is called the length of γ and
it is denoted by `(γ). The path γ determines a collection of oriented edges
ei = (xi−1 , xi ), i = 1, . . . , `.
We will use the notation e ∈ γ to indicate that e is one of the oriented edges
determined by γ.
We denote by Γ the collection of paths in G. It comes with an obvious equiva-
lence relation: two paths are equivalent if they have the same initial and final points.
Fix a collection C of representatives of this equivalence relation. Thus, C contains
exactly one path for γx,y every pair (x, y) of vertices and this path connects x to y.
Following [46] we set
1 X
K(C) := sup K(C, e), K(C, e) := `(γx,y )πx πy . (4.5.17)
e∈E c(e)
C3γx,y 3e
If an oriented edge e is not contained in any path γ ∈ C we set K(e) = 0.
Proof. We follow the approach in the proof of [46, Proposition 1]. Set K = K(C).
Let u ∈ L2 (X , π). For any x, y ∈ X we have the telescoping equality
X
u(y) − u(x) = δe u.
e∈γx,y
Using the Cauchy Schwartz inequality we deduce
!2
2 X X
u(y) − u(x) = δe u ≤ `(γx,y ) (δe u)2 .
e∈γx,y e∈γx,y
Now observe that
1X 2 1X X
Var u = u(y) − u(x) πx πy ≤ `(γx,y ) (δe u)2
2 x,y 2 x,y e∈γ x,y
1X X 1X 1 X
= (δe u)2 γx,y = c(e)(δe u)2 γx,y
2 γx,y 3e
2 c(e) γ 3e
e∈E
b e∈E x,y
| {z }
≤K
K X (4.5.18)
c(e)(δe u)2
≤ = KE u, u .
2
e∈E
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 454
Example 4.106. Suppose that our Markov chain corresponds to the random walk
on the Cayley graph of the cyclic group Z/nZ, n odd; see Example 4.102. Equiva-
lently, it is the random walk on the set
X = {xi }i∈Z/nZ
of vertices of a regular n-gon, where at each vertex we are equally likely to move to
one of its two neighbors. In this case we have
1 1
πx = , Qxi ,xi+1 = Qxi ,xi−1 = , ∀i ∈ Z/nZ,
n 2
(
1 1, i = j ± 1,
c(xi , xj ) = ×
2n 0, otherwise.
As collection C, we choose geodesics (shortest paths) connecting the pair of vertices.
Since n is odd, for every x, y ∈ X there exists a unique such geodesic γx,y and it
has length < n2 . Due to the symmetry of the graph the quantity
1 X 2 X
K(e) := `(γx,y )πx πy = `(γx,y )
c(e) γ 3e n γ 3e
x,y x,y
is independent of e so
K(C) = K(e), ∀e ∈ E.
Averaging over the n edges of the graph we deduce
1X 2 X X
K(C) = K(e) = 2 `(γx,y )
n e n e γ 3e
x,y
2 X X 2 X
= 2
`(γx,y ) = 2 `(γx,y )2
n x,y e∈γ n x,y
x,y
(n = 2m + 1)
n m
2X 4X 2 n2
= `(γx1 ,xi )2 = i = + O(n), as n → ∞.
n i=1 n i=1 6
Hence
6
+ O n−3 , as n → ∞.
λ2 ≤ 1 − 2
n
Thus, for large n this lower estimate is of the same order, as the precise estimate
(4.5.6) with d = 1. t
u
We want to describe another geometric estimate for µ2 of the type first described
in Riemannian geometry by J. Cheeger [27].
The volume of a set S ⊂ X is computed using the stationary measure π,
X
V (S, Q) := π S = πs .
s∈S
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 455
then
A(∂S,
e Q) A(∂S, Q)
= . t
u
V (S, Q)
e V (S, Q)
To get a feeling of the meaning of h(Q) suppose that Q corresponds to the unbiased
random walk on a connected graph G with vertex set X . For any S ⊂ X , the
area A(∂S) is, up to a multiplicative constant, the number of edges connecting a
vertex in S with a vertex outside S. The volume V (S) is, up to a multiplicative
constant the sum of degrees of vertices in S, or equivalently, V (S) − A(∂S) is twice
the number of edges with both endpoints in S. Thus, a “large” h(Q) signifies that,
for any subset of X , a large fraction of the edges with at least one endpoint in S
have the other endpoint outside S.
As an example of graph with small h think of a “bottleneck”, i.e., a graph
obtained by connecting with a single edge two disjoint copies of a complete graph.
Various versions of Cheeger’s isoperimetric constant of a (connected) graph play
a key role in the definition of expander families of graphs, [96; 108]. It was in
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 456
that context that the connection with random walks on graphs was discovered. For
general reversible Markov chains we have the following result due to Jerrum and
Sinclair [83].
Theorem 4.108. Let Q denote the transition matrix of a reversible Markov chain
with finite state space X . Then
h(Q)2
µ2 ≥ .
2
In particular,
h(Q)2
λ2 ≤ 1 − .
2
Proof. We follow the presentation in [46]. Let u ∈ L2 (X , π). Set u+ := max(u, 0).
We set
A(∂S, Q)
Su := u > 0 } ⊂ X , h(u) = inf
.
S⊂Su V (S)
Lemma 4.109. If u ∈ L2 (X , π) and u+ 6= 0, then
h(u)2
E u+ , u+ ≥ ku+ k2π . (4.5.20)
2
Proof. We can assume without any loss of generality that u = u+ . Then
X X
u(y)2 − u(x)2 c(x, y) = u(x)2 − u(y)2 c(x, y)
2
u(x)<u(y) x,y
12
12
X
2 X 2
≤ u(x) − u(y) c(x, y) u(x) + u(y) c(x, y)
x,y x,y
| {z }
| {z } ≤2(u(x)2 +u(y)2 )
=2E(u,u)
12
X
1/2
u(x) + u(y) c(x, y) = 23/2 E(u, u)1/2 kukπ .
2 2
≤ 2E(u, u)
x,y
| {z }
=2kuk2π
We deduce
X
23/2 E(u, u)1/2 kukπ ≥ 2 u(x)2 − u(y)2 c(x, y)
u(x)<u(y)
!
X Z u(y) Z ∞ X
=4 tdt c(x, y) = 4 t c(x, y) dt.
x,y u(x) 0 u(x)≤t<u(y)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 457
We deduce
Z ∞ Z ∞
X (1.3.43) h(u)
kuk2π .
t c(x, y) dt ≥ h(u) tπ u > t dt =
0 0 2
u(x)≤t<u(y)
t
u
and thus,
µ > 0, ∆u ≤ µu on {u > 0} ⇒ µku+ k2π ≥ E u+ , u+ .
(4.5.21)
Indeed,
µku+ k2π ≥ λhu+ , ∆uiπ = E u+ , ui ≥ E(u+ , u+ ).
2
Combining (4.5.20) and (4.5.21) we deduce that µ ≥ h(u) 2 if ∆u ≤ µu on
{u > 0} =
6 ∅.
Suppose now that u is a nontrivial eigenfunction corresponding to the eigenvalue
µ2 of ∆. Since
X
u(x)πx = 0
x
The quantity h(Q) is rather difficult to compute but lower estimates are easier
to obtain. Consider a collection C of paths in G as in the definition (4.5.17). We
set
1 X
κ(C) = sup κ(C, e), κ(C, e) = πx πy .
e∈E c(e)
C3γx,y 3e
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 458
Clearly
1 1
W (S) = π S π S c ≥ π S = V (S).
2 2
On the other hand
X X X X
W (S) ≤ πx πy = c(e)κ(e) ≤ κ c(e) = κA(∂S),
e∈∂S γx,y 3e e∈∂S e∈∂S
and we deduce
1
κA(∂S) ≥ V (S).
2
t
u
For all intents and purposes, the normalizing constant Z is not effectively avail-
able to us. Still, we would like to produce an X -valued random variable with
distribution π.
The theory of Markov chains will allow us to produce, for any given ε > 0
an X -valued random variable with distribution ν within ε > 0 (in total variation
distance) from the desired but unknowable distribution π.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 459
The Metropolis algorithm will allow us to achieve this. The input of the algo-
rithm is a pair (G, w) where G is a with vertex set X w is a weight on its set of
vertices, i.e., a function w : X → (0, ∞) such that
X
w(x) < ınf ty.
xX
The graph G is called the candidate graph. Often the candidate graph is suggested
by the problem at hand.
A good example to have in mind is the set X of Internet nodes and we want
to sample the set of nodes uniformly. In this case the weight w is a constant
function. To simplify the presentation we assume that the graph is connected and
the standard random walk on it is primitive.
The output of the algorithm is the transition matrix Q of a reversible, irreducible
and aperiodic Markov chain with state space X and whose equilibrium probability
π is proportional to w. We will refer to this Markov chain as the Metropolis chain
with candidate graph G and equilibrium distribution π. If we run this Markov chain
starting from an initial vertex x0 ∈ X , then for n sufficiently large, the state Xn
reached after n steps will have a distribution close to π.
The transitions of this Markov chain are described by an acceptance-rejection
strategy based on the standard random walk on the graph G. More precisely, the
transitions from a vertex x to one of its neighbors follows these rules.
(i) Pick one of the neighbors y of x equally likely among its d(x) neighbors. (This
is what we would do if we were to perform a standard random walk on G.)
This the acceptance part.
(ii) The transition to y is decided by a comparison between the weight w(y) at
y and the weight w(x) at x. More precisely, we accept the move to y with
w(y)/d(y)
probability min 1, w(x)/d(x) . Otherwise we reject the move and stay put at
x. This is the rejection part.
Above, N (x) denotes the set of neighbors of x in the candidate graph. Let us show
that
w(y) w(x)/d(x)
= min 1, = w(y)Qy,x .
d(y) w(y)/d(y)
If the random walk on the candidate graph G is primitive, then so is the Metropolis
chain. If not, we replace the Metropolis chain with its lazy version; see Remark 4.69.
We refer to [77; 141] for applications of this algorithm to combinatorics. In
general, it is difficult to estimate the SLE or the rate of converges of the Metropolis
chain, but in practice it works well. We refer to [77; 141] for applications of this
algorithm to combinatorics.
Example 4.111. A few years ago (2016–17) I asked Mike McCaffrey, at that time
a student writing his senior thesis under my supervision, to read Diaconis’ excel-
lent survey [42] and then to try to implement numerically the decryption strategy
described in that paper, based on the Metropolis algorithm. I want to report some
of McCaffrey’s nice findings. For more details I refer to his senior thesis [116].
Let me first outline the encryption problem and the decryption strategy proposed
in 42]. The encryption method is a simple substitution cipher. Scramble the 26
[
letters of the English alphabet E. The encryption is captured by a permutation ϕ
of the set E, or equivalently, an element of ϕ the symmetric group S26 .
The decryption problem asks to determine the decoding permutation ϕ−1 given
a text encoded by the (unknown) permutation ϕ. Thus, we need to find one element
in a set of 26! elements. To appreciate how large 26! is, it helps to have in mind that
a pile of 26! grains of sand will cover the continental United States with a layer of
sand 0.6 miles (approx. 1 kilometer) thick. We are supposed to find a single grain
of sand in this huge pile. Needle in a haystack sounds optimistic!
The strategy outlined in [42] goes as follows. There are 262 pairs of letters in
the English alphabet E, and they appear as adjacent letters in English texts with
a certain frequency. E.g., one would encounter quite frequently the pair “th”, less
so pairs such as “tt’ ’ or “tw ”. We denote by f (s1 , s2 ) the frequency of the pair
of letters (s1 , s2 ). More precisely f (s1 , s2 ) is the conditional probability that in an
English text the letter s1 is followed by s2 . To any text of length n, viewed as string
of n letters, x = x1 . . . , xn we associate the weight
n
Y
w x := f (xi−1 , xi ).
i=2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 461
We expect the weight of the decoded text wtrue to be a lot higher than the weight of
the encoded text. In the above example, the weight of the original text is 2.6 × 10115
higher than that of the cyphered text!!!
After 3, 000 steps in the random walk governed by the above Metropolis algo-
rithm, the output was close to the original text:
THE PROLALINITY THAT WE MAY FAIN ID THE STRUGGNE OUGHT
DOT TO KETER US FROM THE SUPPORT OF A JAUSE WE LENIEVE
TO LE BUST
5 He actually used an alphabet consisting of 27 symbol, the 26 letters of the alphabet and a 27-th
Mike then tested this algorithm on a bigger text. He chose the easily recognizable
Gettysburg address by Abraham Lincoln.
The most vivid confirmation of the power of this method came when he presented
his results to a mixed group of students in the College of Science of the University
of Notre Dame. He began his presentation by projecting the ciphered Gettysburg
address, but the audience was left in the dark about the nature of original text.
While Mike was describing the problem and the decoding strategy, his laptop was
running the algorithm in the background and every few seconds the text on the
screen would scramble revealing a new text resembling more and more an English
text. Ten minutes or so into his presentation the audience was able to recognize
without difficulty the Gettysburg address. It took about 120 steps in the Metropolis
random walk to reach an easily recognizable albeit misspelled text! t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 463
4.6 Exercises
Exercise 4.1. Consider the construction in Remark 4.5 of an HMC with initial
distribution µ and transition matrix Q as a sequence of random variables defined
on [0, 1) equipped with the Lebesgue measure λ. For every t ∈ [0, 1) there exists
x = x(t) ∈ X N0
uniquely determined by
\
t∈ Ixn0 ,...,xn .
n≥0
(i) Prove that the resulting map Ψ : [0, 1) → X N0 given by t 7→ x(t) is measurable
and Ψ# λ = Pµ . Hint. Use the π-λ theorem.
(ii) Prove that the map Ψ is injective and its image is shift-invariant and has
Pµ -negligible complement.
(iii) Describe the map t 7→ x(t) when X = {0, 1}, µ 0 = µ 1 = 21 and
1 1
2 2
Q= .
1 1
2 2
Describe explicitly the random variables
Xn : [0, 1) → R, Xn (t) = xn (t), where x(t) = x0 (t), x1 (t), . . . ∈ {0, 1}N0 .
t
u
Exercise 4.2. Two people A, B play the following game. Two dice are tossed. If
the sum of the numbers showing is less than 7, A collects a dollar from B. If the
total is greater than 7, then B collects a dollar from A. If a 7 appears, then the
person with the fewest dollars collects a dollar from the other. If the persons have
the same amount, then no dollars are exchanged. The game continues until one
person runs out of dollars. Let A’s number of dollars represent the states. We know
that each person starts with 3 dollars.
(i) Show that the evolution of A is governed by a Markov chain. Describe its
transition matrix.
(ii) If A reaches 0 or 6, then he stays there with probability 1. What is the
probability that B loses in 3 tosses of the dice?
(iii) What is the probability that A loses in 5 or fewer tosses?
t
u
Exercise 4.5. Suppose that (Yn )n≥0 is a sequence of i.i.d., N0 -valued random vari-
ables with common probability generating function
X
pk sn , pk := P Yn = k , ∀k, n ∈ N0 .
G(s) =
k≥0
Let Xn the amount of water in a reservoir at noon on day n. During the 24 hour
period beginning at this hour a quantity Yn flows into reservoir, and just before
noon a quantity of one unit of water is removed, if this amount can be found.
The maximum capacity of the reservoir is K, excessive inflows are spilled and lost.
Show that (Xn )n≥0 is an HMC, and describe the transition matrix and its stationary
distribution in terms of G. t
u
Exercise 4.6. Denote by Xn the capital of gambler at the end of the n-th game.
He relies on the following gambling strategy. If his fortune is ≥ $4 he gambles
$2 expecting to win $4, $3, $2 with respective probabilities 0.25, 0.30, 0.45. If
his capital is 1, 2 or 3 dollars he bets $1 expecting him to earn $2 and $0 with
probabilities 0.45 and respectively 0.45 and 0.55. When his fortune is 0 he stops
gambling.
(i) Show that (Xn )n≥0 is a homogeneous Markov chain, compute its transition
probabilities and classify its states.
(ii) Set
T := inf n ∈ N; Xn = 0 .
Show that P T < ∞ = 1.
(iii) Compute E T .
t
u
Exercise 4.7. Suppose that (Xn )n≥1 is a sequence of nonnegative i.i.d., contin-
uously distributed random variables. Consider the sequence of records (Rn )n∈N
defined inductively by the rule
R1 = 1, Rn = inf n > 1; Xn > max X1 , . . . , Xn−1 .
Show that the sequence (Rn ) is an Markov chain with state space N and then
compute its transition probabilities. Is this a homogeneous chain? t
u
(i) Show that (Xn )n≥0 is a homogeneous Markov chain with transition probabili-
ties
(
qk−j , k ≥ j,
P Xn+1 = k k Xn = j =
0, k < j,
where
∞
(λz)j
Z
qj = e−λz PZ dz .
0 j!
Hint. Use Exercise 1.47.
(ii) Set µ = E Z , r := λµ. Prove that the above chain is positively recurrent if
and only if r < 1.
(iii) Assume that r < 1 and c2 := E Z 2 < ∞. Prove that
λ2 c2
lim E Xn = r + .
n→∞ 2(1 − r)
t
u
Exercise 4.9. Suppose that (Xn )n∈N0 is an irreducible HMC with state space X
and transition matrix Q. Prove that the following statements are equivalent.
t
u
Exercise 4.10. Suppose that (Xn )n≥0 is an irreducible Markov chain with state
space X transition probability matrix Q and x0 ∈ X .
Prove that
X
τx = Qx,y τy , ∀x ∈ X \ {x0 }. (4.6.1)
y6=x0
(ii) Show that if x0 is transient, then there exists x ∈ X \ {x0 } such that τx 6= 0.
(iii) Suppose there exists a function α : X \ {x0 } → [−1, 1], not identically zero,
satisfying (4.6.1). Prove that x0 is transient.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 466
Exercise 4.11. Suppose that (Xn )n≥0 is a transient irreducible Markov chain with
state space X . Prove that, with probability 1 the chain will exit any finite subset
F ⊂ X , never to return, i.e.,
h i
P lim I F (Xn ) = 0 = 1.
n→∞
t
u
Exercise 4.12. Bobby’s business fluctuates in successive years between three sates
between three states: 0 = bankruptcy, 1 = verge of bankruptcy, 2 = solvency. The
transition matrix giving the probability of evolving from state to state is
1 0 0
Q = 0.5 0.25 0.25 .
0.5 0.25 0.25
(i) What is the expected number of years until Bobby’s business goes bankrupt,
assuming it starts in solvency.
(ii) Bobby’s rich father, deciding that it is bad for the family name if his son goes
bankrupt. Thus, when state 0 is entered, his father infuses Bobby’s business
with cash returning him to solvency with probability 1. Thus the transition
matrix for this Markov chain is
0 0 1
P = 0.5 0.25 0.25 .
0.5 0.25 0.25
Show that this Markov chain irreducible aperiodic and find the expected num-
ber of years between cash infusions from his father. t
u
(i) Prove that for any r ∈ (0, 1) the Markov chain defined by the stochastic matrix
Q(r) = (1 − r)Q + rC is irreducible and aperiodic. Denote by πr the unique
stationary probability measure.
(ii) Prove that πr converges as r → 0 to a stationary probability measure π0 of the
HMC defined by Q.
(iii) Describe π0 in the special case when the HMC determined by Q consists of
exactly two communication classes C1 and C2 and there exist xi ∈ Ci , i = 1, 2
such that Qx1 ,x2 > 0.
t
u
Exercise 4.14. The random walk of a chess piece on a chess table is govern by the
rule: the feasible moves are equally likely. Suppose that a rook and a bishop start
at the same corner of a 4 × 4 chess table and perform these random
walks. Denote
by T the time they meet again at the same corner. Find E T . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 467
Exercise 4.15. Consider the HMC with state space X = {0, 1, 2, . . . } and transi-
tion matrix Q defined by
1 n
Qn,n+k = n+1 , ∀0 ≤ k ≤ n, n ≥ 1,
2 k
1
Q0,1 = 1, Qn,0 = , ∀n ≥ 1.
2
Prove
that the chain is irreducible, positively recurrent and aperiodic and find
E0 T0 . t
u
Exercise 4.16. Let Kn+1 denote the complete graph with n + 1 vertices
v0 , v1 , . . . , vn . Denote by (Xn )n≥0 the random walk on Kn+1 transition rules
1
Qvi ,vj = , ∀i > 0, j ≥ 0, Qv0 ,vi = 0, Qv0 ,v0 = 1.
n
Thus the vertex v0 is absorbent. For i > 0 we denote that the time to reach the
vertex v0 starting at vi ,
Hi := min j ≥ 0 : X0 = i, Xj = 0 .
Prove that E Hi = n, ∀i > 0. t
u
Exercise 4.18. We generate a sequence Bn of bits, i.e., 0’s and 1’s, as follows. The
first two bits are chosen randomly and independently with equal probabilities. (Flip
a fair 0/1 coin twice and record the results.) If B1 , . . . , Bn are generated, then we
generate Bn+1 according to the rules
1
P Bn+1 = 0 k Bn = Bn−1 = 0 = = P Bn+1 = 0 k Bn = 0, Bn−1 = 1
2
1
P Bn+1 = 0 k Bn = 1, Bn−1 = 0 = = P Bn+1 = 0 k Bn = Bn−1 = 1 .
4
What is the proportion of 0’s in the long run? t
u
Exercise 4.19. Consider the Markov chain with state space X = N0 and transition
probabilities
Qn,n−1 = 1, ∀n ∈ N,
X
Q0,n = pn , ∀n ∈ N0 , pn = 1.
n≥0
Find a necessary and sufficient condition on the distribution (pn )n≥0 guaranteeing
that the above HMC is positively recurrent. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 468
Exercise 4.20. Suppose that a gambles plays a fair game with winning probability
p = 12 . He starts with an initial fortune X0 = 1 dollar. His goal is to reach a
fortune of g dollars, g ∈ N. He stops if he reaches this fortune or he is broke and he
is employing a bold strategy: at every game he stakes the largest of money that will
get him closest to but not above g. He cannot bet a sum greater that his fortune
at that moment. Denote by Xn his fortune after the n-th game.
(i) Prove that (Xn )n≥0 is an HMC. Describe its state space and it transition
matrix.
(ii) Prove that the player reaches his goal with probability g1 and goes broke with
probability g−1
g .
t
u
Exercise 4.21. Consider the standard random walk in Z2 started at the origin.
For each m ∈∈ Z we denote by Tm the first moment the random walk reaches the
line x + y = m and we denote by (Um , Vm ) the point where this walk intersects the
above line. Find the probability distributions of Tm , Um and Vm . t
u
t
u
Exercise 4.23. Suppose that X is an at most countable set equipped with the
discrete topology µ ∈ Prob(X ) and Q : X × X → [0, 1] is a stochastic matrix. Let
Xn : (Ω, S, P) → X be a sequence of measurable maps.
E f (Xn+1 ) k Xn , . . . , X0 = Q∗ f (Xn )
(ii) (Lévy) Prove that (Xn )n≥0 ∈ Markov(µ, Q) if and only if, for any
f ∈ L∞ (X , µ) the sequence
n−1
X
Y0 = X0 , Yn = f (Xn ) − Qf (Xk ) − f (Xk )
k=0
is a martingale with respect to the filtration Fn = σ(X0 , X1 , . . . , Xn ).
t
u
Exercise 4.24. Consider an irreducible HMC with finite state space X and tran-
sition matrix Q. We denote by π the invariant probability distribution. For every
x ∈ X we denote by Hx the hitting time of x,
Hx := min n ≥ 0; Xn = x .
t
u
Exercise 4.25. Suppose that (Xn )n∈N0 is an HMC with state space X and tran-
sition matrix Q. Suppose that B ⊂ X and HB is the hitting time of B
HB := min n ≥ 0, Xn ∈ B .
We define
hB : X → [0, 1], hB (x) = Px HB < ∞ = P HB < ∞ k X0 = x ,
t
u
Exercise 4.26. Suppose that (Xn )n∈N0 is an HMC with state space X and tran-
sition matrix Q. For x ∈ X we denote by Tx the return time to x
Tx := min n ≥ 1; Xn = x .
We set
fx,y (n) := Px Ty = n ,
X X
Fx,y (s) := fx,y (n)sn , Px,y (s) := Qnx,y sn ,
n≥0 n≥0
X
fx,y := Fx,y (1) = fx,y (n) = Px Ty < ∞ .
n≥0
n≥0
(1) (k) (k−1)
(iii) Set Tx := Tx and define inductively Tx := min , n > Tx ; Xn = x ,
k > 1. Prove that
P Tx(k−1) < ∞ = fx,y fyy
(k−1)
.
t
u
t
u
Exercise 4.28. Consider a positively recurrent HMC (Xn )n≥0 with state space X
and transition matrix Q. Denote by π the stationary distribution. For x ∈ X we
denote by Tx the first return time to x and for y ∈ X we set
Nx,y := n ∈ N; n ≤ Tx , Xn = y , G(x, y) := Ex Nx,y .
Exercise 4.29. Consider a positively recurrent HMC (Xn )n≥0 with state space X ,
transition matrix Q and stationary distribution π. Suppose that
T is a stopping
time adapted to (Xn )n≥0 and let x ∈ X be such that Ex T < ∞. We denote
GT (x, y) denote the expected number of visits to y before T , when started at x, i.e.,
GT (x, y) = Ex Nx,y
T T
, Nx,y = # n ≥ 0; X0 = x, Xn = y, n ≤ T .
Prove that GT (x, y) = π y Ex T .
t
u
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 472
Exercise 4.31. Let (Xn )n≥0 be an irreducible Markov chain with finite state space
X , transition matrix Q and invariant probability measure µ ∈ Prob(X ). Assume
that the initial distribution is also µ, i.e., PX0 = µ. For n ∈ N0 we set (see
Exercise 2.50 for notation)
1
Hn = Ent2 X0 , X1 , . . . Xn , Ln = Ent2 Xn Xn−1 , . . . , X0 .
n+1
(i) Prove that the sequence (Ln )n≥0 is non-increasing and nonnegative. Denote
by L its limit.
(ii) Prove that
n
1 X
Hn = Lk .
n+1
k=0
(iii) Prove that the sequence (Hn ) is convergent and its limit is L.
(iv) Prove that
X X
µ x Qx,y log2 Qx,y .
L=− µ x Ent2 Qx,− = −
x∈X x,y∈X
The number L is called entropy rate of the irreducible Markov chain. We
denote it by Ent2 X , Q .
t
u
Exercise 4.32. Let Q denote the n × n transition matrix describing the random
walk on a complete graph with n vertices. Find the spectrum of Q. t
u
Exercise 4.33 (Doeblin). Suppose that (Xn )≥0 is an HMC with state space X ,
initial distribution µ and transition matrix Q satisfying the Doeblin condition
∃ε > 0, ∃x0 ∈ X : Qx,x0 > ε, ∀x ∈ X .
Denote M the space of finite signed measures ρ on X . For ρ ∈ M we set
X
kρk1 := ρx < ∞, ρx := ρ {x} .
x∈X
If ρ ∈ M and
X
ρx = 0,
x∈X
then
kρQk1 ≤ (1 − ε)ρkρk1 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 473
t
u
Exercise 4.34. Suppose that (Xn )≥0 is an HMC with state space X , initial dis-
tribution µ and transition matrix Q. For each n ∈ N we set
k
1 X k
An := Q .
n+1
k=0
Suppose that there exist N ∈ N, x0 ∈ X and ε > 0 such that
(AN )x,x0 > ε, ∀x ∈ X .
Prove that the HMC is irreducible, positively recurrent and the unique invariant
probability measure π satisfies
N
kµAn − πk1 ≤ , ∀n ∈ N.
(n + 1)ε
t
u
t
u
January 19, 2018 9:17 ws-book961x669 Beyond the Triangle: Brownian Motion...Planck Equation-10734 HKU˙book page vi
Chapter 5
Ergodic theory is a rather eclectic subject with applications in many areas of math-
ematics, including probability. The ergodicity feature first appeared in the works
of L. Boltzmann on statistical mechanics, [23]. The modern formulation of this hy-
pothesis, due to Y. Sinai, came much later, in 1963, and it took a few more decades
to be adjudicated mathematically.
Our rather modest goal in this chapter is to describe enough of the fundamentals
of this theory so we can shed new light on some of the fundamental limit theorems
we have proved in the previous chapters. For more details we refer to [3; 11; 36; 93;
131; 154] that served as our main sources of inspiration.
475
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 476
(b) Consider the n-dimensional torus Tn := Rn /Zn . Set I = [0, 1] and observe that
the natural projection π : I n → Tn is Borel measurable. We denote by PTn the
push-forward by π of the Lebesgue measure on I n . Let observe that the resulting
probability space is isomorphic to the product of n copies of(S 1 , BS 1 , PS 1 ). Suppose
that A ∈ SLn (Z), i.e., A is an n×n matrix with integer coefficients and determinant
1. Then A(Zn ) = Zn and thus we have a well defined induced map
TA : Rn /Zn → Rn /Zn .
This map is clearly bijective and Borel measurable. It is also measure preserving
since det C = 1.
In [3] Arnold and Avez memorably depicted the action of the map TA for
1 1
A= ∈ SL2 (Z)
1 2
as in Figure 5.1. This map is popularly known as Arnold’s cat map.
(c) In the previous examples, the maps where automorphisms of the corresponding
probability spaces. Here is an example of a measure preserving map that is not
bijective. More precisely define
Q : S 1 → S 1 , Q(z) = z 2 .
1
Then the Lebesgue measure 2π dθ is Q-invariant. If we identify S 1 with R mod Z,
then we can describe Q as the map Q : [0, 1) → [0, 1) given by
Q(x) = 2x mod 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 477
(d) Consider the tent map T : [0, 1] → [0, 1], T (x) = min(2x, 2 − 2x). Equivalently,
this is the unique continuous map such that T (0) = T (1) = 0, T (1/2) = 1 and it is
linear on each of the intervals [0, 1/2] and [1/2, 1]. Its graph looks like a tent with
vertices (0, 0), (1/2, 1) and (1, 0).
This map preserves the Lebesgue measure. Indeed, if I ⊂ [0, 1] is a compact
interval then T −1 (I) consists of two intervals I± , symmetrically located with respect
to the midpoint 1/2 of [0, 1], and each having half the size of I.
(e) Suppose that X is a compact metric space and T : X → X is a continuous
map. Denote by Prob(X) the set of Borel probability measures on X. Then map T
induces a push-forward map T# : Prob(X) → Prob(X). The T -invariant measures
are precisely the fixed points of T# . One can show (see Exercise 5.2) that the set
ProbT (X) of T -invariant measures is nonempty, convex and closed with respect to
the weak convergence. t
u
(i) For example, a sequence of i.i.d. random variables is stationary. More generally,
an exchangeable sequence of random variables is stationary.
(ii) Suppose that (Xn )n≥0 is an HMC with state space X and transition matrix Q
and initial distribution µ. The sequence (Xn )n≥0 is stationary if and only if µ
is an invariant distribution, i.e., µ = µ · Q.
Definition 5.3. Suppose that (Ω, S) is a measurable space and T : (Ω, S) → (Ω, S)
is a measurable map.
Proposition 5.5. Suppose that (Ω, S) is a measurable space and T : (Ω, S) → (Ω, S)
is a measurable map. Then the following hold.
Proof. (i) Thus follows from the fact that S ∈ IT if and only if S = T −1 (S).
(ii) Suppose that f is T -invariant. Then for any x ∈ R the set S = {f ≤ x} is
T -invariant since I S ◦ T = I {f ◦T ≤x} = I S .
Conversely, if f is IT -measurable, then f −1 ({y}) ∈ IT , ∀y ∈ R and
(f ◦ T )−1 ({y}) = T −1 f −1 ({y}) = f −1 ({y}).
Remark 5.6. Consider the path space UX = XN . We have the tail sigma-subalgebra
\
T∞ = Tm , Tm = σ Um , Um+1 , . . . .
m≥1
Note that
S ∈ Tm+1 ⇐⇒ Θm S ∈ T1 = U, ∀m ≥ 0.
The shift map Θ is surjective and if S is Θ-invariant, then (5.1.3a) and (5.1.3b)
imply that ΘS = S. In particular, Θm S = S ∈ T1 , so S ∈ Tm , ∀m. Hence, in the
universal case I = IΘ ⊂ T∞ .
Observe that the sigma-algebras IΘ and T do not depend on any choice of prob-
ability measure on UX . t
u
Proof. (i) The fact that JT is a sigma-algebra follows immediately from the defi-
nition of a quasi-invariant.
(ii) by Ī the P-completion of I. Let S̄ ∈ Ī. There exists S ∈ I such that
Denote
P S̄∆S = 0. Since T is measure preserving we deduce
0 = P T −1 (S̄∆S) = P T −1 (S̄)∆S
and thus
P T −1 (S̄)∆S̄ = E |I T −1 (S̄) − I S̄ |
≤ E |I T −1 (S̄) − I S | + E |I S − I S̄ | = 0.
Conversely, if S ∈ J define
\ [
S̄ := Sn , Sn := T −k (S).
n∈N k≥n
I S̄ = lim I Sn .
n→∈∞
Proposition 5.10. The map T is ergodic if and only if any P-quasi-invariant set
is a zero-one event, i.e., has measure 0 or 1. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 481
m∈N
is a zero-one algebra. t
u
Example 5.13. Consider the map Q : [0, 1) → [0, 1) discussed in Example 5.1(iii).
The interval [0, 1) embeds in {0, 1}N
∞
X n
[0, 1) 3 x = n
7→ (1 , 2 , . . . ) ∈ {0, 1}N .
n=1
2
The image of the map is a shift-invariant subset of {0, 1}N . Its complement is neg-
ligible with respect to the product measure on {0, 1}N and the restriction of the
product measure on the image of this embedding coincides with the Lebesgue mea-
sure; see Exercise 1.3(vii). The space {0, 1}N equipped with the product measure
is the path space corresponding to an i.i.d. sequence of Bernoulli random variables
with success probability 21 . Hence the shift map is ergodic, proving that the map
Q is also ergodic. t
u
Z Z
E I A k Tm − P A E I A k Tm − P A .
= ≤
B Ω
Hence
Z
E I A k Tm − P A .
sup P A∩B −P A P B ≤ (5.1.5)
B∈Tm Ω
isometries
Tb : Lp Ω, S, P → Lp Ω, S, P , ∀p ≥ 1.
so that
Z Z Z
f I S dP = T (f I S )dP = (Tbf ) · I S dP, ∀S ∈ J, f ∈ L1 (Ω, S, P).
b (5.1.6)
Ω Ω Ω
We set
PT f = E f k J .
To summarize
T is ergodic ⇐⇒ dim QT = 1. (5.1.9)
For each n we denote by An the n-th temporal average/mean operator
1
1 + Tb + Tb2 + · · · Tbn−1 f.
f 7→ An f =
n
Note that An defines linear operators
An : Lp Ω, S, P → Lp Ω, S, P , p ≥ 1
satisfying
kAn f kp ≤ kf kp , ∀f ∈ Lp . (5.1.10)
Remark 5.16. Let me briefly explain the intuition of the temporal averages An (f ).
Think of Ω as the space of states of a physical system that evolves in discrete time.
Thus, if the system was initially in the state ω, it will be in the state T n (ω) after
n units of time.
A function f : Ω → R can be viewed as a macroscopic quantity that associates
to each state ω a measurable numerical quantity f (ω). Note that for each n ∈ N
and each ω ∈ Ω we have
f (ω) + f (T ω) + · · · + f T n ω
(An+1 f )(ω) =
n+1
is the average value of the macroscopic quantity f as the system evolves for n units
of time. t
u
We have the following mean ergodic theorem due to John von Neumann.
Theorem 5.17 (L2 -Mean ergodic theorem). Suppose that (Ω, S, P) is a proba-
bility space and T : Ω → Ω is a measure preserving map. Then, ∀f ∈ L2 (Ω, S, P),
the temporal averages An f converge in L2 to the orthogonal projection of f onto
the space QT of quasi-invariant functions, i.e.,
1
1 + Tb + Tb2 + · · · + Tbn−1 f → PT f = E f k J .
n
In particular, if T is ergodic we have
1
1 + Tb + Tb2 + · · · + Tbn−1 f → E f I Ω in L2 .
n
Proof. Denote by X2 the collection of functions f ∈ L2 (Ω, S, P) such that An f
converges in L2 to some function A∞ f . Clearly X2 is a vector space. We will
gradually show that X2 = L2 (Ω, S, P) and A∞ = PT .
1. QT ⊂ X2 and A∞ f = f , ∀f ∈ QT .
Indeed An f = f , ∀f ∈ QT .
2. ∀f ∈ X2 , we have Tbf ∈ X2 and A∞ f ∈ QT .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 485
and Tbfj − fj ∈ Q⊥
T, ∀j.
5. A∞ f = PT f , ∀f ∈ X2 .
We have
n−1
1X
f − An f = (f − Tbk f ) ∈ Q⊥
T.
n
k=1
6. X2 is closed.
Let (fk )k∈N be a sequence in X2 that converges in L2 to f . To show that f ∈ X2
we will show that the sequence An f is Cauchy. Fix ε > 0. We have
kAn f − Am f k2 ≤ kAn f − An fk k2 + kAn fk − Am fk k2 + kAm fk − Am f k2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 486
7. X2 = L2 (Ω, S, P).
We know that QT ⊂ X2 and
Range Tb − 1 ⊂ Q⊥
T ∩ X2 .
1
1 + Tb + Tb2 + · · · Tbn−1 f → E f k J in L1 as n → ∞.
n
Proof. Denote by X1 the collection of functions f ∈ L1 (Ω, S, P) such that An f
converges in L1 to some function A∞ f . Since k − k1 ≤ k − k2 we deduce that
X2 ⊂ X1 . The argument in Step 6. in the proof of Theorem 5.17 extends without
change to the L1 since, according to (5.1.10), the operators An are contractions
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 487
The claim (i) is the difficult one. Temporarily assuming its validity we will show
how it implies (ii) and the conclusion of the theorem.
Proof of (ii) assuming (i). Observe that for any f ∈ L1 we have
1 bn
An Tbf − f = T f −f .
n
∞
In particular, if f ∈ L we deduce
2
kAn (Tbf − f )k∞ ≤ kf k∞
n
so An T f − f → 0 a.s. so (T f − f ) ∈ X0 if f ∈ L∞ .
b b
Suppose now that f ∈ L1 then f = f+ − f− and (Tbf )± = Tbf± . Thus it suffices
to show that Tbf − f ∈ X0 if f ∈ L1 and f ≥ 0 a.s.
In this case we can find a sequence of elementary functions fn such that fn % f .
Hence
Tbfn − fn → Tbf − f in L1 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 488
Since the functions fn are bounded, so are the functions Tbfn − fn we deduce that
Tbfn − fn ∈ X0 . We know from (i) that X0 is L1 -closed. This proves (ii).
From (ii) we deduce that Tbf − f ∈ X0 , ∀f ∈ L2 ⊂ L1 . Since X0 is closed in L1
we deduce from the proof of Theorem 5.17 that
closureL2 range(Tb − 1) ⊂ closureL1 range(Tb − 1) ⊂ X0 .
∀λ > 0, f ∈ L1 Ω, S, P : λP {M |f | > λ} ≤ kf k1 .
(5.1.11)
t
u
Let us first explain why the Maximal Ergodic Lemma implies the claim (i).
Suppose that the sequence (fk ) in X0 converges in L1 to a function f . We want
to show that the sequence An (f ) is a.s. Cauchy, i.e., for every ε > 0, the set
[ \ n o
An (f ) − Am (f ) < ε
N m,n>N
| {z }
=:XN (f,ε)
Letting N → ∞ we deduce
lim P XN (f, ε) ≥ lim P 2M |fk − f | < ε/2 ∩ XN (fk , ε/2) .
N →∞ N →∞
lim P 2M |fk − f | < ε/2 ∪ XN (fk , ε/2) = lim P XN (fk , ε/2) = 1.
N →∞ N →∞
Hence, ∀k,
lim P 2M |fk − f | < ε/2 ∩ XN (fk , ε/2) = P 2M |fk − f | < ε/2 .
N →∞
We deduce that
(5.1.11) 4
lim P XN (f, ε) ≥ P 2M |fk − f | > ε/2 ≥ 1 − kf − fk k1 , ∀k.
N →∞ ε
Letting k → ∞ we obtain (5.1.12).
Proof of the Maximal Lemma Let us observe that the inequality (5.1.11) follows
from
Z
gdP ≥ 0 ∀g ∈ L1 (Ω, S, P). (5.1.13)
{M[g]>0}
We will present two proofs of (5.1.13). The first proof, due to F. Riesz, is a bit
longer but a bit more intuitive. The second proof, due to A. Garsia [67] is a lot
shorter but less intuitive.
Set
X := M g > 0 ⊂ Ω.
Define
n−1
X
g ◦ T j , Mn g := max Sk (g), Xk := Mk g > 0 .
Sn (g) := (5.1.14)
1≤k≤n
j=0
At the last step we used the fact that the Cèsaro means of a convergent sequence
have the same limit as the sequence; see Exercise 2.3 with pn,k = n1 . Thus, it suffices
to show that
n Z
X
gdP ≥ 0, ∀n ≥ 0. (5.1.15)
k=1 Xk
Fix n. We have
n Z
X n−1
XZ n−1
XZ
gdP = gdP = g ◦ T j dP,
k=1 Xk j=0 Xn−j j=0 T −j (Xn−j )
where at the last step we used the change-in-variables formula (1.2.21) and the fact
that T is measure preserving. We set Yj := T −j (Xn−j ). Hence
X n Z Z n−1
X
g(T j ω)I Yj (ω) P dω .
gdP =
k=1 Xk Ω j=0
This proves that each of the terms xj0 , xj0 +1 , . . . , x`0 is a leading term. Their sum
is obviously nonnegative.
Consider now the (shorter) sequence
y : y0 = x`0 +1 , . . . , ym−1 := xn , m := n − 1 − `0 < n − 1.
The induction assumption implies that the sum of the leading terms of y is ≥ 0.
The minimality of j0 implies that the leading terms of x are xj0 , . . . , x`0 together
with the leading terms of y. This proves Lemma 5.21 and completes the proof of
Theorem 5.19. t
u
Second proof
of (5.1.13). We continue using the notations (5.1.14). Set
Gn := Mn g .
Since Xn % X, it suffices to show that
Z
gdP ≥ 0, ∀n.
Xn
The operator f 7→ Tbf is monotone, i.e., f0 ≤ f1 ⇒ Tbf0 ≤ Tbf1 , and we deduce that
for 1 ≤ k ≤ m we have
Sk−1 (g) ≤ max Sj (g) = Gm−1 ≤ G+
m−1
1≤j≤m−1
and
Sk (g) = g + TbSk−1 (g) ≤ g + TbGm−1 ≤ g + TbG+
m−1
so that
Gm−1 ≤ Gm ≤ g + TbG+
m−1 , ∀m ∈ N,
or equivalently
g ≥ Gn − TbG+
n , ∀n.
We deduce
Z Z Z
g≥ Gn − TbG+
n
Xn Xn Xn
(TbG+ + +
n ≥ 0 on Ω, Gn = Gn on Xn , Gn = 0 on Ω \ Xn )
Z Z Z Z
≥ G+n − T
b G+
n = G +
n − G+
n =0
Xn Ω Ω Ω
where, at the last step, the equality of the boxed terms is due to the fact that T is
measure preserving. t
u
Remark 5.22. In Remark 5.11 we suggested that the ergodicity condition points
to a chaotic behavior of the dynamics of the iterates of T . The ergodic theorem
makes this much more precise.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 492
5.2 Applications
Ergodicity is the unifying principle behind some of the limit theorems we have
discussed in the previous chapters and it is the source of many interesting non-
probabilistic results.
Example 5.23 (I.i.d. random variables). Suppose that (Xn )n∈N is a sequence
of i.i.d. integrable random variables defined on the same probability space (Ω, S, P).
Kolmogorov’s 0-1 theorem shows that this is a Kolmogorov family, thus ergodic.
Consider the coordinate maps on the path space
Un : RN → R, Um u1 , u2 , . . . = un .
The Ergodic Theorem implies
1
U1 + · · · + Un → E U1 µ − a.s.
n
Observing that Xn = Un ◦ X ~ we deduce the Strong Law of Large Numbers. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 493
Example 5.24 (Markov chains). Consider a HMC (Xn )n≥0 with state space X ,
transition matrix Q and initial distribution µ. The path space of this Markov chain
(see Theorem 4.3) is the probability space
U = Uµ = X N0 , E, Pµ ,
where
Un u0 , u1 , u2 , . . . = un .
Recall that for any x ∈ X we set Px = Pδx where δx is the Dirac measure on X
concentrated at x then
X
Pµ = µx Px , µx = µ {x} . (5.2.1)
x∈X
Denote by I ⊂ E the sigma-algebra of sets that are Θ-invariant sets, where Θ denotes
the shift on U. Fix x ∈ X and let A ∈ Jx .
The Markov property (4.1.16) implies that
Ex I A ◦ Θn k En = EXn I A , En = σ(X0 , . . . , Xn ).
Hence
PXn A = EXn I A → I A a.s.
| {z }
=:fn
On the other hand, Px Xn = x i.o. = 1, since the chain is recurrent. Thus
h i
P fn = Px A i.o. = 1.
I A = Px A a.s. so Px A ∈ {0, 1}. Using (5.2.1) we deduce that
Hence
Pµ A ∈ {0, 1} for any initial distribution µ.
If the chain is positively recurrent and π∞ is the invariant distribution, then Θ
is measure preserving and we deduce that J is a zero-one algebra so Θ is ergodic.
We see that the Ergodic Theorem for Markov chains (Corollary 4.60) is a special
case of Birkhoff’s Ergodic Theorem because any f ∈ L1 X , π i induces a function
f = f ◦ U0 ∈ L1 X N0 E, Pπ∞ .
The fact that the shift map is ergodic allows us to state results stronger
than Corollary 4.60. For any finite set B ⊂ X × X we obtain a function
FB ∈ L1 X N0 E, Pπ∞
FB (u0 , u1 , u2 , . . . ) = I B (u0 , u1 )
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 494
Example 5.25 (Weyl’s equidistribution theorem). Fix ϕ ∈ (0, 2π) and de-
note by Rϕ the planar counterclockwise of angle ϕ about the origine. This induces
a transformation of the unit circle
S 1 := z ∈ C; |z| = 1 .
5.2.2 Mixing
Suppose that T is a measure preserving transformation of a probability space
(Ω, S, P). Note that if T is ergodic, then the L2 ergodic theorem implies that
n−1
1X L2
f ◦ T k −→ E f I Ω , ∀f ∈ L2 (Ω, S, P).
n
k=0
If we take the inner product with g ∈ L2 of both sides in the above equality we
deduce
n−1 Z
1X
(f ◦ T k )g dP → E f E g , ∀f, g ∈ L2 (Ω, S, P).
(5.2.4)
n Ω
k=0
P A 1−P A =0
Z Z Z
∀f, g ∈ L (Ω, S, P) : lim
2 n
f ◦ T g dP = f dP g dP . (5.2.9)
n→∞ Ω Ω Ω
Clearly, if a measure preserving map satisfies (5.2.9), then it is mixing. The above
argument has the following immediate generalization.
t
u
Proof. We will show that the shift map Θ satisfies (5.2.6). Denote by Un the
coordinate maps on the path space Un : XN → X, and set
Tn := σ Un , Un+1 , . . . .
In view of Proposition 5.26 it suffices to show that (5.2.6) is satisfied for any
A, B ∈ C. Suppose that
A = Cxi1 ,...,xik , B = Cxj1 ,...,xjm .
For n > jm we have
Θ−n (A) ∩ B = Cxj1 ,...,xjm ,xi1 +n ,...,xik +n , xij = xij +n ,
and
Pπ Θ−n (A) ∩ B = π xj1 Qjx2j−j · · · Qjxmj −jm−1 n+i1 −jm
,xjm Qxjm ,xi1
1
1 ,xj2 m−1
(5.2.11)
×Qxi2i−i 1
1 ,xi2
· · · Qixki−ik−1
,xi .
k−1 k
(ii) ⇒ (i) Suppose that the shift map is mixing. To prove that it is aperiodic we argue
by contraction and assume the period d is bigger than 1. As in Proposition 4.27
consider the communication classes of Qd ,
C1 , C2 , . . . , Cd ⊂ X .
Hence
P Xn+1 ∈ Ci+1 mod d k Xn ∈ Ci mod d = 1, ∀n ≥ 0, i = 1, . . . , d.
Consider the sets
u ∈ X N0 ; u0 ∈ Ci , i = 1, 2, . . . .
Ai =
Then Θ−n Ai = Ai+n mod d , Ai ∩ Aj = ∅ if i 6≡ j mod d. We deduce that for any
n ∈ N we have
Pπ Θ−nd (A0 ) ∩ A1 = 0, Pπ Θ−nd−1 (A0 ) ∩ A1 = Pπ A1 = π A1 6= 0.
Example 5.30 (The tent map). Consider the tent map T : [0, 1] → [0, 1] intro-
duced in Example 5.1(d). Recall that T is the continuous map [0, 1] → [0, 1] such
that T (0) = 0 = T (1), T (1/2) = 1 and T is linear on each of the intervals [0, 1/2]
and [1/2, 1]. We want to show that T is mixing.
Consider the Haar basis of L2 [0, 1] . Recall its definition. It consists of the
Haar functions
Define
H = { H0 } ∪ Hn,k ; n ≥ 0, 0 ≤ k < 2n .
The subspaces H n are mutually orthogonal and the collection H spans a dense
subspace of L2 [0, 1] ; see [24, Sec. 9.2]. Moreover
TbH n ⊂ H n+1 , ∀n ≥ 0.
Thus if m, n ≥ 0, 0 ≤ j ≤ 2m , 0 ≤ k ≤ 2n we have
Z 1 Z 1
`
T Hm,j , Hn,k L2 = 0 =
b Hm,j (x)dx Hn,k (x)dx , ∀` > n − m.
0 0
θm
Denote by h−, −i the canonical inner product in Rd and by A∗ the transpose of A.
Clearly
A∗ ∈ SLd (Z) and A∗ (Zd ) = Zd .
~ ∈ Zd we denote by Om
For each m ∗ d
~ the orbit of the action of A on Z , i.e., the set
Om
∗ n
~ = (A ) m; ~ n≥0 .
~ ∈ Zd we set define the character 1 χm
For any m ~ ∈L
2
Td , µ
ihm,θi
~
√
χm
~ (θ) = e = ei(m1 θ1 +···+md θd ) , i := −1.
The set of characters
Cd := χm ~ ∈ Zd ⊂ L2 Td , µ
~; m (5.2.12)
2 d
is an orthonormal family that spans a vector subspace dense in L T , µ .
The unitary operator TbA : L2 Td , µ → L2 Td , µ has the explicit description
Theorem 5.32. Let A ∈ SLd (Z), d > 1. The following are equivalent.
Proof. We follow the approach in [36, Sec. 4.3]. We only need to prove (i) ⇒ (ii)
⇒ (iii).
~ ∈ Zd \ {0} such that
(i) ⇒ (ii) We argue by contradiction. Suppose there exists m
∗ n
Om ~ is finite. Denote by n the smallest n ∈ N such that (A ) m ~ = m.
~ Then the
function
~ + · · · + χ(A∗ )n−1 m
f = χm ~
~ · · · χ(A∗ )n−1 m
is TbA -invariant and nonconstant since the functions 1, χm ~ are linearly
independent. Hence A is not ergodic.
(ii) ⇒ (ii) We apply Proposition 5.26 to the set of characters Cd in (5.2.12). Note
that if
(
1, m~ = 0,
Z
χm
~ dµ =
Td 0, m~ 6= 0.
Clearly if f = g = 1, then (5.2.10) holds trivially. Suppose f 6=1. Then
Z Z
f dP g dP = 0.
Ω Ω
Assumption (ii) implies that TbAn f is a character different from g for all n sufficiently
large and thus
TbAn f, g 1 = 0, ∀n 0.
L
We deduce that A is mixing. t
u
The condition (ii) above holds if and only if none of the eigenvalues of A are
roots of 1. Observe that if one eigenvalue of A ∈ SL2 (Z) is a root of 1 then all
eigenvalues are roots of 1. We deduce that the only matrices in SL2 (Z) are
10 0 −1
± and ± .
01 1 0
In particular, this shows that Arnold’s cat map is mixing. t
u
Remark 5.33. There is another condition that intermediates between mixing and
ergodicity. More precisely, a measure preserving self-map of a probability space
(Ω, S, P) is called weakly mixing if,
n−1
1 X −k
lim P T (A) ∩ B − P A P B = 0 (5.2.13)
n→∞ n
k=0
for any A, B ∈ S. Clearly (5.2.13) implies (5.2.5) so weakly mixing are ergodic.
Since convergent sequences are Cèsaro convergent we deduce that (5.2.6) implies
(5.2.13) so mixing maps are weakly mixing.
It turn out that most weakly mixing automorphisms of a probability space
(Ω, S, P) are not mixing. More precisely the mixing operators form a meagre (first
Baire category) subset in the set of weakly mixing automorphisms [78, p. 77].
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 501
5.3 Exercises
Exercise 5.1. Suppose that (Ω, S) is a measurable space and T : (Ω, S) → (Ω, S) a
measurable map. Denote by ProbT (Ω, S) the set of T -invariant probability measures
P : S → [0, 1].
(i) Prove that ProbT (Ω, S) is a convex subset of the space of finite measures on S.
(ii) Prove that T is ergodic with respect to a probability measure P if and only
if P is an extremal point of ProbT (Ω, S) i.e., P cannot be written as a convex
combination P = (1 − t)P0 + tP1 , t ∈ (0, 1), P 6= P0 , P1 .
t
u
(i) Prove that the sequence (Pn )n∈N contains a subsequence (Pnk )that converges
weakly to a Borel probability measure P∗ on X, i.e.,
Z Z
lim f (x)Pnk dx = f (x)P∗ dx , ∀f ∈ C(X).
k→∞ X P∗
Hint. Use Banach-Alaoglu compactness theorem.
(ii) Prove that P∗ is T -invariant.
(iii) Prove that the set ProbT (X) of T -invariant Borel probability measures on X
is convex and closed with respect to the weak convergence.
t
u
c1 P A P B ≤ P T −1 (A) ∩ B ≤ c2 P A P B .
(5.3.1)
t
u
(i)
_
Fn = S.
n≥1
(ii) T −1 (Fn ) ⊂ Fn , ∀n ∈ N.
(iii) For any k ∈ N the intersection
\
(T k )−1 (Fn )
n≥
is a 0-1-sigma subalgebra.
t
u
Set
ω ∈ Ω \ S; T n ω 6∈ S, ∀n ≥ 1 .
ΩS :=
Exercise 5.8. Consider an irreducible HMC (Xn )n≥0 with finite state space X ,
transition matrix Q and whose initial distribution is the stationary distribution µ.
The path space of this Markov chain is (see Theorem 4.3)
Uµ = X N0 , E, Pµ ).
For n ∈ N0 we denote by Un the n-th coordinate map Un (u0 , u1 , . . . ) = un . Let
f : Uµ → R, f (u0 , u1 , . . . ) = − log2 Qu0 ,u1 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 503
where Ent2 X , Q denotes the entropy rate of the Markov chain described in
Exercise 4.31.
(ii) Prove that
1
log2 QU0 ,U1 · · · QUn−1 ,Un = Ent2 X , Q a.s.
− lim
n→∞ n
t
u
Exercise 5.9 (Kac). Consider the map Q : [0, 1) → [0, 1), x 7→ Qx := 2x mod 1.
Show that Q it is mixing with respect to the Lebesgue measure. Hint. See Example 5.13.
t
u
Exercise 5.10. Consider the tent map T : [0, 1] → [0, 1], T (x) = min(2x, 2 − 2x)
and the logistic map L : [0, 1] → [0, 1], L(x) = 4x(1 − x).
(i) Prove that the map Φ : [0, 1] → [0, 1], Φ(x) = 1 − cos x/π ) is a homeomor-
phism and L ◦ Φ = Φ ◦ T .
(ii) Describe the measure µ := Φ# λ, where λ is the Lebesgue measure on [0, 1].
(iii) Prove that the logistic map preserves µ and it is mixing with respect to this
measure. t
u
Exercise 5.12. Consider the Haar functions Hn,k used in Example 5.30. We define
the Rademacher functions,
X
Rn : [0, 1] → R, Rn = 2−n/2 Hn,k , n ≥ 0.
0≤k<2n
(ii) Prove that the functions (Rn )n≥0 , viewed as random variables defined on the
probability space ([0, 1], λ), are i.i.d. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 504
Exercise 5.14. Consider the baker’s transform B : [0, 1]2 → [0, 1]2 ,
(
q(2x), q(y/2) , x ≤ 1/2,
B(x, y) =
q(2x), q (y + 1)/2 , x > 1/2,
where q(t) denotes the fractional part of the real number t, q(t) = t − btc.
Prove that B is mixing with respect to the Lebesgue measure. Consider the map
Φ : {0, 1}Z → [0, 1]2 given by Φ x(u), y(u) ,
∞ ∞
X u(−n) X u(n)
x(u) = n+1
, y(u) = .
n=0
2 n=1
2n
Denote by π the uniform measure on {0, 1} and by π ∞ the induced product measure
on {0, 1}Z .
(i) Prove that Φ# π ∞ = λ, where λ is the Lebesgue measure on the square [0, 1]2 .
(ii) Show that B ◦ Φ = Φ ◦ Θ, where Θ is the shift defined in Exercise 5.13.
(iii) Prove that the baker’s transform is mixing with respect to the Lebesgue
measure. t
u
For k ∈ N we set Ik := (1/(k + 1), 1/k). Any x ∈ (0, 1] has a continuous fraction
decomposition
1
x = [0 : a1 : a2 : · · · ] := 0 + , an = an (x) ∈ N0 , ∀n ∈ N.
1
a1 +
1
a2 +
..
.
(The number x is rational if and only if an = 0 for all n sufficiently large.) Set
[0, 1]∗ := [0, 1] \ Q.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 505
Appendix A
Definition A.1 (Gamma and Beta functions). The Gamma function is the
function
Z ∞
Γ : (0, ∞) → R, Γ(x) = tx−1 e−t dt. (A.1.1)
0
The Beta function is the function of two positive variables
Γ(x)Γ(y)
B(x, y) := , x, y > 0. (A.1.2)
Γ(x + y)
t
u
We gather here a few basic facts about the Gamma and Beta functions used in
the text. For proofs we refer to [100, Chap. 1] or [158, Chap. 12].
(i) Γ(1) = 1.
(ii) Γ(x + 1) = xΓ(x), ∀x > 0.
(iii) For any n = 1, 2, . . . we have
Γ(n) = (n − 1)!. (A.1.3)
√
(iv) Γ(1/2) = π.
(v) For any x, y > 0 we have Euler’s formula
Z 1 ∞
ux−1
Z
B(x, y) = sx−1 (1 − s)y−1 ds = du. (A.1.4)
0 0 (1 + u)x+y
t
u
507
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 508
√s s2 √1 ds,
If we make the change in variables x = 2
so that x2 = 2 and dx = 2
then
we deduce
√
Z ∞
1 x2
π=√ e− 2 dx.
2 −∞
The function Γ(x) grows very fast as x → ∞. Its asymptotics is governed by the
Stirling’s formula
√ x x
xΓ(x) ∼ 2πx as x → ∞. (A.1.6)
e
Note that for n ∈ N the above estimate reads
√ n n
n! ∼ 2πn as n → ∞. (A.1.7)
e
There are very sharp estimates for the ratio
n!
qn = √ .
n n
2πn e
1 1
< log qn < . (A.1.8)
12n + 1 12n
We denote by ω n the volume of the n-dimensional Euclidean unit ball
q
B n := x ∈ Rn ; kxk ≤ 1 , kxk = x21 + · · · + x2n ,
S n−1 = x ∈ Rn ; kxk = 1 .
Then
2Γ(1/2)n 1 Γ(1/2)n
σ n−1 = , ω n = σ n−1 = . (A.1.9)
Γ(n/2) n Γ (n + 1)/2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 509
n k n−k
X ∼ Bin(n, p)⇐⇒P[X = k] = p q , k = 0, 1, . . . , n, q = 1 − p.
k
n − 1 k n−k
X ∼ NegBin(k, p)⇐⇒P[X = n] = p q , n = k, k + 1, . . .
k−1
w b
k n−k
X ∼ HGeom(w, b, n), P[X = k] = w+b
, k = 0, 1, . . . , w.
n
λn
X ∼ Poi(λ), λ > 0⇐⇒P[X = n] = e−λ , n = 0, 1, . . .
n!
1
X ∼ Unif(a, b)⇐⇒PX = I [a,b] dx.
b−a
1 (x−µ)2
X ∼ N (µ, σ 2 ), µ ∈ R, σ > 0⇐⇒PX = √ e− 2σ2 dx, x ∈ R.
σ 2π
λν ν−1 −λx
X ∼ Gamma(ν, λ)⇐⇒pX (x) = x e I [0,∞) dx
Γ(ν)
1
X ∼ Beta(a, b)⇐⇒pX = xa−1 (1 − x)b−1 I (0,1) dx.
B(a, b)
p+1
1 Γ( 2 ) 1
X ∼ Studp ⇐⇒pX = √ dx, x ∈ R.
pπ Γ( p2 ) 1 + x2 /p (p+1)/2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 510
A.3 A glimpse at R
find very useful is “The Book of R”, [38]. Often I ask GOOGLE how to do this or
that in R and I receive many satisfactory solutions.
x<-c(1,2,4.5)
x[k]
The command
x[j:k]
will generate all the entries of x from the j-the to the k-th. If you want to add an
entry to x, say you want to generate the longer vector (1, 2, 4, 5, 7), use the command
c(x,7)
For long vectors this approach can be time consuming. The process of describing
vectors can be accelerated if the entries of the vector x are subject to patterns. For
example, the vector of length 22 with all entries equal to the same number, say 1.5,
can be generated using the command
rep(1.5, 22)
To generate the vector listing in increasing order all the integers between −2
and 10 (included) use the command
(-2):10
sum(x)
To add all the natural numbers from 50 to 200 use the command
sum(50:200)
The result is 18, 875.
You can sort the entries of a vector, if they are numerical. For example
> z<-c(1,4,3)
> sort(z)
[1] 1 3 4
A very convenient feature of working with vectors in R is that the basic al-
gebraic operations involving numbers extend to vectors, component wise. For ex-
ample, if z is the above vector, and y = (1, 8, 9), then the command y/z returns
(1/1, 8/4, 9/3) = (1, 2, 3), while z^2 returns (1, 16, 9). t
u
sum(x<5)
Example A.5 (Functions in R). One can define and work with functions in R.
For example, to define the function
f (q) = 1 + 6q + 10q 2 (1 − q)4
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 513
function1<-function(x){x^2}
function2<-function(x){1-cos(x)}
curve(function1, col=1)
curve(function2, col=2, add=TRUE)
Above col stands for “color”. When this option is used different graphs are
depicted in different colors.
Here is how we define in R the indicator
( function of the unit disc in the plane
1, x2 + y 2 ≤ 1,
ID (x, y) =
0, x2 + y 2 > 1.
indicator<-function(x,y) as.integer(x^2+y^2<= 1)
Example A.6 (Samples with replacement). For example, to sample with re-
placement 7 balls from a bin containing balls labeled 1 through 23 use the R com-
mand
sample(1:23,7, replace=TRUE)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 514
sample(1:6,137, replace=TRUE) t
u
Example A.7 (Rolling a die). Let us show how to simulate rolling a die a num-
ber n of times and then count how many times we get 6. Suppose n = 20. We
indicate this using the command
n<-20
We now roll the die n times and store the results in a vector x
x<-sample(1:6, n, replace=TRUE)
Next we test which of the entries of x are equal to 6 and store the results of these
20 tests in a vector y
y<-x==6
The entries of y are T rue or F alse, depending on whether the corresponding entry
of x was equal to 6 or not. To find how many entries of y are T use the command
sum(y)
The result is equal to the number of 6s we got during the string of 20 rolls of a fair
die.
We can visualize data. Suppose we roll a die a large number N = 1200 of times.
For each 1 ≤ k ≤ N we denote by z(k) the fraction of the first k rolls when we
rolled a 6. For k → ∞ the Law of Large Numbers states that this frequency should
approach 16 . The vector z can be generated in R using the commands
N<-12000
x<-sample(1:6, N, replace=TRUE)
z<-cumsum(x==6)/(1:N)
Above, cumsum stands for “cumulative sum”. The input of this operator is a nu-
merical vector x = (x1 , . . . , xn ). The output is a numerical vector s of the same
dimension, with sk = s1 = · · ·+sk . We can visualize the fluctuations of z(k) around
the expected value 61 using the R code
(i) The geometric distribution in R is slightly different from the one described in
book. In R, the range of Geom(p) variable T is {0, 1, . . . } and its pmf is
this
P T = n = p(1 − p)n . In this book, a geometric random variable has range
{1, 2, . . . } and its pmf is P T = n = p(1 − p)n−1 ; see Example A.14.
(ii) In R the equality nbinom(k, p) = n represents the number of failures until we
register the k-th success; see Example A.15.
The above commands by themselves mean nothing if they are not accompanied
by one of the prefixes
You can learn more details using R’s help function. The examples below describe
some concrete situations.
Example A.13 (Binomial). For example, suppose that X ∼ Bin(10, 0.2), i.e., X
is the number of successes in a sequence of 10 independent Bernoulli trials with
success probability 0.2.
To find the probability P(X = 3) use the R command
dbinom(3,10,0.2)
If FX (x) = P(X ≤ x) is the cdf of X, then you can compute FX (4) using the R
command
pbinom(4,10,0.2)
To generate 253 random samples of X use the command
rbinom(253,10,0.2)
To find the 0.8-quantile of X use the R command
qbinom(0.8,10,0.2)
t
u
unif(min=a, max=b)
exp(rate=lambda)
norm(mean=mu, sd=sigma)
norm
t
u
Example A.18. Here are some concrete examples. To find the probability density
of exp3 at x = 1.7 use the command
dexp(1.7, 3)
Example A.19 (Gambler’s ruin). Consider two players the first with fortune
$a, and the second with fortune $b. Set N := a + b. They flip a fair coin. Heads,
player 1 gets a dollar from player 2, Tails, player 1 gives a dollar to player 2. The
game ends when one of them is ruined. One can simulate this in R using the code
r<-function(a,N){
t<-0
x<-a
v<-c(0,N)
while(all(v!=x)){
f<-sample(0:1,1, replace=TRUE)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 520
x<-x+(2*f-1)
t<-t+1
}
y<-c(x,t)
y
}
The output is a two-dimensional vector. Its first entry is the fortune of the first
player at the end of the game, while the second entry is duration of the game, i.e.,
the number of coin flips until one of them is ruined.
To compute the winning probability of the first player and the expected duration
of a game we can use the Law of Large Numbers and run a large number G of games
empiric_r<-function(G,a,N){
P<-c()
T<-c()
for(i in 1:G){
P<-c(P,r(a,N)[1])
T<-c(T,r(a,N)[2])
}
c(sum(P==N)/G,sum(T)/G)
}
For example if we want to run a number G = 1200 of games with the first
player’s initial fortune a = 8 and the combined fortune of the two players is N = 15
use the command
empiric_r(1200,8,15)
The output is a two-dimensional vector. Its first entry describes the fraction of
the G games won by the first player, and the second entry is the average duration
of these G games.
One can also visualize a game. The code below produces a vector whose entries
describe the evolution of the fortune of the first player.
rgr<-function(a,N){
x<-a
z<-c(a)
v<-c(0,N)
while(all(v!=x)){
f<-sample(0:1,1,replace=TRUE)
x<-x+(2*f-1)
z<-c(z,x)
}
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 521
z
}
For given values of N and a say, N = 25, a = 12, one can visualize the evolution
of the fortune of the first player using the code below. Its output is a graph similar
to the one in Figure A.2.
N<-25
a<-12
u<-rgr(a,N)
l<-length(u)-1
plot(0:l, u,type="l", xlab="# of flips",
ylab="the fortune of the first player",ylim=c(0,N))
abline(h=c(0,N),col=c("red","red") )
t
u
Example A.20 (Buffon’s needle problem). The R program below uses the
Buffon needle problem (see Exercise 1.22) to find an approximation of π.
for (i in 1:N){
y<-runif(1, min=-1/2,max=1/2) #this locates
# the center of the needle
t<-runif(1, min=-pi/2,max=pi/2)#this determines
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 522
t
u
Example A.21 (Monte Carlo). The R-command lines below implement the
Monte Carlo strategy for computing a double integral over the unit square
The next code describes a Monte-Carlo computation of the area of the unit
circle.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 523
Example A.22. Suppose that we have a probability distribution prob on the al-
phabet {1, 2, . . . , L}. One experiment consists of sampling the alphabet according to
the distribution prob until we first observe the given word (or pattern) patt. The
following R-routine performs m such experiments and returns an m-dimensional
vector f whose components are the cumulative means of the waiting times
k
1X
fk = Tj , k = 1, . . . , m,
k j=1
Tpatt_unif<-function(patt, m, L){
k<-length(patt)
T<-c()
for (i in 1:m){
x<-sample(1:L,k,replace=TRUE)
n<-k
while ( all(x[(n-k+1):n]==patt)==0 ){
x<-c(x, sample(1:L,1,replace=TRUE) )
n<-n+1
}
T<-c(T,n)
}
f<-cumsum(T)/(1:m)
f
}
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 524
In the uniform case, the expected waiting time to observe the pattern patt can
be determined using routine below that relies on the identity (3.1.11) in Exam-
ple 3.31.
tau<-function(patt,L){
n<-length(patt)
m<-n-1
t<-2^n
for (i in 1:m){
j<-n-i
k<-i+1
t<-t+ any(patt[1:j]==patt[k:n])*L^(n-i)
}
t
}
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 525
Bibliography
D. Aldous: Exchangeability and related topics, École d’été de Probabilités des Saint Fleur
XIII-1983, pp. 2–199, Lect. Notes in Math vol. 1117, Springer Verlag, 1985.
D. Aldous: Probability Theory, Course notes, Spring 2017.
https://www.stat.berkeley.edu/~aldous/205B/chewi_notes.pdf
V. I. Arnold, A. Avez: Ergodic Problems of Classical Mechanics, Addison Wesley, 1968.
R. B. Ash: Probability and Measure Theory, (with contributions from C. Doléans-Dade),
2nd Edition, Academic Press, 2000.
S. Asmunssen: Applied Probability and Queues, 2nd Edition, Stoch. Modelling and Appl.
Probab., vol. 51, Springer Verlag, 2003.
K. B. Athreya, P. E. Ney: Branching Processes, Springer Verlag, 1972.
S. Banach: Über die Bairésche Kategorie gewisser Functionenmengen, Studia Mathemat-
ica, 3(1931), 174–179.
P. Bamberg, S. Sternberg: A Course in Mathematics for Students of Physics, vol. 2,
Cambridge University Press, 1990.
R. N. Bhattacharya, E. C. Waymire: Stochastic Processes with Applications, SIAM, 2009.
R. N. Bhattacharya, E. C. Waymire: A Basic Course in Probability Theory, 2nd Edition,
Springer Verlag, 2016.
P. Billingsley: Ergodic Theory and Information, John Wiley & Sons, 1965.
P. Billinglsley: Convergence of Probability Measures, 2nd Edition, John Wiley & Sons,
1999.
N. H. Bingham: Fluctuation theory for the Ehrenfest Urn, Adv. Appl. Prob. 23(1991),
598–611.
D. Blackwell, D. Freedman: The tail σ-field of a Markov chain and a theorem of Orey,
Ann. Math. Statist., 35(1964), 1291–1295.
V. I. Bogachev: Measure Theory. Vol. 1, Springer Verlag, 2007.
R. Bott: On induced representations, in volume The Mathematical Heritage of Hermann
Weyl, Proc. Symp. Pure Math., vol. 48, Amer. Math. Soc., 1988
S. Boucheron, G. Lugosi, P. Massart: Concentration Inequalities. A Nonasymptotic Theory
of Independence, Oxford University Press, 2013.
N. Bourbaki: General Topology, Part 2, Hermann, 1966.
L. Breiman: Probability, SIAM, 1992.
P. Brémaud: Markov Chains, Gibbs Fields, Monte Carlo Simulations and Queues, Springer
Verlag, 1999.
525
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 526
Bibliography 527
Bibliography 529
Bibliography 531
H. Weyl: Über die Gleichverteilung von Zahlen mod Eins, Math. Ann., 77(1916), 313–352.
E. T. Whittaker, G. N. Watson: A Course in Modern Analysis, 4th Edition, Cambridge
University Press, 1950.
H. S. Wilf: Generatingfunctionology, Academic Press, 1994.
D. Williams: Probability with Martingales, Cambridge University Press, 1991.
January 19, 2018 9:17 ws-book961x669 Beyond the Triangle: Brownian Motion...Planck Equation-10734 HKU˙book page vi
Index
533
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 534
Index 535
effective function
conductance, 421 Beta, 507
resistance, 421 convex, 47
Ehrenfest urn, 375, 383, 394, 433, 445, elementary, 9
450, 473 Gamma, 507
electric network, 414 strictly convex, 47
cutting, 426
shorting, 426 Galton-Watson process, 263, 290, 300
empirical gambler’s ruin, 307, 375, 390, 519
distribution, 200 Gamma function, 66, 507
process, 204 Gaussian
empirical gap, 119 Hilbert space, 224
entropy, 48, 164, 252, 472 measure, 65, 218, 254
information, 164, 193 covariance form, 218
rate, 472 process, 220
relative, 252 centered, 220
Shannon, 164, 193 random function, 222
equidistribution, 492 random variables, 65
ergodic map, 480 regression, 256
ess sup, 38 vector, 218, 254, 255
Euler means, 406 white noise, 223, 364
event, 14 gaussian
almost sure, 14 measure
exchangeable, 326 centered, 219
improbable, 14 graph, 265
permutable, 326 locally finite, 265
exchangeable, 325 random walk on, 265
sequence, 325, 331, 477
expectation, 42 Haar
exponential martingale, 346 basis, 498
extinction functions, 498, 503
event, 291, 300 Hamming distance, 285, 445
probability, 291 harmonic function, 265, 408, 420
Hermite polynomials, 139, 251, 346
Fatou’s lemma, 35 hitting time, 269, 336, 385, 408, 420
filtration, 260 HMC, 369
complete, 335 Laplacian of, 408
right-continuous, 335 reversible, 393
usual, 335 time reversed, 393
flow hypothesis class, 212
Kirchhoff, 422 PAC learnable, 213
formula
Bayes’, 30 independence
Fourier inversion, 443 conditional, 110
Stirling, 390, 392 independency, 21
Stirling’s, 508 independent
Viète, 246 events, 21
Wald, 137, 305, 306 families, 21
formula Stirling, 244 random variables, 21
Fourier transform, 179, 443 indicator function, xi
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 537
Index 537
inequality law
Azuma, 209, 280, 282, 284 of rare events, 63
Bonferroni, 60 of total probability, 26
motivic, 60 law of total probability, 122
Cauchy-Schwartz, 90 lazy chain, 406, 460
Chebyshev, 49 Lebesgue
Doob’s Lp , 318, 348 measurable, 72
Doob’s maximal, 317, 347 measure, 72
Doob’s upcrossing, 287, 289, 349 Lebesgue integral, 32
Gibbs, 165, 193 Lebesgue measure, 19
Hölder, 38, 191 lemma
Hoeffding, 195, 208, 280, 285 ‘sooner-rather-than-later’, 276, 307
Jensen, 47, 164 Borel-Cantelli, 83–85, 157, 227, 290, 338
Kolmogorov’s maximal, 153 first, 84
Markov, 34 second, 84, 356, 358
McDiarmid, 285 Fatou, 36, 178, 179, 289, 296, 303, 323,
Mills ratio, 66, 140, 231 349, 411
Minkowski, 38 Fekete, 89, 283
infinitely divisible Hoeffding, 196, 281
distribution, 249 Kronecker, 358
random variable, 249 Kronecker’s, 158
integrable, 32 maximal, 488
invariance principle, 232 Sauer, 210
invariant Scheffé’s, 298
distribution, 392, 397 likelihood, 30
function, 478 likelihood ratio, 320
measure, 392 log-normal distribution, 142
set, 478 logistic map, 503
irreducible longest common subsequence problem, 88
HMC, 380 Lusin space, see space
set, 380 Lyapunov function, 407
coercive, 410
joint probability distribution, 76
map
kernel, 111 measurable, 6
disintegration, 120 Markob property
Markovian, 112 strong, 389
probability, 112 Markov
pullback, 112 chain, 368
push-forward by, 112 aperiodic, 383, 406
Kirchhoff current, 418 irreducible, 380
potential of, 418 null recurrent, 397
Kirchhoff’s laws, 416, 418 positively recurrent, 397, 400, 411
Kolmogorov sequence, 481 recurrent, 388, 410, 435
Koopman operator, 483 reversible, 393, 413, 441
Kullback-Leibler divergence, 192, 253 transient, 388, 409, 435
path space, 371
L-process, 333 Markov property, 368, 373, 385, 395
Lévy’s martingale, 409 strong, 385, 387–389, 400, 401
Laplacian, 408, 451 martingale, 260–264, 266, 268, 270, 272,
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 538
Index 539
Index 541
Optional Sampling, 272, 275, 278, 290, UI, 293, 294, 296, 298, 302–305, 307, 311,
302–305, 307, 310, 317 319
Optional Stopping, 270, 292, 302 uniform integrability, 293
Perron-Frobenius, 439 unimodality, 62
portmanteau, 172 union bound, 62
Radon–Nicodym, 98 upcrossing, 286
Radon–Nikodym, 37 number, 286
Raleigh, 426 usual conditions, 335, 338, 347
Riesz Representation, 40, 117
Slutsky, 175, 216, 252 vague convergence, 170
Strong Law of Large Numbers, 157, Vapnik-Chervonenkis, see VC
321, 330, 492 variance, 49
submartingale convergence, 289, 292, variation distance, 403
409, 411 VC
Tikhonov’s compactness, 131 dimension, 210
Wald’s formula, 305 family, 210
weak law of large numbers, 160
Weyl’s equidistribution, 494 waiting time, 80
theorem Kolmogorov continuity, 226 walk, 379
tight family, 248 weak convergence, 170
time weakly mixing, 500
hitting, 269 weight function, 15
optional, 334 Wiener
stopping, 269, 334 integral, 225, 257, 364
transience class, 388 measure, 232
transient state, 386 process, 225
transition matrix, 369 WLLN, 160
n-th step, 370
locally finite, 407 zero-one
tree, 436 algebra, 25, 480, 481, 493
radially symmetric, 436 event, 25, 480