100% found this document useful (2 votes)
168 views559 pages

LIVIU I NICOLAESCU - A Graduate Course in Probability-World Scientific Publishing (2023)

Uploaded by

killerengel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
168 views559 pages

LIVIU I NICOLAESCU - A Graduate Course in Probability-World Scientific Publishing (2023)

Uploaded by

killerengel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 559

A

Graduate Course
Probability
in

12800 9789811255083 tp.indd 1 12/7/22 9:17 AM


Other World Scientific Titles by the Author

Lectures on the Geometry of Manifolds


ISBN: 978-981-02-2836-1

Lectures on the Geometry of Manifolds


Second Edition
ISBN: 978-981-270-853-3
ISBN: 978-981-277-862-8 (pbk)

Introduction to Real Analysis


ISBN: 978-981-121-038-9
ISBN: 978-981-121-075-4 (pbk)

Lectures on the Geometry of Manifolds


Third Edition
ISBN: 978-981-121-481-3
ISBN: 978-981-121-595-7 (pbk)

A Graduate Course in Probability


ISBN: 978-981-125-508-3

RokTing - 12800 - A Graduate Course in Probability.indd 1 24/8/2022 12:09:10 pm


A
Graduate Course
in
Probability
LIVIU I NICOLAESCU
University of Notre Dame, USA

World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO

12800 9789811255083 tp.indd 2 12/7/22 9:17 AM


Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Control Number: 2022033524

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library.

A GRADUATE COURSE IN PROBABILITY

Copyright © 2023 by World Scientific Publishing Co. Pte. Ltd.


All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or
mechanical, including photocopying, recording or any information storage and retrieval system now known or to
be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center,
Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from
the publisher.

ISBN 978-981-125-508-3 (hardcover)


ISBN 978-981-125-509-0 (ebook for institutions)
ISBN 978-981-125-510-6 (ebook for individuals)

For any available supplementary material, please visit


https://www.worldscientific.com/worldscibooks/10.1142/12800#t=suppl

Printed in Singapore

RokTing - 12800 - A Graduate Course in Probability.indd 2 24/8/2022 12:09:10 pm


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page v

To my mom.

v
January 19, 2018 9:17 ws-book961x669 Beyond the Triangle: Brownian Motion...Planck Equation-10734 HKU˙book page vi

This page intentionally left blank


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page vii

Introduction

In no other branch of mathematics is it so easy for experts to


blunder as in probability theory.
Martin Gardner

I have to confess that my mathematical formation is not that of a probabilist.


I am a geometer/analyst by training. About fifteen years ago I stumbled on some
probabilistic geometry questions. The ad-hoc methods I used were producing en-
couraging but unsatisfactory answers. A chance encounter with a trained proba-
bilist led me to a pretty advanced monograph dealing with related problems from
a probabilistic view point. I spent a sabbatical year learning probability so I could
understand that book.
I eventually did understand that book, I was able to phrase the original questions
in a better language and I even offered answers to questions I could not conceive
before. A “side effect” of this effort was that I got a taste of probability.
To the geometer in me, the probabilistic thinking looked (and still looks) like
mathematics with a bit more, somewhat similar to classical mechanics, that is
mathematics with a sprinkle of physical intuition. I find this subject fresh, full of
interesting and enticing questions. This is how my probabilistic journey began and
I have been enjoying it since. In the meantime I matured a bit more by teaching
probability, both at undergraduate and graduate level. This book partially reflects
this personal journey.
Probability theory has grown out of many concrete examples and questions and
I firmly believe that probabilistic thinking can only be grasped through examples.
Compared to other mathematical areas I am familiar with, probability contains an
unusually large number of counterintuitive results. To me, these represent one of
the attractive features of the subject. So a substantial part of this book is devoted
to examples, some truly fundamental and quite a few more esoteric but which are
aesthetically very pleasing and pedagogically very revealing. Some of these examples
are recurring, appearing in many places in the text and, as we develop more and
more sophisticated technology, we dig a deeper and deeper into them.
While teaching probability I discovered that probabilistic simulations enhance

vii
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page viii

viii An Introduction to Probability

the understanding of probabilistic thinking. That is why I have included a brief


introduction to R and a few of the simple codes that allows one to do basic Monte-
Carlo simulations. I hope I can tempt the reader to try a few of these and be
amazed, like myself and my students, of the remarkable agreement between practice
and theory.
I have divided the book into five chapters. The first one concentrates on the
measure theoretic foundations of probability and its theoretical part it is essentially
the content of Kolmogorov’s foundational monograph. I assume that the reader is
familiar with the measure theory and integration. I survey this subject and I present
complete proofs only of results that have important probabilistic applications or
significance.
The first genuinely probabilistic concept is that of independence and I prove
early on Kolmogorov’s zero-one theorem. It is a striking all-or-nothing result and
its deeper implications are gradually revealed in the later parts of the book. The
ubiquitous concept of random variable and its numerical characteristics are dis-
cussed in detail. Along the way I discuss the various modes of convergence of
random variables. I made sure the reader has the opportunity to see these ideas
at work so I present many classical random variables and some of their probabilis-
tic occurrences. Among the classical problems/themes I discuss I should mention,
the inclusion-exclusion principle, sieves and Poissonization, Poisson processes, the
coupon collector problem, the longest common subsequence problem.
Section 4, one of the largest of this chapter, is devoted to the concept of condi-
tional expectation, a central probabilistic concept that takes some getting used to.
Analytically, the existence of conditional expectation is a simple consequence of the
Radon-Nicodym theorem. This however hides its probabilistic significance. I opted
for the more involved approach that reveals the meaning of this object as the best
predictor given certain information.
To get to the heart of the rather subtle concept of conditional expectation I tried
to present many examples, from simple computations to more sophisticated appli-
cations to stochastic optimization problems such as the classical secretary problem.
I spend considerable time on the concept of kernels a.k.a. random measures, reg-
ular conditional distributions and disintegration of measure describing the various
connections between them. I opted to only sketch the proof of the existence of
regular conditional distributions since I felt that the missing details add little to
the understanding of this important concept. Instead, I have included a large and
varied number of concrete examples to give the reader a better feel of this concept.
The last section of this chapter is an introduction to stochastic processes. The
central result of this section is Kolmogorov’s existence/consistency theorem that
guarantees that various objects discussed in the previous sections do indeed have
a mathematical existence. I decided to present a complete proof of this result so
the reader can see the source of this existence, namely Tikhonov’s compactness
theorem, a result that is deeply rooted in the foundations of mathematics.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page ix

Introduction ix

Chapter 2 is devoted to a major theme in probability, the law of large numbers


and its relatives. The first section is devoted to the Strong Law of Large Numbers.
I present Kolmogorov’s proof that reduces this result to the convergence of random
series with independent summands. I find the Law of Large Numbers philosophically
surprising since it extracts order out of chaos. The Monte Carlo method is one
convincing manifestation of the order-out-chaos phenomenon. I could not pass the
opportunity to introduce the concept of entropy and its application via the law of
large numbers to coding/compression of data. The second section is devoted to the
central limit theorem.
The third section is devoted to concentration inequalities. We describe the basics
of Chernoff’s estimates and produce a few fundamental concentration inequalities.
As an application we discuss Lindenstrauss-Johnson lemma stating that the geome-
try of a cloud of points in a high-dimensional vector space is, with high confidence,
little disturbed by a orthogonal projection onto a random subspace of much smaller
dimension.
Section 4 is devoted to more modern considerations, namely uniform limits of
empirical processes. The Glivenko-Cantelli is the pioneering result in this direction.
I also discuss more recent results showing how this uniform convergence can be
obtained by combining the concentration results in the previous section and the
concept of VC-families/dimension. I briefly describe the significance of such results
to PAC-learning, a concept central in machine learning.
The last section of this chapter is a brief introduction to the theory of Brownian
motion. I used it as an opportunity to discuss more concepts and results involv-
ing stochastic processes such as Gaussian processes and Kolmogorov’s continuity
theorem.
Chapter 3 is devoted to the castle that J. L. Doob built, namely the theory of
(sub)martingales, discrete and continuous. I present in detail the theoretical pillars
of this edifice: stopping/sampling, asymptotic behavior, maximal inequalities and I
discuss a large and diverse collection of examples: occurrence of patterns, Galston-
Watson processes, optimal gambling strategies, Azuma and McDiarmid inequalities
and their application to combinatorial optimization problems, backwards martin-
gales, exchangeable sequences, de Finetti’s theorem, and asymptotics in Polya’s urn
problem, Brownian motion.
Chapter 4 is an introduction to Markov chains. This beautiful and rich subject
is still actual, growing, and has many applications and ramifications. The first three
sections are devoted to the “classical” part of this subject and culminates with the
law of large numbers for such stochastic processes. Section 4 is devoted to a more
recent (1950’s) point of view, namely the connection between reversible Markov
chains and electrical networks. I adopt a more geometric approach based on the old
observations of H. Weyl and R. Bott (see [16]) that Kirckhoff’s laws have a Hodge
theoretic description. The last section is devoted to finite Markov chains I describe
various ways of estimating the rate of convergence of irreducible recurrent Markov
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page x

x An Introduction to Probability

chains. The chapter ends with brief discussion of the Markov Chain Monte Carlo
methods.
The last chapter of the book is the shortest and is devoted to the classical ergodic
theorems. I have included it because I felt I owed it to the reader to highlight a
principle that unifies and clarifies the main limit theorems in Chapters 2 and 4.
As the title indicates, this book is meant as an introduction to the modern, i.e.,
post Kolmogorov’s axiomatization, theory of probability. The reader is assumed
to have some familiarity with measure theory and integration and be comfortable
with the basic objects and concepts of modern analysis: metric/topological spaces,
convergence, compactness. In a few places, familiarity with basic concepts of func-
tional analysis is assumed. It could serve as a textbook for a year-long basic graduate
course in probability. With this purpose in mind I have a included a relatively large
number of exercises, many of them nontrivial and highlighting aspects I did not
include in the main body of the text.
The book grew up from notes of a one-semester graduate course in probability
that I taught at the University of Notre Dame. That course covered Chapter 1, the
classical limit theorems (Secs. 2.1–2.3) and discrete time martingales (Secs. 3.1–
3.2). Some of the proofs appear in fine print as a suggestion to the potential
student/instructor that they can be skipped at a first encounter with this subject.
Work on this book has been my constant happy companion during these improb-
able times. I hope I was able to convey my curiosity, fascination and enthusiasm
about probability and convince some readers to dig deeper into this intellectually
rewarding subject.
I want to thank World Scientific for a most professional, helpful and pleasant
collaboration over the years.

Notre Dame, May 2022


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page xi

Notation and conventions

• We set N := Z>0 , N0 := Z≥0 .


• For n ∈ N we set In := {1, 2, . . . , n}.
• For n ∈ N we denote by Sn the group of permutations of In .
• We set R+ := [0, ∞).
• For x ∈ R we set bxc := max Z ∩ (−∞, x], dxe := min Z ∩ [x, ∞).
• x ∧ y := min(x, y), x ∨ y := max(x, y).

• i := −1.
• Given a subset A of a set X we denote by Ac its complement (in X).
• For any set X we denote by 2X the collection of all the subsets of X.
• For any set X we denote by 2X 0 the collection of all the finite subsets of X.
• We will denote by |S| or #S the cardinality of s set S.
• If T is a topological space, then we denote by BT the σ-algebra of Borel
subsets of T .
• We denote by λ the standard Lebesgue measure on R and by λn the stan-
dard Lebesgue measure on Rn .
• If (Ω, F) is a measurable space and (Ai )i∈I is a collection of subsets of
F, then σ(Ai , i ∈ I) is the smallest sub-σ-algebra of F containing all the
collections Ai .
• For a collection (Xi )i∈I of random variables defined on the same probabil-
ity space we denote by σ(Xi ; i ∈ I) the sub-σ-algebra generated by these
variables.
• Given an ambient set Ω and a subset A ⊂ Ω we denote by I A : Ω → {0, 1}
the indicator function of A,
(
1, ω ∈ A,
I A (ω) =
0, ω 6∈ A.
• We denote by ω n the volume of the unit ball in Rn and by σ n−1 the “area”
of the unit sphere in Rn .
1 2Γ(1/2)n
ωn = σ n−1 , σ n−1 = .
n Γ(n/2)

xi
January 19, 2018 9:17 ws-book961x669 Beyond the Triangle: Brownian Motion...Planck Equation-10734 HKU˙book page vi

This page intentionally left blank


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page xiii

Contents

Introduction vii

Notation and conventions xi

1. Foundations 1
1.1 Measurable spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Sigma-algebras . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Measurable maps . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Measures and integration . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Independence and conditional probability . . . . . . . . . . 21
1.2.3 Integration of measurable functions . . . . . . . . . . . . . 31
1.2.4 Lp spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.2.5 Measures on compact metric spaces . . . . . . . . . . . . . 39
1.3 Invariants of random variables . . . . . . . . . . . . . . . . . . . . . 41
1.3.1 The distribution and the expectation of a random variable 41
1.3.2 Higher order integral invariants of random variables . . . . 48
1.3.3 Classical examples of discrete random variables . . . . . . . 53
1.3.4 Classical examples of continuous probability distributions . 64
1.3.5 Product probability spaces and independence . . . . . . . . 69
1.3.6 Convolutions of Borel measures on the real axis . . . . . . 77
1.3.7 Modes of convergence of random variables . . . . . . . . . 83
1.4 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . 91
1.4.1 Conditioning on a sigma sub-algebra . . . . . . . . . . . . . 92
1.4.2 Some applications of conditioning . . . . . . . . . . . . . . 102
1.4.3 Conditional independence . . . . . . . . . . . . . . . . . . . 110
1.4.4 Kernels and regular conditional distributions . . . . . . . . 111
1.4.5 Disintegration of measures . . . . . . . . . . . . . . . . . . 120
1.5 What are stochastic processes? . . . . . . . . . . . . . . . . . . . . 123
1.5.1 Definition and examples . . . . . . . . . . . . . . . . . . . . 123
1.5.2 Kolmogorov’s existence theorem . . . . . . . . . . . . . . . 127

xiii
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page xiv

xiv An Introduction to Probability

1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

2. Limit theorems 151


2.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . 152
2.1.1 Random series . . . . . . . . . . . . . . . . . . . . . . . . . 152
2.1.2 The Law of Large Numbers . . . . . . . . . . . . . . . . . . 156
2.1.3 Entropy and compression . . . . . . . . . . . . . . . . . . . 163
2.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 169
2.2.1 Weak and vague convergence . . . . . . . . . . . . . . . . . 170
2.2.2 The characteristic function . . . . . . . . . . . . . . . . . . 179
2.2.3 The Central Limit Theorem . . . . . . . . . . . . . . . . . 186
2.3 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . 188
2.3.1 The Chernoff bound . . . . . . . . . . . . . . . . . . . . . . 189
2.3.2 Some applications . . . . . . . . . . . . . . . . . . . . . . . 194
2.4 Uniform laws of large numbers . . . . . . . . . . . . . . . . . . . . 200
2.4.1 The Glivenko-Cantelli theorem . . . . . . . . . . . . . . . . 201
2.4.2 VC-theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
2.4.3 PAC learning . . . . . . . . . . . . . . . . . . . . . . . . . . 212
2.5 The Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . 214
2.5.1 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
2.5.2 Gaussian measures and processes . . . . . . . . . . . . . . . 217
2.5.3 The Brownian motion . . . . . . . . . . . . . . . . . . . . . 225
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

3. Martingales 259
3.1 Basic facts about martingales . . . . . . . . . . . . . . . . . . . . . 260
3.1.1 Definition and examples . . . . . . . . . . . . . . . . . . . . 260
3.1.2 Discrete stochastic integrals . . . . . . . . . . . . . . . . . . 266
3.1.3 Stopping and sampling: discrete time . . . . . . . . . . . . 269
3.1.4 Applications of the optional sampling theorem . . . . . . . 274
3.1.5 Concentration inequalities: martingale techniques . . . . . 280
3.2 Limit theorems: discrete time . . . . . . . . . . . . . . . . . . . . . 286
3.2.1 Almost sure convergence . . . . . . . . . . . . . . . . . . . 286
3.2.2 Uniform integrability . . . . . . . . . . . . . . . . . . . . . 293
3.2.3 Uniformly integrable martingales . . . . . . . . . . . . . . . 298
3.2.4 Applications of the optional sampling theorem . . . . . . . 304
3.2.5 Uniformly integrable submartingales . . . . . . . . . . . . . 310
3.2.6 Maximal inequalities and Lp -convergence . . . . . . . . . . 317
3.2.7 Backwards martingales . . . . . . . . . . . . . . . . . . . . 322
3.2.8 Exchangeable sequences of random variables . . . . . . . . 325
3.3 Continuous time martingales . . . . . . . . . . . . . . . . . . . . . 333
3.3.1 Generalities about filtered processes . . . . . . . . . . . . . 333
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page xv

Contents xv

3.3.2 The Brownian motion as a filtered process . . . . . . . . . 337


3.3.3 Definition and examples of continuous time martingales . . 345
3.3.4 Limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . 347
3.3.5 Sampling and stopping . . . . . . . . . . . . . . . . . . . . 349
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

4. Markov chains 367


4.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
4.1.1 Definition and basic concepts . . . . . . . . . . . . . . . . . 368
4.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
4.2 The dynamics of homogeneous Markov chains . . . . . . . . . . . . 378
4.2.1 Classification of states . . . . . . . . . . . . . . . . . . . . . 378
4.2.2 The strong Markov property . . . . . . . . . . . . . . . . . 384
4.2.3 Transience and recurrence . . . . . . . . . . . . . . . . . . . 386
4.2.4 Invariant measures . . . . . . . . . . . . . . . . . . . . . . . 392
4.3 Asymptotic behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 400
4.3.1 The ergodic theorem . . . . . . . . . . . . . . . . . . . . . . 401
4.3.2 Aperiodic chains . . . . . . . . . . . . . . . . . . . . . . . . 403
4.3.3 Martingale techniques . . . . . . . . . . . . . . . . . . . . . 407
4.4 Electric networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
4.4.1 Reversible Markov chains as electric networks . . . . . . . 413
4.4.2 Sources, currents and chains . . . . . . . . . . . . . . . . . 414
4.4.3 Kirkhoff’s laws and Hodge theory . . . . . . . . . . . . . . 416
4.4.4 A probabilistic perspective on Kirchoff laws . . . . . . . . . 420
4.4.5 Degenerations . . . . . . . . . . . . . . . . . . . . . . . . . 424
4.4.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 432
4.5 Finite Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . 438
4.5.1 The Perron-Frobenius theory . . . . . . . . . . . . . . . . . 438
4.5.2 Variational methods . . . . . . . . . . . . . . . . . . . . . . 450
4.5.3 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . 458
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463

5. Elements of Ergodic Theory 475


5.1 The ergodic theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 475
5.1.1 Measure preserving maps and invariant sets . . . . . . . . . 475
5.1.2 Ergodic theorems . . . . . . . . . . . . . . . . . . . . . . . 482
5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
5.2.1 Limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . 492
5.2.2 Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page xvi

xvi An Introduction to Probability

Appendix A A few useful facts 507


A.1 The Gamma function . . . . . . . . . . . . . . . . . . . . . . . . . . 507
A.2 Basic invariants of frequently used probability distributions . . . . 509
A.3 A glimpse at R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510

Bibliography 525
Index 533
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 1

Chapter 1

Foundations

At the beginning of the twentieth century probability was in a fluid state. There
was no clear mathematical concept of probability, and ad-hoc methods were used to
rigorously formulate classical questions. Probability at that stage was a collection of
interesting problems in search of a coherent setup. According to Jean Ville, a PhD
student of M. Fréchet, in Paris probability was viewed among mathematicians as “an
honorable pastime for those who distinguished themselves in pure mathematics”.
The whole enterprise seemed to be concerned with concepts that lie outside
mathematics. Henri Poincaré himself wrote that “one can hardly give a satisfactory
definition of probability”. As Richard von Misses pointed out in 1928, the German
word for probability, “wahrscheinlich”, translates literally as “truth resembling”;
see [155]. Bertrand Russel was quoted as saying in 1929 that “Probability is the
most important concept in modern science, especially as nobody has the slightest
notion of what it means”. The philosophical underpinnings of this concept are
discussed even today. For more on this aspect we refer to the recent delightful
book [45].
In his influential 1900 International Congress address in Paris D. Hilbert recog-
nized this state of affairs and the importance of the subject. In the sixth problem of
his famous list of 23 he asks, among other things, for rigorous foundations of prob-
ability. These were laid by A. N. Kolmogorov in his famous 1933 monograph [94].
According to Kolmogorov himself, this was not a research work, but a work of syn-
thesis. A brilliant synthesis I might add. His point of view was universally adopted
and modern probability theory was born. The theory of probability can now be
informally divided into two eras: before and after Kolmogorov.
The present chapter is devoted to this foundational work of Kolmogorov. The
pillars of probability theory are the concept of probability or sample space, ran-
dom variables, independence, conditional expectations, and consistency, i.e., the
existence of random variables or processes with prescribed statistics.
So efficient is his axiomatization that to the untrained eye, probability, as envis-
aged by Kolmogorov, may seem like a slice of measure theory. In a 1963 interview
Kolmogorov complained that his axioms have been so successful on the theoretic
side that many mathematicians lost interest in the problems and applications that

1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 2

2 An Introduction to Probability

were and are the main engines of growth of this subject. I understand his criti-
cism since I too was one of those mathematicians that was not interested in these
applications. Now I know better.
In this chapter present these pillars of probability theory and prove their main
properties. I have included a large number of detailed examples meant to convey
the subtleties, depth, power and richness of these concepts. No abstract theorem
can capture this richness.
I want to close with a personal anecdote that I find revealing. A few years
ago, at a conference, I had a conversation with J. M. Bismut, a known probabilist
whose mathematical interests were becoming more and more geometric. He noticed
that I was in the middle of a mathematical transition in the opposite direction
and asked me what prompted it. I explained my motivation, how I discovered
that probability is not just a glorious part of measure theory and how much I
struggled to truly understand the concept of conditional expectation, a concept
eminently probabilistic. He smiled and said: “Probability theory is measure theory
plus conditional expectation”. I know it is an oversimplification, but it contains a
lot of truth.

1.1 Measurable spaces

1.1.1 Sigma-algebras
Fix a nonempty set Ω.
Definition 1.1. (a) A collection A of subsets of Ω is called an algebra of Ω if it
satisfies the following conditions

(i) ∅, Ω ∈ A.
(ii) ∀A, B ∈ A, A ∪ B ∈ A.
(iii) ∀A ∈ A, Ac ∈ A.

(b) A collection S of subsets of Ω is called a σ-algebra (or sigma-algebra) of Ω if it


[ subfamily of S is a set in S, i.e.,
is an algebra of Ω and the union of any countable
∀(An )n∈N ∈ S ,N
An ∈ S. (1.1.1)
n≥1
(c) A measurable space is a pair (Ω, S), where S is a sigma-algebra of subsets of Ω.
The subsets S ∈ S are called (S-)measurable. t
u

Remark 1.2. To prove that an algebra S is a σ-algebra is suffices to verify (1.1.1)


only for increasing sequence of subsets Bn ∈ S. Indeed, if (An )n∈N is an arbitrary
family in S the new family of sets in S
[n
Bn = An , n ∈ N,
k=1
is increasing and its union coincides with the union of the family (An )n∈N . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 3

Foundations 3

Example 1.3. (a) The collection 2Ω of all subsets of Ω is obviously a σ-algebra.


(b) Suppose that S is a (σ-)algebra of a set Ω and F : Ω
b → Ω is a map. Then the
preimage
F −1 (S) = F −1 (S); S ∈ S


b The σ-algebra F −1 (S) is denoted by σ(F ) and it


is a (σ-)algebra of subsets of Ω.
is called the σ-algebra generated by F or the pullback of S via F . We will often use
the more suggestive notation
{F ∈ S} := F −1 (S) = ω̂ ∈ Ω;

b F (ω̂) ∈ S .

(c) Given A ∈ Ω we denote by SA the σ-algebra generated by A, i.e.,


SA = ∅, A, Ac , Ω .


We will refer to it as the Bernoulli algebra with success A. Note that SA is the
pullback of 2{0,1} via the indicator function I A : Ω → {0, 1}.
(d) If C ⊂ 2Ω is a family of subsets of Ω, then we denote by σ(C) the σ-algebra
generated by C, i.e., the intersection of all σ-algebras that contain C. In particular,
if S1 , S2 are σ-algebras of Ω, then we set
S1 ∨ S2 := σ(S1 ∪ S2 ).
More generally, for any family (Si )i∈I of σ-algebras we set
!
_ [
Si := σ Si .
i∈I i∈I

(e) Suppose that we are given a countable partition {An }n∈N of Ω


G
Ω= An .
n∈N

The sets An are called the chambers of the partition. Then the σ-algebra generated
by this partition is the σ-algebra consisting of all the subsets of Ω who are unions
of chambers. This σ-algebra can be viewed as the σ-algebra generated by the map
X
X : Ω → N, X = nI An ,
n∈N
−1

so that An = X {n} .
(f) If (Si )i∈I is a family of (σ-)algebras of Ω, then their intersection
Si ⊂ 2Ω
\

i∈I

is a (σ-)algebra of Ω.
(g) If (Ω1 , S1 ) and (Ω2 , S2 ) are two measurable spaces, then we denote by S1 ⊗ S2
the sigma algebra of Ω1 × Ω2 generated by the collection
S1 × S2 : S1 ∈ S1 , S2 ∈ S2 ⊂ 2Ω1 ×Ω2 .

July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 4

4 An Introduction to Probability

(h) If X is a topological space and T ⊂ 2X denotes the family of open subset, then
the Borel σ-algebra of X, denotes by BX , is the σ-algebra generated by TX . The
sets in BX are called the Borel subsets of X. Note that since any open set in Rn is
a countable union of open cubes we have
BRn = B⊗n
R . (1.1.2)
Any finite dimensional real vector space V can be equipped with a topology by
choosing a linear isomorphism L : V → Rdim V . This topology is independent of
the choice of the isomorphism L. It can be alternatively identified as the smallest
topology on V such that all the linear maps V → R are continuous. We denote by
BV the sigma-algebra of Borel subsets determined by this topology.
We set R̄ = [−∞, ∞]. As a topological space it is homeomorphic to [−1, 1]. For
simplicity we will refer to the Borel subsets of R̄ simply as Borel sets.
(i) If (Ω, S) is a measurable space and X ⊂ Ω, then the collection
S|X := S ∩ X : S ∈ S ⊂ 2X


is a σ-algebra of X called the trace of S on X. t


u

Remark 1.4. In measure theory and analysis, sigma-algebras lie in the background
and rarely come to the forefront. In probability they play a more important role
having to do with how they are perceived.
One should think of Ω as the collection of all the possible outcomes of a random
experiment. A σ-algebra of Ω can be viewed as the totality of information we can
collect using certain measurements about the outcomes ω ∈ Ω. Let us explain this
vague statement on a simple example.
Suppose we are given a function X : Ω → R and the only thing that we can
absolutely confirm about the outcome ω of an experiment is whether X(ω) ≤ x for
any given and x ∈ R. In other words, we can detect by measurements the collection
of sets
{X ≤ x} := X −1 (−∞, x] , x ∈ R.


In particular, we can detect whether X(ω) > x, i.e., we can detect the sets
{X > x} = {X ≤ x}c . More generally, we can determine the sets
{a < X ≤ b} = {X > a} ∩ {X ≤ b}.
Indeed, we can do this using two experiments: on experiment to decide if X ≤ a
and one to decide if X ≤ b.
We say that a set S is X-measurable if given ω ∈ Ω we can decide by doing
countably many measurements on X whether ω ∈ S. If S1 , . . . , Sn , . . . ⊂ Ω are
known to be X-measurable, then their union is X-measurable. Indeed,
[
ω∈ Sn ⇐⇒∃n ∈ N : ω ∈ Sn .
n∈N
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 5

Foundations 5

Let us observe that the set theoretic conditions imposed on a sigma-algebra have
logical/linguistic counterparts. Thus, the statement
\
ω∈ Si
i∈I

translates into the formula ∀i ∈ I, ω ∈ Si , while the statement


[
ω∈ Si
i∈I

translates into the formula ∃i ∈ I, ω ∈ Si .


Conversely, statements involving the quantifiers ∃, ∀ can be translated into set
theoretic statements.
The information we can collect by doing such measurements of the function
X is collected into the sigma-algebra σ(X) = X −1 (BR ) generated by the map
X : Ω → (R, BR ). t
u

Definition 1.5. Let C be a collection of subsets of a set Ω. We say that C is a


π-system if it is closed under finite intersections, i.e.,
∀A, B ∈ C : A ∩ B ∈ C.
The collection C is called a λ-system if it satisfies the following conditions.

(i) ∅, Ω ∈ C.
(ii) if A, B ∈ C and A ⊂ B, then B \ A ∈ C.
(iii) If A1 ⊂ A2 ⊂ · · · belong to C, then so does their union.

t
u

Note that a collection C is a σ-algebra if it is simultaneously a π and a λ-system.


Since the intersection of any family of λ-systems is a λ-system we deduce that for
any collection C ⊂ 2Ω there exists a smallest λ-system containing C. We denote
this system by Λ(C) and we will refer to it as the λ-system generated by C.

Example 1.6. Suppose that H is the collection of half-infinite intervals


(−∞, x], x ∈ R.
Then H is π-system of R. The λ-system generated by H contains all the open
intervals. Since any open subset of R is a countable union of open intervals we
deduce that Λ(P) coincides with the Borel σ-algebra BR . t
u

Theorem 1.7 (Dynkin’s π − λ theorem). Suppose that P is a π-system. Then


Λ(P) = σ(P).
In other words, any λ-system that contains P, also contains the σ-algebra generated
by P.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 6

6 An Introduction to Probability

Proof. Since any σ-algebra is a λ-system we deduce Λ(P) ⊂ σ(P). Thus it suffices
to show that
σ(P) ⊂ Λ(P). (1.1.3)
Equivalently, it suffices to show that Λ(P) is a σ-algebra. This happens if and only
if the λ-system Λ(P) is also a π-system. Hence it suffices to show that Λ(P) is closed
under (finite) intersections.
Fix A ∈ Λ(P) and set
LA := B ∈ 2Ω : A ∩ B ∈ Λ(P) .


It suffices to show that


Λ(P) ⊂ LA , ∀A ∈ Λ(P). (1.1.4)
Observe that LA is a λ-system. Indeed, Ω ∈ LA since A ∈ Λ(P). The properties (ii)
and (iii) in the definition of a λ-system are clearly satisfied since Λ(P) is a λ-system.
Thus, to prove (1.1.4), it suffices to show that
P ⊂ LA , ∀A ∈ Λ(P). (1.1.5)
Note that since P is a π-system
P ⊂ LA , ∀A ∈ P.
In particular
Λ(P) ⊂ LA , ∀A ∈ P.
Thus, if A ∈ P and B ∈ Λ(P), then B ∈ LA , i.e., A ∩ B ∈ Λ(P). Hence
P ⊂ LB , ∀B ∈ Λ(P).
This proves (1.1.5) and completes the proof of the π − λ-theorem. t
u

1.1.2 Measurable maps

Definition 1.8. A map F : Ω1 → Ω2 called measurable with respect to the σ-


algebras Si on Ωi , i = 1, 2 or (S1 , S2 )-measurable if F −1 (S2 ) ⊂ S1 , i.e.,
F −1 (S2 ) ∈ S1 , ∀S2 ∈ S2 .
Two measurable spaces (Ωi , Si ), i = 1, 2, are called isomorphic if there exists a
bijection F : Ω1 −→Ω2 such that F −1 (S2 ) = S1 or, equivalently, both F and its
inverse F −1 are measurable. t
u

Definition 1.9. Suppose that (Ω, S) is a measurable space. A function


f : Ω → R̄ is called S-measurable if, for any Borel subset B ⊂ R̄ we have
f −1 (B) ∈ S. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 7

Foundations 7

Example 1.10. (a) The composition of two measurable maps is a measurable map.
(b) A subset S ⊂ Ω is S-measurable if and only if the indicator function I S is a
measurable function.
(c) If A is the σ-algebra generated by a finite or countable partition
G
Ω= Ai , I ⊂ N,
i∈I

then a function f : Ω → (R, BR ) is A-measurable if and only if it is constant in the


chambers Ai of this partition. t
u

Proposition 1.11. Consider a map F : (Ω1 , S1 ) → (Ω2 , S2 ) between two measurable


spaces. Suppose that C2 is a π-system of Ω2 such that σ(C2 ) = S2 . Then the
following statements are equivalent.

(i) The map F is measurable.


(ii) F −1 (C) ∈ S1 , ∀C ∈ C2 .

Proof. Clearly (i) ⇒ (ii). The opposite implication follows from the π − λ theorem
since the set
C ∈ S2 ; F −1 (C) ∈ S1


is a λ-system containing the π-system C2 that generates S2 . t


u

Corollary 1.12. If F : X → Y is a continuous map between topological spaces,


then it is (BX , BY ) measurable.

Proof. Denote by TY the collection of open subsets of Y . Then TY is a π-system


and, by definition, it generates BY . Since F is continuous, for any U ∈ TY the set
F −1 (U ) is open in X and thus belongs to BX . t
u

Corollary 1.13. Let (Ω, S) be a measurable space. A function X : Ω → R is


(S, BR )-measurable if and only if the sets X −1 ( (−∞, x] ) are S-measurable for any
x ∈ R.

Proof. It follows from the previous corollary by observing that the collection
(−∞, x]; x ∈ R ⊂ 2R


is a π-system and the σ-algebra it generates is BR . t


u

Corollary 1.14. Consider a pair of maps between measurable spaces


Fi : (Ω, S) → (Ωi , Si ), i = 1, 2.
Then the following statements are equivalent.

(i) The maps Fi are measurable.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 8

8 An Introduction to Probability

(ii) The map



F1 × F2 : Ω → Ω1 × Ω2 , ω 7→ F1 (ω), F2 (ω)
is (S, S1 ⊗ S2 )-measurable.

Proof. (i) ⇒ (ii) Observe that if the maps F1 , F2 are measurable then
F1−1 (S1 ), F2−1 (S2 ) ∈ S, ∀S1 ∈ S1 , S2 ∈ S2

⇒ (F1 × F2 )−1 (S1 × S2 ) = F1−1 (S1 ) ∩ F2−1 (S2 ) ∈ S, ∀S1 ∈ S1 , S2 ∈ S2 .


Since the collection S1 × S2 , Si ∈ Si , i = 1, 2, is a π-system that, by definition,
generates S1 ⊗ S2 we see that the last statement is equivalent with the measurability
of F1 × F2 .
(ii) ⇒ (i) For i = 1, 2 we denote by π the natural projection Ω1 × Ω2 → Ωi ,
(ω1 , ω2 ) 7→ ωi . The maps πi are (S1 ⊗ S2 , Si ) measurable and Fi = πi ◦ (F1 × F2 ). t
u

Definition 1.15. For any measurable space (Ω, S) we denote by L0 (S) = L0 (Ω, S)
the space of S-measurable random variables, i.e., (S, BR̄ )-measurable functions
Ω → R̄.
The subset of L0 (Ω, S) consisting of nonnegative functions is denoted by
L0+ (Ω, S), while the subspace of L0 (Ω, S) consisting of bounded measurable func-
tions is denoted L∞ (Ω, S). t
u

Remark 1.16. The algebraic operations on R admit (partial) extensions to R̄.


c + ±∞ = ±∞, ∞ + ∞ = ∞, c · ∞ = ∞, ∀c > 0.
As we know, there are a few “illegal” operations
0
∞ − ∞, 0 · ∞, etc. t
u
0
Proposition 1.17. Fix a measurable space (Ω, S). Then the following hold.

(i) For any X, Y ∈ L0 (Ω, S) and any c ∈ R we have


X + Y, XY, cX ∈ L0 (Ω, S),
whenever these functions are well defined.
(ii) If (Xn )n∈N is a sequence in L0 (Ω, S) such that, for any ω ∈ Ω the limit
X∞ (ω) = lim Xn (ω)
n→∞

exists and is finite. Then X∞ : Ω → R̄ is also S-measurable.


(iii) If (Xn )n∈N is a sequence in L0 (Ω, S) such that, for any ω ∈ Ω the quantities
Y∞ (ω) = inf Xn (ω), Zn (ω) = sup Xn (ω)
n∈N n∈N

are finite. Then Y∞ , Z∞ ∈ L (Ω, S).


0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 9

Foundations 9

Proof. (i) Denote by D the subset of R̄2 consisting of the pairs (x, y) for which
x + y is well defined. Observe that X + Y is the composition of two measurable
maps
Ω → D ⊂ R̄2 , ω 7→ X(ω), Y (ω) , D → R̄, (x, y) 7→ x + y.


Above, the first map is measurable according to Corollary 1.14 and the second map
is Borel measurable since it is continuous. The measurability of XY and cX is
established in a similar fashion.
(ii) We will show that for any x ∈ R the set X∞ (ω) > x is S-measurable. Note


that
X∞ (ω) > x ⇐⇒ ∀ν ∈ N, ∃N = N (ω) ∈ N : ∀n ≥ N : Xn (ω) > x + 1/ν.
Equivalently
n o \ [ \ n o
X∞ (ω) > x = Xn > x + 1/ν ∈ S.
ν∈N N ∈N n≥N

(iii) The proof is very similar to the proof of (ii) so we leave the details to the reader.
t
u

Corollary 1.18. For any function f ∈ L0+ (Ω, S), its positive and negative parts,
f + := max(f, 0), f − := max(−f, 0)
belong to L0+ (Ω, S) as well.

Proof. The function f + is the composition of the continuous function


x+ = max(x, 0) with f . t
u

Definition 1.19. A function f ∈ L0 (Ω, S) is called elementary or step function if


its range is a finite subset of R. We denote by Elem(Ω, S) the set of elementary
functions. t
u

More concretely, a function f : Ω → R is elementary if there exist finitely many


disjoint measurable sets A1 , . . . , AN ∈ S, and constants c1 , . . . , cN ∈ R such that
N
X
f (ω) = ck I Ak (ω), ∀ω ∈ Ω. (1.1.6)
k=1

The decomposition (1.1.6) of an elementary function f is not unique. Among the


various decompositions there is a canonical one
X
f= rI f −1 (r) .
r∈R
−1
The above sum is finite since f (r) is empty for all but finitely many r’s.
Let us also observe that Elem(Ω, S) is a vector space. Indeed if f0 , f1 are elemen-
tary functions with ranges R0 and respectively R1 , then their sum is measurable
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 10

10 An Introduction to Probability

and its range is contained in R0 + R1 . This is a finite set since R0 , R1 are finite.
Clearly the multiplication of an elementary function by a scalar also produces an
elementary function.
Let us observe that any nonnegative measurable function is the limit of an
increasing sequence of elementary functions. For n ∈ N we define
n
n2
X k−1
Dn : [0, ∞) → [0, ∞), Dn (r) := I [(k−1)2−n ,k2−n ) (r).
2n
k=1

Let us observe that if r ∈ [0, n], then Dn (r) truncates the binary expansion of r
after n digits. E.g., if r ∈ [0, 1) and

X k
r = 0.1 2 . . . n . . . := , k ∈ {0, 1},
2k
k=1

then
Dn (r) = 0.1 . . . n .
This shows that (Dn )n∈N is a nondecreasing sequence of functions and
lim Dn (r) = r, ∀r ≥ 0.
n→∞

For f ∈ L0+ (Ω, S) and n ∈ N we define Dn [f ]: (Ω, S) → [0, ∞)


n
n2
 X k−1
Dn [f ](ω) := Dn f (ω) = I −n −n ( f (ω) ) + nI [n,∞) ( f (ω) ).
2n [(k−1)2 ,k2 )
k=1
(1.1.7)
We deduce that the sequence of nonnegative elementary functions Dn [f ] converges
increasingly to f .

Definition 1.20. Let (Ω, S) be a measurable. A collection M of S-measurable


random variables is called a monotone class of (Ω, S) if it satisfies the following
conditions.

(i) I Ω ∈ M.
(ii) If f, g ∈ M are bounded and a, b ∈ R, then af + bg ∈ M.
(iii) If (fn ) is an increasing sequence of nonnegative random variables in M with
finite limit f∞ , then f∞ ∈ M.

t
u

Theorem 1.21 (Monotone Class Theorem). Suppose that M is a mono-


tone class of the measurable space (Ω, S) and C is a π-system that generates
S and such that I C ∈ M, ∀C ∈ C. Then M contains L∞ (Ω, S) and all the
nonnegative S-measurable functions.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 11

Foundations 11

Proof. Observe that the collection


A := A ∈ S : I A ∈ M


is a λ-system so A = S, by the π − λ theorem. Thus M contains all the elementary


functions. Since any nonnegative measurable function is an increasing pointwise
limit of elementary functions we deduce that M contains all the nonnegative mea-
surable functions. Finally, if f is a bounded measurable function, then f + , f − are
nonnegative and bounded measurable functions so f + , f − ∈ M and thus
f = f + − f − ∈ M.
t
u

Definition 1.22. The σ-algebra generated by a collection (Xi )i∈I of real-valued


functions on a set Ω is
_
Xi−1 (BR ).

σ Xi , i ∈ I := t
u
i∈I

The next result provides an interpretation of the concept of measurability along


the lines of Remark 1.4.

Theorem 1.23 (Dynkin). Suppose that F : (Ω, S) → (Ω0 , S0 ) is a measurable


map. Let X : Ω → R be an S-measurable function. Then the following are equiva-
lent.

(i) The function X is σ(F ), BR -measurable.




(ii) There exists an (S0 , BR )-measurable function X 0 : Ω0 → R such that


X = X0 ◦ F .

Proof. Clearly, (ii) ⇒ (i). To prove that (i) ⇒ (ii) consider the family M of σ(F)-
measurable functions of the form X 0 ◦ F , X 0 ∈ L0 (Ω0 , S0 ). We will prove that
M = L0 Ω, σ(F) . We will achieve using the monotone class theorem.


Step 1. I Ω ∈ M.
Step 2. M is a vector space. Indeed if X, Y ∈ M and a, b ∈ R, then there exist
S0 -measurable functions X 0 , Y 0 such that
X = X 0 ◦ F, Y = Y 0 ◦ F, aX + bY = (aX 0 + bY 0 ) ◦ F.
Hence aX + bY ∈ M.
Step 3. I A ∈ M, ∀A ∈ σ(F ). Indeed, since A ∈ σ(F ) there exists A0 ∈ S0 such
that
A = F −1 (A0 )
so I A = I A0 ◦ F . Hence M contains all the σ(F )-measurable elementary functions.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 12

12 An Introduction to Probability

Step 4. Suppose now that X ∈ L0 Ω, σ(F ) is nonnegative. Then there exists an




increasing sequence (Xn )n∈N of σ(F )-measurable nonnegative elementary functions


that converges pointwise to X. For every n ∈ N there exists an S-measurable
elementary function Xn0 : Ω0 → R such that
Xn (ω) = Xn0 F (ω) , ∀ω ∈ Ω.


Define
Ω00 := ω 0 ∈ Ω0 ; the limit limn→∞ Xn0 (ω 0 ) exists and it is finite .


Let us observe that Ω00 is S0 -measurable because


ω 0 ∈ Ω00 ⇐⇒ ∀ν ≥ 1, ∃N ≥ 1, ∀m, n ≥ N : |Xn0 (ω 0 ) − Xm
0
(ω 0 )| < 1/ν,
i.e.,
\ [ \ n o
Ω00 = |Xn0 (ω 0 ) − Xm
0
(ω 0 )| < 1/ν .
ν∈N N ≥1 m,n>N
Clearly, F (Ω) ⊂ Ω00 . 0 0
For any ω ∈ Ω we set

0 0
limn→∞ Xn (ω ),

 ω 0 ∈ Ω00 ,
0
X∞ (ω 0 ) :=

ω 0 ∈ Ω0 \ Ω00 .

0,

Arguing as in the proof of Proposition 1.17(ii) 0


we deduce that X∞ is S0 -measurable.
For any ω ∈ Ω the sequence Xn0 F (ω) = Xn (ω) is increasing and the limit


lim Xn0 F (ω)



n→∞
exists and it is finite. Hence
0

X∞ F (ω) = X(ω), ∀ω ∈ Ω.
This proves that M is a monotone class in L0 Ω, σ(F ) that is also a vector space


so it coincides with L0 Ω, σ(F ) .



t
u

Corollary 1.24. Suppose that X1 , . . . , Xn : (Ω, S) → R are S-measurable random


variables. The the function X : Ω → R is σ(X1 , . . . , Xn )-measurable if and only if
there exists an BRn -measurable function u : Rn → R such that

X = u X1 , . . . , Xn .
Proof. Apply the above theorem with (Ω0 , S0 ) = (Rn , BRn ) and

F (ω) = (X1 (ω), . . . , Xn (ω) .
t
u

Remark 1.25. We see that, in its simplest form, Corollary 1.24 describes a mea-
sure theoretic form of functional dependence. Thus, if in a given experiment we
can measure the quantities X1 , . . . , Xn and we know that the information X ≤ c
can be decided only by measuring the quantities X1 , . . . , Xn , then X is in fact a
(measurable) function of X1 , . . . , Xn . In plain English this sounds tautological. In
particular, this justifies the choice of term “measurable”. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 13

Foundations 13

1.2 Measures and integration

1.2.1 Measures

Throughout this section (Ω, S) will denote a measurable space. Given a function
f : X → R we will use the notation {f ≤ c} to denote the subset f −1 (−∞, c] .


The sets {a ≤ f ≤ b} etc. are defined in a similar fashion.

Definition 1.26. A measure on (Ω, S) is a function µ : S → [0, ∞], S 7→ µ S such


 

that the following hold.


 
• µ ∅ = 0, and
• it is σ-additive, i.e., for any sequence of pairwise disjoint S-measurable sets
(An )n∈N we have
" #
[ X  
µ An = µ An . (1.2.1)
n∈N n≥1

The measure is called σ-finite if there exists an increasing sequence of S-measurable


sets

A1 ⊂ A2 ⊂ · · ·

such that
[  
An = Ω and µ An < ∞, ∀n ∈ N.
n∈N
 
The measure is called finite if µ Ω < ∞. A probability measure is a measure P
such that P Ω = 1. We will denote by Prob(Ω, S) the set of probability measures
on (Ω, S). t
u

Remark 1.27. The σ-additivity condition (1.2.1) is equivalent to a pair of condi-


tions that are more convenient to verify in concrete situations.

(i) µ is finitely additive, i.e., for any finite collection of S-measurable sets A1 , . . . , An
we have
" n # n
[ X  
µ Ak = µ Ak .
k=1 k=1

(ii) µ is increasingly continuous i.e., for any increasing sequence of S-measurable


sets A1 ⊂ A2 ⊂ · · ·
" #
[  
µ An = lim µ An . (1.2.2)
n→∞
n∈N
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 14

14 An Introduction to Probability

 
If µ Ω < ∞ and µ is finitely additive, then the increasing continuity condition
(ii) is equivalent with the decreasing continuity condition, i.e., for any decreasing
sequence of S-measurable sets B1 ⊃ B2 ⊃ · · ·
" #
\  
µ Bn = lim µ Bn . (1.2.3)
n→∞
n∈n
 c    
Indeed, the sequence Bnc = Ω \ Bn is increasing
  and µ B n = µ Ω − µ B n . This
last equality could be meaningless if µ Ω = ∞. t
u

Definition 1.28. (a) A measured space is a triplet (Ω, S, µ), where (Ω, S) is a
measurable space and µ : S → [0, ∞] is a measure. t
u

Our next result shows that a finite measure is uniquely determined by its re-
striction to an algebra generating the sigma-algebra where it is defined.

Proposition 1.29. Consider a measurable space (Ω, S) and two finite measures
µ1 , µ2 : S → [0, ∞] such that µ1 Ω = µ2 Ω < ∞, then the collection


E := S ∈ S; µ1 S = µ2 S
    
   
is a λ-system. In particular, if µ1 C = µ2 C for any set C that belongs to a
π-system C, then µ1 and µ2 coincide on the σ-algebra generated by C.

Proof. Clearly ∅, Ω ∈ E. If A, B ∈ E and A ⊂ B, then


       
µ1 A = µ2 A < ∞, µ1 B = µ2 B < ∞
so
           
µ1 B \ A = µ1 B − µ1 A = µ2 B − µ2 A = µ2 B \ A ,
so B \ A ∈ C. The condition (iii) in the Definition 1.5 of a λ-system follows from
the σ-additivity of the measures µ1 , µ2 . t
u

Definition 1.30. A probability space, or sample space, is a measured space (Ω, S, P),
where P is a probability measure. In this case we use the following terminology.

• The subsets S ∈ S care called the events of the sample


  space.
• An event S ∈ Sis called
 almost sure (or a.s.) if P S = 1. An event S is called
improbable if P S = 0.
• The measurable functions X : (Ω, S, P) → R are called random variables.
• A random variable X : (Ω, S, P) → R is called a.s. finite if
 
P |X| < ∞ = 1.
• A random variable on (Ω, S, P) is called deterministic if there exists c ∈ R such
that X = c a.s.

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 15

Foundations 15

- Traditionally the random variables have capitalized names X, Y, Z etc. to distin-


guish them from deterministic quantities that are indicated in small caps. We will
try to adhere to this convention throughout this book.

Example 1.31. (a) If (Ω, S) is a measurable space, then for any ω0 ∈ Ω, the Dirac
measure concentrated at ω0 is the probability measure
(
1, ω0 ∈ S,
δω0 : S → [0, ∞), δω0 S =
 
0, ω0 6∈ S.

(b) Suppose that S is a finite or countable set. A measure on (S, 2S ) is uniquely


determined by the function
 
w : S → [0, ∞], w(s) = µ {s} .
We say that µ[{s}] is the mass of s with respect to µ. The function w is referred to
as the weight function of the measure. Often, for simplicity we will write
   
µ s := µ {s} .
The associated measure µw is a probability measure if
X
w(s) = 1.
s∈S

When S is finite and


1
w(s) = , ∀s ∈ S,
|S|
then the associated probability measure µw is called the uniform probability measure
on the finite set S.
(c) Suppose that F : (Ω, S) → (Ω0 , S0 ) is a measurable map between measurable
spaces. Then any measure µ on Ω induces a measure F# µ on Ω0 according to the
rule
F# µ S 0 := µ F −1 (S 0 ) .
   

The measure F# µ is called the pushforward of µ via F .


(d) Fix a set T with two elements, T = {0, 1}. For any p ∈ (0, 1) the probability
measure βp : 2T → [0, ∞) defined by
   
βp 1 = p, βp 0 = q := 1 − p
is called the Bernoulli distribution with success probability p. We abbreviate it by
Ber(p).
(e) Given finite or countable sets Ω1 , . . . , Ωn , and probability measures
µi : 2Ωi → [0, 1], we obtain a probability measure
µ := µ1 ⊗ · · · ⊗ µn : 2Ω1 ×···×Ωn → [0, 1]
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 16

16 An Introduction to Probability

by setting
     
µ (ω1 , . . . , ωn ) = µ1 ω1 · · · µn ωn , ∀(ω1 , . . . , ωn ) ∈ Ω1 × · · · × Ωn .
In particular, there exists a probability measure βp⊗n on {0, 1}n .
Note that we have a random variable
N : {0, 1}n → N0 , N (1 , . . . , n ) = 1 + · · · + n , ∀1 , . . . , n ∈ {0, 1}.


The push-forward P = Pn,p := N# βp⊗n is a probability measure on {0, 1, . . . , n}


called the binomial distribution corresponding to n independent trials with success
probability p and failure probability q = 1 − p. It is abbreviated Bin(n, p). Note
that Bin(1, p) = Ber(p). For any k ∈ {0, 1, . . . , n} we have
Xn
 
P= P k δk ,
k=0
where
X
P k = βp⊗n [N = k] = βp⊗n (1 , . . . , n )
   

1 +···+n =k
 
X
k n−k n k n−k
= p q = p q .
k
1 +···+n =k

(f) The Lebesgue measure λ defines a measure on BR . For any compact interval
[a, b] the uniform probability measure on [a, b] is
1
I [a,b] λ. t
u
b−a
Definition 1.32. Let X be a topological space. As usual BX denotes the σ-algebra
of Borel subsets of X. A measure on X is called Borel if it is defined on BX . t
u

The Lebesgue measure on R is a Borel measure.

Definition 1.33. Suppose that X ∈ L0 (Ω, S, P). Its distribution is the Borel prob-
ability measure PX on R̄ defined by
PX B = P X ∈ B , ∀B ∈ BR̄ .
   

In other words, PX is the pushforward of P by X, PX = X# P. t


u

Definition 1.34. Suppose that µ is a measure on the measurable space (Ω, S).

(i) A set N ⊂ Ω is called µ-negligible if there exists a set S ∈ S such that


 
N ⊂ S and µ S = 0.
We denote by Nµ the collection of µ-negligible sets.
(ii) The σ-algebra S is said to be complete with respect to µ (or µ-complete) if it
contains all the µ-negligible subsets.
(iii) The µ-completion of S is the σ-algebra Sµ := σ(S, Nµ ). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 17

Foundations 17

Remark 1.35. (a) It may be helpful to think of a sample space (Ω, S, P) as the
collection of all possible outcomes ω of an experiment with unpredictable results.
The observer may not be able to distinguish through measurements all the possi-
ble outcomes, but she is able to distinguish some features or properties of various
outcomes. An event can be understood as the collection of the all outcomes having
an observable or measurable property. The probability P associates a likelihood of
a certain property to be observed at the end of such a random experiment.
Take for example the experiment of flipping n times a coin with  0/1 faces. One
n
natural sample space for this experiment is based on the set Ω = 0, 1 .
If we assume that the coin is fair, then it is natural to conclude that each outcome
ω ∈ Ω is equally likely. Suppose that we can distinguish all the outcomes. In this
case

S = 2Ω .

Since there are 2n outcomes that are equally likely to occur we obtain a probability
measure P given by

  |S|
P S = n , ∀S ∈ S.
2

A random variable on a sample space is a numerical attribute X that we can assign


to each outcome ω of a random experiment with the following feature: for any
c ∈ R the property X(ω) ≤ c is observable, i.e., the set X −1 (−∞, c] belongs to


the collection S of observable properties. For example, in the situation of n fair coin
tosses, the number N of 1’s observed at the end of n tosses is a random variable.
(b) Often one speaks of sampling a probability distribution on R. Modern computer
systems can sample many distributions. More concretely, we say that a probability
measure µ on (R, BR ) can be sampled by a computer system if that computer can
produce a random1 experiment whose outcome is a random number X so that, when
we run the experiment a large number of times n, it generates numbers x1 , . . . , xn
and,
 for any  c ∈ R, the fraction of these numbers that is ≤ c is very close to
µ (−∞, c] .
When we speak of sampling a random variable X, we really mean sampling its
probability distribution PX . t
u

Clearly Sµ is the smallest µ-complete σ-algebra containing S. The proof of the


following result can be safely left to the reader.

Proposition 1.36. Suppose that µ is a measure on the σ-algebra S ⊂ 2Ω .

1 The precise term is pseudo-random since one cannot really simulate randomness.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 18

18 An Introduction to Probability

(i) The completion Sµ has the alternate description

Sµ = S ∪ N ; S ∈ S, N ∈ Nµ ⊂ 2Ω .


(ii) The measure µ admits a unique extension to a probability measure


µ̄ : Sµ → [0, ∞). More precisely

∀S ∈ S, N ∈ Nµ µ̄ S ∪ N = µ S .
   

t
u

Definition 1.37. A set S ⊂ R is called Lebesgue measurable if it belongs to the


λ-completion of BR . t
u

The most versatile method of constructing measures is Carathéodory Extension


Theorem. We need to introduce the appropriate concepts.

Definition 1.38. Fix a set Ω and an algebra F ⊂ 2Ω .

(i) A function µ : F → [0, ∞] is called a premeasure if it satisfies the following


conditions.
 
(a) µ ∅ = 0
(b) µ is finitely additive, i.e., for any finite collection of disjoint sets
A1 , . . . , An ∈ F we have
" n # n
[ X  
µ Ak = µ Ak .
k=1 k=1

(c) µ is countably additive, i.e., for any sequence (An )n∈N of disjoint sets in F
whose union is a set A ∈ F we have
  X  
µ A = µ An .
n≥1

(ii) The premeasure µ is called σ-finite if there exists a sequence of sets (Ωn )n∈N
in F such that
[  
Ω= Ωn , µ Ωn < ∞, ∀n ∈ N.
n∈N

For a proof of the next central result we refer to [4, Sec. 1.3], [50, Chap. 3] or
[92, Thm. 1.53, 1.65].

Theorem 1.39 (Carathéodory Extension Theorem). Suppose that F is an


algebra of subsets of Ω and µ : F → [0, ∞] is a σ-finite premeasure on F. Then the
following hold.

e : σ(F) → [0, ∞].


(i) The premeasure µ admits a unique extension to a measure µ
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 19

Foundations 19

(ii) For any A ∈ σ(F) and any ε > 0 there exist mutually disjoint sets
A1 , . . . , Am ∈ F and B1 , . . . , Bn ∈ F such that
 
[m m
[
A⊂ Aj , µ
e Aj \ A  < ε,
j=1 j=1

and " #
n
[
µ
e A∆ Bk < ε.
k=1

t
u

Example 1.40. Let F denote the collection of subsets of R that are union of inter-
vals of the type (a, b], −∞ ≤ a < b < ∞. This is an algebra of sets. Any F can be
written in a (non)unique way as a union
n
[
F = (ai , bi ], ai < bi ≤ ai+1 < bi+1 , ∀i = 1, . . . , n − 1.
j=1
While this decomposition is not unique the sum
n
  X
λ F = (bi − ai )
i=1
depends only on F and not on the decomposition. It is not very hard to show that
the correspondence
F 3 F 7→ λ F ∈ [0, ∞]
 

is finitely additive. The fact that λ is a premeasure, i.e., it is (conditionally) sigma-


additive, is much more subtle, and it is ultimately rooted in the compactness of
the closed and bounded intervals of R. For details we refer to [4, Sec. 1.4] or [50,
Chap. 3]. The resulting measure on BR is called the Lebesgue measure on R and we
continue to denote it by λ. t
u

Definition 1.41. A set S ⊂ R is called Lebesgue measurable if it belongs to the


λ-completion of BR . t
u

Definition 1.42. A distribution function is a right-continuous nondecreasing func-


tion
F : R → [0, 1]
such that F (−∞) = 0 and F (∞) = 1. t
u

Example 1.43. Suppose that X is a random variable defined on the probability


space (Ω, S, P). The function
FX : R → [0, 1], FX (x) = P[X ≤ x]
is a distribution function called the cumulative distribution function or cdf of the
random variable X. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 20

20 An Introduction to Probability

Example 1.44 (Lebesgue-Stieltjes measures). Suppose that F : R → [0, 1] is a


distribution function. Then there exists a unique Borel probability measure µ = µF
on BR such that
 
µ (x, y] = F (y) − F (x), ∀x ≤ y ∈ R. (1.2.4)
To the uniqueness follows from the fact the collection of intervals (−∞, x] is a π
system generates the Borel algebra of R. The existence follows from Caratheodory’s
extension theorem; see [4, Sec. 1.4] or [50, Chap. 3]. Below we will describe another
existence proof that relies only the existence of the usual Lebesgue measure.
The above measure µF is called the Stieltjes probability measure associated to the
distribution function F . Its extension to the completion BµR is called the Lebesgue-
Stieltjes measure associated to the distribution function F .
Conversely, if µ is a Borel probability, measure on R then µ is the Stieltjes
measure associated
  to its cumulative distribution function (cdf) F : R → [0, 1],
F (x) = µ (−∞, x] . t
u

Example 1.45 (Quantiles). Here is an alternate description of this measure


based on a construction frequently used in statistics. Suppose that F : R → [0, 1] is
a distribution function. The quantile function of F is a generalized inverse of the
nondecreasing function F . Define

Q : [0, 1] → R, Q(p) := inf x : p ≤ F (x)
(1.2.5)
= inf F −1 [p, 1] .


Since F is right-continuous the above definition is equivalent to


F −1 [p, 1] = Q(p), ∞ .
  

Suppose that x0 is a point of discontinuity of F and we set


p−
0 := lim F (x) < F (x0 ) =: p0 .
x%x0

Note that Q(p0 ) = x0 and if p ∈ (p−


0 , p0 ], then Q(p) = x0 .
Note that for any x ∈ R we have
0 ≤ y ≤ F (x) ⇐⇒ Q(y) ≤ x, (1.2.6)

Q−1 [−∞, x] = [0, F (x)].



(1.2.7)
Indeed, ` ∈ Q−1 [−∞, x] if and only if Q(`) ≤ x, i.e., ` ≤ F (x). In particular,


Q−1 (x, y] = F (x), F (y) , ∀ − ∞ ≤ x ≤ y ≤ ∞.


 

The quantile is left continuous. Indeed, let pn % p0 . We will show that


lim Q(pn ) = Q(p0 ).
n
Note that limn Q(pn ) ≤ Q(p0 ) since Q is nondecreasing. To prove that we have
equality we argue by contradiction. Set xn := Q(pn ), x0 = Q(p0 ). Suppose

lim xn = x∞ < x0 = inf x; F (x) ≥ p0 .
n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 21

Foundations 21

From the definition of inf as the greatest lower bound we deduce that there exists
x∗ ∈ (x∞ , x0 ) such that F (x∗ ) < p0 . Thus F (xn ) ≤ F (x∗ ) Since pn % p0 we deduce
pn > F (x∗ ) for all n sufficiently large. This implies

x∗ 6∈ x; F (x) ≥ pn = [Q(pn ), ∞)


i.e., xn = Q(pn ) > x∗ , for all n sufficiently large. This contradicts the fact that
xn → x∞ < x∗ .
If λ[0,1] denotes the Lebesgue measure2 on [0, 1], then

Q# λ[0,1] (x, y] = λ Q−1 (x, y] = F (y) − F (x).


   

Hence the pushforward measure Q# λ[0,1] satisfies (1.2.4) since it coincides with µF
on the π-system consisting of the intervals of the form (a, b] it coincides with µF on
the sigma-algebra of Borel sets.
When F is the cumulative distribution function of a random variable, the asso-
ciated quantile function is called the quantile of the random variable X. t
u

1.2.2 Independence and conditional probability

The next concepts are purely probabilistic in nature. They have no natural coun-
terpart in the traditional measure theory.

Definition 1.46. (a) The events A1 , A2 , . . . , An of a sample space (Ω, S, P) are


called independent if, for any nonempty subset {i1 , . . . , ik } ⊂ {1, . . . , n}, we have
 
P Ai1 ∩ · · · ∩ Aik = P[Ai1 ] · · · P[Aik ].

(b) The families of events A1 , . . . , An ⊂ S are called independent if for any Ai ∈ Ai ,


i = 1, . . . , n, the events A1 , . . . , An are independent.
(c) The (possibly infinite) collection of families of events (Ai )i∈I is called indepen-
dent if for any i1 , . . . , in ∈ I the finite collection Ai1 , . . . , Ain is independent.
(d) An independency is an independent collection (Si )i∈I of sigma-subalgebras of S.

 Xi ∈ L (Ω, S), i ∈ I, is called independent


0
(d) The collection of random variables
if the collection of σ-algebras σ(Xi ) i∈I is independent. t
u

+ We will use the notation X ⊥


⊥ Y to indicate that the random variables X, Y are
independent.

Remark 1.47. (a) We want to emphasize that the independence condition is sen-
sitive to the choice of probability measure involved in this definition.
2 The proof of the existence of the Lebesgue measure is based on Caratheodory’s extension

theorem.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 22

22 An Introduction to Probability

(b) It is possible that n + 1 events be dependent although any n of them are


independent. Here is one such instance, [146, Ex. 3.5]. Suppose we flip a fair coin
n times. In this case a natural sample space is
Ω = 2In = {0, 1}n ,
with the uniform probability measure. (Above 1 = Heads.) For k = 1, . . . , n we
denote by k the event “Heads at the k-th flip”, i.e.,

Ek = ω = (ω1 , . . . , ωn ) ∈ Ω; ωk = 1 .
Denote by E0 the event “the number of heads in these n flips is even”, i.e.,

E0 = ω ∈ Ω; ω1 + · · · + ωn ∈ 2Z .
Clearly
  1
P Ek = , ∀k = 1, . . . , n.
2
Since the probability of flipping an even number of Heads is equal to the probability
of flipping an odd number of Heads, we deduce that
  1
P E0 = .
2
For any subset I ⊂ {0, 1, . . . , n} we set
\
EI := Ei .
i∈I
The events E1 , . . . , En are independent. Observe that for any subset I ⊂ In ,
|I| = k < n, we have
 
h i  X 
P E0 ∩ EI = P  ω ∈ Ω; ωi = 1 ∀i ∈ I, ωi ≡ |I| mod 2 
 
j6∈I
 
h i  X 
= P ω ∈ Ω; ωi = 1 ∀i ∈ I · P  ω ∈ Ω; ωj ≡ |I| mod 2 
 
| {z } j6∈I
1
2k
| {z }
1
2

1   Y  
= = P E0 · P Ei .
2k+1
i∈I
Thus, any n of the events E0 , E1 , . . . , En are independent. Finally, note that
n
(
Y   1   0, n odd,
P Ei = n+1 and P E0 ∩ E1 ∩ · · · ∩ En = 1
i=0
2 n, n even. 2
This shows the events E0 , E1 , . . . , En are dependent.
(c) If Ω is contained in each of the families of events A1 , . . . , An , then these families
are independent if and only if
P A1 ∩ · · · ∩ An = P A1 · · · P An , ∀Ak ∈ Ak , k = 1, . . . , n.
     
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 23

Foundations 23

Proposition 1.48. Let (Ω, S, P) be a sample space and that P1 , . . . , Pn ⊂ S are


π-systems each containing Ω. The following statements are equivalent

(i) The families P1 , . . . , Pn are independent.


(ii) The collection of σ-algebras σ(P1 ), . . . , σ(Pn ) is independent.

Proof. Clearly it suffices to prove only (i) ⇒ (ii). Fix Si ∈ Pi , i = 2, . . . , n. Let


I := S ∈ S : P S ∩ S2 ∩ · · · ∩ Sn = P S P S2 · · · P Sn .
        

Note that P1 ⊂ I. Next let us observe that I is a λ-system. Indeed if A, B ∈ I and


A ⊂ B then
   
P (B \ A) ∩ S2 ∩ · · · ∩ Sn = P (B ∩ S2 ∩ · · · ∩ Sn ) \ (A ∩ S2 ∩ · · · ∩ Sn )
                 
= P B P S2 · · · P Sn − P A P S2 · · · P Sn = P B \ A P S2 · · · P Sn .
If A1 ⊂ A2 ⊂ · · · ⊂ Aν ⊂ is an increasing sequence of events in I and
[
A = lim Aν = Aν
ν∞
ν≥1

then
   
P A ∩ S2 ∩ · · · ∩ Sn = lim P Aν ∩ S2 ∩ · · · ∩ Sn
ν→∞

           
= lim P Aν P S2 · · · P Sn = P A P S2 · · · P Sn .
ν→∞

The π − λ theorem implies that σ(P1 ) ⊂ I so that


       
P A1 ∩ S2 ∩ · · · ∩ Sn = P A1 P S2 · · · P Sn ,
for all A1 ∈ σ(P1 ), Si ∈ Pi , i = 2, . . . , n. Repeating the above argument we deduce
       
P A1 ∩ A2 ∩ · · · ∩ An = P A1 P A2 · · · P An , ∀Ak ∈ σ(Pk ), k = 1, . . . , n.
Remark 1.47 shows that the σ-algebras σ(P1 ), . . . , σ(Pn ) are independent. t
u

Corollary 1.49. Consider the random variables X1 , . . . , Xn : (Ω, S, P) → R. The


following statements are equivalent.

(i) The random variables X1 , . . . , Xn are independent.


(ii) For any x1 , . . . , xn ∈ R
 
P X1 ≤ x1 , . . . , Xn ≤ xn = P[X1 ≤ x1 ] · · · P[Xn ≤ xn ].

Proof. It follows from Proposition 1.48 applied to the π-systems


n o
Pk := {Xk ≤ xk } : xk ∈ R , k = 1, . . . , n.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 24

24 An Introduction to Probability

Corollary 1.50 (Partition of independencies). Suppose that (Si )i∈I is an in-


dependency of (Ω, S, P). For any partition (Iα )α∈A of I we set
_
Fα := Si , α ∈ A.
i∈Iα

Then the collection (Fα )α∈A is also an independency.

Proof. Denote by Cα the π-system obtained by taking intersections of finitely many


events from i∈Iα Si . Then
S

Fα = σ(Cα ), ∀α ∈ A
and the family (Cα )α∈A is independent. The conclusion now follows from Proposi-
tion 1.48. t
u

Corollary 1.51. Suppose that the random variables X1 , . . . , Xn ∈ L0 (Ω, S, P) are


independent. Then for any 1 < k < n and any Borel measurable functions
f : R̄k → R̄, g : R̄n−k → R̄ the random variables
f (X1 , . . . , Xk ), g(Xk+1 , . . . , Xn )
are independent. t
u

Definition 1.52 (Tail algebra). Consider a sequence (Sn )n∈N of sub-σ-algebras


of (Ω, S, P). The tail algebra of this sequence is σ-algebra
\ _
T = T(Sn ) := Tm , Tm := Sn . (1.2.8)
m∈N n>m

The events in T are called tail events. t


u

Remark 1.53. (a) An event S is a tail event of the sequence (Sn )n∈N if
_
∀m ∈ N ; S ∈ Sn .
n>m

The sequence of σ-algebras (Sn )n∈N can be viewed as an information stream. The
tail events are described by a stream of information and are characterized by the
fact that their occurrence is unaffected by information at finitely moments of time
in the stream.
(b) To a sequence of random variables Xn : (Ω, S, P) → R we associate the sequence
of σ-algebras Sn = σ(Xn ) and the event C=the sequence (Xn )n≥1 converges”. To
see that this is a tail event note that Tm = σ(Xm+1 , Xm+2 , . . . ) and
\
C= Cm ,
m∈N

where Cm is the event “the sequence (Xk )k≥m converges”. Clearly Cm ∈ Tm . t


u

Theorem 1.54(Kolmogorov’s
  0-1 law). If A is a tail event of the independency
(Sn )n∈N , then P A = 0 or P A = 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 25

Foundations 25

Proof. Let Tm as in (1.2.8). According to the principle of partition of indepen-


dencies the collection S1 , . . . , Sm , Tm is an independency and, since T ⊂ Tm , the
collection S1 , . . . , Sm , T is also an independency, ∀m ∈ N. We deduce that for any
m ∈ N the σ-algebras
m
_
Sk , T
k=1
are independent so {T0 , T} is an independency. Hence, for any A ∈ T, and any
B ∈ T0 , we have
     
P A∩B =P A P B .
If above we choose B = A ∈ T ⊂ T0 we deduce
 2
P A = P A , ∀A ∈ T ⇒ P A ∈ {0, 1}, ∀A ∈ T.
   

t
u

Definition 1.55. Let (Ω, S,P) be a probability space. A zero-one event is a an


event S ∈ S such that P S ∈ {0, 1}. A zero-one algebra is a sigma-subalgebra
F ⊂ S consisting of zero-one events. t
u

Corollary 1.56. Suppose that (Xn )n∈N is a sequence of independent random vari-
ables on the probability space (Ω, S, P). Then the series
X
Xn
n∈N
is either almost surely convergent, or almost surely divergent. In other words, the
almost sure convergence is a zero-one event. t
u

Definition 1.57. Suppose that A, B are events in the sample space (Ω, P, S) such
that P[B] 6= 0. The conditional probability of A given B is the number
 
  P A∩B
P A B :=   . t
u
P B
Note that we have the useful product formula
     
P A∩B =P A B P B . (1.2.9)
   
In particular, we deduce that A, B are independent if and only if P A = P A|B .
Note that the map
P − B : S → [0, 1], S 7→ P S B
   

is also a probability measure on S. We say that it is the probability measure obtained


by conditioning on B.
Remark 1.58. Observe that n events A1 , . . . , An , n ≥ 2, are independent if and
only if, for any nonempty subset I ⊂ {1, . . . n} of cardinality < n, and any j 6∈ I we
have
  \
P Aj AI = Aj , where AI := Ai . t
u
i∈I
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 26

26 An Introduction to Probability

Suppose we are given a finite or countable measurable partition of (Ω, S, P)


G  
Ω= Ai , I ⊂ N, P Ai 6= 0, ∀i.
i∈I

The law of total probability states that


  X 
P S Ai P Ai , ∀S ∈ S.
  
P S = (1.2.10)
i∈I

Indeed,
  X   (1.2.9) X    
P S = P S ∩ Ai = P S Ai P Ai .
i∈I i∈I

Example 1.59. Suppose that we have an urn containing b black balls and r red
balls. A ball is drawn from the urn and discarded. Without knowing its color, what
is the probability that a second ball drawn is black?
For k = 1, 2 denote by Bk the event “the k-th drawn ball is black ”. We are asked
to find P(B2 ). The first drawn ball is either black (B1 ) or not black (B1c ). From
the law of total probability we deduce
P B2 = P B2 |B1 P B1 + P B2 |B1c P B1c .
         

Observing that
b r
and P B1c =
   
P B1 = ,
b+r b+r
we conclude
  b−1 b b r b(b − 1) + br
P B2 = · + · =
b+r−1 b+r b+r−1 b+r (b + r)(b + r − 1)

b(b + r − 1) b  
= = = P B1 .
(b + r)(b + r − 1) b+r
Thus, the probability that the second extracted ball is black is equal to the proba-
bility that the first extracted ball is black. This seems to contradict our intuition
because when we extract the second ball the composition of available balls at that
time is different from the initial composition.
This is a special case of a more general result, due to S. Poisson, [31, Sec. 5.3].
Suppose in an urn containing b black and r red balls, n balls have been drawn first
and discarded without their colors being noted. If another ball is drawn next, the
probability that it is black is the same as if we had drawn this ball at the outset,
without having discarded the n balls previously drawn.

To quote John Maynard Keynes, [90, p. 394],


This is an exceedingly good example of the failure to perceive that a probability
cannot be influenced by the occurrence of a material event but only by such
knowledge as we may have, respecting the occurrence of the event. u
t
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 27

Foundations 27

Example 1.60 (The ballot problem). This is one of the oldest problems in
probability. A person starts at S0 ∈ Z and every second (or epoch) he flips a
fair coin: Heads, he moves ahead, Tails he takes one step back. We denote by Sn
its location after n coin flips. The sequence of random variables (Sn )n∈N is called
the standard (or unbiased) random walk on Z.
Formally we have a sequence of independent random variables (Xn )n∈N such
that
   1
P Xn = 1 = P Xn = −1 = , ∀n ∈ N.
2
The random variables with this distribution are called Rademacher random vari-
ables. Then
Sn = S0 + X1 + · · · + Xn .
S0 = 0, In := {1, . . . , n }
 
Hn := # k ∈ In ; Xk = 1 , Tn = k ∈ In ; Xk = −1 .
Thus Hn is the number of Heads during the first n coin flips, while Tn denotes the
number of Tails during the first n coin flips. Note that
n = Hn + Tn , Sn = S0 + Hn − Tn = S0 + 2Hn − n.
We deduce that
Sn = m ⇐⇒n + m − S0 = 2Hn .
In particular this shows that Sn ≡ n − S mod 2, ∀n ∈ N. Moreover,
n + m − S0
Sn = m ⇐⇒ Hn = ,
2
and we deduce.
(
(n−m−S0 )/2
2−n

  2 m ≡ n − S0 mod 2,
P Sn = m =
0, otherwise.
It is convenient to visualize the random walk as a zig-zag obtained by successively
connecting by a line segment the point (n − 1, Sn−1 ) to the point (n, Sn ), n ∈ N.
The connecting line segment has slope Xn ; see Figure 1.1.
Suppose that y ∈ N and S0 = 0. The ballot problem asks what is the probability
py that
Sk > 0, ∀k = 1, . . . , n − 1 given that Sn = y.
One can think of a zigzag as describing a succession of votes in favor of one of the
two candidates H or T . When the zigzag goes up, a vote for H is cast, and when
it goes down, a vote in favor of T is cast. We know that at the end of the election
H was declared winner with y votes over T . Thus py is the probability that H was
always ahead during the voting process.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 28

28 An Introduction to Probability

Fig. 1.1 A zig-zag describing a random walk started at S0 = 0.

We set Hn := a, Tn := b so n = a + b, y = a − b. The sample space in this


problems is the space Ωn,y of zigzags ω that start at the origin and end at (n, y).
There are
   
n a+b
|Ωn,y | = = ,
a b
equally likely such zigzags. We seek the probability of the event

E := ω ∈ Ωn,y ; ω touches the horizontal axis .
 
Then py = 1 − P E .  
We will compute P E by conditioning on S1 . There is a silent trap on our way.
Since the first
 vote is equally
 1 likely to have been H or T , one might be tempted to
think that P S1 = ±1 = 2 . This is however not the case since the zig-zags in Ωn,y
are subject to an extra condition, namely the location (n, y) of their endpoints. We
have
         
P E = P E S1 = −1 P S1 = −1 + P E S1 = 1 P S1 = 1
     
= P S1 = −1 + P E S1 = 1 P S1 = 1 .
Note that there are
   
n−1 a+b−1
=
a a
equally likely zigzags from (1, −1) to (n, a − b), so
n−1

  a b b
P S1 = −1 = n = = .
a
n a+b
There are
   
n−1 a+b−1
=
a−1 b
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 29

Foundations 29

equally likely zigzags form (1, 1) to (n, a − b) so


n−1

  b a
P S1 = 1 = n = .
a
a+b
To count the number of zigzags from (1, 1) to (n, a − b) that touch the horizontal
axis we rely on a clever and versatile trick called André’s reflection trick.
For each such zigzag Z denote by k(Z) the first moment it touches the horizontal
axis. Denote by Z r the zigzag obtained from Z by reflecting in the horizontal axis
the part of Z from k(Z) to n; see Figure 1.2.

Fig. 1.2 The zigzag Z r traces Z until Z hits the horizontal axis. At this moment the zigzag Z r
follows the opposite motion of Z (dashed line).

The end point of Z r is (n, −(a − b)). The transformation Z → Z r produces a


bijection between the zigzags with origin (1, 1) and endpoint (n, a − b) that touch
the horizontal axis and the zigzags with origin (1, 1) and endpoint (n, −(a − b)).
Indeed, any zigzag Z 0 : (1, 1) → (n, b − a) must cross the horizontal axis. After the
first touch we reflect it in this axis and obtain a zigzag Z : (1, 1) → (n, a − b) such
that Z r = Z 0 . Clearly Z touches the horizontal axis.
The number of zigzags (1, 1) → (n, b − a) is
   
n−1 a+b−1
= .
b−1 a
Hence
a+b−1

  a (a − 1)!b! b
P E k S1 = 1 = a+b−1 = = .
a−1
a!(b − 1)! a
We deduce
  b a b 2b
P E = · + =
a a+b a+b a+b
and
2b a−b y Sn
py = 1 − = = = . (1.2.11)
a+b a+b n n
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 30

30 An Introduction to Probability

Proposition 1.61 (Bayes’ formula). Suppose we are given a finite or countable


measurable partition of (Ω, S, P)
G  
Ω= Ai , I ⊂ N, P Ai 6= 0, ∀i.
i∈I

Then, for any S ∈ S such that P S 6= 0 and i0 ∈ I we have


 
   
  P S|Ai0 P Ai0
P Ai0 |S = P    . (1.2.12)
i∈I P S|Ai P Ai

Proof. According to the law of total probability, the denominator in the right-
hand-side of (1.2.12) equals P S . Thus, the equality (1.2.12) is equivalent to
       
P Ai0 |S P S = P S|Ai0 P Ai0 .
The
 product
 formula shows that both sides of the above equality are equal to
P Ai0 ∩ S . t
u

Remark 1.62. We should mention here a terminology favored by statisticians.

• The events Ak are called


 hypotheses.
• The probability P Ak is called prior (probability).
• The probability P Ak |S  is called posterior (probability).
• The probability P S|Ak is called likelihood.

Here is one frequent application of Bayes’ principle. Suppose that we observed


a random event S we know that it can be caused only by one of the random events
Ai . To decide which of the events
 Ai is
 more likely to have caused S we need to find
the largest of the posteriors P Ai|S . Bayes’
  formula shows that the most likely
cause maximizes the numerator P S|Ai P Ai . t
u

Example 1.63 (Biased coins). We say that a coin has bias θ ∈ (0, 1) if the
probability of showing Heads when flipped is θ. Suppose that we have an urn
containing c1 coins with bias θ1 and c2 coins with bias θ2 . Let n := c1 + c2 denote
the total number of coins and set pi := cni , i = 1, 2. We assume that
c1 < c2 and θ1 > θ2 , (1.2.13)
i.e., there are fewer coins with higher bias. We draw a coin at random we flip it
twice and we get Heads both times. What is the probability that the coin we have
drawn has higher bias.
If θ denotes the (unknown) bias of the coin drawn at random, then we can think
of θ as a random variable that takes two values θ1 , θ2 with probabilities
   
P θi := P θ = θi = pi , i = 1, 2.
Denote by E the event that two successive flips produce Heads. Then
P E θi := P E θ = θi = θi2 .
   
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 31

Foundations 31

Bayes’ formula shows that


   
  P E θ1 P θ1 p1 θ12 1
P θ1 E =    = = 2 .
p1 θ12 + p2 θ22
   
P E θ1 P θ1 + P E θ2 P θ2

p2 θ2
1+ p1 θ1

Our assumption (1.2.13) shows that


c2 p2 θ2
= >1> .
c1 p1 θ1
Observe that if c2 θ22 > c1 θ12 , then
  1
P θ1 E < .
2
Thus, in this case, if we observe two Heads, then the coin we randomly drew from
the urn is less likely to be the one with bigger bias. For example if θ1 = 23 and
θ2 = 13 and c2 > 8c1 , then
  1
P θ1 E < ,
3
so the randomly drawn coin is less likely to be the one heavily biased towards Heads.
t
u

1.2.3 Integration of measurable functions


We outline below, mostly without proofs, the construction and the basic facts about
integration of measurable functions. For details we refer to [50; 102; 148].
Fix a measured space (Ω, S, µ). Recall that Elem(Ω, S) denotes the vector
space of elementary S-measurable functions (see Definition 1.19). We denote by
Elem+ (Ω, S) the convex cone of Elem(Ω, S) consisting of nonnegative elementary
functions. Define
Z
µ : Elem+ (Ω, S) → [0, ∞], f 7→ µ f =
   
f (ω)µ dω ,

as follows. If
M
X
f= ai I Ai , A1 , . . . , AM disjoint,
k=1

then
Z M
X
     
µ f = f (ω)µ dω := ai µ Ai .
Ω i=1

Note that if
N
X
f= bj I Bj , B1 , . . . , Bn disjoint,
j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 32

32 An Introduction to Probability

then ai = bj if Ai ∩ Bj 6= ∅. Hence
X   XX   XX   X  
ai µ Ai = ai µ Ai ∩ Bj = bj µ Ai ∩ Bj = bj µ B j .
i i j j i j
R
This shows that the value of Ω f (ω)µ(dω) is independent of the decomposition of
f as a linear combination of indicators of pairwise disjoint measurable sets.
The above integration map satisfies the following elementary properties.

∀f, g ∈ Elem+ (Ω, S) f ≤ g ⇒ µ f ≤ µ g .


   
(1.2.14a)

∀a, b ≥ 0, f, g ∈ Elem+ (Ω, S) : µ af + bg = aµ f + bµ g .


     
(1.2.14b)

For f ∈ L0+ (Ω, S) we set

Ef+ := g ∈ Elem+ (Ω, S); g ≤ f.




The set Ef+ is nonempty since 0 ∈ Ef+ . Define


Z Z Z
     
µ f = f dµ = f (ω)µ dω := sup g(ω)µ dω ∈ [0, ∞) . (1.2.15)
Ω Ω g∈Ef+ Ω

Definition 1.64. A measurable function f ∈ L0 (Ω, S) is called µ-integrable if

µ f + , µ f − < ∞.
   

In this case we define its Lebesgue integral to be


Z Z
       
f dµ = f (ω)µ dω = µ f := µ f+ − µ f− .
Ω Ω

We denote by L (Ω, S, µ) the set of µ-integrable functions and by L1+ (Ω, S, µ) the
1

set of µ-integrable nonnegative functions. t


u

Note that

∀f, g ∈ L0+ (Ω, S) f ≤ g ⇒ µ f ≤ µ g .


   
(1.2.16)

Moreover,
Z
∀f ∈ L0+ (Ω, S) : µ[f > 0] = 0 ⇐⇒ f dµ = 0. (1.2.17)

The integral L0+ 3 f 7→ µ f ∈ [0, ∞] enjoys the following key continuity


 

property which is the “workhorse” of the Lebesgue integration theory.

Theorem 1.65 (Monotone Convergence theorem). Suppose that (fn )n∈N is


a sequence in L0+ (Ω, S) that converges increasingly to f ∈ L0+ (Ω, S). Then
   
µ fn % µ f as n → ∞.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 33

Foundations 33

   
Proof. The sequence µ fn is nondecreasing and is bounded above by µ f . Hence it has a, possibly
   
infinite, limit and limn→∞ µ fn ≤ µ f . The proof of the opposite inequality
   
lim µ fn ≥ µ f
n→∞

relies on a clever a clever trick. Fix g ∈ Ef+ , c ∈ (0, 1), and set

Sn := ω ∈ Ω; fn (ω) ≥ cg(ω) .

Since f = lim fn and (fn ) is a nondecreasing sequence of functions we deduce that Sn is a nondecreasing
sequence of measurable sets whose union is Ω. For any elementary function h the product I Sn h is also
elementary. For any n ∈ N we have fn ≥ fn I Sn ≥ cgI Sn so that
     
µ fn ≥ µ I Sn fn ≥ cµ gI Sn .

If we write g as a finite linear combination


X
g= gj I Aj
j

with Aj pairwise disjoint, then we deduce


    X  
µ fn ≥ cµ gI Sn = c gj µ Aj ∩ Sn .
j

The sequence of sets (Aj ∩ Sn )n∈N is nondecreasing and its union is Aj so that
  X   X    
lim µ fn ≥ c gj lim µ Aj ∩ Sn = c gj µ Aj = cµ g .
n→∞ n→∞
j j

Hence
f
lim µ fn ≥ cµ g , ∀g ∈ E+ , ∀c ∈ (0, 1),
   
n→∞

so that
   
lim µ fn ≥ cµ f , ∀c ∈ (0, 1).
n→∞
   
Letting c % 1 we deduce limn→∞ µ fn ≥ µ f . t
u

Corollary 1.66. For any f ∈ L0+ (Ω, S) we have


   
µ f = lim µ Dn [f ] . t
u
n→∞

Corollary 1.67. For any f, g ∈ L1 (Ω, S, µ) and a, b ∈ R such that af + bg is well


defined we have af + bg ∈ L1 (Ω, S, µ) and
Z Z Z
(af + bg)dµ = a f dµ + b gdµ. (1.2.18)
Ω Ω Ω

Moreover, if f, g ∈ L (Ω, S, µ) and f (ω) ≤ g(ω), ∀ω ∈ Ω then


1
Z Z
f dµ ≤ gdµ.
Ω Ω
t
u

Since |f | = f + + f − we deduce the following result.

Corollary 1.68. Let f ∈ L0 (Ω, S). Then


f ∈ L1 (Ω, S, µ) ⇐⇒ |f | ∈ L1 (Ω, S, µ). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 34

34 An Introduction to Probability

Corollary 1.69 (Markov’s Inequality). Suppose that f ∈ L1+ (Ω, S, µ). Then,
for any C > 0, we have
Z
  1
µ {f ≥ C} ≤ f dµ. (1.2.19)
C Ω
In particular, f < ∞, µ-a.e.

Proof. Note that


Z Z
 
CI {f ≥C} ≤ f ⇒ Cµ {f ≥ C} = CI {f ≥C} ≤ f dµ.
Ω Ω
t
u

Corollary 1.70. If f ∈ L1 (Ω, S, µ), then µ {|f | = ∞} = 0.


 

Proof. Note that


  \  
µ {|f | = ∞} = µ {f > n} .
n∈N

On the other hand, Markov’s inequality implies


 
  µ |f |
µ {f > n} ≤ → 0.
n
t
u

Proposition 1.71. Suppose f, g ∈ L0 (Ω, S) and f = g, µ-a.e. Then

f ∈ L1 (Ω, S, µ) ⇐⇒ g ∈ L1 (Ω, S, µ).


   
Moreover, if one of the above equivalent conditions hold, then µ f = µ g . t
u

Remark 1.72. The presentation so far had to tread carefully around a nagging
problem: given f, g in L1 (Ω, S, µ), then f (ω) + g(ω) may not be well defined for
some ω. For example, it could happen that f (ω) = ∞, g(ω) = −∞. Fortunately,
Corollary 1.70 shows that the set of such ω’s is negligible. Moreover, if we redefine
f and g to be equal to zero on the set where they had infinite values, then their
integrals do not change. For this reason we alter the definition of L1 (Ω, S, µ) as
follows.
( Z )
L (Ω, S, µ) := f : (Ω, S) → R; f measurable
1
|f |dµ < ∞ .

Thus, in the sequel the integrable functions will be assumed to be everywhere finite.
With this convention, the space L1 (Ω, S, µ) is a vectors space and the Lebesgue
integral is a linear functional

µ : L1 (Ω, S, µ) → R, f 7→ µ f .
 
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 35

Foundations 35

Recall that for any sequence (xn )n∈N of real numbers we have
lim inf xn = lim x∗k := inf xn .
n→∞ k→∞ n≥k

The sequence (x∗k )


is nondecreasing. The Monotone Convergence Theorem has the
following useful immediate consequence.

Theorem 1.73 (Fatou’s Lemma). Suppose that (fn )n∈N is a sequence in


L0+ (Ω, S). Then
Z Z
 
lim inf fn (ω) µ dω ≤ lim inf fn dµ.
Ω n→∞ n→∞ Ω
t
u

Proof. Set

gk := inf fn .
n≥k

Proposition 1.17(iii) implies that gk ∈ L0+ (Ω, S). The sequence (gk ) is nondecreasing and

lim inf fn = lim gk .


n→∞ k→∞

The Monotone Convergence Theorem implies that


Z Z
 
lim inf fn (ω) µ dω = lim gk dµ.
Ω n→∞ k→∞ Ω

Note that gk ≤ fn , ∀n ≥ k, and thus


Z Z
gk dµ ≤ fn dµ, ∀n ≥ k,
Ω Ω

i.e.,
Z Z
gk dµ ≤ inf fn dµ.
Ω n≥k Ω

Letting k → ∞ we deduce
Z Z Z
lim gk dµ ≤ lim inf fn dµ = lim inf fn dµ.
k→∞ Ω k→∞ n≥k Ω n→∞ Ω

t
u

The next result illustrates one of the advantages of the Lebesgue integral over
the Riemann integral: one needs less restrictive conditions to pass to the limit under
the Lebesgue integral.

Theorem 1.74 (Dominated Convergence). Suppose (fn )n∈N is a sequence in


L1 (Ω, S, µ) satisfying the following properties

(i) There exists f ∈ L0 (Ω, S) such that


lim fn (ω) = f (ω), ∀ω ∈ Ω.
n→∞

(ii) There exists g ∈ L1 (Ω, S, µ) such that


|fn (ω)| ≤ g(ω), ∀ω ∈ Ω, n ∈ N.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 36

36 An Introduction to Probability

Then f ∈ L1 (Ω, S, µ) and


Z
lim fn dµ = f dµ, (1.2.20a)
n→∞ Ω
Z
lim |fn (ω) − f (ω)|dµ = 0. (1.2.20b)
n→∞ Ω

Proof. Set gn = |f | − fn . Then gn ≥ 0 and lim gn = |f | − f . Fatou’s Lemma implies


Z Z Z Z
(|f | − f )dµ ≤ lim inf (|f | − fn )dµ = |f |dµ − lim sup fn dµ.
Ω Ω Ω Ω
We deduce Z Z
lim sup fn dµ ≤ f dµ.
Ω Ω
Arguing in the same fashion using the sequence fn − |f | we deduce
Z Z
f dµ ≤ lim inf fn dµ.
Ω Ω
Hence Z Z Z Z
f dµ ≤ lim inf fn dµ ≤ lim sup fn dµ ≤ f dµ.
Ω Ω Ω Ω
This proves (1.2.20a). The equality (1.2.20b) follows by applying (1.2.20a) to the sequence gn = |fn −f |.
t
u

Theorem 1.75 (Change in variables). Suppose that (Ω0 , S0 ), (Ω1 , S1 ) are mea-
surable spaces and
Φ : (Ω0 , S0 ) → (Ω1 , S1 )
is a measurable map. Fix a measure µ0 : S0 → [0, ∞] and a measurable function
f ∈ L0 (Ω1 , S1 ). Then
f ∈ L1 (Ω1 , S1 , Φ# µ0 )⇐⇒f ◦ Φ ∈ L1 (Ω0 , S0 , µ0 )
and Z Z
f ◦ Φ dµ0 = f dΦ# µ0 . (1.2.21)
Ω0 Ω1

Proof. Note that it suffices to prove the theorem in the case f ≥ 0. The result
is obviously true if f ∈ Elem+ (Ω1 , S1 ). The general case follows from the Mono-
tone Convergence Theorem using the increasing approximation [f ]n % f of f by
elementary functions; see (1.1.7). t
u

Remark 1.76. Unlike the well known change-in-variables formula, the map T in
(1.2.21) need not be bijective, only measurable.
If T is bijective with measurable inverse, then for any measure µ1 on Ω1 , S1 )
then (1.2.21) applied to the map T −1 reads
Z Z
−1
    
f ω1 µ1 dω1 = f (T ω0 )T# µ1 dω0 , (1.2.22)
Ω1 Ω0
∀f ∈ L1 (Ω1 , S1 , µ1 ).
In particular, if Ωi are open subsets of Rn , T : Ω0 → Ω1 is a C 1 -diffeomorphism
onto, and µ1 is the Lebesgue measure on Ω1 , then (1.2.22) reads
Z Z
     
f (y)λ dy = f T x det JT (x) λ dx , (1.2.23)
Ω1 Ω0
where JT (x) is the Jacobian of the C 1 map x → T x. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 37

Foundations 37

Proposition 1.77. Let f ∈ L0+ (Ω, S). Suppose that µ : S → [0, ∞] is a sigma finite
measure. Define
Z Z
µf : S → [0, ∞], µf [S] = f dµ := I S f dµ.
S Ω

Then, µf is a measure. Moreover

µf0 = µf1 ⇐⇒ f0 = f1 , µ − almost everywhere. t


u

The above result has an important converse. To state it we need to introduce


the concept of absolute continuity.

Definition 1.78. Suppose that µ, ν are two measures on the measurable space. We
say that ν is absolutely continuous with respect to µ, and we write this ν  µ if

∀S ∈ S : µ S = 0 ⇒ ν S .
   
t
u

For a proof of the next result we refer to [15; 33; 148].

Theorem 1.79 (Radon–Nikodym). Suppose that µ, ν are two σ-finite measures


on the measurable space (Ω, S). The following statements are equivalent.

(i) ν  µ.
(ii) There exists ρ ∈ L0+ (Ω, S) such that ν = µρ , i.e.,
Z
ρµ dω , ∀S ∈ S.
 
ν[S] =
S

The function ρ is not unique, but it defines a unique element in L0+ (Ω, S, µ)

which we denote by dµ and we will refer to it as the density of ν relative to µ. t
u

1.2.4 Lp spaces
We recall here an important class of Banach spaces. For proofs and many more
 148 . We define an equivalence relation ∼µ on L (Ω, S)
0
details we refer to [50; 102; ]
by declaring f ∼µ g iff µ f 6= g = 0. Note that
Z Z
f ∈ L1 (Ω, S, µ) and g ∼µ f ⇒ g ∈ L1 (Ω, S, µ) and g dµ = f dµ.
Ω Ω

We set

L0 (Ω, S, µ) := L0 (Ω, S, µ)/ ∼µ , L1 (Ω, S, µ) := L1 (Ω, S, µ)/ ∼µ .

For p ∈ [1, ∞) we set


n o
Lp (Ω, S, µ) := f ∈ L0 (Ω, S, µ); |f |p ∈ L1 (Ω, S, µ) ,

Lp (Ω, S, µ) := Lp (Ω, S, µ)/ ∼µ .


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 38

38 An Introduction to Probability

We will refer to the functions in Lp (Ω, S, µ) as p-integrable functions. For p ∈ [1, ∞)


and f ∈ Lp (Ω, S, µ) we set
Z  p1
kf kp := |f |p dµ .

Define
L∞ (Ω, S, µ) := [f ] ∈ L0 (Ω, S, µ); ∃g ∈ L∞ (Ω, S), g ∼µ f .


For f ∈ L0 (Ω, S) we define



kf k∞ = ess sup |f | := inf a ≥ 0; µ[ |f | > a ] = 0 .
Note that this quantity only depends on the ∼µ -equivalence class of f and
L∞ (Ω, S, µ) = f ∈ L1 (Ω, S, µ); kf k∞ < ∞ .


In this fashion we obtain for every p ∈ [1, ∞] maps


k − kp : Lp (Ω, S, µ) → [0, ∞).

Theorem 1.80 (Hölder inequality). Let p, q ∈ [1, ∞] such that


1 1
+ = 1.
p q
Then for any f ∈ Lp (Ω, S, µ) and g ∈ Lq (Ω, S, µ) we have f g ∈ L1 (Ω, S, µ) and
Z
|f g|dµ ≤ kf kp · kgkq . (1.2.24)

t
u

Theorem 1.81 (Minkwoski’s inequality). Let p ∈ [1, ∞], then,


∀f, g ∈ Lp (Ω, S, µ) : kf + gkp ≤ kf kp + kgkp .
t
u

Theorem 1.82. Fix a sigma-finite measured space (Ω, S, µ).

(i) For any p ∈ [1, ∞], the pair Lp (Ω, S, µ), k − kp is a Banach space.


(ii) If p ∈ [1, ∞), the vector subspace of p-integrable elementary functions is dense
in Lp (Ω, S, P). In particular, if S is generated as a sigma-algebra by a countable
collection of sets, then Lp (Ω, S, µ) is separable. t
u

The above density result follows from a combined application of the Monotone
Class Theorem and the Monotone Convergence Theorem; see Exercise 1.4.
Suppose that (Ω, S, µ) is a measured space and p ∈ [1, ∞]. Denote by q the
exponent conjugate to p, i.e.,
1 1 p
+ = 1 ⇐⇒ q = .
p q p−1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 39

Foundations 39

If g ∈ Lq (Ω, S, µ), then Hölder’s inequality shows that f g ∈ L1 , ∀f ∈ Lp (Ω, S, µ)


and the resulting linear map Z
Lp (Ω, S, µ) 3 f 7→ ξg (f ) := gf dµ ∈ R

is continuous.
Theorem 1.83. Suppose that (Ω, S, µ) is a sigma-finite measured space and
p ∈ (1, ∞). Then the map
Lq (Ω, S, µ) 3 g 7→ ξg ∈ Lp (Ω, S, µ)∗ = the dual of the Banach space Lp (Ω, S, µ)
is a bijective isometry of Banach spaces. t
u

1.2.5 Measures on compact metric spaces


Up to this point we have indicated how one can use a measure to define an integral.
The integral is a linear functional on an appropriate space of measurable spaces.
On certain measurable spaces one can invert this process. Suppose that X is a
topological space and B = BX is the sigma algebra of Borel sets. We denote by
Cb (X) the vector space of bounded continuous functions on X. This is equipped
with the sup-norm
k f k∞ = sup | f (x) |.
x∈X
Any finite Borel measure µ on B defines via integration a continuous linear func-
tional Z
   
Iµ : Cb (X) → R, Iµ f = f (x)µ dx .
X
This linear functional satisfies the positivity condition
 
Iµ f ≥ 0, ∀f ∈ Cb (X), f ≥ 0. (Pos)
On metric spaces the measure µ is uniquely determined by the associated functional
Iµ . More precisely we have the following fact.
Proposition 1.84. If X is a metric space and µ, ν are two finite Borel measures
such that
   
Iµ f = Iµ f , ∀f ∈ Cb (X),
   
then µ B = ν B for any subset B ⊂ X.
Proof. Since the Borel sigma-algebra of X is generated by the π-system CX of closed subsets it suffices
to show that
µ C = ν C , ∀C ∈ CX .
   

To see that this indeed the case fix C ∈ CX and, for any n ∈ N denote by Dn the closed set

Dn := x ∈ X; dist(x, C) ≥ 1/n .
Define fn ∈ Cb (X)
dist(x, Dn )
fn (x) := .
dist(x, Dn ) + dist(x, C)
The function fn is identically 1 on C and identically 0 on Dn . Moreover
lim fn (x) = I C (x), ∀x ∈ X.
n→∞
Using the Dominated Convergence Theorem we deduce
       
µ C = lim Iµ fn = lim Iν fn = ν C .
n→∞ n→∞
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 40

40 An Introduction to Probability

We want to include a useful consequence of the above proof.

Corollary 1.85. Suppose that X is a metric space and µ is a finite Borel measure
on X. Then the space Cb (X) is dense in L1 (X, BX , µ). t
u

We have the following remarkable result.

Theorem 1.86 (Riesz Representation). Suppose that X is a compact metric


space and L is a continuous linear functional on C(X) satisfying the positivity
condition (Pos). Then there exists a unique finite Borel measure µ on X such that
   
L f = Iµ f , ∀f ∈ C(X).

t
u

For a proof we refer to [52, Sec. IV.6, Thm. 3] or [148, Thm. 13.5].

Example 1.87. We can use the above result to construct probability measures on
a smooth compact manifold M of dimension m. As shown in e.g. [122, Sec. 3.4.1]
a Riemann metric g on M defines a continuous linear functional
Z
C(M ) 3 f 7→ f dVg ∈ R,
M

usually referred to as the integral with respect to the volume element determined by
g. The Riesz Representation Theorem shows that this corresponds to the integral
with respect to a finite Borel measure Volg on M called the metric measure. The
metric volume of M is then
Z
 
Volg M = I M dVg .
M

We can associate to it the metric probability measure Pg


  1  
Pg B :=   Volg B ,
Volg M
for any Borel subset B ⊂ M .
In particular, if M is a compact submanifold of an Euclidean space RN , then it
comes equipped with an induced metric and as such, with a finite metric measure
µM and thus with a probability measure PM . We will refer to this probability
measure as the Euclidean probability measure.
Suppose for example that M = S m is the unit sphere in Rm

S m := (x0 , x1 . . . , xm ) ∈ Rm=1 ; x20 + · · · + xm = 1 .




The Euclidean volume of S m is (see e.g. [122, Eq. (9.1.10)])


2π (m+1)/2
σ m :=
Γ m+1

2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 41

Foundations 41

and the Euclidean probability measure is


1
PS m = µS m .
σm
For example, if m = 1, then µS 1 is expressed traditionally as dθ where θ is the
angular coordinate. Hence
  1
PS 1 dθ = dθ. (1.2.25)

If we use spherical coordinates (ϕ, θ) on S 2 , where ϕ denotes the Latitude and θ
the Longitude, then
  1
PS 2 dϕdθ = sin ϕdϕdθ. (1.2.26)

t
u

1.3 Invariants of random variables

We have defined the random variables as measurable functions on a probability


space. In concrete examples this probability is not specifically mentioned. In fact
there could be different looking random variables describing essentially the same
random quantity.
Consider for example the simplest example of rolling a fair die and observing
the number N that shows up. The possible values of N are {1, . . . , 6}. We equip I6
with the uniform probability measure and then we can view N as the map
N : I6 → R, N (k) = k, ∀k ∈ I6 .
Consider now a different experiment. Pick a point x uniformly random in (0, 1]. We
receive a reward R(x) = k ∈ I6 if d6xe = k. The functions N and R are obviously
different but the random quantities they described are very similar and they should
have many things in common.
This is analogous to the situation we encounter in geometry or physics when
the same physical or geometric object can be given different descriptions using
different coordinates. The laws of physics or geometry are however independent of
coordinates. Technically, this means they are described in terms of tensors.
In this section we explain a few basic techniques for describing the behavior of
random variables that capture the similarities we observe intuitively.

1.3.1 The distribution and the expectation of a random variable


Fix a probability space (Ω, S, P). For any random variable X ∈ L0 (Ω, S) the most
basic invariant is its probability distribution or the law of X, i.e., the pushforward
PX := X# P. (1.3.1)
Thus PX is a Borel probability measure on R̄ and, as such, it is uniquely deter-
mined by the cumulative distribution function (cdf)
F (x) = FX (x) := P[X ≤ x].
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 42

42 An Introduction to Probability

More precisely, PX can be identified with the associated Lebesgue-Stieltjes measure,


PX = dFX .
When the random variable X is discrete, i.e., the range of X is a finite or countable
discrete subset X ⊂ R, then PX is completely determined by the “mass” of each
x∈X,
   
PX {x} = P X = x .
For this reason in this case the probability distribution of X is often referred as the
probability mass function (or pmf) of X.
- Given a Borel probability measure µ on R̄, we will use the notation X ∼ µ to
indicate that the probability distribution of X is µ, i.e., PX = µ.
Any probability measure µ on (R̄, BR̄ ) tautologically defines a random variable
with probability distribution µ. If we denote by 1R̄ the identity map R̄ → R̄, then
the random variable
X = 1R̄ : (R̄, BR̄ , µ) → R̄
has probability distribution PX = µ. Because of this fact random variables are
often identified with their probability distributions. We will use the notations
d
X = Y or X ∼ Y
to indicate that X and Y have the same distribution.

Definition 1.88 (Expectation). The expectation or the mean of the integrable


random variable X ∈ L1 (Ω, S, P) is the quantity
Z
     
E X = EP X := X(ω)P dω . t
u

We deduce from the Change in Variables Theorem 1.75 that


Z Z Z
1R (x)X# P dx = 1R (X(ω))P dω = E X
       
xPX dx =
R R Ω
so that obtain the useful formula
Z
   
E X = xPX dx . (1.3.2)
R
 
If F (x) = FX (x) is the cdf of X, F (x) = P X ≤ x , then the distribution PX is
the Lebesgue-Stieltjes measure dF determined by F and (1.3.2) takes the classical
form
Z
 
E X = xdF (x). (1.3.3)
R
The above equality shows that
d    
X=Y ⇒E X =E Y .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 43

Foundations 43

More generally, for any Borel measurable function f : R → R such that f (X) is
integrable or nonnegative we have3
Z
   
E f (X) = f (x)PX dx . (1.3.4)
R
In other words, the expectation of a random variable is determined by its probability
distribution alone, and not on the precise nature of the sample space on which it is
defined.
For example, the random variables N and R described at the beginning of this
section have the same distribution and thus they have the same mean
    1 + ··· + 6 7
E N =E R = = .
6 2
Remark 1.89 (Bertrand’s paradox). More often than not, in concrete prob-
lems the sample space where a random variable is defined is not explicitly mentioned.
Sometimes this can create a problem. Consider the following classical example.
√ at random on a unit circle. What is the probability that its length
Pick a chord
is at most 3, the length of the edge of an equilateral triangle inscribed in that
unit circle?
The answer depends on the concept of “at random” we utilize.
For example, we can think that a chord is determined by two points θ1 , θ2 on the
circle or, equivalently,
√ by a pair of numbers in [0, 2π]. The corresponding chord has
length ≤ 3 if and only if |θ1 − θ2 | ≤ 2π 3 . The region in the square [0, 2π] occupied
by pairs (θ1 , θ2 ) consists of two isosceles right triangles with legs of size 2π
3 with
vertices (0, 2π) and (2π, 0). By gluing these triangles along their hypothenuses we
get a square one third the size of [0, 2π]. Assuming that the point (θ1 , θ2 ) is chosen
uniformly inside the√square [0, 2π] we deduce that the probability that the chord
has length at most 3 is 19 .
On the other hand, a chord is uniquely determined√by the location of its midpoint
inside the unit circle. The chord has length at most 3 if and only if the midpoint
is at distance at least 12 from the center. Assuming that the midpoint is chosen
uniformly
√ inside this circle, we deduce that the probability that the chord is at most
3
3 is 4 since the disk of radius 12 occupies 14 of the unit disk.
We can try to decide empirically which is correct answer, but any simula-
tion/experiment must adopt a certain model of randomness. Things are even more
complex. The set of chords has a natural symmetry given by the group of rotations
about the origin. Any “reasonable” model of randomness ought to be compati-
ble with this symmetry. In mathematical terms this means that the underlying
probability measure ought to be invariant with respect to this symmetry.
As a set, we can identify the set of chords with the unit disk: we can describe
a chord by indicating the location of its midpoint. The problem boils down to
choosing a rotation invariant Borel measure on the unit disk. The quotient of the
3 In undergraduate probability classes this formula is often referred as LOTUS: the Law Of The

Unconscious Statistician.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 44

44 An Introduction to Probability

disk with respect to the group of rotation is a segment. In particular, any probability
measure µ on the unit interval defines a rotation invariant probability measure Pµ
defined by on the unit disk, determined by the requirements
  θ1 − θ0  
Pµ 0 ≤ r ≤ r1 , θ0 ≤ θ ≤ θ1 = µ [0, r1 ] .

Hence, there are infinitely may geometric randomness models. In our first model of
randomness, the measure µ is the distribution of cos Θ1 +Θ 2
2
, where Θ1 , Θ2 are
independent uniformly distributed on [0, 2π]. In the second model of randomness
the measure µ is 2rdr. t
u

If X, Y ∈ L1 (Ω, S, P) and a, b ∈ R, then aX + bY ∈ L1 (Ω, S, P) and


     
E aX + bY = aE X + bE Y . (1.3.5)
The above linearity of the expectation is a very powerful tool. Here is a simple
illustration.

Example 1.90. Suppose that n ≥ 3 birds are arranged along a circle looking
towards the center. At a given moment each bird randomly and independently
turns his head to the left or to the right, with equal probabilities. After they turn
their heads, some birds will be visible by one of their neighbors, and some not.
Denote by Xn thenumber of birds that are invisible to their neighbors. We want
to compute E Xn , the expected number of invisible birds. We leave the reader to
convince herself/himself that Xn is indeed a well defined mathematical object.
For k = 1, . . . , n we denote by Bk the event that the k-th bird is invisible to its
neighbors. Then
n
X n n
  X   X    
Xn = I Bk and E Xn = E I Bk = P Bk = nP B1 .
k=1 k=1 k=1

The probability that the first bird is invisible to is neighbors is computed by observ-
ing that this happens iff its right neighbor turn his head right and its left neighbor
turn its head left. Since they do this independently with probabilities 21 we deduce
  1 1 1
P B1 = · = .
2 2 4
Hence
  n
E Xn = .
4
To appreciate how efficient this computation is we present an alternate method.
We will determine the expectation by determining the probability distribution of Xn , or equivalently
its probability generating function (pgf)
 X  X   k
GXn (t) = E t n = P Xn = k t .
k≥0

I learned the argument below from Luke Whitmer, a student in one of my undergraduate probability
courses.
Assume the birds sit on the edges of a convex n-gone Pn . Orienting an edge corresponds to describing
in which direction the corresponding bird is looking. We will refer to a choice of orientations of the
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 45

Foundations 45

edges of Pn as an orientation of Pn . We denote by Ωn the collection of orientations of Pn . Note that


|Ωn | = 2n .
Fix a cyclic clockwise labelling of the vertices of n-gon, v1 , v2 , . . . , vn and define vm for m ∈ N by
requiring vi = vj if i ≡ j mod n. The i-th bird sits on the edge Ei := [vi , vi+1 ]. The i-th bird, or
equivalently the edge Ei is invisible to its neighbors if Ei−1 is oriented from vi to vi−1 and Ei+1 is
oriented from vi+1 to vi+2 . Given an orientation ω of Pn we denote by xn (ω) the number of invisible
edges in this orientation. Thus

  # ω ∈ Ωn ; xn (ω) = j
P Xn = j = .
2n
We distinguish two cases.
1. n = 2k. Denote by P+ n the polygon obtained from Pn by collapsing the edges E1 , E3 , E5 , . . . . As
vertices of the new polygon we can take the collapsed edges. The edges of the new polygon are
+ + +
E1 = E2 , E2 = E4 , . . . , Ek = E2k .

Similarly, we denote by P−n the polygon obtained from Pn by collapsing the edges E2 , E4 , . . . . We can
take the collapsed edges as vertices of the new polygon. Its edges are
− − −
E1 = E1 , E2 = E3 , . . . , Ek = E2k−1 .

Note that an orientation of Pn induces orientations of P± ±


n and conversely, orientations Pn determine
an orientation of Pn . We denote by Ω±
n the set of orientations of P±
n . We thus have a bijection
+ −
Ωn 3 ω 7→ (ω+ , ω− ) ∈ Ωn × Ωn .
Suppose now that we have an oriented m-gon Qm . If q1 , . . . , qm are the vertices Qm we say that vi is
an out-vertex if both edges at vi are oriented away from vi and it is an in-vertex, if both edges at vi
are oriented towards vi . A neutral vertex is a vertex with an incoming edge and one outgoing edge. For
an orientation ω of Qm we denote by ym (ω) the number of out-vertices.
Fix an orientation on Pn . An edge Ei is an invisible in this orientation if and only if the corre-
i
sponding vertex in P(−1)
n is an out vertex. More explicitly, if i is even/odd, then the corresponding
vertex in P±
n is an out-vertex. Note that,

x2k (ω) = yk (ω+ ) + yk (ω− ). (1.3.6)


We denote by xn,j the number of oriented n-gons with j invisible edges and we set
X i
X x (ω)
Pn (t) = xn,j t = t n .
j≥0 ω∈Ωn

Note that
1
GXn (t) = Pn (t).
2n
We denote by ym,j the number of oriented m-gons with j out-vertices and we set
X j
X y (ω)
Qm (t) := ym,j t = t m .
j≥0 ω∈Ωm

From (1.3.6) we deduce


2
P2k (t) = Qk (t) . (1.3.7)

2. n = 2k + 1. Fix an orientation of Pn . Consider a new oriented n-gon Qn with edges, in clockwise


order
0 0 0
E1 , E 2 , . . . , E n ,
where Ei0 carries the orientation of the edge E(2i−1) mod n of Pn . Denote the vertices of Qn by
0
q1 , q2 , . . . , qn , so the two edges that meet at qi are Ei−1 and Ei0 .
Imagine stepping in a clockwise fashion on the edges of Pn and skipping every other edge and
labelling by Ei0 the i-th edge we stepped on. Observe that the edge E2i mod n of Pn is invisible iff
the vertex qi+1 (where Ei0 ↔ E2i−1 and Ei+1 0
↔ E2i+1 meet) is an out-vertex. Thus, the number of
invisible edges of Pn is equal to the number of out-vertices of Qn . Hence
P2k+1 (t) = Q2k+1 (t). (1.3.8)
To determine Qm (t) fix an orientation ω of an m-gon Qm . As we travel clockwise from one vertex to
the next, the out- and in-vertices alternate: once we leave an out-vertex, the first non-neutral vertex we
meet is an in-vertex and similarly once we leave an in-vertex the first non-neutral vertex we encounter is
an out-vertex. In particular this shows that there is an equal number of in and out-vertices. Fix a cyclic
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 46

46 An Introduction to Probability

labelling {1, 2, . . . , m} of the vertices of Qm . If ym (ω) = j then zm (ω) = j so the set S of locations of
in-/out-vertices has cardinality 2j,

S = 1 ≤ `1 < `2 < · · · < `2j ≤ m, .
The above discussion shows that if `1 is an out/in-vertex, then all vertices `3 , `5 , . . . are out/in-vertices
while the even vertices `2 , `4 , . . . are in/out-vertices. This shows that
n Xn j
ym,j = 2 , Qm (t) = t ,
2j j≥0
2j

2 m m
Qm (t ) = (1 + t) + (1 − t) .
Hence
2 k k 2 2k 2k 2 k
P2k (t ) = (1 + t) + (1 − t) = (1 + t) + (1 − t) + 2(1 − t ) ,

2 2k+1 2k+1
P2k+1 (t ) = (1 − t) + (1 + t) .
We conclude that
 √ 2k+1 √ 
 1− t
 + 1 + t 2k+1 , n = 2k + 1,
1
GXn (t) = n ×
2  1 + √t 2k + 1 − √t 2k + 2 1 − tk ,

n = 2k.
The mean of Xn is
  0
E Xn = GXn (1).

t
u

Theorem 1.91. Suppose that (Ω, S, P) and F, G ⊂ S are two independent sigma-
subalgebras. If X ∈ L1 (Ω, F, P), Y ∈ L1 (Ω, G, P), then XY ∈ L1 (Ω, S, P) and
     
E XY = E X E Y . (1.3.9)

Proof. Observe that the equality (1.3.9) is bilinear in X and Y . The equality holds
for X = I F , F ∈ F and Y = I G , G ∈ G and thus it holds for X ∈ Elem(Ω, F) and
Y ∈ Elem(Ω, G).
If X, Y are nonnegative, then Dn [X]Dn [Y ] % XY and the Monotone Con-
vergence Theorem shows that (1.3.9) holds for X, Y ≥ 0. The bilinearity of this
equality implies that it holds in the claimed generality. t
u

Corollary 1.92. Suppose that X, Y ∈ L1 (Ω, S, P) are independent random variables


such that XY ∈ L1 (Ω, S, P). Then
     
E XY = E X E Y . (1.3.10)

Proof. Use Theorem 1.91 with F = σ(X) and G = σ(Y ). t


u

Corollary 1.93. Suppose that the random variables X1 , . . . , Xn : (Ω, S, P) → R are


independent. Then, for any Borel measurable functions f1 , . . . , fn : R → R such
that
fi (Xi ) ∈ L1 (Ω, S, P)
we have f1 (X1 ) · · · fn (Xn ) ∈ L1 (Ω, S, P) and
     
E f1 (X1 ) · · · fn (Xn ) = E f1 (X1 ) · · · E fn (Xn ) .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 47

Foundations 47

Proof. Follows inductively from Corollary 1.92 by observing that for any
k = 2, . . . , n the random variables f1 (X1 ) · · · fk−1 (Xk−1 ) and fk (Xk ) are
independent. t
u

Corollary 1.94. Let X ∈ L1 (Ω, S, P) and suppose that F ⊂ S is sigma-subalgebra.


Then the following are equivalent.

(i) For any Borel measurable function f : R → R such that f (X) ∈ L1 and any
F ∈F
     
E f (X)I F = P F E f (X) .
(ii) The random variable X is independent of F.

Proof. The implication (i) ⇒ (ii) follows by using f = I (−∞,x] , x ∈ R. The


converse follows from Theorem 1.92. t
u

The following is not the usual definition of a convex function (see Exercise 1.23)
but it has the advantage that it is better suited for the applications we have in
mind.

Definition 1.95. Let I be an interval of the real axis. A continuous function


ϕ : I → R is called convex if for any x0 ∈ I there exists a linear function `(x) such
that4
`(x0 ) = ϕ(x0 ), `(x) ≤ ϕ(x), ∀x ∈ I.
The convex function is called strictly convex if for any x0 ∈ I there exists a linear
function `(x) such that
`(x0 ) = ϕ(x0 ), `(x) < ϕ(x), ∀x ∈ I \ {x0 }. t
u

For example, if ϕ : I → R is C 2 , then ϕ is convex (resp. strictly convex) if


ϕ00 (x) ≥ 0 (resp. ϕ0 (x) > 0), ∀x ∈ I.

Theorem 1.96 (Jensen’s Inequality). Suppose that (Ω, S, P) is a probability


space, X ∈ L1 (Ω, S, P), and ϕ : I → R is a convex
 function
 defined on an in-
terval I that contains the range of X. Then E ϕ(X) is well defined (possibly
infinite) and
   
ϕ E X ≤ E ϕ(X) . (1.3.11)
   
Moreover, if ϕ is strictly convex, then ϕ E X = E ϕ(X) iff X is a.s. constant.

Proof. Observe that the when ϕ is linear theorem is valid in the stronger form
ϕ E[X] = E ϕ(X) . We can find a linear function ` : R → R such that
ϕ(x) ≥ `(x), ∀x ∈ I and it is clear that if the theorem is valid for the nonnegative

convex function g := ϕ − `, then it is also valid for ϕ. Note that E g(X) ∈ [0, ∞]
4 The graph of such an ` is tangent to the graph of ϕ at x0 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 48

48 An Introduction to Probability

   
and
 thusthe addition E g(X) + ` E X is well defined and yields a well defined
E ϕ(X) , when ϕ(X) is integrable or nonnegative. Moreover ϕ(X) is integrable if
and only if g(X) is so. Because of this, we set
 
E ϕ(X) := ∞ if ϕ(X) is not integrable.
 
Set µ := E X and observe that µ ∈ I since X ∈ I a.s. Choose a linear function
` : R → R such that
`(x) ≤ ϕ(x), ∀x ∈ I and `(µ) = ϕ(µ).
Then
     
ϕ E X = ϕ(µ) = `(µ) = E `(X) ≤ E ϕ(X) .
If ϕ is strictly convex, then we can choose `(x) such that
`(x) < ϕ(x), ∀x ∈ I \ {µ} and `(µ) = ϕ(µ).
If X is not a.s. constant neither is the nonnegative random variable ϕ(X) − `(X)
so
   
E ϕ(X) − ϕ(µ) = E ϕ(X) − `(X) > 0.
t
u

For any convex function ϕ : R → R we define the ϕ-entropy of an integrable


random variable X to be the quantity
     
Hϕ X := E ϕ(X) − ϕ E X . (1.3.12)
 
Jensen’s inequality shows that Hϕ X ≥ 0.

1.3.2 Higher order integral invariants of random variables


On a probability space (Ω, S, P) we have the inclusions
Lp1 (Ω, S, P) ⊂ Lp0 (Ω, S, P), ∀1 ≤ p0 < p1 ≤ ∞.
Indeed, let X ∈ Lp1 (Ω, S, P). Set
p1
p := , ϕ(x) = xp , x ≥ 0, Y = ||X|p0 .
p0
Since p1 > p0 the function ϕ is convex and we have
p1 p    (1.3.11)  p1
= E |X|p0 = h E Y
 
kXkp0 ≤ E h(Y ) = kXkp1 .
In particular, if p0 = 1 ≤ p we deduce
p
E |X| ≤ E |X|p .
  
(1.3.13)
Given k ∈ N and X ∈ L (Ω, S, P) we define the k-th momentum of X to be the
k

quantity
µk X := E X k .
   
 
Note that µ1 X = E[X].
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 49

Foundations 49

Definition 1.97 (Variance). Let (Ω, S, P) be a probability space. Suppose that


X ∈ L2 (Ω, S, P) is a random variable with mean µ := E[X]. The variance of X is
the real number

Var X = E (X − µ)2 .
   

The standard deviation of X is the quantity


  q  
σ X := Var X . t
u

Note that
   
Var X = 0 ⇐⇒ X = E X a.s.

The variance can be given the alternate description


 2  2
Var X = E X 2 − E X = µ2 X − µ1 X .
     
(1.3.14)

Indeed, if we set µ := E[X], then


 2
Var X = E X 2 − 2µX + µ2 = E X 2 − 2µE[X] + µ2 = E X 2 − E X .
       

This shows that the variance is a special case of ϕ-entropy. More precisely,

Var X = Hϕ X , ϕ(x) = x2 .
   

Note that

Var aX + b = a2 Var X , ∀a, b ∈ R.


   
(1.3.15)

Indeed, set X̄ := X − µ and Z := aX + b. Then

Var X = E X̄ 2 , Z − E[Z] = a X − E[X] = aX̄,


    

Var Z = E a2 X̄ 2 = a2 Var X .
     

Theorem 1.98 (Chebyshev’s inequality). Let X ∈ L2 (Ω, ]S, P). Set µ := E[X]
and σ = σ[X]. Then
  1
P |X − µ| ≥ cσ ≤ 2 , ∀c > 0. (1.3.16)
c
Equivalently
  Var[X] σ2
P |X − µ| ≥ r ≤ = , ∀r > 0. (1.3.17)
r2 r2
Proof. Set Y := |X − µ|2 . Then
 (1.2.19) 1   Var[X]
P |X − µ| > r = P Y > r2
  
≤ E Y = .
r2 r2
Chebyshev’s inequality (1.3.16) now follows from (1.3.17) by setting r = cσ. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 50

50 An Introduction to Probability

Definition 1.99. Let (Ω, S, P) be a probability space and X, Y ∈ L2 (Ω, S, P). We


set
   
µX := E X , µY := E Y .

(i) The covariance of X, Y is the quantity


   
Cov X, Y := E (X − µX )(Y − µY ) .
(ii) If X, Y are not deterministic we define the correlation coefficient of X and Y
to be
 
  Cov X, Y
ρ X, Y :=     .
σ X σ Y
t
u

Proposition 1.100. Let X, Y ∈ L2 (Ω, S, P). Then the following hold.


       
(i) Cov X, Y = E XY − E X E Y . 
(ii) If X,
 Y are independent,
   then Cov
  X, Y = 0. 
(iii) Var X + Y = Var X + Var Y  + 2 Cov X, Y .   
(iv) If X, Y are independent, then Var X + Y = Var X + Var Y .

Proof. Set
   
µX := E X , X̄ = X − µX , µY = E Y , Ȳ = Y − µY .
(i) We have
         
Cov X, Y = E X̄ Ȳ = E XY − E µX Y − E µY X +µX µY
| {z } | {z }
µX µY µX µY

 
= E XY − µX µY .
 
(ii) Corollary
  1.92 shows that if X, Y are independent, then E XY = µX µY , i.e.,
Cov X, Y = 0.
(iii) Next
Var X + Y = E (X̄ + Ȳ )2 = E X̄ 2 + E Ȳ 2 + 2E X̄ Ȳ
         

     
= Var X + Var Y + 2 Cov X, Y .
(iv) This follows from (ii) and (iii). t
u

Corollary 1.101. If X1 , . . . , Xn ∈ L2 (Ω, S, P) are independent, then


     
Var X1 + · · · + Xn = Var X1 + · · · + Var Xn . (1.3.18)
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 51

Foundations 51

Example 1.102. Consider a probability space (Ω, S, P) and two events A, B ∈ S.


We have
 
Cov I A , I B = P[A ∩ B] − P[A]P[B].
 
Thus A, B are independent iff Cov I A , I B = 0. t
u

Definition 1.103 (Moment generating function). Let X be a random vari-


able defined on a probability space (Ω, S, P) such that etX ∈ L1 (Ω, S, P) for all t in
an open interval I containing 0. The moment generating function or mgf of X is
the function

MX : I → R, MX (t) = E etX .
 
t
u

The proof of following result is left to you as an exercise.

Proposition 1.104. Let X such that MX (t) is defined for all t ∈ (−t0 , t0 ).

(i) The moment generating function determines the momenta of X. More pre-
cisely, the function

(−t0 , t0 ) 3 t 7→ MX (t)

is smooth and
(k)  
MX (0) = µk X , ∀k = 1, 2, . . . . (1.3.19)

(ii) The power series



X   tn
µn X ,
n=0
n!

converges to MX (t), ∀t ∈ (−t0 , t0 ).

t
u

Corollary 1.105. Suppose that X1 , . . . , Xn ∈ L0 (Ω, S, P) are independent random


variables such that etXk ∈ L1 (Ω, S, P) for any k = 1, . . . , n and any t in an open
interval I ⊂ R that contains the origin. Then

MX1 +···+Xn (t) = MX1 (t) · · · MXn (t), ∀t ∈ I.

Proof. This is a special case of Corollary 1.93 corresponding to the choices

f1 (x) = · · · = fn (x) = etx , t ∈ I.

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 52

52 An Introduction to Probability

Remark 1.106 (The moment problem). Denote by Prob the set of Borel prob-
ability measures on the real axis and by Prob∞− the subset of Prob consisting of
probability measures p such that
Z
|x|k p[dx] < ∞, ∀k ∈ N.
R

For p ∈ Prob∞− and k ∈ N0 we set


Z
xk p dx .
   
µk p :=
R

N0
We denote by R the set of sequences of real numbers s = (sn )n≥0 . We have a
map

µ : Prob∞− → RN0 , µ[p] = µn p n≥0 .


 

The moment problem asks the following.

(i) Describe the range of µ, i.e., given a sequence of real numbers s = s0 , s1 , . . . ,


decide if there exists p ∈ Prob∞− such that µn [p] = sn , ∀n ≥ 0.
(ii) Is it true that the moments uniquely determine a probability measure, i.e., given
s in the range of µ is it true that there exists a unique p ∈ Prob∞− such that
µ[p] = s?

Party (i) of the moment problem is completely understood in the sense that there
are known several necessary and sufficient conditions for a sequence s to be the
sequence of momenta of a probability measure on R. We refer to [137, Chap. 3] for
more details.
As for part (ii), it is known that a sequence s can be the sequence of momenta
of several probability measures; see Exercise 1.30. On the other hand, there are
known sufficient conditions on s guaranteeing the uniqueness of measure with that
sequence of momenta; see [137, Chap. 4] for more details. In particular, if X is a
random variable such that etX is integrable for any t in an open interval containing
0, then PX is uniquely determined by its moments, [137, Cor. 4.14]. t
u

We formulate for the record the last uniqueness result mentioned above. In
Exercise 2.45 we outline a proof of this special case.

Theorem 1.107. Let X, Y ∈ L0 (Ω, S, P) such that there exist r > 0 with the
property that

E etX , E etX < ∞, ∀|t| < r.


   

Then
d
X = Y ⇐⇒ MX (t) = MY (t), ∀|t| < r. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 53

Foundations 53

Corollary 1.108. Suppose that P0 , P1 are Borel probability measures on R sup-


ported on [0, 1], i.e.,
   
P0 R \ [0, 1] = P1 R \ [0, 1] = 0.
Then
Z Z
xn µ0 dx = xn µ1 dx , ∀n ∈ N.
   
P0 = P1 ⇐⇒
R R
t
u

Proof. Note that


Z Z
n
etx Pi dx < ∞, ∀t ∈ R
   
x Pi dx ≤ 1 ⇒
R R
and
Z Z
etx P0 dx = etx P1 dx , ∀t ∈ R
   
R R
Z Z
xn P0 dx = xn P1 dx , ∀n ∈ N0 .
   
⇐⇒
R R
t
u

To a random variable X with range contained in N0 = {0, 1, 2, . . . we can


associate its probability generating function (or pgf)
X 
P X = n tn = E tX .
  
GX (t) :=
n≥0

Note that
GX (1) = 1, G0X (1) = E X , G00X (1) = E X(X − 1) .
   
(1.3.20)
Similarly, if X, Y are two independent N0 -valued random variables, then
GX+Y (t) = E tX+Y = E tX E tY = GX (t)GY (t).
     

1.3.3 Classical examples of discrete random variables


The theory of probability has grown mostly from concrete intriguing examples. In
this process people encountered various frequently occurring patterns encoded by
some ubiquitous random variables. We describe a few of them in the following sub-
sections. These examples are part of the theory of probability and have many and
varied uses. Their knowledge is absolutely necessary for a genuine understanding
of probability.
Before Kolmogorov (and currently in most undergraduate probability courses),
the world of random variables was divided into three categories: discrete, continu-
ous and neither, or mixed. The discrete random variables are those whose ranges
are discrete subsets of R. A random variable X is called continuous if its probability
distribution PX is absolutely continuous with respect to the Lebesgue measure on
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 54

54 An Introduction to Probability

R. We throw in the third category the random variables that do not fit in these
two categories. We want to describe a few classical example of discrete and con-
tinuous random variables that play an important role in probability. Throughout
our presentation we will frequently assume that given a sequence (µn )n∈N of Borel
probability measures on R there exists a probability space (Ω, S, P) and indepen-
dent random variables Xn : (Ω, S, P) → R such that PXn = µn , ∀n ∈ N. The fact
that such a thing is possible is a consequence of Kolmogorov’s existence theorem,
Theorem 1.195.
We begin by introducing some frequently occurring discrete random variables
by describing the random experiments where they appear.

Example 1.109 (Bernoulli random variables). Suppose we perform  arandom


experiment aiming to observe the occurrence of a certain event S, p := P S . When
S has occurred we say that we have registered a success. Traditionally such an
experiment is called a Bernoulli trial with success probability p. When the event S
is not observed we say that the experiment was a failure. The failure probability is
q := 1 − p. The Bernoulli trial is encoded by the random variable I S which takes
the value 1 when we register a success, and the value 0 otherwise. We also say that
I S is a Bernoulli random variable observe that
 2
E I S = p, Var I S = E I 2S − E I S = p − p2 = pq.
      

Note that any random variable with range {0, 1} is a Bernoulli random variable
since X = I {X=1} . t
u

Example 1.110 (Binomial random variables). Suppose that we perform the


experiment in the above example n times, and the results of these experiments
are independent of each other. We denote by N the number of successes observed
during these n trials.5 We say that N is a binomial random variable corresponding
to n trials with success probability p and we indicate this N ∼ Bin(n, p).
For k = 1, . . . , n we denote by Sk the event “the k-th trial was a success”. Then
n
X n
  X  
N= I Sk and E N = E I Sk = np.
k=1 k=1

Since the events (Sk )1≤k≤n are independent we deduce from Corollary 1.93 that
n
  X  
Var N = Var I Sk = npq.
k=1

Next observe that

GN (s) = q + ps, MN (t) = q + pet ,


5 Think for example that you roll a pair of dice 10 times and you aim to count how many times

the sum of the numbers on the dice is 7. In this case success is when the sum is 7 and it is not
hard to see that the probability of success is 16 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 55

Foundations 55

so
GN (s) = GI S1 (t)n = (q + ps)n , MN (t) = MI S1 (t)n = (q + pet )n . t
u
This string of Bernoulli trials can be realized abstractly in the probability space
 
{0, 1}n , 2{0,1} , βp⊗n
n

described in Example 1.31(e). The events


Sk := (1 , . . . , n ) ∈ {0, 1}n ; k = 1 , k = 1, . . . , n,

 
are independent and P Sk = p, ∀k = 1, . . . , n. Then
I Sk () = k , ∀ = (1 , . . . , n ) ∈ {0, 1}n .
As explained in Example 1.31(e), the probability distribution of N is given by the
equalities
 
  n k n−k
P N =k = p q , k = 0, 1, . . . , n.
k
Equivalently,
n  
X n
PN = pk q n−k δk .
k
k=0

t
u

Example 1.111 (Waiting for successes). Suppose that we perform indepen-


dent Bernoulli trials until we register the first success. We denote by T1 the moment
we observe the first success, T1 ∈ {1, 2, . . . , ∞}. The random variable T1 is a geo-
metric random variable with success probability p. We write this T1 ∼ Geom(p).
Observe that T1 = n iff the first n − 1 trials where failures and the n-th trial
was a success. Thus
P T1 = n = q n−1 p.
 
 
In particular, P N1 = ∞ = 0. We deduce that the probability distribution of T1
is
X
PT1 = pq n−1 δn .
n≥1

Moreover
  X X d X n p 1
E T1 = npq n−1 = p nq n−1 = p q = 2
= . (1.3.21)
dq (1 − q) p
n≥1 n≥1 n≥0

Here is a simple plausibility test for this result. Suppose we role a die until we first
roll a 1. The probability of rolling a 1 is 61 so it is to be expected that we need 6
rolls until we roll our first 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 56

56 An Introduction to Probability

We have
∞ ∞
  X X
E T12 − E T1 = n(n − 1)pq n−1 = n(n − 1)pq n−1
 
n=1 n=2


X d2  1  2pq 2q
= pq n(n − 1)pq n−2 = pq 2
= 3
= 2.
n=2
dq 1 − q (1 − q) p
We deduce that
 2q 1   2q 1 1 q
E T12 = 2 + , Var T1 = 2 + − 2 = 2 .

p p p p p p
Note that
∞ ∞
 X X m pet
MT1 (t) = E etT1 = pq n−1 ent = pet qet

= .
n=1 m=0
1 − et
Consider now a more general situation. Fix k ∈ N and perform independent
Bernoulli trials until we observe the k-th success. Denote by Tk the number trials
until we record the k-th success. Note that
Tk = T1 + (T2 − T1 ) + (T3 − T2 ) + · · · + (Tk − Tk−1 ).
Due to the independence of the trials, once we observe the i-th success it is as if we
start the experiment anew, so the waiting time Ti+1 − Ti until we observe the next
success, the (i + 1)-th, is a random variable with the same distribution as T1
d
Ti+1 − Ti = T1 , ∀i ∈ N.
    1
Hence E Ti+1 − Ti = E T1 = p so
    k
E Tk = kE T1 = . (1.3.22)
p
The probability distribution of Tk is computed as follows. Note that Tk = n if
during the first n − 1 trials we observed exactly k − 1 successes, and at the n-th
trial we observed another success. Hence
   
  n − 1 k−1 n−k n − 1 k n−k
P Tk = n = p q ·p= p q , (1.3.23)
k−1 k−1
and
k
pet

MTk (t) = .
1 − et
Since the waiting times between two consecutive successes are independent random
variables we deduce
    kq
Var Tk = k Var T1 = 2 .
p
The above probability measure on R is called the negative binomial distribution and
Tk is called a negative binomial random variable corresponding to k successes with
probability p. We write this Tk ∼ NegBin(k, p). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 57

Foundations 57

Let us describe a classical and less than obvious application of the geometric
random variables.

Example 1.112 (The coupon collector problem). The coupon collector’s


problem arises from the following scenario.
Suppose that each box of cereal contains one of m different coupons. Once you
obtain one of every type of coupons, you can send in for a prize. Ann wants that
prize and, for that reason, she buys one box of cereals everyday. Assuming that the
coupon in each box is chosen independently and uniformly at random from the m
possibilities and that Ann does not collaborate with others to collect coupons, how
many boxes of cereal is she expected to buy before she obtain at least one of every
type of coupon?
Let N denote the number of boxes  bought until Ann has at least one of every
coupon. We want to determine E N . For i = 1, . . . , n − 1 denote by Ni the
number of boxes she bought while she had exactly i coupons. The first box she
bought contained one coupon. Then she bought N1 boxes containing the coupon
you already had. After 1 + N1 boxes she has two coupons. Next, she bought N2
boxes containing one of the two coupons you already had etc. Hence6
N = 1 + N1 + · · · + Nm−1 .
Let us observe first that for i = 1, · · · , m − 1 we have
m−i i
Ni ∼ Geom(pi ), pi = , qi = 1 − pi = .
m m
Indeed, at the moment she has i coupons, a success occurs when she buys one of
the remaining m − i coupons. The probability of buying one such coupon is thus
m−i
m . Think of buying a box at this time as a Bernoulli trial with success probability
m−i
m . The number Ni is then equal to the number of trials until you register the first
success. This argument also shows that the random variables Ni are independent.
In particular,
1 m
E[Ni ] = = .
pi m−i
From the linearity of expectation we deduce
       
E N = 1 + E N1 + E N2 + · · · + E Nm−1
 
1 1 1
= m 1 + + ··· + + .
2 m−1 m
| {z }
=:Hm
Asymptotically Hm differs from log m by the mysterious Euler-Mascheroni constant
γ ≈ 0.5772, i.e.,
lim (Hm − log m) = γ.
m→∞
Thus the expected number of boxes needed to collect all the m coupons is about
m log m + mγ. t
u
6 Here we tacitly assume that we can describe quantities N as measurable functions defined on
i
the same probability space. In Exercise 1.7 we ask the reader to do this. It is more challenging
than it looks.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 58

58 An Introduction to Probability

Remark 1.113. We can ask a more general question. For k ≥ 1 we at denote by


Xk = Xk,m the number of boxes
 Ann has to buy until she has at least k of these m
coupons. We have seen that E X1,m = mHm . One can show that as m → ∞ we
have
  
E Xk,m = m log m + (k − 1) log log m + γ − log(k − 1)! + o(1) ,
where γ is the Euler-Mascheroni constant. For details we refer to [55; 120]. t
u

Example 1.114 (The hypergeometric distribution). Suppose that we have a


bin containing w white balls and b black balls. We select n balls at random from the
bin and we denote by X the number of white balls among the selected ones. This is
a random variable with range 0, 1, . . . , n called the hypergeometric random variable
with parameters w, b, n. We will use the notation X ∼ HGeom(w, b, n) to indicate
this and we will refer to its pmf as the hypergeometric distribution. For example, if
A is the number of aces in a random poker hand, then A ∼ HGeom(4, 48, 5).
To compute P X = k when X ∼ HGeom(w, b, n) note that a favorable outcome
for the event X = k is determined by a choice of k white balls (out of w) and another
independent choice of n − k black balls (out of b) so that the number of favorable
outcomes is
  
w b
.
k n−k
The number of possible outcomes of a random draw of n balls w+b

n . Hence
w
 b 
  k n−k
P X=k = w+b
 .
n
Its probability generating function is
w   
1 X w b
GX (s) = N  sk , N := w + b.
n
k n − k
k=0
We can identify GX (s) as the coefficient of xn in the polynomial
1
Q(s, x) = N  (1 + sx)w (1 + x)b .
n
We have
∂Q wx(1 + x)b
(s, x) = N
 (1 + sx)w−1 .
∂s n
The mean of X is G0X (1) and it is equal to the coefficient of xn in
−1
w N

∂Q wx N −1 n−1 wn wn
(1, x) = w+b (1 + x) = N
 = = .
∂s n n
N w+b
Hence
  w
E HGeom(w, b, n) = · n. (1.3.24)
w+b
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 59

Foundations 59

Example 1.115 (Poisson random variables). These random variables count


the number N of random rare events that occur in a given unit of time. E.g.,
N could mean the number of computers in a large organization that die during one
fiscal year. They depend on a parameter λ and we indicate this using the notation
N ∼ Poi(λ). If N ∼ Poi(λ), then

λn X λn
P N = n = e−λ , i.e., PN = e−λ δn .
 
n! n=0
n!
Then
  X −λ nλn X λn−1
E N = e = λe−λ = λ.
n! (n − 1)!
n≥0 n≥1

The moment generating function of N is


X (λet )n t
MN (t) = E etN = e−λ = eλ(e −1) .
 
n!
n≥0

We have
t t t
M0N (t) = λet eλ(e −1)
, M00N (t) = λet eλ(e −1)
+ (λet )2 eλ(e −1)

so
E N 2 = M00N (0) = λ + λ2 , Var N = λ.
   

t
u

Example 1.116 (The inclusion-exclusion principle). Suppose that (Ω, S, P)


is a probability space and A1 , . . . , An ∈ S. Set

In := 1, . . . , n .
For m = 0, 1, . . . , n we denote by Ωm set of points ω ∈ Ω that belong to exactly m
of the sets A1 , . . . , An . Note that
Ωc0 = A1 ∪ · · · ∪ An .
For I ⊂ In we set
(Tn
k=1 Ai , I 6= ∅,
AI :=
Ω, I = ∅.
For k ∈ {0, 1, 2, . . . n} we define
X  
sk := P AI .
I⊂In ,
|I|=k

The inclusion-exclusion principle states that


 n−m
 
k m+k
 X
P Ωm = (−1) sm+k , ∀m = 0, 1, . . . , m. (1.3.25)
m
k=0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 60

60 An Introduction to Probability

Using the above equality with m = 0 we obtain the better known formula
n n
 X X   X
(−1)k−1 (−1)k−1 sk . (1.3.26)
  
P A1 ∪· · ·∪An = 1−P Ω0 = P AI =
k=1 I⊂In , k=1
|I|=k

To prove (1.3.25) we set


X
Tk := I AI .
I⊂In ,
|I|=k
Then
n−m  
X m+k
I Ωm = (−1)k I Tm+k . (1.3.27)
m
k=0
Indeed    
X Y Y X Y Y 
I Ωm =  I Ai I Acj  =  I Ai 1 − I Aj 
I⊂In i∈I j∈In \I I⊂In i∈I j∈In \I
|I|=m |I|=m

n−m
X X
= (−1)k c(J)I AJ .
k=0 |J|=m+k
m+k

Now observe that for any subset J ⊂ In of cardinality m+k there are m different
way of writing I AJ as a product
I AJ = I AI I AJ\I , |I| = m.
m+k

Thus c(J) = m for |J| = m + k. We deduce
 
X m+k
c(J)I AJ = Tm+k .
m
|J|=m+k

Using the linearity of expectation we deduce from (1.3.27) that


 n−m
 
X m+k 
(−1)k
   
P Ωm = E I Ωm = E I Tm+k ,
m
k=0
 
where E Tm+k = sm+k .
Associated to the equality (1.3.25) there is a sequence of inequalities called the
Bonferroni inequalities. For ` ∈ N and n−m2 ≥`
2`−1   2`  
k m+k k m+k
X   X
(−1) sm+k ≤ P Ωm ≤ (−1) sm+k . (1.3.28)
m m
k=0 k=0
The above inequalities follow from the “motivic” Bonferroni inequalities
2`−1  
k m+k
X
(−1) Tm+k ≤ I Ωm
m
k=0
(1.3.29)
2`  
X
k m+k n−m
≤ (−1) Tm+k , 1 ≤ ` ≤ .
m 2
k=0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 61

Foundations 61

To prove this we fix ω ∈ Ω. We have to show that


2`−1   2`  
X m+k X m+k
(−1)k Tm+k (ω) ≤ I Ωm (ω) ≤ (−1)k Tm+k (ω) (1.3.30)
m m
k=0 k=0
n−m
for k ≤ 2 . Define
n
X

Iω := i ∈ In ; ω ∈ Ai , r(ω) := |Iω | = I Ak (ω).
k=1

Note that I AI (ω) = 0 if |I| > r(ω). In particular, this shows that all the terms in
the inequality (1.3.29) are equal to zero if r(ω) < m.
Suppose that r(ω) ≥ m. Then, for any k ≤ r, we have
 
X r
Tk (ω) = I AI (ω) = .
k
I⊂Iω
|I|=k

Thus, the inequality (1.3.30) evaluated at ω is equivalent to


min(2`−1)     2`
 
X m+k r m+k X r
(−1)k ≤ I Ωm (ω) ≤ (−1)k .
m m+k m m+k
k=0 k=0
(1.3.31)
The inclusion exclusion-identity (1.3.25) shows that the inequalities become equal-
ities for 2` > r − m so we assume 2` ≤ r − m.
For r = m the inequality (1.3.31) is obvious since the sums in the left and right-
hand sides consist of a single term equal to 1 = I Ωm (ω). Assume r > m. In this
case (1.3.31) is equivalent to
2`−1 2`   
X
k
X
k m+k r
(−1) ak ≤ 0 ≤ (−1) ak , ak := . (1.3.32)
m m+k
k=0 k=0

Observe that
  
r p
ak = , p = r − m.
m k
The inequality (1.3.32) reduces to
       
p p p p
− + ··· + − ≤0
0 1 2` − 2 2` − 1
         
p p p p p
0≤ − + + ··· − + ,
0 1 2 2` − 1 2`
where 2` ≤ p. These inequalities are immediate consequences of two well known
properties of the binomial coefficients, namely their symmetry
   
p p
= ,
k p−k
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 62

62 An Introduction to Probability

and their unimodality


           
r p p p p p
≤ ≤ ··· ≤ = ≥ ≥ ··· ≥ .
0 1 bp/2c b(p + 1)/2c bp/2c + 1 p
For m = 0 we obtain the inequalities
n
X X n
      X  
P Ak − P Ai ∩ Aj ≤ P A1 ∪ · · · ∪ An ≤ P Ak .
k=1 1≤i<j≤n k=1

The right-hand-side inequality is referred to as the union bound. t


u

Example 1.117 (Sieves and poissonization). Suppose now that we have an


upper triangular array of measurable sets (An,i )i∈In , n ∈ N.
A1,1
A2,1 , A2,2
.. ..
. .
An,1 , An,2 , An,3 · · · An,n
.. .. .. .. ..
. . . . .
For n ≥ q we set
X \
snq :=
 
P An,I , An,I = An,i .
I⊂In , i∈I
|I|=q

Similarly, for n ≥ m we denote by Ωnm the set of points in Ω that belong to exactly
m of the sets An,1 , . . . , An,n . Using Bonferroni inequalities we deduce that for fixed
` and n > 2` + m we have
2`−1   2`  
k m+k k m+k
X  n  X
n
(−1) sm+k ≤ P Ωm ≤ (−1) snm+k . (1.3.33)
m m
k=0 k=0

Suppose now that there exists λ > 0 such that, for any q ∈ N we have
λq
lim snq = . (1.3.34)
n→∞ q!
If we let n → ∞ in (1.3.33) we obtain
2`−1 2`
1 X λk 1 X λk
(−1)k ≤ lim inf P Ωnm ≤ lim sup P Ωn ≤ ! (−1)k .
   
!
m k! n→∞ n→∞ m k!
k=0 k=0

If we now let ` → ∞ we deduce


 e−λ λm
lim P Ωnm =

.
n→∞ m!
We can rephrase this in an equivalent way. Set
n
X
Xn := I An,k .
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 63

Foundations 63

Then Ωnm = {Xn = m} and thus we showed that if (1.3.33) holds, then
   
lim P Xn = m = P Poi(λ) = m ,
n→∞

where we recall that Poi(λ) denotes a Poisson random variable with parameter λ.
The phenomenon depicted above is referred under the generic name of pois-
sonization or Poisson
 approximation. Let us observe that if the events An,k are
independent and P An,i = nλ , then


   k
n λ λk
snk = ∼ as n → ∞.
k n k!
In this case Xn = Bin(n, λ/n). The success probability nλ is small for large n and for
this reason the Poisson distribution is sometimes referred as the law of rare events:
The estimation techniques based on various versions of the inclusion-exclusion
principle are called sieves. We refer to [143, Chaps. 2, 3] for a more detailed de-
scription of far reaching generalizations of the inclusion-exclusion principle and as-
sociated sieves. t
u

Example 1.118 (Fixed points of random permutations). Let us show how


the above arguments work on the classical derangements problem Denote by Sn the
group of permutations of In , we equip it with the uniform probability measure so
1
each permutation σ has probability n! . For each σ ∈ Sn we denote by F (σ) = Fn (σ)
its number of fixed points, i.e.,

F (σ) = # k ∈ In ; σ(k) = k .
Thus F : Sn → {0, 1, . . . , n} can be viewed as a random variable.
A derangement is a permutation σ with no fixed points, i.e., F (σ) = 0. A
concrete occurrence of a derangement can be observed when a group of n, slightly
inebriated, passengers board a plane and pick seats at random. A derangement
occurs when none of them sits on his/her preassigned seat.
We want to compute the probability distribution of F , i.e., the probabilities
 
P F = m , k = 0, 1, . . . , n.
For j ∈ In we denote by Ej the event σ(j) = j. The set of permutations that fix j
can be identified with the set of permutations of In \ {j} so
  (n − 1)! 1
P Ej = = .
n! n
Observe that
Xn
F = I Ek ,
k=1
so
n n
  X   X  
E F = E I Ek = P Ek = 1. (1.3.35)
k=1 k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 64

64 An Introduction to Probability

Thus the expected number of fixed points is rather low: a random permutation has,
on average, one fixed point.
Let us compute the probability distribution of F . For each I ⊂ In we set
[
EI = Ei .
i∈I
Thus σ ∈ EI if and only if the permutation σ fixes all the points in I. We deduce
that if |I| = k, then
 
  (n − k)! X   n (n − k)! 1
P EI = and sk := P EI = = .
n! k n! k!
|I|=k
Note that F (σ) = m, then σ fixes exactly k points and (1.3.25) yields
 n−m n−m
 
X m+k 1 X 1
(−1)k (−1)k .

P F =m = sm+k =
m m! k!
k=0 k=0
In particular, the number of derangements is
n
 X 1
(−1)k .

P F =0 =
k!
k=0
 
The equality E F = 1 yields an interesting identity
n n n−m
!
X   X 1 X
k 1
1= mP F = m = (−1) .
m=1 m=1
(m − 1)! k!
k=0
Note that
  e−1
lim P Fn = m = . (1.3.36)
n→∞ m!
−1
The sequence em! , m ≥ 0 describes the Poisson distribution Poi(1). t
u

1.3.4 Classical examples of continuous probability distributions


We want to describe a few example of random variables whose probability dis-
tributions are absolutely continuous with respect to the Lebesgue measure on the
real. Such distributions are classically known as continuous probability distributions.
Their probabilistic significance will gradually be revealed in the book.

Example 1.119 (Uniform distribution). A random variable X is said to


be uniformly distributed or uniform in the interval [a, b], and we write this
X ∼ Unif(a, b), if
  1
PX dx = I [a,b] dx.
b−a
When X ∼ Unif(a, b) we have
Z b
1 etb − eta X bn − an tn−1
MX (t) = etx dx = = .
b−a a t(b − a) n(b − a) (n − 1)!
n≥1
In particular we deduce
  1 bn+1 − an+1
µn X = . (1.3.37)
n+1 b−a
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 65

Foundations 65

Fig. 1.3 The graph of γ 0,σ for σ = 1 (dotted curve) and σ = 0.1 (continuous curve).

Example 1.120 (Gaussian random variables). The Gaussian or normal ran-


dom variables form a 2-parameter family N (µ, σ 2 ), µ ∈ R, σ > 0 where
X ∼ N (µ, σ 2 ) iff
1 (x−µ)2
e− 2σ2 .
 
PX dx = γ µ,σ2 (x)dx, γ µ,σ2 (x) := √
2π σ
We will use the simpler notation γ σ2 (x) := γ 0,σ2 . The measure
 
Γµ,σ2 dx := γ µ,σ2 (x)dx
is called the Gaussian measure on R with mean µ and variance σ 2 . Let us observe
1
X ∼ N (µ, σ 2 ) ⇐⇒ (X − µ) ∼ N (0, 1).
σ
Indeed if we set
1 
Y := X −µ ,
σ
then
Z σy+µ
     
P Y ≤ y = P (x − µ)/σ ≤ y = P x ≤ σy + µ = γ µ,σ2 (x)dx
−∞
Z y 
Z y
=σ γ µ,σ2 σt + µ dt = γ 0,1 (t)dt.
−∞ −∞
Thus
E X = E Y + µ, Var X = σ 2 Var Y .
       

We have
Z
1 2
ye−y /2
 
E Y =√ dy = 0,
2π R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 66

66 An Introduction to Probability

and
Z Z ∞
1 2 2 2
Var Y = E Y 2 = √ y 2 e−y /2 dy = √ y 2 e−y /2 dy
   
2π R 2π 0

(s = y 2 /2, y = 2s)
Z ∞
2 2 2 1
=√ s1/2 e−s ds = √ Γ(3/2) = √ · Γ(1/2) = 1,
π 0 π π 2
where at the last two steps we used basic facts about the Gamma function recalled
in Proposition A.2. We deduce that
X ∼ N (µ, σ 2 ) ⇒ E X = µ, Var Y = σ 2 .
   
(1.3.38)
A variable X ∼ N (0, 1) is called a standard normal random variable. Its cdf is
Z x
1 2
e−x /2 dx,
 
Φ(x) := P X ≤ x = √ (1.3.39)
2π −∞
plays an important role in probability and statistics. The quantity
 
P X>x
γ 1 (x)
is called the Mills ratio of the standard normal random variable. It satisfies the
inequalities
x   1
2
γ 1 (x) ≤ P X > x ≤ γ 1 (x). (1.3.40)
x +1 x
In Exercise 1.25 we outline a proof of this inequality.
Observe that if X ∼ N (0, 1), and σ ∈ R, then σX ∈ N (0, σ 2 ) and
MσX (t) = E etσX = MX [σt].
 

On the other hand, if X ∼ N (0, 1), then


Z Z
1 tx−x2 /2 1 2 2 2
MX (t) = √ e dx = √ e(2tx−x −t )/2 et /2 dx
2π R 2π R
Z
2 1 2 2
= et /2 · √ e−(x−t) /2 dx = et /2 .
2π R
| {z }
=1
Thus
  (2m)!  
µ2m X = m = (2m − 1)!!, µ2m−1 X = 0, ∀m ∈ N. t
u
2 m!
Example 1.121 (Gamma distributions). The Gamma distributions with pa-
rameters ν, λ are defined by
Γν,λ [dx] = gν (x; λ)dx
where gν (x; λ), λ, ν > 0 are given by
λν ν−1 −λx
gν (x; λ) = x e I (0,∞) . (1.3.41)
Γ(ν)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 67

Foundations 67

From the definition of the Gamma function we deduce that gν (x; λ) is indeed a
probability density, i.e.,
Z ∞
gν (x; λ)dx = 1.
0
We will use the notation X ∼ Gamma(ν, λ) to indicate that PX = Γν,λ .
The Gamma(1, λ)-random variables play a special role in probability. They are
called exponential random variables with parameter λ. We will use the notation
X ∼ Exp(λ) to indicate that X is such a random variable. The distribution of
Exp(λ) is
Exp(λ) ∼ λe−λx I (0,∞) dx.
We will have more to say about exponential variables in the next subsection.
The parameter ν is sometimes referred to as the shape parameter. Figure 1.4
may explain the reason for this terminology.

Fig. 1.4 The graphs of gν (x; λ) for ν > 1 and ν < 1.

For n = 1, 2, 3, . . . the distribution Gamma(n, λ) is also known as an Erlang


distribution and has a simple probabilistic interpretation. If the waiting time T for
a certain event is exponentially distributed with rate λ, e.g., the waiting time for
a bus to arrive, then the waiting time for n of these events to occur independently
and in succession is a Gamma(n, λ) random variable. We will prove this later.
The distribution gn/2 (x; 1/2), where n = 1, 2, . . . , plays an important role in
statistics it also known as the chi-squared distribution with n degrees of freedom
and it is traditionally denoted by χ2 (n). One can show that if X1 , . . . , Xn are
independent standard normal random variables, then the random variable
X12 + · · · + Xn2
has a chi-squared distribution of degree n.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 68

68 An Introduction to Probability

If X ∼ Gamma(ν, λ) is a Gamma distributed random variable, then X is s-


integrable for any s ≥ 1. Moreover, for any k ∈ {1, 2, . . . } we have
Z ∞
λν
xk+ν−1 e−λx dx
 
µk X =
Γ(ν) 0
(x = λ−1 t, dx = λ−1 dt, λx = t, xk+ν−1 = λ−(k+ν−1) tk+ν−1 )
Z ∞
1 Γ(k + ν)
= k tk+ν−1 e−t dt = k .
λ Γ(ν) 0 λ Γ(ν)
We deduce
    Γ(ν + 1) ν
E X = µ1 X = = ,
λΓ(ν) λ

     2 Γ(ν + 2) ν 2 k(k + 1) − k 2 ν
Var X = µ2 X − µ1 X = 2 − − 2.
λ Γ(ν) = λ2 λ
Finally, if X ∼ Gamma(ν, λ), then for t < λ we have
Z ∞
λν
MX (t) = xν−1 e−(λ−t)x dx
Γ(ν) 0
x = y/(λ − t)
∞ ν
λν
Z 
λ
= y ν−1 e−y dy = .
Γ(ν)(λ − t)ν 0 λ−t
t
u

Example 1.122 (Beta distributions). The Beta distribution with parameters


a, b > 0 is defined by the probability density function
1
βa,b (x) = xa−1 (1 − x)b−1 I (0,1) .
B(a, b)
The normalizing constant B(a, b) is the Beta function (A.1.2),
Γ(a)Γ(b)
B(a, b) = .
Γ(a + b)
We will use the notation X ∼ Beta(a, b) to indicate that the pdf of X is a Beta
distribution with parameters a, b.
Suppose that X ∼ Beta(a, b). Then
Z 1
1 B(a + 1, b)
xa (1 − x)b−1 dx =
 
E X =
B(a, b) 0 B(a, b)

(A.1.4) Γ(a + 1)Γ(a + b) a


= = ,
Γ(a)Γ(a + b + 1) a+b
Z 1
1 Γ(a + 2)Γ(a + b) a(a + 1)
E X2 = xa+1 (1 − x)b−1 dx =
 
= .
B(a, b) 0 Γ(a)Γ(a + b + 2) (a + b)(a + b + 1)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 69

Foundations 69

Hence  
 2 a a+1 a
Var X = E X 2 − E X =
   

a+b a+b+1 a+b
a (a + 1)(a + b) − a(a + b + 1) ab
= · = 2
.
a+b (a + b)(a + b + 1) (a + b) (a + b + 1)
The distribution Beta(1/2, 1/2) is called the arcsine distribution. In this case
1 1
β1/2,1/2 (x) = p ,
π x(1 − x)
and Z x √
1
β1/2,1/2 (s)ds =
arcsin x.
0 π
We refer to Exercise 1.35 for an alternate interpretation of Beta(1/2, 1/2). t
u

In Appendix A.2 we have listed the basic integral invariants of several frequently
occurring probability distributions.

1.3.5 Product probability spaces and independence


Suppose (Ωi , Si ), i = 0, 1, are two measurable spaces. Recall that S0 ⊗ S1 is the
sigma-algebra of subsets of Ω0 × Ω1 generated by the collection R of “rectangles”
of the form S0 × S1 , Si ∈ Si , i = 0, 21.
The goal of this subsection is to show that two sigma-finite measures µi on Si ,
i = 0, 1 induce in a canonical way a measure µ0 ⊗ µ1 uniquely determined by the
condition
µ0 ⊗ µ1 S0 × S1 = µ0 S0 µ1 S1 , ∀Si ∈ Si , i = 0, 1.
     

The collection A of subsets of Ω0 × Ω1 that are finite disjoint unions of rectangles


is an algebra. This suggests using Carathéodory’s existence theorem to prove this
claim.
We choose a different route that bypasses Carathéodory’s existence theorem.
This alternate, more efficient approach, is driven by the Monotone Class Theorem
and simultaneously proves a central result in integration theory, the Fubini-Tonelli
Theorem. For every measurable space (Ω, S) we denote by L0 (Ω, S)∗ the space of S
measurable functions f : Ω → R.

Lemma 1.123. Suppose that


f ∈ L0 (Ω0 × Ω1 , S0 ⊗ S1 )∗ ∪ L0+ (Ω0 × Ω1 , S0 ⊗ S1 ).
Then, for any ω1 ∈ Ω1 the function fω01 : Ω0 → R,
fω01 (ω0 ) = f (ω0 , ω1 )
is S0 -measurable and, for any ω0 ∈ Ω0 , the function fω10 : (Ω1 , S1 ) → R,
fω10 (ω1 ) = f (ω0 , ω1 )
is S1 -measurable.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 70

70 An Introduction to Probability

0 0
Proof. We prove only the statement concerning fω 1
. For simplicity will write fω1 instead of fω 1
. We
will use the Monotone Class Theorem 1.21.
Denote by M the collection of functions f ∈ L0 (Ω0 × Ω1 , S0 × S1 )∗ such that fω1 is S0 -measurable,
∀ω1 ∈ Ω1 . Clearly is f, g ∈ M are bounded then af + bg ∈ M, ∀a, b ∈ R
The collection R of rectangles is a π-system. Note that for any rectangle R = S0 × S1 the function
f = I R belongs to M. Indeed, for any ω1 ∈ Ω1 we have
(
I S0 , ω1 ∈ S1 ,
fω1 =
0, ω1 ∈ Ω1 \ S1 .

If (fn ) is an increasing sequence of functions in M so is the sequence of slices fn,ω1 so the limit f is
also in M. By the Monotone Class Theorem the collection M contains all the nonnegative measurable
functions. Since M is a vector space, it must coincide with L0 (Ω0 × Ω1 , S0 ⊗ S1 )8 .
When f ∈ L0+ , but f is allowed to have infinite values, the function f is the increasing limit of a
sequence in M. Hence this situation is also included in the conclusions of the lemma. t
u

Theorem 1.124 (Fubini-Tonelli). Let (Ωi , Si , µi ), i = 0, 1 be two sigma-finite


measured spaces.

(i) There exists a measure µ on S0 ⊗ S1 uniquely determined by the equalities


µ S0 × S1 = µ0 S0 µ1 S1 , ∀S0 ∈ S0 , S1 ∈ S1 .
     

We will denote this measure by µ0 ⊗ µ1 .


(ii) For each nonnegative function f ∈ L0+ (Ω0 × Ω1 , S0 ⊗ S1 ) the functions
Z
   
ω0 7→ I 1 f (ω0 ) := f (ω0 , ω1 )µ1 dω1 ∈ [0, ∞],
Ω1
Z
   
ω1 7→ I 0 f (ω1 ) := f (ω0 , ω1 )µ0 dω0 ∈ [0, ∞]
Ω0
are measurable and
Z Z 
   
f (ω0 , ω1 )µ1 dω1 µ0 dω0
Ω0 Ω1
Z
 
= f (ω0 , ω1 )µ0 ⊗ µ1 dω0 dω1 (1.3.42)
Ω ×Ω
Z Z 0 1 
   
= f (ω0 , ω1 )µ0 dω0 µ1 dω1 .
Ω1 Ω0
In particular, if only one of the three terms above is finite, then all three are
finite and equal.
(iii) Let f ∈ L0 (Ω0 × Ω1 , S0 ⊗ S1 , µ0 ⊗ µ1 ). Then f ∈ L1 (Ω0 × Ω1 , S0 ⊗ S1 , µ0 ⊗ µ1 )
if and only if at least one of the terms in (1.3.42) is well defined and finite. In
this case all these terms are equal.
Proof. We will carry the proof in several steps.
Step 1. We will prove that for every positive function f ∈ L0 (Ω0 ×Ω1 , S0 ×S1 ) the nonnegative function
Z
   
ω0 7→ I 1 f (ω0 ) = f (ω0 , ω1 )µ1 dω1
Ω1

is measurable so that the integral


Z Z !
     
I1,0 f := f (ω0 , ω1 )µ1 dω1 µ0 dω0 ∈ [0, ∞]
Ω0 Ω1

is well defined.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 71

Foundations 71

This follows from the Monotone Class Theorem arguing exactly as in the proof of Lemma 1.123.
For S ∈ S0 ⊗ S1 we set
   
µ1,0 S = I1,0 I S .
Note that Z
   
I 1 I S0 ×S1 = I Ω0 ×Ω1 (ω0 , ω1 )µ1 dω1 .
Ω1
If ω0 ∈ Ω0 \ S0 the integral is 0. If ω0 ∈ S0 the integral is
Z
 
I S1 dµ1 = µ1 S1 .
Ω1
Hence
   
I 1 I S0 ×S1 = µ1 S1 I S0 .
We deduce Z
       
I1,0 S0 × S1 = µ1 S1 I S0 dµ0 = µ0 S0 · µ1 S1 .
Ω0

Clearly if A, A0 ∈ S are disjoint, then I A∪A0 = I A + I A0 so that


     0 
I1,0 I A∪A0 = I1,0 I A + I1,0 I A
and
 0    0
µ1,0 A ∪ A = µ1,0 A + µ1,0 A .
If
A1 ⊂ A2 ⊂ · · ·
is an increasing sequence of sets in S and
[
A= An ,
n≥1
 
then invoking the Monotone Convergence Theorem we first deduce that I1,0 I An is a nondecreasing
   
sequence of measurable functions converging to I1,0 I A and then we conclude that µ1,0 An converges
to µ1,0 A . Hence µ1,0 is a measure on S = S0 ⊗ S1 .
 

Step 2. A similar argument shows that


Z Z !
   
µ0,1 [S] = I S (ω0 , ω1 )µ0 dω0 µ1 dω1
Ω1 Ω0
is also a sigma-finite measure on S = S0 ⊗ S1 . Note that
µ1,0 S0 × S1 = µ0,1 S0 × S1 ], ∀S0 ∈ S0 , S1 ∈ S1 .
  

Thus µ1,0 R = µ0,1 R , ∀R ∈ R.


   

We want to show that if ν is another measure on S such that ν R = µ1,0 R for any R ∈ R, then
   

ν A = µ1,0 A , ∀A ∈ S.
   
To see this assume first that µ0 and µ1 are finite measures. Then Ω0 × Ω1 ∈ R
   
µ1,0 Ω0 × Ω1 = ν Ω0 × Ω1 < ∞
and since R is a π-system we deduce from Proposition 1.29 that µ1,0 = ν on S.
To deal with the general case choose two increasing sequences En i
∈ Si , i = 0, 1 such that
 i  [ i
µi Sn < ∞, ∀n and Ωi = En , i = 0, 1.
n≥1
Define
0 1 n i 
En := En × En , µi Si := µi Si ∩ En , Si ∈ Si , i = 0, 1,
 

n
ν A := ν A ∩ En , ∀A ∈ S.
   

Using the measures µni we form as above the measures µn


1,0 and we observe that
n 
µ1,0 A = µ0,1 A ∩ En , ∀n, ∀A ∈ S.
  

For any rectangle R, the intersection R ∩ En is a rectangle and


n   n 
µ1,0 R = ν R , ∀n.
Thus
n  n
µ1,0 A = µ A , ∀n ∈ N, A ∈ S.
 

If we let n → ∞ in the above equality we deduce that µ1,0 = ν on S.


We deduce that µ0,1 = µ1,0 . Thus the measures µ0,1 and µ1,0 coincide on the algebra of sets
generated by the rectangles and thus they must coincide on the S0 ⊗ S1 . This common measure is
denoted by µ0 ⊗ µ1 and it clearly satisfies statement (i) in the theorem.
Step 3. From Step 2 we deduce that (1.3.42) is true for f = I S , ∀S ∈ S0 ⊗ S1 . From this, using
the Monotone Class Theorem exactly as in the proof of Lemma 1.123 we deduce (1.3.42) in its entire
generality. The claim in (iii) follows from the fact that any integrable function f is the difference of two
nonnegative integrable functions f = f + − f − and the claim is true for f ± . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 72

72 An Introduction to Probability

The above construction can be iterated. More precisely, given sigma-finite mea-
sured spaces (Ωk , Sk , µk ), k = 1, . . . , n, we have a measure µ = µ1 ⊗· · ·⊗µn uniquely
determined by the condition
µ S1 × S2 × · · · × Sn = µ1 S1 µ2 S2 · · · µn Sn , ∀Sk ∈ Sk , k = 1, . . . , n.
       

Remark 1.125. Recall that λ denotes the Lebesgue measure on R. The measure
λ⊗n on BRn is called the n-dimensional Lebesgue measure and will denoted by
λn or simply λ, when no confusion is possible. A subset of Rn is called Lebesgue
measurable if it belongs to the completion of the Borel sigma-algebra with respect
to the Lebesgue measure.
One can prove that if a function f : Rn → R is absolutely Riemann integrable
(see [121, Chap. 15]), then it is also Lebesgue integrable with respect to the Lebesgue
measure on Rn and, moreover
Z Z
 
f (x) |dx| = f (x) λ dx ,
Rn Rn
where the left-hand-side integral is the (improper) Riemann integral.
We recommend the reader to try to prove this fact or at least to try to understand
why a Riemann integrable function defined on a cube is Lebesgue measurable. This
is not obvious because there exist Riemann integrable functions that are not Borel
measurable.
For example, if C ⊂ [0, 1] is the Cantor set, then there exists a subset A of C that
are not Borel because the cardinality of the set 2C is bigger than the cardinality
of the family of Borel subsets of C. The subset A is Lebesgue measurable since C
is Lebesgue negligible. The indicator function I A is Riemann integrable but not
Borel measurable.
The change in variables for the Riemann integral shows that if U, V are open
subsets of Rn and F : U → V is a C 1 -diffeomorphism onto V , then
−1
   
F# λV dx = | det JF (x)|λU dx .
t
u

Let us present a few useful consequences of Fubini’s theorem.

Proposition 1.126. Suppose that X is a nonnegative random variable defined on


the probability space (Ω, S, P). For any p ∈ [1, ∞) we have
Z ∞
 p
E X =p xp−1 P[X > x]dx. (1.3.43)
0

Proof. We have
Z ∞ Z ∞ Z 
xp−1 P[X > x]dx = pxp−1 dx
 
p I {X>x} (ω)P dω
0 0 Ω
Z
pxp−1 P ⊗ λ dωdx
 
=
(ω,x)∈Ω×[0,∞)
0≤x<X(ω)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 73

Foundations 73

(use Fubini-Tonelli)
!
Z Z X(ω) Z
p−1
X p (ω) P dω = E X p .
     
= px dx P dω =
Ω 0 Ω

t
u

We want to point out that when p = 1 the equality


Z
   
P ⊗ λ dωdx = E X
(ω,x)∈Ω×[0,∞)
0≤x<X(ω)
 
simply says that E X is equal to the “area” below the graph of the function
X : Ω → [0, ∞).

Example 1.127. Suppose that X is a random variable that takes only nonnegative
integral values. Then
X  
PX = P X = n δn ,
n≥0

and
  (1.3.43)
Z ∞  
E X = P X > x dx
0
XZ n+1   X   (1.3.44)
= P X > x dx = P X>n .
n≥0 n n≥0

Let us apply this identity to a geometric


 random variable with success probability
p, T ∼ Geom(p). Note that P T > n is the probability that the waiting time for
a success is > n or, equivalently, the probability that the first n trials are failures.
Hence
  X n 1 1
P T > n = q n so E T =
 
q = = .
1−q p
n≥0

Similarly
X 
µ2 T = E T 2 = 2
    
nP T > n
n≥0

X X 2q 2q
=2 nq n = 2q nq n−1 = = 2.
(1 − q)2 p
n≥1 n≥1

In particular
 2 q
Var[T ] = E T 2 − E T = 2 .
 
t
u
p
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 74

74 An Introduction to Probability

Example 1.128. Suppose that T is an exponential random variable with parameter


λ, i.e., a random variable with the exponential probability distribution
PT dt = λe−λt I (0,∞) dt.
 

This random variable describes the waiting time for an event to happen, e.g., the
waiting time for a laptop to crash, or the waiting time for a bus to arrive at a bus
station. The quantity λe−λt dt is the probability that the waiting time is in the
interval (t, t + dt]. Then
Z ∞ Z ∞
−λτ −λ 1
e−λt dt = .
   
P T >t = λe dτ = e , E T =
t 0 λ
We see that λ1 is measured in units of time. For this reason λ is called the rate and
describes how many rare events take place per unit of time.
Similarly
Z ∞ Z ∞ Z ∞
2
µ2 T = E T 2 = 2 te−λt dt = 2 se−s ds
   
tP[T > t]dt = 2
0 0 λ 0

2 2
= Γ(2) = 2 .
λ2 λ
 
The function S(t) := P T > t is called the survival function. For example, if T
denotes the life span of a laptop, then S(t) is the probability that a laptop survives
more than g units of time.
The exponential distribution enjoys the so called memoryless property
   
P T > t + s|T > s = P T > t . (1.3.45)
For example, if T is the waiting time for a bus to arrive then, given that you’ve
waited more that s units of time, the probability that you will have to wait at
least t extra is the same as if you have not waited at all. The proof of (1.3.45) is
immediate.
 
 P T >t+s e−λ(t+s)
= e−λt = P T > t .
  
P T > t + s| T > s =   = −λs
t
u
P T >s e

Example 1.129 (Integration by parts). Suppose that µ0 , µ1 are two Borel


probability measures on R supported on [0, ∞), i.e.
 
µk (−∞, 0) = 0, k = 0, 1.
We set
 
Fk (x) = µk (−∞, x] , k = 0, 1,
so that µk is the Lebesgue-Stieltjes measure determined by Fk . Note that
 
Fk (0) = µk {0} .
Classically, the integral
Z
 
u(x)µk dx
[0,a]
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 75

Foundations 75

was denoted by
Z a
u(x)dFk (x).
0
This classical notation is a bit ambiguous due to the following simple fact
Z Z
   
u(x)µk dx = u(0)Fk (0) + u(x)µk dx .
[0,a] (0,a]
We want to prove a version of the integration by parts formula. Namely, we will
show that if one of the functions F0 , F1 is continuous, then
Z a Z a
F0 (x)dF1 (x) = F0 (a)F1 (a) − F0 (0)F1 (0) − F1 (x)dF0 (x). (1.3.46)
0 0
Assume for simplicity that F1 is continuous so F1 (0) = 0. Set µ := µ0 ⊗µ1 . Observe
that since
 
F0 (a)F1 (a) − F0 (0)F1 (0) = F0 (a)F1 (a) = µ [0, a] × [0, a] .
| {z }
Sa

Using the Fubini-Tonelli theorem we deduce


Z a Z Z 
   
F1 (x)dF1 (x) = I (−∞,x] (y) µ1 dy µ0 dx
0 [0,a] R

(F1 is continuous)
Z Z !
     
= I [0,x) (y) µ1 dy µ0 dx = µ R0 ,
[0,a] [0,a]

where
(x, y) ∈ R2 ; 0 ≤ y < x ≤ a, y < x .

R0 :=
Similarly
!
Z a Z Z
     
F0 (y)dF1 (y) = I[0,y] µ0 dx µ1 dy = µ R1 ,
0 [0,a] [0,a]

(x, y) ∈ R2 ; 0 ≤ x ≤ y ≤ a .

R1 :=
Observe that the regions R0 , R1 are disjoint.
The region R0 is the part of the square Sa = [0, a] × [0, a] strictly below the
diagonal y = x, while R1 is the part of this square above or this diagonal. Hence
Sa = R0 ∪ R1 and thus
     
µ R0 + µ R1 = µ Sa .
Let us observe that the integration by parts formula is not true
 if both F0 , F1 are
1
discontinuous. Take for example the case µ0 = µ1 = 2 δ1 + δ3 . Then

0, x < 1,


F0 (x) = F1 (x) = F (x) = 21 , 1 ≤ x < 3,

1, x ≥ 3.

July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 76

76 An Introduction to Probability

In this case we have


Z 2 Z
  1 1 1
F (x)dF (x) = F (x)µ0 dx = F (1) = , F (2)2 =
0 [0,2] 2 4 4
so
Z 2
2 F (x)dF (x) 6= F (2)2 .
0

The reason for this failure has a simple geometric origin: the diagonal {y = x}
may not be µ0 ⊗ µ1 -negligible. The continuity assumption allowed us to discard the
diagonal of the square because in this case it is indeed negligible. t
u

Definition 1.130. Fix a probability space (Ω, S, P).

(i) Suppose that V is a finite dimensional vector space. We denote by BV the


sigma-algebra of Borel subsets of V . A V -valued random vector is a measurable
map

X : (Ω, S, P) → (V, BV ).

Its probability distribution is the pushforward measure PX := X # P. By defi-


nition, PX is a Borel probability measure on V .
(ii) The joint probability distribution of the random variables

X1 , . . . , Xn : (Ω, S, P) → R

is the probability distribution of the random vector

X := (X1 , . . . , Xn ) : (Ω, S, P) → Rn .

We will denote by PX1 ,...,Xn the joint distribution.

t
u

Observe that joint probability distribution PX1 ,...,Xn is uniquely determined by


the probabilities
 
P X1 ≤ x1 , . . . , Xn ≤ xn , x1 , . . . , xn ∈ R.

Proposition 1.131. Suppose that (Ω, S, P) is a probability space and

X1 , . . . , Xn ∈ L0 (Ω, S, P)

are random variables with probability distributions PX1 , . . . , PXn . The following
statements are equivalent.

(i) The random variables X1 , . . . , Xn are independent.


(ii) PX1 ,...,Xn = PX1 ⊗ · · · ⊗ PXn .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 77

Foundations 77

Proof. The random variables X1 , . . . , Xn are independent iff for any Borel sets
B1 , . . . , Bn ⊂ R we have
 
P X1 ∈ B1 , . . . , Xn ∈ Bn = P[X1 ∈ B1 ] · · · P[Xn ∈ Bn ]
   
⇐⇒ PX1 ,...,Xn B1 × · · · × Bn = PX1 ⊗ · · · ⊗ PXn B1 × · · · × Bn .
Thus the random variables X1 , . . . , Xn are independent iff the measures PX1 ,...,Xn
and PX1 ⊗ · · · ⊗ PXn coincide on the set of rectangles B1 × · · · × Bn , i.e.,
PX1 ,...,Xn = PX1 ⊗ · · · ⊗ PXn .
t
u

1.3.6 Convolutions of Borel measures on the real axis

Definition 1.132. Let µ, ν be two probability measures on (R, BR ). The convolu-


tion of µ with ν is the probability measure µ ∗ ν on (R, BR ) defined by
Z
µ B − y ν dy , ∀B ∈ BR .
     
µ∗ν B = (1.3.47)
R
t
u

Denote by Sy the shift Sy : R → R, Sy (x) = x + y and set µy := (Sy )# µ. Note


that for any Borel set B ⊂ R we have
µy B = µ S −1 (B) = µ B − y ,
     

so we can rewrite (1.3.47) in the form


Z
     
µ∗ν − = µy − ν dy .
R
A simple argument based on the Monotone Convergence Theorem shows that µ ∗ ν
is indeed a Borel measure on R. By letting B = R in (1.3.47) we see that µ ∗ ν is
indeed a probability measure.
Note that µ ∗ ν is a mixture in the sense that it is obtained by averaging of
the family of probability measures (µy )y∈R with respect to the probability measure
ν[dy]. For example, if
n
X 1
ν= δxi ,
i=1
n
then
n
1X
µ∗ν = µx .
n i=1 i

Proposition 1.133. Let µ, ν be probability measures on (R, BR ) and


Φ : R2 → R, Φ(x, y) = x + y.
Then µ ∗ ν = Φ# (µ ⊗ ν) = ν ∗ µ.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 78

78 An Introduction to Probability

Proof. Let B ∈ BR and set B̂ = Φ−1 (B). Set



B̂y := x; (x, y) ∈ B̂ = B − y.
Then
Z
   
Φ# (µ ⊗ ν) B = I B̂ µ ⊗ ν dxdy
R2
(use Fubini-Tonelli)
Z Z  Z
         
= I B̂y µ dx ν dy = µ B − y ν dy = µ ∗ ν B .
R R R
The equality µ ∗ ν = ν ∗ µ follows by changing the order of integration in the
Fubini-Tonelli theorem. t
u

Corollary 1.134. Let X, Y ∈ L0 (Ω, S, P) be two independent random variables with


distributions PX and PY . Then
PX+Y = PX ∗ PY .

Proof. Since X, Y are independent we have PX,Y = PX ⊗ PY . Note that


PX+Y = Φ# PX,Y . The conclusion now follows from Proposition 1.133. t
u

Remark 1.135. (a) Suppose that Fµ is the cdf of the probability measure µ, i.e.,
Fµ (c) = µ (−∞, c] , ∀c ∈ R. Then the cdf Fµ∗ν of µ ∗ ν satisfies
Z
 
Fµ∗ν (c) = Fµ (c − x)ν dx , ∀c ∈ R.
R
We write this equality as
Fµ∗ν = Fµ ∗ ν. (1.3.48)
If µ and ν are absolutely continuous with respect to the Lebesgue measure λ on R
so
µ dx = ρµ (x)dx, ν dx = ρν (x)dx, ρµ , ρν ∈ L1 (R, λ),
   

then µ ∗ ν  λ and
Z
   
µ ∗ ν dx = ρµ∗ν (x)dx, ρµ∗ν (x) = ρµ ∗ ρν (x) := ρµ (x − y)ν dy .
R
To see this it suffices to check that for any c ∈ R we have
Z c
 
µ ∗ ν (−∞, c] = ρµ∗ν (x)dx.
−∞
We have
 
Z
 
Z Z c−y 
µ ∗ ν (−∞, c] = µ (−∞, c − y] ν[dy] = ρ(x)dx ν[dy]
R R ∞
Z Z c−y  Z Z c 
= ρµ (x)dx ν[dy] = ρ(z − y)dz ν[dy]
R ∞ R ∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 79

Foundations 79

(use Fubini)
Z c Z  Z c
= ρµ (z − y)ν[dy] dz = ρµ∗ν (z)[dz].
−∞ R −∞

(b) Any Borel probability measure µ on R is the probability distribution of the


random variable

1R : (R, BR , µ) → R, 1R (x) = x.
If µ1 , µ2 , µ3 are different Borel probability measures on R, then we can define three
independent random variables

X1 , X2 , X3 : R3 , BR3 , µ1 ⊗ µ2 ⊗ µ2 → R,


Xk (x1 , x2 , x3 ) = xk , k = 1, 2, 3.

Note that PXk = µk , ∀k = 1, 2, 3. Since (X1 + X2 ) ⊥


⊥ X3 and X1 ⊥
⊥ (X2 + X3 ) we
deduce

(µ1 ∗ µ2 ) ∗ µ3 = P(X1 +X2 )+X3 = PX1 +(X2 +X3 ) = µ1 ∗ (µ2 ∗ µ3 ).

Similarly

µ1 ∗ µ2 = PX1 +X2 = PX2 +X1 = µ2 ∗ µ1 .

(c) The operation of convolution makes sense for any finite Borel measures µ, ν on
R and satisfies the same commutativity and associativity
  properties
  we  encountered

in the case of probability measures. Note that µ ∗ ν R = µ R · ν R . t
u

Example 1.136 (Poisson processes). Suppose that we have a stream of events


occurring in succession at random times S1 ≤ S2 ≤ S3 ≤ · · · such that the waiting
times between two successive occurrences

T1 = S1 , T2 = S2 − S1 , . . . , Tn = Sn − Sn−1 , . . .

are i.i.d. exponential random variables Tn ∼ Exp(λ), n = 1, 2, . . . . We set S0 := 0.


It may help to think of the sequence (Tn ) as inter-arrival times for a bus. The first
bus arrives at the station at time S1 = T1 . Once the n-th bus has left the station,
the waiting time for the next bus to arrive is an exponential random variable Tn+1
independent of the preceding waiting times. From this point of view, Sn is the
arrival time of the n-th bus.
For t > 0 we denote by N (t) the number events that of occurred during the time
interval [0, t]. In terms of streams of busses, N (t) would count the number of buses
that have arrived at the station in the interval [0, t]. In other words
 
N (t) = max n ≥ 1; Sn ≤ t = # n ≥ 1; Sn ≤ t .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 80

80 An Introduction to Probability

This is a discrete random variable with range {0, 1, 2, 3, . . . }. The collection of


random variables N (t), t ≥ 0 is called the Poisson process with intensity λ.
Note that

X
N (t) = I [0,t] (Sn ).
n=1
Let us find the distribution (pmf) of N (t). We have
P N (t) = 0 = P T1 > t = e−λt = the survival function of Exp(λ).
   

If n > 0, then N (t) = n if and only if the n-th bus arrived sometime during the
interval [0, t], i.e., Sn ≤ t, but the (n+1)-th bus has not arrived in this time interval.
We deduce h i
     
P N (t) = n = P {Sn ≤ t} \ {Sn+1 ≤ t} = P Sn ≤ t − P Sn+1 ≤ t .
If we denote by Fn (t) the cdf of Sn , then we can rewrite the above equality in the
form
 
P N (t) = n = Fn (t) − Fn+1 (t).
We have
PSn = Exp(λ) ∗ · · · ∗ Exp(λ)
| {z }
n
(1.6.6a)
= Gamma(λ, 1) ∗ · · · ∗ Gamma(λ, 1) = Gamma(λ, n).
| {z }
n
Hence, for n > 0
Z t
λn+1 λn+1 t n −λs
Z
Fn+1 (t) = sn e−λs ds = s e ds.
Γ(n + 1) 0 n! 0
For n > 0, we integrate by parts to obtain
 n  s=t Z t
λ n −λs λn (tλ)n −λt
Fn+1 (t) = − s e + sn−1 e−λs ds = − e + Fn (t).
n! (n − 1)! 0 n!
s=0
Hence
  (tλ)n −λt
P N (t) = n = Fn (t) − Fn+1 (t) = e , n > 0. (1.3.49)
n!
This shows that N (t) is a Poisson random variable, N (t) ∼ Poi(λt).
The Poisson process plays an important role in probability since it appears
in many situations and displays many surprising phenomena. One such interesting
phenomenon is the waiting time paradox, [59, I.4]. To better appreciate this paradox
we consider two separate situations.
Suppose first that buses arrive at a bus station following a Poisson stream with
frequency λ. Bob arrives at the bus station at a time t ≥ 0, the bus is not there
and he is waiting for the next one. His waiting time is
Wt := SN (t)+1 − t.
 
We want to compute its expectation wt := E Wt . There are two possible heuristic
arguments.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 81

Foundations 81

(i) The memoryless property of the exponential distribution shows that wt should
be independent of t so wt = w0 = λ1 .
(ii) Bob’s arrival time t is uniformly distributed in the inter-arrivals interval
SN (t) , SN (t)+1 of expected length λ1 and, as in the earlier deterministic com-

1
putation, the expectation should be half its length, 2λ .

We will show that (i) provides the correct answer. However, even the reasoning
(ii) holds a bit of truth. To see what is happening we compute the expectations of
SN (t) and SN (t)+1 . We have
Z t
   
E SN (t) = P SN (t) > x dx.
0
Note that
  X  
P SN (t) > x = P SN (t) > x, N (t) = n .
n≥0

On the other hand,


   
P SN (t) > x, N (t) = n = P x < Sn ≤ t, Sn + Tn+1 > t .
The random variables Sn and Tn+1 are independent and the joint distribution of
(Sn , Tn+1 ) is
λn
sn−1 e−λs λe−λτ dsdτ
 
PSn , Tn+1 dsdt =
(n − 1)!
| {z }
ρ(s,τ )

so
Z
 
P x < Sn < t, Sn + Tn+1 > t = ρ(s, τ )dsdτ
x<s≤t
s+τ >t

Z t Z ∞ t
λn
 Z
sn−1 e−λs ds
 
= ρ(s, τ )dτ ds = P Tn+1 > t − s
x t−s x (n − 1)!

t t
λn e−λt λn e−λt λn n
Z Z
e−λ(t−s) sn−1 e−λs ds = sn−1 ds = t − xn .

=
x (n − 1)! (n − 1)! x n!
We deduce
 X e−λt λn n
t − xn = 1 − eλ(−λ(t−x)) ,
 
P SN (t) > x =
n!
n≥0

t t
e−t λt
Z Z
−λ(t−x) −λt
eλx dt = t −
  
E SN (t) = 1−e dx = t − e (e − 1).
0 0 λ
Hence
1 e−λt 1 
= E N (t) − 1 + e−λt .
  
E SN (t) = t − + (1.3.50)
λ λ λ
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 82

82 An Introduction to Probability

 
Let us compute E SN (t)+1 . Again, we have
  X  
P SN (t)+1 > x = P SN (t)+1 > x, N (t) = n ,
n≥0
and
  
P SN (t)+1 > x, N (t) = n = PSn ≤ t Sn+1 ≥ max(t, x)
(  
P Sn ≤ t, Sn + Tn+1 ≥ t , x ≤ t,
=  
P Sn ≤ t, Sn + Tn+1 ≥ x , x > t.
For any c ≥ t we have Z
 
P Sn ≤ t, Sn + Tn+1 ≥ c = ρ(s, t)dsdτ
s≤t,
s+τ ≥c
Z t Z ∞ t
λn e−λc (λt)n
 Z
= ρ(s, τ )dτ ds = e−λ(c−s) sn−1 e−λs ds = .
0 c−s (n − 1)! 0 n!
Observing that
X e−λc (λt)n
= e−λ(c−t)
n!
n≥0
we deduce that (
  1, x ≤ t,
P SN (t) > x = −λ(x−t)
e , x > t.
Hence Z t Z ∞
1 1 
dx + eλt e−λx dx = t +
  
E SN (t)+1 = = E N (t) + 1 , (1.3.51)
0 t λ λ
and
  1
wt = E SN )t)+1 − t = .
λ
In fact much more is true. One can show (see [132, Sec. 3.6]) that the waiting
time Wt is an exponential random variable, Wt ∼ Exp(λ), in agreement with the
conclusion of the argument (i).
The above computations are a bit counterintuitive. The number of busses ar-
riving during a time interval [0, t] is N (t). The busses arrive with a frequency of λ1
per unit of time, so we should expect to wait t = λ1 E N (t) units of time for N (t)
busses to arrive. Formula (1.3.50) shows that we should expect less. On the other
hand, formula (1.3.51) shows that we should expect λ1 E N (t) + 1 units of time
for N (t) + 1 busses to arrive! We refer to Remark 3.71 for an explanation for this
paradoxical divergence of conclusions.
The above computation show that the expectation of Lt = SN (t)+1 − SN (t) is
  2 e−λt 2
E Lt = − ≈ for t large.
λ λ λ
This shows that even the argument (ii) captures a bit of what is going on since wt
is close to half the expected length of the inter-arrival interval SN (t) , SN (t)+1 .
The Poisson processes are special cases of renewal processes. For an enjoy-
able and highly readable introduction to renewal processes we refer to [59] or [132,
Chap. 3]. For a more in-depth presentation of these processes and some of their
practical applications we refer to [5]. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 83

Foundations 83

1.3.7 Modes of convergence of random variables


Fix a probability space (Ω, S, P).

Definition 1.137 (Almost sure convergence). We say that the sequence of


random variables
Xn ∈ L0 (Ω, S, P), n ∈ N,
converges almost surely (or a.s.) to X ∈ L0 (Ω, S, P) if there exist Ω0 ∈ S such that
 
P Ω0 = 1, lim Xn (ω) = X(ω), ∀ω ∈ Ω0 .
n→∞
a.s.
We will use the notation Xn −→ X to indicate the a.s. convergence. t
u

To describe a useful criterion for a.s. convergence we need to rely on a very


versatile classical result.

Definition 1.138. For any sequence of events (An )n∈N ⊂ S we denote by An i.o.
the event “An occurs infinitely often”,
\ [
An i.o. := An .
m≥1 n≥m

Thus
ω ∈ An i.o. ⇐⇒ ∀m ∈ N ∃n ≥ m : ω ∈ An . t
u

Theorem 1.139 (Borel-Cantelli Lemma). Consider a sequence of events


(An )n∈N ⊂ S.

(i) If
X  
P An < ∞.
n≥1
 
Then P An i.o. = 0.  
(ii) Conversely, if the events (An )n∈N are independent then P An i.o. ∈ {0, 1},
and
  X  
P An i.o. = 0 ⇐⇒ P An < ∞. (1.3.52)
n≥1

Proof. (i) We set


X
N := I An .
n≥1

Note that {An i.o.} = {N = ∞}. From the Monotone Convergence Theorem we
deduce
  X   X  
E N = E I An = P An < ∞
n≥q n≥1
 
so P N = ∞ = 0.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 84

84 An Introduction to Probability

(ii) Kolmogorov’s
 0-1
 theorem shows that when the events (An )n≥1 are independent
we have P An i.o. ∈ {0, 1}.
To prove (1.3.52) we have to show that if
X  
P An = ∞,
n≥1
 
then P An i.o. = 1. We have
" # " #
[ \
c
P An = 1 − P An
n≥m n≥m

(use the independence of An )


Y  
=1− 1 − P An
n≥m

(1 − x ≤ e−x , ∀x ∈ R)
P
≥ 1 − e− n≥m P[An ]
= 1.

Hence
" #
  [
P An i.o. = lim P An = 1.
m→∞
n≥m

t
u

Remark 1.140. Statement (i) in Theorem 1.139 is usually referred to as the First
Borel-Cantelli Lemma while statement (ii) is usually referred to as the Second Borel-
Cantelli Lemma. Exercises 3.12 and 3.18 present refinements of the Borel-Cantelli
lemmas. t
u

Observe that Xn → X a.s. if and only if, for any ν ∈ N


 
P |Xn − X| > 1/ν i.o. = 0.

The Borel-Cantelli Lemma now implies the following result.

Corollary 1.141. Suppose that there exists X ∈ L0 (Ω, S, P) such that the sequence
Xn ∈ L0 (Ω, S, P) satisfies
X  
P |Xn − X| > ε < ∞, ∀ε > 0.
n≥1

a.s.
Then Xn −→ X. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 85

Foundations 85

Proof. The Borel-Cantelli Lemma implies that


 
P |Xn − X| > ε i.o. = 0, ∀ε > 0.
Hence, for any ε > 0 there exists a negligible set Sε ∈ S such that, for any ω ∈ Ω\Sε
we have
lim sup Xn (ω) − X(ω) ≤ ε.
n→∞
Set [
S∞ = S1/k .
k∈N
We deduce that for any ω ∈ Ω \ S∞ we have
lim sup Xn (ω) − X(ω) ≤ 1/k, ∀k ∈ N.
n→∞
t
u

Definition 1.142. We say that the sequence Xn ∈ L0 (Ω, S, P) converges in proba-


bility to the random variable X ∈ L0 (Ω, S, P) if, ∀ε > 0, we have
 
lim P |Xn − X| > ε = 0.
n→∞
p
We will use the notation Xn −→ X to indicate convergence in probability. t
u

Observe that if Xn → X in probability and, for any n ∈ N, we have Xn = Xn0


a.s., then Xn0 → X in probability. Thus the convergence in probability is correctly
defined in L0 (Ω, S, P).
The convergence in probability is equivalent to the convergence defined by a
metric on L0 (Ω, S, P). For X, Y ∈ L0 (Ω, S, P) we set
 
dist(X, Y ) := E min(|X − Y |, 1) . (1.3.53)
Clearly dist(X, Y ) = dist(Y, X) and
dist(X, Z) ≤ dist(X, Y ) + dist(Y, Z).
Note that dist(X, Y ) = 0 iff X = Y a.s. so “dist” is a metric on L0 (Ω, S, P).
Proposition 1.143. Let X, Xn ∈ L0 (Ω, S, P). Then the following statements are
equivalent.

(i) Xn → X in probability as n → ∞.
(ii) dist(Xn , X) → 0 as n → ∞.

Proof. Set
ρ(x) := min(|x|, 1), Yn := Xn − X.
Using Markov’s inequality we deduce that for any n ≥ 1 and any ε ∈ (0, 1) we have
     
εP |Yn | > ε = εP ρ(Yn ) > ε ≤ E ρ(Yn ) = dist(Yn , 0).
This shows that (ii) ⇒ (i).
Conversely, observe
Z that, for any ε >
Z 0, we have
   
E ρ(Yn ) = ρ(Yn )dP + ρ(Yn )dP ≤ ε + P |Yn | > ε .
|Yn |≤ε |Yn |>ε
This proves that 0 ≤ lim inf dist(Yn , 0) ≤ lim sup dist(Yn , 0) ≤ ε, ∀ε > 0. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 86

86 An Introduction to Probability

The next result describes the relationships between a.s. convergence and con-
vergence in probability.
Theorem 1.144. Let X, Xn ∈ L0 (Ω, S, P). Then the following hold.

(i) If Xn → X a.s., then Xn → X in probability.


(ii) If Xn → X in probability, then (Xn ) contains a subsequence that converges
a.s. to X.
(iii) The sequence Xn converges in probability to X if and only if any subsequence
contains a further subsequence that is a.s. convergent to X.

Proof. (i) Set Yn := Xn − X. Since Yn → 0 a.s. we have min(|Yn |, 1) → 0 a.s.


From the Dominated Convergence Theorem we deduce
 
dist(Xn , X) = E |Yn | → 0,
p
so that Yn → 0.
(ii) Suppose that Yn → 0 in probability. We deduce that for any k ∈ N there exists
nk ∈ N such that
  1
∀n ≥ nk : P |Yn | > 1/k < k .
2
Now observe that for any m > 0, the series
X  
P |Ynk | > 1/m
k≥1
is convergent since, for k > m we have
    1
P |Ynk | > 1/m ≤ P |Ynk | > 1/k < k .
2
The desired conclusion now follows from Corollary 1.141.
(iii) Recall that a sequence in a metric space converges to a given point if and
only if any subsequence contains a sub-subsequence converging to that point. The
properties (i) and (ii) show that the sequence (Xn ) satisfies this condition with
respect to the metric dist defined by ρ. t
u

Corollary 1.145. If the sequence (Xn ) in L0 (Ω, S, P) converges in probability to


X, then for any continuous function f : R → R the sequence f (Xn ) converges in
probability to f (X).
Proof. The sequence (Xn ) satisfies the necessary and sufficient conditions (iii) in
Theorem 1.144. Since f is continuous, the sequence f (Xn ) satisfies these necessary
and sufficient conditions as well. t
u

Definition 1.146. Let p ∈ [1, ∞). We say that the sequence (Xn )n∈N ⊂ L0 (Ω, S, P)
converges in p-mean or in Lp to X ∈ L0 (Ω, S, P) if
X, Xn ∈ Lp (Ω, S, P), ∀n ∈ N,
and
lim E |Xn − X|p = 0.
 
t
u
n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 87

Foundations 87

Proposition 1.147. If Xn → X in p-mean, then Xn → X in probability. In


particular, Xn admits a subsequence that converges a.s. to X.

Proof. Set Yn := Xn − X. Then


 (1.2.19) 1 
P |Yn | > ε = P |Yn |p > εp E |Yn |p → 0 as n → ∞.
   
≤ p
ε
t
u

Example 1.148. For each n ∈ N and each 1 ≤ k ≤ n we set

Ak,n = [(k − 1)/n, k/n], Xk,n = I Ak,n : [0, 1] → R.

Then the sequence of random variables

X1,1 , X1,2 , X2,2 , X1,3 , X2,3 , X3,3 , . . .

converges in mean and in probability to 0. It does not converge a.s. to 0 because


for any x ∈ [0, 1] infinitely many of these random variables are equal to 1 at x.
The related sequence Yk,n = nXk,n converges in probability to 0 but not in mean
since kYk,n kL1 = 1. t
u

Example 1.149 (Bernoulli). Suppose that (Xn )n≥1 is a sequence of i.i.d.


Bernoulli random variables with wining probability p = 21 . Set
1
Sn = X1 + · · · + Xn ∼ Bin(n, 1/2), Mn = Sn .
n
Then
  1  1   1
Var Mn = 2 Var Sn = Var Ber(1/2) = n.
n n 4
Hence
1
kMn − 1/2kL2 = √ → 0 as n → ∞,
2 n
so that Mn converges in 2-mean to 12 and thus, in probability to 12 . Intuitively, Mn is
the fraction of Heads in a string on n independent fair con flips. From Chebyshev’s
inequality we deduce that
  ε2
P |Mn − 1/2| > ε ≤ .
4n
It turns out that this probability is much smaller. In (2.3.11a) we will show that
2
P |Mn − 1/2| > ε ≤ 2e−2nε .
 

For example if n = 0.1, then

P |M1000 − 1/2| > ε ≤ 2e−20 ≈ 4.2 × 10−9 .


 
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 88

88 An Introduction to Probability

Example 1.150 (Longest common subsequence). Consider a finite set A,


|A| = k, called alphabet. A word of length n in the alphabet A is a finite sequence
of the form
x := (x1 , . . . , xn ) ∈ An .
A subsequence of such a word is a word of the form
(xf (1) , . . . xf (`) ) ∈ A` ,
where f an increasing function f : {1, . . . , `} → {1, . . . , n}. The natural number `
is called the length of the subsequence.
A common subsequence of two words x, y ∈ An is a word w ∈ A` that is a sub-
sequence of both. For example, if A = {H, T }, then H, T, H, T, T is a subsequence
of both words
H, T , T, H, H, T, T and T, H, T, H, T, T , H
We are interested in the length of the longest common subsequence of two random
words of length n on the alphabet A. Such a problem arises in genetics. In that
case the alphabet is {A, C, T, G}. The DNA molecules are described by (very long)
words in this alphabet. The existence a long common subsequence of two such
words is an indication of a common ancestor of two living organisms with those
DNAs.
From a mathematical point of view, we fix a probability measure π on an alpha-
bet A and we choose independent random variables

Xn , Yn ; n ∈ N
where Xn , Yn are A-valued and have common distribution π.
One can think that these random variables are obtained as follows. Two in-
dividuals independently roll identical “dice” with faces labeled by A and whose
occurrences are governed by π. The first individual generates the sequence (Xn )
while the second individual generates the sequence Yn . We denote by Ln the length
of the longest common subsequence of the words
(X1 , . . . , Xn ) and (Y1 , . . . , Yn ).
We want to prove at a.s. and L1 we have
Ln Ln
lim = R(π) := sup . (1.3.54)
n→∞ n n≥1 n

In particular, this shows that


Ln
lim
> L1 > 0.
n n→∞

Note that L1 is a Bernoulli random variable with success probability


X  2
p= π a .
a∈A
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 89

Foundations 89

The equality (1.3.54) is due to Chvátal and Sankoff [32], but we will follow the
presentation in [144, Chap. 1].
The key observation is that the sequence (`n )n∈N is superadditive, i.e.,
`n + `m ≤ `m+n , ∀m, n ∈ N. (1.3.55)
The proof is very simple. We set Zn = (Xn , Yn ) and we observe that
the random variable Ln is an invariant of the sequence of pairs (Z1 , . . . , Zn ),
Ln = L(Z1 , . . . , Zn ). Clearly
Lm = L(Zn+1 , . . . , Zn+m ), ∀m, n ∈ N.
If we concatenate the longest common subsequence of (Z1 , . . . , Zn ) with the longest
common subsequence of (Zn+1 , . . . , Zn+m ) we obtain a common subsequence of
(Z1 , . . . , Zn , Zn+1 , . . . , Zn+m ) of length
L(Z1 , . . . , Zn ) + L(Zn+1 , . . . , Zn+m )
showing that
L(Z1 , . . . , Zn ) + L(Zn+1 , . . . , Zn+m ) ≤ L(Z1 , . . . , Zn , Zn+1 , . . . , Zn+m ),
i.e.,
Lm + Ln ≤ Lm+n , ∀m, n ∈ N. (1.3.56)
Taking the expectations of both sides in the above inequality we obtain (1.3.55).
The conclusion (1.3.54) is now an immediate consequence of the following ele-
mentary result.

Lemma 1.151 (Fekete). Suppose that (xn )n≥1 is a subadditive sequence of real
numbers, i.e.,
xm+n ≤ xm + xn , ∀m, n ∈ N.
Then
xn xn
lim = µ := inf .
n→∞ n n≥1 n

Proof. Then, for any c > µ we can find k = k(c) such that xk ≤ kc. The subaddi-
tivity condition implies xkn ≤ nxk , ∀n ∈ N, so that
xnk
µ≤ < c, ∀n ∈ N.
nk
Hence
xn
µ ≤ lim inf ≤ c, ∀c > µ,
n→∞ n

i.e.,
xn
µ = lim inf .
n→∞ n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 90

90 An Introduction to Probability

Now observe that for any n ≥ k, there exist m ∈ N and r ∈ {0, 1, . . . , k − 1} such
that n = mk + r. Hence
xn ≤ mak + ar < mk(µ + ε) + xr
so that
xn (n − r)c M 
< + , M = sup |a1 | + · · · + |ak1 | .
n n n
Hence
xn (n − r)c
lim sup ≤ lim sup = c, ∀c > µ.
n→∞ n n→∞ n
This completes the proof of the lemma. t
u

The conclusion (1.3.54) follows from Fekete’s Lemma applied to the sequence
xn = −Ln . The inequality (1.3.56) show that
Ln Ln
→ R := sup .
n n n
 
Set r = r(π) := E R . The Cauchy-Schwartz inequality
!2
  X  2 1 X 1
r ≥ E L1 = π a ≥ π(a) = > 0.
k k
a∈A a∈A

The Dominated Convergence Theorem implies that


1  
r = lim E Ln .
n→∞ n

The exact value of r(π) is not known in general. In Example 3.34, using more
sophisticated techniques, we will show that the limit R(π) is constant, R(π) = r
and Lnn is highly concentrated around its mean rn . t
u

The concept of convergence in probability is weaker than the concepts of con-


vergence a.s. or in p-mean. In many applications it is useful to know sufficient ad-
ditional assumptions that will guarantee that a sequence convergent in probability
is also convergent in p-mean. The a.s. convergence does not guarantee convergence
in mean. The next elementary example is typical of what can go wrong.

Example 1.152. Consider the interval [−1, 1] equipped with the uniform probabil-
ity measure 12 dx. Consider the sequence of bounded, nonnegative, random variables
Xn = 2n I [−2−n ,2−n ] .
Note that Xn → 0 a.s. but
  2n
Z 2−n
E Xn = dx = 1, ∀n.
2 −2−n
As we will see later in Chapter 3, the reason why the convergence in mean fails is
the high concentration of Xn on sets of smaller and smaller measures. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 91

Foundations 91

Our next result is an example of a sufficient condition for a sequence converging


in probability to also converge in the mean. It is a stepping stone towards the more
refined results that we will discuss in Chapter 3.

Theorem 1.153 (Bounded Convergence Theorem). Suppose that (Xn ) is a


sequence in L1 (Ω, S, P) that converges in probability to X ∈ L1 (Ω, S, P). If the
sequence (Xn ) is bounded in L∞ (Ω, S, P),i.e.,

M := sup kXn k∞ < ∞,


n∈N

then Xn → X in L1 and
   
lim E Xn = E X . (1.3.57)
n→∞

Proof. We follow the approach in [153, Thm. 1.4]. Since


     
E Xn − E X ≤ E |Xn − X| ,

and |Xn − X| → 0 in probability, it suffices to consider only the special case X = 0,


Xn ≥ 0, and Xn ≥ 0 a.s.. In such an instance the claimed L1 -convergence follows
from (1.3.57).
For any ε > 0 we have
       
E Xn = E Xn I {Xn ≤ε} + E Xn I {Xn >ε} ≤ M ε + P Xn > ε .

Letting n → ∞ taking to account that Xn ≥ 0 and Xn → 0 in probability we


deduce
   
0 ≤ lim inf E Xn ≤ lim sup E Xn ≤ M ε, ∀ε > 0.
n→∞ n→∞

t
u

Remark 1.154. The Bounded Convergence theorem does not follow immediately
from the Dominated Convergence Theorem which involves a.s. convergence. How-
ever, using Theorem 1.144(iii) we can use the Dominated Convergence Theorem to
provide an alternate proof of the Bounded Convergence Theorem. t
u

1.4 Conditional expectation

The concept of conditioning is a central pillar of the theory of probability. It has


a genuinely probabilistic origin and very rich and subtle ramifications. Also, it
takes some time getting used to it. This concept is one important reason why in
probability sigma-algebras play a much more important role than in analysis.
Fix a probability space (Ω, S, P).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 92

92 An Introduction to Probability

1.4.1 Conditioning on a sigma sub-algebra


The main formal constructions of this section are best understood if we first consider
a special but very useful example.

Example 1.155 (Conditioning on a partition). Suppose that (Ω, S, P) and


(Fα )α∈A , A ⊂ N, is a finite or countable partition of Ω with measurable and non-
negligible chambers, i.e.,
Fα ∈ S, P Fα > 0, ∀α ∈ A.
 

We denote by F the sigma-algebra generated by this partition. In other words,


F ⊂ F if and only if it is a union of chambers Fα . This means that ∃B ⊂ A such
that
[
F = Fβ .
β∈B

Observe that a function Y : Ω → R is F-measurable if and only there exist real


numbers (yα )α∈A such that
X
Y = yα I α , I α := I Fα .
α∈A

Moreover
X
Y ∈ L1 ⇐⇒
 
|yα | P Fα < ∞.
α

Suppose now that X ∈ L (Ω, S, P). We define the expectation of X given the event
1

Fα to be the expectation of X with respect to the conditional probability P − Fα ,
i.e., the number
Z
  1   1  
x̄α = E X Fα :=   E XI α =   X(ω)P dω . (1.4.1)
P Fα P Fα F α
We obtain an F-measurable random variable
X
X= x̄α I α .
α

Note that
1  
|x̄α | ≤   E |X|I α
P Fα
so
  X    
E |X| ≤ E |X|I α = E |X| < ∞.
α
Since
   
E XI α = E XI α , ∀α ∈ A,
we deduce
E XI F = E XI F , ∀F ∈ F.
   
(1.4.2)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 93

Foundations 93

Note that if
X
Y = yα I α
α
is another F-measurable, integrable random variable that satisfies (1.4.2), then
       
P Fα xα = E XI α = E Y I α = P Fα yα , ∀α ∈ A,
so that yα = xα , ∀α, i.e., X is uniquely determined by (1.4.2).
If in (1.4.2) we set F = Ω we deduce
    X   X    
E X =E X = x̄α P Fα = E X Fα P Fα . (1.4.3)
α α
When X = I S , then
 
  P S ∩ Fα  
E I S Fα =   = P S Fα .
P Fα
In this special case the equality (1.4.3) becomes the law of total probability
  X    
P S = P S Fα P Fα . (1.4.4)
α
t
u

The next result explains why the condition (1.4.2) is key to our further devel-
opments.

Proposition 1.156. If F ⊂ S is a sigma sub-algebra and Y0 , Y1 ∈ L1 (Ω, F, P) are


two F-measurable random variables such that
E Y0 I F = E Y1 I F , ∀F ∈ F,
   
(1.4.5)
then Y0 = Y1 a.s.

Proof. Set Z = Y0 − Y1 . Then Z is F-measurable, integrable and satisfies


E ZI F = 0, ∀F ∈ F.
 
(1.4.6)
If we let F = {Z > 1/n}, n ∈ N, we deduce that
1    
P Z > 1/n ≤ E ZI {Z>1/n} = 0, ∀n ∈ N.
n
Thus
   
P Z > 1/n = 0, ∀n ∈ N ⇒ P Z > 0 = 0.
 
A similar argument shows that P Z < 0 = 0. t
u

Definition 1.157. Let F ⊂ S be a sigma sub-algebra and X ∈ L1 (Ω, F, P). A


version of the conditional expectation of X given F is an F-measurable random
variable X ∈ L1 (Ω, F, P) such that
E XI F = E XI F , ∀F ∈ F.
   
(1.4.7)
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 94

94 An Introduction to Probability

According to Proposition 1.156, any two random variables X0 ,X1 ∈ L1 (Ω, F, P)


 (1.4.7) are a.s. equal. Their equivalence class in L (Ω, F, P) is denoted by
1
satisfying
E X k F and it is called the conditional expectation of X given F.


 I am using different notations for the conditional expectation given and event,
E X F , and the conditional expectation given a sigma-subalgebra, E X k F , for


a simple reason: I want to emphasize visually that the first is a number and the
latter is a function.

Remark 1.158. Using the Monotone Convergence Theorem and the Monotone
Class Theorem we deduce that the following are equivalent.

(i) The random variable X ∈ L1 (Ω, F, P) is a representative of E X k F .


 

(ii) For any Y ∈ L∞ (Ω, F, P)


   
E XY = E XY . (1.4.8)
(iii) There exists a π-system A ⊂ F that generates F such that
E XI A = E XI A , ∀A ∈ A.
   
(1.4.9)

From Corollary 1.94 we deduce that


X = E X k F ⇐⇒X is F measurable and (X − X) ⊥
⊥ F.
 
(1.4.10)
t
u

Definition 1.159. Given random variables X ∈ L0 (Ω, S, P), Y ∈ L1 (Ω, S, P) we


write
   
E Y k X := E Y k σ(X) ,

where σ(X) denotes the sigma-subalgebra generated by X. This random variable


is called the conditional expectation of Y given X. t
u
 
Remark 1.160. A function Y ∈ L1 (Ω, σ(X), P) represents E Y k X if, for any
x ∈ R we have
Z Z
Y (ω)P[dω] = Y (ω)P[dω].
{X≤x} {X≤x}
 
Since E Y k X is σ(X)-measurable we deduce from Dynkin’s Theorem 1.23 that
there exists a Borel measurable function f : R → R such that
 
f (X) = E Y k X a.s.
This is equivalent to the statement
   
E Y I {X≤x} = E f (X)I {X≤x} , ∀x ∈ R. (1.4.11)
The function f (x) is called
 the conditional expectation of Y given X = x and it is
denoted by E Y X = x . Think of it as the conditional expectation of Y given
the possible negligible event {X = x}.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 95

Foundations 95

Note that
    h  i  
E Y = E Y = E E Y kX = E f (X) .

Thus
Z
     
E Y = E f (X) = f (x)PX dx .
R

We can rewrite the last equality as


Z
     
E Y = E Y X = x PX dx . (1.4.12)
R

This approach to computing the expectation of Y by relying on the above identity


is referred to computing the expectation of Y by conditioning on X. This generalizes
the elementary situation in Exercise 1.12. t
u

Example 1.161. Suppose that X, Y : (Ω, S, P) → R are two random variables


such that their joint probability distribution PX,Y ∈ Prob(R2 ) is absolutely contin-
uous with respect to the Lebesgue measure on R2 . This means that there exists a
Lebesgue integrable function
pX,Y : R2 → [0, ∞)
such that
Z
pX,Y (x, y)dxdy, ∀B ∈ BR2 .
 
P (X, Y ) ∈ B =
B

We denote by PX and respectively PY the probability distributions of X and re-


spectively Y . Note that the cumulative distribution function FX of X is
Z c Z  Z c
 
FX (c) = P X ≤ c = pX,Y (x, y)dy dx = pX (x)dx.
−∞ −∞
| R {z }
=:pX (x)

This shows that PX is absolutely continuous with respect to the Lebesgue measure
on R and
 
PX dx = pX (x)dx.
Similarly
Z
 
PY dy = pY (y)dy = pX,Y (x, y)dx.
R

Classically, the probability distributions PX and PY are called the marginal distri-
butions of the random vector (X, Y ). We define

pX,Y (x,y)
 pX (x) , pX (x) 6= 0,


pY |X=x (y) =


0, pX (x) := 0.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 96

96 An Introduction to Probability

Assume that Y is integrable. Define



1
R
Z  pX (x) R ypX,Y (x, y)dy,

 pX (x) 6= 0,
f : R → R, f (x) = ypY |X=x (y)dy =
R 

0, pX (x) = 0.
Using Fubini and the integrability of Y we deduce that the above integrals are well
defined and the resulting function f is Borel measurable. Note that
Z
f (x)pX (x) = ypX,Y (x, y)dy, ∀x ∈ R.
R
   
We want to show that f (x) = E Y X = x , i.e., f (X) is a version of E Y k X .
We will show that it satisfies (1.4.11).
Let c ∈ R. We have
Z c Z Z 
 
E f (X)I X≤c = f (x)pX (x)dx = ypX,y dy I (−∞,c] (x)dx
−∞ R R
Z (1.4.13)
 
= yI (−∞,c] (x)pX,Y (x, y)dxdy = E Y I X≤c .
R2
 
The function f (x) is the conditional expectation E Y |X = x discussed in Re-
mark 1.158.
Note that the event {X = x} has probability zero so this nomenclature should
be taken with a grain of sand since we cannot apply (1.4.1). Intuitively
 
    (1.4.1) E Y I |X−x|<ε
E Y |X = x = lim E Y {|X − x| < ε} = lim  .
ε&0 ε&0 P |X − x| < ε

t
u

One issue we need to address is the existence of the conditional expectation.


There is a fast proof based on the Radon–Nikodym theorem. We will use a more
roundabout approach that sheds additional light on the nature of conditional expec-
tation. As an aside, let us mention that this approach leads to an alternate proof of
the Radon–Nikodym theorem that does not rely on the concept of signed-measure.

Theorem 1.162. For any X ∈ L1 (Ω, S, P) and any sigma sub-algebra F ⊂ S there
exists a conditional expectation E X k F ∈ L1 (Ω, F, P).

Proof. We follow the approach in [160]. We establish the existence gradually, first
under more restrictive assumptions.
Step 1. Assume X ∈ L2 (Ω, S, P). Then L2 (Ω, F, P) is a closed subspace of
L2 (Ω, S, P). Denote by PF X the orthogonal projection of X on this closed sub-
space. We claim that
PF X = E X k F ,
 
(1.4.14a)

X ≥ 0 ⇒ E X k F ≥ 0.
 
(1.4.14b)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 97

Foundations 97

Set Y = PF X. Since X − Y ⊥ L2 (Ω, F, P) we deduce


E (X − Y )Z = 0, ∀Z ∈ L2 (Ω, F, P).
 

In particular,
E (X − Y )I F = 0, ∀F ∈ F.
 

This proves (1.4.14a). Now suppose that X ≥ 0. For any n ∈ N we have


    1  
0 ≤ E XI {Y ≤−1/n} = E Y I {Y ≤−1/n} ≤ − P Y ≤ −1/n ,
n
so
 
P Y ≤ −1/n = 0, ∀n ∈ N.
This proves (1.4.14b). Clearly, the resulting map
L2 (Ω, S, P) 3 X 7→ E X k F ∈ L2 (Ω, F, P)
 

is linear.
Step 2. Assume X ∈ L1 (Ω, S, P). Decompose X = X + − X − and, for n ∈ N, set
Xn± = min X ± , n .


Note that Xn± ∈ L∞ (Ω, S, P) and, as n → ∞, Xn± % X ± a.s.. From Step 1


we deduce that the random variables Xn± have conditional expectations given F.
Choose versions
Yn± := E Xn± k F .
 

Since Xn± − Xm
±
≥ 0 a.s. if m ≤ n we deduce from (1.4.14b) that
0 ≤ Ym± ≤ Yn± , a.s., ∀m ≤ n.
We set
Y ± := lim Yn± .
n→∞

From the Monotone Convergence Theorem we deduce that


∞ > E X ± = lim E Xn± = lim E Yn± = E Y ± .
       
n→∞ n→∞

This shows that the random variables Y± are integrable and in particular a.s. finite.
We set
Y := Y + − Y − .
We will show that Y is a version of the conditional expectation of X given F. Let
F ∈ F. Then
E XI F = E X+ I F − E X− I F = lim E Xn+ I F − lim E Xn− I F
         
n→∞ n→∞

lim E Yn+ I F − lim E Yn− I F = E Y + I F − E Y − I F = E Y I F .


         
n→∞ n→∞

This proves that Y is a version of E X k F .


 
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 98

98 An Introduction to Probability

Remark 1.163. (a) The sigma sub-algebra F should be viewed as encoding partial
information that we have about a random experiment. Following a terminology
frequently used in statistics, we refer to the F-measurable random variables as
predictors determined by the information contained in F.
Step 1 in the above proof shows that the conditional expectation X of a random
variable X, given the partial information F, should be viewed as the predictor
that best approximates X given the information F. The missing part X − X is
independent of F so it is unknowable given only the information encoded by F.
Note that when F = {∅, Ω}, then
   
E X = E X I Ω.
To put it differently, if the only information we have about a random experiment is
that there will be an outcome, then the most/best we can predict about a numerical
characteristic of that outcome is its expectation.
(b) A random variable X ∈ L1 (Ω, S, P) defines a signed measure
Z
µX : F → [0, ∞), µX F = X(ω)P dω , ∀F ∈ F.
   
F
This measure is absolutely continuous with P (restricted to F). The Radon–
 exists an F-measurable
Nicodym theorem implies that there integrable function
ρX ∈ L (Ω, F, P) such that µX dω = ρX (ω)P dω , i.e.,
1

Z Z
ρX (ω)P dω , ∀F ∈ F.
   
X(ω)P dω =
F F
This shows that ρX = E X k F a.s.
 
t
u

Definition 1.164. Given a sigma subalgebra F ⊂ S, and an event S ∈ S, we define


the conditional probability of S given F to be the random variable
P S k F := E I S k F .
   
t
u

Example 1.165 (Conditioning on an event). Suppose that S ∈ S is an event


such 0 < P[S] < 1. Let Y ∈ L1 (Ω, S, P). Then
E Y k I S = E Y S I S + E Y S c I Sc ,
     

where we recall that (see (1.4.1))


  1  
E Y S =   E Y IS . t
u
P S
Our next result lists the main properties of the conditional expectation.

Theorem 1.166. Suppose that F ⊂ S is a sigma sub-algebra. Then the following


hold.

(i) Let X ∈ L1 (Ω, S, P). If Y is any version of E X k F , then E Y = E X .


     

In other words
h  i
E E X kF
 
=E X . (1.4.15)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 99

Foundations 99

(ii) If X, Y ∈ L1 (Ω, S, P) and X ≤ Y a.s., then E X k F ≤ E Y k F a.s.


   

(iii) The map

L1 (Ω, S, P) 3 X 7→ E X k F ∈ L1 (Ω, F, P)
 

is a linear contraction, i.e., it is linear and satisfies

E X k F L1 ≤ kXkL1 , ∀X ∈ L1 (Ω, S, P).


 

(iv) If X ∈ L1 (Ω, S, P) and Y ∈ L∞ (Ω, F, P), then

E XY k F = Y E X k F .
   

(v) If G ⊂ F is another sigma sub-algebra, then for any X ∈ L1 (Ω, S, P) we have


h  i
E X kG = E E X kF G .
  

(vi) If 0 ≤ Xn % X a.s., X ∈ L1 (Ω, S, P), then

E Xn k F % E X k F , a.s.
   

(vii) If Xn ∈ L1 (Ω, S, P), n ∈ N, Xn ≥ 0 a.s., lim inf Xn ∈ L1 a.s., then

E lim inf Xn k F ≤ lim inf E Xn k F a.s.


   

(viii) If Xn → X a.s. and there exists Y ∈ L1 (Ω, S, P) such that |Xn | ≤ Y a.s., then

E Xn k F → E X k F a.s.
   

(ix) If X ∈ L1 (Ω, S, P) and ϕ : R → R is a convex function such that ϕ(X) is


integrable, then
  
ϕ E X kF ≤ E ϕ(X) k F a.s.
 

In particular, if we choose ϕ(x) = |x|p , p ≥ 1 we deduce that the conditional


expectation defines a linear map

E − k F : Lp (Ω, S, P) → Lp (Ω, F, P)
 

that is linear contraction, i.e.,

E X kF
 
≤ kXkLp .
Lp

(x) If G is another sigma-algebra that is independent of σ(X) ∨ F, then

E X kF ∨ G = E X kF .
   

In particular, if X ∈ L1 (Ω, S, P) is independent of G, then

E X kG = E X .
   
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 100

100 An Introduction to Probability

Proof. (i) Follows by choosing F = Ω in (1.4.7). (ii) Follows from the proof of
Theorem 1.162.
(iii) The linearity follows from the fact that the defining condition (1.4.7) is linear
in X. Now let X ∈ L1 (Ω, S, P). We have X = X + − X − . Choose versions Y ± of
E X ± k F . Then Y± ≥ 0 and

E X k F = Y + − Y − ≤ Y + + Y − = E X + + X − k F = E |X| k F .
     

Hence
h  i
E X kF F
   
1
≤ E E |X| k = E |X| = kXkL1 .
L

(iv) Choose a version Z of E X k F . Let Y ∈ L∞ (Ω, F, P). We have to show that


 

Y Z is a version of E XY k F , i.e.,
E XY I F = E ZY I F , ∀F ∈ F.
   
(1.4.16)
Let F ∈ F. Since Z is a version of E X k F we deduce from (1.4.8) that
 

E XU = E ZU , ∀U ∈ L∞ (Ω, F, P).
   

In particular, ∀F ∈ F we have
    
EX Y I F = E ZU = E ZY I F .
| {z }
U

Thus ZY satisfies (1.4.16).


(v) Choose a version Y of E X kF , and  a version Z of E Y k G . We have to
   

show that Z is also a version of E X k G . Let G ∈ G. We have


   
E Y I B = E ZI B ,
  G∈F    
E XI G = E Y I G = E ZI B .
(vi) Choose versions Yn of E Xn k F and Y of E X k F . Note that Yn is increas-
   

ing. The Monotone Convergence theorem implies that kX − Xn kL1 → 0. From (iii)
we deduce
kYn − Y kL1 ≤ kXn − Y kL1 .
Proposition 1.147 implies that Yn admits a subsequence that converges a.s. to Y .
Since the sequence Yn is increasing we deduce that the whole sequence converges
a.s. to Y .
(vii) Set
Yk = inf Xn .
n≥k

The sequence of random variables (Yk ) is increasing and converges a.s. to


X := lim inf Xn . We deduce from (vi) that
E Yk k F % E X k F .
   
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 101

Foundations 101

Note that since Yk ≤ Xn , ∀n ≥ k, we have


E Yk k F ≤ Zk := inf E Xn k F
   
n≥k
so
E X k F = lim E Yk k F ≤ lim Zk = lim inf E Xn k F .
     
k k

(viii) Set Yn := Xn + Y . Then Yn ≥ 0 and Yn → X + Y a.s. We deduce from (vii)


that
E X k F + E Y k F ≤ lim inf E Xn k F + E Y k F
       

i.e.,
E X k F ≤ lim inf E Xn k F .
   

Similarly, we set Zn = Y − Xn . Then Zn ≥ 0 and Zn → Y − X a.s. Applying (vii)


to Zn we deduce
lim sup E Xn k F ≤ E X k F .
   

(ix) We need to use a less familiar property of convex functions, [4, Thm. 6.3.4].
More precisely, there exist sequences of real numbers (an )n∈N and (bn )n∈N such that
ϕ(x) = sup(an x + bn ), ∀x ∈ R.
n∈N

Set `n (x) = an x + bn .7 Clearly


  
`n E X k F = E `n (X) k F ≤ E ϕ(X) k F .
   

Hence
     
ϕ E X kF = sup `n E X k F = sup E `n (X) k F ≤ E ϕ(X) k F .
   
n∈N n∈N

(x) Let G ∈ G and F in F. Then, the random variables I G and XI F are independent
so
       
E XI F ∩G = E XF F I G = E XI F P G .
If Y is a version of E X k F , then Y is F-measurable and thus independent of G,
 

so
       
E Y I F ∩G = E Y I F I G = E Y I F P G

= E XI F P G = E XI F ∩G , ∀F ∈ F, G ∈ G.
     

Since the collection


F ∩ G; F ∈ F, G ∈ G


is a π-system generating F ∨ G, we deduce from Dynkin’s (π − λ) theorem that


E Y I S = E XI S , ∀S ∈ F ∨ G,
   

so that E X k F ∨ G = Y , i.e., E X k F = E X k F ∨ G .
     
t
u
7 When ϕ is C 1 the family `n coincides with the family (`q )q∈Q , `q (x) = ϕ0 (q)(x − q) + ϕ(q).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 102

102 An Introduction to Probability

1.4.2 Some applications of conditioning


To give the reader a taste of the power and uses of conditional expectation we
describe some nontrivial and less advertised uses of conditional expectation.

Example 1.167. Suppose that a player rolls a die an indefinite amount of times.
More formally, we are given a sequence independent random variables (Xn )n∈N ,
uniformly distributed on I6 := {1, 2, . . . , 6}.
For k ∈ N, we say that a k-run of length k occurred at time n if n ≥ k and
Xn = Xn−1 = · · · = Xn−k+1 = 6.
We set

R = Rk := n; a k-run occurred at time n ⊂ N ∪ {∞}, T = Tk = inf Rk ,
where inf ∅ :=
 ∞. Thus T is the moment when the first k-run occurs. We want to
show that E T < ∞.
Note that for each n ∈ N the event {T ≤ n} belongs to the sigma algebra Fn
generated by X1 , . . . , Xn . The explanation is simple: if we know the results of
the first n rolls of the die we can decide if a k-run was occurred. Consider the
conditional probability
P {T ≤ n + k} k Fn = E I {T ≤n+k} k Fn .
   

This conditional probability is a random variable. Since the sigma-algebra Fn is


defined by the partition
Si1 ,...,in := {X1 = i1 , . . . , Xn = in }, ij ∈ {1, . . . , 6},
we see that P T ≤ n + k k Fn has the form
 

6
X
P T ≤ n + k k Fn =
 
pi1 ,...,in |k I Si1 ,...,in ,
i1 ,...,in =1

where
 
pi1 ,...,in |k = P T ≤ n + k X1 = i1 , . . . , Xn = in .
Note that, irrespective of the ij -s, we have
1
pi1 ,...,in |k ≥ =: r.
6k
Hence
P T ≤ n + k k Fn ≥ r, ∀n.
 

In particular
P T > n + k k Fn ≤ (1 − r) < 1, ∀n ∈ N.
 

Now observe that for any n ∈ N, ` ∈ N0 we have {T > n + `k ∈ Fn+`k }. Hence


   
P T > n + (` + 1)k = E I {T >n+(`+1)k} I {T >n+`k}
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 103

Foundations 103

h i
= E I {T >`k} E T > n + (` + 1)k k Fn+`k


   
≤ (1 − r)E I {T >`k} = (1 − r)P T > n + `k .
Iterating, we deduce that for any i ∈ {1, . . . , k} and any ` ∈ N we have
P T > i + `k < (1 − r)` P T > i ≤ (1 − r)` .
   

Now observe that


k X k X
X   X  X k
(1 − r)` = < ∞.
  
E T = P T >n = P T > i + `k <
i=1 `∈N0 i=1 `∈N0
r
n∈N0
 
This proves that E T is finite. In Example 3.31 we will use martingale techniques
to show that
  6k+1 − 6
E T = .
5
t
u

Example 1.168 (Optimal stopping with finite horizon). Let us consider the
following abstract situation. Suppose we are given N random variables
X1 , . . . , XN ∈ L0 Ω, S, P .


For n ∈ IN := {1, 2, . . . , N } we denote by Fn the sigma-algebra generated by


X1 , . . . , Xn . Suppose that we are also given a sequence of rewards
Rn ∈ L1 Ω, Fn , P , n ∈ IN .


A stopping time is a random variable T : (Ω, S, P) → IN such that {T ≤ n} ∈ Fn ,


∀n ∈ IN . Equivalently, T is a stopping time if and only if {T = n} ∈ Fn , ∀n. Note
that if T is a stopping time, then {T ≥ n} = Ω \ {T ≤ n − 1} ∈ Fn−1 .
One should think of the collection X1 , . . . , XN as a finite stream of random
quantities flowing in time, one quantity per unit of time. The reward Rn depends
only on the observed values X1 , . . . , Xn , i.e., Rn = Rn (X1 , . . . , Xn ). A stopping
time describes a decision when to stop the stream based only on the information
accumulated up to the decision moment. After we observe the first quantity X1 , we
can decide if T = 1. If this not the case, we observe a second quantity and, using
the information about X1 , and X2 we can decide to stop, i.e., if T = 2 or not. We
continue until we either observe all the random quantities or at the first n such that
T = n.
We set
X
RT = Rn I {T =n} .
n∈IN
In other words RT is the reward at the random stopping time T . We denote by T
the collection of all possible stopping times. Note that
N
 X
E |Rn | < ∞, ∀T ∈ T.
  
E |RT | ≤
n=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 104

104 An Introduction to Probability

We want to show that there exists T∗ ∈ T such that


   
E RT∗ = r := sup E RT .
T ∈T

Such a T∗ is called an optimal stopping time. To prove the existence of an optimal


time we establish a Fermat-like optimality condition that the optimal stopping times
satisfy. We follow [28, Chap. 3].
For n ∈ IN we set
Tn := T ∈ T; T ≥ n .


Note that
T = T1 ⊃ T2 ⊃ · · · ⊃ TN .
A stopping time T belongs to Tn if and only if the decision to stop comes only after
we have observed the first n random variables in the stream, X1 , . . . , Xn .
We will detect an optimal stopping strategy using a process of “successive ap-
proximations”. The first approximation is the simplest strategy: pick the reward
only at the end, after we have observed all the N variables in the stream. In this
case the reward is YN = RN . This may not give us the largest expected reward
because some of the up-stream rewards could have been higher. We tweak this
strategy a bit to produce a better outcome.
We wait to observe the first N − 1 variables in the stream, and then decide what
to do. At this moment our reward is RN −1 . To decide what to do next we compare
this reward with the expected reward R  N given that we observed X1 , . . . , XN −1 ,
i.e., with the conditional expectation E YN k FN −1 = E RN k FN −1 . This is an
FN −1 -measurable quantity, i.e., a quantity that is computable from the knowledge
of X1 , . . . , XN −1 .
If the reward RN −1 that what we have in our hands is bigger than we ex-
pect to gain given our current information, we choose it and we stop. If not,
we wait one  more step to stop. More formally, we stop after N − 1 steps if
RN −1 ≥ E RN k FN −1 and we continue one more step otherwise. The decision is
thus based on the random variable YN −1 = max RN −1 , E YN k FN −1 .


This heuristic suggests the following backwards induction.


YN := RN , Yn := max Rn , E Yn+1 k Fn ,
  
  (1.4.17)
Tn := min i ≥ n; Ri ≥ Yi = min i ≥ n; Ri = Yi .
Note that Tn ≥ n and, for any k ≥ n,
Tn > k = Rk < E Yk+1 k Fk ∈ Fk .
   

Hence Tn ∈ Tn . We claim that for any n = 1, . . . , N we have


Yn ≥ E RT k Fn , ∀T ∈ Tn .
 
(1.4.18a)

E RTn k Fn = Yn .
 
(1.4.18b)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 105

Foundations 105

Hence
E RTn k Fn ≥ E Yn = E RT k Fn , ∀T ∈ Tn .
     

By taking expectations we deduce


   
E RTn = sup E RT . (1.4.19)
T ∈Tn
This shows that the stopping time T1 is optimal.
The optimal stopping strategy T1 has a natural description: stop at the first
moment when the reward at hand is bigger that the expected future reward, given
the information we have at that moment. The stopping strategy Tn is similar, but
delayed for n units of times.
We will prove (1.4.18a) and (1.4.18b) by backwards induction on n.
The inequality (1.4.18a) is clearly true for n = N . Assume it is true for n. Let
T ∈ Tn−1 and set TZ0 = max{T, Z n}. Then T ∈ Tn .Z For A ∈ Fn−1 we have
0

RT = Rn−1 + RT 0
A A∩{T =n−1} A∩{T ≥n}
({T ≥ n} ∈ Fn−1 ) Z Z
E RT 0 k Fn−1
 
= Rn−1 +
A∩{T =n−1} A∩{T ≥n}
Z Z h  i
E E RT 0 k Fn Fn−1

= Rn−1 +
A∩{T =n−1}
 A∩{T ≥n}
(use the induction assumption E RT 0 k Fn ≤ Yn )

Z Z Z
E Yn k Fn−1 ≤
 
≤ Rn−1 + Yn−1 .
A∩{T =n−1} | {z } A∩{T ≥n} | {z } A
≤Yn−1 ≤Yn−1
This proves the inequality (1.4.18a).
To prove the equality (1.4.18b), we run the above argument with T = Tn−1 .
Observe that in this case
Un := {T = n − 1} = Rn−1 ≥ E Yn k Fn−1 } = Yn−1 = Rn−1 ,
   
(1.4.20a)
Vn := Tn−1 > n − 1} = Rn−1 < E Yn k Fn−1
   
(1.4.20b)
= Yn−1 = E Yn k Fn−1 .
  

We have Tn−1 = n − 1Zon Un and T Zn−1 = Tn on VZn so that


RTn−1 = Rn−1 + RTn
A A∩Un A∩Vn
(Vn ∈ Fn−1 ) Z Z h  i
E E RTn k Fn k Fn−1

= Rn−1 +
A∩Un A∩Vn
(Yn = E RTn k Fn by induction)
 
Z Z
E Yn k Fn−1
 
Rn−1 +
A∩Un A∩Vn
(use (1.4.20a) and (1.4.20b))
Z Z
max Rn−1 , E Yn k Fn−1
  
= = Yn−1 .
A A
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 106

106 An Introduction to Probability

Remark 1.169. The procedure for determining the optimal time T1 outlined in  the
above example is a bit counterintuitive. The maximal expected reward is E Y1 .
By construction, the random variable Y1 is F1 -measurable, by construction, and
thus has the form f (X1 ) for some Borel measurable function f : R → R. Thus
we can determine Y1 knowing only the initial input X1 . On the other hand the
definition of Y1 by descending induction used the knowledge of the entire stream
X1 , . . . , XN , not just the initial input X1 .
What it is true is that we can compute the maximal expected reward without
running the stream. On the other hand, the moment we stop and the actual reward
when stopped are random quantities. It is conceivable that if we do not stop when
T1 tells us to stop we could get a higher reward later on. However, on average, we
cannot beat the stopping strategy T1 .
We will illustrate this process on the classical secretary problem. t
u

Example 1.170 (The secretary problem). Suppose we have a box with N


prizes with values v1 < · · · < vN . Bob would like to pick the most valuable item but
he does not know the actual values vn . He is allowed to sample them successively
without replacement. At the j-th draw he is told the value Vj of the j-th prize. He
can either accept the j-th prize or he can decline it and ask to sample another one.
A prize once declined cannot be accepted later on. We are interested in a strategy
that maximizes the probability that Bob picks the most valuable prize.8
Consider the relative rankings

Xn := # j ≤ n; Vj ≥ Vn . (1.4.21)
Thus, Xn counts how may gifts unveiled up to the moment n are at least as valuable
as the n-gift revealed. In particular, if Xn = 1, then Vn is the largest of the observed
values V1 , . . . , Vn .
We might be tempted to set the reward Rn = I {Vn =vN } , but this is not Fn -
measurable. We can fix this issue by setting
 
Rn := E I {Vn =vN } k Xn .
Observe that for any stopping time T we have
N Z N Z
  X X
E RT = Rn = I {Vn =vN }
n=1 T =n n=1 T =n

N
X    
= P Vn = VN , T = n = P VT = vN .
n=1
 
We want to find a stopping time T that maximizes E RT , i.e., the probability
that Bob pick the biggest prize. Let us make a few remarks.
8 Think of N secretaries interviewing for a single job and the values v , . . . , v
1 N rank their job
suitability, the higher the value the more suitable. The interviewer learns the value vk only at the
time of the interview.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 107

Foundations 107

1. Observe that rankings (Xn )n∈N defined in (1.4.21) are independent and
  1
P Xn = j = , ∀1 ≤ j ≤ n ≤ N. (1.4.22)
n
Indeed, the random vector (V1 , . . . , VN ) can be identified with a random permuta-
tion ϕ ∈ SN of IN
(V1 , . . . , VN ) = (vϕ(1) , . . . , vϕ(N ) ).
The rank Xn is then a function of ϕ

Xn (ϕ) := # j ≤ n; ϕ(j) ≥ ϕ(n) .
To reach the desired conclusion observe that the map
~ : SN → I1 × I2 × · · · × IN , ϕ 7→ X1 (ϕ), . . . , XN (ϕ)

X
is a bijection.9
2. We have
n n
Rn = I {Xn =1} = I {Vn =vN } .
N N
 
Indeed, the conditional expectation Rn = E I {Vn =vN } k Xn is a function of xn ∈ In
and we have
   
Rn (xn ) = E I {Vn =vN } Xn = xn = P Vn = vN Xn = xn .
This probability is zero if Xn > 1. Now observe that
 
  P V n = vN (N − 1)! n
P Vn = vN Xn = 1 =   = N = .
P Xn = 1 n (n − 1)!(N − n)! N
 
Following (1.4.17) and (1.4.18a) we set yn = E Yn . The quantity yn is the proba-
bility of Bob obtaining the largest prize among the strategies that discard the first
(n − 1) selected prizes. We have
1
YN = RN = I {VN =vN } , yN = .
N
Since {VN = vN } = {XN = 1} is independent of FN −1 we deduce
 (1.4.22) 1
E I {VN =vN } k FN −1 = E I {VN =vN }
  
= = yN ,
N
n o
YN −1 = max RN −1 , E I {VN =vN } k FN −1


 N −1 1
= max RN −1 , yN = I {XN −1 =1} + I {XN −1 >1} ,
N N
1 (N − 2)
yN −1 = + yN .
N (N − 1)
9 From ~ is injective. It
the equality ϕ−1 (N ) = max{j, Xj (ϕ) = 1} we deduce inductively that X
is also surjective since SN and N
Q
n=1 In have the same cardinality.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 108

108 An Introduction to Probability

Similarly
E YN −1 k FN −2 = E YN −1 = yN −1
   


YN −2 = max RN −2 , yN −1

= max{(N − 2)/N, yN −1 }I {XN −2 =1} + yN −1 I {XN −2 >1} ,

(1.4.22) 1 N −3
yN −2 = max{(N − 2)/N, yN −1 } + yN −1 .
N −2 N −2
Iterating we deduce

Yn = max Rn , yn+1 = max{n/N, yn+1 }I {Xn =1} + yn+1 I {Xn >1} ,

1 n−1
yn = max{n/N, yn+1 }
+ yn+1 .
n n
While it is difficult to find an explicit formula for yn , the above equalities can be
easily implemented on a computer. The optimal probability is pN = y1 . Here is a
less than optimal but simple R code that computes y1 given N .

optimal<-function(N){
p<-1/N
m<-N-1
for (i in 1:m){
p<-max((N-i)/N,p)/(N-i)+((N-i-1)/(N-i))*p
}
p
}

Here are some results. Below, pN denotes the optimal probability of choosing the
largest among N prizes.
N 3 4 5 6 8 100 200
pN 0.5 0.458 0.433 0.4277 0.4098 0.3710 0.3694
n
Note that yn+1 < yn with equality when yn+1 > N . We deduce that
n
yn+1 ≥ ⇒ yn+1 = yn = · · · = y1 .
N
We set
N∗ := max{ n; yn ≥ (n − 1)/N
so yN∗ +1 < yN∗ = yN∗ −1 = · · · = y1 . The optimal strategy is given by the stopping
time TN∗ : reject the first N∗ − 1 selected gifts and then pick the first gift that is
more valuable than any of the preceding ones.
N 3 4 8 10 50 100 1000
N∗ 3 3 5 5 20 39 370
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 109

Foundations 109

For example, for N = 10 we have

n 1 2 3 4 5 6 7 8 9 10
yn 0.398 0.398 0.398 0.398 0.398 0.372 0.32 0.26 0.18 0.1
In this case N∗ = 5 and the optimal strategy corresponds to the stopping time T5 :
reject the first four gifts and then accept the first gift more valuable then any of the
previously chosen. In this case the probability of choosing the most valuable gifts
is p10 ≈ 0.398.
Let us sketch what happens as N → ∞. Consider the sequence
N
z := (zn )1≤n≤N +1 defined by backwards induction
n−1 1
zN +1 = 0, zn = zn+1 + , 1 ≤ n ≤ N.
n N
One can show by backwards induction that zn ≤ yn , ∀n ≤ N and zn = yn , ∀n ≥ N∗ .
Denote by fN : [0, 1] → R the continuous function [0, 1] → R that is linear on
each on the intervals [(i − 1)/N, i/N ] and such that
fN (i/N ) = zN +1−i , i = 0, 1, . . . , N.
Note that
 1 1
fN (i + 1)/N − f (i/N ) = zN −i − zN −i+1 = − zN −i+1
N N −i
 
1 1 
= 1− fN i/N .
N 1 − i/N
We recognize here the Euler scheme for the initial value problem
1
f0 = 1 − f, f (0) = 0 (1.4.23)
1−t
corresponding to the subdivision i/N of [0, 1].
The unique solution of this equation is f (t) = −(1 − t) log(1 − t) and fN (t)
converge to f (t) uniformly on the compacts of [0, 1). In fact, (see [25, Sec. 212]) for
every T ∈ (0, 1), there exists C = CT > 0 such that
CT
sup fN (t) − f (t) ≤ .
t∈[0,T ] N

Set gN (t) = fN (1 − t); see Figure 1.5. 


Note that zn = znN = gN (n − 1)/N , n = 1, . . . , N + 1. We deduce that if
n/N → τ ∈ (0, 1] as N → ∞ we have
N
znN → g(τ ) = −τ log τ, zn → − log τ.
n
From the equality
N
N (zn − zn+1 ) = 1 − zn+1 , ∀1 ≤ n ≤ N
n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 110

110 An Introduction to Probability

Fig. 1.5 The graph of g100 .

we deduce that
(
< 0, τ > 1/e,
lim N (zn − zn+1 ) = 1 + log τ =
N/n→τ > 0, τ < 1/e.
This implies that as N → ∞ we have
N∗ 1 1
→ ≈ 0.368, yN∗ = zN∗ →
N e e
as N → ∞. For details we refer to [28, Sec. 3.3] or [69].
As explained in [69] a (nearly) optimal strategy is as follows. Denote by m the
largest integer satisfying
N − 1/2 1 N − 1/2 3
+ ≤m≤ + .
e 2 e 2
Reject the first m prizes and accept the next prize more valuable than any of the
preceding ones. t
u

1.4.3 Conditional independence


Suppose that (Ω, S, P) is a probability space.

Definition 1.171. Fix a sigma-subalgebra G of S. The family (Fi )i∈I of sigma-


subalgebras of S is said to be conditionally independent given G if, for any finite
subset J ⊂ I and any events Fj ∈ Fj , j ∈ J, we have
 
Y Y h i
E I Fj k G = E I Fj k G a.s.
j∈J j∈J

Given sigma algebras F, G, H ⊂ S we use the notation F ⊥


⊥ G H to indicated that F
is independent of H given G. t
u

The next proposition generalizes the result in Exercise 1.8.

Proposition 1.172 (Doob-Markov). Given sigma algebras F± , F0 , ⊂ S the fol-


lowing are equivalent.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 111

Foundations 111

(i) E X+ k F− ∨ F0 = E X+ k F0 a.s., ∀X+ ∈ L1 (Ω, F+ , P).


   

(ii) F+ ⊥
⊥ F0 F− .

Proof. The condition (i) is equivalent to


h i
E XX+ = E XE X+ k F0 , ∀X ∈ L∞ (Ω, F0 ∨ F− , P).
  
(1.4.24)
The condition (ii) equivalent to
E X+ X− k F0 = E X+ k F0 E X− k F0 , ∀X± ∈ L∞ (Ω, F± , P).
     

Note that since E X+ k F0 is an F0 -measurable random variable we have


 
h i
E X+ k F0 E X− k F0 = E X− E X+ k F0 k F0 .
     

Thus, (ii) is equivalent to


h i
E X+ X− k F0 = E X− E X+ k F0 k F0 ,
   

i.e., for any nonnegative, bounded, F0 -measurable random variable X0 we have


h i
E X0 X− X+ = E X0 X− E X+ k F0 .
  

Since F0 ∨ F+ coincides with the sigma-algebra generated collection of random


variables X0 X+ , X0 ∈ L∞ (Ω, F0 , P), X− ∈ L∞ (Ω, F− , P) we deduce that the last
equality is equivalent to (1.4.24), i.e., (i) is equivalent to (ii). t
u

Remark 1.173. You should think of a system evolving in time. Then F0 collects
the present information about the system, F− collects the past information and
F+ collects the future information. Roughly speaking, the above proposition shows
that the information about an event given the present and the past coincides with
the information given the present if and only if the future is independent of the past
given the present. t
u

1.4.4 Kernels and regular conditional distributions


Suppose that (Ω0 , F0 ) and (Ω1 , F1 ) are two measurable spaces. A kernel from
(Ω0 , F0 ) to (Ω1 , F1 ) is a function
K : Ω0 × F1 → [0, ∞], (ω0 , F1 ) 7→ Kω F1
 

with the following properties.

(K1 ) For each ω0 ∈ Ω0 , the map


F1 3 F1 7→ Kω0 F1 ∈ [0, ∞]
 
 
is a measure. We will denote this measure by Kω0 dω1 .
(K2 ) For each F1 ∈ F the function
 
Ω0 3 ω0 7→ Kω0 F1 ∈ [0, ∞]
is F0 -measurable. We will denote this random variable by K F1 .
 
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 112

112 An Introduction to Probability

 
The kernel K is called a probability kernel or a Markovian kernel if Kω0 − is
a probability measure on (Ω1 , F1 ), for any ω0 ∈ Ω0 .
The condition (K1 ) above shows that a kernel is a family (Kω0 [−])ω0 ∈Ω0 of mea-
sures on (Ω1 , F1 ) parametrized by Ω0 . Condition (K2 ) is a measurability condition
on this family. For this reason kernels are also know as random measures.

Example 1.174. Consider the Bernoulli measure


βp := qδ0 + pδ1 ∈ Prob(R), p ∈ [0, 1], q = 1 − p.
To obtain a random measure we let p be a random quantity. More precisely, if
f : (Ω, S) → [0, 1] is a measurable function, then

βX(ω) = 1 − f (ω) δ0 + f (ω)δ1
defines a Markov kernel K : Ω × BR → [0, 1],
      
Kω B = 1 − X(ω) δ0 B + X(ω)δ1 B . t
u

Given a measure µ on the measurable space (Ω, F) and a nonnegative measurable


function f ∈ L0+ (Ω, F) we set
Z
   
hµ, f i := µ f = f (ω)µ dω ∈ [0, ∞].

Theorem 1.175. Suppose that K : Ω0 × F1 → [0, ∞] is a kernel from (Ω0 , F0 ) to


(Ω1 , F1 ).

(i) For any f ∈ L0+ (Ω1 , F1 ) we define its pullback by K to be the function
Z
K ∗ f : Ω0 → [0, ∞], K ∗ f (ω0 ) =
 
f (ω1 )Kω0 dω1 .
Ω1

Then K f ∈ L0+ (Ω0 , F0 ).
(ii) For any measure µ : F0 → [0, ∞] we define its push-forward by K to be the
measure K∗ µ : F1 → [0, ∞]
Z
Kω0 F1 µ dω0 ∈ [0, ∞], F1 ∈ F1 .
     
K∗ µ F1 :=
Ω0

Then K∗ µ is a σ-additive measure on (Ω1 , F1 ).


(iii) The pullback and push-forward by K are adjoints of each other. More precisely,
for any measure µ on (Ω0 , F0 ) and any measurable function f ∈ L0+ (Ω1 , F1 )
we have
hµ, K ∗ f i = hK∗ µ, f i. (1.4.25)

Proof. (i) For any F ∈ F1 we have K ∗ I F (ω0 ) = Kω0 F so K ∗ I F ∈ L0 (Ω0 , F0 ).


 

Clearly the correspondence f 7→ K ∗ f is monotone and the conclusion follows from


the fact that a nonnegative function is measurable iff it is the limit of an increasing
sequence of simple functions.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 113

Foundations 113

The statement (ii) follows from the Monotone Convergence theorem and (K1 ).
For part (iii), fix the measure µ. Observe that for F ∈ F1 we have
Z Z Z
hµ, K ∗ I F i = K ∗ I F (ω0 )µ dω0 =
    
Kω0 F dω1
Ω0 Ω0 Ω1
 
= K∗ µ F = hK∗ µ, I F i.
Thus (1.4.25) holds for f = I H , F ∈ F1 . The general case follows by invoking the
Monotone Class Theorem. t
u

When K is a Markovian kernel and µ is a probability measure then the push-


forward K∗ µ is also a probability measure. For any F1 ∈ F
 
 1 the measure K∗ µ F1
is the expectation of the random variable ω0 7→ Kω0 F1 with respect to µ. The
measure K∗ µ is said to be a mixture of the random measure ω0 7→ Kω0 − driven
by µ.
Example 1.176. (a) Suppose that (Ω0 , F0 ), (Ω1 , F1 ) are two measurable spaces
and
T : (Ω0 , F0 ) → (Ω1 , F1 )
is a measurable map. Then T defines a kernel
K T : Ω0 × F1 , KωT0 F1 = δT (ω0 ) F1 ,
   

where δω1 denotes the Dirac measure on (Ω1 , F1 ) concentrated at ω1 ; see Exam-
ple 1.31(a). Observe that for any measure µ on F0 and any f ∈ L0+ (Ω1 , F1 ) we
have
K∗T µ = T# µ, (K T )∗ f = T ∗ f := f ◦ T.
Thus, (1.4.25 contains as a special case the change in variables formula (1.2.21).
(b) Consider the random measure K : Ω × BR → [0, 1] in Example 1.174. Given a
probability measure µ on (Ω, S) we have
K∗ µ = Ber(f¯) = 1 − f¯ δ0 + f¯ δ1 , f¯ := Eµ f .
  

(c) Suppose that X is a finite or countable set. A kernel (X , 2X ) → (X, 2X ) is


defined by a function (matrix) K : X × X → [0, ∞], via the equality
  X
Kx S = K(x, s), ∀x ∈ X , S ⊂ X .
s∈S
The kernel is Markovian if
X
K(x, x0 ) = 1, ∀x ∈ X .
x0 ∈X
(d) Suppose that f : R2 → [0, ∞) is an integrable function such that
Z
f (x, y)dy = 1, ∀x ∈ R.
R
It defines a Markovian kernel Z
K : R × BR → [0, 1], Kx B =
 
f (x, y)dy,
  B
∀x ∈ R and any Borel subset B ⊂ R. We can rewrite this as Kx dy = f (x, y)dy.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 114

114 An Introduction to Probability

Suppose that (Ω, F, P) is a probability space and S ⊂ F is a sigma sub-algebra.


For every event F ∈ F the random variable
P F k S := E I F k S
   

is called the conditional probability of F given S. The random variable P F k S is


 

unique only up to equality off a negligible set.


Note that for any increasing family (Fn )n≥1 ⊂ F there exists a negligible set
N ⊂ Ω such that
lim P Fn k S (ω) = P lim Fn k S (ω), ∀ω ∈ Ω \ N.
   
n n

A priori, the negligible set N depends on the family (Fn )n≥1 , and there might not
exist one negligible set that works for all such increasing families. When such a
thing is possible we say that the conditional probability P − k S admits a regular


version. Here is the precise definition.

Definition 1.177. Let (ω, S, P) be a probability space. A regular version of


P − k S is a probability or Markovian kernel Q from (Ω, S) to (Ω, F)
 

Q : Ω × F → [0, 1]

 for any F ∈ F, the random variable Ω 3 ω 7→ Qω F is a version of


 
such
 that,
P F k S . In other words,

• the map ω 7→ Qω F is S-measurable and


 

• for any S ∈ S, F ∈ F we have


Z
     
P S∩F = Qω F P dω .
S
t
u

Proposition 1.178. If
Q : Ω × F → [0, 1], (ω, F ) 7→ Qω F
 

is a regular version of P − k S , then


 
Z
∀X ∈ L1 (Ω, F, P), E X k S ω = X(η)Qω dη = Q∗ X(ω) a.s.
   
(1.4.26)

Proof. Note that (1.4.26) holds in the special case X = I F because


Q∗ I F (ω) = Qω F = P F k S (ω) = E I F k S (ω).
     

The general case follows from the Monotone Class theorem. t


u

The equality (1.4.26) can be written in the less precise but more intuitive way
Z
E X kS = X(η)P dη k S .
   
(1.4.27)

July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 115

Foundations 115

e F e → (Ω, F). Let P



More generally, consider a measurable map T : Ω, e be a
e F S⊂F

probability measure on Ω, e and suppose that e e is a sigma subalgebra. For
every F ∈ F we set
S := P S = EeP T ∗ I F k e
S = EeP I T −1 (F ) k e
S .
       
PT F k e e T ∈ F ke (1.4.28)
We will refer to PT − k S as the conditional distribution of T given S. Observe
 
e e
that when
e F e = P and T = 1Ω ,
e = (Ω, F) P

Ω,
then
S = P −ke
S .
   
P1Ω − k e
Note that for any increasing family (Fn )n≥1 ⊂ F we have
S = PT lim Fn k e S a.s.
   
lim PT Fn k e
n→∞ n

S admits a regular version if we can choose representatives for


 
We say that PT − k e
S , F ∈ F so that the above equality holds for any increasing sequence
 
each PT F k e
(Fn ). Here is a more precise definition.
e F, e Fe → (Ω, F)
 
Definition 1.179. Let Ω, e Pe be a probability space and T : Ω,
be a measurable map. Fix a sigma-subalgebra  S ⊂ F. A regular version of the
e e
S of the map T conditioned on eS is a

conditional probability distribution PT − k e
probability kernel Q from Ω, F to (Ω, F) such that, for any F ∈ F, the random

e e
S . In other words,
   
variable Q F is a version of PT F k e

S-measurable and
 
• the random variable Q F is e
• for any S̃ ∈ S, F ∈ F we have
e
Z
−1
     
P S̃ ∩ T (F ) =
e Qω̃ F Pe dω̃ . (1.4.29)
Ω̃

t
u

A conditional probability distribution need not admit a regular version. For


that to happen we need to impose conditions on F, the sigma algebra on the target
space. We need to make a brief topological digression.
Definition 1.180. A Lusin space is a topological space homeomorphic to a Borel
subset of a compact metric space. t
u

Remark 1.181. (a) The above is not the usual definition of a Lusin space but it
has the advantage that emphasizes the compactness feature we need in the proof of
Kolmogorov’s existence theorem.
There are plenty of Lusin spaces. In fact, a topological space that is not Lusin
is rather unusual. We refer to [15; 39] for a more in depth presentation of these
spaces and their applications in measure theory and probability. To give the reader
a taste of the fauna of Lusin spaces we list a few examples.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 116

116 An Introduction to Probability

• The Euclidean spaces Rn are Lusin spaces.


• A Borel subset of a Lusin space is also Lusin space.
• The Cartesian product of two Lusin spaces is a Lusin space.
• A less obvious example is that of Polish spaces, i.e., complete separable metric
spaces. More precisely every Polish space is homeomorphic to a countable
intersection of open subsets of [0, 1]N ; see [18, Chap. IX, Sec. 6.1, Corollary 1].
• A space is Lusin if and only if it is homeomorphic to a Borel subset of a Polish
space.

(b) From a measure theoretic point of view the Lusin spaces are indistinguishable
from the Polish spaces. More precisely, for any Lusin space X, there exists a Polish
space Y and a Borel measurable bijection Φ : X → Y such that the inverse is also
Borel measurable; see [35, Prop. 8.6.13].
The Polish spaces have another important property. More precisely, a Polish
space equipped with the σ-algebra of Borel subsets is isomorphic as a measurable
space to a Borel subset E of [0, 1] equipped with the σ-algebra of Borel subsets.
For a proof we refer to [126, Sec. I.2]. Moreover, any two Borel subsets of R are
measurably isomorphic if and only if they have the same cardinality, [126, Ch. I,
Thm. 2.12].
On the other hand, it is known that the continuum hypothesis holds for the Borel
subsets of a Polish space; see [39, Appendix III.80] or [98, XII.6]. In particular, any
Borel subset of R is either finite, countable or has the continuum cardinality. In
particular, this shows that any Borel subset of [0, 1] is measurably isomorphic to a
compact subset of [0, 1]. Hence any Lusin space is Borel isomorphic to a compact
metric space!
The measurable spaces isomorphic to a Borel subset E of [0, 1] equipped with
the σ-algebra of Borel subsets are called standard measurable spaces and play an
important role in probability. Hence the Lusin spaces are standard measurable
spaces. t
u

We have the following general existence result.

Theorem 1.182 (Existence of regular conditional probabilities). Suppose


that

• (Ω, F, P) is a probability space,


• Y is a Lusin space and
• BY is the sigma algebra of Borel subsets of Y.

Then, for every measurable map Y : (Ω, F) → (Y, BY ), and every σ-subalgebra
S ⊂ F there exists a regular version
Q : Ω × BY → [0, 1], (ω, B) 7→ Qω B
 

of the conditional distribution PY − k S . This means that


 

Q B = P Y ∈ B k S a.s., ∀B ⊂ BY .
   
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 117

Foundations 117

Moreover, for any measurable function f : (Y, BY ) → R, we have


Z
E f ◦ Y k S (ω) =
   
f (y)Qω dy , ∀ω ∈ Ω. (1.4.30)
Y

Idea of proof. For a complete proof we refer to [33, Th. IV2.10], [39, III.71], [40,
IX.11] or [135, II.89].
We can assume that Y is a compact metric space. Fix a dense countable subset
U ⊂ C(Y ) suchthat 1 ∈ U and U is a vector space over Q. we can find representa-
tives Φ(u) of E u(Y ) k such that the map
U 3 u 7→ Φ(u) ∈ L1 Ω, S, P)
is Q-linear, Φ(1) = 1 and Φ(u) ≥ 0 if u ≥ 0. For every nonnegativef ∈ C(U ) we set
Φ∗ (f ) := sup Φ(u); u ∈ U, 0 ≤ u ≤ f .


One can show that


Φ∗ (f ) := inf Φ(u); u ∈ U, u ≥ f .


For arbitrary f ∈ C(Y) we set


Φ∗ (f ) = Φ∗ (f + ) − Φ∗ (f − ).
One can show that the resulting map
C(Y) 3 f 7→ Φ∗ (f ) ∈ L1 Ω, S, P)
is R-linear, Φ∗ (1) = 1 and Φ∗ (f ) ≥ 0 if f ≥ 0. The Riesz Representation
Theorem 1.86 implies that for ay ω ∈ Ω there exists a probability measure
µω : BY → [0, 1] such that
Z
Φ∗ (f )(ω) =
 
f (y)µω dy .
Y
One then shows that for any B ∈ BY the map Ω 3 ω 7→ µω B ∈ [0, 1] is S-
 

measurable and thus it is a regular version of the conditional distribution of Y


given S. t
u

In the special case when S is the σ-algebra generated by a measurable map


X : Ω → X , X some measurable space, we use the notation
   
PY dy k X := PY dy k σ(X)
to denote a regular version for the conditional distribution of Y given X. This is a
random Borel measure on Y.
Example 1.183. Consider the special case of Theorem 1.182 where Y = R and
Y ∈ L1 (Ω, F, P). For any sigma subalgebra S ⊂ F there exists a kernel
Q : Ω × BR → [0, 1]
such that
P Y ≤ y k S = Q (−∞, y] .
   

Moreover Z
E Y kS =
   
yQ dy , P − a.s. on Ω.
R
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 118

118 An Introduction to Probability

Example 1.184. Suppose that X0 , Y0 , X1 , Y1 are random variables and


T : R2 → Rk is a Borel measurable map. Denote by P0 the joint probability
distribution of (X0 , Y0 ). Suppose that the joint distribution of (X1 , Y1 ) has the
form
P1 dxdy = g T (x, y) P0 dxdy
    

for some nonnegative measurable


 function g : Rk → [0, ∞).  
We denote by Pi − k T the regular conditional probability Pi − k σ(T ) . In
other words, for any bounded nonnegative measurable function f : Rk → [0, ∞)
and any Borel set B ⊂ R we have P B k T ∈ L+ R , σ(T ) ) and
i
 0 2

Z Z
I B f T (x, y) Pi dxdy = Pi B k T f T (x, y) Pi dxdy , i = 0, 1.
       
R2 R2
Note that
Z Z
1
I B f T (x, y) g T (x, y) P0 dxdy
      
I B f T (x, y) P dxdy =
R2 R2
Z
P0 B k T f T (x, y) g T (x, y) P0 dxdy
     
=
R2
Z
P0 B k T f T (x, y) P1 dxdy .
    
=
R2
Hence
P1 B k T = P0 A k T , ∀B ∈ BR2 .
   

Suppose that the distribution P0 is known and would like to get information about
the distribution of (X1 , Y1 ) by investigating T (X0 , Y0 ). The above equality shows
that knowledge of T adds nothing to our understanding of the density g T (x, y)
beyond what we know from (X0 , Y0 ). t
u

Example 1.185. Suppose that X1 , . . . , Xn are independent and uniformly dis-


tributed in the interval [0, L]. Set
X(n) := max Xk , X(1) := min Xk .
1≤k≤n 1≤k≤n

Note that
     x n
P X(n) ≤ x = P Xk ≤ x, ∀k = 1, . . . , n = ,
L
so that the probability distribution of X(n) is
  xn−1
Pn dx = n n I [0,L] (x)dx.
L
Similarly,
 n
    (L − x)
P X(1) > x = P Xk > x, ∀k = 1, . . . , n = ,
L
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 119

Foundations 119

so the probability distribution of X(1) is


  (L − x)n−1
P1 dx = n I [0,L] (x)dx.
| Ln
{z }
=:ρ1 (x)
 
Let us compute the conditional distribution of PX(n) dxn k X(1) . We begin by
computing the random variables.
 
P X(n) ≤ xn k X(1) , 0 ≤ xn ≤ L.
Observe first that ; ∀0 ≤ x1 , xn ≤ L,
    (xn − x1 )n+
E I X(n) ≤xn I X(1) ≥x1 = P x1 ≤ X1 , . . . Xn ≤ xn = .
Ln
 
We need to find a function f X(1) = fxn X(1) such that
   (xn − x1 )n+
E f X(1) I X(1) ≥x1 = , ∀x1 ,
Ln
i.e.,
(xn − x1 )n+
Z
f (x)ρ1 (x)dx = , ∀x1 .
[x1 ,L] Ln
Derivating with respect to x1 we deduce
(xn − x1 )n−1
+
f (x1 )ρ1 (x1 ) = n .
Ln
Hence
n−1
 (xn − x1 )n−1
+
y − x1 +
P X(n) ≤ y X(1) = x1 ] = n = .
Ln ρ1 (x1 ) (L − x1 )n−1
Thus, the conditional probability of X(n) given that X(1) = x1 is
n−2
  (n − 1) xn − x1 +
PX(n) dxn X(1) = x1 = .
(L − x1 )n−1
We define the empirical gap or sample range to be the random variable
G = X(n) − X(1) . To find the distribution of G we condition on X(1) and we
have
Z
     
P G≤g = P X(n) ≤ x1 + g X(1) = x1 PX(1) dx1
[0,L]

Z
 
= P X(n) ≤ X(1) + g X(1) = x1 ρ1 (x1 )dx1 .
[0,L]

Now observe that


n−2
 
Z (n − 1) xn − x1 +
P X(n) ≤ X(1) + g X(1) = x1 = dxn
[0,min(L,x1 +g)] (L − x1 )n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 120

120 An Introduction to Probability

g n−1
= I [0,L−g] (x1 ) + I [L−g,L] (x1 ).
(L − x1 )n−1
Thus
 ng n−1 L−g
ng n−1 (L − g) gn
Z Z

P G≤g = dx1 + ρ1 (x1 )dx1 = + .
Ln 0 [L−g,L] Ln Ln
We deduce
d   n(n − 1)g n−2 n2 g n−1 ng n−1 n(n − 1)g n−2  g
P G≤g = + − = 1 − .
dg Ln−1 Ln Ln Ln−1 L
Thus, the probability distribution of G is
  n(n − 1)g n−2  g
PG dg = n−1
1− I [0,L] (g) dg.
L L
If L = 1 then the above distribution is the Beta distribution Beta(n − 1, 2). t
u

1.4.5 Disintegration of measures


Suppose that (Ωi , Fi ), i = 0, 1 are two measurable spaces and
K : Ω0 × F1 → [0, ∞)
is a kernel from (Ω0 , F0 ) to (Ω1 , F1 ). Then any measure µ0 on (Ω0 , µ0 ) defines a
measure µ = µK,µ0 on (Ω, F) := (Ω0 × Ω1 , F0 ⊗ F1 ) via the equality
Z Z 
     
P S = I S (ω0 , ω1 )Kω0 dω1 µ0 dω0 . (1.4.31)
Ω0 Ω1

We say that a measure µ on (Ω0 × Ω1 , F0 ⊗ F1 ) is disintegrated by µ0 or µ0 dis-


integrates µ if µ is of the form µK,µ0 defined above. In this case K is called the
disintegration kernel. Often we will use the notation
     
µ dω0 dω1 = µ0 dω0 Kω0 dω1 . (1.4.32)
Observe that if K is a Markovian kernel and µ0 is a probability measure, then µK,µ0
is a probability measure. In this case, for emphasis, we use the notation PK,µ0 .

Example 1.186. For any probability measures µi on (Ωi , Fi ), i = 0, 1, the product


measure µ = µ0 ⊗ µ1 is disintegrated by µ0 since
   
µ = PK,µ0 , Kω0 − = µ1 − .
t
u

Suppose that the measure µ on (Ω, F) := (Ω0 × Ω1 , F0 ⊗ F1 ) is disintegrated by


µ0 , µ = PK,µ0 . Consider the natural projections
πi : Ω → Ωi , πi (ω0 , ω1 ) = ωi , i = 0, 1,
e 0 := π −1 (F0 ) ⊂ F := F0 ⊗ F1 .
and set F 0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 121

Foundations 121

For F0 ∈ F0 we have
Z Z 
       
µ F 0 × Ω1 = Kω0 dω1 µ0 dω0 = µ0 F0 ,
F0 Ω
| 1 {z  }
Kω0 Ω1 =1

so that µ0 = (π0 )# µ. This shows that the measure µ0 is uniquely and a priori
determined by µ. We can rewrite (1.4.32) as
     
µ dω0 dω1 = (π0 )# µ dω0 Kω0 dω1 . (1.4.33)

Next, for any S = F0 × Ω1 ∈ F


e 0 , and any F1 ∈ F1 , we have
Z
 −1    (1.4.31)    
µ π1 (F1 ) ∩ S = µ F0 × F1 = Kω0 F1 µ0 dω0 .
F0

Thus, the disintegration kernel K is a regular version of the conditional distribu-


tion of the measurable map π1 conditioned on F e 0 ; see (1.4.28). Note also that if
µ1 := (π1 )# µ, then, for any F1 ∈ F1 , we have
Z
       
µ1 F1 = µ Ω0 × F1 = Kω0 F1 µ0 dω0 .
F0
 
In other words, µ1 = K∗ µ0 . Thus, µ1 is a mixture of the measures Kω0 − ω ∈Ω
0 0
driven by µ0 .
Conversely, any regular version of the conditional distribution µ − k F
 
e 0 pro-
duces a disintegration kernel of the measure µ. Theorem 1.182 implies the next
result.

Corollary 1.187. If (Ω1 , F1 ) is isomorphic as a measurable space with a Lusin


space equipped with the Borel sigma algebra then, for any measurable space (Ω0 , F0 ),
any probability measure on (Ω0 × Ω1 , F0 ⊗ F1 ) is disintegrated by µ0 = (π0 )# µ. tu

Example 1.188. Consider a random 2-dimensional vector (X, Y ) with joint distri-
bution
PX,Y ∈ Prob(R2 ).
According to Corollary 1.187, the distribution
 PX of X disintegrates the joint dis-
tribution PX,Y . Suppose that Kx dy is the associated disintegration kernel, i.e.,
     
PX,Y dxdy = Kx dy PX dx .
1
Then, for any measurable function f : R → R such  that f (Y) ∈ L there exists a
measurable function g : R → R such that
 g(X) = E f (Y ) k X . As in Remark 1.160
we denote g(x) by E f (Y ) k X = x . We can give a more explicit description of
g(x) using the disintegration kernel. More precisely we will show that
Z
 
g(x) = f (y)Kx dy .
R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 122

122 An Introduction to Probability

A Monotone Class argument shows that g is Borel measurable. For any x0 ∈ R we


have
Z
   
E f (Y )I {X≤x0 } = f (y)I (−∞,x0 ] (x)PX,Y dxdy
R2
Z Z  Z
     
= f (y)Kx dy I (−∞,x0 ] (x)PX dx = g(x)I (−∞,x0 ] (x)PX dx
R R R
 
= E g(X)I {X≤x0 } .
Since the sets {X ≤ x0 } form a π-system that generate σ(X) we deduce that
   
E f (Y )I S = E g(X)I S , ∀S ∈ σ(X).
Thus
 
g(X) = E f (Y ) k X .
We write this as
Z
   
E f (Y ) k X = f (y)KX dy .
R
 
Hence the conditional expectations E f (Y ) k X are determined by the kernel K
that disintegrates the joint probability distribution PX,Y .
In particular, if B ⊂ R is a Borel set, and f = I B we have the law of total
probability
Z
       
P Y ∈ B = E I B (Y ) = E I B (Y ) X = x PX dx ,
R
where
Z
     
E I B (Y ) X = x = P Y ∈ B X = x = Kx dy .
B
 
This proves that the disintegration kernel Kx dy is a regular conditional distribu-
tion of Y given X, i.e.,
 
    P X ∈ [x, x + dx], Y ∈ [y, y + dy]
Kx dy = P Y ∈ [y, y + dy] X = x “ = ”   .
P X ∈ [x, x + dx]
Observe that if PX,Y is absolutely continuous with respect to the Lebesgue measure
on R2 so that
 
PX,Y dxdy = p(x, y)dxdy,
then
Z
  p(x, y)
Kx dy = dy, p0 (x) = p(x, y)dy,
p0 (x) R
p(x,y)
where we set p0 (x) = 0 if p0 (x) = 0. Then
Z
  p(x, y)
E f (Y ) X = x = f (y) dy.
R p0 (x)
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 123

Foundations 123

Example 1.189. Suppose that f : [0, 1] → R is a C 1 -function of length L. Define


a random measure
K : [0, 1], B → R, B , Kx = δf (x) .
 

Let
p
  1 + |f 0 (x)|2   
µ0 dx = · λ dx ∈ Prob [0, 1] .
L
Then the Borel probability measure PK,µ0 on [0, 1]×R corresponds to the integration
with respect to the normalized arclength along the graph of f . t
u

Example 1.190. Suppose that  X1 , . . . , Xn are independent random variables with
common distribution p(x)λ dx . Denote by X the random vector (X1 , . . . , Xn ).
Suppose that f : Rn → R is a Borel measurable
 function. Denote by P the distri-
bution of the random vector X, f (X) . This is disintegrated by the distribution
µ0 := PX of the random vector X. The disintegration kernel K is the conditional
distribution of f (X) given X. We deduce that
 
Kx1 ,...,xn − = δf (x1 ,...,xn ) .
If B0 is a Borel subset of Rn and B1 is a Borel subset of R, then
Z
  
P B0 × B1 = I B1 f (x1 , . . . , xn ) p(x1 ) · · · p(xn )dx1 · · · dxn .
B0

Using a notation dear to theoretical physicists we can rewrite the above equality as
   
P dx1 · · · dxn dy = δ y − f (x1 , . . . , xn ) p(x1 ) · · · p(xn ) dx1 · · · dxn dy,
where δ(z) denotes the Dirac “function” on the real axis. t
u

1.5 What are stochastic processes?

We have already met stochastic processes though we have not called them so. This
section is meant to be a first encounter with this vast subject. We have a rather
restricted goal namely, to explain what they are, describe a few basic features and
more importantly, show that stochastic processes with prescribed statistics do exist
as mathematical objects.

1.5.1 Definition and examples


A stochastic process is simply a family (Xt )t∈T of random variables parametrized
by a set T . They are all defined on the same probability space (Ω, S, P). The
variables could be real valued, vector valued or we can allow them to be valued in
a measurable space (X, F), where F is a sigma-algebra of subsets of X. Frequently
X = Rn for some n but, as we will see below, it is very easy to produce more
complicated examples.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 124

124 An Introduction to Probability

Obviously stochastic processes exist, but once we impose some restriction on


their behavior, the existence of such stochastic processes is less obvious. A classical
situation very investigated in probability is that of families (Xt )t∈T of real valued
random variables that are independent, identically distributed (or i.i.d. for brevity).
We denote by PX common distribution.
A basic question arises. Given a Borel probability measure µ on R and a set T
can we find a probability space (Ω, S, R) and independent random variables
Xt : (Ω, S, R) → R, t ∈ T,
such that PXt = µ,∀t ∈ R?
When T is finite, say T := {1, 2, . . . , n} the answer is positive. As probability
space we can take
(Ω, S, P) := Rn , BRn , µ⊗n .


The random variables are then the coordinate functions


Xk : Rn → R, Xk (x1 , . . . , xn ) = xk , k = 1, . . . , n.
Using the notation RT instead of Rn we see that we have defined a probability
measure on the space of functions T → R.
What happens if T is infinite, say T = N, in which case we seek a sequence
(Xn )n∈N of i.i.d. random variables with common probability distribution µ. A
substantial portion of probability is devoted to investigating such sequences and it
would be embarrassing, to say the least, if it turned out they do not exist. We will
see that this is not the case.
It is also very easy to stumble into situations in which the random variables are
not independent, or take value in some infinite dimensional space. Here is such a
situation.
Suppose that A0 , A1 , . . . , An is a family of i.i.d. (real valued) random variables
defined on the probability space (Ω, S, P). For every t ∈ [0, 1] we set
At := A0 + A1 t + · · · + An tn .
We now have on our hands a family of random variables (At )t∈[0,1] . These are
dependent. To understand why, suppose for simplicity that the variables Ak have
mean zero and variance 1. Then At has mean zero and for any s, t ∈ [0, 1]
Cov As , At = E As At = 1 + (st) + · · · + (st)n > 1.
   

Thus the random variables (At ) are dependent.



Let X denote the Banach space C [0, 1] equipped with the sup norm. The
family (At ) defines a map
n
X
A : Ω → X, Ω 3 ω 7→ At (ω) = Ak (ω)tk ∈ X
k=0
and one can show
 that this is measurable with respect to the Borel sigma-algebra
on X = C [0, 1] . The push-forward of P via the map A defines a Borel probability
PA measure on X so (X, BX , PA ) is a probability space.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 125

Foundations 125


It comes with a natural family of random variables Xt : C [0, 1] → R, t ∈ [0, 1].
More precisely, the random variable Xt associates to a function f ∈ C [0, 1] its
value at t, Xt (f ) = f (t). Note that At = Xt ◦ A. Thus we can view A• as a random
continuous function.
Suppose now that (Xt )t∈T is a general family of random variables
Xt : (Ω, S, P) → (X, F).
This defines a map
X : T × Ω → X, T × Ω 3 (t, ω) 7→ X(t, ω) := Xt (ω) ∈ X,
such that Xt is measurable for any t.
Equivalently, we can view this as a map
X : Ω → XT = the space of functions f : T → X, (1.5.1)
where for each ω ∈ Ω we have a function X(ω) : T → X, t 7→ Xt (ω).
It is convenient to regard XT as a product of copies Xt of X
Y
XT = Xt .
t∈T

Each copy Xt is equipped with a copy Ft of the sigma-algebra F.


The map (1.5.1) is measurable with respect to the sigma algebra FT in XT , the
smallest sigma-algebra S in XT such that all the evaluation maps
Evt : XT , S → X, F , Evt (f ) := f (t),
 

are measurable. Equivalently,


_
FT = Ev−1
t (F).
t∈T

The push-forward by X defines a probability measure PX on XT , FT called the




distribution of the stochastic process (Xt )t∈T . In this way we can view the process
as defining a random function T → X.
For any finite set I = {t1 , . . . , tm } ⊂ T we have a sigma-algebra FI in XI ,
FI = Ft1 ⊗ · · · ⊗ Ftm ,
and we obtain a random “vector”
X I : Ω, S → XI , FI ), ω 7→ Xt1 (ω), . . . , . . . , Xtm (ω) ∈ XI .
 

We denote by PI its probability distribution PI := (X I )# P. Note that we have a


tautological measurable projection πI : XT → XI , and

PI = (πI )# PX .
Suppose now that J ⊂ T is another finite set containing I
J = {t1 , . . . , tm , tm+1 , . . . , tn }, n > m.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 126

126 An Introduction to Probability

We get in a similar fashion a probability measure on XJ . Now observe that we have


a canonical projection
PIJ : XJ → XI , xt1 , . . . , xtm , xtm+1 , . . . , xtn 7→ xt1 , . . . , xtm
 

and, tautologically, we have10


(PIJ )# PJ = PI (1.5.2)
since X I = PIJ (X J ). Proposition 1.29 shows that PX is the unique probability
measure P on XT such that for any finite subset I ⊂ T we have

PI = (πI )# P .
A family of measures PI on XI , I finite subset of T constrained by the compat-
ibility condition (1.5.2) for any finite subsets I ⊂ J ⊂ T is said to be a projective
family. Note that to any probability measure P on XT , FT ) there is an associated
projective the family of probability measures

PI := (πI )# P .
There are other ways of constructing projective families.

Example 1.191. Suppose that we are given a sequence of measurable spaces


(Xn , Fn )n≥0 is a measurable space. For n ∈ N0 := {0, 1, . . . } we set
n
Y n
O
X In := Xk , FIn := Fk .
b b

k=0 k=0

Consider a family of Markovian kernels Kn : (X In , FIn ) → (Xn+1 , Fn+1 ), n ∈ N0 .


b b

In other words we have random probability measure


X In 3 (x0 , . . . , xn ) → Kx0 ,x1 ,...,xn dxn+1
b  

on (Xn+1 , Fn+1 ). Then, starting with a probability measure µ0 on (X , F), we


obtain inductively using the prescription (1.4.32) a projective family of probability
measures,
P0 = µ0 , Pn+1 = PKn ,Pn . (1.5.3)
This means that for any S ∈ FIn+1 we have
b
Z Z
     
Pn+1 S = K~x dxn+1 I S (~x, xn+1 )Pn d~x , ~x = (x0 , . . . , xn ).
X X bIn
Equivalently, Pn disintegrates Pn+1 and Kn is the disintegration kernel.
Denote by Pn,n+1 the natural projection X In+1 → X In ,
b b

(x0 , x1 , . . . , xn , xn+1 ) 7→ (x0 , x1 , . . . , xn ).


Since K is a Markovian kernel, i.e.,
Z
K~x dx0 = 1, ∀~x ∈ X In ,
  b

X
10 Take a few seconds to convince yourself of the validity of (1.5.2).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 127

Foundations 127

we deduce that Pn = (Pn,n+1 )# Pn+1 , ∀n ∈ N0 . This shows that the collection


(P
bn )n∈N is a projective family of probability measures.
0  
Note that if Kn is deterministic, i.e., Kx0 ,...,xn − is independent of x0 , . . . , xn ,
then we can think of Kn as a probability measure µn on X . In this case

Pn = µ0 ⊗ · · · ⊗ µn .

If (Xn , Fn ) = (X , F) for all n ≥ 0 can obtain kernels Kn as above starting from a


single Markovian kernel K = (X , F) → (X , F)

K : X × F → [0, 1], (x, F ) 7→ Kx F .


 

   
More precisely, we set Kx0 ,...,xn dx = Kxn dx .
In this case the measures Pn on FIn are defined by
b

       
Pn dx0 dx1 · · · dxn = µ0 dx0 Kx0 dx1 · · · Kxn−1 dxn .

More precisely, for any S ∈ FIn we have


b

Z
          
Pn S = I S ~x µ0 dx0 Kx0 dx1 · · · Kxn−2 dxn−1 Kxn−1 dxn . (1.5.4)
X bIn

The above is an iterated integral, going from right to left, i.e., we first integrate
with respect to xn , next with respect to xn−1 etc.
Such a situation occurs in the context of Markov chains. t
u

1.5.2 Kolmogorov’s existence theorem

Fix a topological space X and a parameter set T . We denote by 2T0 the collection of
finite subsets of T . For I ∈ 2T0 we denote by BI the Borel σ-algebra in XI equipped
with the product topology. For any finite subsets I ⊂ J ⊂ T we denote by PIJ the
natural projection XJ → XI .
For t ∈ T we denote by πt the natural projection

πt : XT → X, πt (x) = xt .

More generally, for any I ∈ 2T0 we define πI : XT → XI by setting

XT 3 x 7→ πI (x) = (xi )i∈I ∈ XI .

Definition 1.192. The natural σ-algebra ET in XT is the smallest σ-algebra


E ⊂ 2X such that all the maps πt , t ∈ T , are (E, BX )-measurable, i.e., the σ-
T

algebra generated by the family of σ-algebras πt−1 (BX ). t


u

Remark 1.193. The sigma-algebra ET can also be identified with the σ-algebra of
the Borel subsets of XT equipped with the product topology. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 128

128 An Introduction to Probability

A cylinder is a subset of XT of the form


πI−1 (S) = S × XT \I , I ∈ 2T0 , S ∈ BI .
We denote by CT the collection of cylinders. Clearly CT is an algebra of sets that
generates the natural σ-algebra ET .

Definition 1.194. A projective family of probability measures on XT is a family


PI of probability measures on (XI , BI ), I ∈ 2T0 , such that for any I ⊂ J in 2T0 we
have
PI = (PIJ )# PJ . (1.5.5)
t
u

As discussed in the previous subsection any measure on XT defines a canonical


projective family. Kolmogorov’s existence (or consistency) theorem states that all
projective families are obtained in this fashion.

Theorem 1.195 (Kolmogorov existence theorem). Suppose that X is a Lusin


space, i.e., a Borel subset of a compact metric space; see Definition 1.180. For
any projective family (PI )I∈2T0 of probability measures on XT there exists a prob-
b on ET uniquely determined by the requirement: ∀I ∈ 2T and
ability measure P 0
PI = (PI )# P b . This means that for any BI ∈ BI ,
b π −1 (PI ) = PI BI .
   
P I (1.5.6)

Proof. The uniqueness follows from Proposition 1.29.


The existence is a rather deep result ultimately based on Tikhonov’s compactness
result. We follow the approach in [135, Secs. 30, 31].
Observe that C is a cylinder if and only if
∃I ∈ 2T0 and BI ∈ BI such that C = πI−1 (BI ).
For I ∈ 2T0 we set CIT := π −1 (BI ) ⊂ ET . Note that
C ∈ CIT ⇐⇒ C = BI × XT \I , BI ∈ BI , (1.5.7a)

C ∈ CIT ∩ CJT 6= ∅ ⇒ C ∈ CI∩J


T . (1.5.7b)
Define
bI : CI → [0, ∞), P
   
P bI C = PI πI (C) .
T

Note that if C ∈ CIT ∩ CJT , then, according to (1.5.7b), C ∈ CK


T for some K ⊂ I ∩ J.
Then
πI (C) = P−1 −1
KI πK (C) , πJ (C) = PKJ πK (C) .
 

Thus
h i  (1.5.5)
PI πI (C) = PI P−1
    
KI π K (C) = (PKI )# PI πK (C) = PK πK (C) ,
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 129

Foundations 129

and, similarly,
h i  (1.5.5)
PJ πJ (C) = PI P−1
KJ PK (C)
    
= (PKJ )# PJ πK (C) = PK πK (C) .

Hence, if C ∈ CIT ∩ CJT , then P


   
bI C = PbJ C .
We have thus defined a finitely additive measure P
b on the algebra
[
CT = CIT .
I∈2T
0

To invoke Carathéodory’s extension theorem (Theorem 1.39) it suffices to show that


P̂ is countably additive of CT .
Suppose that (Cn )n∈N is a sequence of disjoint sets in CT and
[
C∞ = Cn ∈ CT .
n≥1

We have to show that


  X  
b C∞ =
P b Cn .
P
n≥1

Equivalently, if we set
n
[
Bn := C∞ \ Ck ∈ F,
k=1

we have to show that


 
lim P
b Bn = 0.
n→∞

More explicitly, we will show that  n )n≥1 is a decreasing sequence of sets in CT


 if (B
with empty intersection, then P Bn → 0 as n → ∞. To complete this step we
b
need to make a brief foundational digression.

Digression 1.196 (Regularity of Borel measures). When dealing with mea-


sures on topological spaces there are several desirable compatibility conditions be-
tween the measure-theoretic objects and the topological ones.

Definition 1.197. Let X be a topological space and µ a Borel measure on X.

(i) The measure µ is called outer regular if for any Borel set B ∈ BX we have
   
µ B = inf µ U .
U ⊃B,
U open

(ii) The measure µ is called inner regular if for any Borel set B ∈ BX we have
   
µ B = sup µ C .
C⊂B,
C closed
(iii) The measure µ is called regular if it is both inner and outer regular.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 130

130 An Introduction to Probability

(iv) The measure µ is called Radon if it is outer regular, and for any Borel set
B ∈ BX , we have
 
µ[B] = sup µ K .
K⊂B,
K compact
From the above definition it is clear that
µ is Radon ⇒ µ is regular.
A deep result in measure theory states that any Borel probability measure on a
Lusin space is Radon, [15, Thm. 7.4.3]. For our immediate needs we can get away
by with a lot less. We have the following useful result, [126, Chap. II, Thm. 1.2]. A
proof is outlined in Exercise 1.53.

Theorem 1.198. Any Borel probability measure on a metric space is regular.

This concludes our digression. t


u

As mentioned in Remark 1.181(b), any Lusin space is Borel isomorphic to a


compact metric space. Thus it suffices to prove Kolmogorov’s theorem only in the
special when X is a compact metric space. From Theorem 1.198 we deduce the
following result.

Lemma 1.199. Let Y be a compact metric space. Then any Borel probability mea-
sure on Y is Radon. t
u

We can now complete the proof of Kolmogorov’s theorem. Suppose there exists
a decreasing sequence of sets (Bn )n≥1 in CT such that
\
Bn = ∅.
n∈N

We want to prove that


 
lim P
b Bn = 0.
n→∞

We argue by contradiction. Suppose that


 
lim P
b Bn = δ > 0.
n→∞

We can find a strictly increasing sequence of finite subsets of T


I1 ⊂ I2 ⊂ · · · ,
and subsets Sn ⊂ XIn , n ∈ N, such that
Bn = Sn × XT \In , Sn ⊂ XIn , ∀n ∈ N.
We set
[
I∞ := In ⊂ T.
n≥1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 131

Foundations 131

For any n ∈ N, the space X In is a compact metric space and Lemma 1.199 implies
that the Borel probability measure PIn on X In is Radon. Hence, for any n > 0,
there exists a compact set Kn ⊂ Sn such that
  δ
PIn Sn \ Kn < .
2n+1
Set
Cn := Kn × XT \In ⊂ Sn × XT \In = Bn .
Tikhonov’s compactness theorem shows that all products XT \In are compact with
respect to the product topology. Hence the sets Cn = Kn × XT \In are also compact.
Note that
b Bn \ Cn = PI Sn \ Kn < δ , ∀n ∈ N.
   
P n
(1.5.8)
2n+1
Set
\n
Dn := Cj ∈ CT .
j=1

Observe that (Dn )n∈N is a decreasing sequence of compact subsets of XT and


 
n
[ Xn
   
b Bn \ Dn = P
P b  (Bn \ Cj ) ≤ b Bn \ Cj
P
j=1 j=1

(1.5.8) n n
X   X δ δ
≤ P Bj \ Cj ≤
b < .
j=1 j=1
2j+1 2

Hence
b Dn > δ , ∀n ≥ 1.
 
P
2
This shows that Dn is nonempty ∀n ∈ N so the decreasing sequence of nonempty
and compact sets Dn has a nonempty intersection. We have reached a contradiction
since
\ \
∅= Bn ⊃ Dn 6= ∅.
n≥1 n≥1

t
u

The real axis R is a Lusin space. Given a Borel probability measure P on R we


can construct trivially a projective family PI , I ∈ 2N
0 . More precisely PI = P
⊗|I|

on RI . We deduce that we have a natural Borel probability measure RN . We have


natural random variables on this probability space
Xn : RN → R, Xn (x) = xn , ∀x = (x1 , x2 , . . . , ) ∈ RN .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 132

132 An Introduction to Probability

Note that PXn = P, ∀n and the joint distribution of X1 , . . . , Xn is P⊗n . Thus, the
random variables (Xn ) are independent and have identical distributions. We have
thus proved the following fact.

Corollary 1.200. For any probability measure P ∈ Prob(R, BR ), there exists a


probability space (Ω, S,P) and a sequence of independent identically distributed (or
i.i.d. for brevity) random variables Xn : (Ω, S,P) → R, n ∈ N, with common distri-
bution P. t
u

Remark 1.201. The proof of Theorem 1.195 uses in an essential fashion the topo-
logical nature of the projective family of measures PI I∈2T . We want to emphasize
0
that in this theorem the set of parameters T is arbitrary.
If the set of parameters T is countable, say T = N0 , then one can avoid the
topological assumptions.
Consider for example the projective family of measures Pn constructed in Exam-
ple 1.191. Recall briefly its construction that we are given a sequence of measurable
spaces (Xn , Fn )n≥0 and measures Pn on
X0 × · · · × Xn , F0 ⊗ · · · ⊗ Fn


such that Pn disintegrates Pn+1 , ∀n ≥ 0. (Observe that this condition is automati-


cally satisfied if each Xn is a Lusin space.) Set

Y
X :=

Xn ,
n=0

denote by πn the natural projection X ∞


→ Xn and by F⊗∞ the sigma-algebra
_
F⊗∞ := πn−1 Fn .


n≥0

A theorem of C. Ionescu-Tulcea (see [85, Thm. 8.24]) states that there exists a
unique probability measure P∞ on F⊗∞ such that
(Pn )# P∞ = Pn , ∀n ≥ 0,
where Pn denotes the natural projection X ∞ → X0 × · · · × Xn .
As a special case of this result let us mention an infinite-dimensional version
of Fubini-Tonelli: given measures µn on Fn , there exists a unique measure µ∞ on
F⊗∞ such that
On
(Pn )# µ∞ = µk .
k=0
For this reason we will denote the measure µ∞ by
O∞
µn .
n=0
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 133

Foundations 133

1.6 Exercises

Exercise 1.1. Let S0 , S1 be two sigma-algebras of a set Ω. Prove that the following
are equivalent.

(i) The union S0 ∪ S1 is a sigma-algebra.


(ii) Either S0 ⊂ S1 or S1 ⊂ S0 .

t
u

Exercise 1.2 (Alexandrov). Suppose that K is a compact metric space, F is an


algebra of subsets of K and µ : F → [0, 1] is a finitely additive function satisfying
the following property. For any F ∈ F and any ε > 0 there exist sets F± ∈ F such
that
 
cl(F− ) ⊂ F ⊂ int(F+ ), µ F+ \ F− < ε.
Prove that µ is a premeasure. t
u

Exercise 1.3. Fix a set Ω of finite cardinality m and a probability measure π on


Ω. Set
Ω∞ := ΩN
so the elements of Ω∞ are functions ω : N → Ω. For every n ∈ N define
πn : Ω∞ → Ωn , πn (ω) = (ω1 , . . . , ωn ),
and denote by Cn the collection of sets of the form
C = πn−1 (S), S ⊂ Ωn , n ∈ N.
Note that C1 ⊂ C2 ⊂ · · · . Set
[
C := Cn .
n∈N

(i) Show that Cn is a σ-algebra of subsets of Ω∞ , ∀n ∈ N.


(ii) For any n ∈ N define βn = βn,π : Cn → [0, 1],
X n
Y
βn πn−1 (S) := π ⊗n S =
     
π {ωj } .
(ω1 ,...,ωn )∈S j=1

Show that βn is a well defined measure on Cn and


βn+1 Cn
= βn .

(iii) Equip Ω with the metric
(
X 1 0, ω = η,
d(ω, η) = n
h(ωm , ηn ), h(ω, η) =
2 1, ω 6= η.
n∈N

Prove that Ω∞ , d is a compact metric space. Hint. Use the diagonal procedure to


show that any sequence if Ω admits a convergent subsequence.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 134

134 An Introduction to Probability

(iv) Define β = βπ : C → [0, 1],


β Cn
= βn .
Show that β is a well defined premeasure on C. Hint. Use Exercise 1.2.
(v) Denote by β̄ = β̄π the extension of β as measure to the σ-algebra σ(C).
(Its existence is guaranteed by the Caratheodory extension theorem.) For
omega0 ∈ Ω we set

∆ω := ω ∈ Ω : ∃m ∈ N such that ωn = ω, ∀n > m .
 
Show that ∆Ω ∈ σ(C) and β̄ ∆ω = 0.
(vi) Define Xn : Ω → Ω, Xn (ω) = ωn . Show that the collection of random
variables (Xn )n∈N is independent and have the same distribution π.
(vii) Let Ω = {0, 1}, π=the uniform measure on {0, 1}, and consider Ω∞ = {0, 1}N
equipped with the measure β̄ = β̄π constructed as above. Show that the map
X 1
B : (Ω∞ , σ(C)) → [0, 1], B[0,1] , B =

Xn
2n
n∈N

is measurable and find B# β̄. t


u

Exercise 1.4. Suppose that (Ω, S, µ) is a finite 11 measured space and A ⊂ S a


 of measurable subsets that generates S, σ(A) = S. Assume Ω ∈ A.
countable family
Denote by R A the vector space spanned by I A , A ∈ A. Fix p ∈ [1, ∞) and
denote by Mp the intersection of L∞ (Ω, S, µ) with the Lp -closure of R[A].

(i) Prove Mp = L∞ (Ω, S, P).


(ii) Prove that R[A] is dense in Lp (Ω, S, µ).
(iii) Prove that Lp (Ω, S, µ) is separable. t
u

Exercise 1.5. Suppose that (Ω, F, µ) is a measured space and (S, d) a metric space.
Consider a function
F : S × Ω → R, (s, ω) 7→ Fs (ω)
satisfying the following properties.

(i) For any s ∈ S the function Ω 3 ω 7→ Fs (ω) ∈ R is measurable.


(ii) For any ω ∈ Ω the function S 3 s 7→ Fs (ω) ∈ R is continuous.
(iii) There exists h ∈ L1 (Ω, S, µ) such that |Fs (ω)| ≤ h(ω), ∀(s, ω) ∈ S × Ω.

Prove that Fs ∈ L1 (Ω, S, µ), ∀s ∈ S, and the resulting function


Z
 
S 3 s 7→ Fs (ω)µ dω ∈ R

is continuous. Hint. Use the Dominated Convergence Theorem. t


u
11 The sigma-finite situation follows from the finite situation in a standard fashion.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 135

Foundations 135

Exercise 1.6. Suppose that (Ω, F, µ) is a measured space and I ⊂ R is an open


interval. Consider a function
F : I × Ω → R, (t, ω) 7→ F (t, ω)
satisfying the following properties.

(i) For any t ∈ I the function F (t, −) : Ω → R is integrable,


Z

|F (t, ω)| µ dω] < ∞.

(ii) For any ω ∈ Ω the function I 3 t 7→ F (t, ω) ∈ R is differentiable at t0 ∈ I. We


denote by F 0 (t0 , ω) its derivative.
(iii) There exists h ∈ L1 (Ω, S, µ) and c > 0 such that
|F (t, ω) − F (t0 , ω)| ≤ h(ω)|t − t0 |, ∀(t, ω) ∈ I × Ω.

Prove that the function


Z
 
I 3 t 7→ F (t, ω)µ dω ∈ R

is differentiable at t0 and
Z  Z
d
F 0 (t0 , ω)µ dω .
   
F (t, ω)µ dω = t
u
dt t=t0 Ω Ω

Exercise 1.7. Prove that the random variables N1 , . . . , Nm that appear in Exam-
ple 1.112 on the coupon collector problem can be realized as measurable functions
defined on the same probability space. Hint. Use Exercise 1.3. t
u

Exercise 1.8 (Markov). Let (Ω, S, P) be a sample space and A− , A0 , A+ ,


P[A0 ] 6= 0. We say that A+ is independent of A− given A0 if
P[A+ ∩ A− |A0 ] = P[A+ |A0 ]P[A− |A0 ].
Show that A+ is independent of A− given A0 if and only if
P[A+ |A0 ∩ A− ] = P[A+ |A0 ]. t
u

Exercise 1.9 (M. Gardner). A family has two children. Find the conditional
probability that both children are boys in each of the following situations.

(i) One of the children is a boy.


(ii) One of the children is a boy born on a Thursday. t
u

Exercise 1.10. A random experiment is performed repeatedly and the outcome of


an experiment is independent of the outcomes of the previous experiments. While
performing these experiments we keep track of the occurrence of the mutually ex-
clusive events A and B, i.e., A ∩ B = ∅. We assume that A and B have positive
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 136

136 An Introduction to Probability

probabilities.12 What is the probability that A occurs before B? Hint. Consider the
event C = (A ∪ B)c = neither A, nor B. Condition of the result of the first experiment which can be
A, B or C. t
u

Exercise 1.11. Consider the standard random walk on Z started at 0. More


precisely
 are
 given
 a sequence
 1 of i.i.d. random variables (Xn )n∈N such that
P Xn = 1 = P Xn = −1 = 2 , ∀n and we set
Sn := X1 + · · · + Xn .
Let T denote the time of the first return to 0,
T := min{n ∈ N; Sn = 0 .
   
Set fn = P T = n , un := P Sn = 0 .
 
(i) Prove that u2n = P S1 6= 0, S2 6= 0, . . . , S2n 6= 0 . Deduce that
f2n = u2n−2 − u2n . Hint. Use André’s reflection principle in Example 1.60.
(ii) Visualize the random walk as a zig-zag of the kind depicted in Figure 1.1. For
such a zigzag we denote by Ln (z) the number of its first n segments that are
above the x axis. Equivalently,
Ln (z) := #{ k; 1 ≤ k ≤ n; Sk > −ε .
For example, for the zig-zag z in Figure 1.1 we have
L8 (z) = L9 (z) = L10 (z) = 8.
Show that
(
  u2k u2n−2k , m = 2k ≤ 2n,
P L2n = m =
0, m ≡ 1 mod 2.
1
 
(iii) Prove that P L2n = 2k S2n = 0 = n+1 . t
u

Exercise 1.12. Suppose that X, Y : (Ω, S, P) → R are two random variables whose
ranges X and Y are countable subsets of R. We set
 X 
E X k Y = y I {Y =y} ∈ L0 (Ω, σ(Y ), P),
 
E X kY =
y∈Y

where
  X
E X k Y = y := xP[X = x k Y = y].
x∈X
 
The random variable E X k Y is called the conditional expectation of X given Y .
Prove that
 
E[X] = E E[X k Y ] . t
u
12 For example if we roll a pair of dice, A could be the event “the sum is 4” and B could be the

event “the sum is 7”. In this case


  3 1   6 1
P A = = , P B = = .
36 12 36 6
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 137

Foundations 137

Exercise 1.13 (Polya’s urn). An urn U contains r0 red balls and g0 red balls.
At each stage a ball is selected at random from the urn, we observe its color, we
return it to the urn and then we add another ball of the same color. We denote by
Rn the number of red balls and by Gn the number of green balls at stage n. Finally,
we denote by Cn the “concentration” of red balls at stage n,
Rn
Cn = .
Rn + Gn
 
(i) Show
 that E Cn+1 k Rn = Cn , where the conditional expectation
E Cn+1 k Rn is defined in Exercise 1.12.
(ii) Show that E[Cn ] = r0r+g
0
0
, ∀n ∈ N. t
u

Exercise 1.14. Prove the claim about the events Sk at the end of Example 1.110.
t
u

Exercise 1.15 (Banach’s matchbox problem). An eminent mathematician fu-


els a smoking habit by keeping matches in both trouser pockets. When impelled
by need, he reaches a hand into a randomly selected pocket and grabs about for a
match. Suppose he starts with n matches in each pocket. What is the probabil-
ity that when he first discovers a pocket to be empty of matches the other pocket
contains exactly m matches? t
u

Exercise 1.16. Suppose that Xn : L1 (Ω, S, P), n ∈ N, is a sequence of independent


and identically distributed (i.i.d.) random variables and T ∈ L1 (Ω, S, P) is a random
variable with range contained in N and independent of the variables Xn . Define
ST : Ω → R
T (ω)
X
ST (ω) = Xn (ω).
n=1
Prove Wald’s formula
     
E ST = E T E X1 . (1.6.1)
t
u

Exercise 1.17. A box contains n identical balls labelled 1, . . . , n. Draw one ball,
uniformly random, and record its label N . Next flip a fair coin N times. What is
the expected number of heads you roll? Hint. Use Wald’s formula. t
u

Exercise 1.18. Suppose that X ∈ L0 (Ω, S, P) is a nonnegative random variable.


Prove that if the range of X is contained
X in N0 , then
 
E X −1≤ P[X > n] ≤ E[X].
n≥0
In particular, conclude that X
X ∈ L1 (Ω, S, P) ⇐⇒ P[X > n] < ∞.
n≥0
Hint. Use (1.3.44). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 138

138 An Introduction to Probability

Exercise 1.19. There are n unstable molecules m1 , . . . , mn in a row. One of the


n − 1 pairs of neighbors, chosen uniformly at random, combine to form a stable
dimer. This process continues until there remain Un isolated molecules, no two of
which are adjacent.

(i) Show that the probability pn that m1 remains uncombined satisfies


(n − 1)pn = p1 + p2 + · · · + pn−2 .
Deduce that
n−1
X (−1)k
pn = → e−1 as n → ∞.
k!
k=0
Hint. Condition on the first pair of molecules (mr , mr+1 ) that gets combined.
(ii) Show that the probability qr,n that the molecule mr remains uncombined is
pr pn−r+1 .
(iii) Show that
n
  X
E Un = qr,n .
r=1
(iv) Show that
1 
E Un = e−2 .

lim
n→∞ n

t
u

Exercise 1.20. Let N = Nm be the random variable defined in the coupon collector
problem described in Example 1.112. Show that
m
  X m−k
Var Nm = m . t
u
k2
k=1

Exercise 1.21 (The Birthday Problem). Let N ∈ N. Consider a sequence


(Xn )n∈N of independent random variables uniformly distributed on the finite set
{1, . . . , N }. Define BN to be the birthday random variable 13

BN (ω) = min j ∈ N : ∃1 ≤ i < j such that Xj (ω) = Xi (ω) .
Compute the probabilities
 
P BN ≤ k , k = 1, . . . , N. t
u

Exercise 1.22 (Buffon’s Problem). A needle of length ` is thrown at random


on a plane ruled by parallel lines distance d apart. Denote by N` the number of
lines that intersect the needle.
13 You should think of B
N as follows. Suppose that you have an urn with N balls labelled 1, . . . , N .
Suppose we perform the following experiment: draw a ball at random, record its label, put it back
in the box, and then repeat until you notice that the label you’ve drawn has appeared before. The
random variable BN is the first moment when you’ve noticed a label that was drawn before. The
classical birthday problem is the special case N = 365.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 139

Foundations 139

 
(i) Prove that P  N` = 1  when
 ` <d.  
(ii) Prove that E
 N ` +`
0 1 = E N`0 + E N`1 , ∀`0 , `1 > 0.
(iii) Compute E N` , ` > 0.

t
u

Exercise 1.23. Suppose that I is an interval of the real axis and f : I → R is a


continuous function. Prove that the following are equivalent.

(i) For any x, y ∈ I, and any t ∈ (0, 1) we have f (1−t)x+ty ≤ (1−t)f (x)+tf (y).
(ii) For any x0 ∈ I there exists a linear function ` : R → R such that
`(x0 ) = f (x0 ), `(x) ≤ f (x), ∀x ∈ I.

t
u

Exercise 1.24 (Hermite polynomials). Suppose that X ∼ N (0, 1) so


1 x2
PX [dx] = Γ1 dx] := γ 1 (x)λ dx , γ 1 (x) = √ e− 2 .
  

For k ∈ N0 we denote by R[x] the space of polynomial with real coefficients. Define
the linear operators
   
P, Q : R x → R x ,

(P f )(x) = f 0 (x), (Qf )(x) = −f 0 (x) + xf (x). (1.6.2)


 
(i) Prove that for any f ∈ R x we have
(P Q − QP )f = f.
 
(ii) Denote by H0 ∈ R x the constant polynomial identically equal to 1. Show
that for any n ∈ N the function
Hn := Qn H0
is a degree n polynomial satisfying
P Hn = nHn−1 , QP Hn = nHn , ∀n ∈ N,
and
Hn = xHn−1 − (n − 1)Hn−2 , ∀n ≥ 2.
The polynomials Hn (x) are called
 the Hermite polynomials.
(iii) Show that for any f, g ∈ R x
Z Z
P f (x)g(x) Γ1 [dx] = f (x)Qg(x) Γ1 [dx]. (1.6.3)
R R
(iv) Show that
x2 x2
Hn (x) = (−1)n e P n e−

2 2 , ∀n ∈ N. (1.6.4)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 140

140 An Introduction to Probability

(v) Show that for any m, n ∈ N0 we have


Z
Hn (x)Hm (x)Γ1 [dx] = n!δmn .
R

(vi) Show that


X λn 2
Hn (x) = eλx−λ /2 . (1.6.5)
n!
n≥0

(vii) Suppose that f ∈ R[x], deg f ≤ n. Prove that


n
X 1  (k) 
f (x) = E f (X) Hk (x).
k!
k=1

t
u

Exercise 1.25. Suppose that X ∼ N (0, 1), i.e.,


1 x2
PX dx = γ 1 (x)dx, γ 1 (x) = √ e− 2 .
 

 
Set Φ(x) := P X > x . Prove the Mills ratio inequalities (1.3.40), i.e.,
x 1
γ (x) ≤ Φ(x) ≤ γ 1 (x), ∀x > 0.
x2 + 1 1 x
Hint. For the upper bound observe that
Z ∞
−QΦ = Φ(x)dx > 0,
x

where Q is the operator defined in (1.6.2). Next express


Z ∞
QΦ(t)dt ≤ 0
x

in terms of Φ and γ 1 . t
u

Exercise 1.26. We denote by Dens(R) the space of probability densities on R, i.e.,


functions p ∈ L1 (R, λ) such that
Z
p(x)dx = 1 and p(x) ≥ 0 almost everywhere.
R

For p ∈ Dens(R) we set


Z Z
 2
x2 p(x)dx − E p .
   
E p := xp(x)dx, Var p :=
R R

The entropy 14 of p ∈ Dens(R) is the quantity


Z
 
Ent p := − p(x) log p(x)dx ∈ [0, ∞],
R
where we set 0 · log 0 = 0.
14 The entropy is a measure of disorder or randomness of the probability density: the higher the

entropy the less predictable is the associated random variable.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 141

Foundations 141

(i) Show that if


1 2
γ 1 (x) := √ e−x /2 ,

then
  1 + log 2π
Ent γ 1 = .
2
(ii) Show that if p, q ∈ Dens(R) and q(x) > 0, ∀x ∈ R, then
Z
 
Ent p ≤ − p(x) log q(x)dx
R

if the integral on the right hand side is finite. Moreover equality holds iff
p = q.
Hint. Show that p(x) − p(x) log p(x) ≤ q(x) − p(x) log q(x), ∀x ∈ R.
(iii) Show that if p ∈ Dens(R) satisfies
       
E p = 0 = E γ 1 , Var p = 1 = Var γ 1 ,
   
then Ent p ≤ Ent γ 1 with equality iff p = γ 1 .

t
u

Exercise 1.27. Let X : (Ω, S, P → N0 ) be a random variable and λ > 0. Prove


that the following are equivalent.

(i) X ∼ Poi(λ). 
(ii) E λf (X + 1) − Xf (X) = 0, for any bounded function f : N0 → R.

t
u

Exercise 1.28. Prove Proposition 1.104. t


u

Exercise 1.29. Show that


pet
MN (t) = if N ∼ Geom(p),
1 − qet
t
−1)
MN (t) = eλ(e if N ∼ Poi(λ),
and
λ
MX (t) = if X ∼ Exp(λ).
λ−t
t
u

Exercise 1.30. Let Y ∼ N (0, 1) be a standard normal random variable and set
X := exp(Y ).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 142

142 An Introduction to Probability

(i) Show that


2
E X n = en /2 , ∀n ∈ N.
 

(ii) Prove that the probability distribution PX of X is given by the log-normal law
2
( 1
  √1 e− 2 (log x) , x > 0,
PX dx = p(x)dx, p(x) = x 2π
0, x ≤ 0,
where log denotes the natural logarithm.
(iii) For α ∈ [−1, 1] we set
( 
p(x) 1 + α sin(2π log x) , x > 0,
pα (x) =
0, x ≤ 0.
Prove that for any α ∈ [−1, 1] and any n ∈ N0 we have
Z
2
xn pα (x)dx = en /2 .
R
Thus, for any α ∈ [−1, 1], the function pα (x)dx is a probability density on R
and the probability measure pα (x) has the same moments as X.

t
u

Exercise 1.31. Let X : (Ω, S, P) → R be a random variable with range contained


in
N0 = {0, 1, 2, . . . }.
Its probability generating function (or pgf for brevity) is the formal power series
X
P GX (s) = P[X = n]sn .
n≥0

(i) Show that the power series defining P GX is convergent for any |s| < 1. More-
over, ∀t ≤ 0 we have
MX (t) = P GX (et ).
(ii) Compute P GX when X ∼ Bin(n, p), X ∼ Geom(p), X ∼ Poi(λ).

t
u

Exercise 1.32. Show that


Gamma(ν0 , λ) ∗ Gamma(ν1 , λ) = Gamma(ν0 + ν1 , λ), ∀ν0 , ν1 > 0, (1.6.6a)

N (0, v0 ) ∗ N (0, v1 ) = N (0, v0 + v1 ), ∀v0 , v1 > 0, (1.6.6b)

Poi(λ0 ) ∗ Poi(λ1 ) = Poi(λ0 + λ1 ), ∀λ0 , λ1 > 0. (1.6.6c)


Hint. Use Theorem 1.107, Corollary 1.105 and Corollary 1.134. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 143

Foundations 143

Exercise 1.33. Let µ0 , µ1 ∈ Prob([0, 1]) be two Borel probability measures. Prove
that the following statements are equivalent.

(i)
Z 1 Z 1
xn µ0 dx = xn µ1 dx , ∀n ∈ N.
   
0 0

(ii) For any Borel subset B ⊂ [0, 1]


   
µ0 B = µ1 B .

t
u

Exercise 1.34. Denote by Prob = Prob(R, BR ) the space of probability measures


on (R, BR ). Show that (Prob, µ) is a commutative semigroup with unit δ0 , the Dirac
measure concentrated at 0. t
u

Exercise 1.35. Consider the interval [−π/2, π/2] equipped with the probability
measure
  1  
P dx = λ dx ,
π
λ = the Lebesgue measure. We regard the function

X : [−π/2, π/2] → R, X(t) = sin2 t

as a random variable on this probability spaces. Prove that X ∼ Beta(1/2, 1/2). t


u

Exercise 1.36. For any a, b > 0 we define the incomplete Beta function
Z x
1
Ba,b : (0, 1) → R, Ba,b (x) = ta−1 (1 − t)b−1 dt,
B(a, b) 0
where B(a, b) is the Beta function (A.1.2).

(i) Prove that

xa (1 − x)b
= Ba,b (x) − Ba+1,b (x). (1.6.7a)
aB(a, b)

xa (1 − x)b
= Ba,b+1 (x) − Ba,b (x). (1.6.7b)
bB(a, b)
(ii) Show that if k, n ∈ N, k < n we have
n  
X n
Bk,n+1−k (x) = xa (1 − x)n−a . (1.6.8)
a
a=k
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 144

144 An Introduction to Probability

Exercise 1.37. Suppose that X1 , . . . , Xn : (Ω, S, µ) → R are random variables with


joint probability distribution
 
PX1 ,...,Xn dx1 · · · dxn = p(x1 , . . . , xn )dx1 · · · dxn ,

Z
p ≥ 0, p(x1 , . . . , xn )dx1 · · · dxn = 1.
Rn

Consider the new random variables


n
X
Yi = aij Xj , aij ∈ R
k=1

where the matrix A = aij 1≤i,j≤n is invertible with inverse A−1 = aij 1≤i,j≤n .
 

Prove that the joint distribution of Y1 , . . . , Yn is given by the density


1
p a11 y1 + · · · + a1,n yn , . . . , an1 y1 + · · · + ann yn .

q(y1 , . . . , yn ) = t
u
| det A|
Exercise 1.38. Suppose that X1 , . . . , XN are independent standard normal ran-
dom variables. For n = 1, . . . , we denote by Rn2 the random variable X12 + · · · + Xn2 .

(i) Prove that


n 1
Rn2 ∼ χ2 (n) := Gamma(ν, λ), ν = , λ= .
2 2
(ii) Prove that
Rn2 n N −n
2 ∼ Beta(a, b), where a = 2 , b =
RN 2
.

(iii) Set
n
1 1 X 2
X1 + · · · + Xn , S 2 :=

X := Xi − X .
n n − 1 i=1

Prove that (n − 1)S 2 ∼ χ2 (n − 1).


(iv) Set

X
Tn := √ .
S/ n
Prove that Tn ∼ Studn−1 , where Studp denotes the Student t-distribution with
p degrees of freedom
p+1
1 Γ( 2 ) 1
Studp = √ dt, t ∈ R, p > 0.
pπ Γ( p2 ) 1 + t2 /p (p+1)/2

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 145

Foundations 145

Exercise 1.39. Fix a probability space (Ω, S, P). Show that L0 (Ω, S, P) equipped
with the metric dist defined in (1.3.53) is a complete metric space. More precisely,
show that if a sequence of random variables Xn ∈ L0 (Ω, S, P) is Cauchy in proba-
bility, i.e.,
 
lim P |Xm − Xn | > r = 0, ∀r > 0,
m,n→∞

then there exists a random variable X ∈ L0 (Ω, S, P) such that Xn → X in


probability. t
u

Exercise 1.40. Prove the claim in Remark 1.158. t


u

Exercise 1.41. Suppose that X, Y are independent random variables with distri-
butions PX and respectively PY . Let f : R2 → R be a Borel measurable function
such that f (X, Y ) is integrable. Show that
 
E f (X, Y ) k X = h(X),
where
Z
 
h(x) = f (x, y)PY dy . t
u
R

Exercise 1.42. Suppose that (Ω, S, P) is a probability space, F ⊂ S a sigma-


subalgebra and X ∈ L0 (Ω, S), Y ∈ L0 (Ω, F). Prove that the following are
equivalent.

(i) X = Y a.s.
(ii) For any bounded Borel measurable function f : R → R, E f (X) k F =
 

f (Y ) a.s.

t
u

Exercise 1.43. Suppose that the sequence of independent random variables


(Xn )n∈N converges in probability. Prove that it is a.s. constant. t
u

Exercise 1.44. For n ∈ N we denote by Cn the cone in Rn defined by


Cn := (x1 , . . . , xn ) ∈ Rn : x1 ≤ x2 ≤ · · · ≤ xm .


Define ord : Rn → Cn
(x1 , . . . , xn ) 7→ ord(x1 , . . . , xn ) = (x(1) , x(2) , . . . , x(n) ),
where
 
x(1) = min{x1 , . . . , xn }, x(2) = min {x1 , . . . , xn } \ {x(1) } , . . . .

In other words, x(1) , . . . , x(n) are the numbers x1 , . . . , xn rearranged in increasing


order.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 146

146 An Introduction to Probability

Suppose X1 , . . . , Xn are n i.i.d. random variables with common cdf


Z x
F (x) = p(s)ds, p ∈ L1 (R, λ).
−∞
The order statistics of the random variables X1 , . . . , Xn is the random vector
ord(X) := (X(1) , . . . , X(n) ),
where X = (X1 , . . . , Xn ).

(i) Show that the distribution of ord(X) is


Pord(X) [dx1 · · · dxn ] = n!p(x1 ) · · · p(xn )I Cn (x1 , . . . , xn )dx1 · · · dxn .
 
(ii) Denote by F(j) the cdf of the component X(j) , F(j) (x) = P X(j) ≤ x . Prove
that
n  
X n n−k
F(j) (x) = F (x)k 1 − F (x) .
k
k=j
(iii) Suppose that X1 , . . . , Xn ∼ Unif(0, 1). Show that
  j
X(j) ∼ Beta(j, n + 1 − j), E X(j) = .
n+1
(iv) Suppose that X1 , . . . , Xn ∼ Unif(0, 1) and consider the random vector
Y = (X(2) , . . . , X(n) ).
Compute the conditional distribution of Y given X(1)
 
PY dy2 · · · dyn k X(1) = x .
(v) Suppose that X1 , . . . , Xn ∼ Exp(λ). Show that15
  1
X(1) ∼ Exp(nλ), E X(1) = .

(vi) Suppose that X1 , . . . , Xn ∼ Exp(λ). Show that
 
nX(1) , (n − 1) X(2) − X(1) , . . . , 2 X(n−1) − X(n−2) , X(n) − X(n−1)
are independent Exp(λ) random variables. Hint. Use (i) and Exercise 1.37.

t
u

Exercise 1.45. Suppose that X1 , . . . , Xn−1 are independent and uniformly dis-
tributed in [0, 1]. Consider their order statistics
X(1) ≤ · · · ≤ X(n−1)
and the corresponding spacings16
S1 = X(1) , S2 = X(2) − X(1) , . . . , Sn = 1 − X(n−1) .
Denote by Ln the largest spacing, Ln = max S1 , . . . , Sn ).
15 To appreciate how surprising then conclusion (v) think that an institution buys a large number

n of computers, all of the same brand, and X1 , . . . , Xn denote the lifetimes of these machines.
Each is expected to last 1/λ years. The random variable X(1) is the lifetime of the first computer
that breaks down. The result in (v) show that we should expect the first break down pretty soon,
1
in nλ years!
16 The n − 1 points X , . . . , X
1 n−1 divide the interval [0, 1] into n subintervals and the spacings are
the lengths of these subintervals.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 147

Foundations 147

(i) Prove that (S1 , . . . , Sn ) is uniformly distributed in the simplex


n n
X o
∆n := (s1 , . . . , sn ) ∈ [0, 1]n ; sk = 1 .
k=1
1
 
Deduce that E Sk = n, ∀k = 1, . . . , n.
(ii) Use (i) to show that
n  
1X 1 n
(−1)k+1
 
E Ln = .
n k k
k=1

(iii) Let Y1 , . . . , Yn be independent Exp(1) random variables. Set Tn = Y1 +· · ·+Yn .


Find the joint distribution of (Y1 , . . . , Ym , Tn ) and show that the random vari-
ables
Y1 Yn
,...,
Tn Tn
has the same joint distribution as S1 , . . . , Sn . Deduce that Ln has the same
distribution as
max1≤k≤n Yk
.
Tn
(iv) Prove that Ln and
n
1 X Yk
Tn k
k=1

have the same distribution. Hint. Use (iii) and Exercise 1.44(vi). Deduce that17
n
  1X1
E Ln := .
n k
k=1

t
u

Remark 1.202. Observe that the above exercise produces a strange identity,
n n  
X 1 X k+1 1 n
= (−1) .
k k k
k=1 k=1
t
u

Exercise 1.46. Consider the Poisson process (N (t))t≥0 with intensity λ described
in Example 1.136.

(i) Find the distribution of Wt = N (t) + 1 − t.


(ii) Show that N (t + h) − N (t) ∼ Poi λh, t ≥ 0, h > 0.

t
u
17 This
  log n
equality shows that E Ln ∼ n
, which substantially higher than the mean of each
1
 
individual spacing, E Sk = n , ∀k.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 148

148 An Introduction to Probability

Exercise 1.47. Consider the Poisson process (N (t))t≥0 with intensity λ described
in Example 1.136. Let S be a nonnegative random variable independent of the
arrival times (Tn )n≥0 of the Poisson process. For any arrival time Tn we denote by
ZTn ,S the number of arrival times located in the interval (Tn , Tn + S]

ZTn ,S := # k > n; Tn < Tk ≤ Tn + S .
Prove that

(λs)k  
Z
e−kλs
 
P ZTn ,S = k = PS ds . t
u
0 k!
Exercise 1.48. Suppose that N (t) is a Poisson process (see Example 1.136) with
intensity λ and arrival times
T1 ≤ T2 ≤ · · · .
Fix t > 0 and let (Xn )n≥1 be i.i.d. random variables uniformly distributed in [0, t].
Prove that, conditional on N (t) = n, the random vectors
 
T1 , . . . , Tn and X(1) , . . . , X(n)
have the same distribution. t
u

Exercise 1.49. Suppose that the 20 contestants at a quiz show are each given
the same question, and that each answers it correctly, independently of the others,
with probability P . But the difficulty of the question is that P itself is a random
variable.18 Suppose, for the sake of illustration, that P is uniformly distributed
over the interval (0, 1].

(i) What is the probability that exactly two of the contestants answer the question
correctly?
(ii) What is the expected number of contestants that answer a question correctly?

t
u

Exercise 1.50 (Skhorohod). Denote by Prob0 (R) the set of Borel probability
measures on R such that Z
 
xµ dx = 0.
R
Clearly Prob0 (R) is a convex subset of the set Prob(R) of Borel probability measures
on R.
For u, v ≥ 0 such that u + v > 0 we define the bipolar measure
v u
βu,v := δ−u + δv ∈ Prob0 (R).
u+v u+v

Let Q := (u, v) ∈ R2 ; u, v ≥ 0, u + v ≥ 0 . We regard βu,v as a random measure
(or Markov kernel) β : Q × BR → R
 
β (u, v), B) = βu,v B .
18 Think of P as a random Bernoulli measure of the kind discussed in Example 1.174.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 149

Foundations 149

Prove that for any µ ∈ Prob0 (R) there exists a Borel probability measure ν on Q
such that µ := β∗ ν. In other words, any measure µ ∈ Prob0 (R) is a mixture of
bipolar measures. t
u

Exercise 1.51. Given sigma algebras F± , F0 , ⊂ S, prove that the following are
equivalent.

(i) F+ ⊥
⊥ F0 F− .
(ii) F+ ⊥
⊥ F0 F0 ∨ F− .

t
u

Exercise 1.52. Given sigma algebras F± , F0 , ⊂ S, prove that the following are
equivalent.

(i) F+ ⊥
⊥ F0 ∨ F−
(ii) F+ ⊥
⊥ F0 and F+ ⊥
⊥ F0 F− .

t
u

Exercise 1.53. Suppose that µ is a Borel probability measure on the metric space
(X, d). Denote by C the collection of Borel subsets S of X satisfying the regularity
property: for any ε0 there exists a closed subset Cε ⊂ S and an open subset Oε ⊃ S
such that
µ Oε \ Cε < ε.
 

(i) Show that S ∈ C ⇒ S c := X \ C ∈ C.


(ii) Show that any closed set belongs to C.19
(iii) Show that C is a π-system.
(iv) Show that C is a λ-system.
(v) Show that C coincides with the family of Borel subsets.

t
u

Exercise 1.54. Suppose that (X, d) is a compact metric space and µ is a finite
Borel measure on X. Prove that for any p ∈ [1, ∞) the space C(X) of continuous
functions on X is dense in Lp (X, µ). Hint. Use Exercise 1.53 to show that for any Borel subset
B ⊂ X the indicator function I B can be approximated in Lp by continuous functions. t
u

19 This is where the fact is a X metric space plays an important role.


January 19, 2018 9:17 ws-book961x669 Beyond the Triangle: Brownian Motion...Planck Equation-10734 HKU˙book page vi

This page intentionally left blank


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 151

Chapter 2

Limit theorems

The limit theorems have preoccupied mathematicians from the dawns of probability.
The first law of large numbers goes back to Jacob Bernoulli at the end of the
seventeenth century. The Golden Theorem in his Ars Conjectandi is what we call
today a weak law of large numbers. Bernoulli considers an urn that contains a large
number of black and white balls. If p ∈ (0, 1) is the proportion of white balls in the
urn and we draw with replacement a large number n of balls, then the proportion
pn of white balls among the extracted ones is with high confidence within a given
open interval containing p.
His result lacked foundations since the concept of probability lacked a proper
definition. The situation improved at the beginning of the twentieth century when
E. Borel proved a strong form of Bernoulli’s law. Borel too lacked a good definition
of a probability space, but he worked rigorously. In modern terms, he used the
interval [0, 1] with the Lebesgue measure as probability space. He then proceeded
to construct explicitly a sequence of functions Xn : [0, 1] → R which, viewed as
random variables are i.i.d. with common distribution Bin(1/2).
It took the efforts of Hinchin and Kolmogorov to settle the general case. The
strong law of large numbers states that if (Xn )n∈N are i.i.d. random variables with
finite mean µ, then the empirical mean
n
1X
Mn = Xn
n
k=1

converges a.s. to the theoretical mean µ.


This chapter is devoted to these limit theorems. In the first section we investi-
gate the SLLN = Strong Law of Large Numbers. The approach we use is due to
Kolmogorov. It reduces this law to the convergence of random series of independent
random variables.
The second section is devoted to the Central Limit Theorem stating that the
distribution of Mn is very close to the distribution of a Gaussian random variable
with the same mean and variance as Mn . The third section, is more modern, and
it is devoted to concentration inequalities. These state in a quantitative fashion
that the probability that Mn deviates from the mean µ by a certain amount is

151
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 152

152 An Introduction to Probability

extremely small under certain conditions. The fourth section is devoted to uniform
limit theorem of the Glivenko-Cantelli type. We have included this section due to its
applications in machine learning. In particular, we show how such results coupled
with the concentration inequalities lead to Probably Approximatively Correct, or
PAC, learning.
The last section of this chapter is devoted to a brief introduction to the Brownian
motion. This is such a fundamental object that we thought that any student of
probability ought to make its acquaintance as soon as possible. As always, along
the way we present many, we hope, interesting examples.

2.1 The Law of Large Numbers

This section is devoted to the (Strong) Law of Large Numbers. We follow Kol-
mogorov’s approach based on random series, a subject of independent interest.

2.1.1 Random series


Fix a probability space (Ω, S, P) and consider a sequence of independent random
variables
Xn : (Ω, S, P) → R, n ∈ N.
The independence of the random variables (Xn ) allows us to invoke Kolmogorov’s
0-1 theorem and conclude that the random series
X
Xn (2.1.1)
n∈N
either converges almost surely, or diverges almost surely. We want to describe by
describing one simple sufficient condition for convergence.

Theorem 2.1 (Kolmogorov’s one series). Suppose that


 
E Xn = 0, ∀n ∈ N, (2.1.2a)
X  
Var Xn < ∞. (2.1.2b)
n≥1

Then the series (2.1.1) converges almost surely and in L2 .

Proof. For n ∈ N we denote by Sn the n-th partial sum of the series (2.1.1),
Xn
Sn := Xk .
k=1
The L2 -convergence follows immediately from (2.1.2b) which, coupled with the in-
dependence of the random variables (Xn ) implies that the sequence (Sn ) is Cauchy
in L2 since
k
X
kSn+k − Sn k2L2 =
 
Var Xn+j , ∀k, n ∈ N.
j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 153

Limit theorems 153

The proof of the a.s. convergence is more difficult. It relies on a fundamental


inequality which we will further generalize in the next chapter. The independence
of the random variables (Xn ) is used crucially in its proof.

Lemma 2.2 (Kolmogorov’s maximal inequality). Set


Mn := max |Sk |.
1≤k≤n

Then, for all a > 0, we have


n
  1 1 X
P Mn > a ≤ 2 Var[Sn ] = 2 Var[Xk ]. (2.1.3)
a a
k=1

Proof of Kolmogorov’s maximal inequality. Define



N : Ω → N ∪ {∞}, N (ω) := inf n ≥ 1; |Sn (ω)| > a .
Notice that N (ω) is the first n ∈ N ∪ {∞} such that Sn (ω) > a, i.e.,
N (ω) = k ⇐⇒ S1 (ω), . . . , Sk−1 (ω) ≤ a and Sk (ω) > a.
This shows that the event Ak = {N = k} is in the σ-algebra generated by
X1 , . . . , Xk . Since Sn − Sk = Xk+1 + · · · + Xn we deduce that I Ak , I Ak Sk are
independent of Sn − Sk . We have
Var Sn = E Sn2 ≥ E Sn2 I {Mn ≥a}
     

n
X n h  i
 X
E I Ak Sn2 = E I Ak Sk2 + 2Sk (Sn − Sk ) + (Sn − Sk )2

=
k=1 k=1

(I Ak , I Ak Sk ⊥⊥ Sn − Sk )
n 
X 
E I Ak Sk2 + 2E I Ak Sk E Sn − Sk + E I Ak E (Sn − Sk )2
        
=
k=1
| {z } | {z }
=0 ≥0

Xn Xn
2 2
P[Ak ] = a2 P[Mn ≥ a .
  
≥ E I Ak Sk ≥ a
k=1 k=1
| {z }
Sk2 ≥a2 on Ak

t
u

We can now complete the proof of Theorem 2.1. Using Kolmogorov’s maximal
inequality for the sequence (Xm+n )n∈N we deduce that for any n ∈ N we have
  n
1   1 X
P max |Sm+k − Sm | > ε ≤ 2 Var Sm+n − Sm = 2 Var[Xm+k ]
1≤k≤n ε ε
k=1

1 X
≤ Var[Xm+k ] .
ε2
k≥1
| {z }
=:rm
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 154

154 An Introduction to Probability

Thus
 
rm
P sup |Sm+n − Sm | > ε ≤ 2 . (2.1.4)
n≥1 ε
We set
Ym := sup |Si − Sj |, Zm := sup |Sm+n − Sm |.
i,j≥m n≥1

Now observe that Sm converges a.s. iff Ym → 0 a.s. The sequence Ym is nonincreas-
ing and thus it converges a.s. to a random variable Y ≥ 0. We will show that Y = 0
a.s.
Note that, for i, j > m we have
|Si − Sj | ≤ |Si − Sm | + |Sj − Sm | ≤ 2Zm ,
so Ym ≤ 2Zm , ∀m so
   
Ym > 2ε ⇒ Zm > ε ⇒ P Ym > 2ε ≤ P Zm > ε .
The equality (2.1.4) reads
  rm
P Zm > ε ≤ 2 ∀m ≥ 1, ∀ε > 0.
ε
Hence
   
lim P Ym > ε = lim P |Zm | > ε = 0.
m→∞ m→∞

On the other hand, for any ε > 0 we have


   
0 ≤ Y ≤ Ym , ∀m ⇒ P Y > ε ≤ P Ym > ε , ∀m.
We conclude that
 
P Y > ε = 0, ∀ε > 0 ⇒ Y = 0 a.s.
t
u

Example 2.3. Consider a sequence of i.i.d. Bernoulli random variables (Nk )n∈N
with success probability 21 . The resulting random variables Rk = (−1)Nk are called
Rademacher random variables and take only the values ±1 with equal probabilities.
We obtain the random series
X (−1)Nk X Rn
= .
k n
n≥1 n≥1

Loosely speaking, this is a version of the harmonic series with random signs
1 1
± 1 ± ± ± ··· , (2.1.5)
2 3
where the ±-choices at any term are equally likely and also independent of the
choices at the other terms of the series. We set
(−1)Nk
Xk = .
k
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 155

Limit theorems 155

We know that if all the terms are positive, a probability zero event, then we obtain
the harmonic series which is divergent. On the other hand,
    1
E Xk = 0, Var Xk = 2 .
k
Since
X 1
< ∞,
k2
k≥1

we deduce from Kolmogorov’s one series theorem that the series


X
Xk
k≥1

is a.s. convergent. Thus, if we flip a fair coin with two sides, a + side and a − side
and we assign the signs in (2.1.5) according to the coin flips, the resulting series is
convergent with probability 1! t
u

Remark 2.4. Kolmogorov also established necessary and sufficient conditions for
convergence in his three series theorem. Before we state it let us introduce a conve-
nient notation. For any random variable X and any positive constant C we denote
by X C the truncation
(
C X, |X| ≤ C,
X := XI {|X|≤C} = (2.1.6)
0, |X| > C.

Theorem 2.5 (Kolmogorov’s three series theorem). Consider a sequence of


independent random variables Xn ∈ L0 (Ω, S, P). The following statements are
equivalent.

(i) The series


X
Xn (2.1.7)
n≥1

converges almost surely.


(ii) For any C > 0 the following three series are convergent.
X   X 
P Xn 6= XnC .

P |Xn | > C =
n≥1 n≥1

X 
E XnC .


n≥1

X
Var XnC .
 

n≥1

For a proof we refer to [33, Sec. 3.7] or [140, IV.§2]. t


u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 156

156 An Introduction to Probability

2.1.2 The Law of Large Numbers


The frequentist interpretation of probability asserts that the probability of an event
is roughly the frequency of the occurrence of that event in a very large number
of independent trials. The Law of Large Numbers formalizes this intuition. The
surprising thing, at least to this author, is that reality respects the theory so closely:
the Law of Large Numbers adds a surprising level of predictability to uncertainty!
Throughout this section (Xn )n≥1 is a sequence of iid random variables
Xn ∈ L1 (Ω, S, P). Set
 
µ := E Xn , Sn := X1 + · · · + Xn .
The various versions of the Law of Large Numbers state that the empirical means
Sn /n converge in an appropriate sense to the theoretical mean µ. The convergence
in probability is usually referred to as the Weak Law of Large Numbers (or WLLN)
while the a.s. convergence is known as the Strong Law of Large Numbers (or SLLN).
We begin by presenting a few special, but historically important, cases.

Theorem 2.6 (Markov). If Xn ∈ L2 (Ω, S, P), then 1


n Sn → µ in probability.

Proof. We use the same strategy as in Example 1.149. Denote by σ 2 the com-
mon variance
 of the random variables Xn . Since they are independent we have
Var Sn = nσ 2 , so
  1   1
Var Sn /n = 2 Var Sn = .
n nσ 2
 
Let ε > 0. Note that E Sn /n = µ. Chebyshev’s inequality (1.3.17) implies
  1
P |Sn /n − µ| > ε ≤ → 0 as n → ∞.
nσ 2 ε2
Thus Sn /n → µ in probability. t
u

Theorem 2.7 (Cantelli). If Xn ∈ L4 (Ω, S, P), then 1


n Sn → µ almost surely.

Proof. By replacing Xn with Yn := Xn − µ we can assume µ = 0. We set


σ 2 := µ2 Xk , r4 := µ4 Xk , Mn := Sn /n.
   

Note that
1  1
P |Mn | > ε = P |Mn |4 > ε4 ≤ 4 E Mn4 = 4 4 E Sn4 .
      
ε n ε
Observe that
Xn
E Sn4 =
   
E Xi Xj Xk X` . (2.1.8)
i,j,k,`=1

Let i 6= j. Due to the independence of the random variables (Xn )n∈N we have
E Xi2 Xj2 = E Xi2 E Xj2 = σ 4 , E Xi Xj3 = E Xi E Xj3 = 0.
           
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 157

Limit theorems 157

Similarly, for distinct i, j, k, `, we have


 
E Xi Xj Xk X` = 0.
Thus
 X  
4 n 4
E Sn4 = nr4 + 2 σ 4 = nr4 + 6 σ = O(n2 ) as n → ∞.
 
2 2
j<k

Hence
 
  1
P |Mn | > ε = O as n → ∞
n2 ε4
so that, for any ε > 0,
X  
P |Mn | > ε < ∞.
n≥1

Corollary 1.141 implies that Mn → 0 a.s. t


u

Remark 2.8. The above Strong Law of Large Numbers is not the most general,
but its proof makes the role of independence much more visible. More precisely
the independence, or the small correlations force the fourth moment of Sn to be
“unnaturally” small
 and thus
 the large fluctuations around the mean are highly
unlike, i.e. the P |Mn | > ε is very small for large n. t
u

The next result, due to Kolmogorov, generalizes both results above.

Theorem 2.9 (The Strong Law of Large Numbers). Suppose that (Xn )n≥1
is a sequence of iid random variables Xn ∈ L1 (Ω, S, P). Then
1
lim Sn = µ a.s.
n→∞ n

Proof. We accomplish this in several steps.


Step 1. Truncate. Set
Yn := Xn I {|Xn |<n} , Tn := Y1 + · · · + Yn .
We claim that
 
P Xn 6= Yn i.o. = 0. (2.1.9)
Indeed, since the random variables (Xn ) are identically distributed we have
Z ∞
X   X     (1.3.43)  
P |Xk | > k = P |X1 | > k ≤ P |X1 | > t dt = E |X1 | < ∞
k≥1 k≥1 0

and Borel-Cantelli’s Lemma implies that


 
P |Xk | > k i.o. = 0.
This is equivalent to (2.1.9). We deduce from (2.1.9) that
1
lim Sn − Tn = 0 a.s.
n→∞ n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 158

158 An Introduction to Probability

Thus, it suffices to show that


1
Tn = µ a.s.
lim (2.1.10)
n n→∞
   
Step 2. Centering. The sequence E Yk k≥1 converges to µ = E X as k → ∞.
Indeed, since the random variables are identically distributed we have
       
E Yk = E Xk I {|Xk |≤k} = E X1 I {|X1 |≤k} → E X1 ,
where at the last
 step
 we used the Dominated Convergence theorem. It follows that
the sequence E Yn is also Cèsaro convergent 1 to the same limit, i.e.,
n
1   1X  
lim E Tn = lim E Yk = µ.
n→∞ n n→∞ n
k=1
Thus, it suffices to prove that
n n
!
1X 1X  
lim Yk − E Yk = 0, a.s.
n→∞ n n
k=1 k=1
 
Zn := Yn − E Yn .
We have to prove that the Cèsaro means of Zn converge to 0 a.s., i.e.,
n
1X
lim Zk = 0 a.s. (2.1.11)
n→∞ n
k=1

Step 3. Conclusion. We will rely on the following elementary result.

Lemma 2.10 (Kronecker’s Lemma). Suppose that (an )n∈N and (xn )n∈N are se-
quences of real numbers satisfying the following conditions.

(i) The sequence (an ) is increasing, positive and unbounded.


(ii) The series n≥1 xann is convergent.
P

Then
n
1 X
lim xk = 0.
n→∞ an
k=1

Assume temporarily the validity of Kronecker’s lemma. Thus, to prove (2.1.11)


it suffices to show that the random series
X Zn
n
n≥1

is a.s. convergent. The independence assumption will finally play a role because we
will invoke the one-series theorem. Clearly the random variables Znn are indepen-
dent. We claim that
X Var[Zk ]
< ∞. (2.1.12)
k2
k≥1
1 Use Exercise 2.3 with pk,n = 1/n.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 159

Limit theorems 159

We have
 2
Var Zk = Var Yk = E Yk2 − E Yk ≤ E Yk2
       

Z ∞ Z ∞
(1.3.43)    
= 2yP |Yk | > y dy = 2yP k ≥ |Xk | > y I {y<k} dy
0 0

Z ∞  
≤ 2yP |Xk | > y I {y<k} dy.
0

Thus
X Var[Zk ] X 1 Z ∞  
≤ 2yP |Xk | > y I {y<k} dy
k2 2
k 0
k≥1 k≥1

   
Z ∞ X 1 Z ∞ X
  1  
= 
2
I {y≤k}  2yP |X1 | > y dy = 
2
 2y P |X1 | > y dy.
0 k 0 k
k≥1 k≥y
| {z }
=:w(y)

We claim that
w(y) < 6, ∀y ≥ 0. (2.1.13)
Indeed, for y ≤ 1 we have
X 1
w(y) = 2y ≤ 4y < 4.
k2
k≥1

For y ∈ (1, 2] we have


X 1
w(y) = 2y < 2y ≤ 4.
k2
k≥2

For y > 2 we have


X 1 Z ∞
1 1
≤ dt =
k2 byc−1 t2 byc −1
k≥y

so
2y 2byc + 2 4
w(y) ≤ ≤ =2+ < 6.
byc − 1 byc − 1 byc − 1
Using (2.1.13) we deduce
X Var[Zk ] Z ∞    
<6 P |X1 | > y dy = 6E |X1 | < ∞.
k2 0
k≥1

This proves (2.1.12) and completes the proof of the SLLN, assuming Lemma 2.10.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 160

160 An Introduction to Probability

Proof of Lemma 2.10. Set


n
xn X
yn := , s0 = a0 := 0, sn := yk , n ≥ 1,
an
k=1

so that the sequence (sn )n≥1 is convergent. We have to show that


n
1 X
lim ak yk = 0.
n→∞ an
k=1

We have2
n
X n
X n
X
ak yk = ak (sk − sk−1 ) = an sn − sk−1 (ak − ak−1 ).
k=1 k=1 k=1

Now set
wk
wk := ak − ak−1 , pn,k := .
an
Since (an )n∈N is increasing, positive and unbounded we deduce
n
X
pn,k = 1, ∀n ≥ 1, lim pn,k = 0, ∀k. (2.1.14)
n→∞
k=1

Observe that
n n
1 X X
ak yk = sn − pk,n sk−1 .
an
k=1 k=1

The conditions (2.1.14) imply that (see Exercise 2.3)


n
X
lim pk,n sk−1 = lim sn .
n→∞ n→∞
k=1

t
u

Since a.s. convergence implies convergence in probability we deduce from the


SLLN the Weak Law of Large Numbers (or WLLN).

Corollary 2.11. Suppose that Xn ∈ L1 (Ω, S, P), n ∈ N, is a sequence of i.i.d.


random variables with common mean µ. We set
n
X
Sn = Xk .
k=1

1
Then the empirical mean n Sn converges in probability to µ. t
u
2 This is classically known as Abel’s trick. It is a discrete version of the integration by parts trick.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 161

Limit theorems 161

Remark 2.12. Let us observe that in Theorem 2.6 the random variables Xn need
not be independent or identically distributed. Assuming all have mean 0, all we
need for that for the Weak Law of Large Numbers to hold is that the random
variables are pairwise uncorrelated,
     
E Xm Xn = E Xm E Xn , ∀m 6= n, (2.1.15)

and the only constraint on their distribution is

sup E Xn2 < ∞.


 
n

In Exercise 2.6 we ask the reader to show that the WLLN holds even if we assume
something weaker than (2.1.15) namely that if |m − n|  1, the random variables
Xm and Xn are weakly correlated. More precisely
 
lim sup E Xm Xm+k = 0.
k→∞ m∈N

Similarly, for the strong Strong Law of Large Numbers to hold we do not need the
variables to be independent. The theorem continues to hold if the variables are
identically distributed, integrable and only pairwise independent. For a proof we
refer to [53, Sec. 2.4].
The arguments in the proof Theorem 2.7 show that the SLLN holds even when
the variables Xn are neither independent, nor identically distributed. Assuming
that all the variables have mean zero, the SLLN holds if any four of them are
independent, and the only assumptions about their distributions is

sup E Xn4 < ∞.


 
n

A natural philosophical question arises. What makes the Law of Large Numbers
possible? The above discussion suggests that it is a consequence of a mysterious
interplay between some form of independence and some “asynchronicity”: their fluc-
tuations around the mean cannot be in resonance. These features can be observed
in the other Laws of Large Numbers we will discuss in this text.
If the random variables are independent, but not necessarily identically dis-
tributed, there are known necessary and sufficient conditions for the WLLN to
hold. We refer to [59, IX], [70, §22.], or [127, Chap. 4] for details. t
u

Remark 2.13. Suppose that (Xn )n≥1 is a sequence of i.i.d. variables. This Strong
Law of Large Numbers shows that if they have finite mean µ, then the empirical
means
1 
Mn = X1 + · · · + Xn
n
converge a.s. to µ. If µ = ∞ and Mn converge a.s. to a random variable M∞ , then
M∞ is a.s. constant. Exercise 2.9 outlines a proof of this fact. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 162

162 An Introduction to Probability

Example 2.14. Suppose we roll a fair die a large number n of times and we denote
by Sn the number of times we roll a 1. Intuition tells us that if the die is fair, then
for large n, the fraction of times we get a 1 should be close to 16 , i.e.,
Sn 1
≈ for n  0.
n 6
This follows from the SLLN. Indeed, the above experiment is encoded by a sequence
(Xn )n∈N of i.i.d. Bernoulli random variables with success probability p = 16 . Then
n
X
Sn = Xk ,
k=1
and the SLLN
Sn   1
→ E X1 = a.s. as n → ∞.
n 6
It helps to visualize a computer simulation of such an experiment. Suppose we roll
a die a large number N of times. For i = 1, . . . , N we denote by fi the frequency of
1-s during the first i trials, i.e.,
Si
fi = .
i
The resulting vector (fi )1≤i≤N ∈ RN is called relative or cumulative frequency.
The R-code below simulates one such experiment when we roll the die 12, 000
times.
N<-12000
x<-sample(1:6, N, replace=TRUE)
rolls<-x==1
rel_freq<-cumsum(rolls)/(1:N)

plot(1:N,rel_freq,type="l", xlab="Number of rolls",

ylab="The frequency of occurrence of 1",


main="Average number 1-s during random rolls of die")
abline(h=1/6,col="red")
The output is a plot of the collection of points (i, fi ) depicted in Figure 2.1.
t
u

Example 2.15 (The Monte-Carlo method). Consider a box (parallelepiped)


Bk := I1 × · · · × Ik ⊂ Rk
where I1 , . . . , Ik ⊂ R are nontrivial bounded intervals. Consider independent ran-
dom variables X1 , . . . , Xk , where Xj is uniformly distributed on Ij . The the prob-
ability distribution of the random vector X = (X1 , . . . , Xk ) is
1
  I Bk λ k ,
λ k Bk
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 163

Limit theorems 163

Fig. 2.1 The frequencies fi fluctuates wildly initially and then stabilizes around the horizontal
line y = 1/6 in perfect agreement with SLLN .

where we recall that λk denotes the Lebesgue measure on Rk . If f : Bk → R is


integrable, then
Z
1  
  f (x)λk (dx) = E f (X) .
λk B k B k
Suppose that X n = (Xn,1 , . . . , Xn,k ), n ∈ N, is a sequence of i.i.d. random vectors
uniformly distributed in Bk , then the sequence of random variables ( f (X n ) )n∈N
is i.i.d., with the same distribution as f (X). The SLLN implies that the sequence
random variables
1 
Zn = f (X 1 ) + · · · + f (X n )
n
converges a.s. to
Z
1
  f (x)λk (dx).
λk B k B k
This fact can be used to produce approximations to integrals using probabilistic
methods. When the dimension k is large these methods are, to this day, the only
viable methods for approximating integrals of functions of many variables.
In Example A.21 we describe a computer implementation of this strategy using
the programming language R. t
u

2.1.3 Entropy and compression


Let us describe a surprising application of the law of large numbers. Suppose that
we are given a finite set X equipped with a probability measure P defined by the
function p : X → [0, 1]
 
p(x) := P {x} .
We will refer to the pair (X , p) as alphabet.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 164

164 An Introduction to Probability

Example 2.16. A good example to have in mind is the “alphabet” of the English
language. In this alphabet we throw in not just the letters, but also the punctuation
signs and the blank space. The elements xi are letters/symbols of the alphabet.
The probabilities p(xi ) can be viewed as the frequency of the symbol xi in the
written texts. One way to estimated these frequencies3 is to count the number of
their occurrences in a large text, say Moby Dick.
Another good example is the alphabet {0, 1} used in computer languages. The
frequencies p(0) = p(1) = 21 . t
u

For a letter xi of the alphabet we define the “surprise” or “information” con-


tained in the letter xi to be the quantity
S(xi ) := − log2 p(xi ).
The base 2 of the logarithm is the convention used in information theory and we
will stick with it. The unit of measure of surprise/information is the bit. Note that
S(xi ) ∈ [0, ∞]. Observe that the less likely the letter xi , the bigger the surprise.
The Shanon entropy or the information entropy of the alphabet is the quantity
    X
Ent2 p := Ep S := − p(x) log2 p(x), (2.1.16)
x∈X
where we adhere to the convention 0 · log 0 = 0. Thus the entropy is the expected
“surprise” of the alphabet. For example, if an urn contains 99 black balls and only
one white ball. We would be extremely surprised if when we randomly draw a ball
from the urn it urns out to be the white one. The average amount of surprise in
this case is
−0.99 log2 (0.99) − 0.01 log2 (0.01) ≈ 0.08.
If p0 is the uniform probability measure on X , then
 
Ent2 p0 = log2 |X |.
Let m := |X |. Note that Prob(X ) can be identified with the (m − 1)-dimensional
simplex
∆m = p = (p1 , . . . , pm ) ∈ [0, ∞)m ; p1 + · · · + pm = 1 .


We can view the entropy as a function Ent2 : ∆m−1 → [0, ∞). One can check that
it is concave since the function [0, ∞) 3 x 7→ f (x) = −x log2 x is strictly concave.
We have
m
  X
Ent2 p = f (pi ).
i=1
Jensen’s inequality shows that
m m
!
1 X 1 X log2 m
f (pi ) ≤ f pi = f (1/m) = ,
m i=1 m i=1 m
3 As a curiosity, the letter “e” is the most frequent letter of he English language; it appears 13%

of the time in large texts. It is for this reason that it has the simplest Morse code, a dot.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 165

Limit theorems 165

1
with equality if and only if p1 = · · · = pm = m . We deduce
 
Ent2 p ≤ log2 |X |, ∀p ∈ Prob(X ), (2.1.17)
with equality if and only if p is the uniform probability measure. We will see later
that the above is a special case of the Gibbs’ inequality (2.3.8). Intuitively, this
inequality says that among all the probability measures on a finite set, the uniform
one is the most “chaotic”, the least “predictable”.
We will refer to the elements of X n as words of length n. The term “word” is a
bit misleading. For example, when X is the English alphabet as above, an element
of X n with large n can be thought of as the sequence of symbols appearing in a
large text. On the other hand, we can think of X n itself as a new alphabet with
frequencies
pn (x1 , . . . , xn ) = p(x1 ) · · · p(xn ).
The amount of “surprise” of a word (x1 , . . . , xn ) is
n
X
S(x1 , . . . , xn ) = S(xk ).
k=1

The entropy of (X n , pn ) is
   
Ent2 pn = n Ent2 p .
We denote by X ∗ the disjoint union of the sets X n ,
G
X∗= X n,
n∈N

and we will refer to it as the vocabulary of the alphabet X .


Fix and alphabet (X , p). We want to describe an efficient way of encoding the
words in X n by words in the vocabulary of the binary alphabet B := {0, 1}. Thus,
we want to construct a code map C : X n → B∗ such that the words x ∈ X n with
high frequency are encoded by words in B∗ of short length. Normally we would
require that C be injective but we are willing to sacrifice precision a bit for the sake
of efficiency. We would be happy if the probability that two different words have
the same code is very small, i.e., the event
x, x0 ∈ X n , x 6= x0 and C(x) = C(x)
has a very small probability.
(n)
Definition 2.17. Let ε > 0. The ε-typical set Aε with respect to p(x) is the set
(n)
Aε ⊂ X n consisting of words (x1 , x2 , . . . , xn ) with the property
2−n(Ent2 [p]+ε) ≤ p(x1 , x2 , . . . , xn ) ≤ 2−n(Ent2 [p]−ε) . (2.1.18)
t
u

Theorem 2.18 (Asymptotic Equipartition Property). For any ε > 0 there


exists N = N (ε) such that for any n > N (ε), the following hold.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 166

166 An Introduction to Probability

 (n) 
(i) pn Aε > 1 − ε.
(n) n(Ent2 [ p ]+ε)
(ii) |Aε | ≤ 2 .
(n)
(iii) |Aε | ≥ (1 − )2n(Ent2 [ p ]−ε) .

Proof. We sample (X , p) it according to the frequencies p(xk ) and we obtain a


sequence (Xn )n∈N of i.i.d. X -valued random variables distributed according to p.
We obtain random words (X1 , . . . , Xn ), n ∈ N. The average amount of surprise per
letter in this word is
n
1 1X
S(X1 , . . . , Xn ) = S(Xk ).
n n
k=1
1
The law of large numbers
 shows that the random variables n S(X1 , . . . , Xn ) converge
in probability to Ent2 p . Now observe that
n
1X
(X1 , . . . , Xn ) ∈ A(n)
   
ε ⇐⇒ Ent2 p − ε ≤ S(Xk ) ≤ Ent2 p + ε
n
k=1
so
n
h 1X i
Pn A(n)
     
ε = P Ent2 p − ε ≤ S(Xk ) ≤ Ent2 p + ε → 1
n
k=1
as n → ∞. Fix N = N (ε) such that
pn A(n)
 
ε > 1 − ε, ∀n > N (ε).
Note that for n > N (ε)
X X
1= p(x) ≥ pn (x) ≥ 2−n(Ent2 [ p ]+ε) |A(n)
 |,
x∈X n x∈Aε
(n)

and thus we have


|Aε(n) | ≤ 2n(Ent2 [ p ]+) .
Finally, for n > N (ε) we have
X
1 − ε < Pn A(n) 2−n(Ent2 [ p ]−ε) = 2−n(Ent2 [ p ]−) |A(n)
 
ε ≤ ε |,
(n)
x∈A
(n)
and conclude that |Aε | ≥ (1 − ε)2n(Ent2 [ p ]−ε) . t
u

The Asymptotic Equipartion Property (or AEP) shows that a typical set has
probability nearly 1, all its elements are nearly equiprobable, and its cardinality is
nearly 2n Ent2 [ p ] . The inequality (2.1.17) shows that if p is not he uniform proba-
bility measure on X , then
2Ent2 [ p ]  |X |.
Hence, if ε > 0 is sufficiently small, then
(n)
|Aε |
→0
|X n |
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 167

Limit theorems 167

exponentially fast as n → ∞. That is, the typical sets have high probability and
are “extremely small” if the entropy is small.
This suggests the following coding procedure. Fix ε > 0 so that 1 − ε will
(n) L
be our
 confidence
   level. For n > N (ε) the set Aε has about 2 elements where
L = n Ent2 p elements and thus we can find an injection
I : A(n)
ε → BL .
(n)
For x ∈ Aε we attach the symbol 1 at the beginning of the word I(x) ∈ BL and
the resulting word in BL+1 will encode x. It uses L + 1 bits. The first bit is 1 and
indicates that the word x is typical.
We are less careful with the atypical words. Chose any map
J : X n \ A(n)
ε → BL
and we encode an atypical word x using the binary word J(x) with a prefix 0
attached to indicate that it is atypical. The resulting map C : X n → BL+1 is not
injective, but if two words have the same code, they must be atypical and thus
occur with very small frequency. This is an example of compression.
Take for example the English language. There are various estimates for its
entropy, starting with the pioneering word of Claude Shannon. Most recent ones4
vary from 1 to 1.5 bits. How do we encode efficiently texts consisting of n = 106
symbols say? For example, “Moby Dick ” has 206, 052 words and the average length
of an English word is 5 letters so “Moby Dick ” consists of about 1.03 million symbols.
Forgetting capitalization and punctuation there are 26n such texts and a brute
encoding would require 26n codewords to cover all the possibilities. The above
result however says that roughly 21.5n texts suffice to capture nearly surely almost
everything. The term compression is fully justified since this is a much smaller
fraction of the total number of possible texts. Also we only need codewords of
lengths 1.5 million. Thus we need is roughly 1.5 gigabits to encode such a text. If
the letters of the alphabet where uniformly distributed in human texts5 then the
entropy would be log2 (26) ≈ 4.70 > 3 × 1.5 and we would need more than three
times amount of memory to store it.

Remark 2.19. The story does not end here and much more precise results are
available. To describe some of them note first that for any alphabet X there is an
obvious operation of concatenation
∗ : X m × X n → X m+n , (x, x0 ) 7→ x ∗ x0
where the word x ∗ x0 is obtained by writing in succession the word x followed by
0 L+1
 
x . Note that this code uses on average n ≈ Ent2 p bits per symbol in a word.
This is an example of compression.
4 A Google search with the keywords “entropy of the English language” will provide many more

details on this subject.


5 The famous monkey on a typewriter produces texts where the letters are uniformly distributed,

but we can safely call the resulting texts highly atypical of the English texts humans are used to.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 168

168 An Introduction to Probability

A binary code for the alphabet (X , p) is an injection


C : C → B∗ .
For each x ∈ X we denote by LC (x) the length of the code word C(x). The
expected length of a codeword is
  X
`C := E LC = LC (x)p(x).
x∈X

Note that C extends to a map


C ∗ : X ∗ → B∗ , C ∗ (x1 , . . . , xn ) = C(x1 ) ∗ · · · ∗ C(xn ).
The code C is called uniquely decodable if its extension C ∗ : X ∗ → B∗ is also
injective.
An important subclass of uniquely decodable codes are instantaneous codes. A
code C is called instantaneous if no codeword is a prefix of some other code word.
E.g., if one of the codewords is 10, then no other codeword can begin with 10.
Here is a very revealing example. Consider an alphabet A consisting of four
letters A := {a, b, c, d} with frequencies
pa = 1/2, pb = 1/3, pc = pd = 1/12.
Consider the following instantaneous code
a → 1, b → 01, c → 001 d → 000.
The expected code length is
1 2 3 3 5
+ + + = ≈ 1.666.
2 3 12 12 3
The entropy of alphabet is
  log 2 log 3 log 12
Ent2 A = + + ≈ 1.625.
2 3 6
Kraft’s inequality shows that for any uniquely decodable code C we have
`C ≥ Ent2 A .
 

Moreover, there exist optimal codes C such that


`C ≤ Ent2 A + 1.
 

Such codes are called Shannon codes. The above code is a Shannon code. In fact it
is a special example of the famous Huffman code, [37].
Let us discuss a particularly suggestive experiment that highlights a defining
feature of Huffman codes and reveals one interpretation of the entropy of an alpha-
bet.
Suppose we have an urn containing the letters a, b, c, d, in proportions
pa , pb , pc , pd . A person randomly draws a letter from the urn and you are sup-
posed to guess what it is by asking YES/NO question. Think YES = 1, NO = 0.
The above code describes an optimal guessing strategy. Here it is.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 169

Limit theorems 169

(1) Ask first if the letter is a → 1. If the answer is YES (= 1), the game is over.
The game has length 1 with probability 1/2.
(01) If the answer is NO (= 0) the letter can only be b, c or d. Ask if the letter is
b → 01. If the answer is YES (= 1) the game is over. The game has length 2
with probability 1/3.
(001) If the answer is NO (= 0) ask if the letter is c → 001. The game has length
3 with probability 1/6.

For more details about information theory and its application we refer to [37;
112]. For a more informal introduction to information theory we refer to [60]. The
eminently readable [71] contains historical perspective on the evolution of informa-
tion theory. Kolmogorov’s brief but very rich in intuition survey [95] is a good place
to start learning about the mathematical theory of information. t
u

2.2 The Central Limit Theorem

The goal of this section is to prove a striking classical result that adds additional
information to the Law of Large Numbers.

Fig. 2.2 Visualizing the Central Limit Theorem.

Suppose that (Xn )n∈N is a sequence of i.i.d. random variables with mean µ and
finite variance σ 2 . Note that the sum Sn := X1 + · · · + Xn has mean nµ and
variance nσ 2 . Loosely speaking, the central limit theorem states that for large n
the probability distribution of Sn “resembles” very much a Gaussian with the same
mean and variance.
For example, if the Xn -s are Bernoulli random variables with success probability
p, then µ = p, σ 2 = pq and Sn ∼ Bin(n, p). In Figure 2.2 we have illustrated what
happens in the case p = 0.3 and n = 65.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 170

170 An Introduction to Probability

The vertical lines depict the probability mass function of the binomial distri-
bution while the curve wrapping them is the Gaussian with the same mean and
variance. They obviously do “resemble”. However, we need to define precisely
what we mean by “resemble”.

2.2.1 Weak and vague convergence


Let (X, d) be a metric space. Denote by Meas(X) the set of finite Borel measures
on X and by Prob(X) ⊂ Meas(X) the space of Borel probability measures on X.
We denote by C0 (X) the space of continuous functions X → R with compact
support and by Cb (X) the space of bounded continuous functions X → R. This is
a Banach space with respect to the sup-norm

kf k∞ := sup f (x) .
x∈X

For any f ∈ Cb (X) and µ ∈ Meas(X) we set


Z
 
µ f := f (x)µ[dx] < ∞.
X

Definition 2.20. Consider a sequence (µn )n∈N of finite Borel measures on X.

(i) We say that the sequence (µn ) converges vaguely to µ ∈ Meas(X), and we
write this µn 99K µ if
Z Z
   
lim f (x)µn dx = f (x)µ dx , ∀f ∈ C0 (X). (2.2.1)
n→∞ R R

(ii) We say that the sequence (µn ) converges weakly to µ ∈ Meas(X), and we write
this µn ⇒ µ if
Z Z
   
lim f (x)µn dx = f (x)µ dx , ∀f ∈ Cb (X). (2.2.2)
n→∞ R R

(iii) A sequence of random variables (Xn )n∈N valued in X is said to converge in


law or in distribution if

PXn ⇒ PX in Prob(X),

i.e.,
   
lim E f (Xn ) = E f (X) , ∀f ∈ Cb (X). (2.2.3)
n→∞

d
We will use the notation Xn −→ X to indicate that Xn converges to X in
distribution. t
u

  F ⊂ Cb (X) is called separating if given µ0 ,


Definition 2.21. A collection
µ1 ∈ Meas(R) such that µ0 f = µ1 f , ∀f ∈ F , then µ0 = µ1 . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 171

Limit theorems 171

As shown in Proposition 1.84, the collection Cb (X) is separating so the above


definition is not vacuous for any metric space.
In the remainder of the subsection we will focus exclusively on the special case
when X = Rk equipped with its natural metric.

Lemma 2.22. The collection C0 (Rk ) is separating. More precisely, let µ0 ,


µ1 ∈ Meas(Rk ). If
Z
µ1 f , ∀f ∈ C0 (Rk ),
   
µ0 f =
R
then µ0 = µ1 .

Proof. According to Proposition 1.29 it suffices so show that


µ0 Br (x0 ) = µ1 Br (x) , ∀r > 0, x0 ∈ Rk ,
   

where Br (x0 ) denotes the closed ball of radius r centered at x0 . Set


Sn := x ∈ Rk ; dist(x, x0 ) ≥ r + 1/n .


Fix r > 0, x0 ∈ Rk . For n ∈ N define fn : R → [0, 1]


dist(x, Sn )
fn (x) = .
dist(x,Br (x0 ) + dist(x, Sn )
Observe that fn is continuous, supp fn = Br+1/n (x0 ) and
lim fn (x) = I a.s.
n→∞ Br (x0 )
The Dominated Convergence Theorem implies that
Z Z
I (x)µ0 [dx] = lim fn (x)µ0 [dx]
R Br (x0 ) n→∞ R
Z Z
= lim fn (x)µ1 [dx] = I µ1 [dx].
n→∞ R R Br (x0 )
t
u

Lemma 2.22 shows a sequence of Borel probability measures on Rk has at most


one vague limit, i.e., if µn 99K and µn 99K µ0 , then µ = µ0 .

Example 2.23. Let


n
1X
µn = δk/n .
n
k=1
Then
µn ⇒ µ = I [0,1] (x)dx ∼ Unif(0, 1).
Indeed, if f ∈ Cb (R), then
Z n
1X
f (x)µn [dx] := f (k/n).
R n
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 172

172 An Introduction to Probability

The sum in the right-hand-side of the above equality is a Riemann sum for f
corresponding to the uniform partition
1 2 n−1
0 < < < ··· < < 1.
n n n
Since f is Riemann integrable we deduce
n Z 1 Z
1X  
lim f (k/n) = f (x)dx = f (x)µ dx .
n→∞ n 0 R
k=1
t
u

Example 2.24. There exist vaguely convergent sequences of Borel probability mea-
  µn = δn , n ∈ N. Then
sures on R that are not weakly convergent. Take for example
µn 99K 0 yet µn does not converge weakly to 0 since µn R = 1, ∀n. t
u

Theorem 2.25 (Mapping theorem). Suppose that F : Rk → Rm is a contin-


uous function and Xn : (Ω, S, P) → Rk , n ∈ N is a sequence of random vectors
converging in distribution to the random vector X. Then the sequence of random
vectors Yn = F (Xn ) converges in distribution to Y = F (X).

Proof. Let f ∈ Cb (Rm ). Then f ◦ F ∈ Cb (Rn ) and


       
E f (Yn ) = E f ◦ F (Xn ) → E f ◦ F (X) = E f (Yn ) .
t
u

Proposition 2.26. If the random variables Xn converge in probability to X, then


they also converge in law to X. In particular, if Xn converge in p-mean to X, then
they also converge in law to X.

Proof. We deduce from Corollary 1.145 that for any f ∈ Cb (R) the random vari-
ables f (Xn ) converge in probability to f (X). The bounded convergence theorem
implies
   
lim E f (Xn ) = E f (X) , ∀f ∈ Cb (R).
n→∞
t
u

Example 2.27. Fix a standard normal random variable X. Then PX = P−X so


−X is a standard normal random variable as well. Consider the constant sequence
Xn = X, n ∈ N.
Then PXn ⇒ P−X , but Xn does not converge to −X in probability. t
u

Theorem 2.28 (Portmanteau theorem). Let µn ∈ Prob(Rk ), n ∈ N, be a se-


quence of Borel probability measures on Rk . The following statements are equivalent.

(i) The sequence (µn )n∈N converges weakly to µ ∈ Meas(Rk ).


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 173

Limit theorems 173

(ii) For any open set U ⊂ Rk we have


   
µ U ≤ lim inf µn U .
(iii) For any closet set C ⊂ Rk we have
   
µ C ≥ lim sup µn C .
 
(iv) For any Borel set B ⊂ Rk such that µ ∂B = 0 we have
   
µ B = lim µn B .
n→∞

Proof. (i) ⇒ (ii) According to Theorem 1.198 the measure µ is regular, i.e., for
any ε > 0 there exists a closed set Cε ⊂ U such that
     
µ U > µ Cε > µ U − ε.
Consider the continuous function
dist(x, U c )
f : R → [0, 1], f (x) = .
dist(x, U c ) + dist(x, Cε )
Note that f = 1 on Cε and f = 0 outside U we have
   
µn f ≤ µn U , ∀n ∈ N > .
In particular, we deduce that, ∀ε > 0, we have
         
µ U − ε < µ Cε ≤ µ f = lim µn f ≤ lim inf µn U .
n→∞ n

This proves (ii).


(ii) ⇐⇒ (iii) Follows from the following facts

• The set U is open iff U c is closed.


   
• For any Borel set B ⊂ R, µ B c = 1 − µ B .
 
(ii) + (iii) ⇒ (iv). Let B ⊂ Rk be a Borel set such that µ ∂B = 0. Denote by U
the interior of B and by C its closure so that ∂B = C \ U . We deduce
     
µ B =µ C =µ U .
Thus
         
lim sup µn C ≤ µ C = µ B = µ U ≤ lim inf µn U .
Since ∂B is closed we deduce
   
lim sup µn ∂B ≤ µ ∂B = 0.
Hence
       
µn U = µn C + µn ∂B , lim µn ∂B = 0,
n
so
   
lim inf µn U = lim inf µn C .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 174

174 An Introduction to Probability

Hence
           
lim µn C = µ C , lim µn B = lim µn C + lim µn ∂B = µ B .
n n n n
   
(iv) ⇒ (i). Clearly it suffices to show that µn f → µ f , for any nonnegative,
bounded, continuous function f on Rk . 
Suppose that f be such a function. Set K = sup f . For any ν ∈ Prob Rk we
can regard f as a random variable (Rk , BRk , ν) → R. The integral ν f is then the


expectation of this random variable. Using Proposition 1.126 with p = 1 we deduce


that
Z Z Z K
     
Eν f = f (x)ν[dx] = ν f >t = ν f > t dt.
R R 0

Note that
   
ν f = t = 0 ⇒ ν ∂{f > t} = 0.

Observe next that for any n ∈ N we have



# t ∈ R; ν[f = t] ≥ 1/n ≤ n,

so, for any ν ∈ Prob(Rk ) the set



t ∈ R; µ[f = t] > 0

is at most a countable set. We deduce from (iv) that


   
lim µn f > t = µ f > t for almost any t.
n→∞

From the Dominated Convergence Theorem we deduce


Z K Z K
   
lim µn [f ] = lim µn f > t dt = µ f > t dt = µ[f ].
n→∞ n→∞ 0 0

t
u

Corollary 2.29. Let Xn , n ∈ N, be a sequence of random variables. Denote by


Fn (x) the cdf of Xn ,
 
Fn (x) = P Xn ≤ x , x ∈ R.

The following statements are equivalent.

(i) The random variables Xn converge in law to the random variable X.


(ii) If F (x) is the cdf of X, then

lim Fn (x) = F (x),


n→∞

for any point of continuity x of F .


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 175

Limit theorems 175

Proof. Set µn := PXn , µ := PX . The condition (ii) is a special case of condition


(iv) of the Portmanteau Theorem so (i) ⇒ (ii).
(ii) ⇒ (i) Denote by X ⊂ R the set of points continuity of F . Note that its
complement R \ X is at most countable so X is dense. Note that for a, b ∈ X ,
a < b we have
 
P a < X < b = F (b) − F (a).
For any a, b ∈ R, a < b and any ε > 0 there exist aε , bε ∈ X , a < aε < bε < b such
that
   
F (bε ) − F (aε ) = P aε < X < bε > P a < X < b − ε.
Hence
  
lim Fn (bε ) − Fn (aε ) = F (bε ) − F (aε ) > P a < X < b − ε.
n→∞
On the other hand
   
P a < Xn < b ≥ P aε < Xn < bε , ∀n,
so that
   
lim inf P a < Xn < b ≥ P a < X < b − ε, ∀ε > 0,
n→∞
i.e.,
   
lim inf P a < Xn < b ≥ P a < X < b , ∀a < b ∈ R.
n→∞

Thus, the sequence µn satisfies the condition (ii) in the portmanteau theorem The-
orem 2.28, where U is any open interval of the real axis. Since any open set of
the real axis is a disjoint union of countably many open intervals, we deduce that
condition (ii) in the Portmanteau Theorem is satisfied for all the open sets U ⊂ R.
t
u

Theorem 2.30 (Slutsky). Suppose that (Xn )n∈N and (Yn )n∈N are sequences of
random variables such that (Xn ) converges in distribution to X and Yn converges
in probability to c ∈ R. Then the sum Xn + Yn converges in distribution to X + c.

Proof. Without loss of generality we can assume c = 0. We follow the argument


in [12, Chap. 1, Sec. 3]. Fix a closed subset C ⊂ R. For ε > 0 set

Cε := x ∈ R; dist(x, C) ≤ ε .
The set Cε is closed and we have
     
P Xn + Yn ∈ C ≤ P |Yn | > ε + P Xn ∈ Cε .
Letting n → ∞ we deduce from the assumptions and the Portmanteau Theorem
that
     
lim sup P Xn + Yn ∈ C ≤ lim sup P Xn ∈ Cε ≤ P X ∈ Cε .
n→∞ n→∞
Now let ε & 0 observing that Cε & C. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 176

176 An Introduction to Probability

We can now formulate and prove the main convergence criterion of this sub-
section.

Theorem 2.31. Suppose that (µn )n∈N is a sequence of finite Borel measures on
R. Fix a subset F ⊂ Cb (R) whose closure in Cb (R) contains C0 (R). The following
statements are equivalent.

(i) The sequence (µn ) converges weakly to µ ∈ Meas(R).


(ii) The sequence (µn ) converges vaguely to µ ∈ Meas(R) and
   
µ R = lim µn R .
n→∞

(iii) µ ∈ Meas(R) and


Z Z
f (x)µ dx , ∀f ∈ F,
   
lim f (x)µn dx =
n→∞ R R
    (2.2.4)
µ R = lim µn R .
n→∞

Proof. Since
Z Z
     
µ R = I R µ dx = lim 1µn dx , ∀ν ∈ Meas(R)
R n→∞ R

we can replace the measures µn by 1  µn and thus we can assume that all the
µn R
measures µn are probability measures. In this case (ii) reads
µn converges vaguely to µ and µ is a probability measure,
while (iii) reads
µ is a probability measure and µn f → µ f , ∀f ∈ F.
   

Obviously (i) ⇒ (ii) and (ii) ⇒ (iii). It suffices to prove that (ii) ⇒ (i) and (iii)
⇒ (ii). We will need the following result.

Lemma 2.32. Any finite Borel probability measure µ ∈ Meas(Rk ) is on Rk is


 Borel set B ⊂ R and any ε > 0, there exists a compact set
Radon, i.e., for any
K ⊂ B such that µ B \ K < ε.

Proof. Let B ⊂ Rk be a Borel set and ε > 0. According to Theorem 1.198, the
measure µ is regular. Hence, there exists a closed set C ⊂ B such that
  ε
µ B\C < .
2
On the other hand, we can find R > 0 sufficiently large such that
 ε
µ BR (0) > µ Rk − .
  
2
We set K := BR ∩ C. The set K is clearly compact and
 ε
µ C \ K ≤ µ Rk \ BR (0) < .
  
2
     
Thus µ B \ K = µ B \ K + µ C \ K < ε. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 177

Limit theorems 177

(ii) ⇒ (i) We will show that the sequence (µn ) satisfies the condition (ii) in the
portmanteau theorem. Now let U ⊂ Rk be an open set and ε > 0. Lemma 2.32
shows that for any ε > 0 there exists a compact set K ⊂ U such that
   
µ K > µ U − ε.
1
Now choose r < 2 dist(K, U c ) and set
Cr := x ∈ Rk ; dist(x, K) ≥ r .


The set Cr is closed and its complement


Vr := x ∈ Rk ; dist(x, K) < r

⊂U
is precompact. Consider the continuous function
dist(x, Cr )
ϕ : Rk → [0, ∞), ϕ(x) = .
dist(x, K) + dist(x, Cr )
Observe that it vanishes on Cr and thus it has compact support contained in U .
Moreover, ϕ = 1 on K. Thus
   
µn [K] ≤ µn ϕ ≤ µn U .
Letting n → ∞ we deduce
         
µ U − ε < µ K ≤ µ ϕ = lim µn ϕ ≤ lim inf µn U , ∀ε > 0.
n n

This establishes condition (ii) of the Portmanteau Theorem.


(iii) ⇒ (ii) Let ϕ ∈ C0 (Rk ). For any ε > 0 choose fε ∈ F such that kϕ − fε k∞ < 2ε ,
i.e.,
ε ε
fε − ≤ ϕ ≤ fε + .
2 2
Then
  ε     ε
µn fε − ≤ µn ϕ ≤ µn fε + .
2 2
Letting n → ∞ we deduce
  ε   ε
µ ϕ − ε < µ[fε ] − = lim µn fε − ≤ lim inf µn [ϕ] ≤ lim sup µn [ϕ]
2 2 n n

  ε ε
≤ lim µn fε + = µ[fε ] + < µ[ϕ] + ε.
2 2
The above inequalities hold for any ε > 0 so
lim inf µn [ϕ] = lim sup µn [ϕ] = µ[ϕ].
n n
t
u

Corollary 2.33. Consider a sequence µn ∈ Prob(Rk ) and µ ∈ Meas(R). Then the


following are equivalent.

(i) The sequence (µn ) converges weakly to µ.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 178

178 An Introduction to Probability

(ii) For any bounded Lipschitz function f : R → R we have


   
µn f = µ f .

Proof. The implication (i) ⇒ (ii) is obvious. To prove that (ii) ⇒ (i) observe first
that any compactly supported continuous function can be uniformly approximated
by compactly supported smooth functions6 so the closure in Cb (Rk ) of the set
of bounded Lipschitz functions contains C0 (Rk ). The measure µ is a probability
measure since the constant function I Rk is bounded and Lipschitz and thus
   
µ I Rk = lim µn I Rk = 1.
n→∞

The conclusion now follows from Theorem 2.31. t


u

Corollary 2.34. If a sequence µn ∈ Prob(R) converges vaguely to a probability


measure, then it also converges weakly. t
u

Corollary 2.35. Suppose that (Xn )n∈N and X are random variables with ranges
contained in Z. Then Xn ⇒ X if and only if
   
lim P Xn = k = P X = k , ∀k ∈ Z. (2.2.5)
n→∞

Proof. The condition (2.2.5) is clearly satisfied if Xn ⇒ X since


   
P X = k = P k − 1/2 < X ≤ k + 1/2

   
= lim P k − 1/2 < Xn ≤ k + 1/2 = lim P Xn = k .
n→∞ n→∞

Conversely, if (2.2.5) is satisfied, then


   
E ϕ(Xn ) → E f (X) , ∀ϕ ∈ C0 (R).

The conclusion now follows from Theorem 2.31. t


u

The next result generalizes Fatou’s Lemma. However, our proof relies on Fatou’s
Lemma.

Proposition 2.36. Suppose that the sequence of random variables (Xn )n∈bN con-
verges in distribution to X. Then
   
E |X| ≤ lim inf E |Xn | .
n→∞

In particular, X is integrable if the sequence (Xn )n∈N is bounded in L1 , i.e.,


 
sup E |Xn | < ∞.
n
6 On simple way to see this is to use Weierstrasss approximation theorem.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 179

Limit theorems 179

Proof. The Mapping Theorem 2.25 implies that the sequence (|Xn |)n∈N converges
in distribution to |X|. Thus
   
lim P |Xn | > t = P |X| > t ,
n∈N

for all t outside a countable subset of [0, ∞). Using (1.3.43) we deduce
Z ∞ Z ∞
       
E |X| = P |X| > t dt, E |Xn | = P |Xn | > t dt, ∀n.
0 0
Fatou’s Lemma implies
Z ∞ Z ∞
   
P |X| > t dt ≤ lim inf P |Xn | > t dt.
0 n→∞ 0
t
u

2.2.2 The characteristic function


The key ingredient in the proof of the CLT is that of Fourier transform or charac-
teristic function of a finite Borel measure µ ∈ Meas(R). This is the complex valued
function
Z
b : R → C, µ
µ b(ξ) = eiξx µ[dx].
R

Note that µ is a probability measure if and only if µb(0) = 1.


The characteristic function of a random variable X is the Fourier transform
ΦX (ξ) of its probability distribution,
bX (ξ) = E eiξX .
 
ΦX (ξ) = P
Note that
| ΦX (ξ) | ≤ 1, ∀ξ ∈ C.
 
Moreover, ΦX (0) = E 1 = 1.
From the Dominated Convergence theorem we deduce that ΦX is a continuous
function R → C. Thus, Fourier transform is a map
Prob(R) 3 µ 7→ µ
b ∈ Cb (R, C).

Proposition 2.37. Let X ∈ L2 (Ω, S, P). Then ΦX ∈ C 2 (R) and


Φ0X (0) = iE X , Φ00X (0) = −E X 2 .
   

Proof. Denote by PX the probability distribution of X so PX ∈ Prob(R). Then


Z
ΦX (ξ) = eixξ PX [dx].
R

Note that since X ∈ L2 we have


Z Z
x2 PX dx < ∞
 
|x| PX [dx],
R R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 180

180 An Introduction to Probability

so
∂ξ eixξ = ixeixξ ∈ L1 R, PX ,


∂ξ2 eixξ = −x2 eixξ ∈ L1 R, PX .




This shows (see Exercise 1.6) that the integral


Z
eixξ PX [dx]
R
is twice differentiable with respect to the parameter ξ and we have
Z Z
Φ0X (ξ) = −i xeiξx PX dx , Φ00X (ξ) = − x2 eiξx PX dx .
   
R R
Using the Dominated Convergence theorem we deduce that the function
Z
ξ 7→ − x2 eiξx PX dx
 
R
2
is continuous so ΦX ∈ C (R). t
u

Let Γv ∈ Prob(R) be the Gaussian measure with mean 0 and variance v > 0
1 x2
e− 2v .
 
Γv dx = γ v (x)dx, γ v (x) = √
2πv
Proposition 2.38.
r
2
− vξ2 2π
Γ
b v (ξ) = e = γ (ξ), ∀v > 0. (2.2.6)
v 1/v
Proof. We have
√ √
Z Z
b v (ξ) = √ 1
Γ
x2
e− 2v eiξx dx = √
1 y2
e− 2 ei vξy dy = Γ
b 1 ( v η).
2πv R 2π R
Thus it suffices to determine
Z
1 x2
f (ξ) = Γ1 (ξ) = √
b e− 2 eiξx dx.
2π R
The imaginary part of the above integrand is odd function (in x) so f (ξ) is real, ∀ξ,
i.e.,
Z
1 x2
f (ξ) = √ e− 2 cos(ξx)dx.
2π R
The function
d  − x2  x2
e 2 cos(ξx) = −xe− 2 sin(ξx)

is integrable (in the x variable). This shows that f (ξ) is differentiable (see Exer-
cise 1.6) and
Z Z
0 1 − x2
2 1 d − x2 
f (ξ) = − √ xe sin(ξx)dx = √ e 2 sin(ξx)dx
2π R 2π R dx
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 181

Limit theorems 181

(integrate by parts)
Z
ξ x2
= −√ e− 2 cos(ξx)dx = −ξf (ξ).
2π R

Thus
f 0 (ξ) + ξf (ξ) = 0
so that
d ξ2 /2 ξ2
f (ξ) = 0 ⇐⇒ f (ξ) = Ce− 2 .

e

2
b 1 (ξ) = e− ξ2 .
Since f (0) = 1 we deduce C = 1 and thus Γ t
u

Theorem 2.39. A probability measure µ ∈ Prob(R) is uniquely determined by its


characteristic function, i.e., the map
Prob(R) 3 µ 7→ µ
b ∈ Cb (R, C)
is injective.

Proof. For any v > 0 and µ ∈ Prob(R) we set µv := Γv ∗ µ. According to


Remark 1.135 we have
Z
   
µv dx := ρv (x)dx, ρv (x) = γ v (x − y) µ dy .
R
The theorem follows from the following two facts.
Fact 1. The family (µv )v>0 is completely determined by µ
b.
Fact 2. The family (µv )v>0 converges weakly to µ as v & 0, i.e.,
   
lim µv f = µ f , ∀f ∈ Cb (R).
v&0

Proof of Fact 1. The idea behind this fact is that the Fourier transform and the
convolution interact in a nice way. More precisely we will show that
Z
1
ρv (x) = √ eixξ γ 1/v (ξ)b
µ(−ξ)dξ. (2.2.7)
2πv R
Using (2.2.6) we deduce
√ x2
Z
2πvγ v (x) = e− 2v = eixξ γ 1/v (ξ)dξ.
R
We deduce
Z Z 
1 i(x−y)ξ
ρv (x) = √ e γ 1/v (ξ)dξ µ[dy]
2πv R R

(use Fubini)
Z Z  Z
1 1
=√ eixξ γ 1/v (ξ) e−iyξ µ[dy] dξ = √ eixξ γ 1/v (ξ)b
µ(−ξ)dξ.
2πv R R 2πv R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 182

182 An Introduction to Probability

Proof of Fact 2. Let f ∈ Cb (R). A simple application of Fubini shows that


Z Z Z 
f (x)µv [dx] = f (x) γ v (x − y)µ[dy] dx
R R R

Z Z 
= γ v (x − y)f (x)dx µ[dy].
R
| R {z }
=:fv (y)

The function fv is obviously continuous. If M := supx∈R |f (x)|, then


Z Z
x→z+y
|fv (y)| ≤ M γ v (x − y)dx = M γ v (z)dz = M.
R R

On the other hand


Z Z
 
fv (y) = γ v (x − y)f (x)dx = γ v (z)f (z + y)dz = Γv Ty f ,
R R

where Ty f (z) := f (z + y). Fix y, ε > 0 and a δ = δ(ε) > 0 such that
ε
sup f (z + y) − f (y) < .
|z|<δ 2
Then
Z
  
fv (y) − f (y) = Γv Ty f − f (y) = f (z + y) − f (y) Γv [dz]
R

Z Z
≤ f (z + y) − f (y) Γv [dz] + f (z + y) − f (y) Γv [dz]
|z|<δ |z|≥δ

 
≤ sup f (z + y) − f (y) + 2M Γv |z| > δ
|z|<δ

(1.3.17) 2M v ε 2M v
≤ sup f (z + y) − f (y) + < + 2 .
|z|<δ δ2 2 δ
Hence
ε
lim sup fv (y) − f (y) ≤ , ∀ε > 0,
v&0 2
so that
lim fv (y) = f (y), ∀y ∈ R.
v&0

The Dominated Convergence theorem now implies


       
lim µv f = lim µ fv = µ lim fv = µ f .
v&0 v&0 v&0

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 183

Limit theorems 183

Remark 2.40. (a) The above theorem can be rephrased as stating that the collec-
tion of trigonometric functions

R 3 x 7→ cos(ξx), sin(ξx); ξ ∈ R

is separating. However, the smaller family



R 3 x 7→ cos(ξx), sin(ξx); |ξ| < 1 ,

is not separating! More precisely, there exists two distinct probability measures
µ0 , µ1 such that

µ b1 (ξ), ∀|ξ| < 1.


b0 (ξ) = µ

We refer to [111, Chap. IV, Sec. 15, p. 231] for more details.
(b) The range of the Fourier transform

Prob(R) 3 µ 7→ µ̂ ∈ Cb (R)

can also be characterized. Note first that ∀µ ∈ Prob(R)


 
µ̂(0) = µ R = 1, µ b(−ξ) = µb(ξ), ∀ξ ∈ R.

Additionally, the function µ b is positive definite. This means that, for any n ∈ N
and any ξ1 , . . . , ξn ∈ R, the hermitian matrix

b(ξi − ξj ) 1≤i,j≤n
µ

is positive semidefinite, i.e., for any z1 , . . . , zn we have


X
b(ξi − ξj )zi z̄j ≥ 0.
µ
1≤i,j≤n

This follows by observing that


X Z n
X 2
b(ξi − ξj )zi z̄j =
µ zk eiξk x µ[dx].
1≤i,j≤n R i=1

It turns out that these above necessary conditions characterize the range of the
Fourier transform. This is the content of the celebrated Bochner theorem. For a
proof we refer to [59, p. 622], [68, §II.3], or [135, I.24]. t
u

Theorem 2.41 (Lévy’s Continuity Theorem). Let (µn )n∈N be a sequence in


Prob(R) and µ ∈ Prob(R). The following statements are equivalent.

(i) The sequence (µn )n∈N converges weakly to µ.


(ii) For any ξ ∈ R

lim µ
bn (ξ) = µ
b(ξ).
n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 184

184 An Introduction to Probability

Proof. Our presentation is influenced by François Le Gall’s lecture notes [102].


(i) ⇒ (ii) Since µn ⇒ µ we deduce that for any ξ ∈ R we have
Z Z
lim cos(ξx)µn [dx] = cos(ξx)µ[dx],
n→∞ R R
Z Z
lim sin(ξx)µn [dx] = sin(ξx)µ[dx].
n→∞ R R

(ii) ⇒ (i) We carry the proof in two steps. For any v > 0 and any f ∈ Cb (R) we
define fv : R → R
Z
 
fv (x) = f (x − y)Γv dy .
R

It is easy to see that fv ∈ Cb (R).


Step 1. We will show that
   
lim µn fv = µ fv , ∀v > 0, ∀f ∈ C0 (R). (2.2.8)
n→∞

Observe that
Z Z
fv = f (x − y)γ v (y)dy = f (z)γ v (x − z)dz.
R R

Let ν ∈ Prob(R). Then


Z Z  Z Z 
     
µ fv = f (z)γ v (z − x)dz ν dx = f (z) γ v (z − x))ν dx dz
R R R
| R {z }
ρv (z)

Z Z 
(2.2.7) 1
= √ eixξ γ 1/v (ξ)b
ν (−ξ)dξ f (x)dx
2πv R R
Z Z  Z
1 1
=√ eixξ f (x)dξ γ 1/v (ξ)b
ν (−ξ)dξ = √ fb(ξ)γ 1/v (ξ)b
ν (−ξ)dξ.
2πv R R 2πv R
| {z }
=;fb(ξ)

The function fb(ξ) is well defined since f ∈ C0 (R). The Dominated Convergence
theorem shows that fb is continuous. Moreover
Z
fb(ξ) ≤ |f (x)| dx.
R
We deduce that, ∀n ∈ N,
Z
  1
µn fv = √ fb(ξ)γ 1/v (ξ)b
µn (−ξ)dξ.
2πv R
The Dominated Convergence theorem shows that
Z Z
 
lim f (ξ)γ 1/v (ξ)b
b µn (−ξ)dξ = fb(ξ)γ 1/v (ξ)b
µ(−ξ)dξ = µ fv .
n→∞ R R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 185

Limit theorems 185

Step 2. If f ∈ Cb (R) is uniformly continuous, then fv converges to f uniformly as


v & 0.
Set
M := sup |f (x)|, ω(r) := sup |f (x) − f (y)|, r > 0.
x∈R x,y∈R
|x−y|≤r

Since f is uniformly continuous we have


lim ω(r) = 0.
r&0

For x ∈ R
Z
fv (x) − f (x) = f (x − y)Γv [dy] − f (x)
R
Z Z Z
   
= f (x − y)Γv [dy] − f (x)Γv dy ≤ |f (x − y) − f (x)|Γv dy
R R R
Z Z
= f (x − y) − f (x) Γv [dy] + f (x − y) − f (x) Γv [dy]
|y|≤r |y|>r
Z Z
≤ ω(r) Γv [dy] + 2M Γv [dy]
|y|≤r |y|>r

(use Chebyshev’s inequality to estimate the second integral)


v
≤ ω(r) + 2M 2 , ∀v, r > 0.
r
Now choose α ∈ (0, 1/2), r = v α . We deduce that
sup fv (x) − f (x) ≤ ω(v α ) + 2M v 1−2α → 0 as v & 0.
x∈R
This completes Step 2.
We deduce that the family
F :=

ϕv ; v > 0, ϕ ∈ C0 (R)
contains C0 (R) in its closure and µn f → µ f for any f ∈ F. The conclusion
   

follows from Theorem 2.31. t


u

Remark 2.42. (a) One can show that if a sequence µn ∈ Prob(R) converges weakly
to a probability measure µ, then µ
bn (ξ) converges ro µ
b(ξ) uniformly on compacts;
see Exercise 2.37.
(b) In Theorem 2.41 we assumed that the limit of the sequence of characteristic
functions µbn n∈N is the characteristic function of a probability measure µ. This
assumption is not necessary.
One can show that if the characteristic functions µ bn (ξ) converge pointwisely
to a continuous function f , then f is the characteristic function of a probability
measure. In Exercise 2.36 we describe the main steps of the rather involved proof
of this more general result. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 186

186 An Introduction to Probability

Remark 2.43. P. Lévy, [105, §17, p. 47], introduced a metric dL on Prob(R). More
precisely, given µ0 , µi ∈ Prob(R) with cumulative distribution functions
 
Fi (x) = µi (−∞, x] , x ∈ R, i = 0, 1,
then the Lévy metric is the length of the largest segment cut-out by the graphs Γ0 , Γ1
of F0 , F1 along a line of the form x + y = a. The graphs are made continuous by
adding vertical segments connecting Fi (x−0) to Fi (x) at the points of discontinuity.
Intuitively, the distance is the diagonal if the largest square with sides parallel to
the axes that can be squeezed between the curves Γ0 and Γ1 .
More precisely

dL (µ0 , µ1 ) = sup distR2 p0 (a), p1 (a) ,
a∈R

where pi (a) is the intersection of the graph Γi with the line x + y = a. Note that if
we write pi (a) = (xi , yi ), then yi = F (xi ),7 then
√
dL (µ0 , µ1 ) = sup 2|x0 − x1 |; x0 + F0 (x0 ) = x1 + F1 (x1 ) .
Lévy refers to the convergence with respect to the metric dL as “convergence from
the point of view of Bernoulli ”. He shows (see [105, §17]) that a sequence of proba-
bility measures µn converges in the metric dL to a probability measure µ if and only
if the characteristic functions µ
bn converge to the characteristic function µ. Hence,
the convergence in the metric dL is the weak convergence so that dL metrizes the
weak convergence. t
u

2.2.3 The Central Limit Theorem


We can now state and prove the main result of this section.

Theorem 2.44 (Central Limit Theorem). Suppose that Xn ∈ L2 (Ω, S, P) is a


sequence of i.i.d. with common mean µ and common variance v. Set
n n
!
X 1 1 X
X̄n = Xn − µ, Sn = (Xk − µ), Zn = √ Sn = √ Xk − nµ .
nv nv
k=1 k=1

Then Zn ⇒ N (0, 1).

Proof. According to Lévy’s continuity theorem it suffices to show that


ξ2
lim ΦZn (ξ) = ΦΓ1 (ξ) = e− 2 .
n→∞

Observe that Xn are i.i.d. with mean 0 and variance v, while Zn has mean 0 and
variance 1. Denote by Φ(ξ) their common characteristic function, Φ(ξ) = E eiX̄1 .
We have
" n #


Y ξ
ΦZn (ξ) = ΦS̄n /√nv (ξ) = ΦS̄n (ξ/ nv) = E exp i √ Xk
nv
k=1
7 At

a point of discontinuity this reads yi ∈ Fi (xi − 0), Fi (xi ) .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 187

Limit theorems 187

 
(the variables exp i √ξnv Xk , 1 ≤ k ≤ n are independent)
n     √ n
Y ξ
= E exp i √ Xk = Φ ξ/ nv .
nv
k=1

Proposition 2.37 shows that the function Φ(η) is C 2 , so as η → 0 we have


1 1  2
Φ(η) = Φ(0) + Φ0 (0)η + Φ00 (0)η 2 + o(η 2 ) = 1 + iE X1 η − E X1 η 2 + o(η 2 )
 
2 2
   2  
(E X1 = 0, E X1 = Var X1 = v)
v
= 1 − η 2 + o(η 2 ).
2

Now let η = ξ/ nv, n  0. We deduce
!n
 √ n ξ2
Φ ξ/ nv = 1− + o(1/n) .
2n
At this point we want to invoke the following result.

Lemma 2.45. Suppose that (cn )n≥1 is a convergent sequence of complex numbers
and
c = lim cn .
n→∞

Then
 cn n
lim 1+ = ec .
n→∞ n
Assuming Lemma 2.45 we deduce that, for any ξ ∈ R we have
!n
ξ2 ξ2
lim ΦZn (ξ) = lim 1 − + o(1/n) = e− 2 = ΦΓ1 (ξ).
n→∞ n→∞ 2n
Proof of Lemma 2.45. Set c = a + bi, cn = an + bn i, so that an → a, bn → b. We set
cn an bn
zn = 1 + =1+ + i.
n n n
For large n zn = rn eiθn , where
q  1/2
rn = (1 + an /n)2 + b2n /n2 = 1 + 2a/n + o(1/n) ,

π 1 bn
|θn | < , tan θn = .
2 n 1 + an /n
Thus
 
1 bn b
θn = arctan = + o(1/n) as n → ∞.
n 1 + an /n n
We deduce that as n → ∞ we have
 n/2
n i(b+o(1)) a ib c
zn = 1 + 2a/n + o(1/n) ·e →e ·e =e .

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 188

188 An Introduction to Probability

Remark 2.46. There is a more refined version of the Central Limit Theorem that
does not require that the random variables be identically distributed, only indepen-
dent. More precisely, we have the following result of Lindeberg.
Suppose that (Xn )n≥0 is a sequence of independent random variables with  zero

means and finite variances. We set Sn := X1 + · · · + Xn and s2n := Var Sn .
Suppose that, ∀t > 0
n
1 X 
E I {|Xk |>tsn } Xk2 = 0.

lim 2 (2.2.9)
n→∞ sn
k=1
Then s1n Sn converges in distribution to a standard normal random variable.
Note that if the random variables Xn are also identically distributed with com-
mon variances σ 2 , then σn2 = nσ 2 . Then
n
1 X  1 
E I {|Xk |>tsn } Xk2 = 2 E I {|X1 |>tσ√n} X12 → 0
 
2
sn σ
k=1
as n → ∞. Hence condition (2.2.9) is satisfied when the random variables are i.i.d.
For a proof of Lindeberg’s theorem we refer to [59, Sec. VIII.4].
For even more general versions of the CLT we refer to [72; 127]. t
u

2.3 Concentration inequalities

Suppose that (Xn )n∈N is a sequence of i.i.d. random variables with mean 0. Let
Sn := X1 + · · · + Xn .
The Strong Law of Large of Numbers shows that n1 Sn → 0 a.s. A concentration
inequality offers a quantitative information on the probability that n1 Sn deviates
from 0 by a given amount ε. More concretely, it gives an upper bound for the
1
probability
2
 n |Sn | > ε. If the random variables Xn have finite second moments,
 that
σ = Var X1 , then we have seen that Chebyshev’s inequality yields the estimate
 
   2 2 2
 Var Sn σ2
P |Sn | > nε = P Sn > n ε < = .
n2 ε2 nε2
In the proof of Theorem 2.7 we have shown that if the variables Xn have a stronger
integrability property namely E Xn4 < ∞, then there exists a constant C > 0 such
that for any ε > 0 and any ε > 0 we have
  C
P |Sn | > nε ≤ 2 4 ,
n ε
showing that n1 Sn is even more concentrated around its mean. Loosely speaking,
we expect higher concentration around if Xn have lighter tails, i.e., the probabilities
 
P |Xn | > x
decay fast as x → ∞.
In this section we want to describe some quantitative results stating that, under
appropriate light-tail assumptions, for any ε > 0 the probability P |Sn | > nε
decays exponentially fast to 0 as n → ∞. The subject of concentration inequalities
has witnessed and explosive growth in the last three decades so we will only be able
to scratch the surface. For more on this subject we refer to [17].
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 189

Limit theorems 189

2.3.1 The Chernoff bound


Many useful concentration inequalities are based on the Chernoff method. Let us
describe its basics.
Suppose that X is a centered, i.e., mean zero, random variable such that
MX (λ) := E eλX < ∞, ∀λ ∈ J,
 

where J is an open interval containing the origin. We set



J± := λ ∈ I; ±λ > 0 .
Note that this implies that X has moments of any order and thus it imposes severe
restrictions on the tail of X. We define the cumulant of X to be the function,

ΨX : J → R, ΨX (λ) := log MX (λ) .
The function x 7→ etx is convex and Jensen’s inequality shows
E eλX ≥ eλE[X] = 1
 

so ΨX (λ) ≥ 0.
Here is the key idea of Chernoff’s method. For x > 0 we have
1 
P X > x = P eλX > eλx ≤ λx E eλX , ∀λ ∈ J+ ,
    
e
where at the last step we used Markov’s inequality. Hence

− xλ−ΨX (λ)
 
P X>x ≤e , ∀λ ∈ J+ .
Set

I+ (x) := sup xλ − ΨX (λ) .
λ∈J+

We obtain in this fashion the Chernoff bound


P X > x ≤ e−I+ (x) , I+ (x) := sup xλ − ΨX (λ) , ∀x > 0.
  
(2.3.1)
λ∈(0,r)

Note that I+ (x) ≥ 0 since ΨX (λ) ≥ 0. Arguing in a similar fashion we deduce


P X < x ≤ e−I− (x) , I− (x) := sup xλ − ΨX (λ) , ∀x < 0.
  
(2.3.2)
λ∈J−
 
More generally, if X has a nonzero mean µ, then X = X − µ is centered. If E eλX
exists for λ ∈ J, then
ΨX (λ) = ΨX̄ = ΨX (λ) − λµ,
and we deduce
P X > x + µ ≤ e−I+ (x) , I+ (x) := sup (x + µ)λ − ΨX (λ) , ∀x > 0
  
(2.3.3)
λ∈J+

and
P X > x + µ ≤ e−I− (x) , I− (x) := sup (x + µ)λ − ΨX (λ) , ∀x < 0. (2.3.4)
  
λ∈J−
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 190

190 An Introduction to Probability

Suppose that (Xn )n∈N is a sequence of i.i.d. random variables such that
M(λ) = MXk (λ) < ∞,
for any λ in an open interval J containing 0. Set
 
µ := E Xk , Sn := X1 + · · · + Xn .
Then
E Sn = nµ, MSn (λ) = M(λ)n , ΨSn (λ) = nΨ(Λ).
 

We deduce that

sup (nx + nµ)λ − ΨSn (λ) = nI+ (x), ∀x > 0,
λ∈J+

and

sup (nx + nµ)λ − ΨSn (λ) = nI− (x), ∀x < 0.
λ∈J−

We deduce
 
1
Sn − µ > x = P Sn − nµ > nx ≤ e−nI+ (x) , ∀x > 0,
 
P (2.3.5a)
n
 
1
Sn − µ < x = P Sn − nµ < nx ≤ e−nI− (x) , ∀x < 0.
 
P (2.3.5b)
n
We have reached a remarkable conclusion. The assumption M(λ) < ∞ for λ in an
open neighborhood of the origin implies that the probability that the empirical mean
1
n Sn deviates from the theoretical mean µ by a fixed amount x decays exponentially
to 0 as n → ∞. In other words, n1 Sn is highly concentrated around its mean and
the above inequalities quantify this fact.
To gain some more insight on the above estimates it is useful to list a few
properties of the function I+ (x).

Proposition 2.47. Suppose that the centered random variable X satisfies


MX (λ) = E eλX < ∞, ∀λ ∈ J,
 

where J ⊂ R is an open interval containing 0. Then the following hold.

(i) MX (0) = 1, M0X (0) = 0, M00X (0) = Var X .


 

(ii) The function J 3 λ 7→ ΨX (λ) ∈ R is convex and nonnegative. Moreover


Ψ00X (0) > 0.
(iii) The function

I : R → [0, ∞], I(x) = sup λx − ΨX (λ)
λ∈J

is convex. Moreover I(x) = IX (x) = I± (x) for ±x > 0.


(iv) I(x) > 0 if x 6= 0.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 191

Limit theorems 191

Proof. (i) is an elementary computation.


(ii) To prove that ΨX (λ) is convex let t1 , t2 ∈ (0, 1) such that t1 + t2 = 1. Then,
using Hölder’s inequality with p = t11 and q = t12 we deduce that for any λ1 , λ2 ∈ R
we have
1/t1 t1  t2 λ2 X 1/t2 t2 t1  t2
E et1 λ1 X+t2 λ2 X ≤ E et1 λ1 X = E eλ1 X E eλ2 X .
   
E e

Taking the logarithm of both sides of the above inequality we obtain the convexity
of ΨX (λ). Next observe that
M0X (0)
Ψ0X (0) = = 0.
MX (0)
Since ΨX (λ) is convex is graph sits above the tangent at λ = 0 so ΨX (λ) ≥ 0,
∀λ ∈ J.
(iii) For t1 , t2 ∈ (0, 1) such that t1 + t2 = 1 and for x1 , x2 > 0 we have

I+ (t1 x1 + t1 x2 ) = sup (t1 x2 + t2 x2 ) − ΨX (λ)
λ∈(0,r)


= sup t1 (x1 − −ΨX (λ)) − (t2 x2 − ΨX (λ)) ≤ t1 I+ (x1 ) + t2 I+ (x2 ).
λ∈(0,r)

Observe that for x > 0 we have

λx − ΨX (λ) ≤ 0, ∀λ ≤ 0

proving that
 
I(x) = sup λx − ΨX (λ) = sup λx − ΨX (λ) .
λ∈J λ∈J+

(iv) Observe that


M00X (λ)MX (λ) − M0X (λ)2
Ψ00X (λ) = (2.3.6)
MX (λ)2
so Ψ00X (0) = M00X (0) = Var X > 0. This proves that λx − ΨX (λ) > 0 for |λ| small
 

and x 6= 0 so I(x) > 0 if x 6= 0. t


u

Remark 2.48. As explained in [134, §12], to any convex lower semicontinuous


function f : Rn → (0, ∞] we can associate a conjugate

f ∗ : Rn → (−∞, ∞], f ∗ (p) = sup hp, xi − f (x) ,



x∈Rn

where h−, −i denotes the canonical inner product in Rn . One can show that f ∗ is
also convex and lower semicontinuous and f = (f ∗ )∗ . The conjugate f ∗ is sometimes
called the Fenchel-Legendre conjugate of f . Observe that I(x) is the conjugate of
the convex function ΨX (λ). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 192

192 An Introduction to Probability

  
Example 2.49. Suppose that X ∼ Bin(p). Then E X = p, MX (λ) = q + peλ .
For x ∈ R we have
fx (λ) := (x + p)λ − ΨX (λ) = (x + p)λ − log(q + peλ )

peλ
fx0 (λ) = x + p −
q + peλ
and fx0 (λ) = 0 if
x+p
p(x + p − 1)eλ = −q(x + p), i.e., peλ = q .
q−x
This forces x ∈ (−p, q). In this case
x+p q−x
λ = log q − log p + log(x + p) − log(q − x) = log − log
p q
x+p q−x q−x
I(x) = (x + p) log − (x + p) log + log
p q q
x+p q−x
= (x + p) log + (q − x) log .
p q
t
u

Remark 2.50. Suppose that P, Q are two Borel probability measures on R that
are mutually absolutely continuous,
P  Q and Q  P.
dP
We denote by ρP|Q := dQ the density of P with respect to Q. We define the Kullback-
Leibler divergence
Z
  dP  
DKL P k Q := log P dx . (2.3.7)
R dQ
(a) Suppose that P is the probability distribution Bin(p),
P = qδ0 + pδ1 .
For x ∈ (−p, q) consider the probability distribution
Qx = (q − x)δ0 + (p + x)δ1 .
Then
  x+p q−x
DKL Qx k P = (x + p) log + (q − x) log .
p q
This is the rate I(x) we found in Example 2.49.
(b) Let X be a random variable with probability distribution Q and set
Z := ρP|Q (X). Then
Z Z
  dP
E Z = dQ = dP = 1,
R dQ R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 193

Limit theorems 193

Z Z
  dP dP dP  
E Z log Z = log dQ = log dP = DKL P k Q .
R dQ dQ R dQ
Thus
       
E Z log Z − E Z log E Z = DKL P k Q
showing that Kullback-Leibler divergence is a special case of ϕ-entropy (1.3.12).
More precisely, the above equality shows that
   
DKL P k Q = Hϕ Z , ϕ(z) = z log z, z > 0.
In particular this yields Gibbs’ inequality
 
DKL P k Q ≥ 0. (2.3.8)
Above, we could have used instead of the natural logarithm any logarithm in a base
> 1 and reach the same conclusion. In particular, if we work with log2 and we set
Z
  dQ  
D2 P k Q = log2 P dx .
R dP
Then Gibbs’ inequality continues to hold in this case as well
 
D2 P k Q ≥ 0. (2.3.9)
Let X be a finite subset of R. Assume that we are given a function p : X → (0, 1]
such that
X
p(x) = 1
x∈X
so p defines the probability measure
X
Pp = p(x)δx ∈ Prob(R).
x∈X

Recall that its Shannon entropy is (see (2.1.16) is the quantity


  X
Ent2 p = − p(x) log2 p(x).
x∈X
The uniform probability measure on X is
X 1 X
P0 = p0 (x)δx = δx .
|X |
x∈X x∈X
Note that Pp and P0 are mutually absolutely continuous. Gibbs’ inequality shows
that
 
D2 P k P0 ≥ 0.
On the other hand
  X  X
D2 P k P0 = log2 |X | · p(x) p(x) = log2 |X | + p(x) log2 p(x) ≥ 0.
x∈X x∈X

We obtained again the inequality (2.1.17).


   
Ent2 p ≤ log2 |X | = Ent2 p0 . (2.3.10)
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 194

194 An Introduction to Probability

Example 2.51. Suppose that X ∼ N (0, 1). Then, for any λ ∈ R,


Z Z
1 2 1 2 2 2 2
MX (λ) = √ eλx e−x /2 dx = √ e−(x −2λx+λ )/2 eλ /2 dx = eλ /2 .
2π R 2π R
Note that Y = σX ∼ N (0, σ) and
2
λ2 /2 σ 2 λ2
MY (λ) = MX (σλ) = eσ , ΨY (λ) = .
2
The supremum
σ 2 λ2
 
I(x) := sup xλ −
λ∈R 2
x
is achieved for λ = λx = σ2 and it is equal to
x2
I(x) = .
2σ 2
In other words, if X ∼ N (0, σ 2 ), then
x2
P X| > ε ≤ 2 max P X < −x , P X > x ≤ 2e− 2σ2 .
     

t
u

2.3.2 Some applications


Often an explicit description of ΨX (λ) may either not be possible, or it could be
too complicated to be useful. That is why it is more practical to have simple ways
of producing upper bounds for the moment generating function.

Definition 2.52. A random variable X with mean µ said to be subgaussian of type


σ 2 , and we write this X ∈ G(σ 2 ), if
σ 2 λ2 λ2 σ 2
, ∀λ ∈ R ⇐⇒ E eλ(X−µ) ≤ e 2 , ∀λ ∈ R.
 
ΨX−µ (λ) ≤ t
u
2
Note that if X ∈ G(σ 2 ) and ±x > 0
 x2
sup xλ − ΨX (λ) ≥ ,
±λ≥0 2σ 2
and thus
x2 
max P X − µ < −x , P X − µ > x ≤ e− 2σ2 , ∀x > 0,
   
(2.3.11a)

x2
P |X − µ| > x ≤ 2e− 2σ2 , ∀x > 0.
 
(2.3.11b)
Observe that if X1 , X2 are independent random variables and Xk ∈ G(σk2 ), k = 1, 2,
then
a1 X1 + a1 X2 ∈ G(a21 σ12 + a22 σ22 ), ∀a1 , a2 ∈ R.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 195

Limit theorems 195

In particular, if X1 , . . . , Xn are centered, independent random variables in G(σ 2 ),


then we have
1
X1 + · · · + Xn ∈ G(σ 2 /n),

n
and thus we obtain Hoeffding’s inequality
 
1 nx2
X1 + · · · + Xn > x ≤ 2e− 2σ2 , ∀x > 0.

P (2.3.12)
n
Example 2.53. Suppose that R is a Rademacher random variable, i.e., it takes
only the values ±1 with equal probabilities. Then
2
E eλR = cosh λ ≤ eλ /2 ,
 

where the last inequality is obtained by inspecting the Taylor series of the two terms
and using the inequality 2n n! ≤ (2n)!. Hence R ∈ G(1). Similarly, cR ∈ G(1),
∀c ∈ [0, 1]. t
u

For these estimates to be useful we need to have some simple ways of recognizing
subgaussian random variables.
 
Proposition 2.54. Suppose that X is a centered random variable, i.e., E X = 0.
If there exists C > 0 such that
E X 2k ≤ k!C k , ∀k ∈ N,
 

then X ∈ G(4C).

Proof. We rely on a very useful symmetrization trick. Choose a random variable X 0


independent of X but with the same distribution as X. Then the random variable
Y = X − X 0 is symmetric, i.e., Y and −Y have the same probability distributions.
Observe next that since −X 0 is centered we have
0  0
E e−λX ≥ e−λE[X ] = 1, ∀λ ∈ R.


We deduce

0  0  X λ2k 
E eλX ≤ E eλX · E e−λX = E eλ(X−X ) = E (X − X 0 )2k .
      
(2k)!
k=0
2k
Since the function x is convex we have
(x + y)2k ≤ 22k−1 x2k + y 2k , ∀x, y ∈ R


so
(2k)! (2k)!
E (X − X 0 )2k ≤ 22k E X 2k ≤ 22k k!C k = (2C)k ≤ (2C)k .
   
(2k − 1)!! k!
Hence

 X (2Cλ2 )k 2
E eλX ≤ = e2Cλ .

k!
k=0
Hence X ∈ G(4C). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 196

196 An Introduction to Probability

Example 2.55. Suppose that R is a Rademacher random variable. Clearly


E R2k = 1 ≤ k!1k , ∀k ∈ N
 

so that R ∈ G(4). We see that this estimate is not as good as the one in Exam-
ple 2.53. t
u

The next result offers a sharper estimate under certain conditions.

Proposition 2.56 (Hoeffding’s lemma). Suppose  that X is a random variable


2
such that X ∈ [a, b] a.s. Then X ∈ G (b − a) /4 , i.e.,
λ2 (b−a)2
E eλ(X−µ) ≤ e 8 , ∀λ ∈ R.
 
(2.3.13)

Proof. Let us first observe that any random variable Y such that Y ∈ [a, b] a.s.
satisfies
  (b − a)2
Var Y ≤ .
4
   
Indeed, if µ = E Y , then Y − µ ∈ a − µ, b − µ . If
(a − µ) + (b − µ)
m=
2
 
is the midpoint of a − µ, b − µ , then
b−a
|(Y − µ) − m| ≤
2
and
2  (b − a)2
Var Y ≤ E (Y − µ)2 + m2 = E (Y − µ) − m
    
≤ .
4
 
Observe next that we can assume that X is centered. Indeed, if µ = E X , then the
centered variable X −µ satisfies X −µ ∈ a−µ, b−µ and (b−a) = (b−µ)−(a−µ).
Denote by P the probability distribution of X. For any λ ∈ R we denote by Pλ
the probability measure on R given by
  eλx  
Pλ dx =

λX
 P dx . (2.3.14)
E e
Note that Pλ is also supported on [a, b]. Since E X = 0 we have Ψ0X (0) = 0. We
 

deduce from (2.3.6) that


 λX
 !2
1 E Xe
Ψ00X (λ) =  λX  E X 2 eλX −
 
 
E e E eλX

  2
Z Z 
x2 Pλ dx −
 
= xPλ dx .
R R
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 197

Limit theorems 197

The last term is the variance of a random variable Z with probability distribution
Pλ . Since Pλ is supported in [a, b] we have Z ∈ [a, b] and we deduce
  (b − a)2
Ψ00X (λ) = Var Z ≤ . (2.3.15)
4
Using the Taylor approximation with Lagrange remainder we deduce that for some
ξ ∈ [0, λ] we have
1 λ2 (b − a)2
ΨX (λ) = ΨX (0) + λΨ0X (0) + Ψ00X (ξ) ≤ .
| {z } 2 8
=0
2
Hence X ∈ G (b − a) /4). t
u

Hoeffding’s Lemma shows that if R is a Rademacher random variable, then


R ∈ G(1) as in Example 2.53, which is an improvement over Proposition 2.54.
If R1 , . . . , Rn are independent Rademacher random variables, then for any
c1 , . . . , cn ∈ [−1, 1] we have ck Rk ∈ G(1) and we deduce from Hoeffding’s inequality
that
 
1 nr 2
P c1 R1 + · · · + cn Rn > r ≤ 2e− 2 . (2.3.16)
n
Example 2.57 (The Poincaré phenomenon). Suppose that X is a standard
normal random variable and Y = X 2
Z
2  1 (2λ−1)x2
MY (λ) = E eλX = √

e 2 dt.
2π R
1
This integral converges only for λ < 2 and in this case it is equal to
1
MY (λ) = √ .
1 − 2λ
   
In particular, X 2 is not subgaussian. Note that E Y = E X 2 = 1. Hence
e−λ 1
MY −1 (λ) = √ , ΨY −1 (λ) = −λ − log(1 − 2λ).
1 − 2λ 2
 
Since Y ≥ 0 we have P Y − 1 < y = 0 for y ≤ −1. For y ∈ (−1, ∞) the supremum

I(y) := sup λy − ΨY −1 (λ)
λ<1/2)

is achieved when
d  1
λy − ΨY −1 (λ) = y + 1 − = 0.
dλ 1 − 2λ
Solving this equation for λ we get
1 y
1 − 2λ = ⇐⇒λ =
y+1 2(y + 1)
and
y2 y 1 y 1 y2
I(y) = + − log(1 + y) = − log(y + 1) ≥ .
2(y + 1) 2(y + 1) 2 2 2 4
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 198

198 An Introduction to Probability

Hence
y2
P Y − 1 < −y ∨ P Y − 1 > y ≤ e− 4 , ∀y > 0.
   

Suppose now that


~ = (X1 , . . . , Xn )
X
is a Gaussian random vector, where Xk are independent standard normal random
variables. The square of its Euclidean norm is the chi-squared random variable
n
X
~ 2=
Zn = kXk Xk2 .
k=1

We deduce that
 
1 ny 2
P Zn − 1 > y < e− 4 , ∀y > 0
n
 
1 ny 2
P Zn − 1 < y < e− 4 , ∀y ∈ (−1, 0).
n
Thus, for large n the random vector √1n X~ is highly concentrated around the unit
n
sphere in R . This is sometimes referred to as the Poincaré phenomenon. In
Exercise 2.52 we describe another proof of this result. t
u

We conclude this section with a remarkable application of the Poincaré phe-


nomenon. Consider a Gaussian random vector in Rn
~ = (X1 , . . . , Xn ),
X
where the components Xk are independent standard normal random variables. Note
that for any unit vector ~u = (u1 , . . . , un ) the inner product
~ = u1 X1 + · · · + un Xn
h~u, Xi
is a mean zero Gaussian random variable. Moreover
~ = E |h~u, Xi|
~ 2 = 1 = k~uk2 .
   
Var h~u, Xi
Suppose that we are now given d such independent8 random vectors
~ j = X1,j , . . . , Xn,j , 1 ≤ j ≤ d.

X
We obtain a random map
~ 1 i, . . . , h~u, X
A : Rn → Rd , Rn 3 ~u 7→ (Y1 , . . . , Yd ) := h~u, X ~ di .

(2.3.17)
If k~uk = 1 components of A~u are independent standard normal random variables
so that kA~uk2 is a chi-squared random variable. We set B := √1d A and we deduce
from Example 2.57 that for any ε ∈ (0, 1) and any unit vector ~u we have
dε2
P kB~uk2 − 1 > ε ≤ 2e− 4 .
 

8 Independence is meant in probabilistic sense, not linear independence.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 199

Limit theorems 199

Suppose now that we have a large cloud of points

C = x 1 , . . . , x N ⊂ Rn .


For 1 ≤ i < j ≤ N we write vij = xj − xi . We deduce that


   
kBvij k N − dε2 dε2
P 1−ε≤ ≤ 1 + ε, ∀1 ≤ i < j ≤ N ≤ 2 e 4 ≤ N 2 e− 2 .
kvij k 2
Now fix a confidence level 0 < p0 < 1 and observe that
dε2 N 4 N
N 2 e− 2 < p0 ⇐⇒ dε2 > 4 log ⇐⇒ d > 2 log .
p0 ε p0
We have thus proved the following remarkable result.

Theorem 2.58 (Lindenstrauss-Johnson). Fix ε > 0 and p0 ∈ (0, 1) and a cloud


of C of N points in Rn . If
 
4 N
d = d(N, ε, p0 ) := 2 log , (2.3.18)
ε p0
then, with probability at least 1 − p0 , the random Gaussian map B = √1d A, where A
is described by (2.3.17), distorts very little the relative distances between the points
in C, i.e.,

(1 − ε)kBx − Byk ≤ kx − yk ≤ (1 + ε)kBx − Byk.

t
u

Remark 2.59. Let us highlight some remarkable features of the above result. Note
first that the dimension d(N, ε, p0 ) is independent of the dimension of the ambient
space Rn where the cloud C resides. Moreover, d(N, ε, p0 ) is substantially smaller
than the size N of the cloud.
For example, if we choose the confidence level p0 = 10−3 , the distortion factor
ε = 10−1 and the size of the cloud N = 1012 , then
4 N
2
log = 60 · 102 log 10 < 14 · 103  1012 .
ε p0
The cloud C could be chosen in a Hilbert space and we can choose as ambient
space the subspace span(C) that has dimension n ≤ N . In this case the vectors
Yk := √1 X~k , k = 1, . . . , d, have with high confidence norm 1.
N

N δ2
P kYk k − 1 > δ, ∀1 ≤ k ≤ d ≤ 2de− 4 .
 

They are also, with high confidence, mutually orthogonal. Indeed, Exercise 2.49
shows that for |r| < 12
 
  d − N r2
P hYi , Yj i > r, ∀i < j ≤ 2 e 12 .
2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 200

200 An Introduction to Probability

This shows that the operator √1N A is with high confidence very close to
the orthogonal projection PX~1 ,...,X~d onto the random d-dimensional9 subspace
span{X ~1 , . . . , X
~d }. This shows that, with high confidence, the operator
r
N
P~ ~
d X1 ,...,Xd
distorts very little the distances between the points in C. The projected cloud has
identical size, similar geometry but lives in a subspace of much smaller dimension.
t
u

2.4 Uniform laws of large numbers

Fix a Borel probability measure µ on R. Suppose that


Xn : (Ω, S, P) → R, n ∈ N
is a sequence of i.i.d. random variables with common probability distribution µ. For
any Borel set B ⊂ R the random variables I B (Xn ) are i.i.d. and have finite means
   
mB := P X1 ∈ B = µ B .
The Strong Law of Large Numbers shows that the empirical means
  1  #{ 1 ≤ k ≤ n; Xk ∈ B}
Mn B := I B (X1 ) + · · · + I B (Xn ) =
n n
 
converge a.s. to µ B . In particular, this provides an asymptotic confirmation of
the “frequentist” interpretation of probability as the ratio of favorable cases to the
number of possible cases. 
If we choose B of the form (−∞, x , then we obtain the empirical cdf
  1 #{ 1 ≤ k ≤ n; Xk ≤ x}
Fn (x) = Mn (−∞, x] = .
n n
This is a random quantity
 (variable), Fn (x) = Fn (x, ω), ω ∈ Ω. For each n ∈ N,
the collection Fn (x) x∈R is an example of empirical process.
For any x ∈ R, the random variable Fn (x) converges a.s. to F (x), where F is
the cdf of µ
 
F (x) = µ (−∞, x] .
For x ∈ Ω the set Nx ⊂ Ω such that Fn (x, ω) does not converge to F (x) is negligible
but, since R is not countable, the union
[
N= Nx
x∈R

need not be negligible. In other words, the set of ω’s such that the functions
Fn (−, ω) do not converge pointwisely to the function F (−) need not by negligible.
We will show that this is not the case.
9 It ~1 , . . . , X
is not hard to see that dim span{X ~d } = d a.s.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 201

Limit theorems 201

2.4.1 The Glivenko-Cantelli theorem


Define
Dn = DnF : Ω → [0, ∞), Dn (ω) := sup Fn (x, ω) − F (x) . (2.4.1)
x∈R

For a fixed ω ∈ Ω the sequence of functions Fn (−, ω) n∈Ω converges uniformly to
F (−) if and only if Dn (ω) → 0. We will show that this is the case for almost all ω.
Denote by U (y) the cdf of the uniform distribution on [0, 1],

0, y < 0,


U (y) = y, y ∈ [0, 1],


1, y > 1,

and by Q the quantile of F defined in (1.2.5), Q : [0, 1] → R


Q(`) := inf x : ` ≤ F (x) = inf F −1 [`, ∞] = inf F −1 [`, 1] .
  

Lemma 2.60. The function DnF is measurable and DnF ≤ DnU , with equality if F is
continuous.

Proof. Let us first show that Dn is indeed measurable. We will show that
Dn = sup Fn (x) − F (x) . (2.4.2)
x∈Q

According to Proposition 1.17(iii) the quantity in the right-hand-side is measurable.


Fix ω ∈ Ω. There exists then a sequence of real numbers (xn )n∈N such that
lim Fn (xn , ω) − F (x) = Dn (ω).
n→∞

Now observe that the functions x 7→ Fn (x, ω), F (x) are right-continuous so there
exists a sequence of rational numbers (qn )n∈N such that qn > xn and
1
Fn (xn , ω) − F (xn ) − Fn (qn , ω) − F (qn ) < .
n
Hence
lim Fn (qn , ω) − F (qn ) = lim Fn (xn , ω) − F (xn )
n→∞ n→∞

thus proving that the functions (2.4.2) are measurable.


Consider now a sequence of i.i.d. random variables (Yn )n∈N uniformly distributed
on [0, 1]. Denote by Un the associated empirical c.d.f.-s,
n
1X
Un (x) = I (−∞,x] (Yk ).
n
k=1

Then Xn = Q(Yn ) are i.i.d. with common cdf F . Note that


n
 1X
Un F (x) − F (x) = I Yk ≤F (x) − F (x)
n
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 202

202 An Introduction to Probability

n
(1.2.6) 1X
= I Q(Yk )≤x − F (x) = Fn (x) − F (x).
n
k=1

Thus
DnF = sup Fn (x) − F (x) = sup Un (F (x)) − U (F (x))
x∈R x∈R

≤ sup Un (y) − U (y) = DnU .


y∈R

Observe that if F is continuous, then ∀y ∈ (0, 1), ∃x ∈ R, such that F (x) = y so


X
U (Fn (x)) − U (F (x)) = sup Un (y) − U (y) .
y∈R
x∈R
t
u

Theorem 2.61 (Glivenko-Cantelli). Suppose that (Xn )n∈N is a sequence of


i.i.d. random variables with common distribution µ and cdf F . Denote by Fn (x)
the empirical cdf-s
n
1X
Fn (x) = I (−∞,x] (Xk ).
n
k=1

Then, almost surely, Fn (x) converges uniformly to F (x), i.e.,


DnF → 0 a.s. as n → ∞,
where Dn is defined by (2.4.1).

Proof. Lemma 2.60 shows that it suffices to prove the theorem only in the special
case when that random variables are uniformly distributed. Thus we assume F = U .
Fix a partition P of [0, 1], P = {0 = x0 < x1 < x2 < · · · < xm = 1}. Set
kPk := max (xk − xk−1 ).
1≤k≤m

For x ∈ [xk−1 , xk ) and n ∈ N we have


Un (xk−1 ) ≤ Un (x) ≤ Un (xk ), |Un (x) − x| = x − Un (x),

xk − Un (xk ) ≤ (xk − x) + x − U (x) ≤ kPk + x − Un (x) ,

x − Un (x) ≤ x − xk−1 + ≤ xk − Un (xk−1 ) ≤ kPk + xk−1 − Un (xk−1 ),

xk − Un (xk ) − kPk ≤ x − Un (x) ≤ kPk + xk−1 − Un (xk−1 ).


If we set
Dn (P) := max xk − Un (xk ),
1≤k≤m

we deduce that for any partition P of [0, 1] we have


Dn (P) ≥ 0, Dn (P) − kPk ≤ DnU ≤ Dn (P) + kPk.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 203

Limit theorems 203

Now consider a sequence Pk of partitions such that


1
kPk k < , ∀k ∈ N.
k
As we mentioned at the beginning of this subsection, the Strong Law of Large
Numbers implies that
U (x) − Un (x) = x − Un (x) → 0 a.s. as n → ∞.
Thus
∀k ∈ N lim Dn (Pk ) = 0 a.s.
n→∞
Hence
1 1
− < −kPk k ≤ lim inf Dn ≤ lim sup Dn ≤ kPk k < , ∀k ∈ N.
k n→∞ n→∞ k
Letting k → ∞ we deduce the desired conclusion. t
u

Remark 2.62. Suppose that (Xn )n∈N is a sequence of i.i.d. random variables with
common cdf F (x). Form the empirical (cumulative) distribution function
n
1X 
Fn (x) = I (−∞,x] Xk ,
n
k=1

and the corresponding deviation Dn := supx∈R Fn (x) − F (x) . The Glivenko-


Cantelli theorem shows that Dn → 0 a.s.
On the other hand, observe that for each x ∈ R the random variables
I (−∞,x] (Xn ) are i.i.d. random Bernoulli random variables with success probabil-
ity F (x). The central limit theorem shows
√  
n Fn (x) − F (x) ⇒ N 0, F (x)(1 − F (x) .
The Kolmogorov-Smirnov theorem states that
√ X 2 2
(−1)m−1 e−2c m .
 
nDn ⇒ D∞ , P D∞ > c = 2
m≥1

For an “elementary” proof of this fact we refer to [57]. For a more sophisticated
proof that reveals the significance of the strange series above we refer to [12] or [51].
t
u

2.4.2 VC-theory
We want to present a generalization of the Glivenko-Cantelli theorem based on
ideas pioneered by V. N. Vapnik and A. Ja. Cervonenkis [151] that turned out to
be very useful in machine learning. Our presentation follows [129, Chap. II]. For
more recent developments we refer to [51; 70; 150; 156].
Fix a Borel probability measure µ on X := RN . Any sequence of i.i.d. random
vectors
Xn : (Ω, S, P) → X = RN
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 204

204 An Introduction to Probability

with common distribution µ defines empirical probabilities


n
1X
Pn := δXk .
n
k=1

measures on X , BX . More precisely, for



The empirical probabilities are
 random
any Borel subset B ⊂ X , Pn B is the random variable


n
  1X
Pn B = I B (Xk ).
n
k=1

Suppose we are given a family F := (Bt )t∈T of Borel subsets of X = RN , N ≥ 1,


parametrized by a set T . We assume T is a Borel subset of another Euclidean space
Rp and we denote by BT its Borel algebra. For example, we can choose X = R,
Bt = (−∞, t], t ∈ T = R.
For each n ∈ N we obtain a stochastic process parametrized by T ,
n
  1X 
Pn : T × Ω → [0, 1], Pn (t, ω) = Pn Bt (ω) = I Bt Xk (ω) .
n
k=1
For each n ∈ N we obtain a random variable
Pn (t) : Ω → [0, 1], ω 7→ Pn (t, ω).
The collection of random variables (Pn (−))t∈T is an example of empirical process.
Note that
      1   vt
E Pn (t) = µ Bt , Var Pn (t) = Var P1 (t) = ,
n n
where
    1
vt := µ Bt 1 − µ Bt ≤ .
4
The Strong Law of Large Numbers implies that
n
  1X  
Zn (t) := Pn (t) − µ Bt = Yk (t) − E Yk (t) → 0 a.s. as n → ∞.
n
k=1
Moreover, Chebyshev’s inequality shows that
  vt 1
P |Zn (t)| > ε ≤ ≤ . (2.4.3)
nε 4nε2
Can we conclude that Zn (t) → 0 uniformly a.s. in the precise sense described in
Glivenko-Cantelli’s theorem?
To proceed further we will need to make some further assumptions on the family
(Bt )t∈T . Later we will have a few things to say about their feasibility. Set
Dn := sup |Zn (t)| : Ω → [0, 1].
t∈T

Here is our first measure theoretic assumption.


M1 . The function Dn is measurable
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 205

Limit theorems 205

To prove that Dn → 0 a.s. we will employ a different strategy than before. More
precisely we intend
 to show that, under certain assumptions on the family (Bt )t∈T ,
the probability P Dn > ε decays very fast as n → ∞, for any ε > 0. This will
guarantee that the series
X  
P Dn > ε
n∈N

is convergent for any ε > 0 and thus, according to Corollary 1.141, the sequence
Dn converges a.s. to 0. To obtain these tail estimates we will rely on some clever
symmetrization tricks.
To state the first symmetrization result choose another sequence Xn0 : Ω → X ,
n ∈ N, of i.i.d. random variables, independent of (Xn )n∈N , but with the same
distribution. Set
n
1X 0
Yk0 (t) := I Bt Xk0 , Zn0 (t) := Yk (t) − µ Yk0 (t) , ∀n ∈ N, t ∈ T,
  
n
k=1

Dn,n := sup Zn0 (t) − Zn (t) . (2.4.4)


t∈T

Equivalently,
n
1 X 
Dn,n = sup Yn+k (t) − Yk (t) .
t∈T n
k=1

Here are our next measure theoretic assumption.


M01 . The function Dn,n is measurable
M2 . For any n > 0 and any ε > 0 there exists a measurable map
τ : Ω, σ(X1 , . . . , Xn ) → (T, BT )


such that |Zn (τ )| > ε on {Dn > ε}, i.e.,


Dn (ω) > ε ⇒ Zn τ (ω) > ε. (2.4.5)

Lemma 2.63 (First symmetrization lemma).


    1
P Dn > ε ≤ 2P Dn,n > ε/2 , ∀ε > 0, ∀n > 2 . (2.4.6)

Proof. Choose a measurable map τ : Ω, σ(X1 , . . . , Xn ) → T, BT satisfying
 

M2 . Then τ is independent of Zn0 and we deduce


    (2.4.3) 1
E I {| Zn0 (τ ) |≤ε/2} k σ(X1 , . . . , Xn ) = E I {| Zn0 (τ (x1 ,...,xn ) |≤ε/2} ≥ 1 − 2,

h  i
P Zn0 (τ ) ≤ ε/2 k Dn = E E I {| Zn0 (τ ) |≤ε/2} k σ(X1 , . . . , Xn ) Dn
  

1
≥1− .
nε2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 206

206 An Introduction to Probability

Integrating over {Dn > ε} we deduce


 1  
1 − 2 P Dn > ε ≤ P Zn0 (τ ) ≤ ε/2, Dn > ε
  

(2.4.5)
P Zn0 (τ ) ≤ ε/2, Zn (τ ) > ε ≤ P Zn0 (τ ) − Zn (τ ) > ε/2
   

≤ P sup Zn0 (t) − Zn (t) > ε/2 .


 
t∈T
1 1
The inequality (2.4.6) follows by observing that for n > 2ε2 we have 1 − nε2 > 12 .
t
u

Note that the variables (Yn (t))


 n∈N are independent Bernoulli random variables
0
with success probability pt = µ Bt . The random variables (Yn (t)) are also of
the same kind and also independent of the Y ’s. The key gain is that the random
variables
Ξn = Yk0 (t) − Yk (t)
are symmetric, i.e., Ξn and −Ξn have the same distributions. They take only the
values −1, 0, 1 with distributions
   
P Ξt = ±1 = p1 (1 − pt ), P Ξt = 0 = 1 − 2pt (1 − pt ).
The advantage of working with symmetric random variables will become apparent
after describe our second symmetrization trick known as Rademacher symmetriza-
tion.
Recall that a Rademacher random variable is a random variable that takes the
only the values ±1, with equal probabilities. Suppose that (Rn )n∈N is sequence
of independent Rademacher random variables10 that are also independent of the
variables Xn and Xn0 .
Observe that the random variables Y n := Rn Yn are also symmetric.

Lemma 2.64 (Rademacher symmetrization). For any n ∈ N we have


" n
# " n
#
1 X 0  ε 1 X ε
P sup Yk (t) − Yk (t) > ≤ 2P sup Y k (t) > . (2.4.7)
t∈R n 2 t∈R n 4
k=1 k=1

Proof. The key observation is that, because Ξk (t) = Yk0 (t) − Yk (t) is symmetric, it
has the same distribution as Rk Ξk (t). Set
n n n
1X 1X X
Sn (t) := Rk Yk (t), Sn0 (t) := Rk Yk0 (t), Sn (t) := Y k (t)
n n
k=1 k=1 k=1
10 Here we are making a tacit assumption that there exists such a sequence random vari-
ables Rn defined on Ω. For example if we can choose Ω to be the probability space
(X , µ⊗N ) ⊗ (X , µ⊗N ) ⊗ {−1, 1}⊗N all the above choices are possible. The choice of Ω is irrelevant
because the Glivenko-Cantelli theorem is a result about (X , µ⊗N ).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 207

Limit theorems 207

" n
#  
1 X 0  ε 1 0 ε
P sup Yk (t) − Yk (t) > = P sup | Sn (t) − Sn (t) | >
t∈R n 2 t∈R n 2
 k=1     
1 ε 1 0 ε 1 ε
≤ P sup | Sn (t) | > + P sup | Sn (t) | > = 2P sup Sn (t) > ,
t∈R n 4 t∈R n 4 t∈R n 4
where we used the fact that Rk Yk0 (t) and Rk Yk (t) have the same distributions. t
u

Putting together all of the above we deduce


" n
#
  1 X ε 1
P Dn > ε ≤ 4P sup Rk Yk > , ∀ε > 0, n > 2 . (2.4.8)
t∈R n 4 2ε
k=1
To make further progress we condition on the variables (Xn ) and we deduce
" n
#
1 X ε
P sup Rk Yk (t) >
t∈R n 4
k=1
 
Z  n 
1 X ε
µ⊗n dx1 · · · dxn ,
  
= P sup
 Rk yk (t, ~x) > 
X n
 t∈R n 4 
k=1 
| {z }
=:St (~
x)

where ~x := (x1 , . . . , xn ) ∈ X n
and
yk (t, ~x) = I Bt (xk ) ∈ {0, 1}, ∀k = 1, . . . , n, t ∈ T.
Hence
Z  
  ε ⊗n  
P Dn > ε ≤ 4 P sup St (~x) > µ dx1 · · · dxn . (2.4.9)
Xn t∈R 4
For each n ∈ N, t ∈ T and ~x ∈ X n we set In := {1, . . . , n},
 
Ct (~x) := k ∈ In ; yk (t, ~x) = 1 = k ∈ In ; xk ∈ Bt .
Roughly speaking, Ct (~x) = Bt ∩ {x1 , . . . , xn }.
Cn (~x) := C ⊂ In ; ∃t ∈ T, C = Ct (~x) .


For every C ⊂ In we set


1 X
SC := Rk ,
n
k∈C

so that St (~x) = SCt (~x) . Hence


  " #
X  
P sup St (~x) > ε/4 = P sup SC > ε/4 ≤ P SC > ε/4 .
t∈T C∈Cn (~
x)
C∈Cn (x)

We can now finally understand the role of the Rademacher symmetrization. The
sums
Xn
Rk yk (t, ~x)
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 208

208 An Introduction to Probability

are of the type appearing in Hoeffding’s inequality (2.3.12), where Rk yk (t, ~x) ∈ G(1)
by the computation in Example 2.53. We deduce
2 2
P SC > ε/4 ≤ 2e−n ε /32 , ∀C ⊂ In .
 

We deduce
2
P sup St (~x) > ε/4 ≤ 2|Cn (~x)|e−nε /32 .
 
(2.4.10)
t∈T
Using this in (2.4.9) we deduce
Z
−nε2 /32
|Cn (~x)| µ⊗n dx1 · · · dxn .
   
P Dn > ε ≤ 8e (2.4.11)
X n
n
We have a rough bound |Cn (~x)| ≤ 2 but it is not helpful. At this point we add our
last and crucial assumption.
VC. The family F = (Bt )t∈T satisfies V C-condition.11 This means that there exists
d ∈ N such that
sup |Cn (~x)| = O(nd ) as n → ∞.
x∈X n
~

With this assumption in place we deduce that there exists K > 0 such that
2|Cn (~x)| ≤ K(nd + 1), ∀n ∈ N, ∀~x ∈ X n
so that
2
P Dn > ε ≤ 8Ke−nε /32 (nd + 1).
 
(2.4.12)
In the above estimate the constant K is independent of the distribution µ. Since
the series
X 2
e−nε /32 (nd + 1) < ∞, ∀ε > 0,
n∈N
we deduce that Dn → 0 a.s. We have thus proved the following wide ranging
generalization of the Glivenko-Cantelli theorem.

Theorem 2.65 (Vapnik-Chervonenkis). Suppose that F = (Bt )t∈T is a family


of Borel subsets of X = RN parametrized by a Borel subset T of some Euclidean
space, and µ is a Borel probability measure on X . Assume that µ, F satisfy the
conditions M1 , M1 , M2 .
Fix a sequence of independent random vectors Xn : Ω → X with common
distribution µ. Form the empirical measures
n
1X 
µn : Ω × BX → [0, ∞], µω
  
n B = I B Xk (ω) .
n
k=1
If F satisfies the V C-condition, then, almost surely,
   
µn B → µ B as n → ∞
uniformly in B ∈ F, i.e.,
   
lim sup µn B − µ B = 0 a.s.
n→∞ B∈F
t
u
11 V C = Vapnik-Chervonenkis.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 209

Limit theorems 209

Remark 2.66. (a) The technical assumptions M1 , M01 , M2 are measure-theoretic


in nature and are automatically satisfied if the space of parameters T is countable.
There are quite general (and very technical) results that guarantee that these results
hold in a rather broad range of situations, [129, Appendix C].
There are more sophisticated ways of bypassing M1 and M01 and we refer to [51],
[70] or [150] for details. Section 1.1 in [150] does a particularly clear and efficient
job of describing these measurability issues and the methods that were proposed
over the years to circumvent them.
If one assumes the condition VC, one can bypass assumption M2 by using a
weaker form of the first symmetrization trick. Observe first that
   
E Dn ≤ E Dn,n . (2.4.13)
Indeed
n
" n #
1 X   1 X  0 
Yk (t) − E Yk (t) = E Yk (t) − E Yk (t) k Yk , 1 ≤ k ≤ n
n n
k=1 k=1

" n #
1 X
0

= E Yk (t) − Yk (t) k Yk , 1 ≤ k ≤ n
n
k=1

" n
#
X
Yk0 (t)
  
≤E Yk (t) − k Yk , 1 ≤ k ≤ n ≤ E Dn,n k Yk , 1 ≤ k ≤ n .
k=1

Hence
n
1 X    
Dn (t) = sup Yk (t) − E Yk (t) ≤ E Dn,n k Yk , 1 ≤ k ≤ n .
n
k=1

By taking the expectations of both sides of the above inequality we obtain (2.4.13).
A similar argument as in the proof of the Rademacher symmetrization lemma yields
" n
#
  1 X
E Dn,n ≤ 2 E sup Y k (t) .
t∈T n
k=1
| {z }
=:Rn (T )

The sequence Rn (T ) is called the Rademacher complexity of the family (Bt )t∈T .
Azuma’s inequality (3.1.14), a refined concentration inequality, shows that Dn
is highly concentrated around its mean. The VC condition can be used to show
that the Rademacher complexity goes to 0 as n → ∞. Thus the mean of Dn goes
to 0 as n → ∞. Combining these facts one can obtain an inequality very similar to
(2.4.11). For details we refer to [156, Sec. 4.2] or Exercise 3.20.
(b) One can obtain bounds for the tails of Dn by a Chernoff-like technique, by
obtaining bounds for E Φ(Dn ) , where Φ : [0, ∞) → R is a convex increasing
function; see Exercise 2.53. We refer to [130] or [150] for details. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 210

210 An Introduction to Probability

The key assumption is VC and we want to discuss it in some detail and describe
several nontrivial examples of families of sets satisfying this condition.
Fix an ambient space X and F ⊂ 2X a family of subsets of X . The shadow of
F on a subset A is the family
FA := F ∩ A; F ∈ F ⊂ 2A .


Note that for a finite set A we have


|FA | ≤ 2|A| .
When we have equality above we say that A is shattered by F. Thus, A is shattered
by F if any subset of A is in the shadow of F. We set

sF (n) := max |FA |; |A| = n .
Thus sF (n) is the size of the largest shadow on a subset of X of cardinality n. Note
that sF (n) ≤ 2n .
For a nonempty F we define its VC-dimension to be
dimV C (F) := max n ∈ N; sF (n) = 2n .


Thus, any subset A such that |A| ≤ dimV C (F) is shattered by F. In other words, if
k = dimV C (F), then for any n ≤ k we have
min(n,k)  
X n
sF (n) = 2n = .
j=1
j

We have the following remarkable dichotomy. For proof we refer to [51, Thm. 4.1.2]
or [70, Thm. 3.6.3].

Theorem 2.67 (Sauer Lemma). If dimV C (F) = k < ∞, then


min(n,k)  
X n
∀n > k : sF (n) ≤ Pk (n) := .
j=0
j

Note that Pk (n) is a polynomial of degree k in n. t


u

Define the density of F to be


dens(F) = inf{r > 0; sF (n) = O(nr ), as n → ∞ .
We see that the family F satisfies the condition VC if and only if dens(F) < ∞.
Sauer’s lemma implies that dens(F) = dimV C (F) so that
dens(F) < ∞ ⇐⇒ dimV C (F) < ∞.
We see that a family F satisfies the condition VC if and only if its VC-dimension
is finite. A family with finite VC-dimension is called a VC-family.
Note that dimV C (F) < k if and only if any set A ⊂ X of cardinality k contains
a subset A0 with the property that any set in F that contains A0 also contains an
element in A\A0 . Intuitively, the sets in F cannot separate A0 from its complement
in A. Let us give some examples of VC families.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 211

Limit theorems 211

(i) Suppose that F consists of all the lower half-lines (−∞, t] ⊂ R, t ∈ R. Note
that if A = {a1 , a2 }, a1 < a2 , then any half-line that contains a2 must also
contain a1 so that dimV C (F) ≤ 1.
(ii) Suppose that F consists of all the open-half spaces of the vector space Rn . A
classical theorem of Radon [114, Thm. 1.3.1] shows that any subset A ⊂ Rn
of cardinality n + 2 contains a subset A0 that cannot be separated from its
complement A \ A0 by a hyperplane. Thus dimV C (F) ≤ n + 1. With a bit
more work one can show that in fact we have equality.
(iii) The above example is a special case of the following general result, [51,
Thm. 4.2.1].

Theorem 2.68. Let X be a set. Suppose that V is a finite dimensional vector


space of functions f : X → R. The space V defines two families of subsets
of X ,
n o n o
FV>0 = {f > 0}, f ∈ V , FV≥0 = {f ≥ 0}, f ∈ V .

Then

dimV C FV>0 = dimV C FV≥0 = dim V.


 

(iv) If F0 , F1 are two VC-families of subsets of a set X , then F0 ∪ F1 is also a VC


family. Moreover (see [51, Thm. 4.5.1])

dens(F0 ∪ F1 ) = max dens(F0 ), dens(F1 ) ,




and (see [51, Prop. 4.5.2])

dimV C F0 ∪ F1 ≤ dim F0 + dim F1 + 1.




The above equality is optimal.


(v) If F0 , F1 are two VC-families of subsets of a set X and we set

F0 u F1 := F0 ∩ F1 ; Fk ∈ Fk , k = 0, 1 ,


then (see [51, Thm. 4.5.3])

dens F0 u F1 ≤ dens(F0 ) + dens(F1 ).




(vi) If Fk is a VC family of subsets of Xk , k = 0, 1, and we define

F0 ⊗ F1 := F0 × F1 ; Fk ∈ Fk , k = 0, 1 ,


then F0 ⊗ F1 is a VC family of X0 × X1 ; see [51, Thm. 4.5.3]. Moreover

dens(F0 ⊗ F1 ) ≤ dens(F0 ) + dens(F1 ).


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 212

212 An Introduction to Probability

2.4.3 PAC learning


Let us explain why the above results are relevant in machine learning. Suppose that
we are dealing with a 0-1 good/bad decision problem.
More precisely we want to determine when a parameter x ∈ RN is “good”,
i.e., determine the set G of “good” parameters. For example we know from other
considerations that a parameter x ∈ R is good if and only if x ≤ t0 , but we do not
know the precise value of t0 . However, we have some information about the “good”
set: it is of the form (−∞, t], t ∈ R.
More generally, for one reason or another we are lead to believe that the set G
belongs to a family (Bt )t∈T , where T ⊂ Rp and Bt is a Borel subset of RN . The
family is (Bt )t∈T called a hypothesis class. Thus we seek t0 ∈ T such that Bt0 = G.
Consider a silly but suggestive example. Suppose that we want to decide when
a banana is good. The goodness of a banana is decided by, say, three parameters:
Color, Flavor, Softness, or CFS. Hence the good bananas are defined by some
measurable subset in the CFS space. Suppose we have a collection F of categories
of bananas, each category being defined by constraints in the CFS.
We are allowed to ask an Oracle to pick banana at random and answer then
following yes/no questions. Does the chosen banana belong to a given category Bt ?
Is the chosen banana a good banana? However, the Oracle won’t tell us which of
the categories of bananas is the good category. Saying that a banana is good and
it belongs to a category Bt only says that the banana belongs to Bt ∩ G. We are
suppose to learn the good category G by repeating the above experiment many
times and recording the answers.
Technically, the Oracle puts at our disposal a sequence of i.i.d. RN -valued ran-
dom vectors (RN plays the role of the CFS space)
Xn : Ω, S, P → RN , n ∈ N,


and the values Yn = I G Xn , n ∈ N. However, we do not know the common
probability distribution µ of these random vectors.
If we knew this probability distribution, then we could find G = Bt0 as a mini-
mizer of the deterministic functional Lµ : T → [0, 1]
n n
1X   1X  
Lµ (t) = P I Bt (Xk ) 6= Yk = P I Bt (Xk ) 6= I G (Xk )
n n
k=1 k=1
n
1 X    
= P I Bt ∆G (Xk ) = 1 = µ Bt ∆ G .
n
k=1
In fact Lµ (t0 ) = 0. Note that
     
µ Bt ∆ G = E IBt ∆G = E I Bt + I G − 2I Bt ∩G .
The law of large numbers shows that P-a.s. we have
n
1X 
lim I Bt (Xk ) + I G (Xk ) − 2I Bt ∩G (Xk )
n→∞ n
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 213

Limit theorems 213

 
= E I Bt + I G − 2I Bt I G = Lµ (t).
Thus, even if we do not know µ we can estimate Lµ (t) using the random functionals
n
1X 
Ln (t) = I Bt (Xk ) + I G (Xk ) − 2I Bt ∩G (Xk )
n
k=1

n
1X 
= I Bt (Xk ) + Yk − 2Yk I Bt (Xk ) .
n
k=1

If (Bt )t∈Bt is a VC-family, then so is the family (Bt ∩ G)t∈T and (2.4.11) shows that
there exists constants K, c > 0, independent of the mysterious µ, such that
2
P sup | Ln (t) − Lµ (t)| > ε ≤ Ke−cnε , ∀n.
 
t∈T

Thus, if we ask the oracle to give us a large sample (x1 , y1 ), . . . , (xn , yn ) of


(X1 , Y1 ), . . . , (Xn , Yn ) we obtain a deterministic functional
n
1X 
Ln (t; x1 , . . . , xn ) = I Bt (xk ) + yk − 2I Bt (xk )yk .
n
k=1

If we find tn such that Ln (tn ; x1 , . . . , xn ) < 2ε , then


2
P Lµ (tn ) > ε ≤ P |Ln (tn ) − Lµ (tn )| > ε/2 ≤ Ke−cnε /4 .
   

Thus, for large n, Ln (tn ) is, with high confidence, within ε of the absolute minimum
LP (t0 ) = 0. Hopefully, this signifies that tn is close to t0 . In the language of machine
learning we say that the hypothesis class (Bt )t∈T is PAC learnable, where PAC
stands for Probably Approximatively Correct. For more details we refer to [139;
152].

Remark 2.69. The results in this section only scratch the surface of the vast
subject concerned with the limits of empirical processes. We have limited our
presentation to 0-1-functions. The theory is more general than that.
Suppose that (U, U) is a measurable space and
Xn : Ω, S, µ → (U, U)


is a sequence of i.i.d. measurable maps with common distribution P = (Xn )# µ,


∀n. Fix a family F of bounded measurable functions U → R. We obtain a random
measure
n
1X
Pn := δXn .
n
k=1

We obtain a stochastic process parametrized by f ∈ F


n
1X
f (Xn ) − E f (Xn ) ∈ L∞ (Ω, S, µ), f ∈ F.
   
Pn − P f :=
n
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 214

214 An Introduction to Probability

When F consists of indicator functions of measurable sets we obtain the situation


described in this section.
For each f the SLLN shows that
 
Pn − P f → 0 a.s.
while the CLT shows that
√     
n Pn − P f ⇒ N 0, v(f ) , v(f ) := Var f (Xn ) , ∀n.
What can be said about the limit of the process Pn − P?
Just like there are different flavors of convergence of random variables, there
are many ways in which stochastic processes can converge. Various measurability
issues make empirical processes trickier to handle. We refer to [51; 70; 129; 150;
156] for more details about this problem. t
u

2.5 The Brownian motion

The Brownian motion bears the name of its discoverer, the botanist R. Brown who
observed in 1827 the chaotic motion of a particle of pollen in a fluid. Its study took
off at the beginning of the 20th century and has since witnessed dramatic growth.
It popped up in many branches of sciences and has lead to the development of many
new branches of mathematics. In the theory of stochastic processes it plays a role
similar to the role of Gaussian random variables in classical probability. It is such a
fundamental and rich object that I believe any student learning the basic principles
of probability needs to have a minimal introduction to it.
I drew my inspiration from many sources and I want to mention a few that we
used more extensively, [12; 53; 103; 106; 136; 145]. My approach is not the most
“efficient” one since I wanted to use the discussion of the Brownian motion as an
opportunity to introduce the reader to other several important concepts concerning
stochastic processes.

2.5.1 Heuristics
To get a grasp on the Brownian motion on a line, we consider first a discretization.
We assume that the pollen particle performs a random walk along the line starting
at the origin. Every unit of time τ it moves to the right or to the left, with
equal probabilities, a distance δ. We denote by Snδ,τ its location after n steps, or
equivalently, its location at time nτ , assuming we start the clock when the motion
begins.
When δ = τ = 1 we obtain the standard random walk on Z
n
X
Sn1,1 = Sn := Xk ,
k=1

where (Xn )n≥1 is a sequence of independent Rademacher variables, i.e., random


variables taking the values ±1 with equal probabilities.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 215

Limit theorems 215

We assume that during the (n + 1)-th jump the particle travels with constant
speed 1 so we can assume that its location at time t ∈ [n, n + 1) is
W 1 (t) = Sn + (t − n)Xn+1 = Sbtc + t − btc Xbtc+1 .


If we sample the random variables (Xn ), then of W 1 (t) is a piecewise linear function
with linear pieces of slopes ±1. Its graph is a zig-zag of the type depicted in
Figure 2.3.

Fig. 2.3 The zig-zag depicting a random walk.

Suppose now that the pollen particle performs these random jumps at a much
faster rate say ν-jumps per second and the size (in absolute value) of the jump is
δ meters. We choose δ to depend on the frequency ν and we intend to let ν → ∞.
Assuming that during a jump its speed is constant we deduce that this speed is δν
meters per second and its location at time t will be
W ν,δ (t) = δSbνtc + δ νt − bνtc Xbνtc+1 .

| {z }
=:Rν,δ (t)

To understand this formula observe that in the time interval [0, t] the particle per-
formed bνtc complete jumps of size δ. It completed the last one at time bνtc
ν . From
this moment to t it travels in the direction Xbνtc+1 with speed δν for a duration of
time t − bνtc
ν .
Assuming that in finite time the particle will stay within a bounded region it is
reasonable to assume that
∀t, sup E W ν,δ (t)2 < ∞.
 
(2.5.1)
ν
Now observe that δSbνtc and Rν,δ are mean zero independent random variables so
that
E W ν,δ (t)2 = δ 2 E Sbνtc
 2 
+ E Rν,δ (t)2 = δ 2 bνtc + E Rν,δ (t)2 .
     
 
Clearly E Rν,δ (t)2 ∈ [0, δ| so for (2.5.1) to hold we need
sup δ 2 ν < ∞.
ν

We achieve this by setting δ = ν −1/2 and we set


−1/2
W ν (t) := W ν,ν (t) = ν −1/2 Sbνtc + Rν (t),
(2.5.2)
Rν (t) := ν −1/2 νt − bνtc Xbνtc+1 .

July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 216

216 An Introduction to Probability

For each ν, the collection (W ν (t))t≥0 is a real valued random process parametrized
by [0, ∞). Think of it as a random real valued function defined on [0, ∞). It turns
out that the random processes (W ν (t))t≥0 have a sort of limit as ν → ∞. The next
result states this in a more precise form.

Proposition 2.70. Let 0 ≤ s < t. Then as ν → ∞ the random variable


W ν (t) − W ν (s) converges in distribution to a Gaussian random variable with mean
zero and variance t − s. In particular, since W ν (0) = 0 we deduce that the limit
W (t) = lim W ν (t)
ν

exists in distribution and it is a Gaussian random variable with mean zero and
variance t. Moreover, if
0 ≤ s0 < t0 ≤ s1 < t1 ≤ · · · ≤ sk < tk , k ≥ 1,
then the increments
W (t0 ) − W (s0 ), W (t1 ) − W (s1 ), . . . , W (tk ) − W (sk )
are independent.

Proof. Fix 0 ≤ s < t. For ν sufficiently large we have bνsc < bνtc and
W ν (t) − W ν (s) = ν −1 Sbνtc − Sbνsc + (Rν (t) − Rν (s) .
 
| {z } | {z }
Yν Zν

Observe first that


lim E Zν2 = 0.
 
n→∞

In particular, this shows that Zν converges in probability to 0. On the other hand


p bνtc
bνtc − bνsc 1 X
Yν = √ ·p Xk .
ν bνtc − bνsc k=bνsc+1
| {z }

The Central Limit Theorem shows that Y ν converges in distribution to a standard
normal random variable. Since
p
bνtc − bνsc √
lim √ = t−s
ν→∞ ν
we deduce that Yν converges in distribution to a Gaussian random variable with
mean zero and variance t−s. Invoking Slutsky’s theorem (Theorem 2.30) we deduce
that Yν + Zν converges in distribution to a Gaussian random variable with mean
zero and variance t − s.
Now let
0 ≤ s0 < t0 ≤ s1 < t1 ≤ · · · ≤ sk < tk , k ≥ 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 217

Limit theorems 217

For large ν the random variables


ν −1/2 Sbtj c − Sbsj c , j = 0, 1, . . . , k


are independent and the above argument shows that they converge in law to the
Gaussian
W (tj ) − W (sj ), j = 0, 1, . . . , k.
Corollary 2.29 implies that these increments are also independent. t
u

Definition 2.71 (Pre-Brownian motion). A pre-Brownian motion on [0, ∞)


is a collection of real valued random variables W (t) t≥0 with the following
properties.

(i) W (0) = 0.
(ii) For any 0 ≤ s < t the increment W (t) − W (s) is a Gaussian random variable
with mean zero and variance t − s.
(iii) For any
0 ≤ s0 < t0 ≤ s1 < t1 ≤ · · · ≤ sk < tk , k ≥ 1,
increments
W (t0 ) − W (s0 ), W (t1 ) − W (s1 ), . . . , W (tk ) − W (sk )
are independent.

A pre-Brownian
 motion on [0, 1] is a collection of real valued random variables
W (t) t∈[0,1] satisfying (i)–(iii) above with the s’s and t’s in [0, 1]. t
u

We have thus proved that a suitable rescaling of the standard random walk on
Z converges to a pre-Browning motion. In Figure 2.4 we have depicted the graph
of a sample of W ν (t) for ν = 100. Its graph is also a piecewise linear curve, but its
linear pieces are much steeper,
 of slopes ±ν 1/2 .
Suppose that W (t) t≥0 is a pre-Brownian motion on [0, ∞). As explained in
Subsection 1.5.1, this process defines a probability measure on R[0,∞) equipped with
[0,∞)
the product sigma-algebra BT called the distribution of the process. We want to
show that any two pre-Brownian motions have the same distributions. This requires
a small digression in the world of Gaussian measures and processes. In the next
subsection we survey some basic facts concerning these concepts. In Exercise 2.54
we ask the reader to fill in some of the details of this digression.

2.5.2 Gaussian measures and processes


Let V be an n-dimensional real vector space. We denote by V ∗ its dual,
V ∗ = Hom(V, R). We have a natural pairing
h−, −i : V ∗ × V → R, hξ, xi := ξ(x), ∀ξ ∈ V ∗ , x ∈ V.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 218

218 An Introduction to Probability

Fig. 2.4 Approximating the Brownian motion.

A Borel probability measure µ ∈ Prob(V ) is called Gaussian if for every linear


functional ξ ∈ V ∗ , the resulting random
 variable ξ : (V, BV , µ) → R is Gaussian
with mean m ξ and variance v ξ , i.e., (see Example 1.120)
(x−m[ξ])2

1 − 2v[ξ]


 (2π)n/2 .e dx, v[ξ] 6= 0,
   
Pξ dx = Γm[ξ],v[ξ] dx =


δm[ξ] , v[ξ] = 0.

Equivalently, this means that the characteristic function of Pξ is


v[ξ]t2
cξ (t) = E eitξ = e− 2 +itm[ξ] .
 
P
A random vector X : (Ω, S, P) → V is called Gaussian if its probability distribution
is a Gaussian measure on V . The random variables X1 , . . . , Xn : (Ω, S, P) → R are
called jointly Gaussian if the random vector
~ : Ω → Rn , X(ω)
~

X = X1 (ω), . . . , Xn (ω) ,
is Gaussian. This means that for any real constants ξ1 , . . . , ξn , the linear combina-
tion
ξ1 X1 + · · · + ξn Xn
is a Gaussian random variable.
For any Gaussian measure
  µ on the finite dimensional vector space V with mean
mµ ξ and variance vµ ξ we define its covariance form to be
C = Cµ : V ∗ × V ∗ → R,
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 219

Limit theorems 219

1         
C(ξ, η) = vµ ξ + η − vµ ξ − η = Eµ ξ − mµ ξ η − mµ η .
4
Then (see Exercises 2.54(ii) + (iii)) the mean mµ is a linear functional mµ : V ∗ → R
and the covariance Cµ is a symmetric and positive semidefinite bilinear form on V ∗ .

Proposition 2.72. A Gaussian measure on a vector space is uniquely determined


by its mean and covariance form. t
u

The proof of the above result is based on the Fourier transform and its main
steps are described in Exercise 2.54. In the sequel we will refer to the mean zero
Gaussian measures as centered.

Example 2.73. (a) If X1 , . . . , Xn are independent Gaussian random variables, then


any linear combination
ξ1 X1 + · · · + ξn Xn
is also Gaussian. In particular, if X1 , . . . , Xn are independent standard normal
random variables, then the random vector X ~ = (X1 , . . . , Xn ) is Gaussian and its
distribution is the standard Gaussian measure on Rn
1 1 2
e− 2 kxk dx.
 
Γ1 dx = n/2
(2π)
~
(b) If X = (X1 , . . . , Xn ) is a Gaussian random vector, then the mean of its distri-
bution is the vector
~ := E X1 , . . . , E Xn
     
m X
and the covariance form of its distribution is the n × n matrix C with entries the
covariances of the components, i.e.,
  h    i
Cij = Cov Xi , Xj = E Xi − E Xi Xj − E Xj , 1 ≤ i, j ≤ n.
(c) If µ is Gaussian measure on a finite dimensional vector space and A : U → V is
a linear map to another vector space then the pushforward A# µ is also a Gaussian
measure on V . In particular if
X~ = (X1 , . . . , Xn )
is a Gaussian vector and A is an m × m matrix then the vector Y ~ = AX
~ is also
Gaussian. Note that
n
X
~ = (Y1 , . . . , Ym ), Yi =
Y aij Xj , i = 1, . . . , m.
j=1

(d) Suppose (−, −) is an inner product on the vector space V with associated norm
k − k. We can then identify V ∗ with V and the symmetric bilinear forms on V ∗
with symmetric operators. The centered Gaussian measure on V whose covariance
form is given by the inner product is
1 1 2
e− 2 kxk dx.
 
Γ1 dx =
(2π)dim V /2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 220

220 An Introduction to Probability

If A : V → V is a symmetric linear operator, then pushforward A# Γ1 is the


Gaussian measure with covariance form C = A2 . More precisely,

C(v1 , v2 ) = (Av1 , Av2 ) = (A2 v1 , v2 ).

If, additionally A is invertible, then


1 1 −2
xk2
e− 2 kA
 
A# Γ1 dx = p dx.
det(2πA2 )
We deduce that for any bilinear, symmetric positive semidefinite form

C :V∗×V∗ →R

there exists a centered Gaussian measure admitting C as covariance form. Indeed, if


we fix a metric on V then we
√ can identify C with a symmetric, positive semidefinite
operator C → V . If A = C, then the Gaussian measure A# Γ1 is centered and
has covariance form C. t
u

Definition 2.74 (Gaussian processes). A Gaussian process parametrized by a


set T is a collection of random variables X(t) t∈T defined on the same probability
space (Ω, S, P) such that, for any finite subset I = {t1 , . . . , tn } ⊂ T , the random
vector XI := X(t1 ), . . . , X(tn ) is Gaussian.
 We denote by ΓI its distribution.
The process is called centered if E X(t) = 0, ∀t ∈ T . t
u

Suppose that X(t) t∈T is a Gaussian process. Its distribution is a probability
measure on RT uniquely determined by the Gaussian measures ΓI , I finite subset
of T . In turn, these probability measures are uniquely determined by the mean
function
 
m : T → R, m(t) = E X(t)

and the covariance kernel


 
K : T × T → R, K(s, t) = Cov X(s), X(t) .

Example 2.75. Suppose that W (t) t≥0 is a pre-Brownian motion. For any
0 ≤ t1 < · · · < tn the random vector
 
X1 , . . . , Xn = W (t1 ), W (t2 ) − W (t1 ), . . . , W (tn ) − W (tn−1

is Gaussian since its components are independent Gaussian random variables; see
Example 2.73(a). Observing that
 
(W (t1 ), . . . , W (tn ) = (X1 , X1 + X2 , . . . , X1 + · · · + Xn

we deduce from Example 2.73(c) that the vector (W (t1 ), . . . , W (tn ) is also Gaus-
sian as linear image of a Gaussian vector. Thus, any pre-Brownian motion is a
Gaussian process. It is centered since all the random variables W (t) have mean
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 221

Limit theorems 221

zero. Its distribution is a probability measure on then path space R[0,∞) uniquely
determined by the covariance kernel
 
K : [0, ∞) × [0, ∞) → R, K(s, t) = E W (s)W (t) .

We claim that

K(s, t) = min(s, t), ∀s, t ≥ 0. (2.5.3)

Indeed, assume without any loss of generality that s ≤ t. Then

E W (s)W (t) = E W (s)2 + E W (s) W (t) − W (s) .


     

The first summand is equal to s according to property (ii) of a pre-Brownian motion.


Property (iii) implies
     
E W (s) W (t) − W (s) = E W (s) · E W (t) − W (s) = 0.

Hence
 
E W (s)W (t) = s = min(s, t).

We see that all pre-Brownian motions have the same covariance form and thus they
all have the same distribution. 
Conversely, suppose that X(t) t≥0 is a centered Gaussian process whose co-
variance form is given by (2.5.3). Then this process is a pre-Brownian motion.
Indeed,

E X(0)2 = K(0, 0) = 0
 

so X(0) = 0 a.s. Next, observe that

E X(t)2 = K(t, t) = t.
 

Each increment X(t) − X(s), s < t, is Gaussian and


 
Var X(t) − X(s) = K(t, t) − 2K(s, t) + K(s, s) = t − s.

Finally suppose that 0 ≤ s1 < t1 ≤ · · · ≤ sn < tn . Then the n-dimensional random


vector of increments
~ := X(t1 ) − X(s1 ), . . . , X(tn ) − X(sn )

Y

is centered Gaussian. The equality (2.5.3) implies that


 
Cov Yi , Yj = 0, ∀i 6= j

and we deduce from ~


 Exercise 2.55 that the components of Y are independent. This
proves that X(t) t≥0 is a pre-Brownian motion. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 222

222 An Introduction to Probability

Remark 2.76 (Brownian events). Consider an arbitrary pre-Brownian motion


Bt : (Ω, F, P) → R, t ≥ 0.
We define the σ-algebra of Brownian events to be the σ-subalgebra of F generated
by the family of random variables Bt , t ≥ 0. Concretely, any Brownian event E has
the form

Bτ (n) n∈N ∈ S,
where S ⊂ [0, ∞)N is a measurable subset and τ : N → [0, ∞) is an injection.
The restriction of P to the σ-algebra of Brownian events is uniquely determined
by the distributions of the Gaussian random vectors
(Bt1 , . . . , Btn ), n ∈ N, t1 , . . . , tn .
In turn, the distribution of such a vector is uniquely determined by the covariances
E Bs Bt = E Bs (Bs + Bt − Bs ) = E Bs2 = s = min(s, t).
     

We see that these distributions are independent of the choice of pre-Brownian mo-
tion B. This shows that if
B i : (Ωi , Fi , Pi ) → R, i = 1, 2,
are two pre-Brownian motions, then for any measurable set S ⊂ [0, ∞) and any
injection τ : N → [0, ∞) we have
P1 Bτ1(n) n∈N ∈ S = P2 Bτ2(n) n∈N ∈ S .
     
t
u

Example 2.77 (Gaussian random functions). Suppose that fn : T → R,


n ∈ N, is a sequence of functions defined on a set T and (Xn )n∈N is a sequence
of independent standard normal random variables defined on a probability space
(Ω, S, P). For each t ∈ T we have a series of random variables
X
F (t) = Xn fn (t).
n∈N

We want to emphasize that F (t) also depends on the random parameter ω ∈ Ω,


X
F (t) = F (t, ω) = Xn (ω)fn (t). (2.5.4)
n∈N

The above is a series of real numbers.


Observe that if the sequence of functions fn satisfies the condition
X
fn (t)2 < ∞, ∀t ∈ T, (2.5.5)
n∈N

then the series defining F (t) converges in L2 Ω, S, P , for any t ∈ T . To see this,


consider the partial sums


n
X
Fn (t) = Xk fk (t).
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 223

Limit theorems 223

Then, for m < n, we have


n n
2  X X
fk (t)2 E Xk2 = fk (t)2 .
  
E Fn (t) − Fm (t) =
k=m+1 k=m+1

This proves that the sequence Fn (t) n∈N is Cauchy in L2 Ω, S, P . The family
 

F = F (t) t∈T is a centered Gaussian random process. It is convenient to think
of F as a random function. Its value F (t) at t is not a deterministic quantity, it is
random.
The covariance kernel is
  X
K(s, t) = KF (s, t) = E F (s)F (t) = fn (s)f (t).
n∈N

The above series is absolutely convergent since

2|fn (s)f (t)| ≤ fn (s)2 + fn (t)2 , ∀n, s, t.

Note that since the random vector (F (s), F (t))is Gaussian,


 the random variables are
independent iff they are not correlated, i.e., E F (s)F (t) = 0. Thus the covariance
kernel can be viewed as a measure of dependency between the values of F at different
points s, t ≥ 0.
Using Kolmogorov’s one series theorem we deduce from the L2 convergence that
for any t ∈ T there exists a measurable subset Nt ⊂ Ω such that P Nt = 0 and,
 

for any ω ∈ Ω \ Nt the series F (t, ω) in (2.5.4) converges. We will denote by F (t, ω)
its sum. Set
[
N := Nt .
t∈T

For ω ∈ Ω \ N we obtain a genuine function

Fω : T → R, Fω (t) = F (t, ω).

The function Fω is referred to as a path of the stochastic process. We encounter


here one of the recurring headaches in the theory of stochastic processes. Namely,
if T is not countable, the set N may not negligible so the paths may not exists a.s.
If the parameter space T has additional structure, one could ask if the paths are
compatible in some fashion with that structure. For example, if T is an interval of
the real axis, we could ask if the paths are continuous functions of t. t
u

Example 2.78. A Gaussian white noise is a triplet H, (Ω, S, P), W , where




• H is a separable real Hilbert space,


• (Ω, S, P) is a probability space and,
• W : H → L2 (Ω, S, P), h 7→ W h is an isometry of H into L2 (Ω, S, P) such
that, for any h ∈ H, the random variable Wh is centered Gaussian.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 224

224 An Introduction to Probability

Since X is an isometry we deduce that


Var W (h) = E W (h)2 = khk2H .
   

In particular, this also shows that the image W (H) of X is a closed subspace of
L2 (Ω, S, P) consisting of centered Gaussian random variables. Such a subspace is
called a Gaussian Hilbert space. Obviously there is a natural bijection between
Gaussian white noises and Gaussian Hilbert spaces.
Here is how one can construct Gaussian white noises. Fix a separable Hilbert
space H with inner product (−, −). Next, fix a Hilbert basis of (en )n∈N . Every
element in H can then be decomposed along this basis
X
h= an (h)en , an (h) := (h, en ).
n∈N

Choose a sequence of independent standard normal random variables (Xn )n∈N de-
fined on a probability space (Ω, S, P). For h ∈ H we set
 X
W h = an (h)Xn .
n∈N
From Parseval’s identity we deduce that
X
an (h)2 = khk2H
n∈N
 
proving that the series defining W h converges in L2 . The collection W (h) h∈H
is a Gaussian process and its covariance is
   X
K(h0 , h1 ) = E W h0 W h1 = an (h0 )an (h1 ) = (h0 , h1 ).
n∈N

In particular, this proves that the correspondence h 7→ W h is an isometry, and
thus we have produced a Gaussian white noise.
As a special example, suppose that H = L2 [0, ∞), λ). Fix a Hilbert basis
(fn )n∈N and construct the Gaussian noise as above
X Z ∞
L2 [0, ∞), λ 3 f 7→ Wf =

an (h)Wn an (f ) = f (t)fn (t)dt.
0
For each t ∈ [0, ∞) we set
Z t 
 X
B(t) := W I [0,t] = fn (s)ds Xn . (2.5.6)
n∈N 0

Note that
 
Z ∞
E B(s)B(t) = I [0,s] (x)I [0,t] (x)dx = min(s, t).
0
This shows that W (t) is a pre-Brownian motion.
Observe that if s 6= t and |u| < |t − s|/2, then the random variables
1  1 
B(s + u) − B(s) and B(t + u) − B(t)
u u
are independent.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 225

Limit theorems 225

Now we need to make a leap of faith and pretend we can derivate with respect
to t. (We really cannot.) Letting u → 0 we deduce that F 0 (t) and F 0 (s) are
independent Gaussian random variables. Derivating with the same abandon the
equality (2.5.6) we deduce
X
B 0 (t) = fn (t)Xn . (2.5.7)
n∈N

Thus, the elusive B 0 (t) is a random “function” of the kind described in Example 2.77
with one big difference: in this case the condition (2.5.5) is not satisfied. Observe
that the “value” of F 0 at a point t is independent of its value at a point s. Thus,
the value F 0 at a point carries no information about its value at a different point so
F 0 (t) is a completely chaotic random “function” and it is what is commonly referred
to as white noise.
As we will see in the next subsection the function B(t) cannot be derivated
at any point. Moreover, the series (2.5.7) does not converge in a classical sense.
However it can be shown to converge in the sense of distributions. For an excellent
discussion of this aspect we refer to[68, Sec. III.4].
For any function f ∈ L2 [0, ∞) we define its Wiener integral
Z t

f (s)dB(s) := W I [0,t] f . (2.5.8)
0

In Exercise 2.60 we give an alternate definition of the this object that justifies this
choice of notation. In particular we deduce that
Z t
B(t) = dB(s).
0
0
Even though B (t) does not exist in any meaningful way, the above intuition is
nevertheless very important since it is what lead to the very important concepts of
Ito integral and stochastic differential equations. t
u

2.5.3 The Brownian motion


We have almost everything we need to define the concept of Brownian motion and
prove its existence.

Definition 2.79. A stochastic process B(t) t≥0 defined on a probability space
(Ω, S, P) is called a standard Brownian motion or Wiener process if the following
hold.

(i) B(t) is a pre-Brownian motion.


(ii) For any ω ∈ Ω the path Bω : [0, ∞) → R, t 7→ B(t, ω) is continuous. t
u

To prove the existence of a standard Brownian motions we need a bit more


terminology and another fundamental result of Kolmogorov.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 226

226 An Introduction to Probability

Definition 2.80. Let (Ω, S, P) be a probability space, T a set, and (X, F) a mea-
surable set. Consider stochastic processes
X, Y : T × Ω → X, (t, ω) 7→ Xt (ω), Yt (ω).

(i) The process Y is said to be a modification or version X, and we denote this


X ∼ Y if for any t ∈ T there exists a negligible subset Nt such that
Xt (ω) = Yt (ω), ∀ω ∈ Ω \ Nt .
(ii) The processes X, Y are said to be indistinguishable and we denote this X ≈ Y ,
if there exists a negligible subset N such that
Xt (ω) = Yt (ω), ∀t ∈ T, ∀ω ∈ Ω \ N.
(iii) The processes X, Y are said to be stochastically equivalent, and we denote this
X ∼s Y , if for any finite subset I ⊂ T the random vectors X I and Y I have
the same distribution.

t
u

Note that ≈, ∼, ∼s are equivalence relations and


X ≈ Y =⇒ X ∼ Y =⇒ X ∼s Y.
We have shown that any two pre-Brownian motions are stochastically equivalent.
We want to prove something stronger namely, that any pre-Brownian motion admits
a version whose paths are almost surely continuous maps [0, ∞) → R. We begin by
proving a more general result.

Theorem 2.81 (Kolmogorov’s Continuity Theorem). Suppose that T is a


compact interval of the real axis, (Ω, F, P) is a probability space and
X : T × Ω → R, (t, ω) 7→ Xt (ω)
is a stochastic process such that, there exist constant q, r, K > 0 with the property
that
E |Xs − Xt |q ≤ K|s − t|1+r , ∀s, t ∈ T.
 
(2.5.9)
Then, for any α ∈ (0, r/q), the process X admits a modification Y whose paths
are almost surely Hölder continuous with exponent α. This means, that for any
α ∈ (0, r/q) there exists a stochastic process (Yt )t∈T , a negligible subset Nα ⊂ Ω
and a measurable function
C = Cα : Ω → [0, ∞),
such that

• ∀t ∈ T , Xt = Yt a.s. and,
• for any ω ∈ Ω \ Nα , and any s, t ∈ T we have
Ys (ω) − Yt (ω) ≤ C(ω)|s − t|α .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 227

Limit theorems 227

Proof. We follow the presentation in [103, Thm. 29]. Without loss of generality
we can assume that T = [0, 1]. We denote by D the set of diadic numbers in [0, 1]
 
[ k
D= Dn , Dn = ; 0 ≤ k ≤ 2 n
.
2n
n≥0

Fix α ∈ (0, r/q). We carry the proof in two steps.


Step 1. We will show that there exists a measurable negligible subset N ⊂ Ω and
a measurable function C : Ω → [0, ∞) such that

Xt (ω) − Xs (ω) ≤ C(ω)|t − s|α , ∀s, t ∈ D. (2.5.10)

From the assumption (2.5.9) and Markov’s inequality we deduce that for any s, t ∈ T
and any a > 0 we have
1   K
P |Xs − Xt | > a ≤ q E |Xs − Xt |q ≤ q |t − s|1+r .
 
a a
Applying this inequality to s = j−1 n 1
2n , t = j/2 and a = 2nα we deduce

P |X(k−1)/2n − Xk/2n | > 2−nα ≤ K2nqα 2−nq−nr = K(ρ/2)n , ρ := 2(αq−r) .


 

Hence
 
 2n 
[ 
|X(k−1)/2n − Xk/2n | > 2−nα  ≤ Kρn .

P
 
k=1 
| {z }
Hn

Note that since α < r/q we have ρ ∈ (0, 1). From the Borel-Cantelli Lemma we
deduce that
 
P Hn i.o. = 0.

Thus, there exists a negligible set N with the following property: for any ω ∈ Ω \ N
there exists n0 (ω) ∈ N so that for any n ≥ n0 (ω), and any k = 1, . . . , 2n we have

|X(k−1)/2n (ω − Xk/2n (ω)| ≤ K2−nα .

At this point we invoke an elementary result whose proof we postpone.

Lemma 2.82. Let f : D → R be a function. Suppose that there exist α ∈ (0, 1),
n0 ∈ N and K > 0 such that, ∀n ≥ n0 and any k = 1, . . . , 2n

f (k − 1)/2n − f k/2n ≤ K2−nα .


 
(2.5.11)

Then there exists C = C(n0 , α, K) such that

f (s) − f (t) ≤ C|s − t|α , ∀t, s ∈ D.

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 228

228 An Introduction to Probability

This shows that for every ω ∈ Ω \ N the map D 3 t 7→ Xt (ω) is Hölder continuous
with exponent α. This completes the proof of (2.5.10).
Step 2. We can now produce the claimed modification. For every ω ∈ Ω \ N the
map
D 3 t 7→ Xt (ω)
admits a unique α-Hölder extension T 3 t 7→ Xt (ω) ∈ R. For t0 ∈ T we have
lim Xt = Xt0 .
t→t0
t∈D

Since
lim E |Xt − Xt0 |q = 0
 
t→t0
t∈D

we deduce that Xt0 = X̄t0 a.s. Hence the process Xt t∈T
is a modification of

Xt t∈T whose paths are a.s. α-Hölder continuous. t
u

i−1 i
Proof of Lemma 2.82. Let 0 ≤ m < n0 s = 2m and t = 2m . For
j = 0, 1, . . . , 2n0 −m we set sj = s + 2nj 0 . Then
0 −m
2nX
f (t) − f (s) ≤ K 2−n0 α = K2n0 −m 2−n0 α = |K2n0 −m+α(m−n
{z
0 ) −mα
}2 .
j=1 =:K1

Thus, f satisfies (2.5.11) for any n ≥ 1, possibly with a different constant


K 0 = max(K1 , K).
Let s, t ∈ D. Assume s < t. Let p be the largest positive integer such that
t − s < 2−p . Note that 2−(p+1) ≤ t − s ≤ 2−p . Set ` := b2p sc. Then
` ` 1 `+1
≤ s ≤ p + p+1 ≤ t ≤ p .
2p 2 2
| {z } 2
=:u

Then for some m, n ∈ N we have


1 i η1 ηi
s = u − p+1 − · · · − p+m , t = u + p+1 + · · · + p+n ,
2 2 2 2
where i , ηj ∈ {0, 1}. For i = 1, . . . , m, and j = 1, . . . , n we set
1 i η1 ηi
si = u − p+1 − · · · − p+i , tj := u + p+1 + · · · + p+j .
2 2 2 2
Using (2.5.11) we deduce
m
X
K1 2−(p+i)α

f (u) − f (s) ≤ f (u) − f (s1 ) + · · · + f (sm−1 ) − f (sm ) ≤
i=1


X 2α
≤ K1 2−pα 2−iα = K1 α 2−pα ≤ 2α K2 |s − t|α ,
i=1
2 − 1
| {z }
=:K2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 229

Limit theorems 229

where at the last step we used the fact that 2−p < 2|s − t|. Similarly
f (u) − f (t) ≤ 2α K2 |s − t|α .
Hence
f (s) − f (t) ≤ f (s) − f (u) + f (u) − f (t) ≤ 21+α K2 |t − s|α .
This proves the lemma with C(n0 , α, K) = 21+α K2 . t
u

Remark 2.83. (a) Using Exercise 2.59 one can modify the modification in Theo-
rem 2.81 to be α-Hölder continuous for any α ∈ (0, q/r), not just for a fixed α in
this range.
(b) The argument in the proof of Lemma 2.82 is an elementary incarnation of
the chaining technique. For a wide ranging generalization of the continuity Theo-
rem 2.81 and the chaining technique we refer to [101, Chap. 11]. t
u

Corollary 2.84. Suppose that (Wt )t≥0 is a pre-Brownian motion. Then for any
α ∈ (0, 1/2) the process (Wt ) admits a modification whose paths are a.s. α-Hölder
continuous. In particular, Brownian motions exist.
Proof. Set δ − 21 − α. Note that since Wt − Ws is Gaussian with mean 0 and
variance |t − s|. Then D := √ 1 Wt − Ws ∼ N (0, 1) so that, ∀q ≥ 1, we have
|t−s|

E |Wt − Ws |q = |t − s|q/2 E |D|q .


   

If we choose q > 1δ , then we deduce that


q/2 − 1 1 1
= − >α
q 2 q

and Theorem 2.81 implies that (Wt ) admits a modification W t t≥0 whose paths
are a.s. α-Hölder continuous.
Recall that this means that there exists a measurable negligible set N ⊂ Ω such
that ∀ω ∈ Ω \ N the path t 7→ W t (ω) is continuous. Now define
(
W t (ω), ω ∈ Ω \ N,
B : [0, ∞) × Ω, (t, ω) 7→ Bt (ω) =
0, ω ∈ N.
Clearly (Bt )t≥0 is a (standard) Brownian motion. t
u

Remark 2.85. I want to say a few words about Paul Lévy’s elegant construction
of the Brownian motion, [106, Sec. 1].
He produces the Brownian motion on [0, 1] as a limit of random piecewise linear
functions Ln with nodes on the dyadic sets
 
k n
Dn := ; 0 ≤ k ≤ 2 , n ≥ 0.
2n
They are successively better approximations of the Brownian motion. The 0-th
order approximation is the random linear function L0 (t) such that L0 (0) = 0 and
L0 (1) is a standard normal random variable.
The n-th order approximation Ln satisfies the following conditions.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 230

230 An Introduction to Probability


• It is linear on each of the intervals (k − 1)/2n , k/2n , Ln (0) = 0.
• The increments
Ln k/2n ) − Ln (k − 1)/2b , k = 1, . . . , 2n


are normal random variables with mean zero and variance 1/2n .
• Ln (t) = Ln−1 (t), ∀t ∈ Dn−1 .

To explain how to produce


 Ln (t) given Ln−1 (t) we only need to explain how to
produce Ln (2k − 1)/2n given that
Ln j/2n−1 = Ln−1 j/2n−1 , j = k − 1, k.
 

To “guess” what Ln (2k − 1)/2n should be, we take our inspiration from the
Brownian motion that we want to approximate.
Consider two moments of time t0 < t1 in [0, 1]. Then B(t0 ) ∼ N (0, t0 ),
B(t1 ) ∼ N (0, t1 ) and B(t1 ) − B(t0 ) is a normal random variable with mean 0,
variance t1 − t0 , independent of B(t0 ). Denote by t∗ the midpoint of [t0 , t1 ],
t∗ = (t0 + t1 )/2.
Consider the linear interpolation
1 
Z= B(t0 ) + B(t1 ) .
2
The difference
1  1 
∆ := B(t∗ ) − Z = B(t∗ ) − B(t0 ) + B(t∗ ) − B(t1 )
2 2
is a sum of two independent normal random variables, that are also independent
of B(t0 ). Thus ∆ is a normal random variable with mean 0, variance (t1 − t0 )/4,
independent of B(t0 ). We write

1  t1 − t0
B(t∗ ) = Z + ∆ = B(t0 ) + B(t1 ) + X. (2.5.12)
2 2
We can now describe Lévy’s prescription. We set
[
D= Dn
n≥0

and consider a family (Xt )t∈D of independent standard normal random variables.
Then
L0 (t) := tX1 .
The approximation Ln+1 is obtained from Ln as follows. If t0 < t1 are two con-
secutive points in Dn and t∗ ∈ Dn+1 is the midpoint of [t0 , t1 ], then Ln+1 (t∗ ) is
obtained by mimicking (2.5.12), i.e.,

1  t1 − t0
Ln+1 (t∗ ) = Ln (t0 ) + Ln (t1 ) + Xt∗
2 2
1  1
= Ln (t0 ) + Ln (t1 ) + 1+n/2 Xt∗ .
2 2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 231

Limit theorems 231

To prove that the sequence Ln (t) converges uniformly a.s. it suffices to show that
the series of random variables
X
sup Ln+1 (t) − Ln (t)
n≥0 t∈[0,1]
| {z }
=:Un
converges a.s.
Denote by Mn the set of midpoints of the 2n intervals determined by Dn ,
Mn = Dn+1 \ Dn . From the construction of Ln we deduce that
1
Un = max |Xτ |.
21+n/2 τ ∈Mn
We deduce that for any c > 0 we have
P Un > c ≤ 2n P Y > 21+n/2 c , Y ∼ N (0, 1).
   

The Mills ratio inequalities (1.3.40) coupled with the Borel-Cantelli lemma lead to
the claimed convergence. t
u

Let us observe that if (B(t)) is a standard Brownian motion, then B(0) = 0 a.s.
For this reason, the standard Brownian motion is also referred to as the Brownian
motion started at 0. For x ∈ R we set B x (t) = x + B(t). We will refer to B x (t) as
the Brownian motion started at x.

Remark 2.86 (The Wiener measure). The space C := C [0, ∞) of continu-




ous functions [0, ∞) → R is equipped with a natural metric d,


X 1 
d(f, g) = n
min 1, dn (f, g) , dn (f, g) := sup |f (t) − g(t)|.
2 t∈[n−1,n]
n∈N

The topology induced by this metric is the topology of uniform convergence on the
compact subsets of [0, ∞). One can prove (see Exercise 2.61) that the Borel algebra
of this metric space coincides with the sigma algebra generated by the functions
Evt : C → R, Evt (f ) = f (t).
More generally, for any finite subset I ⊂ [0, ∞) we have a measurable evaluation
maps
EvI : C → RI , f 7→ f |I .
Proposition 1.29 shows that if µ0 , µ1 are two probability measures on C such that
(EvI )# µ0 = (EvI )# µ1
for any finite subset I⊂ [0, ∞), then µ0 = µ1 .
Note that if Xt t≥0 is a stochastic process defined on a probability space
(Ω, S, P) whose paths are continuous, then it defines a map
X : Ω → C, Ω 3 ω 7→ X(ω) ∈ C, X(ω)(t) = Xt (ω).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 232

232 An Introduction to Probability

The map X is measurable since its composition with all the evaluation maps EvI
are measurable. Thus the stochastic process defines a probability measure
PX := X# P ∈ Prob C, BC


called the distribution of the process.


Suppose that B 0 , B 1 are two Brownian motions defined on possibly different
probability spaces. They have distributions
W0 , W1 ∈ Prob C, BC .


These distributions coincide since the finite dimensional distributions πI Wj , i = 0, 1


are centered Gaussian with identical covariances
E Bti1 Bti1 = min(t1 , t2 ), ∀t1 , t2 ∈ I, i = 0, 1.
 

Thus, the Brownian motions determine a probability measure W on C uniquely


determined by the requirement that for any finite subset {t1 , . . . , tn } ⊂ [0, ∞) the
random vector

Evt1 , . . . , Evtn
 
is centered Gaussian with covariances E Evti Evtj = min(ti , tj ). This measure is
known as the Wiener measure. We denote it by W.
Note that W is unique probability measure on C such that the canonical process
Bt : C, BC , W → R, C 3 f 7→ Evt (f ) = f (t)


is itself a Brownian motion, i.e.,


 
EW Bs Bt = min(s, t), ∀s, t ≥ 0. (2.5.13)
We have proved the existence of Wiener’s measure by relying on the existence of
Brownian motion. Conversely, if by some other method we can construct the Wiener
measure on C, then as a bonus we deduce the existence of Brownian motions. Here
is one such alternate method.
Consider a sequence of i.i.d. random variables (Xn )n∈N with mean 0 and variance
1. We set
S0 = 0, Sn = X1 + · · · + Xn , n ∈ N.
Imitating (2.5.2), for ν ∈ N and t ≥ 0 we set
W ν (t) := ν −1/2 Sbνtc + Rν (t), Rν (t) := ν −1/2 νt − bνtc Xbνtc+1 .

(2.5.14)
For each ν, the paths of the random process are continuous and piecewise linear.
The above discussion shows that it defines a Borel probability measure Pν = PW ν
on C.
Donsker’s Invariance Principle shows that the sequence Pν converges weakly
to a probability measure on P∞ satisfying (2.5.13). In other words, P∞ is the
Wiener measure. We can view the Invariance Principle as a functional version of
the Central Limit Theorem. Its proof requires an in depth investigation of the space
of probability measures on Polish spaces12 and is beyond the scope of this text. For
a most readable presentation of Donsker’s theorem and some of its consequences we
refer to [12], [19, Chap. 13]. t
u
12 Recall that a Polish space is a complete separable metric space.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 233

Limit theorems 233

The next result suggests that the paths of a Brownian motion are very rough,
i.e., they have poor differentiability properties.

Proposition 2.87 (The quadratic variation of Brownian paths). Consider


a Brownian motion Bt )t≥0 defined on the probability space (Ω, S, P). Fix c > 0
and let

0 = tn0 < tn1 < · · · < tnpn = c, n ∈ N

be a sequence of subdivisions of [0, t] with mesh

µn := sup (tnk − tnk−1 )


1≤k≤pn

tending to 0 as n → ∞. Define the quadratic variations


pn
X 2
Qn (c) := Btnk − Btnk−1 .
k=1

Then E Qn (c) = c, ∀n and Qn (c) → c in L2 (Ω, S, P) as n → ∞.


 

Proof. The Gaussian random variables Xkn = Btnk − Btnk−1 , 1 ≤ k ≤ pn , are inde-
pendent, have mean zero and momenta
2
E (Xkn )2 = tnk − tnk−1 , E (Xkn )4 = 3 tnk − tnk−1 .
   

 
From the first equality we deduce E Qn (c) = c. Moreover
pn pn 
X 2 X 2 
Xkn −c= Xkn − tnk − tnk−1 .
k=1 k=1 | {z }
=:Ykn

The random variables Ykn are independent and have mean zero so
pn 2 n
X 2 X
Xkn −c = kYkn k2L2 .
k=1 L2 k=1

Now observe that


4  2 
kYkn k2L2 = E Xkn − 2(tnk − tnk−1 )E Xkn + (tnk − tnk−1 )2 = 2(tnk − tnk−1 )2 .
 

Hence
pn 2 pn
X 2 X 2
Xkn −c =2 tnk − tnk−1
k=1 L2 k=1

pn
X
tnk − tnk−1 = 2µn c → 0 as n → ∞.

≤ 2µn
k=1

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 234

234 An Introduction to Probability

On a subsequence nj we have Qnj (c) → c > 0 a.s. On the other hand, if for
some ω ∈ Ω the function t → Bt (ω) where Hölder with exponent α > 1/2 on [0, c],
then for some constant C = Cω > 0 independent of n we would have
X 2α
0 ≤ Qn (t)(ω) ≤ Cω2 tnk − tnk−1 ≤ Cω2 µn2α−1 c → 0.
k
This prove that Bt is a.s. not α-Hölder on [0, c], α > 1/2.
On the other hand, we know that the paths of the Brownian motion are Hölder
continuous for any exponent < 1/2. A 1933 result of Paley, Wiener, Zygmund [124]
shows that they have very poor differentiability properties. First some historical
context.
One question raised in the 19th century was whether there exist continuous
functions on an interval that are nowhere differentiable. Apparently Gauss believed
that there are no such functions. K. Weierstrass explicitly produced in 1872 such
examples defined by lacunary (or sparse) Fourier series. In 1931 S. Banach [7]
and S. Mazurkewicz [115] independently showed that the complement of the set of
nowhere differentiable functions in the metric space of continuous functions on a
compact interval is very small, meagre in the Baire category sense.
The 1933 result of Paley, Wiener, Zygmund that we want discuss is similar in
nature. They prove that the complement set of continuous nowhere differentiable
functions f ∈ C is negligible with respect to the Wiener measure.

Theorem 2.88 (Paley, Wiener, Zygmund). The paths of a Brownian motion


(Bt )t≥0 are a.s. nowhere differentiable.

Proof. We follow the very elegant argument of Dvoretzky, Erdös, Kakutani [54].
We will show that for any interval I = [a, b) ⊂ [0, ∞) the paths of (Bt ) are a.s.
nowhere differentiable on I. Assume the Brownian motion is defined on a probability
space (Ω, S, P). This probability space could be the space C equipped with the
Wiener measure. For ease of presentation we assume that I = [0, 1). Consider the
set

S := ω ∈ Ω; the path Bt (ω) is nowhere differentiable on [0, 1) .
The set S may not be measurable13 but we will show that its complement is con-
tained in a measurable subset of Ω of measure zero.
Let us observe that if ω ∈ Ω \ S, i.e., the path t 7→ Bt (ω) is differentiable at a
point t0 ∈ [0, 1], then there exist M, N ∈ N such that for any n ≥ N there exists
k ∈ {1, . . . , n − 2} with the property that
M
B(k−1+i)/n (ω) − B(k+i)/n (ω) ≤ , ∀i = 0, 1, 2.
n
To see this set f (t) = Bt (ω), m = |f 0 (t0 )|, M = bmc + 2. Then there exists ε > 0
so that if s, t ∈ (t0 − ε, t0 + ε), s < t we have
|f (s) − f (t)| ≤ M (t − s).
13 In 1936 S. Mazurkewicz proved that the set S is not a Borel subset of C.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 235

Limit theorems 235

1 ε
Now choose N such that N < 6 and, for n ≥ N choose k ∈ {1, 2, . . . , n} such that
k−1 k k+1 k+2
t0 − ε < , , , < t0 + ε. (2.5.15)
n n n n
We deduce that
 
[ [ n \
\ [ 2

Ω\S ⊂  B(k−1+i)/n − B(k+i)/n ≤ M/n  .
M ∈N N ∈N n≥N k=1 i=0
| {z }
=:XM,N

Clearly, the set XM,N is measurable and it suffices to show it is negligible. We have
n−2
X  
 
P XM,N ≤ inf P max B(k−1+i)/n − B(k+i)/n ≤ M/n . (2.5.16)
n≥N 0≤i≤2
k=1

Now observe that the increments B(k−1)/n − Bk/n are independent Gaussians with
mean zero and variance 1/n. We deduce
n−2
  X  3
P XM,N ≤ inf P B(k−1)/n − Bk/n ≤ M/n .
n≥N
k=1

The exponent 3 above will make all the difference. It appears because of the con-

straint (2.5.15) on N . Since n B(k−1)/n − Bk/n is standard normal, the random
variable B(k−1)/n − Bk/n is normal with variance n1 and we have
r Z M/n
n 2
e−x n/2 dx
 
P B(k−1)/n − Bk/n ≤ M/n = 2
2π 0
(x = M y/n)
n M 1 − M y2
r Z
2
2 e 2n dy ≤ √ M n−1/2 .
2π n 0 2π
| {z }
=:C

Hence
n−2
X 3
≤ nC 3 M 3 n−3/2 = C 3 M 3 n−1/2 , ∀n ≥ N,

P B(k−1)/n − Bk/n ≤ M/n
k=1
 
and (2.5.16) implies that P XM,N = 0. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 236

236 An Introduction to Probability

2.6 Exercises

Exercise 2.1 (Skorokhod). Suppose that X1 , . . . , Xn are independent random


variables. We set Sk = X1 + · · · + Xk , k = 1, . . . n. Let α > 0 and set
 
c := sup P |Sn − Sj | > α , Mn := sup |Sj |.
1≤j≤n 1≤j≤n

Prove that if c < 1, then


  1  
P Mn > 2α ≤ P |Sn | > α .
1−c
   
Hint. Denote by J the first j such that |Sj | > 2α. Note that P Mn > 2α = P J ≤ n
n
X n
X
       
P |Sn | > α ≥ P |Sn | > α, Mn > 2α = P |Sn | > α, J = j ≥ P |Sn − Sj | ≤ α, J = j .
j=1 j=1

Observe that the event {J = j} is independent of Sn − Sj . t


u

Exercise 2.2. Suppose that (Xn )n≥1 is a sequence of independent random vari-
ables. Prove that the following statements are equivalent.
P
(i) The series n≥1 Xn converges in probability.
P
(ii) The series n≥1 Xn converges a.s.

Hint. Use Exercise 2.1. t


u

Remark 2.89. The so called Lévy equivalence theorem, [47, §III.2], [50, §9.7] or
[105, §43] states that a series with independent terms converges a.s. iff converges in
probability, iff converges in distribution. t
u

Exercise 2.3. Consider an infinite array of nonnegative numbers P = (pn,k )k,n≥1


satisfying the following conditions.

(i) The array is lower triangular, i.e., pn,k = 0, ∀k > n.


(ii) For every n, the n-th row of P defines a probability distribution on
In = {1, 2, . . . , n}, i.e.,
n
X
pn,k = 1, ∀n ≥ 1.
k=1
(iii) The sequence determined by each column of P converges to 0, i.e.,
lim pn,k = 0, ∀k ≥ 1.
n→∞

Show that if (xn ) is a sequence of real numbers that converges to a number x, then
the sequence of weighted averages
Xn
yn := pn,k xk
k=1
converges to the same number x. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 237

Limit theorems 237

Exercise 2.4. In this exercise we describe the acceptance-rejection method


frequently used in Monte-Carlo simulations. For any nonnegative function
f : R → [0, ∞) we denote by Gf the region bellow its graph
Gf := (x, y) ∈ R2 ; 0 ≤ y ≤ f (x((x) .


(i) Suppose that we are given a probability density p : R → [0, ∞)


Z
p(x)dx = 1.
R

For any positive constant c we set


1
µcp = I G (x, y)dxdy.
c cp
Since area(Gcp ) = c we deduce that µcp defines a Borel probability measure
on R2 . The natural projection R2 3 (x, y) 7→ x ∈ R is a random variable
X defined on the probability space (R2 , BR2 , µcp ). Prove that the probability
distribution of X is p(x)dx.
(ii) Suppose that X is a random variable with probability distribution p(x)dx. Let
U be a random variable independent of X and uniformly distributed over [0, 1].
Prove that the probability distribution of the random vector (X, cp(X)U ) is µcp .
(iii) Let q : R → [0, ∞) be another probability density such that, there exists c > 0
with the property that
q(x) ≤ cp(x), ∀x ∈ R.
Suppose that (Un )n∈N : is a sequence of i.i.d. random variables uniformly dis-
tributed on [0, 1] and (Xn )n∈N is a sequence of i.i.d., independent of the Un ’s
and with common distribution p(x)dx. Denote by N the random variable

N = inf n ∈ N : cp(Xn )Un ≤ q(Xn ) .
Prove that
 
E N = c.

Hint. Consider the random vector Vn = Xn , cp(Xn )Un , observe that

N = inf n ∈ N; Vn ∈ Gq ,

and use part (ii) to show that N is a geometric random variable.


(iv) Define Y = XN , i.e.,
Y (ω) = XN (ω) (ω).
 
From (iii) we know that P N < ∞ = 1 so Y is defined outside a probability
zero set. Prove that the probability distribution of the random variable Y
is q(y)dy.

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 238

238 An Introduction to Probability

Remark 2.90 (Acceptance-Rejection method). Suppose that a computer can


sample the distribution Unif(0, 1) and it can sample the distribution p(x)dx. We
can then sample the distribution q(y)dy as follows. Sample successively and inde-
pendently Unif(0, 1) and p(x)dx and denote by Un and respectively Xn the samples
q(Xn )
obtained at the n-th trial. Stop at the first trial N when the inequality cUn ≤ p(Xn)
is observed. Set Y = XN . The results in the above exercise show that the expected
waiting time to observe this inequality is c and the random number Y samples the
distribution q(y)dy. t
u

Exercise 2.5 (Bernstein). For each x ∈ [0, 1] we consider a sequence (Bkx )k∈N of
i.i.d. Bernoulli random variables with probability of success x. We set
X
Snx = Bkx .
k=1

Note that Snx /n ∈ [0, 1] and the SLLN shows that

Snx /n → x a.s. as n → ∞.

The dominated converges theorem implies that for any continuous function
f : [0, 1] → R we have

lim E f (Snx /n) = f (x).


 
n→∞

Set

Bnf (x) := E f (Snx /n) .


 

(i) Show that


n  
X n
Bnf (x) = xk (1 − x)k f (k/n).
k
k=0

(ii) Prove that as n → ∞ the polynomials Bnf (x) converge uniformly on [0, 1]
to f (x).

Hint. For (ii) imitate the argument in Step 2 of the proof of Theorem 2.41. t
u

Exercise 2.6. Suppose that Xn ∈ L2 (Ω, S, P) is a sequence of random variables


with mean zero and variance one such that
 
lim E Xm Xm+k = 0, uniformly in m.
k→∞

Prove that
1  p
X1 + · · · + Xn −→ 0 as n → ∞. t
u
n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 239

Limit theorems 239

Fig. 2.5 The graph of f (x) = sin(4πx) (the continuous curve) and of the degree 50 Bernstein
f
polynomial B50 (x) (the dotted curve).

Exercise 2.7. Suppose that a player rolls a die an indefinite amount of times. More
formally, we are given a sequence independent random variables (Xn )n∈N , uniformly
distributed on I6 := {1, 2, . . . , 6}. For k ∈ N, we say that a k-run occurred at time
n if n ≥ k and
Xn = Xn−1 = · · · = Xn−k+1 = 6.
For n ∈ N we set
Rn = Rnk := # m ≤ n; a k-run occurred at time m ,



T = Tk = min n ≥ k; Rn > 0 .
Thus
  T is the moment when the first k-run occurs. As shown in Example 1.167,
E T < ∞.
 
(i) Compute E T .
(ii) Prove that Rnn converges in probability to 61k . Hint. For n ≥ k set
Yn := I {Xn =6} · · · I X{n−k+1 =6} .

Observe that Rn = Yk + · · · + Yn .

t
u

Exercise 2.8 (A. Renyi). Suppose that (An )n≥0 is a sequence of events in the
sample space (Ω, S, P) with the following properties.

• A0 = Ω.

• P An 6= 0, ∀n ≥ 0.
• There exists ρ ∈ (0, 1] satisfying
 
lim P An |Ak = ρ, ∀k ≥ 0. (2.6.1)
n→∞

Set Xn := I An − ρ.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 240

240 An Introduction to Probability

(i) Prove that


 
lim E Xn Xk = 0, ∀k ∈ N.
n→∞

(ii) Prove that for any X ∈ L2 (Ω, S, P) we have


 
lim E Xn X = 0.
n→∞

(iii) Conclude that the sequence (An ) satisfies the mixing condition with density ρ

lim P An ∩ A = ρP A , ∀A ∈ S.
   
(2.6.2)
n→∞

Thus, in the long run, the set An occupies the same proportion ρ of any
measurable set A.

t
u

Exercise 2.9 (A. Renyi). Suppose that (Xn )n∈N is a sequence of i.i.d., almost
surely finite random variables. Set
X1 + · · · + Xn
Mn := .
n
Assume that the empirical means Mn converge in probability to a random variable
M . The goal of the exercise is to prove that M is a.s. constant. We argue by
contradiction.
 Assume
 M is not a.s. constant. Let F : R → [0, 1] the cdf of M ,
F (m) = P M ≤ m .

(i) Prove that there exist two continuity points a < b of F (x) such that
 
p0 := F (b) − F (a) = P a < M ≤ b ∈ (0, 1).

(ii) Prove that there exists ν0 ∈ N such that


 
P a < Mn ≤ b > 0, ∀n ≥ ν0 .

(iii) Set A0 = Ω and



An = a < Mν0 +n ≤ b , n ≥ 1.

Prove that the


 sequence (An ) satisfies the condition (2.6.1) with ρ = p0 .
(iv) Set B := a  < M ≤ b . Prove that the restriction of Mn to
(B, S|B , P −|B ) converges in probability to M |B . Here

S|B = S ∩ B; S ∈ S .


(v) Deduce that p0 = 1, thus contradicting (i).

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 241

Limit theorems 241

Exercise 2.10 (Vitali-Hahn-Saks). Suppose that (Ω, S, µ) is a probability


space. Define an equivalence relation on S by setting S ∼ S 0 if µ S∆S 0 = 0,


where ∆ denotes the symmetric difference

S∆S 0 = S \ S 0 ∪ S 0 ∪ S .
 

Define d : S × S → [0, ∞)
  
d S0 , S1 = µ S0 ∆S1 .

(i) Prove that ∀S0 , S1 , S2 ∈ S we have


   
d S0 , S1 = d S1 , S0 ), d S0 , S2 ≤ d S0 , S1 + d S1 , S2

and d S0 , S1 = 0 iff S0 ∼ S1 .
(ii) Prove that d defines a complete metric d on S := S/ ∼.
(iii) Suppose that λ : S → R is a probability
  measure
 that is absolutely continuous
with respect to µ. Hence λ S0 = λ S1 = 0 if S0 ∼ S1 . Prove that the
induced function

λ :S→ R

is continuous with respect to the metric d.


(iv) Suppose that (λn ) is a sequence of probability
 measures on S such that
 λn  µ
for any n ∈ N and, ∀S ∈ S, the sequenceλn  S has a finite

limit λ S . Prove
that λ : S → R is finitely additive and λ S = 0 if µ S = 0.


(v) For any ε > 0 and k ∈ N we set


 
Sk,ε := S ∈ S; sup λl S − λk+n S ≤ ε .
   
m∈N

Prove that the sets Sk,ε ⊂ S are closed with respect to the metric d and
[
S= Sk,ε , ∀ε > 0.
k∈N

(vi) Prove that λ̄ : S → [0, 1] is continuous and deduce that λ̄ is a probability


measure. Hint. It suffice to show that for any decreasing sequence (Sn ) in S with empty
 
intersection we have lim λ Sn = 0. Deduce this from (v) and Baire’s theorem.

t
u

Exercise 2.11 (A. Renyi). Let (Ω, S, P) be a probability space and supposethat
(An ) is a stable sequence of
 events, i.e., for any B ∈ S the sequence P An ∩ B has
a finite limit λ B and λ Ω ∈ (0, 1). Prove that λ : S → [0, 1] is a finite measure
 

absolutely continuous with respect to P, λ  P. Denote by ρ the density of λ with


respect to P, ρ = dλ dP . The function ρ is called the density of the stable sequence of
events. Hint. Use Exercise 2.10. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 242

242 An Introduction to Probability

Exercise 2.12 (A. Renyi). Let (Ω, S, P) be a probability space and suppose that
(An )n∈N is a sequence of events such that the limits
   
λ0 = lim P An , λk := lim P Ak ∩ An , k ∈ N
n→∞ n→∞

exist and λ0 ∈ (0, 1). Denote by X linear span of the indicators I An and by X its
closure in L2 .

(i) Prove that ∀ξ ∈ X there exists a limit


   
L(ξ) := lim E ξI An = E ρξ .
n→∞

(ii) Prove that ∀ξ ∈ L (Ω, S, P) there exists a limit


2

   
L(ξ) = lim E ξI An = E ρξ .
n→∞

(iii) Show that X = L (Ω, S, P) and ∃ρ ∈ L2 (Ω, S, P) such that L(ξ) = E ρξ ,


2
 

∀ξ ∈ L2 (Ω, S, P).
(iv) Show that (An )n∈N is a stable sequence with density ρ. (Note that when ρ
is constant the sequence satisfies the mixing condition (2.6.1) with density
ρ = λ0 .)

t
u

Exercise 2.13. Suppose that f : [0, 1] → [0, 1] is a continuous function that is not
identically 0 or 1. For n ∈ N we set
n−1
[  
An = k/n, k/n + f (k/n) .
k=0

Show that (An )n≥1 is a stable sequence of events and compute its density. t
u

Exercise
 2.14.
 Suppose that π is a probability measure on In = 1, 2, . . . , n ,
pi = π {i} . Consider a sequence (Xn )n∈N of i.i.d. random variables uniformly
distributed on [0, 1]. For j ∈ In and m ∈ N we set
j−1 j
( ) n
X X 1 X
Zm,j := # 1 ≤ k ≤ m; pi ≤ Xk < pi , Hm := Zm,j log2 pj .
i=0 i=0
m j=1

Prove that
n
  X
lim Hm = − Ent2 π = pj log2 pj , a.s.
m→∞
j=1

t
u

Exercise 2.15. Let (Xn )n∈N be a sequence of i.i.d. Bernoulli random variables with
success probability 12 and (Yn )n∈N a sequence of i.i.d. Bernoulli random variables
with success probability 13 . (The sequences (Xn ) and (Yn ) may not be independent
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 243

Limit theorems 243

of each other.) Set B = {0, 1} and denote by Fn the sigma-algebra of BN generated


by the cylinders
Ck := x = (x1 , x2 , . . . ) ∈ BN ; xk =  , k = 1, 2, . . . , n,  = 0, 1.


We set
[
F := Fn .
n∈N

The sequence (Xn ) (resp. Yn ) define a probability measures P = Ber(1/2)⊗N (resp. Q


= Ber(1/3)⊗N ) on BN ; see Subsection 1.5.1. Denote by Pn (resp. Qn ) the restrictions
of P (resp. Q) to Fn .

(i) Prove that for any n ∈ N the measure Qn is absolutely continuous with respect
to Pn . Compute the density dQ
dPn of Qn with respect to Pn .
n

(ii) Prove that Q is not absolutely continuous with respect to P. Hint. Use the Law
of Large Numbers.

t
u

Exercise 2.16. Show that the Gaussian measures Γv [dx] = γ 0,v (x)[dx],
1 x2
e− 2v ,
γ 0,v (x) := √
2πv
converge weakly to the Dirac measure δ0 on R as their variances v converge to 0.
v
t
u
 
Hint. Use the Chebyshev’s inequality (1.3.17) Γv |x| > c ≤ c2
.

Exercise 2.17. Let (Xn ) be a sequence of geometric random variables


Xn ∼ Geom(1/n). Prove that
1
Yn := Xn ⇒ X ∼ Exp(1).
n
Hint. Show that P Yn > y → e−y as n → ∞, ∀y ∈ R. t
u
 

Exercise 2.18. Fix λ > 0. Show that as n → ∞ we have Bin(n, λ/n) ⇒ Poi(λ),
where Bin(n, λ/n) denotes the binomial probability distribution corresponding to
n independent trials with success probability λ/n and Poi(λ) denotes the Poisson
distribution with parameter λ. t
u

Exercise 2.19 (Occupancy Problem). Suppose that n balls are successively


and randomly placed in r boxes, i.e., for all boxes are equally likely to be the
destination of a given ball. Let Nr = Nr,n the number of empty boxes.

(i) Compute the expectation and variance of Nr .


Pr
Hint. Nr = k=1 I Bk,n , Bk,n = box
k is empty after all the n balls have been randomly placed in the r boxes.
(ii) Show that if n/r → c > 0 as r → ∞, then Nrr → e−c in probability.
(iii) Compute P Nr,n = k . Hint. Use the inclusion-exclusion equality (1.3.25).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 244

244 An Introduction to Probability

(iv) Show that if re−n/r → λ as r → ∞, then Nr,n converges in distribution to


Poi(λ). t
u

Remark 2.91. Let me comment why the result in Exercise 2.19 is surprising. Con-
sider the following concrete situation.
Assume n = 2r and suppose that we want to distribute 2r gifts to r children.
We want to do this in the “fairest” possible way since the gifts, of equal value, are
different, and several kids may desire the same gift. To remove any bias, “common
sense” suggests that each gift should be given to a child chosen uniformly at random.
There are twice as many gifts as children so what can go wrong? Part (ii) of this
exercise shows that for n large nearly surely e−2 r ≈ 0.13r children will receive no
gifts! t
u

Exercise 2.20 (Coupon collector problem). For n ∈ N denote by Nn the num-


ber of boxes of cereals one has to purchase in order
 to obtain all the n coupons of a
collection; see Example 1.112. Recall that E Nn ∼ n log n as n → ∞. Prove that
lim P Nn − n log n ≤ nx = exp e−x .
  
n→∞
Hint. Reduce to Exercise 2.19(iv). t
u

Exercise 2.21. For N ∈ N denote by BN the birthday random variable defined in


Exercise 1.21.

(i) Show that as N → ∞, the sequence of random variables


1
XN := √ BN
N
converges in law to a Raleigh random variable, i.e., a random variable X with
probability distribution
x2
PX [dx] = xe− 2 I [0,∞) (x)dx.
(ii) Prove that
r
    π
lim E XN = E X = .
N →∞ 2
x2
Hint. Observe that P X > x = e− 2 . Using the Taylor expansion of log(1 − t) at t = 0 show that
 

  x2   x2
log P XN > x ≤ − and lim log P XN > x = − , ∀x > 0.
2 N →∞ 2
t
u

Exercise 2.22. [P. Lévy] Consider the random variables Ln defined in Exer-
cise 1.11. Prove that as n → ∞ the random variables Lnn converge in distribution
to the arcsine distribution Beta(1/2, 1/2); see Example 1.122. Hint. You need to use
Stirling formula (A.1.7) with error estimate (A.1.8). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 245

Limit theorems 245

Exercise 2.23. Suppose that (Xn )n∈N and (Yn )n∈N are two sequences of random
variables such that Xn → X in distribution and
 
lim P Xn 6= Yn = 0.
n→∞

Then Yn → X in distribution. t
u

Exercise 2.24. Suppose that (Xn )n∈N and (Yn )n∈N are two sequences of random
variables such that

• Xn converges in distribution X.
• Yn converges in distribution to Y .
• Xn is independent of Yn for every n and X is independent of Y .

Prove the following.

(i) The random vector (Xn , Yn ) converges in distribution to (X, Y ).


(ii) The sum Xn + Yn converges in distribution to X + Y .

t
u

Exercise 2.25. Suppose that (Xn )n∈N and (Yn )n∈N are sequences of random vari-
ables with the following properties.

(i) The random variables (Xn )n∈N are identically distributed.


(ii) The sequence of random vectors (Xn , Yn ) converges in distribution to the ran-
dom vector (X, Y ).

Prove that for any Borel measurable function f : R → R the sequence of random
vectors ( f (Xn ), Yn ) converges in distribution to ( f (X), Y ). Hint. Fix a Borel measurable
function f . It suffices to show that for any continuous and bounded functions u, v : R → R we have

     
lim E u f (Xn ) v(Yn ) = E u f (X) v(Y ) .
n→∞

t
u
 
Consider the Borel measurable functions vn defined by vn (Xn ) = E v(Yn ) k Xn = vn (Xn ).

Exercise 2.26. Suppose that (Xn )n∈N and (Yn )n∈N are two sequences of random
variables such that Xn converges in distribution to X and Y converges in probability
to the constant c. Prove that the random vector (Xn , Yn ) converges in distribution
to (X, c). Hint. Prove that (Xn , c) converges in probability to (X, c) and then use Exercise 2.23. t
u

2
Exercise 2.27.
 Suppose that
 (X
 n )n∈N is a sequence of i.i.d. L random variables
2
with µ = E Xn , σ = Var Xn . Set
n−1
1  1 X 2
Xn = X1 + · · · + Xn , Yn = Xk − X̄n .
n n−1
k=1
 
Prove that E Yn = σ 2 and Yn → σ 2 in probability. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 246

246 An Introduction to Probability

Exercise 2.28 (Trotter). We outline a proof of the CLT that does not rely on
the characteristic function.
For any random variable X and any f ∈ Cb (R) we denote by TX f the function
R → R given by
 
TX f (y) = E f (X + y) , y ∈ R.

(i) Prove the correspondence f 7→ TX f induces a bounded linear map


Cb (R) → Cb (R) satisfying kTX f k ≤ kf k, where
kgk = sup |g(x)|, ∀g ∈ Cb (R).
x∈R
(ii) Show that if X ⊥
⊥ Y , then TX ◦ TY = TX+Y = TY ◦ TX .
(iii) Let (Xn )n∈N be a sequence i.i.d. square integrable random variables such that
   
E Xn = 0, Var Xn = 1, ∀n ∈ N.
Additionally, fix a sequence (Yn )n∈N of independent standard normal random
variables that are also independent of the Xn ’s. Set
n n
1 X 1 X
Un = √ Xk , Vn = √ Yk ,
n n
k=1 k=1
Prove that for any f ∈ Cb (R) we have
kTUn f − TVn f k ≤ nkT √
X1 f − T Y1 f k.

n n

(iv) Let f ∈ C02 (R). Show that


X1 f − T Y1 f k = 0.
lim nkT √ √
n→∞ n n

(v) Prove that TVn = TZ , where Z is a standard normal random variable.


(vi) Show that for any f ∈ C02 (R) we have
   
lim E f (Un ) = E f (Z) .
n→∞

t
u

Exercise 2.29. Suppose that (Xn )n∈N is a sequence of i.i.d. Bernoulli random
variables with success probability p = 12 . For each n ∈ N we set
b
X 1
Sn := Xk .
2k
k=1

(i) Find the probability distribution of Sn .


(ii) Prove that for any p ∈ [1, ∞] the sequence Sn converges a.s. and Lp to a
random variable S uniformly distributed on [0, 1].  
(iii) Compute the characteristic functions Fn (ξ) = E exp(iξSn ) and deduce
Viète’s formula

sin ξ Y
cos ξ/2n .

=
ξ n=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 247

Limit theorems 247

(iv) Suppose that µ is a Borel probability measure on R with quantile Q : [0, 1] → R,


  
Q(p) = inf x ∈ R; µ (−∞, x] ≥ p .
Prove that the sequence Q(Sn ) converges a.s. to a random variable with dis-
tribution µ. Have a look at Example 1.45.

t
u

Remark 2.92. Part (iv) of the above exercise is essentially a universality property
of the simplest random experiment: tossing a fair coin. If we are able to perform
this experiment repeatedly and independently, then we can approximate any prob-
ability distribution. In other words, we can approximatively sample any probability
distribution by flipping fair coins. t
u

Exercise 2.30. Suppose that (Xn )n∈N is a sequence of i.i.d. random variables uni-
formly distributed in [0, L], L > 0. For n ∈ N we set

X(n) := max X1 , X2 , . . . , Xn .
 
Prove that limn→∞ E X(n) = L and X(n) → L in probability. Hint. Have a look at
Exercise 1.44. t
u

Exercise 2.31. Suppose that (Xn )n∈N is a sequence of i.i.d. random variables uni-
n n (n)
formly distributed in [0, 1]. Denote by X(1) , X2) , . . . , Xn the order statistics of the
first n of them; see Exercise 1.44. Prove that for any k ∈ N the random variable
n
nX(k) converges in distribution to Gamma(1, k). t
u

Exercise 2.32. Suppose that (µn )n≥0 is a sequence of finite Borel measures. De-
note their
 
Fn (x) = µn (−∞, x] , ∀n ∈ N, x ∈ R.
Let µ ∈ Meas(R) be a finite Borel measure with potential
 
F (x) = µ (−∞, x] , ∀x ∈ R.
Prove that the following are equivalent.

(i) The measures (µn ) converge vaguely to µ


(ii) For every point x of continuity of F
lim Fn (x) = F (x).
n→∞

t
u

Exercise 2.33. Suppose that (µn )n∈R is a sequence of Borel measures on R such
that
 
sup µn R < ∞.
n
 
Set Fn (x) = µn (−∞, x] , x ∈ R. Prove that (µn ) contains a subsequence converg-
ing vaguely to a finite probability measure. Hint. Construct a subsequence (nk ) such that,
the sequence Fnk (q) is convergent for any q ∈ Q. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 248

248 An Introduction to Probability

Exercise 2.34. A subset M ⊂ Prob(R)


  is called tight if, for any ε > 0 there exists a
compact set K ⊂ R such that µ R\K < ε, ∀µ ∈ R. Prove that a tight sequence of
probability measures admits a weakly convergent subsequence. Hint. Use Exercise 2.33.
t
u

Exercise 2.35. Let µ ∈ Prob(R) be a Borel probability measure with characteristic


function µ
b. Prove that for any r > 0 we have
1 1
Z
   sin x 
µ {|x| > r} ≤ b(t/r) dt, C := inf 1 −
1 − Re µ .
C 0 x∈[0,1] x
t
u

Exercise 2.36. This exercise describes a strengthening of Levy’s continuity the-


orem. Suppose that (µn ) is a sequence of Borel probability measures on R with
characteristic functions µ
bn (ξ). Assume that the functions µ
bn (ξ) converge point-
wisely to a function f : R → C that is continuous at 0.

(i) Prove that the sequence (µn )n∈N is tight. Hint. Use Exercise 2.35.
(ii) Show that f is the characteristic function of a Borel probability measure µ.
Hint. Use Exercise 2.34.
(iii) Prove that µn converges weakly to µ.

t
u

Exercise 2.37. Suppose that (µn ) is a sequence of Borel probability measures on


R that converges weakly to a probability measure µ. Prove that the characteristic
functions µ
cn converge to µ
b uniformly on the compacts of R. t
u

Exercise 2.38. Suppose that X is a random variable and ϕ(ξ) is its characteristic
function
ϕ(ξ) = E eiξX .
 

Prove that the following are equivalent.

(i) X is a.s. constant.


(ii) There exists r > 0 such that |ϕ(ξ)| = 1, ∀ξ ∈ [−r, r].

Hint. Use an independent copy X 0 of X. t


u

Exercise 2.39 (Lévy). The concentration function of a random variable X is the


function
 
CX : [0, ∞) → [0, 1], CX (r) := inf P |X − x| > r .
x∈R

(i) Prove that for any r > 0 there exists xr ∈ R such that
 
CX (r) = P |X − xr | > r .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 249

Limit theorems 249

 
(ii) Prove that if Var X < ∞ is integrable, then
1  
CX (r) ≤ 2 Var X .
r
(iii) Prove that if X, Y are independent random variables, then

CX+Y (r) ≥ max CX (r), CY (r) , ∀r > 0.
(iv) Suppose that (Xn ) is a sequence of independent random variables. We set
Sn := X1 + · · · + Xn , Cn,N = CSN −Sn .
Show that the limits
Cn (r) = lim Cn,N (r)
N →∞
exists for every n, and the resulting sequence Cn (r) is nondecreasing.
(v) Show that
lim Cn (r)
n→∞
is independent of r and it is either 0 or 1.

t
u

Exercise 2.40. A probability measure µ ∈ Prob(R) is said to be an infinitely


divisible distribution if for any n ∈ N, there exists µn ∈ N such that
µ = µ∗n
n := µ ∗ · · · ∗ µ .
| {z }
n
A random variable is called infinitely divisible if its distribution is such.

(i) Prove that the Poisson distributions and the Gaussian distributions are in-
finitely divisible.
(ii) Prove that any linear combination of independent infinitely divisible random
variables is an infinitely divisible random variable.
(iii) Suppose that (Xn )n∈N is a sequence of i.i.d. random variables with common
distribution ν ∈ Prob(R). Denote by N (t), t ≥ 0 a Poisson process with
intensity λ > 0; see Example 1.136. For t ≥ 0 we set
N (t)
X
Y (t) = Xk .
k=1
The distribution of Y (t) denoted by Qt , is called a compound Poisson dis-
tribution. The distribution ν is called the compounding distribution. Show
that

X (λt)n ∗n
Qt = e−λt ν
n=0
n!
and deduce that Qt ∗ Qs = Qt+s , ∀t, s ≥ 0. In particular Qt is infinitely
divisible.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 250

250 An Introduction to Probability

(iv) Compute the characteristic function of Qt .


(v) Prove that any weak limit of infinitely divisible distribution is also infinitely
divisible.

t
u

Exercise 2.41. Give an example of a sequence of random variables


Xn ∈ L1 (Ω, S, P) such that Xn converge in distribution to 0 but
 
lim E Xn = ∞. t
u
n→∞

Exercise 2.42 (Skhorohod). Suppose that µn ∈ Prob(R), n ∈ N, is a sequence


converging weakly to µ. Denote by Fn : R → [0, 1] the distribution function of µn ,
 
Fn (x) = µn (−∞, x] ,
and by Qn the associated quantile function (see (1.2.5))

Qn : [0, 1] → R, Qn (t) = inf x; t ≤ Fn (x) .
We can regard Qn as random variables defined on the probability space
([0, 1], B[0,1] , λ[0,1] ),
where λ[0,1] denotes the Lebesgue measure on [0, 1]. As shown in Example 1.44,
µn = (Qn )# λ[0,1] , so that µn is the probability distribution of Qn . Prove that the
sequence Qn converges a.s. on [0, 1] to a random variable with probability distribu-
tion µ.
In other words, given any sequence µn ∈ Prob(R) that converges weakly to
µ ∈ Prob(R), we can find a sequence of random variables Xn , defined on the same
probability space with PXn = µn and such that Xn converges a.s. to a random
variable X with distribution µ. t
u

Exercise 2.43. Suppose that the sequence of random variables Xn : (Ω, S, P) → R,


n ∈ N, converges in distribution to the random variable X. Prove that for any
continuous function f : R → R the random variables f (Xn ) converge in distribution
to f (X). Hint. Use Exercise 2.42. t
u

Exercise 2.44. Let µ be a Borel probability measure on R satisfying


Z
etx µ dx < ∞, ∀|t| < r0 .
 
∃r0 > 0 :
R

(i) Let p ∈ [1, ∞). Prove that the map


Z
Lp (R, µ) 3 f 7→ T f ∈ Cb (R, C), (T f )(ξ) = eiξx f (x)µ dx
 
R

is injective. Hint. Reduce to Theorem 2.39 by writing f = f+ − f− .


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 251

Limit theorems 251

(ii) If f ∈ L2 (R, µ). Prove that there exists r1 > 0 such that for any complex num-
ber such that | Re z| < r1 the complex valued function R 3 x 7→ eizx f (x) ∈ C
is µ integrable and the resulting function
Z
eizx f (x)µ dx
 
z 7→ F (z) =
R

is holomorphic in the strip | Re z| < r1 .
(iii) Prove that R x , the space of polynomials with real coefficients, is dense in
L2 (R, µ). Hint. You have to show that if f ∈ L2 (R, ν) satisfies
Z
n  
f (x)x µ dx = 0, ∀n ≥ 0,
R

the f = 0 µ-a.s. Use (i) and (ii) to achieve this.



(iv) Consider the Hermite polynomials Hn (x) n≥0 described in Exercise 1.24.
Prove that the collection
1
√ Hn , n ≥ 0
n!
is a complete orthonormal basis of the Hilbert space L2 (R, γ 1 ), where γ 1 is the
standard Gaussian measure on R.

t
u

Exercise 2.45. Suppose that µ0 , µ1 are two Borel probability measures such that
∃t0 > 0
Z Z
etx µ0 dx = etx µ1 dx , ∀|t| < t0 .
   
R R

Fix r > 0 as in Exercise 2.44(ii) such that for any complex number the functions
Z
eizx f (x)µk dx , k = 0, 1,
 
z 7→ Fk (z) =
R

are well defined and holomorphic in the strip | Re z| < r . Show that F0 = F1
and deduce that µ0 = µ1 . Hint. Set F = F1 − F0 . Use the Cauchy-Riemann equations to prove
dn F
that dz n z=t = 0, ∀n ∈ N, ∀t ∈ (−r, r). t
u

Exercise 2.46 (De Moivre). Let Xn ∼ Bin(n, 1/2) and Y ∼ N (0, 1). Prove that
√ 
P |Xn − n/2| ≤ 2r n

lim   = 1, ∀r > 0. t
u
n→∞ P |Y | < r

Exercise 2.47 (t-statistic).


  Suppose
 2 that 2(Xn )n∈N is a sequence of i.i.d. random
variables such that E Xn = 0, E Xn = σ < ∞, ∀n. We set
n n
1X 1 X 2 √ Mn
Mn = Xn , Vn = Xk − Mn , Tn = n √ .
n n−1 Vn
k=1 k=1

(i) Prove that Vn converges in probability to σ 2 .


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 252

252 An Introduction to Probability

(ii) Prove that Tn converges in distribution to a standard normal random variable.


Hint. Use CLT and Slutsky’s theorem. t
u

Exercise 2.48. Suppose that X = Xλ is a Gamma(1, λ) random variable (see


Example 1.121) and Y = Yλ is a random variable such that
  X n −nX
P Y = nkX = e , ∀n = 0, 1, 2, . . . .
n!
In other words, conditioned on X = x the random variable Y is Poi(x).

(i) Compute the characteristic function of Y .


(ii) Show that the random variable
1  
q   Yλ − E Yλ
Var Yλ
converges in distribution to N (0, 1) as λ → ∞.

t
u

Exercise 2.49. Suppose that X, Y are independent random normal variables. Set
Z = XY .

(i) Show that


1
MZ (λ) = √ , |λ| < 1,
1 − λ2
and deduce that
λ2
ΨZ (λ) ≤ , ∀λ ∈ (−1, 1).
2(1 − λ2 )
(ii) Prove that IZ (see Proposition 2.47) satisfies
1 1 2 π
IZ (z) ≥ (sin z)2 ≥ z , ∀|z| < .
3 12 6
t
u

Exercise 2.50. Let X be a finite set. The entropy of a random variable


X : (Ω, S, P) → X is
    X  
Ent2 X := Ent2 PX = − pX (x) log2 p(x), pX (x) = P {X = x} .
x∈X

Given two random variables Xi : (Ω, S, P) → Xi , i = 1, 2, we define their relative


entropy to be
 
  X pX1 .X2 (x1 , x2 )
Ent2 X2 X1 := pX1 ,X2 (x1 , x2 ) log2 ,
pX1 (x1 )
(x1 ,x2 )∈X1 ×X2
 
where pX1 ,X2 (x1 , x2 ) = P {X1 = x1 , X2 = x2 } .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 253

Limit theorems 253

(i) Show that if Xi : (Ω, S, P) → Xi , i = 1, 2, are random variables, then


     
Ent2 X2 − Ent2 X2 X1 = DKL P(X1 ,X2 ) , PX1 ⊗ PX2 ,
where DKL is the Kullback-Leibler divergence defined in (2.3.7).
(ii) Suppose that we are given n finite sets Xi , i = 1, . . . , n and n maps
Xi : (Ω, S, P) → Xi .
We denote by Ent2 (X1 , . . . , Xn ) the entropy of the product random variable
(X1 , . . . , Xn ) : Ω → X1 × · · · × Xn .
Prove that
n
  X  
Ent2 X1 , . . . , Xn = Ent2 Xk (Xk−1 , . . . , X1 ) . t
u
k=1

Exercise 2.51 (Herbst). Let φ : [0, ∞) → R, φ(x) = x log x, where 0 · log 0 := 0.


For any nonnegative random variable Z we set
       
H Z = Endφ Z = E φ(Z) − φ E Z .
 
Suppose that X is a random variable such that MX (λ) = E eλX < ∞, for all λ is
 λX
an open interval J containing 0. We set HX (λ) := H e . Prove that if
λ2 σ 2
HX (λ) ≤ MX (λ),
2
then X ∈ G(σ 2 ). t
u

Exercise 2.52 (Poincaré phenomenon). Denote by S n the unit sphere in Rn+1 ,


( n
)
X
S n := (x0 , x1 , . . . , xn ); x2k = 1 .
k=0
Suppose that (X0 , . . . , Xn ) is a random point uniformly distributed on S n with
respect to the canonical Euclidean volume on S n . Prove that there exists c > 0
such that
nr 2
∀n ∈ N, r ∈ [0, 1]; P |X0 | > r = Ce− 2 .
 

Thus for spheres of large dimension n most of the volume is concentrated near
the Equator {x0 = 0}! Hint. Choose independent standard random variables Y0 , . . . , Yn set
Z = Y02 + · · · + Yn2 . Show that the random vector
1 
(X0 , . . . , Xn ) = √ Y0 , . . . Y n
Z
is uniformly distributed on S n . To conclude use Exercise 1.38. t
u

Exercise 2.53. Suppose that ψ : [0, ∞) → [0, ∞) is an Orlicz function, i.e., it is


convex, increasing and ψ(x) → ∞ as x → ∞. Fix a probability space (Ω, S, P). For
any Orlicz function ψ and any random variable X ∈ L0 (Ω, S, P) we set
  
kXkψ = inf t > 0; E ψ(X/t) ≤ 1 ,
where ∞∅ := ∞. We set
Lψ (Ω, S, P) = X ∈ L0 (Ω, S, P); kXkψ < ∞


and we denote by Lψ the quotient of Lψ modulo the a.s. equality.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 254

254 An Introduction to Probability

(i) Prove that Lψ (Ω, S, P) is a normed space.


(ii) Show that when ψ(x) = xp , p ∈ [1, ∞) we have Lψ = Lp .
2
(iii) Let ψ(x) = ex − 1. Prove that X is subgaussian if and only if X ∈ LΨ . t
u

Exercise 2.54. Let V be an n-dimensional real vector space. We denote by V ∗ its


dual, V ∗ = Hom(V, R). We have a natural pairing
h−, −i : V ∗ × V → R, hξ, xi := ξ(x), ∀ξ ∈ V ∗ , x ∈ V.
A Borel probability measure µ ∈ Prob(V ) is called Gaussian if for every linear
functional ξ ∈ V ∗ , the resulting random variable
ξ : (V, BV , µ) → R
   
is Gaussian with mean m ξ and variance v ξ , i.e., (see Example 1.120)
1 (x−m[ξ])2
.e− 2v[ξ] dx.
   
Pξ dx = Γm[ξ],v[ξ] dx = n/2
(2π)
A random vector X : (Ω, S, P) → V is called Gaussian if its probability distribution
is a Gaussian measure on V

(i) Show that the map V ∗ 3 ξ 7→ m[ξ] ∈ R is linear and thus defines an element
m = mµ ∈ (V ∗ )∗ ∼
=V
called the mean of the Gaussian measure. Moreover
Z
mµ = xµ[dx] ∈ V.
V
∗ ∗
(ii) Define C = Cµ : V × V → R
1      
C(ξ, η) = v ξ+η −v ξ−η = Eµ (ξ − m[ξ])(η − m[η]) .
4
Show that C is a bilinear form, it is symmetric and positive semidefinite. It
is called the covariance form of the Gaussian measure µ.
(iii) Show that if µ0 , µ1 are Gaussian measures on V0 and respectively V1 , then
the product µ0 ⊗ µ1 is a Gaussian measure on V0 ⊕ V1 . Moreover,
m[µ0 ⊗ µ1 ] = mµ0 ⊕ mµ1 , Cµ0 ⊗µ1 = Cµ0 ⊕ Cµ1 .
We set
Γ1n := Γ1 ⊗ · · · ⊗ Γ1 .
| {z }
n

Γ1n is called the canonical Gaussian measure on Rn . More explicitly


1 |x|2
− 2
Γ1n [dx] = e dx,
(2π)n/2
where |x| denotes the Euclidean norm of the vector x ∈ Rn .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 255

Limit theorems 255

(iv) Suppose that V0 , V1 are real finite dimensional vector spaces, µ is a Gaussian
measure on V0 and A : V0 → V1 is a linear map. Denote by µA the pushfor-
ward of µ via the map A, µA := A# µ. Prove that µA is a Gaussian measure
on V1 with mean mµA = Amµ and covariance form
CA : V1∗ × V1∗ → R, CA (ξ1 , η1 ) = Cµ (A∗ ξ1 , A∗ η1 ), ∀ξ1 , η1 ∈ V1∗ .
Above, A∗ : V1 → V0∗ is the dual (transpose) of the linear map A.
(v) Fix a basis {e1 , . . . , en } of V so we can identify V and V ∗ with Rn and
C with a symmetric positive semidefinite matrix. Denote by A its unique
positive semidefinite square root. Show that the pushforward A# Γ1n is a
Gaussian measure on Rn with mean zero and covariance form C = A2 .
(vi) Define the Fourier transform of a measure µ ∈ Prob(V ) to be the function
Z
b : V ∗ → C, µ b(ξ) = Eµ eiξ = eihξ,xi µ[dx].
 
µ
V

Show that if µ is a Gaussian measure on V with mean m covariance form C,


then
1
b(ξ) = eim[ξ] e− 2 C(ξ,ξ) , ∀ξ ∈ V ∗ .
µ
(vii) Use the ideas in the proof of Theorem 2.39 to show that a Gaussian measure
is uniquely determined by its mean and covariance form. We denote by Γm,C
the Gaussian measure with mean m and covariance C.
(viii) Suppose that C is a symmetric positive definite n × n matrix. Prove that the
Gaussian measure on Rn with mean 0 and covariance from C is
  1 −
hC −1 x,xi
Γ0,C dx = n/2 e 2 dx
det(2πC)
where h−, −i denotes the canonical inner product on Rn . Hint. Analyze first the
case when C is a diagonal matrix. t
u

Exercise 2.55. Let (Ω, S, P) be a probability space and E a finite dimensional real
vector space. Recall that a Borel measurable map X : (Ω, S, P) → E is called a
Gaussian random vector if its distribution PX = X# P is a Gaussian measure on E;
see Exercise 2.54.
Suppose that X1 , . . . , Xn ∈ L0 (Ω, S, P) are jointly Gaussian random variables,
i.e., the random vector X ~ = (X1 , . . . , Xn ) : Ω → Rn is Gaussian.

(i) Prove that each of the variables X1 , . . . , Xn is Gaussian and the covariance
form
C : Rn × Rn → R
of the Gaussian measure PX~ ∈ Prob(Rn ) is given by the matrix (cij )1≤i,j≤n
 
cij = Cov Xi , Xj , ∀1 ≤ i, j ≤ n.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 256

256 An Introduction to Probability

(ii) Prove that X1 , . . . , Xn are independent if and only if the matrix (cij )1≤i,j≤n
is diagonal, i.e.,
     
E Xi Xj = E Xi E Xj , ∀i 6= j.
Hint. Use the results in Exercise 2.54. t
u

Exercise 2.56 (Gaussian regression). Suppose that X0 , X1 , . . . , Xn are jointly


Gaussian random variables with zero means. Let X0 denote the orthogonal projec-
tion of X0 ∈ L2 (Ω, S, P) onto the finite dimensional subspace
span X1 , . . . , Xn ⊂ L2 (Ω, S, P).


 
(i) Prove that X0 = E X0 k X1 , . . . , Xn .
(ii) Suppose that the covariance matrix C of the Gaussian vector (X1 , . . . , Xn ) is
invertible. Denote by L = [`1 , . . . , `n ] the 1 × n matrix
 
`i = E X0 Xi , i = 1, . . . , n.
Prove that  
X1
X0 = L · C −1 · X, X :=  ...  .
 

Xn
Hint. For (i) use Exercise 2.55(ii) and (1.4.10). t
u

Remark 2.93. The result in Exercise 2.56 is remarkable. Let us explain its typical
use in statistics.
Suppose we want to understand the random quantity X0 and all we truly under-
stand are the random variables X1 , . . . , Xn . A quantity of the form f (X1 , . . . , Xn ) is
called a predictor, and the simplest predictors are of the form c0 +c  1 X1 +· · ·+cn Xn.
These are called linear predictors. The conditional expectation E X0 k X1 , . . . , Xn
is the predictor closest to X0 . The linear predictor closest to X0 is called the linear
regression. The coefficients c0 , c1 , . . . , cn corresponding to the linear regression are
obtained via the least squares approximation.
The result in the above exercise shows that, when the random variables
X0 , X1 , . . . , X1 are jointly Gaussian, the best predictor of X0 , given X1 , . . . , Xn
is the linear predictor. This is another reason why the Gaussian variables are ex-
tremely convenient to work with in practice. t
u

Exercise 2.57 (Maxwell). Suppose that (Xn )n∈N is a sequence of mean zero
i.i.d. random variables. For each n ∈ N we denote by Vn the random vector
Vn := (X1 , . . . , Xn ). Prove that the following are equivalent.

(i) The random variables Xn are Gaussian.


(ii) For any n ∈ N and for any orthogonal map T : Rn → Rn the random vectors
Vn and RVn have identical distributions.

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 257

Limit theorems 257

Exercise 2.58. Suppose that X is a standard normal random variable and Z is a


Bernoulli random variable, independent of X, with success probability p = 12 .

(i) Prove that Y = XZ is also a standard normal random variable.


(ii) Prove that X + Y is not Gaussian.

t
u

Exercise 2.59. Suppose that T is a compact interval of the real axis, and (Xt )t∈T ,
(Yt )t∈T , (Zt )t∈T are real valued stochastic processes such that (Yt ) and (Zt ) are
modifications of (Xt ) with a.s. continuous paths. Prove that the processes (Yt ) and
(Zt ) are indistinguishable. t
u

Exercise 2.60. Fix a Brownian motion (Bt )t≥0 defined  on a probability space
(Ω, S, P). Denote by E the vector subspace of L2 [0, 1], λ spanned by the functions
I (s,t] , 0 ≤ s < t ≤ 1.

(i) Prove that any function f ∈ E admits a convenient representation, i.e., a


representation of the form
n
X
f= ck I (sk ,tk ] , ck ∈ R,
k=1

where the intervals (sj , tj ], (sk , tk ] are disjoint for j 6= k.


(ii) Let f ∈ E and consider two convenient representations of f
n
X m
X
ck I (sk ,tk ] = f = c0k I (s0k ,t0k ] .
k=1 i=1

Show that
n
X m
X
c0k Bt0k − Bs0k
 
ck Btk − Bsk = =: W (f ).
k=1 i=1

(iii) Show that for any f ∈ E we have W (f ) ∈ L2 (Ω, S, P) and


kW (f )kL2 (Ω) = kf kL2 ([0,1]) .
(iv) Prove that the map W : E → L2 (Ω, S, P) is linear and extends to a linear isom-
etry W : L2 ([0, 1]λ) → L2 (Ω, S, P) whose image consists of Gaussian random
variables. In other words, this isometry is a Gaussian white noise. The map
W is called the Wiener integral. It is customary to write
Z 1
W (f ) = f (s)dBs . t
u
0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 258

258 An Introduction to Probability


Exercise 2.61. The space F := C [0, ∞) of continuous functions [0, ∞) → R is
equipped with a natural metric d,
X 1 
d(f, g) = n
min 1, dn (f, g) , dn (f, g) := sup |f (t) − g(t)|.
2 t∈[n−1,n]
n∈N

Denote by BF the Borel algebra of F . For each t ≥ 0 we define Et : F → R,


Et (f ) = f (t). We set
[
St = Ex−1 BR , ∀t ≥ 0, S = St .


t≥0

(i) Prove that Et is a continuous function on F , ∀t ≥ 0.


(ii) Prove that BF = S.
(iii) Suppose that (Ω, A) is a measurable space and W : Ω → F is a map
Ω ∈ ω 7→ Wω ∈ F.
Prove that W is (A, BF )-measurable if and only if for any t ≥ 0 the function
W t : (Ω, A) → R, ω 7→ Wω (t)
is measurable. Hint. Use (ii). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 259

Chapter 3

Martingales

The usefulness of the martingale property was recognized by P. Lévy (condition (C)
in [105, Chap. VIII]), but it was J. L. Doob [47] who realized its full potential by
discovering its most important properties: optional stopping/sampling, existence of
asymptotic limits, maximal inequalities.
I have to admit that when I was first introduced to martingales they looked
alien to me. Why would anyone be interested in such things? What are really these
martingales?
I can easily answer the first question. Martingales are ubiquitous, they appear
in the most unexpected of situations, though not always in an obvious way, and
they are “well behaved”. Since their appearance on the probabilistic scene these
stochastic processes have found many applications.
As for the true meaning of this concept let me first remark that the name
“martingale” itself is a bit unusual. It is French word that has an equestrian meaning
(harness) but, according to [113], the term was used among the French gamblers
when referring to a gambling system. I personally cannot communicate clearly,
beyond a formal definition, what is the true meaning of this concept. I believe it
is a fundamental concept of probability theory and I subscribe to R. Feynman’s
attitude: it is more useful to know how the electromagnetic waves behave than
knowing what the look like. The same could be said about the concept of martingale
and, why not, about the concept of probability. I hope that the large selection of
examples discussed in this chapter will give the reader a sense of this concept.
This chapter is divided into two parts. The first and bigger part is devoted to
discrete time martingales. The second and smaller part is devoted to continuous
time martingales. I have included many and varied applications of martingales
with the hope that they will allow the reader to see the many facets of this concept
and convince him/her of its power and versatility. My presentation was inspired
by many sources and I want to single out [75; 33; 47; 53; 102; 103; 135; 160] that
influenced me the most.

259
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 260

260 An Introduction to Probability

3.1 Basic facts about martingales

We need to introduce some basic terminology.

3.1.1 Definition and examples


Suppose that (Ω, S, P) is a probability space and T ⊂ R. Recall that a random or
stochastic process with parameter space T is a family of random variables

Xt : (Ω, S, P) → R, t ∈ T.

A T-filtration of the probability space (Ω, S, P) is a family F• = (Ft )t∈T of sub-σ-


algebras of S such that

Fs ⊂ Ft , ∀s ≤ t.

We set
_
F∞ := Ft .
t∈T

A family of random variables Xt : Ω, S, P → R, t ∈ T, is said to be adapted to




the filtration F• = ( Ft )t∈T , if Xt is Ft -measurable for any t.

Remark 3.1. If we think of a σ-algebra as encoding all the measurable information


in a given random experiment, then we can think of a T-filtration as an increasing
flow of information. For example, if T = N0 , and (Xn )n≥0 is a sequence of random
variables, then the collection

Fn = σ(X0 , X1 , . . . , Xn ), n ∈ N0 ,

is a filtration of σ-algebras. At epoch n, information about Xn becomes available to


us, on top of the information about X0 , X1 , . . . , Xn−1 that we have collected along
the way. t
u

Definition 3.2. Suppose that (Ω, S, P) equipped with a filtration F• = (Ft )t∈T . An
F• -martingale is a family of random variables Xt : (Ω, S, P) → R, t ∈ T, satisfying
the following two conditions.

(i) The family is adapted to the filtration F• and Xt is integrable for any t ∈ T.
(ii) For all s, t ∈ T, s < t, we have Xs = E Xt kFs .

The family of integrable random variables (Xt )t∈T is called a F• -submartingale


(resp. supermartingale)
 if
 it is adapted to
 the filtration
 and for any s, t ∈ T, s < t,
we have Xs ≤ E Xt kFs (resp. Xs ≥ E Xt kFs ).
When T is a discrete subset of R we say that the (sub- or super-)martingale is
a discrete time (sub/super)martingale. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 261

Martingales 261

Note that a sequence of random variables (Xn )n∈N0 is a discrete time submartin-
gale (resp. martingale) with respect to a filtration (Fn )n∈N0 of F if
   
E Xn+1 kFn ≥ Xn , (resp. E Xn+1 kFn = Xn ), ∀n ∈ N0 .

Remark 3.3. Suppose that (Xn )n≥0 is a sequence of integrable random variables
and

Fn = σ(X1 , . . . , Xn ).

Then E Xn+1 k Fn is a measurable function of the variables X0 , . . . , Xn ,


 

E Xn+1 k Fn = fn+1 (X0 , X1 , . . . , Xn ), fn+1 : Rn+1 → R.


 

The sequence (Xn )n≥0 is a martingale if and only if fn+1 (x0 , x1 , . . . , xn ) = xn ,


∀n ≥ 0, ∀x0 , . . . , xn ∈ R. If the joint distribution of (X0 , . . . , Xn ) is described by a
density pn (x0 , . . . , xn ), then
Z
pn+1 (x0 , . . . , xn , xn+1 )
fn+1 (x0 , . . . , xn ) = xn+1 dxn+1 .
R pn (x0 , . . . , xn )
t
u

Example 3.4 (Closed martingales). Suppose that F• = (Fn )n∈N0 is a filtration


of S and X ∈ L1 (Ω, S, P). Then the sequence of random variables

Xn = E XkFn ∈ L1 (Ω, Fn , P), n ∈ N0 ,


 

is a martingale since
h  i
E Xn+1 k Fn = E E X k Fn+1 Fn = E X k Fn = Xm .
   

Such a martingale is called closed or Doob martingale. t


u

Example 3.5 (Unbiased random walk). Suppose that (Xn )n∈N is a sequence
of independent integrable random variables such that E[Xn ] = 0, ∀n ∈ N0 .
One should think that Xn is the size of the n-th step so that the location after
n steps is

Sn = X1 + · · · + Xn .

Set Fn := σ(X1 , . . . , Xn ), S0 := 0. Then the sequence (Sn )n∈N0 is a martingale


adapted to the filtration Fn . Indeed
     
E Sn+1 kFn = E Xn+1 kX1 , . . . , Xn + E X1 + · · · + Xn kX1 , . . . , Xn

 
= E Xn+1 + X1 + · · · + Xn = Sn . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 262

262 An Introduction to Probability

Example 3.6 (Random products).


  Suppose that (Yn )n∈N are positive i.i.d.
random variables such that E Y1 = 1. Then the sequence of products
Zn = Y1 Y2 · · · Yn , n ∈ N,
is a martingale adapted to the filtration Fn = σ(Y1 , . . . , Yn ). Indeed
   
E Zn+1 k Y1 , . . . , Yn = E Yn+1 Y1 · · · Yn = Zn . t
u

Example 3.7 (Biased random walk). Suppose that (Xn )n∈N are i.i.d. random
variables such that the moment generating function
µ(λ) := E eλXn
 

is well defined for λ in some interval Λ. We set


Sn := X1 + · · · + Xn , Mn = Mn (λ) := eλSn µ(λ)−n , Fn := σ(X1 , . . . , Xn ).
If we define
1 λXn
Yn := e ,
µ(λ)
then we deduce that
 
E Yn = 1, Mn = Y1 · · · Yn .

From the previous example we deduce that Mn (λ) n∈N is a martingale.
As a concrete example, suppose that the random variables Xn are all binomial
type
   
P Xn = 1 = p, P Xn = −1 = q = 1 − p.
In this case µ(λ) = peλ + qe−λ . Note that if eλ = pq , then µ(λ) = 1 and we deduce
that
 Sn
q
Mn =
p
is a martingale. This is sometimes referred to as the De Moivre’s martingale. t
u

Example 3.8 (Galton-Watson/branching processes). Fix a probability mea-


sure µ on N0 such that
X      
m := kµ k < ∞, µ k := µ {k} ,
k∈N0
 
and µ k0 > 0 for some k0 > 0. Consider next a sequence (Xn,j )j,n∈N0 of i.i.d.
N0 -valued random variables with common probability distribution µ. Fix ` ∈ N,
set Z0 := `, and for any n ∈ N0 define
Zn
X
Xn,j , Fn = σ Xk,j ; k ∈ N0 , k < n .

Zn+1 =
j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 263

Martingales 263

Fig. 3.1 Three generations of a Galton-Watson (random) tree. Here Z1 = 3, Z2 = 2 + 1 + 3 = 6,


Z3 = 3 + 2 + 1 + 2 + 3 = 11.

The random variable Zn can be interpreted as the population of the n-th generation
of a species that had ` individuals at n = 0 and such that the number of offsprings
of a given individuals is a random variable with distribution µ. The j-th individual
of the n-th generators has Xn,j offsprings. We will refer to µ as the reproduction
law.
The sequence (Zn )n≥0 is known as the Galton-Watson process or the branching
process with reproduction law µ.
When ` = 1 this process can be visualized as a random rooted tree. The root v0
has Z1 = X0,1 successor vertices. v1,1 , . . . , v1,Z1 . The vertex v1,i has X1,i successors
etc.; see Figure 3.1. For any n ∈ N0 we have
 
X∞ X k
Zn+1 =  Xn,j  I {Zn =k}
k=1 j=1

so
  

X Xk
E [ Zn+1 kFn ] = E  Xn,j  I {Zn =j} Fn 
k=1 j=1

  

X k
X
= E  Xn,j  Fn  I {Zn =k}
k=0 j=1

⊥ Fn , ∀n, j)
(Xn,j ⊥
∞ ∞
k
!
X X   X
= E Xn,j I {Zn =k} = m kI {Zn =k} = mZn .
k=1 j=1 k=0
| {z }
=km

This proves that the sequence Yn = m−n Zn , n ∈ N0 defines a martingale. t


u

Example 3.9 (Polya’s urn). An urn contains r > 0 red balls and g > 0 green
balls. At each moment of time we draw a ball uniformly likely from the balls existing
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 264

264 An Introduction to Probability

at that moment, we replace it by c + 1 balls of the same color, c ≥ 0. Denote by


Rn and Gn the number of red and respectively green balls in the urn after the n-th
draw. We denote by Xn the ratio of red balls after n draws, i.e.,
Rn Rn
Xn := = .
Rn + Gn r + g + cn
Note that when c = 1, the scheme can be alternatively described as randomly adding
at each moment of time a red/green ball with probability equal to the fraction of
red/green balls that exist at that moment in the urn.
We set

Fn = σ(R0 , G0 , · · · , Rn , Gn ) = σ(X0 , X1 , . . . , Xn ).

We will show that (X• ) is an F• -martingale. To see this observe that


X i
Xn = I {Rn =i,Gn =j}
i,j>0
i+j

so
X i h i
E Xn+1 k Fn = E I {Rn+1 =j,Gn+1 =j} Fn .
 
i,j>0
i+j

Now observe that


h i
E I {Rn+1 =i,Gn+1 =j} Fn

X  
= P Rn+1 = i, Gn+1 = jkRn = k, Gn = ` I {Rn =k,Gn =`}
k,`>0

i−c j−c
= I {Rn =i−c,Gn =j−c} + I {Rn =i,Gn =j−c} .
i+j−c i+j−c
We deduce
 X i i−c
E Xn+1 k Fn =

· I {Rn =i−c,Gn =j}
i,j
i+j i+j−c

X i j−c
+ · I {Rn =i,Gn =j−c}
i,j
i+j i+j−c

X u+c u X u v
= · I {Rn =u,Gn =v} + · I {Rn =u,Gn =v}
u,v
u+v+c u+v u,v
u + v + c u + v

X u(u + v + c) X u
= I {Rn =u,Gn =v} = I {Rn =u,Gn =v} = Xn . t
u
u,v
(u + v)(u + v + c) u,v
u+v
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 265

Martingales 265

Example 3.10 (Random walks on graphs). Suppose that Γ is a connected


simple graph with vertex set V Γ and edges E Γ . Assume that there are no multiple
edges between two vertices u, v ∈ V Γ . Assume that Γ is locally finite i.e., for any
vertex u ∈ V Γ , its set of neighbors N(u) is finite. We set deg(u) := |N(u)|.
A function F : V Γ → R is called harmonic if
1 X
F (u) = F (v).
deg(u)
v∈N(u)

Consider the simple random walk on Γ that starts at a given vertex v0 and the
1
probability of transitioning from a vertex u to a neighbor v is equal to deg(u) .
Denote by Vn the location after n steps of the walk. Suppose that F : V Γ → R is
a harmonic function. Then the sequence of random variables

Xn = F (Vn ), n ∈ N0 ,

is a martingale with respect to the filtration Fn = σ(V0 , V1 , . . . , Vn ). Moreover


 
E Xn = F (v0 ), ∀n ∈ N0 . t
u

Example 3.11 (New (sub)martingales from old). Suppose that (Ω, S, P) is


equipped with a filtration F• = (Fn )n∈N0 and Xn ∈ L1 (Ω, F, P) is a sequence
of random variables adapted to the above filtration.

(i) If (Xn )n∈N0 is a martingale and ϕ : R → R is a convex function such that ϕ(Xn )
is integrable ∀n ∈ N0 , then the conditional Jensen inequality implies that the
sequence ϕ(Xn ) is a submartingale. Indeed, Jensen’s inequality implies
  
E ϕ(Xn+1 ) k Fn ≥ ϕ E Xn+1 k Fn
 
= ϕ(Xn ).

(ii) If (Xn )n∈N0 is a submartingale and ϕ : R → R is a nondecreasing convex


function such that ϕ(Xn ) is integrable ∀n ∈ N0 , then the sequence ϕ(Xn ) is
a submartingale. Indeed, follow the same argument as above where at the
last step use the fact that ϕ is nondecreasing. In particular if (Xn )n≥0 is a
submartingale, then so is (Xn+ )n≥0 , x+ = max(0, x).
(iii) If (Xn )n∈N0 is a supermartingale and ϕ : R → R is a nondecreasing concave
function such that ϕ(Xn ) is integrable ∀n ∈ N0 , then the sequence ϕ(Xn ) is a
supermartingale. Indeed
  
E ϕ(Xn+1 ) k Fn ≤ ϕ E Xn+1 k Fn
 
≤ ϕ(Xn ).

In particular, if (Xn )n∈N0 is a supermartingale, then so is ( min(Xn , c) )n≥0 ,


∀c ∈ R.

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 266

266 An Introduction to Probability

3.1.2 Discrete stochastic integrals


Fix a probability space (Ω, S, P) and an N0 -filtration F• of S. If C• is an increasing
F• -adapted process, then obviously C• is a submartingale. If we add to this process
a martingale M• , then the resulting process X• = M• + C• is a submartingale.
It turns out that all submartingales can be obtained in this fashion. In fact, the
increasing process Cn can be chosen to be of a special type: the random variable
Cn+1 can be chosen to be Fn -measurable, i.e., the value of C• at time n + 1 is
predictable at time n, i.e., can be determined from the information available to us
at time n encoded in the σ-algebra Fn .

Definition 3.12. A sequence of random variable {Hn : Ω → R, n ∈ N0 } is called


F• -previsible or predictable if H0 is F0 -measurable, and Hn is Fn−1 -measurable
∀n ∈ N. t
u

The next result formalizes the discussion at the beginning of this subsection.

Proposition 3.13 (Doob decomposition of discrete submartingales). Let


X• = (Xn )n∈N0 be an (Fn )n∈N0 -adapted process such that Xn ∈ L1 , ∀n ∈ N0 .
Then the following statements are equivalent.

(i) The process X• is a submartingale.


(ii) There exists an F• -martingale M• and an F• -predictable nondecreasing process
C• such that
M0 = 0 = C0 , Xn = X0 + Mn + Cn , ∀n ≥ 0.

Moreover, when X• is a submartingale, then the martingale M• and the non-


decreasing predictable process are uniquely determined by X• up to indistinguisha-
bility. In this case M• is called the martingale component of the submartingale X•
and C• is called the compensator of X• . We denote it by C(X• ). The decomposition
Xn = X0 + Mn + Cn is called the Doob decomposition of the submartingale X• .

Proof. Existence. We describe Mn and Cn in terms of their increments. More


precisely
Cn+1 − Cn = E Xn+1 − Xn Fn + E Mn+1 − Mn Fn
   
(3.1.1a)
= E Xn+1 Fn − Xn , ∀n ∈ N0 ,
 

 
Mn+1 − Mn = Xn+1 − Xn − Cn+1 − Cn , ∀n ∈ N0 . (3.1.1b)
Note that Cn+1 − Cn is Fn measurable so (Cn ) is predictable. By construction M•
is an F• -martingale. Clearly, if X• is a submartingale then, tautologically, Cn is
increasing.
Uniqueness. Suppose that X• is a submartingale, M•0 is a martingale, and C• is a
nondecreasing predictable process such that
M0 = C0 = 0, Xn = X0 + Mn0 + Cn0 , ∀n ∈ N0 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 267

Martingales 267

We deduce
E Xn+1 Fn − Xn = E Mn+1 Fn − Mn0 + E Cn+1 Fn − Cn0 .
   0   0 
| {z } | {z }
=0 0
=Cn+1 0
−Cn

This shows that the increments of Cn0 are given by (3.1.1a) so Cn0 = Cn . In partic-
ular, Mn0 = Mn , ∀n ∈ N. t
u

Example 3.14. Suppose that (Xn )n≥0 is a sequence of nonnegative integrable


random variables and X0 = 0. Then
Sn = X1 + · · · + Xn
is a submartingale with respect to the filtration Fn = σ(X1 , . . . , Xn ). Indeed
E Sn k Fn−1 = E Xn k Fn−1 + Sn−1 ≥ Sn−1 .
   

Consider the Doob decomposition Sn = Mn + Cn . The compensator Cn satisfies


Cn+1 − Cn = E Sn+1 k Fn − Sn = E Xn+1 k Fn−1
   

so
n
X
E Xn k Fn−1
 
Cn =
k=1
and
n
X n
 X
E Xn k Fn−1 = Xn − E Xn k Fn−1 .
  
Mn = Sn −
k=1 k=1
If the variables Xn are independent, then
Xn
 
Mn = Xk − E Xk .
k=1
t
u

Definition 3.15 (Quadratic variation). Suppose 2  that (Xn )n≥0 is a martingale


adapted to the filtration (Fn )n≥0 such that E Xn < ∞, ∀n ≥ 0. The compensator
of the submartingale (Xn2 )n≥0 is called the quadratic variation and it is denoted
by hX• i. t
u

Example 3.16. Suppose that (Xn )n≥1 are independent random variables with zero
means and finite variances. We set S0 = 0,
Sn = X1 + · · · + Xn .
Then
n
  X
E Sn2 = E Xk2 < ∞, ∀n ≥ 1.
 

k=1
Thus (S• ) is an L2 -martingale. From the computations in Example 3.14 we deduce
n
X n
 X
E Xk2 = E (Sk − Sk−1 )2 .
  
hS• in =
k=1 k=1
This explains why we refer to hS• i as quadratic quadratic variation. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 268

268 An Introduction to Probability

Theorem 3.17 (Discrete Stochastic Integral). Suppose that (Xn )n∈N0 be an


F• -adapted process and (Hn )n∈N is a bounded predictable process. Define the process
(H · X)• by setting
(H · X)0 := 0, (H · X)n = H1 (X1 − X0 ) + · · · + Hn (Xn − Xn−1 ), ∀n ∈ N. (3.1.2)
Then the following hold.

(i) If (Xn )n∈N0 is a martingale, then the process (H · X)n , n ∈ N0 is also an


F• -adapted martingale.
(ii) If (Xn )n∈N0 is a submartingale and Hn ≥ 0, ∀n ∈ N, then the process
(H · X)n , n ∈ N0 is also an F• -adapted submartingale.

Proof. (i) Clearly (H · X)n ∈ L1 Ω, Fn . We have



   
E (H · X)n+1 kFn = E Hn+1 (Xn+1 − Xn )kFn + (H · X)n
(Hn+1 is Fn -measurable)
    
= Hn+1 E (Xn+1 − Xn )kFn + (H · X)n = Hn+1 E Xn+1 kFn − Xn + (H · X)n
( (Xn ) is a martingale)
= (H · X)n .
The proof of (ii) is similar. t
u

Remark 3.18. (a) When X• is a martingale the process (H · X)• is called the
discrete stochastic integral of H with respect to X and it is alternatively denoted
Z n
HdX := (H · X)n .

One should think of Xn as a random signed measure assigning mass Xn − Xn−1 to


the point n.
(b) The discrete stochastic integral has a stock-trading interpretation. Suppose that
Xn represents the price of a stock at the end of the n-th trading day. A day trader
buys Hn shares at the beginning of the n-th trading day, based on the information
available then. This information is encoded by the sigma-algebra Fn−1 and the
price of a share at the beginning of the n-th trading day is Xn−1 . He sells them at
the end of the n-the trading day. The resulting profit at the end of day n is then
Hn (Xn − Xn−1 ). We deduce that (H • X)n is represents the profit of the day trader
after n trading days.
(c) The special case Theorem 3.17 when the variables Hn are Bernoulli random
variables was discovered by P. Halmos and is classically known as the impossibility
of systems theorem. In this case Hn represents the decision of a gambler to play or
not the next game based on the information gathered during the games he observed
so far. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 269

Martingales 269

The applicability of Theorem 3.17 depends on our ability of producing interesting


predictable processes. We describe one very useful class of examples.

Example 3.19. Observe first that a discrete time process (Yn )n∈N on (Ω, S, P) can
be viewed as a map
Y : N × Ω → R, (n, ω) 7→ Yn (ω).
We equip N × Ω with the product σ-algebra. A measurable set X ⊂ N × Ω defines
a stochastic process
I X : N × Ω → {0, 1}, I X n = I Xn , Xn := ω ∈ Ω; (n, ω) ∈ X .
 

The set X is called F• -predictable if the process I X is such. More precisely, this
means that X0 ∈ F0 and, for any n ∈ N, the set Xn is Fn−1 -measurable. t
u

3.1.3 Stopping and sampling: discrete time


We want to describe one technique that makes the martingales extremely useful in
applications. Fix a probability space (Ω, S, P).

Definition 3.20. A random variable T : (Ω, S, P) → N0 ∪ {∞} is called a stopping


time adapted to the filtration F• = (Fn )n≥0 , or an F• -stopping time if,
{T ≤ n} ∈ Fn , ∀n ∈ N0 ∪ {∞}.
If (Xn )n∈N is an F• -adapted process, and T is an F• -stopping time, then the T -
sample of the process is the random variable
X
XT := Xn I {T =n} . (3.1.3)
n∈N0

Observe that XT = 0 on the set {T = ∞}. t


u

Example 3.21. (a) For each n ∈ N0 the constant random variable equal to n is a
stopping time.
(b) Suppose that (Xn )n∈N0 is F• -adapted and C ⊂ R is a Borel set. We define the
hitting time of C to be the random variable

HC : Ω → N0 ∪ {∞}, HC (ω) := min n ∈ N0 ; Xn (ω) ∈ C .
This is a stopping time since
 [
HC ≤ n = Xk ∈ C
k≤n

and the process (Xn ) is F• -adapted.


(c) If S, T are stopping times, then S ∧ T = min(S, T ) and S ∨ T = max(S, T ) are
also stopping times.
(d) If (Tk )k∈N is a sequence of stopping times, then inf Tk , sup Tk , lim inf Tk and
lim sup Tk are also stopping times. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 270

270 An Introduction to Probability

Definition 3.22. Let X• = (Xn )n∈N be a process adapted to the filtration (Fn )n≥0 .
For any stopping time T we denote by X•T the process stopped at T defined by
(
Xn (ω), n ≤ T (ω),
XnT := XT ∧n , where XT ∧n = Xmin(T (ω),n) (ω) = (3.1.4)
XT (ω) , T (ω) < n.
t
u

Note that the process stopped at T is also adapted to the filtration (Fn )n≥0 .

Proposition 3.23. Suppose that S, T is are stopping times such that S ≤ T . Define

]]T, ∞[[ := (n, ω) ∈ N0 × Ω; T (ω) < n ,

]]S, T ]] := (n, ω) ∈ N0 × Ω; S(ω) < n ≤ T (ω) .
Then ]]T, ∞[[, [[0, T ]] and ]]S, T ]] are predictable subsets of N0 × Ω.

Proof. We have ]]T, ∞[[n = {T < n} = {T ≤ n − 1} ∈ Fn−1 . Next observe that


I [[0,T ]] = 1 − I ]]T,∞[[
so I [[0,T ]] is a predictable process as a linear combination of predictable processes.
Finally observe that since S ≤ T we have
I ]]S,T ]] = I [[0,T ]] − I [[0,S]] ,
so I ]]S,T ]] is predictable as a linear combination of predictable processes. t
u

Suppose now that (Xn )n∈N is a (sub)martingale and T is a stopping time. Then
S0 = 0 is also a stopping time, S0 ≤ T . As we have seen above, the process
I ]]S0 ,T ]] = I ]]0,T ]] · X is a submartingale.
For every n ∈ N we have
   
(I ]]0,T ]] · X)n = I ]]0,T ]] n Xn − Xn−1 + · · · + I ]]0,T ]] 1 X1 − X0

Xn −Xn−1 +· · ·+ I {T ≥1} X1 −X0 = XT ∧n −X0 = XnT −X0 .


   
= I {T ≥n}
Thus
X•T = X0 + I ]]0,T ]] · X


. (3.1.5)

This proves the following result.

Theorem 3.24 (Optional Stopping Theorem). Suppose that


Xn : (Ω, F, P) → R, n ≥ 0,
is a (sub)martingale adapted to the filtration F• and T is an F• -stopping time. Then
X•T , the process stopped at T , is also a (sub)martingale adapted to F• . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 271

Martingales 271

Suppose that T : (Ω, S, P) → N0 ∪ {∞} is a stopping time adapted to the


filtration F• . We define
n o
FT := E ∈ F : E ∩ T ≤ n ∈ Fn , ∀n ∈ N0 ∪ {∞}

n o (3.1.6)
= E ∈ F : E ∩ T = n ∈ Fn , ∀n ∈ N0 ∪ {∞} .


Tautologically, the random variable T is FT -measurable.

Example 3.25. Suppose that T is the hitting time of a Borel set C ⊂ R. Then
the event E belongs to FT if, at any moment of time n, we can decide using the
information Fn available to us at time n whether, up to that moment, we have
visited C and the event E has occurred. t
u

A few remarks are in order.

• The collection FT is a σ-subalgebra of F. It is called the past-until-T σ-algebra.


• The random variable XT is FT -measurable. Indeed,

{XT ≤ c ∩ {T = n} = {Xn ≤ c} ∩ {T = n} ∈ Fn , ∀n.

• If S, T are stopping times such that S ≤ T , then FS ⊂ FT .

Definition 3.26. Suppose that (Xn )n∈N0 is an F• -(sub)martingale and T is an


F• -stopping time. We say that the stopping time T satisfies the Doob conditions 1
if the following hold.
 
P T < ∞ = 1. (3.1.7a)

XT ∈ L1 . (3.1.7b)

 
lim E I {T >n} |Xn | = 0. (3.1.7c)
n→∞

t
u

Roughly speaking, the Doob conditions state that the random process (Xn )n≥0
is not sampled “too late”. In Proposition 3.66 we provide another characterization of
the Doob conditions in terms of the asymptotic behavior of the stopped process X•T .

Example 3.27. Suppose that T is a bounded F• -stopping time. Then T satisfies


the Doob conditions.
To see this, choose N ∈ N such that T < N a.s. The stopped process X•T is a
T
submartingale so XT = XT ∧N = XN ∈ L1 . As for the second condition (3.1.7c),
note that since T is a.s. bounded, then I {T >n} Xn+ is a.s. 0 for n > N . t
u
1 There is no consensus on terminology in the literature. We use the term Doob conditions since

they were first spelled out by J. L. Doob in his influential monograph [47].
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 272

272 An Introduction to Probability

Theorem 3.28 (Optional Sampling Theorem). Suppose that Xn : (Ω, F, P)


→ R, n ≥ 0, is a (sub)martingale adapted to the filtration F• , and S ≤ T are
stopping times adapted to the same filtration. If T satisfies Doob’s conditions in
Definition 3.26, then
   
E XT ≥ E XS .
If X• is a martingale, then
     
E XT = E XS = E X0 .

Proof. We follow the original approach in [47, VII.2]; see also [4, Thm. 6.7.4].
Suppose that (Xn )n≥0 is a martingale. Set Am := {S = m}. Then
  X  
E XS = E XS I Am
m≥0

so it suffices to show that


   
∀m ≥ 0 : E XT I Am = XS I Am .
We have
     
E XS I Am = E Xm I Am I T =m + E XS I Am I T >m
   
= E XT I T =m + E Xm I Am I T >m

(Am ∩ {T > m} ∈ Fm , X• martingale)
   
= E XT I Am I {T =m} + E Xm+1 I Am I {T >m}

     
= E XT I Am I {T =m} + E Xm+1 I Am I {T =m+1} + E Xm+1 I Am I {T >m+1}
   
= E XT I Am I {m≤T ≤m+1} + E Xm+1 I Am I {T >m+1}

(Am ∩ {T > m + 1} ∈ Fm+1 , X• martingale)
   
= E XT I Am I {m≤T ≤m+1} + E Xm+2 I Am I {T >m+1}

   
= E XT I Am I {m≤T ≤m+2} + E Xm+2 I Am I {T >m+2} .
Iterating this procedure we deduce that, ∀n > 0, we have
     
E XS I Am = E XT I Am I {m≤T ≤m+n} + E Xm+n I {T >m+n} .
The condition (3.1.7c) shows that
 
lim E Xm+n I T >m+n = 0,
n→∞
so
     
E XS I Am = E XT I Am I 0≤T ≤∞ = E XT I Am .
The submartingale situation is dealt with similarly. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 273

Martingales 273

Remark 3.29. Suppose that T is an a.s. finite F• -stopping time and X• is an F• -


submartingale such that XT ∈ L1 . Then T satisfies Doob’s conditions if and only
if

lim E Xn+ I T >n = 0.


 
(3.1.8)
n→∞

Clearly (3.1.7c) implies (3.1.8). Let us show that (3.1.8) ⇒ (3.1.7c). Assume first that X• is a martingale.
Fix m, n ∈ N0 , m < n. Observing that {T > m} ∈ Fm we deduce
       
E Xm I T >n = E Xm+1 I T >m = E Xm+1 I T =m+1 + E Xm+1 I T >m+1

({T > m + 1} ∈ Fm+1 )


   
= E Xm+1 I T =m+1 + E Xm+2 I T >m+1

     
= E Xm+1 I T =m+1 + E Xm+2 I T =m+2 + E Xm+2 I T >m+2

       
= · · · = E Xm+1 I T =m+1 + · · · + E XN I T =n + E Xn I T >n = E XT I m<T ≤n .

We deduce
     
E Xm I T >m − E Xn I T >n = E XT I m<T ≤n , ∀n > m.

Using the equality X• = X•+ − X•− we deduce that, ∀n > m.


 −   −      +   + 
E Xn I T >n − E Xm I T >m = E XT I m<T ≤n − E Xm I T >m − E Xn I T >n .

If we let n → ∞ in the above equality and recall that T < ∞ a.s., XT ∈ L1 and X•+ satisfies (3.1.8) we
deduce
 −   −     + 
lim E Xn I T >n − E Xm I T >m = E XT I T >m − E Xm I T >m .
n→∞

Using the Optional Sampling Theorem 3.24 for the stopping times S ≡ m and T we deduce
   +     +   − 
E XT I T >m − E Xm I T >m = E Xm I T >m − E Xm I T >m = −E Xm I T >m .

Hence
 −   −   − 
lim E Xn I T >n − E Xm I T >m = −E Xm I T >m
n→∞

so that
   +   − 
lim E |Xn |I T >n = lim E Xn I T >n + lim E Xn I T >n = 0.
n→∞ n→∞ n→∞

Suppose now that (X• ) is a submartingale. Consider its Doob decomposition Xn = X0 + Mn + Cn . If


X• satisfies (3.1.8), then
+ +
0 ≤ (X0 + Mn ) ≤ Xn

and we deduce that the martingale Y• = X0 + M• satisfies (3.1.8) and thus (3.1.8). Next, observe that
+ + + −
Xn = (Yn + Cn )I Yn ≥0 + (Cn − Yn )I 0<Y − ≤C .
n n

This proves that 0 ≤ Cn ≤ +


Xn + Yn− , so
   + − 
lim E Cn I T >n = lim E (Xn + Yn )I T >n = 0.
n→∞ n→∞

Hence
   
lim E |Xn |I T >n = lim E |Yn | + Cn I T >n = 0. t
u
n→∞ n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 274

274 An Introduction to Probability

3.1.4 Applications of the optional sampling theorem


It is time to give the reader a first taste of the versatility of the optional sampling
theorem. After we present more properties of martingales we will be able to extend
the range of applications of this theorem.

Example 3.30 (The Ballot Problem). Let us consider again the ballot problem
first discussed in Example 1.60. Recall the setup.
Two candidates A and B run for an election. Candidate A received a votes
while candidate B received b votes, where b < a. The votes were counted in random
order, so any permutation of the a + b votes cast is equally likely. We have shown
in Example 1.60 that the probability that A was ahead throughout the count is
a−b
p= .
a+b
We want to described an alternate proof using martingale methods. Our presenta-
tion is inspired from [117, Sec. 12.2].
Set n := a + b and denote by Dk the denote the number votes by which A was
ahead when the k-th voted was tabulated. Note that Sn = a − b. Let Xk denote
the random variable indicating the k-th vote. Thus, Xk = 1, if the vote went for
A, and Xk = −1 if the vote went for B so that
D0 = 0, Dk = X1 + · · · + Xk .
For k = 0, 1, . . . , n we denote by Rk the ratio
Dn−k
Rk := .
n−k
In other words Rk is candidate’s A the lead in percentages after the (n − k)-th
counted vote. Let us first show that Rk is a martingale with respect to the filtration
Fk = σ R0 , . . . , Rk = σ Dn , Dn−1 , . . . , Dn−k .
 

Thus, conditioning on Fk corresponds to conditioning on the results of the last k + 1


votes. Observe that, given Dn−k the result Dn−k−1 one vote earlier, is independent
of the results at the later votes Dn−k+1 , . . . , Dn . In other words,
   
E Dn−k−1 k Dn−k , . . . , Dn = E Dn−k−1 k Dn−k .
One might be tempted to think of Dn−k as a random walk in reverse, but there is
a silent trap: there is a condition at the n-th step in reverse namely D0 = 0.
To compute the above conditional expectation denote by Am (resp. Bm ) the
number of votes A (resp. B) has received after m votes. Thus
Dm = Am − Bm , m = Am + Bm .
Note that Am and Bm are determined by Dm via the equalities
Dm + m m − Dm
Am = , Bm = .
2 2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 275

Martingales 275

Thus, if Dn−k is known, the (n − k)-th vote could have been either a vote for A,
and the probability of such a vote is An−k
n−k
, or it could have been a vote for B, and
Bn−k
the probability of such a vote is n−k . Hence
   An−k  Bn−k
E Dn−k−1 k Dn−k = Dn−k − 1 + Dn−k + 1
n−k n−k
Dn−k n−k−1
= Dn−k − = Dn−k .
n−k n−k
Dividing by (n − k − 1) we deduce that (Rk )0≤k≤n−1 is indeed a martingale.
Now define the stopping times

S := 0 ≤ k ≤ n − 1; Rk = 0 ,
where min ∅ := ∞ and T := min(S, n − 1). The stopping time T is bounded and
the Optional Sampling Theorem 3.28 implies
    Dn a−b
E RT = E R0 = = .
n a+b
Now observe that
     
E RT = E RT I S=∞ + E RT I S<∞ .
Note RT = 0 on {S < ∞}. Observe that if S = ∞, then Dk > 0, for all 1 ≤ k ≤ n.
Hence T = (n − 1) on {S = ∞} so RT = D1 = 1 on {S = ∞}.
a−b    
= E RT = P S = ∞
a+b
= the probability that candidate A lead throughout the vote count.
t
u

Example 3.31 (Expected time to observe a pattern). Suppose that we are


given a finite set (alphabet) A and a probability distribution π on it so that
π(a) := π({a}) > 0, ∀a ∈ A.
Define
1
f : A → (0, ∞), f (a) = .
π(a)
Fix a word (or pattern) of length ` > 0 in this alphabet, a = (a1 , . . . , a` ) ∈ A` .
Suppose that (An )n≥1 is a sequence of independent A-valued random variables
with common distribution π. We say that the pattern a is observed at time n if
n ≥ ` and
(An−`+1 , An−`+2 , . . . , An ) = (a1 , a2 , . . . , an ).
We let T = Ta denote the first time the pattern a is observed

Ta := min n ≥ `; (An−`+1 , An−`+2 , . . . , An ) = (a1 , a2 , . . . , a` ) .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 276

276 An Introduction to Probability

To visualize this, think that we have an urn with balls labeled by the letters in A
in proportions given by π. We sample with replacement the urn and we record in
succession the labels we draw. We are interested in the moment we first observe
the labels a1 , . . . , a` in succession as we sample the urn. As a special case, think
that we flip a fair coin and we stop the first we see T, H, T, H in succession. In this
case A = {H, T }, π(H) = π(T ) = 12 , a = T HT H.
An amusing quote by Bertrand Russel comes to mind. “There is a special
department of Hell for students of probability. In this department there are many
typewriters and many monkeys. Every time that a monkey walks on a typewriter,
it types by chance one of Shakespeare’s
 sonnets.”
We will compute E Ta by using a clever martingale method due to Li [107].
The precise answer is contained  in (3.1.11).
Let us first observe that E Ta < ∞. This follows from a very useful trick, [160,
E10.5], generalizing the result in Example 1.167.

Lemma 3.32 (‘Sooner-rather-than-later’). Suppose that T is a stopping time


adapted to the filtration (Fn )n∈N0 with the property that there exist r0 > 0 and
N0 ∈ N such that
 
∀n ∈ N0 , P T ≤ n + N0 kFn > r0 . (3.1.9)
  n
Then there exists c ∈ (0, 1) such that P T > n < c , ∀n > N0 . In particular,
  X  
E T = P T > n < ∞. t
u
n≥0

In Exercise 3.6 we ask the reader to provide a proof of this result. It is a nice
application of various properties of the conditional expectation.
 `
In the case at hand (3.1.9) is satisfied with N0 = ` and r = mina∈A π(a) .
Following [107] we consider the following betting game involving the House
(casino) and a random number of players. At each moment of time n = 1, 2, . . . the
House samples the alphabet A according to the probability distribution π. (The
House runs a chance game with set of outcomes A and probability distribution π.)
The outcome of this sampling is the sequence of i.i.d. random variables An .
The first player adopts the following a-based strategy.

• At time 0 he bets his fortune F01 = 1 that the outcome of the first game is
A1 = a1 . If A1 = a1 his fortune will change to F11 = f (a1 ) = a11 . Otherwise,
he will lose his fortune F01 to the house, so F11 = 0 in this case.
• At time 1 he bets his fortune that A2 = a2 . If he wins, i.e., A2 = a2 , his fortune
at time 2 will grow to F21 = f (a2 )F11 . If he loses, he will have to turn all its
fortune to the House.
1
• In general, if k ≤ ` and his fortune at time k − 1 is Fk−1 (the fortune could
be 0 at that moment), the player bets all its fortune, f (ak ) on a dollar, that
Ak = ak . If this happens, his fortune will grow to Fk1 = f (ak )Fk−1 . Otherwise,
1
he will surrender his fortune Fk−1 to the house, so Fk1 = 0 in this case.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 277

Martingales 277

• At time ` the first gambler stops playing, so Fn1 = F`1 , ∀n ≥ `.


• We denote by Xn1 the profit of the first player at time n, Xn1 = Fn1 −F01 = Fn1 −1.

Concisely, if we define
(
f (ak )I {Ak =ak } , 1 ≤ k ≤ `,
Mk1 =
1, k < 1 or k > `,
then
n
Y
Fn1 = Mk1 .
k=1
 
Since E Mk1 = 1 we deduce that F•1 and X•1 = F•1 − 1 are martingales.
In general, for m = 1, 2, . . . , the m-th player also plays ` rounds using the same
strategy as the first player, but with a delay of m − 1 units of times.
Thus, the second player skips game 1 and only starts betting before the 2nd game
using the same betting strategy as if the game started when he began playing: at
his j-th round he bets f (aj ) on a dollar that the outcome is Aj+1 = aj . The third
player skips the first two games etc.
In general, at his j-th round, the m-th player bets f (aj ) on a dollar that the
outcome is Aj+m−1 = aj . We denote by Fnm the fortune of the m-th player at time
n. More precisely, if we set
(
m f (ak−m+1 )I {Ak =ak−m+1 } , m ≤ k ≤ m + ` − 1,
Mk :=
1, k < m or k ≥ m + `,
then
n
Y
Fnm := Mkm , Xnm = Fnm − 1 n = 1, 2, . . . .
k=1
Note that Fnm = 1 for n < m because the m-th player skips the games
n = 1, 2, . . . , m − 1. Define
X n
X n
X
Sn := Xnm = Xnm = Fnm − n.
m≥1 m=1 m=1
In other words, Sn is the sum of the profits of all the players after n games. The
process S• is obviously a martingale. Note that
X
ST = FTm − T, T = Ta .
m≤T
Recall that T is the first moment of time such that
AT −`+1 = a1 , AT −`+2 = a2 , . . . , AT = a` . (3.1.10)
Thus the player (T − ` + 1) will be the first player to hit the jackpot, i.e., observes
the pattern a during the first ` games he plays. This proves FTm = 0 for m ≤ T − `.
Indeed, the minimality of T implies

Am , . . . , Am+`−1 6= (a1 , . . . , a` )
and thus I {Am =a1 } · · · I {Am+`−1 =a` } = 0.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 278

278 An Introduction to Probability

The fortune of the player T − ` + 1 at time T is


FTT −`+1 = f (a1 ) · · · f (a` ).
Using the equalities (3.1.10) we deduce that the fortune of the next player, T −`+2,
at time T is nonzero if and only if
(a2 , . . . , a` ) = (a1 , . . . , a`−1 ).
In this case the fortune is f (a1 ) · · · f (a`−1 ). Similarly,
(
T −`+3 f (a1 ) · · · f (a`−2 ), (a1 , . . . , a`−2 ) = (a3 , . . . , a` ),
FT =
0, (a1 , . . . , a`−2 ) 6= (a3 , . . . , a` ).
More generally, denote by δα,β the Kronecker symbol
(
1, α = β,
δα,β :=
0, α 6= β.
We deduce
ST + T = FT1 + · · · + FTT −` + FTT −`+1 + FTT −`+2 + · · · + FTT
| {z }
=0

= FTT −`+1 + FTT −`+2 + · · · + FTT

`−1
Y `−2
Y
= f (a1 ) · · · f (a` ) + f (aj )δaj+1 ,aj + f (aj )δaj+2 ,aj + · · ·
| {z }
j=1 j=1
FTT | {z } | {z }
FTT −1 FTT −2

`−1 `−k
X Y
= f (aj )δaj+k ,aj .
k=0 j=1
| {z }
=:τ (a)

Hence ST = τ (a) − T . If we could show that T satisfies Doob’s conditions (Defini-


tion 3.26), then we could invoke the Optional Sampling Theorem 3.28 and conclude
that
     
0 = E S0 = E ST = τ (a) − E T .
Let us show that
 indeed the stopping time satisfies Doob’s conditions.
Since E T < ∞ and ST = τ (a) − T we deduce ST ∈ L1 . Arguing as above we
deduce that if n < T , then
Fn1 + · · · + Fnn ≤ F n−`+1 + · · · + Fnn

`−1
Y `−2
Y
≤ f (a1 ) · · · f (a` ) + f (aj )δaj+1 ,aj + f (aj )δaj+2 ,aj + · · · = τ (a).
j=1 j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 279

Martingales 279

Hence
 
|Sn |I {T >n} ≤ τ (a) + n I {T >n} ≤ τ (a) + T I {T >n} .
 
Since E T < ∞ we deduce
 
lim E |Sn |I {T >n} = 0.
n→∞
This shows that the stopping time Ta satisfies Doob’s conditions so that
`−1 `−k
X Y δaj+k ,aj
 
E Ta = τ (a) = . (3.1.11)
j=1
π(aj )
k=0

Let us describe this equality using a more convenient notation. Denote by V (A)
the vocabulary of the alphabet A
G
V (A) = A` , A0 := {∅}.
`≥0

We denote by `(a) the length of a word a.


We define a weight w = wπ : V (A) → (0, ∞) by setting
`
Y
w(a1 , . . . , a` ) = f (ak ), w(∅) = 1.
k=1

For a = (a1 , . . . , a` ) ∈ A` , ` ≥ 1, and j = 1, . . . , ` we define the left/right tail maps


Lj , Rj : A` → Aj , Lj (a) = (a1 , . . . , aj ), Rj (a) = (a`−j+1 , . . . , a` ).
Thus, Rj retains only the last j letters of a word while Lj retains the first j letters.
Given two words a, b ∈ V (A) we set
(
1, a = b,
ha, bi :=
0, a 6= b.
Now define
`(a)∧`(b)
X 
Φ : V (A) × V (A) → [0, ∞), Φ(a, b) = Rj a, Lj b w Lj b . (3.1.12)
j=1

We can rewrite (3.1.11) as


 
E Ta = Φ(a, a). (3.1.13)
In the special case when A = {1, 2, . . . , 6}, π is the uniform counting probability
and
· · 6} ∈ Ak ,
a = |6 ·{z
k

then the waiting time τ (a) coincides with the waiting time T to observe the first
occurrence of a k-run of 6-s discussed in Example 1.167. In this case we have
k
  X 6k+1 − 6
E T = 6j = .
j=1
5
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 280

280 An Introduction to Probability

We refer to Example A.22 for an R-code that simulates sampling an alphabet until
a given pattern is observed.
Let us discuss in more detail the special case A = {H, T }, π(H) = π(T ) = 12 .
Suppose that a is the pattern a = (T T HH), b = HHH. Observe that
hLj a, Rj ai = 1 for j = 4 and 0 otherwise. Hence E Ta = 16. A similar computa-
tion shows that E Tb = 14. Thus we have to wait a longer time for the pattern a
to occur.
On the other hand, a formula of Conway (see Exercise 3.14) shows that
 
P Tb < Ta Φ(a, a) − Φ(a, b)
 = .
P Ta < Tb Φ(b, b) − Φ(b, a)
We have hLi a, Rj bi = 0, ∀j and hRj a, Lj bi = 1 for j = 1, 2 so that
 
P Tb < Ta 5
Φ(a, b) = 0, Φ(b, a) = 6,  = .
P Ta < Tb 7
We have reached a somewhat surprising conclusion: although, on average, we have
to wait a shorter amount of time to observe the pattern b, it is less likely that we
will observe b before a. The odds that b will appear first versus that a will appear
first are 5 : 7.
There are other strange phenomena. We should mention M. Gardner’s even
stranger nontransitivity paradox [66, Chap. 5]. More precisely, given any pattern
a ∈ Ak there exists
 1a pattern b ∈ A such that b is more likely to occur before a,
k

i.e., P Tb < Ta > 2 . As shown in by Guibas and Odlyzko [76], if a = (a1 , . . . , ak )


we can choose b to be of the form b = (b, a1 , . . . , ak−1 ). t
u

3.1.5 Concentration inequalities: martingale techniques


Hoeffding’s inequality (2.3.12) has a martingale counterpart usually referred to as
Azuma’s inequality.

Theorem 3.33 (Azuma). Suppose that (Xn )n≥0 is a martingale adapted to a fil-
tration F• = (Fn )n≥0 of the probability space (Ω, S, P). Assume that for any n ∈ N
there exist constants an < bn such the differences Dn = Xn − Xn−1 satisfy
an ≤ Dn ≤ bn a.s.
Then
2
  − 2 2x 2
∀x > 0, P |Xn − X0 | > x ≤ 2e (s1 +···+sn ) , sk = bk − ak . (3.1.14)

Proof. The strategy is a variation on the Chernoff’s method. Set


Dn := Xn − Xn−1 , Sn2 := s21 + · · · + s2n , ∀n ∈ N.
We will prove inductively that
2
λSn
Xn − X0 ∈ G(Sn2 /4), i.e., E eλ(Xn −X0 ) ≤ e 8 , ∀n ∈ N, λ ∈ R.
 
(3.1.15)
Assuming this, the inequality (3.1.14) follows from (2.3.11b).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 281

Martingales 281

To prove (3.1.15) note that since (Xn ) is a martingale we have


E eλ(Xn −X0 ) k Fn−1 = eλ(Xn−1 −X0 ) E eλDn k Fn−1 .
   

We set
Zn (λ) := E eλDn k Fn−1 , ∀n ∈ N, λ ∈ R.
 

We claim that
λs2
n
∀n ∈ N, ∀λ ∈ R, Zn (λ) ≤ e 8 a.s. (3.1.16)
Obviously this implies that
λs2
n
E eλ(Xn −X0 ) ≤ e 8 E eλ(Xn−1 −X0 ) ,
   

from which we can conclude inductively that Xn − X0 ∈ G(Sn2 /4).


 Zn (λ) is Fn−1 -measurable. We
To prove (3.1.16) observe that, by construction,
have to show that for any S ∈ Fn−1 such that P S 6= 0
    λs2n
E Zn (λ)I S ≤ P S e 8 .
Denote by DnS the random variable Dn S defined on the probability space
S, Fn−1 S , PS , where
 
 P A
PS A = P A k S =   , ∀A ∈ Fn−1 S .
  
P S
We denote by ES the expectation on S, Fn−1 ∩ S, PS . Since E Dn k Fn−1 = 0
  

we deduce
ES DnS = 0.
 

Clearly an ≤ DnS ≤ bn . We deduce from Hoeffding’s Lemma (Proposition 2.56) that


S  λs2
n
ES eλDn ≤ e 8 ,


and therefore, ∀S ∈ Fn−1 such that, P S 6= 0 we have


 

S    λs2n
E Zn (λ)I S = E eλDn I S = P S ES eλDn ≤ P S e 2 .
      

This concludes the proof of Azuma’s inequality. t


u

The strength of Azuma’s inequality is best appreciated in concrete examples.

Example 3.34 (Longest common subsequence). We want to have another


look at the problem of the longest common subsequence first discussed in Exam-
ple 1.150. Let us briefly recall the set-up.
We are given a finite set (alphabet) A, |A| = k, and a family of independent
A-valued random variables

Xn , Yn ; m, n ∈ N
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 282

282 An Introduction to Probability

all with the same distribution π. We denote by Ln the length of the longest common
subsequence of two random words
(X1 , . . . , Xn ) and (Y1 , . . . , Yn ).
We set
1
Rn := Ln , R := sup Rn .
n n

In Example 1.150 we have shown


1
Ln → R a.s.,
n
and
   
lim E Rn = r(π) := E R .
n→∞

We will to show that Rn is highly concentrated around its mean rn . We follow the
presentation in[144, Sec. 1.3].
Set `n := E Ln , Zn = (Xn , Yn ). Consider the finite filtration
F0 := σ(∅), Fj = σ Z1 , . . . , Zj , j = 1, . . . , n.


Form the Doob (closed) martingale Uj := E Ln k Fj . Note that U0 = `n . The


 

random variable Ln is a function of the Zj ’s


Ln = Ln (Z1 , . . . , Zn ),
and Uj is a function of Z1 , . . . , Zj , Uj = Fj (Z1 , . . . , Zj ). More precisely,
  
Fj (z1 , . . . , zj ) = E Ln Z1 = z1 , . . . , Zj = zj
 
= E Ln (z1 , . . . , zj , Zj+1 , . . . , Zn )
Z
Ln (z1 , . . . zj , zj+1 , . . . , zn ) π ⊗2(n−j) dzj+1 · · · dzn .
 
=
(A2 )n−j

Note that for any z1 , . . . , zj−1 , zj , zj0 , zj+1 , . . . , zn ∈ A2 we have


−1 ≤ Ln (z1 , . . . , zj−1 , zj0 , zj+1 , . . . , zn ) − Ln (z1 , . . . , zj−1 , zj , zi+1 , . . . , zn ) ≤ 1.
Integrating with respect to zj0 , zj+1 , . . . , zn we deduce
−1 ≤ Fj−1 (z1 , . . . , zj−1 ) − Fj (z1 , . . . , zn ) ≤ 1.
Hence Uj − Uj−1 ≤ 1. From Azuma’s inequality with sn = 2 we deduce
nx2
P |Ln − `n | ≥ nx ≤ 2e− 2 ,
 

so that
nx2
P |Rn − rn | ≥ x ≤ 2e− 2 .
 
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 283

Martingales 283

This proves that Rn is highly concentrated around its mean. Obviously


X  
∀ε > 0, P |Rn − rn | ≥ ε < ∞,
n≥1

and Corollary 1.141 implies that Rn − rn → 0 a.s.


On the other hand, we know from Example 1.150 that
 
Rn → R a.s. and rn → r(π) = E R .

Hence n1 Ln converges almost surely to a constant r(π).


We write r(k) instead of r(π) when π is the uniform distribution on an alphabet
of cardinality k. In this case one has additional information about the rate of
convergence of rn to r(k). However, the exact value of r(k) remains illusive, even
for small k. t
u

Example 3.35 (Bin packing). The bin packing problem has a short formulation:
pack n items of sizes x1 , . . . , xn ∈ [0, 1] in as few bins of maximum capacity 1 each.
We denote by Bn (x1 , . . . , xn ) the lowest numbers of bins we can use to pack the
items of sizes x1 , . . . , xn .
As in the case of the longest common subsequence problem, the bin packing
problem has a probabilistic counterpart. Consider independent random variables
Xn ∼ Unif([0, 1]), n ∈ N defined on a probability
 space (Ω, S, P). We will describe
the behavior of bm := E Bn (X1 , . . . , Xn ) as n → ∞.
Note that

X1 + · · · + Xn ≤ Bn (X1 , . . . , Xn ) ≤ n.

By taking expectations we deduce


n
≤ bn ≤ n, ∀n ∈ N, (3.1.17)
2
showing that bn has linear growth as n → ∞. On the other hand,
Bn+m (X1 , . . . , Xn , Xn+1 , · · · Xn+m )
(3.1.18)
≤ Bn (X1 , . . . , Xn ) + Bm (Xn+1 , · · · Xn+m ),
and thus

bn+m ≤ bn + bm , ∀n, m ∈ N.
bn
Setting rn := n , we deduce from Fekete’s Lemma 1.151 that

lim rn = r := inf rn .
n→∞ n
 1
The inequalities (3.1.17) show that r ∈ 2 , 1 .
We set Rn := Bnn . We deduce from (3.1.18) and Fekete’s Lemma that
 
Rn → R := inf Rn a.s. and r = E R .
n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 284

284 An Introduction to Probability

We want to show that Rn is highly concentrated around its mean. We use the same
approach as in Example 3.34.
We set

Fj = σ(X1 , . . . , Xj ), F0 = { ∅, Ω }.

Fix n ∈ N. For j = 0, 1, . . . , n we set

Uj = Un,j := E Bn k Fj
 

so the collection (Uj )0≤j≤n is a martingale adapted to the filtration (Fj )0≤j≤n .
There exist Borel measurable maps Fj : [0, 1]j → N such that
Uj = Fj (X1 , . . . , Xj ). More precisely,
Z

Fj (x1 , . . . , xj ) = Bn x1 , . . . , xj , xj+1 , . . . xn dxj+1 · · · dxn .
[0,1]n−j

For any x1 , . . . , xj−1 , xj , x0j , xj+1 , . . . , xn ∈ [0, 1] we have

−1 ≤ Bn x1 , . . . , xj−1 , x0j , xj+1 , . . . , xn − Bn x1 , . . . , xj−1 , xj , xj+1 , . . . , xn ≤ 1.


 

Integrating with respect to x0j , xj+1 , . . . , xn we deduce Uj − Uj−1 ≤ 1. Invoking


Azuma’s inequality we deduce as in Example 3.34 that
nx2
P |Rn − rn | > x ≤ 2e− 2 .
 

This shows that Rn is highly concentrated around its mean and that Rn → r a.s.
In this case it is known that r = 12 . More precisely, there is an algorithm called
MATCH which takes as input the sizes x1 , . . . , xn of the n items and packs them
into Mn = Mn (x1 , . . . , xn ) boxes where
n     n √ 
≤ E B n ≤ E Mn ≤ + O n .
2 2
This is the best one can hope for since it is also known
√ √ 
r
  n  n
E Bn ≥ + 3−1 +o n .
2 24π
For details we refer to [34, Sec. 5.1]. t
u

The tricks used in the above examples are generalized and refined in McDiarmid’s
inequality.

Definition 3.36 (Bounded difference property). Suppose that S is a set. A


function f : S n → R, n ∈ N, is said to satisfy the bounded difference property if
there exist L1 , . . . , Ln > 0 such that,

f (s1 , . . . , sk−1 , s, sk+1 , . . . , sn ) − f (s1 , . . . , sk−1 , s0 , sk+1 , . . . , sn ) ≤ Lk , (3.1.19)

∀k = 1, . . . , n, ∀s1 , . . . , sk−1 , s, s0 , sk+1 , . . . , sn ∈ S. t


u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 285

Martingales 285

Let us observe that the above condition is satisfied if and only if f is Lipschitz
with respect to the Hamming distance on S
Xn
dH : S n × S n → [0, ∞), dH s, t :=

I R\{0} (sk − tk ). (3.1.20)
k=1

Theorem 3.37 (McDiarmid’s inequality).


Suppose that X1 , . . . , Xn : Ω, S, P → R are independent random variables and


f : Rn → R satisfies the bounded difference property with constants L1 , . . . , Ln . If


Z = f (X1 , . . . , Xn ) is integrable, then
2 2
> t ≤ e−2t /L , L2 = L21 + · · · + L2n .
   
P Z −E Z (3.1.21)

Proof. Denote by Pk the distribution on Xk . Let F0 = {∅, Ω}, Fk := σ(X1 , . . . , Xk )


and set
Zk := E Z k Fk , k = 0, . . . , n
 
 
so that Zn = Z and Z0 = E Z . Since X1 , . . . , Xn are independent we deduce that
∀ω ∈ Ω

Zk (ω) = gk X1 (ω), . . . , Xk (ω) ,
where Z
   
gk (x1 , . . . , xk ) = f (x1 , . . . , xk , xk+1 , . . . , xn )Pk+1 dxk+1 · · · Pn dxn .
Rn−k
Note that Z
   
gk−1 (x1 , . . . , xk−1 ) = Ek gk := gk (x1 , . . . , xk−1 , xk )Pk dxk .
R
Hence
Dk = gk − Ek gk , E eλDk k Fk−1 = hk−1 (X1 , . . . , Xk−1 ),
 

where hk−1 (x1 , . . . , xk−1 ) := Ek eλ(gk −Ek [gk ]) . Fix x1 , . . . , xk−1 and set
 

ak = ak (x1 , . . . , xk−1 ) := inf g(x1 , . . . , xk−1 , xk ),


xk

bk = bk (x1 , . . . , xk−1 ) := sup g(x1 , . . . , xk−1 , yk ).


yk
We deduce that
0 ≤ bk − ak ≤ sup g(x1 , . . . , xk−1 , yk ) − g(x1 , . . . , xk−1 , xk ) ≤ Lk .
xk ,yk

We deduce from Hoeffding’s inequality (2.3.13) that


λL2k
Ek eλ(gk −Ek gk ) ≤ e 8 , ∀x1 , . . . , xk−1 .
 

Hence
λL2
k λ(L2 2
1 +···+Ln )
E eλDk k Fk−1 ≤ e 8 a.s., i.e., E eλ(Zn −Z0 ) ≤ e
   
8 .
t
u

In Exercise 3.20 we outline an important application of McDiarmid’s inequality.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 286

286 An Introduction to Probability

3.2 Limit theorems: discrete time

We have seen in the previous section how the Optional Stopping Theorem combined
with quite a bit of ingenuity can produce miraculous results. This section is devoted
to another miraculous property of martingales, namely, their rather nice asymptotic
behavior. The foundational results in this section are all due to J. L. Doob. To
convince the reader of the amazing versatility of martingales we have included a
large eclectic collection of concrete applications.

3.2.1 Almost sure convergence


Fix a probability space (Ω, S, P) and a N0 -filtration F• of F. We will investigate the
behavior of an F• -submartingale (Xn )n∈N0 as n → ∞. The key to this investigation
is Doob’s upcrossing inequality.
Given real numbers a < b and a sequence of real numbers α = (αn )n≥0 we define
inductively the sequences
 
Sk (α) = Sk (α; a, b) k≥1 and Tk (α) = Tk (α; a, b) k≥1
in N0 ∪ {∞} as follows. We set
 
S1 (α) := inf αn ≤ a , T1 (α) := inf αn ≥ b .
n≥0 n≥S1
Thus, S1 is the first moment the sequence α drops below a, and T1 is the first
moment after S1 when the sequence α crosses the upper level b. We then define
inductively
 
Sk+1 (α) := inf αn ≤ a , Tk+1 (α) := inf αn ≥ b ,
n≥Tk n≥Sk

where we set inf ∅ = ∞; see Figure 3.2.


X

S1 T1 T2 n
S2

Fig. 3.2 Up/downcrosssing of the interval [a, b].

The terms Sk are called downcrossing times while the terms Tk are called the
upcrossing times of the sequence (αn )n≥0 . We define the upcrossing numbers
 
Nn [a, b], α := # k ∈ N; Tk (α) ≤ n , n ∈ N,
 (3.2.1)
N∞ [a, b], α := lim Nn ( [a, b], α ).
n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 287

Martingales 287

The importance of the upcrossing numbers in convergence problems is explained by


the following elementary but rather clever result. In Exercise 3.8 we ask the reader
to provide a proof.

Lemma 3.38. Suppose that α = (αn )n≥0 is a sequence of real numbers. Then the
following statements are equivalent.

(i) The sequence α has a limit (possibly infinite).


(ii) For any rational numbers a < b the total number of upcrossings N∞ ([a, b], α)
is finite. t
u

Suppose now that (Xn )n∈N0 is a process adapted to the filtration F• . Then, for
any k ∈ N, the down/up-crossing times Sk (X) and Tk (X) are stopping times.

Theorem 3.39 (Doob’s upcrossing inequality). Assume that X = (Xn )n∈N0


is a submartingale. Then for any real numbers a < b we have
(b−a)E Nn ([a, b], X) ≤ E (Xn −a)+ −E (X0 −a)+ , x+ := max(x, 0). (3.2.2)
     

Proof. Since (X − a)+ is a submartingale and


Nn [a, b], X = Nn [0, b − a], (X − a)+ ,
 

we see that it suffices to prove the result in the special case X ≥ 0 and a = 0 < b.
In other words, it suffices to prove that if X ≥ 0, then
   
bE Nn [0, b], X ≤ E (Xn − X0 ) . (3.2.3)
The key fact underlying this inequality is the
 existence of a submartingale Y that
lies above the random process Nn ([0, b], X and, in the mean, below the process X.
Consider the predictable process

X
H= I ]]Sk (X),Tk (X)]] ,
k=1

i.e.,

X
Hn = I {Sk (X)<n≤Tk (X)} .
k=1

Since the intervals


 
S1 (X), T1 (X) , S2 (X), T2 (X) , . . . ,
are pairwise disjoint (when finite) we have Hn ≤ 1. Set Yn := (H · X)n .
In stock market terms, think of the following investing strategy. Start buying a
stock when its price cost hits zero, and sell it at the end the trading day. Continue
buying (and selling) the stock as long as its price at the start of the trading day
is below b. Once the price crosses b stop buying and wait until the price hits 0
again. The price of the stock at the beginning of the n-th trading day is Xn−1 and
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 288

288 An Introduction to Probability

changes to Xn at the end of the n-th trading day. Then Yn is the profit following
this strategy at the end of n days. Clearly the profit will be at least as big as b×
the number of upcrossings of the interval (0, b). This is the content of the following
fundamental inequality.

Yn ≥ bNn [0, b], X . (3.2.4)

Here is a formal proof of this inequality. Let M := Nn [0, b], X . Then
n M Tk (X) n
X  X X X 
Yn = Hj · Xj − Xj−1 = (Xj − Xj−1 ) + Xj − Xj−1
j=1 k=1 j=Sk (X)+1 j=SM +1 +1

M
X  
= XTk − XSk + I {SM +1 <n} Xn − XSM +1
k=1

(use the fact that XSM +1 = 0 and Xn ≥ 0)


M
X 
≥ XTk − XSk ≥ bM = bNn ([0, b], X).
k=1
| {z }
≥b

Hence
   
bE Nn ([0, b], X) ≤ E Yn , ∀n ∈ N.
Note that the inequality (3.2.4) does not rely on the fact that X is a submartingale.
The process (Hn ) is predictable and thus
   
E Yk − Yk−1 kFk−1 = E Hk (Xk − Xk−1 )kFk−1
 
= Hk E (Xk − Xk−1 )kFk−1 .
Since X is a submartingale we deduce
 
E (Xk − Xk−1 )kFk−1 ≥ 0.
On the other hand Hk ≤ 1 so that
   
Hk E (Xk − Xk−1 )kFk−1 ≤ E (Xk − Xk−1 )kFk−1 .
Hence
   
E Yk − Yk−1 ≤ E Xk − Xk−1 . (3.2.5)
We deduce
n
    X    
bE Nn ( [0, b], X ) ≤ E Yn = E Yk − Yk−1 ≤ E Xn − X0 .
k=1

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 289

Martingales 289

Remark 3.40. We should ponder why the inequality (3.2.5) is miraculous. We


know that Hk ∈ [0, 1] so, whenever Xk ≤ Xk−1 , i.e., the price of stock goes down,
we have Xk − Xk−1 ≤ Hk (Xk − Xk−1 ) = Yk − Yk−1 . The inequality (3.2.5) shows
that this is not the expected behavior. The fact that X• is a submartingale biases
the price in favor of increase. That is the reason why (3.2.5) holds. t
u

Theorem 3.41 (Submartingale Convergence Theorem). Suppose that


(Xn )n∈N0 is a submartingale satisfying
sup E Xn+ < ∞.
 
(3.2.6)
n∈N0

Then Xn converges almost surely to an integrable random variable X∞ .

Remark 3.42. Observe that since Xn is a submartingale we have


E X0 ≤ E Xn = E Xn+ − E Xn− , x− = max(−x, 0),
       

so that
sup E Xn− < ∞
 
n∈N0

showing that (3.2.6) is equivalent to


 
sup E |Xn | < ∞. (3.2.7)
n∈N0
t
u

Proof. Set
 
M := sup E |Xn | .
n∈N0

Now let a, b ∈ Q, a < b. Doob’s upcrossing inequality shows that, for all n ≥ 1, we
have
(b − a)E Nn (a, b, X• ) ≤ E (Xn − a)+ ≤ |a| + E |Xn | ≤ |a| + M.
     
 
Letting n → ∞ we deduce E N∞ [a, b], X• < ∞, and thus N∞ ([a, b], X• ) < ∞
a.s. By removing a countable family of negligible sets (one for each pair of rational
numbers a, b, a < b) we deduce that there exists a negligible set N ⊂ Ω such that
∀ω ∈ Ω \ N we have

N∞ [a, b], X• (ω) < ∞, ∀a, b ∈ Q, a < b.
Lemma 3.38 implies that the sequence X• converges a.s. to a random variable X∞ .
The integrability of X∞ follows from Fatou’s lemma
   
E |X∞ | ≤ lim inf E |Xn | < ∞.
n→∞
t
u

Corollary 3.43. Suppose that (Xn )n∈N0 is a nonnegative supermartingale. Then


Xn converges a.s. to an integrable random variable X∞ .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 290

290 An Introduction to Probability

Proof. Observe that Yn = −Xn is a submartingale and Yn+ = 0. The result now
follows from the Submartingale Convergence Theorem. t
u

Corollary 3.44. Suppose that (Xn )n∈N0 is a submartingale  adapted


 to the filtration
(Fn )n∈N0 and T is an a.s. finite stopping time. If supn E |Xn | < ∞, then
lim XnT = lim Xn∧T = XT a.s.
n→∞ n→∞

Proof. Note that (X•+ )T = (X•T )+ so (X•T )+ is a submartingale. The Optional


Sampling Theorem applied to the bounded stopping times n ∧ T ≤ n implies
 + 
≤ E Xn+
 
E Xn∧T
so that
 + 
sup E Xn∧T < ∞.
n

The conclusion now follows from the Submartingale Convergence Theorem. t


u

Example 3.45 (Galton-Watson/branching processes). 2 Consider again the


branching process in Example 3.8 with reproduction law µ ∈ Prob(N0 ) and mean m,
X  
0 < m := nµ n < ∞.
n≥0

As explained in Example 3.8, the sequence


1
Wn = n Z n , n ∈ N 0
m
is a nonnegative martingale so, according to Corollary 3.43, it converges a.s. to an
integrable random variable W∞ .
If m < 1, the original sequence Zn = mn Wm converges a.s. and in mean to 0.
Moreover
E Zn = mn E Z0 = mn `.
   

Thus, the expected population decays exponentially to zero. Something more dra-
matic holds.
Since Zn ≥ 1 if Zn > 0 we deduce
P Zn > 0 = P Zn ≥ 1 ≤ E Zn = `mn .
     

Hence
X  
P Zn > 0 < ∞.
n≥0
 
The Borell-Cantelli Lemma implies that P Zn > 0 i.o. = 0.
2 To the post pandemic reader. I wrote most of this book during the great covid pandemic. I

even taught this example to a group of masked students that were numbed by the news about the
R-factor. The mean m is a close relative of this R-factor. This example explains the desirability
of R < 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 291

Martingales 291

Thus, a population of bacteria that have on average less that one successor will
die out, i.e., with probability 1 there exists n ∈ N such that Zn = 0. If we set
 
En := Zk = 0, ∀k ≥ n = Zn = 0 ,
then the event
[
E= En
n≥0

is called the extinction event. Note that


E0 ⊂ E1 ⊂ · · · ⊂ En ⊂ · · · .
The probability of E is called extinction probability. We see that when m < 1, the
extinction probability is 1. t
u

Remark 3.46. It is known that a random series with independent terms converges
a.s. if and only if it converges in probability; see Exercise 2.2. However, there exist
martingales that converge in probability, but not a.s.
Here is one such example, [53, Example 4.2.14]. Consider the following random
walk (Xn )n≥0 on Z where you should think of Xn as the location at time n. We
set X0 = 0. If Xn−1 is known, then
  1   1
P Xn = ±1 k Xn−1 = 0 = , P Xn = 0 k Xn−1 = 0 = 1 − ,
2n n
  1   1
P Xn = 0 k Xn−1 = x 6= 0 = 1 − , P Xn = nx k Xn−1 = x 6= 0 = .
n n
The existence of such a process is guaranteed by Kolmogorov’s theorem.
Denote by Fn the sigma-algebra generated by the  random  variables
X0 , X1 , . . . , Xn . From the construction we deduce that E Xn k Xn−1 = Xn−1
so (Xn ) is a martingale with respect to the filtration Fn . Let pn := P Xn 6= 0 .


Note that
       
pn = P Xn 6= 0 Xn−1 = 0 P Xn−1 = 0 + P Xn 6= 0 Xn−1 6= 0 P Xn−1 6= 0

1 1 1
= (1 − pn−1 ) + pn−1 = .
n n n
Hence
 
lim P Xn 6= 0 = 0,
n→∞

so that Xn converges in probability to 0. To show it does not converge a.s. it suffices


to show that it does not converge a.s. to 0.
Denote by Fn the event {Xn 6= 0}. The random variables Xn have integer values
so Fn = {|Xn | ≥ 1}. Note that Fn ∈ Fn and
1 1 1
E I Fn k Fn−1 = I {Xn−1 =0} + I {Xn−1 6=0} = .
 
n n n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 292

292 An Introduction to Probability

Hence
X   X1
E I Fn k Fn−1 = = ∞.
n
n≥1 n≥1

The conditional Borel-Cantelli result in Exercise 3.12 implies that


   
P |Xn | ≥ 1 i.o. = P Fn i.o. = 1.
Thus (Xn ) does not converge a.s. to 0.
Recently (2021) Iosif Pinelis gave another beautiful example of martingale con-
verging in probability but not a.s. Here is briefly the construction.
Choose a sequence of independent geometric random variables
(Tn )n≥1 , Tn ∼ Geom(pn ).
We perform the following delayed and frequently stopped random walk on Z. We
start at X0 = 0 and we wait for T1 moments and we begin a standard random walk
on Z until we first return to the origin. At that moment take a brake lasting T2
moments and begin the standard walk until we return back to the origin etc. Denote
by Xn the location after n moments. Then (Xn ) is a martingale (with respect to
an appropriate filtration). Moreover, if
X√
pn < ∞,
n≥1

then Xn converges in probability to 0 but not a.s. For details we refer to [128]. t
u

Example 3.47. The assumptions in the (sub)martingale convergence theorem are


not strong enough to guarantee L1 -convergence. The following example shows what
can happen.
Consider the standard random walk on Z that starts at 1. Each second
the traveler takes a size 1 step forward or back with equal probability. More
precisely, consider
 a sequence of i.i.d. Rademacher random variables (Xn )n∈N ,
P X1 = 1 = P Xn = −1 = 21 . Then the sequence
 

S0 = 1, Sn = 1 + X1 + · · · + Xn , n ∈ N
is a martingale describing the evolution of the walk. Denote by N the first moment
the walk reaches the origin, i.e.,

N := inf n ∈ N; Sn = 0 .
Observe that N < ∞ a.s.; see Exercise 3.13. Consider the random walk stopped
at N
Yn := SnN = Sn∧N .
From the Optional Stopping Theorem 3.24 we deduce that Yn is a martingale which,
by construction, is also nonnegative. Clearly Yn → 0 a.s. since N < ∞ a.s. This
convergence is not L1 since
   
E Yn = E Y0 = 1, ∀n ∈ N. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 293

Martingales 293

3.2.2 Uniform integrability


We will describe in this subsection necessary and sufficient conditions guaranteeing
that a sequence that converges in probability also converges in p-mean.
We begin with a basic fact.

Lemma 3.48. Let X ∈ L1 Ω, S, P . Then



 
lim E |X| I {|X|≥n} = 0.
n→∞

Proof. The sequence Zn := |X|I {|X|>n} converges a.s. to 0 and |Zn | ≤ |X|, ∀n.
The desired conclusion now follows from the Dominated Convergence theorem. t u

Definition 3.49 (Uniform integrability). A collection X ⊂ L1 Ω, S, P is




called uniformly integrable (or UI for brevity) if


lim E |X| I {|X|≥r} = 0 uniformly in X ∈ X .
 
(UI1 )
r→∞
t
u

Remark 3.50. (a) Let X ⊂ L1 Ω, S, P . Set




χ(r) = χ(r, X ) := sup E |X|I {|X| ≥r} .


 
X∈X

Then X is uniformly integrable iff limr→∞ χ(r) = 0.


(b) A uniformly integrable family X ⊂ L1 Ω, S, P is bounded in the L1 -norm,


i.e., χ(0) < ∞. Indeed, ∀X ∈ X and r is sufficiently large so that χ(r) < 1, we
have
     
E |X| = E |X|I {|X| <r} + E |X|I {|X| ≥r} ≤ r + χ(r) < ∞.
t
u

Theorem 3.51. Let X ⊂ L1 Ω, S, P .



Then the following statements are
equivalent.

(i) X is uniformly integrable.


(ii) The family X is L1 -bounded and, for any ε > 0, there exists δ = δ(ε) > 0 such
that, ∀X ∈ X and any S ∈ S, we have
Z
     
P S ≤ δ ⇒ E |X|I S = |X(ω)| P dω < ε. (UI2 )
S

Proof. (i) ⇒ (ii) Fix ε > 0. There exists rε > 0 such that χ(rε ) < ε/2. Now fix
δ > 0 such that δrε < 2ε . Then, for any X ∈ X and any S ∈ S such that P[S] < δ,
we have
     
E |X|I S = E |X|I S∩{|X|<rε } + E |X|I S∩{|X|≥rε }
 
≤ rε P[S] + E |X|I {|X|≥rε } ≤ δrε + χ(rε ) < ε.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 294

294 An Introduction to Probability

(ii) ⇒ (i) Set


 
B := sup E |X| < ∞.
X∈X

Markov’s inequality implies that for r > 0 we have


 B
P |X| > r ≤ , ∀X ∈ X .

r
Fix ε > 0 and rε > 0 such that rε < δ(ε). Then P |X| > rε < δ(ε), ∀X ∈ X .
B
 

Assumption (ii) implies


 
χ(rε ) = sup E |X| I {|X|>rε } < ε.
X∈X

t
u

Remark 3.52. We should draw attention to the qualitatively different conditions


(UI1 ) and (UI2 ). Condition (UI1 ) involves only the probability distributions of
the random variables X ∈ X with no mention of the probability space on which
they are defined, whereas condition (UI2 ) makes explicit reference to their domain
of definition (Ω, S, P). It is related to the absolute continuity condition in real
analysis. t
u

Corollary 3.53. Let X ∈ L1 (Ω, S, P) be a family of random variables such that


there exists Z ∈ L1 (Ω, S, P) with the property
|X| ≤ |Z| a.s., ∀X ∈ X .
Then X is UI.

Proof. The family X satisfies condition (ii) of Theorem 3.51. t


u

Theorem 3.54. Let X ⊂ L1 Ω, S, P .



Then the following statements are
equivalent.

(i) X is UI.
(ii)
Z ∞  
lim sup P |X| > x dx = 0.
r→∞ X∈X r

(iii) There exists a convex increasing function f : [0, ∞) → [0, ∞) which is also
superlinear,
f (r)
lim =∞
r→∞ r

and satisfies
 
sup E f (|X|) < ∞. (3.2.8)
X∈X
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 295

Martingales 295

Proof. (i) ⇐⇒ (ii) Proposition 1.126 shows that


Z ∞
P |X| > x dx = E |X|I {|X|>r} , ∀X ∈ X .
   
r
(ii) ⇒ (iii) Set
Z ∞  
h(r) := sup P |X| > x dx.
X∈X r

Note that h(0) ≤ r + h(r) < ∞. Since h(r) = o(1) as r → ∞ we can find
0 = r0 < r1 < r2 < · · ·
such that
h(0)
h(rn ) ≤ , ∀n ∈ N.
2n
Now define
X Z x
g(r) := I [rn ,∞) (r), f (x) = g(r)dr.
n≥0 0

Note that g(r) is nondecreasing and limr→∞ g(r) = ∞. This shows that f is
increasing convex and superlinear. Using the Fubini-Tonelli theorem as in the proof
of Proposition 1.126 we deduce
"Z #  
  |X| XZ ∞
E f (|X|) = E g(r)dr = E  I |X|>rn (x)dx 
0 n≥0 rn

X X 1
≤ h(rn ) ≤ h(0) .
2n
n≥0 n≥0

(iii) ⇒ (i) For every n ∈ N there exists rn > 0 such that


f (x)
∀x : x > rn ⇒ x < .
n
We deduce that for any X ∈ X we have
  1   1  
E |X|I |X|>rn ≤ E f (|X|)I |X|>rn ≤ E f (|X|) .
n n
The conclusion now follows from (3.2.8). t
u

If in the above theorem we choose f (r) = rp , p > 1, we obtain the following


result.

Corollary 3.55. Let X ∈ L1 (Ω, S, P) be a family of random variables such that


there exist p ∈ (1, ∞) with the property
sup E |X|p < ∞.
 
X∈X

Then X is UI. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 296

296 An Introduction to Probability

 L (Ω,F, P) and suppose that (Fi )i∈I is a family of sigma


2
Corollary 3.56. Let X ∈
subalgebras. Set Xi := E X k Fi , i ∈ I. Then the family (Xi )i∈I is UI.

Proof. Lemma 3.48 shows that the family {X} consisting of the integrable random
variable X is uniformly integrable. Theorem 3.54 implies that there exists a super-
linear, convex, increasing function f : [0, ∞) → [0, ∞) such that E f (X) < ∞.
From the conditional Jensen inequality in Theorem 1.166(ix) we deduce that
|Xi | = E X k Fi ≤ E |X| k Fi .
   

Since f is increasing and convex we deduce that


f |Xi | ≤ f E |X| k Fi ≤ E f |X|) k Fi .
    

Taking the expectations of both sides of this inequality we deduce


   
E f |Xi | ≤ E f |X|) , ∀i ∈ I.
Using Theorem 3.54 again we deduce that the family (Xi )i∈I is UI. t
u

The next result clarifies the importance of the uniform integrability condition.

Theorem 3.57. Consider a sequence (Xn ) in L1 (Ω, S, P) that converges in proba-


bility to X. Then the following statements are equivalent.

(i) The sequence (Xn ) is UI.


(ii) The limit X is integrable and sequence (Xn ) converges to X in the L1 -norm.
(iii) The limit X is integrable and
   
lim E |Xn | = E |X| .
n→∞

Proof. We follow the approach in [53, Thm. 5.5.2].


(i) ⇒ (ii). For every M > 0 we define

M,

 x ≥ M,
ΦM : R → R, ΦM (x) = x, |x| < M,

−M,

x ≤ −M.
We have
     
E |Xn − X| ≤ E Xn − ΦM (Xn ) + E ΦM (Xn ) − ΦM (X)
 
+E | ΦM (X) − X
     
≤ 2E Xn I {|Xn |>M } + E ΦM (Xn ) − ΦM (X) + 2E X I {|X|>M } .
The sequence
  (Xn ) is uniformly integrable and Theorem 3.51 implies that
supn E |Xn | < ∞. Fatou’s Lemma applied to an a.s. convergent subsequence of
Xn implies that X ∈ L1 . We conclude that for any ε > 0 there exists M = M (ε) > 0
such that
    ε
2E |Xn |I {|Xn |>M } + 2E |X|I {|X|>M } < , ∀n ∈ N. (3.2.9)
2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 297

Martingales 297

From Corollary 1.145 we deduce that ΦM (ε) (Xn ) converges to ΦM (ε) (X) in proba-
bility. Moreover,
ΦM (ε) (Xn ) < M (ε), ∀n ∈ N.
The Bounded Convergence Theorem 1.153 implies that there exists n = n(ε) > 0
such that for any n ≥ n(ε) we have
  ε
E |ΦM (ε) (Xn ) − ΦM (ε) (X)| < .
  2
From (3.2.9) we deduce that E |Xn − X| < ε for n > n(ε).
Clearly (ii) ⇒ (iii) since Xn → X in L1 implies kXn kL1 → kXkL1 .
(iii) ⇒ (i) For any M > 0 consider the continuous function

x,

 x ∈ [0, M − 1],
ΨM : [0, ∞) → R, ΨM (x) = 0, x ≥ M,

linear, x ∈ (M − 1, M ).

The Dominated Convergence Theorem implies that ΨM (|X|) converges to |X| in


L1 as M → ∞. Thus, there exists M = M (ε) such that
    ε
E |X| − E ΨM (|X|) < , ∀M ≥ M (ε). (3.2.10)
2
Using the Bounded Convergence Theorem as in the proof of the implication (i) ⇒
(ii) we deduce that
   
E ΨM (|Xn |) → E ΨM (|X|) , ∀M > 0. (3.2.11)
Thus, for any n ∈ N we have
     
E |Xn |I {|Xn |>M (ε)} ≤ E |Xn | − E ΨM (ε) (|Xn |)
    
= E |Xn | − E |X|
         
+ E |X| − E ΨM (ε) (|X|) + E ΨM (ε) (|X|) − E ΨM (ε) (|Xn |)

(3.2.10)
          ε
< E |Xn | − E |X| + E ΨM (ε) (|X|) − E ΨM (ε) (|Xn |) + .
2
We can choose n = n(ε, M (ε)) so that for n > n(ε) we have
          ε
E |Xn | − E |X| + E ΨM (ε) (|X|) − E ΨM (ε) (|Xn |) < .
2
Hence for any M ≥ M (ε)
   
sup E |Xn |I {|Xn |>M } ≤ sup E |Xn |I {|Xn |>M (ε)} < ε.
n>n(ε) n>n(ε)

Now choose M1 > M (ε) such that


 
E |Xn |I {|Xn |>M1 } < ε, ∀n = 1, 2, . . . , n(ε).
Hence for M ≥ M1 we have
 
sup E |Xn |I {|Xn |>M } < ε.
n∈N
Thus (Xn ) is uniformly integrable. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 298

298 An Introduction to Probability

Remark 3.58. (a) The implication (iii) ⇒ (ii) is sometimes referred to as Scheffé’s
Lemma.
(b) We used the Bounded Convergence Theorem to prove the implication (i) ⇒
(ii). Obviously the Bounded Convergence Theorem is a special case of this impli-
cation. One can prove the equivalence (i) ⇐⇒ (ii) without relying on the Bounded
Convergence Theorem; see [50, Thm. 10.3.6].
(c) The sequence in Example 2.27 converges in law, it is uniformly integrable yet it
does not converge in probability. This shows that in the above theorem we cannot
relax the convergence-in-probability condition to convergence in law. t
u

3.2.3 Uniformly integrable martingales


We can now formulate and prove a refinement of Theorem 3.41.

Theorem 3.59. If (Xn )n∈N0 is a martingale adapted to the filtration (Fn )n≥0 of
(Ω, F, P). Set
_
F∞ := Fn = σ Fn , n ≥ 0 .


n≥0

The following are equivalent.

(i) The collection (Xn )n∈N0 is UI.


(ii) The sequence (Xn )n∈N0 converges a.s. and L1 to a random variable X∞ .
(iii) The sequence (Xn )n∈N0 converges L1 to a random variable X∞ .
(iv) There exists an integrable random variable X such that
Xn = E Xk Fn , ∀n ∈ N0 .
 

If the above conditions are satisfied, then the limiting random variable X∞
in (ii) and (iii) is related to the random variable X in (iv) via the equality
X∞ = E X k F∞ , i.e.,


lim E X k Fn = E X k F∞
   
n→∞
1
a.s. and L .

Proof. Note that if a martingale (Xn ) is UI, then it is bounded in L1 and, according
to Theorem 3.41, converges a.s. to an integrable random variable X∞ . In view of
the previous discussion the statements (i)–(iii) are equivalent. The implication (iv)
⇒ (i) follows from Corollary 3.56. The only thing left to prove is (iii) ⇒ (iv).
More precisely, we will show that if Xn → X∞ in L1 , then
 
Xn = E X∞ kFn , a.s., ∀n ∈ N0 .
In other words, we have to show that, for all m ∈ N0 , and all A ∈ Fm we have
   
E Xm I A = E X∞ I A .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 299

Martingales 299

Since (Xn ) is a martingale we deduce that, for n > m, we have


h   i h  i
E Xm I A = E E Xn k Fm I A = E E Xn I A k Fm
   
= E Xn I A .

Now let n → ∞.
Suppose now
 that for some integrable random variable X we have
Xn = E X k Fn . We want to show that


lim Xn = X∞ := E X k F∞
 
n

i.e., for any F ∈ F∞ we have


   
E X∞ I F = E XI F .
Denote by Z ⊂ F∞ the collection of F ⊂ F∞ for which the above holds. Clearly
Fn ⊂ Z. Clearly, Z is a λ-system and contains the π-system
[
Fn .
n≥0

Thus it contains F∞ , the σ-algebra generated by this system. t


u

Theorem 3.59 implies that


lim E XkFn = E XkF∞ a.s. and L1 , ∀X ∈ L1 (Ω, F, P).
   
(3.2.12)
n→∞

In particular, we deduce

Corollary 3.60 (Lévy’s 0-1 law). For any set A ∈ F∞ , the random variables
E I A k Fn , n ∈ N,
 

converge a.s. and L1 to I A as n → ∞. t


u

Corollary 3.61 (Kolmogorov’s 0-1 law). Suppose that G1 , G2 , . . . are indepen-


dent σ-subalgebras of F. We set
Tn := σ Gn+1 , Gn+2 , . . .


and form the tail σ-algebra


\
T∞ = Tn .
n≥1

Then T∞ is a 0-1 sigma-algebra,


H ∈ T∞ ⇒ P H ∈ {0, 1}.
 

Proof. Define Fn := σ G1 , . . . , Gn . Let H ∈ F∞ . By Levy’s 0 − 1 law we have




E I H k Fn → I H a.s.
 

 hand, if H ∈ T, then since T ⊥ ⊥ Fn we deduce E I H k Fn = P H ,


   
On the other
so that P H = I H a.s. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 300

300 An Introduction to Probability

Example 3.62. Consider again the Galton-Watson branching process in Exam-


ple 3.8. Suppose that the reproduction
X law µ satisfies
m := nµ n < ∞.
n≥0
Assume m ≥ 1. Consider the
[ extinction event defined in Example 3.45

E := En , En = Zk = 0, ∀k ≥ n .
n≥0
Consider next the event  
U := sup Zn = ∞ .
n
We want to prove that if the probability that an individual has no successor is
positive then, with probability 1, either the population extinguishes in finite time,
or explodes. In particular it cannot stabilize to a finite nonzero limit. More precisely,
we have the following dichotomy result.

 
If p0 = µ 0 > 0, then, with probability 1, the population either becomes extinct
or explodes, i.e.,
E = U c , P E ∪ U = 1.
 
(3.2.13)

In particular n o
E= lim Zn = 0 . (3.2.14)
n
Note that
∀ν ∈ N0 , ∃δ(ν) ∈ (0, 1) : ∀n ∈ N,
  (3.2.15)
P E k Z1 , . . . , Zn ≥ δ(ν) on {Zn ≤ ν}.
Indeed, if the population of the n-th generation has at most ν individuals then the
probability that there will be no (n + 1)-th generation is at least pν0 . More formally,
     
P E k Z1 , . . . , Zn ≥ P En+1 k Z1 , . . . , Zn = P En+1 kZn .
We have
X ν
   
P En+1 k Zn I {Zn ≤ν} = P En+1 k Zn = k I {Zn =k}
k=0
ν
X
= pk0 I {Zn =k} ≥ pν0 I {Zn ≤ν} .
k=0
This proves (3.2.15) with δ(ν) = pν0 .
Since Zn are integer valued we deduce that
Lemma 3.63. Suppose that (Zn )n≥1 is a sequence of nonnegative random variables.
Set  

E = Zn = 0 for some n , B := sup Zn < ∞ .
n
If (Zn ) satisfies (3.2.15), then
E ⊃ B. (3.2.16)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 301

Martingales 301

Proof. Set Fn := σ(Z1 , . . . , Zn }, Bν :=



supn Zn ≤ ν , so that
[
B1 ⊂ B2 ⊂ · · · , Bν = B.
ν

We have E E k Fn ≥ δ(ν) on Bν . Letting n → ∞ we deduce from Lévy’s 0-1


 

theorem (Corollary 3.60) that


lim E I E k Fn = I E .
 
n→∞
Hence Bν ⊂ E for any ν. Hence B ⊂ E. t
u

In our special case, B = U c . Note also that if the population dies at a time at a
time n0 , then Zn = 0, ∀n ≥ n0 . Hence E ⊂ B or, in view of (3.2.16), E = B = U c .
This proves the claimed dichotomy (3.2.13).
When m = 1, then Wn = Zn converges almost surely to an integrable random
variable and we see that
n o 
lim Zn < ∞ ⊂ sup Zn < ∞ ⊂ E
n
and we deduce that
  h i
1 ≥ P E ≥ P lim Zn < ∞ = 1.
n
Thus,
  when m = 1 and the probability having no successors is positive, i.e.,
µ 0 > 0, then the extinction probability is also 1. One can show (see [6, Sec. I.9]
or [89]) that if m = 1 and
 X
σ 2 := Var Xn,j =
  
k(k − 1)µ k < ∞,
k
then
  2
lim nP Zn > 0 = 2 .
n→∞ σ
Thus, the probability of the population surviving more than n generations given
that individuals have on average 1 successor is O(1/n).
When m > 1 the extinction probability is still positive but < 1. Exercise 3.26
describes this probability and gives additional information about the distribution
of W . For more details about branching processes we refer to [6; 80]. t
u

Suppose that (Xn )n∈N0 is a process adapted to a filtration F• such that Xn


converges a.s. to a random variable X∞ as n → ∞. If T is a stopping time adapted
to the same filtration, finite or not, we set
X
X̂T := Xn I {T =n} + X∞ I {T =∞} = XT + X∞ I {T =∞} . (3.2.17)
n∈N0
Note that
 
P T = ∞ = 0 ⇒ X̂T = XT a.s.,

T
X̂T = X∞ := lim XT ∧n = lim XnT . (3.2.18)
n→∞ n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 302

302 An Introduction to Probability

Theorem 3.64 (Optional sampling: UI martingales). Suppose that


X• = (Xn )n∈N0 is a UI martingale and T is a stopping time,
not necessarily a.s. finite. Then X̂T ∈ L1 and
X̂T = E X∞ k FT .
 
(3.2.19)
Moreover, if S, T are stopping times such that S ≤ T , then
E X̂T k FS = X̂S .
 
(3.2.20)

Proof. Let us first prove that X̂T ∈ L1 . We have


  X    
E X̂T = E I {T =n} Xn + E I {T =∞} X∞
n≥0
X h  i
E I {T =n} E X∞ k Fn
  
= + E I {T =∞} X∞
n≥0

( E X k F ≤ E |X| k F
   
)
X h i
E I {T =n} E |X∞ | k Fn
  
≤ + E I {T =∞} X∞
n≥0

(use the definition of conditional expectation)


X h i    
= E I {T =n} X∞ + E I {T =∞} X∞ = E |X∞ | < ∞.
n≥0

Moreover, for A ∈ FT we have


  X    
E I A X̂T = E I A∩{T =n} Xn + E I A∩{T =∞} X∞
n∈N0

( I A∩{T =n} Xn = E I A∩{T =n} X∞ k Fn


 
)
X      
= E I A∩{T =n} X∞ + E I A∩{T =∞} X∞ = E I A X∞ ,
n∈N0

and thus X̂T = E X∞ k FT . The since FS ⊂ FT , the equality (3.2.20) follows


 

immediately from (3.2.19) and the properties of conditional expectation. t


u

Corollary 3.65 (Optional Stopping). Suppose that (Xn )n∈N0 is a UI martin-


gale and T is any stopping time. Then the stopped martingale XnT = XT ∧n is also
a uniformly integrable martingale with respect to the filtration FT ∧n .

Proof. From Theorem 3.64 we deduce that XT ∧n = E X∞ k FT ∧n and Corol-


 

lary 3.56 implies it is UI. t


u

Doob’s conditions in Definition 3.26 are closely related to uniform integrability.

Proposition 3.66. Suppose that (Xn )n≥0 is a martingale adapted to the filtration
(Fn )n≥0 and T is an a.s. finite stopping time adapted to the same filtration. Then
the following statements are equivalent.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 303

Martingales 303

(i) The stopping time satisfies Doob’s conditions (3.1.7b) and (3.1.7c).
(ii) The stopped martingale XnT = XT ∧n is UI.

Proof. (i) ⇒ (ii) Consider the submartingale |Xn |. Since T satisfies Doob’s condi-
tions we deduce from Theorem 3.28 that
     
E |XT | ≥ E X0 ≥ E |XT ∧n | ∀n ≥ 0.

Thus
   
lim sup E |XT ∧n | ] ≤ E |XT | .
n→∞

Since limn→∞ XT ∧n = XT , a.s., we deduce from Fatou’s Lemma that


   
E |XT | ≤ lim inf E |XT ∧n | ]
n→∞

so that
   
lim sup E |XT ∧n | ] = E |XT | .
n→∞

The desired conclusion now follows from Theorem 3.57.


(ii) ⇒ (i) Observe first that limn→∞ XT ∧n = XT and since XnT is UI we deduce
X T is integrable. Now observe that
   
E |Xn |I T >n = E |XT ∧n |I T >n .
   
Since P T < ∞ = 1 we deduce limn→∞ P T > n = 0. Finally, using the fact
that the stopped martingale XnT is UI we deduce
 
lim E |XT ∧n |I T >n = 0.
n→∞

t
u

Corollary 3.67 (Optional Sampling Theorem). Suppose that (Xn )n≥0 is a


martingale adapted to the filtration (Fn ), S ≤ T are stopping times adapted to the
same filtration and T satisfies the Doob conditions (3.1.7a, 3.1.7b, 3.1.7c). Then

E XT k FS = XS .
 

Proof. Note that X T is UI and, since X S = (X T )S , we deduce from Theorem 3.64


that

E XT k FS = E X∞ k FS = XST = XS .
   T 

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 304

304 An Introduction to Probability

3.2.4 Applications of the optional sampling theorem


Let us observe that the above discussion yields an alternate proof of the Optional
Sampling Theorem 3.28. We restate it below.

Corollary 3.68 (Optional Sampling Theorem). Suppose that (Xn )n≥0 is a


martingale adapted to the filtration (Fn )n≥0 and T is an a.s. stopping time such
that the stopped
 martingale
  XnT = Xn∧T is UI, i.e., T satisfies Doobs conditions.
Then E XT = E X0 . t
u

The Optional Sampling Theorem is a versatile tool for computing expectations.


Its applicability is greatly enhanced once we have simple criteria for recognizing
when a stopped martingale is UI. We have the following result of J. L. Doob, [47,
Thm. VII.2.2].

Proposition 3.69. Suppose that (Mn )n≥0 is a random process adapted to the fil-
tration (Fn )n≥0 such that
 
E Mn < ∞, ∀n,
and T is a stopping time adapted to the same filtration. Suppose that
 
E T < ∞, (3.2.21a)

∃C > 0 : ∀n ∈ N, E |Mn − Mn−1 | k Fn−1 ≤ C.


 
(3.2.21b)
Then the stopped process MnT = MT ∧n is UI.

Proof. We will show that there exists Y ∈ L1 (Ω, F, P) such that


|MT ∧n | ≤ Y, ∀n ∈ N.
Note that
n−1
X
MT ∧n = Mk I {T =k} + Mn I {T ≥n}
k=0

n−1
X n
X
 
= Mk I {T ≥k} − I {T ≥k+1} + Mn I {T ≥n} = M0 + Mk − Mk−1 I {T ≥k}
k=0 k=1
so
n
X

MT ∧n ≤ M0 + Mk − Mk−1 I {T ≥k} .
k=1

Set

X
Y := M0 + Mk − Mk−1 I {T ≥k} .
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 305

Martingales 305

 
Clearly MT ∧n ≤ Y , ∀n ≥ 0. We will show that E Y < ∞. We have
h  i
E Mk − Mk−1 I {T ≥k} = E E Mk − Mk−1 I {T ≥k} k Fk−1 ,
 

({T ≥ k} ∈ Fk−1 )
h  i (3.2.21b)
E I {T ≥k} E Mk − Mk−1 k Fk−1
    
≤ CE I {T ≥k} = CP T ≥ k .

Thus

    X       (3.2.21a)
E Y ≤ E M0 + C P T ≥ k = E M0 + CE T < ∞.
k=1

t
u

Theorem 3.70 (Wald’s formula). Suppose that (Yn )n≥0 is a sequence of i.i.d.
integrable random variables with finite mean µ. Set
n
X
Sn := Yk .
k=0

 T be a stopping time adapted to the filtration Fn = σ(Y0 , . . . , Yn ) and such that


Let
E T < ∞. The following hold.
   
(i) E ST = µE T .
(ii) Suppose additionally that
Yn ∈ L2 , µ = 0, σ 2 = Var Yn .
 
   
Then Var ST = σ 2 E T .

Proof. (i) Set Y n = Yn − µ,


n
X
Mn := Sn − nµ = Y k.
k=1

Then
E Mn k Fn−1 = E Y n + Mn−1 k Fn−1
   

= E Y n k Fn−1 + E Mn−1 k Fn−1 = E Y n + Mn−1 = Mn−1 .


     

Observe that
E Mn − Mn−1 k Fn−1 = E Y n k Fn−1 = E Y n = E Y 0
       

so that (3.2.21b) is satisfied. We deduce from Proposition 3.69 that the stopped
martingale MnT is UI and the Optional Sampling Theorem implies
       
0 = E M0 = E MT = E ST − µE T .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 306

306 An Introduction to Probability

     
(ii) From (i) we deduce E ST = 0 so Var ST = E ST2 . Set
n
X
Qn := Yk2 .
k=1

We have
n
  X X
E Sn2 = E Yk2 + 2
     
E Yi Yj = E Qn .
k=1 1≤i<j≤n

As in (i) we observe that Zn = Qn − nσ 2 is a martingale adapted to the filtration


Fn , the increments Qn − Qn−1 are independent of Fn and
E Zn − Zn−1 k Fn−1 = E Zn − Zn−1 ≤ E Yn2 + σ 2 = 2σ 2 .
     

We deduce from Proposition 3.69 that the stopped martingale ZnT is UI and the
Optional Sampling Theorem implies
0 = E Z0 = E ZT = E QT − σ 2 E T
       

= E ST2 − σ 2 E T = Var ST − σ 2 E T .
       

t
u

Remark 3.71. In Exercise 1.16 we described a version (1.6.1) of Wald’s formula


that has a different nature than the one presented in Theorem 3.70. The random
time T in (1.6.1) is independent of the variables Xn and the proof of (1.6.1) is a
simple exercise in conditioning.
In Theorem 3.70 the random time T is quite dependent of these variables given
that it is adapted to the filtration Fn = σ(X1 , . . . , Xn ) and the proof of the corre-
sponding version of Wald’s formula required the machinery of martingale theory.
We want to point
 out that  without some assumptions on T we cannot expect
the equality E ST = µE T to hold. Here is an example.
Suppose that the random variables Xn are exponential with parameter λ. For
fix t > 0 we set

N (t) := max n ≥ 0; Sn ≤ t .

The collection N (t) t>0 is the Poisson process introduced in Example 1.136. Thus,
N (t) ∼ Poi(λt) so that
 
E N (t) = λt.
In this case
  1
µ = E Exp(λ) = .
λ
For fixed t, the random variable N (t) is not adapted to the filtration
Fn = σ(X1 , . . . , Xn ). Indeed, knowing S1 , . . . , Sn , we cannot conclude that
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 307

Martingales 307

Sn+1 > t, i.e., that n is the largest index k such that Sk ≤ t. If Wald’s for-
mula were true in this case it would predict E SN (t) = t. However, we know from
(1.3.50) that
  1 e−λt
E SN (t) = t − + .
λ λ
Let us observe that T = N (t) + 1 is adapted to the filtration Fn . Indeed
T = n ⇐⇒ N (t) = n − 1

⇐⇒ X1 + · · · + Xn−1 ≤ t and X1 + · · · + Xn + Xn+1 > t,


so that {T = n} ∈ σ(X1 , . . . , Xn ). Wald’s formula implies
      λt + 1 1
E SN (t)+1 = E N (t) + 1 · E X1 = =t+ .
λ λ
This agrees with our earlier conclusion (1.3.51). t
u

Example 3.72 (Gambler’s Ruin). Suppose that


Xn : (Ω, F, P) → {−1, 1}, n ∈ N,
is a sequence
 of i.i.d.
 random
 variables with common probability distribution
P ]Xn = 1 = p, P Xn = −1 = q = 1 − p, p ∈ (0, 1). Fix k ∈ N and set
S0 =: k, Sn := k + X1 + · · · + Xn , ∀n ∈ N.
Intuitively, k is the initial fortune of a gambler that plays a sequence of independent
games where he wins $1 with probability p and loses $1 with probability q. Then
Sn is the fortune of the gambler after n games. The game stops when the gambler
is out of money, or his fortune reaches a prescribed threshold N > k.
The sequence (Sn )n∈N is a random process adapted to the filtration
Fn = σ(X1 , . . . , Xn ).
The random variable

T = Tk := min n ∈ N; Sn ∈ {0, N } (3.2.22)
is a stopping time adapted to this filtration. It is the moment the gambler
 stops
playing. The ‘sooner-rather-than-later’ Lemma 3.32 implies that E T < ∞ since
T satisfies (3.1.9)
  N
∀n ∈ N0 , P T ≤ n + N kFn > r, r = min(p, q) .
   
In particular, P T < ∞ = 1. We want to compute pk (N ) := P ST = N . We
distinguish two cases.
A. p = 1/2 so that the game is fair. Then Sn is a martingale. Consider now
the stopped process S T . It is a UI martingale since its is uniformly bounded. We
deduce from the Optional Sampling Theorem that
    k
k = E S0 = E ST = pk (N )N ⇒ pk (N ) = .
N
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 308

308 An Introduction to Probability

N −k k
Thus, the ruin probability is 1 − pk (N ) = N = 1− N. In Example A.19 we
describe R codes simulating this situation.
B. p 6= 1/2 so the game is biased. Consider the De Moivre’s martingale Mn defined
Example 3.7, i.e.,
 Sn
q
Mn = .
p
The stopped martingale M T is UI since it is bounded. Hence
 k
    q
E MT = E M 0 = .
p
 
If we set pk (N ) := P ST = N , then we deduce
 k
 q 0
   N  N
q  q  q
= P ST = 0 + P[ST = N ] = 1 − pk (N ) + pk (N ) .
p p p p
Hence
( pq )k − 1
pk (N ) = q N . t
u
(p) − 1

Example 3.73 (The coupon collector problem revisited). Let us recall the
coupon collector problem we discussed in Example 1.112.
Suppose that each box of cereal contains one of m different coupons. Once you
obtain one of every type of coupons, you can send in for a prize. Ann wants that
prize and, for that reason, she buys one box of cereals everyday. Assuming that the
coupon in each box is chosen independently and uniformly at random from the m
possibilities and that Ann does not collaborate with others to collect coupons, how
many boxes of cereal is she expected to buy before she obtain at least one of every
type of coupon?
Let N denote the number of boxes bought until Ann has at least one of every
coupon. We have shown in Example 1.112 that
 
  1 1 1
E N = mHm , Hm := 1 + + · · · + + .
2 m−1 m
Suppose now that Ann has a little brother, Bob, and, every time she collects a
coupon she already has, she gives it to Bob. At the moment when she completed
her collection, Bob is missing B coupons. What is the expectation of B?
To answer this question we follow the approach in [61, Sec. 12.5]. Assume that
the coupons are labelled 1, . . . , m. We denote by Ck the label of the coupon Ann
found in the k-th box she bought. Thus (Ck )k≥1 are i.i.d., uniformly distributed in
{1, . . . , m}. We set
Fn := σ C1 , . . . , Cn , n ∈ N.


Denote by Xn the number of coupons Ann is missing after she bought n cereal
boxes and by Yn the number of coupons that have appeared exactly one time in the
first n boxes Ann bought.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 309

Martingales 309

From the equality



N = min n ∈ N; Xn = 0 ,

we deduce that N is a stopping time adapted to the filtration F• . Moreover,


YN = B, the number of coupons Bob is missing the moment Ann completed her
collection.
Fix a function f : N20 → N0 satisfying the difference equation

x f (x − 1, y + 1) − f (x, y) + yf (x, y − 1) = 0, ∀ x, y ≥ 1 (3.2.23)

and form the process Zn := f (Xn , Yn ), n ∈ N.

Lemma 3.74. The process Z• is a martingale adapted to the filtration Fn .

Proof. We set ∆Zn := Zn+1 − Zn . Note that Zn is Fn -measurable so we have to


show that

E ∆Zn k Fn = 0.
 
(3.2.24)

Let us observe that, when Ann buys a new cereal box there are only three, mutually
exclusive possibilities,

∆Xn = −1, ∆Yn = −1, ∆Xn = ∆Yn = 0.

The first possibility corresponds to Ann obtaining a new coupon, the second possi-
bility corresponds to Bob obtaining a new coupon, and the third possibility occurs
when the (n + 1)-th coupon is owned by both Ann and Bob. Hence

∆Zn = I {∆Xn =−1} f (Xn − 1, Yn + 1) − f (Xn , Yn )


+I {∆Yn =−1} f (Xn , Yn − 1) − (f (Xn , Yn ) .

Now observe that


 Xn  Yn
E I {∆Xn =−1} k Fn = and E I {∆Yn =−1} k Fn =
 
.
m m
To understand the first equality observe that if Ann is missing Xn coupons at time
n, then the probability of getting a new one in the new box is Xmn . The second
equality is proved in a similar fashion. Hence
 Xn  
E ∆Zn k Fn =

f (Xn − 1, Yn + 1) − f (Xn , Yn )
m

Yn   (3.2.23)
+ f (Xn , Yn − 1) − (f (Xn , Yn ) = 0.
m
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 310

310 An Introduction to Probability

The martingale (Zn )n≥0 is bounded so it is uniformly integrable and we deduce


from the Optional Sampling Theorem that
         
E f (0, YN ) = E ZN = E Z1 = E f (X1 , Y1 ) = E f (m, 0) .
This holds for any function f satisfying (3.2.23). Now observe that the function
y
f : N0 → N0 , f (x, y) = Hx + ,
1+x
where
1 1
H0 = 0, Hx = 1 + + · · · + , ∀x > 0,
2 x
satisfies (3.2.23), and we conclude
 
E YN = Hm ∼ log m as m → ∞.
For example, if m = 30, then Hm ≈ 3.99 so at the moment Ann has all the complete
collection of 30 coupons, we expect that her little brother is missing only about 4
of them. Nearly there. t
u

3.2.5 Uniformly integrable submartingales


The proof of Theorem 3.59 yields the following submartingale counterpart.

Theorem 3.75. If (Xn )n∈N0 is a submartingale, then the following are equivalent.

(i) The collection (Xn )n∈N0 is uniformly integrable.


(ii) The sequence (Xn )n∈N0 converges a.s. and L1 to a random variable X∞ .
(iii) The sequence (Xn )n∈N0 converges L1 to a random variable X∞ . t
u

Corollary 3.76. Suppose that X• = (Xn )n∈N0 is a submartingale with Doob de-
composition Xn = X0 + Mn + Cn , where (Mn )n∈N0 is the martingale component
and (Cn )n∈N0 is the predictable compensator. Then the following are equivalent.

(i) The submartingale (Xn )n∈N0 is uniformly integrable.


(ii) The martingale (Mn )n∈N0 and the compensator (Cn )n∈N0 are uniformly
integrable.

Proof. Clearly (ii) ⇒ (i). To prove the converse note that


       
E |Cn | = E Cn = E Xn − E X0
and since (Xn ) is uniformly integrable we deduce
     
sup E |Cn | ≤ sup E |Xn | − E X0 < ∞.
n n

The limit C∞ := limn→∞ Cn exists because (Cn ) is nondecreasing. The Monotone


Convergence theorem implies that C∞ is integrable. Since |Cn | = Cn ≤ C∞ , ∀n,
we deduce that the family (Cn ) is UI. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 311

Martingales 311

We can use Doob’s decomposition to prove a submartingale version of Theo-


rem 3.64.

Corollary 3.77. If X• = (Xn )n∈N0 is a uniformly integrable submartingale, then


for any stopping time T the stopped submartingale XnT = XT ∧n is a uniformly
integrable submartingale.

Proof. Consider the Doob decomposition of X• , Xn = X0 + Mn + Cn . From


Corollary 3.76 we deduce that M• and C• are UI. Moreover, the Doob decomposition
of X T is X T = M T + C T . Corollary 3.65 shows that M T is UI and C T is UI since
0 ≤ CnT ≤ C∞ ∈ L1 . t
u

Theorem 3.78 (Optional Sampling: UI submartingales). Suppose that


X• = (Xn )n∈N0
is a UI submartingale. Then for any stopping times S, T such that S ≤ T we have
 
X̂S ≤ E X̂T kFT . (3.2.25)
In particular, if we let T = ∞,
 
X̂S ≤ E X∞ kFS . (3.2.26)

Proof. If X• = M• + C• is the Doob decomposition of X, then


X•S = M•S + C•S , X•T = M•T + C•T .

In this case X S = (X T )S we deduce that X̂S = XdT . Then, since Ĉ is F -


S S S
measurable,
(3.2.19)  T
  
X̂S = M̂S + ĈS = E M∞ kFS + E ĈS kFS
 T     T   T   
≤ E M∞ kFS + E ĈT kFS = E M∞ kFS + E C∞ kFS = E X̂T kFS .
t
u

Corollary 3.79 (Optional Sampling). Suppose that Y• = (Yn )n∈N0 is a uni-


formly integrable submartingale
  and S, T are a.s. finite stopping times such that
S ≤ T . Then YS ≤ E YT kFS .

Proof. Use (3.2.26) with the UI submartigale X = Y T and observe that


T
X∞ = Y∞ = ŶT = YT . t
u

Example 3.80 (Optimal Gambling Strategies). Consider a game of chance


where the winning probability is p < 21 . For example, in the red-and-black roulette
game one bets on black with winning probability p = 18 38 ≈ 0.473. (The fair case
p = 12 is discussed in Exercise 3.19.)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 312

312 An Introduction to Probability

Before each game the player bets a sum s, called stake, that cannot be larger
than his fortune at that moment. If he wins, his fortune increases by the amount
that he bet. Otherwise he loses his stake.
The player starts with a sum of money x and decides that he will play until
the first moment his fortune goes above a set sum, the goal, say 1. His strategy is
based on a function σ(x). If his fortune after n games is Xn , then the amount he
wagers for the next game depends on his current fortune Xn and is σ(Xn ). The
player stops playing when, either he is broke, or he has reached (or surpassed) his
goal. The function σ is known as the strategy of the gambler.
We denote by π(x, σ) the probability that the gambler will reach his goal using
the strategy σ, given that his initial fortune is x.
We want to show that the strategy that maximizes the winning probability
π(x, σ) is the “go-bold ” strategy: if your fortune is less than half the goal, bet it
all, and if your fortune is more than half the goal, bet as much as you need to
reach your goal. Our presentation follows [64, §24.8]. To find out about gambling
strategies for more complex games we refer to [49].
First let us introduce the appropriate formalism. The strategies will be chosen
from a space S, the collection of measurable functions σ : [0, ∞) → [0, ∞) such that
σ(x) ≤ x, ∀x ∈ [0, 1] and σ(x) = 0, ∀x > 1.
Note that the stopping rule is built in the definition of S.
The sequence of games encoded by the sequence of i.i.d. random variables
(Yn )n∈N such that
    1
P Yn = 1 = p, P Yn = −1 = 1 − p, 0 < p < .
2
For each x ≥ 0 and each σ ∈ S define inductively a sequence of random variables
Xn = Xnx,σ ,
X0x = x, Xn+1 = Xn + σ(Xn )Yn+1 , n ≥ 0. (3.2.27)
We denote by (Ω, S, P) the probability space where the random variables Xn and
Yn are defined. Thus Xnx,σ is the fortune of the player after n games starting with
initial fortune x and using the strategy σ. Note that σ(Xn ) is the amount of money
the player bets before the (n + 1)-th game. It depends only on its fortune Xn at
that time. If Yn = 1 the player gains σ(Xn ) and if Yn = −1, the player loses this
amount. His strategy σ stays the same for the duration of the game.
Let us observe first that
x,σ
X∞ := lim Xnx,σ
n→∞

exists a.s. and L . We will prove this by showing that Xnx,σ is a bounded super-
1

martingale.
Since σ(x) ≤ x we deduce x − σ(x) ≥ 0 and we deduce inductively that Xn ≥ 0,
a.s., ∀n. Next, we observe that if x ≤ 1 + x then x + σ(x) ≤ x + 1. We deduce
inductively that Xn ≤ x + 1, a.s., ∀n.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 313

Martingales 313

 
We have E Yn = 2p − 1 < 0 and thus
E Xn+1 k Fn = Xn + σ(Xn )E Yn+1 ≤ Xn .
   

Thus (Xn ) is a uniformly bounded supermartingale and thus UI. Set


x,σ
   x,σ 
h(x, σ) := E min(X∞ , 1) and π(x, σ) := P X∞ ≥1 .
Observe that
x ≥ h(x, σ) ≥ π(x, σ), ∀x ∈ [0, 1], σ ∈ S. (3.2.28)
Since (Xn ) is a supermartingale and the function x 7→ min(x, 1) is concave and non-
decreasing, the sequence min(Xn , 1) is also a supermartingale. Using the continuity
of x 7→ min(x, 1) we deduce from (i) that
min(Xnx,σ , 1) → min(X∞
x,σ
, 1) a.s.
Since 0 ≤ min(Xn , 1) ≤ 1 we deduce from the Dominated Convergence theorem
that
x,σ
       
x = E min(X0 , 1) ≥ E min(X∞ , 1) ≥ E I X∞ ≥1 ≥ P X∞ ≥ 1 ≥ π(x, σ).
Let us observe that if a strategy σ depends continuously on x, then
h(x, σ) = π(x, σ).
 
Set again Xn = Xnx,σ . We will prove that P 0 < X∞ < 1 = 0. We argue by
contradiction and assume
 
P 0 < X∞ < 1 > 0.
Thus, assume there exists ω ∈ Ω such that X∞ (ω) ∈ (0, 1) and
lim Xn (ω) = X∞ (ω).
n→∞

Thus
lim σ(Xn (ω)) = σ(X∞ (ω)) > 0.
n→∞

On the other hand,


σ(Xn ) = |Xn+1 (ω) − Xn (ω)| → 0 as n → ∞.
 
Hence P 0 < X∞ < 1 = 0 so
   
E min(X∞ , 1) = P X∞ ≥ 1 .
We have the following optimality criterion.

Lemma 3.81. Let σ0 ∈ S and set h0 (x) := h(x, σ0 ), π0 (x) = π(x, σ0 ). If h0 (x) is
continuous and satisfies,
h0 (x) ≥ ph0 (x + s) + (1 − p)h0 (x − s), (3.2.29)
then, for any σ ∈ S, and any x ∈ [0, 1] we have π(x, σ0 ) = h0 (x) ≥ π(x, σ).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 314

314 An Introduction to Probability

Proof. Fix σ ∈ S and x ∈ [0, 1] and set Xn = Xnx,σ . We set h0 (x) = 1, for x ≥ 1.
This is a natural condition: if the initial fortune is greater than the goal then the
probability of achieving the goal is 1.
Observe that the random process Yn = h0 (X) is a supermartingale. Indeed,
E h0 Xn+1 k Fn = E h0 Xn + σ(Xn )Yn+1 k Fn
     

= E h0 Xn + σ(Xn ) I {Yn+1 =1} + h0 Xn − σ(Xn ) I {Yn+1 =−1} k Fn


   

  (3.2.29)
= ph0 Xn + σ(Xn ) + (1 − p)h0 Xn − σ(Xn ) ≤ h0 (Xn ).
Thus Yn is a bounded supermartingale and thus
h0 (x) = E h0 (X0x,σ ) E Y0 ≥ E Yn .
     
 
Now observe that E Y0 = h0 (x).
On the other hand, since h0 (x) is continuous and bounded we deduce that
h0 (Xn ) converges a.s. and L1 to h0 (X∞ ). Thus
x,σ
     x,σ 
E Y∞ = E h0 (X∞ ) ≥ P X∞ ≥ 1 ≥ π(x, σ).
t
u

Define σ0 ∈ S
(
min(x, 1 − x), x ∈ [0, 1],
σ0 (x) :=
0, x ≥ 1,
and set h0 (x) := h(x, σ0 ), π0 (x) = π(x, σ0 ). We want to show that σ0 satisfies all
the conditions of Lemma 3.81.
Clearly σ0 is a continuous strategy. By construction, for any x ∈ [0, 1] we have
0 ≤ Xnx,σ0 ≤ 1 a.s. so
 x,σ0 
π0 (x) = h0 (x) = E X∞ , ∀x ∈ [0, 1].
The functions
[0, 1] 3 x 7→ x + σ0 (x) ∈ [0, 1], [0, 1] 3 x 7→ x − σ0 (x) ∈ [0, 1]
are non-decreasing. We deduce inductively that if x ≤ y then
E Xnx,σ0 ≤ E Xny,σ0
   

and, by letting n → ∞ we deduce that h0 (x) ≤ h0 (y) so that h0 is non-decreasing.


By conditioning on Y1 we deduce that

ph0 (2x),

 x ≤ 1/2,
h0 (x) = p + (1 − p)h0 (2x − 1), 1/2 ≤ x ≤ 1, (3.2.30)


1, x > 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 315

Martingales 315

Set
 
k
D := ; n ∈ N0 , 0 ≤ k ≤ 2n .
2n
k
We will prove by induction on n that (3.2.29) holds for x of the form x = 2n . Start
with n = 1 so x = 21 . We have
h(1/2) − ph(1/2 + s) − (1 − p)h(1/2 − s)

(3.2.30) 
= p − p p + (1 − p)h(2s) − (1 − p)ph(1 − 2s)
 
= p(1 − p) 1 − h(2s) + h(1 − 2s) ≥ 0,

where at the last step we used the fact that h(x) ≤ x, ∀x ∈ [0, 1].
For the inductive step, assume that n > 1 and x = 2kn , k < 2n . Choose s ∈ [0, x].
We consider several cases.
Case 1. x + s ≤ 21 . Using (3.2.30) and the induction hypothesis we deduce

ph(x + s) + (1 − p)h(x − s) = p ph(2x + 2s) + (1 − p)ph(2x − 2s)

≤ ph(2x) = h(x).
Case 2. x − s ≥ 21 . Similar to Case 1.
1
Case 3. x ≤ 2 and x + s ≥ 12 . Using (3.2.30) we have
A := h(x) − ph(x + s) − (1 − p)h(x − s)

= ph(x2x) − p p + (1 − p)h(2x + 2s − 1) − (1 − p)ph(2x − 2s)

= p h(2x) − p − (1 − p)h(2x + 2s − 1) − (1 − p)h(2x − 2s) .
1
Observe that since 2 ≤ x + s ≤ 2x. Using (3.2.30) we deduce
h(2x) = p + (1 − p)h(4x − 1)
so that

A = p p + (1 − p)h(4x − 1) − p − (1 − p)h(2x + 2s − 1) − (1 − p)h(2x − 2s)

= p(1 − p) h(4x − 1) − h(2x + 2s − 1) − h(x − 2s)

= (1 − p) h(2x − 1/2) − ph(x + 2s − 1) − p(x − 2s)
(p ≤ 1 − p)

≥ (1 − p) h(2x − 1/2) − ph(x + 2s − 1) − (1 − p)(x − 2s) .
The induction hypothesis implies h(2x − 1/2) − ph(x + 2s − 1) − (1 − p)(x − 2s) ≥ 0.
1
Case 4. x ≥ 2 and x − s ≤ 12 . This is similar to the previous case.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 316

316 An Introduction to Probability

We can now prove that h0 is continuous. Since h is nondecreasing we deduce


that the right/left limits h(x±) exist at each x ∈ [0, 1]. Since (3.2.29) holds for
every x in a dense set we deduce
 
h(x−) ≥ ph (x + s) − + (1 − p)h (x − s) −
∀0 ≤ s < x ≤ 1. Now let s & 0 to conclude
h(x−) ≥ ph(x+) + (1 − p)h(x−) ⇒ ph(x−) ≥ ph(x+ )
so that h(x−) = h(x+ ), i.e., h is continuous. Since D is dense in [0, 1] we deduce
that h0 satisfies (3.2.29) on [0, 1]. We can now invoke Lemma 3.81 to deduce that
π(x, σ0 ) = h0 (x) ≥ π(x, σ), ∀x ∈ [0, 1], σ ∈ S,
i.e., σ0 is an optimal gambling strategy.
Let us explain how to compute h0 (x), x ∈ D. Every number x ∈ D has a binary
expansion
X n
x = 0.1 2 · · · =
2n
n≥1

where n ∈ {0, 1}, and n = 0 for n  0. Note that


1
x < ⇐⇒ 1 = 0.
2
The first equation in (3.2.30) reads
h(0.02 · · · ) = p · h(0.2 · · · ).
In particular
· · 0} 1k+2 · · · ) = pk (0.1k+2 · · · ).
h 0. |0 ·{z
k

The second equation in (3.2.30) reads


h(0.12 . . . ) = p + (1 − p)h(0.2 · · · ).
We define f0 , f1 : [0, 1] → [0, 1] by
f0 (x) = px, f1 (x) = p + (1 − p)x.
The above discussion shows that

h(0.1 · · · n ) = f1 h(0.2 · · · n ) .
Since h(0) = h(1/2) = p we deduce by iteration that if,
x = 0.1 2 · · · n 1,
then
h(x) = f1 ◦ f2 ◦ · · · ◦ fn (p).
Thus h is uniquely determined on D and, since D is dense on [0, 1], the function h is
uniquely determined on [0, 1]. Let us emphasize that h0 (x) depends on the winning
probability p.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 317

Martingales 317

21
As an illustration let us compute h0 (21/32). Note that 32 has the binary
expansion
21
= 0.10101
32
so that
h(21/32) = f1 ◦ f0 ◦ f1 ◦ f0 (p) = f1 ◦ f0 ◦ f1 (p2 )

= f1 ◦ f0 p + p2 − p3 ) = f1 (p2 + p3 − p4 ) = p + (1 − p)(p2 + p3 − p4 )

= p + p2 + p3 − p4 − p3 − p4 + p5 = p + p2 − 2p4 + p5 .
For example if the winning probability is p = 0.4, then h0 (21/32) = 0.519 > 0.5.
Thus, although the winning probability p < 0.5, using this strategy with an initial
fortune 21/32, the odds of increasing the fortune to 1 are better than 50 : 50.
If the initial fortune is x = 41 , then using its binary expansion 14 = 0.01 we
deduce
p
h0 (1/4) = ph0 (1/2) = .
2
In this case, if p = 0.4, the probability of reaching his goal is 0.2, substantially
smaller. tu

3.2.6 Maximal inequalities and Lp -convergence


The results in this subsection are wide ranging generalizations of Kolmogorov’s
one series theorem. They depend on Doob’s maximal inequality which generalizes
Kolmogorov’s inequality (2.1.3).

Theorem 3.82 (Doob’s maximal inequality). Suppose that (Xn )n∈N0 is a sub-
martingale. Set
X̃n := sup Xk .
k≤n

Then, for any a > 0, we have


h i h i
aP X̃n ≥ a ≤ E Xn I { Xen ≥a} ≤ E Xn+ .
 
(3.2.31)

Proof. Let us introduce the stopping time



T := inf n ≥ 0; Xn ≥ a .
Then
n o  

A := en ≥ a
X = sup Xk ≥ a = T ≤n .
k≤n

Applying the Optional Sampling Theorem 3.28 to the bounded stopping times T ∧n
and n we deduce
   
E XT ∧n ≤ E Xn .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 318

318 An Introduction to Probability

On the other hand,

XT ∧n (ω) = XT (ω) I A (ω) + Xn I Ac (ω) ≥ aI A (ω) + Xn (ω)I Ac (ω),

so XT ∧n ≥ aI A + Xn I Ac . We deduce
           
aP A + E Xn I Ac ≤ E XT ∧n ≤ E Xn = E Xn I A + E Xn I Ac .

This implies the first inequality in (3.2.31). The second inequality is trivial. t
u

Corollary 3.83. Suppose that Yn is a martingale. We set

Yn∗ = max Yn .
k≤n

Then for every c > 0 and any p ∈ [1, ∞) we have


1 
P Yn∗ > c ≤ p E |Yn |p .
  
c
Proof. Doob’s maximal inequality applied to the submartingale Xn = |Yn |p yields
 
 ∗ 1 
P Yn > c = P max |Yn | > c ≤ p E |Yn |p .
p p
 
0≤k≤n c
t
u

Theorem 3.84 (Doob’s Lp -inequality). Let p > 1 and suppose that (Xn )n∈N0
is a positive submartingale such that Xn ∈ Lp , ∀n ≥ 0. Set

X
en := sup Xk .
k≤n

Then for any n ≥ 0 we have


 1 1
en p p ≤ qE Xnp p ,
 
E X (3.2.32)

where
1 1 p
+ = 1 or q = .
p q p−1
In particular, if (Yn )n∈N0 is a martingale and if

Yn∗ := max |Yk |,


k≤n

then for any n ≥ 0 we have3

kYn∗ kLp ≤ qkYn kLp . (3.2.33)


3 Note p 1 1
that q = p−1
is the exponent conjugate to p, p
+ q
= 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 319

Martingales 319

Proof. Clearly (3.2.32) ⇒ (3.2.33). Note that (Xnp )n≥0 is also a submartingale and
X̃n ∈ Lp . From Doob’s maximal inequality we deduce
   
aP X̃n ≥ a ≤ E Xn I {X̃n ≥a}

so
Z ∞  (3.2.31)
Z ∞
1  p  (1.3.43) p−1
ap−2 E Xn I {X̃n ≥a} da.
  
E X̃n = a P X̃n ≥ a da ≤
p 0 0

Switching the order of integration we deduce


Z ∞ " Z X̃n #
p−2 p−2 1
E Xn X̃np−1
   
a E Xn I {X̃n ≥a} da = E Xn a da =
0 0 p−1
1
(use Hölder’s inequality with q = 1 − p1 )
1 1   p−1
E Xnp p E X̃np p .


p−1
Hence
1  p 1 1   p−1
E Xnp p E X̃np p .

E X̃n ≤
p p−1
This proves (3.2.32). t
u

Definition 3.85. Let p ∈ [1, ∞). A martingale (Xn )n∈N0 is called an Lp -martingale
if

E |Xn |p < ∞, ∀n ∈ N0 .
 

A bounded Lp -martingale is a martingale (Xn )n∈N0 such that

sup E |Xn |p < ∞.


 
t
u
n∈N0

Corollary 3.86 (Lp -martingale convergence theorem). Suppose that


(Xn )n∈N0 is a bounded Lp -martingale for some p > 1. Set

Xn∗ := max |Xk |, X∞



= sup |Xk | = lim Xn∗ .
k≤n k≥0 n→∞

Then (Xn )n∈N0 is a UI martingale and Xn converges a.s. and Lp to a random


variable

X∞ ∈ Lp (Ω, F∞ , P).

Moreover
 p
 ∗ p p
E |X∞ |p .
 
E (X∞ ) ≤
p−1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 320

320 An Introduction to Probability

Proof. From the Monotone Convergence Theorem we deduce


 p
 ∗ p p
) = lim E (Xn∗ )p ≤ sup E |Xn |p < ∞
   
E (X∞
n→∞ p − 1 n≥0
∗ ∗
so X∞ ∈ Lp and |Xn | ≤ X∞ , ∀n ≥ 0. The desired conclusions now follow from the
martingale convergence theorem and the Dominated Convergence Theorem. t
u

Example 3.87 (Kolmogorov’s one series theorem). Suppose that (Xn )n≥0 is
a sequence of independent random variables such that
X
E[Xn ] = 0, ∀n ≥ 0, Var[Xn ] < ∞.
n≥0

Then the random series X0 + X1 + · · · is a.s. and L2 -convergent. Indeed, the


sequence of partial sums
Sn = X0 + · · · + Xn
is a bounded L2 -martingale and so it converges a.s. and L2 . t
u

Example 3.88 (Likelihood ratio). This example has origin in statistics. Sup-
pose that we have a random quantity and we have reasons to believe that its prob-
ability distribution is either of the form p(x)dx or q(x)dx where p, q : R → [0, ∞)
are mutually absolutely continuous probability densities on R
Z Z
p(x)dx = q(x)dx = 1.
R R

We want to describe a statistical test that helps deciding which is the real distribu-
tion. Our presentation follows [75, Sec. 12.8].
We take a large number of samples of the random quantity, or equivalently,
suppose that we are given a sequence of i.i.d. random variables (Xn )n≥1 with com-
mon probability density f , where f is one of the two densities p or q. Assume for
simplicity that p(x), q(x) > 0, for almost any x ∈ R.
The products
n
Y p(Xk )
Yn :=
q(Xk )
k=1
 
are called likelihood ratios. Note that if f = q, then E Yn = 1, ∀n.
To decide whether f = q or f = p we fix a (large) positive number a and a large
n ∈ N and adopt the prediction strategy
(
p, Yn ≥ a,
fn :=
q, Yn < a.
We want to show
 that this strategy picks the correct density with high confidence,
i.e., P f = fn is very close to 1 for large n and a.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 321

Martingales 321

If f = q, then Yn is a product of i.i.d. nonnegative random variables with mean


1 and, as shown in Example 3.6, it is a martingale with respect to the filtration
Fn = σ(X1 , . . . , Xn ).
The function log is strictly concave and we deduce from Jensen’s inequality
   
p(Xn ) p(Xn )
E log < log E = 0.
q(Xn ) q(xn )
The Strong Law of Large Numbers shows that
n
1X p(Xk )  
log → E log Y1 < 0, a.s.
n q(Xk )
k=1
Thus n
p(Xk )
X
log Yn = → −∞ a.s.
log
q(Xk )
k=1
   
Thus, if f = q, then Yn → 0 a.s. In particular, P f = fn = P Yn < a → 1 as
n → ∞.
If f = p, then a similar argument shows that Y1n → 0 a.s. We deduce that
(
0, f = q,
Yn →
∞, f = p.
Moreover, Doob’s maximal inequality (3.2.31) shows that if f = q so Yn is a mar-
tingale, we have
 
1
P max Yk ≥ a ≤ .
1≤k≤n a
Thus, if f = q the probability Yn overshoots the level a  1 is small and this
statistical test makes the right decision with high confidence. t
u

Example 3.89. Consider again the branching process in Example 3.8. Suppose
that the reproduction law µ satisfies
X∞ ∞
X
m= kµ(k) < ∞, k 2 µ(k) < ∞.
k=0 k=0
We set

X
σ 2 := Var[µ] = k 2 µ(k) − m2 .
k=0
Note that
 

X k
X ∞
X X ∞
X
Zn+1 =  Xn,j  I {Zn =k} = Xn,j I {Zn =k} = Xn,j I {Zn ≥j} ,
k=1 j=1 j=0 k≥j j=0

 

X
 2 
E Zn+1 kFn = E  I {Zn ≥j,Zn ≥k} Xn,j Xn,k kFn 
k,j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 322

322 An Introduction to Probability

⊥ Fn )
(Xn,j , Xn,k ⊥
X∞ ∞
X
I {Zn ≥j,Zn ≥k} m2 + δjk σ 2
  
= I {Zn ≥j,Zn ≥k} E Xn,j Xnk =
k,j=1 k,j=1

X ∞
X
= m2 I {Zn ≥j} I {Zn ≥k} + σ 2 I {Zn ≥k}
j,k=j=1 k=1
P
(E[Zn ] = k≥1 P(Zn ≥ 1))

!2 ∞
X X
2
=m I {Zn ≥k} + σ2 I {Zn ≥k} = m2 Zn2 + σ 2 Zn .
k=1 k=1
Hence
2
E[Zn+1 ] = m2 E[Zn2 ] + m2 E[Zn ] = σ 2 E[Zn2 ] + σ 2 mn E[Z0 ] = m2 E[Zn2 ] + σ 2 mn `.
We set
qn+1 := m−2n E Zn2
 

and we get from the above that


qn+1 = qn + m−n−2 σ 2 `.
This shows that if m > 1, then the sequence (qn ) converges so the martingale
Wn := m−n Zn converges in L2 and a.s. The limit W∞ is nonzero if ` = E[Z0 ] > 0
because
     
E W∞ = E W0 = E Z0 = `.
We refer to Exercise 3.26 for more details about W∞ . t
u

3.2.7 Backwards martingales


Suppose that the parameter set T is

T = −N0 = 0, −1, −2, . . . , .
In this case a T-filtration Fn , n ∈ −N0 is called a backwards filtration. We set
\
F−∞ := Fn .
n≤0
A backwards martingale (submartingale, supermartingale) is a martingale (resp.
submartingale, supermartingale) adapted to a backwards filtration.

Theorem 3.90 (Convergence of backwards submartingales). Suppose that


F• = (Fn )n∈−N0 is a backwards filtration of (Ω, F, P) and X• = (Xn )n∈−N0 is
F• -submartingale, i.e.,
Xn ≤ E Xm Fn , ∀n, m ∈ −N0 , n ≤ m,
 

and
 
C := inf E Xn > −∞.
n≤0
Then the following hold.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 323

Martingales 323

(i) The family (Xn )n∈−N0 is UI.


(ii) There exists X−∞ ∈ L1 (Ω, F−∞ , P) such that Xn → X−∞ a.s. and L1 as
n → −∞.

Moreover
X−∞ ≤ E Xn F−∞ ,
 
(3.2.34)
with equality if (Xn )n∈−N0 is a martingale.

Proof. Step 1. Boundedness in L1 . Observe that (Xn+ ) is a submartingale and


thus
E Xn+ ≤ E X0+ , ∀n ≤ 0.
   

On the other hand, there exists C ∈ R such that


E Xn = E Xn+ − E Xn− ≥ C, ∀n ≥ 0.
     

Hence
E Xn− ≤ C + E Xn+ ≤ C + E X0+ , ∀n ≤ 0,
     

and consequently,
 
Z := sup E |Xn | < ∞. (3.2.35)
n≤0

Step 2. Almost sure convergence. For K ∈ N consider the (increasing) filtration


GK
n := F(−K+n)∧0 , n ∈ N0 ,

and the GK K
n -submartingale Yn = X(−K+n)∧0 . Thus

Y0K = X−K , Y1K = X−K+1 , . . . , YKK = X0 , YK+1


K
= X0 , . . . .
Doob’s upcrossing inequality applied to the submartingale YnK shows that, for any
rational numbers a < b we have
(b − a)E NK ([a, b], Y K ) ≤ E (X0 − a)+ − E (X−K − a)+
     

≤ E (X0 − a)+ ≤ |a| + E |X0 | .


   

This proves that, for any rational numbers a < b, the nondecreasing sequence
K 7→ NK [a, b], Y K


is also bounded, and thus it has a finite limit N∞ ([a, b], X) as K → ∞. An obvious
version of Lemma 3.38 shows that Xn has an a.s. limit a.s. n → −∞. The limit is
a F−∞ -measurable random variable X−∞ . Fatou’s Lemma shows that
 
E |X−∞ | < ∞.
Step 3. Uniform integrability. This is obvious if (Xn )n≤ is a martingale since
Xn = E X0 k Fn
 

and the conclusion follows from Corollary 3.56.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 324

324 An Introduction to Probability

In general, if (Xn )n≤0 is a submartingale, we have


   
E Xn ≤ E Xm , ∀n ≤ m ≤ 0.
 
Since the sequence E Xn is bounded below we deduce that it has a finite limit. Thus, for any
ε > 0, there exists K = K(ε) > 0 such that
    ε
E X−n ≥ E X−K − , ∀n ≥ K.
2
For n > K and a > 0 we have
     
E |X−n |I {|X−n |>a} = E (−X−n )I {X−n <−a} + E X−n I {X−n >a}

     
= −E X−n + E X−n I {X−n ≥−a} + E X−n I {X−n >a}

  ε    
≤ −E X−K + + E X−n I {X−n ≥−a} + E X−n I {X−n >a} .
2
Now observe that, for any H ∈ Fn , we have

X−n I H ≤ E X−K F−n I H = E X−K I H F−n ,


   

so
   
E X−n I H ≤ E X−K I H .
Hence, if H = {X−n ≥ −a}, or H = {X−n > a}, then
       
E X−n I {X−n ≥−a} + E X−n I {X−n >a} ≤ E X−K I {X−n ≥−a} + E X−K I {X−n >a} ,

and
        ε
E |X−n |I {|X−n |>a} ≤ −E X−K + E X−K I {X−n ≥−a} + E X−K I {X−n >a} +
2
  ε
= E |X−K |I {|X−n |≥a} + .
2
From Markov’s inequality and (3.2.35) we deduce
  Z
P |X−m | > a ≤ , ∀m ∈ N0 .
a
Since the family consisting of the single random variable X−K is uniformly
  integrable, we deduce
that there exists δ = δ(ε) > 0 such that, for any A ∈ FK satisfying P A < δ we have
  ε
E |X−K |I A < .
2
Z
We deduce that for any a > 0 such that a
< δ(ε) we have
    ε
E |X−n |I {|X−n |>a} ≤ E |X−K |I {|X−n |≥a} < .
2
This proves that the family (X−n )n∈N0 is UI.

Step 4. Conclusion.
  Finally,
 observe
 that for any A ∈ F−∞ and any n ≤ m ≤ 0
we have E Xn I A ≤ E Xm I A . If we let n → −∞ we deduce

E X−∞ I A ≤ E Xm I A , ∀m ≤ 0, A ∈ F−∞ .
   

This is precisely the inequality (3.2.34). When (Xn ) is a martingale all the above
inequalities are equalities. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 325

Martingales 325

Corollary 3.91 (Backwards Martingale Convergence). Suppose that


(Gn )n∈N0 is a decreasing family of σ-subalgebras of F and
Z ∈ L1 (Ω, F, P).
   
Then the sequence E ZkGn converges a.s. and L1 to E ZkG∞ , where
\
G∞ = Gn .
n≥0

 to the backwards filtration Fn := G−n , n ≤ 0,


Proof. Apply the previous theorem
and the martingale Zn := E ZkFn , n ≤ 0. t
u

3.2.8 Exchangeable sequences of random variables


An n-dimensional random vector X = (X1 , . . . , Xn ) is called exchangeable if, for any
permutation π of {1, . . . , n} the random vectors (X1 , . . . , Xn ) and (Xπ(1) , . . . , Xπ(n) )
have identical distributions.
A sequence of random variables (Xk )k∈N is called exchangeable if for any n ∈ N
the random vector (X1 , . . . , Xn ) is exchangeable. One also refers to an exchangeable
sequence as an exchangeable process.
Equivalently, if we denote by Sn the subgroup of permutations ϕ of N such that
ϕ(r) = r, ∀r > n, then the sequence (Xn )n≥1 is exchangeable if for any n ∈ N and
any ϕ ∈ Sn the sequences (Xn )n∈N and (Xϕ(n) )n∈N are identically distributed.

Example 3.92. (a) A sequence of i.i.d. random variables (Xn )n≥1 is exchangeable.
(b) Suppose that (µλ )λ∈Λ is a family of Borel probability measures on R
parametrized by a probability space (Λ, S, PΛ ) such that, for any Borel subset
B ⊂ R, the function
 
Λ 3 λ 7→ µλ B
is measurable. In other words, µ• is a random probability measure. In the language
of kernels, the function µ• : Λ → BR is a Markov kernel (Λ, S) → (R, BR ).
For each λ ∈ Λ we have a product measure µ⊗n n
λ on R equipped with its natural
⊗n
σ-algebra, Bn = BR . The mixture of the family (µλ ) directed by PΛ is the measure
n

µnΛ defined by the averaging formula


Z
µnΛ S := µnλ B PΛ dλ , ∀B ∈ Bn .
     
Λ

The collection µnΛ n∈N
forms a projective family. Kolmogorov’s existence theorem
shows that this family induces a unique probability measure µ∞ N
Λ on R . The random
variables
Xn : R∞ → R, Xn (x1 , x2 , . . . ) = xn
form an exchangeable sequence. The measure µ∞
Λ is called a mixture of i.i.d. directed
by the random measure µ.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 326

326 An Introduction to Probability

For example, suppose that ν is a Borel probability measure on Λ = [0, 1]. For
any p > 0 define

µp = Bin(p) = (1 − p)δ0 + pδ1 ∈ Prob {0, 1} .

Then we obtain the mixtures µnν ∈ Prob {0, 1}n defined by,
 Z
n
µnν {1 , . . . , n } = (1 − p)n−k pk ν dp , k = 1 + · · · + n .
   
k [0,1]

The collection µnν ∈ Prob {0, 1}n , n ∈ N is a projective family and thus it defines
a measure µ∞ ν on {0, 1} .
N

The random vector X = (X1 , X2 , . . . ) with distribution µ∞ defines an exchange-


able sequence of Bernoulli random variables. Observe that their common success
probability is
Z
   
p := P Xn = 1 = pν dp , ∀n ∈ N.
[0,1]
t
u

Denote by B the Borel σ-algebra of R. The groups Sn act on RN by permuting


the first n coordinates and we say that a function Φ : RN → R is n-symmetric if
it is Sn -invariant. We denote by Sn ⊂ BN the sigma-subalgebra generated by the
n-symmetric measurable functions Φ : RN → R. Equivalently,
S ⊂ Sn ⇐⇒ σ(S) = S, ∀σ ∈ Sn .
We set
\
S∞ := Sn ⊂ BN .
n≥1

We will refer to S∞ as the σ-algebra of permutable or exchangeable events associated


to the exchangeable sequence (Xn )n∈N . Note that S∞ ⊃ T∞ , where T∞ denotes the
tail σ-algebra of the coordinate sequence Xn : RN → R,

Xn x1 , x2 , . . . = xn , n ∈ N.
It turns out that exchangeable sequences have a rather nice structure.

Theorem 3.93 (de Finetti). Suppose that X := (Xn )n∈N is an exchangeable se-
quence of integrable random variables defined on the same probability space (Ω, F, P).
Set
Sn := X −1 Sn , ∀n ∈ N ∪ {∞}.
Then the following hold.

(i) The random variables (Xn )n≥1 are conditionally independent given S∞ .
(ii) The random variables (Xn )n≥1 are identically distributed given S∞ , i.e., there
exists a negligible subset N ∈ F such that, on Ω \ N
P Xi ≤ x k S∞ = P Xj ≤ x k S∞ , ∀i, j ∈ N, ∀x ∈ R.
   
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 327

Martingales 327

(iii) The empirical means


X1 + · · · + Xn
n
converge a.s. and L to E X1 k S∞ .
1
 

Proof. We follow the presentation in [91]. Without any loss of generality we can
assume that (Ω, F) = (RN , BN ) and Xn (x1 , x2 , . . . ) = xn . In this case Sn = Sn .
Observe that the exchangeability condition implies that the random variables Xn
are identically distributed. Suppose that f : R → R is a measurable function such
that f (X1 ) ∈ L1 . We claim that
1   
∀k ∈ N, f (X1 ) + · · · + f (Xn ) = E f (Xk )kSn . (3.2.36)
n
Note that Sn = X −1 (Bn ), where Bn is the σ-subalgebra of BN consisting of Sn -
invariant subsets. In particular, a function g : Ω → R is Sn -measurable iff there
exists an n-symmetric function Φ such that g = Φ(X).
Let A ∈ Sn and choose an n-symmetric function Φ such that I A = Φ(X). Then,
for 1 ≤ j ≤ n we have
   
E f (Xj )Φ(X) = E f (Xj )Φ(Xj , X2 , . . . , Xj−1 , X1 , Xj+1 , . . . )
 
= E f (X1 )Φ(X) ,
so that
 
  f (X1 ) + · · · + f (Xn )  
E f (X1 )I A = E I A = E f (Xj )Φ(X) .
n
The equality (3.2.36) follows by observing that f (X1 )+· · ·+f (Xn ) is Sn -measurable.
The convergence theorem for backwards martingales (Corollary 3.91) shows that
the empirical mean
f (X1 ) + · · · + f (Xn )
n
converges a.s. and L to E f (X1 ) k S∞ . By choosing f (x) = x we obtain the
1
 

statement (iii) of Theorem 3.93.


By choosing f (x) = I (−∞,x] we deduce
#{j ≤ n; Xj ≤ x}
= F (x) := P X1 ≤ x k S∞ ,
 
lim (3.2.37)
n→∞ n
a.s. and L1 .
Let k ∈ N. For n ≥ k we set (n)k := n(n − 1) · · · (n − k + 1). Suppose that
f1 , . . . , fk : R → R are bounded and measurable. The above argument generalizes
to prove that for n ≥ k we have
1 X
f1 (Xj1 ) · · · fk (Xjk ) = E f1 (X1 ) · · · fk (Xk ) k Sn .
 
Ak,n :=
(n)k j ,...,j
1 k
ji distinct
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 328

328 An Introduction to Probability

Using the backwards martingale convergence theorem we deduce


lim Ak,n = E f1 (X1 ) · · · fk (Xk ) k S∞ .
 
(3.2.38)
n→∞
Consider now
n k
1 X Y fi (X1 ) + · · · + fi (Xn )
Bk,n := f1 (Xj1 ) · · · fk (Xjk ) = .
nk j1 ,...,jk =1 i=1
n
We deduce from (3.2.36) that
k
Y
E fi (Xi ) k S∞ k .

lim Bk,n =
n→∞
i=1
Now observe that

Ak,n − Bk,n = O 1/n as n → ∞,
since the contribution to Bk,n corresponding to k-tuples with ji non-distinct is
O(nk−1 ) and nk ∼ (n)k as n → ∞. If we choose
fi = I (−∞,xi ] , 1 ≤ i ≤ k,
we deduce from (3.2.37) that
k
 Y
P X1 ≤ x1 , . . . , Xk ≤ xk k S∞ = P Xi ≤ xi k S∞ .
  
i=1
This proves (i) and (ii) of the theorem. t
u

Remark 3.94. Suppose that (Xn )n∈N is an exchangeable sequence of random vari-
ables defined on the probability space (Ω, F, P). Denote by S∞ the sigma-algebra
of exchangeable events. Suppose that
Q : Ω × BR → [0, 1], (ω, B) 7→ Qω B
 

is a regular version of the conditional distribution PX1 dx k S∞ , i.e.,


 

∀B ∈ BR , P X1 ∈ B k S∞ = Q B , a.s.
   

De Finnetti’s theorem implies that


n
1X
P X1 ∈ B k S∞ = lim I B (Xk ) = P Xm ∈ B k S∞ , ∀m ∈ N.
   
n→∞ n
k=1
Thus the random variables (Xn ) are equidistributed, conditional on S∞ .
Let us show that the distribution
 of the sequence (Xn )n∈N is a mixture directed
by the random measure ω 7→ Qω − as in Example 3.92(b).
Indeed, for any Borel subsets B1 , . . . , Bn ⊂ R we have
h  i
P X1 ∈ B1 , . . . , Xn ∈ Bn = E E I B1 (X1 ) · · · I Bn (Xn ) k S∞
 

(use the conditional independence given S∞ )


h  i
= E E I B1 (X1 ) k S∞ · · · E I Bn (Xn ) k S∞
 

h  Z
i
Q⊗n
     
= E Q B1 · · · Q Bn = ω B1 × · · · Bn P dω .

Thus the distribution of the sequence (Xn ) is a mixture of i.i.d. driven by the
random distribution Q. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 329

Martingales 329

The σ-algebra S∞ ⊂ BN of permutable events of an exchangeable sequence


(Xn )n∈N contains its tail σ-algebra T∞ . It turns out that they are not so different.
We have the following general result.

Proposition 3.95. Suppose that (Xn )n∈N is an exchangeable sequence of random


variables. Then the P-completion of S∞ coincides with the completion of the tail T∞ .

Proof. We follow the approach in [29, Sec. 7.3, Thm. 4]. For a different but related
proof we refer to [1, Cor. (3.10)].
Denote by S∗∞ and T∞ ∗
the completions of S∞ and respectively T∞ . We have
n
1X
1{Xk ≤x} = P X1 ≤ x k S∞ .
 
lim
n→∞ n
k=1
Clearly the limit in the left-hand side is T∞ -measurable since it is not affected by
changing finitely many of the random variables. Hence
P X1 ≤ x k S∞ = P X1 ≤ x k T∞ = P X1 ≤ x k T∞ ∗
     
. (3.2.39)
Similarly, for any x1 , . . . , xn ∈ R, the random variable
Yn
P Xk ≤ xk k S∞
 

k=1
is T∞ -measurable. Hence, for any S ∈ S∞ we have
" n
# " n
#
\ Y
{Xk ≤ xk } T∞ = E I S · P Xk ≤ xk k S∞ T∞
 
P S∩
k=1 k=1
" n
#
Y
P Xk ≤ xk k S∞ k T∞ P S k T∞
  
=E
k=1
n
(3.2.39) Y
P Xk ≤ xk k T∞ P S k T∞ .
   
=
k=1
Thus S∞ and X1 , . . . , Xn are conditionally independent given T∞ so S∞ and
(Xn )n∈N are conditionally independent given T∞ . Since S∞ ⊂ σ Xn , n ∈ N
we deduce that for any S ∈ S∞ is conditionally independent of itself given T∞ . We
have
2
P S k T∞ = P S k T∞
  

so P S k T∞ ∈ {0, 1}. ∀S ∈ S∞ . Set


 

T = TS = ω : P S k T∞ (ω) = 1 .
  

Then T ∈ T∞ , T ⊂ S and
h i
P T ∩ S = E I T P S k T∞
    
=P T ,
h  i
P S = E P S k T∞
     
= E IT = P T .
This concludes the proof. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 330

330 An Introduction to Probability

Observe that a sequence of i.i.d. random variables (Xn )n≥1 is exchangeable. The
Kolmogorov 0-1 law and the above proposition imply the following result.

Theorem 3.96 (Hewitt-Savage 0-1 Law). If (Xn )n≥1 is a sequence of iid ran-
dom variables and A ∈ S∞ , then P A ∈ {0, 1}. t
u

Corollary 3.97 (The Strong Law of Large Numbers). Suppose that


(Xn )n∈N is a sequence of i.i.d. integrable random variables. Then
1 
X̄n := X1 + · · · + Xn
n
 
converges a.s. and L1 to E X1 .

Proof. From
 De Finetti’s Theorem 3.93
 we deduce
 that
 X̄n converges a.s. and L1 to
E X1 k S∞ . Proposition 3.95 that E X1 k S∞ = E X1 k T∞ and Kolmogorov’s
 

0-1 theorem shows that E X1 k T∞ = E X1 . t


u

Theorem 3.98. Suppose that Xn : (Ω, F, P) → {0, 1}



n∈N
is an exchangeable
sequence of Bernoulli random variables. Set
1 
S := lim X1 + · · · + Xn .
n→∞ n

Then
S = P X1 = 1 k S∞ ,
 
(3.2.40a)

P X1 = · · · = Xk = 1, Xk+1 = · · · = Xn = 0 k S = S k (1 − S)n−k ,
 
(3.2.40b)

P X1 = · · · = Xk = 1, Xk+1 = · · · = Xn = 0 = E S k (1 − S)n−k .
   
(3.2.40c)
In particular, the moment generating function of S is
 X   tn
E etS =

P X1 = · · · = Xn = 1 .
n!
n≥0

Proof. Using de Finetti’s theorem we deduce that S = E X1 k S∞ . Observe that


 

X1 = I {X1 =1} so that


S = E X1 k S∞ = E I {X1 =1} k S∞ = P X1 = 1 k S∞ .
     

Note that 0 ≤ S ≤ 1 a.s. and


1 − S = E 1 − I {X1 =1} k S∞ = P Xn = 0 k S∞ .
   

Then, since X1 , . . . , Xn are conditionally i.i.d. given S∞ , we have


P X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0 k S∞
 

k  n−k
= P X1 = 1 k S∞ P X1 = 0 k S∞ = S k (1 − S)n−k .

July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 331

Martingales 331

Since S is S∞ -measurable we have


 
P X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0 k S
h  i
= E P X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0 k S∞

S

= E S k (1 − S k ) k S = S k (1 − S k ).
 

Clearly,
P X1 = 1, . . . , Xk = 1, Xk+1 = 0, . . . , Xn = 0 = E S k (1 − S)n−k .
   

t
u

Example 3.99 (Polya’s urn revisited). We want to conclude this introduction


to exchangeability with an application to Polya’s urn problem introduced in Exam-
ple 3.9. We recall this process.
We start with an urn containing r > 0 red balls and g > 0 green balls. At
each moment of time we draw a ball uniformly likely from the balls existing at that
moment, we replace it by c + 1 balls of the same color, c ≥ 0. Denote by Rn and
Gn the number of red and respectively green balls in the urn after the nth draw.
As we have seen in Example 3.9 the ratio of red balls
Rn Rn
Zn = =
Rn + Gn r + g + cn
is a bounded martingale and thus it has an a.s. and L1 limit Z∞ . We will determine
this limit using de Finetti’s theorem. We discuss only the nontrivial case c > 0.
Introduce the {0, 1}-valued random variables (Xn )n≥1 where Xn = 1 if the
n-drawn ball is red and Xn = 0 if it is green. Then
Rn = r + cSn , Sn := X1 + · · · + Xn ,
and we deduce that
cSn Rn
lim = lim = Z∞ .
n→∞ cn n→∞ Rn + Gn

Let us observe that the sequence (Xn )n≥1 is exchangeable. We prove by induction
that (X1 , . . . , Xn ) is exchangeable. For n = 1 the result is trivial.
Let n > 1 and 1 , . . . , n ∈ {0, 1}. We denote by rk and gk the number of red
balls and respectively green balls after the k-th draw. We deduce
 Qn 
k=1 k rk−1 +(1−k )gk−1

 Qn , c > 0,
   k=1 (r+g+(k−1)c)
P X1 = 1 , . . . , Xn = n =


 k
z0 (1 − z0 )n−k , c = 0,
r
where z0 = Z0 = r+g . When c > 0 the denominator above is independent of
{1 , . . . , n }. We set Sn := 1 + · · · + n and we rewrite the numerator in the form
Sn
n
Y  Y  n−S
Yn 
k rk−1 + (1 − k )gk−1 = r + c(i − 1) g + c(j − 1) .
k=1 i=1 j=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 332

332 An Introduction to Probability

The last expression only depends on Sn which is obviously a symmetric function in


the variables 1 , . . . , n . If c = 0, then this expression is equal to 0.
When c > 0, we set
r g
ρ := , γ := ,
c c
and we deduce
Qk−1 Qn−k−1
  i=0 (ρ + i) j=0 (γ + j)
P X1 = · · · = Xk = 1, Xk+1 = · · · = Xn = 0 = Qn−1
k=0 (r + γ + k)
Γ(ρ + γ) Γ(ρ + k)Γ(γ + n − k) B(ρ + k, γ + n − k)
= · = ,
Γ(ρ)Γ(γ) Γ(ρ + γ + n) B(ρ, γ)
(3.2.41)
where B(x, y) denotes the Beta function
Z 1
Γ(x)Γ(y)
B(x, y) = = tx−1 (1 − t)y−1 dt.
Γ(x + y) 0
We now invoke Theorem 3.98. Note that
1
X1 + · · · + Xn = S = P X1 = 1 k S∞
  
Z∞ = lim
n→∞ n

is a [0, 1]-valued random variable and (3.2.40c) with k = n shows that, for any
n ≥ 0, we have
Z 1
z n PZ∞ dz = E Z∞
   n   
= P X1 = · · · = Xn = 1
0
 R
B(ρ+n,γ) 1 n z ρ−1 (1−z)γ−1
 B(ρ,γ) , c > 0,  0 z dz, c > 0,
 
  B(ρ,γ)
(3.2.41)
= =
 
z n ,

c=0 R 1 sn δ  dz 

c = 0,
0 0 z0

where δz0 is the Dirac measure concentrated at z0 . Hence


R
1 n z ρ−1 (1−z)γ−1
Z 1 

 0
z B(ρ,γ) dz, c > 0,
n
 
z PZ∞ dz =
0 
R 1 z m δ  dz ,

0 z0 c = 0, ∀n ≥ 0.
Since the probability measures on [0, 1] are uniquely determined by their momenta
(see Corollary 1.108) we deduce
 ρ−1
z (1−z)γ−1


 B(ρ,γ) dz, c > 0
 
PZ∞ dz =

δ  dz ,

c = 0.
z0

The distribution in the case c > 0 is the Beta distribution with parameters ρ, γ
discussed in Example 1.122. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 333

Martingales 333

3.3 Continuous time martingales

The study of martingales parametrized by T = [0, ∞) faces a few fundamental


technical difficulties stemming from the fact that the space of parameters is not
countable. To deal with these issues we need to introduce several new concepts.

3.3.1 Generalities about filtered processes


Suppose that (Ω, F, P) is a probability space and F• = (Ft )t≥0 is a filtration of
sigma-subalgebras of S. We denote by Proc(F• ) the collection of random processes
(parametrized by T) that are adapted to the filtration F• . If no confusion is possible,
we will use the simpler notation Proc when referring to adapted processes.
A function f : [0, ∞) → R is called an R-function 4 if it is right continuous with
left limits. It is called an L-function 5 if it is left continuous with right limits.

Definition 3.100. Let X• = Xt : (Ω, F, P) → R t∈[0,∞) be a random process,




not necessarily adapted to the filtration F• .

(i) We say that the random process X• is measurable if the map


X : [0, ∞) × Ω → R, (t, ω) 7→ Xt (ω)
is measurable with respect to the σ-algebra B[0,∞) × F.
(ii) We say that the random process X• is progressively measurable or progressive
(with respect to the filtration F• ) if for any t > 0, the map
[0, t] × Ω 3 (s, ω) 7→ Xt (ω) ∈ R
is B[0,t] ⊗ Ft measurable, where B[0,t] denotes the σ-algebra of Borel subsets of
[0, t].
(iii) A subset A ⊂ [0, ∞) × Ω is called progressive if the associated process
I A : [0, ∞) × Ω → R
is progressive.
(iv) We say that the adapted random process X• is an R-process (resp. L-process)
if there exists a negligible subset N ⊂ Ω such that, for any ω ∈ Ω \ N , the
function T 3 t 7→ Xt (ω) is and R-function (resp. L-function).

t
u

Remark 3.101. The progressive subsets of [0, ∞) × Ω form a σ-subalgebra of


B(R) ⊗ F that we denote by Fprog . Observe that a process is progressively mea-
surable if and only if it is Fprog -measurable. For this reason we will denote by
Proc(Fprog ) or Procprog the collection of progressive processes.
An F• -progressive process is also adapted to the filtration F• so
Proc(Fprog ) ⊂ Proc(F• ). t
u
4 A.k.a. cadlag function, continue à droite limite à gauche.
5 A.k.a. caglad function, continue à gauche limite à droite.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 334

334 An Introduction to Probability

Proposition 3.102. Suppose that X• ∈ Proc(F• ) is either an R-process or an


L-process. Then X• is a progressive process.

Proof. Assume X is an R-process. The case of L-processes is similar. Fix t ≥ 0.


For each n ∈ N, we subdivide the interval [0, t] into n intervals of the same size. For
n ∈ N, define
(
n n Xkt/n (ω), s ∈ [ (k − 1)t/n, kt/n), 1 ≤ k ≤ n,
X : [0, t] × Ω → R, Xs (ω) =
Xt (ω), s = t.
Since X• is an R-process we deduce that there exists a negligible subset N ⊂ Ω
such that
lim Xsn (ω) = Xs (ω), ∀s ∈ [0, t], ω ∈ Ω \ N.
n→∞

Clearly the function X n : [0, t] × Ω → R is B[0,t] ⊗ Ft -measurable. It follows that


the a.s. limit X : [0, t] × Ω → R is also B[0,t] ⊗ Ft -measurable, ∀t ≥ 0. t
u

We have the following nontrivial result, [30].

Theorem 3.103 (Chung-Doob). Suppose that


X• = Xt : (Ω, F, P) → R

t∈[0,∞)

is a measurable process adapted to the filtration F• . Then X• admits a progressive


modification. t
u

Definition 3.104. Fix a filtration F• = (Ft )t≥0 of the probability space (Ω, F, P).

(i) An F• -stopping time is a random variable T : Ω → [0, ∞] such that


T ≤ t ∈ Ft , ∀t ≥ 0.


(ii) An F• -optional time is a random variable T : Ω → [0, ∞] such that


T < t ∈ Ft , ∀t > 0.


(iii) If T : Ω → [0, ∞] is a stopping time, then the past before T is collection


FT ⊂ F∞ consisting of the sets F ∈ F satisfying the property F ∩{T ≤ t} ∈ Ft ,
∀t ≥ 0.

t
u

Lemma 3.105. For any stopping time T adapted to the filtration F• the collection
FT is a σ-algebra. t
u

The proof is left to the reader as an exercise.

Lemma 3.106. Any stopping time T is an optional time.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 335

Martingales 335

Proof. Indeed,
 [
T <t = T ≤ t − 1/n ,
n≥0

∈ Ft−1/n ⊂ Ft .

and T ≤ t − 1/n t
u

Definition 3.107. Fix a probability space (Ω, F, P) and a filtration F• = (Ft )t≥0
of F. We set
\
Ft+ := Fs , t ≥ 0.
s>t

(i) We say that the filtration F• = (Ft )t≥0 right-continuous if


Ft = Ft+ , ∀t ≥ 0.
(ii) We say that the filtration Ft is P-complete if the probability space (Ω, F, P) is
P-complete6 and the collection N ⊂ F of P-negligible events is contained in Ft ,
∀t ≥ 0.
(iii) We say that the filtration Ft satisfies the usual conditions (or that it is usual )
if it is both right-continuous and P-complete.

t
u

Remark 3.108. If (Ft )t≥0 is a filtration of the complete probability space (Ω, F, P),
then the usual augmentation of (Ft ) is the minimal filtration (F̂t ) containing (Ft )
and satisfying the usual conditions. More precisely if N ⊂ F is the collection of
probability zero events, then
\
F̂t = σ(N, Fs ). t
u
s∈(t,∞)

Proposition 3.109. Consider a random variable T : Ω → [0, ∞]. Then the follow-
ing statements are equivalent.

(i) T is an optional time for (Ft ).


(ii) T is a stopping time for (Ft+ ).

In particular, if Ft is right-continuous, then T is a stopping time if and only if


it is an optional time.7 t
u

Example 3.110. Suppose that (Xt )t≥0 is a process adapted to F• and Γ ⊂ R. The
(Γ-) début time of (Xt ) is the function

DΓ : Ω → [0, ∞], DΓ (ω) = inf t ≥ 0; Xt (ω) ∈ Ω ,
6 Recallthat this means that any set contained in a P-null subset is measurable.
7 Thissettles an inconsistency in the existence literature. Many authors refer to stopping times
as optional times, while our optional times are sometimes referred to as weakly optional times.
When the filtration is right continuous all these terms refer to the same concept, that of stopping
time.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 336

336 An Introduction to Probability

and the (Γ-)hitting time of (Xt ) is the function



HΓ : Ω → [0, ∞], HΓ (ω) = inf t > 0; Xt (ω) ∈ Ω .
The following facts are not hard to prove; see [85, Lemma 9.6], [103, Prop. 3.9].

(i) If Γ is open, and the paths of Xt are right continuous, then the début time DΓ
is a stopping time of (Xt ), while the hitting time HΓ is an optional time.
(ii) If Γ is closed, and the paths of Xt are continuous, then the début time DΓ is a
stopping time of (Xt ), while the hitting time HΓ is an optional time.

We deduce from the above that if the filtration Ft is right-continuous and the
paths of (Xt ) are continuous, then both DΓ and HΓ are stopping times if Γ is either
open or closed. t
u

If the filtration F• satisfies the usual condition a much more general result is
true. More precisely, we have the following highly nontrivial result of Dellacherie
and Meyer [39, Thm. IV.50].

Theorem 3.111 (Début Theorem). Suppose that the filtration F• satisfies the
usual conditions and (Xt )t≥0 is an F• -progressive process. Then, for any Borel
subset Γ ⊂ R, the début time DΓ is a stopping time. t
u

We list below a few elementary properties of stopping times.

Proposition 3.112. Fix a filtered probability space (Ω, F• , P).

(i) If T is a stopping time, then T is also FT -measurable.


(ii) If S is a stopping time and T is an FS -measurable random variable such that
T ≥ S, then T is also a stopping time and FS ⊂ FT .
(iii) Suppose that S, T are stopping times. Then S ∧ T and S ∨ T are also stopping
times and
FS∨T = FS ∩ FT .
(iv) An increasing limit of stopping times is a stopping time while a decreasing
limit of stopping times is an optional time.
(v) Suppose that T is a stopping time. A function

T < ∞ 3 ω 7→ Y (ω) ∈ R
is FT -measurable if and only if, ∀t ≥ 0, the restriction of Y to T ≤ t is


Ft -measurable.

Proof. We prove only (i). The rest are left to the reader as an exercise. To prove
that the sublevel set {T ≤ c} is measurable we have to show that for any t ≥ 0 the
intersection
{T ≤ c} ∩ {T ≤ t} = {T ≤ t ∧ c}
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 337

Martingales 337

is Ft -measurable. This is a consequence of the fact that T is compatible with the


filtration Ft . t
u

Definition 3.113. Fix a filtered probability space (Ω, F• , P). Given a random
process (Xt )t≥0 and an F• -stopping time T : Ω → [0, ∞] we denote by XT the
random variable
(
XT (ω) (ω), T (ω) < ∞
I {T (ω)<∞} = XT (ω) = .
0, T (ω) = ∞
t
u

The proof of the following result is left to the reader as an exercise.

Proposition 3.114. If (Xt )t≥0 is a progressively measurable random process and


T is a stopping time, then the random variable XT is FT -measurable. t
u

3.3.2 The Brownian motion as a filtered process


Let us illustrate the concepts introduced in the previous subsection on the stochastic
process defined by the Brownian motion. The following result should be obvious
from the definition of the Brownian motion.

Proposition 3.115. Suppose that B is a Brownian motion. Then the following


hold.
Symmetry. The stochastic process −B is also a Brownian motion.
Time rescaling. For any c > 0 the rescaled Brownian motion
1
Btc := √ Bct
c
is another standard Brownian motion.
Time inversion. The stochastic process
(
tB1/t , t > 0,
Xt :=
0, t = 0,
is another standard Brownian motion.

Proof. The statements (i) and (ii) are immediate. The last statement concerning
time inversion requires a bit more work. We follow the approach in the proof of [33,
Thm. VIII.1.6].
Observe that first Xt is a Gaussian process with mean zero and covariances
 
E Xs Xt = min(s, t), ∀s, t ≥ 0.
Thus it suffices to show that (Xt ) is a.s. continuous, i.e.,
lim Xt = 0 a.s.
t&0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 338

338 An Introduction to Probability

Equivalently, we will show that


1
lim Bt = 0 a.s.
t→∞ t

Note that for n ∈ N and t ∈ (n, n + 1] we have


1 1 1 1
|Xt | ≤ Xn + (Xt − Xn ) ≤ Xn + sup Xn+s − Xn
t n n n s∈[0,1]

(the process (Xt ) is a.s. continuous on (0, ∞))


1 1
= Xn + sup Xn+s − Xn .
n n s∈[0,1]∩Q

The Strong Law of Large Numbers shows that


1
lim Xn = 0 a.s.
n→∞ n
For each m = 1, 2, . . . , the process
m
X
Dkn = Xn+k/m − Xn = (Xn+ j − Xn+ j−1 ),
m m
j=1

is a martingale since the above summands have mean zero and are independent.
Applying Doob’s maximal inequalities (3.2.31) to the discrete submartingales
2
Ym = Dnm , 0 ≤ k ≤ m , m = 1, 2, . . . ,


we deduce that, for any ε > 0,


" # " #
2
P sup Xn+s − Xn > nε = P sup Xn+s − Xn > n2 ε2
s∈[0,1] s∈[0,1]∩Q

1 h
2
i 1
≤ 2 2
E |Xn+1 − Xn | = 2 2.
n ε n ε
1
P
Since n≥1 n2 < ∞ we deduce from the Borel-Cantelli Lemma that
1
lim sup Xn+s − Xn = 0, a.s.
n→∞ n s∈[0,1]
t
u

Theorem 3.116. Suppose that B : [0, ∞) × Ω → R is a Brownian motion and


(Ω, F, P) is a complete probability space. Let N denote the collection of P-negligible
events. We set

Ft = σ N, Bs , 0 ≤ s ≤ t .


Then the filtration (Ft )t≥0 satisfies the usual conditions.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 339

Martingales 339

Proof. We follow the approach in the proof of [33, Thm. VII.3.20]. It suffices to
prove that (Ft ) is right-continuous, i.e.,
\
Ft0 = Ft .
t>t0

We set
G = Ft0 , Gn = σ Bt0 +2−n − Bt0 +2−n−1 , n ∈ N.


Clearly the σ-algebras G, G1 , . . . , are independent. Set


\
Tn := σ(G, Gn+1 , Gn+2 , . . . ), T∞ := Tn .
n∈N

From Corollary 3.61 we deduce that Ft0 = T∞ . On the other hand, T∞ ⊃ Ft0 + so
Ft0 + = Ft0 . t
u

Corollary 3.117 (Blumenthal’s 0-1 law). If H ∈ F0+ then, P H ∈ {0, 1}. t


 
u

Proposition 3.118. Suppose that (Bt )t≥0 is a standard Brownian motion and
Ft = σ Bs , 0 ≤ s ≤ t .


Then the following hold.

(i) For any ε > 0 we have


" #  
P sup Bs > 0 = P inf Bs < 0 = 1.
s∈[0,ε] s∈[0,ε]

(ii) For any a ∈ R we set


Ta := inf Bt = a.
t≥0

Then
 
P Ta < ∞ = 1, ∀a ∈ R.
In particular, a.s.,
lim sup Bt = ∞, lim inf Bt = −∞.
t→∞ t→∞

Proof. (i) For any c 6= 0, the rescaled process


1
B c (t) :=
Bc2 t , t ≥ 0
c
is also a standard Brownian motion. Note that since the paths of Bt are continuous
we have
sup Bt = sup Bt .
t∈[0,1] t∈Q∩[0,1]
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 340

340 An Introduction to Probability

Thus the set


( )
ω; sup Bt (ω) > 0
t∈[0,1]

is a Brownian event. The discussion in Remark 2.76 shows that


" # " #
P sup Bt > 0 = P sup Btc > 0 , ∀c 6= 0. (3.3.1)
t∈[0,1] t∈[0,1]

If we let c = −1 in the above equality we deduce,


" #  
P sup Bt > 0 = P inf Bt < 0 . (3.3.2)
t∈[0,1] t∈[0,1]

If we let c = n, n ∈ N we deduce
" # " #
P sup Bt > 0 = P sup Bt > 0 , ∀n > 0. (3.3.3)
t∈[0,1] t∈[0,1/n]

We denote by En the Brownian event supt∈[0,1/n] Bt > 0. Clearly


E1 ⊃ E2 ⊃ · · · ⊃ En ⊃ · · ·
and En ∈ F1/n . We deduce from (3.3.3) that P En = P E1 , ∀n. If we set
   
\
E∞ := En ,
n

then we deduce that E∞ ∈ F0+


   
and P E∞ = P E1 . Blumenthal’s 0-1 theorem
implies that
    
P En = P E∞ ∈ 0, 1 .
Now observe that
    1
P E1 ⊂ P B1/2 > 0 = > 0.
2
Hence
" #  
P sup Bt > 0 = P inf Bt < 0 = 1, ∀n ∈ N. (3.3.4)
t∈[0,1/n] t∈[0,1/n]

This shows that a path of the Brownian motion oscillates wildly.


(ii) We have
   
1=P sup Bs > 0 = lim P sup Bs > δ ,
0≤s≤1 δ&0 0≤s≤1

where the second is an increasing limit. The rescaling invariance of the Brownian
motion implies
  " #
P sup Bs > δ = P sup Bsδ > 1 .
0≤s≤1 0≤s≤1/δ 2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 341

Martingales 341

We deduce
 
  X
P sup Bs > 1 = lim P  Bsδ > 1 = 1.
s≥0 δ&
0≤s≤1/δ 2

Another rescaling argument shows that


 
P sup Bs > M = 1, ∀M > 0.
s≥0

Replacing B by −B we deduce
 
P inf Bs < −M = 1, ∀M > 0.
s≥0

The conclusion (ii) is now obvious. t


u

Remark 3.119. The above result shows that, with probability 1 the Brownian
motion has a zero on any arbitrarily small interval [0, ε]. As a matter of fact, the
set of zeros of a Brownian motion is a large set: its Hausdorff dimension is a.s. 21 ,
[118, Thm. 4.24]. t
u

Let us observe that if (Bt )t≥0 is a Brownian motion, then for any t0 ≥ 0, the
process

Bt+t0 − Bt0 t≥0

is also a Brownian motion, independent of σ Bs , 0 ≤ s ≤ t0 . We will refer to this
elementary fact as the simple Markov property. We want to show that a stronger
result holds where t0 is allowed to be random.

Theorem 3.120 (The strong Markov property). Suppose that (Bt )t≥0 is a
standard Brownian motion and T  is a stopping time with respect to the filtration
Ft = σ(Bs , 0 ≤ s ≤ t) such that P T < ∞ > 0. For every t ≥ 0 we set


(T ) 
Bt := I {T <∞} BT +t − BT .
 (T )
Then, with respect to the probability measure P − T < ∞], the process Bt is a
standard Brownian motion, independent of FT .

Proof. We follow the approach in [103, Thm. 2.20].

Lemma 3.121. Fix A ∈ FT . Let F : Rp → R be a bounded continuous function.


Then, ∀t1 , . . . , tp ≥ 0, we have
 (T ) (T )      
E I A I T <∞ F Bt1 , . . . , Btp = P A ∩ {T < ∞} E F Bt1 , . . . , Btp . (3.3.5)

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 342

342 An Introduction to Probability

Let us show first that conclusions of theorem follow from the above lemma. Set
S∞ := {T < ∞}. Assume first that P S∞ = 1. Then (3.3.5) reads
 (T ) (T )      
E I A F Bt1 , . . . , Btp = P A E F Bt1 , . . . , Btp . (3.3.6)
(T )
Indeed, if we set A = Ω in (3.3.6) we deduce that Bt is a Brownian motion. In
particular, for every choice of t1 , . . . , tp ≥ 0, the vectors
(T ) (T )  
Bt1 , . . . , Btp and Bt1 , . . . , Btp
have the same distribution. Next, (3.3.6) implies that for every choice of
(T ) (T )
t1 , . . . , tp ≥ 0 the vector (Bt1 , . . . , Btp ) is independent of FT .
If P S∞ < 1, t and  we denote
 by ES∞ the expectation with respect to the
probability measure P − S∞ , then (3.3.5) implies
 (T ) (T )      
ES∞ I A F Bt1 , . . . , Btp = P A E∞ E F Bt1 , . . . , Btp .
Arguing as before we reach the conclusions of Theorem 3.120 assuming the validity
of Lemma 3.121. t
u

Proof
 of Lemma 3.121.  For
 the clarity of exposition we discuss only the case
P S∞ = 1. The case P S∞ < 1 requires no new ideas. The details can be safely
left to the reader.
For every t ≥ 0 and any n ∈ N we denote by [t]n the smallest rational number of
the form k/2n and ≥ t. Note that the quantities [T ]n are stopping times: stopping
the process at [T ]n corresponds to stopping the process at the first time of the form
k/2n after T . Then
lim [T ]n = T
n→∞

and
(T )  (T )  ([T ]n ) ([T ]n ) 
F Bt1 , . . . , B tp = lim F Bt1 , . . . , B tp .
n→∞

From the Dominated Convergence theorem we deduce that


 (T ) (T )    ([T ] ) ([T ] )  
E I A F Bt1 , . . . , Btp = lim E I A F Bt1 n , . . . , Btp n
n→∞


([T ] ) ([T ] )  
X 
= lim E I A I (k−1)2−n <T ≤k2−n F Bt1 n , . . . , Btp n .
n→∞
k=0

Observe now that if A ∈ FT , then the event


Ak,n := A ∩ (k − 1)2−n < T ≤ k2−n


= A ∩ T ≤ k2−n } ∩ T > (k − 1)2−n


 

is Fk2−n -measurable.
From the simple Markov property of the Brownian motion we deduce
 ([T ] ) ([T ] )  
E I Ak,n F Bt1 n , . . . , Btp n
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 343

Martingales 343

 
= E I Ak,n F Bt1 +k2−n − Bk2−n , . . . , Btp +k2−n − Bk2−n
   
= P Ak,n E F Bt1 , . . . , Btp .
Observing that
n
X  
P Ak,n = P[A]
k=0
we deduce

([T ] ) ([T ] )  
X 
E I A I (k−1)2−n <T ≤k2−n F Bt1 n , . . . , Btp n
k=0


([T ] ) ([T ] )  
X 
= E I Ak,n F Bt1 n , . . . , Btp n
k=0

n
X        
= P Ak,n E F Bt1 , . . . , Btp = P A E F Bt1 , . . . , Btp .
k=0
t
u

Let us present some applications application of the strong Markov property. For
a ∈ R we define the hitting time

Ta := inf t > 0; Bt = a .
This is a stopping time for the standard Brownian motion Bt and Proposi-
tion 3.118(ii) shows that
 
P Ta < ∞ = 1.

Theorem 3.122 (Reflection Principle). Fix a ∈ R. If (Bt )t≥0 is a standard


Brownian motion, then the process
(
Bet = Bt , t < Ta
(3.3.7)
2a − Bt , t ≥ Ta
is also a standard Brownian motion.

Proof. We follow the approach in [135, I.13]. Consider the processes


Yt = Bt I [[0,Ta ]] , Zs = Bs+Ta − a, s ≥ 0.
By the strong Markov property, Z is a standard Brownian motion, independent
of Y . The process −Z is also a Brownian motion independent of Y . Thus, the
processes (Y, Z) and (Y, −Z) have the same distribution. The map

(Y, Z) 7→ ϕ(Y, Z) := Yt I [[0,Ta ]] + a + Zt−Ta I ]]Ta ,∞[[
produces the continuous process which will therefore have the same law as ϕ(Y, −Z).
Now observe that ϕ(Y, Z) = B and ϕ(Y, −Z) = B̃. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 344

344 An Introduction to Probability

Remark 3.123. The above result is called the reflection principle for a simple
reason. In the region t ≥ Ta the graph of the function t → B̃t , viewed as a curve
in the Cartesian plane with coordinates (t, x), is the reflection of the graph of Bt in
the horizontal line x = a. This reflection principle is intimately related to André’s
reflection trick. t
u

Corollary 3.124. Define

St := sup Bu .
u≤t

Then, for any a, y, t ≥ 0 we have


   
P St ≥ a, Bt ≤ a − y = P Bt ≥ a + y . (3.3.8)

In particular, St has the same distribution as |Bt |.

Proof. Note that St ≥ a if and only if Ta ≤ t. We have


    (3.3.7)  
P St ≥ a, Bt ≤ a − y = P Ta ≤ t, Bt ≤ a − y = P B et ≥ a + y

(use the Reflection Principle)


 
= P Bt ≥ a + y .

Now observe that


     
P St ≥ a = P St ≥ a, Bt ≥ a +P St ≥ a, Bt ≤ a
|  {z  }
=P Bt ≥a

(3.3.8)        
= 2P Bt ≥ a = P Bt ≥ a + P Bt ≤ −a = P |Bt | ≥ a .

t
u

Corollary 3.125. For every a > 0 the stopping time Ta has the same distribution
a2
as B 2 and has density
1
 2
a a
fa (t) = √ exp − I {t>0} .
2πt3 2t
Proof. Note that

P Ta ≤ t = P St ≥ a = P |Bt | ≥ a = P Bt2 ≥ a2
       

a2
 
= P tB12 ≥ a2 = P
 
≤ t .
B12
The statement about fa now follows from the fact that B1 is a standard normal
random variable. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 345

Martingales 345

3.3.3 Definition and examples of continuous time martingales


Fix a filtered probability space Ω, F, (Ft )t≥0 , P .


Definition 3.126. A random process (Xt )t≥0 adapted to the filtration (Ft )t≥0 such
that Xt ∈ L1 , ∀t, is called a

• martingale if,
 
E Xt kFs = Xs , ∀0 ≤ s < t,
• submartingale if,
 
E Xt kFs ≥ Xs , ∀0 ≤ s < t,
• supermartingale if,
 
E Xt kFs ≤ Xs , ∀0 ≤ s < t.

t
u

Example 3.127 (Uniformly integrable martingales).  To any


 integrable ran-
dom variable X we can associate the martingale Xt := E XkFt . t
u

Example 3.128 (Processes with independent increments). Suppose that


the random process (Zt )t≥ has independent increments, i.e., for any n ∈ N and
any
0 ≤ s1 < t1 ≤ s2 < t2 ≤ · · · ≤ sn < tn ,
the increments
Zt1 − Zs1 , Zt2 − Zs2 , . . . , Ztn − Zsn
are independent. The process (Zt ) is adapted to the natural filtration
(Ft )t≥0 , Ft = σ Zs , s ≤ t .


We deduce that, ∀0 ≤ s < t, the increment Zt − Zs is independent of Fs so


     
E Xt kFs − Xs = E (Xt − Xs ) kFs = E Xt − Xs .
Hence
h i
E Xt − E Xt k Fs
  
= Xs − E Xs , ∀0 ≤ s < t. (3.3.9)
Then
 
(i) if Zt ∈ L1 , ∀t ≥ 0, then Z
et := Zt − E Zt is a martingale;
 2
(ii) if Zt ∈ L2 , ∀t ≥ 0, then Yt :=Zet2 −E Zet is a martingale;
θZt
(iii) if, for some θ ∈ R, we have E e < ∞, ∀t ≥ 0, then
eθZt
Xt :=  
E eθZt
is a martingale.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 346

346 An Introduction to Probability

The case (i) follows from (3.3.9). The case (iii) is the continuous time analogue
of Example 3.7 and the proof is similar. To prove (ii) note that
 2
et kFs = E (Zes + Zet − Zes )2 kFs
  
E Z

= Zes2 + 2Z es )2 kFs = Zes2 + E (Zet − Zes )2


     
et − Z
es E (Zet − Zes ) kFs +E (Z
| {z }
=0

= Zes2 + E Z
 2    2
et −2E Zes Z
et + E Z es
h  i
= Zes2 + E Z
 2
et −2E E Zes Zet kFs + E Zes2 = Zes2 + E Z
   2  2
et −E Zes .

Hence
h i
E Zet2 − E Z Fs = Zes2 − E Zes2 .
 2  
et

Classical examples of processes with independent increments are the Brownian mo-
tion, the Poisson process, or more generally the Lévy processes, [33, Chap. VII].
If Bt is a 1-dimensional Brownian motion started at 0, adapted to Ft , then Bt is
a normal random variable with mean 0 and variance t, for each t > 0. The moment
generating function of Bt is
θ2 t
MBt (θ) = E eθBt = e 2 .
 

We deduce from the above that


θ2
Bt , Bt2 − t, eθBt − 2 t

are martingales, ∀θ ∈ R. The martingale


θ2
 
eθBt − 2 t ,
t≥0
is called the exponential martingale
√ of the Brownian motion.
Bt
Note that if we set λ := θ t, and X = √ t
, then
θ2 2 (1.6.5) X λn
eθBt − 2 t = eλX−λ /2 = Hn (X) ,
n!
n≥0

where Hn (x) is the n-th Hermite polynomial (1.6.4). We can rewrite the above
equality as
θ2
X θn √ 
eθBt − 2 t = Mn (t) , Mn (t) = tn/2 Hn Bt / t .
n!
n≥0

Each of the coefficients Mn (t) is a continuous time martingale. Note that


M1 (t) = Bt . M2 (t) = Bt2 − t.
t
u

Example 3.129 (New submartingales from old). If (Xt )t≥0 is a martingale 


and f : R → R is a convex function such that f (Xt ) ∈ L1 , ∀t ≥ 0, then f (Xt ) t≥0
is a submartingale. If (Xt )t≥0 is only a submartingale and additionally, f is non-
decreasing, then f (Xt ) t≥0 is a submartingale. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 347

Martingales 347

3.3.4 Limit theorems


Fix a filtered probability space Ω, (Ft )t≥0 , F, P .


Definition 3.130. An R-(sub/super)martingale is a (sub/super)martingale


(Xt )t≥0 adapted to the filtration (Ft )t≥0 such that the paths of Xt are a.s. R-
functions. t
u

Remark 3.131. Suppose that (Xt )t≥0 is an R-submartingale. Fix a negligible set
N ⊂ Ω such that t 7→ Xt (ω) is an R-function for any ω ∈ Ω \ N. Fix a dense
countable subset D of [0, ∞).
Note that for every open interval I ⊂ [0, ∞) we have

sup Xt (ω) = sup Xt (ω), inf Xt (ω) = inf Xt (ω), ; ∀ω ∈ Ω \ N. (3.3.10)


t∈D∩I t∈I t∈D∩I t∈I

This shows that (Xt )t≥ is a separable process in the sense of Doob, [47, II.2]. This
means that there exist

• a countable dense subset D ⊂ [0, ∞), and


• a negligible subset N ⊂ Ω,

such that, for any closed interval I ⊂ R, and any open subset O of [0, ∞), the sets

ω; Xs (ω) ∈ I, ∀s ∈ D ∩ O and ω; Xt (ω) ∈ I, ∀t ∈ O


 

differ by a subset of N. A dense countable subset D with the above property is


called a separability set. t
u

Before we proceed investigating the properties of R-submartingales we want to


understand how restrictive is the assumption that the paths are a.s. R-functions.
The proof of Theorem 3.90 shows that if (Xt )t≥0 is an R-submartingale, then, for
any bounded
  set S ⊂ [0, ∞) the family (Xs )s∈S is UI. This implies that the function
t 7→ E Xt is an R-function. We have a more precise result, [103, Sec. 3.3], [135,
II.65-67].

Theorem 3.132 (Doob’s regularization theorem). If the filtration (Ft )t≥0


satisfies the usual conditions, then a submartingale (Xt )t≥0 adapted to this filtration
 
admits an R-submartingale modification if and only if the function t 7→ E Xt is
right continuous. t
u

Theorem 3.133 (Doob’s maximal inequality). Suppose that (Xt )t≥0 is an R-


submartingale. Then, for any a, t > 0 we have
" #
aP sup |Xs | > a ≤ E |Xt+ ≤ E |Xt | + E |X0 | .
     
(3.3.11)
s∈[0,t]
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 348

348 An Introduction to Probability

Proof. For any m ∈ N we set


 
t (m − 1)t [
Dm := 0, , . . . , , t , D := Dm .
m m
m∈N

The discrete Doob maximal inequality (3.2.31) implies that


   
aP sup |Xs | > a ≤ E |Xt+ and aP sup |Xs | > a ≤ E |Xt+ .
   
s∈Dm s∈D

As observed in Remark 3.131 (X• ) is a separable process so (3.3.10)


  " #
P sup |Xs | > a = P sup |Xs | > a .
s∈D s∈[0,t]

t
u

Theorem 3.134 (Doob’s Lp -inequality). Suppose that (Xt )t≥0 is an R-


martingale. Then, for any t > 0 and p > 1 we have
" # p1
1 1
E sup |Xs |p ≤ qkXt kLp , =1− . (3.3.12)
s∈[0,t] q p

Proof. Argue as in the proof of Theorem 3.133 by relying on the separability of


(X• ) and the discrete Lp -inequality (3.2.33). t
u

Theorem 3.135. Suppose that (Xt )t≥0 is an R-submartingale and


 
sup E |Xt | < ∞. (3.3.13)
t>0

Then there exists an integrable random variable X∞ such that


lim Xt = X∞ a.s.
t→∞

Proof. For any m ∈ N we set


1 [
Dm := m
N, m ∈ N, D = Dm .
2
m∈N

For any function f : [0, ∞) → R, any rational numbers a < b and any S ⊂ [0∞) we
denote by N (f, S, [a, b]) the supremum of the set of integers k such that there exist
s1 < t1 < · · · sk < tk
in S such that f (si ) ≤ a, f (ti ) ≥ b, ∀i = 1, . . . , k.
For m ∈ N we set Nm (f, [a, b]) := N (f, Dm , [a, b]). Equivalently, Nm (f, [a, b]) is
the number of upcrossings of the strip [a, b] by the function f D . Note that
m
 
Nm X, [a, b] ≤ Nm+1 X, [a, b] , ∀m,
and

N (f, D, [a, b]) = lim Nm X, [a, b] .
m→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 349

Martingales 349

Doob’s upcrossing inequality (3.2.2) implies


(b − a)E Nm X, [a, b] ≤ sup E (Xt − a)+ − E (X0 − a)+ , ∀m ∈ N.
     
t>0

Letting m → ∞ we deduce from the Monotone Convergence Theorem


(b − a)E N X, D, [a, b] ≤ sup E (Xt − a)+ − E (X0 − a)+ < ∞.
     
t>0

Thus N X, D, [a, b] < ∞ a.s. so the limit
X∞ := lim Xt
t→∞
t∈D

exists a.s. We leave the reader convince her/himself that since the process X• is
separable (see Remark 3.131) the limit
X∞ = lim Xt
t→∞

exists a.s. The boundedness assumption (3.3.13) coupled with Fatou’s lemma im-
plies that X∞ is integrable. t
u

The above theorem implies immediately the following continuous time counter-
part of Theorem 3.59.

Theorem 3.136 (UI martingales). Suppose that (Xt )t≥0 is an UI


R-martingale. Then
X∞ = lim Xt
t→∞
1
exists a.s. and L and
Xt = E X∞ k Ft , ∀t > 0.
 
t
u

3.3.5 Sampling and stopping


Suppose that (Xt )t≥0 is an R-submartingale such that
X∞ = lim Xt
t→∞

exists a.s. Let T : Ω → [0, ∞] be a stopping time adapted to the filtration (Ft ).
The optional sampling of X• at T is the random variable
XT (ω) = I T <∞ XT (ω) (ω) + I T =∞ X∞ (ω).

Theorem 3.137 (Optional sampling). Suppose that (Xt )t≥0 is an U I R-


martingale and S, T are stopping times such that S ≤ T . Then the following hold.

(i) The random variables X


 S , XT are
 integrable.
(ii) XS = E XT k FS = E X∞  k FS .
 

(iii) E[XS ] = E X∞ = E X0 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 350

350 An Introduction to Probability

Proof. We set

X k+1
Sn = I {k2−n <S≤(k+1)2−n } + ∞I S=∞ ,
2n
k=0


X k+1
Tn = I {k2−n <T ≤(k+1)2−n } + ∞I T =∞ .
2n
k=0
Observe that Sn ≥ S, Tn ≥ T and Sn ≤ Tn , ∀n.
Let us show that Sn is FS measurable and Tn is T -measurable. In other words,
we have to show that
{Sn ≤ c} ∩ {S ≤ s} ∈ Fs , ∀c, s ≥ 0.
Note that
 
[ n o
{S ≤ s} ∩ {Sn ≤ c} = {S ≤ s} ∩  k2−n < S ≤ (k + 1)2−n  ∈ Fc
(k+1)2−n ≤c

[ n o
= k2−n < S ≤ min s, (k + 1)2−n ∈ Fs .
(k+1)2−n ≤c

Proposition 3.112(ii) now implies that Sn is a stopping time. A similar argument


shows that Tn is a stopping time. Note that
Sn & S and Tn & T as n → ∞.
−n
For n ∈ N0 set Dn = 2 N0 . For each n ∈ N0 the stochastic process
X n := Xt t∈Dn ,


is a UI discrete martingales with respect to the filtration F•n := (Ft )t∈Dn . The
above arguments show that Sn and Tn are stopping times with respect to these
filtrations. We deduce from the discrete Optional Sampling Theorems 3.64 that
XSn = XSnn = E XTnn k FSnn = E XTn k FSn ,
   

and
  
XSn = E X∞ kFSn , XTn = E X∞ kFTn .
Now observe that since (Xt ) is a.s. right continuous we have
XS = lim XSn and XT = lim XTn a.s.
n→∞ n→∞

The families (XSn ) and (XTn ) are UI so the above convergences also hold in L1 .
Since FS ⊂ FSn ⊂ FTn and the conditional expectation map
E −kFS : L1 (Ω, , F, P) → L1 (Ω, FS , P)
 

is a contraction we deduce
       
XS = E XS kFS = lim E XSn kFS = lim E XTn kFS = E XT kFS ,
n→∞ n→∞
1
where the above converges are in L . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 351

Martingales 351

Corollary 3.138. Suppose that (Xt )t≥0 is an R-martingale and S, T are bounded
stoping times such that S ≤ T a.s. Then the following hold.

(i) The random variables X


 S , XT are
 integrable.
(ii) XS = E XT k FS = E X∞ k FS .
 

Proof. Fix t0 > 0 such S, T ≤ t0 a.s. Then the stopped process Xt∧t0 is an U I R-
martingale. The conclusions now follow from Theorem 3.137 applied to this stopped
martingale. t
u

Corollary 3.139 (Optional stopping). Let (Xt )t≥0 be an R-martingale compat-


ible with the filtration (Ft )t≥0 . Then the following hold.

(i) The stopped process


XtT := XT ∧t
is an R-martingale compatible with the same filtration (Ft )t≥0 .
(ii) If additionally (Xt )t≥0 is UI, then so is the stopped process and we have
 
XT ∧t = E XT kFt , (3.3.14)

XT = lim Xt a.s. and L1 . (3.3.15)


t→∞

Proof. We begin by proving (ii). For s < t, the stopping times s ∧ T and t ∧ T are
bounded and s ∧ T ≤ t ∧ T . The random variables Xt∧T are Ft∧T -measurable and
thus Ft -measurable since Ft∧T ⊂ Ft . To prove (3.3.14) it suffices to check that for
any A ∈ Ft we have
   
E XT I A = E Xt∧T I A .
Decompose I A = I A∩{T ≤t} + I A∩{T >t} . We have
XT I A∩{T ≤t} = Xt∧T I A∩{T ≤t}
so that
   
E XT I A∩{T ≤t} = E Xt∧T I A∩{T ≤t} . (3.3.16)
On the other hand, we deduce from Theorem 3.137 that
Xt∧T = E XT k Ft∧T .
 

Now observe that


A ∩ {T > t} ∈ Ft and A ∩ {T > t} ∈ FT ,
so A ∩ {T > t} ∈ Ft ∩ FT = Ft∧T . Hence
Xt∧T I A∩{T >t} = E XT I A∩{T >t} k Ft∧T ,
 

   
E Xt∧T I A∩{T >t} = E XT I A∩{T >t} . (3.3.17)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 352

352 An Introduction to Probability

The desired conclusion follows by adding (3.3.16) and (3.3.17). The assertion
(3.3.15) follows from the fact that the stopped martingale X T is UI. Part (i) now
follows from (ii) applied to the sequence of UI martingales
(Xtn )t≥0 := (Xn∧t )t≥0 , n ∈ N.
Indeed, the martingales X n are compatible with Ft and for s < t we have
E XTn∧t k Fs = E E XTn kFt Fs = E XTn k Fs = XTn∧s .
       

Now let n → ∞ and observe that for n > t we have XTn∧t = XT ∧n . t


u

Example 3.140. Suppose that (Bt )t≥0 is a Brownian motion started at 0 and
(Ft )t≥0 is its canonical filtration. For any a ∈ R we set

Ta := inf t ≥ 0 : Bt = a .
 
According to Proposition 3.118(ii), P Ta < ∞ = 1.
(a) We want to show that if a < 0 < b, then
  b   −a
P Ta < Tb = , P Ta > Tb = . (3.3.18)
b−a b−a
Consider the stopping time T = Ta ∧ Tb and the stopped martingale Mt = BT ∧t .
This martingale is UI since |Mt | ≤ |a| ∨ |b|. We deduce
        
0 = E M0 = E M∞ = E BT = aP Ta < Tb + bP Tb < Ta ].
 
The
 equalities
 (3.3.18) follow by observing that the probabilities P Ta < Tb and
P Ta > Tb satisfy a second linear constraint
   
P Ta < Tb + P Ta > Tb = 1.
(b) For a > 0 we set

Ua := inf t ≥ 0 : |Bt | = a = Ta ∧ T−a .
We want to show that
E Ua = a2 .
 
(3.3.19)
To see this consider the martingale of Example 3.3.9(ii), Mt = Bt2 − t.
The stopped
process Mt∧Ua is still a martingale so
     2   
E Mt∧Ua = E M0 = 0 and E Bt∧U a
= E t ∧ Ua .
The Monotone Convergence Theorem implies that
   
lim E t ∧ Ua = E Ua .
t→∞

The martingale Bt∧Ua is bounded, |Bt∧Ua | ≤ a, ∀t ≥ 0 and we deduce from the


Dominated Convergence Theorem that
 2
= E BU2 a = a2 .
      
E Ua = lim E t ∧ Ua = lim E Bt∧U a
t→∞ t→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 353

Martingales 353

(c) Fix a > 0. We want to compute the moment generating function of Ta . To this
aim, we consider for any λ ∈ R the martingale of Example 3.3.9(iii)
λ2 t
 
λ
Xt := exp λBt − . (3.3.20)
2
For λ > 0 the stopped martingale Yyλ = Xt∧T
λ
a
is bounded thus UI and we deduce
 
λ 2 Ta
1 = E Y0λ = E Y∞ = eλa E e− 2
   λ
.

Replacing λ with 2λ we deduce

E e−λTa = e−a 2λ .
 

This can be alternatively verified using the distribution of Ta computed in Corol-


lary 3.125.
(d) We want to compute the Laplace transform of Ua (or moment generating func-
λ
tion). Consider the stopped martingale Ztλ := Xt∧U a
, where Xtλ is defined as in
(3.3.20). We deduce as above that
2
1 = E eλBUa e−λ Ua /2 .
 

The computations in (a) show that


    1
P BUa = a = P BUa = −a = .
2
Note that
     
P Ua ≤ u = P BUa = a, Ua ≤ u + P BUa = −a, Ua ≤ u .
Using the symmetry Bt 7→ −Bt we deduce
    1  
P BUa = a, Ua ≤ u = P BUa = −a, Ua ≤ u = P Ua ≤ u
2
       
= P BUa = a P Ua ≤ u = P BUa = −a P Ua ≤ u ,
proving that BUa and Ua are independent. Hence
2 2
1 = E eλBUa E e−λ Ua /2 = cosh(λa)E e−λ Ua /2 .
     
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 354

354 An Introduction to Probability

3.4 Exercises

Exercise 3.1. Suppose that (Xn )n≥0 is a sequence of integrable random variables
and (qn )n≥1 is a sequence of nonzero real numbers such that, for any n ∈ N
E Xn k Fn−1 = qn Xn−1 , Fn−1 := σ X0 , . . . , Xn−1 .
  

Define Q0 = 1, Qn = q1 · · · qn , ∀n ∈ N and set Yn := Q1n Xn . Prove that (Yn )n≥0 is


a martingale with adapted to the filtration Fn )n≥0 . t
u

Exercise 3.2. Suppose that (Xn )n≥0 is a martingale with respect to a filtration
(Fn )n≥0 such that X0 = 0 and E |Xn |2 < ∞, ∀n. Using the sequence of differences
Dn = Xn − Xn−1 , n ≥ 1 we construct two new processes, the optional quadratic
variation
Xn
Qn = Dk2
k=1
and the predictable quadratic variation
X n
E Dk2 k Fk−1 .
 
Vn =
k=1
Prove that the processes
An = Xn2 − Qn Bn = Xn2 − Vn
are martingales with respect to the (Fn )n≥0 . t
u

Exercise 3.3. Let x1 , . . . xr ∈ R. Fix a family In , Jn ; n ∈ N of indepen-
dent random variables such that In , Jn are uniformly distributed on {1, . . . , n − 1},
∀n ≥ 2. Define inductively
(
xr , n≤r
Xn :=
XIn + XJn , n > r,
and set
n
1 X
Yn := Xk .
n(n + 1)
k=1
Prove that the sequence (Yn ) is a martingale with respect to the filtration
σ(X1 , . . . , Xn ). t
u

Exercise 3.4. Prove all the claims in Example 3.21. t


u

Exercise 3.5 (Optional switching). Suppose that F• := (Fn )n≥0 is a filtration


of the probability space (Ω, S, P) and (Xn )n≥0 , (Yn n ≥ 0 are two F• -martingales.
Let T : Ω → N0 be a stopping time adapted to F• . For n ∈ N0 define
(
Xn (ω), n ≤ T (ω),
Zn : Ω → R, Zn (ω) =
Yn (ω), n > T (ω).
Prove that (Zn )n≥0 is a martingale adapted to F• . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 355

Martingales 355

Exercise 3.6. Prove Lemma 3.32. t


u

Exercise 3.7 (Dubins’ inequality). Let X• = (Xn )n≥0 be a nonnegative su-


permartingale adapted to the filtration F• of a probability space (Ω, S, B). For
0 ≤ a < b denote by Nn ([a, b], X) the number of upcrossings of [a, b] by X• up to
time n; see (3.2.1). Prove that for any k = 1, 2, . . . , n
   a k  
P Nn (a, b, X) ≥ k ≤ E min(1, X0 /a) .
b
Exercise 3.8. Prove Lemma 3.38. t
u

Exercise 3.9. Suppose that Xn ∈ L1 (Ω, S, P), n ∈ N, is a uniformly integrable


sequence of random variables that converges in law to the random variable X,
Xn ⇒ X. Then X ∈ L1 (Ω, S, P) and
lim E Xn± = E X ± , lim E |Xn | = E |X| ,
       
n→∞ n→∞
   
lim E Xn = E X . t
u
n→∞

Exercise 3.10 (Pratt’s Lemma). Let (Xn ), (Yn ), (Zn ) be three sequences of
integrable random variables with the following properties.

(i) Xn ≤ Yn ≤ Zn , ∀n.
p p p
(ii) Xn → X, Yn→ Y , Zn → Z.  
(iii) E Xn → E X , E Zn → E Z .
   
Prove that E Yn → E Y . t
u

Exercise 3.11. Suppose that (Xn )n≥0 is a martingale defined on a probability


space (Ω, S, P) such that ∃M > 0,
∀n ∈ N |Xn − Xn−1 | ≤ M, a.s.
Define
n o
A := ω ∈ Ω; lim Xn (ω) exists and is finite ,
n→∞
n o
B := ω ∈ Ω; lim inf Xn (ω) = −∞, lim sup Xn (ω) = ∞ .
n→∞ n→∞
 
Prove that P A ∪ B = 1. In other words, when a martingale (with bounded
increments) does not have a limit, it oscillates wildly.
±
t
u

Hint. For C > 0 look at TC = min n; ±Xn > C .

Exercise 3.12 (P. Lévy). Suppose that (Ω, S, P) is a probability space and
(Fn )n≥1 is a filtration of sigma-subalgebras. Let (Fn ) be a sequence of events
such that Fn ∈ Fn , ∀n. We set
Xn
I Fk − E I Fk k Fk−1 .
 
Xn =
k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 356

356 An Introduction to Probability

(i) Prove that Xn is a martingale and |Xn − Xn−1 | ≤ 4, ∀n. Hint. Have a look at
Example 3.14.
(ii) Prove that
( )
X 
E I Fn k Fn−1 = ∞ .
 
Fn i.o. =
n≥1
Hint. Use Exercise 3.11.
(iii) Deduce from (ii) the second Borel-Cantelli Lemma, Theorem 1.139(ii).

t
u

Exercise 3.13. Suppose that (Xn )n∈N is a sequence of independent Rademacher


random variables
    1
P Xn = 1 = P Xn = −1 = , ∀n.
2
Set
 
Sn := X1 + · · · + Xn , pn := P ∃k = 1, . . . , n, Sk < 0 .

(i) Compute pn . Hint. Use the André’s reflection trick in Example 1.60.
(ii) Show that pn → 1 as n → ∞.

t
u

Exercise 3.14. Consider the situation in Example 3.31. We  have a finite set A
called alphabet, a probability distribution π on A such that π a 6= 0, ∀a ∈ A. Fix


two words
a = (a1 , . . . , ak ) ∈ Ak , , b = (b1 , . . . , b` ) ∈ A`
and assume that b is not a subword of a, i.e.,
(ai+1 , . . . , ai+` ) 6= (b1 , · · · , b` ), ∀i = 0, . . . , k − `.
Let (An )n≥1 be i.i.d. A valued random variable with common distribution π. As in
Example 3.31 we denote by Tb the time to observe the pattern b.

(i) Prove that


 
E Tb k A1 = a1 , . . . , Ak = ak − k = Φ(b, b) − Φ(a, b)
where Φ is defined
 by (3.1.12).  
(ii) Set pa := P Ta < Tb , pb := P Tb < Ta , T = min(Ta , Tb ). Prove that
 
pa Φ(a, a) + pb Φ(b, a) = E T = pa Φ(a, b) + pb Φ(b, b).
(iii) Show that
pb Φ(a, a) − Φ(a, b)
= .
pa Φ(b, b) − Φ(a, a)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 357

Martingales 357

Hint. Consider the same martingale (Xn ) as in Example 3.31. Observe that Xk = Φ(a, b) − k given
     
that Aj = aj , j = 1, . . . , k. (ii) Note that E Tb = E Tb + E Tb − T and (i) gives a formula for
t
u
 
E Tb − T k T = Ta .

Exercise 3.15 (Kakutani). Let


 (Xn ) be a sequence of independent positive ran-
dom variables such that E Xn = 1. Consider the product martingale
n
Y
Yn = Xk .
k=1

Doob’s convergence  theorem showsthat Yn converges a.s. to a random variable Y∞


 1/2 
satisfying E Y∞ ≤ 1. Set an := E Xn . Prove that the following are equivalent.
 
(i) E Y∞ = 1.
(ii) Yn → Y∞ in L1 .
(iii) The martingale (Yn )n∈N is UI.
Q
(iv) n an > 0.
P
(v) n (1 − an ) > 0.

t
u

Exercise 3.16. Consider the unbiased random walk in Example 3.5


S0 = a ∈ Z, Sn = X1 + · · · + Xn , n ≥ 1,
 
where (Xn )n≥1 are i.i.d. random variables such that E Xn = 0, ∀n. Set
Fn = σ(X1 , . . . , Xn ), n ∈ N.
  
(i) Assume σ 2 := E X12 < ∞. Show that the sequence Sn2 − nσ 2 n≥0 is a
martingale with respectto the filtration Fn .
(ii) Assume that M (t) = E etX1 exists for all |t| < t0 , t0 > 0. For |t| < t0 and
n ∈ N we set
1
Zn (t) := etSn
M (t)n
is a martingale with respect to the
d
 Fn .
 filtration

(iii) Set D = dx . We define M (D) : R x → R x by the equality
X M (k) (0)
Dk P (x).
 
M (D) P (x) =
k!
k≥0

Prove that M (D) is bijective and for any polynomial P the sequence
Yn = M (D)−n P (Sn ), n ≥ 1,
is a martingale. Find Yn when P (x) = x and P (x) = x2 .

Hint. Set Pn := T −n P and express E Pn+1 (Sn + Xn+1 ) k X1 , . . . , xn using the operator M (D).
   

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 358

358 An Introduction to Probability

Exercise 3.17. Suppose that (X  n2)n≥0 is a martingale with respect to the filtra-
tion F• = (Fn )≥0 such that E Xn < ∞, ∀n. The sequence (Xn2 )n≥0 is a sub-


martingale and thus, according to Proposition 3.13 it admits a Doob decomposition


Xn2 = X0 + Mn + Cn , where (Mn )n≥0 is a martingale and the compensator (Cn ) is
a predictable, nondecreasing process. Set
An = X0 + Cn , A∞ = lim An = sup An .
n→∞ n∈N
   
(i) Prove that E supn≥0 Xn ≤ 4E A∞ . Hint. Use Doob’s L2 -maximal
 inequality.
(ii) Prove that limn→∞ Xn exists and is finite a.s. on the set A∞ < ∞ . Hint.
For a > 0 we set Na = min{n; An+1 > a2 }. Show that it is adapted to the filtration F• . Apply
(i) to the stopped martingale Xn∧Na .
(iii) Suppose that f : [0, ∞) → [1, ∞) is an increasing function such that
Z ∞
f (t)
dt < ∞.
0 t2
Xn
Prove that f (An)
→ 0 a.s. on the set {A∞ = ∞}. Hint. Set Hn = f (A1 n ) , ∀n ∈ N.
Let Y• denote the martingale defined by the discrete stochastic integral (H · X)• ; see (3.1.2). Use
the Doob decomposition of Yn to prove that Yn converges L2 a.s. Conclude using Kronecker’s
lemma, Lemma 2.10.

t
u

Exercise 3.18 (Dubins-Freedman). Suppose that (Ω, S, P) is a probability


space and (Fn )n≥1 is a filtration of sigma-subalgebras. Let (Fn ) be a sequence
of events such that Fn ∈ Fn , ∀n. We set
Xn
I Fn − fn , fn := E I Fn k Fn−1 .
  
Xn =
k=1
 2
(i) Prove that (X
P n )≥0 is a martingale and E Xn < ∞, ∀n ≥ 0.
(ii) Define S = n fn = ∞}. Prove that
Pn
I Fk
Pk=0
n → 1, a.s. on S. (3.4.1)
k=0 fk
(iii) Deduce from (3.4.1) the conclusion of Exercise 3.12(ii). Thus (3.4.1) is a gen-
eralization of the second Borel-Cantelli lemma, Theorem 1.139(ii).

t
u

Exercise 3.19 (Conservation of fairness). A fair coin is flipped repeatedly and


independently. A gambler starts with an initial fortune f0 > 0. Before the n-th
flip, his fortune is Fn−1 . Based only on the information available to him at that
moment, the gambler bets a sum Bn ∈ (0, b), 0 ≤ Bn ≤ Fn1 . If the n-th flip shows
Heads he earns Bn dollars and if its shows Tails, he loses Bn dollars. The gambler
stops gambling when he is broke or at the first moment when he reaches his goal,
i.e., Fn ≥ g where g > 0 set in advance of his gambling.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 359

Martingales 359

f0
(i) Prove that the probability pg that he reaches his goal is ≤ g .
(ii) Prove that if Bn ≤ min(Fn−1 , g − Fn−1 ), ∀n ≥ 1, then pg = fg0 .
(iii) Find pg if Bn = 12 Fn−1 .

t
u

Remark 3.141. Note that if f0 , g ∈ N and the gambling strategy is Bn = 1


whenever his fortune is < g the above problem reduces to the classical Gambler’s
ruin problem discussed in Example 3.72. The name “conservation of fairness”
seems appropriate: whatever gambling strategy satisfying (ii) and based only on
the information available at each moment, the probability of reaching the goal is
the same, fg0 . t
u

Exercise 3.20. Suppose that we are given a sequence of i.i.d. random vectors
Xn : (Ω, S, P) → X := RN
and a collection F of uniformly bounded measurable functions f : X → R, i.e., there
exists C > 0 such that kf kL∞ ≤ C, ∀f ∈ F. For n ∈ N we set
n
1 X  
Dn (F) := sup f(Xk ) , f(x) := f (x) − E Xj ,
f ∈F n
k=1

n
" #
1 X
Rn (F) = sup E Rk f (Xk ) ,
f ∈F n
k=1

where (Rn )n≥1 is a sequence of independent Rademacher random variables that are
also independent of (Xn )n≥1 . Assume that Dn is measurable.

(i) Prove that the function Gn : Xn → R


n
1 X
Gn (x1 , . . . , xn ) = sup f(xk )
f ∈F n
k=1
2C
satisfies the bounded difference property (Definition 3.36) with Lk = n ,
∀k = 1, . . . , n.
(ii) Prove that
nt2
P Dn − E Dn ≤ t ≥ 1 − e− 2C 2 , ∀t > 0.
   

(iii) Prove that for any δ > 0


nδ 2
P Dn ≤ 2Rn + δ ≥ 1 − e− 2C 2 .
 

Hint. Use (2.4.13) in Remark 2.66.

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 360

360 An Introduction to Probability

Exercise 3.21. Consider the standard random walk (Sn )n≥0 on Z started at 0,
i.e.,
S0 = a ∈ Z, Sn = X1 + · · · + Xn ,
where (Xn )n≥1 are i.i.d. with P Xn = ±1 = 21 . Fix a, g ∈ N0 , a < g and set
 

Ta := min n ∈ N; Sn = 0 or Sn = g .

(i) Show P STa = 0 = g−a and P STa = g = ag .


   
  g
(ii) Show that E Ta = a(g − a).
(iii) Compute the pgf of Ta

X
P Ta = n sn .
 
P GTa (s) :=
n=0

t
u

Exercise 3.22. Suppose that (Xn )n≥0 adapted to the filtration


 (Fn )n≥0
 and T
is a stopping time adapted to the same filtration such that P T < ∞ = 1 and
XT ∈ L1 . Prove that
E XT k Fn = Xn on {T ≥ n}.
 

Hint. Have a look at the Proof of Theorem 3.28.

Exercise 3.23. Suppose that (Xn )n≥1 is a sequence of i.i.d., nonnegative, integer
valued random variables with finite mean. Set
Sn := X1 + · · · + Xn .
For k = 1, . . . , n, set F−k

= σ Sk , Sk+1 , . . . , Sn , Y−k = Sk /k.

(i) Prove that for j ≤ k we have


E Xj k F−k = Xk .
 

(ii) Prove that Y−k 1≤k≤n is a martingale with respect to the filtration
F−k 1≤k≤n . (Compare with Example 3.30.)


(iii) Show that


  +
P Sk < k, ∀1 ≤ k ≤ n k Sn = 1 − Sn /n .

Hint. (iii) Set T = inf − n ≤ k ≤ −1; Yk ≥ 1 , where we define inf ∅ = −1. Use Exercise 3.22.
t
u

Exercise 3.24. Suppose that f : [0, 1] → R is a Lebesgue integrable function. For


any n ∈ N0 we define the step function fn : [0, 1 → R by setting fn (0) = 0 and
Z k/2n
1 (k − 1) k
fn (x) = n f (x)dx, if 0 ≤ < x ≤ n ≤ 1.
2 (k−1)/2n 2n 2
Prove that fn converges a.s. and L1 to f as n → ∞. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 361

Martingales 361

Exercise 3.25. Suppose that (Xn )n≥0 is a supermartingale such that, there exist
f0 , g > 0 with the property
X0 = f0 a.s., 0 ≤ Xn ≤ g a.s., ∀n ∈ N.
 
Prove that for any stopping time T such that P T < ∞ = 1 we have
P XT = g ≤ fg0 .

t
u

Exercise 3.26. Consider the branching process (Zn )n≥0 with initial condition
Z0 = 1 and reproduction law µ ∈ Prob(N0 ) such that
  X  
m := E µ = nµn < ∞, µn := µ n .
n≥0

Assume µ0 > 0. Denote by f (s) the probability generating function (pgf) of µ


X
f (s) = µn sn .
n≥0

We set
fn (s) := f ◦ · · · ◦ f (s), n ∈ N.
| {z }
n

(i) Show that if m > 1 the equation f (s) = s has a unique solution r = r(µ) in
the interval (0, 1). Compute r(µ) when
µn = qpn , n ∈ N0 ,
where p ∈ (1/2,  1 − p.
 1), q =
(ii) Prove that P Zn = 0 = fn (0).
(iii) Denote by E the extinction event
[
E= Zn = 0}.
n≥0

Prove that
(
  1, m ≤ 1,
P E =
r(µ), m > 1.
Z

(iv) Assume m > 1. Prove that the sequence r n
n≥0
is a martingale.
(v) Set
1
Wn := Zn .
mn
Assume
 X 2
m > 1, E Z12 =

n µn < ∞,
n≥

and set
W := lim Wn .
n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 362

362 An Introduction to Probability

Denote by PW the probability distribution of W and by ϕ(λ) its Laplace trans-


form
Z
ϕ(λ) = E e−λW = e−λw PW [dw], λ ∈ C, Re λ ≥ 0.
 
R

Prove that

 X
ϕ0 (0) = 1, ϕ(λ) = f ϕ(λ/m) = µn ϕ λ/m )n , ∀ Re λ ≥ 0.

(3.4.2)
n=0

(vi) Prove that there exists at most one probability measure ν ∈ Prob [0, ∞)
such that
Z ∞
t2 ν[dt] < ∞
0

and its Laplace transform


Z ∞
ϕν (λ) := e−λt ν[dt], λ ∈ C, Re λ ≥ 0,
0

satisfies (3.4.2).
Hint. Consider two such measures νk , k = 0, 1, denote by Φk (t) their characteristic functions.
Set Φ(t) = Φ1 (t) − Φ0 (t), γ(t) = Φ(t)/t, t 6= 0. Prove that |γ(mt)| ≤ |γ(t)| and conclude that
Φ ≡ 0.

t
u

Exercise 3.27. Let Sn denote the group of permutations of In := {1, . . . , n}. We


equip it with the uniform probability measures. A run of a permutation π is a pair
(s, r) ∈ In , s < t such that
πs−1 > πs < πs+1 < · · · < πr > πr+1 ,
where π0 := n + 1 and πn+1 := 0. We denote by Rn (π) the number of runs of
π ∈ Sn . Set
1
Xn := nRn − n(n + 1).
2
(i) For π ∈ Sn+1 we set kπ := π −1 (n + 1) and denote by ϕπ the unique increasing
bijection

ϕπ : In → In+1 \ kπ .
Set π := π ◦ ϕπ . Show that the random maps
Sn+1 3 π 7→ kπ ∈ In+1 , Sn+1 3 π 7→ π ∈ Sn
are independent and uniformly distributed on their ranges.
(ii) Prove that (X
 n )is a martingale.
 
(iii) Compute E Rn and E Rn2 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 363

Martingales 363

(iv) Show that


" 2 #
Rn 1
lim E − = 0.
n→∞ n 2
t
u

Exercise 3.28. Suppose that (Xn )n≥0 is an L2 -martingale adapted to the filtration
(eFn )n≥0 and hX• i is its quadratic variation; see Definition 3.15. Fix a bounded
predictable process (Hn )n≥0 and form the discrete stochastic integral (H • X) (see
Theorem 3.17.

(i) Show that


E Xn2 − E X02 = E hXin .
     

(ii) Prove that the martingale (H • X) is an L2 martingale.


(iii) Prove that
n
X
hH • Xim = (H 2 • hXi)n := Hk2 hXik − hXik−1 , ∀n ≥ 1.


k=1

(iv) Prove that


" n #
X
2 2 2
 
E (H • X)n = E Hk (Xk − Xk −1) , l ∀n ≥ 1.
k=1

Exercise 3.29. Suppose that (Xn )n∈N is an exchangeable sequence of random vari-
ables and T is a stopping time adapted to the filtration Fn = σ(X1 , . . . , Xn ). Prove
that if T < N a.s., then XT +1 has the same distribution as X1 . t
u

Exercise 3.30. Suppose that (Xn )n∈N is a sequence of random variables such that
for any n ∈ N the distribution of the random vector (X1 , . . . , Xn ) is orthogonally
invariant, i.e., for any T ∈ O(n), T# PX1 ,...,Xn = PX1 ,...,Xn . Prove that (Xn )N are
conditionally i.i.d. N (0, σ 2 ) given a random variable σ 2 ≥ 0. t
u

Exercise 3.31. Prove Lemma 3.105. t


u

Exercise 3.32. Finish the proof of Proposition 3.112. t


u

Exercise 3.33. Prove Proposition 3.114. t


u

Exercise 3.34. Let N (t) be a Poisson process with intensity λ as described in


Example 1.136. Denote by (Ft ) the natural filtration, Ft = σ N (s), s ≤ t .


(i) Prove that N (t) is an R-process.


(ii) Prove that Ft+
 = Ft , ∀t≥ 0. 
(iii) Prove that E N (t) k Fs = E N (t) k N (s) , ∀0 ≤ s < t.

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 364

364 An Introduction to Probability

→ L2 (Ω, S, P) is a Gaussian white



Exercise 3.35. Suppose that W : L2 [0, ∞) 
2
noise; see Example 2.78. Fix f ∈ L [0, ∞) and consider the Wiener integral (see
Example 2.78 and Exercise 2.60)
Z t

Xt = f (s)dB(s) := W I [0,t] f , t ≥ 0.
0

(i) Prove that (Xt ) is an L2 martingale adapted to the filtration


Ft := σ Xs , s ≤ t .
(ii) Use Kolmogorov’s Continuity Theorem 2.81 to show that (Xt )t≥0 admits a
continuous modification.

t
u

Exercise 3.36. Let B(t), t ≥ 0 be a one-dimensional Brownian motion started at


0. For each n ∈ N and each t ≥ 0 we set
n
X  
Xtn :=

B (k − 1)t/n B kt/n − B (k − 1)t/n .
k=1

(i) Prove that for any n ∈ N the stochastic process Xtn is an L2 -martingale.
(ii) Prove that for each t ≥ 0 Xtn converges to B(t)2 − t in L2 as n → ∞.

t
u

Exercise 3.37. Suppose that (Wt )t≥0 is a pre-Brownian motion defined on a prob-
ability space (Ω, S, P); see Definition 2.71. Let t0 , δ ≥ 0. Set
R(t0 , δ) = sup B(t) − B(t0 ) .
t∈Q∩[t0 ,t0 +δ]

(i) Prove that


  3δ 2
P R(t0 , δ) > ε ≤ 4 , ∀ε, δ > 0.
ε
Hint. Use Doob’s maximal inequalities.
(ii) Prove that Wt is a a.s. uniformly continuous on Q≥0 and conclude that (Wt )
admits a modification continuous on [0, ∞).

t
u

Exercise 3.38. Let (Bt )t≥0 be a standard Brownian  motion and −a < 0 < b. Set
T = min(T−a , Tb ) where for c ∈ R, we set Tc = inf t ≥ 0; Bt = c . Prove that
E T = E BT2 = ab.
   
t
u

Exercise 3.39 (P. Lévy). Let (Bt )t≥0 be a standard Brownian motion and c > 0.
For a ∈ R we denote by ra the reflection ra : R → R, ra (x) = 2a − x.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 365

Martingales 365

(i) Prove that for any Borel subsets U− ⊂ (−∞, −c], U+ ⊂ [c, ∞) we have
     
P Tc < T−c , B1 ∈ U− + P Tc > T−c , B1 ∈ rc (U− ) = P B1 ∈ rc (U− )
     
P Tc > T−c , B1 ∈ U+ + P Tc < T−c , B1 ∈ r−c (U+ ) = P B1 ∈ r−c (U+ ) .
(ii) Denote by J the interval [−c, c]. Prove that
     
P Tc ≤ T−c ∧ 1, Bt ∈ J = P B1 ∈ rc (J) − P Tc > T−c , B1 ∈ rc (J) ,
     
P T−c ≤ Tc ∧ 1, Bt ∈ J = P B1 ∈ r−c (J) − P Tc < T−c , B1 ∈ r−c (J) .
(iii) Prove that
" #
 
P sup |Bt | < c = P B1 ∈ J
t∈[0,1]

    
− P Tc ≤ T−c ∧ 1, Bt ∈ J + P T−c ≤ Tc ∧ 1, Bt ∈ J .

(iv) Prove that


       
P | Bt | < c = P |B1 | ≤ c − P c ≤ |B1 | ≤ 3c + P 3c ≤ |B1 | ≤ 5c − · · · .

t
u

Remark 3.142. Exercise 3.39 is a special case of a more general result called the
support theorem. For any continuous function f : [0, 1] → R such that f (0) = 0 and
any ε > 0 we have
" #
P sup Bt − f (t) ≤ ε > 0. (3.4.3)
t∈[0,1]

For a proof we refer to [63, Ch. 1, Thm. (38)].


Let us describe and amusing application of this fact. Suppose that (Bti )t≥0 ,
i = 1, 2, are two independent Brownian motions and f i : [0, 1] → R, i = 1, 2 are two
continuous functions such that f i (0) = 0. The equality (3.4.3) implies immediately
that for any ε > 0 we have
" #
P max sup Bti − f i (t) ≤ ε
i t∈[0,1]
" # " # (3.4.4)
=P sup Bt1 1
− f (t) ≤ ε P sup Bt2 2
− f (t) ≤ ε > 0.
t∈[0,1] t∈[0,1]

The pair of functions (f 1 , f 2 ) defines a path


F : [0, 1] → R2 , F (t) = f 1 (t), f 2 (t) .


Think of F (t) as tracing the motion of the tip of an infinitesimally fine pen as you
sign a planar piece of paper, starting at the origin.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 366

366 An Introduction to Probability

Any other path G = (g 1 , g 2 ) : [0, 1] → R2 satisfying


g i (t) − f i (t) < ε, ∀t ∈ [0, 1], i = 1, 2,
will follow closely the original motion of the fine pen, producing a curve essentially
indistinguishable with the naked eye from the original signature. In fact, if ε > 0,
one cannot distinguish the two curves, even using a magnifying glass.
The random path (Bt1 , Bt2 ) is the so called planar Brownian motion started at
the origin. The equality (3.4.3) shows that the probability p0 that this random path
follows closely the motion of the tip of the fine pen is positive. For this reason the
inequality (3.4.3) is sometimes referred to as Lévy’s forgery theorem. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 367

Chapter 4

Markov chains

The Markov chains form a special but sufficiently general class of examples of
stochastic processes. Their investigation requires a diverse arsenal of techniques,
probabilistic and not only, and they reveal important patterns arising in many other
instances.
The foundations of this theory were laid by the Russian mathematician
A. A. Markov at the beginning of the twentieth century. By most accounts, Markov
was a rather unconventional individual. He discovered what we now know as Markov
chains in his attempts to contradict Pavel Nekrasov, a mathematician/theologian
of that time who maintained on a theological basis that the Law of Large Numbers
was specific to independent events/random variables and cannot be seen in other
contexts. Markov succeeded in proving Nekrasov wrong and in the process laid the
foundations of the theory of Markov chains. For more on this history of this concept
we refer to the very readable article [82].
So what did Markov discovered? Think of a Markov chain as a random walk
on a finite set X . From a given location x the walker can go to a location x0 with
probability qx,x0 . Suppose that at some location x0 ∈ X we placed a pile of sand
consisting of giddy grains of sand: every second one of them starts this random walk
and performs a billion steps (think of a fixed but very large number of steps). After
all the grains of sand performed this ritual, the initial pile of sand is redistributed
at various points of X . Denote by m1x the mass of the pile of send relocated at
x. Next, collect the piles from their locations and move them back to the initial
location x0 .
Run the above experiment again we get a new distribution of piles of sand at
the points of X . Denote the mass at x by m2x . Markov observed that
m1x
≈ 1, ∀x.
m2x
Run the experiment a third time to obtain a third distribution of mass (m3x )x∈X
and the conclusion is the same
m1x
≈ 1, ∀x.
m3x

367
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 368

368 An Introduction to Probability

To put it differently, if m is the mass of the pile of sand at x0 , then, for any x ∈ X,
m1x m2 m3
≈ x ≈ x ≈ ··· .
m m m
This phenomenon is one manifestation of the Law of Large Numbers for Markov
chains.
During this more than a century since its creation, the theory of Markov chains
has witnessed dramatic growth and generalizations, and has found applications
in unexpected problems. For example, Googly’s Pageant algorithm is a special
application of the Law of Large Numbers for Markov chains.
The present chapter is an introduction to the theory of Markov chains. We
present the classical results and spend some time on some more recent developments.
As always, we try to illustrate the power of the theory on many concrete example.
Needless to say, we barely scratch the surface of this subject.

4.1 Markov chains

In the sequel X will denote a finite or countable set equipped with the discrete
topology. We will refer to it as the state space. The Borel sigma-algebra of X
coincides with the sigma-algebra 2X of all subsets of X .

4.1.1 Definition and basic concepts

Definition 4.1. A Markov chain with state space X is a sequence of random


variables

Xn : (Ω, S, P) → (X , 2X ), n ∈ N0 ,

satisfying the Markov property


   
P Xn+1 = xn+1 Xn = xn = P Xn+1 = xn+1 Xn = xn , . . . , X0 = x0 , (4.1.1)

∀n ∈ N, x0 , x1 , . . . , xn , xn+1 ∈ X .
The filtration associated to the Markov chain is the sequence of sigma-
subalgebras

Fn := σ(X0 , . . . , Xn ), n ∈ N0 .

The probability distribution of X0 is called the initial distribution of the system.


The Markov chain is called homogeneous if, for any x, x0 ∈ X , and any n ∈ N
we have

P Xn+1 = x0 Xn = x = P X1 = x0 X0 = x .
   

In this case the function

Q : X × X → [0, 1], Q(x0 , x1 ) = Qx0 ,x1 = P X1 = x1 X0 = x0


 
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 369

Markov chains 369

is called the transition matrix 1 of the homogeneous Markov chain. We denote by


Markov(X , µ, Q) the collection of HMC-s with state space X , initial distribution
µ and transition matrix Q. t
u

Remark 4.2. (a) Let us observe that the Markov property can be written in the
more compact form
P Xn+1 = x k Xn = P Xn+1 = x k Fn , ∀n ∈ N, x ∈ X .
   
(4.1.2)
In view of Proposition 1.172, the last property is equivalent to the conditional
independence
⊥ Xn Fn−1 , ∀n ∈ N.
Xn+1 ⊥ (4.1.3)
Exercise 1.51 shows that this is also equivalent to the condition
⊥ Xn Fn , ∀n ∈ N.
Xn+1 ⊥ (4.1.4)
One can show that this further equivalent to that
⊥ Xn Fn .
σ(Xn+1 , Xn+2 , . . . ) ⊥ (4.1.5)
This is colloquially expressed as saying that the future is conditionally independent
of the past given the present.
(b) It is convenient to think of a Markov chain with state space X as describing the
random walk of a grasshopper hopscotching on the elements of X . The decision
where to jump next is not influenced by the past, but only by the current location
and the current time. For a homogeneous Markov chain the decision where to
jump next depends only on the current location and not on the “time” n when the
grasshopper reaches that state. Thus Qx0 ,x1 is the probability that the grasshopper,
currently located at x0 , will jump to x1 .
We can represent an HMC with state space X and transition matrix Q as a
directed graph (loops allowed) with vertex set X constructed as follows: there is a
directed edge from x0 to x1 if and only if Qx0 ,x1 > 0. t
u

If (Xn )n≥0 is a homogeneous Markov chain (or HMC for brevity), then its tran-
sition matrix Q is stochastic, i.e.,
X
Qx0 ,x1 ≥ 0, Qx0 ,x = 1, ∀x0 , x1 ∈ X . (4.1.6)
x∈X

In other words, the entries of the matrix Q are nonnegative and the sum of the
entries in each row is equal to 1.
If µn is the distribution of Xn , then, for any x ∈ X we have
X  X
P Xn = x0 Qx0 ,x = µn x0 Qx0 ,x .
    
P Xn+1 = x =
x0 ∈X x0 ∈X
1Imade the decision to break with the tradition and use the letter Q to denote the transition
matrix after teaching this topic and realizing that there were too many P ’s on the blackboard and
this sometimes confused the audience.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 370

370 An Introduction to Probability

Think of µn and µn+1 as matrices consisting of a single row. We can rewrite the
above equality as an equality of matrices µn+1 = µn Q. In particular,

µn = µ0 Qn , (4.1.7)

where Qn denotes the n-th power of the matrix Q, Qn = Qnx,y x,y∈X
. From
(4.1.7) we deduce that

P Xn = xn X0 = x0 = Qnx0 ,xn .
 
(4.1.8)

For this reason the matrix Qn is also known as the n-th step transition matrix.
Let us show that given any matrix Q : X ×X → [0, 1] satisfying (1.2.19) and any
probability measure µ on X , there exists a homogeneous Markov chain, with state
space X , initial distribution µ and transition matrix Q, i.e., Markov(X , µ, Q) 6= ∅.
Observe that we can view Q as a kernel or random probability measure

Q̂ : X × 2X → [0, 1], (x, A) 7→ Q̂x A =


  X
Qx,a .
a∈A

Note that Q̂x − is a probability measure on X . It is described by row x of the


 

matrix Q.
Consider the set X N0 equipped with the natural product sigma algebra E; see
Definition 1.192. In this case it coincides with the sigma algebra generated by
π-system consisting of the cylinders

Cs0 ,s1 ,...,sk := x = (xn )n∈N0 ∈ X N0 ; xi = si , ∀i = 0, . . . , k .




Let us observe that there exists a probability measure Pµ : E → [0, 1] uniquely


determined by the conditions
k
   Y
Pµ Cs0 ,s1 ,...,sk = µ s0 Qsi−1 ,si . (4.1.9)
i=1

To prove that such a measure does indeed exist for any µ and Q we will rely on
Kolmogorov’s existence theorem, Theorem 1.195.
The equalities (4.1.9) define probability measures Pk = Pµ,Q
k on the product
spaces X {0,1,...,k} by setting
k
Y
 
Pk (s0 , . . . , sk ) := µ[s0 ] Qsi−1 ,si . (4.1.10)
i=1

Note that for f : X {0,1,...,k} → R we have


Z
 
f (x0 , . . . , xk )Pk dx0 · · · dxk
X X 
{0,1,...,k}
X X (4.1.11)
= ··· µ x0 Qx0 ,x1 · · · Qxk−1 ,xk f (x0 , . . . , xk ).
x0 ∈X x1 ∈X xk ∈X
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 371

Markov chains 371

The family of measures (Pk )k≥0 is projective since the transition matrix Q is stochas-
tic. Indeed,
X
Pk+1 (s0 , . . . , sk ) × X =
   
Pk+1 (s0 , . . . , sk , x)
x∈X
k
!
Y X
= µ[s0 ] Qsi−1 ,si Qsk ,x
i=1 x (4.1.12)
| {z }
=1
k
Y
Qsi−1 ,si = Pk (s0 , . . . , sk ) × X .
 
= µ[s0 ]
i=1

Kolmogorov’s existence theorem, then implies the existence of Pµ ∈ Prob X N0




satisfying (4.1.9). Note that


X  
Pµ = µ x Px , Px := Pδx . (4.1.13)
x∈X

For n ∈ N0 we denote by En the sub-sigma-algebra of E generated by X0 ,


X1 , . . . , Xn . Note that Pn can be identified with the restriction of Pµ to En .
For µ ∈ Prob(X ) we denote by Eµ the expectation (integral) with respect to Pµ
Z
Eµ : L1 (X N0 , E, Pµ ) → R, Eµ F =
   
F (x)Pµ dx . (4.1.14)
X N0

For x ∈ X we set

Ex := Eδx . (4.1.15)

We have a shift operator

Θ : X N0 → X N0 , Θ(x0 , x1 , x2 , . . . ) = (x1 , x2 , . . . ).

Note that Xn = X0 ◦ Θn , Θn = |Θ ◦ ·{z


· · ◦ Θ}.
n

Theorem 4.3. Consider the random variables

Xn : X N0 → X , Xn (x) = xn , n ∈ N0 .

Then the stochastic process (Xn )n∈N0 is an HMC, defined on (X N0 , E, Pµ ) with


transition state space X matrix Q and initial distribution µ. The probability space
(X N0 , E, Pµ ) is called the path space of this HMC.
Moreover, if F ∈ L1 (X N0 , E, Pµ ), then

Eµ F ◦ Θn k En = Eµ F k Xn .
   
(4.1.16)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 372

372 An Introduction to Probability

Proof. For each x we have a probability measure Qx on X given by


Qx {x0 } = Qx,x0 , ∀x0 ∈ X .
 

We will show that for any A ⊂ X we have the equality of random variables
  X
P Xn+1 ∈ A k En = QXn A =
 
QXn ,a . (4.1.17)
a∈A

Let B ∈ En . It is a cylinder of the form


B = {X0 ∈ B0 , . . . , Xn ∈ Bn }, B0 , B1 , . . . , Bn ⊂ X }.
Then
     
E I A (Xn+1 )I B = Pµ {Xn+1 ∈ A}∩B = Pµ X0 ∈ B0 , . . . , Xn ∈ Bn Xn+1 ∈ A
Z
(4.1.11)  
= QXn A dPµ .
B

This proves (4.1.17).


 The random measure
 QXn is a regular version of the conditional probability
P Xn+1 ∈ − k Xn , i.e.,
QXn S = P Xn+1 ∈ S k Xn , ∀S ⊂ X .
   

Using Proposition 1.178 we deduce that for every bounded function f : X → R we


have
X
E f (Xn+1 ) k En =
 
QXn ,x f (x). (4.1.18)
x∈X

Let M ⊂ L1 (X N0 , E, Pµ ) denote the collection of functions F satisfying (4.1.16).


Clearly M is a vector space and if Fn is a sequence in M such that Fn % F , F ∈ L1 ,
then F ∈ M. To show that M = L1 we use Monotone Class Theorem so it suffices
to show that there exists a π-system C ⊂ E that generates E such that I C ∈ M.
Denote by C the set of cylinders
CA0 ,A1 ,...,AN := x ∈ X N0 ; xi ∈ Ai , i = 1, . . . , N .


Note that
N
Y
I CA0 ,...,AN = I {Xk ∈Ak } ,
k=0

and
N
Y
I CA0 ,...,AN ◦ Θn = I {Xn+k ∈Ak } .
k=0

By definition C generates E. Since M is a vector space it suffices to check that


I C ∈ M for C ∈ C of the form
C = CA0 ,...,AN , Ak = {xk }, xk ∈ X , k = 0, 1, . . . , N.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 373

Markov chains 373

To verify (4.1.16) for sets of this form and arbitrary n we argue by induction on N .
For N = 1 this follows from (4.1.17). For the inductive step note that
"N #
Y
E I {Xn =x0 ,Xn+1 =x1 ,...,Xn+N =xN } k En = E I {Xnk =xk } k En
 

k=0

" " N
# #
Y
= E I {Xn =x0 } E I {Xn+k =xk } En+1 En
k=1

 
 "N # 
Y
En 
 
I {Xn =x0 } E
= E I {Xn+k =xk } Xn+1 
 k=1 
| {z }
=:f (Xn+1 )

(use the inductive assumption)


 
 "N # 
Y
I {Xn+k =xk } En+1
   
= E I {Xn =x0 } f (Xn+1 ) k Xn = E I {Xn =x0 } E
 Xn 

 k=1 
| {z }
=f (Xn+1 )

(σ(Xn ) ⊂ En+1 )
" N
#
Y
=E I {Xn+k =xk } Xn .
k=0
t
u

Remark 4.4. We have deduced (4.1.16) relying on the Markov property. The above
proof shows that the Markov property (4.1.17) is a special case of (4.1.16). For this
reason we can take (4.1.16) as definition of Markov’s property. t
u

Given a homogeneous Markov chain Xn : (Ω, S, P) → X , n ≥ 0, with state


space X , initial distribution µ and transition matrix Q, we obtain a measurable
map
~ : (Ω, S) → (X N0 , E), ω 7→ X(ω)
~

X = Xn (ω) n≥0 .
The distribution of the Markov chain is the pushforward measure

~ := X # P ∈ Prob X
~ ,E .
N0

PX
It is uniquely determined by the equalities
k
Y
   
~ Cs0 ,s1 ,...,sk := P X0 = s0 , . . . , Xk = sk = µ0 [s0 ]
PX Qsi−1 ,si . (4.1.19)
i=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 374

374 An Introduction to Probability

We deduce
~ = Pµ .
PX
For every, F ∈ L1 X N0 , E, Pµ , we have

Z
     
EP F (X0 , X1 , . . . ) = Eµ F = F (x)Pµ dx .
X N0
This is a special case of the change in variables formula (1.2.21).
This shows that the distribution of the Markov chain is uniquely determined by
the initial distribution µ ∈ Prob(X ) and the transition matrix Q.
Remark 4.5. One can define any HMC on probability spaces other than X N0 .
  X transition matrix Q
Here is a such a construction corresponding to state space
and initial probability distribution µ. We set µx := µ x .
First, a little bit of terminology. We say that an interval is convenient if it either
empty or the form [a, b), a < b. If [a, b), [c, d) are nonempty convenient intervals,
then we say that [a, b) precedes [c, d) and we write [a, b) ≺ [c, d) if b ≤ c. The empty
set is allowed to precede or succeed any nonempty convenient interval. Assume that
X is a subset of N. As such it is equipped with a total order.
The probability space is the unit interval [0, 1) equipped with the Lebesgue
measure. The random variables Xn , depend on the choice of initial distribution,
and are defined inductively as follows.

 0[0, 1) into convenient intervals Ix = Ix , x ∈ X of Lebesgue measures


0
• Partition
µx = λ Ix , such that
x < x0 ⇒ Ix ≺ Ix0 .
• Partition each interval Ix00 into convenient intervals Ix10 ,x1 of sizes µx0 Qx0 ,x1 ,
x0 , x1 ∈ X , such that
x < x0 ⇒ Ix10 ,x ≺ Ix10 ,x0 .
• Inductively, partition each interval Ixn0 ,x1 ,...,xn into convenient intervals
Ixn+1
0 ,x1 ,...,xn ,xn+1
of sizes
Y n
λ Ixn0 ,x1 ,...,xn Qxn ,xn+1 = µx0
 
Qxj ,xj+1 ,
j=0
such that
x < x0 ⇒ Ixn+1
0 ,...,xn ,x
≺ Ixn+1
0 ,x1 ,...,xn ,x
0.

Now define Xn : [0, 1) → X by setting


[
Xn (t) = xn if t ∈ Ixn0 ,x1 ,...,xn .
x0 ,x1 ,...,xn−1 ∈X
Note that these random variables are defined on the same probability space
([0, 1], B[0,1] , λ),
but they depend on the choice of the initial distribution.
This is different from the construction based on path spaces. In that case we
are given measurable maps defined on the same measurable space and we obtain
different HMC’s by choosing different probability measures. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 375

Markov chains 375

4.1.2 Examples
The homogeneous Markov chains appear in many and diverse situations. According
to the discussion in the previous subsection, to describe an HMC it suffices to
describe the state space X and the transition matrix Q. We will remain vague
about the initial distribution µ.

Example 4.6 (Gambler’s ruin). Consider the gambler’s ruin problem discussed
in Example 3.72. The state space is X = {0, 1, . . . , N }. Then Xn is the fortune
of a gambler at time n. The gambler flips a fair coin with two faces labeled ±1. If
its fortune is strictly in between 0 and N , then its fortune changes by the amount
shown on the face of the coin. The game stops when its fortune reaches either 0 or
N . Concretely

QN,k = 0, ∀k < N, QN,N = 1,

Q0,j = 0, ∀j > 0, Q0,0 = 1,

1
Qk,k+1 = Qk,k−1 = , Qk,j = 0, if |k − j| > 1, 0 < k, j < N.
2
The directed graph describing this HMC is depicted in Figure 4.1 where, for clarity,
we have omitted the loops at 0 and N . t
u

0.5 0.5 0.5 0

0 0.5 0.5 0.5

Fig. 4.1 The gambler’s ruin chain.

Example 4.7 (The Ehrenfest Urn). Consider the following situation. There
are B balls in two urns. Equivalently, think of an urn with two chambers. Pick
one of these B balls uniformly at random and move it in the other box/chamber.
Denote by Xn the number of balls in the left box at time n. Then (Xn )n≥0 is an
HMC with transition probabilities
B−i i
Qi,i+1 = , i = 0, 1, . . . , B − 1, Qi,i−1 = , i = 1, . . . , B,
B B

Qi,j = 0, |i − j| > 1.

This HMC is known as the Ehrenfest urn. Note that during this process it is more
likely that a ball moves from the more crowded box to the less crowded one, similarly
to what happens in diffusion processes. tu
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 376

376 An Introduction to Probability

Example 4.8 (Random placement of balls). Consider a sequence of indepen-


dent trials each consisting in randomly placing a ball in one of r given urns. We
say that the system is in state k if exactly k urns are occupied.
We obtain an HMC with state space {0, 1, . . . , r} with transition probabilities
j r−j
Qj,j = , Qj,j+1 = , 0 ≤ j < r,
r r
and of course Qj,k = 0 for any other pairs (j, k). If X0 = 0, so initially all boxes are
empty, then Xn = r − Nr,n , where Nr,n is the number of empty boxes investigated
in Exercise 2.19. t
u

Example 4.9 (Random walk on Zd ). Suppose that (Xn )n≥1 are i.i.d. Zd -valued
random variables. Denote by π their common distribution. Set
S0 = 0, Sn = X1 + · · · + Xn .
Then the random process (Sn )n∈N0 is an HMC with transition matrix
Qm,n = P X1 = n − m = π n − m , m, n ∈ Zd .
   

One can imagine this process as a person starting at the origin of Zd and walking
with random step sizes, with Xn the size of the n-th step.
A standard random walk is obtained as follows. Denote by e1 ,. . . , ed the canon-
ical basis Zd and choose π to be uniformly distributed on the set ± e1 , . . . , ±ed ,
i.e.,
  1
π ±ek = , k = 1, . . . , d.
2d
For example, when d = 1, this corresponds to a random walk on Z where, at each
moment, going one step ahead or one step back is decided by flipping a fair coin. t u

Example 4.10 (Simple random walk on a graph). Consider an undirected


graph G = (V, E), where V is the set of vertices, and E denotes the set of edges.
We do not allow for multiple edges connecting two vertices. For every vertex v we
denotes by deg(v) is degree, i.e., the number of edges of E at v. We assume that
the graph is locally finite, i.e., deg(v) < ∞, ∀v ∈ V .
Suppose now that a grasshopper hopscotches on the set vertices V according to
the following rule: if situated at a vertex v0 , the grasshopper will jump to one of
the neighbors of v0 in V chosen uniformly randomly. Denote by Xn the location
of the grasshopper at time n. Then (Xn )n≥0 is an HMC with state space V with
transition matrix

1

 deg(v
 0)
, if v0 and v1 are neighbors,
Qv0 ,v1 =


0, otherwise.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 377

Markov chains 377

Example 4.11 (The branching process). Consider again the branching pro-
cess with reproduction law µ described in Example 3.8. Recall that   it deals with
the evolution of a population of individuals of a species with µ j denoting the
probability that a given individual will have j ∈ N0 offsprings.
Denote by Zn the size of the n-th generation population. We assume that
Z0 = 1. Then (Zn )n≥0 is an HMC with state space N0 .
To see this, choose a sequence of i.i.d. random variables (ξk )k∈N with common
distribution µ. Then
   
P Zn+1 = j Zn = i = P ξ1 + · · · + ξi = j .
The distribution of the random variable ξ1 + · · · + ξi is the convolution of µ∗i , the
convolution of i copies of µ. More precisely,
X
µ∗i j =
     
µ k1 · · · µ ki .
k1 +···+ki =j

The transition matrix is then Qi,j = µ∗i j .


 
t
u

Example 4.12 (Queing). Customers arrive for service and take their place in a
waiting line. During each period of time one customer is served, if at least one
customer is present. During a service period new customers may arrive. We assume
that the number of customers that arrive during the n-th service period is a ran-
dom variable ξn , and that the random variables
  ξ1 , ξ2 , . . . are i.i.d. with common
distribution µ ∈ Prob(N0 ). We set µi := µ i , i ∈ N0 . For notation convenience
we set µn = 0 for n < 0.
We denote by Xn the number of customers in line at the end of the n-th period.
Note that
Xn+1 = (Xn − 1)+ + ξn .
The sequence (Xn )n≥0 is an HMC with state space N0 and transition matrix
(
µj , i = 0,
Qi,j =
µj−i+1 , i > 0.
t
u

Example 4.13 (Noisy dynamical systems). Suppose that T : X → X is a


selfmap of an at most countable set X . This defines a dynamical system (T n )n∈N
which can be viewed as a trivial Homogeneous Markov Chain with transition matrix
(
1, x0 = T (x),
Qx,x0 =
0, x0 6= T (x).
We can obtain more general Markov chains if work with “noisy” selfmaps.
More precisely suppose that (S, S) is a measurable space and
T : X × S → X , (x, s) 7→ Ts (x)
is a measurable map. In other words, (Ts ) is a measurable family of maps X → X .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 378

378 An Introduction to Probability

Fix an S-valued “noise”, i.e., a sequence


Zn : Ω, F, P) → (S, S), n ∈ N
of i.i.d. S-valued random variables and an X -valued random variable X0 indepen-
dent of the Z’s we obtain a noisy dynamical system
Xn+1 = TZn+1 (Xn ), ∀n ∈ N.
Hence
Xn = TZn ◦ · · · ◦ TZ1 (X0 ).
The sequence (Xn )n≥0 is an HMC. Indeed,
 
P Xn+1 = xn+1 Xn = xn , . . . , X0 = x0
 
= P TZn+1 (xn ) = xn+1 Xn = xn , . . . , X0 = x0
 
= P TZn+1 (xn ) = xn+1
since the event {Xn = xn , . . . , X0 = x0 } belongs to the sigma algebra generated
byX0 , Z1 , . . . , Zn and thus is independent of Zn+1 . On the other hand, obviously
   
P TZn+1 (xn ) = xn+1 = P Xn+1 = xn+1 Xn = xn .
This Markov chain is homogenous since the random variables Zn are identically
distributed.
The standard random walk on Z is a Markov system generated in an obvious
way by a random dynamical system defined by the map
T : Z × Z → Z, (m, z) 7→ Tz (m) = m + z
and the noise described by a sequence of i.i.d. Rademacher random variables
(Zn )n∈N . Then Xn+1 = Xn + Zn+1 . One can show that any Markov chain can
be produced in this fashion, as iterates of random maps. t
u

4.2 The dynamics of homogeneous Markov chains

In this section we will consistently adopt the dynamical point of view on Markov
chains described in Remark 4.2(b) and extract some useful consequences.

4.2.1 Classification of states


Suppose that (Xn )n≥0 is an HMC with state space X and transition matrix Q.

Definition 4.14. (a) A state x1 ∈ X is said to be accessible from a state x0 ∈ X ,


and we denote this by x0 → x1 , if Qnx0 ,x1 > 0 for some n ∈ N0 .
(b) The states x0 and x1 communicate if x0 → x1 and x1 → x1 . We indicate this
using the notation x0 ↔ x1 . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 379

Markov chains 379

Recall that to an HMC with state space X we can associate a directed graph
with vertex set X ; see Remark 4.2(b). A walk from x to x0 in this graph is a
sequence of vertices
x = x0 , x1 , . . . , xn = x0
such that, for any i = 1, . . . , n, there exists a directed edge from xi−1 to xi . If
x 6= x0 , then x0 is accessible from x if there is a walk from x to x0 .

Proposition 4.15. The communication relation “↔” is an equivalence relation.

Proof. Reflexivity. x ↔ x since Q0x,x = 1.


Symmetry. The relation is symmetric by definition.
Transitivity. Suppose that x0 ↔ x1 and x1 ↔ x2 . Then, there exist m, n ∈ N0
such that
Qm n
x0 ,x1 > 0 and Qx1 ,x2 > 0.

Observe that
X
Qm+n m n
x0 ,x2 = Qx0 ,x1 Qx1 ,x2 + Qm n
x0 ,x Qx,x2 > 0.
x∈X \{x1 }
| {z }
≥0

Hence x0 → x2 . The opposite relation x2 → x0 is proved in identical fashion. t


u

Definition 4.16. The equivalence classes of the relation ↔ are called the commu-
nication classes of the given HMC. t
u

Example 4.17. (a) Consider the HMC associated to the gamblers’s ruin problem
described in Example 4.6. The state space is {0, 1, . . . , N } and there are three
communication classes
C0 = {0}, C = {1, . . . , N − 1}, CN = {N }.
Note that no state in C is accessible from C0 or CN .
(b) The HMC associated to the Ehrenfest urn model in Example 4.7 has state
space {0, 1, . . . , N } and any two states communicate so that there is only a single
communication class.
(c) The HMC corresponding to the random placement of balls problem in Exam-
ple 4.8 has state space {0, 1, . . . , r} and communication classes
C0 = {0}, {1}, . . . , Cr = {r}.
Note that for j > i, the class Cj is accessible from the class Ci . t
u

Definition 4.18. Let (Xn )n∈N0 be an HMC with state space X and transition
matrix Q.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 380

380 An Introduction to Probability

(i) A subset C ⊂ X is closed with respect to this HMC if no state outside C is


accessible from a state in C.
(ii) A subset of X is called irreducible if its is closed and contains no proper closed
subset.
(iii) A state x ∈ X is called absorbing if the set {x} is irreducible.
(iv) The HMC is called irreducible if its state space is irreducible.

t
u

Example 4.19. For the HMC corresponding to the random placement of balls
problem in Example 4.8 with state space {0, . . . , r}, all the subsets {k, k + 1, . . . , r}
are closed and the state r is absorbing. This is not an irreducible Markov chain. t u

Note that a subset C ⊂ X is closed if and only if for any x ∈ C we have


X
Qx,y = 1.
y∈C
 
Equivalently, this means that P Xn ∈ C X0 ∈ C = 1.
Using the intuition of the randomly hopping grasshopper, this says that, once
the grasshopper steps in a closed set it will be trapped there. In particular, this
argument proves the following result.

Proposition 4.20. A closed subset of the state space of an HMC is a union of


communication classes. t
u

3 5
7

1 4

2 6 8

Fig. 4.2 An HM C with a single irreducible subset.

Example 4.21. Consider an HMC with associated digraph depicted in Figure 4.2.
It consists of three communication classes
C1 := {1, 2, 3, 4}, C2 := {5, 7, 8}, C3 := {6}.
The communication class C3 is closed while C1 and C2 are not. The only irreducible
set is C3 . In particular the state 6 is absorbing.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 381

Markov chains 381

3 5
7

1 4

2 6

Fig. 4.3 Another HM C with a single irreducible subset.

Suppose now that we change the directions of the edges 4 → 5 and 4 → 6 as


depicted in Figure 4.3. This HMC has the same communication classes C1 , C2 , C3 ,
but this time only C1 is closed. t
u

Lemma 4.22. Let C ⊂ X be a closed subset. Then the following are equivalent.

(i) C is irreducible.
(ii) C is a communication class.

In particular, an HMC is irreducible if and only if it consists of a single com-


munication class.
Proof. The implication (ii) ⇒ (i) follows from Proposition 4.20: a closed set is a
union of communication classes.
(i) ⇒ (ii) Suppose that C is an irreducible subset of X . In particular, C is a union
of communication classes
N
[
C= Cj , N ∈ N ∪ {∞}.
j=1
Suppose that N ≥ 2. Set j0 := 1. Since Cj0 is not closed, there exists j1 6= j0 ,
xj0 ∈ Cj0 and xj1 ∈ Cj1 such that xj0 → xj1 . In fact, any class in Cj1 is accessible
from any class in Cj0 . We write this Cj0 → Cj1 .
Next, we can find j2 6∈ {j0 , j1 } such that Cj1 → Cj2 . Clearly no class in
Cj0 ∪ Cj1 is accessible from Cj2 . Iterating, we obtain a (possible finite) subsequence
in N ∩ [1, N ]
j0 , j1 , . . . , jk , . . . ,
where the jk ’s are pairwise distinct, such that
Cj0 → Cj1 → Cj2 → · · ·
and no state in Cj0 ∪ · · · ∪ Cjk is accessible from Cjk+1 . Note that
[
C0 = Cjk ⊂ C \ Cj0
k≥1
is a proper closed subset of C, contradicting the fact that C is irreducible. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 382

382 An Introduction to Probability

Definition 4.23. Suppose that (Xn )n∈N0 is an HMC with state space X and
transition matrix Q.

(i) The set of periods of a state x ∈ X is


Px := n ∈ N; Qnx,x > 0 .


(ii) The period of a state x is d = d(x) := gcd Px , where “gcd” stands for greatest
common divisor. When Px = ∅ we set d(x) := ∞.
(iii) A state x is called aperiodic if d(x) = 1.

t
u

Lemma 4.24. Let (Xn )n≥0 be an HMC with state space X and transition matrix
Q. Suppose that x ∈ X and d(x) < ∞. Then the following hold.

(i) The set Px is a semigroup of the additive semigroup (N, +).


(ii) There exists N ∈ N such that nd(x) ∈ Px , ∀n ≥ N .
(iii) If x ↔ y, then d(x) = d(y).

Proof. (i) Follows from the fact that Qm+n m n


x,x ≥ Qx,x Qx,x .
(ii) We claim that there exist k ≥ 2 and m1 , . . . , mk ∈ Px such that
d(x) = gcd(m1 , m2 , . . . , mk ).
Pick m1 , m2 ∈ Px and set d1 := gcd(m1 , m2 ). Then d ≤ d1 . If d < d1 define
d2 := min gcd(m1 , m2 , m); m ∈ Px .


Then d ≤ d2 ≤ d1 and d2 = d1 iff d = d2 = d1 . If d2 < d1 choose m3 ∈ Px such that


d2 = gcd(m1 , m2 , m3 ) ≥ d.
If d < d2 define
gcd(m1 , m2 , m3 , m); m ∈ Px .

d3 := min
If d3 = d2 we stop because d3 = d. If d3 < d2 , we iterate the above procedure.
Clearly this procedure will stop after finitely many iterations.
An old result of I. Schur [159, Thm. 3.15.2] implies that the set

m 1 n1 + · · · + m k nk ; n1 , . . . , n k ∈ N
contains all the sufficiently large multiples of d.
(iii) For x, y ∈ X we set
Px,y := n ∈ N; Qnx,y > 0 .


Thus Px = Px,x . Suppose x ↔ y. Note that


Px,y + Py,x ⊂ Px .
Hence d(x)| Px,y + Py,x . From the inclusion


Px,y + Py + Py,x ⊂ Px
we deduce that d(x)|Py so d(x)|d(y). Reversing the roles of x, y in the above argu-
ment we deduce d(y)|d(x) so d(x) = d(y). t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 383

Markov chains 383

According to the above result, all the states of an irreducible HMC have the
same period so we can speak of the period of that HMC.

Definition 4.25. An irreducible HMC is called aperiodic if each of its states has
period 1. t
u

Example 4.26. (i) Each state in the standard random walk on Z locally finite
graph has period 2.
More generally, given a vertex v in a locally finite, connected graph, its set of
periods with respect to the standard random walk coincides with the set of lengths
of paths in the graph that start and end at v. Since there is such a path of length
2 we deduce that the vertex is aperiodic if and only if there exists a path of odd
length starting and ending at x.
(ii) The Ehrenfest urn in Example 4.7 is irreducible with period 2. t
u

Proposition 4.27. Let (Xn )n≥0 be an irreducible HMC with state space X , tran-
sition matrix Q, and period d < ∞. Fix x0 ∈ X . Consider the HM C (Yn )n≥0
with state space X , initial state Y0 = x0 and transition matrix T = Qd . Denote by
CT the set of communication classes of T . For each x ∈ X we denote by [x]T the
T -communication class of x. Then the following hold.

(i) There exists a bijection r = rx0 : CT → Z/dZ such that r [x]T = k mod d iff


there exists n ∈ N0 such that Qnd+k


x0 ,x > 0.
(ii) If Qx,y > 0, then r(y) ≡ r(x) + 1 mod d.
(iii) Each T -communication class is T -closed.

Proof. As in the proof of Lemma 4.24 we set


Px,y := n ∈ N; Qnx,y > 0 .


Let us observe that


∀x, y ∈ X , ∀n, m ∈ Px,y : n ≡ m mod d. (4.2.1)
The claim is obviously true if n = m. Suppose that n > m. Then n − m ∈ Py,y so
d = d(y) divides n − m.
Thus, for any x, y ∈ X there exists r = r(x, y) ∈ 0, 1, . . . , d − 1 such that


Px,y ⊂ r(x, y) + dN0 .


For any x ∈ X we set r(x) := r(x0 , x). We want to prove that
[x]T = [y]T ⇐⇒r(x) = r(y). (4.2.2)
n
Indeed, suppose that [x]T = [y]T . Then there exists n such that Tx,y > 0, i.e.,
nd m m+nd
Qx,y > 0. Fix m ∈ N0 such that Qx0 ,x > 0. Then Qx0 ,y > 0. Clearly
r(y) ≡ m + nd ≡ m ≡ r(x) mod d.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 384

384 An Introduction to Probability

n
Conversely, suppose that r(x) = r(y). Fix nx , ny ∈ N0 such that Qnx0x,x , Qx0y,y > 0.
Choose N large enough such that N d > nx and N d ∈ Px0 . Then N d − nx ∈ Px,x0
and ny ∈ Px0 ,x so N d − nx + ny ∈ Px,y . Hence

0 = r(y) − r(x) ≡ ny − nx mod d.


m
We deduce that there exists m ∈ N0 such that md = N d − nx + ny . Hence Tx,y > 0.
In other words y is T -accessible from x. A symmetric argument shows that x is
T -accessible from y so that [x]T = [y]T . This proves (4.2.2) and (i).
The statements (ii) and (iii) follow immediately from the equality

r(x, z) ≡ r(x, y) + r(y, z) mod d,

so r(y) = r(x) + r(x, y). t


u

Remark 4.28. Suppose that (Xn )n≥0 is an HMC as in the above proposition and

C0 , C1 , . . . , Cd−1 ⊂ X

are the communication classes of Qd . If X0 ∈ Ci , then Xn ∈ Ci+n mod d , for any


n. Thus a grasshopper hopscotching following the prescriptions of this Markov
chain will jump from a region Ci to somewhere in the next region Ci+1 and so on,
returning after d jumps to the region where he started.
Observe also that the map r : CT → Z/dZ depends on the choice of x0 . On
the other hand, the action of Z/dZ on CT is independent of x0 and it is free and
transitive. Thus CT is naturally a Z/dZ-torsor. t
u

4.2.2 The strong Markov property


Suppose that Xn : (Ω, S, P) → X , n ≥ 0 is an HMC with state space X and
transition matrix Q. As usual, we denote by (Fn )n≥0 the filtration of S determined
by this random process, i.e., Fn = σ(X0 , . . . , Xn ), n ∈ N0 .
Suppose now that T is a stopping time adapted to this filtration. In (3.1.6) we
defined the sigma-algebra FT associated to T by the requirements

S ∈ FT ⇐⇒S ∩ T ≤ n ∈ Fn , ∀n ∈ N0 .


Example 4.29 (Return times). Let (Xn )n∈N0 be an HMC with state space X .
For A ⊂ X we define

TA := min n ≥ 1; Xn ∈ A .

We will refer to TA as the return time to A. This is a stopping time with respect
to the canonical filtration Fn . For x ∈ X we set Tx := T{x} .
Note that the event S belongs to FTA if at any moment n we can decide using
the information collected up to that point in Fn whether S occurred and we have
returned to A up to that moment. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 385

Markov chains 385

Example 4.30 (Hitting times). Let (Xn )n∈N0 be an HMC with state space X .
For A ⊂ X we define

HA := min n ≥ 0; Xn ∈ A .
We will refer to HA as the hitting time of A. This is a stopping time with respect
to the canonical filtration Fn . For x ∈ X we set Hx := H{x} . t
u

Theorem 4.31. Let Xn : (Ω, S, P) → X , n ∈ N0 , be an HMC with initial distri-


bution µ and transition matrix Q. Suppose that T is a stopping time adapted to
the canonical filtration (Fn )n∈N0 . Conditional on XT = x ∈ X and T < ∞ the
stochastic process
Yn := XT +n , n ∈ N0 ,
is in Markov(X , δx , Q) and independent of FT . More explicitly, if Λ is the event

Λ = T < ∞, XT = x ,
and PΛ : S → [0, 1] is the probability measure PΛ S = P S Λ , then the stochastic
   

process
Yn : (Ω, S, PΛ ) → X
is Markov(X , δx , Q) and independent of FT .

Proof. For n ∈ N denote by Tn the stopping time Tn := T + n. Denote by S the


event

{T < ∞} ∩ XT = x0 = x, XT +1 = x1 , . . . , XT +n = xn

= Λ ∩ XT +1 = x1 , . . . , XT +n = xn .
Note that S ∈ FTn . We have to show that
 
PΛ YT +n+1 = xn+1 S = Qxn ,xn+1 ,
i.e.,
 
PΛ {XT +n+1 = xn+1 } ∩ S ∩ Λ
  = Qxn ,xn+1
PΛ S
or
   
P {XT +n+1 = xn+1 } ∩ S ∩ Λ P {XT +n+1 = xn+1 } ∩ S
  =   = Qxn ,xn+1 .
P S∩Λ P S
We have

  X  
P {XT +n+1 = xn+1 } ∩ S = P {Xk+n+1 = xn+1 } ∩ S ∩ {T = k}
k=0

(use the Markov property (4.1.2) and the definition of S)



X  
= P {T = k} ∩ S Qxn ,xn+1
k=0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 386

386 An Introduction to Probability


X    
= Qxn ,xn+1 P {T = k} ∩ S = Qxn ,xn+1 P S .
k=0

To prove the independence of FT given Λ = {XT = x, T < ∞} it suffices to show


that each of the events
 
Γ0 = XT = x , . . . , Γn = XT = x0 = x, XT +1 = x1 , . . . , XT +n = xn , . . .
are independent of FT given Λ = {XT = x, T < ∞}. Let S ∈ FT . We have

  X  
P S ∩ Γn ∩ Λ = P S ∩ Γn ∩ {T = k}
k=0

(use the Markov property repeatedly)



X  
= P {T = k} ∩ S ∩ Γ0 Qx0 ,x1 · · · Qxn−1 ,xn
k=0

   
= P S ∩ Γ0 ∩ {T < ∞} Qx0 ,x1 · · · Qxn−1 ,xn = P S ∩ Λ Qx0 ,x1 · · · Qxn−1 ,xn ,
i.e.,
   
P S ∩ Γn ∩ Λ = P S ∩ Λ Qx0 ,x1 · · · Qxn−1 ,xn .
Then
 
  P S∩Λ
P S ∩ Γn Λ =   · Qx0 ,x1 · · · Qxn−1 ,xn , x0 = x.
P Λ
Since the stochastic process Yn : (Ω, S, PΛ ) → X is Markov(X , δx , Q) we deduce
 
Qx0 ,x1 · · · Qxn−1 ,xn = P Γn Λ .
     
Hence P S ∩ Γn Λ = P S Λ · P Γn Λ . t
u

In the following subsections we will have plenty of opportunities to see the strong
Markov principle at work.

4.2.3 Transience and recurrence


Suppose that (Xn )n∈N0 is an HMC with state space X and transition matrix Q.
For any x ∈ X we denote by Tx the return time to x, i.e.,
Tx := min{n ≥ 1; Xn = x .
Moreover,
   
Px := P − X0 = x , Ex := E − X0 = x .

Definition
  4.32. A state x ∈ X is called recurrent or persistent if
Px Tx < ∞ = 1. Otherwise it is called transient. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 387

Markov chains 387

Example 4.33. If X is finite and irreducible, then any state of X is recurrent. In-
deed the ‘sooner-rather-than-later’ Lemma 3.32 implies that Ex Tx < ∞, ∀x ∈ X .
t
u

We set Tx1 := Tx and we define inductively


Txk+1 := min n > Txk ; Xn = x ,


Nx := # k ∈ N; Txk < ∞
 
= # n ∈ N; Xn = x ∈ N ∪ {∞}.
Thus Txk is the time of the k-th return to x. We will refer to Nx the number of
returns to x. We set
 
p = px := Px Tx < ∞ .
 
Lemma 4.34. For any n ∈ N0 we have Px Nx ≥ n = pn . In particular, if x is
recurrent, i.e., p = 1, then Nx = ∞ a.s. and, if X is transient, then
  p
Ex Nx = .
1−p
 
Proof. Set pn := Px N  x ≥ n .1 We will
 prove inductively that pn = pn .
Clearly P Nx ≥ 1 = P Tx < ∞ = p. Suppose that pn = p . The post-Txk
n

process Yn = XTxn +n starts at x and the strong Markov property implies that it is a
HMC with the same transition matrix. In particular, the probability that it returns
to x is p. On the other hand, Yn returns to x if and only if Nx ≥ k + 1. Since the
post-Txk process is independent of FTxk we deduce
P Nx ≥ n + 1 = pP Nx ≥ n = pn+1 .
   

t
u

Corollary 4.35. Assume that X0 = x a.s. Then the following hold


 
x is recurrent ⇐⇒ Nx = ∞ a.s. ⇐⇒ Ex Nx = ∞.
 
x is transient ⇐⇒ Ex Nx < ∞.
t
u

Clearly the recurrence/transience of a state depends only on the transition ma-


trix. The next result characterizes these features in terms of the transition matrix.

Theorem 4.36. Let x ∈ X . The following statements are equivalent.

(i) The state x is recurrent.


(ii)
X
Qnx,x = ∞.
n∈N
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 388

388 An Introduction to Probability

Proof. Observe that


X
Nx = I {Xn =x}
n∈N

and
  X   X n
Ex Nx = Ex I {Xn =x} = Qx,x .
n∈N n∈N

The result now follows from Corollary 4.35. t


u

Corollary 4.37. Let x, y ∈ X . If x → y and x is recurrent, then

(i) x ↔ y, 
(ii) Py Tx < ∞ = 1,
(iii) the state y is recurrent.

Proof. The state x is recurrent so Nx = ∞ a.s. Since x → y we deduce that


   
Px Ty < ∞ = P Ty < ∞ X0 = x > 0.
The post-Ty chain Yn = Xn+Ty , n ≥ 0, will almost surely reach x since Nx = ∞ a.s.
Using the strong Markov property at Ty we deduce that Yn has the same transition
matrix Q. Hence y → x, i.e. x ↔ y. In particular,the original chain, started at y
will almost surely reach x, i.e., P Tx < ∞ X0 = y = 1.
Since x ↔ y there exist j, k ∈ N such that
c = min{Qjx,y , Qky,x } > 0.
We deduce
Qn+j+k
y,y ≥ Qky,x Qnx,x Qjx,y ≥ c2 Qnx,x , ∀n ∈ N.
Hence
X X X
Qm
y,y ≥ Qm
,y,y ≥ c
2
Qnx,y = ∞.
m≥1 m>j+k n≥1

t
u

The above result shows that if C is a communication class then, either all classes
in C are recurrent, or all classes in C are transient. In the first case C is called a
recurrence class and in the second case C is called a transience class. An irreducible
HMC consists of a single communication class. Accordingly an irreducible HMC
can be either transient, or recurrent.

Proposition 4.38. Suppose that (Xn )n≥0 is an irreducible transient HMC with
state space X , transition matrix Q and initial distribution µ. Then,
Eµ Nx < ∞, ∀x ∈ X .
 
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 389

Markov chains 389

Proof. We first prove that given x0 ∈ X there exists C = Cx0 > 0 such that
Ey Nx0 ≤ C, ∀y ∈ X .
 

Indeed, using the strong Markov property as in the proof of Lemma 4.34 we deduce
that for any y ∈ X we have
  X   X    
Ey Nx0 = Py Nx0 ≥ n = Px0 Nx0 ≥ n − 1 Py Tx0 < ∞
n≥1 n≥1

     X  
= Px0 Nx0 ≥ 0 Py Tx0 < ∞ + Py Tx0 < ∞ Px0 Nx0 ≥ m
m≥1
| {z }
=1 | {z  }
Ex0 Nx0

     
= Py Tx0 < ∞ 1 + Ex0 Nx0 ≤ 1 + Ex0 Nx0 .
| {z }
Cx0

Now observe that


  X    
Eµ Nx0 = µ y Ey Nx0 ≤ Cx0 .
y∈X
t
u

Theorem 4.39. Suppose that C is a recurrence class and X0 = x ∈ C a.s. Then,


 
P Ny = ∞ X0 = x = 1, ∀y ∈ C.
In particular, with probability one, the chain visits every state of C infinitely often,
i.e.,
 
P ∀y ∈ C, Ny = ∞, X0 ∈ C = 1.

Proof. Let x, y ∈ C. We have


   
P Nx = ∞ X0 = x = 1, P Ty < ∞ X0 = x = 1.
The strong Markov property shows that the post-Ty chain has
 the same distribution

as the chain started at y. Since y is recurrent we have P Ny = ∞ X0 = x = 1
and we deduce that
 
P Ny < ∞ X0 = x = 0, ∀x, y ∈ C,
 
P Ny < ∞ X0 ∈ C = 0, ∀y ∈ C.
In particular,
  X  
P ∃y ∈ C, Ny < ∞ X0 ∈ C ≤ P Ny < ∞ X0 ∈ C = 0.
y∈C

Hence
   
P ∀y ∈ C, Ny = ∞, X0 ∈ C = 1 − P ∃y ∈ C, Ny < ∞ X0 ∈ C = 1.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 390

390 An Introduction to Probability

Proposition 4.40. A recurrence communication class C is closed.

Proof. Suppose that C is not closed. Then there exist c ∈ C and


x ∈ X \ C such that c → x. Since C is a communication class
x does not communicate  with any y ∈ C. Fix n0 ∈ N such that
p := P Xn0 = x X0 = c > 0. In particular, since x does  not communicate
with
 C we deduce that
 P X n ∈ X \ X, ∀n ≥ n 0 X =
0  c ≥ p. In particular

P Nc < n0 X0 = c ≥ p. This contradicts the fact that P Nc = ∞ X0 = c = 1.
t
u

The set of communication classes X := X / ↔ is itself the state space of an


HMC with transition matrix
QC,C 0 = P X1 ∈ C 0 X0 ∈ C .
 

Each state of X is in itself a communication class of the new Markov chain. The
state space X is partitioned into two types: transient states and recurrent states.
The recurrent states are closed, i.e., they are absorbing as states in X . Given a
recurrent state R ∈ X , no other communication class is accessible from R.

Example 4.41. Consider for example the gambler’s ruin problem with total fortune
N ∈ N; see Example 4.6. This can be viewed as a Markov chain with state space
{0, 1, . . . , N } and transition probabilities
1
qi,i±1 = , ∀0 < i < N, q0,0 = qN,N = 1.
2
The communication classes of this Markov chain are
{0}, {N }, {1, 2, . . . , N − 1}.
The first two are recurrent while the third is transient. t
u

If X is finite, the argument in Example 4.33 shows that a communication class


is closed iff it is recurrent.

Example 4.42 (G. Polya). (a) Consider the standard random walk on Z. We
denote by Q the transition matrix. This is an irreducible Markov chain and each
state has period 2. To decide whether it is transient or recurrent it suffices to verify
2n−1
if the origin is such. Note that Q0,0 = 0, ∀n ∈ N. To compute Q2n 0,0 we observe
that a path of length 2n starts and ends at the origin if and only if it consists of
exactly n steps to the right and n steps to the left. Since each such step occurs with
probability 12 we deduce
 
1 2n (2n)!
Q2n
0,0 = 2n
= 2n .
2 n 2 (n!)2
Using Stirling’s formula (A.1.6) we deduce that, as n → ∞, we have

(2n)! 4πn 1
∼ ∼√ ,
22n (n!)2 2πn πn
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 391

Markov chains 391

so
X
Qn0,0 = ∞.
n∈N

Thus, the 1-dimensional standard random walk is recurrent.


(b) Consider the standard walk on Z2 . It is irreducible. We want to decide if the
origin is recurrent or transient. To compute Q2n 0,0 we observe that a path of length
2n starts and ends at the origin if and only if the number of steps up is equal to the
number of steps down and the number of steps to the right is equal to the number
of steps to the left. We deduce that
n n  2
2n
X (2n)! (2n)! X n
Q0,0 = = 2n .
42n (k!)2 ( (n − k)! )2 4 (n!)2 k
k=0 k=0

Using Newton’s binomial formula in the equality


(x + y)2n = (x + y)n (x + y)n
and identifying the coefficients of xn y n on either side of the above equality we
deduce
  X n    X n  2
2n n n n
= = ,
n k n−k k
k=0 k=0

so that
   2
1 2n 1
Q2n
0,0 = ∼ as n → ∞.
22n n πn
Hence, again
X
Qn0,0 = ∞
n∈N

so the standard 2-dimensional random walk is also recurrent.


(c) Consider the standard random walk on Z3 . Arguing as in the 2-dimensional
case we deduce
  X  2
2n 1 X (2n)! 1 2n n!
Q0,0 = 2n = 2n .
6 (j!)2 (k!)2 (`!)2 2 n j!k!`!3n
j+k+`=n j+k+`=n

Now observe that


 n
X n! 1 1 1
= + + = 1.
j!k!`!3n 3 3 3
j+k+`=n | {z }
=:pjk`

Hence
X X
p2jk` ≤ max pj,k,` pj,k,` = max pj,k,` ,
j,k,` j,k,`
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 392

392 An Introduction to Probability

so
 
1 2n
Q2n
0,0 ≤ max pj,k,` .
22n n
Let us observe that the maximum value of pj,k,` is achieved when j, k, ` are as close
to n/3 as possible. Indeed, if j ≤ k ≤ `, j < `, then
j+1
(j + 1)!(` − 1)! = j!`! ≤ j!`!
`
so
pj+1,k,`−1 ≥ pj,k,` .
Assume now that n = 3m. We deduce
 
2n 1 2n (3m)!
Q0,0 ≤ 2n .
2 n (m!)3 33m
Using again Stirling’s formula we deduce that, as m → ∞ we have
√ √
(3m)! 6πm 3
∼ = .
(m!)3 33m (2πm)3/2 2πm
On the other hand
 
1 2n 1 1
∼√ =√ .
22n n πn 3πm
We deduce that
−3/2
Q6m

0,0 = O m as m → ∞.
Arguing in a similar fashion we deduce
Q6m+2 , Q6m+4 = O m−3/2

0,0 0,0 as m → ∞.
We conclude that
X X
Qn0,0 = Q2n
0,0 < ∞,
n∈N n∈N

so the standard 3-dimensional random walk is transient! t


u

4.2.4 Invariant measures


Suppose that (Xn )n∈N0 is an HMC with state space X and transition matrix Q.
We will identify a σ-finite measure λ on X with function
λ : X → [0, ∞), x 7→ λx = λ {x} .
 

Definition 4.43. An invariant or stationary measure for the HMC (Xn )n∈N0 is a
σ-finite measure λ on X such that λ = λQ, i.e.,
X
λx = λy Qy,x , ∀x ∈ X . (4.2.3)
y∈X

An invariant or stationary distribution is an invariant probability measure. t


u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 393

Markov chains 393

Example 4.44 (Time reversal). Suppose that π is an invariant probability dis-


tribution for (Xn )n≥0 such that πx > 0, ∀x ∈ X . Suppose additionally that

P X0 = x = πx , ∀x ∈ X .
 

Then
  X   X
P X1 = x = P X1 = x X0 = y πy = πy Qy,x = πx .
y∈X y

Iterating we deduce that the random variables (Xn )n∈N0 are identically distributed.
For x, y ∈ X we set
   
  P X1 = y X0 = x P X0 = x πx
Ry,x := P X0 = x X1 = y =   = Qx,y .
P X1 = y πy
Note that for every x, y ∈ X we have
X 1 X
Ry,x = πx Qx,y = 1,
x
πy x

so (Ry,x )x,y∈X is a stochastic matrix describing the so called time reversed chain.
Suppose now that π is a probability distribution on X such that πx > 0, ∀x ∈ X
and satisfying
πx
Qy,x = Ry,x = Qx,y , ∀x, y ∈ X . (4.2.4)
πy
From the equality
X X πx
1= Qy,x = Qx,y
x x
πy

we deduce that π is a stationary distribution and the time reversed chain coincides
with the initial chain. This is the reason why the chains satisfying (4.2.4) are called
reversible. t
u

Definition 4.45. An irreducible HMC with state space X and transition matrix
is called reversible if there exists a function λ : X → (0, ∞) satisfying the detailed
balance equations

λy Qy,x = λx Qx,y , ∀x, y ∈ X . (4.2.5)

t
u

Observe that if Q satisfies the detailed balance equation, then an argument as


in Example 4.44 shows λ defines a Q-invariant measure

Example 4.46. (a) If Qx,y = Qy,x for any x, y ∈ X , then the corresponding chain
is reversible and any uniform measure on X is invariant. This happens for example
if (Xn )n≥0 describes the standard random walk on Zd .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 394

394 An Introduction to Probability

(b) In the case the standard random walk on a locally finite connected graph we
have
1
Qx,y = , deg x · Qx,y = 1 = deg y · Qy,x .
deg x
Hence Q is in detailed balance with invariant measure x 7→ deg x. If, additionally,
X is finite, then the probability measure
deg x
πx = P
y deg y

is invariant. t
u

Example 4.47 (The Ehrenfest urn). Consider the Ehrenfest urn model de-
tailed in Example 4.7. We recall that the state space is X = {0, 1, . . . , B}, B ∈ N
and the only nontrivial transition probabilities are
B−k k
Qk,k+1 = , Qk,k−1 = .
B B
Note that
B

Qk,k+1 B−k k+1
= = B .
Qk+1,k k+1 k
B

Then the measure k → λk = k is invariant and
 
1 B
πk = B , k = 0, 1, . . . , B,
2 k
is an invariant probability distribution. t
u

Theorem 4.48. Suppose that the HMC (Xn )n∈N0 is irreducible and recurrent. Fix
x0 ∈ X , and denote by T0 the time of first return to x0 , i.e.,

T0 := Tx0 = min n ≥ 1; Xn = x0 .
For any x ∈ X, define
X T0
X
Nx = I {Xn =x} I {n≤T0 } = I {Xn =x} , (4.2.6a)
n∈N n=1

(  
Ex0 Nx , x 6= x0 ,
λx = λx,x0 = (4.2.6b)
1, x = x0 .
In other words, λx is the expected number of visits to x before returning to x0 when
starting from x0 . Then, the following hold.

(i) λx ∈ (0, ∞), ∀x ∈ X and the associated measure λ on X , given by


λ {x} = λx , ∀x ∈ X
 

is invariant.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 395

Markov chains 395

(ii) λ X = Ex0 Tx0 .


   

(iii) The measure λ is the unique invariant measure such that λx0 = 1.

Proof. (i) We follow the approach in [20, Thm. 3.2.1]. Clearly λx0 = 1. For
x ∈ X \ {x0 } and n ∈ N we set
 
px (n) := P Xn = x, n ≤ T0 .
Thus, px (n) is the probability of visiting state x at time n before returning to x0 .
The equality (4.2.6a) implies that
X
λx = px (n), ∀x 6= x0 . (4.2.7)
n∈N

Let us prove that λ satisfies (4.2.3). Observe first that px (1) = Qx0 ,x . From the
Markov property we deduce
X
px (n) = py (n − 1)Qy,x . (4.2.8)
y6=x0

We deduce that
!
(4.2.7) X X X
λx = px (n) = px (1) + py (n) Qy,x
n∈N y6=x0 n∈N
| {z }
=λy

(λx0 = 1, px (1) = Qx0 ,x )


X X
= λ0 Qx0 ,x + λy Qy,x = λy Qy,x .
y6=x0 y

This proves (4.2.3). Let us now show that the λx defined by (4.2.6b) are positive.
Suppose that λx = 0 for some x ∈ X . Obviously x 6= x0 . Moreover, from the
equality λ = λQn , ∀n ∈ N we deduce
X
0 = λx = Qnx0 ,x + λy Qny,x .
y6=x0

Thus Qnx0 ,x
= 0, ∀n ∈ N, which contradicts the fact that x0 and x communicate.
Finally, let us prove that λx < ∞, ∀x. Observe that
X
1 = λx0 = λx Qnx,x0 . (4.2.9)
x∈X

Suppose that λx = ∞ for some x 6= x0 . Since the chain is irreducible, the state x0
communicates with x so there exists n = n(x) such that Qnx0 ,x 6= 0. The equality
(4.2.9) implies λx ≤ Qn1 .
x,x0

(ii) We have
X X X X X
I {Xn =x} I {n≤T0 } = I {Xn =x} I {n≤T0 } = I {n≤T0 } = T0 .
x∈X n≥1 n≥1 x∈X x∈X
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 396

396 An Introduction to Probability

Hence
X X X
λ X =
     
λx = Ex0 I {Xn =x} I {n≤T0 } = Ex0 T0 .
x∈X x∈X n≥1

(iii) We follow the approach in [2; 123]. Consider the matrix K : X × X → [0, 1]
(
Qx,y , y 6= x0 ,
Kx,y =
0, y = x0 .
Consider the sequence (µn )n≥0 of measures on X defined by
µ0 = δx0 , µn x = px (n) = P Xn = x, n < T0 , x ∈ X .
   
 
Note that µn x0 = 0, ∀n. The equality (4.2.8) implies that
µn = µn−1 K, ∀n ≥ 1
n
so µn = δx0 K . Observe that
X X
λ= µn = δ x0 K n .
n≥0 n≥0
Fix an invariant measure ν such that νx0 = 1. The invariance condition reads
ν = δx0 + νK. We deduce
ν = δx0 + δx0 + νK K = δx0 + δx0 K + νK 2 .


Arguing inductively we deduce


Xn n
X
ν= δx0 K m + νK n+1 ≥ δx0 K m , ∀n ∈ N.
m=0 m=0
Letting n → ∞ we deduce
  ν ≥ µ. The difference
  σ = ν − µ is also an invariant
measure such that σ x0 = 0. Set σx := σ x .
Fix x ∈ X \ {x0 }. Since the Markov chain is irreducible there exists n ∈ N such
that Qnx,x0 > 0. From the equality σ = σQn we deduce
X
0 = σx0 = σx Qnx,x0 + σx Qny,x0 ≥ σx1 Qnx,x0 .
y6=x
Hence σx = 0, ∀x ∈ X so that ν = λ. t
u

Remark 4.49. The example of the standard random walk on Z3 shows that even
transient chains can admit invariant measures. t
u

Suppose that (Xn )n≥0 is irreducible and recurrent. For each x ∈ X we denote
by π x the unique invariant measure on X such that π x x = 1. We know that for
x, y ∈ X the measure π y is a positive multiple of π x ,
π y = cy,x π x .
   
From the equality π x x = 1 we deduce cy,x = π y x so that
πy = πy x πx .
 
(4.2.10)
From Theorem 4.48(ii) we deduce that
π x X = Ex Tx .
   
(4.2.11)
In particular this shows that the following statements are equivalent
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 397

Markov chains 397

(i) ∃x ∈ X such that


 
 Ex Tx < ∞.
(ii) ∀x ∈ X , Ex Tx < ∞.

Definition 4.50. Let (Xn )n≥0 be an irreducible recurrent HMC.

(i) The chain is called positively recurrent if Ex Tx < ∞ for some x ∈ X .
 

(ii) The chain is called null recurrent if Ex Tx = ∞ for all x ∈ X .

t
u

Corollary 4.51. An irreducible recurrent HMC is positively recurrent if for some


(for all) x ∈ X we have
π x X < ∞.
 
(4.2.12)
In particular, a positively recurrent irreducible HMC admits a unique invariant prob-
ability measure π∞ described by
1
π∞ =   π x , ∀x ∈ X .
Ex Tx
In other words
1
  , ∀x ∈ X .
 
π∞ x = (4.2.13)
Ex Tx
t
u
 
Proof. The equality (4.2.13) follows from the equality π x x = 1. t
u

Corollary 4.52. Any irreducible HMC with finite state space X admits a unique
stationary probability measure. t
u

Proof. As shown is Example 4.33 this chain is recurrent since the state space is
finite. The finiteness of X implies (4.2.12). t
u

Example 4.53 (Random knight moves). Consider a regular 8 × 8 chess table


and a knight that starts in the lower left-hand corner and then moves randomly
along, making each permissible move with equal probability.
This is an example of random walk on a graph with vertex set X consisting of
the centers of the 64 squares of the board, where two vertices are connected by as
many edges as possibilities for the knight to go from a square to the other in one
move.
It is easily seen that this is a connected graph so the corresponding random walk
is irreducible and has a unique invariant probability distribution given by
  deg x X
π x = , Z= deg y.
Z
y∈X
Now observe that Z is twice the number of edges of this graph.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 398

398 An Introduction to Probability

To count them observe that each 2×3 sub-rectangle of the chess table determines
four edges in this graph, two for each diagonal. The same is true for the 3 × 2
rectangle. Moreover, any knight move is corresponds to a diagonal of unique such
rectangle. If N2×3 and respectively N3×2 denote the number of 2 × 3 rectangles
respectively 3 × 2 rectangles, then
N2×3 = N3×2 =: N,
so that Z = 16N . Now observe that since a 3 × 2 rectangle is uniquely determined
by the location of its lower left corner we have N = 6 × 7 = 42 so Z = 16 · 42 = 672.
If x corresponds to the left-hand square, then deg x = 4 so
  1 16 · 42
Ex Tx =   = = 4 · 42 = 168.
π x 4
Thus, given that the knight starts at x, the expected time to return to x is 168. t
u

Theorem 4.54. Suppose that (Xn ) is an irreducible HMC with state space X and
transition matrix Q. Then the following are equivalent.

(i) The chain is positively recurrent.


(ii) There exists an invariant probability measure.

Proof. We have already shown that (i) ⇒ (ii). To prove the implication (ii) ⇒ (i)
fix an invariant probability measure π. Thus π = Qn π, ∀n ∈ N, so that
X  
π x Qnx,y , ∀y ∈ X , n ∈ N.
 
π y = (4.2.14)
x∈X

Fix y0 ∈ X such that π y0 6= 0. We prove first that if the chain is recurrent then
 

it has tobe positively recurrent. Denote by λy0 the unique invariant measure such
that λy0 y0 = 1. The measure λy0 is a constant multiple of π so it is finite. Hence
Ey0 Ty0 = λy0 X < ∞,
   

showing that the chain in fact positively recurrent.


We now argue by contradiction that the chain is indeed recurrent. Assume that
our Markov chain is transient. Proposition 4.38 implies that for any x ∈ X we
have
 X   X X n  X n
Qx,x0 π x0 ≥ π y0
  
∞ > Eπ Nx = Eπ I Nx =n = Qx,y0 .
n n x0 ∈X n

Hence
X
Qnx,y0 < ∞ and lim Qnx,y0 = 0, ∀x ∈ X . (4.2.15)
n→∞
n

Set qn (x) = Qnx,y0 , ∀x ∈ X . The equality (4.2.14) implies that


Z
 
qn (x)π[dx] = π y0 > 0, ∀n ∈ N.
X
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 399

Markov chains 399

On the other hand, the equality (4.2.15) coupled with the Dominated Convergence
theorem implies Z
lim qn (x)π[dx] = 0.
n→∞ X
This contradiction completes the proof. t
u

Example 4.55. We have shown in Example 4.42 that the standard random walks
on Z and Z2 are recurrent. Let us show that they are null recurrent.
 
Note that for k = 1, 2, the measure on Zk defined by λ x = 1, ∀x∈ Zk is
invariant.
 kBy
 Theorem 4.48, λ is the unique invariant measure such that λ 0 = 1.
Since λ Z = ∞ we deduce that there is no invariant finite measure. t
u

Proposition 4.56. Suppose that (Xn )n≥0 is an irreducible, positively recurrent


HMC with state space X and transition matrix Q. Then
Ey Tx < ∞, ∀x, y ∈ X .
 

Proof. Fix x ∈ X . Let Y ⊂ X denote the set of y ∈ X such that Ey Tx < ∞.


 

Note that x ∈ Y.
For y ∈ Y we have X 
    
Ey Tx = Ey Tx X1 = y Qy,y + Ey Tx X1 = z Qy,z .
z6=y
Hence
 
y 6= z, Qy,z > 0 ⇒ Ey Tx X1 = z < ∞.
Now observe that for z 6= y
(
  1, z = x,
Ey Tx X1 = z =  
1 + Ez Tx , z 6= x.
We deduce that y ∈ X and Qy,z > 0, then z ∈ Y. We conclude iteratively that
y ∈ Y, ∀y, x → y.
Since the chain is irreducible we deduce Y = X . t
u

Let (Xn )n≥0 be an HMC with state space and transition matrix Q. Recall that
for any set A ⊂ X we denoted by TA the time of first return to A

TA := inf n ≥ 1; Xn ∈ A .
Note that TA ≤ Ta , ∀a ∈ A, so
Ex TA ≤ Ex Ta , ∀x ∈ X , ∀a ∈ A.
   

We deduce that if the chain is irreducible and positively recurrent then


Ex TA < ∞, ∀x ∈ X , ∀A ⊂ X .
 

We have a sort of converse.


Proposition 4.57. Suppose that (Xn )n≥0 is an irreducible HMC with state space
X and transition matrix Q. If there exists a finite subset A ⊂ X such that
 
Ea TA < ∞, ∀a ∈ A,
then (Xn )n≥0 is positively recurrent.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 400

400 An Introduction to Probability

Proof. We follow the approach in [20, Chap. 5, Sec. 1.1]. Set


 
MA := max Ea TA .
a∈A
Consider the epochs of return to A,
T 1 := TA , T k+1 := min n > T k ; Xn ∈ A .


Fix a0 ∈ A and suppose that (Xn )n≥0 starts at a0 , X0 = a0 a.s. We set


Y0 := X0 , Yk := XT k , k ∈ N.
The strong Markov property shows that (Yk )k≥0 is an HMC with state space A.
Since (Xn ) is irreducible we deduce that (Yk )k≥0 is such. Since A is finite, the chain
(Yk ) is positively recurrent. Denote by Tb0 the time of first return to a0 of the chain
(Yk )k≥0 . Set
S0 = T 1 , Sk = T k+1 − T k .
If Ta0 denotes the time of first return to a0 of the original chain, then
X∞ ∞
  X  
Ta0 = Sk I {k<Tb0 } , Ea0 Ta0 = Ea0 Sk I {k<Tb0 } .
k=0 k=0
On the other hand,
  X  
Ea0 Sk I {k<Tb0 } = Ea0 Sk I {k<Tb0 } I {XT k =a} .
a∈A

Observe that the event {k < Tb0 } belongs to FT k . We deduce


     
Ea0 Sk I {k<Tb0 } I {XT k =a} = Ea0 Sk k < Tb0 , XT k = a Px0 k < Tb0 , XT k = a
(use the strong Markov property for T k )
       
= Ea0 Sk XT k = a Px0 k < Tb0 , XT k = a = Ea TA Px0 k < Tb0 , XT k = a
 
≤ MA Pa0 k < Tb0 , XT k = a .
Hence

!
  X X  
Ea0 Ta0 ≤ MA Px0 Tb0 > k, XT k = a
k=0 a∈A


X    
= MA Pa0 Tb0 > k = MA Ea0 Tb0 < ∞,
k=0
since the chain (Yk ) is positively recurrent. t
u

4.3 Asymptotic behavior

Suppose that (Xn )n∈N0 is an irreducible, positively recurrent HMC with state space
X and transition matrix Q. It thus has a unique stationary probability measure
π∞ ∈ Prob(X ). In this section we will provide a dynamical description of π∞ and
prove a Law of Large Numbers that involves this measure.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 401

Markov chains 401

4.3.1 The ergodic theorem


Fix an arbitrary state x0 ∈ X , let T0 := Tx0 denote the time of first return to x0
and denote by π 0 the unique invariant measure on X such that π 0 x0 = 1. The


results in the previous section show that


  "T #
X X0
0
I {Xn =x} , ∀y ∈ X . (4.3.1)
 
π x = Ex0  I {Xn =x} I {n≤T0 }  = Ex0
n≥1 n=1

For n ∈ N we set
n
X
ν(n) = νx0 (n) := I {Xk =x0 } .
k=1

In other words, the random variable νx0 (n) is the number of returns to x0 during
the interval [1, n].

Proposition 4.58. Suppose that f ∈ L1 X , π 0 . Then




n Z
1 X X
f (x)π 0 [dx] = f (x)π 0 x , Px0 − a.s. (4.3.2)
 
lim f (Xk ) =
n→∞ νx0 (n) X
k=1 x∈X

Proof. We follow the proof in [20, Prop. 3.4.1]. Using the decomposition
f = f + − f − we see that it suffices to consider only the case when f is non-
negative. Let T0 = τ1 ≤ τ2 ≤ · · · denote the successive times of return to x0 . We
set
τp
X
Up := f (Xk ).
k=τp−1 +1

The strong Markov property shows that the random variables U1 , U2 , . . . are i.i.d.
We have
T0
"T #
0
  X   X X
Ex0 U1 = Ex0 f (Xk ) = Ex0 f (x)I {Xk =x}
k=1 k=1 x∈X

"T #
0
X X (4.3.1) X
f (x)π 0 x .
 
= f (x)Ex0 I {Xk =x} =
x∈X k=1 x∈X

The Strong Law of Large Numbers implies


p
1X X
f (x)π 0 x , Px0 − a.s.
 
lim Uk =
p→∞ p
k=1 x∈X

In other words
τp
1X X
f (x)π 0 x , Px0 − a.s.
 
f (Xk ) →
p
k=1 x∈X
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 402

402 An Introduction to Probability

Observing that τν(n) ≤ n < τν(n)+1 , we deduce that for nonnegative of f we have
τν(n) n τν(n)+1
1 X 1 X 1 X
f (Xk ) ≤ f (Xk ) ≤ f (Xk )
ν(n) ν(n) ν(n)
k=1 k=1 k=1

τν(n)+1
ν(n) + 1 1 X
= f (Xk ).
ν(n) ν(n) + 1
k=1

Since the chain is recurrent we have ν(n) → ∞ so


ν(n) + 1
lim = 1.
n→∞ ν(n)
 
f (x)π 0 x , Px0 − a.s.
P
Hence the extremes in the last inequality converge to x∈X
t
u

Corollary 4.59. We have


ν(n) 1  
→   = π∞ x0 , Px0 − a.s. (4.3.3)
n Ex0 Tx0
Proof. Let f = 1 in Proposition 4.58. This f is integrable so
n  (4.2.11)
→ π0 X
  
= Ex0 Tx0 .
ν(n)
t
u

Corollary 4.60 (Ergodic Theorem). Suppose that (Xn )n≥0 is a positively re-
current irreducible HMC with state space X , transition matrix Q and stationary
distribution π∞ . Let f ∈ L1 (X , π∞ ). Then, for any µ ∈ Prob(X ) we have
n Z
1X  
lim f (Xk ) = f (x)π∞ dx , Pµ − a.s. (4.3.4)
n→∞ n X
k=1

Proof. Assume first that (Xn ) are defined on the path space(X N0 , E, Pµ ).
Suppose µ = δx0 . If we divide both sides of (4.3.2) by Ex0 Tx0 we deduce
n Z
1 X  
lim   f (Xk ) = f (x)π∞ dx .
n→∞ ν(n)Ex Tx X
0 0 k=1

Now observe that


n n
1X n 1X
f (Xk ) =   f (Xk ),
n ν(n)Ex0 Tx0 n k=1
k=1

and (4.3.3) implies


n
  → 1.
ν(n)Ex0 Tx0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 403

Markov chains 403

More generally, for any µ ∈ Prob(X ),


X  
µ x Px ∈ Prob X N0 , E

Pµ =
x∈X

we denote by Cx the set


 Z 
1
C := x = (x0 , x1 , . . . ) ∈ X : lim
N0
  
f (x1 ) + · · · + f (xn ) = f (x)π∞ dx .
n→∞ n X

From the above we deduce that Px C = 1, ∀x ∈ X . Then


 

  (4.1.13) X    
Pµ C = µ x Px C = 1.
x∈X

Suppose that random maps (Xn ) are defined on a probability space (Ω, S, P), not
~ : Ω → X N0 we reduce this case to
necessarily the path space. Using the map X
the situation we have discussed above. t
u

The Ergodic Theorem is a Law of Large Numbers for a sequence of, not neces-
sarily independent, random variables.

4.3.2 Aperiodic chains


When (Xn )n≥0 is an irreducible, aperiodic, positively recurrent HMC the Ergodic
Theorem can be considerably strengthened. We need to introduce some terminology.
The variation distance dv (µ, ν) between two probability measures µ,
ν ∈ Prob(X ) is defined by
1 X    
dv (µ, ν) := µ x −ν x .
2
x∈X

If X, Y are X -valued random variables, then the variation distance between them
is defined to be the variation distance between their distributions PX , PY ,

dv (X, Y ) := dv (PX , PY ).

Lemma 4.61. Let µ, ν ∈ Prob(X ). Then


        
dv (µ, ν) = sup µ A − ν A = sup µ A − ν A .
A⊂X A⊂X

Proof. The second equality follows from the elementary observation


n    o
µ A − ν A , µ Ac − ν Ac
       
µ A − ν A = max .

Now define
n  o
x∈X; µ x ≥ν x .
 
B :=
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 404

404 An Introduction to Probability

Observe that for any A ∈ X we have


           
µ A −ν A ≤µ A∩B −ν A∩B ≤µ B −ν B .
   
The first inequality follows from the fact that, for x ∈ A∩B c , we have µ x < ν x .
A similar reasoning shows that

ν A − µ A ≤ ν Bc − µ Bc .
       

Observe that

µ B − ν B + µ B c − ν B c = µ X − ν X = 0.
           

Hence
        1 X    
sup µ A −ν A =µ B −ν B = µ x −ν x .
A⊂X 2
x∈X

t
u

Remark 4.62. Suppose that µ, ν are two probability measures on a measurable


spaces (Ω, S). Then their difference λ = µ − ν is a signed measure on (Ω, S). As
such, λ has a positive part, λ, a negative part, and a variation |λ| = λ+ + λ− . We
set
 
kλk := |λ| Ω .

For example, if Ω = R equipped with the Borel sigma-algebra and µ, ν are given by
densities p and respectively q, then
Z
kµ − νk = p(x) − q(x) dx.
R

If µ, ν are probability measures on a finite or countable set X , then


1
dv (µ, ν) = kµ − νk. t
u
2
One technique for estimating the variation distance between two probability
measures on X is based on the idea of coupling.

Definition 4.63. Let µ, ν ∈ Prob(X ). A coupling of µ with ν is a probability


measure λ on X × X whose marginals equal µ and respectively ν, i.e.,

λ {x0 } × X = µ x0 , λ X × {x1 } = ν x1 , ∀x0 , x1 ∈ X .


       

λ
We will use the notation µ ! ν to indicate that λ is a coupling of µ with ν. We
will denote by Couple(µ, ν) the set of couplings of µ with ν.
A coupling of a pair of X -valued random variables is defined to be an X × X
random variable Z whose distribution is a coupling of the distributions of X and Y .
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 405

Markov chains 405

The next result explains the relevance of couplings in estimating the variation
distance between two measures.

Proposition 4.64. Let µ, ν ∈ Prob(X ). Set


X (2) := (x0 , x1 ) ∈ X 2 ; x0 6= x1 .


Then,
dv (µ, ν) ≤ λ X (2) , ∀λ ∈ Couple(µ, ν).
 
(4.3.5)

Proof. For any A ⊂ X we have


X 2 ⊃ A × Ac = A × X \ A × A
so
λ X (2) ≥ λ A × X − λ A × A
     

≥λ A×X −λ X ×A =µ A −ν A .
       

Hence
λ X (2) ≥ sup
     
µ A − ν A = dv (µ, ν).
A⊂X
t
u

Remark 4.65. The inequality (4.3.5) is optimal in the sense that there exists a
coupling λ ∈ Couple(µ, ν) such that dv (µ, ν) = λ X (2) . For details we refer to
[20, Sec. 4.1.2] or [104, Sec. 4.2]. t
u

Definition 4.66. Two X -valued stochastic processes (Xn )n∈N0 and (Yn )n∈N0 are
said to couple if the random variable

T = min n ∈ N; Xm = Ym , ∀m ≥ n
is a.s. finite. The random variable T is called the coupling time. t
u

Lemma 4.67. Suppose that the X -valued processes (Xn )n∈N0 and (Yn )n∈N0 couple
with coupling time T . Then
 
dv (Xn , Yn ) ≤ P T > n .
In particular,
lim dv (Xn , Yn ) = 0.
n→∞

Proof. For all A ⊂ X we have


       
P Xn ∈ A − P Yn ∈ A = P Xn ∈ A, T ≤ n + P Xn ∈ A, T > n
   
−P Yn ∈ A, T ≤ n − P Yn ∈ A, T > n
(Xn = Yn , ∀n ≤ T )
       
= P Xn ∈ A, T > n − P Yn ∈ A, T > n ≤ P Xn ∈ A, T > n ≤ P T > n .
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 406

406 An Introduction to Probability

Theorem 4.68. Suppose that Q is a probability transition matrix on the state space
X such that the associated HMC’s are irreducible, aperiodic and positively recur-
rent. Denote by π the unique invariant probability measure on X . Then, for any
µ ∈ Prob(X )
lim dv µQn , π) = 0. (4.3.6)
n→∞
In particular if µ = δx0 we deduce that
π x = lim Qnx0 ,x .
 
(4.3.7)
n→∞

Proof. Consider two independent HMCs (Xn )n∈N and (Yn )n≥0 with state space
X , transition matrix Q such that initial distribution of (Xn ) is µ and the initial
distribution of Yn is π. Since π is stationary, the probability distribution of Yn is
π, ∀n. According to Lemma 4.67 it suffices to show that the stochastic processes
(Xn ), (Yn ) couple.
Consider the stochastic process (Xn , Yn ). This is an HMC with state space
X × X and transition matrix Q b
b (x ,y ),(x ,y ) = Qx ,x · Qy ,y .
Q 0 0 1 1 0 1 0 1

Since Q is irreducible and aperiodic we deduce that for any x0 , x1 , y0 , y1 ∈ X ,


∃N0 > 0 such that Qnx1 ,x1 · Qny1 ,y1 > 0, ∀n ≥ N0 . We deduce that there exists
N ≥ N0 such that
Qnx0 ,x1 · Qny0 ,y1 > 0, ∀n ≥ N.
This shows that Q b is irreducible.
The product measure π ⊗ π is an invariant probability measure of the chain
(Xn , Yn ). Theorem 4.54 implies that Q b is positively recurrent. Proposition 4.57
thus, for any x ∈ X , the chain (Xn , Yn ) will almost surely return to (x, x) in finite
time. t
u

Remark 4.69. Suppose that Q is the transition matrix of an irreducible, positively


recurrent Markov chain with state space X and invariant probability measure π.
We form a new stochastic matrix
1 
Q̃ = 1+Q .
2
The chain with this transition matrix is called the lazy version of the original chain.
It is irreducible and π is the invariant probability measure of the lazy version as
well. However, the lazy chain is also aperiodic, even if the original chain is not.
This follows from the equality
n  
X 1 n
Q̃nx,y = Qkx,y .
2n k
k=0

This shows that if Qkx,y6= 0, then Q̃nx,y > 0, ∀n ≥ k. Using the terminology of
generalized
 convergence in [79], we can say that the Euler means of the sequence
Qnx,y n≥0 converge to the invariant measure. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 407

Markov chains 407

4.3.3 Martingale techniques


Suppose that (Xn )n≥0 is an HMC with state space X , transition matrix Q and
initial distribution π0 . We assume that all the random variables Xn are defined on
the same probability space (Ω, S, P). Set
πn := π0 · Qn , ∀n ∈ N.
We denote by Fn the filtration
Fn = σ(X0 , X1 , . . . , Xn ) ⊂ S, n ≥ 0.
We want to investigate the (sub/super)martingales with respect to this filtration
and show some of their applications to the dynamics of the underlying HMC.
Note that any function X → R is measurable with respect to the sigma-algebra
2X . For this reason we will denote by L0 (X ) the space of functions X → R. We
think of a function f ∈ L0 (X ) as a column vector f (x) x∈X and we denote by
Qh the function described by the multiplication of the matrix Q with the column
vector f . More precisely,
X
(Qf )(x) = Qx,y f (y), ∀x ∈ X .
y∈X
There is a small problem with this definition namely, if X is infinite, then the above
series may by divergent. Since the rows of Q define probability distributions on X
we see that the above sums are finite if f is bounded. We obtain in this fashion a
linear map
Q : L∞ (X ) → L∞ (X ), f → Qf.
We say that the transition matrix is locally finite if each of its rows has only finitely
many nonzero entries. Equivalently, at each state x ∈ X the system can transition
only to finitely many states. In this case Q defines a linear map
Q : L0 (X ) → L0 (X ).
Note
QI X = I X , f ≥ 0 ⇒ Qf ≥ 0.
If we think of πn as a row vector, then for any g ∈ L1 (X , πn ) we have
Z
gdπn = πn · g,
X
where the “·” denotes the multiplication of a one-row matrix πn with a one-column
matrix g. We deduce
πn · g = (π0 · Qn ) · g = π0 · (Qn g).
Thus
g ∈ L1 (X , πn ), ∀n ≥ 0 ⇐⇒ Qn g ∈ L1 (X , π0 ), ∀n ≥ 0.
Definition 4.70. A function h ∈ L1 (X , π0 ) is called a Lyapunov function of the
HMC (Xn )n≥0 if the stochastic process h(Xn ) n≥0 is a supermartingale adapted
to the filtration Fn , i.e.,
f (Xn ) ∈ L1 (Ω, S, P), E f (Xn+1 ) k Fn ≤ h(Xn ), ∀n ≥ 0.
 
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 408

408 An Introduction to Probability

Since the distribution of Xn is πn , we deduce that


f (Xn ) ∈ L1 (Ω, S, P) ⇐⇒ f ∈ L1 (X , πn ).
The Markov condition implies E f (Xn+1 ) k Fn = E f (Xn+1 ) k Xn . Let us ob-
   

serve that
E f (Xn+1 ) k Fn = E f (Xn+1 ) k Xn = Qf (Xn ), ∀n ≥ 0.
   
(4.3.8)
Indeed,
  X  
E f (Xn+1 ) Xn = x = f (y)P Xn+1 = y Xn = x
y∈X
X
= Qx,y f (y) = (Qf )(x).
y∈X

Proposition 4.71. Let f ∈ L∞ (X ). Then the sequence f (Xn )



n≥0
is a martin-
gale (resp. supermartingale) iff Qf = f (resp. Qh ≤ h). t
u

Definition 4.72. The operator ∆ := 1 − Q : L∞ (X ) → L∞ (X ) is called the


Laplacian2 of the HMC. t
u

Observe that for any f ∈ L∞ (X ) we have


X 
(∆f )(x) = Qx,y f (x) − f (y) .
y∈X

Thus f (Xn ) martingale iff ∆f = 0, i.e., h is harmonic with respect to this Laplacian.
This sequence is a supermartingale iff ∆f ≥ 0, i.e., f is superharmonic with respect
to the Laplacian ∆.
A function f : X → R is said to be harmonic on a subset U ⊂ X if
X
∆f (u) = 0, ∀u ∈ U ⇐⇒ f (u) = Qu,x f (x), ∀u ∈ U.
x∈X

Example 4.73. Fix a nonempty subset Y ⊂ X and x0 ∈ X \ Y . We denote by


HY the hitting time of Y , and by Hx0 the hitting time x0 . For x ∈ X we set
 
f (x) = Px Hx0 < HY ,
i.e., f (x) is the probability that the system started at x hits x0 before it hits E.
Note that
0 = f (y) ≤ f (x) ≤ 1 = f (x0 ), ∀x ∈ X , ∀y ∈ Y.
Note that for any x 6∈ {x0 } ∪ Y we have
X
Px Hx0 < HY X1 = x0 Qx,x0 = Qf (x).
   
f (x) = Px Hx0 < HY =
x0 ∈X
| {z }
=f (x0 )

Thus f is harmonic on X \ {x0 } ∪ Y .



t
u
2 Here we are using the geometers’ convention. As defined, the Laplacian is nonnegative definite.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 409

Markov chains 409

Proposition 4.74 (Lévy’s martingale). Suppose that f ∈ B(X ). For each


n ∈ N0 we set
n−1
X
Mnf := f (Xn ) − f (X0 ) + ∆f (Xk )
k=0
n−1
X  
= f (Xn ) − f (X0 ) + Xk − E f (Xk+1 ) k Xk .
k=0

f

Then the sequence Mn n≥0
is a martingale.

Proof. Note that


f
Mn+1 − Mnf = f (Xn+1 ) − f (Xn ) + ∆f (Xn )

and
 f
− Mnf k Fn = E f (Xn+1 ) k Fn − f (Xn ) − ∆f (Xn )
  
E Mn+1

(4.3.8)
= Qf (Xn ) − f (Xn ) + ∆f (Xn ) = 0.

t
u

Let I : X → R denote the indicator of X , I(x) = 1, ∀x ∈ X . Since Q is a


stochastic matrix we have QI = I, so that the constant functions are harmonic.

Theorem 4.75. Suppose that the HMC (Xn )n≥0 is irreducible and recurrent. Then
any bounded Lyapunov function is constant.

Proof. We argue by contradiction. Suppose that h is non-constant bounded Lya-


punov function on X . There exist x0 , x1 ∈ X such that h(x0 ) 6= h(x1 ).
Suppose that π0 = δx0 . The sequence h(Xn ) is a bounded supermartingale. The
Submartingale Convergence Theorem implies that the sequence h(Xn ) converges a.s.
On the other hand, since (Xn ) is recurrent we deduce
   
P Xn = x0 i.o. = P Xn = x1 i.o. = 1.

Thus, the sequence h(Xn ) has a.s. two different limit points h(x0 ) and h(x1 ) and
thus h(Xn ) is a.s. divergent! t
u

Corollary 4.76. If the irreducible HMC (Xn )n≥0 admits a nonconstant, bounded
Lyapunov function then it must be transient. t
u

Example 4.77. Suppose that (Xn )n≥0 describes the simple random walk on a
locally finite connected graph G = (V, E) with vertex set V and edge set E. A
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 410

410 An Introduction to Probability

function f : V → R is then superharmonic with respect to this Markov chain if its


value at each vertex is at least the average of the values at neighbors
1 X
f (v) ≥ f (u), ∀v ∈ V,
deg v u∼v

where u ∼ v indicates that the vertices u and v are neighbors, i.e., connected by an
edge.
Suppose that G is a rooted binary tree. This means that G is a tree, it has a
unique vertex v0 of degree 1, and every other vertex has degree 3. One can think
that any vertex other than the root has a unique direct ancestor and two direct
successors. The root has a unique successor
One can think of v0 as the generation zero vertex. It has a unique successor.
This is the generation 1 vertex. Its two successors form the second generation of
vertices. Their 4 successors determine the third generation etc. Equivalently, a
vertex belongs to the n-th generation, n > 1, if its predecessor is in the (n − 1)-th
generation. We obtain in this fashion a generation function

g : V → N0 , g(v) := the generation of the vertex v.

Define

f : V → [0, 1], f (v) = 2−g(v) .

Any vertex v 6= v0 has two vertices of generation g(v)+1 and one vertex of generation
g(v) − 1. Hence
X
f (u) = 2−g(v)+1 + 2 · 2−f (v)−1 = 3 · 2−g(v) = 3f (v),
u∼v

so that
1X
f (v) = f (u), ∀v ∈ V \ {v0 }.
3 u∼v

Obviously f (v0 ) > f (v), ∀v ∈ V \ {v0 }. This proves that f is superharmonic,


nonconstant and bounded so the random walk on G is transient. t
u

Definition 4.78. A function f ∈ L0 (X ) is called coercive if, for any C > 0, the
set {f ≤ C} is a finite subset of X . t
u

Proposition 4.79. Let (Xn )n≥0 be an irreducible HMC with state space X and
transition matrix Q. Suppose that there exists a nonnegative coercive function
f : X → [0, ∞) and a finite set A ⊂ X such that
X
Qx,y f (y) ≤ f (x), ∀x ∈ X \ A. (4.3.9)
y∈X

Then (Xn )n≥0 is recurrent.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 411

Markov chains 411

Proof. We follow [56, Sec. 2.2]. The condition (4.3.9) is equivalent to


Ex f (Xn+1 ) − f (Xn ) Xn = x ≤ 0, ∀x ∈ X \ A.
 

Denote by TA the time of first return to A. For x ∈ X \ A we denote by (Ynx ) the


process started at x and stopped upon entry in A, Yn := Xn∧TA .
The sequence Fnx = f (Ynx ) is a bounded below supermartingale adapted to
σ(X0 , . . . , Xn ). From the submartingale convergence theorem we deduce that Fnx
x
converges a.s. to F∞ . Moreover, Fatou’s Lemma implies
 x   
Ex F∞ ≤ Ex F0 = f (x).
= ∞ = 0, ∀x ∈ X \ A. Hence
 x 
In particular, Px F∞
   x 
Px lim f (Xn∧TA ) = ∞ = Px F∞ = ∞ = 0.
We can now argue by contradiction. Suppose that the chain (Xn ) is transient.
Then, with probability 1, the chain Xn will exit any finite set never to return; see
Exercise 4.11. Hence, for x ∈ X \ A, with probability 1, the chain Xn exits the
finite set {f < N }, never to return so that
h i
Px lim f (Xn ) = ∞ = 1.
n→∞
We deduce
Px TA < ∞ = 1, ∀x ∈ X \ A.
 

Indeed, if it does not return to A in finite time, then


 
P f (Xn ) = f (Xn∧TA ), ∀n > 0
so that
 
Px lim f (Xn∧TA ) = ∞ > 0.
Since (Xn ) is transient, with probability 1, it will exit A in finite time, never to
return. This is impossible because we have just shown that if outside A it will
return to A in finite time. t
u

Remark 4.80. We want to mention that the condition (4.3.9) is also necessary for
recurrence. For a proof we refer to [56, Sec. 2.2]. t
u

Theorem 4.81 (Foster). Let (Xn )n≥0 be an irreducible HMC with state space X
and transition matrix Q. Suppose that there exists a function f : X → [0, ∞), a
finite set A ⊂ X and ε > 0 such that
X
Qx,y f (y) ≤ f (x) − ε, ∀x ∈ X \ A. (4.3.10a)
y∈X

X
Qx,y f (y) < ∞, ∀x ∈ A. (4.3.10b)
y∈X

Then (Xn )n≥0 is positively recurrent.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 412

412 An Introduction to Probability

Proof. We follow [56, Sec. 2.2]. Denote by TA the time of first return to A and set
Yn := Xn∧TA . Suppose X0 = x ∈ X \ A. Then (4.3.10a) reads
   
Ex f (Yn+1 ) − f (Yn ) k Yn = Ex f (Yn+1 ) k Yn − Yn =≤ −εI {TA >n} .
Thus f (Yn ) is a nonnegative supermartingale and
     
Ex f (Yn+1 ) − Ex f (Yn ) ≤ −εPx TA > n .
Hence
Xn
       
Ex f (Yn+1 ) − f (x) = Ex f (Yn+1 ) − Ex f (Y0 ) ≤ ε Px TA > k
k=0
so that
n
X   1
Px TA > k ≤ f (x).
ε
k=0
Letting n → ∞ we deduce
 1
Ex TA ≤ f (x), ∀x ∈ X \ A.

(4.3.11)
ε
Now let a ∈ A. Then
  X X  
Ea TA = Qa,b + Qa,x 1 + Ex TA
b∈A x∈X \A
X   (4.3.11) 1 X (4.3.10b)
=1+ Qa,x Ex TA ≤ 1+ Qa,x f (x) < ∞.
ε
x∈X \A x∈X \A
 
Thus Ea TA < ∞, ∀a ∈ A and Proposition 4.57 implies that (Xn )n≥0 is positively
recurrent. t
u

Remark 4.82. (a) Note that condition (4.3.10a) reads


∆f (x) ≥ ε, ∀x ∈ X \ A.
Moreover, condition (4.3.10b) is automatically satisfied Q is locally finite, i.e., on
each row there are only finitely many nonzero entries.
(b) If (Xn )n≥0 positively recurrent, x0 ∈ X , then the function f : X → [0, ∞)
(  
Ex Tx0 , x 6= x0 ,
f (x) =
0, x = x0
satisfies the conditions of Theorem 4.81 with A = {x0 }, ε = 1. t
u

Example 4.83. Consider the biased random walk on N0 = {0, 1, . . . } with transi-
tion probabilities
Q0,1 = 1, Qn,n+1 = pn , Qn,n−1 = qn := 1 − pn , ∀n ∈ N.
Above pn , qn > 0, ∀n ∈ N, so that the corresponding Markov chain is irreducible.
Consider the coercive function
f : N0 → [0, ∞), f (n) = n.
Then, ∀n ≥ 1 we have

∆f (n) = n − pn (n + 1) + qn (n − 1) = qn − pn .
Thus, if qn ≥ pn , this random walk is recurrent. Moreover if
inf (qn − pn ) > 0
n∈N
then this random walk is positively recurrent. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 413

Markov chains 413

4.4 Electric networks

4.4.1 Reversible Markov chains as electric networks


Suppose that (Xn )n≥1 is an irreducible, reversible, locally finite HMC with state
space X and transition matrix Q. We recall that local finiteness means that
∀x ∈ X , # y ∈ X ; Qx,y 6= 0 < ∞.


The reversibility means that there exists a function c : X → (0, ∞) such that
c(y)Qy,x = c(x)Qx,y ∀x, y ∈ X . (4.4.1)
Note that any positive multiple of c also satisfies (4.4.1). We set
c(x, y) := c(x)Qx,y , ∀x, y ∈ X .
The detail balance condition (4.4.1) shows that c(x, y) = c(y, x) and c(x, y) 6= 0 iff
Qx,y > 0. Note that
c(x, y) X
Qx,y = , c(x) = c(x, y). (4.4.2)
c(x) y∼x

It is convenient to visualize this Markov chain as a random walk on an undirected


graph with vertex set X and weighted edges. Two vertices x, y are connected by
an edge if and only if Qx,y > 0. Since the Markov chain is irreducible, this graph is
connected.
We use the notation x ∼ y to indicate that the vertices/nodes x, y are connected
by an edge. We say that two vertices x, y are neighbors if x ∼ y. For x ∈ X we
denote by N (x) the set of neighbors of x. If y ∈ N (x), then we weigh the connecting
edge with the weight c(x, y) = c(y, x).
We will assume c(x, x) = 0, ∀x ∈ X , i.e., the associated graph has no loops.
The Markov chain dynamics has the following equivalent description: if the
system is at the state/vertex x it will transition to
 a neighbor y with a probability
proportional to the weight c(x, y). The weights c(x) x∈X define a Q-invariant
measure µc on X
  X
µc S = c(s). (4.4.3)
s∈S

Formally, an electric network is a triplet (X , E, c), where (X , E) is a locally


finite, connected, unoriented graph and c : X × X → [0, ∞). The set of vertices X
is assumed to be at most countable. We regard the set of edges E as a symmetric
subset of X × X , i.e., (x, y) ∈ E ⇐⇒ (y, x) ∈ E. We assume there are no loops,
i.e., ∀(x, y) ∈ E, x 6= y. We will frequently use the notation x ∼ y to indicate that
(x, y) ∈ E.
The function c : X × X → [0, ∞) satisfies

• c(x, y) > 0 ⇐⇒ (x, y) ∈ E.


• c(x, y) = c(y, x), ∀(x, y) ∈ E.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 414

414 An Introduction to Probability

We have seen that a reversible Markov chain determines an electric network.


Conversely, an electric network (X , E, c) determines a reversible Markov chain
with state space X and transition matrix Q : X × X → [0, 1]
c(x, y) X
Qx,y = , c(x) = c(x, y).
c(x)
y∈N (x)

An electric network corresponds to a real electric network in which an edge between


two nodes x, y corresponds to a resistor between these two nodes with resistance
1
r(x, y) = .
c(x, y)
The quantity c(x, y) is called conductance.

4.4.2 Sources, currents and chains


The connection between electric networks and the dynamics associated Markov
chain is through the classical physical laws of Kirchhoff and Ohm. As shown in
the pioneering work of Nash-Williams [119], this point of view can shed remarkable
insight into the behavior of the Markov chains. In the remainder of this section we
will highlight some of this fruitful interplay between probability and physics. For
more about this we refer to [48; 74; 109; 142] which served our sources of inspiration.
First, a matter of notation. For every pair of elements s, s0 of a set S we denote
by δs,s0 the Kronecker symbol
(
1, s = s0 ,
δs,s0 =
0, s 6= s0 .

As observed by R. Bott and by H. Weyl, see e.g. [16], the physical laws of electric
networks have simple geometric interpretations, best expressed in the language of
Hodge theory.
The main objects in Hodge theory are the chain/cochain complexes. To define
them we need to make some choices.
Consider a locally finite graph (X , E). An orientation of the graph is a subset
E+ ⊂ E such that for any edge (x, y) ∈ E either (x, y) ∈ E+ or (y, x) ∈ E+ , but
not both.
One can obtain such an E+ by assigning orientations (arrows) along the edges.
Define E+ as the collection of positively oriented edges. More precisely (x, y) ∈ E+ ,
if and only if the arrow of the oriented edge goes from x to y.
The vector space of 0-chains, denoted by C0 , consists of formal sums of the type
X
j := j(x)[x], j(x) ∈ R, ∀x ∈ X .
x∈X

Equivalently, C0 = RX . In physics a 0-chain is known as a source (of current) and


j(x) = 0 for all but finitely many x.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 415

Markov chains 415

The vector space C1 of 1-chains consists of skew-symmetric functions


i : E → R, (x, y) 7→ i(x, y).
For any (x, y) ∈ E, define [x] 7 [y] : E → R by setting

1, e = (x, y)


[x] 7 [y](e) = −1, e = (y, x),


0 otherwise.
If we fix an orientation E+ , we will identify an oriented edge e = (x, y) ∈ E+ with
the current [x] 7 [y] and we write ie := [x] 7 [y].
Once we fix an orientation E+ we can describe each current as a formal sum of
the type
X 1 X
i= i(x, y)[x] 7 [y] = i(x, y)[x] 7 [y]
2
(x,y)∈E+ (x,y)∈E

where E+ is an orientation of the edges. In physics, 1-chains are known as currents.


One should think of [x] 7 [y] as representing the edge (x, y) oriented from x to
y. A current can then visualized as an assignment of arrows and weights on edges
with the understanding that we get the same current if we reverse any of the arrows
and change its weight to the opposite.
A 0-chain j is called compactly supported if j(x) = 0 for all but finitely many x.
Similarly, a 1-chain i is compactly supported if i(x, y) = 0 for all but finitely many
edges (x, y) ∈ E. For k = 0, 1 we denote by Ckcpt the space of compactly supported
k-chains. There are boundary operators
∂ : C1 → C0 , ∂ : C0cpt → R
defined as follows.

• If i ∈ C1 , then
X X X
∂i := w(x)[x], w(x) = i(x, y) = − i(y, x), ∀x ∈ X .
x∈X y∈N (x) y∈N (x)

In particular, for x0 , x1 ∈ X ,
∂[x0 ] 7 [x1 ] = [x1 ] − [x0 ].
• If j ∈ C0cpt , then
X
∂j = j(x) ∈ R.
x∈X

Let us observe that for any compactly supported current i we have ∂ 2 i = 0. Indeed
X X X
∂(∂i) = i(x, y) = i(x, y) = 0
x∈X y∈N (x) (x,y)∈E

since i(x, y) + i(y, x) = 0 whenever x ∼ y.

Remark 4.84. If X is infinite, then there could exist 1-chains i such that ∂i ∈ C0cpt
yet ∂ 2 i 6= 0. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 416

416 An Introduction to Probability

The (finite) paths in the graph are special examples of compactly supported
1-chains. By a path of length n we understand a sequence of neighbors

x0 , x1 , . . . , xn , xk−1 ∼ xk , xk−1 6= xk , ∀k = 1, . . . , n.

The associated 1-chain i = ix0 ,x1 ,...,xn is


n
X
ix0 ,x1 ,...,xn = [xk−1 ] 7 [xk ].
k=1

Note that

∂ix0 ,x1 ,...,xn = [xn ] − [x0 ].

The path is closed if x0 = xn or, equivalently, ∂ix0 ,x1 ,...,xn = 0.

4.4.3 Kirkhoff ’s laws and Hodge theory


The actual sources and currents in real electric network are governed by Kirchhoff’s
laws. We refer to [8, Chap. 12] for a more detailed description of the physical
aspects. Fix an electrical network (X , E, c).
Kirchhoff ’s first law states that the source of a (physical) current i ∈ C1 is the
0-chain j = −∂i. More explicitly, this means that
X
j(x) + i(x, y) = 0, ∀x ∈ X . (4.4.4)
y∈N (x)

This is a purely topological condition in the sense that it is independent of the


choice of conductance function.
The physics/geometry enters the scene through the conductance function. More
precisely, in physics each current i in an electric network has finite energy 3 defined
by
1 X
Er i := r(x, y)i(x, y)2 .
 
(4.4.5)
2
(x,y)∈E

If we fix an orientation E+ of the edges we obtain the equivalent description


X X i(x, y)2
Er i = r(x, y)i(x, y)2 =
 
. (4.4.6)
c(x, y)
(x,y)∈E+ (x,y)∈E+

We denote by C1∞ the space of finite energy 1-chains. The space C1∞ is endowed
with a (resistor) inner product
X
hi1 , i2 ir := r(x, y)i1 (x, y)i2 (x, y). (4.4.7)
(x,y)∈E+

Thus, a physical current is an element of C1∞ .


3 The physical units of the expression in (4.4.6) are indeed the units for energy, Joules.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 417

Markov chains 417

To formulate Kirkhoff’s second law we need to introduce the concept of cochain.


The cochains are objects dual to chains. The space of 0-cochains (resp. 1-cochains)
is the dual vector space of C0 (resp. C1 )
C 0 = C0∗ = Hom(C0 , R), C 1 := C1∗ = Hom(C1 , R).
One we can think of a 0-cochain as a function u : X → R. Physicists call such
functions potentials. For each x ∈ X denote by δ x ∈ C 0 the elementary 0-cochain
defined by
δ x ([y]) = δx,y , ∀y ∈ X .
A 0-cochain is then a formal sum
X
u= u(x)δ x .
x∈X

For each (x, y) ∈ E denote by dx 7 dy : E → R the elementary 1-cochain defined


by

1,
 (x0 , y 0 ) = (x, y),
0 0
 
dx 7 dy x , y = −1, (x0 , y 0 ) = (y, x),


0, otherwise.
A 1-cochain should be viewed as a formal sum
X 1 X
v= v(x, y)dx 7 dy = v(x, y)dx 7 dy, v(x, y) = −v(y, x).
2
(x,y)∈E+ (x,y)∈E

More concretely, we identify a 1-cochain with a skew-symmetric function on


v : E → R. In physics such a function is called voltage and it is measured in
Volts.
We define the “integral” of a 1-cochain v along a path
γ = x0 , x1 , . . . , xn ,
to be the real number
Z n
X
v := v(xk−1 , xk ).
γ k=1

There exists a coboundary operator d : C 0 → C 1 that associates to each function


u : X → R its “differential”
X
du = u(y) − u(x))dx 7 dy. (4.4.8)
(x,y)∈E+

A 1-cochain v is called exact if it is the differential of a 0-cochain.


The following fact is left to the reader as an exercise.

Lemma 4.85. A 1-cochain v is exact if and only if its integral along any closed
path is 0. Equivalently, this means that the integral along a path depends only on
the endpoints of the path. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 418

418 An Introduction to Probability

The energy of a 1-cochain


X
v= v(x, y)dx 7 dy
(x,y)∈E+
is
X 1 X
Ec v := c(x, y)v(x, y)2 = c(x, y)v(x, y)2 .
 
2
(x,y)∈E+ (x,y)∈E
1
We denote by C∞the space of finite energy 1-cochains. It is a Hilbert space with
(conductance) inner product
X
hv1 , v2 ic := c(x, y)v1 (x, y)v2 (x, y). (4.4.9)
(x,y)∈E+
Hence
Ec v = hv, vic .
 

We have a “resistor” duality map R : C1∞ → C 1 , C1 3 i → Ri = i∗ ,


 
X X
R i(x, y)[x] 7 [y] = r(x, y)i(x, y)dx 7 dy.
(x,y)∈E+ (x,y)∈E+
1
Note that since r(x, y) = we have
c(x,y)
X X
Ec Ri = c(x, y)r(x, y)2 i(x, y)2 = r(x, y)i(x, y)2 = Er i ,
   

(x,y)∈E+ (x,y)∈E+

so that R induces a (bijective) isometry of Hilbert space C1∞ → C∞ 1


. In fact C∞1
can

be identified with the topological dual of C1 with the induced inner product and
norm. For this reason we will refer to Ri as the dual of i and, when no confusion is
possible, we will write i∗ := R(i).
Ohm’s law states that for any current i in an electric network there is a difference
of potential/voltage u(x, y) between any two neighbors x ∼ y related to i(x, y) via
the equality
u(x, y) = r(x, y)i(x, y). (4.4.10)
In other words, the collection of voltages associated to the current i is the dual
1-cochain Ri.
Kirchhoff ’s second law states that a finite energy currents i generated by a
source −j = ∂i has a special property: the dual 1-chain of voltages is exact. In
other words, there exists a function u : X → R such that du = −i∗ = −Ri, i.e.,
 1 
c(x, y) u(y) − u(x) = u(y) − u(x) = i(x, y), ∀(x, y) ∈ E+ .
r(x, y)
Note that
Ec du := hdu, duic = hi, iir = Er i < ∞.
   
(4.4.11)

Definition 4.86. A Kirkhoff current is a finite energy current i such that its dual
i∗ = Ri is exact. A function u ∈ C0 such that i∗ = −du is called a potential of the
Kirchhoff current. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 419

Markov chains 419

Suppose i is a Kirckhoff current and u is a potential of i. If the graph is


connected, then any other potential of i differs from u by an additive constant. The
source j = −∂i of i can be described explicitly in terms of a potential u of i. The
equality ∂i = −j reads
X
c(x, y) u(y) − u(x) = −j(x), ∀x ∈ X .


y∈N (x)

Since c(x, y) = c(x)Qx,y we deduce


X 1
j(x), ∀x ∈ X .

Qx,y u(y) − u(x) = −
c(x)
y∈N (x)

Equivalently, this means that


1
∆u(x) = j(x), ∀x ∈ X , (4.4.12)
c(x)
where ∆ = 1 − Q is the Laplacian of the HMC with transition matrix Q; see
Definition 4.72.
0
Denote by C∞ space of finite energy 0-chains, i.e., 0-chains u satisfying
X
c(x)u(x)2 .
x∈X
0
This defines an inner product on C∞
X
hu1 , u2 ic = c(x)u1 (x)u2 (x). (4.4.13)
x∈X
0
As such, C∞ can be identified with the Hilbert space L2 (X , µc ) where µc is the Q-
invariant measure on X determined by the detailed balance equations; see (4.4.3).
We denote by Ccpt 0
the spaces of functions u : X → R vanishing outside a
finite set. Let us observe that if α1 , α2 are k-cochains and at least one of them
is compactly supported, then we can define hα1 , α2 ic using the same expressions
(4.4.9), (4.4.13) as above.

Proposition 4.87 (Discrete integration by parts). For any u ∈ C 0 and any


0
v ∈ Ccpt we have
h∆u, vic = hdu, dvic = hu, ∆vic . (4.4.14)

Proof. We have
X
h∆u, vic = c(x)∆u(x)v(x)
x∈X
 
X X  X 
=  c(x)Qx,y u(x) − u(y)  v(x) = c(x, y) u(x) − u(y) v(x)
x∈X y∈Y x,y∈X

X  X 
= c(x, y) u(x) − u(y) v(x) + c(x, y) u(x) − u(y) v(x)
(x,y)∈E+ (y,x)∈E+
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 420

420 An Introduction to Probability

(change variables x ↔ y in the second sum)


X  X 
= c(x, y) u(x) − u(y) v(x) + c(x, y) u(y) − u(x) v(y)
(x,y)∈E+ (x,y)∈E+
X  
= c(x, y) u(x) − u(y) v(x) − v(y) = hdu, dvic .
(x,y)∈E+

The same argument, with the roles of u and v reversed show that
hdu, dvic = hu, ∆vic .
The above expressions are well defined since both dv and ∆v are compactly sup-
ported because the graph is locally finite. t
u

Here is a simple consequence. We say that a set S ⊂ X is cofinite if X \ S is


finite.

Corollary 4.88. Suppose that S is a nonempty cofinite set and u ∈ X is a solution


of the boundary value problem
∆u(x) = 0, ∀x ∈ X \ S, u(s) = 0, ∀s ∈ S.
Then u(x) = 0, ∀x ∈ X .

Proof. We have 0 = h∆u, uic = hdu, duic . Hence du = 0 and since X is connected,
we deduce that u is constant. Since S 6= ∅, we deduce that u is identically zero. t
u

4.4.4 A probabilistic perspective on Kirchoff laws


Denote by (Xn )n≥0 the random walk on the weighted graph defined by the electric
network (X , E, c).
Let S ⊂ X be a nonempty subset. Recall that we denote by HS , respectively
TS , the hitting and respectively return time to S. Fix a bounded function ϕ : S → R
and define
u = uϕ : X → R, u(x) = Ex ϕ(XHS ) .
 

Conditioning on neighbors we deduce u is harmonic on X \ S and u = ϕ on S.


Corollary 4.88 shows that if S is cofinite, then u is the unique function on X that
is harmonic on X \ S and equal to ϕ on S.
Let us investigate a special case of this construction. Consider a cofinite set
S− and x+ ∈ X \ S− . Set S := {x+ } ∪ S− . If ϕ = I {x+ } : S → R, then the
computation in Example 4.73 shows that uϕ is
   
u(x) = ux+ ,S− (x) := Px HS = Hx+ = Px Hx+ < HS− . (4.4.15)
Thus u(x) is the probability that the random walk started at x reaches x+ before
S− . Clearly this function has finite energy since it has compact support. To this
function we associate a current i defined by Ri = −du. More precisely
X 
i= c(x, y) u(y) − u(x) [x] 7 [y],
(x,y)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 421

Markov chains 421

and its source is


(4.4.12)
j = j x+ ,S− : X → R, j(x) = c(x)∆u(x).
The current i has compact support contained in the finite set of edges with one end
in the finite set X \ S− . Hence ∂ 2 i = 0 so
X X
0= j(x) = c(x)∆u(x). (4.4.16)
x∈X x∈X

The energy of u is
X
hdu, duic = hu, ∆uic = c(x)u(x)∆u(x)
x∈X (4.4.17)
= c(x+ )u(x+ )∆u(x+ ) = u(x+ )j(x+ ).
Now observe that u(x+ ) = 1 so that
X
∆u(x+ ) = 1 − Qx+ ,x u(x)
x∈N (x+ )

X    
=1− Qx+ ,x Px HS− > Hx+ = Px+ Tx+ > HS− .
x∈N (x+ )

Hence
  (4.4.17)
Ec du = c(x+ )∆u(x+ ) = j x+ ,S+ (x+ )
  (4.4.18)
= c(x+ )Px+ Tx+ > HS− =: κ(x+ , S− ).
The quantity κ(x+ , S− ) is called the effective conductance from x+ to S− . Its inverse
is called effective resistance between x+ and S− and it is denoted by Reff (x+ , S− ).
Thus
1
Reff (x+ , S− ) =  .
c(x+ )Px+ Tx+ > HS−
We set
1 1  
u = ux+ ,S− := ux+ ,S− = Px Hx+ < HS− .
κ(x+ , S− ) κ(x+ , S− )

This is the potential of the compactly supported Kirchhoff current ix+ ,S− such that

Rix+ ,S− = dux+ ,S− , (4.4.19)

with source
1 c(x)
j = jx+ ,S− = j x+ ,S+ =   ∆u(x), (4.4.20)
κ(x+ , S− ) c(x+ )Px+ Tx+ > HS−
 
where u is defined in (4.4.15), u(x) = Px Hx+ < HS− . Note that

jx+ ,S− (x+ ) = 1.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 422

422 An Introduction to Probability

Its energy is
1 1
E
 
E x+ ,S− := c dux + ,S− = = ux+ ,S− (x+ ). (4.4.21)
κ(x+ , S− )2 κ(x+ , S− )
Since
ux+ ,S− (x) = Px HS− < Tx+ ≤ 1 = ux+ ,S− (x+ ), ∀x ∈ X ,
 

we deduce
0 ≤ ux+ ,S− (x) ≤ ux+ ,S− (x+ ) = E x− ,S− , ∀x ∈ X . (4.4.22)
Let us observe that if X is finite and S− = {x− }, then the equality (4.4.16) shows
that
X
0= jx+ ,x− (x) = jx+ (x+ ) + jx− (x− )
x∈X

and thus
(
±1, x = x± ,
jx+ ,x− (x) =
0, x 6= x± .

Definition 4.89. A flow from x+ to S− on the electric network is a finite energy


current i such that ∂i = −jx+ ,S− . The source jx+ ,S− defined in (4.4.20) is called
the unit dipole with source x+ and sink S− . t
u

A flow from x+ to S− satisfies the second Kirchhoff law if and only if it has
finite energy and i∗ is the differential of a function u : X → R. We will refer to
such flows as Kirchhoff flows.

Lemma 4.90. Suppose that i is a compactly supported current such that ∂i = 0.


Then for any u : X → R we have hi∗ , duic = 0.

Proof. We have
X
i(x, y) = 0, ∀x ∈ X .
y∈N (x)

We recall that i(x, y) = −i(y, x), ∀(x, y) ∈ E. We have


X
hi∗ , duic =

r(x, y)c(x, y)i(x, y) u(y) − u(x)
(x,y)∈E+

X  X 
= i(x, y) u(y) − u(x) = 2 i(x, y) u(y) − u(x)
(x,y)∈E+ (x,y)∈E
   
X X X X
= −2 u(x)  i(x, y) + 2 u(y)  i(x, y) = 0.
x∈X y∈N (x) y∈X x∈N (y)

All the above sums involve only finitely many terms since i is compactly supported.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 423

Markov chains 423

Theorem 4.91. Suppose S− is cofinite and x+ ∈ X \ S− . Then the following hold.

(i) The current i0 := ix+ ,S− defined by (4.4.19) is the unique compactly supported
Kirchhoff current with source the dipole j 0 = jx+ ,S− . In particular it is a
Kirchhoff flow from x+ to S− .
(ii) The voltage function u = ux+ ,S− that determines i0 is the unique solution of
the boundary value problem
∆v(x) = 0, ∀x ∈ X \ S− ∪ {x+ }

(
1 (4.4.23)
, x = x+ ,
v(x) = c(x+ )
0, x∈S−.
(iii) The energy of i0 is
1
E i0 = E x+ ,S− = ux+ ,S− (x+ ) =  = Reff (x+ , S− ).
 

c(x+ )Px+ Tx+ > HS−
(iv) If i1 is another compactly supported flow from x+ to S− , then
E i1 ≥ E ix+ ,S− .
   

Proof. (i) Set u0 := ux+ ,S− . Recall that u0 has compact support. Suppose that i1
is another compactly supported Kirchhoff flow from x+ to S− . Then there exists
a function u1 : X → R such that i∗1 = du1 . We deduce from (4.4.12) that the
functions uk , k = 0, 1 are solutions of the same equation
1
∆uk (x) = j (x), ∀x ∈ X .
c(x) 0
If we write u = u1 − u0 , then ∆u = 0 on X . The function u may not have compact
support, but du does. We have
1 X  
hdu, duic = c(x, y) u(x) − u(y) u(x) − u(y)
2
(x,y)∈E

1 X  1 X 
= c(x, y) u(x) − u(y) u(x) − c(x, y) u(x) − u(y) u(y)
2 2
(x,y)∈E (x,y)∈E

1 X X  1 X X 
= u(x) c(x, y)(u(x) − u(y) + u(y) c(y, x) u(y) − u(x)
2 2
x∈X y∈N (x) y∈X x∈N (y)
| {z } | {z }
=c(x)∆u(x)=0 =c(y)∆u(y)=0

= 0.
Hence du = 0 so that i0 = i1 .
(ii) If v1 , v2 are two compactly supported solutions of (4.4.23), then the argument
above shows that hdv, dvic = 0 and, since v is compactly supported, we deduce that
v = 0.
The equality (iii) follows from (4.4.21).
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 424

424 An Introduction to Probability

(iv) Set i = i1 − i0 . Then

E i1 = E i + i1 = hi∗0 , i20 ic + 2hi∗0 , i∗ ic + hi∗ , i∗ ic


   
| {z }
≥0

(i∗0 = du)

≥ E i0 + 2hdu, i∗ ic .
 

Lemma 4.90 shows that hdu, i∗ ic = 0 since i has compact support and ∂i = 0. t
u

Remark 4.92. (a) Part (iv) of the theorem is known as the Thompson or Dirichlet
Principle. It classically states that the Kirchoff flow is the least energy compactly
supported flow sourced by the dipole jx+ ,S− . Observe that the energy of the Kirch-
hoff flow carries information about the dynamics of the Markov chain associated to
the electric network.
(b) The Kirchhoff flow from x+ to S− is the unique compactly supported current i
such that

• ∂i(x+ ) = −1.
• There exists a function u : X → R, identically zero on S− , such that i∗ = du.

t
u

4.4.5 Degenerations
To proceed further we perform a reduction to a finite network. We set

S+ := X \ S− , ∂S+ := s− ∈ S− ; N (s− ) ∩ S+ 6= ∅ .


For simplicity we assume that x+ does not have any neighbor in S− . We obtain a
new finite electric network X /S− described as follows.

• Its vertex set is S+ ∪ {x− }. Think that we have identified all the vertices in
S− with a single point x− .
• The conductances c∗ (x, y) of X /S− are defined according to the rule

c(x, y),

 x, y ∈ S+ ,
P
c∗ (x, y) = s+ ∈∂S+ c(x, s+ ), y = x−

P

c(s , y), x = x .
s+ ∈∂S+ + −

Note that c∗ (x) = c(x), ∀x ∈ S+ . We denote by ∆∗ the Laplacian determined by


these conductances. The function u = ux0 ,S− is identically zero on S− and thus
descends to a function u∗ on X /S− such that u∗ (x− ) = 0. The set S+ is also a
subset of X /S− .

∆∗ u = ∆u on S+ .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 425

Markov chains 425

Moreover
c∗ (x± )∆∗ u∗ (x+ ) = ±1.
Thus u∗ is the potential of the Kirchoff flow on X /S− from x+ to x− . We denote
by E x+,x− its energy.
Note that the induced Kirchoff flow on X /S− has the same energy as the original
Kirchhoff flow on X , i.e.,
E x+ ,x− = E x+ ,S− . (4.4.24)
On the finite graph X /S− the flows from x+ to x− can be thought of as paths
from x+ and x− . They all have finite energy. The Kirchhoff flow is the path with
minimal energy from x+ to x− .
In view of this reduction to finite graphs we concentrate on finite electric net-
works. Suppose (X , E, c) is such a network and x+ , x− ∈ X , x+ 6= x− . For finite
graphs the finite energy condition is automatically satisfied and a flow from x+ to
x− is simply a 1-chain i such that
∂i = [x− ] − [x+ ].
The source [x+ ] − [x− ] is called a dipole with source x+ and sink x− .
+ The flow condition involves only the topology of graph and is independent of the
physics/geometry of the network encoded by the conductance function. However,
the Kirchhoff flow depends on the physics/geometry of the network.
Denote by i = ix+ ,x− the Kirchhoff flow with source x+ and sink x− . Its
potential grounded at x− is the function u : X → R uniquely determined by the
equations
∆u = 0 ∈ X \ {x+ , x− }, u(x− ) = 0, c(x+ )∆u(x+ ) = 1. (4.4.25)
Then ix+ ,x− = R−1 dux+ ,x− . The energy of this flow is
1
E x+,x− = ux+ ,x− (x+ ) =  . (4.4.26)
c(x+ )Px+ Tx+ > Tx−
This quantity is an invariant of the quadruplet (X , c, x+ , x− ).
Clearly if we vary the conductance function the energy changes, and a flow that
is minimal for a choice of conductance may fail to be so for another choice. In
particular, a flow that has minimal energy with respect to a conductance function
may not have this property if we change the conductance or, equivalently, the
1
resistance function r(x, y) = c(x,y) ∈ (0, ∞]. We will indicate the dependence of
E x+ ,x− on r using the notation E x+ ,x0 (r).
Suppose we change the conductance function to a new function c0 that is bigger
or, equivalently, such that r0 (x, y) ≤ r(x, y). Then for any current i we have
  1 X 1 X 0
Er i = r(x, y)i(x, y)2 ≥ r (x, y)i(x, y)2 = Er0 i .
 
2 2
(x,y)∈E (x,y)∈E

This implies the following result known as the Raleigh Principle.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 426

426 An Introduction to Probability

Theorem 4.93 (Raleigh). The energy of the Kirchoff flow with given source and
sink increases with the increase of the resistance function or, equivalently, if the
conductance function decreases.

Proof. Suppose that we decrease the resistance of an edge from r(x, y) to r0 (x, y).
Denote by i(r) the Kirchoff flow with source x+ , sink x− and choice of resistance
r. Define i(r0 ) in a similar fashion.
We have
E x+ ,x− (r) = Er i(r) ≥ Er0 i(r) ≥ Er0 i(r0 ) = E x+ ,x− (r0 ).
     

t
u

We can use this principle to produce estimates for E x+ ,x− (r) in terms E x+ ,x− (r0 )
if r is chosen wisely making E x+ ,x− (r0 ) easier to compute. One way to simplify
0

the computation of E x+ ,x− is to modify the topology of the graph. We can achieve
this by pushing r to extreme values. Let describe two such degenerations.
Suppose y0 , y1 ∈ X \ {x+ , x− } are two nodes connected by an edge. Upon
rescaling c we can assume that c(y0 , y1 ) = 1 = r(y0 , y1 ). We have a family of
deformed resistances
(
0 t, (x, x0 ) = (y0 , y1 ) or (y1 , y0 ),
rt : E → (0, ∞), t > 0, rt (x, x ) =
r(x, x0 ), otherwise.

We denote by it the Kirchhoff flow with source x+ , sink x− and resistances rt , by


E t its energy E t = E x+ ,x− (rt ) and by ut its potential grounded at x− defined by
(4.4.25).
The Raleigh Principle shows that E t is an increasing function of t. We want to
describe what happens with E t and ut as t → 0, ∞.
Cutting. The behavior as t → ∞ is described by the electric network
(X ∞ , c∞ , E ∞ ) obtained by cutting the edge (y0 , y1 ). More precisely
X ∞ = X , E ∞ = E \ {(y0 , y1 ), (y1 , y0 )},
(
∞ 0 t 0 0, (x, x0 ) = (y0 , y1 ) or (y1 , y0 ),
c (x, x ) = lim c (x, x ) =
t→∞ c(x, x0 ), otherwise.
Shorting. The behavior as t → 0 is described by the network (X 0 , E 0 , c0 ) ob-
tained by shorting the edge (y0 , y1 ). Intuitively, the shorted network is obtained by
collapsing the vertices y0 , y1 to single point ∗; see Figure 4.4. More precisely

• X 0 = X \ {y0 , y1 } ∪ {∗}.


• If x, x0 ∈ X \ {y0 , y1 }, then c0 (x, x0 ) = c(x, x0 ) so that


0 0 0
(x, x ) ∈ E⇐⇒(x, x ) ∈ E .
• x ∈ X \ {y0 , y1 }, then c(x, ∗) = c(x, y0 ) + c(x, y1 ) so that (x, ∗) ∈ E 0 if and
only if (x, y0 ) ∈ E or (x, y1 ) ∈ E.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 427

Markov chains 427

Note that we have a natural projection p : X → X 0


(
x, x 6= y0 , y1 ,
p(x) =
∗, x = y0 , y1 .

c1 c2
y0

c
0

y
c3 1 c4

c2
c0+ c1
*
c3 c4

Fig. 4.4 Shorting an electric network along the edge (y0 , y1 ).

For  ∈ {0, ∞} by u0 (resp. u∞ ) the potential grounded at x− of the Kirchhoff


flow in (X  , E  , c ) with source x+ and sink x− . Denote by U  the energy of u
1 X 2
E = c (x, y) u (x) − u (u) .
2 
(x,y)∈E

Theorem 4.94 (Maxwell-Raleigh). Suppose that y0 , y1 ∈ X \ {x+ , x− } have


the property that the removal of the edge connecting them does not disconnect the
graph (X , E). Then
lim ut (x) = u∞ (x), ∀x ∈ X ,
t→∞

lim ut (x) = u0 p(x) , ∀x ∈ X ,



t→0

and
lim E t = E  ,  = 0, ∞.
t→

In particular, E 0 ≤ E t ≤ E ∞ , ∀t > 0. Thus the energy of the Kirchhoff flow from


x+ to x− is increased by cutting and decreased by shorting.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 428

428 An Introduction to Probability

Proof. We will carry the proof in several steps. We set r = r1 .


1. Compactness. Fix a path γ in X from x+ to x− that avoids the edge (y0 , y1 ),
γ = x− , x0 , x1 , . . . , xn = x− .
t
The r -energy of this path is
n
  X
Ert γ = r(xk−1 , xk ) = Er γ .
 

k=1
It is independent of t since the path avoids the only edge whose resistance depends
on t. We deduce from Thompson’s principle that
E t ≤ Er γ , ∀t > 0.
 

The local estimate (4.4.22) implies that


0 ≤ ut (x) ≤ E t ≤ Er γ , ∀t > 0.
 

This shows that the family of functions ut : X → [0, ∞) is relatively compact with
respect to the usual topology of the finite dimensional vector space RX .
2. t → ∞. In this case observe that
lim ct (x, y) = c∞ (x, y), ∀x, y ∈ X .
t→∞

We will show that as t → ∞ the family ut has only one limit point. Suppose that
for a sequence tn → ∞ the functions utn converge to a function v. The function
utn satisfies the equation
(
X
tn tn tn
 0, x 6= x± ,
c (x, y) u (x) − u (y) =
y∈X
±1, x = x± , utn (x− ) = 0.
Letting n → ∞ we deduce that v satisfies
(
X

 0, x 6= x± ,
c (x, y) v(x) − v(y) =
y∈X
±1, x = x± , v(x− ) = 0.
According to Theorem 4.91(ii) the above equation has a unique solution, the poten-
tial u∞ of the Kirchhoff flow from x+ to x− grounded at x− in (X ∞ , c∞ ) proving
that
lim ut = u∞ .
t→∞
The equality
lim E t = E ∞
t→∞
is obvious.
3. t → 0. The above argument fails in this case because ct (y0 , y1 ) = 1t . Pick a
sequence tn % 0 such that utn has a limit u0 as tn → 0. To simplify the presentation
we will write ut instead of utn . We will show that
u0 (y0 ) = u0 (y1 ) (4.4.27)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 429

Markov chains 429

and the induced function ū0 on X 0 ,


(
0 u0 (x), x 6= ∗,
ū (x) = 0 0
u (y0 ) = u (y1 ), x = ∗,
satisfies
ū0 (x− ) = 0, (4.4.28a)
(
X 0, x ∈ X 0 \ {x+ , x− }
c0 (x, y) ū0 (x) − ū0 (y) =

(4.4.28b)
y∈N 0 (x)
1, x = x+ .

We set
N∗ (y0 ) := N (y0 ) \ {y1 }, N∗ (y1 ) := N (y1 ) \ {y0 },
X
c∗ (yi ) := c(y0 , y), i = 0, 1.
y∈N∗ (yi )

Denote by N 0 (∗) the set of neighbors of ∗ in the graph (X 0 , E 0 ). Note that


N 0 (∗) = N∗ (y0 ) ∪ N∗ (y1 ), c0 (∗) = c∗ (y0 ) + c∗ (y1 ).
Since ∆ct ut (y0 ) = 0, i = 0, 1 we deduce
1 t X
u (y0 ) − ut (y1 ) + c(y0 , y) ut (y0 ) − ut (y)
 
t
y∈N∗ (y0 )

so that
X
1 + tc∗ (y0 ) ut (y0 ) − ut (y1 ) = t c(y0 , y)ut (y).


y∈N∗ (y0 )

A similar computation shows that


X
−ut (y0 ) + 1 + tc∗ (y1 ) ut (y1 ) = t c(y1 , y)ut (y).


y∈N∗ (y1 )

Thus ut (y0 ), ut (y1 ) is the solution of the 2 × 2 non-homogeneous linear system
   t   
a0 (t) −1 u (y0 ) c0 (t)
· t =t· ,
−1 a1 (t) u (y1 ) c1 t)
| {z } | {z }
=:A(t) α(t)

where
X
ai (t) = 1 + tc∗ (yi ), ci (t) = c(yi , y)ut (y), i = 0, 1.
y∈N∗ (yi )

Note that
det A(t) = a0 (t)a1 (t) − 1 = t c∗ (y0 ) + c∗ (y1 ) + O(t2 ) = tc0 (∗) + O(t2 ).

July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 430

430 An Introduction to Probability

Set
   
c0 (t) −1 a0 (t) c0 (t)
A0 (t) = , A1 (t) = .
c1 (t) a1 (t) −1 c1 (t)
Using Cramer’s rule we deduce
t det A0 (t) a1 (t)c0 (t) + c1 (t)
ut (y0 ) = =
det A(t) c0 (∗) + O(t)

a0 (t)c1 (t) + c0 (t)


ut (y1 ) = .
c0 (∗) + O(t)
Now observe that
lim ai (t) = 1
t→0

and, since N 0 (∗) = N∗ (y0 ) ∪ N∗ (y1 )


X
c0 (∗, y)u0 (y).

lim c0 (t) + c1 (t) =
t→0
y∈N 0 (∗)

Hence
c0 (∗, y)u0 (y)
P
0 0 0 y∈N 0 (∗)
u (y0 ) = u (y1 ) = ū (∗) := .
c(∗)
This proves (4.4.27). The equality (4.4.28a) is obvious. Observe that
X X
ū0 (∗) c(∗, y) = c0 (∗, y)ū0 (y),
y∈N 0 (∗) y∈N 0 (∗)

i.e.,
X
c0 (∗, y) ū0 (∗) − ū0 (y) = 0.


y∈N 0 (∗)

This proves (4.4.28b) for x = ∗.


If x ∈ X \ {∗, x− }, then
(
X 1, x = x+ ,
ct (x, y) ut (x) − ut (y) =
y∈N (x)
0, x 6= x+ .

The equality (4.4.28b) for x 6= ∗, x− follows by letting t → 0 above and observing


that
lim u0 (yi ) = ū0 (∗), i = 0, 1, lim ct (x, y0 ) + ct (x, y1 ) = c0 (x, ∗)

t→0 t→0

and
N 0 (x, ∗) = N (x) \ {y0 , y1 } ∪ {∗}.


This proves the equality (4.4.28b). This determines ū0 uniquely and shows that
lim ut = ū0 p(x) .

t→0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 431

Markov chains 431

It remains to verify only the claim


lim E t = E 0 .
t→0

Note that
1 X 2
Et = ct (x, y) ut (x) − ut (y) .
2
(x,y)∈E

There are two problematic terms in the above sum corresponding to (x, y) = (y0 , y1 )
or (y1 , y0 ) and their contribution to the energy is
1 t 2
u (y0 ) − ut (y1 ) .
t
Now observe that

t t c0 (t) a1 (t) − 1 − c1 (t) a0 (t) − 1) c∗ (y1 )c0 (t) − c∗ (y0 )c1 (t)
u (y0 ) − u (y1 ) = =t .
c0 (∗) + O(t) c0 (∗) + O(t)
Hence
1 t 2
u (y0 ) − ut (y1 ) = O(t) as t → 0,
t
so
1 X 2
lim E t = lim ct (x, y) ut (x) − ut (y)
t→0 2 →0
(x,y)∈E\{y0 ,y1 ),(y1 ,y0 )}

1 X 2
= c0 (x, y) u0 (x) − u0 (y) = E0.
2
(x,y)∈E 0

t
u

Remark 4.95. (i) Let us explain what happens if the edge (y0 , y1 ) disconnects
the graph but x+ , x− lie in the same connected component of the resulting graph.
Denote by (X0 , E0 ) the connected component containing x+ , x− and by (X∗ , E∗ )
in the other component. The compactness part of the argument still works since the
energy of ut is bounded by the energy of a path in (X0 , E0 ) connecting x+ to x+ .
Denote by ut0 the restriction of u to X0 and by ut∗ its restriction of u to X ∗ .
Then
 1 X 2 2
E dut = c(x, y) ut (x) − ut (y) + t ut (y0 ) − ut (y1 )

2
(x,y)∈E0
| {z }
=:Et0

1 X 2
+ c(x, y) ut (x) − ut (y) .
2
(x,y)∈E∗
| {z }
=:Et∗
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 432

432 An Introduction to Probability

Note that
2
lim t ut (y0 ) − ut (y1 ) .
t→0

Arguing exactly as in Step 2 of the proof of Theorem 4.94 one can show that ut0
converges to u0x+ ,x− the potential grounded at 0 of the Kirchhoff flow in X 0 from
x+ to x− . If u∗ is any limit point of ut∗ then u∗ satisfies ∆∗ u∗ = 0 so
hdu∗ , du∗ i∗ = h∆∗ u∗ , u∗ i∗ = 0
so 0 is the only limit point of Et∗ as t → 0.
Ec du c ≤ lim Ect dut = Ec0 du0
     
t→0

so the energy of the Kirchhoff flow from x+ , x− in X is not greater than the energy
of the similar flow in X0 .
(ii) To understand why shorting is tricky recall that X is finite so the Markov chain
defined by the conductance ct has an invariant probability measure is given by
ct (x) X
πt (x) = π(x) = , Zt = ct (x).
Zt
x∈X

If we let t = ct (y0 , y1 ) → ∞ and leave the other conductances unchanged, then


1
πt (x) → 0, ∀x 6= y0 , y1 , πt (x) → , x = y0 , y1 .
2
(iii) In view of the conservation of energy equality (4.4.24), the cutting and shorting
procedures can be used in infinite graphs to estimate the energy E x+ ,S− by reducing,
them to cutting/shorting procedure on the collapsed graph X/S− . Cutting has to be
performed with care so that while cutting edges we do not disconnect x+ from S− .
t
u

4.4.6 Applications
We want to illustrate the usefulness of the above results on some concrete example.
When the graph (X , E) is finite and all the edges have the same conductances,
the Kirchhoff flow from x+ to x− can be described explicitly in terms certain counts
of spanning trees, [74, Thm. 1.16]. In particular, its energy K(x+ , x− ) is a topo-
logical invariant of the quadruplet (X , E, x+ , x− ) described explicitly in terms of
spanning trees.
If we now assign conductances c to the edges, the energy E x+ ,x− (c) of the
Kirchhoff flow from x− , x+ satisfies
1 1
K(x+ , x− ) ≤ E x+ ,x− (c) ≤ K(x+ , x− ).
sup c(x, y) inf c(x, y)
The computation of K(x+ , x− ) is impractical for complicated graphs, but the above
rather rough estimate expresses in a simple fashion the fact that E x+ ,x− (c) depends
on both the topology and the geometry of the electrical network.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 433

Markov chains 433

Example 4.96. Suppose that (X , E, c) is a finite electric network such that the
underlying graph is a tree. Then for any pair of points x+ , x− there exists a unique
1-chain i such that
∂i = [x− ] − [x+ ].
It is described by a minimal path
x+ = x0 , x1 , . . . , xn = x− .
This is the Kirchhoff flow from x+ to x− and its energy is
n n
X X 1
E x+ ,x− = r(xi−1 , xi ) = .
i=1 i=1
c(x i−1 , xi )
As a special case of this
 consider the Ehrenfest urn model. Recall that the state
space is the set X := 0, 1, . . . , B , B ∈ N and transition matrix Q given by
k B−j
Qk,k−1 = , ∀k ≥ 1, Qj,j+1 = , ∀j < B.
B B
As explained in Example 4.47, this can be described as an electric network whose
underlining graph is a path
0 → 1 → · · · → B,
and conductances    
B B−j B−1
c(j, j + 1) = = .
j B j
In particular,
     
B−1 B−1 B
c(j) = + = .
j j−1 j
If B is even, B = 2N , then
NX =−1
1
E 0,N = E N,0 = 2N −1
.
j=0 j
Thus
  1   1
PN TN > T0 = , P0 T0 > TN = .
c(N )E N,0 c(0)E N,0
Hence  
4N
 
P0 T0 > TN c(N ) 2N
 = = ∼√ .
PN TN > T0 c(0) N πN
 
In particular, this shows that PN TN > T0 is extremely small for large N . Thus
if initially in the two chambers there equal numbers of balls, the probability that
during the random transfers of balls between them, one of the chambers will con-
tinuously have less than half the balls until it empties, is extremely small. In fact,
the expected time of emptying the left chamber while starting with equal numbers
of balls in both is (see [87, Sec. VII.3, p. 175] with s = 2N )
EN T0 ∼ 4N 1 + A/N as N → ∞, 1 ≤ A ≤ 2.
  
(4.4.29)
This example is historically important because it was used to explain an apparent
contradiction between Boltzmann’s kinetic theory of gases and classical thermody-
namics. We refer to [13; 84] for more details. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 434

434 An Introduction to Probability

Remark 4.97. There isa discrepancy


 between the estimate (4.4.29) proved in [87]
and the estimate for EN T0 proved in [9, Sec. III.5] which states that
  4N 
EN T0 = 1 + O(/N ) as N → ∞. (4.4.30)
N
The estimate (4.4.30) also contradicts the estimates [88, Eq. (4.27)] and [125,
Eq. (7)]. t
u

Example 4.98 (Random walks on infinite graphs). Let us investigate the


standard random walk on an infinite, locally finite graph (X , E, c). Thus we think
of an electric network in which all edges have the same conductance 1. For x, y ∈ X
define dist(x, y) the minimal length of a path joining x and y. Fix x+ ∈ X and set
Bn := x ∈ X ; dist(x+ , x) ≤ n ,


Σn = x ∈ X ; dist(x+ , x) = n = Bn \ Bn−1 , Sn− = X \ Bn .




Note that the balls Bn are finite. For n ∈ N we denote by C(n) the total number
of edges connecting a point in Σn−1 to a point in Σn .

x+
xn-

n
S1 S2 S n-1

x+
S1 S2 S n-1 x n-

C(1) C(2) C(n)


x+
S1 S2 S n-1 x --

Fig. 4.5 Shorting an infinite electric network inside spheres.

Form the collapsed electric network (X n , E n , cn ), X n := X/Sn− . The set Sn−


corresponds to a unique vertex x− n in X ; see the top of Figure 4.5. Denote by
n

E x+ ,x−
n
the energy of the Kirchhoff flow in X n from x+ to xn −.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 435

Markov chains 435

As we have seen
1
  = E x+ ,x+ .
c(x+ )Px+ Tx+ > HSn− n

Observe that the collapsed network X /Sn− is obtained from the collapsed network
− −
X /Sn+1 by first shorting the edges in Σn ⊂ X /Sn+1 and then shorting the edge
− −
(xn , xn+1 ). Hence
E x+ ,x−
n
≤ E x+ ,x− .
n+1

We set
1
E x+ ,∞ := lim E x+ ,xn = lim .
Tx+ > Sn−

n→∞ n→∞ c(x+ )Px+
Thus
1
lim Px+ Tx+ > Sn− =
 
.
n→∞ c(x+ )E x+ ,∞
We deduce that the associated Markov chain is recurrent if and only if E x+ ,∞ = ∞
and transient otherwise.
To estimate E x+ ,x−
n
from below we short edges in X /Sn− . First we short the
edges between points in Σk , k = 1, . . . , n − 1. We obtain the electric network X∗n
at the bottom of Figure 4.5. As explained in Example 4.96, energy of the Kirchhoff
flow in X∗n from x+ to x− n is
n
X 1
En = ≤ E x+ ,x− .
C(k) n
k=1
Hence

X 1
E x+ ,∞ ≥ .
C(k)
k=1
We deduce that if

X 1
= ∞,
C(k)
k=1
then the corresponding Markov chain is recurrent.
To estimate E x+ ,∞ from above we use the cutting trick. We gradually remove
edges such that the component containing x+ has infinitely many vertices. Re-
stricting to the component containing x+ we obtain a electric network with bigger
E x+ ,∞ according to Theorem 4.94 and Remark 4.95(iii).
Thus if the graph (X , E) contains a connected subgraph (X0 , E0 ) such that the
random walk on X0 is transient, then the random walk on (X , E) is also transient.
t
u

Example 4.99 (Random walk on Z2 ). Suppose that (X , E, c) corresponds to


the standard random walk on Z2 . Observe that the sphere Σn−1 , n − 1 > 0, is the
square
Σn−1 = (x, y) ∈ Z2 ; |x| + |y| = n − 1 .

July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 436

436 An Introduction to Probability

Each of the four vertices if this square is connected to Σn through 3 edges. The
interior of each of the four edges contains (n − 2) lattice points and each of them is
connected to Σn through 2-edges. Thus
C(n) = 12 + 8(n − 2) = 8n − 4, ∀n ∈ N.
Since
X 1
= ∞.
8n − 4
n≥1

We deduce again that the random walk on Z2 is recurrent. t


u

Example 4.100 (Random walks on symmetric trees). Consider the unbi-


ased random walk on an infinite locally finite tree (X , E). Fix x+ ∈ X and
think of x+ as the root of the tree. As such every vertex has a unique predecessor
and a number s(x) of successors so the degree is
(
s(x) + 1, x 6= x+ ,
d(x) =
s(x+ ), x = x+ .

Define Bn , Σn , Sn− as in the previous example. We assume that the tree is radially
symmetric about the root i.e., for any n ∈ N the vertices on the sphere Σn have the
same number sn of successors. Set
σk := |Σk |.
Note that for any k ≥ 0 we have
σk+1 = s0 s1 · · · sk .
One can think of σk as the “volume” of the sphere Σk .
We want to investigate the unbiased random walk on this tree. Equivalently,
this means assigning conductance 1 to every edge. We want to solve the equation
(
0, x ∈ Bn \ {x+ },
∆u(x) = 1
d(x+ ) , x = x+ ,

subject to the boundary condition


u(x) = 0, ∀x ∈ Sn− := X \ Bn .
We know that this equation has a unique solution. We can invoke the symmetry of
the graph and show that this solution must be constant along the spheres Σn but
we do not really need to do this. If we can find a solution with this property then
it has to be it. So make use of this Ansatz and seek a solution that is constant on
the spheres.
Denote by uk the value of u on Σk . We set u0 := u(x+ )
∆k = uk − uk+1 , ∀k ≥ 0.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 437

Markov chains 437

Note that ∆n = un . For k ∈ {1, n} we have


sk uk+1 + uk−1
uk =
sk + 1
so that
(sk + 1)uk = sk uk+1 + uk−1 ⇐⇒∆k−1 = sk ∆k .
Iterating we deduce
s0 s1 · · · sn σn σn
∆k−1 = sk+1 · · · sn ∆n = ∆n = ∆n = un .
s0 · · · sk σk σk
Hence !
n n
X X 1
u0 = u0 − un+1 = ∆k = σ n un .
σk
k=0 k=0
The equation
1
∆u(x+ ) =
s0
1
is equivalent to ∆0 = s0 so that
n
1 σn 1 X 1
= un , un = , E x+ ,Sn− = u0 = .
s0 s0 σn σk
k=0
Hence,

X 1
E x+ ,∞ = .
σk
k=0
This shows that if the number of vertices on Σn growth fast the random walk is
transient and if it growth slow, the walk is recurrent. Intuitively, the more vertices
far away, more opportunities to get lost. As an example fix d ∈ N, d ≥ 2. We
denote by Td the rooted radially symmetric tree with successor sequence (sn ) given
by
(
d, n = 2k − 1, k ≥ 0,
sn =
1, otherwise.
Thus
σn = dk+1 2k ≤ n < 2k+1
and 
1
 d−2 , d ≥ 3,
∞ 3 7 ∞

 k 
X 1 1 X 1 X 1 1X 2
= + + + ··· = =
σ
n=0 n
d n=2 d2 n=4 d3 d d 
k=0 ∞,

d = 2.
Thus, the random walk on Td is transient if d ≥ 3 and recurrent if d = 2.
We can obtain a more striking example of recurrent random walk by choosing
the successor sequence to be
(
k, n = k!, k ≥ 2
sn =
1, otherwise.
For more information about random walks on trees we refer to the very comprehen-
sive monograph [110]. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 438

438 An Introduction to Probability

4.5 Finite Markov chains

For HMC-s with finite state space the theory simplifies somewhat and new tech-
niques are available.

4.5.1 The Perron-Frobenius theory


Consider a homogenous Markov chain with finite state space
X = Im := {1, 2, . . . , m}.
In this case the transition matrix Q is an m × m stochastic matrix, i.e., a matrix
with nonnegative entries such that the sum of the entries on each row is 1. If we
set
 
1
 .. 
e :=  .  ∈ Rm
1
then we see that an m × m matrix Q with nonnegative entries is stochastic iff
Qe = e.
We view measures on X as row vectors µ = µ1 , . . . , µm .
 

For convenience we will denote by Rm the space of row vectors and by Cm the
space of column vectors. We will denote the row vectors using Greek letters and
we will think of them as signed measures on X . The matrix Q acts on row vectors
by right multiplication µ → µ · Q, and on column vectors by left multiplication,
v 7→ Q · v.
A signed measure µ ∈ Rm is a probability measure if
µk ≥ 0, ∀k ∈ Im , µ · e = 1.
Let Probm ⊂ Rm denote the space of probability measures on Im . We equip Rm
with the variation norm
Xm
kαkv := |αk |.
k=1
Observe that if µ, ν ∈ Probm , then
1
dv (µ, ν) = kµ − νkv .
2
Note that a column vector
 
z1
z =  ...  ∈ Cm
 

zm
T
is a (left) eigenvector of Q corresponding to an eigenvalue λ ∈ C if and only if the
row vector z > is a (right) eigenvector of Q since
z > · Q = λz > .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 439

Markov chains 439

The matrix Q and its transpose Q> have the same eigenvalues.4 The vector e is
a (left) eigenvector of Q corresponding to the eigenvalue 1. We deduce that there
exists a row vector α ∈ Rm such that
α · Q = α.
If α had nonnegative entries, then it would be an invariant measure for the HMC
defined by Q. The classical Perron-Frobenius theory explains when this is the case
and much more.
Observe that the HMC defined by Q is irreducible if and only if
∀i, j ∈ X , ∃ n > 0 such that Qni,j > 0.
Additionally, it is aperiodic if and only if Q is primitive, i.e., there exists n0 ∈ N
such that
∀ n > n0 , ∀i, j ∈ X such that Qni,j > 0, ∀1 ≤ i, j ≤ m.
For a proof of the following result we refer to [65, Chap. XIII] or [138, Chap. 8].

Theorem 4.101 (Perron-Frobenius). Suppose that Q is a stochastic m×m ma-


trix. Then the following hold.

(i) All the eigenvalues of Q> are contained in the unit disk.
(ii) If Q is irreducible, then there exists p ∈ N such that
λ ∈ Spec(Q) and |λ| = 1 ⇐⇒ λp = 1.
Moreover, every eigenvalue on the unit circle is simple.
(iii) The matrix Q is primitive if and only if p = 1. In this case

ρ := max |λ|; λ ∈ Spec(Q), λ 6= 1 < 1.
t
u

Suppose that Q is primitive and denote by π the unique invariant probability


distribution of Q, i.e., the unique row vector
π = (π1 , . . . , πm )
such that
πk > 0, ∀k, π1 + · · · + πm = 1.
Denote by ∆(λ) the characteristic polynomial of Q, ∆(λ) = det(λ1 − Q). Set
1
B(λ) = λ−1 ∆(λ).
Since 1 is a simple eigenvalue of Q the polynomials λ − 1 and B(λ) have no com-
mon divisor and thus we have a decomposition of the space Rm (see [99, Thm. XI.
4.1]) as a direct sum of (right) Q-invariant subspaces
Rm = kerr 1 − Q ⊕ kerr B(Q),


4 det(λ 1 − Q) = det(λ1 − Q)> .


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 440

440 An Introduction to Probability

where
1 − Q = α ∈ Rm ; α · (1 − Q) = 0 = span(π),
 
kerr

α ∈ Rm ; α · B(Q) = 0 .

kerr B(Q) :=
Thus any α ∈ Rm admits a unique decomposition
α = α0 + α⊥ , α0 ∈ kerr 1 − Q , α⊥ ∈ kerr B(Q).


More explicitly, choose polynomials u(λ), v(λ) such that


u(λ)(λ − 1) + v(λ)B(λ) = 1.
Then
α⊥ = α · u(Q)(Q − 1) ∈ ker B(Q) α0 = α · v(Q)B(Q).
Note that
α⊥ · e = α · α(Q)(Q − 1) · e = 0.
If µ ∈ Rm is a probability measure, then it has a canonical decomposition
µ = cπ + µ⊥ , µ⊥ ∈ kerr B(Q).
Since µ · e = 1 and µ⊥ · e = 0 we deduce c = 1 so µ = π + µ⊥ and thus
µ · Qn = π + µ⊥ · Qn ,
i.e.,
µ · Qn − π = µ⊥ · Qn .
Since kerr B(Q) is Q-invariant we deduce from Theorem 4.101 that there exist
C > 0, r ∈ (0, 1) such that
kα · Qkv ≤ rkαkv ≤ Crkαkv , ∀α ∈ kerr B(Q).
Hence
kµ · Qn − πk1 = kµ⊥ · Qn kv ≤ Crn kµkv = Crn , ∀µ ∈ Probm .
In particular, if we choose µ to be the Dirac measure concentrated at k ∈ Im , then
δ k · Qn is the k-th row of the matrix Qn and we deduce
m
X
Qnk,` − π` ≤ Crn , ∀k ∈ N.
`=1

Theorem 4.101 allows us sharpen the above estimate. If


m−1
X
∆(λ) = det(λ − Q) = λm + aj λj
j=0
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 441

Markov chains 441

denotes the characteristic polynomial of Q, then Cayley-Hamilton theorem implies


that the sequence of matrices (Qn )n∈N0 satisfies the linear recurrence relation
m−1
X
Qn+m + aj Qn+j = 0, ∀n ∈ N0 .
j=0

Let 1, λ2 , . . . , λs be the eigenvalues of Q,


1 > ρ = |λ2 | ≥ · · · ≥ |λs |.
The eigenvalue λ2 is usually referred to as the second largest eigenvalue (or SLE)
of the transition matrix.
Denote by mi is the size of the largest Jordan cell corresponding to the eigenvalue
i. We assume that m2 is the largest Jordan cell size the eigenvalues of norm ρ. The
above recurrence relation shows that, for any 1 ≤ i, j ≤ m, the sequence (Qni,j )n≥0
admits a description of the form
r
X
Qni,j = cij + k
Ci,j (n)λnk
k=2
k
where Ci,j (z) is a complex polynomial of degree ≤ mk − 1. We deduce that
Qnij − cij = O nm2 −1 ρn .


We conclude that ci,j = πj and thus


Qnij − πj = O nm2 −1 ρn .

(4.5.1)
If the Markov chain is reversible, i.e.,
πi Qij = πj Qji , ∀i, j ∈ Im ,
then the operator Q : Cm → Cm is symmetric with respect to the L2 (π)-inner
product h−, −iπ on Cm = RX
n
X
hx, yiπ = xi yi πi , ∀x, y ∈ Cm .
i=1

Indeed
XX XX
hQx, yiπ = Qij xj yi πi = πj Qji xj yi
i j j i

!
X X
= Qji yi πj xj = hx, Qyiπ .
j i

In this case all the eigenvalues are real and the operator Q is diagonalizable and
(4.5.1) improves to
Qnij − πj = O ρn .

(4.5.2)
In general finding or estimating the SLE can be a daunting task. If some sym-
metry is present this is sometimes manageable.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 442

442 An Introduction to Probability

Example 4.102 (Random walks on groups). Suppose that G is a finite group


and H ⊂ G is a set of generators. The set H determines a random walk on G.
1
From g one can transition to h · g, h ∈ H, with probability |H| .
A frequently encountered case is when H is symmetric, i.e.,
x ∈ H ⇐⇒ x−1 ∈ H.
The directed graph corresponding to this random walk is symmetric, i.e., there is a
directed edge from g to g 0 if and only if there is a directed edge from g 0 to g. The
resulting undirected graph is called the Cayley graph determined by the symmetric
set of generators. The random walk on the groups is then the standard walk on the
Cayley graph. The group structure behind the Cayley graph adds a lot of symmetry
that we can use to our advantage. For a detailed presentation of this technique and
many interesting applications we refer to the beautiful monograph [41].
We want illustrate this principle on a simpler situation. Suppose that G is the
discrete torus
d
G := Z/nZ .
We will denote by x = (x1 , . . . , xd ) the elements of G, xk ∈ Z/nZ. As generators e
choose
±ek mod nZ, k = 1, . . . , d,
where
e1 = (1, 0, . . . , 0), . . . , ed = (0, . . . , 0, 1).
For d = 2 this random walk can be visualised as a random walk on the vertices of
the square grid Sn = [0, n]2 ∩ Z2 where the opposite edges are identified. Thus from
(0, y) we can transition (0, y ± 1 mod n) or (±1 mod n, y) with equal probabilities.
Note that when n is odd, the random walk is irreducible and aperiodic.
When n = 2 this becomes a random walk on the set of vertices of the hypercube
[0, 1]d or, equivalently, on the set of subsets of {1, . . . , d}.
The invariant probability measure π is, up to a multiplicative constant, the
uniform counting measure. We write
 1/2
1 X
L2 G = L2 G, π , kf k = kf kL2 = |f (x)|2  .
 

|G|1/2 d x∈Tn

Here we work with complex valued functions so the inner product is


1 X
hf, gi = f (x)g(x).
|G|
x∈G

If Q denotes the transition matrix of this Markov chain, then for any f ∈ L2 (G) we
have
d
X 1 X f (x + ek ) + f (x − ek )
Qf (x) = Qx,x0 f (x0 ) = , (4.5.3a)
0
d 2
x ∈G k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 443

Markov chains 443

d
1 X f (x + ek ) − 2f (x) + f (x − ek )
∆f (x) = f (x) − Qf (x) = − . (4.5.3b)
d 2
k=1

One can verify that the induced operator Q : L2 (G) → L2 (G) is symmetric since Q
is reversible but we will not rely on this fact in this example.
To compute the eigenvalues of Q : L2 (G) → L2 (G) we use Fourier analysis. This
requires a little bit of representation theory and we will refer to [149] for the proofs
of all the claims below.
A character of G is a group morphism
χ : G → S 1 := z ∈ C; |z| = 1 .


The set G
b of characters is a group itself with respect to the pointwise multiplication
of characters. It is called the dual group.
Denote by Rn the group of n-th roots of unity
Rn := z ∈ C∗ ; z n = 1 .


Observe that for any character χ, the complex numbers χ(ek ) are n-th roots of 1.
In fact, the map
b → Rd , G
ρ:G b 3 χ 7→ (ρ1 , . . . , ρd ) = χ(e1 ), . . . , χ(ed ) ∈ Rd
n n

is a group isomorphism. The collection of functions


χ : G → C, χ ∈ G
b

is an orthonormal basis of L2 (G) and thus, for any f ∈ L2 we have an orthogonal


decomposition
X
f= hf, χiχ. (4.5.4)
χ∈G
b

The function
b 3 χ 7→ fb(χ) := hf, χi ∈ C
G
is called the Fourier transform of f . More explicitly,
1 X
fb(χ) = f (x)χ(x).
|G|
x∈G

The equality (4.5.4) can be rewritten


X
f (x) = fb(χ)χ(x), ∀x ∈ G, (4.5.5)
χ∈G
b

and, as such, it is known as the Fourier inversion formula.


If we identify χ ∈ G b with ρ(χ) = (ρ1 , . . . , ρd ) ∈ Rd , then we can view the Fourier
n
transform f as a function on Rdn
b
1 X
fb(ρ1 , . . . , ρd ) = f (x)ρ−x
1
1
· · · ρd−xd ,
|G|
x∈G
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 444

444 An Introduction to Probability

and Fourier inversion formula reads


X
f (x) = fb(ρ1 , . . . , ρd )ρx1 1 · · · ρxd d .
ρn
k =1,
k=1,...,d

Using (4.5.3a) and (4.5.5) we deduce


d
!
X 1 X 
Qf (x) = fb(χ) · χ(ek ) + χ(−ek ) · χ(x).
χ
2d
k=1
| {z }
=:m(χ)

Thus
!
X X
Qf = Q fb(χ)χ = m(χ)fb(χ)χ.
χ χ

In other words, the orthonormal basis χ; χ ∈ G,
b diagonalizes Q and

Spec Q = m(χ), χ ∈ G b .

If we write
χ(ek ) = ρk = cos θk + i sin θk ∈ Rn ,
then χ(ek ) + χ(−ek ) = ρk + ρ¯k = 2 cos θk and
d
( )
1X 2π 2π(n − 1)
m(χ) = cos θk , θk ∈ 0, ,..., .
d n n
k=1

Thus Spec Q ⊂ [−1, 1] and 1 ∈ Spec Q. The SLE is


d − 1 + cos 2π/n 2 sin2 π/n
λ2 = λ2 (d, n) = =1− .
d d
Note that
2π 2
λ2 (d, n) ∼ 1 −
as n → ∞. (4.5.6)
dn2
If n = 2 all the characters/eigenfunctions are real valued. More precisely, for every
~ = (1 , . . . , d ) ∈ {−1, 1}d
we have an eigenfunction χ~ given by
d
Y
χ~(x) = xkk , ∀x = (x1 , . . . , xd ) ∈ {0, 1}d . (4.5.7)
k=1
The corresponding eigenvalue is
1 
λ~ = 1 + · · · + d , k = ±1.
d
Hence
( )
2k
Spec(Q) = − 1 + , k = 0, 1, . . . , d .
d
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 445

Markov chains 445

In this case the SLE is


2
λ2 (d, 2) = 1 − . (4.5.8)
d
A probability measure µ on G can be identified with a continuous linear func-
tional on L2 (G, π) and, as such, can be identified with a function µ∗ ∈ L2 (G, π)
X   X   Z
µ∗ =
 
µ χ χ, µ χ = µ x χ(x) = χdµ.
χ Td
x∈Td
n
n

Then
X
µ · QN = m(χ)n χ.
χ
t
u

Example 4.103 (The Ehrenfest urn revisited). The random walk (Xn )n≥0
on
Vd := {0, 1}d ,
the set of vertices of the hypercube [0, 1]d is intimately related to Ehrenfest urn; see
Example 4.7.
To see this, consider the states
sk := x = (x1 , . . . , xd ) ∈ Vd ; |x| := x1 + · · · + xd = k , k = 0, 1, . . . , d.


If we think of the vertices x ∈ Vd as vectors of bits 0/1, then the random walk has
a simple description: if located at x ∈ Vd , pick a random component of x and flip
it to the opposite bit. Note that
  d−k   k
P Xn+1 ∈ sk+1 Xn ∈ sk = , P Xn+1 ∈ sk−1 Xn ∈ sk = .
d d
We recognize here the transition rules for the Ehrenfest urn model with d parti-
cles/balls. Thus, if on our walk along the vertices of the hypercube, we only keep
track of the state we are in, we obtain the Markov chain defined by Ehrenfest’s urn
model.
For concrete computations it is convenient to have an alternate description of
this phenomenon. Denote by Sd the group of permutations of {1, . . . , d}. There is
an obvious left action of Sd on Vd ,
ϕ · (x1 , . . . , xd ) = xϕ(1) , . . . , xϕ(d) , ∀ϕ ∈ Sd , (x1 , . . . , xd ) ∈ {0, 1}d .


On the other hand, Vd is equipped with a metric, the so called Hamming distance,
d
X
δ(x, y) = |xi − yi |, x, y ∈ Vd .
i=1
Two vertices x, y ∈ Vd are neighbors (connected by an edge of the cube) iff
δ(x, y) = 1. Since the above action of Sd preserves the Hamming distance we
deduce that Sd is a group of graph isomorphisms, i.e.,
∀x, y ∈ Vd , ϕ ∈ Sd : x ∼ y ⇐⇒ ϕ · x ∼ ϕ · y.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 446

446 An Introduction to Probability

Observe also that the states sk , k = 0, 1, . . . , d, are the orbits of the above action
of Sd . Thus, the state space of the Ehrenfest urn model can be identified with
Vd := Sd \Vd , the space of orbits of the above left action. Denote by π the invariant
probability measure of the random walk on Vd and π the invariant measure of the
Ehrenfest urn model
 
  1 d
π k = d .
2 k
If Proj : Vd → Sd \Vd is the natural projection, then
Proj# π = π.
The left action of Sd on Vd induces a right action on the space L2 (Vd , π)
(f · ϕ)(x) = f ϕ · x , ∀f : Vd → R, x ∈ Vd , ϕ ∈ Sd .


We denote by L2 (Vd , π)Sd the subspace consisting of invariant functions, i.e., func-
tions constant along the orbits of Sd . The pullback
Proj∗ : L2 Sd \Vd ,π) → L2 Vd , π , f 7→ f ◦ Proj


is an isometry onto L2 (Vd , π)Sd .


Let us observe that the induced linear operator
Q : L2 Vd , π → L2 Vd , π
 

is Sd -equivariant, i.e., for any f ∈ L2 Vd , π , ϕ ∈ Sd ,




Q f · σ = (Qf ) · σ. (4.5.9)
In particular, this shows that
Q L2 (Vd , π)Sd ⊂ L2 (Vd , π)Sd .


If Q denotes the transition matrix of the Ehrenfest model, then


Q
L2 (Vd , π)Sd w L2 (Vd , π)Sd

Proj∗ Proj∗

u u
L Vd ,π)
2
w L Vd ,π)
2
Q

If λ ∈ Spec Q and χ ∈ker λ − Q is an eigenfunction of Q, then (4.5.9) implies
that χ · ϕ ∈ ker λ − Q , ∀ϕ ∈ Sd .
For every  ∈ {−1, 1}d we set

w(~) = # j; j = −1 .
Note that
X 2w(~)
j = d − 2w(~), λ(~) = 1 − .
j
d
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 447

Markov chains 447

2j
If λj = 1 − d , then
 
ker λj − Q = span χ~; w(~) = j .
The orthogonal projection Π onto L2 (Vd , π)Sd is the symmetrization operator
1 X
L2 (Vd ) 3 f 7→ Πf = f · ϕ ∈ L2 (Vd , π)Sd .
d!
ϕ∈Sd

The above description shows that


 
Π ker λ − Q ⊂ ker λ − Q , ∀λ ∈ Spec Q,
so that
SpecQ ⊂ Spec Q.
Since
χϕ·~(x) = χ~ ϕ−1 · x , ∀ϕ ∈ Sd , x ∈ {0, 1}d ,


we deduce
Πχ~ = Πχϕ·~, ∀ϕ ∈ Sd .
Thus Πχ~ depends only on w(~). We set
Ψj := ΨΠχ~, w(~) = j.
Note that
1 X
Ψj = d
 χ~. (4.5.10)
j w(~
=j

Since the eigenfunctions χ~ with fixed weight w(~) = j span the eigenspace of Q
corresponding to the eigenvalue λj we deduce that

ker λj − Q) = span Ψj

so dim ker λ − Q ≤ 1, ∀λ ∈ SpecQ ⊂ Spec Q. Hence

# SpecQ = dimVd = d + 1 = # Spec Q


and thus

Spec Q = SpecQ and dim ker λ − Q = 1, ∀λ ∈ SpecQ.
Define K : Vd × C → C,
d
Y |x| d−|x|
1 + (−1)xi z = 1 − z

K(x, z) := 1+z , (4.5.11)
i=1

X 
|x| = xi = # i; xi = 1 .
i
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 448

448 An Introduction to Probability

Observe that
 
d d  
X X (4.5.10) X d
K(x, z) =  χ~(x) z j = Ψj (x)z j .
j=0 j=0
j
w(~
)=j

Thus
X d
|x| d−|x|
(1 − z) (1 + z) = Ψj (x)z j . (4.5.12)
j
j

Integrating the equality K(x, z)2 = (1−z)2|x| (1+z)2(d−|x|) over Vd with the uniform
probability measure π we deduce
Z d  
2 1 X d
(1 − z)2k (1 + x)2(d−k)
 
K(x, z) π dx =
Vd 2n k
k=1

1 d
= (1 − z)2 + (1 + z)2 = (1 + z 2 )d .
2d
This shows that
1
kΨj k2L2 (π) = d
.
j

Identify L Vd ,π with the space R


2
 d+1

 
f (0)
L2 Vd ,π 3 f 7→  ...  ∈ R1+d
  

f (d)
with the inner product
1 
hu, viπ :=
d
Bu, v ,
2
where (−, −) denotes the canonical inner product on Rd+1 ,
d
 X
u, v = ui vi ,
i=0
and B is the diagonal matrix
   !
d d
B = Diag ,..., .
0 d
We denote by ckj the coefficient of z j in (1 − z)k (1 + z)d−k . If we think of the
invariant eigenfunction Ψj as a function on Vd , Ψj (k) := Ψj (x), |x| = k, then we
have
 
    c0j
d d
Ψj =  ...  .
(4.5.12)
Ψj (k) = ckj ,
 
j j
cdj
| {z }
=:Cj
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 449

Markov chains 449

Denote by C the (d + 1) × (d + 1) matrix with columns Cj and by Λ the diagonal


matrix

Λ = Diag λ0 , λ1 , . . . , λd .
If we regard the columns Cj as functions in L2 Vd ,π), then each is a multiple of an
eigenfunction Ψj of Q so that
QCj = λj Cj , ∀j = 0, 1, . . . , d.
Hence QC = CΛ so that C diagonalizes Q,
C −1 QC = Λ, i.e., Q = CΛC −1 .
Remarkably, the inverse of C can be described explicitly.
From the equalities
 
d 1
Ψj = Cj , kΨj k2L2 (π) = d
j j

we deduce
 
d
hCi , Cj iπ = δij , ∀i = 0, 1, . . . , d.
i
In other words,
1
BCx, Cy = Bx, y , ∀x, y ∈ Rd+1 .
 
2d

Hence
C > BC = 2d B. (4.5.13)
The matrix C has another miraculous symmetry. To prove it we need to get back
to the definition of the entries ckj ,
k d−k X
1−z 1+z = ckj z j .
j

Consider the function


X d k d−k n
F (u, z) = uk 1 − z 1+z = (1 + u) + (1 − u)z .
k
k
On one hand, we have
X d k d−k
F (u, z) = uk 1−z 1+z
k
k

X d X X d
= uk ckj z j uk = ckj z j uk .
k j
k
k k,j

On the other hand, the binomial formula yields


 
n X d j j
F (u, z) = (1 + u) + (1 − u)z = z 1−u ud−j
j
j
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 450

450 An Introduction to Probability

X d X X d
= zj cjk uk = cjk z j uk .
j
j j
k k,j

Hence
   
d d
ckj = cjk , ∀j, k.
k j
This can be written in more compact form as
(BC)kj = (BC)jk ⇐⇒ BC = (BC)> = C > B.
Using this in (4.5.13) we deduce BC 2 = 2d B so that
1
C −1 = C.
2d
Hence
1
Qn = CΛn C −1 = CΛn C, ∀n ≥ 0. (4.5.14)
2d
The above formula was first obtained by M. Kac [84]. Since then, many different
proofs were offered [87; 88; 147]. For more about the rich history and the ubiquity
of the Ehrenfest urn we refer to [13; 147]. As a curiosity, we want to mention that
the spectrum of Q was known to J. J. Sylvester in the 19th century.
One can use (4.5.14) to obtain important information about the dynamics of
the Ehrenfest urn such that the return or first passage times Ti , i = 0, 1, . . . , d. We
refer to [13; 84; 87; 88] for more details.
The above “miraculous” properties of the matrix C are manifestations of the
remarkable symmetries of the Krawtchouk polynomials. We refer to [43; 44] for
more about these polynomials and their applications in probability. t
u

4.5.2 Variational methods


Consider a reversible, irreducible Markov chain with finite state space X and tran-
sition matrix Q. Set N := |X |. Denote by π the invariant probability distribution.
We have seen that Q is symmetric as a linear operator
L2 (X , π) → L2 (X , π).
We denote by h−, −iπ the inner product in L2 (X , π) and by k − kπ the associated
norm. We identify L2 (X , π) with RN equipped with the inner product
N
X
hu, viπ = ui vi πi .
i=1

The eigenvalues have variational characterizations. We order the eigenvalues of Q


decreasingly
1 = λ1 > λ2 ≥ λ2 ≥ · · · ≥ λN ≥ −1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 451

Markov chains 451

Above, each eigenvalue of Q appears as often as its multiplicity. The eigenspace


corresponding to the eigenvalue 1 is spanned by the constant function e = 1 or,
equivalently the column vector e ∈ RN with all the coordinates equal to 1.
As we have seen, the second largest eigenvalue (or SLE) λ2 controls the rate of
convergence of the Markov chain. It has the variational description
hQu, ui
λ2 = sup 2
.
u∈R N
\{0}, kukπ
hu,eiπ =0

We will use this variational characterization to provide upper estimates for λ2 .


It is more convenient to work with the Laplacian ∆ := 1 − Q. Note that
ker ∆ = span{e}. Its eigenvalues are µk = 1 − λk ,
0 = µ1 < µ2 ≤ µ3 ≤ · · · ≤ λN ≤ 2.
Note that lower estimates for µ2 are equivalent with upper estimates for λ2 .
The first positive eigenvalue µ2 has a variational characterization in terms of the
Dirichlet form
E(−, −) : L2 (X , π) × L2 (X , π) → R, E(u, v) = h∆u, viπ .

Lemma 4.104.
1 X
E(u, v) =
 
πx Qx,y u(x) − u(y) v(x) − v(y) .
2
x,y∈X

Proof.
X  
πx Qx,y u(x) − u(y) v(x) − v(y)
x,y∈X

X  X 
= Qx,y u(x) − u(y) v(x)πx − πx Qx,y u(x) − u(y) v(y) .
x,y∈X x,y∈X
| {z } | {z }
=:A =:B
Note that
X X 
A= Qx,y u(x) − u(y) v(x)πx
x∈X y∈Y
X 
= u(x) − (Qu)(x) v(x)πx = h∆u, viπ .
x∈X

Using the detailed balance equations πx Qx,y = πy Qy,x we deduce


!
X X 
B= Qy,x u(x) − u(y) v(y)πy
y∈X x∈X
X 
= (Qu)(y) − u(y) v(y)πy = −h∆u, viπ .
y∈X
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 452

452 An Introduction to Probability

Let us observe that the reversible Markov chain is defined by an electric network
with conductances c(x, y), where
c(x, y) := πx Qx,y .
Then ∀u, v ∈ L2 (X , π)
 1 X
E u, v =
 
c(x, y) u(x) − u(y) v(x) − v(y) = hdu, dvic ,
2
x,y∈X

where h−, −ic is the inner product (4.4.9) on 1-cochains and d is the coboundary
operator (4.4.8).
The classical Ritz-Raleigh description of eigenvalues of a symmetric operator
shows that
µ2 := inf E(u, u); kukπ = 1, hu, eiπ = 0 .


Note that for any λ in R we have


E u + λ, u + λ) = E(u, u).
If we think of u ∈ L2 (X , π) as a random variable defined on the probability space
(X , π), then the above characterization of µ2 can be rewritten as
E(u, u)
µ2 := inf  .
Eπ [u]=0 Var u
u6=0

Lower bounds of µ2 are classically known as Poincaré inequalities. Thus, a lower


bund µ2 > m > 0 is equivalent to a statement of the form
1X X X
πx Qx,y u(x) − u(y))2 ≥ m πx u(x)2 if u(x)πx = 0
2 x,y x
x∈X

1X X X
⇐⇒ c(x, y) u(x) − u(y))2 ≥ m c(x)u(x)2 if u(x)c(x) = 0.
2 x,y x
x∈X

To state our first Poincaré type inequality we need a few geometric preliminaries.
To our reversible Markov chain we associate a graph G with vertex set X . Two
vertices x, y are connected by an edge iff Q(x, y) 6= 0. We write x ∼ y if x and y
are connected by an edge in G. This graph could have loops. It is connected since
the Markov chain is irreducible. We set
b := (x, y) ∈ X × X ; x ∼ y .

E (4.5.15)
We think of the elements of E
b as edges of G equipped with an orientation. For any
u : X → R and e = (x , x ) ∈ E
0 00 b we set

δe u := u(x00 ) − u(x0 ).
We can speak of the conductance c(e) of any oriented edge e = (x, y),
c(e) := c(x, y) = πx Qx,y .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 453

Markov chains 453

Note that
1X
E(u, u) = c(e)(δe u)2 . (4.5.16)
2
e∈E
b
A path in G between two vertices x, y is a succession of vertices
γ : x = x0 ∼ x1 ∼ · · · x`−1 ∼ x` = y,
where we do not allow repeated edges. The number ` is called the length of γ and
it is denoted by `(γ). The path γ determines a collection of oriented edges
ei = (xi−1 , xi ), i = 1, . . . , `.
We will use the notation e ∈ γ to indicate that e is one of the oriented edges
determined by γ.
We denote by Γ the collection of paths in G. It comes with an obvious equiva-
lence relation: two paths are equivalent if they have the same initial and final points.
Fix a collection C of representatives of this equivalence relation. Thus, C contains
exactly one path for γx,y every pair (x, y) of vertices and this path connects x to y.
Following [46] we set
1 X
K(C) := sup K(C, e), K(C, e) := `(γx,y )πx πy . (4.5.17)
e∈E c(e)
C3γx,y 3e
If an oriented edge e is not contained in any path γ ∈ C we set K(e) = 0.

Theorem 4.105 (Diaconis-Stroock). For any u ∈ L2 (X , π) we have


 
Var u ≤ K(C)E(u, u). (4.5.18)
1
Thus µ2 ≥ K(C) so that
1
λ2 (Q) ≤ 1 − .
K(C)

Proof. We follow the approach in the proof of [46, Proposition 1]. Set K = K(C).
Let u ∈ L2 (X , π). For any x, y ∈ X we have the telescoping equality
X
u(y) − u(x) = δe u.
e∈γx,y
Using the Cauchy Schwartz inequality we deduce
!2
2 X X
u(y) − u(x) = δe u ≤ `(γx,y ) (δe u)2 .
e∈γx,y e∈γx,y
Now observe that
  1X 2 1X X
Var u = u(y) − u(x) πx πy ≤ `(γx,y ) (δe u)2
2 x,y 2 x,y e∈γ x,y

1X X 1X 1 X
= (δe u)2 γx,y = c(e)(δe u)2 γx,y
2 γx,y 3e
2 c(e) γ 3e
e∈E
b e∈E x,y
| {z }
≤K
K X (4.5.18)
c(e)(δe u)2

≤ = KE u, u .
2
e∈E
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 454

454 An Introduction to Probability

Example 4.106. Suppose that our Markov chain corresponds to the random walk
on the Cayley graph of the cyclic group Z/nZ, n odd; see Example 4.102. Equiva-
lently, it is the random walk on the set
X = {xi }i∈Z/nZ
of vertices of a regular n-gon, where at each vertex we are equally likely to move to
one of its two neighbors. In this case we have
1 1
πx = , Qxi ,xi+1 = Qxi ,xi−1 = , ∀i ∈ Z/nZ,
n 2
(
1 1, i = j ± 1,
c(xi , xj ) = ×
2n 0, otherwise.
As collection C, we choose geodesics (shortest paths) connecting the pair of vertices.
Since n is odd, for every x, y ∈ X there exists a unique such geodesic γx,y and it
has length < n2 . Due to the symmetry of the graph the quantity
1 X 2 X
K(e) := `(γx,y )πx πy = `(γx,y )
c(e) γ 3e n γ 3e
x,y x,y

is independent of e so
K(C) = K(e), ∀e ∈ E.
Averaging over the n edges of the graph we deduce
1X 2 X X
K(C) = K(e) = 2 `(γx,y )
n e n e γ 3e
x,y

2 X X 2 X
= 2
`(γx,y ) = 2 `(γx,y )2
n x,y e∈γ n x,y
x,y

(n = 2m + 1)
n m
2X 4X 2 n2
= `(γx1 ,xi )2 = i = + O(n), as n → ∞.
n i=1 n i=1 6
Hence
6
+ O n−3 , as n → ∞.

λ2 ≤ 1 − 2
n
Thus, for large n this lower estimate is of the same order, as the precise estimate
(4.5.6) with d = 1. t
u

We want to describe another geometric estimate for µ2 of the type first described
in Riemannian geometry by J. Cheeger [27].
The volume of a set S ⊂ X is computed using the stationary measure π,
  X
V (S, Q) := π S = πs .
s∈S
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 455

Markov chains 455

The “boundary” of the set S is the collection of oriented edges


∂S := (s, s0 ) ∈ E;b s ∈ S, s0 ∈ S c .


The “area” of the boundary of S ⊂ X is


X X
A(∂S, Q) := c(e) = πs Qs,s0 , S c = X \ S.
e∈∂S (s,s0 )∈S×S c

Note that A(∂S, Q) = A(∂S c , Q). The ratio


A(∂S, Q)
h(S, Q) =
V (S, Q)
is the conditional probability that the Markov chain will transition from a state in
S to a state in S c given that initial distribution is the equilibrium distribution.

Remark 4.107. If Q is associated to an electric network with arbitrary conduc-


tances e
c(x, y), then there exists Z > 0 such that
X
c(x) =
e c(x, y) = Zπx , ∀x ∈ X .
e
y

Note that if we define


X X
Ve (S, Q) := c(s), Ã(∂S, Q) :=
e c(e),
e
s∈S e∈∂S

then
A(∂S,
e Q) A(∂S, Q)
= . t
u
V (S, Q)
e V (S, Q)

Now define the Cheeger isoperimetric constant or the conductance of (X , Q) to


be
 
  1
h(Q) := inf h(S, Q); 0 < µ S <
2 (4.5.19)
n o
6 S(X .
= inf max h(S, Q), h(S c , Q); ∅ =

To get a feeling of the meaning of h(Q) suppose that Q corresponds to the unbiased
random walk on a connected graph G with vertex set X . For any S ⊂ X , the
area A(∂S) is, up to a multiplicative constant, the number of edges connecting a
vertex in S with a vertex outside S. The volume V (S) is, up to a multiplicative
constant the sum of degrees of vertices in S, or equivalently, V (S) − A(∂S) is twice
the number of edges with both endpoints in S. Thus, a “large” h(Q) signifies that,
for any subset of X , a large fraction of the edges with at least one endpoint in S
have the other endpoint outside S.
As an example of graph with small h think of a “bottleneck”, i.e., a graph
obtained by connecting with a single edge two disjoint copies of a complete graph.
Various versions of Cheeger’s isoperimetric constant of a (connected) graph play
a key role in the definition of expander families of graphs, [96; 108]. It was in
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 456

456 An Introduction to Probability

that context that the connection with random walks on graphs was discovered. For
general reversible Markov chains we have the following result due to Jerrum and
Sinclair [83].

Theorem 4.108. Let Q denote the transition matrix of a reversible Markov chain
with finite state space X . Then
h(Q)2
µ2 ≥ .
2
In particular,
h(Q)2
λ2 ≤ 1 − .
2
Proof. We follow the presentation in [46]. Let u ∈ L2 (X , π). Set u+ := max(u, 0).
We set
A(∂S, Q)
Su := u > 0 } ⊂ X , h(u) = inf

.
S⊂Su V (S)
Lemma 4.109. If u ∈ L2 (X , π) and u+ 6= 0, then
 h(u)2
E u+ , u+ ≥ ku+ k2π . (4.5.20)
2
Proof. We can assume without any loss of generality that u = u+ . Then
X X
u(y)2 − u(x)2 c(x, y) = u(x)2 − u(y)2 c(x, y)

2
u(x)<u(y) x,y

  12
  12
X 
 2  X 2
≤ u(x) − u(y) c(x, y) u(x) + u(y) c(x, y)

  
 x,y  x,y
| {z }
| {z } ≤2(u(x)2 +u(y)2 )
=2E(u,u)

  12
 
X 
1/2
u(x) + u(y) c(x, y) = 23/2 E(u, u)1/2 kukπ .
2 2

≤ 2E(u, u)
 

 x,y 
 
| {z }
=2kuk2π

We deduce
X
23/2 E(u, u)1/2 kukπ ≥ 2 u(x)2 − u(y)2 c(x, y)


u(x)<u(y)

!  
X Z u(y) Z ∞ X
=4 tdt c(x, y) = 4 t c(x, y) dt.
x,y u(x) 0 u(x)≤t<u(y)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 457

Markov chains 457

If we write St := {u > t} ⊂ Su and observe that


X  
c(x, y) = A(∂St , Q) ≥ h(u)π St .
u(x)≤t<u(y)

We deduce
 
Z ∞ Z ∞
X  (1.3.43) h(u)
kuk2π .

t c(x, y) dt ≥ h(u) tπ u > t dt =
0 0 2
u(x)≤t<u(y)

t
u

Observe now that for any x, y ∈ X we have


  2
u+ (x) − u+ (y) u(x) − u(y) ≥ u+ (x) − u+ (y) .
To see this, note first that above we have equality if both u(x) and u(y) are non-
negative or both nonpositive. We have strict inequality if one is positive and the
other negative, say u(x) > 0 > u(y). Indeed,
2
u+ (x) − u+ (y) u(x) − u(y) = u(x) u(x) − u(y) > u(x)2 = u+ (x) − u+ (y) .
  

In particular, we deduce that


E u+ , u ≥ E u+ , u+ ,
 

and thus,
µ > 0, ∆u ≤ µu on {u > 0} ⇒ µku+ k2π ≥ E u+ , u+ .

(4.5.21)
Indeed,
µku+ k2π ≥ λhu+ , ∆uiπ = E u+ , ui ≥ E(u+ , u+ ).
2
Combining (4.5.20) and (4.5.21) we deduce that µ ≥ h(u) 2 if ∆u ≤ µu on
{u > 0} =
6 ∅.
Suppose now that u is a nontrivial eigenfunction corresponding to the eigenvalue
µ2 of ∆. Since
X
u(x)πx = 0
x

we deduce that {u > 0} =


6 ∅ and we conclude that
h(u)2 h(Q)2
µ2 ≥ ≥
2 2
as claimed. t
u

The quantity h(Q) is rather difficult to compute but lower estimates are easier
to obtain. Consider a collection C of paths in G as in the definition (4.5.17). We
set
1 X
κ(C) = sup κ(C, e), κ(C, e) = πx πy .
e∈E c(e)
C3γx,y 3e
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 458

458 An Introduction to Probability

If an oriented edge e is not contained in any path γ ∈ C we set κ(e) = 0. We have


the following result, [46; 141].

Proposition 4.110. We have


1
h(Q) ≥ ,
2κ(C)
so that
1
λ2 ≤ 1 − , ∀C.
8κ(C)2

Proof. Let S ⊂ X be a set of vertices with V (S) = π S ≤ 12 . We set


 
X
W (S) = πx πy .
γx,y ∈C
x∈S, y∈S c

Clearly
    1   1
W (S) = π S π S c ≥ π S = V (S).
2 2
On the other hand
X X X X
W (S) ≤ πx πy = c(e)κ(e) ≤ κ c(e) = κA(∂S),
e∈∂S γx,y 3e e∈∂S e∈∂S

and we deduce
1
κA(∂S) ≥ V (S).
2
t
u

4.5.3 Markov Chain Monte Carlo


Since this is only an invitation to this subject we do not attempt to formulate the
most general situation or technique. Suppose that we want to sample a very large
but finite set X according to a probability measure on it. The information we have
about the set and the given distribution is not complete but “obtainable”.
The probability measure π is known only up to a multiplicative constant. More
precisely, we know only a weight w : X → (0, ∞) that is proportional to π, i.e.,
  w(x) X
π x = , Z= w(x).
Z
x∈X

For all intents and purposes, the normalizing constant Z is not effectively avail-
able to us. Still, we would like to produce an X -valued random variable with
distribution π.
The theory of Markov chains will allow us to produce, for any given ε > 0
an X -valued random variable with distribution ν within ε > 0 (in total variation
distance) from the desired but unknowable distribution π.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 459

Markov chains 459

The Metropolis algorithm will allow us to achieve this. The input of the algo-
rithm is a pair (G, w) where G is a with vertex set X w is a weight on its set of
vertices, i.e., a function w : X → (0, ∞) such that
X
w(x) < ınf ty.
xX

The graph G is called the candidate graph. Often the candidate graph is suggested
by the problem at hand.
A good example to have in mind is the set X of Internet nodes and we want
to sample the set of nodes uniformly. In this case the weight w is a constant
function. To simplify the presentation we assume that the graph is connected and
the standard random walk on it is primitive.
The output of the algorithm is the transition matrix Q of a reversible, irreducible
and aperiodic Markov chain with state space X and whose equilibrium probability
π is proportional to w. We will refer to this Markov chain as the Metropolis chain
with candidate graph G and equilibrium distribution π. If we run this Markov chain
starting from an initial vertex x0 ∈ X , then for n sufficiently large, the state Xn
reached after n steps will have a distribution close to π.
The transitions of this Markov chain are described by an acceptance-rejection
strategy based on the standard random walk on the graph G. More precisely, the
transitions from a vertex x to one of its neighbors follows these rules.

(i) Pick one of the neighbors y of x equally likely among its d(x) neighbors. (This
is what we would do if we were to perform a standard random walk on G.)
This the acceptance part.
(ii) The transition to y is decided by a comparison between the weight w(y) at
y and the weight w(x) at x. More precisely, we accept the move to y with
w(y)/d(y)
probability min 1, w(x)/d(x) . Otherwise we reject the move and stay put at
x. This is the rejection part.

In other words, the transition matrix Q of this Markov chain is given by



0,

 y 6∈ N (x),





1 w(y)/d(y) 
Qx,y = d(x) min 1, w(x)/d(x) , y ∈ N (x),





w(x0 )/d(x0 ) 

1 − 1 P

d(x) x0 ∈N (x) min 1, w(x)/d(x) , y = x.

Above, N (x) denotes the set of neighbors of x in the candidate graph. Let us show
that

w(x)Qx,y = w(y)Qy,x , ∀x, y ∈ X ,

so that Q is reversible and its equilibrium distribution is proportional to w.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 460

460 An Introduction to Probability

Indeed, for x 6= y, we have


 
w(x) w(y)/d(y)
w(x)Qx,y = min 1,
d(x) w(x)/d(x)

w(y)/d(y), w(y)/d(y) < w(x)/d(x),


=

w(y)/d(y) ≥ w(x)/d(x)

w(x)d(x),

 
w(y) w(x)/d(x)
= min 1, = w(y)Qy,x .
d(y) w(y)/d(y)
If the random walk on the candidate graph G is primitive, then so is the Metropolis
chain. If not, we replace the Metropolis chain with its lazy version; see Remark 4.69.
We refer to [77; 141] for applications of this algorithm to combinatorics. In
general, it is difficult to estimate the SLE or the rate of converges of the Metropolis
chain, but in practice it works well. We refer to [77; 141] for applications of this
algorithm to combinatorics.

Example 4.111. A few years ago (2016–17) I asked Mike McCaffrey, at that time
a student writing his senior thesis under my supervision, to read Diaconis’ excel-
lent survey [42] and then to try to implement numerically the decryption strategy
described in that paper, based on the Metropolis algorithm. I want to report some
of McCaffrey’s nice findings. For more details I refer to his senior thesis [116].
Let me first outline the encryption problem and the decryption strategy proposed
in 42]. The encryption method is a simple substitution cipher. Scramble the 26
[
letters of the English alphabet E. The encryption is captured by a permutation ϕ
of the set E, or equivalently, an element of ϕ the symmetric group S26 .
The decryption problem asks to determine the decoding permutation ϕ−1 given
a text encoded by the (unknown) permutation ϕ. Thus, we need to find one element
in a set of 26! elements. To appreciate how large 26! is, it helps to have in mind that
a pile of 26! grains of sand will cover the continental United States with a layer of
sand 0.6 miles (approx. 1 kilometer) thick. We are supposed to find a single grain
of sand in this huge pile. Needle in a haystack sounds optimistic!
The strategy outlined in [42] goes as follows. There are 262 pairs of letters in
the English alphabet E, and they appear as adjacent letters in English texts with
a certain frequency. E.g., one would encounter quite frequently the pair “th”, less
so pairs such as “tt’ ’ or “tw ”. We denote by f (s1 , s2 ) the frequency of the pair
of letters (s1 , s2 ). More precisely f (s1 , s2 ) is the conditional probability that in an
English text the letter s1 is followed by s2 . To any text of length n, viewed as string
of n letters, x = x1 . . . , xn we associate the weight
n
Y

w x := f (xi−1 , xi ).
i=2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 461

Markov chains 461

We can use a given encrypted text x to define a weight on S26



w(ϕ) := w ϕ(x1 ) . . . ϕ(xn ) .
If x is obtained from a genuine English text a1 , . . . , an via a permutation ϕ0 ,
xi = ϕ−1
0 (ai ), then ϕ0 is the decoder xi 7→ ϕ0 (xi ) = ai .

w(ϕ0 ) := w a1 . . . an .
The hope is that permutations with higher weight are closer to the decoding permu-
tation since they mimic closely the frequencies of adjacent pairs of letters in written
English. In other words
ϕ = argmaxσ∈S26 w(σ).
The weight function w defines a probability measure on S26 highly concentrated
around the decoding permutation. If we sample this probability measure there is a
high probability that we will land near the decoding permutation.
To sample this probability measure we rely on the Metropolis algorithm. The
symmetric group is generated by its 26 2 transpositions and as candidate graph we
take the associated Cayley graph defined by this set of generators. As initial state
we take the identity permutation.
The question is how well does this work in practice. First, one needs to find the
relative frequencies of adjacent pairs in English texts. One can do this by analyzing
a large text. In [42] Diaconis suggested using “War and Peace”. Mike McCaffrey
used “Moby Dick” for this purpose. The table in Figure 4.6 (borrowed from [116])
depicts these relative frequencies.5
He then proceeded to test6 this method using first a shorter text
THE PROBABILITY THAT WE MAY FAIL IN THE STRUGGLE OUGHT
NOT TO DETER US FROM THE SUPPORT OF A CAUSE WE BELIEVE
TO BE JUST
The scrambled version looked like
OVB CTEAJADKDOM OVJO SB RJM HJDK DN OVB WOTYXXKB
EYXVO NEO OE ZBOBT YW HTER OVB WYCCETO EH J QJYWB SB
ABKDBLB OE AB UYWO

We expect the weight of the decoded text wtrue to be a lot higher than the weight of
the encoded text. In the above example, the weight of the original text is 2.6 × 10115
higher than that of the cyphered text!!!
After 3, 000 steps in the random walk governed by the above Metropolis algo-
rithm, the output was close to the original text:
THE PROLALINITY THAT WE MAY FAIN ID THE STRUGGNE OUGHT
DOT TO KETER US FROM THE SUPPORT OF A JAUSE WE LENIEVE
TO LE BUST
5 He actually used an alphabet consisting of 27 symbol, the 26 letters of the alphabet and a 27-th

representing any symbol that is not a letter.


6 An R-code implementing this algorithm was and is publicly available.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 462

462 An Introduction to Probability

Fig. 4.6 Moby Dick Transition Matrix.

Mike then tested this algorithm on a bigger text. He chose the easily recognizable
Gettysburg address by Abraham Lincoln.
The most vivid confirmation of the power of this method came when he presented
his results to a mixed group of students in the College of Science of the University
of Notre Dame. He began his presentation by projecting the ciphered Gettysburg
address, but the audience was left in the dark about the nature of original text.
While Mike was describing the problem and the decoding strategy, his laptop was
running the algorithm in the background and every few seconds the text on the
screen would scramble revealing a new text resembling more and more an English
text. Ten minutes or so into his presentation the audience was able to recognize
without difficulty the Gettysburg address. It took about 120 steps in the Metropolis
random walk to reach an easily recognizable albeit misspelled text! t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 463

Markov chains 463

4.6 Exercises

Exercise 4.1. Consider the construction in Remark 4.5 of an HMC with initial
distribution µ and transition matrix Q as a sequence of random variables defined
on [0, 1) equipped with the Lebesgue measure λ. For every t ∈ [0, 1) there exists
x = x(t) ∈ X N0
uniquely determined by
\
t∈ Ixn0 ,...,xn .
n≥0

(i) Prove that the resulting map Ψ : [0, 1) → X N0 given by t 7→ x(t) is measurable
and Ψ# λ = Pµ . Hint. Use the π-λ theorem.
(ii) Prove that the map Ψ is injective and its image is shift-invariant and has
Pµ -negligible complement.
(iii) Describe the map t 7→ x(t) when X = {0, 1}, µ 0 = µ 1 = 21 and
   
1 1
2 2
Q= .
1 1
2 2
Describe explicitly the random variables

Xn : [0, 1) → R, Xn (t) = xn (t), where x(t) = x0 (t), x1 (t), . . . ∈ {0, 1}N0 .

t
u

Exercise 4.2. Two people A, B play the following game. Two dice are tossed. If
the sum of the numbers showing is less than 7, A collects a dollar from B. If the
total is greater than 7, then B collects a dollar from A. If a 7 appears, then the
person with the fewest dollars collects a dollar from the other. If the persons have
the same amount, then no dollars are exchanged. The game continues until one
person runs out of dollars. Let A’s number of dollars represent the states. We know
that each person starts with 3 dollars.

(i) Show that the evolution of A is governed by a Markov chain. Describe its
transition matrix.
(ii) If A reaches 0 or 6, then he stays there with probability 1. What is the
probability that B loses in 3 tosses of the dice?
(iii) What is the probability that A loses in 5 or fewer tosses?

t
u

Exercise 4.3. Prove that (4.1.4) is equivalent to (4.1.5). t


u

Exercise 4.4. Let X be a finite or countable subset. Construct a Markov chain


with state space X such that any subset of X is a closed set of this Markov chain.
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 464

464 An Introduction to Probability

Exercise 4.5. Suppose that (Yn )n≥0 is a sequence of i.i.d., N0 -valued random vari-
ables with common probability generating function
X
pk sn , pk := P Yn = k , ∀k, n ∈ N0 .
 
G(s) =
k≥0

Let Xn the amount of water in a reservoir at noon on day n. During the 24 hour
period beginning at this hour a quantity Yn flows into reservoir, and just before
noon a quantity of one unit of water is removed, if this amount can be found.
The maximum capacity of the reservoir is K, excessive inflows are spilled and lost.
Show that (Xn )n≥0 is an HMC, and describe the transition matrix and its stationary
distribution in terms of G. t
u

Exercise 4.6. Denote by Xn the capital of gambler at the end of the n-th game.
He relies on the following gambling strategy. If his fortune is ≥ $4 he gambles
$2 expecting to win $4, $3, $2 with respective probabilities 0.25, 0.30, 0.45. If
his capital is 1, 2 or 3 dollars he bets $1 expecting him to earn $2 and $0 with
probabilities 0.45 and respectively 0.45 and 0.55. When his fortune is 0 he stops
gambling.

(i) Show that (Xn )n≥0 is a homogeneous Markov chain, compute its transition
probabilities and classify its states.
(ii) Set

T := inf n ∈ N; Xn = 0 .
 
Show that P T < ∞ = 1.
(iii) Compute E T .

t
u

Exercise 4.7. Suppose that (Xn )n≥1 is a sequence of nonnegative i.i.d., contin-
uously distributed random variables. Consider the sequence of records (Rn )n∈N
defined inductively by the rule
 
R1 = 1, Rn = inf n > 1; Xn > max X1 , . . . , Xn−1 .
Show that the sequence (Rn ) is an Markov chain with state space N and then
compute its transition probabilities. Is this a homogeneous chain? t
u

Exercise 4.8. At an office served by a single clerk arrives a Poisson stream of


clients. More precisely, the n-th client arrives client arrives at time Tn = S1 +· · ·+Sn
where (Sn )n∈N is a sequence of i.i.d. random variables, Sn ∼ Exp(λ). The time to
process the n-th client is Zn , where (Zn )n≥1 is a sequence of i.i.d. nonnegative ran-
dom variables with common distribution PZ . We assume that the random variables
Zn are independent of the arrival times Tm . For n ≥ 0 we denote by Xn the number
of customers waiting in line immediately after the n-th arrived customer was served.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 465

Markov chains 465

(i) Show that (Xn )n≥0 is a homogeneous Markov chain with transition probabili-
ties
(
  qk−j , k ≥ j,
P Xn+1 = k k Xn = j =
0, k < j,
where

(λz)j  
Z
qj = e−λz PZ dz .
0 j!
Hint. Use Exercise 1.47.
 
(ii) Set µ = E Z , r := λµ. Prove that the above chain is positively recurrent if
and only if r < 1.  
(iii) Assume that r < 1 and c2 := E Z 2 < ∞. Prove that
  λ2 c2
lim E Xn = r + .
n→∞ 2(1 − r)
t
u

Exercise 4.9. Suppose that (Xn )n∈N0 is an irreducible HMC with state space X
and transition matrix Q. Prove that the following statements are equivalent.

(i) The chain is recurrent.


(ii) There exist x, y ∈ X such that
X
Qnx,y = ∞.
n∈N

(iii) For any x, y ∈ X we have


X
Qnx,y = ∞.
n∈N

t
u

Exercise 4.10. Suppose that (Xn )n≥0 is an irreducible Markov chain with state
space X transition probability matrix Q and x0 ∈ X .

(i) For n ∈ N set


 
τx (n) = Px Tx0 > n , τx = lim τx (n).
n→∞

Prove that
X
τx = Qx,y τy , ∀x ∈ X \ {x0 }. (4.6.1)
y6=x0

(ii) Show that if x0 is transient, then there exists x ∈ X \ {x0 } such that τx 6= 0.
(iii) Suppose there exists a function α : X \ {x0 } → [−1, 1], not identically zero,
satisfying (4.6.1). Prove that x0 is transient.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 466

466 An Introduction to Probability

Exercise 4.11. Suppose that (Xn )n≥0 is a transient irreducible Markov chain with
state space X . Prove that, with probability 1 the chain will exit any finite subset
F ⊂ X , never to return, i.e.,
h i
P lim I F (Xn ) = 0 = 1.
n→∞
t
u

Exercise 4.12. Bobby’s business fluctuates in successive years between three sates
between three states: 0 = bankruptcy, 1 = verge of bankruptcy, 2 = solvency. The
transition matrix giving the probability of evolving from state to state is
 
1 0 0
Q =  0.5 0.25 0.25  .
0.5 0.25 0.25

(i) What is the expected number of years until Bobby’s business goes bankrupt,
assuming it starts in solvency.
(ii) Bobby’s rich father, deciding that it is bad for the family name if his son goes
bankrupt. Thus, when state 0 is entered, his father infuses Bobby’s business
with cash returning him to solvency with probability 1. Thus the transition
matrix for this Markov chain is
 
0 0 1
P =  0.5 0.25 0.25  .
0.5 0.25 0.25
Show that this Markov chain irreducible aperiodic and find the expected num-
ber of years between cash infusions from his father. t
u

Exercise 4.13. Let Q be a stochastic n × n matrix and denote by C the n × n


matrix such that Ci,j = n1 , ∀i, j.

(i) Prove that for any r ∈ (0, 1) the Markov chain defined by the stochastic matrix
Q(r) = (1 − r)Q + rC is irreducible and aperiodic. Denote by πr the unique
stationary probability measure.
(ii) Prove that πr converges as r → 0 to a stationary probability measure π0 of the
HMC defined by Q.
(iii) Describe π0 in the special case when the HMC determined by Q consists of
exactly two communication classes C1 and C2 and there exist xi ∈ Ci , i = 1, 2
such that Qx1 ,x2 > 0.

t
u

Exercise 4.14. The random walk of a chess piece on a chess table is govern by the
rule: the feasible moves are equally likely. Suppose that a rook and a bishop start
at the same corner of a 4 × 4 chess table and perform these random
 walks. Denote
by T the time they meet again at the same corner. Find E T . t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 467

Markov chains 467

Exercise 4.15. Consider the HMC with state space X = {0, 1, 2, . . . } and transi-
tion matrix Q defined by
 
1 n
Qn,n+k = n+1 , ∀0 ≤ k ≤ n, n ≥ 1,
2 k
1
Q0,1 = 1, Qn,0 = , ∀n ≥ 1.
2
Prove
 that the chain is irreducible, positively recurrent and aperiodic and find
E0 T0 . t
u

Exercise 4.16. Let Kn+1 denote the complete graph with n + 1 vertices
v0 , v1 , . . . , vn . Denote by (Xn )n≥0 the random walk on Kn+1 transition rules
1
Qvi ,vj = , ∀i > 0, j ≥ 0, Qv0 ,vi = 0, Qv0 ,v0 = 1.
n
Thus the vertex v0 is absorbent. For i > 0 we denote that the time to reach the
vertex v0 starting at vi ,

Hi := min j ≥ 0 : X0 = i, Xj = 0 .
 
Prove that E Hi = n, ∀i > 0. t
u

Exercise 4.17. A particle performs a random walk on the nonnegative integers


with transition probabilities
p0,0 = q, pi,i+1 = p, pj,j−1 = q, i ≥ 0, j > 0,
where p ∈ (0, 1) and q = 1 − p.
Prove that the random walk is transient if p > q, null recurrent if p = q,
and positively recurrent if p < q. In the last case determine the unique invariant
probability distribution. t
u

Exercise 4.18. We generate a sequence Bn of bits, i.e., 0’s and 1’s, as follows. The
first two bits are chosen randomly and independently with equal probabilities. (Flip
a fair 0/1 coin twice and record the results.) If B1 , . . . , Bn are generated, then we
generate Bn+1 according to the rules
  1  
P Bn+1 = 0 k Bn = Bn−1 = 0 = = P Bn+1 = 0 k Bn = 0, Bn−1 = 1
2
  1  
P Bn+1 = 0 k Bn = 1, Bn−1 = 0 = = P Bn+1 = 0 k Bn = Bn−1 = 1 .
4
What is the proportion of 0’s in the long run? t
u

Exercise 4.19. Consider the Markov chain with state space X = N0 and transition
probabilities
Qn,n−1 = 1, ∀n ∈ N,
X
Q0,n = pn , ∀n ∈ N0 , pn = 1.
n≥0
Find a necessary and sufficient condition on the distribution (pn )n≥0 guaranteeing
that the above HMC is positively recurrent. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 468

468 An Introduction to Probability

Exercise 4.20. Suppose that a gambles plays a fair game with winning probability
p = 12 . He starts with an initial fortune X0 = 1 dollar. His goal is to reach a
fortune of g dollars, g ∈ N. He stops if he reaches this fortune or he is broke and he
is employing a bold strategy: at every game he stakes the largest of money that will
get him closest to but not above g. He cannot bet a sum greater that his fortune
at that moment. Denote by Xn his fortune after the n-th game.

(i) Prove that (Xn )n≥0 is an HMC. Describe its state space and it transition
matrix.
(ii) Prove that the player reaches his goal with probability g1 and goes broke with
probability g−1
g .

t
u

Exercise 4.21. Consider the standard random walk in Z2 started at the origin.
For each m ∈∈ Z we denote by Tm the first moment the random walk reaches the
line x + y = m and we denote by (Um , Vm ) the point where this walk intersects the
above line. Find the probability distributions of Tm , Um and Vm . t
u

Exercise 4.22. A top-to-bottom shuffle of a deck of k cards consist of choosing


a position j uniformly random in {1, . . . , k} and moving the top card to occupy
position
 j in the deck, counted from top to bottom. This defines a random walk
Φn n≥0 on the symmetric group Sk . Assume that Φ0 is the identity permutation.
Denote by T the first moment when the card that used to be at the bottom of the
deck reaches the top, i.e.,

T = min n ≥ 1; Φn (1) = k .
 
(i) Compute E T .
(ii) Denote by ΨT the random permutation Sk−1 given by ΨT (j) = ΦT (j + 1),
j = 1, . . . , k − 1. In other words, ΨT is the permutation of the cards below the
top card at time T . Prove that ΨT is uniformly distributed on Sk−1 .
(iii) Prove that ΦT +1 is uniformly distributed on Sk .

t
u

Exercise 4.23. Suppose that X is an at most countable set equipped with the
discrete topology µ ∈ Prob(X ) and Q : X × X → [0, 1] is a stochastic matrix. Let
Xn : (Ω, S, P) → X be a sequence of measurable maps.

(i) Prove that (Xn )n≥0 ∈ Markov(µ, Q) if and only if

E f (Xn+1 ) k Xn , . . . , X0 = Q∗ f (Xn )
 

for any bounded function f : X → R.


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 469

Markov chains 469

(ii) (Lévy) Prove that (Xn )n≥0 ∈ Markov(µ, Q) if and only if, for any
f ∈ L∞ (X , µ) the sequence
n−1
X 
Y0 = X0 , Yn = f (Xn ) − Qf (Xk ) − f (Xk )
k=0
is a martingale with respect to the filtration Fn = σ(X0 , X1 , . . . , Xn ).

t
u

Exercise 4.24. Consider an irreducible HMC with finite state space X and tran-
sition matrix Q. We denote by π the invariant probability distribution. For every
x ∈ X we denote by Hx the hitting time of x,

Hx := min n ≥ 0; Xn = x .

(i) Show that


X    
τ (x) := Ex Hy π y
y∈X
is independent of x.
(ii) Prove that
X    
τ (x) = Eπ Hy π y .
x

t
u

Exercise 4.25. Suppose that (Xn )n∈N0 is an HMC with state space X and tran-
sition matrix Q. Suppose that B ⊂ X and HB is the hitting time of B

HB := min n ≥ 0, Xn ∈ B .
We define
hB : X → [0, 1], hB (x) = Px HB < ∞ = P HB < ∞ k X0 = x ,
   

kB : X → [0, ∞], kB (x) = Ex HB .


 

(i) Show that hB satisfies the linear system


hB (x) = 1, ∀x ∈ B,
(4.6.2)
X
hB (x) = Qx,y hB (y), x ∈ X \ B.
y∈X

(ii) Show that if h : X → [0, ∞) is a solution of (4.6.2), then


hB (x) ≤ h(x), ∀x ∈ X .
(iii) Show that kB satisfies the linear system
kB (x) = 0, ∀x ∈ B,
(4.6.3)
X
kB (x) = 1 + Qx,y kB (y), x ∈ X \ B.
y∈X
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 470

470 An Introduction to Probability

(iv) Show that if k : X → [0, ∞] satisfies (4.6.3), then kB (x) ≤ k(x), ∀x ∈ X .

t
u

Exercise 4.26. Suppose that (Xn )n∈N0 is an HMC with state space X and tran-
sition matrix Q. For x ∈ X we denote by Tx the return time to x

Tx := min n ≥ 1; Xn = x .
We set
 
fx,y (n) := Px Ty = n ,
X X
Fx,y (s) := fx,y (n)sn , Px,y (s) := Qnx,y sn ,
n≥0 n≥0
X  
fx,y := Fx,y (1) = fx,y (n) = Px Ty < ∞ .
n≥0

(i) Prove that


Px,y (s) = δx,y + Fx,y (s)Py,y (s), ∀x, y ∈ X ,
where
(
1, x = y,
δx,y =
0, x 6= y.
(ii) Deduce from (i) that
X
Qnx,x < ∞ ⇐⇒ Px Tx < ∞ < 1.
 

n≥0
(1) (k)  (k−1)
(iii) Set Tx := Tx and define inductively Tx := min , n > Tx ; Xn = x ,
k > 1. Prove that
P Tx(k−1) < ∞ = fx,y fyy
(k−1)
 
.

t
u

Exercise 4.27. Suppose that {Xn : (Ω, S, P) → X }n≥0 is an irreducible, recurrent


HMC with state space X and transition matrix Q defined on a probability space
(Ω, S, P). Fix x ∈ X , assume P X0 = x = 1 and denote by Tk the time of k-th


return to x. More precisely


 
T0 = 0, T1 := min n > 0; Xn = x , Tk+1 = min n > Tk ; Xn = x .
We set

Yk = Xtk , XTk +1 , . . . , XTk+1 −1 .

(i) Realize the quantities Yk as random maps Yk → Y where Y is a countable set


equipped with the sigma-algebra 2Y .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 471

Markov chains 471

(ii) Show that the resulting random maps are i.i.d.

t
u

Exercise 4.28. Consider a positively recurrent HMC (Xn )n≥0 with state space X
and transition matrix Q. Denote by π the stationary distribution. For x ∈ X we
denote by Tx the first return time to x and for y ∈ X we set
Nx,y := n ∈ N; n ≤ Tx , Xn = y , G(x, y) := Ex Nx,y .
  

In other words, G(x, y) is the expected number of visits to y before returning to x.

(i) Prove that G(x, y) = π y Ex Ty .


   

(ii) Prove that


  1
Px Ty < Tx =       .
π y Ex Ty + Ey Tx
t
u

Exercise 4.29. Consider a positively recurrent HMC (Xn )n≥0 with state space X ,
transition matrix Q and stationary distribution π. Suppose  that
 T is a stopping
time adapted to (Xn )n≥0 and let x ∈ X be such that Ex T < ∞. We denote
GT (x, y) denote the expected number of visits to y before T , when started at x, i.e.,
GT (x, y) = Ex Nx,y
 T  T

, Nx,y = # n ≥ 0; X0 = x, Xn = y, n ≤ T .
Prove that GT (x, y) = π y Ex T .
   
t
u

Exercise 4.30 (LeCam). Suppose that (Xr )1≤r≤n is a family of independent


Bernoulli random variables with success probabilities pr . Set
λ := p1 + · · · + pn , S = X1 + · · · + Xn .

(i) Show that measure µr on N0 × N0 given by




 1 − pr , m = n = 0,

−p

   e r − 1 + pr , m = 1, n = 0,
µr (m, n) = pn −p
 r e r,

 n! m = 1, n ≥ 1,

0, elsewhere

is a coupling of Bin(pr ) with Poi(pr ) and conclude that


dv Bin(pr ), Poi(pr ) ≤ p2r .


(ii) Denote by PS the probability distribution of S. Prove that


n
 X
dv PS , Poi(λ) ≤ p2r .
r=1

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 472

472 An Introduction to Probability

Exercise 4.31. Let (Xn )n≥0 be an irreducible Markov chain with finite state space
X , transition matrix Q and invariant probability measure µ ∈ Prob(X ). Assume
that the initial distribution is also µ, i.e., PX0 = µ. For n ∈ N0 we set (see
Exercise 2.50 for notation)
1    
Hn = Ent2 X0 , X1 , . . . Xn , Ln = Ent2 Xn Xn−1 , . . . , X0 .
n+1
(i) Prove that the sequence (Ln )n≥0 is non-increasing and nonnegative. Denote
by L its limit.
(ii) Prove that
n
1 X
Hn = Lk .
n+1
k=0

(iii) Prove that the sequence (Hn ) is convergent and its limit is L.
(iv) Prove that
X   X  
µ x Qx,y log2 Qx,y .
 
L=− µ x Ent2 Qx,− = −
x∈X x,y∈X

The number L is called entropy rate of the irreducible Markov chain. We
denote it by Ent2 X , Q .

t
u

Exercise 4.32. Let Q denote the n × n transition matrix describing the random
walk on a complete graph with n vertices. Find the spectrum of Q. t
u

Exercise 4.33 (Doeblin). Suppose that (Xn )≥0 is an HMC with state space X ,
initial distribution µ and transition matrix Q satisfying the Doeblin condition
∃ε > 0, ∃x0 ∈ X : Qx,x0 > ε, ∀x ∈ X .
Denote M the space of finite signed measures ρ on X . For ρ ∈ M we set
X  
kρk1 := ρx < ∞, ρx := ρ {x} .
x∈X

(i) Prove that for any ρ ∈ M we have ρQ ∈ M and


X
ρx = sup (ρQ)y .
y∈X
x∈X

If ρ ∈ M and
X
ρx = 0,
x∈X

then
kρQk1 ≤ (1 − ε)ρkρk1 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 473

Markov chains 473

(ii) Set µn := µ · Qn . Prove that


kµn − µm k1 ≤ 2(1 − ε)m , ∀n ≥ m ≥ 1.
(iii) Prove that the HMC is irreducible, positively recurrent and the unique invari-
ant probability measure π satisfies
kµn − πk1 ≤ 2(1 − ε)n , ∀n ∈ N.

t
u

Exercise 4.34. Suppose that (Xn )≥0 is an HMC with state space X , initial dis-
tribution µ and transition matrix Q. For each n ∈ N we set
k
1 X k
An := Q .
n+1
k=0
Suppose that there exist N ∈ N, x0 ∈ X and ε > 0 such that
(AN )x,x0 > ε, ∀x ∈ X .
Prove that the HMC is irreducible, positively recurrent and the unique invariant
probability measure π satisfies
N
kµAn − πk1 ≤ , ∀n ∈ N.
(n + 1)ε
t
u

Exercise 4.35. Prove Lemma 4.85. t


u

Exercise 4.36. Suppose that (X , E, c) is a finite connected electric network and


x+ , x− are distinct vertices. The commute time between x+ , x− is the quantity
   
Kx+ ,x− = Ex+ Tx− + Ex− Tx+ .
Set
X
C(X ) := c(e).
e∈E

(i) Prove that


Kx+ ,x− = 2CE x+ ,x−
where E x+ ,x− denotes the energy of the Kirchhoff flow with source x+ and
sink x− ; see (4.4.26). Hint. Use Exercise 4.28.
(ii) Consider the Ehrenfest
 urn with B balls defined in Example 4.7. Use (i) to
compute E 0 TB . In other words, given that initially all the B balls were in
the right chamber, find the expected time until all of them move in the left
chamber.

t
u
January 19, 2018 9:17 ws-book961x669 Beyond the Triangle: Brownian Motion...Planck Equation-10734 HKU˙book page vi

This page intentionally left blank


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 475

Chapter 5

Elements of Ergodic Theory

Ergodic theory is a rather eclectic subject with applications in many areas of math-
ematics, including probability. The ergodicity feature first appeared in the works
of L. Boltzmann on statistical mechanics, [23]. The modern formulation of this hy-
pothesis, due to Y. Sinai, came much later, in 1963, and it took a few more decades
to be adjudicated mathematically.
Our rather modest goal in this chapter is to describe enough of the fundamentals
of this theory so we can shed new light on some of the fundamental limit theorems
we have proved in the previous chapters. For more details we refer to [3; 11; 36; 93;
131; 154] that served as our main sources of inspiration.

5.1 The ergodic theorem

5.1.1 Measure preserving maps and invariant sets


Suppose that (Ω, S, P) is a probability space. A measurable map T : (Ω, S) → (Ω, S)
is said the be measure preserving if T# P = P, i.e.,
P T −1 (S) = P S , ∀S ∈ S.
   
(5.1.1)
The measure preserving map T is called an automorphism of the probability space
if it is bijective, and its inverse is also measure preserving.
Proposition 1.29 shows that (5.1.1) is satisfied if and only if there exists a π-
system C that generates S such that
P T −1 (C) = P C , ∀C ∈ C.
   
(5.1.2)
Example 5.1. (a) Let PS 1 denote the Euclidean probability measure on S 1 , the
unit circle in R2 , i.e. (see Example 1.87)
  1
PS 1 dθ = dθ.

We denote by Rϕ the counterclockwise rotation by an angle ϕ about the center of
this circle. Then Rϕ is measure preserving. If we think of S 1 as the set of complex
numbers of norm 1,
S 1 := z ∈ C; |z| = 1 ,


then Rϕ (z) = eiϕ z.

475
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 476

476 An Introduction to Probability

(b) Consider the n-dimensional torus Tn := Rn /Zn . Set I = [0, 1] and observe that
the natural projection π : I n → Tn is Borel measurable. We denote by PTn the
push-forward by π of the Lebesgue measure on I n . Let observe that the resulting
probability space is isomorphic to the product of n copies of(S 1 , BS 1 , PS 1 ). Suppose
that A ∈ SLn (Z), i.e., A is an n×n matrix with integer coefficients and determinant
1. Then A(Zn ) = Zn and thus we have a well defined induced map
TA : Rn /Zn → Rn /Zn .
This map is clearly bijective and Borel measurable. It is also measure preserving
since det C = 1.

Fig. 5.1 Arnold’s cat map.

In [3] Arnold and Avez memorably depicted the action of the map TA for
 
1 1
A= ∈ SL2 (Z)
1 2
as in Figure 5.1. This map is popularly known as Arnold’s cat map.
(c) In the previous examples, the maps where automorphisms of the corresponding
probability spaces. Here is an example of a measure preserving map that is not
bijective. More precisely define
Q : S 1 → S 1 , Q(z) = z 2 .
1
Then the Lebesgue measure 2π dθ is Q-invariant. If we identify S 1 with R mod Z,
then we can describe Q as the map Q : [0, 1) → [0, 1) given by
Q(x) = 2x mod 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 477

Elements of Ergodic Theory 477

If x ∈ [0, 1) has binary expansion



X n
x= n
, n ∈ {0, 1},
n=1
2
then

X n+1
Q(x) = .
n=1
2n

(d) Consider the tent map T : [0, 1] → [0, 1], T (x) = min(2x, 2 − 2x). Equivalently,
this is the unique continuous map such that T (0) = T (1) = 0, T (1/2) = 1 and it is
linear on each of the intervals [0, 1/2] and [1/2, 1]. Its graph looks like a tent with
vertices (0, 0), (1/2, 1) and (1, 0).
This map preserves the Lebesgue measure. Indeed, if I ⊂ [0, 1] is a compact
interval then T −1 (I) consists of two intervals I± , symmetrically located with respect
to the midpoint 1/2 of [0, 1], and each having half the size of I.
(e) Suppose that X is a compact metric space and T : X → X is a continuous
map. Denote by Prob(X) the set of Borel probability measures on X. Then map T
induces a push-forward map T# : Prob(X) → Prob(X). The T -invariant measures
are precisely the fixed points of T# . One can show (see Exercise 5.2) that the set
ProbT (X) of T -invariant measures is nonempty, convex and closed with respect to
the weak convergence. t
u

Example 5.2 (Stationary sequences). Let (X, F) be a measurable space and


suppose that Xn : Ω → X, n ∈ N, is a sequence of random maps defined on the
same probability space (Ω, S, P). The sequence is said to be stationary if for any
m, k ∈ N the random vectors
X1 , . . . , Xm : Ω → Xm and Xk+1 , . . . , Xk+m : Ω → Xm
 

have the same distribution.

(i) For example, a sequence of i.i.d. random variables is stationary. More generally,
an exchangeable sequence of random variables is stationary.
(ii) Suppose that (Xn )n≥0 is an HMC with state space X and transition matrix Q
and initial distribution µ. The sequence (Xn )n≥0 is stationary if and only if µ
is an invariant distribution, i.e., µ = µ · Q.

To any stationary sequence Xn : Ω → X, we can canonically associate a measure


N
preserving map as follows. Consider the path  space U = UX := X . It consists of
sequences of points in X, u = u1 , u2 , . . . . We have natural coordinate maps
Un : U → X, Un u = un .
For n ∈ N denote by Un the sigma-algebra generated by U1 , . . . , Un and we set

_
U= Un .
n=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 478

478 An Introduction to Probability

Note that we have a U-measurable map shift map


 
Θ : U → U, Θ u1 , u2 , . . . = u2 , u3 , . . . .
A sequence of random variables Xn : (Ω, S, P) → (X, F), n ∈ N, defines a measurable
map
~ : (Ω, S) → (U, U), ω 7→ X1 (ω), X2 (ω), . . . .

X
The distribution of this sequence is the push-forward probability measure
~ # P. Note that
PX~ := X
~ and Uk+1 = U1 ◦ Θk , ∀n, k ∈ N.
Xn = Un ◦ X
Since the measure PX~ is uniquely determined by its restrictions to the sigma-
subalgebras Un we deduce that the sequence (Xn )n∈N is stationary iff the shift
Θ preserves the distribution PX~ on the path space.
When X is finite or countable, and the sequence (Xn )n∈N is i.i.d., the resulting
shift is known as Bernoulli shift.
Conversely, if T is a measurable map T : (Ω, S, P) → (Ω, S, P), then it is measure
preserving if and only if, for any measurable function f : Ω → R the sequence
f, f ◦ T, f ◦ T 2 , . . .
is stationary. t
u

Definition 5.3. Suppose that (Ω, S) is a measurable space and T : (Ω, S) → (Ω, S)
is a measurable map.

(i) A measurable function f : Ω → R is called T -invariant if f ◦ T = f .


(ii) A measurable set S ∈ S is called T -invariant if its indicator I S is an invariant
function.
(iii) We denote by I = IT the collection of invariant sets. t
u

Remark 5.4. (a) Note the definition of IT involves no measure on S.


(b) Note that if S ∈ S, then I S ◦ T = I T −1 (S) so the set S is T -invariant iff
S = T −1 (S). Observe that
S ⊂ T −1 (S) ⇐⇒ T (S) ⊂ S, (5.1.3a)
−1
T (S) ⊂ S ⇐⇒ ∀ω ∈ Ω, T (ω) ∈ S ⇒ ω ∈ S. (5.1.3b)
We can give a dynamic description of invariance. For ω ∈ Ω we denote by OT (ω)
the orbit of ω with respect to the action of T
OT (ω) := ω, T (ω), T 2 (ω), . . . .


A set S is invariant if and only if


ω ∈ S ⇒ OT (ω) ⊂ S and ω ∈ Ω \ S ⇒ OT (ω) ⊂ Ω \ S.
In the universal case (UX , U), a subset S ∈ UX is Θ-invariant if
s = (s1 , s2 , . . . ) ∈ S ⇒ (s2 , s3 , . . . ) ∈ S,
(s2 , s3 , . . . ) ∈ S ⇒ ∀s1 ∈ X : (s1 , s2 , s3 , . . . ) ∈ S.
Note that if T is an automorphism, then a set S is invariant iff T (S) = S. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 479

Elements of Ergodic Theory 479

Proposition 5.5. Suppose that (Ω, S) is a measurable space and T : (Ω, S) → (Ω, S)
is a measurable map. Then the following hold.

(i) The collection I = IT of T -invariant measurable sets is a sigma-subalgebra


of S.
(ii) An S-measurable function f : Ω → R is T -invariant if and only if it is IT -
measurable.

Proof. (i) Thus follows from the fact that S ∈ IT if and only if S = T −1 (S).
(ii) Suppose that f is T -invariant. Then for any x ∈ R the set S = {f ≤ x} is
T -invariant since I S ◦ T = I {f ◦T ≤x} = I S .
Conversely, if f is IT -measurable, then f −1 ({y}) ∈ IT , ∀y ∈ R and
(f ◦ T )−1 ({y}) = T −1 f −1 ({y}) = f −1 ({y}).


If f (x) = y, then x ∈ f −1 ({y}) = (f ◦ T )−1 ({y}) so that f ◦ T (x) = y = f (x). t


u

Remark 5.6. Consider the path space UX = XN . We have the tail sigma-subalgebra
\
T∞ = Tm , Tm = σ Um , Um+1 , . . . .


m≥1

Note that
S ∈ Tm+1 ⇐⇒ Θm S ∈ T1 = U, ∀m ≥ 0.
The shift map Θ is surjective and if S is Θ-invariant, then (5.1.3a) and (5.1.3b)
imply that ΘS = S. In particular, Θm S = S ∈ T1 , so S ∈ Tm , ∀m. Hence, in the
universal case I = IΘ ⊂ T∞ .
Observe that the sigma-algebras IΘ and T do not depend on any choice of prob-
ability measure on UX . t
u

Definition 5.7. Suppose that T : (Ω, S, P) → (Ω, S, P) is a measure preserving


map. A measurable function f : (Ω, S) → R is said to be quasi-invariant if
f = f ◦ T P − a.s.
A subset S ∈ S is said to be quasi-invariant if I S is quasi-invariant, i.e.,
P S∆T −1 S = 0,
 

where A∆B := (A \ B) ∪ (B \ A) is the symmetric difference of two sets. We denote


by J = JT the collection of T -quasi-invariant sets. t
u

Proposition 5.8. Suppose that T : (Ω, S, P) → (Ω, S, P) is a measure preserving


map. Then the following hold.

(i) The collection JT of quasi-invariant sets is a sigma-algebra.


of I and J coincide, i.e., for any S ∈ J there exists S ∈ I
0
(ii) The P-completions
0

such that P S∆S = 0.
(iii) A measurable function is T -quasi-invariant if and only if it is JT -measurable.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 480

480 An Introduction to Probability

Proof. (i) The fact that JT is a sigma-algebra follows immediately from the defi-
nition of a quasi-invariant.
(ii)  by Ī the P-completion of I. Let S̄ ∈ Ī. There exists S ∈ I such that
 Denote
P S̄∆S = 0. Since T is measure preserving we deduce

0 = P T −1 (S̄∆S) = P T −1 (S̄)∆S
   

and thus

P T −1 (S̄)∆S̄ = E |I T −1 (S̄) − I S̄ |
   

   
≤ E |I T −1 (S̄) − I S | + E |I S − I S̄ | = 0.

Conversely, if S ∈ J define
\ [
S̄ := Sn , Sn := T −k (S).
n∈N k≥n

Note that S̄ = T −1 S so that S is invariant. Since S1 ⊃ S2 ⊃ · · · , we have




I S̄ = lim I Sn .
n→∈∞

On the other hand

I Sn = sup I T −k (S) = sup I S ◦ T n = I S a.s.


k≥n k≥n

since S is quasi-invariant and thus I S ◦ T n = I S a.s. Hence I S̄ = I S a.s., so that


S ∈ Ī.
(iii) Clearly, if f is quasi-invariant, then so are the sublevel sets {f ≤ x}, ∀x ∈ R
and thus f is JT -measurable.
Conversely if f is JT -measurable, then so are f ± and it suffices to show that
if f ≥ 0 is J-measurable, then f is quasi-invariant. Clearly any J-measurable ele-
mentary function is quasi-invariant. Since f is an increasing limit of J-measurable
elementary functions, it is therefore an increasing limit of quasi-invariant elementary
functions and thus it is quasi-invariant. t
u

Definition 5.9. Let (Ω, S, P) be probability space. A measure preserving map


T : Ω → Ω is said to be ergodic if any T -invariant set has measure 0 or 1, i.e., the
sigma-algebra of invariant sets is a zero-one algebra. t
u

From Proposition 5.8 we deduce the following equivalent characterization of


ergodicity.

Proposition 5.10. The map T is ergodic if and only if any P-quasi-invariant set
is a zero-one event, i.e., has measure 0 or 1. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 481

Elements of Ergodic Theory 481

Remark 5.11. Suppose that T is an ergodic automorphism of the probability space


(Ω, S, P). If S ∈ S, then the set
[
Sb = T n (S)
n≥0
    
is quasi-invariant. Indeed T Sb ⊂ Sb and P T Sb = P Sb invariant. If
   
P S > 0, then P Sb > 0 and the ergodicity of T implies that
 
P Ω \ Sb = 0.
The set Sb is a union or orbits OT (ω) = {T n (ω)}n≥0 of the dynamical system on Ω
determined by the iterates of T . The ergodicity shows that the orbits originating
in a set S of positive measure reach almost any point in Ω; the unreachable ones
form a negligible set. This shows that the dynamics of an ergodic automorphism is
quite chaotic: orbits want to fill the space. t
u

Definition 5.12. Suppose that Xn : (Ω, S, P) → (X, F), n ∈ N is a sequence of


measurable maps. We say that (Xn )n∈K is a Kolmogorov sequence if its tail algebra
\
T∞ := Tm , Tm := σ Xn , n ≥ m


m∈N
is a zero-one algebra. t
u

As shown in Remark 5.6 the sigma-algebra I of Θ-invariant sets is contained in


the tail algebra. Hence, if (Xn )n∈N is a stationary Kolmogorov sequence, then the
shift map Θ on the associated path space XN is ergodic. In particular, if (Xn )n∈N
is a sequence of i.i.d. random variables, then Kolmogorov’s 0-1 theorem shows that
the shift map on the path space is ergodic.

Example 5.13. Consider the map Q : [0, 1) → [0, 1) discussed in Example 5.1(iii).
The interval [0, 1) embeds in {0, 1}N

X n
[0, 1) 3 x = n
7→ (1 , 2 , . . . ) ∈ {0, 1}N .
n=1
2
The image of the map is a shift-invariant subset of {0, 1}N . Its complement is neg-
ligible with respect to the product measure on {0, 1}N and the restriction of the
product measure on the image of this embedding coincides with the Lebesgue mea-
sure; see Exercise 1.3(vii). The space {0, 1}N equipped with the product measure
is the path space corresponding to an i.i.d. sequence of Bernoulli random variables
with success probability 21 . Hence the shift map is ergodic, proving that the map
Q is also ergodic. t
u

We have the following characterization of Kolmogorov sequences due to Black-


well and Freedman [14].

Theorem 5.14. Suppose that Xn : (Ω, S, P) → (X, F), n ∈ N is a sequence of


measurable maps. The following are equivalent.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 482

482 An Introduction to Probability

(i) The sequence is a Kolmogorov sequence.


(ii) For any A ∈ S
     
lim sup P A ∩ B − P A P B = 0, (5.1.4)
m→∞ B∈T
m

where we recall that Tm = σ Xn , n ≥ m .




Proof. (i) ⇒ (ii) Then for any B ∈ T∞ , A ∈ F we have


           
P A ∩ B − P A P B = E I AI B − E I A E I B

Z Z
E I A k Tm − P A E I A k Tm − P A .
       
= ≤
B Ω

Hence
Z
E I A k Tm − P A .
         
sup P A∩B −P A P B ≤ (5.1.5)
B∈Tm Ω

The Backwards Martingale Convergence Theorem implies that


E I A k Tm → E A k T∞ a.s. and L1 .
   

Since T∞ is a zero-one algebra, we deduce that


lim E A k T∞ = E I A = P A .
     
m→∞

Using this in (5.1.5) we obtain (5.1.4).


(ii) ⇒ (i) Let A ∈ T∞ . Then, ∀m, A ∈ Tm and thus
   2      
0≤P A −P A = P A∩A −P A P A
     
≤ sup P A ∩ B − P A P B → 0.
B∈Tm
 2
Hence P A = P A so that P A ∈ {0, 1} so that T∞ is a zero-one algebra.
   
t
u

5.1.2 Ergodic theorems


Let (Ω, S, P) be probability space and suppose that T : Ω → Ω is measure preserving
map. For any measurable function f : (Ω, S, P) → R we denote by Tbf its pullback
by T
Tbf := f ◦ T.
This is a measurable function and, since T is measure preserving, we deduce from
the change-in-variables formula (Theorem 1.75) that
Z Z Z
T f dP =
b f dT# P = f dP, ∀f ∈ L1 (Ω, S, P).
Ω Ω Ω
Note also that
(Tbf )p = Tbf p ∀f ∈ L0+ (Ω, S), p ≥ 1.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 483

Elements of Ergodic Theory 483

We will denote by k − kp the norm of Lp Ω, S, P . We deduce that Tb defines




isometries

Tb : Lp Ω, S, P → Lp Ω, S, P , ∀p ≥ 1.
 

The operator Tb is referred to as the Koopman operator.


We denote by J = JT the σ-subalgebra of quasi-invariant measurable subsets.
Thus S ∈ J if and only if TbI S = I S , P-a.s.
More generally, if f ∈ L1 (Ω, S, P) and S ∈ J, then

TbI S ∈ L1 (Ω, S, P) and Tb(f I S ) = (Tbf ) · (TbI S ) = (Tbf ) · I S ,

so that
Z Z Z
f I S dP = T (f I S )dP = (Tbf ) · I S dP, ∀S ∈ J, f ∈ L1 (Ω, S, P).
b (5.1.6)
Ω Ω Ω

For each p ≥ 1 we set

QT,p := f ∈ Lp (Ω, S, P); Tbf = f .



(5.1.7)

In other words, QT,p consists of quasi-invariant Lp -functions, i.e.,

QT,p = Lp (Ω, J, P) = Lp (Ω, I, P).

We set

QT := QT,2 = L2 (Ω, J, P) (5.1.8)

and we denote by PT the orthogonal projection onto QT . In the proof of Theo-


rem 1.162 we have shown that

PT f = E f k J .
 

The space QT contains the constant functions so dim QT ≥ 1.

Proposition 5.15. Suppose that T : (Ω, J, P) → (Ω, J, P) is a measure preserving


map. Then the following statements are equivalent.

(i) The map T is ergodic.


(ii) For any p ≥ 1 dim QT,p = 1.
(iii) There exists p ≥ 1 such that dim QpT = 1.

Proof. (i) ⇒ (ii) Assume that T is ergodic so J is a zero-one sigma-subalgebra.


Hence, any J-measurable elementary function is constant. Hence any Lp function
must be a.s.-constant as a limit of elementary functions.
Clearly (ii) ⇒ (iii). To prove the implication (iii) ⇒ (i) note any J-measurable
function belongs to any Lp and thus must be a.s. constant. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 484

484 An Introduction to Probability

To summarize
T is ergodic ⇐⇒ dim QT = 1. (5.1.9)
For each n we denote by An the n-th temporal average/mean operator
1
1 + Tb + Tb2 + · · · Tbn−1 f.

f 7→ An f =
n
Note that An defines linear operators
An : Lp Ω, S, P → Lp Ω, S, P , p ≥ 1
 

satisfying
kAn f kp ≤ kf kp , ∀f ∈ Lp . (5.1.10)

Remark 5.16. Let me briefly explain the intuition of the temporal averages An (f ).
Think of Ω as the space of states of a physical system that evolves in discrete time.
Thus, if the system was initially in the state ω, it will be in the state T n (ω) after
n units of time.
A function f : Ω → R can be viewed as a macroscopic quantity that associates
to each state ω a measurable numerical quantity f (ω). Note that for each n ∈ N
and each ω ∈ Ω we have

f (ω) + f (T ω) + · · · + f T n ω
(An+1 f )(ω) =
n+1
is the average value of the macroscopic quantity f as the system evolves for n units
of time. t
u

We have the following mean ergodic theorem due to John von Neumann.

Theorem 5.17 (L2 -Mean ergodic theorem). Suppose that (Ω, S, P) is a proba-
bility space and T : Ω → Ω is a measure preserving map. Then, ∀f ∈ L2 (Ω, S, P),
the temporal averages An f converge in L2 to the orthogonal projection of f onto
the space QT of quasi-invariant functions, i.e.,
1
1 + Tb + Tb2 + · · · + Tbn−1 f → PT f = E f k J .
  
n
In particular, if T is ergodic we have
1
1 + Tb + Tb2 + · · · + Tbn−1 f → E f I Ω in L2 .
  
n
Proof. Denote by X2 the collection of functions f ∈ L2 (Ω, S, P) such that An f
converges in L2 to some function A∞ f . Clearly X2 is a vector space. We will
gradually show that X2 = L2 (Ω, S, P) and A∞ = PT .
1. QT ⊂ X2 and A∞ f = f , ∀f ∈ QT .
Indeed An f = f , ∀f ∈ QT .
2. ∀f ∈ X2 , we have Tbf ∈ X2 and A∞ f ∈ QT .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 485

Elements of Ergodic Theory 485

Let f ∈ X2 . Note first that Tb commutes with An . Since Tb is continuous we deduce


lim An Tbf = lim TbAn f = TbA∞ f,
n→∞ n→∞

i.e., Tbf ∈ X2 and A∞ Tbf = TbA∞ f .


On the other hand,
n+1 1
nAn Tbf = (n + 1)An+1 f − f =⇒ An Tbf = An+1 f − f,
n n
so that
n+1
TbA∞ f = A∞ Tbf = lim An+1 f = A∞ f.
n→∞ n
Hence A∞ f ∈ QT .
3. Tbf − f ∈ X2 , ∀f ∈ L2 (Ω, S, P).
Indeed
 1 bn 
An Tbf − f = T f −f
n
and, since Tb is unitary, we deduce that
 1 b  2
An Tbf − f 2 ≤ kT f k2 + kf k2 = kf k2 → 0.
n n
4. ∀f ∈ L2 (Ω, S, P), ∀k ∈ N we have Tbk f − f ∈ Q⊥
T.
We first prove the claim for k = 1. Indeed, for any g ∈ QT we have Tbg = g and
(Tbf − f, g) = (Tbf, g) − (f, g) = (Tbf, Tbg) − (f, g) = 0,
where at the last step we used the fact that Tb is unitary. In general
k
X k
X
Tbk f − f = Tbj f − Tbj−1 f Tbfj − fj , fj := (Tbj−1 f ),
 
=
j=1 j=1

and Tbfj − fj ∈ Q⊥
T, ∀j.
5. A∞ f = PT f , ∀f ∈ X2 .
We have
n−1
1X
f − An f = (f − Tbk f ) ∈ Q⊥
T.
n
k=1

Letting n → ∞ and using the fact that Q⊥ 2


T is a closed subspace of L we deduce
from 4 f − A∞ f ∈ QT so that A∞ f = PT f .

6. X2 is closed.
Let (fk )k∈N be a sequence in X2 that converges in L2 to f . To show that f ∈ X2
we will show that the sequence An f is Cauchy. Fix ε > 0. We have
kAn f − Am f k2 ≤ kAn f − An fk k2 + kAn fk − Am fk k2 + kAm fk − Am f k2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 486

486 An Introduction to Probability

(use (5.1.10), i.e., kAn k ≤ 1 as operator L2 → L2 )


≤ kf − fk k2 + kAn fk − Am fk k2 + kf − fk k2 .
Hence
kAn f − Am f k2 ≤ 2kf − fk k2 + kAn fk − Am fk k2 , ∀k, m, n.
Fix k such that
ε
kf − fk k2 < .
3
The sequence (An fk )n∈N is convergent since fk ∈ X2 . It is thus Cauchy, so there
exists N = N (ε, k) such that ∀m, n > N
ε
kAn fk − Am fk k2 < .
3
Hence
kAn f − Am f k2 ≤ ε, ∀m, n > N (ε, k).

7. X2 = L2 (Ω, S, P).
We know that QT ⊂ X2 and
Range Tb − 1 ⊂ Q⊥
T ∩ X2 .


At this point we invoke a classical result of functional analysis: if S : H → H is a


bounded linear operator on a Hilbert space, then the closure of the range of S is
(ker S ∗ )⊥ ; see e.g. [22, Cor. 2.18].
The operator Tb is unitary, so that Tb∗ = Tb−1 . Hence if we let S = Tb − 1, then
S ∗ = Tb∗ − 1 = Tb−1 − 1,
since Tb is unitary. We deduce
  ⊥
closure Range Tb − 1 = ker(Tb−1 − 1) = Q⊥
T.

Since X2 is closed we deduce Q⊥


T ⊂ X2 . This completes the proof of Theorem 5.17.
t
u

Corollary 5.18. Suppose that (Ω, S, P) is a probability space and T : Ω → Ω is


 Then, ∀f ∈ L (Ω, S, P) the temporal averages An f
1
a measure preserving
 map.
converge in L to E f k J , i.e.,
1

1
1 + Tb + Tb2 + · · · Tbn−1 f → E f k J in L1 as n → ∞.
  
n
Proof. Denote by X1 the collection of functions f ∈ L1 (Ω, S, P) such that An f
converges in L1 to some function A∞ f . Since k − k1 ≤ k − k2 we deduce that
X2 ⊂ X1 . The argument in Step 6. in the proof of Theorem 5.17 extends without
change to the L1 since, according to (5.1.10), the operators An are contractions
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 487

Elements of Ergodic Theory 487

An : L1 → L1 . This proves that X1 is a closed subspace of L1 that contains L2 and


thus X1 = L1 .
From (5.1.6) we deduce that for any f ∈ L1 (Ω, S, P), and any S ∈ J we have
Z Z
f I S dP = (An f )I S dP.
Ω Ω
Letting n → ∞ we deduce
Z Z
f I S dP = (A∞ f )I S dP, ∀S ∈ J.
Ω Ω

Hence (see Definition 1.157) A∞ f = E f k J .


 
t
u

We can now formulate and prove Birkhoff ’s ergodic theorem.

Theorem 5.19 (Birkhoff ’s ergodic theorem).


Let (Ω, S, P) be probability space. Suppose that T : Ω → Ω a measure preserving
map. If f ∈ L1 (Ω, S, P), then the temporal averages
1   1
f + f ◦ T + ···f ◦ Tn = f + Tbf + · · · + Tbn f

An (f ) =
n+1 n+1
converge a.s. to E f k I .
 

Proof. Denote by X0 the set of functions f ∈ L1 (Ω, S, P) such that An f con-


verges a.s. to a function A∞ f ∈ L1 . Corollary 5.18 shows that in this case
A∞ f = E f k J . Clearly X0 is a vector subspace of L1 (Ω, S, P).
We will show that X0 = L1 (Ω, S, P) in two steps.

(i) The set X0 is a closed subspace of L1 (Ω, S, P).


(ii) Tbf − f ∈ X0 , ∀f ∈ L1 (Ω, S, P).

The claim (i) is the difficult one. Temporarily assuming its validity we will show
how it implies (ii) and the conclusion of the theorem.
Proof of (ii) assuming (i). Observe that for any f ∈ L1 we have
 1 bn 
An Tbf − f = T f −f .
n

In particular, if f ∈ L we deduce
2
kAn (Tbf − f )k∞ ≤ kf k∞
n
so An T f − f → 0 a.s. so (T f − f ) ∈ X0 if f ∈ L∞ .

b b
Suppose now that f ∈ L1 then f = f+ − f− and (Tbf )± = Tbf± . Thus it suffices
to show that Tbf − f ∈ X0 if f ∈ L1 and f ≥ 0 a.s.
In this case we can find a sequence of elementary functions fn such that fn % f .
Hence
Tbfn − fn → Tbf − f in L1 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 488

488 An Introduction to Probability

Since the functions fn are bounded, so are the functions Tbfn − fn we deduce that
Tbfn − fn ∈ X0 . We know from (i) that X0 is L1 -closed. This proves (ii).
From (ii) we deduce that Tbf − f ∈ X0 , ∀f ∈ L2 ⊂ L1 . Since X0 is closed in L1
we deduce from the proof of Theorem 5.17 that
closureL2 range(Tb − 1) ⊂ closureL1 range(Tb − 1) ⊂ X0 .
 

On the other hand, QT,2 ⊂ X0 , so that


L2 = X2 = QT,2 + Q⊥
T,2 = QT,2 + closureL2 range(T − 1) ∈ X0 .

b

Since L2 is dense in L1 , and X0 is closed in L1 we conclude that X0 = L1 .


Proof of (i) The proof of this result is based on a technical inequality similar
 to0 Doob’s maximal inequality (3.2.31). For f ∈ L (Ω, S, P) we define
1
in spirit
M f ∈ L (Ω, S),
1
f (ω) + f T ω + · · · + f T n−1 ω .
   
M f (ω) := sup An f (ω) = sup
n≥1 n≥1 n

Lemma 5.20 (Maximal Ergodic Lemma). ∀λ > 0, ∀f ∈ L1 Ω, S, P




∀λ > 0, f ∈ L1 Ω, S, P : λP {M |f | > λ} ≤ kf k1 .
    
(5.1.11)
t
u

Let us first explain why the Maximal Ergodic Lemma implies the claim (i).
Suppose that the sequence (fk ) in X0 converges in L1 to a function f . We want
to show that the sequence An (f ) is a.s. Cauchy, i.e., for every ε > 0, the set
[ \ n o
An (f ) − Am (f ) < ε
N m,n>N
| {z }
=:XN (f,ε)

has measure 1. Since XN (f, ε) ⊂ XN 0 (f, ε) for N < N 0 this is equivalent to


 
lim P XN (f, ε) = 1. (5.1.12)
N →∞

Fix ε > 0. Note that


An (f ) − Am (f ) ≤ An (f ) − An (fk ) + An (fk ) − Am (fk ) + Am (fk ) − Am (f )

≤ An (|f − fk |) + An (fk ) − Am (fk ) + Am (|f − fk |)


 
≤ 2M |fk − f | + An (fk ) − Am (fk ) .
We deduce
  
XN (f, ε) ⊃ 2M |fk − f | < ε/2 ∩ XN (fk , ε/2), ∀N, k.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 489

Elements of Ergodic Theory 489

Letting N → ∞ we deduce
     
lim P XN (f, ε) ≥ lim P 2M |fk − f | < ε/2 ∩ XN (fk , ε/2) .
N →∞ N →∞

From the inclusion-exclusion principle we deduce that


       
P 2M |fk − f | < ε/2 ∩ XN (fk , ε/2) = P 2M |fk − f | < ε/2
     
+P XN (fk , ε/2) − P 2M |fk − f | < ε/2 ∪ XN (fk , ε/2) .
Since fk ∈ X0 , the sequence An (fk ) n≥1 is a.s. Cauchy so, for any k, so


     
lim P 2M |fk − f | < ε/2 ∪ XN (fk , ε/2) = lim P XN (fk , ε/2) = 1.
N →∞ N →∞

Hence, ∀k,
       
lim P 2M |fk − f | < ε/2 ∩ XN (fk , ε/2) = P 2M |fk − f | < ε/2 .
N →∞

We deduce that
      (5.1.11) 4
lim P XN (f, ε) ≥ P 2M |fk − f | > ε/2 ≥ 1 − kf − fk k1 , ∀k.
N →∞ ε
Letting k → ∞ we obtain (5.1.12).
Proof of the Maximal Lemma Let us observe that the inequality (5.1.11) follows
from
Z
gdP ≥ 0 ∀g ∈ L1 (Ω, S, P). (5.1.13)
{M[g]>0}

Indeed, if in (5.1.13) we let g = f − λ, λ > 0, then


(5.1.13)
Z
 
kf k1 ≥ f dP ≥ λP {M[f ] > λ} .
{M[f ]>λ}

We will present two proofs of (5.1.13). The first proof, due to F. Riesz, is a bit
longer but a bit more intuitive. The second proof, due to A. Garsia [67] is a lot
shorter but less intuitive.
Set
  
X := M g > 0 ⊂ Ω.
Define
n−1
X
g ◦ T j , Mn g := max Sk (g), Xk := Mk g > 0 .
    
Sn (g) := (5.1.14)
1≤k≤n
j=0

First proof of (5.1.13). Note that Xk ⊂ Xk+1 and


Z Z n Z
1X
gdP = lim gdP = lim gdP.
X n→∞ X n→∞ n Xk
n k=1
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 490

490 An Introduction to Probability

At the last step we used the fact that the Cèsaro means of a convergent sequence
have the same limit as the sequence; see Exercise 2.3 with pn,k = n1 . Thus, it suffices
to show that
n Z
X
gdP ≥ 0, ∀n ≥ 0. (5.1.15)
k=1 Xk

Fix n. We have
n Z
X n−1
XZ n−1
XZ
gdP = gdP = g ◦ T j dP,
k=1 Xk j=0 Xn−j j=0 T −j (Xn−j )

where at the last step we used the change-in-variables formula (1.2.21) and the fact
that T is measure preserving. We set Yj := T −j (Xn−j ). Hence
 
X n Z Z n−1
X
g(T j ω)I Yj (ω) P dω .
 
gdP = 
k=1 Xk Ω j=0

We will prove the stronger fact


n−1
X
g T j ω I Yj (ω) ≥ 0, ∀ω ≥ 0.

h(ω) := (5.1.16)
j=0

Let ω ∈ Ω. Set xj = xj (ω) := g(T j ω). Note that ω ∈ Yj if and only if


T j (ω) ∈ Xn−j , i.e., at least one of the numbers
xj , xj + xj+1 , . . . , xj + xj+1 + · · · + xn−1
is positive. The inequality (5.1.16) is a special case of the following cute combina-
torial lemma of F. Riesz [133].

Lemma 5.21. Suppose are given a finite sequence of real numbers


x := x0 , . . . , xn−1 .
We say that xj is a leading term of x if there exists ` ≥ j such that xj +· · ·+x` > 0.
Then the sum of the leading terms is ≥ 0.

Proof. The lemma is easily proved by induction on n. For n = 1 this is obviously


true. Assume that it is true for any m < n and any sequence of m real numbers.
Denote by L the set of indices j = 0, 1, . . . , n − 1 such that xj is a leading terms. If
L = ∅ the conclusion is trivially true.
Suppose L 6= ∅, set j0 := min L and denote by `0 the smallest ` ≥ j0 such that
xj0 + · · · + x` > 0.
If `0 = j0 , then xj0 > 0. Suppose that `0 > j0 . The minimality of `0 implies that
for any j, such that j0 ≤ j < `0 we have xj0 + · · · + xj < 0 so that, for j0 < k ≤ `0
we have
xk + · · · x`0 ≥ 0.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 491

Elements of Ergodic Theory 491

This proves that each of the terms xj0 , xj0 +1 , . . . , x`0 is a leading term. Their sum
is obviously nonnegative.
Consider now the (shorter) sequence
y : y0 = x`0 +1 , . . . , ym−1 := xn , m := n − 1 − `0 < n − 1.
The induction assumption implies that the sum of the leading terms of y is ≥ 0.
The minimality of j0 implies that the leading terms of x are xj0 , . . . , x`0 together
with the leading terms of y. This proves Lemma 5.21 and completes the proof of
Theorem 5.19. t
u

Second proof
  of (5.1.13). We continue using the notations (5.1.14). Set
Gn := Mn g .
Since Xn % X, it suffices to show that
Z
gdP ≥ 0, ∀n.
Xn

The operator f 7→ Tbf is monotone, i.e., f0 ≤ f1 ⇒ Tbf0 ≤ Tbf1 , and we deduce that
for 1 ≤ k ≤ m we have
Sk−1 (g) ≤ max Sj (g) = Gm−1 ≤ G+
m−1
1≤j≤m−1

and
Sk (g) = g + TbSk−1 (g) ≤ g + TbGm−1 ≤ g + TbG+
m−1

so that
Gm−1 ≤ Gm ≤ g + TbG+
m−1 , ∀m ∈ N,

or equivalently
g ≥ Gn − TbG+
n , ∀n.

We deduce
Z Z Z
g≥ Gn − TbG+
n
Xn Xn Xn

(TbG+ + +
n ≥ 0 on Ω, Gn = Gn on Xn , Gn = 0 on Ω \ Xn )
Z Z Z Z
≥ G+n − T
b G+
n = G +
n − G+
n =0
Xn Ω Ω Ω

where, at the last step, the equality of the boxed terms is due to the fact that T is
measure preserving. t
u

Remark 5.22. In Remark 5.11 we suggested that the ergodicity condition points
to a chaotic behavior of the dynamics of the iterates of T . The ergodic theorem
makes this much more precise.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 492

492 An Introduction to Probability

Suppose that T is a measure preserving self-map of the probability space


(Ω, S, P). If T is ergodic, then for any subset S ∈ S there exists a negligible subset
N ∈ S such that
1 
∀ω ∈ Ω \ N, lim # k; T k ω ∈ S, 0 ≤ k < n = P S .
 
(5.1.17)
n→∞ n
 
Indeed, the left-hand-side of (5.1.17) is the temporal average An I S (ω) so (5.1.17)
follows from the Ergodic Theorem 5.19. Observe that (5.1.17) states that for most
ω, the orbit OT (ω) spends equal amounts of time in sets of equal measures. In
other words, most orbits are equidistributed. The equidistribution phenomenon
characterizes ergodicity.
Let us observe that, conversely, if a measure preserving map T satisfies the above
equidistribution property, then it has to be ergodic. Indeed, suppose that S is a T -
invariant set. Let N be a negligible set as in (5.1.17). Then for any ω ∈ (Ω \ S) \ N
the orbit OT (ω) does not intersect S since  S is invariant. In this case the left-
hand-side of (5.1.17) is equal to zero so P S = 0. Thus if S is invariant and its
complement Ω \ S is not negligible, then S must be so. Hence T is ergodic.
If we
 partition
 Ω into a finite number of measurable sets S1 , . . . , SN ,
pk = P Sk > 0, ∀k, then there exists a negligible set N ⊂ Ω so that for any
ω ∈ Ω \ N the orbit OT (ω) will be located at each moment of time in one of the
chambers Sk of this partition. Moreover, it spends a fraction pk of the time in the
chamber Sk . From this point of view, we can regard the dynamics as hopscotching
randomly from one chamber to another, and each chamber is frequented as often
as its size. We want to warn that this hopscotching need not have a Markovian
nature. t
u

5.2 Applications

Ergodicity is the unifying principle behind some of the limit theorems we have
discussed in the previous chapters and it is the source of many interesting non-
probabilistic results.

5.2.1 Limit theorems


The Strong Law of Large Numbers is a consequence of the Ergodic Theorem.

Example 5.23 (I.i.d. random variables). Suppose that (Xn )n∈N is a sequence
of i.i.d. integrable random variables defined on the same probability space (Ω, S, P).
Kolmogorov’s 0-1 theorem shows that this is a Kolmogorov family, thus ergodic.
Consider the coordinate maps on the path space

Un : RN → R, Um u1 , u2 , . . . = un .
The Ergodic Theorem implies
1   
U1 + · · · + Un → E U1 µ − a.s.
n
Observing that Xn = Un ◦ X ~ we deduce the Strong Law of Large Numbers. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 493

Elements of Ergodic Theory 493

Example 5.24 (Markov chains). Consider a HMC (Xn )n≥0 with state space X ,
transition matrix Q and initial distribution µ. The path space of this Markov chain
(see Theorem 4.3) is the probability space
U = Uµ = X N0 , E, Pµ ,


where

Un u0 , u1 , u2 , . . . = un .
Recall that for any x ∈ X we set Px = Pδx where δx is the Dirac measure on X
concentrated at x then
X  
Pµ = µx Px , µx = µ {x} . (5.2.1)
x∈X

Denote by I ⊂ E the sigma-algebra of sets that are Θ-invariant sets, where Θ denotes
the shift on U. Fix x ∈ X and let A ∈ Jx .
The Markov property (4.1.16) implies that
Ex I A ◦ Θn k En = EXn I A , En = σ(X0 , . . . , Xn ).
   

Since A is invariant we deduce I A = I A ◦ Θn a.s. so


Ex I A k En = EXn I A .
   

Lévy’s 0-1 theorem implies that


Ex I A k En → I A a.s.
 

Hence
   
PXn A = EXn I A → I A a.s.
| {z }
=:fn
 
On the other hand, Px Xn = x i.o. = 1, since the chain is recurrent. Thus
h   i
P fn = Px A i.o. = 1.
   
  I A = Px A a.s. so Px A ∈ {0, 1}. Using (5.2.1) we deduce that
Hence
Pµ A ∈ {0, 1} for any initial distribution µ.
If the chain is positively recurrent and π∞ is the invariant distribution, then Θ
is measure preserving and we deduce that J is a zero-one algebra so Θ is ergodic.
We see that the Ergodic Theorem for Markov chains (Corollary  4.60) is a special
case of Birkhoff’s Ergodic Theorem because any f ∈ L1 X , π i induces a function
f = f ◦ U0 ∈ L1 X N0 E, Pπ∞ .


The fact that the shift map is ergodic allows us to state results stronger
than Corollary 4.60.  For any finite set B ⊂ X × X we obtain a function
FB ∈ L1 X N0 E, Pπ∞
FB (u0 , u1 , u2 , . . . ) = I B (u0 , u1 )
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 494

494 An Introduction to Probability

and a corresponding Law of Large Numbers


n n−1   X  
P I B (Xk , Xk+1 ) = Eπ∞ FB → π∞ x0 Qx0 ,x1 . (5.2.2)
k=0 (x0 ,x1 )∈B

One should think of B as a collection of directed edges, a “bridge”. In the left-


hand-side we have the fraction of time a path of the Markov chain “crosses the
bridge B”.
Here is an amusing simple illustration of this result. Consider the graph G
obtained from by connecting two disjoint connected graphs G0 , G1 with a single
edge from a vertex u0 in G0 to a vertex u1 in G1 . For a vertex vi of Gi we denote
by degi (vi ) its degree in Gi . We denote by Ei the number of edges of Gi .
Let B be the set consisting of the single oriented edge (u0 , u1 ). In this case
1   deg0 (u0 ) + 1
Qu0 ,u1 = , π∞ u0 = .
deg0 (u0 ) + 1 2E0 + 2E1 + 2
Formula (5.2.2) shows that the standard random walk on G crosses the bridge from
1
u0 to u1 roughly a fraction 2E0 +2E 1 +2
of the time. t
u

Example 5.25 (Weyl’s equidistribution theorem). Fix ϕ ∈ (0, 2π) and de-
note by Rϕ the planar counterclockwise of angle ϕ about the origine. This induces
a transformation of the unit circle
S 1 := z ∈ C; |z| = 1 .


This preserves the canonical probability measure µ on S 1


  1
µ dθ = dθ.

As in the previous section this induces a unitary operator
bϕ : L2 (S 1 , µ) → L2 (S 1 , µ), R
R bϕ f (θ) = f (θ + ϕ).
Above the functions in L2 (S 1 , µ) are complex valued. For n ∈ Z we set
en (θ) = einθ ∈ L2 (S 1 , µ).
Note that R bϕ en = einϕ en . Since the collection (en )n∈Z is a complete orthonormal
sistem we deduce that the eigenspace corresponding to the eigenvalue 1 of R bϕ
bϕ = span en ; nϕ ∈ Z .
 n o
ker 1 − R

bϕ is 1-dimensional iff ϕ is irrational. In this case R

We deduce that ker 1 − R 2π

is ergodic and we deduce from (5.1.17) that if A ⊂ S 1 then for almost any θ ∈ S 1
we have the asymptotic equidistribution equality
n−1
1X  θ1 − θ0
I k θ + kϕ = , a.s. (5.2.3)
n 2π
k=0
With a little bit more work one can show that (5.2.3) holds for any θ. This is
Weyl’s equidistribution theorem, [157]. The reader interested in more details on the
equidistribution problem can consult [97]. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 495

Elements of Ergodic Theory 495

5.2.2 Mixing
Suppose that T is a measure preserving transformation of a probability space
(Ω, S, P). Note that if T is ergodic, then the L2 ergodic theorem implies that
n−1
1X L2
f ◦ T k −→ E f I Ω , ∀f ∈ L2 (Ω, S, P).
 
n
k=0

If we take the inner product with g ∈ L2 of both sides in the above equality we
deduce
n−1 Z
1X
(f ◦ T k )g dP → E f E g , ∀f, g ∈ L2 (Ω, S, P).
   
(5.2.4)
n Ω
k=0

In particular, if we let f = I A , g = I B , A, B ∈ S, we deduce


n−1
1 X  −k     
lim P T (A) ∩ B = P A P B . (5.2.5)
n→∞ n
k=0

Let us observe that the above condition is equivalent withergodicity. Indeed if we


let A quasi-invariant and B = X \ A, then P T −k (A) ∩ B = 0, ∀k and we deduce


   
P A 1−P A =0

so any quasi-invariant set has measure 0 or 1.


Since convergent sequences are also Cèsaro convergent we deduce that condition
(5.2.5) follows from the stronger requirement

lim P T −n (A) ∩ B = P A P B , ∀A, B ∈ S.


     
(5.2.6)
n→∞

A measure preserving map T satisfying this condition is said to be mixing.


When T is an automorphism one can give a more visual interpretation of the
mixing condition. In this case mixing is also equivalent to the condition

lim P A ∩ T n (B) = P A P B , ∀A, B ∈ S.


     
(5.2.7)
n→∞

Assume that the region B is occupied molecules


  of black ink in a glass of crystalline
water. These molecules occupy a fraction P B of the entire space. Flow the black
region B using T . Thus, T n (B) represents the location of the black region after n
units of time.    
The mixing condition shows that after a while, the fraction P A∩T n (B) /P A
of a region A occupied by these moving molecules of black ink is equal to P B .
Thus in the long run, all the regions will have the same fraction of black ink. To
use a very apt analogy in Arnold and Avez [3], this is what happens when we mix
well a cocktail.
The mixing condition (5.2.6) can be rewritten as

lim Tbn I A , I B = lim E (I A ◦T n )·I B = E I A ·E I B , ∀A, B ∈ S. (5.2.8)


      
n→∞ n→∞
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 496

496 An Introduction to Probability

This implies that for any elementary functions f, g ∈ Elem(Ω, S) we have


Z Z  Z 
f ◦ T n g dP =

lim f dP g dP .
n→∞ Ω Ω Ω

Since Elem(Ω, S) is dense in L (Ω, S, P) we deduce that if T is mixing, then


2

Z Z  Z 
∀f, g ∈ L (Ω, S, P) : lim
2 n

f ◦ T g dP = f dP g dP . (5.2.9)
n→∞ Ω Ω Ω

Clearly, if a measure preserving map satisfies (5.2.9), then it is mixing. The above
argument has the following immediate generalization.

Proposition 5.26. Suppose that T : (Ω, S, P) → (Ω, S, P) is a measure preserving


map and C ⊂ L2 Ω, S, P is a collection of functions such that span(C) is dense in


L2 Ω, S, P). Then the following are equivalent.

(i) The map T is mixing.


(ii) For every f, g ∈ C,
Z  Z 
lim Tbn f, g

L2 (Ω)
= f dP g dP . (5.2.10)
n→∞ Ω Ω

t
u

Let us give a few examples of mixing maps.

Proposition 5.27. Suppose that Xn : (Ω, S, P) → (X, F), n ∈ N, is a Kolmogorov


stationary sequence of measurable maps. Then the shift map on the path space is
mixing.

Proof. We will show that the shift map Θ satisfies (5.2.6). Denote by Un the
coordinate maps on the path space Un : XN → X, and set
Tn := σ Un , Un+1 , . . . .


For B ∈ F and m ∈ N we set


      
εm B := sup P S∩B −P S P B .
S∈Tm

Since (Xn )n∈BN is a Kolmogorov sequence we deduce from Theorem 5.14


εm (B) → 0 as m → ∞.
Observe that if A ∈ F, then T −m (A) ∈ Tn so that
P T −m (S) ∩ B − P T −m (A) P B ≤ εm (B) → 0.
     

This implies (5.2.6) since T is measure preserving so P T −m (A) = P A .


   
t
u

Proposition 5.28. Suppose that (Xn )n≥0 is an irreducible, positively recurrent


HMC with state space X , transition matrix Q and stationary distribution π. Then
the following are equivalent.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 497

Elements of Ergodic Theory 497

(i) The HMC is aperiodic.


(ii) The shift map on the path space X N0 , E, Pπ is mixing.


Proof. We follow the approach in [21, Sec. 16.1.2].


(i) ⇒ (ii) Suppose that our HMC is aperiodic. Consider the path space U = X N0 .
Denote by C the collection of cylindrical subsets of U of the form
Cxi1 ,...,xik := u ∈ U; uij = xij ∈ X , ∀1 ≤ j ≤ k , 0 ≤ i1 < · · · < ik , k ∈ N.


In view of Proposition 5.26 it suffices to show that (5.2.6) is satisfied for any
A, B ∈ C. Suppose that
A = Cxi1 ,...,xik , B = Cxj1 ,...,xjm .
For n > jm we have
Θ−n (A) ∩ B = Cxj1 ,...,xjm ,xi1 +n ,...,xik +n , xij = xij +n ,
and
Pπ Θ−n (A) ∩ B = π xj1 Qjx2j−j · · · Qjxmj −jm−1 n+i1 −jm
   
,xjm Qxjm ,xi1
1
1 ,xj2 m−1
(5.2.11)
×Qxi2i−i 1
1 ,xi2
· · · Qixki−ik−1
,xi .
k−1 k

Since the HMC is aperiodic we deduce from (4.3.7) we deduce that


1 −jm
lim Qn+i
 
xjm ,xi1 = π xi1 .
n→∞
Using this in (5.2.11) we deduce that
lim Pπ Θ−n (A) ∩ B = π xj1 Qxj2j−j jm −jm−1
   
,xj2 · · · Qxjm−1 ,xjm
1
n→∞ 1
| {z
  }
Pπ B

× π xi1 Qix2i−i · · · Qixki−ik−1


 
,xik .
1
1 ,xi2 k−1
| {z
  }
Pπ A

(ii) ⇒ (i) Suppose that the shift map is mixing. To prove that it is aperiodic we argue
by contraction and assume the period d is bigger than 1. As in Proposition 4.27
consider the communication classes of Qd ,
C1 , C2 , . . . , Cd ⊂ X .
Hence
 
P Xn+1 ∈ Ci+1 mod d k Xn ∈ Ci mod d = 1, ∀n ≥ 0, i = 1, . . . , d.
Consider the sets
u ∈ X N0 ; u0 ∈ Ci , i = 1, 2, . . . .

Ai =
Then Θ−n Ai = Ai+n mod d , Ai ∩ Aj = ∅ if i 6≡ j mod d. We deduce that for any


n ∈ N we have
Pπ Θ−nd (A0 ) ∩ A1 = 0, Pπ Θ−nd−1 (A0 ) ∩ A1 = Pπ A1 = π A1 6= 0.
       

This contradicts the fact that Θ is mixing. t


u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 498

498 An Introduction to Probability

Remark 5.29. Suppose that (Xn )n ≥ 0 is an HMC as in the above proposition.


We know that if the sequence (Xn )n≥0 is Kolmogorov, then it is mixing. A theorem
of Blackwell and Freedman [14] shows that the converse is also true so (Xn )n≥0
is mixing if and on only if it is Kolmogorov. In fact, the following properties are
equivalent.

(i) The HMC (Xn )n≥0 is aperiodic.


(ii) For any probability measures µ, ν ∈ Prob X


lim dv µQn − νQn = 0.



n→∞

(iii) The HMC (Xn )n≥0 is mixing.


(iv) The HMC (Xn )n≥0 is Kolmogorov.

For a proof we refer to [85, Thm. 26.10]. t


u

Example 5.30 (The tent map). Consider the tent map T : [0, 1] → [0, 1] intro-
duced in Example 5.1(d). Recall that T is the continuous map [0, 1] → [0, 1] such
that T (0) = 0 = T (1), T (1/2) = 1 and T is linear on each of the intervals [0, 1/2]
and [1/2, 1]. We want to show that T is mixing.
Consider the Haar basis of L2 [0, 1] . Recall its definition. It consists of the
Haar functions

H0 = 1, H1 = H0,0 = I [0,1/2] − I [1/2,1]

Hn,k (x) = 2n/2 H0,0 2n x − k




= 2n/2 I   − 2n/2 I   , 0 ≤ k < 2n .


k
2n , 2kn + 1 k
2n + 1
, k+1
2n
2n+1 2n+1

Define

H −1 = span I [0,1] , H n := span Hn,k ; 0 ≤ k < 2n , n ≥ 0,




H = { H0 } ∪ Hn,k ; n ≥ 0, 0 ≤ k < 2n .


The subspaces H n are mutually orthogonal and the collection H spans a dense
subspace of L2 [0, 1] ; see [24, Sec. 9.2]. Moreover

TbH n ⊂ H n+1 , ∀n ≥ 0.

Thus if m, n ≥ 0, 0 ≤ j ≤ 2m , 0 ≤ k ≤ 2n we have
Z 1  Z 1 
`

T Hm,j , Hn,k L2 = 0 =
b Hm,j (x)dx Hn,k (x)dx , ∀` > n − m.
0 0

Clearly TbH0 = H0 . Proposition 5.26 applied to the collection H implies that T is


mixing, hence ergodic. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 499

Elements of Ergodic Theory 499

Example 5.31 (Arnold’s cat). We consider a slightly more general situation.


Let d > 1 and denote by Td the d-dimensional torus
Tm = S 1 × · · · × S 1
| {z }
m

equipped with the invariant probability measure


1
dθ1 · · · dθd ,
 
µ dθ =
(2π)d
where θi are the standard angular coordinates on the torus. Suppose that
A ∈ SLd (Z), i.e., A is a d × d matrix with integral entries and determinant 1.
Since AZd = Zd we deduce that A defines a measure preserving map of Td
 
θ1
θ =  ...  7→ TA θ := A · θ mod (2πZ)d .
 

θm
Denote by h−, −i the canonical inner product in Rd and by A∗ the transpose of A.
Clearly
A∗ ∈ SLd (Z) and A∗ (Zd ) = Zd .
~ ∈ Zd we denote by Om
For each m ∗ d
~ the orbit of the action of A on Z , i.e., the set

Om
 ∗ n
~ = (A ) m; ~ n≥0 .

~ ∈ Zd we set define the character 1 χm
For any m ~ ∈L
2
Td , µ
ihm,θi
~

χm
~ (θ) = e = ei(m1 θ1 +···+md θd ) , i := −1.
The set of characters
Cd := χm ~ ∈ Zd ⊂ L2 Td , µ
 
~; m (5.2.12)
2 d

is an orthonormal family that spans a vector subspace dense in L T , µ .
 
The unitary operator TbA : L2 Td , µ → L2 Td , µ has the explicit description

TbA f (θ) = f (Aθ).


In particular,

ihm,Aθi
~
TbA χm
~ (θ) = e = eihA m,θi
~
= χA∗ m
~ (θ).

We have the following result.

Theorem 5.32. Let A ∈ SLd (Z), d > 1. The following are equivalent.

(i) The map A : Td → Td is ergodic.


~ ∈ Zd \ {0} the orbit O m
(ii) For any m ~ is infinite.
d d
(iii) The map A : T → T is mixing.
1 Any continuous group morphism χ : Td → S 1 has the form χm ~ ∈ Zd .
~ for some m
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 500

500 An Introduction to Probability

Proof. We follow the approach in [36, Sec. 4.3]. We only need to prove (i) ⇒ (ii)
⇒ (iii).
~ ∈ Zd \ {0} such that
(i) ⇒ (ii) We argue by contradiction. Suppose there exists m
∗ n
Om ~ is finite. Denote by n the smallest n ∈ N such that (A ) m ~ = m.
~ Then the
function
~ + · · · + χ(A∗ )n−1 m
f = χm ~

~ · · · χ(A∗ )n−1 m
is TbA -invariant and nonconstant since the functions 1, χm ~ are linearly
independent. Hence A is not ergodic.
(ii) ⇒ (ii) We apply Proposition 5.26 to the set of characters Cd in (5.2.12). Note
that if
(
1, m~ = 0,
Z
χm
~ dµ =
Td 0, m~ 6= 0.
Clearly if f = g = 1, then (5.2.10) holds trivially. Suppose f 6=1. Then
Z  Z 
f dP g dP = 0.
Ω Ω

Assumption (ii) implies that TbAn f is a character different from g for all n sufficiently
large and thus
TbAn f, g 1 = 0, ∀n  0.

L
We deduce that A is mixing. t
u

The condition (ii) above holds if and only if none of the eigenvalues of A are
roots of 1. Observe that if one eigenvalue of A ∈ SL2 (Z) is a root of 1 then all
eigenvalues are roots of 1. We deduce that the only matrices in SL2 (Z) are
   
10 0 −1
± and ± .
01 1 0
In particular, this shows that Arnold’s cat map is mixing. t
u

Remark 5.33. There is another condition that intermediates between mixing and
ergodicity. More precisely, a measure preserving self-map of a probability space
(Ω, S, P) is called weakly mixing if,
n−1
1 X  −k     
lim P T (A) ∩ B − P A P B = 0 (5.2.13)
n→∞ n
k=0
for any A, B ∈ S. Clearly (5.2.13) implies (5.2.5) so weakly mixing are ergodic.
Since convergent sequences are Cèsaro convergent we deduce that (5.2.6) implies
(5.2.13) so mixing maps are weakly mixing.
It turn out that most weakly mixing automorphisms of a probability space
(Ω, S, P) are not mixing. More precisely the mixing operators form a meagre (first
Baire category) subset in the set of weakly mixing automorphisms [78, p. 77].
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 501

Elements of Ergodic Theory 501

5.3 Exercises

Exercise 5.1. Suppose that (Ω, S) is a measurable space and T : (Ω, S) → (Ω, S) a
measurable map. Denote by ProbT (Ω, S) the set of T -invariant probability measures
P : S → [0, 1].

(i) Prove that ProbT (Ω, S) is a convex subset of the space of finite measures on S.
(ii) Prove that T is ergodic with respect to a probability measure P if and only
if P is an extremal point of ProbT (Ω, S) i.e., P cannot be written as a convex
combination P = (1 − t)P0 + tP1 , t ∈ (0, 1), P 6= P0 , P1 .

t
u

Exercise 5.2. Suppose that (X, d) is a compact metric spaces and T : X → X is


a continuous map. Let P be a Borel probability measure on X. For n ∈ N we set
n
1 X k
Pn := T# P.
n
k=01

(i) Prove that the sequence (Pn )n∈N contains a subsequence (Pnk )that converges
weakly to a Borel probability measure P∗ on X, i.e.,
Z Z
   
lim f (x)Pnk dx = f (x)P∗ dx , ∀f ∈ C(X).
k→∞ X P∗
Hint. Use Banach-Alaoglu compactness theorem.
(ii) Prove that P∗ is T -invariant.
(iii) Prove that the set ProbT (X) of T -invariant Borel probability measures on X
is convex and closed with respect to the weak convergence.

t
u

Exercise 5.3. Let (Ω, S, P) be a probability space and T : Ω → Ω a measure


preserving map. We say that T is quasi-mixing if there exist c1 , c2 > 0 such that
∀A, B ∈ S

c1 P A P B ≤ P T −1 (A) ∩ B ≤ c2 P A P B .
         
(5.3.1)

(i) Suppose that A ⊂ S is a collection of measurable subsets that generates S,


σ(A) = S. Show that T is quasi-mixing if (5.3.1) holds for all A, B ∈ A.
(ii) Prove that if T is quasi-mixing, then it is ergodic.

t
u

Exercise 5.4. Let (Ω, S, P) be a probability space and T : Ω → Ω a measure


preserving map. Suppose that (Fn )n≥1 is a filtration of sigma-subalgebras with the
following properties.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 502

502 An Introduction to Probability

(i)
_
Fn = S.
n≥1

(ii) T −1 (Fn ) ⊂ Fn , ∀n ∈ N.
(iii) For any k ∈ N the intersection
\
(T k )−1 (Fn )
n≥
is a 0-1-sigma subalgebra.

Prove that T is mixing. t


u

Exercise 5.5. Let (Ω, S, P) be a probability space, T : Ω → Ω a measure preserving


map and g ∈ L1 (Ω, S, P. Prove that the following are equivalent.

(i) The function g is T -invariant,  g ◦T = g a.s.


 i.e.,
(ii) For any f ∈ L∞ (Ω, F, P), E gf = E g(f ◦ T ) . t
u

Exercise 5.6 (Poincaré). Suppose that (Ω, S, P) is a probability space and


T : Ω → Ω is a measure preserving measurable map. Prove that for any S ∈ S
such that P S > 0 we have
P {ω ∈ Ω; T n ω ∈ S i.o.} = 1.
 

t
u

Exercise 5.7 (Kac). Suppose that (Ω, S, P) is a probability space


 and T : Ω → Ω
is a measure preserving measurable map. For S ∈ S such that P S > 0 we define
the first return map
TS : Ω → N ∪ {∞}, TS (ω) = min n ∈ N : T n ω ∈ S .


Set
ω ∈ Ω \ S; T n ω 6∈ S, ∀n ≥ 1 .

ΩS :=

(i) Prove that


Z
   
TS (ω) P dω = 1 − P ΩS .
S
 
(ii) Prove that if T is ergodic then P ΩS = 0. t
u

Exercise 5.8. Consider an irreducible HMC (Xn )n≥0 with finite state space X ,
transition matrix Q and whose initial distribution is the stationary distribution µ.
The path space of this Markov chain is (see Theorem 4.3)
Uµ = X N0 , E, Pµ ).
For n ∈ N0 we denote by Un the n-th coordinate map Un (u0 , u1 , . . . ) = un . Let
f : Uµ → R, f (u0 , u1 , . . . ) = − log2 Qu0 ,u1 .
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 503

Elements of Ergodic Theory 503

(i) Prove that


Z
f (u)Pµ du = Ent2 X , Q ,
   

where Ent2 X , Q denotes the entropy rate of the Markov chain described in
 

Exercise 4.31.
(ii) Prove that
1
log2 QU0 ,U1 · · · QUn−1 ,Un = Ent2 X , Q a.s.
  
− lim
n→∞ n
t
u

Exercise 5.9 (Kac). Consider the map Q : [0, 1) → [0, 1), x 7→ Qx := 2x mod 1.
Show that Q it is mixing with respect to the Lebesgue measure. Hint. See Example 5.13.
t
u

Exercise 5.10. Consider the tent map T : [0, 1] → [0, 1], T (x) = min(2x, 2 − 2x)
and the logistic map L : [0, 1] → [0, 1], L(x) = 4x(1 − x).

(i) Prove that the map Φ : [0, 1] → [0, 1], Φ(x) = 1 − cos x/π ) is a homeomor-
phism and L ◦ Φ = Φ ◦ T .
(ii) Describe the measure µ := Φ# λ, where λ is the Lebesgue measure on [0, 1].
(iii) Prove that the logistic map preserves µ and it is mixing with respect to this
measure. t
u

Exercise 5.11. Fix m ∈ N, m ≥ 2. For any ~ ∈ {0, 1}m define



(−1)k m x − k−1 + k , k−1 k

m ≤x< m,

 m
F~ : [0, 1] → [0, 1], F~(x) =


0, x = 1.

Prove that F~ is mixing for any ~ ∈ {0, 1}m . t


u

Exercise 5.12. Consider the Haar functions Hn,k used in Example 5.30. We define
the Rademacher functions,
X
Rn : [0, 1] → R, Rn = 2−n/2 Hn,k , n ≥ 0.
0≤k<2n

(i) Prove that



X 1
R (x) = 1 − 2x, ∀x ∈ [0, 1].
n+1 n
n=0
2

(ii) Prove that the functions (Rn )n≥0 , viewed as random variables defined on the
probability space ([0, 1], λ), are i.i.d. t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 504

504 An Introduction to Probability

Exercise 5.13. Suppose that X is a finite


 set and π is a probability measure on
X given by the weights πx := π {x} > 0, ∀x ∈ X. Consider the Cartesian
product Uπ := X Z equipped with the product sigma-algebra and product measure
π ∞ := π ⊗Z . The elements of X are functions u : Z → X. Consider the shift
Θ : Uπ → Uπ , Θu(n) = u(n + 1).

(i) Prove that Θ is mixing with respect to the measure π ∞ .


(ii) Denote for S ⊂ X by NπS the subset of Uπ consisting
 of functions u : Z → X
such that limn→∞ u(n) exists. Prove that π ∞ NπS = 0 and that the comple-


ment USπ is Θ-invariant. t


u

Exercise 5.14. Consider the baker’s transform B : [0, 1]2 → [0, 1]2 ,
( 
q(2x), q(y/2) , x ≤ 1/2,
B(x, y) = 
q(2x), q (y + 1)/2 , x > 1/2,
where q(t) denotes the fractional part of the real number t, q(t) = t − btc.
Prove that B is mixing with respect to the Lebesgue measure. Consider the map
Φ : {0, 1}Z → [0, 1]2 given by Φ x(u), y(u) ,
∞ ∞
X u(−n) X u(n)
x(u) = n+1
, y(u) = .
n=0
2 n=1
2n

Denote by π the uniform measure on {0, 1} and by π ∞ the induced product measure
on {0, 1}Z .

(i) Prove that Φ# π ∞ = λ, where λ is the Lebesgue measure on the square [0, 1]2 .
(ii) Show that B ◦ Φ = Φ ◦ Θ, where Θ is the shift defined in Exercise 5.13.
(iii) Prove that the baker’s transform is mixing with respect to the Lebesgue
measure. t
u

Exercise 5.15 (Gauss). Consider the map G : [0, 1] → [0, 1] given by



0, x = 0,
G(x) = 1 j 1 k
 x − x , x ∈ (0, 1].

For k ∈ N we set Ik := (1/(k + 1), 1/k). Any x ∈ (0, 1] has a continuous fraction
decomposition
1
x = [0 : a1 : a2 : · · · ] := 0 + , an = an (x) ∈ N0 , ∀n ∈ N.
1
a1 +
1
a2 +
..
.
(The number x is rational if and only if an = 0 for all n sufficiently large.) Set
[0, 1]∗ := [0, 1] \ Q.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 505

Elements of Ergodic Theory 505

(i) Let x = [0 : a1 : a2 : · · · ] ∈ [0, 1]∗ . Prove that G(x) = [0 : a2 : a3 : · · · ] and,


for any n ∈ N, we have
1
x = [0 : a1 : · · · : an−1 : an + Gn (x)] = .
1
a1 +
1
a2 +
.. 1
. + an−1 +
an + Gn (x)
n
(ii) Let x ∈ [0, 1]∗ . Prove that an (x) = k iff G (x) ∈ Ik , ∀k, n ∈ N.
(iii) For each a ∈ R we set
 
a1
Ta = .
10
Prove that    
pn pn 1
[0 : a1 : · · · : an ] = , = T0 · Ta1 · · · Tan . (5.3.2)
qn qn 0
(iv) Let x := [0 : a1 : a2 : · · · ] ∈ [0, 1]∗ . Prove that for any n ∈ N we have
pn (x) + pn−1 (x)Gn (x)
x=
qn (x) + qn−1 (x)Gn (x)
where pn (x), qn (x) are defined in terms of the an (x)’s by (5.3.2).
n−2
(v) Prove that qn (x) ≥ 2 2 , ∀x ∈ [0, 1]∗ , n ∈ N.
(vi) Prove that the restriction of G to Ik is a diffeomorphism onto (0, 1).
(vii) Fix c > 0 and set ρ : [0, 1] → [0, ∞)
c
ρ(x) = .
x+1
Prove that for any x ∈ [0, 1]∗ we have
X ρ(y)
ρ(x) = .
|G0 (y)|
G(y)=x
(viii) Prove that the probability measure µ on defined by
  1  
µ dx = λ dy
log 2(x + 1)
is G-invariant.
(ix) Prove that for any n ∈ N the map An : [0, 1]∗ → N, x 7→ an (x) is measurable
and the sigma-algebra generated by these random variables coincides with the
Borel sigma algebra. Hint. Show that the set Ia1 ,...,am := {Ak = ak , 1 ≤ k ≤ m} is an
pk
interval with endpoints expressible in terms of the fractions qk defined as in (5.3.2).
(x) Show that G is quasi-mixing (see Exercise 5.3) hence ergodic. t
u

Exercise 5.16. Suppose that T is an automorphism of a probability space (Ω, S, P).


Define
T ×2 : Ω → Ω → Ω × Ω, T ×2 (ω1 , ω2 ) = T ω1 , T ω2 ).
Prove that T ×2 is ergodic (with respect to P⊗2 ) if and only if T is weakly mixing,
i.e., satisfies (5.2.13). t
u
January 19, 2018 9:17 ws-book961x669 Beyond the Triangle: Brownian Motion...Planck Equation-10734 HKU˙book page vi

This page intentionally left blank


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 507

Appendix A

A few useful facts

A.1 The Gamma function

Definition A.1 (Gamma and Beta functions). The Gamma function is the
function
Z ∞
Γ : (0, ∞) → R, Γ(x) = tx−1 e−t dt. (A.1.1)
0
The Beta function is the function of two positive variables
Γ(x)Γ(y)
B(x, y) := , x, y > 0. (A.1.2)
Γ(x + y)
t
u

We gather here a few basic facts about the Gamma and Beta functions used in
the text. For proofs we refer to [100, Chap. 1] or [158, Chap. 12].

Proposition A.2. The following hold.

(i) Γ(1) = 1.
(ii) Γ(x + 1) = xΓ(x), ∀x > 0.
(iii) For any n = 1, 2, . . . we have
Γ(n) = (n − 1)!. (A.1.3)

(iv) Γ(1/2) = π.
(v) For any x, y > 0 we have Euler’s formula
Z 1 ∞
ux−1
Z
B(x, y) = sx−1 (1 − s)y−1 ds = du. (A.1.4)
0 0 (1 + u)x+y
t
u

The equality (iv) above reads



Z ∞
π = Γ(1/2) = e−t t−1/2 dt
0

507
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 508

508 An Introduction to Probability

(t = x2 , t−1/2 = x−1 dt = 2xdx)


Z ∞ Z 0 Z ∞ Z ∞
2 2 2 2
=2 e−x dx = e−x dx + e−x dx = e−x dx.
0 −∞ 0 −∞

√s s2 √1 ds,
If we make the change in variables x = 2
so that x2 = 2 and dx = 2
then
we deduce

Z ∞
1 x2
π=√ e− 2 dx.
2 −∞

From this we obtain the fundamental equality


Z ∞
1 x2
√ e− 2 dx = 1. (A.1.5)
2π −∞

The function Γ(x) grows very fast as x → ∞. Its asymptotics is governed by the
Stirling’s formula
√  x x
xΓ(x) ∼ 2πx as x → ∞. (A.1.6)
e
Note that for n ∈ N the above estimate reads
√  n n
n! ∼ 2πn as n → ∞. (A.1.7)
e
There are very sharp estimates for the ratio

n!
qn = √  .
n n
2πn e

More precisely we have (see [58, II.9])

1 1
< log qn < . (A.1.8)
12n + 1 12n
We denote by ω n the volume of the n-dimensional Euclidean unit ball
q
B n := x ∈ Rn ; kxk ≤ 1 , kxk = x21 + · · · + x2n ,


and by σ n−1 the “area” of the unit sphere in Rn

S n−1 = x ∈ Rn ; kxk = 1 .


Then
2Γ(1/2)n 1 Γ(1/2)n
σ n−1 = , ω n = σ n−1 = . (A.1.9)
Γ(n/2) n Γ (n + 1)/2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 509

A few useful facts 509

A.2 Basic invariants of frequently used probability distributions

 
n k n−k
X ∼ Bin(n, p)⇐⇒P[X = k] = p q , k = 0, 1, . . . , n, q = 1 − p.
k

Ber(p) ∼ Bin(1, p).

 
n − 1 k n−k
X ∼ NegBin(k, p)⇐⇒P[X = n] = p q , n = k, k + 1, . . .
k−1

Geom(p) ∼ NegBin(1, p).

w b
 
k n−k
X ∼ HGeom(w, b, n), P[X = k] = w+b
 , k = 0, 1, . . . , w.
n

λn
X ∼ Poi(λ), λ > 0⇐⇒P[X = n] = e−λ , n = 0, 1, . . .
n!

1
X ∼ Unif(a, b)⇐⇒PX = I [a,b] dx.
b−a

X ∼ Exp(λ), λ > 0⇐⇒PX = λe−λx I [0,∞) dx

1 (x−µ)2
X ∼ N (µ, σ 2 ), µ ∈ R, σ > 0⇐⇒PX = √ e− 2σ2 dx, x ∈ R.
σ 2π

λν ν−1 −λx
X ∼ Gamma(ν, λ)⇐⇒pX (x) = x e I [0,∞) dx
Γ(ν)

1
X ∼ Beta(a, b)⇐⇒pX = xa−1 (1 − x)b−1 I (0,1) dx.
B(a, b)

p+1
1 Γ( 2 ) 1
X ∼ Studp ⇐⇒pX = √ dx, x ∈ R.
pπ Γ( p2 ) 1 + x2 /p (p+1)/2
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 510

510 An Introduction to Probability

Name Mean Variance pgf mgf


Ber(p) p pq (q + ps) pet
Bin(n, p) np npq (q + ps)n n nt
p e
1 q ps pet
Geom(p) p p2 1−qs 1−qet
 k  k
k kq ps pet
NegBin(k, p) p p2 1−qs 1−qet
λ(s−1) λ(et −1)
Poi(λ) λ λ e e
w
HGeom(w, b, n) w+b · n * ∗ *
a+b (b−a)2 etb −eta
Unif(a, b) 2 12 NA tb−ta
−1 −2 λ
Exp(λ) λ λ NA λ−t
2
σ 2

N (µ, σ 2 ) µ σ2 NA exp t + µt
 2 ν
ν ν λ
Gamma(ν, λ) λ λ2 NA λ−t
a ab
Beta(a, b) a+b (a+b)2 (a+b+1) NA ∗
p
Studp 0, p > 1 p−2 , p > 2 NA NA

A.3 A glimpse at R

This section is merely an invitation to programming in R. It is not meant as a


serious guide to learning R. It mainly lists a few basic tricks that will get the curios
reader started and cover many with simple probability simulations I used in my
classes.
First, here is how you install R on your computers.
For Mac users
https://cran.r-project.org/bin/macosx/
For Windows users
https://cran.r-project.org/bin/windows/base/
Next, install R Studio (the Desktop version). This is a very convenient interface for
using R.
https://www.rstudio.com/products/RStudio/
(Install first R and then R Studio.) You can also access RStudio and R in the cloud
https://www.rollapp.com/app/rstudio
The site
http://www.people.carleton.edu/~rdobrow/Probability/
has a repository of many simple R programs (or R scripts) that you can use as
models.
The reader familiar with the basics of programming will have no problems learn-
ing the basics of R. This section is addressed to such a reader. We list some of the
commands and objects most frequently used in probability and we have included
several examples to help the reader get started. R-Studio comes with links to var-
ious freely available web sources for R-programming. A commercial source that I
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 511

A few useful facts 511

find very useful is “The Book of R”, [38]. Often I ask GOOGLE how to do this or
that in R and I receive many satisfactory solutions.

Example A.3 (Operations with vectors). The workhorse of R is the object


called vector. An n-dimensional vector is essentially an element in Rn . An n-
dimensional vector in R can be more general in the sense that its entries need not
be just numbers.
To generate in R the vector (1, 2, 4.5) and then naming it x use the command

x<-c(1,2,4.5)

To see what the vector x is type

and then hit RETURN/ENTER.2


To see what the k-th entry of x is us the command

x[k]

The command

x[j:k]

will generate all the entries of x from the j-the to the k-th. If you want to add an
entry to x, say you want to generate the longer vector (1, 2, 4, 5, 7), use the command

c(x,7)

For long vectors this approach can be time consuming. The process of describing
vectors can be accelerated if the entries of the vector x are subject to patterns. For
example, the vector of length 22 with all entries equal to the same number, say 1.5,
can be generated using the command

rep(1.5, 22)

To generate the vector listing in increasing order all the integers between −2
and 10 (included) use the command

(-2):10

To generate the vector named x consisting of 25 equidistant numbers staring at


1 and ending at 7 use the command

x<-seq(from=1, to=7, length.out=25)


2 You have to do this after every command so I will omit this.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 512

512 An Introduction to Probability

To add all the entries of a vector x = (x1 , . . . , xn ) use the command

sum(x)

To add all the natural numbers from 50 to 200 use the command
sum(50:200)
The result is 18, 875.
You can sort the entries of a vector, if they are numerical. For example

> z<-c(1,4,3)
> sort(z)
[1] 1 3 4

A very convenient feature of working with vectors in R is that the basic al-
gebraic operations involving numbers extend to vectors, component wise. For ex-
ample, if z is the above vector, and y = (1, 8, 9), then the command y/z returns
(1/1, 8/4, 9/3) = (1, 2, 3), while z^2 returns (1, 16, 9). t
u

Example A.4 (Logical operators). These are operators whose output is a


TRUE or FALSE or a vector whose entries are TRUE/FALSE.
For example, the command 2 < 5 returns TRUE. On the other hand if x is the
vector (2, 3, 7, 8), then the command x < 5 return
TRUE,TRUE, FALSE, FALSE.
In R the logicals TRUE/FALSE also have arithmetic meaning,
TRUE = 1, FALSE = 0.
The output of x < 5 is a vector whose entries are TRUE/FALSE. To see how many
of the entries of x are < 5 use the command

sum(x<5)

Above x < 5 is interpreted as a vector with 0/1-entries. When we add them we


count how many are equal to 1 or, equivalently, how many of the entries of x are
< 5.
The R language also has two very convenient logical operators any and all. When
we apply any to a vector with TRUE/FALSE entries it returns TRUE if at least
one of the entries of v are TRUE and returns FALSE otherwise. When we apply all
to a vector v with TRUE/FALSE entries it returns TRUE if all of the entries of v
are TRUE and returns FALSE otherwise. t
u

Example A.5 (Functions in R). One can define and work with functions in R.
For example, to define the function
f (q) = 1 + 6q + 10q 2 (1 − q)4
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 513

A few useful facts 513

use the command


f<-function(q) (1+4*q+10*q^2)*(1-q)^4
To find de value of f at q = 0.73 use the command
f(0.73)
To display the values of f at all the points
0, 0.01, 0.02, 0.03, . . . , 0.15, 0.16
use the command
x<-seq(from=0, to=0.16, by=0.01)
f(x)
To plot the values of f over 100 equidistant points in the interval [2, 7] use the
command
x<-seq(from=2, to=7, length.out=100)
y<-f(x)
plot(x,y, type="l")
Equivalently, there is the simple command curve(-) that allows drawing multiple
graphs in the same coordinate system.

function1<-function(x){x^2}
function2<-function(x){1-cos(x)}
curve(function1, col=1)
curve(function2, col=2, add=TRUE)

Above col stands for “color”. When this option is used different graphs are
depicted in different colors.
Here is how we define in R the indicator
( function of the unit disc in the plane
1, x2 + y 2 ≤ 1,
ID (x, y) =
0, x2 + y 2 > 1.

indicator<-function(x,y) if(1 >= x^2+y^2) 1 else 0

Another possible code that generates this indicator function is

indicator<-function(x,y) as.integer(x^2+y^2<= 1)

Above, the command as.integer converts TRUE/FALSE to 1/0. t


u

Example A.6 (Samples with replacement). For example, to sample with re-
placement 7 balls from a bin containing balls labeled 1 through 23 use the R com-
mand

sample(1:23,7, replace=TRUE)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 514

514 An Introduction to Probability

The result is a 7-dimensional vector whose entries consists of 7 numbers sampled


with replacement from the set {1, . . . , 23}. Similarly, to simulate rolling a fair die
137 times use the command

sample(1:6,137, replace=TRUE) t
u

Example A.7 (Rolling a die). Let us show how to simulate rolling a die a num-
ber n of times and then count how many times we get 6. Suppose n = 20. We
indicate this using the command

n<-20

We now roll the die n times and store the results in a vector x

x<-sample(1:6, n, replace=TRUE)

Next we test which of the entries of x are equal to 6 and store the results of these
20 tests in a vector y

y<-x==6

The entries of y are T rue or F alse, depending on whether the corresponding entry
of x was equal to 6 or not. To find how many entries of y are T use the command

sum(y)

The result is equal to the number of 6s we got during the string of 20 rolls of a fair
die.
We can visualize data. Suppose we roll a die a large number N = 1200 of times.
For each 1 ≤ k ≤ N we denote by z(k) the fraction of the first k rolls when we
rolled a 6. For k → ∞ the Law of Large Numbers states that this frequency should
approach 16 . The vector z can be generated in R using the commands

N<-12000
x<-sample(1:6, N, replace=TRUE)
z<-cumsum(x==6)/(1:N)

Above, cumsum stands for “cumulative sum”. The input of this operator is a nu-
merical vector x = (x1 , . . . , xn ). The output is a numerical vector s of the same
dimension, with sk = s1 = · · ·+sk . We can visualize the fluctuations of z(k) around
the expected value 61 using the R code

plot(1:N, z, type="l", xlab="# of rolls",


ylab="average number 6-s")
abline(h=1/6,col="red")

Figure A.1 depicts the output. t


u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 515

A few useful facts 515

Fig. A.1 Rolling a die.

Example A.8 (Samples without replacement). To sample without replace-


ment 7 balls from an urn containing balls labeled 1 through 23 use the R command
sample(1:23, 7)
The number of possible samples above is (27)7 and to compute it use the R command
prod(21:27)
t
u

Example A.9 (Permutations). To sample a random permutation of 7 objects


use the R command
sample(1:7,7)
To sample 10 random permutations of 7 objects use the R command
for (i in 1:10 ) print(sample(1:7,7))
To compute 7! in R use the command
factorial(7)
t
u

Example A.10 (Combinations). Sampling random m-element subsets out of an


n-element set possible is possible in R. For example, to sample 4 random subsets
with 2 elements out of a 7-element set possible the following command
replicate(4, sort( sample(1:7, 2) ))
52

The sampled sets will appear as columns. To compute 5 in R use the command
choose(52,5)
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 516

516 An Introduction to Probability

Example A.11 (Custom discrete distribution). We can produce custom dis-


crete random variables in R.
Suppose that we want to simulate a discrete random variable X whose values,
sorted in increasing order, are
x1 = 0.1, x2 = 0.2, x3 = 0.3, x4 = 0.7.
The corresponding probabilities are
p1 = 1/3, p2 = 1/6, p3 = 1/4, p4 = 1/4.
The R-commands below describe how to compute the mean and the variance of X
and how to sample X.

X<-c(0.1,0.2,0.3,0.7) # stores the values of X in


increasing order.

prob<-c(1/3,1/6,1/4,1/4) # stores the probabilities.


sum(prob) # This is a test. If this is 1 prob is a pmf.
# Otherwise check prob.

m<-sum(X*prob) # computes the mean of X and stores in m.


v<-sum((X^2)*prob) -m^2# computes the variance of X
# and stores it in v.

m # produces the value of the mean.

v # produces the variance of X.

sample(X,15, replace=TRUE, prob) # produces 15 random


#samples of X.

cumsum(prob) # computes the values of the cdf of X at


# x_1,x_2,...

In R the symbol # indicates a comment. It is only for the programer/user


benefit. Anything following a # is not treated by R as a command. t
u

Example A.12 (Useful discrete distributions). The standard discrete distri-


butions are implemented in R.

The distribution The R command


The binomial distribution Bin(n, p) binom(n,p)
The geometric distribution Geom(p) geom(p)
The negative binomial distribution NegBin(k, p) nbinom(k,p)
The Poisson distribution Poi(λ) pois(lambda)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 517

A few useful facts 517

The R library however uses rather different conventions

(i) The geometric distribution in R is slightly different from the one described in
 book. In R, the range of Geom(p) variable T is {0, 1, . . . } and its pmf is
this
P T = n = p(1 − p)n . In this book,  a geometric random variable has range
{1, 2, . . . } and its pmf is P T = n = p(1 − p)n−1 ; see Example A.14.
(ii) In R the equality nbinom(k, p) = n represents the number of failures until we
register the k-th success; see Example A.15.

The above commands by themselves mean nothing if they are not accompanied
by one of the prefixes

• d produces the density or pmf .


• p produces the cdf .
• r produces random samples .
• q produces quantiles . t
u

You can learn more details using R’s help function. The examples below describe
some concrete situations.

Example A.13 (Binomial). For example, suppose that X ∼ Bin(10, 0.2), i.e., X
is the number of successes in a sequence of 10 independent Bernoulli trials with
success probability 0.2.
To find the probability P(X = 3) use the R command
dbinom(3,10,0.2)
If FX (x) = P(X ≤ x) is the cdf of X, then you can compute FX (4) using the R
command
pbinom(4,10,0.2)
To generate 253 random samples of X use the command
rbinom(253,10,0.2)
To find the 0.8-quantile of X use the R command
qbinom(0.8,10,0.2)
t
u

Example A.14 (Geometric). Suppose now that T ∼ Geom(0.2) is the waiting


time until the first success in a sequence of independent Bernoulli trials with success
probability p = 0.2.
To find the probability P(T = 3) use the command
dgeom(3-1,0.2)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 518

518 An Introduction to Probability

To find the probability P(T ≤ 4) use the command


pgeom(4-1,0.2)
To generate 253 random samples of T use the command
1+rgeom(253,0.2)
To find the 0.8-quantile of T use the R command
qgeom(0.8,0.2)+1
t
u

Example A.15 (Negative Binomial). Suppose that T ∼ NegBin(8, 0.2) is the


waiting time for the first 8 successes in a string of Bernoulli trials with success
probability.
To find the probability P(T = 12) use the R command
dnbinom(12-8,8,0.2)
You can compute P(T ≤ 14) using the R command
pnbinom(14-8,8,0.2)
To generate 253 random samples of T use the command
8+rnbinom(253,8,0.2)
To find the 0.8-quantile of T use the R command
8+qnbinom(0.8,8,0.2)
t
u

Example A.16 (Poisson). Suppose that X ∼ Poi(0.2) is a Poisson random vari-


able with parameter λ = 0.2.
To find the probability P(X = 3) use the command
dpois(3,0.2)
To find the probability P(X ≤ 4) use the command
ppois(4,0.2)
To generate 253 random samples of X use the command
rpois(253,0.2)
To find the 0.8-quantile of X use the R command
qpois(0.8,0.2)
t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 519

A few useful facts 519

Example A.17 (Continuous distributions in R). The continuous distribu-


tions Unif(a, b), expλ and N (µ, σ 2 ) can be simulated in R by invoking

unif(min=a, max=b)

exp(rate=lambda)

norm(mean=mu, sd=sigma)

where sd: = standard deviation.


To invoke the standard normal random variable you could use the shorter
command

norm

t
u

As in the case of discrete distributions, we utilize these commands with the


prefixes d−, p−, q− and r− that have the same meaning as in R-Session A.12.
Thus d− will generate the pdf, p− the cdf, r− generates a random sample, and q−
produces quantiles.

Example A.18. Here are some concrete examples. To find the probability density
of exp3 at x = 1.7 use the command

dexp(1.7, 3)

To find the probability density of N (µ = 5, σ 2 = 7) at x = 2.6 use the command


dnorm(2.6,5, sqrt(7))
To produce 1000 samples from Unif(3, 13) use the command
runif(1000,3,13)
t
u

Example A.19 (Gambler’s ruin). Consider two players the first with fortune
$a, and the second with fortune $b. Set N := a + b. They flip a fair coin. Heads,
player 1 gets a dollar from player 2, Tails, player 1 gives a dollar to player 2. The
game ends when one of them is ruined. One can simulate this in R using the code

r<-function(a,N){
t<-0
x<-a
v<-c(0,N)
while(all(v!=x)){
f<-sample(0:1,1, replace=TRUE)
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 520

520 An Introduction to Probability

x<-x+(2*f-1)
t<-t+1
}
y<-c(x,t)
y
}

The output is a two-dimensional vector. Its first entry is the fortune of the first
player at the end of the game, while the second entry is duration of the game, i.e.,
the number of coin flips until one of them is ruined.
To compute the winning probability of the first player and the expected duration
of a game we can use the Law of Large Numbers and run a large number G of games

empiric_r<-function(G,a,N){
P<-c()
T<-c()
for(i in 1:G){
P<-c(P,r(a,N)[1])
T<-c(T,r(a,N)[2])
}
c(sum(P==N)/G,sum(T)/G)
}

For example if we want to run a number G = 1200 of games with the first
player’s initial fortune a = 8 and the combined fortune of the two players is N = 15
use the command

empiric_r(1200,8,15)

The output is a two-dimensional vector. Its first entry describes the fraction of
the G games won by the first player, and the second entry is the average duration
of these G games.
One can also visualize a game. The code below produces a vector whose entries
describe the evolution of the fortune of the first player.

rgr<-function(a,N){
x<-a
z<-c(a)
v<-c(0,N)
while(all(v!=x)){
f<-sample(0:1,1,replace=TRUE)
x<-x+(2*f-1)
z<-c(z,x)
}
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 521

A few useful facts 521

Fig. A.2 The ruin problem.

z
}

For given values of N and a say, N = 25, a = 12, one can visualize the evolution
of the fortune of the first player using the code below. Its output is a graph similar
to the one in Figure A.2.

N<-25
a<-12
u<-rgr(a,N)
l<-length(u)-1
plot(0:l, u,type="l", xlab="# of flips",
ylab="the fortune of the first player",ylim=c(0,N))
abline(h=c(0,N),col=c("red","red") )

t
u

Example A.20 (Buffon’s needle problem). The R program below uses the
Buffon needle problem (see Exercise 1.22) to find an approximation of π.

L<-0.7 # L is the length of the needle. It is <1.


N<-1000000 # N is the number of times we throw the needle.
f<-0
#the next loop simulates the tossing of
#N random needles and computes
# the number f of times they intersect a line

for (i in 1:N){
y<-runif(1, min=-1/2,max=1/2) #this locates
# the center of the needle
t<-runif(1, min=-pi/2,max=pi/2)#this determines
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 522

522 An Introduction to Probability

#the inclination of the needle


if ( abs(y)< 0.5*L*cos(t) ) f<-f+1 }
#f/N is the empirical frequency
"the approximate value of pi is"; (N/f)*2*L

t
u

Example A.21 (Monte Carlo). The R-command lines below implement the
Monte Carlo strategy for computing a double integral over the unit square

# Monte Carlo integration of the function f(x,y)


#over the rectangle [a,b] x[c,d]
# First we describe the function
f<- function(x,y) sin(x*y)
# Next, we describe the region of integration [a,b]x[c,d]
a=0
b=1
c=0
d=1
# Finally, we decide the number N of sample points in
# the region of integration
N=100000
#S will store the integral
S=0
for (i in 1:N){
x<- runif(1,a,b) #we sample a point uniformly in [a,b]
y<- runif(1,c,d) #we sample a point uniformly in [c,d]
S<-S+f(x[1],y[1])
}
’the integral is’; (b-a)*(d-c)*S/N

The next code describes a Monte-Carlo computation of the area of the unit
circle.

nsim<-1000000#nsim is the number of simulations


x<-runif(nsim,-1,1)#we choose nsim uniform samples
#in the interval (-1,1) on the x axis
y<-runif(nsim,-1,1)#we choose nsim uniform samples
#in the interval (-1,1) on the y axis
area<-4*sum(x^2+y^2<1)/nsim
"the area of the unit circle is very likely"; area

t
u
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 523

A few useful facts 523

Example A.22. Suppose that we have a probability distribution prob on the al-
phabet {1, 2, . . . , L}. One experiment consists of sampling the alphabet according to
the distribution prob until we first observe the given word (or pattern) patt. The
following R-routine performs m such experiments and returns an m-dimensional
vector f whose components are the cumulative means of the waiting times
k
1X
fk = Tj , k = 1, . . . , m,
k j=1

where Tj is the time to observe the pattern in the j-th experiment.

Tpattern<-function(patt, prob, m, L){


k<-length(patt)
T<-c()
for (i in 1:m){
x<-sample(1:L,k,replace=TRUE, prob)
n<-k
while ( all(x[(n-k+1):n]==patt)==0 ){
x<-c(x, sample(1:L,1,replace=TRUE, prob) )
n<-n+1
}
T<-c(T,n)
}
f<-cumsum(T)/(1:m)
f
}

If prob is the uniform distribution use the faster routine

Tpatt_unif<-function(patt, m, L){
k<-length(patt)
T<-c()
for (i in 1:m){
x<-sample(1:L,k,replace=TRUE)
n<-k
while ( all(x[(n-k+1):n]==patt)==0 ){
x<-c(x, sample(1:L,1,replace=TRUE) )
n<-n+1
}
T<-c(T,n)
}
f<-cumsum(T)/(1:m)
f
}
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 524

524 An Introduction to Probability

In the uniform case, the expected waiting time to observe the pattern patt can
be determined using routine below that relies on the identity (3.1.11) in Exam-
ple 3.31.

tau<-function(patt,L){
n<-length(patt)
m<-n-1
t<-2^n
for (i in 1:m){
j<-n-i
k<-i+1
t<-t+ any(patt[1:j]==patt[k:n])*L^(n-i)
}
t
}
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 525

Bibliography

D. Aldous: Exchangeability and related topics, École d’été de Probabilités des Saint Fleur
XIII-1983, pp. 2–199, Lect. Notes in Math vol. 1117, Springer Verlag, 1985.
D. Aldous: Probability Theory, Course notes, Spring 2017.
https://www.stat.berkeley.edu/~aldous/205B/chewi_notes.pdf
V. I. Arnold, A. Avez: Ergodic Problems of Classical Mechanics, Addison Wesley, 1968.
R. B. Ash: Probability and Measure Theory, (with contributions from C. Doléans-Dade),
2nd Edition, Academic Press, 2000.
S. Asmunssen: Applied Probability and Queues, 2nd Edition, Stoch. Modelling and Appl.
Probab., vol. 51, Springer Verlag, 2003.
K. B. Athreya, P. E. Ney: Branching Processes, Springer Verlag, 1972.
S. Banach: Über die Bairésche Kategorie gewisser Functionenmengen, Studia Mathemat-
ica, 3(1931), 174–179.
P. Bamberg, S. Sternberg: A Course in Mathematics for Students of Physics, vol. 2,
Cambridge University Press, 1990.
R. N. Bhattacharya, E. C. Waymire: Stochastic Processes with Applications, SIAM, 2009.
R. N. Bhattacharya, E. C. Waymire: A Basic Course in Probability Theory, 2nd Edition,
Springer Verlag, 2016.
P. Billingsley: Ergodic Theory and Information, John Wiley & Sons, 1965.
P. Billinglsley: Convergence of Probability Measures, 2nd Edition, John Wiley & Sons,
1999.
N. H. Bingham: Fluctuation theory for the Ehrenfest Urn, Adv. Appl. Prob. 23(1991),
598–611.
D. Blackwell, D. Freedman: The tail σ-field of a Markov chain and a theorem of Orey,
Ann. Math. Statist., 35(1964), 1291–1295.
V. I. Bogachev: Measure Theory. Vol. 1, Springer Verlag, 2007.
R. Bott: On induced representations, in volume The Mathematical Heritage of Hermann
Weyl, Proc. Symp. Pure Math., vol. 48, Amer. Math. Soc., 1988
S. Boucheron, G. Lugosi, P. Massart: Concentration Inequalities. A Nonasymptotic Theory
of Independence, Oxford University Press, 2013.
N. Bourbaki: General Topology, Part 2, Hermann, 1966.
L. Breiman: Probability, SIAM, 1992.
P. Brémaud: Markov Chains, Gibbs Fields, Monte Carlo Simulations and Queues, Springer
Verlag, 1999.

525
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 526

526 An Introduction to Probability

P. Brémaud: Probability Theory and Stochastic Processes, Springer Verlag, 2020.


H. Brezis: Functional Analysis, Sobolev Spaces and Partial Differential Equations, Uni-
versitext, Springer Verlag, 2011.
J. Bricmont: Making Sense of Statistical Mechanics, Springer Verlag, 2022.
S. A. Broughton, K. Bryan: Discrete Fourier Analysis and Wavelets. Applications to Signal
and Image Processing, Second Edition, John Wiley & Sons, 2018.
J. C. Butcher: Numerical Methods for Ordinary Differential Equations, 3rd Edition, John
Wiley & Sons, 2016.
I. Chavel: Eigenvalues in Riemann Geometry, Academic Press, 1984.
J. Cheeger: A lower bound for the smallest eigenvalue of the Laplacian, in Gunning (ed.)
Problems in Analysis, pp. 199–205, Princeton University Press, 1970.
Y. S. Chow, H. Robbins, D. Siegel: Great Expectations: The Theory of Optimal Stopping,
Houghton Mifflin Co., 1971.
Y. S. Chow, H. Teicher: Probability Theory. Independence, Interchangeability, Martin-
gales, 3rd Edition, Springer Verlag, 1997.
K. L. Chung, J. L. Doob: Fields, Optionality and Measurability, Amer. J. Math., 87(1965),
397–424.
K. L. Chung, F. AitSahlia: Elementary probability theory: with stochastic processes and
an introduction to mathematical finance, Springer Verlag, 2003.
V. Chvátal, D. Sankoff: Longest common subsequences of two random sequences, J. Appl.
Prob. 12(1975), 306–315.
E. Çinlar: Probability and Stochastics, Graduate Texts in Math., vol. 261 Springer Verlag,
2011.
E. G. Coffman Jr., G. S. Lueker: Probabilistic Analysis of Packing and Partitioning Algo-
rithms, John Wiley & Sons, 1991.
D. L. Cohn: Measure Theory, 2nd Edition, Birkhäuser, 2013.
I. P. Cornfeld, S. V. Fomin, Ya. G. Sinai: Ergodic Theory, Springer Verlag, 1982.
T. M. Cover, M. Thomas, J. A. Thomas: Elements of Information Theory, Wiley-
Interscience, 2006.
T. M. Davies: The Boof of R, No Starch Press, 2015,
https://nostarch.com/bookofr
C. Dellacherie, P.-A. Meyer: Probabilities and Potential, vol. A, North Holland Mathe-
matical Studies, vol. 29, Hermann Paris, 1978.
C. Dellacherie, P.-A. Meyer: Probabilities and Potential, vol. C, North Holland, 1988.
P. Diaconis: Group Representations in Probability and Statistics, Institute of Mathemat-
ical Statistics, 1988.
P. Diaconis: The Markov chain Monte Carlo revolution, Bull. Amer. Math. Soc., 46(2009),
179–205.
P. Diaconis, R. Griffiths: Exchangeable pairs of Bernoulli random variables, Krawtchouk
polynomials, and Ehrenfest urns, Aust. N. Z. J. Stat. 1(2012), 81–101.
P. Diaconis, R. Griffiths: An Introduction to multivariate Krawtchouk polynomials and
their polynomials, arXiv: 1309.0112, J. Stat. Plann. Inference, 154(2014), 39–53.
P. Diaconis, B. Skyrmis: Ten Great Ideas About Chance, Princeton University Press, 2018.
P. Diaconis, D. Stroock: Geometric bounds for eigenvalues of Markov chains, Ann. Appl.
Prob., 1(1991), 31–61.
J. L. Doob: Stochastic Processes, John Wiley & Sons, 1953.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 527

Bibliography 527

P. G. Doyle, J. L. Snell: Random Walks and Electrical Networks, MAA, 1984.


L. E. Dubins, L. J. Savage: How to Gamble if you Must. Inequalities for Stochastic Pro-
cesses.
R. M. Dudley: Real Analysis and Probability, Cambridge University Press, 2004.
R. M. Dudley: Uniform Central Limit theorems, Cambridge University Press, 1999.
N. Dunford, J. T. Schwartz: Linear Operators. Part I: General Theory, John Wiley &
Sons, 1957.
R. Durrett: Probability. Theory and Examples, 5th Edition, Cambridge University Press,
2019.
A. Dvoretzky, P. Erdös, S. Kakutani: Nonincrease everywhere of the Brownian motion
process, 1961 Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. II pp. 103–
116 Univ. California Press, Berkeley, Calif.
P. Erdös, A. Rényi: On a classical problem of probability theory, Magyar Tudományos
Akadémia Matematikai Kutató Intézetének Közleményei, 6(1961), 215–220,
G. Fayolle, V. A. Malyshev, M. V. Menshikov: Topics in the Constructive Theory of
Countable Markov Chains, Cambridge University, Press, 1995.
W. Feller: The Kolmogorov-Smirnov theorems for empirical distributions, Ann. Math.
Statistics, 19(1948), 177–189.
W. Feller: An Introduction to Probability Theory and its Applications, Volume 1, 3rd
Edition, John Wiley & Sons, 1970.
W. Feller: An Introduction to Probability Theory and its Applications, Volume 2, 2nd
Edition, John Wiley & Sons, 1970.
L. Floridi: Information Theory. A Very Short Introduction, Oxford University Press, 2010.
D. Foata, A. Fuchs: Processus Stochastiques. Processus de Poisson, Chaı̂nes de Markov et
Martingales, 2nd edition, Dunod, 1998.
T. Frankel: The Geometry of Physics, 3rd Edition, Cambridge University Press, 2011.
D. Freedman: Brownian Motion and Diffusion, Springer Verlag, 1983.
B. Friestedt, L. Gray: A Modern Approach to Probability Theory, Birkhäuser, 1997.
F. R. Gantmacher: Theory of Matrices, vol. 2, AMS, Chelsea Publishing, 2000.
M. Gardner: Time Travel and Other Mathematical Bewilderments, W. H. Freeman & Co.,
1988.
A. Garsia: A simple proof of E. Hopf’s maximal ergodic theorem, J. Math. Mech.,
14(1965), 381–382.
I. M. Gelfand, N. Ya. Vilenkin: Generalized Functions. Volume 4. Applications of Harmonic
Analysis, Academic Press, 1964.
J. P. Gilbert, F. Mosteller: Recognizing the maximum of a sequence, J. Amer. Stat. Assoc.,
61(1966), 35–73.
E. Giné, R. Nickl: Mathematical Foundations of Infinite Dimensional Statistical Models,
Cambridge University Press, 2016.
J. Gleick: The Information. A History. A Theory. A Flood, Pantheon Books, 2011.
B. V. Gnedenko, A. N. Kolmogorov: Limit Distributions for Sums of Independent Random
Variables, Addison Wesley, 1968.
A. Grigoryan: Introduction to Analysis on Graphs, University Lect. Series, Amer. Math.
Soc., 2018.
G. R. Grimmett: Probability on Graphs. Random Processes on Graphs and Lattices,
Cambridge University Press, 2011.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 528

528 An Introduction to Probability

G. R. Grimmett, D. R. Stirzaker: Probability and Stochastic Processes, 4th Edition,


Oxford University Press, 2020.
L. J. Guibas, A. M. Odlyzko: String overlaps, pattern matching and nontransitive games,
J. Copmb. Th. Series A, 30(1981), 183–208.
O. Häggström: Finite Markov Chains and Algorithmic Applications, Cambridge University
Press, 2002.
P. R. Halmos: Lectures on Ergodic Theory, Dover, 2017.
G. H. Hardy: Divergent Series, Oxford University Press, 1949.
T. E. Harris: The Theory of Branching Processes, Springer Verlag, 1963.
J. Hawkins: Ergodic Dynamics. From Basic Theory to Applications, Springer Verlag, 2021.
B. Hayes: The first links in a Markov chain, American Scientist, 101(2013), No. 2, p. 92.
https://www.americanscientist.org/article/first-links-in-the-markov-chain
M. Jerrum, M. Sinclair: Approximate Counting, Uniform Generation and Rapidly Mixing
of Markov Chains, Information and Computation, 82(1989), 93–133.
M. Kac: Random walk and the theory of Brownian motion, Amer. Math. Monthly,
54(1947), 369–391.
O. Kallenberg: Foundations of Modern Probability, 3rd Edition, Springer Verlag, 2021.
S. Karlin, H. M. Taylor: A First Course in Stochastic Processes, 2nd Edition, Academic
Press, 1975.
J. G. Kemeny, J. L. Snell: Finite Markov Chains, with a new appendix “Generalization of
a fundamental matrix”, Springer Verlag, 1983.
J. H. B. Kemperman: The Passage Problem for a Stationary Markov Chain, The University
of Chicago Press, 1961.
H. Kesten, P. Ney, F. Spitzer: The Galton-Watson process with mean one and finite
variance, Th. Prob. Appl., 11(1966), 513–540.
J. M. Keynes: A Treatise on Probability, MacMillan and Co. London, 1921.
Available at Project Guttenberg http://www.gutenberg.org/ebooks/32625.
J. F. C. Kingman: Uses of exchangeability, Ann. Prob., 6(1978), 183–197.
A. Klenke: Probability Theory. A Comprehensive Course, Universitext, Springer Verlag,
2008.
U. Krengel: Ergodic Theorems. With a supplement by Antoine Brunel, Walter de Guyter,
1985.
A. N. Kolmogorov: Grundbegriffe der Wahrscheinlichkeitreschnung, Springer 1933. English
translation Foundations of the Theory of Probability, Chelsea 1950.
A. N. Kolmogorov: The theory of transmission of information, the volume Selected Works
of A. N. Kolmogorov. Volume III. Information Theory and the Theory of Algorithms,
pp. 6–33, Kluwer Academic Publishers, 1993.
E. Kowalski: An Introduction to Expander Graphs, Société Mathématiques de France,
2019.
L. Kuipers, H. Niederreiter: Uniform Distribution of Sequences, John Wiley & Sons, 1974.
Dover reprint 2006.
K. Kuratowski, A. Mostowski: Set Theory. With an Introduction To Descriptive Set The-
ory, North Hollan Publishing Co., 1976.
S. Lang: Linear Algebra, 3rd Edition, Springer Verlag, 1987.
N. N. Lebedev: Special Functions and Their Applications, Dover, 1972.
M. Ledoux, M. Talagrand: Probability in Banach Spaces, Springer Verlag, 1991.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 529

Bibliography 529

J.-F. Le Gall: Intégration, Probabilités et Processus Aléatoares,


https://www.math.u-psud.fr/~jflegall/IPPA2.pdf
J.-F. Le Gall: Brownian Motion, Martingales, and Stochastic Calculus, Graduate Texts in
Math., vol. 274, Springer Verlag 2016.
D. A. Levin, Y. Perez, E. L. Wilmer: Markov Chains and Mixing Times, Amer. Math.
Soc., 2009.
P. Lévy: Théorie de l’Addition des Variables Aléatoires, Gauthier-Villars, 1937.
P. Lévy: Processus Stochastiques et Mouvement Brownien, Gauthier Villars, 1965.
S.-Y. R. Li: A martingale approach to the study of occurrence of sequence patterns in
repeated experiments, Ann. Prob. 8(1980), 1171–1176.
A. Lubotzky: Expander graphs in pure and applied mathematics, Bull. A.M.S., 49(2012),
pp. 113–162.
T. Lyons: A simple criterion of transience of a reversible Markov chain, Ann. Prob.,
11(1983), 393–402.
T. Lyons, Y. Peres: Probability on Trees and Networks, Cambridge University Press, 2017.
M. Loève: Probability Theory, vol. I, 4th Edition, Graduate Texts in Math. no. 45,
Springer Verlag, 1977.
D. J. MacKay: Information Theory, Inference and Learning Algorithms, Cambridge Uni-
versity Press, 18th printing, 2017.
R. Mansuy: The origins of the word “martingale” Electronic Journl for History of Proba-
bility and Statistics, vol. 5, Fasc. 1, (2009), 1–9.
http://www.jehps.net/juin2009.html
J. Matoušek: Lectures on Discrete Geometry, Graduate Texts in Math. no. 212, Springer
Verlag, 2002.
Mazurkewicz: Sur les fonctions non dérivables, Studia Mathematica, 3(1931), 92–94.
M. McCaffrey: Markov Chains: A Random Walk Through Particles, Cryptography, Web-
sites and Card Shuffling, Senior Thesis, University of Notre Dame, 2017,
https://www3.nd.edu/~lnicolae/Thesis_v3.pdf.
M. Mitzenmacher, E. Upfal: Probability and Computing. Randomized Algorithms and
Probability Analysis, 7th printing, Cambridge University Press, 2013.
P. Mörters, U. Peres: Brownian Motion, Cambridge University Press, 2010.
C. St. J. A. Nash-Williams: Random walks and electric currents in networks, Proc. Cam-
bridge. Phil. Soc., 55(1959), 181–194.
D. J. Newmann, L. Shepp: The double dixie-cup problem, Amer. Math. Monthly,
67(1960), 58–61.
L. I. Nicolaescu: Introduction to Real Analysis, World Scientific, 2020.
L. I. Nicolaescu: Lectures on the Geometry of Manifolds, 3rd Edition, World Scientific,
2021.
J. R. Norris: Markov Chains, Cambridge University Press, 1997.
R. E. A. C. Paley, N. Wiener, A. Zygmund, Note on random functions, Math. Z., 1933.
J. L. Palacios: Fluctuation theory for the Ehrenfest urn via electric networks, Adv. Appl.
Prob., 25(1993), 472–476.
K. R. Parthasarathy: Probability Measures on Metric Spaces, Academic Press, 1967.
V. V. Petrov: Limit Theorems of Probability Theory. Sequences of Independent Random
Variables, Oxford University Press, 1995.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 530

530 An Introduction to Probability

I. Pinelis: Martingales converging in probability and not as, MathOverflow,


https://mathoverflow.net/a/410350/20302
D. Pollard: Convergence of Stochastic Processes, Springer Verlag, 1984.
D. Pollard: Empirical Processes: Theory and Applications, NSF-CBMS Regional Confer-
ence Series in Probability and Statistics, vol. 2, 1990.
M. Pollicott, M. Yuri: Dynamical Systems and Ergodic Theory, Cambridge University
Press, 1998.
S. L. Resnick: Adventures in Stochastic processes, Birkhäuser, 2002.
F. Riesz: Sur la théorie ergodique, Comment. Math. Helv. 17(1944), 221–239.
R. T. Rockefellar: Convex Analysis, Princeton University Press, 1997.
L. C. G. Rogers, D. Williams: Diffusions, Markov Processes and Martingales. Volume 1.
Foundations, Cambridge University Press, 2000.
R. L. Schilling, L. Partzch: Brownian Motion. An Introduction to Stochastic Processes,
DeGruyter, 2012.
K. Schmüdgen: The Moment Problem, Springer Verlag, 2017.
D. Serre: Matrices. Theory and Applications, 2nd Edition, Grad. Texts Math., vol. 216,
Springer Verlag, 2010.
S. Shalev-Shwartz, S. Ben-David: Understanding Machine Learning. From Theory to Al-
gorithms, Cambridge University Press, 2014.
A. N. Shiryaev: Probability, 2nd Edition, Springer Verlag, 1996.
A. Sinclair: Algorithms for Random Generation and Counting: A Markov Chain Approach,
Progress in Theoretical Com. Sci., Springer Verlag, 1993.
P. M. Soardi: Potential Theory of Infinite Networks, Lect. Notes. Math., vol. 1590, Springer
Verlag, 1994.
R. Stanley: Enumerative Combinatorics. vol. 1, 2nd Edition, Cambridge University Press,
2012.
J. M. Steele: Probability Theory and Combinatorial Optimization, CBMS-NSF Regional
Conf. Series in Appl. Math., SIAM, 1997.
J. M. Steele: Stochastic Calculus and Financial Applications, Springer Verlag, 2001.
J. M. Stoyanov: Counterexamples in Probability, 3rd Edition, Dover, 2013.
L. Takáks: On an urn problem of Paul and Tatian Ehrenfest, Math. Proc. Camb. Phil.
Soc., 86(1979), 127–130.
M. Taylor: Measure Theory and Integration, Grad. Studies in Math., Amer. Math. Soc.,
2006.
C. B. Thomas: Representation Theory of Finite and Lie Groups, World Scientific, 2004.
A. W. van der Vaart, J. A. Wellner: Weak Convergence and Empirical Processes. With
Applications to Statistics, Springer Verlag, 1996.
V. N. Vapnik, A. Ja. Cervonenkis: On the uniform convergence of relative frequencies to
their probabilities, Theor. Probab. Appl., 16(1971), 264–240.
V. N. Vapnik: The Nature of Statistical Learning Theory, 2nd Edition, Springer Verlag,
2000.
R. S. Varadhan: Probability, Courant Lect. Notes in Math., Amer. Math. Soc., 2001.
M. Viana, K. Oliveira: Foundations of Ergodic Theory, Cambridge University Press, 2016.
R. von Mises: Probability, Statistics and Truth, 2nd Edition, Dover, 1981.
M. J. Wainright: High Dimensional Statistics. A Non-Asymptotic Point of View, Cam-
bridge University Press, 2019.
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 531

Bibliography 531

H. Weyl: Über die Gleichverteilung von Zahlen mod Eins, Math. Ann., 77(1916), 313–352.
E. T. Whittaker, G. N. Watson: A Course in Modern Analysis, 4th Edition, Cambridge
University Press, 1950.
H. S. Wilf: Generatingfunctionology, Academic Press, 1994.
D. Williams: Probability with Martingales, Cambridge University Press, 1991.
January 19, 2018 9:17 ws-book961x669 Beyond the Triangle: Brownian Motion...Planck Equation-10734 HKU˙book page vi

This page intentionally left blank


July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 533

Index

Ac , xi Markov(X , µ, Q), 369


B(a, b), 143 Meas(X), 170
B(x, y), 507 NegBin(k, p), 56
Ba,b (x), 143 Φ(x), 66
DΓ , 335 ΦX , 179
Dn [f ], 10 Poi(λ), 59
F# µ, 15 Prob(X), 170
HΓ , 336 Prob(Ω, S), 13
Hx , 385 Proc(F• ), 333
L-function, 333 Proc(Fprog ), 333
Lp (Ω, S, µ), 37 Procprog , 333
N (µ, σ 2 ), 65 Reff (x+ , S− ), 421
Nx , 387 Studp , 144
P GX , 142 Unif(a, b), 64
R-function, 333 kf kp , 38
Txk , 387 a.s., 14
TA , 384, 385 E[X], 42 
E X k F , 94

Tx , 384
X ∼ Y , 42 E Y k X = x , 94
X ∼ µ, 42 E Y k X , 94
d
X = Y , 42 EP X , 42
T
X• , 270 G(σ 2 ), 194
XT , 269, 337 Γµ,σ2 , 65
[f ]n , 10 Hϕ , 48
#S, xi In , xi
Ber(p), 15 MX (t), 51
P F k S , 114

Bin(n, p), 16, 54
Couple(µ, ν), 404 P S k F , 98
Elem(Ω, S), 9, 31 T-filtration
Ent2 , 164 see filtration, 260
Gamma(ν, λ), 67 βa,b (x), 68
Γ(x), 507 γ µ,σ2 , 65
i∈I Si , 3
W
Geom(p), 55
HGeom(w, b, n), 58 λn , xi
Λ(C), 5 ω n , xi, 508

533
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 534

534 An Introduction to Probability

C(X• ), 266 trace of a, 4


σ n−1 , xi, 508 σ(Xi , i ∈ I), 11
χ(r), X ), 293 σ[X], 49
a.s.
χ2 (n),
 67  → , 83
d
Cov X, Y , 50 →, 170
p
dens(F), 210 →, 85
dimV C (F), 210 2X , xi
BT , xi 2T0 , 127
BX , 4 2X0 , xi
CT , 128 Var[X], 49
F⊥ ⊥ G H, 110 |S|, xi
FT , 271, 334 ϕ-entropy, 48
Fprog , 333 {F ∈ S}, 3
Ft+ , 335 ]]S, T ]], 270
L0 (Ω, S), 8 dv , 403
L0 (Ω, S)∗ , 69 f +, 9
L0 (S), 8 f −, 9
L1 (Ω, S, µ), 32 gν (x; λ), 66
L∞ (Ω, S), 8 u ∼ v, 410
L0+ (Ω, S), 8 x0 ↔ x1 , 378
L1+ (Ω, S, µ), 32 x0 → x1 , 378
Nµ , 16
OT , 478 absolutely continuous, 37
S|X , 4 absorbing state, 380
Sµ , 16 acceptance-rejection, 237, 459
S1 ∨ S2 , 3 accessible states, 378
S∞ , 326 AEP, 166, 167
ess sup, 38 algebra, 2


, 37 Bernoulli, 3
Sn , xi alphabet, 163
X̂T , 301 entropy of the, 164
i.i.d., 132 André’s reflection trick, 29, 136, 344, 356
i.o., 83 aperiodic, 382
λ-system, 5 arcsine distribution, 69
hXi, 267 Arnold’s cat map, 476, 500
dxe, xi
bxc, xi baker’s transform, 504

⊥ Y , 21 ballot problem, 27, 274
µ ∗ ν, 77 Banch space, 38
λ
µ ! ν, 404 Bernoulli
µ0 ⊗ µ1 , 70 algebra, 3
ν  µ, 37 random variable, 54
π-system, 5 trial, 54
πx , 396  Bernoulli shift, 478
ρ X, Y , 50 Beta distribution, 68, 120, 144, 332
σ-additive, 13 Beta function, 68, 332, 507
σ-algebra, 2 incomplete, 143
Borel, 4 bin packing, 283
complete, 16 binomial distribution, 16
completion of a, 16 bit, 164
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 535

Index 535

Borel in distribution, 170


σ-algebra, 4 in law, 170
set, 4 in probability, 85, 296
subset, 4 convex function, 47, 190
Borel-Cantelli, see lemma conjugate of, 191
boundary operator, 415 convolution, 77
bounded difference, 284 correlation, 50
branching process, 263, 300, 321, 377 coupling, 404
Brownian event, 222 time, 405
Brownian motion, 225, 229 coupon collector problem, 57, 244, 308
quadratic variation, 233 covariance, 50
started at 0, 231 covariance form, 218
strong Markov property, 341 covariance kernel, 220
Brownian motion reflection principle, 343 cumulant, 189
cumulative distribution function, 19
Cèsaro convergent, 158, 495 current, 415
Cèsaro means, 158, 490 cutting
Cayley graph, 442 see electric network, 426
cdf, 19, 20, 41 cylinder, 128
chains, 414
chamber, 3, 92 début time, 335
character, 443, 499 derangement, 63
characteristic function, 179 derangements problem, 63
Cheeger constant, 455 detailed balance equations, 393, 451
Chernoff dipole, 425
bound, 189 Dirac measure, 15
method, 189, 280 Dirichlet form, 451
closed set, 380 disintegration, 120
irreducible, 380 kernel, 120
coboundary operator, 417 distribution
cochain, 417 Beta, 68, 120, 144, 332
code binomial, 16
binary, 168 chi-squared, 67
instantaneous, 168 compound Poisson, 249
Shannon, 168 Erlang, 67
uniquely decodable, 168 exponential, 74, 79, 243
coercive, 294 Gamma, 66, 144
communication class, 379 geometric, 55, 73, 243
compensator, 266 hypergeometric, 58
conditional negative binomial, 55, 56
expectation, 93, 94, 136 of stochastic process, 125, 217
independence, 110 Poisson, 80
index, 135 Student, 144
probability, 25, 98, 114 distribution Bernoulli, 15
conditional distribution, 115 distribution function, 19
conductance, 414, 455 Doeblin condition, 472
convergence Doob
Lp , 86 conditions, 271, 278
almost sure, 83 Doob decomposition, 266, 358
in p-mean, 86 downcrossing, 286
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 536

536 An Introduction to Probability

effective function
conductance, 421 Beta, 507
resistance, 421 convex, 47
Ehrenfest urn, 375, 383, 394, 433, 445, elementary, 9
450, 473 Gamma, 507
electric network, 414 strictly convex, 47
cutting, 426
shorting, 426 Galton-Watson process, 263, 290, 300
empirical gambler’s ruin, 307, 375, 390, 519
distribution, 200 Gamma function, 66, 507
process, 204 Gaussian
empirical gap, 119 Hilbert space, 224
entropy, 48, 164, 252, 472 measure, 65, 218, 254
information, 164, 193 covariance form, 218
rate, 472 process, 220
relative, 252 centered, 220
Shannon, 164, 193 random function, 222
equidistribution, 492 random variables, 65
ergodic map, 480 regression, 256
ess sup, 38 vector, 218, 254, 255
Euler means, 406 white noise, 223, 364
event, 14 gaussian
almost sure, 14 measure
exchangeable, 326 centered, 219
improbable, 14 graph, 265
permutable, 326 locally finite, 265
exchangeable, 325 random walk on, 265
sequence, 325, 331, 477
expectation, 42 Haar
exponential martingale, 346 basis, 498
extinction functions, 498, 503
event, 291, 300 Hamming distance, 285, 445
probability, 291 harmonic function, 265, 408, 420
Hermite polynomials, 139, 251, 346
Fatou’s lemma, 35 hitting time, 269, 336, 385, 408, 420
filtration, 260 HMC, 369
complete, 335 Laplacian of, 408
right-continuous, 335 reversible, 393
usual, 335 time reversed, 393
flow hypothesis class, 212
Kirchhoff, 422 PAC learnable, 213
formula
Bayes’, 30 independence
Fourier inversion, 443 conditional, 110
Stirling, 390, 392 independency, 21
Stirling’s, 508 independent
Viète, 246 events, 21
Wald, 137, 305, 306 families, 21
formula Stirling, 244 random variables, 21
Fourier transform, 179, 443 indicator function, xi
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 537

Index 537

inequality law
Azuma, 209, 280, 282, 284 of rare events, 63
Bonferroni, 60 of total probability, 26
motivic, 60 law of total probability, 122
Cauchy-Schwartz, 90 lazy chain, 406, 460
Chebyshev, 49 Lebesgue
Doob’s Lp , 318, 348 measurable, 72
Doob’s maximal, 317, 347 measure, 72
Doob’s upcrossing, 287, 289, 349 Lebesgue integral, 32
Gibbs, 165, 193 Lebesgue measure, 19
Hölder, 38, 191 lemma
Hoeffding, 195, 208, 280, 285 ‘sooner-rather-than-later’, 276, 307
Jensen, 47, 164 Borel-Cantelli, 83–85, 157, 227, 290, 338
Kolmogorov’s maximal, 153 first, 84
Markov, 34 second, 84, 356, 358
McDiarmid, 285 Fatou, 36, 178, 179, 289, 296, 303, 323,
Mills ratio, 66, 140, 231 349, 411
Minkowski, 38 Fekete, 89, 283
infinitely divisible Hoeffding, 196, 281
distribution, 249 Kronecker, 358
random variable, 249 Kronecker’s, 158
integrable, 32 maximal, 488
invariance principle, 232 Sauer, 210
invariant Scheffé’s, 298
distribution, 392, 397 likelihood, 30
function, 478 likelihood ratio, 320
measure, 392 log-normal distribution, 142
set, 478 logistic map, 503
irreducible longest common subsequence problem, 88
HMC, 380 Lusin space, see space
set, 380 Lyapunov function, 407
coercive, 410
joint probability distribution, 76
map
kernel, 111 measurable, 6
disintegration, 120 Markob property
Markovian, 112 strong, 389
probability, 112 Markov
pullback, 112 chain, 368
push-forward by, 112 aperiodic, 383, 406
Kirchhoff current, 418 irreducible, 380
potential of, 418 null recurrent, 397
Kirchhoff’s laws, 416, 418 positively recurrent, 397, 400, 411
Kolmogorov sequence, 481 recurrent, 388, 410, 435
Koopman operator, 483 reversible, 393, 413, 441
Kullback-Leibler divergence, 192, 253 transient, 388, 409, 435
path space, 371
L-process, 333 Markov property, 368, 373, 385, 395
Lévy’s martingale, 409 strong, 385, 387–389, 400, 401
Laplacian, 408, 451 martingale, 260–264, 266, 268, 270, 272,
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 538

538 An Introduction to Probability

275, 277, 280, 284, 290, 306, 307, 309, motion


319, 345, 408, 409 Brownian, 229
Lp , 319 Brownian standard, 225
Lp -bounded, 319 pre-Brownian, 217
backwards, 322
bounded Lp , 319 negligible, 16
closed, 261, 282 noisy dynamical system, 378
component, 266 null recurrent, 397
De Moivre, 262, 308
discrete time, 260 Ohm’s law, 418
Doob, 261, 282 optimal gambling strategy, 311
exponential, 346 optimal stopping, 103
quadratic variation, 267 orbit, 478, 492
matrix order statistics, 146, 247
matrix, 438 Orlicz function, 253
primitive, 439
mean, 42 paradox
measurable waiting time, 80
map, 6 partition, 3
set, 2 chamber of, 3
space, 2 path space, 477
isomorphism, 6 period, 382
measure, 13 persistent state, 386
σ-finite, 13 pgf, 44, 53, 142, 361
Borel, 16 pmf, 42
Dirac, 15 Poincaré phenomenon, 198, 253
finite, 13 Poisson approximation, 63
inner regular, 129 Poisson process, 80
Lebesgue, 19 poissonization, 63
Lebesgue-Stieltjes, 74 Polya’s urn, 137, 263, 331
outer regular, 129 positively recurrent, 397
probability, 13 posterior, 30
pushforward of a, 15 predictable, 266
Radon, 130, 176 predictor, 98
regular, 129 premeasure, 18
signed, 404 σ-finite, 18
uniform, 15, 16 principle
measure preserving, 475 Dirichlet, 424
measured space, 14 inclusion-exclusion, 59
memoryless property, 74 Raleigh, 425, 426
Metropolis reflection, 344
algorithm, 459 Thompson, 424
chain, 459 prior, 30
mgf, 51 probability
Mills ratio, 66, 140, 231 generating function, 44, 53, 142
mixture, 77, 113, 121, 325 measure, 13
mixtures, 149 Euclidean, 40
moment generating function, 51 space, 14
monotone class, 10 probability distribution
Monte-Carlo method, 162 continuous, 64
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 539

Index 539

joint, 76 function, 503


problem random variable, 27, 154, 206, 214
ballot, 27, 274 symmetrization, 206
Banach’s matchbox, 137 random
bin packing, 283 measure, 112, 325
birthday, 138, 244 variable, 14
Buffon, 138, 521 walk, 27, 261, 262, 265, 292, 357, 360,
coupon collector, 57, 135, 244, 308 369, 376, 383, 390, 393
derangements, 63 on groups, 442
gambler’s ruin, see gambler’s ruin standard, 27
longest common subsequence, 88, 281 random variable, 14
occupancy, 243 Bernoulli, 54
Polya’s urn, 137, 263, 331 binomial, 54
secretary, 106 cdf of, 41
process discrete, 42
branching, 263, 300, 321 probability mass function, 42
empirical, 204 distribution of, 16, 41
exchangeable, 325 expectation, 42
Galton-Watson, see Galton-Watson exponential, 67, 74
process finite, 14
independent increments, 345 Gamma, 67
L-, 333 Gaussian, 65
measurable, 333 geometric, 55
Poisson, 80, 147, 148, 249, 306, 346, hypergeometric, 58
363, 464 law of, 41
predictable, 266, 268, 358 mean, 42
progressive, 333 negative binomial, 56
R-, 333 normal, 65
renewal, 82 Poisson, 59
separable, 347, 349 probability distribution of, 41
stochastic, 123 Raleigh, 244
distribution, 125, 217 standard normal, 66
path, 223 subgaussian, 194
stopped, 270 uniform, 64
product formula, 25 random vector, 76
projective family, 128 probability distribution of, 76
pushforward, 15, 16 recurrence class, 388
recurrent state, 386
quadratic variation, 233, 267, 363 Reflection Principle, 343, 344
optional, 354 regular version, 114, 115
predictable, 354 return time, 384, 420
quantile, 20, 201 reversible, 393, 413, 450
quasi-invariant
function, 479 sample range, 119
set, 479 sample space, 14
quasi-mixing, 501 secretary problem, 106
separating collection, 170
R-process, 333 shift, 478
Rademacher Bernoulli, 478
complexity, 209 shorting, see electric network
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 540

540 An Introduction to Probability

sieve, 62, 63 survival function, 74


sigma-algebra, 2, 3
SLE, 441, 444, 451 tail
SLLN, 151, 156, 161, 200, 203, 204, 321, algebra, 24, 326, 479
330, 338, 401 events, 24
space tail-algebra, 481
Lusin, 115, 116, 128 temporal average, 484
measurable, 2 tent map, 477, 498, 503
measured, 14 theorem
Polish, 116 Lp -martingale convergence, 319
standard measurable, 116 asymptotic equipartition property, 165
stable sequence, 241 Backwards Martingale Convergence,
standard deviation, 49 325, 482
state Birkhoff’s ergodic, 487
accessible, 378 Blumenthal’s 0-1 law, 339
aperiodic, 382 Bochner, 183
period of, 382 Bounded Convergence, 91, 297
persistent, 386 Carathéodory Extension, 18, 129
recurrent, 386 Cayley-Hamilton, 441
transient, 386 central limit, 186
stationary début, 336
distribution, 392 de Finetti, 326, 330
measure, 392 Dominated Convergence, 35, 39, 86, 90,
stationary sequence, 477 91, 94, 171, 174, 179, 184, 293,
Stieltjes 297, 313, 320, 342, 352
measure, 20 Donsker, 232
Stirling’s formula, 390, 392, 508 Doob’s regularization, 347
stochastic integral Dynkin, 11, 94
discrete, 268, 358 Dynkin’s π − λ, 5, 7, 23
stochastic matrix, 369 ergodic, 402, 491
stochastic process, 123, 260 Fubini-Tonelli, 70, 295
indistinguishable, 226 Glivenko-Cantelli, 202
stochastically equivalent, 226 Hewitt-Savage 0-1 law, 330
version of, 226 Ionescu-Tulcea, 132
stopped process, 270 Kolmogorov existence, 54, 128, 370
stopping time, 103, 269, 384 Kolmogorov one series, 152, 223, 320
optimal, 104 Kolmogorov’s 0-1, 24, 84, 152, 299, 492
strictly convex function, 47 Kolmogorov-Smirnov, 203
strong Markov property, 341 Lévy’s 0-1, 299, 493
subadditivity, 89 Lévy’s continuity, 183
subgaussian, see random variable Lévy’s equivalence, 236
submartingale, 260, 266, 268, 270, 272, Lévy’s forgery, 366
287, 345 Lindeberg, 188
backwards, 322 Lindenstrauss-Johnson, 199
discrete time, 260 mapping, 172
superadditivity, 89 mean ergodic, 484
superharmonic function, 408 Monotone Class, 10, 70, 71, 94, 113,
supermartingale, 260, 312, 345, 355, 407 114, 372
backwards, 322 Monotone Convergence, 32, 35, 36, 46,
discrete time, 260 71, 83, 97, 113, 320, 349, 352
July 19, 2022 15:9 ws-book961x669 An Introduction to Probability 12800-main page 541

Index 541

Optional Sampling, 272, 275, 278, 290, UI, 293, 294, 296, 298, 302–305, 307, 311,
302–305, 307, 310, 317 319
Optional Stopping, 270, 292, 302 uniform integrability, 293
Perron-Frobenius, 439 unimodality, 62
portmanteau, 172 union bound, 62
Radon–Nicodym, 98 upcrossing, 286
Radon–Nikodym, 37 number, 286
Raleigh, 426 usual conditions, 335, 338, 347
Riesz Representation, 40, 117
Slutsky, 175, 216, 252 vague convergence, 170
Strong Law of Large Numbers, 157, Vapnik-Chervonenkis, see VC
321, 330, 492 variance, 49
submartingale convergence, 289, 292, variation distance, 403
409, 411 VC
Tikhonov’s compactness, 131 dimension, 210
Wald’s formula, 305 family, 210
weak law of large numbers, 160
Weyl’s equidistribution, 494 waiting time, 80
theorem Kolmogorov continuity, 226 walk, 379
tight family, 248 weak convergence, 170
time weakly mixing, 500
hitting, 269 weight function, 15
optional, 334 Wiener
stopping, 269, 334 integral, 225, 257, 364
transience class, 388 measure, 232
transient state, 386 process, 225
transition matrix, 369 WLLN, 160
n-th step, 370
locally finite, 407 zero-one
tree, 436 algebra, 25, 480, 481, 493
radially symmetric, 436 event, 25, 480

You might also like