0% found this document useful (0 votes)
14 views306 pages

Math 254

Uploaded by

caqchinaxb163
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views306 pages

Math 254

Uploaded by

caqchinaxb163
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 306

Lecture Notes for Students of

Probability and Statistics


in a Mathematical Sciences Dept.

Code = Math254
Spring 2024

Quick navigation
Part I
Part II
Part III
Part IV
Appendices

Typeset in LATEX, with vim + vimtex plugin + zathura viewer, on Linux


Distribution of these notes, in any form, is strictly prohibited ©Takis Konstantopoulos
Internal hyperreferences appear in this color. External hyperlinks appear in this color.

Haec versio: 21st Jan, 2024


Parts of these notes

“To be, or not to be: that is the question.”


– William Shakespeare

These lecture notes contain five parts:

Part I: What is probability and how to teach and learn it


Part II: Introduction for university students
Part III: Elementary probability and prerequisite material
Part IV: Main topics for this module
Appendices

Only Part IV contains what the syllabus specifies. The lectures will be from Part IV.

Everything contained in the pink pages is material that is logically needed for the proper under-
standing of Part IV. I know no way to teach the topics of the syllabus, that is, Part IV, without
assuming that the student has an understanding the pink pages. I could have chosen not to
include the pink pages. I did1 only because I wanted to give the reader the opportunity to
find this material in a single document.

I repeat: the rule is simple and have color-coded it:

If pink then it is not part of the lectures.

If white then it is part of the lectures.

This does not mean that pink=useless. It is very useful. But it’s not part of the syllabus or
the lectures during the term.
A chapter or section preceded by  is optional.

1
Hamlet’s dilemma in the epigraph above expresses my dilemma: should I add the pink pages or not?

i
CONTENTS ii

Contents

1 Introduction 1
1.1 Guide: please read carefully . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

I WHAT IS PROBABILITY AND HOW TO ?TEACH AND ?LEARN IT 9

2 Notes for the instructor of probability in maths 10


2.1 The syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Writing these notes and teaching from them . . . . . . . . . . . . . . . . . . . . 11
2.3 A list of misconceptions and bad practices . . . . . . . . . . . . . . . . . . . . 12

3 Notes for the student of probability in maths 15


3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 What does “know”, “understand”, “learn” mean . . . . . . . . . . . . . . . . 16
3.3 Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Notes for anyone who speaks a human language 21

II INTRODUCTION FOR UNIVERSITY STUDENTS 26

5 Events, sets, logic and the language you speak 27

6 “The only two things you need to know” 31

7 Elementary probability properties 35

III ELEMENTARY PROBABILITY AND PREREQUISITE MATERIAL 39

8 Probability on discrete sets 40


8.1 Probability on finite sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.2 Uniform probability measure and counting . . . . . . . . . . . . . . . . . . . 42
8.3 Probability from data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

9 Random variables 49
9.1 Random variables are functions . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.2 The distribution of a random variable under a probability measure . . . . . 50
CONTENTS iii

9.3 Expectation and moments of a discrete random variable. . . . . . . . . . . . 52


9.4 Random variables and probability from data . . . . . . . . . . . . . . . . . . 55
9.5 Indicator functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.6 Cheating undone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

10 Classical problems of elementary nature 62

11 Conditional probability and independence 77


11.1 Motivation and properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
11.2 Using conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
11.3 Independence between events . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.4 Independence between random variables . . . . . . . . . . . . . . . . . . . . 92
11.5 Uncorrelated random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 95
11.6 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
11.7 Some wisdom to keep in mind . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

IV MAIN TOPICS FOR THIS MODULE 98

12 Remembrances and foresights 99


12.1 Remembrances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
12.2 Foresights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

13 Bernoulli trials ad infinitum 102


13.1 Bernoulli random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
13.2 Finitely many Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
13.3 Binomial random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
13.4 Poisson random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
13.5 Infinitely many Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . 107
13.6 Geometric random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
13.7 Limits of geometric random variables . . . . . . . . . . . . . . . . . . . . . . 109

14 Densities per se 112


14.1 Mass and probability density functions . . . . . . . . . . . . . . . . . . . . . 112
14.2 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
14.3 Some common laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
14.3.1 The uniform distribution on a bounded interval . . . . . . . . . . . . . 116
14.3.2 The exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . 118
14.3.3 The normal law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
14.4 Functions of random variables with densities . . . . . . . . . . . . . . . . . . 123
14.5 Densities in higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 126
14.5.1 Probability distribution function of a random vector with density . . . 127
14.5.2 Marginal densities; conditional densities . . . . . . . . . . . . . . . . . 129
14.5.3 Independence of random variables . . . . . . . . . . . . . . . . . . . . . 130
14.5.4 Some special densities on R2 . . . . . . . . . . . . . . . . . . . . . . . . . 131
14.5.4.1 Uniform density on a finite area planar set . . . . . . . . . . . . 131
14.5.4.2 The standard normal density on the plane . . . . . . . . . . . . 131
CONTENTS iv

14.5.5 Change of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

15 Probability laws on bigger spaces 146


15.1 Recapitulation and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
15.2 The real line: distribution functions and densities . . . . . . . . . . . . . . . 147
15.3 Classification of (distributions) of random variables . . . . . . . . . . . . . . 153
15.4 Pause for recollection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
15.5 Random vectors and their laws . . . . . . . . . . . . . . . . . . . . . . . . . . 156
15.6 Beyond Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

16 Expectation, unadulterated 163


16.1 Expectation via approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 164
16.2 The law of the unconscious statistician . . . . . . . . . . . . . . . . . . . . . . 167
16.3 Independence, revamped . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
16.4 Convex functions and moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
16.4.1 Convex functions of random variables . . . . . . . . . . . . . . . . . . . . 171
16.4.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
16.5 Variance and covariance and correlation and Cauchy-Schwarz . . . . . . . . 172
16.6 Markov and Chebyshev inequalities . . . . . . . . . . . . . . . . . . . . . . . 174
16.7 Expectation of special functionals . . . . . . . . . . . . . . . . . . . . . . . . . 174
16.8 The probability generating function of an integer-valued random variable . 175
16.8.1  Bonus section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
16.9 The moment generating function of a real-valued random variable . . . . . 180
16.9.1 Expectation and covariance of random vectors . . . . . . . . . . . . . . 184
16.9.2 Moment generating function of random vectors . . . . . . . . . . . . . 185

17 The fundamental theorem of probability 190


17.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
17.2 The statement of the strong law of large numbers . . . . . . . . . . . . . . . . 192
17.3 The explanation of the strong law of large numbers in a simpler case . . . . 193
17.4  Laws of Large Numbers in Mathematics, Physics and Statistics . . . . . . 197
17.5 The weak law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 199

18 Normality, normally and smoothly 201


18.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
18.2 The unavoidability of the the normal law . . . . . . . . . . . . . . . . . . . . 202
18.3 Normal(µ, σ2 ) distribution, reprise . . . . . . . . . . . . . . . . . . . . . . . . . 204
18.4 Normal law in higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 205
18.5 Deriving the density for normal law on Rd . . . . . . . . . . . . . . . . . . . 207
18.6  Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
18.7 Normal distribution and the circle . . . . . . . . . . . . . . . . . . . . . . . . 210
18.8 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

19 Conditionally 214
19.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
19.2 Euclidean projections, platonically . . . . . . . . . . . . . . . . . . . . . . . . . 215
CONTENTS v

19.3 Euclidean projections, linearly . . . . . . . . . . . . . . . . . . . . . . . . . . . 217


19.4 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
19.4.1 Properties of conditional expectation . . . . . . . . . . . . . . . . . . . 228
19.4.2 Conditional variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
19.5 Conditional probability measures . . . . . . . . . . . . . . . . . . . . . . . . . 229
19.6 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
19.7 Conditioning under normality . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
19.7.1 Conditional expectation for two normal r.v.s . . . . . . . . . . . . . . . 235
19.7.2 Conditional expectation for many normal r.v.s . . . . . . . . . . . . . . 236
19.7.3 Conditional variance under normality . . . . . . . . . . . . . . . . . . . 238
19.7.4 Conditional probability distribution under normality . . . . . . . . . . 239
19.7.5 Conditional probability distribution under normality, II . . . . . . . . . 241

20 The central limit theorem 242


20.1 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
20.2 Limits of sequences of random variables . . . . . . . . . . . . . . . . . . . . . 243
20.3 Rate of convergence of the law of large numbers . . . . . . . . . . . . . . . . 245
20.4 The classical central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 247
20.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

21 Special distributions used in statistics 253


21.1 The gamma(λ, α) distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
21.1.1 Further properties of the Γ function . . . . . . . . . . . . . . . . . . . . 255
21.2 The χ2 (d) distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
21.3 Degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
21.4 The F(m, n) distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
21.5 Decoupling of sample mean and sample variance . . . . . . . . . . . . . . . 262
21.6 The t(n) distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

22  Random objects 267

23  Bernoulli trials and the Poisson point process 268


23.1 Bernoulli trials on a general index set . . . . . . . . . . . . . . . . . . . . . . . 268
23.2 The Poisson construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

A Counting 273

B Calculus 283

Index 288
List of PROBLEMS

3.1 PROBLEM (−1 times −1 equals +1) . . . . . . . . . . . . . . . . . . . . . . . . 18


3.2 PROBLEM (no need for formulas) . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 PROBLEM (half-knowledge is often worse than no knowledge) . . . . . . . 18
3.4 PROBLEM (human vs machine) . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 PROBLEM (poetic answers may be beautiful but seldomly precise) . . . . . 19
3.6 PROBLEM (knowing what a limit is not the same as knowing how to apply
computational tricks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 PROBLEM (are your surprised that you got 5 heads in a row?) . . . . . . . . 22
4.2 PROBLEM (is your coin fair?) . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 PROBLEM (probability of 2 consecutive heads) . . . . . . . . . . . . . . . . . 22
4.4 PROBLEM (in how many coin arrangements do you have at most 4 consec-
utive heads?) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1 PROBLEM (sentences and sets) . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 PROBLEM (describe your sets logically) . . . . . . . . . . . . . . . . . . . . . 28
6.1 PROBLEM (select 2 numbers out of a 100; what’s the sample space?) . . . . 31
6.2 PROBLEM (The cardinality function) . . . . . . . . . . . . . . . . . . . . . . . 33
6.3 PROBLEM (The area function) . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 PROBLEM (two extreme event collections) . . . . . . . . . . . . . . . . . . . . 34
6.5 PROBLEM (events A, B generate more events) . . . . . . . . . . . . . . . . . . 34
7.1 PROBLEM (the conjunction fallacy) . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 PROBLEM (probabilities must add up to what?) . . . . . . . . . . . . . . . . 38
7.3 PROBLEM (estimating chance of winning in the lottery) . . . . . . . . . . . 38
7.4 PROBLEM (deadly sins) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.5 PROBLEM (rich or famous) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.1 ?PROBLEM (probability measures on finite sets) . . . . . . . . . . . . . . . . 40
8.2 PROBLEM (a biased one-pound coin) . . . . . . . . . . . . . . . . . . . . . . . 41
8.3 ?PROBLEM (the product rule produces a new probability measure) . . . . 41
8.4 PROBLEM (a pair of coins) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.5 PROBLEM (a small chessboard) . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.6 PROBLEM (chance that a number is even) . . . . . . . . . . . . . . . . . . . . 42
8.7 PROBLEM (chance that a number is divisible by k) . . . . . . . . . . . . . . . 43
8.8 PROBLEM (roll three dice, get at least one 6) . . . . . . . . . . . . . . . . . . 43
8.9 PROBLEM (a tourist in London) . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.10 ?PROBLEM (select a set at random) . . . . . . . . . . . . . . . . . . . . . . . . 43
8.11 ?PROBLEM (derive the formula for mn ) . . . . . . . . . . . . . . . . . . . . .

44
8.12 ?PROBLEM (number of one-to-one functions between finite sets) . . . . . 45

vi
LIST OF PROBLEMS vii

8.13 ?PROBLEM (labeled balls in labeled boxes) . . . . . . . . . . . . . . . . . . 45


8.14 ?PROBLEM (birthday coincidences) . . . . . . . . . . . . . . . . . . . . . . . 45
8.15 PROBLEM (data from 10 coin tosses) . . . . . . . . . . . . . . . . . . . . . . . 47
8.16 PROBLEM (experimental problem) . . . . . . . . . . . . . . . . . . . . . . . . 47
9.1 PROBLEM (3 coins and two random variables) . . . . . . . . . . . . . . . . . 49
9.2 PROBLEM (3 coins and the laws of two random variables) . . . . . . . . . . 50
9.3 ?PROBLEM (the law of the unconscious statistician) . . . . . . . . . . . . . 52
9.4 ?PROBLEM (monotonicity of expectation) . . . . . . . . . . . . . . . . . . . 52
9.5 PROBLEM (3 coins and the expectation of two random variables) . . . . . . 52
9.6 PROBLEM (3 coins and a non-uniform probability measure) . . . . . . . . . 53
9.7 ?PROBLEM (the expectation may be infinite or may not exist!) . . . . . . . 53
9.8 PROBLEM (Markov’s inequality) . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.9 PROBLEM (data and the empirical probability space) . . . . . . . . . . . . . 56
9.10 PROBLEM (continuation of Problem 9.9) . . . . . . . . . . . . . . . . . . . . . 57
9.11 ?PROBLEM (properties of indicator functions) . . . . . . . . . . . . . . . . . 57
9.12 ?PROBLEM (indicator of union: inclusion-exclusion) . . . . . . . . . . . . . 59
9.13 PROBLEM (a very useful, albeit trivial, identity) . . . . . . . . . . . . . . . . 59
9.14 PROBLEM (summing the tail gives the expectation) . . . . . . . . . . . . . . 59
10.1 PROBLEM (shuffling the letters of a word) . . . . . . . . . . . . . . . . . . . . 62
10.2 ?PROBLEM (tossing k dice n times) . . . . . . . . . . . . . . . . . . . . . . . . 63
10.3 PROBLEM (sum of dice rolls) . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.4 ?PROBLEM (permutations in a row and on a circle) . . . . . . . . . . . . . . 64
10.5 ?PROBLEM (dancing pairs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
10.6 ?PROBLEM (choosing a k-member committee from a set of n people–see
Problem 8.10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
10.7 PROBLEM (probability that a random committee has k members) . . . . . . 65
10.8 ?PROBLEM (Maxwell-Boltzman model) . . . . . . . . . . . . . . . . . . . . . 66
10.9 ?PROBLEM (Bose-Einstein model) . . . . . . . . . . . . . . . . . . . . . . . . 66
10.10?PROBLEM (Fermi-Dirac model) . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.11PROBLEM (Maxwell-Boltzmann, Fermi-Dirac and Bose-Einstein) . . . . . . 67
10.12?PROBLEM (distinguishable balls in distinct boxes) . . . . . . . . . . . . . 67
10.13PROBLEM (expected number of particles at a given state according to Bose-
Einstein) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.14PROBLEM (seating with avoidance) . . . . . . . . . . . . . . . . . . . . . . . 69
10.15PROBLEM (sampling without replacement) . . . . . . . . . . . . . . . . . . . 69
10.16?PROBLEM (sampling without replacement, general case) . . . . . . . . . . 70
10.17?PROBLEM (sampling without replacement, another view) . . . . . . . . . 70
10.18PROBLEM (sampling without replacement, comparison) . . . . . . . . . . . 71
10.19PROBLEM (matching socks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.20PROBLEM (sample constituency) . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.21PROBLEM (poker probabilities) . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.22PROBLEM (temperature affects energy levels) . . . . . . . . . . . . . . . . . 73
10.23?PROBLEM (matching problem–see Problem 10.5) . . . . . . . . . . . . . . . 74
10.24?PROBLEM (number of matchings–see Problems 10.5 and 10.23) . . . . . . 75
11.1 PROBLEM (motivating conditional probability) . . . . . . . . . . . . . . . . 77
11.2 PROBLEM (a simple problem on conditioning) . . . . . . . . . . . . . . . . . 78
LIST OF PROBLEMS viii

11.3 PROBLEM (false positive and false negative errors: observations affect de-
cisions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
11.4 ?PROBLEM (birthday coincidences–Problem 8.14 revisited) . . . . . . . . . 81
11.5 PROBLEM (a coin whose probability of heads is random!) . . . . . . . . . . 82
11.6 PROBLEM (strategy for getting the gift: every bit of information counts) . 83
11.7 PROBLEM (prisoner’s dilemma) . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.8 PROBLEM (where is the coin?) . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.9 PROBLEM (application in business) . . . . . . . . . . . . . . . . . . . . . . . 85
11.10PROBLEM (uniform distribution on several coin tosses begets independence) 87
11.11?PROBLEM (product probability measure on S1 × S2 begets independence) 88
11.12?PROBLEM (product probability measure on the product of n sets) . . . . . 88
11.13PROBLEM (a not-so obvious independence) . . . . . . . . . . . . . . . . . . . 89
11.14PROBLEM (pairwise independence but not independence) . . . . . . . . . . 89
11.15PROBLEM (pairwise independence but not independence, again) . . . . . . 90
11.16PROBLEM (uniform distribution on many coin tosses begets independence,
again) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
11.17PROBLEM (symmetry begets independence) . . . . . . . . . . . . . . . . . . 92
11.18?PROBLEM (criterion for independence) . . . . . . . . . . . . . . . . . . . . 92
11.19?PROBLEM (independence of many implies independence of fewer) . . . 93
11.20?PROBLEM (independence of events and their indicator random variables) 93
11.21?PROBLEM (toss k dice n times, as in Problem 10.2, again) . . . . . . . . . . 93
11.22?PROBLEM (independence of disjoint sets of r.v.s) . . . . . . . . . . . . . . 93
11.23PROBLEM (independence in presence of common variable) . . . . . . . . . 94
11.24PROBLEM (symmetry begets independence of sorts) . . . . . . . . . . . . . 94
11.25?PROBLEM (variance of sum of uncorrelated random variables) . . . . . . 95
11.26?PROBLEM (the expectation of the conditional expectation) . . . . . . . . . 96
12.1 ?PROBLEM (“let there be finitely many i.i.d. r.v.s” can always be said) . . . 100
13.1 PROBLEM (joint distribution of n i.i.d. Bernoulli trials) . . . . . . . . . . . . 103
13.2 ?PROBLEM (formula for the binomial distribution) . . . . . . . . . . . . . . 103
13.3 ?PROBLEM (matching n men to n women) . . . . . . . . . . . . . . . . . . . 105
13.4 PROBLEM (estimation of the p of bin(n, p)) . . . . . . . . . . . . . . . . . . . 105
13.5 PROBLEM (distribution of infrequent errors) . . . . . . . . . . . . . . . . . . 106
13.6 ?PROBLEM (a sure event) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
13.7 PROBLEM (upper integer part) . . . . . . . . . . . . . . . . . . . . . . . . . . 110
13.8 ?PROBLEM (a sparse geometric r.v. assumes no specific value in the limit) 110
13.9 ?PROBLEM (the union of uncountably many events begets monsters) . . . 111
14.1 PROBLEM (an unbounded density) . . . . . . . . . . . . . . . . . . . . . . . . 113
14.2 PROBLEM (some properties of densities) . . . . . . . . . . . . . . . . . . . . 113
14.3 PROBLEM (examples and counterexamples of densities) . . . . . . . . . . . 113
14.4 ?PROBLEM (zero-length sets) . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
14.5 PROBLEM (semicircle density) . . . . . . . . . . . . . . . . . . . . . . . . . . 115
14.6 PROBLEM (triangular density and its distribution function) . . . . . . . . . 116
14.7 PROBLEM (distribution function, expectation and variance of unif([a, b])) . 117
14.8 PROBLEM (unif([a, b]) from unif([0, 1])) . . . . . . . . . . . . . . . . . . . . . 117
14.9 ?PROBLEM (we can’t choose uniformly at random from the set of real
numbers) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
LIST OF PROBLEMS ix

14.10PROBLEM (distribution function and moments of expon(λ)) . . . . . . . . . 118


14.11PROBLEM (scaling of an expon(λ) random variable) . . . . . . . . . . . . . . 119
14.12PROBLEM (discretizing an expon(λ) r.v. gives a geometric r.v.) . . . . . . . . 119
14.13?PROBLEM (the memoryless property of an expon(λ) random variable) . . 120
14.14?PROBLEM (expectation and variance of the standard normal law) . . . . . 121
14.15?PROBLEM (density of the non-standard normal law) . . . . . . . . . . . . 122
14.16PROBLEM (using the table of the normal distribution) . . . . . . . . . . . . 122
14.17PROBLEM (using the table of the normal distribution) . . . . . . . . . . . . 123
14.18PROBLEM (affine transformation) . . . . . . . . . . . . . . . . . . . . . . . . . 124
14.19PROBLEM (log-normal density) . . . . . . . . . . . . . . . . . . . . . . . . . . 124
14.20PROBLEM (density of the square of a r.v.) . . . . . . . . . . . . . . . . . . . . 125
14.21?PROBLEM (a r.v. with density that has no expectation) . . . . . . . . . . . . 125
14.22PROBLEM (dimensions are not necessarily physical dimensions) . . . . . . 126
14.23?PROBLEM (help me give definitions) . . . . . . . . . . . . . . . . . . . . . . 128
14.24?PROBLEM (marginal densities) . . . . . . . . . . . . . . . . . . . . . . . . . 129
14.25?PROBLEM (marginal and conditional densities) . . . . . . . . . . . . . . . 129
14.26?PROBLEM (expectation of a function of many r.v.s) . . . . . . . . . . . . . 130
14.27?PROBLEM (expectation of product under independence) . . . . . . . . . . 131
14.28?PROBLEM (uniform law on a rectangle begets independence) . . . . . . . 131
14.29PROBLEM (a probability measure on R2 , linearly transformed) . . . . . . . 136
14.30PROBLEM (the probability that an equation has real roots) . . . . . . . . . . 137
14.31PROBLEM (a circle in a square) . . . . . . . . . . . . . . . . . . . . . . . . . . 138
14.32PROBLEM (uniform law on a disc begets independence) . . . . . . . . . . . 138
14.33PROBLEM (two explosions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
14.34PROBLEM (a random determinant) . . . . . . . . . . . . . . . . . . . . . . . . 140
14.35?PROBLEM (the minimum of two independent exponential random vari-
ables) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
14.36PROBLEM (the maximum and the minimum, together) . . . . . . . . . . . . 141
14.37PROBLEM (a dangerous particle hits the Earth) . . . . . . . . . . . . . . . . . 143
14.38PROBLEM (choose a small square at random) . . . . . . . . . . . . . . . . . . 144
15.1 ?PROBLEM (adding uncountably many positive numbers always gives ∞) 147
15.2 ?PROBLEM (it satisfies the defining properties of a distribution function) 149
15.3 ?PROBLEM (uniform probability measure on a bounded interval) . . . . . 149
15.4 ?PROBLEM (continuous r.v.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.5 PROBLEM (tossing a pencil) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.6 ?PROBLEM (an uncountable zero-length set) . . . . . . . . . . . . . . . . . . 151
15.7 ?PROBLEM (fundamental theorems of Calculus and the folklore above) . 153
15.8 ?PROBLEM (mixture is a relation between laws or between random variables)153
15.9 PROBLEM (examples of the three besic types) . . . . . . . . . . . . . . . . . 154
15.10?PROBLEM (it’s useless to differentiate a singular function) . . . . . . . . . 155
15.11PROBLEM (continuous random variables) . . . . . . . . . . . . . . . . . . . . 155
15.12PROBLEM (an essential property of 2-dimensional distribution function) . 157
15.13?PROBLEM (a continuous but not absolutely continuous random vector) . 159
15.14PROBLEM (continuation of Prolem 15.13) . . . . . . . . . . . . . . . . . . . . 160
15.15PROBLEM (use the normal table) . . . . . . . . . . . . . . . . . . . . . . . . . 161
15.16?PROBLEM (a non-trivial “infinite” event) . . . . . . . . . . . . . . . . . . . 162
LIST OF PROBLEMS x

16.1 PROBLEM (expectation under P and under Q) . . . . . . . . . . . . . . . . . 163


16.2 PROBLEM (scaling of exponential r.v.) . . . . . . . . . . . . . . . . . . . . . . 166
16.3 ?PROBLEM (integrating the tail gives the expectation) . . . . . . . . . . . . 166
16.4 PROBLEM (the law of the unconscious statistician, discretely) . . . . . . . . 167
16.5 PROBLEM (if you’re independent of yourself then you’re not random) . . . 171
16.6 ?PROBLEM (a zero variance r.v. is trivial) . . . . . . . . . . . . . . . . . . . . 174
16.7 PROBLEM (probability generating function of geo(p)) . . . . . . . . . . . . . 176
16.8 ?PROBLEM (differentiation of power series) . . . . . . . . . . . . . . . . . . 176
16.9 ?PROBLEM (probability generating functions of some common discrete r.v.s)177
16.10PROBLEM (probability generating function of the sum of independent Pois-
son r.v.s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
16.11?PROBLEM (thinning) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
16.12?PROBLEM (a useless moment generating function) . . . . . . . . . . . . . 182
16.13?PROBLEM (moment generating functions of some common r.v.s) . . . . . 182
16.14PROBLEM (linear combination of independent normal r.v.s) . . . . . . . . . 183
16.15PROBLEM (expectation of a simple random vector on the plane) . . . . . . 184
16.16PROBLEM (expectations of certain uniform random vectors) . . . . . . . . . 185
16.17?PROBLEM (independence inferred from the moment generating function) 186
16.18PROBLEM (sum and difference of exponentials) . . . . . . . . . . . . . . . . 186
16.19?PROBLEM (more thinning) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
16.20?PROBLEM (continuation of Problem 16.19) . . . . . . . . . . . . . . . . . . 188
16.21?PROBLEM (independent normals) . . . . . . . . . . . . . . . . . . . . . . . 188
17.1 PROBLEM (SLLN implies convergence of frequencies) . . . . . . . . . . . . 192
17.2 PROBLEM (SLLN with nonzero mean) . . . . . . . . . . . . . . . . . . . . . . 195
17.3 PROBLEM (SLLN for Bernoulli trials) . . . . . . . . . . . . . . . . . . . . . . 195
17.4 PROBLEM (computing the length of some set via the SLLN) . . . . . . . . . 196
17.5 PROBLEM (SLLN for functions of i.i.d. r.v.s) . . . . . . . . . . . . . . . . . . 196
17.6 ?PROBLEM (strong law implies weak law) . . . . . . . . . . . . . . . . . . . 199
18.1 PROBLEM (because of the Pythagorean theorem) . . . . . . . . . . . . . . . 201
18.2 PROBLEM (linear combination of i.i.d. standard normals) . . . . . . . . . . 204
18.3 PROBLEM (linear combination of independent centered normals) . . . . . 204
18.4 ?PROBLEM (linear combination of independent normals) . . . . . . . . . . 205
18.5 ?PROBLEM (uncorrelatedness ⇒ independence under normality) . . . . . 207
18.6 PROBLEM (whitening example) . . . . . . . . . . . . . . . . . . . . . . . . . . 210
18.7 PROBLEM (an open problem) . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
18.8 PROBLEM (normal or not?) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
19.1 PROBLEM (the three perpendiculars property) . . . . . . . . . . . . . . . . . 216
19.2 PROBLEM (when do projections add?) . . . . . . . . . . . . . . . . . . . . . . 217
19.3 PROBLEM (standard inner product on Rd ) . . . . . . . . . . . . . . . . . . . . 218
19.4 PROBLEM (inner product on Rd ) . . . . . . . . . . . . . . . . . . . . . . . . . 218
19.5 PROBLEM (linearity of projection) . . . . . . . . . . . . . . . . . . . . . . . . 220
19.6 ?PROBLEM (projections of projections) . . . . . . . . . . . . . . . . . . . . . 220
19.7 ?PROBLEM (computation of a projection) . . . . . . . . . . . . . . . . . . . . 220
19.8 PROBLEM (projection of a random variable onto two subspaces) . . . . . . 221
19.9 PROBLEM (projection in Bernoulli trials) . . . . . . . . . . . . . . . . . . . . 222
19.10?PROBLEM (projection when densities exist) . . . . . . . . . . . . . . . . . . 224
LIST OF PROBLEMS xi

19.11PROBLEM (conditional expectation with respect to a discrete random variable)225


19.12?PROBLEM (conditional expectation when densities exist) . . . . . . . . . . 226
19.13PROBLEM (an arbitrary example) . . . . . . . . . . . . . . . . . . . . . . . . . 226
19.14?PROBLEM (many properties of the conditional expectation) . . . . . . . . 228
19.15?PROBLEM (expectation of conditional variance and variance of condi-
tional expectation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
19.16PROBLEM (tossing i.i.d. coins with random probability of heads) . . . . . . 230
19.17?PROBLEM (convolutions of densities) . . . . . . . . . . . . . . . . . . . . . 233
19.18PROBLEM (sum of two independent exponential r.v.s) . . . . . . . . . . . . 234
19.19PROBLEM (sum of independent uniform r.v.s) . . . . . . . . . . . . . . . . . 234
19.20PROBLEM (computation of a conditional expectation under normality) . . 237
19.21PROBLEM (computation of another conditional expectation under normality)237
19.22PROBLEM (conditional variance computation under normality) . . . . . . . 238
19.23PROBLEM (computation of a conditional density under normality) . . . . . 239
19.24PROBLEM (verification of conditional distribution under normality) . . . . 240
20.1 PROBLEM (rate of convergence of approximations to e) . . . . . . . . . . . . 242
20.2 PROBLEM (example of strong convergence) . . . . . . . . . . . . . . . . . . . 243
20.3 PROBLEM (example of convergence in probability) . . . . . . . . . . . . . . 244
20.4 ?PROBLEM (strong convergence implies convergence in probability) . . . 244
20.5 PROBLEM (comparing distribution functions) . . . . . . . . . . . . . . . . . 245
20.6 PROBLEM (empirical standard deviation) . . . . . . . . . . . . . . . . . . . . 250
20.7 PROBLEM (confidence interval for the parameter p of a geo(p) distribution) 251
20.8 PROBLEM (continuation of Problem 20.7) . . . . . . . . . . . . . . . . . . . . 251
20.9 PROBLEM (a useful approximation for the normal distribution function) . 252
21.1 PROBLEM (correctness of the gamma(1, n) density) . . . . . . . . . . . . . . . 254
21.2 PROBLEM (domain of the gamma function) . . . . . . . . . . . . . . . . . . . 254
21.3 PROBLEM (gamma reproduction rule) . . . . . . . . . . . . . . . . . . . . . . 255
21.4 PROBLEM (gamma asymptotics) . . . . . . . . . . . . . . . . . . . . . . . . . 256
21.5 ?PROBLEM (density for the χ2 (d) distribution when d is even) . . . . . . . 257
21.6 ?PROBLEM (density for the χ2 (d) distribution for general d) . . . . . . . . . 257
21.7 PROBLEM (density of χ2 (4; a, a, b, b)) . . . . . . . . . . . . . . . . . . . . . . . 258
21.8 PROBLEM (moments for F(m, n)) . . . . . . . . . . . . . . . . . . . . . . . . . 259
21.9 ?PROBLEM (limit of F(m, n) when n → ∞) . . . . . . . . . . . . . . . . . . . . 260
21.10?PROBLEM (limit of F(m, n) when m → ∞) . . . . . . . . . . . . . . . . . . . 261
21.11?PROBLEM (the t(n) density) . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
21.12?PROBLEM (t(1)=standard Cauchy) . . . . . . . . . . . . . . . . . . . . . . . 266
21.13?PROBLEM (t(∞) = N(0, 1)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
21.14PROBLEM (moments of t(n)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
23.1 PROBLEM (Bernoulli trials with general index set) . . . . . . . . . . . . . . 268
List of Theorems

12.1 Theorem (a sequence of i.i.d. random variables exists) . . . . . . . . . . . . . 100


12.2 Theorem (the area function exists) . . . . . . . . . . . . . . . . . . . . . . . . . 101
12.3 Theorem (equivalence of Theorems 12.1 and 12.2) . . . . . . . . . . . . . . . . 101
14.1 Theorem (a probability density defines a unique probability measure) . . . 114
15.1 Theorem (a distribution function defines a unique probability measure) . . 148
15.2 Theorem (an advanced version of the fundamental theorem of Calculus) . . 152
15.3 Theorem (decomposition of the law of any random variable) . . . . . . . . . 155
15.4 Theorem (multidimensional analog of Theorem 15.1) . . . . . . . . . . . . . 159
16.1 Theorem (expectation is unambiguously defined) . . . . . . . . . . . . . . . . 165
16.2 Theorem (interchanging limit and expectation) . . . . . . . . . . . . . . . . . 165
16.3 Theorem (the law of the unconscious statistician) . . . . . . . . . . . . . . . . 168
16.4 Theorem (moments define a unique probability law) . . . . . . . . . . . . . . 172
16.5 Theorem (Cramér-Wold theorem) . . . . . . . . . . . . . . . . . . . . . . . . . 186
17.1 Theorem (the fundamental theorem of Probability) . . . . . . . . . . . . . . . 192
17.2 Theorem (the fundamental theorem of Statistics) . . . . . . . . . . . . . . . . 199
18.1 Theorem (the normal law is unavoidable) . . . . . . . . . . . . . . . . . . . . 202
18.2 Theorem (existence of density of normal law on Rd ) . . . . . . . . . . . . . . 207
19.1 Theorem (existence of conditional expectation) . . . . . . . . . . . . . . . . . 225
20.1 Theorem (convergence of moment generating functions implies convergence
in distribution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
20.2 Theorem (the classical central limit theorem) . . . . . . . . . . . . . . . . . . 247

xii
List of hyperlinks

1. William Shakespeare, dilemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i


2. David Mumford, on Platonism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3. Hilbert’s 6th Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
4. Richard Feynman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5. Feynman: why we need a knowledge basis before demanding explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6. Feynman: Lectures on Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7. Alexandrov, on morality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8. Alexander Danilovich Alexandrov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
9. Conjunction fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10. Online coin flipper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
11. Jacob Bernoulli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
12. Thales’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .118
13. Thales of Miletus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
14. Eratosthenes of Cyrene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
15. Law of cosines for a spherical triangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
16. Folland: Real Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148, 152
17. Euclid: The Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
18. Hilbert: The Foundations of Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156, 215
19. Coxeter: Introduction to Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
20. Descartes’ enigma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
21. On Descartes’ death . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
22. The Guardian: Descartes was poisoned a priest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
23. Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
24. Various centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
25. Liouville’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
26. Look and say sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
27. White noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
28. Andrey Nikolaevich Kolmogorov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
29. Kolmogorov: Foundations of the Theory of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
30. EYKΛEI∆EIOΣ ΓEΩMETPIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
31. epi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .216
32. sexagesimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
33. Emil Artin: The Gamma Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
34. Williams: Weighing the Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

xiii
Chapter 1

Introduction

“I am a Platonist”
– David Mumford

These notes are tailored for a course titled ”Probability and Statistics II,” designed for
students who have completed two antecedent courses, namely Probability I and Statistics I, or
their equivalents at other academic institutions. Additionally, participants are expected to
have a foundation in university-level mathematics.
The notes are as elementary as possible. Nothing advanced is used herein and no theorem is actually
proved!
Arguably, there exists no branch within the realm of mathematics, such as probability,
that embodies both profound interest and a contemporary essence, characterized by robust
connections to theory and a myriad of practical applications. Paradoxically, despite its inherent
mathematical rigor, this subject is frequently presented in an antiquated and sometimes
erronneous manner, as if we were still in the 19th century, a period when probability theory
remained somewhat of a mystery. Hilbert identified the need for a solid mathematical basis in
his 6th Problem. Indeed, at the time (early 20th c.) it was not even understood that probability
theory was part of mathematics or what kind of mathematics it should be based upon.
Statistics had been dealt with earlier, rather haphazardly, and its connection to probability
was not fully understood.
Over a century later, probability is firmly established as a pivotal subject in mathematics,
wielding substantial influence and making invaluable contributions to various branches. We
employ the term ”STOCHASTICS” as a comprehensive umbrella, encompassing probability
and its myriad offshoots, including statistics. This umbrella extends over diverse areas such
as Insurance Mathematics, Financial Mathematics, Stochastic Control, Filtering, Information
Theory, Stochastic Differential Geometry, Stochastic Calculus, Statistical Physics, and more.
A solid understanding of probability is essential for working with its applications and for
having a novel point of view within mathematics itself. The 20th century could be hailed as
the era of the Stochastics Revolution, 1 underscoring the profound impact of probabilistic
thinking across disciplines.
1
But not only. Many other scientific and mathematical revolutions occurred.

1
CHAPTER 1. INTRODUCTION 2

To illustrate the transition from a pre- to post-revolutionary period in a scientific discipline,


consider the analogy of geometry’s evolution. In ancient times, geometry was developed
empirically and experimentally to meet specific needs—such as the construction of Sumerian
cities and Egyptian pyramids. However, it was Euclid who systematized geometry, recognizing
its potential for development through pure thought.
Over the subsequent 2000 years, geometry has ascended to the core of mathematics,
seamlessly integrating with algebra, differential equations, mechanics, and even probability.
In comparison, we’ve only recently recognized the valuable contributions probability
offers to mathematics. It’s crucial to note that statistics, a subset of probability, grapples
with challenging decisions, often involving the optimal selection from a class of probability
measures.
When crafting these notes, I took into account that we now live in the 21st century, marked
by a century of swift and profound developments in probability. It’s essential to convey this
rapid progress early in students’ studies. Moreover, we understand things much differently
than 4 decades ago, so there is no point in teaching as if these decades have not existed. The
subject has progressed by leaps and bounds and, having now placed it at the very core of all
mathematical sciences, books written 20 years ago seem to be obsolete.
I probably would not have written these notes 10 years ago because I explain too much. But,
having taught this particular course once last year, I realized that students need to be taught
explicitly because the curricula in their schools or learning institutions of their country/place
of origin have been trivialized and a lot of topics that fall under the title “mathematics” are
being taught, at best, at a vocational level. At the same time, I know of no way to teach other
than explain what I teach. I amplify this next.
From my days as a student, I harbored a strong aversion to instruction lacking explanation.
My approach involved challenging instructors until they either clarified the topic or conceded
their inability to do so. Understanding was a prerequisite for effective learning, not just for
me but for most of my classmates as well. When others struggled, it often fell upon me to
persuade them that true comprehension is the key to learning, especially in the realm of
mathematics.
Additionally, I adhere to a writing approach that prevents the creation of notes I would be
ashamed of or would loathe to read. I strive to produce content that, if I were a student, I
would find enjoyable to read and engage with. This involves a thorough explanation of every
detail without omission. Specifically, I steer clear of clichés and deceitful practices, aiming
to make the notes engaging. In a subject as dynamic as probability, replete with surprises,
there’s no justification for dull and uninteresting materials.
This year, a notable change is the explicit indication of the logically necessary material
for the topics outlined in the syllabus, presented in the pink pages . Whether the student
chooses to engage with this material depends on his or her familiarity with more elementary
probability concepts.
The challenge in writing these notes was immense, as I endeavored to reconcile several
seemingly incompatible objectives:
1. Explain without delving too deeply into complex details.
2. Maintain readability throughout the notes.
CHAPTER 1. INTRODUCTION 3

3. Engage the reader, provided he or she is willing to actively participate by reading and
solving problems thoroughly.

4. Avoid any form of cheating or, at the very least, provide comprehensive explanations
when shortcuts are taken.2

5. Refrain from presenting formulas without adequate explanation3

6. Ensure logical consistency in the presentation.

7. Emphasize to the student that the provided material offers an incomplete understanding,
encouraging further learning, possibly at an higher academic level in subsequent classes.

8. Illuminate, through concrete problems, the senselessness and ineffectiveness of rote


memorization.

9. Make the notes valuable for both probability and statistics, with a particular focus on
applications.

There is a belief, among some academic people that every subject should have “theory”
and “tutorials”. This is close to being absurd. There are topics that are best learned by
experience rather than “studying theory”. Elementary probability is one of them. (I am talking
about understanding and using discrete probability. The richness of the subject can greatly
be appreciated by solving lots of problems of various kinds.) For some other topics, it is
pointless to have tutorials: theory must be developed for a while before problems/exercises are
attempted. Thus, I have to struggle to differentiate between “theory” and “problems/exercises”.
For me, in particular, it is very hard because I do not see the difference. I actually never in my
life saw any difference even when I was a child in school. I solved problems and understood
theory. I studied theory and understood how to solve problems without solving them. Much
in the same way that I do not see the difference between pure and applied mathematics. (I
merely differentiate between bad and good mathematics; between interesting and boring
mathematics; between honest and dishonest mathematics.)
Here is a brief outline of the topics that will be taught.
Teaching starts with Part IV, page 99, therefore I will only outline the 10 chapters contained
within Part IV.
Chapter 12 is very brief. It simply reminds the reader some of the basic concepts and, in
particular, stresses the concept of a random variable and its role: to transform a probability
measure into another. It also gives a glimpse of the mathematical difficulties that occur when
non-discrete random variables are considered. We point out that these difficulties are precisely
2
Once, in a book on probability, I read the following “theorem”: Let S, T be subsets of Rn , g : S → T a bijection
and X a random vector with distribution having a density fX supported on S. Then the random vector Y = g(X)
has density fY given by the formula fY (y) = fX (g−1 (y)) det(J), where H is the Jacobian. This has problems at many
levels. For example, S cannot be an arbitrary set. Despite this not being a multivariate calculus course, we cannot
disregard the fact that conditions on S and g must be placed. If we aim to provide practical help to students, it is
imperative not to label something as a theorem if we only pretend to be giving a proof without, in reality, giving a
correct one. Such a practice is disingenuous and one that I colloquially refer to as cheating.
3
E.g., mn should be defined as the size of the collection of subsets of size m of a set of size n and then the formula

n!/m!(n − m)! be derived.
CHAPTER 1. INTRODUCTION 4

the same as the ones that (ought to) concern us when we state that we can toss a fair coin an
infinite number of times. The funny thing is that everyone will agree that the latter needs no
maths. (And yet it does.)
Chapter 13 talks about a sequence of independent coin tosses (which we accept they exist,
but I already warned the reader in Chapter 12) that this is not something that he/she should
not do light-heartedly). Through such a sequence we derive various interesting random
variables, their distributions and their relations. In a sense, this is also background: the
student has heard of these distributions before. We end up by taking a limit of a geometric
random variable when the probability of heads tends to 0 and find a random variable such
that the probability that it assumes any given value is zero.
Chapter 14 talks about “mass density” as being a positive function that has finite integral.
Through it, we can define probabilities of various simple sets (events) and I point out that
we can extend to classes of events, but, of course, I do not prove anything. I merely give
examples of interesting densities in one and higher dimensions. I also explain how densities
are transformed by smooth bijections. This has nothing to do with probability. It’s just
multivariate calculus (the same thing one does when changes coordinates or when changing
variables in a “volume” integral).
Chapter 15 serves as a way to explain some of the things from the previous chapter. In
particular, I explain, as much as possible, that not every continuous random variable has
density. (And thus, the term “continuous”, that the student probably heard in previous classes,
is different from “having density”.) It is not hard to do that.
Chapter 16 I attempt to stress the important of the concept of expectation, in general, and
devote a little space to the law of the unconscious statistician. I also define independence, as
extending the notions that students already (should) know from elementary probability and
statistics. Some simple inequalities are needed in order that we later talk about the central
limit theorem and conditional distributions. Certain functions of random variables, via their
expectation, lead to the concepts of probability and moment generating function. We pass on
to random vectors and talk about their covariance matrix and its properties. We need this,
e.g., when we later discuss normal random vectors.
Chapter 17 is the fundamental theorem of probability (the strong law of large numbers)
explained in a simple case. It is needed for an understanding of the central limit theorem but
also for the whole subject of statistics. Without a good understanding of the strong law of
large numbers there’s no way to ever get beyond the belief that statistics is something magic
or (in the words of an algebraist friend of mine) that probability is a “voodoo science”.
Chapter 18 treats normal (Gaussian) random variables and vectors and explains how they
go hand-in-hand with linearity. Pay attention. Understanding this will save you a lot of time
when you later learn things in statistics. There are whole specialized courses in statistics about
linear systems with random inputs (linear regression, linear time series, and so on, people use
all kinds of names), but always normal. Why? Because the normal law is the only distribution
with finite variance that is “preserved under linearity”. If you think of a linear system as
a black box then normal input implies normal output. If you understand this then whole
courses in statistics become trivial.
Chapter 19 defines conditional expectation via geometry because that’s the only way to
understand it without having to resort to special cases and ugly formulas whose meaning
CHAPTER 1. INTRODUCTION 5

isn’t clear. And I do so because we need to talk about conditional probability but also make
computations easy. One reason that conditional expectation must be introduced is, that, without it,
dealing with conditional distributions of normal random variables can be a mess. A second reason is
that it is often impossible to compute the expectation of a function of several random variables without
conditioning on some of them. See the epigraph on Chapter 19!
Chapter 20 explains the central limit theorem in a simple case. I am constrained not to
use anything else but a moment generating function, and this makes it hard to prove much
in generality. The central limit theorem is the oldest “universality” result. It is very robust
and works (again because normality and linearity go hand in hand!) We exemplify this via
confidence intervals.
Chapter 21 talks about a class of distributions on R with densities, as required by the
syllabus. But I don’t just give formulas. I explain their properties and their use. And I never
ever define something by saying “here’s a formula for a density, learn it by heart”. Because
this is boring, silly, and totally uninformative. Rather, I derive formulas. In particular, we
show that the sample mean and sample variance, for i.i.d. normal random variables are
independent; we explain this; we don’t merely say “it is, believe me because some authority
says so”.
Chapters 22 and 23 are optional.
Various devices I used in these notes are explained below under the title “Guide’.
– I have added lots of “problems (=exercises) that you must do while you’re reading the notes.
Exams will be based on them. I have added a list of all the problems in the beginning.
– I have made the notes interactive: you can click and be transferred internally or externally. A
list of all external links is in the beginning.
– Bits and pieces that are to be emphasized appear in blue.
– I use the word “theorem” only for serious things, none of which is proved herein. I have
added a list of all the theorems in the beginning.
– There is an index at the end that is also interactive.
– The table of contents is also interactive: it transfers you to the place you wish to be transferred
to.
– There are historical notes and anecdotes, optional of course, but my goal is to show you that
humans have known some things for long time (100 years, 300 years, 2000 years,...) So if a
human could work out some simple problem 2000 years ago, how is it possible that we can
ignore this nowadays?
Basically, I tried hard to make these notes independent of the system used, called “canvas”,
because, in my experience, canvas is rather useless when it comes to mathematics. It is
designed for other subjects.

1.1 Guide: please read carefully


Emphasis Any material that needs to be emphasized appears in blue.

Problems The notes are interspersed with problems that form an integral part of the material
taught. In a previous version I called many of them examples. But by calling them problems I
CHAPTER 1. INTRODUCTION 6

wish to stress the fact that the student must do them on his/her own. I provide answers to all of
them.
Actually, “problems” are often part of “theory”4 In that sense, omitting them implies failure
to understand what’s going on. In particular there are ?starred problems. The presence of a
? does not mean they are harder. It just means that they must be done, or at least read, for
logical continuity.
Here is what you SHOULD DO when you encounter an item called “Problem”.

1) Read the statement carefully and make sure you understand it. If you don’t understand
it then one or more of the following may be true; (a) you haven’t understood the lecture
notes prior to the problem statement (b) you have difficulty with the English language
(c) you haven’t understood concepts in elementary probability (e.g., discrete random
variables) or elementary maths (e.g., calculus).

2) Then hide the answer to the problem; discipline yourself and do not look at it. Solve the
problem by yourself. By solving the problem we mean provide an answer that is often
expressed in English language. The answers are always short. Most of the time, a few
lines suffice.

3) Then read the answer I provide and compare it with yours.

4) Keep a notebook (electronic or on paper, whatever you like) recording all the problems
you have solved.

5) When you ask questions, the first thing I will ask you is to show me your homework–your
notebook with the problems you have solved. If you have solved a few or none, then
chances are that you won’t be able to catch up later.

The end of (the answer of) a problem is indicated by the symbol 

Interactiveness The notes are interactive:


1) there are links to documents on the Internet (hyperlinks) and appear in this color.
2) There are internal links (hyperreferences) and appear in this color.

Cheating Teaching at this elementary level must necessarily involve some mathematical
cheating. I will avoid it as much as possible, but I will tell you were I cheat so that, later, you
may have a chance to learn more properly, if you ever get the chance because, nowadays,
cheating is almost a norm in mathematics teaching worldwide.

Terminology I use the word “probability measure” in order to indicate the assignment
of probabilities to events. There is no measure theory in these notes, not any advanced
mathematical analysis. I am just using some words to reflect modern usage. Terminology,
in the subject of Probability and Statistics is largely anachronistic, that is, old-fashioned, and
often misleading. I do not offer any apologies for using more appropriate, for the 21st century,
4
I don’t understand the difference between “theory”, “applications”, “examples”, “problems”, “tutorials”, etc.
It’s all part of one and the same thing.
CHAPTER 1. INTRODUCTION 7

terminology. For example, lots of people say “continuous” random variable when they mean
“absolutely continuous”. I shall use the latter.

Parentheses and brackets I detest using parentheses (especially when they go deeper than
level 1) because I can’t parse them. So I often simply omit them. If (a, b) is an interval and P a
probability measure I sometimes write P(a, b) instead of P((a, b)). If {x} is a set containing one
point I write P{x} instead of P({x}). The use of parentheses is often ambiguous. For example
(a, b) may mean the set {x ∈ R : a < x < b}, as above or it may mean an element of R2 . But
R1
everything should be clear from the context. I also write EX instead of E(X) and E 0 X(t)dt
R 
1
instead of E 0 X(t)dt . So instead of writing E(XY) = (E(X))(E(Y)), and get swamped by the
parentheses, I simply write E(XY) = (EX)(EY), and it’s clear what I mean. But I do write
P(A) and almost never write PA (although, perhaps, I should). Events are sets so I use curly
brackets: {...}. Curly brackets indicated that we are dealing with an unordered collection. If
{X ∈ B} is an event then I should write P({X ∈ B}), but I don’t: I write P(X ∈ B). Apologies.

Other symbols I don’t bother to denote probability measures with any special font. I use P,
or I use Q or other letters. I write expectation as E or, when I need to emphasize the probability
measure, say Q, I write EQ .

Theorems I am being told that nowadays students in mathematical sciences departments


do not like the words theorems and proofs any more and that I should avoid using them. But
whether they like them or not, they’re there, because you should bear in mind that every single
thought is a proof of a theorem and avoiding theorems and proofs is like avoiding teaching
anything. It is like saying that students of music do not like to listen to music. Nevertheless,
in compliance with the status quo, I do avoid the word “theorem” as much as I can and I use
the word only for big things, which won’t be proved at all but motivated, a wee bit at least.

Numbering All items are numbered chapter-wise. So Problems in Chapter 13 appear as


Problem 13.1, Problem 13.2, etc. Similarly, Definition 13.1, Definition 13.2, etc.

Method used in writing these notes; interlard and repetition These notes have written as
a textbook, that is, they are supposed to be read. In particular, skipping reading problems and
solving/answering them, often means failing to ?understand and thus failing to ?know and
?learn. The reason that problems are included and that I insist you do them is because, unlike
in the past, students nowadays need to be told what to do in order that they ?learn. In the
past, a student was well aware that he/she had to solve problems and spend hours/days/weeks
making sure that they ?understand. Since, as I am being told, this is not the case, it behooves
me to interlard the notes with details, such as trite problems, and with repetition of concepts
and facts. I mentioned that pink pages, that is, PARTS I, II and III refer to material that is
NOT in the syllabus. However, the subject that the syllabus asks me to teach cannot be
?understood without these parts. A teacher can, of course, cheat and asks students “memorize
the formula (a + x)4 = a4 + 4a3 x + 6a2 x2 + 4ax3 + x4 ” and then when I tell you “apply it when
a = 1 and x = e−t ” you can do it, but that is cheating from the part of the teacher (because
CHAPTER 1. INTRODUCTION 8

he/she is asking you to behave like a machine) and that I will not do, not only because I don’t
want to, but also because I can’t: my memory is extremely weak, I remember hardly anything,
I can only handle logic and I this must explain everything, even to myself, at all times.
Part I

WHAT IS PROBABILITY AND HOW


TO ?TEACH AND ?LEARN IT

9
Chapter 2

Notes for the instructor of probability


in a mathematical sciences department

In which I present the syllabus and a list of misconceptions


and bad practices in teaching probability. I have spent my
entire life thinking about how to teach mathematics and,
in particular, probability, so I do understand the way it
should be done, at a very elementary level, in a university.
Words preceded by a ? are defined in Def. 3.1 below.

2.1 The syllabus

The syllabus1 specifies:

Random variables: Distribution functions, probability density


functions, discrete and continuous random variables. Some commonly
used distributions: Binomial, Geometric, Poisson, Normal,
Exponential distributions etc.
Multivariate random variables: Joint, marginal and conditional
distribution functions and density functions. Covariance and
correlation of random variables. Bivariate Normal distributions.
Composition of function and variables; Transformations of continuous
random variables/vectors.
Generating functions: Moment and probability generating functions.
Central limit Theorem.
Special distributions: t, χ2 and F distributions as illustrations
of transformations and motivated by practical examples.
1
As decided by a committee

10
CHAPTER 2. NOTES FOR THE INSTRUCTOR OF PROBABILITY IN MATHS 11

2.2 Writing these notes and teaching from them


All the topics in the syllabus are covered in these notes in logical order and everything
is explained, to the extent possible. Everything will be covered from a modern point of
view. Thus, for example, in order to teach marginal densities under normality, we must
explain conditional expectation and conditional variance. These are best explained via
geometry. Conditional expectation is totally unavoidable because of what I call “the vital
formula of probability”. For, otherwise, how should one compute, say, ES(N) where S(N) is
the sum of a random number N of random variables? The mantra that some unfortunate
students learn, namely, there is a formula that says multiply probabilities times values and
sum up, fails, It fails in myriads of cases that are of paramount importance, even if one
wears the hat of “fancy” applications, such as “big data”, “artificial intelligence”, “machine
learning”, “biostatistics”, “actuarial mathematics”, “financial mathematics”, “communications
engineering”, “autonomous systems”, “intelligent smart networks”, and so on.
The notes are written so that student must read them. They contain more than 250
PROBLEMS. These problems range from being very essential for the understanding of the
“theory” to trite numerical ones. The essential ones are called ?PROBLEMS. The student must
do them, else he/she won’t understand properly. I could have chosen to call the essential
problems something else, but I haven’t because I want to force the student do them. Of course,
I provide answers to everything, so nothing is lost. Even those who can’t do them, they can
read the answers.
There is a small set of elementary mathematics that the student must know. It’s as follows.

a) CALCULUS: That is, they know what sequences and their limits are, what continuity
and differentiability means, what an integral is (the Riemann integral that we teach in
Calculus is enough).

b) SETS and FUNCTIONS: At a rudimentary level; knowledge and understanding of


operations on sets and functions.

Perhaps the most important assumption I will make is that students know what a
function is!

c) LINEAR ALGEBRA: At a very basic level; including the understanding of linear


independence; the concept of determinant; how to change basis so that linear equations
become simple.

d) ELEMENTARY PROBABILITY: Especially at the level of solving problems that require


very little mathematics and only a bit of logical thinking.

e) COUNTING: Finding the size of a finite set. Probability, at its very elementary level, is
all about counting and then about integration. In fact, integration is nothing else but
“counting continuously”.

Whether the students actually know these topics or not is clearly not dependent upon the
present notes. I am merely stating that there is no way for a student, especially in mathematical
CHAPTER 2. NOTES FOR THE INSTRUCTOR OF PROBABILITY IN MATHS 12

sciences, to understand the topic of this course without familiarity and working knowledge of
the above.
Students are often being (or have been) exposed to misconceptions. We should avoid them
here as much as humanly possible.

2.3 A list of misconceptions and bad practices


Thousands of people have written probability/statistics books and notes (because the subject
is important both for pure mathematics and applied science), but a large fraction of them
are books I would not like to open as they are laden with misconceptions and mathematical
imprecision. The reasons are multiple. I shall not discuss them here, but I mention only two:
(a) There is a tendency to think of probability and statistics as something mysterious that is
not part of mathematics. And yet it is. We’re in the 21st century, not in the 18th. (b) Often,
common language is used without explanation: what exactly do we mean when we say “take
a sample from a normal distribution” or “event so and so occurs” or “according to the law
of large numbers” or “the density” or “a continuous random variable”? I do not propose to
make a rigorous course at this level but I propose to do what I’ve been doing throughout my
career: AVOID CHEATING. And if cheat we must, it behooves us to state so.

(A) Random variables: Some people think that the term “random variable” has something
to do with randomness (it does not) and that it is a “variable” (it is not, because the term
“variable” is quite ambiguous). We should emphasize that random variable is a function
and that’s that. Such an object’s role is principally to transform a probability measure P
on its domain into another probability measure Q on its range. More often than not, Q
is given and an X and a P must be found.

(B) Frequencies: Some people think that probability can be defined as a limit of the
frequency of occurrence of an event. Lots of people tried to create a foundation of
probability theory in this way, but they failed. In fact, it has been proven that such a
definition is impossible.

(C) Additivity, only, is wrong: Some people teach that probability is only finitely additive,
namely that, for disjoint events A1 and A2 we have P(A1 ∪ A2 ) = P(A1 ) + P(A2 ), but
they say nothing about countable additivity. If this is the only property they introduce
then those people must be very careful not to start cheating. For example, if the
syllabus specifies that densities should be taught then the instructor who only uses finite
additivity is cheating. Since the syllabus specifies that densities must be taught, hence
the instructor has no right to omit countable additivity.

(D) Concepts exist independently of one’s ability to calculate: Some people teach students
concepts, such as, the expectation of a random variable, in a way that students do not
understand the concept independently of their ability to calculate it. This is, e.g., a common
practice in teaching Calculus. The result is that when you ask students “what is the
integral?” they typically reply by a question: “of which function?” And if you insist “I
am asking you to tell me what the concept of the integral is”, then you may get the silly
reply “the area under the curve”, at which point you ask them to “define the concept of
CHAPTER 2. NOTES FOR THE INSTRUCTOR OF PROBABILITY IN MATHS 13

the area” and you have a vicious circle. The vicious circle won’t be resolved in these
notes but the students must be made aware of its existence.

(E) Don’t always use a single letter for all probabilities: Since probability is a function
from a set of events to real numbers between 0 and 1, students should learn that it is not
necessary to always use the same letter, typically P or Pr or Prob for it. Many times we
need to consider different probability functions (Statistics, for example, is all about sets
of probabilities!) Using the same letter always is as funny as using the word fun for
every function you encounter!

(F) Probability is a trivial axiomatic system: Some people do not emphasize this fact.
Probability is merely a function P from a set of sets (called events) into the nonnegative
real line such that

(A1) P assigns value 1 to the largest event


(A2) P is countably additive: if An , n ∈ N, are pairwise disjoint events then the values
that P assigns to their union equals the sum of the values that P assigns to each
event.

Students must be told that everything in Probability, Statistics, and any topic following
from them, merely needs these two axioms (and, of course, whatever can be introduced,
via definitions from them, and whatever can be proved from them).

(G) Confusion between a number and a function: This is happening all the time. There is
a confusion between a function, f , say, and its value b = f (a) at the point a of its domain.
This appears at least twice in the teaching of probability/statistics:

(i) Random variables, say X, are being treated as numbers. Students must be told they
have to understand the difference between X, as a function, and X(ω) at the point
ω. Random variables are functions and that’s that. Granted, they are functions for
which their range is of more importance and a probability measure (the so-called
distribution) sitting on the range is even more important, but that’s part of the
experience, it’s not part of the definition.
(ii) Probabilities are being treated as numbers. Students must be told they have to
understand the difference between P, as a function, and P(A), as the value of P at
the point A (a set/an event) of its domain. Since, for statistics, it is very important to
deal with a variety of functions P, the point is important. I use the word “probability
measure” when I’m referring to P as a function and the word “probability” for
numbers between 0 and 1.

(H) The law of large numbers: Here is what some of the false beliefs that some people have
about the law of large numbers:
CHAPTER 2. NOTES FOR THE INSTRUCTOR OF PROBABILITY IN MATHS 14

False belief 1. It is a law, and as such, we don’t have to do anything about


it other than write it in a book and then recite it.

False belief 2. It is an experimental result.

False belief 3. It is law of nature, such as, say, Newton’s law F = GM1 M2 r−2 ,
that can’t be proved (within Classical Mechanics) but it is
consistent with any observation.

False belief 4. It is a definition. It is the definition of probability itself.

False belief 5. It only applies to large numbers.

False belief 6. It is useless. (So useless, that it is omitted from introductory


classes.)

So if the students aren’t taught that the law of large numbers has nothing to do with the
word “law” (and nothing to do with large numbers either) they will carry these silly
misconceptions/beliefs. To get rid of those, students should understand the law of large
numbers as a result within the probability theory framework: The law of large numbers
is a theorem, the fundamental theorem of probability. In fact, it is silly to teach the
central limit theorem without the law of large numbers. It is as silly as teaching how to
expand a function f in Taylor series up to order 2, but only explain to the students how
to calculate the second order term (that is, f 00 (x0 )/2) and not the first order one ( f 0 (x0 )).
Chapter 3

Notes for the student of probability in


a mathematical sciences department

“Tis the good reader that makes the good book; a good
head cannot read amiss: in every book he finds passages
which seem confidences or asides hidden from all else
and unmistakably meant for his ear.”
– Ralph Waldo Emerson

3.1 Background
Listen carefully to what Feynman said.

Scene

Reporter walks into Feynman’s 1 room and demands that Feynman explain the force between
two magnets is:

Reporter: If you get hold of two magnets and you push them you can feel this pushing
between them, turn them on the other way and they slam together now; what is it the
feeling between those two magnets?

Feynman replies by explaining to the reporter that before asking a question he must have a
basis for asking it. Anyone can ask any question, but without any prior knowledge there is no
way for the responder to give an honest answer. After a few minutes of brilliant reply, that
you can watch and listen to here, Feynman concludes by telling the reporter:
1
Richard Feynman , (1918-1988) American theoretical physicist. You can learn basic physics from his lecture
notes.

15
CHAPTER 3. NOTES FOR THE STUDENT OF PROBABILITY IN MATHS 16

Feynman: I really can’t do a good job, any job, of explaining magnetic force in terms of
something else that you’re more familiar with; because I don’t understand it in terms of
anything else that you’re more familiar with.

In other words, Feynman tells the reporter that it’s fine to ask a question but the answer
he will get depends on what the reporter’s prior knowledge is (things he is familiar with).
If he has no basis for understanding, if he is not familiar with any elementary physics and
elementary mathematics, if the things he is familiar with are insufficient, then Feynman cannot
explain what a magnetic force is.
For example, the reporter may have gone to a high school where they do not teach any
physics or any mathematics. Or maybe they teach but they do a very poor job because the
teachers themselves do not know. The reporter is then in an unfortunate situation in that he
may never be able to understand the magnetic force. Or, to turn that positively, the reporter
may understand that he will never understand and turn into something else, e.g., interview a
businessman.

What you should already know


Although this is a rather complete set of notes for elementary probability theory, with a few
elements of statistics, they cannot be independent of previous knowledge. It is therefore
assumed that you should know:

a) CALCULUS: Sequences and their limits, continuity and differentiability, integration (the
Riemann integral that we teach in Calculus is enough).

b) SETS and FUNCTIONS: At a rudimentary level; knowledge and understanding of


operations on sets and functions.

Perhaps the most important assumption I will make is that you know what a function
is!

c) LINEAR ALGEBRA: At a very basic level; including the understanding of linear


independence; the concept of determinant; how to change basis so that linear equations
become simple.

d) ELEMENTARY PROBABILITY: Especially at the level of solving problems that require


very little mathematics and only a bit of logical thinking.

e) COUNTING: Finding the size of a finite set. Probability, at its very elementary level, is
all about counting and then about integration. In fact, integration is nothing else but
“counting continuously”.

3.2 What does “know”, “understand”, “learn” mean


Over the last couple of decades or so, these words have been hijacked by various groups of
people who, for one reason or another, have altered their meaning. Here is an example of
CHAPTER 3. NOTES FOR THE STUDENT OF PROBABILITY IN MATHS 17

such a group. When I was in Sweden, I encountered a group of people who call themselves
“pedagoger”2 and promote memorization and vocational training as a means for learning.
Maybe this is the case in some disciplines, but not in Mathematics and therefore, not in
Probability or Statistics.
Mathematics is hard and requires true teaching3 . It is not easy for those who do not grasp
it from the start, but it becomes easy once you understand how to learn it. In particular,
learning it requires never cutting corners and never letting logic slip away in the slightest.
Once you get used to learning mathematics, it becomes a game. You can continue learning it
effortlessly. 10 pages of mathematics will then be equivalent to 1. You need to think about
how you learn and learn how to know. And this cannot be achieved by taking everything
you read in mathematics books at face value. Only when you reach a stage where everything
appears trivial or obvious you will have truly learned something.
We live in an era where cheating and dishonesty are rampant and propagate at the speed
of light (the speed at which packets of information are transferred over fiber optics cables).
In the era of “surveillance capitalism”, that is an economy based on making money out of
nothing or from collecting and selling people’s personal data, beliefs, opinions, etc., it is very
easy to get scammed by, say, Internet sites who promote teaching in the same way that the
aforementioned Swedish group does. These Internet sites are not interested in true teaching
but in increasing their profit, either directly (by asking you to pay) or indirectly (by collecting
your views and data); they achieve this by convincing you, via a variety of methods (“awards”
of sorts is one of them) that you have learned a topic and they have many ways of achieving
this. But of all disciplines, mathematics has the characteristic of forcing its adherents to be
honest4 (a necessary, but not sufficient, condition for true engagement with mathematics). It
is out of this property that I need to point out the pitfalls I mentioned and the hijacking of
important words such as “understand”, “know” and “learn”.
Let me then attempt to define them. Since I don’t want to invent new words, I will add a
star as a prefix to each of the hijacked words in order to indicate that I am talking about their
original meaning, rather than the one attached to them by various groups of people.

Definition 3.1 (?know, ?understand and ?learn, for mathematics). In mathematics, the verbs
?know and ?understand are equivalent

?know ≡ ?understand
2
They have even hijacked this perfectly legitimate word and altered its meaning.
3
Mathematics comes from the Greek word “mathema” (µaθηµα) that means “lesson”, both in ancient and
modern Greek. Anatolius of Laodicea (ca. 200–283 AD, also known as Anatolius of Alexandria) wrote:
“Why is mathematics so named?” The Peripatetics say that rhetoric and poetry and the whole
of popular music can be understood without any course of instruction, but no one can acquire
knowledge of the subjects called by the special name mathematics unless he has first gone through a
course of instruction in them; and for this reason the study of these subjects was called mathematics.

4
Some will even go further and claim that morality is strongly associated to mathematics: “I’m not interested in
[mathematics], I’m interested in morality” , replied the famous mathematician Alexander Danilovich Alexandrov
, when asked what attracted him to the subject. Anatolii Vershik correctly stated that mathematicians “have a
very clear sense of right and wrong”. This statement is equivalent to “if one has not a clear sense of right and
wrong then one is not a mathematician”. And so the appearance of the word “cheating” (or, rather, the need that
we should reject it), mentioned above and sporadically in these notes, is fully justified. (I thank Professor Aram
Karakhanyan of the University of Edinburgh for pointing out Alexandrov’s statement.)
CHAPTER 3. NOTES FOR THE STUDENT OF PROBABILITY IN MATHS 18

and their main characteristics are:

I. to fully comprehend the why and the how;


II. to be able to explain to yourself and others;
III. to never simply memorize things;
IV. to be able to differentiate between human conventions and universal truths;
V. to never consider mathematics as a bunch of formulas.

To ?learn is the process you go through in order that you ?understand (and hence ?know).

PROBLEM 3.1 (−1 times −1 equals +1). Fact: (−1)(−1) = 1. You were told this at school. So
you claim you know it. But you do not ?know it if you cannot explain it in any other way
than saying “the teacher told me so”. So why is (−1)(−1) = 1?
Answer. I trust you ?know why. But let me spell it out. The symbol −a is the “additive inverse”
of the number a. That is, to every (real) number a there corresponds a unique number denoted
by a0 such that a + a0 = 0. (I temporarily denote −a as a0 .) So 1 + 10 = 0. Hence 10 · (1 + 10 ) = 10 · 0.
But 10 · (1 + 10 ) = (10 · 1) + (10 · 10 ) and 10 · 0 = 0. We thus have (10 · 1) + (10 · 10 ) = 0. But 1 is the
neutral element of multiplication: multiplying a number by 1 leaves the number unaltered. So
10 · 1 = 10 . So we have 10 + (10 · 10 ) = 0. And so 1 + (10 + (10 · 10 )) = 1 + 0 = 1 or (1 + 10 ) + (10 · 10 ) = 1.
But 1 + 10 = 0 so 0 + (10 · 10 ) = 1, which means 10 · 10 = 1. And this is why (−1) · (−1) = 1. 

PROBLEM 3.2 (no need for formulas). You probably √ know that the quadratic equation
ax2 + bx + c = 0 is solved by the formula x = (−b ± b2 − 4ac)/2a. Memorizing the formula and
simply applying it in special cases means that you do not ?know it. But if you can explain
the formula then you ?know it and you need not remember it. E.g., how do you solve the
equation x2 + 6x − 1 = 0 without a formula?
Answer. You simply observe that the second coefficient is twice 3, so you add and subtract 32
to obtain the equivalent equation x2 + 6x + 32 − 32 − 1 = 0 and then you see that the first three
terms equal (x + 3)2 , so you equivalently get (x + 3)2 = 32 + 1 = 10. If you know the √ square of a
number then
√ the number is its square root of its square or minus that, so x + 3 = ± 10 and so
x = −3 ± 10. 

PROBLEM 3.3 (half-knowledge is often worse than no knowledge). If you have been told
that independent events are “events that do not influence one another” then you have been
told nothing of much value because you may know how to tell someone what you have been
told, but you still do not ?know the concept.
Answer. See Sections 11.3 and 11.4 for the hopefully already known to you cases of events
and simple random variables. For other cases, wait until you ?learn it in this course. It will
be covered in Sections 14.5.3 and 16.3. 

PROBLEM 3.4 (human vs machine). Your boss tells you that your task is to decide whether
the average annual temperature in a certain region is 10C, and be 99% confident you are
right. He sends you to a training course where you are being taught the steps: (a) collect
data, (b) compute the value of a certain function of the data and (c) compare this value to
a number someone wrote in a table. You memorize the steps and apply them to complete
your task. Whereas you may feel you have learned statistics, in reality you have ?learned
CHAPTER 3. NOTES FOR THE STUDENT OF PROBABILITY IN MATHS 19

nothing. You just understood how to blindly apply what someone else told you. Indeed, you
cannot explain to me or even to yourself why these steps achieve the goal. A robot can be
taught to do the same thing but can explain anything. So what it the difference between a
robot/machine and a human being, especially in our information/data–dominated era?
Answer. Machines can’t think but, thanks to semiconductor electronics technology, we have
built computers that have immense processing speeds, immense storage, are linked via optical
cables or wirelessly, and can perform amazing tasks. But any machine, including those for
which the terms “smart” or “intelligent” are used, cannot ?understand what they’re doing.
That’s left for humans. In fact, machines have freed humans from the burden of performing
endless tasks, so humans, especially those who choose to study mathematical sciences have
now, more than ever in the past, the luxury to think and program the machines to perform
incredible tasks. 
PROBLEM 3.5 (poetic answers may be beautiful but seldomly precise). If I ask you whether
you know what a normal distribution is and you reply “it’s a bell-shaped curve” then
you don’t know it. If you reply it’s one whose density is given by the formula f (x) =
(2πσ2 )−1 exp((x − µ)2 /2σ2 ) then you know the formula because you can memorize it, but you
still do not ?know what it means. What is, then, a non-poetic answer to the question

What is a normal distribution?


Answer. One way to answer this is to give the formula and then proceed to understand
various properties of it. This will be done in Section 14.3.3. Another way is given in Chapter
18 where we will learn that the normal distribution is the only finite-variance distribution
that is preserved by linear operations of independent random variables with that distribution.
In a sense, if you want a distribution that has finite variance and is “preserved” by linear
operations then you have no other choice that choosing the normal distribution. That’s why it
is important and so unique. 
PROBLEM 3.6 (knowing what a limit is not the same as knowing how to apply computa-
tional tricks). Taking another example, your ability to compute the limit of certain sequence,
say limn→∞ (1/n), does not mean that you ?know what the concept of a limit is. If you are
able to ?understand the definition that lim an = a if for all positive numbers ε then |an − a| ≤ ε
for all except finitely many positive integers n, and be able to apply it, then you do ?know
what a limit is. Why is limn→∞ (1/n) = 0?
Answer. limn→∞ (1/n) = 0 because for all n > 1/ε we have 1/n < ε and there are only finitely
many n that are ≤ 1/ε. 

3.3 Advice
The pink pages and the appendices As stated at the very beginning, the pink pages and
the appendices contain material that is logically needed for the proper understanding of the
topics of this course. The latter are contained in Part IV–the white pages. I know of no way
to teach the topics of the syllabus, that is, Part IV, without assuming that the student has an
understanding of the pink pages. I could have chosen not to include the pink pages. I did
it only because I wanted to give the reader the opportunity to find this material in a single
document.
CHAPTER 3. NOTES FOR THE STUDENT OF PROBABILITY IN MATHS 20

It makes no sense ?learn statistics without ?learning probability. In fact, it is rather futile
to claim that you have ?learned statistics without probability. Hence I will not assume any
knowledge of statistics. You can ?learn it afterwards.

Badly learned things. Based on questions that students frequently ask me, I realize that you
may have ?learned things incorrectly. If this is the case, then you should forget badly-learned
things and start anew See quote (Q1) above.
When you ?learn something properly, you are doing yourself a favor: you save yourself from
a lot of hard and boring work.

Read the notes not in the same way you read a novel. Regarding these notes, like anything
that concerns mathematics, one should read them in a very different manner than reading
prose. You may, for example, wish to scan a chapter first to get an idea, then move to another,
and then go back to the first one when you need to. Mathematics is not read sequentially. If you
don’t understand something at first reading, make sure you go back in order to understand.
Different people read in different ways and you must assume responsibility for the reading
you do. See quote (Q2) above.

Ability to think. The best tool for reading these notes is having familiarity with thinking.
If you have difficulty with the thinking faculty and with reasoning then mathematics, and
especially probability, is not a good subject for you.

You need to be able to understand English. Thinking cannot be done without very good
familiarity with a natural human language; which, by convention, is English in our case: The
language of instruction, as well as the language used in these notes is English. Therefore,
knowledge of English is a sine qua non. In fact, these notes have been written in a way that
mathematical logic is expressed via English. This means that more often than not the verbal
passages of the notes convey as much important information as the formulas. See Item V.
above. Hence, if you fail to read and understand the non-formula part of these notes, you
may fail to understand the mathematics as well.

Mathematics is not a collection of formulas. Last but not least, I try as much as I can
to convince you that formulas are not, per se, of primary importance. Formulas appear as
an expression of our thoughts. Therefore, when someone asks me questions like “which
formula should I use to solve the problem” or “which distribution should I pick to find this
probability”, then I already know? that this person has not learned? and needs to try again.
Formulas do not appear ex nihilo.

Freedom of the individual. I’ve always made the assumption that individuals are free at
all stages of their lives. But freedom comes with responsibility. The individual person is
a free and responsible agent determining their own development through acts of the will.
In particular, I will try to teach you what I’ve taught for several decades, but if you are not
interested in ?learning, then it’s OK with me. After all, it is you decision to be a student in a
university. I will help you with answers to you questions, but I will assume that you have a
basis for understanding. Lack of such a basis requires strength of character, that is, realization
of this lack, and steps towards obtaining it.
Chapter 4

Notes for anyone who speaks a natural


language (for this course, the language
shall be English, by law)

We wonder what probability (as understood by a lay


person) might mean but quickly necessitate the move to
mathematics. We link statements and events and point
out that probability is, in a very good sense, sitting on top
of logic.

Probability is a measure of certainty or uncertainty of events. Events are sentences or


statements expressible in one’s natural language, usually following its rules and the rules of
logic. Here are some examples of sentences whose probability one might wish to know:

I “The Earth’s average temperature exceeds 16o C in the year 2040.”

II “The number of insects in the world exceeds the number of grains of sand.”

III “Investing £1000 results in a profit of at least £200.”

IV “Of all the 100 people in the theater tonight, the one who sits closest to the stage is the
tallest.”

V “I will toss a fair coin 100 times and will obtain at least 5 consecutive heads.”

Figuring out the probability of Sentence I is a very complex task. It requires understanding
of physics and mathematics describing the climate as well as historical data. Currently, the
average temperature is 15o C. Some people place the probability of statement I at 90% or
higher.
Sentence II clearly has probability 0 or 100%. But this requires counting all insects and all
grains of sand. Instead we could do count carefully selected samples and infer the probability
from this data. The probability will include the uncertainty introduced by the sampling.

21
CHAPTER 4. NOTES FOR ANYONE WHO SPEAKS A HUMAN LANGUAGE 22

Sentence III requires a model for the place we invest at and a lot of information. If we have
little information, the model may not be good and the probability we derive from this model
might not be accurate.
For IV, if we assume that all people have different heights and if we assume that they sit at
random, then it is rather intuitive that the answer will be 1%; and this you should already
?know.
For the last sentence V, assume that all possible patterns of coin tosses are equally likely.
By “pattern” we mean a sequence of 100 heads or tails, e.g.,

HTHHHHTTTTHTHTTHTHTHTHTHT · · · HHTTHTTTHTHTHTTTTT

We have

no. of all possible patterns = 2100 = 1, 267, 650, 600, 228, 229, 401, 496, 703, 205, 376

Then count those patterns that contain at most 4 consecutive heads. I counted them and found

no. of patterns containing at most 4 consecutive heads


= 156, 242, 900, 686, 472, 853, 807, 378, 029, 578.

If you want to understand how I counted them, see Problem 4.4). Divide the two to find the
probability that V is not true and then subtract from 1 to find that
156, 242, 900, 686, 472, 853, 807, 378, 029, 578
V has probability = 1 − ≈ 88%. (4.1)
1, 267, 650, 600, 228, 229, 401, 496, 703, 205, 376
PROBLEM 4.1 (are your surprised that you got 5 heads in a row?). Are you surprised by
the fact that in 100 fair coin tosses the chance that you will have at least 5 consecutive heads is
so high?
Answer. Yes, I didn’t expect that. 
PROBLEM 4.2 (is your coin fair?). If someone tosses a coin and gets fewer that 5 consecutive
heads, what will you say about the fairness of the coin? (This is what statistics is about.)
Answer. I’m more confident than not that the coin is not fair. 
I now wish to explain how I obtained (4.1). Since the problem seems difficult, I simplify it:
PROBLEM 4.3 (probability of 2 consecutive heads). Explain why the number of arrange-
ments of 100 coins in a row so that there are at most 2 consecutive heads equals

249, 483, 823, 285, 270, 218, 137, 723, 786.

Answer. To solve this replace 100 by the symbol1 n, representing a positive integer, and let an
be the number of arrangements. Consider the sentences

Hk =“there is a head at the k-th position”


1
And this is something we ?learn in mathematics: many problems, including numerical ones, cannot be
answered for the particular number we are interested in (the number 100 in this problem). Rather, it is easier
to find the answer simultaneously for all possible numbers. We thus pass from a number to a function and a
computation involving numbers becomes an equation for functions.
CHAPTER 4. NOTES FOR ANYONE WHO SPEAKS A HUMAN LANGUAGE 23

and
Tk =“there is a tail at the k-th position”.
Clearly,
Tk is the negation of Hk .
Either Tn holds or Hn holds. (By “holds” we mean “is true”.) If Tn holds then the number
arrangements is an−1 . If Hn holds then either Tn−1 holds or Hn−1 holds. If Hn &Tn−1 holds then
the number arrangements is an−2 . If Hn &Hn−1 holds then, necessarily, we must have that Tn−2
holds and so the number arrangements is an−3 . Since

Tn or (Hn &Tn−1 ) or (Hn &Hn−1 ) is a true sentence

while
Tn & (Hn &Tn−1 ) & (Hn &Hn−1 ) is a false sentence
we have
an = an−1 + an−2 + an−3 .
We clearly have a1 = 2 and a2 = 3. Adding these gives a3 = 5 and so a4 = 3 + 5 = 8 and so on,
a100 = 249, 483, 823, 285, 270, 218, 137, 723, 786. 2 
Having solved the simplified version, we pass on to the actual question related to (V).

PROBLEM 4.4 (in how many coin arrangements do you have at most 4 consecutive heads?).
Explain how to compute the number of arrangements of 100 coins in a row so that there are at
most 4 consecutive heads.
Answer. Just as before, we should consider statements involving the last 5 coins. Replacing
100 by n, and with the same notation as above, we have that

“either Tn or (Hn &Tn−1 ) or (Hn &Hn−1 &Tn−2 ) or (Hn &Hn−1 &Hn−2 &Tn−3 )
or (Hn &Hn−1 &Hn−2 &Hn−3 )” is a true statement.

(By saying either “A” or “B” or “C” is true, we mean, as is standard in the English language
that one and only one of them is true. For example, “I’ll either go to school or I’ll go to the
movies” means that I won’t do both.) Considering each of these 5 cases, we get

an = an−1 + an−2 + an−3 + an−4 + an−5 ,

and you can check by hand that a1 = 2, a2 = 3, a3 = 5, a4 = 10, a5 = 20. And you can continue
the additions until you find that a100 = 156, 242, 900, 686, 472, 853, 807, 378, 029, 578. 
The sentences I spoke about above can be called events. I temporarily state (but we need to
wait for mathematics to kick in to make things precise) that:

An event is a well-formulated sentence (or a clause or a phrase)


 √ √   n
c1/3 +4c−1/3 +1
2
It is not hard to see that for general n we have that an ≈ 11− 33 1/3
44
c + 55−7 33 2/3
352
c 3
, where

c = 19 + 3 33.
CHAPTER 4. NOTES FOR ANYONE WHO SPEAKS A HUMAN LANGUAGE 24

Since there is no such thing as the totality of questions we can ask or statements we can
make about everything in the world, past, present or future, we content ourselves to events
whose totality is well-described. On purpose, I am not yet telling you what “well-described”
means. However, I’m sure you’ve heard of the term propositional logic (or sentential logic
or statement logic) which is the branch of logic formed by joining entire propositions or
statements to form more complex ones, their relationships, and their veracity or not. Actually,
if your mental faculties are that of the majority of human beings then you’re speaking in a
natural language (e.g., Nahuatl3 ) and are thus able to form sentences such as “if the Earth is
flat then Newton’s law doesn’t hold”.4
Let us, for now, use the letter P for denoting Probability.

We should think of P as some kind of function that assigns nonnegative numbers


to sentences (or statements or events) that, for now, we are going to denote by
letters of the Greek alphabet: α, β, . . .

Probability obeys certain rules, rules that respect the rules of logic. Here are a few.

1. We express the values of P by a percentage, that is, a real number u between 0 and 100 and
write P(α) = u%. Of course, this means nothing else but P(α) = u/100. So P(α) is taken to
be a real number between 0 and 1.

2. If α implies β then P(α) ≤ P(β).

3. If α = “either α1 or α2 ” then P(α) = P(α1 ) + P(α2 ).

4. If α is impossible (e.g., “there is a human with 30 fingers”) then we set P(α) = 0.

5. The “or” in the sentence α or β) is taken to mean α or β or both. (In real life, “or” often
means exclusive or, but not here.) The “and” in the sentence α and β) means what it means
in real life, that is, “both α and β”. We have the rule P(α or β) + P(α and β) = P(α) + P(β).

6. If not(α) is the negation of α then P(not(α)) = 1 − P(α).

7. P(not(α and β)) = P(not(α)) + P(not(β)) − P(not(α) and not(β))).

8. The sentence α =⇒ β is a composite sentence which means “if α then β”. That is, “α
implies β”. We have the rule P(α =⇒ β) = P(not(α)) + P(β) − P(not(α) and β).

We can keep going on and create more rules. If you study the rules above you will realize
two things.

1. First realization: the rules are not independent. Some follow from others.

2. Second realization: there are more rules you can make; more complicated ones (how
many? is there a limit?)
3
There are about 1.7 million speakers of this language nowadays
4
About 1 to 2% of Americans, that is 3 to 5 million people, believe that the Earth is flat, but beliefs are irrelevant,
especially when they lead to contradictions.
CHAPTER 4. NOTES FOR ANYONE WHO SPEAKS A HUMAN LANGUAGE 25

Two questions therefore arise.

1. First question: what is the minimal set or rules you can have?

2. Second question: what is the totality of things that I can put as an argument in P? Put it
otherwise, when we write P(α) for a statement or sentence α, what things of the type α am
I allowed to put inside P?

The first question is not difficult to answer. The second question requires practice.
Fortunately, people have been practicing with this sort of things for about 500 years. They
seriously started asking these questions though in the 18th c. And really more seriously in
the 20th c. And humans, that is, us, came up with the answers that (no surprise here) require
mathematics.
If you only use rules such as the above, you can still figure out lots of probabilities of
lots of events. People do so in practice, e.g., lawyers and businessmen, but, unless they’re
mathematically trained, they work in a kind of cloud. Within this cloud they can correctly
answer some simple questions (but may be unable to tell whether their answer is true or not)
but they cannot envision what this thing (called Probability) they work with really is, neither
can they envision the kinds of unimaginably complex and beautiful problems they can deal
with. In all fairness, they don’t have to. A lawyer, for instance, is only interested in figuring
out the chance that his client has committed a crime. If he can reasonably convince himself
that the chance is less than 0.1% then he will be defending him enthusiastically. But if the
chance is more than 70% then he’s only doing his job in defending him.
Before we go on, you need to understand that Probability Theory (and several other
domains that depend on it, e.g., Statistics, Random Processes, Stochastic Geometry, Filtering,
Control of Industrial Processes, Decision Theory, Game Theory, and others) is a branch of
Mathematics, as rigorous as, say, Mathematical Analysis or Algebra. In fact, Probability
Theory has progressed so much that we can now use its tools in other areas of Mathematics
too. For example, we can show that every polynomial of degree d has exactly d complex roots
(this is called the Fundamental Theorem of Algebra), using Probability Theory and then go on
to compute them using Probability Theory as well.
We summarize:

PROBABILITY assigns values P(α) to statements α such that rules such as the
above hold. In particular, the probabilities P(α), P(β), etc., assigned to statements
α, β, etc., must respect the rules of logic, that is, the ways that the statements relate
to one another.

All that was fine, but to make things precise and to be able to deal with probability properly
we must let mathematics enter. And this we can do, because this is not a course “for anyone
who speaks a human language”, as the title of this chapter states, but a course for university
students in mathematical sciences.
Part II

INTRODUCTION FOR UNIVERSITY


STUDENTS

26
Chapter 5

Events, sets, logic and the language


you speak

Mathematically, events are sets. Events are often described by sentences


in English. Handling sentences in English is the process we perform
when we speak, listen, read and understand English. To save space and
to be precise in what we mean, we often use the language of sets if we
handle events. Probability assigns numbers to events consistently. If the
way we handle events (that is, the way we speak and understand English)
is wrong, then the probabilities of these events will probably be wrong.
Consistency in the calculus of probabilities, requires consistency in the
calculus of events, which is nothing else but the way we express and
understand English–or whichever language you’re most familiar with.

Scientists have realized, by experience, that objects called sets are described by sentences.
There is no way to define the concept of a set, but we know what it is by its use. We can say
that

a set is a collection of things we call elements.

(The elements can be sets themselves.) But once you accept the concept of a set then, using it,
you can do an awful lot of mathematics and physics and biology and even be able to be a
rational politician.
We say that an element a belongs to a set A (or not) and we write this as

a∈A

(or a < A). A set A is a subset of a set B (and we write this as A ⊂ B) if whenever x ∈ A then
x ∈ B. I repeat:

A ⊂ B (read: set A is included in set B) if x ∈ A ⇒ x ∈ B.

In words, all elements of A are elements of B. You see, I used the notation:

27
CHAPTER 5. EVENTS, SETS, LOGIC AND THE LANGUAGE YOU SPEAK 28

⇒ is a symbol for the word “implies”.

If α and β are two sentences (propositions) then

α⇒β

is a new sentence, the sentence “α implies β” or “if α then β”.


Two sets A, B are equal, and we write A = B, if they have exactly the same elements

A = B is equivalent to, by definition, for all x, x ∈ A ⇐⇒ x ∈ B.

PROBLEM 5.1 (sentences and sets). Consider the sentence “if you’re human then your body
contains a molecule called DNA”. Express this as a relation between two sets.
Answer. Let H be the set of humans in the Universe. Let D be the set of objects in the Universe
that contain DNA. We just said that H ⊂ D. 

PROBLEM 5.2 (describe your sets logically). Let N be the set of positive integers. The
set A of all prime numbers is an example of a subset of N. You see, describing A as a list,
2, 5, 7, 11, 13, 17, 19, 23, . . . is impossible. Why? How can you describe A by logic?
Answer. It is impossible because we don’t even have a clue what large prime numbers are
(there is no formula). But if we define the sentence

σ(n) = “n is not divisible by any positive integer other than 1 or n”

then we have defined A by


A = {n ∈ N : σ(n)}.

You need here to recall your elementary understanding of sets. If you have never seen logic
or syntax (of your natural language, that is) or if you have never seen (and used) sets then
you must stop and do that because you can never fully appreciate the subject of Probability
without these elementary concepts.
Here’s a brief summary of some of the things you need to have mastered, both in school
and in real life. There is no way to go through life other than understanding the rules of
speech (think how you talk)–also called syntax–and without understanding what is known as
naı̈ve set theory. You will have certainly observed that, in Internet-talk, there is a confusion
between the clause “you’re” (meaning “you are”) and the word “your”. The first is a clause
consisting of of the personal pronoun “you” followed by the verb “are”. The second is not
a clause but a single word, the word “your” which is a possessive pronoun. This is not a
mistake in English only, it is a mistake in logic, therefore it is a mistake in mathematics.
I continue with the summary of things you need to know. You need to know that

A ∪ B corresponds to the conjunction “or” because x ∈ A ∪ B if x ∈ A or x ∈ B (or


both);

and that
CHAPTER 5. EVENTS, SETS, LOGIC AND THE LANGUAGE YOU SPEAK 29

A ∩ B corresponds to the conjunction “and’ because x ∈ A ∩ B if x ∈ A and x ∈ B.

Hence you immediately have that A∩B ⊂ A∪B. If A1 , . . . , An are sets then we write A1 ∩· · ·∩An
without parentheses because we know that, just like in English, the conjunction “and” is
associative: “(she’s smart and she’s tall) and she’s from Mars” is theTsame sentence as “she’s
smart and “(she’s tall and she’s from Mars)”. We also write this as ni=1 Ai . But then
Tn
x∈ i=1 Ai iff x ∈ A1 and · · · and x ∈ An which means that for all i we have x ∈ Ai .

T if we have infinitely many sets, say Ai , where i ∈ I and I is another set? Well,
What happens
we define i∈I Ai asSthe set containing all elements x for which for all i ∈ I we have x ∈ Ai .
Similarly, we define i∈I Ai by replacing the quantifier “for all” by the quantifier “there exists”;
so
S
i∈I Ai is the set of all x such that there exists i ∈ I for which x ∈ Ai .

You then need to remember the de Morgan law which rests on the logical fact that if it is
not the case that this and that then it is this or that. The complement of a set A is denoted by
Ac and means that it is all x ∈ Ω such that x < A. The complement of A with respect to B is all
x ∈ B such that x < A; it is denoted as B \ A. Hence Ac = Ω \ A. Observe that B \ A = B ∩ Ac .
The de Morgan law gives
 c
[  \
 A  = Aci .
 i

i∈I i∈I

Remember too that ∩ is distributive over ∪ and ∪ is distributive over ∩.


The product A × B of two sets A and B is the set of ordered pairs (a, b) with a ∈ A and b ∈ B.
We can similarly define the product of 3 sets, 4 sets, ... , n sets. The product A × · · · × A (n
times) is denoted by An .
The set of all functions from A to B (that is, with domain A and co-domain B) is denoted as
BA . If A = {1, . . . , n}, the set of the first n positive integers then a function f : {1, . . . , n} → B
can be represented as an element of Bn and so B{1,...,n} = Bn .
We let N denote the set of all positive integers. A function f : N → B is called a sequence
of elements of B. The set of all such sequences is denoted by BN .
If B = R, the set of real numbers, then Rn is the set of all (x1 , . . . , xn ) where all xi are real
numbers. The set Rn is identified with the n-dimensional Euclidean space because Rn can be
thought of as a set of Cartesian coordinates of this space.
A function f : A → B such that if a1 , a2 then f (a1 ) , f (a2 ) is called a one-to-one function
or injection.
A function f : A → B such that for every b ∈ B there is an a ∈ A such that f (a) = b is called
an onto function or surjection.
A function is called bijection if it is both injection and surjection.
Depending on the particular flavor of a function, human beings may use different names.
Here is a partial list of names used: map, mapping, correspondence, functional, operation,
operator, etc.
Similarly, the word set may be called collection, totality, class, etc.
CHAPTER 5. EVENTS, SETS, LOGIC AND THE LANGUAGE YOU SPEAK 30

The number of elements of a set A is called cardinality or size of the set and is denoted by
|A| or by #A. A set is called finite if it has a finite number of elements. Otherwise it is called
infinite. A set is called countable if there is a bijection between N and the set. An infinite set
that is not countable is called uncountable. The set of real numbers is uncountable. The set Z
of all integers is countable. The set Q of all rational numbers is also countable.
We can take the product of a sequence A1 , A2 , . . . of sets and call it A1 × A2 × · · · If the
sets A1 , A2 , . . . are all finite with size at least 2 each then their product is infinite and, in fact,
uncountable.

If A is a set, we let P(A) be the set of all the subsets of A.


n
A has size n then P(A) has size 2n and so P(P(A)) has size 22 .
See Appendix A for ways to count the sizes of some common sets.

Example 5.1. What is the cardinality of the set of all injections from {1, 2, 3} into {1, 2, 3, 4}?
Answer. A function f : {1, 2, 3} → {1, 2, 3, 4} can be represented by the triple ( f (1), f (2), f (3)).
An injection must have f (1) , f (2) , f (3) , f (1). f (1) can be 1 or 2 or 3 or 4. Once f (1) has
been given a value, there are 3 values remaining for f (2) to take. ( f (1), f (2)) can take 4 · 3 = 12
values. There remains to assign a value to f (3), There are only 2 numbers remaining. So f (3)
can take 2 values. And so ( f (1), f (2), f (3)) can take 4 · 3 · 2 = 24 values. This is the cardinality
of the set of injections from {1, 2, 3} into {1, 2, 3, 4}. 
The most important set in the universe is the set that contains nothing. It is called the
empty set and is denoted by ∅. There is one and only one empty set.
Sets A, B are called disjoint if A ∩ B = ∅.
Sets A1 , . . . , An are called pairwise disjoint or mutually disjoint if Ai ∩ A j = ∅ for all i , j.
In this case, we also have Ai ∩ A j ∩ Ak = ∅ if i, j, k are distinct. And so on.
Two sets are called distinct if they are not equal.
Sets of sets are sometimes denoted by calligraphic letters and are often called collections
rather than sets of sets. If A is a collection of sets then we call it pairwise disjoint collection
if A ∩ B = ∅ for every two distinct sets A, B ∈ A .
Chapter 6

“The only two things you need to


know”

Every mathematical system has a number of unprovable axioms that do


not contradict one another. Probability theory is particularly simple in
that it has only two axioms: The first says that the probability of the
universe set is 1. The second says that probability is countably additive
over mutually exclusive events. It took people long time and experience
to understand this. In a sense, everything we learn in probability
depends only on these two axioms.

Disclaimer. When we say “the only two things you need to know” we mean that everything
in probability theory follow from the two axioms (AXIOM ONE) and (AXIOM TWO) below.
Of course, you need to ?learn the consequences of these two axioms that follow logically
from them, together with a number of definitions and special cases of particular importance
in the theory.
We explained that we identify events with sets. Probability theory is concerned with certain
functions, called probability measures, that assign to an event A a number P(A).
We shall let Ω be a certain set that includes all events we may be interested in in a particular
situation or problem. Such a set may be called sample space or configuration space or
ambient space. The elements of Ω may be called outcomes or configurations. Sometimes we
may have to change notation and use another letter in place of Ω. We will, for now, use the
letter E for the collection of all events.

PROBLEM 6.1 (select 2 numbers out of a 100; what’s the sample space?). Consider the
selection of 2 numbers, not necessarily distinct, out of a 100. what sample space Ω should we
choose and what is E ?
Answer. We choose Ω = {1, 2, . . .} × {1, 2, . . .}. We choose E = P(Ω). 
It is typical to have E to be equal to the set of all subsets of Ω, but this may not be the case
for reasons that are quite deep. For now, just accept the fact that the function P is defined on a
collection of sets E .

31
CHAPTER 6. “THE ONLY TWO THINGS YOU NEED TO KNOW” 32

If Ω is a set, then a probability measure on Ω is a function P : E → [0, ∞) such that

1. AXIOM ONE
P(Ω) = 1; (AXIOM ONE)

2. AXIOM TWO (countable additivity)


∞  ∞
[  X
P  Ai  = P(Ai ) if Ai ∩ A j = ∅ whenever i , j. (AXIOM TWO)
i=1 i=1

In writing the above, we must make assumptions that all things appearing in the argument
of P must be events, that is, elements of E . We will assume that

1.
Ω ∈ E. (EV1)

2.

[
If A1 , A2 , . . . ∈ E then Ai ∈ E . (EV2)
i=1

3.
If A ∈ E then Ac ∈ E . (EV3)

The first two assumptions are necessitated from (AXIOM ONE) and (AXIOM TWO). The
third assumptions stems from the fact that we want to have P defined on an event and on its
complement (because if we know the probability of something happening we should also
know the probability of something not happening).
Let us repeat.

Definition 6.1. A collection E of subsets of a set Ω is called a class of events if (EV1), (EV2)
and (EV3) are satisfied. Another name for such an E is σ-field.

Definition 6.2. If E is a class of events then a function P : E → [0, ∞) is called a probability


measure if (AXIOM ONE) and (AXIOM TWO) hold.

Subtle point 1. You should pay attention to the term probability measure. Some people
may simply call it probability. So the word “probability” may have two meanings: as the
function P or as a value of P on a particular event A. Another word for “probability measure”
is “probability distribution” or “distribution”. Another word is “probability law” or just
“law”.

Subtle point 2. The natural question then is: can all subsets of Ω be events? Put it otherwise,
can the set “events in Ω” consist of the set of all subsets of Ω? The answer is: not always.
CHAPTER 6. “THE ONLY TWO THINGS YOU NEED TO KNOW” 33

Subtle point 2, amplified. Let’s try to think of the experiment of tossing a coin infinitely many
times. A result of this experiment is a sequence of heads and tails, such as T, H, T, H, T, T, . . .
The set Ω is the set of all these sequences:

Ω = {H, T}N .

Let us try to define probability measure P that conforms to our intuition. Namely, the
probability of tossing a head should be 1/2 (fair coin); the probability of two heads in a row
should be 1/4; the probability of specifying the coin faces in the first n tosses should be 1/2n .
Let P denote such a probability measure. What is the domain of P? Are all subsets of Ω events?
The answer is no. There are subsets of Ω that are not in the domain of P. These subsets are
not events and so they have no probability at all. This means that I have no right to ask for
the value of P on sets that are not events. Why this is the case I cannot tell you in this course
because it is beyond the syllabus and because it requires a little bit of mathematics that can’t
be explained at this level.

Subtle point 3. There are familiar functions that satisfy (AXIOM TWO) but not necessarily
(AXIOM ONE). Here are two examples.

PROBLEM 6.2 (The cardinality function). Consider a finite set Ω and let N be the function
that assigns to each subset A of Ω its cardinality (that is, its number of elements). Then,
certainly, if A1 , . . . , An are disjoint subsets of Ω then C(A1 ∪ · · · ∪ An ) = C(A1 ) + · · · + C(An ).

PROBLEM 6.3 (The area function). Let Ω be the plane, identified as the set R2 of pairs of real
numbers. Let λ be the function commonly known as “area”. This function assigns value ab
to a rectangle of side lengths a and b. It assigns value πr2 to a circle of radius r. It assigns
value 1 to the set of points (x, y) ∈ R2 such that 0 ≤ y ≤ e−x , x ≥ 0. You have learned how to
compute areas of various subsets of R2 and you have learned to do so by frequently applying
(AXIOM TWO). Frequently, you need to apply this rule to infinite sequences in order to obtain
the area of complicated sets from simpler ones, e.g., the area of the circle from an infinite
sequence of rectangles. But do you really know the function λ? If you do, then you must be

Figure 6.1: This is a circle of radius r. You can find its area by embedding an infinite sequence of
disjoint rectangles: summing up their areas you find πr2 .

able to describe its domain. It turns out that λ, that is, the area function, cannot have the set of
all subsets as its domain. Only in the 20th century we understood what the area function is
and managed to describe its domain.
CHAPTER 6. “THE ONLY TWO THINGS YOU NEED TO KNOW” 34

Definition 6.3 (Event occurs). We say that

Event A occurs for the outcome ω iff ω ∈ A.

If the outcome ω is implicit then we simply say

event A occurs.

Also, “A occurs and B occurs” means that the element ω is in both A and B, that is, A ∩ B
occurs; and so on. If I pick an ω1 ∈ Ω and you pick another ω2 ∈ Ω then A may occur for me
but not for you.

PROBLEM 6.4 (two extreme event collections). If Ω is a set then P(Ω) is a σ-field and {Ω, ∅}
is another σ-field.
Answer. Notice that (EV1), (EV2), (EV3) are satisfied in both cases. 

PROBLEM 6.5 (events A, B generate more events). If Ω is a set and A, B ⊂ Ω, try to build a
σ-field that contains A and B. How can you do that?
Answer. Throw in all sets that are derived from A and B by set operations E = {A, B, A ∩
B, Ac , Bc , A ∩ B, A ∪ Bc , . . .}. To do that systematically draw a Venn diagram and notice that Ω
can be partitioned in 4 sets: A ∩ B, A ∩ Bc , Ac ∩ B, Ac ∩ Bc . Then consider all possible ways of
selecting some of these sets and take the union of each selection:
A∩B A ∩ Bc Ac ∩ B Ac ∩ Bc derived set
1) 1 1 1 1 Ω
2) 1 1 1 0 (A ∩ B) ∪ (A ∩ Bc ) ∪ (Ac ∩ B) = A ∪ B
3) 1 1 0 1 ···
4) 1 1 0 0 ···
.. .. .. .. .. ..
. . . . . .
15) 1 0 0 0 A∩B
16) 0 0 0 0 ∅
Go ahead and fill in each of the 16 rows of this table. 

Definition 6.4 (Probability space.). A probability space is a triple (Ω, E , P) consisting of a set
Ω, a σ-field E of subsets of Ω and a probability measure P, that is, a function P : E → [0, ∞),
satisfying (AXIOM ONE) and (AXIOM TWO).

Often people use the term mutually exclusive for a family of events and this means that
the events in the family are mutually disjoint. We can state (AXIOM TWO) in words by
saying that a probability measure P is countably additive over any countable collection of
mutually exclusive events.
Chapter 7

Elementary probability properties

This very brief chapter summarizes some very basic


properties of any probability, such as the total probability
formula, the probability of the union of events and simple
inequalities.

Any collection of pairwise disjoint events whose union is Ω


is called a partition of Ω.

1. Total probability formula.


n
X
If A1 , . . . , An form a partition of Ω then P(B) = P(B ∩ Ai ),
i=1

for any event B.


Here is why: The events A1 , . . . , An are pairwise disjoint. Hence the events B∩A1 , . . . , B∩An
are pairwise disjoint whose union is B. We now apply (AXIOM TWO) from the definition
of P.

2. Probability of proper difference.

If A ⊂ B then P(B \ A) = P(B) − P(A).

Here is why: The events A and B \ A are disjoint and have union equal to B. By
(AXIOM TWO), P(B) = P(A) + P(B \ A). Since P(B \ A) is nonnegative, if we omit this term
from the sum, we get something smaller.

3. Monotonicity.
If A ⊂ B then P(A) ≤ P(B).
Here is why: If A ⊂ B then P(B) − P(A) = P(B \ A) ≥ 0.

35
CHAPTER 7. ELEMENTARY PROBABILITY PROPERTIES 36

4. Probability is at most 1. Always,

0 ≤ P(A) ≤ 1.

Here is why: A is always a subset of Ω. Hence, by monotonicity, P(A) ≤ P(Ω). But P(Ω) = 1,
by (AXIOM ONE).

5. Probability of negation.
P(Ac ) = 1 − P(A).
Here is why: Since A and Ac are disjoint with union equal to Ω, we have P(A) + P(Ac ) = P(Ω)
by (AXIOM TWO). But P(Ω) = 1 by (AXIOM ONE).

6. Probability of the empty set.


P(∅) = 0.
Here is why: Since Ω and ∅ are disjoint with union equal to Ω we have P(Ω) + P(∅) = P(Ω)
by (AXIOM TWO). Canceling P(Ω), we get P(∅) = 0.

7. Probability of the union of two sets (inclusion-exclusion formula).

P(A ∪ B) = P(A) + P(B) − P(A ∩ B).

Here is why: The events A \ (A ∩ B), B \ (A ∩ B) and A ∩ B are pairwise disjoint with union
A ∪ B. 1 By (AXIOM TWO) then,

P(A ∪ B) = P(A \ (A ∩ B)) + P(B \ (A ∩ B)) + P(A ∩ B)


= P(A) − P(A ∩ B) + P(B) − P(A ∩ B) + P(A ∩ B)
= P(A) + P(B) − P(A ∩ B).

8. Boole’s inequality.  n 
[  X n
P  Ai  ≤

 
 P(Ai ).
i=1 i=1

Here is why. If n = 2 then P(A1 ∪ A2 ) = P(A1 ) + P(A2 ) − P(A1 ∩ A2 ) ≤ P(A1 ) + P(A2 ). We


proceed by induction. If the statement holds for some n ≥ 2 then it also holds for n + 1
because n+1
Sn
i=1 Ai =
S
i=1 Ai ∪ An+1 .

9. Probability of the union of many sets (general inclusion-exclusion formula).


 n   
[  (a) X \ 
P  Ai  = (−1)|I|−1
P  Ai  (7.1)
i=1 I⊂{1,...,n} i∈I
n
(b)
X X
= (−1)k−1 P(Ai1 ∩ · · · ∩ Aik ).
k=1 1≤i1 <···<ik ≤n

1
Indeed the first and the second are disjoint from the third, while the first two are disjoint because the first
set equals A ∩ (A ∩ B)c , the second set equals B ∩ (A ∩ B)c and so their intersection equals (A ∩ B) ∩ (A ∩ B)c = ∅.
Furthermore, (A ∩ (A ∩ B)c ) ∪ (B ∩ (A ∩ B)c ) ∪ (A ∩ B) = A ∪ B because of the distributive property of ∪ over ∩.
CHAPTER 7. ELEMENTARY PROBABILITY PROPERTIES 37

Let’s understand how to read the formula before explaining it.


Equality (a) involves a sum over all subsets I of {1, . . . , n}; the empty set may be omitted the
intersection over an empty set of indices is always an empty set.2 For each I, we compute
the probability of the intersection of the sets Ai , i ∈ I, and multiply it by −1 is the number
of elements |I| of I is even.
Equality (b) is a rewriting of (a) by taking into account the sizes of I.
So, for example, with n = 3, we have that I takes values {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3} and
{1, 2, 3} (and we omit the empty set). So we have

P(A1 ∪A2 ∪A3 ) = P(A1 )+P(A2 )+P(A3 )−P(A1 ∩A2 )−P(A1 ∩A3 )−P(A2 ∩A3 )+P(A1 ∩A2 ∩A3 ).

– One way to show the veracity of the formula is to use the already known n = 2 case and
induction. The second equality is just a rewriting of the first where we considered all sets I
according to their sizes.
– To see another way, look at Problem 9.12.

10. Bonferroni inequalities. Look at the last side of the inclusion-exclusion formula and
consider not the whole sum but the sum of the first m terms. Define
m
X X
Bm := (−1)k−1 P(Ai1 ∩ · · · ∩ Aik ).
k=1 1≤i1 <···<ik ≤n

Then  n 
[ 
B2 ≤ B4 ≤ · · · ≤ P  Ai  ≤ · · · ≤ B3 ≤ B1 .
i=1

In other words, the odd-index Bm provide better and better upper bounds for the probability
of the union, while the even-indexed terms provide better and better lower bounds. To
understand this, take a look at Appendix A on Counting.

PROBLEM 7.1 (the conjunction fallacy). Linda is 31 years old, single, outspoken, and very
bright. She majored in philosophy. As a student, she was deeply concerned with issues of
discrimination and social justice, and also participated in anti-nuclear demonstrations. Which
is more probable?
(a) Linda is a bank teller.
(b) Linda is a bank teller and is active in the feminist movement.
Answer. The answer is (a). The reason is that

(b) ⇒ (a) is a true statement,

so we apply Property 3., that is, monotonicity. Tversky and Kahneman observed that most
people say that (b) is more probable. Which is wrong. This is called the “conjunction fallacy”.
But most people are clueless about elementary mathematics, or even elementary logic, so this
comes as no surprise. What is surprising is that this obvious thing has given rise to many
publications. 
2
Pay attention to logic: the sentence “for all x ∈ A blablabla” is always false if A = ∅ because, by definition, ∅
contains nothing.
CHAPTER 7. ELEMENTARY PROBABILITY PROPERTIES 38

PROBLEM 7.2 (probabilities must add up to what?). Someone claims that the probability
that a biased cube lands on face k has probability 1/(k + 1), k = 1, 2, 3, 4, 5, 6. Why is he wrong?
Answer. (1/2) + (1/3) + (1/4) + (1/5) + (1/6) + (1/7) ≈ 223/140, so (AXIOM ONE) is violated.

PROBLEM 7.3 (estimating chance of winning in the lottery). The probability of winning in
a certain lottery when you buy 1 ticket is 0.0001 = 10−4 . When you buy 2 tickets the probability
of both of them winning is 0.000001 = 10−5 . You buy 100 tickers. Estimate from above and
from below the probability that at least one of them wins.
Answer. If Ai is the event that the i-th ticket wins, i = 1, . . . , n = 100, then the event that least
one of them wins is ni=1 Ai . By Boole’s inequality, we have
S

 n  n
[  X
P  Ai  ≤

 
 P(Ai ) = n · 0.0001 = 0.01 = 1%.
i=1 i=1

By Bonferroni’s inequality.
 n  n !
[  X X n
P  Ai  ≥
  P(Ai ) − P(Ai ∩ A j ) = 0.01 − 0.000001 = 0.009505 = 0.9505%.
2
i=1 i=1 i,j

So the probability that at least one of your tickets wins is at most 1% but not much less than
that, because it is at least 0.9505%. 

PROBLEM 7.4 (deadly sins). A Zoroastrian priest found out that a man has probability at
least 0.01 to commit one of the finite number of deadly sins during his life, and if this happens
then he will go to an unpleasant place called Duzakh. What can you say about the probability
that he will not go to Duzakh?
Answer. If S is the event that a man will commit one of the sins and D the event that he
will end up in Duzakh then we know that S ⊂ D. We also know that P(S) ≥ 0.01. Hence
0.01 ≤ P(S) ≤ P(D). Therefore P(Dc ) = 1 − P(D) ≤ 1 − 0.01 = 0.99. So we learn that the
probability he won’t end up in Duzakh is at most 0.99. 

PROBLEM 7.5 (rich or famous). To understand the inclusion-exclusion formula in slightly


different context, consider a finite set of people. You observe that 10% of them are famous, 5%
are rich and 3% are rich and famous. What is the percentage of people that are rich or famous?
Answer. Let n be the size of the set. Let F, R be the set of people who are famous, rich,
respectively. We are given that |F|/n = 10%, |R|/n = 5% and |F ∩ R| = 3%. Since

|F ∪ R| = |F| + |R| − |F ∩ R|

we have that |F ∪ R|/n = 10% + 5% − 3% = 12%. 


Part III

ELEMENTARY PROBABILITY AND


PREREQUISITE MATERIAL

39
Chapter 8

Probability on discrete sets

The simplest probability is the uniform probability which


means “equally likely” in the discrete case and something
like this, but more advanced in more general cases that
will be dealt with in a later chapter. Uniformity
necessitates counting because the uniform probability of
an event is proportional to the number of outcomes
(elements) of the event. You can immediately consult
Chapter 10 containing many problems under uniform
probability. By finite probability we mean probability on
finite sets.

A set (in particular, a sample space Ω) is called discrete if it is finite or countably infinite.

8.1 Probability on finite sets


If Ω is a finite set then any function p : Ω → [0, 1] such that ω∈Ω p(ω) = 1 defines a probability
P
measure P with domain E = P(Ω) via the formula
X
P(A) = p(ω), A ⊂ Ω.
ω∈A

Conversely, for any probability measure P on P(Ω) there corresponds a function p : Ω → [0, 1]
such that the last display holds.

?PROBLEM 8.1 (probability measures on finite sets). Explain the claim just made.
Answer. IfPA, B are disjoint
P subsets of P Ω then ω ∈ A ∪ B if either ω ∈ A or ω ∈ B and so
P(A∪B) = ω∈A∪B P(ω) = ω∈A p(ω)+ ω∈B p(ω) = P(A)+P(B). This shows that (AXIOM TWO)
holds. On the other hand, if P(∅) = 0 because the sum over the empty set is 0 and so
(AXIOM ONE) holds.
To see that the converse is true, let P be a probability measure on P(Ω)
S and define p(ω) :=
Then, because P satisfies (AXIOM TWO), and because A = ω∈A {ω}, we have that
P({ω}). P
P(A) = ω∈A p(ω) holds. 

40
CHAPTER 8. PROBABILITY ON DISCRETE SETS 41

PROBLEM 8.2 (a biased one-pound coin). Consider a one-pound coin and the experiment
of tossing it. The outcomes are three: H (heads), T (tails), S (standing on its side). Experience
shows that the probabilities of these outcomes are 0.4994, 0.4996, 0.001, respectively. Why
does this define a probability measure? What is the probability of the event that the coin lands
heads or tails?
Answer. It defines a probability measure because

0.4994 + 0.4996 + 0.001 = 1.

The probability of the event that the coin lands heads or tails is 0.4994 + 0.4996 = 0.999. 

Product rule. If Ω1 , Ω2 are finite sets and we have probability measures P1 , P2 defined on
them, respectively, then we can define a probability P on Ω = Ω1 × Ω2 by the product rule

P(ω1 , ω2 ) := P(ω1 )P(ω2 ), ω1 ∈ Ω1 , ω2 ∈ Ω2 .

?PROBLEM 8.3 (the product rule produces a new probability measure). Explain that the
product rule produces a probability measure.
Answer. All we have to do is check that the sum of P(ω1 , ω2 ) over all possible values of its
arguments equals 1. Indeed,
X X X
P(ω1 )P(ω2 ) = P(ω1 ) P(ω2 ) = 1 · 1 = 1.
ω1 ∈Ω1 ω1 ∈Ω1 ω2 ∈Ω2
ω2 ∈Ω2

PROBLEM 8.4 (a pair of coins). Consider the experiment of throwing a pair of identical
coins. We take as Ω the set of all 9 pairs (H,H), (H,T), (H,S), (T,H), (T,T), (T,S), (S,H), (S,T), (S,S).
Suppose that the two coins are biased exactly as above. Assign P according to the product
rule and compute the probability that heads show at least once.
Answer. The event “heads show at least once” is the set

A = {(H,H), (H,T), (H,S), (T,H), (S,H)}.

Its probability, that is, P(A) is given by the sum of the probabilities of each of its 5 elements.
Hence P(A) = PH PH + PH PT + PH PS + PT PH + PS PH = PH PH + 2PH PT + 2PH PS == 0.249510232 +
2 × 0.24950024 + 2 × 0.0004994 = 0.499509872. 

PROBLEM 8.5 (a small chessboard). Consider a “chessboard” of size 3 × 3 consisting of


9 cells. We place two identical balls at random on the board following the rules that the
probability p1 that the two balls fall in the same cell is the same for all squares, the probability
p2 that they fall in different squares is also the same for all pairs of cells and twice as likely as
p1 . What is the probability of each configuration? What is the probability that the balls lie in a
cell on the main diagonal?
CHAPTER 8. PROBABILITY ON DISCRETE SETS 42

Answer. The set of configurations (sample space) Ω contains 9 configurations with two balls
in the same cell and 9 × 8/2 = 36 with two balls in different cells. We have

1 = 9p1 + 36p2 = 9p1 + 72p1 = 81p1 .

So there are 9 configurations with probability p1 = 1/81 each and 36 with probability p2 = 2/81
each. The event that the balls lie on the main diagonal contains 3 configurations of probability
p1 and 3 of probability p2 . Hence the probability that the balls lie on the main diagonal equals
3(p1 + p2 ) = 1/9. 

8.2 Uniform probability measure and counting


When Ω is a finite set and p(ω) is a constant function then the derived probability measure is
called uniform. We also say that we have equipped Ω with the uniform probability. But if
p(ω) is constant, then it must be
1
p(ω) = ,
|Ω|
so that its sum be equal to 1. In this case, we have

|A|
P(A) = ,
|Ω|
for any A ⊂ Ω. Another expression that people use for this situation is:

Select ω uniformly at random from Ω and compute the probability of the event A.

This is just a colloquial way of saying

Let P be the uniform probability measure on Ω and compute P(A).

PROBLEM 8.6 (chance that a number is even). What is the probability that an integer selected
at random between 1 and 100 is even?
Answer. We have Ω = {1, 2, . . . , 100}. Let A be the event containing all even integers. Clearly,
|A| = 50. Hence P(A) = 50/100 = 1/2. 
If x is a real number then let

bxc = max{k ∈ Z : k ≤ x} = integer part of x.

For example, b5.391c = 5, b−7.2c = −8. The number x is the unique integer that satisfies the
inequalities
x − 1 < bxc ≤ x.
This is because any interval of the form (x − 1, x] must contain an integer (for if it doesn’t
then for some integer m we have m < x − 1 < x < m + 1 and so 1 < 1 which is impossible)
and this integer is unique (because if there are 2 integers in this interval then 2 ≤ 1 which is
impossible).
CHAPTER 8. PROBABILITY ON DISCRETE SETS 43

PROBLEM 8.7 (chance that a number is divisible by k). Let n, k be positive integers. What
is the probability that a randomly selected integer between 1 and n is divisible by k? What
is this number when n = 2384 and k = 57? If p, q are two distinct prime numbers then show
that the probability that a randomly selected integer between 1 and n is divisible by p and q
converges to 1/pq when n → ∞.
Answer. Let Ω = {1, . . . , n} the event “ω is divisible by k” is the set Ak of all ω ∈ Ω such that
ω = km for some positive integer m. The cardinality of Ak equals max{m ∈ Z : mk ≤ n} =
bn/kc and so P(Ak ) = bn/kc/n. For the specific values, we have P(Ak ) = b2384/57c/2384 =
b41.824 · · · c/2384 = 41/2384 ≈ 0.017198. Notice that the latter number is actually approximately
equal to 1/57. For the last question, the event of interest is Ap ∩ Aq . Notice that a number is
divisible by p and q if and only if it is divisible by pq because p and q have no common factor.
That is, Ap ∩ Aq = Apq , and so P(Apq ) = bn/pqc/n → 1/pq as n → ∞. 

PROBLEM 8.8 (roll three dice, get at least one 6). Roll three dice and compute the probability
that we get at least one 6.
Answer. An outcome here is a triple ω = (ω1 , ω2 , ω3 ), where ωi is the result of die i, i = 1, 2, 3.
The set of outcomes has 63 = 216 elements. The event “we get at least one 6” is the negation
of the event “we get no 6”. The latter event is the set of outcomes ω such that ωi , 6 for all
i = 1, 2, 3; this means that each ωi takes values 1, 2, 3, 4 or 5. hence the event “we get no 6” has
53 = 125 outcomes. The answer is therefore
125 91
P(get at least one 6) = 1 − = ≈ 0.42.
216 216


PROBLEM 8.9 (a tourist in London). A tourist in London is given a list of the 10 most
important monuments of the city along with the recommendation to visit the Tower of London,
the Buckingham Palace and the Westminster cathedral. Because of lack of time, she decides to
visit only 3 of them at random. What is the probability that she visits the recommended ones
and in order of increasing distance from her hotel?
Answer. Take as outcome ω = (ω1 , ω2 , ω3 ), a triple of distinct monuments arranged in order
of increasing distance from the tourist’s hotel. The set Ω of outcomes has cardinality
10 × 9 × 8 = 720 and, by assumption, the probability measure considered is uniform on Ω.
There is only one outcome contained in the event under consideration. Hence the answer is
1/720. 

?PROBLEM 8.10 (select a set at random). Consider the set S = {1, 2, . . . , d}. What is the
probability that a randomly selected subset of S contains the elements 1 and 2?
Answer. The universe here is the set

Ω = P(S) := set of all subsets of S..

There are 2d subsets. So an element ω of Ω is a set itself, a subset of S, i.e., ω ⊂ S. Again, we


assign probability 1/2d to each element ω ∈ Ω. of Ω. The event A we are interested in is the
set of all sets ω ⊂ S such that {1, 2} ⊂ ω. How many elements does A contain? Well, every
element ω ∈ A is a set that contains 1 and 2. In symbols,

A = {ω ∈ Ω : {1, 2} ⊂ ω}.
CHAPTER 8. PROBABILITY ON DISCRETE SETS 44

Since the set ω \ {1, 2} is a subset of S \ {1, 2} and since there are 2d−2 subsets of S \ {1, 2}, we
have
|A| = 2d−2 .
Hence
P(A) = 2d−2 /2d = 1/4.

To do the next problem, we consider the following definition.
!
n
If S is a set of size n then we define as the number of subsets of S of size k.
k
n
For example,= 1 because there are exactly n sets of size 1.
n

?PROBLEM 8.11 (derive the formula for mn ). Show that




n(n − 1)(n − 2) · · · (n − m + 1)
!
n
= .
m m(m − 1)(m − 2) · · · 1

Answer. We take a set with n elements, say the first n positive integers. Suppose we have m
boxes in a row, labeled 1, 2, . . . , m. Put one of the integers in box 1 (there are n choices). Put
a different integer in box 2 (there are n − 1 choices). Put an integer, different from the ones
placed in the first two boxes, in box 3 (there are n − 2 choices. And so on. Hence there are

n(n − 1)(n − 2) · · · (n − m + 1)

arrangements. But a set is an unordered collection. There are

m(m − 1) · · · 1

ways to scramble the boxes. A moment of reflection then shows that we must divide the two
displays in order to get a formula for the number of subsets of size m (unordered m-tuples).

We now use a couple of useful abbreviations:
The m-falling factorial of n is the number obtained by multiplying, starting from n, the first
m integers that are less than or equal to n. We denote this by (n)m . Thus,
m−1
Y
(n)m := n(n − 1)(n − 2) · · · (n − m + 1) = (n − j). (8.1)
j=0

The n-falling factorial of n, that is, the number (n)n , is also called n-factorial and is denoted by
n−1
Y n
Y
n! := (n)n = n(n − 1) · · · 1 = (n − j) = k.
j=0 k=1

Note that
n!
(n)m = .
(m − m)!
CHAPTER 8. PROBABILITY ON DISCRETE SETS 45

We can thus write !


n (n)m n!
= = ,
m m! m!(n − m)!
and so we have ! !
n n
= .
m n−m
?PROBLEM 8.12 (number of one-to-one functions between finite sets). In choosing a
function uniformly at random from the set {1, . . . , m} into the set {1, . . . , n}, what is the
probability that the function is an injection?
Answer. Let Ω be the set of all functions from the set {1, . . . , m} into the set {1, . . . , n}. We have
|Ω| = nm . Let I be the set of injections. We have |I| = (n)m . Hence P(I) = (n)m /nm . Note that
P(I) = 0 if m > n–as it should. 

?PROBLEM 8.13 (labeled balls in labeled boxes). We put m balls (labeled by the numbers
1, . . . , m) in n boxes (labeled by the numbers 1, . . . , n) at random. What is the probability that
box 1 contains k balls?
Answer. Each ball goes to a unique box. Let us use some notation for this. If the ball has label
i we let ω(i) be the label of the box it goes into. We realize then that an outcome is simply a
function ω from the set of balls to the set of boxes:

ω : set of balls → set of boxes.

We denote this function by ω = (ω(1), . . . , ω(m)) and we can think of it as an ordered m-tuple.
The set of outcomes Ω is the set of all such m-tuples. Since ω(1) can take n values, and ω(2)
can also take n values, and so on, we see that there are nm outcomes. Hence |Ω| = nm . We are
interested in the event A of all m-tuples ω such that exactly k of the entries of ω are equal to 1.
We look at the set I of indices i such that ω(i) = 1. This set is required to have size k. Hence it
can be selected in mk ways. The entries of ω with indices not in I can be selected in n − 1 ways


each. Hence |A| = mk (n − 1)m−k and so




m
|A| k (n − 1)m−k
P(A) = m = . (8.2)
n nm


?PROBLEM 8.14 (birthday coincidences). In a certain room there are n people. What is the
chance that at least two of them have common birthday? Assume that every year has d = 365
days.
Answer. If n > d = 365 then, certainly, at least two people have the same birthday. So assume
that n ≤ d. To answer the problem we need a mathematical model. We choose an outcome
to be an assignment of a day to each of the n people. Let ωi be the day (his/her birthday)
assigned to person i. Hence Ω is the set of all such assignments ω = (ω1 , . . . , ωn ) Equivalently,
Ω is the set of all functions from {1, . . . , n} to {1, . . . , d} (where 1 means the 1st of January, 2
means the second, and so on, until d, the last day of the year). There are

|Ω| = dn
CHAPTER 8. PROBABILITY ON DISCRETE SETS 46

assignments. We are interested in the event


B = {ω ∈ Ω : there are distinct people i and j with ωi = ω j }. (8.3)
The negation of this event corresponds to the set
Bc = Ω \ B = {ω ∈ Ω : all ωi are distinct}
But if a ω has distinct values if it is one-to-one. The number of such functions is
|Bc | = (d)n .
But what is P? In the absence of information, we shall assume that P is uniform, that is, we let
the P of a single assignment {ω} be equal to 1/dn . Therefore,
(d)n (d)n
P(Bc ) =n
, P(B) = 1 − n .
d d
Check now that P(B23 ) ≈ 50.6%. Hence there is more than fair chance to have a birthday
coincidence in a room with 23 or more people. In fact, check that P45 ≈ 99, 8%, so if you’re in a
room with more than 40 people then bet with someone that there is a birthday coincidence.
Chances you win are high. (Chances are that, unless the other person knows a bit about
probability, he or she will accept the bet! So, go for it.) 

Figure 8.1: The probability that in a group of n people at least two are born on the same day as a
function of n. Notice that with about 25 people the probability is about 0.5 and with about 50
people the probability is close to 1.

8.3 Probability from data


Data is a finite collection of numbers or, more generally, elements from some set. Data can be
considered as ordered or unordered. More generally, it can be considered as indexed by sets
other than integers, for example, pairs of integers.
Given data an ordered collection of numerical data, for example, a1 , . . . , an , we define a
probability on R via the formula
n
1X
P(B) :=
b δa j (B), B ⊂ R, (8.4)
n
j=1
CHAPTER 8. PROBABILITY ON DISCRETE SETS 47

where δa (B) is defined as 


1,
 if a ∈ B,
δa (B) = 

(8.5)
0,
 otherwise

We refer to δa as the Dirac probability measure. Hence b P(B) gives the fraction of the data
in the set B. This probability measure is called empirical probability measure or empirical
distribution.
PROBLEM 8.15 (data from 10 coin tosses). Let us consider the outcomes of d = 10 coin tosses
and mark 1 for heads and 0 for tails. Suppose that the outcomes are

1, 0, 0, 0, 1, 1, 0, 1, 0, 0.

Compute b P{0} and b


P{1} by first setting B to {0} and then to {1} in formula (8.4). Then find the
values of the probability measure b P on all subsets of {0, 1}.
Answer. For the given data, formula (8.4) says
1
P(B) = (δ1 (B) + δ0 (B) + δ0 (B) + δ0 (B) + δ1 (B) + δ1 (B) + δ0 (B) + δ1 (B) + δ0 (B) + δ0 (B)).
b
10
Setting B = {0} we find
6
P{0} = ,
b
10
and setting B = {1} we find
4
P{1} = .
b
10
OK, we don’t have to do all that. Just count the number of 1s in the data (4 of them) and the
6 4
number of 0s (6 of them) Thus the probability measure bP assigns value 10 to {0}, value 10 to
{1}, value 1 to {0, 1} and value 0 to ∅. 
PROBLEM 8.16 (experimental problem). Pick up four identical coins, toss them, and record
the number of heads. Repeat this n = 100 times and record the outcomes in a table as below.
You are supposed to fill in the right column.

no. of heads
1 2
2 1
3 1
4 4
5 3
6 0
7 2
.. ..
. .
100 3

P be the empirical probability measure. Compute b


Let b P(B) for B equal to {0}, {1}, {2}, {3}, {4}
and for B = {odd number of heads}. N.B. You can use the online coin flipper or write a simple
code in Maple:
CHAPTER 8. PROBABILITY ON DISCRETE SETS 48

for i to 100 do
A:=Array([rand(0..1)(),rand(0..1)(),rand(0..1)(),rand(0..1)()]);
A[1]+A[2]+A[3]+A[4];
end do

Answer. First count the number of trials that result in j heads, j = 0, 1, 2, 3, 4 and form a table.
Make sure that the sum over j equals n = 100.
j no. of trials resulting in j heads
0 7
1 21
2 39
3 24
4 9
sum 100
P(B) for the various instance of B asked for.
Then compute b

7
P{0} =
b
100
21
P{1} =
b
100
39
P{2} =
b
100
24
P{3} =
b
100
9
P{4} =
b
100
21 24 45
P(odd number of heads) = b
b P{1, 3} = b
P{1} + b
P{3} = + =
100 100 100

Chapter 9

Random variables

“Random variables are neither random nor variables and I bear


no responsibility if you’ve been told they are.”
– Original

9.1 Random variables are functions


If Ω is a sample space then a random variable is simply a function from Ω into the real
numbers. That is, a random variable is a concrete number associated to each outcome ω ∈ Ω.
If X denotes a random variable then the notation
X:Ω→R
simply means that X is a function. The set of values of X is another name for the image X(Ω)
of the function X.
PROBLEM 9.1 (3 coins and two random variables). Let Ω be the sample space corresponding
to tossing a coin 3 times. Describe the space and define a couple of random variables, one that
represents the total number of heads and another that indicates whether the total number of
heads is odd or even.
Answer. If we indicate heads by 1 and tails by 0 then Ω contains ω = (ω1 , ω2 , ω3 ) where each
ωi is 0 or 1. Explicitly,
Ω = {(0, 0, 0), (0, 0, 1), (0, 1, 0), (1, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0), (1, 1, 1)}.
Define
X(ω) = X(ω1 , ω2 , ω3 ) = ω1 + ω2 + ω3 .
So X(ω) is the number of heads in the outcome ω = (ω1 , ω2 , ω3 ). Next define

1, if X(ω) is odd

Y(ω) = 

0, if X(ω) is even

Notice that Y(ω) is not just a function of ω but also a function of X(ω). 

49
CHAPTER 9. RANDOM VARIABLES 50

Caution! The definition of a random variable does not involve the probability measure
that sits on Ω.
Caution! The word “variable” in the terminology “random variable” is wrong. It should
be function.

9.2 The distribution of a random variable under a probability mea-


sure
The distribution of a random variable. If P is a probability measure on Ω and X : Ω → R
then we define the distribution of X to be the probability measure

PX (B) = P(X−1 (B)),

when B is a subset of R, where X−1 (B) is the inverse image of the set B by the function X, that
is,
X−1 (B) = {ω ∈ Ω : X(ω) ∈ B}.
We will come back to this notion in Chapters 12 and 16. I now want to point out that we
frequently use another notation for X−1 (B): we write {X ∈ B}. So

{X ∈ B} = X−1 (B).

The distribution of a discrete random variable. A random variable X is discrete if its set of
values is discrete. If X is a discrete random variable then its distribution is determined by the
probability mass function x 7→ P(X = x), that is, by the numbers

p(x) = P(X = x) = P{ω ∈ Ω : X(ω) = x}, x ∈ X(Ω).

Note that, by (AXIOM ONE) and (AXIOM TWO), p(x) ≥ 0 for all x and
X
p(x) = 1.
x∈X(Ω)

Once a random variable X is given and Ω is equipped with a probability measure P then
define the probabilities of events of the form

{X = x} = {ω ∈ Ω : X(ω) = x}.

The collection of these probabilities is a probability measure on R and it is called the distribution
of X.

PROBLEM 9.2 (3 coins and the laws of two random variables). In the Problem 9.1, let P be
the uniform probability measure on Ω, that is,
CHAPTER 9. RANDOM VARIABLES 51

ω P{ω}
(0,0,0) 1/8
(0,0,1) 1/8
(0,1,0) 1/8
(1,0,0) 1/8
(0,1,1) 1/8
(1,1,0) 1/8
(1,0,1) 1/8
(1,1,1) 1/8

Determine the distributions of X and Y.


Answer. We have

P(X = 1) = P{ω ∈ Ω : ωi = 1 for some i} = P{(0, 0, 1), (0, 1, 0), (1, 0, )} = 3/8.

Indeed,
{X = 1} = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}, etc.
So the distribution of X is the collection of numbers

P(X = 0) = 1/8, P(X = 1) = 3/8, P(X = 2) = 3/8, P(X = 3) = 1/8.

On the other hand,


P(Y = 0) = P(Y = 1) = 1/2.


The direct problem. We repeat: A random variable X transfers a probability measure P on


Ω to a probability measure Q = PX on R. Think of it schematically as a black box, that is, like
a machine that takes P as input and spits Q = PX as output:

P→ X →Q
That is, we solved the problem
P → X →?
where the question mark is replace by Q.

The converse problem. Suppose we want to solve the problem

?→ ? →Q
This means that we are given a probability measure Q on R and we want to find some random
variable X and some probability measure P so that X transfers P to Q. This is the problem
we are faced at all the time in practice. The answers are many. Among these answers people
search for clever ones–something that you may not appreciate without further familiarity and
experience. Here is an answer, an obvious one. Take Ω = R, take P = Q and take X : Ω → R
to be the identity function, namely, X(ω) = ω for all ω ∈ Ω.
CHAPTER 9. RANDOM VARIABLES 52

9.3 Expectation and moments of a discrete random variable.


Let X : Ω → R if Ω is equipped with a probability measure P. Then we define the expectation
or mean of X with respect to the probability measure P by
X
E(X) = xP(X = x), (9.1)
x

provided that the sum makes sense.


Caution! Even though the concept of a random variable does not depend on the probability
measure P sitting on its domain Ω, its expectation very much depends. So if we change P
then the expectation changes.
The most basic property of the expectation is that it is linear. That is, for any c ∈ R and any
random variables X, Y we have

E(cX) = cE(X), E(X + Y) = E(X) + E(Y).

Indeed, Passuming that c , 0 (the case c = 0 being trivial), we have P(cX = cx) = P(X = x), so
E(cX) = x (cx)P(X = x) = cE(X). Further, the function X + Y takes valuesP x + y, where x ranges
in thePset of values of X and P So E(X + Y) = x,y (x + y)P(X
y in the set of values of Y. P P P= x, Y =
y) = x,y xP(XP = x, Y = y) + x,y yP(X = x, Y = y). But x,y xP(X = x, Y = y) = x x y P(X =
x, Y = y) = x P(X = x) = E(X), where we used (AXIOM TWO).
?PROBLEM 9.3 (the law of the unconscious statistician). Explain why
X
E(X) = X(ω)P{ω}. (9.2)
ω∈Ω

Answer. Group together the set of ω ∈ Ω that have the same value under X, that is, consider
the sets {X = x} for all x ∈ X(Ω). Then
X X X X X
X(ω)P{ω} = X(ω)P{ω} = xP{ω}
ω∈Ω x∈X(Ω) ω∈{X=x} x∈X(Ω) ω∈{X=x}
X X X
= x P{ω} = xP(X = x) = E(X).
x∈X(Ω) ω∈{X=x} x∈X(Ω)


?PROBLEM 9.4 (monotonicity of expectation). Explain why if X, Y are two random variables
then
E(X) ≤ E(Y) if X ≤ Y,
where X ≤ Y means X(ω) ≤ Y(ω) for P all ω ∈ Ω.
Answer. If, in the expression E(X) = ω∈Ω X(ω)P{ω}, we replace X(ω) by the bigger numbers
Y(ω) we get something bigger. This something is E(Y). 
PROBLEM 9.5 (3 coins and the expectation of two random variables). In Problem 9.1 with
P uniform compute the expectation of X and 3 different ways. Then compute the expectation
of Y.
CHAPTER 9. RANDOM VARIABLES 53

Answer. Method 1. Use the formula E(X) =


P
ω∈Ω X(ω)P{ω} to get

E(X) = X(0, 0, 0)P{(0, 0, 0} + X(0, 0, 1)P{(0, 0, 1} + · · · + X(1, 1, 1)P{(1, 1, 1)}


12 3
=0· 1
8 +1·
+ · · · + 3 · 18 = (0 + 1 + 1 + 1 + 2 + 2 + 2 + 3) ·
1
8
1
8 = = .
8 2
Method 2. Use the formula E(X) = x xP(X = x) to get
P

3 3 12 3
E(X) = 0 · 1
8 +1· +2· +3· 1
8 = = .
8 8 8 2
Method 3. Define Xi (ω) = 1 if ωi = 1 and Xi (ω) = 0 if ωi = 0. That is, Xi is a random variable
that indicates if there is a head at the i-th position, i = 1, 2, 3. Observe that X = X1 + X2 + X3 .
By linearity,
E(X) = E(X1 ) + E(X2 ) + E(X3 ).
We now have
1
E(X1 ) = 1 · 1
2 +0· 1
2 = = E(X2 ) = E(X3 ).
2
Hence E(X) = 3/2. As for Y, we have
1
E(Y) = 1 · 1
2 +0· 1
2 = .
2

PROBLEM 9.6 (3 coins and a non-uniform probability measure). Suppose we equip the Ω
in the previous problem by a P defined by

ω P{ω}
(0,0,0) 1/20
(0,0,1) 2/20
(0,1,0) 3/20
(1,0,0) 4/20
(0,1,1) 4/20
(1,1,0) 3/20
(1,0,1) 2/20
(1,1,1) 1/20
What is E(X) now?
Answer. E(X) = (0 + 3) · 1
20 + (1 + 2) · 2
20 + (1 + 2) · 3
20 + (1 + 2) · 20 .
4

P∞
?PROBLEM 9.7 (the expectation may be infinite or may not exist!). Recall that1 c = 1
n=1 n2 <
∞. Consider the random variables X, Y whose laws are defined by
1
, n ∈ N,
P(X = n) =
cn2
1
P(Y = n) = , n ∈ Z \ {0}.
2cn2
Explain why E(X) = ∞ and why E(Y) does not exist.
P∞  1  PN  1 
Here is why. We have c = 1 + ∞
P∞ 1
n=2 n2 ≤ 1 + n=2 n (n − 1) = 1 + n=2 n−1 − n = 1 + limN→∞ n=2 n−1 − n =
1 1 1 1
P
 
1 + limN→∞ 1 − N1 = 1 + 1 − 0 = 2 < ∞.
CHAPTER 9. RANDOM VARIABLES 54

Answer. We have
∞ ∞ ∞
X X 1 1X1
E(X) = nP(X = n) = n = = ∞.
n=1 n=1
cn2 c
n=1
n

The reason that the latter is infinity is because


∞ Z ∞ Z x
X 1 dy dy
≥ = lim = lim (log x − log 2) = ∞.
n 2 y x→∞ 2 y x→∞
n=1

You have learned these things in Calculus.


As for Y, we have

∞ ∞
∞ −1
 ∞ ∞

X 1 1 X 1 1 X 1 X 1  1 X 1 X 1 
E(Y) = n = =  +  =  − 
2cn2 2c n 2c 
n=1
n n=−∞ n  2c 
n=1
n
n=1
n
n∈Z\{0} n∈Z\{0}
1
= (∞ − ∞) = undefined!
2c


Functions of random variables and moments Consider the following situation:


X g
Ω−
→R→
− R,

meaning that X is a random variable and g a function. Then g(X) = g◦X is another random
variable which is a function of X. We are in particular interested in the case where g(x) = xk
for some positive integer k. We define

µk = k-moment of X = E(Xk ).

We also define the variance of X by

var(X) = σ2 = E[(X − µ)2 ], µ = E(X).

The number p
σ= var(X)
is called standard deviation of X under the probability measure P. Notice that

σ2 ≥ µ2 ,

and this is because 0 ≤ (X − µ)2 = X2 − 2µX + µ2 , so,

0 ≤ E[X2 − 2µX + µ2 ] = E[X2 ] − 2µE[X] + µ2 = E[X2 ] − 2µ2 + µ2 = E[X2 ] − µ2 .

Note also that


var(X) = E(X2 ) − (EX)2 .
CHAPTER 9. RANDOM VARIABLES 55

To compute the expectation of g(X) we have two ways. First we can think of g(X) as a
random variable per se and so
X
E[g(X)] = yP(g(X) = y)
y

where the sum ranges over all the values of g(X). This leaves is with the problem of computing
the probability of the event {g(X) = y} = {X ∈ g−1 {y}}. Second, we can apply the law of the
unconscious statistician and write
X
E[g(X)] = g(x)P(X = x).
x

PROBLEM 9.8 (Markov’s inequality). Let X be a positive discrete random variable. Explain
why, for all t > 0,
E(X)
P(X > t) ≤ .
t

Answer. Let S be the set of values of X. Then, as seen in (9.9), X = x∈S x1X=x . Let
P

St = {x ∈ S : x > t}.
Since X is positive, all x in the sum are positive numbers, and so, summing over the smaller
set St will give a smaller number:
X X
X≥ x1X=x ≥ t 1X=x = t1X∈St ,
x∈St x∈St

where the second inequality is due to the fact that every x in St is at least t (by definition);
and the last equality is from (AXIOM TWO). Now take expectations of both sides, using the
monotonicity of E, we have
E(X) ≥ tP(X ∈ St ) = tP(X > t).


9.4 Random variables and probability from data


Suppose that we have a bunch of data, represented by a finite sequence
a1 , . . . , an
P be the empirical probability
of real numbers. Note that the data may not be distinct. Let b
measure–see (8.4) of the data. Consider the set
Ω = {a1 , . . . , an }
and let
m = |Ω|
be the size of Ω. The middle symbol, P(Ω), means that b P is defined on all subsets of Ω. In
other words, any subset of Ω is considered to be an event.
CHAPTER 9. RANDOM VARIABLES 56

PROBLEM 9.9 (data and the empirical probability space). Suppose that the data are

3, 1, 5, 3, 4, 1, 2, 5, 3, 4, 1, 1, 2, 2, 4.

P) is the empirical probability space corresponding to the given data.


Then the triple (Ω, P(Ω), b
What is Ω, and what is m = |Ω|?
Answer. Ω is the set of data

Ω = {3, 1, 5, 3, 4, 1, 2, 5, 3, 4, 1, 1, 2, 2, 4} = {1, 2, 3, 4, 5}.

(Remember that a set does not care if elements are repeated!) and m = 5. 
Define now the random variable

J(ω) = ω, ω ∈ Ω.

In other words, J is the identity function. Then

{J = ai } = {ω ∈ Ω : J(Ω) = ai ),

P–probability of the event {J = ai } is


so the b

# occurrences of ai in the data sequence


P(J = ai ) =
b .
# data points

If we thus let x1 , . . . , xm be a listing of the distinct elements of the data sequence, so that

Ω = {x1 , . . . , xm },

and set
N(x) = # occurrences of x in the data sequence
then
N(x)
P(J = x) =
b , x ∈ Ω.
n
The empirical mean or sample mean of J with respect to b
P is defined to be the expectation
E(J) under P. So we have
b b
m
X X N(x j )
E(J) =
b P(J = x) =
xb xj .
n
x∈Ω j=1

Note that
n
1X
µ=b
b E(J) = ai ,
n
i=1

and this is because we can express a1 + · · · + an by summing over the distinct data values
taking into account their multiplicities:
n
X m
X
ai = x j N(x j ).
i=1 j=1
CHAPTER 9. RANDOM VARIABLES 57

Similarly, the empirical variance or sample variance of J is given by


m n
X X N(x j ) 1 X
var(J)
c = µ) P(J = x) =
(x − b 2b
µ)
(x j − b 2
= µ)2
(ai − b
n n
x∈Ω j=1 i=1

PROBLEM 9.10 (continuation of Problem 9.9). What is the sample mean and sample standard
deviation?
Answer. We have two ways to compute these quantities: by summing over data points or
summing over elements of Ω and taking into account multiplicities. We choose the latter
method. We have n = 15 data points. Since x = 1 appears 4 times in the data sequence we
have
P{1} = 4/15.
b

Similarly,
P{2} = 3/15,
b P{3} = 3/15,
b P{4} = 3/15,
b P{5} = 2/15.
b

So
µ = 1(4/15) + 2(3/15) + 3(3/15) + 4(3/15) + 5(2/15) ≈ 2.733.
b

σ2 = b µ2 = 12 (4/15) + 22 (3/15) + 32 (3/15) + 42 (3/15) + 52 (2/15) − b


E(J2 ) − b µ2 = 1.711.

9.5 Indicator functions


If Ω is a set and A ⊂ Ω then we define a random variable that takes value 1 on A and 0 on its
complement Ac = Ω \ A. We denote this function by 1A . So

1, if ω ∈ A,


1A (ω) := 
0, if ω < A.

We call 1A indicator function (or indicator random variable) of the set (event) A.
We first study some algebraic properties.

1A = 1 − 1Ac (9.3)
1A∩B = 1A · 1B (9.4)
1A∪B = 1A + 1B − 1A∩B (9.5)
1A∩B = min(1A , 1B ) (9.6)
1A∪B = max(1A , 1B ) (9.7)

?PROBLEM 9.11 (properties of indicator functions). Explain why (9.3), (9.4), (9.5), (9.6),
(9.7) hold.
Answer. Since 1A (ω) assigns a unique value to each outcome ω ∈ Ω, it is a random variable.
Note that
{ω ∈ Ω : 1A (ω) = 1} = A.
CHAPTER 9. RANDOM VARIABLES 58

Property (9.3). We must show that the formula 1A (ω) = 1 − 1Ac (ω) for all ω ∈ Ω. The are two
cases: either ω ∈ A or ω ∈ Ac . In the first case, 1A (ω) = 1, 1Ac (ω) = 0, so the formula reads
1 = 1 − 0, while in the second, 1A (ω) = 0, 1Ac (ω) = 1, so the formula reads 0 = 1 − 1; hence
correct in both cases.
Property (9.4). Notice that ω ∈ A ∩ B ⇐⇒ ω ∈ A and ω ∈ B ⇐⇒ 1A (ω) = 1 and
1B (ω) = 1 ⇐⇒ 1A (ω) · 1B (ω) = 1 (because, for all nonnegative integers x, y, we have the
equivalence xy = 1 ⇐⇒ x = 1 and y = 1).
Property (9.5). Do some trivial algebra:

(1 − x)(1 − y) = 1 − (x + y) + xy.

Now replace x by 1A and y by 1B and use Property 1 to get 1 − x = 1Ac , 1 − y = 1Bc and then
Property 2 to get (1 − x)(1 − y) = 1Ac · 1Bc = 1Ac ∩Bc . Using de Morgan’s law, Ac ∩ Bc = (A ∩ B)c ,
so (1 − x)(1 − y) = 1(A∩B)c = 1 − 1A∪B . Substituting into the last display we obtain

1 − 1A∪B = 1 − (1A + 1B ) − 1A · 1B ,

and, using again Property 2 and canceling terms we arrive at Property 3.


Property (9.6). We argue as in Property 2 but now we use the fact that, for x, y ∈ {0, 1}, we
have min(x, y) = 1 ⇐⇒ x = y = 1.
Property (9.7). Use the fact that, for x, y ∈ {0, 1}, we have max(x, y) = 1 ⇐⇒ x = 1 or y = 1 (or
both). 
The expectation of the indicator random variable of the event A equals the probability of A:

E(1A ) = P(A).

Indeed,
E(1A ) = 1 · P(1A = 1) + 0 · P(1A = 0) = P(A) + 0 = P(A).

Therefore, if we take expectations of both sides of Property 3 we obtain

P(A ∪ B) = P(A) + P(B) − P(A ∪ B).

This is the inclusion-exclusion formula for 2 events. We generalize this for three events by
using the identity

(1 − x)(1 − y)(1 − z) = 1 − (x + y + z) + (xy + xz + yz) − xyz,

or, for n events, by using the identity


n
Y X X X
(1 − xi ) = 1 − xi + xi x j − xi x j xk + · · · + (−1)n x1 x2 · · · xn . (9.8)
i=1 i i,j i,j,k distinct

We can write this more systematically as


n
Y X Y
(1 − xi ) = 1 + (−1)|I| xi ,
i=1 I⊂{1,...,n} i∈I

where the sum ranges over all nonempty subsets I of {1, . . . , n}.
CHAPTER 9. RANDOM VARIABLES 59

?PROBLEM 9.12 (indicator of union: inclusion-exclusion). Use this last identity to gener-
alize the formula 1A∪B = 1A + 1B − 1A∩B to n events. Then take expectation of both sides to
derive the inclusion-exclusion formula (7.1).
Answer. Let xi = 1Ai . Then
n
Y n
Y
xi = 1Ai = 1Tni=1 Ai
i=1 i=1
n
Y n
Y n
Y
(1 − xi ) = (1 − 1Ai ) = 1Aci = 1Tni=1 Aci = 1(Sn Ai )c = 1 − 1Sni=1 Ai
i=1
i=1 i=1 i=1
Substituting in (9.8), we obtain
X
1Sni=1 Ai = (−1)|I|−1 1Tni=1 Ai
I⊂{1,...,n}

Formula (7.1) follows immediately since E[1Sni=1 Ai ] = P( ni=1 Ai ) and E[1Tni=1 Ai ] = P( ni=1 Ai ).
S T

PROBLEM 9.13 (a very useful, albeit trivial, identity). Let X be a discrete random variable,
that is, a function X : Ω → S, where S is a discrete set. Explain why
X
X= x1X=x . (9.9)
x∈S

Answer. We will show that the right-hand side equals X. We clearly have x1X=x = X1X=x .
Hence X X
x1X=x = X 1X=x .
x∈S x∈S
But the sets {X = x}, x ∈ S, are pairwise disjoint and
[
{X = x} = {X = x for some x ∈ S} = X−1 (S) = Ω,
x∈S
So X
1X=x = 1Ω = 1.
x∈S
PROBLEM 9.14 (summing the tail gives the expectation). Let X be a random variable with
values in Z+ = {0, 1, . . .}. Explain why

X
E(X) = P(X > n). (9.10)
n=0

Answer. Notice that



X X−1
X
1X>n = 1 = X.
n=0 n=0
Hence

X ∞
X
E(X) = E 1X>n = P(X > n).
n=0 n=0

CHAPTER 9. RANDOM VARIABLES 60

How to use indicator functions to simplify life Suppose you have to perform the integral
Z ∞
f (x)g(x)dx
−∞

where f, g are defined “piecewise”, say


 


0, x≤0 

 0, x≤0
f (x) =  0<x≤2, g(x) =  (1 − x), 0 < x ≤ 1 .
 
x,
 

 

0, x>2 e−x , x>1
 

Many students resort to drawing pictures to perform this integral. But you don’t have to do
that if you write

f (x) = x10<x≤2 , g(x) = (1 − x)10<x≤1 + e−x 1x>1 , x ∈ R,

expressions that hold for all −∞ < x < ∞. Check that these expressions are really equivalent
to the ones above. Now multiply the two to get

f (x)g(x) = x(1 − x)10<x≤2 10<x≤1 + xe−x 10<x≤2 1x>1


= x(1 − x)10<x≤1 + xe−x 11<x≤2

because

{0 < x ≤ 2} ∩ {0 < x ≤ 1} = {0 < x ≤ 1}, {0 < x ≤ 2} ∩ {x > 1} = {1 < x ≤ 2}.

Hence
Z ∞ Z ∞ Z ∞
f (x)g(x)dx = x(1 − x)10<x≤1 dx + xe−x 11<x≤2 dx
−∞ −∞ −∞
Z 1 Z 2
= x(1 − x)dx + xe−x dx,
0 1

and these are integrals that you can easily do (use integration by parts for the second) to get
1/6 for the first and 2e−1 − 3e−2 for the second.

9.6 Cheating undone.


In defining the distribution of a random variable X under a probability measure P we
considered sets of the form {X ∈ B}, where B ⊂ R. But in order to use these sets as arguments
of P we must ensure that these sets are events, that is, elements of E . This we cannot now a
priori and it is not true for any function X : Ω → R.
So we cheated.
To be honest again, we must find a way to undo the cheating. This forces us to revise our
definition of a random variable and state that not any function X is a random variable but a
function such that {X ∈ B} ∈ E whenever B is a “nice” subset of a real line. We then define:
CHAPTER 9. RANDOM VARIABLES 61

Definition 9.1. A function X : Ω → R, where Ω is equipped with a set of events (a σ-field) E


is called a random variable if

{X ∈ B} ∈ E whenever B is an interval.

This property is called measurability.

This is something that you will only ?learn in a slightly more advance course in probability
if you have the chance to take it. If not, you will never ?know.
Chapter 10

Classical problems of elementary


nature

We collect here a number of classical problems that can be


solved by pure logic. I make sure to explain, in each case,
what the sample space is and what the probability is. In
most cases the probability is uniform. Sometimes
probability problems tend to be stated loosely and it is up
to the solver to properly put them in a mathematical setup
which may not be unique. The close “at random” often
means that one should consider the uniform probability.

?Learning probability and statistics at an elementary cannot be achieved without working


through a number of elementary problems and ?understanding them. These problems come
from real life and can often be solved in many ways. One can think “combinatorially” (see
Appendix A) or “probabilistically”, that is using the axioms and properties of probability and
its various consequences.
It should be noted that nowhere in this chapter do we make use of independence or of
conditioning, at least not explicitly. Although avoiding these two concepts may make one’s
life more difficult, it is quite instructive to do so. These two concepts are developed in Chapter
11.

PROBLEM 10.1 (shuffling the letters of a word). In how many ways can you rearrange the
word BOOKKEEPER? More generally, consider a language whose alphabet contains d letters,
say L1 , . . . , Ld . Pick a word in this language with length n letters, such that Li appears ni times,
i = 1, . . . , d. In how many ways can you rearrange the word? Show that n1 + · · · + nd = n.
Deduce that if n1 , . . . , nd are d nonnegative integers then the product n1 !n2 ! · · · nd ! divides n!
Answer. The word has length n = 10 letters. If the letters were distinct then the number of
rearrangements is n! = 10!. For example, if we let E1 , E2 , E3 be the three E’s then one of these n!
rearrangements is K2 O1 BO1 E2 PE1 RE3 K1 and another is K1 O1 BO1 E3 PE2 RE1 K2 . However, if we
do not distinguish between identical letters then both these arrangements are KOBOEPEREK.

62
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 63

Thus we are free to permute the 2 K’s between themselves, the 2 O’s between themselves and
the 3 E’s between themselves. Hence BOOKKEEPER can be arranged in 2!2!3! 10!
= 151, 200 ways.
For the more general case, we use the same logic: Were the letters of the word distinct
we would have n! arrangements. However, given an arrangement and a letter Li in the
arrangement, then we can permute the occurrences of Li between themselves–which can be
done in ni ! ways–and obtain identical arrangement. Hence the total number of arrangements
is n!/(n1 !n2 ! · · · nd !). Since each letter Li appears ni times in the word (and this includes the
possibility that ni = 0) we must have n1 + · · · + nd = n. The reason that n1 !n2 ! · · · nd ! divides n!
is because the ratio n!/(n1 !n2 ! · · · nd !) is the number of arrangements, which must be an integer.


?PROBLEM 10.2 (tossing k dice n times). Pick k identical dice and roll them once. Then
repeat n times. What is the set of outcomes? What is the probability of the event that you
will get k sixes in some roll? Give the probability as a decimal number when k = 1, n = 4 and
when k = 2, n = 24.
Answer. A die has 6 faces with numbers 1 through 6 on them. Rolling k dice at once results
can be described by x = (x1 , . . . , xk ) where xi is the what die i shows. Hence a full outcome for
the n rolls is:

(x(1), x(2), . . . , x(n)) = (x1 (1), . . . , xk (1); x1 (2), . . . , xk (2), x1 (n), . . . , xk (n)).

the set of outcomes is Ω = ({1, . . . , 6}k )n . To find the probability of the event we first need to
define the probability P. In absence of further information, we assume that the probability
is uniform. Thus each outcome has probability 1/6kn . Call the event under consideration
A. Then Ac It can be described as the set of all outcomes such that x(i) , (6, . . . , 6), for all
i = 1, . . . , n. The cardinality of Ac is (6k − 1)n . Hence

(6k − 1)n
P(A) = 1 − .
6kn
If k = 1, n = 4, P(A) = 1 − 54 /64 = 0.518. If k = 2, n = 24, P(A) = 1 − 3524 /3624 = 0.491. 

PROBLEM 10.3 (sum of dice rolls). Roll 3 dice and let S be the sum of the numbers observed.
Explain why S is a random variable and compute the probability that S = 9 and the probability
that S = 10.
Answer. An outcome is (x1 , x2 , x3 ), where xi is what die i shows. S assigns value x1 + x2 + x3
to this outcome. Hence it is a function from the set Ω = {1, 2, 3, 4, 5, 6}3 into R. Hence S is a
random variable. To compute P(S = 9) we need to define P. We will assume that P is uniform.
Hence the probability of each outcome (x1 , x2 , x3 ) is 1/63 . Now let us see which outcomes
belong to the event {S = 9}. The partitions of 9 in 3 parts are
6+2+1 5+3+1 4+4+1 5+2+2 4+3+2 3+3+3
6 6 3 3 6 1
and the number below each partition is the number of outcomes corresponding to that
partition. (Recall that a partition of a number is a way to represent the number as a sum without
paying attention to the order; but an outcome has order.) Hence there are 6+6+3+3+6+1 = 25
outcomes in the event {S = 9}, so P(S = 9) = 25/63 = 0.1157. 
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 64

?PROBLEM 10.4 (permutations in a row and on a circle). n ≥ 3 people are seated in a row.
What is the probability that Mr X sits next to Mrs Y? n ≥ 3 people are seated at a round table.
What is the probability that Mr X sits next to Mrs Y?
Answer. Let 1, 2, . . . , n are the names of the people with X=1, Y=2. An outcome ω = (ω1 , . . . , ωn )
in the first case is a permutation of the n people. There are n! outcomes. In the absence of any
further information, we assume that P is uniform on the set of outcomes; that is, P assigns
value 1/n! to each outcome The event A =“1 sits next to 2” is the set of outcomes ω such
that ωi = 1, ωi+1 = 2 or ωi = 2, ωi+1 = 1 for some i = 1, . . . , n − 1. To count the number |A| of
outcomes in this event, we think of 1 as glued to 2 (which can be done in 2 ways), so that we
have n − 1 objects now (which can be permuted in (n − 1)! ways). Hence |A| = 2(n − 1)!. Hence

2(n − 1)! 2
P(A) = = .
n! n
If the people are sitting at a round table, then certain permutations must be identified. For
example, if n = 4, one permutation is 2, 4, 1, 3. However this must be identified with 4, 1, 3, 2
and with 1, 3, 2, 4 and with 3, 2, 4, 1. The number of identified permutations is n. Hence the set
of all outcomes has cardinality (n − 1)!. If B is the event that “1 sits next to 2” then |B| = 2(n − 2)!.
Hence
2(n − 2)! 2
P(B) = = .
(n − 1)! n−1
Check: if n = 3 then P(A) = 2/3, P(B) = 1. The veracity of these two probabilities can be verified
by inspection. 

?PROBLEM 10.5 (dancing pairs). An even number n of people are dancing in pairs. Each of
the people has a unique partner. What is the probability that a person dances with his/her
partner?
Answer. Here it is a little difficult to imagine what the set of outcomes is. One way to do this
is to consider all n! permutations of a people: the people arrive in the room at a random order.
The doorman picks a person and assigns him/her to the person arriving immediately after
him/her. For concreteness, let n = 4. The set of all possible n! arrival patterns is

1234 2134 3124 4123 12|34 21|34 3124 4123


1243 2143 3142 4132 12|43 21|43 3142 4132
1324 2314 3214 4213 1324 2314 3214 4213
1342 2341 3241 4231 1342 2341 3241 4231
1423 2413 3412 4312 1423 2413 34|12 43|12
1432 2431 3421 4321 1432 2431 34|21 43|21
Consider the arrival pattern 1234. Thus results into two pairs, 12 and 34. But the same
pairs are found in all the arrival patterns in boxes. We glue these boxed patterns and identify
them and consider them as a single outcome. An outcome them is a collection of 8 arrival
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 65

patterns. In total there are 3 outcomes:

outcome 1 outcome 2 outcome 3


 

 

 




 1234 1324 1423  


1243 1342 1432 

 

 

 


 2134 2413 2314  

Ω=
 
2143 2431 2341 
 
 
 3412

 3124 3214 



 




 3421 3142 3241  


4312 4213 4123 

 

 

 
 4321
 4231 4132  

The event that “each person dances with his/her partner” contains only one outcome. Hence
its probability is 1/3. With an even number n of people, we have n! arrival patterns. Each
arrival pattern is identified with 2n/2 (n/2)! other arrival patterns. Indeed, we split each arrival
pattern into n/2 boxes of size 2 each. Boxes can be permuted in (n/2)! ways. Changing the
order of the 2 contents of a box does not change anything. This can be done in 2 ways per
each of the n/2 boxes, which gives the extra factor 2n/2 . Hence

n!
number of outcomes = .
2n/2 (n/2)!

Since the event of interest consists of one outcome, we have

2n/2 (n/2)!
P(each person dances with his/her partner) = .
n!


?PROBLEM 10.6 (choosing a k-member committee from a set of n people–see Problem


8.10). Form a committee by picking any number of members from a set of n people. What is
the probability that the committee will have k members? (The possibility k = 0 is included.)
Do this problem in two ways.
Answer. The set of outcomes is the set of all subsets of {1, 2, . . . , n}. There are 2n subsets. In
other words, Ω = P({1, 2, . . . , n}), as in Example 8.10. The event “committee has k members” is
the set of all subsets of Ω of size k, so the event has size nk . Assuming uniform probability on

Ω, we have
n 
k
P(committee has k members) = .
2n
A second way to do this problem is by assigning label Y to a person if he/she is selected or N
if not. The possible assignments of Y/N labels are 2n in number. The probability that the first k
people have label Y and the rest N is 1/2n . Similarly, the probability that any given k people
have label Y and the rest N is 1/2n . Since we can arrange k Y’s and n − k N’s in nk ways, the

answer is the same. 

PROBLEM 10.7 (probability that a random committee has k members). Form a committee
by picking any number of members from a set of n people and by designating a person as
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 66

head of the committee. What is the probability that the committee will have k members?
Show that the identity
n !
X n
k = n‘2n−1
k
k=1

holds.
Answer. An outcome is a subset of {1, 2, . . . , n} together with a distinguished member desig-
nated as head. Thus there are n2n−1 outcomes. The number of outcomes with k members is
k nk . Hence

k nk

P(committee has k members) = n−1 .
n2
Sn
If Ak is the event “committee has k members” then P Ak ∩ A` = ∅ for k , ` and Ω = k=1 Ak .
Hence, by (AXIOM TWO), we have 1 = P(Ω) = k=1 P(Ak ), and, substituting the value of
P(Ak ) from the previous display, gives the identity. 

?PROBLEM 10.8 (Maxwell-Boltzman model). In so-called Maxwell-Boltzman model 1 we


have n particles that can be in any of m states. By state we mean, e.g., an energy level. The
particles are distinct and any number of particles can be in the same state (such particles are
known as Bosons). What is the probability that we have ni particles at state i, for i = 1, . . . , m?
Answer. An outcome here is an assignment of a unique state to each particle. Thus the set of
outcomes is
Ω = {1, . . . , m}n .
It has cardinality |Ω| = mn . Exactly as in Problem 10.1 above, we see that the event “ni particles
at state i, for i = 1, . . . , m” has cardinality n!/(n1 ! · · · nm !). Hence, assuming that P is uniform
on Ω,
n! 1
P(ni particles at state i, for i = 1, . . . , m) = , n1 , . . . , nm ≥ 0, n1 + · · · + nm = n.
n1 ! · · · nm ! mn


?PROBLEM 10.9 (Bose-Einstein model). According to Bose-Einstein model we have n


indistinguishable particles that can be in any of the m distinct states. What is the probability
that we have ni particles at state i, for i = 1, . . . , m?
Answer. Since the particles are indistinguishable all that matters is how many particles we
have in each state. We take

Ω = {(n1 , n2 , . . . , nm ) : n1 , . . . , nm ∈ Z+ , n1 + · · · + nm = n}.
1
In Physics, they say “Maxwell-Boltzman statistics”. But the term “statistics” is wrong. More generally, there
is an area of Physics called “Statistical Physics” but, again, the term adjective statistical is wrong. It should be
called “probabilistic physics” or, better yet, “stochastic physics”. But, at the time that this terminology was coined,
people hadn’t quite realized the difference between statistics and probability and hadn’t quite understood the
mathematics of the latter or its foundations. Some people used to think that what we now know as theorems
(meaning statements that follow from pure thought) were experimental results. In fact, 4 thousand years ago, what
we now know as “Pythagorean theorem” was taken as an experimental result and not as a fact that can be proven.
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 67

n+m−1
Recall that Z+ = {0, 1, 2, 3, 4, 5, . . .}. We also have that Ω has cardinality m−1 ; see Item 7 in
Appendix A. Assuming uniform probability on Ω we have

1
P(ni particles at state i, for i = 1, . . . , m) = n+m−1
, n1 , . . . , nm ≥ 0, n1 + · · · + nm = n. (10.1)
m−1

?PROBLEM 10.10 (Fermi-Dirac model). According to Fermi-Dirac model we have n indis-


tinguishable particles that can be in any of the m distinct states with the caveat that we cannot
have two particles at the same state (such particles are called Fermions; the electron is such a
particle). What is the probability that we have ni particles at state i, for i = 1, . . . , m?
Answer. It goes without saying that we can’t have more particles than states. So n ≤ m. In this
case, an outcome is ω = (ω1 , . . . , ωm ) where ωi = 1 or 0, depending whether a particle can be
found at state i or not. Hence
m
X
Ω = {(ω1 , . . . , ωm ) : ωi ∈ {0, 1} for i = 1, . . . , m, ωi = n}.
i=1

The cardinality of Ω is clearly the same as the number of subsets of {1, . . . , n} of size m, that is,
n
m . Assuming uniform probability on Ω we have

1
P(ni particles at state i, for i = 1, . . . , m) = n , n1 , . . . , nm ∈ {0, 1}, n1 + · · · + nm = n.
m

PROBLEM 10.11 (Maxwell-Boltzmann, Fermi-Dirac and Bose-Einstein). Put the Maxwell-


Boltzmann, Fermi-Dirac and Bose-Einstein models on the same probability space.
Answer. Discuss in class. 

?PROBLEM 10.12 (distinguishable balls in distinct boxes). Put n distinguishable balls into
m distinct boxes at random. What is the probability that the first box contains k balls? In
other words, let X be the random variable indicating that the number of balls in the first box;
find the distribution of X, that is, the probabilities P(X = k) for k = 0, 1, . . .. Also find the
expectation of X.
Answer. The set of outcomes is Ω = {1, . . . , m}n , that is, all functions from the set {1, . . . , n} of
balls into the set {1, . . . , m} of boxes. As seen before, |Ω| = mn . We are interested in the event

A = “number of balls in the first box is k”

If I ⊂ {1, . . . , n} has size k then we define

AI = {ω ∈ Ω : ωi = 1 for i ∈ I}.

Clearly, AI ⊂ A. Moreover AI ∩ A J = ∅ if I, J are two different sets of size k. Moreover,


[
A= AI .
I:|I|=k
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 68

By (AXIOM TWO), P(A) = I:|I|=k P(AI ). But P(AI ) is the same for all I with |I| = k (it does not
P
matter which balls go into box 1. Assuming that P is uniform on Ω, we have, with I = {1, . . . , k}

|AI | (m − 1)n−k
P(AI ) = = .
mn mn
Indeed, the event AI is the event that ω1 = 1, . . . , ωk = 1, whereas the other ω j , j = k + 1, . . . , n
can have any of the m − 1 values 2, 3, . . . , m. Therefore,
X (m − 1)n−k n (m − 1)n−k
!
P(A) = =
mn k mn
I:|I|=k

because all terms in the sum are equal and because the number of I, subsets of {1, . . . , n}, with
size k is nk .

The random variable X is given by
Xn
X= 1Ai , (10.2)
i=1
where 1Ai is the indicator (random variable) of the event Ai = {ω ∈ Ω : ωi = 1}. Since
P(ω j = 1) = 1/m, we have
E(1Ai ) = m1 .
By the linearity of the expectation,
n
X n
E(X) = E(1Ai ) = .
m
i=1

Note: We will later show how to compute the same probability using the concept of indepen-
dence. Indeed, we will show that ω1 , . . . , ωn are independent random variables. The fact that
ωi is a random variables is obvious from the fact that it represents the box assigned to the i-th
ball, that is, it represents the function
(ω1 , . . . , ωn ) 7→ ωi .
The fact that the random variables are independent is a consequence of the fact that P is
uniform on Ω. To understand this note, look forward at the chapter on independence. 
PROBLEM 10.13 (expected number of particles at a given state according to Bose-Einstein).
According to the Bose-Einstein model (where particles are indistinguishable), what is the
probability that there are k particles at state 1? Let X be the random variable indicating the
number of balls in the first box. Find the expectation of X.
Answer. As explained above, we let P be the uniform probability measure on
Ω = {(n1 , n2 , . . . , nm ) : n1 , . . . , nm ∈ Z+ , n1 + · · · + nm = n}. (10.3)
n+m−1−1
Since Ω contains n+m−1

m−1 elements, P assigns value m−1 to each element. We are interested
in the event

A = “number of particles in state 1 is k” = {(n1 , n2 , . . . , nm ) ∈ Ω : n1 = k}


= {(k, n2 , . . . , nm ) : n2 , . . . , nm ∈ Z+ , n2 + · · · + nm = n − k}.
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 69

Using the logic we used to compute the cardinality of Ω (see Problem 10.9 above and see Item
7 in Appendix A), we have |A| = n−k+m−2

m−2 . Simply replace n by n − k. Hence
n−k+m−2
m−2
P(“number of particles in state 1 is k”) = n+m−1
= P(X = k).
m−1

Pnyou use the definition of E(X) to compute it, you have to evaluate the rather hard sum
If
k=0 kP(X = k). Instead, take into account the symmetry in the problem. Namely, that the
formula for the probability (10.1) does not change if we permute (n1 , . . . , nm ). Hence, if we let

Xi (n1 , . . . , nm ) = ni , (10.4)

the number of particles at state i (hence X1 = X), we have

E(X1 ) = · · · E(Xm ).

On the other hand,


X1 + · · · + Xm = n.
Hence n = E(X1 ) + · · · + E(Xm ) = mE(X), so
n
E(X) = .
m
PROBLEM 10.14 (seating with avoidance). Show that the number ways so that n people can
sit in a row of K seats if no people are allowed to sit next to each other is K−n+1

n . We are
looking for distinct patterns, that is, we don’t care about the identities of the people or the
chairs.
Answer. To understand things, let’s take n = 2 people and k = 4 chairs. If  represents an
occupied seat and  an unoccupied one, the patterns we get are
   
   
   


PROBLEM 10.15 (sampling without replacement). An urn contains 12 red and 8 blue balls.
Balls of the same color are identical. We pick 4 balls at random. What is the probability that
we get 2 balls of each color? What is the probability that we get 1 red and 3 blue? What is the
probability that we get 3 red and 1 blue?
Answer. Call the balls 1, 2, . . . , 20. Think of the first 12 as being red and of the last 8 as being
blue, So R = {1, . . . , 12}, B = {13, . . . , 20} Take as Ω the set of all subsets ω of {1, 2, . . . , 20} of size
4. Then |Ω| = 20 4 . The event A = “get 2 balls of each color” is the set of all ω ∈ Ω such

that
12 8
|ω ∩ R| = 2 and |ω ∩ B| = 2. We can pick 2 elements of R in 2 ways and 2 of B in 2 ways.
Hence, assuming that P is uniform on Ω
12 8
2 2
P(get 2 balls of each color) = 20
= 0.38
4
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 70

Similarly,
12 8 12 8
1 3 3 1
P(get 1 red and 3 blue) = 20
= 0.32, P(get 3 red and 1 blue) = 20
= 0.36
4 4

?PROBLEM 10.16 (sampling without replacement, general case). An urn contains Ni balls
of color i, i = 1, . . . , c. (c is the number of colors). Balls of the same color are identical. We pick
n balls at random. What is the probability that we get n1 balls of color 1, n2 of color 2, etc?
Answer. Let N = N1 + · · · + Nc be the number of balls. Consider {1, 2, . . . , N} as the set of balls.
Let R1 = {1, . . . , N1 }, R2 = {N1 + 1, . . . , N1 + N2 }, etc, Thus we split the set of balls into a set R1
of balls of color 1, a set R2 of balls of color 2, etc. A sample (an outcome) is a set ω of balls of
size n. Let Ω be the set of subsets of balls of size n. The event of interest is thus
A = {ω ∈ Ω : |ω ∩ Ri | = ni , i = 1, . . . , c}.
Assuming P to be the uniform probability measure on Ω, we have that P assigns value
P{ω} = 1/|Ω| = 1/ Nn to each ω. Therefore,


N1  N2  Nc 
|A| n1 n2 · · · nc
X X 1
P(A) = P{ω} = = = N
. (10.5)
|Ω| |Ω|
ω∈A ω∈A n


?PROBLEM 10.17 (sampling without replacement, another view). Recall the multinomial
coefficient from Appendix B, 3(d). Show that the probability of (10.5) can be written as
N1  N2  Nc  n  N−n 
n1 n2 · · · nc n1 ,...,nc N1 −n1 ,··· ,Nc −nc
N
= N 
,
n N1 ,...,Nc

first by trivial algebra and second by setting Problem 10.16 on a different sample space.
Answer. We have
N1  N2  Nc  N1 ! Nc ! n! (N−n)! n  N−n 
n1 n2 · · · nc n1 !(N1 −n1 )! · · · nc !(Nc −nc )! n1 !···nc ! (N1 −n1 )!···(Nc −nc )! n1 ,...,nc N1 −n1 ,··· ,Nc −nc
N
= N!
= N!
= N 
n n!(N−n)! N1 !···Nc ! N1 ,...,Nc

This suggests that we can use a sample space Ω0 of size as in the denominator of the last
fraction. By Appendix B, 3(d), we can take as Ω0 the set of arrangements of N objects when
N1 are identical, i.e. of color 1, N2 of color 2, etc. Let {1, . . . , N} be the set of objects (=balls, in
our case). An arrangement ω = (ω1 , . . . , ωN ) is an assignment of color ωi ∈ {1, . . . , c} to ball
i. Let S = {1, . . . , n} be the set we select and Sc those we do not select. Then the event, say
A0 , of interest is the the set of all ω such that the number of balls j with ω j = i equals ni , for
i = 1, . . . , c. We can easily see that this event has size equal to the numerator of the last fraction
in the display above. Assuming that P0 is uniform on Ω0 , we have that
n  N−n 
|A0 | n1 ,...,nc N1 −n1 ,··· ,Nc −nc
P (A ) = 0 =
0 0
N 
.
|Ω |
N1 ,...,Nc


CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 71

PROBLEM 10.18 (sampling without replacement, comparison). Expand on the differences


between the probability spaces (Ω, E , P) and (Ω0 , E 0 , P0 ) of Problems 10.16 and 10.17, where E
is the set of all subsets of Ω and E 0 the set of all subsets of Ω0 , and what it really means that
we obtained the same answer on two different spaces. 

PROBLEM 10.19 (matching socks). You have 8 red socks, 7 blue and 5 yellow. Pick 2 socks
at random. What is the probability that they match in color?
Answer. If Ar , Ab , A y are the events that the two selected socks are red, blue, yellow, respectively,
then, by (AXIOM TWO),

P(match in color) = P(Ar ) + P(Ab ) + P(A y ).

From Problem 10.16, we have


8 7 5
2 2 2
P(Ar ) = 20
, P(Ab ) = 20
, P(Ar ) = 20
.
2 2 2

So P(A) = (28 + 21 + 10)/190 = 59/190 = 0.31. 

PROBLEM 10.20 (sample constituency). An urn contains 12 balls, 3 red, 5 blue and 4 green.
We select 6 balls at random in sequence. What is the probability that the first two are red the
next three blue and the sixth green?
Answer. The probability of the event A that the sample contains 2 red, 3 blue and 1 green is,
as seen in Problem 10.17.
6  12−6 
2,3,1 3−2,5−3,4−1 60 · 60 10
P(A) = 12 
= = = 0.12987.
27, 720 77
3,5,4

But we are interested in the probability of the event

Ar,r,b,b,b,g = “first two are red the next three blue and the sixth green”

Note that the events


Ar,r,b,b,g,b Ab,b,b,g,r,r ..., (10.6)
each have an obvious meaning, all have the same probability (symmetry!). On the other hand,
these events are pairwise disjoint and their union is A. Hence, by (AXIOM TWO),

P(A) = `P(Ar,r,b,b,g,b ),

where ` is the number of events in (10.6). But ` is the number of arrangements of 6 objects in a
row of which 2 are red, 3 are blue and 1 is green, so
!
6
`= = 60.
2, 3, 1

Hence
1 60
P(Ar,r,b,b,g,b ) = P(A) = = 0.00216.
60 27, 720
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 72

Note: When we learn about conditional probability–see relevant chapter–we shall have another
way to compute P(Ar,r,b,b,g,b ). The rationale goes like this: The first ball is red with probability
2/12; the second is red with probability 1/11; the third is blue with probability 3/10, etc.
Multiplying these numbers gives

3 2 5 434
= 0.00216.
12 11 10 9 8 7


PROBLEM 10.21 (poker probabilities). A deck of playing cards is a collection of N = 52


cards, each of which is labeled with a pair (v, s) where the value v is an element of the set
2 V = {A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K} and the suit s is an element of the set F = {S, C, D, H}

(standing for Spades, Clubs, Diamonds and Hearts, respectively). A unique element of V × F
is assigned to each card. Check: 13 × 4 = 52. We select n = 5 cards at random and ask the
probabilities of various events determined by what values appear in the sample and how
many cards have the same value. Each of these events corresponds to a partition of 5 (see
Appendix B, Section 7). In reading the table below, interpret the v, v0 , . . . appearing on the
same line as “some v, some v0 , . . ., all distinct.
partition of 5 meaning colloquial name event name
5 all cards of same value cheating A5
4+1 4 of value v and 1 of value v0 four of a kind A4,1
3+2 3 of value v and 2 of value v0 full house A3,2
3+1+1 3 of value v, 1 of v0 and 2 of v00 three of a kind A3,1,1
2+2+1 2 of value v, 2 of v0 , and 1 of v00 two pairs A2,2,1
2+1+1+1 0 00
2 of value v, 1 of v , 1 of v , 1 of v 000 one pair A2,1,1,1
1+1+1+1+1 all distinct values nil A1,1,1,1,1
Determine the probabilities of all the above events and verify that they add up to 1. Why is
that?
Answer. Let Π = {1, . . . , 52} be the set of cards and think of g = (v, s) as a function on Π, for
example, g(1) = (A, S), g(2) = (2, S), g(3) = (3, S), etc., so that g : Π → V × F is a bijection.
Equivalently, you can imagine the cards laid out in a particular order:
A 2 3 4 5 6 7 8 9 10 J Q K

Our sample space is taken to be Ω = P(Π), the set of all subsets of Π. Assuming P to be
uniform on Ω we compute the probability of full house A3,2 by counting the number of its
2
A is called ace and stands for 1, J is jack and stands for 11, Q is queen and stands for 12 and K is king and
stands for 13. The card with label (8, S) is called “8 of spades”; the card with label (A, D) is called “ace of diamonds”
and so on.
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 73

elements. We wish the sample to contain 2 different values. The first value corresponds to
any of the 13 columns, the second to any of the remaining 12. We thus can pick two distinct
columns in 13 × 12 ways. Having selected the first value, we assign suits to the three cards of
that value in 43 ways and then we assign suits to the two cards of the other value in 42 ways.
 
Hence ! !
4 4
|A3,2 | = 13 · 12 · · = 3744.
3 2
Since !
52
|Ω| = = 2, 598, 960
5
we find
3744
P(A5,2 ) = .
2, 598, 960
Arguing in similar manner, we find

P(A5 ) = 0
4 4
13 · 12 · 4 · 1 624
P(A4,1 ) = =
|Ω| 2, 598, 960
13 · 12 · 43 · 42
 
3744
P(A3,2 ) = =
|Ω| 2, 598, 960
12·11 4 4 4
13 · 2 · 3 · 1 · 1 54, 192
P(A3,1,1 ) = =
|Ω| 2, 598, 960
13·12 4 4 4
2 · 11 · 2 · 2 · 1 123, 552
P(A2,2,1 ) = =
|Ω| 2, 598, 960
12·11·10 4 4 4 4
13 · 3! · 2 · 1 · 1 · 1 1, 098, 240
P(A2,1,1,1 ) = =
|Ω| 2, 598, 960
13·12·11·10·9
5! · 45 1, 317, 888
P(A1,1,1,1,1 ) = =
|Ω| 2, 598, 960
Note that

624 + 37, 44 + 54, 912 + 123, 552 + 1, 098, 240 + 1, 317, 888 = 2, 598, 960

so the sum of the probabilities above equals to 1. This must be so because the 7 events above
are pairwise disjoint and their union is Ω (AXIOM TWO). 
PROBLEM 10.22 (temperature affects energy levels). Let S = {x1 , . . . , xm } be a finite set of
“states” of a physical system, where the xi are distinct real numbers, representing, say, energy
levels with x1 < x2 < · · · < xm . The uniform probability assigns value 1/m to each state. Here
Pm −βx Let β ≥ 0 and let the probability assigned to
is another way to assign non-uniform probability.
state xi be pi = pi (β) ∝ e . Let Z(β) := i=1 e i . Let Pβ be the probability measure defined
−βxi

on S, P(S) determined by these probabilities. Thus P0 is the uniform probability and


X
Pβ (A) = e−βxi /Z(β), i = 1, . . . , m, A ∈ P(S).
i:xi ∈A
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 74

If X is a random variable on (S, P(S), Pβ ) let Eβ (X) denote its expectation under Pβ . Define
the energy random variable W : S → R by W(xi ) = xi . Show that Eβ (W) = − dβ d
log Z(β). Also
show that Pβ converges to P0 as β → 0 and that Pβ converges to δx1 as β → ∞, where δx1 is the
probability measure assigning value 1 to x1 and 0 to all other states–recall definition (8.5). In
physics, the reciprocal T = 1/β of β is called temperature the temperature goes to infinity then
all states are equally likely, while at 0 temperature, only the minimal energy state is possible.
Answer. We have
n n
X X e−βxi
Eβ (W) = xi pi (β) = xi .
Z(β)
i=1 i=1

But d −βx
dβ e = −xe−βx for all x ∈ R. So

d 1 d 1 X
log Z(β) = Z(β) = (−xi )e−βxi = −Eβ (W).
dβ Z(β) dβ Z(β)
i

Since pi (β) is a continuous function of β ∈ [0, ∞), we have that limβ→0 pi (β) = pi (0) = Z(0)
1
= 1
m,
which is indeed the uniform probability measure P0 . To find the limit as β → ∞, write

1 1
p1 (β) = Pm = Pm
i=1 e−β(xi −x1 ) 1+ i=2 e
−β(xi −x1 )

and, since xi − x1 > 0 for i = 2, . . . , m, we have limβ→∞ e−β(xi −x1 ) = 0 for all i = 2, . . . , m, so

lim p1 (β) = 0.
β→∞

Since p1 (β) + p2 (β) + · · · + pm (β) = 1 and m is a finite number that does not depend on β we
have limβ→∞ pi (β) = 0 for i = 2, . . . , m. Hence

X 1
 if x1 ∈ A
Pβ (A) = .

pi (β) → 
0
 otherwise
i:x ∈A
i

The right-hand side is, precisely, equal to δx1 (A). 

?PROBLEM 10.23 (matching problem–see Problem 10.5). In a ballroom, there are n couples,
n men and n women. Each man is paired with a woman at random. What is the probability
that some man dances with his own wife?
Answer. The sample space Ω here is the set of pairings between men and women. We can
assume that the set of men is {1, . . . , n} and the set of women is {1, . . . , n} and that woman i is
man’s i wife. We can describe a pairing w by
!
1 2 3 ··· n
w=
w1 w2 w3 · · · wn

where the top row corresponds to men and the bottom to women. Clearly, each woman
appears only once in the second row. Hence Ω has n! elements. The phrase “each man is
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 75

paired with a woman at random” means that each element of Ω has probability 1/n!. Let Fi be
the event that man i dances with his own wife. We are interested in the event
n
[
Fi = some man dances with his own wife.
i=1

Note that Fi is the set of pairings such that wi = i. But then Fi has (n − 1)! elements, so

(n − 1)! 1
P(Fi ) = = .
n! n
3 Now take two different men i and j. The event Fi ∩ F j has (n − 2)! elements hence

(n − 2)! 1
P(Fi ∩ F j ) = = .
n! n(n − 1)

The pattern is
Sclear. We now use the inclusion-exclusion formula to calculate the probability
n
of the event i=1 Fi that at least one man dances with his own wife. If I ⊂ {1, . . . , n} then
 
\  1
P  Fi  =
 .
(n)|I|
i∈I

So  n 
[  X 1
P  Fi  = (−1)|I|−1 .
(n)|I|
i=1 I⊂{1,...,n}
n
Since there are k subsets I of {1, . . . , n} of size k, we further have
 n  n ! n
[  X n 1 X 1
P  Fi  =
  (−1)k−1
= (−1)k−1 .
k (n)k k!
i=1 k=1 k=1

Pn xk
This is the answer. We can go a bit further. Since limn→∞ k=0 k! = e−x we have
n
X 1
P(some man dances with his own wife) ≈ lim (−1)k−1 = 1 − e−1 ≈ 0.632.
n→∞ k!
k=1

?PROBLEM 10.24 (number of matchings–see Problems 10.5 and 10.23). In the previous
problem find the probability of the event

exactly m pairs are formed.

Another way to describe an outcome is to say that it is a one-to-one function from the set {1, . . . , n} into itself.
3

Hence Ω is the set of all bijections on {1, . . . , n}. The event Fi is the set of all bijections that have i as a fixed point.
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 76

Answer. We already know that4


 n  n n
[  X (−1)k−1 X (−1)k−1
pn (0) = P(no pairs are formed) = 1 − P  Fi  = 1 − = .
k! k!
i=1 k=1 k=0

To find the probability that exactly m pairs are formed, we first give a name to this event, say

Gn,m = exactly m pairs are formed ⊂ Ω,

and then compute its cardinality. Since we can write


[
Gn,m = Gn,m,I
|I|=m

where, given a set of men I of size m,

Gn,m,I = the men whose names are in I dance with their wives but the rest do not,

and since Gn,m,I ∩ Gn,m,J = ∅ if |I| = |J| = m but I , J 5 we have


X
|Gn,m | = |Gn,m,I |.
|I|=m

n
But all terms in the sum are equal, by symmetry. There are m terms in the sum, hence
!
n
|Gn,m | = |Gn,m,I |, with I = {1, . . . , m}.
m

How many pairings w do we have so that w1 = 1, . . . , wm = m, and w j , j for j = m + 1, . . . , n!


We have exactly as many pairings of the set of men {m + 1, . . . , n} with their wives such that
none of these men dances with his wife. Therefore, with I = {1, . . . , m},

|Gn,m,I | = |Gn−m,0 | = (n − m)!pn−m (0),

and so we finally have


n−m
1 X (−1) j
!
1 1 n 1
P(Gn,m ) = |Gn,m | = (n − m)!pn−m (0) = pn−m (0) = . (10.7)
n! n! m m! m! j!
j=0

Notice that
1 −1
lim P(Gn,m ) = e , m = 0, 1, 2, . . . (10.8)
n→∞ m!

4
because we speak English and therefore understand that the negation of the sentence “ no man dances with
his own wife ” is the sentence “ some man dances with his own wife ”.
5
If we take two different sets of men of size m each then there is a man that belongs to one set but not to the
other, hence this man dances with his wife and does not dance with his wife. This is an absurd sentence and this
explains why we obtain ∅.
Chapter 11

Conditional probability and


independence

Conditioning on an event means to know that the event


occurs (has occurred, will occur–we attach no time to this).
How does the probability change if we have some
knowledge? That’s what we explain. We also explain the
celebrated Bayes’ theorem which is a total triviality: it
requires 10 seconds to prove it. But it is very useful. If
knowledge of an event does not influence another then
the two events are independent. We make this precise
because we must. I am aware that you have learned this
concept incorrectly in a previous model, so make sure to
learn? it well now.

11.1 Motivation and properties


If P is a probability on a set Ω then we define the conditional probability of an event A given
another set B as the probability that A occurs given that B occurs. We denote this by the
symbol
P(A|B).
In fact, the phrase “given that” is often referred to as “conditional on” which means “under
the condition that”. To find out the correct formula of this conditional probability, we solve an
exercise.

PROBLEM 11.1 (motivating conditional probability). In tossing a fair coin three times
uniformly at random, find the probability that we obtain two or more heads given that the
first toss lands tails.
Answer. The set of configurations here is

Ω = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}.

77
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 78

where, e.g., HTH means that we first toss a head (H) then a tail (T) and then a head. There are
8 configurations. The assumption tells us to assign probability 1/8 to each configuration, that
is, P is the uniform probability measure on Ω. Let A be the event that we have two or more
heads. Explicitly,
A = {HHH, HHT, HTH, THH}.
So P(A) = 4/8 = 1/2. But if we know that the first toss lands tails then we can exclude the first
three configurations in A and left with only one: THH. The probability of this should be 1/4
(one out of four configurations). Let us call B the event that the first toss lands tails. We have
shown that the

P(A|B) = probability that we obtain two or more heads given that the first toss
lands tails = 1/4.

We now find another way to obtain this number. You see, when we excluded the configurations
in A that are not in B we really considered the set A ∩ B which has only 1 element and thus
P(A ∩ B) = 1/8.
Since B = {THH, THT, TTH, TTT},
P(B) = 4/8.
Hence
P(A ∩ B) 1/8 1
= = .
P(B) 4/8 4

Motivated by this problem (as well as a couple of centuries worth of experience), we
define the conditional probability of event A given event B (or conditional on event B) by
the formula
P(A ∩ B)
P(A|B) := ,
P(B)
in general, not just for uniform probability measures. Experience shows that this definition
works.
PROBLEM 11.2 (a simple problem on conditioning). Let P be the uniform probability on
the set of pairs (i, j) of positive integers between 1 and n. Let B be the event that that j > i. Let
Ak be the event that the largest of the i, j is equal to k. Compute the probability P(Ak |B).
Answer. The sample space Ω is the set of all pairs (i, j). There are n2 such pairs. Hence the
probability of every pair is 1/n2 (because we used the adjective “uniform”). The event B
contains all pairs (i, j) with j > i. If we subtract from n2 the n pairs (i, i) we get n2 − n. Halve
this to get that the cardinality of B is (n2 − n)/2. We now need to compute the cardinality of
the event Ak ∩ B. This contains all pairs (i, j) with i < j = k, that is, all pairs (i, k) where i < k,
that is, Ak ∩ B = {(1, k), (2, k), . . . , (k − 1, k)}. So it has cardinality k − 1. Therefore
P(Ak ∩ B) |Ak ∩ B| k−1
P(Ak |B) = = =2 2 .
P(B) |B| n −n
Note that P(A1 |B) = 0 (why?) and P(An |B) = 2/n. 
Just as we have some properties of (unconditional) probability, we now look at some
properties for conditional probability.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 79

Properties of conditional probability


1. The denominator cannot be zero. P(A|B) is only defined when P(B) , 0.

2. Conditional probability is a probability measure. Fix an event B. Then the function that
assigns probability P(A|B) to an arbitrary event A is itself a probability measure.
Here is why: Fix event B with P(B) > 0 and define the function

PB (A) = P(A|B), A ∈ E, (11.1)

that is, PB takes as its argument all events. First,we have PB (Ω) = P(Ω ∩ B)/P(B) =
P(B)/P(B) = 1. So (AXIOM ONE) is satisfied. Next, let A1 , A2 , . . . be a sequence of events.
Since
[∞ [∞
B∩ An = (B ∩ An ),
n=1 n=1

we have ∞  ∞ 
[  1 [ 
PB  An  =
  P  (B ∩ An ) .

P(B)
n=1 n=1

Assume now that the events A1 , A2 , . . . are pairwise disjoint. Then the events B∩A1 , B∩A2 , . . .
are pairwise disjoint. By (AXIOM TWO) for P we have
∞  ∞
[  X
P  (B ∩ An ) =
  P(B ∩ An ).
n=1 n=1

Therefore ∞  ∞ ∞
[  X 1 X
PB  An  =

 
 P(B ∩ An ) = PB (An ),
P(B)
n=1 n=1 n=1

so (AXIOM TWO) holds for PB . Since both (AXIOM ONE) and (AXIOM TWO) hold for
PB , it follows that PB is, as well, a probability (measure).

3. Summation formula (also known as total probability formula). If A1 , . . . , An form a


partition of Ω then, for any event B,
n
X
P(B) = P(B|Ai )P(Ai ).
i=1

Actually, the formula holds even when n = ∞.


Here is why: We have that A1 , A2 , . . . are pairwise disjoint with union equal to Ω. Hence

[ ∞
[
B=B∩Ω=B∩ An = (B ∩ An ),
n=1 n=1
P∞
and the events B ∩ An are pairwise disjoint. By (AXIOM TWO), P(B) = n=1 P(B ∩ An ).
But, from definition, P(B ∩ An ) = P(An |B)P(B), and we’re done.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 80

4. Multiplication formula (or chain rule). If A1 , A2 , . . . , An are events then

P(A1 ∩ · · · ∩ An ) = P(A1 )P(A2 |A1 )P(A3 |A2 ∩ A1 ) · · · P(An |An−1 ∩ · · · ∩ A1 ).

Here is why: We have


P(A1 ∩ A2 ) = P(A2 |A1 )P(A1 ),
by definition. Let’s try this for n = 3. We have P(A1 ∩ A2 ∩ A3 ) = P(A3 |A1 ∩ A2 )P(A1 ∩ A2 )
and putting things together, this is further equal to P(A3 |A1 ∩ A2 )P(A2 |A1 )P(A1 ). So

P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 |A1 )P(A3 |A1 ∩ A2 ),

and thus the formula holds for n = 3. We can use induction to show (in one line) that the
formula holds for all n.

5. Bayes’ rule. How do P(A|B) and P(B|A) relate?

P(A|B)P(B)
P(B|A) = .
P(A)

Here is why: In the probability P(A ∩ B) the order of A and B can be changed because
A ∩ B = B ∩ A. Now use the definition of conditional probability twice. First, P(A ∩ B) =
P(A)P(B|A). And then, P(B ∩ A) = P(B)P(A|B). That’s it.

6. Bayes’ rule, second form. If H1 , . . . , Hn form a partition of Ω, namely, they are pairwise
disjoint events with union equal to Ω, then

P(A|Hi )P(Hi )
P(Hi |A) = Pn , i = 1, . . . , n.
k=1 P(A|Hk )P(Hk )

P(A|H )P(H )
Here is why: apply the previous formula to get P(Hi |A) = i
P(A)
i
and then apply the
Pn
total probability formula: P(A) = k=1 P(A|Hk )P(Hk ).
Interpretation: If H1 , . . . , Hn are interpreted as alternative hypotheses and have prior
probabilities P(H1 ), . . . , P(Hn ), then observation of an event A will change these probabilities
to the posterior probabilities P(H1 |A), . . . , P(Hn |A).

11.2 Using conditional probability


We explain the usefulness of the concept of conditional probability through a number of
interesting problems.

PROBLEM 11.3 (false positive and false negative errors: observations affect decisions). In
a population, 1% of people have a certain disease. A test is devised to detect the disease. If
the patient has the disease then the test is positive (detects the disease) with probability 96%.
If the patient does not have the disease the test may erroneously indicate that the patient does
have the disease with probability 2%. Is the test a reliable indicator of the disease?
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 81

Figure 11.1: In a population with a pandemic, 1% of people have the disease. A test is devised that
is shown to have 2% false positive and 4% negative probability. Is the test reliable?

Answer. We let D be the event that a patient has the disease. Since 1% of people have the
disease, we let P(D) = 0.01. Hence P(Dc ) = 0.99. We let T+ be the event that the test is positive
and T− the event that it is negative. We are given that

P(T+ |D) = 0.96, P(T+ |Dc ) = 0.02.

We now compute P(D|T+ ). We have

P(T+ |D)P(D)
P(D|T+ ) = .
P(T+ )
The numerator is P(T+ |D)P(D) = 0.96 × 0.01 = 0.0096. The denominator is, by the summation
formula,

P(T+ ) = P(T+ |D)P(D) + P(T+ |Dc )P(Dc ) = 0.0096 + 0.02 × 0.99 = 0.0096 + 0.0198 = 0.0294.

Hence
0.0096 96
P(D|T+ ) = = ≈ 32%.
0.0294 294
The test is totally unreliable.
Discussion: The probability P(T+ |Dc ) is called “false positive probability” (or type I error). The
probability P(T− |D) is called “false negative probability” (or type II error). Which of the two
probabilities should be reduced in order that there be a significant change in the usefulness of
the test? 
?PROBLEM 11.4 (birthday coincidences–Problem 8.14 revisited). Recall the birthday coin-
cidence problem 8.14 among n people on a planet whose year contains exactly d days. Again
assume that P is uniform. We wish to compute the probability P(B) of the event B that at least
two people have the same birthday by first computing P(Bc ) using the multiplication formula;
see Item 4. of Section 11.1.
Answer. Let us apply the following algorithm for checking birthday coincidence. Pick people
one at a time, in any order you like. Let 1, 2, . . . , n be the names of the people you pick. If
person 2 has the same birthday as 1 (call this event R2 ) then stop. Else, if person 3 has the
same birthday as 2 (call this event R3 ) then stop. And so on. Then

B = R2 ∪ R3 ∪ · · · ∪ Rn .
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 82

We now have Bc = Rc2 ∩ · · · ∩ Rcn so, by the multiplication formula,

P(Bc ) = P(Rc2 )P(Rc3 |Rc2 )P(Rc4 |Rc3 ∩ Rc2 ) · P(Rcn |Rcn−1 ∩ · · · ∩ Rc2 ).

Recall that these events are subsets of Ω = {1, . . . , d}n . We can interpret an ω = (ω1 , . . . , ωd )
as making n selections with replacement from an urn containing d distinct items. Using this
interpretation we get, for 2 ≤ k ≤ n,

d−k+1
P(Rck |Rck−1 ∩ · · · ∩ Rc2 ) = ,
d
the reason is that if we select k − 1 items and observe they are distinct then the probability
that the next item is distinct from the previous ones is (d − k + 1)/d because there are d − k + 1
remaining values; the denominator is d because sampling is with replacement: we put them
back in the urn after we look at them. (Pay attention to the k = 2 term in the last display; it
reads P(Rc2 ) = (d − 1)/d.) Therefore,

d−1 d−2 d−n+1 d d−1 d−2 d − n + 1 (d)n


P(Bc ) = · ··· = · · ··· = n ,
d d d d d d d d
exactly as in Problem 8.14. 

PROBLEM 11.5 (a coin whose probability of heads is random!). When we toss a coin n
times the outcome is an n-tuple (x1 , . . . , xn ) where xi = 1 if heads show up at the i-th toss
or xi = 0 if tails show up. Thus, (x1 , . . . , xn ) ∈ {0, 1}n . We define a probability measure P
by defining the probability of the outcome (x1 , . . . , xn ) be equal to ps(x) (1 − p)n−s(x) , where
s(x) = x1 + · · · + xn . In other words,
Pn Pn
P{(x1 , . . . , xn )} = ps(x) (1 − p)n−s(x) = p i=1 xi (1 − p) i=1 (1−xi ) . (11.2)

Note that these numbers add up to 1, as they should, because

X 1
X 1
X
p s(x)
(1−p) n−s(x)
= x1
p (1−p) 1−x1
··· pxn (1−p)1−xn = (p+1−p) · · · (p+1−p) = 1.
(x1 ,...,xn )∈{0,1}n x1 =0 xn =0

The number p can be interpreted as the probability of heads in one coin toss. But suppose we
do not know what p is. We know that there are two possibilities: either p = 1/2 or p = 2/3 and
that each of these possibilities has probability 1/2. To model this situation, we introduce, in
addition to (x1 , . . . , xn ), the variable p. Our outcome then is (p, x1 , . . . , xn ) ∈ {3/7, 2/3} × {0, 1}n .
Let A3/7 be the event that p = 3/7, namely, the set of outcomes of the form (3/7, x1 , . . . , xn ), and
let A2/3 be the set of outcomes (2/3, x1 , . . . , xn ). Let

Rx ≡ Rx1 ,...,xn := {1/2, 2/3} × {(x1 , . . . , xn )},

representing the event that we see the outcomes x = (x1 , . . . , xn ) but we do not see the value of
p. We are given that
P(Rx |Ap ) = ps(x) (1 − p)n−s(x) .
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 83

This is a rewriting of the display (11.2). We have two values of p, so we have two formulas.
By Bayes’ formula,
P(Bx |Ap ) ps(x) (1 − p)n−s(x)
P(Ap |Rx ) = = , p ∈ {3/7, 2/3}.
P(Bx ) P(Bx )
This allows us to make a guess of what the actual p is, based on the values of x = (x1 , . . . , xn ).
To be concrete, suppose that n = 10 and we observe s(x) = 2 heads and n − s(x) = 8 tails. Then

(3/7) (4/7) p = 3/7
 2 8
p2 (1 − p)8 = 

(2/3)2 (1/3)8 p = 2/3

so
(3/7)2 (4/7)8

 (3/7)2 (4/7)8 +(2/3)2 (1/3)8 = 0.97, p = 3/7

P(Ap |Bx ) = 

(2/3)2 (1/3)8
= 0.03, p = 2/3


(3/7)2 (4/7)8 +(2/3)2 (1/3)8
Hence we should guess that p = 3/7, as this has much higher probability than the alternative.
It’s a guess, but it’s justifiably a good guess. 

PROBLEM 11.6 (strategy for getting the gift: every bit of information counts). There are
3 identical sealed boxes, numbered 1, 2, 3. Two of them are empty but the nonempty box
contains £1000. You don’t know where the money is. You are given the option to open a box
at random. If the money is in the box you open then you get it. Clearly, the probability you
find the money is 1/3.
But we make another deal. I ask you to point to a box; say you point to box number X. Then I
open a box, say box number Y, such that Y , X and such that Y does not contain the money.
I then ask you a question: do you want to stick with your original choice or switch? What
should you do?
Answer. Without loss of generality, let us say that box number 3 contains the money. You
don’t know that. So if {X = i} represents the event that you pick box i, we have P(X = i) = 1/3,
i = 1, 2, 3. If X = 1 then, knowing that I should open a different box Y that does not contain
the money, I must open box 2, that is,
P(Y = 2|X = 1) = 1.
Similarly,
P(Y = 1|X = 2) = 1.
If you pick X = 3, then I pick either Y = 1 or Y = 2 at random. So
P(Y = 1|X = 3) = P(X = 2|X = 3) = 1/2.
Let us find what your probability of winning is if you do switch. Since the money is in box 3,
“win without switching” = {X = 3}.
But I have already opened Y. So we must compute P(X = 3|Y = 2) and P(X = 3|Y = 1)
(remember that Y , 3). We have, by Bayes’ rule,
1 1
P(Y = 1|X = 3)P(X = 3) 2 · 3 1
P(X = 3|Y = 1) = = = .
P(Y = 1|X = 3)P(X = 3) + P(Y = 1|X = 2)P(X = 2)
3 +1
1 1 1 3
2 · · 3
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 84

1 1
P(Y = 2|X = 3)P(X = 3) 2 · 3 1
P(X = 3|Y = 2) = = = .
P(Y = 2|X = 3)P(X = 3) + P(Y = 2|X = 1)P(X = 1)
3 +1
1 1 1 3
2 · · 3

Thus, regardless of the value y in the event {Y = y}, we have found that

1
P(win without switching|Y = y) = .
3
This means that if you switch you double the probability of winning. 

PROBLEM 11.7 (prisoner’s dilemma). Three prisoners are in the death row in a US prison.
It is midnight. Two of them are going to be executed at dawn. There is a guard keeping an
eye on them. The following dialogue takes place between a specific prisoner and the guard:

Prisoner: Guard, can I ask you a question?


Guard: No, I am not allowed to answer a question that will give you
any information other than what you already know.
Prisoner: I’m not going to ask you any such question.
Guard: Go ahead.
Prisoner: Are two of us going to survive in the morning?
Guard: Yes, you know that.
Prisoner: Ok, just checking. So one of the other guys will be executed.
Guard: Yes, obviously.
Prisoner: Can you tell the name of a prisoner who will die?.
Guard: (Thinking hard.) Sure, knowing his name won’t change anything.
His name is Billy.
Prisoner: So Billy will be executed.
Guard: I told you so. And, by the way, there is no chance in a million
that you can tell him that. You’re tied on the floor and no sound can
penetrate the 2 meter thick walls.
Prisoner: Thank you guard for the information, I’m now going to sleep
better now.
Guard: How so?
Prisoner: Well, before I had chance 1/3 of surviving. Now, since Billy
won’t be executed, I have chance 1/2 of surviving. That’s better. I
tricked you. You did give me some information.

Question: Is the prisoner right?


Answer. Reduce to Example 11.6. 

PROBLEM 11.8 (where is the coin?). A coin is in one of n boxes. The probability that it is
in box i is pi . If you search box i you may not find the coin with probability εi . Find the
probability that the coin is in box j, given that you have searched box i and not found the coin.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 85

Answer. Bi be the event that the coin is in box i. We are given that P(Bi ) = pi . Let E be the event
that we make an error. We are given that P(E|Bi ) = εi . Let Ai be the event that we searched
box i and not found the coin. Then

Ai = (Bi ∩ E) ∪ Bci .

We want to compute the conditional probability of the event B j given the event Ai . We have:

P(B j ∩ Ai )
P(B j |Ai ) = .
P(Ai )

We have
P(Ai ) = P(Bi ∩ E) + P(Bci ) = P(E|Bi )P(Bi ) + (1 − pi ) = εi pi + 1 − pi
We now distinguish two cases: If j = i,

Bi ∩ Ai = Bi ∩ E, P(Bi ∩ Ai ) = P(E|Bi )P(Bi ) = εi pi .

If j , i,
B j ∩ Ai = B j ∩ Bci = B j , P(B j ∩ Ai ) = p j .
Hence  εp
 1−(1−εi )pi , if j = i
 i i

P(B j |Ai ) = 

p
)p ,
j

 1−(1−ε
i i
if j , i

PROBLEM 11.9 (application in business). There are two big shops, Aloobubba and Boorbari
selling expensive colored plastic balls of diameter 3 inches and weight 125 lbs each. that
people rush to buy placing orders on the Internet through a company that works with both
shops. Aloobubba has m storage locations numbered 1 through m. Boorbari has n storage
locations, numbered m + 1 through m + n. Storage location i = 1, . . . , m + n contains ri red and
gi green balls. The shops just started their business and you’re the first customer worldwide
to place an order. You’ve been up all night in order to do that, both for telling your friends
you’ve never met on Fuzzybook and because you’d get a whopping 2.95% discount if you
didn’t specify the color of the ball or the shop. The ball is delivered to your house in a nice
package and you open the package you find out that the ball is red. What is the probability
that the ball came from Boorbari?
Answer. Consider the event A, representing the event that the ball came from Aloobubba, and
B for Boorbari. Let Si be the event that the ball was selected from Storage location i. Let R be
the event that the received ball is red and G that it is blue. In the absence of any information
on what which shop the Internet company uses more frequently, we assume that

P(A) = P(B) = 1/2.

Similarly, given that there is no preference of a storage location over another, we must let
 
m, 1 ≤ i ≤ m n, m + 1 ≤ i ≤ m + n
 1  1
P(Si |A) =  , =
 
P(S i |B)
0, i > m

 0, i ≤ m

CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 86

Since balls are identical except their color, we must define


ri
P(R|Si ) = , 1 ≤ i ≤ m + n.
ri + gi

By Bayes’ formula,

P(R|A)P(A) P(R|A)P(A)
P(A|R) = = . (11.3)
P(R) P(R|A)P(A) + P(R|B)P(B)

We now have, by the definition of conditional probability and (AXIOM TWO),


n+m
X m
X
P(R|A) = P(R ∩ Si |A) = P(R ∩ Si |A),
i=1 i=1

where the second equality is due to P(R ∩ Si |A) ≤ P(Si |A) = 0 if i > m. Using the definition of
conditional probability once more (or the chain rule, Property 4. above), we have

P(R ∩ Si |A) = P(R|Si ∩ A)P(Si |A)

But if i ≤ m then Si ⊂ A (storage locations i ≤ m are owned by Aloobubba) so


ri
P(R|Si ∩ A) = P(R|Si ) = .
ri + gi

Hence
m
1 X ri
P(R|A) = .
m ri + gi
i=1

Similarly,
m+n
1 X ri
P(R|B) = .
n ri + gi
i=m+1

Substituting into (11.3) gives


1 Pm ri
m i=1 ri +gi
P(A|R) = 1 Pm ri 1 Pm+n ri .
m i=1 ri +gi + n i=m+1 ri +gi

To make the problem even more applied, suppose that Aloobubba has m = 2 storage locations
and that Boorbari has n = 5 and r1 = 100, g1 = 500, r2 = 300, g2 = 1600, r3 = r4 = r5 = 1000, g3 =
g4 = g5 = 10, 000, r6 = r7 = 2000, g6 = g7 = 17, 000. I put the numbers in the formula and
found P(A|R) = 0.63. Note that all numbers in the formula for P(A|R) are positive integers.
You can cook up these number in a way that you can achieve P(A|R) to be approximately any
number between 0 and 1 and so arrive at “surprises” much in the same way that Example
11.3 was surprising. 
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 87

11.3 Independence between events

Independence between two events


When does event B not influence event A? This vague question cannot be answered without
knowing P, that is, without knowing which probability measure P we are using. (Recall that
P is a function that assigns numbers to events.) Well, we have explained why
P(A ∩ B)
P(A|B) =
P(B)
is the probability of A given that B occurs. So it makes sense to declare:

I say that B does not influence A (under P) if P(A|B) = P(A).

Now notice immediately that


P(A|B) = P(A) ⇐⇒ P(A ∩ B) = P(A)P(B)
and so

B does not influence A (under P)


⇐⇒ A does not influence B (under P)
⇐⇒ P(A ∩ B) = P(A)P(B).

Since the phrase “B does not influence A” is symmetric in A and B, we use a phrase in English
that is syntactically symmetric as well and define the probability of A given that B occurs. So
it makes sense to define

We say that events A and B are independent (under P) if P(A ∩ B) = P(A)P(B).

We immediately notice that


if A and B are independent then so are A and Bc . (11.4)
Indeed
P(A ∩ Bc ) = P(A) − P(A ∩ B) = P(A) − P(A) · P(B) = P(A) · (1 − P(B)) = P(A) · P(Bc ).
PROBLEM 11.10 (uniform distribution on several coin tosses begets independence). We
throw a pair of dice. Explain why, under the uniform probability measure on the set of
outcomes, the events
A=“first die shows 5”, B=“second die shows 6”
are independent.
Answer. Here is why. The set of outcomes Ω is the set of all ordered pairs (i, j), where
1 ≤ i, j ≤ 6. Uniform probability measure means that the event {(i, j)}, being the set containing
just one outcome, the outcome (i, j), has probability 1/36. Now that we understand what
P is, we answer the question. We have A ∩ B = {(5, 6)}. Hence P(A ∩ B) = 1/36. But A =
S6
j=1 {(5, j)} = {(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6)}. Hence P(A) = 36 + 36 + 36 + 36 + 36 = 36 = 6 .
1 1 1 1 1 6 1

Similarly, P(B) = 16 . So P(A ∩ B) = P(A)P(B). 


CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 88

We generalize this considerably:


?PROBLEM 11.11 (product probability measure on S1 × S2 begets independence). Let S1 ,
S2 be two discrete sets. Let P1 be a probability measure on the events (subsets) of S1 and P2 be
a probability measure on events (subsets) of S2 . This means that both P1 and P2 satisfy rules
(i) and (ii). We now let
Ω = S1 × S2 ,
where × means Cartesian product which is nothing else but the set of ordered pairs (i, j),
where i ranges in S1 and i in S2 . Define now

P{(i, j)} = P1 {i} P2 {j}.

(You should check that P is a probability measure on the events of S1 × S2 .) We then have that
the events [ [
Ai = {(i, j)}, B j = {(i, j)}
j∈S2 i∈S1

are independent (under the P we defined). Indeed, Ai ∩ B j contains only one element, the
element (i, j), hence,
P(Ai ∩ B j ) = P1 {i}P2 { j}.
On the other hand, by rule (ii) X
P(Ai ) = P{(i, j)}
j∈S2

and, because P{(i, j)} = P1 {i} P2 { j},


X X
P(Ai ) = P1 {i}P2 {j} = P1 {i} P2 {j},
j∈S2 j∈S2

but, from rule (i), X


P2 {j} = 1.
j∈S2

Hence
P(Ai ) = P1 {i}.
Similarly,
P(B j ) = P2 {j}.
Hence
P(Ai ∩ B j ) = P(Ai )P(B j ),
so the two events are independent. 

Independence was rather obvious in the above problem because P was defined by a product
rule. But independence, whenever discovered, gives us powerful tools. Here is a problem
where independence is not obvious.
?PROBLEM 11.12 (product probability measure on the product of n sets). Generalize the
previous problem to define a probability measure P on the set S1 × S2 × S3 × · · · × Sn (product
of finitely many discrete sets), once you have define a probability measure Pi on Si for all i.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 89

Answer. We assume that Pk is defined on the discrete set Sk by assigning value Pk {x} for
each x ∈ Sk . An element of S1 × S2 × S3 × · · · × Sn is an ordered n-tuple (x1 , . . . , xn ), where
x1 ∈ S1 , . . . , xn ∈ Sn . To this element we assign the number

P{(x1 , . . . , xn )} = P1 {x1 } · · · Pn {xn }.

To show that P defines a probability measure on S1 × S2 × S3 × · · · × Sn all we have to show is


that X
P{(x1 , . . . , xn )} = 1.
(x1 ,...,xn )∈S1 ×···×Sn

But the sum on the left is actually n sums of a function with separable variables:
   
X  X   X 
P{(x1 , . . . , xn )} =  P{x1 } · · ·  P{xn }
 
 
(x1 ,...,xn )∈S1 ×···×Sn x1 ∈S1 xn ∈Sn

and each sum equals 1 because each Pi is a probability measure. 


PROBLEM 11.13 (a not-so obvious independence). Toss a coin 3 times. Then, under the
uniform probability measure. the events

A=“get exactly one head in the first two tosses”


B=“get exactly one head in the last two tosses”

are independent.
Here is why: The set of outcomes is Ω = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}.
Each outcome (element) of Ω gets probability 1/8. Event A is the set {HTH, THH, HTT, THT}
and B = {HHT, HTH, THT, TTH}. 

Independence between three events


We say that events A, B and C are independent (under P) if
P(A ∩ B) = P(A)P(B)
and P(A ∩ C) = P(A)P(C)
and P(B ∩ C) = P(B)P(C)
and P(A ∩ B ∩ C) = P(A)P(B)P(C).

But are we saying too much? Could it be that the four equalities above are not all necessary?
Unfortunately, they are. Here is an example.
PROBLEM 11.14 (pairwise independence but not independence). Put the uniform proba-
bility measure on the set Ω = {1, 2, 3, 4}. Consider the events A = {1, 4}, B = {2, 4}, C = {3, 4}.
Show that every two of these events are independent but that they are not independent.
Answer. We have
P(A) = P(B) = P(C) = 2/4 = 1/2.
Let us compute probabilities of intersections. Notice that A ∩ B = {4}, so P(A ∩ B) = 1/4. By
symmetry,
P(A ∩ B) = P(B ∩ C) = P(A ∩ C) = 1/4.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 90

And, indeed,

P(A ∩ B) = P(A)P(B), P(B ∩ C) = P(B)P(C), P(A ∩ C) = P(A)P(C).

So A, B are independent; and B, C are independent; and A, C are also independent. But
A ∩ B ∩ C = {4}, so
P(A ∩ B ∩ C) = 1/4 , P(A)P(B)P(C).
Hence the events A, B, C are not independent. 
We notice next that

if A, B, C are independent then so are A, B, Cc and so are A, Bc , Cc , etc.

Let us, for example, show that A, B, Cc are independent. We have immediately have–
due to (11.4)– that every two of them are independent, so we just need to check that
P(A ∩ B ∩ Cc ) = P(A)P(B)P(Cc ). Indeed:

P(A ∩ B ∩ Cc ) = P(A ∩ B) − P(A ∩ B ∩ C) = P(A)P(B) − P(A)P(B)P(C) = P(A)P(B)P(Cc ).

Let us also check that Ac , Bc , Cc are independent. We have (again by (AXIOM TWO))

P(A ∩ Bc ∩ Cc ) + P(A ∩ B ∩ Cc ) + P(A ∩ Bc ∩ C) + P(A ∩ B ∩ C) = P(A).

But we we showed just above, the last three terms of the left-hand side become products. To
save some space, let a, b, c stand for P(A), P(B), P(C), respectively. We then have

P(A ∩ Bc ∩ Cc ) + ab(1 − c) + a(1 − b)c + abc = a.

Rearranging terms and doing some trivial algebra,

P(A ∩ Bc ∩ Cc ) = a[1 − b(1 − c) + (1 − b)c + abc] = abc.

And this shoes that my claim was true.

PROBLEM 11.15 (pairwise independence but not independence, again). Take a canonical
tetrahedron (a perfect die with 4 faces) and write the letter a on one face, the letter b on another,
the letter c on another and write abc on the fourth face. Toss the tetrahedron and see which
face it lands on. The events “this face has the letter a on it”, “this face has the letter b on it”,
and “this face has the letter c on it” are not independent.
Answer. This is essentially the same as Problem 11.14. Indeed if we interpret 1 in that problem
as “face 1” of the tetrahedron, and so on, then we have 4 equally likely faces. And if we
interpret event A in Example 11.14 as event “this face has the letter a on it” of the current
problem, etc., then we can clearly see that we reduce to the previous case. 

Independence of an arbitrary collection of events.


We say that a collection of events is an independent collection if any finite
subcollection is a collection of independent events.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 91

Figure 11.2: In tossing a 4-sided die, labeled as shown, the events “this face has the letter a on it”,
“this face has the letter b on it”, and “this face has the letter c on it” are pairwise independent but
not independent.

The definition applies both to finite and infinite collections of events. So if you have n events
then you need to satisfy 2n − n − 1 equalities to prove they are independent. Indeed, a collection
of size n has 2n subsets. You do not need to consider the empty subset. and you do not need
to consider subsets containing one event. So reduce 2n by n + 1 to get the number of equalities
that need to be checked.
PROBLEM 11.16 (uniform distribution on many coin tosses begets independence, again).
Toss a fair coin at random n times. Let Hi be the event that heads show up at the i-th toss.
Explain why H1 , . . . , Hn are independent events.
Answer. First of all, our sample space here is Ω = {0, 1}n . The poetic expression “toss a fair
coin at random n times” means that we put the uniform probability measure P on Ω. Since
|Ω| = 2n this means that each outcome receives probability 1/2n and that P(A) = |A|/2n for any
A ⊂ Ω. (Hence P has been defined on E = P(Ω).) We now have
Hi = {ω ∈ Ω : ωi = 1}.
Note that
|Hi | = 2n−1 ,
because, having specified that a 1 (=heads) is in the i-th position we have 2 choices (either 0 or
1) for each of the remaining n − 1 positions. Similarly, if i , j,
|Hi ∩ H j | = 2n−2
Therefore,
2n−2 1
P(Hi ∩ H j ) = = = P(Hi )P(H j ).
2n 4
in general,
\
Hi = 2n−|I|
i∈I
for any I ⊂ {1, 2, . . . , n}. Therefore,
2n−|I| 1
\  Y
P Hi = n = |I| = P(Hi ).
2 2
i∈I i∈I
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 92

So the events are indeed independent. 


PROBLEM 11.17 (symmetry begets independence). We combine Problem 10.22 and Example
11.16 and define a probability measure Pβ on Ω = {0, 1}n by letting the Pβ –probability assigned
to the single outcome ω = (ω1 , . . . , ωn ) be given by

e−β(ω1 +···+ωn )
Pβ {(ω1 , . . . , ωn )} = , (11.5)
Zn
where 1/β is interpreted as temperature. The constant Zn is chosen so that the sum of all
probabilities on single outcomes equals one. So
Question: Are events H1 , . . . , Hn still independent under Pβ ? We first find Zn by
n !
X
−β(ω1 +···+ωn )
X n −βk
Zn = e = e = (1 + e−β )n . (11.6)
k
ω∈{0,1}n k=0

Hence Zn = Zn1 . Choose any collection of events from H1 , . . . , Hn . Suppose, without loss
of generality, we pick the first i of them. We wish to examine whether P(H1 ∩ · · · ∩ Hi ) =
P(H1 ) · · · P(Hi ). We have

H1 ∩ · · · ∩ Hi = {ω ∈ {0, 1}n : ω1 = · · · = ωi = 1}.

Hence
X e−β(i+ωi+1 +···+ωn ) e−βi X e−β(ωi+1 +···+ωn )
P(H1 ∩ · · · ∩ Hi ) = = i
ωi+1 ,...,ωn
Zn1 Z1 ωi+1 ,...,ωn Zn−i
1

But the last sum equals one because of (11.6) applied when we replace n by n − i. Letting
−β
now i = 1 in the last display gives P(H1 ) = eZ1 . Similarly for P(H j ) for all j. Hence, indeed,
P(H1 ∩ · · · ∩ Hi ) = P(H1 ) · · · P(Hi ). So the events are independent. 

11.4 Independence between random variables


Loosely speaking, two random variables X, Y are independent if any statement we can make
for one is independent of any statement we can make about the other.
Definition 11.1. Two discrete random variables X, Y are said to be independent under a
probability measure P if any event of the form {X = x} is independent of any event of the form
{Y = y} under P. Equivalently,

P(X = x, Y = y) = P(X = x)P(Y = y) for all x and y.

We generalize this to n discrete random variables X1 , . . . , Xn and declare that they are
independent under a probability measure P if the events of the form {X1 = x1 }, . . . , {Xn = xn }
are independent under a probability measure P for all choices of the values x1 , . . . , xn
?PROBLEM 11.18 (criterion for independence). Show that X, Y are independent if and only
if
P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)
whenever A, B are sets in the codomains of X, Y, respectively.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 93

Answer. If the last statement holds then take A = {x}, B = {y} to arrive at the definition of
independence. Conversely, if the definition of independence holds and if A, B are sets in the
codomains of X, Y, respectively, then
XX XX
P(X ∈ A, Y ∈ B) = P(X = x, Y = y) = P(X = x)P(Y = y)
x∈A y∈B x∈A y∈B
  
X  X 
=  P(X = x)  P(Y = y) = P(A)P(B).

 
x∈A y∈B


?PROBLEM 11.19 (independence of many implies independence of fewer). Show that if
X, Y, Z are independent then X, Y are independent.
Answer. We have that P(X = x, Y = y, Z = z) = P(X = x)P(Y = y)P(Z = z) for all x, y, z. Sum
both sides over x. Then the left-hand side gives P(X = x, Y = y) and the right-hand side gives
P(X = x)P(Y = y). 
?PROBLEM 11.20 (independence of events and their indicator random variables). Explain
the following equivalence:

events A1 , . . . , An are independent ⇐⇒ random variables 1A1 , . . . 1An are independent.


(11.7)

Answer. If we assume the latter then we get the former easily because Ai = 1Ai =1 . Now assume
the former and show the latter. Let’s do it for n = 2. We assume that A1 , A2 are independent.
This implies (why?) that A1 , Ac2 are independent and Ac1 , A2 are independent and Ac1 , Ac2 are
independent. We consider P(1A1 = x1 , 1A2 = x2 ) for x1 , x2 ∈ {0, 1}. We have four cases. If
x1 = 0, x2 = 0 we have P(1A1 = 0, 1A2 = 0) = P(Ac1 ∩ Ac2 ) = P(Ac1 )P(Ac2 ) = P(1A1 = 0)P(1A2 = 0).
Considering the other three cases, we conclude that P(1A1 = x1 , 1A2 = x2 ) = P(1A1 = x1 )P(1A2 =
x2 ) for all x1 , x2 . And hence 1A1 , 1A2 are independent. I leave it for you to generalize this for
arbitrary n. 
?PROBLEM 11.21 (toss k dice n times, as in Problem 10.2, again). Toss k dice n times, see
Problem 10.2. That is, put the uniform probability measure on Ω = {1, . . . , k}n . Let Xi be the
random variable that takes value 1 if the i-th coordinate of the outcome is heads or 0 if it is
tails. Show that X1 , . . . , Xn are independent.
Answer. We have, for all x1 , . . . , xn ranging in {0, 1},
1
P(X1 = x1 , . . . , Xn = xn ) = ,
2n
because P is the uniform probability measure and because Ω has 2n elements. But the event
n−1
{Xi = xi } has 2n−1 elements and so P(Xi = xi ) = 22n = 1/2. So P(X1 = x1 ) · · · P(Xn = xn ) =
(1/2)n = 1/2n . 
?PROBLEM 11.22 (independence of disjoint sets of r.v.s). Let X1 , . . . , Xn be independent. Let
I1 , . . . , Im be pairwise disjoint subsets of {1, . . . , n}. Define random variables . Yi = gi (Xk , k ∈ Ii ),
i = 1, . . . , m, where gi are given functions. Explain why Y1 , . . . , Ym are also independent
random variables.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 94

Answer. It’s a matter of following the definitions! 


For example,
If the random variables X1 , X2 , X3 , X4 , X5 then the random variables Y1 = g1 (X1 , X2 ),
Y2 = g2 (X3 , X4 , X5 ) are also independent. The thing is that the Y1 , Y2 have no
common Xi in their expressions as functions.
But the converse is not true!
PROBLEM 11.23 (independence in presence of common variable). Put the uniform proba-
bility measure on 3 coin tosses. Let Xi be 1 or 0 if the i-th coin is heads or tails respectively.
Show that Y = 1X1 +X2 =1 , Z = 1X2 +X3 are independent despite the fact that they both have X2 in
their formulas.
Answer. We must show that P(Y = y, Z = z) = P(Y = y)P(Z = z) for all values of y, z. Take
y = z = 1, Then
{Y = 1, Z = 1} = {X1 = 1, X2 = 0, X3 = 1} ∪ {X1 = 0, X2 = 1, X3 = 0}
Hence P(Y = 1, Z = 1) = P(X1 = 1, X2 = 0, X3 = 1) + P(X1 = 0, X2 = 1, X3 = 0) = (1/8) + (1/8) =
1/4. On the other hand,
{Y = 1} = {X1 = 1, X2 = 0} ∪ {X1 = 0, X2 = 1},
so P(Y = 1) = (1/4) + (1/4) = 1/2. Similarly, P(Z = 1) = 1/2. And so we have P(Y = 1, Z = 1) =
P(Y = 1)P(Z = 1). Now check that the same thing holds for all other values of (y, z) or simply
observe that Y and Z are indicator random variables and apply (11.7). 
PROBLEM 11.24 (symmetry begets independence of sorts). Let Ω = {−2, −1, 1, 2} and P the
uniform probability measure on Ω. Define X(ω) = |ω| and Y(ω) = sign(ω) (that is +1 if ω > 0
and −1 if ω < 0). Show that X and Y are independent.
Answer. If we know the sign of ω and its absolute value then we know ω itself. So P(X =
x, S = s) = 1/4. On the other hand, X(ω) = x is equivalent to ω = x or ω = −x. Hence
P(X = x) = 2/4 = 1/2. Also S(ω) = +1 is equivalent to ω > 0, so P(S = +1) = 1/2. Similarly,
P(S = −1) = 1/2. And so we have (X = x, S = s) = P(X = x)P(S = s) for all choices of x and s.

We now have the important observation:
If X, Y are independent then E(XY) = (EX)(EY).
To understand this, write
X X
X= x1X=x , Y = y1Y=y ,
x∈X(Ω) y∈Y(Ω)

and multiply the two to get:


X X
XY = xy1X=x 1Y=y
x∈X(Ω) y∈Y(Ω)

Note that 1X=x 1Y=y = 1X=x,Y=y . Take expectations of both sides. The right side will involve a
double sum of the quantities xyP(X = x, Y = y) = xP(X = x) yP(Y = y). Hence
X X
E(XY) = xP(X = x) yP(Y = y) = (EX)(EY).
x∈X(Ω) y∈Y(Ω)
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 95

11.5 Uncorrelated random variables


Two random variables are called uncorrelated if E(XY) = (EX)(EY). A number of random
variables are uncorrelated if every two of them are uncorrelated.
If the random variables are independent then, obviously, they are uncorrelated. But not
the other way.

?PROBLEM 11.25 (variance of sum of uncorrelated random variables). Explain why


n
X n
X
var Xi = var(Xi ), if X1 , . . . , Xn are uncorrelated.
i=1 i=1

Pn
Answer. Let S = We have var(S) = E[(S − ES)2 ]. But ES = = i=1 µi ,
P P
i=1 Xi . i=1 EXi where
µi = EXi . So
n
X
S − ES = (Xi − µi ).
i=1

The square of this is


n
X X
(S − ES)2 = (Xi − µi )2 + (Xi − µi )(X j − µ j )
i=1 i,j

But notice that


E[(Xi − µi )(X j − µ j )] = E(Xi X j ) − µi EX j − µ j EXi + µi µ j .
Since the random variables are uncorrelated, E(Xi X j ) = (EXi )(EX j ) = µi µ j , so

E[(Xi − µi )(X j − µ j )] = 0, i , j.

Taking expectations in the expression for (S − ES)2 we find


n
X n
X
E[(S − ES) ] = 2
E[(Xi − µi ) ] + 0 =
2
var Xi .
i=1 i=1

11.6 Conditional expectation


If X : Ω → R is a random variable and B ⊂ Ω an event with P(B) > 0, recall from (11.1), that

PB (A) = P(A|B),

is a new probability measure on Ω. We then define the conditional expectation of the random
variable X given, or conditional on, the event B by
X
E(X|B) = EPB (X) = xPB (X = x).
x
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 96

The sum is taken over the values of X. Of course,


P(X = x, B)
PB (X = x) = .
P(B)

Here, the comma between {X = x} and B means intersection of the two events. We use comma,
just as we would use it as a symbol of conjunction (and) in English.
If X, Y : Ω → R are two discrete random variables then we can take B = {Y = y} and define
E(X|Y = y) as above, namely,
X
E(X|Y = y) = xP(X = x|Y = y).
x

How can we further define the random variable E(X|Y) as follows:


1) Let
m(y) := E(X|Y = y).
2) Set
E(X|Y) = m(Y).

?PROBLEM 11.26 (the expectation of the conditional expectation). Explain why

E(E(X|Y)) = E(X)

Answer. We have
 
X X X
E(E(X|Y)) = E(m(Y)) = m(y)P(Y = y) = xP(X = x|Y = y) P(Y = y)
 

y y x
X X
= x P(X = x|Y = y)P(Y = y).
x y

But P(X = x|Y = y)P(Y = y) = P(X = x, Y = y). By (AXIOM TWO), = x, Y = y) =


P
y P(X
P(X = x), Hence X
E(E(X|Y)) = xP(X = x) = E(X).
x


11.7 Some wisdom to keep in mind


Constructions using independence. We can use independence to construct new random
objects. For example, if X and Y are independent random variables with given distributions
then immediately we know the distribution of (X, Y).
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 97

Discovery of independence. The power of independence is in its discovery. Whenever


independence is discovered this is great news, for it makes our life easy. In a world of
randomness, the search for independence is often a non-trivial matter, but a very rewarding
indeed.

Symmetry. Often, symmetry of sorts implies some kind of independence. (After all,
whenever symmetry is discovered, it must be used because it often leads to solvability. For
example, to solve the set of two equations 2x + y = 1, 2y + x = 1, we merely observe that the set
remains the same if we interchange x and y; so x = y and so 3x = 1, i.e., x = 1/3.) For example,
the uniform probability measure (which has an obvious symmetry) on the product of two
finite sets implies independence of the coordinates. This was the case in Problem 11.16. Please
revise the problem and think of the symmetry. Symmetry was also present in Problem 11.17.
Indeed, the probability measure in this problem, (11.5), depends on the sum ω1 + · · · + ωn , and
the sum is a symmetric function: if we permute the ωi ’s the value of the sum does not change.

Definitions through independence. We shall later see that independence (especially between
random variables) can be used to define new random objects). For example, if H1 , . . . , Hn
are the (now known to be independent) events associated with a fair coin tossing at random
n times, We can define random variables X0 , X1 , . . . , Xn by letting X0 = 0 and, for i ≥ 1,
Xi = Xi−1 + 1 on Hi and Xi = Xi−1 − 1 on Hic

Do not confuse independence and disjointness. A most silly mistake beginners (and not
only) make is to say “aha, events A and B are disjoint so they’re independent”. But think
again: if A an B are disjoint then occurrence of A implies non-occurrence of B and vice versa.
So they’re extremely dependent! Mathematically: suppose that events A, B are disjoint and
have strictly positive probabilities; then A ∩ B = ∅ so P(A ∩ B) = 0. But P(A)P(B) > 0, so
P(A ∩ B) , P(A)P(B).
Part IV

MAIN TOPICS FOR THIS MODULE

98
Chapter 12

Remembrances and foresights

A quick look at things we should already ?know (or just


know) and a glimpse at things we will ?learn (or merely
learn)

12.1 Remembrances
A probability measure is a function whose domain is a collection of events and whose range
is the set of real numbers [0, 1]. The domain of a probability measure must satisfy certain
logical properties: (a) the empty set is an event; (b) if A is an event then Ac is an event; (c)
if A1 , A2 , . . . are events then so is their union. The term that mathematicians use for these
properties is: the collection of events forms a σ-field. The probability measure itself must
satisfy (AXIOM ONE) and (AXIOM TWO).
People use sloppy language. For example, when people say “what is the probability that
when tossing a pair dice I will get a sum at most 4?” they mean “compute the value of the
event under the uniform probability measure”.
The class of events is a collection of subsets of an ambient space that is often referred to
as the sample space. In answering a probability question, there is often a lot of freedom in
choosing the sample space. One must choose it judiciously so that the computation become as
simple as possible.
A random variable is a function from the sample space into the set of real numbers.
Technically, it must be a measurable function as explained in Section 9.6.
A discrete set is a countable set. A discrete random variable is one whose set of values is
discrete.
Random variables have a purpose in life: to transform a probability measure, say P, on
their domain to a probability measure, say Q, on their range. This is indicated as follows:

P→ X →Q

99
CHAPTER 12. REMEMBRANCES AND FORESIGHTS 100

This Q is called distribution of X or law of X. In practical applications (and not only) a deeper
purpose is to do this: Given a desired Q, design a random variable whose law is Q.

P The expectation of a discrete random variable X is defined by the expression E(X) =


x xP(X = x), where the sum rangesP over all values of X and you must have seen that it
the same thing is given by E(X) = ω X(ω)P{ω}. This is called the law of the unconscious
statistician (LOTUS). Quantities related to this are the variance and moments. Sometimes
random variables do not have expectations. We use LOTUS to compute E[g(X)] in two ways.
Independence is a notion of paramount importance. In elementary classes, we learn what
independence between events is: the probability of the intersection of any finite number of
events is equal to the product of the probabilities of the events. And then we learn what
independence between random variables is.
Conditioning is also important and you must have understood why and you must have
also used conditioning in a variety of problems ranging from decision making to ways of
facilitating computations.

12.2 Foresights
The issue next is to define random variables (and their distributions) on spaces that are
uncountable.

Definition 12.1 (i.i.d.). We say that a collection of random variables are i.i.d. = independent
and identically distributed if any finitely many random variables from the collection are
independent and if each random variables has the same distribution.

Suppose that S is a finite set and let Q be a probability measure on it. Then it is obvious that
we can
let X1 , . . . , Xn be i.i.d. with common law Q. (12.1)

?PROBLEM 12.1 (“let there be finitely many i.i.d. r.v.s” can always be said). Explain why
statement (12.1) is not vacuous.
Answer. To each element (x1 , . . . , xn ) of Sn assign probability according to the product rule,
as explained in the Section 8.1, “Probability on finite sets”. That is, assign probability
Q{x1 } · · · Q{xn }. Then define Xi (x1 , . . . , xn ) = xi , for all i = 1, . . . , n, and check that each Xi has
law Q and that the X1 , . . . , Xn are independent. 
But the issue is: can we also

let X1 , X2 , . . . , be an infinite sequence of i.i.d. random variables with common law Q?


(12.2)
This is not obvious at all, so we state it as one of the main theorems of the whole probability
theory:

Theorem 12.1 (a sequence of i.i.d. random variables exists). If Q is a probability measure


CHAPTER 12. REMEMBRANCES AND FORESIGHTS 101

on a finite set S of size at least 2 then there exists an infinite sequence (X1 , X2 , . . .) of i.i.d. random
variables with law Q each, all defined on the some common probability space.

For most statisticians this is obvious. After all, they claim, we can always toss a coin
independently and an unlimited number of times. But it is a fact that the theorem above is
equivalent to the theorem

Theorem 12.2 (the area function exists). Let R2 represent the Euclidean plane. Then there is
a large class B 2 of subsets of R2 that contains all rectangles and a unique function

area : B 2 → [0, ∞),

where B 2 is given in Definition 12.2, such that

(1) area(∅) = 0
∞  ∞
[  X
(2) area  Bn  =

 
 area(Bn ), whenever B1 , B2 , . . . are pairwise disjoint;
n=1 n=1
(3) area(R) = ab if R is a rectangle with side lengths a, b.

Of course, because we cheat when we teach Calculus, we make the student believe they know
what this function is: the usual “area”. Indeed, it is. But we never proved that this function
can be defined on “very complicated sets”.

Theorem 12.3 (equivalence of Theorems 12.1 and 12.2). The statements of Theorems 12.1
and 12.2 are equivalent.

I do not expect you to ?understand these theorems. But I expect, and hope, you understand
them and know what they say.
Armed with the above results (that we will never explain in these notes), we have the right
to assert the veracity of statement (12.2).
We will proceed by defining random variables with values in big sets, such as R, R2 and
Rd .We will also consider sequences of random variables.

Definition 12.2. A rectangle in Rn is any set of the form I1 × · · · × In , where I1 , . . . , In are


intervals in R. Let I n be the collection of all rectangles. We say that B n is the Borel class in
Rn if it is the smallest σ-field that includes I n . When n = 1, we write I instead of I 1 and B
instead of B n .
Chapter 13

Bernoulli trials, some basic random


variables and their limits

In which chapter we toss a coin finitely many times and,


through this, we discover Bernoulli, binomial and Poisson
random variables. We touch upon tossing a coin infinitely
many times, explain memorylessness and discover what
looks like probability measure on the real line that assigns
probability 0 to every real number.

Notations. When we write {0, 1} (with curly brackets) we mean a set with two elements. Do
not confuse this with [0, 1] which denotes the subset of all real numbers x with 0 ≤ x ≤ 1. The
set {0, 1}2 contains 4 elements; (0, 0), (0, 1), (1, 0), (1, 1). Similarly, the set {0, 1}n contains 2n
elements.

13.1 Bernoulli random variables


A Bernoulli random variable ξ is a random variable that takes two values: 0 or 1. The
distribution of such a random variable is denoted by Ber(p) where

p = P(ξ = 1).

Hence P(ξ = 0) = 1 − p. We can think of ξ as encoding the outcome of a coin toss, 1 for,
say, heads, and 0 for tails. The name Bernoulli is that of the mathematician Jacob Bernoulli
(1655-1705) . As a probability measure, Ber(p) is unique. It can be written as

Ber(p) = pδ1 + (1 − p)δ0 ,

where δa is the Dirac probability measure; see (8.5). The expression “ξ is a Ber(p) random
variable” means that the law of ξ is Ber(p). We use definite article (the) when we refer to
the law Ber(p); but we use indefinite article (a) when we refer to a Ber(p) random variable,
because the latter is by no means unique. Observe that any indicator random variable is
Ber(p) random variable. Indeed, if A is an event, 1A is Ber(P(A)).

102
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 103

It is easy to see that


Eξ = p, var ξ = p(1 − p).

13.2 Finitely many Bernoulli trials


Consider now ξ1 , . . . , ξn to be i.i.d. Ber(p) random variables. We refer to this situation as
Bernoulli trials. We can talk about the random vector

ξ = (ξ1 , . . . , ξn )

that takes values in the set {0, 1}n . The distribution of ξ is completely determined by the
probabilities
P(ξ = (x1 , . . . , xn )), xi ∈ {0, 1}, i = 1, . . . , n.

PROBLEM 13.1 (joint distribution of n i.i.d. Bernoulli trials). Show that


Pn Pn
P(ξ = (x1 , . . . , xn )) = p i=1 xi (1 − p)n− i=1 xi (13.1)
Answer. Since ξ1 , . . . , ξn are independent we have

P(ξ = (x1 , . . . , xn )) = P(ξ1 = x1 ) · · · P(ξn = xn ).

But
P(ξi = x) = px (1 − p)1−x , x = 0, 1.
Multiply these numbers, bring the p-terms together and the (1 − p) also together. 

13.3 Binomial random variables


We continue as above and define
n
X
Sn = ξi . (13.2)
i=1
If we think of tossing n coins then Sn represents the number of tosses that resulted in heads
(and so n − Sn is the number of tails). Notice that

Sn ∈ {0, 1, 2, . . . , n}.

We define
bin(n, p) = law of Sn .
To compute this law we need to compute P(Sn = k), k = 0, 1, . . . , n.
?PROBLEM 13.2 (formula for the binomial distribution). Show that
!
n k
P(Sn = k) = p (1 − p)n−k . (13.3)
k
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 104

Answer. Notice that

{Sn = k}
n o
= there is a set I ⊂ {1, . . . , n} such that |I| = k such that ξi = 1 for all i ∈ I and ξi = 0 for all i < I
[ n o [
= ξi = 1 for all i ∈ I and ξi = 0 for all i < I ≡ AI .
I⊂{1,...,n},|I|=k I⊂{1,...,n},|I|=k

But AI ∪ A J = ∅ if I , J. Hence, by (AXIOM TWO),


X
P(Sn = k) = P(AI )
I⊂{1,...,n},|I|=k

Since P(AI ) = pk (1 − p)n−k is the same for all I with |I| = k, the sum equals this number times
the number of summands; the latter is nk because this is (by definition) the number of I with

|I| = k. 
We next compute the expectation and variance of a bin(n, p) random variable. We have

E(Sn ) = np.

This is obvious: the expectation of a sum is the sum of the expectations.


n
X n
X
E(Sn ) = E(ξi ) = p = np.
i=1 i=1

But the random variables ξ1 , . . . , ξn are independent and hence uncorrelated. Therefore the
variance of Sn is the sum of the variances of the ξi . Each ξi has the same variance, p(1 − p),
therefore
var Sn = np(1 − p).
It is remarkable that both the expectation and the variance of the sum are linear functions of n.

13.4 Poisson random variables


Let us consider the situation where p is small and n is large so that their product is neither
large nor small. To express this mathematically, we assume that p = pn , a function of n such
that
lim npn = λ, (13.4)
n→∞

where λ is a positive constant.

λk −λ
lim P(Sn = k) = e , k = 0, 1, 2, . . . (13.5)
n→∞ k!
Note that Sn depends on n both explicitly, through the subscript n, and implicitly, through the numbers
pn . To ?understand this, assume for simplicity that

λ
p=
n
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 105

and start from (13.3):


n!
n λ k λ n−k n! 1 λk λ n−k λk λ n−k
!       
(n−k)!
P(Sn = k) = 1− = 1− = 1− .
k n n (n − k)! k! nk n k! nk n
The middle fraction on the right is a sequence with limit 1 because
n!
(n−k)! nn−1 n−k+1
= ···
nk n n n
and each of the k terms converges to 1. Hence

λk λ n−k λk −λ
 
lim P(Sn = k) = lim 1 − = e ,
n→∞ k! n→∞ n k!
because, we know from Calculus, that limn→∞ (1 + n1 )n = e. It is rather remarkable that the
numbers
λk
Πλ (k) = e−λ , k = 0, 1, . . .
k!
sum up to 1, namely,
X∞
Πλ (k) = 1,
k=0
and this is because
λk
eλ = ,
k!
is the Taylor expansion of the exponential function (and valid for all λ). Therefore,

The numbers Πλ (k), kP∈ Z+ = {0, 1, . . .}, define a probability measure Pλ on


(Z, P(Z)), by Pλ (A) = k∈A Πλ (k), A ∈ P(Z). This probability measure is called
Poisson distribution with parameter λ and is denoted by Poi(λ).

We can express (13.5) by


Ber(n, λ/n) → Poi(λ) as n → ∞.
Since Poi(λ) is a probability measure, there is a random variable whose distribution is Poi(λ).
Such a random variable is called a Poi(λ) random variable.
?PROBLEM 13.3 (matching n men to n women). In a ballroom, there are n men and n
women. Each man is married to one woman. The master of dancing ceremonies pairs men
and women at random. Let Zn be the number of dance pairs that are actually married. Show
that the distribution of Zn converges, as n → ∞ to a Poi(1) random variable.
Answer. See Problem 10.24 for the solution. 
PROBLEM 13.4 (estimation of the p of bin(n, p)). We consider tossing a coin n times. We
assume that the tosses are behave as Ber(p) trials. But we do not know p. Is there a way to
make a good guess based on the observation? Here is how. Count the number of successes.
Say that you obtain the number k. The probability that of this is
!
n k
p (1 − p)n−k .
k
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 106

A good guess is to choose the p̂ that maximizes this probability. We thus must choose p̂ such
that
p̂k (1 − p̂)n−k ≥ pk (1 − p)k , for all 0 ≤ q ≤ 1.
Let L(p) = pk (1 − p)k . We can easily show that L is maximized at p̂ = k/n. This is then a
reasonable estimate of the true p. We can go further and ask: What is the probability that the
estimated p̂ is far from the true p? We defer this for later. 
PROBLEM 13.5 (distribution of infrequent errors). A book has 400 pages and about 1000
characters per page. We found 200 characters typed wrong in the whole book. What is the
probability that a given page contains at least 2 erroneous characters? Make appropriate
assumptions in order to answer this and do so in two different ways.
Answer. We assume that each character is erroneously typed with probability p. Since we
found 200 wrong characters among 400, 000 ones it is reasonable to assume that
200 1
p= = .
400, 000 2000
Assume that characters are erroneously typed independently. Then the total number of
characters on a given page is a bin(1000, 1/2000) random variable. This is approximately a
Poi(1/2) random variable. The probability that there is 0 or 1 characters wrong on the given
page is e−1/2 + 12 e−1/2 , hence the probability that we have at least 2 errors is 1 − 32 2−1/2 ≈ 0.09.

Observe that
if X, Y are independent with laws bin(n, p), bin(m, p) then X + Y has law bin(n + m, p).
And here is why. Consider n + m Bernoulli trials
ξ1 , · · · , ξn , ξn+1 , . . . , ξn+m .
Clearly:
1) The sum of all of them has law bin(n + m, p).
2) The sum of the first n has law bin(n, p). The sum of the last m has law bin(m, p).
3) The first n are independent of the last n.
4) Hence the sum of the first n is independent of the sum of the last m.
5) So we can let X be the sum of the first n and Y the sum of the last m. Then X + Y has law
bin(n + m, p).
Something similar happens with independent Poisson random variables:
if X, Y are independent with laws Poi(λ), Poi(µ) then X + Y has law Poi(λ + µ).
Here is why. For each n let Xn , Yn be two independent random variables with binomial
distributions. Assume that Xn has law bin(n, λ/n) and that Yn has law bin(bnµ/λc, λ/n),
where b·c denotes integer part. We know then that Xn + Yn has law bin(n + bnµ/λc, λ/n). We
have
bin(n, λ/n) → Poi(λ)
bin(bnµ/λc, λ/n) → Poi(µ)
bin(n + bnµ/λc, λ/n) → Poi(λ + µ)
Letting X be a random variable with law the limit of the law of Xn , and similarly for Y, and
since independence is preserved in the limit, we have that X + Y has law Poi(λ + µ).
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 107

13.5 Infinitely many Bernoulli trials


By Theorem 12.1, we claim that we can have an infinite collection of i.i.d. Ber(p) random
variables. We denote them by
ξ1 , ξ2 , . . .
This sequence is referred to as Bernoulli trials (infinitely many of them). If we denote by ξ this
sequence, namely,
ξ = (ξ1 , ξ2 , . . .)
we see that ξ takes values in the set
{0, 1}N ,
the set of infinite sequences of 0s and 1s, which is not only an infinite set but also uncountable.
If we let Ω be the underlying sample space then each ξi is a function from Ω into {0, 1}. Also, ξ
is a function from Ω into {0, 1}N . Since {0, 1}N is a space of sequence, we may call ξ a random
sequence. Theorem 12.1 asserts that ξ has a well-defined distribution, a probability measure
on the uncountable set {0, 1}N .
We can visualize these infinitely many Bernoulli trials as tossing a coin ad infinitum, and
doing so independently. The probability of heads is p and of tails 1 − p.
?PROBLEM 13.6 (a sure event). Assume that ξ1 , ξ2 , . . . is an infinite collection of i.i.d. Ber(p)
random variables with p > 0 (strictly positive). Consider the event
A = “infinitely many of the ξi are equal to 1”
and express it using set-theoretic symbols (like ∪). Then compute P(A) by first computing
P(Ac ).
Answer. Observe that “infinitely many of the ξi are equal to 1” if and only if “for every m
there is a further n ≥ m such that ξn = 1”. But then it is obvious that
∞ [
\ ∞
A= {ξi = 1}.
m=1 n=m

Hence
∞ \
[ ∞ ∞
[
Ac = {ξi = 0} =: Tm .
m=1 n=m m=1
But the event

\
Tm = {ξi = 0}.
n=m
is the event that ξm = ξm+1 = · · · = 0 and to find the probability of this we must multiply (1 − p)
by itself infinitely many times. Since p > 0 we have 1 − p < 1. When we multiply a number
with absolute value strictly less than 1 infinitely many times by itself we get 0. (All I’m saying
is that limx→∞ an = 0 if |a| < 1.) So P(Tm ) = 0 for all m. But then

X
c
P(A ) ≤ P(Tm ) = 0.
m=1

Hence P(A) = 1. 
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 108

13.6 Geometric random variables


Define
ν := min{n ∈ N : ξn = 1}.
In words, ν is the first toss that will give heads. We have

{ν > n} = {ξ1 = · · · = ξn = 0}.

Using independence,

P(ν > n) = P(ξ1 = 0) · · · P(ξn = 0) = (1 − p)n , n = 0, 1, . . . (13.6)

Since
{ν = n} = {ν > n − 1} \ {ν > n}, {ν > n} ⊂ {ν > n − 1},
we have

P(ν = n) = P(ν > n − 1) − P(ν > n) = (1 − p)n−1 − (1 − p)n = (1 − p)n−1 p.

We refer to the law of ν as geometric with parameter p and denote it by geo(p). A random
variable with this law is called a geo(p) random variable. We can compute the expectation of
ν by using the trivial identity
X∞
ν= 1ν>n
n=0

(the summand on the right side is 1 if and only if n < ν, that is, if and only if n = 0, 1, 2, . . . , ν − 1–
these are ν integers; hence the sum is the sum of ν ones, so it is ν). Taking expectations, we
have (see also (9.10))
∞ ∞
X X 1
E(ν) = P(ν > n) = (1 − p)n = .
p
n=0 n=0

The most important property of a geometric random variable is that it is memoryless.


Mathematically, this is a statement about conditional probabilities:

P(ν − m > n|ν > m) = P(ν > n), for all m, n = 0, 1, 2, . . .

In words:

Given that no heads have shown up in the first m trials, the time ν − m until the
next head shows up has the same distribution as ν.

This is now obvious from the fact that the coin tosses are i.i.d. But we can also see the above
statement holds by a simple computation:

P(ν > m + n, ν > m) P(ν > m + n) (1 − p)n+m


P(ν − m > n|ν > m) = = = = (1 − p)n = P(ν > n).
P(ν > m) P(ν > m) (1 − p)m
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 109

In particular, we have that if ν > 1 then the law of ν is the law of 1 + ν. Since P(ν > 1) = 1 − p,
we can express this by writing

1,
d  with probability p
ν= ,

1 + ν, with probability 1 − p

d
where = denotes equality of the distribution of the left-hand side to the distribution of the
right-hand side. From this, we have

 g(1), with probability p
d 
g(ν) =  ,

 g(1 + ν), with probability 1 − p

for any function g. Letting g(ν) = ν2 and taking expectations we obtain

E(ν2 ) = p + (1 − p)E[(1 + ν)2 ].

Expanding the square and solving for E(ν2 ) we find


2−p
E(ν2 ) = .
p2
Hence
1−p
var(ν) = E(ν2 ) − (Eν)2 = .
p2

13.7 Limits of geometric random variables


Consider again infinitely many Bernoulli trials, and let the probability of heads p = pn depend
on n as in (13.4), that is, we assume limn→∞ npn = λ. We let

νn := min{n ∈ N : ξn = 1},

where the index n denotes the implicit dependence on pn . We claim that


1
 
lim P νn > t = e−λt , t ≥ 0. (13.7)
n→∞ n
Here t is any real number. To understand this, let us first pretend that nt is an integer. If so
then, letting p = pn = λ/n, the event {νn > nt} has probability given by (13.6):
1 λ n
   
P νn > t = 1 − → e−λt , as n → ∞.
n n
But “to pretend that nt is an integer” is a statement that depends on n itself, so we can’t really
pretend. To overcome this slight problem, we first show that
1
 
lim P νn ≥ t = e−λt , t ≥ 0. (13.8)
n→∞ n
Let’s do a little exercise.
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 110

PROBLEM 13.7 (upper integer part). Define the upper integer part of the real number x by
dxe := min{n ∈ Z : n ≥ x}. (13.9)
For example, d2.6e = 3. Explain why the statement
For all N ∈ Z and all x ∈ R we have N ≥ x ⇐⇒ N ≥ dxe
is always true
Answer. Consider the set U(x) := {n ∈ Z : n ≥ x}. (We have U(x) , ∅ because there is always
an integer above any given real number.) Let N be an integer. Suppose first then N ≥ x. Then
N ∈ U(x), by the definition of U(x). Hence N ≥ min U(x) = dxe. Suppose next that N ≥ dxe.
But dxe ≥ x always, so N ≥ x. Hence the equivalence is always true. 
We then have
νn ≥ nt ⇐⇒ νn ≥ dnte
and so
1 λ dnte
   
P νn ≥ t = 1 −
n n
Since
dnte
lim = t,
n→∞ n
dnte
we again have that limn→∞ 1 − λn

= e−λt . To conclude we observe the following:
?PROBLEM 13.8 (a sparse geometric r.v. assumes no specific value in the limit). Explain
why
1
 
lim P νn = t = 0.
n→∞ n
Answer. We have
λ nt−1 λ
 
P(νn = nt) = 1 − 1nt∈Z .
n n
This actually says that the probability is 0 if nt , Z. We have limn→∞ (λ/n) = 0 while the other
two terms in the expression are ≤ 1. Hence the limit is 0. 
We thus have that:
Both (13.7) and (13.8) hold true.
The big question now is:
Is there a random variable, say τ, such that
P(τ > t) = P(τ ≥ t) = e−λt , t ≥ 0? (13.10)
Such a random variable must necessarily satisfy
P(τ = t) = 0, for all t,
the reason being obvious:
P(τ = t) = P(τ ≥ t) − P(τ > t) = e−λt − e−λt = 0.
Equivalently, we are really asking for the existence of a probability measure on (certain subsets
of) [0, ∞) that assigns value e−λt to sets of the form [t, ∞) and (t, ∞).
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 111

?PROBLEM 13.9 (the union of uncountably many events begets monsters). A student
argues as follows.
We showed that the random variable τ is such that the probability of the event {τ = t} is 0 for all t. But

τ ≥ 0 ⇐⇒ τ = t for some t ≥ 0

is a true statement. We write this statement in the equivalent form


[
{τ ≥ 0} = {τ = t}.
t≥0

We also notice that the events {τ = t}, t ≥ 0, are pairwise disjoint because

if t , s then {τ = t} ∩ {τ = s} = {τ = t = s} = ∅.

Therefore, by (AXIOM TWO), we have


X
P(τ ≥ 0) = P(τ = t). (13.11)
t≥0

But P(τ ≥ 0) = 1 and P(τ = t) = 0 for all t and so

1 = 0.

Where did the student make a mistake?


Answer. AXIOM P that if a countable collection of events A1 , A2 , . . . is pairwise
S TWO says
disjoint then P( n An ) = n P(An ). But the collection of events {τ = t}, t ≥ 0, is uncountable so
we can’t apply AXIOM TWO. Everything in the student’s syllogism is correct except (13.11).
Chapter 14

Densities per se

We accept the notion of “density” as a positive (measurable) function


with finite integral (integral 1, for probability density) and, through it,
we formally define what looks like a probability measure. The point of
view here is more or less that of a physicist. We don’t explain things
rigorously. In particular, the definition of expectation will be on a
case-by-case basis and thus sloppy. This will somehow be remedied in
Chapter 16. We also explain how smooth functions transform a given
density into another (smooth change of variables).

14.1 Mass and probability density functions


Definition 14.1 (mass density). We say that f (x), where x ranges in R, is a mass density if f
is a measurable function such that f (x) ≥ 0 for all x and such that
Z b
f (x)dx < ∞
a
R∞
for all real numbers a, b. The quantity −∞
f (x)dx is called total mass and may be 0 or ∞.
The point here is what does integral mean? It is not the usual Riemann integral of Calculus
but, rather, the so-called Lebesgue integral. However, in practice, if we assume that f is
piecewise continuous then we can think of the integral as a Riemann integral and hence be
able to apply the usual Calculus theorems (e.g., the first and second fundamental theorems
and integration by parts).
Let’s attach a physical meaning to this definition. We think of mass being “continuously
distributed” on an infinite rod such that the mass of the segment between a and b is given by
the integral above.
Definition 14.2 (probability density). We say that f (x) is a probability density function or,
simply, density, if it is a mass density such that the toral mass is one:
Z ∞
f (x)dx = 1.
−∞

112
CHAPTER 14. DENSITIES PER SE 113

PROBLEM 14.1 (an unbounded density). For which c is


 √
c/ |x|, if 0 < |x| ≤ 1,

f (x) = 

0,
 otherwise.

is a probability density?
Answer. We have Z 1
c
√ dx = 4c,
−1 |x|
and since this must be equal to 1 we have c = 1/4. 

PROBLEM 14.2 (some properties of densities). Can a probability density function f


(1) be unbounded?
(2) satisfy limx→∞ f (x) = ∞?
(3) satisfy limx→∞ f (x) = ∞?
(4) be discontinuous?
Answer. (1) Yes. The f of Problem 14.1 is unbounded.
(2) No because if it did then given R ∞ any K > R0 ∞there would have been an x0 > 0 such that
f (x) ≥ K for all x ≥ x0 . But then, −∞ f (x)dx ≥ x Kdx = ∞, and this is impossible.
0
(3) Yes. For example, consider “thin” rectangles at positive integers n with heights tending to
infinity as n → ∞. Choose them so thin that the total area is 1. E.g., let

X
f (x) = c n1|x−n|≤3−n ,
n=1

for some c > 0. Thus the set of points on the plane below the graph of f is a collection of
rectangles, where the n-th rectangle has base lengh 2 · 3−n and height n, so its area is 2n3−n .
The sum of these areas is

X 3
2n3−n = .
2
n=1
R∞
So if we choose c = 2/3 we make −∞ f (x)dx = 1, as it should. So f is a density. But
limx→∞ f (x)dx ≥ limn→∞ n = ∞.

PROBLEM 14.3 (examples and counterexamples of densities). Which of the following


functions are mass densities and which probability densities? (1) f (x) = x, (2) f (x) = x2 , (3)
f (x) = e−x , (4) f (x) = log x1x≥1 , (5) f (x) = e−x 1x>0 , (6) f (x) = 1a<x<b , (7) f (x) = b−a
1
1a<x<b , (8)
f (x) = 2 sin(x)10<x<π , (9) f (x) = 1/|x|, (10) f (x) = 1x∈Z .
1

Answer.
CHAPTER 14. DENSITIES PER SE 114

f (x) mass density? probability density?


x no no
x2 yes no
e−x yes no
log x1x≥1 yes no
e−x 1x>0 yes yes
1a<x<b yes no, unless b − a = 1
1
1
b−a a<x<b yes yes
f (x) = 2 sin(x)10<x<π yes
1
yes
1/|x| no no
1x∈Z yes no
Here is the reasoning. (1) is not a mass density because it takes negative values. (9) is not
a mass density because its integral from −1 to 1 is ∞. (2), (3) and (4) are mass densities
because they’re piecewise continuous R ∞and so have finite integrals
R∞ over bounded intervals.
(5) is a probability density because 0 e dx = 1. (6) has −∞ 1a<x<b dx = (b − a) so it is a
−x

probability density iffRb − a = 1. (7) is a probability density. (8) is a probability density because
R∞ π
−∞ 2
1
sin(x)10<x<π = 0 21 sin(x)dx = 12 (cos(0) − cos(π)) = 1. (10) is a mass density but not a
R∞
probability density because −∞ 1x∈Z dx = 0. 
A mass density like (10) in this problem is a trivial mass density because the total mass is
zero! The reason for this is that

the integral of a mass density that is zero except on a countable set is zero.

This statement is actually true for the Lebesgue integral but not for the Riemann integral. It
is true for both if the set of nonzero values of f is not just countable but it also has no limit
points.

Theorem 14.1 (a probability density defines a unique probability measure). A proba-


bility density f on R defines a probability measure Q on B (defined in Definition 12.2) by the
formula Z Z
Q(B) = f (x)dx := 1B (x) f (x)dx, B ∈ B.
B R

We cannot prove this here because we really need the definition of Lebesgue integral.
However, when the set B is simple enough, e.g. an interval we can, in principle, compute the
integral as a Riemann integral. Of course, the existence of a probability measure Q implies the
existence of some random variable X whose distribution is Q. If Q has a name then X has also
the same name.

Definition 14.3. If Q is a probability measure on B such that


Z
Q(B) = f (x)dx,
B

for some probability density f , then we call f the (probability) density of Q.


CHAPTER 14. DENSITIES PER SE 115

Definition 14.4. A subset N of R is said to be of zero-length if for any ε > 0 there is a sequence
of intervals I1 , I2 , . . . such that
[∞
N⊂ In
n=1

and

X
length(In ) < ε
n=1

?PROBLEM 14.4 (zero-length sets). Explain why a countable set is of zero-length.


Answer. If N is countable, we can enumerate its elements: a1 , a2 , . . . Fix ε > 0. Consider the
intervals In = (an − ε2−(n+1) , an + ε2−(n+1) ). Obviously, an ∈ In for all n andP
so N is included
inPthe union of these intervals. But length(In ) = 2 · ε2−(n+1) = ε2−n and ∞ n=1 length(In ) =

ε n=1 2 = ε.
−n 
The importance of the concept of zero-length sets is this:

If f is a probability density for the probability measure Q then any function g


that differs from f on a negligible set is also a probability density for the same
probability measure Q.

Expectation and variance of a random variable with density We next define the expectation
of a random variable X with density f by the formula
Z ∞
E(X) = x f (x)dx,
−∞

whenever this integral makes sense. If Y = g(X) is a function of X we define


Z ∞
Eg(X) = g(x) f (x)dx, (14.1)
−∞

whenever this integral makes sense. Just as in the discrete case we define

var(X) = E(X2 ) − (EX)2 .

For now, accept these definitions as done “by analogy” to the discrete case.

PROBLEM 14.5 (semicircle density). Let f (x) = c 1 − x2 1−1<x<1 . What should c be so that
f be a probability density function? The probability measure defined by this f is called
“semicircle law” for obvious reasons. Let X be a random variable with this law. Find E(X) and
var X.
Answer. We need Z ∞ Z 1 √
1= f (x)dx = c 1 − x2 dx = cπ/2.
−∞ −1
I made the change of variables x = cos θ to perform this integral. So

c = π/2.
CHAPTER 14. DENSITIES PER SE 116

We have
Z 1 Z 1 Z 0 Z 1 Z 1
E(X) = x f (x)dx = x f (x)dx + x f (x)dx = x f (x)dx − x f (x)dx = 0.
−1 0 −1 0 0

I actually didn’t have to compute anything. I just observed that f (x) = f (−x). Further,

2
Z 1 √ 1
var X = E(X ) = 2
x2 1 − x2 dx = ,
π −1 4

14.2 Distribution functions


If f is a probability density function then the function
Z x
F(x) := f (y)dy
−∞

is called distribution function. If X is a random variable with density f then

F(x) = P(X ≤ x).

Therefore,
P(a < X ≤ b) = F(b) − F(a).
If f is piecewise continuous then we can easily see from the first fundamental theorem of
Calculus that
d
f (x) = F0 (x) = F(x) for every x at which f is continuous.
dx
Hence F is an antiderivative of f .

PROBLEM 14.6 (triangular density and its distribution function). Consider the triangular
density defined by f (x) = 1 − x when 0 ≤ x ≤ 1 and f (x) = 1 + x for −1 ≤ x ≤ 0. Below −1 and
above +1 we let f (x) = 0. Show that f is a probability density function and compute
R ∞ F(x).
Answer. The graph of f consists of two triangles each of area 1/2. Hence −∞ f (x)dx =
R1
−1
f (x)dx = 1. So it is a probability density function. We obviously have F(x) = 0 if x < −1
Rx
and F(x) = 1 if x > 1. For x between −1 and 0 we have F(x) = −1 (1 + y)dy = 12 (1 + x)2 . For x
Rx
between 0 and 1 we have F(x) = F(0) + 0 (1 − y)dy = 12 + 21 (1 − (1 − x)2 ). 

14.3 Some common laws

14.3.1 The uniform distribution on a bounded interval


Let I be a bounded interval with endpoints a, b, a < b. Define

1
f (x) = 1a<x<b .
b−a
CHAPTER 14. DENSITIES PER SE 117

The probability measure Q having density f is called uniform probability measure on [a, b]; its
density is called uniform density on [a, b]. Of course, by what we said above, we can change
f on any zero-length set and we still have a density for the same measure. Thus, Q is unique
but its density is not unique. Neither is any random variable whose distribution is Q. For
example, the function
1
fe(x) = 1a≤x≤b
b−a
R R
is also a density for the same Q because f (x)dx = fe(x)dx. We denote Q by the symbol
B B
unif([a, b]).

PROBLEM 14.7 (distribution function, expectation and variance of unif([a, b])). Let U be
a unif([a, b]) random variable. Explain why



0, x≤a
P(U ≤ x) =  a < x < b,

(x − a)/(b − a),



1, x≥b

and why
P(U = x) = 0, for all x ∈ R,
a+b (a − b)2
E(U) =
, var(U) = .
2 12
Answer. It is clear that P(a ≤ UR ≤ b) = 1, so P(U ≤ x) = 0 if x ≤ a and P(U ≤ x) = 1 if x ≥ b. For
x 1
a < x < b we have P(U ≤ x) = a b−a dx = x−a
b−a . For the expectation, we have

b
a+b
Z
1 1
E(U) = xdx = (b2 /2 − a2 /2) = .
b−a a b−a 2

Also, Z b
1 1
E(U ) =
2
x2 dx = (b3 /3 − a3 /3).
b−a a b−a
Since b3 − a3 = (b − a)(b2 + ab + a2 ), we have

b2 + ab + a2
E(U2 ) = .
3
Therefore,

b2 + ab + a2 a2 + b2 + 2ab (a − b)2
var(U) = E(U2 ) − (EU)2 = − = .
3 2 12

Some computations are simplified if we observe that:

PROBLEM 14.8 (unif([a, b]) from unif([0, 1])). We can transform a unif([a, b]) random
variable to obtain a unif([0, 1]) random variable by a very simple function. Find this function.
CHAPTER 14. DENSITIES PER SE 118

Figure 14.1: We use Thales’ theorem (which has been known for at least 2500 years ) for the two
triangles with parallel sides, one whose side is from a to x and the other whose side is from a to b,
to obtain (14.2).

Answer. Just map the interval [a, b] onto the interval [0, 1] by a straight line. Observe that
y−0 1−0
= . (14.2)
x−a b−a
Hence the function is
x−a
y= .
b−a
Thus if X is unif([a, b]) then
X−a
Y=
b−a
is unif([0, 1]). It is actually the converse we are being asked. So we solve for X to get
X = (b − a)Y + a. 
Since Y is unif([0, 1]) it is easier to compute things for Y (we got rid of the annoying
constants) first.
?PROBLEM 14.9 (we can’t choose uniformly at random from the set of real numbers).
Explain why there is no uniform probability distribution on the whole of R.
Answer. Because, if there were one, it would have had to have R ∞ constant density: f (x) = c
for all x. Since f should be a probability density function, −∞ f (x)dx = 1. If c > 0 then the
left-hand side is ∞, so we get ∞ = 1. Impossible. If c = 0 then he left-hand side is 0, so 0 = 1.
Again impossible. 

14.3.2 The exponential distribution


Let λ > 0 and define the function
fλ (x) = λe−λx 1x>0 .
R∞ R∞
Note that fλ (x) ≥ 0 for all x and −∞ fλ (x)dx = 1 because 0 e−λx dx = 1/λ. Hence fλ is density
for some probability measure Pλ that we denote by expon(λ). Let τ be a random variable with
law Pλ . We call τ an expon(λ) random variable.
PROBLEM 14.10 (distribution function and moments of expon(λ)). Show that
P(τ > t) = P(τ ≥ t) = e−λt , t ≥ 0.
This answers the question (13.10). Also show that
E(τ) = 1/λ, var(τ) = 1/λ2 .
CHAPTER 14. DENSITIES PER SE 119

Answer. We have, by the change of variable y = λx,


Z ∞ Z ∞
P(τ > t) = λe dx =
−λx
e−y dy
t λt

Since d −y
dy e = −e−y we can replace e−y by − dy
d −y
e and then use the second fundamental theorem
R∞
of calculus to get λt e−y dy = e−λt , as needed. We can compute E(τ) in many ways. But we
will do a trick since the integral depends on the parameter λ. Since
Z ∞
1
e−λt dt =
0 λ

for all λ, we can differentiate both sides with respect to λ. We can justify that we can
interchange the derivative and the integral to get
Z ∞
d −λt d 1
e dt = .
0 dλ dλ λ

But d −λt
dλ e = −te−λt and d 1
dλ λ = −1
λ2
. So
Z ∞
1
te−λt dt = . (14.3)
0 λ2

Multiply both sides by λ to get Z ∞


1
t λe−λt dt = .
0 |{z} λ
fλ (t)

But the left-hand side equals E(τ). Now differentiate (14.3) once more to obtain

E(τ2 ) = 2/λ2 .

Hence
var(τ) = 2/λ2 − (1/λ2 ) = 1/λ2 .

The constant λ is called rate of τ.

PROBLEM 14.11 (scaling of an expon(λ) random variable). Explain why if σ is expon(1) then
σ/λ is expon(λ).
Answer. We have P(σ > x) = e−x , x ≥ 0. Hence P(σ/λ > t) = P(σ > λt) = e−λt , t ≥ 0. 
Recall that d·e denotes upper integer part; see (13.9).

PROBLEM 14.12 (discretizing an expon(λ) r.v. gives a geometric r.v.). Let τ be an expon(λ)
random variable Explain why

dτe is a geo(p) random variable

and express p in terms of λ.


CHAPTER 14. DENSITIES PER SE 120

Answer. We have dτe takes values 1, 2 . . .. Let n be a nonnegative integer. Observe that

dτe > n ⇐⇒ τ > n.

Therefore
P(dτe > n) = P(τ > n) = e−λn = (e−λ )n .
Comparing with the expression (13.6) we see that indeed ν = dτe is a geometric random
variable. Writing e−λ = 1 − p we find

p = 1 − e−λ .


?PROBLEM 14.13 (the memoryless property of an expon(λ) random variable). Explain why
an expon(λ) random variable has the memoryless property:

P(τ − t > s|τ > t) = P(τ > s), t, s ≥ 0.

Answer. The conditional probability on the left equals the joint probability P(τ − t > s, τ > t)
divided by P(τ > t). But P(τ − t > s, τ > t) = P(τ > t + s) = e−λ(t+s) and P(τ > t) = e−λt . Dividing
the two we get e−λs . 

14.3.3 The normal law


Consider the function
2
f (x) = Ce−x ,
were C is a positive constant, chosen so that f is a probability density function. We thus must
have Z ∞
1
e−x /2 dx.
2
=
C −∞
Equivalently,
Z ∞ !2 Z ∞ ! Z ∞ !
1 −x2 /2 −x2 /2 −x2 /2
= e dx = e dx e dx
C2 −∞ −∞ −∞
Note that the function
(x, y) 7→ x2 + y2
is invariant under rotations around (0, 0) and you don’t have to do any algebra to verify this;
just remember that x2 + y2 is the square of the distance of the point (x, y) from (0, 0) and you
“know” that distances of points from (0, 0) do not change when we rotate. (Please feel free to
ask yourselves how you know that!) Hence we change coordinates from Cartesian to polar.
That is, we set
x = r cos θ, y = r sin θ,
and (remember your Calculus!) so
Z ∞ Z 2π Z ∞ Z 2π
1 −r2 /2 −r2 /2
= e r drdθ = e r dr dθ
C2 0 0 0 0
CHAPTER 14. DENSITIES PER SE 121
R∞ 2
To do the first integral change variable: s = r2 /2. Then the integral becomes 0 e−r /2 r dr =
R∞ R 2π
0
e−s ds = [−e−s ]∞
0
= 1. The second integral is trivial: 0 dθ = 2π. Hence C12 = 2π and so

C = 1/ 2π. Thus
1
f (x) = √ e−x /2
2


is a probability density. The corresponding probability measure is called standard normal
and is denoted by N(0, 1). Any random variable with this distribution is called a N(0, 1) random
variable.

?PROBLEM 14.14 (expectation and variance of the standard normal law). Show that if X
is N(0, 1) then
E(X) = 0, var(X) = 1.

Answer. We have that


Z ∞ Z ∞ Z 0
E(X) = x f (x)dx = x f (x)dx + x f (x)dx.
−∞ 0 −∞

2
R∞ R0
Since e−x ≤ e−x for x > 1 it follows that I =
0
x f (x)dx < ∞. Since −∞
x f (x)dx = −I, it follows
R∞ R
that E(X) = I − I = 0. We have v := var(X) = E(X2 ) = −∞ x2 f (x)dx. We will simply write
R∞ R
instead of −∞ . Since f (y)dy = 1 we have
Z Z "
v= 2
x f (x)dx f (y)dy = x2 f (x) f (y)dxdy.

Interchanging the dummy variables x and y we also have


"
v= y2 f (x) f (y)dxdy.

Hence " "


1 2 +y2 )/2
2v = (x + y ) f (x) f (y)dxdy =
2 2
(x2 + y2 )e−(x dxdy

Passing on to polar coordinates,
Z 2π Z ∞
1 2 /2
2v = r2 e−r rdrdθ.
2π 0 0

Integrating with respect to θ first gives 2π which cancels with the constant in front. Changing
variables by r2 /2 = s we have
Z ∞ Z ∞
2 −r2 /2
2v = r e rdr = 2se−s ds = 2,
0 0

whence v = 1. 
The meaning of the 0 and 1 in the symbol N(0, 1) is that a standard normal random variable
has expectation 0 and variance 1.
CHAPTER 14. DENSITIES PER SE 122

Figure 14.2: Various normal densities. The blue, red and yellow curves have mean 0 but different
variances. The larger the variance the flatter the curve. The green curve has negative mean. © in
public domain

Definition 14.5 (N(µ, σ2 ) ). We define the N(µ, σ2 ) distribution to be the distribution of σX + µ


where X is N(0, 1) random variable.

See Figure 14.2 for plots of various normal densities.


?PROBLEM 14.15 (density of the non-standard normal law). The density of N(µ, σ2 ) is
1 2 )/2σ2
fµ,σ2 (x) = √ e−(x−µ) .
2πσ2

Answer. See Problem 14.18 below. 


The distribution function of the standard normal,
Z x −y2 /2
e
F(x) = √ dx
−∞ 2π
does not have a “closed form”. However, we can numerically evaluate it. Here are some
numbers
PROBLEM 14.16 (using the table of the normal distribution). What is the probability that
a N(0, 1) random variable X takes values in the interval [−2, 2]?
Answer. Let X be N(0, 1). Then

P(−2 ≤ X ≤ 2) = P(X ≤ 2) − P(X < −2).


d
But X = −X so
P(X < −2) = P(−X < −2) = P(X > 2) = 1 − P(X ≤ 2).
Hence
P(−2 ≤ X ≤ 2) = 2P(X ≤ 2) − 1 = 2F(2) − 1 ≈ 2 × 0.977 − 1 ≈ 0.954.

CHAPTER 14. DENSITIES PER SE 123

x F(x) x F(x)
0 0.5 1.0 0.8413
0.1 0.5398 1.1 0.8643
0.2 0.5793 1.2 0.8849
0.3 0.6179 1.3 0.9032
0.4 0.6554 1.4 0.9192
0.5 0.6915 1.5 0.9332
0.6 0.7257 1.6 0.9452
0.7 0.7580 1.7 0.9554
0.8 0.7881 1.8 0.9641
0.9 0.8159 1.9 0.9713
1.0 0.8413 2.0 0.9772

Table 14.1: Values of the distribution function for N(0, 1)

PROBLEM 14.17 (using the table of the normal distribution). What is the probability that
a N(5, 16) random variable takes values in the interval [1, 13]?
Answer. X has is N(5, 4) then Z = X−5
4 is N(0, 1), hence

1 − 5 X − 5 13 − 5 X−5
   
P(1 ≤ X ≤ 13) = P ≤ ≤ = P −1 ≤ ≤ 2 = F(2)+F(1)−1 ≈ 0.818.
4 4 4 4


14.4 Functions of random variables with densities


The question addressed here is: If X has density f and Y = H(X), for a “nice” function H, what
is the density of Y?
I cannot answer this question unless I delve into deeper maths. So I’m only going to do a
trivial case and then provide some meaningful recipes.
Let H : R → R be a strictly increasing and differentiable function. Suppose that X is a
random variable with density fX . What is the density of Y = H(X)?
To answer this, notice that
Z H−1 (y)
P(Y ≤ y) = P(X ≤ H (y)) =
−1
fX (x)dx.
−∞

By the inverse function theorem, the derivative of the inverse function H−1 exists and
d −1 1
H (y) = 0 −1 ,
dy H (H (y))
where H0 is the derivative function of H, which is strictly positive, by assumption. Hence the
derivative of P(Y ≤ y) with respect to y exists for all y and equals (by composite differentiation)

d fX (H−1 (y))
fY (y) = P(Y ≤ y) = 0 −1 .
dy H (H (y))
CHAPTER 14. DENSITIES PER SE 124

If that’s too complicated to remember, recall that Calculus gives us good mnemonic rules.
Here we go.
Imagine that fX (x) is a mass distribution on R. Then the function H, being strictly increasing
and smooth, does not change the order of points and is merely smoothly transforming them.
Fix a point x and consider the “little” interval Ix with endpoints x and x + dx. Then x maps
to y = H(x) and x + dx maps to H(x + dx) ≈ H(x) + H0 (x)dx = y + dy. So Ix is approximately
mapped to J y , an interval with endpoints y and y + dy. Since order is preserved, the mass
contained in Ix is transferred to J y without losses and without additions. But the mass in Ix
is approximately fX (x)|dx| and the mass in J y is approximately fY (y)|dy|. We thus have the
preservation of mass formula
fY (y)|dy| = fX (x)|dx|,
and I boldly replaced approximate equality by equality. This “means” that

fX (x)
fY (y) = .
|dy|
|dx|

If we now replace x by H−1 (y), we see that we again arrive at the earlier formula. This
paragraph contains nonsense, but it is nonsense that we have explained as being correct and,
therefore, we can treat them as meaningful nonsense. We need an example.

Figure 14.3: How a function transforms a density.

PROBLEM 14.18 (affine transformation). Let X be N(0, 1) and let Y = 3X + 5. Then

fY (y)|dy| = fX (x)|dx|

Since dy/dx = 3 we have 3 fY (y) = fX (x). Since x = (y − 5)/3 we have fY (y) = 1


3 fX ((y − 5)/3) =
√1 e−(y−5) /18 .
2

3 2π

The recipe works in more general case.

PROBLEM 14.19 (log-normal density). Let X be N(0, 1) and let Y = eX . Find the density of Y.
(We say that Y has a lognormal law.)
CHAPTER 14. DENSITIES PER SE 125

Answer. Then
fY (y)|dy| = fX (x)|dx|
Since dy/dx = ex we have fY (y)ex = fX (x). So fY (y) = e−x fX (x) and since x = log y, we get

1 1 −(log y)2 /2
fY (y) = √ e .
y 2π


The recipe works if the function is smooth but nor invertible.

PROBLEM 14.20 (density of the square of a r.v.). Let X have density fX (x) such that fX (x) > 0
for all x ∈ R, and let Y = X2 . Find the density of Y.

Answer. The function x 7→ y := x2 is not invertible. But it has two branches: x1 = + y and

x2 = − y. Then
fY (y)|dy| = fX (x1 )|dx1 | + fX (x2 )|dx2 |
because in order to obtain the mass in the interval with endpoints y and y + dy we have to
add the masses in the intervals with endpoints xi , xi + dxi , i = 1, 2. We have dy/dx = 2x. Hence
|dx1 /dy| = 1/2|x1 |, |dx2 /dy| = 1/2|x2 |, and if we let x1 = x, we have |x2 | = x. Hence
√ √
fX (x) + fX (−x) fX ( y) + fX (− y)
fY (y) = = √ .
2x 2 y

?PROBLEM 14.21 (a r.v. with density that has no expectation). Someone is standing at
distance ` from an infinite wall and holds a gun. He takes a swing and fires. What is the
distribution of the place where the bullet landed and its expectation? To answer this, let 0
denote the closest point from the gun holder to the wall, let Θ be the angle formed between
the lines from the gun holder to the location X of the bullet and the line from the gun holder to
0. Assume that Θ has constant density between −π/2 and π/2. Also assume that X is signed:
positive if X is to the right of 0, negative if it is to the left of it.
Answer. We have
X = ` tan Θ.
We are been told that Θ has density f (θ) = π1 1−π/2<θ<π/2 . If g(x) is the density of X then

g(x)dx = f (θ)dθ,

where x = ` tan θ. Hence dx = `/ cos2 θdθ and so


`
g(x) = f (θ).
cos2 θ
Using the relation x = ` tan θ between the two variables we find

`
g(x) = √ .
`2 + x2
CHAPTER 14. DENSITIES PER SE 126

There are no restriction on x because −π/2 < θ < π/2 if and only if −∞ < x < ∞. To compute
E(X) we write
Z ∞ Z ∞ Z ∞ Z ∞ Z ∞
E(X) = xg(x)dx = xg(x)dx + xg(x)dx = xg(x)dx − xg(x)dx
−∞ 0 −∞ 0 0

because g(x) = g(−x). But the last two integrals in the last display do not cancel out, even
though they are identical, because
Z ∞ Z ∞
`x
xg(x)dx = √ dx = ∞,
0 0 ` 2 + x2
R∞
because the change of variables `2 + x2 = y2 transforms this integral to 1 `dy = ∞. So the
answer to what E(X) is is “it does not exist”. 

14.5 Densities in higher dimensions


Why do we need many dimensions? Here is one reason:

PROBLEM 14.22 (dimensions are not necessarily physical dimensions). 1 A certain com-
pany abiding to surveillance capitalism principles, collects 5 kinds of data for each customers:
the time x1 that the measurement is taken, their height x2 at this time, their location on the
planet (two numbers, x3 , x4 at this time, longitude and latitude) and their wealth x5 at this
time (negative if they are in dept). What is an appropriate sample space? The company is
interested in the event that, before the end of the year 2024, the height of the customer is larger
than 167 cm, the customer is located in the capital of Zimbabwe, and the customer’s wealth is
negative. Which set is this event?
Answer. The time x1 can, for instance, be measured in hours, starting from the year 2000 until
2100. There are 876,600 hours in 100 years, so 0 ≤ x1 ≤ 876, 600. The height can be measured
in cm, and a baby is at least 40 cm, while the tallest person is at most 280 cm, so 40 ≤ x2 ≤ 2804.
Longitude is an angle, −180 ≤ x3 ≤ 180, if measured in Sumerian units (degrees). Latitude is
also an angle, −90 ≤ x4 ≤ 90. The richest person on earth is worth 250 billion dollars. Let’s say
that the maximum debt of someone is at most 100 billion dollars. So −100 ≤ x5 ≤ 250. We
thus take

Ω = {(x1 , x2 , x3 , x4 , x5 ) : 0 ≤ x1 ≤ 876, 600, 40 ≤ x2 ≤ 2804,


− 180 ≤ x3 ≤ 180, −90 ≤ x4 ≤ 90, −100 ≤ x5 ≤ 250}

as a sample space. The given event is

A = {(x1 , x2 , x3 , x4 , x5 ) ∈ Ω : x1 ≤ 210, 384, x2 > 167, x3 = −17.8, x4 = 21.05, x5 < 0}.


1
It is really awful when I ask students in Calculus to compute the area of, say, the set {(x, y) ∈ R2 : y2 ≤ x2 e−x }
and they correctly answer 1, but then they attach a unit and say 1m2 (square meter). Who said that x = 1 means
1m? This is not just bad taste, but also bad for business.
CHAPTER 14. DENSITIES PER SE 127

The point of this problem is that Rd does not have to be a space of physical dimensions.
The concept of mass densities and probability densities of one variable, introduced earlier,
generalizes tit for tat for many variables.
We quickly give the definitions for 2 variables which can easily be generalized to many
variables.

We say that f (x, y), (x, y) ∈ R2 , is!a mass density if f is a measurable function such
that f (x, y) ≥ 0 everywhere and I f (x, y)dxdy < ∞ for every bounded rectangle I.
!
It is a probability density if, in addition, R2 f (x, y)dxdy = 1.

If f (x, y) is a probability density function then we can define a probability measure via
"
Q(B) = f (x, y)dxdy, B ∈ B 2 ,
B

where B 2 is defined in Definition 12.2.


A random vector in R2 is a function from Ω → R2 . Hence it is represented as (X, Y) where
X : Ω → R, Y : Ω → R are random variables. Given a probability density function f (x, y) we
always have a random vector (X, Y) whose distribution is the probability measure Q defined
by f .
A subset N of R2 is said to be of zero-area if for any ε > 0 there is a sequence of rectangles
I1 , I2 , . . . such that
[∞
N⊂ In
n=1

and

X
area(In ) < ε.
n=1

Any countable subset of R2 has zero area. Any line in R2 has zero area and, more generally,
any piecewise smooth curve has zero area.
We can modify a probability density function on a zero-area set and get another density
function for the same probability measure.

14.5.1 Probability distribution function of a random vector with density


The probability distribution function corresponding to a probability density function f (x, y) is
given by Z Z x y
F(x, y) = f (u, v)dudv.
−∞ −∞
If f (x, y) is “nice enough” then
∂2
f (x, y) = F(x, y). (14.4)
∂x∂y
CHAPTER 14. DENSITIES PER SE 128

If (X, Y) is a random vector with distribution f (x, y) then

F(x, y) = P(X ≤ x, Y ≤ y). (14.5)

?PROBLEM 14.23 (help me give definitions). Consider the random vector (X, Y, Z), that is,
3 measurable functions with domain a common Ω and codomain R each, and
(1) define the concept of probability density function f (x, y, z) for (X, Y, Z);
(2) define the concept of probability distribution function F(x, y, z) for (X, Y, Z);
(3) give the relations between f and F;
(4) generalize to n random variables X1 , . . . , Xn .
Answer. (1) We say that (X, Y, Z) has probability density function f (x, y, z) if
$
P((X, Y, Z) ∈ B) = f (x, y, z) dxdydz,
B

for B ⊂ R3 .
(2) The distribution function F(x, y, z) is defined by

F(x, y, z) = P(X ≤ x, Y ≤ y, Z ≤ z)

(3) We derive F from f via


Z x Z y Z z
F(x, y, z) = f (u, v, w) dudvdw.
−∞ −∞ −∞

We derive f from F via


∂3
f (x, y, z) = F(x, y, z).
∂x∂y∂z
(4) We say that (X1 , . . . , Xn ) has probability density function f (x1 , . . . , xn ) if
Z Z
P((X1 , . . . , Xn ) ∈ B) = ··· f (x1 , . . . , xn )dx1 , · · · , dxn , B ⊂ Rn .

It has distribution function F(x1 , . . . , xn ) defined by


Z x1 Z xn
F(x1 , . . . , xn ) = ··· f (u1 , . . . , un ) du1 , · · · , dun
−∞ −∞

and then
∂n
f (x1 , . . . , xn ) = F(x1 , . . . , xn ).
∂x1 · · · ∂xn

CHAPTER 14. DENSITIES PER SE 129

14.5.2 Marginal densities; conditional densities

?PROBLEM 14.24 (marginal densities). If (X, Y) is a random vector in R2 with density f (x, y)
find the density of X and the density of Y.
Answer. Let us use the letter f1 for the density of X, and f2 for the density of Y. We have
Z x Z ∞
P(X ≤ x) = P(X ≤ x, Y < ∞) = f (u, v)dv.
−∞ −∞

Hence Z x Z ∞ Z ∞
d
f1 (x) = f (u, v)dv = f (x, v)dv.
dx −∞ −∞ −∞
Similarly, Z ∞
f2 (y) = f (u, y)dv.
−∞

We call f1 (x) the first marginal of f (x, y) and f2 (y) the second marginal.
In analogy to discrete probability we can define the conditional density of X given Y = y
by the formula
f (x, y)
f1|2 (x|y) = . (14.6)
f2 (y)
We note that f1|2 (x|y), as a function of x, is a probability density function because
Z ∞
f1|2 (x|y)dx = 1, for all y.
−∞

We can compute conditional probabilities of X if we know the value y of Y by the formula


Z
P(X ∈ B|Y = y) = f1|2 (x|y)dx.
B

?PROBLEM 14.25 (marginal and conditional densities). Let f (x1 , . . . , xn ) be a probability


density function for the random vector (X1 , . . . , Xn ). Derive a formula for
(1) the density of X1
(2) the density of (X1 , X2 )
(3) the conditional density of (X1 , X2 , X3 ) given X4 = x4 , X5 = x5
Answer. (1) Z ∞ Z ∞
f1 (x1 ) = ··· f (x1 , x2 , . . . , xn ) dx2 · · · dxn .
x2 =−∞ xn =−∞

(2) Z ∞ Z ∞
f1,2 (x1 , x2 ) = ··· f (x1 , x2 , . . . , xn ) dx3 · · · dxn .
x3 =−∞ xn =−∞

(3)
f1,2,3,4,5 (x1 , x2 , x3 , x4 , x5 )
f1,2,3|4,5 (x1 , x2 , x3 |x4 , x5 ) = .
f4,5 (x4 , x5 )

CHAPTER 14. DENSITIES PER SE 130

Expectation of a function of a random vector with density


If g(X, Y) is a given function of (X, Y) then we can compute its expectation via
"
E[g(X, Y)] = g(x, y) f (x, y)dxdy.
R2

?PROBLEM 14.26 (expectation of a function of many r.v.s). Let f (x1 , . . . , xn ) be a probability


density function for the random vector (X1 , . . . , Xn ). Let g : Rn → R be a function. Write a
formula for E[g(X1 , . . . , Xn )].
Answer.
Z ∞ Z ∞
E[g(X1 , . . . , Xn )] = ··· g((x1 , x2 , . . . , xn ) f (x1 , x2 , . . . , xn ) dx1 · · · dxn .
x1 =−∞ xn =−∞

14.5.3 Independence of random variables


If the density f (x, y) of (X, Y) is of product form, that is,

f (x, y) = f1 (x) f2 (y) (14.7)

then X, Y are independent random variable, a notion that we have encountered in elementary
probability and that we shall encounter again.
Notice that if f, g are probability density function on R then the formula f (x)g(y) gives a
probability density function on R2 . This corresponds to a pair (X, Y) of independent random
variables.
If X, Y are independent then

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)

because
Z Z
P(X ∈ A, Y ∈ B) = f (x, y)dxdy
A B
Z Z
= f1 (x) f2 (y)dxdy
A B
Z ∞Z ∞
= 1x∈A f1 (x)1 y∈B f2 (y)dxdy
Z−∞


Z ∞
= 1x∈A f1 (x)dx 1 y∈B f2 (y)dxdy
−∞ −∞
= P(X ∈ A)P(X ∈ B).

We can generalize to n random variables X1 , . . . , Xn and say they are independent if the
density of (X1 , . . . , Xn ) is the product of the n marginal densities:

f (x1 , . . . , xn ) = f1 (x1 ) · · · fn (xn )


CHAPTER 14. DENSITIES PER SE 131

?PROBLEM 14.27 (expectation of product under independence). Assume that X1 , . . . , Xn


are independent and that n functions are given: gi : R → R, i = 1, . . . , n. Explain why

E[g(X1 ) · · · g(Xn )] = [Eg(X1 )] · · · [Eg(Xn )].

Answer. Omit the limits of the integrals below because they’re all on the whole of R.
Z Z
E[g(X1 ) · · · g(Xn )] = ··· g(x1 ) · · · g(xn ) f (x1 , . . . , xn ) dx1 · · · dxn
Z Z
= ··· g(x1 ) · · · g(xn ) f1 (x1 ) · · · fn (xn ) dx1 · · · dxn
"Z # "Z #
= g(x1 ) f1 (x1 )dx1 · · · g(xn ) fn (xn )dxn

= [Eg(X1 )] · · · [Eg(Xn )].

14.5.4 Some special densities on R2

14.5.4.1 Uniform density on a finite area planar set

The uniform density on a set S ⊂ R2 with finite area is denoted by unif(S) and is defined by
1
f (x, y) = 1(x,y)∈S .
area(S)
In other words, f is constant on S and 0 outside S. The constant must necessarily be the area
of S so that the total integral of f be equal to 1.

14.5.4.2 The standard normal density on the plane

The standard normal density on R2 is obtained by multiplying together the N(0, 1) density at
x times the N(0, 1) density at y. This gives
1 −(x2 +y2 )/2
f (x, y) = e .

?PROBLEM 14.28 (uniform law on a rectangle begets independence). Let (X, Y) by a


random vector with unif([a1 , b1 ] × [a2 , b2 ]) distribution. Write a formula for a density for (X, Y)
and explain why X and Y are independent.
Answer. The area of S = [a1 , b1 ] × [a2 , b2 ] is (b1 − a1 )(b2 − a2 ). Hence a density for the
unif([a1 , b1 ] × [a2 , b2 ]) distribution is
1
f (x, y) = 1(x,y)∈[a1 ,b1 ]×[a2 ,b2 ] .
(b1 − a1 )(b2 − a2 )
But
(x, y) ∈ [a1 , b1 ] × [a2 , b2 ] ⇐⇒ a1 ≤ x ≤ b1 and a2 ≤ y ≤ b2 ,
CHAPTER 14. DENSITIES PER SE 132

so
1 1
f (x, y) = 1a ≤x≤b1 · 1a ≤y≤b2
b1 − a1 1 b2 − a2 2
| {z } | {z }
function of x function of y

Since the variables “separate”, the random variables X, Y are independent. 

14.5.5 Change of variables


This extends the topic of Section 14.4 to higher dimensions.
Let
X = (X1 , . . . , Xn )
be a random vector in Rn , where n is a positive integer. This means that each Xi is a measurable
function on some sample space Ω, where Ω is the same for all of them. If Ω is equipped with
a probability measure P then we denoted by Q the law of X, that is, the image of P under X.
Assume that Q is given by a probability density function denoted by f (x) = f (x1 , . . . , xn ). That
is, Z
Q(B) = P(X ∈ B) = f (x1 , . . . , xn )dx1 · · · dxn , B ∈ B n .
B
R R R
We use a shorthand rather than · · · . In fact, we use another shorthand. Writing
x = (x1 , . . . , xn ) and writing the above as
Z
Q(B) = P(X ∈ B) = f (x)dx.
B

Further assume that there is an open set U ⊂ Rn and a function

H = (H1 , . . . , Hn ) : Rn → Rn ,

such that
V = H(U) is an open set
and
H:U→V
is a bijection that is continuously differentiable with inverse function

H−1 : V → U

being also continuously differentiable. Sometimes U and V are Rn itself, but they don’t have
to be. One reason that we want U (and, similarly, V) to be open is so that we have some space
to move around a given point x ∈ U which is needed in order to differentiate. Remember that
derivative is a limit and to be able to talk about a limit when x0 → x, say, we need x0 to freely
move in a small neighborhood around x. A set U is open if any point x in it contains a small
neighborhood that is included in U.
CHAPTER 14. DENSITIES PER SE 133

Think of the n variables x1 , . . . , xn as being changed to the new variables

y1 = H1 (x1 , . . . , xn )
.. ..
. .
yn = Hn (x1 , . . . , xn )

We define the new random variables Y1 , . . . , Yn by

Y1 = H1 (X1 , . . . , Xn )
.. ..
. .
Yn = Hn (X1 , . . . , Xn )

Obviously, Y1 , . . . , Yn are also random variables on the same Ω. After all, the last display
really means

Y1 (ω) = H1 (X1 (ω), . . . , Xn (ω))


.. ..
. .
Yn (ω) = Hn (X1 (ω), . . . , Xn (ω))

for all ω ∈ Ω. Under the conditions stated via the underlined words, the new random vector

Y = (Y1 , . . . , Yn )

has a distribution what is also given by a probability density function g(y) = g(y1 , . . . , yn ), that
is Z
P(Y ∈ B) = g(x)dx.
B

The problem is to determine g(y) from f (x) and H(x).

In order to write down the formula, we recall certain notions from Calculus. The derivative
of y = H(x) is represented as an n × n matrix:
 ∂y1 ∂y1 
 ∂x
 1 ... ∂xn 

 . ..

 ..

 . 

 ∂yn ∂yn 
∂x1
... ∂xn

∂y
Of course, ∂x ij stands for ∂x∂ j Hi (x1 , . . . , xn ). There are many notations for this matrix and many
names for it. We will write either
 ∂y1
 ∂x . . . ∂y 1

∂x

∂y ∂(y1 , . . . , yn )  . 1 
.
n

=  .. .. 


∂x ∂(x1 , . . . , xn )  ∂y 
 n . . . ∂yn 
∂x1 ∂xn
CHAPTER 14. DENSITIES PER SE 134

or  ∂y1 ∂y1 
 ∂x
 1 ... ∂xn 

H (x) =  ... ..

0  
 ∂y . 

∂yn 
 n
∂x1
... ∂xn

and simply call it derivative, understanding that it is always (represented as) a matrix.2 All
that is different notations for the same thing.
The determinant of this matrix is called Jacobian of H:

∂y
!
Jacobian of H = det H (x) = det 0
.
∂x

Our underlined assumptions imply that both the derivative of H : U → V and the derivative
of the inverse function H−1 : V → U exist. The inverse function theorem says that the
derivatives are invertible matrices and one is the inverse of the other:

(H−1 )0 (y) = H0 (x)−1 at x = H−1 (y).

This can also be written as !−1


∂x ∂y
= .
∂y ∂x
The formula we are looking for is

If f is the density of X then the density g of Y = H(X) is


∂x
!
g(y) = f (x) det at x = H−1 (y). (14.8)
∂y

We need to take the absolute value of the Jacobian (the Jacobian can be negative but, after all,
a density is always nonnegative).
So, here, I explained a recipe, and you have learned it. I could explain this recipe to a slave
(=computer) and instruct him/her/it (=write a program) to apply the formula. The fact that
the slave can execute my instructions does not mean that the slave has ?learned the formula.
I don’t have the time in this course, or space herein, to really ?teach it, so I will resort to some
kind of motivation.
Suppose that n = 2. (For n = 1 the situation has been discussed and is, after all, rather
trivial.)
Consider a “small” rectangle, located at x = (x1 , x2 ) and with sides parallel to the axes
having oriented lengths dx1 , dx2 . By “oriented” we man that the two vectors have the same
orientation as the standard basis of R2 . The probability that X = (X1 , X2 ) is in this small
q
2
Denote kxk := x21 + · · · + x2n . Then H0 (x) is the unique matrix such that

kH(x + h) − H(x) − H0 (x)hk


lim = 0.
h→0 khk
CHAPTER 14. DENSITIES PER SE 135

rectangle is given by the integral of its density over this rectangle. Since the rectangle is small
this is approximately equal to
f (x1 , x2 ) |dx1 dx2 |.
When (x1 , x2 ) is mapped to (y1 , y2 ) = H(x1 , x2 ), the rectangle is mapped to another small set,
but not necessarily a rectangle because the 900 angle may have been distorted by H. But it is a
small quadrilateral with sides denoted by dy1 , dy2 . The probability that Y = (Y1 , Y2 ) is in this
small quadrilateral is
g(y1 , y2 ) |dy1 dy2 |,
approximately. But the two probabilities must be the same:
g(y1 , y2 ) |dy1 dy2 | = f (x1 , x2 ) |dx1 dx2 |. (14.9)
To find g(y1 , y2 ) we must express the elementary area dx1 dx2 in terms of the elementary area
dy1 dy2 . Remember that y1 = H1 (x1 , x2 ), y2 = H2 (x1 , x2 ). The function H = (H1 , H2 ) has an
inverse. Let us denote its inverse by K = (K1 , K2 ) rather than the clumsier symbol H−1 . We
thus have
x1 = K1 (y1 , y2 )
x2 = K2 (y1 , y2 )
Taking differentials of these expressions we obtain
∂x1 ∂x1
dx1 = dy1 + dy2
∂y1 ∂y2
∂x2 ∂x2
dx2 = dy1 + dy2
∂y1 ∂y2
Hence
∂x1 ∂x1 ∂x2 ∂x1 ∂x2 ∂x1 ∂x2
dx1 dx2 = (dy1 )(dy1 ) + (dy2 )(dy2 ) + (dy1 )(dy2 ) + (dy2 )(dy1 ).
∂y1 ∂y2 ∂y2 ∂y1 ∂y2 ∂y2 ∂y1
Since the area of a small rectangle with of zero width is zero, we set (dy1 )(dy1 ) = 0 = (dy2 )(dy2 ).
Since (dy1 )(dy2 ) is the same as (dy2 )(dy1 ) in magnitude but different sign we have (dy2 )(dy1 ) =
−(dy1 )(dy2 ). We thus obtain
∂x1 ∂x2 ∂x1 ∂x2
!
dx1 dx2 = − (dy1 )(dy2 ).
∂y1 ∂y2 ∂y2 ∂y1
But
∂x1 ∂x2 ∂x1 ∂x2 ∂x
!
− = det ,
∂y1 ∂y2 ∂y2 ∂y1 ∂y
so
∂x
!
|dx1 dx2 | = det |dy1 dy2 |.
∂y
Substituting this into (14.9) gives (14.8).
Of course, all that was sheer skulduggery and not a rigorous explanation. Nevertheless,
I hope you have ?learned something about the gist of all that, rather than simply learning
formula (14.8). We can generalize this skulduggerous arguments to general n.
CHAPTER 14. DENSITIES PER SE 136

PROBLEM 14.29 (a probability measure on R2 , linearly transformed). (1) Let A be an n × n


! density f on R . Find the density g of Y = AX.
matrix with det(A) , 0. Let X have n

a b
(2) Let next n = 2 and A = . Write g explicitly in this case.
c d
(3) What happens in the n = 2 case when ad = bc?
Answer. (1) The mapping we have is

H(x) = Ax,

where we think of x as a column and Ax is the result of matrix multiplication. We have

H0 (x) = A.

Thus A is the Jacobian of H at every point. Hence the Jacobian of H−1 is A−1 . So (14.8) becomes

g(y) = f (x)| det A−1 | at x = H−1 (y) = A−1 y.

Thus
g(y) = f (A−1 y) | det A−1 |.
since
1
| det A−1 | = ,
det A
we can also write
f (A−1 y)
g(y) = .
| det A|
(2) We have
!−1 !
a b 1 d −b
= .
c d ad − bc −c a
So ! ! !
1 d −b y1 1 dy1 − by2
A y=
−1
=
ad − bc −c a y2 ad − bc −cy1 + ay2
and so
dy1 − by2 −cy1 + ay2
!
1
g(y1 , y2 ) = f ,
ad − bc ad − bc ad − bc
(3) If ad = bc then
dY1 = aY2 .
But the set
L = {(y1 , y2 ) ∈ R2 : dy1 = ay2 } ⊂ R2
is a straight line and so it has zero area. The equation dY1 = aY2 can be written as Y ∈ L. Thus the
probability distribution of Y is zero outside L. SinceR L is a straight
R line, Y cannot have
R a density.
(For if it did have a density g(y) we would have 1 = R2 g(y)dy = R2 g(y)1L (y)dy = L g(y)dy = 0.
And we do know that 1 , 0.) 
CHAPTER 14. DENSITIES PER SE 137

PROBLEM 14.30 (the probability that an equation has real roots). Consider the equation

x2 + Ax + B = 0,

where A, B are random real numbers between −1 and 1. What is the probability that the
equation has real roots? To be precise, let us assume that (A, B) is a random vector with
distribution unif([−1, 1] × [−1, 1]).
Answer. The equation can be written as
 2  2
A A A A 2 A2
 
0 = x2 + Ax + B = x2 + 2 x + − +B= x+ − +B
2 2 2 2 4
that is,
A 2 A2
 
x+ = − B.
2 4
For this to have a real solution we must have the right hand side nonnegative so that we can
take its square root. So

“the equation has real roots” = {B ≤ A2 /4}.

The density for (A, B) is


1
f (a, b) = 1−1≤a≤1 1−1≤b≤1 .
4
So
" "
1
P(the equation has real roots) = f (a, b)dadb = 1−1≤a≤1 1−1≤b≤1 1b≤a2 /4 dadb.
(a,b): a2 −4b≥0 4

The last integral is over the whole of R2 since the restrictions in the density and in the set over
which we integrate have been expressed via indicator functions. But the product of the last 2
indicators is
1−1≤b≤1 1b≤a2 /4 = 1−1≤b≤a2 /4
(because a ≤ 1 so q2 /4 ≤ 1). We can choose to do the integration in any order we like. Choosing
to integrate over b first and over a next, the above integral is equal to
Z Z !
1
1−1≤a≤1 1−1≤b≤a2 /4 db da
4
2
The integral in the parenthesis is equal to a4 + 1. Hence the last display becomes

a2 1 1 a2
Z ! Z !
1 1 1 2 13
 
1−1≤a≤1 + 1 da = + 1 da = · +2 = ≈ 0.542.
4 4 4 −1 4 4 4 3 26


If S is a finite set then the expression “choose an element of S uniformly at random”


is equivalent, by definition, to the sentence “consider a random variable whose
distribution is uniform on S”. If S is a subset of R2 then the expression ‘choose a
point from S uniformly at random” is equivalent, by definition. to the sentence
“consider a random variable whose distribution is unif(S).”
CHAPTER 14. DENSITIES PER SE 138

Figure 14.4: The ratio of the areas of the disc and the square equals the probability that a random
point chosen uniformly at random in the square actually lies in the circle. This probability is about
78.5%.

PROBLEM 14.31 (a circle in a square). What is the probability that a point chosen uniformly
at random from a square of side length ` lies in the largest inscribed disc? See Figure 14.4.
Answer. Let (X, Y) have distribution unif(S), where S is a square of side length `. Hence
1
f (x, y) = 1(x,y)∈S
`2
is a density for (X, Y). Let D be the largest inscribed circle. We have
" " "
1 1 1
P((X, Y) ∈ D) = f (x, y)dxdy = 1(x,y)∈S 1(x,y)∈D dxdy = 1(x,y)∈D dxdy = 2 area(D).
D ` 2 ` 2 `
Since D has radius ` its area is π`2 . So
π2
P(X, Y) ∈ D) = ≈ 0.785.
4

PROBLEM 14.32 (uniform law on a disc begets independence). Let (X, Y) be a random
vector in R2 whose distribution is unif(D), with D be the disc centered at the origin and
having radius 1.
(1) Are X, Y independent?
(2) Express (X, Y) in polar coordinates, that is, define random variables (R, Θ) via
X = R cos Θ
Y = R sin Θ
Are (R, Θ) independent?
(3) What is the density of R? What is the density of θ?
Answer. (1) A density for (X, Y) is the function
1
12 2 .
f (x, y) =
π x +y <1
This cannot be written as the product of a function of x only and a function of y only so X, Y
are not independent.
(2) Let us find a density, say g(r, θ) for (R, Θ) via the formula
∂(x, y)
g(r, θ) = f (x, y) det
∂(r, θ)
CHAPTER 14. DENSITIES PER SE 139

The derivative matrix of the map

x = r cos θ
y = r sin θ

is
∂x ∂x !
∂(x, y) cos θ −r sin θ
!
= ∂r ∂θ =
∂y ∂y
∂(r, θ) sin θ r cos θ
∂r ∂θ
The Jacobian of the map is the determinant of the this matrix. We have

cos θ −r sin θ
!
det = r cos2 θ + r sin2 θ = r.
sin θ r cos θ

We thus have
1 r
g(r, θ) =
1x2 +y2 <1 · r = 1r<1 .
π π
We substituted x, y by their expressions in terms of r, θ so 1x2 +y2 <1 = 1r2 <1 = 1r<1 , and that’s
how we arrived at the last formula. Obviously this is a function of r times function of θ (the
constant function!), and hence (R, Θ) are independent random variables.
(3) Since g(r, θ) is a constant function of θ, we have that Θ is a uniform random variable.
Uniform where? Well, Θ, being an angle, ranges between 0 and 2π. So its density is

1
fΘ (θ) = 10≤Θ≤2π .

(It does not matter if I write ≤ or < in the indicator because changing a density on a zero-length
set remains a density for the same probability measure.) We then have

r 1
g(r, θ) = 1r<1 = 2r1r<1 .
π 2π
Hence the density for R is
fR (r) = 2r1r<1 .


PROBLEM 14.33 (two explosions). Two explosions occur at two time points between 0 and
1, uniformly at random and independently. Find the probability that the explosions take place
on a time interval of length t, for 0 < t < 1. You should model this by letting X1 , X2 denote the
times of the two explosions and by letting the distribution of (X1 , X2 ) be unif([0, 1] × [0, 1]).
Answer.
{|X1 − X1 | ≤ t}.
Since (X1 , X2 ) has unif([0, 1] × [0, 1]) if we let

A := {(x1 , x2 ) ∈ [0, 1] × [0, 1] : |x1 − x2 | ≤ t}

we have
area(A)
P(|X1 − X1 | ≤ t) = P((X1 , X2 ) ∈ A) = .
area([0, 1] × [0, 1])
CHAPTER 14. DENSITIES PER SE 140

The area in the denominator is 1. The area in the numerator is 1 − (1 − t)2 = t(2 − t). (Draw a
figure!) Hence
P(|X1 − X1 | ≤ t) = t(2 − t).

PROBLEM 14.34 (a random determinant). Assume that  X,
 Y, Z, W are i.i.d. N(0, 1) variable.
What is the variance of the determinant of the matrix X Y ?
Z W
Answer. The determinant is XW−YZ. By independence, E(XY−YZ) = (EX)(EY)−(EY)(EZ) = 0.
Hence var(XW−YZ) = E[(XW−YZ)2 ] = E[X2 W 2 +Y2 Z2 −2XWYZ] = (EX2 )(EW 2 )+(EY2 )(EZ2 )−
0 = 1 + 1 = 2. 
?PROBLEM 14.35 (the minimum of two independent exponential random variables). Let
τ, σ be two independent random variables with distributions expon(λ), expon(µ), respectively;
that is, their joint distribution has a density given by f (t, s) = λe−λt 1t>0 µe−µs 1s>0 . Define the
function
X = min(τ, σ).
Determine the distribution function of X and then its density. What is the distribution of X
called? Then consider the events {X = τ} and {X = σ}. Do you expect their probabilities to add
up to 1? Answer this, and then compute the probabilities explicitly and add them up to verify
your expectation..
Answer. We have X > x if and only if τ > x and σ > x. Hence, for x > 0,

P(X > x) = P(τ > x)P(σ > x) = e−λx e−µx = e−(λ+µ)x .

Therefore, the distribution function of X is

F(x) = P(X ≤ x) = (1 − e−(λ+µ)x ) 1x>0 .

Its density is then given by

f (x) = F0 (x) = (λ + µ)e−(λ+µ)x 1x>0 .

Hence X has expon(λ + µ) distribution.


We have {X = τ} ∪ {X = σ} = Ω and {X = τ} ∩ {X = σ} = {τ = σ}. But P(τ = σ) = 0. Hence
P(X = τ) + P(X = σ) = 1.
We compute the probabilities now. We have
"
P(X = τ) = P(τ ≤ σ) = λe−λt 1t>0 µe−µs 1s>0
(t,s):t≤s
Z Z !
= λe 1t>0 −λt
µe 1s>0,s≥t ds dt
−µs

Z ∞ Z ∞ !
= λe −λt
µe ds dt
−µs
0 t
Z ∞ Z ∞
λ
= λe e dt = λ
−λt −µt
e−(λ+µ)t dt = .
0 0 λ+µ
µ λ µ
For exactly the same reason, P(X = σ) = λ+µ . Indeed, λ+µ + λ+µ = 1. 
CHAPTER 14. DENSITIES PER SE 141

PROBLEM 14.36 (the maximum and the minimum, together). Let τ1 , τ2 be two independent
random variables, each with expon(1) law. Set

X = min(τ1 , τ2 )
Y = max(τ1 , τ2 )

Find (i) the density of the random vector (X, Y), (ii) the density of Y, (iii) the density of X, (iv)
the conditional density of X given Y = y, (v) the conditional density of Y given X = x. And
explain why (vi) the law of X is the law of an expon(2) random variable, (vii) the law Y is the
law of the sum of two independent random variables, one expon(2) and the other expon(1).
Answer. (i) The function H = (H1 , H2 ), where H1 (t1 , t2 ) = min(t1 , t2 ), H2 (t1 , t2 ) = max(t1 , t2 ) is
not invertible3 and not differentiable at every point of the form (t, t). Hence the technique we
learned does not apply, But we can compute the distribution function

F(x, y) = P(X ≤ x, Y ≤ y),

see (14.5) and then differentiate as (14.4) to obtain the density f (x, y) of (X, Y). Since Y is a
maximum we have Y ≤ y ⇐⇒ τ1 ≤ y, τ2 ≤ y, so, by the assumed independence,

P(Y ≤ y) = P(τ1 ≤ y)P(τ2 ≤ y) = (1 − e−y )2 .

Using (AXIOM TWO) we have

P(X ≤ x, Y ≤ y) = P(Y ≤ y) − P(X > x, Y ≤ y)

Noticing that X > x, Y ≤ y ⇐⇒ x < τ1 ≤ y, x < τ2 ≤ y we have

P(X > x, Y ≤ y) = P(x < τ1 ≤ y)P(x < τ2 ≤ y) = (e−x − e−y )2 ,

provided that x < y, else it is zero. So we have

F(x, y) = (1 − e−y )2 − (e−x − e−y )2 .

Differentiating this with respect to x and to y we get

∂2
f (x, y) = F(x, y) = 2e−x e−y 10<x<y .
∂x∂y

Note that the indicator is important because the density is 0 is x > y.


(ii) We can find the density f2 (y) of Y either by integrating out x,
Z ∞ Z ∞ Z y
f2 (y) = f (x, y)dx = 2e e 10<x<y dx = 2e
−x −y −y
e−x dx 1 y>0 = 2e−y (1 − e−y ) 1 y>0 ,
−∞ −∞ 0

or directly:
d
f2 (y) = P(Y ≤ y) = 2e−y (1 − e−y ) 1 y>0 . (14.10)
dy
3
However, given X and Y we have P(τ1 = X, τ2 = Y or τ1 = Y, τ2 = x) = 1 and this can be used.
CHAPTER 14. DENSITIES PER SE 142

(iii) Similarly, the density f( x) of X can be found in 2 ways.

f1 (x) = 2e−2x 1x>0 .


(iv) and (v) are by the definitions. But do not forget the restriction, viz., the indicators.
f (x, y)
f2|1 (y|x) = = e−(y−x) 1 y>x
f1 (x)
f (x, y) e−x
f1|2 (x|y) = = 1x<y .
f2 (y) 1 − e−y
(vi) The formula for f1 (x) is the formula for the density of expon(2).
(vii) We are being asked to verify that
d
Y = σ2 + σ1 ,
where σ2 , σ1 are independent and expon(2), expon(1), respectively. One way to do that is by
computing the density of σ2 + σ1 and showing that it equals f2 (y). We consider the random
variables defined by
U = σ1
V = σ2 + σ1
Since (U, V) = H(σ1 , σ2 ) for a linear map H : R2 → R2 , we can apply the “Jacobian” method to
first figure out the density of (U, V) and then, by integrating out the first variable, the density
of V, and it is this that we want. Let g(s1 , s2 ) be the density of (σ1 , σ2 ) and h(u, v) the density of
(U, V) (I’m running out of letters...) Since σ1 , σ2 are independent, we have
g(s1 , s2 ) = e−s1 1s1 >0 e−2s2 1s2 >0 ,
We then have
∂(s1 , s2 )
h(u, v) = g(s1 , s2 ) det
∂(u, v)
Since u = s1 , v = s+ s2 , we have s1 = u.s2 = u − v. so
∂(s1 , s2 )
!
1 1
= ,
∂(u, v) −1 0
a matrix whose determinant is 1, so
h(u, v) = g(s1 , s2 ) = 2e−s1 e−2s2 1s1 >0,s2 >0 ,
but we should not forget to replace s1 , s2 by their expressions in terms of u, v. Doing this we
obtain
h(u, v) = 2e−u e−2(v−u) 1u>0,(v−u)>0 = 2e−2v eu 10<u<v
Integrating out the variable u we obtain the density of V = σ1 + σ2 . We find
Z ∞ Z ∞ Z v
h(u, v)du = 2e e 10<u<v = 2e
−2v u −2v
eu du, 1v>0 = 2e−2v (ev −1)1v>0 = 2e−v (1−e−v )1v>0 .
−∞ −∞ 0

This is equal to f2 (u) where f2 is the density of Y, as we derived in (14.10). So the density of
d
V = σ2 + σ1 is the density of Y. Hence Y = σ2 + σ2 , as we were asked to show. 
CHAPTER 14. DENSITIES PER SE 143

PROBLEM 14.37 (a dangerous particle hits the Earth). A particle coming from very far away
is going to hit the Earth at a random point at some time the distant future and cause damage
because it carries a lot of energy. The Empire of Africa, extending between Ethiopia, Niger
and Angola, has become the center of the world and people are interested in the chance that
the particle will hit it. Let A, B, C, be the capitals of these three countries (A=Addis Ababa,
B=Niamey, C= Luanda, respectively). The distances between them (as measured by a
plane flying on the least distance path between any two cities) are c := d(A, B) = 4023 km,
b := d(A, C) = 4038 km, a := d(B, C) = 2774 km. Calculate the probability that the particle will
land in the triangle defined by these three capitals. Note that the circumference4 of the Earth
is 40, 075 km. Hint: You can use the law of cosines for a spherical triangle 5 that relates its side
lengths a, b, c, to the angles θA , θB , θC at its vertices.

cos(a/R) = cos(b/R) cos(c/R) + sin(b/R) sin(c/R) cos(θA ),

and symmetrically for the other two angles. After computing the angles, use the inclusion-
exclusion formula.
Answer. We can use a calculator, or, better yet, Maple:

g:=(x, y, z) -> arccos((cos(x/R)-cos(y/R)*cos(z/R))/(sin(y/R)*sin(z/R)))


angleA:=g(a,b,c); angleB:=g(b,c,a); angleC:=g(c,a,b);

This gives:

θA = 0.8016475022, θB = 0.8081074982, θC = 0.4761251312.

If T denotes the triangle defined by the three capitals, we have, assuming that the particle
lands on the Earth has uniform distribution,

p = P(particle hits the triangle) = area(T)/ area(Earth).

Each side, say AB, of T is an arc of a great circle CAB . (a circle passing through the center
of the Earth). Focus on a particular vertex, A, say. Consider the region LA on the Earth
between CAB and CAC that contains the triangle T. This region is a “double lune”. whose area
= θπA . We know that the area of
S(θA )
S(θA ) = area LA is clearly a linear function of θA . Hence S(2π)
the Earth is 4πR2 . And this gives S(θA ) = 4θA R2 . Similarly for the other two lunes, LB , LC . By
the inclusion-exclusion formula,

P(LA ∪ LB ∪ LC ) = P(LA ) + P(LB ) + P(LC ) − P(LA ∩ LB ) − P(LB ∩ LC ) − P(LC ∩ LA ) + P(LA ∩ LB ∩ LC ).

But LA ∪ LB ∪ LC = Earth, and LA ∩ LB ∩ LC = T ∪ T0 (each pair of lunes meet twice), where T0


is the antipodal of T. We also have LA ∩ LB = T ∪ T0 , etc., and P(LA ) = 4θA R2 /4πR2 = θA /π. So
θA θB θC
1= + + − 6p + 2p,
π π π
and so
θA + θB + θC
p= = 0.0107.

4
First computed, to an astonishing accuracy of less that 1%, by Eratosthenes , 2200 years ago.
5
See, e.g., here .
CHAPTER 14. DENSITIES PER SE 144

PROBLEM 14.38 (choose a small square at random). How would you model the selection
of a small square, at a random location and with random side length, contained inside a big
fixed square and having sides parallel to the big one? On the basis of your model, compute
the probability of the event L that the small square is entirely contained within the left half of
the big one; do the same for the event R that the small square is entirely contained within the
right half. And what is the probability that the small square intersects the segment separating
the big square in half?
Answer. There are many ways to do this. Here is one. Suppose the big square has side length
1. Let (x, y) be the coordinates of the upper right vertex of the small square and let ` be its side
length. Since we want the small square to be inside the big one we must have ` ≤ x and ` ≤ y.
The small square is thus specified by a point (x, y, `) ∈ R3 contained in the set

Ω = {(x, y, `) ∈ R3 : 0 ≤ ` ≤ x ≤ 1, 0 ≤ ` ≤ y ≤ 1}.

I therefore must define a probability measure P on Ω. I can do anything I like, but, perhaps,
the most natural way is to do it via the uniform density

c, if (x, y, `) ∈ Ω

f (x, y, `) = 

0, if not.

Since Z Z
1= f (x, y, `)dxdyd` = c dxdyd` = c vol(Ω),
R3 Ω
we have that c = 1/ vol(Ω), so it remains to compute the volume vol(Ω) of Ω. We do that by
carrying out the last integration explicitly. Let x ∧ y = min(x, y). The third coordinate ` must
be below x and below y, i.e. below x ∧ y. So
Z 1 Z 1 Z x∧y Z 1 Z 1 Z 1 Z x Z 1 Z y
vol(Ω) = d`dxdy = (x ∧ y)dxdy = ydy + xdy.
x=0 y=0 `=0 x=0 y=0 x=0 y=0 y=0 x=0

The first integral is


1
x2
Z
1
dx = .
x=0 2 6
By symmetry, so is the second integral. Therefore, vol Ω = 1/3. Hence we model the selection
by the density f (x, y, `) = 3 when (x, y, `) ∈ Ω or 0 otherwise.
We next have

L = {(x, y, `) ∈ Ω : x < 1/2}, R = {(x, y, `) ∈ Ω : x − ` > 1/2},

and so we need to compute


Z Z Z
1 1 vol(L)
P(L) = f dxdyd` = dxdyd` = dxdyd` = ,
L L vol(Ω) vol(Ω) L vol(Ω)
and similarly for P(R). Before computing anything let’s see the logic of L. We have
(x, y, `) ∈ L ⇐⇒ 0 ≤ x < 1/2, 0 ≤ y ≤ 1, ` ≤ x ∧ y. Hence
Z 1/2 Z 1 Z x∧y Z 1/2 Z 1
1 1 5
vol(L) = d`dydx = (x ∧ y)dxdy = + = ,
x=0 y=0 `=0 x=0 y=0 48 12 48
CHAPTER 14. DENSITIES PER SE 145

where the term 1/48 resulted by performing the inner integral over 0 ≤ y ≤ x and the term
1/12 by doing it over x < y ≤ 1. For R we have

(x, y, `) ∈ R ⇐⇒ 1/2 ≤ x < 1, 0 ≤ y ≤ 1, ` ≤ x ∧ y ∧ (x − 1/2).

Indeed, the requirement that x − ` > 1/2 is equivalent to ` < x − 1/2 and so ` must be below x,
below y and below x − 1/2. But x ≥ x − 1/2 so

x ∧ y ∧ (x − 1/2) = y ∧ (x − 1/2).

Hence
Z 1 Z 1 Z y∧(x− 21 ) Z 1 Z 1
1 1 5
vol(R) = d`dydx = [y ∧ (x − 12 )]d`dydx = + = .
x=1/2 y=0 `=0 x=1/2 y=0 12 48 48

We thus have
5
.
P(L) = P(R) =
16
So the probability that the small random square intersects the line x = 1/2 is 1 − P(L) − P(R) =
6/16 = 3/8. 
Chapter 15

Probability laws on bigger spaces


(R, Rn and beyond)

I call “big space” a set whose elements cannot be counted. I will


try to explain to you why putting a probability on a big space
isn’t a trivial matter because, in most interesting cases, you
cannot assign probabilities to individual outcomes but to events.
But then things get a bit fishy because (AXIOM ONE) and
(AXIOM TWO) must be respected.

Terminology. An interval I is a subset of R such that if x < y are two elements of I


then every z with x < z < y is also an element of I. Examples of intervals are the sets
(a, b) = {x ∈ R : a < x < b}, [a, b] = {x ∈ R : a ≤ x ≤ b}, [a, b) = {x ∈ R : a ≤ x < b},
(a, b] = {x ∈ R : a < x ≤ b}, as well as the set (a, ∞), [a, ∞), (−∞, b), (−∞, b] and (−∞, ∞) = R.

15.1 Recapitulation and motivation

Review
A sample space Ω is finite when it has finitely many elements (outcomes). A sample space Ω
is countable if there is an injection from Ω into N. A finite Ω is always countable. An infinite
Ω is not always countable. Examples of countable sets are the sets Z of integers and Q of
rational numbers. Instead of saying “countable”, some people say “discrete”.
You know from previous courses how to put a probability measure P on a countable (finite
or infinite) Ω: Simply consider a function p : Ω → [0, 1] such thatP ω∈Ω p(ω) = 1. Any such
function defines a probability measure P on E = P(Ω) via P(A) = ω∈Ω p(ω).
We also saw, in Chapter 14, through many examples and problems that are important for
all kinds of applications, that we do need to have random variables X in bigger than countable
spaces, such as R and Rn , such that P(X = x) = 0 for all x.

146
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 147

Motivation
If Ω is an uncountable set, then it is not easy to put a interesting probability measures on
events of Ω. The reasons are two. The first one we can explain. The second is deeper and we
shall not explain in this course. But it has to do with Theorem 12.1, 12.2 and 12.3 that I ask
you to read again now.

Difficulty 1. If Ω is uncountable and if we decide to define a probability measure P by defining P on


each singleton (=one-element set) {ω} then the set

{ω ∈ Ω : P{ω} > 0} is countable. (15.1)

And this means that we must give preference to a special countable subset of the uncountable set Ω,
which is ugly, inconvenient, impractical and stupid.

Difficulty 2. Again, let Ω be uncountable. The above difficulty is overcome by defining P on sets
rather than on singletons. But this must be done in a way that (AXIOM TWO) be satisfied. So we
must define P on a set E of events, a set that is a σ-field. Naturally, we would like to have E as large as
possible. The largest E is P(Ω) and it would be great to have every probability measure P defined on
P(Ω). But, alas, for most interesting probability measures we cannot choose P(Ω) as their domain.
And this is a fundamental restriction, one that cannot be explained here. I ask you to trust me on this
and I refer you to future courses to learn this, or to any good book on more advanced probability.
Let us explain Difficulty 1.

?PROBLEM 15.1 (adding uncountably many positive numbers always gives ∞). Explain
statement (15.1).
Answer. Denote by A+ the set {ω ∈ Ω : P{ω} > 0} and let An := {ω ∈ Ω : P{ω} ≥ 1/n}, n ∈ N.
Since
x > 0 ⇐⇒ x ≥ 1/n for some n ∈ N
we have

[
A+ = An .
n=1

Now observe that An can have at most n elements. (If it had a > n elements, we would find
that P(A) ≥ a/n > 1, and this is impossible). Since each An is finite, and A+ is their union, it
follows that A+ is countable. 
Therefore, the need to have random variables that have probability zero to take any specific
value, together with the associated difficulties, motivates us to explain things carefully, and
this is the purpose of the rest of this chapter.

15.2 The real line: distribution functions and densities


We shall here consider one particular, but important case:

Ω = R = the set of real numbers.


CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 148

I will assume that you know basic things: (a) the definition of R and its characterization; (b)
its completeness; in particular the notions of sup (least upper bound) and inf (greatest lower
bound) of a set; (c) sequence of real numbers and the notion of a limit. (d) open subsets of the
real line.
The question is: how do we define interesting probability measures on R and what is their
domain?
Consider an increasing function F : R → R, that is, a function F such that

x1 < x2 ⇒ F(x1 ) ≤ F(x2 ).

We use the word “increasing” in the sense of “non-decreasing”. If F(x1 ) < F(x2 ) whenever
x1 < x2 we say that F is strictly increasing. Any such function has left and right limits: for
any x, the numbers F(x+) = limε↓0 F(x + ε) and F(x−) = limε↓0 F(x − ε) exist. The reason is that
monotone bounded sequences converge in R. The notation ε ↓ 0 means that we let ε converge
to 0 from above 0.
The width of an increasing function is defined by

width(F) = sup F(x) − inf F(x).


x∈R x∈R

We let E be any σ-field that includes the collection of all intervals and define B be the
smallest such σ-field. This B is called the Borel σ-field (see Def. 12.2) and it is on B that
interesting nontrivial probability measures are defined.
The following is a fundamental result in mathematics:

Theorem 15.1 (a distribution function defines a unique probability measure). To each


increasing function F with width equal to 1 there corresponds a unique probability measure
Q : B → [0, 1] such that
Q((a, b]) = F(b+) − F(a+), (15.2)
for all real numbers a < b.

To understand this, you need to read, e.g., G.B. Folland, Real Analysis, Section 1.5 . The
theorem says that to every increasing unit-width F there corresponds a unique probability
measure Q with domain B:
F 7→ Q = QF .
The converse is also true,
Q 7→ F = FQ ,
and it is a mere exercise. See Problem 15.2.
Conventions: Now, if we make the convention that F be right-continuous which means
that F(x+) = F(x) for all x, and supx∈R F(x) = 1, infx∈R F(x) = 0, then we call it a probability
distribution function. I say these things are “conventions” because they’re not really needed.
Taking into account the above we declare:
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 149

Definition 15.1 (distribution function on R). We say that F : R → R is a distribution function


if it has the following properties:
1. Essential property. F is increasing and width(F) = 1.
2. Conventional property. F is right-continuous and limx→−∞ F(x) = 0

?PROBLEM 15.2 (it satisfies the defining properties of a distribution function). Let Q be
a probability measure defined on B. Show that

F(x) := Q((−∞, x])

is a distribution function such that (15.2) holds.


Answer. If x1 < x2 then (−∞, x1 ] ⊂ (−∞, x2 ] so F is increasing.
S∞ We now have (a, b] = (−∞, b] \
= F(b)−F(a). Since R = (−∞, 0]∪ k=1 (k−1, k], we have,
(−∞, a], so Q((a, b])P P∞ by (AXIOM TWO),
1 = Q(R)P= F(0) + ∞ k=1 (F(k) − F(k − 1)). But an infinite sum is a limit: k=1 (F(k) − F(k − 1)) =
n
limn→∞ k=1 (F(k) − F(k − 1)) = limn→∞ (F(n) − F(0)). Therefore, 1 = F(0) + limn→∞ (F(n) − F(0)),
whence limn→∞ F(n) = 1, and hence supx∈R F(x) = 1. Similarly, infx∈R F(x) = 0. Hence F is a
distribution function. 
If F is a distribution function we have

Q((a, b]) = F(b) − F(a), Q([a, b]) = F(b) − F(a−), Q((−∞, b)) = F(b−), etc.

?PROBLEM 15.3 (uniform probability measure on a bounded interval). Let I be a bounded


interval of length L. Such a interval has endpoints a and a + L (and may or may not include
the endpoints). Define the function F(x) that is 0 if x < a, 1 if x ≥ a + L and equal to x−a
L when
a ≤ x < a + L. The corresponding probability measure Q is called uniform on I. Show that

length(J)
Q(J) = .
length(I)

when J ⊂ I is an interval.

In fact, if Q is unif(I) then, not just for any interval, but also for any Borel set B ⊂ I
we have
length(B)
Q(B) = P(X ∈ B) = , (15.3)
length(I)
where X is any unif(I) random variable.

For every distribution function F we can always construct a random variable on some
probability space (Ω, E , P) such that the law of X is the probability measure defined by F. In
this case, we say that F is the distribution function of X. For example, we can take Ω = R,
E = B and X(ω) = ω. In this case, we can trivially write

P(X ≤ x) = F(x), x ∈ R.

People, being people, like to give names to things that they use often. If Q has a name name
then it is customary to call any random variable X with law Q also name. So, for example, if Q
is unif(I) then any random variable with law Q is called a unif(I) random variable.
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 150

Figure 15.1: zero-length set in R, zero-area set in R2 , zero-volume set in R3

Although Q is unique, the corresponding random variable with law Q is not


unique.

Definition 15.2. We say that the point x is an atom of the probability measure Q if Q{x} > 0. A
probability measure Q with no atoms is called non-atomic or continuous. A random variable
X whose law is a continuous probability measure Q is called a continuous random variable.

?PROBLEM 15.4 (continuous r.v.). Let X be a random variable with distribution function F.
Explain why the following are equivalent:

(a) X is continuous

(b) F is a continuous function

(c) P(X = x) = 0, for all x ∈ R.

Answer. We have Q{x} = P(X = x) = F(x) − F(x−). Hence Q{x} = 0 for all x if and only if
P(X = x) = 0 for all x. Hence (a) is equivalent to (c). On the other hand, (b) is equivalent to
F(x) − F(x−) = 0 for all x. Hence (b) is equivalent to (c). 

Now recall what zero-length set means. See Definition 14.4. And then study again
Problem 14.4 that explains that any countable set is a zero-length set. Finally, trust
me that there are zero-length sets that are uncountable, it’s just hard for you to
imagine. But see Problem 15.6 below. Also see Figure 15.1 (left). This figure also
depicts a zero-area set in R2 and a zero-volume set in R3 .

PROBLEM 15.5 (tossing a pencil). A pencil1 usually has 6 sides. But let’s consider a pencil
with 10 sides, labeled 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
1
I thank Professor Venkat Anantharam of the University of California, Berkeley, who told me that I can toss a
pencil
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 151

Toss this pencil and record the face on which it lands. Assume the pencil is perfect, so assign
probability 1/10 to each side. Repeat this infinitely many times, independently. Consider the
event
A = {1, 2, 3, 4, 5, 6, 7, 8 will show up infinitely many times each} (15.4)
and show that P(A) = 1.
Answer. In Problem 13.6 we showed the same thing except that we were tossing a coin (two
outcomes). Now we toss a pencil (10 outcomes). If we repeat the argument we again obtain
that the event A above also has P(A) = 1. 

?PROBLEM 15.6 (an uncountable zero-length set). Consider the set N of all real numbers x
between 0 and 1 whose decimal representation only uses digits 0 and 9. For example, this set
contains the numbers

0.09090909090909090909090909 · · ·
0.09900990099009900990099 · · ·
0.909009000900009000009000000900000009000000009 · · ·
0.0090909090009000909090090909999090909000009 · · ·

Explain why this set is uncountable and has zero-length.


Answer. Any number in the set N is specified by a sequence of 0s and 9s. So the set N is
represented as {0, 9}N , the set of all sequences of 0s and 9s. This is exactly like the space of
outcomes when tossing a coin infinitely many times. So it is uncountable.
To see that it has zero length, we show that the points that are not in N, that is, the set
Nc = [0, 1] \ N, form a set of unit length.
Consider a random variable X with distribution unif([0, 1]). Apply formula (15.3) to get

P(X ∈ N) = length(N). (15.5)

But

Nc = {x ∈ [0, 1] : some digit xi in their decimal representation equals 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8}.

But then
P(X ∈ Nc ) ≥ P(A),
where A is the event (15.4) considered in Problem 15.5 where it was shown that P(A) = 1.
Hence P(X ∈ Nc ) ≥ 1, so P(X ∈ Nc ) = 1, by (AXIOM ONE). This means that P(X ∈ N) = 0 and
so, by (15.5), length(N) = 0. 
We now define the notion of an absolutely continuous function.
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 152

Definition 15.3 (absolutely continuous function). We say that F : R → R is absolutely


for any ε > 0 there exists
continuous if P Pm δ > 0 such that if (a1 , b1 ), . . . , (am , bm ) are disjoint
intervals with m (b
i=1 i − ai ) ≤ δ then i=1 |F(bi ) − F(ai )| ≤ ε.

Recall that the derivative of a function F at a point x is the number

F(x + h) − F(x)
F0 (x) = lim ,
h→0 h
whenever the limit exists.

Theorem 15.2 (an advanced version of the fundamental theorem of Calculus). If F :


R → R is absolutely continuous then the set

N = {x ∈ R : F0 (x) exists} is a zero-length set.

Moreover, Z b
F0 (x)dx = F(b) − F(a).
a
where the integral here is a so-called Lebesgue integral.

We won’t explain these things but we will again refer to G.B. Folland, Real Analysis . The
notion of a Lebesgue integral is different than that of a Riemann integral, the one you learned
in Calculus. However, for all practical purposes, they are equal in most cases. For example, if
a function is piecewise continuous the two integrals are the same.
Let us apply this to a probability measure Q. We say that Q is absolutely continuous if its
distribution function F is absolutely continuous. Hence, by Theorem 15.2:

If F is absolutely continuous then


(1) f (x) = F0 (x) exists for x in a zero-length set
(2) f (x) ≥ 0 for all x. This is because F is increasing and because we can define it to
be anything we like on the zero-length set on which it is not exist, so we choose it
to be 0 on this set.
Rb
(3) F(b) − F(a) = a f (u)du.
Rx
(4) In particular, F(x) = −∞ f (u)du.
R∞
(5) −∞ f (u)du = 1.
Hence f is a probability density function as those studied in Chapter 14.

We can summarize this in words as follows.

Folklorically, A probability distribution function F is absolutely continuous if it has


a derivative (except on a zero-length set) AND if its derivative can be integrated
so that F be recovered.
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 153

?PROBLEM 15.7 (fundamental theorems of Calculus and the folklore above). How do the
fundamental theorems of Calculus compare to the above folklore, viz., to Theorem 15.2?
Answer. Let F be a continuous distribution function. Assume that it is piecewise differentiable,
that is, differentiable at all points except at a discrete set of points N. Denote by F0 its derivative
function, arbitrarily defined on N. According to the second fundamental theorem of Calculus,
Z b
F0 (x)dx = F(b) − F(a), for all −∞ < a < b < ∞. (15.6)
a

So we can recover the function F from its derivative F0 .


If we assume that F is a continuous distribution function that is differentiable except on
a zero-length set N (that is not necessarily discrete) then Theorem 15.2 says that (15.6) holds
provided that F be in addition absolutely continuous. That is, without the absolute continuity,
way may not be able to recover the function F from its derivative F0 . 

15.3 Classification of (distributions) of random variables

Mixtures
First, a definition.

Definition 15.4. We say that (the law QX ) of a random variable X is the mixture of (the laws
QY1 , QY2 and QY3 ) of the random variables Y1 , Y2 and Y3 if

QX = p1 QY1 + p2 QY2 + p3 QY3 , (15.7)

for some nonnegative numbers p1 , p2 , p3 such that p1 + p2 + p3 = 1. This can be stated for any
number of Yi ’s, even infinitely many.

?PROBLEM 15.8 (mixture is a relation between laws or between random variables). Let ξ
be a random variable with values in the set {1, 2, 3} and distribution P(ξ = i) = pi , i = 1, 2, 3.
Show that (15.7) is equivalent to

X = 1ξ=1 Y1 + 1ξ=2 Y2 + 1ξ=3 Y3 , (15.8)

where Y1 , Y2 , Y3 , ξ are independent.


Answer. Suppose (15.8) holds. Then, by (AXIOM TWO),
3
X
P(X ≤ x) = P(ξ = i, X ≤ x)
i=1

But ξi = i implies that X = Yi , So P(ξ = i, X ≤ x) = P(ξ = i, Yi ≤ x). But Yi and ξ are


independent, so the this is further equal to P(ξ = i)P(Y ≤ x) = pi QY ((−∞, x]). Hence
3
X
QX ((−∞, x]) = pi QY ((−∞, x]).
i=1
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 154

By Theorem 15.1, this implies that


3
X
QX (B) = pi QY (B),
i=1

not only for B of the form (−∞, x], but also for any (Borel) set B ⊂ R.

Classification
There are 3 basic types of random variables.

1. Discrete random variable The (law of the) random variable X is called discrete if there is
a countable set C such that P(X ∈ C) = 1. In this case, the law
P of X is completely specified
by the numbers p(x) = P(X = x), x ∈ C, because P(X ∈ B) = x∈B∩X p(x).

2. Absolutely continuous random variable The (law of the) random variable X is called
R functionR f (x), x ∈ R. In this case, the
absolutely continuous if it has a probability density
law of X is completely specified by P(X ∈ B) = B f (x)dx = R 1x∈B f (x)dx. In particular,
P(X = x) = 0 for all x ∈ R.

3. Singularly continuous random variable The (law of the) random variable X is called
singularly continuous if P(X = x) = 0 for all x ∈ R but there is a zero-length set N ⊂ R such
that P(X ∈ N) = 1.

PROBLEM 15.9 (examples of the three besic types). Give an example of a random variable
that is (1) discrete, (2) absolutely continuous, (3) singularly continuous. Use Problems 15.5
and 15.6 to answer (3).
Answer. (1) Take, e.g., a bin(n, p) random variable.
(2) Take, e.g., an expon(λ) random variable.
(3) Let X1 , X2 , . . . be i.i.d. random variables, each with law unif({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}). Clearly,
the Xi represent the outcomes of tossing a 10-sided pencil as in Problem 15.5. Form the real
number whose decimal expansion has digits X1 , X2 , . . ., that is,

X = 0.X1 X2 X3 · · · in high school notation,

or

X Xi
X= in calculus notation.
i=1
10i

(That is, instead of writing 0.571 we write 5


10 + 7
100 + 1
1000 .) Now define

X Xi 1X ∈{0,9}
Z= i
.
i=1
10i

Since the right-hand side of this is the decimal expansion of a number that uses digits 0 or 9
only, we have
P(Z ∈ N) = 1,
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 155

where N is the set defined in Problem 15.6. But we showed that N is a zero-length set. Also, if
x = 0.x1 x2 · · · , is a given number, 0 < x < 1, then
P(Z = x) = P(X1 = x1 )P(X2 = x2 ) · · · = ( 10
1 ∞
) = 0.
So Z is a singularly continuous. 

Theorem 15.3 (decomposition of the law of any random variable). (The law of a) random
variable X in R is the unique mixture of 3 random variables: a discrete one, an absolutely
continuous one and a singularly continuous one.

This is, as usual, not proved here. Although it is obvious that a mixture of the three types
gives a random variable that is not necessarily one of the three types, the converse is not
obvious.
?PROBLEM 15.10 (it’s useless to differentiate a singular function). Explain why we can
define the derivative of the distribution function of a singularly continuous random variable
but that this derivative is not a distribution function, and hence useless.
Answer. Let Z be a singularly continuous random variable. Then N be a zero-length set such
that P(Z ∈ N) = 0. Let F(z) = P(Z ≤ z) be the distribution function of Z. To differentiate F(z),
we will consider only those z that are not in N. (We can ignore, as explained, any zero-length
set.) If z ∈ Nc , let (we can do that) (z − ε, z + ε) ∈ Nc for small enough ε > 0. Since P(Z ∈ Nc ) = 0,
we have F(z + h) = F(z) for all h ∈ (z − ε, zε ). And so F0 (z) = 0. We thus have f (z) = F0 (z) = 0
for z ∈ Nc and we can define it whatever we like for z ∈ n. But then
Z b
f (z)dz = 0
a

for all a, b, whereas P(Z ∈ (a, b)) > 0 if a and b are far apart. This is why F0 (z) is useless. 
PROBLEM 15.11 (continuous random variables). In Definition 15.2 we declared that a
random variable is continuous (it has non-atomic distribution) if P(X = x) = 0 for all x ∈ R.
Using Theorem 15.3 give an explicit expression for the class of continuous random variables
in terms of the three basic types.
Answer. If a random variable X is absolutely continuous or singularly continuous then
P(X = x) = 0 for all x ∈ R, so it is continuous. If a random variable is a mixture of an
absolutely continuous random variable and a singularly continuous one then, we again have
P(X = x) = 0. If X is a general random variable then, according to Theorem 15.3 it is a mixture
of the 3 basic types. But if the discrete part is present in the mixture then, clearly, P(X = x) > 0
for some x ∈ R, so the discrete part shouldn’t be there if want X to be continuous. We conclude
A random variable is continuous ⇐⇒ it is a mixture of an absolutely continuous
random variable and a singularly continuous one.

15.4 Pause for recollection


In Chapter 14 we defined and used the concept of probability density function of a random
variable and then of a random vector. We also discussed a number of special cases appearing
in applications, but which are also fundamental for the subject.
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 156

We also defined, naively, the notion of independence and expectation and derived some
calculus rules in order to compute expectations and distributions of functions of random
variables.
In this chapter, we explained what we mean by random variables with general distribution.
We explained why their probability distribution cannot, in general, be described by probabilities
on individual points (see Section 15.1, Difficulty 1–explained in Problem 15.1, and Difficulty
1).
We then defined the concept of distribution function on R–see Section 15.2 and, in particular,
Theorem 15.1 that states that a distribution function uniquely defines a probability measure
and hence a random variable.
In the same section, Section 15.2, we also stated a generalization of the second fundamental
theorem of Calculus, that is, Theorem 15.2, that tells us when we can differentiate a distribution
function and obtain a useful derivative that can be integrated so that the distribution function
be recovered. We didn’t prove this theorem, but we have ?understood its statement. And,
indeed, having done so, we could easily solve Problem 15.7 that requires knowledge of the
second fundamental theorem of Calculus that you should know from Calculus.
In Section 15.3 we explained the three basic types of random variables and stated, as
Theorem 15.3, that a general random variable is a mixture of those three types.

15.5 Random vectors and their laws


We now pass on to
Ω = Rn .
This is the set of all n-tuples x = (x1 , . . . , xn ) of real numbers. I call Rn a Euclidean space
because I identify the set of ordered n-tuples of real numbers with a geometric Euclidean space,
a space that contains points, lines, planes, etc.2 , just like R2 is identified with the Euclidean
plane–thanks to Descartes.3 That is, we consider random vectors

(X1 , . . . , Xn ) : Ω → Rn

and ways to describe their laws (=distributions). Let’s quickly recall that by saying “law” or
“distribution” of (X1 , . . . , Xn ) we mean the probability measure

Q(B) := P((X1 , . . . , Xn ) ∈ B),

for B (Borel) subsets of Rn .


2
First systematized by Euclid, Elements (13 volumes), c. 300 BCE, eventually (2200 years later) axiomatized by
D. Hilbert, The Foundations of Geometry , 1902; one of the modern accounts being that of H.S.M. Coxeter, Introduction
to Geometry , 1969.
3
Cartesius in his Latinized name (most rich and/or famous people at the time, 17th c. CE, wanted to have
Latin names as this was posh; just as, nowadays, people attach letters to their names if they are rich or famous).
Descartes went from France to Sweden to educate the then Swedish queen but he was poisoned there by a priest
and died. See Theodor Ebert, L’énigme de la mort de Descartes, Hermann Philosophie 2011 (transl. from the the
German original, Der rätselhafte Tod des René Descartes, Alibri Verlag, 2009. See also articles here and here .
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 157

If we set B = (−∞, t1 ] × · · · × (−∞, tn ], which is an unbounded rectangle. in this display, we


obtain a function of (t1 , . . . , tn ), denoted by F(t1 , . . . , tn ). That is,

F(t1 , . . . , tn ) = P(X1 ≤ t1 , . . . , Xn ≤ tn ).

It is easy to see that this function is increasing in each ti when the others are kept fixed. Note
that if we know this function then we can easily compute the probability that X is in a bounded
rectangle. We exemplify this when n = 2.

PROBLEM 15.12 (an essential property of 2-dimensional distribution function). Let (X1 , X2 )
be a random vector in R2 with distribution function F(x1 , x2 ). Consider the bounded rectangle

R = (a1 , b1 ] × (a2 , b2 ],

and show that

P((X1 , X2 ) ∈ R) = F(b1 , b2 ) − F(a1 , b2 ) − F(a2 , b1 ) + F(a1 , a2 ), (15.9)

using the inclusion-exclusion formula or using the algebra of indicator functions.


Answer. Let us draw the rectangle R and give names A, B, C, D to its 4 vertices as in Figure
15.2 Let RA = (−∞, b1 ] × (−∞, b2 ], the unbounded rectangle with top-right corner A. Similarly,

Figure 15.2: The rectangle R = (a1 , b1 ] × (a2 × b2 ] and its 4 vertices A, B, C, D.

define RB , RC , RD . Consider the function

h := 1RA + 1RB .

This assigns values 1 or 2 in the regions shown on the left of Figure 15.3. Similarly, consider
the function
g := 1RC + 1RD ,
that takes values 0, 1, 2, as shown on the right of Figure 15.3. Hence it is obvious that

h − g = 1R .

Hence
1(X1 ,X2 )∈R = 1(X1 ,X2 )∈RA + 1(X1 ,X2 )∈RB − 1(X1 ,X2 )∈RC − 1(X1 ,X2 )∈RD ,
and now take expectations. We have E[1(X1 ,X2 )∈R ] = P((X1 , X2 ) ∈ R), E[1(X1 ,X2 )∈RA ] =
P((X1 , X2 ) ∈ RA )) = F(b1 , b2 ), etc. Hence (15.9) holds. 
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 158

Figure 15.3: The functions h = 1RA + 1RB and g = 1RC + 1RD .

Since the left-hand side of (15.9) is non-negative, we must have

F(b1 , b2 ) − F(a1 , b2 ) − F(a2 , b1 ) + F(a1 , a2 ) ≥ 0, (15.10)

for all a1 < b1 , a2 < b2 . This is an essential property.


Let us explain how to write this in Rn . We let

R = (a1 , b1 ] × · · · × (an , bn ]

be a bounded rectangle in Rn and consider its 2n vertices:

vertices(R) = {a1 , b1 } × · · · × {an , bn }.

(The sets {a1 , b1 }, . . . , {an , bn } have size 2 each, so their product has size 2 · · · 2 = 2n .) We define
the sign of a vertex v ∈ vertices(R) by

+1, if the number of bi in v is even

sgn(v) = 

−1, if the number of bi in v is odd

For example, with n = 2, we have the vertices A, B, C, D, whose signs are:

vertices (b1 , b2 ) (a1 , b2 ) (a2 , b1 ) (a1 , a2 )


signs + − − +

Condition (15.10) is written as


X
sgn(v) F(v) ≥ 0.
b∈vertices(R)

We are now ready to define the concept of a distribution function without reference to a
random vector, just as we did on R in Definition 15.1.

Definition 15.5 (distribution function on Rn –the analog of Def. 15.1). We say that F : Rn → R
is a distribution function if it has the following properties:
1. First essential property. F(x1 , . . . , xn ) is increasing in each xi when the other arguments are
kept fixed and width(F) = 1.
2. Second essential property. For any bounded rectangle R = (a1 , b1 ] × · · · × (an , bn ] we have
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 159

P
v∈vertices(R) sgn(v) F(v)
≥ 0.
3. Conventional property. F is right-continuous in the sense that for all x and all ε > 0 there
exists a δ0 such that F(x1 + δ, . . . , xn + δ) ≤ F(x1 , . . . , xn ) + ε and F(x1 , . . . , xn ) → 0 when some
xi → −∞.

We now state the equivalent to Theorem 15.1 which says that if we are given a distribution
function then we have a unique probability measure with the given distribution function and
a random variable with the given distribution function.

Theorem 15.4 (multidimensional analog of Theorem 15.1). To each increasing function F


on Rn there there corresponds a unique probability measure Q : B n → [0, 1] such that

Q((−∞, t1 ] × · · · × (−∞, tn ]) = F(t1 , . . . , tn ),

for all (t1 , . . . , tn ) ∈ Rn , and, therefore, some random variable X = (X1 , . . . , Xn ) with law Q.

The things we discussed about random variables carry over to random vectors. But we
won’t expand on them. Rather, we just give a summary:

Summary
• Random vectors (X1 , . . . , Xn ) can be continuous (meaning that their law is nonatomic:
P((X1 , . . . , Xn ) = (x1 , . . . , xn )) = 0 for all (x1 , . . . , xn ) ∈ Rn ) and some continuous random vectors
have a probability density. These are called absolutely continuous random vectors.
• The concept of zero-length set in R and zero-area set in R2 generalizes to Rn . Rather
than saying “volume” we say n-volume. So 1-volume is length, 2-volume is area, 3-volume is
volume. We don’t need to define n-volume in general. We just need to define the n-volume of
a rectangle R = J1 × · · · × Jn and the concept of a zero-n-volume set. We set

n-vol(J1 × · · · × Jn ) := length(J1 ) · · · length(Jn ).

This is obvious: what else can we do other than multiply the lengths of the sides? We also say
that the set N ⊂ Rn is a zero-n-volume set if for all ε > 0 we can find a sequence of rectangles
such that N is included in their union and the sum of the n-volumes of these rectangles is at
most ε.
• If (X1 , . . . , Xn ) has density f (x1 , . . . , xn ) and we change this density on a zero-n-volume
then the changed function is also a density for (X1 , . . . , Xn ).
• There are continuous random vectors that are not absolutely continuous. These random
vectors are easier to come about when n ≥ 2 because there are plenty of zero-n-volume sets in
dimension n ≥ 2. See Figure 15.1.

?PROBLEM 15.13 (a continuous but not absolutely continuous random vector). Give an
example of a random vector (X1 , X2 ) that is continuous but not absolutely continuous.
Answer. Let Z be an absolutely continuous random variable (in R), e.g., let Z be N(0, 1). Then
define
(X1 , X2 ) = (Z, Z).
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 160

We then have: for all (x1 , x2 ) ∈ R2 ,

P((X1 , X2 ) = (x1 , x2 )) = P(Z = x1 , Z = x2 )



0, if x1 , x2 (because Z = x1 = x2 is absurd when x1 , x2 )

=

0, if x = x (because P(Z = x ) = 0 since Z is absolutely continuous)

1 2 1


Look at Figure 15.4.

Figure 15.4: Left: density of Z. Right: (Z, Z) is not absolutely continuous, so it does not have
density on R2 .

PROBLEM 15.14 (continuation of Prolem 15.13). Let Z be unif([0, 1]) random variable. Then
(X1 , X2 ) := (Z, Z) is not absolutely continuous. In fact, P((Z, Z) ∈ L) = 1, where L is the diagonal
line, so it is a zero-area set. But (Z, Z) does have a distribution function. Attempt to sketch the
distribution function of (Z, Z).
Answer.

P(X1 ≤ x1 , X2 ≤ x2 ) = P(Z ≤ x1 , Z ≤ x2 ) = P(Z ≤ min(x1 , x2 )) − min(x1 , x2 ),

for 0 ≤ x1 , x2 ≤ 1. Here is the sketch: 

Figure 15.5: If we now take Z to be bu uniform on [0, 1], then (Z, Z) is continuous but, as
explained above, it is has no density, so it is not absolutely continuous. Of course, its distribution
function F(x1 , x2 ) is continuous and its plot is very easy. It is a pyramid with base the square
[0, 1] × [0, 1] and apex the point (1, 1, 1).
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 161

15.6 Beyond Rn
We know that we need to study (and we have), not only about finitely many random variables
X1 , . . . , Xn , that is, a random vector X = (X1 , . . . , Xn ) in Rn , but also about infinitely many
random variables X1 , X2 , . . ., that is, an infinite-dimensional random vector X = (X1 , X2 , . . .) in
R∞ ≡ RN (= the set of sequences of real numbers).
We stated, in Theorem 12.1, and its equivalent form of Theorem 12.2, that such infinite-
dimensional random vectors, under the i.i.d. assumption, do exist.
And we understood that such things are important, else we cannot talk about tossing a
coin infinitely many times, something that we should absolutely be able to, else we can’t do
probability or statistics.
We will also understand in Chapter 17 that we need to consider calculating probabilities of
events such as “a random sequence converges” because this is precisely what the Strong Law
of Large Numbers is about.
But we can’t approach the study of infinite-dimensional random vectors X = (X1 , X2 , . . .)
by things like densities. The reason being that, whereas R has a function called “length”,
enabling us to define a one-variable density f (x), and whereas R2 has a function called “area”,
enabling us to define a two-variable density f (x1 , x2 ), and whereas Rn has a function called
“n-volume”, enabling us to define an n-variable density f (x1 , x2 , . . . , xn ), the space R∞ does
not have an ∞-volume. We can (and often do) talk about density of X = (X1 , X2 , . . .) but we
have to specify “with respect to what”. This is important, but we won’t learn this here. We
will simply trust Theorem 12.1 and accept that an i.i.d. sequences do exist and move on.
The problem that a novice has is that he or she approaches the subject of probability/statistics
computationally: Compute the density of (X1 , X2 ), compute the expectation of g(X1 , X2 , X3 ),
etc. So the novice (and often his/her teachers) thinks that if something can’t be computed
then it either doesn’t exist or that it’s useless. This (almost religious) belief shoves away all
interesting things and prevents the novice from ever obtaining the skills necessary to do the
job properly.

PROBLEM 15.15 (use the normal table). Let X1 , X2 , . . . be i.i.d. random variables with
common N(0, 1) law. Use Table 14.1 to calculate the probability.

P(X1 ≤ 2, X2 ≤ 2, . . .).

Let B be any interval of R other than R itself. What is

P(X1 ∈ B, X2 ∈ B, . . .)?

Answer. By independence and 14.1 ,

P(X1 ≤ 2, X2 ≤ 2, . . .) = F(2) · F(2) · · · = 0.9772 · 0.9772 · · · = 0.

If B is any interval other than R then P(Xi ∈ B) < 1 (strictly). The product of a positive number
that is strictly smaller than 1 by itself infinitely many times is zero. So P(X1 ∈ B, X2 ∈ B, . . .) = 0
as well. 
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 162

?PROBLEM 15.16 (a non-trivial “infinite” event). Let X1 , X2 , . . . be i.i.d. expon(1) random


variables. Compute

P(X2 ≤ 2 log 2, X3 ≤ 2 log 3, X4 ≤ 2 log 4, . . .).

Answer. By independence, the above probability is equal to the product



Y
P(Xn ≤ 2 log n)
n=2

But
1 (n − 1)(n + 1)
P(Xn ≤ 2 log n) = P(X1 ≤ 2 log n) = 1 − e−2 log n = 1 − 2
= .
n n2
Hence the product equals
∞ N
Y (n − 1)(n + 1) Y (n − 1)(n + 1)
= lim
n=2
n2 N→∞
n=2
n2

But this is simple because


N N
Y (n − 1)(n + 1) Y n−1 n+1 1 3 2 4 3 5 N−1 N+1 1 N+1
= · = · · · · · ··· · = · .
n=2
n2 n=2
n n 2 2 3 3 4 4 N N 2 N

The last fraction converges to 1, so the product converges to 1/2, that is,

1
P(X2 ≤ 2 log 2, X3 ≤ 2 log 3, X4 ≤ 2 log 4, . . .) = .
2

Chapter 16

Expectation, unadulterated

Expectation is now seen as a primary object, just as


fundamental as a probability measure. We define it
globally, without reference to continuity properties of the
probability measure. And we discuss, once more,
independence. A generating function is the expectation of
a certain class of functions of a given random variable or
random vector.

Let X : Ω → R be a random variable with real values. Recall that this means that there is a
class E of events such that X respects these events in the sense that sets of the form {X ≤ x} are
events.
If P is a probability measure on the events, then we have talked about the expectation E(X)
in two cases. When X is discrete, in which case we Rset E(X) = x∈X(Ω) xP(X = x) and when X is
P

absolutely continuous, in which case we set E(X) = R f (x)dx. When we want to emphasize the
role of P we write EP (X).
PROBLEM 16.1 (expectation under P and under Q). Let Ω = {H, T} and X(H) = 1, X(T) = −1.
We take as E all 4 subsets of Ω. Take two probability measures, P and Q, defined by P{H} = p,
P{T} = 1 − p, and Q{H} = q, Q{T} = 1 − q. What is EP (X)? What is EQ (X)?
Answer.
EP (X) = 1 · P(X = 1) + (−1) · P(X = −1) = p − (1 − p) = 2p − 1
EQ (X) = 1 · Q(X = 1) + (−1) · Q(X = −1) = q − (1 − q) = 2q − 1

But what do we do in general and why should we care?
First, let me explain why we should care.

1) A first great reason is that it is simpler to think in general than consider all possible
special cases.
2) A second reason is that we don’t always know if X is discrete or if it is absolutely
continuous or neither.

163
CHAPTER 16. EXPECTATION, UNADULTERATED 164

3) A third reason is that Probability/Statistics is all about approximations. This in particular


means that we often need to do this: limn E(Xn ) = E(limn Xn ). When and why can we
do so?

4) A fourth reason is that E(X) is really an integral of a function on an abstract space Ω,


e.g., Ω could be a space of functions (that’s so, say, in Quantum Physics), but then what
do we mean by integral of a function on a space of functions?

5) A fifth reason is that we’re in the 21st century. This means that we can’t be doing the
same things they were doing in the 19th, can we? If you make the analogy, you know
have “smart” phones but they only had telegraph, if at all. Why shouldn’t the same
thing apply in maths and stats?

16.1 Expectation via approximation


If x is a real number then, as usual, bxc is its integer part, the largest integer n ≤ x. In other
words, bxc is the unique integer satisfying

x − 1 ≤ bxc < x.

bxc is the best integer approximation to x from below. But let’s say that we want a better
approximation, that is, if ε > 0 and small we might want the best approximation from below
by an integer multiple of ε. We then define bxcε to be this number:

x − ε ≤ bxcε < x, bxcε ∈ {· · · , −ε, 0, ε, 2ε, 3ε, . . .}.

For example bπc = 3, bπc0.01 = 3.14. In fact, the two are related by

bxcε = εbx/εc.

Since x − ε ≤ bxcε < x for all ε > 0 we immediately have that

limbxcε = x.
ε→0

If X is a general random variable, bXcε is a discrete random variable. Hence we know what
its expectation is. We then attempt to define

E(X) := lim E(bXcε ) ?


ε→0

This works; well, almost. Sometimes even the expectation of a discrete random variable may
not exist: see Problem 9.7. To avoid this problem we may assert that

if X > 0, E(X) := lim E(bXcε ) ?


ε→0

Yes, that will work, but since the expectation of a positive random variable may be infinity,
and we wish to deal with finite numbers, we simply truncate below a large number, say 1/ε,
before taking expectation. So we adopt the following definition.
CHAPTER 16. EXPECTATION, UNADULTERATED 165

Definition 16.1. If X is a positive random variable define


E(X) = lim E(min(bXcε ), 1/ε).
ε→0

If X is a general random variable, then, since X = max(X, 0) − max(−X, 0) wee define


E(X) = E(max(X, 0)) − E(max(−X, 0)), so long as not both numbers are +∞.

But bXcε → X is one way to approximate X by discrete random variables. What tells us
that if we have another approximation, say Xε → X, we will not get a different limit for E(Xε )?
This is the same dilemma that Archimedes could have faced in deriving that the area of the
circle of radius r is πr2 . That is, Archimedes probably wondered if the specific approximation
of a circle, e.g., by rectangles, as in Figure 14.4, or by regular polygons, gave exactly the
same limit, namely πr2 . He probably convinced himself that the limit is independent of the
approximation. He couldn’t, however, have proved that because he did not have the means
to do so. It took another 2 thousand years to be able to justify Archimedes’ hunch. Indeed, it
works, and the same thing holds about the definition above. The definition is good because
(note that we replace ε by 1/n where n are integers):

Theorem 16.1 (expectation is unambiguously defined). If Xn , n = 1, 2, . . . is a sequence of


finitely-valued random variables such that Xn ≤ Xn+1 for all n and Xn → X as n → ∞ then
limn→∞ E(Xn ) exists and equals the number defined in Definition 16.1.

We will not prove this theorem here, just as we proved no theorems in this course, but at
least we can understand what its statement is.
What is important is the following set of consequences of Definition 16.1 and Theorem 16.1.

PROPERTIES OF EXPECTATION
Linearity. If a, b are real numbers and X, Y random variables then
E(aX + bY) = aE(X) + bE(Y).
The reason for this is that linearity holds for discrete random variables. So it holds in the limit.
Monotonicity. If X, Y are random variables such that X ≤ Y then
E(X) ≤ E(Y).
The reason for this is that P never takes negative values!
Another thing we get out of this is that what we wondered above (see reason three 3) is
true under some assumptions.

Theorem 16.2 (interchanging limit and expectation). If Xn is a sequence of random variables


such that limn→∞ Xn = X then

E( lim Xn ) = lim E(Xn )


n→∞ n→∞

provided that
CHAPTER 16. EXPECTATION, UNADULTERATED 166

(i) For all n, 0 ≤ Xn ≤ Xn+1 .


(ii) There exists a positive random variable Z with E(Z) < in f ty such that, for all n, |Xn | ≤ Z.

PROBLEM 16.2 (scaling of exponential r.v.). Let τ(λ) be an expon(λ) random variable.
Explain why, for λ > 0,
d 1
E[cos τ(λ)] = E[τ(λ) sin(τ(λ))].
dλ λ
d
Hint: Recall that if σ is expon(1) then τ(λ) = σ/λ; see Problem 14.11.
Answer. Since τ(λ) has the same law as σ/λ we can replace the latter by the former when
considering the expectation of a function of it. Observe that the derivative of the function
g(λ) = cos λσ
is
d
g0 (λ) = cos λσ = σ
λ2
sin λσ .

But derivative is a limit:
d g(λ + h) − g(h)
cos λσ = lim .
dλ h→0 h
By the mean value theorem,
g(λ + h) − g(h)
= g0 (θ(h))
h
for some θ(h) between λ and λ + h. Hence
d
cos λσ = lim g0 (θ(h)).
dλ h→0

Now let Xh := g0 (θ(h)). The Xh play the same role as the Xn in Theorem 16.2. But
σ
|Xh | = |g0 (θ(h))| ≤
θ(h)2
But θ(h) is between λ − |h| and λ + |h|. Since λ > 0 if we let |h| < λ/2 we have λ − |h| > λ/2, so
θ(h) > λ/2 for |h| < λ/2 and so
σ
|Xh | ≤ =: Z, for all λ/2,
(λ/2)2
Obviously, E(Z) < ∞. So, by (ii) of Theorem 16.2 we have
g(λ + h) − g(h) g(λ + h) − g(h)
" # " #
E lim = lim E
h→0 h h→0 h

The left-hand side of this is E[g0 (λ)] = E λσ2 sin λσ = λ1 E [τ(λ) sin τ(λ)]. The right-hand side is
h i

E[g(λ+h)]−E[g(h)]
limh→0 h = d
dλ E[g(λ)]. 
?PROBLEM 16.3 (integrating the tail gives the expectation). (See Problem (9.14) also.) Let
X be a positive random variable. Explain why
Z ∞
E(X) = P(X > t)dt.
0
CHAPTER 16. EXPECTATION, UNADULTERATED 167

Answer.  Consider the trivial integral


Z x
dt = x,
0

for any positive x. Write this as Z ∞


x= 1x>t dt
0
Hence Z ∞
X= 1X>t dt,
0
and so Z ∞ Z ∞
E(X) = E(1X>t )dt = P(X > t)dt.
0 0

16.2 The law of the unconscious statistician


The law of the unconscious statistician for discrete random variables states that if X : Ω → R
is a discrete random variable then its expectation can be computed in two ways:
X X X
EP (X) = X(ω)P{ω} = xP(X = x) xP(X = x) (16.1)
ω∈Ω x∈X(Ω) x

Students in very elementary probability are asked to explain this. In these notes, it appears as
the equality between (9.2) and (9.1). See Problem 9.3.

PROBLEM 16.4 (the law of the unconscious statistician, discretely). Let Ω be the set of all
the 2n subsets of {1, . . . , d}. (This sample space was considered in Problem 8.10 of elementary
probability.) Give Ω the uniform probability P, that is assign probability P{ω} = 1/2n to each
ω ∈ Ω. Consider now the random variable

X : Ω → R, X(ω) = |ω| = no. of elements of ω.

(1) Write down what (16.1) says.


(2) Explain why (16.1) holds.
(3) Compute the expectation of X.
Answer. (1) The left-hand side of (16.1) is
X 1 X
X(ω)P{ω} = |ω|.
ω∈Ω
2d ω∈Ω

This is the sum of the sizes of all subsets divided by their total number. The right-hand side of
(16.1) is
d !
X 1 X d
xP(X = x) = d k .
2 k
x∈X(Ω) k=0
CHAPTER 16. EXPECTATION, UNADULTERATED 168

This is because the image of Ω under X is X(Ω) = {0, 1, . . . , d} and because the law of X (under
the uniform probability measure P on Ω) is a bin(d, 1/2). Canceling the 1/2d factor (16.1) says

d !
X X d
|ω| = k .
k
ω∈Ω k=0

(2) This holds because it counts the sum of sizes of all subsets (left-hand side) by classifying
subsets according to their sizes (right-hand side).
(3) The expectation of a bin(d, 1/2) random variable is d/2. 

The general scheme


Pay attention now. Consider a series of sets and functions:
G1 G2 G3
Ω0 −−→ Ω1 −−→ Ω2 −−→ Ω3

(We can go on and make a longer chain if we wish.) Consider a set of events Ei on each Ωi . If
these functions “respect the events” (this is called measurability) then we can think of them as
random variables. That is G1 is a random variable on Ω0 , G2 is a random variable on Ω1 , and
so on. In addition, we can compose these functions, and have, for example,

G3 (G2 (G1 )) is a r.v. on Ω0 , G3 (G2 ) is a r.v. on Ω1 , G3 is a r.v. on Ω2 .

Now pick a probability measure P0 on Ω0 and consider


G1 G2 G3
Ω0 −−→ Ω1 −−→ Ω2 −−→ Ω3
P0 P1 P2 P3
P1 is the law of G1 under P0
P2 is the law of G2 under P1
P3 is the law of G3 under P2

Theorem 16.3 (the law of the unconscious statistician). In the above situation, if Ω1 =
Ω2 = Ω3 = R, we have

EP0 [G3 (G2 (G1 ))] = EP1 [G3 (G2 ))] = EP2 [G3 ]

This is really a theorem about “change of variables”. Proving it, we really have to prove it
for two stages only. Consider
X g
Ω −
→ R →
− R
P Q
CHAPTER 16. EXPECTATION, UNADULTERATED 169

OK, I changed letters. I used Ω for Ω0 , I used X for G1 and g for G2 , and I set Ω1 and Ω2 equal
to R, as we are required to do. The law of the unconscious statistician states that
n o n o
EP g(X) = EQ g .

But we know that it holds for discrete random variables. In particular, for each ε > 0 we have
that bg(X)cε and bgcε are discrete random variables. We then have (this is an application of
(16.1)) n o n o
EP bg(X)cε = EQ bgcε .
The discussion at the first part of Section 16.1 aimsnto convince
o o that taking limits as ε → 0
nyou
in the last display gives the display above it: EP g(X) = EQ g . It is, really, a very simple
matter.

New notation
R R
Instead of writing EP (X) many people write Ω
X(ω)P(dω) or simply Ω
XdP.
If Q is a probability measure on R that is given by a density f (x),
Z
Q(B) = f (x)dx,
B

and if id : R → R is the identity function (defined by id(x) = x), then


Z
EQ [id] = x f (x)dx.
R

Hence the expression Z


EP (X) = x f (x)dx
R
for a continuous random variable X is not a definition of its expectation but, rather, a
consequence of the law of the unconscious statistician:
X id
Ω −
→ R −→ R
P Q

that says
EP [X] = EQ [id].

16.3 Independence, revamped


I In elementary probability one learns what is meant by independence between events
(Section 11.3), independence between discrete random variables (Section 11.4).
I In this course, we defined independence between components of an absolutely continuous
random vector (X1 , . . . , Xn ) (Section 14.5.3) in a rather unpleasant way given by (14.7) which
says that the joint density is a function of separable variables.
CHAPTER 16. EXPECTATION, UNADULTERATED 170

Remark 16.1. Let X1 , . . . , Xn be finitely many random variables, Xi : Ω → R, i = 1, . . . , n.


(a) If each Xi is discrete then (X1 , . . . , Xn ) is discrete. Indeed, Xi being discrete means that the
set Si = Xi (Ω) of its values is countable. Therefore, S1 × · · · × Sn is also countable.
(b) If each Xi is absolutely continuous, then (X1 , . . . , Xn ) is not necessarily absolutely continuous:
see Problem 15.13. This is why I say that (14.7) defines independence between components of
an absolutely continuous random vector. I did not write “independence between absolutely
continuous random variables” because that would be wrong.

So what do we do? This is rather ugly. Independence should not rely on our ability to
know whether, jointly, a bunch of random variables have density or not.
I We also hinted that we can talk about the independence of infinitely many random variables.
And also, in our FORESIGHTS section we stated, as Theorem 12.1, that if we are given a
probability distribution Q then we can find an infinite sequence of independent random
variables with common law Q and we stated that this is deep, but I am sure you don’t see
what it’s deep.
So, how do we revamp the notion of independence and define it once and for all?

Definition 16.2. Let X = {Xt , t ∈ T} be a collection of random vectors, indexed by a set T. We


say that they are independent under a given P, if

P(Xt1 ∈ B1 , Xtn ∈ Bn ), = P(Xt1 ∈ B1 ) · · · P(Xt1 ) ∈ Bn ),

for any integer n ≥ 2, for any finite set {t1 , . . . , tn } ⊂ T of size n, and for any n sets B1 , . . . , Bn .

Some facts.
I If {Xt , t ∈ T} are independent then {Xt , t ∈ S} are independent for any S ⊂ T.
I If {Xt , t ∈ T} are independent and S1 , . . . , Sk are finitely many disjoint finite subsets of T
then the random vectors (Xt , t ∈ S1 ), . . . , (Xt , t ∈ Sk ) are independent. Moreover, for given
R-valued functions g1 , . . . , gk , the random variables G1 = g1 (Xt , t ∈ S1 ), . . . , Gk = gk (Xt , t ∈ Sk )
are independent. And then

E[G1 · · · Gk ] = (EG1 ) · · · (EGk ).

I The following are equivalent


(1) X1 , . . . , Xn are independent random variables
(2) P(X1 ∈ I1 , . . . , Xn ∈ In ) = P(X1 ∈ I1 ) . . . P(Xn ∈ In ), for all intervals I1 , . . . , In .
(3) P(X1 ≤ x1 , . . . , Xn ≤ xn ) = P(X1 ≤ x1 ) · · · P(Xn ≤ xn ), for all numbers x1 , . . . , xn : the joint
distribution function is the product of individual distribution functions.
(4) P(X1 > x1 , . . . , Xn > xn ) = P(X1 > x1 ) · · · P(Xn > xn ) for all numbers x1 , . . . , xn .
(5) in case that X1 , . . . , Xn are discrete, P(X1 = x1 , . . . , Xn = xn ) = P(X1 = x1 ) · · · P(Xn = xn ), for
all x1 , . . . , xn .
(6) in case that (X1 , . . . , Xn ) has density f (x1 , . . . , xn ) on Rn , f (x1 , . . . , xn ) = f1 (x1 ) · · · fn (xn ), for
all x1 , . . . , xn , where fi is the density of Xi .
CHAPTER 16. EXPECTATION, UNADULTERATED 171

PROBLEM 16.5 (if you’re independent of yourself then you’re not random). Let X be a real
random variable that is independent of itself. Show that it is a constant, i.e., that there is a real
number c such that P(X = c) = 1.
Answer. If X is independent of X then P(X ≤ x, X ≤ x) = P(X ≤ x)P(X ≤ x). But the left-hand
side is P(X ≤ x). Hence P(X ≤ x) = P(X ≤ x)2 . The only two real numbers whose square it the
same are 0 and 1. So P(X ≤ x) is equal to 0 or 1 for all x. Since P(X ≤ x) → 1 as x → ∞ and
→ 0 as x → −∞, we can define c to be the largest x such that P(X < x) = 0. Then P(X < c) = 0
and P(X ≤ c) = 1. Hence P(X = c) = 1 − 0 = 1. 
@@ Find a way to present Fubini’s theorem
R∞
@@ State and explain E(X) = 0 P(X > x)dx if X > 0.

16.4 Convex functions and moments

16.4.1 Convex functions of random variables


You need to learn about moments (next section). Moments are expectations of integer powers.
Integer powers are convex functions. This is why you need to learn about expectations of
convex functions. Convex functions are generalizations of functions whose graphs are straight
line, called affine functions.
If we take an affine function, that is

g(x) = ax + b, x∈R

(I don’t call this linear because, in general b is not 0), whose graph is a straight line, then

E[g(X)] = g(EX).

If we take two affine functions, gi (x) = ai x + bi , i = 1, 2 then, since max(g1 (X), g1 (X)) ≥ gi (X),
we have E[max(g1 (X), g1 (X))] ≥ E[gi (X)], for i = 1, 2, and so

E[max(g1 (X), g1 (X))] ≥ max(E[g1 (X)], E[g2 (X)]).

This is true for any number of affine functions, even for uncountably many ones. A function
of the form
g = sup gt ,
t∈T

where each gt is affine and T is any set is called convex. Examples of convex functions are
g(x) = x2 , g(x) = ex , g(x) = e−x , g(x) = − log x, g(x) = |x|. Applying the observation above we
find
Jensen’s inequality: E[g(X)] ≥ g(EX), for any convex function g.
This applies to random vectors too. If g(x1 , . . . , xn ) is a convex function of n variables and
(X1 , . . . , Xn ) is a random vector then

E[g(X1 , . . . , Xn )] ≥ g(EX1 , . . . , EXn ), for any convex function g : Rn → R.


CHAPTER 16. EXPECTATION, UNADULTERATED 172

16.4.2 Moments
If X is a random variable and k a positive integer, we define

mk = k-moment of X := E(Xk ),

provided it exists. Of course, m0 = 1. Even moments are nonnegative, odd moments are
signed. Moments are sometimes important for several reasons, one of which being that
sometimes, moments determine the distribution.

Theorem 16.4 (moments define a unique probability law). If m1 , m2 , . . . are such that the
series

X
mk zk
k=0

converges on |z| < ρ for some ρ > 0 then there is only one law whose moments are m1 , m2 , . . ..

We can define moments for real numbers as well, but then we must be careful about the
sign of a random variable. So we deal with absolute moments if the exponent is real, namely,

µp := E(|X|p ).

Note that this could be finite or infinity. We have the moments inequality
1/p 1/q
µp ≤ µq if 0 < p < q.

And here is why. If 0 < p < q then the function g(x) = xq/p , x ≥ 0, is convex. Hence
E[g(Z)] ≥ g(EZ) for any random variable Z. Let Z = |X|p . Then g(Z) = (|X|p )q/p = |X|q and
g(EZ) = (E[|X|p ])q/p . So E[|X|q ] ≥ (E[|X|p ])q/p . Raise this to the power 1/q to get the inequality
above.
Of special importance is the second moment:

m2 = µ2 = E(X2 ).

Jensen’s inequality gives


E(X2 ) ≥ (EX)2 ,
and the absolute moments inequality gives

(E(X2 ))1/2 ≥ E|X|,

which really the same as the above with X replaced by |X|.

16.5 Variance and covariance and correlation and Cauchy-Schwarz


The variance of X is given by

var X = E(X2 ) − (EX)2 = E[(X − EX)]2 ,


CHAPTER 16. EXPECTATION, UNADULTERATED 173

where the equality follows from expanding the square on the right. The standard deviation
of X is given by √
stdev(X) := var X.

We also define the inner product or correlation between two random variables X, Y by
hX, Yi = E(XY), (16.2)
and their covariance by
cov(X, Y) = hX − EX, Y − EYi = E((X − EX)(Y − EY)) = E(XY) − (EX)(EY),
where the second equality follows by expanding the product on the left.
The correlation coefficient1 between X and Y is the number
cov(X, Y)
corr(X, Y) = .
stdev(X) stdev(Y)
We have
− 1 ≤ corr(X, Y) ≤ 1 (16.3)
and this is the Cauchy-Schwarz inequality, stated as follows.
(E(UV))2 ≤ E(U2 )E(V 2 ), (16.4)
for any two random variables U, V. To see this, first take the obvious true statement:
0 ≤ (tU + V)2 = U2 t2 + 2UVt + V 2 .
So
0 ≤ (tU + V)2 = E(U2 )t2 + 2E(UV)t + E(V 2 ).
Multiply both sides by E(U2 ) and add and subtract (E(UV))2 to get
0 ≤ (E(U2 )2 t2 + 2E(UV)E(U2 )t + (E(UV))2 − (E(UV))2 + E(U2 )E(V 2 )
= [E(U2 )t + E(UV)]2 − [E(UV))2 − E(U2 )E(V 2 )].
Hence This is true for all t ∈ R. Assuming that E(U2 ) > 0, we can choose t so that
E(U2 )t + E(UV) = 0, so
0 ≤ 0 − [E(UV))2 − E(U2 )E(V 2 )]
which is (16.4). But if E(U2 ) = 0 then P(U = 0) = 1, so the inequality is trivial: 0 ≤ 0. To get
(16.3), set U = X − EX and V = Y − EY.
Since the correlation coefficient is always between −1 and 1 there is an angle θ such that
corr(X, Y) = cos(θ). So we define the angle between X − EX and Y − EY by
θ = arccos corr(X, Y).
We can agree that −π ≤ θ(X, Y) < π.
We say that X, Y are uncorrelated if arccos corr(X, Y) = 0. This means that θ(X, Y) = ±π/2
and so we can say that the angle between X − EX and Y − EY is ±π/2. We summarize
1
Some people use the term “correlation” for what I call “correlation coefficient” and use another name for
what I call inner product. But names, just as influencers, are a dime a dozen, they come and go. Please use my
terminology.
CHAPTER 16. EXPECTATION, UNADULTERATED 174

X, Y are uncorrelated ⇐⇒ corr(X, Y) = 0 ⇐⇒ hX − EX, Y − EYi = 0 ⇐⇒


E(XY) = (EX)(EY) ⇐⇒ angle(X − EX, Y − EY) = ±π/2.

16.6 Markov and Chebyshev inequalities


We learn Markov’s inequality in elementary classes. See Problem 9.8.
Let now Z be a general positive random variable. Markov’s inequality states
EZ
P(Z > t) ≤ . (16.5)
t

Here is why. We have, since Z > 0,

Z ≥ Z1Z>t ≥ t 1Z>t .

Hence
E(Z) ≥ E(t 1Z>t ) = tP(Z > t).

Assume that X is an arbitrary random variable.


Chebyshev’s inequality states

var(X)
P(|X − EX| > t) ≤ .
t2

Here is why. We have P(|X − EX| > t) = P((X − EX)2 > t2 ) ≤ E[(X − EX)2 ]/t2 , by Markov’s
inequality.
?PROBLEM 16.6 (a zero variance r.v. is trivial). Show that if var(X) = 0 then P(X = EX) = 1.
Answer. By Chebyshev’s inequality, P(|X − EX| > t) = 0 for all t > 0. Taking the limit as t → 0
we find P(|X − EX| > 0) = 0, that is P(|X − EX| = 0) = 1. 

16.7 Expectation of special functionals


We saw in Section 15 that the distribution of a random variable X or a random vector
(X1 , . . . , Xn ) is specified completely once we know its distribution function F(x) or F(x1 , . . . , xn )
(Theorems 15.1 and 15.4). But

F(x) = E[gx (X)], gx (X) = 1X∈(x,∞) .

Read this as follows: If we know the expectation of gx (X) for all x then we know F(x) and so
we know the distribution of X.
So, it is conceivable that

there are families, say G, of functions such that if we know the expectation of g(X)
for each g ∈ G then we know the distribution of X.
CHAPTER 16. EXPECTATION, UNADULTERATED 175

Example 1: The class G = {1(x,∞) : x ∈ R} specifies the distribution of any random variable.
Example 2: Set pk (u) = uk . The class of functions P = {pk : k = 0, 1, . . .} specifies the
distribution of a random variables under the condition of Theorem 16.4.
We will consider two special families of functions.
• Set hs (x) = sx and consider the class H = {hs : s ∈ [0, r)} for some r > 0. This will lead us
to the concept of probability generating functions.

Definition 16.3 (probability generating function). Let X be a random variable with values
in Z+ = {0, 1, 2 . . .}, the set of nonnegative integers. Define its probability generating function
by

X
G(s) = E[s ] =
X
pk sk , where pk = P(X = k), k = 0, 1, . . .
k=0

• Set wt (x) = etx and consider the class W = {wt : −a < t < b} for some −a < 0 < b. This
will lead us to the concept of moment generating functions.

Definition 16.4 (moment generating function). Let X be a random variable with values in R.
Define its moment generating function by

M(t) = EetX , t∈F

where F = {t ∈ R : EetX < ∞}.

Note that M(t) cannot, in general, be expressed as a sum or as an integral against a density
because a general random variable X is not necessarily discrete nor absolutely continuous.
Nevertheless, M(t) always exists for t ∈ F. We will see that, in some cases, F = {0}, in which
case M(t) is useless (being the function that is 0 at t = 0 and ∞ for t , 0).

16.8 The probability generating function of an integer-valued ran-


dom variable
Suppose that the random variable X takes values in the set of integers Z+ = {0, 1, 2, . . .}.
Why should we do that? There are two good reasons:
Reason 1: Often, such random variables appear when counting random discrete sets. Counts
are, of course, Z+ –valued things.
Reason 2: See Bonus Section 16.8.1 below.
When X is Z+ –valued its law is determined by its “probability mass function”

pn = P(X = n), n = 0, 1, . . .

as a sequence of positive real numbers and consider the power series



X
pk sk = p0 + p1 s + p2 s2 + · · ·
k=0
CHAPTER 16. EXPECTATION, UNADULTERATED 176

This series may or may not converge (remember criteria for convergence from your Calculus)
depending on s. When it does, it defines a function

X
G(s) = pk sk = EsX , (16.6)
k=0

for those s for which the series converges, and G called probability generating function.
Note that if |s| < 1 then certainly the series converges and converges absolutely because

X ∞
X ∞
X
|pk sk | = pk |s|k ≤ pk = 1.
k=0 k=0 k=0

PROBLEM 16.7 (probability generating function of geo(p)). Consider a geo(p) random


variable X and find an explicit formula for its probability generating function. Explain what
the radius of convergence of the series means and find it.
Answer. We have pm = P(X = m) = (1 − p)m−1 p, m = 1, 2, . . . Hence
∞ ∞ ∞
X X X ps
pm s =
m
(1 − p) m−1
ps = ps
m
((1 − p)s)m−1 = .
1 − (1 − p)s
m=1 m=1 m=1

The radius of convergence r is the maximum r for which the series converges on |s| < r. We can
find r either by the root test or by looking at the formula above whose denominator becomes
0 when s = 1/(1 − p), meaning that at this point G has a pole (it becomes ∞). Hence the radius
of convergence is r = 1/(1 − p). (All that is standard Calculus material.) 
P∞
Remark 16.2. Even though the series k=0 pk sk has a radius of convergence, the function
obtained can be extended on larger domains. For example, in the problem above, the series
converges on |s| < 1/(1 − p), the resulting formula, after performing the summation, namely,
ps
1−(1−p)s is defined for all s except s = 1, that is, it is an extension. At this special point we may
define it to be ∞. When we speak of a probability generating function we mean the extension
of the function.
?PROBLEM 16.8 (differentiation of power series). If G(s) is the function defined by the
series (16.6), explain why

dm X dm k
G(s) = p k ms .
dsm ds
k=0

Answer. The series ∞ k=0 pk s converges uniformly with respect to s over the set {s : |s| < 1}. In
k
P
this case, as we learned in Calculus, we can differentiate the power series, as many times as
we like, term by term, and we will be getting the derivatives of G(s). 
Recall the notion of m-falling factorial–see (8.1)–of a number k, where m is a positive integer:
(k)m = k(k − 1) · · · (k − m + 1) (we need to set set (k)0 = 1). When m = k we have another notation:
(m)m = m! These numbers appear when we differentiate m times the monomial sk :

k(k − 1) · · · (k − m + 1) s
k−m = (k) ,
dm k  m if k ≥ m
= ,

s
ds m
0,


 otherwise.
CHAPTER 16. EXPECTATION, UNADULTERATED 177

Using this and Problem (16.8) we get



dm X
G(s) = (k)m pk sk−m .
dsm
k=m

The first term of the series above, corresponding to k = m, does not depend on s. Let us rewrite

dm X
G(s) = m! p m + (k)m pk sk−m .
dsm
k=m+1

If we therefore set s = 0 we obtain


1 dm G
P(X = m) = pm = (0). (16.7)
m! dsm
This is why G is called probability generating function.
A first property of the probability generating function is that it fits like a glove with
independence:

PROPERTY A. Let X1 , . . . , Xn be independent Z+ –valued random variables with


probability generating functions Gi (s) = EsXi , i = 1, . . . , n. Letting G(s) = EsX1 +···+Xn
be the probability generating function of X1 + · · · + Xn we have

G(s) = EsX1 +···+Xn = (EsX1 ) · · · (EsXn ) = G1 (s) · · · Gn (s).

The reason is simple: sX1 +···+Xn = (sX1 ) · · · (sXn ). Since X1 , . . . , Xn are independent, so are
sX1 , . . . , sXn . The expectation of the product of independent random variables is the product of
their expectations. See facts on the expectation above.
?PROBLEM 16.9 (probability generating functions of some common discrete r.v.s). Find
expressions for the probability generating function of the r.v. X when X is
(0) unif({k1 , . . . , kn })
(1) Ber(p)
(2) bin(n, p)
(3) Poi(λ)
What is the radius of convergence of each power series?
Answer. (0) We have pk1 = pk2 = · · · = pkn = 1/n, so

1 k1
EsX = (s + · · · + skn ).
n
(1) We have p1 = p, p0 = (1 − p) so

EsX = ps + (1 − p).
n m
(2) We have pm = m p (1 − p)n−m , for m = 0, 1, . . . , n. So
n ! n !
X n m m X n
Es =
X
s p (1 − p) n−m
= (sp)m (1 − p)n−m = (sp + (1 − p))n .
m m
m=0 m=0
CHAPTER 16. EXPECTATION, UNADULTERATED 178

λm −λ
(3) We have pm = m! e , for m = 0, 1, . . ., so
∞ ∞
mλ (λs)m
X m X
Es =
X
s e −λ
= e−λ = esλ e−λ = e−λ(1−s) .
m! m!
m=0 m=0

All radii of convergence are ∞. This is obvious for (0), (1),P(2), because they’re all finitely-valued
random variables. For (3) it follows from the fact that ∞ m=0 z /m! converges for all z (easily
m

shown by convergence criteria for series–intuitive too because m! grows very fast and it is in
the denominator). 

PROBLEM 16.10 (probability generating function of the sum of independent Poisson r.v.s).
Let N1 , N2 be two independent Poisson r.v.s with laws Poi(λ1 ), Poi(λ2 ), respectively. Determine
the probability generating function of their sum. What do you observe?
Answer.
EsN1 +N2 = (EsN1 )(EsN2 ) = e−λ1 (1−s) e−λ2 (1−s) = e−(λ1 +λ2 )(1−s) .
We observe that the probability generating function of N1 + N2 is that of a Poi(λ1 + λ2 ) random
variable. 
Question: Can we conclude that, in the above problem, N1 + N2 is itself a Poi(λ1 + λ2 )
random variable? Indeed we can because:

PROPERTY B. If two random Z+ –valued variables X, Y have the same probability


d
generating function then X = Y:
d
EsX = EsY for all s ⇒ X = Y.

The reason is easy. We know that G(s) = EsX is defined and analytic on |s| < 1. This means
that G(s) is given by an infinite Taylor series (expansion around 0):

X G(m) (0)
G(s) = sm ,
m!
m=0

where G(m) (0) is the m-th derivative of G(s) at s = 0. Let G1 (s) = EsX , G2 (s) = EsY and assume
(m) (m)
they are equal: G1 (s) = G2 (s). Then G1 (0) = G2 (0) for all m and, by (16.7), we have
d
P(X = m) = P(Y = m) for all m. That is X = Y.
So, in relation to Problem 16.10, we have this very special, very important and very much
treasured property:

If N1 , . . . , Nn are independent random variables with laws Poi(λ1 ), . . . , Poi(λn ),


respectively, then their sum N1 + · · · + Nn is a Poi(λ1 + · · · + λn ) random variable.

This generalizes to infinitely many Poisson random variables:

If N1 , N2 , . . . are independent random variables with laws Poi(λ1 ), Poi(λ2 ), . . .,


respectively, such that ∞
P∞ P∞
j=1 λ j < ∞ then their sum N = j=1 N j is a Poi( j=1 λ j )
P
random variable.
CHAPTER 16. EXPECTATION, UNADULTERATED 179

?PROBLEM 16.11 (thinning). Let N, ξ1 , ξ2 , . . . be independent random variables such that N


is Poi(λ) and ξ1 , ξ2 , . . . are Ber(p) each. Let S0 = 0 and
n
X
Sn = ξi , n ≥ 1.
i=1

Determine the distribution of the random variable SN .


Answer. By a trivial property of indicator functions (see (9.9)) we have

X
SN
s = sSn 1N=n .
n=0

This says that sSN = sSn if N = n. Duh! Using independence between Sn and N we have

X
EsSN = (EsSn )P(N = n)
n=0

n
But Sn is a p r.v. By Problem 16.9, EsSn = (sp + (1 − p))n . So


X
EsSN = (sp + (1 − p))n P(N = n) = E[(sp + (1 − p))N ]
n=0

By Problem 16.9 again, EzN = e−λ(1−z) . Setting z = sp + (1 − p) we have

E[(sp + (1 − p))N ] = e−λ(1−sp−(1−p)) = e−λp(1−s) .

We thus found
EsSN = e−λp(1−s) .
But this is the probability generating function of a Poi(λp) random variable. Hence the
distribution of SN is Poi(λp). 

16.8.1  Bonus section


We spoke about the radius of convergence, But you may say it’s useless. Not quite. Its
usefulness really shows up when the random variable X X takes values in the whole set of
integers Z, not just Z+ .
We may still define

X
G(s) = pk sk ,
k=−∞

and understand this sum as the sum of two series:



X −1
X ∞
X
G+ (s) = pk s ,
k
G− (s) = pk sk = p−k (1/s)k .
k=0 k=−∞ k=1
CHAPTER 16. EXPECTATION, UNADULTERATED 180

The first sum converges when |s| < 1 and the second sum converges when |1/s| < 1. But |s| < 1
and |1/s| < 1 is the empty set. So we really need to have established convergence of both series
on bigger domains. Luckily, we know from Calculus the concept of radius of convergence. If
the first series converges for |s| < R and the second for |1/s| < r and if rR > 1 then both series
converge for 1/r < |s| < R. The point of this discussion is that

If X takes values in Z then G(s) = ∞ k


P
k=−∞ pk s may not exist for any
s , 0. Careful consideration of the radii of convergence for the series
corresponding to the sequence p0 , p1 , p2 , . . . and p−1 , p−2 , . . . is needed.

16.9 The moment generating function of a real-valued random vari-


able
Let X be a random variable People define the function

M(t) = EetX , for those t for which the expectation is finite,

and call it moment generating function. But just like in §16.8.1, this may be infinity. The only
t for which M(t) is for sure defined is t = 0. Indeed,

M(0) = Ee0 = 1.

This is useless information. It can be shown that the set of real t’s for which M(t) is finite is an
interval. Here is why. Write

EetX = E[etX 1X≥0 ] + E[etX 1X<0 ].

If the first term on the right is finite for some t then it is finite for all smaller t, because, when
X ≥ 0, etX decreases when t decreases. Hence the first term is finite on an interval −∞ < t ≤ b
for some b ≥ 0. Similarly, the second term is finite on an interval a ≤ t < ∞ for some a ≤ 0.
Putting both together, we have that EetX < ∞ on an interval a ≤ t ≤ b.

Definition 16.5 (useless moment generating function). We say that the random variable X
has a useless moment generating function if EetX < ∞ only when t = 0.

If M(t) = EetX is finite for some interval that includes 0 in its interior then E|X|k < ∞ for all
k = 1, 2, . . . and
∞ k
X t
M(t) = EXk
k!
k=0

for all t on this interval. Moreover, we can recover the moments of X via

M(k) (0) = EXk , k = 1, 2, . . . .

And this is why M is called moment generating function.


CHAPTER 16. EXPECTATION, UNADULTERATED 181

To understand why this is true, we remember that etx is the limit of the polynomials
k k
pn (tx) = nk=0 t k!x and so
P

EetX = E lim pn (tX).


n→∞

It is easy to see that |pn (tX)| are upper bounded by a positive random variable with finite
expectation: |pn (tX)| ≤ etx + e−tX , and E(etX + e−tX ) < ∞. This is enough (but I’m not telling
you why) to ensure that
E lim pn (tX) = lim Epn (tX).
n→∞ n→∞

Using the formula for pn (tX) we have


n n
X tk X k X tk E(Xk )
Epn (tX) = E =
k! k!
k=0 k=0

P∞ tk
Putting these things together we have the formula M(t) = k
k=0 k! EX .
Next observe that the right side of this last formula is a power series. You know, from
Calculus, that a power series can be differentiated term-by-term, that is,

d` X d ` tk
M(t) = EXk
dt` k=0
dt` k!

But 
d` k  k(k − 1) · · · (k − ` + 1)t ,
k−` if k ≥ `,
=

t
dt`

0,
 otherwise.
Set now t = 0 to get 
d` k k(k − 1) · · · 1,
 if k = `,
=

`
t 
dt t=0 0, otherwise.
Hence, in the infinite sum above, setting t = 0 causes all terms to vanish except the term k = `,
`
`
which gives the formula ddtM ` (0) = EX .

Property A and Property B of probability generating function have exact analogs for
moment generating functions.
The first one concerns the moment generating function of the sum of n independent random
variables.

PROPERTY A. Let X1 , . . . , Xn be independent random variables with moment


generating functions Mi (s) = EetXi , i = 1, . . . , n. Letting M(s) = Eet(X1 +···+Xn ) be the
moment generating function of X1 + · · · + Xn we have

M(t) = Eet(X1 +···+Xn ) = (EetX1 ) · · · (EetXn ) = M1 (t) · · · Mn (t).

The first one says that knowledge of the moment generating function implies knowledge
of the law.
CHAPTER 16. EXPECTATION, UNADULTERATED 182

PROPERTY B. If two random variables X, Y have the same moment generating


d
function then X = Y: If there are number −a ≤ 0 ≤ b, such that not both a, b are
zero, then
d
EetX = EetY for all t ⇒ X = Y.

Property B is not obvious here. But we will not spend time proving it.
?PROBLEM 16.12 (a useless moment generating function). Consider a random variable X
with density 
1
 2 , if |x| ≥ 1


f (x) = 

 2x
0,
 if − 1 < x < 1
and find its moment generating function. Explain why it useless.
Answer. We have Z ∞ Z 1
tx 1 1
Ee =
tX
e 2
dx + etx 2 dx
1 2x −∞ 2x
If t > 0 then the first integral equals +∞. If t < 0 then the second integral equals +∞. Hence

+∞, t , 0

Ee = 
tX
.

1,
 t=0

This is useless. 
?PROBLEM 16.13 (moment generating functions of some common r.v.s). Find expressions
for the moment generating function of the r.v. X when X is
(0) δa (the Dirac law at a–(8.5))
(1) unif([a, b]) (what happens in the limit b → a and why?)
(2) expon(λ)
(3) N(µ, σ)
1/π
(4) Cauchy(a, b), defined as the law of a + bZ, a ∈ R, b > 0, where Z has density f (x) = 1+x2.
Is any of them useless?
Answer. (0) Since X has law δa we have P(X = a) = 1. So

EtX = eta .
1
(1) Since X has density b−a 1a≤x≤b ,

b
etb − eta
Z
dx
Ee tX
= etx =
a b − a (b − a)t
We have
etb − eta 1 d tx
lim = (e ) = eta .
b→a (b − a)t t dx t=a
By this is the moment generating function of a random variable with law δa . This is reasonable
because if we let Y be a unif([0, 1]) r.v., then (b − a)U + a has unif([a, b]) law. But then

lim((b − a)U + a) = a,
b→a
CHAPTER 16. EXPECTATION, UNADULTERATED 183

so it is to be expected that the law of the random variable on the left converges to the law of
the random variable on the right.
(2) An expon(λ) r.v. has density λe−λx 1x≥0 , so
Z ∞ Z ∞
λ
Ee =
tX
e λe dx = λ
tx −λx
e−(λ−t)x dx = ,
0 0 λ−t
λ
provided that t < λ. Of course the λ−t makes sense for all t, but if t > λ we get a negative
value for this fraction, so the equality does not hold because EetX is positive. This is what we
mean when we say that the right-hand side, considered as a function over all t, is an extension
of the left-hand side.
(3) If X is N(µ, σ2 ), we write X = µ + σZ where Z is N(0, 1). For Z we have
∞ −x2 /2 ∞
Z Z
tx e 1 1 2 −2tx)
Ee tZ
= e √ dx = √ e− 2 (x dx.
−∞ 2π 2π −∞

It is helpful to write
x2 − 2tx = x2 − 2tx + t2 − t2 = (x − t)2 − t2 ,
so 2
∞ ∞
et
Z Z
1 −(x−t)2 t2 (a) 2 (b) 2 /2
Ee tZ
= √ e e dx = √ e−y dy = et .
2π −∞ 2π −∞

where equality (a) follows by a simple change of variable, y = x − t, and where (b) follows from

the fact that e−y /2 2π is a probability density function, so it integrates to 1. For X = µ + σZ
2

we obviously have
EetX = Eet(µ+σZ) = eµt e 2 (σt) = eσ t +µt
1 2 2 2
(16.8)
We can write this as
1 2 +(EX)t
EetX = e 2 (var X)t (16.9)
(4) Letting X = a + bZ, with Z having the given density, we find

etbx
Z
1
Ee tX
=e e ta tbZ
=e ta
dx
π −∞ 1 + x2
R∞
etbx
If t > 0 we have 0 1+x2
= ∞ because, intuitively, etbx goes to infinity as x → ∞ much faster
dx
R 0 tbx
than 1 + x2 . If t < 0 we similarly have −∞ 1+x
e
2 dx = ∞. So the moment generating function is a
useless one. In all previous cases, the moment generating functions were not useless. 

PROBLEM 16.14 (linear combination of independent normal r.v.s). Let X1 , X2 be independent


random variables with laws N(µ1 , σ21 ), N(µ2 , σ22 ), respectively. Determine the law of

X = a1 X1 + a2 X2

where a1 , a2 are arbitrary real numbers.


CHAPTER 16. EXPECTATION, UNADULTERATED 184

Answer. Let us compute the moment generating function of X:

EetX = Eet(a1 X1 +a2 X2 ) = E[eta1 X1 · eta2 X2 ] = Eeta1 X1 · Eeta2 X2


12 2 2
= e 2 σ1 (ta1 )
2 +µ
· e 2 σ2 (ta2 )
2 +µ +(a
= e( 2 a1 σ1 +a 2 2 σ2 )t 1 µ1 +a2 µ2 )t
1 2 1 2 1 2 2
1 (ta1 ) 2 (ta2 )
,
where we used the expression from (16.8). But observe that
var(X) = a21 σ21 + a22 σ22 , EX = a1 µ1 + a2 µ2 ,
so we can write the above as
1 2 +(EX)t
EetX = e 2 (var X)t .
We immediately recognize this from (16.9), that is, it is the moment generating function of a
normal random variable. By Property B of moment generating functions, we conclude that
the law of X is N(a21 σ21 + a22 σ22 , a1 µ1 + a2 µ2 ). 
So, in relation to Problem 16.14, we have this very special, very important and very much
treasured property:
If X1 , . . . , Xn are independent random variables with laws N(µ1 , σ21 ), . . . , N(µn , σ2n ),
respectively, then their sum X1 + · · · + Xn is a normal random variable.P It is
immediate Pthat this should actually be a N(µ, σ2 ) random variable with µ = ni=1 µi
n
and σ = i=1 σi .
2 2

16.9.1 Expectation and covariance of random vectors

Now let X = (X1 , . . . , Xd ) be a random vector in Rd . We define the expectation of the random
vector X to be the vector of the expectations of the individual random variables:
EX = (EX1 , . . . , EXd )

PROBLEM 16.15 (expectation of a simple random vector on the plane). Let A, B, C be three
points on the Euclidean plane that we identify, as monsieur Descartes taught us, with their
coordinates, say (a1 , a2 ), (b1 , b2 ), (c1 , c2 ). Let X be a random vector with uniform distribution
on the set {A, B, C}. Explain why EX is the point of intersection of the three medians of the
triangle with vertices A, B, C, also explaining that these three segments pass through the same
point. Note: The median of a triangle is the straight segment that joins a vertex with the
midpoint of the opposite side.
Answer. We have
1 1 1
EX = A + B + C,
3 3 3
where I’m thinking of the A, B, C as represented by their coordinates. But then
2A+B 1
EX = + C.
3 2 3
But the point MAB = A+B
2 is the midpoint of the segment AB. Since EX is a linear combination
of MAB and C, EX lies on the straight line joining MAB and C, that is, the median from C. By
exactly the same argument, EX lies on the median from A and on the median from B. Hence
the three medians meet at the same point and this point is EX. 
CHAPTER 16. EXPECTATION, UNADULTERATED 185

This problem teaches us that EX can be thought of as a geometric object. If we assume that
X has law unif(A) where A is a “geometric” subset of the plane, then EX is called centroid
of A. Physically, it is the center of mass of A is the mass of A is uniformly distributed on A
(mass density is constant). But then is A has certain symmetries, then it becomes easy to find
the centroid.
PROBLEM 16.16 (expectations of certain uniform random vectors). Explain what EX is
when X has law
(1) unif(D) where D is a disc;
(2) unif(R) where S is a rectangle;
Answer. (1) EX is the center of the disc.
(2) EX is the point where the two diagonals meet. 
Since these things are important in Civil Engineering (and not only), people have created
tables of expectations of uniform random vectors .
Similarly, we can have a geometric interpretation of EX when X is a random vector in Rd ,
for any positive integer d.
Passing on to the covariance of the random vector X, we must define it as a matrix, the
so-called covariance matrix of the random vector (X1 , . . . , Xd ), defined by

cov(X1 , X2 ) · · · cov(X1 , Xd )


 
 var(X1 )
cov(X2 , X1 ) var(X2 ) · · · cov(X2 , Xd )

cov(X) =   (16.10)
 ··· 
cov(Xd , X1 ) cov(Xd , X1 ) · · ·

var(Xd )

We can write this as


cov(X)i,j = cov(Xi , X j ).
That is, the element of the matrix above sitting in row i and column j is cov(Xi , X j ). If i = j
then cov(Xi , Xi ) = var(Xi ).
Symmetry: Since cov(Xi , X j ) = cov(X j , Xi ), the matrix cov(X) is symmetric (with respect to
the diagonal: if we flip the triangular array above the diagonal we get the triangular array
below it).
Positive definiteness: This means that for all real numbers t1 , . . . , td we have
d X
X d
t j tk cov(X j , Xk ) ≥ 0
j=1 k=1

Indeed,
    2
d X
X d X d  X d  Xd 
t j tk cov(X j , Xk ) = E  t j (X j − EX j )  tk (Xk − EXk ) = E  t j (X j − EX j ) ≥ 0.
    
   
j=1 k=1 j=1 k=1 j=1

16.9.2 Moment generating function of random vectors


To understand what goes on regarding the concept of moment generating function of a
random vector, we need to state one very important theorem.
CHAPTER 16. EXPECTATION, UNADULTERATED 186

Theorem 16.5 (Cramér-Wold theorem). The distribution of the d-dimensional random vector
X = (X1 , . . . , Xd ) is completely determined by the distributions of all 1-dimensional random
variables
t1 X1 + · + td Xd ,
where t1 , . . . , td range in R.

If we believe in this (and we should because its proof is easy once you know what a Fourier
transform is), then the concept of the moment generating function of X = (X1 , . . . , Xd ) reduces
to dimension 1: the moment generating function of t1 X1 + · + td Xd for all t1 , . . . , td .

Definition 16.6. The moment generating function of X = (X1 , . . . , Xd ) is the function

M(t1 , . . . , td ) = Eet1 X1 +·+td Xd ,

provided that the set of (t1 , . . . , td ) for which M(t1 , . . . , td ) < ∞ contains an open set. Otherwise,
we say that the moment generating function is useless.

?PROBLEM 16.17 (independence inferred from the moment generating function). Let
X1 , . . . , Xd be random variables such that

M(t1 , . . . , td ) = M1 (t1 ) · · · Md (td ), (16.11)

where M is the moment generating function of X and Mi the moment generating function of
Xi , i = 1, . . . , d. Assume that none of the Mi is useless. Explain why the X1 , . . . , Xd must be
independent.
d
Answer. Let Y1 , . . . , Yd be independent random variable such that Yi = Xi for all i. Then

Eet1 Y1 +···+td Yd = M1 (t1 ) · · · Md (td ).

Therefore
Eet1 Y1 +···+td Yd = Eet1 X1 +···+td Xd ,
By Property B of the moment generating function of a one-dimensional random variable, we
have that
d
t1 Y1 + · · · + td Yd = t1 X1 + · · · + td Xd ,
for all (t1 , . . . , td ) and so, by the Cramér-Wold theorem

d
(X1 , . . . , Xd ) = (Y1 , . . . , Yd ).

Since the Y1 , . . . , Yd are independent, so are the X1 , . . . , Xd . 

PROBLEM 16.18 (sum and difference of exponentials). Let X1 , X2 be independent expon(1)


random variables. Let Y1 = X1 + X2 , Y2 = X1 − X2 . Determine the joint moment generating
function of (Y1 , Y2 ). Are Y1 , Y2 independent?
CHAPTER 16. EXPECTATION, UNADULTERATED 187

Answer.
1 1
Eet1 Y1 +t2 Y2 = Ee(t1 +t2 )X1 +(t1 −t2 )X2 ) = Ee(t1 +t2 )X1 Ee(t1 −t2 )X2 = .
1 − t1 − t2 1 − t1 + t2
This cannot be written in product form. So Y1 , Y2 are not independent. 
Remark 16.3. If the random variables X1 , . . . , Xd take values in Z+ , we may work with the
joint probability generating function
X
G(s1 , . . . , sd ) = E[sX
1
1
· · · sd d ].

This always exists at least if |s1 | < 1, . . . , |sd | < 1. As above, we have that
X X
E[sX
1
1
· · · sd d ] = E[sX
1
1
· · · sd d ] for all s1 , . . . , sd ⇐⇒ X1 , . . . , Xd are independent.

?PROBLEM 16.19 (more thinning). Let N, ξ1 , ξ2 , . . . be independent random variables such


that N is Poi(λ) and ξ1 , ξ2 , . . . are Ber(p) each. Assume 0 < p < 1. Let
n
X n
X
Sn = ξi , Rn = (1 − ξi ).
i=1 i=1

(1) Are the random variables Sn , Rn independent?


(2) Are random variables SN , RN independent?
(3) Determine the distribution of the random (SN , RN ) by writing down a formula for
P(SN = k, RN = `).
Answer. (1) If they were independent we would have P(Sn = k, Rn = `) = P(Sn = k)P(Rn = `)
for all k and ` in the range 0, 1, . . . , n. Take k = ` = 0. We have Sn + Rn = n. Hence
P(Sn = 0, Rn = 0) = 0. Bue P(Sn = 0)P(Rn = 0) = (1 − p)n (1 − p)n which is nonzero. Hence Sn , Rn
are categorically not independent.
(2) We shall compute the probability generating function of (SN , RN ), namely, the function
G(x, y) = E[xSN yRN ], and see what happens.

PN PN N
Y X n
∞ Y
ξi
xSN yRN = x i=1 y i=1 (1−ξi ) = xξi y1−ξi = xξi y1−ξi 1N=n .
i=1 n=0 i=1

Hence  n 

X Y  X n
∞ Y
ξ
E[x ySN RN
]= E 
 x y
i 1−ξ i
1N=n  =

 E[xξi y1−ξi ] P(N = n).
n=0 i=1 n=0 i=1

But P(ξi = 1) = p, P(ξi = 0) = q, so, by the law of the unconscious statistician,

E[xξi y1−ξi ] = x1 y1−1 p + x0 y1−0 (1 − p) = xp + y(1 − p).

Hence

X
E[xSN yRN ] = (xp + y(1 − p))n P(N = n) = e−λ(1−xp−y(1−p)) .
n=0
Write
1 − xp − y(1 − p)) = p + (1 − p) − xp − y(1 − p) = p(1 − x) + (1 − p)(1 − y),
CHAPTER 16. EXPECTATION, UNADULTERATED 188

so
E[xSN yRN ] = e−λp(1−x) e−λ(1−p)(1−y) .
Setting y = 1 we find E[xSN ] = e−λp(1−x) (which we knew from Problem 16.11) and setting x = 1
we find E[yRN ] = e−λ(1−p)(1−y) . Hence SN is Poi(λp) and RN is Poi(λ(1 − p)) The last display can
be written as
E[xSN yRN ] = E[xSN ]E[yRN ]
and this implies that SN , RN are independent.
(3) Since SN , RN are independent, our task is trivial:

(λp)k −λp (λ(1 − p))` −λ(1−p)


P(SN = k, RN = `) = P(SN = k)P(RN = `) = e e (16.12)
k! `!

?PROBLEM 16.20 (continuation of Problem 16.19). Can you compute P(SN = k, RN = `)
directly and show thus that the two random variables are independent?
Answer. Let’s try. Since SN + RN = N, if SN = k, RN = ` then N = k + `. So

P(SN = k, RN = `) = P(SN = k, RN = `, N = k + `) = P(Sk+` = k, Rk+` = `, N = k + `).

Now, if Sk+` = k then certainly Rk+` = ` (because their sum is k + `, so we can omit this event.
Hence
P(SN = k, RN = `) = P(Sk+` = k, N = k + `).
But Sk+` is a function of the ξ1 , . . . , ξk+` and so it is independent of N. Hence

k+` k λk+` −λ
!
P(SN = k, RN = `) = P(Sk+` = k)P(N = k + `) = p (1 − p)` e . (16.13)
k (k + `)!

But a little light algebra shows that the expressions on the right of (16.12) and (16.13) are the
same. 
?PROBLEM 16.21 (independent normals). Let X, Y be i.i.d. N(0, 1) each. Define

W = aX + bY, Z = cX + dY,

where a, b, c, d are real numbers. Find a relation between these numbers so that W, Z be
independent as well.
Answer. Well, W, Z are independent iff the moment generating function of (W, Z) is the product
of individual moment generating functions, that is, iff

E[esW+tZ ] = (EesW )(EetZ ).

We compute all terms separately. We have

EesW = Ees(aX+bY) = E[esaX · esbY ].

But X and Y are independent by assumption, so this is further equal to

(EesaX )(EesbY )
CHAPTER 16. EXPECTATION, UNADULTERATED 189

By (16.9)
1 1 2 2
EesaX = e 2 var(saX) = e 2 a s .
For the same reasons,
1 2 2
EesbY = e 2 b s ,
and so
1 2 +b2 )s2
EesW = e 2 (a .
Similarly,
1 2 +d2 )t2
EetZ = e 2 (c
On the other hand

sW + tZ = s(aX + bY) + t(cX + dY) = (as + ct)X + (bs + dt)Y,

so
1 2 1 2
E[esW+tZ ] = (Ee(as+ct)X )(Ee(bs+dt)Y ) = e 2 (as+ct) e 2 (bs+dt)
Hence
1 2 1 2 2 +b2 )s2 2 +d2 )t2
W, Z are independent ⇐⇒ e 2 (as+ct) e 2 (bs+dt) = e(a e(c for all s, t.

Equating the exponents gives

(as + ct)2 + (bs + dt)2 = (a2 + b2 )s2 + (c2 + d2 )t2

Expanding the squares on the left and canceling terms we’re left with

2(ac + bd)st = 0

and since this is true for all s, t we obtain Hence

W, Z are independent ⇐⇒ ac + bd = 0.

We can write this condition as

(a, b) is orthogonal to (c, d).


Remark 16.4. In the above problem, noticing that

WZ = acX2 + bdY2 + 2(ab + cd)XY

we have
E(WZ) = ac + bd + 0 = ac + bd.
Hence the condition ac + bd = 0 is equivalent to EWZ = 0. Since EW = EZ = 0, this means that
W, Z are uncorrelated. Hence

W, Z are independent ⇐⇒ W, Z are uncorrelated.

This is a special property of normality. It requires that (W, Z) be normal.


Chapter 17

The fundamental theorem of


probability = Strong Law of Large
Numbers

The “law of large numbers” is not a law. It is a theorem. It is called law for
historical reasons: before people understood the mathematics of probability
they had no clue what this thing was. Some took it as a definition. Others
believed it as a law of physics. And some thought it is an experimental result.
They were all wrong.
When I ask students to calculate the probability that in 10 thousand fair coin
tosses we get half heads and half tails, half of them reply 1, the other half reply
1/2; but all of them agree it’s due to the law of large numbers. And they’re all
wrong.

I use the term “fundamental theorem of probability” for the strong law of large numbers.
As such, it is totally unacceptable not to ?know what it says and to ?understand why it’s
true. There is no point going around declaring that “the sample mean converges to the true
mean” without understanding what this means. In this chapter, you will; so long as you’re
willing to study it.

17.1 Discussion
Long time ago, people didn’t know how to define probability. They thought that they could
use a concept called “frequency” to define it. But no matter how hard they tried, they failed to
define a mathematically consistent theory. I’ll explain how they tried to think with a gedanken
experiment1 . Suppose you want to know the probability that a message transmitted over
the Internet contains a virus. Put probes (software) in various computers across the Internet
that detects malicious messages. Do this for a period of time (say, a year) during which 1010
messages have been sent. Count how many of them are malicious, say 104 , and divide the
1
thought experiment

190
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 191

two numbers: 104 /1010 = 1/106 (1 per million). Now use this number as the probability of the
event you are interested in. Which seems reasonable. But, to build a theory of probability,
you need to consider the totality of the events (those you are interested in and those you are
not–because, a theory doesn’t care about what YOU are interested in) and assign a probability
to each one, using frequency approach. And people proved that such an approach fails.
The Strong Law of Large Numbers does, indeed, talk about frequencies. But not as a
definition of probability; rather, as a result in probability, one that can be proved, provided that
some assumptions are made.
I’m not telling you yet what the Law of Large Numbers says, but I’m going to tell you a
consequence of it.

Let A1 , A2 , . . . be events that are independent and all have the same probability p. Then,
for each positive integer n, the number

1
fn := (1A1 + · · · + 1An ) (17.1)
n
is the “proportion of events that occur”. The Law of Large Numbers says that

P( lim fn = p) = 1. (17.2)
n→∞

That is, for sure, the limit of the random sequence fn , n ∈ N, exists, for sure it is
deterministic, and for sure it equals to p (where p = P(A j ) for any j, because we assumed
that all events have the same probability.)

Let’s discuss more. We can think of each A j above as an “independent replica” of some
“idealized event” A. Then the number fn expresses the “frequency of A”. Of course, the
events A1 , A2 , . . . are never equal to A. For example, if we wish to consider A as the idealized
event “malicious message” then A j is the event that “malicious message appears at the jth
measurement”. If we assume (or can prove that) the events A j have the same probability
(why should they?) and that they are independent (are they?) then we can talk about the
proportion fn of the malicious ones in the first n measurements and then the Law of Large
Numbers will guarantee that the sequence f1 , f2 , . . . has a limit and that this limit is p.
Let’s discuss even more. Do not forget that we are working on a set Ω with a probability
measure P defined on events of Ω. An element of ω, called “configuration” or “elementary
outcome” (these are just silly words), is an element of an event A j or is not. For example, in
the gedanken experiment above, ω may be taken as an infinite sequence (ω1 , ω2 , . . .) where ω j
denotes the state of the Internet at the jth measurement. (Of course, we cannot know ω j , but
our knowledge does not affect our ability to talk about it.) Surely then we can tell if ω ∈ A j
because, knowing the state of the whole Internet at the jth measurement tells us if a malicious
message was sent (and much much more). Recall that the phrase “A j occurs” means “ω ∈ A j ”
and this means that “1A j (ω) = 1”, by definition of the indicator function. Hence f100 (ω) = 0.75,
say, means that ω belongs to 75 of the events A1 , A2 , . . . , A100 . phrase “A j occurs on ω”.
The Law of Large Numbers will be explained next, but only in a special case. It should be
noted that the independence assumption can be dropped.
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 192

17.2 The statement of the strong law of large numbers

Theorem 17.1 (the fundamental theorem of Probability). Consider an infinite sequence


X1 , X2 , . . . of i.i.d. random variables such that

µ = EX1

exists and is a real number. Define


n
X
Sn = Xi .
i=1

Consider the event

C := {the sequence Sn /n converges as n → ∞ and its limit is µ}. (17.3)

Then
P(C) = 1. (17.4)

Note that Sn /n is the sample mean of the n first random variables; whereas µ is the
expectation of X1 . Since the random variables Xi have the same law, we obviously have
EXi = µ for all i.
In words, one can state the SLLN as:

The probability that the sample mean of an i.i.d. sequence of random variables
with common expectation µ converges to µ is equal to 1.

Let us consider the event

B := {the sequence Sn /n converges as n → ∞}.

Of course,
C ⊂ B,
so showing that P(B) = 1 does not mean that P(C) = 1. We really have to understand why
P(C) = 1.

PROBLEM 17.1 (SLLN implies convergence of frequencies). Let A1 , A2 , . . . be independent events


with probability p each. Define

fn := proportion of the first n events that occur.

Namely, (17.1) Explain why the SLLN implies that (17.2) holds.
Answer. Simply let Xi = 1Ai , i = 1, 2, . . .. These are i.i.d. random variables with

µ = EX1 = p.
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 193

Let
n
X n
X
Sn = Xi = 1Ai .
i=1 i=1

Note that
Sn
fn = .
n
The SLLN states that

P(the sequence Sn /n converges as n → ∞ and its limit is µ) = 1,

which can immediately be written as

P(the sequence fn converges as n → ∞ and its limit is p) = 1,

This is the same as (17.2).

17.3 The explanation of the strong law of large numbers in a simpler


case
I am going to explain why the SLLN is true, but for a simpler case. I will assume that

EX14 < ∞.

This assumption is made for convenience. Therefore, EX1 , EX12 , EX13 are all finite as well.
To make life simple, I will also assume that

EX1 = 0, (17.5)

because, if not, we can simply subtract it and reduce to this case! If EX12 = 0 we immediately
get that X1 = 0 (Problem 16.6) and so Xi = 0 for all i, and so there is nothing to explain here:
everything is zero! So we assume
EX12 > 0.

Step 1: Compute the 4-th power of the sum. We now consider the 4-the power of the sum
of the first n random variables and expand it:
 n 4
X 
S4n =  Xi 
i=1
X X X X X
= Xi4 + Xi2 X2j + Xi3 X j + Xi2 X j Xk + Xi X j Xk X` . (17.6)
i i,j i, j i,j,k i,j,k,`

What I have done here is that I collected together similar terms because, when raising a sum
to the 4th power I do the same as multiplying 4 identical sums in parentheses:

(X1 + X2 + · · · + Xn ) · (X1 + X2 + · · · + Xn ) · (X1 + X2 + · · · + Xn ) · (X1 + X2 + · · · + Xn )


CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 194

To perform the product and find all n4 terms, I must select exactly one variable from each
parentheses. If I select the same variable from all parentheses then I get a term of the form Xi4
and this gives the first term in (17.6). If I select the same variable from two parentheses and a
different variable from the other two, I get the second term. And so on. Now take expectation
of the expression above. By independence,
X X X X X 

E(S4n ) = EXi4 + EXi2 EX2j + EX 3 
i EX j + EX
 i

2 
EX
 j EXk + EX
 i
 EX
 
j

EX k EX`

i i,j i,j
 
i, j,kdistinct


i, j,k,`distinct


The last three terms are equal to zero because we assumed that (17.5), which implies that all
Xi has zero expectation. We therefore have
E(S4n ) = nEX14 + 3n(n − 1)(EX12 )2 .
Indeed, the i EXi4 has n terms all equal to EX14 , and the sum i,j EXi2 EX2j has 3 n2 terms
P P 

because I can choose two unordered distinct i, j from n indices in n2 and, once chosen, I can

choose {i, j} in 6 ways from the 4 parenthesis: Xi from the first 2, X j from the last 2; or Xi from
the first and third parentheses and X j from the others, etc.

Step 2: Use Markov’s inequality. We are interested to show that Sn /n converges to zero.
Fix ε > 0 and observe:
1 ES4n
P(|Sn /n| > ε) = P(S4n > n4 ε4 ) ≤ ,
ε4 n4
where we used (16.5), explained in Section 16.6. But we have an expression for E(S4n ), from
which we get
E(S4n ) ≤ cn2 ,
where c is a positive constant. (In fact, c can be taken to be 3(EX12 )2 .) Hence
c/ε4
.
P(|Sn /n| > ε) ≤
n2
Let us define the number of times n such that the sample mean Sn /n is outside [−ε, ε]:

X
Nε := 1|Sn /n|>ε . (17.7)
n=1
The expectation of this random variable is
∞ ∞
X c X 1
ENε = P(|Sn /n| > ε) ≤ < ∞. (17.8)
n=1
ε4 n=1 n2
Thus the random variable ENε has finite expectation, so it cannot take value ∞ with positive
probability. Therefore,
P(Nε < ∞) = 1,
for all ε > 0, and hence for all rational ε > 0, so
P( for all rational ε > 0 Nε < ∞) = 1,
and so
P( for all ε > 0 Nε < ∞) = 1,
because Nε increases as ε decreases.
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 195

We’re done! This is just ordinary logic. Since all terms in (17.7) are 1 or 0, the statement
Nε < ∞ is equivalent to all but finitely many terms in the sum are equal to 0:

Nε < ∞ ⇐⇒ 1|Sn /n|>ε = 0 all but finitely many n ⇐⇒ |Sn /n| ≤ ε all but finitely many n

Therefore

for all ε > 0 Nε < ∞ ⇐⇒ for all ε > 0 |Sn /n| ≤ ε all but finitely many n
⇐⇒ the sequence Sn /n converges to 0.

And so
P(the sequence Sn /n converges to 0) = 1.

PROBLEM 17.2 (SLLN with nonzero mean). We explained the SLLN under the assumption
that EX1 = 0. How do you explain the more general case when µ = EX1 exists and finite but
not necessarily zero?
Answer. Simply that
n n
1X 1X
Xi converges to µ ⇐⇒ (Xi − µ) converges to 0
n n
i=1 i=1

and we explained why the latter event has probability 1. 


Remark 17.1. Can we only assume that E|X1 | < ∞? Yes, we can. This is the most important
case and you will learn this in another class.
PROBLEM 17.3 (SLLN for Bernoulli trials). Let X1 , X2 , . . . be i.i.d. Ber(p) random variables.
(1) State the SLLN in this case.
(2) For p = 1/2 and ε = 0.01, how can you give an upper bound for P(Nε > 109 )?
(3) If p is unknown but you are able to observe the values of X1 , X2 , . . ., how do you find p?
(4) If p is unknown but you are able to observe the values of X1 , . . . , Xn , how do you estimate
p and what can you say about the error(s) of your estimation?
(5) If you have the freedom to choose the “number of trials” n how many trials do you need
so that the sample mean differs from p by at most 0.01 with probability at least 95%?
Answer. (1) We have
EX1 = p.
With
Sn = X1 + · · · + Xn ,
the SLLN states that
Sn
 
P lim = p = 1.
n→∞ n

(2) From Markov’s inequality,


ENε
P(Nε > 109 ) ≤ .
109
We obtained a bound for Nε in (17.8). This bound involves the given value of ε, the constant
c = 3(EX12 )2 . But X1 ∈ {0, 1}, so X12 = X1 , so c = 3(EX1 )2 = 2p2 (1 − p)2 , and the value
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 196

of the sum ∞ n=1 n2 . You can either use the well-known expression π /6 for the last sum,
1 2
P
P∞ 1
or compute it numerically. We get n=1 n2 ≈ 1.7 For p = 1/2, we get 3/16 = 0.1875. So
EN0.01 ≤ 0.1875×1.7
0.014
≈ 3 · 107 and so P(N0.01 > 109 ) ≈ 0.03.
(3) If the values of the sequence of functions X1 , X2 , . . . are known then you can find p by the
formula
X1 + · · · + Xn
p = lim .
n→∞ n
(4) If only the values of the first n functions X1 , . . . , Xn are known then we can, e.g., use
Chebyshev’s inequality. Fix ε > 0 and write

E(|Sn /n − p|2 ) E[(Sn − np)2 ] np(1 − p) 1


P(|Sn /n − p| > ε) = P(|Sn /n − p|2 > ε2 ) ≤ = = ≤ .
ε2 n ε
2 2 n ε
2 2 4nε2
Since Sn = X1 + · · · + Xn is known, but p is unknown, it is better to write this inequality as
Sn Sn 1
 
P −ε≤p≤ +ε ≥1− .
n n 4nε2
So your estimate of p is Sn /n but you can quantify the error. If you want to be within ε, for
your choice of ε, within Sn /n then the probability that this is so is at least 1 − 4nε
1
2 . The smaller
the ε the less sure you are.
2 > 0.95, where ε = 0.01. This gives n ≥ 50, 000.
1
(5) We need to choose n so that 1 − 4nε 
In the last example, we saw a way to estimate something. You see, even when p is unknown,
the law of large numbers holds (because it does not care what p is). But Sn /n, the sequence of
sample means, does not depend on p. Since the sequence converges, for sure (!), to p, we can
use it to estimate p.
PROBLEM 17.4 (computing the length of some set via the SLLN). Explain, using the SLLN,
why the set SP
of real numbers 0 ≤ x ≤ 1 such that (with xi being ith digit of x in its decimal
expansion) n ni=1 x2i converges to 28.5 has length 1.
1

Answer. Let X be a unif([0, 1]) random variable. Then

P(X ∈ S) = length(S).

Write X = 0.X1 X2 X3 · · · in decimal. Then X1 , X2 , . . . are i.i.d. unif({)0, 1, 2, 3, 4, 5, 6, 7, 8, 9} each.


By the SLLN,
 n 9

 1 X 1 X 
P(X ∈ S) = P  lim Xi = EX1 =
2 2
i = 28.5 = 1.
2
n→∞ n 10
i=1 i=0

Since P(X ∈ S) = length(S) we have length(S) = 1. 


PROBLEM 17.5 (SLLN for functions of i.i.d. r.v.s). Let N1 , N2 , . . . be i.i.d. Poi(λ) random
variables. Define the random variable X, by setting
n
1X
X := lim (N2i−1 + N2i )2 ,
n→∞ n
i=1

if the limit exists. If the limit does not exist set X = 0. Compute the distribution function of X.
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 197

Answer. The random variables Zi := (N2i−1 + N2i )2 , i = 1, 2, . . ., are i.i.d. Note that

EX1 = E(N12 + N22 + 2N1 N2 ) = λ + λ + 2λ2 .

By the SLLN,  
n
 1 X 
P  lim Zi = EX1  = 1.
n→∞ n
i=1

Hence
P(X = 2λ2 + 2λ) = 1.
This means that the distribution function of F(x) of X is given by

1, x ≥ 2λ + 2λ
 2
F(x) = P(X ≤ x) =  .

0, otherwise

17.4  Laws of Large Numbers in Mathematics, Physics and Statis-


tics
It should also be noted that the Law of Large Numbers appears in other areas of mathematics
and science.

Mechanics For example, it appears in a mathematical system called Mechanics which deals
with the motion of particles according to Newton’s law of motion. The latter states that when
a particle moves in space then the second derivative of its position vector is proportional
to a quantity known as force. If you take a large number of particles moving according to
Newton’s laws but without affecting one another then you can define the density of the system
of particles by calculating the proportion of particles and their velocities that lie in a subset of
the position-velocity space. Liouville’s theorem states that this density is preserved by the
motion. This is a Law of Large Numbers in disguise. You can read more about this here . You
see a (good) course in Mechanics helps understand probability measure and a (good) course
in the latter helps the former too.

Dynamical systems It also appears in another area of mathematics called Dynamical Systems.
A dynamical system is, for example, the equations of motion by Newton. But it’s something
much more general. A dynamical system is, roughly speaking, something that depends on
time in such a way that the future after time t depends on the past before t only through the
present at time t. For example, consider the sequence
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 198

t C(t)
1 1
2 11
3 21
4 1211
5 111221
6 312211
7 13112221
··· ···

There is a rule producing this sequence. Can you figure it out? The existence of the rule tells
us that to figure out what the future value C(t + k) is we only need to know the present value
C(t) (and not the past ones). If you cannot figure out the rule look here . I called the sequence
C because it was invented by the Liverpudlian John H. Conway (who died in 2020 because
he was infected by covid). Dynamical Systems often satisfy “laws of large numbers”. For
example, for the sequence above, I claim that if L(t) is the length of the sequence at time t and
if L j (t) is the number of occurrences of the digit j in C(t) then

L j (t)
lim exists, j = 1, 2, 3.
t→∞ L(t)
L(t+1)
Nobody knows the answer. However, Conway proved that limt→∞ L(t) = 1.303577269034 · · · .

Statistics In Statistics, we are interested in figuring out the distribution function F(x) of a
random variable X. We do the following. Let X1 , X2 , . . . be i.i.d. copies of X, that is, a sequence
of independent random variables such that

P(X j ≤ x) = F(x), x ∈ R, j = 1, 2, . . .

Now consider the random function


n
1X
Fn (x) = 1X j ≤x .
n
j=1

Look at Section 8.3 to realize that Fn defines a new probability from the “data” (X1 , . . . , Xn ). In
fact, x 7→ Fn (x) is a distribution function, for each n. It is called empirical distribution function.
Notice that, for each x, the random variables

1X j ≤x , j = 1, 2, . . .

take values 0 or 1 only and they are independent. Moreover,

P(1X j ≤x = 1) = E1X j ≤x = P(X j ≤ x) = F(x), P(1X j ≤x = 0) = 1 − F(x).

Hence the Law of Large Numbers we proved in the previous section says that

P( lim Fn (x) = F(x)) = 1, for all x ∈ R.


n→∞
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 199

Does this allow us to estimate the FUNCTION F? No, because the rate of convergence depends
on x.
However, something stronger is true:

Theorem 17.2 (the fundamental theorem of Statistics).

P( lim Fn (x) = F(x) uniformly in all x ∈ R) = 1.


n→∞

This is also known as the Glivenko-Cantelli theorem.

17.5 The weak law of large numbers


Let X1 , X2P
, . . . be a sequence of independent random variables with common expectation µ.
Let Sn := ni=1 Xi . The weak law of large numbers states that, under some conditions,

Sn
 
For all ε > 0, P − µ > ε → 0, as n → ∞.
n

?PROBLEM 17.6 (strong law implies weak law). Explain why, under the conditions of
Theorem 17.1, namely that the X1 , X2 , . . . be i.i.d. and have common expectation µ, the weak
law of large numbers holds.
Answer. The strong law of large numbers states that

Sn
 
P converges to µ = 1.
n
n o
See (17.3)+(17.4). Hence the complement of the event Snn = µ has probability 0:

Sn
 
P does not converge to µ = 0.
n
But

Sn
 
does not converge to µ
n
Sn
 
= there is ε > 0 such that for all N there is n ≥ N with −µ >ε
n
[ Sn

= for all N there is n ≥ N with −µ >ε
n
ε>0

If the union of countably many events has probability 0 then each event has probability 0.
Since we can consider ε to be rational, this simple observation applies. So

Sn
 
P for all N there is n ≥ N with − µ > ε = 0.
n
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 200

The events
Sn
 
IN = there is n ≥ N with −µ >ε
n
satisfy
I1 ⊃ I2 ⊃ I3 · · ·
In this case, by (AXIOM TWO),

Sn
 
P for all N there is n ≥ N with − µ > ε = lim P(IN ).
n N→∞

So
lim P(IN ) = 0.
N→∞

But
SN
 
− µ > ε ⊂ IN ,
N
and so
SN
 
P − µ > ε ≤ P(IN ).
N
The latter has limit 0. And so does the former. 
Why is the weak law of large numbers useful? One answer is because it may hold under a
different set of conditions from those of the strong law. But this is rather subtle to appreciate.
Chapter 18

Normality, normally and smoothly

“The normal, or Gaussian, law is as important as the concept of a


straight line and the Pythagorean theorem in Geometry.”
– Original

18.1 Review
1. In Section 14.3.3 we defined the standard normal law on R as a probability measure such
−x2 /2 . We computed the constant in front of it and found it
√ density proportional to e
that it had
to be 1/ 2π. In order to do so, we passed on to 2 dimensions and discovered a circle.

2. In Section 14.5.4.2 we defined the standard normal law on R2 as a probability measure


that is obtained by the product rule, or, equivalently, the distribution of (X, Y) where X, Y
are i.i.d. random variables with standard normal distribution each. Since the density of the
standard normal law on R2 is proportional to e−(x +y ) , and since
2 2

the function x2 + y2 is invariant under rotations (hint: the Pythagorean theorem is


a geometric theorem)

it follows that

the standard normal law on R2 is also invariant under rotations.

PROBLEM 18.1 (because of the Pythagorean theorem). Show that x2 + y2 is invariant under
rotations, first by using Cartesian coordinates and then without.
Answer. If we rotate (x, y) by an angle θ we obtain a new point (x0 , y0 ) with coordinates

x0 = x cos θ − y sin θ
y0 = x sin θ + y cos θ

201
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 202

But then
x02 + y02 = (x cos θ − y sin θ)2 + (x sin θ + y cos θ)2 = x2 + y2
because sin2 θ p + cos2 θ = 1.
Alternatively, x2 + y2 is the length of the hypotenuse of a right triangle with vertices (0, 0),
(x, 0), (x, y). Length does not change if we rotate. 

3. Then look at Problem 16.14 where we discovered, using the machinery of moment
generating functions, the law of
aX + bY
when X, Y are independent normals. We read a special case of this problem:

If X, Y are i.i.d. standard normals then

aX + bY is N(0, a2 + b2 ).

This can be written as

If X, Y are i.i.d. standard normals then

d
(L) aX + bY = cZ, where Z is standard normal and c2 = a2 + b2 .
(what’s this little triangle doing here?)

4. So we may now wonder:


a) Are there other laws that are invariant under rotations?
b) Are there other laws that satisfy property (L) above, perhaps with a different c?

18.2 The unavoidability of the the normal law

Theorem 18.1 (the normal law is unavoidable). Let Q be a probability law on R such that:
(i) If Z has law Q then EZ = 0 and EZ2 = 1.
(ii) If X, Y are independent with common law Q then, for all a, b ∈ R there is c ∈ R such that
d
aX + bY = cZ. (18.1)

Then Q = N(0, 1).

I will explain this under the additional assumption

M(t) = EetZ < ∞

for some t , 0. I don’t have to assume this. I am just doing so to avoid using complex
numbers.
We assume that (18.1) holds.
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 203

Step 1: Take squares and then expectations. We have (aX + bY)2 = aX2 + bY2 + 2abXY.
Since E(XY) = (EX)(EY) = 0, we have E[(aX + bY)2 ] = a2 E(X2 ) + b2 E(Y2 ) = a2 + b2 , because we
also assumed E(X2 ) = 1, and so E(Y2 ) = 1. On the other hand, Since (18.1) holds, we have
E[(aX + bY)2 ] = E[(cZ)2 ] = c2 , therefore
a2 + b2 = c2 . (18.2)
Hence c is uniquely specified, up to a sign.

Step 2: Take exponentials and then expectations. Since (18.1) holds and since we made the
assumption that M(t) < ∞ for some t , 0, we have
Eet(aX+bY) = EetcZ = M(ct).
By independence,
Eet(aX+bY) = EetaX · EetbY = M(at)M(bt)
and so
M(at)M(bt) = M(ct). (18.3)

Step 3: Solving the equations. We now have two identities, an algebraic one (18.2), and a
functional one (18.3). We call the latter functional because our assumption that M(t) be finite
for some t , 0, implies that it is finite on the nontrivial interval with endpoints t and 0, so
there is room to move around. In order to transform multiplication in (18.3) into addition and
in order to get rid of the squares of (18.2), we define

L(u) = log M( u)
so
2
M(u) = eL(u ) .
Substituting this into (18.3) and using (18.2) leads to
L(a2 t2 ) + L(b2 t2 ) = L(a2 t2 + b2 t2 ).
Since a, b are arbitrary, we can rewrite this as
L(u1 ) + L(u2 ) = L(u1 + u2 ).
This identity is true for small u1 , u2 and since the only continuous function that preserves
addition is linear, we obtain that there is a constant C such that
L(u) = Cu,
which gives
2 2
M(t) = eC t .
We differentiate this function twice:
M0 (t) = 2C2 tM(t), M00 (t) = 2C2 M0 (t) = 4C4 M(t)
But M00 (0) = EX2 = 1. Thus 4C4 = 1, or C2 = 1/2. We have thus found
1 2
M(t) = e 2 t .
From (16.9), we see that this is the moment generating for the N(0, 1). Therefore, Q = N(0, 1).
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 204

PROBLEM 18.2 (linear combination of i.i.d. standard normals). Let X, Y, W be i.i.d. N(0, 1)
and let a, b, c be real numbers. What is the law of aX + bY + cW?
Answer. We have √
d

aX + bY = a2 + b2 Z
where Z is N(0, 1). Similarly,
r
√  d
√ 2 √
a2 + b2 Z + cW = a 2 + b2 + c2 W = a2 + b2 + c2 W,

where W is N(0, 1). Therefore, the law of aX + bY + cW is N(0, a2 + b2 + c2 ). 


PROBLEM 18.3 (linear combination of independent centered normals). Let X, Y, W be
independent normals. Show that the law of aX + bY + cW is N(0, var aX + bY + cW).
Answer. Let α2 , β2 , γ2 be the variances of X, Y, W, respectively. Then
X Y W
aX + bY + cW = aα + bβ + cγ
α β γ

α, β, γ
X Y W
and the are i.i.d. standard normals. By Problem 18.2, the right-hand side of the above
display has law N(0, (aα)2 + (bβ)2 + (cγ)2 ) = N(0, var aX + bY + cW).

18.3 Normal(µ, σ2 ) distribution, reprise


Recall a few things:

If X has density f (x) then

X + a has density f (x − a)

and
1 x
 
bX has density f .
b b
Therefore,
1 x−a
 
bX + a has density f .
b b
We also have
X−µ
X is N(µ, σ2 ) ⇐⇒ is N(0, 1)
σ

and so

X has normal law


(x − EX)2
!
1
⇐⇒ it has density √ exp − (18.4)
2π var X 2 var X
1
 
⇐⇒ EetX = exp var(tX) + E(tX) . (18.5)
2
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 205

?PROBLEM 18.4 (linear combination of independent normals). Let P X1 , X2 , . . . , Xn be inde-


pendent normals and a1 , a2 , . . . , an real numbers. Show that the law of ni=1 ai Xi is normal.
Answer. Let µi = EXi . We have
n
X n
X n
X
ai X i = µi + ai (Xi − µi ).
i=1 i=1 i=1
Pn
But X1 − µ1 , . . . , Xn − µn are independent
Pn centered normals. By Problem 18.3, i=1 ai (Xi − µi )
is normal, and adding the constant i=1 µi gives another normal with that constant as its
expectation.

18.4 Normal law in higher dimensions

We need to define a law on Rd that we can confidently call normal. By the Cramér-Wold
theorem 16.5, it suffices to

know the law of t1 X1 + · · · + td Xd for all (t1 , . . . , td ) ∈ Rd .

So why don’t we just define this law to be normal?


Definition 18.1 (normal law on Rd ). We say that the random vector (X1 , . . . , Xd ) has a normal
law on Rd if t1 X1 + · · · + td Xd has a normal law on R for all (t1 , . . . , td ) ∈ Rd .

Directly from this definition and formula (16.9) for the moment generating function of a
single normal random variable we find:

If X = (X1 , . . . , Xd ) is normal on Rd then its moment generating function is given


by  
d d X
d
X 1 X 
Eet1 X1 +···+td Xd = exp  ti EXi + ti cov(Xi , X j )t j 

(18.6)
 
 2 
i=1 i=1 j=1

Now recall the definition (16.10) of the covariance matrix of a random vector, and its properties.
This was the topic of Section 16.9.1. The formalism of Linear Algebra here comes to a rescue,
not only because we can write things more succinctly but also because Linear Algebra helps
us establish the converse of Problem 18.3, namely,

Whitening:any linear combination of a normal random vector (X1 , . . . , Xd ) is (18.7)


a (different) linear combination of i.i.d. normal random variables.

See Figure 18.1 for a physical explanation of the term whitening.


Note that the moment generating function (18.6) depends only on the expectation EX of
the random vector (X1 , . . . , Xd ) and on its covariance matrix cov(X) that I recall here:

cov(X1 , X2 ) · · · cov(X1 , Xd )


 
 var(X1 )
cov(X2 , X1 ) var(X2 ) · · · cov(X2 , Xd )
 
EX = (EX1 , . . . , EXd ), cov(X) =  
 ··· 
cov(Xd , X1 ) cov(Xd , X1 ) · · ·

var(Xd )

CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 206

Figure 18.1: When we pass electric current, that is, as a function of time, described by a sequence
of i.i.d. normal random variables, through a speaker, it is transformed to air pressure function of
time that can be heard by a human ear as noise, called white noise .

Definition 18.2 (symbol for normal law on Rd ). If (X1 , . . . , Xd ) is a normal random vector
with expectation vector µ and covariance matrix R then we use the symbol N(µ, R) for its law.

If we consider the elements of Rd as a column vectors, e.g.,

X1  µ1  t1 


     

X =  ...  , µ =  ...  , t =  ... 


     
     
Xd µd td

and write their transposed as

XT = (X1 , . . . , Xd ), µT = (µ1 , . . . , µd ), tT = (t1 , . . . , td ),

then we can succinctly write


t1 X1 + · · · td Xd = tT X
d
X d
X
ti EXi = ti Eµi = tT µ
i=1 i=1
d X
X d d X
X d
ti cov(Xi , X j )t j = ti Ri,j t j = tT Rt,
i=1 j=1 i=1 j=1

and so (18.6) can be written as

1
 
TX
Eet = exp tT µ + tT Rt . (18.8)
2

Remark 18.1. Note that we can also write the covariance matrix as

R = E((X − µ)(X − µ)T )

simply because when we multiply the column vector X − µ by the row vector (X − µ)T we
obtain a square matrix whose elements are (Xi − µi )(X j − µ j ); the expectation of which is
cov(Xi , X j ).
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 207

?PROBLEM 18.5 (uncorrelatedness ⇒ independence under normality). Let (X1 , . . . , Xd ) be


normal in Rd . Explain why

cov(Xi , X j ) = 0 for all i , j ⇒ X1 , . . . , Xd are independent.

Answer. If cov(Xi , X j ) = 0 for all i , j we have, from (18.6),


 d d
 d d
X 1 X  Y Y
Eet1 X1 +···+td Xd = exp  eti µi + 2 ti σi =
1 2 2
ti µi + t2i σ2i  = Eeti Xi
2
i=1 i=1 i=1 i=1

so we have exactly the situation studied in Problem 16.17, that is, (16.11) holds: the moment
generating function of (X1 , . . . , Xd ) becomes a product of individual moment generating
functions. So the X1 , . . . , Xd are independent. 

18.5 Deriving the density for normal law on Rd

When does X = (X1 , . . . , Xd ) have a density on Rd ? Remember that if d = 1 then N(µ, σ2 ) always
has density unless σ2 = 0. The equivalent criterion in d dimensions is:

Theorem 18.2 (existence of density of normal law on Rd ).

N(µ, R) has density on Rd ⇐⇒ det(R) , 0.

We will not fully explain this (as usual, we do not proved theorems in this course), but
we will make a couple of remarks and then find a formula for the density when it exists.

Remark 18.2. If det(R) = 0 then the columns of R are linearly dependent. This linear
dependence translates into the fact that there is, with probability 1, some linear dependence for
the random variables X1 , . . . , Xd themselves. And this immediately implies that (X1 , . . . , Xd )
belongs to a (d − 1)–dimensional subspace say, of Rd that necessarily has zero d-volume and
so the density does not exist. See Section 15.5 for these notions. In particular, recall the notion
of zero d-volume presented in the Summary of page 159.
Remark 18.3. If det(R) = 0 then one can show that if we let

V = {Rx : x ∈ Rd },

then the dimension m of V is < d and

P(X ∈ V) = 1.

If we then choose some basis on V and express X with in this basis then we will obtain an
m-dimensional normal random vector that has a density. This is just a matter of linear algebra.
Remark 18.4. If det(R) , 0 then we simply show that the density exists by deriving a formula.
This is done next.
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 208

Let X = (X1 , . . . , Xd ) be a N(µ, R) random vector on Rd such that det(R) , 0. Then


X has density given by
1 1
 
f (x) = p exp − (x − µ)T R−1 (x − µ) . (18.9)
(2π)d det(R) 2

Explanation

To understand (18.9), we will simply verify that it works, namely, that


Z
1
 
T
et x f (x)dx = exp tT µ + tT Rt .
Rd 2
It suffices to do this for µ = 0. With µ = 0 the above display is written as
Z
1  T −1 1 T
  q  
exp − x R x − 2t x dx = (2π) det(R) exp t Rt .
T d
Rd 2 2

Step 1. Realizing that we must complete the square. We therefore need to find a way to
perform the integral on the left. In Problem 3.2 we completed the square for a quadratic
polynomial of one variable. We are faced with the same problem here: complete the square
for a quadratic polynomial of d variables:

Q(x) = xT R−1 x − 2tT x.

We need to take the square root of the matrix R.

Step 2. Finding the square root. How can we define the square root of the matrix R?
Since R is symmetric and positive definite (see Section 16.9.1, equation (16.10) infra) it has
d nonnegative eigenvalues λ1 , . . . , λd . Moreover, one can choose the eigenvectors u1 , . . . , ud ,
corresponding to these eigenvalues, so that they are pairwise orthogonal and have length
1. So if we define the matrix U whose columns are the eigenvectors, we will have UT = U.
Letting Λ be the d × d diagonal matrix whose diagonals are the eigenvalues, we have

RU = UΛ,

which is a succinct way of writing

Ru j = λ j u j , j = 1, . . . , d,

the relations obeyed by the eigenvectors. Hence

R = UΛU−1 = UΛUT .

Thinking geometrically, this says that Λ is the matrix of the linear function x 7→ Rx in the basis
of the eigenvectors. Since Λ is so simple (diagonal: it means that the mapping scales along
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 209

the straight lines defined by the eigenvectors), it p
makes perfect sense to define Λ as the
diagonal matrix whose diagonal elements are the λ j . We then have
√ √  √   √ T
R = U Λ ΛUT = U Λ U Λ .

We have not quite written R as the square of another matrix but we have written it as

R = SST ; S := U Λ. (18.10)

(We can’t expect square root to mean the same as for real numbers. After all, multiplication
of real numbers is commutative: ab = ba, but multiplication of matrices is not: AB , BA, in
general.) The only knowledge we need is that R = SST , for some S (that we constructed).

Step 3: Completing the square. Replace R by SST in Q(x) and write

Q(x) = xT (SST )−1 x − 2tT x

Since det(R) = det(S) det(ST ) = det(S)2 , it follows that det(S) , 0 as well, so S−1 and
(ST )−1 = (S−1 )T exist. We then write
 T    
Q(x) = S−1 x S−1 x − 2tT S S−1 x ≡ yT y − 2uT y
= yT y − 2uT y + uT u − uT u
= (y − u)T (y − u) − uT u.

where, for brevity, we defined


y := S−1 x, u = ST t. (18.11)
The square has been completed.

Step 4: Performing the integral. With this new expression for Q(x) we have
Z Z Z
− 21 Q(x) 1 T
(y−u)T (y−u) 1 T T
e dx = e 2 u u
e dx = e 2 u u
e(y−u) (y−u) d(Sy)
Rd Rd Rd

The determinant of the Jacobian of the mapping y 7→ Sy is det(S) = det(R) and so the above
p

is further equal to
Z Z
1 T
(y−u)T (y−u) − 12 12 uT u 1 T
p p
e 2 u u
det(R) e dy = e det(R) e− 2 z z dz,
Rd Rd

by the simple change of variables z = y − u. But


Z Z
1 2
− 21 zT z 1 2
e dz = e− 2 z1 · · · e− 2 zd dz1 · · · dzd
Rd d
ZR Z
1 2
− 12 z21
= e dz1 · · · · · · e− 2 zd dzd
R R
√ √ √
= ( 2π) · · · ( 2π) = ( 2π)d = (2π)d/2 .
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 210

Putting things together, we have found that


Z
1 1 T p
e− 2 Q(x) dx = e 2 u u det(R) (2π)d/2 .
Rd

Recalling that u = tT S we have uT u = tT SST t = tT Rt, and so


Z q
− 12 Q(x) 1 T
e dx = e 2 t Rt
(2π)d det(R).
Rd

This is exactly what we wanted to show. And therefore we know that (18.9) is, indeed, a
density for N(µ, R) when det(R) , 0.

18.6  Whitening
Let X = (X1 , . . . , Xd ) be N(0, R). How do we represent the Xi as linear combination of
independent normals? We will only explain how when det(R) , 0.
We have actually already done it. Look at (18.11) and define

Y = S−1 X.

We think of Y and X as columns and we recall that S is the square root of R, namely, R = SST .
We now have

cov(Y) = E(YYT ) = E(S−1 XXT (ST )−1 ) = S−1 E(XXT )(ST )−1 ) = S−1 SST (ST )−1 ) = I,

where I is the identity matrix. This means that

cov(Yi , Y j ) = 0 if i , j.

Since (Y1 , . . . , Yd ) is normal on Rd , it follows that the random variables Y1 , . . . , Yd are indepen-
dent. In fact, they are all N(0, 1).
Therefore, our representation is
X = SY.
Pd
Or, explicitly, Xi = j=1 Si,j Y j , i = 1, . . . , d.

PROBLEM 18.6 (whitening example). Let (X1 , X2 ) be N(µ, R), with your favorite R, and write
it as a linear function of i.i.d. standard normals.

18.7 Normal distribution and the circle


What is most important when dealing with a normal distribution is not one normal random
variable but two!

The normal distribution is better understood on the plane rather than on the line.
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 211

Let’s explain this better. Let (X, Y) be a pair of i.i.d. standard normal random variables.
Consider a rotation by angle θ about the origin of the plane and. Let (X̂, Ŷ) be the rotated
(X, Y). We have

X̂ = X cos θ − Y sin θ
Ŷ = X sin θ + Y cos θ

We then have EX̂ = EŶ = 0, and, since EX2 = EY2 = 1 and EXY = 0, we also have

EX̂2 = (EX2 ) cos2 θ + (EY2 ) sin2 θ = cos2 θ + sin2 θ = 1, EŶ2 = cos2 θ + sin2 θ = 1,

by the Pythagorean theorem, and


EX̂Ŷ = 0.
We now let R be the distance of (X, Y) from the origin of the plane and Θ be the angle between
the line defined by the origin and (X, Y) and the defined by the origin and (1, 0) (the latter line
is known as “x-axis”). Hence,

R := X2 + Y2 , tan Θ = Y/X, 0 < Θ < 2π,

(where we actually ignored the possibility that Y = 0 because the probability of this event is
zero). Since
tan Θ = Y/X = Ŷ/X̂,
we have that the distribution of Θ is uniform:
β−α
P(α < Θ < β) = , 0 < α < β < 2π.

On the other hand, we have that
"
e−(x +y )/2
2 2

P(R > t) =
2
dxdy
x2 +y2 >t 2π
"
e−r /2
2 Z ∞ Z 2π Z ∞
−r2 /2
dθ = √ e−r /2 rdr,
2
= √ r drdθ = √ e r dr
r> t 2π t 0 t
0<θ<2π

and so R2 has density given by



Z ∞
d d 2 /2 1
P(R2 ≤ t) = − √ e−r rdr = e−t/2 .
dt dt t 2

Hence R2 has expon(1/2) density. If we now rotate (X, Y) by a fixed angle θ about the origin,
we see that its distance from the origin does not change, whereas Θ changes by adding θ. This
implies that P(R ≤ t, Θ ≤ β) depends linearly on β. Hence P(R ≤ t, Θ ≤ β) = F1 (t)β, for some
function F1 . It follows that R and Θ are independent. We summarize

If (X, Y) are i.i.d. standard normal random variables then R2 = X2 + Y2 and


Θ = arctan(Y/X) are independent. Moreover, R2 has expon(1/2) density and Θ has
uniform distribution on (0, 2π).
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 212

PROBLEM 18.7 (an open problem). Show that T = X2 + Y2 satisfies the memoryless property

P(T > t + s|T > t) = P(T > s),

for all s, t > 0, without using the above analytical computation.


Answer. I do not have an answer because I do not know how to do this. Maybe you, the
student, can. 
Practical rule. Suppose that we wish to generate d i.i.d. standard normal random variables
X1 , X2 , . . . from i.i.d. exponential random variables H1 , H2 , . . . and independent i.i.d. uniform
(on (0, 2π)) random variables Θ1 , Θ2 , . . .. We then do the following:
√ √ √ √
X1 = H1 cos Θ1 , X2 = H1 cos Θ2 , X3 = H2 cos Θ3 , X4 = H2 cos Θ4 , . . .

PROBLEM 18.8 (normal or not?). Let X1 , X2 , . . . , X2n be an even number of i.i.d. random
variables with common law Q We plot data points (X2i−1 , X2i ), i = 1, . . . , n in two cases:
1/π
• Q = Cauchy(0, 1) (which has density 1+x 2 ).
• Q = N(0, 1)
Here are the two plots (done for n = 5000):

Figure 18.2: normal or Cauchy?

(1) Which is which? (2) Any observations?


Answer. (1) The one on the left corresponds to N(0, 1) because N(0, 1) × N(0, 1) has rotational
symmetry. On the other hand, Cauchy(0, 1) × Cauchy(0, 1) does not have rotational symmetry.
(2) It is unlikely that the coordinates of the data points on the left exceed value 3 (indeed, none
was observed). However, many points on the right are have very high value, including one
that has value 1500. Another remark is that if one coordinate on the right plot has high value,
the other has small value (despite independence). It is unlikely that both coordinates become
simultaneously big. 

18.8 Conditional distributions

Suppose that (X1 , . . . , Xm , Y1 , . . . , Yd ) is a normal random vector in Rm+d . How do we compute

P((X1 , . . . , Xm ) ∈ B|Y1 , . . . , Yd ) ? (18.12)


CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 213

You may only be interested in the case m = d = 1. But this case is not much simpler than
the general one. To understand it we need the notion of projection, which translates into the
concept of conditional expectation. This will be developed first in the first few Sections of
Chapter 19 and then, in Section 19.7 we will see how to compute the conditional probability
above.
But we don’t develop the concepts of projection and conditional expectation only because
we wish to perform the computation above. It turns out that the concepts are central to the
whole subject itself! This was first realized by Andrey Nikolaevich Kolmogorov , less than
100 years ago,1 and one could argue that it was then that the very foundations of probability
were laid. The vital formula of probability, formula (19.9), will appear as in Chapter 19 and
this is something that we use on a daily basis when we deal with the topics specified by the
syllabus of this module, in order to solve applied problems of great significance to society.
We will see an answer to (18.12) in Section 19.7.4 when m = 1 and in Section 19.7.5 when m
is any positive integer.

1
The small monograph containing ideas around conditional expectation was published as Grundbegriffe der
Wahrscheinlichkeitsrechnung, Julius Springer, Berlin, in 1933, and later, in 1956, translated as Foundations of the
Theory of Probability , Chelsea, New York.
Chapter 19

Conditionally

Every bit of information counts and it’s up to us to


determine how to take advantage of it.
– The applied probabilist’s manifesto [so to speak]

19.1 Motivation
How do I model a sequence of tosses of a coin for which I have absolutely no idea about the
probability of heads? Suppose I toss such a coin 100 times and observe the number of heads.
Do I still have no idea about the probability of heads?
A bus arrives at random during a period of random duration. The duration is an exponential
random variable with a random rate. The rate is a uniform random variable that is uniformly
distributed on a random interval of random length. What does this mean? If we observe
arrivals of buses, what can we say about the random length in the last sentence?
We model randomness by using a probability measure. A probability measure is, so far,
a non-random object. Can I model it as a random object, in which case I can talk about the
probability distribution of this probability measure? And what if this probability distribution
is still unknown? Can I talk about the probability law of the probability distribution of the
probability measure?
Laplace said that if we toss a fair coin twice then the probability of getting 2 heads is 1/3.
Under which model was he right?
If we have a density for a random vector (X, Y) we defined the conditional density of
X given Y in a kind of ad hoc way? Is this compatible with the definition of conditional
probability?
Can we ever define conditional probability given an event of probability zero?
What do we mean by the sentence “conditionally on the value x of a random variable X” if
X has density and so the probability of the event {X = x} is zero?
What exactly does likelihood ratio (a term used by statisticians) mean?

214
CHAPTER 19. CONDITIONALLY 215

If we pick a point at random in a cube (so its law is proportional to the volume function),
can we condition that it lie on a sphere inside the cube? Does this mean that its law will
be proportional to the sphere area function and thus we have discovered how to define the
second function through the first?
If we have an infinite-dimensional vector X = (X1 , X2 , . . .) of i.i.d. normal random variables
can we talk about the density of X?
If two densities differ at a set of length zero and thus define the same probability measure,
does the value of a density at a given point have any meaning?

19.2 Euclidean projections, platonically


Consider a plane Π the Euclidean space, a point A outside it and pose the problem

(P) Which points, if any, on Π, are closest to A?

Euclidean geometry is a mathematical system consisting of points, lines and planes, together
with various properties such that every two points X, Y, lie on a unique line. The part of the
line between these points is a segment, denoted by XY, that has a certain length |XY|.
Problem (P) above asks to find points B on Π such that |AB| ≤ |AΓ| for any point Γ on Π.
Euclid ?teaches us that there is a point B on Π such that AB is perpendicular to Π. This
means that if Γ is any other point on Π then AB is perpendicular to BΓ and so the triangle ABΓ
has a right angle at B.
By the Pythagorean theorem, which states that |AB|2 = |AΓ|2 − |BΓ|2 we have |AB|2 ≤ |BΓ|2 ,
so |AB| ≤ |AΓ| and so

If we drop a perpendicular line to Π from A then the point B where this line meets Π is
such that |AB| ≤ |AΓ| for all points Γ on Π.

Since we cannot have two perpendiculars from A to Π we have also showed that

There can be no two points on Π that are closest to A.

We have thus solved (P) and showed that the solution satisfies

(P1) B∈Π

(P2) AB ⊥ Π

In fact, we have shown that


(P) ⇐⇒ (P1) + (P2)
The symbol ⊥ means “is perpendicular to”. The point B defined uniquely by (??)+(??) is
called the projection of A onto the plane Π. See Figure 19.1.1 Let us use the symbol

B = E(A|Π),
1
Figures 19.1 and 19.2 are taken from my textbook when I was in secondary school: ΣΠ.Γ. KANEΛΛOΣ,
EYKΛEI∆EIOΣ ΓEΩMETPIA , OE∆B, Athens, 1977.
CHAPTER 19. CONDITIONALLY 216

Figure 19.1: The point B solves the problem (P), it is uniquely defined by (P1)+(P2), it is called
the projection of the point A onto the plane Π, and it is denoted by B = E(A|Π).

since E is the first letter of the Greek word επι2 which means onto, as in “project point A onto
= επι the plane Π”.
Just as we can define the projection of a point to any plane, we can also define the projection
E(A|γ) of a point A to any line γ.

PROBLEM 19.1 (the three perpendiculars property). Explain why if γ is a line in the plane
Π then
E(E(A|Π)|γ) = E(A|γ). (19.1)
See Figure 19.2.
Answer. Let B = E(A|Π) and Γ = E(B|γ). We will show that Γ = E(A|γ). Pick points ∆ and E on
either side of Γ so that |∆Γ| = |EΓ|. Then |B∆| = |BE| by the Pythagorean theorem applied to
triangles BΓ∆ and BΓE. Applying the Pythagorean theorem once more to triangles AB∆ and
ABE, we find |A∆| = |AE|. Therefore the triangles AΓ∆ and AΓE are equal, in the sense that all
sides and angles are the same. Hence the angles ∠AΓ∆ and ∠AΓE are the same. Since their
sum is π (180o in sexagesimal Sumerian units ), each one must be equal to π/2 (90o ). In other
words AΓ ⊥ γ. Since, also, Γ ∈ γ (by definition), (P1)+(P2) hold and so Γ = E(A|γ). 

Figure 19.2: If B = E(A|Π) and Γ = E(B|γ) then Γ = E(A|γ).

The identity (19.1) is called the three perpendiculars identity because you can clearly see 3
right angles in Figure 19.2.
If γ1 , γ2 are two lines on a common plane, define

E(A|γ1 , γ2 ) := E(A|plane defined by γ1 , γ2 ).


2
as in epigram, epilepsy, epitaxy, epithet, epigraph, etc., there are 786 words in English that start with “epi”
CHAPTER 19. CONDITIONALLY 217

PROBLEM 19.2 (when do projections add?). Let γ1 , γ2 be two distinct lines that meet at a
point O Given a point A in space, consider its projections

B = E(A|γ1 , γ2 ), B1 = E(A|γ1 ), B2 = E(A|γ2 ),

on the plane defined by γ1 , γ2 , on γ1 and on γ2 , respectively and explain why


−−→ −−−→ −−−→
γ1 ⊥ γ2 ⇐⇒ OB = OB1 + OB2 .

(This means that the quadrilateral with vertices O, B1 , B2 , B is a parallelogram.)


Answer. By the three perpendiculars property,

Bi = E(A|γi ) = E(E(A|γ1 , γ2 )|γi ) = E(B|γi ), i = 1, 2.

But then BB1 is perpendicular to γ1 and BB2 is perpendicular to γ2 . So

∠OB1 B = ∠OB2 B = π/2.

– Assume first that γ1 ⊥ γ2 . Then ∠B1 OB2 = π/2. So the quadrilateral OB1 B2 B has 3 right
angles. Since the angles of any planar quadrilateral add up to 2π, it follows that the fourth
angle ∠B1 BB2 = π/2. Hence the quadrilateral is a rectangle and hence a parallelogram.
– Conversely, assume that OB1 B2 B is a parallelogram. Then ϕ := ∠B1 OB2 = ∠B1 BB2 (opposite
angles are equal in a parallelogram). Call this angle ϕ. Hence the 4 angles of the parallelogram
are π/2, π/2, ϕ, ϕ. They must add up to 2π. Hence ϕ = π/2, so γ1 ⊥ γ2 . 
Remark 19.1. Everything we said here works in higher dimensional Euclidean space V if we
replace Π by any lower-dimensional space included in V The identity (19.1) also holds if we
replace γ by any lower-dimensional space included in Π.

19.3 Euclidean projections, linearly


When we describe space using Cartesian coordinates, the system introduced by Descartes, we
describe space by R3 , the set of all triples x = (x1 , x2 , x3 ) of real numbers. To make things a bit
more general, we consider the set V = Rd rather, the set of all d-tuples x = (x1 , . . . , xd ) of real
numbers. We have been referring to these as vectors, thinking of x as a vector with endpoints
0 = (0, . . . , 0) and x, oriented from 0 to x.
If we look at the geometrical arguments of Section 19.2 we realize that the only element
needed there was a way to measure angles and lengths. These are both achieved via an inner
product function, hx, yi:
Definition 19.1 (inner product, norm, distance, angle, perpendicularity). A function hx, yi,
x, y ∈ V, is called inner product if it is bilinear, that is, linear in each argument when the
other is kept fixed, and strictly positive definite, that is, hx, xi is 0 when x = 0 (the origin)
and strictlyp positive for all other x ∈ V. The distance d(x, y) between x and y is defined by
d(x, y) := hx − y, x − yi. The norm kxk of x is defined as the distance of x from 0, that is,
hx,yi
kxk := hx − y, x − yi. The angle θ(x, y) between x and y is defined by cos(θ(x, y)) = kxk kyk ,
p

modulo 2π. We say that x, y are perpendicular or orthogonal if |θ(x, y)| = π/2, that is,
hx, yi = 0.
CHAPTER 19. CONDITIONALLY 218

We can easily show that


|hx, yi| ≤ kxk kyk,
which justifies that the ratio of the left-hand side by the right-hand side can be considered as
the cosine of an angle, and that
kx + yk ≤ kxk + kyk,
which is called triangle inequality, because it is equivalent to the inequality d(x, y) = kx − yk =
kx − z + z + yk ≤ kx − zk + kz + yk = d(x, z) + d(z, y).

PROBLEM 19.3 (standard inner product on Rd ). Explain why hx, yi := di=1 xi yi is an inner
P
product.
Answer. It is clearly bilinear and symmetric. Moreover, hx, xi = di=1 x2i is always nonnegative
P
and if it is zero then all the xi are zero. 
PROBLEM 19.4 (inner product on Rd ). Let R be a d × d covariance matrix such that det(R) , 0.
Define hx, yi = xT Ry. Explain why this is an inner product.
Answer. Just look at the properties of a covariance matrix: positive definiteness implies
xT Rx ≥ 0 for all x. If det(R) , 0 then, letting S be its square root, that is, R = SST , we have
det R = (det S)2 , so det S , 0. But then xT Rx = xT SST x = (ST x)T (ST x). But ST x is a real number.
So xT Rx = (ST x)2 , and if this is 0 we get ST x = 0. Since det S , 0, we immediately have x = 0.
So xT Ry is strictly positive definite. 
We now wish to state and solve Problem (P). To do this, observe that a plane Π in R3
is described by one equation of the form a1 x1 + a2 x2 + a3 x3 = c and a line by two equations
of this form: a11 x1 + a12 x2 + a13 x3 = c1 , a21 x1 + a22 x2 + a23 x3 = c2 . When we are in Rd we can
have k equations of such a form, in which case we describe a flat set in a (d − k)-dimensional
Euclidean space. By setting all coefficients c j on the right-hand side of these equations equal
to 0 (and this is no loss of generality) we make sure that the origin is in this flat set. Such an
object is a linear subspace of Rd . Hence we need to replace Π by a linear subspace of Rd .
So problem (P) has a version stated as follows.

(PL) Let U be a linear subspace of V = Rd , x ∈ V. Which points, if any, on U, are closest to x?

Just as before, we show that the solution is unique and is characterized by conditions
analogous to (P1)+(P2). To do this, let

∆ := inf kx − uk.
u∈U

Infimum of a set is the greatest lower bound of a set. So we can find a points in the set that are
as close as we like to ∆. So, given any n ∈ N there is un ∈ U such that kx − un k ≤ ∆ + n1 . We
apply the parallelogram identity to x − un and x − um :

kun − um k2 + k2x − (un + um )k2 = 2kx − un k2 + 2kx − um k2 .

Since 12 (yn +ym ) ∈ U we have kx− 12 (yn +ym )k ≥ ∆ (because ∆ is least). Hence k2x−(yn +ym )k ≥ 2∆.
If we replace the second term in the last display by (2∆)2 we get a smaller quantity. On the
other hand, the right-hand side is at most 2(∆ + n1 )2 + 2(∆ + m1 )2 . Hence

kun − um k2 + (2∆)2 ≤ 2(∆ + n1 )2 + 2(∆ + m) .


1 2
CHAPTER 19. CONDITIONALLY 219

Simplifying,
kun − um k2 ≤ (4∆ + 1)( n1 + 1
m ).
This implies that limm,n→∞ kun − um k = 0. Since un belongs to U which has finite dimension, it
follows that
there is u ∈ U such that lim kun − uk = 0. (19.2)
n→∞

But then
∆ ≤ kx − uk ≤ kx − un k + kun − uk ≤ ∆ + 1
n + kun − uk.
Since the right-hand side converges to ∆ as n → ∞, we have

kx − uk = ∆.

We have thus solved (PL). But we do not know if the solution is unique. So suppose
kx − u0 k = ∆ = kx − u00 k, for u0 , u00 ∈ U. Again by the parallelogram identity,

ku0 − u00 k2 + k2x − (u0 + u00 )k2 = 2kx − u0 k2 + 2kx − u00 k2 = 4∆2 .

But k2x − (u0 + u00 )k2 ≥ 4∆2 , so

ku0 − u00 k2 ≤ 4∆2 − 4∆2 = 0, and hence u0 = u00 .

Hence (PL) has a unique solution u, that we call projection of x onto U and write it as

u = E(x|U).

Since kx − uk = ∆, we take any v ∈ U and any t ∈ R, notice that u + tv ∈ U (because U is a


linear space), and so

kx − uk2 ≤ kx − u − tvk2 , for all u ∈ U and t ∈ R.

But
kx − u − tvk2 = kx − uk2 − 2thx − u, vi + t2 kvk2 .
Canceling the term kx − uk2 we obtain

1
thx − u, vi ≤ t2 kvk2 , for all t ∈ R.
2
Suppose t > 0. Then we have

1
hx − u, vi ≤ tkvk2 . for all t > 0.
2
Suppose next t < 0. Write t = −s, s > 0 Then, dividing both sides by s,

1
−hx − u, vi ≤ − skvk2 , for all s > 0
2
Putting the inequalities together, we have 12 skvk2 ≤ hx − u, vi ≤ 12 tkvk2 for all t > 0, s > 0. Hence

hx − u, vi = 0, for all v ∈ U,
CHAPTER 19. CONDITIONALLY 220

something that we write as x − u ⊥ U.


We have thus shown that Problem (PL) is solved by a unique u ∈ U such that

(PL1) u∈U

(PL2) x−u⊥U

In fact, it is really easy to see that

(PL) ⇐⇒ (PL1) + (PL2).

PROBLEM 19.5 (linearity of projection). Explain why, for any x1 , x2 ∈ V,

E(c1 x1 + c2 x2 |U) = c1 E(x1 |U)) + c2 E(x2 |U)).

Answer. We have xi − E(xi |U) ⊥ U, i = 1, 2. This means that hxi − E(xi |U), ui = 0, for all u ∈ U,
i = 1, 2. Hence hc1 (x1 − E(x1 |U)) + c2 (x2 − E(x2 |U)), ui = 0, for all u ∈ U, which means that
c1 x1 + c2 x2 − (c1 E(x1 |U) + c2 E(x2 |U)) ⊥ U. Thus condition (PL2) holds with x replaced by
c1 x1 + c2 x2 and u replaced by c1 E(x1 |U) + c2 E(x2 |U). Obviously, (PL1) holds as well. 

?PROBLEM 19.6 (projections of projections). Let V = Rd , U a linear subspace of V and W


a linear subspace of U. So
W ⊂ U ⊂ V.
Explain why, for all x ∈ V,
E(E(x|U)|W) = E(x|W).

Answer. In order for E(x|W) to be the projection of E(x|U) onto W we need (PL1)+(PL2) to
hold, that is, E(x|W) ∈ W (this is obviously true) and E(x|U) − E(x|W) ⊥ W. It is this condition
that we need to verify. We have x − E(x|U) ⊥ U and since W ⊂ U we have x − E(x|U) ⊥ W. On
the other hand, x − E(x|W) ⊥ W. Hence the difference (x − E(x|U)) − (x − E(x|W)) is also ⊥ W.
But the difference if E(x|U) − E(x|W). So we’re done. 

?PROBLEM 19.7 (computation of a projection). Consider 3 linearly independent vectors


u1 , u2 , u3 in Rd , let U be the set defined by

U = {a1 u1 + a2 u2 + a3 u3 : a1 , a2 , a3 ∈ R}

and do the following:


(1) write down the equations needed in order to compute the projection E(x|U) of a given x
onto U;
(2) explain what happens when the u1 , u2 , u3 are pairwise orthogonal?
(3) if U were defined by k linearly independent vectors instead of 3, what would you have
instead?
CHAPTER 19. CONDITIONALLY 221

Answer. (1) If u denotes the projection of x then u must belong to U, so u = a1 u2 + a2 u2 + a3 u3 ,


and x − u must be orthogonal to U, that is, hx − u, u j i = 0, j = 1, 2, 3. We therefore have three
equations with three unknowns, a1 , a2 , a3 .

hx − (a1 u1 + a2 u2 + a3 u3 ), u1 i = 0 hu1 , u1 ia1 + hu2 , u1 ia2 + hu3 , u1 ia3 = hx, u1 i


hx − (a1 u1 + a2 u2 + a3 u3 ), u2 i = 0 hu1 , u2 ia1 + hu2 , u2 ia2 + hu3 , u2 ia3 = hx, u2 i
hx − (a1 u1 + a2 u2 + a3 u3 ), u3 i = 0 hu1 , u3 ia1 + hu2 , u3 ia2 + hu3 , u3 ia3 = hx, u3 i

(2) We would have hui , u j i = 0 for all i , j, and so the equations would be immediately
solvable:
hx, u1 i hx, u2 i hx, u3 i
a1 = 1 1 , a2 = 2 2 , a3 = 3 3 .
hu , u i hu , u i hu , u i
(3) We would write u = a1 u1 + · · · + ak uk and we would have k equations in k unknowns,
a1 , . . . , ak . The equations would be linearly independent because, by assumption, u1 , . . . , uk
are linearly independent, and so we would be able to solve uniquely for a1 , . . . , ak . 
PROBLEM 19.8 (projection of a random variable onto two subspaces). Let Ω be a finite
sample space with d outcomes, e.g., Ω = {ω1 , . . . , ωd }. The every function Ω → R is a random
variable. The set of all random variables is RΩ which can be thought as Rd since Ω has d
elements. Now let P be a probability measure on Ω. Hence P is defined by

p j = P{ω j }, j = 1, . . . , d.

Of course, by (AXIOM ONE)+(AXIOM TWO), p1 + · · · + pd = 1. But we also assume that


p j > 0 for all j. Define inner product between two random variables X, Y by
d
X
hX, Yi := p j X(ω j )Y(ω j ) = E(XY).
j=1

(If we had not assume that p j > 0 for all j then hX, Yi would fail to be strictly positive definite.)
Let X, Y be two random variables. Compute E(X|U) in the following two cases:
(1) U = L(Y), the space of all linear functions of Y. Assume that P(Y , 0) > 0.
(2) U = F (Y), the space of all functions of Y.
Answer. (1) The only linear function of Y is of the form aY for some real number a. So we write

E(XL(Y)) = aY,

and we seek to find a. The problem is solved by (PL1)+(PL2). Obviously, aY ∈ L(Y), so (PL1)
holds. The second requirement, (PL2) is written in any of the equivalent ways below:

X − E(X|Y) ⊥ Y, hX − aY, Yi = 0, E[(X − aY)Y] = 0, aE(Y2 ) = E(XY).

Since P(Y , 0) > 0 we have E(Y2 ) > 0, and so we can divide and get a = E(XY)/E(Y2 ). The
answer therefore is
E(XY)
E(X|L(Y)) = Y.
E(Y2 )
(2) (PL1) says that
E(X|F (Y)) = h(Y),
CHAPTER 19. CONDITIONALLY 222

for some function h : Y(Ω) → R. This, together with (PL2), gives

X − h(Y) ⊥ g(Y) for all g : Y(Ω) → R.

We can write X − h(Y) ⊥ g(Y) as one of the following equivalent ways:

hX − h(Y), g(Y)i = 0, E[(X − h(Y))g(Y)] = 0, E[h(Y)g(Y)] = E[Xg(Y)].

Writing X = x x1X=x we have


P

X
E[h(Y)g(Y)] = E[Xg(Y)] = E[1X=x g(Y)]
x

This is true for all g(Y). We make the following choices:

g(Y) = 1Y=y , y ∈ Y(Ω);

We have one function for each value of Y. We obtain


X
E[h(Y)1Y=y ] = E[1X=x 1Y=y ]
x

But E[h(Y)1Y=y ] = h(y)E[1Y=y ] = h(y)P(Y = y), and E[1X=x 1Y=y ] = P(X = x, Y = y). Therefore,

h(y)P(Y = y) = P(X = x, Y = y), y ∈ Y(Ω).

Since P assigns positive probability to each ω ∈ Ω, it follows that P(Y = y) > 0 for all y ∈ Y(Ω),
and so we can divide and get
X P(X = x, Y = y)
h(y) = x . (19.3)
x
P(Y = y)

The answer to (2) therefore is

E(X|Y) = h(Y), where h is the function defined by (19.3). (19.4)

PROBLEM 19.9 (projection in Bernoulli trials). Let X1 , . . . , Xn be n i.i.d. Ber(p) random


variables (toss a fair coin independently n times). Assume 0 < p < 1. Let Sn = X1 + · · · + Xn .
Compute E(X1 |F (Sn )) in two ways:
(1) By reducing to Problem 19.8.
(2) By using the fact that Sn is a symmetric function of (X1 , . . . , Xn ).
Notice that the answer does not depend on p.
Answer. We have
E(X1 |L(Sn )) = h(Sn ),
where
X P(X = x, S = s) P(X = 1, S = s) pP(X2 + · · · + Xn−1 = s − 1)
1 n 1 n
h(s) = = =
P(Sn = s) P(Sn = s) P(Sn = s)
x∈{0,1}
CHAPTER 19. CONDITIONALLY 223

But Sn is bin(n, p), while X2 + · · · + Xn−1 is bin(n − 1, p). Hence


n−1 s−1 (n−1)(n−2)···(n−s+1)
(1 − p)n−s n−1

p s−1 p s−1 (s−1)(s−2)···1 s
h(s) = n s n−s n = n(n−1)(n−2)···(n−s+1)
= .
s p (1 − p) s
n
s(s−1)(s−2)···1

Hence the answer is


Sn
E(X1 |F (Sn )) = .
n
(2) We obviously have
E(Sn |L(Sn )) = Sn .
But Sn = X1 + · · · + Xn . By linearity,

E(Sn |L(Sn )) = E(X1 |L(Sn )) + · · · + E(Xn |L(Sn )).

By symmetry,
E(X1 |L(Sn )) = · · · = E(Xn |L(Sn )).
Combining the last three displays yields

Sn = E(Sn |L(Sn )) = n E(X1 |L(Sn )),

whence we obtain the same answer. 

19.4 Conditional expectation


Consider now an arbitrary sample space Ω and a probability measure P defined on some set
of events E . We define

V := {all random variables X on (Ω, E , P)}

We wish to work with all random variables that have finite variance. We call this space

V2 := {all random variables X on (Ω, E , P) with EP (X2 ) < ∞}.

Since the sum of two random variables with finite variance has finite variance, V is a linear
space. Consider next a collection of random variables

Y = (Y1 , Y2 , . . .)

which could be finite or countable or even uncountable–we don’t care, and let
F (Y) := {all random variables that can be written as functions of random variables from Y}
F2 (Y) := {all random variables that can be written as functions of random variables from Y
and have finite variance}
For example, Y1 ∈ F2 (Y), Y12 cos(Y2 + Y3 ) ∈ F2 (Y), limn→∞ Yn ∈ F2 (Y), etc. Since the sum of
two random variables that are functions of members of Y is also a function of members of Y,
the set F2 (Y) is a linear space. In fact,

F2 (Y) ⊂ V2 .
CHAPTER 19. CONDITIONALLY 224

We define an distance on V2 by first defining an inner product. But this we have already done
in Section 16.5, Equation (16.2). We repeat it here:
hX, Yi := E(XY).
Another name for this inner product is correlation. Again, see Section 16.5. Through this
inner product we define a distance function:
p p
d(X, Y) := hX − Y, X − Yi = E(X − Y)2 .
In particular, √
kXk = d(X, 0) = EX2 .
This quantity has three names: (a) norm, (b) distance from the origin, (c) square root of second
moment.
We consider the problem
(PV) Which random variables, if any, from F2 (Y) are closest to a given random variable X ∈ V2 ?
In other words, we wish to solve the problem
∆ := inf d(X, Y).
Y∈F (Y)

It would be no surprise then if I tell you that Problem (PV) has a unique solution denoted
by E(X|F (Y), or, simply, by E(X|Y) and which satisfies the following:
(PV1) E(X|Y) ∈ F (Y)

(PV2) X − E(X|Y) ⊥ F(Y)


In fact,
(PV) ⇐⇒ (PV1) + (PV2).

Remark 19.2. I am not going to prove this, but I ask you to accept that the explanation is very
similar to the one in Section 19.3. If V is finite-dimensional (which is the case when Ω is a
finite sample space), then we reduce (PV) to (PL).
If V is not finite-dimensional then the only place one has to be careful at is (19.2). This is
actually true (and comes under the name “completeness of the L2 space”).
?PROBLEM 19.10 (projection when densities exist). Let (X, Y) be a random vector in R2
such that both X and Y have finite variance (they are thus elements of V). Assume also
that (X, Y) has a probability density function f (x, y). Consider Y = F (Y), the collection of all
functions of Y with finite variance. Let f2 (y) be the density of Y. Define the function–see eq.
(14.6)
f (x, y)
f1|2 (x|y) := ,
f2 (y)
interpreting 0/0 as 0 if it occurs. Explain why
Z
E(X|Y) := E(X|F (Y)) = x f1|2 (x|Y)dx.
CHAPTER 19. CONDITIONALLY 225

Answer. We have to show that


Z
X− x f (x|Y)dx ⊥ g(Y), for any g(Y) ∈ F (Y).

Equivalently, *Z +
X, g(Y) = x f (x|Y)dx, g(Y) , for any g(Y) ∈ F (Y).

But the right-hand side is


" Z # Z Z ! "
E g(Y) x f (x|Y)dx = g(y) x f (x|y)dx f2 (y)dy = g(y)x f (x, y)dxdy = E[Xg(Y)].


We are now ready to define conditional expectation.
Definition 19.2 (conditional expectation under finite variance). If X, Y1 , Y2 , . . . are random
variables with finite variance then define the conditional expectation of X given Y1 , Y2 , . . . by
 
E(X|Y1 , Y2 , . . .) := E X|F (Y1 , Y2 , . . .) .

The condition of finite variance is too restrictive. But we can generalize. A simple
approximation theorem ensures that the conditional expectation can be defined for random
variables X that have may have infinite variance, so long as E|X| < ∞.

Theorem 19.1 (existence of conditional expectation). Let X, Y1 , Y2 , . . . be random variables


such that E|X| < ∞. Denote by Y = (Y1 , Y2 , . . .), and let F (Y) be the collection of random
variables that are functions of Y. Then there is a random variable

b = E(X|Y) ≡ E(X|Y1 , Y2 , . . .)
X

such that
(1)
X b < ∞;
b ∈ F (Y) and has E|X|

(2) the projection property holds

E[XZ] = E[XZ]
b for any bounded random variable Z ∈ F (Y). (19.5)

Moreover there is a unique such X


b in the sense that if there is a second one, say X,
e then
b = X)
P(X e = 1.

PROBLEM 19.11 (conditional expectation with respect to a discrete random variable). Let
X be a random variable with E|X| < ∞ and let Y be a discrete random variable. Explain why
X E(X1Y=y )
E(X|Y) = 1Y=y . (19.6)
y
P(Y = y)

(If Ω is finite then this formula is the same as E(X|Y) computed in (19.4)+(19.3).)
CHAPTER 19. CONDITIONALLY 226

Answer. We simply have to verify that the right-hand side of (19.6) satisfies conditions (1)+(2)
of Theorem 19.1. It is clear it satisfies (1). To check (2) we need to check that
 
 X E(X1Y=y ) 
E[Xg(Y)] = E  g(Y)
 1Y=y  ,
P(Y = y) y

where g(Y) ∈ F (Y), a function of Y, that is also bounded. We start with the right-hand side
and verify, step-by-step, that it equals the left-hand side:
 
 X E(X1Y=y )  X E(X1Y=y ) X E(X1Y=y )
E  g(Y) 1Y=y  = E[g(Y)1Y=y ] = g(y) E[1Y=y ]
y
P(Y = y) y
P(Y = y) y
P(Y = y)
  
X E(X1Y=y ) X  X 
= g(y) P(Y = y) = E(X1Y=y ) g(y) = E X  1Y=y g(y) = E[Xg(Y)].
y
P(Y = y) y y

?PROBLEM 19.12 (conditional expectation when densities exist). Let (X, Y1 , . . . , Yk ) be a


random vector such that E|X| < ∞. Assume that it has density f (x, y1 , . . . , yk ). Denote by
f2 (y1 , . . . , yk ) the density of (Y1 , . . . , Yk ) and let

f (x, y1 , . . . , yk )
f1|2 (x|y1 , . . . , yk ) := .
f2 (y1 , . . . , yk )

Explain why Z ∞
E(X|Y1 , . . . , Yk ) = x f1|2 (x|Y1 , . . . , Yk )dx.
−∞
Answer. Since (1) of Theorem 19.1 is obvious, we only need to check (2):
" Z ∞ #
E[Xg(Y1 , . . . , Yk )] = E g(Y1 , . . . , Yk ) x f1|2 (x|Y1 , . . . , Yk )dx .
−∞

We have that the right-hand side equals


Z Z ∞
g(y1 , . . . , yk ) x f1|2 (x|y1 , . . . , yk )dx f2 (y1 , . . . , yk )dy1 · · · dyk
Rk −∞
Z ∞
= xg(y1 , . . . , yk ) f (x, y1 , . . . , yk )dy1 · · · dyk = E[Xg(Y1 , . . . , Yk )].
−∞

PROBLEM 19.13 (an arbitrary example). Let (X, Y) have density


c
f (x, y) = 1x≥1,y≥1 ,
x2 (x + y)2

for some c > 0. Compute E(X|Y).


CHAPTER 19. CONDITIONALLY 227

Answer. We have

y+2 2 log(y + 1)
Z " #
f2 (y) = f (x, y)dy = c 2 − 1 y≥1 .
−∞ y (y + 1) y3

We thus have
y3 y + 1

f (x, y)
f1|2 (x|y) = =  1x≥1,y≥1 .
f2 (y) y2 + 2 y − 2 y + 1 log y + 1 x2 x + y 2
 

And so3

Y − log (Y + 1) Y + log (Y + 1) Y
Z 
E(X|Y) = x f1|2 (x|Y)dx = 1Y≥1 .
−∞ 2 log (Y + 1) Y − Y2 + 2 log (Y + 1) − 2 Y


We now continue with a number of important properties of the conditional expectation.
These are so important for everything in probability and statistics, even for understanding
your favorite distributions, that I’m going to devote a separate page for them. So please flip
over and pay attention.

3
To be honest, I didn’t do the computations myself. I hate computations and I always make mistakes anyway.
So I asked my slave to do them. My slave can do lots of procedural things, like complicated integrals, but he/she/it
cannot think. I, on the other hand, like most human beings, prefer to think and think how to think. My slave’s
name is Maple.
CHAPTER 19. CONDITIONALLY 228

19.4.1 Properties of conditional expectation


Suppose that E|X| < ∞ and let Y be a collection of random variables. We let E(X|Y) be the
conditional expectation of X given Y.

(I) Linearity. For all a1 , a2 ∈ R we have

E(a1 X1 + a2 X2 |Y) = a1 E(X1 |Y) + a2 E(X2 |Y).

(II) There is a lot of freedom in changing the conditioning information.

E(X|Y) = E(X|Y0 ) if there is a bijection g such that Y0 = g(Y).

(III) Expectation of conditional expectation.

E[E(X|Y)] = E(X).

(IV) Conditional expectation with respect to independent r.v.s. If X, Y are independent


(meaning that X, together with any finite collection of random variables from Y are
independent) then
E(X|Y) = E(X)

(V) Conditional expectation under full knowledge. If X ∈ F (Y) then

E(X|Y) = X.

(VI) Factoring out known things. If X ∈ F (Y) then

E(XZ|Y) = XE(Z|Y).

(VII) Monotonicity of conditional expectation. If X1 ≤ X2 then

E(X1 |Y) ≤ E(X2 |Y).

(VIII) Tower property. If Z ⊂ Y then

E(E(X|Y)|Z) = E(X|Z).

?PROBLEM 19.14 (many properties of the conditional expectation). Explain all the prop-
erties above.
Answer. Simply verify (1)+(2) of Theorem 19.1.
CHAPTER 19. CONDITIONALLY 229

19.4.2 Conditional variance


Recall that E(X|Y) is a random variable (that is a function of Y). As such it has a variance,
denoted, as usual, by var E(X|Y). This should not be confused by the conditional variance
that we define next, as a random variable.
The conditional variance of a random variable X given a collection Y of random variables
is defined by h i
var(X|Y) := E (X − E(X|Y))2 |Y .
If we expand the square we obtain a second expression for var(X|Y). We have
h i
E X2 + E(X|Y)2 − 2XE(X|Y)|Y = E(X2 |Y) + E[E(X|Y)2 |Y] − 2E[XE(X|Y)|Y]
= E(X2 |Y) + E(X|Y)2 − 2E(X|Y)2
= E(X2 |Y) − E(X|Y)2 .

The first equality is due to Property (I)–linearity. Then we used Property (V)–factoring
our known things. Since E(X|Y)2 is a function of Y we have E[E(X|Y)2 |Y] = E(X|Y)2 and
E(X|Y) · E(X|Y). So the alternative expression is

var(X|Y) = E(X2 |Y) − E(X|Y)2 . (19.7)

?PROBLEM 19.15 (expectation of conditional variance and variance of conditional expec-


tation). Explain why
E[var E(X|Y)] + var(E(X|Y)) = var X.

Answer. Take expectations in (19.7):

E[var E(X|Y)] = E[E(X2 |Y)] − E[E(X|Y)2 ] = E(X2 ) − E[E(X|Y)2 ],

where we used Property (III) for the first term. Next, by the expression of the variance of a
random variable,

var(E(X|Y)) = E[E(X|Y)2 ] − E[E(X|Y))]2 = E[E(X|Y)2 ] − (EX)2 ,

where we used Property (III) for the second term. Adding the last two displays together, the
term E[E(X|Y)2 ] cancels, whence the desired formula ensues. 

19.5 Conditional probability measures


Recall that
P(X ∈ B) = E(1X∈B ).
Motivated by this, and since we have a meaningful/rigorous way to define conditional
expectations, we may attempt to define conditional probability distribution similarly.
CHAPTER 19. CONDITIONALLY 230

Let (X1 , . . . , Xn ) be a random vector in Rn and let Y be another collection of random


variables. Given B ⊂ Rn (Borel set), the random variable

1(X1 ,...,Xn )∈B

takes values 0 or 1 and so we can define its conditional expectation

E[1(X1 ,...,Xn )∈B |Y].

We use a special symbol for this:

P((X1 , . . . , Xn ) ∈ B|Y) := E[1(X1 ,...,Xn )∈B |Y]. (19.8)

This can be made meaningful, that is, one can arrange it so that

1) As a function of B, the P((X1 , . . . , Xn ) ∈ B|Y) is a probability measure;

2) P((X1 , . . . , Xn ) ∈ B|Y) is also a random variable since it belongs to F (Y).

Hence P((X1 , . . . , Xn ) ∈ B|Y) is a random probability measure and is called the regular con-
ditional distribution or, for brevity, conditional distribution of (X1 , . . . , Xn ) given Y or
conditionally on Y.
We now have the following very important formula that we call vital formula.

The vital formula of probability:


  h i
P (X1 , . . . , Xn ) ∈ B, Y ∈ C = E P((X1 , . . . , Xn ) ∈ B|Y) 1Y∈C . (19.9)

Here is the reason: First, we have

P((X1 , . . . , Xn ) ∈ B, Y ∈ C) = E[1(X1 ,...,Xn )∈B 1Y∈C ].

Then, from the projection property (2) of Theorem 19.1

E[1(X1 ,...,Xn )∈B 1Y∈C ] = E[E(1(X1 ,...,Xn )∈B |Y) 1Y∈C ]

But from (19.8),


E(1(X1 ,...,Xn )∈B |Y) = P((X1 , . . . , Xn ) ∈ B|Y)
Substituting backwards, we obtain (19.9).

PROBLEM 19.16 (tossing i.i.d. coins with random probability of heads). Let Θ be a
unif([0, 1]) random variable. Let ξ = (ξ1 , . . . , ξn ) be a random vector whose distribution,
conditional on Θ, is that of a sequence of i.i.d. Ber(Θ) random variables. Therefore–see (13.1)–the
conditional distribution of ξ given Θ is given by
Pn Pn
P(ξ1 = x1 , . . . , ξn = xn |Θ) = Θ i=1 xi (1 − Θ) i=1 (1−xi ) , x1 , . . . , xn ∈ {0, 1}.
CHAPTER 19. CONDITIONALLY 231

You are asked to determine the density f (θ|ξ) of the conditional distribution of Θ given ξ. Your
answer should be a function of
n
X
Sn = ξi = the total number of heads.
i=1

You may use the formula


1
(M + N + 1)!
Z
tM (1 − t)N dt = ,
0 M!N!
when and if you need it.
Answer. We determine the conditional distribution function of Θ given ξ. From the definition
of elementary conditional probability,

P(Θ ≤ θ, ξ1 = x1 , . . . , ξn = xn )
P(Θ ≤ θ|ξ1 = x1 , . . . , ξn = xn ) = .
P(ξ1 = x1 , . . . , ξn = xn )

The denominator is obtained by setting θ = 1 in the numerator. So we only deal with the
latter. From the vital formula of probability (19.9), and our assumption,
h i h Pn Pn i
P(Θ ≤ θ, ξ1 = x1 , . . . , ξn = xn ) = E 1Θ≤θ P(ξ1 = x1 , . . . , ξn = xn |Θ = E 1Θ≤θ Θ i=1 xi (1−Θ) i=1 (1−xi ) ,

which is easy since we know the distribution of Θ:

Pn Pn i Z Pn Pn
Z θ P
h n Pn
E 1t≤θ t i=1 xi (1 − t) i=1 (1−xi ) = 1t≤θ t i=1 xi (1 − t) i=1 (1−xi ) dt = t i=1 xi (1 − t) i=1 (1−xi ) dt
R 0

Hence Rθ Pn Pn
t i=1 xi (1 − t) i=1 (1−xi )
0
P(Θ ≤ θ|ξ1 = x1 , . . . , ξn = xn ) = R 1 Pn Pn
0
t i=1 xi (1 − t) i=1 (1−xi ) dt
and so, from the fundamental theorem of Calculus,
Pn Pn
d θ i=1 xi (1 − θ) i=1 (1−xi )
f (θ|x1 , . . . , xn ) := P(Θ ≤ θ|ξ1 = x1 , . . . , ξn = xn ) = R 1 Pn .
dθ t
Pn
i=1 xi (1 − t) i=1 (1−xi ) dt
0

Hence
Pn Pn
d θ i=1 ξi (1 − θ) i=1 (1−ξi )
f (θ|ξ1 , . . . , ξn ) = P(Θ ≤ θ|ξ1 , . . . , ξn ) = R 1 Pn
dθ Pn
t i=1 ξi (1 − t) i=1 (1−ξi ) dt
0
Sn !(n − Sn )! Sn
= θ (1 − θ)n−Sn , 0 ≤ θ ≤ 1.
(n + 1)!

This f (θ|ξ1 , . . . , ξn ) is a random variable (because it is a function of ξ = (ξ1 , . . . , ξn )), and a


R1
probability density function (because 0 f (θ|ξ1 , . . . , ξn )dθ = 1).
CHAPTER 19. CONDITIONALLY 232

19.6 Convolutions
The convolution is an operation between probability measures such that if we take independent
random variables whose laws are the given probability measures, then their convolution is
the law of the sum of the random variables.
If B ⊂ R and t a real number, let

B + t := {x + t : x ∈ B}.

Thus, for example,

(a, b] + t = (a + t, b + t], (−∞, b] − t = (−∞, b − t], R + t = R.

Let X1 , X2 be independent random variables with distributions Q1 , Q2 respectively. Let S


be their rum. We have

P(S ∈ B) = E[P(X1 + X2 ∈ B|X2 )] = E[P(X1 ∈ B − X2 |X2 )]

Since X1 is independent of X2 we have

P(X1 ∈ B − X2 ) = Q1 (B − X2 ).

We therefore have

P(S ∈ B) = E[Q1 (B − X2 )] = E[Q2 (B − X1 )] =: (Q1 ∗ Q2 )(B).

where the second equality is obtained by interchanging the roles of X1 and X2 . The calculation
above depends only on Q1 , Q2 . The last equality is a symbol for this operation. Its name is
convolution between probability measures Q1 and Q2 .
Here are some properties:

1.
Q1 ∗ Q2 = Q2 ∗ Q1

2.
Q1 ∗ (Q2 ∗ Q3 ) = (Q1 ∗ Q2 ) ∗ Q3 .

3.
Q ∗ δ0 = Q

We explain: The first and second ones are due to the facts that X1 + X2 = X2 + X1 and
X1 + (X2 + X3 ) = (X1 + X2 ) + X3 . For the last one, recall that δ0 is the law of a random variable
that takes value 0 only. But then X + 0 = X. Due to the second property, we can omit the
parentheses:
Q1 ∗ Q2 ∗ Q3 = Q1 ∗ (Q2 ∗ Q3 ) = (Q1 ∗ Q2 ) ∗ Q3 .
And we can do this for any finite number of probability measures. In particular, we can write

Q∗n := Q1 ∗ Q2 ∗ · · · ∗ Qn , when Qi = Q for all i = 1, . . . , n.


CHAPTER 19. CONDITIONALLY 233

We also write
Q∗0 = δ0 ,
because then we can write

Q∗m ∗ Q∗n = Q∗(m+n) , for all m, n ≥ 0.

?PROBLEM 19.17 (convolutions of densities). Let Qi have density fi . Derive the density f
of Q1 ∗ Q2 .
Answer. The distribution function of Q1 ∗ Q2 is (Q1 ∗ Q2 )(−∞, x]. Hence its density is its
derivative:
d
f (x) = (Q1 ∗ Q2 )(−∞, x].
dx
But
(Q1 ∗ Q2 )(−∞, x] = E{Q1 (−∞, x − X2 ]}.
But
d
f1 (t) = Q1 (−∞, t].
dt
Hence
f (x) = E{ f1 (x − X2 )}.
The right-hand side is merely the expectation of a function of X2 and, by the law of the
unconscious statistician, Z ∞
f (t) = f1 (x − y) f2 (y)dy
−∞

We give a symbol to this:
Z ∞
( f1 ∗ f2 )(x) := f1 (x − y) f2 (y)dy : is the density of the sum of the two independent r.v.s.
−∞

We have

1.
f1 ∗ f2 = f2 ∗ f1

2.
f1 ∗ ( f2 ∗ f3 ) = ( f1 ∗ f2 ) ∗ f3 .

In the particular case where the random variables are positive, we have
Z x
( f1 ∗ f2 )(x) := f1 (x − y) f2 (y)dy, if X1 , X2 ≥ 0.
0

Indeed,
Z ∞ Z ∞ Z x
( f1 ∗ f2 )(x) = f1 (x − y)1x−y≥0 f2 (y)1 y≥0 dy = 10≤y≤x f1 (x − y) f2 (y)dy f1 (x − y) f2 (y)dy.
−∞ −∞ 0
CHAPTER 19. CONDITIONALLY 234

PROBLEM 19.18 (sum of two independent exponential r.v.s). Let X, Y be expon(λ), expon(µ),
respectively. Determine the density of XY .
Answer. The densities of X, Y are f (x) = λe−λx 1x>0 g(x) = µe−µx 1x>0 . The density of X + Y os
f ∗ g:
λ µ eµ x − eλ x
 
Z x
( f ∗ g)(x) = λe−λ(x−y) µe−µy dy = , x > 0.
0 −λ + µ


PROBLEM 19.19 (sum of independent uniform r.v.s). Let X, Y be indepedent unif([0, 1]),
unif([ 21 , 32 )], respectively. Determine the density of X + Y and sketch it.
Answer. The densities of the two random variables are 10<x<1 , 1 1 <x< 3 , respectively. Hence the
2 2
density of X + Y is
Z ∞ Z ∞
( f ∗ g)(x) = f (x − y)g(y) dy = 10<x−y<1 1 1 <y< 3 dy
2 2
−∞ −∞
Z ∞ Z ∞ Z ∞
= 1x−1<y<x 1 1 <y< 3 dy = 1x−1<y, 1 <y,y<x,y< 3 dy = 1max(x−1, 1 )<y<min(x, 3 ) dy.
2 2 2 2 2 2
−∞ −∞ −∞

The reason for the last equality is that x − 1 < y, 12 < y ⇐⇒ max(x − 1, 12 ) < y; and also that,
Ry ∞< x, y < 2 ⇐⇒ y < min(x, 2 ). This is a really trivial integral because it is of the form
3 3

1
−∞ a<y<b
dy with a = a(x) = max(x − 1, 12 ), b = b(x) = min(x, 32 ). This integral equals b − a,
provided a < b, or zero if not. We can write this as
Z ∞
1a<y<b dy = max[b − a, 0].
−∞

Hence

h(x) := ( f + g)(x) = max[b(x) − a(x), 0] = max[min(x, 23 ) − max(x − 1, 12 ), 0] x ∈ R,

is the density of X + Y. But we can write it in terms of its various branches.


If x ≤ 12 then b(x) = x, a(x) = 21 , b(x) − a(x) = x − 21 ≤ 0, so h(x) = 0.
If 12 ≤ x ≤ 32 then b(x) = x, a(x) = 21 , b(x) − a(x) = x − 12 ≥ 0, so h(x) = x − 12 .
If 32 ≤ x ≤ 52 then b(x) = 32 , a(x) = x − 1, b(x) − a(x) = 52 − x ≥ 0, so h(x) = 52 − x.
If x ≥ 52 then b(x) = 23 , a(x) − x − 1, b(x) = a(x) = 25 − x ≤ 0, so h(x) = 0. And so

x − 12 , if 12 ≤ x ≤ 32 ,




h(x) = 

5
 2 − x, if 32 ≤ x ≤ 52

0,

otherwise


CHAPTER 19. CONDITIONALLY 235

19.7 Conditioning under normality


We now address the problem of finding conditional expectations under normality assumptions.

19.7.1 Conditional expectation for two normal r.v.s

Let (X, Y) be a normal random vector in R2 . We wish to compute E(X|Y). By the linearity
property–Property (I)–we have

E(X|Y) = E(X − EX|Y) + EX.

By Property (II), we can replace Y by any g(Y) such that g is a bijection. We choose g(Y) = Y−EY:

E(X|Y) = E(X − EX|Y − EY) + EX.

So far, we have not used normality. The above holds for any random variables. But now, since
we have a hunch that “normality” and “linearity” somehow go hand in hand, we speculate
that
E(X − EX|Y − EY) = a · (Y − EY).
If we manage to show that (1) and (2) of Theorem 19.1 hold, then we’re done. Write, for
e = X − EX, Y
brevity X e = E − EY. Obviously, Y − EY is a function of Y. So (1) holds. To show
(2), we need to show that
e · Z] = E[a · Y
E[X e · Z] (19.10)
e (and hence of Y). We first take Z = Y.
for any Z that is a function of Y e We then have

E[X e = aE[Y
e · Y] e2 ].

This gives
E[X
e · Y]
e cov(X, Y)
a= = .
2
E[Y ]
e var(Y)

provided that E[Ye2 ] = var(Y) , 0. (We’ll see what happens when var(Y) = 0 later.) With this
choice of a we have
E[(X
e − aY) e = 0 = E[X
e · Y] e − aY]
e · E[Y].
e

This means that


X
e − aY
e and Y
e are uncorrelated.

Since
(X
e − aY, e is normal in R2
e Y)

we immediately have
X
e − aY
e and Y
e are independent.

Hence, for any Z = g(Y) we have

X
e − aY e = g(Y) are independent.
e and Z
CHAPTER 19. CONDITIONALLY 236

Hence
E[(X e · Z] = E[X
e − aY) e · E[Z] = 0.
e − aY]
But this is precisely (19.10). We’re done:

E[X
e · Y]
e
E(X|Y) = a(Y − EY) + EX, with a = if var(Y) , 0.
e2 ]
E[Y
It remains to examine what happens when var(Y) = 0. But when var(Y) = 0 then P(Y = 0) = 1
because (reminder!) P(|Y − EY| > ε) ≤ var(Y)/ε2 = 0, so P(|Y − EY| > ε) = 0 for all ε > 0 and so
P(Y = 0) = 1. Since X, 0 are independent, we have, from Property ((V)),

E(X|Y) = EX if var(Y) = 0.

19.7.2 Conditional expectation for many normal r.v.s

Let (X, Y1 , . . . , Yd ) be a normal random vector in R1+d . We wish to compute E(X|Y) when
Y = (Y1 , . . . , Yd ).
Assume that
R = cov(Y) is invertible.
Recall that cov(Y) is the matrix with entries

Ri, j = cov(Yi , Y j ), i, j = 1, . . . , d. (19.11)

As before, observe that


E(X|Y) = E(X − EX|Y − EY) + EX.
We proceed as before:
Step 1. Make the assumption that E(X − EX|Y − EY) is a linear function of Y − EY and
compute its coefficients.
Step 2. Show that this is correct.
e = X − EX, Y
Define X e = Y − EY = (Y1 − EY1 , . . . , Y − EY ) = (Y
d d
e1 , . . . , Y
ed ), to save some
ink. The assumption of Step 1 is
Xd
E(X|Y) =
e e a jYej .
j=1

By (2) of Theorem 19.1 we must have


 
 Xd 
e = E Z
E[XZ] a j
ej  ,
Y
 
j=1

e and hence of Y. By choosing Z = Y


for any Z that is a function of Y ei we obtain

d
X
E[X
eYei ] = a j E[X
ei Y
ej ].
j=1
CHAPTER 19. CONDITIONALLY 237

If we let  
a1  E[X eY e1 ] cov(X, Y1 )
 

a =  ...  , . ..


  
.

B =   =   ,
   
. . (19.12)

 
    
 e e  
ad E[X Yd ] cov(X, Yd )
the equation above becomes
Ra = B,
whose solution is
a = R−1 B, (19.13)
because we assumed that R is invertible. Our answer then, in this case, is
d
X
E(X|Y1 , . . . , Yd ) = a j (Y j −EY j )+EX, where a = R−1 B with R as in (19.11) and B as in (19.12).
j=1

We omit step 2 because it is identical in spirit to the one in the simple (d = 1) case.
PROBLEM 19.20 (computation of a conditional expectation under normality). Let (X, Y1 , Y2 )
be normal in R3 such that EX = EY1 = EY2 = 0 and

EY12 = 13, EY22 = 2, EY1 Y2 = 1, EXY1 = 11, EXY2 = −3.

Compute E(X|Y1 , Y2 ).
Answer. We have
E(X|Y1 , Y2 ) = a1 Y1 + a2 Y2 ,
and we need to determine a1 , a2 . By (2) of Theorem 19.1 we must have

EXZ = E[Z(a1 Y1 + a2 Y2 )]

for all functions Z of (Y1 , Y2 ). We simply choose Z = Y1 first and then Z = Y2 to obtain

EXY1 = a1 EY12 + a2 EY1 Y2 11 = 13a1 + a2


, that is, ,
EXY2 = a1 EY1 Y2 + a2 EY22 −3 = a1 + 2a2

that can be easily solved: a1 = 1, a2 = −2. Hence

E(X|Y1 , Y2 ) = Y1 − 2Y2 .


PROBLEM 19.21 (computation of another conditional expectation under normality). Let
(Z1 , Z2 ) be i.i.d. N(0, 1). Let X = Z1 + 4Z2 , Y1 = 3Z1 + 2Z2 , Y2 = Z1 − Z2 . Compute E(X|Y1 , Y2 ).
Answer. Since (X, Y1 , Y2 ) is a linear function of (Z1 , Z2 ) we have that (X, Y1 , Y2 ) is normal in
R3 . Therefore,
E(X|Y1 , Y2 ) = a1 Y1 + a2 Y2 ,
and we need to determine a1 , a2 . We have

EXY1 = a1 EY12 + a2 EY1 Y2


EXY2 = a1 EY1 Y2 + a2 EY22
CHAPTER 19. CONDITIONALLY 238

and we need to determine the covariances:

EXY1 = E(Z1 + 4Z2 )(3Z1 + 2Z2 ) = 3E(Z21 ) + 8E(Z12 ) + 6EZ1 Z2 = 3 + 9 + 0 = 11.

We similarly compute the others and find

EY12 = 13, EY22 = 2, EY1 Y2 = 1, EXY1 = 11, EXY2 = −3.

Hence the system of equations becomes

11 = 13a1 + a2
−3 = a1 + 2a2

which yields a1 = 1, a2 = −2 and so

E(X|Y1 , Y2 ) = Y1 − 2Y2 .

19.7.3 Conditional variance under normality


We wish to compute var(X|Y) when (X, Y) is a normal random vector. Recall

var(X|Y) = E[(X − E(X|Y)2 |Y].

By (2) of Theorem 19.1 we have that X − E(X|Y) and Y are uncorrelated, and hence (by
normality), independent. By Property (V), E[(X − E(X|Y)2 |Y] = E[(X − E(X|Y)2 ]. And so

var(X|Y) = E[(X − E(X|Y)2 ].

That is, var(X|Y), when (X, Y) is a normal random vector, is a constant!

PROBLEM 19.22 (conditional variance computation under normality). Let (X, Y) be normal
in R2 with var(Y) , 0. Compute var(X|Y) in terms of σ21 = var(X), σ22 = var(Y) and
σ1,2 = cov(X, Y).
Answer. We have E(X|Y) = a(Y − EY) + EX, where a = σ1,2 /σ22 . So

var(X|Y) = E[(X − E(X|Y)2 ] = E[(X − (a(Y − EY) − EX))2 ] = E[((X − EX) − a(Y − EY))2 ]
= var(X) + a2 var(Y) − 2a cov(X, Y)
σ21,2 σ1,2 σ21 σ22 − σ21,2
= σ21 + σ22 − 2 σ1,2 = .
σ42 σ22 σ22


We can similarly compute var(X|Y) when (X, Y) is normal in R1+d
but we will skip the
computation, only mentioning that it followed directly from the fact that it is a constant!
CHAPTER 19. CONDITIONALLY 239

19.7.4 Conditional probability distribution under normality

Let (X, Y) be normal in R1+d . Then

The regular conditional distribution of X given Y is N(E(X|Y), var(X|Y)). (19.14)

From this, we can easily obtain the conditional density, in the case that cov(Y) is invertible.
We claim that all we have to do is remember that

(x − µ)2
!
1
density of N(µ, σ ) = √
2
exp −
2πσ2 2σ2

and replace µ by E(X|Y) and σ2 by var(X|Y) (which is a constant). We therefore have that

(x − E(X|Y1 , . . . , Yd ))2
!
1
f (x|Y1 , . . . , Yd ) = p exp − (19.15)
2π var(X|Y1 , . . . , Yd ) 2 var(X|Y1 , . . . , Yd )

is a density for the regular conditional distribution of X given Y = (Y1 , . . . , Yd ). This means
that Z
P(X ∈ B|Y1 , . . . , Yd ) = f (x|Y1 , . . . , Yd ) dx.
B

PROBLEM 19.23 (computation of a conditional density under normality). Let Z1 , Z2 be i.i.d.


standard normal random variables. Let X = Z1 + 2Z2 , Y = 3Z1 − Z2 . Compute f (x|Y). We
have EXY = 3 − 2 = 1, EY2 = 9 + 1 = 10, EX2 = 1 + 4 = 5, so
EXY 1
E(X|Y) = 2
Y = Y,
EY 10
and
1 2 1 1
var(X|Y) = E[(X − E(X|Y))2 ] = E[(X − Y) ] = 5 + 2 · 10 − 2 · = 4.9
10 10 10
Hence
1
e−(x− 10 Y) /9.8
1 2
f (x|Y) = √
9.8π


But we need to explain our claim! That is, we need to justify why, for all B,
Z
P(X ∈ B|Y) = f (x|Y) dx,
B

where f (x|Y) is given by (19.15). Look at PROPERTY B, on page 182, for a moment generating
function. It says that the distribution of a random variable is determined by its moment
generating function. The same is true for conditional distributions. So we need to justify that,
for all t, Z ∞
E[etX |Y] = etx f (x|Y) dx.
−∞
CHAPTER 19. CONDITIONALLY 240

Since f (x|Y) is a density for N(E(X|Y), var(X|Y)), we have, by the formula 16.9 for the moment
generating function of a normal random variable,
Z ∞
1 2
etx f (x|Y) dx = etE(X|Y)+ 2 t var(X|Y) .
−∞

Our claim thus becomes


1 2
E[etX |Y] = etE(X|Y)+ 2 t var(X|Y)
.
By (2) of Theorem 19.1 all we have to do is show that, for all t,
1 2
h i
E[etX g(Y)] = E etE(X|Y)+ 2 t var(X|Y) .g(Y)

By the Cramér-Wold theorem (Theorem 16.5), the distribution of (X, Y) = (X, Y1 , . . . , Yd ) is


termined by the distribution of

tX + s1 Y1 + · · · sd Yd = tX + sT Y,

for all t, s1 , . . . , sd . So it is enough to show that


T 1 2 T
h i
E[etX es Y ] = E etE(X|Y)+ 2 t var(X|Y) es Y

But var(X|Y) is a constant, so it moves outside the expectation. On the other hand E(X|Y) =
aT Y, that is, a linear function of Y; this was shown in 19.7.2; the coefficients a are given by
(19.13). Hence it is enough to show

E[etX es Y ] = e 2 t var(X|Y) E e(ta +s )Y


T 1 2
h T T i
(19.16)

To do this, we simply notice that the left-hand side is the moment generating function of
the normal random vector (X, Y) = (X, Y1 , . . . , Yd ) in R1+d , while the right-hand side is the
moment generating function of the normal random vector Y = (Y1 , . . . , Yd ) in Rd , so we can
compute them both. This is because we know exactly what the moment generating function
of a multidimensional random vector is. It is given by formula (18.6) or, equivalently, by the
same formula written using matrix notation: (18.8).

PROBLEM 19.24 (verification of conditional distribution under normality). Establish that


(19.16) holds when d = 1, that is, when (X, Y) = (X, Y) is normal in R2 . You may assume that
EX = EY = 0, var(X) = σ21 , var(Y) = σ22 , cov(X, Y) = σ1,2 . Recall that, in this case,
σ1,2
E(X|Y) = aY, var(X|Y) = σ21 − a2 σ22 , a= .
σ22

Answer. (19.16) is written as


1 2
EetX+sY = e 2 t var(X|Y)
Ee(ta+s)Y (19.17)

We must show that it holds. We have


1 1 2 2 +σ2 s2 +2σ st)
EetX+sY = e 2 var(tX+sY) = e 2 (σ1 t 2 1,2
,
CHAPTER 19. CONDITIONALLY 241

1 1 2 t2 σ2 +s2 σ2 +2astσ2 )
Ee(ta+s)Y = e 2 var((ta+s)Y) = e 2 (a 2 2 2

1 2
(σ1 −a2 σ22 )
1 2 2 1 2 2 −a2 σ2 t2 )
e2t var(X|Y)
= e2t = e 2 (σ1 t 2 ,
and so (19.17) is equivalent to (equate the exponents)

σ21
t2 + 
σ22
s2 + 2σ1,2 st = 
σ21
t2 −  σ
2
2t + 
22 2
σ22 + 
t2 σ
s2 2 + 2astσ2 ,
2 2
 
a a

   

  
 
and immediately we see that several terms cancel. Equivalently then,

2σ1,2 st = 2astσ22 ,

which is obviously true, since a = σ1,2 /σ22 .

Remark 19.3. Verifying (19.16) in the general case, that is when d > 1, is conceptually similar
to the d = 1 case of Problem 19.24. The only additional difficulty is finding out how to write
things compactly using matrix notation.

Remark 19.4. (19.14) provides and answer to the question (18.12) when m = 1.

19.7.5 Conditional probability distribution under normality, II

Let (X, Y) = (X1 , . . . , Xm , Y1 , . . . , Yd ) be normal in Rm+d .


It will now come as no surprise if I tell you that

The regular conditional distribution of X given Y is

N(E(X|Y), cov(X|Y))

on Rm , where cov(X|Y) is the d × d random matrix defined by4


 
cov(X|Y) := E (X − EX)(X − EX) Y .
T
(19.18)

Thus, its (i, j) entry is


h i
cov(X|Y)i,j = E (Xi − EXi )(X j − EX j )|Y .

Remark 19.5. (19.18) provides and answer to the question (18.12) for general m.
Chapter 20

The central limit theorem

The central limit theorem gives a kind of correction to the


weak law of large numbers. In the language of Gauss, the
cumulative sum of independent errors has a normal
distribution in the limit. We prove this.

20.1 Rate of convergence


In applications of mathematics, whenever we have a sequence an such that
lim an = a
n→∞

we often wonder how fast the convergence is. To answer this, we look for a sequence λ(n)
such that
lim λ(n)(an − a) = c , 0. (20.1)
n→∞
Necessarily,
lim λ(n) = ∞.
n→∞
Of course, there can be many such sequences but we are looking for the “simplest” one. We
then say that1
1
an converges to a at rate .
λ(n)
For example, if an = n+2
n+1 then an → 1 as n → ∞. But an − 1 = n+1 , so if we choose λ(n) = n we
1

have λ(n)(an − 1) = n+1 → 1 as n → ∞ and so we say that n+1 converges to 1 at rate 1/n.
n n+2

PROBLEM 20.1 (rate of convergence of approximations to e). Consider the following se-
quences:
n
1 n 1
  X
an = 1 + , bn = .
n k!
k=0
1
Or, more generally, if all limit points of the sequence λ(n)(an − a) are contained in a bounded interval that does
not contain 0.

242
CHAPTER 20. THE CENTRAL LIMIT THEOREM 243

Both sequences converge to the same limit:

lim an = lim bn = e
n→∞ n→∞

At what rates?
Answer. By Taylor’s theorem,
" n #
1 e
lim n 1 + −e =− .
n→∞ n 2

By direct computation,
lim n!(e − bn ) = 1.
n→∞

So the rate of the convergence an → e is 1/n, while the rate of the convergence bn → e is 1/n!.
Since n! is much much larger than n, the second convergence is much faster. Indeed, if you
were to approximate e = 2.718281828459 · · · numerically, please use the second approximation.
Voilà:
n an bn
1 2. 1.
2 2.250000000 2.
3 2.370370370 2.5
4 2.441406250 2.666666667
5 2.488320000 2.708333333
6 2.521626372 2.716666667
7 2.546499697 2.716666667
8 2.565784514 2.718253968
9 2.581174792 2.718278770
10 2.593742460 2.718281526

When we have sequences of random variables there are two subtleties: First, what do we
mean by limit? Second, what do we mean by rate of convergence?
These questions are of paramount importance because probability and statistics is, in a
sense, all about approximations and, therefore, all about limits and their convergence rates.
So we must understand, at least a little bit, what these concepts mean.

20.2 Limits of sequences of random variables


Let X1 , X2 , . . . be a sequence of random variables. Recall that they are sequences of functions
on some sample space Ω. There is also a probability measure P defined on a class of events E .

Definition 20.1 (strong limit). A sequence of random variables X1 , X2 , . . . converges strongly


to a random variable X if
P( lim Xn = X) = 1.
n→∞

PROBLEM 20.2 (example of strong convergence). Give an example of strong convergence.


CHAPTER 20. THE CENTRAL LIMIT THEOREM 244

Answer. Let X1 , X2 , . . . be i.i.d. random variables with common mean µ. Set Sn = X1 + · · · + Xn .


Then, by the strong law of large numbers

P( lim Sn /n = µ) = 1.
n→∞

Hence the sequence S1 , S2 /2, S3 /3, . . . converges to µ strongly. 


Definition 20.2 (limit in probability). A sequence of random variables X1 , X2 , . . . converges
in probability to a random variable X if

for all ε > 0, lim P(|Xn − X| > ε) = 0


n→∞

PROBLEM 20.3 (example of convergence in probability). Give an example of convergence


in probability.
Answer. Let X1 , X2 , . . . be i.i.d. random variables with common mean µ. Set Sn = X1 + · · · + Xn .
Then, by the weak law of large numbers

for all ε > 0, lim P(|Sn /n − µ| > ε) = 0.


n→∞

Hence the sequence S1 , S2 /2, S3 /3, . . . converges to µ in probability. 


?PROBLEM 20.4 (strong convergence implies convergence in probability). Explain why
if Xn converges to X strongly then Xn converges tp X in probability.
Answer. Just follow the steps of Problem 17.6. 
And now we pass to an even weaker concept, that of convergence in distribution. We need
to understand that because this is what is needed for the central limit theorem.
Definition 20.3 (limit in distribution). A sequence of random variables X1 , X2 , . . . converges
in distribution to a random variable X if

lim P(Xn ≤ x) = P(X ≤ x)


n→∞

for all x for which P(X = x) = 0.

In other words Xn converges in distribution to X if the distribution function of Xn converges


to the distribution function of X, but we have to exclude the points where the latter distribution
function is discontinuous. If X is a continuous random variable then we don’t have to exclude
anything.
We can immediately deduce that if Xn converges to the distribution function of X then

lim P(Xn ∈ I) = P(X ∈ I)


n→∞

for all intervals I for which the probability that X equals to one of the endpoints of I is zero.
We have
Xn converges to X strongly ⇒ Xn converges to X in probability ⇒ Xn converges to X in
(20.2)
distribution.
The first implication was dealt with in Problem 20.4. The second implication is a little bit
more subtle and we explain it here, by first asking you to do a simple problem that only
requires elementary concepts.
CHAPTER 20. THE CENTRAL LIMIT THEOREM 245

PROBLEM 20.5 (comparing distribution functions). Suppose that X and Y are random
variables with distribution function F(x), G(x), respectively. Assume that

P(|X − Y| ≥ ε) ≤ δ.

Show that, for all x ∈ R,


F(x − ε) − δ ≤ G(x) ≤ F(x + ε) + δ.
Answer. We have

F(x − ε) = P(X ≤ x − ε) = P(|X − Y| ≥ ε, X ≤ x − ε) + P(|X − Y| < ε, X ≤ x − ε).

Look at the first term:


P(|X − Y| ≥ ε, X ≤ x − ε) ≤ P(|X − Y| ≥ ε) ≤ δ.
Look at the second term:
P(|X − Y| < ε, X ≤ x − ε), = P(X − ε < Y < X + ε ≤ x) ≤ P(Y ≤ x) = G(x).
Adding up, we obtain F(x − ε) ≤ G(x) + δ. That’s the first inequality. Similarly for the second.

We can now justify the second ⇒ in (20.2). Suppose that Xn converges to X in probability
and let Fn and F be their distribution functions. Then, for all ε, δ > 0, we have P(|Xn −X| > ε) ≤ δ
for all large n. Then we have

F(x − ε) − δ ≤ Fn (x) ≤ F(x + ε) + δ,

for all large n. Letting ε and δ converge to 0 we have that F(xε )+δ → F(x) but F(x−ε)−δ) → F(x−).
Hence if x is continuous at x then F(x) = F(x−) and so Fn (x) converges to F(x).
We finally state a theorem without proof.

Theorem 20.1 (convergence of moment generating functions implies convergence in


distribution). If Xn is a sequence of random variables such that Xn has a non-useless moment
generating function Mn (t) and if
lim Mn (t) = M(t),
n→∞

where M(t) is the moment generating function of a random variable X, then Xn converges to X in
distribution.

20.3 Rate of convergence of the law of large numbers

Pn of large numbers states that if X1 , X2 , . . . are i.i.d. with common


Once more, the strong law
expectation µ, and Sn = i=1 Xn , then

Sn
→ µ strongly.
n
To find a rate of convergence we need to look for a sequence λ(n) such that

Sn
 
λ(n) − µ → Z, in some sense (20.3)
n
CHAPTER 20. THE CENTRAL LIMIT THEOREM 246

where Z , 0. We are thus stating (20.1) with Z replacing c. In (20.1) the sequence an was not
random and hence c was not random. Here, the sequence Sn /n is random and so Z is random.
Let us try to see what Z could be.
Sn Sn −E(Sn )
Since n −µ= n , we might as well

B assume that µ = 0 in order to make life simple.

So we are looking for some λ(n) such that


Sn
λ(n) → Z, in some sense
n
and it is reasonable to speculate that EZ = 0. Setting
λ(n)
Λ(n) = , (20.4)
n
we are now looking for some Λ(n) such that
Λ(n)Sn → Z, in some sense (20.5)
Let a, b > 0. We have2
Λ(an + bn)San+bn → Z, in some sense (20.6)
But
San+bn = San + (San+bn − San )
d
observing that S̃bn := San+bn − San = Sbn and San , San+bn − San are independent. But then
Λ(an)San → Z0 , Λ(bn)S̃bn → Z00 , (20.7)
d d
where Z0 = Z0 = Z and Z0 , Z00 are independent. We now rewrite (20.6) as
Λ(an + bn) Λ(an + bn)
Λ(an)San + Λ(bn)S̃bn → Z.
Λ(an) Λ(bn)
Since (20.7) must also hold, we can “replace”, in the limit, Λ(an)San by Z0 and Λ(bn)S̃bn by Z00 ,
and this forces us to assume that the two ratios must also converge. Since a, b are arbitrary,
this amounts us to stipulate that
Λ(an + bn)
lim exists for all a, b > 0.
n→∞ Λ(an)
The simplest function for which this holds is a power function. So, let’s assume that.
Λ(n) = np .
Assuming that the sense in which (20.5) holds is in distribution, We obtain
d
((a + b)/a)p Z0 + ((a + b)/b)p Z00 = Z,
This is essentially (18.1). So if we also
2
In this section, when I write an, where a is real and n integer, I mean to write the integer part of an, but then
I’d have to carry to much notation, so I omit it.
CHAPTER 20. THE CENTRAL LIMIT THEOREM 247

B assume that var(X1 ) < ∞

we immediately obtain that

Z is normal.

Since, also (see (18.2)),


((a + b)/a)2p + ((a + b)/b)2p = 1,
and this should be true for all a, b > 0, we√see that the only p that makes it work is p = −1/2.
So Λ(n) = n−1/2 , hence–see (20.4)–λ(n) = n. Thus,

The rate of convergence of Sn /n → µ is 1/ n, which is rather slow. That is,
√  Sn 
n − µ → Z,
n
the convergence should be convergence in distribution, and Z must be normal
with EZ = 0.

20.4 The classical central limit theorem

Theorem 20.2 (the classical central limit theorem). Let X1 , X2 , . . . be a sequence of i.i.d.
random variables with
µ = EX1 , σ2 = var X1 < ∞.
Then
Sn − ESn Sn − nµ
√ = √ → Z, in distribution,
var Sn σ n
where Z is a N(0, 1) random variable.

We will explain this in a very special case: Assume that X1 has a non-useless moment
generating function. Define
m(t) := Eet(X1 −µ)/σ
Sn −nµ
defined for at least one t , 0. Then √
σ n
has a non-useless moment generating function,
!
Sn − nµ √
E exp t √ = m(t/ n)n .
σ n
The random variable Z, being N(0, 1), moment generating function given by
1 2
EetZ = e 2 t .

So the only thing we have to show, according to Theorem 20.1 is:



√ t2 /2 n log m(t/ n) 1
n
m(t/ n) → e , or, equivalently, 2
→ .
t 2
CHAPTER 20. THE CENTRAL LIMIT THEOREM 248

But
m0 (0) = E((X1 − µ)/σ)) = 0, m00 (0) = E((X1 − µ)/σ)2 ) = 1.
We have
√ √
n log m(t/ n) m(t/ n) (a) m(t) (b) m0 (s) (c) m00 (s) m00 (0) 1
lim = lim √ = lim = lim = lim = = .
n→∞ t2 n→∞ (t/ n)2 s→0 s2 s→0 2s s→0 2 2 2

(a) √ (b) (c)


I explain: to get = I set s = t/ n, so s → 0 iff n → ∞; to get = and = I used L’Hôpital’s rule
twice.
Another way to state the central limit theorem (CLT) is:
! Z −x2 /2
Sn − nµ e
lim P √ ∈I = √ dx.
n→∞ σ n I 2π

20.5 Confidence intervals


Throughout this section, let
1
ϕ(x) := √ e−x /2 ,
2


be the density of a standard normal N(0, 1) random variable, and let
Z x
Φ(x) := ϕ)y)dy
−∞

be its distribution function.


Let Pθ be a family of probability measures (depending on a parameter θ that ranges on
some set Θ) on some set Ω. We say that a random variable X is a statistic if X does not depend
on θ. 3
A random interval I = [A, B] is an interval whose endpoints A, B are random variables. We
say that I is a p-confidence interval of θ if A, B are statistics and 4

Pθ (A ≤ θ ≤ B) ≥ p.

Typically, p is a large probability (e.g., 0.95). The idea here is that θ is unknown and wish to
find it. We need to find random variables A and B that are statistics (we cannot make them
depend on they unknown θ, that’s why they should be statistics) such that the above holds,
meaning that we are pretty confident that θ will lie in [A, B]. This problem may or may not
have a solution. Below is a case where it does.
3
This means that we should think of X as a function X : Θ × Ω → S with the property that X(θ, ω) = X(θ0 , ω)
for all ω ∈ Ω and all θ, θ0 ∈ Θ.
4
Statistics has, grammatically, a twofold meaning. First, it refers to the “subject of statistics”; second, it is “the
plural of the noun ‘statistic’ ”.
CHAPTER 20. THE CENTRAL LIMIT THEOREM 249

Let Pµ be the probability distribution of a random variable with mean µ and finite variance
σ2 .
Let X1 , X2 , . . . be i.i.d. random variables with common distribution Pµ . By the central limit
theorem, proved above, tells us that, for a > 0,
√  !
n Sn

Pµ −a ≤ − µ ≤ a → Φ(a) − Φ(−1) = 2Φ(a) − 1, as n → ∞,
σ n
√ 
n Sn

where Φ is the cumulative distribution function of a standard normal. Since −a ≤ −µ ≤
σ n
Sn aσ Sn aσ
a ⇐⇒ − √ ≤µ≤ + √ , we rewrite the above limit as
n n n n
!
Sn aσ Sn aσ
Pµ − √ ≤µ≤ + √ → 2Φ(a) − 1, as n → ∞.
n n n n
We now let a = a(p) be defined as the unique solution of

2Φ(a) − 1 = p

and make an approximation (which may be nonsense in practice if, say, n is not big enough):
!
Sn aσ Sn aσ
Pµ − √ ≤µ≤ + √ ≈ 2Φ(a) − 1 = p.
n n n n
We then have good reasons to declare that,

If a = a(p) is given by
p+1
Φ(a(p)) = ,
2
then the interval
" #
Sn aσ Sn aσ
− √ ≤µ≤ + √ is (approximately) a p-confidence interval for µ.
n n n n

But there is a catch. Although Sn = X1 + · · · + Xn is a statistic, the endpoints of the interval


are not statistics if σ2 is unknown.5 Then we should do something else.
We assume that we work with the family Pµ,σ2 which is the distribution of a random
variable with unknown mean µ and unknown variance σ2 . We write P instead of Pµ,σ2 to save
some ink.
Consider the random variables (X1 − µ)2 , (X2 − µ)2 , . . .. They are i.i.d. with common
expectation E(X1 − µ)2 = σ2 . Hence, by the strong law of large numbers (=the fundamental
theorem of Probability, proved in Chapter 17)
 
n
 1 X 
P  lim (X j − µ)2 = σ2  = 1. (L1)
 
n→∞ n 
j=1

5
What is the mathematical meaning of the phrase “to know”?
CHAPTER 20. THE CENTRAL LIMIT THEOREM 250

SoPwe think that we might get a good approximation for the unknown σ2 if we replace it by
n
j=1 (X j − µ) for some large fixed n. But hold on! This is not a statistic because it depends
1 2
n
on the unknown µ.
So, make yet another approximation and replace µ by Sn /n. After all,
Sn
 
P lim = µ = 1. (L2)
n→∞ n

So let us define the random variable


v
u n 
Sn 2
t X
1

sn = Xj − .
n n
j=1

We now check that


P( lim sn = σ) = 1. (L3)
n→∞
Here is why. Simple algebra shows that
n  n  n
Sn 2 1 X Sn − nµ 2 1 X  2  S − nµ 2
1X

n
s2n = Xj − = Xj − µ − = Xj − µ − .
n n n n n n
j=1 j=1 j=1

By (L1) and (L2) we have


 
n
 1X S n − nµ 
P  lim (X j − µ)2 = σ2 , lim = 0 = 1
 
n→∞ n n→∞ n 
j=1

(the intersection of two events of probability 1 each has probability 1) and, by the little algebra
above, (L3) holds.
Since Sn and sn are statistics, we can now state:

If a = a(p) is given by
p+1
Φ(a(p)) = ,
2
then the interval
" #
Sn asn Sn asn
− √ ≤µ≤ + √ is (approximately) a p-confidence interval for µ.
n n n n

Of course, the above involved heuristics that can, with a bit of effort, be justified rigorously.
PROBLEM 20.6 (empirical standard deviation). Justify the formula
n
1X 2
s2n = X j − (Sn /n)2
n
j=1

using the concept of empirical probability measure (See Section 8.3), thereby justifying the
“little algebra” above.
CHAPTER 20. THE CENTRAL LIMIT THEOREM 251

Answer. If x1 , . . . , xn are real numbers then the empirical probability measure assigns proba-
bility 1/n to each xi , so, as we know from Section 8.3, we can write it as
n
1X
Pn =
b δx j ,
n
j=1

where δx is a probability measure on R such that δx (B) = 1x∈B , for all B ⊂ R. If we define a
random variable Y on {1, . . . , n} by Y(i) = xi and let P be the uniform probability measure on
{1, . . . , n}, we have that the distribution of Y under P is b
Pn because
X X1 1X
P(Y ∈ B) = P{i : Y(i) ∈ B} = P{i : Y(i) = x} = 1xi =x = 1xi =x = b
Pn (B).
n n
x∈B x∈B x∈B
1 Pn
Hence E(Y) = is the sample mean of (x1 , . . . , xn ). Similarly, E(Y2 ) = n1 nj=1 x2j is the
P
n j=1 x j
sample second moment of (x1 , . . . , xn ) and var(Y) = n1 nj=1 (x j − E(Y))2 is the sample variance
P
of (x1 , . . . , xn ). But
var(Y) = E(Y2 ) − (E(Y))2 .
This explains the formula. 
PROBLEM 20.7 (confidence interval for the parameter p of a geo(p) distribution). Let
X1 , . . . , Xn be i.i.d. random variables with common distribution geo(p), where p is unknown.
Device an experiment in order to estimate p and give me a 0.99-confidence interval for it.
Answer. Since the distribution of the first success when tossing a coin (whose probability of
heads=success) is p, independently, do the following: Pick a coin and toss it until you first get
heads. Let X1 be the number of tosses required. Then toss again and let X2 be the number of
tosses required until the next success. Do this n times, for, say n = 100 (or more, if you have
the stamina) and compute Sn and sn . With p = 0.99, we have (0.99 + 1)/2 = 0.995 and we find,
using a computer or a table, √ that Φ(a) = 0.995
√ is solved for a = 2.58, we have that p lies in the
interval [Sn /n − 2.58sn / n, Sn /n + 2.58sn / n] with probability 0.99. 
PROBLEM 20.8 (continuation of Problem 20.7). I sat down last night and tossed a coin lots
and lots of times and stopped when I got tired. I observed that
k 1 2 3 4 5 6 7 8 9 15
nk 31 20 19 8 10 2 2 3 4 1
meaning that there were exactly 31 i’s for which Xi = 1, and 20 i’s for which Xi = 2, etc.
What is the 0.99-confidence interval of p?
Answer. We compute the the quantities needed, by first noticing that n = k nk = 100.
P

n
Sn 1X 1 X 311
= Xn = knk = = 3.11
n n 100 100
i=1 k
n 100
1X 2 1 X 2
s2n = Xi − 3.112 = k nk − 3.112 = 15.71 − (3.11)2 = 6.038.
n 100
i=1 i=1
Hence
Sn sn
± 2.58 √ = 3.11 ± 0.63 = [2.48, 3.74].
n n
Hence p lies between 2.48 and 3.74 with probability 0.99. 
CHAPTER 20. THE CENTRAL LIMIT THEOREM 252

PROBLEM 20.9 (a useful approximation for the normal distribution function). Show that
if ϕ(x) is the density of a standard normal random variable Z, then

ϕ(x)
1 − Φ(x) = P(Z > x) ≤ .
x
Answer. We have Z ∞
P(Z > x) = ϕ(y)dy.
x
The integral is over all y > x. That is, y/x ≥ 1. Hence
Z ∞
1 ∞
Z
y
P(Z > x) ≤ ϕ(y)dy = yϕ(y)dy.
x x x x
2 /2
Since ϕ(y) = Ce−y we have ϕ0 (y) = −yϕ(y). Hence
Z ∞ Z ∞ Z b
yϕ(y)dy = − ϕ (y)dy = − lim
0
ϕ0 (y)dy = − lim [ϕ(b) − ϕ(x)] = ϕ(x).
x x b→∞ x b→∞

Hence P(Z > x) ≤ 1x ϕ(x), as required.


Chapter 21

Special distributions used in statistics

This chapter discusses some distributions that are


important in statistics. I will try to explain why they are
important.

21.1 The gamma(λ, α) distribution


We define the distribution gamma(λ, n), for positive real number λ and positive
integer n, as the distribution of X1 + · · · + Xn when X1 , . . . , Xn are i.i.d. random
variables with common distribution expon(λ).

We define the distribution gamma(λ, α), for positive real numbers λ andα, as the
distribution whose density is an analytic extension of the density of gamma(λ, n).

We will derive formulas for the densities of both distributions. But first, let us mention that
the word “analytic” means something much stronger than the phrase “it possesses derivatives
of all orders”.
How do we know that these two distributions possess a density? Well, we know that the
sum of independent random variables with densities has a density that is the convolution
of the individual densities. To make life simple, we first assume that λ = 1. Then f (x) = e−x ,
x > 0, is a density for each of the variables X1 , X2 , . . . Hence the density of X1 + X2 is
Z x
f (x) = ( f ∗ f )(x) =
∗2
e−(x−y) e−y dy = xe−x , x > 0.
0

The density of X1 + X2 + X3 is
Z x Z x
1
f (x) = ( f ∗ f )(x) =
∗3 ∗2
e −(x−y)
ye dy = e
−y −x
ydy = x2 e−x .
0 0 2

253
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 254

The pattern soon becomes clear, and we therefore guess that X1 + · · · + Xn has density

1
f ∗n (x) = xn−1 e−x , x > 0. (gamma(1, n) density)
(n − 1)!

PROBLEM 21.1 (correctness of the gamma(1, n) density). Use induction to show that the
formula for the gamma(1, n) density is correct.
Answer. The formula is correct for n = 1. Suppose it is correct up to n − 1. We then just need
to check that f ∗ f ∗(n−1) = f ∗n . We have
Z x Z x
1 e−x e−x xn−1 xn−1 e−x
f∗f ∗(n−1)
(x) = e −(x−y)
y e dy =
n−1 −y
yn−2 dy = = .
0 (n − 2)! (n − 2)! 0 (n − 2)! n − 1 (n − 1)!

And, indeed, the right side equals f ∗n . We’re done. 


Now, since f ∗n is a probability density function, we have
Z ∞
f ∗n (x)dx = 1.
0

Substituting the formula from above, we learn that


Z ∞
(n − 1)! = xn−1 e−x dx.
0

We then take a bold step and replace n in the exponent inside the integral by a real number α
and make a definition.

Definition 21.1 (the gamma function). The gamma function is defined by


Z ∞
Γ(α) := xα−1 e−x dx,
0

for α ∈ R for which the integral is finite.

PROBLEM 21.2 (domain of the gamma function). Show that Γ(α) is defined for α > 0 and
that we can differentiate it with respect to α as many times as we like.
Answer. Note that
xα−1 e−x ≤ xα−1 , for all x > 0,
while
xα−1 e−x ≤ e−x , for all x > 1.
Hence Z ∞ Z 1 Z ∞
α−1 −x α−1
x e dx ≤ x dx + e−x dx < ∞.
0 0 1

So Γ(α) < ∞ for all α > 0. (But Γ(0) = ∞.) Moreover, since xα−1 is a smooth function of α, so is
Γ(α). 
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 255

Since we have Z ∞
1 α−1 −x
1= x e dx,
0 Γ(α)
the function inside the integral is positive and has integral 1. Hence it is a probability density
function. We therefore define

1 α−1 −x
fα (x) := x e , x > 0, (gamma(1, α) density)
Γ(α)

and call it density of the gamma(1, α) distribution.


Suppose now that we take a general λ > 0, not just equal to 1. Recall that if X has the
expon(1) distribution then X/λ has the expon(λ) distribution. Hence the gamma(λ, n) density
is the density of (X1 + · · · + Xn )/λ which equals

λn
λ f ∗n (λx) = xn−1 e−λx , x > 0. (gamma(λ, n) density)
(n − 1)!

By analogy, the gamma(λ, α) density is

λα α−1 −λx
λ fα (λx) := x e , x > 0, (gamma(λ, α) density)
Γ(α)

Since the latter is a density, its integral over the whole space must be 1, and so
Z ∞
Γ(α)
xα−1 e−λx dx = α , α > 0, λ > 0. (21.1)
0 λ

21.1.1 Further properties of the Γ function


1.
Γ(n) = (n − 1)! if n is a positive integer.

2. Using integration by parts we can easily show that

Γ(α) = (α − 1)Γ(α − 1).

PROBLEM 21.3 (gamma reproduction rule). Explain why Γ(α) = (α − 1)Γ(α − 1) when α > 1.
e = −e−x , we have
d −x
Answer. By Definition 21.1, and the fact that dx
Z ∞
d −x
Γ(α) = − xα−1 e dx
0 dx
R∞ R∞
The integration by parts formula says that 0 f g0 dx = − 0 f 0 gdx if f (x)g(x) has value 0
(interpreted as a limit) at x = 0 and x = ∞. We apply this to f (x) = xα−1 and g(x) = e−x .
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 256

Since the exponential function drops to 0 faster than any power, we have f (x)g(x)|x=∞ =
limx→∞ xα−1 e−x = 0. Since α − 1 > 0, we have f (x)g(x)|x=0 = 0α−1 e−0 = 0 · 1 = 0. Therefore,
Z ∞ ! Z ∞
d α−1 −x
Γ(α) = x e dx = (α − 1)xα−2 e−x dx = (α − 1)Γ(α − 1).
0 dx 0


Since
Γ(n) = (n − 1)!, if n is a positive integer
the previous display is merely an extension of this property of the factorial function.

3. Another important property is that

log Γ(α) is a convex function of α > 0.

See Emil Artin, The Gamma Function, 1964, Theorems 1.8+1.9 .

4. We also have
Γ(a + t)
lim = 1.
t→∞ Γ(t)ta

We will explain this only one special case.


Γ(a+t)
PROBLEM 21.4 (gamma asymptotics). Show that limt→∞ Γ(t)ta = 1 when a is a positive
integer.
Answer. We have

Γ(a + t) = (a − 1 + t)Γ(a − 1 + t) = (a − 1 + t)(a − 2 + t)Γ(a − 2 + t) = · · ·


· · · = (a − 1 + t)(a − 2 + t) · · · t Γ(t).

So
Γ(a + t) a − 1 + t a − 2 + t 1+t t
= · ··· · → 1 as t → ∞,
Γ(t)ta t t t t
because each of the a fractions converges to 1 as t → ∞. 

21.2 The χ2 (d) distribution

We define the distribution χ2 (d) as the distribution of Z21 + · · · + Z2d where Z1 , . . . , Zd


are i.i.d. standard normals.

We claim that the distribution χ2 (d) has a density given by the formula

1
fd (x) = x(d/2)−1 e−x/2 . (χ2 (d) density)
2d/2 Γ(d/2)

This is quite easy to see when d is even.


CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 257

?PROBLEM 21.5 (density for the χ2 (d) distribution when d is even). Derive the density for
the χ2 (d) distribution when d is even.
Answer. From Section 18.7, we know that that the d/2 random variables

Z21 + Z22 , Z23 + Z24 , . . . , Z2d−1 + Z2d

are i.i.d. and each one expon(1/2). Their sum is therefore gamma(1/2, d/2). So if we plug in
λ = 1/2 and n = d/2 in the formula for the gamma(λ, n) density we find the χ2 (d) announced
above. 

?PROBLEM 21.6 (density for the χ2 (d) distribution for general d). Derive the formula for
the density of the χ2 (d) probability measure.
Answer. If d is odd then d − 1 even. So

Z21 + · · · + Z2d = (Z21 + · · · + Z2d−1 ) + Z2d

has the distribution of the sum of two independent random variables, one with gamma(1/2, (d −
1)/2) and a N(0, 1) one. To find the density of the sum, we perform a convolution, as in Section
19.6, and arrive again at the formula for χ2 (d) announced above. 

21.3 Degrees of freedom

The number d in χ2 (d) is called “degrees of freedom”. This is a word that means “dimension”.
And there is more to it than meets the eye. Suppose we have X = (X1 , . . . , Xd ) that is normal
in Rd . Remember the definition of the covariance matrix

R = cov(X),

which, along with the expectations of X1 , . . . , Xd , which will be assumed to be zero, determines
the distribution of X. The matrix R is symmetric but, in general, it is not invertible. But Linear
Algebra tells us that there is a matrix S of dimension d × r, where

r = rank(R)

is the rank of R, such that


R = SST .
If we then let Z = (Z1 , . . . , Zr ) be r i.i.d. standard normal random variables, we see that

X = SZ

in the sense that X and SZ have the same distribution. Indeed,

cov(SZ) = E(SZ)(SZ)T = ES(ZZT )ST = SST = R.

We then say that X has r degrees of freedom because X takes values in a linear space of
dimension r. This space is precisely the set

{Ru : u ∈ Rd }.
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 258

We then have
r
X
X12 + ··· + Xd2 = λ j Z2j . (eig)
j=1

where λ1 , . . . , λr are the nonzero eigenvalues of R.

LetPZ1 , . . . , Zr be i.i.d. N(0, 1) r.v.s and λ1 , . . . , λr positive numbers. The distribution


of rj=1 λ j Z2j is called χ2 (r; λ1 , . . . , λr ).

Of course, χ2 (d; 1, . . . , 1) ≡ χ2 (()d).

PROBLEM 21.7 (density of χ2 (4; a, a, b, b)). Derive the density of χ2 (4; a, a, b, b).
Answer. The idea is this. χ2 (4; a, a, b, b) is the density of X = a(Z21 + Z22 ) + b(Z23 + Z24 ), where
Z1 , Z2 , Z3 , Z4 are i.i.d. N(0, 1) r.v.s. But S = Z21 + Z22 , T = Z23 + Z24 are i.i.d. expon(1/2). Hence aS
is expon(1/2a) and bT is expon(1/2b). So the density of aS is (1/2a)e−x/2a 1x>0 and the density
of bT is (1/2b)e−x/2b 1x>0 . Thus the density of X = aS + bT is a convolution:
Z x
f (x) = (1/2a)e−(x−y)/2a (1/2b)e−y/2b dy.
0

We derive
1  x x

f (x) = e− 2 a − e− 2 b , a , b,
2a − 2b
x x
f (x) = 2 e− 2 a , a = b.
4a


21.4 The F(m, n) distribution


Definition 21.2 (scaling parameter). We say that θ ∈ R is a scaling parameter of a probability
measure Qθ on R if there is a random variable X such that θX has law Qθ .

Observe that the parameter d in the χ2 (d) distribution is not a scaling parameter. (In fact,
the dependence on d of the density fd (x) of the χ2 (d) distribution is not simple.) So if X is a
random variable with χ2 (d) distribution then, even though E(X/d) = 1, the distribution of X/d
still depends on d. Nevertheless, we do like X b := X/d better than X (because its expectation
becomes independent of d).
In statistics, people use ratios of independent random variables, each being a normalized
chi-squared variable. We therefore give a special name to such a ratio.

Let Um , Vn be two independent random variables, with densities χ2 (m), χ2 (n),


respectively. The F(m, n) distribution is the distribution of the random variable
1
m Um U
bm
Wm,n := 1
= . (21.2)
n Vn V
bn
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 259

To compute the density of Wm,n , first recall the density fd (x) of the χ2 (d) distribution:
d 1
fd (x) = Cd x 2 −1 e− 2 x , where Cd = (2d/2 Γ(d/2))−1 .

We therefore have
1 1 m
    
P(Wm,n ≤ x) = P Um ≤ x Vn = E P Um ≤ x Vn Vn ,
m n n
and so the density of Wm,n is

m ∞
Z
d m m m
    
fm,n (x) = P(Wm,n ≤ x) = E Vn fm xVn = y fm xy fn (y)dy.
dx n n n 0 n

We can make this a bit tidier by replacing y by ny in the integral (change variables):
Z ∞
fm,n (x) = mn

y fm mxy fn (ny)dy
0

And now substitute the expressions for fm , fn and do some algebra:


Z ∞
m n 1 1
fm,n (x) = mnCm Cn y(mxy) 2 −1 (ny) 2 −1 e− 2 mxy e− 2 ny dy
0
Z ∞
m n m m+n mx+n
= mnCm Cn m 2 −1 n 2 −1 x 2 −1 y 2 −1 e− 2 y dy
0

But we have already computed the last integral! It is given by (21.1):


∞ Γ( m+n
2 )
Z
m+n mx+n
y 2 −1 e− 2 y = m+n .
0 ( mx+n
2 )
2

We now substitute this and the expressions for Cm , Cn to get

1 1 m
−1 n2 −1 m2 −1
Γ( m+n
2 )
fm,n (x) = mn m/2 m 2 n x
2 Γ(m/2) 2n/2 Γ(n/2)
m+n
( mx+n
2 )
2

Γ( m+n )
m n
m 2 n2 m
= x 2 −1 mx+n2 m+n .
Γ(m/2)Γ(n/2) ( ) 2 2

We can shuffle the terms around until we reach a more symmetric form, just so it is more
pleasing to the eye:

Γ( m+n
2 )m
m/2 nn/2
(m + nx−1 )−m/2 (mx + n)−n/2
fm,n (x) = , x > 0. (F(m, n) density)
Γ(m/2)Γ(n/2) x

PROBLEM 21.8 (moments for F(m, n)). Let Wm,n be a random variable whose law is F(m, n).
(1) Let k be a positive integer. For which values of m, n does Wm,n has finite k-th moment?
(2) For which values of m, n does Wm,n have a non-useless moment generating function?
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 260

Answer. (1) Notice that for large x we can ignore x−1 from the term (m + nx−1 )−m/2 and write
mx + n ≈ mx, so
n
fm,n (x) ≈ const. x− 2 −1 , when x is large.
This is sloppy, but we can easily see that

lim fm,n (x)x 2 +1 > 0,


n

x→∞

n
so the rate of convergence (see Section 20.1) of limx→∞ fm,n (x) = 0 is x− 2 −1 . On the other hand,
m
fm,n (x) ≈ const. x 2 −1 , when x is small.

We now have
Z 1 Z ∞
k
EWm,n < ∞ ⇐⇒ x fm,n (x)dx < ∞ and
k
xk fm,n (x)dx < ∞.
0 1

The first integral is always finite. But


Z ∞ Z ∞
n
x fm,n (x)dx < ∞ ⇐⇒
k
xk x− 2 −1 dx < ∞
1 1
R∞
We know that 1
x−p dx < ∞ ⇐⇒ p > 1 so, letting p = n
2 + 1 − k, we have p > 1 ⇐⇒ k < n/2.
We thus have
k
EWm,n < ∞ ⇐⇒ k < n/2.
There is no restriction on m.
(2) Since, for every m and n, there is always a k for which EWm,n k = ∞, the moment generating
function is useless for all m and n. 

Remark 21.1. The random variable U bm = 1 Um = 1 Pm Z2 can be interpreted as the sample


m m j=1 j
mean of Z , . . . , Zm , where the Z j are i.i.d. N(0, 1). Since EZ j = 0, we can also interpret U
1
2 2 bm as the
sample variance of Z1 , . . . , Zm . Similarly for V
bn . Hence Wm,n is the ratio of two independent
sample variances.

Remark 21.2. For uses of the F(m, n) distribution in statistics see David Williams, Weighing
the Odds, 2012; page 301, “the classical F-test” .

?PROBLEM 21.9 (limit of F(m, n) when n → ∞). (1) Explain why

1
 
P lim Wm,n = Um = 1.
n→∞ m
(2) Use this, or otherwise, to show that

(m/2)m/2 m −1 −mx/2
lim fm,n (x) = x2 e .
n→∞ Γ(m/2)
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 261

Answer. (1) We look at the definition (21.2). We apply the strong law of large numbers to the
denominator:
1
 
P lim Vn = 1 = 1.
n→∞ n

This is because n1 Vn = n1 (Z21 + · · · + Z2n ), where Z1 , . . . , Zn are i.i.d. N(0, 1) and so the strong law
of large numbers says that n1 (Z21 + · · · + Z2n ) converges to EZ21 = 1 with probability 1. Since,
with probability 1, the denominator of (21.2) converges to 1, and since the numerator does not
depend on n, we have that Wm,n converges to m1 Um with probability 1.
(2) Since strong convergence implies convergence in distribution, see (20.2), we have that

1
Wm,n → Um as n → ∞ in distribution.
m
Therefore the probability distribution function of Wm,n converges to the probability distribution
function of m1 Um . We actually have that the density function of Wm,n converges to the density
function of m1 Um . But Um has χ2 (m) density:

1 m
fm (x) = x 2 −1 e−x/2 .
2m/2 Γ(m/2)
1
We copied this formula from (χ2 (d) density). But then m Um has density m fm (mx). So

(m/2)m/2 m −1 −mx/2
lim fm,n (x) = m fm (mx) = x2 e .
n→∞ Γ(m/2)


?PROBLEM 21.10 (limit of F(m, n) when m → ∞). Compute the limit of fm,n (x) as m → ∞.
Answer. From the strong law of large numbers we have

1
 
P lim Wm,n = = 1.
m→∞ Vn /n
Therefore,
1
Wm,n → as m → ∞ in distribution.
Vn /n
So
lim P(Wm,n ≤ x) = P(Vn ≥ n/x) = 1 − P(Vn ≤ n/x)
m→∞

But Vn has χ2 (n) density:


1 n
fn (x) = x 2 −1 e−x/2 .
2n/2 Γ(n/2)
We copied this formula from (χ2 (d) density). Differentiating

n (n/2)n/2 − n −1 − n
lim fm,n (x) = f n (n/x) = x 2 e 2x .
m→∞ x2 Γ(n/2)
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 262

21.5 Decoupling of sample mean and sample variance

Let X1 , X2 , . . . be i.i.d. random variables with σ2 = EX12 and µ = EX1 . Define


n n
1X 1X
Mn := Xi , Vn = (Xi − Mn )2 .
n n
i=1 i=1

From the strong law of large numbers we have that


 n

 1X 
P  lim
 Xi exists and is equal to µ = 1.
n→∞ n
i=1

From the strong law of large numbers again we have that


 n

 1 X 
P  lim (Xi − µ)2 exists and is equal to σ2  = 1.
n→∞ n
i=1

Since the probability of the intersection of two events that have probability 1 also has
probability 1 we have
 n

 1X 
P  lim Mn = µ, lim
 (Xi − µ) = σ  = 1.
2 2
(21.3)
n→∞ n→∞ n
i=1

Let us rewrite Vn as

n n n
 n 2
1X 1 X 1 X  1 X 
Vn = (Xi − µ − Mn + µ)2 = (Xi − µ)2 − (Mn − µ)2 = (Xi − µ)2 −  (Xi − µ) .
n n n n
i=1 i=1 i=1 i=1

Therefore,  
P lim Vn = σ 2
= 1.
n→∞

Notice that
n
 n 2
1X 1 X  1 1 n−1 2
EVn = E[(Xi − µ)2 ] − 2 E  (Xi − µ) = · nσ2 − 2 nσ2 = σ .

n n n n n
i=1 i=1

Hence if we let
n
1 X
V
en := (Xi − Mn )2 ,
n−1
i=1

we have  
en = σ2 = 1,
P lim V en = σ2 .
EV
n→∞
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 263

If X1 , X2 , . . . , Xn are n real-valued random variables


n
1X
the quantity Mn := Xi is called sample mean, and
n
i=1
n
1 X
the quantity Vn := e (Xi − Mn )2 is called unbiased sample variance.
n−1
i=1

We now have

If X1 , X2 , . . . , Xn are i.i.d. N(µ, σ2 ) random variables then


1) Mn and V
en are independent;

2) Mn is N(µ, σ2 /n);
n − 1e
3) Vn is χ2 (n − 1).
σ2

To see 2) is very easy: Indeed, Mn is a linearPcombination P


of independent normals; hence it
is normal; it has EMn = µ and var Mn = var( n1 nj=1 X j ) = n12 nj=1 σ2 = σ2 .
To see 1) we let
Y = (X1 − Mn , . . . , Xn − Mn ),
and observe that Y is normal in Rn since it has been obtained by linear operations from
(X1 , . . . , Xn ) which is already normal in Rn by definition. We also observe that
n
X n
X
(Xi − Mn )2 = Y2j ,
j=1 j=1

hence the unbiased sample variance V


en is a function of Y. If we can show that

claim: Y, Mn are independent;

then we will have shown that V


en and Mn are independent. To show the claim, it is enough to
show that Y and Mn are uncorrelated (because they are jointly normal). But this is easy:

cov(Y j , Mn ) = cov(X j −Mn , Mn ) = cov(X j , Mn )−cov(Mn , Mn ) = E(X j − µ)(Mn − µ)−E(Mn − µ)2 .

But Mn − µ = n1 nk=1 (Xk − µ). So if we multiply this by X j − µ and then take expectation, we
P
see that all terms in the sum except the one corresponding to k = j have expectation zero.
Hence
1 σ2
E(X j − µ)(Mn − µ) = E(X j − µ)2 = .
n n
Similarly, in taking the square of Mn we obtain n12 times the sum of the squares of all terms
plus cross-products; the latter have expectation zero. So
n
1 X 1 σ2
E(Mn − µ) = 2 2
E(X j − µ)2 = 2 nσ2 = .
n n n
k=1
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 264

Hence, indeed, cov(Y j , Mn ) = 0 for all j, and therefore the claim that Y and Mn are independent
is true, and therefore V
en and Mn are independent.
To see 3) we observe that the components of Y add up to zero. Hence if we let
W := {y = (y1 + · · · + yn ) ∈ Rn : y1 + · · · + yn }
then we have
P(Y ∈ W) = 1.
On the other hand, W is a linear space of dimension n − 1. Now remember formula (eig) of
Section 21.3. Apply it with d = n and r = n − 1 to get
n−1
X
Y12 + · · · + Yn2 = λ j Z2j ,
j=1

where Z1 , . . . , Zn−1 are i.i.d. standard normal random variables and λ1 , . . . , λn are positive
numbers, being the nonzero eigenvalues of the matrix cov(Y). But observe that all off-diagonal
entries of this matrix are the same and all diagonal entries are the same. Hence all nonzero
eigenvalues must be the same. So
n−1
X
Y12 + ··· + Yn2 =λ Z2j ,
j=1

To find λ we take expectation of both sides. Since the left-hand side equals (n − 1)V
en and since
we already know that EV en = σ2 , we have E(Y2 + · · · + Yn2 ) = n − 1. On the other hand, the
1
expectation of the right-hand side equals λ(n − 1). Hence
λ = σ2
and so
n−1
n − 1e X
Vn = Y 2
+ · · · + Yn
2
= Z2j .
σ2 1
j=1

We’re done.

21.6 The t(n) distribution


We next define

If X1 , X2 , . . . , Xn are i.i.d. N(µ, σ2 ) random variables then the t(n) distribution


is the distribution of the ratio of
n
√ 1 X
n √ (X j − µ)
(Mn − µ) n j=1
Tn := σ q
U
= v ≡ . (21.4)
1 e
t n V
Vn 1 X 2
σ (Xi − Mn )
n−1
i=1
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 265

Preliminary observations and computations.

1) Our probability space here can be taken to be Rn and our PROBABILITY is Pσ (the
product of n N(µ, σ2 )). Moreover, if ω = (ω1 , . . . , ωn ) ∈ Rn , we have chosen X j (ω) = ω j
for all j. Then ω 7→ Tn (ω) is a random variable that does not depend on σ. Hence
Pσ1 (Tn ≤ t) = Pσ2 (Tn ≤ t) for all t and all σ1 , σ2 > 0. Hence Tn is a statistic for σ (or for
Pσ ), a concept defined in Section 20.5.
2) The numerator in the formula for Tn is independent of the denominator and the latter is
positive.
3) The numerator √
n
U := (Mn − µ) has N(0, 1) distribution.
σ
In other words,
2 /2
U has density fU (x) = (2π)−1/2 e−x , x ∈ R.

4) The denominator
s r
q
1 e 1 (n − 1)V
en W
V := Vn = =: has easily found distribution,
σ n−1 σ2 n−1
(n − 1)V
en
because W := has χ2 (n − 1) distribution
σ2

So W has density x 7→ fn−1 (x), given by the formula (χ2 (d) density) for fd with d = n − 1
in Section 21.2.
Therefore W/(n − 1) has density (n − 1) fn−1 ((n − 1)x).
Since V = W/(n − 1) is the image of the random variable W/(n − 1) under the map
p


x 7→ x = v,

we can easily find the density of V:

V has density fV (x) = (n − 1) fn−1 ((n − 1)x2 )2x.

Finishing up. We have


U
Tn = ,
V
where U and V are independent, P(V > 0) = 1, and both have known densities. So we can
proceed exactly as in Section 21.4 to see that the density of Tn is
Z ∞
ft(n) (x) = E[V fU (Vx)] = v fU (vx) fV (v)dv.
0
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 266

?PROBLEM 21.11 (the t(n) density). Compute the last integral to show that
!− n+1
Γ( n+1
2 ) x2 2
ft(n) (x) = √ 1+ , x ∈ R. (t(n) density)
π/n Γ(n/2) n

?PROBLEM 21.12 (t(1)=standard Cauchy). Explain why t(1) is the density for the standard
Cauchy distribution.
Answer. Setting n = 1 in ft(n) (x) we obtain

1 1
ft(1) (x) = , x ∈ R.
π 1 + x2
?PROBLEM 21.13 (t(∞) = N(0, 1)). Explain why
2 /2
lim ft(1) (x) = (2π)−1/2 e−x .
n→∞

Answer. From (21.4), the denominator in the second expression for Tn converges to 1 as n → ∞,
d Z
with probability 1. The numerator is a N(0, 1) random variable, let’s call it Z. So Tn = Dn where
P(Dn → 1) = 1. Hence
Tn → Z as n → ∞ in distribution.
Hence the density of Tn converges to the density of Z. 

Figure 21.1: ©in public domain; graphic art done by the author of these notes, just to make the
notes appear friendly and easy.

The distribution is known as “student” distribution because its inventor, William Sealy
Gosset, used to be modest and referred to himself as “student”. Gosset invented the t(d)
distribution in trying to address a beer problem in his workplace, the Guinness brewery in
Dublin in 1908. Back then, Dublin was part of Britain. Ireland became independent 10 years
later. Guinness is a great stout. But I don’t like it warm. I prefer it extra cold.

PROBLEM 21.14 (moments of t(n)). Let Tn have t(n) distribution. Explain when ETnk < ∞.
Answer. Notice that the density is a symmetric function. So all odd moments are zero,
provided they exist. If n > 1 then ETnk < ∞ ⇐⇒ k < n.
Chapter 22

 Random objects

267
Chapter 23

 Bernoulli trials and the Poisson


point process

23.1 Bernoulli trials on a general index set


Let T be a discrete set (finite or countable). Consider a collection (Xt , t ∈ T) of i.i.d. random
variables such that Xt is Ber(p) for all t ∈ T. The distribution of this collection will be called

BernoulliTrials(T, p)

PROBLEM 23.1 (Bernoulli trials with general index set). If T = {1, . . . , k} then this is the
BernoulliTrials(k, p) distribution. If T = N then it is the BernoulliTrials(∞, p) distribution
We talked about these two extensively. We could take T = Z as well. Or we could choose
T = Z × Z. 

In this chapter, we shall, for each a positive integer n, consider the set

1 −2 −1 1 2 3
 
Tn = Z = . . . , , , 0, , , , . . .
n n n n n n
of all rational numbers with denominator n, let λ be a positive real number and let p = λ/n.
We will focus on the distributions

Qn = BernoulliTrials( n1 Z, λ/n),

one for each n. We can think of n1 Z as an infinite set of boxes and each box contains a Ber(p)
random variable such that the collection of them is an independent collection. If you think
of R as time, then you can think of Qn as modeling a transmitter that attempts to transmit
every 1/n time units. The transmission is successful at the times i/n when Xi/n = 1. Think of a
successful transmission as light: you see an instantaneous light whenever the transmission
is successful or darkness otherwise. This could look like this: The line on both figures is a
copy of the real line. The figure on top represents Q10 and the one on the bottom Q50 . At the
top line the transmitter transmits 10 times per second and at the bottom at rate 50 times a
second. Light is represented by red, darkness by black. In both lines we see roughly, the same

268
CHAPTER 23.  BERNOULLI TRIALS AND THE POISSON POINT PROCESS 269

Figure 23.1: Both lines represent time. The transmitter of the bottom line transmits 5 times faster
than the transmitter of the top line. The probability that the bottom transmits red light is 1/5 the
probability that the top transmitter does so. Therefore the average number of red lines on a given
interval of time is the same for both lines.

number successful transmissions. This is because the probability of successful transmission at


the bottom line is 5 times smaller than at the top line.
We can verify the intuition above by taking an interval I ⊂ R and considering the set of
times at which there was a successful transmission:
( )
k
Rn (I) := ∈ I : Xk/n = 1 .
n

Let S(I) be the size of this set. For concreteness, take I = (a, b] where a < b are both real
numbers: ( )
k
Sn (I) = Sn ((a, b]) = ] ∈ I : Xk/n = 1
n
But
k
∈ (a, b] ⇐⇒ na < k ≤ nb ⇐⇒ [na] < k ≤ [nb]
n
because k is an integer. Hence

Sn ((a, b]) has bin([nb] − [na], λ/n) distribution.

This means that


λ
E(Sn ((a, b])) = ([nb] − [na])
→ n(b − a), as n → ∞.
n
This justifies the picture and intuition above. Performing an computation we obtain

(λ(b − a))k −λ(b−a)


lim P(Sn ((a, b] = k) = e .
n→∞ k!
Suppose now that I1 , I2 , . . . , Im are m disjoint bounded intervals. Then Rn (I1 ), Rn (I2 ), . . . , Rn (Im )
are independent and so Sn (I1 ), Sn (I2 ), . . . , Sn (Im ) are also independent. Hence,

lim P(Sn (I1 ) = k1 , . . . , Sn (Im ) = km ) = lim P(Sn (I1 ) = k1 ) · · · lim P(Sn (Im ) = km ) →
n→∞ n→∞ n→∞
(λ|I1 |)k (λ|Im |)k −λ|Im |
e−λ|I1 | · · · e , as n → ∞.
k! k!
So if N(I1 ), . . . , N(Im ) are independent such that N(I j ) is Poi(λ|I j |) for j = 1, 2, . . . , m, the above
says
lim P(Sn (I1 ) = k1 , . . . , Sn (Im ) = km ) = P(N(I1 ) = k1 , . . . , N(Im ) = km ).
n→∞
CHAPTER 23.  BERNOULLI TRIALS AND THE POISSON POINT PROCESS 270

We next look at the actual times at which we have successful transmissions (the red points
of the picture). They are the same, roughly, over a fixed period of time, regardless if n. What
happens then is that the probability of successful transmission is so small when n is large, and
this cancels out the large rate of transmission. For instance, let Gn (1) be the first successful
transmission after time 0:
1 1
Γn (1) = min{kn > 0 : Xk/n = 1} = min{k ≥ 1 : Xk/n = 1} = Gn (1),
n n
where Gn (1) is a geo(λ)n random variable. But we showed in Section 13.4 that

P(Γn (1) > t) = P( n1 Gn (1) > t) → e−λt , as n → ∞.

Let Γn ( j), j ∈ Z be the locations of the j such that X j/n = 1. It’s obvious that Γn (1), Γn (2)−Γn (1), . . .
are independent random variables, and the distribution of each one converges to that of an
expon(λ) random variable. So, then it is reasonable to expect that, in the limit as n → ∞, the
successful transmissions are located at points 0 < T1 < T2 < · · · where T1 , T2 − T1 , . . . are i.i.d.
expon(λ) random variables. The intuition is correct and we will partially verify it.

23.2 The Poisson construction


First of all, let us “construct” the limit. Let

Figure 23.2: This figure is the limiting situation of what we saw in the previous figure, when the
transmission rate goes to infinity but the probability that light is transmitted goes to 0 inversely
proportional to the transmission rate. To construct this limit, we simply construct the points of
time Tk , k ∈ Z, at which light is transmitted. These points are given by formula (23.1).

τ j , j ∈ Z, be a collection of i.i.d. expon(λ) random variables.

Define P
k
 j=1 τ j , if k = 1, 2, . . . ,

.

Tk :=  (23.1)
− 0j=k τ j , if k = 0, −1, −2, . . .
 P

This is supposed to represent the limit. A transmitter transmits successfully at time Tk only.
Let N(I) be the number of successful transmissions on the bounded interval I. We expect that
if I1 , . . . , Im are disjoint bounded intervals then N(I1 ), . . . , N(Im ) should be independent and
such that N(I j ) is Poi(λ|I|). This is actually true. One might exclaim: “but you proved it”. He
is almost right. If I add a bit more advanced mathematics then I can claim that the limits I
calculated above have proved it. Let us, however, prove it directly by proving that N(I) is
Poi(λ|I|). I will take I = (0, t] to begin. Then, for t > 0 and k = 1, . . ., we have

N((0, t]) = k ⇐⇒ Tk ≤ t < Tk+1 = Tk + τk . (23.2)


CHAPTER 23.  BERNOULLI TRIALS AND THE POISSON POINT PROCESS 271

(This is true even for k = 0 because T0 ≤ 0, by definition.) Hence we have

P(Tk ≤ t < Tk + τn |Tk = s) = P(τk > t − s)1t≥s = e−λ(t−s) 1t≥s ,

and so Z t
P(Tk ≤ t < Tk+1 = Tn + τn ) = e−λ(t−s) fTk (s)ds
0
Since, for k ≥ 1, Tk is the sum of k independent expon(λ) random variables, Tk is a gamma(λ, k)
random variable whose density was discovered in Section 21.1:

λk
fTk (s) = sk−1 e−λs .
(k − 1)!

Substituting above gives, for k ≥ 1,


t
λk λk tk (λt)k
Z
P(N((0, t]) = k) = e −λt
sk ds = e−λt = e−λt
(k − 1)! 0 (k − 1)! k k!

(and hence for k = 0 as well) which means that

N((0, t]) is a Poi(λt) random variable,

as claimed.
Let us now explain that N(I) has Poi(λ|I|) distribution for any interval I. Take I = (t, t + `],
an interval with length `. We shall reduce the computation to the previous one by looking
closely at the points Tk that are in this interval. Consider the smallest of these points that
exceeds t. If we set, for brevity,
Nt := N((0, t]),
we have, from (23.2)
TNt ≤ t < TNt +1 , t > 0.
(Simply let k = Nt in (23.2). Since the left side is a tautology, it follows that the right side is
true for all t > 0.) Therefore: for t > 0, TNt +1 is the smallest of the points Tk that exceeds t.

Figure 23.3: For t > 0, TNt +1 is the smallest of the points Tk that exceeds t.

We just need to show that

(TNt +1 − t, TNt +2 − t, . . .) has the same distribution as (T1 , T2 , . . .)


and is independent of (TNt , TNt −1 , . . .).
CHAPTER 23.  BERNOULLI TRIALS AND THE POISSON POINT PROCESS 272

I will just show that TNt +1 − t has the same distribution as T1 = τ1 and is independent of TNt .
We have

X
P(TNt +1 − t > u|TNt ) = P(TNt +1 − t > u|TNt , Nt = k) P(Nt = k|TNt )
k=0
X∞
= P(Tk+1 − t > u|Tk , Nt = k) P(Nt = k|TNt )
k=0
X∞
= P(Tk+1 − t > u|Tk , Tk+1 − Tk > t − Tk ≥ 0) P(Nt = k|TNt ).
k=0

Recall that τk+1 − τk is expon(λ) and independent of Tk . Hence

P(Tk+1 − t > u|Tk , Tk+1 − Tk > t − Tk ≥ 0) = P(τk+1 > (t − Tk ) + u|Tk , τk+1 > t − Tk ≥ 0) = e−u ,

by the memoryless property of the expon(λ) random variable. Hence



X
P(TNt +1 − t > u|TNt ) = e −u
P(Nt = k|TNt ) = e−u ,
k=0

because the last sum equals 1, and this is because the events {Nt = k}, k = 0, 1, . . ., form a
partition of Ω.
Another property of the construction is that

if we know that there are m points on an interval I then the unordered


collection of these m points is an independent collection of m i.i.d. unif(I)
random variables.

The reason is this. Take, without loss of generality, I = (0, t]. We assume that we know that
Nt = N((0, t]) = m. Therefore we know that

0 < T1 < T2 < . . . < Tm ≤ t < Tm+1 .

But these m points are ordered. To render them unordered, let σ be a random permutation of
{1, . . . , m}, with uniform distribution over the set of all m! permutations. Then Tσ(1) , . . . , Tσ(m)
are unordered. And then perform the following computation:
x1 xm
P(Tσ(1) ≤ x1 , . . . , Tσ(m) ≤ xm |Nt = m) = ··· ,
t t
for all 0 ≤ x1 , · · · , xm ≤ t.
Appendix A

Counting

Counting is the method for assigning a positive integer to finite set denoting its number of
elements. If the set is infinite, then counting means to produce a one-to-one correspondence
of the set with a given “well-understood” set, such as the integers or the set of real numbers.
We say that the set we are counting has the cardinality of the concrete set. For example, the set
of all finite sequences of coin tosses has the cardinality of the integers, but the set of all infinite
coin tosses has the cardinality of the real numbers. We also know that the real numbers do
not have the cardinality of the integers!)

Typically, we count sets that arise from other sets. We write #S or, sometimes, |S| for the
cardinality of the set S.

1. Binary sequences of length n. Consider the set {0, 1}. OK, it has cardinality 2. The set
{0, 1}2 contains 00, 01, 10, 11, that is, 4 elements. You can guess what the cardinality of {0, 1}n
(the set of sequences of 0’s and 1’s o length n) is. If you can’t, let it be equal to cn . List all the
elements of {0, 1}n . Any sequence of length n + 1 is a sequence of length n followed by a 0 or a
1. So
cn+1 = 2cn ,
and, since c1 = 2, you see that
cn = #{0, 1}n = 2n .

2. Subsets of a set. Suppose that A is a set with n elements. Let P(A) be the set containing
all subsets of A. What is the cardinality of P(A)?

To solve this problem, put the elements of A in a row. For example, with n = 6,
a1 a2 a3 a4 a5 a6

To determine a subset, all we have to do is decide whether we include an element or not. If


we include it, write 1 below the element; if not, write a 0. So
a1 a2 a3 a4 a5 a6
0 1 0 1 1 0

273
APPENDIX A. COUNTING 274

This means that we consider the subset {a2 , a4 , a5 }. We can do this with each subset. Put
it otherwise, for each subset of A we have a binary sequence of length n, and vice versa.
Therefore the cardinality of P(A) is the cardinality of {0, 1}n which is 2n . Hence
|P(A)| = 2|A| .

3. Arrangements of objects. Suppose we have n objects and we put them in a row. How
many rows can we form? This depends on whether the objects are distinguishable from one
another or not.

(a) All objects are indistinguishable.


The answer is 1. Indeed, if the objects are identical (in color, shape, weight, smell, temperature,
etc.) then swapping positions results in nothing different.

(b) All objects are distinct.


Swapping positions will give a different row. If pn is the number we are looking for then
consider which object goes into position 1; there are n choices for this object. Pick a particular
object. Then for the remaining n − 1 objects the number if pn−1 . So
pn = npn−1 .
But p1 = 1, so p2 = 2p1 = 1 · 2, p3 = 3p2 = 1 · 2 · 3, and so on; we discover that
pn = #arrangements = 1 · 2 · 3 · · · n.
Since this is a very important number, we give it a notation known by many:
n! = 1 · 2 · 3 · · · n.

(c) k out of n objects are red and the remaining n − k are blue.
Two objects of the same color are supposed to be indistinguishable. Let cn,k be the number of
arrangements. Consider a particular arrangement of the n objects. Now permute red objects
between themselves and blue between themselves only. For example, if we have n = 5 objects
of which k = 3 are red and denote the objects as r1 , r2 , r3 , b1 , b2 then the arrangements
b1 r2 b2 r4 r3
b2 r4 b1 r3 r2
are indistinguishable. Thus, for each of the n! arrangements of the objects there are k!(n − k)!
indistinguishable arrangements. This means that
n! = cn,k k!(n − k)!
and this gives
n!
cn,k = #arrangements =
k!(n − k)!
This is another important number, the binomial coefficient, so we give it a notation known by
many: !
n n!
= .
k k!(n − k)!
APPENDIX A. COUNTING 275

(d) There are d colors, such that n j objects are of color j, j = 1, . . . , d.


Since we have n objects in total,
n1 + · · · + nd = n.
Arguing in a similar manner as in the case of 2 colors, we find that

n!
#arrangements = ,
n1 !n2 ! · · · nd !

another important number denoted by


!
n n!
= , n1 , . . . , nd ≥ 0, n1 + · · · + nd = n,
n1 , . . . , nd n1 !n2 ! · · · nd !

referred to as multinomial coefficient. As an application in algebra, consider the expression

(x1 + x2 + · · · + xd )n .
n
We wish to expand this into a sum of terms of the form xn1 1 · · · xd d . We can write
n
xn1 1 xn2 2 · · · xd d = (x1 · · · x1 ) (x2 · · · x2 ) · · · (xd · · · xd ) .
| {z } | {z } | {z }
n1 times n2 times nd times

If we change the order on the right-hand side, we do not obtain a different term so long as the
number of times that each variable x j appears is equal to n j . So if we think of “variable” as
n
n 
“color”, the number of arrangement is precisely n1 ,...,n d
; in other words, the term xn1 1 · · · xd d
n 
will appear n1 ,...,nd
times in the expansion. Therefore we have discovered the multinomial
formula !
X n n
(x1 + x2 + · · · + xd ) =
n
xn1 · · · xd d . (MF)
n1 , . . . , nd 1
n1 ,...,nd ≥0
n1 +···+nd =n

4. Fixed-size subsets of a set. How many sets of size k does a set of size n have?

To solve the problem, consider a set of size n, say the set {1, . . . , n} of the first n positive
integers. Following the coding of subsets by binary sequences of length n, if we are interested
only in subsets of size k then we need to count the number of binary sequences such that 1
appears exactly k times. But if we think of 1 as red and of 0 as blue this is the same problem as
the number of arrangements of n objects in a row such that k of them are red and n − k blue.
The answer then is !
n
#{subsets of {1, . . . , n} of size k} = .
k
This is why the symbol nk is pronounced “n choose k”. Since the total number of subsets is 2n ,

we have proved that
n !
X n
= 2n .
k
k=0
APPENDIX A. COUNTING 276

Here is an obvious property: ! !


n n
= .
k n−k
And here is another one. Suppose that a fixed element, say n, of the set {1, . . . , n} is the most
important one. In considering a subset of size k we can either include n in the subset or not. If
we include it, then we need to make k − 1 other choices from the remaining n − 1 elements.
If we do not include it, then we need to make k choices from the remaining n − 1 elements.
Hence ! ! !
n n−1 n−1
= + .
k k−1 k

5. Finite binary sequences. The set of finite binary sequences is the set

[
{0, 1}n
n=1

because a finite binary sequence has length n for some positive integer n. But this set is into
one-to-one correspondence with the integers. Here is one way to do this:

0 1
0 1
00 01 10 11
−3 −2 2 3
000 001 010 011 100 101 110 111
−7 −6 −5 −4 4 5 6 7
..
.

That is, if a binary sequence starts with a 1 assign to it the positive integer whose binary
representation is given by the binary sequence. If a binary sequence starts with 0 then flip the
0’s and 1’s and then add a negative sign. The process is reversible. Hence

[
# {0, 1}n = #Z.
n=1

6. Infinite binary sequences. The set of infinite binary sequences is the set

{0, 1}N

because an infinite binary sequence is a map from N into {0, 1}. We claim that

#{0, 1}N = #{x ∈ R : 0 ≤ x ≤ 1}.

Indeed, if a = (a1 , a2 , . . .) is an infinite binary sequence then the number



X an
x=
2n
n=1
APPENDIX A. COUNTING 277

is a real number, x ≥ 0, and x ≤ ∞n=1 2n = 1. To make the operation invertible, we remark that
1
P
any real number 0 ≤ x ≤ 1 has exactly 1 binary representation if x is not a binary rational. A
binary rational number has 2 representations of which we select the one the has eventually
only 1’s. (For example, the number 1/2 is written as 0.100000 · · · or as 0.0111111 · · · –and we
choose, by convention, the latter.)

7. Allocation of balls in boxes. There are n boxes and m balls. In how many ways can we
place the balls in the boxes? We consider two cases.

(a) Balls are distinguishable and so are the boxes.


Let the balls be labeled as b1 , . . . , bm and let the boxes be labeled as c1 , . . . , cn . (The letter c
stands for “cell”.) We consider each ball, one by one, and assign to it a box. Ball b1 can go into
any of the n boxes. Ball b2 can also go into any of the n boxes. This means that we multiply n
times n to figure out the possible arrangements of the first two balls. Continuing in this way
we have
#allocations = nm .

(b) Balls are identical and but boxes are distinguishable.


Since the balls are identical, a ball in box c1 and another in box c2 is a single allocation. (If the
balls were distinguishable then this would correspond to 2 allocations.) Let us consider an
example, with m = 2 identical balls and n = 3 boxes distinguishable boxes. On the left below
you see the possible allocations.

•• •• | |
• • •| • |
• • •| |•
•• | ••|
• • | • |•
•• | | ••

On the right, we have removed the boxes and only left the internal walls. Each allocation is
represented by the m balls and the n − 1 internal walls. In total, we have m + n − 1 objects, m
of which are balls (think of them as “red” objects) and n − 1 are internal walls (think of them
as “blue” objects). From case (d) in paragraph 3 above we have

(m + n − 1)! m+n−1
!
#allocations = = .
m!(n − 1)! n−1

As an application, look again at the multinomial formula (MF) above. It involves a big sum
over the set of d-tuples (n1 , . . . , nd ) of nonnegative integers whose sum is n. We can think of
each such d-tuple as an allocation of balls in d boxes. Thus, (n1 , . . . , nd ) means that we put n1
balls in box 1, n2 balls in box 2 and so on. We therefore have that

Cn (d) := {(n1 , . . . , nd ) : n1 , . . . , nd ≥ 0, n1 + · · · nd = n}
= {all allocations of n identical balls in d distinguishable boxes}.
APPENDIX A. COUNTING 278

Hence the number of terms in the right-hand side of (MF) equals

n+d−1
!
.
d−1

8.  Partitions. A partition of a positive integer n is a way to write it as a sum of positive


integers where the order of the summands does not matter. For example,

6=6
=5+1
=4+2
=3+3
=4+1+1
=3+2+1
=2+2+2
=3+1+1+1
=2+2+1+1
=2+1+1+1+1
=1+1+1+1+1+1

Note that we can think of a partition as a finite sequence of positive integers in nonincreasing
order. For example, writing 6 = 3 + 1 + 1 + 1 corresponds to (3, 1, 1, 1). Alternatively, we can
think of this as using 1 3 times and 3 once. Using the second way of thinking, each partition
of n is simply
1 (k1 times) , 2 (k2 times) , 3 (k3 times) , · · ·
i.e., as a sequence (k1 , k2 , . . .) of nonnegative integers such that

k1 + 2k2 + 3k3 · · · = n.

The positive terms of such a sequence are finitely many. That is, after some index, all the
terms of this sequence are equal to 0. Call them “eventually zero sequences”.

The number of partitions of 6 is p(6) = 11. We do not have a closed formula for p(n), the
number of partitions of n. Define the function

X
G(x) := p(n)xn .
n=0

Note that p(0) = 1; not because 0 = 0 is a valid formula but because the set of ways to write 0
as a sum of positive integers is the empty set and the cardinality of the empty set is 1. We now
write X
p(n) = 1k1 +2k2 +3k3 +···=n .
k1 ,k2 ,...
APPENDIX A. COUNTING 279

The sum is taken over all eventually zero sequences. Hence


 
∞ X
X ∞  X
X 
k1 +2k2 +3k3 +···
G(x) = xn 1k1 +2k2 +3k3 +···=n =


 x 1 k1 +2k2 +3k3 +···=n 


n=0 k1 ,k2 ,... n=0 k1 ,k2 ,...
∞ 
X X 
= xk1 x2k2 x3k3 · · ·  1k1 +2k2 +3k3 +···=n  .
k1 ,k2 ,... n=0

But notice that the last sum equals 1 because k1 , k2 , . . . are fixed and so the only term in the
sum that gives 1 is the term for which k1 + 2k2 + 3k3 + · · · = n; all other terms give 0. Hence

X ∞
X ∞
X ∞
X
G(x) = k1 2k2 3k3
x x x ··· = x k1 2k2
x xk3 · · ·
k1 ,k2 ,... k1 =0 k2 =0 k3 =0
1 1 1
= (1 + x + x2 + · · · )(1 + x2 + x4 + · · · )(1 + x3 + x6 + · · · ) · · · = ···
1 − x 1 − x2 1 − x3
This function contains all information about p(n) because the coefficient of xn is p(n) and it can
be recovered by hand for small n.

As an application, let us look again at the multinomial formula (MF). As explained above,
the sum on the right-hand side of (MF) has n+d−1

d−1 terms. Each term corresponds to an element
n = (n1 , . . . , nd ) of the set Cn (d). Let us use the abbreviation
! !
n n
:= .
n n1 , . . . , nd

Looking again at the right-hand side of (MF), we wish to group together terms with the same
multinomial coefficient.

Note that if n and n0 are two elements of Cn (d) such that one is obtained by the permutation
of the other then nn = nn0 .
 

We can make that clearer by saying that n is equivalent to n0 if they are permutations
of one another. An equivalence class π is a subset of Cn (d) such that all elements of π are
equivalent to one another. Let then

Π(n, d) := {π ⊂ Cn (d) : π is an equivalence class.}


n
By definition, all n ∈ π have the same multinomial coefficient n . We can then define
! !
n n
:= , for some (and hence any) n ∈ π, π ∈ Π(n, d).
π n

We therefore obtain
!
X n X
n
(x1 + x2 + · · · + xd ) = n
xn1 1 · · · xd d . (MF2)
π
π∈Π(n,d) (n1 ,...,nd )∈π
APPENDIX A. COUNTING 280

Formula (MF2) is a rewriting of (MF), where we grouped terms with the same multinomial
coefficient together. We can now easily see that

each π in Π(n, d) corresponds to a partition of n in at most d parts.

Indeed, in figuring out all d-tuples obtained by permuting a particular d-tuple (n1 , . . . , nd ) we
may as well put this in nondecreasing order. Since the sum of the ni equals n, we have a
partition of n in at most d parts (we say “at most” because some elements ni may be equal to
0). How many terms does the second sum in (MF2) have? It has as many terms as the number
of elements of π. Pick an element of (n1 , . . . , nd ) of π. In figuring out how many other d-tuples
are equivalent to (n1 , . . . , nd ) the only thing that matters is how many elements of it are equal
to one number, how many to another number, and so on. So if we let

κπ (i) := number of terms of π that are equal to i, i ≥ 1,

we have, on one hand,


κπ (1) + 2κπ (2) + 3κπ (3) + · · · = n,
while, on the other hand,

κπ (1) + κπ (2) + κπ (3) + · · · · · · ≤ d.

Letting κπ (0) := d − (κπ (1) + κπ (2) + · · · ), which can be thought of as the number of parts of π
that are equal to 0 (equivalently, if π is a partition of n in a number of parts strictly smaller
than d then complement it by zeros), we have
!
d! d
|π| = =:
κπ (0)!κπ (1)! · · · κπ (d)! κπ

We therefore obtain the identity


! !
X n d
d =
n
.
π κπ
π∈Π(n,d)

Let us check this with d = 4 and n = 6:


! ! ! !
n n n n
π κπ
π π π π
6+0+0+0 6! = 1
6!
(3, 0, 0, 0, 0, 0, 1) 3! = 4
4!
4
5+1+0+0 5! = 6
6!
(2, 1, 0, 0, 0, 1, 0) 2! = 12
4!
72
4+2+0+0 4!2! = 15
6!
(2, 0, 1, 0, 1, 0, 0) 2! = 12
4!
180
3+3+0+0 3!3! = 20
6!
(2, 0, 0, 2, 0, 0, 0) 2!2! = 6
4!
120
4+1+1+0 4! = 30
6!
(1, 2, 0, 0, 1, 0, 0) 2! = 12
4!
360
3+2+1+0 3!2! = 60
6!
(1, 1, 1, 1, 0, 0, 0) 1 = 24
4!
1440
2+2+2+0 2!2!2! = 90
6!
(1, 0, 3, 0, 0, 0, 0) 3! = 4
4!
360
3+1+1+1 3! = 120
6!
(0, 3, 0, 1, 0, 0, 0) 3! = 4
4!
480
2+2+1+1 2!2! = 180
6!
(0, 2, 2, 0, 0, 0, 0) 2!2! = 6
4!
1080
4096 = dn
APPENDIX A. COUNTING 281

9. Selections. Frequently, we have to count the number of selections of objects. Think of


a computer screen containing n buttons (objects). You are to make r clicks (selections). You
can make clicks of the same button or you may not allowed to click the same button. We
refer to these situations as “selections with replacement” or “selections without replacement”.
Also, you may care about the order of clicks or not. We thus have four cases. The answers are
indicated in the table below. Let us explain them.
#r-selections from n objects

ordered selections unordered selections


r+n−1
!
selections with replacement nr
n −!1
n
selections without replacement (n)r
r
• Make r ordered selections with replacement: First selection can be done in n ways, second
in n ways, and so on, the r-the selection can also be done in n ways. The answer is nr .

• Make r ordered selections without replacement: As above, but now the second selection
can be done in n − 1 ways, the third in n − 2 and so on. Hence the answer is n(n − 1)(n −
2) · · · (n − r + 1) = (n)r .

• Make r unordered selections without replacement: We have (n)r ways to select if the
order matters. But since the order does not matter, we divide by r! and hence the answer if
(n)r n
r! = r .

• Make r unordered selections with replacement: To count this we map the problem into a
situation we have already considered. Think of the objects as boxes. We are to select r of them
but we do not care about the order of the selected boxes. To do this, we place a ball in each
box that we wish to select. Since the order does not matter the balls must be indistinguishable.
Since we can select with replacement, we are allowed to put many balls in each box. Hence the
number of selections in this case is the same as the number of allocations of r indistinguishable
balls into n distinguishable boxes. The answer was found in case (b) of paragraph 7 and is
r+n−1
n−1 .

10. Explaining the Bonferroni inequalities. Let A1 , . . . , An be events in a sample space Ω


and P a probability measure on all events of Ω. Define
m
X X
Bm := (−1)k−1 P(Ai1 ∩ · · · ∩ Aik ), m = 1, 2, . . . n.
k=1 1≤i1 <···<ik ≤n

Then  n 
[ 
B2 ≤ B4 ≤ · · · ≤ P  Ai  ≤ · · · ≤ B3 ≤ B1 .
i=1
APPENDIX A. COUNTING 282

We will show something much stronger which is pure logic and counting and which does not
involve probability. Let δi := 1Ai and set
m
X X
Sm := (−1)k−1 δi1 · · · δik , m = 1, 2, . . . n.
k=1 1≤i1 <···<ik ≤n
| {z }
:=Nk

We will show
S2 ≤ S4 ≤ · · · ≤ 1Sni=1 Ai ≤ · · · ≤ S3 ≤ S1 .
This will be enough because
 n 
[ 
ESm = Bm , E1Sni=1 Ai = P  Ai  .
i=1
Pn
The key is understanding what Nk is. If ω ∈ PΩ then N1 (ω) = i=1 1ω∈Ai , that is, it is the number
of events that contain ω. Also, N2 (ω) = i<j 1ω∈Ai ∩A j is the number of pairs of events that
contain ω. And so on.1 If N1 (ω) = `, say, then N2 (ω) = 2` . Indeed, if, say, ω belongs to the sets

A1 , . . . , A` then the only indices i, j for which Ai ∩ A j contain ω must be chosen among 1, . . . , `.
Since the order is irrelevant, there are 2` ways to choose these indices. Similarly, Nk (ω) = `k .
 
Hence
m
k−1 `
X !
If N1 = ` then Sm = k−1
(−1) (−1) .
k
k=1

We consider two cases. First ` = 0. But then N1 (ω) = 0 means that ω does not belong to any
of the events. Hence 1Sni=1 Ai (ω) = 0. Moreover, Sm (ω) = 0 for all m. Hence the inequalities
hold trivially because 0 ≤ 0 ≤ 0. Second, assume ` ≥ 1. Then N1 (ω) ≥ 1, so ω belongs to some
event, so 1Sni=1 Ai (ω) = 1. Hence the inequalities become

` ` ` ` ` `
! ! ! ! ! !
`− ≤`− + − ≤ ··· ≤ 1 ≤ ··· ≤ ` − + ≤ `, 0
2 2 3 4 2 3

and we need show that these are true for ` ≥ 1. These follow from the identity
m
` m `−1
X ! !
(−1) i
= (−1) ,
i m
i=0

which is left as an exercise.

1
In probability parlance, Nk is the number of unordered k-tuples of events that simultaneously occur. I refer
you to the end of §5 where the expression “occurs” is defined.
Appendix B

Calculus

1. Neighborhood. The ε-neighborhood of a real number a is the interval (a − ε, a + ε). For a


real number M, the M-neighborhood of +∞ is an interval (M, +∞) and the M-neighborhood
of −∞ is an interval (−∞, M).

2. Limit of a sequence (= a function on the integers). A sequence a1 , a2 , . . . of real numbers


is said to converge to a if any neighborhood of a contains all the terms of the sequence
eventually (that is, it contains all except, possibly, finitely many terms).
Notation: limn→∞ an = a. We can also write an → a as n → ∞.
Useful limits:

lim ρn = 0 if |ρ| < 1 lim n1/n = 1 lim (1 + 1


n + an )n = e if lim an /n = 0
n→∞ n→∞ n→∞ n→∞
1 1 1 n! √
lim (1 + + + · · · + − log n) = γ ≈ 0.577 lim √ = 2π
n→∞ 2 3 n n→∞ nn e−n n

3. Asymptotic equivalence. We write an ∼ bn as n → ∞ to mean that an /bn → 1 as n → ∞.


We then say that the two sequences are asymptotically equivalent.

4. Limit of a function (of a real variable). A function f (x) of a real variable x is said to
converge to y0 as x → x0 if any neighborhood of y0 contains all numbers f (x) for all x
sufficiently close to x0 . (Note that f does not have to be defined at x0 .)
Notation: limx→x0 f (x) = y0 . We can also write f (x) → y0 when x → x0 .
Useful limits:
sin x
lim =1 lim x log x = 0 lim(1 + x + g(x))1/x = e if lim xg(x) = 0
x→0 x x→0 x→0 x→0

∞, c > 0

lim ex = 0 lim ex = ∞ lim xc = 

x→−∞ x→∞ x→∞ 0, c < 0

283
APPENDIX B. CALCULUS 284

5. Continuity. We say that f is continuous at x0 if limx→x0 f (x) = f (x0 ). We say that f is


continuous if it is continuous at every x0 .

6. Slope. The slope function of a function f between two points x1 , x2 is the quantity
f (x2 ) − f (x1 )
, defined when x1 , x2 are distinct real numbers.
x2 − x1

f (x)− f (x0 )
7. Differentiability. We say that f is differentiable at x0 if limx→x0 x−x0 exists. The limit
df
is denoted by f 0 (x0 ) or as We can rewrite the definition of derivative f 0 (x0 ) as: there is
dx (x0 ).
a function R(x) such that R(x)/x → 0 as x → 0 and

f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) + R(x − x0 ).

If a function is differentiable at every x then x 7→ f 0 (x) is its derivative function.

8. Differentiability and continuity.


– If a function is differentiable at x0 then it is continuous at x0 .
– If a function is differentiable everywhere then its derivative function may be discontinuous.
If the derivative function is discontinuous then it has oscillatory discontinuities.
– There are plenty of continuous functions whose derivatives exist nowhere. The graph of any
such function has infinite length between any two points no matter how close they are.

9. Some common derivatives.


d a d cx d 1
x = axa−1 e = cecx log x =
dx dx dx x
d d 1 d 1
sin x = cos x arctan x = arcsin x =
dx dx 1 + x2 dx 1 − x2

10. The product rule.


d
( f (x)g(x)) = f 0 (x)g(x) + f (x)g0 (x).
dx

11. The chain rule (=composition rule). The chain rule for the composition of function says

( f ◦ g)0 (x) = f 0 (g(x)) g0 (x).

12. Subdivision of an interval. Let I be a bounded interval with endpoints a, b, where a < b.
A subdivision of I is a finite sequence a = x0 < x1 < · · · < xN = b, starting from a and ending at
b. The intervals [x j−1 , x j ], j = 1, . . . , N, are the intervals of the subdivision. The subdivision is
tagged is it is accompanied by a selection of a point in each interval, that is, we have the N
points u j ∈ [x j−1 , x j ], j = 1, . . . , N. Let us use the letter Π to denote some tagged subdivision.
The mesh kΠk of Π is simply the length of the maximum interval of Π.
APPENDIX B. CALCULUS 285

13. Integration à la Riemann. If f is a function on I and Π a tagged subdivision of I, we let


N
X
S( f, Π) := f (u j )(x j − x j−1 ).
j=1

We say that f is Riemann-integrable on I with integral S( f ) if for any ε > 0 there is a δ > 0
such that, if Π is any tagged subdivision with kΠk < δ then |S( f, π) − S( f )| < ε. it is customary
to write Z Z b
S( f ) = f (x)dx or f (x)dx.
I a
We can compute integrals as limits, by taking subdivisions Πn , n = 1, 2, . . ., such that kΠn k → 0
and then use Z
f (x)dx = lim S( f, Πn ).
I n→∞

Any continuous function on [a, b] is Riemann-integrable on [a, b].

Rx
14. Inefinite integral. If f is integrable on every bounded interval then a
f (u)du, as a
Rb Ra
function of x, is called an indefinite integral. We often use the geometric convention a = − b .

15. Antiderivative. We say that F is an anti-derivative of f is F is differentiable everywhere


and F0 = f .

16. Some common integrals.

1
 
1
, <
Z ∞ Z 1 Z ∞
c −1 , c > −1 √

 
 2
x dx = 
c |c + 1| x dx = 
c
c+1 e−x dx = π

 

 
1 ∞,

c ≥ −1 0 ∞,
 c ≤ −1 −∞
Z ∞ Z 1
m!n!
xn e−x dx = n! xm (1 − x)n dx =
0 0 (m + n + 1)!

17. Some common indefinite integrals.

eax
Z Z
1
e dx =
ax
dx = log x
a x
xc+1
Z Z
x dx =
c
log x dx = x log x − x
c+1

18. The first fundamental theorem of calculus. It says that if we differentiate the indefinite
integral of a function then we obtain the function again:
Z x
d
f (u)du = f (x).
dx a
APPENDIX B. CALCULUS 286

To be more precise, we need to assume that f can be integrated. One condition for this is that
f be continuous. A better condition is that f be piecewise continuous. Let us assume the latter.
Then the first fundamental theorem of calculus says that
Z x
F(x) := f (u)du
a

is differentiable at all points at which f is continuous and that F0 (x) = f (x) at these points. If
we take the function f (x) = 1x≥c , for some c > 0 we compute its indefinite integral and find
F(x) = 0 if x ≤ c and F(x) = x − c if x ≥ c. Obviously, F0 (x) exists everywhere, and is equal to
f (x), except at x = c.

19. The second fundamental theorem of calculus. It says that if we integrate the derivative
of a function then we obtain the function again, in the sense that
Z b
d
F(x) dx = F(b) − F(a).
a dx

More precisely, if F : [a, b] → R has derivative F0 (x) at all points x ∈ [a, b] such that F0 is
Rb
Riemann integrable then a F0 (x)dx = F(b) − F(a).

20. The integration-by-parts formula. It says


Z b Z b
f (x)g(x)dx +
0
f (x)g0 (x)dx = f (b)g(b) − f (a)g(a).
a a

and follows from the product rule and the second fundamental theorem of calculus.

21. The change-of-variable formula. It says


Z g(b) Z b
f (x)dx = f (g(u))g0 (u)du.
g(a) a

Indeed, if we let F(x) be an anti-derivative of f , then = F0 (g(u))g0 (u) = f (g(u))g0 (u), so


d
du F(g(u))
R g(b)
the right-hand side integral equals F(g(b))−F(g(a)). The left-hand side integral is g(a) F0 (x)dx =
F(g(b)) − F(g(a)). This can be extended when a and/or b are at −∞/∞, provided we use limits.

22. Sum of a finite P number of numbers. The sum of n numbers a1 , . . . , an is denoted by


a1 + · · · + an or by nj=1 a j , where j is a dummy variable, so the same sum can be written as
Pn Pn Pn−1
k=1 ak . The order of summation is irrelevant, hence j=1 a j = j=0 an−j . Observe

n
X
1=n
j=1
APPENDIX B. CALCULUS 287

because adding 1 to itself n times gives n. Also,


n
X n(n + 1)
j= .
2
j=1

Indeed, if S denotes this sum then 2S = S + S = (1 + 2 + · · · + n) + (n + (n − 1) + · · · + n) =


(1 + n) + (2 + (n − 1)) + · · · + (n + 1) that is, the number n + 1 added to itself n times, hence
2S = n(n + 1). Also, for ρ , 1,
n
X ρn+1 − 1
ρj = .
ρ−1
j=0

(If ρ = 1 the sum is trivially equal to n.) Let Sn denote this sum. You can actually discover this
ρ2 −1
formula by observing that S1 = 1 + ρ = ρ−1 , as claimed, and that Sn = 1 + ρSn−1 .

23. Sum of an infinite number of numbers (=infinite sum). An infinite sum ∞


P
Pn j=1 a j is a
symbol for the limit of the sequence j=1 a j , n = 1, 2, . . ., provided that the limit exists. For
example,

X 1
ρj = if |ρ| < 1,
1−ρ
j=0

because ρn+1 → 0 as n → ∞. You can actually discover this formula, when you forget it,
because if S is the sum then, trivially, S = 1 + ρS, so S = 1/(1 − ρ). We know that

X 1
< ∞ ⇐⇒ p > 1.
np
n=1

In fact,

X 1 π2
= .
n=1
n2 6

24. Some standard expansions.



X xn
ex = , x∈R
n!
n=0

X xn
log(1 − x) = − , −1 < x < 1
n
n=1
Index

(. . .), ordered list of objects bin(n, p), see binomial distribution


(a, b], an interval containing its right but ∩, intersection, and
not its left endpoint χ2 (d), see chi-squared distribution
(n)m , see falling factorial ◦, composition operation
∗, convolution operation cov, see covariance
A \ B, difference of sets ∪, union, or
A × B, (Cartesian) product of two sets δ0 , see Dirac probability measure
AT , transpose of matrix A det A, determinant of matrix A
Ac , complement of set A, negation F(m, n), see F distribution
A−1 , inverse of matrix A expon(λ), see exponential distribution
BA , set of functions from A to B gamma(λ, α), see gamma distribution
E, expectation geo(p), see geometric distribution
EQ , expectation with respect to the ⇐⇒ , equivalent, if and only if
probability measure Q ∈, belongs to, membership
Q1 ∗ Q2 , see convolution of probability inf,
R greatest lower bound
measures , integrail
d
X = Y, X has the same distribution as Y
R
B
, integral over a set B
X−1 , inverse function of function (random h·, ·i, see inner product
variable) X d·e, see upper integer part
[a, b], an interval containing its endpoints b·c, see lower integer part
1, see indicator log, natural logarithm
Cauchy(·), see Cauchy distribution lognormal, see lognormal distribution
Γ, the gamma function 7→, maps to
N, the set of positive integers P(·), see set of subsets of a set
Poi(λ), see Poisson distribution N(µ, R), see normal distribution, many
Poi(λ) , see Poisson distribution dimensions
Q, the set of rational numbers N(µ, σ ), see normal distribution
2

R, the set of real numbers ⊥, perpendicular, orthogonal


R+ , the set of nonnegatve real numbers π, the area of a circle of unit radius
Q
⇒, implies , product
Q
Z, the set of all integers , product over a finite set S
QS∞
Z+, the set of all nonnegative integers i=1 , product over positive integers
n
m , see binomial coefficient E , see projection
n  P
n1 ,...,nd , multinomial coefficient , sum

288
INDEX 289

P
, sum over a discrete set S cardinality, 30, 33, 273
PS∞
i=1 , sum over positive integers Cauchy distribution, 182
sup, least upper bound Cauchy-Schwarz inequality, 173
t(d), see T distribution ceiling function, see upper integer part
unif(·), see uniform distribution center of mass, see centroid
var, see variance central limit theorem, 247
∅, see empty set centroid, 185
|A|, cardinality of set A Chebyshev’s inequality, 174
{0, 1}N , set of infinite binary sequences chi-squared distribution, 256
{H, T}N , set of infinite coin tosses class of events, see sigma-field
{. . .}, unordered list of objects (set) completing the square, 18
e, base of natural logarithms many variables, 208, 209
ex , exp(x), exponential function conditional density, 129
df conditional distribution, see conditional
f 0 (x), dx , derivative
f ∗ g, see convolution of functions probability measure
f ◦ g, is the function x 7→ f (g(x)) regular, 229
n! , see factorial under normality, 239, 241
|, conditional on, given that conditional expectation, 95, 223
existence, 225
absolutely continuous function, 151 for discrete r.v.s, 95
affine, 124 for general r.v.s, 223
area function, 101 under normality, 235, 236
existence of, 101 conditional probability, 78
atom of a probability measure, 150 conditional probability measure, 229
conditional variance
Bayes’ rule, 80 under normality, 238
a coin with a random probability of confidence interval, 248
heads, 82, 230 continuous random variable, 150
false negative probability, 80 convergence in distribution, 244
false positive probability, 80 convergence in probability, 244
find the gift, 83 convolution, 232
prisoner’s dilemma, 84 of functions, 233
Bernoulli of probability measures, 232
random variables, 102 correlation, see inner product
Bernoulli trials correlation coefficient, 173
finitely many, 103 counting, 42
infinitely many, 107 covariance, 173
on a general index set, 268 covariance matrix, 185
sparse limit, 270 square root of, 209
binomial coefficient, 44 Cramér-Wold theorem, 185
binomial distribution, 103
birthday coincidences, 45 dancing pairs, see matching problem
via conditioning, 81 density transformation, 123, 132
Bonferroni inequalities, 37, 281 in many dimensions, 132
Boole’s inequality, 36 in one dimension, 123
Bose-Einstein model, 66 Dirac probability measure, 47
INDEX 290

disintegration , see vital formula of memoryless property, 108


probability Glivenko-Cantelli theorem, 199
distinguishable balls in distinct boxes, 67
distribution, see probability measure i.i.d., 100
distribution function, 116, 127, 148 i.i.d. infinite sequence exists, 100
in general, 148 iff, if and only if, equivalent
of an absolutely continuous r.v., 116 in matching problem, 105
of an absolutely continuous random inclusion-exclusion formula, 36
vector, 127 independence, 87, 92
probability measure defined by, 148 between events, 87
multidimensional, 159 between random variables, 92
distribution of a r.v., 50, 51, 99 in general, 169
distribution of a random vector, 156 independent and identically distributed ,
see i.i.d.
empirical mean, 56, 262 indicator, 57
empirical probability measure, 47 indicator function, see indicator
empirical variance, 57, 262 indicator r.v., see indicator
empty set, 30 inequality
event, 21, 24 Bonferroni, 37, 281
class of, 32, 147, 163 Boole, 36
indicator of, 57 Cauchy-Schwarz, 173
occurrence, 34 Chebyshev, 174
expectation, 52 Jensen, 171
of a discrete r.v., 52 Markov, 55, 174
of a general r.v., 164 moments, 172
of an absolutely continuous r.v., 115 triangle, 218
exponential distribution, 118 inner product, 173
as limit of geometric, 110
memoryless property, 120 Jacobian, 134
Jensen’s inequality, 171
F distribution, 258
factorial, 44 law, see probability measure
falling factorial, 44 law of a r.v., see distribution of a r.v.
Fermi-Dirac model, 67 law of a random vector, see distribution of
floor function, see lower integer part a random vector
fundamental theorem of calculus law of the unconscious statistician, 52, 167
advanced version, 152 general scheme, 168
fundamental theorem of probability, see lognormal distribution, 124
strong law of large numbers LOTUS , see law of the unconscious
fundamental theorem of statistics, see statistician
Glivenko-Cantelli theorem lower integer part, 42

gamma distribution, 253 marginal density, 129


Gaussian distribution , see normal Markov’s inequality, 55, 174
distribution mass density, 112
geometric distribution, 108 matching problem, 64, 74, 75
limit of, 109 Maxwell-Boltzman model, 66
INDEX 291

mixture, 153 discrete, 154


moment, 54, 172 measurability, 60
moment generating function, 175, 180 singularly continuous, 154
of a random vector, 185 transfers probability measure, 51
random vector, 156
normal distribution, 120 rate of convergence, 242
characterization, 202
explanation of appearance of π, 121 sample mean, see empirical mean
many dimensions, 205 sample variance, see empirical variance
density, 208 sampling without replacement
existence of density, 207 coincidence of methods, 71
moment generating function, 205, first method, 70
206 second method, 70
on the plane, 131 semicircle distribution, 115
set of subsets of a set, 30
occurrence of an event, 34 sigma-algebra, see sigma-field
sigma-field, 32
Poisson distribution, 104
axioms, 32
in matching problem, 75, 105
SLLN , see strong law of large numbers
matching problem, 105
standard deviation, 54, 173
Poisson process, 270
statistic, 248
poker, 72
strong convergence, 243
probability density, 112
strong law of large numbers, 192
higher dimensions, 126
probability measure defined by, 114 T distribution, 264
probability generating function, 175 temperature, 73
probability measure, 32 Thales’ theorem, 118
axiom one, 32 triangle inequality, 218
axiom two, 32 type I error , see Bayes’ rule, false positive
probability space, 34 probability
product of probability measures, 41, 100 type II error , see Bayes’ rule, false negative
product rule, 131 probability
product rule , see product of probability
measures uncorrelated, 95, 173, 189
projection uniform distribution, see uniform
in a space of r.v.s, see conditional probability measure
expectation uniform probability measure
in a vector space, 217 and counting, 42, 62
in ordinary Euclidean space, 215 on a finite set, 42, 62
Pythagorean theorem, 66, 120, 201, 202, on a planar set, 131
211, 215, 216 upper integer part, 110
r.v., see random variable variance, 54, 172
random variable, 49 vital formula of probability, 213, 230, 231
absolutely continuous, 154
continuous, 150 weak law of large numbers, 199
decomposition of the law of, 155 white noise representation, see whitening
INDEX 292

whitening, 210 zero-area, 127


zero-length, 114
zero-n-volume, 159 zero-volume, 150

You might also like