Math 254
Math 254
Code = Math254
Spring 2024
Quick navigation
Part I
Part II
Part III
Part IV
Appendices
Only Part IV contains what the syllabus specifies. The lectures will be from Part IV.
Everything contained in the pink pages is material that is logically needed for the proper under-
standing of Part IV. I know no way to teach the topics of the syllabus, that is, Part IV, without
assuming that the student has an understanding the pink pages. I could have chosen not to
include the pink pages. I did1 only because I wanted to give the reader the opportunity to
find this material in a single document.
This does not mean that pink=useless. It is very useful. But it’s not part of the syllabus or
the lectures during the term.
A chapter or section preceded by is optional.
1
Hamlet’s dilemma in the epigraph above expresses my dilemma: should I add the pink pages or not?
i
CONTENTS ii
Contents
1 Introduction 1
1.1 Guide: please read carefully . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
9 Random variables 49
9.1 Random variables are functions . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.2 The distribution of a random variable under a probability measure . . . . . 50
CONTENTS iii
19 Conditionally 214
19.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
19.2 Euclidean projections, platonically . . . . . . . . . . . . . . . . . . . . . . . . . 215
CONTENTS v
A Counting 273
B Calculus 283
Index 288
List of PROBLEMS
vi
LIST OF PROBLEMS vii
11.3 PROBLEM (false positive and false negative errors: observations affect de-
cisions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
11.4 ?PROBLEM (birthday coincidences–Problem 8.14 revisited) . . . . . . . . . 81
11.5 PROBLEM (a coin whose probability of heads is random!) . . . . . . . . . . 82
11.6 PROBLEM (strategy for getting the gift: every bit of information counts) . 83
11.7 PROBLEM (prisoner’s dilemma) . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.8 PROBLEM (where is the coin?) . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.9 PROBLEM (application in business) . . . . . . . . . . . . . . . . . . . . . . . 85
11.10PROBLEM (uniform distribution on several coin tosses begets independence) 87
11.11?PROBLEM (product probability measure on S1 × S2 begets independence) 88
11.12?PROBLEM (product probability measure on the product of n sets) . . . . . 88
11.13PROBLEM (a not-so obvious independence) . . . . . . . . . . . . . . . . . . . 89
11.14PROBLEM (pairwise independence but not independence) . . . . . . . . . . 89
11.15PROBLEM (pairwise independence but not independence, again) . . . . . . 90
11.16PROBLEM (uniform distribution on many coin tosses begets independence,
again) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
11.17PROBLEM (symmetry begets independence) . . . . . . . . . . . . . . . . . . 92
11.18?PROBLEM (criterion for independence) . . . . . . . . . . . . . . . . . . . . 92
11.19?PROBLEM (independence of many implies independence of fewer) . . . 93
11.20?PROBLEM (independence of events and their indicator random variables) 93
11.21?PROBLEM (toss k dice n times, as in Problem 10.2, again) . . . . . . . . . . 93
11.22?PROBLEM (independence of disjoint sets of r.v.s) . . . . . . . . . . . . . . 93
11.23PROBLEM (independence in presence of common variable) . . . . . . . . . 94
11.24PROBLEM (symmetry begets independence of sorts) . . . . . . . . . . . . . 94
11.25?PROBLEM (variance of sum of uncorrelated random variables) . . . . . . 95
11.26?PROBLEM (the expectation of the conditional expectation) . . . . . . . . . 96
12.1 ?PROBLEM (“let there be finitely many i.i.d. r.v.s” can always be said) . . . 100
13.1 PROBLEM (joint distribution of n i.i.d. Bernoulli trials) . . . . . . . . . . . . 103
13.2 ?PROBLEM (formula for the binomial distribution) . . . . . . . . . . . . . . 103
13.3 ?PROBLEM (matching n men to n women) . . . . . . . . . . . . . . . . . . . 105
13.4 PROBLEM (estimation of the p of bin(n, p)) . . . . . . . . . . . . . . . . . . . 105
13.5 PROBLEM (distribution of infrequent errors) . . . . . . . . . . . . . . . . . . 106
13.6 ?PROBLEM (a sure event) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
13.7 PROBLEM (upper integer part) . . . . . . . . . . . . . . . . . . . . . . . . . . 110
13.8 ?PROBLEM (a sparse geometric r.v. assumes no specific value in the limit) 110
13.9 ?PROBLEM (the union of uncountably many events begets monsters) . . . 111
14.1 PROBLEM (an unbounded density) . . . . . . . . . . . . . . . . . . . . . . . . 113
14.2 PROBLEM (some properties of densities) . . . . . . . . . . . . . . . . . . . . 113
14.3 PROBLEM (examples and counterexamples of densities) . . . . . . . . . . . 113
14.4 ?PROBLEM (zero-length sets) . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
14.5 PROBLEM (semicircle density) . . . . . . . . . . . . . . . . . . . . . . . . . . 115
14.6 PROBLEM (triangular density and its distribution function) . . . . . . . . . 116
14.7 PROBLEM (distribution function, expectation and variance of unif([a, b])) . 117
14.8 PROBLEM (unif([a, b]) from unif([0, 1])) . . . . . . . . . . . . . . . . . . . . . 117
14.9 ?PROBLEM (we can’t choose uniformly at random from the set of real
numbers) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
LIST OF PROBLEMS ix
xii
List of hyperlinks
xiii
Chapter 1
Introduction
“I am a Platonist”
– David Mumford
These notes are tailored for a course titled ”Probability and Statistics II,” designed for
students who have completed two antecedent courses, namely Probability I and Statistics I, or
their equivalents at other academic institutions. Additionally, participants are expected to
have a foundation in university-level mathematics.
The notes are as elementary as possible. Nothing advanced is used herein and no theorem is actually
proved!
Arguably, there exists no branch within the realm of mathematics, such as probability,
that embodies both profound interest and a contemporary essence, characterized by robust
connections to theory and a myriad of practical applications. Paradoxically, despite its inherent
mathematical rigor, this subject is frequently presented in an antiquated and sometimes
erronneous manner, as if we were still in the 19th century, a period when probability theory
remained somewhat of a mystery. Hilbert identified the need for a solid mathematical basis in
his 6th Problem. Indeed, at the time (early 20th c.) it was not even understood that probability
theory was part of mathematics or what kind of mathematics it should be based upon.
Statistics had been dealt with earlier, rather haphazardly, and its connection to probability
was not fully understood.
Over a century later, probability is firmly established as a pivotal subject in mathematics,
wielding substantial influence and making invaluable contributions to various branches. We
employ the term ”STOCHASTICS” as a comprehensive umbrella, encompassing probability
and its myriad offshoots, including statistics. This umbrella extends over diverse areas such
as Insurance Mathematics, Financial Mathematics, Stochastic Control, Filtering, Information
Theory, Stochastic Differential Geometry, Stochastic Calculus, Statistical Physics, and more.
A solid understanding of probability is essential for working with its applications and for
having a novel point of view within mathematics itself. The 20th century could be hailed as
the era of the Stochastics Revolution, 1 underscoring the profound impact of probabilistic
thinking across disciplines.
1
But not only. Many other scientific and mathematical revolutions occurred.
1
CHAPTER 1. INTRODUCTION 2
3. Engage the reader, provided he or she is willing to actively participate by reading and
solving problems thoroughly.
4. Avoid any form of cheating or, at the very least, provide comprehensive explanations
when shortcuts are taken.2
7. Emphasize to the student that the provided material offers an incomplete understanding,
encouraging further learning, possibly at an higher academic level in subsequent classes.
9. Make the notes valuable for both probability and statistics, with a particular focus on
applications.
There is a belief, among some academic people that every subject should have “theory”
and “tutorials”. This is close to being absurd. There are topics that are best learned by
experience rather than “studying theory”. Elementary probability is one of them. (I am talking
about understanding and using discrete probability. The richness of the subject can greatly
be appreciated by solving lots of problems of various kinds.) For some other topics, it is
pointless to have tutorials: theory must be developed for a while before problems/exercises are
attempted. Thus, I have to struggle to differentiate between “theory” and “problems/exercises”.
For me, in particular, it is very hard because I do not see the difference. I actually never in my
life saw any difference even when I was a child in school. I solved problems and understood
theory. I studied theory and understood how to solve problems without solving them. Much
in the same way that I do not see the difference between pure and applied mathematics. (I
merely differentiate between bad and good mathematics; between interesting and boring
mathematics; between honest and dishonest mathematics.)
Here is a brief outline of the topics that will be taught.
Teaching starts with Part IV, page 99, therefore I will only outline the 10 chapters contained
within Part IV.
Chapter 12 is very brief. It simply reminds the reader some of the basic concepts and, in
particular, stresses the concept of a random variable and its role: to transform a probability
measure into another. It also gives a glimpse of the mathematical difficulties that occur when
non-discrete random variables are considered. We point out that these difficulties are precisely
2
Once, in a book on probability, I read the following “theorem”: Let S, T be subsets of Rn , g : S → T a bijection
and X a random vector with distribution having a density fX supported on S. Then the random vector Y = g(X)
has density fY given by the formula fY (y) = fX (g−1 (y)) det(J), where H is the Jacobian. This has problems at many
levels. For example, S cannot be an arbitrary set. Despite this not being a multivariate calculus course, we cannot
disregard the fact that conditions on S and g must be placed. If we aim to provide practical help to students, it is
imperative not to label something as a theorem if we only pretend to be giving a proof without, in reality, giving a
correct one. Such a practice is disingenuous and one that I colloquially refer to as cheating.
3
E.g., mn should be defined as the size of the collection of subsets of size m of a set of size n and then the formula
n!/m!(n − m)! be derived.
CHAPTER 1. INTRODUCTION 4
the same as the ones that (ought to) concern us when we state that we can toss a fair coin an
infinite number of times. The funny thing is that everyone will agree that the latter needs no
maths. (And yet it does.)
Chapter 13 talks about a sequence of independent coin tosses (which we accept they exist,
but I already warned the reader in Chapter 12) that this is not something that he/she should
not do light-heartedly). Through such a sequence we derive various interesting random
variables, their distributions and their relations. In a sense, this is also background: the
student has heard of these distributions before. We end up by taking a limit of a geometric
random variable when the probability of heads tends to 0 and find a random variable such
that the probability that it assumes any given value is zero.
Chapter 14 talks about “mass density” as being a positive function that has finite integral.
Through it, we can define probabilities of various simple sets (events) and I point out that
we can extend to classes of events, but, of course, I do not prove anything. I merely give
examples of interesting densities in one and higher dimensions. I also explain how densities
are transformed by smooth bijections. This has nothing to do with probability. It’s just
multivariate calculus (the same thing one does when changes coordinates or when changing
variables in a “volume” integral).
Chapter 15 serves as a way to explain some of the things from the previous chapter. In
particular, I explain, as much as possible, that not every continuous random variable has
density. (And thus, the term “continuous”, that the student probably heard in previous classes,
is different from “having density”.) It is not hard to do that.
Chapter 16 I attempt to stress the important of the concept of expectation, in general, and
devote a little space to the law of the unconscious statistician. I also define independence, as
extending the notions that students already (should) know from elementary probability and
statistics. Some simple inequalities are needed in order that we later talk about the central
limit theorem and conditional distributions. Certain functions of random variables, via their
expectation, lead to the concepts of probability and moment generating function. We pass on
to random vectors and talk about their covariance matrix and its properties. We need this,
e.g., when we later discuss normal random vectors.
Chapter 17 is the fundamental theorem of probability (the strong law of large numbers)
explained in a simple case. It is needed for an understanding of the central limit theorem but
also for the whole subject of statistics. Without a good understanding of the strong law of
large numbers there’s no way to ever get beyond the belief that statistics is something magic
or (in the words of an algebraist friend of mine) that probability is a “voodoo science”.
Chapter 18 treats normal (Gaussian) random variables and vectors and explains how they
go hand-in-hand with linearity. Pay attention. Understanding this will save you a lot of time
when you later learn things in statistics. There are whole specialized courses in statistics about
linear systems with random inputs (linear regression, linear time series, and so on, people use
all kinds of names), but always normal. Why? Because the normal law is the only distribution
with finite variance that is “preserved under linearity”. If you think of a linear system as
a black box then normal input implies normal output. If you understand this then whole
courses in statistics become trivial.
Chapter 19 defines conditional expectation via geometry because that’s the only way to
understand it without having to resort to special cases and ugly formulas whose meaning
CHAPTER 1. INTRODUCTION 5
isn’t clear. And I do so because we need to talk about conditional probability but also make
computations easy. One reason that conditional expectation must be introduced is, that, without it,
dealing with conditional distributions of normal random variables can be a mess. A second reason is
that it is often impossible to compute the expectation of a function of several random variables without
conditioning on some of them. See the epigraph on Chapter 19!
Chapter 20 explains the central limit theorem in a simple case. I am constrained not to
use anything else but a moment generating function, and this makes it hard to prove much
in generality. The central limit theorem is the oldest “universality” result. It is very robust
and works (again because normality and linearity go hand in hand!) We exemplify this via
confidence intervals.
Chapter 21 talks about a class of distributions on R with densities, as required by the
syllabus. But I don’t just give formulas. I explain their properties and their use. And I never
ever define something by saying “here’s a formula for a density, learn it by heart”. Because
this is boring, silly, and totally uninformative. Rather, I derive formulas. In particular, we
show that the sample mean and sample variance, for i.i.d. normal random variables are
independent; we explain this; we don’t merely say “it is, believe me because some authority
says so”.
Chapters 22 and 23 are optional.
Various devices I used in these notes are explained below under the title “Guide’.
– I have added lots of “problems (=exercises) that you must do while you’re reading the notes.
Exams will be based on them. I have added a list of all the problems in the beginning.
– I have made the notes interactive: you can click and be transferred internally or externally. A
list of all external links is in the beginning.
– Bits and pieces that are to be emphasized appear in blue.
– I use the word “theorem” only for serious things, none of which is proved herein. I have
added a list of all the theorems in the beginning.
– There is an index at the end that is also interactive.
– The table of contents is also interactive: it transfers you to the place you wish to be transferred
to.
– There are historical notes and anecdotes, optional of course, but my goal is to show you that
humans have known some things for long time (100 years, 300 years, 2000 years,...) So if a
human could work out some simple problem 2000 years ago, how is it possible that we can
ignore this nowadays?
Basically, I tried hard to make these notes independent of the system used, called “canvas”,
because, in my experience, canvas is rather useless when it comes to mathematics. It is
designed for other subjects.
Problems The notes are interspersed with problems that form an integral part of the material
taught. In a previous version I called many of them examples. But by calling them problems I
CHAPTER 1. INTRODUCTION 6
wish to stress the fact that the student must do them on his/her own. I provide answers to all of
them.
Actually, “problems” are often part of “theory”4 In that sense, omitting them implies failure
to understand what’s going on. In particular there are ?starred problems. The presence of a
? does not mean they are harder. It just means that they must be done, or at least read, for
logical continuity.
Here is what you SHOULD DO when you encounter an item called “Problem”.
1) Read the statement carefully and make sure you understand it. If you don’t understand
it then one or more of the following may be true; (a) you haven’t understood the lecture
notes prior to the problem statement (b) you have difficulty with the English language
(c) you haven’t understood concepts in elementary probability (e.g., discrete random
variables) or elementary maths (e.g., calculus).
2) Then hide the answer to the problem; discipline yourself and do not look at it. Solve the
problem by yourself. By solving the problem we mean provide an answer that is often
expressed in English language. The answers are always short. Most of the time, a few
lines suffice.
4) Keep a notebook (electronic or on paper, whatever you like) recording all the problems
you have solved.
5) When you ask questions, the first thing I will ask you is to show me your homework–your
notebook with the problems you have solved. If you have solved a few or none, then
chances are that you won’t be able to catch up later.
Cheating Teaching at this elementary level must necessarily involve some mathematical
cheating. I will avoid it as much as possible, but I will tell you were I cheat so that, later, you
may have a chance to learn more properly, if you ever get the chance because, nowadays,
cheating is almost a norm in mathematics teaching worldwide.
Terminology I use the word “probability measure” in order to indicate the assignment
of probabilities to events. There is no measure theory in these notes, not any advanced
mathematical analysis. I am just using some words to reflect modern usage. Terminology,
in the subject of Probability and Statistics is largely anachronistic, that is, old-fashioned, and
often misleading. I do not offer any apologies for using more appropriate, for the 21st century,
4
I don’t understand the difference between “theory”, “applications”, “examples”, “problems”, “tutorials”, etc.
It’s all part of one and the same thing.
CHAPTER 1. INTRODUCTION 7
terminology. For example, lots of people say “continuous” random variable when they mean
“absolutely continuous”. I shall use the latter.
Parentheses and brackets I detest using parentheses (especially when they go deeper than
level 1) because I can’t parse them. So I often simply omit them. If (a, b) is an interval and P a
probability measure I sometimes write P(a, b) instead of P((a, b)). If {x} is a set containing one
point I write P{x} instead of P({x}). The use of parentheses is often ambiguous. For example
(a, b) may mean the set {x ∈ R : a < x < b}, as above or it may mean an element of R2 . But
R1
everything should be clear from the context. I also write EX instead of E(X) and E 0 X(t)dt
R
1
instead of E 0 X(t)dt . So instead of writing E(XY) = (E(X))(E(Y)), and get swamped by the
parentheses, I simply write E(XY) = (EX)(EY), and it’s clear what I mean. But I do write
P(A) and almost never write PA (although, perhaps, I should). Events are sets so I use curly
brackets: {...}. Curly brackets indicated that we are dealing with an unordered collection. If
{X ∈ B} is an event then I should write P({X ∈ B}), but I don’t: I write P(X ∈ B). Apologies.
Other symbols I don’t bother to denote probability measures with any special font. I use P,
or I use Q or other letters. I write expectation as E or, when I need to emphasize the probability
measure, say Q, I write EQ .
Method used in writing these notes; interlard and repetition These notes have written as
a textbook, that is, they are supposed to be read. In particular, skipping reading problems and
solving/answering them, often means failing to ?understand and thus failing to ?know and
?learn. The reason that problems are included and that I insist you do them is because, unlike
in the past, students nowadays need to be told what to do in order that they ?learn. In the
past, a student was well aware that he/she had to solve problems and spend hours/days/weeks
making sure that they ?understand. Since, as I am being told, this is not the case, it behooves
me to interlard the notes with details, such as trite problems, and with repetition of concepts
and facts. I mentioned that pink pages, that is, PARTS I, II and III refer to material that is
NOT in the syllabus. However, the subject that the syllabus asks me to teach cannot be
?understood without these parts. A teacher can, of course, cheat and asks students “memorize
the formula (a + x)4 = a4 + 4a3 x + 6a2 x2 + 4ax3 + x4 ” and then when I tell you “apply it when
a = 1 and x = e−t ” you can do it, but that is cheating from the part of the teacher (because
CHAPTER 1. INTRODUCTION 8
he/she is asking you to behave like a machine) and that I will not do, not only because I don’t
want to, but also because I can’t: my memory is extremely weak, I remember hardly anything,
I can only handle logic and I this must explain everything, even to myself, at all times.
Part I
9
Chapter 2
10
CHAPTER 2. NOTES FOR THE INSTRUCTOR OF PROBABILITY IN MATHS 11
a) CALCULUS: That is, they know what sequences and their limits are, what continuity
and differentiability means, what an integral is (the Riemann integral that we teach in
Calculus is enough).
Perhaps the most important assumption I will make is that students know what a
function is!
e) COUNTING: Finding the size of a finite set. Probability, at its very elementary level, is
all about counting and then about integration. In fact, integration is nothing else but
“counting continuously”.
Whether the students actually know these topics or not is clearly not dependent upon the
present notes. I am merely stating that there is no way for a student, especially in mathematical
CHAPTER 2. NOTES FOR THE INSTRUCTOR OF PROBABILITY IN MATHS 12
sciences, to understand the topic of this course without familiarity and working knowledge of
the above.
Students are often being (or have been) exposed to misconceptions. We should avoid them
here as much as humanly possible.
(A) Random variables: Some people think that the term “random variable” has something
to do with randomness (it does not) and that it is a “variable” (it is not, because the term
“variable” is quite ambiguous). We should emphasize that random variable is a function
and that’s that. Such an object’s role is principally to transform a probability measure P
on its domain into another probability measure Q on its range. More often than not, Q
is given and an X and a P must be found.
(B) Frequencies: Some people think that probability can be defined as a limit of the
frequency of occurrence of an event. Lots of people tried to create a foundation of
probability theory in this way, but they failed. In fact, it has been proven that such a
definition is impossible.
(C) Additivity, only, is wrong: Some people teach that probability is only finitely additive,
namely that, for disjoint events A1 and A2 we have P(A1 ∪ A2 ) = P(A1 ) + P(A2 ), but
they say nothing about countable additivity. If this is the only property they introduce
then those people must be very careful not to start cheating. For example, if the
syllabus specifies that densities should be taught then the instructor who only uses finite
additivity is cheating. Since the syllabus specifies that densities must be taught, hence
the instructor has no right to omit countable additivity.
(D) Concepts exist independently of one’s ability to calculate: Some people teach students
concepts, such as, the expectation of a random variable, in a way that students do not
understand the concept independently of their ability to calculate it. This is, e.g., a common
practice in teaching Calculus. The result is that when you ask students “what is the
integral?” they typically reply by a question: “of which function?” And if you insist “I
am asking you to tell me what the concept of the integral is”, then you may get the silly
reply “the area under the curve”, at which point you ask them to “define the concept of
CHAPTER 2. NOTES FOR THE INSTRUCTOR OF PROBABILITY IN MATHS 13
the area” and you have a vicious circle. The vicious circle won’t be resolved in these
notes but the students must be made aware of its existence.
(E) Don’t always use a single letter for all probabilities: Since probability is a function
from a set of events to real numbers between 0 and 1, students should learn that it is not
necessary to always use the same letter, typically P or Pr or Prob for it. Many times we
need to consider different probability functions (Statistics, for example, is all about sets
of probabilities!) Using the same letter always is as funny as using the word fun for
every function you encounter!
(F) Probability is a trivial axiomatic system: Some people do not emphasize this fact.
Probability is merely a function P from a set of sets (called events) into the nonnegative
real line such that
Students must be told that everything in Probability, Statistics, and any topic following
from them, merely needs these two axioms (and, of course, whatever can be introduced,
via definitions from them, and whatever can be proved from them).
(G) Confusion between a number and a function: This is happening all the time. There is
a confusion between a function, f , say, and its value b = f (a) at the point a of its domain.
This appears at least twice in the teaching of probability/statistics:
(i) Random variables, say X, are being treated as numbers. Students must be told they
have to understand the difference between X, as a function, and X(ω) at the point
ω. Random variables are functions and that’s that. Granted, they are functions for
which their range is of more importance and a probability measure (the so-called
distribution) sitting on the range is even more important, but that’s part of the
experience, it’s not part of the definition.
(ii) Probabilities are being treated as numbers. Students must be told they have to
understand the difference between P, as a function, and P(A), as the value of P at
the point A (a set/an event) of its domain. Since, for statistics, it is very important to
deal with a variety of functions P, the point is important. I use the word “probability
measure” when I’m referring to P as a function and the word “probability” for
numbers between 0 and 1.
(H) The law of large numbers: Here is what some of the false beliefs that some people have
about the law of large numbers:
CHAPTER 2. NOTES FOR THE INSTRUCTOR OF PROBABILITY IN MATHS 14
False belief 3. It is law of nature, such as, say, Newton’s law F = GM1 M2 r−2 ,
that can’t be proved (within Classical Mechanics) but it is
consistent with any observation.
So if the students aren’t taught that the law of large numbers has nothing to do with the
word “law” (and nothing to do with large numbers either) they will carry these silly
misconceptions/beliefs. To get rid of those, students should understand the law of large
numbers as a result within the probability theory framework: The law of large numbers
is a theorem, the fundamental theorem of probability. In fact, it is silly to teach the
central limit theorem without the law of large numbers. It is as silly as teaching how to
expand a function f in Taylor series up to order 2, but only explain to the students how
to calculate the second order term (that is, f 00 (x0 )/2) and not the first order one ( f 0 (x0 )).
Chapter 3
“Tis the good reader that makes the good book; a good
head cannot read amiss: in every book he finds passages
which seem confidences or asides hidden from all else
and unmistakably meant for his ear.”
– Ralph Waldo Emerson
3.1 Background
Listen carefully to what Feynman said.
Scene
Reporter walks into Feynman’s 1 room and demands that Feynman explain the force between
two magnets is:
Reporter: If you get hold of two magnets and you push them you can feel this pushing
between them, turn them on the other way and they slam together now; what is it the
feeling between those two magnets?
Feynman replies by explaining to the reporter that before asking a question he must have a
basis for asking it. Anyone can ask any question, but without any prior knowledge there is no
way for the responder to give an honest answer. After a few minutes of brilliant reply, that
you can watch and listen to here, Feynman concludes by telling the reporter:
1
Richard Feynman , (1918-1988) American theoretical physicist. You can learn basic physics from his lecture
notes.
15
CHAPTER 3. NOTES FOR THE STUDENT OF PROBABILITY IN MATHS 16
Feynman: I really can’t do a good job, any job, of explaining magnetic force in terms of
something else that you’re more familiar with; because I don’t understand it in terms of
anything else that you’re more familiar with.
In other words, Feynman tells the reporter that it’s fine to ask a question but the answer
he will get depends on what the reporter’s prior knowledge is (things he is familiar with).
If he has no basis for understanding, if he is not familiar with any elementary physics and
elementary mathematics, if the things he is familiar with are insufficient, then Feynman cannot
explain what a magnetic force is.
For example, the reporter may have gone to a high school where they do not teach any
physics or any mathematics. Or maybe they teach but they do a very poor job because the
teachers themselves do not know. The reporter is then in an unfortunate situation in that he
may never be able to understand the magnetic force. Or, to turn that positively, the reporter
may understand that he will never understand and turn into something else, e.g., interview a
businessman.
a) CALCULUS: Sequences and their limits, continuity and differentiability, integration (the
Riemann integral that we teach in Calculus is enough).
Perhaps the most important assumption I will make is that you know what a function
is!
e) COUNTING: Finding the size of a finite set. Probability, at its very elementary level, is
all about counting and then about integration. In fact, integration is nothing else but
“counting continuously”.
such a group. When I was in Sweden, I encountered a group of people who call themselves
“pedagoger”2 and promote memorization and vocational training as a means for learning.
Maybe this is the case in some disciplines, but not in Mathematics and therefore, not in
Probability or Statistics.
Mathematics is hard and requires true teaching3 . It is not easy for those who do not grasp
it from the start, but it becomes easy once you understand how to learn it. In particular,
learning it requires never cutting corners and never letting logic slip away in the slightest.
Once you get used to learning mathematics, it becomes a game. You can continue learning it
effortlessly. 10 pages of mathematics will then be equivalent to 1. You need to think about
how you learn and learn how to know. And this cannot be achieved by taking everything
you read in mathematics books at face value. Only when you reach a stage where everything
appears trivial or obvious you will have truly learned something.
We live in an era where cheating and dishonesty are rampant and propagate at the speed
of light (the speed at which packets of information are transferred over fiber optics cables).
In the era of “surveillance capitalism”, that is an economy based on making money out of
nothing or from collecting and selling people’s personal data, beliefs, opinions, etc., it is very
easy to get scammed by, say, Internet sites who promote teaching in the same way that the
aforementioned Swedish group does. These Internet sites are not interested in true teaching
but in increasing their profit, either directly (by asking you to pay) or indirectly (by collecting
your views and data); they achieve this by convincing you, via a variety of methods (“awards”
of sorts is one of them) that you have learned a topic and they have many ways of achieving
this. But of all disciplines, mathematics has the characteristic of forcing its adherents to be
honest4 (a necessary, but not sufficient, condition for true engagement with mathematics). It
is out of this property that I need to point out the pitfalls I mentioned and the hijacking of
important words such as “understand”, “know” and “learn”.
Let me then attempt to define them. Since I don’t want to invent new words, I will add a
star as a prefix to each of the hijacked words in order to indicate that I am talking about their
original meaning, rather than the one attached to them by various groups of people.
Definition 3.1 (?know, ?understand and ?learn, for mathematics). In mathematics, the verbs
?know and ?understand are equivalent
?know ≡ ?understand
2
They have even hijacked this perfectly legitimate word and altered its meaning.
3
Mathematics comes from the Greek word “mathema” (µaθηµα) that means “lesson”, both in ancient and
modern Greek. Anatolius of Laodicea (ca. 200–283 AD, also known as Anatolius of Alexandria) wrote:
“Why is mathematics so named?” The Peripatetics say that rhetoric and poetry and the whole
of popular music can be understood without any course of instruction, but no one can acquire
knowledge of the subjects called by the special name mathematics unless he has first gone through a
course of instruction in them; and for this reason the study of these subjects was called mathematics.
4
Some will even go further and claim that morality is strongly associated to mathematics: “I’m not interested in
[mathematics], I’m interested in morality” , replied the famous mathematician Alexander Danilovich Alexandrov
, when asked what attracted him to the subject. Anatolii Vershik correctly stated that mathematicians “have a
very clear sense of right and wrong”. This statement is equivalent to “if one has not a clear sense of right and
wrong then one is not a mathematician”. And so the appearance of the word “cheating” (or, rather, the need that
we should reject it), mentioned above and sporadically in these notes, is fully justified. (I thank Professor Aram
Karakhanyan of the University of Edinburgh for pointing out Alexandrov’s statement.)
CHAPTER 3. NOTES FOR THE STUDENT OF PROBABILITY IN MATHS 18
To ?learn is the process you go through in order that you ?understand (and hence ?know).
PROBLEM 3.1 (−1 times −1 equals +1). Fact: (−1)(−1) = 1. You were told this at school. So
you claim you know it. But you do not ?know it if you cannot explain it in any other way
than saying “the teacher told me so”. So why is (−1)(−1) = 1?
Answer. I trust you ?know why. But let me spell it out. The symbol −a is the “additive inverse”
of the number a. That is, to every (real) number a there corresponds a unique number denoted
by a0 such that a + a0 = 0. (I temporarily denote −a as a0 .) So 1 + 10 = 0. Hence 10 · (1 + 10 ) = 10 · 0.
But 10 · (1 + 10 ) = (10 · 1) + (10 · 10 ) and 10 · 0 = 0. We thus have (10 · 1) + (10 · 10 ) = 0. But 1 is the
neutral element of multiplication: multiplying a number by 1 leaves the number unaltered. So
10 · 1 = 10 . So we have 10 + (10 · 10 ) = 0. And so 1 + (10 + (10 · 10 )) = 1 + 0 = 1 or (1 + 10 ) + (10 · 10 ) = 1.
But 1 + 10 = 0 so 0 + (10 · 10 ) = 1, which means 10 · 10 = 1. And this is why (−1) · (−1) = 1.
PROBLEM 3.2 (no need for formulas). You probably √ know that the quadratic equation
ax2 + bx + c = 0 is solved by the formula x = (−b ± b2 − 4ac)/2a. Memorizing the formula and
simply applying it in special cases means that you do not ?know it. But if you can explain
the formula then you ?know it and you need not remember it. E.g., how do you solve the
equation x2 + 6x − 1 = 0 without a formula?
Answer. You simply observe that the second coefficient is twice 3, so you add and subtract 32
to obtain the equivalent equation x2 + 6x + 32 − 32 − 1 = 0 and then you see that the first three
terms equal (x + 3)2 , so you equivalently get (x + 3)2 = 32 + 1 = 10. If you know the √ square of a
number then
√ the number is its square root of its square or minus that, so x + 3 = ± 10 and so
x = −3 ± 10.
PROBLEM 3.3 (half-knowledge is often worse than no knowledge). If you have been told
that independent events are “events that do not influence one another” then you have been
told nothing of much value because you may know how to tell someone what you have been
told, but you still do not ?know the concept.
Answer. See Sections 11.3 and 11.4 for the hopefully already known to you cases of events
and simple random variables. For other cases, wait until you ?learn it in this course. It will
be covered in Sections 14.5.3 and 16.3.
PROBLEM 3.4 (human vs machine). Your boss tells you that your task is to decide whether
the average annual temperature in a certain region is 10C, and be 99% confident you are
right. He sends you to a training course where you are being taught the steps: (a) collect
data, (b) compute the value of a certain function of the data and (c) compare this value to
a number someone wrote in a table. You memorize the steps and apply them to complete
your task. Whereas you may feel you have learned statistics, in reality you have ?learned
CHAPTER 3. NOTES FOR THE STUDENT OF PROBABILITY IN MATHS 19
nothing. You just understood how to blindly apply what someone else told you. Indeed, you
cannot explain to me or even to yourself why these steps achieve the goal. A robot can be
taught to do the same thing but can explain anything. So what it the difference between a
robot/machine and a human being, especially in our information/data–dominated era?
Answer. Machines can’t think but, thanks to semiconductor electronics technology, we have
built computers that have immense processing speeds, immense storage, are linked via optical
cables or wirelessly, and can perform amazing tasks. But any machine, including those for
which the terms “smart” or “intelligent” are used, cannot ?understand what they’re doing.
That’s left for humans. In fact, machines have freed humans from the burden of performing
endless tasks, so humans, especially those who choose to study mathematical sciences have
now, more than ever in the past, the luxury to think and program the machines to perform
incredible tasks.
PROBLEM 3.5 (poetic answers may be beautiful but seldomly precise). If I ask you whether
you know what a normal distribution is and you reply “it’s a bell-shaped curve” then
you don’t know it. If you reply it’s one whose density is given by the formula f (x) =
(2πσ2 )−1 exp((x − µ)2 /2σ2 ) then you know the formula because you can memorize it, but you
still do not ?know what it means. What is, then, a non-poetic answer to the question
3.3 Advice
The pink pages and the appendices As stated at the very beginning, the pink pages and
the appendices contain material that is logically needed for the proper understanding of the
topics of this course. The latter are contained in Part IV–the white pages. I know of no way
to teach the topics of the syllabus, that is, Part IV, without assuming that the student has an
understanding of the pink pages. I could have chosen not to include the pink pages. I did
it only because I wanted to give the reader the opportunity to find this material in a single
document.
CHAPTER 3. NOTES FOR THE STUDENT OF PROBABILITY IN MATHS 20
It makes no sense ?learn statistics without ?learning probability. In fact, it is rather futile
to claim that you have ?learned statistics without probability. Hence I will not assume any
knowledge of statistics. You can ?learn it afterwards.
Badly learned things. Based on questions that students frequently ask me, I realize that you
may have ?learned things incorrectly. If this is the case, then you should forget badly-learned
things and start anew See quote (Q1) above.
When you ?learn something properly, you are doing yourself a favor: you save yourself from
a lot of hard and boring work.
Read the notes not in the same way you read a novel. Regarding these notes, like anything
that concerns mathematics, one should read them in a very different manner than reading
prose. You may, for example, wish to scan a chapter first to get an idea, then move to another,
and then go back to the first one when you need to. Mathematics is not read sequentially. If you
don’t understand something at first reading, make sure you go back in order to understand.
Different people read in different ways and you must assume responsibility for the reading
you do. See quote (Q2) above.
Ability to think. The best tool for reading these notes is having familiarity with thinking.
If you have difficulty with the thinking faculty and with reasoning then mathematics, and
especially probability, is not a good subject for you.
You need to be able to understand English. Thinking cannot be done without very good
familiarity with a natural human language; which, by convention, is English in our case: The
language of instruction, as well as the language used in these notes is English. Therefore,
knowledge of English is a sine qua non. In fact, these notes have been written in a way that
mathematical logic is expressed via English. This means that more often than not the verbal
passages of the notes convey as much important information as the formulas. See Item V.
above. Hence, if you fail to read and understand the non-formula part of these notes, you
may fail to understand the mathematics as well.
Mathematics is not a collection of formulas. Last but not least, I try as much as I can
to convince you that formulas are not, per se, of primary importance. Formulas appear as
an expression of our thoughts. Therefore, when someone asks me questions like “which
formula should I use to solve the problem” or “which distribution should I pick to find this
probability”, then I already know? that this person has not learned? and needs to try again.
Formulas do not appear ex nihilo.
Freedom of the individual. I’ve always made the assumption that individuals are free at
all stages of their lives. But freedom comes with responsibility. The individual person is
a free and responsible agent determining their own development through acts of the will.
In particular, I will try to teach you what I’ve taught for several decades, but if you are not
interested in ?learning, then it’s OK with me. After all, it is you decision to be a student in a
university. I will help you with answers to you questions, but I will assume that you have a
basis for understanding. Lack of such a basis requires strength of character, that is, realization
of this lack, and steps towards obtaining it.
Chapter 4
II “The number of insects in the world exceeds the number of grains of sand.”
IV “Of all the 100 people in the theater tonight, the one who sits closest to the stage is the
tallest.”
V “I will toss a fair coin 100 times and will obtain at least 5 consecutive heads.”
Figuring out the probability of Sentence I is a very complex task. It requires understanding
of physics and mathematics describing the climate as well as historical data. Currently, the
average temperature is 15o C. Some people place the probability of statement I at 90% or
higher.
Sentence II clearly has probability 0 or 100%. But this requires counting all insects and all
grains of sand. Instead we could do count carefully selected samples and infer the probability
from this data. The probability will include the uncertainty introduced by the sampling.
21
CHAPTER 4. NOTES FOR ANYONE WHO SPEAKS A HUMAN LANGUAGE 22
Sentence III requires a model for the place we invest at and a lot of information. If we have
little information, the model may not be good and the probability we derive from this model
might not be accurate.
For IV, if we assume that all people have different heights and if we assume that they sit at
random, then it is rather intuitive that the answer will be 1%; and this you should already
?know.
For the last sentence V, assume that all possible patterns of coin tosses are equally likely.
By “pattern” we mean a sequence of 100 heads or tails, e.g.,
HTHHHHTTTTHTHTTHTHTHTHTHT · · · HHTTHTTTHTHTHTTTTT
We have
no. of all possible patterns = 2100 = 1, 267, 650, 600, 228, 229, 401, 496, 703, 205, 376
Then count those patterns that contain at most 4 consecutive heads. I counted them and found
If you want to understand how I counted them, see Problem 4.4). Divide the two to find the
probability that V is not true and then subtract from 1 to find that
156, 242, 900, 686, 472, 853, 807, 378, 029, 578
V has probability = 1 − ≈ 88%. (4.1)
1, 267, 650, 600, 228, 229, 401, 496, 703, 205, 376
PROBLEM 4.1 (are your surprised that you got 5 heads in a row?). Are you surprised by
the fact that in 100 fair coin tosses the chance that you will have at least 5 consecutive heads is
so high?
Answer. Yes, I didn’t expect that.
PROBLEM 4.2 (is your coin fair?). If someone tosses a coin and gets fewer that 5 consecutive
heads, what will you say about the fairness of the coin? (This is what statistics is about.)
Answer. I’m more confident than not that the coin is not fair.
I now wish to explain how I obtained (4.1). Since the problem seems difficult, I simplify it:
PROBLEM 4.3 (probability of 2 consecutive heads). Explain why the number of arrange-
ments of 100 coins in a row so that there are at most 2 consecutive heads equals
Answer. To solve this replace 100 by the symbol1 n, representing a positive integer, and let an
be the number of arrangements. Consider the sentences
and
Tk =“there is a tail at the k-th position”.
Clearly,
Tk is the negation of Hk .
Either Tn holds or Hn holds. (By “holds” we mean “is true”.) If Tn holds then the number
arrangements is an−1 . If Hn holds then either Tn−1 holds or Hn−1 holds. If Hn &Tn−1 holds then
the number arrangements is an−2 . If Hn &Hn−1 holds then, necessarily, we must have that Tn−2
holds and so the number arrangements is an−3 . Since
while
Tn & (Hn &Tn−1 ) & (Hn &Hn−1 ) is a false sentence
we have
an = an−1 + an−2 + an−3 .
We clearly have a1 = 2 and a2 = 3. Adding these gives a3 = 5 and so a4 = 3 + 5 = 8 and so on,
a100 = 249, 483, 823, 285, 270, 218, 137, 723, 786. 2
Having solved the simplified version, we pass on to the actual question related to (V).
PROBLEM 4.4 (in how many coin arrangements do you have at most 4 consecutive heads?).
Explain how to compute the number of arrangements of 100 coins in a row so that there are at
most 4 consecutive heads.
Answer. Just as before, we should consider statements involving the last 5 coins. Replacing
100 by n, and with the same notation as above, we have that
“either Tn or (Hn &Tn−1 ) or (Hn &Hn−1 &Tn−2 ) or (Hn &Hn−1 &Hn−2 &Tn−3 )
or (Hn &Hn−1 &Hn−2 &Hn−3 )” is a true statement.
(By saying either “A” or “B” or “C” is true, we mean, as is standard in the English language
that one and only one of them is true. For example, “I’ll either go to school or I’ll go to the
movies” means that I won’t do both.) Considering each of these 5 cases, we get
and you can check by hand that a1 = 2, a2 = 3, a3 = 5, a4 = 10, a5 = 20. And you can continue
the additions until you find that a100 = 156, 242, 900, 686, 472, 853, 807, 378, 029, 578.
The sentences I spoke about above can be called events. I temporarily state (but we need to
wait for mathematics to kick in to make things precise) that:
Since there is no such thing as the totality of questions we can ask or statements we can
make about everything in the world, past, present or future, we content ourselves to events
whose totality is well-described. On purpose, I am not yet telling you what “well-described”
means. However, I’m sure you’ve heard of the term propositional logic (or sentential logic
or statement logic) which is the branch of logic formed by joining entire propositions or
statements to form more complex ones, their relationships, and their veracity or not. Actually,
if your mental faculties are that of the majority of human beings then you’re speaking in a
natural language (e.g., Nahuatl3 ) and are thus able to form sentences such as “if the Earth is
flat then Newton’s law doesn’t hold”.4
Let us, for now, use the letter P for denoting Probability.
Probability obeys certain rules, rules that respect the rules of logic. Here are a few.
1. We express the values of P by a percentage, that is, a real number u between 0 and 100 and
write P(α) = u%. Of course, this means nothing else but P(α) = u/100. So P(α) is taken to
be a real number between 0 and 1.
5. The “or” in the sentence α or β) is taken to mean α or β or both. (In real life, “or” often
means exclusive or, but not here.) The “and” in the sentence α and β) means what it means
in real life, that is, “both α and β”. We have the rule P(α or β) + P(α and β) = P(α) + P(β).
8. The sentence α =⇒ β is a composite sentence which means “if α then β”. That is, “α
implies β”. We have the rule P(α =⇒ β) = P(not(α)) + P(β) − P(not(α) and β).
We can keep going on and create more rules. If you study the rules above you will realize
two things.
1. First realization: the rules are not independent. Some follow from others.
2. Second realization: there are more rules you can make; more complicated ones (how
many? is there a limit?)
3
There are about 1.7 million speakers of this language nowadays
4
About 1 to 2% of Americans, that is 3 to 5 million people, believe that the Earth is flat, but beliefs are irrelevant,
especially when they lead to contradictions.
CHAPTER 4. NOTES FOR ANYONE WHO SPEAKS A HUMAN LANGUAGE 25
1. First question: what is the minimal set or rules you can have?
2. Second question: what is the totality of things that I can put as an argument in P? Put it
otherwise, when we write P(α) for a statement or sentence α, what things of the type α am
I allowed to put inside P?
The first question is not difficult to answer. The second question requires practice.
Fortunately, people have been practicing with this sort of things for about 500 years. They
seriously started asking these questions though in the 18th c. And really more seriously in
the 20th c. And humans, that is, us, came up with the answers that (no surprise here) require
mathematics.
If you only use rules such as the above, you can still figure out lots of probabilities of
lots of events. People do so in practice, e.g., lawyers and businessmen, but, unless they’re
mathematically trained, they work in a kind of cloud. Within this cloud they can correctly
answer some simple questions (but may be unable to tell whether their answer is true or not)
but they cannot envision what this thing (called Probability) they work with really is, neither
can they envision the kinds of unimaginably complex and beautiful problems they can deal
with. In all fairness, they don’t have to. A lawyer, for instance, is only interested in figuring
out the chance that his client has committed a crime. If he can reasonably convince himself
that the chance is less than 0.1% then he will be defending him enthusiastically. But if the
chance is more than 70% then he’s only doing his job in defending him.
Before we go on, you need to understand that Probability Theory (and several other
domains that depend on it, e.g., Statistics, Random Processes, Stochastic Geometry, Filtering,
Control of Industrial Processes, Decision Theory, Game Theory, and others) is a branch of
Mathematics, as rigorous as, say, Mathematical Analysis or Algebra. In fact, Probability
Theory has progressed so much that we can now use its tools in other areas of Mathematics
too. For example, we can show that every polynomial of degree d has exactly d complex roots
(this is called the Fundamental Theorem of Algebra), using Probability Theory and then go on
to compute them using Probability Theory as well.
We summarize:
PROBABILITY assigns values P(α) to statements α such that rules such as the
above hold. In particular, the probabilities P(α), P(β), etc., assigned to statements
α, β, etc., must respect the rules of logic, that is, the ways that the statements relate
to one another.
All that was fine, but to make things precise and to be able to deal with probability properly
we must let mathematics enter. And this we can do, because this is not a course “for anyone
who speaks a human language”, as the title of this chapter states, but a course for university
students in mathematical sciences.
Part II
26
Chapter 5
Scientists have realized, by experience, that objects called sets are described by sentences.
There is no way to define the concept of a set, but we know what it is by its use. We can say
that
(The elements can be sets themselves.) But once you accept the concept of a set then, using it,
you can do an awful lot of mathematics and physics and biology and even be able to be a
rational politician.
We say that an element a belongs to a set A (or not) and we write this as
a∈A
(or a < A). A set A is a subset of a set B (and we write this as A ⊂ B) if whenever x ∈ A then
x ∈ B. I repeat:
In words, all elements of A are elements of B. You see, I used the notation:
27
CHAPTER 5. EVENTS, SETS, LOGIC AND THE LANGUAGE YOU SPEAK 28
α⇒β
PROBLEM 5.1 (sentences and sets). Consider the sentence “if you’re human then your body
contains a molecule called DNA”. Express this as a relation between two sets.
Answer. Let H be the set of humans in the Universe. Let D be the set of objects in the Universe
that contain DNA. We just said that H ⊂ D.
PROBLEM 5.2 (describe your sets logically). Let N be the set of positive integers. The
set A of all prime numbers is an example of a subset of N. You see, describing A as a list,
2, 5, 7, 11, 13, 17, 19, 23, . . . is impossible. Why? How can you describe A by logic?
Answer. It is impossible because we don’t even have a clue what large prime numbers are
(there is no formula). But if we define the sentence
and that
CHAPTER 5. EVENTS, SETS, LOGIC AND THE LANGUAGE YOU SPEAK 29
Hence you immediately have that A∩B ⊂ A∪B. If A1 , . . . , An are sets then we write A1 ∩· · ·∩An
without parentheses because we know that, just like in English, the conjunction “and” is
associative: “(she’s smart and she’s tall) and she’s from Mars” is theTsame sentence as “she’s
smart and “(she’s tall and she’s from Mars)”. We also write this as ni=1 Ai . But then
Tn
x∈ i=1 Ai iff x ∈ A1 and · · · and x ∈ An which means that for all i we have x ∈ Ai .
T if we have infinitely many sets, say Ai , where i ∈ I and I is another set? Well,
What happens
we define i∈I Ai asSthe set containing all elements x for which for all i ∈ I we have x ∈ Ai .
Similarly, we define i∈I Ai by replacing the quantifier “for all” by the quantifier “there exists”;
so
S
i∈I Ai is the set of all x such that there exists i ∈ I for which x ∈ Ai .
You then need to remember the de Morgan law which rests on the logical fact that if it is
not the case that this and that then it is this or that. The complement of a set A is denoted by
Ac and means that it is all x ∈ Ω such that x < A. The complement of A with respect to B is all
x ∈ B such that x < A; it is denoted as B \ A. Hence Ac = Ω \ A. Observe that B \ A = B ∩ Ac .
The de Morgan law gives
c
[ \
A = Aci .
i
i∈I i∈I
The number of elements of a set A is called cardinality or size of the set and is denoted by
|A| or by #A. A set is called finite if it has a finite number of elements. Otherwise it is called
infinite. A set is called countable if there is a bijection between N and the set. An infinite set
that is not countable is called uncountable. The set of real numbers is uncountable. The set Z
of all integers is countable. The set Q of all rational numbers is also countable.
We can take the product of a sequence A1 , A2 , . . . of sets and call it A1 × A2 × · · · If the
sets A1 , A2 , . . . are all finite with size at least 2 each then their product is infinite and, in fact,
uncountable.
Example 5.1. What is the cardinality of the set of all injections from {1, 2, 3} into {1, 2, 3, 4}?
Answer. A function f : {1, 2, 3} → {1, 2, 3, 4} can be represented by the triple ( f (1), f (2), f (3)).
An injection must have f (1) , f (2) , f (3) , f (1). f (1) can be 1 or 2 or 3 or 4. Once f (1) has
been given a value, there are 3 values remaining for f (2) to take. ( f (1), f (2)) can take 4 · 3 = 12
values. There remains to assign a value to f (3), There are only 2 numbers remaining. So f (3)
can take 2 values. And so ( f (1), f (2), f (3)) can take 4 · 3 · 2 = 24 values. This is the cardinality
of the set of injections from {1, 2, 3} into {1, 2, 3, 4}.
The most important set in the universe is the set that contains nothing. It is called the
empty set and is denoted by ∅. There is one and only one empty set.
Sets A, B are called disjoint if A ∩ B = ∅.
Sets A1 , . . . , An are called pairwise disjoint or mutually disjoint if Ai ∩ A j = ∅ for all i , j.
In this case, we also have Ai ∩ A j ∩ Ak = ∅ if i, j, k are distinct. And so on.
Two sets are called distinct if they are not equal.
Sets of sets are sometimes denoted by calligraphic letters and are often called collections
rather than sets of sets. If A is a collection of sets then we call it pairwise disjoint collection
if A ∩ B = ∅ for every two distinct sets A, B ∈ A .
Chapter 6
Disclaimer. When we say “the only two things you need to know” we mean that everything
in probability theory follow from the two axioms (AXIOM ONE) and (AXIOM TWO) below.
Of course, you need to ?learn the consequences of these two axioms that follow logically
from them, together with a number of definitions and special cases of particular importance
in the theory.
We explained that we identify events with sets. Probability theory is concerned with certain
functions, called probability measures, that assign to an event A a number P(A).
We shall let Ω be a certain set that includes all events we may be interested in in a particular
situation or problem. Such a set may be called sample space or configuration space or
ambient space. The elements of Ω may be called outcomes or configurations. Sometimes we
may have to change notation and use another letter in place of Ω. We will, for now, use the
letter E for the collection of all events.
PROBLEM 6.1 (select 2 numbers out of a 100; what’s the sample space?). Consider the
selection of 2 numbers, not necessarily distinct, out of a 100. what sample space Ω should we
choose and what is E ?
Answer. We choose Ω = {1, 2, . . .} × {1, 2, . . .}. We choose E = P(Ω).
It is typical to have E to be equal to the set of all subsets of Ω, but this may not be the case
for reasons that are quite deep. For now, just accept the fact that the function P is defined on a
collection of sets E .
31
CHAPTER 6. “THE ONLY TWO THINGS YOU NEED TO KNOW” 32
1. AXIOM ONE
P(Ω) = 1; (AXIOM ONE)
In writing the above, we must make assumptions that all things appearing in the argument
of P must be events, that is, elements of E . We will assume that
1.
Ω ∈ E. (EV1)
2.
∞
[
If A1 , A2 , . . . ∈ E then Ai ∈ E . (EV2)
i=1
3.
If A ∈ E then Ac ∈ E . (EV3)
The first two assumptions are necessitated from (AXIOM ONE) and (AXIOM TWO). The
third assumptions stems from the fact that we want to have P defined on an event and on its
complement (because if we know the probability of something happening we should also
know the probability of something not happening).
Let us repeat.
Definition 6.1. A collection E of subsets of a set Ω is called a class of events if (EV1), (EV2)
and (EV3) are satisfied. Another name for such an E is σ-field.
Subtle point 1. You should pay attention to the term probability measure. Some people
may simply call it probability. So the word “probability” may have two meanings: as the
function P or as a value of P on a particular event A. Another word for “probability measure”
is “probability distribution” or “distribution”. Another word is “probability law” or just
“law”.
Subtle point 2. The natural question then is: can all subsets of Ω be events? Put it otherwise,
can the set “events in Ω” consist of the set of all subsets of Ω? The answer is: not always.
CHAPTER 6. “THE ONLY TWO THINGS YOU NEED TO KNOW” 33
Subtle point 2, amplified. Let’s try to think of the experiment of tossing a coin infinitely many
times. A result of this experiment is a sequence of heads and tails, such as T, H, T, H, T, T, . . .
The set Ω is the set of all these sequences:
Ω = {H, T}N .
Let us try to define probability measure P that conforms to our intuition. Namely, the
probability of tossing a head should be 1/2 (fair coin); the probability of two heads in a row
should be 1/4; the probability of specifying the coin faces in the first n tosses should be 1/2n .
Let P denote such a probability measure. What is the domain of P? Are all subsets of Ω events?
The answer is no. There are subsets of Ω that are not in the domain of P. These subsets are
not events and so they have no probability at all. This means that I have no right to ask for
the value of P on sets that are not events. Why this is the case I cannot tell you in this course
because it is beyond the syllabus and because it requires a little bit of mathematics that can’t
be explained at this level.
Subtle point 3. There are familiar functions that satisfy (AXIOM TWO) but not necessarily
(AXIOM ONE). Here are two examples.
PROBLEM 6.2 (The cardinality function). Consider a finite set Ω and let N be the function
that assigns to each subset A of Ω its cardinality (that is, its number of elements). Then,
certainly, if A1 , . . . , An are disjoint subsets of Ω then C(A1 ∪ · · · ∪ An ) = C(A1 ) + · · · + C(An ).
PROBLEM 6.3 (The area function). Let Ω be the plane, identified as the set R2 of pairs of real
numbers. Let λ be the function commonly known as “area”. This function assigns value ab
to a rectangle of side lengths a and b. It assigns value πr2 to a circle of radius r. It assigns
value 1 to the set of points (x, y) ∈ R2 such that 0 ≤ y ≤ e−x , x ≥ 0. You have learned how to
compute areas of various subsets of R2 and you have learned to do so by frequently applying
(AXIOM TWO). Frequently, you need to apply this rule to infinite sequences in order to obtain
the area of complicated sets from simpler ones, e.g., the area of the circle from an infinite
sequence of rectangles. But do you really know the function λ? If you do, then you must be
Figure 6.1: This is a circle of radius r. You can find its area by embedding an infinite sequence of
disjoint rectangles: summing up their areas you find πr2 .
able to describe its domain. It turns out that λ, that is, the area function, cannot have the set of
all subsets as its domain. Only in the 20th century we understood what the area function is
and managed to describe its domain.
CHAPTER 6. “THE ONLY TWO THINGS YOU NEED TO KNOW” 34
event A occurs.
Also, “A occurs and B occurs” means that the element ω is in both A and B, that is, A ∩ B
occurs; and so on. If I pick an ω1 ∈ Ω and you pick another ω2 ∈ Ω then A may occur for me
but not for you.
PROBLEM 6.4 (two extreme event collections). If Ω is a set then P(Ω) is a σ-field and {Ω, ∅}
is another σ-field.
Answer. Notice that (EV1), (EV2), (EV3) are satisfied in both cases.
PROBLEM 6.5 (events A, B generate more events). If Ω is a set and A, B ⊂ Ω, try to build a
σ-field that contains A and B. How can you do that?
Answer. Throw in all sets that are derived from A and B by set operations E = {A, B, A ∩
B, Ac , Bc , A ∩ B, A ∪ Bc , . . .}. To do that systematically draw a Venn diagram and notice that Ω
can be partitioned in 4 sets: A ∩ B, A ∩ Bc , Ac ∩ B, Ac ∩ Bc . Then consider all possible ways of
selecting some of these sets and take the union of each selection:
A∩B A ∩ Bc Ac ∩ B Ac ∩ Bc derived set
1) 1 1 1 1 Ω
2) 1 1 1 0 (A ∩ B) ∪ (A ∩ Bc ) ∪ (Ac ∩ B) = A ∪ B
3) 1 1 0 1 ···
4) 1 1 0 0 ···
.. .. .. .. .. ..
. . . . . .
15) 1 0 0 0 A∩B
16) 0 0 0 0 ∅
Go ahead and fill in each of the 16 rows of this table.
Definition 6.4 (Probability space.). A probability space is a triple (Ω, E , P) consisting of a set
Ω, a σ-field E of subsets of Ω and a probability measure P, that is, a function P : E → [0, ∞),
satisfying (AXIOM ONE) and (AXIOM TWO).
Often people use the term mutually exclusive for a family of events and this means that
the events in the family are mutually disjoint. We can state (AXIOM TWO) in words by
saying that a probability measure P is countably additive over any countable collection of
mutually exclusive events.
Chapter 7
Here is why: The events A and B \ A are disjoint and have union equal to B. By
(AXIOM TWO), P(B) = P(A) + P(B \ A). Since P(B \ A) is nonnegative, if we omit this term
from the sum, we get something smaller.
3. Monotonicity.
If A ⊂ B then P(A) ≤ P(B).
Here is why: If A ⊂ B then P(B) − P(A) = P(B \ A) ≥ 0.
35
CHAPTER 7. ELEMENTARY PROBABILITY PROPERTIES 36
0 ≤ P(A) ≤ 1.
Here is why: A is always a subset of Ω. Hence, by monotonicity, P(A) ≤ P(Ω). But P(Ω) = 1,
by (AXIOM ONE).
5. Probability of negation.
P(Ac ) = 1 − P(A).
Here is why: Since A and Ac are disjoint with union equal to Ω, we have P(A) + P(Ac ) = P(Ω)
by (AXIOM TWO). But P(Ω) = 1 by (AXIOM ONE).
Here is why: The events A \ (A ∩ B), B \ (A ∩ B) and A ∩ B are pairwise disjoint with union
A ∪ B. 1 By (AXIOM TWO) then,
8. Boole’s inequality. n
[ X n
P Ai ≤
P(Ai ).
i=1 i=1
1
Indeed the first and the second are disjoint from the third, while the first two are disjoint because the first
set equals A ∩ (A ∩ B)c , the second set equals B ∩ (A ∩ B)c and so their intersection equals (A ∩ B) ∩ (A ∩ B)c = ∅.
Furthermore, (A ∩ (A ∩ B)c ) ∪ (B ∩ (A ∩ B)c ) ∪ (A ∩ B) = A ∪ B because of the distributive property of ∪ over ∩.
CHAPTER 7. ELEMENTARY PROBABILITY PROPERTIES 37
P(A1 ∪A2 ∪A3 ) = P(A1 )+P(A2 )+P(A3 )−P(A1 ∩A2 )−P(A1 ∩A3 )−P(A2 ∩A3 )+P(A1 ∩A2 ∩A3 ).
– One way to show the veracity of the formula is to use the already known n = 2 case and
induction. The second equality is just a rewriting of the first where we considered all sets I
according to their sizes.
– To see another way, look at Problem 9.12.
10. Bonferroni inequalities. Look at the last side of the inclusion-exclusion formula and
consider not the whole sum but the sum of the first m terms. Define
m
X X
Bm := (−1)k−1 P(Ai1 ∩ · · · ∩ Aik ).
k=1 1≤i1 <···<ik ≤n
Then n
[
B2 ≤ B4 ≤ · · · ≤ P Ai ≤ · · · ≤ B3 ≤ B1 .
i=1
In other words, the odd-index Bm provide better and better upper bounds for the probability
of the union, while the even-indexed terms provide better and better lower bounds. To
understand this, take a look at Appendix A on Counting.
PROBLEM 7.1 (the conjunction fallacy). Linda is 31 years old, single, outspoken, and very
bright. She majored in philosophy. As a student, she was deeply concerned with issues of
discrimination and social justice, and also participated in anti-nuclear demonstrations. Which
is more probable?
(a) Linda is a bank teller.
(b) Linda is a bank teller and is active in the feminist movement.
Answer. The answer is (a). The reason is that
so we apply Property 3., that is, monotonicity. Tversky and Kahneman observed that most
people say that (b) is more probable. Which is wrong. This is called the “conjunction fallacy”.
But most people are clueless about elementary mathematics, or even elementary logic, so this
comes as no surprise. What is surprising is that this obvious thing has given rise to many
publications.
2
Pay attention to logic: the sentence “for all x ∈ A blablabla” is always false if A = ∅ because, by definition, ∅
contains nothing.
CHAPTER 7. ELEMENTARY PROBABILITY PROPERTIES 38
PROBLEM 7.2 (probabilities must add up to what?). Someone claims that the probability
that a biased cube lands on face k has probability 1/(k + 1), k = 1, 2, 3, 4, 5, 6. Why is he wrong?
Answer. (1/2) + (1/3) + (1/4) + (1/5) + (1/6) + (1/7) ≈ 223/140, so (AXIOM ONE) is violated.
PROBLEM 7.3 (estimating chance of winning in the lottery). The probability of winning in
a certain lottery when you buy 1 ticket is 0.0001 = 10−4 . When you buy 2 tickets the probability
of both of them winning is 0.000001 = 10−5 . You buy 100 tickers. Estimate from above and
from below the probability that at least one of them wins.
Answer. If Ai is the event that the i-th ticket wins, i = 1, . . . , n = 100, then the event that least
one of them wins is ni=1 Ai . By Boole’s inequality, we have
S
n n
[ X
P Ai ≤
P(Ai ) = n · 0.0001 = 0.01 = 1%.
i=1 i=1
By Bonferroni’s inequality.
n n !
[ X X n
P Ai ≥
P(Ai ) − P(Ai ∩ A j ) = 0.01 − 0.000001 = 0.009505 = 0.9505%.
2
i=1 i=1 i,j
So the probability that at least one of your tickets wins is at most 1% but not much less than
that, because it is at least 0.9505%.
PROBLEM 7.4 (deadly sins). A Zoroastrian priest found out that a man has probability at
least 0.01 to commit one of the finite number of deadly sins during his life, and if this happens
then he will go to an unpleasant place called Duzakh. What can you say about the probability
that he will not go to Duzakh?
Answer. If S is the event that a man will commit one of the sins and D the event that he
will end up in Duzakh then we know that S ⊂ D. We also know that P(S) ≥ 0.01. Hence
0.01 ≤ P(S) ≤ P(D). Therefore P(Dc ) = 1 − P(D) ≤ 1 − 0.01 = 0.99. So we learn that the
probability he won’t end up in Duzakh is at most 0.99.
|F ∪ R| = |F| + |R| − |F ∩ R|
39
Chapter 8
A set (in particular, a sample space Ω) is called discrete if it is finite or countably infinite.
Conversely, for any probability measure P on P(Ω) there corresponds a function p : Ω → [0, 1]
such that the last display holds.
?PROBLEM 8.1 (probability measures on finite sets). Explain the claim just made.
Answer. IfPA, B are disjoint
P subsets of P Ω then ω ∈ A ∪ B if either ω ∈ A or ω ∈ B and so
P(A∪B) = ω∈A∪B P(ω) = ω∈A p(ω)+ ω∈B p(ω) = P(A)+P(B). This shows that (AXIOM TWO)
holds. On the other hand, if P(∅) = 0 because the sum over the empty set is 0 and so
(AXIOM ONE) holds.
To see that the converse is true, let P be a probability measure on P(Ω)
S and define p(ω) :=
Then, because P satisfies (AXIOM TWO), and because A = ω∈A {ω}, we have that
P({ω}). P
P(A) = ω∈A p(ω) holds.
40
CHAPTER 8. PROBABILITY ON DISCRETE SETS 41
PROBLEM 8.2 (a biased one-pound coin). Consider a one-pound coin and the experiment
of tossing it. The outcomes are three: H (heads), T (tails), S (standing on its side). Experience
shows that the probabilities of these outcomes are 0.4994, 0.4996, 0.001, respectively. Why
does this define a probability measure? What is the probability of the event that the coin lands
heads or tails?
Answer. It defines a probability measure because
The probability of the event that the coin lands heads or tails is 0.4994 + 0.4996 = 0.999.
Product rule. If Ω1 , Ω2 are finite sets and we have probability measures P1 , P2 defined on
them, respectively, then we can define a probability P on Ω = Ω1 × Ω2 by the product rule
?PROBLEM 8.3 (the product rule produces a new probability measure). Explain that the
product rule produces a probability measure.
Answer. All we have to do is check that the sum of P(ω1 , ω2 ) over all possible values of its
arguments equals 1. Indeed,
X X X
P(ω1 )P(ω2 ) = P(ω1 ) P(ω2 ) = 1 · 1 = 1.
ω1 ∈Ω1 ω1 ∈Ω1 ω2 ∈Ω2
ω2 ∈Ω2
PROBLEM 8.4 (a pair of coins). Consider the experiment of throwing a pair of identical
coins. We take as Ω the set of all 9 pairs (H,H), (H,T), (H,S), (T,H), (T,T), (T,S), (S,H), (S,T), (S,S).
Suppose that the two coins are biased exactly as above. Assign P according to the product
rule and compute the probability that heads show at least once.
Answer. The event “heads show at least once” is the set
Its probability, that is, P(A) is given by the sum of the probabilities of each of its 5 elements.
Hence P(A) = PH PH + PH PT + PH PS + PT PH + PS PH = PH PH + 2PH PT + 2PH PS == 0.249510232 +
2 × 0.24950024 + 2 × 0.0004994 = 0.499509872.
Answer. The set of configurations (sample space) Ω contains 9 configurations with two balls
in the same cell and 9 × 8/2 = 36 with two balls in different cells. We have
So there are 9 configurations with probability p1 = 1/81 each and 36 with probability p2 = 2/81
each. The event that the balls lie on the main diagonal contains 3 configurations of probability
p1 and 3 of probability p2 . Hence the probability that the balls lie on the main diagonal equals
3(p1 + p2 ) = 1/9.
|A|
P(A) = ,
|Ω|
for any A ⊂ Ω. Another expression that people use for this situation is:
Select ω uniformly at random from Ω and compute the probability of the event A.
PROBLEM 8.6 (chance that a number is even). What is the probability that an integer selected
at random between 1 and 100 is even?
Answer. We have Ω = {1, 2, . . . , 100}. Let A be the event containing all even integers. Clearly,
|A| = 50. Hence P(A) = 50/100 = 1/2.
If x is a real number then let
For example, b5.391c = 5, b−7.2c = −8. The number x is the unique integer that satisfies the
inequalities
x − 1 < bxc ≤ x.
This is because any interval of the form (x − 1, x] must contain an integer (for if it doesn’t
then for some integer m we have m < x − 1 < x < m + 1 and so 1 < 1 which is impossible)
and this integer is unique (because if there are 2 integers in this interval then 2 ≤ 1 which is
impossible).
CHAPTER 8. PROBABILITY ON DISCRETE SETS 43
PROBLEM 8.7 (chance that a number is divisible by k). Let n, k be positive integers. What
is the probability that a randomly selected integer between 1 and n is divisible by k? What
is this number when n = 2384 and k = 57? If p, q are two distinct prime numbers then show
that the probability that a randomly selected integer between 1 and n is divisible by p and q
converges to 1/pq when n → ∞.
Answer. Let Ω = {1, . . . , n} the event “ω is divisible by k” is the set Ak of all ω ∈ Ω such that
ω = km for some positive integer m. The cardinality of Ak equals max{m ∈ Z : mk ≤ n} =
bn/kc and so P(Ak ) = bn/kc/n. For the specific values, we have P(Ak ) = b2384/57c/2384 =
b41.824 · · · c/2384 = 41/2384 ≈ 0.017198. Notice that the latter number is actually approximately
equal to 1/57. For the last question, the event of interest is Ap ∩ Aq . Notice that a number is
divisible by p and q if and only if it is divisible by pq because p and q have no common factor.
That is, Ap ∩ Aq = Apq , and so P(Apq ) = bn/pqc/n → 1/pq as n → ∞.
PROBLEM 8.8 (roll three dice, get at least one 6). Roll three dice and compute the probability
that we get at least one 6.
Answer. An outcome here is a triple ω = (ω1 , ω2 , ω3 ), where ωi is the result of die i, i = 1, 2, 3.
The set of outcomes has 63 = 216 elements. The event “we get at least one 6” is the negation
of the event “we get no 6”. The latter event is the set of outcomes ω such that ωi , 6 for all
i = 1, 2, 3; this means that each ωi takes values 1, 2, 3, 4 or 5. hence the event “we get no 6” has
53 = 125 outcomes. The answer is therefore
125 91
P(get at least one 6) = 1 − = ≈ 0.42.
216 216
PROBLEM 8.9 (a tourist in London). A tourist in London is given a list of the 10 most
important monuments of the city along with the recommendation to visit the Tower of London,
the Buckingham Palace and the Westminster cathedral. Because of lack of time, she decides to
visit only 3 of them at random. What is the probability that she visits the recommended ones
and in order of increasing distance from her hotel?
Answer. Take as outcome ω = (ω1 , ω2 , ω3 ), a triple of distinct monuments arranged in order
of increasing distance from the tourist’s hotel. The set Ω of outcomes has cardinality
10 × 9 × 8 = 720 and, by assumption, the probability measure considered is uniform on Ω.
There is only one outcome contained in the event under consideration. Hence the answer is
1/720.
?PROBLEM 8.10 (select a set at random). Consider the set S = {1, 2, . . . , d}. What is the
probability that a randomly selected subset of S contains the elements 1 and 2?
Answer. The universe here is the set
A = {ω ∈ Ω : {1, 2} ⊂ ω}.
CHAPTER 8. PROBABILITY ON DISCRETE SETS 44
Since the set ω \ {1, 2} is a subset of S \ {1, 2} and since there are 2d−2 subsets of S \ {1, 2}, we
have
|A| = 2d−2 .
Hence
P(A) = 2d−2 /2d = 1/4.
To do the next problem, we consider the following definition.
!
n
If S is a set of size n then we define as the number of subsets of S of size k.
k
n
For example,= 1 because there are exactly n sets of size 1.
n
n(n − 1)(n − 2) · · · (n − m + 1)
!
n
= .
m m(m − 1)(m − 2) · · · 1
Answer. We take a set with n elements, say the first n positive integers. Suppose we have m
boxes in a row, labeled 1, 2, . . . , m. Put one of the integers in box 1 (there are n choices). Put
a different integer in box 2 (there are n − 1 choices). Put an integer, different from the ones
placed in the first two boxes, in box 3 (there are n − 2 choices. And so on. Hence there are
n(n − 1)(n − 2) · · · (n − m + 1)
m(m − 1) · · · 1
ways to scramble the boxes. A moment of reflection then shows that we must divide the two
displays in order to get a formula for the number of subsets of size m (unordered m-tuples).
We now use a couple of useful abbreviations:
The m-falling factorial of n is the number obtained by multiplying, starting from n, the first
m integers that are less than or equal to n. We denote this by (n)m . Thus,
m−1
Y
(n)m := n(n − 1)(n − 2) · · · (n − m + 1) = (n − j). (8.1)
j=0
The n-falling factorial of n, that is, the number (n)n , is also called n-factorial and is denoted by
n−1
Y n
Y
n! := (n)n = n(n − 1) · · · 1 = (n − j) = k.
j=0 k=1
Note that
n!
(n)m = .
(m − m)!
CHAPTER 8. PROBABILITY ON DISCRETE SETS 45
?PROBLEM 8.13 (labeled balls in labeled boxes). We put m balls (labeled by the numbers
1, . . . , m) in n boxes (labeled by the numbers 1, . . . , n) at random. What is the probability that
box 1 contains k balls?
Answer. Each ball goes to a unique box. Let us use some notation for this. If the ball has label
i we let ω(i) be the label of the box it goes into. We realize then that an outcome is simply a
function ω from the set of balls to the set of boxes:
We denote this function by ω = (ω(1), . . . , ω(m)) and we can think of it as an ordered m-tuple.
The set of outcomes Ω is the set of all such m-tuples. Since ω(1) can take n values, and ω(2)
can also take n values, and so on, we see that there are nm outcomes. Hence |Ω| = nm . We are
interested in the event A of all m-tuples ω such that exactly k of the entries of ω are equal to 1.
We look at the set I of indices i such that ω(i) = 1. This set is required to have size k. Hence it
can be selected in mk ways. The entries of ω with indices not in I can be selected in n − 1 ways
m
|A| k (n − 1)m−k
P(A) = m = . (8.2)
n nm
?PROBLEM 8.14 (birthday coincidences). In a certain room there are n people. What is the
chance that at least two of them have common birthday? Assume that every year has d = 365
days.
Answer. If n > d = 365 then, certainly, at least two people have the same birthday. So assume
that n ≤ d. To answer the problem we need a mathematical model. We choose an outcome
to be an assignment of a day to each of the n people. Let ωi be the day (his/her birthday)
assigned to person i. Hence Ω is the set of all such assignments ω = (ω1 , . . . , ωn ) Equivalently,
Ω is the set of all functions from {1, . . . , n} to {1, . . . , d} (where 1 means the 1st of January, 2
means the second, and so on, until d, the last day of the year). There are
|Ω| = dn
CHAPTER 8. PROBABILITY ON DISCRETE SETS 46
Figure 8.1: The probability that in a group of n people at least two are born on the same day as a
function of n. Notice that with about 25 people the probability is about 0.5 and with about 50
people the probability is close to 1.
We refer to δa as the Dirac probability measure. Hence b P(B) gives the fraction of the data
in the set B. This probability measure is called empirical probability measure or empirical
distribution.
PROBLEM 8.15 (data from 10 coin tosses). Let us consider the outcomes of d = 10 coin tosses
and mark 1 for heads and 0 for tails. Suppose that the outcomes are
1, 0, 0, 0, 1, 1, 0, 1, 0, 0.
no. of heads
1 2
2 1
3 1
4 4
5 3
6 0
7 2
.. ..
. .
100 3
for i to 100 do
A:=Array([rand(0..1)(),rand(0..1)(),rand(0..1)(),rand(0..1)()]);
A[1]+A[2]+A[3]+A[4];
end do
Answer. First count the number of trials that result in j heads, j = 0, 1, 2, 3, 4 and form a table.
Make sure that the sum over j equals n = 100.
j no. of trials resulting in j heads
0 7
1 21
2 39
3 24
4 9
sum 100
P(B) for the various instance of B asked for.
Then compute b
7
P{0} =
b
100
21
P{1} =
b
100
39
P{2} =
b
100
24
P{3} =
b
100
9
P{4} =
b
100
21 24 45
P(odd number of heads) = b
b P{1, 3} = b
P{1} + b
P{3} = + =
100 100 100
Chapter 9
Random variables
Notice that Y(ω) is not just a function of ω but also a function of X(ω).
49
CHAPTER 9. RANDOM VARIABLES 50
Caution! The definition of a random variable does not involve the probability measure
that sits on Ω.
Caution! The word “variable” in the terminology “random variable” is wrong. It should
be function.
when B is a subset of R, where X−1 (B) is the inverse image of the set B by the function X, that
is,
X−1 (B) = {ω ∈ Ω : X(ω) ∈ B}.
We will come back to this notion in Chapters 12 and 16. I now want to point out that we
frequently use another notation for X−1 (B): we write {X ∈ B}. So
{X ∈ B} = X−1 (B).
The distribution of a discrete random variable. A random variable X is discrete if its set of
values is discrete. If X is a discrete random variable then its distribution is determined by the
probability mass function x 7→ P(X = x), that is, by the numbers
Note that, by (AXIOM ONE) and (AXIOM TWO), p(x) ≥ 0 for all x and
X
p(x) = 1.
x∈X(Ω)
Once a random variable X is given and Ω is equipped with a probability measure P then
define the probabilities of events of the form
{X = x} = {ω ∈ Ω : X(ω) = x}.
The collection of these probabilities is a probability measure on R and it is called the distribution
of X.
PROBLEM 9.2 (3 coins and the laws of two random variables). In the Problem 9.1, let P be
the uniform probability measure on Ω, that is,
CHAPTER 9. RANDOM VARIABLES 51
ω P{ω}
(0,0,0) 1/8
(0,0,1) 1/8
(0,1,0) 1/8
(1,0,0) 1/8
(0,1,1) 1/8
(1,1,0) 1/8
(1,0,1) 1/8
(1,1,1) 1/8
P(X = 1) = P{ω ∈ Ω : ωi = 1 for some i} = P{(0, 0, 1), (0, 1, 0), (1, 0, )} = 3/8.
Indeed,
{X = 1} = {(0, 0, 1), (0, 1, 0), (1, 0, 0)}, etc.
So the distribution of X is the collection of numbers
P→ X →Q
That is, we solved the problem
P → X →?
where the question mark is replace by Q.
?→ ? →Q
This means that we are given a probability measure Q on R and we want to find some random
variable X and some probability measure P so that X transfers P to Q. This is the problem
we are faced at all the time in practice. The answers are many. Among these answers people
search for clever ones–something that you may not appreciate without further familiarity and
experience. Here is an answer, an obvious one. Take Ω = R, take P = Q and take X : Ω → R
to be the identity function, namely, X(ω) = ω for all ω ∈ Ω.
CHAPTER 9. RANDOM VARIABLES 52
Indeed, Passuming that c , 0 (the case c = 0 being trivial), we have P(cX = cx) = P(X = x), so
E(cX) = x (cx)P(X = x) = cE(X). Further, the function X + Y takes valuesP x + y, where x ranges
in thePset of values of X and P So E(X + Y) = x,y (x + y)P(X
y in the set of values of Y. P P P= x, Y =
y) = x,y xP(XP = x, Y = y) + x,y yP(X = x, Y = y). But x,y xP(X = x, Y = y) = x x y P(X =
x, Y = y) = x P(X = x) = E(X), where we used (AXIOM TWO).
?PROBLEM 9.3 (the law of the unconscious statistician). Explain why
X
E(X) = X(ω)P{ω}. (9.2)
ω∈Ω
Answer. Group together the set of ω ∈ Ω that have the same value under X, that is, consider
the sets {X = x} for all x ∈ X(Ω). Then
X X X X X
X(ω)P{ω} = X(ω)P{ω} = xP{ω}
ω∈Ω x∈X(Ω) ω∈{X=x} x∈X(Ω) ω∈{X=x}
X X X
= x P{ω} = xP(X = x) = E(X).
x∈X(Ω) ω∈{X=x} x∈X(Ω)
?PROBLEM 9.4 (monotonicity of expectation). Explain why if X, Y are two random variables
then
E(X) ≤ E(Y) if X ≤ Y,
where X ≤ Y means X(ω) ≤ Y(ω) for P all ω ∈ Ω.
Answer. If, in the expression E(X) = ω∈Ω X(ω)P{ω}, we replace X(ω) by the bigger numbers
Y(ω) we get something bigger. This something is E(Y).
PROBLEM 9.5 (3 coins and the expectation of two random variables). In Problem 9.1 with
P uniform compute the expectation of X and 3 different ways. Then compute the expectation
of Y.
CHAPTER 9. RANDOM VARIABLES 53
3 3 12 3
E(X) = 0 · 1
8 +1· +2· +3· 1
8 = = .
8 8 8 2
Method 3. Define Xi (ω) = 1 if ωi = 1 and Xi (ω) = 0 if ωi = 0. That is, Xi is a random variable
that indicates if there is a head at the i-th position, i = 1, 2, 3. Observe that X = X1 + X2 + X3 .
By linearity,
E(X) = E(X1 ) + E(X2 ) + E(X3 ).
We now have
1
E(X1 ) = 1 · 1
2 +0· 1
2 = = E(X2 ) = E(X3 ).
2
Hence E(X) = 3/2. As for Y, we have
1
E(Y) = 1 · 1
2 +0· 1
2 = .
2
PROBLEM 9.6 (3 coins and a non-uniform probability measure). Suppose we equip the Ω
in the previous problem by a P defined by
ω P{ω}
(0,0,0) 1/20
(0,0,1) 2/20
(0,1,0) 3/20
(1,0,0) 4/20
(0,1,1) 4/20
(1,1,0) 3/20
(1,0,1) 2/20
(1,1,1) 1/20
What is E(X) now?
Answer. E(X) = (0 + 3) · 1
20 + (1 + 2) · 2
20 + (1 + 2) · 3
20 + (1 + 2) · 20 .
4
P∞
?PROBLEM 9.7 (the expectation may be infinite or may not exist!). Recall that1 c = 1
n=1 n2 <
∞. Consider the random variables X, Y whose laws are defined by
1
, n ∈ N,
P(X = n) =
cn2
1
P(Y = n) = , n ∈ Z \ {0}.
2cn2
Explain why E(X) = ∞ and why E(Y) does not exist.
P∞ 1 PN 1
Here is why. We have c = 1 + ∞
P∞ 1
n=2 n2 ≤ 1 + n=2 n (n − 1) = 1 + n=2 n−1 − n = 1 + limN→∞ n=2 n−1 − n =
1 1 1 1
P
1 + limN→∞ 1 − N1 = 1 + 1 − 0 = 2 < ∞.
CHAPTER 9. RANDOM VARIABLES 54
Answer. We have
∞ ∞ ∞
X X 1 1X1
E(X) = nP(X = n) = n = = ∞.
n=1 n=1
cn2 c
n=1
n
∞ ∞
∞ −1
∞ ∞
X 1 1 X 1 1 X 1 X 1 1 X 1 X 1
E(Y) = n = = + = −
2cn2 2c n 2c
n=1
n n=−∞ n 2c
n=1
n
n=1
n
n∈Z\{0} n∈Z\{0}
1
= (∞ − ∞) = undefined!
2c
meaning that X is a random variable and g a function. Then g(X) = g◦X is another random
variable which is a function of X. We are in particular interested in the case where g(x) = xk
for some positive integer k. We define
µk = k-moment of X = E(Xk ).
The number p
σ= var(X)
is called standard deviation of X under the probability measure P. Notice that
σ2 ≥ µ2 ,
To compute the expectation of g(X) we have two ways. First we can think of g(X) as a
random variable per se and so
X
E[g(X)] = yP(g(X) = y)
y
where the sum ranges over all the values of g(X). This leaves is with the problem of computing
the probability of the event {g(X) = y} = {X ∈ g−1 {y}}. Second, we can apply the law of the
unconscious statistician and write
X
E[g(X)] = g(x)P(X = x).
x
PROBLEM 9.8 (Markov’s inequality). Let X be a positive discrete random variable. Explain
why, for all t > 0,
E(X)
P(X > t) ≤ .
t
Answer. Let S be the set of values of X. Then, as seen in (9.9), X = x∈S x1X=x . Let
P
St = {x ∈ S : x > t}.
Since X is positive, all x in the sum are positive numbers, and so, summing over the smaller
set St will give a smaller number:
X X
X≥ x1X=x ≥ t 1X=x = t1X∈St ,
x∈St x∈St
where the second inequality is due to the fact that every x in St is at least t (by definition);
and the last equality is from (AXIOM TWO). Now take expectations of both sides, using the
monotonicity of E, we have
E(X) ≥ tP(X ∈ St ) = tP(X > t).
PROBLEM 9.9 (data and the empirical probability space). Suppose that the data are
3, 1, 5, 3, 4, 1, 2, 5, 3, 4, 1, 1, 2, 2, 4.
(Remember that a set does not care if elements are repeated!) and m = 5.
Define now the random variable
J(ω) = ω, ω ∈ Ω.
{J = ai } = {ω ∈ Ω : J(Ω) = ai ),
If we thus let x1 , . . . , xm be a listing of the distinct elements of the data sequence, so that
Ω = {x1 , . . . , xm },
and set
N(x) = # occurrences of x in the data sequence
then
N(x)
P(J = x) =
b , x ∈ Ω.
n
The empirical mean or sample mean of J with respect to b
P is defined to be the expectation
E(J) under P. So we have
b b
m
X X N(x j )
E(J) =
b P(J = x) =
xb xj .
n
x∈Ω j=1
Note that
n
1X
µ=b
b E(J) = ai ,
n
i=1
and this is because we can express a1 + · · · + an by summing over the distinct data values
taking into account their multiplicities:
n
X m
X
ai = x j N(x j ).
i=1 j=1
CHAPTER 9. RANDOM VARIABLES 57
PROBLEM 9.10 (continuation of Problem 9.9). What is the sample mean and sample standard
deviation?
Answer. We have two ways to compute these quantities: by summing over data points or
summing over elements of Ω and taking into account multiplicities. We choose the latter
method. We have n = 15 data points. Since x = 1 appears 4 times in the data sequence we
have
P{1} = 4/15.
b
Similarly,
P{2} = 3/15,
b P{3} = 3/15,
b P{4} = 3/15,
b P{5} = 2/15.
b
So
µ = 1(4/15) + 2(3/15) + 3(3/15) + 4(3/15) + 5(2/15) ≈ 2.733.
b
We call 1A indicator function (or indicator random variable) of the set (event) A.
We first study some algebraic properties.
1A = 1 − 1Ac (9.3)
1A∩B = 1A · 1B (9.4)
1A∪B = 1A + 1B − 1A∩B (9.5)
1A∩B = min(1A , 1B ) (9.6)
1A∪B = max(1A , 1B ) (9.7)
?PROBLEM 9.11 (properties of indicator functions). Explain why (9.3), (9.4), (9.5), (9.6),
(9.7) hold.
Answer. Since 1A (ω) assigns a unique value to each outcome ω ∈ Ω, it is a random variable.
Note that
{ω ∈ Ω : 1A (ω) = 1} = A.
CHAPTER 9. RANDOM VARIABLES 58
Property (9.3). We must show that the formula 1A (ω) = 1 − 1Ac (ω) for all ω ∈ Ω. The are two
cases: either ω ∈ A or ω ∈ Ac . In the first case, 1A (ω) = 1, 1Ac (ω) = 0, so the formula reads
1 = 1 − 0, while in the second, 1A (ω) = 0, 1Ac (ω) = 1, so the formula reads 0 = 1 − 1; hence
correct in both cases.
Property (9.4). Notice that ω ∈ A ∩ B ⇐⇒ ω ∈ A and ω ∈ B ⇐⇒ 1A (ω) = 1 and
1B (ω) = 1 ⇐⇒ 1A (ω) · 1B (ω) = 1 (because, for all nonnegative integers x, y, we have the
equivalence xy = 1 ⇐⇒ x = 1 and y = 1).
Property (9.5). Do some trivial algebra:
(1 − x)(1 − y) = 1 − (x + y) + xy.
Now replace x by 1A and y by 1B and use Property 1 to get 1 − x = 1Ac , 1 − y = 1Bc and then
Property 2 to get (1 − x)(1 − y) = 1Ac · 1Bc = 1Ac ∩Bc . Using de Morgan’s law, Ac ∩ Bc = (A ∩ B)c ,
so (1 − x)(1 − y) = 1(A∩B)c = 1 − 1A∪B . Substituting into the last display we obtain
1 − 1A∪B = 1 − (1A + 1B ) − 1A · 1B ,
E(1A ) = P(A).
Indeed,
E(1A ) = 1 · P(1A = 1) + 0 · P(1A = 0) = P(A) + 0 = P(A).
This is the inclusion-exclusion formula for 2 events. We generalize this for three events by
using the identity
where the sum ranges over all nonempty subsets I of {1, . . . , n}.
CHAPTER 9. RANDOM VARIABLES 59
?PROBLEM 9.12 (indicator of union: inclusion-exclusion). Use this last identity to gener-
alize the formula 1A∪B = 1A + 1B − 1A∩B to n events. Then take expectation of both sides to
derive the inclusion-exclusion formula (7.1).
Answer. Let xi = 1Ai . Then
n
Y n
Y
xi = 1Ai = 1Tni=1 Ai
i=1 i=1
n
Y n
Y n
Y
(1 − xi ) = (1 − 1Ai ) = 1Aci = 1Tni=1 Aci = 1(Sn Ai )c = 1 − 1Sni=1 Ai
i=1
i=1 i=1 i=1
Substituting in (9.8), we obtain
X
1Sni=1 Ai = (−1)|I|−1 1Tni=1 Ai
I⊂{1,...,n}
Formula (7.1) follows immediately since E[1Sni=1 Ai ] = P( ni=1 Ai ) and E[1Tni=1 Ai ] = P( ni=1 Ai ).
S T
PROBLEM 9.13 (a very useful, albeit trivial, identity). Let X be a discrete random variable,
that is, a function X : Ω → S, where S is a discrete set. Explain why
X
X= x1X=x . (9.9)
x∈S
Answer. We will show that the right-hand side equals X. We clearly have x1X=x = X1X=x .
Hence X X
x1X=x = X 1X=x .
x∈S x∈S
But the sets {X = x}, x ∈ S, are pairwise disjoint and
[
{X = x} = {X = x for some x ∈ S} = X−1 (S) = Ω,
x∈S
So X
1X=x = 1Ω = 1.
x∈S
PROBLEM 9.14 (summing the tail gives the expectation). Let X be a random variable with
values in Z+ = {0, 1, . . .}. Explain why
∞
X
E(X) = P(X > n). (9.10)
n=0
How to use indicator functions to simplify life Suppose you have to perform the integral
Z ∞
f (x)g(x)dx
−∞
Many students resort to drawing pictures to perform this integral. But you don’t have to do
that if you write
expressions that hold for all −∞ < x < ∞. Check that these expressions are really equivalent
to the ones above. Now multiply the two to get
because
Hence
Z ∞ Z ∞ Z ∞
f (x)g(x)dx = x(1 − x)10<x≤1 dx + xe−x 11<x≤2 dx
−∞ −∞ −∞
Z 1 Z 2
= x(1 − x)dx + xe−x dx,
0 1
and these are integrals that you can easily do (use integration by parts for the second) to get
1/6 for the first and 2e−1 − 3e−2 for the second.
{X ∈ B} ∈ E whenever B is an interval.
This is something that you will only ?learn in a slightly more advance course in probability
if you have the chance to take it. If not, you will never ?know.
Chapter 10
PROBLEM 10.1 (shuffling the letters of a word). In how many ways can you rearrange the
word BOOKKEEPER? More generally, consider a language whose alphabet contains d letters,
say L1 , . . . , Ld . Pick a word in this language with length n letters, such that Li appears ni times,
i = 1, . . . , d. In how many ways can you rearrange the word? Show that n1 + · · · + nd = n.
Deduce that if n1 , . . . , nd are d nonnegative integers then the product n1 !n2 ! · · · nd ! divides n!
Answer. The word has length n = 10 letters. If the letters were distinct then the number of
rearrangements is n! = 10!. For example, if we let E1 , E2 , E3 be the three E’s then one of these n!
rearrangements is K2 O1 BO1 E2 PE1 RE3 K1 and another is K1 O1 BO1 E3 PE2 RE1 K2 . However, if we
do not distinguish between identical letters then both these arrangements are KOBOEPEREK.
62
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 63
Thus we are free to permute the 2 K’s between themselves, the 2 O’s between themselves and
the 3 E’s between themselves. Hence BOOKKEEPER can be arranged in 2!2!3! 10!
= 151, 200 ways.
For the more general case, we use the same logic: Were the letters of the word distinct
we would have n! arrangements. However, given an arrangement and a letter Li in the
arrangement, then we can permute the occurrences of Li between themselves–which can be
done in ni ! ways–and obtain identical arrangement. Hence the total number of arrangements
is n!/(n1 !n2 ! · · · nd !). Since each letter Li appears ni times in the word (and this includes the
possibility that ni = 0) we must have n1 + · · · + nd = n. The reason that n1 !n2 ! · · · nd ! divides n!
is because the ratio n!/(n1 !n2 ! · · · nd !) is the number of arrangements, which must be an integer.
?PROBLEM 10.2 (tossing k dice n times). Pick k identical dice and roll them once. Then
repeat n times. What is the set of outcomes? What is the probability of the event that you
will get k sixes in some roll? Give the probability as a decimal number when k = 1, n = 4 and
when k = 2, n = 24.
Answer. A die has 6 faces with numbers 1 through 6 on them. Rolling k dice at once results
can be described by x = (x1 , . . . , xk ) where xi is the what die i shows. Hence a full outcome for
the n rolls is:
(x(1), x(2), . . . , x(n)) = (x1 (1), . . . , xk (1); x1 (2), . . . , xk (2), x1 (n), . . . , xk (n)).
the set of outcomes is Ω = ({1, . . . , 6}k )n . To find the probability of the event we first need to
define the probability P. In absence of further information, we assume that the probability
is uniform. Thus each outcome has probability 1/6kn . Call the event under consideration
A. Then Ac It can be described as the set of all outcomes such that x(i) , (6, . . . , 6), for all
i = 1, . . . , n. The cardinality of Ac is (6k − 1)n . Hence
(6k − 1)n
P(A) = 1 − .
6kn
If k = 1, n = 4, P(A) = 1 − 54 /64 = 0.518. If k = 2, n = 24, P(A) = 1 − 3524 /3624 = 0.491.
PROBLEM 10.3 (sum of dice rolls). Roll 3 dice and let S be the sum of the numbers observed.
Explain why S is a random variable and compute the probability that S = 9 and the probability
that S = 10.
Answer. An outcome is (x1 , x2 , x3 ), where xi is what die i shows. S assigns value x1 + x2 + x3
to this outcome. Hence it is a function from the set Ω = {1, 2, 3, 4, 5, 6}3 into R. Hence S is a
random variable. To compute P(S = 9) we need to define P. We will assume that P is uniform.
Hence the probability of each outcome (x1 , x2 , x3 ) is 1/63 . Now let us see which outcomes
belong to the event {S = 9}. The partitions of 9 in 3 parts are
6+2+1 5+3+1 4+4+1 5+2+2 4+3+2 3+3+3
6 6 3 3 6 1
and the number below each partition is the number of outcomes corresponding to that
partition. (Recall that a partition of a number is a way to represent the number as a sum without
paying attention to the order; but an outcome has order.) Hence there are 6+6+3+3+6+1 = 25
outcomes in the event {S = 9}, so P(S = 9) = 25/63 = 0.1157.
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 64
?PROBLEM 10.4 (permutations in a row and on a circle). n ≥ 3 people are seated in a row.
What is the probability that Mr X sits next to Mrs Y? n ≥ 3 people are seated at a round table.
What is the probability that Mr X sits next to Mrs Y?
Answer. Let 1, 2, . . . , n are the names of the people with X=1, Y=2. An outcome ω = (ω1 , . . . , ωn )
in the first case is a permutation of the n people. There are n! outcomes. In the absence of any
further information, we assume that P is uniform on the set of outcomes; that is, P assigns
value 1/n! to each outcome The event A =“1 sits next to 2” is the set of outcomes ω such
that ωi = 1, ωi+1 = 2 or ωi = 2, ωi+1 = 1 for some i = 1, . . . , n − 1. To count the number |A| of
outcomes in this event, we think of 1 as glued to 2 (which can be done in 2 ways), so that we
have n − 1 objects now (which can be permuted in (n − 1)! ways). Hence |A| = 2(n − 1)!. Hence
2(n − 1)! 2
P(A) = = .
n! n
If the people are sitting at a round table, then certain permutations must be identified. For
example, if n = 4, one permutation is 2, 4, 1, 3. However this must be identified with 4, 1, 3, 2
and with 1, 3, 2, 4 and with 3, 2, 4, 1. The number of identified permutations is n. Hence the set
of all outcomes has cardinality (n − 1)!. If B is the event that “1 sits next to 2” then |B| = 2(n − 2)!.
Hence
2(n − 2)! 2
P(B) = = .
(n − 1)! n−1
Check: if n = 3 then P(A) = 2/3, P(B) = 1. The veracity of these two probabilities can be verified
by inspection.
?PROBLEM 10.5 (dancing pairs). An even number n of people are dancing in pairs. Each of
the people has a unique partner. What is the probability that a person dances with his/her
partner?
Answer. Here it is a little difficult to imagine what the set of outcomes is. One way to do this
is to consider all n! permutations of a people: the people arrive in the room at a random order.
The doorman picks a person and assigns him/her to the person arriving immediately after
him/her. For concreteness, let n = 4. The set of all possible n! arrival patterns is
The event that “each person dances with his/her partner” contains only one outcome. Hence
its probability is 1/3. With an even number n of people, we have n! arrival patterns. Each
arrival pattern is identified with 2n/2 (n/2)! other arrival patterns. Indeed, we split each arrival
pattern into n/2 boxes of size 2 each. Boxes can be permuted in (n/2)! ways. Changing the
order of the 2 contents of a box does not change anything. This can be done in 2 ways per
each of the n/2 boxes, which gives the extra factor 2n/2 . Hence
n!
number of outcomes = .
2n/2 (n/2)!
2n/2 (n/2)!
P(each person dances with his/her partner) = .
n!
PROBLEM 10.7 (probability that a random committee has k members). Form a committee
by picking any number of members from a set of n people and by designating a person as
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 66
head of the committee. What is the probability that the committee will have k members?
Show that the identity
n !
X n
k = n‘2n−1
k
k=1
holds.
Answer. An outcome is a subset of {1, 2, . . . , n} together with a distinguished member desig-
nated as head. Thus there are n2n−1 outcomes. The number of outcomes with k members is
k nk . Hence
k nk
P(committee has k members) = n−1 .
n2
Sn
If Ak is the event “committee has k members” then P Ak ∩ A` = ∅ for k , ` and Ω = k=1 Ak .
Hence, by (AXIOM TWO), we have 1 = P(Ω) = k=1 P(Ak ), and, substituting the value of
P(Ak ) from the previous display, gives the identity.
Ω = {(n1 , n2 , . . . , nm ) : n1 , . . . , nm ∈ Z+ , n1 + · · · + nm = n}.
1
In Physics, they say “Maxwell-Boltzman statistics”. But the term “statistics” is wrong. More generally, there
is an area of Physics called “Statistical Physics” but, again, the term adjective statistical is wrong. It should be
called “probabilistic physics” or, better yet, “stochastic physics”. But, at the time that this terminology was coined,
people hadn’t quite realized the difference between statistics and probability and hadn’t quite understood the
mathematics of the latter or its foundations. Some people used to think that what we now know as theorems
(meaning statements that follow from pure thought) were experimental results. In fact, 4 thousand years ago, what
we now know as “Pythagorean theorem” was taken as an experimental result and not as a fact that can be proven.
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 67
n+m−1
Recall that Z+ = {0, 1, 2, 3, 4, 5, . . .}. We also have that Ω has cardinality m−1 ; see Item 7 in
Appendix A. Assuming uniform probability on Ω we have
1
P(ni particles at state i, for i = 1, . . . , m) = n+m−1
, n1 , . . . , nm ≥ 0, n1 + · · · + nm = n. (10.1)
m−1
The cardinality of Ω is clearly the same as the number of subsets of {1, . . . , n} of size m, that is,
n
m . Assuming uniform probability on Ω we have
1
P(ni particles at state i, for i = 1, . . . , m) = n , n1 , . . . , nm ∈ {0, 1}, n1 + · · · + nm = n.
m
?PROBLEM 10.12 (distinguishable balls in distinct boxes). Put n distinguishable balls into
m distinct boxes at random. What is the probability that the first box contains k balls? In
other words, let X be the random variable indicating that the number of balls in the first box;
find the distribution of X, that is, the probabilities P(X = k) for k = 0, 1, . . .. Also find the
expectation of X.
Answer. The set of outcomes is Ω = {1, . . . , m}n , that is, all functions from the set {1, . . . , n} of
balls into the set {1, . . . , m} of boxes. As seen before, |Ω| = mn . We are interested in the event
AI = {ω ∈ Ω : ωi = 1 for i ∈ I}.
By (AXIOM TWO), P(A) = I:|I|=k P(AI ). But P(AI ) is the same for all I with |I| = k (it does not
P
matter which balls go into box 1. Assuming that P is uniform on Ω, we have, with I = {1, . . . , k}
|AI | (m − 1)n−k
P(AI ) = = .
mn mn
Indeed, the event AI is the event that ω1 = 1, . . . , ωk = 1, whereas the other ω j , j = k + 1, . . . , n
can have any of the m − 1 values 2, 3, . . . , m. Therefore,
X (m − 1)n−k n (m − 1)n−k
!
P(A) = =
mn k mn
I:|I|=k
because all terms in the sum are equal and because the number of I, subsets of {1, . . . , n}, with
size k is nk .
The random variable X is given by
Xn
X= 1Ai , (10.2)
i=1
where 1Ai is the indicator (random variable) of the event Ai = {ω ∈ Ω : ωi = 1}. Since
P(ω j = 1) = 1/m, we have
E(1Ai ) = m1 .
By the linearity of the expectation,
n
X n
E(X) = E(1Ai ) = .
m
i=1
Note: We will later show how to compute the same probability using the concept of indepen-
dence. Indeed, we will show that ω1 , . . . , ωn are independent random variables. The fact that
ωi is a random variables is obvious from the fact that it represents the box assigned to the i-th
ball, that is, it represents the function
(ω1 , . . . , ωn ) 7→ ωi .
The fact that the random variables are independent is a consequence of the fact that P is
uniform on Ω. To understand this note, look forward at the chapter on independence.
PROBLEM 10.13 (expected number of particles at a given state according to Bose-Einstein).
According to the Bose-Einstein model (where particles are indistinguishable), what is the
probability that there are k particles at state 1? Let X be the random variable indicating the
number of balls in the first box. Find the expectation of X.
Answer. As explained above, we let P be the uniform probability measure on
Ω = {(n1 , n2 , . . . , nm ) : n1 , . . . , nm ∈ Z+ , n1 + · · · + nm = n}. (10.3)
n+m−1−1
Since Ω contains n+m−1
m−1 elements, P assigns value m−1 to each element. We are interested
in the event
Using the logic we used to compute the cardinality of Ω (see Problem 10.9 above and see Item
7 in Appendix A), we have |A| = n−k+m−2
m−2 . Simply replace n by n − k. Hence
n−k+m−2
m−2
P(“number of particles in state 1 is k”) = n+m−1
= P(X = k).
m−1
Pnyou use the definition of E(X) to compute it, you have to evaluate the rather hard sum
If
k=0 kP(X = k). Instead, take into account the symmetry in the problem. Namely, that the
formula for the probability (10.1) does not change if we permute (n1 , . . . , nm ). Hence, if we let
Xi (n1 , . . . , nm ) = ni , (10.4)
E(X1 ) = · · · E(Xm ).
PROBLEM 10.15 (sampling without replacement). An urn contains 12 red and 8 blue balls.
Balls of the same color are identical. We pick 4 balls at random. What is the probability that
we get 2 balls of each color? What is the probability that we get 1 red and 3 blue? What is the
probability that we get 3 red and 1 blue?
Answer. Call the balls 1, 2, . . . , 20. Think of the first 12 as being red and of the last 8 as being
blue, So R = {1, . . . , 12}, B = {13, . . . , 20} Take as Ω the set of all subsets ω of {1, 2, . . . , 20} of size
4. Then |Ω| = 20 4 . The event A = “get 2 balls of each color” is the set of all ω ∈ Ω such
that
12 8
|ω ∩ R| = 2 and |ω ∩ B| = 2. We can pick 2 elements of R in 2 ways and 2 of B in 2 ways.
Hence, assuming that P is uniform on Ω
12 8
2 2
P(get 2 balls of each color) = 20
= 0.38
4
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 70
Similarly,
12 8 12 8
1 3 3 1
P(get 1 red and 3 blue) = 20
= 0.32, P(get 3 red and 1 blue) = 20
= 0.36
4 4
?PROBLEM 10.16 (sampling without replacement, general case). An urn contains Ni balls
of color i, i = 1, . . . , c. (c is the number of colors). Balls of the same color are identical. We pick
n balls at random. What is the probability that we get n1 balls of color 1, n2 of color 2, etc?
Answer. Let N = N1 + · · · + Nc be the number of balls. Consider {1, 2, . . . , N} as the set of balls.
Let R1 = {1, . . . , N1 }, R2 = {N1 + 1, . . . , N1 + N2 }, etc, Thus we split the set of balls into a set R1
of balls of color 1, a set R2 of balls of color 2, etc. A sample (an outcome) is a set ω of balls of
size n. Let Ω be the set of subsets of balls of size n. The event of interest is thus
A = {ω ∈ Ω : |ω ∩ Ri | = ni , i = 1, . . . , c}.
Assuming P to be the uniform probability measure on Ω, we have that P assigns value
P{ω} = 1/|Ω| = 1/ Nn to each ω. Therefore,
N1 N2 Nc
|A| n1 n2 · · · nc
X X 1
P(A) = P{ω} = = = N
. (10.5)
|Ω| |Ω|
ω∈A ω∈A n
?PROBLEM 10.17 (sampling without replacement, another view). Recall the multinomial
coefficient from Appendix B, 3(d). Show that the probability of (10.5) can be written as
N1 N2 Nc n N−n
n1 n2 · · · nc n1 ,...,nc N1 −n1 ,··· ,Nc −nc
N
= N
,
n N1 ,...,Nc
first by trivial algebra and second by setting Problem 10.16 on a different sample space.
Answer. We have
N1 N2 Nc N1 ! Nc ! n! (N−n)! n N−n
n1 n2 · · · nc n1 !(N1 −n1 )! · · · nc !(Nc −nc )! n1 !···nc ! (N1 −n1 )!···(Nc −nc )! n1 ,...,nc N1 −n1 ,··· ,Nc −nc
N
= N!
= N!
= N
n n!(N−n)! N1 !···Nc ! N1 ,...,Nc
This suggests that we can use a sample space Ω0 of size as in the denominator of the last
fraction. By Appendix B, 3(d), we can take as Ω0 the set of arrangements of N objects when
N1 are identical, i.e. of color 1, N2 of color 2, etc. Let {1, . . . , N} be the set of objects (=balls, in
our case). An arrangement ω = (ω1 , . . . , ωN ) is an assignment of color ωi ∈ {1, . . . , c} to ball
i. Let S = {1, . . . , n} be the set we select and Sc those we do not select. Then the event, say
A0 , of interest is the the set of all ω such that the number of balls j with ω j = i equals ni , for
i = 1, . . . , c. We can easily see that this event has size equal to the numerator of the last fraction
in the display above. Assuming that P0 is uniform on Ω0 , we have that
n N−n
|A0 | n1 ,...,nc N1 −n1 ,··· ,Nc −nc
P (A ) = 0 =
0 0
N
.
|Ω |
N1 ,...,Nc
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 71
PROBLEM 10.19 (matching socks). You have 8 red socks, 7 blue and 5 yellow. Pick 2 socks
at random. What is the probability that they match in color?
Answer. If Ar , Ab , A y are the events that the two selected socks are red, blue, yellow, respectively,
then, by (AXIOM TWO),
PROBLEM 10.20 (sample constituency). An urn contains 12 balls, 3 red, 5 blue and 4 green.
We select 6 balls at random in sequence. What is the probability that the first two are red the
next three blue and the sixth green?
Answer. The probability of the event A that the sample contains 2 red, 3 blue and 1 green is,
as seen in Problem 10.17.
6 12−6
2,3,1 3−2,5−3,4−1 60 · 60 10
P(A) = 12
= = = 0.12987.
27, 720 77
3,5,4
Ar,r,b,b,b,g = “first two are red the next three blue and the sixth green”
P(A) = `P(Ar,r,b,b,g,b ),
where ` is the number of events in (10.6). But ` is the number of arrangements of 6 objects in a
row of which 2 are red, 3 are blue and 1 is green, so
!
6
`= = 60.
2, 3, 1
Hence
1 60
P(Ar,r,b,b,g,b ) = P(A) = = 0.00216.
60 27, 720
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 72
Note: When we learn about conditional probability–see relevant chapter–we shall have another
way to compute P(Ar,r,b,b,g,b ). The rationale goes like this: The first ball is red with probability
2/12; the second is red with probability 1/11; the third is blue with probability 3/10, etc.
Multiplying these numbers gives
3 2 5 434
= 0.00216.
12 11 10 9 8 7
(standing for Spades, Clubs, Diamonds and Hearts, respectively). A unique element of V × F
is assigned to each card. Check: 13 × 4 = 52. We select n = 5 cards at random and ask the
probabilities of various events determined by what values appear in the sample and how
many cards have the same value. Each of these events corresponds to a partition of 5 (see
Appendix B, Section 7). In reading the table below, interpret the v, v0 , . . . appearing on the
same line as “some v, some v0 , . . ., all distinct.
partition of 5 meaning colloquial name event name
5 all cards of same value cheating A5
4+1 4 of value v and 1 of value v0 four of a kind A4,1
3+2 3 of value v and 2 of value v0 full house A3,2
3+1+1 3 of value v, 1 of v0 and 2 of v00 three of a kind A3,1,1
2+2+1 2 of value v, 2 of v0 , and 1 of v00 two pairs A2,2,1
2+1+1+1 0 00
2 of value v, 1 of v , 1 of v , 1 of v 000 one pair A2,1,1,1
1+1+1+1+1 all distinct values nil A1,1,1,1,1
Determine the probabilities of all the above events and verify that they add up to 1. Why is
that?
Answer. Let Π = {1, . . . , 52} be the set of cards and think of g = (v, s) as a function on Π, for
example, g(1) = (A, S), g(2) = (2, S), g(3) = (3, S), etc., so that g : Π → V × F is a bijection.
Equivalently, you can imagine the cards laid out in a particular order:
A 2 3 4 5 6 7 8 9 10 J Q K
Our sample space is taken to be Ω = P(Π), the set of all subsets of Π. Assuming P to be
uniform on Ω we compute the probability of full house A3,2 by counting the number of its
2
A is called ace and stands for 1, J is jack and stands for 11, Q is queen and stands for 12 and K is king and
stands for 13. The card with label (8, S) is called “8 of spades”; the card with label (A, D) is called “ace of diamonds”
and so on.
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 73
elements. We wish the sample to contain 2 different values. The first value corresponds to
any of the 13 columns, the second to any of the remaining 12. We thus can pick two distinct
columns in 13 × 12 ways. Having selected the first value, we assign suits to the three cards of
that value in 43 ways and then we assign suits to the two cards of the other value in 42 ways.
Hence ! !
4 4
|A3,2 | = 13 · 12 · · = 3744.
3 2
Since !
52
|Ω| = = 2, 598, 960
5
we find
3744
P(A5,2 ) = .
2, 598, 960
Arguing in similar manner, we find
P(A5 ) = 0
4 4
13 · 12 · 4 · 1 624
P(A4,1 ) = =
|Ω| 2, 598, 960
13 · 12 · 43 · 42
3744
P(A3,2 ) = =
|Ω| 2, 598, 960
12·11 4 4 4
13 · 2 · 3 · 1 · 1 54, 192
P(A3,1,1 ) = =
|Ω| 2, 598, 960
13·12 4 4 4
2 · 11 · 2 · 2 · 1 123, 552
P(A2,2,1 ) = =
|Ω| 2, 598, 960
12·11·10 4 4 4 4
13 · 3! · 2 · 1 · 1 · 1 1, 098, 240
P(A2,1,1,1 ) = =
|Ω| 2, 598, 960
13·12·11·10·9
5! · 45 1, 317, 888
P(A1,1,1,1,1 ) = =
|Ω| 2, 598, 960
Note that
624 + 37, 44 + 54, 912 + 123, 552 + 1, 098, 240 + 1, 317, 888 = 2, 598, 960
so the sum of the probabilities above equals to 1. This must be so because the 7 events above
are pairwise disjoint and their union is Ω (AXIOM TWO).
PROBLEM 10.22 (temperature affects energy levels). Let S = {x1 , . . . , xm } be a finite set of
“states” of a physical system, where the xi are distinct real numbers, representing, say, energy
levels with x1 < x2 < · · · < xm . The uniform probability assigns value 1/m to each state. Here
Pm −βx Let β ≥ 0 and let the probability assigned to
is another way to assign non-uniform probability.
state xi be pi = pi (β) ∝ e . Let Z(β) := i=1 e i . Let Pβ be the probability measure defined
−βxi
If X is a random variable on (S, P(S), Pβ ) let Eβ (X) denote its expectation under Pβ . Define
the energy random variable W : S → R by W(xi ) = xi . Show that Eβ (W) = − dβ d
log Z(β). Also
show that Pβ converges to P0 as β → 0 and that Pβ converges to δx1 as β → ∞, where δx1 is the
probability measure assigning value 1 to x1 and 0 to all other states–recall definition (8.5). In
physics, the reciprocal T = 1/β of β is called temperature the temperature goes to infinity then
all states are equally likely, while at 0 temperature, only the minimal energy state is possible.
Answer. We have
n n
X X e−βxi
Eβ (W) = xi pi (β) = xi .
Z(β)
i=1 i=1
But d −βx
dβ e = −xe−βx for all x ∈ R. So
d 1 d 1 X
log Z(β) = Z(β) = (−xi )e−βxi = −Eβ (W).
dβ Z(β) dβ Z(β)
i
Since pi (β) is a continuous function of β ∈ [0, ∞), we have that limβ→0 pi (β) = pi (0) = Z(0)
1
= 1
m,
which is indeed the uniform probability measure P0 . To find the limit as β → ∞, write
1 1
p1 (β) = Pm = Pm
i=1 e−β(xi −x1 ) 1+ i=2 e
−β(xi −x1 )
and, since xi − x1 > 0 for i = 2, . . . , m, we have limβ→∞ e−β(xi −x1 ) = 0 for all i = 2, . . . , m, so
lim p1 (β) = 0.
β→∞
Since p1 (β) + p2 (β) + · · · + pm (β) = 1 and m is a finite number that does not depend on β we
have limβ→∞ pi (β) = 0 for i = 2, . . . , m. Hence
X 1
if x1 ∈ A
Pβ (A) = .
pi (β) →
0
otherwise
i:x ∈A
i
?PROBLEM 10.23 (matching problem–see Problem 10.5). In a ballroom, there are n couples,
n men and n women. Each man is paired with a woman at random. What is the probability
that some man dances with his own wife?
Answer. The sample space Ω here is the set of pairings between men and women. We can
assume that the set of men is {1, . . . , n} and the set of women is {1, . . . , n} and that woman i is
man’s i wife. We can describe a pairing w by
!
1 2 3 ··· n
w=
w1 w2 w3 · · · wn
where the top row corresponds to men and the bottom to women. Clearly, each woman
appears only once in the second row. Hence Ω has n! elements. The phrase “each man is
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 75
paired with a woman at random” means that each element of Ω has probability 1/n!. Let Fi be
the event that man i dances with his own wife. We are interested in the event
n
[
Fi = some man dances with his own wife.
i=1
Note that Fi is the set of pairings such that wi = i. But then Fi has (n − 1)! elements, so
(n − 1)! 1
P(Fi ) = = .
n! n
3 Now take two different men i and j. The event Fi ∩ F j has (n − 2)! elements hence
(n − 2)! 1
P(Fi ∩ F j ) = = .
n! n(n − 1)
The pattern is
Sclear. We now use the inclusion-exclusion formula to calculate the probability
n
of the event i=1 Fi that at least one man dances with his own wife. If I ⊂ {1, . . . , n} then
\ 1
P Fi =
.
(n)|I|
i∈I
So n
[ X 1
P Fi = (−1)|I|−1 .
(n)|I|
i=1 I⊂{1,...,n}
n
Since there are k subsets I of {1, . . . , n} of size k, we further have
n n ! n
[ X n 1 X 1
P Fi =
(−1)k−1
= (−1)k−1 .
k (n)k k!
i=1 k=1 k=1
Pn xk
This is the answer. We can go a bit further. Since limn→∞ k=0 k! = e−x we have
n
X 1
P(some man dances with his own wife) ≈ lim (−1)k−1 = 1 − e−1 ≈ 0.632.
n→∞ k!
k=1
?PROBLEM 10.24 (number of matchings–see Problems 10.5 and 10.23). In the previous
problem find the probability of the event
Another way to describe an outcome is to say that it is a one-to-one function from the set {1, . . . , n} into itself.
3
Hence Ω is the set of all bijections on {1, . . . , n}. The event Fi is the set of all bijections that have i as a fixed point.
CHAPTER 10. CLASSICAL PROBLEMS OF ELEMENTARY NATURE 76
To find the probability that exactly m pairs are formed, we first give a name to this event, say
Gn,m,I = the men whose names are in I dance with their wives but the rest do not,
n
But all terms in the sum are equal, by symmetry. There are m terms in the sum, hence
!
n
|Gn,m | = |Gn,m,I |, with I = {1, . . . , m}.
m
Notice that
1 −1
lim P(Gn,m ) = e , m = 0, 1, 2, . . . (10.8)
n→∞ m!
4
because we speak English and therefore understand that the negation of the sentence “ no man dances with
his own wife ” is the sentence “ some man dances with his own wife ”.
5
If we take two different sets of men of size m each then there is a man that belongs to one set but not to the
other, hence this man dances with his wife and does not dance with his wife. This is an absurd sentence and this
explains why we obtain ∅.
Chapter 11
PROBLEM 11.1 (motivating conditional probability). In tossing a fair coin three times
uniformly at random, find the probability that we obtain two or more heads given that the
first toss lands tails.
Answer. The set of configurations here is
77
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 78
where, e.g., HTH means that we first toss a head (H) then a tail (T) and then a head. There are
8 configurations. The assumption tells us to assign probability 1/8 to each configuration, that
is, P is the uniform probability measure on Ω. Let A be the event that we have two or more
heads. Explicitly,
A = {HHH, HHT, HTH, THH}.
So P(A) = 4/8 = 1/2. But if we know that the first toss lands tails then we can exclude the first
three configurations in A and left with only one: THH. The probability of this should be 1/4
(one out of four configurations). Let us call B the event that the first toss lands tails. We have
shown that the
P(A|B) = probability that we obtain two or more heads given that the first toss
lands tails = 1/4.
We now find another way to obtain this number. You see, when we excluded the configurations
in A that are not in B we really considered the set A ∩ B which has only 1 element and thus
P(A ∩ B) = 1/8.
Since B = {THH, THT, TTH, TTT},
P(B) = 4/8.
Hence
P(A ∩ B) 1/8 1
= = .
P(B) 4/8 4
Motivated by this problem (as well as a couple of centuries worth of experience), we
define the conditional probability of event A given event B (or conditional on event B) by
the formula
P(A ∩ B)
P(A|B) := ,
P(B)
in general, not just for uniform probability measures. Experience shows that this definition
works.
PROBLEM 11.2 (a simple problem on conditioning). Let P be the uniform probability on
the set of pairs (i, j) of positive integers between 1 and n. Let B be the event that that j > i. Let
Ak be the event that the largest of the i, j is equal to k. Compute the probability P(Ak |B).
Answer. The sample space Ω is the set of all pairs (i, j). There are n2 such pairs. Hence the
probability of every pair is 1/n2 (because we used the adjective “uniform”). The event B
contains all pairs (i, j) with j > i. If we subtract from n2 the n pairs (i, i) we get n2 − n. Halve
this to get that the cardinality of B is (n2 − n)/2. We now need to compute the cardinality of
the event Ak ∩ B. This contains all pairs (i, j) with i < j = k, that is, all pairs (i, k) where i < k,
that is, Ak ∩ B = {(1, k), (2, k), . . . , (k − 1, k)}. So it has cardinality k − 1. Therefore
P(Ak ∩ B) |Ak ∩ B| k−1
P(Ak |B) = = =2 2 .
P(B) |B| n −n
Note that P(A1 |B) = 0 (why?) and P(An |B) = 2/n.
Just as we have some properties of (unconditional) probability, we now look at some
properties for conditional probability.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 79
2. Conditional probability is a probability measure. Fix an event B. Then the function that
assigns probability P(A|B) to an arbitrary event A is itself a probability measure.
Here is why: Fix event B with P(B) > 0 and define the function
that is, PB takes as its argument all events. First,we have PB (Ω) = P(Ω ∩ B)/P(B) =
P(B)/P(B) = 1. So (AXIOM ONE) is satisfied. Next, let A1 , A2 , . . . be a sequence of events.
Since
[∞ [∞
B∩ An = (B ∩ An ),
n=1 n=1
we have ∞ ∞
[ 1 [
PB An =
P (B ∩ An ) .
P(B)
n=1 n=1
Assume now that the events A1 , A2 , . . . are pairwise disjoint. Then the events B∩A1 , B∩A2 , . . .
are pairwise disjoint. By (AXIOM TWO) for P we have
∞ ∞
[ X
P (B ∩ An ) =
P(B ∩ An ).
n=1 n=1
Therefore ∞ ∞ ∞
[ X 1 X
PB An =
P(B ∩ An ) = PB (An ),
P(B)
n=1 n=1 n=1
so (AXIOM TWO) holds for PB . Since both (AXIOM ONE) and (AXIOM TWO) hold for
PB , it follows that PB is, as well, a probability (measure).
and thus the formula holds for n = 3. We can use induction to show (in one line) that the
formula holds for all n.
P(A|B)P(B)
P(B|A) = .
P(A)
Here is why: In the probability P(A ∩ B) the order of A and B can be changed because
A ∩ B = B ∩ A. Now use the definition of conditional probability twice. First, P(A ∩ B) =
P(A)P(B|A). And then, P(B ∩ A) = P(B)P(A|B). That’s it.
6. Bayes’ rule, second form. If H1 , . . . , Hn form a partition of Ω, namely, they are pairwise
disjoint events with union equal to Ω, then
P(A|Hi )P(Hi )
P(Hi |A) = Pn , i = 1, . . . , n.
k=1 P(A|Hk )P(Hk )
P(A|H )P(H )
Here is why: apply the previous formula to get P(Hi |A) = i
P(A)
i
and then apply the
Pn
total probability formula: P(A) = k=1 P(A|Hk )P(Hk ).
Interpretation: If H1 , . . . , Hn are interpreted as alternative hypotheses and have prior
probabilities P(H1 ), . . . , P(Hn ), then observation of an event A will change these probabilities
to the posterior probabilities P(H1 |A), . . . , P(Hn |A).
PROBLEM 11.3 (false positive and false negative errors: observations affect decisions). In
a population, 1% of people have a certain disease. A test is devised to detect the disease. If
the patient has the disease then the test is positive (detects the disease) with probability 96%.
If the patient does not have the disease the test may erroneously indicate that the patient does
have the disease with probability 2%. Is the test a reliable indicator of the disease?
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 81
Figure 11.1: In a population with a pandemic, 1% of people have the disease. A test is devised that
is shown to have 2% false positive and 4% negative probability. Is the test reliable?
Answer. We let D be the event that a patient has the disease. Since 1% of people have the
disease, we let P(D) = 0.01. Hence P(Dc ) = 0.99. We let T+ be the event that the test is positive
and T− the event that it is negative. We are given that
P(T+ |D)P(D)
P(D|T+ ) = .
P(T+ )
The numerator is P(T+ |D)P(D) = 0.96 × 0.01 = 0.0096. The denominator is, by the summation
formula,
P(T+ ) = P(T+ |D)P(D) + P(T+ |Dc )P(Dc ) = 0.0096 + 0.02 × 0.99 = 0.0096 + 0.0198 = 0.0294.
Hence
0.0096 96
P(D|T+ ) = = ≈ 32%.
0.0294 294
The test is totally unreliable.
Discussion: The probability P(T+ |Dc ) is called “false positive probability” (or type I error). The
probability P(T− |D) is called “false negative probability” (or type II error). Which of the two
probabilities should be reduced in order that there be a significant change in the usefulness of
the test?
?PROBLEM 11.4 (birthday coincidences–Problem 8.14 revisited). Recall the birthday coin-
cidence problem 8.14 among n people on a planet whose year contains exactly d days. Again
assume that P is uniform. We wish to compute the probability P(B) of the event B that at least
two people have the same birthday by first computing P(Bc ) using the multiplication formula;
see Item 4. of Section 11.1.
Answer. Let us apply the following algorithm for checking birthday coincidence. Pick people
one at a time, in any order you like. Let 1, 2, . . . , n be the names of the people you pick. If
person 2 has the same birthday as 1 (call this event R2 ) then stop. Else, if person 3 has the
same birthday as 2 (call this event R3 ) then stop. And so on. Then
B = R2 ∪ R3 ∪ · · · ∪ Rn .
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 82
P(Bc ) = P(Rc2 )P(Rc3 |Rc2 )P(Rc4 |Rc3 ∩ Rc2 ) · P(Rcn |Rcn−1 ∩ · · · ∩ Rc2 ).
Recall that these events are subsets of Ω = {1, . . . , d}n . We can interpret an ω = (ω1 , . . . , ωd )
as making n selections with replacement from an urn containing d distinct items. Using this
interpretation we get, for 2 ≤ k ≤ n,
d−k+1
P(Rck |Rck−1 ∩ · · · ∩ Rc2 ) = ,
d
the reason is that if we select k − 1 items and observe they are distinct then the probability
that the next item is distinct from the previous ones is (d − k + 1)/d because there are d − k + 1
remaining values; the denominator is d because sampling is with replacement: we put them
back in the urn after we look at them. (Pay attention to the k = 2 term in the last display; it
reads P(Rc2 ) = (d − 1)/d.) Therefore,
PROBLEM 11.5 (a coin whose probability of heads is random!). When we toss a coin n
times the outcome is an n-tuple (x1 , . . . , xn ) where xi = 1 if heads show up at the i-th toss
or xi = 0 if tails show up. Thus, (x1 , . . . , xn ) ∈ {0, 1}n . We define a probability measure P
by defining the probability of the outcome (x1 , . . . , xn ) be equal to ps(x) (1 − p)n−s(x) , where
s(x) = x1 + · · · + xn . In other words,
Pn Pn
P{(x1 , . . . , xn )} = ps(x) (1 − p)n−s(x) = p i=1 xi (1 − p) i=1 (1−xi ) . (11.2)
X 1
X 1
X
p s(x)
(1−p) n−s(x)
= x1
p (1−p) 1−x1
··· pxn (1−p)1−xn = (p+1−p) · · · (p+1−p) = 1.
(x1 ,...,xn )∈{0,1}n x1 =0 xn =0
The number p can be interpreted as the probability of heads in one coin toss. But suppose we
do not know what p is. We know that there are two possibilities: either p = 1/2 or p = 2/3 and
that each of these possibilities has probability 1/2. To model this situation, we introduce, in
addition to (x1 , . . . , xn ), the variable p. Our outcome then is (p, x1 , . . . , xn ) ∈ {3/7, 2/3} × {0, 1}n .
Let A3/7 be the event that p = 3/7, namely, the set of outcomes of the form (3/7, x1 , . . . , xn ), and
let A2/3 be the set of outcomes (2/3, x1 , . . . , xn ). Let
representing the event that we see the outcomes x = (x1 , . . . , xn ) but we do not see the value of
p. We are given that
P(Rx |Ap ) = ps(x) (1 − p)n−s(x) .
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 83
This is a rewriting of the display (11.2). We have two values of p, so we have two formulas.
By Bayes’ formula,
P(Bx |Ap ) ps(x) (1 − p)n−s(x)
P(Ap |Rx ) = = , p ∈ {3/7, 2/3}.
P(Bx ) P(Bx )
This allows us to make a guess of what the actual p is, based on the values of x = (x1 , . . . , xn ).
To be concrete, suppose that n = 10 and we observe s(x) = 2 heads and n − s(x) = 8 tails. Then
(3/7) (4/7) p = 3/7
2 8
p2 (1 − p)8 =
(2/3)2 (1/3)8 p = 2/3
so
(3/7)2 (4/7)8
(3/7)2 (4/7)8 +(2/3)2 (1/3)8 = 0.97, p = 3/7
P(Ap |Bx ) =
(2/3)2 (1/3)8
= 0.03, p = 2/3
(3/7)2 (4/7)8 +(2/3)2 (1/3)8
Hence we should guess that p = 3/7, as this has much higher probability than the alternative.
It’s a guess, but it’s justifiably a good guess.
PROBLEM 11.6 (strategy for getting the gift: every bit of information counts). There are
3 identical sealed boxes, numbered 1, 2, 3. Two of them are empty but the nonempty box
contains £1000. You don’t know where the money is. You are given the option to open a box
at random. If the money is in the box you open then you get it. Clearly, the probability you
find the money is 1/3.
But we make another deal. I ask you to point to a box; say you point to box number X. Then I
open a box, say box number Y, such that Y , X and such that Y does not contain the money.
I then ask you a question: do you want to stick with your original choice or switch? What
should you do?
Answer. Without loss of generality, let us say that box number 3 contains the money. You
don’t know that. So if {X = i} represents the event that you pick box i, we have P(X = i) = 1/3,
i = 1, 2, 3. If X = 1 then, knowing that I should open a different box Y that does not contain
the money, I must open box 2, that is,
P(Y = 2|X = 1) = 1.
Similarly,
P(Y = 1|X = 2) = 1.
If you pick X = 3, then I pick either Y = 1 or Y = 2 at random. So
P(Y = 1|X = 3) = P(X = 2|X = 3) = 1/2.
Let us find what your probability of winning is if you do switch. Since the money is in box 3,
“win without switching” = {X = 3}.
But I have already opened Y. So we must compute P(X = 3|Y = 2) and P(X = 3|Y = 1)
(remember that Y , 3). We have, by Bayes’ rule,
1 1
P(Y = 1|X = 3)P(X = 3) 2 · 3 1
P(X = 3|Y = 1) = = = .
P(Y = 1|X = 3)P(X = 3) + P(Y = 1|X = 2)P(X = 2)
3 +1
1 1 1 3
2 · · 3
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 84
1 1
P(Y = 2|X = 3)P(X = 3) 2 · 3 1
P(X = 3|Y = 2) = = = .
P(Y = 2|X = 3)P(X = 3) + P(Y = 2|X = 1)P(X = 1)
3 +1
1 1 1 3
2 · · 3
Thus, regardless of the value y in the event {Y = y}, we have found that
1
P(win without switching|Y = y) = .
3
This means that if you switch you double the probability of winning.
PROBLEM 11.7 (prisoner’s dilemma). Three prisoners are in the death row in a US prison.
It is midnight. Two of them are going to be executed at dawn. There is a guard keeping an
eye on them. The following dialogue takes place between a specific prisoner and the guard:
PROBLEM 11.8 (where is the coin?). A coin is in one of n boxes. The probability that it is
in box i is pi . If you search box i you may not find the coin with probability εi . Find the
probability that the coin is in box j, given that you have searched box i and not found the coin.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 85
Answer. Bi be the event that the coin is in box i. We are given that P(Bi ) = pi . Let E be the event
that we make an error. We are given that P(E|Bi ) = εi . Let Ai be the event that we searched
box i and not found the coin. Then
Ai = (Bi ∩ E) ∪ Bci .
We want to compute the conditional probability of the event B j given the event Ai . We have:
P(B j ∩ Ai )
P(B j |Ai ) = .
P(Ai )
We have
P(Ai ) = P(Bi ∩ E) + P(Bci ) = P(E|Bi )P(Bi ) + (1 − pi ) = εi pi + 1 − pi
We now distinguish two cases: If j = i,
If j , i,
B j ∩ Ai = B j ∩ Bci = B j , P(B j ∩ Ai ) = p j .
Hence εp
1−(1−εi )pi , if j = i
i i
P(B j |Ai ) =
p
)p ,
j
1−(1−ε
i i
if j , i
PROBLEM 11.9 (application in business). There are two big shops, Aloobubba and Boorbari
selling expensive colored plastic balls of diameter 3 inches and weight 125 lbs each. that
people rush to buy placing orders on the Internet through a company that works with both
shops. Aloobubba has m storage locations numbered 1 through m. Boorbari has n storage
locations, numbered m + 1 through m + n. Storage location i = 1, . . . , m + n contains ri red and
gi green balls. The shops just started their business and you’re the first customer worldwide
to place an order. You’ve been up all night in order to do that, both for telling your friends
you’ve never met on Fuzzybook and because you’d get a whopping 2.95% discount if you
didn’t specify the color of the ball or the shop. The ball is delivered to your house in a nice
package and you open the package you find out that the ball is red. What is the probability
that the ball came from Boorbari?
Answer. Consider the event A, representing the event that the ball came from Aloobubba, and
B for Boorbari. Let Si be the event that the ball was selected from Storage location i. Let R be
the event that the received ball is red and G that it is blue. In the absence of any information
on what which shop the Internet company uses more frequently, we assume that
Similarly, given that there is no preference of a storage location over another, we must let
m, 1 ≤ i ≤ m n, m + 1 ≤ i ≤ m + n
1 1
P(Si |A) = , =
P(S i |B)
0, i > m
0, i ≤ m
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 86
By Bayes’ formula,
P(R|A)P(A) P(R|A)P(A)
P(A|R) = = . (11.3)
P(R) P(R|A)P(A) + P(R|B)P(B)
where the second equality is due to P(R ∩ Si |A) ≤ P(Si |A) = 0 if i > m. Using the definition of
conditional probability once more (or the chain rule, Property 4. above), we have
Hence
m
1 X ri
P(R|A) = .
m ri + gi
i=1
Similarly,
m+n
1 X ri
P(R|B) = .
n ri + gi
i=m+1
To make the problem even more applied, suppose that Aloobubba has m = 2 storage locations
and that Boorbari has n = 5 and r1 = 100, g1 = 500, r2 = 300, g2 = 1600, r3 = r4 = r5 = 1000, g3 =
g4 = g5 = 10, 000, r6 = r7 = 2000, g6 = g7 = 17, 000. I put the numbers in the formula and
found P(A|R) = 0.63. Note that all numbers in the formula for P(A|R) are positive integers.
You can cook up these number in a way that you can achieve P(A|R) to be approximately any
number between 0 and 1 and so arrive at “surprises” much in the same way that Example
11.3 was surprising.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 87
Since the phrase “B does not influence A” is symmetric in A and B, we use a phrase in English
that is syntactically symmetric as well and define the probability of A given that B occurs. So
it makes sense to define
(You should check that P is a probability measure on the events of S1 × S2 .) We then have that
the events [ [
Ai = {(i, j)}, B j = {(i, j)}
j∈S2 i∈S1
are independent (under the P we defined). Indeed, Ai ∩ B j contains only one element, the
element (i, j), hence,
P(Ai ∩ B j ) = P1 {i}P2 { j}.
On the other hand, by rule (ii) X
P(Ai ) = P{(i, j)}
j∈S2
Hence
P(Ai ) = P1 {i}.
Similarly,
P(B j ) = P2 {j}.
Hence
P(Ai ∩ B j ) = P(Ai )P(B j ),
so the two events are independent.
Independence was rather obvious in the above problem because P was defined by a product
rule. But independence, whenever discovered, gives us powerful tools. Here is a problem
where independence is not obvious.
?PROBLEM 11.12 (product probability measure on the product of n sets). Generalize the
previous problem to define a probability measure P on the set S1 × S2 × S3 × · · · × Sn (product
of finitely many discrete sets), once you have define a probability measure Pi on Si for all i.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 89
Answer. We assume that Pk is defined on the discrete set Sk by assigning value Pk {x} for
each x ∈ Sk . An element of S1 × S2 × S3 × · · · × Sn is an ordered n-tuple (x1 , . . . , xn ), where
x1 ∈ S1 , . . . , xn ∈ Sn . To this element we assign the number
But the sum on the left is actually n sums of a function with separable variables:
X X X
P{(x1 , . . . , xn )} = P{x1 } · · · P{xn }
(x1 ,...,xn )∈S1 ×···×Sn x1 ∈S1 xn ∈Sn
are independent.
Here is why: The set of outcomes is Ω = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}.
Each outcome (element) of Ω gets probability 1/8. Event A is the set {HTH, THH, HTT, THT}
and B = {HHT, HTH, THT, TTH}.
But are we saying too much? Could it be that the four equalities above are not all necessary?
Unfortunately, they are. Here is an example.
PROBLEM 11.14 (pairwise independence but not independence). Put the uniform proba-
bility measure on the set Ω = {1, 2, 3, 4}. Consider the events A = {1, 4}, B = {2, 4}, C = {3, 4}.
Show that every two of these events are independent but that they are not independent.
Answer. We have
P(A) = P(B) = P(C) = 2/4 = 1/2.
Let us compute probabilities of intersections. Notice that A ∩ B = {4}, so P(A ∩ B) = 1/4. By
symmetry,
P(A ∩ B) = P(B ∩ C) = P(A ∩ C) = 1/4.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 90
And, indeed,
So A, B are independent; and B, C are independent; and A, C are also independent. But
A ∩ B ∩ C = {4}, so
P(A ∩ B ∩ C) = 1/4 , P(A)P(B)P(C).
Hence the events A, B, C are not independent.
We notice next that
Let us, for example, show that A, B, Cc are independent. We have immediately have–
due to (11.4)– that every two of them are independent, so we just need to check that
P(A ∩ B ∩ Cc ) = P(A)P(B)P(Cc ). Indeed:
Let us also check that Ac , Bc , Cc are independent. We have (again by (AXIOM TWO))
But we we showed just above, the last three terms of the left-hand side become products. To
save some space, let a, b, c stand for P(A), P(B), P(C), respectively. We then have
PROBLEM 11.15 (pairwise independence but not independence, again). Take a canonical
tetrahedron (a perfect die with 4 faces) and write the letter a on one face, the letter b on another,
the letter c on another and write abc on the fourth face. Toss the tetrahedron and see which
face it lands on. The events “this face has the letter a on it”, “this face has the letter b on it”,
and “this face has the letter c on it” are not independent.
Answer. This is essentially the same as Problem 11.14. Indeed if we interpret 1 in that problem
as “face 1” of the tetrahedron, and so on, then we have 4 equally likely faces. And if we
interpret event A in Example 11.14 as event “this face has the letter a on it” of the current
problem, etc., then we can clearly see that we reduce to the previous case.
Figure 11.2: In tossing a 4-sided die, labeled as shown, the events “this face has the letter a on it”,
“this face has the letter b on it”, and “this face has the letter c on it” are pairwise independent but
not independent.
The definition applies both to finite and infinite collections of events. So if you have n events
then you need to satisfy 2n − n − 1 equalities to prove they are independent. Indeed, a collection
of size n has 2n subsets. You do not need to consider the empty subset. and you do not need
to consider subsets containing one event. So reduce 2n by n + 1 to get the number of equalities
that need to be checked.
PROBLEM 11.16 (uniform distribution on many coin tosses begets independence, again).
Toss a fair coin at random n times. Let Hi be the event that heads show up at the i-th toss.
Explain why H1 , . . . , Hn are independent events.
Answer. First of all, our sample space here is Ω = {0, 1}n . The poetic expression “toss a fair
coin at random n times” means that we put the uniform probability measure P on Ω. Since
|Ω| = 2n this means that each outcome receives probability 1/2n and that P(A) = |A|/2n for any
A ⊂ Ω. (Hence P has been defined on E = P(Ω).) We now have
Hi = {ω ∈ Ω : ωi = 1}.
Note that
|Hi | = 2n−1 ,
because, having specified that a 1 (=heads) is in the i-th position we have 2 choices (either 0 or
1) for each of the remaining n − 1 positions. Similarly, if i , j,
|Hi ∩ H j | = 2n−2
Therefore,
2n−2 1
P(Hi ∩ H j ) = = = P(Hi )P(H j ).
2n 4
in general,
\
Hi = 2n−|I|
i∈I
for any I ⊂ {1, 2, . . . , n}. Therefore,
2n−|I| 1
\ Y
P Hi = n = |I| = P(Hi ).
2 2
i∈I i∈I
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 92
e−β(ω1 +···+ωn )
Pβ {(ω1 , . . . , ωn )} = , (11.5)
Zn
where 1/β is interpreted as temperature. The constant Zn is chosen so that the sum of all
probabilities on single outcomes equals one. So
Question: Are events H1 , . . . , Hn still independent under Pβ ? We first find Zn by
n !
X
−β(ω1 +···+ωn )
X n −βk
Zn = e = e = (1 + e−β )n . (11.6)
k
ω∈{0,1}n k=0
Hence Zn = Zn1 . Choose any collection of events from H1 , . . . , Hn . Suppose, without loss
of generality, we pick the first i of them. We wish to examine whether P(H1 ∩ · · · ∩ Hi ) =
P(H1 ) · · · P(Hi ). We have
Hence
X e−β(i+ωi+1 +···+ωn ) e−βi X e−β(ωi+1 +···+ωn )
P(H1 ∩ · · · ∩ Hi ) = = i
ωi+1 ,...,ωn
Zn1 Z1 ωi+1 ,...,ωn Zn−i
1
But the last sum equals one because of (11.6) applied when we replace n by n − i. Letting
−β
now i = 1 in the last display gives P(H1 ) = eZ1 . Similarly for P(H j ) for all j. Hence, indeed,
P(H1 ∩ · · · ∩ Hi ) = P(H1 ) · · · P(Hi ). So the events are independent.
We generalize this to n discrete random variables X1 , . . . , Xn and declare that they are
independent under a probability measure P if the events of the form {X1 = x1 }, . . . , {Xn = xn }
are independent under a probability measure P for all choices of the values x1 , . . . , xn
?PROBLEM 11.18 (criterion for independence). Show that X, Y are independent if and only
if
P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)
whenever A, B are sets in the codomains of X, Y, respectively.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 93
Answer. If the last statement holds then take A = {x}, B = {y} to arrive at the definition of
independence. Conversely, if the definition of independence holds and if A, B are sets in the
codomains of X, Y, respectively, then
XX XX
P(X ∈ A, Y ∈ B) = P(X = x, Y = y) = P(X = x)P(Y = y)
x∈A y∈B x∈A y∈B
X X
= P(X = x) P(Y = y) = P(A)P(B).
x∈A y∈B
?PROBLEM 11.19 (independence of many implies independence of fewer). Show that if
X, Y, Z are independent then X, Y are independent.
Answer. We have that P(X = x, Y = y, Z = z) = P(X = x)P(Y = y)P(Z = z) for all x, y, z. Sum
both sides over x. Then the left-hand side gives P(X = x, Y = y) and the right-hand side gives
P(X = x)P(Y = y).
?PROBLEM 11.20 (independence of events and their indicator random variables). Explain
the following equivalence:
Answer. If we assume the latter then we get the former easily because Ai = 1Ai =1 . Now assume
the former and show the latter. Let’s do it for n = 2. We assume that A1 , A2 are independent.
This implies (why?) that A1 , Ac2 are independent and Ac1 , A2 are independent and Ac1 , Ac2 are
independent. We consider P(1A1 = x1 , 1A2 = x2 ) for x1 , x2 ∈ {0, 1}. We have four cases. If
x1 = 0, x2 = 0 we have P(1A1 = 0, 1A2 = 0) = P(Ac1 ∩ Ac2 ) = P(Ac1 )P(Ac2 ) = P(1A1 = 0)P(1A2 = 0).
Considering the other three cases, we conclude that P(1A1 = x1 , 1A2 = x2 ) = P(1A1 = x1 )P(1A2 =
x2 ) for all x1 , x2 . And hence 1A1 , 1A2 are independent. I leave it for you to generalize this for
arbitrary n.
?PROBLEM 11.21 (toss k dice n times, as in Problem 10.2, again). Toss k dice n times, see
Problem 10.2. That is, put the uniform probability measure on Ω = {1, . . . , k}n . Let Xi be the
random variable that takes value 1 if the i-th coordinate of the outcome is heads or 0 if it is
tails. Show that X1 , . . . , Xn are independent.
Answer. We have, for all x1 , . . . , xn ranging in {0, 1},
1
P(X1 = x1 , . . . , Xn = xn ) = ,
2n
because P is the uniform probability measure and because Ω has 2n elements. But the event
n−1
{Xi = xi } has 2n−1 elements and so P(Xi = xi ) = 22n = 1/2. So P(X1 = x1 ) · · · P(Xn = xn ) =
(1/2)n = 1/2n .
?PROBLEM 11.22 (independence of disjoint sets of r.v.s). Let X1 , . . . , Xn be independent. Let
I1 , . . . , Im be pairwise disjoint subsets of {1, . . . , n}. Define random variables . Yi = gi (Xk , k ∈ Ii ),
i = 1, . . . , m, where gi are given functions. Explain why Y1 , . . . , Ym are also independent
random variables.
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 94
Note that 1X=x 1Y=y = 1X=x,Y=y . Take expectations of both sides. The right side will involve a
double sum of the quantities xyP(X = x, Y = y) = xP(X = x) yP(Y = y). Hence
X X
E(XY) = xP(X = x) yP(Y = y) = (EX)(EY).
x∈X(Ω) y∈Y(Ω)
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 95
Pn
Answer. Let S = We have var(S) = E[(S − ES)2 ]. But ES = = i=1 µi ,
P P
i=1 Xi . i=1 EXi where
µi = EXi . So
n
X
S − ES = (Xi − µi ).
i=1
E[(Xi − µi )(X j − µ j )] = 0, i , j.
PB (A) = P(A|B),
is a new probability measure on Ω. We then define the conditional expectation of the random
variable X given, or conditional on, the event B by
X
E(X|B) = EPB (X) = xPB (X = x).
x
CHAPTER 11. CONDITIONAL PROBABILITY AND INDEPENDENCE 96
Here, the comma between {X = x} and B means intersection of the two events. We use comma,
just as we would use it as a symbol of conjunction (and) in English.
If X, Y : Ω → R are two discrete random variables then we can take B = {Y = y} and define
E(X|Y = y) as above, namely,
X
E(X|Y = y) = xP(X = x|Y = y).
x
E(E(X|Y)) = E(X)
Answer. We have
X X X
E(E(X|Y)) = E(m(Y)) = m(y)P(Y = y) = xP(X = x|Y = y) P(Y = y)
y y x
X X
= x P(X = x|Y = y)P(Y = y).
x y
Symmetry. Often, symmetry of sorts implies some kind of independence. (After all,
whenever symmetry is discovered, it must be used because it often leads to solvability. For
example, to solve the set of two equations 2x + y = 1, 2y + x = 1, we merely observe that the set
remains the same if we interchange x and y; so x = y and so 3x = 1, i.e., x = 1/3.) For example,
the uniform probability measure (which has an obvious symmetry) on the product of two
finite sets implies independence of the coordinates. This was the case in Problem 11.16. Please
revise the problem and think of the symmetry. Symmetry was also present in Problem 11.17.
Indeed, the probability measure in this problem, (11.5), depends on the sum ω1 + · · · + ωn , and
the sum is a symmetric function: if we permute the ωi ’s the value of the sum does not change.
Definitions through independence. We shall later see that independence (especially between
random variables) can be used to define new random objects). For example, if H1 , . . . , Hn
are the (now known to be independent) events associated with a fair coin tossing at random
n times, We can define random variables X0 , X1 , . . . , Xn by letting X0 = 0 and, for i ≥ 1,
Xi = Xi−1 + 1 on Hi and Xi = Xi−1 − 1 on Hic
Do not confuse independence and disjointness. A most silly mistake beginners (and not
only) make is to say “aha, events A and B are disjoint so they’re independent”. But think
again: if A an B are disjoint then occurrence of A implies non-occurrence of B and vice versa.
So they’re extremely dependent! Mathematically: suppose that events A, B are disjoint and
have strictly positive probabilities; then A ∩ B = ∅ so P(A ∩ B) = 0. But P(A)P(B) > 0, so
P(A ∩ B) , P(A)P(B).
Part IV
98
Chapter 12
12.1 Remembrances
A probability measure is a function whose domain is a collection of events and whose range
is the set of real numbers [0, 1]. The domain of a probability measure must satisfy certain
logical properties: (a) the empty set is an event; (b) if A is an event then Ac is an event; (c)
if A1 , A2 , . . . are events then so is their union. The term that mathematicians use for these
properties is: the collection of events forms a σ-field. The probability measure itself must
satisfy (AXIOM ONE) and (AXIOM TWO).
People use sloppy language. For example, when people say “what is the probability that
when tossing a pair dice I will get a sum at most 4?” they mean “compute the value of the
event under the uniform probability measure”.
The class of events is a collection of subsets of an ambient space that is often referred to
as the sample space. In answering a probability question, there is often a lot of freedom in
choosing the sample space. One must choose it judiciously so that the computation become as
simple as possible.
A random variable is a function from the sample space into the set of real numbers.
Technically, it must be a measurable function as explained in Section 9.6.
A discrete set is a countable set. A discrete random variable is one whose set of values is
discrete.
Random variables have a purpose in life: to transform a probability measure, say P, on
their domain to a probability measure, say Q, on their range. This is indicated as follows:
P→ X →Q
99
CHAPTER 12. REMEMBRANCES AND FORESIGHTS 100
This Q is called distribution of X or law of X. In practical applications (and not only) a deeper
purpose is to do this: Given a desired Q, design a random variable whose law is Q.
12.2 Foresights
The issue next is to define random variables (and their distributions) on spaces that are
uncountable.
Definition 12.1 (i.i.d.). We say that a collection of random variables are i.i.d. = independent
and identically distributed if any finitely many random variables from the collection are
independent and if each random variables has the same distribution.
Suppose that S is a finite set and let Q be a probability measure on it. Then it is obvious that
we can
let X1 , . . . , Xn be i.i.d. with common law Q. (12.1)
?PROBLEM 12.1 (“let there be finitely many i.i.d. r.v.s” can always be said). Explain why
statement (12.1) is not vacuous.
Answer. To each element (x1 , . . . , xn ) of Sn assign probability according to the product rule,
as explained in the Section 8.1, “Probability on finite sets”. That is, assign probability
Q{x1 } · · · Q{xn }. Then define Xi (x1 , . . . , xn ) = xi , for all i = 1, . . . , n, and check that each Xi has
law Q and that the X1 , . . . , Xn are independent.
But the issue is: can we also
on a finite set S of size at least 2 then there exists an infinite sequence (X1 , X2 , . . .) of i.i.d. random
variables with law Q each, all defined on the some common probability space.
For most statisticians this is obvious. After all, they claim, we can always toss a coin
independently and an unlimited number of times. But it is a fact that the theorem above is
equivalent to the theorem
Theorem 12.2 (the area function exists). Let R2 represent the Euclidean plane. Then there is
a large class B 2 of subsets of R2 that contains all rectangles and a unique function
(1) area(∅) = 0
∞ ∞
[ X
(2) area Bn =
area(Bn ), whenever B1 , B2 , . . . are pairwise disjoint;
n=1 n=1
(3) area(R) = ab if R is a rectangle with side lengths a, b.
Of course, because we cheat when we teach Calculus, we make the student believe they know
what this function is: the usual “area”. Indeed, it is. But we never proved that this function
can be defined on “very complicated sets”.
Theorem 12.3 (equivalence of Theorems 12.1 and 12.2). The statements of Theorems 12.1
and 12.2 are equivalent.
I do not expect you to ?understand these theorems. But I expect, and hope, you understand
them and know what they say.
Armed with the above results (that we will never explain in these notes), we have the right
to assert the veracity of statement (12.2).
We will proceed by defining random variables with values in big sets, such as R, R2 and
Rd .We will also consider sequences of random variables.
Notations. When we write {0, 1} (with curly brackets) we mean a set with two elements. Do
not confuse this with [0, 1] which denotes the subset of all real numbers x with 0 ≤ x ≤ 1. The
set {0, 1}2 contains 4 elements; (0, 0), (0, 1), (1, 0), (1, 1). Similarly, the set {0, 1}n contains 2n
elements.
p = P(ξ = 1).
Hence P(ξ = 0) = 1 − p. We can think of ξ as encoding the outcome of a coin toss, 1 for,
say, heads, and 0 for tails. The name Bernoulli is that of the mathematician Jacob Bernoulli
(1655-1705) . As a probability measure, Ber(p) is unique. It can be written as
where δa is the Dirac probability measure; see (8.5). The expression “ξ is a Ber(p) random
variable” means that the law of ξ is Ber(p). We use definite article (the) when we refer to
the law Ber(p); but we use indefinite article (a) when we refer to a Ber(p) random variable,
because the latter is by no means unique. Observe that any indicator random variable is
Ber(p) random variable. Indeed, if A is an event, 1A is Ber(P(A)).
102
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 103
ξ = (ξ1 , . . . , ξn )
that takes values in the set {0, 1}n . The distribution of ξ is completely determined by the
probabilities
P(ξ = (x1 , . . . , xn )), xi ∈ {0, 1}, i = 1, . . . , n.
But
P(ξi = x) = px (1 − p)1−x , x = 0, 1.
Multiply these numbers, bring the p-terms together and the (1 − p) also together.
Sn ∈ {0, 1, 2, . . . , n}.
We define
bin(n, p) = law of Sn .
To compute this law we need to compute P(Sn = k), k = 0, 1, . . . , n.
?PROBLEM 13.2 (formula for the binomial distribution). Show that
!
n k
P(Sn = k) = p (1 − p)n−k . (13.3)
k
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 104
{Sn = k}
n o
= there is a set I ⊂ {1, . . . , n} such that |I| = k such that ξi = 1 for all i ∈ I and ξi = 0 for all i < I
[ n o [
= ξi = 1 for all i ∈ I and ξi = 0 for all i < I ≡ AI .
I⊂{1,...,n},|I|=k I⊂{1,...,n},|I|=k
Since P(AI ) = pk (1 − p)n−k is the same for all I with |I| = k, the sum equals this number times
the number of summands; the latter is nk because this is (by definition) the number of I with
|I| = k.
We next compute the expectation and variance of a bin(n, p) random variable. We have
E(Sn ) = np.
But the random variables ξ1 , . . . , ξn are independent and hence uncorrelated. Therefore the
variance of Sn is the sum of the variances of the ξi . Each ξi has the same variance, p(1 − p),
therefore
var Sn = np(1 − p).
It is remarkable that both the expectation and the variance of the sum are linear functions of n.
λk −λ
lim P(Sn = k) = e , k = 0, 1, 2, . . . (13.5)
n→∞ k!
Note that Sn depends on n both explicitly, through the subscript n, and implicitly, through the numbers
pn . To ?understand this, assume for simplicity that
λ
p=
n
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 105
λk λ n−k λk −λ
lim P(Sn = k) = lim 1 − = e ,
n→∞ k! n→∞ n k!
because, we know from Calculus, that limn→∞ (1 + n1 )n = e. It is rather remarkable that the
numbers
λk
Πλ (k) = e−λ , k = 0, 1, . . .
k!
sum up to 1, namely,
X∞
Πλ (k) = 1,
k=0
and this is because
λk
eλ = ,
k!
is the Taylor expansion of the exponential function (and valid for all λ). Therefore,
A good guess is to choose the p̂ that maximizes this probability. We thus must choose p̂ such
that
p̂k (1 − p̂)n−k ≥ pk (1 − p)k , for all 0 ≤ q ≤ 1.
Let L(p) = pk (1 − p)k . We can easily show that L is maximized at p̂ = k/n. This is then a
reasonable estimate of the true p. We can go further and ask: What is the probability that the
estimated p̂ is far from the true p? We defer this for later.
PROBLEM 13.5 (distribution of infrequent errors). A book has 400 pages and about 1000
characters per page. We found 200 characters typed wrong in the whole book. What is the
probability that a given page contains at least 2 erroneous characters? Make appropriate
assumptions in order to answer this and do so in two different ways.
Answer. We assume that each character is erroneously typed with probability p. Since we
found 200 wrong characters among 400, 000 ones it is reasonable to assume that
200 1
p= = .
400, 000 2000
Assume that characters are erroneously typed independently. Then the total number of
characters on a given page is a bin(1000, 1/2000) random variable. This is approximately a
Poi(1/2) random variable. The probability that there is 0 or 1 characters wrong on the given
page is e−1/2 + 12 e−1/2 , hence the probability that we have at least 2 errors is 1 − 32 2−1/2 ≈ 0.09.
Observe that
if X, Y are independent with laws bin(n, p), bin(m, p) then X + Y has law bin(n + m, p).
And here is why. Consider n + m Bernoulli trials
ξ1 , · · · , ξn , ξn+1 , . . . , ξn+m .
Clearly:
1) The sum of all of them has law bin(n + m, p).
2) The sum of the first n has law bin(n, p). The sum of the last m has law bin(m, p).
3) The first n are independent of the last n.
4) Hence the sum of the first n is independent of the sum of the last m.
5) So we can let X be the sum of the first n and Y the sum of the last m. Then X + Y has law
bin(n + m, p).
Something similar happens with independent Poisson random variables:
if X, Y are independent with laws Poi(λ), Poi(µ) then X + Y has law Poi(λ + µ).
Here is why. For each n let Xn , Yn be two independent random variables with binomial
distributions. Assume that Xn has law bin(n, λ/n) and that Yn has law bin(bnµ/λc, λ/n),
where b·c denotes integer part. We know then that Xn + Yn has law bin(n + bnµ/λc, λ/n). We
have
bin(n, λ/n) → Poi(λ)
bin(bnµ/λc, λ/n) → Poi(µ)
bin(n + bnµ/λc, λ/n) → Poi(λ + µ)
Letting X be a random variable with law the limit of the law of Xn , and similarly for Y, and
since independence is preserved in the limit, we have that X + Y has law Poi(λ + µ).
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 107
Hence
∞ \
[ ∞ ∞
[
Ac = {ξi = 0} =: Tm .
m=1 n=m m=1
But the event
∞
\
Tm = {ξi = 0}.
n=m
is the event that ξm = ξm+1 = · · · = 0 and to find the probability of this we must multiply (1 − p)
by itself infinitely many times. Since p > 0 we have 1 − p < 1. When we multiply a number
with absolute value strictly less than 1 infinitely many times by itself we get 0. (All I’m saying
is that limx→∞ an = 0 if |a| < 1.) So P(Tm ) = 0 for all m. But then
∞
X
c
P(A ) ≤ P(Tm ) = 0.
m=1
Hence P(A) = 1.
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 108
Using independence,
Since
{ν = n} = {ν > n − 1} \ {ν > n}, {ν > n} ⊂ {ν > n − 1},
we have
We refer to the law of ν as geometric with parameter p and denote it by geo(p). A random
variable with this law is called a geo(p) random variable. We can compute the expectation of
ν by using the trivial identity
X∞
ν= 1ν>n
n=0
(the summand on the right side is 1 if and only if n < ν, that is, if and only if n = 0, 1, 2, . . . , ν − 1–
these are ν integers; hence the sum is the sum of ν ones, so it is ν). Taking expectations, we
have (see also (9.10))
∞ ∞
X X 1
E(ν) = P(ν > n) = (1 − p)n = .
p
n=0 n=0
In words:
Given that no heads have shown up in the first m trials, the time ν − m until the
next head shows up has the same distribution as ν.
This is now obvious from the fact that the coin tosses are i.i.d. But we can also see the above
statement holds by a simple computation:
In particular, we have that if ν > 1 then the law of ν is the law of 1 + ν. Since P(ν > 1) = 1 − p,
we can express this by writing
1,
d with probability p
ν= ,
1 + ν, with probability 1 − p
d
where = denotes equality of the distribution of the left-hand side to the distribution of the
right-hand side. From this, we have
g(1), with probability p
d
g(ν) = ,
g(1 + ν), with probability 1 − p
νn := min{n ∈ N : ξn = 1},
PROBLEM 13.7 (upper integer part). Define the upper integer part of the real number x by
dxe := min{n ∈ Z : n ≥ x}. (13.9)
For example, d2.6e = 3. Explain why the statement
For all N ∈ Z and all x ∈ R we have N ≥ x ⇐⇒ N ≥ dxe
is always true
Answer. Consider the set U(x) := {n ∈ Z : n ≥ x}. (We have U(x) , ∅ because there is always
an integer above any given real number.) Let N be an integer. Suppose first then N ≥ x. Then
N ∈ U(x), by the definition of U(x). Hence N ≥ min U(x) = dxe. Suppose next that N ≥ dxe.
But dxe ≥ x always, so N ≥ x. Hence the equivalence is always true.
We then have
νn ≥ nt ⇐⇒ νn ≥ dnte
and so
1 λ dnte
P νn ≥ t = 1 −
n n
Since
dnte
lim = t,
n→∞ n
dnte
we again have that limn→∞ 1 − λn
= e−λt . To conclude we observe the following:
?PROBLEM 13.8 (a sparse geometric r.v. assumes no specific value in the limit). Explain
why
1
lim P νn = t = 0.
n→∞ n
Answer. We have
λ nt−1 λ
P(νn = nt) = 1 − 1nt∈Z .
n n
This actually says that the probability is 0 if nt , Z. We have limn→∞ (λ/n) = 0 while the other
two terms in the expression are ≤ 1. Hence the limit is 0.
We thus have that:
Both (13.7) and (13.8) hold true.
The big question now is:
Is there a random variable, say τ, such that
P(τ > t) = P(τ ≥ t) = e−λt , t ≥ 0? (13.10)
Such a random variable must necessarily satisfy
P(τ = t) = 0, for all t,
the reason being obvious:
P(τ = t) = P(τ ≥ t) − P(τ > t) = e−λt − e−λt = 0.
Equivalently, we are really asking for the existence of a probability measure on (certain subsets
of) [0, ∞) that assigns value e−λt to sets of the form [t, ∞) and (t, ∞).
CHAPTER 13. BERNOULLI TRIALS AD INFINITUM 111
?PROBLEM 13.9 (the union of uncountably many events begets monsters). A student
argues as follows.
We showed that the random variable τ is such that the probability of the event {τ = t} is 0 for all t. But
τ ≥ 0 ⇐⇒ τ = t for some t ≥ 0
We also notice that the events {τ = t}, t ≥ 0, are pairwise disjoint because
if t , s then {τ = t} ∩ {τ = s} = {τ = t = s} = ∅.
1 = 0.
Densities per se
112
CHAPTER 14. DENSITIES PER SE 113
is a probability density?
Answer. We have Z 1
c
√ dx = 4c,
−1 |x|
and since this must be equal to 1 we have c = 1/4.
for some c > 0. Thus the set of points on the plane below the graph of f is a collection of
rectangles, where the n-th rectangle has base lengh 2 · 3−n and height n, so its area is 2n3−n .
The sum of these areas is
∞
X 3
2n3−n = .
2
n=1
R∞
So if we choose c = 2/3 we make −∞ f (x)dx = 1, as it should. So f is a density. But
limx→∞ f (x)dx ≥ limn→∞ n = ∞.
Answer.
CHAPTER 14. DENSITIES PER SE 114
probability density iffRb − a = 1. (7) is a probability density. (8) is a probability density because
R∞ π
−∞ 2
1
sin(x)10<x<π = 0 21 sin(x)dx = 12 (cos(0) − cos(π)) = 1. (10) is a mass density but not a
R∞
probability density because −∞ 1x∈Z dx = 0.
A mass density like (10) in this problem is a trivial mass density because the total mass is
zero! The reason for this is that
the integral of a mass density that is zero except on a countable set is zero.
This statement is actually true for the Lebesgue integral but not for the Riemann integral. It
is true for both if the set of nonzero values of f is not just countable but it also has no limit
points.
We cannot prove this here because we really need the definition of Lebesgue integral.
However, when the set B is simple enough, e.g. an interval we can, in principle, compute the
integral as a Riemann integral. Of course, the existence of a probability measure Q implies the
existence of some random variable X whose distribution is Q. If Q has a name then X has also
the same name.
Definition 14.4. A subset N of R is said to be of zero-length if for any ε > 0 there is a sequence
of intervals I1 , I2 , . . . such that
[∞
N⊂ In
n=1
and
∞
X
length(In ) < ε
n=1
Expectation and variance of a random variable with density We next define the expectation
of a random variable X with density f by the formula
Z ∞
E(X) = x f (x)dx,
−∞
whenever this integral makes sense. Just as in the discrete case we define
For now, accept these definitions as done “by analogy” to the discrete case.
√
PROBLEM 14.5 (semicircle density). Let f (x) = c 1 − x2 1−1<x<1 . What should c be so that
f be a probability density function? The probability measure defined by this f is called
“semicircle law” for obvious reasons. Let X be a random variable with this law. Find E(X) and
var X.
Answer. We need Z ∞ Z 1 √
1= f (x)dx = c 1 − x2 dx = cπ/2.
−∞ −1
I made the change of variables x = cos θ to perform this integral. So
c = π/2.
CHAPTER 14. DENSITIES PER SE 116
We have
Z 1 Z 1 Z 0 Z 1 Z 1
E(X) = x f (x)dx = x f (x)dx + x f (x)dx = x f (x)dx − x f (x)dx = 0.
−1 0 −1 0 0
I actually didn’t have to compute anything. I just observed that f (x) = f (−x). Further,
2
Z 1 √ 1
var X = E(X ) = 2
x2 1 − x2 dx = ,
π −1 4
Therefore,
P(a < X ≤ b) = F(b) − F(a).
If f is piecewise continuous then we can easily see from the first fundamental theorem of
Calculus that
d
f (x) = F0 (x) = F(x) for every x at which f is continuous.
dx
Hence F is an antiderivative of f .
PROBLEM 14.6 (triangular density and its distribution function). Consider the triangular
density defined by f (x) = 1 − x when 0 ≤ x ≤ 1 and f (x) = 1 + x for −1 ≤ x ≤ 0. Below −1 and
above +1 we let f (x) = 0. Show that f is a probability density function and compute
R ∞ F(x).
Answer. The graph of f consists of two triangles each of area 1/2. Hence −∞ f (x)dx =
R1
−1
f (x)dx = 1. So it is a probability density function. We obviously have F(x) = 0 if x < −1
Rx
and F(x) = 1 if x > 1. For x between −1 and 0 we have F(x) = −1 (1 + y)dy = 12 (1 + x)2 . For x
Rx
between 0 and 1 we have F(x) = F(0) + 0 (1 − y)dy = 12 + 21 (1 − (1 − x)2 ).
1
f (x) = 1a<x<b .
b−a
CHAPTER 14. DENSITIES PER SE 117
The probability measure Q having density f is called uniform probability measure on [a, b]; its
density is called uniform density on [a, b]. Of course, by what we said above, we can change
f on any zero-length set and we still have a density for the same measure. Thus, Q is unique
but its density is not unique. Neither is any random variable whose distribution is Q. For
example, the function
1
fe(x) = 1a≤x≤b
b−a
R R
is also a density for the same Q because f (x)dx = fe(x)dx. We denote Q by the symbol
B B
unif([a, b]).
PROBLEM 14.7 (distribution function, expectation and variance of unif([a, b])). Let U be
a unif([a, b]) random variable. Explain why
0, x≤a
P(U ≤ x) = a < x < b,
(x − a)/(b − a),
1, x≥b
and why
P(U = x) = 0, for all x ∈ R,
a+b (a − b)2
E(U) =
, var(U) = .
2 12
Answer. It is clear that P(a ≤ UR ≤ b) = 1, so P(U ≤ x) = 0 if x ≤ a and P(U ≤ x) = 1 if x ≥ b. For
x 1
a < x < b we have P(U ≤ x) = a b−a dx = x−a
b−a . For the expectation, we have
b
a+b
Z
1 1
E(U) = xdx = (b2 /2 − a2 /2) = .
b−a a b−a 2
Also, Z b
1 1
E(U ) =
2
x2 dx = (b3 /3 − a3 /3).
b−a a b−a
Since b3 − a3 = (b − a)(b2 + ab + a2 ), we have
b2 + ab + a2
E(U2 ) = .
3
Therefore,
b2 + ab + a2 a2 + b2 + 2ab (a − b)2
var(U) = E(U2 ) − (EU)2 = − = .
3 2 12
Some computations are simplified if we observe that:
PROBLEM 14.8 (unif([a, b]) from unif([0, 1])). We can transform a unif([a, b]) random
variable to obtain a unif([0, 1]) random variable by a very simple function. Find this function.
CHAPTER 14. DENSITIES PER SE 118
Figure 14.1: We use Thales’ theorem (which has been known for at least 2500 years ) for the two
triangles with parallel sides, one whose side is from a to x and the other whose side is from a to b,
to obtain (14.2).
Answer. Just map the interval [a, b] onto the interval [0, 1] by a straight line. Observe that
y−0 1−0
= . (14.2)
x−a b−a
Hence the function is
x−a
y= .
b−a
Thus if X is unif([a, b]) then
X−a
Y=
b−a
is unif([0, 1]). It is actually the converse we are being asked. So we solve for X to get
X = (b − a)Y + a.
Since Y is unif([0, 1]) it is easier to compute things for Y (we got rid of the annoying
constants) first.
?PROBLEM 14.9 (we can’t choose uniformly at random from the set of real numbers).
Explain why there is no uniform probability distribution on the whole of R.
Answer. Because, if there were one, it would have had to have R ∞ constant density: f (x) = c
for all x. Since f should be a probability density function, −∞ f (x)dx = 1. If c > 0 then the
left-hand side is ∞, so we get ∞ = 1. Impossible. If c = 0 then he left-hand side is 0, so 0 = 1.
Again impossible.
Since d −y
dy e = −e−y we can replace e−y by − dy
d −y
e and then use the second fundamental theorem
R∞
of calculus to get λt e−y dy = e−λt , as needed. We can compute E(τ) in many ways. But we
will do a trick since the integral depends on the parameter λ. Since
Z ∞
1
e−λt dt =
0 λ
for all λ, we can differentiate both sides with respect to λ. We can justify that we can
interchange the derivative and the integral to get
Z ∞
d −λt d 1
e dt = .
0 dλ dλ λ
But d −λt
dλ e = −te−λt and d 1
dλ λ = −1
λ2
. So
Z ∞
1
te−λt dt = . (14.3)
0 λ2
But the left-hand side equals E(τ). Now differentiate (14.3) once more to obtain
E(τ2 ) = 2/λ2 .
Hence
var(τ) = 2/λ2 − (1/λ2 ) = 1/λ2 .
The constant λ is called rate of τ.
PROBLEM 14.11 (scaling of an expon(λ) random variable). Explain why if σ is expon(1) then
σ/λ is expon(λ).
Answer. We have P(σ > x) = e−x , x ≥ 0. Hence P(σ/λ > t) = P(σ > λt) = e−λt , t ≥ 0.
Recall that d·e denotes upper integer part; see (13.9).
PROBLEM 14.12 (discretizing an expon(λ) r.v. gives a geometric r.v.). Let τ be an expon(λ)
random variable Explain why
Answer. We have dτe takes values 1, 2 . . .. Let n be a nonnegative integer. Observe that
Therefore
P(dτe > n) = P(τ > n) = e−λn = (e−λ )n .
Comparing with the expression (13.6) we see that indeed ν = dτe is a geometric random
variable. Writing e−λ = 1 − p we find
p = 1 − e−λ .
?PROBLEM 14.13 (the memoryless property of an expon(λ) random variable). Explain why
an expon(λ) random variable has the memoryless property:
Answer. The conditional probability on the left equals the joint probability P(τ − t > s, τ > t)
divided by P(τ > t). But P(τ − t > s, τ > t) = P(τ > t + s) = e−λ(t+s) and P(τ > t) = e−λt . Dividing
the two we get e−λs .
2π
is a probability density. The corresponding probability measure is called standard normal
and is denoted by N(0, 1). Any random variable with this distribution is called a N(0, 1) random
variable.
?PROBLEM 14.14 (expectation and variance of the standard normal law). Show that if X
is N(0, 1) then
E(X) = 0, var(X) = 1.
2
R∞ R0
Since e−x ≤ e−x for x > 1 it follows that I =
0
x f (x)dx < ∞. Since −∞
x f (x)dx = −I, it follows
R∞ R
that E(X) = I − I = 0. We have v := var(X) = E(X2 ) = −∞ x2 f (x)dx. We will simply write
R∞ R
instead of −∞ . Since f (y)dy = 1 we have
Z Z "
v= 2
x f (x)dx f (y)dy = x2 f (x) f (y)dxdy.
Integrating with respect to θ first gives 2π which cancels with the constant in front. Changing
variables by r2 /2 = s we have
Z ∞ Z ∞
2 −r2 /2
2v = r e rdr = 2se−s ds = 2,
0 0
whence v = 1.
The meaning of the 0 and 1 in the symbol N(0, 1) is that a standard normal random variable
has expectation 0 and variance 1.
CHAPTER 14. DENSITIES PER SE 122
Figure 14.2: Various normal densities. The blue, red and yellow curves have mean 0 but different
variances. The larger the variance the flatter the curve. The green curve has negative mean. © in
public domain
x F(x) x F(x)
0 0.5 1.0 0.8413
0.1 0.5398 1.1 0.8643
0.2 0.5793 1.2 0.8849
0.3 0.6179 1.3 0.9032
0.4 0.6554 1.4 0.9192
0.5 0.6915 1.5 0.9332
0.6 0.7257 1.6 0.9452
0.7 0.7580 1.7 0.9554
0.8 0.7881 1.8 0.9641
0.9 0.8159 1.9 0.9713
1.0 0.8413 2.0 0.9772
PROBLEM 14.17 (using the table of the normal distribution). What is the probability that
a N(5, 16) random variable takes values in the interval [1, 13]?
Answer. X has is N(5, 4) then Z = X−5
4 is N(0, 1), hence
1 − 5 X − 5 13 − 5 X−5
P(1 ≤ X ≤ 13) = P ≤ ≤ = P −1 ≤ ≤ 2 = F(2)+F(1)−1 ≈ 0.818.
4 4 4 4
By the inverse function theorem, the derivative of the inverse function H−1 exists and
d −1 1
H (y) = 0 −1 ,
dy H (H (y))
where H0 is the derivative function of H, which is strictly positive, by assumption. Hence the
derivative of P(Y ≤ y) with respect to y exists for all y and equals (by composite differentiation)
d fX (H−1 (y))
fY (y) = P(Y ≤ y) = 0 −1 .
dy H (H (y))
CHAPTER 14. DENSITIES PER SE 124
If that’s too complicated to remember, recall that Calculus gives us good mnemonic rules.
Here we go.
Imagine that fX (x) is a mass distribution on R. Then the function H, being strictly increasing
and smooth, does not change the order of points and is merely smoothly transforming them.
Fix a point x and consider the “little” interval Ix with endpoints x and x + dx. Then x maps
to y = H(x) and x + dx maps to H(x + dx) ≈ H(x) + H0 (x)dx = y + dy. So Ix is approximately
mapped to J y , an interval with endpoints y and y + dy. Since order is preserved, the mass
contained in Ix is transferred to J y without losses and without additions. But the mass in Ix
is approximately fX (x)|dx| and the mass in J y is approximately fY (y)|dy|. We thus have the
preservation of mass formula
fY (y)|dy| = fX (x)|dx|,
and I boldly replaced approximate equality by equality. This “means” that
fX (x)
fY (y) = .
|dy|
|dx|
If we now replace x by H−1 (y), we see that we again arrive at the earlier formula. This
paragraph contains nonsense, but it is nonsense that we have explained as being correct and,
therefore, we can treat them as meaningful nonsense. We need an example.
fY (y)|dy| = fX (x)|dx|
PROBLEM 14.19 (log-normal density). Let X be N(0, 1) and let Y = eX . Find the density of Y.
(We say that Y has a lognormal law.)
CHAPTER 14. DENSITIES PER SE 125
Answer. Then
fY (y)|dy| = fX (x)|dx|
Since dy/dx = ex we have fY (y)ex = fX (x). So fY (y) = e−x fX (x) and since x = log y, we get
1 1 −(log y)2 /2
fY (y) = √ e .
y 2π
The recipe works if the function is smooth but nor invertible.
PROBLEM 14.20 (density of the square of a r.v.). Let X have density fX (x) such that fX (x) > 0
for all x ∈ R, and let Y = X2 . Find the density of Y.
√
Answer. The function x 7→ y := x2 is not invertible. But it has two branches: x1 = + y and
√
x2 = − y. Then
fY (y)|dy| = fX (x1 )|dx1 | + fX (x2 )|dx2 |
because in order to obtain the mass in the interval with endpoints y and y + dy we have to
add the masses in the intervals with endpoints xi , xi + dxi , i = 1, 2. We have dy/dx = 2x. Hence
|dx1 /dy| = 1/2|x1 |, |dx2 /dy| = 1/2|x2 |, and if we let x1 = x, we have |x2 | = x. Hence
√ √
fX (x) + fX (−x) fX ( y) + fX (− y)
fY (y) = = √ .
2x 2 y
?PROBLEM 14.21 (a r.v. with density that has no expectation). Someone is standing at
distance ` from an infinite wall and holds a gun. He takes a swing and fires. What is the
distribution of the place where the bullet landed and its expectation? To answer this, let 0
denote the closest point from the gun holder to the wall, let Θ be the angle formed between
the lines from the gun holder to the location X of the bullet and the line from the gun holder to
0. Assume that Θ has constant density between −π/2 and π/2. Also assume that X is signed:
positive if X is to the right of 0, negative if it is to the left of it.
Answer. We have
X = ` tan Θ.
We are been told that Θ has density f (θ) = π1 1−π/2<θ<π/2 . If g(x) is the density of X then
g(x)dx = f (θ)dθ,
`
g(x) = √ .
`2 + x2
CHAPTER 14. DENSITIES PER SE 126
There are no restriction on x because −π/2 < θ < π/2 if and only if −∞ < x < ∞. To compute
E(X) we write
Z ∞ Z ∞ Z ∞ Z ∞ Z ∞
E(X) = xg(x)dx = xg(x)dx + xg(x)dx = xg(x)dx − xg(x)dx
−∞ 0 −∞ 0 0
because g(x) = g(−x). But the last two integrals in the last display do not cancel out, even
though they are identical, because
Z ∞ Z ∞
`x
xg(x)dx = √ dx = ∞,
0 0 ` 2 + x2
R∞
because the change of variables `2 + x2 = y2 transforms this integral to 1 `dy = ∞. So the
answer to what E(X) is is “it does not exist”.
PROBLEM 14.22 (dimensions are not necessarily physical dimensions). 1 A certain com-
pany abiding to surveillance capitalism principles, collects 5 kinds of data for each customers:
the time x1 that the measurement is taken, their height x2 at this time, their location on the
planet (two numbers, x3 , x4 at this time, longitude and latitude) and their wealth x5 at this
time (negative if they are in dept). What is an appropriate sample space? The company is
interested in the event that, before the end of the year 2024, the height of the customer is larger
than 167 cm, the customer is located in the capital of Zimbabwe, and the customer’s wealth is
negative. Which set is this event?
Answer. The time x1 can, for instance, be measured in hours, starting from the year 2000 until
2100. There are 876,600 hours in 100 years, so 0 ≤ x1 ≤ 876, 600. The height can be measured
in cm, and a baby is at least 40 cm, while the tallest person is at most 280 cm, so 40 ≤ x2 ≤ 2804.
Longitude is an angle, −180 ≤ x3 ≤ 180, if measured in Sumerian units (degrees). Latitude is
also an angle, −90 ≤ x4 ≤ 90. The richest person on earth is worth 250 billion dollars. Let’s say
that the maximum debt of someone is at most 100 billion dollars. So −100 ≤ x5 ≤ 250. We
thus take
1
It is really awful when I ask students in Calculus to compute the area of, say, the set {(x, y) ∈ R2 : y2 ≤ x2 e−x }
and they correctly answer 1, but then they attach a unit and say 1m2 (square meter). Who said that x = 1 means
1m? This is not just bad taste, but also bad for business.
CHAPTER 14. DENSITIES PER SE 127
The point of this problem is that Rd does not have to be a space of physical dimensions.
The concept of mass densities and probability densities of one variable, introduced earlier,
generalizes tit for tat for many variables.
We quickly give the definitions for 2 variables which can easily be generalized to many
variables.
We say that f (x, y), (x, y) ∈ R2 , is!a mass density if f is a measurable function such
that f (x, y) ≥ 0 everywhere and I f (x, y)dxdy < ∞ for every bounded rectangle I.
!
It is a probability density if, in addition, R2 f (x, y)dxdy = 1.
If f (x, y) is a probability density function then we can define a probability measure via
"
Q(B) = f (x, y)dxdy, B ∈ B 2 ,
B
and
∞
X
area(In ) < ε.
n=1
Any countable subset of R2 has zero area. Any line in R2 has zero area and, more generally,
any piecewise smooth curve has zero area.
We can modify a probability density function on a zero-area set and get another density
function for the same probability measure.
?PROBLEM 14.23 (help me give definitions). Consider the random vector (X, Y, Z), that is,
3 measurable functions with domain a common Ω and codomain R each, and
(1) define the concept of probability density function f (x, y, z) for (X, Y, Z);
(2) define the concept of probability distribution function F(x, y, z) for (X, Y, Z);
(3) give the relations between f and F;
(4) generalize to n random variables X1 , . . . , Xn .
Answer. (1) We say that (X, Y, Z) has probability density function f (x, y, z) if
$
P((X, Y, Z) ∈ B) = f (x, y, z) dxdydz,
B
for B ⊂ R3 .
(2) The distribution function F(x, y, z) is defined by
F(x, y, z) = P(X ≤ x, Y ≤ y, Z ≤ z)
and then
∂n
f (x1 , . . . , xn ) = F(x1 , . . . , xn ).
∂x1 · · · ∂xn
CHAPTER 14. DENSITIES PER SE 129
?PROBLEM 14.24 (marginal densities). If (X, Y) is a random vector in R2 with density f (x, y)
find the density of X and the density of Y.
Answer. Let us use the letter f1 for the density of X, and f2 for the density of Y. We have
Z x Z ∞
P(X ≤ x) = P(X ≤ x, Y < ∞) = f (u, v)dv.
−∞ −∞
Hence Z x Z ∞ Z ∞
d
f1 (x) = f (u, v)dv = f (x, v)dv.
dx −∞ −∞ −∞
Similarly, Z ∞
f2 (y) = f (u, y)dv.
−∞
We call f1 (x) the first marginal of f (x, y) and f2 (y) the second marginal.
In analogy to discrete probability we can define the conditional density of X given Y = y
by the formula
f (x, y)
f1|2 (x|y) = . (14.6)
f2 (y)
We note that f1|2 (x|y), as a function of x, is a probability density function because
Z ∞
f1|2 (x|y)dx = 1, for all y.
−∞
(2) Z ∞ Z ∞
f1,2 (x1 , x2 ) = ··· f (x1 , x2 , . . . , xn ) dx3 · · · dxn .
x3 =−∞ xn =−∞
(3)
f1,2,3,4,5 (x1 , x2 , x3 , x4 , x5 )
f1,2,3|4,5 (x1 , x2 , x3 |x4 , x5 ) = .
f4,5 (x4 , x5 )
CHAPTER 14. DENSITIES PER SE 130
then X, Y are independent random variable, a notion that we have encountered in elementary
probability and that we shall encounter again.
Notice that if f, g are probability density function on R then the formula f (x)g(y) gives a
probability density function on R2 . This corresponds to a pair (X, Y) of independent random
variables.
If X, Y are independent then
because
Z Z
P(X ∈ A, Y ∈ B) = f (x, y)dxdy
A B
Z Z
= f1 (x) f2 (y)dxdy
A B
Z ∞Z ∞
= 1x∈A f1 (x)1 y∈B f2 (y)dxdy
Z−∞
∞
∞
Z ∞
= 1x∈A f1 (x)dx 1 y∈B f2 (y)dxdy
−∞ −∞
= P(X ∈ A)P(X ∈ B).
We can generalize to n random variables X1 , . . . , Xn and say they are independent if the
density of (X1 , . . . , Xn ) is the product of the n marginal densities:
Answer. Omit the limits of the integrals below because they’re all on the whole of R.
Z Z
E[g(X1 ) · · · g(Xn )] = ··· g(x1 ) · · · g(xn ) f (x1 , . . . , xn ) dx1 · · · dxn
Z Z
= ··· g(x1 ) · · · g(xn ) f1 (x1 ) · · · fn (xn ) dx1 · · · dxn
"Z # "Z #
= g(x1 ) f1 (x1 )dx1 · · · g(xn ) fn (xn )dxn
The uniform density on a set S ⊂ R2 with finite area is denoted by unif(S) and is defined by
1
f (x, y) = 1(x,y)∈S .
area(S)
In other words, f is constant on S and 0 outside S. The constant must necessarily be the area
of S so that the total integral of f be equal to 1.
The standard normal density on R2 is obtained by multiplying together the N(0, 1) density at
x times the N(0, 1) density at y. This gives
1 −(x2 +y2 )/2
f (x, y) = e .
2π
so
1 1
f (x, y) = 1a ≤x≤b1 · 1a ≤y≤b2
b1 − a1 1 b2 − a2 2
| {z } | {z }
function of x function of y
H = (H1 , . . . , Hn ) : Rn → Rn ,
such that
V = H(U) is an open set
and
H:U→V
is a bijection that is continuously differentiable with inverse function
H−1 : V → U
being also continuously differentiable. Sometimes U and V are Rn itself, but they don’t have
to be. One reason that we want U (and, similarly, V) to be open is so that we have some space
to move around a given point x ∈ U which is needed in order to differentiate. Remember that
derivative is a limit and to be able to talk about a limit when x0 → x, say, we need x0 to freely
move in a small neighborhood around x. A set U is open if any point x in it contains a small
neighborhood that is included in U.
CHAPTER 14. DENSITIES PER SE 133
y1 = H1 (x1 , . . . , xn )
.. ..
. .
yn = Hn (x1 , . . . , xn )
Y1 = H1 (X1 , . . . , Xn )
.. ..
. .
Yn = Hn (X1 , . . . , Xn )
Obviously, Y1 , . . . , Yn are also random variables on the same Ω. After all, the last display
really means
for all ω ∈ Ω. Under the conditions stated via the underlined words, the new random vector
Y = (Y1 , . . . , Yn )
has a distribution what is also given by a probability density function g(y) = g(y1 , . . . , yn ), that
is Z
P(Y ∈ B) = g(x)dx.
B
In order to write down the formula, we recall certain notions from Calculus. The derivative
of y = H(x) is represented as an n × n matrix:
∂y1 ∂y1
∂x
1 ... ∂xn
. ..
..
.
∂yn ∂yn
∂x1
... ∂xn
∂y
Of course, ∂x ij stands for ∂x∂ j Hi (x1 , . . . , xn ). There are many notations for this matrix and many
names for it. We will write either
∂y1
∂x . . . ∂y 1
∂x
∂y ∂(y1 , . . . , yn ) . 1
.
n
= .. ..
≡
∂x ∂(x1 , . . . , xn ) ∂y
n . . . ∂yn
∂x1 ∂xn
CHAPTER 14. DENSITIES PER SE 134
or ∂y1 ∂y1
∂x
1 ... ∂xn
H (x) = ... ..
0
∂y .
∂yn
n
∂x1
... ∂xn
and simply call it derivative, understanding that it is always (represented as) a matrix.2 All
that is different notations for the same thing.
The determinant of this matrix is called Jacobian of H:
∂y
!
Jacobian of H = det H (x) = det 0
.
∂x
Our underlined assumptions imply that both the derivative of H : U → V and the derivative
of the inverse function H−1 : V → U exist. The inverse function theorem says that the
derivatives are invertible matrices and one is the inverse of the other:
We need to take the absolute value of the Jacobian (the Jacobian can be negative but, after all,
a density is always nonnegative).
So, here, I explained a recipe, and you have learned it. I could explain this recipe to a slave
(=computer) and instruct him/her/it (=write a program) to apply the formula. The fact that
the slave can execute my instructions does not mean that the slave has ?learned the formula.
I don’t have the time in this course, or space herein, to really ?teach it, so I will resort to some
kind of motivation.
Suppose that n = 2. (For n = 1 the situation has been discussed and is, after all, rather
trivial.)
Consider a “small” rectangle, located at x = (x1 , x2 ) and with sides parallel to the axes
having oriented lengths dx1 , dx2 . By “oriented” we man that the two vectors have the same
orientation as the standard basis of R2 . The probability that X = (X1 , X2 ) is in this small
q
2
Denote kxk := x21 + · · · + x2n . Then H0 (x) is the unique matrix such that
rectangle is given by the integral of its density over this rectangle. Since the rectangle is small
this is approximately equal to
f (x1 , x2 ) |dx1 dx2 |.
When (x1 , x2 ) is mapped to (y1 , y2 ) = H(x1 , x2 ), the rectangle is mapped to another small set,
but not necessarily a rectangle because the 900 angle may have been distorted by H. But it is a
small quadrilateral with sides denoted by dy1 , dy2 . The probability that Y = (Y1 , Y2 ) is in this
small quadrilateral is
g(y1 , y2 ) |dy1 dy2 |,
approximately. But the two probabilities must be the same:
g(y1 , y2 ) |dy1 dy2 | = f (x1 , x2 ) |dx1 dx2 |. (14.9)
To find g(y1 , y2 ) we must express the elementary area dx1 dx2 in terms of the elementary area
dy1 dy2 . Remember that y1 = H1 (x1 , x2 ), y2 = H2 (x1 , x2 ). The function H = (H1 , H2 ) has an
inverse. Let us denote its inverse by K = (K1 , K2 ) rather than the clumsier symbol H−1 . We
thus have
x1 = K1 (y1 , y2 )
x2 = K2 (y1 , y2 )
Taking differentials of these expressions we obtain
∂x1 ∂x1
dx1 = dy1 + dy2
∂y1 ∂y2
∂x2 ∂x2
dx2 = dy1 + dy2
∂y1 ∂y2
Hence
∂x1 ∂x1 ∂x2 ∂x1 ∂x2 ∂x1 ∂x2
dx1 dx2 = (dy1 )(dy1 ) + (dy2 )(dy2 ) + (dy1 )(dy2 ) + (dy2 )(dy1 ).
∂y1 ∂y2 ∂y2 ∂y1 ∂y2 ∂y2 ∂y1
Since the area of a small rectangle with of zero width is zero, we set (dy1 )(dy1 ) = 0 = (dy2 )(dy2 ).
Since (dy1 )(dy2 ) is the same as (dy2 )(dy1 ) in magnitude but different sign we have (dy2 )(dy1 ) =
−(dy1 )(dy2 ). We thus obtain
∂x1 ∂x2 ∂x1 ∂x2
!
dx1 dx2 = − (dy1 )(dy2 ).
∂y1 ∂y2 ∂y2 ∂y1
But
∂x1 ∂x2 ∂x1 ∂x2 ∂x
!
− = det ,
∂y1 ∂y2 ∂y2 ∂y1 ∂y
so
∂x
!
|dx1 dx2 | = det |dy1 dy2 |.
∂y
Substituting this into (14.9) gives (14.8).
Of course, all that was sheer skulduggery and not a rigorous explanation. Nevertheless,
I hope you have ?learned something about the gist of all that, rather than simply learning
formula (14.8). We can generalize this skulduggerous arguments to general n.
CHAPTER 14. DENSITIES PER SE 136
a b
(2) Let next n = 2 and A = . Write g explicitly in this case.
c d
(3) What happens in the n = 2 case when ad = bc?
Answer. (1) The mapping we have is
H(x) = Ax,
H0 (x) = A.
Thus A is the Jacobian of H at every point. Hence the Jacobian of H−1 is A−1 . So (14.8) becomes
Thus
g(y) = f (A−1 y) | det A−1 |.
since
1
| det A−1 | = ,
det A
we can also write
f (A−1 y)
g(y) = .
| det A|
(2) We have
!−1 !
a b 1 d −b
= .
c d ad − bc −c a
So ! ! !
1 d −b y1 1 dy1 − by2
A y=
−1
=
ad − bc −c a y2 ad − bc −cy1 + ay2
and so
dy1 − by2 −cy1 + ay2
!
1
g(y1 , y2 ) = f ,
ad − bc ad − bc ad − bc
(3) If ad = bc then
dY1 = aY2 .
But the set
L = {(y1 , y2 ) ∈ R2 : dy1 = ay2 } ⊂ R2
is a straight line and so it has zero area. The equation dY1 = aY2 can be written as Y ∈ L. Thus the
probability distribution of Y is zero outside L. SinceR L is a straight
R line, Y cannot have
R a density.
(For if it did have a density g(y) we would have 1 = R2 g(y)dy = R2 g(y)1L (y)dy = L g(y)dy = 0.
And we do know that 1 , 0.)
CHAPTER 14. DENSITIES PER SE 137
PROBLEM 14.30 (the probability that an equation has real roots). Consider the equation
x2 + Ax + B = 0,
where A, B are random real numbers between −1 and 1. What is the probability that the
equation has real roots? To be precise, let us assume that (A, B) is a random vector with
distribution unif([−1, 1] × [−1, 1]).
Answer. The equation can be written as
2 2
A A A A 2 A2
0 = x2 + Ax + B = x2 + 2 x + − +B= x+ − +B
2 2 2 2 4
that is,
A 2 A2
x+ = − B.
2 4
For this to have a real solution we must have the right hand side nonnegative so that we can
take its square root. So
The last integral is over the whole of R2 since the restrictions in the density and in the set over
which we integrate have been expressed via indicator functions. But the product of the last 2
indicators is
1−1≤b≤1 1b≤a2 /4 = 1−1≤b≤a2 /4
(because a ≤ 1 so q2 /4 ≤ 1). We can choose to do the integration in any order we like. Choosing
to integrate over b first and over a next, the above integral is equal to
Z Z !
1
1−1≤a≤1 1−1≤b≤a2 /4 db da
4
2
The integral in the parenthesis is equal to a4 + 1. Hence the last display becomes
a2 1 1 a2
Z ! Z !
1 1 1 2 13
1−1≤a≤1 + 1 da = + 1 da = · +2 = ≈ 0.542.
4 4 4 −1 4 4 4 3 26
Figure 14.4: The ratio of the areas of the disc and the square equals the probability that a random
point chosen uniformly at random in the square actually lies in the circle. This probability is about
78.5%.
PROBLEM 14.31 (a circle in a square). What is the probability that a point chosen uniformly
at random from a square of side length ` lies in the largest inscribed disc? See Figure 14.4.
Answer. Let (X, Y) have distribution unif(S), where S is a square of side length `. Hence
1
f (x, y) = 1(x,y)∈S
`2
is a density for (X, Y). Let D be the largest inscribed circle. We have
" " "
1 1 1
P((X, Y) ∈ D) = f (x, y)dxdy = 1(x,y)∈S 1(x,y)∈D dxdy = 1(x,y)∈D dxdy = 2 area(D).
D ` 2 ` 2 `
Since D has radius ` its area is π`2 . So
π2
P(X, Y) ∈ D) = ≈ 0.785.
4
PROBLEM 14.32 (uniform law on a disc begets independence). Let (X, Y) be a random
vector in R2 whose distribution is unif(D), with D be the disc centered at the origin and
having radius 1.
(1) Are X, Y independent?
(2) Express (X, Y) in polar coordinates, that is, define random variables (R, Θ) via
X = R cos Θ
Y = R sin Θ
Are (R, Θ) independent?
(3) What is the density of R? What is the density of θ?
Answer. (1) A density for (X, Y) is the function
1
12 2 .
f (x, y) =
π x +y <1
This cannot be written as the product of a function of x only and a function of y only so X, Y
are not independent.
(2) Let us find a density, say g(r, θ) for (R, Θ) via the formula
∂(x, y)
g(r, θ) = f (x, y) det
∂(r, θ)
CHAPTER 14. DENSITIES PER SE 139
x = r cos θ
y = r sin θ
is
∂x ∂x !
∂(x, y) cos θ −r sin θ
!
= ∂r ∂θ =
∂y ∂y
∂(r, θ) sin θ r cos θ
∂r ∂θ
The Jacobian of the map is the determinant of the this matrix. We have
cos θ −r sin θ
!
det = r cos2 θ + r sin2 θ = r.
sin θ r cos θ
We thus have
1 r
g(r, θ) =
1x2 +y2 <1 · r = 1r<1 .
π π
We substituted x, y by their expressions in terms of r, θ so 1x2 +y2 <1 = 1r2 <1 = 1r<1 , and that’s
how we arrived at the last formula. Obviously this is a function of r times function of θ (the
constant function!), and hence (R, Θ) are independent random variables.
(3) Since g(r, θ) is a constant function of θ, we have that Θ is a uniform random variable.
Uniform where? Well, Θ, being an angle, ranges between 0 and 2π. So its density is
1
fΘ (θ) = 10≤Θ≤2π .
2π
(It does not matter if I write ≤ or < in the indicator because changing a density on a zero-length
set remains a density for the same probability measure.) We then have
r 1
g(r, θ) = 1r<1 = 2r1r<1 .
π 2π
Hence the density for R is
fR (r) = 2r1r<1 .
PROBLEM 14.33 (two explosions). Two explosions occur at two time points between 0 and
1, uniformly at random and independently. Find the probability that the explosions take place
on a time interval of length t, for 0 < t < 1. You should model this by letting X1 , X2 denote the
times of the two explosions and by letting the distribution of (X1 , X2 ) be unif([0, 1] × [0, 1]).
Answer.
{|X1 − X1 | ≤ t}.
Since (X1 , X2 ) has unif([0, 1] × [0, 1]) if we let
we have
area(A)
P(|X1 − X1 | ≤ t) = P((X1 , X2 ) ∈ A) = .
area([0, 1] × [0, 1])
CHAPTER 14. DENSITIES PER SE 140
The area in the denominator is 1. The area in the numerator is 1 − (1 − t)2 = t(2 − t). (Draw a
figure!) Hence
P(|X1 − X1 | ≤ t) = t(2 − t).
PROBLEM 14.34 (a random determinant). Assume that X,
Y, Z, W are i.i.d. N(0, 1) variable.
What is the variance of the determinant of the matrix X Y ?
Z W
Answer. The determinant is XW−YZ. By independence, E(XY−YZ) = (EX)(EY)−(EY)(EZ) = 0.
Hence var(XW−YZ) = E[(XW−YZ)2 ] = E[X2 W 2 +Y2 Z2 −2XWYZ] = (EX2 )(EW 2 )+(EY2 )(EZ2 )−
0 = 1 + 1 = 2.
?PROBLEM 14.35 (the minimum of two independent exponential random variables). Let
τ, σ be two independent random variables with distributions expon(λ), expon(µ), respectively;
that is, their joint distribution has a density given by f (t, s) = λe−λt 1t>0 µe−µs 1s>0 . Define the
function
X = min(τ, σ).
Determine the distribution function of X and then its density. What is the distribution of X
called? Then consider the events {X = τ} and {X = σ}. Do you expect their probabilities to add
up to 1? Answer this, and then compute the probabilities explicitly and add them up to verify
your expectation..
Answer. We have X > x if and only if τ > x and σ > x. Hence, for x > 0,
Z ∞ Z ∞ !
= λe −λt
µe ds dt
−µs
0 t
Z ∞ Z ∞
λ
= λe e dt = λ
−λt −µt
e−(λ+µ)t dt = .
0 0 λ+µ
µ λ µ
For exactly the same reason, P(X = σ) = λ+µ . Indeed, λ+µ + λ+µ = 1.
CHAPTER 14. DENSITIES PER SE 141
PROBLEM 14.36 (the maximum and the minimum, together). Let τ1 , τ2 be two independent
random variables, each with expon(1) law. Set
X = min(τ1 , τ2 )
Y = max(τ1 , τ2 )
Find (i) the density of the random vector (X, Y), (ii) the density of Y, (iii) the density of X, (iv)
the conditional density of X given Y = y, (v) the conditional density of Y given X = x. And
explain why (vi) the law of X is the law of an expon(2) random variable, (vii) the law Y is the
law of the sum of two independent random variables, one expon(2) and the other expon(1).
Answer. (i) The function H = (H1 , H2 ), where H1 (t1 , t2 ) = min(t1 , t2 ), H2 (t1 , t2 ) = max(t1 , t2 ) is
not invertible3 and not differentiable at every point of the form (t, t). Hence the technique we
learned does not apply, But we can compute the distribution function
see (14.5) and then differentiate as (14.4) to obtain the density f (x, y) of (X, Y). Since Y is a
maximum we have Y ≤ y ⇐⇒ τ1 ≤ y, τ2 ≤ y, so, by the assumed independence,
∂2
f (x, y) = F(x, y) = 2e−x e−y 10<x<y .
∂x∂y
or directly:
d
f2 (y) = P(Y ≤ y) = 2e−y (1 − e−y ) 1 y>0 . (14.10)
dy
3
However, given X and Y we have P(τ1 = X, τ2 = Y or τ1 = Y, τ2 = x) = 1 and this can be used.
CHAPTER 14. DENSITIES PER SE 142
This is equal to f2 (u) where f2 is the density of Y, as we derived in (14.10). So the density of
d
V = σ2 + σ1 is the density of Y. Hence Y = σ2 + σ2 , as we were asked to show.
CHAPTER 14. DENSITIES PER SE 143
PROBLEM 14.37 (a dangerous particle hits the Earth). A particle coming from very far away
is going to hit the Earth at a random point at some time the distant future and cause damage
because it carries a lot of energy. The Empire of Africa, extending between Ethiopia, Niger
and Angola, has become the center of the world and people are interested in the chance that
the particle will hit it. Let A, B, C, be the capitals of these three countries (A=Addis Ababa,
B=Niamey, C= Luanda, respectively). The distances between them (as measured by a
plane flying on the least distance path between any two cities) are c := d(A, B) = 4023 km,
b := d(A, C) = 4038 km, a := d(B, C) = 2774 km. Calculate the probability that the particle will
land in the triangle defined by these three capitals. Note that the circumference4 of the Earth
is 40, 075 km. Hint: You can use the law of cosines for a spherical triangle 5 that relates its side
lengths a, b, c, to the angles θA , θB , θC at its vertices.
and symmetrically for the other two angles. After computing the angles, use the inclusion-
exclusion formula.
Answer. We can use a calculator, or, better yet, Maple:
This gives:
If T denotes the triangle defined by the three capitals, we have, assuming that the particle
lands on the Earth has uniform distribution,
Each side, say AB, of T is an arc of a great circle CAB . (a circle passing through the center
of the Earth). Focus on a particular vertex, A, say. Consider the region LA on the Earth
between CAB and CAC that contains the triangle T. This region is a “double lune”. whose area
= θπA . We know that the area of
S(θA )
S(θA ) = area LA is clearly a linear function of θA . Hence S(2π)
the Earth is 4πR2 . And this gives S(θA ) = 4θA R2 . Similarly for the other two lunes, LB , LC . By
the inclusion-exclusion formula,
PROBLEM 14.38 (choose a small square at random). How would you model the selection
of a small square, at a random location and with random side length, contained inside a big
fixed square and having sides parallel to the big one? On the basis of your model, compute
the probability of the event L that the small square is entirely contained within the left half of
the big one; do the same for the event R that the small square is entirely contained within the
right half. And what is the probability that the small square intersects the segment separating
the big square in half?
Answer. There are many ways to do this. Here is one. Suppose the big square has side length
1. Let (x, y) be the coordinates of the upper right vertex of the small square and let ` be its side
length. Since we want the small square to be inside the big one we must have ` ≤ x and ` ≤ y.
The small square is thus specified by a point (x, y, `) ∈ R3 contained in the set
Ω = {(x, y, `) ∈ R3 : 0 ≤ ` ≤ x ≤ 1, 0 ≤ ` ≤ y ≤ 1}.
I therefore must define a probability measure P on Ω. I can do anything I like, but, perhaps,
the most natural way is to do it via the uniform density
c, if (x, y, `) ∈ Ω
f (x, y, `) =
0, if not.
Since Z Z
1= f (x, y, `)dxdyd` = c dxdyd` = c vol(Ω),
R3 Ω
we have that c = 1/ vol(Ω), so it remains to compute the volume vol(Ω) of Ω. We do that by
carrying out the last integration explicitly. Let x ∧ y = min(x, y). The third coordinate ` must
be below x and below y, i.e. below x ∧ y. So
Z 1 Z 1 Z x∧y Z 1 Z 1 Z 1 Z x Z 1 Z y
vol(Ω) = d`dxdy = (x ∧ y)dxdy = ydy + xdy.
x=0 y=0 `=0 x=0 y=0 x=0 y=0 y=0 x=0
where the term 1/48 resulted by performing the inner integral over 0 ≤ y ≤ x and the term
1/12 by doing it over x < y ≤ 1. For R we have
Indeed, the requirement that x − ` > 1/2 is equivalent to ` < x − 1/2 and so ` must be below x,
below y and below x − 1/2. But x ≥ x − 1/2 so
x ∧ y ∧ (x − 1/2) = y ∧ (x − 1/2).
Hence
Z 1 Z 1 Z y∧(x− 21 ) Z 1 Z 1
1 1 5
vol(R) = d`dydx = [y ∧ (x − 12 )]d`dydx = + = .
x=1/2 y=0 `=0 x=1/2 y=0 12 48 48
We thus have
5
.
P(L) = P(R) =
16
So the probability that the small random square intersects the line x = 1/2 is 1 − P(L) − P(R) =
6/16 = 3/8.
Chapter 15
Review
A sample space Ω is finite when it has finitely many elements (outcomes). A sample space Ω
is countable if there is an injection from Ω into N. A finite Ω is always countable. An infinite
Ω is not always countable. Examples of countable sets are the sets Z of integers and Q of
rational numbers. Instead of saying “countable”, some people say “discrete”.
You know from previous courses how to put a probability measure P on a countable (finite
or infinite) Ω: Simply consider a function p : Ω → [0, 1] such thatP ω∈Ω p(ω) = 1. Any such
function defines a probability measure P on E = P(Ω) via P(A) = ω∈Ω p(ω).
We also saw, in Chapter 14, through many examples and problems that are important for
all kinds of applications, that we do need to have random variables X in bigger than countable
spaces, such as R and Rn , such that P(X = x) = 0 for all x.
146
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 147
Motivation
If Ω is an uncountable set, then it is not easy to put a interesting probability measures on
events of Ω. The reasons are two. The first one we can explain. The second is deeper and we
shall not explain in this course. But it has to do with Theorem 12.1, 12.2 and 12.3 that I ask
you to read again now.
And this means that we must give preference to a special countable subset of the uncountable set Ω,
which is ugly, inconvenient, impractical and stupid.
Difficulty 2. Again, let Ω be uncountable. The above difficulty is overcome by defining P on sets
rather than on singletons. But this must be done in a way that (AXIOM TWO) be satisfied. So we
must define P on a set E of events, a set that is a σ-field. Naturally, we would like to have E as large as
possible. The largest E is P(Ω) and it would be great to have every probability measure P defined on
P(Ω). But, alas, for most interesting probability measures we cannot choose P(Ω) as their domain.
And this is a fundamental restriction, one that cannot be explained here. I ask you to trust me on this
and I refer you to future courses to learn this, or to any good book on more advanced probability.
Let us explain Difficulty 1.
?PROBLEM 15.1 (adding uncountably many positive numbers always gives ∞). Explain
statement (15.1).
Answer. Denote by A+ the set {ω ∈ Ω : P{ω} > 0} and let An := {ω ∈ Ω : P{ω} ≥ 1/n}, n ∈ N.
Since
x > 0 ⇐⇒ x ≥ 1/n for some n ∈ N
we have
∞
[
A+ = An .
n=1
Now observe that An can have at most n elements. (If it had a > n elements, we would find
that P(A) ≥ a/n > 1, and this is impossible). Since each An is finite, and A+ is their union, it
follows that A+ is countable.
Therefore, the need to have random variables that have probability zero to take any specific
value, together with the associated difficulties, motivates us to explain things carefully, and
this is the purpose of the rest of this chapter.
I will assume that you know basic things: (a) the definition of R and its characterization; (b)
its completeness; in particular the notions of sup (least upper bound) and inf (greatest lower
bound) of a set; (c) sequence of real numbers and the notion of a limit. (d) open subsets of the
real line.
The question is: how do we define interesting probability measures on R and what is their
domain?
Consider an increasing function F : R → R, that is, a function F such that
We use the word “increasing” in the sense of “non-decreasing”. If F(x1 ) < F(x2 ) whenever
x1 < x2 we say that F is strictly increasing. Any such function has left and right limits: for
any x, the numbers F(x+) = limε↓0 F(x + ε) and F(x−) = limε↓0 F(x − ε) exist. The reason is that
monotone bounded sequences converge in R. The notation ε ↓ 0 means that we let ε converge
to 0 from above 0.
The width of an increasing function is defined by
We let E be any σ-field that includes the collection of all intervals and define B be the
smallest such σ-field. This B is called the Borel σ-field (see Def. 12.2) and it is on B that
interesting nontrivial probability measures are defined.
The following is a fundamental result in mathematics:
To understand this, you need to read, e.g., G.B. Folland, Real Analysis, Section 1.5 . The
theorem says that to every increasing unit-width F there corresponds a unique probability
measure Q with domain B:
F 7→ Q = QF .
The converse is also true,
Q 7→ F = FQ ,
and it is a mere exercise. See Problem 15.2.
Conventions: Now, if we make the convention that F be right-continuous which means
that F(x+) = F(x) for all x, and supx∈R F(x) = 1, infx∈R F(x) = 0, then we call it a probability
distribution function. I say these things are “conventions” because they’re not really needed.
Taking into account the above we declare:
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 149
?PROBLEM 15.2 (it satisfies the defining properties of a distribution function). Let Q be
a probability measure defined on B. Show that
Q((a, b]) = F(b) − F(a), Q([a, b]) = F(b) − F(a−), Q((−∞, b)) = F(b−), etc.
length(J)
Q(J) = .
length(I)
when J ⊂ I is an interval.
In fact, if Q is unif(I) then, not just for any interval, but also for any Borel set B ⊂ I
we have
length(B)
Q(B) = P(X ∈ B) = , (15.3)
length(I)
where X is any unif(I) random variable.
For every distribution function F we can always construct a random variable on some
probability space (Ω, E , P) such that the law of X is the probability measure defined by F. In
this case, we say that F is the distribution function of X. For example, we can take Ω = R,
E = B and X(ω) = ω. In this case, we can trivially write
P(X ≤ x) = F(x), x ∈ R.
People, being people, like to give names to things that they use often. If Q has a name name
then it is customary to call any random variable X with law Q also name. So, for example, if Q
is unif(I) then any random variable with law Q is called a unif(I) random variable.
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 150
Definition 15.2. We say that the point x is an atom of the probability measure Q if Q{x} > 0. A
probability measure Q with no atoms is called non-atomic or continuous. A random variable
X whose law is a continuous probability measure Q is called a continuous random variable.
?PROBLEM 15.4 (continuous r.v.). Let X be a random variable with distribution function F.
Explain why the following are equivalent:
(a) X is continuous
Answer. We have Q{x} = P(X = x) = F(x) − F(x−). Hence Q{x} = 0 for all x if and only if
P(X = x) = 0 for all x. Hence (a) is equivalent to (c). On the other hand, (b) is equivalent to
F(x) − F(x−) = 0 for all x. Hence (b) is equivalent to (c).
Now recall what zero-length set means. See Definition 14.4. And then study again
Problem 14.4 that explains that any countable set is a zero-length set. Finally, trust
me that there are zero-length sets that are uncountable, it’s just hard for you to
imagine. But see Problem 15.6 below. Also see Figure 15.1 (left). This figure also
depicts a zero-area set in R2 and a zero-volume set in R3 .
PROBLEM 15.5 (tossing a pencil). A pencil1 usually has 6 sides. But let’s consider a pencil
with 10 sides, labeled 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
1
I thank Professor Venkat Anantharam of the University of California, Berkeley, who told me that I can toss a
pencil
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 151
Toss this pencil and record the face on which it lands. Assume the pencil is perfect, so assign
probability 1/10 to each side. Repeat this infinitely many times, independently. Consider the
event
A = {1, 2, 3, 4, 5, 6, 7, 8 will show up infinitely many times each} (15.4)
and show that P(A) = 1.
Answer. In Problem 13.6 we showed the same thing except that we were tossing a coin (two
outcomes). Now we toss a pencil (10 outcomes). If we repeat the argument we again obtain
that the event A above also has P(A) = 1.
?PROBLEM 15.6 (an uncountable zero-length set). Consider the set N of all real numbers x
between 0 and 1 whose decimal representation only uses digits 0 and 9. For example, this set
contains the numbers
0.09090909090909090909090909 · · ·
0.09900990099009900990099 · · ·
0.909009000900009000009000000900000009000000009 · · ·
0.0090909090009000909090090909999090909000009 · · ·
But
But then
P(X ∈ Nc ) ≥ P(A),
where A is the event (15.4) considered in Problem 15.5 where it was shown that P(A) = 1.
Hence P(X ∈ Nc ) ≥ 1, so P(X ∈ Nc ) = 1, by (AXIOM ONE). This means that P(X ∈ N) = 0 and
so, by (15.5), length(N) = 0.
We now define the notion of an absolutely continuous function.
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 152
F(x + h) − F(x)
F0 (x) = lim ,
h→0 h
whenever the limit exists.
Moreover, Z b
F0 (x)dx = F(b) − F(a).
a
where the integral here is a so-called Lebesgue integral.
We won’t explain these things but we will again refer to G.B. Folland, Real Analysis . The
notion of a Lebesgue integral is different than that of a Riemann integral, the one you learned
in Calculus. However, for all practical purposes, they are equal in most cases. For example, if
a function is piecewise continuous the two integrals are the same.
Let us apply this to a probability measure Q. We say that Q is absolutely continuous if its
distribution function F is absolutely continuous. Hence, by Theorem 15.2:
?PROBLEM 15.7 (fundamental theorems of Calculus and the folklore above). How do the
fundamental theorems of Calculus compare to the above folklore, viz., to Theorem 15.2?
Answer. Let F be a continuous distribution function. Assume that it is piecewise differentiable,
that is, differentiable at all points except at a discrete set of points N. Denote by F0 its derivative
function, arbitrarily defined on N. According to the second fundamental theorem of Calculus,
Z b
F0 (x)dx = F(b) − F(a), for all −∞ < a < b < ∞. (15.6)
a
Mixtures
First, a definition.
Definition 15.4. We say that (the law QX ) of a random variable X is the mixture of (the laws
QY1 , QY2 and QY3 ) of the random variables Y1 , Y2 and Y3 if
for some nonnegative numbers p1 , p2 , p3 such that p1 + p2 + p3 = 1. This can be stated for any
number of Yi ’s, even infinitely many.
?PROBLEM 15.8 (mixture is a relation between laws or between random variables). Let ξ
be a random variable with values in the set {1, 2, 3} and distribution P(ξ = i) = pi , i = 1, 2, 3.
Show that (15.7) is equivalent to
not only for B of the form (−∞, x], but also for any (Borel) set B ⊂ R.
Classification
There are 3 basic types of random variables.
1. Discrete random variable The (law of the) random variable X is called discrete if there is
a countable set C such that P(X ∈ C) = 1. In this case, the law
P of X is completely specified
by the numbers p(x) = P(X = x), x ∈ C, because P(X ∈ B) = x∈B∩X p(x).
2. Absolutely continuous random variable The (law of the) random variable X is called
R functionR f (x), x ∈ R. In this case, the
absolutely continuous if it has a probability density
law of X is completely specified by P(X ∈ B) = B f (x)dx = R 1x∈B f (x)dx. In particular,
P(X = x) = 0 for all x ∈ R.
3. Singularly continuous random variable The (law of the) random variable X is called
singularly continuous if P(X = x) = 0 for all x ∈ R but there is a zero-length set N ⊂ R such
that P(X ∈ N) = 1.
PROBLEM 15.9 (examples of the three besic types). Give an example of a random variable
that is (1) discrete, (2) absolutely continuous, (3) singularly continuous. Use Problems 15.5
and 15.6 to answer (3).
Answer. (1) Take, e.g., a bin(n, p) random variable.
(2) Take, e.g., an expon(λ) random variable.
(3) Let X1 , X2 , . . . be i.i.d. random variables, each with law unif({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}). Clearly,
the Xi represent the outcomes of tossing a 10-sided pencil as in Problem 15.5. Form the real
number whose decimal expansion has digits X1 , X2 , . . ., that is,
or
∞
X Xi
X= in calculus notation.
i=1
10i
Since the right-hand side of this is the decimal expansion of a number that uses digits 0 or 9
only, we have
P(Z ∈ N) = 1,
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 155
where N is the set defined in Problem 15.6. But we showed that N is a zero-length set. Also, if
x = 0.x1 x2 · · · , is a given number, 0 < x < 1, then
P(Z = x) = P(X1 = x1 )P(X2 = x2 ) · · · = ( 10
1 ∞
) = 0.
So Z is a singularly continuous.
Theorem 15.3 (decomposition of the law of any random variable). (The law of a) random
variable X in R is the unique mixture of 3 random variables: a discrete one, an absolutely
continuous one and a singularly continuous one.
This is, as usual, not proved here. Although it is obvious that a mixture of the three types
gives a random variable that is not necessarily one of the three types, the converse is not
obvious.
?PROBLEM 15.10 (it’s useless to differentiate a singular function). Explain why we can
define the derivative of the distribution function of a singularly continuous random variable
but that this derivative is not a distribution function, and hence useless.
Answer. Let Z be a singularly continuous random variable. Then N be a zero-length set such
that P(Z ∈ N) = 0. Let F(z) = P(Z ≤ z) be the distribution function of Z. To differentiate F(z),
we will consider only those z that are not in N. (We can ignore, as explained, any zero-length
set.) If z ∈ Nc , let (we can do that) (z − ε, z + ε) ∈ Nc for small enough ε > 0. Since P(Z ∈ Nc ) = 0,
we have F(z + h) = F(z) for all h ∈ (z − ε, zε ). And so F0 (z) = 0. We thus have f (z) = F0 (z) = 0
for z ∈ Nc and we can define it whatever we like for z ∈ n. But then
Z b
f (z)dz = 0
a
for all a, b, whereas P(Z ∈ (a, b)) > 0 if a and b are far apart. This is why F0 (z) is useless.
PROBLEM 15.11 (continuous random variables). In Definition 15.2 we declared that a
random variable is continuous (it has non-atomic distribution) if P(X = x) = 0 for all x ∈ R.
Using Theorem 15.3 give an explicit expression for the class of continuous random variables
in terms of the three basic types.
Answer. If a random variable X is absolutely continuous or singularly continuous then
P(X = x) = 0 for all x ∈ R, so it is continuous. If a random variable is a mixture of an
absolutely continuous random variable and a singularly continuous one then, we again have
P(X = x) = 0. If X is a general random variable then, according to Theorem 15.3 it is a mixture
of the 3 basic types. But if the discrete part is present in the mixture then, clearly, P(X = x) > 0
for some x ∈ R, so the discrete part shouldn’t be there if want X to be continuous. We conclude
A random variable is continuous ⇐⇒ it is a mixture of an absolutely continuous
random variable and a singularly continuous one.
We also defined, naively, the notion of independence and expectation and derived some
calculus rules in order to compute expectations and distributions of functions of random
variables.
In this chapter, we explained what we mean by random variables with general distribution.
We explained why their probability distribution cannot, in general, be described by probabilities
on individual points (see Section 15.1, Difficulty 1–explained in Problem 15.1, and Difficulty
1).
We then defined the concept of distribution function on R–see Section 15.2 and, in particular,
Theorem 15.1 that states that a distribution function uniquely defines a probability measure
and hence a random variable.
In the same section, Section 15.2, we also stated a generalization of the second fundamental
theorem of Calculus, that is, Theorem 15.2, that tells us when we can differentiate a distribution
function and obtain a useful derivative that can be integrated so that the distribution function
be recovered. We didn’t prove this theorem, but we have ?understood its statement. And,
indeed, having done so, we could easily solve Problem 15.7 that requires knowledge of the
second fundamental theorem of Calculus that you should know from Calculus.
In Section 15.3 we explained the three basic types of random variables and stated, as
Theorem 15.3, that a general random variable is a mixture of those three types.
(X1 , . . . , Xn ) : Ω → Rn
and ways to describe their laws (=distributions). Let’s quickly recall that by saying “law” or
“distribution” of (X1 , . . . , Xn ) we mean the probability measure
F(t1 , . . . , tn ) = P(X1 ≤ t1 , . . . , Xn ≤ tn ).
It is easy to see that this function is increasing in each ti when the others are kept fixed. Note
that if we know this function then we can easily compute the probability that X is in a bounded
rectangle. We exemplify this when n = 2.
PROBLEM 15.12 (an essential property of 2-dimensional distribution function). Let (X1 , X2 )
be a random vector in R2 with distribution function F(x1 , x2 ). Consider the bounded rectangle
R = (a1 , b1 ] × (a2 , b2 ],
h := 1RA + 1RB .
This assigns values 1 or 2 in the regions shown on the left of Figure 15.3. Similarly, consider
the function
g := 1RC + 1RD ,
that takes values 0, 1, 2, as shown on the right of Figure 15.3. Hence it is obvious that
h − g = 1R .
Hence
1(X1 ,X2 )∈R = 1(X1 ,X2 )∈RA + 1(X1 ,X2 )∈RB − 1(X1 ,X2 )∈RC − 1(X1 ,X2 )∈RD ,
and now take expectations. We have E[1(X1 ,X2 )∈R ] = P((X1 , X2 ) ∈ R), E[1(X1 ,X2 )∈RA ] =
P((X1 , X2 ) ∈ RA )) = F(b1 , b2 ), etc. Hence (15.9) holds.
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 158
R = (a1 , b1 ] × · · · × (an , bn ]
(The sets {a1 , b1 }, . . . , {an , bn } have size 2 each, so their product has size 2 · · · 2 = 2n .) We define
the sign of a vertex v ∈ vertices(R) by
+1, if the number of bi in v is even
sgn(v) =
−1, if the number of bi in v is odd
We are now ready to define the concept of a distribution function without reference to a
random vector, just as we did on R in Definition 15.1.
Definition 15.5 (distribution function on Rn –the analog of Def. 15.1). We say that F : Rn → R
is a distribution function if it has the following properties:
1. First essential property. F(x1 , . . . , xn ) is increasing in each xi when the other arguments are
kept fixed and width(F) = 1.
2. Second essential property. For any bounded rectangle R = (a1 , b1 ] × · · · × (an , bn ] we have
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 159
P
v∈vertices(R) sgn(v) F(v)
≥ 0.
3. Conventional property. F is right-continuous in the sense that for all x and all ε > 0 there
exists a δ0 such that F(x1 + δ, . . . , xn + δ) ≤ F(x1 , . . . , xn ) + ε and F(x1 , . . . , xn ) → 0 when some
xi → −∞.
We now state the equivalent to Theorem 15.1 which says that if we are given a distribution
function then we have a unique probability measure with the given distribution function and
a random variable with the given distribution function.
for all (t1 , . . . , tn ) ∈ Rn , and, therefore, some random variable X = (X1 , . . . , Xn ) with law Q.
The things we discussed about random variables carry over to random vectors. But we
won’t expand on them. Rather, we just give a summary:
Summary
• Random vectors (X1 , . . . , Xn ) can be continuous (meaning that their law is nonatomic:
P((X1 , . . . , Xn ) = (x1 , . . . , xn )) = 0 for all (x1 , . . . , xn ) ∈ Rn ) and some continuous random vectors
have a probability density. These are called absolutely continuous random vectors.
• The concept of zero-length set in R and zero-area set in R2 generalizes to Rn . Rather
than saying “volume” we say n-volume. So 1-volume is length, 2-volume is area, 3-volume is
volume. We don’t need to define n-volume in general. We just need to define the n-volume of
a rectangle R = J1 × · · · × Jn and the concept of a zero-n-volume set. We set
This is obvious: what else can we do other than multiply the lengths of the sides? We also say
that the set N ⊂ Rn is a zero-n-volume set if for all ε > 0 we can find a sequence of rectangles
such that N is included in their union and the sum of the n-volumes of these rectangles is at
most ε.
• If (X1 , . . . , Xn ) has density f (x1 , . . . , xn ) and we change this density on a zero-n-volume
then the changed function is also a density for (X1 , . . . , Xn ).
• There are continuous random vectors that are not absolutely continuous. These random
vectors are easier to come about when n ≥ 2 because there are plenty of zero-n-volume sets in
dimension n ≥ 2. See Figure 15.1.
?PROBLEM 15.13 (a continuous but not absolutely continuous random vector). Give an
example of a random vector (X1 , X2 ) that is continuous but not absolutely continuous.
Answer. Let Z be an absolutely continuous random variable (in R), e.g., let Z be N(0, 1). Then
define
(X1 , X2 ) = (Z, Z).
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 160
Look at Figure 15.4.
Figure 15.4: Left: density of Z. Right: (Z, Z) is not absolutely continuous, so it does not have
density on R2 .
PROBLEM 15.14 (continuation of Prolem 15.13). Let Z be unif([0, 1]) random variable. Then
(X1 , X2 ) := (Z, Z) is not absolutely continuous. In fact, P((Z, Z) ∈ L) = 1, where L is the diagonal
line, so it is a zero-area set. But (Z, Z) does have a distribution function. Attempt to sketch the
distribution function of (Z, Z).
Answer.
Figure 15.5: If we now take Z to be bu uniform on [0, 1], then (Z, Z) is continuous but, as
explained above, it is has no density, so it is not absolutely continuous. Of course, its distribution
function F(x1 , x2 ) is continuous and its plot is very easy. It is a pyramid with base the square
[0, 1] × [0, 1] and apex the point (1, 1, 1).
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 161
15.6 Beyond Rn
We know that we need to study (and we have), not only about finitely many random variables
X1 , . . . , Xn , that is, a random vector X = (X1 , . . . , Xn ) in Rn , but also about infinitely many
random variables X1 , X2 , . . ., that is, an infinite-dimensional random vector X = (X1 , X2 , . . .) in
R∞ ≡ RN (= the set of sequences of real numbers).
We stated, in Theorem 12.1, and its equivalent form of Theorem 12.2, that such infinite-
dimensional random vectors, under the i.i.d. assumption, do exist.
And we understood that such things are important, else we cannot talk about tossing a
coin infinitely many times, something that we should absolutely be able to, else we can’t do
probability or statistics.
We will also understand in Chapter 17 that we need to consider calculating probabilities of
events such as “a random sequence converges” because this is precisely what the Strong Law
of Large Numbers is about.
But we can’t approach the study of infinite-dimensional random vectors X = (X1 , X2 , . . .)
by things like densities. The reason being that, whereas R has a function called “length”,
enabling us to define a one-variable density f (x), and whereas R2 has a function called “area”,
enabling us to define a two-variable density f (x1 , x2 ), and whereas Rn has a function called
“n-volume”, enabling us to define an n-variable density f (x1 , x2 , . . . , xn ), the space R∞ does
not have an ∞-volume. We can (and often do) talk about density of X = (X1 , X2 , . . .) but we
have to specify “with respect to what”. This is important, but we won’t learn this here. We
will simply trust Theorem 12.1 and accept that an i.i.d. sequences do exist and move on.
The problem that a novice has is that he or she approaches the subject of probability/statistics
computationally: Compute the density of (X1 , X2 ), compute the expectation of g(X1 , X2 , X3 ),
etc. So the novice (and often his/her teachers) thinks that if something can’t be computed
then it either doesn’t exist or that it’s useless. This (almost religious) belief shoves away all
interesting things and prevents the novice from ever obtaining the skills necessary to do the
job properly.
PROBLEM 15.15 (use the normal table). Let X1 , X2 , . . . be i.i.d. random variables with
common N(0, 1) law. Use Table 14.1 to calculate the probability.
P(X1 ≤ 2, X2 ≤ 2, . . .).
P(X1 ∈ B, X2 ∈ B, . . .)?
If B is any interval other than R then P(Xi ∈ B) < 1 (strictly). The product of a positive number
that is strictly smaller than 1 by itself infinitely many times is zero. So P(X1 ∈ B, X2 ∈ B, . . .) = 0
as well.
CHAPTER 15. PROBABILITY LAWS ON BIGGER SPACES 162
But
1 (n − 1)(n + 1)
P(Xn ≤ 2 log n) = P(X1 ≤ 2 log n) = 1 − e−2 log n = 1 − 2
= .
n n2
Hence the product equals
∞ N
Y (n − 1)(n + 1) Y (n − 1)(n + 1)
= lim
n=2
n2 N→∞
n=2
n2
The last fraction converges to 1, so the product converges to 1/2, that is,
1
P(X2 ≤ 2 log 2, X3 ≤ 2 log 3, X4 ≤ 2 log 4, . . .) = .
2
Chapter 16
Expectation, unadulterated
Let X : Ω → R be a random variable with real values. Recall that this means that there is a
class E of events such that X respects these events in the sense that sets of the form {X ≤ x} are
events.
If P is a probability measure on the events, then we have talked about the expectation E(X)
in two cases. When X is discrete, in which case we Rset E(X) = x∈X(Ω) xP(X = x) and when X is
P
absolutely continuous, in which case we set E(X) = R f (x)dx. When we want to emphasize the
role of P we write EP (X).
PROBLEM 16.1 (expectation under P and under Q). Let Ω = {H, T} and X(H) = 1, X(T) = −1.
We take as E all 4 subsets of Ω. Take two probability measures, P and Q, defined by P{H} = p,
P{T} = 1 − p, and Q{H} = q, Q{T} = 1 − q. What is EP (X)? What is EQ (X)?
Answer.
EP (X) = 1 · P(X = 1) + (−1) · P(X = −1) = p − (1 − p) = 2p − 1
EQ (X) = 1 · Q(X = 1) + (−1) · Q(X = −1) = q − (1 − q) = 2q − 1
But what do we do in general and why should we care?
First, let me explain why we should care.
1) A first great reason is that it is simpler to think in general than consider all possible
special cases.
2) A second reason is that we don’t always know if X is discrete or if it is absolutely
continuous or neither.
163
CHAPTER 16. EXPECTATION, UNADULTERATED 164
5) A fifth reason is that we’re in the 21st century. This means that we can’t be doing the
same things they were doing in the 19th, can we? If you make the analogy, you know
have “smart” phones but they only had telegraph, if at all. Why shouldn’t the same
thing apply in maths and stats?
x − 1 ≤ bxc < x.
bxc is the best integer approximation to x from below. But let’s say that we want a better
approximation, that is, if ε > 0 and small we might want the best approximation from below
by an integer multiple of ε. We then define bxcε to be this number:
For example bπc = 3, bπc0.01 = 3.14. In fact, the two are related by
bxcε = εbx/εc.
limbxcε = x.
ε→0
If X is a general random variable, bXcε is a discrete random variable. Hence we know what
its expectation is. We then attempt to define
This works; well, almost. Sometimes even the expectation of a discrete random variable may
not exist: see Problem 9.7. To avoid this problem we may assert that
Yes, that will work, but since the expectation of a positive random variable may be infinity,
and we wish to deal with finite numbers, we simply truncate below a large number, say 1/ε,
before taking expectation. So we adopt the following definition.
CHAPTER 16. EXPECTATION, UNADULTERATED 165
But bXcε → X is one way to approximate X by discrete random variables. What tells us
that if we have another approximation, say Xε → X, we will not get a different limit for E(Xε )?
This is the same dilemma that Archimedes could have faced in deriving that the area of the
circle of radius r is πr2 . That is, Archimedes probably wondered if the specific approximation
of a circle, e.g., by rectangles, as in Figure 14.4, or by regular polygons, gave exactly the
same limit, namely πr2 . He probably convinced himself that the limit is independent of the
approximation. He couldn’t, however, have proved that because he did not have the means
to do so. It took another 2 thousand years to be able to justify Archimedes’ hunch. Indeed, it
works, and the same thing holds about the definition above. The definition is good because
(note that we replace ε by 1/n where n are integers):
We will not prove this theorem here, just as we proved no theorems in this course, but at
least we can understand what its statement is.
What is important is the following set of consequences of Definition 16.1 and Theorem 16.1.
PROPERTIES OF EXPECTATION
Linearity. If a, b are real numbers and X, Y random variables then
E(aX + bY) = aE(X) + bE(Y).
The reason for this is that linearity holds for discrete random variables. So it holds in the limit.
Monotonicity. If X, Y are random variables such that X ≤ Y then
E(X) ≤ E(Y).
The reason for this is that P never takes negative values!
Another thing we get out of this is that what we wondered above (see reason three 3) is
true under some assumptions.
provided that
CHAPTER 16. EXPECTATION, UNADULTERATED 166
PROBLEM 16.2 (scaling of exponential r.v.). Let τ(λ) be an expon(λ) random variable.
Explain why, for λ > 0,
d 1
E[cos τ(λ)] = E[τ(λ) sin(τ(λ))].
dλ λ
d
Hint: Recall that if σ is expon(1) then τ(λ) = σ/λ; see Problem 14.11.
Answer. Since τ(λ) has the same law as σ/λ we can replace the latter by the former when
considering the expectation of a function of it. Observe that the derivative of the function
g(λ) = cos λσ
is
d
g0 (λ) = cos λσ = σ
λ2
sin λσ .
dλ
But derivative is a limit:
d g(λ + h) − g(h)
cos λσ = lim .
dλ h→0 h
By the mean value theorem,
g(λ + h) − g(h)
= g0 (θ(h))
h
for some θ(h) between λ and λ + h. Hence
d
cos λσ = lim g0 (θ(h)).
dλ h→0
Now let Xh := g0 (θ(h)). The Xh play the same role as the Xn in Theorem 16.2. But
σ
|Xh | = |g0 (θ(h))| ≤
θ(h)2
But θ(h) is between λ − |h| and λ + |h|. Since λ > 0 if we let |h| < λ/2 we have λ − |h| > λ/2, so
θ(h) > λ/2 for |h| < λ/2 and so
σ
|Xh | ≤ =: Z, for all λ/2,
(λ/2)2
Obviously, E(Z) < ∞. So, by (ii) of Theorem 16.2 we have
g(λ + h) − g(h) g(λ + h) − g(h)
" # " #
E lim = lim E
h→0 h h→0 h
The left-hand side of this is E[g0 (λ)] = E λσ2 sin λσ = λ1 E [τ(λ) sin τ(λ)]. The right-hand side is
h i
E[g(λ+h)]−E[g(h)]
limh→0 h = d
dλ E[g(λ)].
?PROBLEM 16.3 (integrating the tail gives the expectation). (See Problem (9.14) also.) Let
X be a positive random variable. Explain why
Z ∞
E(X) = P(X > t)dt.
0
CHAPTER 16. EXPECTATION, UNADULTERATED 167
Students in very elementary probability are asked to explain this. In these notes, it appears as
the equality between (9.2) and (9.1). See Problem 9.3.
PROBLEM 16.4 (the law of the unconscious statistician, discretely). Let Ω be the set of all
the 2n subsets of {1, . . . , d}. (This sample space was considered in Problem 8.10 of elementary
probability.) Give Ω the uniform probability P, that is assign probability P{ω} = 1/2n to each
ω ∈ Ω. Consider now the random variable
This is the sum of the sizes of all subsets divided by their total number. The right-hand side of
(16.1) is
d !
X 1 X d
xP(X = x) = d k .
2 k
x∈X(Ω) k=0
CHAPTER 16. EXPECTATION, UNADULTERATED 168
This is because the image of Ω under X is X(Ω) = {0, 1, . . . , d} and because the law of X (under
the uniform probability measure P on Ω) is a bin(d, 1/2). Canceling the 1/2d factor (16.1) says
d !
X X d
|ω| = k .
k
ω∈Ω k=0
(2) This holds because it counts the sum of sizes of all subsets (left-hand side) by classifying
subsets according to their sizes (right-hand side).
(3) The expectation of a bin(d, 1/2) random variable is d/2.
(We can go on and make a longer chain if we wish.) Consider a set of events Ei on each Ωi . If
these functions “respect the events” (this is called measurability) then we can think of them as
random variables. That is G1 is a random variable on Ω0 , G2 is a random variable on Ω1 , and
so on. In addition, we can compose these functions, and have, for example,
Theorem 16.3 (the law of the unconscious statistician). In the above situation, if Ω1 =
Ω2 = Ω3 = R, we have
EP0 [G3 (G2 (G1 ))] = EP1 [G3 (G2 ))] = EP2 [G3 ]
This is really a theorem about “change of variables”. Proving it, we really have to prove it
for two stages only. Consider
X g
Ω −
→ R →
− R
P Q
CHAPTER 16. EXPECTATION, UNADULTERATED 169
OK, I changed letters. I used Ω for Ω0 , I used X for G1 and g for G2 , and I set Ω1 and Ω2 equal
to R, as we are required to do. The law of the unconscious statistician states that
n o n o
EP g(X) = EQ g .
But we know that it holds for discrete random variables. In particular, for each ε > 0 we have
that bg(X)cε and bgcε are discrete random variables. We then have (this is an application of
(16.1)) n o n o
EP bg(X)cε = EQ bgcε .
The discussion at the first part of Section 16.1 aimsnto convince
o o that taking limits as ε → 0
nyou
in the last display gives the display above it: EP g(X) = EQ g . It is, really, a very simple
matter.
New notation
R R
Instead of writing EP (X) many people write Ω
X(ω)P(dω) or simply Ω
XdP.
If Q is a probability measure on R that is given by a density f (x),
Z
Q(B) = f (x)dx,
B
that says
EP [X] = EQ [id].
So what do we do? This is rather ugly. Independence should not rely on our ability to
know whether, jointly, a bunch of random variables have density or not.
I We also hinted that we can talk about the independence of infinitely many random variables.
And also, in our FORESIGHTS section we stated, as Theorem 12.1, that if we are given a
probability distribution Q then we can find an infinite sequence of independent random
variables with common law Q and we stated that this is deep, but I am sure you don’t see
what it’s deep.
So, how do we revamp the notion of independence and define it once and for all?
for any integer n ≥ 2, for any finite set {t1 , . . . , tn } ⊂ T of size n, and for any n sets B1 , . . . , Bn .
Some facts.
I If {Xt , t ∈ T} are independent then {Xt , t ∈ S} are independent for any S ⊂ T.
I If {Xt , t ∈ T} are independent and S1 , . . . , Sk are finitely many disjoint finite subsets of T
then the random vectors (Xt , t ∈ S1 ), . . . , (Xt , t ∈ Sk ) are independent. Moreover, for given
R-valued functions g1 , . . . , gk , the random variables G1 = g1 (Xt , t ∈ S1 ), . . . , Gk = gk (Xt , t ∈ Sk )
are independent. And then
PROBLEM 16.5 (if you’re independent of yourself then you’re not random). Let X be a real
random variable that is independent of itself. Show that it is a constant, i.e., that there is a real
number c such that P(X = c) = 1.
Answer. If X is independent of X then P(X ≤ x, X ≤ x) = P(X ≤ x)P(X ≤ x). But the left-hand
side is P(X ≤ x). Hence P(X ≤ x) = P(X ≤ x)2 . The only two real numbers whose square it the
same are 0 and 1. So P(X ≤ x) is equal to 0 or 1 for all x. Since P(X ≤ x) → 1 as x → ∞ and
→ 0 as x → −∞, we can define c to be the largest x such that P(X < x) = 0. Then P(X < c) = 0
and P(X ≤ c) = 1. Hence P(X = c) = 1 − 0 = 1.
@@ Find a way to present Fubini’s theorem
R∞
@@ State and explain E(X) = 0 P(X > x)dx if X > 0.
g(x) = ax + b, x∈R
(I don’t call this linear because, in general b is not 0), whose graph is a straight line, then
E[g(X)] = g(EX).
If we take two affine functions, gi (x) = ai x + bi , i = 1, 2 then, since max(g1 (X), g1 (X)) ≥ gi (X),
we have E[max(g1 (X), g1 (X))] ≥ E[gi (X)], for i = 1, 2, and so
This is true for any number of affine functions, even for uncountably many ones. A function
of the form
g = sup gt ,
t∈T
where each gt is affine and T is any set is called convex. Examples of convex functions are
g(x) = x2 , g(x) = ex , g(x) = e−x , g(x) = − log x, g(x) = |x|. Applying the observation above we
find
Jensen’s inequality: E[g(X)] ≥ g(EX), for any convex function g.
This applies to random vectors too. If g(x1 , . . . , xn ) is a convex function of n variables and
(X1 , . . . , Xn ) is a random vector then
16.4.2 Moments
If X is a random variable and k a positive integer, we define
mk = k-moment of X := E(Xk ),
provided it exists. Of course, m0 = 1. Even moments are nonnegative, odd moments are
signed. Moments are sometimes important for several reasons, one of which being that
sometimes, moments determine the distribution.
Theorem 16.4 (moments define a unique probability law). If m1 , m2 , . . . are such that the
series
∞
X
mk zk
k=0
converges on |z| < ρ for some ρ > 0 then there is only one law whose moments are m1 , m2 , . . ..
We can define moments for real numbers as well, but then we must be careful about the
sign of a random variable. So we deal with absolute moments if the exponent is real, namely,
µp := E(|X|p ).
Note that this could be finite or infinity. We have the moments inequality
1/p 1/q
µp ≤ µq if 0 < p < q.
And here is why. If 0 < p < q then the function g(x) = xq/p , x ≥ 0, is convex. Hence
E[g(Z)] ≥ g(EZ) for any random variable Z. Let Z = |X|p . Then g(Z) = (|X|p )q/p = |X|q and
g(EZ) = (E[|X|p ])q/p . So E[|X|q ] ≥ (E[|X|p ])q/p . Raise this to the power 1/q to get the inequality
above.
Of special importance is the second moment:
m2 = µ2 = E(X2 ).
where the equality follows from expanding the square on the right. The standard deviation
of X is given by √
stdev(X) := var X.
We also define the inner product or correlation between two random variables X, Y by
hX, Yi = E(XY), (16.2)
and their covariance by
cov(X, Y) = hX − EX, Y − EYi = E((X − EX)(Y − EY)) = E(XY) − (EX)(EY),
where the second equality follows by expanding the product on the left.
The correlation coefficient1 between X and Y is the number
cov(X, Y)
corr(X, Y) = .
stdev(X) stdev(Y)
We have
− 1 ≤ corr(X, Y) ≤ 1 (16.3)
and this is the Cauchy-Schwarz inequality, stated as follows.
(E(UV))2 ≤ E(U2 )E(V 2 ), (16.4)
for any two random variables U, V. To see this, first take the obvious true statement:
0 ≤ (tU + V)2 = U2 t2 + 2UVt + V 2 .
So
0 ≤ (tU + V)2 = E(U2 )t2 + 2E(UV)t + E(V 2 ).
Multiply both sides by E(U2 ) and add and subtract (E(UV))2 to get
0 ≤ (E(U2 )2 t2 + 2E(UV)E(U2 )t + (E(UV))2 − (E(UV))2 + E(U2 )E(V 2 )
= [E(U2 )t + E(UV)]2 − [E(UV))2 − E(U2 )E(V 2 )].
Hence This is true for all t ∈ R. Assuming that E(U2 ) > 0, we can choose t so that
E(U2 )t + E(UV) = 0, so
0 ≤ 0 − [E(UV))2 − E(U2 )E(V 2 )]
which is (16.4). But if E(U2 ) = 0 then P(U = 0) = 1, so the inequality is trivial: 0 ≤ 0. To get
(16.3), set U = X − EX and V = Y − EY.
Since the correlation coefficient is always between −1 and 1 there is an angle θ such that
corr(X, Y) = cos(θ). So we define the angle between X − EX and Y − EY by
θ = arccos corr(X, Y).
We can agree that −π ≤ θ(X, Y) < π.
We say that X, Y are uncorrelated if arccos corr(X, Y) = 0. This means that θ(X, Y) = ±π/2
and so we can say that the angle between X − EX and Y − EY is ±π/2. We summarize
1
Some people use the term “correlation” for what I call “correlation coefficient” and use another name for
what I call inner product. But names, just as influencers, are a dime a dozen, they come and go. Please use my
terminology.
CHAPTER 16. EXPECTATION, UNADULTERATED 174
Z ≥ Z1Z>t ≥ t 1Z>t .
Hence
E(Z) ≥ E(t 1Z>t ) = tP(Z > t).
var(X)
P(|X − EX| > t) ≤ .
t2
Here is why. We have P(|X − EX| > t) = P((X − EX)2 > t2 ) ≤ E[(X − EX)2 ]/t2 , by Markov’s
inequality.
?PROBLEM 16.6 (a zero variance r.v. is trivial). Show that if var(X) = 0 then P(X = EX) = 1.
Answer. By Chebyshev’s inequality, P(|X − EX| > t) = 0 for all t > 0. Taking the limit as t → 0
we find P(|X − EX| > 0) = 0, that is P(|X − EX| = 0) = 1.
Read this as follows: If we know the expectation of gx (X) for all x then we know F(x) and so
we know the distribution of X.
So, it is conceivable that
there are families, say G, of functions such that if we know the expectation of g(X)
for each g ∈ G then we know the distribution of X.
CHAPTER 16. EXPECTATION, UNADULTERATED 175
Example 1: The class G = {1(x,∞) : x ∈ R} specifies the distribution of any random variable.
Example 2: Set pk (u) = uk . The class of functions P = {pk : k = 0, 1, . . .} specifies the
distribution of a random variables under the condition of Theorem 16.4.
We will consider two special families of functions.
• Set hs (x) = sx and consider the class H = {hs : s ∈ [0, r)} for some r > 0. This will lead us
to the concept of probability generating functions.
Definition 16.3 (probability generating function). Let X be a random variable with values
in Z+ = {0, 1, 2 . . .}, the set of nonnegative integers. Define its probability generating function
by
∞
X
G(s) = E[s ] =
X
pk sk , where pk = P(X = k), k = 0, 1, . . .
k=0
• Set wt (x) = etx and consider the class W = {wt : −a < t < b} for some −a < 0 < b. This
will lead us to the concept of moment generating functions.
Definition 16.4 (moment generating function). Let X be a random variable with values in R.
Define its moment generating function by
Note that M(t) cannot, in general, be expressed as a sum or as an integral against a density
because a general random variable X is not necessarily discrete nor absolutely continuous.
Nevertheless, M(t) always exists for t ∈ F. We will see that, in some cases, F = {0}, in which
case M(t) is useless (being the function that is 0 at t = 0 and ∞ for t , 0).
pn = P(X = n), n = 0, 1, . . .
This series may or may not converge (remember criteria for convergence from your Calculus)
depending on s. When it does, it defines a function
∞
X
G(s) = pk sk = EsX , (16.6)
k=0
for those s for which the series converges, and G called probability generating function.
Note that if |s| < 1 then certainly the series converges and converges absolutely because
∞
X ∞
X ∞
X
|pk sk | = pk |s|k ≤ pk = 1.
k=0 k=0 k=0
The radius of convergence r is the maximum r for which the series converges on |s| < r. We can
find r either by the root test or by looking at the formula above whose denominator becomes
0 when s = 1/(1 − p), meaning that at this point G has a pole (it becomes ∞). Hence the radius
of convergence is r = 1/(1 − p). (All that is standard Calculus material.)
P∞
Remark 16.2. Even though the series k=0 pk sk has a radius of convergence, the function
obtained can be extended on larger domains. For example, in the problem above, the series
converges on |s| < 1/(1 − p), the resulting formula, after performing the summation, namely,
ps
1−(1−p)s is defined for all s except s = 1, that is, it is an extension. At this special point we may
define it to be ∞. When we speak of a probability generating function we mean the extension
of the function.
?PROBLEM 16.8 (differentiation of power series). If G(s) is the function defined by the
series (16.6), explain why
∞
dm X dm k
G(s) = p k ms .
dsm ds
k=0
Answer. The series ∞ k=0 pk s converges uniformly with respect to s over the set {s : |s| < 1}. In
k
P
this case, as we learned in Calculus, we can differentiate the power series, as many times as
we like, term by term, and we will be getting the derivatives of G(s).
Recall the notion of m-falling factorial–see (8.1)–of a number k, where m is a positive integer:
(k)m = k(k − 1) · · · (k − m + 1) (we need to set set (k)0 = 1). When m = k we have another notation:
(m)m = m! These numbers appear when we differentiate m times the monomial sk :
k(k − 1) · · · (k − m + 1) s
k−m = (k) ,
dm k m if k ≥ m
= ,
s
ds m
0,
otherwise.
CHAPTER 16. EXPECTATION, UNADULTERATED 177
The first term of the series above, corresponding to k = m, does not depend on s. Let us rewrite
∞
dm X
G(s) = m! p m + (k)m pk sk−m .
dsm
k=m+1
The reason is simple: sX1 +···+Xn = (sX1 ) · · · (sXn ). Since X1 , . . . , Xn are independent, so are
sX1 , . . . , sXn . The expectation of the product of independent random variables is the product of
their expectations. See facts on the expectation above.
?PROBLEM 16.9 (probability generating functions of some common discrete r.v.s). Find
expressions for the probability generating function of the r.v. X when X is
(0) unif({k1 , . . . , kn })
(1) Ber(p)
(2) bin(n, p)
(3) Poi(λ)
What is the radius of convergence of each power series?
Answer. (0) We have pk1 = pk2 = · · · = pkn = 1/n, so
1 k1
EsX = (s + · · · + skn ).
n
(1) We have p1 = p, p0 = (1 − p) so
EsX = ps + (1 − p).
n m
(2) We have pm = m p (1 − p)n−m , for m = 0, 1, . . . , n. So
n ! n !
X n m m X n
Es =
X
s p (1 − p) n−m
= (sp)m (1 − p)n−m = (sp + (1 − p))n .
m m
m=0 m=0
CHAPTER 16. EXPECTATION, UNADULTERATED 178
λm −λ
(3) We have pm = m! e , for m = 0, 1, . . ., so
∞ ∞
mλ (λs)m
X m X
Es =
X
s e −λ
= e−λ = esλ e−λ = e−λ(1−s) .
m! m!
m=0 m=0
All radii of convergence are ∞. This is obvious for (0), (1),P(2), because they’re all finitely-valued
random variables. For (3) it follows from the fact that ∞ m=0 z /m! converges for all z (easily
m
shown by convergence criteria for series–intuitive too because m! grows very fast and it is in
the denominator).
PROBLEM 16.10 (probability generating function of the sum of independent Poisson r.v.s).
Let N1 , N2 be two independent Poisson r.v.s with laws Poi(λ1 ), Poi(λ2 ), respectively. Determine
the probability generating function of their sum. What do you observe?
Answer.
EsN1 +N2 = (EsN1 )(EsN2 ) = e−λ1 (1−s) e−λ2 (1−s) = e−(λ1 +λ2 )(1−s) .
We observe that the probability generating function of N1 + N2 is that of a Poi(λ1 + λ2 ) random
variable.
Question: Can we conclude that, in the above problem, N1 + N2 is itself a Poi(λ1 + λ2 )
random variable? Indeed we can because:
The reason is easy. We know that G(s) = EsX is defined and analytic on |s| < 1. This means
that G(s) is given by an infinite Taylor series (expansion around 0):
∞
X G(m) (0)
G(s) = sm ,
m!
m=0
where G(m) (0) is the m-th derivative of G(s) at s = 0. Let G1 (s) = EsX , G2 (s) = EsY and assume
(m) (m)
they are equal: G1 (s) = G2 (s). Then G1 (0) = G2 (0) for all m and, by (16.7), we have
d
P(X = m) = P(Y = m) for all m. That is X = Y.
So, in relation to Problem 16.10, we have this very special, very important and very much
treasured property:
This says that sSN = sSn if N = n. Duh! Using independence between Sn and N we have
∞
X
EsSN = (EsSn )P(N = n)
n=0
n
But Sn is a p r.v. By Problem 16.9, EsSn = (sp + (1 − p))n . So
∞
X
EsSN = (sp + (1 − p))n P(N = n) = E[(sp + (1 − p))N ]
n=0
We thus found
EsSN = e−λp(1−s) .
But this is the probability generating function of a Poi(λp) random variable. Hence the
distribution of SN is Poi(λp).
The first sum converges when |s| < 1 and the second sum converges when |1/s| < 1. But |s| < 1
and |1/s| < 1 is the empty set. So we really need to have established convergence of both series
on bigger domains. Luckily, we know from Calculus the concept of radius of convergence. If
the first series converges for |s| < R and the second for |1/s| < r and if rR > 1 then both series
converge for 1/r < |s| < R. The point of this discussion is that
and call it moment generating function. But just like in §16.8.1, this may be infinity. The only
t for which M(t) is for sure defined is t = 0. Indeed,
M(0) = Ee0 = 1.
This is useless information. It can be shown that the set of real t’s for which M(t) is finite is an
interval. Here is why. Write
If the first term on the right is finite for some t then it is finite for all smaller t, because, when
X ≥ 0, etX decreases when t decreases. Hence the first term is finite on an interval −∞ < t ≤ b
for some b ≥ 0. Similarly, the second term is finite on an interval a ≤ t < ∞ for some a ≤ 0.
Putting both together, we have that EetX < ∞ on an interval a ≤ t ≤ b.
Definition 16.5 (useless moment generating function). We say that the random variable X
has a useless moment generating function if EetX < ∞ only when t = 0.
If M(t) = EetX is finite for some interval that includes 0 in its interior then E|X|k < ∞ for all
k = 1, 2, . . . and
∞ k
X t
M(t) = EXk
k!
k=0
for all t on this interval. Moreover, we can recover the moments of X via
To understand why this is true, we remember that etx is the limit of the polynomials
k k
pn (tx) = nk=0 t k!x and so
P
It is easy to see that |pn (tX)| are upper bounded by a positive random variable with finite
expectation: |pn (tX)| ≤ etx + e−tX , and E(etX + e−tX ) < ∞. This is enough (but I’m not telling
you why) to ensure that
E lim pn (tX) = lim Epn (tX).
n→∞ n→∞
P∞ tk
Putting these things together we have the formula M(t) = k
k=0 k! EX .
Next observe that the right side of this last formula is a power series. You know, from
Calculus, that a power series can be differentiated term-by-term, that is,
∞
d` X d ` tk
M(t) = EXk
dt` k=0
dt` k!
But
d` k k(k − 1) · · · (k − ` + 1)t ,
k−` if k ≥ `,
=
t
dt`
0,
otherwise.
Set now t = 0 to get
d` k k(k − 1) · · · 1,
if k = `,
=
`
t
dt t=0 0, otherwise.
Hence, in the infinite sum above, setting t = 0 causes all terms to vanish except the term k = `,
`
`
which gives the formula ddtM ` (0) = EX .
Property A and Property B of probability generating function have exact analogs for
moment generating functions.
The first one concerns the moment generating function of the sum of n independent random
variables.
The first one says that knowledge of the moment generating function implies knowledge
of the law.
CHAPTER 16. EXPECTATION, UNADULTERATED 182
Property B is not obvious here. But we will not spend time proving it.
?PROBLEM 16.12 (a useless moment generating function). Consider a random variable X
with density
1
2 , if |x| ≥ 1
f (x) =
2x
0,
if − 1 < x < 1
and find its moment generating function. Explain why it useless.
Answer. We have Z ∞ Z 1
tx 1 1
Ee =
tX
e 2
dx + etx 2 dx
1 2x −∞ 2x
If t > 0 then the first integral equals +∞. If t < 0 then the second integral equals +∞. Hence
+∞, t , 0
Ee =
tX
.
1,
t=0
This is useless.
?PROBLEM 16.13 (moment generating functions of some common r.v.s). Find expressions
for the moment generating function of the r.v. X when X is
(0) δa (the Dirac law at a–(8.5))
(1) unif([a, b]) (what happens in the limit b → a and why?)
(2) expon(λ)
(3) N(µ, σ)
1/π
(4) Cauchy(a, b), defined as the law of a + bZ, a ∈ R, b > 0, where Z has density f (x) = 1+x2.
Is any of them useless?
Answer. (0) Since X has law δa we have P(X = a) = 1. So
EtX = eta .
1
(1) Since X has density b−a 1a≤x≤b ,
b
etb − eta
Z
dx
Ee tX
= etx =
a b − a (b − a)t
We have
etb − eta 1 d tx
lim = (e ) = eta .
b→a (b − a)t t dx t=a
By this is the moment generating function of a random variable with law δa . This is reasonable
because if we let Y be a unif([0, 1]) r.v., then (b − a)U + a has unif([a, b]) law. But then
lim((b − a)U + a) = a,
b→a
CHAPTER 16. EXPECTATION, UNADULTERATED 183
so it is to be expected that the law of the random variable on the left converges to the law of
the random variable on the right.
(2) An expon(λ) r.v. has density λe−λx 1x≥0 , so
Z ∞ Z ∞
λ
Ee =
tX
e λe dx = λ
tx −λx
e−(λ−t)x dx = ,
0 0 λ−t
λ
provided that t < λ. Of course the λ−t makes sense for all t, but if t > λ we get a negative
value for this fraction, so the equality does not hold because EetX is positive. This is what we
mean when we say that the right-hand side, considered as a function over all t, is an extension
of the left-hand side.
(3) If X is N(µ, σ2 ), we write X = µ + σZ where Z is N(0, 1). For Z we have
∞ −x2 /2 ∞
Z Z
tx e 1 1 2 −2tx)
Ee tZ
= e √ dx = √ e− 2 (x dx.
−∞ 2π 2π −∞
It is helpful to write
x2 − 2tx = x2 − 2tx + t2 − t2 = (x − t)2 − t2 ,
so 2
∞ ∞
et
Z Z
1 −(x−t)2 t2 (a) 2 (b) 2 /2
Ee tZ
= √ e e dx = √ e−y dy = et .
2π −∞ 2π −∞
where equality (a) follows by a simple change of variable, y = x − t, and where (b) follows from
√
the fact that e−y /2 2π is a probability density function, so it integrates to 1. For X = µ + σZ
2
we obviously have
EetX = Eet(µ+σZ) = eµt e 2 (σt) = eσ t +µt
1 2 2 2
(16.8)
We can write this as
1 2 +(EX)t
EetX = e 2 (var X)t (16.9)
(4) Letting X = a + bZ, with Z having the given density, we find
∞
etbx
Z
1
Ee tX
=e e ta tbZ
=e ta
dx
π −∞ 1 + x2
R∞
etbx
If t > 0 we have 0 1+x2
= ∞ because, intuitively, etbx goes to infinity as x → ∞ much faster
dx
R 0 tbx
than 1 + x2 . If t < 0 we similarly have −∞ 1+x
e
2 dx = ∞. So the moment generating function is a
useless one. In all previous cases, the moment generating functions were not useless.
X = a1 X1 + a2 X2
Now let X = (X1 , . . . , Xd ) be a random vector in Rd . We define the expectation of the random
vector X to be the vector of the expectations of the individual random variables:
EX = (EX1 , . . . , EXd )
PROBLEM 16.15 (expectation of a simple random vector on the plane). Let A, B, C be three
points on the Euclidean plane that we identify, as monsieur Descartes taught us, with their
coordinates, say (a1 , a2 ), (b1 , b2 ), (c1 , c2 ). Let X be a random vector with uniform distribution
on the set {A, B, C}. Explain why EX is the point of intersection of the three medians of the
triangle with vertices A, B, C, also explaining that these three segments pass through the same
point. Note: The median of a triangle is the straight segment that joins a vertex with the
midpoint of the opposite side.
Answer. We have
1 1 1
EX = A + B + C,
3 3 3
where I’m thinking of the A, B, C as represented by their coordinates. But then
2A+B 1
EX = + C.
3 2 3
But the point MAB = A+B
2 is the midpoint of the segment AB. Since EX is a linear combination
of MAB and C, EX lies on the straight line joining MAB and C, that is, the median from C. By
exactly the same argument, EX lies on the median from A and on the median from B. Hence
the three medians meet at the same point and this point is EX.
CHAPTER 16. EXPECTATION, UNADULTERATED 185
This problem teaches us that EX can be thought of as a geometric object. If we assume that
X has law unif(A) where A is a “geometric” subset of the plane, then EX is called centroid
of A. Physically, it is the center of mass of A is the mass of A is uniformly distributed on A
(mass density is constant). But then is A has certain symmetries, then it becomes easy to find
the centroid.
PROBLEM 16.16 (expectations of certain uniform random vectors). Explain what EX is
when X has law
(1) unif(D) where D is a disc;
(2) unif(R) where S is a rectangle;
Answer. (1) EX is the center of the disc.
(2) EX is the point where the two diagonals meet.
Since these things are important in Civil Engineering (and not only), people have created
tables of expectations of uniform random vectors .
Similarly, we can have a geometric interpretation of EX when X is a random vector in Rd ,
for any positive integer d.
Passing on to the covariance of the random vector X, we must define it as a matrix, the
so-called covariance matrix of the random vector (X1 , . . . , Xd ), defined by
Indeed,
2
d X
X d X d X d Xd
t j tk cov(X j , Xk ) = E t j (X j − EX j ) tk (Xk − EXk ) = E t j (X j − EX j ) ≥ 0.
j=1 k=1 j=1 k=1 j=1
Theorem 16.5 (Cramér-Wold theorem). The distribution of the d-dimensional random vector
X = (X1 , . . . , Xd ) is completely determined by the distributions of all 1-dimensional random
variables
t1 X1 + · + td Xd ,
where t1 , . . . , td range in R.
If we believe in this (and we should because its proof is easy once you know what a Fourier
transform is), then the concept of the moment generating function of X = (X1 , . . . , Xd ) reduces
to dimension 1: the moment generating function of t1 X1 + · + td Xd for all t1 , . . . , td .
provided that the set of (t1 , . . . , td ) for which M(t1 , . . . , td ) < ∞ contains an open set. Otherwise,
we say that the moment generating function is useless.
?PROBLEM 16.17 (independence inferred from the moment generating function). Let
X1 , . . . , Xd be random variables such that
where M is the moment generating function of X and Mi the moment generating function of
Xi , i = 1, . . . , d. Assume that none of the Mi is useless. Explain why the X1 , . . . , Xd must be
independent.
d
Answer. Let Y1 , . . . , Yd be independent random variable such that Yi = Xi for all i. Then
Therefore
Eet1 Y1 +···+td Yd = Eet1 X1 +···+td Xd ,
By Property B of the moment generating function of a one-dimensional random variable, we
have that
d
t1 Y1 + · · · + td Yd = t1 X1 + · · · + td Xd ,
for all (t1 , . . . , td ) and so, by the Cramér-Wold theorem
d
(X1 , . . . , Xd ) = (Y1 , . . . , Yd ).
Answer.
1 1
Eet1 Y1 +t2 Y2 = Ee(t1 +t2 )X1 +(t1 −t2 )X2 ) = Ee(t1 +t2 )X1 Ee(t1 −t2 )X2 = .
1 − t1 − t2 1 − t1 + t2
This cannot be written in product form. So Y1 , Y2 are not independent.
Remark 16.3. If the random variables X1 , . . . , Xd take values in Z+ , we may work with the
joint probability generating function
X
G(s1 , . . . , sd ) = E[sX
1
1
· · · sd d ].
This always exists at least if |s1 | < 1, . . . , |sd | < 1. As above, we have that
X X
E[sX
1
1
· · · sd d ] = E[sX
1
1
· · · sd d ] for all s1 , . . . , sd ⇐⇒ X1 , . . . , Xd are independent.
PN PN N
Y X n
∞ Y
ξi
xSN yRN = x i=1 y i=1 (1−ξi ) = xξi y1−ξi = xξi y1−ξi 1N=n .
i=1 n=0 i=1
Hence n
∞
X Y X n
∞ Y
ξ
E[x ySN RN
]= E
x y
i 1−ξ i
1N=n =
E[xξi y1−ξi ] P(N = n).
n=0 i=1 n=0 i=1
Hence
∞
X
E[xSN yRN ] = (xp + y(1 − p))n P(N = n) = e−λ(1−xp−y(1−p)) .
n=0
Write
1 − xp − y(1 − p)) = p + (1 − p) − xp − y(1 − p) = p(1 − x) + (1 − p)(1 − y),
CHAPTER 16. EXPECTATION, UNADULTERATED 188
so
E[xSN yRN ] = e−λp(1−x) e−λ(1−p)(1−y) .
Setting y = 1 we find E[xSN ] = e−λp(1−x) (which we knew from Problem 16.11) and setting x = 1
we find E[yRN ] = e−λ(1−p)(1−y) . Hence SN is Poi(λp) and RN is Poi(λ(1 − p)) The last display can
be written as
E[xSN yRN ] = E[xSN ]E[yRN ]
and this implies that SN , RN are independent.
(3) Since SN , RN are independent, our task is trivial:
Now, if Sk+` = k then certainly Rk+` = ` (because their sum is k + `, so we can omit this event.
Hence
P(SN = k, RN = `) = P(Sk+` = k, N = k + `).
But Sk+` is a function of the ξ1 , . . . , ξk+` and so it is independent of N. Hence
k+` k λk+` −λ
!
P(SN = k, RN = `) = P(Sk+` = k)P(N = k + `) = p (1 − p)` e . (16.13)
k (k + `)!
But a little light algebra shows that the expressions on the right of (16.12) and (16.13) are the
same.
?PROBLEM 16.21 (independent normals). Let X, Y be i.i.d. N(0, 1) each. Define
W = aX + bY, Z = cX + dY,
where a, b, c, d are real numbers. Find a relation between these numbers so that W, Z be
independent as well.
Answer. Well, W, Z are independent iff the moment generating function of (W, Z) is the product
of individual moment generating functions, that is, iff
(EesaX )(EesbY )
CHAPTER 16. EXPECTATION, UNADULTERATED 189
By (16.9)
1 1 2 2
EesaX = e 2 var(saX) = e 2 a s .
For the same reasons,
1 2 2
EesbY = e 2 b s ,
and so
1 2 +b2 )s2
EesW = e 2 (a .
Similarly,
1 2 +d2 )t2
EetZ = e 2 (c
On the other hand
so
1 2 1 2
E[esW+tZ ] = (Ee(as+ct)X )(Ee(bs+dt)Y ) = e 2 (as+ct) e 2 (bs+dt)
Hence
1 2 1 2 2 +b2 )s2 2 +d2 )t2
W, Z are independent ⇐⇒ e 2 (as+ct) e 2 (bs+dt) = e(a e(c for all s, t.
Expanding the squares on the left and canceling terms we’re left with
2(ac + bd)st = 0
W, Z are independent ⇐⇒ ac + bd = 0.
Remark 16.4. In the above problem, noticing that
we have
E(WZ) = ac + bd + 0 = ac + bd.
Hence the condition ac + bd = 0 is equivalent to EWZ = 0. Since EW = EZ = 0, this means that
W, Z are uncorrelated. Hence
The “law of large numbers” is not a law. It is a theorem. It is called law for
historical reasons: before people understood the mathematics of probability
they had no clue what this thing was. Some took it as a definition. Others
believed it as a law of physics. And some thought it is an experimental result.
They were all wrong.
When I ask students to calculate the probability that in 10 thousand fair coin
tosses we get half heads and half tails, half of them reply 1, the other half reply
1/2; but all of them agree it’s due to the law of large numbers. And they’re all
wrong.
I use the term “fundamental theorem of probability” for the strong law of large numbers.
As such, it is totally unacceptable not to ?know what it says and to ?understand why it’s
true. There is no point going around declaring that “the sample mean converges to the true
mean” without understanding what this means. In this chapter, you will; so long as you’re
willing to study it.
17.1 Discussion
Long time ago, people didn’t know how to define probability. They thought that they could
use a concept called “frequency” to define it. But no matter how hard they tried, they failed to
define a mathematically consistent theory. I’ll explain how they tried to think with a gedanken
experiment1 . Suppose you want to know the probability that a message transmitted over
the Internet contains a virus. Put probes (software) in various computers across the Internet
that detects malicious messages. Do this for a period of time (say, a year) during which 1010
messages have been sent. Count how many of them are malicious, say 104 , and divide the
1
thought experiment
190
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 191
two numbers: 104 /1010 = 1/106 (1 per million). Now use this number as the probability of the
event you are interested in. Which seems reasonable. But, to build a theory of probability,
you need to consider the totality of the events (those you are interested in and those you are
not–because, a theory doesn’t care about what YOU are interested in) and assign a probability
to each one, using frequency approach. And people proved that such an approach fails.
The Strong Law of Large Numbers does, indeed, talk about frequencies. But not as a
definition of probability; rather, as a result in probability, one that can be proved, provided that
some assumptions are made.
I’m not telling you yet what the Law of Large Numbers says, but I’m going to tell you a
consequence of it.
Let A1 , A2 , . . . be events that are independent and all have the same probability p. Then,
for each positive integer n, the number
1
fn := (1A1 + · · · + 1An ) (17.1)
n
is the “proportion of events that occur”. The Law of Large Numbers says that
P( lim fn = p) = 1. (17.2)
n→∞
That is, for sure, the limit of the random sequence fn , n ∈ N, exists, for sure it is
deterministic, and for sure it equals to p (where p = P(A j ) for any j, because we assumed
that all events have the same probability.)
Let’s discuss more. We can think of each A j above as an “independent replica” of some
“idealized event” A. Then the number fn expresses the “frequency of A”. Of course, the
events A1 , A2 , . . . are never equal to A. For example, if we wish to consider A as the idealized
event “malicious message” then A j is the event that “malicious message appears at the jth
measurement”. If we assume (or can prove that) the events A j have the same probability
(why should they?) and that they are independent (are they?) then we can talk about the
proportion fn of the malicious ones in the first n measurements and then the Law of Large
Numbers will guarantee that the sequence f1 , f2 , . . . has a limit and that this limit is p.
Let’s discuss even more. Do not forget that we are working on a set Ω with a probability
measure P defined on events of Ω. An element of ω, called “configuration” or “elementary
outcome” (these are just silly words), is an element of an event A j or is not. For example, in
the gedanken experiment above, ω may be taken as an infinite sequence (ω1 , ω2 , . . .) where ω j
denotes the state of the Internet at the jth measurement. (Of course, we cannot know ω j , but
our knowledge does not affect our ability to talk about it.) Surely then we can tell if ω ∈ A j
because, knowing the state of the whole Internet at the jth measurement tells us if a malicious
message was sent (and much much more). Recall that the phrase “A j occurs” means “ω ∈ A j ”
and this means that “1A j (ω) = 1”, by definition of the indicator function. Hence f100 (ω) = 0.75,
say, means that ω belongs to 75 of the events A1 , A2 , . . . , A100 . phrase “A j occurs on ω”.
The Law of Large Numbers will be explained next, but only in a special case. It should be
noted that the independence assumption can be dropped.
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 192
µ = EX1
Then
P(C) = 1. (17.4)
Note that Sn /n is the sample mean of the n first random variables; whereas µ is the
expectation of X1 . Since the random variables Xi have the same law, we obviously have
EXi = µ for all i.
In words, one can state the SLLN as:
The probability that the sample mean of an i.i.d. sequence of random variables
with common expectation µ converges to µ is equal to 1.
Of course,
C ⊂ B,
so showing that P(B) = 1 does not mean that P(C) = 1. We really have to understand why
P(C) = 1.
Namely, (17.1) Explain why the SLLN implies that (17.2) holds.
Answer. Simply let Xi = 1Ai , i = 1, 2, . . .. These are i.i.d. random variables with
µ = EX1 = p.
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 193
Let
n
X n
X
Sn = Xi = 1Ai .
i=1 i=1
Note that
Sn
fn = .
n
The SLLN states that
EX14 < ∞.
This assumption is made for convenience. Therefore, EX1 , EX12 , EX13 are all finite as well.
To make life simple, I will also assume that
EX1 = 0, (17.5)
because, if not, we can simply subtract it and reduce to this case! If EX12 = 0 we immediately
get that X1 = 0 (Problem 16.6) and so Xi = 0 for all i, and so there is nothing to explain here:
everything is zero! So we assume
EX12 > 0.
Step 1: Compute the 4-th power of the sum. We now consider the 4-the power of the sum
of the first n random variables and expand it:
n 4
X
S4n = Xi
i=1
X X X X X
= Xi4 + Xi2 X2j + Xi3 X j + Xi2 X j Xk + Xi X j Xk X` . (17.6)
i i,j i, j i,j,k i,j,k,`
What I have done here is that I collected together similar terms because, when raising a sum
to the 4th power I do the same as multiplying 4 identical sums in parentheses:
To perform the product and find all n4 terms, I must select exactly one variable from each
parentheses. If I select the same variable from all parentheses then I get a term of the form Xi4
and this gives the first term in (17.6). If I select the same variable from two parentheses and a
different variable from the other two, I get the second term. And so on. Now take expectation
of the expression above. By independence,
X X X X X
E(S4n ) = EXi4 + EXi2 EX2j + EX 3
i EX j + EX
i
2
EX
j EXk + EX
i
EX
j
EX k EX`
i i,j i,j
i, j,kdistinct
i, j,k,`distinct
The last three terms are equal to zero because we assumed that (17.5), which implies that all
Xi has zero expectation. We therefore have
E(S4n ) = nEX14 + 3n(n − 1)(EX12 )2 .
Indeed, the i EXi4 has n terms all equal to EX14 , and the sum i,j EXi2 EX2j has 3 n2 terms
P P
because I can choose two unordered distinct i, j from n indices in n2 and, once chosen, I can
choose {i, j} in 6 ways from the 4 parenthesis: Xi from the first 2, X j from the last 2; or Xi from
the first and third parentheses and X j from the others, etc.
Step 2: Use Markov’s inequality. We are interested to show that Sn /n converges to zero.
Fix ε > 0 and observe:
1 ES4n
P(|Sn /n| > ε) = P(S4n > n4 ε4 ) ≤ ,
ε4 n4
where we used (16.5), explained in Section 16.6. But we have an expression for E(S4n ), from
which we get
E(S4n ) ≤ cn2 ,
where c is a positive constant. (In fact, c can be taken to be 3(EX12 )2 .) Hence
c/ε4
.
P(|Sn /n| > ε) ≤
n2
Let us define the number of times n such that the sample mean Sn /n is outside [−ε, ε]:
∞
X
Nε := 1|Sn /n|>ε . (17.7)
n=1
The expectation of this random variable is
∞ ∞
X c X 1
ENε = P(|Sn /n| > ε) ≤ < ∞. (17.8)
n=1
ε4 n=1 n2
Thus the random variable ENε has finite expectation, so it cannot take value ∞ with positive
probability. Therefore,
P(Nε < ∞) = 1,
for all ε > 0, and hence for all rational ε > 0, so
P( for all rational ε > 0 Nε < ∞) = 1,
and so
P( for all ε > 0 Nε < ∞) = 1,
because Nε increases as ε decreases.
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 195
We’re done! This is just ordinary logic. Since all terms in (17.7) are 1 or 0, the statement
Nε < ∞ is equivalent to all but finitely many terms in the sum are equal to 0:
Nε < ∞ ⇐⇒ 1|Sn /n|>ε = 0 all but finitely many n ⇐⇒ |Sn /n| ≤ ε all but finitely many n
Therefore
for all ε > 0 Nε < ∞ ⇐⇒ for all ε > 0 |Sn /n| ≤ ε all but finitely many n
⇐⇒ the sequence Sn /n converges to 0.
And so
P(the sequence Sn /n converges to 0) = 1.
PROBLEM 17.2 (SLLN with nonzero mean). We explained the SLLN under the assumption
that EX1 = 0. How do you explain the more general case when µ = EX1 exists and finite but
not necessarily zero?
Answer. Simply that
n n
1X 1X
Xi converges to µ ⇐⇒ (Xi − µ) converges to 0
n n
i=1 i=1
of the sum ∞ n=1 n2 . You can either use the well-known expression π /6 for the last sum,
1 2
P
P∞ 1
or compute it numerically. We get n=1 n2 ≈ 1.7 For p = 1/2, we get 3/16 = 0.1875. So
EN0.01 ≤ 0.1875×1.7
0.014
≈ 3 · 107 and so P(N0.01 > 109 ) ≈ 0.03.
(3) If the values of the sequence of functions X1 , X2 , . . . are known then you can find p by the
formula
X1 + · · · + Xn
p = lim .
n→∞ n
(4) If only the values of the first n functions X1 , . . . , Xn are known then we can, e.g., use
Chebyshev’s inequality. Fix ε > 0 and write
P(X ∈ S) = length(S).
if the limit exists. If the limit does not exist set X = 0. Compute the distribution function of X.
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 197
Answer. The random variables Zi := (N2i−1 + N2i )2 , i = 1, 2, . . ., are i.i.d. Note that
By the SLLN,
n
1 X
P lim Zi = EX1 = 1.
n→∞ n
i=1
Hence
P(X = 2λ2 + 2λ) = 1.
This means that the distribution function of F(x) of X is given by
1, x ≥ 2λ + 2λ
2
F(x) = P(X ≤ x) = .
0, otherwise
Mechanics For example, it appears in a mathematical system called Mechanics which deals
with the motion of particles according to Newton’s law of motion. The latter states that when
a particle moves in space then the second derivative of its position vector is proportional
to a quantity known as force. If you take a large number of particles moving according to
Newton’s laws but without affecting one another then you can define the density of the system
of particles by calculating the proportion of particles and their velocities that lie in a subset of
the position-velocity space. Liouville’s theorem states that this density is preserved by the
motion. This is a Law of Large Numbers in disguise. You can read more about this here . You
see a (good) course in Mechanics helps understand probability measure and a (good) course
in the latter helps the former too.
Dynamical systems It also appears in another area of mathematics called Dynamical Systems.
A dynamical system is, for example, the equations of motion by Newton. But it’s something
much more general. A dynamical system is, roughly speaking, something that depends on
time in such a way that the future after time t depends on the past before t only through the
present at time t. For example, consider the sequence
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 198
t C(t)
1 1
2 11
3 21
4 1211
5 111221
6 312211
7 13112221
··· ···
There is a rule producing this sequence. Can you figure it out? The existence of the rule tells
us that to figure out what the future value C(t + k) is we only need to know the present value
C(t) (and not the past ones). If you cannot figure out the rule look here . I called the sequence
C because it was invented by the Liverpudlian John H. Conway (who died in 2020 because
he was infected by covid). Dynamical Systems often satisfy “laws of large numbers”. For
example, for the sequence above, I claim that if L(t) is the length of the sequence at time t and
if L j (t) is the number of occurrences of the digit j in C(t) then
L j (t)
lim exists, j = 1, 2, 3.
t→∞ L(t)
L(t+1)
Nobody knows the answer. However, Conway proved that limt→∞ L(t) = 1.303577269034 · · · .
Statistics In Statistics, we are interested in figuring out the distribution function F(x) of a
random variable X. We do the following. Let X1 , X2 , . . . be i.i.d. copies of X, that is, a sequence
of independent random variables such that
P(X j ≤ x) = F(x), x ∈ R, j = 1, 2, . . .
Look at Section 8.3 to realize that Fn defines a new probability from the “data” (X1 , . . . , Xn ). In
fact, x 7→ Fn (x) is a distribution function, for each n. It is called empirical distribution function.
Notice that, for each x, the random variables
1X j ≤x , j = 1, 2, . . .
Hence the Law of Large Numbers we proved in the previous section says that
Does this allow us to estimate the FUNCTION F? No, because the rate of convergence depends
on x.
However, something stronger is true:
Sn
For all ε > 0, P − µ > ε → 0, as n → ∞.
n
?PROBLEM 17.6 (strong law implies weak law). Explain why, under the conditions of
Theorem 17.1, namely that the X1 , X2 , . . . be i.i.d. and have common expectation µ, the weak
law of large numbers holds.
Answer. The strong law of large numbers states that
Sn
P converges to µ = 1.
n
n o
See (17.3)+(17.4). Hence the complement of the event Snn = µ has probability 0:
Sn
P does not converge to µ = 0.
n
But
Sn
does not converge to µ
n
Sn
= there is ε > 0 such that for all N there is n ≥ N with −µ >ε
n
[ Sn
= for all N there is n ≥ N with −µ >ε
n
ε>0
If the union of countably many events has probability 0 then each event has probability 0.
Since we can consider ε to be rational, this simple observation applies. So
Sn
P for all N there is n ≥ N with − µ > ε = 0.
n
CHAPTER 17. THE FUNDAMENTAL THEOREM OF PROBABILITY 200
The events
Sn
IN = there is n ≥ N with −µ >ε
n
satisfy
I1 ⊃ I2 ⊃ I3 · · ·
In this case, by (AXIOM TWO),
Sn
P for all N there is n ≥ N with − µ > ε = lim P(IN ).
n N→∞
So
lim P(IN ) = 0.
N→∞
But
SN
− µ > ε ⊂ IN ,
N
and so
SN
P − µ > ε ≤ P(IN ).
N
The latter has limit 0. And so does the former.
Why is the weak law of large numbers useful? One answer is because it may hold under a
different set of conditions from those of the strong law. But this is rather subtle to appreciate.
Chapter 18
18.1 Review
1. In Section 14.3.3 we defined the standard normal law on R as a probability measure such
−x2 /2 . We computed the constant in front of it and found it
√ density proportional to e
that it had
to be 1/ 2π. In order to do so, we passed on to 2 dimensions and discovered a circle.
it follows that
PROBLEM 18.1 (because of the Pythagorean theorem). Show that x2 + y2 is invariant under
rotations, first by using Cartesian coordinates and then without.
Answer. If we rotate (x, y) by an angle θ we obtain a new point (x0 , y0 ) with coordinates
x0 = x cos θ − y sin θ
y0 = x sin θ + y cos θ
201
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 202
But then
x02 + y02 = (x cos θ − y sin θ)2 + (x sin θ + y cos θ)2 = x2 + y2
because sin2 θ p + cos2 θ = 1.
Alternatively, x2 + y2 is the length of the hypotenuse of a right triangle with vertices (0, 0),
(x, 0), (x, y). Length does not change if we rotate.
3. Then look at Problem 16.14 where we discovered, using the machinery of moment
generating functions, the law of
aX + bY
when X, Y are independent normals. We read a special case of this problem:
aX + bY is N(0, a2 + b2 ).
d
(L) aX + bY = cZ, where Z is standard normal and c2 = a2 + b2 .
(what’s this little triangle doing here?)
Theorem 18.1 (the normal law is unavoidable). Let Q be a probability law on R such that:
(i) If Z has law Q then EZ = 0 and EZ2 = 1.
(ii) If X, Y are independent with common law Q then, for all a, b ∈ R there is c ∈ R such that
d
aX + bY = cZ. (18.1)
for some t , 0. I don’t have to assume this. I am just doing so to avoid using complex
numbers.
We assume that (18.1) holds.
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 203
Step 1: Take squares and then expectations. We have (aX + bY)2 = aX2 + bY2 + 2abXY.
Since E(XY) = (EX)(EY) = 0, we have E[(aX + bY)2 ] = a2 E(X2 ) + b2 E(Y2 ) = a2 + b2 , because we
also assumed E(X2 ) = 1, and so E(Y2 ) = 1. On the other hand, Since (18.1) holds, we have
E[(aX + bY)2 ] = E[(cZ)2 ] = c2 , therefore
a2 + b2 = c2 . (18.2)
Hence c is uniquely specified, up to a sign.
Step 2: Take exponentials and then expectations. Since (18.1) holds and since we made the
assumption that M(t) < ∞ for some t , 0, we have
Eet(aX+bY) = EetcZ = M(ct).
By independence,
Eet(aX+bY) = EetaX · EetbY = M(at)M(bt)
and so
M(at)M(bt) = M(ct). (18.3)
Step 3: Solving the equations. We now have two identities, an algebraic one (18.2), and a
functional one (18.3). We call the latter functional because our assumption that M(t) be finite
for some t , 0, implies that it is finite on the nontrivial interval with endpoints t and 0, so
there is room to move around. In order to transform multiplication in (18.3) into addition and
in order to get rid of the squares of (18.2), we define
√
L(u) = log M( u)
so
2
M(u) = eL(u ) .
Substituting this into (18.3) and using (18.2) leads to
L(a2 t2 ) + L(b2 t2 ) = L(a2 t2 + b2 t2 ).
Since a, b are arbitrary, we can rewrite this as
L(u1 ) + L(u2 ) = L(u1 + u2 ).
This identity is true for small u1 , u2 and since the only continuous function that preserves
addition is linear, we obtain that there is a constant C such that
L(u) = Cu,
which gives
2 2
M(t) = eC t .
We differentiate this function twice:
M0 (t) = 2C2 tM(t), M00 (t) = 2C2 M0 (t) = 4C4 M(t)
But M00 (0) = EX2 = 1. Thus 4C4 = 1, or C2 = 1/2. We have thus found
1 2
M(t) = e 2 t .
From (16.9), we see that this is the moment generating for the N(0, 1). Therefore, Q = N(0, 1).
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 204
PROBLEM 18.2 (linear combination of i.i.d. standard normals). Let X, Y, W be i.i.d. N(0, 1)
and let a, b, c be real numbers. What is the law of aX + bY + cW?
Answer. We have √
d
aX + bY = a2 + b2 Z
where Z is N(0, 1). Similarly,
r
√ d
√ 2 √
a2 + b2 Z + cW = a 2 + b2 + c2 W = a2 + b2 + c2 W,
α, β, γ
X Y W
and the are i.i.d. standard normals. By Problem 18.2, the right-hand side of the above
display has law N(0, (aα)2 + (bβ)2 + (cγ)2 ) = N(0, var aX + bY + cW).
X + a has density f (x − a)
and
1 x
bX has density f .
b b
Therefore,
1 x−a
bX + a has density f .
b b
We also have
X−µ
X is N(µ, σ2 ) ⇐⇒ is N(0, 1)
σ
and so
We need to define a law on Rd that we can confidently call normal. By the Cramér-Wold
theorem 16.5, it suffices to
Directly from this definition and formula (16.9) for the moment generating function of a
single normal random variable we find:
Now recall the definition (16.10) of the covariance matrix of a random vector, and its properties.
This was the topic of Section 16.9.1. The formalism of Linear Algebra here comes to a rescue,
not only because we can write things more succinctly but also because Linear Algebra helps
us establish the converse of Problem 18.3, namely,
Figure 18.1: When we pass electric current, that is, as a function of time, described by a sequence
of i.i.d. normal random variables, through a speaker, it is transformed to air pressure function of
time that can be heard by a human ear as noise, called white noise .
Definition 18.2 (symbol for normal law on Rd ). If (X1 , . . . , Xd ) is a normal random vector
with expectation vector µ and covariance matrix R then we use the symbol N(µ, R) for its law.
1
TX
Eet = exp tT µ + tT Rt . (18.8)
2
Remark 18.1. Note that we can also write the covariance matrix as
simply because when we multiply the column vector X − µ by the row vector (X − µ)T we
obtain a square matrix whose elements are (Xi − µi )(X j − µ j ); the expectation of which is
cov(Xi , X j ).
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 207
so we have exactly the situation studied in Problem 16.17, that is, (16.11) holds: the moment
generating function of (X1 , . . . , Xd ) becomes a product of individual moment generating
functions. So the X1 , . . . , Xd are independent.
When does X = (X1 , . . . , Xd ) have a density on Rd ? Remember that if d = 1 then N(µ, σ2 ) always
has density unless σ2 = 0. The equivalent criterion in d dimensions is:
We will not fully explain this (as usual, we do not proved theorems in this course), but
we will make a couple of remarks and then find a formula for the density when it exists.
Remark 18.2. If det(R) = 0 then the columns of R are linearly dependent. This linear
dependence translates into the fact that there is, with probability 1, some linear dependence for
the random variables X1 , . . . , Xd themselves. And this immediately implies that (X1 , . . . , Xd )
belongs to a (d − 1)–dimensional subspace say, of Rd that necessarily has zero d-volume and
so the density does not exist. See Section 15.5 for these notions. In particular, recall the notion
of zero d-volume presented in the Summary of page 159.
Remark 18.3. If det(R) = 0 then one can show that if we let
V = {Rx : x ∈ Rd },
P(X ∈ V) = 1.
If we then choose some basis on V and express X with in this basis then we will obtain an
m-dimensional normal random vector that has a density. This is just a matter of linear algebra.
Remark 18.4. If det(R) , 0 then we simply show that the density exists by deriving a formula.
This is done next.
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 208
Explanation
Step 1. Realizing that we must complete the square. We therefore need to find a way to
perform the integral on the left. In Problem 3.2 we completed the square for a quadratic
polynomial of one variable. We are faced with the same problem here: complete the square
for a quadratic polynomial of d variables:
Step 2. Finding the square root. How can we define the square root of the matrix R?
Since R is symmetric and positive definite (see Section 16.9.1, equation (16.10) infra) it has
d nonnegative eigenvalues λ1 , . . . , λd . Moreover, one can choose the eigenvectors u1 , . . . , ud ,
corresponding to these eigenvalues, so that they are pairwise orthogonal and have length
1. So if we define the matrix U whose columns are the eigenvectors, we will have UT = U.
Letting Λ be the d × d diagonal matrix whose diagonals are the eigenvalues, we have
RU = UΛ,
Ru j = λ j u j , j = 1, . . . , d,
R = UΛU−1 = UΛUT .
Thinking geometrically, this says that Λ is the matrix of the linear function x 7→ Rx in the basis
of the eigenvectors. Since Λ is so simple (diagonal: it means that the mapping scales along
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 209
√
the straight lines defined by the eigenvectors), it p
makes perfect sense to define Λ as the
diagonal matrix whose diagonal elements are the λ j . We then have
√ √ √ √ T
R = U Λ ΛUT = U Λ U Λ .
We have not quite written R as the square of another matrix but we have written it as
√
R = SST ; S := U Λ. (18.10)
(We can’t expect square root to mean the same as for real numbers. After all, multiplication
of real numbers is commutative: ab = ba, but multiplication of matrices is not: AB , BA, in
general.) The only knowledge we need is that R = SST , for some S (that we constructed).
Since det(R) = det(S) det(ST ) = det(S)2 , it follows that det(S) , 0 as well, so S−1 and
(ST )−1 = (S−1 )T exist. We then write
T
Q(x) = S−1 x S−1 x − 2tT S S−1 x ≡ yT y − 2uT y
= yT y − 2uT y + uT u − uT u
= (y − u)T (y − u) − uT u.
Step 4: Performing the integral. With this new expression for Q(x) we have
Z Z Z
− 21 Q(x) 1 T
(y−u)T (y−u) 1 T T
e dx = e 2 u u
e dx = e 2 u u
e(y−u) (y−u) d(Sy)
Rd Rd Rd
The determinant of the Jacobian of the mapping y 7→ Sy is det(S) = det(R) and so the above
p
is further equal to
Z Z
1 T
(y−u)T (y−u) − 12 12 uT u 1 T
p p
e 2 u u
det(R) e dy = e det(R) e− 2 z z dz,
Rd Rd
This is exactly what we wanted to show. And therefore we know that (18.9) is, indeed, a
density for N(µ, R) when det(R) , 0.
18.6 Whitening
Let X = (X1 , . . . , Xd ) be N(0, R). How do we represent the Xi as linear combination of
independent normals? We will only explain how when det(R) , 0.
We have actually already done it. Look at (18.11) and define
Y = S−1 X.
We think of Y and X as columns and we recall that S is the square root of R, namely, R = SST .
We now have
cov(Y) = E(YYT ) = E(S−1 XXT (ST )−1 ) = S−1 E(XXT )(ST )−1 ) = S−1 SST (ST )−1 ) = I,
cov(Yi , Y j ) = 0 if i , j.
Since (Y1 , . . . , Yd ) is normal on Rd , it follows that the random variables Y1 , . . . , Yd are indepen-
dent. In fact, they are all N(0, 1).
Therefore, our representation is
X = SY.
Pd
Or, explicitly, Xi = j=1 Si,j Y j , i = 1, . . . , d.
PROBLEM 18.6 (whitening example). Let (X1 , X2 ) be N(µ, R), with your favorite R, and write
it as a linear function of i.i.d. standard normals.
The normal distribution is better understood on the plane rather than on the line.
CHAPTER 18. NORMALITY, NORMALLY AND SMOOTHLY 211
Let’s explain this better. Let (X, Y) be a pair of i.i.d. standard normal random variables.
Consider a rotation by angle θ about the origin of the plane and. Let (X̂, Ŷ) be the rotated
(X, Y). We have
X̂ = X cos θ − Y sin θ
Ŷ = X sin θ + Y cos θ
We then have EX̂ = EŶ = 0, and, since EX2 = EY2 = 1 and EXY = 0, we also have
EX̂2 = (EX2 ) cos2 θ + (EY2 ) sin2 θ = cos2 θ + sin2 θ = 1, EŶ2 = cos2 θ + sin2 θ = 1,
(where we actually ignored the possibility that Y = 0 because the probability of this event is
zero). Since
tan Θ = Y/X = Ŷ/X̂,
we have that the distribution of Θ is uniform:
β−α
P(α < Θ < β) = , 0 < α < β < 2π.
2π
On the other hand, we have that
"
e−(x +y )/2
2 2
P(R > t) =
2
dxdy
x2 +y2 >t 2π
"
e−r /2
2 Z ∞ Z 2π Z ∞
−r2 /2
dθ = √ e−r /2 rdr,
2
= √ r drdθ = √ e r dr
r> t 2π t 0 t
0<θ<2π
Hence R2 has expon(1/2) density. If we now rotate (X, Y) by a fixed angle θ about the origin,
we see that its distance from the origin does not change, whereas Θ changes by adding θ. This
implies that P(R ≤ t, Θ ≤ β) depends linearly on β. Hence P(R ≤ t, Θ ≤ β) = F1 (t)β, for some
function F1 . It follows that R and Θ are independent. We summarize
PROBLEM 18.7 (an open problem). Show that T = X2 + Y2 satisfies the memoryless property
PROBLEM 18.8 (normal or not?). Let X1 , X2 , . . . , X2n be an even number of i.i.d. random
variables with common law Q We plot data points (X2i−1 , X2i ), i = 1, . . . , n in two cases:
1/π
• Q = Cauchy(0, 1) (which has density 1+x 2 ).
• Q = N(0, 1)
Here are the two plots (done for n = 5000):
You may only be interested in the case m = d = 1. But this case is not much simpler than
the general one. To understand it we need the notion of projection, which translates into the
concept of conditional expectation. This will be developed first in the first few Sections of
Chapter 19 and then, in Section 19.7 we will see how to compute the conditional probability
above.
But we don’t develop the concepts of projection and conditional expectation only because
we wish to perform the computation above. It turns out that the concepts are central to the
whole subject itself! This was first realized by Andrey Nikolaevich Kolmogorov , less than
100 years ago,1 and one could argue that it was then that the very foundations of probability
were laid. The vital formula of probability, formula (19.9), will appear as in Chapter 19 and
this is something that we use on a daily basis when we deal with the topics specified by the
syllabus of this module, in order to solve applied problems of great significance to society.
We will see an answer to (18.12) in Section 19.7.4 when m = 1 and in Section 19.7.5 when m
is any positive integer.
1
The small monograph containing ideas around conditional expectation was published as Grundbegriffe der
Wahrscheinlichkeitsrechnung, Julius Springer, Berlin, in 1933, and later, in 1956, translated as Foundations of the
Theory of Probability , Chelsea, New York.
Chapter 19
Conditionally
19.1 Motivation
How do I model a sequence of tosses of a coin for which I have absolutely no idea about the
probability of heads? Suppose I toss such a coin 100 times and observe the number of heads.
Do I still have no idea about the probability of heads?
A bus arrives at random during a period of random duration. The duration is an exponential
random variable with a random rate. The rate is a uniform random variable that is uniformly
distributed on a random interval of random length. What does this mean? If we observe
arrivals of buses, what can we say about the random length in the last sentence?
We model randomness by using a probability measure. A probability measure is, so far,
a non-random object. Can I model it as a random object, in which case I can talk about the
probability distribution of this probability measure? And what if this probability distribution
is still unknown? Can I talk about the probability law of the probability distribution of the
probability measure?
Laplace said that if we toss a fair coin twice then the probability of getting 2 heads is 1/3.
Under which model was he right?
If we have a density for a random vector (X, Y) we defined the conditional density of
X given Y in a kind of ad hoc way? Is this compatible with the definition of conditional
probability?
Can we ever define conditional probability given an event of probability zero?
What do we mean by the sentence “conditionally on the value x of a random variable X” if
X has density and so the probability of the event {X = x} is zero?
What exactly does likelihood ratio (a term used by statisticians) mean?
214
CHAPTER 19. CONDITIONALLY 215
If we pick a point at random in a cube (so its law is proportional to the volume function),
can we condition that it lie on a sphere inside the cube? Does this mean that its law will
be proportional to the sphere area function and thus we have discovered how to define the
second function through the first?
If we have an infinite-dimensional vector X = (X1 , X2 , . . .) of i.i.d. normal random variables
can we talk about the density of X?
If two densities differ at a set of length zero and thus define the same probability measure,
does the value of a density at a given point have any meaning?
Euclidean geometry is a mathematical system consisting of points, lines and planes, together
with various properties such that every two points X, Y, lie on a unique line. The part of the
line between these points is a segment, denoted by XY, that has a certain length |XY|.
Problem (P) above asks to find points B on Π such that |AB| ≤ |AΓ| for any point Γ on Π.
Euclid ?teaches us that there is a point B on Π such that AB is perpendicular to Π. This
means that if Γ is any other point on Π then AB is perpendicular to BΓ and so the triangle ABΓ
has a right angle at B.
By the Pythagorean theorem, which states that |AB|2 = |AΓ|2 − |BΓ|2 we have |AB|2 ≤ |BΓ|2 ,
so |AB| ≤ |AΓ| and so
If we drop a perpendicular line to Π from A then the point B where this line meets Π is
such that |AB| ≤ |AΓ| for all points Γ on Π.
Since we cannot have two perpendiculars from A to Π we have also showed that
We have thus solved (P) and showed that the solution satisfies
(P1) B∈Π
(P2) AB ⊥ Π
B = E(A|Π),
1
Figures 19.1 and 19.2 are taken from my textbook when I was in secondary school: ΣΠ.Γ. KANEΛΛOΣ,
EYKΛEI∆EIOΣ ΓEΩMETPIA , OE∆B, Athens, 1977.
CHAPTER 19. CONDITIONALLY 216
Figure 19.1: The point B solves the problem (P), it is uniquely defined by (P1)+(P2), it is called
the projection of the point A onto the plane Π, and it is denoted by B = E(A|Π).
since E is the first letter of the Greek word επι2 which means onto, as in “project point A onto
= επι the plane Π”.
Just as we can define the projection of a point to any plane, we can also define the projection
E(A|γ) of a point A to any line γ.
PROBLEM 19.1 (the three perpendiculars property). Explain why if γ is a line in the plane
Π then
E(E(A|Π)|γ) = E(A|γ). (19.1)
See Figure 19.2.
Answer. Let B = E(A|Π) and Γ = E(B|γ). We will show that Γ = E(A|γ). Pick points ∆ and E on
either side of Γ so that |∆Γ| = |EΓ|. Then |B∆| = |BE| by the Pythagorean theorem applied to
triangles BΓ∆ and BΓE. Applying the Pythagorean theorem once more to triangles AB∆ and
ABE, we find |A∆| = |AE|. Therefore the triangles AΓ∆ and AΓE are equal, in the sense that all
sides and angles are the same. Hence the angles ∠AΓ∆ and ∠AΓE are the same. Since their
sum is π (180o in sexagesimal Sumerian units ), each one must be equal to π/2 (90o ). In other
words AΓ ⊥ γ. Since, also, Γ ∈ γ (by definition), (P1)+(P2) hold and so Γ = E(A|γ).
The identity (19.1) is called the three perpendiculars identity because you can clearly see 3
right angles in Figure 19.2.
If γ1 , γ2 are two lines on a common plane, define
PROBLEM 19.2 (when do projections add?). Let γ1 , γ2 be two distinct lines that meet at a
point O Given a point A in space, consider its projections
– Assume first that γ1 ⊥ γ2 . Then ∠B1 OB2 = π/2. So the quadrilateral OB1 B2 B has 3 right
angles. Since the angles of any planar quadrilateral add up to 2π, it follows that the fourth
angle ∠B1 BB2 = π/2. Hence the quadrilateral is a rectangle and hence a parallelogram.
– Conversely, assume that OB1 B2 B is a parallelogram. Then ϕ := ∠B1 OB2 = ∠B1 BB2 (opposite
angles are equal in a parallelogram). Call this angle ϕ. Hence the 4 angles of the parallelogram
are π/2, π/2, ϕ, ϕ. They must add up to 2π. Hence ϕ = π/2, so γ1 ⊥ γ2 .
Remark 19.1. Everything we said here works in higher dimensional Euclidean space V if we
replace Π by any lower-dimensional space included in V The identity (19.1) also holds if we
replace γ by any lower-dimensional space included in Π.
modulo 2π. We say that x, y are perpendicular or orthogonal if |θ(x, y)| = π/2, that is,
hx, yi = 0.
CHAPTER 19. CONDITIONALLY 218
PROBLEM 19.3 (standard inner product on Rd ). Explain why hx, yi := di=1 xi yi is an inner
P
product.
Answer. It is clearly bilinear and symmetric. Moreover, hx, xi = di=1 x2i is always nonnegative
P
and if it is zero then all the xi are zero.
PROBLEM 19.4 (inner product on Rd ). Let R be a d × d covariance matrix such that det(R) , 0.
Define hx, yi = xT Ry. Explain why this is an inner product.
Answer. Just look at the properties of a covariance matrix: positive definiteness implies
xT Rx ≥ 0 for all x. If det(R) , 0 then, letting S be its square root, that is, R = SST , we have
det R = (det S)2 , so det S , 0. But then xT Rx = xT SST x = (ST x)T (ST x). But ST x is a real number.
So xT Rx = (ST x)2 , and if this is 0 we get ST x = 0. Since det S , 0, we immediately have x = 0.
So xT Ry is strictly positive definite.
We now wish to state and solve Problem (P). To do this, observe that a plane Π in R3
is described by one equation of the form a1 x1 + a2 x2 + a3 x3 = c and a line by two equations
of this form: a11 x1 + a12 x2 + a13 x3 = c1 , a21 x1 + a22 x2 + a23 x3 = c2 . When we are in Rd we can
have k equations of such a form, in which case we describe a flat set in a (d − k)-dimensional
Euclidean space. By setting all coefficients c j on the right-hand side of these equations equal
to 0 (and this is no loss of generality) we make sure that the origin is in this flat set. Such an
object is a linear subspace of Rd . Hence we need to replace Π by a linear subspace of Rd .
So problem (P) has a version stated as follows.
Just as before, we show that the solution is unique and is characterized by conditions
analogous to (P1)+(P2). To do this, let
∆ := inf kx − uk.
u∈U
Infimum of a set is the greatest lower bound of a set. So we can find a points in the set that are
as close as we like to ∆. So, given any n ∈ N there is un ∈ U such that kx − un k ≤ ∆ + n1 . We
apply the parallelogram identity to x − un and x − um :
Since 12 (yn +ym ) ∈ U we have kx− 12 (yn +ym )k ≥ ∆ (because ∆ is least). Hence k2x−(yn +ym )k ≥ 2∆.
If we replace the second term in the last display by (2∆)2 we get a smaller quantity. On the
other hand, the right-hand side is at most 2(∆ + n1 )2 + 2(∆ + m1 )2 . Hence
Simplifying,
kun − um k2 ≤ (4∆ + 1)( n1 + 1
m ).
This implies that limm,n→∞ kun − um k = 0. Since un belongs to U which has finite dimension, it
follows that
there is u ∈ U such that lim kun − uk = 0. (19.2)
n→∞
But then
∆ ≤ kx − uk ≤ kx − un k + kun − uk ≤ ∆ + 1
n + kun − uk.
Since the right-hand side converges to ∆ as n → ∞, we have
kx − uk = ∆.
We have thus solved (PL). But we do not know if the solution is unique. So suppose
kx − u0 k = ∆ = kx − u00 k, for u0 , u00 ∈ U. Again by the parallelogram identity,
ku0 − u00 k2 + k2x − (u0 + u00 )k2 = 2kx − u0 k2 + 2kx − u00 k2 = 4∆2 .
Hence (PL) has a unique solution u, that we call projection of x onto U and write it as
u = E(x|U).
But
kx − u − tvk2 = kx − uk2 − 2thx − u, vi + t2 kvk2 .
Canceling the term kx − uk2 we obtain
1
thx − u, vi ≤ t2 kvk2 , for all t ∈ R.
2
Suppose t > 0. Then we have
1
hx − u, vi ≤ tkvk2 . for all t > 0.
2
Suppose next t < 0. Write t = −s, s > 0 Then, dividing both sides by s,
1
−hx − u, vi ≤ − skvk2 , for all s > 0
2
Putting the inequalities together, we have 12 skvk2 ≤ hx − u, vi ≤ 12 tkvk2 for all t > 0, s > 0. Hence
hx − u, vi = 0, for all v ∈ U,
CHAPTER 19. CONDITIONALLY 220
(PL1) u∈U
(PL2) x−u⊥U
Answer. We have xi − E(xi |U) ⊥ U, i = 1, 2. This means that hxi − E(xi |U), ui = 0, for all u ∈ U,
i = 1, 2. Hence hc1 (x1 − E(x1 |U)) + c2 (x2 − E(x2 |U)), ui = 0, for all u ∈ U, which means that
c1 x1 + c2 x2 − (c1 E(x1 |U) + c2 E(x2 |U)) ⊥ U. Thus condition (PL2) holds with x replaced by
c1 x1 + c2 x2 and u replaced by c1 E(x1 |U) + c2 E(x2 |U). Obviously, (PL1) holds as well.
Answer. In order for E(x|W) to be the projection of E(x|U) onto W we need (PL1)+(PL2) to
hold, that is, E(x|W) ∈ W (this is obviously true) and E(x|U) − E(x|W) ⊥ W. It is this condition
that we need to verify. We have x − E(x|U) ⊥ U and since W ⊂ U we have x − E(x|U) ⊥ W. On
the other hand, x − E(x|W) ⊥ W. Hence the difference (x − E(x|U)) − (x − E(x|W)) is also ⊥ W.
But the difference if E(x|U) − E(x|W). So we’re done.
U = {a1 u1 + a2 u2 + a3 u3 : a1 , a2 , a3 ∈ R}
(2) We would have hui , u j i = 0 for all i , j, and so the equations would be immediately
solvable:
hx, u1 i hx, u2 i hx, u3 i
a1 = 1 1 , a2 = 2 2 , a3 = 3 3 .
hu , u i hu , u i hu , u i
(3) We would write u = a1 u1 + · · · + ak uk and we would have k equations in k unknowns,
a1 , . . . , ak . The equations would be linearly independent because, by assumption, u1 , . . . , uk
are linearly independent, and so we would be able to solve uniquely for a1 , . . . , ak .
PROBLEM 19.8 (projection of a random variable onto two subspaces). Let Ω be a finite
sample space with d outcomes, e.g., Ω = {ω1 , . . . , ωd }. The every function Ω → R is a random
variable. The set of all random variables is RΩ which can be thought as Rd since Ω has d
elements. Now let P be a probability measure on Ω. Hence P is defined by
p j = P{ω j }, j = 1, . . . , d.
(If we had not assume that p j > 0 for all j then hX, Yi would fail to be strictly positive definite.)
Let X, Y be two random variables. Compute E(X|U) in the following two cases:
(1) U = L(Y), the space of all linear functions of Y. Assume that P(Y , 0) > 0.
(2) U = F (Y), the space of all functions of Y.
Answer. (1) The only linear function of Y is of the form aY for some real number a. So we write
E(XL(Y)) = aY,
and we seek to find a. The problem is solved by (PL1)+(PL2). Obviously, aY ∈ L(Y), so (PL1)
holds. The second requirement, (PL2) is written in any of the equivalent ways below:
Since P(Y , 0) > 0 we have E(Y2 ) > 0, and so we can divide and get a = E(XY)/E(Y2 ). The
answer therefore is
E(XY)
E(X|L(Y)) = Y.
E(Y2 )
(2) (PL1) says that
E(X|F (Y)) = h(Y),
CHAPTER 19. CONDITIONALLY 222
X
E[h(Y)g(Y)] = E[Xg(Y)] = E[1X=x g(Y)]
x
But E[h(Y)1Y=y ] = h(y)E[1Y=y ] = h(y)P(Y = y), and E[1X=x 1Y=y ] = P(X = x, Y = y). Therefore,
Since P assigns positive probability to each ω ∈ Ω, it follows that P(Y = y) > 0 for all y ∈ Y(Ω),
and so we can divide and get
X P(X = x, Y = y)
h(y) = x . (19.3)
x
P(Y = y)
By symmetry,
E(X1 |L(Sn )) = · · · = E(Xn |L(Sn )).
Combining the last three displays yields
We wish to work with all random variables that have finite variance. We call this space
Since the sum of two random variables with finite variance has finite variance, V is a linear
space. Consider next a collection of random variables
Y = (Y1 , Y2 , . . .)
which could be finite or countable or even uncountable–we don’t care, and let
F (Y) := {all random variables that can be written as functions of random variables from Y}
F2 (Y) := {all random variables that can be written as functions of random variables from Y
and have finite variance}
For example, Y1 ∈ F2 (Y), Y12 cos(Y2 + Y3 ) ∈ F2 (Y), limn→∞ Yn ∈ F2 (Y), etc. Since the sum of
two random variables that are functions of members of Y is also a function of members of Y,
the set F2 (Y) is a linear space. In fact,
F2 (Y) ⊂ V2 .
CHAPTER 19. CONDITIONALLY 224
We define an distance on V2 by first defining an inner product. But this we have already done
in Section 16.5, Equation (16.2). We repeat it here:
hX, Yi := E(XY).
Another name for this inner product is correlation. Again, see Section 16.5. Through this
inner product we define a distance function:
p p
d(X, Y) := hX − Y, X − Yi = E(X − Y)2 .
In particular, √
kXk = d(X, 0) = EX2 .
This quantity has three names: (a) norm, (b) distance from the origin, (c) square root of second
moment.
We consider the problem
(PV) Which random variables, if any, from F2 (Y) are closest to a given random variable X ∈ V2 ?
In other words, we wish to solve the problem
∆ := inf d(X, Y).
Y∈F (Y)
It would be no surprise then if I tell you that Problem (PV) has a unique solution denoted
by E(X|F (Y), or, simply, by E(X|Y) and which satisfies the following:
(PV1) E(X|Y) ∈ F (Y)
Remark 19.2. I am not going to prove this, but I ask you to accept that the explanation is very
similar to the one in Section 19.3. If V is finite-dimensional (which is the case when Ω is a
finite sample space), then we reduce (PV) to (PL).
If V is not finite-dimensional then the only place one has to be careful at is (19.2). This is
actually true (and comes under the name “completeness of the L2 space”).
?PROBLEM 19.10 (projection when densities exist). Let (X, Y) be a random vector in R2
such that both X and Y have finite variance (they are thus elements of V). Assume also
that (X, Y) has a probability density function f (x, y). Consider Y = F (Y), the collection of all
functions of Y with finite variance. Let f2 (y) be the density of Y. Define the function–see eq.
(14.6)
f (x, y)
f1|2 (x|y) := ,
f2 (y)
interpreting 0/0 as 0 if it occurs. Explain why
Z
E(X|Y) := E(X|F (Y)) = x f1|2 (x|Y)dx.
CHAPTER 19. CONDITIONALLY 225
Equivalently, *Z +
X, g(Y) = x f (x|Y)dx, g(Y) , for any g(Y) ∈ F (Y).
We are now ready to define conditional expectation.
Definition 19.2 (conditional expectation under finite variance). If X, Y1 , Y2 , . . . are random
variables with finite variance then define the conditional expectation of X given Y1 , Y2 , . . . by
E(X|Y1 , Y2 , . . .) := E X|F (Y1 , Y2 , . . .) .
The condition of finite variance is too restrictive. But we can generalize. A simple
approximation theorem ensures that the conditional expectation can be defined for random
variables X that have may have infinite variance, so long as E|X| < ∞.
b = E(X|Y) ≡ E(X|Y1 , Y2 , . . .)
X
such that
(1)
X b < ∞;
b ∈ F (Y) and has E|X|
E[XZ] = E[XZ]
b for any bounded random variable Z ∈ F (Y). (19.5)
PROBLEM 19.11 (conditional expectation with respect to a discrete random variable). Let
X be a random variable with E|X| < ∞ and let Y be a discrete random variable. Explain why
X E(X1Y=y )
E(X|Y) = 1Y=y . (19.6)
y
P(Y = y)
(If Ω is finite then this formula is the same as E(X|Y) computed in (19.4)+(19.3).)
CHAPTER 19. CONDITIONALLY 226
Answer. We simply have to verify that the right-hand side of (19.6) satisfies conditions (1)+(2)
of Theorem 19.1. It is clear it satisfies (1). To check (2) we need to check that
X E(X1Y=y )
E[Xg(Y)] = E g(Y)
1Y=y ,
P(Y = y) y
where g(Y) ∈ F (Y), a function of Y, that is also bounded. We start with the right-hand side
and verify, step-by-step, that it equals the left-hand side:
X E(X1Y=y ) X E(X1Y=y ) X E(X1Y=y )
E g(Y) 1Y=y = E[g(Y)1Y=y ] = g(y) E[1Y=y ]
y
P(Y = y) y
P(Y = y) y
P(Y = y)
X E(X1Y=y ) X X
= g(y) P(Y = y) = E(X1Y=y ) g(y) = E X 1Y=y g(y) = E[Xg(Y)].
y
P(Y = y) y y
f (x, y1 , . . . , yk )
f1|2 (x|y1 , . . . , yk ) := .
f2 (y1 , . . . , yk )
Explain why Z ∞
E(X|Y1 , . . . , Yk ) = x f1|2 (x|Y1 , . . . , Yk )dx.
−∞
Answer. Since (1) of Theorem 19.1 is obvious, we only need to check (2):
" Z ∞ #
E[Xg(Y1 , . . . , Yk )] = E g(Y1 , . . . , Yk ) x f1|2 (x|Y1 , . . . , Yk )dx .
−∞
Answer. We have
∞
y+2 2 log(y + 1)
Z " #
f2 (y) = f (x, y)dy = c 2 − 1 y≥1 .
−∞ y (y + 1) y3
We thus have
y3 y + 1
f (x, y)
f1|2 (x|y) = = 1x≥1,y≥1 .
f2 (y) y2 + 2 y − 2 y + 1 log y + 1 x2 x + y 2
And so3
∞
Y − log (Y + 1) Y + log (Y + 1) Y
Z
E(X|Y) = x f1|2 (x|Y)dx = 1Y≥1 .
−∞ 2 log (Y + 1) Y − Y2 + 2 log (Y + 1) − 2 Y
We now continue with a number of important properties of the conditional expectation.
These are so important for everything in probability and statistics, even for understanding
your favorite distributions, that I’m going to devote a separate page for them. So please flip
over and pay attention.
3
To be honest, I didn’t do the computations myself. I hate computations and I always make mistakes anyway.
So I asked my slave to do them. My slave can do lots of procedural things, like complicated integrals, but he/she/it
cannot think. I, on the other hand, like most human beings, prefer to think and think how to think. My slave’s
name is Maple.
CHAPTER 19. CONDITIONALLY 228
E[E(X|Y)] = E(X).
E(X|Y) = X.
E(XZ|Y) = XE(Z|Y).
E(E(X|Y)|Z) = E(X|Z).
?PROBLEM 19.14 (many properties of the conditional expectation). Explain all the prop-
erties above.
Answer. Simply verify (1)+(2) of Theorem 19.1.
CHAPTER 19. CONDITIONALLY 229
The first equality is due to Property (I)–linearity. Then we used Property (V)–factoring
our known things. Since E(X|Y)2 is a function of Y we have E[E(X|Y)2 |Y] = E(X|Y)2 and
E(X|Y) · E(X|Y). So the alternative expression is
where we used Property (III) for the first term. Next, by the expression of the variance of a
random variable,
where we used Property (III) for the second term. Adding the last two displays together, the
term E[E(X|Y)2 ] cancels, whence the desired formula ensues.
This can be made meaningful, that is, one can arrange it so that
Hence P((X1 , . . . , Xn ) ∈ B|Y) is a random probability measure and is called the regular con-
ditional distribution or, for brevity, conditional distribution of (X1 , . . . , Xn ) given Y or
conditionally on Y.
We now have the following very important formula that we call vital formula.
PROBLEM 19.16 (tossing i.i.d. coins with random probability of heads). Let Θ be a
unif([0, 1]) random variable. Let ξ = (ξ1 , . . . , ξn ) be a random vector whose distribution,
conditional on Θ, is that of a sequence of i.i.d. Ber(Θ) random variables. Therefore–see (13.1)–the
conditional distribution of ξ given Θ is given by
Pn Pn
P(ξ1 = x1 , . . . , ξn = xn |Θ) = Θ i=1 xi (1 − Θ) i=1 (1−xi ) , x1 , . . . , xn ∈ {0, 1}.
CHAPTER 19. CONDITIONALLY 231
You are asked to determine the density f (θ|ξ) of the conditional distribution of Θ given ξ. Your
answer should be a function of
n
X
Sn = ξi = the total number of heads.
i=1
P(Θ ≤ θ, ξ1 = x1 , . . . , ξn = xn )
P(Θ ≤ θ|ξ1 = x1 , . . . , ξn = xn ) = .
P(ξ1 = x1 , . . . , ξn = xn )
The denominator is obtained by setting θ = 1 in the numerator. So we only deal with the
latter. From the vital formula of probability (19.9), and our assumption,
h i h Pn Pn i
P(Θ ≤ θ, ξ1 = x1 , . . . , ξn = xn ) = E 1Θ≤θ P(ξ1 = x1 , . . . , ξn = xn |Θ = E 1Θ≤θ Θ i=1 xi (1−Θ) i=1 (1−xi ) ,
Pn Pn i Z Pn Pn
Z θ P
h n Pn
E 1t≤θ t i=1 xi (1 − t) i=1 (1−xi ) = 1t≤θ t i=1 xi (1 − t) i=1 (1−xi ) dt = t i=1 xi (1 − t) i=1 (1−xi ) dt
R 0
Hence Rθ Pn Pn
t i=1 xi (1 − t) i=1 (1−xi )
0
P(Θ ≤ θ|ξ1 = x1 , . . . , ξn = xn ) = R 1 Pn Pn
0
t i=1 xi (1 − t) i=1 (1−xi ) dt
and so, from the fundamental theorem of Calculus,
Pn Pn
d θ i=1 xi (1 − θ) i=1 (1−xi )
f (θ|x1 , . . . , xn ) := P(Θ ≤ θ|ξ1 = x1 , . . . , ξn = xn ) = R 1 Pn .
dθ t
Pn
i=1 xi (1 − t) i=1 (1−xi ) dt
0
Hence
Pn Pn
d θ i=1 ξi (1 − θ) i=1 (1−ξi )
f (θ|ξ1 , . . . , ξn ) = P(Θ ≤ θ|ξ1 , . . . , ξn ) = R 1 Pn
dθ Pn
t i=1 ξi (1 − t) i=1 (1−ξi ) dt
0
Sn !(n − Sn )! Sn
= θ (1 − θ)n−Sn , 0 ≤ θ ≤ 1.
(n + 1)!
19.6 Convolutions
The convolution is an operation between probability measures such that if we take independent
random variables whose laws are the given probability measures, then their convolution is
the law of the sum of the random variables.
If B ⊂ R and t a real number, let
B + t := {x + t : x ∈ B}.
P(X1 ∈ B − X2 ) = Q1 (B − X2 ).
We therefore have
where the second equality is obtained by interchanging the roles of X1 and X2 . The calculation
above depends only on Q1 , Q2 . The last equality is a symbol for this operation. Its name is
convolution between probability measures Q1 and Q2 .
Here are some properties:
1.
Q1 ∗ Q2 = Q2 ∗ Q1
2.
Q1 ∗ (Q2 ∗ Q3 ) = (Q1 ∗ Q2 ) ∗ Q3 .
3.
Q ∗ δ0 = Q
We explain: The first and second ones are due to the facts that X1 + X2 = X2 + X1 and
X1 + (X2 + X3 ) = (X1 + X2 ) + X3 . For the last one, recall that δ0 is the law of a random variable
that takes value 0 only. But then X + 0 = X. Due to the second property, we can omit the
parentheses:
Q1 ∗ Q2 ∗ Q3 = Q1 ∗ (Q2 ∗ Q3 ) = (Q1 ∗ Q2 ) ∗ Q3 .
And we can do this for any finite number of probability measures. In particular, we can write
We also write
Q∗0 = δ0 ,
because then we can write
?PROBLEM 19.17 (convolutions of densities). Let Qi have density fi . Derive the density f
of Q1 ∗ Q2 .
Answer. The distribution function of Q1 ∗ Q2 is (Q1 ∗ Q2 )(−∞, x]. Hence its density is its
derivative:
d
f (x) = (Q1 ∗ Q2 )(−∞, x].
dx
But
(Q1 ∗ Q2 )(−∞, x] = E{Q1 (−∞, x − X2 ]}.
But
d
f1 (t) = Q1 (−∞, t].
dt
Hence
f (x) = E{ f1 (x − X2 )}.
The right-hand side is merely the expectation of a function of X2 and, by the law of the
unconscious statistician, Z ∞
f (t) = f1 (x − y) f2 (y)dy
−∞
We give a symbol to this:
Z ∞
( f1 ∗ f2 )(x) := f1 (x − y) f2 (y)dy : is the density of the sum of the two independent r.v.s.
−∞
We have
1.
f1 ∗ f2 = f2 ∗ f1
2.
f1 ∗ ( f2 ∗ f3 ) = ( f1 ∗ f2 ) ∗ f3 .
In the particular case where the random variables are positive, we have
Z x
( f1 ∗ f2 )(x) := f1 (x − y) f2 (y)dy, if X1 , X2 ≥ 0.
0
Indeed,
Z ∞ Z ∞ Z x
( f1 ∗ f2 )(x) = f1 (x − y)1x−y≥0 f2 (y)1 y≥0 dy = 10≤y≤x f1 (x − y) f2 (y)dy f1 (x − y) f2 (y)dy.
−∞ −∞ 0
CHAPTER 19. CONDITIONALLY 234
PROBLEM 19.18 (sum of two independent exponential r.v.s). Let X, Y be expon(λ), expon(µ),
respectively. Determine the density of XY .
Answer. The densities of X, Y are f (x) = λe−λx 1x>0 g(x) = µe−µx 1x>0 . The density of X + Y os
f ∗ g:
λ µ eµ x − eλ x
Z x
( f ∗ g)(x) = λe−λ(x−y) µe−µy dy = , x > 0.
0 −λ + µ
PROBLEM 19.19 (sum of independent uniform r.v.s). Let X, Y be indepedent unif([0, 1]),
unif([ 21 , 32 )], respectively. Determine the density of X + Y and sketch it.
Answer. The densities of the two random variables are 10<x<1 , 1 1 <x< 3 , respectively. Hence the
2 2
density of X + Y is
Z ∞ Z ∞
( f ∗ g)(x) = f (x − y)g(y) dy = 10<x−y<1 1 1 <y< 3 dy
2 2
−∞ −∞
Z ∞ Z ∞ Z ∞
= 1x−1<y<x 1 1 <y< 3 dy = 1x−1<y, 1 <y,y<x,y< 3 dy = 1max(x−1, 1 )<y<min(x, 3 ) dy.
2 2 2 2 2 2
−∞ −∞ −∞
The reason for the last equality is that x − 1 < y, 12 < y ⇐⇒ max(x − 1, 12 ) < y; and also that,
Ry ∞< x, y < 2 ⇐⇒ y < min(x, 2 ). This is a really trivial integral because it is of the form
3 3
1
−∞ a<y<b
dy with a = a(x) = max(x − 1, 12 ), b = b(x) = min(x, 32 ). This integral equals b − a,
provided a < b, or zero if not. We can write this as
Z ∞
1a<y<b dy = max[b − a, 0].
−∞
Hence
x − 12 , if 12 ≤ x ≤ 32 ,
h(x) =
5
2 − x, if 32 ≤ x ≤ 52
0,
otherwise
CHAPTER 19. CONDITIONALLY 235
Let (X, Y) be a normal random vector in R2 . We wish to compute E(X|Y). By the linearity
property–Property (I)–we have
By Property (II), we can replace Y by any g(Y) such that g is a bijection. We choose g(Y) = Y−EY:
So far, we have not used normality. The above holds for any random variables. But now, since
we have a hunch that “normality” and “linearity” somehow go hand in hand, we speculate
that
E(X − EX|Y − EY) = a · (Y − EY).
If we manage to show that (1) and (2) of Theorem 19.1 hold, then we’re done. Write, for
e = X − EX, Y
brevity X e = E − EY. Obviously, Y − EY is a function of Y. So (1) holds. To show
(2), we need to show that
e · Z] = E[a · Y
E[X e · Z] (19.10)
e (and hence of Y). We first take Z = Y.
for any Z that is a function of Y e We then have
E[X e = aE[Y
e · Y] e2 ].
This gives
E[X
e · Y]
e cov(X, Y)
a= = .
2
E[Y ]
e var(Y)
provided that E[Ye2 ] = var(Y) , 0. (We’ll see what happens when var(Y) = 0 later.) With this
choice of a we have
E[(X
e − aY) e = 0 = E[X
e · Y] e − aY]
e · E[Y].
e
Since
(X
e − aY, e is normal in R2
e Y)
we immediately have
X
e − aY
e and Y
e are independent.
X
e − aY e = g(Y) are independent.
e and Z
CHAPTER 19. CONDITIONALLY 236
Hence
E[(X e · Z] = E[X
e − aY) e · E[Z] = 0.
e − aY]
But this is precisely (19.10). We’re done:
E[X
e · Y]
e
E(X|Y) = a(Y − EY) + EX, with a = if var(Y) , 0.
e2 ]
E[Y
It remains to examine what happens when var(Y) = 0. But when var(Y) = 0 then P(Y = 0) = 1
because (reminder!) P(|Y − EY| > ε) ≤ var(Y)/ε2 = 0, so P(|Y − EY| > ε) = 0 for all ε > 0 and so
P(Y = 0) = 1. Since X, 0 are independent, we have, from Property ((V)),
E(X|Y) = EX if var(Y) = 0.
Let (X, Y1 , . . . , Yd ) be a normal random vector in R1+d . We wish to compute E(X|Y) when
Y = (Y1 , . . . , Yd ).
Assume that
R = cov(Y) is invertible.
Recall that cov(Y) is the matrix with entries
d
X
E[X
eYei ] = a j E[X
ei Y
ej ].
j=1
CHAPTER 19. CONDITIONALLY 237
If we let
a1 E[X eY e1 ] cov(X, Y1 )
We omit step 2 because it is identical in spirit to the one in the simple (d = 1) case.
PROBLEM 19.20 (computation of a conditional expectation under normality). Let (X, Y1 , Y2 )
be normal in R3 such that EX = EY1 = EY2 = 0 and
Compute E(X|Y1 , Y2 ).
Answer. We have
E(X|Y1 , Y2 ) = a1 Y1 + a2 Y2 ,
and we need to determine a1 , a2 . By (2) of Theorem 19.1 we must have
EXZ = E[Z(a1 Y1 + a2 Y2 )]
for all functions Z of (Y1 , Y2 ). We simply choose Z = Y1 first and then Z = Y2 to obtain
E(X|Y1 , Y2 ) = Y1 − 2Y2 .
PROBLEM 19.21 (computation of another conditional expectation under normality). Let
(Z1 , Z2 ) be i.i.d. N(0, 1). Let X = Z1 + 4Z2 , Y1 = 3Z1 + 2Z2 , Y2 = Z1 − Z2 . Compute E(X|Y1 , Y2 ).
Answer. Since (X, Y1 , Y2 ) is a linear function of (Z1 , Z2 ) we have that (X, Y1 , Y2 ) is normal in
R3 . Therefore,
E(X|Y1 , Y2 ) = a1 Y1 + a2 Y2 ,
and we need to determine a1 , a2 . We have
11 = 13a1 + a2
−3 = a1 + 2a2
E(X|Y1 , Y2 ) = Y1 − 2Y2 .
By (2) of Theorem 19.1 we have that X − E(X|Y) and Y are uncorrelated, and hence (by
normality), independent. By Property (V), E[(X − E(X|Y)2 |Y] = E[(X − E(X|Y)2 ]. And so
PROBLEM 19.22 (conditional variance computation under normality). Let (X, Y) be normal
in R2 with var(Y) , 0. Compute var(X|Y) in terms of σ21 = var(X), σ22 = var(Y) and
σ1,2 = cov(X, Y).
Answer. We have E(X|Y) = a(Y − EY) + EX, where a = σ1,2 /σ22 . So
var(X|Y) = E[(X − E(X|Y)2 ] = E[(X − (a(Y − EY) − EX))2 ] = E[((X − EX) − a(Y − EY))2 ]
= var(X) + a2 var(Y) − 2a cov(X, Y)
σ21,2 σ1,2 σ21 σ22 − σ21,2
= σ21 + σ22 − 2 σ1,2 = .
σ42 σ22 σ22
We can similarly compute var(X|Y) when (X, Y) is normal in R1+d
but we will skip the
computation, only mentioning that it followed directly from the fact that it is a constant!
CHAPTER 19. CONDITIONALLY 239
From this, we can easily obtain the conditional density, in the case that cov(Y) is invertible.
We claim that all we have to do is remember that
(x − µ)2
!
1
density of N(µ, σ ) = √
2
exp −
2πσ2 2σ2
and replace µ by E(X|Y) and σ2 by var(X|Y) (which is a constant). We therefore have that
(x − E(X|Y1 , . . . , Yd ))2
!
1
f (x|Y1 , . . . , Yd ) = p exp − (19.15)
2π var(X|Y1 , . . . , Yd ) 2 var(X|Y1 , . . . , Yd )
is a density for the regular conditional distribution of X given Y = (Y1 , . . . , Yd ). This means
that Z
P(X ∈ B|Y1 , . . . , Yd ) = f (x|Y1 , . . . , Yd ) dx.
B
But we need to explain our claim! That is, we need to justify why, for all B,
Z
P(X ∈ B|Y) = f (x|Y) dx,
B
where f (x|Y) is given by (19.15). Look at PROPERTY B, on page 182, for a moment generating
function. It says that the distribution of a random variable is determined by its moment
generating function. The same is true for conditional distributions. So we need to justify that,
for all t, Z ∞
E[etX |Y] = etx f (x|Y) dx.
−∞
CHAPTER 19. CONDITIONALLY 240
Since f (x|Y) is a density for N(E(X|Y), var(X|Y)), we have, by the formula 16.9 for the moment
generating function of a normal random variable,
Z ∞
1 2
etx f (x|Y) dx = etE(X|Y)+ 2 t var(X|Y) .
−∞
tX + s1 Y1 + · · · sd Yd = tX + sT Y,
But var(X|Y) is a constant, so it moves outside the expectation. On the other hand E(X|Y) =
aT Y, that is, a linear function of Y; this was shown in 19.7.2; the coefficients a are given by
(19.13). Hence it is enough to show
To do this, we simply notice that the left-hand side is the moment generating function of
the normal random vector (X, Y) = (X, Y1 , . . . , Yd ) in R1+d , while the right-hand side is the
moment generating function of the normal random vector Y = (Y1 , . . . , Yd ) in Rd , so we can
compute them both. This is because we know exactly what the moment generating function
of a multidimensional random vector is. It is given by formula (18.6) or, equivalently, by the
same formula written using matrix notation: (18.8).
1 1 2 t2 σ2 +s2 σ2 +2astσ2 )
Ee(ta+s)Y = e 2 var((ta+s)Y) = e 2 (a 2 2 2
1 2
(σ1 −a2 σ22 )
1 2 2 1 2 2 −a2 σ2 t2 )
e2t var(X|Y)
= e2t = e 2 (σ1 t 2 ,
and so (19.17) is equivalent to (equate the exponents)
σ21
t2 +
σ22
s2 + 2σ1,2 st =
σ21
t2 − σ
2
2t +
22 2
σ22 +
t2 σ
s2 2 + 2astσ2 ,
2 2
a a
and immediately we see that several terms cancel. Equivalently then,
2σ1,2 st = 2astσ22 ,
Remark 19.3. Verifying (19.16) in the general case, that is when d > 1, is conceptually similar
to the d = 1 case of Problem 19.24. The only additional difficulty is finding out how to write
things compactly using matrix notation.
Remark 19.4. (19.14) provides and answer to the question (18.12) when m = 1.
N(E(X|Y), cov(X|Y))
Remark 19.5. (19.18) provides and answer to the question (18.12) for general m.
Chapter 20
we often wonder how fast the convergence is. To answer this, we look for a sequence λ(n)
such that
lim λ(n)(an − a) = c , 0. (20.1)
n→∞
Necessarily,
lim λ(n) = ∞.
n→∞
Of course, there can be many such sequences but we are looking for the “simplest” one. We
then say that1
1
an converges to a at rate .
λ(n)
For example, if an = n+2
n+1 then an → 1 as n → ∞. But an − 1 = n+1 , so if we choose λ(n) = n we
1
have λ(n)(an − 1) = n+1 → 1 as n → ∞ and so we say that n+1 converges to 1 at rate 1/n.
n n+2
PROBLEM 20.1 (rate of convergence of approximations to e). Consider the following se-
quences:
n
1 n 1
X
an = 1 + , bn = .
n k!
k=0
1
Or, more generally, if all limit points of the sequence λ(n)(an − a) are contained in a bounded interval that does
not contain 0.
242
CHAPTER 20. THE CENTRAL LIMIT THEOREM 243
lim an = lim bn = e
n→∞ n→∞
At what rates?
Answer. By Taylor’s theorem,
" n #
1 e
lim n 1 + −e =− .
n→∞ n 2
By direct computation,
lim n!(e − bn ) = 1.
n→∞
So the rate of the convergence an → e is 1/n, while the rate of the convergence bn → e is 1/n!.
Since n! is much much larger than n, the second convergence is much faster. Indeed, if you
were to approximate e = 2.718281828459 · · · numerically, please use the second approximation.
Voilà:
n an bn
1 2. 1.
2 2.250000000 2.
3 2.370370370 2.5
4 2.441406250 2.666666667
5 2.488320000 2.708333333
6 2.521626372 2.716666667
7 2.546499697 2.716666667
8 2.565784514 2.718253968
9 2.581174792 2.718278770
10 2.593742460 2.718281526
When we have sequences of random variables there are two subtleties: First, what do we
mean by limit? Second, what do we mean by rate of convergence?
These questions are of paramount importance because probability and statistics is, in a
sense, all about approximations and, therefore, all about limits and their convergence rates.
So we must understand, at least a little bit, what these concepts mean.
P( lim Sn /n = µ) = 1.
n→∞
for all intervals I for which the probability that X equals to one of the endpoints of I is zero.
We have
Xn converges to X strongly ⇒ Xn converges to X in probability ⇒ Xn converges to X in
(20.2)
distribution.
The first implication was dealt with in Problem 20.4. The second implication is a little bit
more subtle and we explain it here, by first asking you to do a simple problem that only
requires elementary concepts.
CHAPTER 20. THE CENTRAL LIMIT THEOREM 245
PROBLEM 20.5 (comparing distribution functions). Suppose that X and Y are random
variables with distribution function F(x), G(x), respectively. Assume that
P(|X − Y| ≥ ε) ≤ δ.
for all large n. Letting ε and δ converge to 0 we have that F(xε )+δ → F(x) but F(x−ε)−δ) → F(x−).
Hence if x is continuous at x then F(x) = F(x−) and so Fn (x) converges to F(x).
We finally state a theorem without proof.
where M(t) is the moment generating function of a random variable X, then Xn converges to X in
distribution.
Sn
→ µ strongly.
n
To find a rate of convergence we need to look for a sequence λ(n) such that
Sn
λ(n) − µ → Z, in some sense (20.3)
n
CHAPTER 20. THE CENTRAL LIMIT THEOREM 246
where Z , 0. We are thus stating (20.1) with Z replacing c. In (20.1) the sequence an was not
random and hence c was not random. Here, the sequence Sn /n is random and so Z is random.
Let us try to see what Z could be.
Sn Sn −E(Sn )
Since n −µ= n , we might as well
Z is normal.
Theorem 20.2 (the classical central limit theorem). Let X1 , X2 , . . . be a sequence of i.i.d.
random variables with
µ = EX1 , σ2 = var X1 < ∞.
Then
Sn − ESn Sn − nµ
√ = √ → Z, in distribution,
var Sn σ n
where Z is a N(0, 1) random variable.
We will explain this in a very special case: Assume that X1 has a non-useless moment
generating function. Define
m(t) := Eet(X1 −µ)/σ
Sn −nµ
defined for at least one t , 0. Then √
σ n
has a non-useless moment generating function,
!
Sn − nµ √
E exp t √ = m(t/ n)n .
σ n
The random variable Z, being N(0, 1), moment generating function given by
1 2
EetZ = e 2 t .
But
m0 (0) = E((X1 − µ)/σ)) = 0, m00 (0) = E((X1 − µ)/σ)2 ) = 1.
We have
√ √
n log m(t/ n) m(t/ n) (a) m(t) (b) m0 (s) (c) m00 (s) m00 (0) 1
lim = lim √ = lim = lim = lim = = .
n→∞ t2 n→∞ (t/ n)2 s→0 s2 s→0 2s s→0 2 2 2
2π
be the density of a standard normal N(0, 1) random variable, and let
Z x
Φ(x) := ϕ)y)dy
−∞
Pθ (A ≤ θ ≤ B) ≥ p.
Typically, p is a large probability (e.g., 0.95). The idea here is that θ is unknown and wish to
find it. We need to find random variables A and B that are statistics (we cannot make them
depend on they unknown θ, that’s why they should be statistics) such that the above holds,
meaning that we are pretty confident that θ will lie in [A, B]. This problem may or may not
have a solution. Below is a case where it does.
3
This means that we should think of X as a function X : Θ × Ω → S with the property that X(θ, ω) = X(θ0 , ω)
for all ω ∈ Ω and all θ, θ0 ∈ Θ.
4
Statistics has, grammatically, a twofold meaning. First, it refers to the “subject of statistics”; second, it is “the
plural of the noun ‘statistic’ ”.
CHAPTER 20. THE CENTRAL LIMIT THEOREM 249
Let Pµ be the probability distribution of a random variable with mean µ and finite variance
σ2 .
Let X1 , X2 , . . . be i.i.d. random variables with common distribution Pµ . By the central limit
theorem, proved above, tells us that, for a > 0,
√ !
n Sn
Pµ −a ≤ − µ ≤ a → Φ(a) − Φ(−1) = 2Φ(a) − 1, as n → ∞,
σ n
√
n Sn
where Φ is the cumulative distribution function of a standard normal. Since −a ≤ −µ ≤
σ n
Sn aσ Sn aσ
a ⇐⇒ − √ ≤µ≤ + √ , we rewrite the above limit as
n n n n
!
Sn aσ Sn aσ
Pµ − √ ≤µ≤ + √ → 2Φ(a) − 1, as n → ∞.
n n n n
We now let a = a(p) be defined as the unique solution of
2Φ(a) − 1 = p
and make an approximation (which may be nonsense in practice if, say, n is not big enough):
!
Sn aσ Sn aσ
Pµ − √ ≤µ≤ + √ ≈ 2Φ(a) − 1 = p.
n n n n
We then have good reasons to declare that,
If a = a(p) is given by
p+1
Φ(a(p)) = ,
2
then the interval
" #
Sn aσ Sn aσ
− √ ≤µ≤ + √ is (approximately) a p-confidence interval for µ.
n n n n
5
What is the mathematical meaning of the phrase “to know”?
CHAPTER 20. THE CENTRAL LIMIT THEOREM 250
SoPwe think that we might get a good approximation for the unknown σ2 if we replace it by
n
j=1 (X j − µ) for some large fixed n. But hold on! This is not a statistic because it depends
1 2
n
on the unknown µ.
So, make yet another approximation and replace µ by Sn /n. After all,
Sn
P lim = µ = 1. (L2)
n→∞ n
(the intersection of two events of probability 1 each has probability 1) and, by the little algebra
above, (L3) holds.
Since Sn and sn are statistics, we can now state:
If a = a(p) is given by
p+1
Φ(a(p)) = ,
2
then the interval
" #
Sn asn Sn asn
− √ ≤µ≤ + √ is (approximately) a p-confidence interval for µ.
n n n n
Of course, the above involved heuristics that can, with a bit of effort, be justified rigorously.
PROBLEM 20.6 (empirical standard deviation). Justify the formula
n
1X 2
s2n = X j − (Sn /n)2
n
j=1
using the concept of empirical probability measure (See Section 8.3), thereby justifying the
“little algebra” above.
CHAPTER 20. THE CENTRAL LIMIT THEOREM 251
Answer. If x1 , . . . , xn are real numbers then the empirical probability measure assigns proba-
bility 1/n to each xi , so, as we know from Section 8.3, we can write it as
n
1X
Pn =
b δx j ,
n
j=1
where δx is a probability measure on R such that δx (B) = 1x∈B , for all B ⊂ R. If we define a
random variable Y on {1, . . . , n} by Y(i) = xi and let P be the uniform probability measure on
{1, . . . , n}, we have that the distribution of Y under P is b
Pn because
X X1 1X
P(Y ∈ B) = P{i : Y(i) ∈ B} = P{i : Y(i) = x} = 1xi =x = 1xi =x = b
Pn (B).
n n
x∈B x∈B x∈B
1 Pn
Hence E(Y) = is the sample mean of (x1 , . . . , xn ). Similarly, E(Y2 ) = n1 nj=1 x2j is the
P
n j=1 x j
sample second moment of (x1 , . . . , xn ) and var(Y) = n1 nj=1 (x j − E(Y))2 is the sample variance
P
of (x1 , . . . , xn ). But
var(Y) = E(Y2 ) − (E(Y))2 .
This explains the formula.
PROBLEM 20.7 (confidence interval for the parameter p of a geo(p) distribution). Let
X1 , . . . , Xn be i.i.d. random variables with common distribution geo(p), where p is unknown.
Device an experiment in order to estimate p and give me a 0.99-confidence interval for it.
Answer. Since the distribution of the first success when tossing a coin (whose probability of
heads=success) is p, independently, do the following: Pick a coin and toss it until you first get
heads. Let X1 be the number of tosses required. Then toss again and let X2 be the number of
tosses required until the next success. Do this n times, for, say n = 100 (or more, if you have
the stamina) and compute Sn and sn . With p = 0.99, we have (0.99 + 1)/2 = 0.995 and we find,
using a computer or a table, √ that Φ(a) = 0.995
√ is solved for a = 2.58, we have that p lies in the
interval [Sn /n − 2.58sn / n, Sn /n + 2.58sn / n] with probability 0.99.
PROBLEM 20.8 (continuation of Problem 20.7). I sat down last night and tossed a coin lots
and lots of times and stopped when I got tired. I observed that
k 1 2 3 4 5 6 7 8 9 15
nk 31 20 19 8 10 2 2 3 4 1
meaning that there were exactly 31 i’s for which Xi = 1, and 20 i’s for which Xi = 2, etc.
What is the 0.99-confidence interval of p?
Answer. We compute the the quantities needed, by first noticing that n = k nk = 100.
P
n
Sn 1X 1 X 311
= Xn = knk = = 3.11
n n 100 100
i=1 k
n 100
1X 2 1 X 2
s2n = Xi − 3.112 = k nk − 3.112 = 15.71 − (3.11)2 = 6.038.
n 100
i=1 i=1
Hence
Sn sn
± 2.58 √ = 3.11 ± 0.63 = [2.48, 3.74].
n n
Hence p lies between 2.48 and 3.74 with probability 0.99.
CHAPTER 20. THE CENTRAL LIMIT THEOREM 252
PROBLEM 20.9 (a useful approximation for the normal distribution function). Show that
if ϕ(x) is the density of a standard normal random variable Z, then
ϕ(x)
1 − Φ(x) = P(Z > x) ≤ .
x
Answer. We have Z ∞
P(Z > x) = ϕ(y)dy.
x
The integral is over all y > x. That is, y/x ≥ 1. Hence
Z ∞
1 ∞
Z
y
P(Z > x) ≤ ϕ(y)dy = yϕ(y)dy.
x x x x
2 /2
Since ϕ(y) = Ce−y we have ϕ0 (y) = −yϕ(y). Hence
Z ∞ Z ∞ Z b
yϕ(y)dy = − ϕ (y)dy = − lim
0
ϕ0 (y)dy = − lim [ϕ(b) − ϕ(x)] = ϕ(x).
x x b→∞ x b→∞
We define the distribution gamma(λ, α), for positive real numbers λ andα, as the
distribution whose density is an analytic extension of the density of gamma(λ, n).
We will derive formulas for the densities of both distributions. But first, let us mention that
the word “analytic” means something much stronger than the phrase “it possesses derivatives
of all orders”.
How do we know that these two distributions possess a density? Well, we know that the
sum of independent random variables with densities has a density that is the convolution
of the individual densities. To make life simple, we first assume that λ = 1. Then f (x) = e−x ,
x > 0, is a density for each of the variables X1 , X2 , . . . Hence the density of X1 + X2 is
Z x
f (x) = ( f ∗ f )(x) =
∗2
e−(x−y) e−y dy = xe−x , x > 0.
0
The density of X1 + X2 + X3 is
Z x Z x
1
f (x) = ( f ∗ f )(x) =
∗3 ∗2
e −(x−y)
ye dy = e
−y −x
ydy = x2 e−x .
0 0 2
253
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 254
The pattern soon becomes clear, and we therefore guess that X1 + · · · + Xn has density
1
f ∗n (x) = xn−1 e−x , x > 0. (gamma(1, n) density)
(n − 1)!
PROBLEM 21.1 (correctness of the gamma(1, n) density). Use induction to show that the
formula for the gamma(1, n) density is correct.
Answer. The formula is correct for n = 1. Suppose it is correct up to n − 1. We then just need
to check that f ∗ f ∗(n−1) = f ∗n . We have
Z x Z x
1 e−x e−x xn−1 xn−1 e−x
f∗f ∗(n−1)
(x) = e −(x−y)
y e dy =
n−1 −y
yn−2 dy = = .
0 (n − 2)! (n − 2)! 0 (n − 2)! n − 1 (n − 1)!
We then take a bold step and replace n in the exponent inside the integral by a real number α
and make a definition.
PROBLEM 21.2 (domain of the gamma function). Show that Γ(α) is defined for α > 0 and
that we can differentiate it with respect to α as many times as we like.
Answer. Note that
xα−1 e−x ≤ xα−1 , for all x > 0,
while
xα−1 e−x ≤ e−x , for all x > 1.
Hence Z ∞ Z 1 Z ∞
α−1 −x α−1
x e dx ≤ x dx + e−x dx < ∞.
0 0 1
So Γ(α) < ∞ for all α > 0. (But Γ(0) = ∞.) Moreover, since xα−1 is a smooth function of α, so is
Γ(α).
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 255
Since we have Z ∞
1 α−1 −x
1= x e dx,
0 Γ(α)
the function inside the integral is positive and has integral 1. Hence it is a probability density
function. We therefore define
1 α−1 −x
fα (x) := x e , x > 0, (gamma(1, α) density)
Γ(α)
λn
λ f ∗n (λx) = xn−1 e−λx , x > 0. (gamma(λ, n) density)
(n − 1)!
λα α−1 −λx
λ fα (λx) := x e , x > 0, (gamma(λ, α) density)
Γ(α)
Since the latter is a density, its integral over the whole space must be 1, and so
Z ∞
Γ(α)
xα−1 e−λx dx = α , α > 0, λ > 0. (21.1)
0 λ
PROBLEM 21.3 (gamma reproduction rule). Explain why Γ(α) = (α − 1)Γ(α − 1) when α > 1.
e = −e−x , we have
d −x
Answer. By Definition 21.1, and the fact that dx
Z ∞
d −x
Γ(α) = − xα−1 e dx
0 dx
R∞ R∞
The integration by parts formula says that 0 f g0 dx = − 0 f 0 gdx if f (x)g(x) has value 0
(interpreted as a limit) at x = 0 and x = ∞. We apply this to f (x) = xα−1 and g(x) = e−x .
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 256
Since the exponential function drops to 0 faster than any power, we have f (x)g(x)|x=∞ =
limx→∞ xα−1 e−x = 0. Since α − 1 > 0, we have f (x)g(x)|x=0 = 0α−1 e−0 = 0 · 1 = 0. Therefore,
Z ∞ ! Z ∞
d α−1 −x
Γ(α) = x e dx = (α − 1)xα−2 e−x dx = (α − 1)Γ(α − 1).
0 dx 0
Since
Γ(n) = (n − 1)!, if n is a positive integer
the previous display is merely an extension of this property of the factorial function.
4. We also have
Γ(a + t)
lim = 1.
t→∞ Γ(t)ta
So
Γ(a + t) a − 1 + t a − 2 + t 1+t t
= · ··· · → 1 as t → ∞,
Γ(t)ta t t t t
because each of the a fractions converges to 1 as t → ∞.
We claim that the distribution χ2 (d) has a density given by the formula
1
fd (x) = x(d/2)−1 e−x/2 . (χ2 (d) density)
2d/2 Γ(d/2)
?PROBLEM 21.5 (density for the χ2 (d) distribution when d is even). Derive the density for
the χ2 (d) distribution when d is even.
Answer. From Section 18.7, we know that that the d/2 random variables
are i.i.d. and each one expon(1/2). Their sum is therefore gamma(1/2, d/2). So if we plug in
λ = 1/2 and n = d/2 in the formula for the gamma(λ, n) density we find the χ2 (d) announced
above.
?PROBLEM 21.6 (density for the χ2 (d) distribution for general d). Derive the formula for
the density of the χ2 (d) probability measure.
Answer. If d is odd then d − 1 even. So
has the distribution of the sum of two independent random variables, one with gamma(1/2, (d −
1)/2) and a N(0, 1) one. To find the density of the sum, we perform a convolution, as in Section
19.6, and arrive again at the formula for χ2 (d) announced above.
The number d in χ2 (d) is called “degrees of freedom”. This is a word that means “dimension”.
And there is more to it than meets the eye. Suppose we have X = (X1 , . . . , Xd ) that is normal
in Rd . Remember the definition of the covariance matrix
R = cov(X),
which, along with the expectations of X1 , . . . , Xd , which will be assumed to be zero, determines
the distribution of X. The matrix R is symmetric but, in general, it is not invertible. But Linear
Algebra tells us that there is a matrix S of dimension d × r, where
r = rank(R)
X = SZ
We then say that X has r degrees of freedom because X takes values in a linear space of
dimension r. This space is precisely the set
{Ru : u ∈ Rd }.
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 258
We then have
r
X
X12 + ··· + Xd2 = λ j Z2j . (eig)
j=1
PROBLEM 21.7 (density of χ2 (4; a, a, b, b)). Derive the density of χ2 (4; a, a, b, b).
Answer. The idea is this. χ2 (4; a, a, b, b) is the density of X = a(Z21 + Z22 ) + b(Z23 + Z24 ), where
Z1 , Z2 , Z3 , Z4 are i.i.d. N(0, 1) r.v.s. But S = Z21 + Z22 , T = Z23 + Z24 are i.i.d. expon(1/2). Hence aS
is expon(1/2a) and bT is expon(1/2b). So the density of aS is (1/2a)e−x/2a 1x>0 and the density
of bT is (1/2b)e−x/2b 1x>0 . Thus the density of X = aS + bT is a convolution:
Z x
f (x) = (1/2a)e−(x−y)/2a (1/2b)e−y/2b dy.
0
We derive
1 x x
f (x) = e− 2 a − e− 2 b , a , b,
2a − 2b
x x
f (x) = 2 e− 2 a , a = b.
4a
Observe that the parameter d in the χ2 (d) distribution is not a scaling parameter. (In fact,
the dependence on d of the density fd (x) of the χ2 (d) distribution is not simple.) So if X is a
random variable with χ2 (d) distribution then, even though E(X/d) = 1, the distribution of X/d
still depends on d. Nevertheless, we do like X b := X/d better than X (because its expectation
becomes independent of d).
In statistics, people use ratios of independent random variables, each being a normalized
chi-squared variable. We therefore give a special name to such a ratio.
To compute the density of Wm,n , first recall the density fd (x) of the χ2 (d) distribution:
d 1
fd (x) = Cd x 2 −1 e− 2 x , where Cd = (2d/2 Γ(d/2))−1 .
We therefore have
1 1 m
P(Wm,n ≤ x) = P Um ≤ x Vn = E P Um ≤ x Vn Vn ,
m n n
and so the density of Wm,n is
m ∞
Z
d m m m
fm,n (x) = P(Wm,n ≤ x) = E Vn fm xVn = y fm xy fn (y)dy.
dx n n n 0 n
We can make this a bit tidier by replacing y by ny in the integral (change variables):
Z ∞
fm,n (x) = mn
y fm mxy fn (ny)dy
0
1 1 m
−1 n2 −1 m2 −1
Γ( m+n
2 )
fm,n (x) = mn m/2 m 2 n x
2 Γ(m/2) 2n/2 Γ(n/2)
m+n
( mx+n
2 )
2
Γ( m+n )
m n
m 2 n2 m
= x 2 −1 mx+n2 m+n .
Γ(m/2)Γ(n/2) ( ) 2 2
We can shuffle the terms around until we reach a more symmetric form, just so it is more
pleasing to the eye:
Γ( m+n
2 )m
m/2 nn/2
(m + nx−1 )−m/2 (mx + n)−n/2
fm,n (x) = , x > 0. (F(m, n) density)
Γ(m/2)Γ(n/2) x
PROBLEM 21.8 (moments for F(m, n)). Let Wm,n be a random variable whose law is F(m, n).
(1) Let k be a positive integer. For which values of m, n does Wm,n has finite k-th moment?
(2) For which values of m, n does Wm,n have a non-useless moment generating function?
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 260
Answer. (1) Notice that for large x we can ignore x−1 from the term (m + nx−1 )−m/2 and write
mx + n ≈ mx, so
n
fm,n (x) ≈ const. x− 2 −1 , when x is large.
This is sloppy, but we can easily see that
x→∞
n
so the rate of convergence (see Section 20.1) of limx→∞ fm,n (x) = 0 is x− 2 −1 . On the other hand,
m
fm,n (x) ≈ const. x 2 −1 , when x is small.
We now have
Z 1 Z ∞
k
EWm,n < ∞ ⇐⇒ x fm,n (x)dx < ∞ and
k
xk fm,n (x)dx < ∞.
0 1
Remark 21.2. For uses of the F(m, n) distribution in statistics see David Williams, Weighing
the Odds, 2012; page 301, “the classical F-test” .
1
P lim Wm,n = Um = 1.
n→∞ m
(2) Use this, or otherwise, to show that
(m/2)m/2 m −1 −mx/2
lim fm,n (x) = x2 e .
n→∞ Γ(m/2)
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 261
Answer. (1) We look at the definition (21.2). We apply the strong law of large numbers to the
denominator:
1
P lim Vn = 1 = 1.
n→∞ n
This is because n1 Vn = n1 (Z21 + · · · + Z2n ), where Z1 , . . . , Zn are i.i.d. N(0, 1) and so the strong law
of large numbers says that n1 (Z21 + · · · + Z2n ) converges to EZ21 = 1 with probability 1. Since,
with probability 1, the denominator of (21.2) converges to 1, and since the numerator does not
depend on n, we have that Wm,n converges to m1 Um with probability 1.
(2) Since strong convergence implies convergence in distribution, see (20.2), we have that
1
Wm,n → Um as n → ∞ in distribution.
m
Therefore the probability distribution function of Wm,n converges to the probability distribution
function of m1 Um . We actually have that the density function of Wm,n converges to the density
function of m1 Um . But Um has χ2 (m) density:
1 m
fm (x) = x 2 −1 e−x/2 .
2m/2 Γ(m/2)
1
We copied this formula from (χ2 (d) density). But then m Um has density m fm (mx). So
(m/2)m/2 m −1 −mx/2
lim fm,n (x) = m fm (mx) = x2 e .
n→∞ Γ(m/2)
?PROBLEM 21.10 (limit of F(m, n) when m → ∞). Compute the limit of fm,n (x) as m → ∞.
Answer. From the strong law of large numbers we have
1
P lim Wm,n = = 1.
m→∞ Vn /n
Therefore,
1
Wm,n → as m → ∞ in distribution.
Vn /n
So
lim P(Wm,n ≤ x) = P(Vn ≥ n/x) = 1 − P(Vn ≤ n/x)
m→∞
n (n/2)n/2 − n −1 − n
lim fm,n (x) = f n (n/x) = x 2 e 2x .
m→∞ x2 Γ(n/2)
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 262
Since the probability of the intersection of two events that have probability 1 also has
probability 1 we have
n
1X
P lim Mn = µ, lim
(Xi − µ) = σ = 1.
2 2
(21.3)
n→∞ n→∞ n
i=1
Let us rewrite Vn as
n n n
n 2
1X 1 X 1 X 1 X
Vn = (Xi − µ − Mn + µ)2 = (Xi − µ)2 − (Mn − µ)2 = (Xi − µ)2 − (Xi − µ) .
n n n n
i=1 i=1 i=1 i=1
Therefore,
P lim Vn = σ 2
= 1.
n→∞
Notice that
n
n 2
1X 1 X 1 1 n−1 2
EVn = E[(Xi − µ)2 ] − 2 E (Xi − µ) = · nσ2 − 2 nσ2 = σ .
n n n n n
i=1 i=1
Hence if we let
n
1 X
V
en := (Xi − Mn )2 ,
n−1
i=1
we have
en = σ2 = 1,
P lim V en = σ2 .
EV
n→∞
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 263
We now have
2) Mn is N(µ, σ2 /n);
n − 1e
3) Vn is χ2 (n − 1).
σ2
But Mn − µ = n1 nk=1 (Xk − µ). So if we multiply this by X j − µ and then take expectation, we
P
see that all terms in the sum except the one corresponding to k = j have expectation zero.
Hence
1 σ2
E(X j − µ)(Mn − µ) = E(X j − µ)2 = .
n n
Similarly, in taking the square of Mn we obtain n12 times the sum of the squares of all terms
plus cross-products; the latter have expectation zero. So
n
1 X 1 σ2
E(Mn − µ) = 2 2
E(X j − µ)2 = 2 nσ2 = .
n n n
k=1
CHAPTER 21. SPECIAL DISTRIBUTIONS USED IN STATISTICS 264
Hence, indeed, cov(Y j , Mn ) = 0 for all j, and therefore the claim that Y and Mn are independent
is true, and therefore V
en and Mn are independent.
To see 3) we observe that the components of Y add up to zero. Hence if we let
W := {y = (y1 + · · · + yn ) ∈ Rn : y1 + · · · + yn }
then we have
P(Y ∈ W) = 1.
On the other hand, W is a linear space of dimension n − 1. Now remember formula (eig) of
Section 21.3. Apply it with d = n and r = n − 1 to get
n−1
X
Y12 + · · · + Yn2 = λ j Z2j ,
j=1
where Z1 , . . . , Zn−1 are i.i.d. standard normal random variables and λ1 , . . . , λn are positive
numbers, being the nonzero eigenvalues of the matrix cov(Y). But observe that all off-diagonal
entries of this matrix are the same and all diagonal entries are the same. Hence all nonzero
eigenvalues must be the same. So
n−1
X
Y12 + ··· + Yn2 =λ Z2j ,
j=1
To find λ we take expectation of both sides. Since the left-hand side equals (n − 1)V
en and since
we already know that EV en = σ2 , we have E(Y2 + · · · + Yn2 ) = n − 1. On the other hand, the
1
expectation of the right-hand side equals λ(n − 1). Hence
λ = σ2
and so
n−1
n − 1e X
Vn = Y 2
+ · · · + Yn
2
= Z2j .
σ2 1
j=1
We’re done.
1) Our probability space here can be taken to be Rn and our PROBABILITY is Pσ (the
product of n N(µ, σ2 )). Moreover, if ω = (ω1 , . . . , ωn ) ∈ Rn , we have chosen X j (ω) = ω j
for all j. Then ω 7→ Tn (ω) is a random variable that does not depend on σ. Hence
Pσ1 (Tn ≤ t) = Pσ2 (Tn ≤ t) for all t and all σ1 , σ2 > 0. Hence Tn is a statistic for σ (or for
Pσ ), a concept defined in Section 20.5.
2) The numerator in the formula for Tn is independent of the denominator and the latter is
positive.
3) The numerator √
n
U := (Mn − µ) has N(0, 1) distribution.
σ
In other words,
2 /2
U has density fU (x) = (2π)−1/2 e−x , x ∈ R.
4) The denominator
s r
q
1 e 1 (n − 1)V
en W
V := Vn = =: has easily found distribution,
σ n−1 σ2 n−1
(n − 1)V
en
because W := has χ2 (n − 1) distribution
σ2
So W has density x 7→ fn−1 (x), given by the formula (χ2 (d) density) for fd with d = n − 1
in Section 21.2.
Therefore W/(n − 1) has density (n − 1) fn−1 ((n − 1)x).
Since V = W/(n − 1) is the image of the random variable W/(n − 1) under the map
p
√
x 7→ x = v,
?PROBLEM 21.11 (the t(n) density). Compute the last integral to show that
!− n+1
Γ( n+1
2 ) x2 2
ft(n) (x) = √ 1+ , x ∈ R. (t(n) density)
π/n Γ(n/2) n
?PROBLEM 21.12 (t(1)=standard Cauchy). Explain why t(1) is the density for the standard
Cauchy distribution.
Answer. Setting n = 1 in ft(n) (x) we obtain
1 1
ft(1) (x) = , x ∈ R.
π 1 + x2
?PROBLEM 21.13 (t(∞) = N(0, 1)). Explain why
2 /2
lim ft(1) (x) = (2π)−1/2 e−x .
n→∞
Answer. From (21.4), the denominator in the second expression for Tn converges to 1 as n → ∞,
d Z
with probability 1. The numerator is a N(0, 1) random variable, let’s call it Z. So Tn = Dn where
P(Dn → 1) = 1. Hence
Tn → Z as n → ∞ in distribution.
Hence the density of Tn converges to the density of Z.
Figure 21.1: ©in public domain; graphic art done by the author of these notes, just to make the
notes appear friendly and easy.
The distribution is known as “student” distribution because its inventor, William Sealy
Gosset, used to be modest and referred to himself as “student”. Gosset invented the t(d)
distribution in trying to address a beer problem in his workplace, the Guinness brewery in
Dublin in 1908. Back then, Dublin was part of Britain. Ireland became independent 10 years
later. Guinness is a great stout. But I don’t like it warm. I prefer it extra cold.
PROBLEM 21.14 (moments of t(n)). Let Tn have t(n) distribution. Explain when ETnk < ∞.
Answer. Notice that the density is a symmetric function. So all odd moments are zero,
provided they exist. If n > 1 then ETnk < ∞ ⇐⇒ k < n.
Chapter 22
Random objects
267
Chapter 23
BernoulliTrials(T, p)
PROBLEM 23.1 (Bernoulli trials with general index set). If T = {1, . . . , k} then this is the
BernoulliTrials(k, p) distribution. If T = N then it is the BernoulliTrials(∞, p) distribution
We talked about these two extensively. We could take T = Z as well. Or we could choose
T = Z × Z.
In this chapter, we shall, for each a positive integer n, consider the set
1 −2 −1 1 2 3
Tn = Z = . . . , , , 0, , , , . . .
n n n n n n
of all rational numbers with denominator n, let λ be a positive real number and let p = λ/n.
We will focus on the distributions
Qn = BernoulliTrials( n1 Z, λ/n),
one for each n. We can think of n1 Z as an infinite set of boxes and each box contains a Ber(p)
random variable such that the collection of them is an independent collection. If you think
of R as time, then you can think of Qn as modeling a transmitter that attempts to transmit
every 1/n time units. The transmission is successful at the times i/n when Xi/n = 1. Think of a
successful transmission as light: you see an instantaneous light whenever the transmission
is successful or darkness otherwise. This could look like this: The line on both figures is a
copy of the real line. The figure on top represents Q10 and the one on the bottom Q50 . At the
top line the transmitter transmits 10 times per second and at the bottom at rate 50 times a
second. Light is represented by red, darkness by black. In both lines we see roughly, the same
268
CHAPTER 23. BERNOULLI TRIALS AND THE POISSON POINT PROCESS 269
Figure 23.1: Both lines represent time. The transmitter of the bottom line transmits 5 times faster
than the transmitter of the top line. The probability that the bottom transmits red light is 1/5 the
probability that the top transmitter does so. Therefore the average number of red lines on a given
interval of time is the same for both lines.
Let S(I) be the size of this set. For concreteness, take I = (a, b] where a < b are both real
numbers: ( )
k
Sn (I) = Sn ((a, b]) = ] ∈ I : Xk/n = 1
n
But
k
∈ (a, b] ⇐⇒ na < k ≤ nb ⇐⇒ [na] < k ≤ [nb]
n
because k is an integer. Hence
lim P(Sn (I1 ) = k1 , . . . , Sn (Im ) = km ) = lim P(Sn (I1 ) = k1 ) · · · lim P(Sn (Im ) = km ) →
n→∞ n→∞ n→∞
(λ|I1 |)k (λ|Im |)k −λ|Im |
e−λ|I1 | · · · e , as n → ∞.
k! k!
So if N(I1 ), . . . , N(Im ) are independent such that N(I j ) is Poi(λ|I j |) for j = 1, 2, . . . , m, the above
says
lim P(Sn (I1 ) = k1 , . . . , Sn (Im ) = km ) = P(N(I1 ) = k1 , . . . , N(Im ) = km ).
n→∞
CHAPTER 23. BERNOULLI TRIALS AND THE POISSON POINT PROCESS 270
We next look at the actual times at which we have successful transmissions (the red points
of the picture). They are the same, roughly, over a fixed period of time, regardless if n. What
happens then is that the probability of successful transmission is so small when n is large, and
this cancels out the large rate of transmission. For instance, let Gn (1) be the first successful
transmission after time 0:
1 1
Γn (1) = min{kn > 0 : Xk/n = 1} = min{k ≥ 1 : Xk/n = 1} = Gn (1),
n n
where Gn (1) is a geo(λ)n random variable. But we showed in Section 13.4 that
Let Γn ( j), j ∈ Z be the locations of the j such that X j/n = 1. It’s obvious that Γn (1), Γn (2)−Γn (1), . . .
are independent random variables, and the distribution of each one converges to that of an
expon(λ) random variable. So, then it is reasonable to expect that, in the limit as n → ∞, the
successful transmissions are located at points 0 < T1 < T2 < · · · where T1 , T2 − T1 , . . . are i.i.d.
expon(λ) random variables. The intuition is correct and we will partially verify it.
Figure 23.2: This figure is the limiting situation of what we saw in the previous figure, when the
transmission rate goes to infinity but the probability that light is transmitted goes to 0 inversely
proportional to the transmission rate. To construct this limit, we simply construct the points of
time Tk , k ∈ Z, at which light is transmitted. These points are given by formula (23.1).
Define P
k
j=1 τ j , if k = 1, 2, . . . ,
.
Tk := (23.1)
− 0j=k τ j , if k = 0, −1, −2, . . .
P
This is supposed to represent the limit. A transmitter transmits successfully at time Tk only.
Let N(I) be the number of successful transmissions on the bounded interval I. We expect that
if I1 , . . . , Im are disjoint bounded intervals then N(I1 ), . . . , N(Im ) should be independent and
such that N(I j ) is Poi(λ|I|). This is actually true. One might exclaim: “but you proved it”. He
is almost right. If I add a bit more advanced mathematics then I can claim that the limits I
calculated above have proved it. Let us, however, prove it directly by proving that N(I) is
Poi(λ|I|). I will take I = (0, t] to begin. Then, for t > 0 and k = 1, . . ., we have
and so Z t
P(Tk ≤ t < Tk+1 = Tn + τn ) = e−λ(t−s) fTk (s)ds
0
Since, for k ≥ 1, Tk is the sum of k independent expon(λ) random variables, Tk is a gamma(λ, k)
random variable whose density was discovered in Section 21.1:
λk
fTk (s) = sk−1 e−λs .
(k − 1)!
as claimed.
Let us now explain that N(I) has Poi(λ|I|) distribution for any interval I. Take I = (t, t + `],
an interval with length `. We shall reduce the computation to the previous one by looking
closely at the points Tk that are in this interval. Consider the smallest of these points that
exceeds t. If we set, for brevity,
Nt := N((0, t]),
we have, from (23.2)
TNt ≤ t < TNt +1 , t > 0.
(Simply let k = Nt in (23.2). Since the left side is a tautology, it follows that the right side is
true for all t > 0.) Therefore: for t > 0, TNt +1 is the smallest of the points Tk that exceeds t.
Figure 23.3: For t > 0, TNt +1 is the smallest of the points Tk that exceeds t.
I will just show that TNt +1 − t has the same distribution as T1 = τ1 and is independent of TNt .
We have
∞
X
P(TNt +1 − t > u|TNt ) = P(TNt +1 − t > u|TNt , Nt = k) P(Nt = k|TNt )
k=0
X∞
= P(Tk+1 − t > u|Tk , Nt = k) P(Nt = k|TNt )
k=0
X∞
= P(Tk+1 − t > u|Tk , Tk+1 − Tk > t − Tk ≥ 0) P(Nt = k|TNt ).
k=0
P(Tk+1 − t > u|Tk , Tk+1 − Tk > t − Tk ≥ 0) = P(τk+1 > (t − Tk ) + u|Tk , τk+1 > t − Tk ≥ 0) = e−u ,
because the last sum equals 1, and this is because the events {Nt = k}, k = 0, 1, . . ., form a
partition of Ω.
Another property of the construction is that
The reason is this. Take, without loss of generality, I = (0, t]. We assume that we know that
Nt = N((0, t]) = m. Therefore we know that
But these m points are ordered. To render them unordered, let σ be a random permutation of
{1, . . . , m}, with uniform distribution over the set of all m! permutations. Then Tσ(1) , . . . , Tσ(m)
are unordered. And then perform the following computation:
x1 xm
P(Tσ(1) ≤ x1 , . . . , Tσ(m) ≤ xm |Nt = m) = ··· ,
t t
for all 0 ≤ x1 , · · · , xm ≤ t.
Appendix A
Counting
Counting is the method for assigning a positive integer to finite set denoting its number of
elements. If the set is infinite, then counting means to produce a one-to-one correspondence
of the set with a given “well-understood” set, such as the integers or the set of real numbers.
We say that the set we are counting has the cardinality of the concrete set. For example, the set
of all finite sequences of coin tosses has the cardinality of the integers, but the set of all infinite
coin tosses has the cardinality of the real numbers. We also know that the real numbers do
not have the cardinality of the integers!)
Typically, we count sets that arise from other sets. We write #S or, sometimes, |S| for the
cardinality of the set S.
1. Binary sequences of length n. Consider the set {0, 1}. OK, it has cardinality 2. The set
{0, 1}2 contains 00, 01, 10, 11, that is, 4 elements. You can guess what the cardinality of {0, 1}n
(the set of sequences of 0’s and 1’s o length n) is. If you can’t, let it be equal to cn . List all the
elements of {0, 1}n . Any sequence of length n + 1 is a sequence of length n followed by a 0 or a
1. So
cn+1 = 2cn ,
and, since c1 = 2, you see that
cn = #{0, 1}n = 2n .
2. Subsets of a set. Suppose that A is a set with n elements. Let P(A) be the set containing
all subsets of A. What is the cardinality of P(A)?
To solve this problem, put the elements of A in a row. For example, with n = 6,
a1 a2 a3 a4 a5 a6
273
APPENDIX A. COUNTING 274
This means that we consider the subset {a2 , a4 , a5 }. We can do this with each subset. Put
it otherwise, for each subset of A we have a binary sequence of length n, and vice versa.
Therefore the cardinality of P(A) is the cardinality of {0, 1}n which is 2n . Hence
|P(A)| = 2|A| .
3. Arrangements of objects. Suppose we have n objects and we put them in a row. How
many rows can we form? This depends on whether the objects are distinguishable from one
another or not.
(c) k out of n objects are red and the remaining n − k are blue.
Two objects of the same color are supposed to be indistinguishable. Let cn,k be the number of
arrangements. Consider a particular arrangement of the n objects. Now permute red objects
between themselves and blue between themselves only. For example, if we have n = 5 objects
of which k = 3 are red and denote the objects as r1 , r2 , r3 , b1 , b2 then the arrangements
b1 r2 b2 r4 r3
b2 r4 b1 r3 r2
are indistinguishable. Thus, for each of the n! arrangements of the objects there are k!(n − k)!
indistinguishable arrangements. This means that
n! = cn,k k!(n − k)!
and this gives
n!
cn,k = #arrangements =
k!(n − k)!
This is another important number, the binomial coefficient, so we give it a notation known by
many: !
n n!
= .
k k!(n − k)!
APPENDIX A. COUNTING 275
n!
#arrangements = ,
n1 !n2 ! · · · nd !
(x1 + x2 + · · · + xd )n .
n
We wish to expand this into a sum of terms of the form xn1 1 · · · xd d . We can write
n
xn1 1 xn2 2 · · · xd d = (x1 · · · x1 ) (x2 · · · x2 ) · · · (xd · · · xd ) .
| {z } | {z } | {z }
n1 times n2 times nd times
If we change the order on the right-hand side, we do not obtain a different term so long as the
number of times that each variable x j appears is equal to n j . So if we think of “variable” as
n
n
“color”, the number of arrangement is precisely n1 ,...,n d
; in other words, the term xn1 1 · · · xd d
n
will appear n1 ,...,nd
times in the expansion. Therefore we have discovered the multinomial
formula !
X n n
(x1 + x2 + · · · + xd ) =
n
xn1 · · · xd d . (MF)
n1 , . . . , nd 1
n1 ,...,nd ≥0
n1 +···+nd =n
4. Fixed-size subsets of a set. How many sets of size k does a set of size n have?
To solve the problem, consider a set of size n, say the set {1, . . . , n} of the first n positive
integers. Following the coding of subsets by binary sequences of length n, if we are interested
only in subsets of size k then we need to count the number of binary sequences such that 1
appears exactly k times. But if we think of 1 as red and of 0 as blue this is the same problem as
the number of arrangements of n objects in a row such that k of them are red and n − k blue.
The answer then is !
n
#{subsets of {1, . . . , n} of size k} = .
k
This is why the symbol nk is pronounced “n choose k”. Since the total number of subsets is 2n ,
we have proved that
n !
X n
= 2n .
k
k=0
APPENDIX A. COUNTING 276
5. Finite binary sequences. The set of finite binary sequences is the set
∞
[
{0, 1}n
n=1
because a finite binary sequence has length n for some positive integer n. But this set is into
one-to-one correspondence with the integers. Here is one way to do this:
0 1
0 1
00 01 10 11
−3 −2 2 3
000 001 010 011 100 101 110 111
−7 −6 −5 −4 4 5 6 7
..
.
That is, if a binary sequence starts with a 1 assign to it the positive integer whose binary
representation is given by the binary sequence. If a binary sequence starts with 0 then flip the
0’s and 1’s and then add a negative sign. The process is reversible. Hence
∞
[
# {0, 1}n = #Z.
n=1
6. Infinite binary sequences. The set of infinite binary sequences is the set
{0, 1}N
because an infinite binary sequence is a map from N into {0, 1}. We claim that
is a real number, x ≥ 0, and x ≤ ∞n=1 2n = 1. To make the operation invertible, we remark that
1
P
any real number 0 ≤ x ≤ 1 has exactly 1 binary representation if x is not a binary rational. A
binary rational number has 2 representations of which we select the one the has eventually
only 1’s. (For example, the number 1/2 is written as 0.100000 · · · or as 0.0111111 · · · –and we
choose, by convention, the latter.)
7. Allocation of balls in boxes. There are n boxes and m balls. In how many ways can we
place the balls in the boxes? We consider two cases.
•• •• | |
• • •| • |
• • •| |•
•• | ••|
• • | • |•
•• | | ••
On the right, we have removed the boxes and only left the internal walls. Each allocation is
represented by the m balls and the n − 1 internal walls. In total, we have m + n − 1 objects, m
of which are balls (think of them as “red” objects) and n − 1 are internal walls (think of them
as “blue” objects). From case (d) in paragraph 3 above we have
(m + n − 1)! m+n−1
!
#allocations = = .
m!(n − 1)! n−1
As an application, look again at the multinomial formula (MF) above. It involves a big sum
over the set of d-tuples (n1 , . . . , nd ) of nonnegative integers whose sum is n. We can think of
each such d-tuple as an allocation of balls in d boxes. Thus, (n1 , . . . , nd ) means that we put n1
balls in box 1, n2 balls in box 2 and so on. We therefore have that
Cn (d) := {(n1 , . . . , nd ) : n1 , . . . , nd ≥ 0, n1 + · · · nd = n}
= {all allocations of n identical balls in d distinguishable boxes}.
APPENDIX A. COUNTING 278
n+d−1
!
.
d−1
6=6
=5+1
=4+2
=3+3
=4+1+1
=3+2+1
=2+2+2
=3+1+1+1
=2+2+1+1
=2+1+1+1+1
=1+1+1+1+1+1
Note that we can think of a partition as a finite sequence of positive integers in nonincreasing
order. For example, writing 6 = 3 + 1 + 1 + 1 corresponds to (3, 1, 1, 1). Alternatively, we can
think of this as using 1 3 times and 3 once. Using the second way of thinking, each partition
of n is simply
1 (k1 times) , 2 (k2 times) , 3 (k3 times) , · · ·
i.e., as a sequence (k1 , k2 , . . .) of nonnegative integers such that
k1 + 2k2 + 3k3 · · · = n.
The positive terms of such a sequence are finitely many. That is, after some index, all the
terms of this sequence are equal to 0. Call them “eventually zero sequences”.
The number of partitions of 6 is p(6) = 11. We do not have a closed formula for p(n), the
number of partitions of n. Define the function
∞
X
G(x) := p(n)xn .
n=0
Note that p(0) = 1; not because 0 = 0 is a valid formula but because the set of ways to write 0
as a sum of positive integers is the empty set and the cardinality of the empty set is 1. We now
write X
p(n) = 1k1 +2k2 +3k3 +···=n .
k1 ,k2 ,...
APPENDIX A. COUNTING 279
But notice that the last sum equals 1 because k1 , k2 , . . . are fixed and so the only term in the
sum that gives 1 is the term for which k1 + 2k2 + 3k3 + · · · = n; all other terms give 0. Hence
X ∞
X ∞
X ∞
X
G(x) = k1 2k2 3k3
x x x ··· = x k1 2k2
x xk3 · · ·
k1 ,k2 ,... k1 =0 k2 =0 k3 =0
1 1 1
= (1 + x + x2 + · · · )(1 + x2 + x4 + · · · )(1 + x3 + x6 + · · · ) · · · = ···
1 − x 1 − x2 1 − x3
This function contains all information about p(n) because the coefficient of xn is p(n) and it can
be recovered by hand for small n.
As an application, let us look again at the multinomial formula (MF). As explained above,
the sum on the right-hand side of (MF) has n+d−1
d−1 terms. Each term corresponds to an element
n = (n1 , . . . , nd ) of the set Cn (d). Let us use the abbreviation
! !
n n
:= .
n n1 , . . . , nd
Looking again at the right-hand side of (MF), we wish to group together terms with the same
multinomial coefficient.
Note that if n and n0 are two elements of Cn (d) such that one is obtained by the permutation
of the other then nn = nn0 .
We can make that clearer by saying that n is equivalent to n0 if they are permutations
of one another. An equivalence class π is a subset of Cn (d) such that all elements of π are
equivalent to one another. Let then
We therefore obtain
!
X n X
n
(x1 + x2 + · · · + xd ) = n
xn1 1 · · · xd d . (MF2)
π
π∈Π(n,d) (n1 ,...,nd )∈π
APPENDIX A. COUNTING 280
Formula (MF2) is a rewriting of (MF), where we grouped terms with the same multinomial
coefficient together. We can now easily see that
Indeed, in figuring out all d-tuples obtained by permuting a particular d-tuple (n1 , . . . , nd ) we
may as well put this in nondecreasing order. Since the sum of the ni equals n, we have a
partition of n in at most d parts (we say “at most” because some elements ni may be equal to
0). How many terms does the second sum in (MF2) have? It has as many terms as the number
of elements of π. Pick an element of (n1 , . . . , nd ) of π. In figuring out how many other d-tuples
are equivalent to (n1 , . . . , nd ) the only thing that matters is how many elements of it are equal
to one number, how many to another number, and so on. So if we let
Letting κπ (0) := d − (κπ (1) + κπ (2) + · · · ), which can be thought of as the number of parts of π
that are equal to 0 (equivalently, if π is a partition of n in a number of parts strictly smaller
than d then complement it by zeros), we have
!
d! d
|π| = =:
κπ (0)!κπ (1)! · · · κπ (d)! κπ
• Make r ordered selections without replacement: As above, but now the second selection
can be done in n − 1 ways, the third in n − 2 and so on. Hence the answer is n(n − 1)(n −
2) · · · (n − r + 1) = (n)r .
• Make r unordered selections without replacement: We have (n)r ways to select if the
order matters. But since the order does not matter, we divide by r! and hence the answer if
(n)r n
r! = r .
• Make r unordered selections with replacement: To count this we map the problem into a
situation we have already considered. Think of the objects as boxes. We are to select r of them
but we do not care about the order of the selected boxes. To do this, we place a ball in each
box that we wish to select. Since the order does not matter the balls must be indistinguishable.
Since we can select with replacement, we are allowed to put many balls in each box. Hence the
number of selections in this case is the same as the number of allocations of r indistinguishable
balls into n distinguishable boxes. The answer was found in case (b) of paragraph 7 and is
r+n−1
n−1 .
Then n
[
B2 ≤ B4 ≤ · · · ≤ P Ai ≤ · · · ≤ B3 ≤ B1 .
i=1
APPENDIX A. COUNTING 282
We will show something much stronger which is pure logic and counting and which does not
involve probability. Let δi := 1Ai and set
m
X X
Sm := (−1)k−1 δi1 · · · δik , m = 1, 2, . . . n.
k=1 1≤i1 <···<ik ≤n
| {z }
:=Nk
We will show
S2 ≤ S4 ≤ · · · ≤ 1Sni=1 Ai ≤ · · · ≤ S3 ≤ S1 .
This will be enough because
n
[
ESm = Bm , E1Sni=1 Ai = P Ai .
i=1
Pn
The key is understanding what Nk is. If ω ∈ PΩ then N1 (ω) = i=1 1ω∈Ai , that is, it is the number
of events that contain ω. Also, N2 (ω) = i<j 1ω∈Ai ∩A j is the number of pairs of events that
contain ω. And so on.1 If N1 (ω) = `, say, then N2 (ω) = 2` . Indeed, if, say, ω belongs to the sets
A1 , . . . , A` then the only indices i, j for which Ai ∩ A j contain ω must be chosen among 1, . . . , `.
Since the order is irrelevant, there are 2` ways to choose these indices. Similarly, Nk (ω) = `k .
Hence
m
k−1 `
X !
If N1 = ` then Sm = k−1
(−1) (−1) .
k
k=1
We consider two cases. First ` = 0. But then N1 (ω) = 0 means that ω does not belong to any
of the events. Hence 1Sni=1 Ai (ω) = 0. Moreover, Sm (ω) = 0 for all m. Hence the inequalities
hold trivially because 0 ≤ 0 ≤ 0. Second, assume ` ≥ 1. Then N1 (ω) ≥ 1, so ω belongs to some
event, so 1Sni=1 Ai (ω) = 1. Hence the inequalities become
` ` ` ` ` `
! ! ! ! ! !
`− ≤`− + − ≤ ··· ≤ 1 ≤ ··· ≤ ` − + ≤ `, 0
2 2 3 4 2 3
and we need show that these are true for ` ≥ 1. These follow from the identity
m
` m `−1
X ! !
(−1) i
= (−1) ,
i m
i=0
1
In probability parlance, Nk is the number of unordered k-tuples of events that simultaneously occur. I refer
you to the end of §5 where the expression “occurs” is defined.
Appendix B
Calculus
4. Limit of a function (of a real variable). A function f (x) of a real variable x is said to
converge to y0 as x → x0 if any neighborhood of y0 contains all numbers f (x) for all x
sufficiently close to x0 . (Note that f does not have to be defined at x0 .)
Notation: limx→x0 f (x) = y0 . We can also write f (x) → y0 when x → x0 .
Useful limits:
sin x
lim =1 lim x log x = 0 lim(1 + x + g(x))1/x = e if lim xg(x) = 0
x→0 x x→0 x→0 x→0
∞, c > 0
lim ex = 0 lim ex = ∞ lim xc =
x→−∞ x→∞ x→∞ 0, c < 0
283
APPENDIX B. CALCULUS 284
6. Slope. The slope function of a function f between two points x1 , x2 is the quantity
f (x2 ) − f (x1 )
, defined when x1 , x2 are distinct real numbers.
x2 − x1
f (x)− f (x0 )
7. Differentiability. We say that f is differentiable at x0 if limx→x0 x−x0 exists. The limit
df
is denoted by f 0 (x0 ) or as We can rewrite the definition of derivative f 0 (x0 ) as: there is
dx (x0 ).
a function R(x) such that R(x)/x → 0 as x → 0 and
11. The chain rule (=composition rule). The chain rule for the composition of function says
12. Subdivision of an interval. Let I be a bounded interval with endpoints a, b, where a < b.
A subdivision of I is a finite sequence a = x0 < x1 < · · · < xN = b, starting from a and ending at
b. The intervals [x j−1 , x j ], j = 1, . . . , N, are the intervals of the subdivision. The subdivision is
tagged is it is accompanied by a selection of a point in each interval, that is, we have the N
points u j ∈ [x j−1 , x j ], j = 1, . . . , N. Let us use the letter Π to denote some tagged subdivision.
The mesh kΠk of Π is simply the length of the maximum interval of Π.
APPENDIX B. CALCULUS 285
We say that f is Riemann-integrable on I with integral S( f ) if for any ε > 0 there is a δ > 0
such that, if Π is any tagged subdivision with kΠk < δ then |S( f, π) − S( f )| < ε. it is customary
to write Z Z b
S( f ) = f (x)dx or f (x)dx.
I a
We can compute integrals as limits, by taking subdivisions Πn , n = 1, 2, . . ., such that kΠn k → 0
and then use Z
f (x)dx = lim S( f, Πn ).
I n→∞
Rx
14. Inefinite integral. If f is integrable on every bounded interval then a
f (u)du, as a
Rb Ra
function of x, is called an indefinite integral. We often use the geometric convention a = − b .
1
1
, <
Z ∞ Z 1 Z ∞
c −1 , c > −1 √
2
x dx =
c |c + 1| x dx =
c
c+1 e−x dx = π
1 ∞,
c ≥ −1 0 ∞,
c ≤ −1 −∞
Z ∞ Z 1
m!n!
xn e−x dx = n! xm (1 − x)n dx =
0 0 (m + n + 1)!
eax
Z Z
1
e dx =
ax
dx = log x
a x
xc+1
Z Z
x dx =
c
log x dx = x log x − x
c+1
18. The first fundamental theorem of calculus. It says that if we differentiate the indefinite
integral of a function then we obtain the function again:
Z x
d
f (u)du = f (x).
dx a
APPENDIX B. CALCULUS 286
To be more precise, we need to assume that f can be integrated. One condition for this is that
f be continuous. A better condition is that f be piecewise continuous. Let us assume the latter.
Then the first fundamental theorem of calculus says that
Z x
F(x) := f (u)du
a
is differentiable at all points at which f is continuous and that F0 (x) = f (x) at these points. If
we take the function f (x) = 1x≥c , for some c > 0 we compute its indefinite integral and find
F(x) = 0 if x ≤ c and F(x) = x − c if x ≥ c. Obviously, F0 (x) exists everywhere, and is equal to
f (x), except at x = c.
19. The second fundamental theorem of calculus. It says that if we integrate the derivative
of a function then we obtain the function again, in the sense that
Z b
d
F(x) dx = F(b) − F(a).
a dx
More precisely, if F : [a, b] → R has derivative F0 (x) at all points x ∈ [a, b] such that F0 is
Rb
Riemann integrable then a F0 (x)dx = F(b) − F(a).
and follows from the product rule and the second fundamental theorem of calculus.
n
X
1=n
j=1
APPENDIX B. CALCULUS 287
(If ρ = 1 the sum is trivially equal to n.) Let Sn denote this sum. You can actually discover this
ρ2 −1
formula by observing that S1 = 1 + ρ = ρ−1 , as claimed, and that Sn = 1 + ρSn−1 .
because ρn+1 → 0 as n → ∞. You can actually discover this formula, when you forget it,
because if S is the sum then, trivially, S = 1 + ρS, so S = 1/(1 − ρ). We know that
∞
X 1
< ∞ ⇐⇒ p > 1.
np
n=1
In fact,
∞
X 1 π2
= .
n=1
n2 6
288
INDEX 289
P
, sum over a discrete set S cardinality, 30, 33, 273
PS∞
i=1 , sum over positive integers Cauchy distribution, 182
sup, least upper bound Cauchy-Schwarz inequality, 173
t(d), see T distribution ceiling function, see upper integer part
unif(·), see uniform distribution center of mass, see centroid
var, see variance central limit theorem, 247
∅, see empty set centroid, 185
|A|, cardinality of set A Chebyshev’s inequality, 174
{0, 1}N , set of infinite binary sequences chi-squared distribution, 256
{H, T}N , set of infinite coin tosses class of events, see sigma-field
{. . .}, unordered list of objects (set) completing the square, 18
e, base of natural logarithms many variables, 208, 209
ex , exp(x), exponential function conditional density, 129
df conditional distribution, see conditional
f 0 (x), dx , derivative
f ∗ g, see convolution of functions probability measure
f ◦ g, is the function x 7→ f (g(x)) regular, 229
n! , see factorial under normality, 239, 241
|, conditional on, given that conditional expectation, 95, 223
existence, 225
absolutely continuous function, 151 for discrete r.v.s, 95
affine, 124 for general r.v.s, 223
area function, 101 under normality, 235, 236
existence of, 101 conditional probability, 78
atom of a probability measure, 150 conditional probability measure, 229
conditional variance
Bayes’ rule, 80 under normality, 238
a coin with a random probability of confidence interval, 248
heads, 82, 230 continuous random variable, 150
false negative probability, 80 convergence in distribution, 244
false positive probability, 80 convergence in probability, 244
find the gift, 83 convolution, 232
prisoner’s dilemma, 84 of functions, 233
Bernoulli of probability measures, 232
random variables, 102 correlation, see inner product
Bernoulli trials correlation coefficient, 173
finitely many, 103 counting, 42
infinitely many, 107 covariance, 173
on a general index set, 268 covariance matrix, 185
sparse limit, 270 square root of, 209
binomial coefficient, 44 Cramér-Wold theorem, 185
binomial distribution, 103
birthday coincidences, 45 dancing pairs, see matching problem
via conditioning, 81 density transformation, 123, 132
Bonferroni inequalities, 37, 281 in many dimensions, 132
Boole’s inequality, 36 in one dimension, 123
Bose-Einstein model, 66 Dirac probability measure, 47
INDEX 290