Lectures
Lectures
Introduction to Probability
Based on D. Stirzaker’s book
and
Dover Publications
Davar Khoshnevisan
University of Utah
Firas Rassoul-Agha
University of Utah
Example 1.1 (The Monty Hall problem, Steve Selvin, 1975). Three doors:
behind one is a nice prize; behind the other two lie goats. You choose a
door. The host (Monty Hall) knows where the prize is and opens a door
that has a goat. He gives you the option of changing your choice to the
remaining unopened door. Should you take his offer?
The answer is “yes.” Indeed, if you do not change your mind, then
to win you must choose the prize right from the start. This is 1 in 3. If
you do change your mind, then you win if you choose a goat right from
the start (for then the host opens the other door with the goat and when
you switch you have the prize). This is 2 in 3. Your chances double if you
switch.
model, which is to predict the behavior without having to toss the coin
that many times. What we should do is first define a model, then draw
from it the prediction about the frequency of heads.
1
2 1
Here is how we will do things. First, we define a state space (or sample
space) that we will denote by Ω. We think of the elements of Ω as outcomes
of the experiment.
Then we specify a collection F of subsets of Ω. Each of these subsets
Space: All possible results
is called an event. We will “only be allowed” to talk about the probability Events: Subgroups of the
results in the space.
of these events; i.e. it shall be illegal to ask about the probability of an
event not in F . Probability only for Events,
When Ω is finite, F can be taken to be the collection of all its subsets. there is no probability of Space
(It will be always 1)
In this case, we are allowed to talk about the probability of any event.
Thus the next step is to assign a probability P(A) to every A ∈ F . We
will talk about this after the following examples.
Ω = {H, T }.
F = ∅, {H}, {T }, {H, T } .
Note that the event {H, T } reads: “we tossed the coin and got heads or
tails” NOT “heads and tails”!
Example 1.3. Roll a six-sided die; what is the probability of rolling a six?
First, write a sample space. Here is a natural one:
Ω = {1, 2, 3, 4, 5, 6}.
For example, the event that we rolled and got an even outcome is the event
{2, 4, 6}. The event that we got an odd number or a 6 is {1, 3, 5, 6}. And so
on.
Example 1.4. Toss two coins; what is the probability that we get two
heads? A natural sample space is
2. Rules of probability
Rule 1: Probability always
Rule 1. 0 6 P(A) 6 1 for every event A. between 0 and 1
Rule 2. P(Ω) = 1. “Something will happen with probability one.” Rule 2: Probability of the Space
is 1
Rule 3 (Addition rule). If A and B are disjoint events [i.e., A ∩ B = ∅], Rule 3: P( A or B) = P(A)+P(B)
P(Space)=P(Space)+P(empty)
then the probability that at least one of the two occurs is the sum of the -> P(empty)=0
individual probabilities. More precisely put,
P(A ∪ B) = P(A) + P(B).
Note that Ω ∩ ∅ = ∅ and hence these two events are disjoint. Further-
more, Ω ∪ ∅ = Ω. So Rule 3, when applied to the two disjoint events Ω
and ∅, implies the following:
P(Ω) = P(Ω) + P(∅).
Canceling P(Ω) on both sides gives that P(∅) = 0. This makes sense: the
probability that nothing happens is zero.
Example 1.5 (Coin toss model). We have seen that to model a coin toss
we set Ω = {H, T } and let F be all subsets of Ω. Now, we can assign
probabilities to the events in F . We know P{H, T } = 1 by Rule 2. Also, we
have just seen that P(∅) = 0. To complete the model we just need to assign
a probability to each of {H} and {T }. The numbers have to be between 0
and 1 by Rule 1. So pick p ∈ [0, 1] and let P{H} = p. By Rule 3 we have
P{H} + P{T } = P{H, T } = 1. Rule 3: The sum of two unlinked events is the Probability of one or other
happening. In this case, the entire space. By Rule 2, P(Space) = 1
Thus, P{T } = 1 − p.
This is our first probability model. It models a coin with heads loaded
to come out with probability p and tails with probability 1 − p.
3. Algebra of events
Given two sets A and B that are subsets of some bigger set Ω:
União é a probabilidade de um OU outro ocorrer. União soma os dois conjuntos
• A∪B is the “union” of the two and consists of elements belonging
to either set; i.e. x ∈ A ∪ B is equivalent to x ∈ A or x ∈ B.
• A ∩ B is the “intersection” of the two and consists of elements
Intersecção é AND, quando os
shared by the two sets; i.e. x ∈ A ∩ B is equivalent to x ∈ A and dois ocorrer ao mesmo tempo.
x ∈ B.
• Ac is the “complement” of A and consists of elements in Ω that Complement é tudo que não faz
are not in A. parte daquele evento.
4 Elements of A not in B 1
Toda intersecção indo
de A1 com A2 até A1
com An vai ser o próprio We write A r B for A ∩ Bc ; i.e. elements in A but not in B.
A1 porque para ser Clearly, A∪B = B∪A and A∩B = B∩A. Also, A∪(B∪C) = (A∪B)∪C,
which we simply write as A ∪ B ∪ C. Thus, it is clear what is meant by
elemento de qualquer
A1 ∪ · · · ∪ An . Similarly for intersections.
intersecção com A1 tem
We write A ⊆ B when A is inside B; i.e. x ∈ A implies x ∈ B. It is clear
que pertencer ao A1 that if A ⊆ B, then A ∩ B = A and A ∪ B = B. Thus, if A1 ⊆ A2 ⊆ · · · ⊆ An , Se A está dentro de B, então A
interseção com B Vai ser sempre A
then ∩n n
i=1 Ai = A1 and ∪i=1 Ai = An .
It is clear that A ∩ Ac = ∅ and A ∪ Ac = Ω. It is also not very hard to
see that (A ∪ B)c = Ac ∩ Bc . (Not being in A or B is the same thing as not
being in A and not being in B.) Similarly, (A ∩ B)c = Ac ∪ Bc .
Toda união (ser elemento
deum OU outro - Soma os Todo elemento que não pertencer a A mesma coisa aqui: dizer que o elemento que não pertence a A e B (interseção
elemtentos dos dois conjuntos) A ou B é o mesmo que dizer que o ocorrendo ao mesmo tempo) é o mesmo que dizer que o elemento não pertence a A ou
de A1 ou A2 até A1 ou An vai elemento não pertence a A ou não não pertence a B (para não fazer parte da intersecção). Ou ele A ou ele é B
ser o próprio An porque ele é o pertence a B
maior conjunto de eventos e
todos os são subconjuntos de
An
3. Algebra of events 5
Homework Problems
Exercise 1.1. You ask a friend to choose an integer N between 0 and 9. Let
A = {N 6 5}, B = {3 6 N 6 7} and C = {N is even and > 0}. List the
points that belong to the following events:
(a) A ∩ B ∩ C
(b) A ∪ (B ∩ Cc )
(c) (A ∪ B) ∩ Cc
(d) (A ∩ B) ∩ ((A ∪ C)c )
Exercise 1.2. Let A, B and C be events in a sample space Ω. Prove the
following identities:
(a) (A ∪ B) ∪ C = A ∪ (B ∪ C)
(b) A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
(c) (A ∪ B)c = Ac ∩ Bc
(d) (A ∩ B)c = Ac ∪ Bc
Exercise 1.3. Let A, B and C be arbitrary events in a sample space Ω.
Express each of the following events in terms of A, B and C using inter-
sections, unions and complements.
(a) A and B occur, but not C;
(b) A is the only one to occur;
(c) at least two of the events A, B, C occur;
(d) at least one of the events A, B, C occurs;
(e) exactly two of the events A, B, C occur;
(f) exactly one of the events A, B, C occurs;
(g) not more than one of the events A, B, C occur.
Disjoint sets: Prop. of SETS Exercise 1.4. Two sets are disjoint if their intersection is empty. If A and B
(sample space)
are disjoint events in a sample space Ω, are Ac and Bc disjoint? Are A ∩ C
Mutually exclusives: Prop.
of Events (subsets)
and B ∩ C disjoint ? What about A ∪ C and B ∪ C?
Exercise 1.5. We roll a die 3 times. Give a sample space Ω and a set of
events F for this experiment.
Exercise 1.6. An urn contains three chips: one black, one green, and one
red. We draw one chip at random. Give a sample space Ω and a collection
of events F for this experiment.
Exercise 1.7. If An ⊂ An−1 ⊂ · · · ⊂ A1 , show that ∩n
i=1 Ai = An and
∪n
i=1 A i = A 1 .
Lecture 2
Example 2.1. The sets {1, 2}, {2, 3}, and {1, 3} are disjoint but not pair-wise
disjoint.
Example 2.2. If A and B are disjoint, then A ∪ C and B ∪ C are disjoint only
when C = ∅. To see this, we write (A∪C)∩(B∪C) = (A∩B)∪C = ∅∪C = C.
On the other hand, A ∩ C and B ∩ C are obviously disjoint.
Example 2.3. If A, B, C, and D are some events, then the event “B and
at least A or C, but not D” is written as B ∩ (A ∪ C) r D or, equivalently,
B ∩ (A ∪ C) ∩ Dc . Similarly, the event “A but not B, or C and D” is written
(A ∩ Bc ) ∪ (C ∩ D).
Example 2.4. Now, to be more concrete, let A = {1, 3, 7, 13}, B = {2, 3, 4, 13, 15},
C = {1, 2, 3, 4, 17}, D = {13, 17, 30}. Then, A ∪ C = {1, 2, 3, 4, 7, 13, 17},
B ∩ (A ∪ C) = {2, 3, 4, 13}, and B ∩ (A ∪ C) r D = {2, 3, 4}. Similarly,
A ∩ Bc = {1, 7}, C ∩ D = {17}, and (A ∩ Bc ) ∪ (C ∩ D) = {1, 7, 17}.
Example 2.5. We want to write the solutions to |x−5|+|x−3| > |x| as a union
of disjoint intervals. For this, we first need to figure out what the absolute
values are equal to. There are four cases. If x 6 0, then the inequality
becomes 5 − x + 3 − x > −x, that is 8 > x, which is always satisfied (when
x 6 0). Next, is the case 0 6 x 6 3, and then we have 5 − x + 3 − x > x,
which means 8 > 3x, and so 8/3 < x 6 3 is not allowed. The next case is
3 6 x 6 5, which gives 5 − x + x − 3 > x and thus 2 > x, which cannot
7
8 2
happen (when 3 6 x 6 5). Finally, x > 5 implies x−5+x−3 > x and x > 8,
which rules out 5 6 x < 8. In short, the solutions to the above equation
are the whole real line except the three intervals (8/3, 3], [3, 5], and [5, 8).
This is really the whole real line except the one interval (8/3, 8). In other
words, the solutions are the points in (−∞, 8/3] ∪ [8, ∞).
Homework Problems
if A, B ∈ F , then Ac ∈ F and A ∪ B ∈ F .
Example 3.1. We can take F = {∅, Ω}. This is the smallest possible F . In
this case, we are only allowed to ask about the probability of something
happening and that of nothing happening.
11
12 3
Figure 3.1. Félix Édouard Justin Émile Borel (Jan 7, 1871 – Feb 3, 1956, France)
Rules 1–3 suffice if we want to study only finite sample spaces. But
infinite sample spaces are also interesting. This happens, for example, if
we want to write a model that answers, “what is the probability that we
toss a coin 12 times before we toss heads?” This leads us to the next, and
final, rule of probability.
Rule 4 (Extended addition rule). If A1 , A2 , . . . are (countably-many) paire-
wise disjoint events, then
∞ ∞
X
!
[
P Ai = P(Ai ).
i=1 i=1
4. Properties of probability
Rules 1–4 have other consequences as well.
Example 3.7. Since P(B r A) > 0, the above also shows another physically-
appealing property:
A ⊆ B =⇒ P(A) 6 P(B).
5. Equally-likely outcomes
Suppose Ω = {ω1 , . . . , ωN } has N distinct elements (“N distinct outcomes
of the experiment”). One way of assigning probabilities to every subset of
Ω is to just let
|A| |A|
P(A) = = ,
|Ω| N
where |E| denotes the number of elements of E. Let us check that this prob-
ability assignment satisfies Rules 1–4. Rules 1 and 2 are easy to verify, and
Rule 4 holds vacuously because Ω does not have infinitely-many disjoint
subsets. It remains to verify Rule 3. If A and B are disjoint subsets of Ω,
then |A ∪ B| = |A| + |B|. Rule 3 follows from this. In this example, each
outcome ωi has probability 1/N. Thus, this is the special case of “equally
likely outcomes.”
5. Equally-likely outcomes 15
Homework Problems
Exercise 3.1. We toss a coin twice. We consider three steps in this experi-
ment: 1. before the first toss; 2. after the first toss, but before the second
toss; 3. after the two tosses.
(a) Give a sample space Ω for this experiment.
(b) Give the collection F3 of observable events at step 3.
(c) Give the collection F2 of observable events at step 2.
(d) Give the collection F1 of observable events at step 1.
Exercise 3.2. Aaron and Bill toss a coin one after the other until one of
them gets a head. Aaron starts and the first one to get a head wins.
(a) Give a sample space for this experiment.
(b) Describe the events that correspond to ”Aaron wins”, ”Bill wins”
and ”no one wins” ?
Exercise 3.3. Give an example to show that P(A\B) does not need to equal
P(A) − P(B).
Lecture 4
17
18 4
0.25-0.1=0.15 0.1
Brown eyes and hair Brown eyes and hair
and not math major and math major
Lemma 4.2. If Ai , i > 1, are [countably many] events (not necessarily disjoint),
then X
P(∪i>1 Ai ) 6 P(Ai ).
i>1
For finitely many such events, A1 , · · · , An , the proof of the lemma goes
by induction using the previous lemma. The proof of the general case of
infinitely many events uses rule 4 and is omitted.
Example 4.3. The probability a student has brown hair is 0.6, the probabil-
ity a student has brown eyes is 0.45, the probability a student has brown
hair and eyes and is a math major is 0.1, and the probability a student
has brown eyes or brown hair is 0.8. What is the probability of a student
having brown eyes and hair, but not being a math major? We know that
P{brown eyes or hair}
= P{brown eyes} + P{brown hair} − P{brown eyes and hair}.
Thus, the probability of having brown eyes and hair is 0.45 + 0.6 − 0.8 =
0.25. But then,
P{brown eyes and hair} = P{brown eyes and hair and math major}
+ P{brown eyes and hair and not math major}.
Therefore, the probability we are seeking equals 0.25 − 0.1 = 0.15. See
Figure 4.1.
For example,
P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C).
(4.5)
Proving the inclusion-exclusion formula is deferred to Exercise 33.2.
2. Word of caution
One has to be careful when working out the state space. Consider, for
example, tossing two identical fair coins and asking about the probability
of the two coins landing with different faces; i.e. one heads and one tails.
Since the two coins are identical and one cannot tell which is which, the
state space can be taken as
Ω = {“two heads”,“two tails”,“one heads and one tails”}.
A common mistake, however, is to assume these outcomes to be equally
likely. This would be a perfectly fine mathematical model. But it would
not be modeling the toss of two identical fair coins. For example, if we do
the tossing a large number of times and observe the fraction of time we
got two different faces, this fraction will not be close to 1/3. It will in fact
be close to 1/2.
To resolve the issue, let us paint one coin in red. Then, we can tell
which coin is which and a natural state space is
Ω = {(H1 , H2 ), (T1 , T2 ), (H1 , T2 ), (T1 , H2 )}.
Now, these outcomes are equally likely. Since coins do not behave differ-
ently when they are painted, the probabilities assigned to the state space
in the previous case of identical coins must be
P{two heads} = P{two tails} = 1/4 and P{one heads and one tails} = 1/2.
This matches what an empirical experiment would give, and hence is the
more accurate model of a toss of two fair coins.
3. Rolling dice
Roll two fair dice fairly; all possible outcomes are equally likely.
3.3. What are the chances that we roll a total of five pips? Let
A = {(1, 4) , (2, 3) , (3, 2) , (4, 1)}.
We need to find P(A) = |A|/36 = 4/36 = 1/9.
3.4. What is the probability that we roll somewhere between two and five
pips (inclusive)? Let
sum = 2 sum =4
z }| { z }| {
A = (1, 1) , (1, 2) , (2, 1) , (1, 3) , (2, 2) , (3, 1) , (1, 4) , (4, 1) , (2, 3) , (3, 2) .
| {z } | {z }
sum =3 sum=5
3.5. What are the odds that the product of the number of pips thus rolls
is an odd number? The event in question is
(1, 1), (1, 3), (1, 5)
A := (3, 1), (3, 3), (3, 5) .
(5, 1), (5, 3), (5, 5)
And P(A) = 9/36 = 1/4.
4. Easy cards 21
4. Easy cards
There are 52 cards in a deck. You deal two cards, all pairs equally likely.
Math model: Ω is the collection of all pairs [drawn without replace-
ment from an ordinary deck]. What is |Ω|? To answer this note that 2|Ω|
is the number of all possible ways to give a pair out; i.e., 2|Ω| = 52 × 51,
by the principle of counting. Therefore,
52 × 51
|Ω| = = 1326.
2
• The probability that exactly one card is an ace is 4 × 48 = 192
divided by 1326. This probability is ' 0.1448
• The probability that both cards are aces is (4 × 3)/2 = 6 divided
by 1326, which is ' 0.0045.
• The probability that both cards are the same is P{ace and ace} +
· · · + P{king and king} = 13 × 6/1326 ' 0.0588.
22 4
Homework Problems
Exercise 4.1. A fair die is rolled 5 times and the sequence of scores recorded.
(a) How many outcomes are there?
(b) Find the probability that first and last rolls are 6.
Exercise 4.2. If a 3-digit number (000 to 999) is chosen at random, find the
probability that exactly one digit will be larger than 5.
Exercise 4.3. A license plate is made of 3 numbers followed by 3 letters.
(a) What is the total number of possible license plates?
(b) What is the number of license plates with the alphabetical part
starting with an A?
Exercise 4.4. An urn contains 3 red, 8 yellow and 13 green balls; another
urn contains 5 red, 7 yellow and 6 green balls. We pick one ball from each
urn at random. Find the probability that both balls are of the same color.
Lecture 5
23
24 5
For example, check that p(10) ' 0.88 while p(50) ' 0.03. In fact, if n > 23,
then p(n) < 0.5.
2. An urn problem
n purple and n orange balls are in an urn. You select two balls at ran-
dom [without replacement]. What are the chances that they have different
colors?
Let us number the purple balls 1 through n and the orange balls n + 1
through 2n. This is only for convenience, so that we can define a sample
space and compute probabilities. The balls of course do not know they are
numbered!
The sample space Ω is then the collection of all pairs of distinct num-
bers 1 through 2n. Note that |Ω| = 2n(2n − 1) [principle of counting].
P{two different colors} = 1 − P{the same color}.
Also,
P{the same color} = P(P1 ∩ P2 ) + P(O1 ∩ O2 ),
where Oj denotes the event that the jth ball is orange, and Pk the event
that the kth ball is purple. The number of elements of P1 ∩ P2 is n(n − 1);
the same holds for O1 ∩ O2 . Therefore,
n(n − 1) n(n − 1)
P{different colors} = 1 − +
2n(2n − 1) 2n(2n − 1)
n
= .
2n − 1
In particular, regardless of the value of n, we always have
1
P{different colors} > .
2
Proof. To prove this think of the case k = 2. Let B be the set of balls.
Then, B2 = B × B is the state space corresponding to picking two balls
with replacement. The second principle of counting says |B2 | = |B|2 = n2 .
More generally, when picking k balls we have |Bk | = |B|k = nk ways.
3. Ordered selection with replacement 25
Note that the above theorem implies that the number of functions from
a set A to a set B is |B||A| . (Think of A = {1, . . . , k} and B being the set of
balls. Each function from A to B corresponds to exactly one way of picking
k balls from B, and vice-versa.)
Example 5.2. What is the probability that 10 people, picked at random,
are all born in May? Let us assume the year has 365 days and ignore leap
years. There are 31 days in May and thus 3110 ways to pick 10 birthdays in
May. In total, there are 36510 ways to pick 10 days. Thus, the probability
3110
in question is 365 10 .
Example 5.5. We roll a fair die then toss a coin the number of times shown
on the die. What is the probability of the event A that all coin tosses result
in heads? One could use the state space
Ω = {(1, H), (1, T ), (2, H, H), (2, T , T ), (2, T , H), (2, H, T ), · · · }.
However, the outcomes are then not all equally likely. Instead, we continue
tossing the coin up to 6 times regardless of the outcome of the die. Now,
the state space is Ω = {1, · · · , 6} × {H, T }6 and the outcomes are equally
likely. Then, the event of interest is A = A1 ∪ A2 ∪ A3 ∪ A4 ∪ A5 ∪ A6 , where
Ai is the event that the die came up i and the first i tosses of the coin came
up heads. There is one way the die can come up i and 26−i ways the first
i tosses come up heads. Then,
26−i 1
P(Ai ) = 6
= .
6×2 6 × 2i
These events are clearly disjoint and
11 1 1 1 1 1 21
P(A) = + + + + + = .
6 2 4 8 16 32 64 128
26 5
Homework Problems
Read the examples from the lecture that were not covered in class.
Exercise 5.1. Suppose that there are 5 duck hunters, each a perfect shot.
A flock of 10 ducks fly over, and each hunter selects one duck at random
and shoots. Find the probability that 5 ducks are killed.
Lecture 6
As a special case one concludes that there are n(n − 1) · · · (2)(1) ways
to put n objects in order. (This corresponds to picking n balls out of a bag
of n balls, without replacement.)
Definition 6.2. If n > 1 is an integer, then we define “n factorial” as the
following integer:
n! = n · (n − 1) · (n − 2) · · · 2 · 1.
For consistency of future formulas, we define also
0! = 1.
27
28 6
Example 6.4. Five rolls of a fair die. What is P(A), where A is the event
that all five show different faces? Note that |A| is equal to 6 [which face is
left out] times 5!. Thus,
6 · 5! 6!
P(A) = 5 = 5 .
6 6
Example 6.5. The number of permutations of cards in a regular 52-card
deck is 52! > 8 × 1068 . If each person on earth shuffles a deck per second
and even if each of the new shuffled decks gives a completely new per-
mutation, it would still require more than 3 × 1050 years to see all possible
decks! The currently accepted theory says Earth is no more than 5 × 109
years old and our Sun will collapse in about 7 × 109 years. The Heat Death
theory places 3 × 1050 years from now in the Black Hole era. The matter
that stars and life was built of no longer exists.
Example 6.6. Eight persons, consisting of four couples are to be seated in
a row of eight chairs. What is the probability that significant others in each
couple sit together? Since we have 4 couples, there are 4! ways to arrange
them. Then, there are 2 ways to arrange each couple. Thus, there are 4!×24
4
ways to seat couples together. The probability is thus 4!×28! = 1/105.
Before we give the proof, let us do an example that may shed a bit of
light on the situation.
Example 6.8. If there are n people in a room, then they can shake hands
in n2 many different ways. Indeed, the number of possible hand shakes
is the same as the number of ways we can list all pairs of people, which
is clearly n2 . Here is another, equivalent, interpretation. If there are n
n
vertices in a “graph,” then there are 2 many different possible “edges”
that can be formed between distinct vertices. The reasoning is the same.
Another way to reason is to say that there are n ways to pick the first
2. Unordered selection without replacement: Combinations 29
vertex of the edge and n − 1 ways to pick the second one. But then we
would count each edge twice (once from the point of view of each end of
the edge) and thus the number of edges is n(n − 1)/2 = n2 .
Proof of Theorem 6.7. Let us first consider the case of n distinct balls.
Then, there is no difference between, on the one hand, ordered choices
of k1 balls, k2 balls, etc, and on the other hand, putting n balls in order.
There are n! ways to do so. Now, each choice of k1 balls out of n identical
balls corresponds to k1 ! possible choices of k1 balls out of n distinct balls.
Hence, if the number of ways of choosing k1 balls, marking them 1, then
k2 balls, marking them 2, etc, out of n identical balls is N, we can write
k1 ! · · · kr !N = n!. Solve to finish.
Example 6.9. Roll 4 dice; let A denote the event that all faces are different.
Then,
6 6! 6!
|A| = 4! = = .
4 2! 2
The 6-choose-4 is there because that is how many ways we can choose the
different faces. Note that another way to count is via permutations. We
are choosing 4 distinct faces out of 6. In any case,
6!
P(A) = .
2 × 64
Example 6.10. A poker hand consists of 5 cards dealt without replacement
and without regard to order from a standard 52-cards deck. There are
52
= 2, 598, 960
5
different standard poker hands possible.
Example 6.11. The number of different “pairs” {a, a, b, c, d} in a poker
hand is
4 12
13
|{z} × × × |{z} 43 .
2 3
choose the a |{z} | {z } deal b, c, d
deal the two a’s choose the b, c, and d
Then
13 × 4 × 3 × 52 × 12 × 11 × 10 × 43
P(pairs) = .
52 × 51 × 50 × 49 × 48
Check this is exactly the same as the above answer.
Example 6.12. Let A denote the event that we get two pairs [a, a, b, b, c] in
a poker hand. Then,
2
13 4
|A| = × × |{z}
11 × |{z}4 .
2 2
| {z } | {z } choose c deal c
choose a, b deal the a, b
Another way to compute this (which some may find more intuitive) is as:
13
3 is to pick the face values, times 3 to pick which face value is the single
2
card and which are the two pairs, and then times 42 × 4 to deal the cards.
Check that this gives the same answer as above.
In any case,
13 4 2
2 × 2 × 11 × 4
P(two pairs) = 52
≈ 0.06.
5
2. Unordered selection without replacement: Combinations 31
Homework Problems
Exercise 6.7. Suppose that n people are to be seated at a round table. Show
that there are (n − 1)! distinct seating arrangements. Hint: the mathemat-
ical significance of a round table is that there is no dedicated first chair.
Exercise 6.8. An experiment consists of drawing 10 cards from an ordinary
52-card deck.
(a) If the drawing is made with replacement, find the probability that
no two cards have the same face value.
(b) If the drawing is made without replacament, find the probability
that at least 9 cards will have the same suit.
Exercise 6.9. An urn contains 10 balls numbered from 1 to 10. We draw
five balls from the urn, without replacement. Find the probability that the
second largest number drawn is 8.
Exercise 6.10. Eight cards are drawn without replacement from an ordi-
nary deck. Find the probability of obtaining exactly three aces or exactly
three kings (or both).
Exercise 6.11. How many possible ways are there to seat 8 people (A,B,C,D,E,F,G
and H) in a row, if:
(a) No restrictions are enforced;
(b) A and B want to be seated together;
(c) assuming there are four men and four women, men should be
only seated between women and the other way around;
(d) assuming there are five men, they must be seated together;
(e) assuming these people are four married couples, each couple has
to be seated together.
Exercise 6.12. John owns six discs: 3 of classical music, 2 of jazz and one
of rock (all of them different). How many possible ways does John have if
he wants to store these discs on a shelf, if:
(a) No restrictions are enforced;
(b) The classical discs and the jazz discs have to be stored together;
(c) The classical discs have to be stored together, but the jazz discs
have to be separated.
Exercise 6.13. How many (not necessarily meaningful) words can you form
by shuffling the letters of the following words: (a) bike; (b) paper; (c) letter;
(d) minimum.
Lecture 7
1. Properties of combinations
Clearly, n choose 0 and n choose n are both equal to 1. The following is
also clear from the definition.
Recall that n choose k is the number of ways one can choose k elements
out of a set of n elements. Thus, the above formula is obvious: choosing
which k balls we remove from a bag is equivalent to choosing which n − k
balls we keep in the basket. This is called a combinatorial proof.
Proof. We leave the algebraic proof to the student and give instead the
combinatorial proof. Consider a set of n identical balls and mark one of
them, say with a different color. Any choice of k balls out of the n will
either include or exclude the marked ball. There are n − 1 choose k ways
to choose k elements that exclude the ball and n − 1 choose k − 1 ways
to choose k elements that include the ball. The formula now follows from
the first principle of counting.
33
34 7
Figure 7.1. Blaise Pascal (Jun 19, 1623 – Aug 19, 1662, France)
X
n−1
n − 1 j+1 n−(j+1) X n − 1 j n−j
n−1
= x y + xy .
j j
j=0 j=0
Change variables [k = j + 1 for the first sum, and k = j for the second] to
deduce that
Xn
n − 1 k n−k X n − 1 k n−k
n−1
n
(x + y) = x y + x y
k−1 k
k=1 k=0
X n − 1 n − 1
n−1
= + xk yn−k + xn + yn
k−1 k
k=1
X n
n−1
= xk yn−k + xn + yn .
k
k=1
The binomial theorem follows.
Remark 7.7. A combinatorial proof of the above theorem consists of writ-
ing
(x + y)n = (x + y)(x + y) · · · (x + y) .
| {z }
n-times
Then, one observes that to get the term xk yn−k one has to choose k of the
above n multiplicands and pick x from them, then pick y from the n − k
remaining multiplicands. There are n choose k ways to do that.
Example 7.8. The coefficient in front of x3 y4 in (2x − 4y)7 is 73 23 (−4)4 =
71680.
One can similarly work out the coefficients in the multinomial theo-
rem.
36 7
Theorem 7.9 (The multinomial theorem). For all integers n > 0 and r > 2,
and all real numbers x1 , . . . , xr ,
X
n
n
(x1 + · · · + xr ) = xk1 · · · xk
r ,
r
k 1 , · · · , kr 1
06k1 ,...,kr 6n
k1 +···+kr =n
n
where k1 ,··· ,kr was defined in Theorem 6.7.
The sum in the above display is over r-tuples (k1 , . . . , kr ) such that each
ki is an integer between 0 and n, and the ki ’s add up to n. In the case
r = 2, these are simply (k, n − k) where k runs from 0 to n. So there are
n + 1 terms. The following theorem gives the number of terms for more
general r.
Theorem 7.10. The number of terms in the expansion of (x1 + · · · + xr )n is
n+r−1
r−1 .
2. Conditional Probabilities
Example 7.11. There are 5 women and 10 men in a room. Three of the
women and 9 of the men are employed. You select a person at random
from the room, all people being equally likely to be chosen. Clearly, Ω is
the collection of all 15 people, and
2 1 4
P{male} = , P{female} = , P{employed} = .
3 3 5
2. Conditional Probabilities 37
Also,
9 1
P{male and employed} = , P{female and employed} = .
15 5
Someone has looked at the result of the sample and tells us that the person
sampled is employed. Let P(female | employed) denote the conditional
probability of “female” given this piece of information. Then,
|female among employed| 3 1
P(female | employed) = = = .
|employed| 12 4
Definition 7.12. If A and B are events and P(B) > 0, then the conditional
probability of A given B is
P(A ∩ B)
P(A | B) = .
P(B)
For the previous example, this amounts to writing
|female and employed|/|Ω| 1
P(Female | employed) = = .
|employed|/|Ω| 4
The above definition is consistent with the frequentist intuition about
probability. Indeed, if we run an experiment n times and observe that an
event B occurred nB times, then probabilistic intuition tells us that P(B) '
nB /n. If among these nB times an event A occurred nAB times, then
P(A | B) should be about nAB /nB . Dividing through by n one recovers the
above definition of conditional probability.
Example 7.13. If we deal two cards fairly from a standard deck, the prob-
ability of K1 ∩ K2 [Kj = {King on the j draw}] is
4 3
P(K1 ∩ K2 ) = P(K1 )P(K2 | K1 ) = × .
52 51
This agrees with direct counting: |K1 ∩ K2 | = 4 × 3, whereas |Ω| = 52 × 51.
Similarly,
P(K1 ∩ K2 ) P(K3 ∩ K1 ∩ K2 )
P(K1 ∩ K2 ∩ K3 ) = P(K1 ) × ×
P(K1 ) P(K1 ∩ K2 )
= P(K1 )P(K2 | K1 )P(K3 | K1 ∩ K2 )
4 3 2
= × × .
52 51 50
Or for that matter,
4 3 2 1
P(K1 ∩ K2 ∩ K3 ∩ K4 ) = × × × . (Check!)
52 51 50 49
38 7
Homework Problems
Example 8.2. Once again, we draw two cards from a standard deck. The
probability P(K2 ) (second draw is a king, regardless of the first) is best
computed by splitting it into the two disjoint cases: K1 ∩ K2 and Kc1 ∩ K2 .
Thus,
P(K2 ) = P(K2 ∩ K1 ) + P(K2 ∩ Kc1 ) = P(K1 )P(K2 | K1 ) + P(Kc1 )P(K2 | Kc1 )
4 3 48 4
= × + × .
52 51 52 51
In the above theorem what mattered was that B and Bc partitioned the
space Ω into two disjoint parts. The same holds if we partition the space
into any other number of disjoint parts (even countably many).
39
40 8
Example 8.3. There are three types of people: 10% are poor (π), 30% have
middle-income (µ), and the rest are rich (ρ). 40% of all π, 45% of µ, and
60% of ρ are over 25 years old (Θ). Find P(Θ). The result of Theorem 8.1
gets replaced with
P(Θ) = P(Θ ∩ π) + P(Θ ∩ µ) + P(Θ ∩ ρ)
= P(Θ | π)P(π) + P(Θ | µ)P(µ) + P(Θ | ρ)P(ρ)
= 0.4P(π) + 0.45P(µ) + 0.6P(ρ).
We know that P(ρ) = 0.6 (why?), and thus
P(Θ) = (0.4 × 0.1) + (0.45 × 0.3) + (0.6 × 0.6) = 0.535.
Example 8.4. Let us recall the setting of Example 5.4. We can now use the
state space
{(D1, H1 ), (D1, T1 ), (D2, T1 , T2 ), (D2, T1 , H2 ), (D2, H1 , T2 ), (D2, H1 , H2 ), · · · },
even though we know the outcomes are not equally likely. We can then
compute
P(A) = P{(D1, H1 )} + P{(D2 , H1 , H2 )} + · · · + P{(D6, H1 , H2 , H3 , H4 , H5 , H6 )}
= P(D1)P(H1 | D1) + P(D2 )P{(H1 , H2 ) | D2 } + · · ·
= P(D1)P(H1 | D1) + P(D2)P(H1 | D2)P(H2 | D1, H1 ) + · · · .
We will finish this computation once we learn about independence in the
next lecture.
2. Bayes’ Theorem
The following question arises from time to time: Suppose A and B are two
events of positive probability. If we know P(B | A) but want P(A | B), then
2. Bayes’ Theorem 41
1/2 H
1/2
1/2 T
1/2 1/3 H
2/3
T
Figure 8.2. Boldface arrows indicated paths giving heads. The path go-
ing to the boldface circle corresponds to choosing the first coin and get-
ting heads. Probabilities multiply along paths by Bayes’ formula.
Example 8.7. There are two coins on a table. The first tosses heads with
probability 1/2, whereas the second tosses heads with probability 1/3. You
select one at random (equally likely) and toss it. Say you got heads. What
are the odds that it was the first coin that was chosen?
Let C denote the event that you selected the first coin. Let H denote
the event that you tossed heads. We know: P(C) = 1/2, P(H | C) = 1/2,
and P(H | Cc ) = 1/3. By Bayes’s formula (see Figure 8.2),
1 1
P(H | C)P(C) × 3
P(C | H) = = 2 2
= .
P(H | C)P(C) + P(H | Cc )P(Cc ) 1 1 1 1
2 × 2 + 3 ×2
5
Remark 8.8. The denominator in Bayes’ rule simply computes P(B) using
the law of total probability. Sometimes, partitioning the space Ω into A
and Ac is not the best way to go (e.g. when the event Ac is complicated).
42 8
In that case, one can apply the law of total probability by partitioning the
space Ω into more than just two parts (as was done in Example 8.3 to
compute the probability P(Θ)). The corresponding diagram (analogous to
Figure 8.2) could then have more than two branches out of each node. But
the methodology is the same. See Exercise 8.10 for an example of this.
Homework Problems
(b) A new customer has an accident during the first year of his con-
tract. What is the probability that he belongs to the group likely
to have an accident?
Exercise 8.9. A transmitting system transmits 0’s and 1’s. The probability
of a correct transmission of a 0 is 0.8, and it is 0.9 for a 1. We know that
45% of the transmitted symbols are 0’s.
(a) What is the probability that the receiver gets a 0?
(b) If the receiver gets a 0, what is the probability the the transmitting
system actually sent a 0?
Exercise 8.10. 46% of the electors of a town consider themselves as inde-
pendent, whereas 30% consider themselves democrats and 24% republi-
cans. In a recent election, 35% of the independents, 62% of the democrats
and 58% of the republicans voted.
(a) What proportion of the total population actually voted?
(b) A random voter is picked. Given that he voted, what is the prob-
ability that he is independent? democrat? republican?
Exercise 8.11. To go to the office, John sometimes drives - and he gets late
once every other time - and sometimes takes the train - and he gets late
only once every other four times. When he get on time, he always keeps
the same transportation the day after, whereas he always changes when
he gets late. Let p be the probability that John drives on the first day.
(a) What is the probability that John drives on the nth day?
(b) What is the probability that John gets late on the nth day?
(c) Find the limit as n → ∞ of the results in (a) and (b).
Lecture 9
1. Independence
It is reasonable to say that A is independent of B if
i.e. ”knowledge of B tells us nothing new about A.” It turns out that the
first equality above implies the other three. (Check!) It also is equivalent
to the definition that we will actually use: A and B are independent if and
only if
P(A ∩ B) = P(A)P(B).
Note that this is now a symmetric formula and thus B is also independent
of A. Note also that the last definition makes sense even if P(B) = 0 or
P(A) = 0.
45
46 9
Example 9.3. Roll a die and let A be the event of an even outcome and B
that of an odd outcome. The two are obviously dependent. Mathemati-
cally, P(A ∩ B) = 0 while P(A) = P(B) = 1/2. On the other hand the two
are disjoint. Conversely, let C be the event of getting a number less than or
equal to 2. Then, P(A ∩ C) = P{2} = 1/6 and P(A)P(C) = 1/2 × 1/3 = 1/6.
So even though A and C are not disjoint, they are independent.
Also, it could happen that any two are independent but (9.1) does not
hold and hence A1 , A2 , and A3 are not independent.
1. Independence 47
Example 9.6. Roll two dice and let A be the event of getting a number less
than 3 on the first die, B the event of getting a number larger than 4 on the
second die, and C the event of the two faces adding up to 7. Then, each
two of these are independent (check), while
1
P(A ∩ B ∩ C) = P{(1, 6), (2, 5), (3, 4)} =
12
but P(A)P(B)P(C) = 1/24.
Homework Problems
Figure 10.1. Christiaan Huygens (Apr 14, 1629 – Jul 8, 1695, Netherlands)
49
50 10
have for j = 1, . . . , k + K − 1
Pj = P(H | start with $j)
= P(H ∩ W | start with $j) + P(H ∩ L | start with $j)
= P(W | start with $j)P(H | W and start with $j)
+ P(L | start with $j)P(H | L and start with $j).
Since winning or losing $1 is independent of how much we start with, and
the probability of each is just 1/2, and since starting with $j and winning
$1 results in us having $(j + 1), and similarly for losing $1, we have
1 1
Pj = P(H | start with $(j + 1)) + P(H | start with $(j − 1))
2 2
1 1
= Pj+1 + Pj−1 .
2 2
In order to solve this, write Pj = 12 Pj + 21 Pj , so that
1 1 1 1
Pj + Pj = Pj+1 + Pj−1 for 0 < j < k + K.
2 2 2 2
Multiply both side by two and solve:
Pj+1 − Pj = Pj − Pj−1 for 0 < j < k + K.
In other words,
Pj+1 − Pj = P1 for 0 < j < k + K.
This is the simplest of all possible “difference equations.” In order to solve
it you note that, since P0 = 0,
Pj+1 = (Pj+1 − Pj ) + (Pj − Pj−1 ) + · · · + (P1 − P0 ) for 0 < j < k + K
= (j + 1)P1 for 0 < j < k + K.
Apply this with j = k + K − 1 to find that
1
1 = Pk+K = (k + K)P1 , and hence P1 = .
k+K
Therefore,
j+1
Pj+1 = for 0 < j < k + K.
k+K
Set j = k − 1 to find the following:
Theorem 10.1 (Gambler’s ruin formula). If you start with k dollars, then the
probability that you end with k + K dollars before losing all of your initial fortune
is k/(k + K) for all k, K > 1.
2. Random Variables 51
2. Random Variables
We often want to measure certain characteristics of an outcome of an ex-
periment; e.g. we pick a student at random and measure their height.
Assigning a value to each possible outcome is what a random variable
does.
Definition 10.2. A D-valued random variable is a function X from Ω to D.
The set D is usually [for us] a subset of the real line R, or d-dimensional
space Rd .
Now if, say, we picked John and he was 6 feet tall, then there is nothing
random about 6 feet! What is random is how we picked the student; i.e.
the procedure that led to the 6 feet. Picking a different student is likely
to lead to a different value for the height. This is modeled by giving a
probability P on the state space Ω.
Example 10.4. In the previous example assume the die is fair; i.e. all out-
comes are equally likely. This corresponds to the probability P on Ω that
gives each outcome a probability of 1/6. As a result, for all k = 1, . . . , 6,
1
P ({ω ∈ Ω : X(ω) = k}) = P({k}) = . (10.1)
6
This probability is zero for other values of k, since X does not take such
values. Usually, we write {X ∈ A} in place of the set {ω ∈ Ω : X(ω) ∈ A}.
In this notation, we have
1
if k = 1, . . . , 6,
P{X = k} = 6 (10.2)
0 otherwise.
This is a math model for the result of rolling a fair die. Similarly,
1 1 1
P{Y = 5} = , P{Y = 2} = , and P{Y = −1} = . (10.3)
3 2 6
This is a math model for the the game mentioned in the previous example.
52 10
Ω = a, l, m, F, ˇ, ˜ .
Homework Problems
If we define PX (A) = P{X ∈ A}, then one can check that this collection
satisfies the rules of probability and is thus a probability on the set D. In
other words, the law of a D-valued random variable is itself a probability
on D. [In fact, this leads to a subtle point. Since we are now talking
about a probability on D, we need an events set, say G . But then we
need {ω : X(ω) ∈ A} to be in F , for all A ∈ G . This is actually another
condition that we should require when defining a random variable, but
we are overlooking this (important) technical point in this course.]
From now on we will focus on the study of two types of random vari-
ables: discrete random variables and continuous random variables.
55
56 11
Figure 11.1. Jacob Bernoulli (also known as James or Jacques) (Dec 27,
1654 – Aug 16, 1705, Switzerland)
is then called the mass function of X. The values x for which f(x) > 0 are
called the possible values of X.
Note that in this case knowledge of the mass function is sufficient to
determine the law of X. Indeed, for any subset A ⊂ D,
X
P{X ∈ A} = P{X = x}, (11.1)
x∈A
A nice and useful way to rewrite this is as f(x) = px (1 − p)1−x , if x ∈ {0, 1},
and f(x) = 0 otherwise.
Let X denote the number of smokers in the sample. Then X ∼ Binomial(n , p),
with p = 0.1 and n = 5 [“success” = “smoker”]. Therefore,
P{X > 2} = 1 − P{X 6 1} = 1 − [f(0) + f(1)]
5 0 5−0 5 1 5−1
=1− (0.1) (1 − 0.1) + (0.1) (1 − 0.1)
0 1
= 1 − (0.9)5 − 5(0.1)(0.9)4 .
Alternatively, we can follow the longer route and write
P{X > 2} = P ({X = 2} ∪ · · · {X = 5}) = f(2) + f(3) + f(4) + f(5).
4. The binomial distribution (Bernoulli) 59
Homework Problems
Example 12.1. A couple has children until their first son is born. Suppose
the genders of their children are independent from one another, and the
probability of girl is 0.6 every time. Let X denote the number of their
children to find then that X ∼ Geometric(0.4). In particular,
= p 3 − 3p + p2
= 0.784.
61
62 12
Figure 12.1. Left: Siméon Denis Poisson (Jun 21, 1781 – Apr 25, 1840,
France). Right: Sir Brook Taylor (Aug 18, 1685 – Nov 30, 1731, England)
Poisson random variables are often used to model the length of a wait-
ing list or a queue (e.g. the number of people ahead of you when you stand
in line at the supermarket). The reason this makes a good model is made
clear in the following section.
integers k = 0 , . . . , n,
k
λ n−k
n λ
fX (k) = 1− . (12.2)
k n n
Poisson’s “law of rare events” states that if n is large, then the distribu-
tion of X is approximately Poisson(λ). This explains why Poisson random
variables make good models for queue lengths.
In order to deduce this we need two computational lemmas.
Lemma 12.3. For all z ∈ R,
z n
lim 1+ = ez .
n→∞ n
Proof. Because ex is continuous, it suffices to prove that
z
lim n ln 1 + = z. (12.3)
n→∞ n
By Taylor’s expansion,
z z θ2
ln 1 + = + ,
n n 2
where θ lies between 0 and z/n. Equivalently,
z z z z2
6 ln 1 + 6 + 2.
n n n 2n
Multiply all sides by n and take limits to find (12.3), and thence the lemma.
Alternatively, one can set h = z/n and write (12.3) as
ln(1 + h) ln(1 + h) − ln(1)
z lim = z lim = z(ln x) 0 x=0
= z.
h→0 h h→0 h
Lemma 12.4. If k > 0 is a fixed integer, then
nk
n
∼ as n → ∞.
k k!
where an ∼ bn means that limn→∞ (an /bn ) = 1.
Homework Problems
Exercise 12.2. Some day, 10,000 cars are travelling across a city ; one car
out of 5 is gray. Suppose that the probability that a car has an accident
this day is 0.002. Using the approximation of a binomial distribution by a
Poisson distribution, compute:
(a) the probability that exactly 15 cars have an accident this day;
(b) the probability that exactly 3 gray cars have an accident this day.
Lecture 13
Proof. Let us start with the first statement. The proof uses Rule 4 of prob-
ability. To do this we write
∪n>1 An = A1 ∪ (A2 r A1 ) ∪ (A3 r A2 ) ∪ · · · .
67
68 13
In fact, the converse is also true. Any function F that satisfies the
above properties (a)-(c) is a distribution function of some random variable
X. This is because of the following property. If X has distribution function
F under P, then
(d) F(b) − F(a) = P{a < X 6 b} for a < b.
Now, say we have a function F satisfying (a)-(c) and we want to re-
verse engineer a random variable X with distribution function F. Let
Ω = (−∞, ∞) and for a < b define
Recall at this point the Borel sets from Example 3.4. It turns out that
properties (a)-(c) are exactly what is needed to be able to extend (13.1) to a
collection {P(B) : B is a Borel set} that satisfies the rules of probability. This
fact has a pretty sophisticated proof that we omit here. But then, consider
the random variable X(ω) = ω. Its distribution function under P is equal
to
P{X 6 x} = P((−∞, x]) = P ∩n>1 (−n, x] = lim (F(x) − F(−n)) = F(x).
n→∞
F(x) F (x)
1 1
0.5
0
a x 0 1 x
(a) X = a (b) Bernoulli(0.5)
F (x) F (x)
..
1 ..1 ... .
0.75 .
0.5 0.58
0.44
0.25 0.25
0 0
1 2 3 4 x 1 2 3 4 5x
(c) Uniform(1, 2, 3, 4) (d) Geometric(0.25)
Example 13.3. Let X be Bernoulli with parameter p ∈ [0, 1]. Then (see
Figure 13.1(b)),
0 if x < 0,
F(x) = 1 − p if 0 6 x < 1,
1 if x > 1.
Example 13.4. Let Ω = {1, . . . , n} and let X(k) = k for all k ∈ Ω. Let P the
probability on Ω corresponding to choosing an element, equally likely;
70 13
Homework Problems
for all x. Hence, f(x) is not a mass function. A good way to think about it
is as the “likelihood” of getting outcome x, instead of as the probability of
getting x.
If f is continuous at x, then by the fundamental theorem of calculus,
F 0 (x) = f(x).
And since F is non-decreasing, we have that f(x) > 0, for all x where f is
continuous.
73
74 14
R∞
Conversely, any piecewise continuous f such that −∞ f(y) dy = 1 and
f(x) > 0R corresponds to a continuous random variable X. Simply define
x
F(x) = −∞ f(y) dy and check that properties (a)-(c) of distribution func-
tions are satisfied!
In fact, if X has pdf f, then
Z
P{X ∈ A} = f(x) dx.
A
In particular,
Zb
P{a 6 X 6 b} = P{a < X 6 b} = P{a 6 X < b} = P{a < X < b} = f(x) dx.
a
Example 14.1. Say X is a continuous random variable with probability
density function
1
f(x) = 2 , if |x| > 1/2.
4x
Then, to find P{X4 − 2X3 − X2 + 2X > 0} we need to write the set in question
as a union of disjoint intervals and then integrate f over each interval and
add up the results. So we observe that
X4 − 2X3 − X2 + 2X = (X + 1)X(X − 1)(X − 2)
and thus the region in question is (−∞, −1) ∪ (0, 1) ∪ (2, ∞). Note that X
is never in (0, 1/2), since fX vanishes there. The probability of X being in
this region is then
Z −1 Z1 Z∞
1 1 1 5
2
dx + 2
dx + 2
dx = .
−∞ 4x 1/2 4x 2 4x 8
Example 14.2. Say X is a continuous random variable with probability
density function f(x) = 14 min(1, x12 ). Then f(x) = 41 for −1 6 x 6 1 and
f(x) = 4x1 2 , for |x| > 1. Thus,
Z −1 Z1 Z4
1 1 1 13
P{−2 6 X 6 4} = 2
dx + dx + 2
dx = .
−2 4x −1 4 1 4x 16
1. Continuous Random Variables 75
Homework Problems
Exercise 14.1. Is the random variable X from Exercise 13.1 discrete, con-
tinuous, or neither?
Exercise 14.2. Let X be a random variable with probability density function
given by
c(4 − x2 ) if − 2 < x < 2,
f(x) =
0 otherwise.
(a) What is the value of c?
(b) Find the cumulative distribution function of X.
Exercise 14.3. Let X be a random variable with probability density function
given by
c cos2 (x) if 0 < x < π2 ,
f(x) =
0 otherwise.
(a) What is the value of c?
(b) Find the cumulative distribution function of X.
Exercise 14.4. Let X be a random variable with probability density function
given by
1
f(x) = exp(−|x|).
2
Compute the probabilities of the following events:
(a) {|X| 6 2},
(b) {|X| 6 2 or X > 0},
(c) {|X| 6 2 or X 6 −1},
(d) {|X| + |X − 3| 6 3},
(e) {X3 − X2 − X − 2 > 0},
(f) {esin(πX) > 1},
(g) {X ∈ N}.
Exercise 14.5. Solve the following.
(a) Let f : R → R be defined by
c
√ if x > 1,
f(x) = x
0 if x < 1.
Does there exist a value of c such that f becomes a probability
density function?
76 14
Example 15.1 (Uniform density). If a < b are fixed, then the uniform
density on (a , b) is the function
1 if a 6 x 6 b,
f(x) = b − a
0 otherwise;
see Figure 15.1(a). In this case, we can compute the distribution function
as follows:
0 if x < a,
x − a
F(x) = if a 6 x 6 b,
b−a
1 if x > b.
A random variable with this density (X ∼ Uniform(a, b)) takes any value in
[a, b] “equally likely” and has 0 likelihood of taking values outside [a, b].
Note that if a < c < d < b, then
d−c
P{c 6 X 6 d} = F(d) − F(c) = .
b−a
This says that the probability we will pick a number in [c, d] is equal to
the ratio of d − c ”the number of desired outcomes” by b − a ”the total
number of outcomes.”
77
78 15
f (x) f (x)
1 1
0 1 2 x 0 1 2 3 x
(a) Uniform(1, 2) (b) Exponential(1)
Figure 16.1. Baron Augustin-Louis Cauchy (Aug 21, 1789 – May 23,
1857, France)
81
82 16
f (x) f (x)
f (x)
x −3 −2 −1 0 1 2 3 x µ − 2σ µ µ + 2σ x
(a) Cauchy (b) Standard Normal (c) Normal(µ,σ2 )
out that it is a good model of the distance for which a certain type of
squirrels carries a nut before burring it. The fat tails of the distribution
then explain the vast spread of certain types of trees in a relatively short
time period!
Example 16.2 (Standard normal density). I claim that
1 x2
φ(x) = √ e− 2
2π
defines a density function; see Figure 2(b). Clearly, φ(x) > 0 and is con-
tinuous at all points x. So it suffices to show that the area under φ is one.
Define Z∞
A= φ(x) dx.
−∞
Then,
Z∞ Z∞
2 1 x2 +y2
A = e− 2 dx dy.
2π −∞ −∞
Changing to polar coordinates (x = r cos θ, y = r sin θ gives a Jacobian of
r) one has
Z Z
2 1 2π ∞ − r2
A = e 2 r dr dθ.
2π 0 0
R∞
Let s = r2 /2 to find that the inner integral is 0 e−s ds = 1. Therefore,
A2 = 1 and hence A = 1, as desired. [Why is A not −1?]
The distribution function of φ is
Zz
1 2
Φ(z) = √ e−x /2 dx.
2π −∞
Of course, we know that Φ(z) → 0 as z → −∞ and Φ(z) → 1 as z → ∞.
Due to symmetry, we also know that Φ(0) = 1/2. (Check that!) Unfor-
tunately, a theorem of Liouville tells us that Φ(z) cannot be computed (in
terms of other “nice” functions). In other words, Φ(z) cannot be computed
1. Continuous Random Variables, continued 83
Figure 16.3. Johann Carl Friedrich Gauss (Apr 30, 1777 – Feb 23, 1855, Germany)
exactly for any value of z other than z = 0, ±∞. Therefore, people have
approximated and tabulated Φ(z) for various choices of z, using standard
methods used for approximating integrals; see the table in Appendix C.
Here are some consequences of that table [check!!]:
Of course, nowadays one can also use software to compute Φ(z) very accu-
rately. For example, in Excel one can use the command NORMSDIST(0.09)
to compute Φ(0.09).
1 2 2
f(x) = √ e−(x−µ) /(2σ ) for −∞ < x < ∞;
2π σ
see Figure 16.2. Using a change of variables, one can relate this dis-
tribution to the standard normal one, denoted N(0,1). Indeed, for all
84 16
−∞ < a 6 b < ∞,
Zb Zb
1 2 2
f(x) dx = √ e−(x−µ) /(2σ ) dx
a a 2π σ
Z (b−µ)/σ
1 2
= √ e−z /2 dz [z = (x − µ)/σ]
(a−µ)/σ 2π
Z (b−µ)/σ (16.1)
= φ(z) dz
(a−µ)/σ
b−µ a−µ
=Φ −Φ .
σ σ
One can take a → −∞ or b → ∞ to compute, respectively,
Zb Z∞
b−µ a−µ
f(x)dx = Φ and f(x)dx = 1 − Φ .
−∞ σ a σ
Note at this point that taking both a → −∞ and b → ∞ proves that f
is indeed a density curve (i.e. has area 1 under it). The operation x 7→
z = (x − µ)/σ is called standardization. Thus, the above calculation shows
that the area between a and b under the Normal(µ,σ2 ) curve is the same
as the one between the standard scores of a and b but under the standard
Normal(0,1) curve. One can now use the standard normal table to estimate
these areas.
Lecture 17
Theorem 17.1 (Bernoulli’s golden theorem a.k.a. the law of large numbers,
1713). Suppose 0 6 p 6 1 is fixed. Then, with probability 1, as n → ∞,
Number of heads
≈ p.
n
In other words: in a large sample (n large), the probability is nearly one that
the percentage in the sample is quite close to the percentage the population (p);
i.e. with high probability, random sampling works well for large sample sizes.
Next, a natural question comes to mind: for a given a 6 b with 0 6
a, b 6 n, we know that
Xb
n j
P {Number of heads is somewhere between a and b} = p (1 − p)n−j .
j
j=a
85
86 17
Figure 17.1. Left: Abraham de Moivre (May 26, 1667 – Nov 27, 1754,
France). Right: Pierre-Simon, marquis de Laplace (Mar 23, 1749 – Mar
5, 1827, France).
Example 17.4. The evening of a presidential election the ballots were opened
and it was revealed that the race was a tie between the democratic and the
republican candidates. In a random sample of 1963 voters what is the
chance that more than 1021 voted for the republican candidate?
The exact answer to this question is computed from a binomial distri-
bution with n = 1963 and p = 0.5. We are asked to compute
X 1963 1 j
1963
1 1963−j
P {more than 1021 republican voters} = 1− .
j 2 2
j=1021
p
Because np = 981.5 and np(1 − p) = 22.15, the normal approximation
(Theorem 17.2) yields the following which turns out to be a quite good
approximation:
1021 − 981.5
P {more than 1021 republican voters} ≈ 1 − Φ
22.15
≈ 1 − Φ(1.78) ≈ 1 − 0.9625 = 0.0375.
In other words, the chances are approximately 3.75% that the number of
republican voters in the sample is more than 1021.
Example 17.5. A certain population is comprised of half men and half
women. In a random sample of 10,000 what is the chance that the percent-
age of the men in the sample is somewhere between 49% and 51%?
The exact answer to this question is computed from a binomial distri-
bution with n = 10, 000 and p = 0.5. We are asked to compute
X 10000 1 j
5100
1 10000−j
P {between 4900 and 5, 100 men} = 1− .
j 2 2
j=4900
p
Because np = 5000 and np(1 − p) = 50, the normal approximation (The-
orem 17.2) yields the following which turns out to be a quite good approx-
imation:
5100 − 5000 4900 − 5000
P {between 4900 and 5100 men} ≈ Φ −Φ
50 50
= Φ(2) − Φ(−2) = Φ(2) − (1 − Φ(2))
= 2Φ(2) − 1
≈ (2 × 0.9772) − 1 = 0.9544.
In other words, the chances are approximately 95.44% that the percentage
of men in the sample is somewhere between 49% and 51%. This is con-
sistent with law of large numbers: in a large sample, the probability is
nearly one that the percentage of the men in the sample is quite close to
the percentage of men in the population.
88 17
Homework Problems
89
90 18
values of α > 0. The key to unraveling this remark is the following “re-
producing property”:
The first term is zero, and the Rsecond (the integral) is αΓ (α), as claimed.
∞
Now, it easy to see that Γ (1) = 0 e−x dx = 1. Therefore, Γ (2) = 1 × Γ (1) =
1, Γ (3) = 2 × Γ (2) = 2, . . . , and in general,
In other words, even though Γ (α) is usually hard to compute, for a general
α, it is quite easy to compute for α’s that are are half nonnegative integers.
Define a new random variable Y = 2X2 + 1. Then, Y takes the values 1 and
3. The mass function of Y is
fY (y) = P{Y = y} = P{2X2 + 1 = y} = P{X2 = (y − 1)/2}
p p
= P X = (y − 1)/2 or X = − (y − 1)/2 .
When y = 3 we have
1 1 2
fY (3) = P{X = 1 or X = −1} = fX (1) + fX (−1) = + = .
6 2 3
When y = 1 we get
1
fY (3) = P{X = 0 or X = 0} = fX (0) = .
3
The procedure of this example actually produces a theorem.
Theorem 18.3. Let X be a discrete random variable with the set of possible values
being D. If Y = g(X) for a function g, then the set of possible values of Y is g(D)
and P
x: g(x)=y fX (x) if y ∈ g(D),
fY (y) =
0 otherwise.
When g is one-to-one and has inverse h (i.e. x = h(y)) then the formula
simplifies to
fY (y) = fX (h(y)). (18.2)
In the above example, solving for x in terms of y gives
−1 or 1 if y = 3,
x=
0 if y = 1.
Thus,
fX (−1) + fX (1) if y = 3,
fY (y) =
fX (0) if y = 1.
Lecture 19
93
94 19
[Compare the above formula with the one for the discrete case (18.2)!]
Example 19.5. Suppose X has density fX . Then let us find the density
function of Y = X2 . Again, we seek to first compute FY . Now, for all y > 0,
√ √ √ √
FY (y) = P{X2 6 y} = P {− y 6 X 6 y} = FX ( y) − FX (− y) .
Differentiate [d/dy] to find that
√ √
fX y + fX − y
fY (y) = √
2 y
On the other hand, FY (y) = 0 if y 6 0 and so fY (y) = 0 as well.
For example, consider the case that X is standard normal. Then,
−y
√e if y > 0,
fX2 (y) = 2πy
0 if y 6 0.
Or if X is Cauchy, then
√ 1 if y > 0,
fX2 (y) = π y(1 + y)
0 if y 6 0.
Homework Problems
Exercise 19.1. Let X be a uniform random variable on [−1, 1]. Let Y = e−X .
What is the probability density function of Y ?
Exercise 19.2. Let X be an exponential random variable with parameter
λ > 0. What is the probability density function of Y = X2 ?
Exercise 19.3. Solve the following.
(a) (Log-normal distribution) Let X be a standard normal random
variable. Find the probability density function of Y = eX .
(b) Let X be a standard normal random variable and Z a random
variable solution of Z3 + Z + 1 = X. Find the probability density
function of Z.
Exercise 19.4. Solve the following.
(a) Let X be an exponential random variable with parameter λ > 0.
Find the probability density function of Y = ln(X).
(b) Let X be a standard normal random variable and Z a random
π π
variable with values in − 2 , 2 solution of Z + tan(Z) = X. Find
the density function of Z.
Exercise 19.5. Let X be a continuous random variable with probability
density function given by fX (x) = x12 if x > 1 and 0 otherwise. A random
variable Y is given by
2X if X > 2,
Y=
X2 if X < 2.
Find the probability density function of Y.
Exercise 19.6. Solve the following.
(a) Let f be the probability density function of a continuous random
variable X. Find the probability density function of Y = X2 .
(b) Let X be a standard normal random variable. Show that Y = X2
has a Gamma distribution and find the parameters.
Exercise 19.7. We throw a ball from the origin with velocity v0 and an
angle θ with respect to the x-axis. We assume v0 is fixed and θ is uniformly
distributed on [0, π2 ]. We denote by R the distance at which the object lands,
i.e. hits the x-axis again. Find the probability density function of R. Hint :
we remind you that the laws of mechanics show that the distance is given
v20 sin(2θ)
by R = g , where g is the gravity constant.
Lecture 20
and √ √
3e−3(1+ y)+ 3e−3(1− y)
fY (y) = √ .
2 y
y) /√y
√
This formula cannot be true for y large. Indeed e−3(1− goes to ∞
as y → ∞, while fY integrates to 1.
In fact, if y > 1, then
Z 1+√y
FY (y) = 3e−3x dx
0
and √
3e−3(1+ y)
fY (y) = √ .
2 y
Another way to see the above is to write
√ √
1 − y or 1 + y if 0 < y < 1,
x= √
1+ y if y > 1.
97
98 20
Finish the computation and check you get the same answer as before.
Or if X is Cauchy, then
2 1
if y > 0,
f|X| (y) = π 1 + y2
0 otherwise.
Example 20.3. As you can see, it is best to try to work on these problems on
a case-by-case basis. Here is another example where you need to do that.
Let Θ be uniformly distributed between −π/2 and π/2. Let Y = tan Θ.
Geometrically, Y is obtained by picking a line, in the xy-plane, passing
through the origin so that the angle of this line with the x-axis is uniformly
distributed. The y-coordinate of the intersection between this line and
the line x = 1 is our random variable Y. What is the pdf of Y? The
transformation is y = tan θ and thus the pdf of Y is
1
fY (y) = fΘ (arctan(y)) | arctan 0 (y)| = .
π(1 + y2 )
1
u6
u5 u4 u3
u2
G(u 3 ) = G(u 4 )
u1 = G(u 5 )
0
G(u 1 )G(u 2 ) G(u 6 )
101
102 21
Notice, by the way, that the above shows that one can start with a con-
tinuous random variable and transform it into a discrete random variable!
Of course, the transformation G is not continuous.
2. Joint distributions
If X and Y are two discrete random variables, then their joint mass function
is
f(x , y) = P{X = x , Y = y}.
We might write fX,Y in place of f in order to emphasize the dependence
on the two random variables X and Y.
Here are some properties of fX,Y :
2. Joint distributions 103
Example 21.3. You roll two fair dice. Let X be the number of 2s shown,
and Y the number of 4s. Then X and Y are discrete random variables, and
f(x , y) = P{X = x , Y = y}
1
if x = 2 and y = 0,
36
1
if x = 0 and y = 2,
36
2
36 if x = y = 1,
= 36 8
if x = 0 and y = 1,
8
if x = 1 and y = 0,
36
16
36 if x = y = 0,
0 otherwise.
Some times it helps to draw up a table of “joint probabilities”:
x\y 0 1 2
0 16/36 8/36 1/36
1 8/36 2/36 0
2 1/36 0 0
From this we can also calculate fX and fY . For instance,
10
fX (1) = P{X = 1} = f(1 , 0) + f(1 , 1) =
.
36
In general, you compute the row sums (fX ) and put them in the margin;
you do the same with the column sums (fY ) and put them in the bottom
row. In this way, you obtain:
x\y 0 1 2 fX
0 16/36 8/36 1/36 25/36
1 8/36 2/36 0 10/36
2 1/36 0 0 1/36
fY 25/36 10/36 1/36 1
The “1” designates the right-most column sum (which should be one),
and/or the bottom-row sum (which should also be one). This is also the
sum of the elements of the table (which should also be one).
3. Independence
Definition 21.5. Let X and Y be discrete with joint mass function f. We say
that X and Y are independent if for all x, y,
fX,Y (x , y) = fX (x)fY (y).
• Suppose A and B are two sets, and X and Y are independent.
Then,
X X
P{X ∈ A , Y ∈ B} = f(x , y)
x∈A y∈B
X X
= fX (x) fY (y)
x∈A y∈B
= P{X ∈ A}P{Y ∈ B}.
• Similarly, if h and g are functions, then h(X) and g(Y) are inde-
pendent as well.
• All of this makes sense for more than 2 random variables as well.
Example 21.6 (Example 21.3, continued). Note that in this example, X and
Y are not independent. For instance,
10 1
f(1 , 2) = 0 6= fX (1)fY (2) = × .
36 36
Example 21.7. Let X ∼ Geometric(p1 ) and Y ∼ Geometric(p2 ) be independent.
What is the mass function of Z = min(X , Y)?
Let q1 = 1 − p1 and q2 = 1 − p2 be the probabilities of failure. Recall
from Lecture 10 that P{X > n} = qn−1
1 and P{Y > n} = qn−1
2 for all integers
n > 1. Therefore,
P{Z > n} = P{X > n , Y > n} = P{X > n}P{Y > n}
= (q1 q2 )n−1 ,
as long as n > 1 is an integer. Because P{Z > n} = P{Z = n} + P{Z > n + 1},
for all integers n > 1,
P{Z = n} = P{Z > n} − P{Z > n + 1} = (q1 q2 )n−1 − (q1 q2 )n
= (q1 q2 )n−1 (1 − q1 q2 ) .
Else, P{Z = n} = 0. Thus, Z ∼ Geometric(p), where p = 1 − q1 q2 .
3. Independence 105
This makes sense: at each step we flip two coins and wait until the
first time one of them comes up heads. In other words, we keep flipping
as long as both coins land tails. Thus, this is the same as flipping one coin
and waiting for the first time it comes up heads, as long as the probability
of tails in this third coin is equal to the probability of both of the original
coins coming up tails.
106 21
Homework Problems
Exercise 21.1. Let X and Y be two discrete random variables with joint
mass function f(x, y) given by
x|y 1 2
1 0.4 0.3
2 0.2 0.1
and f(x, y) = 0 otherwise.
(a) Determine if X and Y are independent.
(b) Compute P(XY 6 2).
Exercise 21.2. We roll two fair dice. Let X1 (resp. X2 ) be the smallest (resp.
largest) of the two outcomes.
(a) What is the joint mass function of (X1 , X2 )?
(b) What are the probability mass functions of X1 and X2 ?
(c) Are X1 and X2 independent?
Exercise 21.3. We draw two balls with replacement out of an urn in which
there are three balls numbered 2,3,4. Let X1 be the sum of the outcomes
and X2 be the product of the outcomes.
(a) What is the joint mass function of (X1 , X2 )?
(b) What are the probability mass functions of X1 and X2 ?
(c) Are X1 and X2 independent?
Exercise 21.4. Let F be the function defined by:
0 if x < 0,
x2
if 0 6 x < 1,
3
1
F(x) = if 1 6 x < 2,
3
1 1
if 2 6 x < 4,
6x + 3
1 if x > 4.
Let U be a Uniform(0,1) random variable. Give a transformation G that
would make X = G(U) a random variable with CDF F.
Lecture 22
∞
X
fX+Y (z) = fX (x)fY (z − x)
x=0
X∞
e−λ λx
= fY (z − x)
x!
x=0
Xz
e−λ λx e−γ γz−x
=
x! (z − x)!
x=0
e−(λ+γ) X z x z−x
z
= λ γ
z! x
x=0
e−(λ+γ)
= (λ + γ)z ,
z!
thanks to the binomial theorem. For other values of z, it is easy to see that
fX+Y (z) = 0. This computation shows us that X + Y ∼ Poisson(λ + γ). This
makes sense, doesn’t it?
107
108 22
Observe that the first line in the above computation is completely gen-
eral and in fact proves the following theorem.
Theorem 22.2. If X and Y are discrete and independent, then
X
fX+Y (z) = fX (x)fY (z − x).
x
Example 22.3. Suppose X = ±1 with probability 1/2 each; and Y = ±2
with probability 1/2 each. Then,
1/4 if z = 3, −3, 1, −1,
fX+Y (z) =
0 otherwise.
Example 22.4. Let X and Y denote two independent Geometric(p) random
variables with the same parameter p ∈ (0 , 1). What is the mass function
of X + Y? If z = 2, 3, . . . , then
X ∞
X
fX+Y (z) = fX (x)fY (z − x) = pqx−1 fY (z − x)
x x=1
X
z−1 X
z−1
2
= pq x−1
pq z−x−1
=p qz−2 = (z − 1)p2 qz−2 .
x=1 x=1
Else, fX+Y (z) = 0. This shows that X + Y is a negative binomial. This again
makes sense, right?
Example 22.5. If X ∼ Binomial(n , p) and Y ∼ Binomial(m , p) for the
same parameter p ∈ (0 , 1), then what is the distribution of X + Y? If
z = 0, 1, . . . , n + m, then
X Xn
n x n−x
fX+Y (z) = fX (x)fY (z − x) = p q fY (z − x)
x
x
x=0
X n
m
x n−x
= p q pz−x qm−(z−x)
x z−x
06x6n
06z−x6m
X
n m
z m+n−z
=p q .
x z−x
06x6n
z−m6x6z
[The sum is over all integers x such that x is between 0 and n, and x is also
between z − m and m.] For other values of z, fX+Y (z) = 0.
Equivalently, we can write for all z = 0, . . . , n + m,
n m
n + m z m+n−z X x z−x
fX+Y (z) = p q .
z n+m
06x6n
z−m6x6z z
1. Sums of independent random variables 109
For any function f of two variables that satisfies these properties, one
can reverse engineer two random variables that will have f as their joint
density function.
111
112 23
We want to find P{|X+Y| 6 1/2}. In this case, the areas are easy to compute
geometrically; see Figure 23.2. The area of the square is 22 = 4. The
shaded area is the sum of the areas of two identical trapezoids and a
parallelogram. It is thus equal to 2 × 12 × (1 + 21 )/2 + 1 × 1 = 7/4. Or,
alternatively, the non-shaded area is that of two triangles. The shaded
area is thus equal to 4 − 2 × 12 × 23 × 32 = 74 . Then, P{|X + Y| 6 1/2} = 7/16.
We could have used the definition of joint density functions and written
ZZ
P{|X + Y| 6 1/2} = fX,Y (x, y) dx dy
|x+y|61/2
Z −1/2 Z 1 Z 1/2 Z −x+1/2 Z 1 Z −x+1/2
1 1 1
= dy dx + dy dx + dy dx
−1 −x−1/2 4 −1/2 −x−1/2 4 1/2 −1 4
7
= .
16
1. Jointly distributed continuous random variables 113
y y
1 1
1/2 1/2
-1 1/2 1 -1 1/2 1
x x
-1/2 -1/2
-1/2
-1/2
-1 -1
Next, we want to compute P{XY 6 1/2}. This area is not easy to com-
pute geometrically, in contrast to |x + y| 6 1/2; see Figure 23.2. Thus, we
need to compute it using the definition of joint density functions.
ZZ
P{XY 6 1/2} = fX,Y (x, y) dx dy
xy61/2
Z −1/2 Z 1 Z 1/2 Z 1 Z 1 Z 1/2x
1 1 1
= dy dx + dy dx + dy dx
1/2x 4 −1/2 −1 4 1/2 −1 4
−1
| {z } | {z } | {z }
(1/4−1/8x) 2/4 (1/8x+1/4)
x ln |x| −1/2 1 ln |x| x 1 3 ln 2
= − + + + = + .
4 8 −1 2 8 4 1/2 4 4
Note that we could have computed the middle term geometrically: the
area of the rectangle is 2 × 1 = 2 and thus the probability corresponding to
it is 2/4 = 1/2. An alternative way to compute the above probability is by
computing one minus the integral over the non-shaded region in the right
Figure 23.2. If, on top of that, one observes that both the pdf and the two
non-shaded parts are symmetric relative to exchanging x and y, one can
quickly compute
Z1 Z1 Z1
1 1 1 3 ln 2
P{XY 6 1/2} = 1 − 2 dy dx = 1 − 2 ( − ) dx = + .
1/2 1/2x 4 1/2 4 8x 4 4
114 23
y=x
y=x/2
0 1 x
Homework Problems
Exercise 23.1. Let X and Y be two continuous random variables with joint
density given by
1
4 if − 1 6 x 6 1 and − 1 6 y 6 1,
f(x, y) =
0 otherwise.
Compute the following probabilities:
(a) P{X + Y 6 21 },
(b) P{X − Y 6 12 },
(c) P{XY 6 41 },
Y
(d) P X 61 ,
Y 21
(e) P X 6 2 ,
(f) P{|X| + |Y| 6 1},
(g) P{|Y| 6 eX }.
Lecture 24
Similarly,
Z∞
fY (b) = f(x , b) dx.
−∞
117
118 24
Then,
Z √ 2
1−x
1
√ dy if −1 < x < 1,
fX (x) = 2 π
0
− 1−x
otherwise.
p
2 1 − x2 if −1 < x < 1,
= π
0 otherwise.
By symmetry, fY is the same function.
2. Independence
Just as in the discrete case, two continuous random variables are said to be
independent if fX,Y (x, y) = fX (x)fY (y), for all x and y. As a consequence,
one has
Z Z
P{X ∈ A, Y ∈ B} = fX,Y (x, y) dx dy = fX (x)fY (y) dx dy
A×B A×B
Z Z
= fX (x) dx fY (y) dy = P{X ∈ A}P{Y ∈ B}.
A B
This actually implies that if X and Y are independent, then f(X) and g(Y)
are also independent, for any functions f and g. We omit the short proof.
Example 24.3. Let X ∼ Exponential(λ1 ) and Y ∼ Exponential(λ2 ). What is
Z = min(X, Y)?
Let us compute
FZ (z) = P{min(X, Y) 6 z} = 1 − P{X > z, Y > z}
= 1 − P{X > z}P{Y > z} = 1 − (1 − FX (z))(1 − FY (z))
= 1 − e−λ1 z e−λ2 z = 1 − e−(λ1 +λ2 )z .
2. Independence 119
Thus, Z ∼ Exponential(λ1 + λ2 ). This makes sense: say you have two sta-
tions, with the first serving about λ1 people per unit time and the second
serving about λ2 people per unit time. Then, being served by these sta-
tions in a row is equivalent to being served by one station that serves about
λ1 + λ2 people per unit time.
Homework Problems
Exercise 24.1. Let X and Y be two continuous random variables with joint
density given by
c(x + y) if 0 6 x 6 1 and 0 6 y 6 1,
f(x, y) =
0 otherwise.
(a) Find c.
(b) Compute P{X < Y}.
(c) Find the marginal densities of X and Y.
(d) Compute P{X = Y}.
Exercise 24.2. Let X and Y be two continuous random variables with joint
density given by
4xy if 0 6 x 6 1, 0 6 y 6 1 and x > y
f(x, y) = 6x2 if 0 6 x 6 1, 0 6 y 6 1 and x < y
0 otherwise.
(a) Find the marginal densities of X and Y.
(b) Let A = {X 6 12 } and B = {Y 6 12 }. Find P(A ∪ B).
Exercise 24.3. Let X and Y be two continuous random variables with joint
density given by
2e−(x+y) if 0 6 y 6 x,
f(x, y) =
0 otherwise.
Find the marginal densities of X and Y.
Exercise 24.4. Let (X, Y) be uniformly distributed over the parallelogram
with vertices (−1, 0), (1, 0), (2, 1), and (0, 1).
(a) Find and sketch the density functions of X and Y.
(b) A new random variable Z is defined by Z = X + Y. Show that Z is
a continuous random variable and find and sketch its probability
density function.
Exercise 24.5. Let (X, Y) be continuous random variables with joint density
f(x, y) = (x + y)/8, 0 6 x 6 2, 0 6 y 6 2; f(x, y) = 0 elsewhere.
(a) Find the probability that X2 + Y 6 1.
(b) Find the conditional probability that exactly one of the random
variables X and Y is 6 1, given that at least one of the random
variables is 6 1.
(c) Determine whether or not X and Y are independent.
Lecture 25
123
124 25
p
after the change of variables s = x2 + y2 and ϕ = arctan(y/x). There-
fore, for all r ∈ (0 , ρ) and θ ∈ (−π , π),
r2 θ
FR,Θ (r , θ) = .
2πρ2
Rr Rθ
Since, by definition, FR,Θ (r, θ) = −∞ −∞ fR,Θ (s, ϕ) ds dϕ, we see that
∂2 FR,Θ
fR,Θ (r , θ) = (r , θ).
∂r∂θ
It is also clear that fR,Θ (r, θ) = 0 if r 6∈ (0, ρ) or θ 6∈ (−π, π). Therefore,
r if 0 < r < ρ and −π < θ < π,
fR,Θ (r , θ) = πρ2
0 otherwise.
1
Observe that the above yields fΘ (θ) = 2π , if −π < θ < π, which implies
that Θ is Uniform(−π, π). On the other hand, fR (r) = ρ2r2 , if 0 < r < ρ,
which implies that R is not Uniform(0, ρ). Indeed, it is more likely to pick
a point with a larger radius (since there are more of them!).
y=1
x=v
y=1
x=0
x=y u=v x²=y²=v
v=u²
Figure 25.1. Domains for pdfs in Example 25.4. Left: domain transfor-
mation. Right: integration area for CDF calculation.
You should check that this yields Example 25.1, for instance.
Example 25.4. Let fX,Y (x, y) = 2(x + y), if 0 < x < y < 1, and 0 other-
wise. We want to find fXY . In this case, we will first find fX,XY , and then
integrate the first coordinate out. This means we will use the transforma-
tion (u, v) =√(x, xy). Solving for x and y we get (x, y) = (u, v/u), with
0 < v < u < v < 1; see Figure 25.1. The Jacobian is then equal to
1 −v 1
1× −0× 2 = .
u u u
√
As a result, fU,V (u, v) = 2(u + v/u)/u, with 0 < v < u < v < 1, and
Z √v
fV (v) = 2 (1 + v/u2 ) du = 2(1 − v), for 0 < v < 1.
v
Homework Problems
Exercise 25.3. Let X and Y be two independent random variables both with
Y
distribution N(0, 1). Find the probability density function of Z = X .
Exercise 25.4. A point-size worm is inside an apple in the form of the
sphere x2 + y2 + z2 = 4a2 . (Its position is uniformly distributed.) If the
apple is eaten down to a core determined by the intersection of the sphere
and the cylinder x2 + y2 = a2 , find the probability that the worm will be
eaten.
Exercise 25.5. A point (X, Y, Z) is uniformly distributed over the region
described by x2 + y2 6 4, 0 6 z 6 3x. Find the probability that Z 6 2X.
Exercise 25.6. Let T1 , . . . , Tn be the order statistics of X1 , . . . , Xn . That is, T1
is the smallest of the X’s, T2 is the second smallest, and so on. Tn is the
largest of the X’s. Assume X1 , . . . , Xn are independent, each with density
f. Show that the joint density of T1 , . . . , Tn is given by g(t1 , . . . , tn ) =
n! f(t1 , . . . , tn ) if t1 < t2 < · · · < tn and 0 otherwise.
Hint: First find P{T1 6 t1 , . . . , Tn 6 tn , X1 < X2 < . . . < Xn }.
Exercise 25.7. Let X, Y and Z be three continuous independent random
variables with densities given by
e−x if x > 0,
f(x) =
0 otherwise.
Compute P{X > 2Y > 3Z}.
Exercise 25.8. A man and a woman agree to meet at a certain place some
time between 10 am and 11 am. They agree that the first one to arrive will
wait 10 minutes for the other to arrive and then leave. If the arrival times
are independent and uniformly distributed, what is the probability that
they will meet?
1. Functions of a random vector 127
Exercise 25.9. When commuting to work, John can take public transporta-
tion (first a bus and then a train) or walk. Buses ride every 20 minutes and
trains ride every 10 minutes. John arrives at the bus stop at 8 am precisely,
but he doesn’t know the exact schedule of buses, nor the exact schedule of
trains. The total travel time on foot (resp. by public transportation) is 27
minutes (resp. 12 minutes).
(a) What is the probability that taking public transportation will take
more time than walking?
(b) If buses are systematically 2 minutes late, how does it change the
probability in (a)?
Exercise 25.10. Let X and Y be two independent random variable, both
with distribution N(0, σ2 ) for some σ > 0. Let R and Θ be two random
variables defined by
X = R cos(Θ),
Y = R sin(Θ),
where R > 0. Prove that R and Θ are independent and find their density
functions.
Exercise 25.11. A chamber consists of the inside of the cylinder x2 + y2 =
1. A particle at the origin is given initial velocity components vx = U
and vy = V, where (U, V) are independent random variables, each with
standard normal density. There is no motion in the z-direction and no
force acting on the particle after the initial push at time t = 0. If T is
the time at which the particle strikes the wall of the chamber, find the
distribution and density functions of T .
Lecture 26
129
130 26
This in fact shows that U and V are independent, even though they are
both mixtures of both X and Y. It also shows that they are both normal
random variables with parameters (mean) 0 and (variance) 2; i.e. N(0, 2).
Now, we will start building up the necessary material to make the link
between the mathematical definition of probability (state space, function
on events, etc) and the intuitive one (relative frequency). The starting point
is the notion of mathematical expectation.
When X has finitely many possible values the above sum is well de-
fined. It corresponds to the physical notion of center of gravity of point
masses placed at positions x with weights f(x).
Example 26.3. We toss a fair coin and win $1 for heads and lose $1 for tails.
This is a fair game since the average winnings equal $0. Mathematically,
if X equals the amount we won, then E[X] = 1 × 12 + (−1) × 21 = 0.
Example 26.4. We roll a die that is loaded as follows: it comes up 6 with
probability 0.4, 1 with probability 0.2, and the rest of the outcomes come
up with probability 0.1 each. Say we lose $2 if the die shows a 2, 3, 4, or 5,
while we win $1 if it shows a 1 and $2 if it shows a 6. On average we win
−2 × 4 × .1 + 1 × 0.2 + 2 × 0.4 = 0.2;
that is we win 20 cents. In a simple case like this one, where X has a finite
amount of possible values, one can use a table:
x −2 1 2
f(x) = P{X = x} 4 × 0.1 0.2 0.4
xf(x) −0.8 0.2 0.8
E[X] = 0.2 is then the sum of the elements in the last row. Intuitively,
this means that if we play, say, 1000 times, we expect to win about $200.
Making this idea more precise is what we mean by “connecting the math-
ematical and the intuitive definitions of probability.” This also gives a fair
price to the game: 20 cents is a fair participation fee for each attempt.
2. Mathematical Expectation: Discrete random variables 131
Example 26.5. You role a fair die and lose as many dollars as pips shown
on the die. Then, you fairly toss an independent fair coin a number of
times equal to the outcome of the die. Each head wins you $2 and each
tail loses you $1. Is this a winning or a losing game? Let X be the amount
of dollars you win after having played the game. Let us compute the
average winning. First, we make a table of all the outcomes.
Outcome 1H 1T 2H 1H1T 2T
x −1 + 2 −1 − 1 −2 + 4 −2 + 2 − 1 −2 − 2
1 1 1 1 1 1
f(x) 6 × 2 6 × 2 6 × 4 2 × 61 × 14 1
6 × 4
1
Then,
X 7
E[X] = x f(x) = − = −1.75.
4
In conclusion, the game is a losing game. In fact, I would only play if they
pay me a dollar and 75 cents each time!
why:
f(k)
z }| {
X
n
n k n−k
E[X] = k p (1 − p)
k
k=0
Xn
n!
= pk (1 − p)n−k
(k − 1)!(n − k)!
k=1
Xn
n − 1 k−1
= np p (1 − p)(n−1)−(k−1)
k−1
k=1
X n − 1
n−1
= np pj (1 − p)(n−1)−j
j
j=0
∞
X ∞
X
e−λ λk e−λ λk−1
E[X] = k =λ
k! (k − 1)!
k=0 k=1
X∞
e−λ λj
=λ = λ,
j!
j=0
P
because eλ = ∞ j
j=0 λ /j!, thanks to Taylor’s expansion. So when modeling
the length of a waiting line, the parameter λ is the average length of the
line.
133
134 27
∞
X
k−1 r
E[X] = k p (1 − p)k−r
r−1
k=r
X∞
k!
= pr (1 − p)k−r
(r − 1)!(k − r)!
k=r
X∞
k r
=r p (1 − p)k−r
r
k=r
∞
r X k r+1
= p (1 − p)(k+1)−(r+1)
p r
k=r
∞
r X
j−1
= pr+1 (1 − p)j−(r+1)
p (r + 1) − 1
j=r+1 | {z }
P{Negative binomial (r+1 ,p)=j}
r
= .
p
If X has infinitely-many possible values but can take both positive and
negativePvalues, then we have to be careful with the definition of the sum
E[X] = x f(x). We can always add the positive and negative parts sepa-
rately. So, formally, we can write
X X
E[X] = x f(x) + x f(x).
x>0 x<0
Now, we see that if one of these two sums is finite then, even if the other
were infinite, E[X] would be well defined. Moreover, E[X] is finite if, and
only if, both sums are finite; i.e. if
X
|x| f(x) < ∞.
Example 27.4. Say X has the mass function f(2n ) = f(−2n ) = 1/2n , for
P
n > 2. (Note that the probabilities do add up to one: 2 n>2 21n = 1.)
Then, the positive part of the sum gives
X 1
2n × n = ∞,
2
n>2
Homework Problems
Remark: Try to redo the exercise with the assumption that you always
lose the $1 you bet. See how part (b) changes drastically, with just this
small change in the rules of the game!
Exercise 27.3. Let X be a Geometric random variable with parameter p ∈
[0, 1]. Compute E[X].
Lecture 28
Z∞
E[X] = x f(x) dx.
−∞
R∞
The same issues as before arise: if −∞ |x| f(x) dx < ∞, then the above
R∞
integral is well defined and finite. If, on the other hand, 0 x f(x) dx < ∞
R0
but −∞ x f(x) dx = −∞, then the integral is again defined but equals −∞.
R∞ R0
Conversely, if 0 x f(x) dx = ∞ but −∞ x f(x) dx > −∞, then E[X] = ∞.
Finally, if both integrals are infinite, then E[X] is not defined.
Zb
1 1 b 2 − a2 1 (b − a)(b + a) b+a
E[X] = x dx = = = .
a b−a 2 b−a 2 b−a 2
N.B.: The formula of the first example on page 303 of Stirzaker’s text is
wrong.
137
138 28
Example 28.2 (Gamma). If X is Gamma(α , λ), then for all positive values
α
of x we have f(x) = Γλ(α) xα−1 e−λx , and f(x) = 0 for x < 0. Therefore,
Z
λα ∞ α −λx
E[X] = x e dx
Γ (α) 0
Z∞
1
= zα e−z dz (z = λx)
λΓ (α) 0
Γ (α + 1)
=
λΓ (α)
α
= .
λ
In the special case that α = 1, λ1 is the expectation of an exponential
random variable with parameter λ. So when modeling a waiting time,
the parameter of the exponential is one over the average waiting time. The
parameter λ is thus equal to the serving rate: the number of people served
per unit time. Now, you should understand a bit better the derivation of
the exponential distribution that came after Exercise 15.2. Namely, if we
recall from Exercise 27.2 that the average of a geometric with parameter p
is 1/p, we see that if we use p = λ/n, the average will be n/λ. If each coin
flip takes 1/n seconds, then the average serving time is 1/λ, as desired.
Another observation is that E[Gamma(α, λ)] = α/λ the same way as
E[Negative Binomial(r, p)] = r/p. This is not a coincidence and one can
derive the Gamma distribution from the negative binomial similarly to
how the exponential was derived from a geometric.
(x−µ)2
Example 28.3 (Normal). Suppose X ∼ N(µ , σ2 ); i.e. f(x) = σ√12π e− 2σ2 .
Then,
Z∞
(x − µ)2
1
E[X] = √ x exp − dx
σ 2π −∞ 2σ2
Z∞
1 2
=√ (µ + σz)e−z /2 dz (z = (x − µ)/σ)
2π −∞
Z ∞ −z2 /2 Z
e σ ∞ 2
=µ √ dz + √ ze−z /2 dz
2π 2π −∞
|−∞ {z } | {z }
1 0, by symmetry
= µ.
Example 28.4 (Cauchy). In this example, f(x) = π−1 (1 + x2 )−1 . Note that
the expectation is defined only if the following limit exists regardless of
how we let n and m tend to ∞:
Z
1 n y
dy.
π −m 1 + y2
2
1. Mathematical Expectation: Continuous random variables 139
Now I argue that the limit does not exist; I do so by showing two different
choices of (n , m) which give rise to different limiting “integrals.”
2
Suppose m = eπ a n, for some fixed number a. Then,
Z Z Z −π2 a n
1 n y 1 n y 1 e y
dy = 2 dy − 2 dy
π2 −e−π2 a n 1 + y2 π 0 1 + y2 π 0 1 + y2
Z 2 Z −2π2 a n2
1 1+n dz 1 1+e dz
= 2 − 2 (z = 1 + y2 )
2π 1 z 2π 1 z
1 + n2
1
= 2 ln 2
2π 1 + e−2π a n2
1 2
→ 2 ln e2π a = a as n → ∞.
2π
Thus, we can make the limit converge to any number a we want. In fact,
taking m = n2 and repeating the above calculation √ allows us to make the
limit converge to −∞, while taking m = n makes the limit equal to
∞. The upshot is that the Cauchy density does not have a well-defined
expectation. [That is not to say that the expectation is well defined, but
infinite.] In particular, we conclude that E[|X|] = ∞.
Theorem 28.5. If X is a positive random variable with density f, then
Z∞ Z∞
E[X] = P{X > x} dx = (1 − F(x)) dx.
0 0
Note that the above formula does not involve the density function f.
It turns out (and we omit the math) that we can define the expectation of
any positive random variable (discrete, continuous, or other) using that
formula. That is to say the notion of mathematical expectation (or average
value) is general and applies to any real-valued random variable.
Lecture 29
as desired.
141
142 29
Theorem 29.3. If (X, Y) have joint density function f(x, y) and g(x, y) is some
function, then
Z∞ Z∞
E[g(X, Y)] = g(x, y) f(x, y) dx,
−∞ −∞
provided the integral is well defined. In particular, if X has density f(x) and g(x)
is some function, then
Z∞
E[g(X)] = g(x) f(x) dx,
−∞
Proof. We show the proofs in the discrete case. The proofs in the contin-
uous case are similar, and the proofs in the general case are omitted. To
prove (1) let x1 , x2 , . . . be the possible values of X. Then, ax1 , ax2 , . . . are
the possible values of aX. Moreover,
X X
E[aX] = axi f(xi ) = a xi f(xi ) = aE[X].
i i
Let us now prove (2). We will only treat the case where both variables are
nonnegative. Let x1 , x2 , . . . be the possible (nonnegative) values of X and
y1 , y2 , . . . the possible (nonnegative) values of Y. Then, the possible values
1. Some properties of expectations 143
If at least one of the variables takes positive and negative values, then
one needs to use slightly more involved arguments requiring facts about
infinite series. We omit the proof in this case.
Next, we prove (3). The only value the random variable a takes is
a and it takes it with probability 1. Thus, its mathematical expectation
simply equals a itself. To prove (4) observe that Y − X is a nonnegative
random variable; i.e. its possible values are all nonnegative. Thus, it has
a nonnegative average. But by (1) and (2) we have 0 6 E[Y − X] = E[Y] +
E[−X] = E[Y] − E[X]. Property (5) is obvious since if there
P existed an x0 > 0
for which f(x0 ) > 0, then we would have had E[X] = xf(x) > x0 f(x0 ) >
0, since the sum is over x > 0.
Finally, we prove (6). Again, let xi and yj be the possible values of X
and Y, respectively. Then, the possible values of XY are given by the set
{xi yj : i = 1, 2, . . . , j = 1, 2, . . . }. Thus,
X
E[XY] = xi yj P{X = xi , Y = yj }
i,j
X
= xi yj P{X = xi }P{Y = yj } (by independence)
i,j
X X
= xi P{X = xi } yj P{Y = yj }
i j
= E[X]E[Y].
144 29
The third equality was simply the result of summing over j first and then
over i. We can sum in any order because the terms are either of the same
sign, or are summable (if the expectations of X and Y are finite).
E[X2 ] = (1 − p) × 02 + p × 12 = p.
Two observations:
X
n
n k X
n
n!
2 2 n−k
E[X ] = k p (1 − p) = k pk (1 − p)n−k .
k (k − 1)!(n − k)!
k=0 k=1
related problem.
X
n
n k
E[X(X − 1)] = k(k − 1) p (1 − p)n−k
k
k=0
Xn
n!
= k(k − 1) pk (1 − p)n−k
k!(n − k)!
k=2
X
n
(n − 2)!
= n(n − 1) pk qn−k
k=2
(k − 2)! [n − 2] − [k − 2] !
Xn
n−2 k
= n(n − 1) p (1 − p)n−k
k−2
k=2
Xn
n − 2 k−2
2
= n(n − 1)p p (1 − p)[n−2]−[k−2]
k−2
k=2
X n − 2
n−2
2
= n(n − 1)p p` (1 − p)[n−2]−` .
`
`=0
The summand is the probability that Binomial(n − 2 , p) is equal to `. Since
that probability is added over all of its possible values, the sum is one.
Thus, we obtain E[X(X−1)] = n(n−1)p2 . But X(X−1) = X2 −X. Therefore,
we can apply Theorem 29.4 to find that
E[X2 ] = E[X(X − 1)] + E[X] = n(n − 1)p2 + np = (np)2 + np(1 − p).
Example 29.8. Suppose X ∼ Poisson(λ). We saw in Example 27.1 that
E[X] = λ. In order to compute E[X2 ], we first compute E[X(X − 1)] and find
that
∞
X ∞
e−λ λk X e−λ λk
E[X(X − 1)] = k(k − 1) =
k! (k − 2)!
k=0 k=2
∞
X
2 e−λ λk−2
=λ .
(k − 2)!
k=2
The sum is equal to one; change variables (j = k − 2) and recognize the jth
term as the probability that Poisson(λ) = j. Therefore,
E[X(X − 1)] = λ2 .
Because X(X − 1) = X2 − X, the left-hand side is E[X2 ] − E[X] = E[X2 ] − λ.
Therefore,
E[X2 ] = λ2 + λ.
146 29
Homework Problems
The first term is 0 and the second is 1 (why?). Thus, E[X2 ] = 1. One can
similarly compute E[Xn ] for integers n > 3.
147
148 30
2. Variance
When E[X] is well-defined, the variance of X is defined as
Var(X) = E (X − E[X])2 .
If E[X] = ∞ or −∞ the above is just infinite and does not carry any
information. Thus, the variance is a useful notion when E[X] is finite. The
next theorem says that this is the same as asking for E[|X|] to be finite.
(Think of absolute summability or absolute integrability in calculus.)
Theorem 30.3 (Triangle inequality). E[X] is well defined and finite if, and only
if, E[|X|] < ∞. In that case,
|E[X]| 6 E[|X|].
Proof. Observe that −|X| 6 X 6 |X| and apply (4) of Theorem 29.4.
This of course makes sense: the average of X must be smaller than the
average of |X|, since there are no cancellations when averaging the latter.
Note that the triangle inequality that we know is a special case of the
above: |a + b| 6 |a| + |b|. Indeed, let X equal a or b, equally likely. Now
apply the above theorem and see what happens!
Thus, when E[|X|] is finite:
(1) We predict the as-yet-unseen value of X by the nonrandom num-
ber E[X] (its average value);
(2) Var(X) is the expected squared-error in this prediction. Note that
Var(X) is also a nonrandom number.
2. Variance 149
The proofs go by direct computation and are left to the student. Note
that (2) says that nonrandom quantities have no variation. (4) says that
shifting by a nonrandom amount does not change the amount of variation
in the random variable.
Let us now compute the variance of a few random variables. But first,
here is another useful way to write the variance
E (X − E[X])2 = E X2 − 2XE[X] + (E[X])2 = E[X2 ] − 2E[X]E[X] + (E[X])2
= E[X2 ] − (E[X])2 .
Example 30.6. We have seen in the previous lecture that if X ∼ Poisson(λ),
then E[X2 ] = λ2 + λ. We have also seen in Example 27.1 that E[X] = λ.
Thus, in this case, Var(X) = λ.
Example 30.7. Suppose X ∼ Bernoulli(p). Then, X2 = X and E[X2 ] = E[X] =
p. But then, Var(X) = p − p2 = p(1 − p).
Example 30.8. If X = Binomial(n , p), then what is Var(X)? We have seen
that E[X] = np and E[X2 ] = (np)2 + np(1 − p). Therefore, Var(X) = np(1 −
p).
Proof. Observe first that E[|XY|] = E[|X|]E[|Y|] < ∞. Thus by the triangle
inequality E[XY] is well defined and finite. Now, the proof of the theorem
follows by direct computation:
Var(X + Y) = E[(X + Y)2 ] − (E[X + Y])2
= E[X2 ] + 2E[XY] + E[Y 2 ] − (E[X])2 − 2E[X]E[Y] − (E[Y])2
= E[X2 ] + 2E[X]E[Y] + E[Y 2 ] − (E[X])2 − 2E[X]E[Y] − (E[Y])2
= E[X2 ] − (E[X]2 ) + E[Y 2 ] − (E[Y])2
= Var(X) + Var(Y).
In the third equality we used property (6) in Theorem 29.4.
Example 30.10. Since a Binomial(n, p) is the sum of n independent Bernoulli(p),
each of which has variance p(1−p), the variance of a Binomial(n, p) is sim-
ply np(1 − p), as already observed by direct computation.
2. Variance 151
Homework Problems
1. Variance, continued
Example 31.1. Suppose X ∼ Geometric(p) distribution. We have seen al-
ready that E[X] = 1/p (Example 27.2). Let us find a new computation for
this fact, and then go on and find also the variance.
∞
X ∞
X
E[X] = kp(1 − p)k−1 = p k(1 − p)k−1
k=1 k=1
∞
X
!
d k d 1 p 1
=p − (1 − p) =p − = 2 = .
dp dp p p p
k=0
In the above computation, we used that the derivative of the sum is the
sum of the derivatives. This is OK when we have finitely many terms.
Since we have infinitely many terms, one does need a justification that
comes from facts in real analysis. We will overlook this issue...
Next we compute E[X2 ] by first finding
∞
X X∞
p
E[X(X − 1)] = k(k − 1)p(1 − p)k−1 = k(k − 1)(1 − p)k−2
(1 − p)
k=1 k=1
∞
d2 X
!
d2
k p 1
= p(1 − p) (1 − p) =
dp2 (1 − p) dp2 p
k=0
d 1 2 2(1 − p)
= p(1 − p) − 2 = p(1 − p) 3 = .
dp p p p2
153
154 31
Because E[X(X − 1)] = E[X2 ] − E[X] = E[X2 ] − (1/p), this proves that
2(1 − p) 1 2−p
E[X2 ] = 2
+ = .
p p p2
Consequently,
2−p 1 1−p
Var(X) = − 2 = .
p2 p p2
For a different solution, see Example (13) on page 124 of Stirzaker’s text.
As a consequence of Theorem 30.9 we have the following.
Example 31.2. Let X be a negative binomial with parameters n and p.
Then, we know that X is a sum of n independent Geometric(p) random
variables. We conclude that Var(X) = n(1 − p)/p2 . Can you do a direct
computation to verify this?
Example 31.3 (Variance of Uniform(a , b)). If X is Uniform(a , b), then
E[X] = a+b
2 and
Zb
1 b2 + ab + a2
E[X2 ] = x2 dx = .
b−a a 3
(b−a)2
In particular, Var(X) = 12 .
Example 31.4 (Moments of N(0 , 1)). Compute E[Xn ], where X ∼ N(0 , 1)
and n > 1 is an integer:
Z∞
n 1 2
E[X ] = √ xn e−x /2 dx
2π −∞
= 0 if n is odd, by symmetry.
If n is even (or even when n is odd but we are computing E[|X|n ] instead
of E[Xn ]), then
Z∞ r Z
n 2 n −x2 /2 2 ∞ n −x2 /2
E[X ] = √ x e dx = x e dx
2π 0 π 0
r Z
2 ∞ √
= (2z)n/2 e−z (2z)−1/2 dz z = x2 /2 ⇔ x = 2z
π 0 | {z }
dx
Z∞
2n/2
= √ z(n−1)/2 e−z dz
π 0
2n/2
n 1
= √ Γ +
π 2 2
2n/2 n 1 n 3 3 1
= √ − − ··· Γ (1/2) (Exercise 18.1)
π 2 2 2 2 2 2
= (n − 1)(n − 3) · · · (5)(3)(1).
1. Variance, continued 155
1. Covariance
Theorem 32.1. If E[X2 ] < ∞ and E[Y 2 ] < ∞ then E[X], E[Y], and E[XY] are all
well-defined and finite.
Proof. First observe that if |X| > 1 then |X| 6 X2 and thus also |X| 6 X2 + 1.
If, on the other hand, |X| 6 1 then also |X| 6 X2 + 1. So in any case,
|X| 6 X2 + 1. This implies that E[|X|] < ∞ and by the triangle inequality
E[X] is well-defined and finite. The same reasoning goes for E[Y]. Lastly,
observe that (X + Y)2 > 0 and (X − Y)2 > 0 imply
−X2 − Y 2 6 2XY 6 X2 + Y 2
and thus |XY| 6 (X2 + Y 2 )/2 and E[XY] is well-defined and finite.
From the above theorem we see that if E[X2 ] and E[Y 2 ] are finite then
we can define the covariance between X and Y to be
Cov(X, Y) = E [(X − E[X])(Y − E[Y])] . (32.1)
Because (X − E[X])(Y − E[Y]) = XY − XE[Y] − YE[X] + E[X]E[Y], we obtain the
following, which is the computationally useful formula for covariance:
Cov(X, Y) = E[XY] − E[X]E[Y]. (32.2)
157
158 32
2. Correlation
The correlation between X and Y is the quantity,
Cov(X, Y)
ρ(X, Y) = p . (32.3)
Var(X) Var(Y)
Example 32.4 (Example 21.3, continued). Note that
25 10 1 14
E[X2 ] = E[Y 2 ] = 02 ×+ 12 × + 22 × = .
36 36 36 36
Therefore, the correlation between X and Y is
1/18 1
ρ(X, Y) = − q =− .
5 5
5
18 18
We say that X and Y are negatively correlated. But what does this mean?
The following few sections will help explain this.
Figure 32.1. Left: Karl Hermann Amandus Schwarz (Jan 25, 1843 – Nov
30, 1921, Hermsdorf, Silesia [now Jerzmanowa, Poland]). Right: Victor
Yakovlevich Bunyakovsky (Dec 16, 1804 – Dec 12, 1889, Bar, Ukraine,
Russian Empire)
which leads to
E[Y 2 ] E[X2 ]E[Y 2 ] − (E[XY])2 > 0.
Applying the above inequality to two special cases one deduces two
very useful inequalities in mathematical analysis.
160 32
Example 32.8. Let U ∼ Uniform(a, b) and let X = g(U) and Y = h(U) for
some functions g and h. Applying the above theorem we deduce that
Zb 2 Z b Z b
g(u)h(u) du 6 |g(u)|2 du |h(u)|2 du .
a a a
Thus, X and Y are uncorrelated. But they are not independent. Intuitively
speaking, this is clear because |X| = Y. Here is one way to logically justify
our claim:
1
P{X = 1 , Y = 2} = 0 6= = P{X = 1}P{Y = 2}.
8
4. Correlation and linear dependence 161
Theorem 32.11. Assume none of X and Y is constant (i.e. Var(X) > 0 and
Var(Y) > 0). If ρ(X, Y) = 1, then there exist constants b and a > 0 such that
P{Y = aX + b} = 1. Similarly, if ρ(X, Y) = −1, then there exist constants b and
a < 0 such that P{Y = aX + b} = 1.
Proof. Let a = Cov(X, Y)/Var(X). Note that a has the same sign as ρ(X, Y).
Recalling that ρ(X, Y) = 1 means (Cov(X, Y))2 = Var(X)Var(Y), we have
Var(Y − aX) = Var(Y) + Var(−aX) + 2Cov(−aX, Y)
= Var(Y) + a2 Var(X) − 2aCov(X, Y)
(Cov(X, Y))2
= Var(Y) −
Var(X)
= 0.
By Theorem 30.4 this implies the existence of a constant b such that
P{Y − aX = b} = 1.
Homework Problems
Exercise 32.1. If X and Y are affinely dependent (i.e. there exist numbers a
and b such that Y = aX + b), show that |ρ(X, Y)| = 1.
Exercise 32.2. Show that equality occurs in the Cauchy-Bunyakovsky-Schwarz
inequality (i.e. E[XY]2 = E[X2 ]E[Y 2 ]) if and only if X and Y are linearly de-
pendent (i.e. there exists a number a such that Y = aX).
Exercise 32.3. Prove the following.
(a) For any real numbers a1 , . . . , an and b1 , . . . , bn ,
Xn 2 X n X
n
ai b i 6 a2i b2i .
i=1 i=1 i=1
Rb Rb Rb
(b) If a g2 (x)dx and a h2 (x)dx are finite, then so is a g(x)h(x)dx and
furthermore
Zb 2 Z b Zb
g(x)h(x)dx 6 2
g (x)dx h2 (x)dx.
a a a
Figure 33.1. Left: Pafnuty Lvovich Chebyshev (May 16, 1821 – Dec 8,
1894, Kaluga, Russia). Right: Andrei Andreyevich Markov (Jun 14, 1856
– Jul 20, 1922, Ryazan, Russia)
1. Indicator functions
Let A be an event. The indicator function of A is the random variable
defined by
1 if x ∈ A,
1IA (x) =
0 otherwise.
It indicates whether x is in A or not!
Example 33.1. If A and B are two events, then 1IA∩B = 1IA 1IB . This is
because 1IA (x)1IB (x) equals 1 when both indicators are 1, and equals 0
otherwise. But both indicators equal 1 only when x is in both A and B, i.e.
when x ∈ A ∩ B.
Proof. The proof is simple. 1IA takes only two values: 0 and 1. Thus,
E[1IA ] = P(Ac ) × 0 + P(A) × 1 = P(A).
165
166 33
Proof. Let A be the event {x : h(x) > λ}. Then, because h is nonnegative,
Thus,
E[h(X)] > λE[1IA ] = λP(A) = λP{h(X) > λ}.
Divide by λ to finish.
E[|X|]
P {|X| > λ} 6 (Markov’s inequality)
λ
Var(X)
P {|X − E[X]| > λ} 6 , (33.1)
λ2
E[|X − E[X]|4 ]
P {|X − E[X]| > λ} 6 . (33.2)
λ4
To get Markov’s inequality, apply Lemma 33.3 with h(x) = |x|. To get the
second inequality, first note that |X−E[X]| > λ if and only if |X−E[X]|2 > λ2 .
Then, apply Lemma 33.3 with h(x) = |x − E[X]|2 and with λ2 in place of λ.
The third inequality is similar: use h(x) = |x − E[X]|4 and λ4 in place of λ.
In words:
We are now ready for the link between the intuitive understanding
of probability (relative frequency) and the mathematical one (state space,
probability of an event, random variable, expectation, etc).
2. The law of large numbers 167
X1 + · · · + Xn
lim P −µ >ε = 0. (33.3)
n→∞ n
To see why this theorem is a step towards the connection with the
intuitive understanding of probability, think of the Xi ’s as being the results
of independent coin tosses: Xi = 1 if the i-th toss results in heads and
Xi = 0 otherwise. Then (X1 +· · ·+Xn )/n is precisely the relative frequency
of heads: the fraction of time we got heads, up to the n-th toss. On the
other hand, µ = E[X1 ] equals the probability of getting heads (since X1 is
really a Bernoulli random variable). Thus, the theorem says that if we toss
a coin a lot of times, the relative frequency of heads will, with very high
chance, be close to the probability the coin lands heads. If the coin is fair,
the relative frequency of heads will, with high probability, be close to 0.5.
The reason the theorem is called the weak law of large numbers is that
it does not say that the relative frequency will always converge, as n → ∞,
to the probability the coin lands heads. It only says that the odds the
relative frequency is far from the probability of getting heads (even by a
tiny, but fixed, amount) get smaller as n grows. We will later prove the
stronger version of this theorem, which then completes the link with the
intuitive understanding of a probability. But let us, for now, prove the
weak version.
1
E[X] = E[X1 + · · · + Xn ]
n
1
= E[X1 + · · · + Xn−1 ] + E[Xn ]
n
1
= E[X1 + · · · + Xn−2 ] + E[Xn−1 ] + E[Xn ]
n
1
= ··· = E[X1 ] + · · · + E[Xn ]
n
1
= (nµ) = µ.
n
168 33
Now, we will state and prove the stronger version of the law of large
numbers.
Theorem 33.5 (Strong Law of Large Numbers). Suppose X1 , X2 , . . . , Xn are
independent, all with the same (well defined) mean µ and finite fourth moment
β4 = E[X41 ] < ∞. Then,
X1 + · · · + Xn
P lim = µ = 1. (33.4)
n→∞ n
This theorem implies that if we flip a fair coin a lot of times and keep
track of the relative frequency of heads, then it will converge, as the num-
ber of tosses grows, to 0.5, the probability of the coin landing heads.
There is a subtle difference between the statements of the two versions
of the law of large numbers. This has to do with the different definitions
of convergence for a sequence of random variables.
Definition 33.6. A sequence Yn of random variables is said to converge
in probability to a random variable Y if for any ε > 0 (however small) the
quantity P{|Yn − Y| > ε} converges to 0 as n → ∞.
Convergence in probability means that the probability that Yn is far
from Y by more than the fixed amount ε gets small as n gets large. In
other words, Yn is very likely to be close to Y for large n.
Definition 33.7. A sequence Yn of random variables is said to converge
almost surely to a random variable Y, if
P Yn −→ Y = 1.
n→∞
Almost sure convergence means that the odds that Yn does not con-
verge to Y are nill. It is a fact that almost sure convergence implies con-
vergence in probability. We omit the proof. However, the converse is not
true, as the following example shows.
2. The law of large numbers 169
E[Xi − µ]E[(Xj − µ)3 ], and this equals 0 because E[Xi ] = µ. Also, there are
n terms of the form
E[(Xi − µ)4 ] = E[X4i ] − 3E[X3i ]µ + 3E[Xi ]µ2 + µ4 .
Observe that by the Cauchy-Schwarz inequality (Theorem 32.6),
q
E[X2i ] = E[1 × X2i ] 6 1 × E[X4i ] = β2 < ∞
and q
E[|Xi |3 ] = E[|Xi | × X2i ] 6 E[X2i ]E[X4i ] 6 β3 < ∞.
Thus, E[(Xi − µ)4 ] 6 β4 + 3|µ|β3 + 3|µ|3 + µ4 = γ < ∞. Similarly, we have
n(n − 1) terms of the form E[(Xi − µ)2 (Xj − µ)2 ], with i 6= j. Here too the
Cauchy-Schwarz inequality gives
q
E[(Xi − µ)2 (Xj − µ)2 ] 6 E[(Xi − µ)4 ]E[(Xj − µ)4 ] 6 γ < ∞.
Next, observe that the sets BN = ∪n>N An are decreasing; i.e. BN+1 ⊂
BN . Thus, by Lemma 13.1,
X 1
P{∩N>1 BN } = lim P(BN ) 6 γ lim .
N→∞ N→∞
n>N
n3/2
Because the series with general term 1/n3/2 is summable, the right-most
term above converges to 0 as N → ∞. Thus,
\ [
P {|(X1 + · · · + Xn )/n − µ| > 1/n1/8 } = 0.
N>1 n>N
Homework Problems
1. Conditioning
Say X and Y are two random variables. If they are independent, then
knowing something about Y does not say anything about X. So, for ex-
ample, if fX (x) were the pdf of X, then knowing that Y = 2 the pdf of X
is still fX (x). If, on the other hand, the two are dependent, then knowing
Y = 2 must change the pdf of X. For example, consider the case Y = |X|
and X ∼ N(0, 1). If we do not know anything about Y, then the pdf of X is
2
√1 e−x /2 . However, if we know Y = 2, then X can only take the values 2
2π
and −2 (with equal probability in this case). So knowing Y = 2 makes X a
discrete random variable with mass function f(2) = f(−2) = 1/2.
1.1. Conditional mass functions. We are given two discrete random vari-
ables X and Y with mass functions fX and fY , respectively. For all y, define
the conditional mass function of X given that Y = y as
P{X = x , Y = y} fX,Y (x , y)
fX|Y (x | y) = P X = x Y = y = = ,
P{Y = y} fY (y)
provided that fY (y) > 0 (i.e. y is a possible value for Y).
As a function in x, fX|Y (x | y) is a probability mass function. That is:
(1) 0 6 fX|Y (x | y) 6 1;
P
(2) x fX|Y (x | y) = 1.
Example 34.1 (Example 21.3, continued). In this example, the joint mass
function of (X, Y), and the resulting marginal mass functions, were given
by the following:
173
174 34
x\y 0 1 2 fX
0 16/36 8/36 1/36 25/36
1 8/36 2/36 0 10/36
2 1/36 0 0 1/36
fY 25/36 10/36 1/36 1
Let us calculate the conditional mass function of X, given that Y = 0:
fX,Y (0 , 0) 16 fX,Y (1 , 0) 8
fX|Y (0 | 0) = = , fX|Y (1 | 0) = = ,
fY (0) 25 fY (0) 25
fX,Y (2 , 0) 1
fX|Y (2 | 0) = = , fX|Y (x | 0) = 0 for other values of x.
fY (0) 25
Similarly,
8 2
fX|Y (0 | 1) = , f (1 | 1) = , fX|Y (x | 1) = 0 for other values of x.
10 X|Y 10
and
fX|Y (0 | 2) = 1, fX|Y (x | 2) = 0 for other values of x.
These conditional mass functions are really just the relative frequencies in
each column of the above table. Similarly, fY|X (y | x) would be the relative
frequencies in each row.
Observe that if we know fX|Y and fY , then fX,Y (x, y) = fX|Y (x | y)fY (y).
This is really Bayes’ formula. The upshot is that one way to describe how
two random variables interact is by giving their joint mass function, and
another way is by giving the mass function of one and then the conditional
mass function of the other (i.e. describing how the second random variable
behaves, when the value of the first variable is known).
Example 34.2. Let X ∼ Poisson(λ) and if X = x then let Y ∼ Binomial(x, p).
By the above observation, the mass function for Y is
X X
fY (y) = fX,Y (x, y) = fX (x)fY|X (y | x)
x x
∞
X λx
−λ x y
= e p (1 − p)x−y
x=y
x! y
∞
py λy −λ X (λ(1 − p))x−y (λp)y
= e = e−λp .
y! x=y
(x − y)! y!
to flip. The coin gives heads with probability p. If it comes up heads, the
person stays in line. But if it comes up tails, the person leaves the store!
Now, you still have a line of length Y in front of you. This is thus again
a Poisson random variable. Its average, though, is λp (since you had on
average λ people originally and then only a fraction of p of them stayed).
Example 34.4. Roll a fair die fairly n times. Let X be the number of 3’s and
Y the number of 6’s. We want to compute the conditional mass function
fX|Y (x | y). The possible values for Y are the integers from 0 to n. If we
know Y = y, for y = 0, . . . , n, then we know that the possible values for
X are the integers from 0 to n − y. If we know we got y 6’s, then the
probability of getting x 3’s is
n − y 1 x 4 n−y−x
fX|Y (x | y) = ,
x 5 5
for y = 0, . . . , n and x = 0, . . . , n − y (and it is 0 otherwise). In other words,
given that Y = y, X is a Binomial(n − y, 1/5). This makes sense, doesn’t it?
(You can also compute fX|Y using the definition.) Now, the expected value
of X, given Y = y, is clear: E[X | Y = y] = (n − y)/5, for y = 0, . . . , n.
Example 34.5 (Example 26.5, continued). Last time we computed the av-
erage amount one wins by considering a long table of all possible out-
comes and their corresponding probabilities. Now, we can do things much
cleaner. If we know the outcome of the die was x (an integer between 1
and 6), we lose x dollars right away. Then, we toss a fair coin x times and
the expected amount we win at each toss is 2 × 12 − 1 × 12 = 21 dollars. So
after x tosses the expected amount we win is x/2. Subtracting the amount
we already lost we have that, given the die rolls an x, the expected amount
we win is x/2 − x = −x/2. The probability of the die rolling x is 1/6.
Hence, Baye’s formula gives that the expected amount we win is
X
6 X
6
x 1 7
E[W] = E[W | X = x]P{X = x} = − =− ,
2 6 4
x=1 x=1
as we found in the longer computation. Here, we wrote W for the amount
we win in this game and X for the outcome of the die.
1. Conditioning 177
Homework Problems
1. Conditioning, continued
1.1. Conditional density functions. We are now given two continuous
random variables X and Y with density functions fX and fY , respectively.
For all y, define the conditional density function of X given that Y = y as
f(x , y)
fX|Y (x | y) = , (35.1)
fY (y)
provided that fY (y) > 0.
As a function in x, fX|Y (x | y) is a probability density function. That is:
(1) fX|Y (x | y) > 0;
R∞
(2) −∞ fX|Y (x | y) dx = 1.
Example 35.1. Let X and Y have joint density fX,Y (x, y) = e−y , 0 < x < y.
If we do not have any information about Y, the pdf of X is
Z∞
fX (x) = e−y dy = e−x , x > 0
x
which means that X ∼ Exponential(1). But say we know that Y = y > 0.
We would like to find fX|Y (x | y). To this end, we first compute
Zy
fY (y) = e−y dx = ye−y , y > 0.
0
Then,
fX,Y (x, y) 1
fX|Y (x | y) = = , 0 < x < y.
fY (y) y
This means that given Y = y > 0, X ∼ Uniform(0, y).
179
180 35
Example 35.2. Now, say X is a random variable with pdf fX (x) = xe−x ,
x > 0. Given X = x > 0, Y is a uniform random variable on (0, x). This
means that Y has the conditional pdf fY|X (y | x) = x1 , 0 < y < x. Then,
fX,Y (x, y) = fX (x)fY|X (y | x) = e−x , 0 < y < x. This allows us to compute,
for example, the pdf of Y:
Z∞
fY (y) = e−x dx = e−y , y > 0.
y
So Y ∼ Exponential(1). We can also compute things like P{X + Y 6 2}.
First, we need to figure out the boundary of integration. We know that
0 < y < x and now we also have x + y 6 2. So x can go from 0 to 2
and then y can go from 0 to x or 2 − x, whichever is smaller. The switch
happens at x = 2 − x, and so at x = 1. Now we compute:
Z1 Zx Z 2 Z 2−x
−x
P{X + Y 6 2} = e dy dx + e−x dy dx
0 0 1 0
Z1 Z2
= xe−x dx + (2 − x)e−x dx
0 1
1 2 2
= −(x + 1)e−x − 2e−x + (x + 1)e−x = 1 + e−2 − 2e−1 .
0 1 1
(To see the last equality, either use integration by parts, or the fact that this
is the second moment of an Exponential(1), which is equal to its variance
plus the square of its mean: 1 + 12 = 2.) On the other hand,
Z∞ Z∞
E[X] = E[X | Y = y] fY (y) dy = (1 + y)e−y dy = 2,
−∞ 0
as it should be.
The proof of Bayes’ formula is similar to the discrete case. (Do it!)
2. Conditioning on events
So far, we have learned how to compute the conditional pdf and expecta-
tion of X given Y = y. But what about the same quantities, conditional on
knowing that Y 6 2, instead of a specific value for Y? This is quite simple
to answer in the discrete case. The mass function of X, given Y ∈ B, is:
P
P{X = x, Y ∈ B} y∈B fX,Y (x, y)
fX|Y∈B (x) = P{X = x | Y ∈ B} = = P .
P{Y ∈ B} y∈B fY (y)
The analogous formula in the continuous case is for the pdf of X, given
Y ∈ B: R
fX,Y (x, y) dy
fX|Y∈B (x) = BR . (35.2)
B fY (y) dy
Once we know the pdf (or mass function), formulas for expected val-
ues become clear:
P P
x x fX,Y (x, y)
E[X | Y ∈ B] = Py∈B ,
y∈B fY (y)
Example 35.4. Let (X, Y) have joint density function fX,Y (x, y) = e−x , 0 <
y < x. We want to find the expected value of Y, conditioned on X 6 5.
First, we find the conditional pdf. One part we need to compute is P{X 6
5}. The pdf of X is
Zx
fX (x) = e−x dy = xe−x , x > 0,
0
and, using integration by parts, we have
Z5
P{X 6 5} = xe−x dx = 1 − 6e−5 .
0
Now, we can go ahead with computing the conditional pdf using (35.3). If
X 6 5, then also Y < 5 (since Y < X) and
R5 R5 −x
f X,Y (x, y) dx ye dx e−y − e−5
fY|X65 (y) = 0 = = , 0 < y < 5.
1 − 6e−5 1 − 6e−5 1 − 6e−5
(Check that this pdf integrates to 1!) Finally, using integration by parts,
we can compute:
Z∞ R5
y(e−y − e−5 ) dy 1 − 18.5e−5
E[Y | X 6 5] = y fY|X65 (y) dy = 0 = ≈ 0.912.
−∞ 1 − 6e−5 1 − 6e−5
R∞
Remark 35.5. We have fY (y) = y e−x dx = e−y , y > 0. Thus, Y ∼
Exponential(1) and E[Y] = 1. Note now that the probability that X 6 5
is 1 − 6e−5 ≈ 0.96, which is very close to 1. So knowing that X 6 5 gives
very little information. This explains why E[Y | X 6 5] is very close to E[Y].
Try to compute E[Y | X 6 1] and see how it is not that close to E[Y] any-
more. Try also to compute E[Y | X 6 10] and see how it is even closer to
E[Y] than E[Y | X 6 5].
still uniformly distributed but on the triangle {(x, y) ∈ [0, 1]2 : x + y 6 1}.
Consequently,
R1 R1−x Z1
0x dy dx 1
E[X | X + Y 6 1] = 0
= 2 x(1 − x) dx = .
1/2 0 3
Alternatively, let U = X + Y. Using the transformation method we find
that fX,U (x, u) = 1, 0 < x < 1 and x < u < x + 1. (Do it!) This implies
Ru R1
that fU (u) = 0 dx = u, for 0 < u < 1, and fU (u) = u−1 dx = 2 − u, for
1 < u < 2. (This clearly integrates to 1. Use geometry to see that, rather
than doing the (easy) computation!) We can readily see that E[U] = 1/2.
(Again, use geometry rather than the (easy) computation.) Furthermore,
fX|U (x | u) = 1/u, for 0 < x < u < 1, and fX|U (x | u) = 1/(2 − u), for
0 < u − 1 < x < 1. Thus,
R u 1 u
0 x u dx = 2 if 0 < u < 1,
E[X | U = u] =
R 1 1 1−(2−u)2
u−1 x 2−u dx = 2(2−u) if 1 < u < 2.
u
= .
2
Finally, using (35.4),
R1 R1 u
0 E[X | U = u]fU (u) du u du 1
E[X | X + Y 6 1] = E[X | U 6 1] = = 0 2 = .
P{U 6 1} 1/2 3
If instead we wanted to use (35.3), then we first write
R∞ R1
−∞ fX,U (x, u) du du
fX|U61 (x) = = x = 2(1 − x).
P{U 6 1} 1/2
Then, applying (35.3),
Z∞ Z1
1
E[X | X + Y 6 1] = E[X | U 6 1] = xfX|U61 (x) dx = 2x(1 − x) dx = .
−∞ 0 3
184 35
Homework Problems
provided that the sum (or integral) exists. This is precisely the Laplace
transform of the mass function (or pdf).
Note that M(0) always equals 1 and M(t) is always nonnegative.
A related transformation is the characteristic function of a random vari-
able, given by
P
eitx f(x), in the discrete setting,
Φ(t) = E[eitX ] = R∞x itx
−∞ e f(x) dx, in the continuous setting.
While the moment generating function may be infinite at some (or even
all) nonzero values of t, the characteristic function is always defined and
finite. It is precisely the Fourier transform of the mass function (or the
pdf). In this course we will restrict attention to the moment generating
function. However, one can equally work with the characteristic function
instead, with the added advantage that it is always defined.
M(t) = 1 − p + pet .
187
188 36
We omit the proof. The theorem says that if we compute the mgf of
some random variable and recognize it to be the mgf of a distribution we
already knew, then that is precisely what the distribution of the random
variable is. In other words, there is only one distribution that corresponds
to any given mgf.
Example 36.5. If
1 1 1 √
M(t) = et + e−πt + e 2t ,
2 4 4
then M is the mgf of a random variable with mass function
1/2 if x = 1, √
f(x) = 1/4 if x = −π or x = 2,
0 otherwise.
The sum converges only when (1−p)et < 1, and thus when t < − ln(1−p).
For example, the mgf of a Geometric(1/2) is only defined on the interval
(−∞, ln 2). So for a Geometric(p), the mgf is
pet
M(t) = , for t < − ln(1 − p).
1 − (1 − p)et
pet r
M(t) = , for t < − ln(1 − p).
1 − (1 − p)et
Z∞ Z∞
tx λα α−1 −λx λα
M(t) = e x e dx = xα−1 e−(λ−t)x dx.
0 Γ (α) Γ (α) 0
If t > λ, then the integral is infinite. On the other hand, if t < λ, then
Z∞
λα zα−1 dz
M(t) = α−1
e−z (z = (λ − t)x)
Γ (α)
0 (λ − t) λ −t
α Z∞
λ
= zα−1 e−z dz
Γ (α) × (λ − t)α 0
| {z }
Γ (α)
λα
= .
(λ − t)α
Thus,
α
λ
M(t) = , if t < λ.
λ−t
λ
M(t) = , if t < λ.
λ−t
2. Sums of independent random variables 191
= √ e du (u = (x − σ2 t − µ)/σ)
2π −∞
2 t2 /2
= eµt+σ .
In particular, the mgf of a standard normal N(0,1) is
2 /2
M(t) = et .
192 36
Homework Problems
The same end-result holds if X is discrete with mass function f, but this
time,
X
M 0 (t) = xetx f(x) = E XetX .
x
Therefore, in any event:
M 0 (0) = E[X].
In general, this procedure yields,
Therefore,
M(n) (0) = E [Xn ] .
193
194 37
Homework Problems
199
200 38
p
standardization. That is, Zn = (Sn − E[Sn ])/ Var(Sn ). Alternatively,
Sn − np
Zn = p .
np(1 − p)
Recall that Sn is really the sum of n independent Bernoulli(p) random
variables and that the mean of a Bernoulli(p) is p and its variance is p(1 −
p). Thus, the question of what the asymptotic distribution of Zn looks like
is precisely what we have been asking about in this section.
n
We know that for all real numbers t, MSn (t) = 1 − p + pet . We can
use this to compute MZn as follows:
" !#
Sn − np
MZn (t) = E exp t · p
np(1 − p)
√ t
= e−npt/ np(1−p) MSn p
np(1 − p)
√ √ n
= e−t np/(1−p) 1 − p + pet/ np(1−p)
q q n
p
t 1−p
−t n(1−p)
= (1 − p)e + pe np
.
If E[|X1 |3 ] < ∞, then the proof of the above theorem goes exactly the
same way as in the two examples above, i.e. through a Taylor–MacLaurin
expansion. We leave the details to the student.
A nice visualization of the Central Limit Theorem in action is done
using a Galton board. Look it up on Google and on YouTube.
One way to use this theorem is to approximately compute percentiles
of the sample mean X.
Example 38.4. The waiting time at a certain toll station is exponentially
distributed with an average waiting time of 30 seconds. If we use minutes
to compute things, then this average waiting time is µ = 0.5 a minute
and thus λ = 1/µ = 2. Consequently, the variance is σ2 = 1/λ2 = 1/4.
If 100 cars are in line, we know the average waiting time is 50 minutes.
This is only an estimate, however. So, for example, we want to estimate
of the probabilities they wait between 45 minutes and an hour. If Xi is
the waiting time of car number i, then we want to compute P{45 < X1 +
· · · + X100 < 60}. We can use the central limit theorem for this. The average
waiting time for the 100 cars is 50 minutes. The theorem tells us that the
distribution of
X1 + · · · + X100 − 50
Z= √
0.5 100
is approximately standard normal. Thus,
Z2
1 2
P{45 < X1 + · · · + X100 < 60} = P{−5/5 < Z < 10/5} ≈ √ e−z /2 dz,
2π −1
202 38
which we can find using the tables for the so-called error function
Zx
1 2
ϕ(x) = √ e−z /2 dz
2π −∞
(which is simply the CDF of a standard normal). The tables give that
ϕ(2) ≈ 0.9772. Most tables do not give ϕ(x) for negative numbers x.
This is because symmetry implies that ϕ(−x) = 1 − ϕ(x). Thus, ϕ(−1) =
1 − ϕ(1) ≈ 1 − 0.8413 = 0.1587. Hence, the probability we are looking for
is approximately equal to 0.9772 − 0.1587 = 0.8185, i.e. about 82%.
Example 38.5. In the 2004 presidential elections the National Election Pool
ran an exit poll. At 7:32 PM it was reported that 1963 voters from Ohio
responded to the poll, of which 941 said they voted for President Bush
and 1022 for Senator Kerry. It is safe to assume the sampling procedure
was done correctly without any biases (e.g. nonresponse, etc). We are
wondering if this data has significant evidence that President Bush had
lost the race in Ohio.
To answer this question, we assume the race resulted in a tie and com-
pute the odds that only 941 of the 1963 voters would vote for President
Bush. The de Moivre–Laplace central limit theorem tells us that
S1963 − 0.5 × 1963
Z= p
1963 × 0.5(1 − 0.5)
is approximately standard normal. Hence,
941 − 981.5
P{S1963 6 941} = P Z 6 √ ≈ ϕ(−1.8282) = 1−ϕ(1.8282) ≈ 0.03376.
490.75
In other words, had the result been a tie, there is chance of at most 3.4%
no more than 941 of the 1963 voters would have voted for President Bush.
1. The Central Limit Theorem, continued 203
Homework Problems
Exercise 38.1. A carton contains 144 baseballs, each of which has a mean
weight of 5 ounces and a standard deviation of 2/5 ounces. (Standard
deviation is the square root of the variance.) Find an approximate value
for the probability that the total weight of the baseballs in the carton is no
more than 725 ounces.
Exercise 38.2. Let Xi ∼ Uniform(0, 1), where X1 , . . . , X20 are independent.
Find normal approximations for each of the following:
X
20
(a) P Xi 6 12 .
i=1
X
20
(b) The 90-th percentile of Xi ; i.e. the number a for which
i=1
X
20
P Xi 6 a = 0.9.
i=1
Exercise 38.3. Let Xi be the weight of the i-th passenger’s luggage. As-
sume that the weights are independent, each with pdf
f(x) = 3x2 /803 , for 0 < x < 80,
X
100
and 0 otherwise. Approximate P Xi > 6025 .
i=1
1. What next?
This course has covered some basics in probability theory. There are sev-
eral (not necessarily exclusive) directions to follow from here.
One direction is learning some statistics. For example, if you would
like to compute the average height of students at the university, one way
would be to run a census asking each student for their height. Thanks
to the law of the large numbers, a more efficient way would be to collect
a sample and compute the average height in that sample. Natural ques-
tions arise: how many students should be in the sample? How to take
the sample? Is the average height in the sample a good estimate of the
average height of all university students? If it is, then how large an error
are we making? These are very important practical issues. Example 38.4
touched on this matter. The same kind of questions arise, for example,
when designing exit polls. The main question is really about estimating
parameters of the distribution of the data; e.g. the mean, the variance, etc.
This is the main topic of Statistical Inference I (Math 5080).
Another situation where statistics is helpful is, for example, when
someone claims the average student at the university is more than 6 feet
tall. How would you collect data and check this claim? Clearly, the first
step is to use a sample to estimate the average height. But that would be
just an estimate and includes an error that is due to the randomness of
the sample. So if you find in your sample an average height of 6.2 feet, is
this large enough to conclude the average height of all university students
is indeed larger then 6 feet? What about if you find an average of 6.01
feet? Or 7.5 feet? Can one estimate the error due to random sampling
205
206 A
and thus guarantee that an average of 7.5 feet is not larger than 6 feet only
due to randomness but because the average height of all students is really
more than 6 feet? Inline with this, example 38.5 shows how one can use
probability theory to check the validity of certain claims. These issues are
addressed in Statistical Inference II (Math 5090).
Another direction is learning more probability theory. Here is a se-
lected subset of topics you would learn in Math 6040. The notion of al-
gebra of events is established more seriously, a very important topic that
we touched upon very lightly and then brushed under the rug for the rest
of this course. The two main theorems are proved properly: the strong
law of large numbers (with just one finite moment, instead of four as we
did in this course) and the central limit theorem (with just two moments
instead of three). The revolutionary object “Brownian motion” is intro-
duced and explored. Markov chains may also be covered, depending on
the instructor and time. And more... We will talk a little bit about Brow-
nian motion in the next two sections. Brownian motion (and simulations)
is also explored in Stochastic Processes and Simulation I & II (Math 5040
and 5050).
1.1. History of Brownian motion, as quoted from the Wiki. The Roman
Lucretius’s scientific poem On the Nature of Things (c. 60 BC) has a re-
markable description of Brownian motion of dust particles. He uses this
as a proof of the existence of atoms:
”Observe what happens when sunbeams are admitted into a build-
ing and shed light on its shadowy places. You will see a multitude of
tiny particles mingling in a multitude of ways... their dancing is an actual
indication of underlying movements of matter that are hidden from our
sight... It originates with the atoms which move of themselves [i.e. spon-
taneously]. Then those small compound bodies that are least removed
from the impetus of the atoms are set in motion by the impact of their
invisible blows and in turn cannon against slightly larger bodies. So the
movement mounts up from the atoms and gradually emerges to the level
of our senses, so that those bodies are in motion that we see in sunbeams,
moved by blows that remain invisible.”
Although the mingling motion of dust particles is caused largely by air
currents, the glittering, tumbling motion of small dust particles is, indeed,
caused chiefly by true Brownian dynamics.
Jan Ingenhousz had described the irregular motion of coal dust par-
ticles on the surface of alcohol in 1785. Nevertheless Brownian motion
is traditionally regarded as discovered by the botanist Robert Brown in
1827. It is believed that Brown was studying pollen particles floating in
1. What next? 207
water under the microscope. He then observed minute particles within the
vacuoles of the pollen grains executing a jittery motion. By repeating the
experiment with particles of dust, he was able to rule out that the motion
was due to pollen particles being ’alive’, although the origin of the motion
was yet to be explained.
The first person to describe the mathematics behind Brownian mo-
tion was Thorvald N. Thiele in 1880 in a paper on the method of least
squares. This was followed independently by Louis Bachelier in 1900 in
his PhD thesis ”The theory of speculation”, in which he presented a sto-
chastic analysis of the stock and option markets. However, it was Albert
Einstein’s (in his 1905 paper) and Marian Smoluchowski’s (1906) indepen-
dent research of the problem that brought the solution to the attention of
physicists, and presented it as a way to indirectly confirm the existence of
atoms and molecules.
However, at first the predictions of Einstein’s formula were refuted by
a series of experiments, by Svedberg in 1906 and 1907, which gave dis-
placements of the particles as 4 to 6 times the predicted value, and by
Henri in 1908 who found displacements 3 times greater than Einstein’s
formula predicted. But Einstein’s predictions were finally confirmed in a
series of experiments carried out by Chaidesaigues in 1908 and Perrin in
1909. The confirmation of Einstein’s theory constituted empirical progress
for the kinetic theory of heat. In essence, Einstein showed that the motion
can be predicted directly from the kinetic model of thermal equilibrium.
The importance of the theory lay in the fact that it confirmed the kinetic
theory’s account of the second law of thermodynamics as being an essen-
tially statistical law.
but not too far, so that there is still a process going on and we do not just
see a straight line!, then a continuous curve emerges. This is the so-called
Brownian motion. This is not a trivial fact to work out mathematically and
is expressed in the following theorem. A picture explains it nicely, though;
see Figure 1.3.
Donsker’s Theorem. Let X1 , X2 , . . . denote independent, identically distributed
random variables with mean zero and variance one. The random walk is then
the random sequence √ Sn = X1 √ + · · · + Xn , and√for all n large, the random
graph (0, 0), (1, S1 / n), (2, S2 / n), · · · , (n, Sn/ n) (linearly interpolated in
between the values), is close to the graph of Brownian motion run until time one.
Once it is shown that the polygonal graph does have a limit, Einstein’s
predicates (a)–(d) are natural ((b) being the result of the central limit the-
orem). (e) is not trivial at all and is a big part of the hard work.
Appendix B
Solutions
211
212 B. Solutions
Exercise 1.1
(a) {4}
(b) {0, 1, 2, 3, 4, 5, 7}
(c) {0, 1, 3, 5, 7}
(d) ∅
Exercise 1.2
(a) Let x ∈ (A ∪ B) ∪ C. Then we have the following equivalences:
x ∈ (A ∪ B) ∪ C ⇔ x ∈ A ∪ B or x ∈ C
⇔ x ∈ A or x ∈ B or x ∈ C
⇔ x ∈ A or x ∈ (B ∪ C)
⇔ x ∈ A ∪ (B ∪ C)
This proves the assertion.
(b) Let x ∈ A ∩ (B ∪ C). Then we have the following equivalences:
x ∈ A ∩ (B ∪ C) ⇔ x ∈ A and x ∈ B ∪ C
⇔ (x ∈ A and x ∈ B) or (x ∈ A and x ∈ C)
⇔ (x ∈ A ∩ B) or (x ∈ A ∩ C)
⇔ x ∈ (A ∩ B) ∪ (A ∩ C)
This proves the assertion.
(c) Let x ∈ (A ∪ B)c . Then we have the following equivalences:
x ∈ (A ∪ B)c ⇔ x 6∈ A ∪ B
⇔ (x 6∈ A and x 6∈ B)
⇔ x ∈ Ac and x ∈ Bc
⇔ x ∈ Ac ∩ Bc
This proves the assertion.
(d) Let x ∈ (A ∩ B)c . Then we have the following equivalences:
x ∈ (A ∩ B)c ⇔ x 6∈ A ∩ B
⇔ (x 6∈ A or x 6∈ B)
⇔ x ∈ Ac or x ∈ Bc
⇔ x ∈ Ac ∪ Bc
This proves the assertion.
Exercise 1.3
(a) A ∩ B ∩ Cc
(b) A ∩ Bc ∩ Cc
B. Solutions 213
(c) (A ∩ B ∩ Cc ) ∪ (A ∩ Bc ∩ C) ∪ (Ac ∩ B ∩ C) ∪ (A ∩ B ∩ C) =
(A ∩ B) ∪ (A ∩ C) ∩ (B ∩ C)
(d) A ∪ B ∪ C
(e) (A ∩ B ∩ Cc ) ∪ (A ∩ Bc ∩ C) ∪ (Ac ∩ B ∩ C)
(f) (A ∩ Bc ∩ Cc ) ∪ (Ac ∩ B ∩ Cc ) ∪ (Ac ∩ Bc ∩ C)
(g) (Ac ∩ Bc ∩ Cc ) ∪ (A ∩ Bc ∩ Cc ) ∪ (Ac ∩ B ∩ Cc ) ∪ (Ac ∩ Bc ∩ C)
Exercise 1.4 First of all, we can see that Ac and Bc are not disjoint: any
element that is not in A, nor in B will be in Ac ∩ Bc . Then, A ∩ C and B ∩ C
are disjoint as (A ∩ C) ∩ (B ∩ C) = A ∩ B ∩ C = ∅ ∩ C = ∅. Finally, A ∪ C
and B ∪ C are not disjoint as they both contain the elements of C (if this
one is not empty).
Exercise 1.5 The standard sample space for this experiment is to consider
Ω = {1, 2, 3, 4, 5, 6}3 , i.e. the set of all sequences of 3 elements chosen from
the set {1, 2, 3, 4, 5, 6}. In other words,
Ω = {(1, 1, 1), (1, 1, 2), . . . , (6, 6, 6)}.
There are 63 = 216 elements in Ω. As Ω is finite we can choose F to be the
set of all possible subsets of Ω.
Exercise 1.6 We can choose Ω = {B, G, R} where B denotes the black chip,
G the green one and R the red one. As the set is finite and the informa-
tion complete, we can choose F to be the set of all possible subsets of Ω.
Namely,
F = {∅, {B}, {G}, {R}, {B, G}, {B, R}, {R, G}, Ω} .
Exercise 1.7 See Ash’s exercise 1.2.7.
214 B. Solutions
Exercise 2.1 There are 150 people who favor the Health Care Bill, do not
approve of Obama’s performance and are not registered Democrats.
Exercise 2.2
(a) We have
(A ∩ B) \ (A ∩ C) = (A ∩ B) ∩ (A ∩ C)c = (A ∩ B) ∩ (Ac ∪ Cc )
= (A ∩ B ∩ Ac ) ∪ (A ∩ B ∩ Cc ) = A ∩ (B ∩ Cc ) = A ∩ (B \ C).
(b) We have
A \ (B ∪ C) = A ∩ (B ∪ C)c = A ∩ (Bc ∩ Cc ) = (A ∩ Bc ) ∩ Cc
= (A \ B) ∩ Cc = (A \ B) \ C.
(c) Let A = {1, 2, 3}, B = {2, 3, 4}, C = {2, 4, 6}. We have (A \ B) ∪ C =
{1, 2, 4, 6} and (A ∪ C) \ B = {1, 6}. Hence, the proposition is wrong.
Exercise 2.3 See Ash’s exercise 1.2.5.
B. Solutions 215
Exercise 3.1
(a) We can choose Ω = {HH, HT , T H, T T }, where H stands for heads
and T for tails. The first letter is the outcome of the first toss and
the second letter, the outcome of the second toss.
(b) The sample space Ω is finite and, at step 3, all information is
available, so we can choose F3 = P(Ω), the set of possible subsets
of Ω.
(c) At step 2, we do not know the result of the second toss. Hence,
if an observable subset contains xH, it has to contain xT , because
we would have no way to distinguish both. Hence, we have to
choose
F2 = {∅, {HH, HT }, {T H, T T }, Ω} .
Other subsets, such as {HT } cannot be observed at this step. In-
deed, one would need to know the outcome of the second toss to
decide if {HT } happens or not. As for the sets in F2 above, you do
not need to know the second outcome to decide if they happen or
not.
(d) At step 1, we know neither of the outcomes, so we can just decide
about the probability of the trivial events and we have to pick
F1 = {∅, Ω}.
Exercise 3.2
(a) We can choose Ω to be the set of all sequences of outcomes that
are made of only tails and one head at the end. That is
Ω = {H, T H, T T H, T T T H, . . .}.
Notice that we can also consider Ω = N, where ω = n means that
the game ended at toss number n. We also would like to point out
that it is customary to add an outcome ∆ called the cemetery out-
come which corresponds to the case where the experiment never
ends.
(b) With the different sample spaces above, we can describe the events
below this way.
{Aaron wins} = {H, T T H, T T T T H, . . .}, {Bill wins} = {T H, T T T H, T T T T T H, . . .}
and
{no one wins} = ∅.
If we consider the case where Ω = N, we obtain
{Aaron wins} = {odd integers}, {Bill wins} = {even integers}
216 B. Solutions
and
{no one wins} = ∅.
We notice that if you consider the cemetery outcome ∆, then {no
one wins} becomes {∆} rather than ∅. We will see later that with
a fair coin, the probability that no one wins is 0, hence modelling
this by ∅ is ok. To illustrate why ∆ could be useful, imagine
that the players play with a coin with two tails faces. Then the
only possible outcome is ∆ and this one can’t have probability 0.
Hence, modelling by ∅ would not be appropriate in this case.
Exercise 3.3 Let’s consider the case where we toss a coin twice and let
A = {we get H on the first toss} and B = {we get T on the second toss}.
Hence,
A = {HH, HT }, B = {HT , T T }, and A \ B = {HH}.
Hence,
1
P(A \ B) = P{HH} = ,
4
but
1 1
P(A) − P(B) = − = 0.
2 2
B. Solutions 217
Exercise 4.1 (a) There are 65 possible outcomes. (b) There are only 63
possible outcomes with the first and last rolls being 6. So the probability
in question is 63 /65 .
Exercise 4.2 There are 103 = 1000 ways to choose a 3-digit number at
random. Now, there are 3 ways to choose the position of the single digit
larger than 5, 4 ways to choose this digit (6 to 9) and 6 · 6 ways to choose
the two other digits (0 to 5). Hence, there are 3 · 4 · 6 · 6 ways to choose
a 3-digit number with only one digit larger than 5. The probability then
becomes:
3·4·6·6 432
p= = = 43.2%.
103 1000
Exercise 4.3
(a) We can apply the principles of counting and choosing each sym-
bol on the license plate in order. We obtain 26 × 26 × 26 × 10 ×
10 × 10 = 17, 576, 000 diffrent license plates.
(b) Similarly, we have 103 × 1 × 26 × 26 = 676, 000 license plates with
the alphabetical part starting with an A.
Exercise 4.4 Let A be the event that balls are of the same color, R, Y and G
the event that they are both red, yellow and green, respectively. Then, as
R, Y and G are disjoint,
3×5 8×7 13 × 6 149
P(A) = P(R∪Y∪G) = P(R)+P(Y)+P(G) = + + = = 0.345.
24 × 18 24 × 18 24 × 18 432
218 B. Solutions
Exercise 5.1 Each hunter has 10 choices: hunter 1 makes one of 10 choices,
then hunter 2 makes 1 of 10 choices, etc. So over all there are 105 possible
options. On the other hand, the number of ways to get 5 ducks shot is: 10
for hunter 1 then 9 for hunter 2, etc. So 10 × 9 × 8 × 7 × 6 ways. The answer
thus is the ratio of the two numbers: 10×9×8×7×6
105
.
B. Solutions 219
(b) We have 65 48
1 ways to choose a combination of 6 numbers that
shares 5 numbers with the one played (5 numbers out of the 6
played and 1 out of the 48 not played). Hence, the probability to
win the second prize is
6
48
· 6 · 48 288 3
p = 5 541 = 54 = = .
6 6
25, 827, 165 269, 033
Exercise 6.5
(a) There are 50
5 possible combinations of 5 numbers in the first list
and 92 combinations of 2 numbers in the second list. That makes
50
9
5 · 2 possible results for this lottery. Only one will match the
combination played. Hence, the probability to win the first prize
is
1 1
p = 50 9 =
5 2
76, 275, 360
(b) Based only on the probability to win the first prize, you would
definitely choose the first one which has a larger probability of
winning.
220 B. Solutions
52
Exercise 6.6 The number of possible poker hands is 5 as we have seen
in class.
(a) We have 13 ways to choose the value for the four cards. The suits
are all taken. Then, there are 48 ways left to choose the fifth card.
Hence, the probability to get four of a kind is
13 · 48
P(four of a kind) = 52
= 0.00024.
5
(b) We have 13 ways to choose the value for the three cards. The, 43
Exercise 6.8
(a) For the draw with replacement, there are 5210 possible hands.
If we want no to cards to have the same face values, we have
13 · 12 · · · 4 ways to pick diffrent values and then, 410 ways to
choose the suits (4 for each card drawn). Hence, the probability
becomes
13 · · · · · 4 · 410
p= = 0.00753.
5210
(b) In the case of the draw without replacement, we have 52
10 possible
hands. The number of hands that have at least 9 card of the same
suit can have 9 of them or 10 of them. The first case corresponds
to 4 · 13
9 · 39 possibilities (4 possible suits, 9 cards out of this suit
and 1 additionnal card from the 39 remaining) and the second
case corresponds to 4 · 13
10 possibilities. Hence, the probability
becomes
4 · 13 13
9 · 39 + 4 · 10
p= 52
= 0.00000712.
10
(c) First of all, notice that there are two possible ways to sit men and
women in alternance, namely
wmwmwmwm or mwmwmwmw,
where w stands for a woman and m for a man. Then, for each
of the repartitions above, we have to choose the positions of the
women among themselves. There are 4! permutations. For each
repartition of the women, we need to choose the positions of the
men. There are 4! permutations as well. Hence, there are 2·4!·4! (=
1152) ways to seat 4 women and 4 men in alternance.
(d) Similarly as in (b), the 5 men form an entity that we will treat as
a single person. Then, there are 4 entities (3 women and 1 group
of men) to position. There are 4! ways to do it. For each of these
ways, the 5 men can be seated differently on the 5 consecutive
chairs they have. There are 5! to do it. Hence, there are 4! · 5! (=
2880) possible ways to seat those 8 people with the 5 men seated
together.
(e) We consider that each married couple forms an entity that we
will treat as a single person. There are then 4! ways to assign
seats to the couples. For each of these repartitions, there are two
ways to seat each person within the couple. Hence, there are
4! · 2 · 2 · 2 · 2 = 4! · 24 (= 384) possible ways to seat 4 couples.
Exercise 6.12
(a) There are 6 discs to store on the shelf. As they are all different,
there are 6! (= 720) ways to do it.
(b) Assume the classical discs, as well as the jazz discs form two en-
tities, that we will consider as a single disc. Then, there are 3 en-
tities to store and 3! ways to do it. For each of these repartitions,
the classical discs have 3! ways to be stored within the group and
the jazz discs hav 2 ways to be stored within the group. Globally,
there are 3! · 3! · 2 (= 72) ways to store the 6 discs respecting the
styles.
(c) If only the classical discs have to be stored together, we have 4
entities (the classical group, the three other discs). We have then
4! ways to assign their position. For each of their repartitions,
we have 3! to store the classical discs within the group. Hence,
we have 4! · 3! ways to store the discs with the classical together.
Nevertheless, among those repartitions, some of them have the
jazz discs together, which we don’t want. Hence, we subtract
from the number above, the number of ways to store the discs
according to the styles found in (b). Hence, there are (4! · 3!) −
B. Solutions 223
(3! · 3! · 2) (= 144 − 72 = 72) ways to store the discs with only the
classical together.
Exercise 6.13
(a) The 5 letters of the word “bikes” being different, there are 5! (=
120) ways to form a word.
(b) Among the 5 letters of the word “paper”, there are two p’s. First
5
choose their position, we have 2 ways to do it. Then, there are
3! ways to position the other 3 different letters. Hence, we have
5 5!
2 · 3! = 2! = 60 possible words.
(c) First choose the positions of the e’s, then of the t’s and finally the
ones of the other letters. Hence, we have 62 42 · 2! = 2!2! 6!
= 180
possible words.
(d) Choose the poisition of the three m, then the ones of the two i’s
and finally the ones of the other different letters. Hence, we have
7 4 7!
3 2 · 2! = 3!2! = 420 possible words.
224 B. Solutions
8
Exercise 7.1 In (a + b)8 the coefficient of b5 is × a3 . Hence if a = 2 and
5
b = 3x, the coefficient of x5 is 85 × 23 × 35 .
among four two place the first-type digit. Secondly, if one of the
pairs is a pair of 0’s, we have 9 · 32 possible numbers. Indeed,
there are 9 ways to choose the second pair and we need to choose
two spots among three for the 0’s (we cannot put the 0 upfront).
Finally, there are 92 42 + 9 · 32 = 243 4-digit numbers made of
(e) In (a), there are 9 possible numbers, for any value of n. In (d),
following
the same argument as for n = 4, we notice that there
are n9 n-digit ordered numbers for 1 6 n < 10. There are none
of them for n > 10. In (c), for 2 6 n 6 9, we have n9 · n! n-digit
9
numbers with different digits without 0 and n−1 ·(n−1)·(n−1)!
9 9
numbers with 0. Hence, we have n ·n!+ n−1 ·(n−1)·(n−1)! =
9·9!
(10−n)! n-digit numbers with different digits. There are 9 · 9! for
n = 10 and none for n > 10.
226 B. Solutions
Exercise 8.4 We will use the Law of Total Probability with an infinite num-
ber of events. Indeed, for every n > 1, the events {I = n} are disjoint (we
can’t choose two different integers) and their union is Ω (one integer is
necessarily chosen). Hence, letting H denote the event that the outcome is
heads, we have
P(H|I = n) = e−n .
Then, by the Law of Total Probability, we have
∞
X ∞
X ∞
X 1 1
P(H) = P(H|I = n)P{I = n} = e−n 2−n = (2e)−n = 1
−1 = ,
1 − 2e 2e −1
n=1 n=1 n=1
P∞
because n=0 xn = 1
1−x for |x| < 1.
Exercise 8.5 See Ash’s exercise 1.6.5.
Exercise 8.6 See Ash’s exercise 1.6.6.
Exercise 8.7 Let D denote the event that a random person has the disease,
P the event that the test is positive and R the event that the person has the
rash. We want to find P(D|R). We know that
P(D) = 0.2 P(P|D) = 0.9 P(P|Dc ) = 0.3 and P(R|P) = 0.25.
B. Solutions 227
Exercise 8.8 Let A denote the event “the customer has an accident within
one year” and let R denote the event “the customer is likely to have acci-
dents”.
(a) We want to find P(A). By the Law of Total Probability, we have
P(A) = P(A | R)P(R) + P(A | Rc )P(Rc ) = (0.4 × 0.3) + (0.2 × 0.7) = 0.26.
(b) We want to compute P(R | A). The defnition of conditional proaba-
bility leads to
P(A | R)P(R) 0.4 × 0.3
P(R | A) = = = 0.46,
P(A) 0.26
where we used the result in (a).
Exercise 8.9 Let Ri denote the event “the receiver gets an i” and Ei the
event “the transmitter sends an i” (i ∈ {0, 1}).
(a) We want to find P(R0 ). By the Law of Total Probability,
P(R0 ) = P(R0 | E0 )P(E0 )+P(R0 | E1 )P(E1 ) = (0.8×0.45)+(0.1×0.55) = 0.415,
as E0 = Ec1 .
(b) We want to compute P(E0 | R0 ). The definition of conditional
probability leads to
P(R0 | E0 )P(E0 ) 0.8 × 0.45
P(E0 | R0 ) = = = 0.867,
P(R0 ) 0.415
where we used the result in (a).
Exercise 8.10 Let I,L and C be the events “the voter is independent, demo-
crat or republican”, respectively. Let V be the event “he actually voted in
the election”.
(a) By the Law of Total Probability, we have
P(V) = P(V|I)P(I) + P(V|L)P(L) + P(V|C)P(C) = 0.4862.
(b) We first compute P(I|V). By Bayes’ theorem, we have
P(V|I)P(I) 0.35 · 0.46
P(I|V) = = = 0.331.
P(V) 0.4862
228 B. Solutions
Similarly, we have
P(V|L)P(L) 0.62 · 0.30
P(L|V) = = = 0.383,
P(V) 0.4862
and
P(V|C)P(C) 0.58 · 0.24
P(C|V) = = = 0.286.
P(V) 0.4862
Exercise 8.11 Let An be the event “John drives on the n-th day” and Rn be
the event “he is late on the n-th day”.
(a) Let’s compute P(An ), we have
P(An ) = P(An |An−1 )P(An−1 ) + P(An |Acn−1 )P(Acn−1 )
1 1
= P(An−1 ) + (1 − P(An−1 ))
2 4
1 1
= P(An−1 ) + ,
4 4
c
where the event An stands for “John takes the train on the n-th
day.” Iterating this formula n − 1 times, we obtain
X 1 i 1 n−1
n−1
n−1 !
1 1 1 − ( 14 )n−1
P(An ) = P(A1 ) + = p+
4
i=1
4 4 4 1 − 41
n−1 n−1 !
1 1 1
= p+ 1− .
4 3 4
(b) By the Law of Total Probability, we have
P(Rn ) = P(Rn |An )P(An ) + P(Rn |Acn )P(Acn )
1 1
= P(An ) + (1 − P(An ))
2 4
1 1
= P(An ) + = P(An+1 ).
4 4
By (a), we then have
n n
1 1 1
P(Rn ) = p+ 1− .
4 3 4
(c) Let’s compute limn→∞ P(An ). We know that limn→∞ ( 14 )n−1 = 0.
Hence, limn→∞ P(An ) = 13 . Similarly, we have limn→∞ P(Rn ) =
limn→∞ P(An+1 ) = 31 .
B. Solutions 229
Exercise 9.1
(a) The events ”‘getting a spade”’ and ”‘getting a heart”’ are disjoint
but not independent.
(b) The events ”‘getting a spade”’ and ”‘getting a king”’ are indepen-
dent (check the definition) and not disjoint: you can get the king
of spades.
(c) The events ”‘getting a king”’ and ”‘getting a queen and a jack”’
are disjoint (obvious) and independent. As the probability of the
second event is zero, this is easy to check.
(d) The events ”‘getting a heart”’ and ”‘getting a red king”’ are not
disjoint and not independent.
Exercise 9.2 The number of ones (resp. twos) is comprised between 0 and
6. Hence, we have the following possibilities : three ones and no two,
four ones and one two. (Other possibilities are not compatible with the
experiment.) Hence, noting A the event of which we want the probability,
we have
P(A) = P{three 1’s, no 2 (and three others)} + P{four one’s, one two (and one other)}
6 6
3 2
3 ·4 4 · 1 ·4
= + .
66 66
Indeed, we have to choose 3 positions among 6 for the ones and four
choices for each of the other values for the first probability and we have
to choose 4 positions among 6 for the ones, one position among the two
remaining for the two and we have 4 choices for the last value for the
second probability. The total number of results is 66 (six possible values
for each roll of a die).
Exercise 9.3
Ω = {(P, P, P), (P, P, F), (P, F, P), (P, F, F), (F, P, P), (F, P, F), (F, F, P), (F, F, F)}.
230 B. Solutions
All outcomes are equally likely and then, P{ω} = 18 , for all ω ∈ Ω. More-
over, counting the favorable cases for each event, we see that
4 1
P(G1 ) = = = P(G2 ) = P(G3 )
8 2
2 1
P(G1 ∩ G2 ) = = = P(G1 )P(G2 ).
8 4
Similarly, we find that P(G1 ∩ G3 ) = P(G1 )P(G3 ) and that P(G2 ∩ G3 ) =
P(G2 )P(G3 ). The events G1 , G2 and G3 are pairwise independent.
However,
2 1 1
P(G1 ∩ G2 ∩ G3 ) = = 6= P(G1 ) · P(G2 ) · P(G3 ) = ,
8 4 8
hence G1 , G2 and G3 are not independent. Actually, it is to see that if G1
and G2 occur, then G3 occurs as well, which explains the dependence.
Exercise 9.5 We consider that having 4 children is the result of 4 indepen-
dent trials, each one being a success (girl) with probability 0.48 or a failure
(boy) with probability 0.52. Let Ei be the event “the i-th child is a girl”.
(a) Having children with all the same gender corresponds to the
event { 4 successes or 0 success}. Hence, P(“all children have the
same gender”) = P(“4 successes”) + P(“0 success”) = (0.48)4 +
(0.52)4 .
(b) The fact that the three oldest children are boys and the youngest
is a girl corresponds to the event Ec1 ∩ Ec2 ∩ Ec3 ∩ E4 . Hence P(“three
oldest are boys and the youngest is a girl”) = (0.52)3 (0.48).
(c) Having three boys comes back to having 1 success among the 4
trials. Hence, P(“exactly three boys”) = 43 (0.52)3 (0.48).
(d) The two oldest are boys, the other do not matter. This comes back
to having two failures among the first two trials. Hence, P(“the
two oldest are boys”) = (0.52)2 .
(e) Let’s first compute the probability that there is no girl. This equals
the probability of no sucess, that is (0.52)4 . Hence, P(“at least one
girl”) = 1 − P(“no girl”) = 1 − (0.52)4 .
B. Solutions 231
Exercise 12.1
(a) The random variable X has a geometric distribution with param-
eter p, hence P{X = n} = p(1 − p)n−1 . Then,
∞
X X∞ X∞ ∞
X
n−1 n−1 1
P{X = n} = p(1 − p) =p (1 − p) =p (1 − p)n = p = 1.
1 − (1 − p)
n=1 n=1 n=1 n=0
by the standard formula for geometric series.
(b) The random variable Y has a Poisson distribution with parameter
n
λ, hence P{Y = n} = e−λ λn! . Then,
∞
X ∞
X ∞
X λn
λn
P{Y = n} = e−λ = e−λ = e−λ eλ = 1,
n! n!
n=0 n=0 n=0
by the standard series expansion for exponentials.
Exercise 12.2
(a) Let X be the r.v. counting the number of cars having an accident
this day. The r.v. X has a binomial distribution with parameters
n = 10, 000 and p = 0.002. As p is small, n is large and np is
not too large, nor too small, we can approximate X by a Poisson
random variable with parameter λ = np = 20. Then, we have
λ15 2015
P{X = 15} ' e−λ = e−20 = 5.16%.
15! 15!
We notice that the exact value of P{X = 15} would be precisely
5.16%.
(b) As above, let Y be the r.v. counting the number of gray cars having
an accident this day. By a similar argument as in (a) and as one car
out of 5 is gray, the r.v. Y follows a binomial random variable with
parameters n = 2, 000 and p = 0.002. We can again approximate
by a Poisson distribution of parameter λ = np = 4. Then, we have
λ3 43
P{Y = 3} ' e−λ = e−4 = 19.54%.
3! 3!
The exact value would be P{Y = 3} = 19.55%.
234 B. Solutions
Exercise 13.1
(a) We can easilly check that F(x) is a non-decreasing function, that
limx→∞ F(x) = 1, that limx→−∞ F(x) = 0 and that F is right-
continuous. (A plot can help.) Hence, F is a cumulative distri-
bution function.
(b) We will use the properties of CDFs to compute the probabilities.
Namely, we have
1 1 1 1
P{X = 2} = F(2) − F(2−) = ·2+ − = .
6 3 3 3
1
(c) P{X < 2} = F(2−) = limx↑2 3 = 13 .
(d) As the two events are disjoint, we have
1 3 1 3
P X = 2 or 6X< = P{X = 2} + P 6X6
2 2 2 2
= P{X = 2} + (F (3/2−) − F (1/2−))
1 1 1 7
= + − =
3 3 12 12
(e) As 2 is included in [ 21 ; 3], we have
1 1
P X = 2 or 6X63 = P 6X63
2 2
= F(3) − F (1/2−)
5 1 3
= − = .
6 12 4
B. Solutions 235
Exercise 14.3
R+∞
(a) We need to find c such that −∞ f(x)dx = 1. In order for f to be a
pdf, we need c > 0. Moreover,
Z +∞ Zπ Zπ
2 2 1 + cos(2x)
2
f(x)dx = c cos (x) dx = c dx
−∞ 0 0 2
π2
x sin(2x) πc
= c + = .
2 4 0 4
4
Hence, c = π.
Rx
(b) The cdf F of X is given by F(x) = −∞ f(u)du. As x 6 0, the
Rx
integral vanishes. Moreover, as x > π2 , we have −∞ f(u)du =
236 B. Solutions
R+∞
−∞ f(u)du = 1. Then. for 0 < x < π2 ,
Zx Z x
4 x
2 4 u sin(2u)
f(u)du = cos (u) du = +
−∞ π 0 π 2 4 0
2 sin(2x)
= x+ .
π 2
Finally,
0 if x 6 0,
2 sin(2x) π
F(x) = π x+ if 0 < x < 2,
2
1 if x > π
2.
Exercise 14.4 For each question, we need to find the right set (union of
Rb
intervals) and integrate f over it. The fact that for a, b > 0, a f(x) dx =
1 −a
2 (e − e−b ) is used throughout.
(a) By symmetry around 0, we have
1
P{|X| 6 2} = 2 · P{0 6 X 6 2} = 2 · (1 − e−2 ) = 1 − e−2 .
2
(b) We have {|X| 6 2 or X > 0} ⇔ {X > −2}. Hence,
Z∞ Z Z
1 0 x 1 ∞
P{|X| 6 2 or X > 0} = f(x) dx = e dx + f(x) dx
−2 2 −2 2 0
1 1 1
= (1 − e−2 ) + = 1 − e−2 .
2 2 2
(c) We have {|X| 6 2 or X 6 −1} ⇔ {X 6 2}. Moreover, by symmetry,
P{X 6 2} = P{X > −2} = 1 − 12 e−2 , by the result in (b).
(d) The condition |X| + |X − 3| 6 3 corresponds to 0 6 X 6 3. Hence,
1
P{|X| + |X − 3| 6 3} = P{0 6 X 6 3} = (1 − e−3 ).
2
(e) We have X3 −X2 −X+2 = (X−2)(X2 +X+1). Hence, X3 −X2 −X+2 >
0 if and only if X > 2. Then, using the result in (c)
1
P{X3 − X2 − X + 2 > 0} = P{X > 2} = e−2 .
2
(f) We have
esin(πX) > 1 ⇔ sin(πX) > 0
⇔ X ∈ [2k, 2k + 1] for some k ∈ Z.
Now by symmetry, P{−2k 6 X − 2k + 1} = P{2k − 1 6 X 6 2k}.
Hence, P{X ∈ [2k, 2k + 1] for some k ∈ Z} = P{X > 0} = 21 .
B. Solutions 237
Exercise 14.6
R+∞
(a) In order for f to be a pdf, we need c > 0. Let’s compute −∞ f(x)dx:
Z +∞ Z +∞
1 +∞ πc
f(x)dx = c 2
dx = c(arctan(x)) 0 = .
−∞ 0 1+x 2
2
Then, taking c = π, f is a pdf. This is a Cauchy distribution.
238 B. Solutions
and
1 x 1 x
lim F(x) = 1 + lim √ = 1 − lim √ = 0.
x→−∞ 2 x→−∞ 1 + x2 2 x→+∞ 1 + x2
The function F is a cdf. The density function is given by f(x) =
0
F (x). We have
√1 + x2 − x √2x
1 + x2 − x2
d 1 x 2 1+x2 1
f(x) = 1+ √ = 2
= 3 = 3 ,
dx 2 1 + x2 2(1 + x ) 2(1 + x2 ) 2 2(1 + x2 ) 2
for all x ∈ R.
B. Solutions 239
Exercise 17.1 We use the formula developped in class. We have n = 10, 000,
p= 7940, b = 8080, p = 0.8. Hence, np = 8, 000, np(1 − p) = 1, 600 and
a
np(1 − p) = 40. Now,
8, 080 − 8, 000 7, 940 − 8, 000
P{7940 6 X 6 8080} = Φ −Φ = Φ(2) − Φ(−1.5)
40 40
= Φ(2) − 1 + Φ(1.5) = 0.9772 + 0.9332 − 1 = 0.9104.
Hence, there is 91.04% probability to find between 7,940 and 8,080 suc-
cesses.
240 B. Solutions
Exercise 19.4
(a) We have Y = g(X), with g : (0, ∞) → R given by g(x) = log x. The
function g is one-to-one. Its inverse is the solution to y = log x
which is x = ey . Hence :
fY (y) = fX (ey )|(ey ) 0 |.
As X is an exponential r.v. with parameter λ, fX (x) = λe−λx , for
x > 0 and fX (x) = 0 otherwise. Hence, for y ∈ R,
y y
fY (y) = λe−λe |ey | = λey−λe .
(b) We have h(Z) = X, with h : − π2 , π2 → R given by h(z) = z +
Hence,
1
2y3/2
if y ∈ [1, 4),
fY (y) = 2
if y > 4,
y2
0 otherwise.
Exercise 19.6
(a) We have Y = g(X), with g : R → R given by g(x) = x2 . The
function g is not one-to-one from R into R. First find the cumu-
lative distribution function FY from the definition. For y 6 0,
FY (y) = P{Y 6 y} = P{X2 6 y} = 0. For y > 0,
√ √ √ √
FY (y) = P{Y 6 y} = P{X2 6 y} = P{− y 6 X 6 y} = F( y) − F(− y),
1 1 y y 1 1 y
fY (y) = √ √ (e− 2 + e− 2 ) = √ y− 2 e− 2 ,
2 y 2π 2π
which corresponds to a Gamma density function with parameters
1 y
α = 12 and λ = 12 . Indeed, y− 2 e− 2 appears in the Gamma density
and the constant is necessarily the right one, determined by the
R+∞ √
property 0 fY (y)dy = 1. In particular, Γ ( 12 ) = π.
v2 v2
Exercise 19.7 Let g : [0, π2 ] → [0, g0 ] be defined by g(x) = g0 sin(2θ). The
function g is not one-to-one as every possible sine value can be obtained
B. Solutions 243
2g 1
= q
π v4 − g2 r2
0
Finally,
v20
2g √ 1
π if 0 6 r 6 g,
fR (r) = v40 −g2 r2
0 otherwise.
244 B. Solutions
Exercise 21.1
(a) No, X and Y are not independent. For instance, we have fX (1) =
0.4 + 0.3 = 0.7, fY (2) = 0.3 + 0.1 = 0.4. Hence, fX (1)fY (2) =
0.7 · 0.4 = 0.28 6= 0.3 = f(1, 2).
(b) We have
P(XY 6 2) = 1 − P(XY > 2) = 1 − P(X = 2, Y = 2) = 1 − 0.1 = 0.9.
Exercise 21.2
(a) The set of possible values for X1 and X2 is {1, . . . , 6}. By definition,
we always have X1 6 X2 . We have to ccompute f(x1 , x2 ) = P{X1 =
x1 , X2 = x2 }. If x1 = x2 , both outcomes have to be the same, equal
to x1 . There is only one possible roll for this, namely (x1 , x1 ) and
1
f(x1 , x2 ) = 36 . If x1 < x2 , one dice has to be x1 , the other one x2 .
There are two possible rolls for this to happen, namely (x1 , x2 ) and
1
(x2 , x1 ). We obtain f(x1 , x2 ) = 18 . Then, for x1 , x2 ∈ {1, 2, 3, 4, 5, 6},
1
36 if x1 = x2 ,
f(x1 , x2 ) = 1
if x1 < x2 ,
18
0 otherwise.
(b) In order to find the density of X1 , we have to add all the probabili-
P
ties for which X1 takes a precise value (i.e. fX1 (x1 ) = 6i=1 f(x1 , i)).
The following table sums up the results (as in the example in
class).
x1 |x2 1 2 3 4 5 6 fX1 (x1 )
1 1 1 1 1 1 11
1 36 18 18 18 18 18 36
1 1 1 1 1 9
2 0 36 18 18 18 18 36
1 1 1 1 7
3 0 0 36 18 18 18 36
1 1 1 5
4 0 0 0 36 18 18 36
1 1 3
5 0 0 0 0 36 18 36
1 1
6 0 0 0 0 0 36 36
1 3 5 7 9 11
fX2 (x2 ) 36 36 36 36 36 36
(c) They are not independent. Namely, f(x1 , x2 ) 6= fX1 (x1 )fX2 (x2 ). For
instance, f(6, 1) = 0 6= 3612 = fX1 (6)fX2 (1).
Exercise 21.3
(a) The set of possible values for X1 is {4, 5, 6, 7, 8} and the set of pos-
sible values for X2 is {4, 6, 8, 9, 12, 16}. We can see that the values
B. Solutions 245
1 2 2 1 2 1
fX2 (x2 ) 9 9 9 9 9 9
(c) They are not independent. Namely, f(x1 , x2 ) 6= fX1 (x1 )fX2 (x2 ). For
2
instance, f(5, 4) = 0 6= 81 = fX1 (5)fX2 (4).
Exercise 21.4 By Theorem 21.1 the transformation is
0
√ if u < 0,
3u if 0 6 u 6 1/3,
G(u) = 2 if 1/3 < u 6 2/3,
6u − 2 if 2/3 < u < 1,
4 if u > 1.
Note that technically Theorem 21.1 asks for the CDF to be strictly in-
creasing on (−∞, ∞). Here, the CDF of U is constant on (−∞, 0) and
then on (1, ∞). However, since U never takes values in these intervals, we
should still be able to apply the theorem. Check the proof of the theorem
and make sure you understand why it still works.
246 B. Solutions
Exercise 23.1 We can see that the distribution of (X, Y) is uniform on the
square [−1, 1]2 . Hence, we can use a ratio of surfaces to compute the
probabilities. (In most cases a drawing of the domain can help.)
(a) We have P{X + Y 6 12 } = P{Y 6 1
− X} = 1 − P{Y > 12 − X}. Now the
2
surface corresponding to {Y > 1
− X} is a triangle and we have
2
1 3 2
1 · 23
P{X + Y 6 } = 1 − 2 2 = .
2 4 32
(b) The domain corresponding to {X − Y 6 2 } has exactly the same
1
Exercise 24.1
(a) We must choose c such that
Z +∞ Z +∞
f(x, y)dxdy = 1.
−∞ −∞
But,
Z +∞ Z +∞ Z1 Z1 Z1 2 x=1
x
f(x, y) dxdy = c (x + y) dxdy = c + xy dy
−∞ −∞ 0 0 0 2 x=0
Z1 1
y y2
1 1 1
= c + y dy = c + =c + = c.
0 2 2 2 0 2 2
Hence, c = 1.
(b) Observe that
ZZ Z1 Zy
P{X < Y} = f(x, y) dxdy = (x + y) dxdy
0 0
{(x,y) : x<y}
Z1 x=y Z1 1
x2 3 2 3 y3 1
= + xy dy = y dy = = .
0 2 x=0 2 0 2 3 0 2
(c) For x 6∈ [0, 1], fX (x) = 0. For x ∈ [0, 1],
Z +∞ Z1 1
y2
1
fX (x) = f(x, y) dy = (x + y) dy = xy + = + x.
−∞ 0 2 0 2
By symmetry, fY (y) = fX (y) for all y ∈ R.
(d) We can write P{X = Y} as
ZZ Z∞ Zy
P{X = Y} = f(x, y) dxdy = f(x, y) dxdy = 0,
−∞ y
{(x,y) : x=y}
(b) We have
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
= P{X 6 1/2} + P{Y 6 1/2} − P{X 6 1/2, Y 6 1/2}
Z 1/2 Z 1/2 Z 1/2 Zx Z 1/2 !
2 3 2
= (6x − 4x ) dx + 2y dy − dx 4xy dy + 6x dy
0 0 0 0 x
1/2 1/2
Z 1/2
= (2x3 − x4 ) + y2 − dx(3x2 − 4x3 )
0 0 0
1 1 1 1/2
= − + − (x3 − x4 )
4 16 4 0
7 1 1 6 3
= − − = = .
16 8 16 16 8
Exercise 24.3 First of all, fX (x) = 0 if x < 0. Now, for x > 0,
Z Zx x
fX (x) = f(x, y) dy = 2 e−(x+y) dy = 2e−x (−e−y ) = 2e−x (1 − e−x ).
R 0 0
But when
√ z > 3 then the parabola intersects the square at its top side at
x = 1/ z and
Z 1/√z Z1
z 1 2
P{Y/X 6 z} = P{Y 6 zX } =
2 2 2
zx dx+ √ 1 dx = √ 3 +1− √ = 1− √ .
0 1/ z 3( z) z 3 z
Therefore, the pdf is 1/3 when z ∈ (0, 1] and 1/(3z3/2 ) when z > 1.
Alternatively, one can use the pdf method: let W = X and Z = Y/X2 .
Then X = W and Y = ZW 2 . The Jacobian matrix is
1 0
2zw w2
and its determinant is w2 . So
fZ,W (z, w) = 1 × w2 = w2 .
The crucial thing though is the domain! The above formula is valid if
0 < x < 1 and 0 < y < 1 which becomes 0 < w < 1 and 0 < zw2 < 1. The
pdf is 0 otherwise. So if 0 < z < 1 then 0 < w < 1 and while if z > 1 we
gave 0 < w < 1/sqrt(z). Finally, the pdf of Z is
Z1
1
fZ (z) = w2 dw = if 0 < z < 1
0 3
and
Z 1/√z
1
fZ (z) = w2 dw = if z > 1.
0 3z3/2
Exercise 25.2 First of all, by independence,
e−(x+y) if x > 0, y > 0,
fX,Y (x, y) = fX (x)fY (y) =
0 otherwise.
(a) We will use the transformation U = X, Z = X + Y. This transfor-
mation is bijective with inverse given by X = U, Y = Z − U. The
Jacobian of this transformation is given by
∂x ∂x
!
∂u ∂z 1 0
J(u, z) = det = det = 1.
∂y ∂y −1 1
∂u ∂z
250 B. Solutions
Now,
fU,Z (u, z) = fX,Y (x(u, z), y(u, z))|J(u, z)| = e−(u+(z−u)) = e−z ,
for u > 0, z > 0 and u 6 z. The latter condition comes from y > 0.
Hence,
e−z if u > 0, z > 0 and u 6 z,
fU,Z (u, z) =
0 otherwise.
Finally, fZ (z) = 0 if z < 0 and, for z > 0,
Z Zz
fZ (z) = fU,Z (u, z) du = e−z du = ze−z .
R 0
Y
(b) Similarly as above, we will consider V = X, W = X . This trans-
formation is bijective with inverse X = V, Y = VW. The Jacobian
of this transformation is given by
∂x ∂x
!
∂v ∂w 1 0
J(v, w) = det = det = v.
∂y ∂y w v
∂v ∂w
Now,
fV,W (v, w) = fX,Y (x(v, w), y(v, w))|J(v, w)| = e−(v+vw) · v = ve−(1+w)v ,
for v > 0, w > 0. Hence,
ve−(1+w)v if v > 0, w > 0,
fV,W (v, w) =
0 otherwise.
Finally, fW (w) = 0 if w < 0 and, for w > 0,
Z Z∞
1
fW (w) = fV,W (v, w) dv = ve−(1+w)v dv = ,
R 0 (1 + w)2
where we used the properties of Gamma integrals on p.73 of the
Lecture Notes.
Exercise 25.3 First of all, by independence,
1 − 1 (x2 +y2 )
fX,Y (x, y) = fX (x)fY (y) = e 2 .
2π
Y
We will consider U = X, Z = X . This transformation is bijective with
inverse X = U, Y = UZ. The Jacobian of this transformation is given by
∂x ∂x
!
∂u ∂z 1 0
J(u, z) = det = det = u.
∂y ∂y z u
∂u ∂z
Now,
1 − 1 (u2 +u2 z2 ) 1 1 2 2
fU,Z (u, z) = fX,Y (x(u, z), y(u, z))|J(u, z)| = e 2 ·|u| = |u|e− 2 (1+z )u ,
2π 2π
B. Solutions 251
Now,
1 − 1 2 (r2 cos2 (θ)+r2 sin2 (θ)) 1 r2
fR,Θ (r, θ) = fX,Y (x(r, θ), y(r, θ))|J(r, θ)| = 2
e 2σ ·|r| = 2
re− 2σ2 ,
2πσ 2πσ
for r > 0 and 0 6 θ < 2π. We can immediately conclude that R and Θ
are independent. Namely, it is easy to see that we can write fR,Θ (r, θ) =
g(r)h(θ) for suitable functions g and h. Finally, a direct integration shows
B. Solutions 253
that
r2
r
σ2
e− 2σ2 if r > 0,
fR (r) =
0 otherwise.
and 1
2π if 0 6 θ < 2π,
fΘ (θ) =
0 otherwise.
Exercise 25.11 See Ash’s exercise 2.8.16.
254 B. Solutions
x −1 40
, if m ∈ {0, 00},
51 1
P{Hm = x} 52 52
where pm = P{ω = m}. Hence, E[Hm ] = mpm + (−1)(1 − pm ).
The numerical results are presented in the following table:
m 1 2 5 10 20 0 00
8 7
E[Hm ] − 52 − 52 − 10
52
8
− 52 − 10
52 − 11
52 − 11
52
B. Solutions 255
Hence, betting on “0” or “00” gives the worst expectation and bet
on “2” gives the best. We notice that the expected values are all
negative and, hence, this game is always in favor of the organiser.
Exercise 27.3 We know that for a Geometric random variable, f(k) = P{X =
k} = p(1 − p)k−1 for k > 1. Hence, we have
X∞ X∞
E[X] = kp(1 − p)k−1 = p kqk−1 ,
k=1 k=1
Z∞
E[X] = xλe−λx dx
0
Z∞
∞
−xe−λx 0 e−λx dx
= +
0
∞
e−λx
= −
λ 0
1
= .
λ
Then, again using integration by parts and the results above, we have
Z∞
2
E[X ] = x2 λe−λx dx
0
Z∞
∞
−x2 e−λx 0 xe−λx dx
= +2
0
2
= .
λ2
x2
by the symmetry of the function x 7→ xn e− 2 (the function is odd). More-
over, if n = 2, we have seen in class that E[X2 ] = 1. Let’s prove the result
by induction. Assume the result is true for all even numbers up to n − 2
and let’s compute E[Xn ]. Using an integration by parts, we have
Z∞ Z∞
1 x2 1 x2
E[Xn ] = √ xn e− 2 dx = √ xn−1 xe− 2 dx
2π −∞ 2π −∞
∞ Z
(n − 1) ∞ n−2 − x2
1 n−1 − x2
2
= √ −x e + √ x e 2 dx
2π −∞ 2π −∞
= (n − 1)E[Xn−2 ] = (n − 1) · (n − 3)(n − 5) · · · 1.
Exercise 30.3 This corresponds to Examples 29.7 on page 144 and 30.8 on
page 149 in the Lecture Notes.
Exercise 30.4 This corresponds to Example 31.1 on page 153 in the Lecture
Notes.
Exercise 30.5 First of all, let’s notice that Y 2 + Z2 = cos2 (X) + sin2 (X) = 1
and that YZ = cos(X) sin(X) = sin(2X)2 . Hence, we have
Z
1 1 2π
E[YZ] = E[sin(2X)] = sin(2x) dx = 0
2 4π 0
Moreover,
Z
1 2π
E[Y] = E[cos(X)] = cos(x) dx = 0.
2π 0
Similarly, E[Z] = 0 and E[YZ] = E[Y]E[Z]. Then, as E[Y] = 0,
Z 2π 2π
2 1
2 2 1 sin(2x) 1
Var(Y) = E[Y ] = E[cos (X)] = cos (x) dx = (x − ) = .
2π 0 4π 2 2
0
and
1 1
P(Y > 1/2, Z > 1/2) = P(π/6 < X < π/3) = 6= ,
12 9
which proves that Y and Z are not independent.
Exercise 30.6 See Ash’s exercise 3.2.8.
Exercise 30.7 This corresponds to Theorem 30.3 on page 148 in the Lecture
Notes.
260 B. Solutions
X
n+1 X
n X
i−1 X
n
= Var(Xi ) + 2 Cov(Xi , Xj ) + 2 Cov(Xn+1 , Xj )
i=1 i=1 j=1 j=1
X
n+1 XX
n+1 i−1
= Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i=1 j=1
B. Solutions 261
Exercise 33.1 See Ash’s exercise 3.5.2. (Go to the link on Ash’s website
saying Solutions to Problems Not Solved in the Text.)
Exercise 33.2 Consider an element x. If it does not belong to any Ai ,
then all of the indicator functions in the formula take the value 0 and the
formula says 0 = 0, which is true.
If x does belong to at least one Ai , then consider all sets Ai to which
x belongs. There is no loss in generality when assuming these sets are
A1 , · · · , Ar for some r > 1. (Otherwise, simply rename the sets!) The left-
hand side of the formula is 1. So we need to show that the right-hand side
is also 1.
The indicator functions on the right-hand side take the value 0 unless
all the indices j1 , . . . , ji are among {1, . . . , r}, in which case the indicator
function takes the value 1. Moreover, for a given i 6 r the number of
possible choices of distinct integers j1 , . . . , ji from {1, . . . , r} is ri . Hence,
Note: Some of the solutions are not in Ash’s book itself, but in a pdf
file on his site, under a link that says Solutions to Problems Not Solved in
the Text.
264 B. Solutions
267