A pair of random variables
◮ Let X, Y be random variables on the same probability
space (Ω, F, P )
◮ Each of X, Y maps Ω to ℜ.
◮ We can think of the pair of radom variables as a
vector-valued function that maps Ω to ℜ2 .
[ XY ]
Sample Space 2
R
P S Sastry, IISc, E1 222 Aug 2021 1/248
◮ Just as in the case of a single rv, we can think of the
induced probability space for the case of a pair of rv’s too.
◮ That is, by defining the pair of random variables, we
essentially create a new probability space with sample
space being ℜ2 .
◮ The events now would be the Borel subsets of ℜ2 .
◮ Recall that ℜ2 is cartesian product of ℜ with itself.
◮ So, we can create Borel subsets of ℜ2 by cartesian
product of Borel subsets of ℜ.
B 2 = σ ({B1 × B2 : B1 , B2 ∈ B})
where B is the Borel σ-algebra we considered earlier, and
B 2 is the set of Borel sets of ℜ2 .
P S Sastry, IISc, E1 222 Aug 2021 2/248
◮ Recall that B is the smallest σ-algebra containing all
intervals.
◮ Let I1 , I2 ⊂ ℜ be intervals. Then I1 × I2 ⊂ ℜ2 is known
as a cylindrical set.
[a, b] X [c, d]
a b x
◮ B 2 is the smallest σ-algebra containing all cylindrical sets.
◮ We saw that B is also the smallest σ-algebra containing
all intervals of the form (−∞, x].
◮ Similarly B 2 is the smallest σ-algebra containing
cylindrical sets of the form (−∞, x] × (−∞, y].
P S Sastry, IISc, E1 222 Aug 2021 3/248
◮ Let X, Y be random variables on the probability space
(Ω, F, P )
◮ This gives rise to a new probability space (ℜ2 , B 2 , PXY )
with PXY given by
PXY (B) = P [(X, Y ) ∈ B], ∀B ∈ B 2
= P ({ω : (X(ω).Y (ω)) ∈ B})
(Here, B ⊂ ℜ2 )
◮ Recall that for a single rv, the resulting probability space
is (ℜ, B, PX ) with
PX (B) = P [X ∈ B] = P ({ω : X(ω) ∈ B})
(Here, B ⊂ ℜ)
P S Sastry, IISc, E1 222 Aug 2021 4/248
◮ In the case of a single rv, we define a distribution
function, FX which essentailly assigns probability to all
intervals of the form (−∞, x].
◮ This FX uniquely determines PX (B) for all Borel sets, B.
◮ In a similar manner we define a joint distribution function
FXY for a pair of random varibles.
◮ FXY (x, y) would be PXY ((−∞, x] × (−∞, y]).
◮ FXY fixes the probability of all cylindrical sets of the form
(−∞, x] × (−∞, y] and hence uniquely determines the
probability of all Borel sets of ℜ2 .
P S Sastry, IISc, E1 222 Aug 2021 5/248
Joint distribution of a pair of random variables
◮ Let X, Y be random variables on the same probability
space (Ω, F, P )
◮ The joint distribution function of X, Y is FXY : ℜ2 → ℜ,
defined by
FXY (x, y) = P [X ≤ x, Y ≤ y]
= P ({ω : X(ω) ≤ x} ∩ {ω : Y (ω) ≤ y})
◮ The joint distribution function is the probability of the
intersection of the events [X ≤ x] and [Y ≤ y].
P S Sastry, IISc, E1 222 Aug 2021 6/248
Properties of Joint Distribution Function
◮ Joint distribution function:
FXY (x, y) = P [X ≤ x, Y ≤ y]
◮ FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y;
FXY (∞, ∞) = 1
(These are actually limits: limx→−∞ FXY (x, y) = 0, ∀y)
◮ FXY is non-decresing in each of its arguments
◮ FXY is right continuous and has left-hand limits in each
of its arguments
◮ These are straight-forward extensions of single rv case
◮ But there is another crucial property satisfied by FXY .
P S Sastry, IISc, E1 222 Aug 2021 7/248
◮ Recall that, for the case of a single rv, given x1 < x2 , we
have
P [x1 < X ≤ x2 ] = FX (x2 ) − FX (x1 )
◮ The LHS above is a probability.
Hence the RHS should be non-negative
The RHS is non-negative because FX is non-decreasing.
◮ We will now derive a similar expression in the case of two
random variables.
◮ Here, the probability we want is that of the pair of rv’s
being in a cylindrical set.
P S Sastry, IISc, E1 222 Aug 2021 8/248
◮ Let x1 < x2 and y1 < y2 . We want
P [x1 < X ≤ x2 , y1 < Y ≤ y2 ].
◮ Consider the Borel set B = (−∞, x2 ] × (−∞, y2 ].
y
y
B3 2
B1
y
1
x1 x2 x
B2
B , (−∞, x2 ] × (−∞, y2 ] = B1 + (B2 ∪ B3 )
B1 = (x1 , x2 ] × (y1 , y2 ]
B2 = (−∞, x2 ] × (−∞, y1 ]
B3 = (−∞, x1 ] × (−∞, y2 ]
B2 ∩ B3 = (−∞, x1 ] × (−∞, y1 ]
P S Sastry, IISc, E1 222 Aug 2021 9/248
y
y
B3 2
B1
y
1
x1 x2 x
B2
FXY (x2 , y2 ) = P [X ≤ x2 , Y ≤ y2 ] = P [(X, Y ) ∈ B]
= P [(X, Y ) ∈ B1 + (B2 ∪ B3 )]
= P [(X, Y ) ∈ B1 ] + P [(X, Y ) ∈ (B2 ∪ B3 )]
P [(X, Y ) ∈ B2 ] = P [X ≤ x2 , Y ≤ y1 ] = FXY (x2 , y1 )
P [(X, Y ) ∈ B3 ] = P [X ≤ x1 , Y ≤ y2 ] = FXY (x1 , y2 )
P [(X, Y ) ∈ B2 ∩ B3 ] = P [X ≤ x1 , Y ≤ y1 ] = FXY (x1 , y1 )
P [(X, Y ) ∈ B1 ] = FXY (x2 , y2 ) − P [(X, Y ) ∈ (B2 ∪ B3 )]
= FXY (x2 , y2 ) − FXY (x2 , y1 ) − FXY (x1 , y2 ) + FXY (x1 , y1 )
P S Sastry, IISc, E1 222 Aug 2021 10/248
◮ What we showed is the following.
◮ For x1 < x2 and y1 < y2
P [x1 < X ≤ x2 , y1 < Y ≤ y2 ] = FXY (x2 , y2 ) − FXY (x2 , y1 )
−FXY (x1 , y2 ) + FXY (x1 , y1 )
◮ This means FXY should satisfy
FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0
for all x1 < x2 and y1 < y2
◮ This is an additional condition that a function has to
satisfy to be the joint distribution function of a pair of
random variables
P S Sastry, IISc, E1 222 Aug 2021 11/248
Properties of Joint Distribution Function
◮ Joint distribution function: FXY : ℜ2 → ℜ
FXY (x, y) = P [X ≤ x, Y ≤ y]
◮ It satisfies
1. FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y;
FXY (∞, ∞) = 1
2. FXY is non-decreasing in each of its arguments
3. FXY is right continuous and has left-hand limits in each
of its arguments
4. For all x1 < x2 and y1 < y2
FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0
◮ Any F : ℜ2 → ℜ satisfying the above would be a joint
distribution function.
P S Sastry, IISc, E1 222 Aug 2021 12/248
◮ Let X, Y be two discrete random variables (defined on
the same probability space).
◮ Let X ∈ {x1 , · · · xn } and Y ∈ {y1 , · · · , ym }.
◮ We define the joint probability mass function of X and Y
as
fXY (xi , yj ) = P [X = xi , Y = yj ]
(fXY (x, y) is zero for all other values of x, y)
◮ The fXY would satisfy
P P
◮ fXY (x, y) ≥ 0, ∀x, y and i j fXY (xi , yj ) = 1
◮ This is a straight-forward extension of the pmf of a single
discrete rv.
P S Sastry, IISc, E1 222 Aug 2021 13/248
Example
◮ Let Ω = (0, 1) with the ‘usual’ probability.
◮ So, each ω is a real number between 0 and 1
◮ Let X(ω) be the digit in the first decimal place in ω and
let Y (ω) be the digit in the second decimal place.
◮ If ω = 0.2576 then X(ω) = 2 and Y (ω) = 5
◮ Easy to see that X, Y ∈ {0, 1, · · · , 9}.
◮ We want to calculate the joint pmf of X and Y
P S Sastry, IISc, E1 222 Aug 2021 14/248
Example
◮ What is the event [X = 4]?
[X = 4] = {ω : X(ω) = 4} = [0.4, 0.5)
◮ What is the event [Y = 3]?
[Y = 3] = [0.03, 0.04) ∪ [0.13, 0.14) ∪ · · · ∪ [0.93, 0.94)
◮ What is the event [X = 4, Y = 3]?
It is the intersection of the above
[X = 4, Y = 3] = [0.43, 0.44)
◮ Hence the joint pmf of X and Y is
fXY (x, y) = P [X = x, Y = y] = 0.01, x, y ∈ {0, 1, · · · , 9}
P S Sastry, IISc, E1 222 Aug 2021 15/248
Example
◮ Consider the random experiment of rolling two dice.
Ω = {(ω1 , ω2 ) : ω1 , ω2 ∈ {1, 2, · · · , 6}}
◮ Let X be the maximum of the two numbers and let Y be
the sum of the two numbers.
◮ Easy to see X ∈ {1, 2, · · · , 6} and Y ∈ {2, 3, · · · , 12}
◮ What is the event [X = m, Y = n]? (We assume m, n
are in the correct range)
[X = m, Y = n] = {(ω1 , ω2 ) ∈ Ω : max(ω1 , ω2 ) = m, ω1 +ω2 = n}
◮ For this to be a non-empty set, we must have
m < n ≤ 2m
◮ Then [X = m, Y = n] = {(m, n − m), (n − m, m)}
◮ Is this always true? No! What if n = 2m?
[X = 3, Y = 6] = {(3, 3)},
[X = 4, Y = 6] = {(4, 2), (2, 4)}
◮ So, P [X = m, Y = n] is either 2/36 or 1/36 (assuming
m, n satisfy other requirements) P S Sastry, IISc, E1 222 Aug 2021 16/248
Example
◮ We can now write the joint pmf.
◮ Assume 1 ≤ m ≤ 6 and 2 ≤ n ≤ 12. Then
( 2
36
if m < n < 2m
fXY (m, n) = 1
36
if n = 2m
(fXY (m, n) is zero in all other cases)
◮ Does this satisfy requirements of joint pmf?
6 2m−1 6
X X X 2 X 1
fXY (m, n) = +
m,n m=1 n=m+1
36 m=1 36
6
2 X 1
= (m − 1) + 6
36 m=1 36
2 6
= (21 − 6) + =1
36 36
P S Sastry, IISc, E1 222 Aug 2021 17/248
Joint Probability mass function
◮ Let X ∈ {x1 , x2 , · · · } and Y ∈ {y1 , y2 , · · · } be discrete
random variables.
◮ The joint pmf: fXY (x, y) = P [X = x, Y = y].
◮ The joint pmf satisfies:
◮ fP
XYP (x, y) ≥ 0, ∀x, y and
j fXY (xi , yj ) = 1
◮
i
◮ Given the joint pmf, we can get the joint df as
X X
FXY (x, y) = fXY (xi , yj )
i: j:
xi ≤x yj ≤y
P S Sastry, IISc, E1 222 Aug 2021 18/248
◮ Given sets {x1 , x2 , · · · } and {y1 , y2 , · · · }.
◮ Suppose fXY : ℜ2 → [0, 1] be such that
◮ fXY (x, y) = 0 unless x = xi for some i and y = yj for
some
P Pj, and
j fXY (xi , yj ) = 1
◮
i
◮ Then fXY is a joint pmf.
◮ This is because, if we define
X X
FXY (x, y) = fXY (xi , yj )
i: j:
xi ≤x yj ≤y
then FXY satisfies all properties of a df.
◮ We normally specify a pair of discrete random variables by
giving the joint pmf
P S Sastry, IISc, E1 222 Aug 2021 19/248
◮ Given the joint pmf, we can (in principle) compute the
probability of any event involving the two discrete random
variables.
X
P [(X, Y ) ∈ B] = fXY (xi , yj )
i,j:
(xi ,yj )∈B
◮ Now, events can be specified in terms of relations
between the two rv’s too
[X < Y + 2] = {ω : X(ω) < Y (ω) + 2}
◮ Thus, X
P [X < Y + 2] = fXY (xi , yj )
i,j:
xi <yj +2
P S Sastry, IISc, E1 222 Aug 2021 20/248
◮ Take the example: 2 dice, X is max and Y is sum
◮ fXY (m, n) = 0 unless m = 1, · · · , 6 and n = 2, · · · , 12.
For this range
( 2
36
if m < n < 2m
fXY (m, n) = 1
36
if n = 2m
◮ Suppose we want P [Y = X + 2].
X 6
X
P [Y = X + 2] = fXY (m, n) = fXY (m, m + 2)
m,n: m=1
n=m+2
6
X
= fXY (m, m + 2) since we need m + 2 ≤ 2m
m=2
1 2 9
= +4 =
36 36 36
P S Sastry, IISc, E1 222 Aug 2021 21/248
Joint density function
◮ Let X, Y be two continuous rv’s with df FXY .
◮ If there exists a function fXY that satisfies
Z x Z y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞
then we say that X, Y have a joint probability density
function which is fXY
◮ Please note the difference in the definition of joint pmf
and joint pdf.
◮ When X, Y are discrete we defined a joint pmf
◮ We are not saying that if X, Y are continuous rv’s then a
joint density exists.
P S Sastry, IISc, E1 222 Aug 2021 22/248
properties of joint density
◮ The joint density (or joint pdf) of X, Y is fXY that
satisfies
Z x Z y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞
◮ Since FXY is non-decreasing in each argument, we must
have fXY (x, y) ≥ 0.
R∞ R∞
◮
−∞ −∞ XY
f (x′ , y ′ ) dy ′ dx′ = 1 is needed to ensure
FXY (∞, ∞) = 1.
P S Sastry, IISc, E1 222 Aug 2021 23/248
properties of joint density
◮ The joint density fXY satisfies the following
1. fXY (x, y) ≥ 0, ∀x, y
R∞ R∞
2. −∞ −∞ fXY (x′ , y ′ ) dy ′ dx′ = 1
◮ These are very similar to the properties of the density of a
single rv
P S Sastry, IISc, E1 222 Aug 2021 24/248
Example: Joint Density
◮ Consider the function
f (x, y) = 2, 0 < x < y < 1 (f (x, y) = 0, otherwise)
◮ Let us show this is a density
Z ∞ Z ∞ Z 1Z y Z 1 Z 1
f (x, y) dx dy = 2 dx dy = 2 x|y0 dy = 2y dy = 1
−∞ −∞ 0 0 0 0
◮ We can say this density is uniform over the region
y
1.0
0.5
x
0.5 1.0
The figure is not a plot of the density function!!
P S Sastry, IISc, E1 222 Aug 2021 25/248
properties of joint density
◮ The joint density fXY satisfies the following
1. fXY (x, y) ≥ 0, ∀x, y
R∞ R∞
2. −∞ −∞ fXY (x′ , y ′ ) dy ′ dx′ = 1
◮ Any function fXY : ℜ2 → ℜ satisfying the above two is a
joint density function.
◮ Given fXY satisfying the above, define
Z x Z y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞
◮ Then we can show FXY is a joint distribution.
P S Sastry, IISc, E1 222 Aug 2021 26/248
R∞ R∞
◮ fXY (x, y) ≥ 0 and −∞ −∞
fXY (x′ , y ′ ) dy ′ dx′ = 1
◮ Define
Z x Z y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞
◮ Then, FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y and
FXY (∞, ∞) = 1
◮ Since fXY (x, y) ≥ 0, FXY is non-decreasing in each
argument.
◮ Since it is given as an integral, the above also shows that
FXY is continuous in each argument.
◮ The only property left is the special property of FXY we
mentioned earlier.
P S Sastry, IISc, E1 222 Aug 2021 27/248
∆ , FXY (x2 , y2 ) − FXY (x1 , y2 ) − FXY (x2 , y1 ) + FXY (x1 , y1 ).
◮ We need to show ∆ ≥ 0 if x1 < x2 and y1 < y2 .
◮ We have
Z x 2 Z y2 Z x 1 Z y2
∆ = fXY dy dx − fXY dy dx
−∞ −∞ −∞ −∞
Z x 2 Z y1 Z x 1 Z y1
− fXY dy dx + fXY dy dx
−∞ −∞ −∞ −∞
Z x 2 Z y2 Z y1
= fXY dy − fXY dy dx
−∞ −∞ −∞
Z x 1 Z y2 Z y1
− fXY dy − fXY dy dx
−∞ −∞ −∞
P S Sastry, IISc, E1 222 Aug 2021 28/248
◮ Thus we have
Z x 2 Z y2 Z y1
∆ = fXY dy − fXY dy dx
−∞ −∞ −∞
Z x 1 Z y2 Z y1
− fXY dy − fXY dy dx
−∞ −∞ −∞
Z x 2 Z y2 Z x 1 Z y2
= fXY dy dx − fXY dy dx
−∞ y1 −∞ y1
Z x 2 Z y2
= fXY dy dx ≥ 0
x1 y1
◮ This actually shows
Z x2 Z y2
P [x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ] = fXY dy dx
x1 y1
P S Sastry, IISc, E1 222 Aug 2021 29/248
◮ What we showed is the following
◮ Any function fXY : ℜ2 → ℜ that satisfies
◮ fRXY (x, y) ≥ 0, ∀x, y
∞ R∞
−∞ −∞ fXY (x, y) dx dy = 1
◮
is a joint density function.
◮ This is because
R y now
Rx
FXY (x, y) = −∞ −∞ fXY (x, y) dx dy
would satisfy all conditions for a df.
◮ Convenient to specify joint density (when it exists)
◮ We also showed
Z x 2 Z y2
P [x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ] = fXY dy dx
x1 y1
◮ In general
Z
P [(X, Y ) ∈ B] = fXY (x, y) dx dy, ∀B ∈ B 2
B
P S Sastry, IISc, E1 222 Aug 2021 30/248
◮ Let us consider the example
f (x, y) = 2, 0 < x < y < 1
◮ Suppose wee want probability of [Y > X + 0.5]
P [Y > X + 0.5] = P [(X, Y ) ∈ {(x, y) : y > x + 0.5}]
Z
= fXY (x, y) dx dy
{(x,y) : y>x+0.5}
Z 1 Z y−0.5
= 2 dx dy
Z0.51 0
= 2(y − 0.5)dy
0.5
1
y2
=2 − y|10.5 = 1 − 0.25 − 1 + 0.5 = 0.25
2 0.5
P S Sastry, IISc, E1 222 Aug 2021 31/248
◮ We can look at it geometrically
y
1.0
0.5
x
0.5 1.0
◮ The probability of the event we want is the area of the
small triangle divided by that of the big triangle.
P S Sastry, IISc, E1 222 Aug 2021 32/248
Marginal Distributions
◮ Let X, Y be random variables with joint distribution
function FXY .
◮ We know FXY (x, y) = P [X ≤ x, Y ≤ y].
◮ Hence
FXY (x, ∞) = P [X ≤ x, Y ≤ ∞] = P [X ≤ x] = FX (x)
◮ We define the marginal distribution functions of X, Y by
FX (x) = FXY (x, ∞); FY (y) = FXY (∞, y)
◮ These are simply distribution functions of X and Y
obtained from the joint distribution function.
P S Sastry, IISc, E1 222 Aug 2021 33/248
Marginal mass functions
◮ Let X ∈ {x1 , x2 , · · · } and Y ∈ {y1 , y2 , · · · }
◮ Let fXY be their joint mass function.
◮ Then
X X
P [X = xi ] = P [X = xi , Y = yj ] = fXY (xi , yj )
j j
P [Y = yj ], j = 1, · · · , form a partition
(This is because
and P (A) = i P (ABi ) when Bi is a partition)
◮ We define the marginal mass functions of X and Y as
X X
fX (xi ) = fXY (xi , yj ); fY (yj ) = fXY (xi , yj )
j i
◮ These are mass functions of X and Y obtained from the
joint mass function
P S Sastry, IISc, E1 222 Aug 2021 34/248
marginal density functions
◮ Let X, Y be continuous rv with
R x joint
R y density f′ XY′ . ′ ′
◮ Then we know FXY (x, y) = −∞ −∞ fXY (x , y ) dy dx
◮ Hence, we have
Z x Z ∞
FX (x) = FXY (x, ∞) = fXY (x′ , y ′ ) dy ′ dx′
Z−∞ −∞
x Z ∞
= fXY (x , y ) dy dx′
′ ′ ′
−∞ −∞
◮ Since X is a continuous rv, this means
Z ∞
fX (x) = fXY (x, y) dy
−∞
We call this the marginal density of X.
◮ Similarly, marginal density of Y is
Z ∞
fY (y) = fXY (x, y) dx
−∞
◮ These are pdf’s of X and Y obtained from theIISc,joint
P S Sastry, E1 222 Aug 2021 35/248
Example
◮ Rolling two dice, X is max, Y is sum
◮ We had, for 1 ≤ m ≤ 6 and 2 ≤ n ≤ 12,
( 2
36
if m < n < 2m
fXY (m, n) = 1
36
if n = 2m
P
◮ We know, fX (m) = n fXY (m, n), m = 1, · · · , 6.
◮ Given m, for what values of n, fXY (m, n) > 0 ?
We can only have n = m + 1, · · · , 2m.
◮ Hence we get
2m 2m−1
X X 2 1 2 1 2m − 1
fX (m) = fXY (m, n) = + = (m−1)+ =
n=m+1 n=m+1
36 36 36 36 36
P S Sastry, IISc, E1 222 Aug 2021 36/248
Example
◮ Consider the joint density
fXY (x, y) = 2, 0 < x < y < 1
◮ The marginal density of X is: for 0 < x < 1,
Z ∞ Z 1
fX (x) = fXY (x, y) dy = 2 dy = 2(1 − x)
−∞ x
Thus, fX (x) = 2(1 − x), 0 < x < 1
◮ We can easily verify this is a density
Z ∞ Z 1
1
fX (x) dx = 2(1 − x) dx = (2x − x2 ) 0
=1
−∞ 0
P S Sastry, IISc, E1 222 Aug 2021 37/248
We have: fXY (x, y) = 2, 0 < x < y < 1
◮ We can similarly find density of Y .
◮ For 0 < y < 1,
Z ∞ Z y
fY (y) = fXY (x, y) dx = 2 dx = 2y
−∞ 0
◮ Thus, fY (y) = 2y, 0 < y < 1 and
1 1
y2
Z
2y dy = 2 =1
0 2 0
P S Sastry, IISc, E1 222 Aug 2021 38/248
◮ If we are given the joint df or joint pmf/joint density of
X, Y , then the individual df or pmf/pdf are uniquely
determined.
◮ However, given individual pdf of X and Y , we cannot
determine the joint density. (same is true of pmf or df)
◮ There can be many different joint density functions all
having the same marginals
P S Sastry, IISc, E1 222 Aug 2021 39/248
Conditional distributions
◮ Let X, Y be rv’s on the same probability space
◮ We define the conditional distribution of X given Y by
FX|Y (x|y) = P [X ≤ x|Y = y]
(For now ignore the case of P [Y = y] = 0).
◮ Note that FX|Y : ℜ2 → ℜ
◮ FX|Y (x|y) is a notation. We could write FX|Y (x, y).
P S Sastry, IISc, E1 222 Aug 2021 40/248
◮ Conditional distribution of X given Y is
FX|Y (x|y) = P [X ≤ x|Y = y]
It is the conditional probability of [X ≤ x] given (or
conditioned on) [Y = y].
◮ Consider example: rolling 2 dice, X is max, Y is sum
P [X ≤ 4|Y = 3] = 1; P [X ≤ 4|Y = 9] = 0
◮ This is what conditional distribution captures.
◮ For every value of y, FX|Y (x|y) is a distribution function
in the variable x.
◮ It defines a new distribution for X based on knowing the
value of Y .
P S Sastry, IISc, E1 222 Aug 2021 41/248
◮ Let: X ∈ {x1 , x2 , · · · } and Y ∈ {y1 , y2 , · · · }. Then
P [X ≤ x, Y = yj ]
FX|Y (x|yj ) = P [X ≤ x|Y = yj ] =
P [Y = yj ]
(We define FX|Y (x|y) only when y = yj for some j).
◮ For each yj , FX|Y (x|yj ) is a df of a discrete rv in x.
◮ Since X is a discrete rv, we can write the above as
P
P [X ≤ x, Y = yj ] i:xi ≤x P [X = xi , Y = yj ]
FX|Y (x|yj ) = =
P [Y = yj ] P [Y = yj ]
X P [X = xi , Y = yj ]
=
i:xi ≤x
P [Y = yj ]
X fXY (xi , yj )
=
i:x ≤x
fY (yj )
i
P S Sastry, IISc, E1 222 Aug 2021 42/248
Conditional mass function
◮ We got
X fXY (xi , yj )
FX|Y (x|yj ) =
i:x ≤x
fY (yj )
i
◮ We define the conditional mass function of X given Y as
fXY (xi , yj )
fX|Y (xi |yj ) = = P [X = xi |Y = yj ]
fY (yj )
◮ Note that
X X
fX|Y (xi |yj ) = 1, ∀yj ; and FX|Y (x|yj ) = fX|Y (xi |yj )
i i:xi ≤x
P S Sastry, IISc, E1 222 Aug 2021 43/248
Example: Conditional pmf
◮ Consider the random experiment of tossing a coin n
times.
◮ Let X denote the number of heads and let Y denote the
toss number on which the first head comes.
◮ For 1 ≤ k ≤ n
P [Y = k, X = 1]
fY |X (k|1) = P [Y = k|X = 1] =
P [X = 1]
p(1 − p)n−1
= n
C1 p(1 − p)n−1
1
=
n
◮ Given there is only one head, it is equally likely to occur
on any toss.
P S Sastry, IISc, E1 222 Aug 2021 44/248
◮ The conditional mass function is
fXY (xi , yj )
fX|Y (xi |yj ) = P [X = xi |Y = yj ] =
fY (yj )
◮ This gives us the useful identity
fXY (xi , yj ) = fX|Y (xi |yj )fY (yj )
( P [X = xi , Y = yj ] = P [X = xi |Y = yj ]P [Y = yj ])
◮ This gives us the total proability rule for discrete rv’s
X X
fX (xi ) = fXY (xi , yj ) = fX|Y (xi |yj )fY (yj )
j j
◮ This is same as
X
P [X = xi ] = P [X = xi |Y = yj ]P [Y = yj ]
j
P
(P (A) = j P (A|Bj )P (Bj ) when B1 , · · · form a
partition)
P S Sastry, IISc, E1 222 Aug 2021 45/248
Bayes Rule for discrete Random Variable
◮ We have
fXY (xi , yj ) = fX|Y (xi |yj )fY (yj ) = fY |X (yj |xi )fX (xi )
◮ This gives us Bayes rule for discrete rv’s
fY |X (yj |xi )fX (xi )
fX|Y (xi |yj ) =
fY (yj )
fY |X (yj |xi )fX (xi )
= P
i fXY (xi , yj )
fY |X (yj |xi )fX (xi )
= P
i fY |X (yj |xi )fX (xi )
P S Sastry, IISc, E1 222 Aug 2021 46/248
◮ Let X, Y be continuous rv’s with joint density, fXY .
◮ We once again want to define conditional df
FX|Y (x|y) = P [X ≤ x|Y = y]
◮ But the conditioning event, [Y = y] has zero probability.
◮ Hence we define conditional df as follows
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
δ↓0
◮ This is well defined if the limit exists.
◮ The limit exists for all y where fY (y) > 0 (and for all x)
P S Sastry, IISc, E1 222 Aug 2021 47/248
◮ The conditional df is given by (assuming fY (y) > 0)
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
δ↓0
P [X ≤ x, Y ∈ [y, y + δ]]
= lim
δ↓0 P [Y ∈ [y, y + δ]]
R x R y+δ
−∞ y
fXY (x′ , y ′ ) dy ′ dx′
= lim R y+δ
δ↓0
y
fY (y ′ ) dy ′
Rx
f (x′ , y) δ dx′ + o(δ)
−∞ XY
= lim
δ↓0 fY (y) δ + o(δ)
Z x
fXY (x′ , y) ′
= dx
−∞ fY (y)
◮ We define conditional density of X given Y as
fXY (x, y)
fX|Y (x|y) =
fY (y)
P S Sastry, IISc, E1 222 Aug 2021 48/248
◮ Let X, Y have joint density fXY .
◮ The conditional df of X given Y is
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
δ↓0
◮ This exists if fY (y) > 0 and then it has a density:
Z x Z x
fXY (x′ , y) ′
FX|Y (x|y) = dx = fX|Y (x′ |y) dx′
−∞ f Y (y) −∞
◮ This conditional density is given by
fXY (x, y)
fX|Y (x|y) =
fY (y)
◮ We (once again) have the useful identity
fXY (x, y) = fX|Y (x|y) fY (y) = fY |X (y|x)fX (x)
P S Sastry, IISc, E1 222 Aug 2021 49/248
Example
fXY (x, y) = 2, 0 < x < y < 1
◮ We saw that the marginal densities are
fX (x) = 2(1 − x), 0 < x < 1; fY (y) = 2y, 0 < y < 1
◮ Hence the conditional densities are given by
fXY (x, y) 1
fX|Y (x|y) = = , 0<x<y<1
fY (y) y
fXY (x, y) 1
fY |X (y|x) = = , 0<x<y<1
fX (x) 1−x
◮ We can see this intuitively
Conditioned on Y = y, X is uniform over (0, y).
Conditioned on X = x, Y is uniform over (x, 1).
P S Sastry, IISc, E1 222 Aug 2021 50/248
◮ The identity fXY (x, y) = fX|Y (x|y)fY (y) can be used to
specify the joint density of two continuous rv’s
◮ We can specify the marginal density of one and the
conditional density of the other given the first.
◮ This may actually be the model of how the the rv’s are
generated.
P S Sastry, IISc, E1 222 Aug 2021 51/248
Example
◮ Let X be uniform over (0, 1) and let Y be uniform over
0 to X. Find the density of Y .
◮ What we are given is
1
fX (x) = 1, 0 < x < 1; fY |X (y|x) = , 0 < y < x < 1
x
◮ Hence the joint density is:
fXY (x, y) = x1 , 0 < y < x < 1.
◮ Hence the density of Y is
Z ∞ Z 1
1
fY (y) = fXY (x, y) dx = dx = − ln(y), 0 < y < 1
−∞ y x
◮ We can verify it to be a density
Z 1 1
1
Z
1
− ln(y) dy = −y ln(y)|0 + y dy = 1
0 0 y
P S Sastry, IISc, E1 222 Aug 2021 52/248
◮ We have the identity
fXY (x, y) = fX|Y (x|y) fY (y)
◮ By integrating both sides
Z ∞ Z ∞
fX (x) = fXY (x, y) dy = fX|Y (x|y) fY (y) dy
−∞ −∞
◮ This is a continuous analogue of total probability rule.
◮ But note that, since X is continuous rv, fX (x) is NOT
P [X = x]
◮ In case of discrete rv, the mass function value fX (x) is
equal to P [X = x] and we had
X
fX (x) = fX|Y (x|y)fY (y)
y
◮ It is as if one can simply replace pmf by pdf and
summation by integration!!
◮ While often that gives the right result, one needs to be
very careful
P S Sastry, IISc, E1 222 Aug 2021 53/248
◮ We have the identity
fXY (x, y) = fX|Y (x|y) fY (y) = fY |X (y|x)fX (x)
◮ This gives rise to Bayes rule for continuous rv
fY |X (y|x)fX (x)
fX|Y (x|y) =
fY (y)
fY |X (y|x)fX (x)
= R∞
f (y|x)fX (x) dx
−∞ Y |X
◮ This is essentially identical to Bayes rule for discrete rv’s.
We have essentially put the pdf wherever there was pmf
P S Sastry, IISc, E1 222 Aug 2021 54/248
◮ To recap, we started by defining conditional distribution
function.
FX|Y (x|y) = P [X ≤ x|Y = y]
◮ When X, Y are discrete, we define this only for y = yj .
That is, we define it only for all values that Y can take.
◮ When X, Y have joint density, we defined it by
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
δ↓0
This limit exists and FX|Y is well defined if fY (y) > 0.
That is, essentially again for all values that Y can take.
◮ In the discrete case, we define fX|Y as the pmf
corresponding to FX|Y . This conditional pmf can also be
defined as a conditional probability
◮ In the continuous case fX|Y is the density corresponding
to FX|Y .
◮ In both cases we have: fXY (x, y) = fX|Y (x|y)fY (y)
◮ This gives total probability rule and Bayes rule for random
variables P S Sastry, IISc, E1 222 Aug 2021 55/248
◮ Now, let X be a continuous rv and let Y be discrete rv.
◮ We can define FX|Y as
FX|Y (x|y) = P [X ≤ x|Y = y]
This is well defined for all values that y takes. (We
consider only those y)
◮ Since X is continuous rv, this df would have a density
Z x
FX|Y (x|y) = fX|Y (x′ |y) dx′
−∞
◮ Hence we can write
P [X ≤ x, Y = y] = FX|Y (x|y)P [Y = y]
Z x
= fX|Y (x′ |y) fY (y) dx′
−∞
P S Sastry, IISc, E1 222 Aug 2021 56/248
◮ We now get
X
FX (x) = P [X ≤ x] = P [X ≤ x, Y = y]
y
X Z x
= fX|Y (x′ |y) fY (y) dx′
y −∞
Z x X
= fX|Y (x′ |y) fY (y) dx′
−∞ y
◮ This gives us
X
fX (x) = fX|Y (x|y)fY (y)
y
◮ This is another version of total probability rule.
◮ Earlier we derived this when X, Y are discrete.
◮ The formula is true even when X is continuous
Only difference is we need to take fX as the density of X.
P S Sastry, IISc, E1 222 Aug 2021 57/248
◮ When X, Y are discrete we have
X
fX (x) = fX|Y (x|y)fY (y)
y
◮ When X is continuous and Y is discrete, we defined
fX|Y (x|y) to be the density corresponding to
FX|Y (x|y) = P [X ≤ x|Y = y]
◮ Then we once again get
X
fX (x) = fX|Y (x|y)fY (y)
y
However, now, fX is density (and not a mass function).
fX|Y is also a density now.
◮ Suppose Y ∈ {1, 2, 3} and fY (i) = λi .
Let fX|Y (x|i) = fi (x). Then
fX (x) = λ1 f1 (x) + λ2 f2 (x) + λ3 f3 (x)
Called a mixture density model
P S Sastry, IISc, E1 222 Aug 2021 58/248
◮ Continuing with X continuous rv and Y discrete. We
have
Z x
FX|Y (x|y) = P [X ≤ x|Y = y] = fX|Y (x′ |y) dx′
−∞
◮ We also have
Z x
P [X ≤ x, Y = y] = fX|Y (x′ |y) fY (y) dx′
−∞
◮ Hence we can define a ‘joint density’
fXY (x, y) = fX|Y (x|y)fY (y)
◮ This is a kind of mixed density and mass function.
◮ We will not be using such ‘joint densities’ here
P S Sastry, IISc, E1 222 Aug 2021 59/248
◮ Continuing with X continuous rv and Y discrete
◮ Can we define fY |X (y|x)?
◮ Since Y is discrete, this (conditional) mass function is
fY |X (y|x) = P [Y = y|X = x]
But the conditioning event has zero prob
We now know how to handle it
fY |X (y|x) = lim P [Y = y|X ∈ [x, x + δ]]
δ↓0
◮ For simplifying this we note the following:
Z x
P [X ≤ x, Y = y] = fX|Y (x′ |y) fY (y) dx′
−∞
Z x+δ
⇒ P [X ∈ [x, x+δ], Y = y] = fX|Y (x′ |y) fY (y) dx′
x
P S Sastry, IISc, E1 222 Aug 2021 60/248
◮ We have
fY |X (y|x) = lim P [Y = y|X ∈ [x, x + δ]]
δ↓0
P [Y = y, X ∈ [x, x + δ]]
= lim
δ↓0 P [X ∈ [x, x + δ]]
R x+δ
fX|Y (x′ |y) fY (y) dx′
= lim x R x+δ
δ↓0
x
fX (x′ ) dx′
fX|Y (x|y) fY (y)
=
fX (x)
⇒ fY |X (y|x)fX (x) = fX|Y (x|y) fY (y)
◮ This gives us further versions of total probability rule and
Bayes rule.
P S Sastry, IISc, E1 222 Aug 2021 61/248
◮ First let us look at the total probability rule possibilities
◮ When X is continuous rv and Y is discrete rv, we derived
fY |X (y|x)fX (x) = fX|Y (x|y) fY (y)
Note that fY is mass fn, fX is density and so on.
◮ Since fX|Y is a density (corresponding to FX|Y ),
Z ∞
fX|Y (x|y) dx = 1
−∞
◮ Hence we get
Z ∞
fY (y) = fY |X (y|x)fX (x) dx
−∞
◮ Earlier we derived the same formula when X, Y have a
joint density.
P S Sastry, IISc, E1 222 Aug 2021 62/248
◮ Let us review all the total probability formulas
X
1. fX (x) = fX|Y (x|y)fY (y)
y
◮ We first derived this when X, Y are discrete.
◮ But now we proved this holds when Y is discrete
If X is continuous the fX , fX|Y are densities; If X is also
discrete they are mass functions
Z ∞
2. fY (y) = fY |X (y|x)fX (x) dx
−∞
◮ We first proved it when X, Y have a joint density
We now know it holds also when X is cont and Y is
discrete. In that case fY is a mass function
P S Sastry, IISc, E1 222 Aug 2021 63/248
◮ When X is continuous rv and Y is discrete rv, we derived
fY |X (y|x)fX (x) = fX|Y (x|y) fY (y)
◮ This once again gives rise to Bayes rule:
fX|Y (x|y) fY (y) fY |X (y|x)fX (x)
fY |X (y|x) = fX|Y (x|y) =
fX (x) fY (y)
◮ Earlier we showed this hold when X, Y are both discrete
or both continuous.
◮ Thus Bayes rule holds in all four possible scenarios
◮ Only difference is we need to interpret fX or fX|Y as
mass functions when X is discrete and as densities when
X is a continuous rv
◮ In general, one refers to these always as densities since
the actual meaning would be clear from context.
P S Sastry, IISc, E1 222 Aug 2021 64/248
Example
◮ Consider a communication system. The transmitter puts
out 0 or 5 volts for the bits of 0 and 1, and, volage
measured by the receiver is the sent voltage plus noise
added by the channel.
◮ We assume noise has Gaussian density with mean zero
and variance σ 2 .
◮ We want the probability that the sent bit is 1 when
measured voltage at the receiver is x (to decide what is
sent).
◮ Let X be the measured voltage and let Y be sent bit.
◮ We want to calculate fY |X (1|x).
◮ We want to use the Bayes rule to calculate this
P S Sastry, IISc, E1 222 Aug 2021 65/248
◮ We need fX|Y . What does our model say?
◮ fX|Y (x|1) is Gaussian with mean 5 and variance σ 2 and
fX|Y (x|0) is Gaussian with mean zero and variance σ 2
fX|Y (x|1) fY (1)
P [Y = 1|X = x] = fY |X (1|x) =
fX (x)
◮ We need fY (1), fY (0). Let us take them to be same.
◮ In practice we only want to know whether
fY |X (1|x) > fY |X (0|x)
◮ Then we do not need to calculate fX (x).
We only need ratio of fY |X (1|x) and fY |X (0|x).
P S Sastry, IISc, E1 222 Aug 2021 66/248
◮ The ratio of the two probabilities is
fY |X (1|x) fX|Y (x|1) fY (1)
=
fY |X (0|x) fX|Y (x|0) fY (0)
1 2
√1 e− 2σ2 (x−5)
σ 2π
= 1 2
√1 e− 2σ2 (x−0)
σ 2π
−0.5σ −2 (x2 −10x+25−x2 )
= e
−2
= e0.5σ (10x−25)
◮ We are only interested in whether the above is greater
than 1 or not.
◮ The ratio is greater than 1 if 10x > 25 or x > 2.5
◮ So, if X > 2.5 we will conclude bit 1 is sent. Intuitively
obvious!
P S Sastry, IISc, E1 222 Aug 2021 67/248
◮ We did not calculate fX (x) in the above.
◮ We can calculate it if we want.
◮ Using total probability rule
X
fX (x) = fX|Y (x|y)fY (y)
y
= fX|Y (x|1)fY (1) + fX|Y (x|0)fY (0)
1 1 (x−5)2 1 1 x2
= √ e− 2σ2 + √ e− 2σ2
2 σ 2π 2 σ 2π
◮ It is a mixture density
P S Sastry, IISc, E1 222 Aug 2021 68/248
◮ As we saw, given the joint distribution we can calculate
all the marginals.
◮ However, there can be many joint distributions with the
same marginals.
◮ Let F1 , F2 be one dimensional df’s of continuous rv’s with
f1 , f2 being the corresponding densities.
Define a function f : ℜ2 → ℜ by
f (x, y) = f1 (x)f2 (y) [1 + α(2F1 (x) − 1)(2F2 (y) − 1)]
where α ∈ (−1, 1).
◮ First note that f (x, y) ≥ 0, ∀α ∈ (−1, 1).
For different α we get different functions.
◮ We first show that f (x, y) is a joint density.
◮ For this, we note the following
∞
(F1 (x))2
Z ∞
1
f1 (x) F1 (x) dx = =
−∞ 2 2
−∞
P S Sastry, IISc, E1 222 Aug 2021 69/248
f (x, y) = f1 (x)f2 (y) [1 + α(2F1 (x) − 1)(2F2 (y) − 1)]
Z ∞ Z ∞ Z ∞ Z ∞
f (x, y) dx dy = f1 (x) dx f2 (y) dy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
+α (2f1 (x)F1 (x) − f1 (x)) dx (2f2 (y)F2 (y) − f2 (y)) dy
−∞ −∞
= 1
R∞
because 2 −∞
f1 (x) F1 (x) dx = 1. This also shows
Z ∞ Z ∞
f (x, y)dx = f2 (y); f (x, y)dy = f1 (x)
−∞ −∞
P S Sastry, IISc, E1 222 Aug 2021 70/248
◮ Thus infinitely many joint distributions can all have the
same marginals.
◮ So, in general, the marginals cannot determine the joint
distribution.
◮ An important special case where this is possible is that of
independent random variables
P S Sastry, IISc, E1 222 Aug 2021 71/248
Independent Random Variables
◮ Two random variable X, Y are said to be independent if
for all Borel sets B1 , B2 , the events [X ∈ B1 ] and
[Y ∈ B2 ] are independent.
◮ If X, Y are independent then
P [X ∈ B1 , Y ∈ B2 ] = P [X ∈ B1 ] P [Y ∈ B2 ], ∀B1 , B2 ∈ B
◮ In particular
FXY (x, y) = P [X ≤ x, Y ≤ y] = P [X ≤ x]P [Y ≤ y] = FX (x) FY (y)
◮ Theorem: X, Y are independent if and only if
FXY (x, y) = FX (x)FY (y).
P S Sastry, IISc, E1 222 Aug 2021 72/248
◮ Suppose X, Y are independent discrete rv’s
fXY (x, y) = P [X = x, Y = y] = P [X = x]P [Y = y] = fX (x)fY (y)
The joint mass function is a product of marginals.
◮ Suppose fXY (x, y) = fX (x)fY (y). Then
X X
FXY (x, y) = fXY (xi , yj ) = fX (xi )fY (yj )
xi ≤x,yj ≤y xi ≤x,yj ≤y
X X
= fX (xi ) fY (yj ) = FX (x)FY (y)
xi ≤x yj ≤y
◮ So, X, Y are independent if and only if
fXY (x, y) = fX (x)fY (y)
P S Sastry, IISc, E1 222 Aug 2021 73/248
◮ Let X, Y be independent continuous rv
Z x Z y
′ ′
FXY (x, y) = FX (x)FY (y) = fX (x ) dx fY (y ′ ) dy ′
−∞ −∞
Z y Z x
= (fX (x′ )fY (y ′ )) dx′ dy ′
−∞ −∞
◮ This implies joint density is product of marginals.
◮ Now, suppose fXY (x, y) = fX (x)fY (y)
Z y Z x
FXY (x, y) = fXY (x′ , y ′ ) dx′ dy ′
−∞ −∞
Z y Z x
= fX (x′ )fY (y ′ ) dx′ dy ′
−∞ −∞
Z x Z y
′ ′
= fX (x ) dx fY (y ′ ) dy ′ = FX (x)FY (y)
−∞ −∞
◮ So, X, Y are independent if and only if
fXY (x, y) = fX (x)fY (y)
P S Sastry, IISc, E1 222 Aug 2021 74/248
◮ Let X, Y be independent.
◮ Then P [X ∈ B1 |Y ∈ B2 ] = P [X ∈ B1 ].
◮ Hence, we get FX|Y (x|y) = FX (x).
◮ This also implies fX|Y (x|y) = fX (x).
◮ This is true for all the four possibilities of X, Y being
continuous/discrete.
P S Sastry, IISc, E1 222 Aug 2021 75/248
More than two rv
◮ Everything we have done so far is easily extended to
multiple random variables.
◮ Let X, Y, Z be rv on the same probability space.
◮ We define joint distribution function by
FXY Z (x, y, z) = P [X ≤ x, Y ≤ y, Z ≤ z]
◮ If all three are discrete then the joint mass function is
fXY Z (x, y, z) = P [X = x, Y = y, Z = z]
◮ If they are continuous , they have a joint density if
Z z Z y Z x
FXY Z (x, y, z) = fXY Z (x′ , y ′ , z ′ ) dx′ dy ′ dz ′
−∞ −∞ −∞
P S Sastry, IISc, E1 222 Aug 2021 76/248
◮ Easy to see that joint mass function satisfies
1. fXY Z (x, y, z) ≥ 0 and is non-zero only for countably
many
P tuples.
2. x,y,z fXY Z (x, y, z) = 1
◮ Similarly the joint density satisfies
1. fRXY ZR(x, y, z) ≥ 0
∞ ∞ R∞
2. −∞ −∞ −∞ fXY Z (x, y, z) dx dy dz = 1
◮ These are straight-forward generalizations
◮ The properties of joint distribution function such as it
being non-decreasing in each argument etc are easily seen
to hold here too.
◮ Generalizing the special property of the df (relating to
probability of cylindrical sets) is a little more complicated.
◮ We specify multiple random variables either through joint
mass function or joint density function.
P S Sastry, IISc, E1 222 Aug 2021 77/248
◮ Now we get many different marginals:
FXY (x, y) = FXY Z (x, y, ∞); FZ (z) = FXY Z (∞, ∞, z) and so on
◮ Similarly we get
Z ∞
fY Z (y, z) = fXY Z (x, y, z) dx;
Z−∞
∞ Z ∞
fX (x) = fXY Z (x, y, z) dy dz
−∞ −∞
◮ Any marginal is a joint density of a subset of these rv’s
and we obtain it by integrating the (full) joint density
with respect to the remaining variables.
◮ We obtain the marginal mass functions for a subset of the
rv’s also similarly where we sum over the remaining
variables.
P S Sastry, IISc, E1 222 Aug 2021 78/248
◮ We have to be a little careful in dealing with these when
some random variables are discrete and others are
continuous.
◮ Suppose X is continuous and Y, Z are discrete. We do
not have any joint density or mass function as such.
◮ However, the joint df is always well defined.
◮ Suppose we want marginal joint distribution of X, Y . We
know how to get FXY by marginalization.
◮ Then we can get fX (a density), fY (a mass fn), fX|Y
(conditinal density) and fY |X (conditional mass fn)
◮ With these we can generally calculate most quantities of
interest.
P S Sastry, IISc, E1 222 Aug 2021 79/248
◮ Like in case of marginals, there are different types of
conditional distributions now.
◮ We can always define conditional distribution functions
like
FXY |Z (x, y|z) = P [X ≤ x, Y ≤ y|Z = z]
FX|Y Z (x|y, z) = P [X ≤ x|Y = y, Z = z]
◮ In all such cases, if the conditioning random variables are
continuous, we define the above as a limit.
◮ For example when Z is continuous
FXY |Z (x, y|z) = lim P [X ≤ x, Y ≤ y|Z ∈ [z, z + δ]]
δ↓0
P S Sastry, IISc, E1 222 Aug 2021 80/248
◮ If X, Y, Z are all discrete then, all conditional mass
functions are defined by appropriate conditional
probabilities. For example,
fX|Y Z (x|y, z) = P [X = x|Y = y, Z = z]
◮ Thus the following are obvious
fXY Z (x, y, z)
fXY |Z (x, y|z) =
fZ (z)
fXY Z (x, y, z)
fX|Y Z (x|y, z) =
fY Z (y, z)
fXY Z (x, y, z) = fZ|Y X (z|y, x)fY |X (y|x)fX (x)
◮ For example, the first one above follows from
P [X = x, Y = y, Z = z]
P [X = x, Y = y|Z = z] =
P [Z = z]
P S Sastry, IISc, E1 222 Aug 2021 81/248
◮ When X, Y, Z have joint density, all such relations hold
for the appropriate (conditional) densities. For example,
P [Z ≤ z, X ∈ [x, x + δ], Y ∈ [y, y + δ]]
FZ|XY (z|x, y) = lim
δ↓0 P [X ∈ [x, x + δ, Y ∈ [y, y + δ]]
R z R x+δ R y+δ
−∞ x y
fXY Z (x′ , y ′ , z ′ ) dy ′ dx′ dz ′
= lim R x+δ R y+δ
δ↓0
x y
fXY (x′ , y ′ ) dy ′ dx′
Z z Z z
fXY Z (x, y, z ′ ) ′
= dz = fZ|XY (z ′ |x, y) dz ′
−∞ f XY (x, y) −∞
◮ Thus we get
fXY Z (x, y, z) = fZ|XY (z|x, y)fXY (x, y) = fZ|XY (z|x, y)fY |X (y|x)fX (x)
P S Sastry, IISc, E1 222 Aug 2021 82/248
◮ We can similarly talk about the joint distribution of any
finite number of rv’s
◮ Let X1 , X2 , · · · , Xn be rv’s on the same probability space.
◮ We denote it as a vector X or X. We can think of it as a
mapping, X : Ω → ℜn .
◮ We can write the joint distribution as
FX (x) = P [X ≤ x] = P [Xi ≤ xi , i = 1, · · · , n]
◮ We represent by fX (x) the joint density or mass function.
Sometimes we also write it as fX1 ···Xn (x1 , · · · , xn )
◮ We use similar notation for marginal and conditional
distributions
P S Sastry, IISc, E1 222 Aug 2021 83/248
Independence of multiple random variables
◮ Random variables X1 , X2 , · · · , Xn are said to be
independent if the the events [Xi ∈ Bi ], i = 1, · · · , n are
independent.
(Recall definition of independence of a set of events)
◮ Independence implies that the marginals would determine
the joint distribution.
◮ If X, Y, Z are independent then
fXY Z (x, y, z) = fX (x)fY (y)fZ (z)
◮ For independent random variables, the joint mass
function (or density function) is product of individual
mass functions (or density functions)
P S Sastry, IISc, E1 222 Aug 2021 84/248
Example
◮ Let a joint density be given by
fXY Z (x, y, z) = K, 0<z<y<x<1
First let us determine K.
Z ∞ Z ∞ Z ∞ Z 1 Z x Z y
fXY Z (x, y, z) dz dy dx = K dz dy dx
−∞ −∞ −∞ 0 0 0
Z 1 Z x
= K y dy dx
0 0
1 2
x
Z
= K dx
0 2
1
= K ⇒K=6
6
P S Sastry, IISc, E1 222 Aug 2021 85/248
fXY Z (x, y, z) = 6, 0<z<y<x<1
◮ Suppose we want to find the (marginal) joint distribution
of X and Z.
Z ∞
fXZ (x, z) = fXY Z (x, y, z) dy
−∞
Z x
= 6 dy, 0 < z < x < 1
z
= 6(x − z), 0<z<x<1
P S Sastry, IISc, E1 222 Aug 2021 86/248
◮ We got the joint density as
fXZ (x, z) = 6(x − z), 0<z<x<1
◮ We can verify this is a joint density
Z ∞Z ∞ Z 1Z x
fXZ (x, z) dz dx = 6(x − z) dz dx
−∞ −∞ 0 0
Z 1 x
x z2
= 6x z|0 − 6 dx
0 2 0
Z 1
x2
2
= 6x − 6 dx
0 2
1
x3
= 3 =1
3 0
P S Sastry, IISc, E1 222 Aug 2021 87/248
◮ The joint density of X, Y, Z is
fXY Z (x, y, z) = 6, 0<z<y<x<1
◮ The joint density of X, Z is
fXZ (x, z) = 6(x − z), 0<z<x<1
◮ Hence,
fXY Z (x, y, z) 1
fY |XZ (y|x, z) = = , 0<z<y<x<1
fXZ (x, z) x−z
P S Sastry, IISc, E1 222 Aug 2021 88/248
Functions of multiple random variables
◮ Let X, Y be random variables on the same probability
space.
◮ Let g : ℜ2 → ℜ.
◮ Let Z = g(X, Y ). Then Z is a rv
◮ This is analogous to functions of a single rv
Z = g(X,Y)
g
R2
Sample Space
[ XY] R
B’ B
P S Sastry, IISc, E1 222 Aug 2021 89/248
◮ let Z = g(X, Y )
◮ We can determine distribution of Z from the joint
distribution of X, Y
FZ (z) = P [Z ≤ z] = P [g(X, Y ) ≤ z]
◮ For example, if X, Y are discrete, then
X
fZ (z) = P [Z = z] = P [g(X, Y ) = z] = fXY (xi , yj )
xi ,yj :
g(xi ,yj )=z
P S Sastry, IISc, E1 222 Aug 2021 90/248
◮ Let X, Y be discrete rv’s. Let Z = min(X, Y ).
fZ (z) = P [min(X, Y ) = z]
= P [X = z, Y > z] + P [Y = z, X > z] + P [X = Y = z]
X X
= P [X = z, Y = y] + P [X = x, Y = z]
y>z x>z
+P [X = z, Y = z]
X X
= fXY (z, y) + fXY (x, z) + fXY (z, z)
y>z x>z
◮ Now suppose X, Y are independent and both of them
have geometric distribution with the same parameter, p.
◮ Such random variables are called independent and
identically distributed or iid random variables.
P S Sastry, IISc, E1 222 Aug 2021 91/248
◮ Now we can get pmf of Z as (note Z ∈ {1, 2, · · · })
fZ (z) = P [X = z, Y > z] + P [Y = z, X > z] + P [X = Y = z]
= P [X = z]P [Y > z] + P [Y = z]P [X > z] + P [X = z]P [Y = z]
2
= p(1 − p)z−1 (1 − p)z ∗ 2 + p(1 − p)z−1
2
= 2p(1 − p)z−1 (1 − p)z + p(1 − p)z−1
= 2p(1 − p)2z−1 + p2 (1 − p)2z−2
= p(1 − p)2z−2 (2(1 − p) + p)
= (2 − p)p(1 − p)2z−2
P S Sastry, IISc, E1 222 Aug 2021 92/248
◮ We can show this is a pmf
∞
X ∞
X
fZ (z) = (2 − p)p(1 − p)2z−2
z=1 z=1
∞
X
= (2 − p)p (1 − p)2z−2
z=1
1
= (2 − p)p
1 − (1 − p)2
1
= (2 − p)p =1
2p − p2
P S Sastry, IISc, E1 222 Aug 2021 93/248
◮ Let us consider the max and min functions, in general.
◮ Let Z = max(X, Y ). Then we have
FZ (z) = P [Z ≤ z] = P [max(X, Y ) ≤ z]
= P [X ≤ z, Y ≤ z]
= FXY (z, z)
= FX (z)FY (z), if X, Y are independent
= (FX (z))2 , if they are iid
◮ This is true of all random variables.
◮ Suppose X, Y are iid continuous rv. Then density of Z is
fZ (z) = 2FX (z)fX (z)
P S Sastry, IISc, E1 222 Aug 2021 94/248
◮ Suppose X, Y are iid uniform over (0, 1)
◮ Then we get df and pdf of Z = max(X, Y ) as
FZ (z) = z 2 , 0 < z < 1; and fZ (z) = 2z, 0 < z < 1
FZ (z) = 0 for z ≤ 0 and FZ (z) = 1 for z ≥ 1 and
fZ (z) = 0 outside (0, 1)
P S Sastry, IISc, E1 222 Aug 2021 95/248
◮ This is easily generalized to n radom variables.
◮ Let Z = max(X1 , · · · , Xn )
FZ (z) = P [Z ≤ z] = P [max(X1 , X2 , · · · , Xn ) ≤ z]
= P [X1 ≤ z, X2 ≤ z, · · · , Xn ≤ z]
= FX1 ···Xn (z, · · · , z)
= FX1 (z) · · · FXn (z), if they are independent
= (FX (z))n , if they are iid
where we take FX as the common df
◮ For example if all Xi are uniform over (0, 1) and ind, then
FZ (z) = z n , 0 < z < 1
P S Sastry, IISc, E1 222 Aug 2021 96/248
◮ Consider Z = min(X, Y ) and X, Y independent
FZ (z) = P [Z ≤ z] = P [min(X, Y ) ≤ z]
◮ It is difficult to write this in terms of joint df of X, Y .
◮ So, we consider the following
P [Z > z] = P [min(X, Y ) > z]
= P [X > z, Y > z]
= P [X > z]P [Y > z], using independence
= (1 − FX (z))(1 − FY (z))
= (1 − FX (z))2 , if they are iid
Hence, FZ (z) = 1 − (1 − FX (z))(1 − FY (z))
◮ We can once again find density of Z if X, Y are
continuous
P S Sastry, IISc, E1 222 Aug 2021 97/248
◮ Suppose X, Y are iid uniform (0, 1).
◮ Z = min(X, Y )
FZ (z) = 1 − (1 − FX (z))2 = 1 − (1 − z)2 , 0 < z < 1
◮ We get the density of Z as
fZ (z) = 2(1 − z), 0 < z < 1
P S Sastry, IISc, E1 222 Aug 2021 98/248
◮ min fn is also easily generalized to n random variables
◮ Let Z = min(X1 , X2 , · · · , Xn )
P [Z > z] = P [min(X1 , X2 , · · · , Xn ) > z]
= P [X1 > z, · · · , Xn > z]
= P [X1 > z] · · · P [Xn > z], using independence
= (1 − FX1 (z)) · · · (1 − FXn (z))
= (1 − FX (z))n , if they are iid
◮ Hence, when Xi are iid, the df of Z is
FZ (z) = 1 − (1 − FX (z))n
where FX is the common df
P S Sastry, IISc, E1 222 Aug 2021 99/248
Joint distribution of max and min
◮ X, Y iid with df F and density f
Z = max(X, Y ) and W = min(X, Y ).
◮ We want joint distribution function of Z and W .
◮ We can use the following
P [Z ≤ z] = P [Z ≤ z, W ≤ w] + P [Z ≤ z, W > w]
P [Z ≤ z, W > w] = P [w < X, Y ≤ z] = (F (z) − F (w))2
P [Z ≤ z] = P [X ≤ z, Y ≤ z] = (F (z))2
◮ So, we get FZW as
FZW (z, w) = P [Z ≤ z, W ≤ w]
= P [Z ≤ z] − P [Z ≤ z, W > w]
= (F (z))2 − (F (z) − F (w))2
◮ Is this correct for all values of z, w?
P S Sastry, IISc, E1 222 Aug 2021 100/248
◮ We have P [w < X, Y ≤ z] = (F (z) − F (w))2 only when
w ≤ z.
◮ Otherwise it is zero.
◮ Hence we get FZW as
(F (z))2
if w > z
FZW (z, w) = 2 2
(F (z)) − (F (z) − F (w)) if w ≤ z
◮ We can get joint density of Z, W as
∂2
fZW (z, w) = FZW (z, w)
∂z ∂w
= 2f (z)f (w), w ≤ z
P S Sastry, IISc, E1 222 Aug 2021 101/248
◮ Let X, Y be iid uniform over (0, 1).
◮ Define Z = max(X, Y ) and W = min(X, Y ).
◮ Then the joint density of Z, W is
fZW (z, w) = 2f (z)f (w), w ≤ z
= 2, 0 < w ≤ z < 1
P S Sastry, IISc, E1 222 Aug 2021 102/248
Order Statistics
◮ Let X1 , · · · , Xn be iid with density f .
◮ Let X(k) denote the k th smallest of these.
◮ That is, X(k) = gk (X1 , · · · , Xn ) where gk : ℜn → ℜ and
the value of gk (x1 , · · · , xn ) is the k th smallest of the
numbers x1 , · · · , xn .
◮ X(1) = min(X1 , · · · , Xn ), X(n) = max(X1 , · · · , Xn )
◮ The joint distribution of X(1) , · · · X(n) is called the order
statistics.
◮ Earlier, we calculated the order statistics for the case
n = 2.
◮ It can be shown that
n
Y
fX(1) ···X(n) (x1 , · · · xn ) = n! f (xi ), x1 < x2 < · · · < xn
i=1
P S Sastry, IISc, E1 222 Aug 2021 103/248
Marginal distributions of X(k)
◮ Let X1 , · · · , Xn be iid with df F and density f .
◮ Let X(k) denote the k th smallest of these.
◮ We want the distribution of X(k) .
◮ The event [X(k) ≤ y] is:
“at least k of these are less than or equal to y”
◮ We want probability of this event.
P S Sastry, IISc, E1 222 Aug 2021 104/248
Marginal distributions of X(k)
◮ X1 , · · · , Xn iid with df F and density f .
◮ P [Xi ≤ y] = F (y) for any i and y.
◮ Since they are independent, we have, e.g.,
P [X1 ≤ y, X2 > y, X3 ≤ y] = (F (y))2 (1 − F (y))
◮ Hence, probability that exactly k of these n random
variables are less than or equal to y is
n
Ck (F (y))k (1 − F (y))n−k
◮ Hence we get
n
X
n
FX(k) (y) = Cj (F (y))j (1 − F (y))n−j
j=k
We can get the density by differentiating this.
P S Sastry, IISc, E1 222 Aug 2021 105/248
Sum of two discrete rv’s
◮ Let X, Y ∈ {0, 1, · · · }
◮ Let Z = X + Y . Then we have
X
fZ (z) = P [X + Y = z] = P [X = x, Y = y]
x,y:
x+y=z
z
X
= P [X = k, Y = z − k]
k=0
Xz
= fXY (k, z − k)
k=0
◮ Now suppose X, Y are independent. Then
z
X
fZ (z) = fX (k)fY (z − k)
k=0
P S Sastry, IISc, E1 222 Aug 2021 106/248
◮ Now suppose X, Y are independent Poisson with
parameters λ1 , λ2 . And, Z = X + Y .
z
X
fZ (z) = fX (k)fY (z − k)
k=0
z
X λk 1 −λ1 λ2z−k −λ2
= e e
k=0
k! (z − k)!
z
−(λ1 +λ2 ) 1 X z!
= e λk1 λ2z−k
z! k=0 k!(z − k)!
1
= e−(λ1 +λ2 ) (λ1 + λ2 )z
z!
◮ Z is Poisson with parameter λ1 + λ2
P S Sastry, IISc, E1 222 Aug 2021 107/248
Sum of two continuous rv
◮ Let X, Y have a joint density fXY . Let Z = X + Y
FZ (z) = P [Z ≤ z] = P [X + Y ≤ z]
Z Z
= fXY (x, y) dy dx
{(x,y):x+y≤z}
Z ∞ Z z−x
= fXY (x, y) dy dx
x=−∞ y=−∞
change variable y to t: t = x + y
dt = dy; y = z − x ⇒ t = z
Z ∞ Z z
= fXY (x, t − x) dt dx
x=−∞ t=−∞
Z z Z ∞
= fXY (x, t − x) dx dt
−∞ −∞
◮ This gives us the density of Z
P S Sastry, IISc, E1 222 Aug 2021 108/248
◮ X, Y have joint density fXY . Z = X + Y . Then
Z ∞
fZ (z) = fXY (x, z − x) dx
−∞
◮ Now suppose X and Y are independent. Then
Z ∞
fZ (z) = fX (x) fY (z − x) dx
−∞
Density of sum of independent random variables is
the convolution of their densities.
fX+Y = fX ∗ fY (Convolution)
P S Sastry, IISc, E1 222 Aug 2021 109/248
Distribution of sum of iid uniform rv’s
◮ Suppose X, Y are iid uniform over (−1, 1).
◮ let Z = X + Y . We want fZ .
◮ The density of X, Y is
◮ fZ is convolution of this density with itself.
P S Sastry, IISc, E1 222 Aug 2021 110/248
◮ fX (x) = 0.5, −1 < x < 1. fY is also same
◮ Note that Z takes values in [−2, 2]
Z ∞
fZ (z) = fX (t) fY (z − t) dt
−∞
◮ For the integrand to be non-zero we need
◮ −1 < t < 1 ⇒ t < 1, t > −1
◮ −1 < z − t < 1 ⇒ t < z + 1, t > z − 1
◮ Hence we need:
t < min(1, z + 1), t > max(−1, z − 1)
◮ Hence, for z < 0, we need −1 < t < z + 1
and, for z ≥ 0 we need z − 1 < t < 1
◮ Thus we get
R z+1
−1 14 dt = z+2
4
if − 2 ≤ z < 0
fZ (z) =
R 1 1 dt = 2−z
if 2 ≥ z ≥ 0
z−1 4 4
P S Sastry, IISc, E1 222 Aug 2021 111/248
◮ Thus, the density of sum of two ind rv’s that are uniform
over (−1, 1) is
z+2
4
if − 2 < z < 0
fZ (z) = 2−z
4
if 0 < z < 2
◮ This is a triangle with vertices (−2, 0), (0, 0.5), (2, 0)
P S Sastry, IISc, E1 222 Aug 2021 112/248
Independence of functions of random variable
◮ Suppose X and Y are independent.
◮ Then g(X) and h(Y ) are independent
◮ This is because [g(X) ∈ B1 ] = [X ∈ B̃1 ] for some Borel
set, B̃1 and similarly [h(Y ) ∈ B2 ] = [Y ∈ B̃2 ]
◮ Hence, [g(X) ∈ B1 ] and [h(Y ) ∈ B2 ] are independent.
P S Sastry, IISc, E1 222 Aug 2021 113/248
Independence of functions of random variable
◮ This is easily generalized to functions of multiple random
variables.
◮ If X, Y are vector random variables (or random vectors),
independence implies [X ∈ B1 ] is independent of
[Y ∈ B2 ] for all borel sets B1 , B2 (in appropriate spaces).
◮ Then g(X) would be independent of h(Y).
◮ That is, suppose X1 , · · · , Xm , Y1 , · · · , Yn are
independent.
◮ Then, g(X1 , · · · , Xm ) is independent of h(Y1 , · · · , Yn ).
P S Sastry, IISc, E1 222 Aug 2021 114/248
◮ Let X1 , X2 , X3 be independent continuous rv
◮ Z = X1 + X2 + X3 .
◮ Can we find density of Z?
◮ Let W = X1 + X2 . We know how to find its density
◮ Then Z = W + X3 and W and X3 are independent.
◮ So, density of Z is the convolution of the densities of W
and X3 .
P S Sastry, IISc, E1 222 Aug 2021 115/248
◮ Suppose X, Y are iid exponential rv’s.
fX (x) = λ e−λx , x > 0
◮ Let Z = X + Y . Then, density of Z is
Z ∞
fZ (z) = fX (x) fY (z − x) dx
−∞
Z z
= λ e−λx λ e−λ(z−x) dx
0
Z z
2 −λz
= λ e dx = λ2 z e−λz
0
◮ Thus, sum of independent exponential random variables
has gamma distribution:
fZ (z) = λz λe−λz , z > 0
P S Sastry, IISc, E1 222 Aug 2021 116/248
Sum of independent gamma rv
◮ Gamma density with parameters α > 0 and λ > 0 is given
by
1
f (x) = λα xα−1 e−λx , x > 0
Γ(α)
We will call this Gamma(α, λ).
◮ The α is called the shape parameter and λ is called the
rate parameter.
◮ For α = 1 this is the exponential density.
◮ Let X ∼ Gamma(α1 , λ), Y ∼ Gamma(α2 , λ).
Suppose X, Y are independent.
◮ Let Z = X + Y . Then Z ∼ Gamma(α1 + α2 , λ).
P S Sastry, IISc, E1 222 Aug 2021 117/248
Z ∞
fZ (z) = fX (x) fY (z − x) dx
−∞
Z z
1 1
= λα1 xα1 −1 e−λx λα2 (z − x)α2 −1 e−λ(z−x) dx
0 Γ(α 1 ) Γ(α 2 )
λα1 +α2 e−λz z α1 −1 x α1 −1 α2 −1 x α2 −1
Z
= z z 1− dx
Γ(α1 )Γ(α2 ) 0 z z
x
change the variable: t = (⇒ z −1 dx = dt)
z
λα1 +α2 e−λz α+ α2 −1 1 α1 −1
Z
= z t (1 − t)α2 −1 dt
Γ(α1 )Γ(α2 ) 0
1 α1 +α2 α1 +α2 −1 −λz
= λ z e
Γ(α1 + α2 )
Because
Z 1
Γ(α1 )Γ(α2 )
tα1 −1 (1 − t)α2 −1 dt =
0 Γ(α1 + α2 )
P S Sastry, IISc, E1 222 Aug 2021 118/248
◮ If X, Y are independent gamma random variables then
X + Y also has gamma distribution.
◮ If X ∼ Gamma(α1 , λ), and Y ∼ Gamma(α2 , λ), then
X + Y ∼ Gamma(α1 + α2 , λ).
P S Sastry, IISc, E1 222 Aug 2021 119/248
Sum of independent Gaussians
◮ Sum of independent Gaussians random variables is a
Gaussian rv
◮ If X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ) and X, Y are
independent, then
X + Y ∼ N (µ1 + µ2 , σ12 + σ22 )
◮ We can show this.
◮ The algebra is a little involved.
◮ There is a calculation trick that is often useful with
Gaussian density
P S Sastry, IISc, E1 222 Aug 2021 120/248
A Calculation Trick
∞
1 2
Z
I = exp − x − 2bx + c dx
−∞ 2K
Z ∞
1 2 2
= exp − (x − b) + c − b dx
−∞ 2K
Z ∞
(x − b)2 (c − b2 )
= exp − exp − dx
−∞ 2K 2K
(c − b2 ) √
= exp − 2πK
2K
because
∞
(x − b)2
1
Z
√ exp − dx = 1
2πK −∞ 2K
P S Sastry, IISc, E1 222 Aug 2021 121/248
◮ We next look at a general theorem that is quite useful in
dealing with functions of multiple random variables.
◮ This result is only for continuous random variables.
P S Sastry, IISc, E1 222 Aug 2021 122/248
◮ Let X1 , · · · , Xn be continuous random variables with
joint density fX1 ···Xn . We define Y1 , · · · Yn by
Y1 = g1 (X1 , · · · , Xn ) ··· Yn = gn (X1 , · · · , Xn )
We think of gi as components of g : ℜn → ℜn .
◮ We assume g is continuous with continuous first partials
and is invertible.
◮ Let h be the inverse of g. That is
X1 = h1 (Y1 , · · · , Yn ) ··· Xn = hn (Y1 , · · · , Yn )
◮ Each of gi , hi are ℜn → ℜ functions and we can write
them as
yi = gi (x1 , · · · , xn ); ··· xi = hi (y1 , · · · , yn )
We denote the partial derivatives of these functions by
∂xi
∂yj
etc.
P S Sastry, IISc, E1 222 Aug 2021 123/248
◮ The jacobian of the inverse transformation is
∂x1 ∂x1 ∂x1
∂y1 ∂y2
··· ∂yn
∂(x1 , · · · , xn ) ∂x2 ∂x2
··· ∂x2
J= = ∂y1 ∂y2 ∂yn
∂(y1 , · · · , yn ) .. .. .. ..
. . . .
∂xn ∂xn ∂xn
∂y1 ∂y2
··· ∂yn
◮ We assume that J is non-zero in the range of the
transformation
◮ Theorem: Under the above conditions, we have
fY1 ···Yn (y1 , · · · , yn ) = |J|fX1 ···Xn (h1 (y1 , · · · , yn ), · · · , hn (y1 , · · · , yn ))
Or, more compactly, fY (y) = |J|fX (h(y))
P S Sastry, IISc, E1 222 Aug 2021 124/248
◮ Let X1 , X2 have a joint density, fX . Consider
Y1 = g1 (X1 , X2 ) = X1 + X2 (g1 (a, b) = a + b)
Y2 = g2 (X1 , X2 ) = X1 − X2 (g2 (a, b) = a − b)
This transformation is invertible
Y1 + Y2
X1 = h1 (Y1 , Y2 ) = (h1 (a, b) = (a + b)/2)
2
Y1 − Y2
X2 = h2 (Y1 , Y2 ) = (h2 (a, b) = (a − b)/2)
2
0.5 0.5
The jacobian is: = −0.5.
0.5 −0.5
y1 +y2 y1 −y2
◮ This gives: fY1 Y2 (y1 , y2 ) = 0.5 fX1 X2 2
, 2
P S Sastry, IISc, E1 222 Aug 2021 125/248
Proof of Theorem
◮ Let B = (−∞, y1 ] × · · · × (−∞, yn ] ⊂ ℜn . Then
FY (y) = FY1 ···Yn (y1 , · · · yn ) = P [Yi ≤ yi , i = 1, · · · , n]
Z
= fY1 ···Yn (y1′ , · · · , yn′ ) dy1′ · · · dyn′
B
◮ Define
g −1 (B) = {(x1 , · · · , xn ) ∈ ℜn : g(x1 , · · · , xn ) ∈ B}
= {(x1 , · · · , xn ) ∈ ℜn : gi (x1 , · · · , xn ) ≤ yi , i = 1 · · · n}
◮ Then we have
FY1 ···Yn (y1 , · · · yn ) = P [gi (X1 , · · · , Xn ) ≤ yi , i = 1, · · · n]
Z
= fX1 ···Xn (x′1 , · · · , x′n ) dx′1 · · · dx′n
g −1 (B)
P S Sastry, IISc, E1 222 Aug 2021 126/248
Proof of Theorem
◮ B = (−∞, y1 ] × · · · × (−∞, yn ].
◮ g −1 (B) = {(x1 , · · · , xn ) ∈ ℜn : g(x1 , · · · , xn ) ∈ B}
FY (y1 , · · · , yn ) = P [gi (X1 , · · · , Xn ) ≤ yi , i = 1, · · · , n]
Z
= fX1 ···Xn (x′1 , · · · , x′n ) dx′1 · · · dx′n
g −1 (B)
change variables: yi′ = gi (x′1 , · · ·
, x′n ), i = 1, · · · n
(x′1 , · · · x′n ) ∈ g (B) ⇒ (y1′ , · · · , yn′ ) ∈ B
−1
x′i = hi (y1′ , · · · , yn′ ), dx′1 · · · dx′n = |J|dy1′ · · · dyn′
Z
FY (y1 , · · · , yn ) = fX1 ···Xn (h1 (y′ ), · · · , hn (y′ )) |J|dy1′ · · · dyn′
B
⇒ fY1 ···Yn (y1 , · · · , yn ) = fX1 ···Xn (h1 (y), · · · , hn (y)) |J|
P S Sastry, IISc, E1 222 Aug 2021 127/248
◮ X1 , · · · Xn are continuous rv with joint density
Y1 = g1 (X1 , · · · , Xn ) ··· Yn = gn (X1 , · · · , Xn )
◮ The transformation is continuous with continuous first
partials and is invertible and
X1 = h1 (Y1 , · · · , Yn ) ··· Xn = hn (Y1 , · · · , Yn )
◮ We assume the Jacobian of the inverse transform, J, is
non-zero
◮ Then the density of Y is
fY1 ···Yn (y1 , · · · , yn ) = |J|fX1 ···Xn (h1 (y1 , · · · , yn ), · · · , hn (y1 , · · · , yn ))
◮ Called multidimensional change of variable formula
P S Sastry, IISc, E1 222 Aug 2021 128/248
◮ Let X, Y have joint density fXY . Let Z = X + Y .
◮ We want to find fZ using the theorem.
◮ To use the theorem, we need an invertible transformation
of ℜ2 onto ℜ2 of which one component is x + y.
◮ Take Z = X + Y and W = X − Y . This is invertible.
◮ X = (Z + W )/2 and Y = (Z − W )/2. The Jacobian is
1 1
2 2 1
J= =−
1
2
− 12 2
◮ Hence we get
z+w z−w
1
fZW (z, w) = fXY ,
2 2 2
◮ Now we get density of Z as
Z ∞
z+w z−w
1
fZ (z) = fXY , dw
−∞ 2 2 2
P S Sastry, IISc, E1 222 Aug 2021 129/248
◮ let Z = X + Y and W = X − Y . Then
Z ∞
z+w z−w
1
fZ (z) = fXY , dw
−∞ 2 2 2
z+w 1
change the variable: t = ⇒ dt = dw
2 2
Z ∞ ⇒ w = 2t − z ⇒ z − w = 2z − 2t
fZ (z) = fXY (t, z − t) dt
−∞
Z ∞
= fXY (z − s, s) ds,
−∞
◮ We get same result as earlier. If, X, Y are independent
Z ∞
fZ (z) = fX (t) fY (z − t) dt
−∞
P S Sastry, IISc, E1 222 Aug 2021 130/248
◮ let Z = X + Y and W = X − Y . We got
z+w z−w
1
fZW (z, w) = fXY ,
2 2 2
◮ Now we can calculate fW also.
Z ∞
z+w z−w
1
fW (w) = fXY , dz
−∞ 2 2 2
z+w 1
change the variable: t = ⇒ dt = dz
2 2
Z ∞ ⇒ z = 2t − w ⇒ z − w = 2t − 2w
fW (w) = fXY (t, t − w) dt
−∞
Z ∞
= fXY (s + w, s)ds,
−∞
P S Sastry, IISc, E1 222 Aug 2021 131/248
Example
◮ Let X, Y be iid U [0, 1]. Let Z = X − Y .
Z ∞
fZ (z) = fX (t) fY (t − z) dt
−∞
◮ For the integrand to be non-zero
◮ 0 ≤ t ≤ 1 ⇒ t ≥ 0, t ≤ 1
◮ 0 ≤ t − z ≤ 1 ⇒ t ≥ z, t ≤ 1 + z
◮ ⇒ max(0, z) ≤ t ≤ min(1, 1 + z)
◮ Thus, we get density as (note Z ∈ (−1, 1))
R
1+z 1 dt = 1 + z, if − 1 ≤ z ≤ 0
0
fZ (z) =
1 1 dt = 1 − z,
R
0≤z≤1
z
◮ Thus, when X, Y ∼ U (0, 1) iid
fX−Y (z) = 1 − |z|, −1 < z < 1
P S Sastry, IISc, E1 222 Aug 2021 132/248
◮ We showed that
Z ∞ Z ∞
fX+Y (z) = fXY (t, z − t) dt = fXY (z − t, t) dt
−∞ −∞
Z ∞ Z ∞
fX−Y (w) = fXY (t, t − w) dt = fXY (t + w, t)dt
−∞ −∞
◮ Suppose X, Y are discrete. Then we have
X
fX+Y (z) = P [X + Y = z] = P [X = k, Y = z − k]
k
X
= fXY (k, z − k)
k
X
fX−Y (w) = P [X − Y = w] = P [X = k, Y = k − w]
k
X
= fXY (k, k − w)
k
P S Sastry, IISc, E1 222 Aug 2021 133/248
Distribution of product of random variables
◮ We want density of Z = XY .
◮ We need one more function to make an invertible
transformation
◮ A possible choice: Z = XY W =Y
◮ This is invertible: X = Z/W Y = W
1 −z
w w2 1
J= =
0 1 w
◮ Hence we get
1 z
fZW (z, w) = fXY ,w
w w
◮ Thus we get the density of product as
Z ∞
1 z
fZ (z) = fXY , w dw
−∞ w w
P S Sastry, IISc, E1 222 Aug 2021 134/248
Density of XY
◮ Let X, Y have joint density fXY . Let Z = XY .
◮ We can find density of XY directly also (but it is more
complicated)
◮ Let Az = {(x, y) ∈ ℜ2 : xy ≤ z} ⊂ ℜ2 .
FZ (z) = P [XY ≤ z] = P [(X, Y ) ∈ Az ]
Z Z
= fXY (x, y) dy dx
Az
◮ We need to find limits for integrating over Az
◮ If x > 0, then xy ≤ z ⇒ y ≤ z/x
If x < 0, then xy ≤ z ⇒ y ≥ z/x
Z 0 Z ∞ Z ∞ Z z/x
FZ (z) = fXY (x, y) dy dx+ fXY (x, y) dy dx
−∞ z/x 0 −∞
P S Sastry, IISc, E1 222 Aug 2021 135/248
Z 0 Z ∞ Z ∞ Z z/x
FZ (z) = fXY (x, y) dy dx + fXY (x, y) dy dx
−∞ z/x 0 −∞
◮ Change variable from y to t using t = xy
y = t/x; dy = x1 dt; y = z/x ⇒ t = z
Z 0 Z −∞ Z ∞Z z
1 t 1 t
FZ (z) = fXY (x, ) dt dx + fXY (x, ) dt d
−∞ z x x 0 −∞ x x
Z 0 Z z Z ∞Z z
1 t 1 t
= fXY (x, ) dt dx + fXY (x, )
−∞ x x −∞ x x
Z−∞
∞ Z z 0
1 t
= fXY x, dt dx
−∞ −∞ x x
Z z Z ∞
1 t
= fXY x, dx dt
−∞ −∞ x x
R∞
This shows: fZ (z) = −∞ x1 fXY x, xz dx
P S Sastry, IISc, E1 222 Aug 2021 136/248
example
◮ Let X, Y be iid U (0, 1). Let Z = XY .
Z ∞
1 z
fZ (z) = fX fY (w) dw
−∞ w w
z
◮ We need: 0 < w < 1 and 0 < w
< 1. Hence
1 1
1 1
Z Z
fZ (z) = dw = dw = − ln(z), 0 < z < 1
z w z w
P S Sastry, IISc, E1 222 Aug 2021 137/248
◮ X, Y have joint density and Z = XY . Then
Z ∞
1 z
fZ (z) = fXY .w dw
−∞ w w
Suppose X, Y are discrete and Z = XY
X X
fZ (0) = P [X = 0 or Y = 0] = fXY (x, 0) + fXY (0, y)
x y
X k X k
fZ (k) = P X = ,Y = y = fXY , y , k 6= 0
y6=0
y y6=0
y
◮ We cannot always interchange density and mass
functions!!
P S Sastry, IISc, E1 222 Aug 2021 138/248
◮ We wanted density of Z = XY .
◮ We used: Z = XY and W = Y .
◮ We could have used: Z = XY and W = X.
◮ This is invertible: X = W and Y = Z/W .
0 1 1
J= =−
1
w
−z
w2
w
◮ This gives
1 z
fZW (z, w) = fXY w,
w w
Z ∞
1 z
fZ (z) = fXY w, dw
−∞ w w
◮ The fZ should be same in both cases.
P S Sastry, IISc, E1 222 Aug 2021 139/248
Distributions of quotients
◮ X, Y have joint density and Z = X/Y .
◮ We can take: Z = X/Y W =Y
◮ This is invertible: X = ZW Y = W
w z
J= =w
0 1
◮ Hence we get
fZW (z, w) = |w| fXY (zw, w)
◮ Thus we get the density of quotient as
Z ∞
fZ (z) = |w| fXY (zw, w) dw
−∞
P S Sastry, IISc, E1 222 Aug 2021 140/248
example
◮ Let X, Y be iid U (0, 1). Let Z = X/Y .
Note Z ∈ (0, ∞)
Z ∞
fZ (z) = |w| fX (zw) fY (w) dw
−∞
◮ We need 0 < w < 1 and 0 < zw < 1 ⇒ w < 1/z.
◮ So, when z ≤ 1, w goes from 0 to 1; when z > 1, w goes
from 0 to 1/z.
◮ Hence we get density as
R
01 w dw = 21 , if 0 < z ≤ 1
fZ (z) =
R 1/z w dw = 1 , 1 < z < ∞
0 2z 2
P S Sastry, IISc, E1 222 Aug 2021 141/248
◮ X, Y have joint density and Z = X/Y
Z ∞
fZ (z) = |w| fXY (zw, w) dw
−∞
◮ SupposeX, Y are discrete and Z = X/Y
fZ (z) = P [Z = z] = P [X/Y = z]
X
= P [X = yz, Y = y]
y
X
= fXY (yz, y)
y
P S Sastry, IISc, E1 222 Aug 2021 142/248
◮ We chose: Z = X/Y and W = Y .
◮ We could have taken: Z = X/Y and W = X
◮ The inverse is: X = W and Y = W/Z
0 1 w
J= =−
− zw2 1
z
z2
◮ Thus we get the density of quotient as
Z ∞
w w
fZ (z) = 2
f XY w, dw
−∞ z z
w dw
put t = ⇒ dt = , w = tz
z z
Z ∞
= |t|fXY (tz, t) dt
−∞
◮ We can show that the density of quotient is same in both
these approches.
P S Sastry, IISc, E1 222 Aug 2021 143/248
Summary: Densities of standard functions of rv’s
◮ We derived densities of sum, difference, product and
quotient of random variables.
Z ∞ Z ∞
fX+Y (z) = fXY (t, z − t) dt = fXY (z − t, t) dt
−∞ −∞
Z ∞ Z ∞
fX−Y (z) = fXY (t, t − z) dt = fXY (t + z, t)dt
−∞ −∞
Z ∞ Z ∞
1 z 1 z
fX∗Y (z) = fXY , t dt = fXY t, dt
−∞ t t −∞ t t
Z ∞ Z ∞
t t
f(X/Y ) (z) = |t| fXY (zt, t) dt = 2
fXY t, dt
−∞ −∞ z z
P S Sastry, IISc, E1 222 Aug 2021 144/248
Exchangeable Random Variables
◮ X1 , X2 , · · · , Xn are said to be exchangeable if their joint
distribution is same as that of any permutation of them.
◮ let (i1 , · · · , in ) be a permutation of (1, 2, · · · , n). Then
joint df of (Xi1 , · · · , Xin ) should be same as that
(X1 , · · · , Xn )
◮ Take n = 3. Suppose FX1 X2 X3 (a, b, c) = g(a, b, c). If they
are exchangeable, then
FX2 X3 X1 (a, b, c) = P [X2 ≤ a, X3 ≤ b, X1 ≤ c]
= P [X1 ≤ c, X2 ≤ a, X3 ≤ b]
= g(c, a, b) = g(a, b, c)
◮ The df or density should be “symmetric” in its variables if
the random variables are exchangeable.
P S Sastry, IISc, E1 222 Aug 2021 145/248
◮ Consider the density of three random variables
2
f (x, y, z) = (x + y + z), 0 < x, y, z < 1
3
◮ They are exchangeable (because f (x, y, z) = f (y, x, z))
◮ If random variables are exchangeable then they are
identically distributed.
FXY Z (a, ∞, ∞) = FXY Z (∞, ∞, a) ⇒ FX (a) = FZ (a)
◮ The above example shows that exchangeable random
variables need not be independent. The joint density is
not factorizable.
Z 1Z 1
2 2(x + 1)
(x + y + z) dy dz =
0 0 3 3
◮ So, the joint density is not the product of marginals
P S Sastry, IISc, E1 222 Aug 2021 146/248
Expectation of functions of multiple rv
◮ Theorem: Let Z = g(X1 , · · · Xn ) = g(X). Then
Z
E[Z] = g(x) dFX (x)
ℜn
◮ That is, if they have a joint density, then
Z
E[Z] = g(x) fX (x) dx
ℜn
◮ Similarly, if all Xi are discrete
X
E[Z] = g(x) fX (x)
x
P S Sastry, IISc, E1 222 Aug 2021 147/248
◮ Let Z = X + Y . Let X, Y have joint density fXY
Z ∞Z ∞
E[X + Y ] = (x + y) fXY (x, y) dx dy
−∞ −∞
Z ∞ Z ∞
= x fXY (x, y) dy dx
−∞
Z ∞−∞ Z ∞
+ y fXY (x, y) dx dy
−∞ −∞
Z ∞ Z ∞
= x fX (x) dx + y fY (y) dy
−∞ −∞
= E[X] + E[Y ]
◮ Expectation is a linear operator.
◮ This is true for all random variables.
P S Sastry, IISc, E1 222 Aug 2021 148/248
◮ We saw E[X + Y ] = E[X] + E[Y ].
◮ Let us calculate Var(X + Y ).
Var(X + Y ) = E ((X + Y ) − E[X + Y ])2
= E ((X − EX) + (Y − EY ))2
= E (X − EX)2 + E (Y − EY )2
+2E [(X − EX)(Y − EY )]
= Var(X) + Var(Y ) + 2Cov(X, Y )
where we define covariance between X, Y as
Cov(X, Y ) = E [(X − EX)(Y − EY )]
P S Sastry, IISc, E1 222 Aug 2021 149/248
◮ We define covariance between X and Y by
Cov(X, Y ) = E [(X − EX)(Y − EY )]
= E [XY − X(EY ) − Y (EX) + EX EY ]
= E[XY ] − EX EY
◮ Note that Cov(X, Y ) can be positive or negative
◮ X and Y are said to be uncorrelated if Cov(X, Y ) = 0
◮ If X and Y are uncorrelated then
Var(X + Y ) = Var(X) + Var(Y )
◮ Note that E[X + Y ] = E[X] + E[Y ] for all random
variables.
P S Sastry, IISc, E1 222 Aug 2021 150/248
Example
◮ Consider the joint density
fXY (x, y) = 2, 0 < x < y < 1
◮ We want to calculate Cov(X, Y )
Z 1Z 1 1
1
Z
EX = x 2 dy dx = 2 x (1 − x) dx =
0 x 0 3
1 y 1
2
Z Z Z
EY = y 2 dx dy = 2 y 2 dy =
0 0 0 3
1Z y 1
y2 1
Z Z
E[XY ] = xy 2 dx dy = 2 y dy =
0 0 0 2 4
1 2 1
◮ Hence, Cov(X, Y ) = E[XY ] − EX EY = 4
− 9
= 36
P S Sastry, IISc, E1 222 Aug 2021 151/248
Independent random variables are uncorrelated
◮ Suppose X, Y are independent. Then
Z Z
E[XY ] = x y fXY (x, y) dx dy
Z Z
= x y fX (x) fY (y) dx dy
Z Z
= xfX (x) dx yfY (y) dy = EX EY
◮ Then, Cov(X, Y ) = E[XY ] − EX EY = 0.
◮ X, Y independent ⇒ X, Y uncorrelated
P S Sastry, IISc, E1 222 Aug 2021 152/248
Uncorrelated random variables may not be
independent
◮ Suppose X ∼ N (0, 1) Then, EX = EX 3 = 0
◮ Let Y = X 2 Then,
E[XY ] = EX 3 = 0 = EX EY
◮ Thus X, Y are uncorrelated.
◮ Are they independent? No
e.g.,
P [X > 2 |Y < 1] = 0 6= P [X > 2]
◮ X, Y are uncorrealted does not imply they are
independent.
P S Sastry, IISc, E1 222 Aug 2021 153/248
◮ We define the correlation coefficient of X, Y by
Cov(X, Y )
ρXY = p
Var(X) Var(Y )
◮ If X, Y are uncorrelated then ρXY = 0.
◮ We will show that |ρXY | ≤ 1
◮ Hence −1 ≤ ρXY ≤ 1, ∀X, Y
P S Sastry, IISc, E1 222 Aug 2021 154/248
◮ We have E [(αX + βY )2 ] ≥ 0, ∀α, β ∈ ℜ
α2 E[X 2 ] + β 2 E[Y 2 ] + 2αβE[XY ] ≥ 0, ∀α, β ∈ ℜ
E[XY ]
Take α = −
E[X 2 ]
(E[XY ])2 2 2 (E[XY ])2
+ β E[Y ] − 2β ≥ 0, ∀β ∈ ℜ
E[X 2 ] E[X 2 ]
aβ 2 + bβ + c ≥ 0, ∀β ⇒ b2 − 4ac ≤ 0
2
(E[XY ])2 2
2 (E[XY ])
⇒ 4 − 4E[Y ] ≤0
E[X 2 ] E[X 2 ]
2
(E[XY ])2 E[Y 2 ](E[XY ])2
⇒ ≤
E[X 2 ] E[X 2 ]
(E[XY ])4 E[Y 2 ](E[X 2 ])2
⇒ ≤
(E[XY ])2 E[X 2 ]
⇒ (E[XY ])2 ≤ E[X 2 ]E[Y 2 ]
P S Sastry, IISc, E1 222 Aug 2021 155/248
◮ We showed that
(E[XY ])2 ≤ E[X 2 ]E[Y 2 ]
◮ Take X − EX in place of X and Y − EY in place of Y
in the above algebra.
◮ This gives us
(E[(X − EX)(Y − EY )])2 ≤ E[(X−EX)2 ]E[(Y −EY )2 ]
⇒ (Cov(X, Y ))2 ≤ Var(X)Var(Y )
◮ Hence we get
!2
Cov(X, Y )
ρ2XY = p ≤1
Var(X)Var(Y )
◮ The equality holds here only if E [(αX + βY )2 ] = 0
Thus, |ρXY | = 1 only if αX + βY = 0
◮ Correlation coefficient of X, Y is ±1 only when Y is a
linear function of X P S Sastry, IISc, E1 222 Aug 2021 156/248
Linear Least Squares Estimation
◮ Suppose we want to approximate Y as an affine function
of X.
◮ We want a, b to minimize E [(Y − (aX + b))2 ]
◮ For a fixed a, what is the b that minimizes
E [((Y − aX) − b)2 ] ?
◮ We know the best b here is:
b = E[Y − aX] = EY − aEX.
◮ So, we want to find the best a to minimize
J(a) = E [(Y − aX − (EY − aEX))2 ]
P S Sastry, IISc, E1 222 Aug 2021 157/248
◮ We want to find a to minimize
J(a) = E (Y − aX − (EY − aEX))2
= E ((Y − EY ) − a(X − EX))2
= E (Y − EY )2 + a2 (X − EX)2 − 2a(Y − EY )(X − EX)
= Var(Y ) + a2 Var(X) − 2aCov(X, Y )
◮ So, the optimal a satisfies
Cov(X, Y )
2aVar(X) − 2Cov(X, Y ) = 0 ⇒ a =
Var(X)
P S Sastry, IISc, E1 222 Aug 2021 158/248
◮ The final mean square error, say, J ∗ is
J ∗ = Var(Y ) + a2 Var(X) − 2aCov(X, Y )
2
Cov(X, Y ) Cov(X, Y )
= Var(Y ) + Var(X) − 2 Cov(X, Y )
Var(X) Var(X)
(Cov(X, Y ))2
= Var(Y ) −
Var(X)
(Cov(X, Y ))2
= Var(Y ) 1 −
Var(Y ) Var(X)
2
= Var(Y ) 1 − ρXY
P S Sastry, IISc, E1 222 Aug 2021 159/248
◮ The best mean-square approximation of Y as a ‘linear’
function of X is
Cov(X, Y ) Cov(X, Y )
Y = X + EY − EX
Var(X) Var(X)
◮ Called the line of regression of Y on X.
◮ If cov(X, Y ) = 0 then this reduces to approximating Y by
a constant, EY .
◮ The final mean square error is
Var(Y ) 1 − ρ2XY
◮ If ρXY = ±1 then the error is zero
◮ If ρXY = 0 the final error is Var(Y )
P S Sastry, IISc, E1 222 Aug 2021 160/248
◮ The covariance of X, Y is
Cov(X, Y ) = E[(X−EX) (Y −EY )] = E[XY ]−EX EY
Note that Cov(X, X) = Var(X)
◮ X, Y are called uncorrelated if Cov(X, Y ) = 0.
◮ X, Y independent ⇒ X, Y uncorrelated.
◮ Uncorrelated random variables need not necessarily be
independent
◮ Covariance plays an important role in linear least squares
estimation.
◮ Informally, covariance captures the ‘linear dependence’
between the two random variables.
P S Sastry, IISc, E1 222 Aug 2021 161/248
Covariance Matrix
◮ Let X1 , · · · , Xn be random variables (on the same
probability space)
◮ We represent them as a vector X.
◮ As a notation, all vectors are column vectors:
X = (X1 , · · · , Xn )T
◮ We denote E[X] = (EX1 , · · · , EXn )T
◮ The n × n matrix whose (i, j)th element is Cov(Xi , Xj ) is
called the covariance matrix (or variance-covariance
matrix) of X. Denoted as ΣX or ΣX
Cov(X1 , X1 ) Cov(X1 , X2 ) · · · Cov(X1 , Xn )
Cov(X2 , X1 ) Cov(X2 , X2 ) · · · Cov(X2 , Xn )
ΣX = .. .. .. ..
. . . .
Cov(Xn , X1 ) Cov(Xn , X2 ) · · · Cov(Xn , Xn )
P S Sastry, IISc, E1 222 Aug 2021 162/248
Covariance matrix
◮ If a = (a1 , · · · , an )T then
a aT is a n × n matrix whose (i, j)th element is ai aj .
◮ Hence we get
ΣX = E (X − EX) (X − EX)T
◮ This is because
(X − EX) (X − EX)T ij = (Xi − EXi )(Xj − EXj )
and (ΣX )ij = E[(Xi − EXi )(Xj − EXj )]
P S Sastry, IISc, E1 222 Aug 2021 163/248
◮ Recall the following about vectors and matrices
◮ let a, b ∈ ℜn be column vectors. Then
2 T T
aT b = aT b a b = bT a aT b = bT a aT b
◮ Let A be an n × n matrix with elements aij . Then
n
X
T
b Ab = bi bj aij
i,j=1
where b = (b1 , · · · , bn )T
◮ A is said to be positive semidefinite if bT Ab ≥ 0, ∀b
P S Sastry, IISc, E1 222 Aug 2021 164/248
◮ ΣX is a real symmetric matrix
◮ It is positive semidefinite.
◮ Let a ∈ ℜn and let Y = aT X.
◮ Then, EY = aT EX. We get variance of Y as
h 2 i
Var(Y ) = E[(Y − EY )2 ] = E aT X − aT EX
h 2 i
T
= E a (X − EX)
= E aT (X − EX) (X − EX)T a
= aT E (X − EX) (X − EX)T a
= aT ΣX a
◮ This gives aT ΣX a ≥ 0, ∀a
◮ This shows ΣX is positive semidefinite
P S Sastry, IISc, E1 222 Aug 2021 165/248
Y = aT X = i ai Xi – linear combination of Xi ’s.
P
◮
◮ We know how to find its mean and variance
X
EY = aT EX = ai EXi ;
i
X
Var(Y ) = aT ΣX a = ai aj Cov(Xi , Xj )
i,j
Specifically, by taking all components of a to be 1, we get
◮
n
! n n n X
X X X X
Var Xi = Cov(Xi , Xj ) = Var(Xi )+ Cov(Xi , Xj )
i=1 i,j=1 i=1 i=1 j6=i
◮ If Xi are independent, variance of sum is sum of
variances.
P S Sastry, IISc, E1 222 Aug 2021 166/248
◮ Covariance matrix ΣX positive semidefinite because
aT ΣX a = Var(aT X) ≥ 0
◮ ΣX would be positive definite if aT ΣX a > 0, ∀a 6= 0
◮ It would fail to be positive definite if Var(aT X) = 0 for
some nonzero a.
◮ Var(Z) = E[(Z − EZ)2 ] = 0 implies Z = EZ, a
constant.
◮ Hence, ΣX fails to be positive definite only if there is a
non-zero linear combination of Xi ’s that is a constant.
P S Sastry, IISc, E1 222 Aug 2021 167/248
◮ Covariance matrix is a real symmetric positive
semidefinite matrix
◮ It have real and non-negative eigen values.
◮ It would have n linearly independent eigen vectors.
◮ These also have some interesting roles.
◮ We consider one simple example.
P S Sastry, IISc, E1 222 Aug 2021 168/248
◮ Let Y = aT X and assume ||a|| = 1
◮ Y is projection of X along the direction a.
◮ Suppose we want to find a direction along which variance
is maximized
◮ We want to maximize aT ΣX a subject to aT a = 1
◮ The lagrangian is aT ΣX a + η(1 − aT a)
◮ Equating the gradient to zero, we get
ΣX a = ηa
◮ So, a should be an eigen vector (with eigen value η).
◮ Then the variance would be aT ΣX a = ηaT a = η
◮ Hence the direction is the eigen vector corresponding to
the highest eigen value.
P S Sastry, IISc, E1 222 Aug 2021 169/248
Joint moments
◮ Given two random variables, X, Y
◮ The joint moment of order (i, j) is defined by
mij = E[X i Y j ]
m10 = EX, m01 = EY , m11 = E[XY ] and so on
◮ Similarly joint central moments of order (i, j) are defined
by
sij = E (X − EX)i (Y − EY )j
s10 = s01 = 0, s11 = Cov(X, Y ), s20 = Var(X) and so on
◮ We can similarly define joint moments of multiple random
variables
P S Sastry, IISc, E1 222 Aug 2021 170/248
◮ We can define moment generating function of X, Y by
MXY (s, t) = E esX+tY , s, t ∈ ℜ
◮ This is easily generalized to n random variables
h T i
MX (s) = E es X , s ∈ ℜn
◮ Once again, we can get all the moments by differentiating
the moment generating function
∂
MX (s) = EXi
∂si s=0
◮ More generally
∂ m+n
MX (s) = EXin Xjm
∂sni ∂sm
j s=0
P S Sastry, IISc, E1 222 Aug 2021 171/248
Conditional Expectation
◮ Suppose X, Y have a joint density fXY
◮ Consider the conditional density fX|Y (x|y). This is a
density in x for every value of y.
◮ Since it isR a density, we can use it in an expectation
integral: g(x) fX|Y (x|y) dx
◮ This is like expectation of g(X) since fX|Y (x|y) is a
density in x.
◮ However, its value would be a function of y.
◮ That is, this is a kind of expectation that is a function of
Y (and hence is a random variable)
◮ It is called conditional expectation.
P S Sastry, IISc, E1 222 Aug 2021 172/248
◮ Let X, Y be discrete random variables (on the same
probability space).
◮ The conditinal expectation of h(X) conditioned on Y is a
function of Y , and is defined by
E[h(X)|Y ] = g(Y ) where
X
E[h(X)|Y = y] = g(y) = h(x) fX|Y (x|y)
x
◮ Thus
X
E[h(X)|Y = y] = h(x) fX|Y (x|y)
x
X
= h(x) P [X = x|Y = y]
x
◮ Note that, E[h(X)|Y ] is a random variable
P S Sastry, IISc, E1 222 Aug 2021 173/248
◮ Let X, Y have joint density fXY .
◮ The conditinal expectation of h(X) conditioned on Y is a
function of Y , and its value for any y is defined by
Z ∞
E[h(X)|Y = y] = h(x) fX|Y (x|y) dx
−∞
◮ Once again, what this means is that E[h(X)|Y ] = g(Y )
where Z ∞
g(y) = h(x) fX|Y (x|y) dx
−∞
P S Sastry, IISc, E1 222 Aug 2021 174/248
A simple example
◮ Consider the joint density
fXY (x, y) = 2, 0 < x < y < 1
◮ We calculated the conditional densities earlier
1 1
fX|Y (x|y) = , fY |X (y|x) = , 0<x<y<1
y 1−x
◮ Now we can calculate the conditional expectation
Z ∞
E[X|Y = y] = x fX|Y (x|y) dx
−∞
Z y y
1 1 x2 y
= x dx = =
0 y y 2 0 2
◮ This gives: E[X|Y ] = Y2
1+X
◮ We can show E[Y |X] = 2
P S Sastry, IISc, E1 222 Aug 2021 175/248
◮The conditional expectation is defined by
X
E[h(X)|Y = y] = h(x) fX|Y (x|y), X, Y are discrete
Zx ∞
E[h(X)|Y = y] = h(x) fX|Y (x|y) dx, X, Y have joint density
−∞
◮ We can actually define E[h(X, Y )|Y ] also as above.
That is,
Z ∞
E[h(X, Y )|Y = y] = h(x, y) fX|Y (x|y) dx
−∞
◮ It has all the properties of expectation:
1. E[a|Y ] = a where a is a constant
2. E[ah1 (X) + bh2 (X)|Y ] = aE[h1 (X)|Y ] + bE[h2 (X)|Y ]
3. h1 (X) ≥ h2 (X) ⇒ E[h1 (X)|Y ] ≥ E[h2 (X)|Y ]
P S Sastry, IISc, E1 222 Aug 2021 176/248
◮ Conditional expectation also has some extra properties
which are very important
◮ E [E[h(X)|Y ]] = E[h(X)]
◮ E[h1 (X)h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ]
◮ E[h(X, Y )|Y = y] = E[h(X, y)|Y = y]
◮ We will justify each of these.
◮ The last property above follows directly from the
definition.
P S Sastry, IISc, E1 222 Aug 2021 177/248
◮ Expectation of a conditional expectation is the
unconditional expectation
E [ E[h(X)|Y ] ] = E[h(X)]
In the above, LHS is expectation of a function of Y .
◮ Let us denote g(Y ) = E[h(X)|Y ]. Then
E [ E[h(X)|Y ] ] = E[g(Y )]
Z ∞
= g(y) fY (y) dy
−∞
Z ∞ Z ∞
= h(x) fX|Y (x|y) dx fY (y) dy
−∞ −∞
Z ∞Z ∞
= h(x) fXY (x, y) dy dx
−∞ −∞
Z ∞
= h(x) fX (x) dx
−∞
= E[h(X)]
P S Sastry, IISc, E1 222 Aug 2021 178/248
◮ Any factor that depends only on the conditioning variable
behaves like a constant inside a conditional expectation
E[h1 (X) h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ]
◮ Let us denote g(Y ) = E[h1 (X) h2 (Y )|Y ]
g(y) = E[h1 (X) h2 (Y )|Y = y]
Z ∞
= h1 (x)h2 (y) fX|Y (x|y) dx
−∞
Z ∞
= h2 (y) h1 (x) fX|Y (x|y) dx
−∞
= h2 (y) E[h1 (X)|Y = y]
⇒ E[h1 (X) h2 (Y )|Y ] = g(Y ) = h2 (Y )E[h1 (X)|Y ]
P S Sastry, IISc, E1 222 Aug 2021 179/248
◮ A very useful property of conditional expectation is
E[ E[X|Y ] ] = E[X] (Assuming all expectations exist)
◮ We can see this in our earlier example.
fXY (x, y) = 2, 0 < x < y < 1
◮ We easily get: EX = 13 and EY = 32
◮ We also showed E[X|Y ] = Y2
Y 1
E[ E[X|Y ] ] = E = = E[X]
2 3
◮ Similarly
1+X 2
E[ E[Y |X] ] = E = = E[Y ]
2 3
P S Sastry, IISc, E1 222 Aug 2021 180/248
Example
◮ Let X, Y be random variables with joint density given by
fXY (x, y) = e−y , 0 < x < y < ∞
◮ The marginal densities are:
Z ∞ Z ∞
fX (x) = fXY (x, y) dy = e−y dy = e−x , x > 0
−∞ x
Z ∞ Z y
fY (y) = fXY (x, y) dx = e−y dx = y e−y , y > 0
−∞ 0
Thus, X is exponential and Y is gamma.
◮ Hence we have
EX = 1; Var(X) = 1; EY = 2; Var(Y ) = 2
P S Sastry, IISc, E1 222 Aug 2021 181/248
fXY (x, y) = e−y , 0 < x < y < ∞
◮ Let us calculate covariance of X and Y
Z ∞Z ∞
E[XY ] = xy fXY (x, y) dx dy
−∞ −∞
Z ∞Z y Z ∞
−y 1 3 −y
= xye dx dy = y e dy = 3
0 0 0 2
◮ Hence, Cov(X, Y ) = E[XY ] − EX EY = 3 − 2 = 1.
◮ ρXY = √12
P S Sastry, IISc, E1 222 Aug 2021 182/248
◮ Recall the joint and marginal densities
fXY (x, y) = e−y , 0 < x < y < ∞
fX (x) = e−x , x > 0; fY (y) = ye−y , y > 0
◮ The conditional densities will be
fXY (x, y) e−y 1
fX|Y (x|y) = = −y = , 0 < x < y < ∞
fY (y) ye y
fXY (x, y) e−y
fY |X (y|x) = = −x = e−(y−x) , 0 < x < y < ∞
fX (x) e
P S Sastry, IISc, E1 222 Aug 2021 183/248
◮ The conditional densities are
1
fX|Y (x|y) = ; fY |X (y|x) = e−(y−x) , 0 < x < y < ∞
y
◮ We can now calculate the conditional expectation
Z y
1 y
Z
E[X|Y = y] = x fX|Y (x|y) dx = x dx =
0 y 2
Y
Thus E[X|Y ] = 2
Z Z ∞
E[Y |X = x] = y fY |X (y|x) dy = ye−(y−x) dy
Z ∞ x
x −y ∞ −y
= e −ye x + e dy
x
= ex xe−x + e−x = 1 + x
Thus, E[Y |X] = 1 + X
P S Sastry, IISc, E1 222 Aug 2021 184/248
◮ We got
Y
E[X|Y ] = ; E[Y |X] = 1 + X
2
◮ Using this we can verify:
Y EY 2
E[ E[X|Y ] ] = E = = = 1 = EX
2 2 2
E[ E[Y |X] ] = E[1 + X] = 1 + 1 = 2 = EY
P S Sastry, IISc, E1 222 Aug 2021 185/248
◮ A property of conditional expectation is
E[ E[X|Y ] ] = E[X]
◮We assume that all three expectations exist.
◮ Very useful in calculating expectations
X Z
EX = E[ E[X|Y ] ] = E[X|Y = y] fY (y) or E[X|Y = y] fY (y) d
y
◮ Can be used to calculate probabilities of events too
P (A) = E[IA ] = E [ E [IA |Y ] ]
P S Sastry, IISc, E1 222 Aug 2021 186/248
◮ Let X be geometric and we want EX.
◮ X is number of tosses needed to get head
◮ Let Y ∈ {0, 1} be outcome of first toss. (1 for head)
E[X] = E[ E[X|Y ] ]
= E[X|Y = 1] P [Y = 1] + E[X|Y = 0] P [Y = 0]
= E[X|Y = 1] p + E[X|Y = 0] (1 − p)
= 1 p + (1 + EX)(1 − p)
⇒ EX (1 − (1 − p)) = p + (1 − p)
⇒ EX p = 1
1
⇒ EX =
p
P S Sastry, IISc, E1 222 Aug 2021 187/248
◮ P [X = k|Y = 1] = 1 if k = 1 (otherwise it is zero) and
hence E[X|Y = 1] = 1
(
0 if k = 1
P [X = k|Y = 0] = (1−p)k−1 p
(1−p)
if k ≥ 2
Hence
∞
X
E[X|Y = 0] = k (1 − p)k−2 p
k=2
X∞ ∞
X
k−2
= (k − 1) (1 − p) p+ (1 − p)k−2 p
k=2 k=2
X∞ ∞
X
′ ′
= k ′ (1 − p)k −1 p + (1 − p)k −1 p
k′ =1 k′ =1
= EX + 1
P S Sastry, IISc, E1 222 Aug 2021 188/248
Another example
◮ Example: multiple rounds of the party game
◮ Let Rn denote number of rounds when you start with n
people.
◮ We want R̄n = E [Rn ].
◮ We want to use E [Rn ] = E[ E [Rn |Xn ] ]
◮ We need to think of a useful Xn .
◮ Let Xn be the number of people who got their own hat in
the first round with n people.
P S Sastry, IISc, E1 222 Aug 2021 189/248
◮ Rn – number of rounds when you start with n people.
◮ Xn – number of people who got their own hat in the first
round
E [Rn ] = E[ E [Rn |Xn ] ]
Xn
= E [Rn |Xn = i] P [Xn = i]
i=0
n
X
= (1 + E [Rn−i ]) P [Xn = i]
i=0
Xn n
X
= P [Xn = i] + E [Rn−i ] P [Xn = i]
i=0 i=0
If we can guess value of E[Rn ] then we can prove it using
mathematical induction
P S Sastry, IISc, E1 222 Aug 2021 190/248
◮ What would be E[Xn ]?
◮ Let Yi ∈ {0, 1} denote whether or not ith person got his
own hat.
◮ We know
(n − 1)! 1
E[Yi ] = P [Yi = 1] = =
n! n
n
X n
X
Now, Xn = Yi and hence EXn = E[Yi ] = 1
i=1 i=1
◮ Hence a good guess is E[Rn ] = n.
◮ We verify it using mathematical induction. We know
E[R1 ] = 1
P S Sastry, IISc, E1 222 Aug 2021 191/248
◮ Assume: E [Rk ] = k, 1 ≤ k ≤ n − 1
n
X n
X
E [Rn ] = P [Xn = i] + E [Rn−i ] P [Xn = i]
i=0 i=0
n
X
= 1 + E [Rn ] P [Xn = 0] + E [Rn−i ] P [Xn = i]
i=1
n
X
= 1 + E [Rn ] P [Xn = 0] + (n − i) P [Xn = i]
i=1
n
X
E [Rn ] (1 − P [Xn = 0]) = 1 + n(1 − P [Xn = 0]) − i P [Xn = i]
i=1
= 1 + n (1 − P [Xn = 0]) − E[Xn ]
= 1 + n (1 − P [Xn = 0]) − 1
⇒ E [Rn ] = n
P S Sastry, IISc, E1 222 Aug 2021 192/248
Analysis of Quicksort
◮ Given n numbers we want to sort them. Many algorithms.
◮ Complexity – order of the number of comparisons needed
◮ Quicksort: Choose a pivot. Separte numbers into two
parts – less and greater than pivot, do recursively
◮ Separating into two parts takes n − 1 comparisons.
◮ Suppose the two parts contain m and n − m − 1.
Comparisons needed to Separate each of them into two
parts depends on m
◮ So, final number of comparisons depends on the ‘number
of rounds’
P S Sastry, IISc, E1 222 Aug 2021 193/248
quicksort details
◮ Given {x1 , · · · , xn }.
◮ Choose first as pivot
{xj1 , xj2 , · · · , xjm }x1 {xk1 , xk2 , · · · , xkn−1−m }
◮ Suppose rn is the number of comparisons. If we get
(roughly) equal parts, then
rn ≈ n+2rn/2 = n+2(n/2+2rn/4 ) = n+n+4rn/4 = · · · = n log2 (n)
◮ If all the rest go into one part, then
n(n + 1)
rn = n + rn−1 = n + (n − 1) + rn−2 = · · · =
2
◮ If you are lucky, O(n log(n)) comparisons.
◮ If unlucky, in the worst case, O(n2 ) comparisons
◮ Question: ‘on the average’ how many comparisons?
P S Sastry, IISc, E1 222 Aug 2021 194/248
Average case complexity of quicksort
◮ Assume pivot is equally likely to be the smallest or second
smallest or mth smallest.
◮ Mn – number of comparisons.
◮ Define: X = j if pivot is j th smallest
◮ Given X = j we know Mn = (n − 1) + Mj−1 + Mn−j .
n
X
E[Mn ] = E[ E[Mn |X] ] = E[Mn |X = j] P [X = j]
j=1
n
X 1
= E[(n − 1) + Mj−1 + Mn−j ]
j=1
n
n−1
2X
= (n − 1) + E[Mk ], (taking M0 = 0)
n k=1
◮ This is a recurrence relation. (A little complicated to
solve)
P S Sastry, IISc, E1 222 Aug 2021 195/248
Least squares estimation
◮ We want to estimate Y as a function of X.
◮ We want an estimate with minimum mean square error.
◮ We want to solve (the min is over all functions g)
min E (Y − g(X))2
g
◮ Earlier we considered only linear functions:
g(X) = aX + b
◮ Now we want the ‘best’ function (linear or nonlinear)
◮ The solution now turns out to be
g ∗ (X) = E[Y |X]
◮ Let us prove this.
P S Sastry, IISc, E1 222 Aug 2021 196/248
◮ We want to show that for all g
E (E[Y | X] − Y )2 ≤ E (g(X) − Y )2
◮ We have
2
(g(X) − Y )2 =
(g(X) − E[Y | X]) + (E[Y | X] − Y )
2 2
= g(X) − E[Y | X] + E[Y | X] − Y
+ 2 g(X) − E[Y | X] E[Y | X] − Y
◮ Now we can take expectation on both sides.
◮ We first show that expectation of last term on RHS
above is zero.
P S Sastry, IISc, E1 222 Aug 2021 197/248
First consider the last term
E (g(X) − E[Y | X])(E[Y | X] − Y )
= E E (g(X) − E[Y | X])(E[Y | X] − Y ) | X
because E[Z] = E[ E[Z|X] ]
= E (g(X) − E[Y | X]) E (E[Y | X] − Y ) | X
because E[h1 (X)h2 (Z)|X] = h1 (X) E[h2 (Z)|X]
= E (g(X) − E[Y | X]) E (E[Y | X])|X − E{Y | X})
= E (g(X) − E[Y | X]) (E[Y | X] − E[Y | X))
= 0
P S Sastry, IISc, E1 222 Aug 2021 198/248
◮ We earlier got
2 2
(g(X) − Y )2 = g(X) − E[Y | X] + E[Y | X] − Y
+ 2 g(X) − E[Y | X] E[Y | X] − Y
◮ Hence we get
E (g(X) − Y )2 = E (g(X) − E[Y | X])2
+ E (E[Y | X] − Y )2
≥ E (E[Y | X] − Y )2
◮ Since the above is true for all functions g, we get
g ∗ (X) = E [Y | X]
P S Sastry, IISc, E1 222 Aug 2021 199/248
Sum of random number of random variables
◮ Let X1 , X2 , · · · be iid rv on the same probability space.
Suppose EXi = µ < ∞, ∀i.
◮ Let N be a positive integer valued rv that is independent
of all Xi (EN < ∞)
Let S = N
P
i=1 Xi .
◮
◮ We want to calculate ES.
◮ We can use
E[S] = E[ E[S|N ] ]
P S Sastry, IISc, E1 222 Aug 2021 200/248
◮ We have
" N
#
X
E[S|N = n] = E Xi | N = n
" i=1
n
#
X
= E Xi | N = n
i=1
since E[h(X, Y )|Y = y] = E[h(X, y)|Y = y]
n
X Xn
= E[Xi | N = n] = E[Xi ] = nµ
i=1 i=1
◮ Hence we get
E[S|N ] = N µ ⇒ E[S] = E[N ]E[X1 ]
P S Sastry, IISc, E1 222 Aug 2021 201/248
Wald’s formula
We took S = N
P
i=1 Xi with N independent of all Xi .
◮
◮ With iid Xi , the formula ES = EN EX1 is valid even
under some dependence between N and Xi .
◮ Here are one version of assumptions needed.
1 |] < ∞ and EN < ∞ (Xi iid).
A1 E[|X
A2 E Xn I[N ≥n] = E[Xn ]P [N ≥ n], ∀n
Let SN = N
P
i=1 Xi .
◮
◮ Then, ESN = EX1 EN
◮ Suppose the event [N ≤ n − 1] depends only on
X1 , · · · , Xn−1 .
◮ Such an N is called a stopping time.
◮ Then the event [N ≤ n − 1] and hence its complement
[N ≥ n] is independent of Xn and hence A2 holds.
P S Sastry, IISc, E1 222 Aug 2021 202/248
Wald’s formula
◮ In the general case, we do not need Xi to be iid.
◮ Here is one version of this Wald’s formula. We assume
i |] < ∞, ∀i and EN < ∞.
1. E[|X
2. E Xn I[N ≥n] = E[Xn ]P [N ≥ n], ∀n
Let SN = N
P PN
i=1 Xi and let TN = i=1 E[Xi ].
◮
◮ Then, ESN = ETN .
If E[Xi ] is same for all i, ESN = EX1 EN .
P S Sastry, IISc, E1 222 Aug 2021 203/248
Variance of random sum
PN
◮ S= i=1 Xi , Xi iid, ind of N . Want Var(S)
!2 !2
XN XN
E[S 2 ] = E Xi = E E Xi | N
i=1 i=1
◮ As earlier, we have
!2 !2
XN n
X
E Xi | N = n = E Xi | N = n
i=1 i=1
!2
n
X
= E Xi
i=1
P S Sastry, IISc, E1 222 Aug 2021 204/248
Let Y = ni=1 Xi , Xi iid
P
◮
◮ Then, Var(Y ) = n Var(X1 )
◮ Hence we have
E[Y 2 ] = Var(Y ) + (EY )2 = n Var(X1 ) + (nEX1 )2
◮ Using this
!2 !2
N
X n
X
E Xi | N = n = E Xi = n Var(X1 )+(nEX1 )2
i=1 i=1
◮ Hence
!2
N
X
E Xi | N = N Var(X1 ) + N 2 (EX1 )2
i=1
P S Sastry, IISc, E1 222 Aug 2021 205/248
PN
◮ S= i=1 Xi (Xi iid). We got
E[S 2 ] = E[ E[S 2 |N ] ] = EN Var(X1 ) + E[N 2 ](EX1 )2
◮ Now we can calculate variance of S as
Var(S) = E[S 2 ] − (ES)2
= EN Var(X1 ) + E[N 2 ](EX1 )2 − (EN EX1 )2
EN Var(X1 ) + (EX1 )2 E[N 2 ] − (EN )2
=
= EN Var(X1 ) + Var(N ) (EX1 )2
P S Sastry, IISc, E1 222 Aug 2021 206/248
Another Example
◮ We toss a (biased) coin till we get k consecutive heads.
Let Nk denote the number of tosses needed.
◮ N1 would be geometric.
◮ We want E[Nk ]. What rv should we condition on?
◮ Useful rv here is Nk−1
E[Nk | Nk−1 = n] = (n + 1)p + (1 − p)(n + 1 + E[Nk ])
◮ Thus we get the recurrence relation
E[Nk ] = E[ E[Nk | Nk−1 ] ]
= E [(Nk−1 + 1)p + (1 − p)(Nk−1 + 1 + E[Nk ])]
P S Sastry, IISc, E1 222 Aug 2021 207/248
◮ We have
E[Nk ] = E [(Nk−1 + 1)p + (1 − p)(Nk−1 + 1 + E[Nk ])]
◮ Denoting Mk = E[Nk ], we get
Mk = pMk−1 + p + (1 − p)Mk−1 + (1 − p) + (1 − p)Mk
pMk = Mk−1 + 1
1 1
Mk = Mk−1 +
p p
2 2
1 1 1 1 1 1 1
= Mk−2 + + = Mk−2 + +
p p p p p p p
k−1 k−1 j
1 X 1
= M1 +
p j=1
p
1 − pk 1
= taking M 1 =
(1 − p)pk p
P S Sastry, IISc, E1 222 Aug 2021 208/248
◮ As mentioned earlier, we can use the conditional
expectation to calculate probabilities of events also.
P (A) = E[IA ] = E [ E [IA |Y ] ]
E[IA |Y = y] = P [IA = 1|Y = y] = P (A|Y = y)
◮ Thus, we get
P (A) = E[IA ] = E [ E [IA |Y ] ]
X
= P (A|Y = y)P [Y = y], when Y is discrete
y
Z
= P (A|Y = y) fY (y) dy, when Y is continuous
P S Sastry, IISc, E1 222 Aug 2021 209/248
Example
◮ Let X, Y be independent continuous rv
◮ We want to calculate P [X ≤ Y ]
◮ We can calculate it by integrating joint density over
A = {(x, y) : x ≤ y}
Z Z
P [X ≤ Y ] = fX (x) fY (y) dx dy
A
Z ∞ Z y
= fY (y) fX (x) dx dy
−∞ −∞
Z ∞
= FX (y) fY (y) dy
−∞
◮ IF X, Y are iid then P [X < Y ] = 0.5
P S Sastry, IISc, E1 222 Aug 2021 210/248
◮ We can also use the conditional expectation method here
Z ∞
P [X ≤ Y ] = P [X ≤ Y | Y = y] fY (y) dy
−∞
Z ∞
= P [X ≤ y | Y = y] fY (y) dy
−∞
Z ∞
= P [X ≤ y] fY (y) dy
−∞
Z ∞
= FX (y) fY (y) dy
−∞
P S Sastry, IISc, E1 222 Aug 2021 211/248
Another Example
◮ Consider a sequence of bernoullli trials where p,
probability of success, is random.
◮ We first choose p uniformly over (0, 1) and then perform
n tosses.
◮ Let X be the number of heads.
◮ Conditioned on knowledge of p, we know distribution of
X
P [X = k | p] = n Ck pk (1 − p)n−k
◮ Now we can calculate P [X = k] using the conditioning
argument.
P S Sastry, IISc, E1 222 Aug 2021 212/248
◮ Assuming p is chosen uniformly from (0, 1), we get
Z
P [X = k] = [P [X = k | p] f (p) dp
Z 1
n
= Ck pk (1 − p)n−k 1 dp
0
n k!(n − k)!
= Ck
(n + 1)!
Z 1
Γ(k + 1)Γ(n − k + 1)
because pk (1 − p)n−k dp =
0 Γ(n + 2)
1
=
n+1
1
◮ So, we get: P [X = k] = n+1
, k = 0, 1, · · · , n
P S Sastry, IISc, E1 222 Aug 2021 213/248
Tower property of Conditional Expectation
◮ Conditional expectation satisfies
E[ E[h(X)|Y, Z] | Y ] = E[h(X)|Y ]
Note that all these can be random vectors.
◮ Let
g1 (Y, Z) = E[h(X)|Y, Z]
g2 (Y ) = E[g1 (Y, Z)|Y ]
We want to show g2 (Y ) = E[h(X)|Y ]
P S Sastry, IISc, E1 222 Aug 2021 214/248
◮ Recall: g1 (Y, Z) = E[h(X)|Y, Z], g2 (Y ) = E[g1 (Y, Z)|Y ]
Z
g2 (y) = g1 (y, z) fZ|Y (z|y) dz
Z Z
= h(x) fX|Y Z (x|y, z) dx fZ|Y (z|y) dz
Z Z
= h(x) fX|Y Z (x|y, z) fZ|Y (z|y) dz dx
Z Z
= h(x) fXZ|Y (x, z|y) dz dx
Z
= h(x) fX|Y (x|y) dx
◮ Thus we get
E[ E[h(X)|Y, Z] | Y ] = E[h(X)|Y ]
P S Sastry, IISc, E1 222 Aug 2021 215/248
Gaussian or Normal distribution
◮ The Gaussian or normal density is given by
1 (x−µ)2
f (x) = √ e− 2σ2 , −∞ < x < ∞
σ 2π
◮ If X has this density, we denote it as X ∼ N (µ, σ 2 ).
We showed EX = µ and Var(X) = σ 2
◮ The density is a ‘bell-shaped’ curve
P S Sastry, IISc, E1 222 Aug 2021 216/248
◮ Standard Normal rv — X ∼ N (0, 1)
◮ The distribution function of standard normal is
Z x
1 t2
Φ(x) = √ e− 2 dt
−∞ 2π
◮ Suppose X ∼ N (µ, σ 2 )
Z b
1 (x−µ)2
P [a ≤ X ≤ b] = √ e− 2σ2 dx
a σ 2π
(x − µ) 1
take y = ⇒ dy = dx
σ σ
Z (b−µ)
σ 1 y2
= √ e− 2 dy
(a−µ) 2π
σ
b−µ a−µ
= Φ −Φ
σ σ
◮ We can express probability of events involving all Normal
rv using Φ.
P S Sastry, IISc, E1 222 Aug 2021 217/248
◮ X ∼ N (0, 1). Then its mgf is
Z ∞
tX 1 x2
MX (t) = E e = etx √ e− 2 dx
2π
Z ∞ −∞
1 1 2
= √ e− 2 (x −2tx) dx
2π −∞
Z ∞
1
e− 2 ((x−t) −t ) dx
1 2 2
= √
2π −∞
Z ∞
1 2
t 1 1 2
= e2 √ e− 2 (x−t) dx
2π −∞
1 2
= e2t
◮ Now let Y = σX + µ. Then Y ∼ N (µ, σ 2 ).
The mgf of Y is
MY (t) = E et(σX+µ) = etµ E e(tσ)X = etµ MX (tσ)
= e(µt+ 2 t σ )
1 2 2
P S Sastry, IISc, E1 222 Aug 2021 218/248
Multi-dimensional Gaussian Distribution
◮ The n-dimensional Gaussian density is given by
1 1 T Σ−1 (x−
fX (x) = 1 n
e− 2 (x−µ) µ) , x ∈ ℜn
|Σ| (2π)
2 2
◮ µ ∈ ℜn and Σ ∈ ℜn×n are parameters of the density and
Σ is symmetric and positive definite.
◮ If X1 , · · · , Xn have the above joint density, they are said
to be jointly Gaussian.
◮ We denote this by X ∼ N (µ, Σ)
◮ We will now show that this is a joint density function.
P S Sastry, IISc, E1 222 Aug 2021 219/248
◮ We begin by showing the following is a density (when M
is symmetric +ve definite)
1 T My
fY (y) = C e− 2 y
1 T
Let I = ℜn C e− 2 y M y dy
R
◮
◮ Since M is real symmetric, there exists an orthogonal
transform, L with L−1 = LT , |L| = 1 and LT M L is
diagonal
◮ Let LT M L = diag(m1 , · · · , mn ).
◮ Then for any z ∈ ℜn ,
X
zT LT M Lz = mi zi2
i
P S Sastry, IISc, E1 222 Aug 2021 220/248
◮ We now get
Z
1 T
I = C e− 2 y M y dy
ℜn
change variable: z = L−1 y = LT y ⇒ y = Lz
Z
1 T T
= C e− 2 z L M Lz dz (note that |L| = 1)
n
Zℜ
1 P 2
= C e− 2 i mi zi dz
ℜn
n n Z zi2
YZ − 21
− 21 mi zi2 1
Y
= C e dzi = C e mi
dzi
i=1 ℜ i=1 ℜ
n r
Y 1
= C 2π
i=1
mi
P S Sastry, IISc, E1 222 Aug 2021 221/248
◮ We will first relate m1 · · · mn to the matrix M .
◮ By definition, LT M L = diag(m1 , · · · , mn ). Hence
1 1 −1
diag ,··· , = LT M L = L−1 M −1 (LT )−1 = LT M −1 L
m1 mn
◮ Since |L| = 1, we get
1
LT M −1 L = M −1 =
m1 · · · mn
Putting all this together
n r
1
Z
1
− 21 yT M y n
Y
Ce dy = C 2π = C (2π) 2 M −1 2
ℜn i=1
mi
1
Z
1 T My
⇒ n 1 e− 2 y dy = 1
(2π) |M −1 |
2 2 ℜn
P S Sastry, IISc, E1 222 Aug 2021 222/248
◮ We showed the following is a density (taking M −1 = Σ)
1 1 T Σ−1 y
fY (y) = n 1 e− 2 y , y ∈ ℜn
(2π) |Σ|
2 2
◮ Let X = Y + µ. Then
1 1 T Σ−1 (x−
fX (x) = fY (x − µ) = n 1 e− 2 (x−µ) µ)
(2π) |Σ|
2 2
◮ This is the multidimensional Gaussian distribution
P S Sastry, IISc, E1 222 Aug 2021 223/248
◮ Consider Y with joint density
1 1 T Σ−1 y
fY (y) = n 1 e− 2 y , y ∈ ℜn
(2π) |Σ|
2 2
◮ As earlier let M = Σ−1 . Let LT M L = diag(m1 , · · · , mn )
◮ Define Z = (Z1 , · · · , Zn )T = LT Y. Then Y = LZ.
◮ Recall |L| = 1, |M −1 | = (m1 · · · mn )−1
◮ Then density of Z is
1 1 T T 1 1 P
mi zi2
fZ (z) = n 1 e− 2 z L M Lz
= n 1 1 e
−2 i
(2π) |M −1 |
2 2 (2π) 2 ( m1 ···m n
)2
n n z2
r r
1 1 1 1 − 12 1i
− 12 mi zi2
Y Y
= q e = q e mi
i=1
2π 1
i=1
2π 1
mi mi
This shows that Zi ∼ N (0, m1i ) and Zi are independent.
P S Sastry, IISc, E1 222 Aug 2021 224/248
◮ If Y has density fY and Z = LT Y then Zi ∼ N (0, m1i )
and Zi are independent. Hence,
1 1
ΣZ = diag ,··· , = LT M −1 L
m1 mn
◮ Also, since E[Zi ] = 0, ΣZ = E[ZZT ].
◮ Since Y = LZ, E[Y] = 0 and
ΣY = E[YYT ] = E[LZZT LT ] = LE[ZZT ]LT = L(LT M −1 L)LT = M −1
◮ Thus, if Y has density
1 1 T Σ−1 y
fY (y) = n 1 e− 2 y , y ∈ ℜn
(2π) |Σ|
2 2
then EY = 0 and ΣY = M −1 = Σ
P S Sastry, IISc, E1 222 Aug 2021 225/248
◮ Let Y have density
1 1 T Σ−1 y
fY (y) = n 1 e− 2 y , y ∈ ℜn
(2π) |Σ|
2 2
◮ Let X = Y + µ. Then
1 1 T Σ−1 (x−
fX (x) = n 1 e− 2 (x−µ) µ)
(2π) |Σ|
2 2
◮ We have
EX = E[Y + µ] = µ
ΣX = E[(X − µ)(X − µ)T ] = E[YYT ] = Σ
P S Sastry, IISc, E1 222 Aug 2021 226/248
Multi-dimensional Gaussian density
◮ X = (X1 , · · · , Xn )T are said to be jointly Gaussian if
1 1 T Σ−1 (x−
fX (x) = n 1 e− 2 (x−µ) µ)
(2π) |Σ|
2 2
◮ EX = µ and ΣX = Σ.
◮ Suppose Cov(Xi , Xj ) = 0, ∀i 6= j ⇒ Σij = 0, ∀i 6= j.
◮ Then Σ is diagonal. Let Σ = diag(σ12 , · · · , σn2 ).
2 n 2
1 1
xi −µi xi −µi
− 21 n − 12
P Y
fX (x) = n e i=1 σi
= √ e σi
(2π) σ1 · · · σn
2
i=1
σi 2π
◮ This implies Xi are independent.
◮ If X1 , · · · , Xn are jointly Gaussian then uncorrelatedness
implies independence.
P S Sastry, IISc, E1 222 Aug 2021 227/248
◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian:
1 1 T Σ−1 (x−
fX (x) = n 1 e− 2 (x−µ) µ)
(2π) |Σ|
2 2
◮ Let Y = X − µ.
◮ Let M = Σ−1 and L be such that
LT M L = diag(m1 , · · · , mn )
◮ Let Z = (Z1 , · · · , Zn )T = LT Y .
◮ Then we saw that Zi ∼ N (0, m1i ) and Zi are independent.
◮ If X1 , · · · , Xn are jointly Gaussian then there is a ‘linear’
transform that transforms them into independent random
variables.
P S Sastry, IISc, E1 222 Aug 2021 228/248
Moment generating function
◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian
◮ Let Y = X − µ and Z = (Z1 , · · · , Zn )T = LT Y as earlier
◮ The moment generating function of X is given by
h T i
MX (s) = E es X
h T i h T i
s (Y+µ) sT µ
= E e =e E es Y
T
h T i
= es µ E es LZ
h T i
sT µ
= e E eu Z
where u = LT s
T
= es µ MZ (u)
P S Sastry, IISc, E1 222 Aug 2021 229/248
◮ Since Zi are independent, easy to get MZ .
◮ We know Zi ∼ N (0, m1i ). Hence
1 1 u2
u2i i
MZi (ui ) = e 2 mi = e 2mi
h i n n u2 u2
i i
P
uT Z
Y Y
ui Z i
MZ (u) = E e = E e = e 2mi = e i 2mi
i=1 i=1
◮ We derived earlier
T
MX (s) = es µ MZ (u), where u = LT s
P S Sastry, IISc, E1 222 Aug 2021 230/248
◮ We got
P u2
i
T
MX (s) = es µ MZ (u); u = LT s; MZ (u) = e i 2mi
◮ Earlier we have shown LT M −1 L = diag( m11 , · · · , m1n )
where M −1 = Σ. Now we get
1 X u2i 1 1 1
= uT (LT M −1 L)u = sT M −1 s = sT Σs
2 i mi 2 2 2
◮ Hence we get
T 1 T
MX (s) = es µ + 2 s Σs
◮ This is the moment generating function of
multi-dimensional Normal density
P S Sastry, IISc, E1 222 Aug 2021 231/248
◮Let X, Y be jointly Gaussian. For simplicity let
EX = EY = 0.
◮ Let Var(X) = σ 2 , Var(Y ) = σ 2 ;
x y
let ρXY = ρ ⇒ Cov(X, Y ) = ρσx σy .
◮ Now, the covariance matrix and its inverse are given by
σx2 σy2
ρσx σy −1 1 −ρσx σy
Σ= ; Σ = 2 2
ρσx σy σy2 σx σy (1 − ρ2 ) −ρσx σy σx2
◮ The joint density of X, Y is given by
x2 y2
1 − 1
2(1−ρ2 ) 2 + 2 − 2ρxy
σx σy
fXY (x, y) = p e σx σy
2πσx σy 1 − ρ2
◮ This is the bivariate Gaussian density
P S Sastry, IISc, E1 222 Aug 2021 232/248
◮ Suppose X, Y are jointly Gaussian (with the density
above)
◮ Then, all the marginals and conditionals would be
Gaussian.
◮ X ∼ N (0, σx2 ), and Y ∼ N (0, σy2 )
◮ fX|Y (x|y) would be a Gaussian density with mean yρ σσxy
and variance σx2 (1 − ρ2 ).
P S Sastry, IISc, E1 222 Aug 2021 233/248
◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian.
◮ Then we call X as a Gaussian vector.
◮ It is possible that Xi , i = 1, · · · , n are individually
Gaussian but X is not a Gaussian vector.
◮ For example, X, Y may be individually Gaussian but their
joint density is not the bivariate normal density.
◮ Gaussian vectors have some special properties. (E.g.,
uncorrelated implies independence)
◮ Important to note that ‘individually Gaussian’ does not
mean ‘jointly Gaussian’
P S Sastry, IISc, E1 222 Aug 2021 234/248
◮ The multi-dimensional Gaussian density has some
important properties.
◮ We have seen some of them earlier.
◮ If X1 , · · · , Xn are jointly Gaussian then they are
independent if they are uncorrelated.
◮ Suppose X1 , · · · , Xn be jointly Gaussian and have zero
means. Then there is an orthogonal transform Y = AX
such that Y1 , · · · , Yn are jointly Gaussian and
independent.
◮ Another important property is the following
◮ X1 , · · · , Xn are jointly Gaussian if and only if tT X is
Gaussian for for all non-zero t ∈ ℜn .
◮ We will prove this using moment generating functions
P S Sastry, IISc, E1 222 Aug 2021 235/248
◮ Suppose X = (X1 , · · · , Xn )T be jointly Gaussian and let
W = tT X.
◮ Let µX and ΣX denote the mean vector and covariance
matrix of X. Then
µw , EW = tT µX ; σw2 , Var(W ) = tT ΣX t
◮ The mgf of W is given by
uW h T i
MW (u) = E e = E eu t X
Tµ 1 2 T
= MX (ut) = eut x + 2 u t Σx t
1 2 σ2
= euµw + 2 u w
showing that W is Gaussian
◮ Shows density of Xi is Gaussian for each i. For example,
if we take t = (1, 0, 0, · · · , 0)T then tT X would be X1 .
P S Sastry, IISc, E1 222 Aug 2021 236/248
◮ Now suppose W = tT X is Gaussian for all t 6= 0.
1 2 σ2 Tµ 1 2
tT Σ X t
MW (u) = euµw + 2 u w = eu t X+2u
◮This implies
h T i T 1 2 T
E eu t X = eu t µX + 2 u t ΣX t , ∀u ∈ ℜ, ∀t ∈ ℜn , t 6= 0
h T i T 1 T
E et X = et µX + 2 t ΣX t , ∀t
This implies X is jointly Gaussian.
◮ This is a defining property of multidimensional Gaussian
density
P S Sastry, IISc, E1 222 Aug 2021 237/248
◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian.
◮ Let A be a k × n matrix with rank k.
◮ Then Y = AX is jointly Gaussian.
◮ We will once again show this using the moment
generating function.
◮ Let µx and Σx denote mean vector and covariance matrix
of X. Similarly µy and Σy for Y
◮ We have µy = Aµx and
Σy = E (Y − µy )(Y − µy )T
= E (A(X − µx ))(A(X − µx ))T
= E A(X − µx )(X − µx )T AT
= A E (X − µx )(X − µx )T AT = AΣx AT
P S Sastry, IISc, E1 222 Aug 2021 238/248
◮ The mgf of Y is
h T i
MY (s) = E es Y (s ∈ ℜk )
h T i
s AX
= E e
= MX (AT s)
Tµ 1 T
x + 2 t Σx t
(Recall MX (t) = et )
1 T
sT Aµ x+ 2 s A
Σx AT s
= e
T 1 T
= e s µy + 2 s Σ y s
This shows Y is jointly Gaussian
P S Sastry, IISc, E1 222 Aug 2021 239/248
◮ X is jointly Gaussian and A is a k × n matrix with rank k.
◮ Then Y = AX is jointly Gaussian.
◮ This shows all marginals of X are gaussian
◮ For example, if you take A to be
1 0 0 ··· 0
A=
0 1 0 ··· 0
then Y = (X1 , X2 )T
P S Sastry, IISc, E1 222 Aug 2021 240/248
◮ Finding the distribution of a rv by calculating its mgf is
useful in many situations.
◮ Let X1 , X2 , · · · be iid with mgf MX (t).
Let SN = N
P
i=1 Xi where N is a positive integer valued
◮
rv which is independent of all Xi .
◮ We want to find out the distribution of SN .
◮ We can calculate mgf of SN in terms of MX and
distribution of N .
◮ We can use properties of conditional expectation for this
P S Sastry, IISc, E1 222 Aug 2021 241/248
The mgf of SN is MSN (t) = E etSN
◮
h PN i
E etSN | N = n = E et i=1 Xi | N = n
h Pn i
t i=1 Xi
= E e |N =n
" n #
h Pn i Y
= E et i=1 Xi = E etXi
i=1
n
Y
E etXi = (MX (t))n
=
i=1
◮ Hence we get
E etSN | N = (MX (t))N
P S Sastry, IISc, E1 222 Aug 2021 242/248
◮ We can now find mgf of SN as
MSN (t) = E etSN
= E E etSN | N
h i
= E (MX (t))N
∞
X
= (MX (t))n fN (n)
n=1
= GN ( MX (t) )
where GN (s) = EsN is the generating function of N
◮ This method is useful for finding distribution of SN when
we can recognize the distribution from its mgf
P S Sastry, IISc, E1 222 Aug 2021 243/248
◮ We can also find distribution function of SN directly using
the technique of conditional expectations.
◮ FSN (s) = P [SN ≤ s] and we know how to find
probabilities of events using conditional expectation.
" N # ∞
" N #
X X X
P Xi ≤ s = P Xi ≤ s | N = n P [N = n]
i=1 n=1
∞
" i=1
n
#
X X
= P Xi ≤ s P [N = n]
n=1 i=1
P S Sastry, IISc, E1 222 Aug 2021 244/248
Jensen’s Inequality
◮ Let g : ℜ → ℜ be a convex function. Then
g(EX) ≤ E[g(X)]
◮ For example, (EX)2 ≤ E [X 2 ]
◮ Function g is convex if (see figure on left)
g(αx+(1−α)y) ≤ αg(x)+(1−α)g(y), ∀x, y, ∀0 ≤ α ≤ 1
◮ If g is convex, then, given any x0 , exists λ(x0 ) such that
(see figure on right)
g(x) ≥ g(x0 ) + λ(x0 )(x − x0 ), ∀x
P S Sastry, IISc, E1 222 Aug 2021 245/248
Jensen’s Inequality: Proof
◮ We have: ∀x0 , ∃λ(x0 ) such that
g(x) ≥ g(x0 ) + λ(x0 )(x − x0 ), ∀x
◮ Take x0 = EX and x = X(ω). Then
g(X(ω)) ≥ g(EX) + λ(EX)(X(ω) − EX), ∀ω
◮ Y (ω) ≥ Z(ω), ∀ω ⇒ Y ≥ Z ⇒ EY ≥ EZ
Hence we get
g(X) ≥ g(EX) + λ(EX)(X − EX)
⇒ E[g(X)] ≥ g(EX) + λ(EX) E[X − EX] = g(EX)
◮ This completes the proof
P S Sastry, IISc, E1 222 Aug 2021 246/248
◮ Consider the set of all mean-zero random variables.
◮ It is closed under addition and scalar (real number)
multiplication.
◮ Cov(X, Y ) = E[XY ] satisfies
1. Cov(X, Y ) = Cov(Y, X)
2. Cov(X, X) = Var(X) ≥ 0 and is zero only if X = 0
3. Cov(aX, Y ) = aCov(X, Y )
4. Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y )
◮ Thus Cov(X, Y ) is an inner product here.
◮ The Cauchy-Schwartz inequality (|xT y| ≤ ||x|| ||y||)
gives
p p
|Cov(X, Y )| ≤ Cov(X, X) Cov(Y, Y ) = Var(X) Var(Y )
◮ This is same as |ρXY | ≤ 1
◮ A generalization of Cauchy-Schwartz inequality is Holder
inequality
P S Sastry, IISc, E1 222 Aug 2021 247/248
Holder Inequality
1 1
◮ For all p, q with p, q > 1 and p
+ q
=1
1 1
E[|XY |] ≤ (E|X|p ) p (E|Y |q ) q
(We assume all the expectations are finite)
◮ If we take p = q = 2
p
E[|XY |] ≤ E[X 2 ] E[Y 2 ]
◮ This is same as Cauchy-Schwartz inequality. This implies
|ρXY | ≤ 1.
Cov(X, Y ) = E[(X − EX)(Y − EY )]
≤ E (X − EX)(Y − EY )
p
≤ E[(X − EX)2 ] E[(Y − EY )2 ]
p
= Var(X) Var(Y )
P S Sastry, IISc, E1 222 Aug 2021 248/248