Ma 202
Ma 202
Class Notes
Amit Kumar
Department of Mathematical Sciences
Indian Institute of Technology (BHU) Varanasi
Varanasi – 221005, India.
Contents
1 Basic Probability 1
1.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Definitions of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Properties of Probability Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Independence of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2
CONTENTS
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Basic Probability
Probability is all about gaining a degree of confidence in deciding on something where the process is
random, and the result is not precisely determinable a priori. The mathematical theory of probability,
although initiated for gambling, has a wide range of applications starting from game theory to physics
to finance and extends to almost all areas of science and engineering. For example, a ticket booking
system of IRCTC has an option “CNF Probability” while booking a ticket on the waiting list. It helps
the people to identify the chance of confirmation of the ticket.
When we perform a random experiment whose all possible outcomes are already known, but the result
of a specific experiment is not predictable, then probability comes into the picture. One needs to design
a suitable probability space depending on the outcome, which comprises a sample space, a σ-algebra,
and a probability measure. Probability space plays a fundamental role in the probabilistic analysis
of a model. There are two levels of probability theory; one is when the underlying sample space is
countable, and another is the case when this space is uncountable. In the case of a countable sample
space, we have discrete probability, whereas the definition of probability becomes more challenging
when the sample space is uncountable, referred to as continuous probability.
This chapter aims to develop a background in the basic concepts of probability, where we learn how
to build a probability space for a given probability model and study some important properties of
probability spaces. We also discuss the conditional probability and some consequent results.
Experiment
1
Chapter 1: Basic Probability
Definition 1.1.2 [Deterministic Experiment]
If an experiment is conducted under certain conditions and it result in a known outcome then it is
called a deterministic experiment.
(ii) any performance of the experiment results in an outcome that is not known in advance, and
The sample space is denoted by Ω or U or S. Throughout this notes, we use Ω for the sample space.
Example 1.1.3. The sample space of the Example 1.1.2 are as follows:
Sample Space
Event
Note that φ and Ω are called impossible and sure events, respectively.
Types of Events: The event can have several forms including unions, intersections, complements,
among many others. The significance of some of them are as follows:
1. Union of events:
A∪B
n
[
(b) Ai ≡ occurrence of at least one of Ai , i = 1, 2, . . . , n.
i=1
[∞
(c) Ai ≡ occurrence of at least on of Ai , i = 1, 2, . . ..
i=1
2. Intersection of events:
A A∩B B
3. If A ∩ B = φ then A and B are called mutually exclusive events, that is, happening of one of
them excludes the possibility of happening of other.
B
A
n
[
4. If Ω = Ai , then A1 , A2 , . . . , An are called exhaustive events.
i=1
6. Ac ≡ Not happening of A.
Ac
A B
A\B
Example 1.2.1. Rolling of a dice, we have Ω = {1, 2, 3, 4, 5, 6}. Let A be the event that even number
occur. Then,
#A 3 1
P(A) = = = .
#Ω 6 2
Example 1.2.2. Tossing of two coins, we have Ω = {HH, HT, T H, T T }. Let A be the event that both
are same. Then,
#A 2 1
P(A) = = = .
#Ω 4 2
Drawbacks: The definition loses its significance in the following context of real-life situations.
(b) Events may not be always equally likely in real-life applications, for example, in climatol-
ogy, a rainy day and a dry day can not be equally likely in general.
Due to the limitations of the classical definition, another approach is needed to define probability. The
following definition was given by Von Mises.
Example 1.2.3. Consider the experiment of tossing a coin repeatedly and the output are in the following
form:
HHT HHT HHT HHT . . ..
Let A denote the event H occur. Then
2n−1
3n−2
, n = 1, 2, 3, . . .
an 1 2 2 3 4 4
2n
= , , , , , ,... = 3n−1
, n = 1, 2, 3, . . .
n 1 2 3 4 5 6 2n
, n = 1, 2, 3, . . .
3n
Drawbacks: The definition loses its significance in the following context of real-life situations.
(a) Actual observations of the experiment may sometimes not be possible, for example, the
probability of success in launching satellites.
n1−ε
(b) Note that, for small values of ε, n1−ε is close to n while → 0 that seems to be
n
n − n1−ε
unexpected probability. On the other hand, → 1, for some small ε.
n
The purpose of probability theory is to set up a general mathematical framework to quantify the chance
of occurrence of an event in a random experiment. So, an abstract notion of probability is desirable
to deal with a broad class of experiments. To achieve this, let us first defined all the desirable features
needed to set up a probability model.
(i) φ ∈ F .
Important Observations:
(a) φ ∈ F =⇒ Ω = φc ∈ F .
n
[
(b) Ai ∈ F , for every n ∈ N (substitute An+1 = An+2 = · · · = φ in (iii))
i=1
∞
[
(c) For A1 , A2 , . . . ∈ F , we have Ac1 , Ac2 , . . . ∈ F and therefore Aci ∈ F . Hence,
i=1
∞ ∞
!c
\ [
Ai = Aci ∈F
i=1 i=1
(d) Sigma algebra in closed under compliment and countable unions (finite unions) and count-
able intersections (finite intersections).
(b) F2 = {φ, Ω, {1, 3}, {2, 3}, {2}} is NOT a sigma-algebra (since {2, 3}c = {1} ∈
/ F ).
Example 1.2.6. For any set A ∈ Ω, F = {φ, A, Ac , Ω} is always a sigma-algebra (it is called a
sigma-algebra generated by A).
Example 1.2.7. Let Ω = {1, 2, 3, . . .} and
F = {A : A is finite or Ac is finite} .
Now, we are in a position to give an abstract definition of probability, called probability measure, which
is given by Kolmogorov in 1933.
(b) P(Ω) = 1
(c) For any sequence of pair-wise disjoint events Ai ∈ F , that is, Ai ∩ Aj = φ, for i 6= j, we
have
∞
! ∞
[ X
P Ai = P(Ai ) [coutable additivity property]
i=1 o=1
Question: We define the probability measure for pair-wise disjoint events only (Axiom (c)), why
not for all unions?
Answer: Define
i−1
!
[
Bi = Ai \ Ai , for i = 1, 2, . . . .
j=1
Hence, we can convert any union to the union of pair-wise disjoint elements. So, it is enough
to define the probability measure for pair-wise disjoint elements. For example, union of 5 sets
converted to disjoint union of 5 sets can be seen in the following figure.
Take F = P(Ω). Let A be the sum showing on the two dice in equal to 11. Then
Property 1.3.1
P(φ) = 0
P(φ) = 0.
Property 1.3.2
For any finite pair-wise disjoint collection A1 , A2 , . . . , An ∈ F ,
n
! n
[ X
P Ai = P (Ai ) .
i=1 i=1
Property 1.3.3
If A ⊂ B then P(B|A) = P(B) − P(A). Moreover, P(A) ≤ P(B), that is, P is monotone.
Proof. Since A ⊂ B,
B\A
B = A ∪ (B\A)
=⇒ P(B) = P(A) + P(B\A)
=⇒ P(B\A) = P(B) − P(A)
P(A) ≤ P(B).
Property 1.3.4
For any A ∈ F , P(A) ≤ 1.
Property 1.3.5
Property 1.3.6
For any two events. A, B ∈ F ,
A B\(A ∩ B)
A ∪ B = A ∪ (B\(A ∩ B)))
Property 1.3.7
For any event A, B, C ∈ F ,
Proof. Exercise
A general form of the above result is called the inclusion-exclusion formula or general addition rule.
Theorem 1.3.1 [Inclusion-Exclusion Formula or General Addition Rule]
For n ≥ 2, let A1 , A2 , . . . , An be events. Then
n
! n
[ X X
P Ai = P (Ai ) − P (Ai ∩ Aj )
i=1 i=1 1≤i<j≤n
n
!
X \
+ P (Ai ∩ Aj ∩ Ak ) − · · · + (−1)n+1 P Ai .
1≤i<j<k≤n i=1
Now, consider
k+1
! k
! !
[ [
P Ai = P Ai ∪ Ak+1
i=1 i=1
k
! k
! !
[ [
=P Ai + P (Ak+1 ) − P Ai ∩ Ak+1
i=1 i=1
k
! k
!
[ [
=P Ai + P (Ak+1 ) − P (Ai ∩ Ak+1 )
i=1 i=1
then
If
then
4
[
c
A = Bi .
i=1
Notes that
6
3 3 3 3 3 3 3
P (B1 ) = P (none of the six is a space) = × × × × × = .
4 4 4 4 4 4 4
Similarly,
6
3
P (B2 ) = P (B3 ) = P (B4 ) = .
4
Next,
6 6
2 1
P (Bi ∩ Bj ) = = , i, j = 1, 2, 3, 4,
4 2
n
! n
\ X
(ii) P Ai > P (Ai ) − (n − 1).
i=1 i=1
∞
! ∞
\ X
(b) P Ai >1− P (Aci ).
i=1 i=1
P(A ∩ B)
P(A | B) = .
P(B)
Lemma 1.4.1
P(· | B) is a valid probability function.
(a) If a firmly is choose at rondos and is found to hove a boy. What is the probability that the other
one is also a boy?
(b) If a child is chooses at random from these familiar and is found to be a boy. What is the probability
that the other child in that family is also a boy?
Solution.
(a) Note that Ω = {(b, b), (b, g), (g, b), (g, g)}. Let A be the event that a family has a boy. Then
3
P(A) =
4
P(A ∩ B) 1/4 1
P(B|A) = = = .
P(A) 3/4 3
(b) Note that Ω = {(b, b), (b, g), (g, b), (g, g)}. Let A be the event that the child is a boy. Then
1
P(A) = .
2
Let B be the event that the child has a brother. Then
P (A ∩ B) Y4 1
P(B | A) = = = .
P(A) Y2 2
Notice the difference in (a) and (b). This is due to difference in selection policy.
Theorem 1.4.4 [Multiplication Rule]
Let P(A) > 0 and P(B) > 0. Then
P(A ∩ B)
P(A|B) = =⇒ P(A ∩ B) = P(A|B)P(B)
P(B)
P(A ∩ B)
P(B|A) = =⇒ P(A ∩ B) = P(B|A)P(A).
P(A)
n
! n−1
!
\ \
P Ai = P (A1 ) P (A2 |A1 ) P (A3 |A1 ∩ A2 ) . . . P An Ai .
i=1 i=1
Proof. We prove the result using induction of n. For k = 1, the statement is clearly hold. Let the result
holds for n = k, that is,
k
! k−1
!
\ \
P Ai = P (A1 ) P (A2 |A1 ) P (A3 |A1 ∩ A2 ) . . . P Ak Ai .
i=1 i=1
∞
X
P(A ∩ B) = P (A|Bj ) P (Bj )
j=1
Moreover, if B = Ω then
∞
X
P(A) = P (A|Bj ) P (Bj ) .
j=1
This implies
∞
! ∞ ∞
[ X X
P(A ∩ B) = P A ∩ Bi = P (A ∩ Bi ) = P (A|Bi ) P (Bi ) .
i=1 i=1 i=1
P (A|Bi ) P (Bi )
P (Bi | A) = P∞ .
i=1 P (A|Bi ) P (Bi )
P (A ∩ B) P (A|Bi ) P (Bi )
P (Bi |A) = = P∞ .
P(A) i=1 P (A|Bi ) P (Bi )
Similarly,
15 30
P (B2 |A) = and P (B3 |A) = .
49 49
P(A ∩ B) = P(A)P(B).
P(A ∩ B) = P(A)P(B)
P(B ∩ C) = P(B)PC)
P(B ∩ C) = P (B)P(C)
P(A ∩ B ∩ C) = P(A)P(P(C).
Remark 1.5.1. The total number of conditions to determines the independence of n events is
n n n
+ + ··· + = 2n − n − 1.
2 3 n
If U contains only two events, then the concepts mutually independent and pairwise independent are
the same. However, if U contains more than two events, then they are different in the sense that
independence implies pairwise independence but not conversely.
Example 1.5.1. Tossing of two coins, we have Ω = {HH, HT, T H, T T }. Define the events
Therefore,
1
P(A) = P(B) = P(C) = .
2
1
P(A ∩ B) = = P(B ∩ C) = P(A ∩ C)
4
1
P(A ∩ B ∩ C) = .
4
Note that
1 1
P(A ∩ B ∩ C) = 6= = P(A)P(B)P(C).
4 8
Hence, A, B and C are pairwise independent but not statistically independent.
1.6 Exercises
1. Let Let Ω = {1, 2, 3, . . .} and
F = {A : A is countable or Ac is countable} .
2. If four married couples are arranged to be seated in a row. What is the probability that no husband
is seated next to his wife?
3. There an two kinds of tube in an electronic gadget. It will cease to function it one of each kind is
defective. The probability that there is a defective tube of the first kind is 0.1; the probability that
there is a defective tube of the second kind is 0.2. It is known that two tubes are defective. What
in the probability that the gadget still works?
1/5 1/5
1/3
1/4 1/4
The numbers indicate the probabilities of failure for the various links, which are all independent.
What is the probability that circuit is closed?
5. If two dice are thrown, what is the probability that the sum is (a) greater than 8 (b) neither 7 nor
11?
6. A cord is drawn from a well-shuffled pack of playing cards .What in the probability that it is
either a spade or an ace?
7. A problem in mathematics is given to the three students A, B and C whose chance of solving it
are 1/2, 3/4, and 1/4 respectively. What is the probability that the problem will be solved if all
of them try independently?
8. A consignment of 15 record players contain 4 defectives. The record players are selected at
random, one by ore, and examined. Those examined are not put back. What is the probability
that the 9th are examined is the last defectives?
then
\
σ(C) = F.
F ⊆I
22
Chapter 2: Random Variable and its Distribution
Example 2.1.1. Consider Ω = [0, 1] and C = {[0, 0.3], [0.5, 1]} = {A1 , A2 }, say. Then
σ(C) = {φ, A1 , A2 , A3 , A1 ∪ A2 , A1 ∪ A3 , A2 ∪ A3 , Ω} ,
C = {{x} : x ∈ Ω}
Remark 2.1.1. In general, it is not always possible to find explicit form of generated σ-algebra.
Theorem 2.1.1 [Equivalent Way to Define Borel σ-algebra]
The Borel σ-algebra on R is σ(C), the sigma algebra generated by each of the classes of sets C
described below:
f −1 (E) ∈ F1 .
| | | | | R
-2 -1 0 1 2
Let X denote the number of tails. Then, X : Ω → R can takes the values 0, 1, 2, 3.
TTT Ω
HT T
T HT
TTH
HHT
X:Ω→R
HT H
T HH
HHH
| | | | | | | R
-3 -2 -1 0 1 2 3
Ω
F
λ
| | | | | | R
-2 -1 0 1 2
The above definition can also be written in many form by using the following theorem.
Theorem 2.2.3 [Equivalent Way to Define Random Variable]
Let X be defined (Ω, F ) be a random variable if and only if any one of the following condition
is satisfied.
Corollary 2.2.1
Let X be a random variable defined on a probability space (Ω, F ). Then,
X −1 (B) ∈ F .
Therefore, X −1 (B) is a measurable set and hence, we can use the axiomatic definition of proba-
bility to define the probability for a random variable X.
Axiometic definition
of probability
Ω
F
[ ]
0 1
[ | | | | ] | R
-2 -1 0 1 2
Theorem 2.2.4
A random variable X defined on a probability space (Ω, F , P) Induces a probability space
(R, BR , Q) by the correspondence
Ω
F
φ Ω
{T} {T} {H}
λ
| | ||| | R
-2 -1 0 1 2
In general,
φ
λ<0
{ω : x(ω) ≤ λ} = {T } 0 6 λ < 1
Ω λ>1
∈ F.
Let X denote the sum of upward faces and F = P(Ω). Then, X can takes the values 2, 3, . . . , 12 and
φ λ < 2,
{(1, 1)} 2 ≤ λ < 3,
{ω : X(ω) ≤ λ} = {(1, 1), (1, 2), (21)} 3 ≤ λ < 4,
..
.
Ω, λ ≥ 12.
∈ F.
is called a distribution function of the random variable X (also called the cumulative distribution
function).
Property 2.3.1
lim FX (x) = 0 and lim FX (x) = 1.
x→−∞ x→∞
An = {ω : X(ω) ≤ xn } .
Then
lim An = φ.
n→∞
Therefore,
lim P (An ) = P lim An
n→∞ n→∞
=⇒ lim F (xn ) = P(φ) = 0
xn →∞
=⇒ lim F (x) = 0.
x→−∞
An = {ω : X(ω) ≥ xn } .
lim An = φ.
n→∞
Therefore,
lim P (An ) = P lim An
n→∞ n→∞
=⇒ lim (1 − F (xn )) = P(φ) = 0
xn →∞
=⇒ lim F (x) = 1.
x→∞
Property 2.3.2
If x1 < x2 then FX (x1 ) ≤ FX (x2 ) [FX is non-decreasing].
{ω : X(ω) ≤ x1 } ⊂ {ω : X(ω) ≤ x2 }
=⇒ P ({ω : X(ω) ≤ x1 }) ≤ P ({ω : X(ω) ≤ x2 })
=⇒ FX (x1 ) ≤ FX (x2 ) .
Property 2.3.3
limh→0 FX (x + h) = FX (x) [FX is right continuous].
An = {ω : x < X(ω) ≤ xn }
P(a ≤ X ≤ b) = P(a < X < b) = P(a ≤ X < b) = P(a < X ≤ b) = FX (b) − FX (a).
(ii) If any function F : R → R satisfies properties 2.3.1 − 2.3.3 is a CDF of some random variable.
For a given probability space and a random variable X associated with it, we now know how to define
the distribution function of X. The converse also holds. We state the theorem and omit the proof for
this course.
Theorem 2.3.5
For any given distribution function F , there exists a unique probability space and a random vari-
able X defined on the space such that F is a distribution function of X.
Example 2.3.1. Consider the experiment of rolling of two dice, we have Ω = {(i, j) : 1 ≤ i, j ≤ 6}.
Let X denote the number of upward faces. Then
1 2 3
P(X = 2) = , P(X = 3) = , P(X = 4) = ,
31 36 36
4 5 6
P(X = 5) = P(X = 6) = , P(X = 7) = ,
36 36 36
5 4 3
P(X = 8) = , P(X = 9) = , P(X = 10) = ,
36 36 36
2 1
P(X = 11) = and P(X = 12) = .
36 36
Therefore,
0, x < 2,
1
, 2 ≤ x < 3,
36
3 , 3 ≤ x < 4,
36
FX (x) = P(x ≤ x) = 6
, 4 ≤ x < 5,
36
..
.
1, x ≥ 12.
0.5
x
-4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12
Note that FX (x) a valid distribution function as its satisfy all the necessary property of CDF.
FX (x)
0.5
x
-4 -3 -2 -1 0 1 2 3 4
Note that FX (·) is not a valid CDF as it is not right continuous at x = 1/2.
The probability mass function associated with a discrete random variable satisfies the following condi-
tions.
Remark 2.4.2. In practice, the above properties can be used to roughly check the answer obtain by a
student is correct or not.
If the pmf is known to us then the distribution function of a discrete random variable X can be obtained
in terms of the pmf and is given by
X
FX (x) = P(X ≤ x) = pX (y), x ∈ R,
y∈C
y≤x
pX (xi ) = P(X = xi )
= P(X ≤ xi ) − P(X < xi )
= P(X ≤ xi ) − P(X ≤ xi−1 )
= FX (xi ) − FX (xi−1 ).
Remark 2.4.3. (i) For a discrete random variable, the cdf is a step function having jump pX (xi ) at
i.
·
·
·
·
·
·
| | | | | x
x1 x2 x3 ... xixi+1. . .
pX (x)
0.5
x
0 1 2 3 4
Remark 2.4.4. In the above example, observe that if the cdf is known then the pmf can be calculated
as
21 21
pX (0) = FX (0) − FX (−1) = −0= ,
45 45
42 21 21
pX (1) = FX (1) − FX (0) = − = ,
45 45 45
42 3
pX (2) = FX (2) − FX (1) = 1 − = .
45 45
Moreover,
21
P(X < 1) = P(X = 0) = , P(1 < X < 2) = 0,
45
21 21 24
P(1 ≤ X < 2) = P(X = 1) = , P(X ≥ 1) = 1 − P(X < 1) = 1 − = ,
45 45 45
21 3 24
P(1 ≤ X ≤ 2) = P(X = 1) + P(X = 2) = + = .
45 45 45
The function fX (·) is called the probability density function (pdf) of the random variable X.
Remark 2.5.1. (i) For a continuous random variable, P(X = x) = 0, for all x ∈ R.
(ii) Using the pdf, we can calculate the probability for a continuous random variable and is given by
Z b
P(a < X ≤ b) = fX (t)dt = FX (b) − FX (a)
a
(iv) Note that the difference between the pmf and pdf. The pdf is just a continuous function (not
probability) and its value may be more than one, for example,
2, 0 ≤ x ≤ 12 ,
fX (x) =
0, otherwise.
(b) Compute P X ∈ 12 , 34 .
Solution.
(b) Consider
1 3 1 3
P X∈ , =P <X<
2 4 2 4
Z 3/4
= fX (x)dx
x
Z 3/4
11
6 x − x2 dx = .
=
1/2 32
Note that
d d
3x2 − 2x3 = 6x − 6x2 = 6x(1 − x),
fX (x) = FX (x) = for 0 ≤ x ≤ 1.
dx dx
fX (x)
1
0.75
0.5
0.25
x
0 1
Note that the random variable X is neither discrete not continuous and it is mixture of both. Therefore,
the random variate X is of mixed type. Observe that
Z 1
3 3 1
P(0 < X ≤ 1) = dx = and P(X = 0) = =⇒ P(0 ≤ X ≤ 1) = 1.
0 4 4 4
Th cdf of X is given by
0, x < 0,
1
4
, x = 0,
FX (x) = 1 3
+ x, 0 < x ≤ 1,
4 4
1, x ≥ 1.
FX (x)
1
0.75
0.5
0.25
x
-1 0 1
Remark 2.7.1. In general, the expectation for any function g(X) is defined as
∞
X
g(xi )pX (xi ), if X is a discrete random variable,
E(g(X)) = i=1
Z ∞
g(x)fX (x)dx, if X is a continuous random variable,
−∞
Properties of Expectation
(a) E(c) = c, for any constant c.
(c) E (c1 g1 (X) + c2 g2 (X) + · · · + ck gk (X)) = c1 E(g1 (X)) + c2 E(g2 (X)) + · · · + ck E (gk (X)).
(d) If g1 (x) ≤ g2 (x), for all x ∈ R then E (g1 (X)) ≤ E(g2 (X)).
Remark 2.7.2. (i) The mathematical expectation is also known as average or measure of central
tendency or measure of location for a distribution.
(ii) One of the best applications of mathematical expectation is to compute the CPI for B.Tech/IDD
students at IIT (BHU). The formula to compute the CPI for a semester with 5 courses is given by
credit of paper 1 credit of paper 5
CPI = (grade in paper 1) × + · · · + (grade in paper 5) × .
total number of credits total number of credits
(iii) Most of the times, the students are confused that why we need E(|X|) < ∞ for the existence of
E(X)? It is expected that the average should be unique for a random variable. In particular, for
the discrete case, we are dealing with the series and the order of the series is not mentioned while
defining the mathematical expectation. We know that if the series is absolutely convergent then
Its sums to a value less than 56 . Consider a rearrangement of the above series where two positive
terms are followed by one negative term
1 1 1 1 1 1 1 1
1+ − + + − + + − +··· .
3 2} |5 {z
| {z 7 4} |9 11
{z 6}
(5/6) (>0) (>0)
Since
1 1 1
+ − >0
4k − 3 4k − 1 2k
the rearranged series sums to a value greater than 56 .
Similarly, for a continuous random variable, if the integral is absolutely convergent then it value
will be unique, however, it may not be true for conditionally convergent integrals. For example,
consider the Cauchy distribution with pdf
1 1
fX (x) = , for −∞ < x < ∞.
π 1 + x2
R∞ x
Note that the integral −∞ x2 +1
dx can also be interpreted as, for instance,
kR
1 + k 2 R2
Z
x 1
lim dx = lim log = log(k), where k > 0
R→∞ −R x2 + 1 R→∞ 2 1 + R2
You could also take other functions of R such that the lower limit tends to negative infinity and
R ∞limitxtends to infinity as R → ∞ to get different answers.
upper
So, −∞ x2 +1 dx is not zero and in fact cannot be assigned any value unless you know how the
lower limit and upper limit approach ∞. This arises due to the fact the integral doesn’t converge
conditionally on (−∞, ∞), that is,
Z ∞ Z ∞
1 |x|
|x|f (x)dx = 2
dx = ∞.
−∞ −∞ π 1 + x
R R
Therefore, xf (x)dx is well-defined and exists only when |x|f (x)dx < ∞.
Hence, we need the absolute convergence to exists the mathematical expectation.
Example 2.7.1. Consider the experiment of rolling of two dice. Let X denote the absolute difference
of the upturned faces, find the expectation of X.
Solution. Given X(ω) = |i − j| if ω = (i, j), for i, j = 1, 2, . . . , 6. Therefore,
Example 2.7.2. If X has pdf fX (x) = e−x , x > 0. Find the mean of X.
Solution. Note that
Z ∞ Z ∞
µ = E(X) = xfX (x)dx = xe−x dx.
0 0
(−1)j+1 3j
2
P X= = j , j = 1, 2, 3, . . . .
j 3
Then,
∞ ∞ ∞
X (−1)j+1 3j 2 X 3j 2 X 1
E|X| = j
= · j
= 2 ,
j=1
j 3 j=1
j 3 j=1
j
Then,
Z ∞ Z ∞ ∞
1 2x 1
E(X) = xfX (x)dx = dx = log(x) ,
−∞ 2π −∞ 1 + x2 2π −∞
In particular,
(a) µ0k = E(X k ) is called the kth moment about origin or the kth non-central moment.
(b) µk = E(X − µ)k is called the kth moment about the mean or the kth central moment.
Remark 2.7.3. (i) We can always write central/non-central moments in terms of non-central/central
moments using binomial expansion.
(ii) The 1st non-central moments is the mean of X and the 1st central moment is zero.
(iii) Moments gives the information about the shape of the distribution.
= E(X 2 ) + µ2 − 2µE(X)
= E(X 2 ) + µ2 − 2µ2
= E X 2 − (E(X))2 .
1 1
0.5 0.5
0.15 0.15
0.1 0.1
x y
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
From the figure, it can be easily visualized that the variance of X is less than the variance of Y .
| | |
α−x α α+x
FX (α − x) = 1 − FX (α + x) + P(X = α + x).
That is, if a random variable is symmetric about 0 then its pdf should be an even function.
Next, we move to define skewness and kurtosis which are related the third and forth central moments.
Literally, skewness is a statistical number that tells us whether a distribution is symmetric or not.
(ii) If β1 > 0, then the right tail is longer than the left tail. In this case, it is called right-skewed or
positively skewed and
mean > median > mode.
mode
median
mean
(iii) If β1 < 0, then the left tail is longer than the right tail. In this case, it is called left-skewed or
negatively skewed and
mean < median < mode.
mode
median
mean
Remark 2.7.7. (i) β2 + 3 and β2 are called simple and excess kurtosis, respectively.
(ii) Kurtosis will tell us about the peakness of the distribution. In particular, if β2 > 0, β2 = 0 and
β2 < 0 then it is called lepto kurtic, meso kurtic (normal peak) and plato kurtic, respectively.
is called the moment generating function (mgf) of a the random variable X, provided the series
and integral exist for some t 6= 0.
(a) MX (0) = 1.
(c) Moment generating function uniquely determines the distribution. In other words, if X and Y
are two random variables and
or equivalently X and Y have the same distribution. This statement is not equivalent to the
statement “if two distributions have the same moments, then they are identical at all points”.
This is because in some cases, the moments exist and yet the moment-generating function does
not, because the limit
n
X ti
lim E(X i )
n→∞
i=0
i!
(k) dk
E(X k ) = MX (t) = MX (t) .
t=0 dtk t=0
(k) t t2
MX (t) = E(X k ) + E(X k+1 ) + E(X k+2 ) + · · · .
1! 2!
Hence,
(k)
E(X k ) = MX (t) .
t=0
(b) ϕX (0) = 1.
(e) ϕX (t) = ϕY (t), for all t if and only if FX (x) = FY (x), for all x.
(k)
(g) If E|X|k < ∞, for k ≥ 1, then ϕX (t) exists, and
1 (k)
E(X k ) = ϕ (t) .
ik X t=0
(a) Note that the pgf always exists for |t| < 1, since,
X X
E(|tX |) = |tx |pX (x) ≤ pX (x) = 1.
x x
(b) GX (1) = 1.
Hence,
1 (k)
pX (k) = G (t) .
k! X t=0
P(X = x) = q x−1 p, x = 1, 2, 3, . . . ,
where p + q = 1. Find the moment generating function, and hence the mean and variance. Also, find
the characteristic function and the probability generating function of X.
Solution. Note that
∞ ∞
X pX
tx x−1
x
MX (t) = e q p= qet
x=1
q x=1
qet pet
p
= = , for qet < 1 or t < − ln(q).
q 1 − qet 1 − qet
Now,
pet 1
MX0 (t) = =⇒ E(X) = MX0 (t) = .
(1 − qet )2 t=0 p
Also,
and
∞ ∞
X pX
GX (t) = x x−1
t q p= (qt)x
x=1
q x=1
p qt pt
= = , for t < 1/q.
q 1 − qt 1 − qt
(b) 2, 7, 9, 15, 9, 3, 7, 5.
(c) 1, 2, 3, 4, 5.
(d) 4, 5, 6, 4, 5, 6.
2 2
1 1
0 1 2 3 4 5 6 7 8 9 101112131415 0 1 2 3 4 5 6 7 8 9 101112131415
(a) mode = 10 (b) mode = 7 and 9
1
1
0 1 2 3 4 5 0 1 2 3 4 5 6
(c) No mode possible (d) No mode possible
Remark 2.8.1. (i) If a set of observation has one mode, two mode and more than two modes then it
called uni-modal, bimodal and multi-modal, respectively.
(ii) Frequency table con also be used to obtain the mode(s). For instance, Example 2.8.1(a) can also
be written as
No. 4 5 6 10 12 15
Frequency 1 1 2 3 1 1
0 1 2 3 4 5 6 7 8
Mode = 1
x x
fX (x) fX (x)
no mode
mode mode
x x
fX (x)
mode
0 (x) = 0
fX
fX (x)
0 (x) = 0
fX
| x
0 1 4/3 2
Also,
3 4
fX00 (x) = (4 − 6x) =⇒ fX00 (0) = 3 > 0 and fX00 = −3 < 0.
4 3
(≥ p) (≥ 1 − p) (≥ p) (≥ 1 − p)
Qp
Qp
(a) Discrete Random Variable (b) Continuous Random Variable
Qp
(c) The quantile may not be unique for a discrete random variable but it is unique for a continuous
random variable.
(d) If p = 12 then M = Q 1 is called the median for the random variable X. That is, M is called the
2
median of a random variable X if
1 1
P(X ≤ M ) ≥ and P(X ≥ M ) ≥ .
2 2
(≥ 1/2)
(≥ 1/2) (≥ 1/2) (≥ 1/2)
M
M
(a) Discrete Random Variable (b) Continuous Random Variable
1/2
(f) The numbers Q 1 , Q 1 and Q 3 are called quartiles of the random variable X. That is,
4 2 4
1 3 1
P X ≤ Q1 ≥ , P X ≥ Q1 ≥ , P X ≤ Q1 ≥ ,
4 4 4 4 2 2
1 3 1
P X ≥ Q1 ≥ , P X ≤ Q3 ≥ and P X ≥ Q 3 ≥ .
2 2 4 4 4 4
(≥ 1/4)
(≥ 3/4) (≥ 1/4) (≥ 3/4)
Q1 Q1
4 4
(a) Discrete Random Variable (b) Continuous Random Variable
(≥ 1/2)
(≥ 1/2) (≥ 1/2) (≥ 1/2)
Q1
Q1 2
2
(a) Discrete Random Variable (b) Continuous Random Variable
(≥ 3/4)
(≥ 1/4) (≥ 3/4) (≥ 1/4)
Q3 Q3
4 4
(a) Discrete Random Variable (b) Continuous Random Variable
Example 2.8.4. Find the median of the random variable having Pmf
x −2 0 1 2
1 1 1 1
pX (x) 4 4 3 6
pX (x)
(≥ 1/2) (≥ 1/2)
x
-2 0 Median 1 2
Now, consider
1
FX Q1 or Q 1 =
4 4
1 π 1
=⇒ tan−1 (Q1 ) + =
π 2 4
−1 π
=⇒ tan (Q1 ) = −
4
=⇒ Q1 = −1.
Next, consider
1
FX (M or Q 1 ) =
2 2
1 −1 π 1
=⇒ tan (M ) + =
π 2 2
−1
=⇒ tan M = 0
=⇒ M = 0.
2.9 Exercises
1. A coin is tossed three times. If X : Ω → R is such that X counts the number of heads. Show
that X is a random variable. [Assume F = P(Ω)]
x
2. A random variable X can take values 0, 1, 2, . . ., with probability, proportional to (x + 1) 15 .
Find P(X = 0).
3. If f1 (x) and f2 (x) are pdfs then show that (θ + 1)f1 (x) −θf2 (x), 0 < θ < 1, is a pdf.
fX (x) = |1 − x|, 0 ≤ x ≤ 2.
7. The experiment is to put two balls into five boxes in such a way that each ball is equally likely to
fall in any box. Let X denote the number of balls in the first box.
10. Find the quartiles for the random variable having pdf
λ
fX (x) = , −∞ < x < ∞.
π(λ2 + (x − µ)2 )
As it turns out, there are some specific distributions that are used over and over in practice, thus they
have been given special names. There is a random experiment behind each of these distributions. Since
these random experiments model a lot of real life phenomenon, these special distributions are used
frequently in different applications. That’s why they have been given a name and we devote a chapter
to study them. We will provide pmfs for all of these special random variables, but rather than trying
to memorize the pmf, you should understand the random experiment behind each of them. If you
understand the random experiments, you can simply derive the pmfs when you need them. Although it
might seem that there are a lot of formulas in this chapter, there are in fact very few new concepts. Do
not get intimidated by the large number of formulas, look at each distribution as a practice problem on
discrete random variables.
pX (x)
1
N−
| | | | x
x1 x2 ... xN −1 xN
57
Chapter 3: Special Discrete Distributions
Example 3.1.1. Tossing of a coin, rolling of a dice and draw cards from a deck of cards are the examples
of discrete uniform distribution.
Theorem 3.1.1
If a random variable X follows discrete uniform distribution with range {1, 2, . . . , N } then
N +1 N2 − 1
E(X) = , Var(X) = and
( 2t N t 12
e (e −1)
MX (t) = N (et −1)
, t 6= 0
1 t = 0.
and
N N
2
X
2
X x2 1 N (N + 1)(2N + 1)
E X = x pX (x) = = · .
x=1 x=1
N N 6
Therefore,
N2 − 1
Var(X) = E X 2 − (E(X))2 =
.
12
The mgf of X is given by
N
X
tX
etx pX (x)
MX (t) = E e =
x=1
N t Nt
1 X e e − 1
= etx = , for t 6= 0.
N x=1
N (et − 1)
P (X = c) = 1.
| x
c
Example 3.2.1. Examples of degenerate distribution include a two-headed coin and rolling a die whose
all sides show the same number.
Remark 3.2.1. While the degenerate distribution does not appear random in the everyday sense of the
word, it does satisfy the definition of random variable.
Theorem 3.2.2
If a random variable X follows degenerate distribution then
E(X) = c × 1 = c
and
E X 2 = c2 × 1 = c2 .
Therefore,
Var(X) = E X 2 − (E(X))2 = 0.
pX (x)
1
p −
1−p
| | x
0 1
Remark 3.3.1. (i) The number p is called the parameter of Bernoulli distribution.
Therefore,
p p p
P(X = 0) = q × q × · · · × q = q n
n
P(X = 1) = pq n−1
1
n 2 n−2
P(X = 2) = pq
2
..
.
n x n−x
P(X = x) = p q , for x = 0, 1, . . . , n,
x
which is the pmf of binomial distribution. The formal definition can be given as follows:
Remark 3.4.1. (i) The numbers n and p are called the parameters of binomial distribution.
(ii) If X follows binomial distribution with parameters n and p then it is denoted by X ∼ B(n, p).
(iii) We know that
n n
n
X n x n−x X n n−x x
(a + b) = a b = a b ,
x=0
x x=0
x
It can be easily seen that the probabilities of binomial distribution are the terms of the binomial
expansion of (p + q)n . That is why, it is called the binomial distribution.
Theorem 3.4.4
If a random variable X follows binomial distribution with parameters n and p then
Therefore,
(d) Consider
n n
itX
X
itxn x n−x X
itx
ϕX (t) = E e = e pX (x) = e p q
x=0 x=0
x
n
X n x n
= peit q n−x = q + peit .
x=0
x
(e) Consider
n n
x n
X X
X x
px q n−x
GX (t) = E t = t pX (x) = t
x=0 x=0
x
n
X n
= (pt)x q n−x = (q + pt)n .
x=0
x
and
4
E (X 4 ) − 4µE (X 3 ) + 6µ2 E (X 2 ) − 3µ4
X −µ
β2 = E −3= − 3.
σ σ4
Form (a), (b) and (c), we have µ = np, σ 2 = npq and MX (t) = (q + pet )n . So,
n−1
MX0 (t) = npet q + pet
n−1 n−2
MX00 (t) = npet q + pet + n(n − 1)p2 e2t q + pet
n−2 n−3
MX000 (t) = MX00 (t) + 2n(n − 1)p2 e2t q + pet + n(n − 1)(n − 2)p3 e3t q + pet
Therefore,
= np(q + p)n−1 + n(n − 1)p2 (q + p)n−2 + 2n(n − 1)p2 + n(n − 1)(n − 2)p3
= np + 3n(n − 1)p2 + n(n − 1)(n − 2)p3
= np + 3n2 p2 − 3np2 + n3 p3 − 3n2 p3 + 2np3
and
Hence,
E (X 3 ) − 3µσ 2 − µ3
β1 =
σ3
np + 3n p − 3np2 + n3 p3 − 3n2 p3 + 2np3 − 3n2 p2 (1 − p) − n3 p3
2 2
=
(np(1 − p))3/2
np − 3np2 + 2np3 np(1 − p)(1 − 2p) 1 − 2p 1 − 2p
= 3/2
= 3/2
=p = √
(np(1 − p)) (np(1 − p)) np(1 − p) npq
> 0, if p < 21 (positively skewed))
= = 0, if p = 12 (symmentric)
< 0, if p > 12 (negatively skewed).
Let us observe some example by considering the different value of p in the following figures:
pX (x)
pX (x)
pX (x)
x x x
n = 30 and p = 0.2 n = 30 and p = 0.5 n = 30 and p = 0.8
1
= 2 2 np + 3n2 p2 − 3np2 + n3 p3 − 3n2 p3 + 2np3 + 4n2 p2 − 4np2 + 5n3 p3
n p (1 − p)2
− 15n2 p3 + 10np3 + n4 p4 − 6n3 p4 + 11n2 p4 − 6np4 − 4n2 p2 − 12n3 p3 + 12n2 p3
−4n4 p4 + 12n3 p4 − 8n2 p4 + 6n4 p4 + 6n3 p3 − 6n3 p4 − 3n4 p4 − 3n2 p2 + 6n2 p3 − 3n2 p4
Let us observe some example by considering the different value of p in the following figures:
pX (x)
pX (x)
pX (x)
x x x
n = 30 and p = 0.3 n = 30 and p = 0.211325 n = 30 and p = 0.17
P(X = x)
P(X = x − 1)
P(X = x + 1)
pX (x)
x
First consider
n x+1 n−x+1 n x x−x
p q ≤ p q
x+1 x
n! n!
=⇒ px−1 q n−x+1 ≤ px q nx
(n − x + 1)!(x − 1)! x!(n − x)!
q p
=⇒ ≤
n−x+1 x
=⇒ (1 − p)x ≤ p(n − x + 1)
=⇒ x − px ≤ np − px + p
=⇒ x ≤ (n + 1)p.
Similarly,
Therefore, we have.
(n + 1)p − 1 ≤ x ≤ (n + 1)p.
Case I: If (n + 1)p is an integer then x = (n + 1)p − 1 and x = (n + 1)p both are modes for binomial
distribution. That is, the distribution is bimodal.
Case II: If (n + 1)p is not an integer then the mode is the integral part of (n + 1)p.
Example 3.4.2. An experiment succeeds twice of often as is fails. Find the chance that in the next six
trials there will be at least four successes.
Solution. Let X denote the number of successes for the given experiment. Note that the success
probability p = 2/3 and n = 6 and therefore, X ∼ B(6, 2/3). Hence,
x 6−x
n 2 1
P(X = x) = , x = 0, 1, 2, 3, 4, 5, 6.
x 3 3
(a) The number of outcomes/occurrence during disjoint time interval are independent.
(b) The probability of a single occurrence during a small time interval is proportional to the length
of the interval.
(c) The probability of more than one occurrence during a small time interval is negligible.
That is, if X(t) is the number of occurrence in (0, t] then, for very small δ,
1 − λδ, if k = 0,
P(X(δ) = k) ≈ λδ, if k = 1, (3.5.1)
0, if k > 1,
where λ is the expected number of occurrences per unit time. In this case {X(t), t ≥ 0} is called the
Poisson process.
Theorem 3.5.5
Under the assumptions (a), (b) and (c), we have
e−λt (λt)n
Pn (t) := P(X(t) = n) = , n = 0, 1, 2, . . . .
n!
Proof. We use induction of n to prove the result. First, let n = 0 and consider
P0 (t + h) − P0 (t)
= −λP0 (t)
h
Taking limit h → 0, we get
P0 (t) = e−λt .
Hence, the result holds for n = 0. Suppose the result is true for n ≤ k. Consider
e−λt (λt)k
= Pk+1 (t)(1 − λh) + (λh) (using induction hypothesis and (3.5.1))
k!
This implies
0 λk+1 tk
Pk+1 (t) = −λPk+1 (t) + e−λt .
k!
This implies
(λt)k+1 e−λt
Pk+1 (t) = + c1 .
(k + 1)!
e−λt (λt)k+1
Pk+1 (t) = .
(k + 1)!
e−15 (15)2
P(X(3) = 2) = .
2!
Now, we are in a position to define the Poisson distribution. The Poisson process is a statistical process
with independent time increments, where the number of events occurring in a time interval is modelled
by a Poisson distribution. The formal definition can be given as follows:
e−λ λx
P(X = x) = , for λ > 0 and x = 0, 1, . . ..
x!
Remark 3.5.1. (i) The numbers λ is called the parameter of Poisson distribution.
(a) E(X) = λ.
(b) Var(X) = λ.
t
(c) MX (t) = eλ(e −1) .
it −1)
(d) ϕX (t) = eλ(e .
(b) Consider
∞ ∞
X X e−λ λx
E (X(X − 1)) = x(x − 1)P(X = x) = x(x − 1)
x=2 x=2
x!
∞ ∞
2 −λ
X λx−2 2 −λ
X λj
=λ e =λ e (j = x − 2)
x=2
(x − 2)! j=0
j!
= λ2 e−λ × eλ = λ2 .
This implies
E(X 2 ) = λ2 + E(X) = λ2 + λ.
Therefore,
Var(X) = E X 2 − (E(X))2 = λ.
and
4
E (X 4 ) − 4µE (X 3 ) + 6µ2 E (X 2 ) − 3µ4
X −µ
β2 = E −3= − 3.
σ σ4
t
Form (a), (b) and (c), we have µ = λ, σ 2 = λ and MX (t) = eλ(e −1) . So,
MX0000 (t) = λ3 λet + 3 eλ(e +3t + 3λ2 λet + 2 eλ(e +2t + λ λet + 1 eλ(e
t−1) t−1) t−1)+t
= λ4 eλ(e −1)+4t + 6λ3 eλ(e −1)+3t + 7λ2 eλ(e −1)+2t + λeλ(e −1)+t .
t t t t
Therefore,
and
E X 4 = λ4 eλ(e −1)+0 + 6λ3 eλ(e −1)+0 + 7λ2 eλ(e −1)+0 + λeλ(e −1)+0
0 0 0 0
= λ4 + 6λ3 + 7λ2 + λ
Hence,
λ=1
λ=5
λ=9
pX (x)
Now, we will look the limiting case of binomial distribution is Poisson distribution in the following
theorem.
Theorem 3.5.7
Let X ∼ B(n, p). If p → 0 and np → λ as n → ∞ then
e−x λx
P(X = x) → .
x!
Proof. Consider
n x
P(X = x) = p (1 − p)n−x
x
x n−x
n! λ λ
≈ 1−
x!(n − x) n n
n −x
n(n + 1) − (n − x + 1) λx
λ λ
= · 1− 1−
nx x! n n
x
n −x
n n−1 n−x+1 λ λ λ
= · · · 1− 1−
n n n x! n n
−λ x
e λ
→ as n → ∞.
x!
(b) There are only two possible outcomes for each trial, often designated success or failure.
In a sequence of Bernoulli trials, suppose X denote the number of trials before first success. Then, X
can take values {1, 2, . . .} and
q q q
p p p
P(X = 0) = p
P(X = 1) = qp
..
.
P(X = x) = q x−1 p, for x = 0, 1, . . .,
which is the pmf of geometric distribution. The formal definition can be given as follows:
Remark 3.6.1. (i) The number p is called the parameter of geometric distribution.
1 q pet
E(X) = , Var(X) = and MX (t) = , t < − ln(q).
p p2 1 − qet
and
∞
X
E(X(X − 1)) = x(x − 1)q x−1 p
x=2
∞
X
= pq x(x − 1)q x−2
x=2
2pq 2q
= = .
(1 − q)3 p2
This implies
2q 1
E(X 2 ) = + .
p2 p
Hence,
2q 1 1 q
Var(X) = E(X 2 ) − (E(X))2 = 2
+ − 2 = 2.
p p p p
Next, consider
∞ ∞
tX
X
tx x−1 pX t x pet
MX (t) = E(e ) = e q p= (qe ) = , provided qet < 1 or t < − ln(q).
x=1
q x=1 1 − qet
Example 3.6.2. Show the the geometric distribution satisfies the memoryless property.
Solution. Let X ∼ Geo(p). Then
P(X = x) = q x−1 p, x = 1, 2, . . . .
Note that
∞
X
P(X > n) = P(X = x)
x=n+1
X∞
=p q x−1 = pq n (1 + q + q 2 + · · · )
x=n+1
n
=q .
Therefore,
P (X > n + m, X > m)
P (X > n + m | X > m) =
P (X > m)
P (X > n + m) q n+m
= = m
P (X > m) q
n
= q = P (X > n).
X
X|X > 4
pX (x)
P(Y = y) = q y p, y = 0, 1, 2, . . . . (3.6.3)
pY (y)
Theorem 3.6.9
If a random variable Y follows geometric distribution defined in (3.6.3) then
q q p
E(X) = , Var(X) = and MX (t) = , t < − ln(q).
p p2 1 − qet
and
∞ ∞
X
x 2
X 2q 2
E(X(X − 1)) = x(x − 1)q p = pq x(x − 1)q x−2 = .
x=2 x=2
p2
This implies
2q 2 2q 2 q
E(X 2 ) = + E(X) = + .
p2 p2 p
Therefore,
2q 2 q q 2 q
Var(X) = E(X 2 ) − (E(X))2 = + − = .
p2 p p2 p2
p p p p p
x − 1 r x−r
P (X = x) = pq , x = r, r + 1, . . . .
r−1
Remark 3.7.1. (i) The numbers r and p is called the parameters of negative binomial distribution.
(ii) If X follows negative binomial distribution with parameters r and p then it is denoted by X ∼
NB(r, p).
First, consider
∞ ∞
X X x − 1 r x−r
E(X) = xP(X = x) = x pq
x=r x=r
r−1
r X∞
p x − 1 x−1
=q x q
q x=r
r − 1
r r−1
p r q
=q × 2 (using (3.7.2))
q p p
r
= .
p
Now, consider
∞ ∞
X X x − 1 r x−r
E(X(X − 1)) = x(x − 1)P(X = x) = x(x − 1) pq
x=r x=r
r−1
r X ∞
2 p x − 1 x−2
=q x(x − 1) q
q x=r
r − 1
r r−2
2 p r(r + 2q − 1) q
=q × (using (3.7.3))
q p4 p
r(r + 2q − 1)
= .
p2
This implies
r(r + 2q − 1) r(r + 2q − 1) r
E(X 2 ) = 2
+ E(X) = + .
p p2 p
Theorem 3.7.11
If a random variable Y follows geometric distribution defined in (3.6.3) then
rq q
E(X) = , Var(X) =
p p2
and
r
p
MX (t) = , t < − ln(q).
1 − qet
Proof. Following the similar steps to the proof of Theorem 3.7.10, the results follow.
k N −k
x n−x
P(X = x) = N
, x = 0, 1, 2, . . . n and max(0, n − N + k) ≤ x ≤ min(n, k).
n
Remark 3.8.1. (i) The numbers k, N and n is called the parameters of hypergeometric distribution.
Proof. Consider
n n k N −k
X X x n−n
E(X) = xP(X = x) = x N
x=1 x=1 n
k−1 N −1−(k−1)
n
kn X x−1 n−1−(x−1)
= N −1
N x=1 n−1
n−1 k−1 N −1−(k−1)
kn X l n−1−l
= N −1
(l = x − 1)
N l=0 n−1
kn
= .
N
Theorem 3.8.13
k
Let X ∼ Hypergeo(k, N, n). If N
→ p as k → ∞, N → ∞ then
n x
P(X = x) → p (1 − p)n−x .
x
3.9 Exercises
1. If on an average 1 vessel in every 10 is wrecked, find the probability that out of 5 vessels expected
to arrive 4 at least will arrive safely.
2. In a precision bombing attack, there is a 50% chance, that a bomb will strike the target. Two
direct hits are required to destroy the target completely. How many bombs must be dropped to
give atleast 99% chance of completely destroying the target?
3. Suppose that average number of telephone calls arriving at the switchboard of an average is 30
calls per hours.
(a) What is probability that no calls arrive in a 3-minute?
(b) What is the probability that more than five calls in a 5-minute period.
6. A vaccine for desensitizing patients to be strings is to packed with 3 vials in each box. Each vial
is checked for strength before packed. The probability that a vial meets the specifications is 0.9.
Let X denote the number of vials that must be checked to fill a box. Find
8. Suppose that X has a binomial distribution with parameters n and p and Y has a negative bino-
mial distribution with parameters r and p. Show that
FX (r − 1) = 1 − FY (n − r).
A continuous random variable has uncountable set of possible values which is known as the range
of the random variable. A continuous distribution describes the probabilities of a continuous random
variable at its possible values. The mapping of time can be considered as an example of the continuous
probability distribution. It can be from one second to one billion seconds, and so on. The area under
the curve of the pdf of a continuous random variable is used to calculate its probability. As a result,
only value ranges can have a non-zero probability. The probability of a continuous random variable on
some value is always zero.
fX (x) fX (x) = k
| | x
a b
k, a ≤ x ≤ b,
fX (x) =
0, otherwise.
Therefore,
Z b
1
fX (x)dx = 1 =⇒ k = .
a b−a
84
Chapter 4: Special Continuous Distributions
The formal definition can be given as follows:
fX (x) 1
fX (x) = b−a
| | x
a b
Remark 4.1.1. (i) The numbers a and b are called the parameters of continuous uniform distribu-
tion.
(ii) If X follows continuous uniform distribution with parameters a and b then it is denoted by X ∼
U(a, b).
Theorem 4.1.1
If a random variable X follows continuous uniform distribution then
a+b (b − a)2
E(X) = , Var(X) = ,
2 12
0, x < a, (
etb −eta
x−a , t 6= 0
FX (x) = , a ≤ x < b, and MX (t) = t(b−a)
b−a 1 t = 0.
1, x≥b
(b − a)2
Var(X) = E X 2 − (E(X))2 =
.
12
The cdf of X is given by
x x
x−a
Z Z
1
FX (x) = fX (t)dt = dt = , a ≤ x < b.
a a b−a b−a
FX (x)
1−
| | x
a b
Therefore,
1 − e−λt , t > 0,
FT (t) = P(X ≤ t) = 1 − P(X > t) =
0, t ≤ 0.
λ=3
λ=5
λ=7
fX (x)
Remark 4.2.1. (i) The numbers λ is called the parameter of exponential distribution.
Proof. Consider
Z ∞ Z ∞
k
E(X ) = k
x fX (x)dx = λ xk e−λx dx
0 0
∞
tk −t dt
Z
=λ e (λx = t)
0 λk λ
Z ∞
1
= k e−t tk+1−1 dt
λ 0
Γ(k + 1) k!
= k
= k.
λ λ
Now, consider
Z ∞
tX
MX (t) = E(e ) = etx fX (x)dx
0
Z ∞
λ
=λ e−(λ−t)x dx = , λ > t.
0 λ−t
This proves the result.
(c) β1 = 2 and β2 = 6.
and
4
X −µ
β2 = E = 6.
σ
Remark 4.2.2. The exponential distribution is always positively skewed and lepto kurtic.
Example 4.2.1. Show the the exponential distribution satisfies the memoryless property.
Solution. Let X ∼ Exp(λ). Then
Note that
Z ∞
P(X > n) = λe−λx dx
n
∞
= −e−λx
n
−λn
=e .
Therefore,
P (X > n + m, X > m)
P (X > n + m | X > m) =
P (X > m)
P (X > n + m)
=
P (X > m)
e−λ(n+m)
=
e−λm
−λn
=e = P (X > n).
X X|X > 4
fX (x)
Therefore,
(
0, t ≤ 0,
FTr (t) = P(Tr ≤ t) = 1 − P(Tr > t) = e−λt (λt)j
1 − r−1
P
j=0 j!
, t > 0.
e−λt (λt)r−1
d d −λt −λt
fT (t) = FT (t) = − e + (λt)e + · · · +
dt dt (r − 1)!
λr tr−1 e−λt
= λe−λt − λe−λt + λ2 te−λt − λ2 te−λt + · · · +
(r − 1)!
r r−1 −λt r
λt e λ −λt r−1
= = e t , t > 0.
(r − 1)! Γ(r)
r = 1, λ = 2
r = 3, λ = 4
r = 4, λ = 2
fX (x)
(ii) If X follows gamma distribution with parameters r and λ then it is denoted by X ∼ G(r, λ).
Proof. Consider
∞ Z ∞
λr
Z
k
E(X ) = k
x fX (x)dx = e−λx xk+r−1 dx
0 Γ(r) 0
r Z ∞
λ
= k+r
e−t tk+r−1 dt (λx = t)
Γ(r)λ 0
Γ(k + r)
= .
Γ(r)λk
Now, consider
Z ∞
tX
MX (t) = E(e ) = etx fX (x)dx
Z ∞ 0
λr
= e−(λ−t)x xr−1 dx
Γ(r) 0
Z ∞
λr
= e−y y r−1 dx
Γ(r)(λ − t)r 0
r
λr Γ(r)
λ
= = , λ > t.
Γ(r)(λ − t)r λ−t
Corollary 4.3.2
If a random variable X follows gamma distribution then
and
4
X −µ 6
β2 = E = .
σ r
Remark 4.3.2. (i) The gamma distribution is always positively skewed and lepto kurtic.
fX (x) fX (x)
x x
(a) α = 1, β = 1 (b) α = β < 1
β>α>1
α>β>1
x x
(c) α = β > 1 (d) α < 1, β = 1 and α = 1, β < 1
fX (x) fX (x)
x x
(e) α > β > 1 (f ) β > α > 1
Here, B(α, β) denotes the beta function defined by
Z 1
B(α, β) = xα−1 (1 − x)β−1 dx.
0
Γ(α)Γ(β)
B(α, β) = .
Γ(α + β)
Remark 4.4.1. (i) The numbers α and β are called the parameter of beta distribution.
(ii) If X follows beta distribution with parameters α and β then it is denoted by X ∼ B(α, β).
Theorem 4.4.4
If a random variable X follows beta distribution then
B(α + k, β)
E(X k ) = .
B(α, β)
Proof. Consider
Z 1
k 1
E(X ) = xα+k−1 (1 − x)β−1 dx
B(α, β) 0
B(α + k, β)
= .
B(α, β)
α α(α + 1)
E(X) = and E(X 2 ) = .
α+β (α + β)(α + β + 1)
Therefore,
αβ
Var(X) = E(X 2 ) − (E(X))2 = .
(α + β)2 (α + β + 1)
Remark 4.5.1. (i) The numbers µ and σ 2 are called the parameters of normal distribution.
(ii) If X follows normal distribution with parameters µ and σ 2 then it is denoted by X ∼ N (µ, σ 2 ).
|
µ
R∞
Example 4.5.1. Show that −∞
fX (x)dx = 1.
Solution. Consider
Z ∞ Z ∞
1 1 x−µ 2
fX (x)dx = √ e− 2 ( σ ) dx
−∞ −∞ 2πσ
Z ∞
1 −z 2 /2 x−µ
=√ e dz =z
2π −∞ σ
Z ∞
1 2
= 2√ e−z /2 dz
2π 0
Z ∞ 2
1 −t 12 −1 z
=√ e t dt =t
π 0 2
Γ 12
= √ = 1.
π
Theorem 4.5.5
If a random variable X follows normal distribution then
k 0, if k is odd,
E(X − µ) =
σ 2m (2m − 1)(2m − 3) . . . 5.3.1, if k = 2m, m = 1, 2, . . ..
Proof. Consider
k ∞ k
x−µ x−µ
Z
E = fX (x)dx
σ −∞ σ
Z ∞ k
1 x−µ 1 x−y 2
=√ e− 2 ( σ ) dx
2πσ −∞ σ
Z ∞
1 k −z 2 /2 x−µ
=√ z e dz, =z
2π −∞ σ
= 0,
Corollary 4.5.4
If a random variable X follows normal distribution then
(a) E(X) = µ.
(b) Var(X) = σ 2 .
(c) β1 = β2 = 0.
E(X − µ) = 0 =⇒ E(X) = µ.
and, for k = 2,
and
4
X −µ
β4 = E − 3 = 0.
σ
MX (t) = E etX
Z ∞
= etx fX (x)dx
−∞
Z ∞
1 1 x−µ 2
=√ etx e− 2 ( σ ) dx
2πσ −∞
Z ∞
1 t(µ+σz) −z 2 /2 x−µ
=√ e e dz =z
2π −∞ σ
Z ∞
1 z2
=√ et(µ+σz)− 2 dz
2π −∞
µt+ 21 σ 2 t2 Z ∞
e 1
e− 2 (z +σ t −2σtz) dz
2 2 2
= √
2π
−∞ Z ∞
1 2 2
µt+ 2 r t 1 − 12 (z−σt)2
=e √ e dz
2π −∞
1 2 t2
= eµt+ 2 σ
aX + b ∼ N aµ + b, a2 σ 2 ,
a 6= 0 and b ∈ R. Therefore,
X −µ
Z= ∼ N (0, 1).
σ
This is called standard normal random variable. The pdf of Z is given by
1 2
φ(z) = √ e−z /2 , −∞ < z < ∞.
2π
The cdf of Z is given by
Φ(z) = P(Z ≤ z)
Z z
1 2
= √ e−t /2 dt.
−∞ 2π
A table for different values of z is available to calculate the probabilities.
Φ(z) = 1 − Φ(−z).
and
x−µ a−µ a−µ a−µ
P(X ≤ a) = P ≤ =P Z≤ =Φ .
σ σ σ σ
Example 4.5.2. If x is normally distrbaled with man 2 and variance 1 , find P(|x − 2| < 1).
Solution. Note that
or
Example 4.5.3. If X is normally distributed with mean 11 and standard deviation 1.5 find the number
x0 suck that
P (X > x0 ) = 0.3.
P (x > x0 ) = 0.3
=⇒ 1 − P (x ≤ x0 ) = 0.3
=⇒ P (x ≤ x0 ) = 0.7
x−1 x0 − 11
=⇒ P ≤ = 0.7
1.5 1.5
x0 − 11
=⇒ P Z≤ = 0.7.
1.5
Remark 4.6.1. The numbers λ and µ are called the parameters of normal distribution.
Example 4.6.1. Find the cdf of Cauchy distribution.
Solution. The cdf of the Cauchy distribution is given by
Z x
FX (x) = fX (x)dx
−∞
Z x
1 dx
=
πλ −∞ 1 + x−µ 2
λ
Z x−µ
1 λ dt x−µ
= t=
π −∞ 1 + t2 λ
x−µ
1 λ
= tan−1 (t)
π −∞
1 −1 x−µ −1
= tan − tan (−∞)
π λ
1 −1 x−µ π
= tan +
π λ 2
1 1 x−µ
= + tan−1 .
2 π λ
(ii) It can be verified that for the Cauchy distribution, mean variance and higher moments do not
exist.
4.7 Exercises
1. If X is uniformly distributed over [1, 2], find z so that
1
P (X > z + µ) = .
4
2
2. If X ∼ U(−1, 3) and Y ∼ Exp(λ). Find λ such that σX = σY2 .
(a) How many observations may be expected to lie between 65 and 100?
(b) Find the value of the variate beyond which 10% of the items would lie.
7. In a distribution exactly normal, 7% of the them are under 35 and 89% are under 63. What is the
mean and standard deviation of the distribution?
9. Show that for a normal distribution: mean, median and mode coincide.
Let X be a random variable, discrete and continuous, and let g : R → R, which we think of as a
transformation. For example, X could be a height of a randomly chosen person in a given population
in inches, and g could be a function which transforms inches to centimetres, that is, g(x) = 2.54 × x.
Then Y = g(X) is also a random variable, but its distribution (pmf or pdf), mean, variance, etc. will
differ from that of X. Transformations of random variables play a central role in statistics, and we will
learn how to work with them in this chapter.
Since g is a measurable function, the set g −1 (−∞, y] is measurable set. Now, since X is a random
variable, {ω : X(ω) ∈ g −1 (−∞, y]} is also measurable.
Remark 5.1.1. Let X be a random variable and g : R → R be a continuous function then g(X) is a
random variable (since every continuous function is measurable).
Theorem 5.1.2
Give a random variable with cdf FX (·), the distribution of the random variable Y = g(X), where
g is measurable, is determined.
103
Chapter 5: Function of Random Variables and Its Distribution
Since g is measurable, g −1 ((−∞, y]) is a measurable set. Now, since the distribution of x is well-
defined, GY (y) is also well-defined.
Example 5.1.1. Let Y = aX + b, a 6= 0 and b ∈ R. Then
FY (y) = P (Y ≤ y) = P (aX + b ≤ y)
(
P X ≤ y−b
a
if a > 0,
= y−b
P X≥ a if a < 0
(
FX y−b
a
, if a > 0,
= y−b
y−b
1 − P X ≤ a + P X = a , if a < 0
FX y−b
a
, if a > 0,
=
1 − FX y1a−b + P X = y−b
a
, if a < 0.
2
0
1
0 1
−1
4
−2
Example 5.2.2. Let X ∼ U(−1, 1). Find the distribution (or pdf) of Y = |X|.
Solution. It is known that
1
2
, −1 ≤ x ≤ 1
fX (x) =
0 otherwise.
Consider
Therefore,
d 1, 0 ≤ y ≤ 1,
fY (y) = FY (y) =
dy 0, otherwise.
Consider
1 2 σ 2 )t2
= e(aµ+b)t+ 2 (a .
Hence, Y ∼ N (aµ + b, a2 σ 2 ).
Proof. Let g 0 (x) > 0, for all x. Then g is strictly increasing and so one-one and g −1 is strictly increas-
ing, that is,
f racddyg −1 (y) > 0. Therefore,
So the pdf of Y is
d −1
g (y) fX g −1 (y) .
fY (y) =
dy
In case g 0 (x) < 0, for all x then g is strictly decreasing and g −1 will also be strictly decreasing, that is,
d −1
dy
g (y) < 0. Therefore,
= 1 − FX g −1 (y) + P X = g −1 (y) .
d −1
g (y) fX g −1 (y) .
fY (y) =
dy
Note that
0≤x≤1
y
=⇒ 0≤ ≤1
1−y
=⇒ y ≥ 0 and y ≤ 1 − y
1
=⇒ 0≤y≤ .
2
Therefore,
1
fX (g −1 y) = 1, for 0 ≤ y ≤ .
2
Hence, by above theorem,
d −1
g (y) fX g −1 (y)
fY (y) =
dy
1
(1−y)2
, 0 ≤ y ≤ 1/2,
=
0, otherwise.
Further, if the function does not satisfy the condition of monotonically increasing or decreasing then the
following theorem will help to compute the pdf of the function of random variable for such situations.
(a) there exists a positive integer n = n(y) and real numbers (inverses) x1 (y), x2 (y), . . . , xn (y)
such that
(b) there does not exist any x such that g(x) = y, g 0 (x) 6= 0, in which case we write n(y) = 0.
Let
d −1 d 1
g1−1 (y) = −y =⇒ g1 (y) = −1 =⇒ g (y) = 1
dy dy 1
d −1 d −1
g2−1 (y) = +y =⇒ g2 (y) = 1 =⇒ g (y) = 1.
dy dy 2
It is known that
(
1
2
, −1 ≤ x ≤ 1,
fX (x) =
0, otherwise.
d −1 d −1
g1 (y) fX g1−1 (y) + g2 (y) fX g2−1 (y)
fY (y) =
dy dy
1 1
= 1 · fx (−y) + 1.fx (+y) = + = 1, 0 ≤ y ≤ 1.
2 2
Hence,
1, 0 ≤ y ≤ 1,
fY (y) =
0, otherwise.
Let
√ d −1 1 d 1 1
g1−1 (y) = − y =⇒ g1 (y) = − √ =⇒ g1 (y) = √
dy 2 y dy 2 y
√ d −1 1 d −1 1
g2−1 (y) = + y =⇒ g2 (y) = √ =⇒ g2 (y) = √ .
dy 2 y dy 2 y
It is known that
1 2
fX (x) = √ e−x /2 , −∞ < x < ∞.
2π
Therefore, we get
d −1 d −1
g1 (y) fX g1−1 (y) + g2 (y) fX g2−1 (y)
fY (y) =
dy dy
1 √ 1 √
= √ · fx (− y) + √ .fx (+ y)
2 y 2 y
1 1 1 1
= √ √ e−y/2 + √ √ e−y/2
2π y 2π y
1 1
= 1/2 1 e−y/2 y 2 −1 , y > 0.
2 Γ 2
1 1
Hence, Y ∼ G ,
2 2
.
5.3 Exercises
1. Let X be a random variable having pmf
1 1 1
P(X = −1) = P(X = 0) = P(X = 1) = .
4 2 4
Find the distribution of Y = |X| and Y = X 2 .
2. If X is binomially distributed with parameters n and p, what is the distribution of Y = n − X?
3. Let X ∼ U(0, 1). Show that Y = − λ1 ln(FX (x)) ∼ Exp(λ).
4. Show that if X ∼ B(α, β) then 1 − X ∼ B(β, α).
5. Give an example when X and −X have same distribution.
6. If X ∼ N (2, 9), find the pdf of Y = 21 X − 1.
7. If X ∼ N (µ, σ 2 ), find the distribution of Y = eX .
In real life, we are often interested in several random variables that are related to each other. For
instance, let a student has five courses in one semester and
Xi = marks of the student in course i ⊆ {0, 1, 2, . . . , 100}, for i = 1, 2, 3, 4, 5.
Then, the joint behaviour of his marks can be obtained using joint distribution of (X1 , X2 , X3 , X4 , X5 ).
In particular, suppose the passing marks is 30 in each course then the probability that a student pass the
semester is P(X1 ≥ 30, . . . , X5 ≥ 30). Therefore, if we know the distribution of (X1 , X2 , X3 , X4 , X5 )
then we can easily compute such probabilities.
In this chapter, we will focus on two random variables, but once you understand the theory for two
random variables, the extension to n random variables is straightforward. We will first discuss joint
distributions of discrete random variables and then extend the results to continuous random variables.
X
Probability with respect to all
values of x and y in this area
110
Chapter 6: Random Vector and its Joint Distribution
Remark 6.1.1. Note that if we are dealing with joint distribution then comma means intersection.
Note that
P(a < X ≤ b, c < Y ≤ b) = FX,Y (b, d) − FX,Y (a, d) − FX,Y (b, c) + FX,Y (a, c).
a b X
5. lim FX,Y (x + h, y) = FX,Y (x, y) and limk→0 FX,Y (x, y + k) = FX,Y (x, y).
h→0
6. If x1 < x2 then
pX,Y (xi , yj ) = P (X = xi , Y = yj ) , i, j = 1, 2, . . . .
P(X = xi , Y = yj )
{X = xi , Y = yj }
yj
x1 x2
xi y2
y1
pX,Y (xi , yj )
pX|Y =yj (xi ) = , for all xi , provided pY (yj ) 6= 0.
pY (yj )
pX,Y (xi , yj )
pY |X=xi (yj ) = , for all yj , provided pX (xi ) 6= 0.
pX (xi )
Example 6.2.1. Suppose a car showroom has ten cars of a brand out of which 5 are good (G), 2 have
defective transmission (DT), and 3 have defective starling (DS). Two cars are selected at random. Let
X denote the number of cars with DT and Y denote the number of cars with DS. Find
Solution. Given X is the number of cars with DT and therefore, it can take values {0, 1, 2} and Y is
the number of cars with DS and therefore, it can take values {0, 1, 2}.
Similarly, we can find other probabilities and they are given in the following table:
Y
0 1 2
X
10 15 3
0 45 45 45
10 6
1 45 45
0
1
2 45
0 0
0.75
0.5
0.25
2
0 1
1 2
(b) Consider
Y
0 1 2 pX (x)
X
10 15 3 28
0 45 45 45 45
10 6 16
1 45 45
0 45
1 1
2 45
0 0 45
21 21 3
pY (y) 45 45 45
1
pX,Y (0, y) 45
pY |X=0 (y) = = pX,Y (0, y), y = 0, 1, 2.
pX (0) 28
pX,Y (X, 0) 45
pX|Y =0 (x) = = pX,Y (x, 0), x = 0, 1, 2.
pY (0) 21
Therefore,
2
X
E(Y | X = 0) = ypY |X=0 (y)
y=0
2
45 X
= ypX,Y (0, y)
28 y=0
45
= [0.pX,Y (0, 0) + 1.pX,Y (0, 1) + 2.pX,Y (0, 2)]
28
9
= .
14
and
2
X
E(X | Y = 0) = xpX|Y =0 (y)
x=0
2
45 X
= xpX,Y (x, 0)
21 x=0
45
= [0.pX,Y (0, 0) + 1.pX,Y (1, 0) + 2.pX,Y (2, 0)]
21
11
= .
21
Remark 6.2.1. From the joint pmf, we can compute the marginals and the conditional distributions.
Therefore, other characteristic such as mean, variance, mode, median etc. can also be calculated for
such distributions.
d
c
a
b
fX,Y (x, y)
fX|Y =y (x) = , for all x, provided fY (y) 6= 0.
fY (y)
fX,Y (x, y)
fY |X=x (yj ) = , for all y, provided fX (x) 6= 0.
fX (x)
Find
1
(a) P 0 < X + Y < 2
0<x<y<1
X
1 1
Z Z −x
1 4 2
P 0<X +Y < = fX,Y (x, y)dydx
2 0 x
1 1
Z
4
Z
2
−x
= 10 xy 2 dydx
0 x
11
= .
3072
(b) Consider
3
y= 4
1
y= 4
Z 3Z 1 Z 1Z 3
1 1 3 4 4 2 4
P 0<X< , <Y < = fX,Y (x, y)dxdy + fX,Y (x, y)dxdy
4 4 4 1
4
0 1
4
x
Z 3Z 1 Z 1Z 3
4 4 2 4
= 10 xy 2 dxdy + 10 xy 2 dxdy
1 1
4
0 4
x
15 343
= +
1536 3072
473
= .
3072
Therefore,
10
3
x (1 − x3 ) , 0 < x < 1,
fX (x) =
0, otherwise.
Therefore,
5y 4 , 0 < y < 1,
fY (y) =
0, otherwise.
and
Z 1
3 875
P Y > = 5y 4 dy = .
4 3/4 1024
Therefore,
(
2x
y2
, 0 < x < y, 0 < y < 1,
fX|Y =y (x | y) =
0 otherwise.
Therefore,
3y 2
1−x3
0 < x < 1, x < y < 1,
fY |X=x (y | x) =
0 otherwise.
and
Z 1
1 1 2 64 1
P Y < |X= = y 2 dy = .
2 4 1
4
21 9
In particular, if X and Y are discrete random variables then X and Y are independent if
µ0r,s = E (X r Y s ) .
µr,s = E ((X − µX )r (Y − µY )s ) ,
Note that
µ01,0 = E(X) = µX
µ00,1 = E(Y ) = µY
XX
xypX,Y (x, y), if X and Y are discrete,
µ01,1 = E(X, Y ) = x y
Z ∞Z ∞
xyfX,Y (x, y)dxdy, if X and Y are continuous.
−∞ −∞
µ1,1 = E ((X − µX ) (Y − µY ))
= E (XY − µY X − µX Y + µX µY )
= E(XY ) − µY E(X) − µX E(Y ) + µX µY
= E(XY ) − E(X)E(Y ) − E(X)E(Y ) + E(X)E(Y )
= E(XY ) − E(X)E(Y ).
Cov(X, Y )
ρX,Y = ,
σX σY
2
where σX = Var(X) and σY2 = Var(Y ).
Theorem 6.5.1
The correlation coefficient between two random variables is always lies between −1 and 1, that
is, −1 ≤ ρX,Y ≤ 1.
Then,
and
Therefore,
−1 ≤ E(U V ) ≤ 1. (6.5.1)
X X
(a) ρX,Y > 0 (b) ρX,Y < 0
Y Y
X X
(c) ρX,Y = 0 (d) ρX,Y = 0
X X
(e) ρX,Y = 1 (f) ρX,Y = −1
(ii) If X = aY + b, (a > 0), then X and Y are perfectly linear in positive direction and ρX,Y = 1.
Also, X = aY + b, (a < 0), then X and Y are perfectly linear in negative direction and
ρX,Y = −1.
(iii) If ρX,Y = 0 then we say that X and Y are uncorrelated.
Theorem 6.5.2
Let X and Y be independent random variably then
Proof. We prove the result for continuous random variables and following the similar steps, it can be
easily proved for discrete case. Let X and Y be continuous random variable, that is,
Consider
Z ∞ Z ∞
E (g1 (X)g2 (Y )) = g1 (x)g2 (y)fX,Y (x, y)dxdy
Z−∞
∞ Z−∞
∞
= g1 (x)g2 (y)fX (x)fY (y)dxdy
−∞
Z ∞ −∞ Z ∞
= g1 (x)fX (x)dx g2 (y)fY (y)dy
−∞ −∞
= E (g1 (X)) E (g2 (Y )) .
E (X r Y s ) = E (X r ) E (Y s )
E ((X − µX )r (Y − µY )s ) = E ((X − µX )r ) E ((Y − µY )s ) .
E(XY ) = E(X)E(Y ).
Therefore,
This implies ρx,y = 0. However, the converse is not true. For example, let X and Y have joint pmf
given by
X
−1 0 1 pX (x)
Y
1 1
0 0 3
0 3
1 1 2
1 3
0 3 3
1 1 1
pY (y) 3 3 3
1
Then,
1
X 1 2 2
E(X) = pX (x) = 0 · pX (0) + 1 · pX (1) = 0 × +1× =
x=0
3 3 3
X1
E(Y ) = ypY (y) = −1 · pY (−1) + 0 · pY (0) + 1 · pY (1)
j=1
1 1 1 1 1
= −1 ×+ 0 × + 1 × = − = 0.
3 3 3 3 3
1 X
X 1
E(XY ) = xypX,Y (x, y)
x=0 y=−1
Cov(X, Y )
ρX,Y = = 0.
σX σY
But X and Y are not independent, for example,
1 1
pX,Y (0, 0) = 6= = pX (0)pY (0).
3 9
Now, we move to define joint mgf that will generate the product moments for jointly distributed random
variables.
Remark 6.5.4. Note that the joint mgf generates the product moments as follows:
∂ ∂
E(X) = MX,Y (s, t) , E(Y ) = MX,Y (s, t)
∂s s=t=0 ∂t s=t=0
2 2
∂ ∂
E(X 2 ) = 2 MX,Y (s, t) , E(Y 2 ) = 2 MX,Y (s, t)
∂s s=t=0 ∂t s=t=0
∂2
E(XY ) = MX,Y (s, t) ,
∂s∂t s=t=0
m ∂ m+n
n
E (X Y ) = m n MX,Y (s, t) .
∂s ∂t s=t=0
Theorem 6.5.3
The random variables X and Y are independent then
Proof. We will prove the result for continuous random variables and following the similar steps to
prove it for discrete case. Consider
Z ∞Z ∞
MX,Y (s, t) = esx+ty fX,Y (x, y)dxdy
Z−∞ −∞
∞ Z ∞
= esx+ty fX (x)fY (y)dxdy
−∞
Z ∞ −∞ Z ∞
sx ty
= e fX (x)dx e fY (y)dy
−∞ −∞
= MX (s)MY (t).
Theorem 6.5.4
The random variables X and Y are independent then
Find joint mgf of (X, Y ) and therefore, find E(X), E(Y ), E(XY ) and ρX,Y .
Solution. The joint mgf of (X, Y ) is given by
Z ∞Z ∞
MX,Y (s, t) = esx+ty fX,Y (x, y)dxdy
−∞ −∞
Z ∞Z ∞
=λ 2
esx+ty e−λ(x+y) dxdy
0Z ∞0 Z ∞
2 −(λ−s)x −(λ−t)y
=λ e dx e dy
0 0
λ2
= , for λ > s and λ > t.
(λ − s)(λ − t)
∂ λ2 ∂ λ2
MX,Y (s, t) = , MX,Y (s, t) = ,
∂s (λ − s)2 (λ − t) ∂t (λ − s)(λ − t)2
∂2 2λ2 ∂2 2λ2
MX,Y (s, t) = , MX,Y (s, t) = ,
∂s2 (λ − s)3 (λ − t) ∂t2 (λ − s)(λ − t)3
∂2 λ2
MX,Y (s, t) = .
∂s∂t (λ − s)2 (λ − t)2
Therefore,
∂ 1
E(X) = MX,Y (s, t) =⇒ E(X) =
∂s s=t=0 λ
∂ 1
E(Y ) = MX,Y (s, t) =⇒ E(Y ) =
∂t s=t=0 λ
2
∂ 2
E(X 2 ) = 2 MX,Y (s, t) =⇒ E(X 2 ) = 2
∂s s=t=0 λ
2 1 1
σX = E(X 2 ) − (E(X))2 = , σY2 = E(Y 2 ) − (E(Y ))2 =
λ λ
and
Cov(X, Y ) E(XY ) − E(X)E(Y )
ρX,Y = = = 0.
σX σY σX σY
(ii) If (X, Y ) follows bivariate distribution with parameters µ1 , µ2 , σ1 , σ2 and ρ then it is denoted as
(X, Y ) ∼ BVN(µ1 , µ2 , σ12 , σ22 , ρ).
Note that if ρ = 0 then
where X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ). Therefore, the following theorem can be easily proved
for bivariate normal distribution.
Theorem 6.6.5
If (X, Y ) ∼ BVN(µ1 , µ2 , σ12 , σ22 , ρ) then
Theorem 6.6.6
If (X, Y ) ∼ BVN (µ1 , µ2 , σ12 , σ22 , ρ) then the marginals and conditional distribution of X and Y
are all univariate normal.
Proof. Consider
2 2
x−µ1 y−µ x−µ y−µ2
1 − 1
2(1−ρ2 ) σ1
+ σ 2 −2ρ σ 1 σ
fX,Y (x, y) = p e 2 1 2
2πσ1 σ2 1− ρ2
2 2 2 2
x−µ1 y−µ x−µ y−µ2 x−µ x−µ
1 − 1
2(1−ρ2 ) σ1
+ σ 2 −2ρ σ 1 σ
+ρ2 σ 1 −ρ2 σ 1
= p e 2 1 2 1 1
2πσ1 σ2 1 − ρ2
2 2
2 y−µ2 x−µ1 y−µ2 x−µ
1 1 1
+ρ2 σ 1
x−µ1 − −2ρ
−1 2 σ2 σ1 σ2
=√ e 2 σ1
×√ p e 2(1−ρ ) 1
2πσ1 2πσ2 1 − ρ 2
2 h i2
x−µ
1 1 − 2 1
−1
x−µ1 y− µ2 +ρσ2 σ 1
=√ e 2 σ1
×√ p e 2σ2 (1−ρ2 ) 1
.
2πσ1 2πσ2 1 − ρ2
Therefore,
Z ∞ h i2
1 x−µ1 2 1 x−µ
− 2 1 2 y− µ2 +ρσ2 σ 1
− 12
fX (x) = √ e σ1
√ p e 2( )
2σ 1−ρ 1
dy
2πσ1 −∞ 2πσ2 1 − ρ2
1 x−µ1 2
− 12
=√ e σ 1 ,
2πσ1
x−µ1 2 2
since the expression inside the integral is the pdf of N µ2 + ρσ2 σ1 , σ2 (1 − ρ ) . Hence, X ∼
2
2 2 y−µ2
N (µ1 , σ1 ). Similarly, we can add and subtract ρ σ2
and follow the similar steps as above, we
get Y ∼ N (µ2 , σ22 ).
Similarly,
y − µ2 2 2
X | Y = y ∼ N µ1 + ρσ1 , σ1 1 − ρ .
σ2
Theorem 6.6.7
Let X and Y be two random variables. Then
Proof. Consider
Z ∞Z ∞ Z ∞Z ∞
E[g(X, Y )] = g(x, y)fX,Y (a, y)dxdy = g(x, y)fY |X=x (y|x)fX (x)dxdy
−∞ −∞ −∞ −∞
Z ∞ Z ∞ Z ∞
= g(x, y)fY |X=x (y|x)dx fX (x)dx = EY |X [g(X, Y ) | X = x]fX (x)dx.
−∞ −∞ −∞
X Y |X
= E [E [g(X, Y ) | X]].
Also, following the similar steps, the result can be proved for discrete random variables.
Corollary 6.6.2
Let X and Y be two random variables. Then
ρX,Y = ρ.
Cov(X, Y ) = E ((X − µ1 ) (Y − µ2 ))
= E [(X − µ1 ) E (Y − µ2 | X)]
(X − µ1 )
= E (X − µ1 ) × ρσ2
σ1
ρσ2
= E (X − µ1 )2
σ1
= ρσ1 σ2 .
This implies
Cov(X, Y ) ρσ1 σ2
ρX,Y = = = ρ.
σ1 σ2 σ1 σ2
This proves the result.
Theorem 6.6.9
Let (X, Y ) ∼ BVN (µ1 , µ2 , σ12 , σ22 , ρ). Then, the joint mgf of (X, Y ) is
1 2 2 + 1 σ 2 t2 +ρσ σ st
MX,Y (s, t) = eµ1 s+µ2 t+ 2 σ1 s 2 2 1 2
.
= E etY E esX |Y
σ µ σ
µ1 s−ρ 1σ 2 + 12 σ12 (1−ρ2 )s2 Y t+ρ σ1 s
=e 2 E e 2
σ µ
µ1 s−ρ 1σ 2 + 12 σ12 (1−ρ2 )s2 σ1
=e 2 MY t + ρ s
σ2
2
σ1 µ2 σ σ
µ1 s−ρ + 12 σ12 (1−ρ2 )s2 eµ2 t+ρ σ1 s + 12 σ22 t+ρ σ1 s
=e σ2 2 2
1 2 2 + 1 σ 2 t2 +ρσ σ st
= eµ1 s+µ2 t+ 2 σ1 s 2 2 1 2
.
Example 6.6.1. The amount of rainfall recorded at a Us weather station in January is a random vari-
able X and the amount in February at the same station is a random variable Y . Suppose (X, Y ) ∼
BVN(6, 4, 1, 0.25, 0.1), Find P(X ≤ 5) and P(Y ≤ 4 | X = 5).
Solution. Note that X ∼ N (6, 1). Therefore,
x−6 5−6
P(X ≤ 5) = P ≤ = P(X ≤ −1) = 0.1587.
1 1
or equivalently,
(a) MX1 ,...,Xk−1 (t1 , . . . , tk−1 ) = (pk − p1 et1 + · · · + pk−1 etk−1 )n , for all (t1 , . . . , tk−1 ) ∈ Rk−1
and pk = 1 − p1 − · · · − pk−1 .
A special case of multinomial distribution is trinomial distribution when k = 3. The formal definition
can be given as follows:
Theorem 6.7.12
Let (X, Y ) follow trinomial distribution. Then
Theorem 6.7.13
Let (X, Y ) follow bivariate gamma distribution. Then
Theorem 6.7.14
Let (X, Y ) follow bivariate beta distribution. Then
(a) X ∼ Beta(p1 , p2 + p3 ).
(b) Y ∼ Beta(p2 , p1 + p3 ).
Y
(c) 1−X |X = x ∼ Beta(p2 , p3 ).
X
(d) 1−Y |Y = y ∼ Beta(p1 , p3 ).
Now, we consider the case when n = 2 and obtain the distribution of the transformation of the given
random vector. There are mainly three approaches to find the distribution of Y = g(X).
X
−1 0 1
Y
1 1 1
−2 6 12 6
1 1 1
1 6 12 6
1 1
2 12
0 12
U
0 1
V
1 1
1 12 3
1 1
4 12 2
Therefore,
√1 ,
4 uv
0 < u < 1, 0 < u < 1,
fU,V (u, v) =
0, otherwise.
(b) Assume that the mapping and inverses are both continuous.
∂x ∂x ∂y ∂y
(c) Assume the partial derivatives , , ,
∂u ∂v ∂u ∂v
exist and are continuous.
Then, the random vector (U, V ) is continuous and its joint pdf is given by
iid
Example 6.8.4. Let X, Y ∼ U(0, 1) then find the distribution of U = X + Y .
Solution. It is known that
1, 0 < x, y < 1,
fX,Y (x, y) =
0, otherwise.
Given
U =X +Y and V = X − Y.
This implies
u+v u−v
x = h1 (u, v) = and y = h2 (u, v) = .
2 2
Therefore,
dx dx 1 dy 1 dy 1
= = , = and =− .
du dv 2 du 2 du 2
This implies
1 1 1 1
J= 2
1
2 =− =⇒ |J| = .
2
− 21 2 2
Note that
u+v
0 < x < 1 =⇒ 0 < < 1 =⇒ 0 < u + v < 2
2
u−v
0 < y < 1 =⇒ 0 < < 1 =⇒ 0 < u − v < 2
2
Therefore,
u−v =0
u−v =2
u+v =2
u+v =0
This implies
u, 0<u≤1
fX+Y (u) = 2 − u, 1 < u < 2
0, otherwise.
2. lim FX1 ,...,Xn (x1 , . . . , xn ) = FX1 ,...,Xi−1 ,Xi+1 ,...,Xn (x1 , . . . , xi−1 , xi+1 , . . . , xn ).
xi →∞
3. FX1 ,...,Xn (x1 , . . . xn ) is continuous form right in each of it argument and also non-decreasing.
× pX1 ,...,Xi−1 ,Xi+1 ,...,Xj−1 ,Xj+1 ,...,Xn (x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn )
provided pX1 ,...,Xi−1 ,Xi+1 ,...,Xj−1 ,Xj+1 ,...,Xn (x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn ) 6= 0.
Remark 6.9.2. Similar to above definition, the conditional distribution can be defined for any dimen-
sion random vector less than n.
× fX1 ,...,Xi−1 ,Xi+1 ,...,Xj−1 ,Xj+1 ,...,Xn (x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn )
× dx1 . . . dxi−1 dxi+1 . . . dxj−1 dxj+1 . . . dxn .
provided fX1 ,...,Xi−1 ,Xi+1 ,...,Xj−1 ,Xj+1 ,...,Xn (x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn ) 6= 0.
Remark 6.9.4. Similar to above definition, the conditional distribution can be defined for any dimen-
sion random vector less than n.
n
Y
(e) MX1 +···+Xn (t) = MXi (t).
i=1
The following are the applications of the last result of the above theorem.
Corollary 6.9.4
Let X1 , . . . , Xk ∼ B(ni , p) and independent. Then
k k
!
X X
Xi ∼ B ni , p .
i=1 i=1
That is, the sum of binomial with same success probability is also a binomial.
k
!
X
which is the mgf of B ni , p . This proves the result.
i=1
λ
MXi (t) = , for i = 1, 2, . . . , n and λ > t.
λ−t
Using Theorem 6.9.3(e), we have
n n
Y λ
MX1 +···+Xn (t) = MXi (t) = , for λ > t,
i=1
λ−t
Corollary 6.9.6
Let X1 , . . . , Xk ∼ G(ri , λ) and independent. Then
k k
!
X X
Xi ∼ G ri , λ .
i=1 i=1
That is, the sum of independent gamma with same rate is also a gamma distribution.
k
!
X
which is the mgf of G ri , λ . This proves the result.
i=1
6.10 Exercises
1. Toss three coins. Let X denotes the number of heads on the first two coins and Y denotes the
number of heads on the last two coins. Find
Find
Find ρX,Y .
If the filament diameter is 0.098, what is the probability that the tube will last 1950 hour?
We have studied the random variable in one dimension and two dimensions in details. Also, we have
seen generalization of it in n-dimension and results related to it. Now, we move to study the concept
of the sequence of random variable when n tending to infinity. This will help us to recognize the
limiting behaviour of the sequence of random variables. In this chapter, we consider several modes of
convergence and investigate their interrelationships.
Fn (x) → F (x), as n → ∞
or
at every point at which F is continues. Then, we say Fn converges in law (or weakly) to F .
Remark 7.1.1. If {Xn } converges to X in distribution then any one of the following notations can be
used:
d
Xn → X (Xn converges in distribution to X)
d
Fn → F (Fn converges in distribution to F )
L
Xn → X (Xn converges in law to X)
ω
Fn → F (Fn converges weakly to F ).
146
Chapter 7: Large Sample Theory
Example 7.1.1. Let {Fn } be a sequence of distribution functions defined by
0, x < 0,
Fn (x) = 1 − n1 , 0 ≤ x < n,
1, x ≥ n.
F (x)
F4 (x)
F3 (x)
F2 (x)
F1 (x)
P (|Xn − X| > ε) → 0 as n → ∞.
or
P (|Xn − X| ≤ ε) → 1 as n → ∞.
or
lim P (|Xn − X| ≤ ε) = 1.
n→∞
This implies
(
1
n
, 0 < ε < 1,
P {|Xn − 0| > ε} =
0, ε ≥ 1.
1−
Hence,
P (|Xn − 0| > ε) → 0 as n → ∞.
a.s.
Remark 7.1.3. (i) If {Xn } converges almost surely to X then we write Xn → X.
a.s.
Let us say this set is S. If P(S) = 1 then we say Xn → X.
X1 (ω)
R
X2 (ω)
R
X3 (ω)
R
..
.
Xn (ω)
R
..
.
Note that
1
lim Xn (ω) = lim = 0, for ω ∈ (0, 1]
n→∞ n→∞ n
and
Therefore, we have
This implies
Hence,
a.s.
Xn → X.
p L L
5. |Xn − Yn | −→ 0 and Yn −→ Y =⇒ Xn −→ Y .
6. (Slutsky’s Therem):
L p L
(a) Xn −→ X, Yn −→ c =⇒ Xn + Yn −→ X + c.
(
L
L p Xn Yn ←→ cX, if c 6= 0,
(b) Xn −→ X, Yn −→ c =⇒ L .
Xn Yn ←→ cX, if c = 0.
L p Xn L X
(c) Xn −→ X, Yn −→ c =⇒ Yn
−→ c
, c 6= 0.
p p
7. Xn ←→ X ⇐⇒ Xn − X ←→ 0.
σ2
P(|X − µ| ≥ k) ≤ .
k2
This proves the result for continuous case. Following the similar steps, it can be easily proved for
discrete case.
1 1 σ2
P(|X − µ| ≥ kσ) ≤ or P(|X − µ| < kσ) ≥ 1 − or P(|X − µ| < k) ≥ 1 − .
k2 k2 k2
Example 7.2.1. The number of customers who visit a store everyday is a random variable X with
µ = 18 and σ = 2.5. With what probability, can the asset that the customers will be between 8 to 28
customers?
Solution. Consider
E(X)
P(X ≥ k) ≤ ,
k
provided E(X) exists.
24
Example 7.2.2. Consider a random variable X that takes the value 0 with probability 25
and 1 with
1
probability 25 . Find a bound of the probability of X in at least 5.
Solution. Note that
24 1
P(X = 0) = and P(X = 1) =
25 25
and
1
X 25 1 1
E(X) = xP(X = x) = 0 × +1× = .
x=0
25 25 25
E(X) 1
P (X ≥ 5) ≤ = .
5 125
Then,
X̄n → µ̄n ,
provided
Bn Bn
lim 2 = 0 or 2 → 0 as n → ∞ .
n→∞ n n
Remark 7.2.2. (i) Note that WLLN can be easily proved using Chebyshev’s inequality as
1
Var(X̄n ) = Var (X1 + · · · + Xn )
n
1
= 2 Var(X1 + · · · + Xn )
n
Bn
= 2
n
→ 0, as n → ∞.
Remark 7.2.3. (i) In the above figure, note that how X̄n converges to µ when Var(X̄n ) is decreas-
ing.
(ii) The above theorem can be easily follow from Chebyshev’s inequality as
n
! n
1X 1 X σ2
Var(X̄n ) = Var Xi = 2 Var(Xi ) = → 0 as n → ∞.
n i=1 n i=1 n
(iii) If Xi ’s are iid then the necessary condition for LLN to hold is that E(Xi ) exists, for i =
1, 2, . . . , n.
(iv) If the variable are uniformly bounded then the condition
Bn
lim =0
n→∞ n2
Sn p
−→ p.
n
n
X
where Sn = Xi .
i=1
Proof. As Sn ∼ B(n, p) and therefore, E (Sn ) = np and Var (Sn ) = np(1 − p). This implies
Sn
E =p
n
then
a.s.
X̄n −→ µn .
Now, we to prove one of well-known and useful theorem in probability and statistics, that is, central
limit theorem.
Proof. Consider
√nt(X̄n −µ)
M √n(X̄n −µ) (t) = E e σ
σ
√nt 1
= E e σ ( n (X1 +···+Xn )−µ)
√nt
((X1 −µ)+···+(Xn −µ))
=E e nσ (7.3.1)
t t
√ (X −µ)+···+ √nσ (Xn −µ)
= E e nσ 1
t t
√ (X1 −µ) √ (Xn −µ)
=E e nσ ...E e nσ
n
Y t
= MXi −µ √ . (7.3.2)
i=1
nσ
Note that E (Xi − µ) = 0 and E (Xi − µ)2 = σ 2 , for all i = 1, 2, . . . , n. Now, Consider
t t
√ (X −µ)
MXi −u √ σ = E e nσ i
n
t2 t3
t 2 3
= E 1 + √ (Xi − µ) + (Xi − µ) + √ (Xi − µ) + · · ·
nσ 2!nσ 2 3!n nσ 3
t 2t2 t3
= 1 + √ E (Xi − µ) + 2 E (Xi − µ)2 + √ 3 E (Xi − 1)3 + · · ·
nσ nσ 6n nσ
2
2
t /2 t
=1+ +O .
n n
(ii) In practice, n ≥ 30 is considers to be large sample (NOT always). If the original distribution is
close to normal then, for smaller n itself, the approximation may be good.
√
n(X̄n − µ)
(iii) Note that the cdf of sequence of random variables convergences to the cdf of stan-
σ
dard normal distribution, that is,
Fn := F √n(X̄n −µ) −→ FZ , as n → ∞.
σ
−
− F1
− F2
− F3
− FZ
−
− F1
− F2
− F3
− FZ
Sn − np d
√ −→ N (0, 1) as n → ∞.
npq
Note that the approximate probability is computer with an extra area (blue color) and we left the area
outside the normal curve (red color). These two areas (shaded in blue and red) are approximately equal.
Example 7.3.1. Two fair die are rolled 600 times. Let X donate the number of times a total 7 occurs.
Use CLT to find P(90 ≤ X ≤ 110).
Example 7.3.2. Let a random sample of size 54 be taken from a discrete distribution with pmf
1
pX (x) = , x = 2, 4, 6.
3
Find the probability that the sample mean will lie between 4.1 and 4.4.
Solution. Note that µ = 4 and σ 2 = 8/3. Therefore, the required probability is
√ √ !
54(4.1 − 4) 56(4.4 − 4)
P 4.1 ≤ X̄54 ≤ 4.4 ≈ P p ≤Z≤ p
8/3 8/3
= P(0.45 ≤ Z ≤ 1.8)
= Φ(1.8) − Φ(0.45)
= 0.9641 − 0.6738
= 0.2905.
Now, we move to give other central limit theorems (for non-iid case) that are useful in many application
in the literature.
then
Sn − E (Sn )
→ N (0, 1) as n → ∞.
sn
then
Sn − E (Sn )
→ N (0, 1) as n → ∞.
sn
then
∞
Var (Xn ) 2 2
X
→ σ = E X1 + 2 E (X1 , X1+k ) ,
n k=1
E (Yn | Fn−1 ) = 0.
Let
( n
)
X
σn2 = E Yn2 | Fn−1 σk2 > t .
and νt = min n :
k=1
P∞
Assume that n=1 σn2 = ∞ with probability 1. Then
Xv t
√ → N (0, 1).
t
7.4 Exercises
1. Check the convergence in distribution for the following sequence of distribution functions:
(
0, x < n,
Fn (x) =
1, x > n.
4. A coin is tossed 10 times. Find the probability of 3,4 or 5 heads using normal approximation.
5. If X ∼ B(25, 0.5), find the approximate value of P (Y ≤ 12).
6. A polling agency wisher to take a sample of voters in a given stake large enough that the proba-
bility is only 0.01 that they will find the proportion favouring a contain candidate to be less than
50% when in fact it is 52%. How large a sample should be taken?
8. Two dice are thrown. If X is the sum of the numbers showing up. Prove that P (|X − 7| ≥ 3) ≤
35
54
. Compare toe with the actual probability.
Sampling distribution in statistics refers to studying many random samples collected from a given pop-
ulation based on a specific attribute. The results obtained provide a clear picture of variations in the
probability of the outcomes derived. In this chapter, we first discuss elementary definitions in statis-
tics. Further, we study three most useful distributions (chi-square, t-distribution and F -distribution) in
statistics.
Population
Sample
Example 8.1.1. Suppose you are cooking rice in a pressure cooker. How will you check whether the
rice is properly cooked or not? We take 4-5 pieces of rice and press them to check whether these pieces
are properly cooked or not. If these pieces of rice are properly cooked then we say the rice (in the
162
Chapter 8: Statistics and Sampling Distributions
pressure cooker) properly cooked. So, if we say the rice in the pressure is a population and 4-5 pieces
of rice, that we have check, is the sample then we are predicting about the population based on the
sample. This plays a crucial role in the literature as we are making the decision about population (for
which, we can not work with all elements) based on a sample. Similar kind theory can also be applied
to other example such as blood test, heights of peoples, longevity of peoples.
T = T (X1 , X2 , . . . , Xn )
is called a statistic.
Remark 8.1.1. (i) The following statistics which are useful in practice are
(iii) The parameters X̄n and s2 are called sample parameters. Also, the parameters µ and σ 2 are called
population.
Theorem 8.1.1
iid
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), Ui = Xi − X̄n , for i = 1, 2, . . . , n, and U = (U1 , . . . , Un ). Then,
X̄n and U are independent.
Note that
Pn Pn
MU (t) = E e i=1 ti Ui = E e i=1 ti (Xi −X̄n )
Pn Pn
ti Xi −t̄ n
P
ti Xi −nt̄X̄n Xi
=E e i=1 =E e i=1 i=1
n
!
Pn Y
= E e i=1 Xi (ti −t̄) = E eXi (ti −t̄)
i=1
n
Y n
Y
= E eXi (ti −t̄) = MXi (ti − t̄)
i=1 i=1
n
1 2 1 2 (t −t̄)2
Y
= eµ(ti −t̄)+ 2 σ(ti −t̄) = e 2 σ i
i=1
Now, consider
Pn sP Pn
MY,U (s, t) = E esY + i=1 ti Ui = E e n Xi + i=1 Xi (ti −t̄)
n
!
Pn s
s
= E e i=1 Xi (ti −t̄+ n ) = E eXi (ti −t̄+ n )
Y
i=1
n n n
s s 1 2 s 2
o
eµ(ti −t̄+ n )+ 2 σ (ti −t̄+ n ) .
Y Y
= MXi ti − t̄ + =
i=1
n i=1
1 2 2 1 2
Pn 2
=e µs+ 2n σ s +2σ i=1 (ti −t̄) = MY (s)MU (t).
Corollary 8.1.1
iid
Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Then, X̄n and s2 are independent.
n 1
Remark 8.2.1. (i) From (8.2.1), it can be easily verified that W ∼ G ,
2 2
.
0.5 n=2
n=4
0.4 n=6
n=8
0.3
0.2
0.1
0
0 5 10 15 20
(a) E(W ) = n.
Proof. Substitute r = n/2 and λ = 1/2 in Theorem 4.3.3 and Corollary 4.3.2, the result follows.
Further, we prove some useful properties for χ2 −distribution.
Property 8.2.1
If Wi ∼ χ2ni , for i = 1, 2, . . . , k and Wi , for i = 1, 2, . . . , k are independent. Then
k
X
Wi ∼ χ2Pk ni
.
i=1
i=1
Property 8.2.2
If Xi ∼ N (µi , σi2 ), for i = 1, 2, . . . , k, and Xi ’s are independent then
k 2
X Xi − µi
∼ χ2k .
i=1
σi
Xi − µi
∼ N (0, 1)
σi
Property 8.2.3
iid
If Xi ∼ N (µ, σ 2 ) then
n
σ2
1X
X̄n = Xi ∼ N µ, .
n i=1 n
Y n
t t t
= MX1 . . . MXn = MXi
n n i=1
n
n
t µt
+ 12 σ 2 t2
n h
µt+ 12 σ 2 t2
i
= MX1 = e n 2n since MX1 (t) = e
n
σ2
µt+ 12 t2
=e n
.
Property 8.2.4
iid
If Xi ∼ N (µ, σ 2 ) then
(n − 1)s2
∼ χ2n−1 .
σ2
Proof. Consider
n n
X X 2
(Xi − µ)2 = Xi − X̄n + X̄n − µ
i=1 i=1
n n n
X 2 X
2
X
= Xi − X̄n + (X̄n − µ) + 2 (Xi − X̄n )(X̄n − µ)
i=1 i=1 i=1
n
X 2
= Xi − X̄n + n(X̄n − u)2 .
i=1
Let
n n
1 X 1 X 2 n(X̄n − µ)2
W = 2 (Xi − µ)2 , W1 = 2 Xi − X̄n and W2 =
σ i=1 σ i=1 σ2
Note that
Xi ∼ N (µ, σ 2 )
Xi − µ
=⇒ ∼ N (0, 1)
σ
2
Xi − µ
=⇒ ∼ χ21
σ
n 2
X Xi − µ
=⇒ ∼ χ2n .
i=1
σ
Therefore,
n
W ∼ χ2n =⇒ MW (t) = (1 − 2t)− 2 .
σ2 n(X̄n − µ)2
X̄n − µ
X̄n ∼ N µ, =⇒ √ ∼ N (0, 1) =⇒ ∼ χ21 .
n σ/ n σ2
Therefore,
1
W2 ∼ χ21 =⇒ MW2 (t) = (1 − 2t)− 2 .
From Corollary 8.1.1, it can be easily verified that W1 and W2 are independent. Therefore,
n
(1 − 2t)− 2 (n−1)
MW (t) = MW1 (t)MW2 (t) =⇒ MW1 (t) = − 12
= (1 − 2t)− 2 .
(1 − 2t)
Therefore,
W1 ∼ χ2n−1
n
n−1 X 2
=⇒ X i − X̄ n ∼ χ2n−1
σ 2 (n − 1) i=1
(n − 1)s2
=⇒ ∼ χ2n−1 .
σ2
This proves the result.
χ2n,α
The table is for P(χ2n > χ2n,α ) = α, where χ2n and χ2n,α denote the chi-square distribution and corre-
sponding point (see above figure), respectively.
Solution.
(a) It follows from Example 5.2.6.
(b) Using part (a) together with Property 8.2.1, it can be easily proved.
(ii) Student’s-distribution is symmetric about t = 0 but its peak and tail are higher than normal
distribution.
Theorem 8.3.3
Let X and Y be two independent random variables such that X ∼ N (0, 1) and Y ∼ χ2n . Then
√
X nX
T =p = √ ∼ tn .
Y /n Y
Z ∞ ! n+1
2
−1 !
t0 t2
1 −y 2y 2
= n+1 √ e 2 t2
dy y= 1+
2 2 nπΓ n2 0 1 + tn 2 n
1+ n
Theorem 8.3.4
If T ∼ tn then, for k < n, show that
k
0,k/2 if k is odd
E T = n k!Γ((n − k)/2)
, if k is even.
2k (k/2)!Γ(n/2)
follows t−distribution with n degrees of freedom. Since X and Y are independent, we have
We know that
k 0, if k is odd
E(X ) =
1 · 3 · · · (k − 1), if k is even.
0, if k is odd
= k!
, if k is even.
(k/2)!2k/2
and
2−k/2 Γ((n − k)/2)
E(Y −k/2 ) = , (if k (< n) is even)
Γ(n/2)
Hence,
k
0,k/2 if k is odd
E T = n k!Γ((n − k)/2)
, if k is even.
2k (k/2)!Γ(n/2)
(a) E(T ) = 0
n
(b) Var(T ) = n−2
, for n > 2.
(c) β1 = 0.
6
(d) β2 = n−4
> 0, for n > 4.
Theorem 8.3.5
iid
Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Then
√
n(X̄n − µ)
∼ tn−1 .
s
and
(n − 1)s2
2
∼ φ2n−1 .
σ
Therefore,
√ √
n(X̄n − µ)/σ n(X̄n − µ)
q = ∼ tn−1 .
(n−1)s2 s
σ2
/(n − 1)
Theorem 8.3.6
Let T ∼ tn . As n → ∞, the pdf of T converges to
1 2
φ(t) = √ e−t /2 .
2π
That is, t−distribution converges to standard normal distribution as n → ∞.
Therefore,
√ n−1 n2
Γ n+1 2πe− 2 n−1
1 2 2
√ =√ ≈√ √ n−1
nB n2 , 12 nπΓ n2 nπ 2π n−2 2
2
n
1 n
1 − n1 2
1 e2 n 2 1
=√ 1 n−1 · n−1 → √ as n → ∞.
2π n 2 n 2 1 − n2 2 2π
The table is for P(tn > tn,α ) = α, where tn and tn,α denote the t−distribution and corresponding point
tn,α
8.4 F −distribution
The F −distribution, also known as Snedecor’s F distribution, is a continuous probability distribution
that arises frequently as the null distribution of a test statistic, most notably in the Analysis of Variance
(ANOVA) and other F −tests. The formal definition can be given as follows:
Remark 8.4.1. If F follows F −distribution with (m, n) degrees of freedom then it is denoted by
F ∼ Fm,n .
Most of the times the following theorem is used to define F −distribution.
Theorem 8.4.7
Let X and Y be independently distributed random variables such that X ∼ χ2m and Y ∼ χ2n .
Then
X/m nX
F = = ∼ Fm,n .
Y /n mY
m = 1, n = 1
3 m = 5, n = 2
m = 10, n = 10
m = 60, n = 50
2
0
0 1 2 3 4
X/m
F =
Y /n
W = Y,
Corollary 8.4.3
1
If U ∼ Fm,n then U
∼ Fn,m .
Theorem 8.4.8
Let F ∼ Fm,n . Then
n
E(F ) = , n>2
n−2
2n2 (m + n − 2)
Var(F ) = , n > 4.
m(n − 2)2 (n − 4)
n2 (m + 2) n2 2n2 (m + n − 2)
Var(X) = E X 2 − E(X)2 =
− = .
m (n − 2) (n − 4) (n − 2)2 m(n − 2)2 (n − 4)
Corollary 8.4.4
Let µk = E(F k ). Then, for n > 2k,
n k Γ m + k Γ n − k
2 2
µk = m
Γ n2
m Γ 2
σ22 s21
· ∼ Fm−1,n−1 ,
σ12 s22
1
Pm 1
Pn 1
Pm
where s21 = m−1 i=1 (Xi − X̄m )2 , s22 = n−1 2
j=1 (Xj − Ȳn ) , X̄m = m i=1 Xi and Ȳn =
1
P n
n j=1 Yj .
(m − 1)s21 (n − 1)s22
2
∼ χ2m−1 and 2
∼ χn−1 2 .
σ1 σ2
fm,n,α
1
fn,m,1−α = .
fm,n,α
Example 8.4.2. If two independent random sample of sizes n1 = 7 and n2 = 13 and taken from a
normal population with same variance. What is the probability the variance of the first sample will be
at least 3 times larger as that of second sample?
Solution. Given σ12 = σ22 = σ 2 , n1 = 7 and n2 = 13. Therefore,
s21
s21 3s22
P ≥ =P ≥ 3 = P (F6,12 > 3) = 0.05.
s21
Estimation
In statistics, estimation refers to the process by which one makes inferences about a population, based
on information obtained from a sample. This is a process of guessing the underlying properties of
the population by observing the sample that has been taken from the population. The idea behind
this is to calculate and find out the approximate values of the population parameter on the basis of a
sample statistics. In order to determine the characteristic of data, we use point estimation or interval
estimation. A point estimate of a population parameter is a single value of a statistic. For example, the
sample mean X̄n is a point estimate of the population mean µ. Similarly, the sample variance s2 is a
point estimate of the population variance σ 2 . An interval estimate is defined by two numbers, between
which a population parameter is said to lie. For example, a < X̄n < b is an interval estimate of the
population mean µ. It indicates that the population mean is greater than a but less than b. In this chapter,
we will study these two estimation in details.
Example 9.1.1. We can consider the process to check the sugar level in human body. Here, we should
take a sample of 10 ml or 15 ml of blood. Based on this sample, we make a decision of the sugar level
in human body. This produce of drawing conclusions is called statistical inference.
Example 9.1.2. Mean (µ), Variance (σ 2 ), mode, median, correlation coefficient etc. are examples of
parameters.
185
Chapter 9: Estimation
Definition 9.1.3 [Parameter Space]
The parameter space is the space of possible parameters values and it is denoted by Θ.
Remark 9.1.1. Now onwards, since we are dealing with parameters, and therefore, we will denote a
density function (pdf or pmf) by f (x; θ1 , θ2 , . . . , θk ) or f (x; θ), where θ = (θ1 , θ2 , . . . , θk ). Here, θ
denotes the set of parameters. For example, θ = (µ, σ 2 ) for normal population and θ = (n, p) for
binomial population.
Now, let us brief the study of statistical inference in general.
Statistical Inference
In classical theory of estimation, we have density function f (x, θ), where x is the value of random
variable and θ is the set of unknown parameters. Here, we estimate the value of unknown parameters.
In Bayesian theory of estimation, we have density function f (x, θ), where x is the value of random
variable and θ is also a random variable. Here, we estimate the distribution of unknown parameters.
We will only work on some portion of classical theory of estimation in this course.
In Classical Theory of Estimation, we have several methods to find the estimators. We have have
various properties to check whether the obtained estimator is good or not. The following figure discuss
the methods to finding estimator and properties of estimator.
In this course, we will study only two methods, namely, Methods of Moments and Methods of Max-
imum Likelihood Estimator. Also, we will two properties of estimator, namely, unbiasedness and
consistency.
Remark 9.1.2. An estimator which satisfies all five properties is treated as a good estimator.
We first start with unbiased estimator in the following section.
Further, if E[T (X)] = g(θ) + b(θ), then we say T (X) is biased for g(θ) and b(θ) is called the bias
of T (X). Moreover, if E(T (X)) > g(θ) then it is called positive bias and if E(T (X)) < g(θ)
then it is called negative bias.
σ 2 = Var(X) = np(1 − p) = n p − p2
X X(X − 1)
= nE −
n n(n + 1)
X(n − X)
=E .
n−1
X(n−X)
This implies n−1
is an unbiased estimator for σ 2 .
Example 9.2.2. Let Xi ∼ N (w, σ 2 ), for i = 1, 2, . . . , n. Note that Θ = (µ, σ 2 ). We know that
n n
1X 1 X 2
X̄n = Xi and s2 = Xi − X̄n .
n i=1 (n − 1) i=1
and
σ2 (n − 1)s2
X̄n ∼ N µ, and ∼ χ2n−1 .
n σ2
Note that
n
1X 1
E X̄n = E(Xi ) = · nµ = µ
n i=1 n
and
(n − 1)s2
= n − 1 =⇒ E s2 = σ 2 .
E
σ2
This implies (−2)X is an unbiased estimator for e−3λ . But (−2)X can take values 1, −2, 4, . . .
which is not close to 0 < e−3λ < 1. So, only unbiased does not guarantee about good estimator.
(iv) Infinitely many unbiased estimator may exist. For example, let X1 , X2 , . . . be a sequence of iid
random variables with mean µ. Then,
If we have unbiased estimators then how to know that which one is good? In this case, we will compute
variance of the statistic and the estimator which has less variance is the best among others. Such
estimator is called efficient estimator. It can be defined as follows.
If T1 (X) and T2 (X) are two unbiased estimator for g(θ) and
Then, T1 is better than T2 . In general, if the estimator is not unbiased then will check the mean square
error which can be defined as follows.
Let T1 (X) and T2 (X) are two estimator for g(θ) then mean square error (MSE) is defined as
MSE(T1 ) ≤ MSE(T2 ).
In other words,
p
Tn −→ g(θ).
Theorem 9.3.1
If E (Tn ) = γ (θn ) → γ(θ) and Var (Tn ) = σn2 → 0 as n → ∞ then Tn is consistent for γ(θ).
Therefore,
Note that
E(X̄n ) = µ
and
σ2
Var X̄n = → 0 as n → ∞
n
Therefore, X̄n is consistent estimator for µ. This can also be directly proved using WLLN.
(n − 1)s2
∼ χ2n−1
σ2
and therefore, E (s2 ) = σ 2 and
(n − 1)s2 2σ 4
2
Var = 2(n − 1) =⇒ Var(s ) = → 0 as n → ∞.
σ2 n−1
1 − na
0 n−a p
Tn = Tn = b
Tn → Tn −→ γ(θ) as n → ∞.
n−b 1− n
(iii) If Tn is consistent estimator for γ(θ) and g(γ(θ)) is continuous function then g (Tn ) is consistent
estimator of g(γ(θ)).
µ01 = g1 (θ1 , θ2 , . . . , θk )
µ02 = g2 (θ1 , θ2 , . . . , θk )
..
.
0
µk = gk (θ1 , θ2 , . . . , θk ).
In method
Pnof moments, we estimate θi by θ̂i = hi (α1 , α2 , . . . , αk ), for i = 1, 2, . . . , k, where
1 j
αj = n i=1 Xi , for j = 1, 2, . . . , k.
(iv) Two different distributions may have same moments. Therefore„ some parameters may have two
different estimators.
(v) The method of moment estimator is denoted by using a hat and MME in subscript on the symbol,
for example, the MME for sample mean µ is denoted by µ̂MME .
iid
Example 9.4.1. Let Xi ∼ P(λ), for i = 1, 2, . . . , n and λ > 0. Find MME for λ.
Solution. Note that µ01 = E(X1 ) = λ and hence
n
1X
λ̂MME = α1 = Xi = X̄n
n i=1
λ
is MME for λ. Also, note that E(X̄n ) = λ and Var(X̄n ) = n
→ 0 as n → ∞. Hence, X̄n is a consistent
estimator of λ.
iid
Example 9.4.2. Let Xi ∼ N (µ, σ 2 ), for i = 1, 2, . . . , n. Find MME for µ and σ 2 .
Solution. Note that
This implies
and
n n
2 1X 2 1X n−1
σ̂MME = α2 − α12 = Xi − (X̄n )2 = (Xi − X̄n )2 = s2 ,
n i=1 n i=1 s
respectively.
Take n = 3, we have
P(X = x) 0 1 2 3
P X = x; p = 14 27 27 9 1
64 64 64 64
P X = x; p = 34 1
64
9
64
27
64
27
64
which is known as likelihood function. A statistic θ̂(x) is said to be the maximum likelihood
estimator of θ if
Remark 9.5.1. (ii) The estimator is generally denoted by using a hat on the symbol, for example,
the estimator for the sample mean µ is denoted by µ̂.
(ii) The maximum likelihood estimator is denoted by using a hat and MLE in subscript on the symbol,
for example, the MLE for sample mean µ is denoted by µ̂MLE .
Working Procedure for MLE. The following are the steps to compute MLEs.
d2 L
(d) Check dθ2
< 0.
Remark 9.5.2. (i) Note that the steps (a)-(d) maximize the likelihood function. Moreover, if the
likelihood function is not differentiable or likelihood equation is not solvable then we have to
some other approach to maximize the likelihood function.
(ii) Since L(x; θ̂) > 0 and `(x; θ̂) is a non-decreasing function. So, L(x; θ̂) > 0 and `(x; θ̂) attain
their extreme values at the same point. It can also be verified as
dL 1 dL d`
= 0 =⇒ = 0 =⇒ = 0.
dθ L dθ dθ
So, we can work on log-likelihood function instead of likelihood function to compute the critical
point. This is useful because log-likelihood function is easy to handle compare to likelihood
function.
iid
Example 9.5.1. Let Xi ∼ N (µ, σ 2 ), for i = 1, 2, . . . , n. Find MLE of µ and σ 2 .
Solution. The pdf of Xi is given by
1 1 xi −µ 2
f (xi : µ, σ 2 ) = √ e− 2 ( σ ) , −∞ < xi , µ < ∞ and σ > 0.
2π
Therefore, the likelihood function is
n
2
Y 1 x −µ 2
− 21 ( iσ ) 1 − 21 n
P xi −µ 2
i=1 ( σ ) .
L(x; µ, σ ) = √ e = n/2 n
e
i=1
2πσ (2π) σ
n−1
Hence, n
s2 is MLE for σ 2 .
Remark 9.5.3. Notes that MLE may not be unbiased, however, it is always consistent.
iid
Example 9.5.2. Let Xi ∼ N (µ, σ 2 ), for i = 1, 2, . . . , n, and µ > 0. Find MLE of µ and σ 2 .
= n−1
2
2
Solution. Note that we have computed µ̂MLE = X̄n and σ̂MLE n
s in previous example. Now,
we it is given µ > 0. Note that
∂` n > 0, if µ < x̄n
= 2 (x̄n − µ) =
∂µ σ < 0, if µ > x̄n .
| | | |
0 x̂n x̂n 0
Therefore,
X̄n , if x̄n > 0
µ̂MLE =
0, if x̄n ≤ 0.
= max(X̄n , 0)
d`
Further, if dθ = 0 is not solvable then how do we handle the problem? It mainly happen if the den-
sity depends only on parameters and accordingly the problem should be handle. Let us consider one
example of such case.
and
∂` n
= = 0 =⇒ b − a = ∞,
∂a b−a
which is not solvable. This method is not useful in this case. Now, we move to apply the definition of
MLE. Observe that L(x; a, b) is maximum when b − a minimum. Let us define
x(1) = min{x1 , x2 , . . . , xn }
x(2) = second largest of {x1 , x2 , . . . , xn }
..
.
x(n) = max{x1 , x2 , . . . , xn }.
Then
Note that x(1) and x(n) are more likely such that
1 1
L(x; a, b) = n
≤ n ,
(b − a) x(n) − x(1)
and
If θ̂(x) is a maximum likelihood estimate for θ, then g(θ̂(x)) is a maximum likelihood estimate
for g(θ).
Suppose T1 (x) and T2 (x) are observed values of T1 (X) and T2 (X), respectively. Then, we say
(T1 (x), T2 (x)) is 100(1 − α)% confidence interval for g(θ).
Remark 9.6.1. The C.I. means that we have 100(1 − α)% confidence that the value of the function of
parameter g(θ) lies in (T1 (x), T2 (x)).
1−α
| |
T1 (x) T2 (x)
Confidence interval for µ: To compute the C.I. for µ of a normal population, we deal with two cases
when σ is known and unknown.
(a) When σ 2 is known: The test statistic is given by
√
n X̄n − µ
Z= ∼ N (0, 1).
σ
1−α
α/2 α/2
| | |
−zα/2 0 zα/2
1−α
α/2 α/2
| | |
−tn−1,α/2 0 tn−1,α/2
Note that
P −tn−1,α/2 ≤ T ≤ tn−1,α/2 = 1 − α
√ !
n X̄n − µ
=⇒ P −tn−1,α/2 ≤ ≤ tn−1α/2 = 1 − α
s
s s
=⇒ P − √ tn−1,α/2 ≤ X̄n − µ ≤ √ tn−1,α/2 = 1 − α
n n
s s
=⇒ P X̄n − √ tn−1,α/2 ≤ µ ≤ X̄n + √ tn−1,α/2 = 1 − α.
n n
0.95
0.025 0.025
| | |
−t9,0.025 0 t9,0.025
We know that x̄n ± √s tn−1, α is 100(1 − α)% C.I. for µ. It can be easily verified that
n 2
Therefore,
s
x̄10 ± √ t9,0.025 = (0.0477, 0.0535)
10
is 95% C.I. for µ.
Confidence interval for σ 2 : To compute the C.I. for σ 2 of a normal population, we deal with two cases
when µ is known and unknown.
(a) When µ is known: The test statistic is given by
n 2
X Xi − µ
W = ∼ χ2n .
i=1
σ
1−α
α/2 α/2
| |
χ2n,1− α χ2n, α
2 2
!
1 σ2 1
=⇒ P ≤ Pn 2 ≤ 2 =1−α
χ2n, α i=1 (X i − µ) χn,1− α
2 2
Pn !
2 Pn 2
i=1 (X i − µ) 2 (X i − µ)
=⇒ P ≤ σ ≤ i=1 2 = 1 − α.
χ2n, α χn,1− α
2 2
(n − 1)s2
W = ∼ χ2n−1 .
σ2
1−α
α/2 α/2
| |
χ2n−1,1− α χ2n−1, α
2 2
Note that
P χ2n−1,1− α ≤ W ≤ χ2n−1, α = 1 − α
2 2
2
2 (n − 1)s 2
=⇒ P χn−1,1− α ≤ ≤ χn−1, α = 1 − α
2 σ2 2
Then,
!
(n − 1)s2 (n − 1)s2
,
χ2n−1, α χ2n−1,1− α
2 2
0.98
0.01 0.01
| |
χ230,0.99 χ230,0.01
From the chi-square table, it can be easily verified that χ230,0.99 = 14.95 and χ230,0.01 = 50.89. Therefore,
the 98% C.I. for σ is
s s ! r r !
(n − 1)s2 (n − 1)s2 30 × (0.83)2 30 × (0.83)2
, = , = (0.6373, 1.1756).
χ2n−1, α χ2n−1,1− α 50.89 14.95
2 2
Confidence interval for µ1 − µ2 : To compute the C.I. for µ1 − µ2 of a normal population, we deal
with following three cases:
(a) When σ1 2 and σ2 2 are known: Note that
σ12 σ2
X̄m ∼ N µ1 , and Ȳn ∼ N µ1 , 1 ,
m n
σ12 σ22
X̄m − Ȳn ∼ N µ1 − µ2 , + .
m n
Hence,
1−α
α/2 α/2
| | |
−zα/2 0 zα/2
Note that
P −zα/2 ≤ Z ≤ zα/2 = 1 − α
X̄m − Ȳn − (µ1 − µ2 )
=⇒ P −zα/2 ≤ q ≤ zα/2 = 1 − α
σ12 σ22
m
+ n
r r !
σ12 σ22 σ12 σ22
=⇒ P − + zα/2 ≤ X̄m − Ȳn − (µ1 − µ2 ) ≤ + zα/2 = 1 − α
m n m n
r r !
σ12 σ22 σ12 σ22
=⇒ P X̄m − Ȳn − + zα/2 ≤ µ1 − µ2 ≤ X̄m − Ȳn + + zα/2 = 1 − α.
m n m n
Let x̄m and x̄n ar an observed value of X̄m and X̄n , respectively. Then,
r r ! r
σ12 σ22 σ12 σ22 σ12 σ22
x̄m − ȳn − + zα/2 , x̄m − ȳn + + zα/2 or x̄m − ȳn ± + zα/2
m n m n m n
(m − 1)s21 (n − 1)s22
∼ χ2m−1 and ∼ χ2n−1
σ2 σ2
1−α
α/2 α/2
| | |
−tm+n−2,α/2 0 tm+n−2,α/2
Note that
P −tm+n−2,α/2 ≤ T ≤ tm+n−2,α/2 = 1−α
r
mn X̄m − Ȳn − (µ1 − µ2 )
=⇒ P −tm+n−2,α/2 ≤ ≤ tm+n−2,α/2 = 1−α
m+n sp
r r !
m+n m+n
=⇒ P − sp tm+n−2,α/2 ≤ X̄m − Ȳn − (µ1 − µ2 ) ≤ sp tm+n−2,α/2 = 1−α
mn mn
r r !
m+n m+n
=⇒ P X̄m − Ȳn − sp tm+n−2,α/2 ≤ µ1 −µ2 ≤ X̄m − Ȳn + sp tm+n−2,α/2 = 1−α
mn mn
Let x̄m and x̄n ar an observed value of X̄m and X̄n , respectively. Then,
r r !
m+n m+n
x̄m − ȳn − sp tm+n−2,α/2 , x̄m − ȳn − sp tm+n−2,α/2
mn mn
or
r
m+n
x̄m − ȳn ± sp tm+n−2,α/2
mn
is 100(1 − α)% C.I. for µ1 − µ2 .
(a) When σ1 2 6= σ2 2 = σ 2 are unknown: In this case, we have an an approximate C.I. whcih is based
on the statistic
X̄m − Ȳn − (µ1 − µ2 )
T∗ = q ∼ tν ,
s21 s22
m
+ n
1−α
α/2 α/2
| | |
−tν,α/2 0 tν,α/2
Note that
P −tν,α/2 ≤ T ∗ ≤ tν,α/2 = 1 − α
X̄m − Ȳn − (µ1 − µ2 )
=⇒ P −tν,α/2 ≤ q ≤ tν,α/2 = 1 − α
s21 s22
m
+ n
r r !
s21 s22 s21 s22
=⇒ P − + tν,α/2 ≤ X̄m − Ȳn − (µ1 − µ2 ) ≤ + tν,α/2 = 1 − α
m n m n
r r !
s21 s22 s21 s22
=⇒ P X̄m − Ȳn − + tν,α/2 ≤ µ1 − µ2 ≤ X̄m − Ȳn + + tν,α/2 = 1 − α.
m n m n
Let x̄m and x̄n ar an observed value of X̄m and X̄n , respectively. Then,
r r !
s21 s22 s21 s22
x̄m − ȳn − + tν,α/2 , x̄m − ȳn + + tν,α/2
m n m n
or
r
s21 s22
x̄m − ȳn ± + tν,α/2
m n
is 100(1 − α)% C.I. for µ1 − µ2 .
Example 9.6.3. Two machines are used to fill plastic bottles with dish washing detergent. The standard
deviation of filling volume are known to be σ1 = 0.15 fluid ounces and σ2 = 0.12fluid ounces for the
two machines. Two random samples of n1 = 12 bottles from machines 1 and n1 = 10 bottles from
machines 2 are selected and the observations are x̄1 = 30.87 and x̄2 = 30.68. Find 90% C.I. for µ1 −µ2 .
Solution. Note that z0.05 = 1.645 and therefore, the 90% C.I. for µ1 − µ2 is given by
s r
σ12 σ22 (0.15)2 (0.12)2
x̄1 − ȳ1 ± + zα/2 = 30.87 − 30.68 ± + × 1.645 = (0.095, 0.285).
n1 n2 12 10
Therefore,
1
Pm
i=1 (Xi − µ1 )2 /m n m
P 2 2
σ12 i=1 (Xi − µ1 ) σ2
F = = ∼ Fm,n .
m ni=1 (Yi − µ2 )2 σ12
1
Pn P
σ22 i=1 (Yi − µ2 )2 /n
1−α
α/2 α/2
| |
fm,n,1− α2 fm,n, α2
Note that
P fm,n,1− α2 ≤ F ≤ fm,n, α2 = 1 − α
n m (Xi − µ1 )2 σ22
P
i=1
=⇒ P fm,n,1− α2 ≤ Pn ≤ fm,n, α2 = 1 − α
m i=1 (Yi − µ2 )2 σ12
!
m m 2 2
P
1 (X i − µ 1 ) σ 1 1
=⇒ P ≤ Pmi=1 ≤ =1−α
fm,n, α2 n i=1 (Xi − µ1 )2 σ22 fm,n,1− α2
!
n m
P 2 2
Pm 2
1 (X i − µ 1 ) σ1 1 n (X i − µ 1 )
=⇒ P Pi=1 ≤ 2 ≤ Pi=1 = 1 − α.
fm,n, α2 m ni=1 (Yi − µ2 )2 σ2 fm,n,1− α2 m ni=1 (Yi − µ2 )2
(m − 1)s21 (n − 1)s22
∼ χ2m−1 and ∼ χ2n−1 .
σ12 σ22
and therefore,
(m−1)s21
σ12
/(m − 1) s21 σ22
F∗ = (n−1)s22
= ∼ Fm−1,n−1 .
/(n − 1) s22 σ12
σ22
1−α
α/2 α/2
| |
fm−1,n−1,1− α2 fm−1,n−1, α2
Note that
P fm−1,n−1,1− α2 ≤ F ∗ ≤ fm−1,n−1, α2 = 1 − α
s21 σ22
=⇒ P fm−1,n−1,1− α2 ≤ 2 2 ≤ fm−1,n−1, α2 = 1 − α
s2 σ 1
!
1 s22 σ12 1
=⇒ P ≤ 2 2 ≤ =1−α
fm−1,n−1, α2 s1 σ2 fm−1,n−1,1− α2
!
1 s21 σ12 1 s21
=⇒ P ≤ 2 ≤ = 1 − α.
fm,n, α2 s22 σ2 fm,n,1− α2 s22
Then,
!
1 s21 1 s21
,
fm,n, α2 s22 fm,n,1− α2 s22
0.90
0.05 0.05
| |
f4,3,0.95 f4,3,0.05
Confidence Intervals for Proportions: Let X denote the number of successes in n observed Bernoulli
trials with unknown success probability p. Then, the sample proportion p̂ = X/n is the estimate of the
population proportion p. Using CLT, we have
p − p̂
Z=q ∼ N (0, 1).
p̂(1−p̂)
n
1−α
α/2 α/2
| | |
−zα/2 0 zα/2
Therefore.
P −zα/2 ≤ Z ≤ zα/2 = 1 − α
p − p̂
=⇒ P −zα/2 ≤ q ≤ zα/2 = 1 − α
p̂(1−p̂)
n
r r !
p̂(1 − p̂) p̂(1 − p̂)
=⇒ P p̂ − zα/2 ≤ p ≤ p̂ + zα/2 = 1 − α.
n n
Example 9.6.5. A survey is conducted by an Institute and found that 323 students out of 1404 paid
their eduction far by student loan. Find the 90% CI. of the true proportion of students who paid for
their eduction fee by student loans.
Solution. Given x = 323 and n = 1404. Therefore,
x 323
p̂ = = = 0.23.
n 1404
Also, we have α = 0.1 =⇒ zα/2 = z0.05 = 1.645.
0.9
0.05 0.05
| | |
−z0.05 0 z0.05
p1 − p2 − (p̂1 − p̂2 )
Z∗ = q ∼ N (0, 1).
p̂1 (1−p̂1 ) p̂2 (1−p̂2 )
n1
+ n2
1−α
α/2 α/2
| | |
−zα/2 0 zα/2
Therefore.
or
s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
(p̂1 − p̂2 ± zα/2 + .
n1 n2
Example 9.6.6. Suppose we want to estimate the difference in the proportion of residents who support
a certain law in county. A compared to the proportion who support the law in county B. In Sample 1,
62 out of 100 residents support the law. In Sample 2, 46 our of 100 residents support the law. Find 90%
C.I. for the difference in population proportions.
Solution. Here, n1 = 100, p1 = 62/100 = 0.62 and n2 = 100, p2 = 42/100 = 0.42. Also, α =
0.1 =⇒ zα/2 = z0.05 = 1.645.
0.9
0.05 0.05
| | |
−z0.05 0 z0.05
9.7 Exercises
1. Find an unbiased estimator for µ2 in Example 9.2.2. Also, using the definition of s2 , prove that
E (s2 ) = σ 2 .
2. Let Xi ∼ Exp(λ), for i = 1, 2, . . . , n and λ > 0. Show that n−1 , where Y = ni=1 Xi , is an
P
Y
unbiased estimator for λ.
iid
3. Let Xi ∼ U(0, θ), for i = 1, 2, . . . , n. Show that X(n) = max{X1 , X2 , . . . , Xn } is consistent
estimator for θ.
iid
4. Let Xi ∼ Ber(p), for i = 1, 2, . . . , n and 0 ≤ p ≤ 1. Find the MLE of p.
iid
5. Let Xi ∼ P(λ), for i = 1, 2, . . . , n and λ > 0. Find the MLE of λ.
Find MLE of θ.
7. A random sample of size n = 100 is taken from a normal population with σ = 5.1. Given that
the sample mean x̄100 = 21.6, construct 95% C.I. for population mean µ.
8. A random sample of size n = 80 is taken from a normal population. Given that the sample
standard deviation s = 9.583, construct a 95% C.I. for the population standard deviation σ.
9. A survey of 1898 peoples found that 45% of the adults said that dandelions were the toughest
weeds to control in their yards. Find the 95% C.I. for the true proportion who said that dandelions
were the toughest weeds to control in their yards.
10. A market research company wants to estimate the proportion of house holds in the country with
digital televisions. A random sample of 80 households is selected and 46 of then has digital
televisions. Construct 95% and 98% confidence intervals for the proportion of households with
digital tile visions.
Answers of Exercises
Chapter 1
2. 0.34
3. 5/9
4. 379/400
6. 4/13
7. 29/32
8. 8/195
Chapter 2
2. 16/25
5. False
212
Chapter 10: Answers of Exercises
x 0 1 2
pX (x) 16 / 25 8 / 25 1 / 25
1 1 1
(b) MX (t) = 25 (et + 4)2 , ϕX (t) = 25 (eit + 4)2 and GX (t) = 25
(t + 4)2
(c) E(X) = 2/5 and Var(X) = 8/25
10. Q 1 = µ − λ, Q 1 = µ and Q 3 = µ + λ
4 2 4
Chapter 3
1. 0.91854
2. 11
7. No
Chapter 4
1. 1/4
√
2. 2/ 3
α+β−1
3. α−1
, no
8. 13.01
Chapter 5
2. Y ∼ B(n, q)
5. Normal distribution
7. Log-normal distribution
Chapter 6
1. (a) & (b) The joint pmf and marginals are given by
X
0 1 2 fY (y)
Y
1 1 1
0 8 8
0 4
1 2 1 1
1 8 8 8 2
1 1 1
2 0 8 8 4
1 1 1
fX (x) 4 2 4
1
(c) 1 (d) 1
2. (a) fX (x) = 2x, 0 < x < 1, fY (y) = 2(1 − y), 0 < y < 1 (b) 7/16 (c) 2/3 (d) 1/2
4. -1/11
5. 1/2
6. 0.9803
Chapter 7
1. Not convergent
4. 0.5684
5. 0.5
6. 3387
7. 0.0179
Chapter 9
s2
1. X̄n − n
5. λ̂MLE = X̄n
6. X(n) − 1, X(1) + 1
7. (20.6, 22.6)
8. (8.29, 11.35)
9. (0.428, 0.472)