0% found this document useful (0 votes)
19 views219 pages

Ma 202

Uploaded by

Sahil Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views219 pages

Ma 202

Uploaded by

Sahil Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 219

MA-202: Probability & Statistics

Class Notes

Amit Kumar
Department of Mathematical Sciences
Indian Institute of Technology (BHU) Varanasi
Varanasi – 221005, India.
Contents

1 Basic Probability 1
1.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Definitions of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Properties of Probability Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Independence of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Random Variable and its Distribution 22


2.1 Borel σ-algebra and Measurable Function . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Mixed Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Moments and Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 Mode, Median and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3 Special Discrete Distributions 57


3.1 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Degenerate Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.8 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4 Special Continuous Distributions 84


4.1 Continuous Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.5 Normal/Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6 Cauchy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

2
CONTENTS
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5 Function of Random Variables and Its Distribution 103


5.1 Some Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Approaches to Find the Distribution of Y = g(X) . . . . . . . . . . . . . . . . . . . 104
5.2.1 PMF Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.2 CDF Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2.3 MGF Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2.4 PDF Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6 Random Vector and its Joint Distribution 110


6.1 Joint Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Joint Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Joint Continuous Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4 Independence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.5 Expectation and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.6 Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.7 Other Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.8 Transformation of Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.8.1 PMF Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.8.2 CDF Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.8.3 MGF Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.8.4 PDF Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.9 n-dimensional Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.9.1 Joint Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.9.2 Joint Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . 140
6.9.3 Some Important Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7 Large Sample Theory 146


7.1 Mode of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.1 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1.3 Convergence Almost Surely . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.3 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8 Statistics and Sampling Distributions 162


8.1 Random Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.2 Chi-square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.3 Student t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.4 F −distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Amit Kumar 3 MA-202: Probability & Statistics


9 Estimation 185
9.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.2 Unbiased Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.3 Consistent Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
9.4 Method of Moment Estimator (MME) . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.5 Maximum Likelihood Estimator (MLE) . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.6 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.6.1 Confidence Intervals for One Normal Populations . . . . . . . . . . . . . . . . 197
9.6.2 Confidence Intervals for Two Normal Populations . . . . . . . . . . . . . . . . 201
9.6.3 Confidence Intervals for Proportion . . . . . . . . . . . . . . . . . . . . . . . 207
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

10 Answers of Exercises 212


Chapter 1

Basic Probability

Probability is all about gaining a degree of confidence in deciding on something where the process is
random, and the result is not precisely determinable a priori. The mathematical theory of probability,
although initiated for gambling, has a wide range of applications starting from game theory to physics
to finance and extends to almost all areas of science and engineering. For example, a ticket booking
system of IRCTC has an option “CNF Probability” while booking a ticket on the waiting list. It helps
the people to identify the chance of confirmation of the ticket.
When we perform a random experiment whose all possible outcomes are already known, but the result
of a specific experiment is not predictable, then probability comes into the picture. One needs to design
a suitable probability space depending on the outcome, which comprises a sample space, a σ-algebra,
and a probability measure. Probability space plays a fundamental role in the probabilistic analysis
of a model. There are two levels of probability theory; one is when the underlying sample space is
countable, and another is the case when this space is uncountable. In the case of a countable sample
space, we have discrete probability, whereas the definition of probability becomes more challenging
when the sample space is uncountable, referred to as continuous probability.
This chapter aims to develop a background in the basic concepts of probability, where we learn how
to build a probability space for a given probability model and study some important properties of
probability spaces. We also discuss the conditional probability and some consequent results.

1.1 Basic Definitions


Definition 1.1.1 [Experiment]
An experiment is observing something happen or conducting something under certain conditions
which result in some outcomes.

Types of Experiment: There are two types of experiment.

Experiment

Deterministic Experiment Random Experiment

1
Chapter 1: Basic Probability
Definition 1.1.2 [Deterministic Experiment]
If an experiment is conducted under certain conditions and it result in a known outcome then it is
called a deterministic experiment.

Example 1.1.1. (a) H2 + O = H2 O (water).


(b) The sum of 3 and 2 is equal to 5.

Definition 1.1.3 [Random Experiment]


An experiment is said to be a random experiment if the following conditions hold:

(i) all outcomes of the experiment are known in advance,

(ii) any performance of the experiment results in an outcome that is not known in advance, and

(iii) the experiment can be repeated under identical conditions.

Example 1.1.2. (a) Tossing of a coin.


(b) Rolling of a dice.
(c) Drawing a card from a deck of cards.
(d) Waiting time for the next bus to arrive at a bus stop.
(e) Life of a bulb.

Definition 1.1.4 [Sample Space]


The set of all possible outcomes of a random experiment is called a sample space.

The sample space is denoted by Ω or U or S. Throughout this notes, we use Ω for the sample space.
Example 1.1.3. The sample space of the Example 1.1.2 are as follows:

Random Experiments Sample Space


Tossing of a coin Ω = {H, T }
Rolling of a dice Ω = {1, 2, 3, 4, 5, 6}
Drawing a card from a deck of cards Ω = {1, 2, . . . , 52}
Waiting time for the next bus to arrive at a bus stop (max. 15 minutes) Ω = (0, 15)
Life of a bulb Ω = (0, ∞)

Definition 1.1.5 [Event]


An event is any subset of the sample space.

Amit Kumar 2 MA-202: Probability & Statistics


Chapter 1: Basic Probability

Sample Space

Event

Note that φ and Ω are called impossible and sure events, respectively.
Types of Events: The event can have several forms including unions, intersections, complements,
among many others. The significance of some of them are as follows:

1. Union of events:

(a) A ∪ B ≡ occurrences of at least one of A and B.

A∪B

n
[
(b) Ai ≡ occurrence of at least one of Ai , i = 1, 2, . . . , n.
i=1
[∞
(c) Ai ≡ occurrence of at least on of Ai , i = 1, 2, . . ..
i=1

2. Intersection of events:

(a) A ∩ B ≡ simultaneous occurrence of A and B.

A A∩B B

Amit Kumar 3 MA-202: Probability & Statistics


Chapter 1: Basic Probability
n
\
(b) Ai ≡ simultaneous occurrences of A1 , A2 , . . . , An .
i=1
\n
(c) Ai ≡ simultaneous occurrences of A1 , A2 , . . ..
i=1

3. If A ∩ B = φ then A and B are called mutually exclusive events, that is, happening of one of
them excludes the possibility of happening of other.

B
A

n
[
4. If Ω = Ai , then A1 , A2 , . . . , An are called exhaustive events.
i=1

5. If A1 , A1 . . . are such that Ai ∩ Aj = φ, i 6= j, then A1 , A2 , . . . are pairwise disjoint or mutually


exclaims events.

6. Ac ≡ Not happening of A.

Ac

7. A\B = A − B = A ∩ B C ≡ happening of A but not B.

A B

A\B

1.2 Definitions of Probability


There are three approaches to defining probability. The following first approach was given by Laplace
in 1812 and is the simplest one but has many limitations.

Amit Kumar 4 MA-202: Probability & Statistics


Chapter 1: Basic Probability
Definition 1.2.6 [Classical or Mathematical or a Prior Definition of Probability]
Suppose a random experiment has N possible outcomes which are mutually exclusive, exhaustive
and equally likely. Let M of these outcomes be favourable to the happening of an event A. Then
the probability of A is defined by
M
P(A) = .
N

Example 1.2.1. Rolling of a dice, we have Ω = {1, 2, 3, 4, 5, 6}. Let A be the event that even number
occur. Then,
#A 3 1
P(A) = = = .
#Ω 6 2

Example 1.2.2. Tossing of two coins, we have Ω = {HH, HT, T H, T T }. Let A be the event that both
are same. Then,
#A 2 1
P(A) = = = .
#Ω 4 2

Drawbacks: The definition loses its significance in the following context of real-life situations.

(a) N need not be finite, for example, the life of a bulb.

(b) Events may not be always equally likely in real-life applications, for example, in climatol-
ogy, a rainy day and a dry day can not be equally likely in general.

Due to the limitations of the classical definition, another approach is needed to define probability. The
following definition was given by Von Mises.

Definition 1.2.7 [Empirical or Statistical or Relative Frequency Definition of Probability]


Suppose a random experiment is conducted for a large number of times independently under
identical conditions. Let an denote the number of times the event A occurs in n trials of the
experiment. Then, we define
an
P(A) = lim , provided limit exists.
n→∞ n

Example 1.2.3. Consider the experiment of tossing a coin repeatedly and the output are in the following
form:
HHT HHT HHT HHT . . ..
Let A denote the event H occur. Then
2n−1

3n−2
, n = 1, 2, 3, . . .
an 1 2 2 3 4 4 
2n
= , , , , , ,... = 3n−1
, n = 1, 2, 3, . . .
n 1 2 3 4 5 6 2n
, n = 1, 2, 3, . . .

3n

Amit Kumar 5 MA-202: Probability & Statistics


Chapter 1: Basic Probability
an
Therefore, P(A) = limn→∞ n
= 23 .

Drawbacks: The definition loses its significance in the following context of real-life situations.

(a) Actual observations of the experiment may sometimes not be possible, for example, the
probability of success in launching satellites.
n1−ε
(b) Note that, for small values of ε, n1−ε is close to n while → 0 that seems to be
n
n − n1−ε
unexpected probability. On the other hand, → 1, for some small ε.
n

The purpose of probability theory is to set up a general mathematical framework to quantify the chance
of occurrence of an event in a random experiment. So, an abstract notion of probability is desirable
to deal with a broad class of experiments. To achieve this, let us first defined all the desirable features
needed to set up a probability model.

Definition 1.2.8 [Sigma-algebra or σ-algebra or Sigma-field or σ-field]


A class of subsets F of Ω is said to be a sigma-algebra if

(i) φ ∈ F .

(ii) For every A ∈ F , we have Ac ∈ F .

(iii) For A1 , A2 , . . . ∈ F , we have ∞i=1 Ai ∈ F .


S

Important Observations:

(a) φ ∈ F =⇒ Ω = φc ∈ F .
n
[
(b) Ai ∈ F , for every n ∈ N (substitute An+1 = An+2 = · · · = φ in (iii))
i=1


[
(c) For A1 , A2 , . . . ∈ F , we have Ac1 , Ac2 , . . . ∈ F and therefore Aci ∈ F . Hence,
i=1

∞ ∞
!c
\ [
Ai = Aci ∈F
i=1 i=1

(d) Sigma algebra in closed under compliment and countable unions (finite unions) and count-
able intersections (finite intersections).

(e) Intersection of sigma-algebras is also a sigma-algebra.

(f) Each element of F is called an event.

Amit Kumar 6 MA-202: Probability & Statistics


Chapter 1: Basic Probability
Example 1.2.4. For any set Ω, the set {φ, Ω} is the smallest sigma-algebra and P(Ω), the power set,
is the largest sigma-algebra.
Example 1.2.5. Let Ω = {1, 2, 3}. Then

(a) F1 = {φ, {1}, {2, 3}, Ω} is a sigma-algebra.

(b) F2 = {φ, Ω, {1, 3}, {2, 3}, {2}} is NOT a sigma-algebra (since {2, 3}c = {1} ∈
/ F ).

Example 1.2.6. For any set A ∈ Ω, F = {φ, A, Ac , Ω} is always a sigma-algebra (it is called a
sigma-algebra generated by A).
Example 1.2.7. Let Ω = {1, 2, 3, . . .} and

F = {A : A is finite or Ac is finite} .

Then, F is not a sigma-field.


Proof. Suppose F is a sigma-field. We know that

{1}, {3}, {5}, . . . ∈ F =⇒ A := {1, 3, 5, . . .} ∈ F ,

which is a contradiction as A and Ac are not finite. Hence, F is not a sigma-field.

Definition 1.2.9 [Measurable Space]


Let Ω be a sample space and F be a sigma-algebra of subsets of Ω. Then, the pair (Ω, F ) is
called a measurable space.

Now, we are in a position to give an abstract definition of probability, called probability measure, which
is given by Kolmogorov in 1933.

Definition 1.2.10 [Axiomatic Definition of Probability]


Let (Ω, F ) be a measurable space. A set function P : F → R is said to be a probability function
if it satisfies the following:

(a) P(A) ≥ 0, for all A ∈ F [non-negativity property]

(b) P(Ω) = 1

(c) For any sequence of pair-wise disjoint events Ai ∈ F , that is, Ai ∩ Aj = φ, for i 6= j, we
have

! ∞
[ X
P Ai = P(Ai ) [coutable additivity property]
i=1 o=1

Amit Kumar 7 MA-202: Probability & Statistics


Chapter 1: Basic Probability
Remark 1.2.1. Developing a probability model for a random phenomenon comprises of three steps,
namely, identifying the sample set Ω, identifying all events that we are interested and then forming a
suitable sigma-algebra F that includes our events, and finally, defining a suitable probability measure
P to the sigma-algebra .

Definition 1.2.11 [Probability Space]


Putting together the sample set Ω, a sigma-algebra F and a probability measure P, the triple
(Ω, F , P) is called a probability space.

Question: We define the probability measure for pair-wise disjoint events only (Axiom (c)), why
not for all unions?
Answer: Define
i−1
!
[
Bi = Ai \ Ai , for i = 1, 2, . . . .
j=1

Then, {Bi }’s are disjoint, and



[ ∞
[
Bi = Ai .
i=1 i=1

Hence, we can convert any union to the union of pair-wise disjoint elements. So, it is enough
to define the probability measure for pair-wise disjoint elements. For example, union of 5 sets
converted to disjoint union of 5 sets can be seen in the following figure.

Example 1.2.8. Consider the experiment of rolling of two dice. Then

Ω = {(1, 1), (1, 2), . . . , (6, 6} = {(i, j) : 1 ≤ i, j ≤ 6}.

Take F = P(Ω). Let A be the sum showing on the two dice in equal to 11. Then

A = {(1, 3), (2, 2), (3, 1)} ∈ F

Amit Kumar 8 MA-202: Probability & Statistics


Chapter 1: Basic Probability
and
3 1
P(A) = 6
= .
3 12

1.3 Properties of Probability Measure

Property 1.3.1
P(φ) = 0

Proof. Take A1 = Ω, A2 = Ω, A3 = A4 = . . . = φ in Axiom (c). Then

P(Ω) = P(Ω) + P(φ) + P(φ) + · · · · .

Since P(Ω) = 1 and P(φ) ≥ 0, we have

P(φ) = 0.

Property 1.3.2
For any finite pair-wise disjoint collection A1 , A2 , . . . , An ∈ F ,
n
! n
[ X
P Ai = P (Ai ) .
i=1 i=1

Proof. Substitute An+1 = An+2 = · · · = φ in axiom (c). The results follows.

Property 1.3.3
If A ⊂ B then P(B|A) = P(B) − P(A). Moreover, P(A) ≤ P(B), that is, P is monotone.

Proof. Since A ⊂ B,

B\A

Amit Kumar 9 MA-202: Probability & Statistics


Chapter 1: Basic Probability
note that

B = A ∪ (B\A)
=⇒ P(B) = P(A) + P(B\A)
=⇒ P(B\A) = P(B) − P(A)

Further, since P(B\A) ≥ 0, we have

P(A) ≤ P(B).

Property 1.3.4
For any A ∈ F , P(A) ≤ 1.

Proof. Note that

A ⊂ Ω =⇒ P(A) ≤ P(Ω) = 1 (using Property 1.3.3).

Property 1.3.5

For any A ∈ F , P AC = 1 − P(A)




Proof. Note that A ∪ AC = Ω and A ∩ AC = φ. Therefore, from axiom (c), we have

P(A) + P AC = P(Ω) = 1 =⇒ P AC = 1 − P(A).


 

Property 1.3.6
For any two events. A, B ∈ F ,

P(A ∪ B) = P(A) + P(B) − P(A ∩ B).

Proof. Let us first draw a Venn diagram for better understanding.

A B\(A ∩ B)

Amit Kumar 10 MA-202: Probability & Statistics


Chapter 1: Basic Probability
From the above Venn diagram, note that A ∩ (B\(A ∩ B))) and

A ∪ B = A ∪ (B\(A ∩ B)))

Therefore, from axiom (c), we have

P(A ∪ B) = P(A) + P(B\(A ∩ B))


=⇒ P(A ∪ B) = P(A) + P(B) − P(A ∩ B) (∵ A ∩ B ⊂ B, (using Property 1.3.3)).

Property 1.3.7
For any event A, B, C ∈ F ,

P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(B ∩ C) − P(A ∩ C) + P(A ∩ B ∩ C).

Proof. Exercise
A general form of the above result is called the inclusion-exclusion formula or general addition rule.
Theorem 1.3.1 [Inclusion-Exclusion Formula or General Addition Rule]
For n ≥ 2, let A1 , A2 , . . . , An be events. Then
n
! n
[ X X
P Ai = P (Ai ) − P (Ai ∩ Aj )
i=1 i=1 1≤i<j≤n
n
!
X \
+ P (Ai ∩ Aj ∩ Ak ) − · · · + (−1)n+1 P Ai .
1≤i<j<k≤n i=1

Proof. For n = 2, we know that

P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) − P (A1 ∩ A2 ) .

Assume the formula holds for n = k. That is,


k
! k k
!
[ X X Y
P Ai = P (Ai ) − P (Ai ∩ Aj ) + · · · + (−1)k+1 P Ai .
i=1 i=1 1≤i<j≤k i=1

Now, consider
k+1
! k
! !
[ [
P Ai = P Ai ∪ Ak+1
i=1 i=1
k
! k
! !
[ [
=P Ai + P (Ak+1 ) − P Ai ∩ Ak+1
i=1 i=1
k
! k
!
[ [
=P Ai + P (Ak+1 ) − P (Ai ∩ Ak+1 )
i=1 i=1

Amit Kumar 11 MA-202: Probability & Statistics


Chapter 1: Basic Probability
k k
!
X X \
= P (Ai ) − P (Ai ∩ Aj ) + · · · + (−1)k+1 P Ai + P (Ak+1 )
i=1 1≤i<j≤k i=1
" k k+1
!#
X X \
− P (Ai ∩ Ak+1 ) − P ((Ai ∩ Ak+1 ) ∩ (Aj ∩ Ak+1 )) + · · · + (−1)k+1 P Ai
i=1 1≤i<j≤k i=1
k+1 k+1
!
X X \
= P (Ai ) − P (Ai ∩ Aj ) + · · · + (−1)k+2 P Ai .
i=1 1≤i<j≤k+1 i=1

Hence, the result is true for all positive integral values of n.


Remark 1.3.1. In general, if we have “OR” condition then sum of the probabilities will be taken into
account and if we have “AND” condition the product of the probabilities will be taken into the account.
Example 1.3.1. Six cards are drawn with replacement from a ordinary deck of cards. What is the
probability that each of four suits will be presenter al least once among the six cards.
Solution. Let

A = all suit appear at least once

then

Ac = at least one suit does not appear.

If

B1 = spades do not appear,


B2 = hearts do not appear,
B3 = diamonds do not appear,
B4 = clubs do not appear.

then
4
[
c
A = Bi .
i=1

Notes that
 6
3 3 3 3 3 3 3
P (B1 ) = P (none of the six is a space) = × × × × × = .
4 4 4 4 4 4 4

Similarly,
 6
3
P (B2 ) = P (B3 ) = P (B4 ) = .
4
Next,
 6  6
2 1
P (Bi ∩ Bj ) = = , i, j = 1, 2, 3, 4,
4 2

Amit Kumar 12 MA-202: Probability & Statistics


Chapter 1: Basic Probability
 6
1
P (Bi ∩ Bi ∩ Bk ) = , 1 ≤ i, j, k ≤ 4,
4
4
!
\
P Ai = 0.
i=1

Use inclusion exclusion formula, we have


4
! 4 4
!
[ X X X \
c
P (A ) = P Bi = P (Bi ) − P (Ai ∩ Aj ) + P (Ai ∩ Aj ∩ Ak ) − P Bi
i=1 i=1 1≤i<j≤4 1≤i<j<k≤4 i=1
 6  6  6
3 1 1 317
=4 −6 +4 = ≈ 0.62.
4 2 4 512
Hence,

P (A) = 1 − P (Ac ) = 1 − 0.62 = 0.38.

Theorem 1.3.2 [Bonferroni’s Inequality]


Given A1 , A2 , . . . , An ∈ F , show that
n n
! n
X X [ X
P (Ai ) − P (Ai ∩ Aj ) ≤ P Ai ≤ P (Ai )
i=1 1≤i<j≤n i=1 i=1

Proof. Hint: Using induction of n.

Theorem 1.3.3 [Boole’s Inequality]


Let {An }n≥1 be a sequence of sets in F then
n
! n
X [
(i) P Ai ≤ P (Ai ).
i=1 i=1

n
! n
\ X
(ii) P Ai > P (Ai ) − (n − 1).
i=1 i=1

Proof. (i) Let


i−1
!
[
Bi = Ai \ Aj .
j=1

Then, Bi ’s are pair-wise disjoint and



[ ∞
[
Ai = Bi .
i=1 i=1

Amit Kumar 13 MA-202: Probability & Statistics


Chapter 1: Basic Probability
Therefore,

! ∞
! ∞ ∞
[ [ X X
P Ai =P Bi = P (Bi ) ≤ P (Ai ) (∵ Bi ⊂ Ai )
i=1 i=1 i=1 i=1

(ii) Note that



! ∞
!c
\ \
P Ai =1−P Ai
i=1 i=1

!
[
=1−P Aci
l=1

X
≥1− P (Ai c ) (using (i)).
i=1

This proves the results.


Remark 1.3.2. In addition, the following results are also hold for countable unions.

! ∞
[ X
(a) P Ai ≤ P (Ai ).
i=1 i=1


! ∞
\ X
(b) P Ai >1− P (Aci ).
i=1 i=1

1.4 Conditional Probability


So far, we introduced the concept of probability under the condition that no information is available
about the occurrence of an event. That is, we do not know what is going to be the outcome of a
particular experiment, although we know all possible outcomes. Now the question is that “if one has
partial information about the outcome, then how this information affect the probability of an event?”
The probability of an event changes if one knows some partial information. Let us illustrate this by an
example.
Example 1.4.1. Consider an experiment of rolling of a dice. The sample space is Ω = {1, 2, 3, 4, 5, 6}.
Let A denote the event that the number 2 occur. Then
1
P(A) = .
6
Next, suppose an event B that an even number occur comes into the picture then the question is: what
is the probability of A given that B has occurred? Clearly,
1
P(A|B) = .
3
Thus, the partial information had affected the probability of the events.

Amit Kumar 14 MA-202: Probability & Statistics


Chapter 1: Basic Probability
Definition 1.4.12 [Conditional Probability]
Let (Ω, F , P) be a probability space and let B ∈ F with P(B) > 0. Then, for any arbitrary
event A ∈ F , the conditional probability of A given that B has along occurred is defined as

P(A ∩ B)
P(A | B) = .
P(B)

If P(B) = 0, then the conditional probability of A given B is undefined.

Lemma 1.4.1
P(· | B) is a valid probability function.

Proof. Note that


P(A ∩ B)
(a) P(A|B) = > 0, for all A ∈ F .
P(B)
P (Ω ∩ B) P(B)
(b) P (Ω|B) = = = 1.
P(B) P(B)
(c) Let {Ai }i>1 be a pair-wise disjoint sequence of events. Then,

!
P (( ∞ P( ∞
S S
i=1 Ai ) ∩ B) i=1 (Ai ∩ B))
[
P Ai B = =
i=1
P(B) P(B)
∞ ∞
X P (Ai |B)
1 X
= P (Ai ∩ B) =
P(B) i=1 i=1
P(B)

X
= P(Ai |B).
i=1

Thus, P(·|B) is a valid probability function


Example 1.4.2. Consider all families with two children and assume that boys and girls are equally
likely.

(a) If a firmly is choose at rondos and is found to hove a boy. What is the probability that the other
one is also a boy?

(b) If a child is chooses at random from these familiar and is found to be a boy. What is the probability
that the other child in that family is also a boy?

Solution.

(a) Note that Ω = {(b, b), (b, g), (g, b), (g, g)}. Let A be the event that a family has a boy. Then
3
P(A) =
4

Amit Kumar 15 MA-202: Probability & Statistics


Chapter 1: Basic Probability
Let B be the event that second child is a boy.

P(A ∩ B) 1/4 1
P(B|A) = = = .
P(A) 3/4 3

(b) Note that Ω = {(b, b), (b, g), (g, b), (g, g)}. Let A be the event that the child is a boy. Then
1
P(A) = .
2
Let B be the event that the child has a brother. Then
P (A ∩ B) Y4 1
P(B | A) = = = .
P(A) Y2 2

Notice the difference in (a) and (b). This is due to difference in selection policy.
Theorem 1.4.4 [Multiplication Rule]
Let P(A) > 0 and P(B) > 0. Then

(a) P(A ∩ B) = P(A|B)P(B).

(b) P(A ∩ B) = P(B|A)P(A).

Proof. The proof can be seen as follows.

P(A ∩ B)
P(A|B) = =⇒ P(A ∩ B) = P(A|B)P(B)
P(B)
P(A ∩ B)
P(B|A) = =⇒ P(A ∩ B) = P(B|A)P(A).
P(A)

Theorem 1.4.5 [General Multiplication Rule]


n
!
\
Let A1 , A2 . . . , An ∈ F with P Ai > 0. Then
i=1

n
! n−1
!
\ \
P Ai = P (A1 ) P (A2 |A1 ) P (A3 |A1 ∩ A2 ) . . . P An Ai .
i=1 i=1

Proof. We prove the result using induction of n. For k = 1, the statement is clearly hold. Let the result
holds for n = k, that is,
k
! k−1
!
\ \
P Ai = P (A1 ) P (A2 |A1 ) P (A3 |A1 ∩ A2 ) . . . P Ak Ai .
i=1 i=1

Amit Kumar 16 MA-202: Probability & Statistics


Chapter 1: Basic Probability
Now, consider
k+1
! k+1
!
\ \
P Ai = P (A1 ∩ A2 ) Ai
i=1 j=3
k
!
\
= P (A1 ∩ A2 ) P (A3 | A1 ∩ A2 ) P (A4 | R1 ∩ A2 ∩ A3 ) . . . P Ak+1 Ai
i=1
k
!
\
= P (A1 ) P (A2 | A1 ) P (A3 | A1 ∩ A2 ) . . . P Ak+1 Ai .
i=1

This proves the result.

Theorem 1.4.6 [Theorem of Total Probability or Total Probability Rule]



[
Let B1 , B2 , . . . be pair-wise disjoint events with B = Bi . Then, for any event A,
i=1


X
P(A ∩ B) = P (A|Bj ) P (Bj )
j=1

Moreover, if B = Ω then

X
P(A) = P (A|Bj ) P (Bj ) .
j=1

Proof. Note that



! ∞
[ [
A∩B =A∩ Bi = (A ∩ Bi )
i=1 i=1

This implies

! ∞ ∞
[ X X
P(A ∩ B) = P A ∩ Bi = P (A ∩ Bi ) = P (A|Bi ) P (Bi ) .
i=1 i=1 i=1

Theorem 1.4.7 [Bayes’ Theorem]


Let B1 , B2 , . . . be pair-wise disjoint events with Ω = ∞
S
i=1 Bi and we are given a prior probabili-
ties P(Bi) > 0, l = 1, 2, . . .. Then, for any event A with P P (A) > 0,

P (A|Bi ) P (Bi )
P (Bi | A) = P∞ .
i=1 P (A|Bi ) P (Bi )

Amit Kumar 17 MA-202: Probability & Statistics


Chapter 1: Basic Probability
Proof. Using the Theorem of Probability, we get

P (A ∩ B) P (A|Bi ) P (Bi )
P (Bi |A) = = P∞ .
P(A) i=1 P (A|Bi ) P (Bi )

This proves the result.


Example 1.4.3. Suppose a calculator manufacturer purchases ICs from suppliers B1 , B2 and B3 with
40% from B1 , 30% from B2 and 30% fro B3 . Suppose 1% of supply from B1 is defective, 5% from B2
and 10% from B3 is defective.
(a) What is the probability that a randomly selected IC from the manufacturer’s stock in defective?
(b) Suppose a randomly selected IC is found to be defective. What in the probability that it was
supplied by B1 (B2 or B3 )?
Solution.
(a) Given P (B1 ) = 0.4, P (B2 ) = 0.3 and P (B3 ) = 0.3. Let A be the event that the IC is defectives.
Then

P (A | B1 ) = 0.01, P (A | B2 ) = 0.05 and P (A | B3 ) = 0.1.

Therefore, by the theorem of total probability, we get


3
X
P(A) = P (A|Bj ) P (Bj )
j=1

= P (A|B1 ) P (B1 ) + P (A|B2 ) P (B2 ) + P (A|B3 ) P (B1 )


= 0.01 × 0.4 + 0.05 × 0.3 + 0.1 × 0.3
= 0.049.

(b) Using Bayes’ Theorem , we get

P (A|B1 ) P (B1 ) 0.01 × 0.4 4


P (B1 |A) = = = .
P(A) 0.049 49

Similarly,
15 30
P (B2 |A) = and P (B3 |A) = .
49 49

1.5 Independence of Events


If the event B does not affect the probability of the event A, then we have

P(A|B) = P(A) =⇒ P(A ∩ B) = P(A)P(B).

Such events are called independent events.

Amit Kumar 18 MA-202: Probability & Statistics


Chapter 1: Basic Probability
Definition 1.5.13 [Independence of Two Events]
Two events A and B are called statistically independent (or simply independent) if

P(A ∩ B) = P(A)P(B).

Definition 1.5.14 [Independence of Three Events]


Three events A, B and C are said to be statistically independent if

P(A ∩ B) = P(A)P(B)
P(B ∩ C) = P(B)PC)
P(B ∩ C) = P (B)P(C)
P(A ∩ B ∩ C) = P(A)P(P(C).

Definition 1.5.15 [Independence of n Events]


The events A1 , A2 , . . . An are said to be statistically independent if

P (Ai ∩ Aj ) = P (Ai ) P (Aj ) , for all i < j


P (Ai ∩ Aj ∩ Ak ) = P (Ai ) P (Aj ) P (Ak ) , for all i < j < k
..
.
n
! n
\ Y
P Ai = P (Ai ) .
l=1 i=1

Remark 1.5.1. The total number of conditions to determines the independence of n events is
     
n n n
+ + ··· + = 2n − n − 1.
2 3 n

Definition 1.5.16 [Independent Events]


Let (Ω, F , P) be a given probability space. A collection of events U ⊂ F is said to be mutually
independent if and only if for every finite sub-collection {Ai1 , Ai2 , . . . , Ain } of U the following
condition holds:
n
! n
\ Y
P Aik = P (Aik ) .
k=1 k=1

Amit Kumar 19 MA-202: Probability & Statistics


Chapter 1: Basic Probability
Definition 1.5.17 [Pairwise Independent Events]
The events A1 , A2 . . . An are pairwise independent if

P (Ai ∩ Aj ) = P (Ai ) P (Aj ) , for all i < j.

Moreover, a collection of events U ⊂ F is said to be pairwise independent if

P (Ai ∩ Aj ) = P (Ai ) P (Aj ) , for all Ai , Aj ∈ U, i 6= j.

If U contains only two events, then the concepts mutually independent and pairwise independent are
the same. However, if U contains more than two events, then they are different in the sense that
independence implies pairwise independence but not conversely.
Example 1.5.1. Tossing of two coins, we have Ω = {HH, HT, T H, T T }. Define the events

A = H on the first coin = {HH, HT }


B = H on the second coin = {HH, T H}
C = some face on both coin = {HH, T T }.

Therefore,
1
P(A) = P(B) = P(C) = .
2
1
P(A ∩ B) = = P(B ∩ C) = P(A ∩ C)
4
1
P(A ∩ B ∩ C) = .
4
Note that
1 1
P(A ∩ B ∩ C) = 6= = P(A)P(B)P(C).
4 8
Hence, A, B and C are pairwise independent but not statistically independent.

1.6 Exercises
1. Let Let Ω = {1, 2, 3, . . .} and

F = {A : A is countable or Ac is countable} .

Show that F is a sigma-algebra.

2. If four married couples are arranged to be seated in a row. What is the probability that no husband
is seated next to his wife?

3. There an two kinds of tube in an electronic gadget. It will cease to function it one of each kind is
defective. The probability that there is a defective tube of the first kind is 0.1; the probability that
there is a defective tube of the second kind is 0.2. It is known that two tubes are defective. What
in the probability that the gadget still works?

Amit Kumar 20 MA-202: Probability & Statistics


Chapter 1: Basic Probability
4. An electric network looks as in the below figure.

1/5 1/5

1/3

1/4 1/4

The numbers indicate the probabilities of failure for the various links, which are all independent.
What is the probability that circuit is closed?

5. If two dice are thrown, what is the probability that the sum is (a) greater than 8 (b) neither 7 nor
11?

6. A cord is drawn from a well-shuffled pack of playing cards .What in the probability that it is
either a spade or an ace?

7. A problem in mathematics is given to the three students A, B and C whose chance of solving it
are 1/2, 3/4, and 1/4 respectively. What is the probability that the problem will be solved if all
of them try independently?

8. A consignment of 15 record players contain 4 defectives. The record players are selected at
random, one by ore, and examined. Those examined are not put back. What is the probability
that the 9th are examined is the last defectives?

Amit Kumar 21 MA-202: Probability & Statistics


Chapter 2

Random Variable and its Distribution

In Chapter 1, we discussed probability spaces (Ω, F , P) as tools to measure uncertainty in random


experiments. Often, in reality, the outcomes of an experiment, i.e. the elements of Ω, are not represented
in terms of numerical values. For instance, in tossing a coin, the outcomes are head and tail. In fact, in
certain cases, interpreting outcomes to a feasible form for the study becomes challenging. To illustrate
this, let us consider a company looking for hiring human resources from an educational institute. Of
course, they prefer to hire “good” students from the institute. So, they first collect all outgoing students
interested in joining the company and then look for the subset consisting of the so-called good students.
However, the question is how to decide a given student is a good student or not? It is difficult for the
company management to decide this from a qualitative point of view. Rather, it becomes easy for them
if the institute assigns to each student a number that cumulatively, in some sense, represents the level
of knowledge that the student acquired in the courses. Thus, it is meaningful to look for a procedure,
let us denote it by X, which assigns a unique real number to each student based on their performance.
From the mathematical perspective, it is equivalent to say that we look for a function X : Ω → R. It
is also appropriate to look for a real-valued function that facilitates the measurability (in some sense)
of all events in the probability model. Such a real-valued function is called a random variable (also
called a stochastic variable). Once we set a relevant random variable for a sample space, we can define
probability through a concept called probability distribution.
2.1 Borel σ-algebra and Measurable Function
In Chapter 1, we have seen the definition of σ-algebra and its consequences to the axiomatic definition
of probability. Now, we move to define σ-algebra generated by a class of subsets of Ω which play a
crucial role to define random variable in the next section.
Definition 2.1.1 [Sigma-algebra generated by a family of subsets]
Let C be a family of sets of Ω, then the σ-algebra generated by C, denoted by σ(C), is the
intersection of all sigma-algebras containing C. That is, if

I = {F : F ⊆ P(Ω), F is a σ-algebra, and C ∈ F }

then
\
σ(C) = F.
F ⊆I

It is the smallest sigma algebra which contains all of the sets in C.

22
Chapter 2: Random Variable and its Distribution
Example 2.1.1. Consider Ω = [0, 1] and C = {[0, 0.3], [0.5, 1]} = {A1 , A2 }, say. Then

σ(C) = {φ, A1 , A2 , A3 , A1 ∪ A2 , A1 ∪ A3 , A2 ∪ A3 , Ω} ,

where we define A3 = (0.3, 0.5).


Example 2.1.2. Let Ω be a non-empty set and

C = {{x} : x ∈ Ω}

Then the σ-algebra generated by C is

σ(C) = {E ⊆ Ω : either E or E c is countable } .

Definition 2.1.2 [Borel σ-algebra]


Let C = {(a, b) : a < b}. Then σ(C) = BR is the Borel σ-algebra generated by the family of all
open intervals C. The elements of BR are called Borel sets.

Remark 2.1.1. In general, it is not always possible to find explicit form of generated σ-algebra.
Theorem 2.1.1 [Equivalent Way to Define Borel σ-algebra]
The Borel σ-algebra on R is σ(C), the sigma algebra generated by each of the classes of sets C
described below:

1. C1 = {(a, b); a ≤ b}.

2. C2 = {(a, b]; a ≤ b}.

3. C3 = {[a, b); a ≤ b}.

4. C4 = {[a, b]; a ≤ b}.

Definition 2.1.3 [Measurable Function]


Let (X, F1 ) and (Y, F2 ) be two measurable space. A function f : X → Y is said to be measur-
able if for every E ∈ F2 ,

f −1 (E) ∈ F1 .

Theorem 2.1.2 [Equivalent Way to Define Measurable Function on R]


If (Ω, F ) is a measurable space, then f : Ω → R is a measurable function if and only if one of
the following conditions holds:

{ω ∈ Ω : f (ω) < λ} ∈ F , for all λ ∈ R.


{ω ∈ Ω : f (ω) ≤ λ} ∈ F , for all λ ∈ R.
{ω ∈ Ω : f (ω) > λ} ∈ F , for all λ ∈ R.
{ω ∈ Ω : f (ω) ≥ λ} ∈ F , for all λ ∈ R.

Amit Kumar 23 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
2.2 Random Variable
In general, a random Variable is a numerical version of sample space. It assign a real value to each and
every outcome of sample space, that is, mathematically, it is a function X : Ω → R.

E1
E2
X:Ω→R
E3

| | | | | R
-2 -1 0 1 2

Example 2.2.1. Let us consider an experiment of tossing of 3 coins. Then, we have

Ω = {HHH, HHT, HT H, T HH, T T H, T HT, HT T, T T T }.

Let X denote the number of tails. Then, X : Ω → R can takes the values 0, 1, 2, 3.

TTT Ω

HT T
T HT

TTH
HHT
 X:Ω→R
HT H

T HH
HHH

| | | | | | | R
-3 -2 -1 0 1 2 3

Question : Any function X : Ω → R is a random variable?


The answer of this question is negative. A random variable if nothing but a measurable function.

Definition 2.2.4 [Random Variable]


Let (Ω, F , P) be a probability space. A function X : Ω → R is called a random variable if it is
a measurable function. That is,

X −1 (B) ∈ F, for all B ∈ BR ,

where BR is a Borel σ-algebra.

Amit Kumar 24 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Basically, the above definition needs several concept from higher level of mathematics, mainly, from
measure theory which is not studied in B.Tech level. So, we are defining the random variable in a
simple form in the following definition.

Definition 2.2.5 [Random Variable]


Let (Ω, F , P) be a probability space. A function X : Ω → R is called a random variable if

X −1 {(−∞, λ)} = {ω : X(ω) ≤ λ} ∈ F, for all λ ∈ R.


F

λ
| | | | | | R
-2 -1 0 1 2

The above definition can also be written in many form by using the following theorem.
Theorem 2.2.3 [Equivalent Way to Define Random Variable]
Let X be defined (Ω, F ) be a random variable if and only if any one of the following condition
is satisfied.

1. {ω : X(ω) < λ} ∈ F , for all λ ∈ R.

2. {ω : X(ω) ≥ λ} ∈ F , for all λ ∈ R.

3. {ω : X(ω) > λ} ∈ F , for all λ ∈ R.

Corollary 2.2.1
Let X be a random variable defined on a probability space (Ω, F ). Then,

1. {ω : X(ω) = λ} ∈ F , for all λ ∈ R.

2. {ω : λ1 < X(ω) ≤ λ2 } ∈ F , for all λ1 , λ2 ∈ R.

Amit Kumar 25 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Question : How to compute the probability for a random variable?
Note that if B ∈ BR then, from the definition of the random variable, say X, we have

X −1 (B) ∈ F .

Therefore, X −1 (B) is a measurable set and hence, we can use the axiomatic definition of proba-
bility to define the probability for a random variable X.

Axiometic definition
of probability


F

[ ]
0 1

[ | | | | ] | R
-2 -1 0 1 2

Theorem 2.2.4
A random variable X defined on a probability space (Ω, F , P) Induces a probability space
(R, BR , Q) by the correspondence

Q(B) = P X −1 (B) = P(ω : X(ω) ∈ B), for all B ∈ B.




Proof. Note that

1. Q(B) = P(ω : X(ω) ∈ B) > 0, for all B ∈ B.

2. Q(R) = P(ω : X(ω) ∈ R) = P(Ω) = 1.

3. Let B1 , B2 , . . . be a sequence of pairwise disjoint events. Then,



! ∞
!!
[ [
Q Bi = P X −1 (Bi )
i=1 i=1

!
[
=P X −1 (Bi )
i=1

X
P X −1 (Bi )

=
i=1
X∞
= Q (Bi ) .
i=1

Amit Kumar 26 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
This proves that Q is a valid probability measure.
Example 2.2.2. Consider an experiment of tossing of a coin. Then, Ω = {H, T }. Let

F = {φ, {H}, {T }, Ω}.

Let X : Ω → R be number of heads. Note that if λ = 0.5 then {ω : x(ω) ≤ 0.5} = {T }.


F
φ Ω
{T} {T} {H}

λ
| | ||| | R
-2 -1 0 1 2

In general,

φ
 λ<0
{ω : x(ω) ≤ λ} = {T } 0 6 λ < 1

Ω λ>1

∈ F.

Hence, X is a random variable. Note that


1
P(X = 0) = P(ω : X(ω) ∈ {0}) = P({T }) =
2
1
P(X = 1) = P (ω : X(ω) ∈ {1})) = P ({H}) = .
2
Suppose B = {0, 1}. Then

P(X ∈ B) = P(X ∈ {0, 1}) = P(ω : X(ω) ∈ {0, 1}) = P(Ω) = 1.

Example 2.2.3. Consider the experiment of rolling of two dice, we have

Ω = {(1, 1), (1, 2), . . . , (6, 6)} = {(i, j) : 1 ≤ i, j ≤ 6}.

Let X denote the sum of upward faces and F = P(Ω). Then, X can takes the values 2, 3, . . . , 12 and


 φ λ < 2,

{(1, 1)} 2 ≤ λ < 3,



{ω : X(ω) ≤ λ} = {(1, 1), (1, 2), (21)} 3 ≤ λ < 4,

 ..


 .

Ω, λ ≥ 12.

∈ F.

Amit Kumar 27 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Hence, X is a random variable. Note that
1
P(X = 2) = P(ω : X(ω) ∈ {2}) = P({(1, 1)}) =
36
3 1
P (X ∈ {2, 3}) = P(ω : X(ω) ∈ {2, 3}) = P({(1, 1), (1, 1), (2, 1)}) = = .
36 12

2.3 Distribution Function


Our interest in this section is to introduce the notion of distribution functions associated with a random
variable.

Definition 2.3.6 [Cumulative Distribution Function (CDF)]


Let (Ω, F , P) be a probability space and let X be a random variable with respect to F . Then the
function defined by

FX (x) = P({X ≤ x}), for all x ∈ R

is called a distribution function of the random variable X (also called the cumulative distribution
function).

Our next task is to prove the properties of distribution function.

Property 2.3.1
lim FX (x) = 0 and lim FX (x) = 1.
x→−∞ x→∞

Proof. Let {xn } be a decreasing sequence such that limn→∞ xn = −∞ and

An = {ω : X(ω) ≤ xn } .

Then

lim An = φ.
n→∞

Therefore,
 
lim P (An ) = P lim An
n→∞ n→∞
=⇒ lim F (xn ) = P(φ) = 0
xn →∞

=⇒ lim F (x) = 0.
x→−∞

Next, let {xn } be a increasing sequence such that limn→∞ xn = ∞ and

An = {ω : X(ω) ≥ xn } .

Amit Kumar 28 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Then

lim An = φ.
n→∞

Therefore,
 
lim P (An ) = P lim An
n→∞ n→∞
=⇒ lim (1 − F (xn )) = P(φ) = 0
xn →∞

=⇒ lim F (x) = 1.
x→∞

Property 2.3.2
If x1 < x2 then FX (x1 ) ≤ FX (x2 ) [FX is non-decreasing].

Proof. If x1 < x2 , then

{ω : X(ω) ≤ x1 } ⊂ {ω : X(ω) ≤ x2 }
=⇒ P ({ω : X(ω) ≤ x1 }) ≤ P ({ω : X(ω) ≤ x2 })
=⇒ FX (x1 ) ≤ FX (x2 ) .

Property 2.3.3
limh→0 FX (x + h) = FX (x) [FX is right continuous].

Proof. Let {xn } be a decreasing sequence such that limn→∞ xn = x and

An = {ω : x < X(ω) ≤ xn }

Then, {An } ∈ B, {An } is monotonically decreasing and limn→∞ An = ∞


T
n=1 An = φ. Therefore,
 
lim P (An ) = P lim An
n→∞ n→∞
=⇒ lim [F (xn ) − F (x)] = 0
n→∞
=⇒ lim F (xn ) = F (x)
n→∞
=⇒ lim F (x + h) = F (x).
h→0

Amit Kumar 29 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Property 2.3.4
For any a, b ∈ R with a < b, show that

1. P(a < X ≤ b) = FX (b) − FX (a).

2. P(a < X < b) = FX (b−) − FX (a).

3. P(a ≤ X < b) = FX (b−) − FX (a−).

4. P(a ≤ X ≤ b) = FX (b) − FX (a−).

5. P(X > b) = 1 − FX (b).

Remark 2.3.1. (i) If P(X = a) = P(X = b) = 0 then

P(a ≤ X ≤ b) = P(a < X < b) = P(a ≤ X < b) = P(a < X ≤ b) = FX (b) − FX (a).

(ii) Note that 0 ≤ FX (x) ≤ 1.

(ii) If any function F : R → R satisfies properties 2.3.1 − 2.3.3 is a CDF of some random variable.
For a given probability space and a random variable X associated with it, we now know how to define
the distribution function of X. The converse also holds. We state the theorem and omit the proof for
this course.
Theorem 2.3.5
For any given distribution function F , there exists a unique probability space and a random vari-
able X defined on the space such that F is a distribution function of X.

Example 2.3.1. Consider the experiment of rolling of two dice, we have Ω = {(i, j) : 1 ≤ i, j ≤ 6}.
Let X denote the number of upward faces. Then
1 2 3
P(X = 2) = , P(X = 3) = , P(X = 4) = ,
31 36 36
4 5 6
P(X = 5) = P(X = 6) = , P(X = 7) = ,
36 36 36
5 4 3
P(X = 8) = , P(X = 9) = , P(X = 10) = ,
36 36 36
2 1
P(X = 11) = and P(X = 12) = .
36 36
Therefore,


 0, x < 2,
1
, 2 ≤ x < 3,


36


 3 , 3 ≤ x < 4,

36
FX (x) = P(x ≤ x) = 6
, 4 ≤ x < 5,
 36

 ..



 .
 1, x ≥ 12.

Amit Kumar 30 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
FX (x)

0.5
x
-4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12

Example 2.3.2. Let the distribution function of a random variable X is given by



0, x < 0,

FX (x) = x, 0 ≤ x < 12 ,

1, x ≥ 12 .

Note that FX (x) a valid distribution function as its satisfy all the necessary property of CDF.
FX (x)

0.5

x
-4 -3 -2 -1 0 1 2 3 4

Example 2.3.3. Let the distribution function of a random variable X is given by



0, x < 0,

Fx (x) = x, 0 ≤ x ≤ 12 ,

1, x > 12 .

Note that FX (·) is not a valid CDF as it is not right continuous at x = 1/2.

2.4 Discrete Random Variable


In this section, we discuss those random variables that take only at most countable number of values.

Definition 2.4.7 [Discrete Random Variable]


Let (Ω, F , P) be a probability space and let X be a random variable with respect to F that takes
only a finite or countably infinite number of values x1 , x2 , . . .. Then X is called a discrete random
variable.

Amit Kumar 31 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Remark 2.4.1. From the above definition, a random variable X is said to be discrete if there exist
a countable set C = {x1 , x2 , x3 , . . .} such that P(X ∈ C) = 1. That is, the range of X is at most
countable.
Example 2.4.1. (a) Number of customers arrives in a store.

(b) Number of accidents happen in a city.

(c) Number of children in a family.

(d) Number of defective light bulbs in a box.

Definition 2.4.8 [Probability Mass Function (PMF)]


Let (Ω, F , P) be a probability space and X be a discrete random variable with respect to F and
the range C = {x1 , x2 , . . .} ⊂ R. The real-valued function pX : R → [0, 1] defined by

pX (xi ) = P(X = xi ), for i = 1, 2, . . .

is called the probability mass function or the probability distribution of X.

The probability mass function associated with a discrete random variable satisfies the following condi-
tions.

(a) pX (xi ) ≥ 0, for all i = 1, 2, . . ..



X
(b) pX (xi ) = 1.
i=1

Remark 2.4.2. In practice, the above properties can be used to roughly check the answer obtain by a
student is correct or not.
If the pmf is known to us then the distribution function of a discrete random variable X can be obtained
in terms of the pmf and is given by
X
FX (x) = P(X ≤ x) = pX (y), x ∈ R,
y∈C
y≤x

where X takes values in the countable set C.


On the other hand, if the cdf of a discrete random variable is known to us then the pmf can also obtained
using the cdf and is given by

pX (xi ) = P(X = xi )
= P(X ≤ xi ) − P(X < xi )
= P(X ≤ xi ) − P(X ≤ xi−1 )
= FX (xi ) − FX (xi−1 ).

Remark 2.4.3. (i) For a discrete random variable, the cdf is a step function having jump pX (xi ) at
i.

Amit Kumar 32 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
FX (x)

·
·
·

·
·
·

| | | | | x
x1 x2 x3 ... xixi+1. . .

(ii) Note that if xi = i, that is, C = {1, 2, . . .} then we have


(a) pX (x) ≥ 0, for all x.
X∞
(b) pX (x) = 1.
x=1
bxc
X
(c) FX (x) = P(X ≤ x) = pX (y), for all x ∈ R, where bxc denotes the greatest integer
y=1
function of x.
(d) pX (x) = FX (x) − FX (x − 1), for all x ∈ R.
Example 2.4.2. A computer store certain 10 computers of which 3 an defective. A customer buy two
at random. Let X denotes the number of defective in the purchase. Then, X can takes the values 0, 1, 2
and
7 3 7
  
2 21 21
P(X = 0) = pX (0) = 10  = , P(X = 1) = pX (1) = 1 101 = ,
2
45 2
45
3

2 1
P(X = 2) = pX (2) = 10 = .
2
45

pX (x)

0.5

x
0 1 2 3 4

Amit Kumar 33 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Note that pX (x) ≥ 0, for all x = 0, 1, 2 and
2
X 21 21 3 45
pX (x) = pX (0) + pX (1) + pX (2) = + + = = 1.
x=0
45 45 45 45

So, pX (·) is a valid pmf. The cdf of X is given by




 0, x < 0,
 21 , 0 ≤ x < 1,

FX (x) = P(X ≤ x) = 45
42
 , 1 ≤ x < 2,
 45


1, x ≥ 2.

Remark 2.4.4. In the above example, observe that if the cdf is known then the pmf can be calculated
as
21 21
pX (0) = FX (0) − FX (−1) = −0= ,
45 45
42 21 21
pX (1) = FX (1) − FX (0) = − = ,
45 45 45
42 3
pX (2) = FX (2) − FX (1) = 1 − = .
45 45
Moreover,
21
P(X < 1) = P(X = 0) = , P(1 < X < 2) = 0,
45
21 21 24
P(1 ≤ X < 2) = P(X = 1) = , P(X ≥ 1) = 1 − P(X < 1) = 1 − = ,
45 45 45
21 3 24
P(1 ≤ X ≤ 2) = P(X = 1) + P(X = 2) = + = .
45 45 45

2.5 Continuous Random Variable


There are many situations when we have to work with random variables that are not discrete. Rather we
may have to work with random variables where it can take all possible values between certain limits.
Such random variables cannot be discrete, and we call them continuous random variables. We give the
precise definition of continuous random variables and introduce their probability distributions, called
probability density functions (pdf).
Definition 2.5.9 [Continuous Random Variable]
Let X be a random variable defined on (Ω, F , P ) with distribution function FX . Then X is
said to be of the continuous if FX is absolutely continuous, that is, if there exists a non-negative
function fX (·) such that for every real number x we have
Z x
FX (x) = fX (t)dt.
−∞

The function fX (·) is called the probability density function (pdf) of the random variable X.

Amit Kumar 34 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Example 2.5.1. (a) Life of a bulb.

(b) Weights of peoples.

(c) Volume of water in a bottle.


The pdf of a random variable X satisfies the following:

(a) fX (x) ≥ 0, for all x ∈ R.


Z ∞
(b) fX (t)dt = 1.
−∞

Remark 2.5.1. (i) For a continuous random variable, P(X = x) = 0, for all x ∈ R.

(ii) Using the pdf, we can calculate the probability for a continuous random variable and is given by
Z b
P(a < X ≤ b) = fX (t)dt = FX (b) − FX (a)
a

(iii) Observe that, for a continuous random variable, we have

P(a < X ≤ b) = P(a ≤ X < b) = P(a ≤ X ≤ b) = P(a < X < b).

(iv) Note that the difference between the pmf and pdf. The pdf is just a continuous function (not
probability) and its value may be more than one, for example,

2, 0 ≤ x ≤ 12 ,

fX (x) =
0, otherwise.

(v) If we know the cdf then we can compute the pdf as


d
fX (x) = FX (x).
dx
Example 2.5.2. The diameter of an electric cable, say X, assumed to be a continuous random variable
with pdf

6x(1 − x), 0 ≤ x ≤ 1,
fX (x) =
0, otherwise.

(a) Check fX (·) is a valid pdf.

(b) Compute P X ∈ 12 , 34 .


(c) Find the cdf of X.

Solution.

(a) Obviously, for 0 ≤ x ≤ 1, fx (x) ≥ 0 and


Z ∞ Z 0 Z 1 Z ∞
fX (x)dx = fX (x)dx + fX (x)dx + fX (x)dx
−∞ −∞ 0 1

Amit Kumar 35 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Z 1
= fX (x)dx
0
Z 1
= 6x(1 − x)dx
0
Z 1
x − x2 dx

=6
0
 2 1
x x3
=6 −
2 3 0
 
1 1
=6 − = 1.
2 3

Therefore, fX (x) is a valid pdf.

(b) Consider
    
1 3 1 3
P X∈ , =P <X<
2 4 2 4
Z 3/4
= fX (x)dx
x
Z 3/4
11
6 x − x2 dx = .

=
1/2 32

(c) Note that


Z x Z x
FX (x) = P(X ≤ x) = fX (t)dt = fX (t)dt
−∞ 0
Z x  2 x
t t3
= 6t(t − 1)dt = 6 − .
0 2 3 0
 2
x3

x
=6 − = x2 (3 − 2x).
2 3

Note that
d d
3x2 − 2x3 = 6x − 6x2 = 6x(1 − x),

fX (x) = FX (x) = for 0 ≤ x ≤ 1.
dx dx

2.6 Mixed Random Variable


There are random variables that are neither discrete nor continuous, but are a mixture of both. In
particular, a mixed random variable has a discrete part and a continuous part.
Definition 2.6.10 [Mixed Random Variable]
A random variable is said to be mixed random variable if it is neither discrete nor continuous, but
it is a mixture of both.

Let us illustrate an application of mixed random variable.

Amit Kumar 36 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Example 2.6.1. A person is travelling to his office everyday by a car. There is a traffic signal on the
way to the office. Let X denote the waiting time at the traffic signal. Assume there is a green signal at
25% of the cases, that is,
1
P(X = 0) = .
4
If the signal is red thenX become a continuous random variable. In this case, assume the pdf X is
 3
4
, 0 < x ≤ 1,
fX (x) =
0, otherwise.

fX (x)

1
0.75
0.5
0.25
x
0 1

Note that the random variable X is neither discrete not continuous and it is mixture of both. Therefore,
the random variate X is of mixed type. Observe that
Z 1
3 3 1
P(0 < X ≤ 1) = dx = and P(X = 0) = =⇒ P(0 ≤ X ≤ 1) = 1.
0 4 4 4
Th cdf of X is given by


 0, x < 0,
 1
4
, x = 0,
FX (x) = 1 3
+ x, 0 < x ≤ 1,
 4 4


1, x ≥ 1.

FX (x)

1
0.75
0.5
0.25
x
-1 0 1

2.7 Moments and Generating Functions


In this section, we study several characteristics of a distribution such as mean, variance, skewness,
kurtosis and higher order moments.

Amit Kumar 37 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Definition 2.7.11 [Mathematical Expectation]
Let X be a random variable defined on a probability space (Ω, F , P). Then, the mathematical
expectation or the mean of X is denoted by µ and is defined as
 ∞
 X


 xi pX (xi ), if X is a discrete random variable,
µ = E(X) = Zi=1∞



 xfX (x)dx, if X is a continuous random variable,
−∞

provided E(|X|) exists, that is,



X
|xi | pX (xi ) < ∞, if X is a discrete random variable.
Zi=1∞
|x|fX (x)dx < ∞, if X is a continuous random variable.
−∞

Remark 2.7.1. In general, the expectation for any function g(X) is defined as
 ∞
 X
g(xi )pX (xi ), if X is a discrete random variable,



E(g(X)) = i=1
Z ∞



 g(x)fX (x)dx, if X is a continuous random variable,
−∞

provided E(|g(X)|) exists.

Properties of Expectation
(a) E(c) = c, for any constant c.

(b) E(cg(X)) = cE(g(X)), where c is any constant.

(c) E (c1 g1 (X) + c2 g2 (X) + · · · + ck gk (X)) = c1 E(g1 (X)) + c2 E(g2 (X)) + · · · + ck E (gk (X)).

(d) If g1 (x) ≤ g2 (x), for all x ∈ R then E (g1 (X)) ≤ E(g2 (X)).
Remark 2.7.2. (i) The mathematical expectation is also known as average or measure of central
tendency or measure of location for a distribution.
(ii) One of the best applications of mathematical expectation is to compute the CPI for B.Tech/IDD
students at IIT (BHU). The formula to compute the CPI for a semester with 5 courses is given by
credit of paper 1 credit of paper 5
CPI = (grade in paper 1) × + · · · + (grade in paper 5) × .
total number of credits total number of credits

(iii) Most of the times, the students are confused that why we need E(|X|) < ∞ for the existence of
E(X)? It is expected that the average should be unique for a random variable. In particular, for
the discrete case, we are dealing with the series and the order of the series is not mentioned while
defining the mathematical expectation. We know that if the series is absolutely convergent then

Amit Kumar 38 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
all the rearrangement of the series will converge to the some value, however, this is not true for
conditionally convergent series. For example, consider the following series
1 1 1 1 1 1
1 − + − + − + −··· .
2 3} |4 {z 5} |6 {z 7}
| {z
(5/6) (<0) (<0)

Its sums to a value less than 56 . Consider a rearrangement of the above series where two positive
terms are followed by one negative term
1 1 1 1 1 1 1 1
1+ − + + − + + − +··· .
3 2} |5 {z
| {z 7 4} |9 11
{z 6}
(5/6) (>0) (>0)

Since
1 1 1
+ − >0
4k − 3 4k − 1 2k
the rearranged series sums to a value greater than 56 .
Similarly, for a continuous random variable, if the integral is absolutely convergent then it value
will be unique, however, it may not be true for conditionally convergent integrals. For example,
consider the Cauchy distribution with pdf
1 1
fX (x) = , for −∞ < x < ∞.
π 1 + x2
R∞ x
Note that the integral −∞ x2 +1
dx can also be interpreted as, for instance,
kR
1 + k 2 R2
Z  
x 1
lim dx = lim log = log(k), where k > 0
R→∞ −R x2 + 1 R→∞ 2 1 + R2

You could also take other functions of R such that the lower limit tends to negative infinity and
R ∞limitxtends to infinity as R → ∞ to get different answers.
upper
So, −∞ x2 +1 dx is not zero and in fact cannot be assigned any value unless you know how the
lower limit and upper limit approach ∞. This arises due to the fact the integral doesn’t converge
conditionally on (−∞, ∞), that is,
Z ∞ Z ∞
1 |x|
|x|f (x)dx = 2
dx = ∞.
−∞ −∞ π 1 + x
R R
Therefore, xf (x)dx is well-defined and exists only when |x|f (x)dx < ∞.
Hence, we need the absolute convergence to exists the mathematical expectation.
Example 2.7.1. Consider the experiment of rolling of two dice. Let X denote the absolute difference
of the upturned faces, find the expectation of X.
Solution. Given X(ω) = |i − j| if ω = (i, j), for i, j = 1, 2, . . . , 6. Therefore,

Amit Kumar 39 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution


 0, if ω = (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6),
1, if ω = (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 5), (5, 4), (4, 3), (3, 2), (2, 1),




2, if ω = (1, 3), (2, 4), (3, 5), (4, 6), (6, 4), (5, 3), (4, 2), (3, 1),

X(ω) =
 3,
 if ω = (1, 4), (2, 5), (3, 6), (6, 3), (5, 2), (4, 1),
4, if ω = (1, 5), (2, 6), (6, 2), (5, 1),




5, if ω = (1, 6), (6, 1).

Therefore, the probability function of X is


x 0 1 2 3 4 5
6 10 8 6 4 2
pX (x) 36 36 36 36 36 36
Hence, the mean of X is given by
5
X 6 10 8 6 4 2 35
E(X) = xpX (x) = 0 × +1× +2× +3× +4× +5× = .
x=0
36 36 36 36 36 36 18

Example 2.7.2. If X has pdf fX (x) = e−x , x > 0. Find the mean of X.
Solution. Note that
Z ∞ Z ∞
µ = E(X) = xfX (x)dx = xe−x dx.
0 0

Integrating by parts, we obtain


Z ∞ Z ∞

E(X) = − xe−x 0 + −x
e dx = e−x dx = 1.
0 0

Question : Expectation always exists?


The answer of this question is negative, for example, define

(−1)j+1 3j
 
2
P X= = j , j = 1, 2, 3, . . . .
j 3

Then,
∞ ∞ ∞
X (−1)j+1 3j 2 X 3j 2 X 1
E|X| = j
= · j
= 2 ,
j=1
j 3 j=1
j 3 j=1
j

which is not convergent. So, E(X) does not exists.


In continuous case, let a random variable has pdf
1
fX (x) = , −∞ < x < ∞.
π (1 + x2 )

Then,
Z ∞ Z ∞ ∞
1 2x 1
E(X) = xfX (x)dx = dx = log(x) ,
−∞ 2π −∞ 1 + x2 2π −∞

which is not convergent. So, E(X) does not exists.

Amit Kumar 40 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Definition 2.7.12 [Moments]
Let X be a random variable. The kth moment about the point A is defined as
 ∞
X
(xi − A)k pX (xi ),

if X is a discrete random variable,



k
E(X − A) = Zi=1∞
(x − A)k fX (x)dx, if X is a continuous random variable,




−∞

In particular,

(a) µ0k = E(X k ) is called the kth moment about origin or the kth non-central moment.

(b) µk = E(X − µ)k is called the kth moment about the mean or the kth central moment.

(c) E(|X|k ) is called the kth absolute moment about origin.

(d) E(|X − µ|k ) is called the kth absolute central moment.

(e) E(X(X − 1) . . . (X − k + 1)) is called the kth factorial moment.

Remark 2.7.3. (i) We can always write central/non-central moments in terms of non-central/central
moments using binomial expansion.
(ii) The 1st non-central moments is the mean of X and the 1st central moment is zero.
(iii) Moments gives the information about the shape of the distribution.

Definition 2.7.13 [Variance]


The second central moments is called the variance of X. That is,

σ 2 = Var(X) = E(X − µ)2 .


p
Moreover, σ = Var(X) is called the standard deviation of X.

Remark 2.7.4. (i) Note that

σ 2 = Var(X) = E(x − µ)2


= E X 2 + µ2 − 2µX


= E(X 2 ) + µ2 − 2µE(X)
= E(X 2 ) + µ2 − 2µ2
= E X 2 − (E(X))2 .


(ii) Observe that σ 2 ≥ 0 =⇒ E (X 2 ) ≥ (E(X))2 .


(iii) Variance gives the information about the variability of the random variable from the mean. For
example, consider the pmfs of two random variables X and Y whose graphs are shown in the
figures below.

Amit Kumar 41 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
pX (x) pY (y)

1 1

0.5 0.5

0.15 0.15
0.1 0.1
x y
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Figure 1: µ = 5.1 and σ 2 = 2.79 Figure 2: µ = 5.1 and σ 2 = 13.09

From the figure, it can be easily visualized that the variance of X is less than the variance of Y .

Definition 2.7.14 [Symmetric Distribution]


A random variable X is said to be symmetric about a point α if

P(X ≤ α − x) = P(X ≥ α + x), for all x.

| | |
α−x α α+x

Remark 2.7.5. The above definition can also be written as

FX (α − x) = 1 − FX (α + x) + P(X = α + x).

Moreover, if α = 0 and X is a continuous random variable, we have

FX (−x) = 1 − FX (x) + P(X = x) = 1 − FX (x) =⇒ fX (−x) = fX (x).

That is, if a random variable is symmetric about 0 then its pdf should be an even function.
Next, we move to define skewness and kurtosis which are related the third and forth central moments.
Literally, skewness is a statistical number that tells us whether a distribution is symmetric or not.

Amit Kumar 42 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Definition 2.7.15 [Skewness]
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random
variable about its mean. Mathematically, it is defined by
 3
X −µ
β1 = E .
σ

Remark 2.7.6. (i) If β1 = 0 then the distribution is symmetric and


mean = median = mode.
mean = median = mode

(ii) If β1 > 0, then the right tail is longer than the left tail. In this case, it is called right-skewed or
positively skewed and
mean > median > mode.

mode
median
mean

(iii) If β1 < 0, then the left tail is longer than the right tail. In this case, it is called left-skewed or
negatively skewed and
mean < median < mode.

mode
median
mean

Amit Kumar 43 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Definition 2.7.16 [Kurtosis]
Kurtosis is a measure of the tailedness of the probability distribution of a real-valued random
variable. That is, its value describes the thickness of the distribution’s tails. Mathematically, it is
defined by
 4
X −µ
β2 = E − 3.
σ

Remark 2.7.7. (i) β2 + 3 and β2 are called simple and excess kurtosis, respectively.

(ii) Kurtosis will tell us about the peakness of the distribution. In particular, if β2 > 0, β2 = 0 and
β2 < 0 then it is called lepto kurtic, meso kurtic (normal peak) and plato kurtic, respectively.

Lepto Kurtic (β2 > 0)

Meso Kurtic (β2 = 0)

Plato Kurtic (β2 < 0)

Definition 2.7.17 [Moment Generating Function]


Let X be a random variable. The function
X


 etx pX (x), if X is a discrete random variable,
MX (t) = E etX = Zx ∞



 etx fX (x)dx, if X is a continuous random variable,
−∞

is called the moment generating function (mgf) of a the random variable X, provided the series
and integral exist for some t 6= 0.

Properties of Moment Generating Function:

(a) MX (0) = 1.

(b) MaX+b (t) = ebt MX (at).

(c) Moment generating function uniquely determines the distribution. In other words, if X and Y
are two random variables and

MX (t) = MY (t), for all t

Amit Kumar 44 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
then

FX (x) = FY (x), for all x

or equivalently X and Y have the same distribution. This statement is not equivalent to the
statement “if two distributions have the same moments, then they are identical at all points”.
This is because in some cases, the moments exist and yet the moment-generating function does
not, because the limit
n
X ti
lim E(X i )
n→∞
i=0
i!

may not exist. The log-normal distribution is an example of such case.


Theorem 2.7.6
Suppose MX (t) is continuously differentiable for |t| < t0 , for some t0 , then

(k) dk
E(X k ) = MX (t) = MX (t) .
t=0 dtk t=0

Proof. Observe that



!
(tX)i tX t2 X 2
X  
tX
MX (t) = E(e ) = E =E 1+ + + ···
i=0
i! 1! 2!
2
t t
=1+ E(X) + E(X 2 ) + · · · .
1! 2!
Differentiate k times, we get

(k) t t2
MX (t) = E(X k ) + E(X k+1 ) + E(X k+2 ) + · · · .
1! 2!
Hence,
(k)
E(X k ) = MX (t) .
t=0

This proves the result.

Definition 2.7.18 [Characteristic Function]


A characteristic function of a random variable X is defined by
X


 eitx pX (x), if X is a discrete random variable,
itX
 x
ϕX (t) = E e = Z ∞


 eitx fX (x)dx if X is a continuous random variable,
−∞

where i = −1.

Amit Kumar 45 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Properties of Characteristic Function:

(a) Note ϕX (t) exists always, since,


Z ∞ Z ∞
itx
|ϕX (t)| ≤ e fX (x)dx = fX (x)dx = 1.
−∞ −∞

(b) ϕX (0) = 1.

(c) ϕaX+b (t) = eibt ϕX (at).

(d) ϕx (t) = ϕx (−t).

(e) ϕX (t) = ϕY (t), for all t if and only if FX (x) = FY (x), for all x.

(f) If X is continuous then


Z ∞
d 1
fX (x) = FX (x) = e−itx ϕX (t)dt.
dx 2π −∞

(k)
(g) If E|X|k < ∞, for k ≥ 1, then ϕX (t) exists, and
1 (k)
E(X k ) = ϕ (t) .
ik X t=0

Definition 2.7.19 [Probability Generating Function]


Let X be a discrete random variable then the probability generating function (pgf) of X is defined
by
 X x
GX (t) = E tX = t pX (x) , for some |t| < t0 .
x

Properties of Probability Generating Function:

(a) Note that the pgf always exists for |t| < 1, since,
X X
E(|tX |) = |tx |pX (x) ≤ pX (x) = 1.
x x

(b) GX (1) = 1.

(c) Suppose X can take values {0, 1, 2, . . .} then


1 (k)
pX (k) = G (t) .
k! X t=0

Amit Kumar 46 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Proof. Note that

X
GX (t) = tx pX (x).
x=0
= pX (0) + tpX (1) + t2 pX (2) + t2 pX (3) + · · · .

Differentiate k times, we get


(k)
GX (t) = k!pX (k) + (k + 1)!tpX (k + 1) + · · · .

Hence,
1 (k)
pX (k) = G (t) .
k! X t=0

This proves the result.

(d) Note that


(k)
E(X(X − 1) . . . (X − k + 1)) = GX (t) ,
t=1

which is called kth factorial moment.


(e) GX (·) is absolutely and uniform continuous within its interval of convergence.
(f) GX (·) can be differentiated term-by-term with in its interval of convergence.
(g) GX (t) = GY (t), for all t if and only if FX (x) = FY (x), for all x.
Example 2.7.3. Let X be a random variable with pmf

P(X = x) = q x−1 p, x = 1, 2, 3, . . . ,

where p + q = 1. Find the moment generating function, and hence the mean and variance. Also, find
the characteristic function and the probability generating function of X.
Solution. Note that
∞ ∞
X pX
tx x−1
x
MX (t) = e q p= qet
x=1
q x=1
qet pet
 
p
= = , for qet < 1 or t < − ln(q).
q 1 − qet 1 − qet
Now,

pet 1
MX0 (t) = =⇒ E(X) = MX0 (t) = .
(1 − qet )2 t=0 p

Also,

pet (1 + qet ) 1+q


MX00 (t) = 3 =⇒ E(X 2 ) = MX00 (t) = .
t
(1 − qe ) t=0 p2

Amit Kumar 47 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
This implies
q
Var(X) = E(X 2 ) − (E(X))2 = .
p2
Next, observe that
∞ ∞
X
itx x−1 pX x
ϕX (t) = e q p= qeit
x=1
q x=1
qeit peit
 
p
= =
q 1 − qeit 1 − qeit

and
∞ ∞
X pX
GX (t) = x x−1
t q p= (qt)x
x=1
q x=1
 
p qt pt
= = , for t < 1/q.
q 1 − qt 1 − qt

2.8 Mode, Median and Quantiles


In this section, we study several characteristic of probability distributions such as mode, median, deciles
and percentiles.

Definition 2.8.20 [Mode]


Mode is the value which occurs most frequently in a set of observations.

Example 2.8.1. Consider the following set observations:

(a) 5, 6, 10, 12, 4, 6, 10, 15, 10.

(b) 2, 7, 9, 15, 9, 3, 7, 5.

(c) 1, 2, 3, 4, 5.

(d) 4, 5, 6, 4, 5, 6.

2 2

1 1

0 1 2 3 4 5 6 7 8 9 101112131415 0 1 2 3 4 5 6 7 8 9 101112131415
(a) mode = 10 (b) mode = 7 and 9

Amit Kumar 48 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution

1
1

0 1 2 3 4 5 0 1 2 3 4 5 6
(c) No mode possible (d) No mode possible

Remark 2.8.1. (i) If a set of observation has one mode, two mode and more than two modes then it
called uni-modal, bimodal and multi-modal, respectively.
(ii) Frequency table con also be used to obtain the mode(s). For instance, Example 2.8.1(a) can also
be written as
No. 4 5 6 10 12 15
Frequency 1 1 2 3 1 1

It is clear that the mode is 10.


Now, let us connect the above definition of mode to define the mode for the discrete and continuous
random variables. Note that the occurrence of the observation 10 in example 2.8.1(a) has the high
probability compare to other observations. Therefore, the probability at points plays an important role
to define mode for a discrete random variable. The formal definition is as follows:

Definition 2.8.21 [Mode for Discrete Random Variable]


Let X be a discrete random variable. The mode of X is the value x at which the probability mass
function takes its maximum value.

Example 2.8.2. Let the probability distribution of a random variable X is given by


x 1 2 3 4 5 6 7 8
pX (x) 0.3 0.25 0.25 0.06 0.05 0.05 0.03 0.01
It is clear that the mode is 1.

0 1 2 3 4 5 6 7 8
Mode = 1

Amit Kumar 49 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Remark 2.8.2. The mode for discrete random variable may not be unique.

Definition 2.8.22 [Mode for Continuous Random Variable]


The mode of a continuous random variable is the value(s) x at which the probability density
function reaches a local maximum.

fX (x) fX (x) mode


mode

x x

fX (x) fX (x)

no mode
mode mode

x x

fX (x)

mode

0 (x) = 0
fX

Amit Kumar 50 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Example 2.8.3. Find the mode of a random variate X having pdf
 3 2
4
x (2 − x), 0 ≤ x ≤ 2,
fX (x) =
0, otherwise.

Solution. Note that


3
fX0 (x) = 4x − 3x2 = 0

4
=⇒ x(4 − 3x) = 0
4
=⇒ x = 0 and .
3

fX (x)
0 (x) = 0
fX

| x
0 1 4/3 2

Also,
 
3 4
fX00 (x) = (4 − 6x) =⇒ fX00 (0) = 3 > 0 and fX00 = −3 < 0.
4 3

Hence, x = 3/4 is the mode of X.

Definition 2.8.23 [Quantiles]


A number Qp satisfying

P (X ≤ Qp ) ≥ p and P (X ≥ Qp ) ≥ 1 − p, for 0 < p < 1. (2.8.1)

is called pth quantile (or quantile of order p) for the distribution of X.

(≥ p) (≥ 1 − p) (≥ p) (≥ 1 − p)

Qp
Qp
(a) Discrete Random Variable (b) Continuous Random Variable

Amit Kumar 51 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Remark 2.8.3. (a) The quantiles are cut points divided the range of a probability distribution into
continuous interval with equal probability.
(b) Note that if FX (·) is absolutely continuous cdf then (2.8.1) is reduced to
Z Qp
FX (Qp ) = p or fX (x)dx = p.
−∞

Qp

(c) The quantile may not be unique for a discrete random variable but it is unique for a continuous
random variable.
(d) If p = 12 then M = Q 1 is called the median for the random variable X. That is, M is called the
2
median of a random variable X if
1 1
P(X ≤ M ) ≥ and P(X ≥ M ) ≥ .
2 2

(≥ 1/2)
(≥ 1/2) (≥ 1/2) (≥ 1/2)

M
M
(a) Discrete Random Variable (b) Continuous Random Variable

In particular, if X is a continuous random variable then


Z M Z ∞
1
fX (x)dx = = fX (x)dx.
−∞ 2 M

1/2

Amit Kumar 52 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
(e) The median of a random variable X divides the probability distribution in two equal parts.

(f) The numbers Q 1 , Q 1 and Q 3 are called quartiles of the random variable X. That is,
4 2 4

  1   3   1
P X ≤ Q1 ≥ , P X ≥ Q1 ≥ , P X ≤ Q1 ≥ ,
4 4 4 4 2 2
  1   3   1
P X ≥ Q1 ≥ , P X ≤ Q3 ≥ and P X ≥ Q 3 ≥ .
2 2 4 4 4 4

(≥ 1/4)
(≥ 3/4) (≥ 1/4) (≥ 3/4)

Q1 Q1
4 4
(a) Discrete Random Variable (b) Continuous Random Variable

(≥ 1/2)
(≥ 1/2) (≥ 1/2) (≥ 1/2)

Q1
Q1 2
2
(a) Discrete Random Variable (b) Continuous Random Variable

(≥ 3/4)
(≥ 1/4) (≥ 3/4) (≥ 1/4)

Q3 Q3
4 4
(a) Discrete Random Variable (b) Continuous Random Variable

(g) The numbers Q 1 , Q 2 , . . . , Q 9 are called deciles.


10 10 10

(h) The numbers Q 1 ,Q 2 , . . . , Q 99 are called percentiles.


100 100 100

Example 2.8.4. Find the median of the random variable having Pmf

x −2 0 1 2
1 1 1 1
pX (x) 4 4 3 6

Amit Kumar 53 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
Solution. Note that, for any M , 0 ≤ M ≤ 1, we have
1 1
P(X ≤ M ) ≥ and P(X ≥ M ) ≥ .
2 2

pX (x)

(≥ 1/2) (≥ 1/2)

x
-2 0 Median 1 2

Hence, M ∈ [0, 1] is a median. For instance, M = 12 is a median.


Example 2.8.5. Find the quartiles of a random variate having pdf.
1
fX (x) = , −∞ < x < ∞.
π (1 + x2 )

Solution. Note that


Z x
1  −1 π
FX (x) = fX (t)dt = tan (x) + , −∞ < x < ∞.
−∞ π 2

Now, consider
  1
FX Q1 or Q 1 =
4 4
1  π  1
=⇒ tan−1 (Q1 ) + =
π 2 4
−1 π
=⇒ tan (Q1 ) = −
4
=⇒ Q1 = −1.

Next, consider
1
FX (M or Q 1 ) =
2 2
1  −1 π 1
=⇒ tan (M ) + =
π 2 2
−1
=⇒ tan M = 0
=⇒ M = 0.

Finally, note that


  3
FX Q3 or Q 3 =
4 4
Amit Kumar 54 MA-202: Probability & Statistics
Chapter 2: Random Variable and its Distribution
1  −1 π 3
=⇒ tan (Q3 ) + =
π 4 4
−1 π
=⇒ tan (Q3 ) =
4
=⇒ Q3 = 1.

Hence, the quartiles are −1, 0 and 1.

2.9 Exercises
1. A coin is tossed three times. If X : Ω → R is such that X counts the number of heads. Show
that X is a random variable. [Assume F = P(Ω)]
x
2. A random variable X can take values 0, 1, 2, . . ., with probability, proportional to (x + 1) 15 .
Find P(X = 0).

3. If f1 (x) and f2 (x) are pdfs then show that (θ + 1)f1 (x) −θf2 (x), 0 < θ < 1, is a pdf.

4. Let X be a random variable with pdf

fX (x) = |1 − x|, 0 ≤ x ≤ 2.

Find the mean and variance of X.

5. Prove or disprove: E(X) < ∞ =⇒ E (X 2 ) < ∞.

6. Let X be random variable having pdf


 
1 x−α
fX (x) = 1− , α − β < x < α + β,
β β

where −∞ < α < ∞ and β > 0.

(a) Demonstrate that fX (·) is a pdf and sketch it.


(b) Find the cdf corresponding to fX (·).
(c) Find the mean and variance of X.

7. The experiment is to put two balls into five boxes in such a way that each ball is equally likely to
fall in any box. Let X denote the number of balls in the first box.

(a) What is the density function of X?


(b) Find the mgf, characteristic function and pgf of X.
(c) Find the mean and variance of X.

8. Find the mode(s) of the random variable having pmf


x 0 1 2 3
(a)
pX (x) 0.25 0.25 0.25 0.25
x 1.2 3.1 4.7 5.8
(b)
pX (x) 0.2 0.4 0.2 0.2

Amit Kumar 55 MA-202: Probability & Statistics


Chapter 2: Random Variable and its Distribution
x 1 2 3 4 9
(c)
pX (x) 0.2 0.3 0.1 0.1 0.3
9. Find the mode of a random variable having pdf or cdf

4x (1 − x2 ) , 0 ≤ x ≤ 1,
(a) fX (x) =
0, otherwise.
x
1 − e− 2 , x ≥ 0,

(b) FX (x) =
0, otherwise.
2
( x
1
xe− 12 , x > 0,
(c) fX (x) = 12
0, otherwise.

10. Find the quartiles for the random variable having pdf
λ
fX (x) = , −∞ < x < ∞.
π(λ2 + (x − µ)2 )

Amit Kumar 56 MA-202: Probability & Statistics


Chapter 3

Special Discrete Distributions

As it turns out, there are some specific distributions that are used over and over in practice, thus they
have been given special names. There is a random experiment behind each of these distributions. Since
these random experiments model a lot of real life phenomenon, these special distributions are used
frequently in different applications. That’s why they have been given a name and we devote a chapter
to study them. We will provide pmfs for all of these special random variables, but rather than trying
to memorize the pmf, you should understand the random experiment behind each of them. If you
understand the random experiments, you can simply derive the pmfs when you need them. Although it
might seem that there are a lot of formulas in this chapter, there are in fact very few new concepts. Do
not get intimidated by the large number of formulas, look at each distribution as a practice problem on
discrete random variables.

3.1 Discrete Uniform Distribution


The discrete uniform distribution is a symmetric probability distribution in which a finite number of
equally likely values are observed. The formal definition can be given as follows:
Definition 3.1.1 [Discrete Uniform Distribution]
A random variable X has discrete uniform distribution if each of N values in its range
{x1 , x2 , . . . , xN } has equal probability. That is,
1
pX (x) = P(X = x) = , x = x1 , x2 , . . . , xN .
N
1
Moreover, if X can take values {1, 2, . . . , N } then pX (x) = N
, for x = 1, 2, . . . , N .

pX (x)

1
N−

| | | | x
x1 x2 ... xN −1 xN

57
Chapter 3: Special Discrete Distributions
Example 3.1.1. Tossing of a coin, rolling of a dice and draw cards from a deck of cards are the examples
of discrete uniform distribution.
Theorem 3.1.1
If a random variable X follows discrete uniform distribution with range {1, 2, . . . , N } then

N +1 N2 − 1
E(X) = , Var(X) = and
( 2t N t 12
e (e −1)
MX (t) = N (et −1)
, t 6= 0
1 t = 0.

Proof. Note that


N N
X X x N +1
E(X) = xpX (x) = =
x=1 i=1
N 2

and
N N
2
 X
2
X x2 1 N (N + 1)(2N + 1)
E X = x pX (x) = = · .
x=1 x=1
N N 6

Therefore,

N2 − 1
Var(X) = E X 2 − (E(X))2 =

.
12
The mgf of X is given by
N
X
tX
etx pX (x)

MX (t) = E e =
x=1
N t Nt

1 X e e − 1
= etx = , for t 6= 0.
N x=1
N (et − 1)

Clearly, MX (0) = 1. This proves the result.

3.2 Degenerate Distribution


The degenerate distribution is the probability distribution of a discrete random variable whose support
consists of only one value. The formal definition can be given as follows:

Definition 3.2.2 [Degenerate Distribution (Constant Distribution)]


A random variable X is to have degenerate distribution if for some constant c,

P (X = c) = 1.

Amit Kumar 58 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
pX (x)

| x
c

Example 3.2.1. Examples of degenerate distribution include a two-headed coin and rolling a die whose
all sides show the same number.
Remark 3.2.1. While the degenerate distribution does not appear random in the everyday sense of the
word, it does satisfy the definition of random variable.
Theorem 3.2.2
If a random variable X follows degenerate distribution then

E(X) = c, Var(X) = 0 and MX (t) = ect .

Proof. Note that

E(X) = c × 1 = c

and

E X 2 = c2 × 1 = c2 .


Therefore,

Var(X) = E X 2 − (E(X))2 = 0.


The mgf of X is given by

MX (t) = E etX = ect × 1 = ect .




This proves the result.

3.3 Bernoulli Distribution


A Bernoulli trial is an experiment which can have only two possible outcomes, for examples, success
and failure, true and false, yes and no, head or tails, among many others. The Bernoulli distribution,
named after Swiss mathematician Jacob Bernoulli, is a discrete probability distribution of Bernoulli
trials. The formal definition can be given as follows:

Amit Kumar 59 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Definition 3.3.3 [Bernoulli Distribution]
The distribution of a random variable X that can take two values, namely, 0 and 1 with probabil-
ities p and 1 − p, respectively, is called Bernoulli distribution. That is, a random variable is said
to have Bernoulli distribution if

P(X = 0) = 1 − p and P(X = 1) = p.

pX (x)

1
p −

1−p
| | x
0 1

Remark 3.3.1. (i) The number p is called the parameter of Bernoulli distribution.

(ii) If X follows Bernoulli distribution with parameter p then it is denoted by X ∼ Ber(p).


Theorem 3.3.3
If a random variable X follows Bernoulli distribution and q = 1 − p then

E(X) = p, Var(X) = pq and MX (t) = q + pet .

Proof. Note that


1
X 1
 X
E(X) = xpX (x) = p and E X 2 = x2 pX (x) = p.
x=0 x=0

Therefore,

Var(X) = E X 2 − (E(X))2 = pq.




The mgf of X is given by


1
X
tX
etx pX (x) = q + pet .

MX (t) = E e =
x=0

This proves the result.

Amit Kumar 60 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
3.4 Binomial Distribution
The binomial distribution plays an important role in many real-life applications. This is because at its
heart is a binary situation, one with two possible outcomes. It can be described as follows:
Suppose a Bernoulli trial, with two possible outcomes, namely, success and failure, is repeated n (finite)
times under independent and identical conditions. Let the probability of success and failure are p and
q, respectively, where p + q = 1. Suppose X denote the number of successes out of n trials. Then, X
can take values {0, 1, . . . , n} and
q q q

Repeated Bernoulli Trials → 1 2 ... n

p p p

P(X = 0) = q × q × · · · × q = q n
 
n
P(X = 1) = pq n−1
1
 
n 2 n−2
P(X = 2) = pq
2
..
.
 
n x n−x
P(X = x) = p q , for x = 0, 1, . . . , n,
x

which is the pmf of binomial distribution. The formal definition can be given as follows:

Definition 3.4.4 [Binomial Distribution]


Suppose n Bernoulli trials are repeated under independent and identical conditions with success
and failure probabilities p and q, respectively, where p + q = 1. Let X denote the number of
successes in n trials. Then the distribution of X is called binomial distribution with pmf
 
n x n−x
P(X = x) = p q , for x = 0, 1, . . . , n.
x

Remark 3.4.1. (i) The numbers n and p are called the parameters of binomial distribution.
(ii) If X follows binomial distribution with parameters n and p then it is denoted by X ∼ B(n, p).
(iii) We know that
n   n  
n
X n x n−x X n n−x x
(a + b) = a b = a b ,
x=0
x x=0
x

Amit Kumar 61 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
which is called binomial expansion. Therefore,
n n  
X X n
pX (x) = px q n−x = (p + q)n = 1n = 1.
x=0 x=0
x

It can be easily seen that the probabilities of binomial distribution are the terms of the binomial
expansion of (p + q)n . That is why, it is called the binomial distribution.
Theorem 3.4.4
If a random variable X follows binomial distribution with parameters n and p then

(a) E(X) = np.

(b) Var(X) = npq.

(c) MX (t) = (q + pet )n .

(d) ϕX (t) = (q + peit )n .

(e) GX (t) = (q + pt)n .


1 − 2p 1 − 6pq
(f) β1 = √ and β2 = .
npq npq

Proof. (a) Consider


n   n
X n x n−x X n!
E(X) = x p q = x px q n−x
x=0
x x=0
x!(n − x)!
n n−1
X n! x n−x
X (n − 1)!
= p q = np py q (n−1)−y (y = x − 1)
x=1
(x − 1)!(n − x)! y=0
(n − 1 − y)!y!
= np(p + q)n−1 = np.

(b) Note that


n   n
X n x n−x
X
E(X(X − 1)) = x(x − 1)pX (x) = x(x − 1) p q
x=0 x=2
x
n n
X n! X n!
= x(x − 1) px q n−x = px q n−x
x=2
(n − x)!x! x=2
(n − x)!(x − 2)!
n
2
X (n − 2)!
= n(n − 1)p px−2 q n−x
x=2
(n − x)!(x − 2)
n−2
X (n − 2)!
= n(n − 1)p2 py q n−2−y (y = x − 2)
y=0
(n − y − 2)!y!
= n(n − 1)p2 (p + q)n−2 = n(n − 1)p2 .

Amit Kumar 62 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
This implies

E(X 2 ) − E(X) = n(n − 1)p2 =⇒ E(X 2 ) = n(n − 1)p2 + np.

Therefore,

σ 2 = Var(X) = E(X 2 ) − (EX)2 = n(n − 1)p2 + np − n2 p2 = npq.

(c) Observe that


n
X
tX
etx pX (x)

MX (t) = E e =
x=0
n   n  
tx n n
X
x n−x
X x
= e p q = pet q n−x
x=0
x x=0
x
n
= q + pet .


(d) Consider
n   n
itX
 X
itxn x n−x X
itx
ϕX (t) = E e = e pX (x) = e p q
x=0 x=0
x
n  
X n x n
= peit q n−x = q + peit .
x=0
x

(e) Consider
n n  
x n
X X
X x
px q n−x

GX (t) = E t = t pX (x) = t
x=0 x=0
x
n
X n  
= (pt)x q n−x = (q + pt)n .
x=0
x

(f) Note that


3
E (X 3 ) − 3µσ 2 − µ3

X −µ
β1 = E =
σ σ3

and
4
E (X 4 ) − 4µE (X 3 ) + 6µ2 E (X 2 ) − 3µ4

X −µ
β2 = E −3= − 3.
σ σ4

Form (a), (b) and (c), we have µ = np, σ 2 = npq and MX (t) = (q + pet )n . So,
n−1
MX0 (t) = npet q + pet
n−1 n−2
MX00 (t) = npet q + pet + n(n − 1)p2 e2t q + pet
n−2 n−3
MX000 (t) = MX00 (t) + 2n(n − 1)p2 e2t q + pet + n(n − 1)(n − 2)p3 e3t q + pet

Amit Kumar 63 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
n−2
MX0000 (t) = MX000 (t) + 4n(n − 1)p2 e2t 1 − p + pet
n−3
+ 2n(n − 1)(n − 2)p3 e3t 1 − p + pet
n−3
+ 3n(n − 1)(n − 2)p3 e3t 1 − p + pet
n−4
+ n(n − 1)(n − 2)(n − 3)p4 e4t 1 − p + pet .

Therefore,

E X 3 = MX00 (0) + 2n(n − 1)p2 (q + p)n−2 + n(n − 1)(n − 2)p3 (q + p)n−3




= np(q + p)n−1 + n(n − 1)p2 (q + p)n−2 + 2n(n − 1)p2 + n(n − 1)(n − 2)p3
= np + 3n(n − 1)p2 + n(n − 1)(n − 2)p3
= np + 3n2 p2 − 3np2 + n3 p3 − 3n2 p3 + 2np3

and

E X 4 = MX000 (0) + 4n(n − 1)p2 + 2n(n − 1)(n − 2)p3




+ 3n(n − 1)(n − 2)p3 + n(n − 1)(n − 2)(n − 3)p4


= np + 3n2 p2 − 3np2 + n3 p3 − 3n2 p3 + 2np3 + 4n2 p2 − 4np2 + 5n3 p3 − 15n2 p3
+ 10np3 + n4 p4 − 6n3 p4 + 11n2 p4 − 6np4

Hence,

E (X 3 ) − 3µσ 2 − µ3
β1 =
σ3
np + 3n p − 3np2 + n3 p3 − 3n2 p3 + 2np3 − 3n2 p2 (1 − p) − n3 p3
2 2
=
(np(1 − p))3/2
np − 3np2 + 2np3 np(1 − p)(1 − 2p) 1 − 2p 1 − 2p
= 3/2
= 3/2
=p = √
(np(1 − p)) (np(1 − p)) np(1 − p) npq

 > 0, if p < 21 (positively skewed))
= = 0, if p = 12 (symmentric)
< 0, if p > 12 (negatively skewed).

Let us observe some example by considering the different value of p in the following figures:
pX (x)

pX (x)

pX (x)

x x x
n = 30 and p = 0.2 n = 30 and p = 0.5 n = 30 and p = 0.8

Amit Kumar 64 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Also,
1
β2 = np + 3n2 p2 − 3np2 + n3 p3 − 3n2 p3 + 2np3 + 4n2 p2 − 4np2 + 5n3 p3 − 15n2 p3
n2 p2 q 2
+ 10np3 +n4 p4 −6n3 p4 +11n2 p4 −6np4 −4np np + 3n2 p2 −3np2 +n3 p3 −3n2 p3 + 2np3


+6n2 p2 n2 p2 + np(1 − p) − 3n4 p4 − 3


 

1
= 2 2 np + 3n2 p2 − 3np2 + n3 p3 − 3n2 p3 + 2np3 + 4n2 p2 − 4np2 + 5n3 p3
n p (1 − p)2
− 15n2 p3 + 10np3 + n4 p4 − 6n3 p4 + 11n2 p4 − 6np4 − 4n2 p2 − 12n3 p3 + 12n2 p3
−4n4 p4 + 12n3 p4 − 8n2 p4 + 6n4 p4 + 6n3 p3 − 6n3 p4 − 3n4 p4 − 3n2 p2 + 6n2 p3 − 3n2 p4


np − 7np2 + 12np3 − 6np4


=
n2 p2 (1 − p)2
np(1 − p) (6p2 − 6p + 1)
=
n2 p2 (1 − p)2
1 + 6p(p − 1)
=
np(1 − p)
1 − 6pq
=
npq

1
< 0, if pq > 6 (Plato kurtic)

= = 0, if pq = 61 (Meso-kurtic)

> 0, if pq < 61 (Lapto kurtic).

Let us observe some example by considering the different value of p in the following figures:
pX (x)

pX (x)

pX (x)

x x x
n = 30 and p = 0.3 n = 30 and p = 0.211325 n = 30 and p = 0.17

This proves the result.


Example 3.4.1. Find the mode(s) of binomial distribution with parameters n and p.
Solution. Let X ∼ B(n, p). We have to find the value of X for which the probability is maximum. If
x is the modal value (i.e., x is the most probable number of successes), then

P(X = x − 1) ≤ P(X = x) ≥ P(X = x + 1).

Amit Kumar 65 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions

P(X = x)
P(X = x − 1)
P(X = x + 1)

pX (x)

x
First consider
   
n x+1 n−x+1 n x x−x
p q ≤ p q
x+1 x
n! n!
=⇒ px−1 q n−x+1 ≤ px q nx
(n − x + 1)!(x − 1)! x!(n − x)!
q p
=⇒ ≤
n−x+1 x
=⇒ (1 − p)x ≤ p(n − x + 1)
=⇒ x − px ≤ np − px + p
=⇒ x ≤ (n + 1)p.

Similarly,

P(X = x) ≥ P(X = x + 1) =⇒ x ≥ (n + 1)p − 1.

Therefore, we have.

(n + 1)p − 1 ≤ x ≤ (n + 1)p.

Case I: If (n + 1)p is an integer then x = (n + 1)p − 1 and x = (n + 1)p both are modes for binomial
distribution. That is, the distribution is bimodal.
Case II: If (n + 1)p is not an integer then the mode is the integral part of (n + 1)p.
Example 3.4.2. An experiment succeeds twice of often as is fails. Find the chance that in the next six
trials there will be at least four successes.
Solution. Let X denote the number of successes for the given experiment. Note that the success
probability p = 2/3 and n = 6 and therefore, X ∼ B(6, 2/3). Hence,
   x  6−x
n 2 1
P(X = x) = , x = 0, 1, 2, 3, 4, 5, 6.
x 3 3

Amit Kumar 66 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
The required probability is given by

P(X ≥ 4) = P(X = 4) + P(X = 5) + P(X = 6)


   4  2    5  1    6
6 2 1 6 2 1 6 2 496
= + + = .
4 3 3 5 3 3 6 3 729

3.5 Poisson Distribution


The Poisson distribution was named after French mathematician Siméon Denis Poisson. It is a discrete
probability distribution that gives the probability of an event happening a certain number of times within
a given interval of time and where the only information available is a measurement of its average value.
For example, if the average number of people who buy cheeseburgers from a fast-food chain on a Friday
night at a single restaurant location is 200, a Poisson distribution can answer questions such as, “what is
the probability that more than 300 people will buy burgers?” Let us first discuss the concept of Poisson
process to develop the Poisson distribution.
Assume the following conditions hold for observations/occurrence/happening over a time/area/space.

(a) The number of outcomes/occurrence during disjoint time interval are independent.

(b) The probability of a single occurrence during a small time interval is proportional to the length
of the interval.

(c) The probability of more than one occurrence during a small time interval is negligible.

That is, if X(t) is the number of occurrence in (0, t] then, for very small δ,

 1 − λδ, if k = 0,
P(X(δ) = k) ≈ λδ, if k = 1, (3.5.1)
0, if k > 1,

where λ is the expected number of occurrences per unit time. In this case {X(t), t ≥ 0} is called the
Poisson process.
Theorem 3.5.5
Under the assumptions (a), (b) and (c), we have

e−λt (λt)n
Pn (t) := P(X(t) = n) = , n = 0, 1, 2, . . . .
n!

Proof. We use induction of n to prove the result. First, let n = 0 and consider

P0 (t + h) = P(number of occurrences during (0, t + h])



= P {number of occurrences during (0, t]} ∩ {number of occurrences during (t, t + h)]}
 
= P number of occurrences during (0, t] × P number of occurrences during (t, t + h)]
= P0 (t)P0 (h)
= P0 (t)(1 − λh)

Amit Kumar 67 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
This implies

P0 (t + h) − P0 (t)
= −λP0 (t)
h
Taking limit h → 0, we get

P00 (t) = −λP0 (t) =⇒ P0 (t) = ce−λt .

Since P0 (0) = 1, we get c = 1 and therefore,

P0 (t) = e−λt .

Hence, the result holds for n = 0. Suppose the result is true for n ≤ k. Consider

Pk+1 (t + h) = P((k + 1) occurrence in (0, t + h])


= P ({(k + 1) occurrence in (0, t]} ∩ {no occurence in (t, t + h]})
+ P({k occurrence in (0, t]} ∩ {one ocurrence in (t, t + h]})
k
X
+ P({k − j occurrence in (0, t]} ∩ {(j + 1) occurrence in (t, t + h]})
j=1

= P ((k + 1) occurrence in (0, t]) × P (no occurence in (t, t + h])


+ P(k occurrence in (0, t]) × P(one ocurrence in (t, t + h])
k
X
+ P(k − j occurrence in (0, t]) × P({(j + 1) occurrence in (t, t + h]})
j=1
k
X
= Pk+1 (t)P0 (h) + Pk (t)P1 (h) + Pk−j (t)Pj+1 (h)
j=1

e−λt (λt)k
= Pk+1 (t)(1 − λh) + (λh) (using induction hypothesis and (3.5.1))
k!
This implies

Pk+1 (t + h) − Pk+1 (t) λk+1 tk


= −λPk+1 (t) + e−λt .
h k!
Taking limit h → 0, we get

0 λk+1 tk
Pk+1 (t) = −λPk+1 (t) + e−λt .
k!
This implies

(λt)k+1 e−λt
Pk+1 (t) = + c1 .
(k + 1)!

Amit Kumar 68 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Since Pk+1 (0) = 1 ⇒ c1 = 0. Hence,

e−λt (λt)k+1
Pk+1 (t) = .
(k + 1)!

This proves the result.


Example 3.5.1. Suppose customers arrive in a shopping mall at the rate 5 per minute. What is the
probability that two customers arrive in a 3 minute period?
Solution. Give λ = 5. The require probability is

e−15 (15)2
P(X(3) = 2) = .
2!
Now, we are in a position to define the Poisson distribution. The Poisson process is a statistical process
with independent time increments, where the number of events occurring in a time interval is modelled
by a Poisson distribution. The formal definition can be given as follows:

Definition 3.5.5 [Poisson Distribution]


The random variable X is said to have Poisson distribution if its pmf can be written as

e−λ λx
P(X = x) = , for λ > 0 and x = 0, 1, . . ..
x!

Remark 3.5.1. (i) The numbers λ is called the parameter of Poisson distribution.

(ii) If X follows Poisson distribution with parameter λ then it is denoted by X ∼ P(λ).


Theorem 3.5.6
If a random variable X follows Poisson distribution with parameters λ then

(a) E(X) = λ.

(b) Var(X) = λ.
t
(c) MX (t) = eλ(e −1) .
it −1)
(d) ϕX (t) = eλ(e .

(e) GX (t) = eλ(t−1) .


1 1
(f) β1 = √ and β2 = .
λ λ

Proof. (a) Consider


∞ ∞
X X e−λ λx
E(X) = xP(X = x) = x
x=1 x=1
x!

Amit Kumar 69 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
∞ ∞
−λ
X λx−1 −λ
X λj
= λe = λe (j = x − 1)
x=1
(x − 1)! j=0
j!
= λe−λ × eλ = λ.

(b) Consider
∞ ∞
X X e−λ λx
E (X(X − 1)) = x(x − 1)P(X = x) = x(x − 1)
x=2 x=2
x!
∞ ∞
2 −λ
X λx−2 2 −λ
X λj
=λ e =λ e (j = x − 2)
x=2
(x − 2)! j=0
j!
= λ2 e−λ × eλ = λ2 .

This implies

E(X 2 ) = λ2 + E(X) = λ2 + λ.

Therefore,

Var(X) = E X 2 − (E(X))2 = λ.


(c) The mgf of X is given by



X
tX
etx P(X = x)

MX (t) = E e =
x=1
∞ −λ x ∞ x
X
tx e λ −λ
X (λet )
= e =e
x=1
x! x=1
x!

= e−λ eλe = eλ(e −1) .


t t

(d) The characteristic function of X is given by



X
itX
eitx P(X = x)

ϕX (t) = E e =
x=1
∞ −λ x ∞ x
X
itx e λ −λ
X (λeit )
= e =e
x=1
x! x=1
x!

= e−λ eλe = eλ(e ).


it it −1

(e) The pgf of X is given by



X
X
tx P(X = x)

GX (t) = E t =
x=1
∞ ∞
X
xe
−λ x
λ −λ
X (λt)x
= t =e
x=1
x! x=1
x!

Amit Kumar 70 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
= e−λ eλt = eλ(t−1) .

(f) Note that


3
E (X 3 ) − 3µσ 2 − µ3

X −µ
β1 = E =
σ σ3

and
4
E (X 4 ) − 4µE (X 3 ) + 6µ2 E (X 2 ) − 3µ4

X −µ
β2 = E −3= − 3.
σ σ4
t
Form (a), (b) and (c), we have µ = λ, σ 2 = λ and MX (t) = eλ(e −1) . So,

MX000 (t) = λ2 eλ(e −1)+2t + λ λet + 1 eλ(e −1)+t


2 2 t

= λ3 eλ(e −1)+3t + 3λ2 eλ(e −1)+2t + λeλ(e −1)+t


t t 2

MX0000 (t) = λ3 λet + 3 eλ(e +3t + 3λ2 λet + 2 eλ(e +2t + λ λet + 1 eλ(e
 t−1)  t−1)  t−1)+t

= λ4 eλ(e −1)+4t + 6λ3 eλ(e −1)+3t + 7λ2 eλ(e −1)+2t + λeλ(e −1)+t .
t t t t

Therefore,

E X 3 = λ2 eλ(e −1)+0 + λ λe0 + 1 eλ(e −1)+0


 0 2 0

= λ2 + λ(λ + 1)2 = λ3 + 3λ2 + λ

and

E X 4 = λ4 eλ(e −1)+0 + 6λ3 eλ(e −1)+0 + 7λ2 eλ(e −1)+0 + λeλ(e −1)+0
 0 0 0 0

= λ4 + 6λ3 + 7λ2 + λ

Hence,

E (X 3 ) − 3µσ 2 − µ3 λ3 + 3λ2 + λ − 3λ2 − λ3 λ 1


β1 = 3
= 3/2
= 3/2 = √
σ λ λ λ
and
E (X 4 ) − 4λ (λ3 + 3λ2 + λ) + 6λ2 (λ2 + λ) − 3λ4
β2 = −3
λ2
(λ4 + 6λ3 + 7λ2 + λ) − 4λ (λ3 + 3λ2 + λ) + 6λ2 (λ2 + λ) − 3λ4
= −3
λ2
(1 − 4 + 6 − 3)λ4 + (6 − 12 + 6)λ3 + (7 − 4)λ2 + λ
= −3
λ2
3λ2 + λ 1
= 2
−3= .
λ λ
So, the Poisson distribution is always positively skewed and lepto kurtic. Let us observe some
example by considering the different value of λ in the following figures:

Amit Kumar 71 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions

λ=1
λ=5
λ=9

pX (x)

This proves the result.


Example 3.5.2. A telephone switch board handles 600 calls, on an average, during a rush hour. The
board can take a maximum of 20 connections per minute. Evaluate the probability that the board will
overtaxed during any given minute.
Solution. Let X denote the number of calls in any given minute. Then λ = 600/60 = 10 and
X ∼ P(10). The required probability is given by
20
X e−10 (10)i
P(X > 20) = 1 − P(X ≤ 20) = 1 − .
i=0
i!

Now, we will look the limiting case of binomial distribution is Poisson distribution in the following
theorem.
Theorem 3.5.7
Let X ∼ B(n, p). If p → 0 and np → λ as n → ∞ then

e−x λx
P(X = x) → .
x!

Proof. Consider
 
n x
P(X = x) = p (1 − p)n−x
x
 x  n−x
n! λ λ
≈ 1−
x!(n − x) n n
n  −x
n(n + 1) − (n − x + 1) λx

λ λ
= · 1− 1−
nx x! n n
x
 n  −x
n n−1 n−x+1 λ λ λ
= · · · 1− 1−
n n n x! n n
−λ x
e λ
→ as n → ∞.
x!

Amit Kumar 72 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Alternate Proof. Consider
 n n→∞ λ(et −1)
 
t n
 t n λ t
MX (t) = q + pe = (1 − p + pe ) ≈ 1+ e −1 −→ e .
n

This proves the result.


Remark 3.5.2. If the sample is large for binomial distribution then we can use Poisson with λ = np.
Example 3.5.3. A manufactures of cotter pits knows that 5% of his product is defective. If he sells
cotter pins in bores of 100 and guarantees that no more than 10 pins will be defective. What is the
approximate probability that a box will fail to meet the guaranteed quality?
Solution. Here, n = 100 and p = 0.05 (small). So, we can use Poisson with λ = np = 5 instead of
B(100, 0.05). Therefore, the required probability is given by
10
X e−5 5x
P(X > 10) = 1 − P (X ≤ 10) = 1 − .
x=0
x!

3.6 Geometric Distribution


The geometric distribution is an appropriate model if the following assumptions are true.

(a) The phenomenon being modelled is a sequence of independent trials.

(b) There are only two possible outcomes for each trial, often designated success or failure.

(c) The probability of success, p, is the same for every trial.

In a sequence of Bernoulli trials, suppose X denote the number of trials before first success. Then, X
can take values {1, 2, . . .} and
q q q

Repeated Bernoulli Trials → 1 2 ... n ...

p p p

P(X = 0) = p
P(X = 1) = qp
..
.
P(X = x) = q x−1 p, for x = 0, 1, . . .,

which is the pmf of geometric distribution. The formal definition can be given as follows:

Amit Kumar 73 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Definition 3.6.6 [Geometric Distribution]
Suppose Bernoulli trials are repeated under independent and identical conditions with success
and failure probabilities p and q, respectively, where p + q = 1. Let X denote the number of trials
before first success. Then the distribution of X is called geometric distribution with pmf

P(X = x) = q x−1 p, for x = 1, 2, . . ..


pX (x)

Remark 3.6.1. (i) The number p is called the parameter of geometric distribution.

(ii) If X follows geometric distribution with parameter p then it is denoted by X ∼ Geo(p).


Theorem 3.6.8
If a random variable X follows geometric distribution with parameter p then

1 q pet
E(X) = , Var(X) = and MX (t) = , t < − ln(q).
p p2 1 − qet

Proof. Note that



X 1
qx = 1 + q + q2 + · · · = .
x=0
1−q

Differentiate with respect to q, we get



X 1
xq x−1 = (3.6.1)
x=1
(1 − q)2

X 2
x(x − 1)q x−2 = . (3.6.2)
x=1
(1 − q)3

Amit Kumar 74 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Therefore,

X
E(X) = xq x−1 p
x=1
X∞
=p xq x−1
x=1
p 1
= 2
= .
(1 − q) p

and

X
E(X(X − 1)) = x(x − 1)q x−1 p
x=2

X
= pq x(x − 1)q x−2
x=2
2pq 2q
= = .
(1 − q)3 p2

This implies
2q 1
E(X 2 ) = + .
p2 p
Hence,
2q 1 1 q
Var(X) = E(X 2 ) − (E(X))2 = 2
+ − 2 = 2.
p p p p
Next, consider
∞ ∞
tX
X
tx x−1 pX t x pet
MX (t) = E(e ) = e q p= (qe ) = , provided qet < 1 or t < − ln(q).
x=1
q x=1 1 − qet

This proves the result.


Example 3.6.1. A dice is rolling until six appears. What is the probability that it must be rolling more
than five times.
Solution. Let X denote number of rolling the dice until six appear. It can be easily seen that X ∼
Geo(1/6) and
 x−1
1 5
P(X = x) = , for x = 1, 2, . . ..
6 5

The required probability is given by

P(X > 5) = 1 − P(X ≤ 5)

Amit Kumar 75 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
5 5  x−1  5
X X 1 5 5
=1− pX (x) = 1 − = .
x=1 x=1
6 5 6

Property 3.6.1 [The Memoryless Property]


A random variable X is called memoryless if, for any n, m ≥ 0,

P (X > n + m | X > m) = P (X > n).

Example 3.6.2. Show the the geometric distribution satisfies the memoryless property.
Solution. Let X ∼ Geo(p). Then

P(X = x) = q x−1 p, x = 1, 2, . . . .

Note that

X
P(X > n) = P(X = x)
x=n+1
X∞
=p q x−1 = pq n (1 + q + q 2 + · · · )
x=n+1
n
=q .

Therefore,
P (X > n + m, X > m)
P (X > n + m | X > m) =
P (X > m)
P (X > n + m) q n+m
= = m
P (X > m) q
n
= q = P (X > n).

X
X|X > 4
pX (x)

Amit Kumar 76 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Another Type of Geometric Distribution. In a sequence of Bernoulli trials, suppose Y denote the
number of failures before first success. Then, Y can take values {0, 1, 2, . . .} and Y = X − 1. The pmf
of Y is given by

P(Y = y) = q y p, y = 0, 1, 2, . . . . (3.6.3)

pY (y)

Theorem 3.6.9
If a random variable Y follows geometric distribution defined in (3.6.3) then
q q p
E(X) = , Var(X) = and MX (t) = , t < − ln(q).
p p2 1 − qet

Proof. Using (3.6.1) and (3.6.2), observe that


∞ ∞
X
x
X q
E(X) = xq p = pq xq x−1 = .
x=1 x=1
p

and
∞ ∞
X
x 2
X 2q 2
E(X(X − 1)) = x(x − 1)q p = pq x(x − 1)q x−2 = .
x=2 x=2
p2

This implies

2q 2 2q 2 q
E(X 2 ) = + E(X) = + .
p2 p2 p
Therefore,

2q 2 q q 2 q
Var(X) = E(X 2 ) − (E(X))2 = + − = .
p2 p p2 p2

Amit Kumar 77 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Next, consider

X
tX
MX (t) = E(e ) = etx q x p
x=0

X p
=p (qet )x = , provided qet < 1 or t < − ln(q).
x=0
1 − qet

This proves the result.


Remark 3.6.2. The memoryless property for Y is given by

P(Y ≥ n + m|Y ≥ m) = P(Y ≥ n).

3.7 Negative Binomial Distribution


The negative binomial distribution is generalize over the geometric distribution. In a sequence of
independent Bernoulli trials, let the random variable X denote the number of trials to get rth success
occurs, where r is a fixed positive integer. Then

(r − 1) successes rth success


q q q q q

Repeated Bernoulli Trials → 1 2 ... x−1 x x+1 ...

p p p p p

 
x − 1 r x−r
P (X = x) = pq , x = r, r + 1, . . . .
r−1

The formal definition can be given as follows:

Definition 3.7.7 [Negative Binomial Distribution]


Suppose independent and identical Bernoulli trials are conducted till rth success is achieved. Let
X denote the number of trials to get rth success occurs and the distribution of X is called negative
binomial distribution with pmf
 
x − 1 r x−r
P (X = x) = p q , x = r, r + 1, . . . .
r−1

Remark 3.7.1. (i) The numbers r and p is called the parameters of negative binomial distribution.

(ii) If X follows negative binomial distribution with parameters r and p then it is denoted by X ∼
NB(r, p).

Amit Kumar 78 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Theorem 3.7.10
If a random variable X follows negative binomial distribution and q = 1 − p then
r
pet

r rq
E(X) = , Var(X) = and MX (t) = , provided qet < 1 or t < − ln(q).
p p 1 − qet

Proof. Note that



X
P(X = x) = 1
x=r
∞    r  r
X x−1 x q q
=⇒ q = = (3.7.1)
x=r
r−1 p 1−q
∞     r−1
X x − 1 x−1 r q
=⇒ x q = 2 . (3.7.2)
x=r
r−1 p p
∞    r−2
X x − 1 x−2 r(r + 2q − 1) q
=⇒ x(x − 1) q = 4
. (3.7.3)
x=r
r − 1 p p

First, consider
∞ ∞  
X X x − 1 r x−r
E(X) = xP(X = x) = x pq
x=r x=r
r−1
 r X∞  
p x − 1 x−1
=q x q
q x=r
r − 1
 r  r−1
p r q
=q × 2 (using (3.7.2))
q p p
r
= .
p
Now, consider
∞ ∞  
X X x − 1 r x−r
E(X(X − 1)) = x(x − 1)P(X = x) = x(x − 1) pq
x=r x=r
r−1
 r X ∞  
2 p x − 1 x−2
=q x(x − 1) q
q x=r
r − 1
 r  r−2
2 p r(r + 2q − 1) q
=q × (using (3.7.3))
q p4 p
r(r + 2q − 1)
= .
p2
This implies

r(r + 2q − 1) r(r + 2q − 1) r
E(X 2 ) = 2
+ E(X) = + .
p p2 p

Amit Kumar 79 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Therefore,
rq
Var(X) = E(X 2 ) − (E(X))2 = .
p2
Next, consider
∞ ∞  
tX
X
tx
X x − 1 r x−r
tx
MX (t) = E(e ) = e P(X = x) = e pq
x=r x=r
r − 1
 r X ∞  
p x−1
= (qet )x
q r − 1
 r x=r r
p qet
= × (using (3.7.1))
q 1 − qet
r
pet

= , provided qet < 1 or t < − ln(q).
1 − qet

This proves the result.


Another Type of Negative Binomial Distribution. In a sequence of Bernoulli trials, suppose Y denote
the number of failures before rth success. Then, Y can take values {0, 1, 2, . . .} and Y = X − r. The
pmf of Y is given by
 
r+y−1 r y
P(Y = y) = p q , y = 0, 1, 2, . . . .
y

Theorem 3.7.11
If a random variable Y follows geometric distribution defined in (3.6.3) then
rq q
E(X) = , Var(X) =
p p2
and
 r
p
MX (t) = , t < − ln(q).
1 − qet

Proof. Following the similar steps to the proof of Theorem 3.7.10, the results follow.

3.8 Hypergeometric Distribution


The hypergeometric distribution is a distribution in which selections are made from two groups without
replacing members of the groups. The hypergeometric distribution differs from the binomial distribu-
tion in the lack of replacements. Let the size of the population selected from be N , with k elements
of the population belonging to type I and N − k belonging to the type II. Further, let the number of
samples drawn from the population be n, such that 0 ≤ n ≤ N . Assume X denote the number of items
of type I in the selected sample. Then

Amit Kumar 80 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
N items

k items (type I) N − k items (type II)

k N −k
 
x n−x
P(X = x) = N
 , x = 0, 1, 2, . . . n and max(0, n − N + k) ≤ x ≤ min(n, k).
n

The formal definition can be given as follows:

Definition 3.8.8 [Hypergeometric Distribution]


A random variable X is said to have hypergeometric distribution if its pmf can be written as
k N −k
 
x n−x
P(X = x) = N
 , x = 0, 1, 2, . . . n and max(0, n − N + k) ≤ x ≤ min(n, k).
n

Remark 3.8.1. (i) The numbers k, N and n is called the parameters of hypergeometric distribution.

(ii) If X follows hypergeometric distribution with parameters k, N and n then it is denoted by X ∼


Hypergeo(k, N, n).
Theorem 3.8.12
If a random variable X follows hypergeometric distribution then
   
kn N −n kn k
E(X) = and Var(X) = 1− .
N N −1 N N

Proof. Consider
n n k N −k
 
X X x n−n
E(X) = xP(X = x) = x N

x=1 x=1 n
k−1 N −1−(k−1)
n
 
kn X x−1 n−1−(x−1)
= N −1

N x=1 n−1
n−1 k−1 N −1−(k−1)
 
kn X l n−1−l
= N −1
 (l = x − 1)
N l=0 n−1
kn
= .
N

Amit Kumar 81 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
Now, Consider
n  2 k N −k
X kn x n−x
Var(X) = x− N

x=0
N n
n k N −k n k N −k n k N −k
     
X
2 x n−x 2nk X x n−x n2 k 2 X x n−x
= x N
 − x N
 + N
N 2 x=0

x=0 n
N x=0 n n
n k N −k
 
n2 k 2 X 2 x n−x
=− 2 + x N
 (using E(X) formula)
N x=0 n
n k−1 N −k n k−1 N −k
   
2 2
nk nk X x−1 n−x nk X x−1 n−x
=− 2 + (x − 1) N −1
 + N −1

N N x=1 n−1
N x=1 n−1
n2 k 2 nk(n − 1)(k − 1) nk
=− + +
N2 N (N − 1) N
2 2
−n k (N − 1) + N n(n − 1)k(k − 1) + knN (N − 1)
=
N 2 (N − 1)
nk (N 2 + (−k − n)N + nk)
=
N 2 (N − 1)
nk(N − k)(N − n)
=
N 2 (N − 1)
   
N −n kn k
= 1− .
N −1 N N

This proves the result.

Theorem 3.8.13
k
Let X ∼ Hypergeo(k, N, n). If N
→ p as k → ∞, N → ∞ then
 
n x
P(X = x) → p (1 − p)n−x .
x

3.9 Exercises
1. If on an average 1 vessel in every 10 is wrecked, find the probability that out of 5 vessels expected
to arrive 4 at least will arrive safely.
2. In a precision bombing attack, there is a 50% chance, that a bomb will strike the target. Two
direct hits are required to destroy the target completely. How many bombs must be dropped to
give atleast 99% chance of completely destroying the target?
3. Suppose that average number of telephone calls arriving at the switchboard of an average is 30
calls per hours.
(a) What is probability that no calls arrive in a 3-minute?
(b) What is the probability that more than five calls in a 5-minute period.

Amit Kumar 82 MA-202: Probability & Statistics


Chapter 3: Special Discrete Distributions
4. A printed page in a book contain 40 lines and each line has 75 positions. Each page has 3000
positions. A typist makes one error per 600 positions.

(a) What is the distribution of number of error per page?


(b) What is the probability that a page has no errors?
(c) What is the probability that 16 page chapter has no errors?

5. If the mgf of a random variable X is (5 − 4e0 )−1 , find P(X = 5 or 6).

6. A vaccine for desensitizing patients to be strings is to packed with 3 vials in each box. Each vial
is checked for strength before packed. The probability that a vial meets the specifications is 0.9.
Let X denote the number of vials that must be checked to fill a box. Find

(a) the pmf of X.


(b) the mean and variance of X.
(c) the probability that out of 10 box to be filled only three boxes need exactly 3 testing each.

7. If E(X) = 10 and σX = 3, can X have a negative binomial distribution?

8. Suppose that X has a binomial distribution with parameters n and p and Y has a negative bino-
mial distribution with parameters r and p. Show that

FX (r − 1) = 1 − FY (n − r).

Amit Kumar 83 MA-202: Probability & Statistics


Chapter 4

Special Continuous Distributions

A continuous random variable has uncountable set of possible values which is known as the range
of the random variable. A continuous distribution describes the probabilities of a continuous random
variable at its possible values. The mapping of time can be considered as an example of the continuous
probability distribution. It can be from one second to one billion seconds, and so on. The area under
the curve of the pdf of a continuous random variable is used to calculate its probability. As a result,
only value ranges can have a non-zero probability. The probability of a continuous random variable on
some value is always zero.

4.1 Continuous Uniform Distribution


The continuous uniform distributions are a family of symmetric probability distributions. Such a dis-
tribution describes an experiment where there is an arbitrary outcome that lies between certain bounds.
The bounds are defined by the parameters, a and b, which are the minimum and maximum values. For
example, a student is coming for a particular lecture where he/she can come earlier by five minutes
and late by five minutes. In ten minutes duration, if we assume X denotes the arrival time of the stu-
dents then X follows continuous uniform distribution. Note that the density is constant between the
parameters a and b, that is,

fX (x) fX (x) = k

| | x
a b


k, a ≤ x ≤ b,
fX (x) =
0, otherwise.

Therefore,
Z b
1
fX (x)dx = 1 =⇒ k = .
a b−a

84
Chapter 4: Special Continuous Distributions
The formal definition can be given as follows:

Definition 4.1.1 [Continuous Uniform Distribution or Rectangular Distribution]


A random variable X is said to have continuous uniform distribution if its pdf is given by
 1
b−a
, a ≤ x ≤ b,
fX (x) =
0, otherwise.

fX (x) 1
fX (x) = b−a

| | x
a b

Remark 4.1.1. (i) The numbers a and b are called the parameters of continuous uniform distribu-
tion.

(ii) If X follows continuous uniform distribution with parameters a and b then it is denoted by X ∼
U(a, b).
Theorem 4.1.1
If a random variable X follows continuous uniform distribution then
a+b (b − a)2
E(X) = , Var(X) = ,
2 12
 0, x < a, (
etb −eta
x−a , t 6= 0
FX (x) = , a ≤ x < b, and MX (t) = t(b−a)
 b−a 1 t = 0.
1, x≥b

Proof. Note that


Z b Z b
x a+b
E(X) = xfX (x)dx = dx =
a a b−a 2
and
b b
x2 a2 + b2 + ab
Z Z
2 2

E X = x fX (x)dx = dx = .
a a b−a 2

Amit Kumar 85 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Therefore,

(b − a)2
Var(X) = E X 2 − (E(X))2 =

.
12
The cdf of X is given by
x x
x−a
Z Z
1
FX (x) = fX (t)dt = dt = , a ≤ x < b.
a a b−a b−a

FX (x)

1−

| | x
a b

The mgf of X is given by


b b
etx etb − eta
Z Z
tX tx

MX (t) = E e = e fX (x)dx = dx = , for t 6= 0.
a a b−a t(b − a)

Clearly, MX (0) = 1. This proves the result.

4.2 Exponential Distribution


The exponential distribution is a continuous probability distribution that often concerns the amount
of time until a specific event happens. The exponential distribution has the key property of being
memoryless. Let {X(t), t ≥ 0} be a Poisson process with rate λ > 0 and T denote the time for the first
occurrence. Then
 −λt
e , t > 0,
P(T > t) = P(X(t) = 0) =
1, t ≤ 0.

Therefore,

1 − e−λt , t > 0,

FT (t) = P(X ≤ t) = 1 − P(X > t) =
0, t ≤ 0.

Hence, the cdf of T is given by

fT (x) = λe−λt , t > 0.

The formal definition can be given as follows:

Amit Kumar 86 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Definition 4.2.2 [Exponential Distribution]
A random variable X is said to have exponential distribution if its pdf is given by

fX (x) = λe−λx , x > 0.

λ=3
λ=5
λ=7
fX (x)

Remark 4.2.1. (i) The numbers λ is called the parameter of exponential distribution.

(ii) If X follows exponential distribution with parameter λ then it is denoted by X ∼ Exp(λ).


Theorem 4.2.2
If a random variable X follows exponential distribution then
k! λ
E(X k ) = and MX (t) = , λ > t.
λk λ−t

Proof. Consider
Z ∞ Z ∞
k
E(X ) = k
x fX (x)dx = λ xk e−λx dx
0 0

tk −t dt
Z
=λ e (λx = t)
0 λk λ
Z ∞
1
= k e−t tk+1−1 dt
λ 0
Γ(k + 1) k!
= k
= k.
λ λ
Now, consider
Z ∞
tX
MX (t) = E(e ) = etx fX (x)dx
0
Z ∞
λ
=λ e−(λ−t)x dx = , λ > t.
0 λ−t
This proves the result.

Amit Kumar 87 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Corollary 4.2.1
If a random variable X follows exponential distribution then

(a) E(X) = 1/λ.

(b) Var(X) = 1/λ2 .

(c) β1 = 2 and β2 = 6.

Proof. Form the above theorem, we have


1 2 6 24
E(X) = , E(X 2 ) = , E(X 3 ) = and E(X 4 ) = .
λ λ2 λ3 λ4
Therefore,
 3
2 2 1 2 X −µ
σ = Var(X) = E(X ) − (E(X)) = 2 , β1 = E =2
λ σ

and
 4
X −µ
β2 = E = 6.
σ

Remark 4.2.2. The exponential distribution is always positively skewed and lepto kurtic.
Example 4.2.1. Show the the exponential distribution satisfies the memoryless property.
Solution. Let X ∼ Exp(λ). Then

fX (x) = λe−λx , x > 0.

Note that
Z ∞
P(X > n) = λe−λx dx
n

= −e−λx
n
−λn
=e .

Therefore,
P (X > n + m, X > m)
P (X > n + m | X > m) =
P (X > m)
P (X > n + m)
=
P (X > m)
e−λ(n+m)
=
e−λm
−λn
=e = P (X > n).

Amit Kumar 88 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions

X X|X > 4

fX (x)

4.3 Gamma Distribution


Gamma Distribution is one of the distributions, which is widely used in the field of Business, Science
and Engineering, in order to model the continuous variable that should have a positive and skewed
distribution. It is generalized over the exponential distribution by considering the rth occurrence instead
of first occurrence. Let {X(t), t ≥ 0} be a Poisson process with rate λ > 0 and Tr denote the time for
the rth occurrence. Then

P(X(t) ≤ r − 1), t > 0,
P(Tr > t) =
1, t ≤ 0.
( P −λt j
r−1 e (λt)
j=0 , t > 0,
= j!
1, t ≤ 0.

Therefore,
(
0, t ≤ 0,
FTr (t) = P(Tr ≤ t) = 1 − P(Tr > t) = e−λt (λt)j
1 − r−1
P
j=0 j!
, t > 0.

Hence, the cdf of T is given by

e−λt (λt)r−1
 
d d −λt −λt
fT (t) = FT (t) = − e + (λt)e + · · · +
dt dt (r − 1)!
λr tr−1 e−λt
= λe−λt − λe−λt + λ2 te−λt − λ2 te−λt + · · · +
(r − 1)!
r r−1 −λt r
λt e λ −λt r−1
= = e t , t > 0.
(r − 1)! Γ(r)

Amit Kumar 89 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Here, Γ(r) denotes the gamma function which is defined as
Z ∞
Γ(r) = e−t tr−1 dt.
0

It satisfies the following properties:


1
 √
(a) Γ(r) = (r − 1)! (b) Γ(r + 1) = rΓ(r) (c) Γ 2
= π.

The formal definition can be given as follows:

Definition 4.3.3 [Gamma Distribution]


A random variable X is said to have gamma distribution if its pdf is given by
λr −λx r−1
fX (x) = e x , x, r, λ > 0.
Γ(r)

r = 1, λ = 2
r = 3, λ = 4
r = 4, λ = 2
fX (x)

Amit Kumar 90 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Remark 4.3.1. (i) The numbers r and λ are called the parameter of gamma distribution.

(ii) If X follows gamma distribution with parameters r and λ then it is denoted by X ∼ G(r, λ).

(iii) If λ = 1 then it is called standard gamma distribution and is denoted by X ∼ Γ(r).


Theorem 4.3.3
If a random variable X follows gamma distribution then
 r
k Γ(k + r) λ
E(X ) = and MX (t) = , λ > t.
Γ(r)λk λ−t

Proof. Consider
∞ Z ∞
λr
Z
k
E(X ) = k
x fX (x)dx = e−λx xk+r−1 dx
0 Γ(r) 0
r Z ∞
λ
= k+r
e−t tk+r−1 dt (λx = t)
Γ(r)λ 0
Γ(k + r)
= .
Γ(r)λk

Now, consider
Z ∞
tX
MX (t) = E(e ) = etx fX (x)dx
Z ∞ 0
λr
= e−(λ−t)x xr−1 dx
Γ(r) 0
Z ∞
λr
= e−y y r−1 dx
Γ(r)(λ − t)r 0
r
λr Γ(r)

λ
= = , λ > t.
Γ(r)(λ − t)r λ−t

This proves the result.

Corollary 4.3.2
If a random variable X follows gamma distribution then

(a) E(X) = r/λ.

(b) Var(X) = r/λ2 .



(c) β1 = 2/ r and β2 = 6/r.

Proof. Form the above theorem, we have

r r(r + 1) r(r + 1)(r + 2)


E(X) = , E(X 2 ) = , E(X 3 ) =
λ λ2 λ3

Amit Kumar 91 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
and
r(r + 1)(r + 2)(r + 3)
E(X 4 ) = .
λ4
Therefore,
 3
2 2 r 2 X −µ 2
σ = Var(X) = E(X ) − (E(X)) = 2 , β1 = E =√
λ σ r

and
 4
X −µ 6
β2 = E = .
σ r

Remark 4.3.2. (i) The gamma distribution is always positively skewed and lepto kurtic.

(ii) If r is a positive integer then it is called Erlang distribution.

4.4 Beta Distribution


The beta distribution is a family of continuous probability distributions set on the interval [0, 1] having
two positive shape parameters, expressed by α and β. These two parameters appear as exponents of
the random variable and manage the shape of the distribution. The formal definition can be given as
follows:

Definition 4.4.4 [Beta Distribution]


A random variable X is said to have beta distribution if its pdf is given by
1
fX (x) = xα−1 (1 − x)β−1 , 0 < x < 1 and α, β > 0.
B(α, β)

fX (x) fX (x)

x x
(a) α = 1, β = 1 (b) α = β < 1

Amit Kumar 92 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
fX (x) fX (x)

β>α>1
α>β>1

x x
(c) α = β > 1 (d) α < 1, β = 1 and α = 1, β < 1

fX (x) fX (x)

x x
(e) α > β > 1 (f ) β > α > 1
Here, B(α, β) denotes the beta function defined by
Z 1
B(α, β) = xα−1 (1 − x)β−1 dx.
0

The following is the relation between gamma and beta function:

Γ(α)Γ(β)
B(α, β) = .
Γ(α + β)

Remark 4.4.1. (i) The numbers α and β are called the parameter of beta distribution.

(ii) If X follows beta distribution with parameters α and β then it is denoted by X ∼ B(α, β).
Theorem 4.4.4
If a random variable X follows beta distribution then
B(α + k, β)
E(X k ) = .
B(α, β)

Proof. Consider
Z 1
k 1
E(X ) = xα+k−1 (1 − x)β−1 dx
B(α, β) 0
B(α + k, β)
= .
B(α, β)

Amit Kumar 93 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Corollary 4.4.3
If a random variable X follows beta distribution then
α
(a) E(X) = .
α+β
αβ
(b) Var(X) = .
(α + β)2 (α + β + 1)

Proof. Form the above theorem, we have

α α(α + 1)
E(X) = and E(X 2 ) = .
α+β (α + β)(α + β + 1)

Therefore,
αβ
Var(X) = E(X 2 ) − (E(X))2 = .
(α + β)2 (α + β + 1)

This proves the result.

4.5 Normal/Gaussian Distribution


The Normal Distribution, also called the Gaussian Distribution, is the most significant continuous prob-
ability distribution. Sometimes it is also called a bell curve. A large number of random variables are
either nearly or exactly represented by the normal distribution, in every physical science and economics.
Furthermore, it can be used to approximate other probability distributions. The famous examples of
normal distribution are heights of people, weights of new born babies, heights of watermelon, among
many others. The formal definition can be given as follows:

Definition 4.5.5 [Normal Distribution]


A continuous random variable X is said to have a normal distribution with mean µ and variance
σ 2 if if has pdf given by
1 1 x−µ 2
fX (x) = √ e− 2 ( σ ) , −∞ < x, µ < ∞ and σ > 0.
2πσ

Remark 4.5.1. (i) The numbers µ and σ 2 are called the parameters of normal distribution.

(ii) If X follows normal distribution with parameters µ and σ 2 then it is denoted by X ∼ N (µ, σ 2 ).

(iii) The points µ ± σ are the point of inflection for fX (·).

(iv) The graph of fX (·) is bell-shaped and symmetric about x = µ.

Amit Kumar 94 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions

|
µ
R∞
Example 4.5.1. Show that −∞
fX (x)dx = 1.
Solution. Consider
Z ∞ Z ∞
1 1 x−µ 2
fX (x)dx = √ e− 2 ( σ ) dx
−∞ −∞ 2πσ
Z ∞  
1 −z 2 /2 x−µ
=√ e dz =z
2π −∞ σ
Z ∞
1 2
= 2√ e−z /2 dz
2π 0
Z ∞  2 
1 −t 12 −1 z
=√ e t dt =t
π 0 2
Γ 12

= √ = 1.
π

Theorem 4.5.5
If a random variable X follows normal distribution then

k 0, if k is odd,
E(X − µ) =
σ 2m (2m − 1)(2m − 3) . . . 5.3.1, if k = 2m, m = 1, 2, . . ..

Proof. Consider
 k ∞  k
x−µ x−µ
Z
E = fX (x)dx
σ −∞ σ
Z ∞ k
1 x−µ 1 x−y 2
=√ e− 2 ( σ ) dx
2πσ −∞ σ
Z ∞  
1 k −z 2 /2 x−µ
=√ z e dz, =z
2π −∞ σ
= 0,

if k is odd since the integrand is an odd function.

Amit Kumar 95 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Next, suppose k is even then k = 2m, for m = 1, 2, . . .. This implies
 2m Z ∞
x−µ 1 2
E =√ z 2m e−z /2 dz
σ 2π −∞
Z ∞
2 2/2
=√ z 2m e−z dz
2π Z0

2 dt
(2t)m e−t √ z2 = t

=√
2π 0 2t
m Z ∞
2 1
=√ tm− 2 e−t dt
π 0
m
 
2 1
= √ Γ m+
π 2
m
3 1 √
  
2 1 3
=√ m− m− ··· · · π
π 2 2 2 2
= (2m − 1)(2m − 3) · · · 5.3.1.

This proves the result.

Corollary 4.5.4
If a random variable X follows normal distribution then

(a) E(X) = µ.

(b) Var(X) = σ 2 .

(c) β1 = β2 = 0.

Proof. From the above theorem at k = 1, we have

E(X − µ) = 0 =⇒ E(X) = µ.

and, for k = 2,

Var(X) = E(X − µ)2 = σ 2 .

Also, for k = 3 and k = 4, we have


 3
X −µ
β1 = E =0
σ

and
 4
X −µ
β4 = E − 3 = 0.
σ

This proves the result.

Amit Kumar 96 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Theorem 4.5.6
If a random variable X follows normal distribution then
1 2 t2
MX (t) = eµt+ 2 σ .

Proof. Note that

MX (t) = E etX

Z ∞
= etx fX (x)dx
−∞
Z ∞
1 1 x−µ 2
=√ etx e− 2 ( σ ) dx
2πσ −∞
Z ∞  
1 t(µ+σz) −z 2 /2 x−µ
=√ e e dz =z
2π −∞ σ
Z ∞
1 z2
=√ et(µ+σz)− 2 dz
2π −∞
µt+ 21 σ 2 t2 Z ∞
e 1
e− 2 (z +σ t −2σtz) dz
2 2 2
= √

 −∞ Z ∞ 
1 2 2
µt+ 2 r t 1 − 12 (z−σt)2
=e √ e dz
2π −∞
1 2 t2
= eµt+ 2 σ

It will prove in the next chapter that if X ∼ N (µ, σ 2 ) then

aX + b ∼ N aµ + b, a2 σ 2 ,


a 6= 0 and b ∈ R. Therefore,
X −µ
Z= ∼ N (0, 1).
σ
This is called standard normal random variable. The pdf of Z is given by
1 2
φ(z) = √ e−z /2 , −∞ < z < ∞.

The cdf of Z is given by

Φ(z) = P(Z ≤ z)
Z z
1 2
= √ e−t /2 dt.
−∞ 2π
A table for different values of z is available to calculate the probabilities.

Amit Kumar 97 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions

Amit Kumar 98 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions

Amit Kumar 99 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Rz 2
Sometimes the table may be given for 0 √12π e−z /2 dz. In this situation, the student must be careful
and add 0.5 as the standard normal distribution is symmetric about 0. Also, observe that

Φ(z) = 1 − Φ(−z).

A direct consequence of the above result is Φ(0) = 12 .


In practice, if X ∼ N (µ, σ 2 ) then

P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a)


   
X −µ b−µ X −µ a−µ
=P ≤ −P ≤
σ σ σ σ
   
b−µ a−µ
=P Z≤ −P Z ≤
σ σ
   
b−µ a−µ
=Φ −Φ .
σ σ

and
     
x−µ a−µ a−µ a−µ
P(X ≤ a) = P ≤ =P Z≤ =Φ .
σ σ σ σ

Example 4.5.2. If x is normally distrbaled with man 2 and variance 1 , find P(|x − 2| < 1).
Solution. Note that

P(|X − 2| < 1) = P(1 < X < 3)


 
1−2 X −1 3−2
=P < <
1 1 1
= P(−1 < Z < 1) (where Z ∼ N (0, 1))
= P(Z < 1) − P(Z ≤ −1)
= Φ(1) − Φ(−1)
= 0.84134 − 0.15816
= 0.68268

or

P(|X − 2| < 1) = Φ(1) − Φ(−1)


= Φ(1) − [1 − Φ(1)]
= 2Φ(1) − 1
= 2 × 0.8413
= 0.68268.

Example 4.5.3. If X is normally distributed with mean 11 and standard deviation 1.5 find the number
x0 suck that

P (X > x0 ) = 0.3.

Amit Kumar 100 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Solution. Consider

P (x > x0 ) = 0.3
=⇒ 1 − P (x ≤ x0 ) = 0.3
=⇒ P (x ≤ x0 ) = 0.7
 
x−1 x0 − 11
=⇒ P ≤ = 0.7
1.5 1.5
 
x0 − 11
=⇒ P Z≤ = 0.7.
1.5

From the standard normal tables, we have


x0 − 11
= 0.525 =⇒ x0 = 11.78.
1.5

4.6 Cauchy Distribution


Cauchy distribution, also known as Cauchy-Lorentz distribution, in statistics, continuous distribution
function with two parameters. The formal definition can be given as follows:

Definition 4.6.6 [Cauchy Distribution]


A continuous random variable X is said to have a Cauchy distribution if if has pdf given by
λ
fX (x) = , −∞ < x, µ < ∞ and λ > 0.
π(λ2 + (x − µ)2 )

Remark 4.6.1. The numbers λ and µ are called the parameters of normal distribution.
Example 4.6.1. Find the cdf of Cauchy distribution.
Solution. The cdf of the Cauchy distribution is given by
Z x
FX (x) = fX (x)dx
−∞
Z x
1 dx
=
πλ −∞ 1 + x−µ 2

λ
Z x−µ  
1 λ dt x−µ
= t=
π −∞ 1 + t2 λ
x−µ
1 λ
= tan−1 (t)
π  −∞  
1 −1 x−µ −1
= tan − tan (−∞)
π λ
   
1 −1 x−µ π
= tan +
π λ 2
 
1 1 x−µ
= + tan−1 .
2 π λ

Amit Kumar 101 MA-202: Probability & Statistics


Chapter 4: Special Continuous Distributions
Remark 4.6.2. (i) Using the above example, the quantiles of Cauchy distribution can be easily cal-
culated.

(ii) It can be verified that for the Cauchy distribution, mean variance and higher moments do not
exist.

4.7 Exercises
1. If X is uniformly distributed over [1, 2], find z so that
1
P (X > z + µ) = .
4
2
2. If X ∼ U(−1, 3) and Y ∼ Exp(λ). Find λ such that σX = σY2 .

3. Let X ∼ B(α, β). Then find E (1/X). Can E (1/X) = 1?

4. Let X ∼ B(α, β) and γ = α + β. Show that if α > 2 and β > 0, then


   
1 γ−1 1 β(γ − 1)
E = and Var = .
X α−1 X (α − 1)2 (α − 2)

5. If X is a normal variable with mean 1 and standard deviation 3. Find

(a) P(3.42 ≤ X ≤ 6.19).


(b) P(−1.43 ≤ X ≤ 6.19).

6. Assume a normal distribution with N = 1000, µ = 80 and σ = 15. Then

(a) How many observations may be expected to lie between 65 and 100?
(b) Find the value of the variate beyond which 10% of the items would lie.

7. In a distribution exactly normal, 7% of the them are under 35 and 89% are under 63. What is the
mean and standard deviation of the distribution?

8. Find x0 such that P (X > x0 ) = 0.09, where X ∼ N (11, 1.52 ).

9. Show that for a normal distribution: mean, median and mode coincide.

Amit Kumar 102 MA-202: Probability & Statistics


Chapter 5

Function of Random Variables and Its


Distribution

Let X be a random variable, discrete and continuous, and let g : R → R, which we think of as a
transformation. For example, X could be a height of a randomly chosen person in a given population
in inches, and g could be a function which transforms inches to centimetres, that is, g(x) = 2.54 × x.
Then Y = g(X) is also a random variable, but its distribution (pmf or pdf), mean, variance, etc. will
differ from that of X. Transformations of random variables play a central role in statistics, and we will
learn how to work with them in this chapter.

5.1 Some Theoretical Results


We first answer the question “when a function of random variable is a random variable” in the following
theorem.
Theorem 5.1.1
Let X be a random variable defined on a probability space (Ω, F , P) and g : R → R be a
measurable function. Then g(X) is also a random variable.

Proof. Note that

{ω : g(X(ω)) ≤ y} = ω : X(ω) ∈ g −1 (−∞, y] .




Since g is a measurable function, the set g −1 (−∞, y] is measurable set. Now, since X is a random
variable, {ω : X(ω) ∈ g −1 (−∞, y]} is also measurable.
Remark 5.1.1. Let X be a random variable and g : R → R be a continuous function then g(X) is a
random variable (since every continuous function is measurable).

Theorem 5.1.2
Give a random variable with cdf FX (·), the distribution of the random variable Y = g(X), where
g is measurable, is determined.

Proof. The cdf of Y is given by

GY (y) = P(Y ≤ y) = P(g(X) ≤ y) = P x ∈ g −1 (−∞, y] .




103
Chapter 5: Function of Random Variables and Its Distribution
Since g is measurable, g −1 ((−∞, y]) is a measurable set. Now, since the distribution of x is well-
defined, GY (y) is also well-defined.
Example 5.1.1. Let Y = aX + b, a 6= 0 and b ∈ R. Then

FY (y) = P (Y ≤ y) = P (aX + b ≤ y)
(
P X ≤ y−b

a 
if a > 0,
= y−b
P X≥ a if a < 0
(
FX y−b

a
, if a > 0,
= y−b
 y−b

1 − P X ≤ a + P X = a , if a < 0
FX y−b
 
a
,  if a > 0,
=
1 − FX y1a−b + P X = y−b

a
, if a < 0.

5.2 Approaches to Find the Distribution of Y = g(X)


There are mainly three approaches to find the distribution of Y = g(X).
(a) PMF or PDF approach
(b) CDF approach
(a) MGF approach

5.2.1 PMF Approach


Let X be a discrete random variable and g : R → R be a function. Assume Y = g(X) is a random
variable. Then

P(Y = y) = P(g(X) = y) = P X = g −1 (y)




Example 5.2.1. Let X be a random variable with pmf


1 1 1
P(X = −2) = , P(X = −1) = , P(X = 0) = ,
5 6 5
1 11
P(X = 1) = and P(X = 1) = .
15 30
Find the distribution of Y = X 2 .
Solution. Note that X can take values {−2, −1, 0, 1, 2} and therefore Y = X 2 can take values {0, 1, 4}.

2
0
1
0 1
−1
4
−2

Amit Kumar 104 MA-202: Probability & Statistics


Chapter 5: Function of Random Variables and Its Distribution
Consider
1
P(Y = 0) = P X 2 = 0 = P(X = 0) =

5
1 1 7
P(Y = 1) = P X 2 = 1 = P(X = ±1) = P(X = −1) + P(X = 1) = +

=
6 15 30
1 11 17
P(Y = 4) = P X 2 = 4 = P(X = ±2) = P(X = −2) + P(X = 2) = +

= .
5 30 30
Remark 5.2.1. Note that once we know the pmf of Y = g(X) then we can find any characteristic of
Y such as mean, variance, moments, mode and median etc.

5.2.2 CDF Approach


The cdf approach is mainly useful for a continuous random variable. Let X be a discrete random
variable and g : R → R be a function. Assume Y = g(X) is a random variable. Then

P(Y ≤ y) = P(g(X) ≤ y) = P X ≤ g −1 (y) or P X ≥ g −1 (y)


 
(for example, y = ±x)

Example 5.2.2. Let X ∼ U(−1, 1). Find the distribution (or pdf) of Y = |X|.
Solution. It is known that
1

2
, −1 ≤ x ≤ 1
fX (x) =
0 otherwise.

Consider

FY (y) = P(Y ≤ y) = P(|X| ≤ y)


Z y
= P(−y ≤ X ≤ y) = fX (x)dx
−y
Z y
1 x y
= dx =
−y 2 2 −y
= y, 0 ≤ y ≤ 1.

Therefore,

d 1, 0 ≤ y ≤ 1,
fY (y) = FY (y) =
dy 0, otherwise.

Hence, Y ∼ U(0, 1).

5.2.3 MGF Approach


It is known that the mgf is uniquely determines the distribution of a random variable. Therefore, we
can also determine the distribution of a random variable by its mgf. However, this approach is useful
for special distribution as we know the exact form of their mgfs.
Example 5.2.3. Let X ∼ N (µ, σ 2 ). Find the distribution of Y = aX + b.

Amit Kumar 105 MA-202: Probability & Statistics


Chapter 5: Function of Random Variables and Its Distribution
Solution. Given X ∼ N (µ, σ 2 ), therefore,
1 2 t2
MX (t) = eµt+ 2 σ .

Consider

MY (t) = E(etY ) = E(et(aX+v) )


= ebt E(e(at)X ) = ebt MX (at)
1 2 2 2
= ebt eaµt+ 2 at σ

1 2 σ 2 )t2
= e(aµ+b)t+ 2 (a .

Hence, Y ∼ N (aµ + b, a2 σ 2 ).

5.2.4 PDF Approach


We can also directly obtain the pdf for a continuous random variable using the transformation of random
variable. The following result is useful to find the pdf of transformation of random variable.
Theorem 5.2.3
Let X be a continuous random variable with pdf fx (·). Let y = g(x) be differentiable for all x
and either g 0 (x) > 0, for all x or g 0 (x) < 0, for all x. Than Y = g(X) is a continuous random
variable with pdf

 d −1
g (y) fX g −1 (y) , α < y < β,

fY (y) = dy
0, otherwise.

where α = min{g(−∞), g(∞)} and β = max{g(−∞), g(∞)}.

Proof. Let g 0 (x) > 0, for all x. Then g is strictly increasing and so one-one and g −1 is strictly increas-
ing, that is,
f racddyg −1 (y) > 0. Therefore,

FY (y) = P(Y ≤ y) = P(g(X) ≤ y) = P X ≤ g −1 (y) = FX g −1 (y) .


 

So the pdf of Y is

d −1
g (y) fX g −1 (y) .

fY (y) =
dy

In case g 0 (x) < 0, for all x then g is strictly decreasing and g −1 will also be strictly decreasing, that is,
d −1
dy
g (y) < 0. Therefore,

FY (y) = P(Y ≤ y) = P(g(X) ≤ y) = P X ≥ g −1 (y)




= 1 − FX g −1 (y) + P X = g −1 (y) .
 

Amit Kumar 106 MA-202: Probability & Statistics


Chapter 5: Function of Random Variables and Its Distribution
Since X is continuous, we have
 d −1
fY (y) = − fX g −1 (y) g (y).
dy

Since fX (·) ≥ 0, we have

d −1
g (y) fX g −1 (y) .

fY (y) =
dy

This proves the result.


Remark 5.2.2. Note that α and β are just written to represent the range of the random variable Y ,
however, it has to carefully computed while dealing with examples.
X
Example 5.2.4. Let X ∼ U (0, 1). Find the pdf of Y = 1+X
.
Solution. Given
X
Y = g(X) = .
1+X
Therefore, g 0 (x) = 1/(1 + x)2 > 0. That is, g is strictly increasing function. Also, note that
y
x = g −1 (y) =
1−y
d −1 1
=⇒ g (y) = .
dy (1 − y)2

Note that

0≤x≤1
y
=⇒ 0≤ ≤1
1−y
=⇒ y ≥ 0 and y ≤ 1 − y
1
=⇒ 0≤y≤ .
2
Therefore,
1
fX (g −1 y) = 1, for 0 ≤ y ≤ .
2
Hence, by above theorem,

d −1
g (y) fX g −1 (y)

fY (y) =
dy
 1
(1−y)2
, 0 ≤ y ≤ 1/2,
=
0, otherwise.

Further, if the function does not satisfy the condition of monotonically increasing or decreasing then the
following theorem will help to compute the pdf of the function of random variable for such situations.

Amit Kumar 107 MA-202: Probability & Statistics


Chapter 5: Function of Random Variables and Its Distribution
Theorem 5.2.4
Let X be a continuous random variable with pdf fX (·). Let y = g(x) be differentiate for all x,
and assume that g 0 (x) is continuous and non-zero at all but finite number of values of x. Then,
for every real number y,

(a) there exists a positive integer n = n(y) and real numbers (inverses) x1 (y), x2 (y), . . . , xn (y)
such that

g (xk (y)) = y, g 0 (xk (y)) 6= 0, k = 1, 2, . . . n(y).

(b) there does not exist any x such that g(x) = y, g 0 (x) 6= 0, in which case we write n(y) = 0.

Then Y is continuous random variable with pdf given by


 n
 X f (x (y)) | g 0 (x (y)|, if n > 0

k k
fY (y) =
 k=1
 0, if n = 0.

Example 5.2.5. Let X ∼ U(−1, 1). Find the pdf of Y = |X|.


Solution. Note that

y = g(x) = |x| =⇒ x = g −1 (y) = ±y.

Let
d −1 d 1
g1−1 (y) = −y =⇒ g1 (y) = −1 =⇒ g (y) = 1
dy dy 1
d −1 d −1
g2−1 (y) = +y =⇒ g2 (y) = 1 =⇒ g (y) = 1.
dy dy 2

It is known that
(
1
2
, −1 ≤ x ≤ 1,
fX (x) =
0, otherwise.

Therefore, using above theorem, we get

d −1 d −1
g1 (y) fX g1−1 (y) + g2 (y) fX g2−1 (y)
 
fY (y) =
dy dy
1 1
= 1 · fx (−y) + 1.fx (+y) = + = 1, 0 ≤ y ≤ 1.
2 2
Hence,

1, 0 ≤ y ≤ 1,
fY (y) =
0, otherwise.

Hence, Y ∼ U(0, 1).

Amit Kumar 108 MA-202: Probability & Statistics


Chapter 5: Function of Random Variables and Its Distribution
Example 5.2.6. Let X ∼ N (0, 1). Show that Y = X 2 ∼ G 21 , 12 .


Solution. Note that



y = g(x) = X 2 =⇒ x = g −1 (y) = ± y.

Let
√ d −1 1 d 1 1
g1−1 (y) = − y =⇒ g1 (y) = − √ =⇒ g1 (y) = √
dy 2 y dy 2 y
√ d −1 1 d −1 1
g2−1 (y) = + y =⇒ g2 (y) = √ =⇒ g2 (y) = √ .
dy 2 y dy 2 y

It is known that
1 2
fX (x) = √ e−x /2 , −∞ < x < ∞.

Therefore, we get

d −1 d −1
g1 (y) fX g1−1 (y) + g2 (y) fX g2−1 (y)
 
fY (y) =
dy dy
1 √ 1 √
= √ · fx (− y) + √ .fx (+ y)
2 y 2 y
1 1 1 1
= √ √ e−y/2 + √ √ e−y/2
2π y 2π y
1 1
= 1/2 1  e−y/2 y 2 −1 , y > 0.
2 Γ 2
1 1

Hence, Y ∼ G ,
2 2
.

5.3 Exercises
1. Let X be a random variable having pmf
1 1 1
P(X = −1) = P(X = 0) = P(X = 1) = .
4 2 4
Find the distribution of Y = |X| and Y = X 2 .
2. If X is binomially distributed with parameters n and p, what is the distribution of Y = n − X?
3. Let X ∼ U(0, 1). Show that Y = − λ1 ln(FX (x)) ∼ Exp(λ).
4. Show that if X ∼ B(α, β) then 1 − X ∼ B(β, α).
5. Give an example when X and −X have same distribution.
6. If X ∼ N (2, 9), find the pdf of Y = 21 X − 1.
7. If X ∼ N (µ, σ 2 ), find the distribution of Y = eX .

Amit Kumar 109 MA-202: Probability & Statistics


Chapter 6

Random Vector and its Joint Distribution

In real life, we are often interested in several random variables that are related to each other. For
instance, let a student has five courses in one semester and
Xi = marks of the student in course i ⊆ {0, 1, 2, . . . , 100}, for i = 1, 2, 3, 4, 5.
Then, the joint behaviour of his marks can be obtained using joint distribution of (X1 , X2 , X3 , X4 , X5 ).
In particular, suppose the passing marks is 30 in each course then the probability that a student pass the
semester is P(X1 ≥ 30, . . . , X5 ≥ 30). Therefore, if we know the distribution of (X1 , X2 , X3 , X4 , X5 )
then we can easily compute such probabilities.
In this chapter, we will focus on two random variables, but once you understand the theory for two
random variables, the extension to n random variables is straightforward. We will first discuss joint
distributions of discrete random variables and then extend the results to continuous random variables.

6.1 Joint Cumulative Distribution Function


The joint distribution function is a function that completely characterizes the probability distribution of
a random vector. The formal definition can be given as follows:
Definition 6.1.1 [Joint Cumulative Distribution Function]
Let X and Y be two random variables then the joint cdf of (X, Y ) is given by

FX,Y (x, y) = P(X ≤ x, Y ≤ y),

where x and y are two real numbers.

X
Probability with respect to all
values of x and y in this area

110
Chapter 6: Random Vector and its Joint Distribution
Remark 6.1.1. Note that if we are dealing with joint distribution then comma means intersection.
Note that

P(a < X ≤ b, c < Y ≤ b) = FX,Y (b, d) − FX,Y (a, d) − FX,Y (b, c) + FX,Y (a, c).

a b X

Properties of Joint cdf:


1. lim FX,Y (x, y) = FX (x) (marginal cdf of X).
y→∞

2. lim FX,Y (x, y) = FY (y) (marginal cdf of Y ).


x→∞

3. lim FX,Y (x, y) = 0 = lim FX,Y (x, y).


x→−∞ y→−∞

lim FX,Y (x, y) = 1.


4. x→∞
y→∞

5. lim FX,Y (x + h, y) = FX,Y (x, y) and limk→0 FX,Y (x, y + k) = FX,Y (x, y).
h→0

6. If x1 < x2 then

FX,Y (x1 , y) ≤ FX,Y (x2 , y) .

Also, if y < y2 then

FX,Y (x, y1 ) ≤ FX,Y (x, y2 ).

6.2 Joint Discrete Random Variables


Let X and Y be two random variables. The probability distribution that defines their simultaneous
behavior is referred to as a joint probability distribution. The two random variables X and Y are then
called jointly distributed random variables. Here, we consider the case when both X and Y are discrete.

Amit Kumar 111 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Definition 6.2.2 [Joint Probability Mass Function]
Let X and Y be two discrete random variables defined on the sample space (Ω, F , P) that takes
values {x1 , x2 , . . .} and {y1 , y2 , . . .}, respectively. Then, the joint pmf of X and Y is given by

pX,Y (xi , yj ) = P (X = xi , Y = yj ) , i, j = 1, 2, . . . .

P(X = xi , Y = yj )

{X = xi , Y = yj }

yj
x1 x2
xi y2
y1

The joint pmf of (X, Y ) satisfies


(a) P (X = xi , Y = yj ) = pX,Y (xi , yj ) ≥ 0, for all i, j = 1, 2, . . ..
P∞ P∞
(b) i=1 j=1 pX,Y (xi , yj ) = 1.

Definition 6.2.3 [Joint cdf in Discrete Case]


The joint cdf of (X, Y ) is given by
XX
FX,Y (x, y) = P(X ≤ x, Y ≤ y) = pX,Y (r, s).
r≤x s≤y

Definition 6.2.4 [Marginal Distributions]


The marginal pmf of X is defined by

!
X X
pX (xi ) = pX,Y (xi , yj ) or simply pX (x) = pX,Y (x, y) .
j=1 y

The marginal pmf of Y is defined by



!
X X
pY (yj ) = pX,Y (xi , yj ) or simply pY (y) = pX,Y (x, y) .
i=1 x

Amit Kumar 112 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Definition 6.2.5 [Conditional Distributions]
The conditional pmf of X given Y = yj , that is, X | Y = yj

pX,Y (xi , yj )
pX|Y =yj (xi ) = , for all xi , provided pY (yj ) 6= 0.
pY (yj )

The conditional pmf of Y given X = xi , that is, Y | X = xi

pX,Y (xi , yj )
pY |X=xi (yj ) = , for all yj , provided pX (xi ) 6= 0.
pX (xi )

Example 6.2.1. Suppose a car showroom has ten cars of a brand out of which 5 are good (G), 2 have
defective transmission (DT), and 3 have defective starling (DS). Two cars are selected at random. Let
X denote the number of cars with DT and Y denote the number of cars with DS. Find

(a) the joint pmf of (X, Y )

(b) FX,Y (1, 1)

(c) the marginal pmf of X and Y

(d) E(Y | X = 0) and E(X | Y = 0).

Solution. Given X is the number of cars with DT and therefore, it can take values {0, 1, 2} and Y is
the number of cars with DS and therefore, it can take values {0, 1, 2}.

(a) It can be easily verifies that

pX,Y (0, 0) = P(X = 0, Y = 0)


5

2 2
= 10 =
2
9
pX,Y (0, 1) = P(X = 0, Y = 1)
5 3
 
1 1 1
= 10 = .
2
3

Similarly, we can find other probabilities and they are given in the following table:

Y
0 1 2
X
10 15 3
0 45 45 45
10 6
1 45 45
0
1
2 45
0 0

Amit Kumar 113 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution

0.75

0.5

0.25

2
0 1
1 2

(b) Consider

FX,Y (1, 1) = P(X ≤ 1, Y ≤ 1)


= P(X = 0, Y = 0) + P(X = 0, Y = 1) + P(X = 1, Y = 0) + P(X = 1, Y = 1)
10 15 10 6
= + + +
45 45 45 45
41
= .
45

(c) Note that


2
X
pX (0) = P(X = 0) = P(X = 0, Y = y)
y=0

= P(X = 0, Y = 0) + P(X = 0, Y = 1) + P(X = 0, Y = 2)


10 15 3 28
= + + = .
45 45 45 45
Similarly, other probabilities can be easily computed. We can also use the table to compute the
marginal distribution. If we do row sum of the probabilities then it will give the pmf of X.
Similarly, if we do column sum of the probabilities then it will give the pmf of Y . The following
table gives the marginal distribution of X and Y .

Y
0 1 2 pX (x)
X
10 15 3 28
0 45 45 45 45
10 6 16
1 45 45
0 45
1 1
2 45
0 0 45
21 21 3
pY (y) 45 45 45
1

(d) We first compute the conditional pmf of Y given X = 0. Note that

pX,Y (0, y) 45
pY |X=0 (y) = = pX,Y (0, y), y = 0, 1, 2.
pX (0) 28

Amit Kumar 114 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Now, we compute the conditional pmf of X given Y = 0. Note that

pX,Y (X, 0) 45
pX|Y =0 (x) = = pX,Y (x, 0), x = 0, 1, 2.
pY (0) 21

Therefore,
2
X
E(Y | X = 0) = ypY |X=0 (y)
y=0
2
45 X
= ypX,Y (0, y)
28 y=0
45
= [0.pX,Y (0, 0) + 1.pX,Y (0, 1) + 2.pX,Y (0, 2)]
28
9
= .
14
and
2
X
E(X | Y = 0) = xpX|Y =0 (y)
x=0
2
45 X
= xpX,Y (x, 0)
21 x=0
45
= [0.pX,Y (0, 0) + 1.pX,Y (1, 0) + 2.pX,Y (2, 0)]
21
11
= .
21

Remark 6.2.1. From the joint pmf, we can compute the marginals and the conditional distributions.
Therefore, other characteristic such as mean, variance, mode, median etc. can also be calculated for
such distributions.

6.3 Joint Continuous Random Variable


Having considered the discrete case, we now look at joint distributions for continuous random variables.

Definition 6.3.6 [Joint Probability Density Function]


Let X and Y be two continuous random variables defined on the sample space (Ω, F , P). Then,
the joint pdf of X and Y is denoted as fX,Y (x, y) which satisfies the following properties:

(a) fX,Y (x, y) ≥ 0, for all (x, y) ∈ R2 .


Z ∞Z ∞
(b) fX,Y (x, y)dxdy = 1.
−∞ −∞

Amit Kumar 115 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
In this case, the probability can be defined as follows:
Z bZ d
P(a < X < b, c < Y < d) = fX,Y (x, y)dxdy,
a c

which is the volume under the surface fX,Y (x, y).

d
c
a
b

Definition 6.3.7 [Joint cdf in Continuous Case]


The joint cdf of (X, Y ) is given by

FX,Y (x, y) = P(X ≤ x, Y ≤ y)


Z x Z y
= fX,Y (r, s)drds.
−∞ −∞

Definition 6.3.8 [Marginal Distributions]


The marginal pdf of X is defined by
Z ∞
fX (x) = fX,Y (x, y)dy.
−∞

The marginal pdf of Y is defined by


Z ∞
fY (y) = fX,Y (x, y)dx.
−∞

Amit Kumar 116 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Definition 6.3.9 [Conditional Distributions]
The conditional pmf of X given Y = y, that is, X | Y = y

fX,Y (x, y)
fX|Y =y (x) = , for all x, provided fY (y) 6= 0.
fY (y)

The conditional pmf of Y given X = x, that is, Y | X = x

fX,Y (x, y)
fY |X=x (yj ) = , for all y, provided fX (x) 6= 0.
fX (x)

Example 6.3.1. Let the joint pdf of (X, Y ) is



10xy 2 , 0 < x < y < 1,
fX,Y (x, y) =
0, otherwise.

Find
1

(a) P 0 < X + Y < 2

(b) P 0 < X < 21 , 14 < Y < 3



4

(c) the marginal density of X and Y

(d) P(X > 1/4) and P(Y > 3/4)

(e) the conditional density of X | Y = y and Y | X = x

(f) P X < 21 | Y = 43 and P Y < 12 | X = 14 .


 

Solution. Observe that the graph of fX (x, y) is represented as

(a) Note that

Amit Kumar 117 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Y

0<x<y<1

Volume = Required Probability


1
0<x+y < 2

X
1 1
  Z Z −x
1 4 2
P 0<X +Y < = fX,Y (x, y)dydx
2 0 x
1 1
Z
4
Z
2
−x
= 10 xy 2 dydx
0 x
11
= .
3072

(b) Consider

3
y= 4

1
y= 4

  Z 3Z 1 Z 1Z 3
1 1 3 4 4 2 4
P 0<X< , <Y < = fX,Y (x, y)dxdy + fX,Y (x, y)dxdy
4 4 4 1
4
0 1
4
x
Z 3Z 1 Z 1Z 3
4 4 2 4
= 10 xy 2 dxdy + 10 xy 2 dxdy
1 1
4
0 4
x
15 343
= +
1536 3072
473
= .
3072

Amit Kumar 118 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
(c) The marginal distribution of X is given by
1 1 1
10xy 3
Z Z
2 10
x 1 − x3 .

fX (x) = fX,Y (x, y)dy = 10xy dy = =
x x 3 x 3

Therefore,
10

3
x (1 − x3 ) , 0 < x < 1,
fX (x) =
0, otherwise.

The marginal distribution of Y is given by


Z y Z y
y
fY (y) = fX,Y (x, y)dx = 10xy 2 dx = 5x2 y 2 4
= 5y 4 , .
0 0

Therefore,

5y 4 , 0 < y < 1,
fY (y) =
0, otherwise.

(e) Note that


  Z 1
1 10 343
x 1 − x3 dx =

P X> = .
4 1
4
3 3072

and
  Z 1
3 875
P Y > = 5y 4 dy = .
4 3/4 1024

(e) The conditional distribution of X given Y = y is given by

fX,Y (x, y) 10xy 2 2x


fX|Y =y (x | y) = = = .
fY (y) 5y 4 y2

Therefore,
(
2x
y2
, 0 < x < y, 0 < y < 1,
fX|Y =y (x | y) =
0 otherwise.

The conditional distribution of Y given X = x is given by

fX,Y (x, y) 10xy 2 3y 2


fY |X=x (y | x) = = 10 = .
fX (x) 3
x (1 − x3 ) 1 − x3

Therefore,
3y 2

1−x3
0 < x < 1, x < y < 1,
fY |X=x (y | x) =
0 otherwise.

Amit Kumar 119 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
(f) Note that
32 3
fX|Y = 3 (x) = x, 0<x<
4 9 4
64 1
fY |X= 1 (y) = y 2 , < y < 1.
4 21 4
Therefore,
  Z 1
1 3 2 32 4
P X< |Y = = xdx =
2 4 0 9 9

and
  Z 1
1 1 2 64 1
P Y < |X= = y 2 dy = .
2 4 1
4
21 9

6.4 Independence of Random Variables


The concept of independent random variables is very similar to independent events. Remember, two
events A and B are independent if we have P(A ∩ B) = P(A)P(B). Similarly, we have the following
definition for independent random variables.

Definition 6.4.10 [Independent Random Variables]


We say two random variables X and Y are independently distributed if

FX,Y (x, y) = FX (x)FY (y) for all (x, y) ∈ R2 .

In particular, if X and Y are discrete random variables then X and Y are independent if

pX,Y (x, y) = pX (x)pY (y) for all x, y.

and if X and Y are continuous then X and Y are independent if

fX,Y (x, y) = fX (x)fY (y) for allx, y.

Example 6.4.1. Let X and Y be discrete random variables with joint


X
0 1 pY (y)
Y
1 1 1
0 4 4 2
1 1 1
1 4 4 2
1 1
pX (x) 2 2
1
Note that

pX,Y (x, y) = pX (x)pY (y) for all x, y = 0, 1.

Hence, X and Y are independent.

Amit Kumar 120 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
6.5 Expectation and Moments
We now look at taking the expectation of jointly distributed random variables. Because expected values
are defined for a single quantity, we will actually define the expected value of a combination of the pair
of random variables, that is, we look at the expected value of a function applied to the random vector
(X, Y ).

Definition 6.5.11 [Expectation of (X, Y )]


Let g(X, Y ) be a random variable. Then the expectation of g(X, Y ) is given by
X X


 g(x, y)pX,Y (x, y), if X and Y are discrete,
E(g(X, Y )) = Zx ∞ Zy ∞
g(x, y)fX,Y (x, y)dxdy, if X and Y are continuous,



−∞ −∞

provided the series and integral are absolutely continuous.

Definition 6.5.12 [Non-central Product Moments]


The (r, s)th non-central product moments are defined as

µ0r,s = E (X r Y s ) .

Definition 6.5.13 [Central Product Moments]


The (r, s)th central product moments are defined as

µr,s = E ((X − µX )r (Y − µY )s ) ,

where µX and µy denote the mean of X and Y , respectively.

Note that

µ01,0 = E(X) = µX
µ00,1 = E(Y ) = µY
 XX


 xypX,Y (x, y), if X and Y are discrete,
µ01,1 = E(X, Y ) = x y
Z ∞Z ∞
xyfX,Y (x, y)dxdy, if X and Y are continuous.



−∞ −∞

Amit Kumar 121 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Definition 6.5.14 [Covariance]
The (1, 1)th central product moment is called the covariance between X and Y , and is denoted
by Cov(X, Y ). That is,

Cov(X, Y ) = µ1,1 = E ((X − µX ) (Y − µY )) .

Remark 6.5.1. (i) Note that

µ1,1 = E ((X − µX ) (Y − µY ))
= E (XY − µY X − µX Y + µX µY )
= E(XY ) − µY E(X) − µX E(Y ) + µX µY
= E(XY ) − E(X)E(Y ) − E(X)E(Y ) + E(X)E(Y )
= E(XY ) − E(X)E(Y ).

(ii) Covariance is a measure of relationship between two random variables.


(iii) Cov(X, Y ) > 0 indicates that X and Y tend to move in the same direction. Similarly, Cov(X, Y )
< 0 indicates that X and Y tend to move in inverse directions.

Definition 6.5.15 [Covariance]


The correlation coefficient between X and Y is given by

Cov(X, Y )
ρX,Y = ,
σX σY
2
where σX = Var(X) and σY2 = Var(Y ).

Theorem 6.5.1
The correlation coefficient between two random variables is always lies between −1 and 1, that
is, −1 ≤ ρX,Y ≤ 1.

Proof. Consider two random variables U and V such that

E(U ) = 0, E U 2 = 1, E(V ) = 0, and E V 2 = 1.


 

Then,

E(U − V )2 ≥ 0 =⇒ E(U 2 + V 2 − 2U V ) ≥ 0 =⇒ E(U V ) ≤ 1

and

E(U + V )2 ≥ 0 =⇒ E(U 2 + V 2 + 2U V ) ≥ 0 =⇒ E(U V ) ≥ −1.

Therefore,

−1 ≤ E(U V ) ≤ 1. (6.5.1)

Amit Kumar 122 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Next, let
X − µX Y − µY
U= and V = ,
σX σY
2
where µX = E(X), µY = E(Y ), σX = Var(X) and σY2 = Var(Y ). Therefore,
 
X − µX 1
= 0, E U 2 = 2 E(X − µX )2 = 1,

E(U ) = E
σ σX
 X 
Y − µY 1
= 0, E V 2 = 2 E(Y − µY )2 = 1.

E(V ) = E
σY σY

Hence, from (6.5.1), we have


  
X − µX Y − µY
−1 ≤E ≤1
σX σY
E ((X − µX ) (Y − µY ))
=⇒ −1 ≤ ≤1
σX σY
Cov(X, Y )
=⇒ −1 ≤ ≤ 1.
σX σY
This proves the result.
Remark 6.5.2. (i) The correlation coefficient is a measure of linear relationship between two ran-
dom variables.
Y Y

X X
(a) ρX,Y > 0 (b) ρX,Y < 0

Y Y

X X
(c) ρX,Y = 0 (d) ρX,Y = 0

Amit Kumar 123 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Y Y

X X
(e) ρX,Y = 1 (f) ρX,Y = −1

(ii) If X = aY + b, (a > 0), then X and Y are perfectly linear in positive direction and ρX,Y = 1.
Also, X = aY + b, (a < 0), then X and Y are perfectly linear in negative direction and
ρX,Y = −1.
(iii) If ρX,Y = 0 then we say that X and Y are uncorrelated.
Theorem 6.5.2
Let X and Y be independent random variably then

E (g1 (X)g2 (Y )) = E (g1 (X)) E (g2 (X)) ,

provided expectation exists.

Proof. We prove the result for continuous random variables and following the similar steps, it can be
easily proved for discrete case. Let X and Y be continuous random variable, that is,

fX,Y (x, y) = fX (x)fY (y).

Consider
Z ∞ Z ∞
E (g1 (X)g2 (Y )) = g1 (x)g2 (y)fX,Y (x, y)dxdy
Z−∞
∞ Z−∞

= g1 (x)g2 (y)fX (x)fY (y)dxdy
−∞
Z ∞ −∞  Z ∞ 
= g1 (x)fX (x)dx g2 (y)fY (y)dy
−∞ −∞
= E (g1 (X)) E (g2 (Y )) .

This proves the result.


Corollary 6.5.1
Let X and Y be independent random variables then

E (X r Y s ) = E (X r ) E (Y s )
E ((X − µX )r (Y − µY )s ) = E ((X − µX )r ) E ((Y − µY )s ) .

Amit Kumar 124 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Remark 6.5.3. In particular, if X and Y are independent then we have

E(XY ) = E(X)E(Y ).

Therefore,

Cov(X, Y ) = E(XY ) − E(X)E(Y ) = 0.

This implies ρx,y = 0. However, the converse is not true. For example, let X and Y have joint pmf
given by

X
−1 0 1 pX (x)
Y
1 1
0 0 3
0 3
1 1 2
1 3
0 3 3
1 1 1
pY (y) 3 3 3
1
Then,
1
X 1 2 2
E(X) = pX (x) = 0 · pX (0) + 1 · pX (1) = 0 × +1× =
x=0
3 3 3
X1
E(Y ) = ypY (y) = −1 · pY (−1) + 0 · pY (0) + 1 · pY (1)
j=1
1 1 1 1 1
= −1 ×+ 0 × + 1 × = − = 0.
3 3 3 3 3
1 X
X 1
E(XY ) = xypX,Y (x, y)
x=0 y=−1

= (0)(−1)pX,Y (0, −1) + (0)(0)pX,Y (0, 0) + (0)(1)pX,y (0, 1)


+ (1)(−1)pX,Y (1, −1) + (1)(0)pX,Y (1, 0) + (1)(1)pX,Y (1, 1)
1 1
= −pX,Y (1, −1) + pX,Y (1, 1) = + = 0.
3 3
This implies
2
Cov(X, Y ) = E(XY ) − E(X)E(Y ) = 0 − × 0 = 0.
3
This implies

Cov(X, Y )
ρX,Y = = 0.
σX σY
But X and Y are not independent, for example,
1 1
pX,Y (0, 0) = 6= = pX (0)pY (0).
3 9
Now, we move to define joint mgf that will generate the product moments for jointly distributed random
variables.

Amit Kumar 125 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Definition 6.5.16 [Joint Moment Generating Function]
Let X and Y be two random variables then the joint mgf of X and Y is given by
X X


 esx+ty pX,Y (x, y), if X and Y are discrete,
MX,Y (s, t) = E esX+tY = Zx ∞ Zy ∞

esx+ty fX,Y (x, y)dxdy, if X and Y are continuous,



−∞ −∞

provided the expectation exists in a neighbourhood of (0, 0).

Remark 6.5.4. Note that the joint mgf generates the product moments as follows:
∂ ∂
E(X) = MX,Y (s, t) , E(Y ) = MX,Y (s, t)
∂s s=t=0 ∂t s=t=0
2 2
∂ ∂
E(X 2 ) = 2 MX,Y (s, t) , E(Y 2 ) = 2 MX,Y (s, t)
∂s s=t=0 ∂t s=t=0
∂2
E(XY ) = MX,Y (s, t) ,
∂s∂t s=t=0

and so on. In general,

m ∂ m+n
n
E (X Y ) = m n MX,Y (s, t) .
∂s ∂t s=t=0

Theorem 6.5.3
The random variables X and Y are independent then

MX,Y (s, t) = MX (s)MY (t), for all s, t.

Proof. We will prove the result for continuous random variables and following the similar steps to
prove it for discrete case. Consider
Z ∞Z ∞
MX,Y (s, t) = esx+ty fX,Y (x, y)dxdy
Z−∞ −∞
∞ Z ∞
= esx+ty fX (x)fY (y)dxdy
−∞
Z ∞ −∞  Z ∞ 
sx ty
= e fX (x)dx e fY (y)dy
−∞ −∞
= MX (s)MY (t).

This proves the result.

Theorem 6.5.4
The random variables X and Y are independent then

MX+Y (t) = MX (t)MY (t), for all t.

Amit Kumar 126 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Proof. We will prove the result for continuous random variables and following the similar steps to
prove it for discrete case. Consider
Z ∞Z ∞
MX+Y (t) = et(x+y) fX,Y (x, y)dxdy
Z−∞ −∞
∞ Z ∞
= etx+ty fX (x)fY (y)dxdy
Z ∞ −∞
−∞  Z ∞ 
tx ty
= e fX (x)dx e fY (y)dy
−∞ −∞
= MX (t)MY (t).

This proves the result.


Example 6.5.1. Let X and Y has joint pdf given by
(
λ2 e−λ(x+y) , x, y > 0
fX,Y (x, y) =
0, otherwise.

Find joint mgf of (X, Y ) and therefore, find E(X), E(Y ), E(XY ) and ρX,Y .
Solution. The joint mgf of (X, Y ) is given by
Z ∞Z ∞
MX,Y (s, t) = esx+ty fX,Y (x, y)dxdy
−∞ −∞
Z ∞Z ∞
=λ 2
esx+ty e−λ(x+y) dxdy
0Z ∞0  Z ∞ 
2 −(λ−s)x −(λ−t)y
=λ e dx e dy
0 0
λ2
= , for λ > s and λ > t.
(λ − s)(λ − t)

Now, note that

∂ λ2 ∂ λ2
MX,Y (s, t) = , MX,Y (s, t) = ,
∂s (λ − s)2 (λ − t) ∂t (λ − s)(λ − t)2
∂2 2λ2 ∂2 2λ2
MX,Y (s, t) = , MX,Y (s, t) = ,
∂s2 (λ − s)3 (λ − t) ∂t2 (λ − s)(λ − t)3
∂2 λ2
MX,Y (s, t) = .
∂s∂t (λ − s)2 (λ − t)2

Therefore,
∂ 1
E(X) = MX,Y (s, t) =⇒ E(X) =
∂s s=t=0 λ
∂ 1
E(Y ) = MX,Y (s, t) =⇒ E(Y ) =
∂t s=t=0 λ
2
∂ 2
E(X 2 ) = 2 MX,Y (s, t) =⇒ E(X 2 ) = 2
∂s s=t=0 λ

Amit Kumar 127 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
∂2 2
E(Y 2 ) = MX,Y (s, t) =⇒ E(Y 2
) =
∂t2 s=t=0 λ2
2
∂ 1
E(XY ) = MX,Y (s, t) =⇒ E(XY ) = 2 .
∂s∂t s=t=0 λ
Hence,

2 1 1
σX = E(X 2 ) − (E(X))2 = , σY2 = E(Y 2 ) − (E(Y ))2 =
λ λ
and
Cov(X, Y ) E(XY ) − E(X)E(Y )
ρX,Y = = = 0.
σX σY σX σY

6.6 Bivariate Normal Distribution


Remember that the normal distribution is very important in probability theory and it shows up in many
different applications. We have discussed a single normal random variable previously; we will now talk
about two normal random variables.

Definition 6.6.17 [Bivariate Normal Distribution]


A continues jointly distributed random variable X and Y is said to have bivariate normal distri-
bution if its joint pdf given by
 2  2   
x−µ1 y−µ x−µ y−µ2
1 − 1
2(1−ρ2 ) σ1
+ σ 2 −2ρ σ 1 σ
fX,Y (x, y) = p e 2 1 2
,
2πσ1 σ2 1 − ρ2

where x, y, µ1 , µ ∈ R, σ1 , σ2 > 0 and −1 < ρ < 1.

Amit Kumar 128 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Remark 6.6.1. (i) The numbers µ1 , µ2 , σ1 , σ2 and ρ are called the parameters of bivariate normal
distribution.

(ii) If (X, Y ) follows bivariate distribution with parameters µ1 , µ2 , σ1 , σ2 and ρ then it is denoted as
(X, Y ) ∼ BVN(µ1 , µ2 , σ12 , σ22 , ρ).
Note that if ρ = 0 then

fX,Y (x, y) = fX (x)fY (y),

where X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ). Therefore, the following theorem can be easily proved
for bivariate normal distribution.

Theorem 6.6.5
If (X, Y ) ∼ BVN(µ1 , µ2 , σ12 , σ22 , ρ) then

ρ = ρX,Y = 0 ⇐⇒ X and Y are independent.

Theorem 6.6.6
If (X, Y ) ∼ BVN (µ1 , µ2 , σ12 , σ22 , ρ) then the marginals and conditional distribution of X and Y
are all univariate normal.

Proof. Consider
 2  2   
x−µ1 y−µ x−µ y−µ2
1 − 1
2(1−ρ2 ) σ1
+ σ 2 −2ρ σ 1 σ
fX,Y (x, y) = p e 2 1 2

2πσ1 σ2 1− ρ2
 2  2     2  2 
x−µ1 y−µ x−µ y−µ2 x−µ x−µ
1 − 1
2(1−ρ2 ) σ1
+ σ 2 −2ρ σ 1 σ
+ρ2 σ 1 −ρ2 σ 1
= p e 2 1 2 1 1

2πσ1 σ2 1 − ρ2
 2     2 
2 y−µ2 x−µ1 y−µ2 x−µ
1 1 1
+ρ2 σ 1

x−µ1 − −2ρ
−1 2 σ2 σ1 σ2
=√ e 2 σ1
×√ p e 2(1−ρ ) 1

2πσ1 2πσ2 1 − ρ 2
2 h   i2
x−µ
1 1 − 2 1

−1
x−µ1 y− µ2 +ρσ2 σ 1
=√ e 2 σ1
×√ p e 2σ2 (1−ρ2 ) 1
.
2πσ1 2πσ2 1 − ρ2

Therefore,
 Z ∞ h   i2
1 x−µ1 2 1 x−µ
− 2 1 2 y− µ2 +ρσ2 σ 1

− 12
fX (x) = √ e σ1
√ p e 2( )
2σ 1−ρ 1
dy
2πσ1 −∞ 2πσ2 1 − ρ2
1 x−µ1 2
 
− 12
=√ e σ 1 ,
2πσ1
   
x−µ1 2 2
since the expression inside the integral is the pdf of N µ2 + ρσ2 σ1 , σ2 (1 − ρ ) . Hence, X ∼
 2
2 2 y−µ2
N (µ1 , σ1 ). Similarly, we can add and subtract ρ σ2
and follow the similar steps as above, we
get Y ∼ N (µ2 , σ22 ).

Amit Kumar 129 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Next, consider
h   i2
x−µ
fX,Y (x, y) 1 − 2 1 y− µ2 +ρσ2 σ 1
fY |X=x (y|x) = =√ p e 2σ2 (1−ρ2 ) 1
.
fX (x) 2πσ2 1 − ρ2

Compare with the density of normal distribution, we get


   
x − µ1 2 2

Y | X = x ∼ N µ2 + ρσ2 , σ2 1 − ρ .
σ1

Similarly,
   
y − µ2 2 2

X | Y = y ∼ N µ1 + ρσ1 , σ1 1 − ρ .
σ2

This proves the result.


Remark 6.6.2. The converse of the above theorem is true, that is, if the marginals and conditionals are
univariate normal distribution then the joint distribution will be bivariate normal distribution.

Theorem 6.6.7
Let X and Y be two random variables. Then

E[g(X, Y )] = EY EX|Y [g(X, Y ) | Y ] = EX EY |X (g(X, Y ) | X) .


   

Proof. Consider
Z ∞Z ∞ Z ∞Z ∞
E[g(X, Y )] = g(x, y)fX,Y (a, y)dxdy = g(x, y)fY |X=x (y|x)fX (x)dxdy
−∞ −∞ −∞ −∞
Z ∞ Z ∞  Z ∞
= g(x, y)fY |X=x (y|x)dx fX (x)dx = EY |X [g(X, Y ) | X = x]fX (x)dx.
−∞ −∞ −∞
X Y |X
= E [E [g(X, Y ) | X]].

Similarly, it can be easily proves

E[g(X, Y ] = EY EX|Y (g(X, Y ) | Y )) .


 

Also, following the similar steps, the result can be proved for discrete random variables.

Corollary 6.6.2
Let X and Y be two random variables. Then

E(X) = E[E(X | Y )],

for any random variable Y .

Amit Kumar 130 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Theorem 6.6.8
Let (X, Y ) ∼ BVN (µ1 , µ2 , σ12 , σ22 , ρ). Then

ρX,Y = ρ.

Proof. Note that

Cov(X, Y ) = E ((X − µ1 ) (Y − µ2 ))
= E [(X − µ1 ) E (Y − µ2 | X)]
 
(X − µ1 )
= E (X − µ1 ) × ρσ2
σ1
ρσ2
= E (X − µ1 )2
σ1
= ρσ1 σ2 .

This implies

Cov(X, Y ) ρσ1 σ2
ρX,Y = = = ρ.
σ1 σ2 σ1 σ2
This proves the result.

Theorem 6.6.9
Let (X, Y ) ∼ BVN (µ1 , µ2 , σ12 , σ22 , ρ). Then, the joint mgf of (X, Y ) is
1 2 2 + 1 σ 2 t2 +ρσ σ st
MX,Y (s, t) = eµ1 s+µ2 t+ 2 σ1 s 2 2 1 2
.

Proof. Note that

MX,Y (s, t) = E esX+tY = E E esX+tY | Y


  

= E etY E esX |Y
 

= E etY MX|Y (s)


 
    
Y −µ2
tY µ1 +ρσ1 s+ 12 σ12 (1−ρ)2 s2
=E e e σ2

  
σ µ σ
µ1 s−ρ 1σ 2 + 12 σ12 (1−ρ2 )s2 Y t+ρ σ1 s
=e 2 E e 2

 
σ µ
µ1 s−ρ 1σ 2 + 12 σ12 (1−ρ2 )s2 σ1
=e 2 MY t + ρ s
σ2
   2
σ1 µ2 σ σ
µ1 s−ρ + 12 σ12 (1−ρ2 )s2 eµ2 t+ρ σ1 s + 12 σ22 t+ρ σ1 s
=e σ2 2 2

1 2 2 + 1 σ 2 t2 +ρσ σ st
= eµ1 s+µ2 t+ 2 σ1 s 2 2 1 2
.

This proves the result.

Amit Kumar 131 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Theorem 6.6.10
(X, Y ) ∼ BVN (µ1 , µ1 , σ12 , σ22 , ρ) if and only if

aX + bY ∼ N µ1 + µ2 , a2 σ12 + b2 σ22 + 2abρσ! σ2



for all a, b ∈ R,

where a and b should not be simultaneously zero.

Example 6.6.1. The amount of rainfall recorded at a Us weather station in January is a random vari-
able X and the amount in February at the same station is a random variable Y . Suppose (X, Y ) ∼
BVN(6, 4, 1, 0.25, 0.1), Find P(X ≤ 5) and P(Y ≤ 4 | X = 5).
Solution. Note that X ∼ N (6, 1). Therefore,
 
x−6 5−6
P(X ≤ 5) = P ≤ = P(X ≤ −1) = 0.1587.
1 1

Also, Y | X = 5 ∼ N (3.975, 0.2475). Therefore,


 
5 − 3.975
P(Y ≤ 4 | X = 5) = P Z ≤ √ = P(Z ≤ 2.06) = 0.9803.
0.2475

6.7 Other Bivariate Distributions


There are many special bivariate distributions but we are not going to study them in details. We just
present their joint pmf or pdf and the other characteristic is left to do for the student.

Definition 6.7.18 [Multinomial Distribution]


Suppose a random experiment is conducted n times under identical conditions. Each trial may
result in one of k mutually exclusive and exhaustive events A1 , A2 , . . . Ak . Let pj denote the
probability of outcome Aj , j = 1, . . . , k. Let Xi be the number of outcomes resulting in event
Ai , i = 1, . . . , k. Then,
 n! x1 xk Pn
x1 !...xk !
p 1 · · · p k , if n = i=1 xi ,
P (X1 = x1 , . . . , Xk = xk ) =
0, otherwise.

or equivalently,

P (X1 = x1 , . . . , Xk−1 = xk−1 )


xk−1
n! x
if n−1
 P
 x1 !...xk−1 !(n−x1 −···−xk−1 )! p1 1 · · · pk−1 i=1 xi ≤ n,
= n−x −···−x
×(1 − p1 − · · · − pk−1 ) 1 k−1 ,

0, otherwise.

is said to have multinomial distribution.

Amit Kumar 132 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Theorem 6.7.11
Let (X1 , . . . , Xk−1 ) follow multinomial distribution. Then

(a) MX1 ,...,Xk−1 (t1 , . . . , tk−1 ) = (pk − p1 et1 + · · · + pk−1 etk−1 )n , for all (t1 , . . . , tk−1 ) ∈ Rk−1
and pk = 1 − p1 − · · · − pk−1 .

(b) Xi ∼ B(n, pi ), for all i = 1, 2, . . . , n.


 1/2
pi pj
(c) ρXi ,Xj = − , for i 6= j.
qi qj

A special case of multinomial distribution is trinomial distribution when k = 3. The formal definition
can be given as follows:

Definition 6.7.19 [Trinomial Distribution]


A joint discrete distribution (X, Y ) is said to have trinomial distribution if it joint pmf adopt the
from
n!
px py (1 − p1 − p2 )n−x−y , x, y = 0, 1, . . . , n,


 x!y!(n−x−y)! 1 2
x+y ≤n



P(X = x, Y = y) = p1 , p2 > 0,
p1 + p2 ≤ 1,




0,

otherwise.

Theorem 6.7.12
Let (X, Y ) follow trinomial distribution. Then

(a) X ∼ B(n, p1 ) and Y ∼ B(n, p2 ).


   
p1 p2
(b) X|Y = y ∼ B n − y, 1−p 2
and Y |X = x ∼ B n − x, 1−p1
.

Further, let us define bivariate discrete uniform distribution.

Definition 6.7.20 [Bivariate Uniform Distribution]


A joint discrete distribution (X, Y ) is said to have bivariate uniform distribution if it joint pmf
adopt the from
 2
k(k+1)
, y = 1, 2, . . . , x and x = 1, 2, . . . , k,
pX,Y (x, y) =
0, otherwise.

Next, let us define bivariate gamma distribution.

Amit Kumar 133 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Definition 6.7.21 [Bivariate Gamma Distribution]
A joint continuous distribution (X, Y ) is said to have bivariate gamma distribution if it joint pdf
adopt the from
(
β γ+α
Γ(α)Γ(γ)
xα−1 (y − x)γ−1 e−βy , 0 < x < y, α, β, γ > 0,
fX,Y (x, y) =
0, otherwise.

Theorem 6.7.13
Let (X, Y ) follow bivariate gamma distribution. Then

(a) X ∼ G(α, β).

(b) Y ∼ G(α + γ, β).

(c) Y − X|X = x ∼ G(γ, β).

Finally, let us define bivariate beta distribution.

Definition 6.7.22 [Bivariate Beta Distribution]


A joint continuous distribution (X, Y ) is said to have bivariate beta distribution if it joint pdf
adopt the from
 Γ(p1 +p2 +p3 ) p −1 p −2
 Γ(p1 )Γ(p2 )Γ(p3 ) x 1 y 2 (1 − x − y)p3 −1 , x, y ≥ 0, x + y ≤ 1,
fX,Y (x, y) = p1 , p2 , p3 > 0,

0, otherwise.

Theorem 6.7.14
Let (X, Y ) follow bivariate beta distribution. Then

(a) X ∼ Beta(p1 , p2 + p3 ).

(b) Y ∼ Beta(p2 , p1 + p3 ).
Y

(c) 1−X |X = x ∼ Beta(p2 , p3 ).
X

(d) 1−Y |Y = y ∼ Beta(p1 , p3 ).

6.8 Transformation of Random Variable


Our goal in this section is to develop analytical results for the probability distribution of a transformed
random vector Y in R2 given that we know the distribution of X, the original random vector.

Amit Kumar 134 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
We first answer the question “when a function of random vector is a random vector” in the following
theorem.
Theorem 6.8.15
Let X = (X1 , X2 , . . . , Xn ) be a random vector and g : Rn → Rn be a measurable function. Then
g(X) is also a random vector.

Now, we consider the case when n = 2 and obtain the distribution of the transformation of the given
random vector. There are mainly three approaches to find the distribution of Y = g(X).

(a) PMF or PDF approach

(b) CDF approach

(a) MGF approach

6.8.1 PMF Approach


The pmf approach is the easiest approach that is useful for a discrete random vector. Here, we can find
the inverses of the function and, using the given joint pmf, the joint pmf of transformation of random
variable can easily be computed.
Example 6.8.1. Let the joint pmf of (X, Y ) be

X
−1 0 1
Y
1 1 1
−2 6 12 6
1 1 1
1 6 12 6
1 1
2 12
0 12

Find the joint distribution of U = |X| and V = Y 2 .


Solution. Note that U can take values {0, 1} and V can take values {1, 4}. Consider

P(U = 0, V = 1) = P |X| = 0, Y 2 = 1 = P(X = 0, Y = ±1)




= P(X = 0, Y = −1) + P(X = 0, Y = 1)


1 1
=0+ =
12 12
P(U = 0, V = 4) = P |X| = 0, Y 2 = 4 = P(X = 0, Y = ±2)


= P(X = 0, Y = −2) + P(X = 0, Y = 1)


1 1
= +0= .
12 12
Similarly, we can find other probabilities and the joint pmf is given by

U
0 1
V
1 1
1 12 3
1 1
4 12 2

Amit Kumar 135 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
6.8.2 CDF Approach
The CDF approach is useful for joint continuous random vector in general. Here, we can find the joint
cdf of the given transformation of random vector. Further, we obtain partial derivatives to get the joint
pdf.
Example 6.8.2. Let (X, Y ) have joint pdf given by
 1+xy
4
|x| < 1, |y| < 0,
fX,Y (x, y) =
0 otherwise.

Find joint pdf of U = X 2 and V = Y 2


Solution. Consider

FU,V (u, v) = P(U ≤ u, V ≤ u)


= P X 2 ≤ u, Y 2 ≤ u

√ √ √ √
= P(− u ≤ X ≤ u, − u ≤ Y ≤ u)
Z √v Z √u
= √ √
fX,Y (x, y)dxdy
− v − u
√ √
Z v Z u  
1 + xy
= √ √
dxdy
− v − u 4

= uv, 0 < 0 < x < 1.

Therefore,
√1 ,

4 uv
0 < u < 1, 0 < u < 1,
fU,V (u, v) =
0, otherwise.

6.8.3 MGF Approach


As we have seen earlier, the MGF approach is mainly useful for the special distribution. Let us see an
example related to it.
Example 6.8.3. Let X ∼ B(n1 , p) and Y ∼ B(n1 , p) then find the distribution of X + Y , where X and
Y are independent.
Solution. Note that
n
MX (t) = (q + pet )n1 and MY (t) = q + pet 2 .

Then, from Theorem 6.5, we have

MX+Y (t) = MX (t)MY (t) = (q + pet )nt +n2 .

Therefore, X + Y ∼ B(n1 + n2 , p).

6.8.4 PDF Approach


We are now to generalize the PDF approach from one dimension random variable to two dimension
random vector using the Jacobian.

Amit Kumar 136 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Theorem 6.8.16
Let X and Y be continuous random variables with joint pdf fX,Y (·, ·). Let U = g(X, Y ) and
V = h(X, Y ) be a one-one transformation from R2 to R2 such that

(a) x = h1 (u, v) and y = h2 (u, v) defined over the range of transformation.

(b) Assume that the mapping and inverses are both continuous.
∂x ∂x ∂y ∂y
(c) Assume the partial derivatives , , ,
∂u ∂v ∂u ∂v
exist and are continuous.

(d) Assume that the Jacobian J of transformation.


dx ∂x
du dx
J= dy dy 6= 0
du ∂y

in the range of transformation.

Then, the random vector (U, V ) is continuous and its joint pdf is given by

fU,V (u, v) = |J|fX,Y (h1 (u, v), h2 (u, v)) .

iid
Example 6.8.4. Let X, Y ∼ U(0, 1) then find the distribution of U = X + Y .
Solution. It is known that

1, 0 < x, y < 1,
fX,Y (x, y) =
0, otherwise.

Given

U =X +Y and V = X − Y.

This implies
u+v u−v
x = h1 (u, v) = and y = h2 (u, v) = .
2 2
Therefore,
dx dx 1 dy 1 dy 1
= = , = and =− .
du dv 2 du 2 du 2
This implies
1 1 1 1
J= 2
1
2 =− =⇒ |J| = .
2
− 21 2 2

Note that
u+v
0 < x < 1 =⇒ 0 < < 1 =⇒ 0 < u + v < 2
2
u−v
0 < y < 1 =⇒ 0 < < 1 =⇒ 0 < u − v < 2
2

Amit Kumar 137 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution

Therefore,

fU,V (u, v) = |J|fX,Y (h1 (u, v), h2 (u, v))


 
1 u+v u−v
= fX,Y ,
2 2 2
(
1
, 0 < u + v < 2, 0 < u − v < 1,
= 2
0, otherwise.

u−v =0

u−v =2

u+v =2

u+v =0

This the pdf of U = X + Y is given by


R u
R−u fU,V (u, t)dt,
 if 0 < u ≤ 1,
2−u
fU (u) = f (u, t)dt, if 1 < u < 2,
u−2 U,V

0, otherwise.

This implies

 u, 0<u≤1
fX+Y (u) = 2 − u, 1 < u < 2
0, otherwise.

6.9 n-dimensional Random Vector


An n-dimensional random vector X = (X1 , . . . , Xn ) is an ordered set of n random variables, each
of which describes some aspect of a statistical outcome. Since we have studied the theory for two
dimension random vector, the concept is easily generalize for n dimension random vector.

Amit Kumar 138 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Definition 6.9.23 [Random Vector]
A measurable function X = (X1 , . . . , Xn ) : Ω → Rn is called the random vector of n dimension.

Definition 6.9.24 [Joint Cumulative Distribution Function]


Let X = (X1 , . . . , Xn ) be a random vector then the joint cdf of X is given by

FX1 ,...,Xn (x1 , . . . , xn ) = P(X1 ≤ x1 , . . . , Xn ≤ xn ), for all x1 , . . . , xn ∈ R.

Properties of Joint CDF: The joint cdf of X satisfies


1. lim FX1 ,...Xn (x1 , . . . xn ) = 0, for all i = 1, 2, . . . , n.
xi →−∞

2. lim FX1 ,...,Xn (x1 , . . . , xn ) = FX1 ,...,Xi−1 ,Xi+1 ,...,Xn (x1 , . . . , xi−1 , xi+1 , . . . , xn ).
xi →∞

3. FX1 ,...,Xn (x1 , . . . xn ) is continuous form right in each of it argument and also non-decreasing.

6.9.1 Joint Discrete Random Variables


In this section, we study the definitions related to joint discrete n-dimension random vector.
Definition 6.9.25 [Joint Probability Mass Function]
Let X be random vector. Then, the joint pmf of X is given by

pX1 ,...,Xn (x1 , . . . , xn ) = P (X1 = x1 , . . . Xn = xn ) .

The joint pmf of X satisfies


(a) pX1 ,...,Xn (x1 , . . . , xn ) ≥ 0, for all possible values of x1 , . . . , xn .
X X
(b) ... pX1 ,...,Xn (x1 , . . . , xn ) = 1.
x1 xn

Definition 6.9.26 [Joint cdf in Discrete Case]


The joint cdf of X is given by
X X
FX1 ,...,Xn (x1 , . . . , xn ) = P(X1 ≤ x1 , . . . , Xn ≤ xn ) = ... pX1 ,...,Xn (r1 , . . . , rn ) .
r1 ≤x1 rn ≤xn

Definition 6.9.27 [Marginal Distributions]


The marginal pmf of (Xi , Xj ), for i < j, is defined by
X XX XX X
pXi ,Xj (xi , xj ) = ... ... ...
x1 xi−1 xj+1 xj−1 xj+1 xn

× pX1 ,...,Xi−1 ,Xi+1 ,...,Xj−1 ,Xj+1 ,...,Xn (x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn )

Amit Kumar 139 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Remark 6.9.1. Similar to above definition, the marginal distribution can be defined for any dimension
random vector less than n.
Definition 6.9.28 [Conditional Distributions]
The conditional pmf of (Xi , Xj ), for i < j, given X1 = x1 , . . . , Xi−1 = xi−1 , Xi+1 =
xi+1 , . . . , Xj−1 = xj−1 , Xj+1 = xj+1 , . . . , Xn = xn , that is, Xi , Xj | X1 = x1 , . . . , Xi−1 =
xi−1 , Xi+1 = xi+1 , . . . , Xj−1 = xj−1 , Xj+1 = xj+1 , . . . , Xn = xn , is defines as

pX ,X |X =x ,...,X xi , xj |x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn
i j 1 1 i−1 =xi−1 ,Xi+1 =xi+1 ,...,Xj−1 =xj−1 ,Xj+1 =xj+1 ,...,Xn =xn
pX1 ,...,Xn (x1 , . . . , xn )
= ,
pX1 ,...,Xi−1 ,Xi+1 ,...,Xj−1 ,Xj+1 ,...,Xn x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn

provided pX1 ,...,Xi−1 ,Xi+1 ,...,Xj−1 ,Xj+1 ,...,Xn (x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn ) 6= 0.

Remark 6.9.2. Similar to above definition, the conditional distribution can be defined for any dimen-
sion random vector less than n.

6.9.2 Joint Continuous Random Variables


In this section, we study the definitions related to joint continuous n-dimension random vector.
Definition 6.9.29 [Joint Probability Density Function]
Let X be random vector. Then, the joint pdf of X is denotes as pX1 ,...,Xn (x1 , . . . , xn ) =
P (X1 = x1 , . . . Xn = xn ) and it satisfies the following conditions:

(a) fX1 ,...,Xn (x1 , . . . , xn ) ≥ 0, for all possible values of x1 , . . . , xn .


Z ∞ Z ∞
(b) ... fX1 ,...,Xn (x1 , . . . , xn ) dx1 . . . dxn = 1.
−∞ −∞

Definition 6.9.30 [Joint cdf in Discrete Case]


The joint cdf of X is given by

FX1 ,...,Xn (x1 , . . . , xn ) = P(X1 ≤ x1 , . . . , Xn ≤ xn )


Z x1 Z xn
= ... fX1 ,...,Xn (r1 , . . . , rn ) dr1 . . . drn .
−∞ −∞

Definition 6.9.31 [Marginal Distributions]


The marginal pdf of (Xi , Xj ), for i < j, is defined by
Z ∞ Z ∞
fXi ,Xj (xi , xj ) = ...
−∞
| {z −∞}
n−2

× fX1 ,...,Xi−1 ,Xi+1 ,...,Xj−1 ,Xj+1 ,...,Xn (x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn )
× dx1 . . . dxi−1 dxi+1 . . . dxj−1 dxj+1 . . . dxn .

Amit Kumar 140 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Remark 6.9.3. Similar to above definition, the marginal distribution can be defined for any dimension
random vector less than n.

Definition 6.9.32 [Conditional Distributions]


The conditional pmf of (Xi , Xj ), for i < j, given X1 = x1 , . . . , Xi−1 = xi−1 , Xi+1 =
xi+1 , . . . , Xj−1 = xj−1 , Xj+1 = xj+1 , . . . , Xn = xn , that is, Xi , Xj | X1 = x1 , . . . , Xi−1 =
xi−1 , Xi+1 = xi+1 , . . . , Xj−1 = xj−1 , Xj+1 = xj+1 , . . . , Xn = xn , is defines as

fX ,X |X =x ,...,X xi , xj |x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn
i j 1 1 i−1 =xi−1 ,Xi+1 =xi+1 ,...,Xj−1 =xj−1 ,Xj+1 =xj+1 ,...,Xn =xn
fX1 ,...,Xn (x1 , . . . , xn )
= ,
fX1 ,...,Xi−1 ,Xi+1 ,...,Xj−1 ,Xj+1 ,...,Xn x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn

provided fX1 ,...,Xi−1 ,Xi+1 ,...,Xj−1 ,Xj+1 ,...,Xn (x1 , . . . , xi−1 , xi+1 , . . . , xj−1 , xj+1 , . . . , xn ) 6= 0.

Definition 6.9.33 [Joint Moment Generating Function]


The joint mgf of X is defined as
 Pn 
t1 X1 +···+tn Xn Xi

MX1 ,...Xn (t1 . . . tn ) = E e =E e i=1 .

Remark 6.9.4. Similar to above definition, the conditional distribution can be defined for any dimen-
sion random vector less than n.

6.9.3 Some Important Results


In this section, we study some important results related to the n-dimension random vector.
Theorem 6.9.17
If X1 , X2 . . . Xn are independent. Then.

(a) FX1 , . . . Xn (x1 , . . . xn ) = FX1 (x1 ) FX2 (x2 ) . . . FXn (xn ).

(b) pX1 ,...Xn (x1 , . . . xn ) = pX1 (x1 ) . . . pXn (xn ).

(c) fX1 ,...,Xn (x1 , . . . , xn ) = fX1 (xn ) . . . fXn (xn ).


n
Y
(d) MX1 ,...,Xn (t1 , . . . , tn ) = MXi (ti ).
i=1

n
Y
(e) MX1 +···+Xn (t) = MXi (t).
i=1

The following are the applications of the last result of the above theorem.

Amit Kumar 141 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Corollary 6.9.3
iid
Let X1 , . . . , Xn ∼ Ber(p). Then
n
X
Xi ∼ B (n, p) .
i=1

That is, the sum of iid Bernoulli is binomial.

Proof. Given X1 , . . . , Xn ∼ B(p), therefore,

MXi (t) = (q + pet ), for i = 1, 2, . . . , n.

Using Theorem 6.9.3(e), we have


n
Y
MX1 +···+Xn (t) = MXi (t)
i=1
= (q + pet )n ,

which is the mgf of B (n, p). This proves the result.

Corollary 6.9.4
Let X1 , . . . , Xk ∼ B(ni , p) and independent. Then
k k
!
X X
Xi ∼ B ni , p .
i=1 i=1

That is, the sum of binomial with same success probability is also a binomial.

Proof. Given X1 , . . . , Xk ∼ B(ni , p), therefore,

MXi (t) = (q + pet )ni , for i = 1, 2, . . . , k.

Using Theorem 6.9.3(e), we have


k
Y
MX1 +···+Xk (t) = MXi (t)
i=1
Pk
= (q + pet ) i=1 ni
,

k
!
X
which is the mgf of B ni , p . This proves the result.
i=1

Amit Kumar 142 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Corollary 6.9.5
iid
Let X1 , . . . , Xn ∼ Exp(λ). Then
n
X
Xi ∼ G (n, λ) .
i=1

That is, the sum of iid exponential is gamma distribution.

Proof. Given X1 , . . . , Xn ∼ Exp(λ), therefore,

λ
MXi (t) = , for i = 1, 2, . . . , n and λ > t.
λ−t
Using Theorem 6.9.3(e), we have
n  n
Y λ
MX1 +···+Xn (t) = MXi (t) = , for λ > t,
i=1
λ−t

which is the mgf of G (n, λ). This proves the result.

Corollary 6.9.6
Let X1 , . . . , Xk ∼ G(ri , λ) and independent. Then
k k
!
X X
Xi ∼ G ri , λ .
i=1 i=1

That is, the sum of independent gamma with same rate is also a gamma distribution.

Proof. Given X1 , . . . , Xk ∼ Exp(λ), therefore,


 ni
λ
MXi (t) = , for i = 1, 2, . . . , k and λ > t.
λ−t

Using Theorem 6.9.3(e), we have


k  Pki=1 ni
Y λ
MX1 +···+Xk (t) = MXi (t) = , for λ > t,
i=1
λ−t

k
!
X
which is the mgf of G ri , λ . This proves the result.
i=1

Amit Kumar 143 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Theorem 6.9.18
Let X1 , . . . , Xn be random variables then
n
! n
X X X
Var ai X i = a2i Var (Xi ) + 2 ai aj Cov (Xi , Xj ) .
i=1 i=1 i<j

6.10 Exercises
1. Toss three coins. Let X denotes the number of heads on the first two coins and Y denotes the
number of heads on the last two coins. Find

(a) the joint PMF of (X, Y )


(b) the marginals of X and Y
(c) E(Y | X = 1)
(d) E(X | Y = 1).

2. Let the joint pdf of (X, Y ) be



2, 0 < y < x < 1,
fX,Y (x, y) =
0, otherwise.

Find

(a) the marginals of X and Y


(b) P Y < X − 12


(c) P(X − Y > 1/4)


1 3

(d) P Y > 4
|X = 4
1

(e) P Y < 2
X .

3. Show that X and Y are independent if the joint pdf of (X, Y ) is



1, 0 < x < 1, 0 < y < 1,
fX,Y (x, y) =
0, otherwise.

4. Let X and Y have joint pdf given by



x + y, 0 < x < 1, 0 < y < 1,
fX,Y (x, y) =
0 otherwise.

Find ρX,Y .

5. Let X and Y , have joint pdf given by


(
2, 0 < y < x < 1,
fX,Y (x, y) =
0 otherwise.

Amit Kumar 144 MA-202: Probability & Statistics


Chapter 6: Random Vector and its Joint Distribution
Find ρX,Y .

6. Let X1 be the life of a tube and X2 be the filament diameter. Given

(X1 , X2 ) ∼ BVN(2000, 0.1, 2500, 0.01, 0.87).

If the filament diameter is 0.098, what is the probability that the tube will last 1950 hour?

7. Prove the following:

(a) The sum of Poisson is Poisson.


(b) The sum of iid geometric is negative binomial distribution.
(c) The sum of negative binomial with same success probability is also negative binomial.
(d) The sum of normal is normal distribution.

Amit Kumar 145 MA-202: Probability & Statistics


Chapter 7

Large Sample Theory

We have studied the random variable in one dimension and two dimensions in details. Also, we have
seen generalization of it in n-dimension and results related to it. Now, we move to study the concept
of the sequence of random variable when n tending to infinity. This will help us to recognize the
limiting behaviour of the sequence of random variables. In this chapter, we consider several modes of
convergence and investigate their interrelationships.

7.1 Mode of Convergence


We begin with the weakest mode of convergence.

7.1.1 Convergence in Distribution

Definition 7.1.1 [Convergence in Distribution]


Let {Fn } be a sequence of distribution functions corresponding to the sequences of random vari-
ables {Xn }. If there exist a distribution function F such that

Fn (x) → F (x), as n → ∞

or

lim Fn (x) = F (x),


n→∞

at every point at which F is continues. Then, we say Fn converges in law (or weakly) to F .

Remark 7.1.1. If {Xn } converges to X in distribution then any one of the following notations can be
used:
d
Xn → X (Xn converges in distribution to X)
d
Fn → F (Fn converges in distribution to F )
L
Xn → X (Xn converges in law to X)
ω
Fn → F (Fn converges weakly to F ).

146
Chapter 7: Large Sample Theory
Example 7.1.1. Let {Fn } be a sequence of distribution functions defined by

 0, x < 0,
Fn (x) = 1 − n1 , 0 ≤ x < n,
1, x ≥ n.

F (x)

F4 (x)

F3 (x)
F2 (x)

F1 (x)

Then, Fn → F , where F is the distribution function given by


(
0, x < 0,
F (x) =
1, x ≥ 0.

7.1.2 Convergence in Probability


Definition 7.1.2 [Convergence in Probability]
Let {Xn } be a sequence of random variables defined on some probability space (Ω, F , P). We
say Xn converges in probability to a random variate X, if for all ε > 0

P (|Xn − X| > ε) → 0 as n → ∞.

or

lim P (|Xn − X| > ε) = 0.


n→∞

That is, the sequence of probabilities of {|Xn − X| > ε} converges to 0.

Remark 7.1.2. (i) The above definition is equivalent to

P (|Xn − X| ≤ ε) → 1 as n → ∞.

or

lim P (|Xn − X| ≤ ε) = 1.
n→∞

That is, the sequence of probabilities of {|Xn − X| ≤ ε} converges to 1.

Amit Kumar 147 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
p
(ii) If {Xn } converges to X in probability then we write Xn → X.
Example 7.1.2. Let {Xn } be a sequence of random variables with pmf
1 1
P (Xn = 1) = and P (Xn = 0) = 1 − .
n n
p
Then, Xn → 0.
Solution. Note that

{Xn = 1} , 0 < ε < 1,
{|Xn − 0| > ε} =
φ, ε ≥ 1.

This implies
(
1
n
, 0 < ε < 1,
P {|Xn − 0| > ε} =
0, ε ≥ 1.

1−

Hence,

P (|Xn − 0| > ε) → 0 as n → ∞.

This proves the result.

7.1.3 Convergence Almost Surely

Definition 7.1.3 [Convergence Almost Surely]


Let {Xn } be a sequence of random variable. We say that Xn converges almost surely (a.s.) to a
random variable X if and only if
n o
P ω : lim Xn (ω) = X(ω) = 1.
n→∞

a.s.
Remark 7.1.3. (i) If {Xn } converges almost surely to X then we write Xn → X.

Amit Kumar 148 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
(ii) Here, we collect all ω such that

lim Xn (ω) = X(ω).


n→∞

a.s.
Let us say this set is S. If P(S) = 1 then we say Xn → X.

X1 (ω)
R
X2 (ω)
R
X3 (ω)
R
..
.
Xn (ω)
R
..
.

Example 7.1.3. Consider a sequence of random variable {Xn } defined as follows:


(
1, if ω = 0,
Xn (ω) = 1
n
, if ω 6= 0.

Define a constant random variable X as follow:

X(ω) = 0 for all ω ∈ [0, 1].

Note that
1
lim Xn (ω) = lim = 0, for ω ∈ (0, 1]
n→∞ n→∞ n

and

lim Xn (ω) = lim 1 = 1 for ω = 0.


n→∞ n→∞

Therefore, we have

{ω ∈ Ω : {Xn (ω)} does not converge to X(ω)} = {0}.

This implies

P {ω ∈ Ω : {Xn (ω)} does not converges to X(ω)} = P({0}) = P([0, 0]) = 0.

Hence,
a.s.
Xn → X.

Amit Kumar 149 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
Some Important Results:
a.s. p L
1. Xn −→ X =⇒ Xn −→ X =⇒ Xn −→ X.
L p a.s.
2. Xn −→ X =⇒
6 Xn −→ X =⇒
6 Xn −→ X.
L p
3. Xn −→ k =⇒ Xn −→ k, for a constant k.
4. If {Xn } be a strictly decreasing sequence of positive random variables. Then
p a.s.
Xn −→ =⇒ Xn −→ 0.

p L L
5. |Xn − Yn | −→ 0 and Yn −→ Y =⇒ Xn −→ Y .
6. (Slutsky’s Therem):
L p L
(a) Xn −→ X, Yn −→ c =⇒ Xn + Yn −→ X + c.
(
L
L p Xn Yn ←→ cX, if c 6= 0,
(b) Xn −→ X, Yn −→ c =⇒ L .
Xn Yn ←→ cX, if c = 0.
L p Xn L X
(c) Xn −→ X, Yn −→ c =⇒ Yn
−→ c
, c 6= 0.
p p
7. Xn ←→ X ⇐⇒ Xn − X ←→ 0.

7.2 Law of Large Numbers


The law of large numbers has a very central role in probability and statistics. It states that if you repeat
an experiment independently a large number of times and average the result, what you obtain should be
close to the expected value. There are two main versions of the law of large numbers. They are called
the weak and strong laws of the large numbers. The difference between them is mostly theoretical. Let
us begin with the Chebyshev’s inequality which is useful to prove several results in this section.

Theorem 7.2.1 [Chebyshev’s Inequality]


Let X be a random variable with mean µ and variance σ 2 . Then for any k > 0,

σ2
P(|X − µ| ≥ k) ≤ .
k2

Proof. Let X be a continuous random variable with pdf fX (x). Then

σ 2 = Var(X) = E(X − u)2


Z ∞ Z
2
= (x − µ) fX (x)dx ≥ (x − µ)2 fX (x)dx
−∞ |x−µ|≥k
Z
≥ k2 fX (x)dx = k 2 P(|X − µ| ≥ k).
|x−µ|≥k

This proves the result for continuous case. Following the similar steps, it can be easily proved for
discrete case.

Amit Kumar 150 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
Remark 7.2.1. Note that the following are also equivalent to the Chebyshev’s inequality:

1 1 σ2
P(|X − µ| ≥ kσ) ≤ or P(|X − µ| < kσ) ≥ 1 − or P(|X − µ| < k) ≥ 1 − .
k2 k2 k2
Example 7.2.1. The number of customers who visit a store everyday is a random variable X with
µ = 18 and σ = 2.5. With what probability, can the asset that the customers will be between 8 to 28
customers?
Solution. Consider

P(8 ≤ X ≤ 28) = P(−10 ≤ X − 18 ≤ 10) = P(|X − 18| ≤ 10)


62 15
= 1 − P(|X − 18| ≥ 10) > 1 − = .
100 16

Theorem 7.2.2 [Markov’s Inequality]


Let X be a non-negative random variable. Then, for any k > 0,

E(X)
P(X ≥ k) ≤ ,
k
provided E(X) exists.

24
Example 7.2.2. Consider a random variable X that takes the value 0 with probability 25
and 1 with
1
probability 25 . Find a bound of the probability of X in at least 5.
Solution. Note that
24 1
P(X = 0) = and P(X = 1) =
25 25
and
1
X 25 1 1
E(X) = xP(X = x) = 0 × +1× = .
x=0
25 25 25

Therefore, by Markov’s inequality, we have

E(X) 1
P (X ≥ 5) ≤ = .
5 125

Theorem 7.2.3 [Chebyshev’s Theorem]


Let {Xn } be a sequence of random Pn variables such that E (Xi ) = µi and Var (Xi ) = σi2 , i =
1
1, 2, . . . , n. Also, let X̄n = n i=1 Xi with Var X̄n → 0 as n → ∞. Then
p
X̄n → µ̄n .
1
Pn
where µ̄n = n i=1 µi .

Amit Kumar 151 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory

Proof. Using Chebyshev’s inequality [Note: E X̄n = µn ],

 Var X̄n
P X̄n − µn > ε ≤ → 0 as n → ∞, for all ε > 0.
ε2
p
Hence, X̄n → µ̄n .

Theorem 7.2.4 [Weak Law of Large Numbers (WLLN)]


Let {Xn } be a sequence of random variables with E (Xi ) = µi , i = 1, 2, . . . n. Also, let

Bn = Var (X1 + X2 + · · · + Xn ) < ∞.

Then,

X̄n → µ̄n ,

provided
 
Bn Bn
lim 2 = 0 or 2 → 0 as n → ∞ .
n→∞ n n

Remark 7.2.2. (i) Note that WLLN can be easily proved using Chebyshev’s inequality as
 
1
Var(X̄n ) = Var (X1 + · · · + Xn )
n
1
= 2 Var(X1 + · · · + Xn )
n
Bn
= 2
n
→ 0, as n → ∞.

(ii) For the existence of WLLN, we must have

(a) E (Xi ) exists, for all i.


n
!
X
(b) Bn = Var Xi exists,
i=1
Bn
(c) → 0 as n → ∞.
n2

Theorem 7.2.5 [WLLN for iid Case]


p
If {Xn } is an iid sequence with E (Xi ) = µ and Var (Xi ) = σ 2 Than, X̄n → µ.

Amit Kumar 152 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory

Remark 7.2.3. (i) In the above figure, note that how X̄n converges to µ when Var(X̄n ) is decreas-
ing.
(ii) The above theorem can be easily follow from Chebyshev’s inequality as
n
! n
1X 1 X σ2
Var(X̄n ) = Var Xi = 2 Var(Xi ) = → 0 as n → ∞.
n i=1 n i=1 n

(iii) If Xi ’s are iid then the necessary condition for LLN to hold is that E(Xi ) exists, for i =
1, 2, . . . , n.
(iv) If the variable are uniformly bounded then the condition
Bn
lim =0
n→∞ n2

is necessary as well as sufficient for WLLN to hold.

Theorem 7.2.6 [Bernoulli’s Theorem (Bernoulli’s Law of Large number)]


Let {Xn } be a sequence of iid random variables with success probability p. Then, E (Xi ) = p
and Var (Xi ) = p(1 − p) and we have

Sn p
−→ p.
n
n
X
where Sn = Xi .
i=1

Proof. As Sn ∼ B(n, p) and therefore, E (Sn ) = np and Var (Sn ) = np(1 − p). This implies
 
Sn
E =p
n

Amit Kumar 153 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
and
 
Sn 1
Var = Var (Sn )
n n2
p(1 − p)
= .
n
By Chebsyshev’s inequality, we have
 
Sn Var (Sn )
P −p >ε ≤
n ε2
p(1 − p)
= → 0 as n → ∞.
nε2
This proves the result.
Now, we move to present strong law of large numbers.

Theorem 7.2.7 [Strong Law of Large Numbers (SLLN)]


Let {Xn } be a sequence of independent random variables such that E (Xi ) = µi and Var (Xi ) =
σi2 . If

X σ2 i
2
<∞
i=1
i

then
a.s.
X̄n −→ µn .

Theorem 7.2.8 [SLLN for iid Case]


Let {Xn } be a sequence of iid random variables. Then
a.s.
E (Xi ) exists and E (Xi ) = µ ⇐⇒ X̄n −→ µ.

Theorem 7.2.9 [Khinthin’s Theorem]


If Xi ’s are iid random variables, the only necessary condition for LLN to hold is that E (Xi∗ ) , i =
1, 2, . . . should exists. That is, E(Xi ) = µ < ∞ =⇒ X̄n → µ.

7.3 Central Limit Theorems


If the sequence of mgf, corresponding to the sequence of random variables, converges to the mgf of a
random variable then can we say convergence in distribution hold? The answer is affirmative and the
following theorem gives the answer of this question.

Amit Kumar 154 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
Theorem 7.3.10 [Continuity Theorem]
Let {Fn } be a sequence of distribution functions
 corresponding to the sequence of random vari-
tXn
able {Xn }. Suppose MXn (t) = E e exists for |t| < t0 , for every n. If there exist a distribu-
tion function F corresponding to the random variable X. Suppose. MX (t) exist fer |t| ≤ t1 < t0
such that XXn (t) → MX (t) as n → ∞ for every |t| ≤ t1 , then
d
Xn → X.

Now, we to prove one of well-known and useful theorem in probability and statistics, that is, central
limit theorem.

Theorem 7.3.11 [Central Limit Theorem (CLT)]


Let X1 , X2 , . . . be a sequence of iid random variable with mean µ and variance σ 2 < ∞. Then
X̄n − µ
the limiting distribution of √ is N (0, 1). That is,
σ/ n

n(X̄n − µ) d
−→ N (0, 1) as n → ∞.
σ

Proof. Consider
 √nt(X̄n −µ) 
M √n(X̄n −µ) (t) = E e σ
σ
 √nt 1 
= E e σ ( n (X1 +···+Xn )−µ)
 √nt 
((X1 −µ)+···+(Xn −µ))
=E e nσ (7.3.1)
 t t

√ (X −µ)+···+ √nσ (Xn −µ)
= E e nσ 1
 t   t 
√ (X1 −µ) √ (Xn −µ)
=E e nσ ...E e nσ

n  
Y t
= MXi −µ √ . (7.3.2)
i=1

Note that E (Xi − µ) = 0 and E (Xi − µ)2 = σ 2 , for all i = 1, 2, . . . , n. Now, Consider
 
t  t
√ (X −µ)

MXi −u √ σ = E e nσ i
n
t2 t3
 
t 2 3
= E 1 + √ (Xi − µ) + (Xi − µ) + √ (Xi − µ) + · · ·
nσ 2!nσ 2 3!n nσ 3
t 2t2 t3
= 1 + √ E (Xi − µ) + 2 E (Xi − µ)2 + √ 3 E (Xi − 1)3 + · · ·
nσ nσ 6n nσ
2
 2
t /2 t
=1+ +O .
n n

Amit Kumar 155 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
From 7.3.2, we get
 2 n
t2 /2

t 2
M n(X̄n −µ) (t) = 1 +

+O → et /2 as n → ∞.
σ n n

Hence, by continuity theorem, we get



n(X̄n − µ) d
−→ N (0, 1) as n → ∞.
σ
This proves the result.
Remark 7.3.1. (i) In CLT, there is no restriction for Xi ’s is to be discrete or continuous.

(ii) In practice, n ≥ 30 is considers to be large sample (NOT always). If the original distribution is
close to normal then, for smaller n itself, the approximation may be good.

n(X̄n − µ)
(iii) Note that the cdf of sequence of random variables convergences to the cdf of stan-
σ
dard normal distribution, that is,

Fn := F √n(X̄n −µ) −→ FZ , as n → ∞.
σ


− F1
− F2
− F3
− FZ

(a) For Discrete Case


− F1
− F2
− F3
− FZ

(b) For Continuous Case

Amit Kumar 156 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
A special case of CLT under Bernoulli trials is know as De-Moivre Laplace Theorem. Formally, it is
given as follows:

Theorem 7.3.12 [De-Moivre Laplace Theorem]


Let X1 , X2 , . . . be a sequence
Pn of n independent Bernoulli random variable with success probabil-
ity p. Also, let Sn = i=1 Xi . Then

Sn − np d
√ −→ N (0, 1) as n → ∞.
npq

Remark 7.3.2. Note that Sn ∼ B(n, P ) is the above theorem.


Continuity Correction. Suppose X in an integer-valued random variable (such or Binomial or Poisson
etc.) then
Discrete Continuous
X=c c − 0.5 < X < c + 0.5
X>c X > c + 0.5
X≥c X > c − 0.5
X<c X < c − 0.5
X≤c X < c + 0.5
a≤X≤b a − 0.5 < X < b + 0.5

Note that the approximate probability is computer with an extra area (blue color) and we left the area
outside the normal curve (red color). These two areas (shaded in blue and red) are approximately equal.
Example 7.3.1. Two fair die are rolled 600 times. Let X donate the number of times a total 7 occurs.
Use CLT to find P(90 ≤ X ≤ 110).

Amit Kumar 157 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
√ p
Solution Note that X ∼ B(600, 1/6), µ = np = 100 and σ = npq = 500/6. Now, consider
!
90 − 100 110 − 100
P(90 ≤ X ≤ 1100) = P p <Z< p
500/6 500/6
= P(−1.15 < Z < 1.09)
= P(Z ≤ 1.09) − P(Z ≤ −1.15)
= 0.8621 − 0.1257
= 0.7370.

Example 7.3.2. Let a random sample of size 54 be taken from a discrete distribution with pmf
1
pX (x) = , x = 2, 4, 6.
3
Find the probability that the sample mean will lie between 4.1 and 4.4.
Solution. Note that µ = 4 and σ 2 = 8/3. Therefore, the required probability is
√ √ !
 54(4.1 − 4) 56(4.4 − 4)
P 4.1 ≤ X̄54 ≤ 4.4 ≈ P p ≤Z≤ p
8/3 8/3
= P(0.45 ≤ Z ≤ 1.8)
= Φ(1.8) − Φ(0.45)
= 0.9641 − 0.6738
= 0.2905.

Now, we move to give other central limit theorems (for non-iid case) that are useful in many application
in the literature.

Theorem 7.3.13 [Lindeberg CLT]


Let X1 , X2 , . . . be a sequence of independent random variables with mean E (Xn ) = µn and
0 < Var (Xn ) = σn2 < ∞. Let
n
X n
X
Sn = Xi and s2n = Var (Sn ) = σi2 .
i=1 i=1

If for all ε > 0,


n
1 X 
E (Xi − µi )2 I {|Xi − µi | ≥ εsn } = 0 (Lindeberg condition)

lim 2
n→∞ sn
i=1

then
Sn − E (Sn )
→ N (0, 1) as n → ∞.
sn

Amit Kumar 158 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
Theorem 7.3.14 [Lyapunov CLT]
Let X1 , X2 , . . . be a sequence of independent random variables with mean E (Xn ) = µn and
0 < Var (Xn ) = σn2 < ∞. Let
n
X n
X
Sn = Xi and s2n = Var (Sn ) = σi2 .
i=1 i=1

If then exist δ > 0 such that


n
1 X
lim E (|Xi − µi |)2+δ = 0 (Lyapunov condition)
n→∞ s2+δ
n i=1

then
Sn − E (Sn )
→ N (0, 1) as n → ∞.
sn

Theorem 7.3.15 [CLT Under Weak Dependence]


Suppose that X1 , X2 , . . . is stationary and α-mixing with αn = O (n−5 ) and that E (Xn ) = 0 and
E (Xn12 ) < ∞. If
n
X
Sn = Xi
i=1

then

Var (Xn ) 2 2
 X
→ σ = E X1 + 2 E (X1 , X1+k ) ,
n k=1

where the series converges absolutely. If σ > 0 then


Sn
√ → N (0, 1) as n → ∞.
σ n

Remark 7.3.3. Define

α (t1 , t2 ) = sup{|P(A ∩ B) − P(A)P(B)| : A ∈ σ(Xt−1 ) and B ∈ σ(Xt+2 )}.

If a process is stationary then

α(t1 , t2 ) = α (|t1 − t2 |) = α(τ ).

If α(τ ) → 0 as τ → 0 then the process is called α-mixing.

Amit Kumar 159 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
Theorem 7.3.16 [Martingale CLT]
Suppose X1 , x2 , . . . is martingale relative to Y1 , Y2 , . . . and Yn = Xn − Xn−1 (Y1 = X1 ) are
uniformly bounded satisfy

E (Yn | Fn−1 ) = 0.

Let
( n
)
X
σn2 = E Yn2 | Fn−1 σk2 > t .

and νt = min n :
k=1
P∞
Assume that n=1 σn2 = ∞ with probability 1. Then

Xv t
√ → N (0, 1).
t

Theorem 7.3.17 [Multidimensional CLT]


Let Xn = (Xn1 , . . . , Xnk ) be independent random vectors all having some distribution. Suppose
2
that E (Xnu ) < ∞; let the vector of means be µ = (µ1 , . . . , µk ), where µ
i = E (Xni ),Pand let
n
the covariance matrix be Σ = (σij ), where σij = E (Xni − µi ) Xnj − µi . Let Sn = i=1 Xi
then
Sn − nµ
√ → N (0, Σ).
n

7.4 Exercises
1. Check the convergence in distribution for the following sequence of distribution functions:
(
0, x < n,
Fn (x) =
1, x > n.

2. Show that WLLN holds for a sequence of independent N (µ, σ 2 ) variates.


3. Show that the following sequence obeys WLLN
 2  1
P Xk = k −1/2 = and P Xk = −k −1/2 = .
3 3

4. A coin is tossed 10 times. Find the probability of 3,4 or 5 heads using normal approximation.
5. If X ∼ B(25, 0.5), find the approximate value of P (Y ≤ 12).
6. A polling agency wisher to take a sample of voters in a given stake large enough that the proba-
bility is only 0.01 that they will find the proportion favouring a contain candidate to be less than
50% when in fact it is 52%. How large a sample should be taken?

Amit Kumar 160 MA-202: Probability & Statistics


Chapter 7: Large Sample Theory
7. Let a random sample of size 105 a taken from a distribution whose pmf is given by
1
pX (x) = , x = 1, 2, 3, 4, 5, 6.
6
Find the probability that the sample mean lie between 2.15 to 3.15?

8. Two dice are thrown. If X is the sum of the numbers showing up. Prove that P (|X − 7| ≥ 3) ≤
35
54
. Compare toe with the actual probability.

Amit Kumar 161 MA-202: Probability & Statistics


Chapter 8

Statistics and Sampling Distributions

Sampling distribution in statistics refers to studying many random samples collected from a given pop-
ulation based on a specific attribute. The results obtained provide a clear picture of variations in the
probability of the outcomes derived. In this chapter, we first discuss elementary definitions in statis-
tics. Further, we study three most useful distributions (chi-square, t-distribution and F -distribution) in
statistics.

8.1 Random Sample

Definition 8.1.1 [Population]


A population is a collection of measurements.

Definition 8.1.2 [Sample]


A subset of population a called a sample.

Population

Sample

Example 8.1.1. Suppose you are cooking rice in a pressure cooker. How will you check whether the
rice is properly cooked or not? We take 4-5 pieces of rice and press them to check whether these pieces
are properly cooked or not. If these pieces of rice are properly cooked then we say the rice (in the

162
Chapter 8: Statistics and Sampling Distributions
pressure cooker) properly cooked. So, if we say the rice in the pressure is a population and 4-5 pieces
of rice, that we have check, is the sample then we are predicting about the population based on the
sample. This plays a crucial role in the literature as we are making the decision about population (for
which, we can not work with all elements) based on a sample. Similar kind theory can also be applied
to other example such as blood test, heights of peoples, longevity of peoples.

Definition 8.1.3 [Random Sample]


Let X1 , X2 , . . . , Xn be n independent and identically distributed (iid) random variable each hav-
ing same distribution f (·). Then, we say that, (X1 , X2 , . . . , Xn ) is a random sample from popu-
lation with the joint distribution
n
Y
f (x1 , x2 , . . . , xn ) = f (x1 ) f (x1 ) . . . f (xn ) = f (xi ).
i=1

Definition 8.1.4 [Statistic]


A function of random sample

T = T (X1 , X2 , . . . , Xn )

is called a statistic.

Definition 8.1.5 [Sample Mean]


The sample mean is a statistic given by
n
1X
X̄n = Xi .
n i=1

Definition 8.1.6 [Sample Variance]


The sample variance is a statistic given by
n
2 1 X 2
s = Xi − X̄n .
n − 1 i=1

Remark 8.1.1. (i) The following statistics which are useful in practice are

(a) Sample Range = X(n) − X(1) , where

X(n) = max{X1 , X2 , . . . , Xn } and X(1) = min{X1 , X2 , . . . , Xn }.

Amit Kumar 163 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
(b) The sample median is defined as

X( n+1

2 )
, if n is odd,
Sample median = X( n ) + X( n +1)
 2
 2
, if n is even.
2

(ii) The probability distribution of a statistics is called a sample distribution.

(iii) The parameters X̄n and s2 are called sample parameters. Also, the parameters µ and σ 2 are called
population.
Theorem 8.1.1
iid
Let X1 , . . . , Xn ∼ N (µ, σ 2 ), Ui = Xi − X̄n , for i = 1, 2, . . . , n, and U = (U1 , . . . , Un ). Then,
X̄n and U are independent.

Proof. Let Y = X̄n . We will show

MY,U (s, t) = MY (s)MU (t) for all s and t = (t1 , . . . , tn ) .


1
Pn
Define t̄n = n i=1 ti . It is known that
1 σ 2 s2
MY (s) = eµs+ 2 n .

Note that
 Pn   Pn 
MU (t) = E e i=1 ti Ui = E e i=1 ti (Xi −X̄n )
 Pn   Pn 
ti Xi −t̄ n
P
ti Xi −nt̄X̄n Xi
=E e i=1 =E e i=1 i=1

n
!
 Pn  Y
= E e i=1 Xi (ti −t̄) = E eXi (ti −t̄)
i=1
n
Y   n
Y
= E eXi (ti −t̄) = MXi (ti − t̄)
i=1 i=1
n
1 2 1 2 (t −t̄)2
Y
= eµ(ti −t̄)+ 2 σ(ti −t̄) = e 2 σ i

i=1

Now, consider
 Pn   sP Pn 
MY,U (s, t) = E esY + i=1 ti Ui = E e n Xi + i=1 Xi (ti −t̄)
n
!
 Pn s
 s
= E e i=1 Xi (ti −t̄+ n ) = E eXi (ti −t̄+ n )
Y

i=1
n n n
 s s 1 2 s 2
o
eµ(ti −t̄+ n )+ 2 σ (ti −t̄+ n ) .
Y Y
= MXi ti − t̄ + =
i=1
n i=1
1 2 2 1 2
Pn 2
=e µs+ 2n σ s +2σ i=1 (ti −t̄) = MY (s)MU (t).

Amit Kumar 164 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
Hence, Y = X̄n and U are independently distributed.

Corollary 8.1.1
iid
Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Then, X̄n and s2 are independent.

8.2 Chi-square Distribution


It was initially discovered by Helmert in 1875 and was again defined independently in 1900 by Karl
Pearson, who gave the definition in terms of normal distribution. The formal definition can be given as
follows:

Definition 8.2.7 [Chi-square Distribution]


A continuous random variable W is said to have a chi-square (χ2 ) distribution with n degrees of
freedom if it has pdf given by
1 w n
fW (w) = n n
 e− 2 w 2 −1 , w > 0, n > 0 (8.2.1)
2 Γ
2
2

or equivalently, the distribution of sum of squares of n independent standard normal variables is


known as χ2 −distribution.

n 1

Remark 8.2.1. (i) From (8.2.1), it can be easily verified that W ∼ G ,
2 2
.

(ii) If W is a chi-square distribution with n degrees of freedom then it is denoted by W ∼ χ2n .

0.5 n=2
n=4
0.4 n=6
n=8
0.3

0.2

0.1

0
0 5 10 15 20

Amit Kumar 165 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
Theorem 8.2.2
Let W ∼ χ2n . Then

(a) E(W ) = n.

(b) Var(W ) = 2n.


n
(c) MW (t) = (1 − 2t)− 2 , for t < 1/2.
q
(d) β1 = n8 > 0 (positivity skewed).
12
(e) β2 = n
> 0 (lapto kurtic).

Proof. Substitute r = n/2 and λ = 1/2 in Theorem 4.3.3 and Corollary 4.3.2, the result follows.
Further, we prove some useful properties for χ2 −distribution.

Property 8.2.1
If Wi ∼ χ2ni , for i = 1, 2, . . . , k and Wi , for i = 1, 2, . . . , k are independent. Then
k
X
Wi ∼ χ2Pk ni
.
i=1
i=1

Proof. We know that Wi ∼ χ2ni . This implies


ni
MWi (t) = (1 − 2t)− 2 , for i = 1, 2, . . . , n.

Since Wi ’s are independent. Therefore,


k k Pk
Y Y ni i=1 ni
− −
MPk Wi (t) = MWi (t) = (1 − 2t) 2 = (1 − 2t) 2 .
i=1
i=1 i=1

This proves the result.

Property 8.2.2
If Xi ∼ N (µi , σi2 ), for i = 1, 2, . . . , k, and Xi ’s are independent then
k  2
X Xi − µi
∼ χ2k .
i=1
σi

Proof. Given Xi ∼ N (µi , σi2 ), for i = 1, 2, . . . , k, therefore,

Xi − µi
∼ N (0, 1)
σi

Amit Kumar 166 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
 2
Xi − µi
=⇒ ∼ χ21 (using Example 5.2.6)
σi
k  2
X Xi − µi
=⇒ ∼ χ2k (using Property 8.2.1).
i=1
σi

Property 8.2.3
iid
If Xi ∼ N (µ, σ 2 ) then
n
σ2
 
1X
X̄n = Xi ∼ N µ, .
n i=1 n

Proof. Note that


   t Pn 
MX̄n (t) = E etX̄n = E e n i=1 Xi
 t   t 
(X1 +X2 +···+Xn ) ( )X1 ( nt )X2 ( t
)Xn
=E e n =E e n e ...e n

    Y n  
t t t
= MX1 . . . MXn = MXi
n n i=1
n
  n 
t µt
+ 12 σ 2 t2
 n h
µt+ 12 σ 2 t2
i
= MX1 = e n 2n since MX1 (t) = e
n
σ2
 
µt+ 12 t2
=e n
.

This proves the result.

Property 8.2.4
iid
If Xi ∼ N (µ, σ 2 ) then

(n − 1)s2
∼ χ2n−1 .
σ2

Proof. Consider
n n
X X 2
(Xi − µ)2 = Xi − X̄n + X̄n − µ
i=1 i=1
n n n
X 2 X
2
X
= Xi − X̄n + (X̄n − µ) + 2 (Xi − X̄n )(X̄n − µ)
i=1 i=1 i=1
n
X 2
= Xi − X̄n + n(X̄n − u)2 .
i=1

Amit Kumar 167 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
This implies
n n
1 X 2 1 X 2 n(X̄n − µ)2
(X i − µ) = X i − X̄ n + .
σ 2 i=1 σ 2 i=1 σ2

Let
n n
1 X 1 X 2 n(X̄n − µ)2
W = 2 (Xi − µ)2 , W1 = 2 Xi − X̄n and W2 =
σ i=1 σ i=1 σ2

Note that

Xi ∼ N (µ, σ 2 )
Xi − µ
=⇒ ∼ N (0, 1)
σ
 2
Xi − µ
=⇒ ∼ χ21
σ
n  2
X Xi − µ
=⇒ ∼ χ2n .
i=1
σ

Therefore,
n
W ∼ χ2n =⇒ MW (t) = (1 − 2t)− 2 .

Further, from Property 8.2.3, we have

σ2 n(X̄n − µ)2
 
X̄n − µ
X̄n ∼ N µ, =⇒ √ ∼ N (0, 1) =⇒ ∼ χ21 .
n σ/ n σ2

Therefore,
1
W2 ∼ χ21 =⇒ MW2 (t) = (1 − 2t)− 2 .

From Corollary 8.1.1, it can be easily verified that W1 and W2 are independent. Therefore,
n
(1 − 2t)− 2 (n−1)
MW (t) = MW1 (t)MW2 (t) =⇒ MW1 (t) = − 12
= (1 − 2t)− 2 .
(1 − 2t)

Therefore,

W1 ∼ χ2n−1
n
n−1 X 2
=⇒ X i − X̄ n ∼ χ2n−1
σ 2 (n − 1) i=1
(n − 1)s2
=⇒ ∼ χ2n−1 .
σ2
This proves the result.

Amit Kumar 168 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions

χ2n,α

The table is for P(χ2n > χ2n,α ) = α, where χ2n and χ2n,α denote the chi-square distribution and corre-
sponding point (see above figure), respectively.

Amit Kumar 169 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
Example 8.2.1. Let us find some probabilities related to χ2 −distribution.
(a) P (χ26 > 1.635) = 0.95 (check the table).
(b) P (χ214 < 4.660) = 1 − P(α14
2
≥ 4.660) = 1 − 0.99 = 0.01.
(c) P (χ2n ≥ 2.088) = 0.990 =⇒ n = 9.
Example 8.2.2. Let Z1 , Z2 , . . . , Zn be independent standard normal distribution. Show that
(a) Zi2 ∼ χ21 , for all i = 1, 2, . . . , n.
Pn 2 2
(b) i=1 Zi ∼ χn .

Solution.
(a) It follows from Example 5.2.6.
(b) Using part (a) together with Property 8.2.1, it can be easily proved.

8.3 Student t-distribution


The distribution is defined by William Sealy Gosset’s 1908 under the pseudonym “Student”. Gosset’s
employer preferred staff to use pen names when publishing scientific papers instead of their real name,
so he used the name “Student” to hide his identity. Gosset worked at the Guinness Brewery in Dublin,
Ireland, and was interested in the problems of small samples. For example, the chemical properties
of barley where sample sizes might be as few as 3. Gosset’s paper refers to the distribution as the
“frequency distribution of standard deviations of samples drawn from a normal population”. It became
well known through the work of Ronald Fisher, who called the distribution “Student’s distribution” and
represented the test value with the letter t. The formal definition can be given as follows:

Definition 8.3.8 [Student t-distribution]


A continuous random variable T is said to have a student t−distribution with n degrees of free-
dom if it has pdf given by
− n+1
t2 ( 2 )

1
fT (t) = √  1+ , −∞ < t < ∞.
nB n2 , 12 n

Amit Kumar 170 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
Remark 8.3.1. (i) If T is a student t−distribution with n degrees of freedom then it is denoted by
T ∼ tn .

(ii) Student’s-distribution is symmetric about t = 0 but its peak and tail are higher than normal
distribution.

Theorem 8.3.3
Let X and Y be two independent random variables such that X ∼ N (0, 1) and Y ∼ χ2n . Then

X nX
T =p = √ ∼ tn .
Y /n Y

Proof. Note that the pdf of X and Y are given by


1 2 1 n
fX (x) = √ e−x /2 and fY (y) = n n
 e−y/2 y 2 −1 ,
2π 2 Γ
2
2

respectively. Therefore, the joint pdf of X and Y is


1 2 /2 n
fX,Y (x, y) = fX (x)fY (y) = √n n
 e−x e−y/2 y 2 −1 .
2 2πΓ
2
2

nX
Consider the transformation T = √
Y
and T 0 = Y . The inverse transformation is

t t0
x= √ and y = t0 .
n

So, the Jacobian is given by


q r
∂x ∂x t0
∂t ∂t0
√t t0
J= ∂y ∂y = n 2 nt0 = .
∂t ∂t0 0 1 n

The joint pdf of T ad T 0 is given by


0
1 2
 
− t2 1+ tn n+1
0 2 −1
fT,T 0 (t, t0 ) = n+1 √ n
e t , t0 > 0, t ∈ R.
2 2 nπΓ 2

Therefore, the pdf of T is given by


Z ∞
fT (t) = fT,T 0 (t, t0 )dt0
0
Z ∞ 0
1 2
 
− t2 1+ tn n+1
0 2 −1 0
= n+1 √ e t dt
2 2 nπΓ n2 0


Z ∞ ! n+1
2
−1 !
t0 t2
  
1 −y 2y 2
= n+1 √ e 2 t2
dy y= 1+
2 2 nπΓ n2 0 1 + tn 2 n

1+ n

Amit Kumar 171 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
Z ∞
1 n+1
=√  n+1 e−y y 2
−1
dy
n t2
 2
nπΓ 1+ 0
2
n+1
n
Γ 2
=√  n+1
n 1 t2
  2
nΓ 2
Γ 2
1+ n
− n+1
t2 ( 2 )

1
=√  1+ , −∞ < t < ∞.
nB n2 , 12 n

This proves the result.

Theorem 8.3.4
If T ∼ tn then, for k < n, show that

k
  0,k/2 if k is odd
E T = n k!Γ((n − k)/2)
, if k is even.
2k (k/2)!Γ(n/2)

Proof. Note that if X ∼ N (0, 1) and Y ∼ χ2n then



X nX
T =p = √
Y /n Y

follows t−distribution with n degrees of freedom. Since X and Y are independent, we have

E(T k ) = nk/2 E X k E Y −k/2 .


 

We know that

k 0, if k is odd
E(X ) =
1 · 3 · · · (k − 1), if k is even.

 0, if k is odd
= k!
, if k is even.
(k/2)!2k/2

and
2−k/2 Γ((n − k)/2)
E(Y −k/2 ) = , (if k (< n) is even)
Γ(n/2)
Hence,

k
  0,k/2 if k is odd
E T = n k!Γ((n − k)/2)
, if k is even.
2k (k/2)!Γ(n/2)

This proves the result.

Amit Kumar 172 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
Corollary 8.3.2
Let T ∼ tn . Then

(a) E(T ) = 0
n
(b) Var(T ) = n−2
, for n > 2.

(c) β1 = 0.
6
(d) β2 = n−4
> 0, for n > 4.

Theorem 8.3.5
iid
Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Then

n(X̄n − µ)
∼ tn−1 .
s

Proof. We know that



σ2
 
n(X̄n − µ)
X̄n ∼ N µ, =⇒ ∼ N (0, 1)
n σ

and
(n − 1)s2
2
∼ φ2n−1 .
σ
Therefore,
√ √
n(X̄n − µ)/σ n(X̄n − µ)
q = ∼ tn−1 .
(n−1)s2 s
σ2
/(n − 1)

This proves the result.

Theorem 8.3.6
Let T ∼ tn . As n → ∞, the pdf of T converges to
1 2
φ(t) = √ e−t /2 .

That is, t−distribution converges to standard normal distribution as n → ∞.

Proof. It is known that


− n+1
t2 ( 2 )

1
fT (t) = √  1+ , −∞ < t < ∞.
nB n2 , 12 n

Amit Kumar 173 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
Note that
− n+1
t2
 2
2
1+ → e−t /2 , as n → ∞.
n

Using Stirling formula, we have


√ 1
Γ(p + 1) ≈ 2πe−p pp+ 2 , for large p.

Therefore,
√ n−1  n2
Γ n+1 2πe− 2 n−1

1 2 2
√ =√ ≈√ √  n−1
nB n2 , 12 nπΓ n2 nπ 2π n−2 2
2
n
1 n
1 − n1 2

1 e2 n 2 1
=√ 1 n−1 · n−1 → √ as n → ∞.
2π n 2 n 2 1 − n2 2 2π


This proves the result.

The table is for P(tn > tn,α ) = α, where tn and tn,α denote the t−distribution and corresponding point

Amit Kumar 174 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
(see below figure), respectively.

tn,α

Example 8.3.1. Let us find some probabilities related to t−distribution.

(a) P (t10 > 2.764) = 0.01 (check the table).

(b) P (t14 ≤ 2.264) = 1 − P(t14 > 2.264) = 1 − 0.2 = 0.98.

(c) P (tn ≥ 1.753) = 0.05 =⇒ n = 15.

8.4 F −distribution
The F −distribution, also known as Snedecor’s F distribution, is a continuous probability distribution
that arises frequently as the null distribution of a test statistic, most notably in the Analysis of Variance
(ANOVA) and other F −tests. The formal definition can be given as follows:

Definition 8.4.9 [F -distribution]


A continuous random variable F is said to have a F −distribution with (m, n) degrees of freedom
if it has pdf given by
m
(m/n)m/2 u 2 −1
fF (u) =  m+n , u > 0.
B m2 , n2 1 + m

n
u 2

Remark 8.4.1. If F follows F −distribution with (m, n) degrees of freedom then it is denoted by
F ∼ Fm,n .
Most of the times the following theorem is used to define F −distribution.

Theorem 8.4.7
Let X and Y be independently distributed random variables such that X ∼ χ2m and Y ∼ χ2n .
Then
X/m nX
F = = ∼ Fm,n .
Y /n mY

Amit Kumar 175 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions

m = 1, n = 1
3 m = 5, n = 2
m = 10, n = 10
m = 60, n = 50
2

0
0 1 2 3 4

Proof. Define the random variables F and W as functions of X and Y

X/m
F =
Y /n
W = Y,

such that the inverse functions X and Y in terms of F and W are


m
x= fw
n
y=w

This implies the following Jacobian matrix is


dx dx m m
df dw n
w n
f m
J= dy dy = = w.
df dw
0 1 n

Since X and Y are independent, the joint density of X and Y is given by


1 x+y m n
fX,Y (x, y) = fX (x)fY (y) = m+n
m
 n
 e− 2 x 2 −1 y 2 −1 , x, y > 0.
2 2 Γ 2
Γ 2

The joint density of F and W can be defined as


m m 
fF,W (f, w) = |J| fX,Y (x, y) =
w × fX f w, w
n n
(m/n)w ( m f +1)w 
− n 2 m  m2 −1 n −1
= m+n m  e fw w2
2 2 Γ 2 Γ n2 n
m
m 2
(m
n f +1)w m m+n
n − −1 −1
= m+n
m
 n
e 2 f 2 w 2 .
2 2 Γ 2
Γ 2

Amit Kumar 176 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
Therefore, the marginal density of F is given by
Z ∞ m 2
m Z ∞
m (m
n f +1)w m+n
−1
fF (f ) = fF,W (f, w)dw = m+n
n
f 2 e− 2 w 2
−1
dw
2 2 Γ m2 Γ n

0 2 0
m+n
m
 m2 m
−1 ∞
2 f
Z h m w
2 2 m+n
i
−1
= n
 m+n ey y 2 dy f +1 =y
2
m+n
Γ m2 Γ n2 1 + m
2
 
f 2 0 n 2
n
m m
m 2
Γ m+n f 2 −1
 
n 2
=  m+n
Γ m2 Γ n2 1 + m
 
n
f 2
m
(m/n)m/2 f 2 −1
=  m+n .
B m2 , n2 1 + m

n
f 2

This proves the result

Corollary 8.4.3
1
If U ∼ Fm,n then U
∼ Fn,m .

Theorem 8.4.8
Let F ∼ Fm,n . Then
n
E(F ) = , n>2
n−2
2n2 (m + n − 2)
Var(F ) = , n > 4.
m(n − 2)2 (n − 4)

Proof. Note that


Z ∞
m Z
m 2 ∞
n m −(m+n)/2
m

E(X) = ufF (u)du = m n
 u 1+ u n dx
−∞ B ,0
2 2
n
m 2
m Z ∞  
n n m/2 −(m+n)/2 n
 m 
= t (1 + t) dt by a change of variable: t = u
B m2 , n2 0

m m n
Z ∞
(n/m)
= m n
 tm/2 (1 + t)−(m+n)/2 dt
B 2,2 0
Z 1  
(n/m) n
−1−1 m
+1−1 1−y
= y2 (1 − y) 2 dt t=
B m2 , n2 0

y
(n/m)  n m 
= m n
B − 1, + 1
B 2,2 2 2
n Γ m+n m n
  
Γ + 1 Γ − 1
= 2  2 2
m Γ m2 Γ n2 Γ m+n 2

Amit Kumar 177 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
n m 2
= × ×
m 2 n−2
n
= .
n−2
Next, observe that

m Z
m 2 ∞
m − m+n
Z 
m 2
2
u2 fF (u)du = 2
u n +1 1 + u

E X = m n
 dx
−∞ B ,
2 2 0 n
m Z ∞  m
m 2
n n 2 +1 − m+n n  m 
= t (1 + t) 2 dt by a change of variable: t = u
B m2 , n2 0

m m n
Z ∞
(n/m)2 m
+1 − m+n
= t 2 (1 + t) 2 dt
B m2 , n2 0

Z 1
(n/m)2
 
n
−2−1 m
+2−1 1−y
= y2 (1 − y) 2 dt t=
B m2 , n2 0

y
(n/m)2  n m 
= B − 2, + 2
B m2 , n2

2 2
 n 2 Γ m+n 
Γ m
+ 2

Γ n
− 2

= 2  2 2
m Γ m2 Γ n2 Γ m+n 2
 n 2 m(m + 2) 4
= × ×
m 4 (n − 2)(n − 4)
2
n (m + 2)
= .
m(n − 2)(n − 4)
Hence,

n2 (m + 2) n2 2n2 (m + n − 2)
Var(X) = E X 2 − E(X)2 =

− = .
m (n − 2) (n − 4) (n − 2)2 m(n − 2)2 (n − 4)

This proves the result.


Following the similar step to the proof of above theorem, the following can be easily proved.

Corollary 8.4.4
Let µk = E(F k ). Then, for n > 2k,
 n k Γ m + k  Γ n − k 
2 2
µk = m
Γ n2
 
m Γ 2

Amit Kumar 178 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
Theorem 8.4.9
iid iid
Let Xi ∼ N (µ1 , σ12 ), for i = 1, 2, . . . , m, Yj ∼ N (µ2 , σ22 ), for j = 1, 2, . . . , n, and they are
independent. Then

σ22 s21
· ∼ Fm−1,n−1 ,
σ12 s22
1
Pm 1
Pn 1
Pm
where s21 = m−1 i=1 (Xi − X̄m )2 , s22 = n−1 2
j=1 (Xj − Ȳn ) , X̄m = m i=1 Xi and Ȳn =
1
P n
n j=1 Yj .

Proof. We know that

(m − 1)s21 (n − 1)s22
2
∼ χ2m−1 and 2
∼ χn−1 2 .
σ1 σ2

Therefore, from Theorem 8.4.7, we have


(m−1)s21
σ12
/(m − 1) σ22 s21
(n−1)s22
=⇒ · ∼ Fm−1,n−1 .
/(n − 1) σ12 s22
σ22

This proves the result.


For F −distribution, the table is given for

P(Fm,n > fm,n,α ) = α.

fm,n,α

Amit Kumar 179 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions

Amit Kumar 180 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions

Amit Kumar 181 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions

Amit Kumar 182 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions

Amit Kumar 183 MA-202: Probability & Statistics


Chapter 8: Statistics and Sampling Distributions
Observe that
   
1 1 1
α = P (Fm,n ≥ fm,n,α ) = P ≤ = P Fn,m ≤
Fm,n fm,n,α fm,n
 
1
=⇒ P Fn,m > = 1 − α.
fm,n,α

We know that P(Fn,m > fn,m,1−α ) = 1 − α. Therefore,

1
fn,m,1−α = .
fm,n,α

Example 8.4.1. Let us find some probabilities related to F −distribution.

(a) P (F9,11 > 2.90) = 0.05 (check the table).

(b) P (F4,12 ≤ 4.12) = 1 − P (F4,12 > 4.12) = 1 − 0.025 = 0.975.

(c) P (Fm,n > 999.33) = 0.001 =⇒ m = 7, n = 2.

Example 8.4.2. If two independent random sample of sizes n1 = 7 and n2 = 13 and taken from a
normal population with same variance. What is the probability the variance of the first sample will be
at least 3 times larger as that of second sample?
Solution. Given σ12 = σ22 = σ 2 , n1 = 7 and n2 = 13. Therefore,

σ22 s21 s21


= ∼ F6,12 .
σ12 s22 s22

The required probability is

s21
 
s21 3s22

P ≥ =P ≥ 3 = P (F6,12 > 3) = 0.05.
s21

Amit Kumar 184 MA-202: Probability & Statistics


Chapter 9

Estimation

In statistics, estimation refers to the process by which one makes inferences about a population, based
on information obtained from a sample. This is a process of guessing the underlying properties of
the population by observing the sample that has been taken from the population. The idea behind
this is to calculate and find out the approximate values of the population parameter on the basis of a
sample statistics. In order to determine the characteristic of data, we use point estimation or interval
estimation. A point estimate of a population parameter is a single value of a statistic. For example, the
sample mean X̄n is a point estimate of the population mean µ. Similarly, the sample variance s2 is a
point estimate of the population variance σ 2 . An interval estimate is defined by two numbers, between
which a population parameter is said to lie. For example, a < X̄n < b is an interval estimate of the
population mean µ. It indicates that the population mean is greater than a but less than b. In this chapter,
we will study these two estimation in details.

9.1 Basic Definitions


We begin with the definition of statistical inference.

Definition 9.1.1 [Statistical Inference]


Statistical inference in the process of drawing conclusions about on underling population based
on a sample or subset of data.

Example 9.1.1. We can consider the process to check the sugar level in human body. Here, we should
take a sample of 10 ml or 15 ml of blood. Based on this sample, we make a decision of the sugar level
in human body. This produce of drawing conclusions is called statistical inference.

Definition 9.1.2 [Parameters]


Parameters are functions of population observations.

Example 9.1.2. Mean (µ), Variance (σ 2 ), mode, median, correlation coefficient etc. are examples of
parameters.

185
Chapter 9: Estimation
Definition 9.1.3 [Parameter Space]
The parameter space is the space of possible parameters values and it is denoted by Θ.

Example 9.1.3. For a normal population

Θ = µ, σ 2 : −∞ < µ < ∞, σ > 0 ∈ R × R+ ,


 

where R+ denotes the set of all positive real numbers.


Example 9.1.4. For a Binomial population, , the parameter space is given by

Θ = {(n, p) : 0 ≤ p ≤ 1, n ∈ N} ∈ N × (0, 1).

Remark 9.1.1. Now onwards, since we are dealing with parameters, and therefore, we will denote a
density function (pdf or pmf) by f (x; θ1 , θ2 , . . . , θk ) or f (x; θ), where θ = (θ1 , θ2 , . . . , θk ). Here, θ
denotes the set of parameters. For example, θ = (µ, σ 2 ) for normal population and θ = (n, p) for
binomial population.
Now, let us brief the study of statistical inference in general.

Statistical Inference

Theory of Estimation Testing of Hypothesis

Classical Theory Bayesian Theory


of Estimation of Estimation

In classical theory of estimation, we have density function f (x, θ), where x is the value of random
variable and θ is the set of unknown parameters. Here, we estimate the value of unknown parameters.
In Bayesian theory of estimation, we have density function f (x, θ), where x is the value of random
variable and θ is also a random variable. Here, we estimate the distribution of unknown parameters.
We will only work on some portion of classical theory of estimation in this course.

Definition 9.1.4 [Estimator]


An estimator is a statistic that estimates the population parameters.

In Classical Theory of Estimation, we have several methods to find the estimators. We have have
various properties to check whether the obtained estimator is good or not. The following figure discuss
the methods to finding estimator and properties of estimator.

Amit Kumar 186 MA-202: Probability & Statistics


Chapter 9: Estimation
Classical Theory of Estimation

Methods to Find Estimator Properties of Estimator


(i) Methods of Moments (MOM) (i) Unbiasedness
(ii) Methods of Maximum Likelihood Estimator (ii) Consistency
(iii) Methods of Minimum Variance (iii) Efficiency
(iv) Methods of Least Square (Regression) (iv) Sufficiency
(v) Methods of Minimum Chi-square (Discrete Population) (v) Completeness

In this course, we will study only two methods, namely, Methods of Moments and Methods of Max-
imum Likelihood Estimator. Also, we will two properties of estimator, namely, unbiasedness and
consistency.
Remark 9.1.2. An estimator which satisfies all five properties is treated as a good estimator.
We first start with unbiased estimator in the following section.

9.2 Unbiased Estimator


An estimator of a given parameter is said to be unbiased if its expected value is equal to the true value
of the parameter. The formal definition can be given as follows.

Definition 9.2.5 [Unbiased Estimator]


A statistic T (X) is said to be an unbiased estimator of g(θ) if

E[T (X)] = g(θ).

Further, if E[T (X)] = g(θ) + b(θ), then we say T (X) is biased for g(θ) and b(θ) is called the bias
of T (X). Moreover, if E(T (X)) > g(θ) then it is called positive bias and if E(T (X)) < g(θ)
then it is called negative bias.

Example 9.2.1. Let X ∼ B(n, p), where n is known. Define


x
T (x) = .
n
Then
x 1 1
E{T (X)} = E = E(X) = · np = p
n n n
X
This implies n
is an unbiased estimator for p. Also, note that
 
2 X(X − 1)
E(X(X − 1)) = n(n − 1)p =⇒ E = p2
n(n − 1)

Amit Kumar 187 MA-202: Probability & Statistics


Chapter 9: Estimation
X(X−1)
This implies n(n−1)
is an unbiased estimator for p2 . Moreover, observe that

σ 2 = Var(X) = np(1 − p) = n p − p2

 
X X(X − 1)
= nE −
n n(n + 1)
 
X(n − X)
=E .
n−1
X(n−X)
This implies n−1
is an unbiased estimator for σ 2 .
Example 9.2.2. Let Xi ∼ N (w, σ 2 ), for i = 1, 2, . . . , n. Note that Θ = (µ, σ 2 ). We know that
n n
1X 1 X 2
X̄n = Xi and s2 = Xi − X̄n .
n i=1 (n − 1) i=1

and
σ2 (n − 1)s2
 
X̄n ∼ N µ, and ∼ χ2n−1 .
n σ2

Note that
n
 1X 1
E X̄n = E(Xi ) = · nµ = µ
n i=1 n

and
(n − 1)s2
 
= n − 1 =⇒ E s2 = σ 2 .

E
σ2

Hence, X̄n and s2 are unbiased estimator for µ and σ 2 , respectively.


Remark 9.2.1. (i) An unbiased estimator means that mean of all sample means is equal to the pop-
ulation mean. For instance, suppose we have four coins of weights 1, 2, 3, 4 grams then the
population mean is
1+2+3+4
= 2.5.
4
Now, suppose we select two coins with simple random sampling without replacement then we
4

have 2 = 6 possible samples for which the mean of sample means is equal to the population
mean. This can be seen as follows.
Sample Values Sample Mean
(1,2) 1.5
(1,3) 2
(1,4) 2.5
(2,3) 2.5
(2,4) 3
(3,4) 3.5
Mean 2.5

Amit Kumar 188 MA-202: Probability & Statistics


Chapter 9: Estimation
(ii) Sometimes unbiased estimation may be absurd. For example, let X ∼ P(λ), g(λ) = e−3λ and
T (X) = (−2)X . Then, note that
∞ ∞ −λ x
xe λ
X X
E(T (X)) = T (x)P(X = x) = (−2)
x=0 x=0
x!

−λ
X (−2λ)x
=e = e−λ e−2λ = e−3λ .
x=0
x!

This implies (−2)X is an unbiased estimator for e−3λ . But (−2)X can take values 1, −2, 4, . . .
which is not close to 0 < e−3λ < 1. So, only unbiased does not guarantee about good estimator.

(iii) Sometimes unbiased estimator does not exists.

(iv) Infinitely many unbiased estimator may exist. For example, let X1 , X2 , . . . be a sequence of iid
random variables with mean µ. Then,

E (X1 ) = µ =⇒ X1 is an unbiased estimator for µ


 
X1 + X 2 X1 + X2
E =µ =⇒ is an unbiased estimator for µ
2 2
 
X1 + X2 + X 3 X1 + X2 + X3
E =µ =⇒ is an unbiased estimator for µ
3 3
..
.
 
X1 + X2 + · · · + Xn X1 + X2 + · · · + Xn
E =µ =⇒ in an unbiased estimator for µ
n n
..
.

If we have unbiased estimators then how to know that which one is good? In this case, we will compute
variance of the statistic and the estimator which has less variance is the best among others. Such
estimator is called efficient estimator. It can be defined as follows.
If T1 (X) and T2 (X) are two unbiased estimator for g(θ) and

Var (T1 ) < Var (T2 ) .

Then, T1 is better than T2 . In general, if the estimator is not unbiased then will check the mean square
error which can be defined as follows.
Let T1 (X) and T2 (X) are two estimator for g(θ) then mean square error (MSE) is defined as

MSE(Ti ) = E(Ti − g(θ)), for i = 1, 2.

We say that T1 is better than T2 if

MSE(T1 ) ≤ MSE(T2 ).

Note that if Ti , i = 1, 2, are unbiased then MSE(Ti ) reduces to the variance of Ti , i = 1, 2.

Amit Kumar 189 MA-202: Probability & Statistics


Chapter 9: Estimation
9.3 Consistent Estimator
An estimator of a given parameter is said to be consistent if it converges in probability to the true value
of the parameter as the sample size tends to infinity. The formal definition can be given as follows.

Definition 9.3.6 [Consistent Estimator]


An estimator Tn = T (X) = T (X1 , . . . , Xn ) is said to be consistent for g(θ) if, for every ε > 0,

P (|Tn − g(θ)| > ε) → 0 as n → ∞.

In other words,
p
Tn −→ g(θ).

Theorem 9.3.1
If E (Tn ) = γ (θn ) → γ(θ) and Var (Tn ) = σn2 → 0 as n → ∞ then Tn is consistent for γ(θ).

Proof. Note that

|Tn − γ (θn )| ≤ |Tn − γ (θn )| + |γ (θn ) − γ (θ) |.

Therefore,

P (|Tn − γ (θn )| > ε) ≤ P (|Tn − γ (θn )| + |γ (θn ) − γ (θ) | > ε)


= P (|Tn − γ (θn )| > ε − |γ (θn ) − γ (θ)|)
σn2
≤ → 0 as n → ∞.
(ε − |γ (θn ) − γ (θ)|)2

This proves the result.


iid
Example 9.3.1. Let Xi ∼ N (µ, σ 2 ), for i = 1, 2, . . . , n, then
n
σ2
 
1X
X̄n = Xi ∼ N µ, .
n i=1 n

Note that

E(X̄n ) = µ

and
 σ2
Var X̄n = → 0 as n → ∞
n
Therefore, X̄n is consistent estimator for µ. This can also be directly proved using WLLN.

Amit Kumar 190 MA-202: Probability & Statistics


Chapter 9: Estimation
Next, we know that

(n − 1)s2
∼ χ2n−1
σ2
and therefore, E (s2 ) = σ 2 and

(n − 1)s2 2σ 4
 
2
Var = 2(n − 1) =⇒ Var(s ) = → 0 as n → ∞.
σ2 n−1

Hence, s2 is consistent estimator for σ 2 .


Remark 9.3.1. (i) If Tn is consistent estimator, an → 1 and bn → 0 then an Tn +bn is also consistent.
(ii) If Tn is consistent estimator for γ(θ), then there are infinitely many consistent estimators, for
example,

1 − na
  
0 n−a p
Tn = Tn = b
 Tn → Tn −→ γ(θ) as n → ∞.
n−b 1− n

(iii) If Tn is consistent estimator for γ(θ) and g(γ(θ)) is continuous function then g (Tn ) is consistent
estimator of g(γ(θ)).

9.4 Method of Moment Estimator (MME)


The method of moments involves equating sample moments with theoretical moments. It can be defined
formally as follows.

Definition 9.4.7 [Method of Moment Estimator]


Let X1 , X2 , . . . , Xn be a random sample that follow a distribution which has k parameters, say,
θ = (θ1 , θ2 , . . . , θk ). Let µ0j denote jth non-central moment of X1 , that is, µ0j = E(X1j ). We write

µ01 = g1 (θ1 , θ2 , . . . , θk )
µ02 = g2 (θ1 , θ2 , . . . , θk )
..
.
0
µk = gk (θ1 , θ2 , . . . , θk ).

Let the above system of equations has the solution

θ1 = h1 (µ01 , µ02 , . . . , µ0k )


θ2 = h2 (µ01 , µ02 , . . . , µ0k )
..
.
θk = hk (µ01 , µ02 , . . . , µ0k )

In method
Pnof moments, we estimate θi by θ̂i = hi (α1 , α2 , . . . , αk ), for i = 1, 2, . . . , k, where
1 j
αj = n i=1 Xi , for j = 1, 2, . . . , k.

Amit Kumar 191 MA-202: Probability & Statistics


Chapter 9: Estimation
Remark 9.4.1. (i) Note that we use sample moments are equal to population moments in MOM,
that is,
n
1X j
αj0 = X = E(Xij ) = µ0j
n i=1 i

or equivalently, sample moments are unbiased estimator of population mean.

(ii) In general, we can not say MME are unbiased estimator.


p
(iii) If αj −→ µ0j and the functions h1 , h2 , . . . , hk are continuous then θ̂i ’s are consistent estimator for
θi ’s.

(iv) Two different distributions may have same moments. Therefore„ some parameters may have two
different estimators.

(v) The method of moment estimator is denoted by using a hat and MME in subscript on the symbol,
for example, the MME for sample mean µ is denoted by µ̂MME .
iid
Example 9.4.1. Let Xi ∼ P(λ), for i = 1, 2, . . . , n and λ > 0. Find MME for λ.
Solution. Note that µ01 = E(X1 ) = λ and hence
n
1X
λ̂MME = α1 = Xi = X̄n
n i=1

λ
is MME for λ. Also, note that E(X̄n ) = λ and Var(X̄n ) = n
→ 0 as n → ∞. Hence, X̄n is a consistent
estimator of λ.
iid
Example 9.4.2. Let Xi ∼ N (µ, σ 2 ), for i = 1, 2, . . . , n. Find MME for µ and σ 2 .
Solution. Note that

µ01 = µ and µ02 = µ2 + σ 2 .

This implies

µ = µ01 and σ 2 = µ02 − µ02


1.

Therefore, the MME for µ and σ 2 are given by


n
1X
µ̂MME = α1 = Xi = X̄n
n i=1

and
n n  
2 1X 2 1X n−1
σ̂MME = α2 − α12 = Xi − (X̄n )2 = (Xi − X̄n )2 = s2 ,
n i=1 n i=1 s

respectively.

Amit Kumar 192 MA-202: Probability & Statistics


Chapter 9: Estimation
9.5 Maximum Likelihood Estimator (MLE)
The maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed
probability distribution, given some observed data. This is achieved by maximizing a likelihood func-
tion so that, under the assumed statistical model, the observed data is most probable. For example,
Consider n urn which has a certain number of black and red balls. It is known that the number of blacks
and red balls are in the proportion 3 : 1 or 1 : 3. Let p be the proportion of blacks balls. Then
1 3
p= or .
4 4
Let a random sample of size n is taken from the urn and let X denote the number of black balls in the
sample. Then
 
n x 1 3
P (X = x) = p (1 − p)n−x , x = 0, 1, . . . , n and p = or .
x 4 4

Take n = 3, we have
P(X = x) 0 1 2 3
P X = x; p = 14  27 27 9 1

64 64 64 64
P X = x; p = 34 1
64
9
64
27
64
27
64

Note that when x = 0, 1 then p̂ = 41 is more likely and when x = 2, 3 then p̂ = 3


4
is more likely. The
formal definition of MLE can be given as follows.

Definition 9.5.8 [Maximum Likelihood Estimator]


Let X1 , X2 , . . . , Xn be a random sample from a population with density function f (x; θ). Then,
the joint density of (X1 , X2 , . . . , Xn ) is given by
n
Y
f (x; θ) = f (xi ; θ) = L(x; θ), x = (x1 , . . . , xn ) ,
i=1

which is known as likelihood function. A statistic θ̂(x) is said to be the maximum likelihood
estimator of θ if

L(x; θ̂) ≥ L(x; θ) for all θ ∈ Θ.

In other words, MLE maximize the likelihood function.

Remark 9.5.1. (ii) The estimator is generally denoted by using a hat on the symbol, for example,
the estimator for the sample mean µ is denoted by µ̂.

(ii) The maximum likelihood estimator is denoted by using a hat and MLE in subscript on the symbol,
for example, the MLE for sample mean µ is denoted by µ̂MLE .

(iii) In general, MLE is not unique.

Working Procedure for MLE. The following are the steps to compute MLEs.

Amit Kumar 193 MA-202: Probability & Statistics


Chapter 9: Estimation
(a) Find likelihood function L(x; θ̂).

(b) Find log-likelihood function `(x; θ̂) = ln(L(x; θ̂)).

(c) Compute the parameter θ from the likelihood equation


d`
= 0.

d2 L
(d) Check dθ2
< 0.

Remark 9.5.2. (i) Note that the steps (a)-(d) maximize the likelihood function. Moreover, if the
likelihood function is not differentiable or likelihood equation is not solvable then we have to
some other approach to maximize the likelihood function.

(ii) Since L(x; θ̂) > 0 and `(x; θ̂) is a non-decreasing function. So, L(x; θ̂) > 0 and `(x; θ̂) attain
their extreme values at the same point. It can also be verified as
dL 1 dL d`
= 0 =⇒ = 0 =⇒ = 0.
dθ L dθ dθ
So, we can work on log-likelihood function instead of likelihood function to compute the critical
point. This is useful because log-likelihood function is easy to handle compare to likelihood
function.
iid
Example 9.5.1. Let Xi ∼ N (µ, σ 2 ), for i = 1, 2, . . . , n. Find MLE of µ and σ 2 .
Solution. The pdf of Xi is given by
1 1 xi −µ 2
f (xi : µ, σ 2 ) = √ e− 2 ( σ ) , −∞ < xi , µ < ∞ and σ > 0.

Therefore, the likelihood function is
n  
2
Y 1 x −µ 2
− 21 ( iσ ) 1 − 21 n
P xi −µ 2
i=1 ( σ ) .
L(x; µ, σ ) = √ e = n/2 n
e
i=1
2πσ (2π) σ

This implies the log-likelihood function is


 n n  2 #
1 − 12
X xi − µ
`(x; µ, σ 2 ) = ln √ + ln e
2πσ i=1
σ
n 2


1X xi − u
= −n ln( 2πσ) −
2 i=1 σ
n
√ n 1 X
2
(xi − µ)2 .

= −n ln( 2π) − ln σ − 2
2 2σ i=1

Differentiate with respect to µ, we get


n n
∂` 1 X 1 X
=− 2 2 (xi − µ) (−1) = 2 (xi − µ) = 0
∂µ 2σ i=1 σ i=1

Amit Kumar 194 MA-202: Probability & Statistics


Chapter 9: Estimation
n
X
=⇒ (xi − µ) = 0
i=1
n
1X
=⇒ µ̂MLE = Xi = X̄n .
n i=1

Therefore, X̄n is MLE for µ. Next, differentiate with respect to σ 2 , we have


n
∂` n 1 X
2
=− 2 + 4 (ii − µ)2 = 0
∂σ 2σ 2σ i=1
n n  
1X 1X 2 n−1
=⇒ 2
σMLE = (Xi − µ)2 = Xi − X̄n = s2 .
n i=1 n i=1 n

n−1

Hence, n
s2 is MLE for σ 2 .
Remark 9.5.3. Notes that MLE may not be unbiased, however, it is always consistent.
iid
Example 9.5.2. Let Xi ∼ N (µ, σ 2 ), for i = 1, 2, . . . , n, and µ > 0. Find MLE of µ and σ 2 .
= n−1
2
 2
Solution. Note that we have computed µ̂MLE = X̄n and σ̂MLE n
s in previous example. Now,
we it is given µ > 0. Note that

∂` n > 0, if µ < x̄n
= 2 (x̄n − µ) =
∂µ σ < 0, if µ > x̄n .

maximum value maximum value

| | | |
0 x̂n x̂n 0

Therefore,

X̄n , if x̄n > 0
µ̂MLE =
0, if x̄n ≤ 0.
= max(X̄n , 0)

Hence, µ̂MLE = max(X̄n , 0) is MLE for µ > 0. Also,


n  1
Pn
2 1X
Pi=1 (Xi − X̄n )2 , if X̄n > 0
σ̂MLE = (Xi − µ̂MLE ) = n
1 n 2
n i=1 n i=1 Xi , if X̄n < 0.

d`
Further, if dθ = 0 is not solvable then how do we handle the problem? It mainly happen if the den-
sity depends only on parameters and accordingly the problem should be handle. Let us consider one
example of such case.

Amit Kumar 195 MA-202: Probability & Statistics


Chapter 9: Estimation
Example 9.5.3. Find MLE for a and b for the population having pdf
1
f (x; a, b) = , a < x < b.
b−a
Solution. The likelihood function is
1
L(x; a, b) = , a < x < b.
(b − a)n

Therefore, the log-likelihood function is

`(x; a, b) = ln(L(x; a, b)) = −n ln(b − a).

and
∂` n
= = 0 =⇒ b − a = ∞,
∂a b−a
which is not solvable. This method is not useful in this case. Now, we move to apply the definition of
MLE. Observe that L(x; a, b) is maximum when b − a minimum. Let us define

x(1) = min{x1 , x2 , . . . , xn }
x(2) = second largest of {x1 , x2 , . . . , xn }
..
.
x(n) = max{x1 , x2 , . . . , xn }.

Then

a < x(1) ≤ x(n) ≤ · · · ≤ x(n) < b

Note that x(1) and x(n) are more likely such that

1 1
L(x; a, b) = n
≤ n ,
(b − a) x(n) − x(1)

which maximize the likelihood function. Hence,

âMLE = X(1) = min{X1 , X2 , . . . , Xn }

and

b̂MLE = X(n) = max{X1 , X2 , . . . , Xn }.

Property 9.5.1 [Invariance Property of Maximum Likelihood Estimators]

If θ̂(x) is a maximum likelihood estimate for θ, then g(θ̂(x)) is a maximum likelihood estimate
for g(θ).

Amit Kumar 196 MA-202: Probability & Statistics


Chapter 9: Estimation
9.6 Confidence Interval
A confidence interval is a kind of interval calculation, obtained from the observed data that holds the
actual value of the unknown parameter. It is associated with the confidence level that quantifies the
confidence level in which the interval estimates the deterministic parameter.

Definition 9.6.9 [Confidence Interval]


Let X1 , X2 , . . . , Xn be a random sample, and T1 (X) and T2 (X) be two statistics such that

P (T1 (X) ≤ g(θ) ≤ T2 (X)) = 1 − α for all θ ∈ Θ.

Suppose T1 (x) and T2 (x) are observed values of T1 (X) and T2 (X), respectively. Then, we say
(T1 (x), T2 (x)) is 100(1 − α)% confidence interval for g(θ).

Remark 9.6.1. The C.I. means that we have 100(1 − α)% confidence that the value of the function of
parameter g(θ) lies in (T1 (x), T2 (x)).

1−α

| |
T1 (x) T2 (x)

9.6.1 Confidence Intervals for One Normal Populations


iid
In this subsection, we consider Xi ∼ N (µ, σ 2 ), for i = 1, 2, . . . , n and obtain the confidence interval
(C.I.) for the parameters µ and σ 2 in different cases.

Confidence interval for µ: To compute the C.I. for µ of a normal population, we deal with two cases
when σ is known and unknown.
(a) When σ 2 is known: The test statistic is given by
√ 
n X̄n − µ
Z= ∼ N (0, 1).
σ

1−α
α/2 α/2

| | |
−zα/2 0 zα/2

Amit Kumar 197 MA-202: Probability & Statistics


Chapter 9: Estimation
Note that

P −zα/2 ≤ Z ≤ zα/2 = 1 − α
√  !
n X̄n − µ
=⇒ P −zα/2 ≤ ≤ zα/2 = 1 − α
σ
 
σ σ
=⇒ P − √ zα/2 ≤ X̄n − µ ≤ √ zα/2 = 1 − α
n n
 
σ σ
=⇒ P X̄n − √ zα/2 ≤ µ ≤ X̄n + √ zα/2 = 1 − α.
n n

Let x̄n is an observed value of X̄n . Then,


 
σ σ σ
x̄n − √ zα/2 , x̄n + √ zα/2 or x̄n ± √ zα/2
n n n

is 100(1 − α)% C.I. for µ.


(b) When σ 2 is unknown: The test statistic is given by
√ 
n X̄n − µ
T = ∼ tn−1 .
s

1−α
α/2 α/2

| | |
−tn−1,α/2 0 tn−1,α/2

Note that

P −tn−1,α/2 ≤ T ≤ tn−1,α/2 = 1 − α
√  !
n X̄n − µ
=⇒ P −tn−1,α/2 ≤ ≤ tn−1α/2 = 1 − α
s
 
s s
=⇒ P − √ tn−1,α/2 ≤ X̄n − µ ≤ √ tn−1,α/2 = 1 − α
n n
 
s s
=⇒ P X̄n − √ tn−1,α/2 ≤ µ ≤ X̄n + √ tn−1,α/2 = 1 − α.
n n

Let x̄n is an observed value of X̄n . Then,


 
s s s
x̄n − √ tn−1,α/2 , x̄n + √ tn−1,α/2 or x̄n ± √ tn−1,α/2
n n n

is 100(1 − α)% C.I. for µ.

Amit Kumar 198 MA-202: Probability & Statistics


Chapter 9: Estimation
Example 9.6.1. Ten bearing made from a certain process have a mean diameter 0.0506 cm and a
standard deviation 0.004 cm. Assuming that the data may be looked upon as a random sample from
a normal population. Construct a 95% C.I. for the actual average diameter of bearing made by the
process.
iid
Solution. Given Xi ∼ N (µ, σ 2 ), for i = 1, 2, . . . , 10, x̄10 = 0.0506, s = 0.004 and α = 0.05.

0.95
0.025 0.025

| | |
−t9,0.025 0 t9,0.025

We know that x̄n ± √s tn−1, α is 100(1 − α)% C.I. for µ. It can be easily verified that
n 2

P (t9 > t9,0.025 ) = 0.025 =⇒ t9,0.025 = 2.262.

Therefore,
s
x̄10 ± √ t9,0.025 = (0.0477, 0.0535)
10
is 95% C.I. for µ.

Confidence interval for σ 2 : To compute the C.I. for σ 2 of a normal population, we deal with two cases
when µ is known and unknown.
(a) When µ is known: The test statistic is given by
n  2
X Xi − µ
W = ∼ χ2n .
i=1
σ

1−α

α/2 α/2

| |
χ2n,1− α χ2n, α
2 2

Amit Kumar 199 MA-202: Probability & Statistics


Chapter 9: Estimation
Note that
 
P χ2n,1− α
≤W ≤ =1−α χ2n, α
2 2
n  2 !
2
X Xi − µ
=⇒ P χn,1− α ≤ ≤ χ2n, α = 1 − α
2
i=1
σ 2

!
1 σ2 1
=⇒ P ≤ Pn 2 ≤ 2 =1−α
χ2n, α i=1 (X i − µ) χn,1− α
2 2
Pn !
2 Pn 2
i=1 (X i − µ) 2 (X i − µ)
=⇒ P ≤ σ ≤ i=1 2 = 1 − α.
χ2n, α χn,1− α
2 2

Let xi is an observed value of Xi . Then,


Pn !
2 Pn 2
i=1 (xi − µ) i=1 (xi − µ)
,
χ2n, α χ2n,1− α
2 2

is 100(1 − α)% C.I. for σ 2 .


(b) When µ is unknown: The test statistic is given by

(n − 1)s2
W = ∼ χ2n−1 .
σ2

1−α

α/2 α/2

| |
χ2n−1,1− α χ2n−1, α
2 2

Note that
 
P χ2n−1,1− α ≤ W ≤ χ2n−1, α = 1 − α
2 2
2
 
2 (n − 1)s 2
=⇒ P χn−1,1− α ≤ ≤ χn−1, α = 1 − α
2 σ2 2

Amit Kumar 200 MA-202: Probability & Statistics


Chapter 9: Estimation
!
1 σ2 1
=⇒ P ≤ ≤ 2 =1−α
χ2n−1, α (n − 1)s 2 χn−1,1− α
2 2
!
2
(n − 1)s 2 (n − 1)s2
=⇒ P ≤ σ ≤ = 1 − α.
χ2n−1, α χ2n−1,1− α
2 2

Then,
!
(n − 1)s2 (n − 1)s2
,
χ2n−1, α χ2n−1,1− α
2 2

is 100(1 − α)% C.I. for σ 2 .


Example 9.6.2. If 31 measurements of boiling points of sulphur have a standard deviation s = 0.083◦ C.
Assuming that the data may be looked upon as a random sample from a normal population. Construct
a 98% C.I. for the standard deviation of such measurement.
iid
Solution. Given Xi ∼ N (µ, σ 2 ), for i = 1, 2, . . . , 31, s = 0.083 and α = 0.02.

0.98
0.01 0.01

| |
χ230,0.99 χ230,0.01

From the chi-square table, it can be easily verified that χ230,0.99 = 14.95 and χ230,0.01 = 50.89. Therefore,
the 98% C.I. for σ is
s s ! r r !
(n − 1)s2 (n − 1)s2 30 × (0.83)2 30 × (0.83)2
, = , = (0.6373, 1.1756).
χ2n−1, α χ2n−1,1− α 50.89 14.95
2 2

9.6.2 Confidence Intervals for Two Normal Populations


iid iid
In this subsection, we consider Xi ∼ N (µ1 , σ12 ), for i = 1, 2, . . . , m and Yj ∼ N (µ2 , σ22 ), for j =
1, 2, . . . , n. Both samples are independent. We obtain the confidence interval (C.I.) for the difference
σ2
of means, that is, for µ1 − µ2 and ratio of variances, that is, for σ12 , in different cases.
2

Confidence interval for µ1 − µ2 : To compute the C.I. for µ1 − µ2 of a normal population, we deal
with following three cases:
(a) When σ1 2 and σ2 2 are known: Note that

σ12 σ2
   
X̄m ∼ N µ1 , and Ȳn ∼ N µ1 , 1 ,
m n

Amit Kumar 201 MA-202: Probability & Statistics


Chapter 9: Estimation
and X̄m and Ȳn are independent. Therefore,

σ12 σ22
 
X̄m − Ȳn ∼ N µ1 − µ2 , + .
m n
Hence,

X̄m − Ȳn − (µ1 − µ2 )


Z= q ∼ N (0, 1).
σ12 σ22
m
+ n

1−α
α/2 α/2

| | |
−zα/2 0 zα/2

Note that

P −zα/2 ≤ Z ≤ zα/2 = 1 − α
 
X̄m − Ȳn − (µ1 − µ2 )
=⇒ P −zα/2 ≤ q ≤ zα/2  = 1 − α
σ12 σ22
m
+ n
r r !
σ12 σ22 σ12 σ22
=⇒ P − + zα/2 ≤ X̄m − Ȳn − (µ1 − µ2 ) ≤ + zα/2 = 1 − α
m n m n
r r !
σ12 σ22 σ12 σ22
=⇒ P X̄m − Ȳn − + zα/2 ≤ µ1 − µ2 ≤ X̄m − Ȳn + + zα/2 = 1 − α.
m n m n

Let x̄m and x̄n ar an observed value of X̄m and X̄n , respectively. Then,
r r ! r
σ12 σ22 σ12 σ22 σ12 σ22
x̄m − ȳn − + zα/2 , x̄m − ȳn + + zα/2 or x̄m − ȳn ± + zα/2
m n m n m n

is 100(1 − α)% C.I. for µ1 − µ2 .


(a) When σ1 2 = σ2 2 = σ 2 (unknown): It can be easily verified that
r
X̄m − Ȳn − (µ1 − µ2 ) mn X̄m − Ȳn − (µ1 − µ2 )
q = ∼ N (0, 1). (9.6.1)
σ12 2
σ2 m+n σ
m
+ n

Next, note that

(m − 1)s21 (n − 1)s22
∼ χ2m−1 and ∼ χ2n−1
σ2 σ2

Amit Kumar 202 MA-202: Probability & Statistics


Chapter 9: Estimation
and therefore,

(m + n − 2)s2p (m − 1)s21 + (n − 1)s22


= ∼ χ2m+n−2 , (9.6.2)
σ2 σ2
(m−1)s21 +(n−1)s22
where s2p = m+n−2
is known as pooled sample variance. Hence, from (9.6.1) and (9.6.2), we
have
 
p mn X̄m −Ȳn −(µ1 −µ2 ) r  
m+n σ mn X̄m − Ȳn − (µ1 − µ2 )
T =q = ∼ tm+n−2 .
(m+n−2)s2p m+n sp
σ2
/(m + n − 2)

1−α
α/2 α/2

| | |
−tm+n−2,α/2 0 tm+n−2,α/2

Note that

P −tm+n−2,α/2 ≤ T ≤ tm+n−2,α/2 = 1−α
 r   
mn X̄m − Ȳn − (µ1 − µ2 )
=⇒ P −tm+n−2,α/2 ≤ ≤ tm+n−2,α/2 = 1−α
m+n sp
r r !
m+n m+n
=⇒ P − sp tm+n−2,α/2 ≤ X̄m − Ȳn − (µ1 − µ2 ) ≤ sp tm+n−2,α/2 = 1−α
mn mn
r r !
m+n m+n
=⇒ P X̄m − Ȳn − sp tm+n−2,α/2 ≤ µ1 −µ2 ≤ X̄m − Ȳn + sp tm+n−2,α/2 = 1−α
mn mn

Let x̄m and x̄n ar an observed value of X̄m and X̄n , respectively. Then,
r r !
m+n m+n
x̄m − ȳn − sp tm+n−2,α/2 , x̄m − ȳn − sp tm+n−2,α/2
mn mn

or
r
m+n
x̄m − ȳn ± sp tm+n−2,α/2
mn
is 100(1 − α)% C.I. for µ1 − µ2 .
(a) When σ1 2 6= σ2 2 = σ 2 are unknown: In this case, we have an an approximate C.I. whcih is based
on the statistic
X̄m − Ȳn − (µ1 − µ2 )
T∗ = q ∼ tν ,
s21 s22
m
+ n

Amit Kumar 203 MA-202: Probability & Statistics


Chapter 9: Estimation
where
 2
s21 s22
m
+ n
ν= s41 s42
.
m2 (m−1)
+ n2 (n−1)

1−α
α/2 α/2

| | |
−tν,α/2 0 tν,α/2

Note that

P −tν,α/2 ≤ T ∗ ≤ tν,α/2 = 1 − α

 
X̄m − Ȳn − (µ1 − µ2 )
=⇒ P −tν,α/2 ≤ q ≤ tν,α/2  = 1 − α
s21 s22
m
+ n
r r !
s21 s22 s21 s22
=⇒ P − + tν,α/2 ≤ X̄m − Ȳn − (µ1 − µ2 ) ≤ + tν,α/2 = 1 − α
m n m n
r r !
s21 s22 s21 s22
=⇒ P X̄m − Ȳn − + tν,α/2 ≤ µ1 − µ2 ≤ X̄m − Ȳn + + tν,α/2 = 1 − α.
m n m n

Let x̄m and x̄n ar an observed value of X̄m and X̄n , respectively. Then,
r r !
s21 s22 s21 s22
x̄m − ȳn − + tν,α/2 , x̄m − ȳn + + tν,α/2
m n m n

or
r
s21 s22
x̄m − ȳn ± + tν,α/2
m n
is 100(1 − α)% C.I. for µ1 − µ2 .
Example 9.6.3. Two machines are used to fill plastic bottles with dish washing detergent. The standard
deviation of filling volume are known to be σ1 = 0.15 fluid ounces and σ2 = 0.12fluid ounces for the
two machines. Two random samples of n1 = 12 bottles from machines 1 and n1 = 10 bottles from
machines 2 are selected and the observations are x̄1 = 30.87 and x̄2 = 30.68. Find 90% C.I. for µ1 −µ2 .
Solution. Note that z0.05 = 1.645 and therefore, the 90% C.I. for µ1 − µ2 is given by
s r
σ12 σ22 (0.15)2 (0.12)2
x̄1 − ȳ1 ± + zα/2 = 30.87 − 30.68 ± + × 1.645 = (0.095, 0.285).
n1 n2 12 10

Amit Kumar 204 MA-202: Probability & Statistics


Chapter 9: Estimation
Confidence interval for σ12 /σ22 : To compute the C.I. for σ12 /σ22 of a normal population, we deal with
following two cases:
(a) When µ1 and µ2 are known: Note that
m n
1 X 1 X
(Xi − µ1 )2 ∼ χ2m and (Yi − µ2 )2 ∼ χ2n .
σ12 i=1 σ22 i=1

Therefore,
1
Pm
i=1 (Xi − µ1 )2 /m n m
P 2 2
σ12 i=1 (Xi − µ1 ) σ2
F = = ∼ Fm,n .
m ni=1 (Yi − µ2 )2 σ12
1
Pn P
σ22 i=1 (Yi − µ2 )2 /n

1−α

α/2 α/2

| |
fm,n,1− α2 fm,n, α2

Note that

P fm,n,1− α2 ≤ F ≤ fm,n, α2 = 1 − α
n m (Xi − µ1 )2 σ22
 P 
i=1
=⇒ P fm,n,1− α2 ≤ Pn ≤ fm,n, α2 = 1 − α
m i=1 (Yi − µ2 )2 σ12
!
m m 2 2
P
1 (X i − µ 1 ) σ 1 1
=⇒ P ≤ Pmi=1 ≤ =1−α
fm,n, α2 n i=1 (Xi − µ1 )2 σ22 fm,n,1− α2
!
n m
P 2 2
Pm 2
1 (X i − µ 1 ) σ1 1 n (X i − µ 1 )
=⇒ P Pi=1 ≤ 2 ≤ Pi=1 = 1 − α.
fm,n, α2 m ni=1 (Yi − µ2 )2 σ2 fm,n,1− α2 m ni=1 (Yi − µ2 )2

Let xi and yi is an observed value of Xi and Yi , respectively. Then,


!
n m
P 2
Pm 2
1 (x i − µ 1 ) 1 n (x i − µ 1 )
Pi=1 , Pi=1
fm,n, α2 m ni=1 (yi − µ2 )2 fm,n,1− α2 m ni=1 (yi − µ2 )2

is 100(1 − α)% C.I. for σ12 /σ22 .

Amit Kumar 205 MA-202: Probability & Statistics


Chapter 9: Estimation
(b) When µ1 and µ2 are unknown: Note that

(m − 1)s21 (n − 1)s22
∼ χ2m−1 and ∼ χ2n−1 .
σ12 σ22

and therefore,
(m−1)s21
σ12
/(m − 1) s21 σ22
F∗ = (n−1)s22
= ∼ Fm−1,n−1 .
/(n − 1) s22 σ12
σ22

1−α

α/2 α/2

| |
fm−1,n−1,1− α2 fm−1,n−1, α2

Note that

P fm−1,n−1,1− α2 ≤ F ∗ ≤ fm−1,n−1, α2 = 1 − α


s21 σ22
 
=⇒ P fm−1,n−1,1− α2 ≤ 2 2 ≤ fm−1,n−1, α2 = 1 − α
s2 σ 1
!
1 s22 σ12 1
=⇒ P ≤ 2 2 ≤ =1−α
fm−1,n−1, α2 s1 σ2 fm−1,n−1,1− α2
!
1 s21 σ12 1 s21
=⇒ P ≤ 2 ≤ = 1 − α.
fm,n, α2 s22 σ2 fm,n,1− α2 s22

Then,
!
1 s21 1 s21
,
fm,n, α2 s22 fm,n,1− α2 s22

is 100(1 − α)% C.I. for σ12 /σ22 .

Amit Kumar 206 MA-202: Probability & Statistics


Chapter 9: Estimation
Example 9.6.4. The viscosity of two brans of oil used in cars measure and the following data is
recorded:
Brand 1 10.62 10.58 10.33 10.72 10.44
Brand 2 10.50 10.52 10.62 10.53

Find 90% C.I. for σ12 /σ22 .


Solution. From the data, it can be easily obtained that

s21 = 0.02362 and s22 = 0.002825.

Also, from the F −distribution table, we have

f4,3,0.05 = 9.1122 and f4,3,0.95 = 0.1517.

0.90
0.05 0.05

| |
f4,3,0.95 f4,3,0.05

Hence, 90% C.I. for σ12 /σ22 is


!
s21 s21
 
1 1 1 0.02362 1 0.02362
, = × , × = (0.917, 55.11).
fm,n, α2 s22 fm,n,1− α2 s22 9.1122 0.002825 0.1517 0.002825

9.6.3 Confidence Intervals for Proportion


In this subsection, we discuss the C.I. for proportion. For instance, if we wish to estimate the proportion
of people in the electorate who will vote for a particular candidate, we could choose a sample of 400
people and count the number x of people in our sample who intend to vote for the candidate. Then the
sample proportion takes the value p̂ = x/n. Here, we discuss the C.I. for proportion and difference of
proportions.

Confidence Intervals for Proportions: Let X denote the number of successes in n observed Bernoulli
trials with unknown success probability p. Then, the sample proportion p̂ = X/n is the estimate of the
population proportion p. Using CLT, we have
p − p̂
Z=q ∼ N (0, 1).
p̂(1−p̂)
n

Amit Kumar 207 MA-202: Probability & Statistics


Chapter 9: Estimation

1−α
α/2 α/2

| | |
−zα/2 0 zα/2

Therefore.

P −zα/2 ≤ Z ≤ zα/2 = 1 − α
 
p − p̂
=⇒ P −zα/2 ≤ q ≤ zα/2  = 1 − α
p̂(1−p̂)
n
r r !
p̂(1 − p̂) p̂(1 − p̂)
=⇒ P p̂ − zα/2 ≤ p ≤ p̂ + zα/2 = 1 − α.
n n

Hence, 100(1 − α)%. confidence interval for p is


r r ! r
p̂(1 − p̂) p̂(1 − p̂) p̂(1 − p̂)
p̂ − zα/2 , p̂ + zα/2 or p̂ ± zα/2 .
n n n

Example 9.6.5. A survey is conducted by an Institute and found that 323 students out of 1404 paid
their eduction far by student loan. Find the 90% CI. of the true proportion of students who paid for
their eduction fee by student loans.
Solution. Given x = 323 and n = 1404. Therefore,
x 323
p̂ = = = 0.23.
n 1404
Also, we have α = 0.1 =⇒ zα/2 = z0.05 = 1.645.

0.9
0.05 0.05

| | |
−z0.05 0 z0.05

Hence, 90% C.I. for p is


r r !
p̂(1 − p̂) p̂(1 − p̂)
p̂ − z0.05 , p̂ + z0.05
n n
r r !
0.23 × 0.77 0.23 × 0.77
= 0.23 − 1.645 × , 0.23 + 1.645 × = (0.211, 0.249).
1404 1404

Amit Kumar 208 MA-202: Probability & Statistics


Chapter 9: Estimation
Confidence Intervals for Difference of Proportions: A C.I. for a difference in proportions is a range
of values that is likely to contain the true difference between two population proportions with a certain
level of confidence. Suppose we have two samples of sizes n! and n2 with proportion p̂1 and p̂2 . Let p1
and p2 be true proportions. Then,

p1 − p2 − (p̂1 − p̂2 )
Z∗ = q ∼ N (0, 1).
p̂1 (1−p̂1 ) p̂2 (1−p̂2 )
n1
+ n2

1−α
α/2 α/2

| | |
−zα/2 0 zα/2

Therefore.

P −zα/2 ≤ Z ∗ ≤ zα/2 = 1−α



 
p1 − p2 − (p̂1 − p̂2 )
⇒ P −zα/2 ≤ q ≤ zα/2 = 1−α
p̂1 (1−p̂1 ) p̂2 (1−p̂2 )
n1
+ n2
 s s 
p̂1 (1− p̂1 ) p̂2 (1− p̂2 ) p̂1 (1− p̂1 ) p̂2 (1− p̂2 )
⇒ P−zα/2 + ≤ p1 −p2 −(p̂1 − p̂2 ) ≤ zα/2 + = 1−α.
n1 n2 n1 n2

Hence, 100(1 − α)%. confidence interval for p1 − p2 is


 s s 
p̂1 − p̂2 − zα/2 p̂1 (1 − p̂1 ) + p̂2 (1 − p̂2 ) , p̂1 − p̂2 + zα/2 p̂1 (1 − p̂1 ) + p̂2 (1 − p̂2 ) 
n1 n2 n1 n2

or
s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
(p̂1 − p̂2 ± zα/2 + .
n1 n2

Example 9.6.6. Suppose we want to estimate the difference in the proportion of residents who support
a certain law in county. A compared to the proportion who support the law in county B. In Sample 1,
62 out of 100 residents support the law. In Sample 2, 46 our of 100 residents support the law. Find 90%
C.I. for the difference in population proportions.
Solution. Here, n1 = 100, p1 = 62/100 = 0.62 and n2 = 100, p2 = 42/100 = 0.42. Also, α =
0.1 =⇒ zα/2 = z0.05 = 1.645.

Amit Kumar 209 MA-202: Probability & Statistics


Chapter 9: Estimation

0.9
0.05 0.05

| | |
−z0.05 0 z0.05

Hence, the 90% C.I. for the difference of proportions is


s r
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 ) 0.62 × 0.38 0.46 × 0.54
p̂1 − p̂2 ± zα/2 + = 0.62 − 0.46 ± 1.645 × +
n1 n2 100 100
= (0.0456, 0.2744).

9.7 Exercises
1. Find an unbiased estimator for µ2 in Example 9.2.2. Also, using the definition of s2 , prove that
E (s2 ) = σ 2 .

2. Let Xi ∼ Exp(λ), for i = 1, 2, . . . , n and λ > 0. Show that n−1 , where Y = ni=1 Xi , is an
P
Y
unbiased estimator for λ.
iid
3. Let Xi ∼ U(0, θ), for i = 1, 2, . . . , n. Show that X(n) = max{X1 , X2 , . . . , Xn } is consistent
estimator for θ.
iid
4. Let Xi ∼ Ber(p), for i = 1, 2, . . . , n and 0 ≤ p ≤ 1. Find the MLE of p.
iid
5. Let Xi ∼ P(λ), for i = 1, 2, . . . , n and λ > 0. Find the MLE of λ.

6. Let a random sample X1 , X2 . . . Xn follow a distribution whose pdf is given by


 1
2
, θ−1≤x≤θ+1
f (x; θ) =
0, otherwise

Find MLE of θ.

7. A random sample of size n = 100 is taken from a normal population with σ = 5.1. Given that
the sample mean x̄100 = 21.6, construct 95% C.I. for population mean µ.

8. A random sample of size n = 80 is taken from a normal population. Given that the sample
standard deviation s = 9.583, construct a 95% C.I. for the population standard deviation σ.

9. A survey of 1898 peoples found that 45% of the adults said that dandelions were the toughest
weeds to control in their yards. Find the 95% C.I. for the true proportion who said that dandelions
were the toughest weeds to control in their yards.

10. A market research company wants to estimate the proportion of house holds in the country with
digital televisions. A random sample of 80 households is selected and 46 of then has digital
televisions. Construct 95% and 98% confidence intervals for the proportion of households with
digital tile visions.

Amit Kumar 210 MA-202: Probability & Statistics


Chapter 9: Estimation
11. A toy company had a history of producing defective toys. When several toy stores complained,
the manager of the toy company claimed that he had found the source of the problem and cor-
rected it. However, the toy stores began noticing even more defective toys than before. In an
attempt to discover if the manager of the toy company was lying, the owner of a toy store decided
to sample toys from a delivery before the manager’s claim, and one after the manager’s claim.
He randomly sampled 1000 toys from a delivery before the manager said that the problem was
corrected. Of the 1000 toys, he found that 20 were defective. He then randomly sampled 1000
toys from a delivery after the manager claimed that the problem was corrected. From that sam-
ple, 42 were defective. Construct and interpret a 95% confidence interval for the difference in the
proportion of defective toys before and after the manager’s claim.

Amit Kumar 211 MA-202: Probability & Statistics


Chapter 10

Answers of Exercises

Chapter 1

2. 0.34

3. 5/9

4. 379/400

5. (a) 5/18 (b) 7/9

6. 4/13

7. 29/32

8. 8/195

Chapter 2

2. 16/25

4. E(X) = 1 and Var(X) = 1/2

5. False

6. (b) The cdf is given by



 0, x ≤ α − β,
 1 (β − α){x − (α − β)} + 1 {x2 − (α − β)2 } ,

α − β < x ≤ α,
β2 2β 2
FX (x) = 1 1 1
+ (β + α)(x − α) − 2β 2 (x2 − α2 ) , α < x ≤ α + β,
 2 β2


1, x ≥ α + β.

(c) E(X) = α and Var(X) = 61 β 2

7. (a) The pmf of X is given by

212
Chapter 10: Answers of Exercises
x 0 1 2
pX (x) 16 / 25 8 / 25 1 / 25

1 1 1
(b) MX (t) = 25 (et + 4)2 , ϕX (t) = 25 (eit + 4)2 and GX (t) = 25
(t + 4)2
(c) E(X) = 2/5 and Var(X) = 8/25

8. (a) no mode (b) 3.1 (c) 2 and 9


√ √
9. (a) 1/ 3 (b) 1 (c) 6

10. Q 1 = µ − λ, Q 1 = µ and Q 3 = µ + λ
4 2 4

Chapter 3

1. 0.91854

2. 11

3. (a) 0.42 (b) 0.6065

4. (a) X(1) (b) 0.6065 (c) 0.0003


9×45
5. 57

6. (a) NB(3,0.9) (b) 10/3 and 10/27 (c) 0.0049

7. No

Chapter 4

1. 1/4

2. 2/ 3
α+β−1
3. α−1
, no

5. (a) 0.1672 (b) 0.7492

6. (a) 8 (b) 99.20

7. µ = 50.3 and σ = 10.33

8. 13.01

Chapter 5

Amit Kumar 213 MA-202: Probability & Statistics


Chapter 10: Answers of Exercises
1. P(Y = 0) = 1/2 and P(Y = 1) = 1/2

2. Y ∼ B(n, q)

5. Normal distribution

7. Log-normal distribution

Chapter 6

1. (a) & (b) The joint pmf and marginals are given by

X
0 1 2 fY (y)
Y
1 1 1
0 8 8
0 4
1 2 1 1
1 8 8 8 2
1 1 1
2 0 8 8 4
1 1 1
fX (x) 4 2 4
1

(c) 1 (d) 1

2. (a) fX (x) = 2x, 0 < x < 1, fY (y) = 2(1 − y), 0 < y < 1 (b) 7/16 (c) 2/3 (d) 1/2

4. -1/11

5. 1/2

6. 0.9803

Chapter 7

1. Not convergent

4. 0.5684

5. 0.5

6. 3387

7. 0.0179

Chapter 9

s2
1. X̄n − n

Amit Kumar 214 MA-202: Probability & Statistics


Chapter 10: Answers of Exercises
4. p̂MLE = X̄n

5. λ̂MLE = X̄n

6. X(n) − 1, X(1) + 1

7. (20.6, 22.6)

8. (8.29, 11.35)

9. (0.428, 0.472)

10. 95% is (0.467, 0.683) and 98% is (0.446, 0.704)

11. (−0.0372, −0.0068)

Amit Kumar 215 MA-202: Probability & Statistics

You might also like