0% found this document useful (0 votes)
67 views5 pages

Sufficient Statistics - Problems - Solved - Xiang - Yin

The document discusses parameter estimation and sufficient statistics. It defines what constitutes a sufficient statistic and provides the Neyman-Fisher factorization theorem for determining if a statistic is sufficient. Examples of sufficient statistics are given such as the entire sample, rank ordered sample, and total from Bernoulli trials.

Uploaded by

Saksham Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views5 pages

Sufficient Statistics - Problems - Solved - Xiang - Yin

The document discusses parameter estimation and sufficient statistics. It defines what constitutes a sufficient statistic and provides the Neyman-Fisher factorization theorem for determining if a statistic is sufficient. Examples of sufficient statistics are given such as the entire sample, rank ordered sample, and total from Bernoulli trials.

Uploaded by

Saksham Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

AU7022 Stochastic Methods in Systems & Control Xiang Yin

8 Parameter Estimations and Sufficient Statistics


A General Model for Statistics

▶ Many problems have the following common structure. A continuous signal {x(t) : t ∈ R}
is measured at t1 , . . . , tn producing vector x = (x1 , . . . , xn ), where xi = x(ti ). The vector
x is a realization of a random vector or a random process X = (X1 , . . . , Xn ) with a joint
distribution which is of known form but depends on some unknown parameters
θ = (θ1 , . . . , θp ). The estimation theory aims to estimate these unknown parameters θ
based on the observed realization x.

▶ Formally, the above problem has the following ingredients:


– X = (X1 , . . . , Xn ) is a vector of random measurements or observations taken over
the course of the experiment
– X is sample or measurement space of realizations x of X, e.g., X = R × · · · × R
– θ = (θ1 , . . . , θp ) is an unknown parameter vector of interest
– Θ is parameter space for the experiment
– Pθ : B(Rn ) → [0, 1] is a probability measure such that, for any Borel set or event B,
we have
 Z



 f (x; θ)dx if X is continuous
B
Pθ (B) = probability of event B ⊆ X = X

 p(x; θ) if X is discrete


x∈B

Such {Pθ }θ∈Θ is called the statistical model of the experiment.

The probability model also induces the joint C.D.F. associated with X

F (x; θ) = Pθ (X1 ≤ x1 , . . . , Xn ≤ xn ),

which is assumed to be known for each θ ∈ Θ. We denote by Eθ (X) the expectation of


random variable X given θ ∈ Θ.

Parametric Statistics (Estimation Theory)

▶ The basic estimation problem is as follows. The observations X = (X1 , . . . , Xn ) is actually


generated by a true parameter θ0 ∈ Θ. In case Xi are i.i.d., we have Xi ∼ Pθ0 (·).
Then we want to find an estimator θ̂ : X → Θ such that the estimate θ̂(X1 , . . . , Xn )
approximates θ0 “optimally”.

▶ The question is that how to describe whether θ̂ is a good estimator. Depending on


whether or not we have prior knowledge about the distribution θ, we will discuss two
different approaches: Bayesian estimation and non-random estimation.

1
AU7022 Stochastic Methods in Systems & Control Xiang Yin

Definition of Sufficient Statistics

▶ Let us consider an i.i.d. observations X = (X1 , . . . , Xn ) with distribution Pθ from the


family {Pθ : θ ∈ Θ}. Imagine that there are two people A and B, and that
– A observes the entire sample (X1 , . . . , Xn );
– B observes only a smaller vector T = T (X1 , . . . , Xn ) which is a function of the
sample. In this case, function T : Rn → Rm , m ≤ n. is called a statistic.

Clearly, A has more information about the distribution of the data and, in particular,
about the unknown parameter θ. However, in some cases, for some choices of function T
(called sufficient statistics) B will have as much information about θ as A has.

▶ To see this more clearly, for observations X = (X1 , . . . , Xn ) and statistic T (X), the
conditional probability
fX|T (X) (x | t, θ) = Pθ (X1 = x1 , . . . , Xn = xn | T (X) = t)

is, typically, a function of both t and θ. For some choices of statistic T , however,
fX|T (X) (x | t, θ) can be θ-independent.

▶ To see the above argument, let us consider consider the case X = (X1 , . . . , Xn ), a sequence
of n Bernoulli trials with success probability parameter θ and the statistic T (X) = X1 +
· · · + Xn the total number of successes. Then
Y
n
Pθ (X1 = x1 , . . . , Xn = xn ) = θxi (1 − θ)1−xi = θt (1 − θ)n−t ,
i=1
P
where t = T (x1 , . . . , xn ) = x1 + · · · xn . Therefore, if ni=1 xi ̸= t, then we know that the
statistic is incompatible with the observation. Otherwise, we have
 −1
fX (x | θ) Pθ (X1 = x1 , . . . , Xn = xn ) θt (1 − θ)n−t n
fX|T (X) (x | t, θ) = = = n t =
fT (X) (t | θ) Pθ (T (X) = t) t
θ (1 − θ)n−t t

which does not depend on the parameter θ. This means that all information about θ in
X has been summarized by T (X). This motivates the following definition.

Definition: Sufficient Statistics

A statistic T = T (X) is said to be sufficient for parameter θ if

Pθ (X1 ≤ x1 , . . . , Xn ≤ xn | T (X) = t) = G(x, t)

where G(·, ·) is a function that does not depend on θ. Equivalent, we have

– p(x | t, θ) = Pθ (X = x | T (X) = t) = G(x, t) if X is discrete;

– f (x | t, θ) = G(x, t) if X is continuous.

▶ Thus, by the law of total probability


Pθ (X1 ≤ x1 , . . . , Xn ≤ xn ) = P (X1 ≤ x1 , . . . , Xn ≤ xn | T (X) = T (x))Pθ (T (X) = T (x))

and once we know the value of the sufficient statistic, we cannot obtain any additional
information about the value of θ from knowing the observed values.

2
AU7022 Stochastic Methods in Systems & Control Xiang Yin

Neyman-Fisher Factorization Theorem


▶ The above definition of sufficient statistics is often difficult to use since it involves deriva-
tion of the conditional distribution of X given T . However, when the random variable X is
discrete or continuous a simpler way to verify sufficiency is through the Neyman-Fisher
factorization criterion.
Theorem: Fisher Factorization Criterion

A statistic T = T (X) is sufficient for θ if and only if functions g and h can be found
such that
fX (x | θ) = g(T (x), θ)h(x)

We only proof the case of discrete random variables, i.e., fX (x; θ) is the PMF.

▶ (⇒) Because T is a function of x, we have

fX (x | θ) = fX,T (X) (x, T (x) | θ) = fX|T (X) (x | T (x), θ)fT (X) (T (x) | θ)

Since T is sufficient, then fX|T (X) (x | T (x), θ) is not a function of θ and we can set it to
be h(X). The second term is a function of T (x) and θ. We will write it g(T (x), θ).

▶ (⇐) Suppose that we have the factorization. By the definition of conditional expectation,

fX,T (X) (x, t | θ)


fX|T (X) (x | t, θ) =
fT (X) (t | θ)

For the numerator, we have


(
0 if T (x) ̸= t
fX,T (X) (x, t | θ) =
fX (x | θ) = g(t, θ)h(x) otherwise

Furthermore, for the denominator, we have


X X
fT (X) (t | θ) = fX (x̃ | θ) = g(t, θ)h(x̃)
x̃:T (x̃)=t x̃:T (x̃)=t

Therefore, we have

g(t, θ)h(x) h(x)


fX|T (X) (x | t, θ) = P =P ,
x̃:T (x̃)=t g(t, θ)h(x̃) x̃:T (x̃)=t h(x̃)

which is independent of θ and, therefore, T is sufficient.

▶ For example, in the maximum likelihood estimation, we have to find the best estimate
θ ∈ Θ such that the likelihood function

L(θ | x) = fX (x | θ)

is maximized for the observed sample x = (x1 , . . . , xn ). For sufficient statistics, since
fX (x | θ) = g(T (x), θ)h(x), maximizing the likelihood is equivalent to maximizing
g(T (x), θ) and the maximum likelihood estimator θ̂(T (x)) is a function of the sufficient
statistic.

3
AU7022 Stochastic Methods in Systems & Control Xiang Yin

General Examples of Sufficient Statistics


▶ Example 1: Entire Sample
X = (X1 , . . . , Xn ) is clearly sufficient but not very interesting.

▶ Example 2: Rank Ordered Sample


X(1) , . . . , X(n) is sufficient when Xi are i.i.d. This is because, under the i.i.d. setting,

Y
n Y
n
f (x1 , . . . , xn | θ) = f (xi | θ) = f (x(i) | θ)
i=1 i=1

▶ Example 3: Binary Likelihood Ratios


Suppose that θ only takes two possible values Θ = {θ0 , θ1 }, or simply θ ∈ {0, 1}. This
gives the binary decision problem: “decide between θ = 0 versus θ = 1. Then the
“likelihood ratio” (assume it is finite)

f1 (X) f (X | 1)
Λ(X) = =
f0 (X) f (X | 0)

is sufficient for θ, because we can write


 
 
fθ (X) = θf1 (X) + (1 − θ)f0 (X) = θΛ(X) + (1 − θ) f0 (X)
| {z } | {z }
g(T,θ) h(X)

▶ Example 4: Discrete Likelihood Ratios


Suppose that θ takes p possible values, i.e., Θ = (θ1 , . . . , θp ). Then the vector of p − 1
likelihood ratios (assume it is finite)
 
fθ1 (X) fθp−1 (X)
Λ(X) = ,..., = (Λ1 (X), . . . , Λp−1 (X))
fθp (X) fθp (X)

is sufficient for θ. Try to prove this as a homework.

▶ Example 5: Likelihood Ratio Trajectory


When Θ is a set of scalar parameters θ the likelihood ratio trajectory over Θ is
 
fθ (X)
Λ(X) =
fθ0 (X) θ∈Θ

is sufficient for θ. Here θ0 is an arbitrary reference point in Θ for which the trajectory is
finite for all X. When θ is not a scalar, this becomes a likelihood ratio surface, which is
also a sufficient statistic.

▶ We say Tmin is a minimal sufficient statistic if for any sufficient statistic T there
exists a function q such that Tmin = q(T ). Finding minimal sufficient statistic is in
general difficult; the following provides a sufficient condition for T (X) to be be minimal

∀x, x′ ∈ X : Λ(T (x)) = Λ(T (x′ )) ⇒ T (x) = T (x′ )

Note that Λ(t) is well-defined because Λ(x) = Λ(T (x)) for any sufficient statistic T as we
discussed above.

4
AU7022 Stochastic Methods in Systems & Control Xiang Yin

More Examples of Sufficient Statistics


▶ Example 1: Bernoulli Distribution
Suppose that X = (X1 , X2 , . . . , Xn ) is i.i.d. and each Xi satisfies the Bernoulli distribution
with unknownPprobability, i.e., Pθ (Xi = 1) = θ and Pθ (Xi = 0) = 1 − θ. Then we claim
that T (X) = ni=1 Xi is a sufficient statistic. To see this, we write the joint PMF as

Y
n Y
n Yn
θ Xi
pX (X; θ) = pXi (Xi ; θ) = θ (1 − θ)
Xi 1−Xi
= (1 − θ)( )
i=1 i=1 i=1
1 − θ
θ T (X)
= (1 − θ)n ( ) · |{z}
1
| 1−θ
{z } h(X)
g(T (X,θ))

Clearly, this sufficient statistic is minimal as it is already one-dimensional.

▶ Example 2: Uniform Distribution


Suppose that X = (X1 , X2 , . . . , Xn ) is i.i.d. and each Xi satisfies the uniform distribution
over [0, θ] with unknown length θ. Then we claim that T (X) = maxni=1 Xi is a sufficient
statistic. To see this, we write
Y
n Y
n
1 Y
n
1 1
fX (X; θ) = fXi (Xi ; θ) = 1[0,θ] (Xi ) = 1[Xi ,∞) (θ) = 1
n [T (X),∞)
(θ) · |{z}
1
i=1 i=1
θ i=1
θ |θ {z } h(X)
g(T (X,θ))

Note that, the tricky part is I[0,θ] (Xi ) = I[Xi ,∞) (θ).

▶ Example 3: Gaussian Distribution with Unknown Mean


Suppose that X = (X1 , X2 , . . . , Xn ) is i.i.d. and each Xi satisfies the Gaussian distribution
P
with unknown mean θ but the variance σ 2 is known. Then we claim that T (X) = ni=1 Xi
is a sufficient statistic. To see this, we have

fX (X; θ)
   n !
Yn Yn
1 (Xi − θ)2 1 Xn
(Xi − θ)2
= fXi (Xi ; θ) = √ exp − = √ exp −
2πσ 2σ 2 2πσ 2σ 2
i=1
 n i=1
    Pn  i=1
1 θT (X) −nθ2 − i=1 Xi2
= √ exp exp exp
2πσ σ2 2σ 2 2σ 2
| {z }| {z }
g(T (X,θ)) h(X)

▶ Example 4: Gaussian Distribution with Unknown Mean and Variance


When the unknown
Pn mean Pnis µ = θ1 and the unknown variance is σ 2 = θ2 , then we claim
2
that T (X) = ( i=1 Xi , i=1 Xi ) is a sufficient statistic. To see this, we have

fX (X; θ)
   n !
Yn Yn
1 (Xi − θ)2 1 Xn
(Xi − θ)2
= fXi (Xi ; θ) = √ exp − 2
= √ exp −
i=1 i=1
2πσ 2σ 2πσ i=1
2σ 2
 n    
1 θ1 1 −nθ12
= √ exp T1 (X) − T2 (X) exp · |{z}
1
2πθ2 θ2 2θ2 2θ2
| {z } h(X)
g(T (X,θ))

You might also like