AU7022 Stochastic Methods in Systems & Control Xiang Yin
8 Parameter Estimations and Sufficient Statistics
A General Model for Statistics
▶ Many problems have the following common structure. A continuous signal {x(t) : t ∈ R}
is measured at t1 , . . . , tn producing vector x = (x1 , . . . , xn ), where xi = x(ti ). The vector
x is a realization of a random vector or a random process X = (X1 , . . . , Xn ) with a joint
distribution which is of known form but depends on some unknown parameters
θ = (θ1 , . . . , θp ). The estimation theory aims to estimate these unknown parameters θ
based on the observed realization x.
▶ Formally, the above problem has the following ingredients:
– X = (X1 , . . . , Xn ) is a vector of random measurements or observations taken over
the course of the experiment
– X is sample or measurement space of realizations x of X, e.g., X = R × · · · × R
– θ = (θ1 , . . . , θp ) is an unknown parameter vector of interest
– Θ is parameter space for the experiment
– Pθ : B(Rn ) → [0, 1] is a probability measure such that, for any Borel set or event B,
we have
Z
f (x; θ)dx if X is continuous
B
Pθ (B) = probability of event B ⊆ X = X
p(x; θ) if X is discrete
x∈B
Such {Pθ }θ∈Θ is called the statistical model of the experiment.
The probability model also induces the joint C.D.F. associated with X
F (x; θ) = Pθ (X1 ≤ x1 , . . . , Xn ≤ xn ),
which is assumed to be known for each θ ∈ Θ. We denote by Eθ (X) the expectation of
random variable X given θ ∈ Θ.
Parametric Statistics (Estimation Theory)
▶ The basic estimation problem is as follows. The observations X = (X1 , . . . , Xn ) is actually
generated by a true parameter θ0 ∈ Θ. In case Xi are i.i.d., we have Xi ∼ Pθ0 (·).
Then we want to find an estimator θ̂ : X → Θ such that the estimate θ̂(X1 , . . . , Xn )
approximates θ0 “optimally”.
▶ The question is that how to describe whether θ̂ is a good estimator. Depending on
whether or not we have prior knowledge about the distribution θ, we will discuss two
different approaches: Bayesian estimation and non-random estimation.
1
AU7022 Stochastic Methods in Systems & Control Xiang Yin
Definition of Sufficient Statistics
▶ Let us consider an i.i.d. observations X = (X1 , . . . , Xn ) with distribution Pθ from the
family {Pθ : θ ∈ Θ}. Imagine that there are two people A and B, and that
– A observes the entire sample (X1 , . . . , Xn );
– B observes only a smaller vector T = T (X1 , . . . , Xn ) which is a function of the
sample. In this case, function T : Rn → Rm , m ≤ n. is called a statistic.
Clearly, A has more information about the distribution of the data and, in particular,
about the unknown parameter θ. However, in some cases, for some choices of function T
(called sufficient statistics) B will have as much information about θ as A has.
▶ To see this more clearly, for observations X = (X1 , . . . , Xn ) and statistic T (X), the
conditional probability
fX|T (X) (x | t, θ) = Pθ (X1 = x1 , . . . , Xn = xn | T (X) = t)
is, typically, a function of both t and θ. For some choices of statistic T , however,
fX|T (X) (x | t, θ) can be θ-independent.
▶ To see the above argument, let us consider consider the case X = (X1 , . . . , Xn ), a sequence
of n Bernoulli trials with success probability parameter θ and the statistic T (X) = X1 +
· · · + Xn the total number of successes. Then
Y
n
Pθ (X1 = x1 , . . . , Xn = xn ) = θxi (1 − θ)1−xi = θt (1 − θ)n−t ,
i=1
P
where t = T (x1 , . . . , xn ) = x1 + · · · xn . Therefore, if ni=1 xi ̸= t, then we know that the
statistic is incompatible with the observation. Otherwise, we have
−1
fX (x | θ) Pθ (X1 = x1 , . . . , Xn = xn ) θt (1 − θ)n−t n
fX|T (X) (x | t, θ) = = = n t =
fT (X) (t | θ) Pθ (T (X) = t) t
θ (1 − θ)n−t t
which does not depend on the parameter θ. This means that all information about θ in
X has been summarized by T (X). This motivates the following definition.
Definition: Sufficient Statistics
A statistic T = T (X) is said to be sufficient for parameter θ if
Pθ (X1 ≤ x1 , . . . , Xn ≤ xn | T (X) = t) = G(x, t)
where G(·, ·) is a function that does not depend on θ. Equivalent, we have
– p(x | t, θ) = Pθ (X = x | T (X) = t) = G(x, t) if X is discrete;
– f (x | t, θ) = G(x, t) if X is continuous.
▶ Thus, by the law of total probability
Pθ (X1 ≤ x1 , . . . , Xn ≤ xn ) = P (X1 ≤ x1 , . . . , Xn ≤ xn | T (X) = T (x))Pθ (T (X) = T (x))
and once we know the value of the sufficient statistic, we cannot obtain any additional
information about the value of θ from knowing the observed values.
2
AU7022 Stochastic Methods in Systems & Control Xiang Yin
Neyman-Fisher Factorization Theorem
▶ The above definition of sufficient statistics is often difficult to use since it involves deriva-
tion of the conditional distribution of X given T . However, when the random variable X is
discrete or continuous a simpler way to verify sufficiency is through the Neyman-Fisher
factorization criterion.
Theorem: Fisher Factorization Criterion
A statistic T = T (X) is sufficient for θ if and only if functions g and h can be found
such that
fX (x | θ) = g(T (x), θ)h(x)
We only proof the case of discrete random variables, i.e., fX (x; θ) is the PMF.
▶ (⇒) Because T is a function of x, we have
fX (x | θ) = fX,T (X) (x, T (x) | θ) = fX|T (X) (x | T (x), θ)fT (X) (T (x) | θ)
Since T is sufficient, then fX|T (X) (x | T (x), θ) is not a function of θ and we can set it to
be h(X). The second term is a function of T (x) and θ. We will write it g(T (x), θ).
▶ (⇐) Suppose that we have the factorization. By the definition of conditional expectation,
fX,T (X) (x, t | θ)
fX|T (X) (x | t, θ) =
fT (X) (t | θ)
For the numerator, we have
(
0 if T (x) ̸= t
fX,T (X) (x, t | θ) =
fX (x | θ) = g(t, θ)h(x) otherwise
Furthermore, for the denominator, we have
X X
fT (X) (t | θ) = fX (x̃ | θ) = g(t, θ)h(x̃)
x̃:T (x̃)=t x̃:T (x̃)=t
Therefore, we have
g(t, θ)h(x) h(x)
fX|T (X) (x | t, θ) = P =P ,
x̃:T (x̃)=t g(t, θ)h(x̃) x̃:T (x̃)=t h(x̃)
which is independent of θ and, therefore, T is sufficient.
▶ For example, in the maximum likelihood estimation, we have to find the best estimate
θ ∈ Θ such that the likelihood function
L(θ | x) = fX (x | θ)
is maximized for the observed sample x = (x1 , . . . , xn ). For sufficient statistics, since
fX (x | θ) = g(T (x), θ)h(x), maximizing the likelihood is equivalent to maximizing
g(T (x), θ) and the maximum likelihood estimator θ̂(T (x)) is a function of the sufficient
statistic.
3
AU7022 Stochastic Methods in Systems & Control Xiang Yin
General Examples of Sufficient Statistics
▶ Example 1: Entire Sample
X = (X1 , . . . , Xn ) is clearly sufficient but not very interesting.
▶ Example 2: Rank Ordered Sample
X(1) , . . . , X(n) is sufficient when Xi are i.i.d. This is because, under the i.i.d. setting,
Y
n Y
n
f (x1 , . . . , xn | θ) = f (xi | θ) = f (x(i) | θ)
i=1 i=1
▶ Example 3: Binary Likelihood Ratios
Suppose that θ only takes two possible values Θ = {θ0 , θ1 }, or simply θ ∈ {0, 1}. This
gives the binary decision problem: “decide between θ = 0 versus θ = 1. Then the
“likelihood ratio” (assume it is finite)
f1 (X) f (X | 1)
Λ(X) = =
f0 (X) f (X | 0)
is sufficient for θ, because we can write
fθ (X) = θf1 (X) + (1 − θ)f0 (X) = θΛ(X) + (1 − θ) f0 (X)
| {z } | {z }
g(T,θ) h(X)
▶ Example 4: Discrete Likelihood Ratios
Suppose that θ takes p possible values, i.e., Θ = (θ1 , . . . , θp ). Then the vector of p − 1
likelihood ratios (assume it is finite)
fθ1 (X) fθp−1 (X)
Λ(X) = ,..., = (Λ1 (X), . . . , Λp−1 (X))
fθp (X) fθp (X)
is sufficient for θ. Try to prove this as a homework.
▶ Example 5: Likelihood Ratio Trajectory
When Θ is a set of scalar parameters θ the likelihood ratio trajectory over Θ is
fθ (X)
Λ(X) =
fθ0 (X) θ∈Θ
is sufficient for θ. Here θ0 is an arbitrary reference point in Θ for which the trajectory is
finite for all X. When θ is not a scalar, this becomes a likelihood ratio surface, which is
also a sufficient statistic.
▶ We say Tmin is a minimal sufficient statistic if for any sufficient statistic T there
exists a function q such that Tmin = q(T ). Finding minimal sufficient statistic is in
general difficult; the following provides a sufficient condition for T (X) to be be minimal
∀x, x′ ∈ X : Λ(T (x)) = Λ(T (x′ )) ⇒ T (x) = T (x′ )
Note that Λ(t) is well-defined because Λ(x) = Λ(T (x)) for any sufficient statistic T as we
discussed above.
4
AU7022 Stochastic Methods in Systems & Control Xiang Yin
More Examples of Sufficient Statistics
▶ Example 1: Bernoulli Distribution
Suppose that X = (X1 , X2 , . . . , Xn ) is i.i.d. and each Xi satisfies the Bernoulli distribution
with unknownPprobability, i.e., Pθ (Xi = 1) = θ and Pθ (Xi = 0) = 1 − θ. Then we claim
that T (X) = ni=1 Xi is a sufficient statistic. To see this, we write the joint PMF as
Y
n Y
n Yn
θ Xi
pX (X; θ) = pXi (Xi ; θ) = θ (1 − θ)
Xi 1−Xi
= (1 − θ)( )
i=1 i=1 i=1
1 − θ
θ T (X)
= (1 − θ)n ( ) · |{z}
1
| 1−θ
{z } h(X)
g(T (X,θ))
Clearly, this sufficient statistic is minimal as it is already one-dimensional.
▶ Example 2: Uniform Distribution
Suppose that X = (X1 , X2 , . . . , Xn ) is i.i.d. and each Xi satisfies the uniform distribution
over [0, θ] with unknown length θ. Then we claim that T (X) = maxni=1 Xi is a sufficient
statistic. To see this, we write
Y
n Y
n
1 Y
n
1 1
fX (X; θ) = fXi (Xi ; θ) = 1[0,θ] (Xi ) = 1[Xi ,∞) (θ) = 1
n [T (X),∞)
(θ) · |{z}
1
i=1 i=1
θ i=1
θ |θ {z } h(X)
g(T (X,θ))
Note that, the tricky part is I[0,θ] (Xi ) = I[Xi ,∞) (θ).
▶ Example 3: Gaussian Distribution with Unknown Mean
Suppose that X = (X1 , X2 , . . . , Xn ) is i.i.d. and each Xi satisfies the Gaussian distribution
P
with unknown mean θ but the variance σ 2 is known. Then we claim that T (X) = ni=1 Xi
is a sufficient statistic. To see this, we have
fX (X; θ)
n !
Yn Yn
1 (Xi − θ)2 1 Xn
(Xi − θ)2
= fXi (Xi ; θ) = √ exp − = √ exp −
2πσ 2σ 2 2πσ 2σ 2
i=1
n i=1
Pn i=1
1 θT (X) −nθ2 − i=1 Xi2
= √ exp exp exp
2πσ σ2 2σ 2 2σ 2
| {z }| {z }
g(T (X,θ)) h(X)
▶ Example 4: Gaussian Distribution with Unknown Mean and Variance
When the unknown
Pn mean Pnis µ = θ1 and the unknown variance is σ 2 = θ2 , then we claim
2
that T (X) = ( i=1 Xi , i=1 Xi ) is a sufficient statistic. To see this, we have
fX (X; θ)
n !
Yn Yn
1 (Xi − θ)2 1 Xn
(Xi − θ)2
= fXi (Xi ; θ) = √ exp − 2
= √ exp −
i=1 i=1
2πσ 2σ 2πσ i=1
2σ 2
n
1 θ1 1 −nθ12
= √ exp T1 (X) − T2 (X) exp · |{z}
1
2πθ2 θ2 2θ2 2θ2
| {z } h(X)
g(T (X,θ))