Lecture 2 BayesianHypothesisTesting
Lecture 2 BayesianHypothesisTesting
pH (Hm ), m = 0, 1, . . . , M − 1,
together with a characterization of the observed data under each hypothesis, which
takes the form of the conditional probability distributions2
While the belief is a complete characterization of our knowledge of the true hy-
pothesis, in applications one must often go further and make a decision (i.e., an
intelligent guess) based on this information. To make a good decision we need some
measure of goodness, appropriately chosen for the application of interest. In the se-
quel, we develop a framework for such decision-making, restricting our attention to
the binary (M = 2) case to simplify the exposition.
H0 : py|H (y|H0 )
(4)
H1 : py|H (y|H1 ).
The development is essentially the same whether the observations are discrete or
continuous. We arbitrarily use the continuous case in our development. The discrete
case differs only in that integrals are replaced by summations.
We begin with a simple example to which we will return later.
y = sm + w ,
2
observation under each of the hypotheses, obtaining:
1 2 /(2σ 2 )
py |H (y|H0 ) = N(y; s0 , σ 2 ) = √ e−(y−s0 )
2πσ 2
(5)
2 1 −(y−s1 )2 /(2σ 2 )
py |H (y|H1 ) = N(y; s1 , σ ) = √ e .
2πσ 2
In addition, if 0’s and 1’s are equally likely to be transmitted we would set the a
priori probabilities to
P0 = P1 = 1/2.
Y0 = {y ∈ Y : Ĥ(y) = H0 }
(6)
Y1 = {y ∈ Y : Ĥ(y) = H1 }.
to denote the “cost” of deciding that the hypothesis is Ĥ = Hi when the correct
hypothesis is H = Hj . Then the optimum decision rule takes the form
3
Y0
Y1
Y0
Y0
Y
and where the expectation in (9) is over both y and H, and where f (·) is a decision
rule.
Generally, the application dictates an appropriate choice of the costs Cij . For
example, a symmetric cost function of the form Cij = 1 − 1i=j , i.e.,
C00 = C11 = 0
(10)
C01 = C10 = 1,
corresponds to seeking a decision rule that minimizes the probability of a decision
error. However, there are many applications for which such symmetric cost functions
are not well-matched. For example, in a medical diagnosis problem where H0 denotes
the hypotheses that the patient does not have a particular disease and H1 that he
does, we would typically want to select cost assignments such that C01 C10 .
Definition 1. A set of costs {Cij } is valid if the cost of a correct decision is lower
than the cost of an incorrect decision, i.e., Cjj < Cij whenever i 6= j.
Theorem 1. Given a priori probabilities P0 , P1 , data y, observation models py|H (·|H0 ),
py|H (·|H1 ), and valid costs C00 , C01 , C10 , C11 , the optimum Bayes’ decision rule takes
the form:
py|H (y|H1 ) Ĥ(y)=H1 P0 (C10 − C00 )
L(y) , R , η, (11)
py|H (y|H0 ) Ĥ(y)=H P1 (C01 − C11 )
0
4
i.e., the decision is H1 when L(y) > η, the decision is H0 when L(y) < η, and the
decision can be made arbitrarily when L(y) = η.
Before establishing this result, we make a few remarks. First, the left-hand side
of (11) is referred to as the likelihood ratio, and thus (11) is typically referred to as
a likelihood ratio test (LRT). Note too that the likelihood ratio—which we denote
using L(y)—is constructed from the observations model and the data. Meanwhile,
the right-hand side of (11)—which we denote using η—is a precomputable threshold
that is determined from the a priori probabilities and costs.
Proof. Consider an arbitrary but fixed decision rule f (·). In terms of this generic
f (·), the Bayes risk can be expanded in the form
h i
ϕ(f ) = E C̃(H, f (y))
h i
= E E C̃(H, f (y)) y = y
Z
= ϕ̃(f (y), y) py (y) dy, (12)
with h i
ϕ̃(H, y) = E C̃(H, H) y = y , (13)
and where to obtain the second equality in (12) we have used iterated expectation.
Note from (12) that since py (y) is nonnegative, it is clear that we minimize ϕ if
we minimize ϕ̃(f (y), y) for each particular value of y. Hence, we can determine the
optimum decision rule Ĥ(·) on a point-by-point basis, i.e., Ĥ(y) for each y.
Let’s consider a particular (observation) point y = y∗ . For this point, if we choose
the assignment
Ĥ(y∗ ) = H0 ,
then our conditional expectation (13) takes the value
ϕ̃(H0 , y∗ ) = C00 pH|y (H0 |y∗ ) + C01 pH|y (H1 |y∗ ). (14)
Ĥ(y∗ ) = H1 ,
ϕ̃(H1 , y∗ ) = C10 pH|y (H0 |y∗ ) + C11 pH|y (H1 |y∗ ). (15)
Hence, the optimum assignment for the value y∗ is simply the choice corresponding
to the smaller of (14) and (15). It is convenient to express this optimum decision
5
rule using the following notation (now replacing our particular observation y∗ with a
generic observation y):
Ĥ(y)=H1
C00 pH|y (H0 |y) C10 pH|y (H0 |y)
R (16)
+ C01 pH|y (H1 |y) Ĥ(y)=H0
+ C11 pH|y (H1 |y).
Note that when the two sides of (16) are equal, then either assignment is equally
good—both have the same effect on the objective function (12).
A minor rearrangement of the terms in (16) results in
Ĥ(y)=H1
(C01 − C11 )pH|y (H1 |y) R (C10 − C00 )pH|y (H0 |y). (17)
Ĥ(y)=H0
Since for any valid choice of costs the terms in parentheses in (17) are both positive,
we can equivalently write (17) in the form4
Ĥ(y)=H1
pH|y (H1 |y) (C10 − C00 )
R . (18)
pH|y (H0 |y) Ĥ(y)=H0
(C01 − C11 )
When we then substitute (19) into (18) and multiply both sides by P0 /P1 , we
obtain the decision rule in its final form (11), directly in terms of the measurement
densities.
As a final remark, observe that, not surprisingly, the optimum decision produced
by (17) is a particular function of our beliefs, i.e., the a posteriori probabilities
py|H (y|Hm ) Pm
pH|y (Hm |y) = . (19)
py|H (y|H0 ) P0 + py|H (y|H1 ) P1
6
We will develop the notion of a sufficient statistic more precisely and in greater
generality in a subsequent section of the notes; however, at this point it suffices to
make two observations with respect to our hypothesis testing problem. First, (11)
tells us an explicit construction for a scalar sufficient statistic for the Bayesian binary
hypothesis testing problem. Second, sufficient statistics are not unique. For example,
any invertible function of L(y) is also a sufficient statistic. In fact, for the purposes of
implementation or analysis it is often more convenient to rewrite the likelihood ratio
test in the form
Ĥ(y)=H1
0
L (y) = g(L(y)) R g(η), (20)
Ĥ(y)=H0
7
ilarly, an increase in C10 means that deciding H1 when H0 is true is more costly, so
η is increased to appropriately bias the test toward deciding H0 to offset this risk.
Finally, note that adding a constant to the cost function (i.e., to all Cij ) has, as we
would anticipate, no effect on the threshold. Hence, without loss of generality we
may set at least one of the correct decision costs—i.e., C00 or C11 —to zero.
Finally, it is important to emphasize that the likelihood ratio test (11) indirectly
determines the decision regions (6). In particular, we have
The corresponding decision rule in this case can be obtained as a special case of
(11).
Corollary 1. The minimum probability-of-error decision rule takes the form
The rule (22), in which one chooses the hypothesis for which our belief is largest, is
referred to as the maximum a posteriori (MAP) decision rule.
Proof. Instead of specializing (11), we specialize the equivalent test (17), from which
we obtain a form of the minimum probability-of-error test expressed in terms of the
a posteriori probabilities for the problem, viz.,
Ĥ(y)=H1
pH|y (H1 |y) R pH|y (H0 |y). (23)
Ĥ(y)=H0
From (23) we see that the desired decision rule can be expressed in the form (22)
6
Indeed, neither of the respective sets Y0 and Y1 are even connected in general.
8
Still further simplification is possible when the hypotheses are equally likely (P0 =
P1 = 1/2). In this case, we have the following.
Corollary 2. When the hypotheses are equally likely, the minimum probability of
error decision rule takes the form
The rule (24), which is referred to as the maximum likelihood (ML) decision rule,
chooses the hypothesis for which the corresponding likelihood function is largest.
Proof. Specializing (11) we obtain
Ĥ(y)=H1
py|H (y|H1 )
R 1, (25)
py|H (y|H0 ) Ĥ(y)=H0
or, equivalently,
Ĥ(y)=H1
py|H (y|H1 ) R py|H (y|H0 ),
Ĥ(y)=H0
whence (24)
Example 2. Continuing with Example 1, we obtain from (5) that the likelihood ratio
test for this problem takes the form
1 2 /(2σ 2 )
√ e−(y−s1 ) Ĥ(y)=H1
L(y) = 2πσ 2 R η. (26)
1 2 /(2σ 2 )
√ e−(y−s0 ) Ĥ(y)=H0
2πσ 2
As (26) suggests—and as is generally the case in Gaussian problems—the natural
logarithm of the likelihood ratio is a more convenient sufficient statistic to work with
in this example. In this case, taking logarithms of both sides of (26) yields
Ĥ(y)=H1
1
L0 (y) = 2 2
(y − s 0 ) − (y − s 1 ) R ln η. (27)
2σ 2 Ĥ(y)=H0
Expanding the quadratics and cancelling terms in (27) we obtain the test in its
simplest form, which for s1 > s0 is given by
Ĥ(y)=H1
s1 + s0 σ 2 ln η
y R + , γ. (28)
Ĥ(y)=H0
2 s1 − s0
9
In this form, the resulting error probability is easily obtained, and is naturally ex-
pressed in terms of Q-function notation.
We also remark that with a minimum probability-of-error criterion, if P0 = P1
then ln η = 0 and we see immediately from (27) that the optimum test takes the form
Ĥ(y)=H1
|y − s0 | R |y − s1 |,
Ĥ(y)=H0
Y0 = {y ∈ R : y < γ}
(29)
Y1 = {y ∈ R : y > γ}.
Example 3. Suppose that a zero-mean Gaussian random variable has one of two
possible variances, σ12 or σ02 , where σ12 > σ02 . Let the costs and prior probabilities be
arbitrary. Then the likelihood ratio test for this problem takes the form
1 2 2
p
2
e−y /(2σ1 ) Ĥ(y)=H1
2πσ1
L(y) = R η.
1 2 2
p
2
e−y /(2σ0 ) Ĥ(y)=H0
2πσ0
In this problem, it is a straightforward exercise to show that the test simplifies to one
of the form s
Ĥ(y)=H1
σ02 σ12
σ1
|y| R 2 2 2
ln η , γ.
Ĥ(y)=H
σ 1 − σ 0 σ0
0
Hence, the decision region Y1 is the union of two disconnected regions in this case,
i.e.,
Y1 = {y ∈ R : y > γ} ∪ {y ∈ R : y < −γ}.
10