0% found this document useful (0 votes)
7 views10 pages

Lecture 2 BayesianHypothesisTesting

The document discusses Bayesian hypothesis testing, focusing on decision-making based on observations in various applications like medical diagnosis and digital communication. It introduces the concept of prior and posterior probabilities, the likelihood ratio test, and the formulation of optimum decision rules for binary hypothesis testing. The framework aims to minimize decision errors by utilizing a cost function to guide the decision process based on observed data and prior beliefs.

Uploaded by

Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Lecture 2 BayesianHypothesisTesting

The document discusses Bayesian hypothesis testing, focusing on decision-making based on observations in various applications like medical diagnosis and digital communication. It introduces the concept of prior and posterior probabilities, the likelihood ratio test, and the formulation of optimum decision rules for binary hypothesis testing. The framework aims to minimize decision errors by utilizing a cost function to guide the decision process based on observed data and prior beliefs.

Uploaded by

Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Massachusetts Institute of Technology

Department of Electrical Engineering and Computer Science


6.437 Inference and Information
Spring 2015

2 Bayesian Hypothesis Testing


In a wide range of applications, one must make decisions based on a set of observa-
tions. Examples include medical diagnosis, voice and face recognition, DNA sequence
analysis, air traffic control, and digital communication. In general, the observations
are noisy, incomplete, or otherwise imperfect, and thus the decisions produced will
not always be correct. However, we would like to use a decision process that is as
good as possible in an appropriate sense.
Addressing such problems is the aim of decision theory, and a natural framework
for setting up such problems is in terms of a hypothesis test. In this framework, each
of the possible scenarios corresponds to a hypothesis. When there are M hypotheses,
we denote the set of possible hypotheses using H = {H0 , H1 , . . . , HM −1 }.1 For each
of the possible hypotheses, there is a different model for the observed data, and this
is what we will exploit to distinguish among the hypotheses.
In our formulation, the observed collection of data is represented as a random
vector y, which may be discrete- or continuous-valued. There are a variety of ways to
model the hypotheses. In this section, we follow what is referred to as the Bayesian
approach, and model the valid hypothesis as a (discrete-valued) random variable, and
thus we denote it using H.
In a Bayesian hypothesis testing problem, the complete model therefore consists
of the a priori probabilities

pH (Hm ), m = 0, 1, . . . , M − 1,

together with a characterization of the observed data under each hypothesis, which
takes the form of the conditional probability distributions2

py|H (·|Hm ), m = 0, 1, . . . , M − 1. (1)

Of course, a complete characterization of our knowledge of the correct hypothesis


based on our observations is the set of a posteriori probabilities

pH|y (Hm |y), m = 0, 1, . . . , M − 1. (2)

The distribution of possible values of H is often referred to as our belief about


the hypothesis. From this perspective, we can view the a priori probabilities as our
prior belief, and view (2) as the revision of our belief based on having observed the
1
Note that H0 is sometimes referred to as the “null” hypothesis, particularly in asymmetric
problems where it has special significance.
2
As related terminology, the function py|H (y|·), where y is the actual observed data, is referred
to as the likelihood function.
data y. The belief update is, of course, computed from the particular data y based
on the model via Bayes’ Rule:3

py|H (y|Hm ) pH (Hm )


pH|y (Hm |y) = X .
py|H (y|Hm0 ) pH (Hm0 )
m0

While the belief is a complete characterization of our knowledge of the true hy-
pothesis, in applications one must often go further and make a decision (i.e., an
intelligent guess) based on this information. To make a good decision we need some
measure of goodness, appropriately chosen for the application of interest. In the se-
quel, we develop a framework for such decision-making, restricting our attention to
the binary (M = 2) case to simplify the exposition.

2.1 Binary Hypothesis Testing


Specializing to the binary case, our model consists of two components. One is the set
of prior probabilities
P0 = pH (H0 )
(3)
P1 = pH (H1 ) = 1 − P0 .
The second is the observation model, corresponding to the likelihood functions

H0 : py|H (y|H0 )
(4)
H1 : py|H (y|H1 ).

The development is essentially the same whether the observations are discrete or
continuous. We arbitrarily use the continuous case in our development. The discrete
case differs only in that integrals are replaced by summations.
We begin with a simple example to which we will return later.

Example 1. As a highly simplified scenario, suppose a single bit of information


m ∈ {0, 1} is encoded into a codeword sm and sent over a communication channel,
where s0 and s1 are both deterministic, known quantities. Let’s further suppose that
the channel is noisy; specifically, what is received is

y = sm + w ,

where w is a zero-mean Gaussian random variable with variance σ 2 and independent


of H. From this information, we can readily construct the probability density for the
3
In applications where further data is obtained, beliefs can be further revised, again using Bayes’
Rule as for the computation. This updating is a simple form of what is referred to as belief propa-
gation.

2
observation under each of the hypotheses, obtaining:
1 2 /(2σ 2 )
py |H (y|H0 ) = N(y; s0 , σ 2 ) = √ e−(y−s0 )
2πσ 2
(5)
2 1 −(y−s1 )2 /(2σ 2 )
py |H (y|H1 ) = N(y; s1 , σ ) = √ e .
2πσ 2
In addition, if 0’s and 1’s are equally likely to be transmitted we would set the a
priori probabilities to
P0 = P1 = 1/2.

2.1.1 Optimum Decision Rules: The Likelihood Ratio Test


The solution to a hypothesis test is specified in terms of a decision rule. We focus for
the time being on deterministic decision rules. Mathematically, such a decision rule
is a function Ĥ(·) that uniquely maps every possible observation y ∈ Y to one of the
two hypotheses, i.e., Ĥ : Y 7→ H, where H = {H0 , H1 }. From this perspective, we see
that choosing the function Ĥ(·) is equivalent to partitioning the observation space Y
into two disjoint “decision” regions, corresponding to the values of y for which each
of the two possible decisions are made. Specifically, we use Ym to denote those values
of y ∈ Y for which our rule decides Hm , i.e.,

Y0 = {y ∈ Y : Ĥ(y) = H0 }
(6)
Y1 = {y ∈ Y : Ĥ(y) = H1 }.

These regions are depicted schematically in Fig. 1.


Our goal, then, is to design this bi-valued function (equivalently the associated
decision regions Y0 and Y1 ) in such a way that the best possible performance is
obtained. In order to do this, we need to be able to quantify the notion of “best.” This
requires that we have a well-defined objective function corresponding to a suitable
measure of goodness. In the Bayesian approach, we use an objective function taking
the form of an expected cost function. Specifically, we use

C̃(Hj , Hi ) , Cij (7)

to denote the “cost” of deciding that the hypothesis is Ĥ = Hi when the correct
hypothesis is H = Hj . Then the optimum decision rule takes the form

Ĥ(·) = arg min ϕ(f ) (8)


f (·)

where the average cost, which is referred to as the “Bayes risk,” is


h i
ϕ(f ) = E C̃(H, f (y)) , (9)

3
Y0

Y1
Y0
Y0
Y

Figure 1: The regions Y0 and Y1 as defined in (6) corresponding to an example decision


rule Ĥ(·), where Y is the the observation alphabet.

and where the expectation in (9) is over both y and H, and where f (·) is a decision
rule.
Generally, the application dictates an appropriate choice of the costs Cij . For
example, a symmetric cost function of the form Cij = 1 − 1i=j , i.e.,
C00 = C11 = 0
(10)
C01 = C10 = 1,
corresponds to seeking a decision rule that minimizes the probability of a decision
error. However, there are many applications for which such symmetric cost functions
are not well-matched. For example, in a medical diagnosis problem where H0 denotes
the hypotheses that the patient does not have a particular disease and H1 that he
does, we would typically want to select cost assignments such that C01  C10 .
Definition 1. A set of costs {Cij } is valid if the cost of a correct decision is lower
than the cost of an incorrect decision, i.e., Cjj < Cij whenever i 6= j.
Theorem 1. Given a priori probabilities P0 , P1 , data y, observation models py|H (·|H0 ),
py|H (·|H1 ), and valid costs C00 , C01 , C10 , C11 , the optimum Bayes’ decision rule takes
the form:
py|H (y|H1 ) Ĥ(y)=H1 P0 (C10 − C00 )
L(y) , R , η, (11)
py|H (y|H0 ) Ĥ(y)=H P1 (C01 − C11 )
0

4
i.e., the decision is H1 when L(y) > η, the decision is H0 when L(y) < η, and the
decision can be made arbitrarily when L(y) = η.

Before establishing this result, we make a few remarks. First, the left-hand side
of (11) is referred to as the likelihood ratio, and thus (11) is typically referred to as
a likelihood ratio test (LRT). Note too that the likelihood ratio—which we denote
using L(y)—is constructed from the observations model and the data. Meanwhile,
the right-hand side of (11)—which we denote using η—is a precomputable threshold
that is determined from the a priori probabilities and costs.
Proof. Consider an arbitrary but fixed decision rule f (·). In terms of this generic
f (·), the Bayes risk can be expanded in the form
h i
ϕ(f ) = E C̃(H, f (y))
 h i
= E E C̃(H, f (y)) y = y
Z
= ϕ̃(f (y), y) py (y) dy, (12)

with h i
ϕ̃(H, y) = E C̃(H, H) y = y , (13)

and where to obtain the second equality in (12) we have used iterated expectation.
Note from (12) that since py (y) is nonnegative, it is clear that we minimize ϕ if
we minimize ϕ̃(f (y), y) for each particular value of y. Hence, we can determine the
optimum decision rule Ĥ(·) on a point-by-point basis, i.e., Ĥ(y) for each y.
Let’s consider a particular (observation) point y = y∗ . For this point, if we choose
the assignment
Ĥ(y∗ ) = H0 ,
then our conditional expectation (13) takes the value

ϕ̃(H0 , y∗ ) = C00 pH|y (H0 |y∗ ) + C01 pH|y (H1 |y∗ ). (14)

Alternatively, if we choose the assignment

Ĥ(y∗ ) = H1 ,

then our conditional expectation (13) takes the value

ϕ̃(H1 , y∗ ) = C10 pH|y (H0 |y∗ ) + C11 pH|y (H1 |y∗ ). (15)

Hence, the optimum assignment for the value y∗ is simply the choice corresponding
to the smaller of (14) and (15). It is convenient to express this optimum decision

5
rule using the following notation (now replacing our particular observation y∗ with a
generic observation y):

Ĥ(y)=H1
C00 pH|y (H0 |y) C10 pH|y (H0 |y)
R (16)
+ C01 pH|y (H1 |y) Ĥ(y)=H0
+ C11 pH|y (H1 |y).

Note that when the two sides of (16) are equal, then either assignment is equally
good—both have the same effect on the objective function (12).
A minor rearrangement of the terms in (16) results in

Ĥ(y)=H1
(C01 − C11 )pH|y (H1 |y) R (C10 − C00 )pH|y (H0 |y). (17)
Ĥ(y)=H0

Since for any valid choice of costs the terms in parentheses in (17) are both positive,
we can equivalently write (17) in the form4

Ĥ(y)=H1
pH|y (H1 |y) (C10 − C00 )
R . (18)
pH|y (H0 |y) Ĥ(y)=H0
(C01 − C11 )

When we then substitute (19) into (18) and multiply both sides by P0 /P1 , we
obtain the decision rule in its final form (11), directly in terms of the measurement
densities.
As a final remark, observe that, not surprisingly, the optimum decision produced
by (17) is a particular function of our beliefs, i.e., the a posteriori probabilities

py|H (y|Hm ) Pm
pH|y (Hm |y) = . (19)
py|H (y|H0 ) P0 + py|H (y|H1 ) P1

2.1.2 Properties of the Likelihood Ratio Test


Several observations lend insight into the optimum decision rule (11). First, note
that the likelihood ratio L(·) is a scalar-valued function, i.e., L : Y → R, regardless
of the dimension or alphabet of the data. In fact, L(y) is an example of what is
referred to as a sufficient statistic for the problem: it summarizes everything we need
to know about the observation vector in order to make a decision. Phrased differently,
in terms of our ability to make the optimum decision (in the Bayesian sense in this
case), knowledge of L(y) is as good as knowledge of the full data vector y itself.
4
Technically, we have to be careful about dividing by zero here if pH|y (H0 |y) = 0. To simplify
our exposition, however, as we discuss in Section 2.1.2, we will generally restrict our attention to
the case where this does not happen.

6
We will develop the notion of a sufficient statistic more precisely and in greater
generality in a subsequent section of the notes; however, at this point it suffices to
make two observations with respect to our hypothesis testing problem. First, (11)
tells us an explicit construction for a scalar sufficient statistic for the Bayesian binary
hypothesis testing problem. Second, sufficient statistics are not unique. For example,
any invertible function of L(y) is also a sufficient statistic. In fact, for the purposes of
implementation or analysis it is often more convenient to rewrite the likelihood ratio
test in the form
Ĥ(y)=H1
0
L (y) = g(L(y)) R g(η), (20)
Ĥ(y)=H0

where g(·) is some suitably chosen, monotonically increasing function. An important


example is the case corresponding to g(·) = ln(·), which simplifies many tests involving
densities with exponential factors, such as Gaussians.5
It is also important to emphasize that L = L(y) is a random variable—i.e., it takes
on a different value in each experiment. As such, we will frequently be interested in
its probability density function—or at least moments such as its mean and variance—
under each of H0 and H1 . Such densities can be derived using the usual method of
events, and are often used in calculating performance of the decision rule.
It follows immediately from the definition in (11) that the likelihood ratio is a
nonnegative quantity. Furthermore, depending on the problem, some values of y may
lead to L(y) being zero or infinite. In particular, the former occurs when py|H (y|H1 ) =
0 but py|H (y|H0 ) > 0, which is an indication that values in a neighborhood of y
effectively cannot occur under H1 but can under H0 . In this case, there will be values
of y for which we’ll effectively know with certainty that the correct hypothesis is
H0 . When the likelihood ratio is infinite, corresponding a division-by-zero scenario,
an analogous situation exists, but with the roles of H0 and H1 reversed. These cases
where such perfect decisions are possible are referred to as singular decision scenarios.
In some practical problems, these scenarios do in fact occur. However, in other cases
they suggest a potential lack of robustness in the data modeling, i.e., that some source
of inherent uncertainty may be missing from the model. In any event, to simplify our
development for the remainder of the topic we will largely restrict our attention to
the case where 0 < L(y) < ∞ for all y.
While the likelihood ratio focuses the observed data into a single scalar for the
purpose of making an optimum decision, the threshold η for the test plays a com-
plementary role. In particular, from (11) we see that η focuses the relevant features
of the cost function and a priori probabilities into a single scalar. Furthermore, this
information is combined in a manner that is intuitively satisfying. For example, as
(11) also reflects, an increase in P0 means that H0 is more likely, so that η is increased
to appropriately bias the test toward deciding H0 for any particular observation. Sim-
5
We will discuss an important such family of distributions—exponential families—in detail in a
subsequent section of the notes.

7
ilarly, an increase in C10 means that deciding H1 when H0 is true is more costly, so
η is increased to appropriately bias the test toward deciding H0 to offset this risk.
Finally, note that adding a constant to the cost function (i.e., to all Cij ) has, as we
would anticipate, no effect on the threshold. Hence, without loss of generality we
may set at least one of the correct decision costs—i.e., C00 or C11 —to zero.
Finally, it is important to emphasize that the likelihood ratio test (11) indirectly
determines the decision regions (6). In particular, we have

Y0 = {y ∈ Y : Ĥ(y) = H0 } = {y ∈ Y : L(y) < η}


(21)
Y1 = {y ∈ Y : Ĥ(y) = H1 } = {y ∈ Y : L(y) > η}.
As Fig. 1 suggests, while a decision rule expressed in the measurement data space
Y can be complicated,6 (11) tells us that the observations can be transformed into
a one-dimensional space defined via L = L(y) where the decision regions have a
particularly simple form: the decision Ĥ(L) = H0 is made whenever L lies to the left
of some point η on the line, and Ĥ(L) = H1 whenever L lies to the right.

2.1.3 Maximum A Posteriori and Maximum Likelihood Decision Rules


An important cost assignment for many problems is that given by (10), which as we
recall corresponds to a minimum probability-of-error criterion. Indeed, in this case,
we have
   
ϕ(Ĥ) = P Ĥ(y) = H0 , H = H1 + P Ĥ(y) = H1 , H = H0 .

The corresponding decision rule in this case can be obtained as a special case of
(11).
Corollary 1. The minimum probability-of-error decision rule takes the form

Ĥ(y) = arg max pH|y (H|y). (22)


H∈{H0 ,H1 }

The rule (22), in which one chooses the hypothesis for which our belief is largest, is
referred to as the maximum a posteriori (MAP) decision rule.
Proof. Instead of specializing (11), we specialize the equivalent test (17), from which
we obtain a form of the minimum probability-of-error test expressed in terms of the
a posteriori probabilities for the problem, viz.,
Ĥ(y)=H1
pH|y (H1 |y) R pH|y (H0 |y). (23)
Ĥ(y)=H0

From (23) we see that the desired decision rule can be expressed in the form (22)
6
Indeed, neither of the respective sets Y0 and Y1 are even connected in general.

8
Still further simplification is possible when the hypotheses are equally likely (P0 =
P1 = 1/2). In this case, we have the following.

Corollary 2. When the hypotheses are equally likely, the minimum probability of
error decision rule takes the form

Ĥ(y) = arg max py|H (y|H). (24)


H∈{H0 ,H1 }

The rule (24), which is referred to as the maximum likelihood (ML) decision rule,
chooses the hypothesis for which the corresponding likelihood function is largest.
Proof. Specializing (11) we obtain

Ĥ(y)=H1
py|H (y|H1 )
R 1, (25)
py|H (y|H0 ) Ĥ(y)=H0

or, equivalently,
Ĥ(y)=H1
py|H (y|H1 ) R py|H (y|H0 ),
Ĥ(y)=H0

whence (24)

Example 2. Continuing with Example 1, we obtain from (5) that the likelihood ratio
test for this problem takes the form
1 2 /(2σ 2 )
√ e−(y−s1 ) Ĥ(y)=H1
L(y) = 2πσ 2 R η. (26)
1 2 /(2σ 2 )
√ e−(y−s0 ) Ĥ(y)=H0
2πσ 2
As (26) suggests—and as is generally the case in Gaussian problems—the natural
logarithm of the likelihood ratio is a more convenient sufficient statistic to work with
in this example. In this case, taking logarithms of both sides of (26) yields

Ĥ(y)=H1
1 
L0 (y) = 2 2

(y − s 0 ) − (y − s 1 ) R ln η. (27)
2σ 2 Ĥ(y)=H0

Expanding the quadratics and cancelling terms in (27) we obtain the test in its
simplest form, which for s1 > s0 is given by

Ĥ(y)=H1
s1 + s0 σ 2 ln η
y R + , γ. (28)
Ĥ(y)=H0
2 s1 − s0

9
In this form, the resulting error probability is easily obtained, and is naturally ex-
pressed in terms of Q-function notation.
We also remark that with a minimum probability-of-error criterion, if P0 = P1
then ln η = 0 and we see immediately from (27) that the optimum test takes the form

Ĥ(y)=H1
|y − s0 | R |y − s1 |,
Ĥ(y)=H0

which corresponds to a “minimum-distance” decision rule, i.e.,

Ĥ(y) = Hm̂ , m̂ = arg min |y − sm |.


m∈{0,1}

This minimum-distance property turns out to hold in multidimensional Gaussian


problems as well, and leads to convenient analysis in terms of Euclidean geometry.
Note too that in this problem the decisions regions on the y-axis have a particularly
simple form; for example, for s1 > s0 we obtain

Y0 = {y ∈ R : y < γ}
(29)
Y1 = {y ∈ R : y > γ}.

In other problems—even Gaussian ones—the decision regions can be more compli-


cated, as our next example illustrates.

Example 3. Suppose that a zero-mean Gaussian random variable has one of two
possible variances, σ12 or σ02 , where σ12 > σ02 . Let the costs and prior probabilities be
arbitrary. Then the likelihood ratio test for this problem takes the form
1 2 2
p
2
e−y /(2σ1 ) Ĥ(y)=H1
2πσ1
L(y) = R η.
1 2 2
p
2
e−y /(2σ0 ) Ĥ(y)=H0
2πσ0

In this problem, it is a straightforward exercise to show that the test simplifies to one
of the form s
Ĥ(y)=H1
σ02 σ12
 
σ1
|y| R 2 2 2
ln η , γ.
Ĥ(y)=H
σ 1 − σ 0 σ0
0

Hence, the decision region Y1 is the union of two disconnected regions in this case,
i.e.,
Y1 = {y ∈ R : y > γ} ∪ {y ∈ R : y < −γ}.

10

You might also like