0% found this document useful (0 votes)
127 views180 pages

Bayes Manuscripts

This document provides an overview of Bayesian statistics. It discusses key Bayesian concepts like priors, likelihoods, and posteriors. Some advantages of the Bayesian approach are that it allows incorporating prior beliefs or knowledge into analyses. In contrast, frequentist statistics treats unknown parameters as fixed rather than random. The document also covers Bayesian modeling techniques like hierarchical models, empirical Bayes, and Monte Carlo methods like Gibbs sampling and MCMC. The goal is to give readers an introduction to the essential elements of Bayesian statistics.

Uploaded by

Viks Sassy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views180 pages

Bayes Manuscripts

This document provides an overview of Bayesian statistics. It discusses key Bayesian concepts like priors, likelihoods, and posteriors. Some advantages of the Bayesian approach are that it allows incorporating prior beliefs or knowledge into analyses. In contrast, frequentist statistics treats unknown parameters as fixed rather than random. The document also covers Bayesian modeling techniques like hierarchical models, empirical Bayes, and Monte Carlo methods like Gibbs sampling and MCMC. The goal is to give readers an introduction to the essential elements of Bayesian statistics.

Uploaded by

Viks Sassy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 180

Some of Bayesian Statistics: The Essential Parts

Rebecca C. Steorts

February 21, 2016


Contents

1 Introduction 3
1.1 Advantages of Bayesian Methods . . . . . . . . . . . . . . . . . 4
1.2 de Finetti’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Introduction to Bayesian Methods 9


2.1 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Frequentist Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Motivation for Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . 16
◯ Frequentist Interpretation: Risk . . . . . . . . . . . . . 16
◯ Bayesian Interpretation: Posterior Risk . . . . . . . . . 17
◯ Hybrid Ideas . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Bayesian Parametric Models . . . . . . . . . . . . . . . . . . . . 18
2.6 How to Choose Priors . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Hierarchical Bayesian Models . . . . . . . . . . . . . . . . . . . 20
2.8 Empirical Bayesian Models . . . . . . . . . . . . . . . . . . . . . 38
2.9 Posterior Predictive Distributions . . . . . . . . . . . . . . . . . 39

3 Being Objective 47
◯ Meaning Of Flat . . . . . . . . . . . . . . . . . . . . . . . 49
◯ Objective Priors in More Detail . . . . . . . . . . . . . . 51
3.1 Reference Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
◯ Laplace Approximation . . . . . . . . . . . . . . . . . . . 60
◯ Some Probability Theory . . . . . . . . . . . . . . . . . . 61
◯ Shrinkage Argument of J.K. Ghosh . . . . . . . . . . . 62
◯ Reference Priors . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Final Thoughts on Being Objective . . . . . . . . . . . . . . . . 69

4 Evaluating Bayesian Procedures 71


4.1 Confidence Intervals versus Credible Intervals . . . . . . . . . . 71

1
CONTENTS 2

4.2 Credible Sets or Intervals . . . . . . . . . . . . . . . . . . . . . . 79


4.3 Bayesian Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 81
◯ Lavine and Schervish (The American Statistician, 1999):
Bayes Factors: What They Are and What They Are
Not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Bayesian p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
◯ Prior Predictive p-value . . . . . . . . . . . . . . . . . . 86
◯ Other Bayesian p-values . . . . . . . . . . . . . . . . . . 86
4.5 Appendix to Chapter 4 (Done by Rafael Stern) . . . . . . . . . 88

5 Monte Carlo Methods 92


5.1 A Quick Review of Monte Carlo Methods . . . . . . . . . . . . 92
◯ Classical Monte Carlo Integration . . . . . . . . . . . . 93
◯ Importance Sampling . . . . . . . . . . . . . . . . . . . . 94
◯ Importance Sampling with unknown normalizing con-
stant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
◯ Rejection Sampling . . . . . . . . . . . . . . . . . . . . . 100
5.2 Introduction to Gibbs and MCMC . . . . . . . . . . . . . . . . 104
◯ Markov Chains and Gibbs Samplers . . . . . . . . . . . 104
◯ The Two-Stage Gibbs Sampler . . . . . . . . . . . . . . 107
◯ The Multistage Gibbs Sampler . . . . . . . . . . . . . . 112
◯ Application of the GS to latent variable models . . . . 114
5.3 MCMC Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4 Theory and Application Based Example . . . . . . . . . . . . . 124
◯ PlA2 Example . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5 Metropolis and Metropolis-Hastings . . . . . . . . . . . . . . . 136
◯ Metropolis-Hastings Algorithm . . . . . . . . . . . . . . 143
◯ Metropolis and Gibbs Combined . . . . . . . . . . . . . 149
5.6 Introduction to Nonparametric Bayes . . . . . . . . . . . . . . . 161
◯ Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . 162
◯ The Dirichlet Process . . . . . . . . . . . . . . . . . . . . 162
◯ Polya Urn Scheme on Urn With Finitely Many Colors 165
◯ Polya Urn Scheme in General . . . . . . . . . . . . . . . 166
◯ De Finetti and Exchaneability . . . . . . . . . . . . . . 168
◯ Chinese Restaurant Process . . . . . . . . . . . . . . . . 170
◯ Clustering: How to choose K? . . . . . . . . . . . . . . 170
Chapter 1

Introduction

There are three kinds of lies: lies, damned lies and statistics.
—Mark Twain

The word “Bayesian” traces its origin to the 18th century and English Rev-
erend Thomas Bayes, who along with Pierre-Simon Laplace was among the
first thinkers to consider the laws of chance and randomness in a quantita-
tive, scientific way. Both Bayes and Laplace were aware of a relation that is
now known as Bayes Theorem:

p(x∣θ)p(θ)
p(θ∣x) = ∝ p(x∣θ)p(θ). (1.1)
p(x)

The proportionality ∝ in Eq. (1.1) signifies that the 1/p(x) factor is con-
stant and may be ignored when viewing p(θ∣x) as a function of θ. We can
decompose Bayes’ Theorem into three principal terms:

p(θ∣x) posterior
p(x∣θ) likelihood
p(θ) prior

In effect, Bayes’ Theorem provides a general recipe for updating prior beliefs
about an unknown parameter θ based on observing some data x.

However, the notion of having prior beliefs about a parameter that is os-
tensibly “unknown” did not sit well with many people who considered the
problem in the 19th and early 20th centuries. The resulting search for a

3
1.1 Advantages of Bayesian Methods 4

way to practice statistics without priors led to the development of frequen-


tist statistics by such eminent figures as Sir Ronald Fisher, Karl Pearson,
Jerzy Neyman, Abraham Wald, and many others.

The frequentist way of thinking came to dominate statistical theory and


practice in the 20th century, to the point that most students who take only
introductory statistics courses are never even aware of the existence of an
alternative paradigm. However, recent decades have seen a resurgence of
Bayesian statistics (partially due to advances in computing power), and
an increasing number of statisticians subscribe to the Bayesian school of
thought. Perhaps most encouragingly, both frequentists and Bayesians have
become more willing to recognize the strengths of the opposite approach
and the weaknesses of their own, and it is now common for open-minded
statisticians to freely use techniques from both sides when appropriate.

1.1 Advantages of Bayesian Methods

The basic philosophical difference between the frequentist and Bayesian


paradigms is that Bayesians treat an unknown parameter θ as random and
use probability to quantify their uncertainty about it. In contrast, frequen-
tists treat θ as unknown but fixed, and they therefore believe that proba-
bility statements about θ are useless. This fundamental disagreement leads
to entirely different ways to handle statistical problems, even problems that
might at first seem very basic.

To motivate the Bayesian approach, we now discuss two simple examples


in which the frequentist way of thinking leads to answers that might be
considered awkward, or even nonsensical.
Example 1.1: Let θ be the probability of a particular coin landing on
heads, and suppose we want to test the hypotheses
H0 ∶ θ = 1/2, H1 ∶ θ > 1/2
at a significance level of α = 0.05. Now suppose we observe the following
sequence of flips:
heads, heads, heads, heads, heads, tails (5 heads, 1 tails)
To perform a frequentist hypothesis test, we must define a random variable
to describe the data. The proper way to do this depends on exactly which
of the following two experiments was actually performed:
1.1 Advantages of Bayesian Methods 5

• Suppose that the experiment was “Flip six times and record the re-
sults.” In this case, the random variable X counts the number of
heads, and X ∼ Binomial(6, θ). The observed data was x = 5, and the
p-value of our hypothesis test is

p-value = Pθ=1/2 (X ≥ 5)
= Pθ=1/2 (X = 5) + Pθ=1/2 (X = 6)
6 1 7
= + = = 0.109375 > 0.05.
64 64 64
So we fail to reject H0 at α = 0.05.

• Suppose instead that the experiment was “Flip until we get tails.” In
this case, the random variable X counts the number of the flip on
which the first tails occurs, and X ∼ Geometric(1 − θ). The observed
data was x = 6, and the p-value of our hypothesis test is

p-value = Pθ=1/2 (X ≥ 6)
= 1 − Pθ=1/2 (X < 6)
5
= 1 − ∑ Pθ=1/2 (X = x)
x=1
1 1 1 1 1 1
=1−( + + + + )= = 0.03125 < 0.05.
2 4 8 16 32 32
So we reject H0 at α = 0.05.

The conclusions differ, which seems absurd. Moreover the p-values aren’t
even close—one is 3.5 times as large as the other. Essentially, the result of
our hypothesis test depends on whether we would have stopped flipping if we
had gotten a tails sooner. In other words, the frequentist approach requires
us to specify what we would have done had the data been something that
we already know it wasn’t.

Note that despite the different results, the likelihood for the actual value of
x that was observed is the same for both experiments (up to a constant):

p(x∣θ) ∝ θ5 (1 − θ).

A Bayesian approach would take the data into account only through this
likelihood and would therefore be guaranteed to provide the same answers
regardless of which experiment was being performed.
1.2 de Finetti’s Theorem 6

Example 1.2: Suppose we want to test whether the voltage θ across some
electrical component differs from 9 V, based on noisy readings of this voltage
from a voltmeter. Suppose the data is as follows:

9.7, 9.4, 9.8, 8.7, 8.6

A frequentist might assume that the voltage readings Xi are iid from some
N (θ, σ 2 ) distribution, which would lead to a basic one-sample t-test.

However, the frequentist is then presented with an additional piece of infor-


mation: The voltmeter used for the experiment only went up to 10 V, and
any readings that might have otherwise been higher are instead truncated
to that value. Notice that none of the voltages in the data are 10 V. In other
words, we already know that the 10 V limit was completely irrelevant for
the data we actually observed.

Nevertheless, a frequentist must now redo the analysis and could perhaps
obtain a different conclusion, because the 10 V limit changes the distribution
of the observations under the null hypothesis. Like in the last example, the
frequentist results change based on what would have happened had the data
been something that we already know it wasn’t.

The problems in Examples 1.1 and 1.2 arise from the way the frequentist
paradigm forces itself to interpret probability. Another familiar aspect of
this problem is the awkward definition of “confidence” in frequentist confi-
dence intervals. The most natural interpretation of a 95% confidence inter-
val (L, U )—that there is a 95% chance that the parameter is between L and
U —is dead wrong from the frequentist point of view. Instead, the notion
of “confidence” must be interpreted in terms of repeating the experiment a
large number of times (in principle, an infinite number), and no probabilistic
statement can be made about this particualar confidence interval computed
from the data we actually observed.

1.2 de Finetti’s Theorem

In this section, well motivate the use of priors on parameters and indeed
motivate the very use of parameters. We begin with a denition.
Definition 1.1: (Infinite exchangeability). We say that (x1 , x2 , . . . ) is an
infinitely exchangeable sequence of random variables if, for any n, the joint
1.2 de Finetti’s Theorem 7

probability p(x1 , x2 , ..., xn ) is invariant to permutation of the indices. That


is, for any permutation π,

p(x1 , x2 , ..., xn ) = p(xπ1 , xπ2 , ..., xπn ).

A key assumption of many statistical analyses is that the random variables


being studied are independent and identically distributed (iid). Note that
iid random variables are always infinitely exchangeable. However, infinite
exchangeability is a much broader concept than being iid; an infinitely ex-
changeable sequence is not necessarily iid. For example, let (x1 , x2 , . . . ) be
iid, and let x0 be a non-trivial random variable independent of the rest. Then
(x0 + x1 , x0 + x2 , . . .) is infinitely exchangeable but not iid. The usefulness
of infinite exchangeability lies in the following theorem.
Theorem 1.1. (De Finetti). A sequence of random variables (x1 , x2 , . . . )
is infinitely exchangeable iff, for all n,
n
p(x1 , x2 , ..., xn ) = ∫ ∏ p(xi ∣θ)P (dθ),
i=1

for some measure P on θ.

If the distribution on θ has a density, we can replace P (dθ) with p(θ) dθ,
but the theorem applies to a much broader class of cases than just those
with a density for θ.

Clearly, since ∏ni=1 p(xi ∣θ) is invariant to reordering, we have that any se-
quence of distributions that can be written as
n
∫ ∏ p(xi ∣θ) p(θ) dθ,
i=1

for all n must be infinitely exchangeable. The other direction, though, is


much deeper. It says that if we have exchangeable data, then:

• There must exist a parameter θ.

• There must exist a likelihood p(x∣θ).

• There must exist a distribution P on θ.

• The above quantities must exist so as to render the data (x1 , . . . , xn )


conditionally independent.
1.2 de Finetti’s Theorem 8

Thus, the theorem provides an answer to the questions of why we should


use parameters and why we should put priors on parameters.

Example 1.3: (Document processing and information retrieval). To high-


light the difference between iid and infinitely exchangeable sequences, con-
sider that search engines have historically used “bag-of-words” models to
model documents. That is, for the moment, pretend that the order of words
in a document does not matter. Even so, the words are definitely not iid.
If we see one word and it is a French word, we then expect that the rest of
the document is likely to be in French. If we see the French words voyage
(travel), passeport (passport), and douane (customs), we expect the rest of
the document to be both in French and on the subject of travel. Since we
are assuming infinite exchangeability, there is some θ governing these intu-
itions. Thus, we see that θ can be very rich, and it seems implausible that θ
might always be finite-dimensional in Theorem 2. In fact, it is the case that
θ can be infinite-dimensional in Theorem 2. For example, in nonparametric
Bayesian work, θ can be a stochastic process.
Chapter 2

Introduction to Bayesian
Methods

Every time I think I know what’s going on, suddenly there’s another layer
of complicitions. I just want this damned thing solved.
—John Scalzi, The Lost Colony

We introduce Bayesian methods first by motivations in decision theory and


introducing the ideas of loss functions, as well as many others. Advanced
topics in decision theory will be covered much later in the course. We will
cover the following topics as well in this chapter:

• hierarchical and empirical Bayesian methods

• the difference between subjective and objective priors

• posterior predictive distributions.

2.1 Decision Theory

Another motivation for the Bayesian approach is decision theory. Its origins
go back to Von Neumann and Morgenstern’s game theory, but the main
character was Wald. In statistical decision theory, we formalize good and
bad results with a loss function.

9
2.2 Frequentist Risk 10

A loss function L(θ, δ(x)) is a function of θ ∈ Θ a parameter or index, and


δ(x) is a decision based on the data x ∈ X. For example, δ(x) = n−1 ∑ni=1 xi
might be the sample mean, and θ might be the true mean. The loss function
determines the penalty for deciding δ(x) if θ is the true parameter. To give
some intuition, in the discrete case, we might use a 0–1 loss, which assigns


⎪0 if δ(x) = θ,
L(θ, δ(x)) = ⎨

⎪ if δ(x) ≠ θ,
⎩1
or in the continuous case, we might use the squared error loss L(θ, δ(x)) =
(θ − δ(x))2 . Notice that in general, δ(x) does not necessarily have to be an
estimate of θ. Loss functions provide a very good foundation for statistical
decision theory. They are simply a function of the state of nature (θ) and a
decision function (δ(⋅)). In order to compare procedures we need to calculate
which procedure is best even though we cannot observe the true nature of
the parameter space Θ and data X. This is the main challenge of decision
theory and the break between frequentists and Bayesians.

2.2 Frequentist Risk

Definition 2.1: The frequentist risk is

R(θ, δ(x)) = Eθ [L(θ, δ(x))] = ∫ L(θ, δ)f (x∣θ) dx.


X

where θ is held fixed and the expectation is taken over X.

Thus, the risk measures the long-term average loss resulting from using δ.

Figure 1 shows the risk of three different decisions as a function of θ ∈ Θ.

Often one decision does not dominate the other everywhere as is the case
with decisions δ1 , δ2 . The challenge is in saying whether, for example, δ1 or
δ3 is better. In other words, how should we aggregate over Θ?

Frequentists have a few answers for deciding which is better:

1. Admissibility. A decision which is inadmissible is one that is dom-


inated everywhere. For example, in Figure 1, δ2 dominates δ1 for all
2.2 Frequentist Risk 11

Frequentist Risk

R(θ, δ1)
R(θ, δ2)
R(θ, δ3)

Risk

Figure 2.1: frequentist Risk

values of θ. It would be easy to compare decisions if all but one were


inadmissible. But usually the risk functions overlap, so this criterion
fails.

2. Restricted classes of procedure. We say that an estimator θ̂ is


an unbiased estimator of θ if Eθ [θ̂] = θ for all θ. If we restrict our
attention to only unbiased estimators then we can often reduce the
situation to only risk curves like δ1 and δ2 in Figure 5.21, eliminat-
ing overlapping curves like δ3 . The existence of an optimal unbiased
procedure is a nice frequentist theory, but many good procedures are
biased—for example Bayesian procedures are typically biased. More
surprisingly, some unbiased procedures are actually inadmissible. For
example, James and Stein showed that the sample mean is an inadmis-
sible estimate of the mean of a multivariate Gaussian in three or more
dimensions. There are also some problems were no unbiased estimator
exists—for example, when p is a binomial proportion and we wish to
estimate 1/p (see Example 2.1.2 on page 83 of Lehmann and Casella).
If we restrict our class of procedures to those which are equivariant,
we also get nice properties. We do not go into detail here, but these
are procedures with the same group theoretic properties as the data.
2.2 Frequentist Risk 12

3. Minimax. In this approach we get around the problem by just looking


at supΘ R(θ, δ(x)), where R(θ, δ(x)) = Eθ [L(θ, δ(x))]. For example in
Figure 2, δ2 would be chosen over δ1 because its maximum worst-case
risk (the grey dotted line) is lower.

Minimax Frequentist Risk


Risk

R(θ, δ1)
R(θ, δ2)

Figure 2.2: Minimax frequentist Risk

A Bayesian answer is to introduce a weighting function p(θ) to tell which


part of Θ is important and integrate with respect to p(θ). In some sense the
frequentist approach is the opposite of the Bayesian approach. However,
sometimes an equivalent Bayesian procedure can be derived using a certain
prior. Before moving on, again note that R(θ, δ(x)) = Eθ [L(θ, δ(x))] is
an expectation on X, assuming fixed θ. A Bayesian would only look at x,
the data you observed, not all possible X. An alternative definition of a
frequentist is, “Someone who is happy to look at other data they could have
gotten but didn’t.”
2.3 Motivation for Bayes 13

2.3 Motivation for Bayes

The Bayesian approach can also be motivated by a set of principles. Some


books and classes start with a long list of axioms and principles conceived
in the 1950s and 1960s. However, we will focus on three main principles.

The Bayesian approach can also be motivated by a set of principles. Some


books and classes start with a long list of axioms and principles conceived
in the 1950s and 1960s. However, we will focus on three main principles.

1. Conditionality Principle: The idea here for a Bayesian is that we


condition on the data x.

• Suppose we have an experiment concerning inference about θ that


is chosen from a collection of possible experiments independently.
• Then any experiment not chosen is irrelevant to the inference
(this is the opposite of what we do in frequentist inference).

Example 2.1: For example, two different labs estimate the potency
of drugs. Both have some error or noise in their measurements which
can accurately estimated from past tests. Now we introduce a new
drug. Then we test its potency at a randomly chosen lab. Suppose
the sample sizes matter dramatically.

• Suppose the sample size of the first experiment (lab 1) is 1 and


the sample size of the second experiment (lab 2) is 100.
• What happens if we’re doing a frequentist experiment in terms
of the variance? Since this is a randomized experiment, we need
to take into account all of the data. In essence, the variance will
do some sort of averaging to take into account the sample sizes
of each.
• However, taking a Bayesian approach, we just care about the data
that we see. Thus, the variance calculation will only come from
the actual data at the randomly chosen lab.

Thus, the question that we ask is should we use the noise level from
the lab where it is tested or average over both? Intuitively, we use the
noise level from the lab where it was tested, but in some frequentist
approaches, it is not always so straightforward.
2.3 Motivation for Bayes 14

2. Likelihood Principle: The relevant information in any inference


about θ after observing x is contained entirely in the likelihood func-
tion. Remember the likelihood function p(x∣θ) for fixed x is viewed as
a function of θ, not x. For example in Bayesian approaches, p(θ∣x) ∝
p(x∣θ)p(θ), so clearly inference about θ is based on the likelihood.
Another approach based on the likelihood principle is Fisher’s max-
imum likelihood estimation. This approach can also be justified by
asymptotics. In case this principle seems too indisputable, here is an
example using hypothesis testing in coin tossing that shows how some
reasonable procedures may not follow it.
Example 2.2: Let θ be the probability of a particular coin landing
on heads and let
Ho ∶ θ = 1/2, H1 ∶ θ > 1/2.
Now suppose we observe the following sequence of flips:
H,H,T,H,T,H,H,H,H,H,H,T (9 heads, 3 tails)
Then the likelihood is simply
p(x∣θ) ∝ θ9 (1 − θ)3 .
Many non-Bayesian analyses would pick an experimental design that
is reflected in p(x∣θ), for example binomial (toss a coin 12 times) or
negative binomial (toss a coin until you get 3 tails). However the two
lead to different probabilities over the sample space X. This results in
different assumed tail probabilities and p-values.

We repeat a few definitions from mathematical statistics that will be


used in the course at one point or another. For example of sufficient
statistics or distributions that fall in exponential families, we refer the
reader to Theory of Point Estimation (TPE), Chapter 1.
Definition 2.2: Sufficiency
Recall that for a data set x = (x1 , . . . , xn ), a sufficient statistic T (x)
is a function such that the likelihood p(x∣θ) = p(x1 , . . . , xn ∣θ) depends
on x1 , . . . , xn only through T (x). Then the likelihood p(x∣θ) may be
written as p(x∣θ) = g(θ, T (x)) h(x) for some functions g and h.
Definition 2.3: Exponential Families
A family {Pθ } of distributions is said to form an s-dimensional expo-
nential family if the distributions of Pθ have densities of the form
s
pθ (x) = exp [∑ ηi (θ)Ti (x) − B(θ)] h(x).
i=1
2.3 Motivation for Bayes 15

3. Sufficiency Principle: The sufficiency principle states that if two


different observations x and y have the same sufficient statistic T (x) =
T (y), then inference based on x and y should be the same. The suffi-
ciency principle is the least controversial principle.

Theorem 2.1. The posterior distribution, p(θ∣y) only depends on the


data through the sufficient statistic, T(y).

Proof. By the factorization theorem, if T (y) is sufficient,

f (y∣θ) = g(θ, T (y)) h(y).

Then we know the posterior can be written

f (y∣θ)π(θ)
p(θ∣y) =
∫ f (y∣θ)π(θ) dθ
g(θ, T (y)) h(y)π(θ)
=
∫ g(θ, T (y)) h(y)π(θ) dθ
g(θ, T (y))π(θ)
=
∫ g(θ, T (y))π(θ) dθ
∝ g(θ, T (y)) p(θ),

which only depends on y through T (y).

Example 2.3: Sufficiency


Let y ∶= ∑i yi . Consider
iid
y1 , . . . , yn ∣θ ∼ Bin(1, θ).

Then p(y) = (ny)θy (1 − θ)n−y . Let p(θ) represent a general prior. Then

p(θ∣y) ∝ θy (1 − θ)n−y p(θ),

which only depends on the data through the sufficient statistic y.

Theorem 2.2. (Birnbaum).


Sufficiency Principle + Conditionality Principle = Likelihood Princi-
ple.

So if we assume the sufficiency principle, then the conditionality and


likelihood principles are equivalent. The Bayesian approach satisfies
all of these principles.
2.4 Bayesian Decision Theory 16

2.4 Bayesian Decision Theory

Earlier we discussed the frequentist approach to statistical decision theory.


Now we discuss the Bayesian approach in which we condition on x and inte-
grate over Θ (remember it was the other way around in the frequentist ap-
proach). The posterior risk is defined as ρ(π, δ(x)) = ∫Θ L(θ, δ(x))π(θ∣x) dθ.

The Bayes action δ ∗ (x) for any fixed x is the decision δ(x) that minimizes
the posterior risk. If the problem at hand is to estimate some unknown
parameter θ, then we typically call this the Bayes estimator instead.

Theorem 2.3. Under squared error loss, the decision δ(x) that minimizes
the posterior risk is the posterior mean.

Proof. Suppose that L(θ, δ(x)) = (θ − δ(x))2 . Now note that

ρ(π, δ(x)) = ∫ (θ − δ(x))2 π(θ∣x) dθ

= ∫ θ2 π(θ∣x) dθ + δ(x)2 ∫ π(θ∣x) dθ − 2δ(x) ∫ θπ(θ∣x) dθ.

Then
∂[ρ(π, δ(x))]
= 2δ(x) − 2 ∫ θπ(θ∣x) dθ = 0 ⇐⇒ δ(x) = E[θ∣x],
∂[δ(x)]

and ∂ 2 [ρ(π, δ(x))]/∂[δ(x)]2 = 2 > 0, so δ(x) = E[θ∣x] is the minimizer.

Recall that decision theory provides a quantification of what it means for


a procedure to be ‘good.’ This quantification comes from the loss function
L(θ, δ(x)). Frequentists and Bayesians use the loss function differently.

◯ Frequentist Interpretation: Risk

In frequentist usage, the parameter θ is fixed, and thus it is the sample space
over which averages are taken. Letting R(θ, δ(x)) denote the frequentist
risk, recall that R(θ, δ(x)) = Eθ [L(θ, δ(x))]. This expectation is taken over
the data X, with the parameter θ held fixed. Note that the data, X, is
capitalized, emphasizing that it is a random variable.
2.4 Bayesian Decision Theory 17

Example 2.4: (Squared error loss). Let the loss function be squared error.
In this case, the risk is

R(θ, δ(x)) = Eθ [(θ − δ(x))2 ]


2
= Eθ [{θ − Eθ [δ(x)] + Eθ [δ(x)] − δ(x)} ]
2 2
= {θ − Eθ [δ(x)]} + Eθ [{δ(x) − Eθ [δ(x)]}
= Bias2 + Variance

This result allows a frequentist to analyze the variance and bias of an estima-
tor separately, and can be used to motivate frequentist ideas, e.g. minimum
variance unbiased estimators (MVUEs).

◯ Bayesian Interpretation: Posterior Risk

Bayesians do not find the previous idea compelling because it doesn’t adhere
to the conditionality principle since it averages over all possible data sets.
Hence, in a Bayesian framework, we define the posterior risk ρ(x, π) based
on the data x and a prior π, where

ρ(π, δ(x)) = ∫Θ L(θ, δ(x))π(θ∣x) dθ.

Note that the prior enters the equation when calculating the posterior den-
sity. Using the Bayes risk, we can define a bit of jargon. Recall that the
Bayes action δ ∗ (x) is the value of δ(x) that minimizes the posterior risk.
We already showed that the Bayes action under squared error loss is the
posterior mean.

◯ Hybrid Ideas

Despite the tensions between frequentists and Bayesians, they occasionally


steal ideas from each other.

Definition 2.4: The Bayes risk is denoted by r(π, δ(x)). While the Bayes
risk is a frequentist concept since it averages over X, the expression can also
2.5 Bayesian Parametric Models 18

be interpreted differently. Consider

r(π, δ(x)) = ∫ ∫ L(θ, δ(x)) f (x∣θ) π(θ) dx dθ

r(π, δ(x)) = ∫ ∫ L(θ, δ(x)) π(θ∣x) π(x) dx dθ

r(π, δ(x)) = ∫ ρ(π, δ(x)) π(x) dx.

Note that the last equation is the posterior risk averaged over the marginal
distribution of x. Another connection with frequentist theory includes that
finding a Bayes rule against the “worst possible prior” gives you a minimax
estimator. While a Bayesian might not find this particularly interesting, it
is useful from a frequentist perspective because it provides a way to compute
the minimax estimator.

We will come back to more decision theory in a more later chapter on ad-
vanced decision theory, where we will cover topics such as minimaxity, ad-
missibility, and James-Stein estimators.

2.5 Bayesian Parametric Models

For now we will consider parametric models, which means that the param-
eter θ is a fixed-dimensional vector of numbers. Let x ∈ X be the observed
data and θ ∈ Θ be the parameter. Note that X may be called the sample
space, while Θ may be called the parameter space. Now we define some
notation that we will reuse throughout the course:

p(x∣θ) likelihood
π(θ) prior
p(x) = ∫ p(x∣θ)π(θ) dθ marginal likelihood
p(x∣θ)π(θ)
p(θ∣x) = posterior probability
p(x)
p(xnew ∣x) = ∫ p(xnew ∣θ)π(θ∣x) dθ predictive probability

Most of Bayesian analysis is calculating these quantities in some way or


another. Note that the definition of the predictive probability assumes ex-
changeability, but it can easily be modied if the data are not exchangeable.
2.6 How to Choose Priors 19

As a helpful hint, note that for the posterior distribution,

p(x∣θ)π(θ)
p(θ∣x) =
p(x)
∝ p(x∣θ)π(θ),

and oftentimes it’s best to not calculate the normalizing constant p(x) be-
cause you can recognize the form of p(x∣θ)π(θ) as a probability distribution
you know. So don’t normalize until the end!

Remark: Note that the prior distribution that we take on θ doesn’t have
to be a proper distribution, however, the posterior is always required to be
proper for valid inference. By proper, I mean that the distribution must
integrate to 1.

Two questions we still need to address are

• How do we choose priors?

• How do we compute the aforementioned quantities, such as posterior


distributions?

We’ll focus on choosing priors for now.

2.6 How to Choose Priors

We will discuss objective and subjective priors. Objective priors may be


obtained from the likelihood or through some type of invariance argument.
Subjective priors are typically arrived at by a process involving interviews
with domain experts and thinking really hard; in fact, there is arguably more
philosophy and psychology in the study of subjective priors than mathemat-
ics. We start with conjugate priors. The main justification for the use of
conjugate priors is that they are computationally convenient and they have
asymptotically desirably properties.

Choosing prior probabilities: Subjective or Objective.

Subjective
A prior probability could be subjective based on the information a person
2.7 Hierarchical Bayesian Models 20

might have due to past experience, scientific considerations, or simple com-


mon sense. For example, suppose we wish to estimate the probability that a
randomly selected woman has breast cancer. A simple prior could be formu-
lated based on the national or worldwide incidence of breast cancer. A more
sophisticated approach might take into account the woman’s age, ethnicity,
and family history. Neither approach could necessarily be classified as right
or wrong—again, it’s subjective.

As another example, say a doctor administers a treatment to patients and


finds 48 out of 50 are cured. If the same doctor later wishes to investigate the
cure probability of a similar but slightly different treatment, he might expect
that its cure probability will also be around 48/50 = 0.96. However a different
doctor may have only had 8/10 patients cured by the first treatment and
might therefore specify a prior suggesting a cure rate of around 0.8 for the
for the new treatment. For convenience, subjective priors are often chosen
to take the form of common distributions, such as the normal, gamma, or
beta distribution.

Objective
An objective prior (also called default, vague, noninformative) can also be
used in a given situation even in the absence of enough information. Exam-
ples of objective priors are flat priors such as Laplace’s, Haldane’s, Jeffreys’,
and Bernardo’s references priors. These priors will be discussed later.

2.7 Hierarchical Bayesian Models

In a hierarchical Bayesian model, rather than specifying the prior distribu-


tion as a single function, we specify it as a hierarchy. Thus, on the unknown
parameter of interest, say θ, we put a prior. On any other unknown hyper-
parameters of the model that are given, we also specify priors for these. We
write

X∣θ ∼ f (x∣θ)
Θ∣γ ∼ π(θ∣γ)
Γ ∼ φ(γ),
where we assume that φ(γ) is known and not dependent on any other un-
known hyperparameters (what the parameters of the prior are often called
2.7 Hierarchical Bayesian Models 21

as we have already said). Note that we can continue this hierarchical mod-
eling and add more stages to the model, however note that doing so adds
more complexity to the model (and possibly as we will see may result in a
posterior that we cannot compute without the aid of numerical integration
or MCMC, which we will cover in detail in a later chapter).

Definition 2.5: (Conjugate Distributions). Let F be the class of sampling


distributions p(y∣θ). Then let P denote the class of prior distributions on θ.
Then P is said to be conjugate to F if for every p(θ) ∈ P and p(y∣θ) ∈ F,
p(y∣θ) ∈ P. Simple definition: A family of priors such that, upon being
multiplied by the likelihood, yields a posterior in the same family.

Example 2.5: (Beta-Binomial) If X∣θ is distributed as binomial(n, θ), then


a conjugate prior is the beta family of distributions, where we can show that
the posterior is

π(θ∣x) ∝ p(x∣θ)p(θ)
n Γ(a + b) a−1
∝ ( )θx (1 − θ)n−x θ (1 − θ)b−1
x Γ(a)Γ(b)
∝ θx (1 − θ)n−x θa−1 (1 − θ)b−1
∝ θx+a−1 (1 − θ)n−x+b−1 Ô⇒

θ∣x ∼ Beta(x + a, n − x + b).

Let’s apply this to a real example! We’re interested in the proportion of


people that approve of President Obama in PA.

• We take a random sample of 10 people in PA and find that 6 approve


of President Obama.

• The national approval rating (Zogby poll) of President Obama in mid-


December was 45%. We’ll assume that in MA his approval rating is
approximately 50%.

• Based on this prior information, we’ll use a Beta prior for θ and we’ll
choose a and b. (Won’t get into this here).

• We can plot the prior and likelihood distributions in R and then see
how the two mix to form the posterior distribution.
2.7 Hierarchical Bayesian Models 22

3.5
3.0

Prior
2.5
2.0
Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ
3.5
3.0

Prior
2.5

Likelihood
2.0
Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ
2.7 Hierarchical Bayesian Models 23

3.5
3.0

Prior
2.5

Likelihood
Posterior
2.0
Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Example 2.6: (Normal-Uniform Prior)


iid
X1 , . . . , Xn ∣θ ∼ Normal(θ, σ 2 ), σ 2 known
θ ∼ Uniform(−∞, ∞),

where θ ∼ Uniform(−∞, ∞) means that p(θ) ∝ 1.

Calculate the posterior distribution of θ given the data.


n
1 −1
p(θ∣x) = ∏ √ exp { 2 (xi − θ)2 }
2 2σ
i=1 2πσ
1 −1 n
= exp { (x − θ)2 }
2 ∑ i
(2πσ )
2 n/2
2σ i=1
−1
∝ exp { ∑(xi − θ) } .
2
2σ 2 i
2.7 Hierarchical Bayesian Models 24

Note that ∑i (xi − θ)2 = ∑i (xi − x̄)2 + n(x̄ − θ)2 . Then

−1 −n
p(θ∣x) ∝ exp { (x − x̄)2 } exp { 2 (x̄ − θ)2 }
2 ∑ i
2σ i 2σ
−n
∝ exp { 2 (x̄ − θ)2 }

−n
= exp { 2 (θ − x̄)2 } .

Thus,
θ∣x1 , . . . , xn ∼ Normal(x̄, σ 2 /n).

Example 2.7: Normal-Normal


iid
X1 , . . . , Xn ∣θ ∼ N(θ, σ 2 )
θ ∼ N(µ, τ 2 ),

where σ 2 is known. Calculate the distribution of θ∣x1 , . . . , xn .

n
1 −1 1 −1
p(θ∣x1 , . . . , xn ) ∝ ∏ √ 2
(xi − θ)2 } × √
exp { exp { 2 (θ − µ)2 }
i=1 2πσ 2σ 2 2πτ 2 2τ
−1 −1
∝ exp { 2 ∑(xi − θ)2 } exp { 2 (θ − µ)2 } .
2σ i 2τ

Consider

∑(xi − θ) = ∑(xi − x̄ + x̄ − θ) = ∑(xi − x̄) + n(x̄ − θ) .


2 2 2 2
i i i
2.7 Hierarchical Bayesian Models 25

Then
−1 −1 −1
p(θ∣x1 , . . . , xn ) = exp { 2 ∑ i
(x − x̄)2 } × exp { 2 n(x̄ − θ)2 } × exp { 2 (θ − µ)2 }
2σ i 2σ 2τ
−1 −1
∝ exp { 2 n(x̄ − θ)2 } exp { 2 (θ − µ)2 }
2σ 2τ
−1 n 1
= exp { [ 2 (x̄ − 2x̄θ + θ ) + 2 (θ2 − 2θµ + µ2 )]}
2 2
2 σ τ
−1 n 1 nx̄ µ nx̄2 µ2
= exp { [( 2 + 2 ) θ2 − 2θ ( 2 + 2 ) + 2 + 2 ]}
2 σ τ σ τ σ τ

⎪ ⎡ µ ⎤⎫

⎪ −1 ⎢⎢ n 2 + τ 2 ⎞⎥
nx̄
⎪ 1 ⎛ ⎥⎪

∝ exp ⎨ ⎢( 2 + 2 ) ⎜θ2 − 2θ σn ⎟⎥⎬

⎪ ⎢2 ⎢ σ τ ⎝ + 1 ⎥⎪

2 ⎠⎥⎪

⎩ ⎣ σ 2 τ
⎦⎭

⎪ ⎡ µ 2 ⎤⎫

⎪ −1 ⎢⎢ n 2 + τ2 ⎞ ⎥
nx̄
⎪ 1 ⎛ ⎥⎪

∝ exp ⎨ ⎢( 2 + 2 ) ⎜θ − σn ⎟ ⎥⎬.
⎪ ⎢ ⎥⎪
2 + τ 2 ⎠ ⎥⎪
1

⎪ 2 ⎢ σ τ ⎝
⎩ ⎣ σ ⎦⎪

Recall what it means to complete the square as we did above.1 Thus,


µ
⎛ nx̄2 + τ 2 1 ⎞
θ∣x1 , . . . , xn ∼ N ⎜ σn , ⎟
⎝ σ2 + τ 2 σ2 + τ 2 ⎠
1 n 1

nx̄τ 2 + µσ 2 σ 2 τ 2
=N( , ).
nτ 2 + σ 2 nτ 2 + σ 2

Definition 2.6: The reciprocal of the variance is referred to as the preci-


sion. That is,
1
Precision = .
Variance
Theorem 2.4. Let δn be a sequence of estimators of g(θ) with mean squared
error E(δn − g(θ))2 .

(i) If E[δn − g(θ)]2 → 0 then δn is consistent for g(θ).

(ii) Equivalent to the above, δn is consistent if bn (θ) → 0 and V ar(δn ) → 0


for all θ.
1
Recall from algebra that (x − b)2 = x2 − 2bx + b2 . We want to complete something that
resembles x2 − 2bx = x2 + 2bx + (2b/2)2 − (2b/2)2 = (x − b)2 − b2 .
2.7 Hierarchical Bayesian Models 26

(iii) In particular (and most useful), δn is consistent if it is unbiased for


each n and if V ar(δn ) → 0 for all θ.

We omit the proof since it requires Chebychev’s Inequality along with a bit
of probability theory. See Problem 1.8.1 in TPE for the exercise of proving
this.
Example 2.8: (Normal-Normal Revisited) Recall Example 2.7. We write
the posterior mean as E(θ∣x). Let’s write the posterior mean in this example
as
µ
2 +
nx̄
σ τ2
E(θ∣x) = n .
+ 1
σ2 τ2
nx̄ µ
2
= nσ 1 + n τ
2
.
+ 2 + 1
σ2 τ σ2 τ2

We also write the posterior variance as


1
V (θ∣x) = .
n
2 + 1
τ2
σ
We can see that the posterior mean is a weighted average of the sample
mean and the prior mean. The weights are proportional to the reciprocal of
the respective variances (precision). In this case,
1
Posterior Precision =
Posterior Variance
= (n/σ 2 ) + (1/τ 2 )
= Sample Precision + Prior Precision.
The posterior precision is larger than either the sample precision or the prior
precision. Equivalently, the posterior variance, denoted by V (θ∣x), is smaller
than either the sample variance or the prior variance.

What happens as n → ∞?

Divide the posterior mean (numerator and denominator) by n. Now take


n → ∞. Then
1 nx̄ 1 µ x̄
2
+
E(θ∣x) = n σ n τ → σ 2 = x̄ as n → ∞.
2
1 n 1 1 1
2
+ 2
nσ nτ σ2
2.7 Hierarchical Bayesian Models 27

In the case of the posterior variance, divide the denominator and numerator
by n. Then
1
n σ2
V (θ∣x) = ≈ →0 as n → ∞.
1 n 1 1 n
+
n σ2 n τ 2

Since the posterior mean is unbiased and the posterior variance goes to 0,
the posterior mean is consistent by Theorem 2.4.
2.7 Hierarchical Bayesian Models 28

Example 2.9:

X∣α, β ∼ Gamma(α, β), α known, β unknown


β ∼ IG(a, b).

Calculate the posterior distribution of β∣x.


1 α−1 −x/β ba −a−1 −b/β
p(β∣x) ∝ x e × β e
Γ(α)β α Γ(a)
1
∝ α e−x/β β −a−1 e−b/β
β
= β −α−a−1 e−(x+b)/β .

Notice that this looks like an Inverse Gamma distribution with parameters
α + a and x + b. Thus,
β∣x ∼ IG(α + a, x + b).
Example 2.10: (Bayesian versus frequentist)
Suppose a child is given an IQ test and his score is X. We assume that

X∣θ ∼ Normal(θ, 100)


θ ∼ Normal(100, 225)

From previous calculations, we know that the posterior is


400 + 9x 900
θ∣x ∼ Normal ( , ).
13 13

Here the posterior mean is (400 + 9x)/13. Suppose x = 115. Then the pos-
terior mean becomes 110.4. Contrasting this, we know that the frequentist
estimate is the mle, which is x = 115 in this example.

The posterior variance is 900/13 = 69.23, whereas the variance of the data
is σ 2 = 100.

Now suppose we take the Uniform(−∞, ∞) prior on θ. From an earlier ex-


ample, we found that the posterior is

θ∣x ∼ Normal (115, 100) .

Notice that the posterior mean and mle are both 115 and the posterior
variance and variance of the data are both 100.
2.7 Hierarchical Bayesian Models 29

When we put little/no prior information on θ, the data washes away most/all
of the prior information (and the results of frequentist and Bayesian estima-
tion are similar or equivalent in this case).
Example 2.11: (Normal Example with Unknown Variance)
Consider
iid
X1 , . . . , Xn ∣θ ∼ Normal(θ, σ 2 ), θ known, σ 2 unknown
p(σ 2 ) ∝ (σ 2 )−1 .

Calculate p(σ 2 ∣x1 , . . . , xn ).

−1
p(σ 2 ∣x1 , . . . , xn ) ∝ (2πσ 2 )−n/2 exp { 2 −1
∑(xi − θ) } (σ )
2
2σ 2 i
−1
∝ (σ 2 )−n/2−1 exp { ∑(xi − θ) } .
2
2σ 2 i

ba −a−1 −b/y
Recall, if Y ∼ IG(a, b), then f (y) = y e . Thus,
Γ(a)

∑ni=1 (xi − θ)2


σ 2 ∣x1 , . . . , xn ∼ IG (n/2, ).
2

Example 2.12: (Football Data) Gelman et. al (2003) consider the problem
of estimating an unknown variance using American football scores. The
focus is on the difference d between a game outcome (winning score minus
losing score) and a published point spread.

• We observe d1 , . . . , dn , the observed differences between game outcomes


and point spreads for n = 2240 football games.

• We assume these differences are a random sample from a Normal dis-


tribution with mean 0 and unknown variance σ 2 .

• Our goal is to make inference on the unknown parameter σ 2 , which


represents the variability in the game outcomes and point spreads.

We can refer to Example 2.11, since the setup here is the same. Hence the
posterior becomes

σ 2 ∣d1 , . . . , dn ∼ IG(n/2, ∑ d2i /2).


i
2.7 Hierarchical Bayesian Models 30

The next logical step would be plotting the posterior distribution in R. As far
as I can tell, there is not a built-in function predefined in R for the Inverse
Gamma density. However, someone saw the need for it and built one in
using the pscl package.

Proceeding below, we try and calculate the posterior using the function
densigamma, which corresponds to the Inverse Gamma density. However,
running this line in the code gives the following error:

Warning message:
In densigamma(sigmas, n/2, sum(d^2)/2) : value out of range in ’gammafn’

What’s the problem? Think about the what the posterior looks like. Recall
that
(∑i d2i /2)n/2 2 −n/2−1 −(∑i d2i )/2σ2
p(σ 2 ∣d) = (σ ) e .
Γ(n/2)

In the calculation R is doing, it’s dividing by Γ(1120), which is a very large


factorial. This is too large for even R to compute, so we’re out of luck here.
So, what can we do to analyze the data?

setwd("~/Desktop/sta4930/football")
data = read.table("football.txt",header=T)
names(data)
attach(data)
score = favorite-underdog
d = score-spread
n = length(d)
hist(d)
install.packages("pscl",repos="http://cran.opensourceresources.org")
library(pscl)
?densigamma
sigmas = seq(10,20,by=0.1)
post = densigamma(sigmas,n/2,sum(d^2)/2)
v = sum(d^2)

We know we can’t use the Inverse Gamma density (because of the function
in R), but we do know a relationship regarding the Inverse Gamma and
Gamma distributions. So, let’s apply this fact.
2.7 Hierarchical Bayesian Models 31

You may be thinking, we’re going to run into the same problem because we’ll
still be dividing by Γ(1120). This is true, except the Gamma density function
dgamma was built into R by the original writers. The dgamma function is able
to do some internal tricks that let it calculate the gamma density even
though the individual piece Γ(n/2) by itself is too large for R to handle. So,
moving forward, we will apply the following fact that we already learned:
If X ∼ IG(a, b), then 1/X ∼ Gamma(a, 1/b).

Since
σ 2 ∣d1 , . . . , dn ∼ IG(n/2, ∑ d2i /2),
i
we know that
1
∣d1 , . . . , dn ∼ Gamma(n/2, 2/v), where v = ∑ d2i .
σ2 i

Now we can plot this posterior distribution in R, however in terms of making


inference about σ 2 , this isn’t going to be very useful.

In the code below, we plot the posterior of 12 ∣d. In order to do so, we must
σ
create a new sequence of x-values since the mean of our gamma will be at
n/v ≈ 0.0053.

xnew = seq(0.004,0.007,.000001)
pdf("football_sigmainv.pdf", width = 5, height = 4.5)
post.d = dgamma(xnew,n/2,scale = 2/v)
plot(xnew,post.d, type= "l", xlab = expression(1/sigma^2), ylab= "density")
dev.off()

As we can see from the plot below, viewing the posterior of 1


∣d isn’t very
σ2
useful. We would like to get the parameter in terms of σ 2 , so that we could
plot the posterior distribution of interest as well as calculate the posterior
mean and variance.

To recap, we know
1
∣d1 , . . . , dn ∼ Gamma(n/2, 2/v), where v = ∑ d2i .
σ2 i

Let u = 1
. We are going to make a transformation of variables now to write
σ2
the density in terms of σ 2 .
2.7 Hierarchical Bayesian Models 32

2000
density

1000
500
0

0.0040 0.0050 0.0060 0.0070

1 σ2

Figure 2.3: Posterior Distribution p( 1


∣d1 , . . . , dn )
σ2

∂u 1
Since u = 1
2, this implies σ 2 = u1 . Then ∣ 2
∣ = 4.
σ ∂σ σ
Now applying the transformation of variables we find that

1 1 n/2−1 − v2 1
f (σ 2 ∣d1 , . . . , dn ) = ( ) e 2σ ( 4 ) .
Γ(n/2)(2/v)n/2 σ 2 σ
Thus,
1
σ 2 ∣d ∼ Gamma(n/2, 2/v) ( ).
σ4

Now, we know the density of σ 2 ∣d in a form we can calculate in R.

x.s = seq(150,250,1)
pdf("football_sigma.pdf", height = 5, width = 4.5)
post.s = dgamma(1/x.s,n/2, scale = 2/v)*(1/x.s^2)
plot(x.s,post.s, type="l", xlab = expression(sigma^2), ylab="density")
dev.off()
detach(data)
2.7 Hierarchical Bayesian Models 33

0.06
0.04
density

0.02
0.00

160 180 200 220 240

σ2

Figure 2.4: Posterior Distribution p(σ 2 ∣d1 , . . . , dn )


2.7 Hierarchical Bayesian Models 34

From the posterior plot in Figure 2.4 we can see that the posterior mean
is around 185. This means that the variability of the actual game result
around the point spread has a standard deviation around 14 points. If you
wanted to actually calculate the posterior mean and variance, you could do
this using a numerical method in R.

What’s interesting about this example is that there is a lot more variability
in football games than the average person would most likely think.

• Assume that (1) the standard deviation actually is 14 points, and (2)
game result is normally distributed (which it’s not, exactly, but this is
a reasonable approximation).

• Things with a normal distribution fall two or more standard deviations


from their mean about 5% of the time, so this means that, roughly
speaking, about 5% of football games end up 28 or more points away
from their spread.
2.7 Hierarchical Bayesian Models 35

Example 2.13:
iid
Y1 , . . . , Yn ∣µ, σ 2 ∼ Normal(µ, σ 2 ),
σ2
µ∣σ 2 ∼ Normal(µ0 , ),
κ0
ν0 σ02
σ 2 ∼ IG( , ),
2 2
where µ0 , κ0 , ν0 , σ02 are constant.

Find p(µ, σ 2 ∣y1 . . . , yn ). Notice that

p(µ, σ 2 , y1 . . . , yn )
p(µ, σ 2 ∣y1 . . . , yn ) =
p(y1 . . . , yn )
∝ p(y1 . . . , yn ∣µ, σ 2 )p(µ, σ 2 )
= p(y1 . . . , yn ∣µ, σ 2 )p(µ∣σ 2 )p(σ 2 ).

Then

p(µ, σ 2 ∣y1 . . . , yn ) ∝ p(y1 . . . , yn ∣µ, σ 2 )p(µ∣σ 2 )p(σ 2 )


−1 n −κ0
∝ (σ 2 )−n/2 exp { (y − µ)2 } (σ 2 )−1/2 exp { 2 (µ − µ0 )2 }
2 ∑ i
2σ i=1 2σ
−σ02
× (σ 2 )−ν0 /2−1 exp { }.
2σ 2

Consider ∑i (yi − µ)2 = ∑i (yi − ȳ)2 + n(ȳ − µ)2 .

Then

n(ȳ − µ)2 + κ0 (µ − µ0 )2 = nȳ 2 − 2nȳµ + nµ2 + κ0 µ2 − 2κ0 µµ0 + κ0 µ0 2


= (n + κ0 )µ2 − 2(nȳ + κ0 µ0 )µ + nȳ 2 + κ0 µ0 2
nȳ + κ0 µ0 2 (nȳ + κ0 µ0 )2
= (n + κ0 ) (µ − ) − + nȳ 2 + κ0 µ0 2 .
n + κ0 n + κ0
2.7 Hierarchical Bayesian Models 36

Now consider
(nȳ + κ0 µ0 )2 −n2 ȳ 2 − 2nκ0 µ0 ȳ − κ0 2 µ0 2
nȳ 2 + κ0 µ0 2 − = nȳ 2 + κ0 µ0 2 +
n + κ0 n + κ0
n ȳ + nκ0 µ0 + nκ0 ȳ + κ0 2 µ0 2 − n2 ȳ 2 − 2nκ0 µ0 ȳ − κ0 2 µ0 2
2 2 2 2
=
n + κ0
nκ0 µ0 + nκ0 ȳ − 2nκ0 µ0 ȳ
2 2
=
n + κ0
nκ0 (µ0 − 2µ0 ȳ + ȳ 2 )
2
=
n + κ0
nκ0 (µ0 − ȳ)2
= .
n + κ0

Putting this all together, we find


−n −1
p(µ, σ 2 ∣y1 . . . , yn ) ∝ exp { 2
(ȳ − µ)2 } exp { 2 ∑(yi − ȳ)2 }
2σ 2σ i
−κ0 2 −n/2−1/2 2 −ν0 /2−1 −σ02
× exp { ∑ (µ − µ 0 ) 2
} (σ ) (σ ) exp { }
2σ 2 i 2σ 2
−nκ0 −1
= exp { (µ0 − ȳ)2 } exp { 2 ∑(yi − ȳ)2 }
2σ (n + κ0 )
2
2σ i
(n + κ0 ) nȳ + κ0 µ0 2 2 −ν0 /2−1 2 −n/2−1 −σ02
× exp {− (µ − ) } (σ ) (σ ) exp { }
2σ 2 n + κ0 2σ 2
−1 nκ0 σ2
= exp { 2 ∑(yi − ȳ)2 − 2 (µ0 − ȳ)2 − 02 } (σ 2 )−(n+ν0 )/2−1
2σ i 2σ (n + κ0 ) 2σ
(n + κ0 ) nȳ + κ0 µ0 2
× exp {− (µ − ) } (σ 2 )−1/2 .
2σ 2 n + κ0
Since the posterior above factors, we find
nȳ + κ0 µ0 σ 2
µ∣σ 2 , y ∼ Normal ( , ),
n + κ0 n + κ0

n + ν0 1 nκ0
σ 2 ∣y ∼ IG ( , (∑(yi − ȳ)2 + (µ0 − ȳ)2 + σ02 )) .
2 2 i (n + κ0 )
Example 2.14: Suppose we calculate E[θ∣y] where y = x(n) . Let
Xi ∣ θ ∼ Uniform(0, θ)
θ ∼ Gamma(a, 1/b).
2.7 Hierarchical Bayesian Models 37

Show
1 P (χ22(n+a−1) < 2/(by))
E[θ∣x] = .
b(n + a − 1) P (χ22(n+a−1) < 2/(by))

Proof. Recall that the posterior depends on the data only through the suf-
ficient statistic y. Consider that P (Y ≤ y) = P (X1 ≤ y)n = (y/θ)n Ô⇒
n
fy (y) = n/θ(y/θ)n−1 = n y n−1 .
θ
θf (y∣θ)π(θ) dθ
E[θ∣x] = ∫
∫ f (y∣θ)π(θ) dθ
n−1 −a−1 −1/(θb)
∞ θny θ e
∫y
θn Γ(a)ba dθ
= a−1 −1/(θb)
∞ nθ e
∫y
θn Γ(a)ba dθ
∞−n−a −1/(θb)
∫y θ e dθ
= ∞ −n−a−1 −1/(θb)
∫y θ e dθ

Let θ = 2/(xb) Ô⇒ dθ = −2/(bx2 ) dx. Recall that Gamma(v/2, 2) is a χ2v .


Then
2
2 −n−a −x/2 2
∫0 ( xb )
by
e bx2
dx
E[θ∣x] = 2

∫0 ( xb )−n−a−1 e−x/2 bx2 dx


by 2 2

2
by n+a−1 n+a−2 −x/2
∫0 b x e dx × Γ(n + a − 1) dx
2n+a−1 Γ(n + a − 1)
= 2
by n+a+1−1 n+a+1−2 −x/2
∫0 b x e dx × Γ(n + a) dx
2 n+a+1−1 Γ(n + a)
P (χ2(n+a−1) < 2/(by)) bn+a−1 Γ(n + a − 1)
2
=
P (χ22(n+a−1) < 2/(by)) bn+a Γ(n + a)

1 P (χ22(n+a−1) < 2/(by))


= .
b(n + a − 1) P (χ22(n+a−1) < 2/(by))
2.8 Empirical Bayesian Models 38

2.8 Empirical Bayesian Models

Another generalization of Bayes estimation is called empirical Bayes (EB)


estimation, which most consider to fall outside of the Bayesian paradigm (in
the sense that it’s not fully Bayesian). However, it’s been proved to be a
technique of constructing estimators that perform well under both Bayesian
and frequentist criteria. One reason for this is that EB estimators tend to
be more robust against model misspecification of the prior distribution.

We start again with an HB model, however this time we assume that γ is


unknown and must be estimated. We begin with the Bayes model

Xi ∣θ ∼ f (x∣θ), i = 1 . . . , p
Θ∣γ ∼ π(θ∣γ).

We then calculate the marginal distribution of X with density

m(x∣γ) = ∫ ∏ f (xi ∣θ)π(θ∣γ) dθ.

Based on m(x∣γ), we obtain an estimate of γ̂(x) of γ. It’s most common to


find the estimate using maximum likelihood estimation (MLE), but method
of moments could be used as well (or other methods). We now substitute
γ̂(x) for γ in π(θ∣γ) and determine the estimator that minimizes the empir-
ical posterior loss
∫ L(θ, δ)π(θ∣γ̂(x)) dθ.

Remark: An alternative definition is obtained by substituting γ̂(x) for γ


in the Bayes estimator. (This proof is left as a homework exercise, 4.6.1 in
TPE).

Example 2.15: Empirical Bayes Binomial


Suppose there are K different groups of patients where each group has n
patients. Each group is given a different treatment for the same illness and in
the kth group, we count Xk , k = 1, . . . , K, which is the number of successful
treatments our of n.

Since the groups receive different treatments, we expect different success


rates, however, since we are treating the same illness, these rates should be
related to each other. These considerations suggest the following model:
2.9 Posterior Predictive Distributions 39

Xk ∼ Bin(n, pk ),
pk ∼ Beta(a, b),
where the K groups are tied together by the common prior distribution.

It is easy to show that the Bayes estimator of pk under squared error loss is
a + xk
E(pk ∣ak , a, b) = .
a+b+n

Suppose now that we are told that a, b are unknown and we wish to estimate
them using EB. We first calculate
K
n x Γ(a + b) a−1
m(x∣a, b) = ∫ ...∫ ∏ ( )pk k (1 − pk ) k ×
n−x
p (1 − pk )b−1 dpk
0,1 0,1 k=1 xk Γ(a)Γ(b) k
K
n Γ(a + b) xk +a−1
=∫ ...∫ ∏( ) pk (1 − pk )n−xk +b−1 dpk
0,1 0,1 k=1 xk Γ(a)Γ(b)
K
n Γ(a + b)Γ(a + xk )Γ(n − xk + b)
= ∏( )
k=1 xk Γ(a)Γ(b)Γ(a + b + n)
which is a product of beta-binomials. Although the MLEs of a and b aren’t
expressible in closed form, they can be calculated numerically to construct
the EB estimator
â + xk
δ̂ EB (x) = .
â + b̂ + n

2.9 Posterior Predictive Distributions

We have just gone through many examples illustrating how to calculate


many simple posterior distributions. This is the main goal of a Bayesian
analysis. Another goal might be prediction. That is given some data y
and a new observation ỹ, we may wish to find the conditional distribution
of ỹ given y. This distribution is referred to as the posterior predictive
distribution. That is, our goal is to find p(ỹ∣y). This minimizing estimator
is called the empirical Bayes estimator.

We’ll derive the posterior predictive distribution for the discrete case (θ is
discrete). It’s the same for the continuous case, with the sums replaced with
integrals.
2.9 Posterior Predictive Distributions 40

Consider
p(ỹ, y)
p(ỹ∣y) =
p(y)
∫ p(ỹ, y, θ) dθ
= θ
p(y)
∫ p(ỹ∣y, θ)p(y, θ) dθ
= θ
p(y)
= ∫ p(ỹ∣y, θ)p(θ∣y) dθ.
θ
In most contexts, if θ is given, then ỹ∣θ is independent of y, i.e., the value of
θ determines the distribution of ỹ, without needing to also know y. When
this is the case, we say that ỹ and y are conditionally independent given θ.
Then the above becomes

p(ỹ∣y) = ∫ p(ỹ∣θ)p(θ∣y) dθ.


θ
Theorem 2.5. If θ is discrete and ỹ and y are conditionally independent
given θ, then the posterior predictive distribution is

p(ỹ∣y) = ∑ p(ỹ∣θ)p(θ∣y).
θ

If θ is continuous and ỹ and y are conditionally independent given θ, then


the posterior predictive distribution is

p(ỹ∣y) = ∫ p(ỹ∣θ)p(θ∣y) dθ.


θ
2.9 Posterior Predictive Distributions 41

Theorem 2.6. Suppose p(x) is a pdf that looks like p(x) = cf (x), where c
is a constant and f is a continuous function of x. Since

∫ p(x) dx = ∫ cf (x) dx = 1,
x x

then
∫ f (x)dx = 1/c.
x

Note: No calculus is needed to compute ∫x f (x) dx if f (x) looks like a known


pdf.

Example 2.16: Human males have one X-chromosome and one Y-chromosome,
whereas females have two X-chromosomes, each chromosome being inherited
from one parent. Hemophilia is a disease that exhibits X-chromosome-linked
recessive inheritance, meaning that a male who inherits the gene that causes
the disease on the X-chromosome is affected, whereas a female carrying the
gene on only one of her X-chromosomes is not affected. The disease is gener-
ally fatal for women who inherit two such genes, and this is very rare, since
the frequency of occurrence of the gene is very low in human populations.

Consider a woman who has an affected brother (xY), which implies that
her mother must be a carrier of the hemophilia gene (xX). We are also told
that her father is not affected (XY), thus the woman herself has a fifty-fifty
chance of having the gene.

Let θ denote the state of the woman. It can take two values: the woman is
a carrier (θ = 1) or not (θ = 0). Based on this, the prior can be written as

P (θ = 1) = P (θ = 0) = 1/2.

Suppose the woman has a son who does not have hemophilia (S1 = 0). Now
suppose the woman has another son. Calculate the probability that this
second son also will not have hemophilia (S2 = 0), given that the first son
does not have hemophilia. Assume son one and son two are conditionally
independent given θ.

Solution:

p(S2 = 0∣S1 = 0) = ∑ p(S2 = 0∣θ)p(θ∣S1 = 0).


θ
2.9 Posterior Predictive Distributions 42

First compute

p(S1 = 0∣θ)p(θ)
p(θ∣S1 = 0) =
p(S1 = 0∣θ = 0)p(θ = 0) + p(S1 = 0∣θ = 1)p(θ = 1)

⎪ (1)(1/2)
⎪ = 23 if θ = 0
= ⎨ (1)(1/2)+(1/2)(1/2)

⎪ 1
if θ = 1.
⎩3
Then

p(S2 = 0∣S1 = 0) = p(S2 = 0∣θ = 0)p(θ = 0∣S1 = 0) + p(S2 = 0∣θ = 1)p(θ = 1∣S1 = 0)
= (1)(2/3) + (1/2)(1/3) = 5/6.

Negative Binomial Distribution


Before doing the next example, we will introduce the Negative Binomial
distribution. The binomial distribution counts the numbers of successes in
a fixed number of iid Bernoulli trials. Recall, a Bernoulli trial has a fixed
success probability p.

Suppose instead that we count the number of Bernoulli trials required to


get a fixed number of successes. This formulation leads to the Negative
Binomial distribution.

In a sequence of independent Bernoulli(p) trials, let X denote the trial at


which the rth success occurs, where r is a fixed integer.

Then
x−1 r
f (x) = ( ) p (1 − p)x−r , x = r, r + 1, . . .
r−1
and we say X ∼ Negative Binom(r, p).

There is another useful formulation of the Negative Binomial distribution.


In many cases, it is defined as Y = number of failures before the rth success.
This formulation is statistically equivalent to the one given above in term
of X = trial at which the rth success occurs, since Y = X − r. Then
r+y−1 r
f (y) = ( ) p (1 − p)y , y = 0, 1, 2, . . .
y

and we say Y ∼ Negative Binom(r, p).

When we refer to the Negative Binomial distribution in this class, we will


refer to the second one defined unless we indicate otherwise.
2.9 Posterior Predictive Distributions 43

Example 2.17: (Poisson-Gamma)

X∣λ ∼ Poisson(λ)
λ ∼ Gamma(a, b)

Assume that X̃∣λ ∼ Poisson(λ) is independent of X. Assume we have a new


observation x̃. Find the posterior predictive distribution, p(x̃∣x). Assume
that a is an integer.

Solution:

First, we must find p(λ∣x).

Recall

p(λ∣x) ∝ p(x∣λ)(p(λ)
∝ e−λ λx λa−1 e−λ/b
= λx+a−1 e−λ(1+1/b) .

Thus, λ∣x ∼ Gamma(x + a, 1+1/b


1
), i.e., λ∣x ∼ Gamma(x + a, b+1
b
).
2.9 Posterior Predictive Distributions 44

It then follows that

p(x̃∣x) = ∫ p(x̃∣λ)p(λ∣x) dλ
λ
e−λ λx̃ 1
=∫ λx+a−1 e−λ(b+1)/b dλ
λ x̃! Γ(x + a)( b+1 )
b x+a

1
= b x+a ∫λ
λx̃+x+a−1 e−λ(2b+1/b) dλ
x̃! Γ(x + a)( b+1 )
1
= Γ(x̃ + x + a)(b/(2b + 1)x̃+x+a
x̃! Γ(x + a)( b+1 )
b x+a

Γ(x̃ + x + a)(b/(2b + 1)x̃+x+a


=
x̃! Γ(x + a)( b+1 )
b x+a

Γ(x̃ + x + a) b x̃+x+a (b + 1)x+a


=
x̃! Γ(x + a) bx+a (2b + 1) x̃+x+a
( x̃ + x + a − 1)! bx̃ (b + 1)x+a
=
(x + a − 1)! x̃! (2b + 1)x̃+x+a
x̃ + x + a − 1 b x̃
b + 1 x+a
=( )( ) ( ) .
x̃ 2b + 1 2b + 1
Let p = b/(2b + 1), which implies 1 − p = (b + 1)/(2b + 1).

Then
x̃ + x + a − 1 x̃
p(x̃∣x) = ( )p (1 − p)x+a .

Thus,
b
x̃∣x ∼ Negative Binom (x + a, ),
2b + 1
where we are assuming the Negative Binomial distribution as defined in
Wikipedia (and not as defined earlier in the notes).
2.9 Posterior Predictive Distributions 45

Example 2.18: Suppose that X is the number of pregnant women arriving


at a particular hospital to deliver their babies during a given month. The
discrete count nature of the data plus its natural interpretation as an arrival
rate suggest modeling it with a Poisson likelihood.

To use a Bayesian analysis, we require a prior distribution for θ having


support on the positive real line. A convenient choice is given by the Gamma
distribution, since it’s conjugate for the Poisson likelihood.

The model is given by

X∣λ ∼ Poisson(λ)
λ ∼ Gamma(a, b).

We are also told 42 moms are observed arriving at the particular hospital
during December 2007. Using prior study information given, we are told
a = 5 and b = 6. (We found a, b by working backwards from a prior mean
of 30 and prior variance of 180).

We would like to find several things in this example:

1. Plot the likelihood, prior, and posterior distributions as functions of λ


in R.

2. Plot the posterior predictive distribution where the number of preg-


nant women arriving falls between [0,100], integer valued.

3. Find the posterior predictive probability that the number of pregnant


women arrive is between 40 and 45 (inclusive).

Solution: The first thing we need to know to do this problem are p(λ∣x) and
p(x̃∣x). We found these in Example 2.17. So,

b
λ∣x ∼ Gamma (x + a, ),
b+1
and
b
x̃∣x ∼ Negative Binom (x + a, ).
2b + 1
2.9 Posterior Predictive Distributions 46

Next, we can move right into R for our analysis.

setwd("~/Desktop/sta4930/ch3")
lam = seq(0,100, length=500)
x = 42
a = 5
b = 6
like = dgamma(lam,x+1,scale=1)
prior = dgamma(lam,5,scale=6)
post = dgamma(lam,x+a,scale=b/(b+1))
pdf("preg.pdf", width = 5, height = 4.5)
plot(lam, post, xlab = expression(lambda), ylab= "Density", lty=2, lwd=3, type="l")
lines(lam,like, lty=1,lwd=3)
lines(lam,prior, lty=3,lwd=3)
legend(70,.06,c("Prior", "Likelihood","Posterior"), lty = c(2,1,3),
lwd=c(3,3,3))
dev.off()

##posterior predictive distribution


xnew = seq(0,100) ## will all be ints
post_pred_values = dnbinom(xnew,x+a,b/(2*b+1))
plot(xnew, post_pred_values, type="h", xlab = "x", ylab="Posterior Predictive Distribution")

## what is posterior predictive prob that number


of pregnant women arrive is between 40 and 45 (inclusive)

(ans = sum(post_pred_values[41:46])) ##recall we included 0

In the first part of the code, we plot the posterior, likelihood, and posterior.
This should be self-explanatory since we have already done an example.

When we find our posterior predictive distribution, we must create a se-


quence of integers from 0 to 100 (inclusive) using the seq command. Then
we find the posterior predictive values using the function dnbinom. Then
we simply plot the sequence of xnew on the x-axis and the corresponding
posterior predictive values on the y-axis. We set type="h" so that our plot
will appear somewhat like a smooth histogram.

Finally, in order to calculate the posterior predictive probability that the


number of pregnant women who arrive is between 40 and 45, we simply add
up the posterior predictive probabilities that correspond to these values. We
find that the posterior predictive probability of 0.1284 that the number of
pregnant women who arrive is between 40 and 45.
Chapter 3

Being Objective

No, it does not make sense for me to be an ‘Objective Bayesian’ !


—Stephen E. Fienberg

Thus far in this course, we have mostly considered informative or subjective


priors. Ideally, we want to choose a prior reflecting our beliefs about the un-
known parameter of interest. This is a subjective choice. All Bayesians agree
that wherever prior information is available, one should try to incorporate
a prior reflecting this information as much as possible. We have mentioned
how incorporation of a prior expert opinion would strengthen purely data-
based analysis in real-life decision problems. Using prior information can
also be useful in problems of statistical inference when your sample size is
small or you have a high- or infinite-dimensional parameter space.

However, in dealing with real-life problems you may run into problems such
as

• not having past historical data,

• not having an expert opinion to base your prior knowledge on (perhaps


your research is cutting-edge and new), or

• as your model becomes more complicated, it becomes hard to know


what priors to put on each unknown parameter.

The problems we have dealt with all semester have been very simple in na-
ture. We have only had one parameter to estimate (except for one example).

47
48

Think about a more complex problem such as the following (we looked at
this problem in Chapter 1):

X∣θ ∼ N (θ, σ 2 )
θ∣σ 2 ∼ N (µ, τ 2 )
σ 2 ∼ IG(a, b)

where now θ and σ 2 are both unknown and we must find the posterior
distributions of θ∣X, σ 2 and σ 2 ∣X. For this slightly more complex problem,
it is much harder to think about what values µ, τ 2 , a, b should take for a
particular problem. What should we do in these type of situations?

Often no reliable prior information concerning θ exists, or inference based


completely on the data is desired. It might appear that inference in such
settings would be impossible, but reaching this conclusion is too hasty.

Suppose we could find a distribution p(θ) that contained no or little infor-


mation about θ in the sense that it didn’t favor one value of θ over another
(provided this is possible). Then it would be natural to refer to such a dis-
tribution as a noninformative prior. We could also argue that all or most of
the information contained in the posterior distribution, p(θ∣x), came from
the data. Thus, all resulting inferences were objective and not subjective.
Definition 3.1: Informative/subjective priors represent our prior beliefs
about parameter values before collecting any data. For example, in reality,
if statisticians are unsure about specifying the prior, they will turn to the
experts in the field or experimenters to look at past data to help fix the
prior.
Example 3.1: (Pregnant Mothers) Suppose that X is the number of preg-
nant mothers arriving at a hospital to deliver their babies during a given
month. The discrete count nature of the data as well as its natural inter-
pretation leads to adopting a Poisson likelihood,
e−θ θx
p(x∣θ) = , x ∈ {0, 1, 2, . . .}, θ > 0.
x!
A convenient choice for the prior distribution here is a Gamma(a, b) since
it is conjugate for the Poisson likelihood. To illustrate the example further,
suppose that 42 moms deliver babies during the month of December. Sup-
pose from past data at this hospital, we assume a prior of Gamma(5, 6).
From this, we can easily calculate the posterior distribution, posterior mean
and variance, and do various calculations of interest in R.
49

Definition 3.2: Noninformative/objective priors contain little or no infor-


mation about θ in the sense that they do not favor one value of θ over
another. Therefore, when we calculate the posterior distribution, most if
not all of the inference will arise from the likelihood. Inferences in this case
are objective and not subjective. Let’s look at the following example to see
why we might consider such priors.

Example 3.2: (Pregnant Mothers Continued) Recall Example 3.1. As we


noted earlier, it would be natural to take the prior on θ as Gamma(a, b)
since it is the conjugate prior for the Poisson likelihood, however suppose
that for this data set we do not have any information on the number of
pregnant mothers arriving at the hospital so there is no basis for using a
Gamma prior or any other informative prior. In this situation, we could
take some noninformative prior.

Comment: Since many of the objective priors are improper, so we must


check that the posterior is proper.

Theorem 3.1. Propriety of the Posterior

• If the prior is proper, then the posterior will always be proper.

• If the prior is improper, you must check that the posterior is proper.

◯ Meaning Of Flat

What does a “flat prior” really mean? People really abuse the word flat and
interchange it for noninformative. Let’s talk about what people really mean
when they use the term “flat,” since it can have different meanings.

Example 3.3: Often statisticians will refer to a prior as being flat, when
a plot of its density actually looks flat, i.e., uniform. An example of this
would be taking such a prior to be

θ ∼ Unif(0, 1).

We can plot the density of this prior to see that the density is flat.

What happens if we consider though the transformation to 1/θ. Is our prior


still flat?
50

1.4
1.2
density

1.0
0.8
0.6

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.1: Unif(0,1) prior

Example 3.4: Now suppose we consider Jeffreys’ prior, pJ (θ), where


X ∼ Bin(n, θ).

We calculate Jeffreys’ prior by finding the Fisher information. The Fisher


information tells us how much information the data gives us for certain
parameter values.

In this example, it can be shown that pJ (θ) ∝ Beta(1/2, 1/2). Let’s consider
the plot of this prior. Flat here is a purely abstract idea. In order to achieve
objective inference, we need to compensate more for values on the boundary
than values in the middle.
51

1.0 1.5 2.0 2.5 3.0


density

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.2: Jeffreys’ prior for Binom likelihood

Example 3.5: Finally, we consider the following prior on θ ∶

θ ∼ N (0, 1000).

What happens in this situation? We look at two plots in Figure 3.3 to


consider the behavior of this prior.

◯ Objective Priors in More Detail

Uniform Prior of Bayes and Laplace

Example 3.6: (Thomas Bayes) In 1763, Thomas Bayes considered the


question of what prior to use when estimating a binomial success proba-
bility p. He described the problem quite differently back then by considering
throwing balls onto a billiard table. He separated the billiard table into
many different intervals and considered different events. By doing so (and
not going into the details of this), he argued that a Uniform(0,1) prior was
appropriate for p.

Example 3.7: (Laplace) In 1814, Pierre-Simon Laplace wanted to know


52

0.012

Normal prior
density

0.006
0.000

−1000 −500 0 500 1000

θ
0.012

Normal prior on [−10,10]


density

0.006
0.000

−10 −5 0 5 10

Figure 3.3: Normal priors


53

the probability that the sun will rise tomorrow. He answered this question
using the following Bayesian analysis:

• Let X represent the number of days the sun rises. Let p be the prob-
ability the sun will rise tomorrow.

• Let X∣p ∼ Bin(n, p).

• Suppose p ∼ Uniform(0, 1).

• Based on reading the Bible, Laplace computed the total number of


days n in recorded history, and the number of days x on which the sun
rose. Clearly, x = n.

Then
n
π(p∣x) ∝ ( )px (1 − p)n−x ⋅ 1
x
∝p x+1−1
(1 − p)n−x+1−1

This implies
p∣x ∼ Beta(x + 1, n − x + 1)
Then
x+1 x+1 n+1
p̂ = E[p∣x] = = = .
x+1+n−x+1 n+2 n+2
Thus, Laplace’s estimate for the probability that the sun rises tomorrow is
(n + 1)/(n + 2), where n is the total number of days recorded in history.
For instance, if so far we have encountered 100 days in the history of our
universe, this would say that the probability the sun will rise tomorrow
is 101/102 ≈ 0.9902. However, we know that this calculation is ridiculous.
Here, we have extremely strong subjective information (the laws of physics)
that says it is extremely likely that the sun will rise tomorrow. Thus, ob-
jective Bayesian methods shouldn’t be recklessly applied to every problem
we study—especially when subjective information this strong is available.

Criticism of the Uniform Prior

The Uniform prior of Bayes and Laplace and has been criticized for many
different reasons. We will discuss one important reason for criticism and not
go into the other reasons since they go beyond the scope of this course.
54

In statistics, it is often a good property when a rule for choosing a prior


is invariant under what are called one-to-one transformations. Invariant
basically means unchanging in some sense. The invariance principle means
that a rule for choosing a prior should provide equivalent beliefs even if we
consider a transformed version of our parameter, like p2 or log p instead of
p.

Jeffreys’ Prior

One prior that is invariant under one-to-one transformations is Jeffreys’


prior.

What does the invariance principle mean? Suppose our prior parameter is
θ, however we would like to transform to φ.

Define φ = f (θ), where f is a one-to-one function.

Jeffreys’ prior says that if θ has the distribution specified by Jeffreys’ prior
for θ, then f (θ) will have the distribution specified by Jeffreys’ prior for φ.
We will clarify by going over two examples to illustrate this idea.

Note, for example, that if θ has a Uniform prior, Then one can show φ = f (θ)
will not have a Uniform prior (unless f is the identity function).

Aside from the invariance property of Jeffreys’ prior, in the univariate case,
Jeffreys’ prior satisfies many optimality criteria that statisticians are inter-
ested in.

Definition 3.3: Define


∂ 2 log p(y∣θ)
I(θ) = −E [ ],
∂θ2

where I(θ) is called the Fisher information. Then Jeffreys’ prior is defined
to be √
pJ (θ) = I(θ).

Example 3.8: (Uniform Prior is Not Invariant to Transformation)


Let θ ∼ Uniform(0, 1). Suppose now we would like to transform from θ to
θ2 .
55


Let φ = θ2 . Then θ = φ. It follows that
∂θ 1
= √ .
∂φ 2 φ
1
Thus, p(φ) = √ , 0 < φ < 1 which shows that φ is not Uniform on (0, 1).
2 φ
Hence, the transformation is not invariant. Criticism such as this led to
consideration of Jeffreys’ prior.
Example 3.9: (Jeffreys’ Prior Invariance Example)
Suppose

X∣θ ∼ Exp(θ).

One can show using calculus that I(θ) = 1/θ2 . Then pJ (θ) = 1/θ. Suppose
that φ = θ2 . It follows that
∂θ 1
= √ .
∂φ 2 φ
Then
√ ∂θ
pJ (φ) = pJ ( φ) ∣ ∣
∂φ
1 1 1
=√ √ ∝ .
φ 2φ φ
Hence, we have shown for this example, that Jeffreys’ prior is invariant under
the transformation φ = θ2 .
Example 3.10: (Jeffreys’ prior) Suppose

X∣θ ∼ Binomial(n, θ).

Let’s calculate the posterior using Jeffreys’ prior. To do so we need to


calculate I(θ). Ignoring terms that don’t depend on θ, we find

log p(x∣θ) = x log (θ) + (n − x) log (1 − θ) Ô⇒


∂ log p(x∣θ) x n − x
= −
∂θ θ 1−θ
∂ 2 log p(x∣θ) x n−x
=− 2 −
∂θ 2 θ (1 − θ)2
56

Since, E(X) = nθ, then


x n−x nθ n − nθ n n n
I(θ) = −E [− − ]= 2 + = = .
θ 2 (1 − θ) 2 θ (1 − θ) 2 θ (1 − θ) θ(1 − θ)
This implies that

n
pJ (θ) =
θ(1 − θ)
∝ Beta(1/2, 1/2).

Jeffrey's prior and flat prior densities


2.0
1.5
p(θ)

1.0
0.5

Beta(1/2,1/2)
Beta(1,1)
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.4: Jeffreys’ prior and flat prior densities

Figure 3.4 compares the prior density πJ (θ) with that for a flat prior, which
is equivalent to a Beta(1,1) distribution.

Note that in this case the prior is inversely proportional to the standard
deviation. Why does this make sense?

We see that the data has the least effect on the posterior when the true θ =
0.5, and has the greatest effect near the extremes, θ = 0 or 1. Jeffreys’ prior
compensates for this by placing more mass near the extremes of the range,
where the data has the strongest effect. We could get the same effect by
1 1
(for example) letting the prior be π(θ) ∝ instead of π(θ) ∝ .
Varθ [Varθ]1/2
57

However, the former prior is not invariant under reparameterization, as we


would prefer.

We then find that

p(θ ∣ x) ∝ θx (1 − θ)n−x θ1/2−1 (1 − θ)1/2−1


= θx−1/2 (1 − θ)n−x−1/2
= θx−1/2+1−1 (1 − θ)n−x−1/2+1−1 .

Thus, θ∣x ∼ Beta(x + 1/2, n − x + 1/2), which is a proper posterior since the
prior is proper.

Note: Remember that it is important to check that the posterior is proper.

Jeffreys’ and Conjugacy


Jeffreys priors are widely used in Bayesian analysis. In general, they are
not conjugate priors; the fact that we ended up with a conjugate Beta prior
for the binomial example above is just a lucky coincidence. For example,
with a Gaussian model X ∼ N (µ, σ 2 ), it can be shown that πJ (µ) = 1
and πJ (σ) = σ1 , which do not look anything like a Gaussian or an inverse
gamma, respectively. However, it can be shown that Jeffreys priors are limits
of conjugate prior densities. For example, a Gaussian density N (µo , σo2 )
approaches a flat prior as σo2 → ∞, while the inverse gamma σ −(a+1) e−b/σ →
σ −1 as a, b → 0.

Limitations of Jeffreys’
Jeffreys’ priors work well for single-parameter models, but not for models
with multidimensional parameters. By analogy with the one-dimensional
case, one might construct a naive Jeffreys prior as the joint density:

πJ (θ) = ∣I(θ)∣1/2 ,

where ∣ ⋅ ∣ denotes the determinant and the (i, j)th element of the Fisher
information matrix is given by

∂ 2 log p(X∣θ)
I(θ)ij = −E [ ].
∂θi ∂θj

Let’s see what happens when we apply a Jeffreys’ prior for θ to a multivariate
Gaussian location model. Suppose X ∼ Np (θ, I), and we are interested in
58

performing inference on ∣∣θ∣∣2 . In this case the Jeffreys’ prior for θ is flat. It
turns out that the posterior has the form of a non-central χ2 distribution
with p degrees of freedom. The posterior mean given one observation of X
is E(∣∣θ∣∣2 ∣X) = ∣∣X∣∣2 + p. This is not a good estimate because it adds p to
the square of the norm of X, whereas we might normally want to shrink
our estimate towards zero. By contrast, the minimum variance frequentist
estimate of ∣∣θ∣∣2 is ∣∣X∣∣2 − p.

Intuitively, a multidimensional flat prior carries a lot of information about


the expected value of a parameter. Since most of the mass of a flat prior
distribution is in a shell at infinite distance, it says that we expect the
value of θ to lie at some extreme distance from the origin, which causes our
estimate of the norm to be pushed further away from zero.

Haldane’s Prior

In 1963, Haldane introduced the following improper prior for a binomial


proportion:
p(θ) ∝ θ−1 (1 − θ)−1 .
It can be shown to be improper using simple calculus, which we will not go
into. However, the posterior is proper under certain conditions. Let
Y ∣θ ∼ Bin(n, θ).
Calculate p(θ∣y) and show that it is improper when y = 0 or y = n.

Remark: Recall that for a Binomial distribution, Y can take values


y = 0, 1, 2, . . . , n.

We will first calculate p(θ∣y).


(ny)θy (1 − θ)n−y
p(θ∣y) ∝
θ(1 − θ)
∝θ y−1
(1 − θ)n−y−1
= θy−1 (1 − θ)(n−y)−1 .

The density of a Beta(a, b) is the following:


Γ(a + b) a−1
f (θ) = θ (1 − θ)b−1 , θ > 0.
Γ(a)Γ(b)
3.1 Reference Priors 59

This implies that θ∣Y ∼ Beta(y, n − y).

Finally, we need to check that our posterior is proper. Recall that the
parameters of the Beta need to be positive. Thus, y > 0 and n − y > 0. This
means that y ≠ 0 and y ≠ n in order for the posterior to be proper.

Remark: Recall that the Beta density must integrate to 1 whenever


the parameter values are positive. Hence, when they are not positive,
the density does not integrate to 1 and integrates to ∞. Thus, for the
problem above, when y = 0 and y = n the density is improper.

There are many other objective priors that are used in Bayesian inference,
however, this is the level of exposure that we will cover in this course. If
you’re interested in learning more about objective priors (g-prior, probability
matching priors), see me and I can give you some references.

3.1 Reference Priors

Reference priors were proposed by Jose Bernardo in a 1979 paper, and fur-
ther developed by Jim Berger and others from the 1980s through the present.
They are credited with bringing about an objective Bayesian renaissance;
an annual conference is now devoted to the objective Bayesian approach.

The idea behind reference priors is to formalize what exactly we mean by an


uninformative prior: it is a function that maximizes some measure of distance
or divergence between the posterior and prior, as data observations are made.
Any of several possible divergence measures can be chosen, for example the
Kullback-Leibler divergence or the Hellinger distance. By maximizing the
divergence, we allow the data to have the maximum effect on the posterior
estimates.

For one-dimensional parameters, it will turn out that reference priors and
Jeffreys’ priors are equivalent. For multidimensional parameters, they dif-
fer. One might ask, how can we choose a prior to maximize the divergence
between the posterior and prior, without having seen the data first? Refer-
ence priors handle this by taking the expectation of the divergence, given a
model distribution for the data. This sounds superficially like a frequentist
approach—basing inference on imagined data. But once the prior is chosen
based on some model, inference proceeds in a standard Bayesian fashion.
3.1 Reference Priors 60

(This contrasts with the frequentist approach, which continues to deal with
imagined data even after seeing the real data!)

◯ Laplace Approximation

Before deriving reference priors in some detail, we go through the Laplace


approximation which is very useful in Bayesian analysis since we often need
to evaluate integrals of the form

∫ g(θ)f (x∣θ)π(θ) dθ.

For example, when g(θ) = 1, the integral reduces to the marginal likelihood
of x. The posterior mean requires evaluation of two integrals ∫ θf (x∣θ)π(θ) dθ
and ∫ f (x∣θ)π(θ) dθ. Laplace’s method is a technique for approximating in-
tegrals when the integrand has a sharp maximum.

Remark: There is a nice refinement of the Laplace approximation due to


Tierney, Kass, and Kadane (JASA, 1989). Due to time constraints, we
won’t go into this, but if you’re looking to apply this in research, this is
something you should look up in the literature and use when needed.

Theorem 3.2. Laplace Approximation


Let I = ∫ q(θ) exp{nh(θ)} dθ. Assume that θ̂ maximizes θ and that h has a
sharp maximum at θ̂. Let c = h′′ (θ̂) > 0. Then
√ √
2π −1 2π
I = q(θ̂) exp{nh(θ̂)} √ (1 + O(n )) ≈ q(θ̂) exp{nh(θ̂)} √
nc nc

Proof. Apply Taylor expansion about θ̂.


θ̂+δ 1
I ≈∫ [q(θ̂) + (θ − θ̂)q ′ (θ̂) + (θ − θ̂)2 q ′′ (θ̂)]
θ̂−δ 2
n
× [exp{nh(θ̂) + n(θ − θ̂)h′ (θ̂) + (θ − θ̂)2 h′′ (θ̂)}] dθ + ⋯
2

q (θ̂) 1 q ′′ (θ̂)
≈ q(θ̂)enh(θ̂) ∫ [1 + (θ − θ̂) + (θ − θ̂)2 ]
q(θ̂) 2 q(θ̂)
−nc
× exp [ (θ − θ̂)2 }] dθ + ⋯ .
2
3.1 Reference Priors 61

√ 1
Now let t = nc(θ − θ̂). This implies that dθ = √ dt. Hence,
nc

q(θ̂)enh(θ̂) δ nc t q ′ (θ̂) t2 q ′′ (θ̂) −t2 /2
I≈ √ ∫ √ [1 + √ + ]e dt
nc −δ nc nc q(θ̂) 2nc q(θ̂)
q(θ̂)enh(θ̂) √ q ′′ (θ̂) 1
≈ √ 2π [1 + 0 + ]
nc q(θ̂) 2nc
q(θ̂)enh(θ̂) √ q(θ̂)enh(θ̂) √
≈ √ 2π [1 + O(1/n)] ≈ √ 2π.
nc nc

◯ Some Probability Theory

First, we give a few definitions from probability theory (you may have seen
these before) and we will be informal about these.

• If Xn is O(n−1 ) then Xn “goes to 0 at least as fast as 1/n.”

• If Xn is o(n−1 ) then Xn “goes to 0 faster than 1/n.”


Definition 3.4: Formally, writing

Xn = o(rn ) as n → ∞

means that
Xn
→ 0.
rn
Similarly,
Xn = O(rn ) as n → ∞
means that
Xn
is bounded.
rn

This shouldn’t be confused with the definition below:


Definition 3.5: Formally, let Xn , n ≥ 1 be random vectors and Rn , n ≥ 1
be positive variables. Then
Xn p
Xn = op (Rn ) if Ð→ 0
Rn
3.1 Reference Priors 62

and
Xn
Xn = Op (Rn ) if is bounded in probability.
Rn

Recall that Xn is bounded in probability if {Pn , n ≥ 1} is uniformly tight,


where Pn (A) = P r(Xn ∈ A), A ∈ Rk , i.e., given any  > 0 there exists an M
such that P r(∣∣Xn ∣∣ ≤ M ) ≥ 1 −  for all n ≥ 1. For full details and examples,
see Billingsley or van der Vaart.

◯ Shrinkage Argument of J.K. Ghosh

This argument given by J.K. Ghosh will be used to derive reference priors. It
can be used in many other theoretical proofs in Bayesian theory. If interested
in seeing these, please refer to his book for details as listed on the syllabus.
Please note that below I am hand waving over some of the details regarding
analysis that are important but not completely necessary to grasp the basic
concept here.

We consider a possibly vector-valued r.v. X with pdf g(⋅∣θ). Our goal is


to find an expression for Eθ [q(X, θ)] for some function q(X, θ), where the
integral ∫ q(x, θ) f (x∣θ) dθ is too difficult to calculate directly. There are
three steps to go through to find the desired quantity. The steps are outlined
without proof.

Step 1: Consider a proper prior π̄(⋅) for θ such that the support of π̄(⋅)
is a compact rectangle in the parameter space and π̄(⋅) vanishes on the
boundary of the support, while remaining positive on the interior. Consider
the posterior of θ under π̄(⋅) and hence obtain E π̄ [q(X, θ)∣x].

Step 2: Find Eθ E π̄ [q(x, θ)∣x] = λ(θ) for θ in the interior of the support
of π̄(⋅).

Step 3: Integrate λ(⋅) with respect to π̄(⋅) and then allow π̄(⋅) to converge
to the degenerate prior at the true value of θ (say θ0 ) supposing that the
true θ is an interior point of the support of π̄(⋅). This yields Eθ [q(X, θ)].
3.1 Reference Priors 63

◯ Reference Priors

Bernardo (1979) suggested choosing the prior to maximize the expected


Kullback-Leibler divergence between the posterior and prior,

π(θ∣x)
E [log ],
π(θ)

where expectation is taken over the joint distribution of X and θ. It is shown


in Berger and Bernardo (1989) that if one does this maximization for fixed n,
it may lead to a discrete prior with finitely many jumps—a far cry from a
diffuse prior. Instead, the maximization must be done asymptotically, i.e.,
as n → ∞. This is achieved as follows:

First write
π(θ∣x) π(θ∣x)
E [log ] = ∫ ∫ [log ] π(θ∣x) m(x) dx dθ
π(θ) π(θ)
π(θ∣x)
= ∫ ∫ log f (x∣θ) π(θ) dx dθ
π(θ)
π(θ∣x)
= ∫ π(θ)E [log ∣ θ] dθ.
π(θ)

π(θ∣x)
Consider E [log ∣ θ] = E [log π(θ∣x) ∣ θ] − log π(θ).
π(θ)
Then by iterated expectation,

π(θ∣x)
E [log ] = ∫ π(θ) {E [π(θ∣x); ∣ θ] − log π(θ)} dθ
π(θ)
= ∫ E [π(θ∣x)∣ θ] π(θ) dθ − ∫ log π(θ)π(θ) dθ. (3.1)

Since we cannot calculate the integral

∫ E [π(θ∣x)∣ θ] π(θ) dθ =∶ q(X, θ)

and we will use Step 1 of the Shrinkage argument of J.K. Ghosh to find
E π̄ [q(X, θ)∣x].

Step 1: Find E π̄ [log π(θ∣x)∣x] = ∫ log π(θ∣x)π̄(θ∣x) dθ.


3.1 Reference Priors 64

Let Ln (θ) be defined such that f (x∣θ) = exp{Ln (θ)}.


exp{Ln (θ)}π(θ)
π(θ∣x) =
∫ exp{Ln (θ)}π(θ) dθ
exp{Ln (θ) − Ln (θ̂n )}π(θ)
= ,
∫ exp{Ln (θ) − Ln (θ̂n )}π(θ) dθ

where θ̂n denotes the maximum likelihood estimator. Let t = n(θ − θ̂n ),
so that θ = θ̂n + n−1/2 t and dθ = n−1/2 dt. We now substitute in for θ and
then perform a Taylor expansion. Recall that in general f (x + h) = f (x) +
hf ′ (x) + 12 h2 f ′′ (x) + ⋯ . Then

exp{Ln (θ̂n + n−1/2 t) − Ln (θ̂n )}π(θ̂n + n−1/2 t)


π(θ∣x) =
∫ exp{Ln (θ̂n + n−1/2 t) − Ln (θ̂n )}π(θ̂n + n−1/2 t) n−1/2 dt
exp{Ln (θ̂n ) + n−1/2 t L′n (θ̂n ) + n−1 t2 L′′n (θ̂n ) − Ln (θ̂n ) + ⋯}π(θ̂n + n−1/2 t)
=
∫ exp{Ln (θ̂n ) + n−1/2 t L′n (θ̂n ) + n−1 t2 L′′n (θ̂n ) − Ln (θ̂n ) + ⋯}π(θ̂n + n−1/2 t) n−1/2 dt

exp{n−1/2 t L′n (θ̂n ) + n−1 t2 L′′n (θ̂n ) + ⋯}π(θ̂n + t/ n)
= .
∫ exp{n−1/2 t L′n (θ̂n ) + n−1 t2 L′′n (θ̂n ) + ⋯}π(θ̂n + n−1/2 t) n−1/2 dt
Since θ̂n is the maximum likelihood estimate, L′n (θ̂n ) = 0. Now define the
quantity
∂ 2 log f (x∣θ)
Iˆn ∶= Iˆn (θ̂n ) = − ∣ = −L′′n (θ̂n ).
∂θ2 θ=θ̂n

Also, under mild regularity conditions,


√ d
Iˆn (θ̂n )1/2 n(θ̂n − θ) Ð→ N (0, Ip ).
Then we have that
exp{n−1/2 t L′n (θ̂n ) + 12 t2 L′′n (θ̂n ) + ⋯}π(θ̂n + n−1/2 t)
π(θ∣x) = .
∫ exp{n−1/2 t L′n (θ̂n ) + 2 t2 L′′n (θ̂n ) + ⋯}π(θ̂n + n−1/2 t) n−1/2 dt
1

exp{− 12 t2 Iˆn + ⋯}π(θ̂n + n−1/2 t)


=
∫ exp{− t2 Iˆn + ⋯}π(θ̂n + n−1/2 t) n−1/2 dt
1
2
exp{− 12 t2 Iˆn + O(n−1/2 )}[π(θ̂n ) + O(n−1/2 )]
=
∫ exp{− 2 t2 Iˆn + O(n−1/2 )}[π(θ̂n ) + O(n−1/2 )] n−1/2 dt
1

n exp{− 12 t2 Iˆn }π(θ̂n )[1 + O(n−1/2 )]
= √ −1/2
,
2π Iˆn π(θ̂n )[1 + O(n−1/2 )]
3.1 Reference Priors 65

noting that the denominator takes the form of a constant times the integral
of a normal density with variance Iˆn−1 . Hence,
√ ˆ1/2
nIn 1
π(θ∣x) = √ exp (− t2 Iˆn ) [1 + O(n−1/2 )]. (3.2)
2π 2

Then log π(θ∣x) = 21 log n − log 2π − 12 t2 Iˆn + 12 log Iˆn + log[1 + O(n−1/2 )] =
√ −1/2
2 log n − log 2π − 2 t In + 2 log In + log[O(n )]. Now consider
1 1 2ˆ 1 ˆ
1 √ 1 1
E π̄ log π(θ∣x) = log n − log 2π − E π̄ [ t2 Iˆn ] + log Iˆn + log[O(n−1/2 )].
2 2 2
To evaluate E π̄ [ 21 t2 Iˆn ], note that (3.2) states that, up to order n−1/2 , π(t∣x)
is approximately normal with mean zero and variance Iˆn−1 . Since this does
not depend on the form of the prior π, it follows that π̄(t∣x) is also approx-
imately normal with mean zero and variance Iˆn−1 , again up to order n−1/2 .
Then E π̄ [ 12 t2 Iˆn ] = 12 , which implies that
1 √ 1
E π̄ log π(θ∣x) = log n − log 2π − + log Iˆn1/2 + log[O(n−1/2 )]
2 2
1 1 √
= log n − log 2πe + log Iˆn1/2 + log[O(n−1/2 )].
2 2

Step 2: Calculate λ(θ) = ∫ E π̄ log π(θ∣x)f (x∣θ) dx. This is simply


1 1 √
λ(θ) = log n − log 2πe + log [I(θ)]1/2 + log[O(n−1/2 )].
2 2

Step 3: Since λ(θ) is continuous, the process of calculating ∫ λ(θ) π̄(θ) dθ


and allowing π̄(⋅) to converge to degeneracy at θ simply yields λ(θ) again.
Thus,
1 1 √
E [π(θ∣x) ∣ θ] = log n − log 2πe + log [I(θ)]1/2 + log[O(n−1/2 )].
2 2

Thus, returning to (3.1), the quantity we need to maximize

1 √ [I(θ)]1/2
1/2 log n − log 2πe + ∫ log { }π(θ) dθ + log[O(n−1/2 )].
2 π(θ)
The integral is non-positive and is maximized above when it is 0, or rather
when π(θ) = I 1/2 (θ), i.e., Jeffreys’ prior.

Take away: If there are no nuisance parameters, Jeffreys’ prior is the refer-
ence prior.
3.1 Reference Priors 66

Multiparameter generalization

In the absence of nuisance parameters, the K-L divergence simplifies to

π(θ∣x) p p ∣I(θ)∣1/2
E[ ] = log n − log(2πe) + ∫ log ( ) π(θ) dθ + O(n−1/2 ).
π(θ) 2 2 π(θ)

Note that this is maximized when π(θ) = ∣I(θ)∣1/2 , meaning that Jeffreys’
prior is the maximizer in distance between the prior and posterior. In the
presence of nuisance parameters, things change considerably.
3.1 Reference Priors 67

Example 3.11: Bernardo’s reference prior, 1979, JASA


Let θ = (θ1 , θ2 ), where θ1 is p1 × 1 and θ2 is p2 × 1. We define p = p1 + p2 . Let

I11 (θ) I12 (θ)


I(θ) = I(θ1 , θ2 ) = ( ).
I21 (θ) I22 (θ)

Suppose that θ1 is the parameter of interest and θ2 is a nuisance parameter


(meaning that it’s not really of interest to us in the model).

Begin with π(θ2 ∣θ1 ) = ∣I22 (θ)∣1/2 c(θ1 ), where c(θ1 ) is the constant that makes
this distribution a proper density. Now try to maximize
log π(θ1 ∣x)
E[ ]
π(θ1 )

to find the marginal prior π(θ1 ). We write


π(θ1 ∣x) π(θ1 , θ2 ∣x)/π(θ2 ∣θ1 , x)
log = log
π(θ1 ) π(θ1 , θ2 )/π(θ2 ∣θ1 )
π(θ∣x) π(θ2 ∣θ1 , x)
= log − log . (3.3)
π(θ) π(θ2 ∣θ1 )
Arguing as before,

π(θ∣x) p p ∣I(θ)∣1/2
E [log ] = log n − log 2πe + ∫ π(θ) log dθ + O(n−1/2 ).
π(θ) 2 2 π(θ)
(3.4)

Similarly,

π(θ2 ∣θ1 , x) p2 p2 ∣I22 (θ)∣1/2


E [log ]= log n − log 2πe + ∫ π(θ) log dθ + O(n−1/2 ).
π(θ2 ∣θ1 ) 2 2 π(θ2 ∣θ1 )
(3.5)

From (3.3)–(3.5), we find

π(θ1 ∣x) p1 p1 ∣I11.2 (θ)∣1/2


E [log ]= log n − log 2πe + ∫ π(θ) log dθ + O(n−1/2 ),
π(θ1 ) 2 2 π(θ1 )
(3.6)
1/2 −1 −1
where I11.2 (θ) = I11 (θ)−I12 (θ)I22 (θ)I21 (θ) and I(θ) = ∣I22 ∣ ∣I11 −I12 I22 I21 ∣ =
∣I22 ∣ ∣I11.2 ∣. These can be derived from Searle’s book on matrix algebra as a
reference.
3.1 Reference Priors 68

We now break up the integral in (3.6) and we define

log ψ(θ1 ) = ∫ π(θ2 ∣θ1 ) log ∣I11.2 (θ)∣1/2 dθ2 .

We find that
π(θ1 ∣x) p1 p1
E [log ]= log n − log 2πe + ∫ π(θ) log ∣I11.2 (θ)∣1/2 dθ
π(θ1 ) 2 2
− ∫ π(θ) log π(θ1 ) dθ + O(n−1/2 )
p1 p1
= log n − log 2πe + ∫ π(θ1 ) [∫ π(θ2 ∣θ1 ) log ∣I11.2 (θ)∣1/2 dθ2 ] dθ1
2 2
− ∫ π(θ1 ) log π(θ1 ) dθ1 + O(n−1/2 )
p1 p1 ψ(θ1 )
= log n − log 2πe + ∫ π(θ1 ) log dθ1 + O(n−1/2 ).
2 2 π(θ1 )
−1
To maximize the integral above, we choose π(θ1 ) = ψ(θ1 ). Note that I11.2 (θ) =
I11 (θ) where
−1
I11 (θ) −I11 (θ)I12 (θ)I22 (θ)
I −1 (θ) = ( −1 ).
−I22 (θ)I21 (θ)I11 (θ) I22 (θ)

Writing out our prior, we find that

π(θ1 ) = exp{∫ π(θ2 ∣θ1 ) log ∣I11.2 (θ)∣1/2 dθ2 } = exp{∫ ∣I22 (θ)∣1/2 log ∣I11.2 (θ)∣1/2 dθ2 }.

Remark: An important point that should be highlighted is that all these


calculations (especially evaluations of all integrals) are carried out over an
increasing sequence of compact sets K, whose union is the parameter space.
For example, if the parameter space is R × R+ , take the increasing sequence
of compact rectangles [−i, i]×[−i−1 , i] and then eventually take i → ∞. Also,
the proofs are carried out by considering a sequence of priors πi with support
Ki and we eventually take i → ∞. This fact should be taken into account
when doing the examples and calculations of these types of problems.
iid
Example 3.12: Let X1 . . . Xn ∣µ, σ 2 ∼ N (µ, σ 2 ), where σ 2 is a nuisance pa-
rameter. Consider the sequence of priors πi with support [−i, i]×[−i−1 , i], i =
1, 2, . . . .
1/σ 2 0
I(µ, σ 2 ) = ( ).
0 2/σ 2
3.2 Final Thoughts on Being Objective 69

√ √
2ci2 −1 i 2ci2
Then π(σ∣µ) = , i ≤ σ ≤ i. Consider ∫i−1 = 1 Ô⇒ ci2 =
σ σ
1 1 1 −1
√ . Thus, π(σ∣µ) = , i ≤ σ ≤ i. Now find π(µ). Observe that
2 2 ln i 2 ln iσ

π(µ) = exp{∫ π(σ∣µ) log ∣I11.2 (θ)∣1/2 dσ}

Recall that

−1 2i 1
I11.2 = I11 − I12 I22 I21 = 1/σ . Thus, π(µ) =
2
log( )} dσ +
exp{∫i−1 ci2
σ σ
π(µ, σ)
constant = c. We want to find π(µ, σ). We know that π(σ∣µ) = Ô⇒
π(µ)
c 1 1
π(µ, σ) = π(µ)π(σ∣µ) = ∝ .
2 ln i σ σ

Problems with Reference Priors See page 128 of the little green book.
Bernardo and Berger (1992) suggest

3.2 Final Thoughts on Being Objective

Some Thoughts from a Review paper by Stephen Fienberg

We have spent just a short amount of time covering objective Bayesian pro-
cedures, but already we have seen how each is flawed in some sense. As
Fienberg (2009) points out in a review article (see the webpage), there are
two main parts of being Bayesian: the prior and the likelihood. What is
Fienberg’s point? There is proposed claim that robustness should be car-
ried through at both levels of the model. That is, we should care about
subjectivity of the likelihood as well as the prior.

What about pragmatism (what works best in terms of implementation)? It


really comes down to the quality of the data that we’re analyzing. Fienberg
looks at two examples. The first involves the NBC Election Night model
of the 1960s and 1970s (which used a fully HB model). Here considering
different priors is important. For this illustration, there were multiple priors
based on past elections to choose from in real time, and the choice of prior
was often crucial in close elections. However, in other examples such as
Mosteller and Wallace (1964) analysis of the Federalist papers, the likelihood
3.2 Final Thoughts on Being Objective 70

mattered more. In this case, the posterior odds for several papers shifted
when Mosteller and Wallace used a negative binomial versus a Poisson for
word counts.

My favorite part that Fienberg illustrates in this paper is the view that “ob-
jective Bayes is like the search for the Holy Grail.” He mentions that Good
(1972) once wrote that there are “46,656 Varieties of Bayesians,” which was
a number that he admitted exceeded the number of professional statisticians
during that time. Today? There seem to be as many choices coming about
of objective Bayes for trying to arrive at the perfect choice of an objective
prior. Each seems to fail because of foundational principles. For example,
Eaton and Freedman (2004) criticize why you shouldn’t use Jeffreys’ prior
for the normal covariance matrix. We didn’t look at intrinsic priors, but they
have been criticized by Fienberg for contingency tables because of their de-
pendence on the likelihood function and because of bizarre properties when
extended to deal with large sparse tables.

Fienberg’s conclusion: “No, it does not make sense for me to be an ‘Objective


Bayesian’ !” Read the other papers on the web when you have time and you
can make you’re own decision.
Chapter 4

Evaluating Bayesian
Procedures

They say statistics are for losers, but losers are usually the ones saying that.
—Urban Meyer

In this chapter, we give a brief overview of how to evaluate Bayesian proce-


dures by looking at how frequentist confidence intervals differ from Bayesian
credible intervals. We also introduce Bayesian hypothesis testing and Bayesian
p-values. Again, we emphasize that this is a rough overview and for more
details, one should look in Gelman et al. (2004) or Carlin and Louis (2009).

4.1 Confidence Intervals versus Credible


Intervals

One major difference between Bayesians and frequentists is how they inter-
pret intervals. Let’s quickly review what a frequentist confidence interval is
and how to interpret one.

Frequentist Confidence Intervals

A confidence interval for an unknown (fixed) parameter θ is an interval of


numbers that we believe is likely to contain the true value of θ. Intervals
are important because they provide us with an idea of how well we can
estimate θ.

71
4.1 Confidence Intervals versus Credible Intervals 72

Definition 4.1: A confidence interval is constructed to contain θ a per-


centage of the time, say 95%. Suppose our confidence level is 95% and our
interval is (L, U ). Then we are 95% confident that the true value of θ is
contained in (L, U ) in the long run. In the long run means that this would
occur nearly 95% of the time if we repeated our study millions and millions
of times.

Common Misconceptions in Statistical Inference

• A confidence interval is a statement about θ (a population parameter).


It is not a statement about the sample.

• Remember that a confidence interval is not a statement about indi-


vidual subjects in the population. As an example, suppose that I tell
you that a 95% confidence interval for the average amount of televi-
sion watched by Americans is (2.69, 6.04) hours. This doesn’t mean
we can say that 95% of all Americans watch between 2.69 and 6.04
hours of television. We also cannot say that 95% of Americans in the
sample watch between 2.69 and 6.04 hours of television. Beware that
statements such as these are false. However, we can say that we are
95 percent confident that the average amount of televison watched by
Americans is between 2.69 and 6.04 hours.

Bayesian Credible Intervals

Recall that frequentists treat θ as fixed, but Bayesians treat θ as a random


variable. The main difference between frequentist confidence intervals and
Bayesian credible intervals is the following:

• Frequentists invoke the concept of probability before observing the


data. For any fixed value of θ, a frequentist confidence interval will
contain the true parameter θ with some probability, e.g., 0.95.

• Bayesians invoke the concept of probability after observing the data.


For some particular set of data X = x, the random variable θ lies in a
Bayesian credible interval with some probability, e.g., 0.95.
4.1 Confidence Intervals versus Credible Intervals 73

Assumptions

In lower-level classes, you wrote down assumptions whenever you did con-
fidence intervals. This is redundant for any problem we construct in this
course since we always know the data is randomly distributed and we assume
it comes from some underlying distribution, say Normal, Gamma, etc. We
also always assume our observations are i.i.d. (independent and identically
distributed), meaning that the observations are all independent and they all
have the same variance. Thus, when working a particular problem, we will
assume these assumptions are satisfied given the proposed model holds.

Definition 4.2: A Bayesian credible interval of size 1 − α is an interval


(a, b) such that
P (a ≤ θ ≤ b∣x) = 1 − α.
b
∫ p(θ∣x) dθ = 1 − α.
a

Remark: When you’re calculating credible intervals, you’ll find the


values of a and b by several means. You could be asked do the following:

• Find the a, b using means of calculus to determine the credible


interval or set.
• Use a Z-table when appropriate.
• Use R to approximate the values of a and b.
• You could be given R code/output and asked to find the values
of a and b.
4.1 Confidence Intervals versus Credible Intervals 74

Important Point

Our definition for the credible interval could lead to many choices of (a, b)
for particular problems.

Suppose that we required our credible interval to have equal probability α/2
in each tail. That is, we will assume
P (θ < a∣x) = α/2
and
P (θ > b∣x) = α/2.
Is the credible interval still unique? No. Consider
π(θ∣x) = I(0 < θ < 0.025) + I(1 < θ < 1.95) + I(3 < θ < 3.025)
so that the density has three separate plateaus. Now notice that any (a, b)
such that 0.025 < a < 1 and 1.95 < b < 3 satisfies the proposed definition of
a ostensibly “unique” credible interval. To fix this, we can simply require
that
{θ ∶ π(θ∣x) is positive}
(i.e., the support of the posterior) must be an interval.

Bayesian interval estimates for θ are similar to confidence intervals of clas-


sical inference. They are called credible intervals or sets. Bayesian credible
intervals have a nice interpretation as we will soon see.

To see this more clearly, see Figure 4.1.


Definition 4.3: A Bayesian credible set C of level or size 1 − α is a set C
such that 1 − α ≤ P (C∣y) = ∫C p(θ∣y) dθ. (Of course in discrete settings, the
integral is simply replaced by summation).

Note: We use ≤ instead of = to include discrete settings since obtaining exact


coverage in a discrete setting may not be possible.

This definition enables direct probability statements about the likelihood


of θ falling in C. That is,
“The probability that θ lies in C given the observed data y is at least (1 − α).”

This greatly contrasts with the usual frequentist CI, for which the corre-
sponding statement is something like “If we could recompute C for a large
4.1 Confidence Intervals versus Credible Intervals 75

number of datasets collected in the same way as ours, about (1 − α) × 100%


of them would contain the true value θ. ”

This classical statement is not one of comfort. We may not be able to


repeat our experiment a large number of times (suppose we have an an
interval estimate for the 1993 U.S. unemployment rate). If we are in physical
possession of just one dataset, our computed C will either contain θ or it
won’t, so the actual coverage probability will be 0 or 1. For the frequentist,
the confidence level (1 − α) is only a “tag” that indicates the quality of the
procedure. But for a Bayesian, the credible set provides an actual probability
statement based only on the observed data and whatever prior information
we add.

0.95

0.025 0.025

a b

Value of θ|x

Figure 4.1: Illustration of 95% credible interval

Interpretation

We interpret Bayesian credible intervals as follows: There is a 95% proba-


bility that the true value of θ is in the interval (a, b), given the data.
4.1 Confidence Intervals versus Credible Intervals 76

Comparisons

• Conceptually, probability comes into play in a frequentist confidence


interval before collecting the data, i.e., there is a 95% probability that
we will collect data that produces an interval that contains the true
parameter value. However, this is awkward, because we would like to
make statements about the probability that the interval contains the
true parameter value given the data that we actually observed.

• Meanwhile, probability comes into play in a Bayesian credible interval


after collecting the data, i.e., based on the data, we now think there
is a 95% probability that the true parameter value is in the interval.
This is more natural because we want to make a probability statement
regarding that data after we have observed it.

Example 4.1: Suppose

X1 . . . , Xn ∣θ ∼ N (θ, σ 2 )
θ ∼ N (µ, τ 2 ),

where µ, σ 2 , and τ 2 are known. Calculate a 95% credible interval for θ.

Recall
nx̄τ 2 + µσ 2 σ 2 τ 2
θ∣x1 , . . . xn ∼ N ( , ).
nτ 2 + σ 2 nτ 2 + σ 2
Let
nx̄τ 2 + µσ 2
µ∗ = ,
nτ 2 + σ 2
σ2τ 2
σ ∗2 = .
nτ 2 + σ 2

We want to calculate a and b such that P (θ < a∣x1 , . . . , xn ) = 0.05/2 = 0.025


and P (θ > b∣x1 , . . . , xn ) = 0.05/2 = 0.025. So,

0.025 = P (θ < a∣x1 , . . . , xn )


θ − µ∗ a − µ∗
=P( < ∣ x1 , . . . , x n )
σ∗ σ∗
a − µ∗
= P (Z < ∣ x1 , . . . , xn ) , where Z ∼ N (0, 1).
σ∗
4.1 Confidence Intervals versus Credible Intervals 77

a−µ∗
Thus, we now must find an a such that P ( Z < σ ∗ ∣ x1 , . . . , x n ) = 0.025.
From a Z-table, we know that
a − µ∗
= −1.96.
σ∗
This tells us that a = µ∗ − 1.96σ ∗ . Similarly, b = µ∗ + 1.96σ ∗ . (Work this part
out on your own at home). Therefore, a 95% credible interval is

µ∗ ± 1.96σ ∗ .

Example 4.2: We’re interested in knowing the true average number of


ornaments on a Christmas tree. Call this θ. We take a random sample of
n Christmas trees, count the ornaments on each one, and call the results
X1 , . . . , Xn . Let the prior on θ be Normal(75, 225).

Using data (trees.txt) we have, we will calculate the 95% credible interval
and confidence interval for θ. In R we first read in the data file trees.txt.
We then set the initial values for our known parameters, n, σ, µ, and τ.

Next, we refer to Example 4.1, and calculate the values of µ∗ and σ ∗ using
this example. Finally, again referring to Example 4.1, we recall that the
formula for a 95% credible interval here is

µ∗ ± 1.96σ ∗ .

On the other hand, recalling back to any basic statistics course, a 95%
confidence interval in this situation is

x̄ ± 1.96σ/ n.

From the R code, we find that there is a 95% probability that the average
number of ornaments per tree is in (45.00, 57.13) given the data. We also
find that we are 95% confident that the average number of ornaments per
tree is contained in (43.80, 56.20). If we compare the width of each interval,
we see that the credible interval is slightly narrower. It is also shifted towards
slightly higher values than the confidence interval for this data, which makes
sense because the prior mean was higher than the sample mean. What would
happen to the width of the intervals if we increased n? Does this make sense?

x = read.table("trees.txt",header=T)
attach(x)
4.1 Confidence Intervals versus Credible Intervals 78

n = 10
sigma = 10
mu = 75
tau = 15

mu.star = (n*mean(orn)*tau^2+mu*sigma^2)/(n*tau^2+sigma^2)
sigma.star = sqrt((sigma^2*tau^2)/(n*tau^2+sigma^2))

(cred.i = mu.star+c(-1,1)*qnorm(0.975)*sigma.star)
(conf.i = mean(orn)+c(-1,1)*qnorm(0.975)*sigma/sqrt(n))

diff(cred.i)
diff(conf.i)
detach(x)

Example 4.3: (Sleep Example)


Recall the Beta-Binomial. Consider that we were interested in the propor-
tion of the population of American college students that sleep at least eight
hours each night (θ).

Suppose a random sample of 27 students from UF, where 11 students recorded


they slept at least eight hours each night. So, we assume the data is dis-
tributed as Binomial(27, θ).

Suppose that the prior on θ was Beta(3.3,7.2). Thus, the posterior distribu-
tion is

θ∣11 ∼ Beta(11 + 3.3, 27 − 11 + 7.2), i.e.,


θ∣11 ∼ Beta(14.3, 23.2).

Suppose now we would like to find a 90% credible interval for θ. We can-
not compute this in closed form since computing probabilities for Beta dis-
tributions involves messy integrals that we do not know how to compute.
However, we can use R to find the interval.

We need to solve
P (θ < c∣x) = 0.05
and
P (θ > d∣x) = 0.05 for c and d.
4.2 Credible Sets or Intervals 79

The reason we cannot compute this in closed form is because we need to


compute
c
∫ Beta(14.3, 23.2) dθ = 0.05
0
and
1
∫ Beta(14.3, 23.2) dθ = 0.05.
d

Note that Beta(14.3,23.2) represents


Γ(37.5)
f (θ) = θ14.3−1 (1 − θ)23.2−1 .
Γ(14.3)Γ(23.2)

The R code for this is very straightforward:

a = 3.3
b = 7.2
n = 27
x = 11
a.star = x+a
b.star = n-x+b

c = qbeta(0.05,a.star,b.star)
d = qbeta(1-0.05,a.star,b.star)

Running the code in R, we find that a 90% credible interval for θ is (0.256, 0.514),
meaning that there is a 90% probability that the proportion of UF students
who sleep eight or more hours per night is between 0.256 and 0.514 given
the data.

4.2 Credible Sets or Intervals

Definition 4.4: Suppose the posterior density for θ is unimodal. A highest


posterior density (HPD) credible set of size 1 − α is a set C such that C =
{θ ∶ p(θ∣Y = y) ≥ kα } where kα is chosen so that P (θ ∈ C) ≥ 1 − α.
Example 4.4: Normal HPD credible interval
Suppose that
y∣θ ∼ N (θ, σ), θ ∼ N (θ, τ 2 ).
4.2 Credible Sets or Intervals 80

Then θ∣y ∼ N (µ1 , τ12 ), where we have derived these before. The HPD credible
interval for θ is simply µ±zα/2 τ1 . Note that the HPD credible interval is that
same as the equal tailed interval interval centered at the posterior mean.

Remark: Credible intervals are very easy to calculate unlike confidence


intervals, which require pivotal quantities or inversion of a family of tests.

In general, plot the posterior distribution and find the HPD credible set.
One important point is that the posterior must be unimodal in order to
guarantee that the HPD credible set is an interval. (Unimodality of the
posterior is a sufficient condition for the credible set to be an interval, but
it’s not necessary.)

Example 4.5: Suppose


iid
y1 , . . . , yn ∣σ 2 ∼ N (0, σ 2 )
p(σ 2 ) ∝ (σ 2 )α/2−1 e− 2σ2 .
β

Let z = 1/σ 2 . Then


1
p(z) ∝ z α/2+1 e− 2z ∣
β
∣ = z α/2−1 e− 2z .
β

z 2

Then

p(σ 2 ∣y) ∝ (σ 2 )−(n+α)/2−1 e− 2σ2 (∑i yi +β) ,


1 2

which implies that

σ 2 ∣y ∼ IG ((n + α)/2, (∑ yi2 + β)/2) .


i

This posterior distribution is unimodal, but how do we know this? One way
of showing the posterior is unimodal is to show that it is increasing in σ 2
up to a point and then decreasing afterwards. The log of the posterior has
the same feature.

Then
1
log(p(σ 2 ∣y)) = c1 − [(n + α)/2 + 1] log(σ 2 ) − (∑ y 2 + β).
2σ 2 i i
4.3 Bayesian Hypothesis Testing 81

This implies that

∂ log(p(σ 2 ∣y)) n+α+2 1


=− + 4 (∑ yi2 + β)
∂σ 2 2σ 2 σ i
(∑i yi2 + β) − (n + α + 2)σ 2
= ,
2σ 4
which is increasing with respect to σ 2 , equal to, or decreasing as σ 2 ≥ (∑i yi2 +
β)/(n + α + 2), etc. Thus, the posterior is unimodal, so we can get a HPD
interval for σ 2 .

4.3 Bayesian Hypothesis Testing

Let’s first review p-values and why they might not make sense in the grand
scheme of things. In classical statistics, the traditional approach proposed
by Fisher, Neyman, and Pearson is where we have a null hypothesis and
an alternative. After determining some test statistic T (y), we compute the
p-value, which is

p-value = P {T (Y ) is more extreme than T (yobs ) ∣ Ho } ,

where extremeness is in the direction of the alternative hypothesis. If the


p-value is less than some pre specified Type I error rate, we reject Ho , and
otherwise we don’t.

Clearly, classical statistics has deep roots and a long history. It’s popular
with practitioners, but does it make sense? The approach can be applied in a
straightforward manner only when the two hypothesis in question are nested
(meaning one within the other). This means that Ho must be a simplification
of Ha . Many practical testing problems involve a choice between two or
more models that aren’t nested (choosing between quadratic and exponential
growth models for example).

Another difficulty is that tests of this type can only offer evidence against
the null hypothesis. A small p-value indicates that the later, alternative
model has significantly more explanatory power. But a large p-value does
not suggest that the two models are equivalent (only that we lack evidence
that they are not). This limitation/difficulty is often swept under the rug
and never dealt with. We simply say, “we fail to reject the null hypothesis”
and leave it at that.
4.3 Bayesian Hypothesis Testing 82

Third, the p-value offers no direct interpretation as a “weight of evidence”


but only as a long-term probability of obtaining data at least as unusual as
what we observe. Unfortunately, the fact that small p-values imply rejection
of Ho causes many consumers of statistical analyses to assume that the p-
value is the probability that Ho is true, even though it’s nothing of the
sort.

Finally, one last criticism is that p-values depend not only on the observed
data but also on the total sampling probability of certain unobserved data
points, namely, the more extreme T (Y ) values. Because of this, two exper-
iments with identical likelihoods could result in different p-values if the two
experiments were designed differently. (This violates the Likelihood Prin-
ciple.) See Example 1.1 in Chapter 1 for an illustration of how this can
happen.

In classical settings, we talk about Type I and Type II errors. In Bayesian


hypothesis testing, we will consider the following scenarios:

Ho ∶ θ ∈ Θo Ha ∶ θ ∈ Θ1 .

Ho ∶ θ = θo Ha ∶ θ ≠ θ0 .
Ho ∶ θ ≤ θo Ha ∶ θ > θ0 .
A Bayesian talks about posterior odds and Bayes factors.

Definition 4.5: Prior odds


Let πo = P (θ ∈ Θo ), π1 = P (θ ∈ Θ1 ) and πo + π1 = 1. Then the prior odds
πo
in favor of Ho = .
π1
Definition 4.6: Posterior odds
αo
Let αo = P (θ ∈ Θo ∣y) and α1 = P (θ ∈ Θ1 ∣y). Then the posterior odds = .
α1
Definition 4.7: Bayes Factor
posterior odds αo πo αo π1
The Bayes Factor (BF)= = ÷ = .
prior odds α1 π1 α1 πo
Example 4.6: IQ Scores
Suppose we’re studying IQ scores and so we assume that the data follow the
model where

y∣θ ∼ N (θ, 102 )


θ ∼ N (100, 152 ).
4.3 Bayesian Hypothesis Testing 83

We’d like to be able to say something about the mean of the IQ scores and
whether it’s below or larger than 100. Then

Ho ∶ θ ≤ 100 Ha ∶ θ > 100.

πo P (θ ≤ 100) 1
The prior odds are then = = 2
= 1 by symmetry.
π1 P (θ > 100)
1
2

Suppose we find that y = 115. Then θ∣y = 115 ∼ N (110.39, 63.23).

Then αo = P (θo ≤ 100∣y = 115) = 0.106 and α1 = P (θ1 > 100∣y = 115) = 0.894.
αo
Thus, = 0.1185. Hence, BF = 0.1185.
α1

◯ Lavine and Schervish (The American Statistician, 1999):


Bayes Factors: What They Are and What They Are Not

We present an example from the paper above to illustrate an important point


regarding Bayes Factors. Suppose a coin is known to be a 2-sided head, a
2-sided tail, or fair. Then let θ be the probability of a head ∈ {0, 1/2, 1}.
Suppose the data tell us that the coin was tossed 4 times and always landed
on heads.

Furthermore, suppose that

π({0}) = 0.01, π({1/2}) = 0.98, π({1}) = 0.01.

Consider

H1 ∶ θ = 1 versus H4 ∶ θ ≠ 1
H2 ∶ θ = 1/2 versus H5 ∶ θ ≠ 1/2
H3 ∶ θ = 0 versus H6 ∶ θ ≠ 0.
4.3 Bayesian Hypothesis Testing 84

Then

f (x∣H1 ) = P (four heads∣θ = 1) = 1


1 1
f (x∣H2 ) = P (four heads∣θ = 1/2) = ( )4 =
2 16
f (x∣H3 ) = P (four heads∣θ = 0) = 0
1
× 0.98 + 0 × 0.01 1 98
f (x∣H4 ) = P (four heads∣θ ≠ 1) = 16
= × = 0.0619
0.98 + 0.01 16 99
1 × 0.01 + 0 × 0.01
f (x∣H5 ) = P (four heads∣θ ≠ 1/2) = = 0.05
0.98 + 0.01
1
× 0.98 + 1 × 0.01
f (x∣H6 ) = P (four heads∣θ ≠ 0) = 16 = 0.072
0.98 + 0.01

We then find that


1
f (x∣H1 ) f (x∣H2 )
= 1/0.0619 and = 16
= 0.125.
f (x∣H4 ) f (x∣H5 ) 1
2

Let k ∈ (0.0619, 0.125) and reject if BF01 < k. Then we reject H4 in favor
of H1 . We fail to reject H2 . Thus, failing to reject H2 implies failing to
reject H4 . In this example, evidence in favor of H4 should be stronger than
that of H2 . But the Bayes Factor violates this. Lavine and Schervish refer
to this as lack of coherence. The problem does not occur with the posterior
odds since if
P (Θo ∣x) < P (Θa ∣x)
holds, then
P (Θo ∣x) P (Θa ∣x)
< .
1 − P (Θo ∣x) 1 − P (Θa ∣x)
(This result can be generalized).

• Bayes factors are insensitive to the choice of prior, however, this state-
ment is misleading. (Berger, 1995) We will see why in Example 4.5.

• BF measures the change from priors odds to the posterior odds.

Example 4.7: Simple Null versus Simple Alternative

Ho ∶ θ = θo Ha ∶ θ = θ1 .
4.4 Bayesian p-values 85

Then πo = P (θ = θo ) and π1 = P (θ = θ1 ), so πo + π1 = 1.

Then
P (y∣θ = θo )P (θ = θo ) P (y∣θ = θo )πo
αo = P (θ = θo ∣y) = = .
P (y∣θ = θo )P (θ = θo ) + P (y∣θ = θ1 )P (θ = θ1 ) P (y∣θ = θo )πo + P (y∣θ = θ1 )π1
αo πo P (y∣θ = θo ) P (y∣θ = θo )
This implies that = and hence BF = , which
α1 π1 P (y∣θ = θ1 ) P (y∣θ = θ1 )
is the likelihood ratio. This does not depend on the choice of the prior.
However, in general the Bayes factor depends on how the prior spreads
mass over the null and alternative (so Berger’s statement is misleading).
Example 4.8:
Ho ∶ θ ∈ θo Ha ∶ θ ∈ θ1 .
Derive the BF. Let go (θ) and g1 (θ) be probability density functions such
that ∫Θo go (θ) dθ = 1 and ∫Θ1 g1 (θ) dθ = 1. Let


⎪πo go (θ) if θ ∈ Θo
π(θ) = ⎨

⎩π1 g1 (θ) if θ ∈ Θ1 .

Then ∫ π(θ) dθ = ∫Θo πo go (θ) dθ + ∫Θ1 π1 g1 (θ) dθ = π0 + π1 = 1.

αo ∫Θo π(θ∣y) dθ p(y∣θ)π(θ)


This implies that = . Thus, π(θ∣y) = . This im-
α1 ∫Θ1 π(θ∣y) dθ m(y)
plies that
∫Θo p(y∣θ)π(θ) dθ
αo m(y) ∫Θ p(y∣θ)πo go (θ) dθ
= = o
α1 ∫Θ1 p(y∣θ)π(θ) dθ ∫Θ1 p(y∣θ)π1 g1 (θ) dθ
m(y)
πo ∫Θo p(y∣θ)go (θ) dθ
= Ô⇒
π1 ∫Θ1 p(y∣θ)g1 (θ) dθ
∫Θo p(y∣θ)go (θ) dθ
BF = , which is the marginal of y under Ho divided by
∫Θ1 p(y∣θ)g1 (θ) dθ
the marginal of y under H1 .

4.4 Bayesian p-values

Bayes factors are meant to compare two or more models, however, often
we are interested in the goodness of fit of a particular model rather than
4.4 Bayesian p-values 86

comparison of the models. Bayesian p-values were proposed to address these


problems.

◯ Prior Predictive p-value

George Box proposed a prior predictive p-value. Suppose that T (x) is a test
statistic and π is some prior. Then we calculate the marginal distribution

m[T (x) ≥ T (xobs )∣Mo ],

where Mo is the null model under consideration.


iid
Example 4.9: X1 , . . . , Xn ∣θ, σ 2 ∼ N (θ, σ 2 ). Let Mo ∶ θ = 0 and T (x) =

n∣X̄∣.

Suppose the prior π(σ 2 ) is degenerate at σo2 . Then π(σ 2 = σo2 ) = 1. Marginally,

X̄ ∼ N (0, σ 2 /n)

under Mo . Also,
√ √ √
√ √ n∣X̄∣ n∣x̄obs ∣ n∣x̄obs ∣
P ( n∣X̄∣ ≥ n∣x̄obs ∣) = P ( ≥ ) = 2Φ(− ).
σo σo σo
If the guessed σo is much smaller than the actual model variance, then the
p-value is small and the evidence again Mo is overestimated.

Remark: The takeaway message is that the prior predictive p-value is


heavily influenced by the prior.

◯ Other Bayesian p-values

Since then, the posterior predictive p-value (PPP) has been proposed by
Rubin (1984), Meng (1994), and Gelman et al. (1996). They propose looking
at the posterior predictive distribution of a future observation x under some
prior π. That is, we calculate

m(x∣xobs ) = ∫ f (x∣θ)π(θ∣xobs ) dθ.


4.4 Bayesian p-values 87

Then PPP is defined to be

P ∗ = P (T (X) ≥ T (xobs )),

which is the conditional probability that for a future observation T (X) ≥


T (xobs ) given the predictive distribution of X under prior π and xobs .

Remark: For details, see the papers. A general criticism by Bayarri and
Berger points out that the procedure involves using the data twice. The
data is used in finding the posterior distribution of θ and also in finding the
posterior predictive p-value. As an alternative, they have suggested using
conditional predictive p-values (CPP). This involves splitting the data into
two parts, say T (X) and U (X). We use U (X) to find the posterior predictive
distribution and T (X) continues to be the test statistic.

Potential Fixes to the Prior Predictive p-value

We first consider the Conditional and Partial Predictive p-values by Bayarri


and Berger, JASA, (1999, 2000). They propose splitting the data into two
parts (T (X), U (X), where T is the test statistic and the p-value is computed
from the posterior predictive distribution of a future T conditional on U. The
choice of U is unclear and for complex problems it is nearly impossible to
find. We note that if U (X) is taken to be the entire data and T (X) is some
test statistic, then we get the PPP back.

Also, Robins, van der Waart, and Ventura, JASA, 2000 investigate Bayarri
and Berger’s claims that for a parametric model, that their conditional and
partial predictive p-values are superior to the parametric bootstrap p-value
and to previously proposed p-values (prior predictive p-value of Guttman,
1967 and Rubin, 1984 and the discrepancy p-value of Gelman et. al (1995,
1996) and Meng (1994). Robins et. al note that Bayarri and Berger’s
claims of superiority is based on small-sample properties for specific exam-
ples. They investigate large sample properties and conclude that asymptotic
results confirm the superiority of the conditional predictive p-value and par-
tial posterior predictive p-values.

Robins et. al (2000) also explore corrections for when these p-values are
difficult to compute. In Section 4 of their paper, they discuss how to modify
the test statistic for the parametric bootstrap p-value, posterior predictive
p-values, and discrepancy p-values. Modifications are made such they are
asymptotically uniform. They claim that their approach is successful for the
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 88

discrepancy p-value (and the authors derive a test based on this). Note: the
discrepancy p-value can be difficult to calculate for complex models.

4.5 Appendix to Chapter 4 (Done by Rafael Stern)

Added Example for Chapter 4 on March 21, 2013


The following example is an adaptation from Carlos Alberto de Braganc a
Pereira (2006).

Consider that U1 , U2 , U3 , U4 are conditionally i.i.d. given θ and such that


U1 ∣θ has Uniform distribution on (θ − 0.5, θ + 0.5). Next, we construct a
confidence interval and a credible interval for θ.

Let’s start with a confidence interval. Let U(1) = min{U1 , U2 , U3 , U4 } and


U(4) = max{U1 , U2 , U3 , U4 }. Let’s prove that (U(1) , U(4) ) is a 87.5% confi-
dence interval for θ.

Consider
P (θ ∉ (U(1) , U(4) )∣θ) = P (U(1) > θ ∪ U(4) < θ∣θ) =
= P (U(1) > θ∣θ) + P (U(4) < θ∣θ) =
= P (Ui > θ, i = 1, 2, 3, 4∣θ) + P (Ui < θ, i = 1, 2, 3, 4∣θ) =
= (0.5)4 + (0.5)4 = (0.5)3 = 0.125

Hence, P (θ ∈ (U(1) , U(4) )∣θ) = 0.875, which proves that (U(1) , U(4) ) is a
87.5% confidence interval for θ.

Consider that U(1) = 0.1 and that U(4) = 0.9. The 87.5% probability has
to do with the random interval (U(1) , U(4) ) and not with the particular ob-
served value of (0.1, 0.9).

Let’s do some investigative work! Observe that, for every ui , ui > θ − 0.5.
Hence, u(1) > θ − 0.5, that is, θ < u(1) + 0.5. Similarly, θ > u(4) − 0.5. Hence,
θ ∈ (u(4) − 0.5, u(1) + 0.5). Plugging in u(1) = 0.1 and u(4) = 0.9, obtain
θ ∈ (0.4, 0.6). That is, even though the observed 87.5% confidence interval
is (0.1, 0.9), we know that θ ∈ (0.4, 0.6) with certainty.
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 89

Let’s now compute a 87.5% centered credible interval. This depends on the
prior for θ. Consider the improper prior p(θ) = 1, θ ∈ R. Observe that:

P (θ∣u1 , u2 , u3 , u4 ) ∝ P (θ)P (u1 , u2 , u3 , u4 ∣θ) =


4
= ∏ I(ui )(θ−0.5,θ+0.5) =
i=1
= I(θ)(u(4) −0.5,u(1) +0.5) .

That is, θ∣u has Uniform distribution on (u(4) − 0.5, u(1) + 0.5). Let a =
u(4) − 0.5 and b = u(1) + 0.5. The centered 87.5% credible interval is (l, u)
l 1 b 1 b−a
such that ∫a dx = 2−4 and ∫u dx = 2−4 . Hence, l = a + 4 and
b−a b−a 2
b−a
u = b − 4 . Observe that this interval is always a subset of (a, b), which
2
we know contains θ for sure.
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 90

1−(U(4) −U(1) ) 1−(U(4) −U(1) )


Does (U(4) − 0.5 + 24
, U(1) + 0.5 − 24
) have any confidence
guarantees? Before getting into troublesome calculations, we can check this
through simulations.

The following R code generates a barplot for how often the credible interval
captures the correct parameter given different parameter values.

sim_capture_theta <- function(theta,nsim) {


samples <- runif(4*nsim,theta-0.5,theta+0.5)
dim(samples) <- c(nsim,4)

success <- function(sample)


{
aa <- min(sample)
bb <- max(sample)
return((theta > aa + (bb-aa)/16) && (theta < bb - (bb-aa)/16))
}

return(mean(apply(samples,1,success)))
}

capture_frequency <- sapply(0:1000, function(ii){sim_capture_theta(ii,1000)})


barplot(capture_frequency,
main="Coverage of credible interval for parameter in 0,1,...,1000")
abline(h=0.825,lty=22)
abline(h=0.85,lty=22)
abline(h=0.875,lty=22)
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 91

Coverage of credible interval for parameter in 0,1,...,1000

0.8
0.6
0.4
0.2
0.0

Figure 4.2

The result in Figure 4.2 shows that, in this case, the coverage of the credible
interval seems to be uniform on the parameter space. This is not guaranteed
to always happen! Also, although we constructed a 87.5% credible interval,
1−(U(4) −U(1) ) 1−(U(4) −U(1) )
the picture suggests that (U(4) − 0.5 + 24
, U(1) + 0.5 − 24
)
is somewhere near a 85% confidence interval.

Observe that, the wider the gap between U(1) and U(4) the smaller is the
region in which θ can lie. In this sense, it would be nice if the interval would
be smaller, the larger this gap is. The example shows that there exist both
credible and confidence intevals with this property, but this property isn’t
achieved by guaranteeing confidence alone.
Chapter 5

Monte Carlo Methods

Every time I think I know what’s going on, suddenly there’s another layer
of complicitions. I just want this damned thing solved.
—John Scalzi, The Lost Colony

5.1 A Quick Review of Monte Carlo Methods

One motivation for Monte Carlo methods is to approximate an integral


of the form ∫X h(x)f (x) dx that is intractable, where f is a probability
density. You might wonder why we wouldn’t just use numerical integration
techniques. There are a few reasons:

• The most serious problem is the so-called “curse of dimensionality.”


Suppose we have a p-dimensional integral. Numerical integration typ-
ically entails evaluating the integrand over some grid of points. How-
ever, if p is even moderately large, then any reasonably fine grid will
contain an impractically large number of points. For example if p = 6,
then a grid with just ten points in each dimension—already too coarse
for any sensible amount of precision—will consist of 106 points. If
p = 50, then even an absurdly coarse grid with just two points in each
dimension will consist of 250 points (note that 250 > 1015 ).

• There can still be problems even when the dimensionality is small.


There are packages in R called area and integrate, however, area

92
5.1 A Quick Review of Monte Carlo Methods 93

cannot deal with infinite bounds in the integral, and even though in-
tegrate can handle infinite bounds, it is fragile and often produces
output that’s not trustworthy (Robert and Casella, 2010).

◯ Classical Monte Carlo Integration

The generic problem here is to evaluate Ef [h(x)] = ∫X h(x)f (x) dx. The
classical way to solve this is to generate a sample (X1 , . . . , Xn ) from f and
propose as an approximation the empirical average

1 n
h̄n = ∑ h(xj ).
n j=1

Why? It can be shown that h̄n converges a.s. (i.e. for almost every generated
sequence) to Ef [h(X)] by the Strong Law of Large Numbers.

Also, under certain assumptions (which we won’t get into, see Casella and
Robert, page 65, for details), the asymptotic variance can be approximated
and then can be estimated from the sample (X1 , . . . , Xn ) by
n
vn = 1/n2 ∑ [h(xj ) − h̄n ]2 .
j=1

Finally, by the CLT (for large n),

h̄n − Ef [h(X)] approx.


√ ∼ N (0, 1).
vn

There are examples in Casella and Robert (2010) along with R code for those
that haven’t seen these methods before or want to review them.
5.1 A Quick Review of Monte Carlo Methods 94

◯ Importance Sampling

Importance sampling involves generating random variables from a different


distribution and then reweighing the output. It’s name is given since the
new distribution is chosen to give greater mass to regions where h is large
(the important part of the space).

Let g be an arbitrary density function and then we can write


f (x) h(x)f (x)
I = Ef [h(x)] = ∫ h(x) g(x) dx = Eg [ ]. (5.1)
X g(x) g(x)
This is estimated by
1 n f (Xj )
Iˆ = ∑ h(Xj ) Ð→ Ef [h(X)] (5.2)
n j=1 g(Xj )

based on a sample generated from g (not f ). Since (5.1) can be written


as an expectation under g, (5.2) converges to (5.1) for the same reason the
Monte carlo estimator h̄n converges.
ˆ we find
Remark: Calculating the variance of I,

ˆ = 1 h(Xi )f (Xi ) 1 h(Xi )f (Xi )


V ar(I) ∑ V ar ( ) = V ar ( ) Ô⇒
n2 i g(Xi ) n g(Xi )

ˆ = 1 V̂ h(Xi )f (Xi )

ar(I) ar ( ).
n g(Xi )
Example 5.1: Suppose we want to estimate P (X > 5), where X ∼ N (0, 1).

Naive method: Generate n iid standard normals and use the proportion p̂
that are larger than 5.

Importance sampling: We will sample from a distribution that gives high


probability to the “important region” (the set (5, ∞)) and then reweight.

Solution: Let φo and φθ be the densities of the N (0, 1) and N (θ, 1) distri-
butions (θ taken around 5 will work). We have
φo (u)
p = ∫ I(u > 5)φo (u) du = ∫ [I(u > 5) ] φθ (u) du.
φθ (u)
In other words, if
φo (u)
h(u) = I(u > 5)
φθ (u)
5.1 A Quick Review of Monte Carlo Methods 95

then p = Eφθ [h(X)]. If X1 , . . . , Xn ∼ N (θ, 1), then an unbiased estimate is


p̂ = n1 ∑i h(Xi ).

We implement this in R as follows:

1 - pnorm(5) # gives 2.866516e-07

# Naive method
set.seed(1)
ss <- 100000
x <- rnorm(n=ss)
phat <- sum(x>5)/length(x)
sdphat <- sqrt(phat*(1-phat)/length(x)) # gives 0

# IS method

set.seed(1)
y <- rnorm(n=ss, mean=5)
h <- dnorm(y, mean=0)/dnorm(y, mean=5) * I(y>5)
mean(h) # gives 2.865596e-07
sd(h)/sqrt(length(h)) # gives 2.157211e-09

Example 5.2: Let f (x) be the pdf of a N (0, 1). Assume we want to
compute
1 1
a=∫ f (x)dx = ∫ N (0, 1)dx
−1 −1
We can use importance sampling to do this calculation. Let g(X) be an
arbitrary pdf,
1 f (x)
a(x) = ∫ g(x) dx.
−1 g(x)

We want to be able to draw g(x) ∼ Y easily. But how should we go about


choosing g(x)?

f (Y )
• Note that if g ∼ Y, then a = E[I[−1,1] (Y ) g(Y ) ].

f (Y )
• The variance of I[−1,1] (Y ) is minimized picking g ∝ I[−1,1] (x)f (x).
g(Y )
Nevertheless simulating from this g is usually expensive.
5.1 A Quick Review of Monte Carlo Methods 96

• Some g’s which are easy to simulate from are the pdf’s of the Uniform(−1, 1),
the Normal(0, 1) and a Cauchy with location parameter 0.
f (Y )
• Below, there is code of how to get a sample from I[−1,1] (Y ) for
g(Y )
these distributions,

uniformIS <- function(nn) {


sapply(runif(nn,-1,1),
function(xx) dnorm(xx,0,1)/dunif(xx,-1,1)) }

cauchyIS <- function(nn) {


sapply(rt(nn,1),
function(xx) (xx <= 1)*(xx >= -1)*dnorm(xx,0,1)/dt(xx,2)) }

gaussianIS <- function(nn) {


sapply(rnorm(nn,0,1),
function(xx) (xx <= 1)*(xx >= -1)) }
5.1 A Quick Review of Monte Carlo Methods 97

Figure 5.1 presents histograms for a sample size 1000 from each of these
f (Y )
distributions. The sample variance of I[−1,1] (Y ) g(Y ) was, respectively, 0.009,
0.349 and 0.227 (for the uniform, cauchy, and the normal).

• Even though the shape of the uniform distribution is very different


from f (x), a standard normal, in (−1, 1), f (x) has a lot of mass outside
of (−1, 1).

• This is why the histograms for the Cauchy and the Normal have big
bars on 0 and the variance obtained from the uniform distribution is
the lowest.

• How would these results change if we wanted to compute the integral


over the range (−3, 3) instead of (−1, 1)? This is left as a homework
exercise.
5.1 A Quick Review of Monte Carlo Methods 98

700

700

700
600

600

600
500

500

500
400

400

400
Frequency

Frequency

Frequency
300

300

300
200

200

200
100

100

100
0

0.45 0.55 0.65 0.75 0.0 0.4 0.8 1.2 0.0 0.2 0.4 0.6 0.8 1.0

f (Y )
Figure 5.1: Histograms for samples from I[−1,1] (Y ) g(Y ) when g is, respec-
tivelly, a uniform, a Cauchy and a Normal pdf.
5.1 A Quick Review of Monte Carlo Methods 99

◯ Importance Sampling with unknown normalizing constant

Often we have sample from µ, but know π(x) except for a multiplicative
µ(x) constant. Typical example is Bayesian situation:

• π = νY = posterior density of θ given Y when prior density is ν.

• µ = λY = posterior density of θ given Y when prior density is λ.


π(x) cν L(θ)ν(θ) ν(θ)
We want to estimate = =c = c `(x),
µ(x) cλ L(θ)λ(θ) λ(θ)
where `(x) is known and c is unknown.
Remark: get a ratio of priors.

Then if we’re estimating h(x), we find

∫ h(x)π(x) dx = ∫ h(x) c `(x)µ(x) d(x)

=∫
h(x) c `(x)µ(x) d(x)
∫ µ(x) d(x)
=∫
h(x) c `(x)µ(x) d(x)
∫ c `(x)µ(x) d(x)
=∫
h(x) `(x)µ(x) d(x)
.
∫ `(x)µ(x) d(x)

Generate X1 , . . . , Xn ∼ µ and estimate via

∑i h(Xi ) `(Xi ) `(Xi )


= ∑ h(Xi ) ( ) = ∑ wi h(Xi )
∑i `(X i ) i ∑j `(Xj ) i

`(Xi ) ν(θi )/λ(θi )


where wi = = .
∑j `(X j ) ∑j ν(θj )/λ(θj )

Motivation
Why the choice above for `(X)? Just taking a ratio of priors. The
motivation is the following for example:

– Suppose our application is to Bayesian statistics where θ1 , . . . , θn ∼


λY .
5.1 A Quick Review of Monte Carlo Methods 100

– Think about the posterior corresponding here is an essay to deal


with conjugate prior λ.
– Think of π = ν as a complicated prior and µ = λ as a conjugate
prior.
ν(θi )/λ(θi )
– Then the weights are wi = .
∑j ν(θj )/λ(θj )

1. If µ and π i.e. ν and λ differ greatly most of the weight will be


taken up by a few observations resulting in an unstable estimate.
∑ h(Xi ) `(Xi )
2. We can get an estimate of the variance of i but
∑i `(Xi )
we need to use theorems from advance probability theory (The
Cramer-Wold device and the Multivariate Delta Method). We’ll
skip these details.
3. In the application of Bayesian statistics, the cancellation of a
potentially very complicated likelihood can lead to a great sim-
plification.
4. The original purpose of importance sampling was to sample more
heavily from regions that are important. So, we may do impor-
tance sampling using a density µ because it’s more convenient
than using a density π. (These could also be measures if the den-
sities don’t exist for those taking measure theory).

◯ Rejection Sampling

Suppose π is a density on the reals and suppose π(x) = c l(x) where l is


known, c is not known. We are interested in case where π is complicated.
Want to generate X ∼ π.

Motivating idea: look at a very simple case of rejection sampling.

Suppose first that l is bounded and is zero outside of [0, 1]. Suppose also l
is constant on the intervals ((j − 1)/k, j/k), j = 1, . . . , k. Let M be such that
M ≥ l(x) for all x.

For very simple case, consider the following procedure.

1. Generate a point (U1 , U2 ) uniformly at random from the rectangle of


height M sitting on top of the interval [0, 1].
5.1 A Quick Review of Monte Carlo Methods 101

2. If the point is below the graph of the function l, retain U1 . Else, reject
the point and go back to (1).

Remark: Using the Probability Integral Transformation in reverse. If X ∼


F −1 (U ), then X ∼ F where U ∼ Uniform(0, 1).

Remark: Think about what this is doing, we’re generating many draws that
are wasting time. Think about the restriction on [0, 1] and if this makes
sense.

General Case:

Suppose the density g is such that for some known constant M, M g(x) ≥ l(x)
for all x. Procedure:

l(X)
1. Generate X ∼ g, and calculate r(X) = .
M g(X)
2. Flip a coin with probability of success r(X). If we have a success,
retain X. Else return to (1).

To show that an accepted point has distribution π, let I = indicator that the
point is accepted. Then

π(x)/c 1
P (I = 1) = ∫ P (I = 1 ∣ X = x)g(x) dx = ∫ g(x) dx = .
M g(x) cM

Thus, if gl is the conditional distribution of X given I, we have

π(x)/c
gI (x∣I = 1) = g(x) /P (I = 1) = π(x).
M g(x)

Example 5.3: Suppose we want to generate random variables from the


Beta(5.5,5.5) distribution. Note: There are no direct methods for generating
from Beta(a,b) if a,b are not integers.

One possibility is to use a Uniform(0,1) as the trial distribution. A better


idea is to use an approximating normal distribution.

##simple rejection sampler for Beta(5.5,5.5), 3.26.13


5.1 A Quick Review of Monte Carlo Methods 102

a <- 5.5; b <- 5.5


m <- a/(a+b); s <- sqrt((a/(a+b))*(b/(a+b))/(a+b+1))
funct1 <- function(x) {dnorm(x, mean=m, sd=s)}
funct2 <- function(x) {dbeta(x, shape1=a, shape2=b)}

##plotting normal and beta densities


pdf(file = "beta1.pdf", height = 4.5, width = 5)
plot(funct1, from=0, to=1, col="blue", ylab="")
plot(funct2, from=0, to=1, col="red", add=T)
dev.off()

##M=1.3 (this is trial and error to get a good M)


funct1 <- function(x) {1.3*dnorm(x, mean=m, sd=s)}
funct2 <- function(x) {dbeta(x, shape1=a, shape2=b)}
pdf(file = "beta2.pdf", height = 4.5, width = 5)
plot(funct1, from=0, to=1, col="blue", ylab="")
plot(funct2, from=0, to=1, col="red", add=T)
dev.off()

##Doing accept-reject
##substance of code
set.seed(1); nsim <- 1e5
x <- rnorm(n=nsim, mean=m, sd=s)
u <- runif(n=nsim)
ratio <- dbeta(x, shape1=a, shape2=b) /
(1.3*dnorm(x, mean=m, sd=s))
ind <- I(u < ratio)
betas <- x[ind==1]
# as a check to make sure we have enough
length(betas) # gives 76836

funct2 <- function(x) {dbeta(x, shape1=a, shape2=b)}


pdf(file = "beta3.pdf", height = 4.5, width = 5)
plot(density(betas))
plot(funct2, from=0, to=1, col="red", lty=2, add=T)
dev.off()
5.1 A Quick Review of Monte Carlo Methods 103

2.5

3.0
2.0

2.0
1.5
1.0

1.0
0.5
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

Figure 5.2: Normal enveloping Figure 5.3: Naive rejection sam-


Beta pling, M=1.3

density.default(x = betas)
2.5
2.0
1.5
Density

1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

N = 76836 Bandwidth = 0.01372

Figure 5.4: Rejection sampler


5.2 Introduction to Gibbs and MCMC 104

5.2 Introduction to Gibbs and MCMC

The main idea here involves iterative simulation. We sample values on a


random variable from a sequence of distributions that converge as iterations
continue to a target distribution. The simulated values are generated by a
Markov chain whose stationary distribution is the target distribution, i.e.,
the posterior distribution.

Geman and Geman (1994) introduced Gibbs sampling for simulating a mul-
tivariate probability distribution p(x) using as random walk on a vector x,
where p(x) is not necessarily a posterior density.

◯ Markov Chains and Gibbs Samplers

We have a probability distribution π on some space X and we are inter-


ested in estimating π or ∫ h(x)π(x)dx, where h is some function. We are
considering situation where π is analytically intractable.

The Basic idea of MCMC

• Construct a sequence of random variables X1 , X2 , . . . with the property


that the distribution of Xn converges to π as n → ∞.

• If no is large, then Xno , Xn0 +1 . . . all have the distribution π and these
can be used to estimate π and ∫ h(x)π(x)dx.

Two problems:

1. The distribution of Xno , Xn0 +1 . . . is only approximately π.

2. The random variables Xno , Xn0 +1 . . . are NOT independent; they may
be correlated.
5.2 Introduction to Gibbs and MCMC 105

The MCMC Method Setup: We have a probability distribution π which


is analytically intractable. Want to estimate π or ∫ h(x)π(x)dx, where h is
some function.

The MCMC method consider of coming up with a transition probability


function P (x, A) with the property that it has station distribution π.

A Markov chain with Markov transition function P (⋅, ⋅) is a sequence of


random variables X1 , X2 , . . . on a measurable space such that:

1. P (Xn+1 ∈ A∣Xn = x) = P (x, A).

2. P (Xn+1 ∈ A∣X1 , X2 , . . . , Xn ) = P (Xn+1 ∈ A∣Xn = x).

1.) is called a Markov transition function and 2.) is the Markov property,
which says “where I’m going next only depends on where I am right now.”

Coming back to the MCMC method, we fix a starting point xo and generate
an observation from X1 from P (xo , ⋅), generate an observation from X2 from
P (X1 , ⋅), etc. This generates the Markov chain xo = Xo , X1 , X2 , . . . ,

If we can show that

sup ∣P n (x, C) − π(C)∣ → 0 for all x ∈ X


C∈B

then by running the chain sufficiently long enough, we succeed in generating


an observation Xn with distribution approximately π.
5.2 Introduction to Gibbs and MCMC 106

What is a Markov chain?


Start with a sequence of dependent random variables, {X (t) }. That is we
have the sequence
X (0) , X (1) , . . . , X (t) , . . .
such that the probability distribution of X (t) given all the past variables
only depends on the very last one X (t−1) . This conditional probability is
called the transition kernel or Markov kernel K, i.e.,

X (t+1) ∣X (0) , X (1) , . . . , X (t) ∼ K(X (t) , X (t+1) ).

• For a given Markov kernel K, there may exist a distribution f such


that
∫ K(x, y)f (x) dx = f (y).
X

• If f satisfies this equation, we call f a stationary distribution of K.


What this means is that if X (t) ∼ f, then X (t+1) ∼ f as well.

The theory of Markov chains provides various results about the existence
and uniqueness of stationary distributions, but such results are beyond the
scope of this course. However, one specific result is that under fairly general
conditions that are typically satisfied in practice, if a stationary distribu-
tion f exists, then f is the limiting distribution of {X (t) } is f for almost
any initial value or distribution of X (0) . This property is called ergodicity.
From a simulation point of view, it means that if a given kernel K produces
an ergodic Markov chain with stationary distribution f , generating a chain
from this kernel will eventually produce simulations that are approximately
from f.

In particular, a very important result can be derived. For integrable func-


tions h, the standard average

1 M (t)
∑ h(X ) Ð→ Ef [h(X)].
M i=1

This means that the LLN lies at the basis of Monte Carlo methods which
can be applied in MCMC settings. The result shown above is called the
Ergodic Theorem.

Of course, even in applied settings, it should always be confirmed that the


Markov chain in question behaves as desired before blindly using MCMC
5.2 Introduction to Gibbs and MCMC 107

to perform Bayesian calculations. Again, such theoretical verifications are


beyond the scope of this course. Practically speaking, however, the MCMC
methods we will discuss do indeed behave nicely in an extremely wide variety
on problems.

Now we turn to Gibbs. The name Gibbs sampling comes from a paper by
Geman and Geman (1984), which first applied a Gibbs sampler on a Gibbs
random field. The name stuck from there. It’s actually a special case of
something from Markov chain Monte Carlo (MCMC), and more specifically
a method called Metropolis-Hastings, which we will hopefully get to. We’ll
start by studying the simple case of the two-stage sampler and then look at
the multi-stage sampler.

◯ The Two-Stage Gibbs Sampler

The two-stage Gibbs sampler creates a Markov chain from a joint distribu-
tion. Suppose we have two random variables X and Y with joint density
f (x, y). They also have respective conditional densities fY ∣X and fX∣Y . The
two-stage sampler generates a Markov chain {(Xt , Yt )} according to the
following steps:

Algorithm 5.1: Two-stage Gibbs Sampler


Take X0 = x0 . Then for t = 1, 2, . . . , generate

1. Xt ∼ fX∣Y (⋅∣yt−1 )

2. Yt ∼ fY ∣X (⋅∣xt ).

As long as we can write down both conditionals (and simulate from them),
it is easy to implement the algorithm above.

Example 5.4: Bivariate Normal


Consider the bivariate normal model
1 ρ
(X, Y ) ∼ N2 (0, ( )) .
ρ 1

Recall the following fact from Casella and Berger (2009): If

µX σ2 ρσX σY
(X, Y ) ∼ N2 (( ),( X )) ,
µY ρσX σY σY2
5.2 Introduction to Gibbs and MCMC 108

then
σY
Y ∣X = x ∼ N (µY + ρ (x − µX ), σY2 (1 − ρ2 )) .
σX
Suppose we calculate the Gibbs sampler just given the starting point (x0 , y0 ).
Since this is a toy example, let’s suppose we only care about X. Note that
we don’t really need both components of the starting point, since if we pick
x0 , we can generate Y0 from fY ∣X (⋅∣x0 ).

We know that Y0 ∼ N (ρx0 , 1 − ρ2 ) and X1 ∣Y0 = y0 ∼ N (ρy0 , 1 − ρ2 ). Then

E[X1 ] = E[E[X1 ∣Y0 ]] = ρx0

and
Var[X1 ] = EVar[X1 ∣Y0 ]] + VarE[X1 ∣Y0 ]] = 1 − ρ4 .
Then
X1 ∼ N (ρ2 x0 , 1 − ρ4 ).
We want the unconditional distribution of X2 eventually. So, we need to
update (X2 , Y2 ). So we need Y1 so we can generate Y1 ∣X1 = x1 . Since we only
care about X, we can use the conditional distribution formula to find that
Y1 ∣X1 = x1 ∼ N (ρx1 , 1 − ρ). Then using iterated expectation and iterated
variance, we can show that

X2 ∼ N (ρ4 xo , 1 − ρ8 ).

If we keep iterating, we find that

Xn ∼ N (ρ2n xo , 1 − ρ4n ).

(To see this, iterate a few times and find the pattern.) What happens as
n → ∞?
approx.
Xn ∼ N (0, 1).

Example 5.5: Binomial-Beta


Suppose X∣θ ∼ Bin(n, θ) and θ ∼ Beta(a, b). Then the joint distribution is

n Γ(a + b) x+a−1
f (x, θ) = ( ) θ (1 − θ)n−x+b−1 .
x Γ(a)Γ(b)

The distribution of X∣θ is given above, and θ∣X ∼ Beta(x + a, n − x + b).

We can implement the Gibbs sampler in R as


5.2 Introduction to Gibbs and MCMC 109

gibbs_beta_bin <- function(nsim, nn, aa, bb)


{
xx <- rep(NA,nsim)
tt <- rep(NA,nsim)

tt[1] <- rbeta(1,aa,bb)


xx[1] <- rbinom(1,nn,tt[1])

for(ii in 2:nsim)
{
tt[ii] <- rbeta(1,aa+xx[ii-1],bb+nn-xx[ii-1])
xx[ii] <- rbinom(1,nn,tt[ii])
}

return(list(beta_bin=xx,beta=tt))
}

Since X has a discrete distribution, we can use a rootogram to check if


the Gibbs sampler performed a good approximation. The rootogram plot
is implemented in the library vcd in R. The following are the commands to
generate this rootogram:

gibbs_sample <- gibbs_beta_bin(5000,15,3,7)

# Density of a beta-binomial distribution with parameters


# nn: sample size of the binomial
# aa: first parameter of the beta
# bb: second parameter of the beta
dbetabi <- function(xx, nn, aa, bb)
{
return(choose(nn,xx)*exp(lgamma(aa+xx)-lgamma(aa)+lgamma(nn-xx+bb)-
lgamma(bb)-lgamma(nn+aa+bb)+lgamma(aa+bb)))
}

#Rootogram for the marginal distribution of X.


library(vcd)
beta_bin_sample <- gibbs_sample$beta_bin
max_observed <- max(beta_bin_sample)
rootogram(table(beta_bin_sample),5000*dbetabi(0:max_observed,15,3,7),
5.2 Introduction to Gibbs and MCMC 110

Rootogram for Beta Binomial sample

● ●



600



400
Frequency

200 ●



0 ● ● ●

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 5.5: Rootogram from a Beta-Binomial(15,3,7)

scale="raw",xlab="X",main="Rootogram for Beta Binomial sample")

Figure 5.5 presents the rootogram for the Gibbs sample for the Beta-Binomial
distribution. Similarly, Figure 5.6 shows the same for the marginal distri-
bution of θ obtained through the following commands:

#Histogram for the marginal distribution of Theta.


beta_sample <- gibbs_sample$beta
hist(beta_sample,probability=TRUE,xlab=expression(theta),
ylab="Marginal Density", main="Histogram for Beta sample")
curve(dbeta(x,3,7),from=0,to=1,add=TRUE)

Example 5.6: Consider the posterior on (θ, σ 2 ) associated with the follow-
ing model:
Xi ∣θ ∼ N (θ, σ 2 ), i = 1, . . . , n,
θ ∼ N (θo , τ 2 )
σ 2 ∼ InverseGamma(a, b),
5.2 Introduction to Gibbs and MCMC 111

Histogram for Beta sample

3.0
2.5
2.0
Marginal Density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8

Figure 5.6: Histogram for a Beta(3,7)

ba e−b/x
where θo , τ 2 , a, b known. Recall that p(σ 2 ) = .
Γ(a) xa+1
The Gibbs sampler for these conditional distributions can be coded in R as
follows:

# gibbs_gaussian: Gibbs sampler for marginal of theta|X=xx and sigma2|X=xx


# when Theta ~ Normal(theta0,tau2) and Sigma2 ~ Inv-Gamma(aa,bb) and
# X|Theta=tt,Sigma2=ss ~ Normal(tt,ss)
#
# returns a list gibbs_sample
# gibbs_sample$theta : sample from the marginal distribution of Theta|X=xx
# gibbs_sample$sigma2: sample from the marginal distribution of Sigma2|X=xx

gibbs_gaussian <- function(nsim,xx,theta0,tau2,aa,bb)


{
nn <- length(xx)
xbar <- mean(xx)
RSS <- sum((xx-xbar)^2)
5.2 Introduction to Gibbs and MCMC 112

post_sigma_shape <- aa + nn/2

theta <- rep(NA,nsim)


sigma2 <- rep(NA,nsim)

sigma2[1] <- 1/rgamma(1,shape=aa,rate=bb)


ww <- sigma2[1]/(sigma2[1]+nn*tau2)
theta[1] <- rnorm(1,mean=ww*theta0+(1-ww)*xbar, sd=sqrt(tau2*ww))

for(ii in 2:nsim)
{
new_post_sigma_rate <- (1/2)*(RSS+ nn*(xbar-theta[ii-1])^2) + bb
sigma2[ii] <- 1/rgamma(1,shape=post_sigma_shape,
rate=new_post_sigma_rate)

new_ww <- sigma2[ii]/(sigma2[ii]+nn*tau2)


theta[ii] <- rnorm(1,mean=new_ww*theta0+(1-new_ww)*xbar,
sd=sqrt(tau2*new_ww))
}

return(list(theta=theta,sigma2=sigma2))
}

The histograms in Figure 5.7 for the posterior for θ and σ 2 are obtained as
follows:

library(mcsm)
data(Energy)
gibbs_sample <- gibbs_gaussian(5000,log(Energy[,1]),5,10,3,3)

par(mfrow=c(1,2))
hist(gibbs_sample$theta,xlab=expression(theta~"|X=x"),main="")
hist(sqrt(gibbs_sample$sigma2),xlab=expression(sigma~"|X=x"),main="")

◯ The Multistage Gibbs Sampler

There is a natural extension from the two-stage Gibbs sampler to the gen-
eral multistage Gibbs sampler. Suppose that for p > 1, we can write the
5.2 Introduction to Gibbs and MCMC 113

1000

1500
800

1000
600
Frequency

Frequency
400

500
200
0

6.0 6.5 7.0 7.5 0.4 0.6 0.8 1.0 1.2 1.4

θ |X=x σ |X=x

Figure 5.7: Histograms for posterior mean and standard deviation.

random variable X = (X1 , . . . , Xp ), where the Xi ’s are either unidimensional


or multidimensional components. Suppose that we can simulate from corre-
sponding conditional densities f1 , . . . , fp . That is, we can simulate

Xi ∣x1 , . . . , xi−1 , xi+1 , . . . , xp ∼ f (xi ∣x1 , . . . , xi−1 , xi+1 , . . . , xp )

for i = 1, . . . , p. The associated Gibbs sampling algorithm is given by the


following transition from X (t) to X (t+1) ∶
Algorithm 5.2: The Multistage Gibbs sampler
(t−1) (t−1)
At iteration t = 1, 2, . . . given x(t−1) = (x1 , . . . , xp ), generate

(t) (t−1) (t−1)


1. X1 ∼ f (x1 ∣x2 , . . . , xp ),
(t) (t) (t−1) (t−1)
2. X2 ∼ f (x2 ∣x1 , x3 . . . , xp ),

(t) (t) (t) (t−1)
p−1. Xp−1 ∼ f (xp−1 ∣x1 , . . . , xp−2 , xp ),
(t) (t) (t)
p. Xp ∼ f (xp ∣x1 , . . . , xp−1 ).
5.2 Introduction to Gibbs and MCMC 114

The densities f1 , . . . , fp are called the full conditionals, and a particular fea-
ture of the Gibbs sampler is that these are the only densities used for sim-
ulation. Hence, even for high-dimensional problems, all of the simulations
may be univariate, which is a major advantage.
Example 5.7: (Casella and Robert, p. 207) Consider the following model:
ind
Xij ∣θi , σ 2 ∼ N (θi , σ 2 ) 1 ≤ i ≤ k, 1 ≤ j ≤ ni
iid
θi ∣µ, τ 2 ∼ N (µ, τ 2 )
µ∣σµ2 ∼ N (µ0 , σµ2 )
σ 2 ∼ IG(a1 , b1 )
τ 2 ∼ IG(a2 , b2 )
σµ2 ∼ IG(a3 , b3 )
The conditional independencies in this example can be visualized by the
Bayesian Network in Figure 5.8. Using these conditional independencies, we
can compute the complete conditional distributions for each of the variables
as
σ2 ni τ 2 σ2τ 2
θi ∼ N ( µ + X̄i , ),
σ 2 + ni τ 2 σ 2 + ni τ 2 σ 2 + ni τ 2
τ2 kσµ2 σµ2 τ 2
µ∼N( µ 0 + θ̄, ),
τ 2 + kσµ2 τ 2 + kσµ2 τ 2 + kσµ2
⎛ ⎞
σ 2 ∼ IG ∑ ni /2 + a1 , (1/2) ∑ (Xi,j − θi )2 + b1 ,
⎝ i i,j ⎠

τ 2 ∼ IG (k/2 + a2 , (1/2) ∑ (θi − µ)2 + b2 ) ,


i
σµ2 ∼ IG (1/2 + a3 , 1/2(µ − µ0 )2 + b3 ) ,
where θ̄ = ∑i ni θi / ∑i ni .

Running the chain with µ0 = 5 and a1 = a2 = a3 = b1 = b2 = b3 = 3 and chain


size 5000, we get the histograms in Figure 5.9.

◯ Application of the GS to latent variable models

We give an example of Gibbs sampling to an data augmentation example.


We look at the example from a genetic linkage analysis. This example is
5.2 Introduction to Gibbs and MCMC 115

τ2 µ σµ2

σ2
θi

Xij
Figure 5.8: Bayesian Network for Example 5.7.

given in Rao (1973, pp. 3689) where it is analyzed in a frequentist setting;


it was re-analyzed in Dempster, Laird and Rubin (1977), and re-analyzed in
a Bayesian framework in Tanner and Wong (1987).

Example 5.8: A genetic model specifies that 197 animals are distributed
multinomially into four categories, with cell probabilities given by

π = (1/2 + θ/4, (1 − θ)/4, (1 − θ)/4, θ/4).(∗)

The actual observations are y = (125, 18, 20, 34). We want to estimate θ.

Biological basis for model:

Suppose we have two factors, call them α and β (say eye color and leg
length).

• Each comes at two levels: α comes in levels A and a, and β comes in


levels B and b.

• Suppose A is dominant, a is recessive; also B dominant, b recessive.

• Suppose further that P (A) = 1/2 = P (a) [and similarly for the other
factor].
5.2 Introduction to Gibbs and MCMC 116

1500

1000
1000

800
800
1000
Frequency

Frequency

Frequency

600
600

400
400
500

200
200
0

0
3 4 5 6 7 8 9 6.4 6.6 6.8 7.0 6.8 7.0 7.2 7.4

µ θ1 θ2
1500

1000
2000

800
1500
1000

600
Frequency

Frequency

Frequency
1000

400
500

500

200
0

0 2 4 6 8 10 0 2 4 6 8 0.2 0.4 0.6 0.8 1.0

σ2µ τ2 σ2

Figure 5.9: Histograms for posterior quantities.


5.2 Introduction to Gibbs and MCMC 117

• Now suppose that the two factors are related: P (B∣A) = 1 − η and
P (b∣A) = η.

• Similarly, P (B∣a) = η and P (b∣a) = 1 − η.

To calculate probability of the phenotypes AB, Ab, aB and ab in an offspring


(phenotype is what we actually see in the offspring), we suppose that mother
and father are chosen independently from the population, and make follow-
ing table, involving the genotypes (genotype is what is actually in the genes,
and this is not seen).

Then
1
P (Father is AB) = P (B∣A)P (B) = (1 − η).
2
1
P (Mother is AB) = P (B∣A)P (B) = (1 − η).
2
1
P (O.S. is AB) = P (B∣A)P (B) = (1 − η)2 .
4

Note: η = 1/2 means no linkage and people like to estimate η.

Table 5.1: default

AB Ab aB ab
4 (1 − η) 4 (1 − η)η 4 (1 − η)η 4 (1 − η)
1 2 1 1 1 2
AB
4 (1 − η)η
1
Ab
aB
ab

There are 9 cases where we would see the phenotype AB and adding up
3 − 2η + η 2
their probabilities, we get . You can find similar probabilities for
4
the other phenotypes. Writing

(3 − 2η + η 2 ) 1 1 − 2η + η 2
= +
4 2 4
and letting θ = (1 − η)2 , we find the model specified in (*).

What now?
Suppose we put the prior Beta(a, b) on θ. How do we get the posterior?
5.2 Introduction to Gibbs and MCMC 118

Here is one method, using the Gibbs sampler.

Split first cell into two cells, one with probability 1/2, the other with prob-
ability θ/4.

Augment the data into a 5-category multinomial, call it X, where X1 is


Bernoulli with parameter 1/2. Now consider pdata (X1 ∣θ). Will run a Gibbs
sampler of length 2:

1/2
• The conditional distribution of X1 ∣ θ (and given the data) is Bin(125, 1/2+θ/4 ).

• Given the data, conditional on X1 the model is simply a binomial


with n = 197 − X1 , and probability of success θ, and data consisting of
(125 − X1 + X5 )successes, (X3 + X4 ) failures

• Thus, conditional distribution of θ ∣ X1 and the data is

Beta(a + 125 − X1 − X5 , b + X3 + X4 ).

R Code to implement G.S.:

set.seed(1)
a <- 1; b <- 1
z <- c(125,18,20,34)
x <- c(z[1]/2, z[1]/2, z[2:4])
nsim <- 50000 # runs in about 2 seconds on 3.8GHz P4
theta <- rep(a/(a+b), nsim)
for (j in 1:nsim)
{
theta[j] <- rbeta(n=1, shape1=a+125-x[1]+x[5],
shape2=b+x[3]+x[4])
x[1] <- rbinom(n=1, z[1], (2/(2+theta[j])))
}
mean(theta) # gives 0.623
pdf(file="post-dist-theta.pdf",
horiz=F, height=5.0, width=5.0)
plot(density(theta), xlab=expression(theta), ylab="",
main=expression(paste("Post Dist of ", theta)))
dev.off()
eta <- 1 - sqrt(theta) # Variable of actual interest
plot(density(eta))
5.2 Introduction to Gibbs and MCMC 119

sum(eta > .4)/nsim # gives 0

Post Dist of θ
8
6
4
2
0

0.4 0.5 0.6 0.7 0.8

Figure 5.10: Posterior Distribution of θ for Genetic Linkage


5.3 MCMC Diagnostics 120

5.3 MCMC Diagnostics

We will want to check any chain that we run to assess any lack of conver-
gence.

The adequate length of a run will depend on

• a burn-in period (debatable topic).

• mixing rate.

• variance of quantity we are monitoring.

Quick checks:

• trace plots: a times series plot of the parameters of interest; indicates


how quickly the chain is mixing of failure to mix.

• Autocorrelations plots.

• Plots of log posterior densities – used mostly in high dimensional prob-


lems.

• Multiple starting points – diagnostic to attempt to handle problems


when we obtain different estimates when we start with multiple (dif-
ferent) starting values.

Definition: An autocorrelation plot graphically measures the correlation


between Xi and each Xk+i variable in the chain.

• The Lag-k correlation is the Corr(Xi , Xk+i ).

• By looking at autocorrelation plots of parameters that we are inter-


ested in, we can decide how much to thin or subsample our chain by.

• Then rerun Gibbs sampler using new thin value.

For a real data example that I’m working on:


5.3 MCMC Diagnostics 121

λ2.9111 λ2.9145
17800

20800
17400

20400
17000

20000
400 600 800 1000 1200 400 600 800 1000 1200

λ2.9378 λ2.9248
20800

20800
20400

20400
20000

20000

400 600 800 1000 1200 400 600 800 1000 1200

Figure 5.11: Trace Plot for RL Example


5.3 MCMC Diagnostics 122

1.0
0.8
Correlation

0.6
0.4
0.2
0.0

0 20 40 60 80 100

Lag

Figure 5.12: Max Autocorrelation Plot for RL Example

Multiple Starting Points: Can help determine if burn-in is long enough.

• Basic idea: want to estimate the mean of a parameter θ.

• Run chain 1 starting at xo . Estimate the mean to be 10 ± 0.1.

• Run chain 2 starting at x1 . Estimate the mean to be 11 ± 0.1.

• Then we know that the effort of the starting point hasn’t been forgot-
ten.

• Maybe the chain hasn’t reached the area of high probability yet and
need to be run for longer?

• Try running multiple chains.

Gelman-Rubin

• Idea is that if we run several chains, the behavior of the chains should
be basically the same.
5.3 MCMC Diagnostics 123

• Check informally using trace plots.

• Check using the Gelman-Rubin diagnostic – but can fail like any test.

• Suggestions – Geweke – more robust when normality fails.


5.4 Theory and Application Based Example 124

5.4 Theory and Application Based Example

◯ PlA2 Example

Twelve studies run to investigate potential link between presence of a certain


genetic trait and risk of heart attack. Each was case-control, and considered
a group of individuals with coronary heart disease and another group with
no history of heart disease. For each study i (i = 1, . . . , 12) the proportion
having the genetic trait in each group was noted and a log odds ratio ψ̂i was
calculated, together with a standard error σi . Results are summarized in
table below (data from Burr et al. 2003).

i 1 2 3 4 5 6
ψ̂i 1.06 -0.10 0.62 0.02 1.07 -0.02
σi 0.37 0.11 0.22 0.11 0.12 0.12

i 7 8 9 10 11 12
ψ̂i -0.12 -0.38 0.51 0.00 0.38 0.40
σi 0.22 0.23 0.18 0.32 0.20 0.25

Setup:

• Twelve studies were run to investigate the potential link between pres-
ence of a certain genetic trait and risk of heart attack.

• Each study was case-control and considered a group of individuals with


coronary heart disease and another group with no history of coronary
heart disease.

• For each study i (i = 1, ⋯, 12) the proportion having the genetic trait
in each group was recorded.

• For each study, a log odds ratio, ψ̂i , and standard error, σi , were
calculated.

Let ψi represent the true log odds ratio for study i. Then a typical hierar-
chical model would look like:
5.4 Theory and Application Based Example 125

ind
ψ̂i ∣ ψi ∼ N (ψi , σi2 ) i = 1, . . . , 12
iid
ψi ∣ µ, τ ∼ N (µ, τ 2 ) i = 1, . . . , 12
(µ, τ ) ∼ ν.

From this, the likelihood is

12 12
L(µ, τ ) = ∫ . . . ∫ ∏ Nψi ,σi (ψ̂i ) ∏ Nµ,τ (ψi ) dψ1 . . . dψ12 .
i=1 i=1
The posterior can be written as (as long as µ and τ have densities), as
12 12
π(µ, τ ∣ ψ̂i ) = c−1 [∫ . . . ∫ ∏ Nψi ,σi (ψ̂i ) ∏ Nµ,τ (ψi ) dψ1 . . . dψ12 ] p(µ, τ ).
i=1 i=1

Supppose we take ν = “Normal/Inverse Gamme prior.” Then conditional


on τ, µ ∼ N (c, dτ 2 ), γ = 1/τ 2 ∼ Gamma(a, b).

Remark: The reason for taking this prior is that it is conjugate for the
normal distribution with both mean and variance unknown (that is, it is
conjugate for the model in which the ψi ’s are observed).

We will use the notation NIG(a, b, c, d) to denote this prior. Taking a =


.1, b = .1, c = 0, and d = 1000 gives a flat prior.

• If we are frequentists, then we need to calculated the likelihood


12 12
L(µ, τ ) = ∫ . . . ∫ ∏ Nψi ,σi (ψ̂i ) ∏ Nµ,τ (ψi ) dψ1 . . . dψ12 .
i=1 i=1

• If we are Bayesians, we need to calculate the likelihood and in addition


we need to calculate the normalizing constant in order to find the
posterior
L(µ, τ )p(µ, τ )
pi(µ, τ ∣ ψ̂i ) =
∫ L(µ, τ )p(µ, τ )dµdτ
• Neither above is easy to do.

We have a choice:
5.4 Theory and Application Based Example 126

• Select a model that doesn’t fit the data well but gives answers that
are easy to obtain, i.e. in closed form.

• Select a model that is appropriate for the data but i computationally


difficult to deal with.

MCMC methods often allow us (in many cases) to make the second choice.

Going back to the example and fitting a model:

Recall the general model:


ind
ψ̂i ∣ ψi ∼ N (ψi , σi2 ) i = 1, . . . , 12
iid
ψi ∣ µ, τ ∼ N (µ, τ 2 ) i = 1, . . . , 12
(µ, τ ) ∼ N IG(a, b, c, d).

Then the posterior of (µ, τ ) is NIG(a’,b’,c’,d’), with

1 n(X̄ − c)2
a′ = a + n/2 b′ = b + ∑(Xi − X̄) +
2
2 i 2(1 + nd)

and
c + ndX̄ 1
c′ = d′ = .
nd + 1 n + d−1

This means that


µ ∣ τ 2 , y ∼ N (c′ , d′ )
and
τ 2 ∣ y ∼ InverseGamma(a′ , b′ )

Implementing the Gibbs sampler:

Want the posterior distribution of (µ, τ ).

• In order to clarify what we are doing, we use the notation that sub-
scripting a distribution by a random variable denotes conditioning.

• Thus, if U and V are two random variables, L(U ∣V ) and LV (U ) will


both denote the conditional distribution of U given V.

We want to find Lψ̂ (µ, τ, ψ) . We’ll run a Gibbs sampler of length 2:


5.4 Theory and Application Based Example 127

• Given (µ, τ ), the ψ’s are independent. The conditional distribution


of ψ given ψ̂ is the conditional distribution of ψ given only ψ̂. This
conditional distribution is given by a standard result for the conjugate
′ ′
normal/normal situation: it is N (µ , τ 2 ), where

′ σi2 µ + τ 2 ψ̂i ′2 σi2 τ 2


µ = τ =
σi2 + τ 2 σi2 + τ 2

• Given the (ψ’s, the data) are superfluous, i.e. Lψ̂ (µ, τ ∣ ψ)= L (µ, τ ∣ ψ) .
This conditional distribution is given by the conjugacy of the Normal
/ Inverse gamma prior: L (µ, τ ∣ ψ) = NIG(a′ , b′ , c′ , d′ ), where

1 n(ψ̄ − c)2
a′ = a + n/2 b′ = b + ∑(ψi ψ̄) +
2
2 i 2(1 + nd)
and
c + ndψ̄ 1
c′ = d′ = .
nd + 1 n + d−1

This gives us a sequence µ, τ, ψ1 , . . . , ψn ; g = 1, ..., G, from Lψ̂ (µ, τ, ψ) . If


were interested in, e.g., the posterior distribution of µ, we just retain the
first coordinate in the sequence.

Specific Example for PlA2 data


Our proposed hierarchical model is
ind
ψ̂i ∣ ψi ∼ N (ψi , σi2 ) i = 1, ⋯, 12
iid
ψi ∣ µ, τ 2 ∼ N (µ, τ 2 ) i = 1, ⋯, 12
µ ∣ τ 2 ∼ N (0, 1000τ 2 )
γ = 1/τ 2 ∼ Gamma(0.1, 0.1)
Why is a normal prior taken? It’s conjugate for the normal distribution
with the mean and variance known. The two priors above with the chosen
hyperparameters result in noninformative hyperpriors.
5.4 Theory and Application Based Example 128

%\frame[containsverbatim]{
%\frametitle{PlA2 Example}

The file model.txt contains

\begin{verbatim}
model {
for (i in 1:N) {
psihat[i] ~ dnorm(psi[i],1/(sigma[i])^2)
psi[i] ~ dnorm(mu,1/tau^2)
}
mu ~ dnorm(0,1/(1000*tau^2))
tau <- 1/sqrt(gam)
gam ~ dgamma(0.1,0.1)
}

Note: In BUGS, use dnorm(mean,precision), where precision = 1/variance.

%\frame[containsverbatim]{
%\frametitle{PlA2 Example}

The file data.txt contains

\begin{verbatim}
"N" <- 12
"psihat" <- c(1.055, -0.097, 0.626, 0.017, 1.068,
-0.025, -0.117, -0.381, 0.507, 0, 0.385, 0.405)
"sigma" <- c(0.373, 0.116, 0.229, 0.117, 0.471,
0.120, 0.220, 0.239, 0.186, 0.328, 0.206, 0.254)

The file inits 1.txt contains

".RNG.name" <- "base::Super-Duper"


".RNG.seed" <- 12
"psi" <- c(0,0,0,0,0,0,0,0,0,0,0,0)
"mu" <- 0
"gam" <- 1

%\frame[containsverbatim]{
5.4 Theory and Application Based Example 129

%\frametitle{PlA2 Example}

The file script.txt contains


\small
\begin{verbatim}
model clear
data clear
model in "model"
data in "data"
compile, nchains(2)
inits in "inits1", chain(1)
inits in "inits2", chain(2)
initialize
update 10000
monitor mu
monitor psi
monitor gam
update 100000
coda *, stem(CODA1)
coda *, stem(CODA2)

Now, we read in the coda files into R from the current directory and continue
our analysis. The first part of our analysis will consist of some diagnostic
procedures.

We will consider

• Autocorrelation Plots

• Trace Plots

• Gelman-Rubin Diagnostic

• Geweke Diagnostic

Definition: An autocorrelation plot graphically measures the correlation


between Xi and each Xk+i variable in the chain.

• The Lag-k correlation is the Corr(Xi , Xk+i ).


5.4 Theory and Application Based Example 130

• By looking at autocorrelation plots of parameters that we are inter-


ested in, we can decide how much to thin or subsample our chain by.
• We can rerun our JAGS script using our thin value.

We take the thin value to be the first lag whose correlation ≤ 0.2. For this
plot, we take a thin of 2. We will go back and rerun our JAGS script and
skip every other value in each chain. After thinning, we will proceed with
other diagnostic procedures of interest.
1.0
0.8
Correlation

0.6
0.4
0.2
0.0

0 10 20 30 40 50

Lag

%\frame[containsverbatim]{
%\frametitle{PlA2 Example}

The file script\_thin.txt contains


\small
\begin{verbatim}
model clear
data clear
model in "model"
data in "data"
compile, nchains(2)
inits in "inits1", chain(1)
inits in "inits2", chain(2)
initialize
update 10000
5.4 Theory and Application Based Example 131

monitor mu, thin(6)


monitor psi, thin(6)
monitor gam, thin(6)
update 100000
coda *, stem(CODA1_thin)
coda *, stem(CODA2_thin)

Definition: A trace plot is a time series plot of the parameter, say µ, that
we monitor as the Markov chain(s) proceed(s).
1.0
0.5
µ

0.0
−0.5

0 500 1000 1500 2000

Iteration

Definition: The Gelman-Rubin diagnostic tests that burn-in is adequate


and requires that multiple starting points be used.

To compute the G-R statistic, we must

• Run two chains in JAGS using two different sets of initial values (and
two different seeds).
• Load coda package in R and run gelman.diag(mcmc.list(chain1,chain2)).

How do we interpret the Gelman-Rubin Diagnostic?

• If the chain has reached convergence, the G-R test statistic R ≈ 1. We


conclude that burn-in is adequate.
5.4 Theory and Application Based Example 132

• Values above 1.05 indicate lack of convergence.

Warning: The distribution of R under the null hypothesis is essentially an


F distribution. Recall that the F-test for comparing two variances is not
robust to violations of normality. Thus, we want to be cautious in using the
G-R diagnostic.

%\frame[containsverbatim]{
%\frametitle{Gelman-Rubin Diagnostic}

Doing this in R, for the PlA-2 example, we find

\begin{verbatim}
Point est. 97.5% quantile
mu 1 1
psi[1] 1 1
psi[2] 1 1
...
psi[11] 1 1
psi[12] 1 1
gam 1 1

Since 1 is in all the 95% CI, we can conclude that we have not failed to
converge.

Suppose µ is the parameter of interest.

Main Idea: If burn-in is adequate, the mean of the posterior distribution


of µ from the first half of the chain should equal the mean from the second
half of the chain.

To compute the Geweke statistic, we must

• Run a chain in JAGS along with a set of initial values.

• Load the coda package in R and run geweke.diag(mcmc.list(chain)).


5.4 Theory and Application Based Example 133

• The Geweke statistic asymptotically has a standard normal distribu-


tion, so if the values from R are outside -2.5 or 2.5, this indicates
nonstationarity of chain and that burn-in is not sufficient.

• Using the Geweke diagnostic on the PlA2 data indicates that burn-in
of 10,000 is sufficient (the largest absolute Z-score is 1.75).

• Observe that the Geweke diagnostic does not require multiple starting
points as Gelman-Rubin does.

• The Geweke statistic (based on a T-test) is robust against violations


of normality so the Geweke test is preferred to Gelman-Rubin.

Using Gelman-Rubin and Geweke we have shown that burn-in is “sufficient.”

• We can look at summary statistics such as means, standard errors,


and credible intervals using the summary function.

• We can use kernel density functions in R to estimate posterior distri-


butions that we are interested in using the density function.

Post Mean Post SD Post Naive SE


µ 0.217272 0.127 0.0009834
ψ1 0.594141 0.2883 0.0022334
ψ2 -0.062498 0.1108 0.0008583
ψ3 0.490872 0.2012 0.0015588
ψ4 0.040284 0.1118 0.0008658
ψ5 0.51521 0.3157 0.0024453
ψ6 0.003678 0.114 0.0008831
ψ7 -0.015558 0.1883 0.0014586
ψ8 -0.175852 0.2064 0.0015988
ψ9 0.433525 0.1689 0.0013084
ψ10 0.101912 0.2423 0.0018769
ψ11 0.332775 0.1803 0.0013965
ψ12 0.331466 0.2107 0.0016318
γ 10.465411 6.6611 0.051596

The posterior of µ ∣ data is


5.4 Theory and Application Based Example 134

3.0
2.0
Density

1.0
0.0

−0.5 0.0 0.5 1.0

Figure 5.13: Posterior of µ ∣ data

Alternatively, we can estimate the conditional distributions of exp(ψi )’s


given the data. A few are shown below.

• So, here we’re looking at the odds ratio’s of the prob of getting heart
disease given you have the genetic trait over the prob of not getting
heart disease given you have the trait. Note that all estimates are
pulled toward the mean showing a Bayesian Stein effect.

• This is the odds ratio of having a heart attack for those who have the
genetic trait versus those who don’t (looking at study i).
5.4 Theory and Application Based Example 135

exp(ψ1) exp(ψ2)

4
3

3
2

2
1

1
0

0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

exp(ψ3) exp(ψ4)
4

4
3

3
2

2
1

1
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

%\frame[containsverbatim]{
Moreover, we could have just have easily done this analysis in WinBUGS. Below is the
\begin{verbatim}
model{
for (i in 1:N) {
psihat[i] ~ dnorm(psi[i],rho[i])
psi[i] ~ dnorm(mu,gam)
rho[i] <- 1/pow(sigma[i],2)
}

mu ~ dnorm(0,gamt)
gam ~ dgamma(0.1,0.1)
gamt <- gam/1000
}

Finally, we can either run the analysis using WinBUGS or JAGS and R.
I will demonstrate how to do this using JAGS for this example. I have
included the basic code to run this on a Windows machine via WinBUGS.
Both methods yield essentially the same results.

To run WinBUGS within R, you need the following:

• Load the R2WinBUGS library.


5.5 Metropolis and Metropolis-Hastings 136

• Read in data and format it as a list().

• Format intital values as a list().

• Format the unknown parameters using c().

• Run the bugs() command to open/run WinBUGS.

• Read in the G.S. values using read.coda().

\scriptsize
\begin{verbatim}
setwd("C:/Documents and Settings/Tina Greenly
/Desktop/beka_winbugs/novartis/pla2")
library(R2WinBUGS)
pla2 <- read.table("pla2_data.txt",header=T)
attach(pla2)
names(pla2)
N<-length(psihat)
data <- list("psihat", "sigma", "N")

inits1 <- list(psi = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), mu = 0, gam = 1)


inits2 <- list(psi = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), mu = 1, gam = 2)
inits = list(inits1, inits2)
parameters <- c("mu", "psi", "gam")
pla2.sim <- bugs(data, inits, parameters,
"pla2.bug", n.chains=2, n.iter = 110000,
codaPkg=T,debug=T,n.burnin = 10000,n.thin=1,bugs.seed=c(12,13),
working.directory="C:/Documents and Settings/Tina Greenly/
Desktop/beka_winbugs/novartis/pla2")
detach(pla2)
coda1 = read.coda("coda1.txt","codaIndex.txt")
coda2 = read.coda("coda2.txt","codaIndex.txt")

5.5 Metropolis and Metropolis-Hastings

The Metropolis-Hastings algorithm is a general term for a family of Markov


chain simulation methods that are useful for drawing samples from Bayesian
posterior distributions. The Gibbs sampler can be viewed as a special case
5.5 Metropolis and Metropolis-Hastings 137

of Metropolis-Hastings (as well will soon see). Here, we present the ba-
sic Metropolis algorithm and its generalization to the Metropolis-Hastings
algorithm, which is often useful in applications (and has many extensions).

Suppose we can sample from p(θ∣y). Then we could generate

θ(1) , . . . , θ(S) ∼ p(θ∣y)


iid

and obtain Monte Carlo approximations of posterior quantities


S
E[g(θ)∣y] → 1/S ∑ g(θ(i) ).
i=1

But what if we cannot sample directly from p(θ∣y)? The important concept
here is that we are able to construct a large collection of θ values (rather
than them being iid, since this most certain for most realistic situations will
not hold). Thus, for any two different θ values θa and θb , we need
#θ′ s in the collection = θa p(θa ∣y)
≈ .
#θ′ s in the collection = θb p(θb ∣y)

How might we intuitively construct such a collection?

• Suppose we have a working collection {θ(1) , . . . , θ(s) } and we want to


add a new value θ(s+1) .
• Consider adding a value θ∗ which is nearby θ(s) .
• Should we include θ∗ or not?
• If p(θ∗ ∣y) > p(θ(s) ∣y), then we want more θ∗ ’s in the set than θ(s) ’s.
• But if p(θ∗ ∣y) < p(θ(s) ∣y), we shouldn’t necessarily include θ∗ .

Based on the above, perhaps our decision to include θ∗ or not should be


based upon a comparison of p(θ∗ ∣y) and p(θ(s) ∣y). We can do this by com-
puting r:

p(θ∗ ∣y) p(y ∣ θ∗ )p(θ∗ )


r= = .
p(θ(s) ∣y) p(y ∣ θ(s) )p(θ(s) )

Having computed r, what should we do next?


5.5 Metropolis and Metropolis-Hastings 138

• If r > 1 (intuition): Since θ(s) is already in our set, we should include


θ∗ as it has a higher probability than θ(s) .

(procedure): Accept θ∗ into our set and let θ(s+1) = θ∗ .


• If r < 1 (intuition): The relative frequency of θ-values in our set equal
to θ∗ compared to those equal to θ(s) should be
p(θ∗ ∣y)
= r.
p(θ(s) ∣y)

This means that for every instance of θ(s) , we should only have a frac-
tion of an instance of a θ∗ value.

(procedure): Set θ(s+1) equal to either θ∗ or θ(s) with probability r


and 1 − r respectively.

This is basic intuition behind the Metropolis (1953) algorithm. More for-
mally, it

• It proceeds by sampling a proposal value θ∗ nearby the current value


θ(s) using a symmetric proposal distribution J(θ∗ ∣ θ(s) ).
• What does symmetry mean here? It means that J(θa ∣ θb ) = J(θb ∣ θa ).
That is, the probability of proposing θ∗ = θa given that θ(s) = θb is equal
to the probability of proposing θ∗ = θb given that θ(s) = θa .
• Symmetric proposals include:

J(θ∗ ∣ θ(s) ) = Uniform(θ(s) − δ, θ(s) + δ)


and
J(θ∗ ∣ θ(s) ) = Normal(θ(s) , δ 2 ).

The Metropolis algorithm proceeds as follows:

1. Sample θ∗ ∼ J(θ ∣ θ(s) ).


2. Compute the acceptance ratio (r):
p(θ∗ ∣y) p(y ∣ θ∗ )p(θ∗ )
r= = .
p(θ(s) ∣y) p(y ∣ θ(s) )p(θ(s) )
5.5 Metropolis and Metropolis-Hastings 139

3. Let


⎪θ∗ with prob min(r,1)
θ(s+1) = ⎨ (s)


⎩θ otherwise.

Remark: Step 3 can be accomplished by sampling u ∼ Uniform(0, 1) and


setting θ(s+1) = θ∗ if u < r and setting θ(s+1) = θ(s) otherwise.
5.5 Metropolis and Metropolis-Hastings 140

Example 5.9: Metropolis for Normal-Normal


Let’s test out the Metropolis algorithm for the conjugate Normal-Normal
model with a known variance situation.

That is let
iid
X1 , . . . , Xn ∣ θ ∼ Normal(θ, σ 2 )
θ ∼ Normal(µ, τ 2 ).

Recall that the posterior of θ is Normal(µn , τn2 ), where


n/σ 2 1/τ 2
µn = x̄ + µ
n/σ 2 + 1/τ 2 n/σ 2 + 1/τ 2
and
1
τn2 = .
n/σ 2 + 1/τ 2

Suppose (taken from Hoff, 2009), σ 2 = 1, τ 2 = 10, µ = 5, n = 5, and y =


(9.37, 10.18, 9.16, 11.60, 10.33). For these data, µn = 10.03 and τn2 = 0.20.

Suppose that for some ridiculous reason we cannot come up with the poste-
rior distribution and instead we need the Metropolis algorithm to approx-
imate it (please note how incredible silly this example is and it’s just to
illustrate the method).

Based on this model and prior, we need to compute the acceptance ratio r
p(θ∗ ∣x) p(x∣θ∗ )p(θ∗ ) ∏i dnorm(xi , θ∗ , σ) ∏i dnorm(θ∗ , µ, σ)
r= = = ( ) ( )
p(θ(s) ∣x) p(x∣θ(s) )p(θ(s) ) ∏i dnorm(xi , θ(s) , σ) ∏i dnorm(θ(s) , µ, σ)

In many cases, computing the ratio r directly can be numerically unstable,


however, this can be modified by taking log r.

This results in

log r = ∑ [log dnorm(xi , θ∗ , σ) − log dnorm(xi , θ(s) , σ)]


i

+ ∑ [log dnorm(θ∗ , µ, σ) − log dnorm(θ(s) , µ, σ)] .


i

Then a proposal is accepted if log u < log r, where u is sample from the
Uniform(0,1).
5.5 Metropolis and Metropolis-Hastings 141

The R-code below generates 10,000 iterations of the Metropolis algorithm


stating at θ(0) = 0. and using a normal proposal distribution, where

θ(s+1) ∼ Normal(θ(s) , 2).

Below is R-code for running the above model. Figure 5.14 shows a trace plot
for this run as well as a histogram for the Metropolis algorithm compared
with a draw from the true normal density. From the trace plot, although
the value of θ does not start near the posterior mean of 10.03, it quickly
arrives there after just a few iterations. The second plot shows that the em-
pirical distribution of the simulated values is very close to the true posterior
distribution.
11

0.8
10

0.6
9

density
θ
8

0.4
7

0.2
6
5

0.0

0 2000 4000 6000 8000 8.5 9.0 9.5 10.5 11.5


iteration θ
Figure 5.14: Results from the Metropolis sampler for the normal model.
5.5 Metropolis and Metropolis-Hastings 142

## initialing values for normal-normal example and setting seed


# MH algorithm for one-sample normal problem with known variance

s2<-1
t2<-10 ; mu<-5; set.seed(1); n<-5; y<-round(rnorm(n,10,1),2)
mu.n<-( mean(y)*n/s2 + mu/t2 )/( n/s2+1/t2)
t2.n<-1/(n/s2+1/t2)

####metropolis part####
y<-c(9.37, 10.18, 9.16, 11.60, 10.33)
##S = total num of simulations
theta<-0 ; delta<-2 ; S<-10000 ; THETA<-NULL ; set.seed(1)

for(s in 1:S)
{

## simulating our proposal


theta.star<-rnorm(1,theta,sqrt(delta))

##taking the log of the ratio r


log.r<-( sum(dnorm(y,theta.star,sqrt(s2),log=TRUE)) +
dnorm(theta.star,mu,sqrt(t2),log=TRUE) ) -
( sum(dnorm(y,theta,sqrt(s2),log=TRUE)) +
dnorm(theta,mu,sqrt(t2),log=TRUE) )

if(log(runif(1))<log.r) { theta<-theta.star }

##updating THETA

THETA<-c(THETA,theta)

##two plots: trace of theta and comparing the empirical distribution


##of simulated values to the true posterior

pdf("metropolis_normal.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))
5.5 Metropolis and Metropolis-Hastings 143

skeep<-seq(10,S,by=10)
plot(skeep,THETA[skeep],type="l",xlab="iteration",ylab=expression(theta))

hist(THETA[-(1:50)],prob=TRUE,main="",xlab=expression(theta),ylab="density")
th<-seq(min(THETA),max(THETA),length=100)
lines(th,dnorm(th,mu.n,sqrt(t2.n)) )
dev.off()

◯ Metropolis-Hastings Algorithm

Recall that a Markov chain is a sequentially generated sequence {x(1) , , x(2) , . . .}


such that the mechanism that generates x(s+1) can depend on the value of
x(s) but not on anything that was in the sequence before it. A better way
of putting this: for a Markov chain, the future depends on the present and
not on the past.

The Gibbs sampler and the Metropolis algorithm are both ways of generating
Markov chains that approximate a target probability distribution.

We first consider a simple example where our target probability distribution


is po (u, v), a bivariate distribution for two random variables U and V. In the
one-sample normal problem, we would have U = θ, V = σ 2 and po (u, v) =
p(θ, σ 2 ∣y).

What does the Gibbs sampler have us do? It has us iteratively sample values
of U and V from their conditional distributions. That is,

1. update U ∶ sample u(s+1) ∼ po (u ∣ v (s) )

2. update V ∶ sample v (s+1) ∼ po (v ∣ u(s+1) ).

In contrast, Metropolis proposes changes to X = (U, V ) and then accepts


or rejects those changes based on po . An alternative way to implement the
Metropolis algorithm if to propose and then accept or reject change to one
element at a time:

1. update U ∶

(a) sample u∗ ∼ Ju (u ∣ u(s) )


5.5 Metropolis and Metropolis-Hastings 144

po (u∗ , v (s) )
(b) compute r =
po (u(s) , v (s) )
(c) set u(s+1) equal to u∗ or u(s+1) with prob min(1,r) and max(0,1-r).

2. update V ∶ sample v (s+1) ∼ po (v ∣ u(s+1) ).

(a) sample v ∗ ∼ Ju (v ∣ v (s) )


po (u(s+1) , v ∗ )
(b) compute r =
po (u(s+1) , v (s) )
(c) set v (s+1) equal to v ∗ or v (s) with prob min(1,r) and max(0,1-r).

Here, Ju and Jv are separate symmetric proposal distributions for U and V.

• The Metropolis algorithm generates proposals from Ju and Jv

• It accepts them with some probability min(1,r).

• Similarly, each step of Gibbs can be seen as generating a proposal from


a full conditional and then accepting it with probability 1.

• The Metropolis-Hastings (MH) algorithm generalizes both of these


approaches by allowing arbitrary proposal distributions.

• The proposal distributions can be symmetric around the current val-


ues, full conditionals, or something else entirely. A MH algorithm for
approximating po (u, v) runs as follows:

1. update U ∶

(a) sample u∗ ∼ Ju (u ∣ u(s) , v (s) )


(b) compute
po (u∗ , v (s) ) Ju (u(s) ∣ u∗ , v (s) )
r= ×
po (u(s) , v (s) ) Ju (u∗ ∣ u(s) , v (s) )

(c) set u(s+1) equal to u∗ or u(s+1) with prob min(1,r) and max(0,1-r).

2. update V ∶

(a) sample v ∗ ∼ Jv (u ∣ u(s+1) , v (s) )


5.5 Metropolis and Metropolis-Hastings 145

(b) compute

po (u(s+1) , v ∗ ) Ju (v (s+1) ∣ u(s+1) , v ∗) )


r= ×
po (u(s+1) , v (s) ) Ju (v ∗ ∣ u(s+1) , v (s) )

(c) set v (s+1) equal to v ∗ or v (s+1) with prob min(1,r) and max(0,1-r).

In the above algorithm, the proposal distributions Ju and Jv are not required
to be symmetric. The only requirement is that they not depend on U or V
values in our sequence previous to the most current values. This requirement
ensures that the sequence is a Markov chain.

Doesn’t the algorithm above look familiar? Yes, it looks a lot like Metropolis,
except the acceptance ratio r contains an extra factor:

• It contains the ratio of the prob of generating the current value from
the proposed to the prob of generating the proposed from the current.

• This can be viewed as a correction factor.

• If a value u∗ is much more likely to be proposed than the current value


u(s) then we must down-weight the probability of accepting u.

• Otherwise, such a value u∗ will be overrepresented in the chain.

Exercise 1: Show that Metropolis is a special case of MH. Hint: Think about
the jumps J.

Exercise 2: Show that Gibbs is a special case of MH. Hint: Show that r =
1.

Note: The MH algorithm can easily be generalized.


Example 5.10: Poisson Regression We implement the Metropolis algo-
rithm for a Poisson regression model.

• We have a sample from a population of 52 song sparrows that was


studied over the course of a summer and their reproductive activities
were recorded.

• In particular, their age and number of new offspring were recorded for
each sparrow (Arcese et al., 1992).
5.5 Metropolis and Metropolis-Hastings 146

• A simple probability model to fit the data would be a Poisson regres-


sion where, Y = number of offspring conditional on x = age.

Thus, we assume that


Y ∣θx ∼ Poisson(θx ).
For stability of the model, we assume that the mean number of offspring θx
is a smoother function of age. Thus, we express θx = β1 + β2 x+ β3 x2 .

Remark: This parameterization allows some values of θx to be negative, so


as an alternative we reparameterize and model the log-mean of Y, so that

log E(Y ∣x) = log θx = log(β1 + β2 x+ β3 x2 )

which implies that

θx = exp(β1 + β2 x+ β3 x2 ) = exp(β T x).

Now back to the problem of implementing Metropolis. For this problem, we


will write
log E(Yi ∣xi ) = log(β1 + β2 xi + β3 x2i ) = β T xi ,
where xi is the age of sparrow i. We will abuse notation slightly and write
xi = (1, xi , x2i ).

• We will assume the prior on the regression coefficients is iid Nor-


mal(0,100).

• Given a current value β (s) and a value β ∗ generated from J(β ∗ , β (s) )
the acceptance ration for the Metropolis algorithm is:

p(β ∗ ∣X, y) ∏ni=1 dpois(yi , xTi β ∗ ) ∏3j=1 dnorm(βj∗ , 0, 10)


r= = × .
p(β (s) ∣X, y) ∏ni=1 dpois(yi , xTi β (s) ) ∏3j=1 dnorm(β (s) , 0, 10)
j

• We just need to specify the proposal distribution for θ∗

• A convenient choice is a multivariate normal distribution with mean


β (s) .
5.5 Metropolis and Metropolis-Hastings 147

• In many problems, the posterior variance can be an efficient choice of


a proposal variance. But we don’t know it here.

• However, it’s often sufficient to use a rough approximation. In a


normal regression problem, the posterior variance will be close to
σ 2 (X T X)−1 where σ 2 is the variance of Y.

In our problem: E log Y = β T x so we can try a proposal variance of σ̂ 2 (X T X)−1


where σ̂ 2 is the sample variance of log(y + 1/2).

Remark: Note we add 1/2 because otherwise log 0 is undefined. The code
of implementing the algorithm is included below.
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


0.0
−0.1

ACF

ACF
β3
−0.2
−0.3

0 2000 6000 10000 0 10 20 30 40 0 5 10 20 30


iteration lag lag/10
Figure 5.15: Plot of the Markov chain in β3 along with autocorrelations
functions

###example 5.10 -- sparrow Poisson regression


yX.sparrow<-dget("http://www.stat.washington.edu/~hoff/Book/Data/data/yX.sparrow")

### sample from the multivariate normal distribution


rmvnorm<-function(n,mu,Sigma)
{
p<-length(mu)
res<-matrix(0,nrow=n,ncol=p)
if( n>0 & p>0 )
{
E<-matrix(rnorm(n*p),n,p)
res<-t( t(E%*%chol(Sigma)) +c(mu))
}
5.5 Metropolis and Metropolis-Hastings 148

res
}

y<- yX.sparrow[,1]; X<- yX.sparrow[,-1]


n<-length(y) ; p<-dim(X)[2]

pmn.beta<-rep(0,p)
psd.beta<-rep(10,p)

var.prop<- var(log(y+1/2))*solve( t(X)%*%X )


beta<-rep(0,p)
S<-10000
BETA<-matrix(0,nrow=S,ncol=p)
ac<-0
set.seed(1)

for(s in 1:S) {

#propose a new beta

beta.p<- t(rmvnorm(1, beta, var.prop ))

lhr<- sum(dpois(y,exp(X%*%beta.p),log=T)) -
sum(dpois(y,exp(X%*%beta),log=T)) +
sum(dnorm(beta.p,pmn.beta,psd.beta,log=T)) -
sum(dnorm(beta,pmn.beta,psd.beta,log=T))

if( log(runif(1))< lhr ) { beta<-beta.p ; ac<-ac+1 }

BETA[s,]<-beta
}
cat(ac/S,"\n")

#######

library(coda)
apply(BETA,2,effectiveSize)
5.5 Metropolis and Metropolis-Hastings 149

####
pdf("sparrow_plot1.pdf",family="Times",height=1.75,width=5)
par(mar=c(2.75,2.75,.5,.5),mgp=c(1.7,.7,0))
par(mfrow=c(1,3))
blabs<-c(expression(beta[1]),expression(beta[2]),expression(beta[3]))
thin<-c(1,(1:1000)*(S/1000))
j<-3
plot(thin,BETA[thin,j],type="l",xlab="iteration",ylab=blabs[j])
abline(h=mean(BETA[,j]) )

acf(BETA[,j],ci.col="gray",xlab="lag")
acf(BETA[thin,j],xlab="lag/10",ci.col="gray")
dev.off()
####

◯ Metropolis and Gibbs Combined

In complex models, it is often the case that the conditional distributions are
available for some parameters but not for others. What can we do then?
In these situations we can combine Gibbs and Metropolis-type proposal
distributions to generate a Markov chain to approximate the joint posterior
distribution of all the parameters.

• Here, we look at an example of estimating the parameters in a re-


gression model for time-series data, where the errors are temporally
correlated.

• The full conditionals are available for the regression parameters here,
but not the parameter describing the dependence among the observa-
tions.

Example 5.11: Historical CO2 and temperature data

Analyses of ice cores from East Antarctica have allowed scientists to de-
duce historical atmospheric conditions of law few hundred years (Petit et
al, 1999). Figure 5.18 plots time-series of temperature and carbon dioxide
concentration on a standardized scale (centered and called to have mean of
zero and variance of 1).
5.5 Metropolis and Metropolis-Hastings 150

• The data include 200 values of temperature measured at roughly equal


time intervals, with time between consecutive measurements being
around 2,000 years.

• For each value of temperature there is a CO2 concentration value that


corresponds to data that is about 1,000 years previous to the temper-
ature value (on average).

• Temperature is recorded in terms of its difference from current present


temperature in degrees Celsius and CO2 concentration is recorded in
parts per million by volume.

temperature difference (deg C)



3


standardized measurement

temp

0 2
● ● ●
CO2 ●●
●●
●●● ●
2

●● ●
● ●
●● ●●

● ● ●
1

● ● ●●

● ● ● ●● ●
●●

● ●● ● ●● ●●
● ● ●●
●● ● ●

−4
−2 −1 0

● ●● ● ● ● ●
●●● ● ●●● ● ●
● ●● ●
●● ● ● ● ● ● ●● ●


● ●●●●● ●● ●●
● ●●● ●● ●● ●●●

● ●
●● ●● ● ●●●
● ●
●●● ●●●● ●● ●

● ● ●●
●●● ●●●●

● ●●●
●●● ●



●●● ●●●

−8

●●●●
● ●
●●●


● ● ● ●●

● ●● ●

−4e+05 −3e+05 −2e+05 −1e+05 0e+00 180 220 260


year CO2(ppmv)
Figure 5.16: Temperature and carbon dioxide data.

• The plot indicates the temporal history of temperature and CO2 follow
very similar patterns.

• The second plot in Figure 5.18 indicates that CO2 concentration at a


given time is predictive of temperature following that time point.

• We can quantify this using a linear regression model for temperature


(Y ) as a function of (CO2 )(x).

• The validity of the standard error relies on the error terms in the
regression model being iid and standard confidence intervals further
rely on the errors being normally distributed.

• These two assumptions are examined in the two residual diagnostic


plots in Figure 5.19.
5.5 Metropolis and Metropolis-Hastings 151

1.0
50

0.8
40

0.4 0.6
30
frequency

ACF
20

0.2
10

0.0
−0.2
0

−4 −2 0 2 4 0 5 10 15 20
residual lag
Figure 5.17: Temperature and carbon dioxide data.

• The first plot shows a histogram of the residuals and indicates no


serious deviation from non-normality.

• The second plot gives the autocorrelation function of the residuals,


indicating a nontrivial correlation of 0.52 between residuals at consec-
utive time points.

• Such a positive correlation generally implies there is less information


in the data and less evidence for a relations between the two variables
than is assumed by the OLS regression analysis.
5.5 Metropolis and Metropolis-Hastings 152

The ordinary regression model is

Y ∼ N (Xβ, σ 2 I).

The diagnostic plots suggest that a more appropriate model for the ice core
data is one in which the error terms are not independent, but temporally
correlated.

We will replace σ 2 I with a covariance matrix Σ that can represent the


positive correlation between sequential observations. One simple, popular
class of covariance matrices for temporally correlated data are those having
first order autoregressive structure:

⎛ 1 ρ ρ2 . . . ρn−1 ⎞
⎜ ρ 1 ρ . . . ρn−2 ⎟
⎜ 2 ⎟
⎜ ρ ρ 1 ... ⎟
2⎜ ⎟
Σ = σ Cp = σ ⎜
2



⎜ ⋮ ⋮ ⋱ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎝ ρn−1 ρn−2 1 ⎠

Under this covariance matrix the variance of Yi ∣β, xi is σ 2 but the correlation
between Yi and Yi+t is ρt . Using the multivariate normal and inverse gamma
prior (it is left as an exercise to show that)

β ∣ X, y, σ 2 , ρ ∼ N (βn , Σn ),
σ 2 ∣ X, y, β ∼ IG((νo + n)/2, [νo σo2 + SSRρ ]/2)
−1 −1
where βn = Σn (X T Cp−1 X/σ 2 + Σ−1
o βo ) and Σn = (X T Cp−1 X/σ 2 + Σ−1
o )
T −1
and SSRρ = (y − Xβ) Cp (y − Xβ)

• If βo and Σo has large diagonal entries, then βn is very close to

(X T Cp−1 X)−1 X T Cp−1 y

• If ρ were known this would be the generalized least squares (GLS)


estimate of β.
5.5 Metropolis and Metropolis-Hastings 153

• This is a type of weighted LS estimate that is used when the error


terms are not iid. In such situations, both OLS and GLS provide
unbiased estimates of β but the GLS has lower variance.

• Bayesian analysis using a model that accounts for correlation errors


provides parameter estimates that are similar to those of GLS, so for
convenience we will refer to our analysis as “Bayesian GLS.”

If we knew the value of ρ we could just implement Gibbs to approximate


p(β, σ 2 ∣X, y, ρ). However, ρ is unknown and typically the distribution of ρ is
nonstandard for most prior distributions, suggesting that the Gibbs sampler
isn’t applicable. What can we do instead?

We can use the generality of the MH algorithm. Recall we are allowed to


use different proposals at each step. We can iteratively update β, σ 2 , and ρ
at different steps (using Gibbs proposals). That is:

• We will make proposals for β and σ 2 using the full conditionals and

• make a symmetric proposal for ρ.

• Following the rules of MH, we accept with prob 1 any proposal coming
from a full conditional distribution, whereas we have to calcite an
acceptance probability for proposals of ρ.
5.5 Metropolis and Metropolis-Hastings 154

We run the following algorithm:

1. Update β: Sample β (s+1) ∼ N (βn , Σn ), where βn and Σn depend on


σ 2(s) and ρ(s) .
2. Update σ 2 : Sample σ 2(s+1) ∼IG( (νo + n)/2, [νo σo2 + SSRρ ]/2) where
SSRρ depends on β (s+1) and ρ(s) .
3. Update ρ ∶ (a): Propose ρ∗ ∼ Uniform(ρ(s ) − δ, ρ(s ) + δ). If ρ∗ < 0 then
reassign it to be ∣p∗ ∣. If ρ∗ > 1 then reassign it to be 2 − ρ∗ .
(b) Compute the acceptance ratio

p(y ∣ X, β (s+1) , σ 2(s+1) , ρ∗ ) p(ρ∗ )


r= andsample
p(y ∣ X, β (s+1) , σ 2(s+1) , ρ(s) ) p(ρ(s) )

u ∼ Uniform(0, 1). If u < r, set ρ(s+1) = ρ∗ , otherwise ρ(s+1) = ρ(s) .

The proposal used in Step 3(a) is called reflecting random walk, which insures
that 0 < ρ < 1. Note that a sequence of MH steps in which each parameter
is updated is often referred to as a scan of the algorithm.

Exercise: Show that the proposal is symmetric.

For convenience and ease, we’re going to use diffuse priors for the parameters
with βo = 0, Σo = diag(1000), νo = 1, and σ 2 = 1. Our prior on ρ will be
Uniform(0,1). We first run 1000 iterations of the MH algorithm and show a
trace plot of ρ as well as an autocorrelation plot (Figure 5.20).

Suppose now we want to generate 25,000 scans for a total of 100,000 param-
eter values. The MC is highly correlated, so we will thin every 25th value
in the chain. This reduces the autocorrelation.

The Monte Carlo approximation of the posterior density of β2 (the slope)


appears in the Figure 5.20. The posterior mean is 0.028 with 95 percent
posterior credible interval of (0.01,0.05), indicating that the relationship
between temperature and CO2 is positive. As indicated in the second plot
this relationship seems much weaker than suggested by the OLS estimate of
0.08. For the OLS estimation, the small number of data points with high
y-values have a large influence on the estimate of β. On the other hand, the
GLS model recognizes many of these extreme points are highly correlated
with one another and down weights their influence.
5.5 Metropolis and Metropolis-Hastings 155

1.0
0.9

0.8
0.8

0.6
ACF
ρ
0.7

0.4 0.2
0.6

0.0
0.5

0 200 400 600 800 1000 0 5 10 15 20 25 30


scan lag
Figure 5.18: The first 1,000 values of ρ generated from the Markov chain.
1.0
0.9

0.8
0.6
0.8

ACF
ρ

0.4
0.7

0.2
0.6

0.0

0 200 400 600 800 1000 0 5 10 15 20 25 30


scan/25 lag/25
Figure 5.19: Every 25th value of ρ generated from the Markov chain of
length 25,000.
5.5 Metropolis and Metropolis-Hastings 156

Remark: this weaker regression coefficient is a result of the temporally cor-


related data and not of the particular prior distribution we used or the
Bayesian approach in general.

Exercise: Repeat the analysis with different prior distributions and perform
non-Bayesian GLS for comparison.
40

● ●

2
GLS estimate ● ●
● ●
●● ● ●
posterior marginal density

OLS estimate ●● ●
30

●● ●

0
● ● ●
● ●●
●●
● ● ●

temperature
● ● ●

−4 −2

● ● ● ●
● ● ● ●● ● ●

20

● ●
● ● ● ●
●● ● ● ●
● ●●●●
● ●● ●
●●● ●●●● ● ●●
● ● ● ● ● ●
● ● ● ●
●● ●
●● ● ●
● ●● ● ●● ● ●● ●● ●
●● ● ●●

−6
● ●● ●
● ● ●

10

● ●●● ● ●● ●
●● ● ●

●● ●●● ●●●

●● ●●

●● ● ● ●● ●●

● ●● ●● ●
● ●● ●● ● ● ●●
●●● ● ● ●● ●
−8
● ● ●●●● ● ●
●● ● ● ● ●


0

0.00 0.02 0.04 0.06 180 200 220 240 260 280
β2 CO2
Figure 5.20: Posterior distribution of the slope parameter β2 and posterior
mean regression line (after generating the Markov chain with
length 25,000 with thin 25).

#####
##example 5.10 in notes
# MH and Gibbs problem
##temperature and co2 problem

source("http://www.stat.washington.edu/~hoff/Book/Data/data/chapter10.r")

### sample from the multivariate normal distribution


rmvnorm<-function(n,mu,Sigma)
{
5.5 Metropolis and Metropolis-Hastings 157

p<-length(mu)
res<-matrix(0,nrow=n,ncol=p)
if( n>0 & p>0 )
{
E<-matrix(rnorm(n*p),n,p)
res<-t( t(E%*%chol(Sigma)) +c(mu))
}
res
}
###

##reading in the data and storing it


dtmp<-as.matrix(read.table("volstok.txt",header=F), sep = "-")
dco2<-as.matrix(read.table("co2.txt",header=F, sep = "\t"))
dtmp[,2]<- -dtmp[,2]
dco2[,2]<- -dco2[,2]
library(nlme)

#### get evenly spaced temperature points


ymin<-max( c(min(dtmp[,2]),min(dco2[,2])))
ymax<-min( c(max(dtmp[,2]),max(dco2[,2])))
n<-200
syear<-seq(ymin,ymax,length=n)
dat<-NULL
for(i in 1:n) {
tmp<-dtmp[ dtmp[,2]>=syear[i] ,]
dat<-rbind(dat, tmp[dim(tmp)[1],c(2,4)] )
}
dat<-as.matrix(dat)
####

####
dct<-NULL
for(i in 1:n) {
xc<-dco2[ dco2[,2] < dat[i,1] ,,drop=FALSE]
xc<-xc[ 1, ]
dct<-rbind(dct, c( xc[c(2,4)], dat[i,] ) )
}

mean( dct[,3]-dct[,1])
5.5 Metropolis and Metropolis-Hastings 158

dct<-dct[,c(3,2,4)]
colnames(dct)<-c("year","co2","tmp")
rownames(dct)<-NULL
dct<-as.data.frame(dct)

##looking at temporal history of co2 and temperature


########
pdf("temp_co2.pdf",family="Times",height=1.75,width=5)
par(mar=c(2.75,2.75,.5,.5),mgp=c(1.7,.7,0))
layout(matrix( c(1,1,2),nrow=1,ncol=3) )

#plot(dct[,1],qnorm( rank(dct[,3])/(length(dct[,3])+1 )) ,
plot(dct[,1], (dct[,3]-mean(dct[,3]))/sd(dct[,3]) ,
type="l",col="black",
xlab="year",ylab="standardized measurement",ylim=c(-2.5,3))
legend(-115000,3.2,legend=c("temp",expression(CO[2])),bty="n",
lwd=c(2,2),col=c("black","gray"))
lines(dct[,1], (dct[,2]-mean(dct[,2]))/sd(dct[,2]),
#lines(dct[,1],qnorm( rank(dct[,2])/(length(dct[,2])+1 )),
type="l",col="gray")

plot(dct[,2], dct[,3],xlab=expression(paste(CO[2],"(ppmv)")),
ylab="temperature difference (deg C)")
dev.off()
########

##residual analysis for the least squares estimation


########
pdf("residual_analysis.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))

lmfit<-lm(dct$tmp~dct$co2)
hist(lmfit$res,main="",xlab="residual",ylab="frequency")
#plot(dct$year, lmfit$res,xlab="year",ylab="residual",type="l" ); abline(h=0)
acf(lmfit$res,ci.col="gray",xlab="lag")
dev.off()
5.5 Metropolis and Metropolis-Hastings 159

########

##BEGINNING THE GIBBS WITHIN METROPOLIS

######## starting values (DIFFUSE)


n<-dim(dct)[1]
y<-dct[,3]
X<-cbind(rep(1,n),dct[,2])
DY<-abs(outer( (1:n),(1:n) ,"-"))

lmfit<-lm(y~-1+X)
fit.gls <- gls(y~X[,2], correlation=corARMA(p=1), method="ML")
beta<-lmfit$coef
s2<-summary(lmfit)$sigma^2
phi<-acf(lmfit$res,plot=FALSE)$acf[2]
nu0<-1 ; s20<-1 ; T0<-diag(1/1000,nrow=2)
###
set.seed(1)

###number of MH steps
S<-25000 ; odens<-S/1000
OUT<-NULL ; ac<-0 ; par(mfrow=c(1,2))
library(psych)
for(s in 1:S)
{

Cor<-phi^DY ; iCor<-solve(Cor)
V.beta<- solve( t(X)%*%iCor%*%X/s2 + T0)
E.beta<- V.beta%*%( t(X)%*%iCor%*%y/s2 )
beta<-t(rmvnorm(1,E.beta,V.beta) )

s2<-1/rgamma(1,(nu0+n)/2,(nu0*s20+t(y-X%*%beta)%*%iCor%*%(y-X%*%beta)) /2 )

phi.p<-abs(runif(1,phi-.1,phi+.1))
phi.p<- min( phi.p, 2-phi.p)
lr<- -.5*( determinant(phi.p^DY,log=TRUE)$mod -
determinant(phi^DY,log=TRUE)$mod +
tr( (y-X%*%beta)%*%t(y-X%*%beta)%*%(solve(phi.p^DY) -solve(phi^DY)) )/s2 )

if( log(runif(1)) < lr ) { phi<-phi.p ; ac<-ac+1 }


5.5 Metropolis and Metropolis-Hastings 160

if(s%%odens==0)
{
cat(s,ac/s,beta,s2,phi,"\n") ; OUT<-rbind(OUT,c(beta,s2,phi))
# par(mfrow=c(2,2))
# plot(OUT[,1]) ; abline(h=fit.gls$coef[1])
# plot(OUT[,2]) ; abline(h=fit.gls$coef[2])
# plot(OUT[,3]) ; abline(h=fit.gls$sigma^2)
# plot(OUT[,4]) ; abline(h=.8284)

}
}
#####

OUT.25000<-OUT
library(coda)
apply(OUT,2,effectiveSize )

OUT.25000<-dget("data.f10_10.f10_11")
apply(OUT.25000,2,effectiveSize )

pdf("trace_auto_1000.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))
plot(OUT.1000[,4],xlab="scan",ylab=expression(rho),type="l")
acf(OUT.1000[,4],ci.col="gray",xlab="lag")
dev.off()

pdf("trace_thin_25.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))
plot(OUT.25000[,4],xlab="scan/25",ylab=expression(rho),type="l")
acf(OUT.25000[,4],ci.col="gray",xlab="lag/25")
dev.off()

pdf("fig10_11.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
5.6 Introduction to Nonparametric Bayes 161

par(mfrow=c(1,2))

plot(density(OUT.25000[,2],adj=2),xlab=expression(beta[2]),
ylab="posterior marginal density",main="")

plot(y~X[,2],xlab=expression(CO[2]),ylab="temperature")
abline(mean(OUT.25000[,1]),mean(OUT.25000[,2]),lwd=2)
abline(lmfit$coef,col="gray",lwd=2)
legend(180,2.5,legend=c("GLS estimate","OLS estimate"),bty="n",
lwd=c(2,2),col=c("black","gray"))
dev.off()

quantile(OUT.25000[,2],probs=c(.025,.975) )

plot(X[,2],y,type="l")
points(X[,2],y,cex=2,pch=19)
points(X[,2],y,cex=1.9,pch=19,col="white")
text(X[,2],y,1:n)

iC<-solve( mean(OUT[,4])^DY )
Lev.gls<-solve(t(X)%*%iC%*%X)%*%t(X)%*%iC
Lev.ols<-solve(t(X)%*%X)%*%t(X)

plot(y,Lev.ols[2,] )
plot(y,Lev.gls[2,] )

5.6 Introduction to Nonparametric Bayes

As we have seen, Bayesian parametric methods takes classical methodology


for prior and posterior distributions in models with a finite number of param-
eters. It is often the case that the number of parameters taken in such model
is low for computational complexity, however, in current research problem
we deal with high dimensional data and high dimensional parameters. The
origins of Bayesian methods have been around since the mid-1700’s and are
still thriving today. The applicability of Bayesian parametric models still
remains and has widened with the increased advancements made in mod-
ern computing and the growth of methods available in Markov chain Monte
Carlo.
5.6 Introduction to Nonparametric Bayes 162

Frequentist nonparametrics covers a wide array of areas in statistics. The


area is well known for being associated with testing procedures that are
or become asymptotically distribution free, which lead to nonparametric
confidence intervals, bands, etc. (Hjors et al., 2010). Further information
can be found on these methods in Wasserman (2006).

Nonparametric Bayesian methods are models and methods characterized


generally by large parameter spaces, such as unknown density and regres-
sion functions and construction of probability measures over these spaces.
Typical examples seen in practice include density estimation, nonparametric
regression with fixed error distributions, hazard rate and survival function
estimation. For a thorough introduction into this subject see (Hjors et al.,
2010).

◯ Motivations

The motivation is the following:

iid
• We have X1 . . . Xn ∼ F, F ∈ F. We usually assume that F is a para-
metric family.

• Then, putting a prior on F amounts to putting a prior on Rd for some


d.

• We would like to be able to put a prior on all the set of cdf’s. And we
would like the prior to have some basic features:

1. The prior should have large support.

2. The prior should give rise to priors which are analytically tractable or
computationally manageable.

3. We should be able to center the prior around a given parametric family.

◯ The Dirichlet Process

Review of Finite Dimensional Dirichlet Distribution


5.6 Introduction to Nonparametric Bayes 163

• This is a distribution on the k-dimensional simplex.

• Let (α1 , . . . , αk ) be such that αj > 0 for all j.

• The Dirichlet distribution with parameter vector (α1 , . . . , αk ) has den-


sity
Γ(α1 + ⋯ + αk ) α1 −1 αk −1
p(θ) = θ1 ⋯θk .
∏kj=1 Γ(αj )

• It is conjugate to the Multinomial distribution. That is if Y ∼ M ultinomial(n, θ)


and θ ∼ Dir(α1 , . . . , αk ), then it can be shown that

θ∣y ∼ Dir(α1 + N1 , . . . , αk + Nk ).

• It can be shown that


E(θj ) = αj /α
where α = ∑j αj . It can also be shown that

αj (α − αj )
V ar(θj ) = .
α2 (α + 1)
5.6 Introduction to Nonparametric Bayes 164

Infinite Dimensional Dirichlet Distribution


Let α be a finite (non-null) measure (or think probability distribution)
on R. Sometimes α will be called the concentration parameter (in
scenarios when we might hand wave the measure theory for example
or it’s not needed).
You should think about the Infinite Dimension Dirichlet Distribution
as a distribution of distributions as we will soon see.

Definition 5.1: F has the Dirichlet distribution with parameter (mea-


sure) α if for every finite measurable partition A1 , . . . , Ak of R the
k-dimensional random vector (F ({A1 }), . . . , F ({Ak })) has the finite
k-dimensional Dirichlet distribution

Dir(α(A1 ), . . . , α(Ak )).

For more on this see: Freedman (1963), Ferguson (1973, 1974).


Intuition: Each F ({Ak }) ∈ [0, 1] since F is some cumulative distri-
bution function. Also,

F ({A1 }) + ⋯ + F ({Ak }),

thus, (F ({A1 }), . . . , F ({Ak })) lives on the k-dimensional simplex.


Remark: For those with measure theory: You can’t have a measure
that is 0. Note that Lebesgue measure isn’t finite on the reals.

We will construct the Dirichlet process to intuitively understand it


based on the “Polya Urn Scheme” of Blackwell and MacQueen (1973).
This is one of the most intuitive approaches. Others in the litera-
ture include Ferguson (1973, 1974), which include two constructions.
There is an incorrect constructions involve the Kolmovgorov exten-
sion theorem (the problem is that the sets aren’t measurable, so an
important technical detail). The other is a correct construction based
on something called the gamma process (this involves much overhead
and existence of the gamma process).
5.6 Introduction to Nonparametric Bayes 165

◯ Polya Urn Scheme on Urn With Finitely Many Colors


– Consider an urn containing a finite number of balls of k different
colors.
– There are α1 , . . . , αk balls of colors 1 . . . , k, respectively.
– We pick a ball at random, look at its color, return it to the urn
together with another ball of the same color.
– We repeat this indefinitely.
– Let p1 (n), . . . pk (n) be the proportions of balls of colors 1 . . . , k
at time n.

Example 5.12: Polya Urn for Three Balls


Suppose we have three balls in our urn. Let red correspond the
ball 1. Let blue correspond the ball 2. Let green correspond the
ball 3. Furthermore, suppose that P (red) = 2/9), P (blue) = 3/9 and
P (green) = 4/9.
Let α be a the following probability measure (or rather discrete prob-
ability distribution):

– αo (1) = 2/9.
– αo (2) = 3/9.
– αo (3) = 4/9.

Another way of writing this is define αo = 2/9 δ1 + 3/9 δ2 + 4/9 δ3 where




⎪1 if 1 ∈ A
δ1 (A) = ⎨


⎩0 otherwise.
5.6 Introduction to Nonparametric Bayes 166

◯ Polya Urn Scheme in General

Let α be a finite measure on a space X = R.

1. (Step 1) Suppose X1 ∼ αo .
2. (Step 2) Now create a new measure α + δX1 where


⎪1 if X1 ∈ A
δX1 (A) = ⎨


⎩0 otherwise.

Then
α + δX1 α + δX1
X2 ∼ = .
α() + δX1 () α() + 1
Fact: δX1 () = 1. Think about why this is intuitively true.

What does the above equation really mean?


– α represents the original distribution of balls.
– δX1 represents the ball we just added.
Deeper understanding
– Suppose the urn contained N total balls when we started.
– Then the probability that the second ball drawn X2 will be of the
original N balls is N /(N + 1).
– This represents the α part of the distribution of X2 .
– We want the probability of drawing a new ball to be 1/(N + 1).
This goes with δX1 .
α + δX1
– When we write X2 ∼ we want N /(N + 1) of
norm. constant
α
the probability to go to and 1/(N + 1) to go to
norm. constant
δX1
.
norm. constant
How does this continue? Since we want
δX1 ()
= = 1/(N + 1)
α() + 1

this implies that


1
= 1/(N + 1) Ô⇒ α() = N.
α() + 1
5.6 Introduction to Nonparametric Bayes 167

Hence, we take α() = N and then we plug back in and find that
αo = α/N Ô⇒ α = αo N.
This implies that
αo N + δX1
X2 ∼ ,
N +1
which is now in terms of αo and N (which we know).

(Step 3) Continue forming new measures: α + δX1 + δX2 . Then


α + δX1 + δX2 α + δX1 + δX2
X3 ∼ = .
α() + δX1 () + δX2 () α() + 2

In general, it can be shown that

α(A) + ∑ni=1 δXi (A) αo N + ∑ni=1 δXi (A)


P (Xn+1 ∣ X1 . . . Xn ) = = .
α() + n N +n

Polya Urn Scheme in General Case: Theorem

– Let α be a finite measure on a space X (this space can be very


general, but we will assume it’s the reals).
– Define a sequence {X1 , X2 , . . .} of random variables to be a Polya
urn sequence with parameter measure α if
∗ P (X1 ∈ B) = α(B)/α().
∗ For every n,
α(B) + ∑i δXi (B)
P (Xn+1 ∈ B ∣ X1 . . . , Xn ) = .
α() + n

Specifically, X1 , X2 , . . . , is PUS(α) if

α(A)
P (X1 ∈ A) = = αo
α()
for every A ∈ and for every n
α(B) + ∑i δXi (B)
P (Xn+1 ∈ B ∣ X1 . . . , Xn ) =
α() + n
for every A ∈ .
5.6 Introduction to Nonparametric Bayes 168

◯ De Finetti and Exchaneability

Recall what exchangeability means. Suppose that Y1 , Y2 , . . . , Yn is a


sequence of random variables. This sequence is said to be exchangeable
if the distribution of
d
(Y1 , Y2 , . . . , Yn ) = (Yπ(1) , Yπ(2) , . . . , Yπ(n) )

for every permutation π of 1, . . . , n.


Note: This means that we can permute the random variables and the
distribution doesn’t change.
An infinite sequence is said to be exchangeable if for every n, Y1 , Y2 , . . . , Yn
is exchangeable. That is, we don’t require exchangeability for infinite
permutations, but it must be true for every “chunk” that we take that
is of length or size n.

Definition 5.2: De Finetti’s General Theorem


Let X1 , X2 . . . be an infinite exchangeable sequence of random vari-
ables. Then there exists a probability measure π such that
iid
X1 , X2 . . . , ∣ F ∼ F

F ∼π
for any x1 , . . . xn ∈ {0, 1}.
Remark: Suppose that X1 , X2 . . . is an infinite exchangeable sequence
of binary random variables. Then there exists a probability measure
(distribution) on [0, 1] such that for every n
1
P (X1 = x1 , . . . , Xn = xn ) = ∫ p∑i xi (1 − p)n−∑i xi µ(p)dp
0

where µ(p) is the measure or probability distribution or prior that we


take on p.
5.6 Introduction to Nonparametric Bayes 169

Theorem 5.1. A General Result (without proof )


Let X1 , X2 . . . be PUS(α). Then this can be thought of as a two-stage
process where

– F ∼ Dir(α)
iid
– X1 , X2 . . . , ∣ F ∼ F

If we consider the process of the PUS consisting of X2 , X3 . . . , then it’s


a PUS(α + δX 1 ). That is, F ∣ X1 ∼ Dir(α + δX 1 ).
More generally, it can be shown that
F ∣ X1 . . . Xn ∼ Dir(α + ∑ni=1 δX i ).
5.6 Introduction to Nonparametric Bayes 170

◯ Chinese Restaurant Process


– There are Bayesian NP approaches to many of the main issues in
statistics including:
∗ regression.
∗ classification.
∗ clustering.
∗ survival analysis.
∗ time series analysis.
∗ spatial data analysis.

– These generally involve assumptions of exchangeability or partial


exchangeability.
∗ and corresponding distributions on random objects of various
kinds (functions, partitions, measures, etc.)
– We look at the problem of clustering for concreteness.

◯ Clustering: How to choose K?


– Adhoc approaches (hierarchical clustering)
∗ these methods do yield a data-drive choice of K
∗ there is little understanding how good these choices are (mean-
ing the checks are adhoc based on some criterion)
– Methods based on objective functions (M -estimators)
∗ K-means, spectral clustering
∗ they come with some frequentist guarantees
∗ it’s often hard to turn these into data-driven choices of K
– Parametric likelihood-based methods
∗ finite mixture models, Bayesian variants
∗ various model choice methods: hypothesis testing, cross-validation,
bootstrap, AIC, BIC, Laplace, reversible jump MCMC
∗ do the assumptions underlying the method apply to the set-
ting (not very often)
– Something different: The Chinese restaurant process.
5.6 Introduction to Nonparametric Bayes 171

Basic idea: In many data analysis settings, we don’t know the number
of latent clusters and would like to learn it from the data. BNP clus-
tering addresses this by assuming there is an infinite number of latent
clusters, but that only a finite number of them is used to generate
the observed data. Under these assumptions, the posterior yields a
distribution over the number of clusters, the assign of data to clusters,
and the parameters associated with each cluster. In addition, the pre-
dictive distribution, the assignment of the next data point, allows for
new data to be assign to a previously unseen cluster.
How does it work: The BNP problem addresses and finesses the clus-
tering problem by choosing the number of clusters by assuming it is
infinite, however it specifies a prior over the infinite groupings P (c)
in such a way that favors assigning data to a small number of groups,
where c refers to the cluster assignments. The prior over groupings is
a well known problem called the Chinese restaurant process (CRP),
which is a distribution over infinite partition of the integers (Aldous,
1985; Pitman, 2002).
Where does the name come from?

– Imagine that Sam and Mike own a restaurant with an infinite


number of tables.
– Imagine a sequence of customers entering their restaurant and
sitting down.
– The first customer (Liz) enters and sits at the first table.
– The second customer enters and sits at the first table with prob-
1 α
ability 1+α and a new table with probability 1+α , where α is
positive and real.
– Liz is friendly and people would want to sit and talk with her.
1
So, we would assume that 1+α is a high probability, meaning that
α is a small number.
– What happens with the nth customer?
∗ He sits at each of the previously occupied tables with proba-
bility proportional to the number previous customers sitting
there.
∗ He sits at the next unoccupied table with probability pro-
portional to α.
5.6 Introduction to Nonparametric Bayes 172

More formally, let cn be the table assigned me of customer n. A draw


from this distribution can be generated by sequentially assigning ob-
servations with probability


⎪ m
⎪ k if k ≤ K+ (i.e. k is a previously occupied table),
P (cn = k ∣ c) = ⎨ n−1+α

⎪ α
otherwise (i.e. k is the next unoccupied table),
⎩ n−1+α
where mk is the number of customers sitting at table k and K+ is
the number of table for which mk > 0. The parameter α is called the
concentration parameter.
5.6 Introduction to Nonparametric Bayes 173

The rich just get richer

– CRP rule: next customer sits at a table with prob. proportional


to number of customers already sitting at it (and sits at new table
with prob. proportional to α).
– Customers tend to sit at most popular tables.
– Most popular tables attract the most new customers, and become
even more popular.
– CRPs exhibit power law behavior, where a few tables attract the
bulk of the customers.
– The concentration parameter α determines how likely a customer
is to sit at a fresh table.

More formally stated:

– A larger value of α will produce more occupied tables (and fewer


customers per table).
– Thus, a small value of α produces more customers at each table.
– The CRP exhibits an important invariance property: the cluster
assignments under this distribution are exchangeable.
– This means that p(c) is unchanged if the order of customers is
shuffled (up to label changes). This may be counter-intuitive
since the process was just described sequentially.
5.6 Introduction to Nonparametric Bayes 174

The CRP and Clustering

– The data points refer to the customers and the tables are the
clusters.
∗ Then the CRP defines a prior distribution on the partition
of the data and on the number of tables.
– The prior can be completed with:
∗ A likelihood, meaning there needs to be an parameterized
probability distribution that corresponds to each table
∗ A prior for the parameters –the first customer to sit at table
k chooses the parameter vector for that table (φk ) from the
prior
– Now that we have a distribution for any quantity we care about
in some clustering setting.

Now, let’s think about how we would write down this process out
formally. We’re writing out a mixture model with a component that’s
nonparametric.
Let’s define the following:

– yn are the observations at time n.


– cn are the latent clusters that generate cn .
– F is a parametric family of distributions for yn .
– θk represent the clustering parameters.
– Go represents a general prior for the clustering parameters (this
is the nonparametric part).

We also assume that each observation is conditionally independent


given its latent cluster assignment and its cluster parameters.
Using the CRP, we can view the model as

yn ∣ cn , θ ∼ F (θcn )
cn ∝ p(cn )
θk ∝ Go .

We want to know p(y ∣ c).


5.6 Introduction to Nonparametric Bayes 175

p(y∣c)p(c)
Then by Bayes’ rule, p(c∣y) = , where
∑c p(y∣c)p(c)
N K
p(y ∣ c) = ∫ [ ∏ F (y∣θcn ) ∏ Go (θk )] dθ.
θ n=1 k=1

A Go that is conjugate allow this integral to be calculated analytically.


For example, the Gaussian is the conjugate prior to a Gaussian with
fixed variance (and thus a mixture of Gaussians model is computa-
tionally convenient). We illustrate this specific example below.

Example 5.13: Suppose

yn ∣ cn , θ ∼ N (θcn , 1)
cn ∼ Multinomial(1, p)
θk ∼ N (µ, τ 2 ),

where p, µ, and τ 2 are known.

Then
N K
p(y∣c) = ∫ [ ∏ Normal(θcn , 1)(yn ) × ∏ Normal(µ, τ 2 )(θk )] dθ.
θ n=1 k=1

The term above (inside the integral) is just another normal as a func-
tion of θ. Then we can integrate θ out as we have in problems before.
Once we calculate p(y∣c), we can simply plug this and p(c) into

p(y∣c)p(c)
p(c∣y) = .
∑c p(y∣c)p(c)
Example 5.14: Gaussian Mixture using R
Information on the R package profdpm:
This package facilitates inference at the posterior mode in a class of
conjugate product partition models (PPM) by approximating the max-
imum a posteriori data (MAP) partition. The class of PPMs is moti-
vated by an augmented formulation of the Dirichlet process mixture,
which is currently the ONLY available member of this class. The
profdpm package consists of two model fittting functions, profBinary
and profLinear, their associated summary methods summary.profBinary
5.6 Introduction to Nonparametric Bayes 176

and summary.profLinear, and a function (pci) that computes sev-


eral metrics of agreement between two data partitions. However, the
profdpm package was designed to be extensible to other types of prod-
uct partition models. For more on this package, see help(profdpm)
after installation.

– The following example simulates a dataset consisting of 99 longi-


tudinal measurements on 33 units of observation, or subjects.
– Each subject is measured at three times, drawn uniformly and
independently from the unit interval.
– Each of the three measurements per subject are drawn indepen-
dently from the normal distribution with one of three linear mean
functions of time, and with unit variance.
– The linear mean functions vary by intercept and slope. The lon-
gitudinal structure imposes a grouping among measurements on
a single subject.
– Observations grouped in this way should always cluster together.
A grouping structure is specified using the group parameter; a
factor that behaves similarly to the groups parameter of lattice
graphics functions.
– For the PPM of conjugate binary models, the grouping structure
is imposed by the model formula.
– Grouped observations correspond to rows of the model matrix,
resulting from a call to model.matrix on the formula passed to
profBinary. Hence, the profBinary function does not have a group
parameter in its prototype.
– The goal of the following example is to recover the simulated
partition and to create simultaneous 95% credible bands for the
mean within each cluster. The following R code block creates
and the simulated dataset.

set.seed(42)
sim <- function(multiplier = 1) {
x <- as.matrix(runif(99))
a <- multiplier * c(5,0,-5)
s <- multiplier * c(-10,0,10)
y <- c(a[1]+s[1]*x[1:33],
5.6 Introduction to Nonparametric Bayes 177

4
2
0
y

-2
-4

0.0 0.2 0.4 0.6 0.8 1.0

Figure 5.21: Simulated data; 99 longitudinal measurements on 33 sub-


jects. Simultaneous confidence bands for the mean within
each of the three clusters.
5.6 Introduction to Nonparametric Bayes 178

a[2]+s[2]*x[34:66],
a[3]+s[3]*x[67:99]) + rnorm(99)
group <- rep(1:33, rep(3,33))
return(data.frame(x=x,y=y,gr=group))
}
dat <- sim()
library("profdpm")
fitL <- profLinear(y ~ x, group=gr, data=dat)
sfitL <- summary(fitL)
%pdf(np_plot.pdf)
plot(fitL$x[,2], fitL$y, col=grey(0.9), xlab="x", ylab="y")
for(grp in unique(fitL$group)) {
ind <- which(fitL$group==grp)
ord <- order(fitL$x[ind,2])
lines(fitL$x[ind,2][ord],
fitL$y[ind][ord],
col=grey(0.9))
}
for(cls in 1:length(sfitL)) {
# The following implements the (3rd) method of
# Hanson & McMillan (2012) for simultaneous credible bands
# Generate coefficients from profile posterior
n <- 1e4
tau <- rgamma(n, shape=fitL$a[[cls]]/2, scale=2/fitL$b[[cls]])
muz <- matrix(rnorm(n*2, 0, 1),n,2)
mus <- (muz / sqrt(tau)) %*% chol(solve(fitL$s[[cls]]))
mu <- outer(rep(1,n), fitL$m[[cls]]) + mus

# Compute Mahalanobis distances


mhd <- rowSums(muz^2)

# Find the smallest 95% in terms of Mahalanobis distance


# I.e., a 95% credible region for mu
ord <- order(mhd, decreasing=TRUE)[-(1:floor(n*0.05))]
mu <- mu[ord,]
#Compute the 95% credible band
plotx <- seq(min(dat$x), max(dat$x), length.out=200)
ral <- apply(mu, 1, function(m) m[1] + m[2] * plotx)
rlo <- apply(ral, 1, min)
rhi <- apply(ral, 1, max)
5.6 Introduction to Nonparametric Bayes 179

rmd <- fitL$m[[cls]][1] + fitL$m[[cls]][2] * plotx

lines(plotx, rmd, col=cls, lty=2)


lines(plotx, rhi, col=cls)
lines(plotx, rlo, col=cls)
}
%dev.off()

You might also like