Bayes Manuscripts
Bayes Manuscripts
Rebecca C. Steorts
1 Introduction 3
1.1 Advantages of Bayesian Methods . . . . . . . . . . . . . . . . . 4
1.2 de Finetti’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Being Objective 47
◯ Meaning Of Flat . . . . . . . . . . . . . . . . . . . . . . . 49
◯ Objective Priors in More Detail . . . . . . . . . . . . . . 51
3.1 Reference Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
◯ Laplace Approximation . . . . . . . . . . . . . . . . . . . 60
◯ Some Probability Theory . . . . . . . . . . . . . . . . . . 61
◯ Shrinkage Argument of J.K. Ghosh . . . . . . . . . . . 62
◯ Reference Priors . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Final Thoughts on Being Objective . . . . . . . . . . . . . . . . 69
1
CONTENTS 2
Introduction
There are three kinds of lies: lies, damned lies and statistics.
—Mark Twain
The word “Bayesian” traces its origin to the 18th century and English Rev-
erend Thomas Bayes, who along with Pierre-Simon Laplace was among the
first thinkers to consider the laws of chance and randomness in a quantita-
tive, scientific way. Both Bayes and Laplace were aware of a relation that is
now known as Bayes Theorem:
p(x∣θ)p(θ)
p(θ∣x) = ∝ p(x∣θ)p(θ). (1.1)
p(x)
The proportionality ∝ in Eq. (1.1) signifies that the 1/p(x) factor is con-
stant and may be ignored when viewing p(θ∣x) as a function of θ. We can
decompose Bayes’ Theorem into three principal terms:
p(θ∣x) posterior
p(x∣θ) likelihood
p(θ) prior
In effect, Bayes’ Theorem provides a general recipe for updating prior beliefs
about an unknown parameter θ based on observing some data x.
However, the notion of having prior beliefs about a parameter that is os-
tensibly “unknown” did not sit well with many people who considered the
problem in the 19th and early 20th centuries. The resulting search for a
3
1.1 Advantages of Bayesian Methods 4
• Suppose that the experiment was “Flip six times and record the re-
sults.” In this case, the random variable X counts the number of
heads, and X ∼ Binomial(6, θ). The observed data was x = 5, and the
p-value of our hypothesis test is
p-value = Pθ=1/2 (X ≥ 5)
= Pθ=1/2 (X = 5) + Pθ=1/2 (X = 6)
6 1 7
= + = = 0.109375 > 0.05.
64 64 64
So we fail to reject H0 at α = 0.05.
• Suppose instead that the experiment was “Flip until we get tails.” In
this case, the random variable X counts the number of the flip on
which the first tails occurs, and X ∼ Geometric(1 − θ). The observed
data was x = 6, and the p-value of our hypothesis test is
p-value = Pθ=1/2 (X ≥ 6)
= 1 − Pθ=1/2 (X < 6)
5
= 1 − ∑ Pθ=1/2 (X = x)
x=1
1 1 1 1 1 1
=1−( + + + + )= = 0.03125 < 0.05.
2 4 8 16 32 32
So we reject H0 at α = 0.05.
The conclusions differ, which seems absurd. Moreover the p-values aren’t
even close—one is 3.5 times as large as the other. Essentially, the result of
our hypothesis test depends on whether we would have stopped flipping if we
had gotten a tails sooner. In other words, the frequentist approach requires
us to specify what we would have done had the data been something that
we already know it wasn’t.
Note that despite the different results, the likelihood for the actual value of
x that was observed is the same for both experiments (up to a constant):
p(x∣θ) ∝ θ5 (1 − θ).
A Bayesian approach would take the data into account only through this
likelihood and would therefore be guaranteed to provide the same answers
regardless of which experiment was being performed.
1.2 de Finetti’s Theorem 6
Example 1.2: Suppose we want to test whether the voltage θ across some
electrical component differs from 9 V, based on noisy readings of this voltage
from a voltmeter. Suppose the data is as follows:
A frequentist might assume that the voltage readings Xi are iid from some
N (θ, σ 2 ) distribution, which would lead to a basic one-sample t-test.
Nevertheless, a frequentist must now redo the analysis and could perhaps
obtain a different conclusion, because the 10 V limit changes the distribution
of the observations under the null hypothesis. Like in the last example, the
frequentist results change based on what would have happened had the data
been something that we already know it wasn’t.
The problems in Examples 1.1 and 1.2 arise from the way the frequentist
paradigm forces itself to interpret probability. Another familiar aspect of
this problem is the awkward definition of “confidence” in frequentist confi-
dence intervals. The most natural interpretation of a 95% confidence inter-
val (L, U )—that there is a 95% chance that the parameter is between L and
U —is dead wrong from the frequentist point of view. Instead, the notion
of “confidence” must be interpreted in terms of repeating the experiment a
large number of times (in principle, an infinite number), and no probabilistic
statement can be made about this particualar confidence interval computed
from the data we actually observed.
In this section, well motivate the use of priors on parameters and indeed
motivate the very use of parameters. We begin with a denition.
Definition 1.1: (Infinite exchangeability). We say that (x1 , x2 , . . . ) is an
infinitely exchangeable sequence of random variables if, for any n, the joint
1.2 de Finetti’s Theorem 7
If the distribution on θ has a density, we can replace P (dθ) with p(θ) dθ,
but the theorem applies to a much broader class of cases than just those
with a density for θ.
Clearly, since ∏ni=1 p(xi ∣θ) is invariant to reordering, we have that any se-
quence of distributions that can be written as
n
∫ ∏ p(xi ∣θ) p(θ) dθ,
i=1
Introduction to Bayesian
Methods
Every time I think I know what’s going on, suddenly there’s another layer
of complicitions. I just want this damned thing solved.
—John Scalzi, The Lost Colony
Another motivation for the Bayesian approach is decision theory. Its origins
go back to Von Neumann and Morgenstern’s game theory, but the main
character was Wald. In statistical decision theory, we formalize good and
bad results with a loss function.
9
2.2 Frequentist Risk 10
Thus, the risk measures the long-term average loss resulting from using δ.
Often one decision does not dominate the other everywhere as is the case
with decisions δ1 , δ2 . The challenge is in saying whether, for example, δ1 or
δ3 is better. In other words, how should we aggregate over Θ?
Frequentist Risk
R(θ, δ1)
R(θ, δ2)
R(θ, δ3)
Risk
R(θ, δ1)
R(θ, δ2)
Example 2.1: For example, two different labs estimate the potency
of drugs. Both have some error or noise in their measurements which
can accurately estimated from past tests. Now we introduce a new
drug. Then we test its potency at a randomly chosen lab. Suppose
the sample sizes matter dramatically.
Thus, the question that we ask is should we use the noise level from
the lab where it is tested or average over both? Intuitively, we use the
noise level from the lab where it was tested, but in some frequentist
approaches, it is not always so straightforward.
2.3 Motivation for Bayes 14
f (y∣θ)π(θ)
p(θ∣y) =
∫ f (y∣θ)π(θ) dθ
g(θ, T (y)) h(y)π(θ)
=
∫ g(θ, T (y)) h(y)π(θ) dθ
g(θ, T (y))π(θ)
=
∫ g(θ, T (y))π(θ) dθ
∝ g(θ, T (y)) p(θ),
Then p(y) = (ny)θy (1 − θ)n−y . Let p(θ) represent a general prior. Then
The Bayes action δ ∗ (x) for any fixed x is the decision δ(x) that minimizes
the posterior risk. If the problem at hand is to estimate some unknown
parameter θ, then we typically call this the Bayes estimator instead.
Theorem 2.3. Under squared error loss, the decision δ(x) that minimizes
the posterior risk is the posterior mean.
Then
∂[ρ(π, δ(x))]
= 2δ(x) − 2 ∫ θπ(θ∣x) dθ = 0 ⇐⇒ δ(x) = E[θ∣x],
∂[δ(x)]
In frequentist usage, the parameter θ is fixed, and thus it is the sample space
over which averages are taken. Letting R(θ, δ(x)) denote the frequentist
risk, recall that R(θ, δ(x)) = Eθ [L(θ, δ(x))]. This expectation is taken over
the data X, with the parameter θ held fixed. Note that the data, X, is
capitalized, emphasizing that it is a random variable.
2.4 Bayesian Decision Theory 17
Example 2.4: (Squared error loss). Let the loss function be squared error.
In this case, the risk is
This result allows a frequentist to analyze the variance and bias of an estima-
tor separately, and can be used to motivate frequentist ideas, e.g. minimum
variance unbiased estimators (MVUEs).
Bayesians do not find the previous idea compelling because it doesn’t adhere
to the conditionality principle since it averages over all possible data sets.
Hence, in a Bayesian framework, we define the posterior risk ρ(x, π) based
on the data x and a prior π, where
Note that the prior enters the equation when calculating the posterior den-
sity. Using the Bayes risk, we can define a bit of jargon. Recall that the
Bayes action δ ∗ (x) is the value of δ(x) that minimizes the posterior risk.
We already showed that the Bayes action under squared error loss is the
posterior mean.
◯ Hybrid Ideas
Definition 2.4: The Bayes risk is denoted by r(π, δ(x)). While the Bayes
risk is a frequentist concept since it averages over X, the expression can also
2.5 Bayesian Parametric Models 18
Note that the last equation is the posterior risk averaged over the marginal
distribution of x. Another connection with frequentist theory includes that
finding a Bayes rule against the “worst possible prior” gives you a minimax
estimator. While a Bayesian might not find this particularly interesting, it
is useful from a frequentist perspective because it provides a way to compute
the minimax estimator.
We will come back to more decision theory in a more later chapter on ad-
vanced decision theory, where we will cover topics such as minimaxity, ad-
missibility, and James-Stein estimators.
For now we will consider parametric models, which means that the param-
eter θ is a fixed-dimensional vector of numbers. Let x ∈ X be the observed
data and θ ∈ Θ be the parameter. Note that X may be called the sample
space, while Θ may be called the parameter space. Now we define some
notation that we will reuse throughout the course:
p(x∣θ) likelihood
π(θ) prior
p(x) = ∫ p(x∣θ)π(θ) dθ marginal likelihood
p(x∣θ)π(θ)
p(θ∣x) = posterior probability
p(x)
p(xnew ∣x) = ∫ p(xnew ∣θ)π(θ∣x) dθ predictive probability
p(x∣θ)π(θ)
p(θ∣x) =
p(x)
∝ p(x∣θ)π(θ),
and oftentimes it’s best to not calculate the normalizing constant p(x) be-
cause you can recognize the form of p(x∣θ)π(θ) as a probability distribution
you know. So don’t normalize until the end!
Remark: Note that the prior distribution that we take on θ doesn’t have
to be a proper distribution, however, the posterior is always required to be
proper for valid inference. By proper, I mean that the distribution must
integrate to 1.
Subjective
A prior probability could be subjective based on the information a person
2.7 Hierarchical Bayesian Models 20
Objective
An objective prior (also called default, vague, noninformative) can also be
used in a given situation even in the absence of enough information. Exam-
ples of objective priors are flat priors such as Laplace’s, Haldane’s, Jeffreys’,
and Bernardo’s references priors. These priors will be discussed later.
X∣θ ∼ f (x∣θ)
Θ∣γ ∼ π(θ∣γ)
Γ ∼ φ(γ),
where we assume that φ(γ) is known and not dependent on any other un-
known hyperparameters (what the parameters of the prior are often called
2.7 Hierarchical Bayesian Models 21
as we have already said). Note that we can continue this hierarchical mod-
eling and add more stages to the model, however note that doing so adds
more complexity to the model (and possibly as we will see may result in a
posterior that we cannot compute without the aid of numerical integration
or MCMC, which we will cover in detail in a later chapter).
π(θ∣x) ∝ p(x∣θ)p(θ)
n Γ(a + b) a−1
∝ ( )θx (1 − θ)n−x θ (1 − θ)b−1
x Γ(a)Γ(b)
∝ θx (1 − θ)n−x θa−1 (1 − θ)b−1
∝ θx+a−1 (1 − θ)n−x+b−1 Ô⇒
• Based on this prior information, we’ll use a Beta prior for θ and we’ll
choose a and b. (Won’t get into this here).
• We can plot the prior and likelihood distributions in R and then see
how the two mix to form the posterior distribution.
2.7 Hierarchical Bayesian Models 22
3.5
3.0
Prior
2.5
2.0
Density
1.5
1.0
0.5
0.0
θ
3.5
3.0
Prior
2.5
Likelihood
2.0
Density
1.5
1.0
0.5
0.0
θ
2.7 Hierarchical Bayesian Models 23
3.5
3.0
Prior
2.5
Likelihood
Posterior
2.0
Density
1.5
1.0
0.5
0.0
−1 −n
p(θ∣x) ∝ exp { (x − x̄)2 } exp { 2 (x̄ − θ)2 }
2 ∑ i
2σ i 2σ
−n
∝ exp { 2 (x̄ − θ)2 }
2σ
−n
= exp { 2 (θ − x̄)2 } .
2σ
Thus,
θ∣x1 , . . . , xn ∼ Normal(x̄, σ 2 /n).
n
1 −1 1 −1
p(θ∣x1 , . . . , xn ) ∝ ∏ √ 2
(xi − θ)2 } × √
exp { exp { 2 (θ − µ)2 }
i=1 2πσ 2σ 2 2πτ 2 2τ
−1 −1
∝ exp { 2 ∑(xi − θ)2 } exp { 2 (θ − µ)2 } .
2σ i 2τ
Consider
Then
−1 −1 −1
p(θ∣x1 , . . . , xn ) = exp { 2 ∑ i
(x − x̄)2 } × exp { 2 n(x̄ − θ)2 } × exp { 2 (θ − µ)2 }
2σ i 2σ 2τ
−1 −1
∝ exp { 2 n(x̄ − θ)2 } exp { 2 (θ − µ)2 }
2σ 2τ
−1 n 1
= exp { [ 2 (x̄ − 2x̄θ + θ ) + 2 (θ2 − 2θµ + µ2 )]}
2 2
2 σ τ
−1 n 1 nx̄ µ nx̄2 µ2
= exp { [( 2 + 2 ) θ2 − 2θ ( 2 + 2 ) + 2 + 2 ]}
2 σ τ σ τ σ τ
⎧
⎪ ⎡ µ ⎤⎫
⎪
⎪ −1 ⎢⎢ n 2 + τ 2 ⎞⎥
nx̄
⎪ 1 ⎛ ⎥⎪
⎪
∝ exp ⎨ ⎢( 2 + 2 ) ⎜θ2 − 2θ σn ⎟⎥⎬
⎪
⎪ ⎢2 ⎢ σ τ ⎝ + 1 ⎥⎪
⎪
2 ⎠⎥⎪
⎪
⎩ ⎣ σ 2 τ
⎦⎭
⎧
⎪ ⎡ µ 2 ⎤⎫
⎪
⎪ −1 ⎢⎢ n 2 + τ2 ⎞ ⎥
nx̄
⎪ 1 ⎛ ⎥⎪
⎪
∝ exp ⎨ ⎢( 2 + 2 ) ⎜θ − σn ⎟ ⎥⎬.
⎪ ⎢ ⎥⎪
2 + τ 2 ⎠ ⎥⎪
1
⎪
⎪ 2 ⎢ σ τ ⎝
⎩ ⎣ σ ⎦⎪
⎭
nx̄τ 2 + µσ 2 σ 2 τ 2
=N( , ).
nτ 2 + σ 2 nτ 2 + σ 2
We omit the proof since it requires Chebychev’s Inequality along with a bit
of probability theory. See Problem 1.8.1 in TPE for the exercise of proving
this.
Example 2.8: (Normal-Normal Revisited) Recall Example 2.7. We write
the posterior mean as E(θ∣x). Let’s write the posterior mean in this example
as
µ
2 +
nx̄
σ τ2
E(θ∣x) = n .
+ 1
σ2 τ2
nx̄ µ
2
= nσ 1 + n τ
2
.
+ 2 + 1
σ2 τ σ2 τ2
What happens as n → ∞?
In the case of the posterior variance, divide the denominator and numerator
by n. Then
1
n σ2
V (θ∣x) = ≈ →0 as n → ∞.
1 n 1 1 n
+
n σ2 n τ 2
Since the posterior mean is unbiased and the posterior variance goes to 0,
the posterior mean is consistent by Theorem 2.4.
2.7 Hierarchical Bayesian Models 28
Example 2.9:
Notice that this looks like an Inverse Gamma distribution with parameters
α + a and x + b. Thus,
β∣x ∼ IG(α + a, x + b).
Example 2.10: (Bayesian versus frequentist)
Suppose a child is given an IQ test and his score is X. We assume that
Here the posterior mean is (400 + 9x)/13. Suppose x = 115. Then the pos-
terior mean becomes 110.4. Contrasting this, we know that the frequentist
estimate is the mle, which is x = 115 in this example.
The posterior variance is 900/13 = 69.23, whereas the variance of the data
is σ 2 = 100.
Notice that the posterior mean and mle are both 115 and the posterior
variance and variance of the data are both 100.
2.7 Hierarchical Bayesian Models 29
When we put little/no prior information on θ, the data washes away most/all
of the prior information (and the results of frequentist and Bayesian estima-
tion are similar or equivalent in this case).
Example 2.11: (Normal Example with Unknown Variance)
Consider
iid
X1 , . . . , Xn ∣θ ∼ Normal(θ, σ 2 ), θ known, σ 2 unknown
p(σ 2 ) ∝ (σ 2 )−1 .
−1
p(σ 2 ∣x1 , . . . , xn ) ∝ (2πσ 2 )−n/2 exp { 2 −1
∑(xi − θ) } (σ )
2
2σ 2 i
−1
∝ (σ 2 )−n/2−1 exp { ∑(xi − θ) } .
2
2σ 2 i
ba −a−1 −b/y
Recall, if Y ∼ IG(a, b), then f (y) = y e . Thus,
Γ(a)
Example 2.12: (Football Data) Gelman et. al (2003) consider the problem
of estimating an unknown variance using American football scores. The
focus is on the difference d between a game outcome (winning score minus
losing score) and a published point spread.
We can refer to Example 2.11, since the setup here is the same. Hence the
posterior becomes
The next logical step would be plotting the posterior distribution in R. As far
as I can tell, there is not a built-in function predefined in R for the Inverse
Gamma density. However, someone saw the need for it and built one in
using the pscl package.
Proceeding below, we try and calculate the posterior using the function
densigamma, which corresponds to the Inverse Gamma density. However,
running this line in the code gives the following error:
Warning message:
In densigamma(sigmas, n/2, sum(d^2)/2) : value out of range in ’gammafn’
What’s the problem? Think about the what the posterior looks like. Recall
that
(∑i d2i /2)n/2 2 −n/2−1 −(∑i d2i )/2σ2
p(σ 2 ∣d) = (σ ) e .
Γ(n/2)
setwd("~/Desktop/sta4930/football")
data = read.table("football.txt",header=T)
names(data)
attach(data)
score = favorite-underdog
d = score-spread
n = length(d)
hist(d)
install.packages("pscl",repos="http://cran.opensourceresources.org")
library(pscl)
?densigamma
sigmas = seq(10,20,by=0.1)
post = densigamma(sigmas,n/2,sum(d^2)/2)
v = sum(d^2)
We know we can’t use the Inverse Gamma density (because of the function
in R), but we do know a relationship regarding the Inverse Gamma and
Gamma distributions. So, let’s apply this fact.
2.7 Hierarchical Bayesian Models 31
You may be thinking, we’re going to run into the same problem because we’ll
still be dividing by Γ(1120). This is true, except the Gamma density function
dgamma was built into R by the original writers. The dgamma function is able
to do some internal tricks that let it calculate the gamma density even
though the individual piece Γ(n/2) by itself is too large for R to handle. So,
moving forward, we will apply the following fact that we already learned:
If X ∼ IG(a, b), then 1/X ∼ Gamma(a, 1/b).
Since
σ 2 ∣d1 , . . . , dn ∼ IG(n/2, ∑ d2i /2),
i
we know that
1
∣d1 , . . . , dn ∼ Gamma(n/2, 2/v), where v = ∑ d2i .
σ2 i
In the code below, we plot the posterior of 12 ∣d. In order to do so, we must
σ
create a new sequence of x-values since the mean of our gamma will be at
n/v ≈ 0.0053.
xnew = seq(0.004,0.007,.000001)
pdf("football_sigmainv.pdf", width = 5, height = 4.5)
post.d = dgamma(xnew,n/2,scale = 2/v)
plot(xnew,post.d, type= "l", xlab = expression(1/sigma^2), ylab= "density")
dev.off()
To recap, we know
1
∣d1 , . . . , dn ∼ Gamma(n/2, 2/v), where v = ∑ d2i .
σ2 i
Let u = 1
. We are going to make a transformation of variables now to write
σ2
the density in terms of σ 2 .
2.7 Hierarchical Bayesian Models 32
2000
density
1000
500
0
1 σ2
∂u 1
Since u = 1
2, this implies σ 2 = u1 . Then ∣ 2
∣ = 4.
σ ∂σ σ
Now applying the transformation of variables we find that
1 1 n/2−1 − v2 1
f (σ 2 ∣d1 , . . . , dn ) = ( ) e 2σ ( 4 ) .
Γ(n/2)(2/v)n/2 σ 2 σ
Thus,
1
σ 2 ∣d ∼ Gamma(n/2, 2/v) ( ).
σ4
x.s = seq(150,250,1)
pdf("football_sigma.pdf", height = 5, width = 4.5)
post.s = dgamma(1/x.s,n/2, scale = 2/v)*(1/x.s^2)
plot(x.s,post.s, type="l", xlab = expression(sigma^2), ylab="density")
dev.off()
detach(data)
2.7 Hierarchical Bayesian Models 33
0.06
0.04
density
0.02
0.00
σ2
From the posterior plot in Figure 2.4 we can see that the posterior mean
is around 185. This means that the variability of the actual game result
around the point spread has a standard deviation around 14 points. If you
wanted to actually calculate the posterior mean and variance, you could do
this using a numerical method in R.
What’s interesting about this example is that there is a lot more variability
in football games than the average person would most likely think.
• Assume that (1) the standard deviation actually is 14 points, and (2)
game result is normally distributed (which it’s not, exactly, but this is
a reasonable approximation).
Example 2.13:
iid
Y1 , . . . , Yn ∣µ, σ 2 ∼ Normal(µ, σ 2 ),
σ2
µ∣σ 2 ∼ Normal(µ0 , ),
κ0
ν0 σ02
σ 2 ∼ IG( , ),
2 2
where µ0 , κ0 , ν0 , σ02 are constant.
p(µ, σ 2 , y1 . . . , yn )
p(µ, σ 2 ∣y1 . . . , yn ) =
p(y1 . . . , yn )
∝ p(y1 . . . , yn ∣µ, σ 2 )p(µ, σ 2 )
= p(y1 . . . , yn ∣µ, σ 2 )p(µ∣σ 2 )p(σ 2 ).
Then
Then
Now consider
(nȳ + κ0 µ0 )2 −n2 ȳ 2 − 2nκ0 µ0 ȳ − κ0 2 µ0 2
nȳ 2 + κ0 µ0 2 − = nȳ 2 + κ0 µ0 2 +
n + κ0 n + κ0
n ȳ + nκ0 µ0 + nκ0 ȳ + κ0 2 µ0 2 − n2 ȳ 2 − 2nκ0 µ0 ȳ − κ0 2 µ0 2
2 2 2 2
=
n + κ0
nκ0 µ0 + nκ0 ȳ − 2nκ0 µ0 ȳ
2 2
=
n + κ0
nκ0 (µ0 − 2µ0 ȳ + ȳ 2 )
2
=
n + κ0
nκ0 (µ0 − ȳ)2
= .
n + κ0
n + ν0 1 nκ0
σ 2 ∣y ∼ IG ( , (∑(yi − ȳ)2 + (µ0 − ȳ)2 + σ02 )) .
2 2 i (n + κ0 )
Example 2.14: Suppose we calculate E[θ∣y] where y = x(n) . Let
Xi ∣ θ ∼ Uniform(0, θ)
θ ∼ Gamma(a, 1/b).
2.7 Hierarchical Bayesian Models 37
Show
1 P (χ22(n+a−1) < 2/(by))
E[θ∣x] = .
b(n + a − 1) P (χ22(n+a−1) < 2/(by))
Proof. Recall that the posterior depends on the data only through the suf-
ficient statistic y. Consider that P (Y ≤ y) = P (X1 ≤ y)n = (y/θ)n Ô⇒
n
fy (y) = n/θ(y/θ)n−1 = n y n−1 .
θ
θf (y∣θ)π(θ) dθ
E[θ∣x] = ∫
∫ f (y∣θ)π(θ) dθ
n−1 −a−1 −1/(θb)
∞ θny θ e
∫y
θn Γ(a)ba dθ
= a−1 −1/(θb)
∞ nθ e
∫y
θn Γ(a)ba dθ
∞−n−a −1/(θb)
∫y θ e dθ
= ∞ −n−a−1 −1/(θb)
∫y θ e dθ
2
by n+a−1 n+a−2 −x/2
∫0 b x e dx × Γ(n + a − 1) dx
2n+a−1 Γ(n + a − 1)
= 2
by n+a+1−1 n+a+1−2 −x/2
∫0 b x e dx × Γ(n + a) dx
2 n+a+1−1 Γ(n + a)
P (χ2(n+a−1) < 2/(by)) bn+a−1 Γ(n + a − 1)
2
=
P (χ22(n+a−1) < 2/(by)) bn+a Γ(n + a)
Xi ∣θ ∼ f (x∣θ), i = 1 . . . , p
Θ∣γ ∼ π(θ∣γ).
Xk ∼ Bin(n, pk ),
pk ∼ Beta(a, b),
where the K groups are tied together by the common prior distribution.
It is easy to show that the Bayes estimator of pk under squared error loss is
a + xk
E(pk ∣ak , a, b) = .
a+b+n
Suppose now that we are told that a, b are unknown and we wish to estimate
them using EB. We first calculate
K
n x Γ(a + b) a−1
m(x∣a, b) = ∫ ...∫ ∏ ( )pk k (1 − pk ) k ×
n−x
p (1 − pk )b−1 dpk
0,1 0,1 k=1 xk Γ(a)Γ(b) k
K
n Γ(a + b) xk +a−1
=∫ ...∫ ∏( ) pk (1 − pk )n−xk +b−1 dpk
0,1 0,1 k=1 xk Γ(a)Γ(b)
K
n Γ(a + b)Γ(a + xk )Γ(n − xk + b)
= ∏( )
k=1 xk Γ(a)Γ(b)Γ(a + b + n)
which is a product of beta-binomials. Although the MLEs of a and b aren’t
expressible in closed form, they can be calculated numerically to construct
the EB estimator
â + xk
δ̂ EB (x) = .
â + b̂ + n
We’ll derive the posterior predictive distribution for the discrete case (θ is
discrete). It’s the same for the continuous case, with the sums replaced with
integrals.
2.9 Posterior Predictive Distributions 40
Consider
p(ỹ, y)
p(ỹ∣y) =
p(y)
∫ p(ỹ, y, θ) dθ
= θ
p(y)
∫ p(ỹ∣y, θ)p(y, θ) dθ
= θ
p(y)
= ∫ p(ỹ∣y, θ)p(θ∣y) dθ.
θ
In most contexts, if θ is given, then ỹ∣θ is independent of y, i.e., the value of
θ determines the distribution of ỹ, without needing to also know y. When
this is the case, we say that ỹ and y are conditionally independent given θ.
Then the above becomes
p(ỹ∣y) = ∑ p(ỹ∣θ)p(θ∣y).
θ
Theorem 2.6. Suppose p(x) is a pdf that looks like p(x) = cf (x), where c
is a constant and f is a continuous function of x. Since
∫ p(x) dx = ∫ cf (x) dx = 1,
x x
then
∫ f (x)dx = 1/c.
x
Example 2.16: Human males have one X-chromosome and one Y-chromosome,
whereas females have two X-chromosomes, each chromosome being inherited
from one parent. Hemophilia is a disease that exhibits X-chromosome-linked
recessive inheritance, meaning that a male who inherits the gene that causes
the disease on the X-chromosome is affected, whereas a female carrying the
gene on only one of her X-chromosomes is not affected. The disease is gener-
ally fatal for women who inherit two such genes, and this is very rare, since
the frequency of occurrence of the gene is very low in human populations.
Consider a woman who has an affected brother (xY), which implies that
her mother must be a carrier of the hemophilia gene (xX). We are also told
that her father is not affected (XY), thus the woman herself has a fifty-fifty
chance of having the gene.
Let θ denote the state of the woman. It can take two values: the woman is
a carrier (θ = 1) or not (θ = 0). Based on this, the prior can be written as
P (θ = 1) = P (θ = 0) = 1/2.
Suppose the woman has a son who does not have hemophilia (S1 = 0). Now
suppose the woman has another son. Calculate the probability that this
second son also will not have hemophilia (S2 = 0), given that the first son
does not have hemophilia. Assume son one and son two are conditionally
independent given θ.
Solution:
First compute
p(S1 = 0∣θ)p(θ)
p(θ∣S1 = 0) =
p(S1 = 0∣θ = 0)p(θ = 0) + p(S1 = 0∣θ = 1)p(θ = 1)
⎧
⎪ (1)(1/2)
⎪ = 23 if θ = 0
= ⎨ (1)(1/2)+(1/2)(1/2)
⎪
⎪ 1
if θ = 1.
⎩3
Then
p(S2 = 0∣S1 = 0) = p(S2 = 0∣θ = 0)p(θ = 0∣S1 = 0) + p(S2 = 0∣θ = 1)p(θ = 1∣S1 = 0)
= (1)(2/3) + (1/2)(1/3) = 5/6.
Then
x−1 r
f (x) = ( ) p (1 − p)x−r , x = r, r + 1, . . .
r−1
and we say X ∼ Negative Binom(r, p).
X∣λ ∼ Poisson(λ)
λ ∼ Gamma(a, b)
Solution:
Recall
p(λ∣x) ∝ p(x∣λ)(p(λ)
∝ e−λ λx λa−1 e−λ/b
= λx+a−1 e−λ(1+1/b) .
p(x̃∣x) = ∫ p(x̃∣λ)p(λ∣x) dλ
λ
e−λ λx̃ 1
=∫ λx+a−1 e−λ(b+1)/b dλ
λ x̃! Γ(x + a)( b+1 )
b x+a
1
= b x+a ∫λ
λx̃+x+a−1 e−λ(2b+1/b) dλ
x̃! Γ(x + a)( b+1 )
1
= Γ(x̃ + x + a)(b/(2b + 1)x̃+x+a
x̃! Γ(x + a)( b+1 )
b x+a
Then
x̃ + x + a − 1 x̃
p(x̃∣x) = ( )p (1 − p)x+a .
x̃
Thus,
b
x̃∣x ∼ Negative Binom (x + a, ),
2b + 1
where we are assuming the Negative Binomial distribution as defined in
Wikipedia (and not as defined earlier in the notes).
2.9 Posterior Predictive Distributions 45
X∣λ ∼ Poisson(λ)
λ ∼ Gamma(a, b).
We are also told 42 moms are observed arriving at the particular hospital
during December 2007. Using prior study information given, we are told
a = 5 and b = 6. (We found a, b by working backwards from a prior mean
of 30 and prior variance of 180).
Solution: The first thing we need to know to do this problem are p(λ∣x) and
p(x̃∣x). We found these in Example 2.17. So,
b
λ∣x ∼ Gamma (x + a, ),
b+1
and
b
x̃∣x ∼ Negative Binom (x + a, ).
2b + 1
2.9 Posterior Predictive Distributions 46
setwd("~/Desktop/sta4930/ch3")
lam = seq(0,100, length=500)
x = 42
a = 5
b = 6
like = dgamma(lam,x+1,scale=1)
prior = dgamma(lam,5,scale=6)
post = dgamma(lam,x+a,scale=b/(b+1))
pdf("preg.pdf", width = 5, height = 4.5)
plot(lam, post, xlab = expression(lambda), ylab= "Density", lty=2, lwd=3, type="l")
lines(lam,like, lty=1,lwd=3)
lines(lam,prior, lty=3,lwd=3)
legend(70,.06,c("Prior", "Likelihood","Posterior"), lty = c(2,1,3),
lwd=c(3,3,3))
dev.off()
In the first part of the code, we plot the posterior, likelihood, and posterior.
This should be self-explanatory since we have already done an example.
Being Objective
However, in dealing with real-life problems you may run into problems such
as
The problems we have dealt with all semester have been very simple in na-
ture. We have only had one parameter to estimate (except for one example).
47
48
Think about a more complex problem such as the following (we looked at
this problem in Chapter 1):
X∣θ ∼ N (θ, σ 2 )
θ∣σ 2 ∼ N (µ, τ 2 )
σ 2 ∼ IG(a, b)
where now θ and σ 2 are both unknown and we must find the posterior
distributions of θ∣X, σ 2 and σ 2 ∣X. For this slightly more complex problem,
it is much harder to think about what values µ, τ 2 , a, b should take for a
particular problem. What should we do in these type of situations?
• If the prior is improper, you must check that the posterior is proper.
◯ Meaning Of Flat
What does a “flat prior” really mean? People really abuse the word flat and
interchange it for noninformative. Let’s talk about what people really mean
when they use the term “flat,” since it can have different meanings.
Example 3.3: Often statisticians will refer to a prior as being flat, when
a plot of its density actually looks flat, i.e., uniform. An example of this
would be taking such a prior to be
θ ∼ Unif(0, 1).
We can plot the density of this prior to see that the density is flat.
1.4
1.2
density
1.0
0.8
0.6
In this example, it can be shown that pJ (θ) ∝ Beta(1/2, 1/2). Let’s consider
the plot of this prior. Flat here is a purely abstract idea. In order to achieve
objective inference, we need to compensate more for values on the boundary
than values in the middle.
51
θ ∼ N (0, 1000).
0.012
Normal prior
density
0.006
0.000
θ
0.012
0.006
0.000
−10 −5 0 5 10
the probability that the sun will rise tomorrow. He answered this question
using the following Bayesian analysis:
• Let X represent the number of days the sun rises. Let p be the prob-
ability the sun will rise tomorrow.
Then
n
π(p∣x) ∝ ( )px (1 − p)n−x ⋅ 1
x
∝p x+1−1
(1 − p)n−x+1−1
This implies
p∣x ∼ Beta(x + 1, n − x + 1)
Then
x+1 x+1 n+1
p̂ = E[p∣x] = = = .
x+1+n−x+1 n+2 n+2
Thus, Laplace’s estimate for the probability that the sun rises tomorrow is
(n + 1)/(n + 2), where n is the total number of days recorded in history.
For instance, if so far we have encountered 100 days in the history of our
universe, this would say that the probability the sun will rise tomorrow
is 101/102 ≈ 0.9902. However, we know that this calculation is ridiculous.
Here, we have extremely strong subjective information (the laws of physics)
that says it is extremely likely that the sun will rise tomorrow. Thus, ob-
jective Bayesian methods shouldn’t be recklessly applied to every problem
we study—especially when subjective information this strong is available.
The Uniform prior of Bayes and Laplace and has been criticized for many
different reasons. We will discuss one important reason for criticism and not
go into the other reasons since they go beyond the scope of this course.
54
Jeffreys’ Prior
What does the invariance principle mean? Suppose our prior parameter is
θ, however we would like to transform to φ.
Jeffreys’ prior says that if θ has the distribution specified by Jeffreys’ prior
for θ, then f (θ) will have the distribution specified by Jeffreys’ prior for φ.
We will clarify by going over two examples to illustrate this idea.
Note, for example, that if θ has a Uniform prior, Then one can show φ = f (θ)
will not have a Uniform prior (unless f is the identity function).
Aside from the invariance property of Jeffreys’ prior, in the univariate case,
Jeffreys’ prior satisfies many optimality criteria that statisticians are inter-
ested in.
where I(θ) is called the Fisher information. Then Jeffreys’ prior is defined
to be √
pJ (θ) = I(θ).
√
Let φ = θ2 . Then θ = φ. It follows that
∂θ 1
= √ .
∂φ 2 φ
1
Thus, p(φ) = √ , 0 < φ < 1 which shows that φ is not Uniform on (0, 1).
2 φ
Hence, the transformation is not invariant. Criticism such as this led to
consideration of Jeffreys’ prior.
Example 3.9: (Jeffreys’ Prior Invariance Example)
Suppose
X∣θ ∼ Exp(θ).
One can show using calculus that I(θ) = 1/θ2 . Then pJ (θ) = 1/θ. Suppose
that φ = θ2 . It follows that
∂θ 1
= √ .
∂φ 2 φ
Then
√ ∂θ
pJ (φ) = pJ ( φ) ∣ ∣
∂φ
1 1 1
=√ √ ∝ .
φ 2φ φ
Hence, we have shown for this example, that Jeffreys’ prior is invariant under
the transformation φ = θ2 .
Example 3.10: (Jeffreys’ prior) Suppose
1.0
0.5
Beta(1/2,1/2)
Beta(1,1)
0.0
Figure 3.4 compares the prior density πJ (θ) with that for a flat prior, which
is equivalent to a Beta(1,1) distribution.
Note that in this case the prior is inversely proportional to the standard
deviation. Why does this make sense?
We see that the data has the least effect on the posterior when the true θ =
0.5, and has the greatest effect near the extremes, θ = 0 or 1. Jeffreys’ prior
compensates for this by placing more mass near the extremes of the range,
where the data has the strongest effect. We could get the same effect by
1 1
(for example) letting the prior be π(θ) ∝ instead of π(θ) ∝ .
Varθ [Varθ]1/2
57
Thus, θ∣x ∼ Beta(x + 1/2, n − x + 1/2), which is a proper posterior since the
prior is proper.
Limitations of Jeffreys’
Jeffreys’ priors work well for single-parameter models, but not for models
with multidimensional parameters. By analogy with the one-dimensional
case, one might construct a naive Jeffreys prior as the joint density:
πJ (θ) = ∣I(θ)∣1/2 ,
where ∣ ⋅ ∣ denotes the determinant and the (i, j)th element of the Fisher
information matrix is given by
∂ 2 log p(X∣θ)
I(θ)ij = −E [ ].
∂θi ∂θj
Let’s see what happens when we apply a Jeffreys’ prior for θ to a multivariate
Gaussian location model. Suppose X ∼ Np (θ, I), and we are interested in
58
performing inference on ∣∣θ∣∣2 . In this case the Jeffreys’ prior for θ is flat. It
turns out that the posterior has the form of a non-central χ2 distribution
with p degrees of freedom. The posterior mean given one observation of X
is E(∣∣θ∣∣2 ∣X) = ∣∣X∣∣2 + p. This is not a good estimate because it adds p to
the square of the norm of X, whereas we might normally want to shrink
our estimate towards zero. By contrast, the minimum variance frequentist
estimate of ∣∣θ∣∣2 is ∣∣X∣∣2 − p.
Haldane’s Prior
Finally, we need to check that our posterior is proper. Recall that the
parameters of the Beta need to be positive. Thus, y > 0 and n − y > 0. This
means that y ≠ 0 and y ≠ n in order for the posterior to be proper.
There are many other objective priors that are used in Bayesian inference,
however, this is the level of exposure that we will cover in this course. If
you’re interested in learning more about objective priors (g-prior, probability
matching priors), see me and I can give you some references.
Reference priors were proposed by Jose Bernardo in a 1979 paper, and fur-
ther developed by Jim Berger and others from the 1980s through the present.
They are credited with bringing about an objective Bayesian renaissance;
an annual conference is now devoted to the objective Bayesian approach.
For one-dimensional parameters, it will turn out that reference priors and
Jeffreys’ priors are equivalent. For multidimensional parameters, they dif-
fer. One might ask, how can we choose a prior to maximize the divergence
between the posterior and prior, without having seen the data first? Refer-
ence priors handle this by taking the expectation of the divergence, given a
model distribution for the data. This sounds superficially like a frequentist
approach—basing inference on imagined data. But once the prior is chosen
based on some model, inference proceeds in a standard Bayesian fashion.
3.1 Reference Priors 60
(This contrasts with the frequentist approach, which continues to deal with
imagined data even after seeing the real data!)
◯ Laplace Approximation
For example, when g(θ) = 1, the integral reduces to the marginal likelihood
of x. The posterior mean requires evaluation of two integrals ∫ θf (x∣θ)π(θ) dθ
and ∫ f (x∣θ)π(θ) dθ. Laplace’s method is a technique for approximating in-
tegrals when the integrand has a sharp maximum.
√ 1
Now let t = nc(θ − θ̂). This implies that dθ = √ dt. Hence,
nc
√
q(θ̂)enh(θ̂) δ nc t q ′ (θ̂) t2 q ′′ (θ̂) −t2 /2
I≈ √ ∫ √ [1 + √ + ]e dt
nc −δ nc nc q(θ̂) 2nc q(θ̂)
q(θ̂)enh(θ̂) √ q ′′ (θ̂) 1
≈ √ 2π [1 + 0 + ]
nc q(θ̂) 2nc
q(θ̂)enh(θ̂) √ q(θ̂)enh(θ̂) √
≈ √ 2π [1 + O(1/n)] ≈ √ 2π.
nc nc
First, we give a few definitions from probability theory (you may have seen
these before) and we will be informal about these.
Xn = o(rn ) as n → ∞
means that
Xn
→ 0.
rn
Similarly,
Xn = O(rn ) as n → ∞
means that
Xn
is bounded.
rn
and
Xn
Xn = Op (Rn ) if is bounded in probability.
Rn
This argument given by J.K. Ghosh will be used to derive reference priors. It
can be used in many other theoretical proofs in Bayesian theory. If interested
in seeing these, please refer to his book for details as listed on the syllabus.
Please note that below I am hand waving over some of the details regarding
analysis that are important but not completely necessary to grasp the basic
concept here.
Step 1: Consider a proper prior π̄(⋅) for θ such that the support of π̄(⋅)
is a compact rectangle in the parameter space and π̄(⋅) vanishes on the
boundary of the support, while remaining positive on the interior. Consider
the posterior of θ under π̄(⋅) and hence obtain E π̄ [q(X, θ)∣x].
Step 2: Find Eθ E π̄ [q(x, θ)∣x] = λ(θ) for θ in the interior of the support
of π̄(⋅).
Step 3: Integrate λ(⋅) with respect to π̄(⋅) and then allow π̄(⋅) to converge
to the degenerate prior at the true value of θ (say θ0 ) supposing that the
true θ is an interior point of the support of π̄(⋅). This yields Eθ [q(X, θ)].
3.1 Reference Priors 63
◯ Reference Priors
π(θ∣x)
E [log ],
π(θ)
First write
π(θ∣x) π(θ∣x)
E [log ] = ∫ ∫ [log ] π(θ∣x) m(x) dx dθ
π(θ) π(θ)
π(θ∣x)
= ∫ ∫ log f (x∣θ) π(θ) dx dθ
π(θ)
π(θ∣x)
= ∫ π(θ)E [log ∣ θ] dθ.
π(θ)
π(θ∣x)
Consider E [log ∣ θ] = E [log π(θ∣x) ∣ θ] − log π(θ).
π(θ)
Then by iterated expectation,
π(θ∣x)
E [log ] = ∫ π(θ) {E [π(θ∣x); ∣ θ] − log π(θ)} dθ
π(θ)
= ∫ E [π(θ∣x)∣ θ] π(θ) dθ − ∫ log π(θ)π(θ) dθ. (3.1)
and we will use Step 1 of the Shrinkage argument of J.K. Ghosh to find
E π̄ [q(X, θ)∣x].
noting that the denominator takes the form of a constant times the integral
of a normal density with variance Iˆn−1 . Hence,
√ ˆ1/2
nIn 1
π(θ∣x) = √ exp (− t2 Iˆn ) [1 + O(n−1/2 )]. (3.2)
2π 2
√
Then log π(θ∣x) = 21 log n − log 2π − 12 t2 Iˆn + 12 log Iˆn + log[1 + O(n−1/2 )] =
√ −1/2
2 log n − log 2π − 2 t In + 2 log In + log[O(n )]. Now consider
1 1 2ˆ 1 ˆ
1 √ 1 1
E π̄ log π(θ∣x) = log n − log 2π − E π̄ [ t2 Iˆn ] + log Iˆn + log[O(n−1/2 )].
2 2 2
To evaluate E π̄ [ 21 t2 Iˆn ], note that (3.2) states that, up to order n−1/2 , π(t∣x)
is approximately normal with mean zero and variance Iˆn−1 . Since this does
not depend on the form of the prior π, it follows that π̄(t∣x) is also approx-
imately normal with mean zero and variance Iˆn−1 , again up to order n−1/2 .
Then E π̄ [ 12 t2 Iˆn ] = 12 , which implies that
1 √ 1
E π̄ log π(θ∣x) = log n − log 2π − + log Iˆn1/2 + log[O(n−1/2 )]
2 2
1 1 √
= log n − log 2πe + log Iˆn1/2 + log[O(n−1/2 )].
2 2
1 √ [I(θ)]1/2
1/2 log n − log 2πe + ∫ log { }π(θ) dθ + log[O(n−1/2 )].
2 π(θ)
The integral is non-positive and is maximized above when it is 0, or rather
when π(θ) = I 1/2 (θ), i.e., Jeffreys’ prior.
Take away: If there are no nuisance parameters, Jeffreys’ prior is the refer-
ence prior.
3.1 Reference Priors 66
Multiparameter generalization
π(θ∣x) p p ∣I(θ)∣1/2
E[ ] = log n − log(2πe) + ∫ log ( ) π(θ) dθ + O(n−1/2 ).
π(θ) 2 2 π(θ)
Note that this is maximized when π(θ) = ∣I(θ)∣1/2 , meaning that Jeffreys’
prior is the maximizer in distance between the prior and posterior. In the
presence of nuisance parameters, things change considerably.
3.1 Reference Priors 67
Begin with π(θ2 ∣θ1 ) = ∣I22 (θ)∣1/2 c(θ1 ), where c(θ1 ) is the constant that makes
this distribution a proper density. Now try to maximize
log π(θ1 ∣x)
E[ ]
π(θ1 )
π(θ∣x) p p ∣I(θ)∣1/2
E [log ] = log n − log 2πe + ∫ π(θ) log dθ + O(n−1/2 ).
π(θ) 2 2 π(θ)
(3.4)
Similarly,
We find that
π(θ1 ∣x) p1 p1
E [log ]= log n − log 2πe + ∫ π(θ) log ∣I11.2 (θ)∣1/2 dθ
π(θ1 ) 2 2
− ∫ π(θ) log π(θ1 ) dθ + O(n−1/2 )
p1 p1
= log n − log 2πe + ∫ π(θ1 ) [∫ π(θ2 ∣θ1 ) log ∣I11.2 (θ)∣1/2 dθ2 ] dθ1
2 2
− ∫ π(θ1 ) log π(θ1 ) dθ1 + O(n−1/2 )
p1 p1 ψ(θ1 )
= log n − log 2πe + ∫ π(θ1 ) log dθ1 + O(n−1/2 ).
2 2 π(θ1 )
−1
To maximize the integral above, we choose π(θ1 ) = ψ(θ1 ). Note that I11.2 (θ) =
I11 (θ) where
−1
I11 (θ) −I11 (θ)I12 (θ)I22 (θ)
I −1 (θ) = ( −1 ).
−I22 (θ)I21 (θ)I11 (θ) I22 (θ)
π(θ1 ) = exp{∫ π(θ2 ∣θ1 ) log ∣I11.2 (θ)∣1/2 dθ2 } = exp{∫ ∣I22 (θ)∣1/2 log ∣I11.2 (θ)∣1/2 dθ2 }.
√ √
2ci2 −1 i 2ci2
Then π(σ∣µ) = , i ≤ σ ≤ i. Consider ∫i−1 = 1 Ô⇒ ci2 =
σ σ
1 1 1 −1
√ . Thus, π(σ∣µ) = , i ≤ σ ≤ i. Now find π(µ). Observe that
2 2 ln i 2 ln iσ
Recall that
√
−1 2i 1
I11.2 = I11 − I12 I22 I21 = 1/σ . Thus, π(µ) =
2
log( )} dσ +
exp{∫i−1 ci2
σ σ
π(µ, σ)
constant = c. We want to find π(µ, σ). We know that π(σ∣µ) = Ô⇒
π(µ)
c 1 1
π(µ, σ) = π(µ)π(σ∣µ) = ∝ .
2 ln i σ σ
Problems with Reference Priors See page 128 of the little green book.
Bernardo and Berger (1992) suggest
We have spent just a short amount of time covering objective Bayesian pro-
cedures, but already we have seen how each is flawed in some sense. As
Fienberg (2009) points out in a review article (see the webpage), there are
two main parts of being Bayesian: the prior and the likelihood. What is
Fienberg’s point? There is proposed claim that robustness should be car-
ried through at both levels of the model. That is, we should care about
subjectivity of the likelihood as well as the prior.
mattered more. In this case, the posterior odds for several papers shifted
when Mosteller and Wallace used a negative binomial versus a Poisson for
word counts.
My favorite part that Fienberg illustrates in this paper is the view that “ob-
jective Bayes is like the search for the Holy Grail.” He mentions that Good
(1972) once wrote that there are “46,656 Varieties of Bayesians,” which was
a number that he admitted exceeded the number of professional statisticians
during that time. Today? There seem to be as many choices coming about
of objective Bayes for trying to arrive at the perfect choice of an objective
prior. Each seems to fail because of foundational principles. For example,
Eaton and Freedman (2004) criticize why you shouldn’t use Jeffreys’ prior
for the normal covariance matrix. We didn’t look at intrinsic priors, but they
have been criticized by Fienberg for contingency tables because of their de-
pendence on the likelihood function and because of bizarre properties when
extended to deal with large sparse tables.
Evaluating Bayesian
Procedures
They say statistics are for losers, but losers are usually the ones saying that.
—Urban Meyer
One major difference between Bayesians and frequentists is how they inter-
pret intervals. Let’s quickly review what a frequentist confidence interval is
and how to interpret one.
71
4.1 Confidence Intervals versus Credible Intervals 72
Assumptions
In lower-level classes, you wrote down assumptions whenever you did con-
fidence intervals. This is redundant for any problem we construct in this
course since we always know the data is randomly distributed and we assume
it comes from some underlying distribution, say Normal, Gamma, etc. We
also always assume our observations are i.i.d. (independent and identically
distributed), meaning that the observations are all independent and they all
have the same variance. Thus, when working a particular problem, we will
assume these assumptions are satisfied given the proposed model holds.
Important Point
Our definition for the credible interval could lead to many choices of (a, b)
for particular problems.
Suppose that we required our credible interval to have equal probability α/2
in each tail. That is, we will assume
P (θ < a∣x) = α/2
and
P (θ > b∣x) = α/2.
Is the credible interval still unique? No. Consider
π(θ∣x) = I(0 < θ < 0.025) + I(1 < θ < 1.95) + I(3 < θ < 3.025)
so that the density has three separate plateaus. Now notice that any (a, b)
such that 0.025 < a < 1 and 1.95 < b < 3 satisfies the proposed definition of
a ostensibly “unique” credible interval. To fix this, we can simply require
that
{θ ∶ π(θ∣x) is positive}
(i.e., the support of the posterior) must be an interval.
This greatly contrasts with the usual frequentist CI, for which the corre-
sponding statement is something like “If we could recompute C for a large
4.1 Confidence Intervals versus Credible Intervals 75
0.95
0.025 0.025
a b
Value of θ|x
Interpretation
Comparisons
X1 . . . , Xn ∣θ ∼ N (θ, σ 2 )
θ ∼ N (µ, τ 2 ),
Recall
nx̄τ 2 + µσ 2 σ 2 τ 2
θ∣x1 , . . . xn ∼ N ( , ).
nτ 2 + σ 2 nτ 2 + σ 2
Let
nx̄τ 2 + µσ 2
µ∗ = ,
nτ 2 + σ 2
σ2τ 2
σ ∗2 = .
nτ 2 + σ 2
a−µ∗
Thus, we now must find an a such that P ( Z < σ ∗ ∣ x1 , . . . , x n ) = 0.025.
From a Z-table, we know that
a − µ∗
= −1.96.
σ∗
This tells us that a = µ∗ − 1.96σ ∗ . Similarly, b = µ∗ + 1.96σ ∗ . (Work this part
out on your own at home). Therefore, a 95% credible interval is
µ∗ ± 1.96σ ∗ .
Using data (trees.txt) we have, we will calculate the 95% credible interval
and confidence interval for θ. In R we first read in the data file trees.txt.
We then set the initial values for our known parameters, n, σ, µ, and τ.
Next, we refer to Example 4.1, and calculate the values of µ∗ and σ ∗ using
this example. Finally, again referring to Example 4.1, we recall that the
formula for a 95% credible interval here is
µ∗ ± 1.96σ ∗ .
On the other hand, recalling back to any basic statistics course, a 95%
confidence interval in this situation is
√
x̄ ± 1.96σ/ n.
From the R code, we find that there is a 95% probability that the average
number of ornaments per tree is in (45.00, 57.13) given the data. We also
find that we are 95% confident that the average number of ornaments per
tree is contained in (43.80, 56.20). If we compare the width of each interval,
we see that the credible interval is slightly narrower. It is also shifted towards
slightly higher values than the confidence interval for this data, which makes
sense because the prior mean was higher than the sample mean. What would
happen to the width of the intervals if we increased n? Does this make sense?
x = read.table("trees.txt",header=T)
attach(x)
4.1 Confidence Intervals versus Credible Intervals 78
n = 10
sigma = 10
mu = 75
tau = 15
mu.star = (n*mean(orn)*tau^2+mu*sigma^2)/(n*tau^2+sigma^2)
sigma.star = sqrt((sigma^2*tau^2)/(n*tau^2+sigma^2))
(cred.i = mu.star+c(-1,1)*qnorm(0.975)*sigma.star)
(conf.i = mean(orn)+c(-1,1)*qnorm(0.975)*sigma/sqrt(n))
diff(cred.i)
diff(conf.i)
detach(x)
Suppose that the prior on θ was Beta(3.3,7.2). Thus, the posterior distribu-
tion is
Suppose now we would like to find a 90% credible interval for θ. We can-
not compute this in closed form since computing probabilities for Beta dis-
tributions involves messy integrals that we do not know how to compute.
However, we can use R to find the interval.
We need to solve
P (θ < c∣x) = 0.05
and
P (θ > d∣x) = 0.05 for c and d.
4.2 Credible Sets or Intervals 79
a = 3.3
b = 7.2
n = 27
x = 11
a.star = x+a
b.star = n-x+b
c = qbeta(0.05,a.star,b.star)
d = qbeta(1-0.05,a.star,b.star)
Running the code in R, we find that a 90% credible interval for θ is (0.256, 0.514),
meaning that there is a 90% probability that the proportion of UF students
who sleep eight or more hours per night is between 0.256 and 0.514 given
the data.
Then θ∣y ∼ N (µ1 , τ12 ), where we have derived these before. The HPD credible
interval for θ is simply µ±zα/2 τ1 . Note that the HPD credible interval is that
same as the equal tailed interval interval centered at the posterior mean.
In general, plot the posterior distribution and find the HPD credible set.
One important point is that the posterior must be unimodal in order to
guarantee that the HPD credible set is an interval. (Unimodality of the
posterior is a sufficient condition for the credible set to be an interval, but
it’s not necessary.)
z 2
Then
This posterior distribution is unimodal, but how do we know this? One way
of showing the posterior is unimodal is to show that it is increasing in σ 2
up to a point and then decreasing afterwards. The log of the posterior has
the same feature.
Then
1
log(p(σ 2 ∣y)) = c1 − [(n + α)/2 + 1] log(σ 2 ) − (∑ y 2 + β).
2σ 2 i i
4.3 Bayesian Hypothesis Testing 81
Let’s first review p-values and why they might not make sense in the grand
scheme of things. In classical statistics, the traditional approach proposed
by Fisher, Neyman, and Pearson is where we have a null hypothesis and
an alternative. After determining some test statistic T (y), we compute the
p-value, which is
Clearly, classical statistics has deep roots and a long history. It’s popular
with practitioners, but does it make sense? The approach can be applied in a
straightforward manner only when the two hypothesis in question are nested
(meaning one within the other). This means that Ho must be a simplification
of Ha . Many practical testing problems involve a choice between two or
more models that aren’t nested (choosing between quadratic and exponential
growth models for example).
Another difficulty is that tests of this type can only offer evidence against
the null hypothesis. A small p-value indicates that the later, alternative
model has significantly more explanatory power. But a large p-value does
not suggest that the two models are equivalent (only that we lack evidence
that they are not). This limitation/difficulty is often swept under the rug
and never dealt with. We simply say, “we fail to reject the null hypothesis”
and leave it at that.
4.3 Bayesian Hypothesis Testing 82
Finally, one last criticism is that p-values depend not only on the observed
data but also on the total sampling probability of certain unobserved data
points, namely, the more extreme T (Y ) values. Because of this, two exper-
iments with identical likelihoods could result in different p-values if the two
experiments were designed differently. (This violates the Likelihood Prin-
ciple.) See Example 1.1 in Chapter 1 for an illustration of how this can
happen.
Ho ∶ θ ∈ Θo Ha ∶ θ ∈ Θ1 .
Ho ∶ θ = θo Ha ∶ θ ≠ θ0 .
Ho ∶ θ ≤ θo Ha ∶ θ > θ0 .
A Bayesian talks about posterior odds and Bayes factors.
We’d like to be able to say something about the mean of the IQ scores and
whether it’s below or larger than 100. Then
πo P (θ ≤ 100) 1
The prior odds are then = = 2
= 1 by symmetry.
π1 P (θ > 100)
1
2
Then αo = P (θo ≤ 100∣y = 115) = 0.106 and α1 = P (θ1 > 100∣y = 115) = 0.894.
αo
Thus, = 0.1185. Hence, BF = 0.1185.
α1
Consider
H1 ∶ θ = 1 versus H4 ∶ θ ≠ 1
H2 ∶ θ = 1/2 versus H5 ∶ θ ≠ 1/2
H3 ∶ θ = 0 versus H6 ∶ θ ≠ 0.
4.3 Bayesian Hypothesis Testing 84
Then
Let k ∈ (0.0619, 0.125) and reject if BF01 < k. Then we reject H4 in favor
of H1 . We fail to reject H2 . Thus, failing to reject H2 implies failing to
reject H4 . In this example, evidence in favor of H4 should be stronger than
that of H2 . But the Bayes Factor violates this. Lavine and Schervish refer
to this as lack of coherence. The problem does not occur with the posterior
odds since if
P (Θo ∣x) < P (Θa ∣x)
holds, then
P (Θo ∣x) P (Θa ∣x)
< .
1 − P (Θo ∣x) 1 − P (Θa ∣x)
(This result can be generalized).
• Bayes factors are insensitive to the choice of prior, however, this state-
ment is misleading. (Berger, 1995) We will see why in Example 4.5.
Ho ∶ θ = θo Ha ∶ θ = θ1 .
4.4 Bayesian p-values 85
Then πo = P (θ = θo ) and π1 = P (θ = θ1 ), so πo + π1 = 1.
Then
P (y∣θ = θo )P (θ = θo ) P (y∣θ = θo )πo
αo = P (θ = θo ∣y) = = .
P (y∣θ = θo )P (θ = θo ) + P (y∣θ = θ1 )P (θ = θ1 ) P (y∣θ = θo )πo + P (y∣θ = θ1 )π1
αo πo P (y∣θ = θo ) P (y∣θ = θo )
This implies that = and hence BF = , which
α1 π1 P (y∣θ = θ1 ) P (y∣θ = θ1 )
is the likelihood ratio. This does not depend on the choice of the prior.
However, in general the Bayes factor depends on how the prior spreads
mass over the null and alternative (so Berger’s statement is misleading).
Example 4.8:
Ho ∶ θ ∈ θo Ha ∶ θ ∈ θ1 .
Derive the BF. Let go (θ) and g1 (θ) be probability density functions such
that ∫Θo go (θ) dθ = 1 and ∫Θ1 g1 (θ) dθ = 1. Let
⎧
⎪
⎪πo go (θ) if θ ∈ Θo
π(θ) = ⎨
⎪
⎩π1 g1 (θ) if θ ∈ Θ1 .
⎪
Then ∫ π(θ) dθ = ∫Θo πo go (θ) dθ + ∫Θ1 π1 g1 (θ) dθ = π0 + π1 = 1.
Bayes factors are meant to compare two or more models, however, often
we are interested in the goodness of fit of a particular model rather than
4.4 Bayesian p-values 86
George Box proposed a prior predictive p-value. Suppose that T (x) is a test
statistic and π is some prior. Then we calculate the marginal distribution
Suppose the prior π(σ 2 ) is degenerate at σo2 . Then π(σ 2 = σo2 ) = 1. Marginally,
X̄ ∼ N (0, σ 2 /n)
under Mo . Also,
√ √ √
√ √ n∣X̄∣ n∣x̄obs ∣ n∣x̄obs ∣
P ( n∣X̄∣ ≥ n∣x̄obs ∣) = P ( ≥ ) = 2Φ(− ).
σo σo σo
If the guessed σo is much smaller than the actual model variance, then the
p-value is small and the evidence again Mo is overestimated.
Since then, the posterior predictive p-value (PPP) has been proposed by
Rubin (1984), Meng (1994), and Gelman et al. (1996). They propose looking
at the posterior predictive distribution of a future observation x under some
prior π. That is, we calculate
Remark: For details, see the papers. A general criticism by Bayarri and
Berger points out that the procedure involves using the data twice. The
data is used in finding the posterior distribution of θ and also in finding the
posterior predictive p-value. As an alternative, they have suggested using
conditional predictive p-values (CPP). This involves splitting the data into
two parts, say T (X) and U (X). We use U (X) to find the posterior predictive
distribution and T (X) continues to be the test statistic.
Also, Robins, van der Waart, and Ventura, JASA, 2000 investigate Bayarri
and Berger’s claims that for a parametric model, that their conditional and
partial predictive p-values are superior to the parametric bootstrap p-value
and to previously proposed p-values (prior predictive p-value of Guttman,
1967 and Rubin, 1984 and the discrepancy p-value of Gelman et. al (1995,
1996) and Meng (1994). Robins et. al note that Bayarri and Berger’s
claims of superiority is based on small-sample properties for specific exam-
ples. They investigate large sample properties and conclude that asymptotic
results confirm the superiority of the conditional predictive p-value and par-
tial posterior predictive p-values.
Robins et. al (2000) also explore corrections for when these p-values are
difficult to compute. In Section 4 of their paper, they discuss how to modify
the test statistic for the parametric bootstrap p-value, posterior predictive
p-values, and discrepancy p-values. Modifications are made such they are
asymptotically uniform. They claim that their approach is successful for the
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 88
discrepancy p-value (and the authors derive a test based on this). Note: the
discrepancy p-value can be difficult to calculate for complex models.
Consider
P (θ ∉ (U(1) , U(4) )∣θ) = P (U(1) > θ ∪ U(4) < θ∣θ) =
= P (U(1) > θ∣θ) + P (U(4) < θ∣θ) =
= P (Ui > θ, i = 1, 2, 3, 4∣θ) + P (Ui < θ, i = 1, 2, 3, 4∣θ) =
= (0.5)4 + (0.5)4 = (0.5)3 = 0.125
Hence, P (θ ∈ (U(1) , U(4) )∣θ) = 0.875, which proves that (U(1) , U(4) ) is a
87.5% confidence interval for θ.
Consider that U(1) = 0.1 and that U(4) = 0.9. The 87.5% probability has
to do with the random interval (U(1) , U(4) ) and not with the particular ob-
served value of (0.1, 0.9).
Let’s do some investigative work! Observe that, for every ui , ui > θ − 0.5.
Hence, u(1) > θ − 0.5, that is, θ < u(1) + 0.5. Similarly, θ > u(4) − 0.5. Hence,
θ ∈ (u(4) − 0.5, u(1) + 0.5). Plugging in u(1) = 0.1 and u(4) = 0.9, obtain
θ ∈ (0.4, 0.6). That is, even though the observed 87.5% confidence interval
is (0.1, 0.9), we know that θ ∈ (0.4, 0.6) with certainty.
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 89
Let’s now compute a 87.5% centered credible interval. This depends on the
prior for θ. Consider the improper prior p(θ) = 1, θ ∈ R. Observe that:
That is, θ∣u has Uniform distribution on (u(4) − 0.5, u(1) + 0.5). Let a =
u(4) − 0.5 and b = u(1) + 0.5. The centered 87.5% credible interval is (l, u)
l 1 b 1 b−a
such that ∫a dx = 2−4 and ∫u dx = 2−4 . Hence, l = a + 4 and
b−a b−a 2
b−a
u = b − 4 . Observe that this interval is always a subset of (a, b), which
2
we know contains θ for sure.
4.5 Appendix to Chapter 4 (Done by Rafael Stern) 90
The following R code generates a barplot for how often the credible interval
captures the correct parameter given different parameter values.
return(mean(apply(samples,1,success)))
}
0.8
0.6
0.4
0.2
0.0
Figure 4.2
The result in Figure 4.2 shows that, in this case, the coverage of the credible
interval seems to be uniform on the parameter space. This is not guaranteed
to always happen! Also, although we constructed a 87.5% credible interval,
1−(U(4) −U(1) ) 1−(U(4) −U(1) )
the picture suggests that (U(4) − 0.5 + 24
, U(1) + 0.5 − 24
)
is somewhere near a 85% confidence interval.
Observe that, the wider the gap between U(1) and U(4) the smaller is the
region in which θ can lie. In this sense, it would be nice if the interval would
be smaller, the larger this gap is. The example shows that there exist both
credible and confidence intevals with this property, but this property isn’t
achieved by guaranteeing confidence alone.
Chapter 5
Every time I think I know what’s going on, suddenly there’s another layer
of complicitions. I just want this damned thing solved.
—John Scalzi, The Lost Colony
92
5.1 A Quick Review of Monte Carlo Methods 93
cannot deal with infinite bounds in the integral, and even though in-
tegrate can handle infinite bounds, it is fragile and often produces
output that’s not trustworthy (Robert and Casella, 2010).
The generic problem here is to evaluate Ef [h(x)] = ∫X h(x)f (x) dx. The
classical way to solve this is to generate a sample (X1 , . . . , Xn ) from f and
propose as an approximation the empirical average
1 n
h̄n = ∑ h(xj ).
n j=1
Why? It can be shown that h̄n converges a.s. (i.e. for almost every generated
sequence) to Ef [h(X)] by the Strong Law of Large Numbers.
Also, under certain assumptions (which we won’t get into, see Casella and
Robert, page 65, for details), the asymptotic variance can be approximated
and then can be estimated from the sample (X1 , . . . , Xn ) by
n
vn = 1/n2 ∑ [h(xj ) − h̄n ]2 .
j=1
There are examples in Casella and Robert (2010) along with R code for those
that haven’t seen these methods before or want to review them.
5.1 A Quick Review of Monte Carlo Methods 94
◯ Importance Sampling
ˆ = 1 V̂ h(Xi )f (Xi )
V̂
ar(I) ar ( ).
n g(Xi )
Example 5.1: Suppose we want to estimate P (X > 5), where X ∼ N (0, 1).
Naive method: Generate n iid standard normals and use the proportion p̂
that are larger than 5.
Solution: Let φo and φθ be the densities of the N (0, 1) and N (θ, 1) distri-
butions (θ taken around 5 will work). We have
φo (u)
p = ∫ I(u > 5)φo (u) du = ∫ [I(u > 5) ] φθ (u) du.
φθ (u)
In other words, if
φo (u)
h(u) = I(u > 5)
φθ (u)
5.1 A Quick Review of Monte Carlo Methods 95
# Naive method
set.seed(1)
ss <- 100000
x <- rnorm(n=ss)
phat <- sum(x>5)/length(x)
sdphat <- sqrt(phat*(1-phat)/length(x)) # gives 0
# IS method
set.seed(1)
y <- rnorm(n=ss, mean=5)
h <- dnorm(y, mean=0)/dnorm(y, mean=5) * I(y>5)
mean(h) # gives 2.865596e-07
sd(h)/sqrt(length(h)) # gives 2.157211e-09
Example 5.2: Let f (x) be the pdf of a N (0, 1). Assume we want to
compute
1 1
a=∫ f (x)dx = ∫ N (0, 1)dx
−1 −1
We can use importance sampling to do this calculation. Let g(X) be an
arbitrary pdf,
1 f (x)
a(x) = ∫ g(x) dx.
−1 g(x)
f (Y )
• Note that if g ∼ Y, then a = E[I[−1,1] (Y ) g(Y ) ].
f (Y )
• The variance of I[−1,1] (Y ) is minimized picking g ∝ I[−1,1] (x)f (x).
g(Y )
Nevertheless simulating from this g is usually expensive.
5.1 A Quick Review of Monte Carlo Methods 96
• Some g’s which are easy to simulate from are the pdf’s of the Uniform(−1, 1),
the Normal(0, 1) and a Cauchy with location parameter 0.
f (Y )
• Below, there is code of how to get a sample from I[−1,1] (Y ) for
g(Y )
these distributions,
Figure 5.1 presents histograms for a sample size 1000 from each of these
f (Y )
distributions. The sample variance of I[−1,1] (Y ) g(Y ) was, respectively, 0.009,
0.349 and 0.227 (for the uniform, cauchy, and the normal).
• This is why the histograms for the Cauchy and the Normal have big
bars on 0 and the variance obtained from the uniform distribution is
the lowest.
700
700
700
600
600
600
500
500
500
400
400
400
Frequency
Frequency
Frequency
300
300
300
200
200
200
100
100
100
0
0.45 0.55 0.65 0.75 0.0 0.4 0.8 1.2 0.0 0.2 0.4 0.6 0.8 1.0
f (Y )
Figure 5.1: Histograms for samples from I[−1,1] (Y ) g(Y ) when g is, respec-
tivelly, a uniform, a Cauchy and a Normal pdf.
5.1 A Quick Review of Monte Carlo Methods 99
Often we have sample from µ, but know π(x) except for a multiplicative
µ(x) constant. Typical example is Bayesian situation:
=∫
h(x) c `(x)µ(x) d(x)
∫ µ(x) d(x)
=∫
h(x) c `(x)µ(x) d(x)
∫ c `(x)µ(x) d(x)
=∫
h(x) `(x)µ(x) d(x)
.
∫ `(x)µ(x) d(x)
Motivation
Why the choice above for `(X)? Just taking a ratio of priors. The
motivation is the following for example:
◯ Rejection Sampling
Suppose first that l is bounded and is zero outside of [0, 1]. Suppose also l
is constant on the intervals ((j − 1)/k, j/k), j = 1, . . . , k. Let M be such that
M ≥ l(x) for all x.
2. If the point is below the graph of the function l, retain U1 . Else, reject
the point and go back to (1).
Remark: Think about what this is doing, we’re generating many draws that
are wasting time. Think about the restriction on [0, 1] and if this makes
sense.
General Case:
Suppose the density g is such that for some known constant M, M g(x) ≥ l(x)
for all x. Procedure:
l(X)
1. Generate X ∼ g, and calculate r(X) = .
M g(X)
2. Flip a coin with probability of success r(X). If we have a success,
retain X. Else return to (1).
To show that an accepted point has distribution π, let I = indicator that the
point is accepted. Then
π(x)/c 1
P (I = 1) = ∫ P (I = 1 ∣ X = x)g(x) dx = ∫ g(x) dx = .
M g(x) cM
π(x)/c
gI (x∣I = 1) = g(x) /P (I = 1) = π(x).
M g(x)
##Doing accept-reject
##substance of code
set.seed(1); nsim <- 1e5
x <- rnorm(n=nsim, mean=m, sd=s)
u <- runif(n=nsim)
ratio <- dbeta(x, shape1=a, shape2=b) /
(1.3*dnorm(x, mean=m, sd=s))
ind <- I(u < ratio)
betas <- x[ind==1]
# as a check to make sure we have enough
length(betas) # gives 76836
2.5
3.0
2.0
2.0
1.5
1.0
1.0
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
density.default(x = betas)
2.5
2.0
1.5
Density
1.0
0.5
0.0
Geman and Geman (1994) introduced Gibbs sampling for simulating a mul-
tivariate probability distribution p(x) using as random walk on a vector x,
where p(x) is not necessarily a posterior density.
• If no is large, then Xno , Xn0 +1 . . . all have the distribution π and these
can be used to estimate π and ∫ h(x)π(x)dx.
Two problems:
2. The random variables Xno , Xn0 +1 . . . are NOT independent; they may
be correlated.
5.2 Introduction to Gibbs and MCMC 105
1.) is called a Markov transition function and 2.) is the Markov property,
which says “where I’m going next only depends on where I am right now.”
Coming back to the MCMC method, we fix a starting point xo and generate
an observation from X1 from P (xo , ⋅), generate an observation from X2 from
P (X1 , ⋅), etc. This generates the Markov chain xo = Xo , X1 , X2 , . . . ,
The theory of Markov chains provides various results about the existence
and uniqueness of stationary distributions, but such results are beyond the
scope of this course. However, one specific result is that under fairly general
conditions that are typically satisfied in practice, if a stationary distribu-
tion f exists, then f is the limiting distribution of {X (t) } is f for almost
any initial value or distribution of X (0) . This property is called ergodicity.
From a simulation point of view, it means that if a given kernel K produces
an ergodic Markov chain with stationary distribution f , generating a chain
from this kernel will eventually produce simulations that are approximately
from f.
1 M (t)
∑ h(X ) Ð→ Ef [h(X)].
M i=1
This means that the LLN lies at the basis of Monte Carlo methods which
can be applied in MCMC settings. The result shown above is called the
Ergodic Theorem.
Now we turn to Gibbs. The name Gibbs sampling comes from a paper by
Geman and Geman (1984), which first applied a Gibbs sampler on a Gibbs
random field. The name stuck from there. It’s actually a special case of
something from Markov chain Monte Carlo (MCMC), and more specifically
a method called Metropolis-Hastings, which we will hopefully get to. We’ll
start by studying the simple case of the two-stage sampler and then look at
the multi-stage sampler.
The two-stage Gibbs sampler creates a Markov chain from a joint distribu-
tion. Suppose we have two random variables X and Y with joint density
f (x, y). They also have respective conditional densities fY ∣X and fX∣Y . The
two-stage sampler generates a Markov chain {(Xt , Yt )} according to the
following steps:
1. Xt ∼ fX∣Y (⋅∣yt−1 )
2. Yt ∼ fY ∣X (⋅∣xt ).
As long as we can write down both conditionals (and simulate from them),
it is easy to implement the algorithm above.
µX σ2 ρσX σY
(X, Y ) ∼ N2 (( ),( X )) ,
µY ρσX σY σY2
5.2 Introduction to Gibbs and MCMC 108
then
σY
Y ∣X = x ∼ N (µY + ρ (x − µX ), σY2 (1 − ρ2 )) .
σX
Suppose we calculate the Gibbs sampler just given the starting point (x0 , y0 ).
Since this is a toy example, let’s suppose we only care about X. Note that
we don’t really need both components of the starting point, since if we pick
x0 , we can generate Y0 from fY ∣X (⋅∣x0 ).
and
Var[X1 ] = EVar[X1 ∣Y0 ]] + VarE[X1 ∣Y0 ]] = 1 − ρ4 .
Then
X1 ∼ N (ρ2 x0 , 1 − ρ4 ).
We want the unconditional distribution of X2 eventually. So, we need to
update (X2 , Y2 ). So we need Y1 so we can generate Y1 ∣X1 = x1 . Since we only
care about X, we can use the conditional distribution formula to find that
Y1 ∣X1 = x1 ∼ N (ρx1 , 1 − ρ). Then using iterated expectation and iterated
variance, we can show that
X2 ∼ N (ρ4 xo , 1 − ρ8 ).
Xn ∼ N (ρ2n xo , 1 − ρ4n ).
(To see this, iterate a few times and find the pattern.) What happens as
n → ∞?
approx.
Xn ∼ N (0, 1).
n Γ(a + b) x+a−1
f (x, θ) = ( ) θ (1 − θ)n−x+b−1 .
x Γ(a)Γ(b)
for(ii in 2:nsim)
{
tt[ii] <- rbeta(1,aa+xx[ii-1],bb+nn-xx[ii-1])
xx[ii] <- rbinom(1,nn,tt[ii])
}
return(list(beta_bin=xx,beta=tt))
}
● ●
●
●
600
●
●
400
Frequency
200 ●
●
●
●
0 ● ● ●
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 5.5 presents the rootogram for the Gibbs sample for the Beta-Binomial
distribution. Similarly, Figure 5.6 shows the same for the marginal distri-
bution of θ obtained through the following commands:
Example 5.6: Consider the posterior on (θ, σ 2 ) associated with the follow-
ing model:
Xi ∣θ ∼ N (θ, σ 2 ), i = 1, . . . , n,
θ ∼ N (θo , τ 2 )
σ 2 ∼ InverseGamma(a, b),
5.2 Introduction to Gibbs and MCMC 111
3.0
2.5
2.0
Marginal Density
1.5
1.0
0.5
0.0
ba e−b/x
where θo , τ 2 , a, b known. Recall that p(σ 2 ) = .
Γ(a) xa+1
The Gibbs sampler for these conditional distributions can be coded in R as
follows:
for(ii in 2:nsim)
{
new_post_sigma_rate <- (1/2)*(RSS+ nn*(xbar-theta[ii-1])^2) + bb
sigma2[ii] <- 1/rgamma(1,shape=post_sigma_shape,
rate=new_post_sigma_rate)
return(list(theta=theta,sigma2=sigma2))
}
The histograms in Figure 5.7 for the posterior for θ and σ 2 are obtained as
follows:
library(mcsm)
data(Energy)
gibbs_sample <- gibbs_gaussian(5000,log(Energy[,1]),5,10,3,3)
par(mfrow=c(1,2))
hist(gibbs_sample$theta,xlab=expression(theta~"|X=x"),main="")
hist(sqrt(gibbs_sample$sigma2),xlab=expression(sigma~"|X=x"),main="")
There is a natural extension from the two-stage Gibbs sampler to the gen-
eral multistage Gibbs sampler. Suppose that for p > 1, we can write the
5.2 Introduction to Gibbs and MCMC 113
1000
1500
800
1000
600
Frequency
Frequency
400
500
200
0
6.0 6.5 7.0 7.5 0.4 0.6 0.8 1.0 1.2 1.4
θ |X=x σ |X=x
The densities f1 , . . . , fp are called the full conditionals, and a particular fea-
ture of the Gibbs sampler is that these are the only densities used for sim-
ulation. Hence, even for high-dimensional problems, all of the simulations
may be univariate, which is a major advantage.
Example 5.7: (Casella and Robert, p. 207) Consider the following model:
ind
Xij ∣θi , σ 2 ∼ N (θi , σ 2 ) 1 ≤ i ≤ k, 1 ≤ j ≤ ni
iid
θi ∣µ, τ 2 ∼ N (µ, τ 2 )
µ∣σµ2 ∼ N (µ0 , σµ2 )
σ 2 ∼ IG(a1 , b1 )
τ 2 ∼ IG(a2 , b2 )
σµ2 ∼ IG(a3 , b3 )
The conditional independencies in this example can be visualized by the
Bayesian Network in Figure 5.8. Using these conditional independencies, we
can compute the complete conditional distributions for each of the variables
as
σ2 ni τ 2 σ2τ 2
θi ∼ N ( µ + X̄i , ),
σ 2 + ni τ 2 σ 2 + ni τ 2 σ 2 + ni τ 2
τ2 kσµ2 σµ2 τ 2
µ∼N( µ 0 + θ̄, ),
τ 2 + kσµ2 τ 2 + kσµ2 τ 2 + kσµ2
⎛ ⎞
σ 2 ∼ IG ∑ ni /2 + a1 , (1/2) ∑ (Xi,j − θi )2 + b1 ,
⎝ i i,j ⎠
τ2 µ σµ2
σ2
θi
Xij
Figure 5.8: Bayesian Network for Example 5.7.
Example 5.8: A genetic model specifies that 197 animals are distributed
multinomially into four categories, with cell probabilities given by
The actual observations are y = (125, 18, 20, 34). We want to estimate θ.
Suppose we have two factors, call them α and β (say eye color and leg
length).
• Suppose further that P (A) = 1/2 = P (a) [and similarly for the other
factor].
5.2 Introduction to Gibbs and MCMC 116
1500
1000
1000
800
800
1000
Frequency
Frequency
Frequency
600
600
400
400
500
200
200
0
0
3 4 5 6 7 8 9 6.4 6.6 6.8 7.0 6.8 7.0 7.2 7.4
µ θ1 θ2
1500
1000
2000
800
1500
1000
600
Frequency
Frequency
Frequency
1000
400
500
500
200
0
σ2µ τ2 σ2
• Now suppose that the two factors are related: P (B∣A) = 1 − η and
P (b∣A) = η.
Then
1
P (Father is AB) = P (B∣A)P (B) = (1 − η).
2
1
P (Mother is AB) = P (B∣A)P (B) = (1 − η).
2
1
P (O.S. is AB) = P (B∣A)P (B) = (1 − η)2 .
4
AB Ab aB ab
4 (1 − η) 4 (1 − η)η 4 (1 − η)η 4 (1 − η)
1 2 1 1 1 2
AB
4 (1 − η)η
1
Ab
aB
ab
There are 9 cases where we would see the phenotype AB and adding up
3 − 2η + η 2
their probabilities, we get . You can find similar probabilities for
4
the other phenotypes. Writing
(3 − 2η + η 2 ) 1 1 − 2η + η 2
= +
4 2 4
and letting θ = (1 − η)2 , we find the model specified in (*).
What now?
Suppose we put the prior Beta(a, b) on θ. How do we get the posterior?
5.2 Introduction to Gibbs and MCMC 118
Split first cell into two cells, one with probability 1/2, the other with prob-
ability θ/4.
1/2
• The conditional distribution of X1 ∣ θ (and given the data) is Bin(125, 1/2+θ/4 ).
Beta(a + 125 − X1 − X5 , b + X3 + X4 ).
set.seed(1)
a <- 1; b <- 1
z <- c(125,18,20,34)
x <- c(z[1]/2, z[1]/2, z[2:4])
nsim <- 50000 # runs in about 2 seconds on 3.8GHz P4
theta <- rep(a/(a+b), nsim)
for (j in 1:nsim)
{
theta[j] <- rbeta(n=1, shape1=a+125-x[1]+x[5],
shape2=b+x[3]+x[4])
x[1] <- rbinom(n=1, z[1], (2/(2+theta[j])))
}
mean(theta) # gives 0.623
pdf(file="post-dist-theta.pdf",
horiz=F, height=5.0, width=5.0)
plot(density(theta), xlab=expression(theta), ylab="",
main=expression(paste("Post Dist of ", theta)))
dev.off()
eta <- 1 - sqrt(theta) # Variable of actual interest
plot(density(eta))
5.2 Introduction to Gibbs and MCMC 119
Post Dist of θ
8
6
4
2
0
We will want to check any chain that we run to assess any lack of conver-
gence.
• mixing rate.
Quick checks:
• Autocorrelations plots.
λ2.9111 λ2.9145
17800
20800
17400
20400
17000
20000
400 600 800 1000 1200 400 600 800 1000 1200
λ2.9378 λ2.9248
20800
20800
20400
20400
20000
20000
400 600 800 1000 1200 400 600 800 1000 1200
1.0
0.8
Correlation
0.6
0.4
0.2
0.0
0 20 40 60 80 100
Lag
• Then we know that the effort of the starting point hasn’t been forgot-
ten.
• Maybe the chain hasn’t reached the area of high probability yet and
need to be run for longer?
Gelman-Rubin
• Idea is that if we run several chains, the behavior of the chains should
be basically the same.
5.3 MCMC Diagnostics 123
• Check using the Gelman-Rubin diagnostic – but can fail like any test.
◯ PlA2 Example
i 1 2 3 4 5 6
ψ̂i 1.06 -0.10 0.62 0.02 1.07 -0.02
σi 0.37 0.11 0.22 0.11 0.12 0.12
i 7 8 9 10 11 12
ψ̂i -0.12 -0.38 0.51 0.00 0.38 0.40
σi 0.22 0.23 0.18 0.32 0.20 0.25
Setup:
• Twelve studies were run to investigate the potential link between pres-
ence of a certain genetic trait and risk of heart attack.
• For each study i (i = 1, ⋯, 12) the proportion having the genetic trait
in each group was recorded.
• For each study, a log odds ratio, ψ̂i , and standard error, σi , were
calculated.
Let ψi represent the true log odds ratio for study i. Then a typical hierar-
chical model would look like:
5.4 Theory and Application Based Example 125
ind
ψ̂i ∣ ψi ∼ N (ψi , σi2 ) i = 1, . . . , 12
iid
ψi ∣ µ, τ ∼ N (µ, τ 2 ) i = 1, . . . , 12
(µ, τ ) ∼ ν.
12 12
L(µ, τ ) = ∫ . . . ∫ ∏ Nψi ,σi (ψ̂i ) ∏ Nµ,τ (ψi ) dψ1 . . . dψ12 .
i=1 i=1
The posterior can be written as (as long as µ and τ have densities), as
12 12
π(µ, τ ∣ ψ̂i ) = c−1 [∫ . . . ∫ ∏ Nψi ,σi (ψ̂i ) ∏ Nµ,τ (ψi ) dψ1 . . . dψ12 ] p(µ, τ ).
i=1 i=1
Remark: The reason for taking this prior is that it is conjugate for the
normal distribution with both mean and variance unknown (that is, it is
conjugate for the model in which the ψi ’s are observed).
We have a choice:
5.4 Theory and Application Based Example 126
• Select a model that doesn’t fit the data well but gives answers that
are easy to obtain, i.e. in closed form.
MCMC methods often allow us (in many cases) to make the second choice.
1 n(X̄ − c)2
a′ = a + n/2 b′ = b + ∑(Xi − X̄) +
2
2 i 2(1 + nd)
and
c + ndX̄ 1
c′ = d′ = .
nd + 1 n + d−1
• In order to clarify what we are doing, we use the notation that sub-
scripting a distribution by a random variable denotes conditioning.
• Given the (ψ’s, the data) are superfluous, i.e. Lψ̂ (µ, τ ∣ ψ)= L (µ, τ ∣ ψ) .
This conditional distribution is given by the conjugacy of the Normal
/ Inverse gamma prior: L (µ, τ ∣ ψ) = NIG(a′ , b′ , c′ , d′ ), where
1 n(ψ̄ − c)2
a′ = a + n/2 b′ = b + ∑(ψi ψ̄) +
2
2 i 2(1 + nd)
and
c + ndψ̄ 1
c′ = d′ = .
nd + 1 n + d−1
%\frame[containsverbatim]{
%\frametitle{PlA2 Example}
\begin{verbatim}
model {
for (i in 1:N) {
psihat[i] ~ dnorm(psi[i],1/(sigma[i])^2)
psi[i] ~ dnorm(mu,1/tau^2)
}
mu ~ dnorm(0,1/(1000*tau^2))
tau <- 1/sqrt(gam)
gam ~ dgamma(0.1,0.1)
}
%\frame[containsverbatim]{
%\frametitle{PlA2 Example}
\begin{verbatim}
"N" <- 12
"psihat" <- c(1.055, -0.097, 0.626, 0.017, 1.068,
-0.025, -0.117, -0.381, 0.507, 0, 0.385, 0.405)
"sigma" <- c(0.373, 0.116, 0.229, 0.117, 0.471,
0.120, 0.220, 0.239, 0.186, 0.328, 0.206, 0.254)
%\frame[containsverbatim]{
5.4 Theory and Application Based Example 129
%\frametitle{PlA2 Example}
Now, we read in the coda files into R from the current directory and continue
our analysis. The first part of our analysis will consist of some diagnostic
procedures.
We will consider
• Autocorrelation Plots
• Trace Plots
• Gelman-Rubin Diagnostic
• Geweke Diagnostic
We take the thin value to be the first lag whose correlation ≤ 0.2. For this
plot, we take a thin of 2. We will go back and rerun our JAGS script and
skip every other value in each chain. After thinning, we will proceed with
other diagnostic procedures of interest.
1.0
0.8
Correlation
0.6
0.4
0.2
0.0
0 10 20 30 40 50
Lag
%\frame[containsverbatim]{
%\frametitle{PlA2 Example}
Definition: A trace plot is a time series plot of the parameter, say µ, that
we monitor as the Markov chain(s) proceed(s).
1.0
0.5
µ
0.0
−0.5
Iteration
• Run two chains in JAGS using two different sets of initial values (and
two different seeds).
• Load coda package in R and run gelman.diag(mcmc.list(chain1,chain2)).
%\frame[containsverbatim]{
%\frametitle{Gelman-Rubin Diagnostic}
\begin{verbatim}
Point est. 97.5% quantile
mu 1 1
psi[1] 1 1
psi[2] 1 1
...
psi[11] 1 1
psi[12] 1 1
gam 1 1
Since 1 is in all the 95% CI, we can conclude that we have not failed to
converge.
• Using the Geweke diagnostic on the PlA2 data indicates that burn-in
of 10,000 is sufficient (the largest absolute Z-score is 1.75).
• Observe that the Geweke diagnostic does not require multiple starting
points as Gelman-Rubin does.
3.0
2.0
Density
1.0
0.0
• So, here we’re looking at the odds ratio’s of the prob of getting heart
disease given you have the genetic trait over the prob of not getting
heart disease given you have the trait. Note that all estimates are
pulled toward the mean showing a Bayesian Stein effect.
• This is the odds ratio of having a heart attack for those who have the
genetic trait versus those who don’t (looking at study i).
5.4 Theory and Application Based Example 135
exp(ψ1) exp(ψ2)
4
3
3
2
2
1
1
0
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
exp(ψ3) exp(ψ4)
4
4
3
3
2
2
1
1
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
%\frame[containsverbatim]{
Moreover, we could have just have easily done this analysis in WinBUGS. Below is the
\begin{verbatim}
model{
for (i in 1:N) {
psihat[i] ~ dnorm(psi[i],rho[i])
psi[i] ~ dnorm(mu,gam)
rho[i] <- 1/pow(sigma[i],2)
}
mu ~ dnorm(0,gamt)
gam ~ dgamma(0.1,0.1)
gamt <- gam/1000
}
Finally, we can either run the analysis using WinBUGS or JAGS and R.
I will demonstrate how to do this using JAGS for this example. I have
included the basic code to run this on a Windows machine via WinBUGS.
Both methods yield essentially the same results.
\scriptsize
\begin{verbatim}
setwd("C:/Documents and Settings/Tina Greenly
/Desktop/beka_winbugs/novartis/pla2")
library(R2WinBUGS)
pla2 <- read.table("pla2_data.txt",header=T)
attach(pla2)
names(pla2)
N<-length(psihat)
data <- list("psihat", "sigma", "N")
of Metropolis-Hastings (as well will soon see). Here, we present the ba-
sic Metropolis algorithm and its generalization to the Metropolis-Hastings
algorithm, which is often useful in applications (and has many extensions).
But what if we cannot sample directly from p(θ∣y)? The important concept
here is that we are able to construct a large collection of θ values (rather
than them being iid, since this most certain for most realistic situations will
not hold). Thus, for any two different θ values θa and θb , we need
#θ′ s in the collection = θa p(θa ∣y)
≈ .
#θ′ s in the collection = θb p(θb ∣y)
This means that for every instance of θ(s) , we should only have a frac-
tion of an instance of a θ∗ value.
This is basic intuition behind the Metropolis (1953) algorithm. More for-
mally, it
3. Let
⎧
⎪
⎪θ∗ with prob min(r,1)
θ(s+1) = ⎨ (s)
⎪
⎪
⎩θ otherwise.
That is let
iid
X1 , . . . , Xn ∣ θ ∼ Normal(θ, σ 2 )
θ ∼ Normal(µ, τ 2 ).
Suppose that for some ridiculous reason we cannot come up with the poste-
rior distribution and instead we need the Metropolis algorithm to approx-
imate it (please note how incredible silly this example is and it’s just to
illustrate the method).
Based on this model and prior, we need to compute the acceptance ratio r
p(θ∗ ∣x) p(x∣θ∗ )p(θ∗ ) ∏i dnorm(xi , θ∗ , σ) ∏i dnorm(θ∗ , µ, σ)
r= = = ( ) ( )
p(θ(s) ∣x) p(x∣θ(s) )p(θ(s) ) ∏i dnorm(xi , θ(s) , σ) ∏i dnorm(θ(s) , µ, σ)
This results in
Then a proposal is accepted if log u < log r, where u is sample from the
Uniform(0,1).
5.5 Metropolis and Metropolis-Hastings 141
Below is R-code for running the above model. Figure 5.14 shows a trace plot
for this run as well as a histogram for the Metropolis algorithm compared
with a draw from the true normal density. From the trace plot, although
the value of θ does not start near the posterior mean of 10.03, it quickly
arrives there after just a few iterations. The second plot shows that the em-
pirical distribution of the simulated values is very close to the true posterior
distribution.
11
0.8
10
0.6
9
density
θ
8
0.4
7
0.2
6
5
0.0
s2<-1
t2<-10 ; mu<-5; set.seed(1); n<-5; y<-round(rnorm(n,10,1),2)
mu.n<-( mean(y)*n/s2 + mu/t2 )/( n/s2+1/t2)
t2.n<-1/(n/s2+1/t2)
####metropolis part####
y<-c(9.37, 10.18, 9.16, 11.60, 10.33)
##S = total num of simulations
theta<-0 ; delta<-2 ; S<-10000 ; THETA<-NULL ; set.seed(1)
for(s in 1:S)
{
if(log(runif(1))<log.r) { theta<-theta.star }
##updating THETA
THETA<-c(THETA,theta)
pdf("metropolis_normal.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))
5.5 Metropolis and Metropolis-Hastings 143
skeep<-seq(10,S,by=10)
plot(skeep,THETA[skeep],type="l",xlab="iteration",ylab=expression(theta))
hist(THETA[-(1:50)],prob=TRUE,main="",xlab=expression(theta),ylab="density")
th<-seq(min(THETA),max(THETA),length=100)
lines(th,dnorm(th,mu.n,sqrt(t2.n)) )
dev.off()
◯ Metropolis-Hastings Algorithm
The Gibbs sampler and the Metropolis algorithm are both ways of generating
Markov chains that approximate a target probability distribution.
What does the Gibbs sampler have us do? It has us iteratively sample values
of U and V from their conditional distributions. That is,
1. update U ∶
po (u∗ , v (s) )
(b) compute r =
po (u(s) , v (s) )
(c) set u(s+1) equal to u∗ or u(s+1) with prob min(1,r) and max(0,1-r).
1. update U ∶
(c) set u(s+1) equal to u∗ or u(s+1) with prob min(1,r) and max(0,1-r).
2. update V ∶
(b) compute
(c) set v (s+1) equal to v ∗ or v (s+1) with prob min(1,r) and max(0,1-r).
In the above algorithm, the proposal distributions Ju and Jv are not required
to be symmetric. The only requirement is that they not depend on U or V
values in our sequence previous to the most current values. This requirement
ensures that the sequence is a Markov chain.
Doesn’t the algorithm above look familiar? Yes, it looks a lot like Metropolis,
except the acceptance ratio r contains an extra factor:
• It contains the ratio of the prob of generating the current value from
the proposed to the prob of generating the proposed from the current.
Exercise 1: Show that Metropolis is a special case of MH. Hint: Think about
the jumps J.
Exercise 2: Show that Gibbs is a special case of MH. Hint: Show that r =
1.
• In particular, their age and number of new offspring were recorded for
each sparrow (Arcese et al., 1992).
5.5 Metropolis and Metropolis-Hastings 146
• Given a current value β (s) and a value β ∗ generated from J(β ∗ , β (s) )
the acceptance ration for the Metropolis algorithm is:
Remark: Note we add 1/2 because otherwise log 0 is undefined. The code
of implementing the algorithm is included below.
0.0 0.2 0.4 0.6 0.8 1.0
ACF
ACF
β3
−0.2
−0.3
res
}
pmn.beta<-rep(0,p)
psd.beta<-rep(10,p)
for(s in 1:S) {
lhr<- sum(dpois(y,exp(X%*%beta.p),log=T)) -
sum(dpois(y,exp(X%*%beta),log=T)) +
sum(dnorm(beta.p,pmn.beta,psd.beta,log=T)) -
sum(dnorm(beta,pmn.beta,psd.beta,log=T))
BETA[s,]<-beta
}
cat(ac/S,"\n")
#######
library(coda)
apply(BETA,2,effectiveSize)
5.5 Metropolis and Metropolis-Hastings 149
####
pdf("sparrow_plot1.pdf",family="Times",height=1.75,width=5)
par(mar=c(2.75,2.75,.5,.5),mgp=c(1.7,.7,0))
par(mfrow=c(1,3))
blabs<-c(expression(beta[1]),expression(beta[2]),expression(beta[3]))
thin<-c(1,(1:1000)*(S/1000))
j<-3
plot(thin,BETA[thin,j],type="l",xlab="iteration",ylab=blabs[j])
abline(h=mean(BETA[,j]) )
acf(BETA[,j],ci.col="gray",xlab="lag")
acf(BETA[thin,j],xlab="lag/10",ci.col="gray")
dev.off()
####
In complex models, it is often the case that the conditional distributions are
available for some parameters but not for others. What can we do then?
In these situations we can combine Gibbs and Metropolis-type proposal
distributions to generate a Markov chain to approximate the joint posterior
distribution of all the parameters.
• The full conditionals are available for the regression parameters here,
but not the parameter describing the dependence among the observa-
tions.
Analyses of ice cores from East Antarctica have allowed scientists to de-
duce historical atmospheric conditions of law few hundred years (Petit et
al, 1999). Figure 5.18 plots time-series of temperature and carbon dioxide
concentration on a standardized scale (centered and called to have mean of
zero and variance of 1).
5.5 Metropolis and Metropolis-Hastings 150
●
standardized measurement
temp
0 2
● ● ●
CO2 ●●
●●
●●● ●
2
●● ●
● ●
●● ●●
●
● ● ●
1
● ● ●●
●
● ● ● ●● ●
●●
●
● ●● ● ●● ●●
● ● ●●
●● ● ●
−4
−2 −1 0
● ●● ● ● ● ●
●●● ● ●●● ● ●
● ●● ●
●● ● ● ● ● ● ●● ●
●
●
● ●●●●● ●● ●●
● ●●● ●● ●● ●●●
●
● ●
●● ●● ● ●●●
● ●
●●● ●●●● ●● ●
●
● ● ●●
●●● ●●●●
●
● ●●●
●●● ●
●
●
●
●●● ●●●
●
−8
●
●●●●
● ●
●●●
●
●
● ● ● ●●
●
● ●● ●
●
• The plot indicates the temporal history of temperature and CO2 follow
very similar patterns.
• The validity of the standard error relies on the error terms in the
regression model being iid and standard confidence intervals further
rely on the errors being normally distributed.
1.0
50
0.8
40
0.4 0.6
30
frequency
ACF
20
0.2
10
0.0
−0.2
0
−4 −2 0 2 4 0 5 10 15 20
residual lag
Figure 5.17: Temperature and carbon dioxide data.
Y ∼ N (Xβ, σ 2 I).
The diagnostic plots suggest that a more appropriate model for the ice core
data is one in which the error terms are not independent, but temporally
correlated.
⎛ 1 ρ ρ2 . . . ρn−1 ⎞
⎜ ρ 1 ρ . . . ρn−2 ⎟
⎜ 2 ⎟
⎜ ρ ρ 1 ... ⎟
2⎜ ⎟
Σ = σ Cp = σ ⎜
2
⎜
⎟
⎟
⎜ ⋮ ⋮ ⋱ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎝ ρn−1 ρn−2 1 ⎠
Under this covariance matrix the variance of Yi ∣β, xi is σ 2 but the correlation
between Yi and Yi+t is ρt . Using the multivariate normal and inverse gamma
prior (it is left as an exercise to show that)
β ∣ X, y, σ 2 , ρ ∼ N (βn , Σn ),
σ 2 ∣ X, y, β ∼ IG((νo + n)/2, [νo σo2 + SSRρ ]/2)
−1 −1
where βn = Σn (X T Cp−1 X/σ 2 + Σ−1
o βo ) and Σn = (X T Cp−1 X/σ 2 + Σ−1
o )
T −1
and SSRρ = (y − Xβ) Cp (y − Xβ)
• We will make proposals for β and σ 2 using the full conditionals and
• Following the rules of MH, we accept with prob 1 any proposal coming
from a full conditional distribution, whereas we have to calcite an
acceptance probability for proposals of ρ.
5.5 Metropolis and Metropolis-Hastings 154
The proposal used in Step 3(a) is called reflecting random walk, which insures
that 0 < ρ < 1. Note that a sequence of MH steps in which each parameter
is updated is often referred to as a scan of the algorithm.
For convenience and ease, we’re going to use diffuse priors for the parameters
with βo = 0, Σo = diag(1000), νo = 1, and σ 2 = 1. Our prior on ρ will be
Uniform(0,1). We first run 1000 iterations of the MH algorithm and show a
trace plot of ρ as well as an autocorrelation plot (Figure 5.20).
Suppose now we want to generate 25,000 scans for a total of 100,000 param-
eter values. The MC is highly correlated, so we will thin every 25th value
in the chain. This reduces the autocorrelation.
1.0
0.9
0.8
0.8
0.6
ACF
ρ
0.7
0.4 0.2
0.6
0.0
0.5
0.8
0.6
0.8
ACF
ρ
0.4
0.7
0.2
0.6
0.0
Exercise: Repeat the analysis with different prior distributions and perform
non-Bayesian GLS for comparison.
40
● ●
2
GLS estimate ● ●
● ●
●● ● ●
posterior marginal density
OLS estimate ●● ●
30
●● ●
0
● ● ●
● ●●
●●
● ● ●
temperature
● ● ●
−4 −2
●
● ● ● ●
● ● ● ●● ● ●
●
20
● ●
● ● ● ●
●● ● ● ●
● ●●●●
● ●● ●
●●● ●●●● ● ●●
● ● ● ● ● ●
● ● ● ●
●● ●
●● ● ●
● ●● ● ●● ● ●● ●● ●
●● ● ●●
−6
● ●● ●
● ● ●
●
10
● ●●● ● ●● ●
●● ● ●
●
●● ●●● ●●●
●
●● ●●
●
●● ● ● ●● ●●
●
● ●● ●● ●
● ●● ●● ● ● ●●
●●● ● ● ●● ●
−8
● ● ●●●● ● ●
●● ● ● ● ●
●
●
0
0.00 0.02 0.04 0.06 180 200 220 240 260 280
β2 CO2
Figure 5.20: Posterior distribution of the slope parameter β2 and posterior
mean regression line (after generating the Markov chain with
length 25,000 with thin 25).
#####
##example 5.10 in notes
# MH and Gibbs problem
##temperature and co2 problem
source("http://www.stat.washington.edu/~hoff/Book/Data/data/chapter10.r")
p<-length(mu)
res<-matrix(0,nrow=n,ncol=p)
if( n>0 & p>0 )
{
E<-matrix(rnorm(n*p),n,p)
res<-t( t(E%*%chol(Sigma)) +c(mu))
}
res
}
###
####
dct<-NULL
for(i in 1:n) {
xc<-dco2[ dco2[,2] < dat[i,1] ,,drop=FALSE]
xc<-xc[ 1, ]
dct<-rbind(dct, c( xc[c(2,4)], dat[i,] ) )
}
mean( dct[,3]-dct[,1])
5.5 Metropolis and Metropolis-Hastings 158
dct<-dct[,c(3,2,4)]
colnames(dct)<-c("year","co2","tmp")
rownames(dct)<-NULL
dct<-as.data.frame(dct)
#plot(dct[,1],qnorm( rank(dct[,3])/(length(dct[,3])+1 )) ,
plot(dct[,1], (dct[,3]-mean(dct[,3]))/sd(dct[,3]) ,
type="l",col="black",
xlab="year",ylab="standardized measurement",ylim=c(-2.5,3))
legend(-115000,3.2,legend=c("temp",expression(CO[2])),bty="n",
lwd=c(2,2),col=c("black","gray"))
lines(dct[,1], (dct[,2]-mean(dct[,2]))/sd(dct[,2]),
#lines(dct[,1],qnorm( rank(dct[,2])/(length(dct[,2])+1 )),
type="l",col="gray")
plot(dct[,2], dct[,3],xlab=expression(paste(CO[2],"(ppmv)")),
ylab="temperature difference (deg C)")
dev.off()
########
lmfit<-lm(dct$tmp~dct$co2)
hist(lmfit$res,main="",xlab="residual",ylab="frequency")
#plot(dct$year, lmfit$res,xlab="year",ylab="residual",type="l" ); abline(h=0)
acf(lmfit$res,ci.col="gray",xlab="lag")
dev.off()
5.5 Metropolis and Metropolis-Hastings 159
########
lmfit<-lm(y~-1+X)
fit.gls <- gls(y~X[,2], correlation=corARMA(p=1), method="ML")
beta<-lmfit$coef
s2<-summary(lmfit)$sigma^2
phi<-acf(lmfit$res,plot=FALSE)$acf[2]
nu0<-1 ; s20<-1 ; T0<-diag(1/1000,nrow=2)
###
set.seed(1)
###number of MH steps
S<-25000 ; odens<-S/1000
OUT<-NULL ; ac<-0 ; par(mfrow=c(1,2))
library(psych)
for(s in 1:S)
{
Cor<-phi^DY ; iCor<-solve(Cor)
V.beta<- solve( t(X)%*%iCor%*%X/s2 + T0)
E.beta<- V.beta%*%( t(X)%*%iCor%*%y/s2 )
beta<-t(rmvnorm(1,E.beta,V.beta) )
s2<-1/rgamma(1,(nu0+n)/2,(nu0*s20+t(y-X%*%beta)%*%iCor%*%(y-X%*%beta)) /2 )
phi.p<-abs(runif(1,phi-.1,phi+.1))
phi.p<- min( phi.p, 2-phi.p)
lr<- -.5*( determinant(phi.p^DY,log=TRUE)$mod -
determinant(phi^DY,log=TRUE)$mod +
tr( (y-X%*%beta)%*%t(y-X%*%beta)%*%(solve(phi.p^DY) -solve(phi^DY)) )/s2 )
if(s%%odens==0)
{
cat(s,ac/s,beta,s2,phi,"\n") ; OUT<-rbind(OUT,c(beta,s2,phi))
# par(mfrow=c(2,2))
# plot(OUT[,1]) ; abline(h=fit.gls$coef[1])
# plot(OUT[,2]) ; abline(h=fit.gls$coef[2])
# plot(OUT[,3]) ; abline(h=fit.gls$sigma^2)
# plot(OUT[,4]) ; abline(h=.8284)
}
}
#####
OUT.25000<-OUT
library(coda)
apply(OUT,2,effectiveSize )
OUT.25000<-dget("data.f10_10.f10_11")
apply(OUT.25000,2,effectiveSize )
pdf("trace_auto_1000.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))
plot(OUT.1000[,4],xlab="scan",ylab=expression(rho),type="l")
acf(OUT.1000[,4],ci.col="gray",xlab="lag")
dev.off()
pdf("trace_thin_25.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
par(mfrow=c(1,2))
plot(OUT.25000[,4],xlab="scan/25",ylab=expression(rho),type="l")
acf(OUT.25000[,4],ci.col="gray",xlab="lag/25")
dev.off()
pdf("fig10_11.pdf",family="Times",height=3.5,width=7)
par(mar=c(3,3,1,1),mgp=c(1.75,.75,0))
5.6 Introduction to Nonparametric Bayes 161
par(mfrow=c(1,2))
plot(density(OUT.25000[,2],adj=2),xlab=expression(beta[2]),
ylab="posterior marginal density",main="")
plot(y~X[,2],xlab=expression(CO[2]),ylab="temperature")
abline(mean(OUT.25000[,1]),mean(OUT.25000[,2]),lwd=2)
abline(lmfit$coef,col="gray",lwd=2)
legend(180,2.5,legend=c("GLS estimate","OLS estimate"),bty="n",
lwd=c(2,2),col=c("black","gray"))
dev.off()
quantile(OUT.25000[,2],probs=c(.025,.975) )
plot(X[,2],y,type="l")
points(X[,2],y,cex=2,pch=19)
points(X[,2],y,cex=1.9,pch=19,col="white")
text(X[,2],y,1:n)
iC<-solve( mean(OUT[,4])^DY )
Lev.gls<-solve(t(X)%*%iC%*%X)%*%t(X)%*%iC
Lev.ols<-solve(t(X)%*%X)%*%t(X)
plot(y,Lev.ols[2,] )
plot(y,Lev.gls[2,] )
◯ Motivations
iid
• We have X1 . . . Xn ∼ F, F ∈ F. We usually assume that F is a para-
metric family.
• We would like to be able to put a prior on all the set of cdf’s. And we
would like the prior to have some basic features:
2. The prior should give rise to priors which are analytically tractable or
computationally manageable.
θ∣y ∼ Dir(α1 + N1 , . . . , αk + Nk ).
αj (α − αj )
V ar(θj ) = .
α2 (α + 1)
5.6 Introduction to Nonparametric Bayes 164
– αo (1) = 2/9.
– αo (2) = 3/9.
– αo (3) = 4/9.
1. (Step 1) Suppose X1 ∼ αo .
2. (Step 2) Now create a new measure α + δX1 where
⎧
⎪
⎪1 if X1 ∈ A
δX1 (A) = ⎨
⎪
⎪
⎩0 otherwise.
Then
α + δX1 α + δX1
X2 ∼ = .
α() + δX1 () α() + 1
Fact: δX1 () = 1. Think about why this is intuitively true.
Hence, we take α() = N and then we plug back in and find that
αo = α/N Ô⇒ α = αo N.
This implies that
αo N + δX1
X2 ∼ ,
N +1
which is now in terms of αo and N (which we know).
Specifically, X1 , X2 , . . . , is PUS(α) if
α(A)
P (X1 ∈ A) = = αo
α()
for every A ∈ and for every n
α(B) + ∑i δXi (B)
P (Xn+1 ∈ B ∣ X1 . . . , Xn ) =
α() + n
for every A ∈ .
5.6 Introduction to Nonparametric Bayes 168
F ∼π
for any x1 , . . . xn ∈ {0, 1}.
Remark: Suppose that X1 , X2 . . . is an infinite exchangeable sequence
of binary random variables. Then there exists a probability measure
(distribution) on [0, 1] such that for every n
1
P (X1 = x1 , . . . , Xn = xn ) = ∫ p∑i xi (1 − p)n−∑i xi µ(p)dp
0
– F ∼ Dir(α)
iid
– X1 , X2 . . . , ∣ F ∼ F
Basic idea: In many data analysis settings, we don’t know the number
of latent clusters and would like to learn it from the data. BNP clus-
tering addresses this by assuming there is an infinite number of latent
clusters, but that only a finite number of them is used to generate
the observed data. Under these assumptions, the posterior yields a
distribution over the number of clusters, the assign of data to clusters,
and the parameters associated with each cluster. In addition, the pre-
dictive distribution, the assignment of the next data point, allows for
new data to be assign to a previously unseen cluster.
How does it work: The BNP problem addresses and finesses the clus-
tering problem by choosing the number of clusters by assuming it is
infinite, however it specifies a prior over the infinite groupings P (c)
in such a way that favors assigning data to a small number of groups,
where c refers to the cluster assignments. The prior over groupings is
a well known problem called the Chinese restaurant process (CRP),
which is a distribution over infinite partition of the integers (Aldous,
1985; Pitman, 2002).
Where does the name come from?
⎧
⎪ m
⎪ k if k ≤ K+ (i.e. k is a previously occupied table),
P (cn = k ∣ c) = ⎨ n−1+α
⎪
⎪ α
otherwise (i.e. k is the next unoccupied table),
⎩ n−1+α
where mk is the number of customers sitting at table k and K+ is
the number of table for which mk > 0. The parameter α is called the
concentration parameter.
5.6 Introduction to Nonparametric Bayes 173
– The data points refer to the customers and the tables are the
clusters.
∗ Then the CRP defines a prior distribution on the partition
of the data and on the number of tables.
– The prior can be completed with:
∗ A likelihood, meaning there needs to be an parameterized
probability distribution that corresponds to each table
∗ A prior for the parameters –the first customer to sit at table
k chooses the parameter vector for that table (φk ) from the
prior
– Now that we have a distribution for any quantity we care about
in some clustering setting.
Now, let’s think about how we would write down this process out
formally. We’re writing out a mixture model with a component that’s
nonparametric.
Let’s define the following:
yn ∣ cn , θ ∼ F (θcn )
cn ∝ p(cn )
θk ∝ Go .
p(y∣c)p(c)
Then by Bayes’ rule, p(c∣y) = , where
∑c p(y∣c)p(c)
N K
p(y ∣ c) = ∫ [ ∏ F (y∣θcn ) ∏ Go (θk )] dθ.
θ n=1 k=1
yn ∣ cn , θ ∼ N (θcn , 1)
cn ∼ Multinomial(1, p)
θk ∼ N (µ, τ 2 ),
Then
N K
p(y∣c) = ∫ [ ∏ Normal(θcn , 1)(yn ) × ∏ Normal(µ, τ 2 )(θk )] dθ.
θ n=1 k=1
The term above (inside the integral) is just another normal as a func-
tion of θ. Then we can integrate θ out as we have in problems before.
Once we calculate p(y∣c), we can simply plug this and p(c) into
p(y∣c)p(c)
p(c∣y) = .
∑c p(y∣c)p(c)
Example 5.14: Gaussian Mixture using R
Information on the R package profdpm:
This package facilitates inference at the posterior mode in a class of
conjugate product partition models (PPM) by approximating the max-
imum a posteriori data (MAP) partition. The class of PPMs is moti-
vated by an augmented formulation of the Dirichlet process mixture,
which is currently the ONLY available member of this class. The
profdpm package consists of two model fittting functions, profBinary
and profLinear, their associated summary methods summary.profBinary
5.6 Introduction to Nonparametric Bayes 176
set.seed(42)
sim <- function(multiplier = 1) {
x <- as.matrix(runif(99))
a <- multiplier * c(5,0,-5)
s <- multiplier * c(-10,0,10)
y <- c(a[1]+s[1]*x[1:33],
5.6 Introduction to Nonparametric Bayes 177
4
2
0
y
-2
-4
a[2]+s[2]*x[34:66],
a[3]+s[3]*x[67:99]) + rnorm(99)
group <- rep(1:33, rep(3,33))
return(data.frame(x=x,y=y,gr=group))
}
dat <- sim()
library("profdpm")
fitL <- profLinear(y ~ x, group=gr, data=dat)
sfitL <- summary(fitL)
%pdf(np_plot.pdf)
plot(fitL$x[,2], fitL$y, col=grey(0.9), xlab="x", ylab="y")
for(grp in unique(fitL$group)) {
ind <- which(fitL$group==grp)
ord <- order(fitL$x[ind,2])
lines(fitL$x[ind,2][ord],
fitL$y[ind][ord],
col=grey(0.9))
}
for(cls in 1:length(sfitL)) {
# The following implements the (3rd) method of
# Hanson & McMillan (2012) for simultaneous credible bands
# Generate coefficients from profile posterior
n <- 1e4
tau <- rgamma(n, shape=fitL$a[[cls]]/2, scale=2/fitL$b[[cls]])
muz <- matrix(rnorm(n*2, 0, 1),n,2)
mus <- (muz / sqrt(tau)) %*% chol(solve(fitL$s[[cls]]))
mu <- outer(rep(1,n), fitL$m[[cls]]) + mus