Introduction to Bayesian Methods in Ecology and Natural
Resources
Visit the link below to download the full version of this book:
https://medipdf.com/product/introduction-to-bayesian-methods-in-ecology-and-natu
ral-resources/
Click Download Now
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Bayesian and Non-Bayesian Inference . . . . . . . . . . . . . . . . . . . . 2
1.2 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Pros and Cons of Bayesian Inference . . . . . . . . . . . . . . . . . . . . . 6
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Probability Theory and Some Useful Probability Distributions . . . . . 11
2.1 Discrete and Continuous Random Variables . . . . . . . . . . . . . . . . 11
2.2 Expectation, Mean, Standard Deviation, and Variance . . . . . . . . . 14
2.3 Unconditional, Conditional, Marginal, and Joint Distributions . . . 15
2.4 Likelihood Functions and Random Samples . . . . . . . . . . . . . . . . 16
2.5 Some Useful Discrete Probability Distributions . . . . . . . . . . . . . . 17
2.5.1 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.3 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Some Useful Continuous Probability Distributions . . . . . . . . . . . 21
2.6.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.2 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.3 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . 23
2.6.4 t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.5 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.6 Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.7 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.8 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vii
viii Contents
3 Choice of Prior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Vague Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Improper Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Conjugate Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Prior Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Vague Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Informative Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 Example: Poisson Sampling Model with Vague
and Informative Priors . . . . . . . . . . . . . ............. 36
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............. 39
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............. 40
4 Elementary Bayesian Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Beta-Binomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Normal Model, Known Variance . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Normal Model, Unknown Variance . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1 Random and Fixed Effects . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.2 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.3 Number of Levels in Hierarchical Models . . . . . . . . . . . . 51
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 Hypothesis Testing and Model Choice . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 Deer Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.2 Fire Scar Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Hypothesis Testing Terminology . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Error Types and Acceptance/Rejection of Hypotheses . . . . . . . . . 58
5.4 Brief Philosophy of Hypothesis Testing . . . . . . . . . . . . . . . . . . . 59
5.5 Model Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5.1 Within-Sample Versus Out-of-Sample Prediction . . . . . . . 60
5.6 Bayes Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.7 Information Theoretic Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.7.1 AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.7.2 Bayesian Information Criterion . . . . . . . . . . . . . . . . . . . . 65
5.7.3 Deviance Information Criterion . . . . . . . . . . . . . . . . . . . . 65
5.7.4 Widely Applicable Information Criterion . . . . . . . . . . . . . 68
5.7.5 Leave-One-Out Criterion . . . . . . . . . . . . . . . . . . . . . . . . 70
5.8 Credible Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.8.1 Point Null for Normal Mean, Variance Known . . . . . . . . 77
5.8.2 Point Null for Normal Mean, Variance Unknown . . . . . . 78
5.8.3 Testing Equality of Two Normal Means, Variances
Unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 80
Contents ix
5.9 Posterior Predictive Densities . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.9.1 Fire Scar Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1 Simple Linear Model: Trees Data . . . . . . . . . . . . . . . . . . . . . . . 92
6.1.1 Predicting a New Observation . . . . . . . . . . . . . . . . . . . . 101
6.2 Hierarchical Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.1 Rat Growth Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.2 Diet 2 Rats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.3 Predicting a New Observation . . . . . . . . . . . . . . . . . . . . 113
6.2.4 Full Rat Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7 General Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.1 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.1.1 Poisson Regression Example . . . . . . . . . . . . . . . . . . . . . 132
7.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2.1 Bernoulli Logistic Example . . . . . . . . . . . . . . . . . . . . . . 136
7.2.2 Binomial Logistic Example . . . . . . . . . . . . . . . . . . . . . . 145
7.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8 Spatial Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.1 Point-Referenced Spatial Models . . . . . . . . . . . . . . . . . . . . . . . . 156
8.1.1 Space-Varying Coefficient Models . . . . . . . . . . . . . . . . . 160
8.1.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.1.3 Tree Height-Diameter Data . . . . . . . . . . . . . . . . . . . . . . . 163
8.2 Models for Large Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Appendix A: Some Common Conjugate Models . . . . . . . . . . . . . . . . . . . . 175
Appendix B: Markov Chain Monte Carlo Sampling . . . . . . . . . . . . . . . . 177
Appendix C: Short Tutorial on OpenBUGS . . . . . . . . . . . . . . . . . . . . . . . 179
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
List of Code Boxes
Code box 3.1 R code to plot gamma distribution1 . . . . . . . . . . . . . . . . .. 38
Code box 5.1 OpenBUGS code for fitting models to doe weight
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 71
Code box 5.2 R code for calculating WAIC and LOO for models a - d,
doe weight data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 72
Code box 5.3 OpenBUGS code to fit exponential and Weibull models
to fire scar data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 74
Code box 5.4 R code to compute LOO and WAIC for exponential
and Weibull models using fire scar data . . . . . . . . . . . . .. 75
Code box 5.5 OpenBUGS code for 2008 adult doe weights . . . . . . . .. 79
Code box 5.6 Computing posterior distribution of a contrast among
mean adult doe weights . . . . . . . . . . . . . . . . . . . . . . . . . .. 81
Code box 5.7 OpenBUGS code for generating values from a
3-parameter Weibull distribution, assuming the data
are in the vector y and the sample size is N . . . . . . . . . .. 87
Code box 6.1 R code for exploring homogeneous and heterogeneous
variance for simple linear models . . . . . . . . . . . . . . . . . .. 92
Code box 6.2 R code plotting a scattergram of V versus D2 H
for the data in the R data set trees . . . . . . . . . . . . . . . . . .. 93
Code box 6.3 R code for plotting two vague prior densities;
Nð0; 1:0 106 Þ and Ga(0.001,0.001) . . . . . . . . . . . . . .. 96
Code box 6.4 OpenBUGS code for fitting simple linear model
yi ¼ b0 þ b1 xi þ ei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 96
Code box 6.5 R code for producing posterior densities and kernel
densities boxplots for PPDs and observed data . . . . . . . . . 100
Code box 6.6 R code for producing posterior predictive distributions
for volume of a new tree . . . . . . . . . . . . . . . . . . . . . . . . . . 101
1
https://github.com/prmr.
xi
xii List of Code Boxes
Code box 6.7 OpenBUGS code for fitting hierarchical linear model
to data for rats fed diet 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Code box 6.8 R code for generating posterior predictive distributions
for rats fed diet 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Code box 6.9 R code for generating posterior predictive distributions
for rat 1 at age 70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Code box 6.10 R code for generating posterior predictive distributions
for an independent, randomly chosen rat at age 70 . . . . . . 116
Code box 6.11 OpenBUGS code for fitting hierarchical linear model
to complete rat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Code box 6.12 R code for generating posterior predictive distributions
the full rat data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Code box 7.1 OpenBUGS code for fitting Poisson regression to avian
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Code box 7.2 OpenBUGS code for fitting logistic regression to spider
presence/absence data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Code box 7.3 R code for computing WAIC and LOO for OpenBUGS
model in Box 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Code box 7.4 OpenBUGS code for fitting Bernoulli model to spider
presence/absence data without covariate . . . . . . . . . . . . . . 139
Code box 7.5 R code for computing WAIC and LOO for OpenBUGS
model in Box 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Code box 7.6 R code for plotting fitted logistic regression against
spider presence/absence data . . . . . . . . . . . . . . . . . . . . . . . 142
Code box 7.7 R code for computing and plotting posterior probability
of correct classification for each beach in spider data . . . . 144
Code box 7.8 OpenBUGS code for fitting hierarchical
binomial-logistic model to beetle data . . . . . . . . . . . . . . . . 147
Code box 7.9 OpenBUGS code for fitting non-hierarchical
binomial-logistic model to beetle data . . . . . . . . . . . . . . . . 148
Code box 7.10 R code for computing WAIC and LOO for OpenBUGS
models in Boxes 7.8 and 7.9 . . . . . . . . . . . . . . . . . . . . . . . 148
Code box 8.1 R code to generate realizations from a spatial GP . . . . . . . 159
Code box 8.2 Abbreviated code for fitting the SVI model using
the spBayes R package . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Chapter 1
Introduction
Bayesian inference in the sciences has become remarkably widespread in the wake of
the Markov chain Monte Carlo (MCMC) revolution of the 1990s. MCMC methods
permit solutions to Bayesian problems which had previously been mathematically
intractable. MCMC methods have simplified Bayesian inference to the point where
it is often arguably simpler than conventional statistical approaches. However, ease
of use is not and should not be a compelling reason on its own to justify a statistical
approach. Hence in this text we will endeavor to motivate Bayesian analyses in
ecological and/or natural resource management problems on the grounds that these
methods readily permit scientists to model phenomena of interest in realistic ways.
One does not need to accept the view that Bayesian methods are philosophically more
appealing than conventional methods in order to use them effectively. We suspect
that many scientists use Bayesian methods today only because they perform well
and allow the scientist to directly answer the specific question posed.
There has been a long and often contentious debate in the literature regarding the
relative merits of Bayesian and non-Bayesian statistical methods. To date, the debate
has not been conclusively settled and perhaps it never will be. Even the authors of
this text do not completely agree on some issues. However, given their increasingly
widespread use, it is apparent that the modern scientist must at least understand
Bayesian methods and preferably have them readily available in their toolbox. To that
end, our goal in writing this text is to present some common Bayesian data analysis
methods in a manner that will be understandable and readily available to students
and scientists in the various fields of Ecology and Natural Resource Management.
We assume that the reader is a typical graduate student in Ecology and/or Natural
Resource Management. In our experience, such a student has had some training in
Analytic Geometry and Calculus and one or two undergraduate courses in Statistics.
We attempt to explain concepts and ideas in a way that will be accessible to such
students. After reading this book, a student should be able to pursue Bayesian analyses
and read more sophisticated texts on the subject.
Modern Bayesian inference relies heavily on computer simulation. To relieve
scientists of the burden of writing new code for every problem, a number of Bayesian
computing packages have arisen. Chief among these is BUGS (Bayesian inference
© Springer Nature Switzerland AG 2020 1
E. J. Green et al., Introduction to Bayesian Methods in Ecology and Natural Resources,
https://doi.org/10.1007/978-3-030-60750-0_1
2 1 Introduction
Using Gibbs Sampling), and its successors, WinBUGS and OpenBUGS (Lunn
et al. 2000). These packages have the virtue of being freely available from the BUGS
project (links are provided in Appendix C). Other widely used packages include
JAGS (Just Another Gibbs Sampler (Plummer 2003), Stan (Carpenter et al. 2017),
and NIMBLE (de Valpine et al. 2017). All examples in this book (other than those in
Chap. 8: Spatial Models) are performed using OpenBUGS, and the code is provided
in text boxes. A short tutorial on OpenBUGS is presented in Appendix C.
In Chap. 2 we cover various theoretical probability distributions and densities. For
that purpose, we employ the freely available statistical computing package R, and
we present the requisite R code. R is available from the Comprehensive R Archive
Network (CRAN, www.cran.r-project.org). Users may also find tutorials and links
to user’s guides at CRAN. We make no attempt to instruct the reader on the use of R.
We have included many code boxes with either R or OpenBUGS code to help
the reader perform analyses or produce graphs. Many of the R code boxes entail
reading in .txt files produced by the coda option in OpenBUGS. In such cases,
we indicate that the user must set the working directory to that where the .txt file
is found. Other options (such as the size of the joint posterior sample generated in
OpenBUGS or the order of the variables in the output files) depend on the parameters
used in the OpenBUGS program. It is up to the user to ensure that these are specified
correctly.
All the code in the code boxes (and larger data sets referenced in some boxes and
exercises) is available online at https://github.com/finleya/GFS. Rather than repeat-
ing this lengthy URL every time it is needed, we will indicate that a data set or piece
of code is available online.
1.1 Bayesian and Non-Bayesian Inference
This book is concerned with making inferences from data. Although the techniques
we describe are useful in many different disciplines, we focus on data which arise
in forestry, ecology, and wildlife biology. Even within that small slice of scientific
disciplines, the interests and orientations of scientists are remarkably varied; hence
we will refer to the reader generically as a “scientist.” In our view, a scientist is
interested in statistical procedures insofar as they permit legitimate inferences about
populations from sample data. These inferences may take on various forms, such
as hypothesis testing or model development. Most scientists are aware that, broadly
speaking, there are at least two classes of statistics: Bayesian and non-Bayesian. The
latter class is often referred to as “classical” or “frequentist” but for our purposes
the latter class can be considered to be any inferential system not based on Bayes
theorem. Many scientists have probably noted an increase in the use of Bayesian
procedures since the early 1990s and may wonder what caused this increase, and
what the big fuss is all about. In this Chapter we introduce Bayesian inference,
briefly touch on its history and the controversy over its use, and conclude with a
short discussion of the some of the reason(s) behind its increase in popularity.
1.2 Bayes Theorem 3
1.2 Bayes Theorem
Let P(·) denote the probability of the quantity inside the parentheses. It may surprise
some scientists to learn that there is no universally accepted definition of probability.
Kolmogorov (1933) set out a set of three axioms that any coherent system of prob-
ability should satisfy, but the axioms do not define a probability system. Broadly
speaking, there are two common notions of probability: frequentism and personal
probability. In frequentism, the probability of an event occurring is defined as the
percentage of times it occurs in a long series of trials. Unfortunately, the definition
does not define how long “long” is. Also, it is not helpful in answering questions like
“What is the probability of life on Mars?”, where it is difficult to imagine repeated
trials. On the other hand, personal probability, or subjective probability, is what the
scientist believes in their mind and may vary among individuals, depending on that
individual’s background and knowledge (e.g., see de Finetti 1974). In general, the
authors favor the personal probability approach, yet as will become evident in the
subsequent chapters, like most modern Bayesians we make liberal use of vague or
noninformative priors to avoid the burden of constructing subjective prior probability
distributions for every problem.
Suppose we are considering two arbitrary events: A and B. Consider the proba-
bility that both A and B occur; this is the intersection of A and B, and in introductory
probability texts it is usually written as P(A ∩ B). In statistical work, it is customary
to suppress the intersection symbol, and write the probability of the intersection of
A and B as P(A, B). Now, suppose we wish to know the probability that event B
will occur, given that event A has already occurred. We write this as P(B | A). This
is the probability of B, given A, or the probability of B, conditional on A. If the
events A and B are independent, then P(B | A) = P(B), i.e., the outcome of A tells us
nothing about the probabilities of the outcomes of B.
The multiplicative rule of probability (see, e.g., Harris 1966, p. 11) states
P(A, B) = P(A)P(B | A). (1.2.1)
As Berry (1997) reports, Eq. 1.2.1 is intuitive. Berry asks the reader to consider
the probability of observing two aces in two random drawings from a deck of cards,
assuming sampling without replacement, i.e., that the first card drawn is not returned
to the deck prior to drawing the second card. Most people would start by saying “first
we need to compute the probability of an ace on the first draw; since there are 52
cards and four aces, the probability of this is 4/52. Then, we need the probability
of an ace on the second draw; the probability of this is 3/51, because there are only
three aces among the remaining 51 cards. Hence the probability of observing two
aces on two draws is 4/52 times 3/51.” These people have just used Eq. 1.2.1.
Re-arranging terms in Eq. 1.2.1, we find
P(A, B)
P(B | A) = . (1.2.2)
P(A)
4 1 Introduction
Observe that if P(A) = 0, then P(B | A) is undefined, as it should be since P(A) = 0
means the event A is impossible, and it is pointless to consider the probability of B
given an impossible event.
Now, since the events A and B were arbitrarily defined, we also have
P(A, B) = P(B)P(A | B). (1.2.3)
Replacing the numerator on the right-hand side (RHS) of (1.2.2) with the RHS of
(1.2.3) yields
P(B)P(A | B)
P(B | A) = . (1.2.4)
P(A)
Equation 1.2.4 is the celebrated Bayes theorem. It reveals the proper, and only, way
to “reverse” the conditioning of a probability statement; on the RHS we have event
A conditioned on event B, while on the left-hand side (LHS) we have the reverse. It
is important to note that this form of Bayes theorem is non-controversial and is an
elementary result of the rule of multiplicative probability (Eqs. 1.2.1 and 1.2.3). It
has many straightforward uses in this form, such as in image classification (Green
et al. 1992; Richards and Jia 2006) or clinical diagnostic testing (Joseph et al. 1995;
Spiegelhalter et al. 1999). The fun begins when Bayes theorem is used as a basis for
scientific inference.1
1.3 Bayesian Inference
Suppose we collect sample data y on some variable, say Y . Further suppose that we
believe the sampling distribution of Y (i.e., the distribution from which the obser-
vations on Y arise) is indexed, or governed, by some unknown parameter(s) θ . For
convenience, in this section we will discuss θ as if it was one-dimensional, i.e., a
scalar, however the reader should be aware that θ is frequently multi-dimensional.
Assume we are interested in making inferences about the parameter θ based on
the sample data y. Now, since we know the sampling distribution, we can evaluate
P(y | θ ), the probability of y conditioned on θ , for any value of θ . But that’s not what
we really want to know; we don’t know θ , we only know y. It seems self-evident to
seek the probability distribution of what we don’t know (θ ), conditioned on what we
do know (y). Application of Bayes theorem yields
1 Bayes theorem is named after the 18th Century English cleric Thomas Bayes (c. 1702–1761). The
theorem was derived in an essay published posthumously by his friend, Richard Price. As noted
by Bernardo and Smith (1994), we don’t know how Rev. Bayes would feel about the system of
inference attributed to him.
1.3 Bayesian Inference 5
P(θ )P(y | θ )
P(θ | y) = . (1.3.1)
P(y)
Equation 1.3.1 is the form of Bayes theorem used for scientific inference. Recall that
we are interested in making inferences about θ . Since the denominator on the RHS
of (1.3.1), P(y), is independent of θ , we can learn nothing about θ from this term.
Furthermore, once y is observed, P(y) is fully specified and has a fixed value, say
c. So, we can rewrite (1.3.1) as
P(y | θ )P(θ )
P(θ | y) = , (1.3.2)
c
∝ P(y | θ )P(θ ). (1.3.3)
In Eq. 1.3.2, c−1 is a normalizing constant; its function is to ensure that the total
probability sums (or integrates) to 1. The first term on the RHS of expression (1.3.3)
is the probability of y given the parameter θ . In non-Bayesian as well as Bayesian
statistics, following Fisher (1922) it has become usual to consider this as a function of
θ rather than of y. When viewed in this way, P(y | θ ) is called the likelihood function
of θ given y, and is written as L(θ | y).2 The value of θ which maximizes L(θ | y) is
called the maximum likelihood estimate (e.g., see Casella and Berger 2001). If we
adopt the likelihood notation, then we can re-write expression (1.3.3) as
P(θ | y) ∝ L(θ | y)P(θ ). (1.3.4)
Expression (1.3.4) is the form usually used in reports on Bayesian analyses. As
mentioned above, L(θ | y) is widely used in non-Bayesian as well Bayesian inference
and is not controversial. In Bayesian inference, it is often instructive to think of this
as the sampling distribution for y, i.e., a mathematical description of the process in
Nature that generates the data we observe.
The second term on the RHS of (1.3.4), P(θ ), is called the prior distribution of θ ;
it represents what was known (or believed) about θ before the data y were observed.
The term on the LHS of (1.3.4), P(θ | y), is called the posterior distribution of θ ; it
synthesizes all that was known about θ before the data were observed plus what was
learned about θ from the data. In the Bayesian paradigm, all inferences on θ derive
from the posterior distribution.
The controversy over Bayesian inference arises primarily over the term P(θ ),
and basically boils down to whether or not it is admissible to place a distribution
on θ . Non-Bayesians make a distinction between random variables and population
parameters. In this view, the parameter θ is fixed; we just don’t know what it is. In
many situations we can “imagine” measuring all the individuals in a population and
then computing the exact value of θ , even though it might be prohibitively expensive
or inefficient to do so. If we knew the values of all the elements in the population,
we could calculate the true value of θ . Hence, since θ is fixed, it would be incorrect
2 Some authors use L(θ | y) to indicate the log of the likelihood function.
6 1 Introduction
to place a probability distribution on it. For example, suppose we know that the
mean height of a specific population of trees is exactly 14 m (for the moment let’s
not concern ourselves with how we could possibly know this). Then the probability
that the mean height is 14 m is 1.0 and the probability that it is any other value is
0. Thus this distribution has a mass of 1 at a single point: 14. Such a distribution is
degenerate, and not really a distribution at all. Hence in the non-Bayesian view it is
not permissible to construct probability distributions for parameters. They are fixed
constants; the probability that they equal their true value is 1.0 and the probability
that they assume any other value is 0.
Bayesians view the world differently. To them, the distinction between random
variables and parameters is artificial and largely irrelevant. Bayesians are also inter-
ested in two classes of objects, but instead of random variables and parameters, they
are interested in what is known and what is unknown. In the Bayesian view, prob-
ability distributions are used for expressing the state of our knowledge about any
unknown object. Hence, since θ is generally unknown, Bayesians find it perfectly
acceptable to place a probability distribution on it.
Interestingly, Bayesian texts often contain a summary of the differences between
Bayesian and non-Bayesian inference (and usually a vigorous defense of the Bayesian
view). For example, see Berger (1985), Robert (2001), Bernardo and Smith (1994),
Gelman et al. (2013), Carlin and Louis (2009), or the classic Jeffreys (1935). On the
other hand, texts on non-Bayesian inference often do not devote much space, if any,
to Bayesian inference. A good historical account of Bayesian inference is contained
in Stigler (1986), and an excellent non-mathematical treatment of the history may
be found in McGrayne (2012). Readers interested in the controversy between the
Bayesian and non-Bayesian view are referred to the above mentioned texts, and to
the discussion following the classic papers by Lindley (1990), Lindley and Phillips
(1976), Lindley and Smith (1972), and the references contained therein. A vast lit-
erature on the relative merits of Bayesian and non-Bayesian inference exists, and
we cannot reproduce all the arguments here. We do however feel that it is necessary
to cover the salient points so that the reader can make up their own mind. Bear in
mind that although we (the authors) endeavor to summarize the arguments fairly, we
do have reasonably firm opinions regarding the merits of Bayesian inference, so we
cannot be entirely objective.
1.4 Pros and Cons of Bayesian Inference
Prior to about 1990, there was a practical objection to the use of Bayesian inference;
in many cases it was impossible to solve for the constant c in (1.3.2), and hence it
was difficult or impossible to solve for the posterior distribution; i.e., the analysis
often could not be done, or if it could it required great skill in numerical analysis.
The MCMC revolution of the early 1990s largely eliminated this concern and has
made Bayesian analysis almost routinely possible now.
1.4 Pros and Cons of Bayesian Inference 7
Current objection(s) to Bayesian inference continue to revolve around the specifi-
cation of prior distributions for population parameters. Since priors must be specified
before data are observed, they are clearly not dependent on the data and hence other
factors besides the sample data may influence the analysis. This is in stark contrast
with the usual stated goal of objectivity in scientific analyses. Furthermore, if two
scientists analyze the same data but employ different prior distributions, they may
or may not reach the same conclusion. Hence it might prove difficult for theories to
become accepted following repeated trials by different scientists. Finally, Bayesians
often use “noninformative” priors in order to “let the data speak for themselves,”
yet there is no unanimity among Bayesians regarding the proper or “correct” nonin-
formative prior distributions to use with common likelihoods, or even on the exact
definitions of noninformative and/or vague priors.
Although the concerns detailed in the preceding paragraph are serious, we do not
find them compelling and, when balanced against the objections to non-Bayesian
methods, we find that Bayesian methods are often preferable. We agree with the
oft-stated Bayesian position that, despite claims to the contrary, all scientific work is
subjective; just the act of choosing a likelihood function is subjective. To us, Bayesian
analysis is more honest because it forces scientists to confront their subjectivity at
the outset and not “sweep it under the rug.” We believe the possibility that different
scientists may reach differing conclusions based on the same data if they bring vastly
different prior opinions to the problem is not a disadvantage but rather a description
of the true state of the world. Furthermore, given that some basic care is taken not
to rule out certain outcomes a-priori, then as evidence accumulates even scientists
with markedly differing initial priors will eventually reach the same conclusion (Box
and Tiao 1972). A Bayesian analysis is not complete unless the scientist specifies
the prior distribution that was used. If a reader does not accept the prior distribution,
then they are under no obligation to accept the results, much the same as in a non-
Bayesian study, if the reader does not accept the choice of likelihood, they will not
find the evidence convincing. Finally, although it is true that there is no universal
agreement regarding non-informative prior distributions, a wise and/or pragmatic
scientist would repeat an analysis using several noninformative priors. If the analyses
agree, then this suggests that the choice of prior is unimportant. If they do not, then
the scientist has learned something and should perhaps think more deeply about the
parameter in question. We find this to be a good feature.
In our view there are important advantages to Bayesian analysis. First and fore-
most, it makes sense. As detailed earlier in this chapter, Bayesians make inferences
about what they don’t know, based on what they observe during an experiment and
what they believed prior to observing data. To us, this is a perfect and natural analog
to the process of learning by experience. The Bayesian inference system is self-
contained and all inferences stem from the calculus of probability, e.g., see Box and
Tiao (1972). Unlike common statistical techniques such as acceptance/rejection of
hypotheses based on p-values or calculation of confidence intervals, Bayesian infer-
ence does not require scientists to imagine an infinite series of repeated trials of an
experiment under identical situations. A Bayesian analysis is conditioned on the data
you observed, not on data you might have observed, e.g., see Lindley (1990) or Carlin
8 1 Introduction
and Louis (2009). In a classic series of papers, Berger and colleagues (Berger 1985;
Berger and Delampady 1987; Berger and Selke 1987) have shown that the common
non-Bayesian practice of rejecting a hypothesis when a p-value takes on a value
smaller than some standard value (typically 0.05) may lead to rejection of hypothe-
ses when the posterior probability of the hypothesis being true, given the available
evidence, is actually greater than 0.5, i.e., the hypothesis is more likely to be true than
not; this is remarkable! In another classic paper, Lindley (1957) showed that a clas-
sical hypothesis testing procedure can reject a null hypothesis at the (α)100% level
while, incredibly, a Bayesian procedure might conclude that the posterior probability
that the null is true is (1-α)100%! This has come to be known as Lindley’s Paradox.
In our experience, the common Bayesian assertion that most scientists do not fully
understand p-values is correct. We believe it is true that many scientists mistakenly
regard a p-value as the probability that the null hypothesis is true3 , whereas even
a cursory look at the development of p-values shows that it is nothing of the sort.
Rather it is the probability of observing data as extreme or more extreme as the data
actually observed if the null hypothesis is true. The latter can be vastly different than
the former. We agree with Cohen (1994), p. 997 who said of a p-value “. . . it does
not tell us what we want to know, and we so much want to know what we want to
know that, out of desperation, we believe that it does!” We find ourselves in accord
with Jeffreys (1980) p. 453: “I have always considered the arguments for the use
of P absurd. They amount to saying that a hypothesis that may or may not be true
is rejected because a greater departure from the trial value was improbable; that is,
that it has not predicted something that has not happened.” It is worthwhile to note
that in March 2016, the American Statistical Association issued a policy statement
on p-values which, among other things, argues against their routine use in science
(Wasserstein and Lazar 2016).4
Finally, in practical situations, non-Bayesian methods may require scientists to
violate their own assumptions in order to proceed. In an oft-cited example, Lindley
and Phillips (1976) show that in a common, unremarkable experiment involving a
series of identical trials, each resulting in one of two outcomes (say heads or tails),
a proper non-Bayesian analysis is impossible unless the number of trials in known
in advance. If the sample size is not known before the start of the experiment, then
it is impossible to clearly define the sample space of possible outcomes; this sample
space is required for the computation of p-values. However it is common for studies
to be terminated for unanticipated reasons. As a consequence non-Bayesians often
violate their own assumptions and proceed as if the sample size was known a-priori.
This problem does not arise in Bayesian inference.
3 One of the authors actually witnessed a well-respected scientist instruct a candidate in this definition
during a graduate committee meeting!
4 Reliance on p-values is also at least partially responsible for the unsettling lack of reproducibility
in scientific studies (e.g., see Begley and Ioannidis 2015). As a partial solution to this, Benjamin et
al. (2018) have suggested that p-values between the conventional 0.05 and a much more stringent
0.005 be called suggestive and that significant findings should be restricted to those with a p-value
less than 0.005.
1.4 Pros and Cons of Bayesian Inference 9
As a result of using a self-contained inference system, Bayesian scientists are
relieved of considering ad-hockeries in order to complete analyses. Each analysis
follows the same well-defined steps. Furthermore, complicated situations with high-
dimensional problems and/or more than one data set are naturally accommodated.
Hence rather than considering seemingly arcane mathematics, scientists are freed to
focus on choosing the appropriate likelihood and prior distributions—those which
most faithfully describe the phenomena under investigation. We believe that this is
where the scientist’s focus should be.
Given the preceding discussion, it may occur to some scientists to wonder “why
wasn’t I taught Bayesian statistics as an undergraduate?”. While there are a number
of possible contributing reasons for this, we believe that a primary impediment to
the teaching of Bayes to non-statisticians is the fact that it would require students to
first acquire (typically via yet another course) some knowledge of probability and
probability distributions. Since science students already must take many courses in
their field of study, the prospect of taking a probability course in order to be able to
take a Bayesian course is unappealing. Hence students normally opt for one statistical
course covering classical methods. This is unfortunate. We regard Bayesian vs. non-
Bayesian statistics as roughly analogous to two types of amusement parks: In one
there is a significant admission price, but afterwards all the rides are free. This is
analogous to Bayesian methods. Before entering, one has to pay the cost of learning
probability. Afterwards, the logic of Bayesian methods is empowering. On the other
hand, there are amusement parks with no admission cost but a fee for every ride.
This is analogous to non-Bayesian methods; it is easy enough to learn simple, basic
concepts like t-tests and/or least squares but whenever new situations occur, the
scientist must learn a new set of statistical methods.
References
Begley, C. G., & Ioannidis, J. P. A. (2015). Reproducibility in science: Improving the standard for
basic and preclinical research. Circulation Research, 116, 116–126.
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B., Wagenmaker, E. J., Berk, R. et al.
(2018). Redefine statistical significance. Nature Human Behavior, 1, 6–10.
Berger, J. O. & Delampady, M. (1987). Testing precise hypotheses. Statistical Science, 2, 317–352
(with discussion).
Berger, J. O. & Selke, T. (1987). Testing a point null hypothesis: The irreconcilability of signifi-
cance levels and evidence. Journal of the American Statistical Association, 82, 112–133 (with
discussion).
Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. New York, NY: Springer.
Bernardo, J. M., & Smith, A. F. M. (1994). Statistical Decision Theory and Bayesian Analysis. New
York, NY: Wiley.
Berry, D. A. (1997). Teaching elementary Bayesian statistics with real applications in science. The
American Statistician, 51(3), 241–246.
Box, G. E. P., & Tiao, G. C. (1972). Bayesian Inference in Statistical Analysis. Reading, MA:
Addison-Wesley.
Carlin, B. P., & Louis, T. A. (2009). Bayesian Methods for Data Analysis (3rd ed.). Boca Raton,
FL: Chapman & Hall/CRC.