512 CHAPTER III.
STATISTICAL MODELS
(i+1)
µ0 = µ(i)
n
(i+1)
Λ0 = Λ(i)
n
(i+1)
(7)
a0 = a(i)
n
(i+1)
b0 = b(i)
n .
The posterior distribution for Bayesian linear regression when observing a single data set is given by
the following hyperparameter equations (→ III/1.6.2):
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (8)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
We can apply (8) to calculate the posterior hyperparameters after seeing the first data set:
−1
(1) (1)
µ(1) (1)
n = Λn X1T P1 y1 + Λ0 µ0
−1
= Λ(1)
n X1T P1 y1 + Λ0 µ0
(1)
Λ(1) T
n = X1 P 1 X1 + Λ 0
= X1T P1 X1 + Λ0
(1) 1
a(1)
n = a0 + n1 (9)
2
1
= a0 + n 1
2
(1) 1 T (1) T (1) (1) T (1) (1)
b(1)
n = b0 + y1 P1 y1 + µ0 Λ0 µ0 − µ(1) n Λ n µ n
2
1 T
(1) T (1) (1)
= b0 + 0 Λ0 µ0 − µn
y 1 P1 y 1 + µ T Λn µn .
2
These are the prior hyperparameters before seeing the second data set:
(2)
µ0 = µ(1)
n
(2)
Λ0 = Λ(1)
n
(2)
(10)
a0 = a(1)
n
(2)
b0 = b(1)
n .
Thus, we can again use (8) to calculate the posterior hyperparameters after seeing the second data
set:
496 CHAPTER III. STATISTICAL MODELS
Completing the square over β, we finally have
s
τ n+p b0 a0 a0 −1
p(y, β, τ ) = |P ||Λ 0 | τ exp[−b0 τ ]·
(2π)n+p Γ(a0 ) (12)
h τ i
exp − (β − µn )T Λn (β − µn ) + (y T P y + µT Λ µ
0 0 0 − µT
Λ µ
n n n )
2
with the posterior hyperparameters (→ I/5.1.7)
µn = Λ−1 T
n (X P y + Λ0 µ0 )
(13)
Λn = X T P X + Λ 0 .
Ergo, the joint likelihood is proportional to
h τ i
p(y, β, τ ) ∝ τ · exp − (β − µn ) Λn (β − µn ) · τ an −1 · exp [−bn τ ]
p/2 T
(14)
2
with the posterior hyperparameters (→ I/5.1.7)
n
an = a0 +
2
1 T (15)
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
From the term in (14), we can isolate the posterior distribution over β given τ :
p(β|τ, y) = N (β; µn , (τ Λn )−1 ) . (16)
From the remaining term, we can isolate the posterior distribution over τ :
p(τ |y) = Gam(τ ; an , bn ) . (17)
Together, (16) and (17) constitute the joint (→ I/1.3.2) posterior distribution (→ I/5.1.7) of β and
τ.
■
Sources:
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161, ex. 3.12, eq. 3.113; URL: https://www.springer.com/gp/book/9780387310732.
1.6.3 Log model evidence
Theorem: Let
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X, known n × n covariance structure V as well as unknown p × 1 regression coefficients β
and unknown noise variance σ 2 . Moreover, assume a normal-gamma prior distribution (→ III/1.6.1)
over the model parameters β and τ = 1/σ 2 :
56 CHAPTER I. GENERAL THEOREMS
■
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-06-06; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
1.11.5 Variance of a constant
Theorem: The variance (→ I/1.11.1) of a constant (→ I/1.2.5) is zero
a = const. ⇒ Var(a) = 0 (1)
and if the variance (→ I/1.11.1) of X is zero, then X is a constant (→ I/1.2.5)
Var(X) = 0 ⇒ X = const. (2)
Proof:
1) A constant (→ I/1.2.5) is defined as a quantity that always has the same value. Thus, if understood
as a random variable (→ I/1.2.2), the expected value (→ I/1.10.1) of a constant is equal to itself:
E(a) = a . (3)
Plugged into the formula of the variance (→ I/1.11.1), we have
Var(a) = E (a − E(a))2
= E (a − a)2 (4)
= E(0) .
Applied to the formula of the expected value (→ I/1.10.1), this gives
X
E(0) = x · fX (x) = 0 · 1 = 0 . (5)
x=0
Together, (4) and (5) imply (1).
2) The variance (→ I/1.11.1) is defined as
Var(X) = E (X − E(X))2 . (6)
Because (X − E(X))2 is strictly non-negative (→ I/1.10.4), the only way for the variance to become
zero is, if the squared deviation is always zero:
(X − E(X))2 = 0 . (7)
This, in turn, requires that X is equal to its expected value (→ I/1.10.1)
X = E(X) (8)
which can only be the case, if X always has the same value (→ I/1.2.5):
X = const. (9)
1. PROBABILITY THEORY 7
• Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ (2009): “Bayesian model selection for
group studies”; in: NeuroImage, vol. 46, pp. 1004–1017, eq. 16; URL: https://www.sciencedirect.
com/science/article/abs/pii/S1053811909002638; DOI: 10.1016/j.neuroimage.2009.03.025.
• Soch J, Allefeld C (2016): “Exceedance Probabilities for the Dirichlet Distribution”; in: arXiv
stat.AP, 1611.01439; URL: https://arxiv.org/abs/1611.01439.
1.3.6 Statistical independence
Definition: Generally speaking, random variables (→ I/1.2.2) are statistically independent, if their
joint probability (→ I/1.3.2) can be expressed in terms of their marginal probabilities (→ I/1.3.3).
1) A set of discrete random variables (→ I/1.2.2) X1 , . . . , Xn with possible values X1 , . . . , Xn is called
statistically independent, if
Y
n
p(X1 = x1 , . . . , Xn = xn ) = p(Xi = xi ) for all xi ∈ Xi , i = 1, . . . , n (1)
i=1
where p(x1 , . . . , xn ) are the joint probabilities (→ I/1.3.2) of X1 , . . . , Xn and p(xi ) are the marginal
probabilities (→ I/1.3.3) of Xi .
2) A set of continuous random variables (→ I/1.2.2) X1 , . . . , Xn defined on the domains X1 , . . . , Xn
is called statistically independent, if
Y
n
FX1 ,...,Xn (x1 , . . . , xn ) = FXi (xi ) for all xi ∈ Xi , i = 1, . . . , n (2)
i=1
or equivalently, if the probability densities (→ I/1.7.1) exist, if
Y
n
fX1 ,...,Xn (x1 , . . . , xn ) = fXi (xi ) for all xi ∈ Xi , i = 1, . . . , n (3)
i=1
where F are the joint (→ I/1.5.2) or marginal (→ I/1.5.3) cumulative distribution functions (→
I/1.8.1) and f are the respective probability density functions (→ I/1.7.1).
Sources:
• Wikipedia (2020): “Independence (probability theory)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-06-06; URL: https://en.wikipedia.org/wiki/Independence_(probability_theory)
#Definition.
1.3.7 Conditional independence
Definition: Generally speaking, random variables (→ I/1.2.2) are conditionally independent given
another random variable, if they are statistically independent (→ I/1.3.6) in their conditional prob-
ability distributions (→ I/1.5.4) given this random variable.
1) A set of discrete random variables (→ I/1.2.6) X1 , . . . , Xn with possible values X1 , . . . , Xn is called
conditionally independent given the random variable Y with possible values Y, if
118 CHAPTER I. GENERAL THEOREMS
1) expressing the first k moments (→ I/1.18.1) of y in terms of θ
µ1 = f1 (θ1 , . . . , θk )
.. (1)
.
µk = fk (θ1 , . . . , θk ) ,
2) calculating the first k sample moments (→ I/1.18.1) from y
µ̂1 (y), . . . , µ̂k (y) (2)
3) and solving the system of k equations
µ̂1 (y) = f1 (θ̂1 , . . . , θ̂k )
.. (3)
.
µ̂k (y) = fk (θ̂1 , . . . , θ̂k )
for θ̂1 , . . . , θ̂k , which are subsequently refered to as “method-of-moments estimates”.
Sources:
• Wikipedia (2021): “Method of moments (statistics)”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-04-29; URL: https://en.wikipedia.org/wiki/Method_of_moments_(statistics)#Method.
4.2 Statistical hypotheses
4.2.1 Statistical hypothesis
Definition: A statistical hypothesis is a statement about the parameters of a distribution describing
a population from which observations can be sampled as measured data.
More precisely, let m be a generative model (→ I/5.1.1) describing measured data y in terms of a
distribution D(θ) with model parameters θ ∈ Θ. Then, a statistical hypothesis is formally specified
as
H : θ ∈ Θ∗ where Θ∗ ⊂ Θ . (1)
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Definition_
of_terms.
4.2.2 Simple vs. composite
Definition: Let H be a statistical hypothesis (→ I/4.2.1). Then,