SlideShare a Scribd company logo
1
Introduction to Bioinformatics for
Computer Scientists
Lecture 12
2
Exam
● Exam days:
● Feb 10 only for those who can't make the dates in April!
● April 22, 23, and 24
● I just sent around a doodle
● Also register for exam via campus.kit.edu !
3
Plan for next lectures
● Today
● Bayesian statistics & (MC)MCMC methods
● Advanced MCMC
● Population genetics
● Course & Exam review
4
Outline for today
● Bayesian statistics
● Monte-Carlo simulations
● Markov-Chain Monte-Carlo (MCMC) methods
● Metropolis-coupled MCMC-methods
● Course beers tonight ! 19:00 at Vogelbräu Karlsruhe
5
Bayesian and Maximum Likelihood
Inference
● In phylogenetics Bayesian and ML (Maximum Likelihood)
methods have a lot in common
● Computationally, both approaches re-evaluate the phylogenetic
likelihood over and over and over again for different tree
topologies, branch lengths, and model parameters
● Bayesian and ML codes spend approx. 80-95% of their total run
time in likelihood calculations on trees
● Bayesian methods sample the posterior probability distribution
● ML methods strive to find a point estimate that maximizes the
likelihood
6
Bayesian Phylogenetic Methods
● The methods used perform stochastic searches, that is, they do
not strive to maximize the likelihood, but rather integrate over it
● Thus, no numerical optimization methods for model parameters
and branch lengths are needed, parameters are proposed at
random
● It is substantially easier to infer trees under complex models
using Bayesian statistics than using Maximum Likelihood
7
A Review of Probabilities
brown blonde Σ
light 5/40 15/40 20/40
dark 15/40 5/40 20/40
Σ 20/40 20/40 40/40
Hair color
Eye color
8
A Review of Probabilities
brown blonde Σ
light 5/40 15/40 20/40
dark 15/40 5/40 20/40
Σ 20/40 20/40 40/40
Hair color
Eye color
Joint probability: probability of observing both A and B: Pr(A,B)
For instance, Pr(brown, light) = 5/40 = 0.125
9
A Review of Probabilities
brown blonde Σ
light 5/40 15/40 20/40
dark 15/40 5/40 20/40
Σ 20/40 20/40 40/40
Hair color
Eye color
Marginal Probability: unconditional probability of an observation Pr(A)
For instance, Pr(dark) = Pr(dark,brown) + Pr(dark,blonde) = 15/40 + 5/40 = 20/40 = 0.5
Marginalize over hair color
10
A Review of Probabilities
brown blonde Σ
light 5/40 15/40 20/40
dark 15/40 5/40 20/40
Σ 20/40 20/40 40/40
Hair color
Eye color
Conditional Probability: The probability of observing A given that B has occurred:
Pr(A|B) is the fraction of cases Pr(B) in which B occurs where A also occurs with Pr(AB)
Pr(A|B) = Pr(AB) / Pr(B)
For instance, Pr(blonde|light) = Pr(blonde,light) / Pr(light) = (15/40) / (20/40) = 0.75
11
A Review of Probabilities
brown blonde Σ
light 5/40 15/40 20/40
dark 15/40 5/40 20/40
Σ 20/40 20/40 40/40
Hair color
Eye color
Statistical Independence: Two events A and B are independent
If their joint probability Pr(A,B) equals the product of their marginal probability Pr(A) Pr(B)
For instance, Pr(light,brown) ≠ Pr(light) Pr(brown), that is, the events are not independent!
12
A Review of Probabilities
Conditional Probability:
Pr(A|B) = Pr(A,B) / Pr(B)
Joint Probability:
Pr(A,B) = Pr(A|B) Pr(B)
and
Pr(A,B) = Pr(B|A) Pr(A)
Problem:
If I can compute Pr(A|B) how can I get Pr(B|A)?
13
A Review of Probabilities
Conditional Probability:
Pr(A|B) = Pr(A,B) / Pr(B)
Joint Probability:
Pr(A,B) = Pr(A|B) Pr(B)
and
Pr(A,B) = Pr(B|A) Pr(A)
Bayes Theorem:
Pr(B|A) = Pr(A,B) / Pr(A)
14
A Review of Probabilities
Conditional Probability:
Pr(A|B) = Pr(A,B) / Pr(B)
Joint Probability:
Pr(A,B) = Pr(A|B) Pr(B)
and
Pr(A,B) = Pr(B|A) Pr(A)
Bayes Theorem:
Pr(B|A) = Pr(A|B) Pr(B) / Pr(A)
15
Bayes Theorem
Pr(B|A) = Pr(A|B) Pr(B) / Pr(A)
Observed outcome
Unobserved outcome
16
Bayes Theorem
Pr(B|A) = Pr(A|B) Pr(B) / Pr(A)
Posterior probability
likelihood
Prior probability Marginal probability
17
Bayes Theorem: Phylogenetics
Pr(Tree,Params|Alignment) = Pr(Alignment|Tree, Params) Pr(Tree,Params) / Pr(Alignment)
Posterior probability
likelihood
Prior probability
Marginal probability
Posterior probability: distribution over all possible trees and all model parameter values
Likelihood: does the alignment fit the tree and model parameters?
Prior probability: introduces prior knowledge/assumptions about the probability distribution
of trees and model parameters (e.g., GTR rates, α shape parameter).
For instance, we typically assume that all possible tree topologies are equally probable
→ uniform prior
Marginal probability: how do we obtain this?
18
Bayes Theorem: Phylogenetics
Pr(Tree|Alignment) = Pr(Alignment|Tree) Pr(Tree) / Pr(Alignment)
Posterior probability
likelihood
Prior probability
Marginal probability
Marginal probability: Assume that our only model parameter is the tree and marginalizing
Means summing over all unconditional probabilities, thus
Pr(Alignment)
can be written as
Pr(Alignment) = Pr(Alignment, t0
) + Pr(Alignment,t1
) + … + Pr(Alignemnt, tn
)
where n+1 is the number of possible trees!
19
Bayes Theorem: Phylogenetics
Pr(Tree|Alignment) = Pr(Alignment|Tree) Pr(Tree) / Pr(Alignment)
Posterior probability
likelihood
Prior probability
Marginal probability
Marginal probability: Assume that our only model parameter is the tree and marginalizing
Means summing over all unconditional probabilities, thus
Pr(Alignment)
can be written as
Pr(Alignment) = Pr(Alignment, t0
) + Pr(Alignment,t1
) + … + Pr(Alignemnt, tn
)
where n+1 is the number of possible trees!
This can be re-written as
Pr(Alignment) = Pr(Alignment|t0
) Pr(t0
) + Pr(Alignment|t1
) Pr(t1
)+ … + Pr(Alignment|tn
) Pr(tn
)
20
Bayes Theorem: Phylogenetics
Pr(Tree|Alignment) = Pr(Alignment|Tree) Pr(Tree) / Pr(Alignment)
Posterior probability
likelihood
Prior probability
Marginal probability
Marginal probability:
Pr(Alignment) = Pr(Alignment|t0
) Pr(t0
) + Pr(Alignment|t1
) Pr(t1
)+ … + Pr(Alignment|tn
) Pr(tn
)
Now, we have all the ingredients for computing Pr(Tree|Alignment), however computing
Pr(Alignment) is prohibitive due to the large number of trees!
With continuous parameters the above equation for obtaining the marginal probability becomes
an integral. Usually, all parameters we integrate over (tree topology, model parameters, etc.) are
lumped into a parameter vector denoted by θ
likelihood
Prior := 1 / (n+1)
→ this is a uniform prior!
21
Bayes Theorem General Form
f(θ|A) = f(A|θ) f(θ) / ∫f(θ)f(A|θ)dθ
Posterior distribution
Posterior probability
likelihood
Prior distribution
Prior Probability
Marginal likelihood
Normalization constant
We know how to compute f(A|θ) → the likelihood of the tree
Problems:
Problem 1: f(θ) is given a priori, but how do we chose an appropriate distribution?
→ biggest strength and weakness of Bayesian approaches
Problem 2: How can we calculate/approximate ∫f(θ)f(A|θ)dθ ?
→ to explain this we need to introduce additional machinery
However, let us first look at an example for f(θ|A) in phylogenetics
22
Bayes Theorem General Form
f(θ|A) = f(A|θ) f(θ) / ∫f(θ)f(A|θ)dθ
Note that, in the continuous case f() is called probability density function
23
Probability Density Function
Properties:
1. f(x) > 0 for all allowed values x
2.The area under f(x) is 1.0
3.The probability that x falls into an interval (e.g. 0.2 – 0.3) is given by the
integral of f(x) over this interval
24
An Example
25
An Example
1.0
Data (observations → sequences)
1.0
probability
Parameter space → 3 distinct tree topologies
Prior distribution
Posterior distribution
posterior
probability
26
An Example
1.0
1.0
probability
Parameter space → 3 distinct tree topologies
Note that, this is a discrete
Distribution, since we only consider
the trees as parameters!
1/3 1/3 1/3
posterior
probability
27
An Example
1.0
probability
What happens to the posterior
probability if we don't have enough data,
e.g., an alignment with a single site?
1/3 1/3 1/3
posterior
probability
?
28
An Example
Parameter space of θ
Include additional model parameters such as branch lengths,GTR rates, and
the α-shape paremeter of the Г distribution into the model:
θ = (tree, α, branch-lengths, GTR-rates)
f(θ|A)
posterior
probability
Tree 1 tree 2 tree 3
29
An Example
Marginal probability distribution of trees
We can look at this distribution for any parameter of interest by marginalizing
(integrating out) all other parameters.
Here we focus on the tree topology.
f(θ|A)
posterior
probability
Tree 1 tree 2 tree 3
20%
48%
32%
30
An Example
Marginal probability distribution of trees
We can look at this distribution for any parameter of interest by marginalizing
(integrating out) all other parameters.
Here we focus on the tree topology.
f(θ|A)
posterior
probability
Tree 1 tree 2 tree 3
20%
48%
32%
We obtain the probability
by integrating over this
Interval!
31
Marginalization
t1
t2
t3
α1
= 0.5 0.10 0.07 0.12 0.29
α2
= 1.0 0.05 0.22 0.06 0.33
α3
= 5.0 0.05 0.19 0.14 0.38
0.20 0.48 0.32 1.0
trees
Three discrete
Values of the
α-shape parameter
Joint probabilities
Marginal probabilities of trees
Marginal probabilities
of α values
32
An Example
Marginal probability distibution of α
We can look at this distribution for any parameter of interest by marginalizing
(integrating out) all other parameters.
Here we focus on the three discrete α values.
f(θ|A)
posterior
probability
α = 5.0
29% 33% 38%
α = 1.0
α = 0.5
33
Bayes versus Likelihood
ML: Joint estimation
Bayesian: Marginal estimation
See: Holder & Lewis
“Phylogeny Estimation: traditional &
Bayesian Approaches” Link to paper
likelihood
34
Outline for today
● Bayesian statistics
● Monte-Carlo simulation & integration
● Markov-Chain Monte-Carlo methods
● Metropolis-coupled MCMC-methods
35
Bayes Theorem General Form
f(θ|A) = (likelihood * prior) / ouch
Marginal likelihood
Normalization constant
→ difficult to calculate
We know how to compute f(A|θ) → the likelihood of the tree
Problems:
Problem 1: f(θ) is given a priori, but how do we chose an appropriate distribution
→ biggest strength and weakness of Bayesian approaches
Problem 2: How can we calculate/approximate ∫f(θ)f(A|θ)dθ
→ to explain this we need to introduce additional machinery to design methods for
numerical integration
36
How can we compute this integral?
Parameter space of θ
f(θ|A)
37
The Classic Example
● Calculating π (the geometric constant!) with Monte-Carlo
Procedure:
1. Randomly throw points onto the
rectangle n times
2. Count how many points fall into
the circle ni
3. determine π as the ratio n / ni
→ this yields an approximation of
the ratio of the areas (the square
and the circle)
38
Monte Carlo Integration
● Method for numerical integration of m-dimensional integrals over R:
∫f(θ)dθ ≈ 1/N Σ f(θi)
where θ is from domain Rm
●
More precisely, if the integral ∫ is defined over a domain/volume V
the equation becomes: V * 1/N * Σ f(θi)
● Key issues:
● Monte Carlo simulations draw samples θi of function f()
completely at random → random grid
● How many points do we need to sample for a 'good'
approximation?
● Domain Rm
might be too large for random sampling!
39
Outline for today
● Bayesian statistics
● Monte-Carlo simulation & integration
● Markov-Chain Monte-Carlo methods
● Metropolis-coupled MCMC-methods
40
How can we compute this integral?
Parameter space of θ
f(θ|A)
Monte-Carlo Methods: randomly sample data-points in this
huge parameter space to approximate the interval
41
How can we compute this integral?
Parameter space of θ
f(θ|A)
In which parts of the distribution are we interested?
Posterior
probability
42
Distribution Landscape
Parameter space of θ
f(θ|A)
Posterior
probability
In which parts of the distribution are we interested?
Areas of high posterior
probability
43
Distribution Landscape
Parameter space of θ
f(θ|A)
Posterior
probability
In which parts of the distribution are we interested?
How can we get a
sample faster?
44
Distribution Landscape
Parameter space of θ
f(θ|A)
Posterior
probability
In which parts of the distribution are we interested?
How can we get a
sample faster? →
Markov Chain Monte Carlo
Methods
45
Distribution Landcsape
Parameter space of θ
f(θ|A)
Posterior
probability
In which parts of the distribution are we interested?
Higher sample density Higher sample density
46
Distribution Landcsape
Parameter space of θ
f(θ|A)
Posterior
probability
In which parts of the distribution are we interested?
Higher sample density Higher sample density
Fewer misses
47
Markov-Chain Monte-Carlo
Parameter space of θ
f(θ|A)
Posterior
probability
Higher sample density Higher sample density
MCMC → biased random walks: the probability to evaluate/find a sample in an area
with high posterior probability is proportional to the posterior distribution
48
Markov-Chain Monte-Carlo
● Idea: Move the grid/samples into regions of high probability
● Construct a Markov Chain that generates samples such that
more time is spent (more samples are evaluated) in the most
interesting regions of the state space
● MCMC can also be used for hard CS optimization problems, for
instance, the knapsack problem
● Note that, MCMC is similar to Simulated Annealing → there's no
time to go into the details though here!
49
The Robot Metaphor
Crete, December 2019
50
The Robot Metaphor
● Drop a robot onto an unknown planet to explore its landscape
● Teaching idea and slides adapted from Paul O. Lewis
Uphill steps → always accepted
Small downhill steps
→ usually accepted Huge downhill steps
→ almost never accepted
elevation
51
How to accept/reject proposals
● Decision to accept/reject a proposal to go from
Point 1 → Point 2 is based on the ratio R of posterior densities
of the two points/samples
R = Pr(Point2|data) / Pr(point1|data) =
(Pr(Point2)Pr(data|point2) / Pr(data)) / (Pr(Point1)Pr(data|point1) / Pr(data))
= Pr(point2)Pr(data|point2) / Pr(point1)Pr(data|point1)
52
How to accept/reject proposals
● Decision to accept/reject a proposal to go from
Point 1 → Point 2 is based on the ratio R of posterior densities
of the two points/samples
R = Pr(Point2|data) / Pr(point1|data) =
(Pr(Point2)Pr(data|point2) / Pr(data)) / (Pr(Point1)Pr(data|point1) / Pr(data))
= Pr(point2)Pr(data|point2) / Pr(point1)Pr(data|point1)
The marginal probability of the data cancels out!
Phew, we don't need to compute it.
53
How to accept/reject proposals
● Decision to accept/reject a proposal to go from
Point 1 → Point 2 is based on the ratio R of posterior densities
of the two points/samples
R = Pr(Point2|data) / Pr(point1|data) =
(Pr(Point2)Pr(data|point2) / Pr(data)) / (Pr(Point1)Pr(data|point1) / Pr(data)) =
(Pr(point2)/Pr(point1)) * (Pr(data|point2)Pr(data|point1))
Prior ratio: for uniform priors this is 1 !
Likelihood ratio
54
The Robot Metaphor
● Drop a robot onto an unknown planet to explore its landscape
At 1m, proposed to go to 2
Ratio = 2/1 → accept
At 10 m, go down to 9 m
Ratio: 9/10 = 0.9 → accept with
probability 90%
At 8 m, go down to 1m
Ratio: 1/8 = 0.125 → accept
with probability 12.5%
elevation
55
Distributions
● The target distribution is the posterior distribution we are trying
to sample (integrate over)!
● The proposal distribution decides which point (how far/close) in
the landscape to randomly go to/try next:
→ The choice has an effect on the efficiency of the MCMC
algorithm, that is, how fast it will get to these interesting areas we
want to sample
56
The Robot Metaphor
Target distribution/
posterior probability
Proposal distribution: how far
Left or right will we usually go?
57
The Robot Metaphor
Target distribution/
posterior probability
Proposal distribution:
with smaller variance →
what happens?
Pros: Seldom refuses a step
Cons: smaller steps, more steps
required for exploration
58
The Robot Metaphor
Target distribution/
posterior probability
Proposal distribution:
with larger variance →
What happens?
59
The Robot Metaphor
Target distribution/
posterior probability
Proposal distribution:
with larger variance →
what happens?
Pros: can cover a large area
quickly
Cons: lots of steps will be rejected
60
The Robot Metaphor
Target distribution/
posterior probability
A proposal distribution that
balances pros & cons yields
'good mixing'
61
Mixing
● A well-designed chain will require a few steps until reaching
convergence, that is, approximating the underlying probability
density function 'well-enough' from a random starting point
● It is a somewhat fuzzy term, refers to the proportion of accepted
proposals (acceptance ratio) generated by a proposal mechanism
→ should be neither too low, nor too high
● The real art in designing MCMC methods consists
● building & tuning good proposal mechanisms
● selecting appropriate proposal distributions
● such that they quickly approximate the distribution we want to sample
from
62
The Robot Metaphor
Target distribution/
posterior probability
When the proposal distribution is
symmetric, that is, the probability
of moving left or right is the same,
we use the Metropolis algorithm
63
The Metropolis Algorithm
● Metropolis et al. 1953 http://www.aliquote.org/pub/metropolis-et-al-1953.pdf
● Initialization: Choose an arbitrary point θ0 as first sample
● Choose an arbitrary probability density Q(θi+1|θi ) which suggests a candidate for the next
sample θi+1 given the previous sample θi.
● For the Metropolis algorithm, Q() must be symmetric:
it must satisfy Q(θi+1|θi ) = Q(θi|θi+1)
● For each iteration i:
● Generate a candidate θ* for the next sample by picking from the distribution Q(θ*|θi )
● Calculate the acceptance ratio R = Pr(θ*)Pr(data|θ*) / Pr(θi )Pr(data/θi )
– If R ≥ 1, then θ* is more likely than θi → automatically accept the candidate by setting θi+1 :=
θ*
– Otherwise, accept the candidate θ* with probability R → if the candidate is rejected: θi+1 := θi
64
The Metropolis Algorithm
● Metropolis et al. 1953 http://www.aliquote.org/pub/metropolis-et-al-1953.pdf
● Initialization: Choose an arbitrary point θ0 as first sample
● Choose an arbitrary probability density Q(θi+1|θi ) which suggests a candidate for the next
sample θi+1 given the previous sample θi.
● For the Metropolis algorithm, Q() must be symmetric:
it must satisfy Q(θi+1|θi ) = Q(θi|θi+1)
● For each iteration i:
● Generate a candidate θ* for the next sample by picking from the distribution Q(θ*|θi )
● Calculate the acceptance ratio R = Pr(θ*)Pr(data|θ*) / Pr(θi )Pr(data/θi )
– If R ≥ 1, then θ* is more likely than θi → automatically accept the candidate by setting θi+1 :=
θ*
– Otherwise, accept the candidate θ* with probability R → if the candidate is rejected: θi+1 := θi
Conceptually this is the same Q
we saw for substitution models!
65
Phylogenetic Metropolis Algorithm
● Initialization: Choose a random tree with random branch lengths as first sample
● For each iteration i:
● Propose either
– a new tree topology
– a new branch length
and re-calculate the likelihood
● Calculate the acceptance ratio of the proposal
● Accept the new tree/branch length or reject it
● Print current tree with branch lengths to file only every k (e.g. 1000) iterations
→ to generate a sample from the chain
→ to avoid writing TBs of files
→ also known as thinning
● Summarize the sample using means, histograms, credible intervals, consensus trees,
etc.
66
Uncorrected Proposal Distribution
A Robot in 3D
Example: MCMC proposed moves to
the right 80% of the time without Hastings
correction for acceptance probability!
Peak area
67
Hastings Correction
Target distribution/
posterior probability
We need to decrease chances to
move to the right by 0.5 and
Increase chances to move to the
left by factor 2 to compensate for
the asymmetry!
1/3 2/3
68
Hastings Correction
R = (Pr(point2)/Pr(point1)) * (Pr(data|point2)/Pr(data|point1)) * (Q(point1|point2) / Q(point2|point1))
Prior ratio: for uniform priors this is 1 !
Likelihood ratio
Hastings ratio: if Q is symmetric
Q(point1|point2) = Q(point2|point) and
the hastings ratio is 1 → we obtain the
normal Metropolis algorithm
69
Hastings Correction
more formally
R = (f(θ*)/f(θi )) * (f(data|θ*)/f(data|θi )) * (Q(θi |θ*) / Q(θ*|θi ))
Prior ratio
Likelihood ratio
Hastings ratio
70
Hastings Correction is not trivial
● Problem with the equation for the hastings correction
● M. Holder, P. Lewis, D. Swofford, B. Larget. 2005.
Hastings Ratio of the LOCAL Proposal Used in Bayesian
Phylogenetics. Systematic Biology. 54:961-965.
http://sysbio.oxfordjournals.org/content/54/6/961.full
“As part of another study, we estimated the marginal likelihoods
of trees using different proposal algorithms and discovered
repeatable discrepancies that implied that the published
Hastings ratio for a proposal mechanism used in many
Bayesian phylogenetic analyses is incorrect.”
● Incorrect Hastings ratio used from 1999-2005
71
Back to Phylogenetics
A
B
C
D
E
A
B
C
D
E
A
C
D
A
E
D
C
B
A
B
C
E
D
A
C
E
D
B
A
B
D
C
E
What's the posterior probability of bipartition AB|CDE?
72
Back to Phylogenetics
A
B
C
D
E
A
B
C
D
E
A
C
D
A
E
D
C
B
A
B
C
E
D
A
C
E
D
B
A
B
D
C
E
What's the posterior probability of bipartition AB|CDE?
We just count from the sample generated by MCMC, here it's 3/5 → 0.6
This approximates the true proportion (posterior probability) of bipartition AB|CDE
if we have run the chain long enough and if it has converged
73
MCMC in practice
Frequency of AB|CDE
generations
convergence
Burn-in → discarded from our final sample
Random
starting point
74
Convergence
● How many samples do we need to draw to obtain an accurate
approximation?
● When can we stop drawing samples?
● Methods for convergence diagnosis
→ we can never say that a MCMC-chain has converged
→ we can only diagnose that it has not converged
→ a plethora of tools for convergence diagnostics for
phylogenetic MCMC
75
Convergence
Entire landscape
Likelihood
score
Likelihood
Score output
MCMC method
Area of apparent
convergence
Zoom in
76
Solution: Run Multiple Chains
Robot 1
Robot 2
77
Outline for today
● Bayesian statistics
● Monte-Carlo simulation & integration
● Markov-Chain Monte-Carlo methods
● Metropolis-coupled MCMC-methods
78
Heated versus Cold Chains
Robot 1
Robot 2
Cold chain: sees
landscape as is
Hot chain: sees a
Flatter version of the
same landscape →
Moves more easily
between peaks
79
Known as MCMCMC
● Metropolis-Coupled Markov-Chain Monte Carlo
● Run several chains simultaneously
● 1 cold chain (the one that samples)
● Several heated chains
● Heated chain robots explore the parameter space in larger
steps
● To flatten the landscape the acceptance ratio R is modified as
follows: R1/1+H where H is the so-called temperature
– For the cold chain H := 0.0
– Setting the temperature for the hot chains is a bit of woo-
do
80
Heated versus Cold Chains
Robot 1: cold
Robot 2: hot
Exchange information every now and then
81
Heated versus Cold Chains
Robot 1: hot
Robot 2: cold
Swap cold/hot states to better sample
this nice peak here
82
Heated versus Cold Chains
Robot 1: hot
Robot 2: cold
Decision on when to swap is a bit more
complicated!
83
Heated versus Cold Chains
Robot 1: hot
Robot 2: cold
Only the cold robot actually emits states (writes samples to file)
84
A few words about priors
● Prior probabilities convey the scientist's beliefs, before having
seen the data
● Using uninformative prior probability distributions (e.g., uniform
priors, also called flat priors)
→ differences between prior and posterior distribution are
attributable to likelihood differences
● Priors can bias an analysis
● For instance, we could chose an arbitrary prior distribution for
branch lengths in the range [1.0,20.0]
→ what happens if branch lengths are much shorter?
85
Some Phylogenetic Proposal
Mechanisms
● Branch Lengths
● Sliding Window Proposal
● Multiplier Proposal
● Topologies
● Local Proposal (the one with the bug in the Hastings ratio)
● Extending TBR (Tree Bisection Reconnection) Proposal
● Remember: We need to design proposals for which
● We either don't need to calculate the Hastings ratio
● Or for which we can calculate it
● That have a 'good' acceptance rate
→ all sorts of tricks being used, e.g., parsimony-biased topological proposals

More Related Content

Similar to lecture12.pdf Introduction to bioinformatics (20)

PPT
Interpreting ‘tree space’ in the context of very large empirical datasets
Joe Parker
 
PPT
Data mining maximumlikelihood
Harry Potter
 
PPT
Data mining maximumlikelihood
Young Alista
 
PPT
Data mining maximumlikelihood
Luis Goldster
 
PPT
Data mining maximumlikelihood
Tony Nguyen
 
PPT
Data mining maximumlikelihood
James Wong
 
PPT
Data miningmaximumlikelihood
Fraboni Ec
 
PPT
Data mining maximumlikelihood
Hoang Nguyen
 
PDF
The bayesian revolution in genetics
Beat Winehouse
 
PPT
Bayes Theorem - Probability and Statistics
EMALLIKARJUNAREDDY
 
PDF
joaks-evolution-2014
Jamie Oaks
 
PDF
Bayesian Classics
Julyan Arbel
 
PDF
San Antonio short course, March 2010
Christian Robert
 
PDF
Course on Bayesian computational methods
Christian Robert
 
PDF
Probably, Definitely, Maybe
James McGivern
 
PDF
MrBayes_intro_big4ws_2016-10-10
FredrikRonquist
 
PDF
Approximate Bayesian model choice via random forests
Christian Robert
 
PPTX
Lec13_Bayes.pptx
KhushiDuttVatsa
 
ODP
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Marco Wirthlin
 
PPTX
Bayesian statistics for biologists and ecologists
Masahiro Ryo. Ph.D.
 
Interpreting ‘tree space’ in the context of very large empirical datasets
Joe Parker
 
Data mining maximumlikelihood
Harry Potter
 
Data mining maximumlikelihood
Young Alista
 
Data mining maximumlikelihood
Luis Goldster
 
Data mining maximumlikelihood
Tony Nguyen
 
Data mining maximumlikelihood
James Wong
 
Data miningmaximumlikelihood
Fraboni Ec
 
Data mining maximumlikelihood
Hoang Nguyen
 
The bayesian revolution in genetics
Beat Winehouse
 
Bayes Theorem - Probability and Statistics
EMALLIKARJUNAREDDY
 
joaks-evolution-2014
Jamie Oaks
 
Bayesian Classics
Julyan Arbel
 
San Antonio short course, March 2010
Christian Robert
 
Course on Bayesian computational methods
Christian Robert
 
Probably, Definitely, Maybe
James McGivern
 
MrBayes_intro_big4ws_2016-10-10
FredrikRonquist
 
Approximate Bayesian model choice via random forests
Christian Robert
 
Lec13_Bayes.pptx
KhushiDuttVatsa
 
Gentle Introduction: Bayesian Modelling and Probabilistic Programming in R
Marco Wirthlin
 
Bayesian statistics for biologists and ecologists
Masahiro Ryo. Ph.D.
 

Recently uploaded (20)

PPTX
apidays Munich 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (Aavista Oy)
apidays
 
PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
PDF
The X-Press God-WPS Office.pdf hdhdhdhdhd
ramifatoh4
 
PPT
dsaaaaaaaaaaaaaaaaaaaaaaaaaaaaaasassas2.ppt
UzairAfzal13
 
PDF
Dr. Robert Krug - Chief Data Scientist At DataInnovate Solutions
Dr. Robert Krug
 
PPTX
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
PPTX
apidays Munich 2025 - GraphQL 101: I won't REST, until you GraphQL, Surbhi Si...
apidays
 
PDF
apidays Munich 2025 - Automating Operations Without Reinventing the Wheel, Ma...
apidays
 
PPTX
materials that are required to used.pptx
drkaran1421
 
PPTX
sampling-connect.MC Graw Hill- Chapter 6
nohabakr6
 
PPTX
Lecture_9_EPROM_Flash univeristy lecture fall 2022
ssuser5047c5
 
PDF
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
UPS Case Study - Group 5 with example and implementation .pptx
yasserabdelwahab6
 
PDF
Responsibilities of a Certified Data Engineer | IABAC
Seenivasan
 
PPTX
apidays Munich 2025 - Federated API Management and Governance, Vince Baker (D...
apidays
 
PPTX
GEN CHEM ACCURACY AND PRECISION eme.pptx
yeagere932
 
PPTX
Data Analysis for Business - make informed decisions, optimize performance, a...
Slidescope
 
PPTX
Pre-Interrogation_Assessment_Presentation.pptx
anjukumari94314
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
apidays Munich 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (Aavista Oy)
apidays
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
The X-Press God-WPS Office.pdf hdhdhdhdhd
ramifatoh4
 
dsaaaaaaaaaaaaaaaaaaaaaaaaaaaaaasassas2.ppt
UzairAfzal13
 
Dr. Robert Krug - Chief Data Scientist At DataInnovate Solutions
Dr. Robert Krug
 
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
apidays Munich 2025 - GraphQL 101: I won't REST, until you GraphQL, Surbhi Si...
apidays
 
apidays Munich 2025 - Automating Operations Without Reinventing the Wheel, Ma...
apidays
 
materials that are required to used.pptx
drkaran1421
 
sampling-connect.MC Graw Hill- Chapter 6
nohabakr6
 
Lecture_9_EPROM_Flash univeristy lecture fall 2022
ssuser5047c5
 
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
UPS Case Study - Group 5 with example and implementation .pptx
yasserabdelwahab6
 
Responsibilities of a Certified Data Engineer | IABAC
Seenivasan
 
apidays Munich 2025 - Federated API Management and Governance, Vince Baker (D...
apidays
 
GEN CHEM ACCURACY AND PRECISION eme.pptx
yeagere932
 
Data Analysis for Business - make informed decisions, optimize performance, a...
Slidescope
 
Pre-Interrogation_Assessment_Presentation.pptx
anjukumari94314
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
Ad

lecture12.pdf Introduction to bioinformatics

  • 1. 1 Introduction to Bioinformatics for Computer Scientists Lecture 12
  • 2. 2 Exam ● Exam days: ● Feb 10 only for those who can't make the dates in April! ● April 22, 23, and 24 ● I just sent around a doodle ● Also register for exam via campus.kit.edu !
  • 3. 3 Plan for next lectures ● Today ● Bayesian statistics & (MC)MCMC methods ● Advanced MCMC ● Population genetics ● Course & Exam review
  • 4. 4 Outline for today ● Bayesian statistics ● Monte-Carlo simulations ● Markov-Chain Monte-Carlo (MCMC) methods ● Metropolis-coupled MCMC-methods ● Course beers tonight ! 19:00 at Vogelbräu Karlsruhe
  • 5. 5 Bayesian and Maximum Likelihood Inference ● In phylogenetics Bayesian and ML (Maximum Likelihood) methods have a lot in common ● Computationally, both approaches re-evaluate the phylogenetic likelihood over and over and over again for different tree topologies, branch lengths, and model parameters ● Bayesian and ML codes spend approx. 80-95% of their total run time in likelihood calculations on trees ● Bayesian methods sample the posterior probability distribution ● ML methods strive to find a point estimate that maximizes the likelihood
  • 6. 6 Bayesian Phylogenetic Methods ● The methods used perform stochastic searches, that is, they do not strive to maximize the likelihood, but rather integrate over it ● Thus, no numerical optimization methods for model parameters and branch lengths are needed, parameters are proposed at random ● It is substantially easier to infer trees under complex models using Bayesian statistics than using Maximum Likelihood
  • 7. 7 A Review of Probabilities brown blonde Σ light 5/40 15/40 20/40 dark 15/40 5/40 20/40 Σ 20/40 20/40 40/40 Hair color Eye color
  • 8. 8 A Review of Probabilities brown blonde Σ light 5/40 15/40 20/40 dark 15/40 5/40 20/40 Σ 20/40 20/40 40/40 Hair color Eye color Joint probability: probability of observing both A and B: Pr(A,B) For instance, Pr(brown, light) = 5/40 = 0.125
  • 9. 9 A Review of Probabilities brown blonde Σ light 5/40 15/40 20/40 dark 15/40 5/40 20/40 Σ 20/40 20/40 40/40 Hair color Eye color Marginal Probability: unconditional probability of an observation Pr(A) For instance, Pr(dark) = Pr(dark,brown) + Pr(dark,blonde) = 15/40 + 5/40 = 20/40 = 0.5 Marginalize over hair color
  • 10. 10 A Review of Probabilities brown blonde Σ light 5/40 15/40 20/40 dark 15/40 5/40 20/40 Σ 20/40 20/40 40/40 Hair color Eye color Conditional Probability: The probability of observing A given that B has occurred: Pr(A|B) is the fraction of cases Pr(B) in which B occurs where A also occurs with Pr(AB) Pr(A|B) = Pr(AB) / Pr(B) For instance, Pr(blonde|light) = Pr(blonde,light) / Pr(light) = (15/40) / (20/40) = 0.75
  • 11. 11 A Review of Probabilities brown blonde Σ light 5/40 15/40 20/40 dark 15/40 5/40 20/40 Σ 20/40 20/40 40/40 Hair color Eye color Statistical Independence: Two events A and B are independent If their joint probability Pr(A,B) equals the product of their marginal probability Pr(A) Pr(B) For instance, Pr(light,brown) ≠ Pr(light) Pr(brown), that is, the events are not independent!
  • 12. 12 A Review of Probabilities Conditional Probability: Pr(A|B) = Pr(A,B) / Pr(B) Joint Probability: Pr(A,B) = Pr(A|B) Pr(B) and Pr(A,B) = Pr(B|A) Pr(A) Problem: If I can compute Pr(A|B) how can I get Pr(B|A)?
  • 13. 13 A Review of Probabilities Conditional Probability: Pr(A|B) = Pr(A,B) / Pr(B) Joint Probability: Pr(A,B) = Pr(A|B) Pr(B) and Pr(A,B) = Pr(B|A) Pr(A) Bayes Theorem: Pr(B|A) = Pr(A,B) / Pr(A)
  • 14. 14 A Review of Probabilities Conditional Probability: Pr(A|B) = Pr(A,B) / Pr(B) Joint Probability: Pr(A,B) = Pr(A|B) Pr(B) and Pr(A,B) = Pr(B|A) Pr(A) Bayes Theorem: Pr(B|A) = Pr(A|B) Pr(B) / Pr(A)
  • 15. 15 Bayes Theorem Pr(B|A) = Pr(A|B) Pr(B) / Pr(A) Observed outcome Unobserved outcome
  • 16. 16 Bayes Theorem Pr(B|A) = Pr(A|B) Pr(B) / Pr(A) Posterior probability likelihood Prior probability Marginal probability
  • 17. 17 Bayes Theorem: Phylogenetics Pr(Tree,Params|Alignment) = Pr(Alignment|Tree, Params) Pr(Tree,Params) / Pr(Alignment) Posterior probability likelihood Prior probability Marginal probability Posterior probability: distribution over all possible trees and all model parameter values Likelihood: does the alignment fit the tree and model parameters? Prior probability: introduces prior knowledge/assumptions about the probability distribution of trees and model parameters (e.g., GTR rates, α shape parameter). For instance, we typically assume that all possible tree topologies are equally probable → uniform prior Marginal probability: how do we obtain this?
  • 18. 18 Bayes Theorem: Phylogenetics Pr(Tree|Alignment) = Pr(Alignment|Tree) Pr(Tree) / Pr(Alignment) Posterior probability likelihood Prior probability Marginal probability Marginal probability: Assume that our only model parameter is the tree and marginalizing Means summing over all unconditional probabilities, thus Pr(Alignment) can be written as Pr(Alignment) = Pr(Alignment, t0 ) + Pr(Alignment,t1 ) + … + Pr(Alignemnt, tn ) where n+1 is the number of possible trees!
  • 19. 19 Bayes Theorem: Phylogenetics Pr(Tree|Alignment) = Pr(Alignment|Tree) Pr(Tree) / Pr(Alignment) Posterior probability likelihood Prior probability Marginal probability Marginal probability: Assume that our only model parameter is the tree and marginalizing Means summing over all unconditional probabilities, thus Pr(Alignment) can be written as Pr(Alignment) = Pr(Alignment, t0 ) + Pr(Alignment,t1 ) + … + Pr(Alignemnt, tn ) where n+1 is the number of possible trees! This can be re-written as Pr(Alignment) = Pr(Alignment|t0 ) Pr(t0 ) + Pr(Alignment|t1 ) Pr(t1 )+ … + Pr(Alignment|tn ) Pr(tn )
  • 20. 20 Bayes Theorem: Phylogenetics Pr(Tree|Alignment) = Pr(Alignment|Tree) Pr(Tree) / Pr(Alignment) Posterior probability likelihood Prior probability Marginal probability Marginal probability: Pr(Alignment) = Pr(Alignment|t0 ) Pr(t0 ) + Pr(Alignment|t1 ) Pr(t1 )+ … + Pr(Alignment|tn ) Pr(tn ) Now, we have all the ingredients for computing Pr(Tree|Alignment), however computing Pr(Alignment) is prohibitive due to the large number of trees! With continuous parameters the above equation for obtaining the marginal probability becomes an integral. Usually, all parameters we integrate over (tree topology, model parameters, etc.) are lumped into a parameter vector denoted by θ likelihood Prior := 1 / (n+1) → this is a uniform prior!
  • 21. 21 Bayes Theorem General Form f(θ|A) = f(A|θ) f(θ) / ∫f(θ)f(A|θ)dθ Posterior distribution Posterior probability likelihood Prior distribution Prior Probability Marginal likelihood Normalization constant We know how to compute f(A|θ) → the likelihood of the tree Problems: Problem 1: f(θ) is given a priori, but how do we chose an appropriate distribution? → biggest strength and weakness of Bayesian approaches Problem 2: How can we calculate/approximate ∫f(θ)f(A|θ)dθ ? → to explain this we need to introduce additional machinery However, let us first look at an example for f(θ|A) in phylogenetics
  • 22. 22 Bayes Theorem General Form f(θ|A) = f(A|θ) f(θ) / ∫f(θ)f(A|θ)dθ Note that, in the continuous case f() is called probability density function
  • 23. 23 Probability Density Function Properties: 1. f(x) > 0 for all allowed values x 2.The area under f(x) is 1.0 3.The probability that x falls into an interval (e.g. 0.2 – 0.3) is given by the integral of f(x) over this interval
  • 25. 25 An Example 1.0 Data (observations → sequences) 1.0 probability Parameter space → 3 distinct tree topologies Prior distribution Posterior distribution posterior probability
  • 26. 26 An Example 1.0 1.0 probability Parameter space → 3 distinct tree topologies Note that, this is a discrete Distribution, since we only consider the trees as parameters! 1/3 1/3 1/3 posterior probability
  • 27. 27 An Example 1.0 probability What happens to the posterior probability if we don't have enough data, e.g., an alignment with a single site? 1/3 1/3 1/3 posterior probability ?
  • 28. 28 An Example Parameter space of θ Include additional model parameters such as branch lengths,GTR rates, and the α-shape paremeter of the Г distribution into the model: θ = (tree, α, branch-lengths, GTR-rates) f(θ|A) posterior probability Tree 1 tree 2 tree 3
  • 29. 29 An Example Marginal probability distribution of trees We can look at this distribution for any parameter of interest by marginalizing (integrating out) all other parameters. Here we focus on the tree topology. f(θ|A) posterior probability Tree 1 tree 2 tree 3 20% 48% 32%
  • 30. 30 An Example Marginal probability distribution of trees We can look at this distribution for any parameter of interest by marginalizing (integrating out) all other parameters. Here we focus on the tree topology. f(θ|A) posterior probability Tree 1 tree 2 tree 3 20% 48% 32% We obtain the probability by integrating over this Interval!
  • 31. 31 Marginalization t1 t2 t3 α1 = 0.5 0.10 0.07 0.12 0.29 α2 = 1.0 0.05 0.22 0.06 0.33 α3 = 5.0 0.05 0.19 0.14 0.38 0.20 0.48 0.32 1.0 trees Three discrete Values of the α-shape parameter Joint probabilities Marginal probabilities of trees Marginal probabilities of α values
  • 32. 32 An Example Marginal probability distibution of α We can look at this distribution for any parameter of interest by marginalizing (integrating out) all other parameters. Here we focus on the three discrete α values. f(θ|A) posterior probability α = 5.0 29% 33% 38% α = 1.0 α = 0.5
  • 33. 33 Bayes versus Likelihood ML: Joint estimation Bayesian: Marginal estimation See: Holder & Lewis “Phylogeny Estimation: traditional & Bayesian Approaches” Link to paper likelihood
  • 34. 34 Outline for today ● Bayesian statistics ● Monte-Carlo simulation & integration ● Markov-Chain Monte-Carlo methods ● Metropolis-coupled MCMC-methods
  • 35. 35 Bayes Theorem General Form f(θ|A) = (likelihood * prior) / ouch Marginal likelihood Normalization constant → difficult to calculate We know how to compute f(A|θ) → the likelihood of the tree Problems: Problem 1: f(θ) is given a priori, but how do we chose an appropriate distribution → biggest strength and weakness of Bayesian approaches Problem 2: How can we calculate/approximate ∫f(θ)f(A|θ)dθ → to explain this we need to introduce additional machinery to design methods for numerical integration
  • 36. 36 How can we compute this integral? Parameter space of θ f(θ|A)
  • 37. 37 The Classic Example ● Calculating π (the geometric constant!) with Monte-Carlo Procedure: 1. Randomly throw points onto the rectangle n times 2. Count how many points fall into the circle ni 3. determine π as the ratio n / ni → this yields an approximation of the ratio of the areas (the square and the circle)
  • 38. 38 Monte Carlo Integration ● Method for numerical integration of m-dimensional integrals over R: ∫f(θ)dθ ≈ 1/N Σ f(θi) where θ is from domain Rm ● More precisely, if the integral ∫ is defined over a domain/volume V the equation becomes: V * 1/N * Σ f(θi) ● Key issues: ● Monte Carlo simulations draw samples θi of function f() completely at random → random grid ● How many points do we need to sample for a 'good' approximation? ● Domain Rm might be too large for random sampling!
  • 39. 39 Outline for today ● Bayesian statistics ● Monte-Carlo simulation & integration ● Markov-Chain Monte-Carlo methods ● Metropolis-coupled MCMC-methods
  • 40. 40 How can we compute this integral? Parameter space of θ f(θ|A) Monte-Carlo Methods: randomly sample data-points in this huge parameter space to approximate the interval
  • 41. 41 How can we compute this integral? Parameter space of θ f(θ|A) In which parts of the distribution are we interested? Posterior probability
  • 42. 42 Distribution Landscape Parameter space of θ f(θ|A) Posterior probability In which parts of the distribution are we interested? Areas of high posterior probability
  • 43. 43 Distribution Landscape Parameter space of θ f(θ|A) Posterior probability In which parts of the distribution are we interested? How can we get a sample faster?
  • 44. 44 Distribution Landscape Parameter space of θ f(θ|A) Posterior probability In which parts of the distribution are we interested? How can we get a sample faster? → Markov Chain Monte Carlo Methods
  • 45. 45 Distribution Landcsape Parameter space of θ f(θ|A) Posterior probability In which parts of the distribution are we interested? Higher sample density Higher sample density
  • 46. 46 Distribution Landcsape Parameter space of θ f(θ|A) Posterior probability In which parts of the distribution are we interested? Higher sample density Higher sample density Fewer misses
  • 47. 47 Markov-Chain Monte-Carlo Parameter space of θ f(θ|A) Posterior probability Higher sample density Higher sample density MCMC → biased random walks: the probability to evaluate/find a sample in an area with high posterior probability is proportional to the posterior distribution
  • 48. 48 Markov-Chain Monte-Carlo ● Idea: Move the grid/samples into regions of high probability ● Construct a Markov Chain that generates samples such that more time is spent (more samples are evaluated) in the most interesting regions of the state space ● MCMC can also be used for hard CS optimization problems, for instance, the knapsack problem ● Note that, MCMC is similar to Simulated Annealing → there's no time to go into the details though here!
  • 50. 50 The Robot Metaphor ● Drop a robot onto an unknown planet to explore its landscape ● Teaching idea and slides adapted from Paul O. Lewis Uphill steps → always accepted Small downhill steps → usually accepted Huge downhill steps → almost never accepted elevation
  • 51. 51 How to accept/reject proposals ● Decision to accept/reject a proposal to go from Point 1 → Point 2 is based on the ratio R of posterior densities of the two points/samples R = Pr(Point2|data) / Pr(point1|data) = (Pr(Point2)Pr(data|point2) / Pr(data)) / (Pr(Point1)Pr(data|point1) / Pr(data)) = Pr(point2)Pr(data|point2) / Pr(point1)Pr(data|point1)
  • 52. 52 How to accept/reject proposals ● Decision to accept/reject a proposal to go from Point 1 → Point 2 is based on the ratio R of posterior densities of the two points/samples R = Pr(Point2|data) / Pr(point1|data) = (Pr(Point2)Pr(data|point2) / Pr(data)) / (Pr(Point1)Pr(data|point1) / Pr(data)) = Pr(point2)Pr(data|point2) / Pr(point1)Pr(data|point1) The marginal probability of the data cancels out! Phew, we don't need to compute it.
  • 53. 53 How to accept/reject proposals ● Decision to accept/reject a proposal to go from Point 1 → Point 2 is based on the ratio R of posterior densities of the two points/samples R = Pr(Point2|data) / Pr(point1|data) = (Pr(Point2)Pr(data|point2) / Pr(data)) / (Pr(Point1)Pr(data|point1) / Pr(data)) = (Pr(point2)/Pr(point1)) * (Pr(data|point2)Pr(data|point1)) Prior ratio: for uniform priors this is 1 ! Likelihood ratio
  • 54. 54 The Robot Metaphor ● Drop a robot onto an unknown planet to explore its landscape At 1m, proposed to go to 2 Ratio = 2/1 → accept At 10 m, go down to 9 m Ratio: 9/10 = 0.9 → accept with probability 90% At 8 m, go down to 1m Ratio: 1/8 = 0.125 → accept with probability 12.5% elevation
  • 55. 55 Distributions ● The target distribution is the posterior distribution we are trying to sample (integrate over)! ● The proposal distribution decides which point (how far/close) in the landscape to randomly go to/try next: → The choice has an effect on the efficiency of the MCMC algorithm, that is, how fast it will get to these interesting areas we want to sample
  • 56. 56 The Robot Metaphor Target distribution/ posterior probability Proposal distribution: how far Left or right will we usually go?
  • 57. 57 The Robot Metaphor Target distribution/ posterior probability Proposal distribution: with smaller variance → what happens? Pros: Seldom refuses a step Cons: smaller steps, more steps required for exploration
  • 58. 58 The Robot Metaphor Target distribution/ posterior probability Proposal distribution: with larger variance → What happens?
  • 59. 59 The Robot Metaphor Target distribution/ posterior probability Proposal distribution: with larger variance → what happens? Pros: can cover a large area quickly Cons: lots of steps will be rejected
  • 60. 60 The Robot Metaphor Target distribution/ posterior probability A proposal distribution that balances pros & cons yields 'good mixing'
  • 61. 61 Mixing ● A well-designed chain will require a few steps until reaching convergence, that is, approximating the underlying probability density function 'well-enough' from a random starting point ● It is a somewhat fuzzy term, refers to the proportion of accepted proposals (acceptance ratio) generated by a proposal mechanism → should be neither too low, nor too high ● The real art in designing MCMC methods consists ● building & tuning good proposal mechanisms ● selecting appropriate proposal distributions ● such that they quickly approximate the distribution we want to sample from
  • 62. 62 The Robot Metaphor Target distribution/ posterior probability When the proposal distribution is symmetric, that is, the probability of moving left or right is the same, we use the Metropolis algorithm
  • 63. 63 The Metropolis Algorithm ● Metropolis et al. 1953 http://www.aliquote.org/pub/metropolis-et-al-1953.pdf ● Initialization: Choose an arbitrary point θ0 as first sample ● Choose an arbitrary probability density Q(θi+1|θi ) which suggests a candidate for the next sample θi+1 given the previous sample θi. ● For the Metropolis algorithm, Q() must be symmetric: it must satisfy Q(θi+1|θi ) = Q(θi|θi+1) ● For each iteration i: ● Generate a candidate θ* for the next sample by picking from the distribution Q(θ*|θi ) ● Calculate the acceptance ratio R = Pr(θ*)Pr(data|θ*) / Pr(θi )Pr(data/θi ) – If R ≥ 1, then θ* is more likely than θi → automatically accept the candidate by setting θi+1 := θ* – Otherwise, accept the candidate θ* with probability R → if the candidate is rejected: θi+1 := θi
  • 64. 64 The Metropolis Algorithm ● Metropolis et al. 1953 http://www.aliquote.org/pub/metropolis-et-al-1953.pdf ● Initialization: Choose an arbitrary point θ0 as first sample ● Choose an arbitrary probability density Q(θi+1|θi ) which suggests a candidate for the next sample θi+1 given the previous sample θi. ● For the Metropolis algorithm, Q() must be symmetric: it must satisfy Q(θi+1|θi ) = Q(θi|θi+1) ● For each iteration i: ● Generate a candidate θ* for the next sample by picking from the distribution Q(θ*|θi ) ● Calculate the acceptance ratio R = Pr(θ*)Pr(data|θ*) / Pr(θi )Pr(data/θi ) – If R ≥ 1, then θ* is more likely than θi → automatically accept the candidate by setting θi+1 := θ* – Otherwise, accept the candidate θ* with probability R → if the candidate is rejected: θi+1 := θi Conceptually this is the same Q we saw for substitution models!
  • 65. 65 Phylogenetic Metropolis Algorithm ● Initialization: Choose a random tree with random branch lengths as first sample ● For each iteration i: ● Propose either – a new tree topology – a new branch length and re-calculate the likelihood ● Calculate the acceptance ratio of the proposal ● Accept the new tree/branch length or reject it ● Print current tree with branch lengths to file only every k (e.g. 1000) iterations → to generate a sample from the chain → to avoid writing TBs of files → also known as thinning ● Summarize the sample using means, histograms, credible intervals, consensus trees, etc.
  • 66. 66 Uncorrected Proposal Distribution A Robot in 3D Example: MCMC proposed moves to the right 80% of the time without Hastings correction for acceptance probability! Peak area
  • 67. 67 Hastings Correction Target distribution/ posterior probability We need to decrease chances to move to the right by 0.5 and Increase chances to move to the left by factor 2 to compensate for the asymmetry! 1/3 2/3
  • 68. 68 Hastings Correction R = (Pr(point2)/Pr(point1)) * (Pr(data|point2)/Pr(data|point1)) * (Q(point1|point2) / Q(point2|point1)) Prior ratio: for uniform priors this is 1 ! Likelihood ratio Hastings ratio: if Q is symmetric Q(point1|point2) = Q(point2|point) and the hastings ratio is 1 → we obtain the normal Metropolis algorithm
  • 69. 69 Hastings Correction more formally R = (f(θ*)/f(θi )) * (f(data|θ*)/f(data|θi )) * (Q(θi |θ*) / Q(θ*|θi )) Prior ratio Likelihood ratio Hastings ratio
  • 70. 70 Hastings Correction is not trivial ● Problem with the equation for the hastings correction ● M. Holder, P. Lewis, D. Swofford, B. Larget. 2005. Hastings Ratio of the LOCAL Proposal Used in Bayesian Phylogenetics. Systematic Biology. 54:961-965. http://sysbio.oxfordjournals.org/content/54/6/961.full “As part of another study, we estimated the marginal likelihoods of trees using different proposal algorithms and discovered repeatable discrepancies that implied that the published Hastings ratio for a proposal mechanism used in many Bayesian phylogenetic analyses is incorrect.” ● Incorrect Hastings ratio used from 1999-2005
  • 72. 72 Back to Phylogenetics A B C D E A B C D E A C D A E D C B A B C E D A C E D B A B D C E What's the posterior probability of bipartition AB|CDE? We just count from the sample generated by MCMC, here it's 3/5 → 0.6 This approximates the true proportion (posterior probability) of bipartition AB|CDE if we have run the chain long enough and if it has converged
  • 73. 73 MCMC in practice Frequency of AB|CDE generations convergence Burn-in → discarded from our final sample Random starting point
  • 74. 74 Convergence ● How many samples do we need to draw to obtain an accurate approximation? ● When can we stop drawing samples? ● Methods for convergence diagnosis → we can never say that a MCMC-chain has converged → we can only diagnose that it has not converged → a plethora of tools for convergence diagnostics for phylogenetic MCMC
  • 76. 76 Solution: Run Multiple Chains Robot 1 Robot 2
  • 77. 77 Outline for today ● Bayesian statistics ● Monte-Carlo simulation & integration ● Markov-Chain Monte-Carlo methods ● Metropolis-coupled MCMC-methods
  • 78. 78 Heated versus Cold Chains Robot 1 Robot 2 Cold chain: sees landscape as is Hot chain: sees a Flatter version of the same landscape → Moves more easily between peaks
  • 79. 79 Known as MCMCMC ● Metropolis-Coupled Markov-Chain Monte Carlo ● Run several chains simultaneously ● 1 cold chain (the one that samples) ● Several heated chains ● Heated chain robots explore the parameter space in larger steps ● To flatten the landscape the acceptance ratio R is modified as follows: R1/1+H where H is the so-called temperature – For the cold chain H := 0.0 – Setting the temperature for the hot chains is a bit of woo- do
  • 80. 80 Heated versus Cold Chains Robot 1: cold Robot 2: hot Exchange information every now and then
  • 81. 81 Heated versus Cold Chains Robot 1: hot Robot 2: cold Swap cold/hot states to better sample this nice peak here
  • 82. 82 Heated versus Cold Chains Robot 1: hot Robot 2: cold Decision on when to swap is a bit more complicated!
  • 83. 83 Heated versus Cold Chains Robot 1: hot Robot 2: cold Only the cold robot actually emits states (writes samples to file)
  • 84. 84 A few words about priors ● Prior probabilities convey the scientist's beliefs, before having seen the data ● Using uninformative prior probability distributions (e.g., uniform priors, also called flat priors) → differences between prior and posterior distribution are attributable to likelihood differences ● Priors can bias an analysis ● For instance, we could chose an arbitrary prior distribution for branch lengths in the range [1.0,20.0] → what happens if branch lengths are much shorter?
  • 85. 85 Some Phylogenetic Proposal Mechanisms ● Branch Lengths ● Sliding Window Proposal ● Multiplier Proposal ● Topologies ● Local Proposal (the one with the bug in the Hastings ratio) ● Extending TBR (Tree Bisection Reconnection) Proposal ● Remember: We need to design proposals for which ● We either don't need to calculate the Hastings ratio ● Or for which we can calculate it ● That have a 'good' acceptance rate → all sorts of tricks being used, e.g., parsimony-biased topological proposals