Molecular Phylogeny
Molecular Phylogeny
Universality Diversity
Darwin’s finches
Peter and Rosemary Grants
Natural selection in ground finches.
The processes of evolution
Anatomical homology of
forelimb in animals.
Genetic comparison
between HIV-1 and Simian
Immunodeficiency Virus
(SIV)
How to read a tree?
Some terminologies
Past
substitution
Phylogenetic mutation
is built on
substitutions.
Present
Heritable and phylogenetic
meaningful
Tips/Taxa/Leaves
Branches or
Lineages A
Represent the
B TAXA (genes,
populations,
species, etc.)
C used to infer
the phylogeny,
D i.e. YOUR
Most recent SEQUENCES
common ancestor E
Nodes or
(MRCA) Divergence Points
or ROOT of (represent hypothetical
the Tree ancestors of the taxa)
Similarity vs. relatedness
Sequence similarity and relatedness are not the same thing, even though
evolutionary relationships are based on certain types of similarity
6
Taxon B
1
1 7 differences
3 Taxon C
This axis
means 1 Taxon A
nothing!
3 differences
5
Taxon D
B D C D D C
# Taxa (N) # Unrooted trees
3 1
• Unrooted trees tell us the similarity 4 3
5 15
among taxa, but not the ancestry nor 6 105
the origin 7
8
945
10,935
9 135,135
10 2,027,025
• The number of unrooted trees increases .
.
.
.
in a greater than exponential manner . .
. .
with the number of taxa 30 ≈3.58 x 1036
Finding a root
Inferring evolutionary relationship requires a rooted phylogeny
B
C
B
C
Root
Root
• Both tree are technically correct, but give us different evolutionary stories
By midpoint or distance:
10
• Root the tree at the midway point
C
3 between the most two distant taxa.
2
2
B 5 D
• OFTEN WRONG.
How to build a tree?
3. Align sequences
• Do you have an appropriate outgroup?
6. Then run more trees, test, and run trees, and further
analyses
Our goal: to reconstruct evolution
ACAGAT
t7
C(2)>T (2)
evolutionary
What’s a phylogeny?
hypothesis G(4)>A(4)
t6
A(4)>G (4)
t5
A(5)>T (5)
Multiple substitutions at a
single site – hidden
information.
A T A A
A T
C C
Count 1 mutations when 3 have occurred Count 0 mutation when 3 have occurred.
The problem of multiple substitutions
A model includes:
• The frequencies of each base (A, T, C, G)
a f
T a G T d G
Jukes-Cantor General Time Reversible
Frequent among-site
rate variation
Little among-site
rate variation
Tree-Building Methods
Maximum likelihood
Parsimony Distance And Bayesian
Maximum parsimony phylogenies
Relying on finding the tree with the smallest number of
character changes (substitutions)
Advantages
Limitations:
Rationale
3. Update D
Advantages
• Simple
• Flexible (many distance and clustering algorithms)
• Fast and scalable (to large datasets)
Limitations
• Maximum Likelihood: find tree and evolutionary rates with highest likelihood
• Bayesian: find tree and evolutionary rates according to posterior probability.
Rationale:
• Performs many iterations of the tree, searching for tree topology with
highest likelihood.
‘Better’ trees
Hill Climbing
#$@*!
Hill Climbing
• Local maxima are a problem for methods using hill
climbing algorithms to find the best tree
• One way to reduce the probability of being stuck in
a local maximum is to do repeat analyses from
different starting points
• I.e. beam in a number of robots to different starting
positions
Hill Climbing
• Local maxima are a problem for methods using
hill climbing algorithms to find the best tree
J
• One way to reduce the probability of being stuck
in a local maximum is to do repeat analyses from
different starting points
• I.e. beam in a number of robots to different
starting positions
L
Statistical phylogenetic reconstruction
Advantages
• very flexible
• consistent with an explicit model of evolution
• statistically consistent (allows for model comparison)
Tree-building
Limitations
• computer-intensive, complicated statistics & methodology
• (ML) no measure of uncertainty for the single tree obtained
• (Bayesian) not ideal for ‘beginners’
Phylogenetic reconstruction - summary
Evolutionary
Method Data used Tree search
Model
Pairwise Simple Can be
Distance
distance algorithm complex
Maximum Can be
All sites Hill climbing
likelihood complex
Bayesian All sites Can be very
MCMC
Methods (+ other info) complex
Maximum likelihood and Bayesian
methods provide more reliable trees.
Bootstrapping
Characters
How much do you trust the tree that you just built?
Taxa 1 2 3 4 5 6 7 8 9
Bootstrapping: A A C C T G A T G C
B A G C T G G T T C
Assess variability due to sampling and C A G C A G A T G G
conflicting signals, relying on analyzing
D T C C T C G T G C
resampled dataset.
E T C T T A A T G C
Permutation with
replacement
Characters
Taxa 2 5 9 2 7 7 2 1 6
A C G C C T T C A A
B G G C G T T G A G
C G G G G T T G A A
D C C C C T T C T G
E C A C C T T C T A
Bootstrapping
Inferred “true” tree
Taxon A : ATG-CGA-GTT-TAG-CAG
A
Taxon B : ATG-CGA-GCT-TAA-CTG B
Taxon C : ATA-CTA-GCT-TAG-CTG C
D
Taxon D : ATG-CTA-TCT-TAG-GTG Node support
for trees
Alignment s2
A
B 100
A
C
B
Statistical
D
100
Alignment s3 A C
B
C
D
75 D
A
Alignment s4 B
C
D
Sampling:
• How do you access the samples? Ethical approval?
Resistant to
ceftriaxone
rC 7G
rA L
CI 80I
Gy 83
8
ln y
_D
_S
_S
tr
3
un
rA
SE
A
T
P
2
Co
Gy
SX
pS
Pa
sp
Bhutan
Pakistan
Sri Lanka
Cambodia
Thailand
South Korea
Morocco
Egypt
Senegal
Madagascar
France
Brazil
0.014
substitutions/site
Chung The et al., 2015, MGen
FQ resistant S. sonnei around the globe share the same PFGE pattern.
Primarily patients with recent travel India (9), Germany (1), Morocco
Ireland 16
history (1), No travel (5), Unknown (4)
plasmid
Clonal expansions
in Vietnam