0% found this document useful (0 votes)
11 views78 pages

Molecular Phylogeny

The document provides an overview of molecular phylogenetics, detailing the processes of evolution, the significance of phylogenetic trees, and methods for constructing these trees. It discusses the importance of genetic data in understanding evolutionary relationships and the various models and techniques used for phylogenetic reconstruction, including maximum likelihood and Bayesian methods. Additionally, it highlights the challenges and considerations in estimating genetic distances and the role of bootstrapping in assessing the reliability of phylogenetic trees.

Uploaded by

trangnt.m22bio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views78 pages

Molecular Phylogeny

The document provides an overview of molecular phylogenetics, detailing the processes of evolution, the significance of phylogenetic trees, and methods for constructing these trees. It discusses the importance of genetic data in understanding evolutionary relationships and the various models and techniques used for phylogenetic reconstruction, including maximum likelihood and Bayesian methods. Additionally, it highlights the challenges and considerations in estimating genetic distances and the role of bootstrapping in assessing the reliability of phylogenetic trees.

Uploaded by

trangnt.m22bio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Molecular phylogenetic

Introduction and application

University of Science and Technology of Hanoi


MSc. class
Chung The Hao, PhD
Why evolution?

Universality Diversity
Darwin’s finches
Peter and Rosemary Grants
Natural selection in ground finches.
The processes of evolution

Mutation: Genetic variability that evolution will act upon.

Natural selection: Differential survival and reproduction of individuals due to


differences in phenotype. Under natural selection, mutations could be beneficial,
deleterious, or neutral.
What is phylogenetics?

A phylogeny is a diagram describing the ancestral relationships of organisms.

Phylogenetics looks for homology as evidence of common ancestry

Anatomical homology of
forelimb in animals.

Homology: similarity among organisms due to


inheritance from a shared common ancestor.
Why phylogenetics?
• The phylogeny is confident hypothesis/model of evolutionary
relationship of taxonomic groups.

• Provide an important evolutionary framework to understand


biology, and to ask further questions about population
structure, co-evolution, evolution rates.

Genomics in infectious diseases:


• Confidently differentiate organisms at strain-level, without the
need to reference for phenotypic data.
Evolutionary tree of Life
Why use molecular data?
• Similarity in nucleotide sequence almost always points to homology.

• For microbes, we can’t construct trees based on known relationships or morphology.

• Constructing tree using microbe’s sequence data to infer evolutionary relationships


and understand epidemiology.

Genetic comparison
between HIV-1 and Simian
Immunodeficiency Virus
(SIV)
How to read a tree?
Some terminologies

Mutation/Polymorphism: mutant allele that arises in a population

Substitution: The mutant allele that is fixed in the population (i.e.


it is carried through into the following generations).

Past
substitution
Phylogenetic mutation
is built on
substitutions.

Present
Heritable and phylogenetic
meaningful

Not phylogenetic meaningful


Some terms

Tips/Taxa/Leaves
Branches or
Lineages A
Represent the
B TAXA (genes,
populations,
species, etc.)
C used to infer
the phylogeny,
D i.e. YOUR
Most recent SEQUENCES
common ancestor E
Nodes or
(MRCA) Divergence Points
or ROOT of (represent hypothetical
the Tree ancestors of the taxa)
Similarity vs. relatedness
Sequence similarity and relatedness are not the same thing, even though
evolutionary relationships are based on certain types of similarity

Similarity: being similar in a measurable metric (substitution differences)

Relatedness: genetically/evolutionarily connectedness (historical fact)

Two taxa can be most similar without being most closely-related.

6
Taxon B
1
1 7 differences
3 Taxon C
This axis
means 1 Taxon A
nothing!
3 differences
5
Taxon D

This axis usually indicates genetic distance (occasionally time)


Unrooted trees
The possible unrooted trees of four taxa (A, B, C, D)

Tree 1 Tree 2 Tree 3


A C A B A B

B D C D D C
# Taxa (N) # Unrooted trees
3 1
• Unrooted trees tell us the similarity 4 3
5 15
among taxa, but not the ancestry nor 6 105
the origin 7
8
945
10,935
9 135,135
10 2,027,025
• The number of unrooted trees increases .
.
.
.
in a greater than exponential manner . .
. .
with the number of taxa 30 ≈3.58 x 1036
Finding a root
Inferring evolutionary relationship requires a rooted phylogeny
B
C
B
C

Unrooted tree Root


Root D
D
A
A
A B C A
D B
C D
Rooted tree

Root
Root

• Both tree are technically correct, but give us different evolutionary stories

• Finding a correct root is important and also difficult.


Finding a root
By outgroup:

• Use taxa (outgroup) that are Monophyletic


known to fall outside of the clade
group of interest

• Require some prior knowledge


Outgroup/
about the relationship between outgroup Basal
taxa taxon

By midpoint or distance:

d (A,D) = 10 + 3 + 5 = 18 • Default by many software


A Midpoint = 18 / 2 = 9

10
• Root the tree at the midway point
C
3 between the most two distant taxa.
2
2
B 5 D
• OFTEN WRONG.
How to build a tree?

A very short introduction


From sequences to tree
1. Formulate a hypothesis!

2. Gather appropriate sequences


• From your samples
• From publicly available databases

3. Align sequences
• Do you have an appropriate outgroup?

4. Run preliminary trees

5. Determine your evolution model(s)

6. Then run more trees, test, and run trees, and further
analyses
Our goal: to reconstruct evolution

ACAGAT
t7
C(2)>T (2)
evolutionary
What’s a phylogeny?

hypothesis G(4)>A(4)

t6
A(4)>G (4)
t5
A(5)>T (5)

observed ACAGAT ACAGTT ATAGAT ATAAAT


data t1=0 t2=0 t3=0 t4=0
DNA sequence alignment
• DNA sequence alignment (input) is the most important thing in
phylogenetic reconstruction.

• Purpose: pin-point homologous nucleotides

• Could be easy or difficult

Taxon 1 GCGGCCCA TCAGGTAGTT GGTGG


Taxon 2 GCGGCCCA
GCGTTCCA
TCAGGTAGTT
TCAGCTGGTT
GGTGG
GGTGG
Easy
Taxon 3
Taxon 4 GCGTCCCA TCAGCTAGTT GGTGG
Taxon 5 GCGGCGCA TTAGCTAGTT GGTGA
******** ********** *****
TTGACATG CCGGGG---A AACCG
TTGACATG CCGGTG--GT AAGCC Difficult due
TTGACATG -CTAGG---A ACGCG to insertions
TTGACATG -CTAGGGAAC ACGCG
TTGACATC -CTCTG---A ACGCG or deletions
******** ?????????? ***** (indels)
Always perform manual check after alignment
Estimating genetic distances
Simplest way: Genetic distance is the observed mutational differences between taxa

DNA sequence alignment


between HIV-1 and SIV.

Multiple substitutions at a
single site – hidden
information.

A T A A
A T
C C
Count 1 mutations when 3 have occurred Count 0 mutation when 3 have occurred.
The problem of multiple substitutions

• When % divergence is low, observed distance (p) is a good estimator of


genetic distance (d)

• When % divergence is high, p underestimates d and a correction statistic is


require (i.e. a model of DNA substitution.)
DNA substitution model
Models of DNA sequence evolution are required to recover missing
information through correcting the problem of multiple substitutions.

A model includes:
• The frequencies of each base (A, T, C, G)

• The probability of substitution between bases (A to C, C to T, …)

• The probability of substitution along a sequence


(Different sites/regions evolve at different rates).

A good model = A good tree


DNA substitution model

Simplest 1. Base frequencies are equal and


all substitutions are equally likely
(Jukes-Cantor)
Estimating genetic distance

2. Base frequencies are equal but transitions and


transversions occur at different rates
(Kimura 2-parameter)

3. Unequal base frequencies and transitions and


transversions occur at different rates
(Hasegawa-Kishino-Yano)

4. Unequal base frequencies and all


Most complex substitution types occur at different rates
(General Time Reversible)
A a C b C
A
a e
a a a c

a f

T a G T d G
Jukes-Cantor General Time Reversible

Most simple Most complex


Among-site rate variation

Frequent among-site
rate variation

Little among-site
rate variation

Biological sequences (genes) have


conserved and non-conserved regions Using Gamma distribution to model among-
site rate variation
for optimization of functionality
• Large alpha: little variation
à Different rates of evolution. • Small alpha: high variation (often the case)
Different genetic distances
Tree building methods
Methods for inferring phylogenies

Tree-Building Methods

No explicit model Explicit model


of DNA evolution of DNA evolution

Application Pairwise Statistical


of the comparison approach
parsimony of
principle sequences

Maximum likelihood
Parsimony Distance And Bayesian
Maximum parsimony phylogenies
Relying on finding the tree with the smallest number of
character changes (substitutions)

Advantages

Intuitive explanation: ‘simplest’ evolutionary scenario

Limitations:

• No measure of uncertainty for the tree obtained


• Computationally intensive.
• Ignore different types of substitutions
• No explicit model of DNA substitution
• Evolution in real life is not necessarily parsimonious
Distance-based phylogenetic reconstruction

Relying on agglomerative clustering algorithms


(UPGMA, Neighbor-joining)

Rationale

1. Compute pairwise genetic distance (D)

2. Group closest sequences

3. Update D

4. Go back to (2) until all sequences are grouped


Distance-based phylogenetic reconstruction
With chosen
evolution model
Tree-building
Distance-based phylogenetic reconstruction

Advantages

• Simple
• Flexible (many distance and clustering algorithms)
• Fast and scalable (to large datasets)

Limitations

• Sensitive to distance/clustering choice


• Return one single tree, no measure of uncertainty for
the tree built
• Oversimplifies most evolutionary relationships
• Rarely publishable.
Statistical phylogenetic reconstruction
Approaches relying on ta model of sequence evolution:

• Maximum Likelihood: find tree and evolutionary rates with highest likelihood
• Bayesian: find tree and evolutionary rates according to posterior probability.

Rationale:

1. Start from a random or pre-defined tree (Neighbor-joining tree)

2. Compute initial likelihood/posterior

3. Permute branches, sample new parameters and compute new


likelihood/posterior

4. Accept or deny new tree based on likelihood/posterior improvement.

5. Go back to (3) until convergence.


Maximum likelihood phylogenetic
• Likelihood is a quantity proportional to the probability of observing an
outcome/data/event, X, given a hypothesis, H.
– P ( X | H ) or P ( X | p )
– P ( Data | model of evolution)

• ML evaluates the probability of phylogenetic hypotheses (evolution model +


Tree-building

unrooted tree) that gives rise to the observed data.

• Performs many iterations of the tree, searching for tree topology with
highest likelihood.

• Returns 1 tree with highest likelihood


Hill Climbing

• Imagine tree ‘space’ is a hill


• Better trees (measured by likelihood) are higher
• We can find the best tree using a robot with a
simple program:
• Accept uphill moves
J
• Reject downhill moves
û
ü
ü
ü

‘Better’ trees
Hill Climbing

#$@*!
Hill Climbing
• Local maxima are a problem for methods using hill
climbing algorithms to find the best tree
• One way to reduce the probability of being stuck in
a local maximum is to do repeat analyses from
different starting points
• I.e. beam in a number of robots to different starting
positions
Hill Climbing
• Local maxima are a problem for methods using
hill climbing algorithms to find the best tree

J
• One way to reduce the probability of being stuck
in a local maximum is to do repeat analyses from
different starting points
• I.e. beam in a number of robots to different
starting positions

L
Statistical phylogenetic reconstruction

Advantages

• very flexible
• consistent with an explicit model of evolution
• statistically consistent (allows for model comparison)
Tree-building

• parameter estimation (evolutionary rate, transition rate)


• (Bayesian) 1000s of trees → provides measure of uncertainty
• (Bayesian) hypothesis testing, complex models, demography

Limitations
• computer-intensive, complicated statistics & methodology
• (ML) no measure of uncertainty for the single tree obtained
• (Bayesian) not ideal for ‘beginners’
Phylogenetic reconstruction - summary

Evolutionary
Method Data used Tree search
Model
Pairwise Simple Can be
Distance
distance algorithm complex

Parsimony All sites Hill climbing Simple

Maximum Can be
All sites Hill climbing
likelihood complex
Bayesian All sites Can be very
MCMC
Methods (+ other info) complex
Maximum likelihood and Bayesian
methods provide more reliable trees.
Bootstrapping
Characters
How much do you trust the tree that you just built?
Taxa 1 2 3 4 5 6 7 8 9
Bootstrapping: A A C C T G A T G C
B A G C T G G T T C
Assess variability due to sampling and C A G C A G A T G G
conflicting signals, relying on analyzing
D T C C T C G T G C
resampled dataset.
E T C T T A A T G C

Permutation with
replacement

Characters
Taxa 2 5 9 2 7 7 2 1 6
A C G C C T T C A A
B G G C G T T G A G
C G G G G T T G A A
D C C C C T T C T G
E C A C C T T C T A
Bootstrapping
Inferred “true” tree
Taxon A : ATG-CGA-GTT-TAG-CAG
A
Taxon B : ATG-CGA-GCT-TAA-CTG B
Taxon C : ATA-CTA-GCT-TAG-CTG C
D
Taxon D : ATG-CTA-TCT-TAG-GTG Node support
for trees

A+B : 4/4 = 100%


A
a phylogeny?

B C+D : 3/4 = 75%


C A+B+C+D : 4/4 = 100%
Alignment s1
D
What’s support

Alignment s2
A
B 100
A
C
B
Statistical

D
100
Alignment s3 A C
B
C
D
75 D
A
Alignment s4 B
C
D

A tree without measures of statistical support for the nodes


(bootstraps or posteriors) is meaningless!
Questions?
Phylogenetic analysis of full-length genomes of 2019-nCoV and
representative viruses of the genus Betacoronavirus
Vibrio cholerae genomic epidemiology in Yemen
Case studies in
molecular phylogeny
Study design
Always frame a hypothesis/research question first.

Which organisms/disease/serotypes are you interested in?

Sampling:
• How do you access the samples? Ethical approval?

• Is it the right sample set to answer your questions?

• Is the sampling representative?

• Is the sample size sufficient?

• When/where/how are the sample collected?

• Any other data associated with the samples?

The study’s results/interpretation are only as good as its sampling allows


Investigation
The global spread of
fluoroquinolone resistant
Shigella sonnei
The genus Shigella

• Top 4 bacterial pathogens for pediatric diarrheal disease.

• An Enterobacteriaceae with 4 species: dysenteriae, boydii, flexneri and sonnei.

• No licensed vaccine. First line recommended treatment is antimicrobial


fluoroquinolone (FQ) or 3rd generation cephalosporin (WHO 2005).

• Both S. sonnei and S. flexneri are increasingly multi-drug resistant


Shigella sonnei evolution

S. sonnei arose in Europe in late 17th


century, with lineage III the most
widespread and drug resistant.

Feil E., 2012

S. sonnei genomic phylogeny


Holt et al., 2012, Nat. Gen.
Shigella sonnei in Vietnam
Resistant to
nalidixic acid

Resistant to
ceftriaxone

Holt et al., 2013

S. sonnei (Global 3) was introduced in 1980s into Vietnam,


and underwent a clonal expansion by fixations of a colicin
plasmid, gyrA mutations and an ESBL plasmid
How about Shigella in other Asian
countries?
Fluoroquinolone resistant S. sonnei in Bhutan
Background: Diarrheal surveillance in Thimphu, Bhutan (2011-2013) reported
the presence of FQ resistant S. sonnei (Ruekit et la., 2014)

Objective: Using genomics to understand population of Bhutanese S. sonnei

Samples: 71 S. sonnei (Bhutan) + global collection


Phylogeny of S. sonnei in Bhutan

S. sonnei in Bhutan belongs to the Central Asia expansion of lineage III.


Phylogeny of S. sonnei in Bhutan

rC 7G
rA L

CI 80I
Gy 83

8
ln y

_D
_S

_S
tr

3
un

rA
SE
A

T
P
2
Co

Gy

SX
pS

Pa
sp
Bhutan
Pakistan
Sri Lanka
Cambodia
Thailand
South Korea
Morocco
Egypt
Senegal
Madagascar
France
Brazil

0.014
substitutions/site
Chung The et al., 2015, MGen

• FQ resistance is conferred by triple mutations in gyrA and parC.


A global problem?

FQ resistant S. sonnei around the globe share the same PFGE pattern.

De Lappe et al., 2015 - Ireland

Ruekit et al., 2014 - Bhutan

Gaudreau et al., 2011 - Canada

Nandy et al., 2011 - India

à Hypothesized that global FQ resistant S. sonnei is clonal


Global emergence of FQ resistant S. sonnei

Aims: To unravel the nature of the recent global surge in FQ resistant


S. sonnei, through the use of genomics and molecular microbiology

Sample: Global representative collection of 70 FQ resistant S. sonnei


Ciprofloxacin resistant isolates
Country Patient group Region of recent travel history (N)
(N)
Bhutan 12 Hospitalised children <5 years old NA

Vietnam 11 Hospitalised children < 5 years old NA

Thailand 1 Hospitalised children <5 years old NA

Cambodia 1 Hospitalised children <5 years old NA

Primarily patients with recent travel India (9), Germany (1), Morocco
Ireland 16
history (1), No travel (5), Unknown (4)

India (15), Cambodia (3), Thailand


Australia 19 Patients with recent travel history
(1), Southeast Asia (1).

USA 10 Unknown Unknown


Evolution FQ resistant S. sonnei

• All FQR S. sonnei


belong to the
Central Asia III clade

• FQR due to triple


mutations (gyrA,
parC)
à A global clonal
emergence of FQR
S. sonnei

Chung The et al., 2016, PLoS Med.


Global collection: 395 CenAsiaIII S. sonnei
Global phylogeny of FQ resistant S. sonnei

Chung The et al., 2019, Nat Comms.


• FQR S. sonnei arose from early 2007,
most likely originating from South Asia.

• Evidence of clonal expansion and


establishment in Southeast Asia and
likely Europe.
Your turn now
Ancient tuberculosis in North
America
Tuberculosis in Peruvian ancient skeletons
Tuberculosis in America today is
dominated by the European-derived
Mycobacterium tuberculosis lineages

à Tuberculosis was introduced to


America by European settlers.

How about pre-Columbian TB?

Three Peruvian skeletons (1028 –


1280 AD, with signs of active
tuberculosis, preserved sufficient
amount of tuberculosis DNA

à Bacterial DNA was sequenced and


compared to modern isolates.
What are these ancient TB?
Singapore Zika study
Dengue and Zika virus

Flaviviridae family of viruses


Timeline of Zika pandemic

Petersen et al 2016, NEJM

• As of 6 April 2016, Zika virus transmission was documented in a total of 62


countries and territories.

• Caused ~711,381 infections with 18 deaths.

• There is no specific treatment or vaccine for Zika virus


An introduction?
Molecular epidemiology
of global ZIKV
• Retrospective screening from
2014 to 2016: No confirmed
Zika cases until Aug 2016.

• The Singapore outbreak is not


linked to the South America
epidemic.

• Clinical and mosquito ZIKV


clustered in clade A

• The Singapore outbreak is


caused by ZIKV clade A, which
existed in May before the initial
detection in August 2016.
Salmonella monophasic
Typhimurium ST34 circulation
in Vietnam
Salmonella enterica subsp. enterica

• Enterobacteriaceae. Extremely diverse (>2,500 serovars)

• Ancient human and animal pathogens, including host-


generalist and human-adapted.

• Major burden of disease:


Ø Typhoid (S. Typhi, S. Paratyphi)
Ø Blood stream infections (S. Typhimurium, S.
Enteritidis, etc.)
Ø Gastroenteritis (multiple serovars)

• Estimated ~94 million gastroenteritis cases caused by


nontyphoidal Salmonella (NTS)
Salmonella outbreak

• Primarily foodborne, with various sources

• Food production, distribution and consumption at global level


à Potential for multi-country outbreaks.

• Sicken >450 patients in 17 countries


• Recall of ~3,000 tonnes of chocolate worldwide

• Outbreak caused by two strains of Salmonella monophasic Typhimurium ST34


• Linked to contamination in a dairy butter tank in Belgium.
Clonal expansions
in Vietnam

clone VBSI VN1 VN2 VN3 VN4


tMRCA 2003.16
AMR

plasmid
Clonal expansions
in Vietnam

• all clonal expansions


emerged in Vietnam
from 2003 - 2009

• SEA isolates carry


more clinically
important AMR genes
(blaCTX-M-55, qnrS1,
mcr3.1, mphA).

clone VBSI VN1 VN2 VN3 VN4


tMRCA 2003.16 2006.07 2007.39 2007.73 2008.64
AMR qnrS1 blaCTX-M-55, qnrS1, qnrS1, mphA qnrS1, mcr3.1,
mcr3.1 mphA
plasmid IncHI2 IncA/C IncHI2 IncA/C

You might also like