Does RNA avoidance dictate protein expression level?
Paul Gardner
Department of Biochemistry
University of Otago
Dunedin
New Zealand
The hard work of Sinan Umu, Ant Poole, Ren Dobson &
recently Chun Shen Lim
“Mycoplasma mycoides” watercolour by Prof. David S. Goodsell
mRNA levels are imperfectly correlated with protein levels
Lu et al. (2007) Nature biotechnology.
Determinants of protein concentration
Protein concentration depends on mRNA concentration, translation and
degradation rates
DNA
[D]
RNA
[R]
Protein
[P]
ktranscription ktranslation
kmRNA degradation kprotein degradation
0 1
A
T GGC
TA
A
GGGGCA
A
T
C
T
T
TA
C
A A
G
AT
CC
G
T
T
C
C
T
G
A
AC
G
C
AC
T G
C
G
T C
G
G
G
A
A
C
G
T
G
T
T C
CAGTTTCTATTTATT
T
G G T G A A T G GTATTA A G C T GC
AA
G
G G
C
AA
A
T
C
G
A
G
T
C
T
TT
T
G
A
T
C
AG
T
T
C
G
T
G
A
T
C
C
T
G
T
T
G
A A
A
A
A
C
A
C
G
G
T
C
A GC
C
A
G
A
T
G
G
T TT
A
C
A
A
GC
A
C
G
C
G
A
T
T
T C T A
C
T
G
T
T G T C C CG
T CT
C
G C C C G G T T T C
T
C
AT
CA
CA
GTAA
CAACGCCG
GT
GGC
G
G
T
A
C
C
A
G
C
A
G
T
A
A
C T A C C A T
C
A
TGGTAGCAGCG
C
G
C A
G
A A
T
AC
T
T
CC
G
C
G
C
A
ACAGG
A
C
A
G
C
G
A
A
GAAACCG
A
A
TAA
de Sousa Abreu, Penalva, Marcotte & Vogel (2009) Global signatures of protein and mRNA expression levels. Molecular
BioSystems.
Two general models describe variation in translation rate
1. Codon usage (Ikemura, 1981)
Figure from: Tuller & Zur (2015) Nucl. Acids Res.
Two general models describe variation in translation rate
2. mRNA structure (Pelletier & Sonenberg, 1987)
Figure from: Tuller & Zur (2015) Nucl. Acids Res.
We think we have a third general model...
http://dx.doi.org/10.7554/eLife.13479
http://dx.doi.org/10.7554/eLife.20686
Non-coding RNAs are abundant
q
q
q
q
q
q
q
q
012345
log10(MeanReadDepth)
Core ncRNA genes
Core protein coding genes
Lindgreen, Umu, et al. (2014) PLOS Computational Biology.
Core bacterial non-coding RNAs
Figure by Bethany Jose
Bacterial regulatory non-coding RNAs
Hfq
AUG
SD
X
Ribosome
sRNA
AUG
RNase E
recruitment
AUG
SD
Ribosome
Anti-antisense mechanism
Selective mRNA stabilisation
AUG
RNase E
Shine-Dalgarno
sequence
Sequestration of ribosome binding site
Induction of mRNA decay
SD =
Figure by Bethany Jose
Checking for mRNA:ncRNA interactions
Looking for regulatory interactions which are specific and small in
number, off-targets are non-specific and large in number
Compare 5 ends of CDS & ncRNAs
Looking for a bump on the left...
−15 −10 −5 0
0.000.050.100.150.200.25
Binding Energy (kcal/mol)
Density
Checking for mRNA:ncRNA interactions
−15 −10 −5 0
0.000.050.100.150.200.25
Binding Energy (kcal/mol)
Native
Shuffled (P = 7.69−52
)
Checking negative controls!
−15 −10 −5 0
0.000.050.100.150.200.25
Binding Energy (kcal/mol)
Native
Shuffled (P = 7.69−52
)
Different phylum (P = 0 )
Downstream (P = 2.66−124
)
Rev. complement (P = 6.51−57
)
Intergenic (P = 6.16−93
)
Do ubiquitous and abundant RNAs influence translation?
Given that ncRNAs are among the most abundant RNAs in the cell
([ncRNA] >> [mRNA])
AND that RNAs frequently hybridise
THEN maybe stochastic interactions with mRNAs inhibit translation
Corley & Laederach (2016) Bioinformatics: Selecting against accidental RNA interactions. eLife.
How can this hypothesis be tested?
We predict that:
1. There is selection against mRNA:ncRNA interactions
2. That stochastic mRNA:ncRNA interactions influence [protein]:[mRNA]
ratios
For consistency:
focus on 6 ncRNA families & 114 mRNAs/proteins that are highly
conserved & expressed
first 21 nts of CDS
Tested 1,582 bacterial & 118 archaeal genomes
Use a nearest-neighbour model for predicting hybridisation and
unfolding energy
Are mRNA:ncRNA interactions selected against?
−15 −10 −5 0
−0.010−0.0050.0000.0050.0100.015
Binding Energy (kcal/mol)
DensityDifference Actinobacteria (n:163) P = 9.8x10−69
Bacteroidetes (n:60) P = 8.7x10−148
Chlamydiae (n:38) P = 1.4x10−193
Cyanobacteria (n:40) P = 3.8x10−11
Firmicutes (n:378) P = 0
Proteobacteria (n:756) P = 0
Spirochaetes (n:38) P = 1.6x10−98
Archaea (n:118) P = 4.2x10−177
Background (n:100)
More stable interactions
NativeinteractionsShuffledinteractions
Act
Bac
Chl
Cya
Fir
Pro
Spi
Arc
010203040
−log10P
“Intrinsic” avoidance
0.0 0.2 0.4 0.6 0.8 1.0
0246810
Density
G+C content
ncRNA
mRNA
High
Low
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●●●●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 10 20 30 40 50
0510152025
−log10P(Avoidance)
−log10P(Intrinsic)
Significant signals of intrinsic or extrinsic avoidance found in 97% of
bacteria and archaea
Do mRNA:ncRNA interactions influence protein
expression?
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
2.02.53.03.54.0
−300 −250 −200 −150
Rs=0.65
log10(fluorescence)
Avoidance (kcal/mol)
Expression data from: Kudla et al. (2009) Science.
Do mRNA:ncRNA interactions influence protein
expression?
Testing the relationship between protein abundance estimates and
avoidance, mRNA secondary structure, codon usage and mRNA
abundance
GFP datasets Mass-Spec datasets
E.coli
(n=52)
GFP/qPCR
E.coli
(n=154)
GFP/Northern
E.coli
(n=14,234)
mCherry/RNAseq
E.coli
(n=389)
MS/microarray
E.coli
(n=3,301)
MS/microarray
P.aeruginosa
(n=5,479)
MS/microarray
P.aeruginosa
(n=1,148)
MS/microarray
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*P < 0.05
0.0 0.60.2 0.4-0.2
Correlation Coefficient
Avoidance
Secondary
Structure
Codon
[mRNA]
Testing the extremes of expression
0.1
0.5
0.8
1.2
1.6
1.9
2.3
2.6
3
3.3
3.7
4.1
4.4
4.8
Freq
0
20
40
60
80
100
120
A
log10([Protein]/[mRNA])
Frequency
low expression (n=10)
high expression (n=10)
B
Avoidance
Codon
Sec.Str.
Null
Sec.Str.
Codon
Avoidance
−2
−1
0
1
2
*
*
Zscore
low expression (n=10)
high expression (n=10)
E. coli genes (n = 389)
Designing mRNAs
239aa GFP can be encoded by 7.62x10111 synonymous mRNAs
Extremes of avoidance have a stronger effect than codon usage or
secondary structure
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
4.24.34.44.54.64.7
0.60 0.65 0.70 0.75 0.80 0.85
CAI
log10(fluorescence)
Rs=0.29
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
4.24.34.44.54.64.7
−15 −10 −5 0
Folding Energy (kcal/mol)
Rs=0.34
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
4.24.34.44.54.64.7
−350 −300 −250 −200 −150 −100
Binding Energy (kcal/mol)
Rs=0.56
hi low
●
●
●
●
●
●
Avoid
Fold
Codon
Optimal●
Avoidance in 3D on the ribosome
Protein binds to regions with low avoidance (green) while exposed
regions are high avoidance (blue): P = 9.3x10−15, Fisher’s exact test
Further Questions
Further work:
Write a nice software package for mRNA design! –funded by MBIE
Do mRNA:ncRNA interactions influence eukaryotic gene expression?
Number of possible interactions increases quadratically with number of
genes. May require spatial & temporal separation of genes
Does avoidance drive compartmentalisation and increases in nucleotide
binding proteins for larger genomes?
Check organelles & endosymbionts
Do mRNA:ncRNA interactions influence viral infection, hybridisation,
HGT & transformation expts?
Are protein, DNA and protein:nucleotide interactions also avoided?
My other project: What and who?
A new approach for estimating functionally significant genetic variation
Application to genome associations with bacterial invasiveness
How can I analyse my genome variation?
Genome-wide association studies (GWAS) with
SNPs
Gene gain/losses
Tajima’s D ≈ difference between an observed
and expected measure of genetic diversity
Compare non-synonymous and synonymous
mutation rates: dN
dS
Should every variant/SNP really be treated equally?
Is there an approach that works for any timescale and any genome variant?
Profile HMMs: a powerful homology search tool
Image provided by Sean Eddy.
Refactoring profile HMMs
F
F
F
F
L
L
F
F
F
V
V
V
I
I
M
M
Y
Y
Y
Y
Y
H
K
S
W
F
F
V
V
V
M
I
I
I
L
L
I
I
V
L
I
V
I
I
I
R
R
R
S
R
R
R
R
R
K
K
K
K
R
R
K
K
K
R
R
R
R
R
R
R
R
R
I
I
I
M
L
L
I
I
I
M
K
K
Q
Q
Q
Q
N
Q
L
L
L
L
L
L
L
L
L
P
A
S
D
N
H
P
R
R
P
P
A
P
A
P
S
P
P
E
E
E
S
N
Q
E
D
E
FL Y
M
V
I W
K
F
S
Y
H
L
V
I KR RI
LM
V LM
IQ
N
M
K
LRS
RH
P
N
D
A
S
PAS
N
ED
Q
S
bitscore = log2
P(seq|model)
P(seq|null)
A new scoring sequence for genome variation
F
F
Y
Y
F
F
I
I
I
I
E
E
A
A
R
R
L
L
Q
Q
I
M
R
E
K
K
R
R
Species 1
Species 2
∆bitscore = bitscoreseq1 − bitscoreseq2
= log2
P(seqseq1|model)
P(seqseq2|model)
Databases of HMMs: Pfam, eggNOG, SMART, InterPro, ...
A new scoring sequence for genome variation
Wheeler, Barquist, Kingsley, Gardner (2016) A profile-based method for identifying functional divergence of orthologous genes in
bacterial genomes. Bioinformatics.
Bacterial pathogen evolution
Gene gain, adaptive mutation
help to invade a new niche
Subsequent adaptation to may
allow gene loss or inactivation
Merhej, Georgiades, Raoult (2013) Postgenomic analysis of bacterial pathogens repertoire reveals genome reduction rather than
virulence factors. Brief. Funct. Genomics.
Random Forests
Training genome set
Random forest
Variation scoring
with DeltaBS
Each node identifies the
most informative gene,
and selects an optimal
gene-wide DeltaBS value
to split the data
Decision treepredicted impact of
mutation
DeltaBS
DeltaBS incorporates the predicted impacts
of all mutations in a gene
Wheeler, Gardner, Barquist (2018) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella
enterica. To appear in PLoS Genetics.
Salmonella enterica
Feasey et al. (2012) The Lancet
Nuccio & B¨aumler (2014) Comparative analysis of Salmonella genomes
identifies a metabolic network for escalating growth in the inflamed gut. MBio
Reproducible signatures of gene degradation/diversifying
selection
Out-of-bag votes: lineages are only voted on using decision trees that
weren’t trained on them
Each model iteration uses a smaller subset of genes that are most
informative of phenotype
Wheeler, Gardner, Barquist (2017) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella
enterica. To appear in PLoS Genetics. https://doi.org/10.1101/204669
Which genes are most informative of pathogenicity?
Wheeler, Gardner, Barquist (2017) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella
enterica. To appear in PLoS Genetics. https://doi.org/10.1101/204669
Global and African isolates
Invasiveness
ranking
Clade
300
A
B
ProportionofisolatescarryingHDCs
Afican Enteritidisr
Global Enteritidis
−1.0
−0.5
0.0
0.5
1.0
tcuR
ybiU
rnfC
STM4519
STM1499
pocR
STM2529
priA
ydeE
phrB
STM1630
STM2245
STM2532
nrdF
rcnA
pqaA
STM1940
acrB
pepT
STM0019
pps
pgtA
yjeF
rna
pgl
mglA
STM3125
bcsA
slsA
STM0018
bcfC
STM0042
citF2
carB
caiC
Gene
Clade
Global epidemic
Central/Eastern African
Outlier
Other
West Africa
C
0.20
0.24
0.28
Africa Global
Region
Invasivenessindex
Invasiveness
ranking
Low
High
Wheeler, Gardner, Barquist (2018) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella
enterica. To appear in PLoS Genetics.
Within-patient evolution
Chronic infection of an
immunocompromised patient
Constantly recolonised with a
hypermutator S. Enteritidis.
Klemm et al (2016) Emergence of host-adapted Salmonella Enteritidis through rapid evolution in an immunocompromised host.
Nat. Microbiol.
Wheeler, Gardner, Barquist (2017) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella
enterica. To appear in PLoS Genetics. https://doi.org/10.1101/204669
Summary
We have developed a simple & scalable approach for determining the
significance of variation
Working well in “the field” (applications to Salmonella, Campylobacter,
PSA, Ratites, ...)
Adapted to evaluate ncRNAs and conserved DNA elements, and
include more phylogeny (ASR, phylo-regression?)
Wheeler, Barquist, Kingsley & Gardner (2016) A profile-based method for identifying functional divergence of orthologous genes in
bacterial genomes. Bioinformatics.
Wheeler, Gardner, Barquist (2017) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella
enterica. To appear in PLoS Genetics. https://doi.org/10.1101/204669
Sackton et al. (2018) Convergent regulatory evolution and the origin of flightlessness in palaeognathous birds
https://doi.org/10.1101/262584
Thanks!
Avoidance: Sinan Umu, Anthony Poole & Renwick Dobson
Nicole Wheeler, Lars Barquist, Alexandra Gavryushkina

Does RNA avoidance dictate protein expression level?

  • 1.
    Does RNA avoidancedictate protein expression level? Paul Gardner Department of Biochemistry University of Otago Dunedin New Zealand
  • 2.
    The hard workof Sinan Umu, Ant Poole, Ren Dobson & recently Chun Shen Lim
  • 3.
    “Mycoplasma mycoides” watercolourby Prof. David S. Goodsell
  • 4.
    mRNA levels areimperfectly correlated with protein levels Lu et al. (2007) Nature biotechnology.
  • 5.
    Determinants of proteinconcentration Protein concentration depends on mRNA concentration, translation and degradation rates DNA [D] RNA [R] Protein [P] ktranscription ktranslation kmRNA degradation kprotein degradation 0 1 A T GGC TA A GGGGCA A T C T T TA C A A G AT CC G T T C C T G A AC G C AC T G C G T C G G G A A C G T G T T C CAGTTTCTATTTATT T G G T G A A T G GTATTA A G C T GC AA G G G C AA A T C G A G T C T TT T G A T C AG T T C G T G A T C C T G T T G A A A A A C A C G G T C A GC C A G A T G G T TT A C A A GC A C G C G A T T T C T A C T G T T G T C C CG T CT C G C C C G G T T T C T C AT CA CA GTAA CAACGCCG GT GGC G G T A C C A G C A G T A A C T A C C A T C A TGGTAGCAGCG C G C A G A A T AC T T CC G C G C A ACAGG A C A G C G A A GAAACCG A A TAA de Sousa Abreu, Penalva, Marcotte & Vogel (2009) Global signatures of protein and mRNA expression levels. Molecular BioSystems.
  • 6.
    Two general modelsdescribe variation in translation rate 1. Codon usage (Ikemura, 1981) Figure from: Tuller & Zur (2015) Nucl. Acids Res.
  • 7.
    Two general modelsdescribe variation in translation rate 2. mRNA structure (Pelletier & Sonenberg, 1987) Figure from: Tuller & Zur (2015) Nucl. Acids Res.
  • 8.
    We think wehave a third general model... http://dx.doi.org/10.7554/eLife.13479 http://dx.doi.org/10.7554/eLife.20686
  • 9.
    Non-coding RNAs areabundant q q q q q q q q 012345 log10(MeanReadDepth) Core ncRNA genes Core protein coding genes Lindgreen, Umu, et al. (2014) PLOS Computational Biology.
  • 10.
    Core bacterial non-codingRNAs Figure by Bethany Jose
  • 11.
    Bacterial regulatory non-codingRNAs Hfq AUG SD X Ribosome sRNA AUG RNase E recruitment AUG SD Ribosome Anti-antisense mechanism Selective mRNA stabilisation AUG RNase E Shine-Dalgarno sequence Sequestration of ribosome binding site Induction of mRNA decay SD = Figure by Bethany Jose
  • 12.
    Checking for mRNA:ncRNAinteractions Looking for regulatory interactions which are specific and small in number, off-targets are non-specific and large in number Compare 5 ends of CDS & ncRNAs Looking for a bump on the left... −15 −10 −5 0 0.000.050.100.150.200.25 Binding Energy (kcal/mol) Density
  • 13.
    Checking for mRNA:ncRNAinteractions −15 −10 −5 0 0.000.050.100.150.200.25 Binding Energy (kcal/mol) Native Shuffled (P = 7.69−52 )
  • 14.
    Checking negative controls! −15−10 −5 0 0.000.050.100.150.200.25 Binding Energy (kcal/mol) Native Shuffled (P = 7.69−52 ) Different phylum (P = 0 ) Downstream (P = 2.66−124 ) Rev. complement (P = 6.51−57 ) Intergenic (P = 6.16−93 )
  • 15.
    Do ubiquitous andabundant RNAs influence translation? Given that ncRNAs are among the most abundant RNAs in the cell ([ncRNA] >> [mRNA]) AND that RNAs frequently hybridise THEN maybe stochastic interactions with mRNAs inhibit translation Corley & Laederach (2016) Bioinformatics: Selecting against accidental RNA interactions. eLife.
  • 16.
    How can thishypothesis be tested? We predict that: 1. There is selection against mRNA:ncRNA interactions 2. That stochastic mRNA:ncRNA interactions influence [protein]:[mRNA] ratios For consistency: focus on 6 ncRNA families & 114 mRNAs/proteins that are highly conserved & expressed first 21 nts of CDS Tested 1,582 bacterial & 118 archaeal genomes Use a nearest-neighbour model for predicting hybridisation and unfolding energy
  • 17.
    Are mRNA:ncRNA interactionsselected against? −15 −10 −5 0 −0.010−0.0050.0000.0050.0100.015 Binding Energy (kcal/mol) DensityDifference Actinobacteria (n:163) P = 9.8x10−69 Bacteroidetes (n:60) P = 8.7x10−148 Chlamydiae (n:38) P = 1.4x10−193 Cyanobacteria (n:40) P = 3.8x10−11 Firmicutes (n:378) P = 0 Proteobacteria (n:756) P = 0 Spirochaetes (n:38) P = 1.6x10−98 Archaea (n:118) P = 4.2x10−177 Background (n:100) More stable interactions NativeinteractionsShuffledinteractions Act Bac Chl Cya Fir Pro Spi Arc 010203040 −log10P
  • 18.
    “Intrinsic” avoidance 0.0 0.20.4 0.6 0.8 1.0 0246810 Density G+C content ncRNA mRNA High Low ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●●●●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 0510152025 −log10P(Avoidance) −log10P(Intrinsic) Significant signals of intrinsic or extrinsic avoidance found in 97% of bacteria and archaea
  • 19.
    Do mRNA:ncRNA interactionsinfluence protein expression? ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 2.02.53.03.54.0 −300 −250 −200 −150 Rs=0.65 log10(fluorescence) Avoidance (kcal/mol) Expression data from: Kudla et al. (2009) Science.
  • 20.
    Do mRNA:ncRNA interactionsinfluence protein expression? Testing the relationship between protein abundance estimates and avoidance, mRNA secondary structure, codon usage and mRNA abundance GFP datasets Mass-Spec datasets E.coli (n=52) GFP/qPCR E.coli (n=154) GFP/Northern E.coli (n=14,234) mCherry/RNAseq E.coli (n=389) MS/microarray E.coli (n=3,301) MS/microarray P.aeruginosa (n=5,479) MS/microarray P.aeruginosa (n=1,148) MS/microarray * * * * * * * * * * * * * * * * * * * * * * * * *P < 0.05 0.0 0.60.2 0.4-0.2 Correlation Coefficient Avoidance Secondary Structure Codon [mRNA]
  • 21.
    Testing the extremesof expression 0.1 0.5 0.8 1.2 1.6 1.9 2.3 2.6 3 3.3 3.7 4.1 4.4 4.8 Freq 0 20 40 60 80 100 120 A log10([Protein]/[mRNA]) Frequency low expression (n=10) high expression (n=10) B Avoidance Codon Sec.Str. Null Sec.Str. Codon Avoidance −2 −1 0 1 2 * * Zscore low expression (n=10) high expression (n=10) E. coli genes (n = 389)
  • 22.
    Designing mRNAs 239aa GFPcan be encoded by 7.62x10111 synonymous mRNAs Extremes of avoidance have a stronger effect than codon usage or secondary structure ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 4.24.34.44.54.64.7 0.60 0.65 0.70 0.75 0.80 0.85 CAI log10(fluorescence) Rs=0.29 ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.24.34.44.54.64.7 −15 −10 −5 0 Folding Energy (kcal/mol) Rs=0.34 ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 4.24.34.44.54.64.7 −350 −300 −250 −200 −150 −100 Binding Energy (kcal/mol) Rs=0.56 hi low ● ● ● ● ● ● Avoid Fold Codon Optimal●
  • 23.
    Avoidance in 3Don the ribosome Protein binds to regions with low avoidance (green) while exposed regions are high avoidance (blue): P = 9.3x10−15, Fisher’s exact test
  • 24.
    Further Questions Further work: Writea nice software package for mRNA design! –funded by MBIE Do mRNA:ncRNA interactions influence eukaryotic gene expression? Number of possible interactions increases quadratically with number of genes. May require spatial & temporal separation of genes Does avoidance drive compartmentalisation and increases in nucleotide binding proteins for larger genomes? Check organelles & endosymbionts Do mRNA:ncRNA interactions influence viral infection, hybridisation, HGT & transformation expts? Are protein, DNA and protein:nucleotide interactions also avoided?
  • 25.
    My other project:What and who? A new approach for estimating functionally significant genetic variation Application to genome associations with bacterial invasiveness
  • 26.
    How can Ianalyse my genome variation? Genome-wide association studies (GWAS) with SNPs Gene gain/losses Tajima’s D ≈ difference between an observed and expected measure of genetic diversity Compare non-synonymous and synonymous mutation rates: dN dS Should every variant/SNP really be treated equally? Is there an approach that works for any timescale and any genome variant?
  • 27.
    Profile HMMs: apowerful homology search tool Image provided by Sean Eddy.
  • 28.
  • 29.
    A new scoringsequence for genome variation F F Y Y F F I I I I E E A A R R L L Q Q I M R E K K R R Species 1 Species 2 ∆bitscore = bitscoreseq1 − bitscoreseq2 = log2 P(seqseq1|model) P(seqseq2|model) Databases of HMMs: Pfam, eggNOG, SMART, InterPro, ...
  • 30.
    A new scoringsequence for genome variation Wheeler, Barquist, Kingsley, Gardner (2016) A profile-based method for identifying functional divergence of orthologous genes in bacterial genomes. Bioinformatics.
  • 31.
    Bacterial pathogen evolution Genegain, adaptive mutation help to invade a new niche Subsequent adaptation to may allow gene loss or inactivation Merhej, Georgiades, Raoult (2013) Postgenomic analysis of bacterial pathogens repertoire reveals genome reduction rather than virulence factors. Brief. Funct. Genomics.
  • 32.
    Random Forests Training genomeset Random forest Variation scoring with DeltaBS Each node identifies the most informative gene, and selects an optimal gene-wide DeltaBS value to split the data Decision treepredicted impact of mutation DeltaBS DeltaBS incorporates the predicted impacts of all mutations in a gene Wheeler, Gardner, Barquist (2018) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. To appear in PLoS Genetics.
  • 33.
    Salmonella enterica Feasey etal. (2012) The Lancet Nuccio & B¨aumler (2014) Comparative analysis of Salmonella genomes identifies a metabolic network for escalating growth in the inflamed gut. MBio
  • 34.
    Reproducible signatures ofgene degradation/diversifying selection Out-of-bag votes: lineages are only voted on using decision trees that weren’t trained on them Each model iteration uses a smaller subset of genes that are most informative of phenotype Wheeler, Gardner, Barquist (2017) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. To appear in PLoS Genetics. https://doi.org/10.1101/204669
  • 35.
    Which genes aremost informative of pathogenicity? Wheeler, Gardner, Barquist (2017) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. To appear in PLoS Genetics. https://doi.org/10.1101/204669
  • 36.
    Global and Africanisolates Invasiveness ranking Clade 300 A B ProportionofisolatescarryingHDCs Afican Enteritidisr Global Enteritidis −1.0 −0.5 0.0 0.5 1.0 tcuR ybiU rnfC STM4519 STM1499 pocR STM2529 priA ydeE phrB STM1630 STM2245 STM2532 nrdF rcnA pqaA STM1940 acrB pepT STM0019 pps pgtA yjeF rna pgl mglA STM3125 bcsA slsA STM0018 bcfC STM0042 citF2 carB caiC Gene Clade Global epidemic Central/Eastern African Outlier Other West Africa C 0.20 0.24 0.28 Africa Global Region Invasivenessindex Invasiveness ranking Low High Wheeler, Gardner, Barquist (2018) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. To appear in PLoS Genetics.
  • 37.
    Within-patient evolution Chronic infectionof an immunocompromised patient Constantly recolonised with a hypermutator S. Enteritidis. Klemm et al (2016) Emergence of host-adapted Salmonella Enteritidis through rapid evolution in an immunocompromised host. Nat. Microbiol. Wheeler, Gardner, Barquist (2017) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. To appear in PLoS Genetics. https://doi.org/10.1101/204669
  • 38.
    Summary We have developeda simple & scalable approach for determining the significance of variation Working well in “the field” (applications to Salmonella, Campylobacter, PSA, Ratites, ...) Adapted to evaluate ncRNAs and conserved DNA elements, and include more phylogeny (ASR, phylo-regression?) Wheeler, Barquist, Kingsley & Gardner (2016) A profile-based method for identifying functional divergence of orthologous genes in bacterial genomes. Bioinformatics. Wheeler, Gardner, Barquist (2017) Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. To appear in PLoS Genetics. https://doi.org/10.1101/204669 Sackton et al. (2018) Convergent regulatory evolution and the origin of flightlessness in palaeognathous birds https://doi.org/10.1101/262584
  • 39.
    Thanks! Avoidance: Sinan Umu,Anthony Poole & Renwick Dobson Nicole Wheeler, Lars Barquist, Alexandra Gavryushkina