Lecture2022 - 3 /!
Lecture2022 - 3 /!
Spring 2022
Part 2:
Allan Haldane
Ron Levy Group
Protein Sequencing History
1951 – Sanger obtains first protein amino-acid sequence (Insulin)
1953 – Watson/Crick/Wilkins/Franklin discovery of DNA structure & function
How much of protein sequence space has been explored by life on Earth? Dryden et al. J Royal Society Interface (2008)
Twilight zone of protein sequence alignments. Rost, Protein Eng Des Sel (1999)
Protein Sequence Search and Alignment
Why?
●
Homology Modelling (for computational chemistry,
Crystallography/Cryo-EM, etc).
●
Discover functionality of new sequences (new genes or proteins,
orthologs and paralogs, newly sequenced organisms)
●
Research into protein structure, function, evolution
●
Covariation Analysis (Active Research - Alphafold)
How?
●
Identify Sequence Patterns which are statistically like your sequence
●
The databases contain huge
numbers of sequences, most of
which do not match your
sequence
●
Homologous Sequences may be
as low as 10% identical
(need statistical model)
●
Homologous sequence will often
have inserts, deletions, additional
domains, etc (need alignment)
Scoring Similarity
●
Matrices: PAM (Point Accepted Matrix) /BLOSUM (BLOcks SUbstitution Matrix)
●
Given an alignment, gives a score for each residue pair
●
PAM matrices are parameterized based on observed
mutations between closely-related sequences (good
for close homologs)
●
BLOSUM matrices are parameterized on large
alignments of distantly related homologs (pair
frequencies)
Sequence Alignment
●
We need to align sequence before we can compute sequence identity
Illustration using “dot matrix” method for manual alignment: Put both sequences on
X and Y axes, and put a dot anywhere they are the same.
Note: The greater number of states in protein sequences (20 vs 4) makes alignment easier
Sequence Alignment
●
We need to align sequence before we can compute sequence identity
-AT-CGAA
CATAC---
FASTP/FASTA (1985)
FASTP = FAST-Proteins, FASTA = FAST-All (protein+nucleotide)
●
Problem: Smith-Waterman search over large databases is
too slow, not feasible (in 1985, anyway)
●
Solution: Use a heuristic strategy to identify potential
matches. Later can do a Smith-Waterman alignment to get
a more accurate score.
●
speed-sensitivity trade-off
FASTP/FASTA (1985)
FASTP = FAST-Proteins, FASTA = FAST-All (protein+nucleotide)
Heuristic Strategy:
A) First compare sequences using “words” (k-mers).
SRREADIAGRMGGDEFLLVLP
SRR
RRE
REA
EAD...
D) Do optimized/constrained Smith-Waterman
alignment (between dotted lines)
BLAST (1990)
●
inexact “word” matching
●
faster implementation
●
only care about “Maximal Segment Pair”
●
More complete analysis of statistical significance: the
“tractability to mathematical analysis is a crucial
feature of the BLAST algorithm”
Sources and additional information
●
Gapped extensions are computationally demanding.
●
Original BLAST only used gapless extension
●
E-Value: measures the expected number of “random” sequences with
a score S or better in a random sequence database of equal size.
●
Lower E-value = less likely this hit was a false-positive
BLAST updates
●
Gapped BLAST
– allows gaps.
– Fast implementation
– Now often used in preference to original BLAST
●
PSI-BLAST (Position-Specific Iterative)
– Instead of BLOSUM matrix, infers position-specific amino-
acid frequencies through iterative DB searches.
BLAST Website: https://blast.ncbi.nlm.nih.gov/Blast.cgi
query: FASTA
format or accession
database
Entrez query
algorithm
parameters
Optional BLASTP search parameters
max sequences
short queries
expect threshold
word size
max matches
scoring matrix
gap costs
compositional adjustment
filter
mask
formatting options
Expect value
BLOSUM62 matrix
Threshold value T
Size of database
Graphic summary of the results shows the alignment scores (coded
by color) and the length of the alignment (given by the length of the
horizontal bars)
BLASTP output
HMMER (1995)
●
Sequence search/alignment tool based on Hidden-Markov Models
●
Models position-specific amino-acid frequencies
●
Mathematical Model more easily accounts for gaps/indels
●
Claims to find more distant homologs than BLAST
References:
●
Biological sequence analysis
●
HMMER User guide
●
Fundamentals of Molecular Evolution (Graur/Li)
●
The Neutral Theory of Molecular Evolution (Kimura)
Review: HMMs (Hidden Markov Models)
●
System moves through a series of “hidden” states with specified
transition probabilities
●
System “emits” observables, with different emission
probabilities in each hidden state.
Hidden states
Observables
Important Algorithms:
●
Forward algorithm: Compute probability of a sequence of observables,
summing over all possible paths through hidden states
●
Viterbi Algorithm: Compute the most likely path though hidden states, given a
sequence of observables
●
Baum–Welch algorithm: Given a set of observed sequences and a topology,
estimate emission and transition probabilities (build an HMM).
There are many possible choices of “topology” for HMMs,
depending on the application. We will focus on two types:
C S
M X
Transition
Probabilities Emission
Probabilities
Hidden Markov Models for Prediction of Protein Features. Christopher Bystroff and Anders Krogh. Methods in Molecular Biology, vol. 413
Simple Transmembrane Helix HMM
Want to model which parts of a protein
sequence are transmembrane helices. Simple Two-State Model
mismatches
Posterior Probability
Hidden Markov Models for Prediction of Protein Features. Christopher Bystroff and Anders Krogh. Methods in Molecular Biology, vol. 413
TMHMM
“Real” high performance HMMs can be much more complicated.
Similar Performance
to PSIPRED
Insights into
secondary structure
patterns?
●
Topology of a profile HMM
●
Building the HMM
●
Using the HMM to align sequences
●
Using the HMM to search for sequences in a database
●
Tool: HMMER
●
Practical challenges to building a profile HMM given an MSA
HMMER vs BLAST
HMMER’s claim:
– As fast as BLAST
Useful for compuling MSAs of remote homologs, as used in the next lecture!
Profile HMMs (HMMER)
Review of Topology:
Aligned:
Profile HMMs (HMMER)
Example of a profile HMM with all values illustrated:
Weight of line
=
transition probability
Unaligned (insert)
What you can do with profile HMMs:
●
Null model (random aa)
mean residue frequencies
in Swiss-Prot 50.8
Scoring Sequences: Log Odds Ratio
The probability of a sequence averaging Probability of sequence in Null model
over all paths (forward algorithm) (random amino acids)
●
Standard way of testing model discriminatory power
●
Values >> 1 imply the sequence is more consistent with the HMM
●
We take the log to avoid numerical difficulties (numbers can be very small)
●
When the log is base 2, the value has units called “bits”.
●
Log-odds ratio is also called the “sequence bit score”
●
Can be efficiently computed for HMMs using a modified Forward algorithm
Scoring Sequences: E-Values
The “Log odds ratio”
Problem: The log-odds ratio gives us a score for each individual sequence, but if you are
testing many, many non-matching sequences, some will still have high scores by
chance.
●
A log-odds ratio (bit score) for each sequence.
>1 means its more likely to be generated by the HMM than the null model
●
An E-Value: Probability a non-match sequence has this score by
chance.
E-value < 1 means that on average we don’t expect this strong a score by
Chance given our dataset size, so it is likely a homolog.
Commonly E < 0.01 is used as a significance cutoff
By setting a threshold on the E-Value, we can filter the database to pick out
the homologous sequences.
Posterior Probabilities
●
Per-site posterior probability annotation.
Probability that emitted observable i came from state k, averaging over
all possible paths.
●
Easy implementation: Very similar to the Viterbi algorithm, except we substitute the
posterior probabilities instead of the log probabilities, plus minor tweaks.
●
Gives “more accurate” alignments than Viterbi. MAC is used by HMMER
What you can do with profile HMMs:
Idea: Given a trial HMM, use forward/backward algorithms to average over all
possible alignments of all sequences in the MSA. Add up the observed
emission and transitions seen in the data with these probabilities.
Idea: Given a seed aligned MSA, construct an initial HMM (see above). Use this
HMM to search the database for more homologous sequences, to get a bigger, more
diverse MSA. Construct a new HMM with this MSA. Repeat.
HMMER
●
Free Software for using/building profile HMMs
●
HMMER3 is highly optimized. Fast!
HMMER: Example Run
Example: Searching a sequence database
HMMER: Example Run
Example: Scoring and Aligning a sequence
Aligned Sequence
Posterior Probability
Profile HMMs: Practical issues
Template WIGDEVAVKAARHDDEDIS
Match WHGD-VAVKILKVVDPTPE
Template EVAVKAARHDDEDISvykig
Match WHGD-VAVKILKVVDPTPE
Note: HMMER 3+ ONLY does local alignment Flanking model states Switching points
2. Choosing Match States
When building a new HMM given a seed MSA, you have to make a decision: Which
columns will count as match states, and which as insert states? (model construction)
●
manual construction:
The user marks alignment columns by hand.
●
heuristic construction:
A rule is used to decide whether a column should be marked. For
instance, a column might be marked when the proportion of gap
symbols in it is below a certain threshold.
●
MAP construction,
A maximum a posteriori choice is determined by dynamic programming
2. Choosing Match States
3. Phylogenetic Weighting
"God has an inordinate fondness for beetles." - J.B.S. Haldane
(no relation)
Eukaryotic
https://evolution.berkeley.edu
If a closely related group of species appears often in the dataset, this introduces a
bias in our estimates of emission probabilities. Similar to overcounting. Will hurt us
for distant homology detection.
QIGSGGSSKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY
QIGSAHSSKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY
QIGSGGSSKVFQVLLEKTQIYAIKKENLEEADNQTLDEINY
QIGEGGSKKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY Beetles with similar sequences
QIGSGGSSKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY
QIGSGGFSKVFQVLNEHKQIYAIKYVNLEEADNQTLDEIAY
ELGRGESGTVYKGVLEDDRHVAVKKLENVRQGKEVFQELSV
RLGAGGFGSVYKATYHGVP-VAIKQVNKCTKNRLASRELNV
KIGAGNSGTVVKALHVPDSKVAKKTIPVEQNNSTIINELSI
ILGEGAGGSVSKCKLKNGSKFALKVINTLNTDPEYQKELQF
HLGEGNGGAVSLVKHRNI--FMARKTVYVGSDSKLQKELGV
KIGQGASGGVYTAYEIGTNVVAIKQMNLEKQPKKELIEILV
VIGRGSYGVvykainkhTDQVAIKEVVYENDEELNDIEISL
LIGSGSFGQVYLGMNASSGEMAVKQVILDSVSESKDREIAL
IIGRGGFGEVYGCRKADTGKYAMKCLDKKRIKMKQGENERI
LLGKGTFGKVILVKEKATGRYAMKILKKEVIVAKDEVENRV
SLGSGSFGTAKLCRHRGSGLFCSKTLRRETIVHEKHKEINI
LLGQGDVGKVYLVRERDTNQFALKVLNKHEMIKRKKIEQEI
TLGIGGFGRVVKAHHQDRVDFALKCLKKRHIVDTKQEERHI
3. Phylogenetic Weighting
Many different strategies to “unweight” the dataset. One way (“Henikoff scheme”):
Without any corrections, our HMM would predict that these other residues could never
be output. We correct for this by adding pseudocounts:
Different ways of computing emission probabilities:
●
16,306 curated protein families (folds)
●
23 million protein sequences
●
7.6 billion residues
●
A curated seed alignment containing a small set of representative members of the family
●
Profile HMMs built iteratively from the seed alignment
●
an automatically generated full alignment, which contains all detected sequences
belonging to the family
Seed Matches
●
All fibronectin sequences in
the alignment have the same
“beta sandwich” fold.
●
~65,000 different sequences
●
~100 residues long
●
Many possible sequences lead to the same fold
●
Proteins in the same family/fold accumulate
substitutions at a constant rate over time
Next time: How do proteins evolve? Modeling the Protein Evolutionary Process