0% found this document useful (0 votes)
37 views60 pages

Lecture2022 - 3 /!

This document provides an overview of protein sequence search and alignment tools, including BLAST and HMMER. It discusses the early history of protein sequencing and databases. It then describes the need for sequence alignment and search tools to identify homologous sequences. It reviews early global and local alignment algorithms and how FASTA and BLAST use heuristics like word matching to rapidly search databases while balancing speed and sensitivity. The document also discusses how BLAST and HMMER use statistical models to assign significance values to matches.

Uploaded by

MAnugrahRizkyP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views60 pages

Lecture2022 - 3 /!

This document provides an overview of protein sequence search and alignment tools, including BLAST and HMMER. It discusses the early history of protein sequencing and databases. It then describes the need for sequence alignment and search tools to identify homologous sequences. It reviews early global and local alignment algorithms and how FASTA and BLAST use heuristics like word matching to rapidly search databases while balancing speed and sensitivity. The document also discusses how BLAST and HMMER use statistical models to assign significance values to matches.

Uploaded by

MAnugrahRizkyP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Temple University Structural Bioinformatics II

Spring 2022

Part 2:

Sequence Search & Alignment


with
BLAST and HMMER

Allan Haldane
Ron Levy Group
Protein Sequencing History
1951 – Sanger obtains first protein amino-acid sequence (Insulin)
1953 – Watson/Crick/Wilkins/Franklin discovery of DNA structure & function

1962 – Margoliash compares protein sequences across species (Cytochrome C)


1962 – Zuckerland/Pauling propose “Molecular Clock” and explore (Hemoglobin)

1968 – Kimura develops the Neutral Theory of Molecular Evolution


1970 – Needleman/Wunsch develop a global sequence alignment algorithm

1974 – Dayhoff’s Atlas of Protein Sequence and Structure

1981 – Smith/Waterman develop a local sequence alignment algorithm


1982 – Creation of GenBank (NIH/NCBI)
1985 – Pearson/Lipman develop the FASTA search/alignment program

1990 – Atschul et al develop BLAST sequence search/alignment program


1995 – Eddy et al develop HMMER sequence search/alignment program

2002 – Creation of Uniprot, a descendant of Dayhoff’s database. (EBI)


The Protein Universe

Typically ~30-40% sequence identity per fold, as low as 8-9%
(equal to random sequences)

The Protein sequence→ structure mapping is highly degenerate

Causes difficulty in the “twilight zone” of sequence similarity:


Below 20% identity, difficulty predicting whether sequences belong to a family.

How much of protein sequence space has been explored by life on Earth? Dryden et al. J Royal Society Interface (2008)
Twilight zone of protein sequence alignments. Rost, Protein Eng Des Sel (1999)
Protein Sequence Search and Alignment

Why?

Homology Modelling (for computational chemistry,
Crystallography/Cryo-EM, etc).


Discover functionality of new sequences (new genes or proteins,
orthologs and paralogs, newly sequenced organisms)


Research into protein structure, function, evolution


Covariation Analysis (Active Research - Alphafold)

How?

Identify Sequence Patterns which are statistically like your sequence

(sequence similarity => Homology)


A Needle In A Haystack
Query Sequence: Database:


The databases contain huge
numbers of sequences, most of
which do not match your
sequence

Homologous Sequences may be
as low as 10% identical
(need statistical model)

Homologous sequence will often
have inserts, deletions, additional
domains, etc (need alignment)
Scoring Similarity

Matrices: PAM (Point Accepted Matrix) /BLOSUM (BLOcks SUbstitution Matrix)


Given an alignment, gives a score for each residue pair


PAM matrices are parameterized based on observed
mutations between closely-related sequences (good
for close homologs)

BLOSUM matrices are parameterized on large
alignments of distantly related homologs (pair
frequencies)
Sequence Alignment

We need to align sequence before we can compute sequence identity

Illustration using “dot matrix” method for manual alignment: Put both sequences on
X and Y axes, and put a dot anywhere they are the same.

Example of Dot matrix method Dot Matrix for a nucleotide sequence


Dot matrix for
Same Sequence,
translated to aa

Note: The greater number of states in protein sequences (20 vs 4) makes alignment easier
Sequence Alignment

We need to align sequence before we can compute sequence identity

Needleman-Wunsch Algorithm Smith-Waterman Algorithm


(Global Alignment) (Local Alignment)
Template WIGDEVAVKAARHDDEDIS EVAVKAARHDDEDISvykig
Match WHGD-VAVKILKVVDPTPE WHGD-VAVKILKVVDPTPE

-AT-CGAA
CATAC---
FASTP/FASTA (1985)
FASTP = FAST-Proteins, FASTA = FAST-All (protein+nucleotide)

Naive idea for sequence search:



Run Smith-Waterman between query & every single DB sequences.

Score using PAM/BLOSUM matrix.


Problem: Smith-Waterman search over large databases is
too slow, not feasible (in 1985, anyway)

Solution: Use a heuristic strategy to identify potential
matches. Later can do a Smith-Waterman alignment to get
a more accurate score.

speed-sensitivity trade-off
FASTP/FASTA (1985)
FASTP = FAST-Proteins, FASTA = FAST-All (protein+nucleotide)

Heuristic Strategy:
A) First compare sequences using “words” (k-mers).

SRREADIAGRMGGDEFLLVLP
SRR
RRE
REA
EAD...

B) Potential matches where many nearby words


match (diagonal on dotplot, within distance
cutoff) .Extend potential matches in length and score
using PAM matrix (gives “initial matches”)

C) Join nearby initial matches together (accounts for


gaps/indels)

D) Do optimized/constrained Smith-Waterman
alignment (between dotted lines)
BLAST (1990)

Similar Idea to FASTP, but with some new ideas:


inexact “word” matching

faster implementation

only care about “Maximal Segment Pair”

More complete analysis of statistical significance: the
“tractability to mathematical analysis is a crucial
feature of the BLAST algorithm”
Sources and additional information

Images and other material in this presentation are taken from


Bioinformatics and Functional Genomics third edition by
Jonathan Pevsner, 2015 John Wiley & Sons, Inc. (
http://pevsnerlab.kennedykrieger.org/)

The lecture follows closely the contents of chapter 4 of


Pevsner book, which contains an in-depth discussion of the
issues covered during the lecture. For additional material,
please go to the book website: http://www.bioinfbook.org
BLAST Phase 1: Word Scan
Choose word size W (eg, W=2 or 3 for proteins).

Too large W = false negatives

Too small W = false positives
BLAST Phase 2: Extension


Gapped extensions are computationally demanding.

Original BLAST only used gapless extension

Original BLAST only returned the single “Maximal Segment Pair”


(best scoring hit/extension, no gaps/indels)
Choosing a Threshold Value T

T = word similarity score threshold controls the number of hits/extensions:



too high may miss lower-similarity sequences (false negative)

too low will cause false positives/larget haystack
E-Values and p-values of Significance
Once we have a maximal segment pair we can compute a score S for the alignment eg,
using BLOSUM matrix. (raw score = BLOSUM, bit score = corrected for database size)

But which scores S are significant?


Karlin-Atschul Statistics

The expected number of non-homologous


(random) sequences scores is an extreme-
value distribution (green)

u depends on sequences m and n

Lambda and k are determined by


analysis of the query database


E-Value: measures the expected number of “random” sequences with
a score S or better in a random sequence database of equal size.

Lower E-value = less likely this hit was a false-positive
BLAST updates

Gapped BLAST
– allows gaps.
– Fast implementation
– Now often used in preference to original BLAST


PSI-BLAST (Position-Specific Iterative)
– Instead of BLOSUM matrix, infers position-specific amino-
acid frequencies through iterative DB searches.
BLAST Website: https://blast.ncbi.nlm.nih.gov/Blast.cgi

query: FASTA
format or accession

database

Entrez query

algorithm

parameters
Optional BLASTP search parameters

max sequences
short queries
expect threshold
word size
max matches
scoring matrix
gap costs
compositional adjustment
filter
mask
formatting options

Expect value

BLOSUM62 matrix

Threshold value T

Size of database
Graphic summary of the results shows the alignment scores (coded
by color) and the length of the alignment (given by the length of the
horizontal bars)
BLASTP output
HMMER (1995)

Sequence search/alignment tool based on Hidden-Markov Models

Models position-specific amino-acid frequencies

Mathematical Model more easily accounts for gaps/indels

Claims to find more distant homologs than BLAST

References:

Biological sequence analysis

HMMER User guide

Fundamentals of Molecular Evolution (Graur/Li)

The Neutral Theory of Molecular Evolution (Kimura)
Review: HMMs (Hidden Markov Models)

System moves through a series of “hidden” states with specified
transition probabilities

System “emits” observables, with different emission
probabilities in each hidden state.

occasionally dishonest casino Weather Behavior (wikipedia)

Hidden states

Observables

Important Algorithms:

Forward algorithm: Compute probability of a sequence of observables,
summing over all possible paths through hidden states

Viterbi Algorithm: Compute the most likely path though hidden states, given a
sequence of observables

Baum–Welch algorithm: Given a set of observed sequences and a topology,
estimate emission and transition probabilities (build an HMM).
There are many possible choices of “topology” for HMMs,
depending on the application. We will focus on two types:

1. Secondary-Structure HMMs (SAM, OSS-HMM, TMHMM)


H

C S

2. Profile HMMs (HMMER)


(for multiple sequence alignments)
Simple Transmembrane Helix HMM
Want to model which parts of a protein
sequence are transmembrane helices. Simple Two-State Model

X = Everything else M = TM helix

M X

Transition
Probabilities Emission
Probabilities

Hidden Markov Models for Prediction of Protein Features. Christopher Bystroff and Anders Krogh. Methods in Molecular Biology, vol. 413
Simple Transmembrane Helix HMM
Want to model which parts of a protein
sequence are transmembrane helices. Simple Two-State Model

X = Everything else M = TM helix

mismatches
Posterior Probability

Can improve with more complex


topologies

Hidden Markov Models for Prediction of Protein Features. Christopher Bystroff and Anders Krogh. Methods in Molecular Biology, vol. 413
TMHMM
“Real” high performance HMMs can be much more complicated.

“Correctly predicts 97-98 % of


transmembrane helices.”

Used to estimate that 20-30 %


of all genes are TM proteins

Posterior probabilities for a single sequence:

Predicting Transmembrane Protein Topology with a Hidden Markov


Model: Application to Complete Genomes. Krogh et al. JMB 2001
HMMS for Secondary Structure Prediction

Naive idea: 3 state topology: Helix, Coil, Strand H

Real World: 36 hidden states (OSS-HMM)

Topology not chosen by hand: An algorithm tries
topologies to find the one with best match with C S
DSSP predictions

Similar Performance
to PSIPRED

Insights into
secondary structure
patterns?

Analysis of an optimal hidden Markov model for secondary


(% correct predictions)
structure prediction. Martin et al, BMC Structural Biology 2006
Profile HMMs (HMMER)
HMMs used to model and align Multiple Sequence Alignments

A Multiple Sequence Alignment:


Topology of a profile HMM

Building the HMM

Using the HMM to align sequences

Using the HMM to search for sequences in a database

Tool: HMMER

Practical challenges to building a profile HMM given an MSA
HMMER vs BLAST

HMMER’s claim:

– More sensitive detection of remote homologs

– Explicit probabilistic mathematical model (allows derivation


of exact model properties, eg expected distribution of bit
scores)

– As fast as BLAST

Useful for compuling MSAs of remote homologs, as used in the next lecture!
Profile HMMs (HMMER)
Review of Topology:

Delete states (gap)

Insert states (emit aa,


with unbiased probabilities)

Match states (emit aa,


with position-specific
probabilities)
1 2 3
Aligned position
Unaligned: VLSPADKTNVKAAWGKVGAHADEYHAEALERMFLSFPTTKTYFPHF

Aligned:
Profile HMMs (HMMER)
Example of a profile HMM with all values illustrated:

Weight of line
=
transition probability

Unaligned (insert)
What you can do with profile HMMs:

Basic (Given a Model):



Score Sequences:

Compute probability that a sequence was generated by the HMM

Search For Matching Sequences

Score all sequences in a database, pick only the high scoring ones

Compute Site Posterior Probability:

Reflects the degree of confidence in each individual aligned residue

Align Sequences to known model:

Find the most probable alignment of a unaligned sequence (viterbi)

Higher-level (No initial model):



Build a model given an aligned MSA

simple counting

Build a model given unaligned sequences

Baum-Welch algorithm (iterates forward/backward algorithms)

Iterative model building/search

Iteratively run Baum-Welch, and collect new sequences from a
database using it.
Scoring Sequences
What is the probability that a given sequence was generated by the model?

We have two types of probabilities:


Observed sequence

We can calculate the probability of a
sequence through most likely path
through hidden states (Viterbi algorithm) Model
Viterbi path

Or, we can calculate the probability
of a sequence averaging over all
paths (forward algorithm)

In practice we use the second: “To maximize power to discriminate true


homologs from nonhomologs in a database search, statistical inference
theory says you ought to be scoring sequences by integrating over
alignment uncertainty, not just scoring the single best alignment”
We can compare this to the probability that it would be generated by a “random” model


Null model (random aa)
mean residue frequencies
in Swiss-Prot 50.8
Scoring Sequences: Log Odds Ratio
The probability of a sequence averaging Probability of sequence in Null model
over all paths (forward algorithm) (random amino acids)

The “Log odds ratio”


(“sequence bit score”)


Standard way of testing model discriminatory power

Values >> 1 imply the sequence is more consistent with the HMM

We take the log to avoid numerical difficulties (numbers can be very small)

When the log is base 2, the value has units called “bits”.

Log-odds ratio is also called the “sequence bit score”

Can be efficiently computed for HMMs using a modified Forward algorithm
Scoring Sequences: E-Values
The “Log odds ratio”

Problem: The log-odds ratio gives us a score for each individual sequence, but if you are
testing many, many non-matching sequences, some will still have high scores by
chance.

E-values: Expected number of


sequences having this bit score or more,
when looking in a dataset of size N.

Theory: Can work out that the “null”


distribution of bit-scores is exponential

This is a way to estimate if a high score


appeared by chance due to having a large
dataset.

, necessary to calibrate mu/tau


Scoring Sequences: Summary

Given an HMM, and a sequence database of size N to


search through, we can calculate:


A log-odds ratio (bit score) for each sequence.

>1 means its more likely to be generated by the HMM than the null model


An E-Value: Probability a non-match sequence has this score by
chance.

E-value < 1 means that on average we don’t expect this strong a score by
Chance given our dataset size, so it is likely a homolog.
Commonly E < 0.01 is used as a significance cutoff
By setting a threshold on the E-Value, we can filter the database to pick out
the homologous sequences.
Posterior Probabilities

Per-site posterior probability annotation.
Probability that emitted observable i came from state k, averaging over
all possible paths.

Computable from forward/backward equations.

Gives an estimate of how “well aligned” a position is.


Aligning Sequences

Naive way: Viterbi Algorithm:


Computes the single most likely path through the HMM.
(maximizes sum of log probabilities)
Problem: The single most likely path might skip a node which almost all other likely paths go
through. i.e., it can go through “unlikely” nodes. It is an extremal path rather than an average.

Better way: MAC Algorithm: (Maximum ACcuracy)


Computes the path with the greatest total posterior probability through the HMM
(maximizes sum of posterior probabilities)
i.e., computes the path going through the most commonly visited nodes by
all likely paths, and so is a better representation of the “average path”.


Easy implementation: Very similar to the Viterbi algorithm, except we substitute the
posterior probabilities instead of the log probabilities, plus minor tweaks.

Gives “more accurate” alignments than Viterbi. MAC is used by HMMER
What you can do with profile HMMs:

Basic (Given a Model):



Score Sequences:

Compute probability that a sequence was generated by the HMM

Search For Matching Sequences

Score all sequences in a database, pick only the high scoring ones

Compute Site Posterior Probability:

Reflects the degree of confidence in each individual aligned residue

Align Sequences to known model:

Find the most probable alignment of a unaligned sequence (viterbi)

Higher-level (No initial model):



Build a model given an aligned MSA

simple counting

Build a model given unaligned sequences

Baum-Welch algorithm (iterates forward/backward algorithms)

Iterative model building/search

Iteratively run Baum-Welch, and collect new sequences from a
database using it.
HMM building Algorithms

Building an HMM given an aligned MSA


Idea: Very simply, we count up what we see in the MSA. i.e, count up the
number of times each amino acid at each position, and normalize, to get the
emission probabilities, and similarly for transition probabilities.
Pseudocounts and weights may be used too (discussed later).

Building an HMM given an unaligned MSA (The Baum-Welch Algorithm)

Idea: Given a trial HMM, use forward/backward algorithms to average over all
possible alignments of all sequences in the MSA. Add up the observed
emission and transitions seen in the data with these probabilities.

Building an HMM given a seed MSA and a database (JACKHMMER, HHBlits)

Idea: Given a seed aligned MSA, construct an initial HMM (see above). Use this
HMM to search the database for more homologous sequences, to get a bigger, more
diverse MSA. Construct a new HMM with this MSA. Repeat.
HMMER

Free Software for using/building profile HMMs

HMMER3 is highly optimized. Fast!
HMMER: Example Run
Example: Searching a sequence database
HMMER: Example Run
Example: Scoring and Aligning a sequence

HMM reference lines:


1. Annotated Secondary structure
2. Consensus sequence
3. Matches

Aligned Sequence
Posterior Probability
Profile HMMs: Practical issues

1)Local vs Global Alignments


2)Choosing Match states
3)Phylogenetic Weighting
4)Pseudocounts
5)Filtering Sequences
1. Local vs Global Alignments

HMM for global alignment


Try to match all residues in the
query sequence to the model

Template WIGDEVAVKAARHDDEDIS
Match WHGD-VAVKILKVVDPTPE

HMM for local alignment


Allows unmatched states at beginning
and end, so can find a match within a
longer sequence, or partial sequence.

Template EVAVKAARHDDEDISvykig
Match WHGD-VAVKILKVVDPTPE

Note: HMMER 3+ ONLY does local alignment Flanking model states Switching points
2. Choosing Match States
When building a new HMM given a seed MSA, you have to make a decision: Which
columns will count as match states, and which as insert states? (model construction)


manual construction:
The user marks alignment columns by hand.

heuristic construction:
A rule is used to decide whether a column should be marked. For
instance, a column might be marked when the proportion of gap
symbols in it is below a certain threshold.

MAP construction,
A maximum a posteriori choice is determined by dynamic programming
2. Choosing Match States
3. Phylogenetic Weighting
"God has an inordinate fondness for beetles." - J.B.S. Haldane
(no relation)
Eukaryotic

https://evolution.berkeley.edu

If a closely related group of species appears often in the dataset, this introduces a
bias in our estimates of emission probabilities. Similar to overcounting. Will hurt us
for distant homology detection.
QIGSGGSSKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY
QIGSAHSSKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY
QIGSGGSSKVFQVLLEKTQIYAIKKENLEEADNQTLDEINY
QIGEGGSKKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY Beetles with similar sequences
QIGSGGSSKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY
QIGSGGFSKVFQVLNEHKQIYAIKYVNLEEADNQTLDEIAY
ELGRGESGTVYKGVLEDDRHVAVKKLENVRQGKEVFQELSV
RLGAGGFGSVYKATYHGVP-VAIKQVNKCTKNRLASRELNV
KIGAGNSGTVVKALHVPDSKVAKKTIPVEQNNSTIINELSI
ILGEGAGGSVSKCKLKNGSKFALKVINTLNTDPEYQKELQF
HLGEGNGGAVSLVKHRNI--FMARKTVYVGSDSKLQKELGV
KIGQGASGGVYTAYEIGTNVVAIKQMNLEKQPKKELIEILV
VIGRGSYGVvykainkhTDQVAIKEVVYENDEELNDIEISL
LIGSGSFGQVYLGMNASSGEMAVKQVILDSVSESKDREIAL
IIGRGGFGEVYGCRKADTGKYAMKCLDKKRIKMKQGENERI
LLGKGTFGKVILVKEKATGRYAMKILKKEVIVAKDEVENRV
SLGSGSFGTAKLCRHRGSGLFCSKTLRRETIVHEKHKEINI
LLGQGDVGKVYLVRERDTNQFALKVLNKHEMIKRKKIEQEI
TLGIGGFGRVVKAHHQDRVDFALKCLKKRHIVDTKQEERHI
3. Phylogenetic Weighting
Many different strategies to “unweight” the dataset. One way (“Henikoff scheme”):

Number of times “a” is seen at position i

Number of different residues ever seen at position i


Weight factor needed to get “equal” emission probabilities at position i

Average weight factor over all positions


4. Pseudocounts
We build an HMM given a limited seed alignment. At some positions there may be
residues that we didn’t see in our dataset, but exist some fraction of the time in nature,
particularly when we have few sequences.

Without any corrections, our HMM would predict that these other residues could never
be output. We correct for this by adding pseudocounts:
Different ways of computing emission probabilities:

Simple pseudocount Dirichlet Pseudocount


No correction (A=20 works well) (assumes there are different
“types” of positions which
have different biases)
5. Filtering
When searching a database, there are choices about which sequences to
“keep”: E values, bit scores, etc.
Pfam

Pfam is a collection of HMMs and MSAs of many protein
families. Uses HMMs to detect homologous sequences

Underlying sequence database (Pfamseq) is based on UniProt

Intimately tied to HMMER

http://pfam.xfam.org/ (European Bioinformatics Institute)


16,306 curated protein families (folds)

23 million protein sequences

7.6 billion residues

collected from all branches of life


Uniprot database sequence statistics

All Domains Eukaryotes only


Pfam
Each Pfam family consists of:


A curated seed alignment containing a small set of representative members of the family

Profile HMMs built iteratively from the seed alignment

an automatically generated full alignment, which contains all detected sequences
belonging to the family

Pfam entries are classified in one of six ways:



Family: A collection of related protein regions

Domain: A structural unit

Repeat: A short unit which is unstable in isolation but forms a stable structure when multiple copies are
present

Motif: A short unit found outside globular domains

Coiled-Coil: Regions that predominantly contain coiled-coil motifs, regions that typically contain alpha-helices
that are coiled together in bundles of 2-7.

Disordered: Regions that are conserved, yet are either shown or predicted to contain bias sequence
composition and/or are intrinsically disordered (non-globular).

(From Pfam website)


Pfam website
example

Fibronectin type III

Seed Matches


All fibronectin sequences in
the alignment have the same
“beta sandwich” fold.

~65,000 different sequences

~100 residues long

How similar do you think the


sequences are (% identical aa)?
Summary


Many possible sequences lead to the same fold

Proteins in the same family/fold accumulate
substitutions at a constant rate over time

Next time: How do proteins evolve? Modeling the Protein Evolutionary Process

You might also like