0% found this document useful (0 votes)

37 views60 pages

Lecture2022 - 3 /!

This document provides an overview of protein sequence search and alignment tools, including BLAST and HMMER. It discusses the early history of protein sequencing and databases. It then describes the need for sequence alignment and search tools to identify homologous sequences. It reviews early global and local alignment algorithms and how FASTA and BLAST use heuristics like word matching to rapidly search databases while balancing speed and sensitivity. The document also discusses how BLAST and HMMER use statistical models to assign significance values to matches.

Uploaded by

MAnugrahRizkyP

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views60 pages

Lecture2022 - 3 /!

Uploaded by

MAnugrahRizkyP

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Temple University Structural Bioinformatics II

Spring 2022

Part 2:

Sequence Search & Alignment

with
BLAST and HMMER

Allan Haldane
Ron Levy Group
Protein Sequencing History
1951 – Sanger obtains first protein amino-acid sequence (Insulin)
1953 – Watson/Crick/Wilkins/Franklin discovery of DNA structure & function

1962 – Margoliash compares protein sequences across species (Cytochrome C)

1962 – Zuckerland/Pauling propose “Molecular Clock” and explore (Hemoglobin)

1968 – Kimura develops the Neutral Theory of Molecular Evolution

1970 – Needleman/Wunsch develop a global sequence alignment algorithm

1974 – Dayhoff’s Atlas of Protein Sequence and Structure

1981 – Smith/Waterman develop a local sequence alignment algorithm

1982 – Creation of GenBank (NIH/NCBI)
1985 – Pearson/Lipman develop the FASTA search/alignment program

1990 – Atschul et al develop BLAST sequence search/alignment program

1995 – Eddy et al develop HMMER sequence search/alignment program

2002 – Creation of Uniprot, a descendant of Dayhoff’s database. (EBI)

The Protein Universe
●
Typically ~30-40% sequence identity per fold, as low as 8-9%
(equal to random sequences)

The Protein sequence→ structure mapping is highly degenerate

Causes difficulty in the “twilight zone” of sequence similarity:

Below 20% identity, difficulty predicting whether sequences belong to a family.

How much of protein sequence space has been explored by life on Earth? Dryden et al. J Royal Society Interface (2008)
Twilight zone of protein sequence alignments. Rost, Protein Eng Des Sel (1999)
Protein Sequence Search and Alignment

Why?
●
Homology Modelling (for computational chemistry,
Crystallography/Cryo-EM, etc).

●
Discover functionality of new sequences (new genes or proteins,
orthologs and paralogs, newly sequenced organisms)

●
Research into protein structure, function, evolution

●
Covariation Analysis (Active Research - Alphafold)

How?
●
Identify Sequence Patterns which are statistically like your sequence

(sequence similarity => Homology)

A Needle In A Haystack
Query Sequence: Database:

●
The databases contain huge
numbers of sequences, most of
which do not match your
sequence
●
Homologous Sequences may be
as low as 10% identical
(need statistical model)
●
Homologous sequence will often
have inserts, deletions, additional
domains, etc (need alignment)
Scoring Similarity
●
Matrices: PAM (Point Accepted Matrix) /BLOSUM (BLOcks SUbstitution Matrix)

●
Given an alignment, gives a score for each residue pair

●
PAM matrices are parameterized based on observed
mutations between closely-related sequences (good
for close homologs)
●
BLOSUM matrices are parameterized on large
alignments of distantly related homologs (pair
frequencies)
Sequence Alignment
●
We need to align sequence before we can compute sequence identity

Illustration using “dot matrix” method for manual alignment: Put both sequences on
X and Y axes, and put a dot anywhere they are the same.

Example of Dot matrix method Dot Matrix for a nucleotide sequence

Dot matrix for
Same Sequence,
translated to aa

Note: The greater number of states in protein sequences (20 vs 4) makes alignment easier
Sequence Alignment
●
We need to align sequence before we can compute sequence identity

Needleman-Wunsch Algorithm Smith-Waterman Algorithm

(Global Alignment) (Local Alignment)
Template WIGDEVAVKAARHDDEDIS EVAVKAARHDDEDISvykig
Match WHGD-VAVKILKVVDPTPE WHGD-VAVKILKVVDPTPE

-AT-CGAA
CATAC---
FASTP/FASTA (1985)
FASTP = FAST-Proteins, FASTA = FAST-All (protein+nucleotide)

Naive idea for sequence search:

●
Run Smith-Waterman between query & every single DB sequences.
●
Score using PAM/BLOSUM matrix.

●
Problem: Smith-Waterman search over large databases is
too slow, not feasible (in 1985, anyway)
●
Solution: Use a heuristic strategy to identify potential
matches. Later can do a Smith-Waterman alignment to get
a more accurate score.
●
speed-sensitivity trade-off
FASTP/FASTA (1985)
FASTP = FAST-Proteins, FASTA = FAST-All (protein+nucleotide)

Heuristic Strategy:
A) First compare sequences using “words” (k-mers).

SRREADIAGRMGGDEFLLVLP
SRR
RRE
REA
EAD...

B) Potential matches where many nearby words

match (diagonal on dotplot, within distance
cutoff) .Extend potential matches in length and score
using PAM matrix (gives “initial matches”)

C) Join nearby initial matches together (accounts for

gaps/indels)

D) Do optimized/constrained Smith-Waterman
alignment (between dotted lines)
BLAST (1990)

Similar Idea to FASTP, but with some new ideas:

●
inexact “word” matching
●
faster implementation
●
only care about “Maximal Segment Pair”
●
More complete analysis of statistical significance: the
“tractability to mathematical analysis is a crucial
feature of the BLAST algorithm”
Sources and additional information

Images and other material in this presentation are taken from

Bioinformatics and Functional Genomics third edition by
Jonathan Pevsner, 2015 John Wiley & Sons, Inc. (
http://pevsnerlab.kennedykrieger.org/)

The lecture follows closely the contents of chapter 4 of

Pevsner book, which contains an in-depth discussion of the
issues covered during the lecture. For additional material,
please go to the book website: http://www.bioinfbook.org
BLAST Phase 1: Word Scan
Choose word size W (eg, W=2 or 3 for proteins).
●
Too large W = false negatives
●
Too small W = false positives
BLAST Phase 2: Extension

●
Gapped extensions are computationally demanding.
●
Original BLAST only used gapless extension

Original BLAST only returned the single “Maximal Segment Pair”

(best scoring hit/extension, no gaps/indels)
Choosing a Threshold Value T

T = word similarity score threshold controls the number of hits/extensions:

●
too high may miss lower-similarity sequences (false negative)
●
too low will cause false positives/larget haystack
E-Values and p-values of Significance
Once we have a maximal segment pair we can compute a score S for the alignment eg,
using BLOSUM matrix. (raw score = BLOSUM, bit score = corrected for database size)

But which scores S are significant?

Karlin-Atschul Statistics

The expected number of non-homologous

(random) sequences scores is an extreme-
value distribution (green)

u depends on sequences m and n

Lambda and k are determined by

analysis of the query database

●
E-Value: measures the expected number of “random” sequences with
a score S or better in a random sequence database of equal size.
●
Lower E-value = less likely this hit was a false-positive
BLAST updates
●
Gapped BLAST
– allows gaps.
– Fast implementation
– Now often used in preference to original BLAST

●
PSI-BLAST (Position-Specific Iterative)
– Instead of BLOSUM matrix, infers position-specific amino-
acid frequencies through iterative DB searches.
BLAST Website: https://blast.ncbi.nlm.nih.gov/Blast.cgi

query: FASTA
format or accession

database

Entrez query

algorithm

parameters
Optional BLASTP search parameters

max sequences
short queries
expect threshold
word size
max matches
scoring matrix
gap costs
compositional adjustment
filter
mask
formatting options

Expect value

BLOSUM62 matrix

Threshold value T

Size of database
Graphic summary of the results shows the alignment scores (coded
by color) and the length of the alignment (given by the length of the
horizontal bars)
BLASTP output
HMMER (1995)
●
Sequence search/alignment tool based on Hidden-Markov Models
●
Models position-specific amino-acid frequencies
●
Mathematical Model more easily accounts for gaps/indels
●
Claims to find more distant homologs than BLAST

References:
●
Biological sequence analysis
●
HMMER User guide
●
Fundamentals of Molecular Evolution (Graur/Li)
●
The Neutral Theory of Molecular Evolution (Kimura)
Review: HMMs (Hidden Markov Models)
●
System moves through a series of “hidden” states with specified
transition probabilities
●
System “emits” observables, with different emission
probabilities in each hidden state.

occasionally dishonest casino Weather Behavior (wikipedia)

Hidden states

Observables

Important Algorithms:
●
Forward algorithm: Compute probability of a sequence of observables,
summing over all possible paths through hidden states
●
Viterbi Algorithm: Compute the most likely path though hidden states, given a
sequence of observables
●
Baum–Welch algorithm: Given a set of observed sequences and a topology,
estimate emission and transition probabilities (build an HMM).
There are many possible choices of “topology” for HMMs,
depending on the application. We will focus on two types:

1. Secondary-Structure HMMs (SAM, OSS-HMM, TMHMM)

C S

2. Profile HMMs (HMMER)

(for multiple sequence alignments)
Simple Transmembrane Helix HMM
Want to model which parts of a protein
sequence are transmembrane helices. Simple Two-State Model

X = Everything else M = TM helix

M X

Transition
Probabilities Emission
Probabilities

Hidden Markov Models for Prediction of Protein Features. Christopher Bystroff and Anders Krogh. Methods in Molecular Biology, vol. 413
Simple Transmembrane Helix HMM
Want to model which parts of a protein
sequence are transmembrane helices. Simple Two-State Model

X = Everything else M = TM helix

mismatches
Posterior Probability

Can improve with more complex

topologies

Hidden Markov Models for Prediction of Protein Features. Christopher Bystroff and Anders Krogh. Methods in Molecular Biology, vol. 413
TMHMM
“Real” high performance HMMs can be much more complicated.

“Correctly predicts 97-98 % of

transmembrane helices.”

Used to estimate that 20-30 %

of all genes are TM proteins

Posterior probabilities for a single sequence:

Predicting Transmembrane Protein Topology with a Hidden Markov

Model: Application to Complete Genomes. Krogh et al. JMB 2001
HMMS for Secondary Structure Prediction
●
Naive idea: 3 state topology: Helix, Coil, Strand H
●
Real World: 36 hidden states (OSS-HMM)
●
Topology not chosen by hand: An algorithm tries
topologies to find the one with best match with C S
DSSP predictions

Similar Performance
to PSIPRED

Insights into
secondary structure
patterns?

Analysis of an optimal hidden Markov model for secondary

(% correct predictions)
structure prediction. Martin et al, BMC Structural Biology 2006
Profile HMMs (HMMER)
HMMs used to model and align Multiple Sequence Alignments

A Multiple Sequence Alignment:

●
Topology of a profile HMM
●
Building the HMM
●
Using the HMM to align sequences
●
Using the HMM to search for sequences in a database
●
Tool: HMMER
●
Practical challenges to building a profile HMM given an MSA
HMMER vs BLAST

HMMER’s claim:

– More sensitive detection of remote homologs

– Explicit probabilistic mathematical model (allows derivation

of exact model properties, eg expected distribution of bit
scores)

– As fast as BLAST

Useful for compuling MSAs of remote homologs, as used in the next lecture!
Profile HMMs (HMMER)
Review of Topology:

Delete states (gap)

Insert states (emit aa,

with unbiased probabilities)

Match states (emit aa,

with position-specific
probabilities)
1 2 3
Aligned position
Unaligned: VLSPADKTNVKAAWGKVGAHADEYHAEALERMFLSFPTTKTYFPHF

Aligned:
Profile HMMs (HMMER)
Example of a profile HMM with all values illustrated:

Weight of line
=
transition probability

Unaligned (insert)
What you can do with profile HMMs:

Basic (Given a Model):

●
Score Sequences:
●
Compute probability that a sequence was generated by the HMM
●
Search For Matching Sequences
●
Score all sequences in a database, pick only the high scoring ones
●
Compute Site Posterior Probability:
●
Reflects the degree of confidence in each individual aligned residue
●
Align Sequences to known model:
●
Find the most probable alignment of a unaligned sequence (viterbi)

Higher-level (No initial model):

●
Build a model given an aligned MSA
●
simple counting
●
Build a model given unaligned sequences
●
Baum-Welch algorithm (iterates forward/backward algorithms)
●
Iterative model building/search
●
Iteratively run Baum-Welch, and collect new sequences from a
database using it.
Scoring Sequences
What is the probability that a given sequence was generated by the model?

We have two types of probabilities:

Observed sequence
●
We can calculate the probability of a
sequence through most likely path
through hidden states (Viterbi algorithm) Model
Viterbi path
●
Or, we can calculate the probability
of a sequence averaging over all
paths (forward algorithm)

In practice we use the second: “To maximize power to discriminate true

homologs from nonhomologs in a database search, statistical inference
theory says you ought to be scoring sequences by integrating over
alignment uncertainty, not just scoring the single best alignment”
We can compare this to the probability that it would be generated by a “random” model

●
Null model (random aa)
mean residue frequencies
in Swiss-Prot 50.8
Scoring Sequences: Log Odds Ratio
The probability of a sequence averaging Probability of sequence in Null model
over all paths (forward algorithm) (random amino acids)

The “Log odds ratio”

(“sequence bit score”)

●
Standard way of testing model discriminatory power
●
Values >> 1 imply the sequence is more consistent with the HMM
●
We take the log to avoid numerical difficulties (numbers can be very small)
●
When the log is base 2, the value has units called “bits”.
●
Log-odds ratio is also called the “sequence bit score”
●
Can be efficiently computed for HMMs using a modified Forward algorithm
Scoring Sequences: E-Values
The “Log odds ratio”

Problem: The log-odds ratio gives us a score for each individual sequence, but if you are
testing many, many non-matching sequences, some will still have high scores by
chance.

E-values: Expected number of

sequences having this bit score or more,
when looking in a dataset of size N.

Theory: Can work out that the “null”

distribution of bit-scores is exponential

This is a way to estimate if a high score

appeared by chance due to having a large
dataset.

, necessary to calibrate mu/tau

Scoring Sequences: Summary

Given an HMM, and a sequence database of size N to

search through, we can calculate:

●
A log-odds ratio (bit score) for each sequence.

>1 means its more likely to be generated by the HMM than the null model

●
An E-Value: Probability a non-match sequence has this score by
chance.

E-value < 1 means that on average we don’t expect this strong a score by
Chance given our dataset size, so it is likely a homolog.
Commonly E < 0.01 is used as a significance cutoff
By setting a threshold on the E-Value, we can filter the database to pick out
the homologous sequences.
Posterior Probabilities
●
Per-site posterior probability annotation.
Probability that emitted observable i came from state k, averaging over
all possible paths.

Computable from forward/backward equations.

Gives an estimate of how “well aligned” a position is.

Aligning Sequences

Naive way: Viterbi Algorithm:

Computes the single most likely path through the HMM.
(maximizes sum of log probabilities)
Problem: The single most likely path might skip a node which almost all other likely paths go
through. i.e., it can go through “unlikely” nodes. It is an extremal path rather than an average.

Better way: MAC Algorithm: (Maximum ACcuracy)

Computes the path with the greatest total posterior probability through the HMM
(maximizes sum of posterior probabilities)
i.e., computes the path going through the most commonly visited nodes by
all likely paths, and so is a better representation of the “average path”.

●
Easy implementation: Very similar to the Viterbi algorithm, except we substitute the
posterior probabilities instead of the log probabilities, plus minor tweaks.
●
Gives “more accurate” alignments than Viterbi. MAC is used by HMMER
What you can do with profile HMMs:

Basic (Given a Model):

Higher-level (No initial model):

Building an HMM given an aligned MSA

Idea: Very simply, we count up what we see in the MSA. i.e, count up the
number of times each amino acid at each position, and normalize, to get the
emission probabilities, and similarly for transition probabilities.
Pseudocounts and weights may be used too (discussed later).

Building an HMM given an unaligned MSA (The Baum-Welch Algorithm)

Idea: Given a trial HMM, use forward/backward algorithms to average over all
possible alignments of all sequences in the MSA. Add up the observed
emission and transitions seen in the data with these probabilities.

Building an HMM given a seed MSA and a database (JACKHMMER, HHBlits)

Idea: Given a seed aligned MSA, construct an initial HMM (see above). Use this
HMM to search the database for more homologous sequences, to get a bigger, more
diverse MSA. Construct a new HMM with this MSA. Repeat.
HMMER
●
Free Software for using/building profile HMMs
●
HMMER3 is highly optimized. Fast!
HMMER: Example Run
Example: Searching a sequence database
HMMER: Example Run
Example: Scoring and Aligning a sequence

HMM reference lines:

1. Annotated Secondary structure
2. Consensus sequence
3. Matches

Aligned Sequence
Posterior Probability
Profile HMMs: Practical issues

1)Local vs Global Alignments

2)Choosing Match states
3)Phylogenetic Weighting
4)Pseudocounts
5)Filtering Sequences
1. Local vs Global Alignments

HMM for global alignment

Try to match all residues in the
query sequence to the model

Template WIGDEVAVKAARHDDEDIS
Match WHGD-VAVKILKVVDPTPE

HMM for local alignment

Allows unmatched states at beginning
and end, so can find a match within a
longer sequence, or partial sequence.

Template EVAVKAARHDDEDISvykig
Match WHGD-VAVKILKVVDPTPE

Note: HMMER 3+ ONLY does local alignment Flanking model states Switching points
2. Choosing Match States
When building a new HMM given a seed MSA, you have to make a decision: Which
columns will count as match states, and which as insert states? (model construction)

●
manual construction:
The user marks alignment columns by hand.
●
heuristic construction:
A rule is used to decide whether a column should be marked. For
instance, a column might be marked when the proportion of gap
symbols in it is below a certain threshold.
●
MAP construction,
A maximum a posteriori choice is determined by dynamic programming
2. Choosing Match States
3. Phylogenetic Weighting
"God has an inordinate fondness for beetles." - J.B.S. Haldane
(no relation)
Eukaryotic

https://evolution.berkeley.edu

If a closely related group of species appears often in the dataset, this introduces a
bias in our estimates of emission probabilities. Similar to overcounting. Will hurt us
for distant homology detection.
QIGSGGSSKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY
QIGSAHSSKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY
QIGSGGSSKVFQVLLEKTQIYAIKKENLEEADNQTLDEINY
QIGEGGSKKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY Beetles with similar sequences
QIGSGGSSKVFQVLNEKKQIYAIKYVNLEEADNQTLDEIAY
QIGSGGFSKVFQVLNEHKQIYAIKYVNLEEADNQTLDEIAY
ELGRGESGTVYKGVLEDDRHVAVKKLENVRQGKEVFQELSV
RLGAGGFGSVYKATYHGVP-VAIKQVNKCTKNRLASRELNV
KIGAGNSGTVVKALHVPDSKVAKKTIPVEQNNSTIINELSI
ILGEGAGGSVSKCKLKNGSKFALKVINTLNTDPEYQKELQF
HLGEGNGGAVSLVKHRNI--FMARKTVYVGSDSKLQKELGV
KIGQGASGGVYTAYEIGTNVVAIKQMNLEKQPKKELIEILV
VIGRGSYGVvykainkhTDQVAIKEVVYENDEELNDIEISL
LIGSGSFGQVYLGMNASSGEMAVKQVILDSVSESKDREIAL
IIGRGGFGEVYGCRKADTGKYAMKCLDKKRIKMKQGENERI
LLGKGTFGKVILVKEKATGRYAMKILKKEVIVAKDEVENRV
SLGSGSFGTAKLCRHRGSGLFCSKTLRRETIVHEKHKEINI
LLGQGDVGKVYLVRERDTNQFALKVLNKHEMIKRKKIEQEI
TLGIGGFGRVVKAHHQDRVDFALKCLKKRHIVDTKQEERHI
3. Phylogenetic Weighting
Many different strategies to “unweight” the dataset. One way (“Henikoff scheme”):

Number of times “a” is seen at position i

Number of different residues ever seen at position i

Weight factor needed to get “equal” emission probabilities at position i

Average weight factor over all positions

4. Pseudocounts
We build an HMM given a limited seed alignment. At some positions there may be
residues that we didn’t see in our dataset, but exist some fraction of the time in nature,
particularly when we have few sequences.

Without any corrections, our HMM would predict that these other residues could never
be output. We correct for this by adding pseudocounts:
Different ways of computing emission probabilities:

Simple pseudocount Dirichlet Pseudocount

No correction (A=20 works well) (assumes there are different
“types” of positions which
have different biases)
5. Filtering
When searching a database, there are choices about which sequences to
“keep”: E values, bit scores, etc.
Pfam
●
Pfam is a collection of HMMs and MSAs of many protein
families. Uses HMMs to detect homologous sequences
●
Underlying sequence database (Pfamseq) is based on UniProt
●
Intimately tied to HMMER
●
http://pfam.xfam.org/ (European Bioinformatics Institute)

●
16,306 curated protein families (folds)
●
23 million protein sequences
●
7.6 billion residues

collected from all branches of life

Uniprot database sequence statistics

All Domains Eukaryotes only

Pfam
Each Pfam family consists of:

●
A curated seed alignment containing a small set of representative members of the family
●
Profile HMMs built iteratively from the seed alignment
●
an automatically generated full alignment, which contains all detected sequences
belonging to the family

Pfam entries are classified in one of six ways:

●
Family: A collection of related protein regions
●
Domain: A structural unit
●
Repeat: A short unit which is unstable in isolation but forms a stable structure when multiple copies are
present
●
Motif: A short unit found outside globular domains
●
Coiled-Coil: Regions that predominantly contain coiled-coil motifs, regions that typically contain alpha-helices
that are coiled together in bundles of 2-7.
●
Disordered: Regions that are conserved, yet are either shown or predicted to contain bias sequence
composition and/or are intrinsically disordered (non-globular).

(From Pfam website)

Pfam website
example

Fibronectin type III

Seed Matches

●
All fibronectin sequences in
the alignment have the same
“beta sandwich” fold.
●
~65,000 different sequences
●
~100 residues long

How similar do you think the

sequences are (% identical aa)?
Summary

●
Many possible sequences lead to the same fold
●
Proteins in the same family/fold accumulate
substitutions at a constant rate over time

Next time: How do proteins evolve? Modeling the Protein Evolutionary Process

All Years Note Package
100% (1)
All Years Note Package
224 pages
UNIT IV _ BLAST (1)
No ratings yet
UNIT IV _ BLAST (1)
21 pages
5 Database Similarity Search BLAST
No ratings yet
5 Database Similarity Search BLAST
47 pages
Bio 2
No ratings yet
Bio 2
39 pages
Sequence Alignments: Felix Sappelt Irina Wagner
100% (1)
Sequence Alignments: Felix Sappelt Irina Wagner
34 pages
Database Similarity Searching
No ratings yet
Database Similarity Searching
4 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Lab 2.1
No ratings yet
Lab 2.1
21 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages
BI205 Prac 5&6
No ratings yet
BI205 Prac 5&6
11 pages
Lecture 4
No ratings yet
Lecture 4
106 pages
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
No ratings yet
Bioinformatics Is The Inter-Disciplinary Branch of Biology Which Merges Computer Science, Mathematics and Engineering To Study The Biological Data
26 pages
Fundamentals of bioinformatics_L5
No ratings yet
Fundamentals of bioinformatics_L5
56 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
BLAST
No ratings yet
BLAST
30 pages
Genomic Sequence Alignment
No ratings yet
Genomic Sequence Alignment
25 pages
Sequence Alignment
No ratings yet
Sequence Alignment
29 pages
2. Sequence alignment
No ratings yet
2. Sequence alignment
25 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
37 pages
Lecture 8 ACB
No ratings yet
Lecture 8 ACB
5 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
Sequence DB Search
No ratings yet
Sequence DB Search
38 pages
Bioinformatics 1 p3
No ratings yet
Bioinformatics 1 p3
17 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Database Searching
No ratings yet
Database Searching
41 pages
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
100% (1)
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
38 pages
Blast Fasta
No ratings yet
Blast Fasta
27 pages
Lecture 05
No ratings yet
Lecture 05
36 pages
Lecture 3 and 4 LSM2241
No ratings yet
Lecture 3 and 4 LSM2241
6 pages
Diploma - Practical
No ratings yet
Diploma - Practical
11 pages
Blast & Fasta
No ratings yet
Blast & Fasta
47 pages
_second_done_w14b_searching squence databases
No ratings yet
_second_done_w14b_searching squence databases
32 pages
ALLIENU Blast and Fasta
No ratings yet
ALLIENU Blast and Fasta
27 pages
Blast
No ratings yet
Blast
18 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
Blast 2 Sequences: Salman Khan Current Gpa in Bioinf 4 Gpa
No ratings yet
Blast 2 Sequences: Salman Khan Current Gpa in Bioinf 4 Gpa
45 pages
Lab Report 03
No ratings yet
Lab Report 03
18 pages
Bioinformatics Session8
No ratings yet
Bioinformatics Session8
33 pages
Blast
100% (1)
Blast
21 pages
blast-170122070200
No ratings yet
blast-170122070200
22 pages
Introduction To Bioinformatics: Sequence Alignment
No ratings yet
Introduction To Bioinformatics: Sequence Alignment
29 pages
Week 3 LocalAlignment
No ratings yet
Week 3 LocalAlignment
25 pages
TY-Exercise_4_(35)
No ratings yet
TY-Exercise_4_(35)
8 pages
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
No ratings yet
University of Kwazulu-Natal Bioinformatics Gene320 3 May 2016 Test 2 Duration 100 Minutes Total Marks: 70
6 pages
Blast Nsuite
No ratings yet
Blast Nsuite
19 pages
BLAST Background
100% (1)
BLAST Background
27 pages
BLAST Script
No ratings yet
BLAST Script
10 pages
Bs982 l08 Basic Blast
No ratings yet
Bs982 l08 Basic Blast
38 pages
Blast
No ratings yet
Blast
115 pages
05 CAP5510 Fall21
No ratings yet
05 CAP5510 Fall21
40 pages
BLAST
100% (1)
BLAST
4 pages
LO5 Pairwise Sequence Alignment
No ratings yet
LO5 Pairwise Sequence Alignment
11 pages
BLAST Analysis and Algorythim
No ratings yet
BLAST Analysis and Algorythim
11 pages
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
Basic Local Alignment
No ratings yet
Basic Local Alignment
36 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
JMP for Mixed Models
From Everand
JMP for Mixed Models
Ruth Hummel
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
10 Minute Guide to Orthogonal Array Test Strategy
From Everand
10 Minute Guide to Orthogonal Array Test Strategy
Rajeev Nair Raman
No ratings yet
月ゼミ11:27
No ratings yet
月ゼミ11:27
14 pages
Jacs 9b13046flavunoidin
No ratings yet
Jacs 9b13046flavunoidin
5 pages
Nematode 2
No ratings yet
Nematode 2
21 pages
Metals and Metallurgy: Theodore L. Brown H. Eugene Lemay, Jr. and Bruce E. Bursten
No ratings yet
Metals and Metallurgy: Theodore L. Brown H. Eugene Lemay, Jr. and Bruce E. Bursten
43 pages
Qualitative Metabolomics-Based Characterization of A Phenolic UDP - Xylosyltransferase With A Broad Substrate Spectrum From Lentinus Brumalis
No ratings yet
Qualitative Metabolomics-Based Characterization of A Phenolic UDP - Xylosyltransferase With A Broad Substrate Spectrum From Lentinus Brumalis
12 pages
Nishitetsu Kaizuka Line
No ratings yet
Nishitetsu Kaizuka Line
3 pages
Metabolism: The Breakdown of Food Molecules Releases Stored Energy Needed To Make New Molecules
No ratings yet
Metabolism: The Breakdown of Food Molecules Releases Stored Energy Needed To Make New Molecules
17 pages
Microbiology and Molecular Biology Reviews-2017-Milani-e00036-17.full PDF
No ratings yet
Microbiology and Molecular Biology Reviews-2017-Milani-e00036-17.full PDF
67 pages
Group Assignment: Instructions To Candidates
No ratings yet
Group Assignment: Instructions To Candidates
6 pages
Cisco Nexus 9500 Series Switches: Product Overview
No ratings yet
Cisco Nexus 9500 Series Switches: Product Overview
9 pages
A Critical Review of Methods of Studying Fish Feeding Based On Analysis of Stomach Contents - Application To Elasmobranch Fishes PDF
No ratings yet
A Critical Review of Methods of Studying Fish Feeding Based On Analysis of Stomach Contents - Application To Elasmobranch Fishes PDF
14 pages
Compiler Construction CS-4207: Instructor Name: Atif Ishaq
No ratings yet
Compiler Construction CS-4207: Instructor Name: Atif Ishaq
19 pages
ZZ Price List 01-04-2022 NR4.
No ratings yet
ZZ Price List 01-04-2022 NR4.
48 pages
CUET 2023 General Test Question Paper June 18 Shift 3 728df4258
No ratings yet
CUET 2023 General Test Question Paper June 18 Shift 3 728df4258
69 pages
Basic Design Data
No ratings yet
Basic Design Data
25 pages
Folk Arts and Crafts of Caraga Region
No ratings yet
Folk Arts and Crafts of Caraga Region
2 pages
Managing Mental Health in The Workplace
100% (2)
Managing Mental Health in The Workplace
7 pages
Batch Coffee Roasting
No ratings yet
Batch Coffee Roasting
15 pages
Porac National High School (Formerly Pulong Santol High School Annex 1)
No ratings yet
Porac National High School (Formerly Pulong Santol High School Annex 1)
15 pages
Blockchain For Livelihoods From Organic Cambodian Rice (BLOCRICE)
No ratings yet
Blockchain For Livelihoods From Organic Cambodian Rice (BLOCRICE)
2 pages
The Use of Fly Ash and Bottom Ash in Geopolymer Mortar
No ratings yet
The Use of Fly Ash and Bottom Ash in Geopolymer Mortar
12 pages
Assignment 01 Ethics 2
No ratings yet
Assignment 01 Ethics 2
5 pages
Essay UNABOMBER
No ratings yet
Essay UNABOMBER
3 pages
Circuits 1
No ratings yet
Circuits 1
37 pages
Smart Home Security System CSE 326
No ratings yet
Smart Home Security System CSE 326
8 pages
My Essay Help New PDF
No ratings yet
My Essay Help New PDF
5 pages
Me 1303 Gas Dynamics and Jet Propulsion: Presented by
No ratings yet
Me 1303 Gas Dynamics and Jet Propulsion: Presented by
24 pages
Development of a Methodology for Lessons Learned
No ratings yet
Development of a Methodology for Lessons Learned
197 pages
Kevin's Resume
No ratings yet
Kevin's Resume
3 pages
2022 SUSI Application Form - Final 2
No ratings yet
2022 SUSI Application Form - Final 2
7 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
Tla5Kup Logic Analyzer Field Upgrade Kit Instructions: Warning
No ratings yet
Tla5Kup Logic Analyzer Field Upgrade Kit Instructions: Warning
47 pages
Conceptual Design of Hybrid-Electric Aircraft
100% (1)
Conceptual Design of Hybrid-Electric Aircraft
111 pages
Unit # 1 Intro To Nursing Research
No ratings yet
Unit # 1 Intro To Nursing Research
26 pages
Ericson Insurance TPA Pvt. LTD
No ratings yet
Ericson Insurance TPA Pvt. LTD
1 page
User Manual Vag K+can 2 0
100% (3)
User Manual Vag K+can 2 0
51 pages

Lecture2022 - 3 /!

Uploaded by

Lecture2022 - 3 /!

Uploaded by

Temple University Structural Bioinformatics II

Sequence Search & Alignment

1962 – Margoliash compares protein sequences across species (Cytochrome C)

1968 – Kimura develops the Neutral Theory of Molecular Evolution

1974 – Dayhoff’s Atlas of Protein Sequence and Structure

1981 – Smith/Waterman develop a local sequence alignment algorithm

1990 – Atschul et al develop BLAST sequence search/alignment program

2002 – Creation of Uniprot, a descendant of Dayhoff’s database. (EBI)

The Protein sequence→ structure mapping is highly degenerate

Causes difficulty in the “twilight zone” of sequence similarity:

(sequence similarity => Homology)

Example of Dot matrix method Dot Matrix for a nucleotide sequence

Needleman-Wunsch Algorithm Smith-Waterman Algorithm

Naive idea for sequence search:

B) Potential matches where many nearby words

C) Join nearby initial matches together (accounts for

Similar Idea to FASTP, but with some new ideas:

Images and other material in this presentation are taken from

The lecture follows closely the contents of chapter 4 of

Original BLAST only returned the single “Maximal Segment Pair”

T = word similarity score threshold controls the number of hits/extensions:

But which scores S are significant?

The expected number of non-homologous

u depends on sequences m and n

Lambda and k are determined by

occasionally dishonest casino Weather Behavior (wikipedia)

1. Secondary-Structure HMMs (SAM, OSS-HMM, TMHMM)

2. Profile HMMs (HMMER)

X = Everything else M = TM helix

X = Everything else M = TM helix

Can improve with more complex

“Correctly predicts 97-98 % of

Used to estimate that 20-30 %

Posterior probabilities for a single sequence:

Predicting Transmembrane Protein Topology with a Hidden Markov

Analysis of an optimal hidden Markov model for secondary

A Multiple Sequence Alignment:

– More sensitive detection of remote homologs

– Explicit probabilistic mathematical model (allows derivation

Delete states (gap)

Insert states (emit aa,

Match states (emit aa,

Basic (Given a Model):

Higher-level (No initial model):

We have two types of probabilities:

In practice we use the second: “To maximize power to discriminate true

The “Log odds ratio”

E-values: Expected number of

Theory: Can work out that the “null”

This is a way to estimate if a high score

, necessary to calibrate mu/tau

Given an HMM, and a sequence database of size N to

Computable from forward/backward equations.

Gives an estimate of how “well aligned” a position is.

Naive way: Viterbi Algorithm:

Better way: MAC Algorithm: (Maximum ACcuracy)

Basic (Given a Model):

Higher-level (No initial model):

Building an HMM given an aligned MSA

Building an HMM given an unaligned MSA (The Baum-Welch Algorithm)

Building an HMM given a seed MSA and a database (JACKHMMER, HHBlits)

HMM reference lines:

1)Local vs Global Alignments

HMM for global alignment

HMM for local alignment

Number of times “a” is seen at position i

Number of different residues ever seen at position i

Average weight factor over all positions

Simple pseudocount Dirichlet Pseudocount

collected from all branches of life

All Domains Eukaryotes only

Pfam entries are classified in one of six ways:

(From Pfam website)

Fibronectin type III

How similar do you think the

You might also like