Sequence Comparison
Sequence Comparison
DNA is made up of four chemicals – adenine, cytosine, guanine, and thymine – that
occur millions or billions of times throughout a genome. The human genome, for
example, has about three billion pairs of bases. RNA is made of four chemicals:
adenine, cytosine, guanine, and uracil. The bases are usually referred to by their
initial letters: A, C, G, T for DNA and A, C, G, U for RNA.
The particular order of As, Cs, Gs, and Ts is extremely important. The order
underlies all of life’s diversity. It even determines whether an organism is human or
another species such as yeast, fruit fly, or chimpanzee, all of which have their own
genomes.
In the late 1940s, Erwin Chargaff noted an important similarity: the amount
of adenine in DNA molecules is always equal to the amount of thymine, and the
amount of guanine is always equal to the amount of cytosine (#A = #T and #G =
#C). In 1953, based on the x-ray diffraction data of Rosalind Franklin and Maurice
Wilkins, James Watson and Francis Crick proposed a conceptual model for DNA
structure. The Watson-Crick model states that the DNA molecule is a double helix,
in which two strands are twisted together. The only two possible pairs are AT and
CG. This yields a molecule in which #A = #T and #G = #C. The model also sug-
gests that the basis for copying the genetic information is the complementarity of its
173
174 A Basic Concepts in Molecular Biology
bases. For example, if the sequence on one strand is AGATC, then the sequence of
the other strand would have to be TCTAG – its complementary bases.
There are several different kinds of RNA made by the cell. In particular, mRNA,
messenger RNA, is a copy of a gene. It acts as a photocopy of a gene by having a
sequence complementary to one strand of the DNA and identical to the other strand.
Other RNAs include tRNA (transfer RNA), rRNA (ribosomal RNA), and snRNA
(small nuclear RNA). Since RNA cannot form a stable double helix, it actually exists
as a single-stranded molecule. However, some regions can form hairpin loops if
there is some base pair complementation (A and U, C and G). The RNA molecule
with its hairpin loops is said to have a secondary structure.
A.2 Proteins
The building blocks of proteins are the amino acids. Only 20 different amino acids
make up the diverse array of proteins found in living organisms. Table A.1 summa-
rizes these 20 common amino acids. Each protein differs according to the amount,
type and arrangement of amino acids that make up its structure. The chains of amino
acids are linked by peptide bonds. A long chain of amino acids linked by peptide
bonds is a polypeptide. Proteins are long, complex polypeptides.
The sequence of amino acids that makes up a particular polypeptide chain is
called the primary structure of a protein. The primary structure folds into the sec-
A.4 The Genomes 175
ondary structure, which is the path that the polypeptide backbone of the protein
follows in space. The tertiary structure is the organization in three dimensions of
all the atoms in the polypeptide chain. The quaternary structure consists of aggre-
gates of more than one polypeptide chain. The structure of a protein is crucial to its
functionality.
A.3 Genes
Genes are the fundamental physical and functional units of heredity. Genes carry
information for making all the proteins required by all organisms, which in turn
determine how the organism functions.
A gene is an ordered sequence of nucleotides located in a particular position on a
particular chromosome that encodes a specific functional product (a protein or RNA
molecule). Expressed genes include those that are transcribed into mRNA and then
translated into protein and those that are transcribed into RNA but not translated
into protein (e.g., tRNA and rRNA).
How does a segment of a strand of DNA relate to the production of the amino
acid sequence of a protein? This concept is well explained by the central dogma of
molecular biology. Information flow (with the exception of reverse transcription) is
from DNA to RNA via the process of transcription, and then to protein via transla-
tion. Transcription is the making of an RNA molecule off a DNA template. Trans-
lation is the construction of an amino acid sequence (polypeptide) from an RNA
molecule.
How does an mRNA template specify amino acid sequence? The answer lies in
the genetic code. It would be impossible for each amino acid to be specified by one
or two nucleotides, because there are 20 types of amino acids, and yet there are
only four types of nucleotides. Indeed, each amino acid is specified by a particular
combination of three nucleotides, called a codon, which is a triplet on the mRNA
that codes for either a specific amino acid or a control word. The genetic code was
broken by Marshall Nirenberg and Heinrich Matthaei a decade after Watson and
Crick’s work.
A genome is all the DNA in an organism, including its genes.1 In 1990, the Hu-
man Genome Project was launched by the U.S. Department of Energy and the Na-
tional Institutes of Health. The project originally was planned to last 15 years, but
rapid technological advances accelerated the draft completion date to 2003. The
goals of the project were to identify all the genes in human DNA, determine the
1 Wouldn’t you agree that the genomes are the largest programs written in the oldest language and
are quite adaptable, flexible, and fault-tolerant?
176 A Basic Concepts in Molecular Biology
sequences of the 3 billion base pairs that make up human DNA, store this informa-
tion in databases, develop tools for data analysis, and address the ethical, legal, and
social issues that may arise from the project.
Because all organisms are related through similarities in DNA sequences, in-
sights gained from other organisms often provide the basis for comparative studies
that are critical to understanding more complex biological systems. As the sequenc-
ing technology advances, digital genomes for many species are now available for
researchers to query and compare. Interested readers are encouraged to check out
the latest update at http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj.
Appendix B
Elementary Probability Theory
This chapter summaries basic background knowledge and establishes the book’s
terminology and notation in probability theory. We give informal definitions for
events, random variables, and probability distributions and list statements without
proof. The reader is referred to consult an elementary probability textbook for the
need of justification.
Intuitively, an event is something that will or will not occur in a situation depending
on chance. Such a situation is called an experiment. For example, in the experiment
of rolling a die, an event might be that the number turning up is 5. In an alignment
of two DNA sequences, whether the two sequences have the same nucleotide at a
position is an event.
The certain event, denoted by Ω , always occurs. The impossible event, denoted
by φ , never occurs. In different contexts these two events might be described in
different, but essentially identical ways.
Let A be an event. The complementary event of A is the event that A does not
occur and is written Ā.
Let A1 and A2 be two events. The event that at least one of A1 or A2 occurs
is called the union of A1 and A2 and is written A1 ∪ A2 . The event that both A1
and A2 occur is called the intersection of A1 and A2 and is written A1 ∩ A2 . We
say A1 and A2 are disjoint, or mutually exclusive, if they cannot occur together.
In this case, A1 ∩ A2 = φ . These definitions extend to finite number of events in a
natural way. The union of the events A1 , A2 , . . . , An is the event that at least one of
these events occurs and is written A1 ∪ A2 ∪ · · · ∪ An or ∪ni=1 Ai . The intersection of
the events A1 , A2 , . . . , An is the event that all of these events occur, and is written
A1 ∩ A2 ∩ · · · ∩ An , or simply A1 A2 · · · An .
The probability of an event A is written Pr[A]. Obviously, for the certain event
Ω , Pr[Ω ] = 1; for the impossible event φ , Pr[φ ] = 0. For disjoint events A1 and A2 ,
177
178 B Elementary Probability Theory
for any event B. This formula is called the law of total probability.
Events A1 and A2 are independent if Pr[A1 ∩ A2 ] = Pr[A1 ] Pr[A2 ]. In general,
events A1 , A2 , . . . , An are independent if
Pr[AB]
Pr[B|A] = . (B.2)
Pr[A]
Pr[A] Pr[B|A]
Pr[A|B] = (B.5)
Pr[B]
when Pr[A] = 0 and Pr[B] = 0. This is called Bayes’ formula in Bayesian statistics.
A random variable is a variable that takes its value by chance. Random variables are
often written as capital letters such as X, Y, and Z, whereas the observed values of
a random variable are written in lowercase letters such as x, y, and z. For a random
variable X and a real number x, the expression {X ≤ x} denotes the event that X has
a value less than or equal to x. The probability that the event occurs is denoted by
Pr[X ≤ x].
B.3 Major Discrete Distributions 179
then f (x) is called the probability density function of the random variable X. If X
has a probability density function fX (x), then its distribution function FX (x) can be
written as
x
FX (x) = Pr[X ≤ x] = fX (x), −∞ < x < ∞. (B.7)
−∞
In this section, we simply list the important discrete probability distributions that
appear frequently in bioinformatics.
A Bernoulli trial is a single experiment with two possible outcomes “success” and
“failure.” The Bernoulli random variable X associated with a Bernoulli trial takes
only two possible values 0 and 1 with the following probability distribution:
The geometric distribution arises from independent Bernoulli trials. Suppose that a
sequence of independent Bernoulli trials are conducted, each trial having probability
p of success. The number of successes prior to the first failure has the geometric
distribution with parameter p.
By (B.10), the distribution function of the geometric random variable X with
parameter p is
The Poisson distribution is probably the most important discrete distribution. It not
only has elegant mathematical properties but also is thought of as the law of the rare
events. For example, in a binomial distribution, if the number of trials n is large and
the probability of success p for each trial is small such that λ = np remains constant,
then the binomial distribution converges to the Poisson distribution with parameter
λ.
Applying integration by part, we see that the distribution function X of the Pois-
son distribution with parameter λ is
∞
λ k+1
FX (k) = Pr[X ≤ k] = yk e−λ y dy, k = 0, 1, 2, . . . . (B.13)
k! 1
For a discrete random variable X, its probability generating function (pgf) is written
G(t) and defined as
where I is the set of all possible values of X. This sum function always converges to
1 for t = 1 and often converges in an open interval containing 1 for all probability
distributions of interest to us. Probability generating functions have the following
two basic properties.
First, probability generating functions have one-to-one relationship with proba-
bility distributions. Knowing the pgf is equivalent to knowing the probability dis-
tribution for a random variable to certain extent. In particular, for a nonnegative
integer-valued random variable X, we have
(
1 d k G(t) ((
Pr[X = k] = . (B.15)
k! dt k (t=0
X = X1 + X2 + · · · + Xn
In this section, we list several important continuous distributions and their simple
properties.
A continuous random variable X has the uniform distribution over an interval [a, b],
a < b, if it has the density function
1
fX (x) = , a≤x≤b (B.17)
b−a
and hence the distribution function
x−a
FX (x) = , a ≤ x ≤ b. (B.18)
b−a
This shows that Y has the geometric distribution with parameter e−λ .
Applying the definition of conditional probability (B.2), we obtain, for x,t > 0,
B.5 Mean, Variance, and Moments 183
Therefore, we often say that the exponential distribution has the memoryless prop-
erty.
A continuous random variable X has the normal distribution with parameters µ and
σ > 0 if it has density function
1
e−(x−µ ) /(2σ ) , −∞ < x < ∞.
2 2
fX (x) = √ (B.21)
2πσ
The function fX is bell-shaped and symmetric about x = µ . By convention, such a
random variable is called an N(µ , σ 2 ) random variable.
The normal distribution with parameters 0 and 1 is called the standard normal
distribution. Suppose that X is an N(µ , σ 2 ) random variable. Then, the random vari-
µ
able Z defined by Z = X− σ has the standard normal distribution. Because of this, the
value of Z is often called a z-score: If x is the observed value of X, then its z-score
is x−σ µ .
Let a random variable X have a normal distribution. For arbitrary a and b, a <
b, we cannot find the probability Pr[a < X < b] by simply integrating its density
function in closed form. Instead, we reduce it to a probability statement for the
standard normal distribution and find an accurate approximation of it from tables
that are widely available in a textbook in probability theory.
µX = ∑ x Pr[X = x] (B.22)
x∈I
where I is the range of X provided that the sum converges. The mean µX is also
called the expected value of X and hence often written as E(X). If X has the pgf
G(t), then the mean of X is equal to the derivative of G(t) at t = 1. That is,
184 B Elementary Probability Theory
(
dG(t) ((
µX = (B.23)
dt (t=1
The means of random variables with a distribution described in Sections B.3 and
B.4 are listed in Table B.1.
Let X1 , X2 , . . . , Xn be random variables having the same range I. Then, for any
constants a1 , a2 , . . . , an ,
n n
E( ∑ ai Xi ) = ∑ ai E(Xi ). (B.27)
i=1 i=1
σ 2 = E((X − µX )2 ) ≥ ε 2 σ 2 Pr[|X − µX | ≥ εσ ],
or equivalently,
1
Pr[|X − µX | ≥ εσ ] ≤ . (B.30)
ε2
This inequality is called Chebyshev’s inequality. It is very useful because it applies
to random variables of any distribution. The use of the Chebyshev’s inequality is
called the second moment method.
Let X1 , X2 , . . . , Xn be n random variables having the same range I and let
X = X1 + X2 + · · · + Xn .
Then
n
σX2 = ∑ σX2i + ∑ cov[Xi , X j ], (B.31)
i=1 i= j
where cov[Xi , X j ] = E[Xi X j ] − E[Xi ]E[X j ], called the covariance of Xi and X j . Notice
that cov[Xi , X j ] = 0 if Xi and X j are independent.
186 B Elementary Probability Theory
An important technique for studying the moments of a random variable is to use its
moment-generating function. The moment-generating function (mgf) of a random
variable X is written MX (t) and defined as
and
d 2 MX (t) d 2 log MX (t)
σX2 = − µX2 = . (B.34)
d t2 t=0 d t2 t=0
Hence, a mgf is useful when it has a simple form. For a random variable having a
probability distribution listed in Sections B.3 and B.4, its mgf can be evaluated as a
simple form. However, this is not always true.
The following property of mgfs plays a critical role in the scoring matrix and
BLAST statistic theory.
Theorem B.1. Let X be a discrete random variable with MX (t) converging for all t
in (−∞, ∞). Suppose that X takes at least one negative value and one positive value
with nonzero probability and that E(X) = 0. Then, there exists a unique nonzero
value θ such that MX (θ ) = 1.
Proof. Let a, b > 0 such that Pr[X = −a] > 0 and Pr[X = b] > 0. By definition,
and
B.6 Relative Entropy of Probability Distributions 187
d 2 MX (t)
= σX2 + µX2 > 0,
d t2
the mgf MX (t) is convex in (−∞, +∞). Recall that MX (0) = 1 and, by (B.33),
d MX (t)
= E(X).
dt t=0
If E(X) < 0, then there exists δ > 0 such that MX (δ ) < MX (0) = 1. Because
MX (t) is continuous, there exists θ in the interval (δ , +∞) such that MX (θ ) = 1.
If E(X) > 0, then there exists δ < 0 such that MX (δ ) < MX (0) = 1. In this case,
there exists θ in the interval (−∞, δ ) such that MX (θ ) = 1.
Suppose that there are two probability distributions P1 and P2 having the same range
I. The relative entropy of P2 with respect to P1 is defined by
Pr[X2 = x]
H(P2 , P1 ) = ∑ Pr[X2 = x] log (B.35)
x∈I Pr[X1 = x]
for each time point t and all states E, E0 , E1 , . . . , Et−1 . In other words, the Markov
chains have the following characteristics:
(i) The memoryless property. If at some time point t the process is in state E ,
then the probability that at time point t + 1 it is in state E depends only on state E ,
and not on how the process had reached state E before time t.
(ii) The time homogeneity property. The probability that the process moves from
state E to state E in time step t to t + 1 is independent of t.
The importance of Markov chains lies in that many natural phenomena in
physics, biology, and economics can be described by them and is enhanced by the
amenability of Markov chains to quantitative analysis.
Let pi j be the probability that the process X moves from Ei to E j in each time
step. The pi j s form the so-called transition (probability) matrix of X. We denote the
matrix by PX and write it as
⎛ ⎞
p11 p12 p13 · · · p1n
⎜ p21 p22 p23 · · · p2n ⎟
⎜ ⎟
PX = ⎜ . .. .. . . . ⎟, (B.37)
⎝ .. . . . .. ⎠
pn1 pn2 pn3 · · · pnn
where n is the number of states in the state space. The rows of PX corresponds
one-to-one to the states respectively. The entries in any particular row are the proba-
bilities that the process moves from the corresponding state in a time step and hence
must sum to 1. If the row corresponding to state E has 1 in the diagonal entry, E is
an absorbing state in the sense that X will never leave E once it enters E.
(k)
We denote the probability that X moves from Ei to E j in k time steps by pi j and
(k) (k)
set PX = pi j . If X is in Ei at time point t and moves to E j at time point t + k,
then it must be in some state Es at time point t + k − 1. Therefore,
pi j = ∑ pis
(k) (k−1)
ps j ,
s
B.7 Discrete-time Finite Markov Chains 189
or equivalently
(k) (k−1)
PX = PX × PX = PXk . (B.38)
In this section, we assume that the Markov chain of interest is finite, aperiodic, and
irreducible. Suppose that a Markov chain has transition probability matrix (pi j ) over
the n states and that at time point t the probability that the process is in state Ei is φi
for i = 1, 2, . . . , n. Then
∑ φi = 1.
i
From the general conditional probability formula (B.4), at time point t + 1 the prob-
ability that process is in state Ei becomes ∑k φk pki , i = 1, 2, . . . , n. If for every i
φi = ∑ φk pki , (B.39)
k
the probability that the process is in a state will never change. Hence, the probability
distribution satisfying (B.39) is called the stationary distribution of the process.
Because the probabilities in each row of P sum to 1, 1 is an eigenvalue of P. The
condition (B.39) can be rewritten as
(φ1 , φ2 , · · · , φn ) = (φ1 , φ2 , · · · , φn )P
and hence implies that the vector (φ1 , φ2 , · · · , φn ) is the left eigenvector of P corre-
sponding to the eigenvalue 1. Moreover, (1, 1, · · · , 1) is the right eigenvector of P
corresponding to 1. Because the Markov chain is aperiodic and irreducible, all other
eigenvalues of P have absolute value less than 1. This gives us that as n increases,
Pn approaches the matrix
190 B Elementary Probability Theory
⎛ ⎞ ⎛ ⎞
1 φ1 φ2 · · · φn
⎜1⎟ ⎜ φ1 φ2 · · · φn ⎟
⎜ ⎟ ⎜ ⎟
⎜ .. ⎟ ( φ1 φ2 · · · φn ) = ⎜ .. .. . . .. ⎟ .
⎝.⎠ ⎝ . . . . ⎠
1 φ1 φ2 · · · φn
One implication of this statement is as follows. No matter what is the initial proba-
bility distribution of the starting state, the process will be in state Ei with probability
close to φi after enough steps.
Analysis of a Markov chain with absorbing states is often reduced to the following
two questions:
(i) What is the probability that the process moves into a particular absorbing
state?
(ii) What is the mean time until the process moves into some absorbing state?
These two basic problems can be answered by applying the so-called first step
analysis method. Suppose that X is a Markov chain with n states E1 , E2 , . . . , En and
the first m states are absorbing states, where m ≥ 1. For j ≤ m, the probability that
X eventually enters E j rather than other absorbing states depends on the initial state
X0 . Let
After the first step, X moves from Ei to state Ek with probability pik . By the law of
total probability (B.4), we have
ui j = pi j + ∑ pik uk j , 1 ≤ j ≤ m, m + 1 ≤ i ≤ n. (B.41)
k>m
Solving this difference equation usually gives an answer to the first basic question.
To answer the second basic question, we let
After step 1, if X1 is an absorbing state, then no further steps are required. If, on the
other hand, X1 is a non-absorbing state Ek , then the process is back at its starting
point, and, on average, µk additional steps are required for entering into absorbing
state. Weight these possibilities by their respective probabilities, we have
µi = 1 + ∑ pik µk , m + 1 ≤ i ≤ n. (B.42)
k>m
B.8 Recurrent Events and the Renewal Theorem 191
Random walks are special cases of Markov chain processes. A simple random walk
is just a Markov chain process whose state space is a finite or infinite subset of
consecutive integers: 0, 1, . . . , c, in which the process, if it is in state k, can either
stay in k or move to one of the neighboring states k − 1 and k + 1. The states 0 and
c are often absorbing states.
A classic example of random walk process is about a gambler. Suppose a gambler
having initially k dollars plays a series of games. For each game, he has probability
p of winning one dollar and probability 1 − p = q of losing one dollar. The money
Yn that he has after n games is a random walk process.
In a general random walk, the process may stay or move to one of m > 2 nearby
states. Although random walks can be analyzed as Markov chains, the special fea-
tures of a random walk allow simple methods of analysis. For example, the moment-
generating approach is a powerful tool for analysis of simple random walks.
Markov chain processes satisfy the memoryless and time homogeneity properties.
In bioinformatics, more general Markov chain processes are used to model gene-
coding sequences. High-order Markov chains relax the memoryless property. A dis-
crete stochastic process is a kth-order Markov chain if
for each time point t and all states E, E0 , E1 , . . . , Et−1 . In other words, the probability
that the process is in a state at the next time point depends on the last k states of the
past history.
It is not hard to see that a kth-order Markov chain is completely defined by the
initial distribution and the transition probabilities of the form (B.43).
The discrete renewal theory is a major branch of classical probability theory. It con-
cerns recurrent events occurring in repeated trials. In this section, we state two basic
theorem in the renewal theory. The interested reader is referred to the book [68] of
Feller for their proofs.
Consider an infinite sequence of repeated trials with possible outcomes Xi (i =
1, 2, . . .). Let E be an event defined by an attribute of finite sequences of possible
outcomes Xi . We say that E occurs at the ith trial if the outcomes X1 , X2 , . . . , Xi
192 B Elementary Probability Theory
have the attribute. E is a recurrent event if, under the condition that E occurs
at the ith trial, Pr[X1 , X2 , . . . , Xi+k have the attribute] is equal to the product of
Pr[X1 , X2 , . . . , Xi have the attribute] and Pr[Xi+1 , Xi+2 , . . . , Xi+k have the attribute].
For instance, the event that a success followed by failure is a recurrent event in a
sequence of Bernoulli trials.
For a recurrent event E, we are interested in the following two probabilistic dis-
tributions:
µE = ∑ i f i . (B.44)
i
Xk = I1 + I2 + . . . + Ik
denotes the number of occurrences of E in the first k trials. By the linearity property
of the mean, we have
b0 , b1 , . . . ,
f1 , f2 , . . . .
vn = bn + vn−1 f1 + · · · + v0 fn , n = 0, 1, . . . .
The following theorem provides conditions under which the sequence vn defined
through convolution equation converges as n goes to infinity.
B.8 Recurrent Events and the Renewal Theorem 193
195
196 C Software Packages for Sequence Alignment
1. Altschul, S.F. (1989) Generalized affine gap costs for protein sequence alignment. Proteins
32, 88-96.
2. Altschul, S.F. (1989) Gap costs for multiple sequence alignment. J. Theor. Biol. 138, 297-
309.
3. Altschul, S.F. (1991) Amino acid substitution matrices from an information theoretic per-
spective. J Mol. Biol. 219, 555-65.
4. Altschul, S.F., Boguski, M.S., Gish, W., and Wootton, J.C. (1994) Issues in searching molec-
ular sequence databases. Nat. Genet. 6, 119-129.
5. Altschul, S.F., Bundschuh, R., Olsen, R., and Hwa, T. (2001) The estimation of statistical
parameters for local alignment score distributions. Nucleic Acids Res. 29, 351-361.
6. Altschul, S.F. and Gish, W. (1996) Local alignment statistics. Methods Enzymol. 266, 460-
480.
7. Altschul, S.F., Gish, W., Miller, W., Myers, E., and Lipman, D.J. (1990) Basic local align-
ment search tool. J. Mol. Biol. 215, 403-410.
8. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman,
D.J. (1997) Gapped Blast and Psi-Blast: a new generation of protein database search pro-
grams. Nucleic Acids Res. 25, 3389-3402.
9. Altschul, S.F., Wootton, J.C., Gertz, E.M., Agarwala, R., Morgulis, A., Schaffer, A.A., and
Yu, Y.K. (2005) Protein database searches using compositionally adjusted substitution ma-
trices. FEBS J. 272, 5101-5109.
10. Arratia, R., Gordon, L., and Waterman, M.S. (1986) An extreme value theory for sequence
matching. Ann. Stat. 14, 971-983.
11. Arratia, R., Gordon, L., and Waterman, M.S. (1990) The Erdös-Renýi law in distribution,
for coin tossing and sequence matching. Ann. Stat. 18, 539-570.
12. Arratia, R. and Waterman, M.S. (1985) An Erdös-Renýi law with shifts. Adv. Math. 55,
13-23.
13. Arratia, R. and Waterman, M.S. (1986) Critical phenomena in sequence matching. Ann.
Probab. 13, 1236-1249.
14. Arratia, R. and Waterman, M.S. (1989) The Erdös-Renýi strong law for pattern matching
with a given proportion of mismatches. Ann. Probab. 17, 1152-1169.
15. Arratia, R. and Waterman, M.S. (1994) A phase transition for the scores in matching random
sequences allowing deletions. Ann. Appl. Probab. 4, 200-225.
16. Arvestad, L. (2006), Efficient method for estimating amino acid replacement rates. J. Mol.
Evol. 62, 663-673.
17. Baase, S. and Gelder, A.V. (2000) Computer Algorithms - Introduction to Design and Anal-
ysis. Addison-Wesley Publishing Company, Reading, Massachusetts.
18. Bafna, V., Lawler, E.L., and Pevzner, P.A. (1997) Approximation algorithms for multiple
sequence alignment. Theor. Comput. Sci. 182, 233-244.
197
198 References
19. Bafna, V. and Pevzner, P.A. (1996) Genome rearrangements and sorting by reversals. SIAM
J. Comput. 25, 272-289.
20. Bahr, A., Thompson, J.D., Thierry, J.C., and Poch, O. (2001) BAliBASE (Benchmark Align-
ment dataBASE): enhancements for repeats. transmembrane sequences and circular permu-
tations. Nucleic Acids Res. 29, 323-326.
21. Bailey, T.L. and Gribskov, M. (2002) Estimating and evaluating the statistics of gapped
local-alignment scores. J. Comput. Biol. 9, 575-593.
22. Balakrishnan, N. and Koutras, M.V. (2002) Runs and Scans with Applications. John Wiley
& Sons, New York.
23. Batzoglou, S. (2005) The many faces of sequence alignment. Brief. Bioinform. 6, 6-22.
24. Bauer, M., Klau, G.W., and Reinert, K. (2007) Accurate multiple sequence-structure align-
ment of RNA sequences using combinatorial optimization. BMC Bioinformatics 8, 271.
25. Bellman, R. (1957) Dynamic Programming. Princeton University Press, Princeton, New Jer-
sey.
26. Benner, S.A., Cohen, M.A., and Gonnet, G.H. (1993) Empirical and structural models for
insertions and deletions in the divergent evolution of proteinsm. J. Mol. Biol. 229, 1065-
1082.
27. Bentley, J. (1986) Programming Pearls. Addison-Wesley Publishing Company, Reading,
Massachusetts.
28. Bray, N. and Pachter, L. (2004) MAVID: Constrained ancestral alignment of multiple se-
quences. Genome Res. 14, 693-699.
29. Brejovà, B., Brown D., and Vinar̆, T. (2004) Optimal spaced seeds for homologous coding
regions. J. Bioinform. Comput. Biol. 1, 595-610.
30. Brejovà, B., Brown, D., and Vinar̆, T. (2005) Vector seeds: an extension to spaced seeds. J.
Comput. Sys. Sci. 70, 364–380.
31. Brown, D.G. (2005) Optimizing multiple seed for protein homology search. IEEE/ACM
Trans. Comput. Biol. and Bioinform. 2, 29-38.
32. Brudno, M., Do, C., Cooper, G., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., and Bat-
zoglou, S. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple align-
ment of genomic DNA. Genome Res. 13, 721-731.
33. Buhler, J. (2001) Efficient large-scale sequence comparison by locality-sensitive hashing.
Bioinformatics 17, 419-428.
34. Buhler, J., Keich, U., and Sun, Y. (2005) Designing seeds for similarity search in genomic
DNA. J. Comput. Sys. Sci. 70, 342-363.
35. Bundschuh, R. (2002) Rapid significance estimation in local sequence alignment with gaps.
J. Comput. Biol. 9, 243-260.
36. Burkhardt, S. and Kärkkäinen, J. (2003) Better filtering with gapped q-grams. Fund. Inform.
56, 51-70.
37. Califano, A. and Rigoutsos, I. (1993) FLASH: A fast look-up algorithm for string homology.
In Proc. 1st Int. Conf. Intell. Sys. Mol. Biol., AAAI Press, pp. 56-64.
38. Carrilo, H. and Lipman, D. (1988) The multiple sequence alignment problem in biology.
SIAM J. Applied Math. 48, 1073-1082.
39. Chan, H.P. (2003) Upper bounds and importance sampling of p-values for DNA and protein
sequence alignments. Bernoulli 9, 183-199
40. Chao, K.-M., Hardison, R.C., and Miller, W. (1993) Locating well-conserved regions within
a pairwise alignment. Comput. Appl. Biosci. 9, 387-396.
41. Chao, K.-M., Hardison, R.C., and Miller, W. (1994) Recent developments in linear-space
alignment methods: a survey. J. Comput. Biol. 1, 271-291.
42. Chao, K.-M. and Miller, W. (1995) Linear-space algorithms that build local alignments from
fragments. Algorithmica 13, 106-134.
43. Chao, K.-M., Pearson, W.R., and Miller, W. (1992) Aligning two sequences within a speci-
fied diagonal band. Comput. Appl. Biosci. 8, 481-487.
44. Chen, L. (1975) Poisson approximation for dependent trials. Ann. Probab. 3, 534-545.
45. Chiaromonte, F., Yap, V.B., and Miller, W. (2002) Scoring pairwise genomic sequence align-
ments. In Proc. Pac. Symp. Biocomput., 115-126.
References 199
46. Chiaromonte, F., Yang, S., Elnitski, L., Yap, V.B., Miller, W., and Hardison, R.C. (2001)
Association between divergence and interspersed repeats in mammalian noncoding genomic
DNA. Proc. Nat’l. Acad. Sci. USA 98, 14503-8.
47. Choi, K.P. and Zhang, L.X. (2004) Sensitivity analysis and efficient method for identifying
optimal spaced seeds. J. Comput. Sys. Sci. 68, 22-40.
48. Choi, K.P., Zeng, F., and Zhang L.X. (2004) Good spaced seeds for homology search. Bioin-
formatics 20, 1053-1059.
49. Chv́tal V and Sankoff D. (1975) Longest common subsequence of two random sequences.
J. Appl. Probab. 12, 306-315.
50. Coles, S. (2001) An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag,
London, UK.
51. Collins, J.F., Coulson, A.F.W., and Lyall, A. (1998) The significance of protein sequence
similarities. Comput. Appl. Biosci. 4, 67-71.
52. Cormen, T.H., Leiserson, C.E., Rivest, R.L., and Stein, C. (2001) Introduction to Algorithms.
The MIT Press, Cambridge, Massachusetts.
53. Csürös, M., and Ma, B. (2007) Rapid homology search with neighbor seeds. Algorithmica
48, 187-202.
54. Darling, A., Treangen, T., Zhang, L.X., Kuiken, C., Messeguer, X., and Perna, N. (2006)
Procrastination leads to efficient filtration for local multiple alignment. In Proc. 6th Int.
Workshop Algorithms Bioinform. Lecture Notes in Bioinform., vol. 4175, pp.126-137.
55. Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. (1978). A model of evolutionary changes
in proteins. In Atlas of Protein Sequence and Structure vol. 5, suppl 3 (ed. M.O. Dayhoff),
345-352, Nat’l Biomed. Res. Found, Washington.
56. Dembo, A., Karlin, S., and Zeitouni, O. (1994) Critical phenomena for sequence matching
with scoring. Ann. Probab. 22, 1993-2021.
57. Dembo, A., Karlin, S., and Zeitouni, O. (1994) Limit distribution of maximal non-aligned
two sequence segmental score. Ann. Probab. 22, 2022-2039.
58. Deonier, R.C., Tavaré, S., and Waterman, M.S. (2005), Computational Genome Analysis.
Springer, New York.
59. Do, C.B., Mahabhashyam, M.S.P., Brudno, M., and Batzoglou, S. (2005) PROBCONS:
Probabilistic consistency-based multiple sequence alignment. Genome Res. 15, 330-340.
60. Dobzhansky, T. and Sturtevant, A.H. (1938) Inversions in the chromosomes of Drosophila
pseudoobscura. Genetics 23, 28-64.
61. Durbin, R., Eddy, S., Krogh, A., and Mitichison, G. (1998) Biological Sequence Analysis:
Probabilistic Models of Protein and Nucleic Acids. Cambridge University Press, Cambridge,
UK.
62. Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res. 32, 1792-1797.
63. Edgar, R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and
space complexity. BMC Bioinformatics 5, no. 113.
64. Ewens, W.J. and Grant, G.R. (2001) Statistical Methods in Bioinformatics: An Introduction.
Springer-Verlag, New York.
65. Farach-Colton, M., Landau, G., Sahinalp, S.C., and Tsur, D. (2007) Optimal spaced seeds
for faster approximate string matching. J. Comput. Sys. Sci. 73, 1035-1044.
66. Fayyaz, A.M., Mercier, S., Ferré, Hassenforder, C. (2008) New approximate P-value of
gapped local sequence alignments. C. R. Acad. Sci. Paris Ser. I 346, 87-92.
67. Feller, W. (1966) An introduction to Probability Theory and its Applications. Vol. 2 (1st
edition), John Wiley & Sons, New York.
68. Feller, W. (1968) An introduction to Probability Theory and its Applications. Vol. I (3rd
edition), John Wiley & Sons, New York.
69. Feng, D.F. and Doolittle, R.F. (1987) Progressive sequence alignment as a prerequisite to
correct phylogenetic trees. J. Mol. Evol. 25, 351-360.
70. Feng, D.F. Johnson, M.S. and Doolittle, R.F. (1985) Aligning amino acid sequences: com-
parison of commonly used methods. J. Mol. Evol. 21, 112-125.
200 References
71. Fitch, W.M. and Smith, T.F. (1983) Optimal sequence alignments. Proc. Nat’l. Acad. Sci.
USA 80, 1382-1386.
72. Flannick, J., and Batzoglou, S. (2005) Using multiple alignments to improve seeded local
alignment algorithms. Nucleic Acids Res. 33, 4563-4577.
73. Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller W. (1998) A computer program
for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8, 967-974.
74. Gertz, E.M. (2005) BLAST scoring parameters.
Manuscript(ftp://ftp.ncbi.nlm.nih.gov/blast/documents/developer/scoring.pdf).
75. Giegerich, R. and Kurtz, S. (1997) From Ukkonen to McCreight and Weiner: A unifying
view of linear-time suffix tree construction. Algorithmica 19, 331-353.
76. Gilbert, W. (1991) Towards a paradigm shift in biology. Nature 349, 99.
77. Gonnet, G.H., Cohen, M.A., and Benner, S.A. (1992) Exhaustive matching of the entire
protein sequence database. Science 256, 1443-5.
78. Gotoh, O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol.
162, 705-708.
79. Gotoh, O. (1989) Optimal sequence alignment allowing for long gaps. Bull. Math. Biol. 52,
359-373.
80. Gotoh, O. (1996) Significant improvement in accuracy of multiple protein sequence align-
ments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol.
264, 823-838.
81. Gribskov, M., Luthy, R., and Eisenberg, D. (1990) Profile analysis. In R.F. Doolittle (ed.)
Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Methods
in Enzymol., vol. 183, Academic Press, New York, pp. 146-159.
82. Grossman, S. and Yakir, B. (2004) Large deviations for global maxima of independent su-
peradditive processes with negative drift and an application to optimal sequence alignment.
Bernoulli 10, 829-845.
83. Guibas, L.J. Odlyzko, A.M. (1981) String overlaps, pattern matching, and nontransitive
games. J. Combin. Theory (series A) 30, 183-208.
84. Gupta, S.K., Kececioglu, J., and Schaffer, A.A. (1995) Improving the practical space and
time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment.
J. Comput. Biol. 2, 459-472.
85. Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences. Cambridge University
Press, Cambridge, UK.
86. Hannenhalli, S. and Pevzner, P.A. (1999) Transforming cabbage into turnip (polynomial
algorithm for sorting signed permutations by reversals). J. Assoc. Comput. Mach. 46, 1-27.
87. Hardy, G., Littlewood, J.E., and Pólya, G. (1952) Inequalities, Cambridge University Press,
Cambridge, UK.
88. Henikoff, S. and Henikoff, JG (1992) Amino acid substitution matrices from protein blocks.
Proc. Nat’l Acad. Sci. USA 89, 10915-10919.
89. Henikoff, S. and Henikoff, JG (1993) Performance evaluation of amino acid substitution
matrices. Proteins 17, 49-61.
90. Hirosawa, M., Totoki, Y., Hoshida, M., and Ishikawa, M. (1995) Comprehensive study on
iterative algorithms of multiple sequence alignment. Comput. Appl. Biosci. 11, 13-18.
91. Hirschberg, D.S. (1975) A linear space algorithm for computing maximal common subse-
quences. Comm. Assoc. Comput. Mach. 18, 341-343.
92. Huang, X. and Miller, W. (1991) A time-efficient, linear-space local similarity algorithm.
Adv. Appl. Math. 12, 337-357.
93. Huang, X. and Chao, K.-M. (2003) A generalized global alignment algorithm. Bioinformat-
ics 19, 228-233.
94. Huffman, D.A. (1952) A method for the construction of minimum-redundancy codes. Proc.
IRE 40, 1098-1101.
95. Ilie, L., and Ilie, S. (2007) Multiple spaced seeds for homology search. Bioinformatics 23,
2969-2977
96. Indyk, P. and Motwani, R. (1998) Approximate nearest neighbors: towards removing the
curse of dimensionality. In Proc. 30th Ann. ACM Symp. Theory Comput., 604-613.
References 201
97. Jones D.T., Taylor, W.R., and Thornton, J.M. (1992) The rapid generation of mutation data
matrices from protein sequences. Comput. Appl. Biosci. 8, 275-82.
98. Jones, N.C. and Pevzner, P.A. (2004) Introduction to Bioinformatics Algorithms. The MIT
Press, Cambridge, Massachusetts.
99. Karlin, S. (2005) Statistical signals in bioinformatics. Proc Nat’l Acad Sci USA. 102, 13355-
13362.
100. Karlin, S. and Altschul, S.F. (1990) Methods for assessing the statistical significance of
molecular sequence features by using general scoring schemes. Proc. Nat’l. Acad. Sci. USA
87, 2264-2268.
101. Karlin, S. and Altschul, S.F. (1993) Applications and statistics fro multiple high-scoring
segments in molecular sequences. Proc. Nat’l Acad. Sci. USA 90, 5873-5877.
102. Karlin, S. and Dembo, A. (1992) Limit distribution of maximal segmental score among
Markov-dependent partial sums. Adv. Appl. Probab. 24, 113-140.
103. Karlin, S. and Ost, F. (1988) Maximal length of common words among random letter se-
quences. Ann. Probab. 16, 535-563.
104. Karolchik, D., Kuhn, R.M., Baertsch, R., Barber, G.P., Clawson, H., Diekhans, M., Giardine,
B., Harte, R.A., Hinrichs, A.S., Hsu, F., Miller, W., Pedersen, J.S., Pohl, A., Raney, B.J.,
Rhead, B., Rosenbloom, K.R., Smith, K.E., Stanke, M., Thakkapallayil, A., Trumbower,
H., Wang, T., Zweig, A.S., Haussler, D., Kent, W.J. (2008) The UCSC genome browser
database: 2008 update. Nucleic Acids Res. 36, D773-779.
105. Karp, R.M., and Rabin, M.O. (1987) Efficient randomized pattern-matching algorithms.
IBM J. Res. Dev. 31, 249-260.
106. Katoh, K., Kuma, K., Toh, H., and Miyata, T. (2005) MAFFT version 5: improvement in
accuracy of multiple sequence alignment, Nucleic Acids Res. 20, 511-518.
107. Kececioglu, J. and Starrett, D. (2004) Aligning alignments exactly. In Proc. RECOMB, 85-
96.
108. Keich, U., Li, M., Ma, B., and Tromp, J. (2004) On spaced seeds for similarity search.
Discrete Appl. Math. 3, 253-263.
109. Kent, W.J. (2002) BLAT: The BLAST-like alignment tool. Genome Res. 12, 656-664.
110. Kent, W.J. and Zahler, A.M. (2000) Conservation, regulation, synteny, and introns in a large-
scale C. briggsae-C. elegans Genomic Alignment. Genome Res. 10, 1115-1125.
111. Kimura, M. (1983) The Neutral Theory of Molecular Evolution, Cambridge University
Press, Cambridge, UK.
112. Kisman, D., Li, M., Ma, B., and Wang, L. (2005) tPatternHunter: gapped, fast and sensitive
translated homology search. Bioinformatics 21, 542-544.
113. Knuth, D.E. (1973) The art of computer programming. Vol. 1, Addison-Wesley Publishing
Company, Reading, Massachusetts.
114. Knuth, D.E. (1973) The art of computer programming. Vol. 3, Addison-Wesley Publishing
Company, Reading, Massachusetts.
115. Kong, Y. (2007), Generalized correlation functions and their applications in selection of
optimal multiple spaced seeds for homology search. J. Comput. Biol. 14, 238-254.
116. Korf, I., Yandell, M., and Bedell, J. (2003) BLAST. O’reilly, USA.
117. Kschischo, M., Lässig, M., and Yu, Y.-K. (2005) Towards an accurate statistics of gapped
alignment. Bull. Math. Biol. 67, 169-191.
118. Kucherov G., Noè, L. and M. Roytberg M. (2005) Multiseed lossless filtration. IEEE/ACM
Trans. Comput. Biol. Bioinform. 2, 51-61.
119. Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C., and
Salzberg S.L. (2004) Versatile and open software for comparing large genomes. Genome
Biology 5, R12.
120. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H.,
Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., and Higgins,
D.G. (2007) ClustalW and ClustalX version 2. Bioinformatics 23, 2947-2948.
121. Lassmann, T. and Sonnhammer, L.L. (2005) Kalign - an accurate and fast multiple sequence
alignment algorithm. BMC Bioinformatics 6, 298.
202 References
122. Letunic, I., Copley, R.R., Pils, B., Pinkert, S., Schultz, J., Bork, P. (2006) SMART 5: domains
in the context of genomes and networks. Nucleic Acids Res. 34, D257-260.
123. Li, M., Ma, B., Kisman, D., and Tromp, J. (2004) PatternHunter II: Highly sensitive and fast
homology search. J. Bioinform. Comput. Biol. 2, 417-439.
124. Li, M., Ma, B. Kisman, D. and Tromp, J. (2004) PatternHunter II: Highly Sensitive and Fast
Homology Search. J. Bioinform Comput. Biol. 2 (3),417-439.
125. Li, M., Ma, B. and Zhang, L.X. (2006) Superiority and complexity of spaced seeds. In Proc.
17th SIAM-ACM Symp. Discrete Algorithms. 444-453.
126. Li, W.H., Wu, C.I., and Luo, C.C. (1985) A new method for estimating synonymous and
nonsynonymous rates of nucleotide substitution considering the relative likelihood of nu-
cleotide and codon changes. Mol. Biol. Evol. 2, 150-174.
127. Lipman, D.J., Altschul, S.F., and Kececioglu, J.D. (1989) A tool for multiple sequence align-
ment. Proc. Nat’l. Acad. Sci. USA 86, 4412-4415.
128. Lipman, D.J. and Pearson, W.R. (1985) Rapid and sensitive protein similarity searches. Sci-
ence 227, 1435-1441.
129. Lothaire, M. (2005) Applied Combinatorics on Words. Cambridge University Press, Cam-
bridge, UK.
130. Ma, B. and Li, M. (2007) On the complexity of spaced seeds. J. Comput. Sys. Sci. 73, 1024-
1034.
131. Ma, B., Tromp, J., and Li, M. (2002) PatternHunter-faster and more sensitive homology
search. Bioinformatics 18, 440-445.
132. Ma, B., Wang, Z., and Zhang, K. (2003) Alignment between two multiple alignments. Proc.
Combin. Pattern Matching, 254-265.
133. Manber, U. (1989) Introduction to Algorithms. Addison-Wesley Publishing Company, Read-
ing, Massachusetts.
134. Manber, U. and Myers, E. (1991) Suffix arrays: a new method for on-line string searches.
SIAM J. Comput. 22, 935-948.
135. McCreight, E.M. (1976) A space-economical suffix tree construction algorithm. J. Assoc.
Comput. Mach. 23, 262-272.
136. McLachlan, A.D. (1971) Tests for comparing related amino-acid sequences. Cytochrome c
and cytochrome c551. J. Mol. Biol. 61, 409-424.
137. Metzler, D. (2006) Robust E-values for gapped local alignment. J. Comput. Biol. 13, 882-
896.
138. Miller, W. (2001) Comparison of genomic DNA sequences: solved and unsolved problems.
Bioinformatics 17, 391-397.
139. Miller, W. and Myers, E. (1988) Sequence comparison with concave weighting functions.
Bull. Math. Biol. 50, 97-120.
140. Miller, W., Rosenbloom, K., Hardison, R.C., Hou, M., Taylor, J., Raney, B., Burhans, R.,
King, D.C., Baertsch, R., Blankenberg, D., Kosakovsky Pond, S.L., Nekrutenko, A., Giar-
dine, B., Harris, R.S., Tyekucheva, S., Diekhans, M., Pringle, T.H., Murphy, W.J., Lesk, A.,
Weinstock, G.M., Lindblad-Toh, K., Gibbs, R.A., Lander, E.S., Siepel, A., Haussler, D., and
Kent, W.J. (2007) 28-way vertebrate alignment and conservation track in the UCSC Genome
Browser. Genome Res. 17,1797-1808.
141. Mitrophanov, A.Y. and Borodovsky, M. (2006) Statistical significance in biological sequence
analysis. Brief. Bioinform. 7, 2-24.
142. Mohana Rao, J.K. (1987) New scoring matrix for amino acid residue exchanges based on
residue characteristic physical parameters. Int. J. Peptide Protein Res. 29, 276-281.
143. Morgenstern, B., French, K., Dress, A., and Werner, T. (1998) DIALIGN: Finding local
similarities by multiple sequence alignment. Bioinformatics 14, 290-294.
144. Mott, R. (1992) Maximum-likelihood estimation of the statistical distribution of Smith-
Waterman local sequence similarity scores. Bull. Math. Biol. 54, 59-75.
145. Mott, R. (2000) Accurate formula for P-values for gapped local sequence and profile align-
ment. J. Mol. Biol. 276, 71-84.
146. Mott, R and Tribe, R. (1999) Approximate statistics of gapped alignments. J. Comput. Biol.
6, 91-112.
References 203
147. Müller, T. and Vingron, M. (2000) Modeling amino acid replacement. J. Comput. Biol. 7,
761-776.
148. Müller, T., Spang, R., and Vingron, M. (2002) Estimating amino acid substitution models:
A comparison of Dayoff’s estimator, the resolvent approach and a maximum likelihood
method. Mol. Biol. Evol. 19, 8-13.
149. Myers, G. (1999) Whole-Genome DNA sequencing. Comput. Sci. Eng. 1, 33-43.
150. Myers, E. and Miller, W. (1988) Optimal alignments in linear space. Comput. Appl. Biosci.
4, 11-17.
151. Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for
similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 443-453.
152. Nicodème, P., Salvy, B., and Flajolet, P. (1999) Motif Statistics. In Lecture Notes in Comput.
Sci., vol. 1643, 194-211, New York.
153. Noè, L., and Kucherov, G. (2004) Improved hit criteria for DNA local alignment. BMC
Bioinformatics 5, no.159.
154. Notredame, C. (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS
Comput. Biol. 3, e123.
155. Notredame, C., Higgins, D., and Heringa, J. (2000) T-Coffee: A novel method for multiple
sequence alignments. J. Mol. Biol. 302, 205-217.
156. Overington, J., Donnelly, D., Johnson, M.S., Sali, A., and Blundell, T.L. (1992)
Environment-specific amino acid substitution tables: tertiary templates and prediction of
protein folds. Protein Sci. 1, 216-26.
157. Park, Y. and Spouge, J.L. (2002) The correction error and finite-size correction in an un-
gapped sequence alignment. Bioinformatics 18, 1236-1242.
158. Pascarella, S. and Argos, P. (1992) Analysis of insertions/deletions in protein structures, J.
Mol. Biol. 224, 461-471.
159. Pearson, W.R. (1995) Comparison of methods for searching protein sequence databases.
Protein Science 4, 1145-1160.
160. Pearson, W.R. (1998) Empirical statistical estimates for sequence similarity searches. J. Mol.
Biol. 276, 71-84.
161. Pearson, W.R. and Lipman, D. (1988) Improved tools for biological sequence comparison.
Proc. Nat’l. Acad. Sci USA 85, 2444-2448.
162. Pearson, W.R. and Wood, T.C. (2003) Statistical Significance in Biological Sequence Com-
parison. In Handbook of Statistical Genetics (edited by D.J. Balding, M. Bishop and C.
Cannings), 2nd Edition. John Wiley & Sons, West Sussex, UK.
163. Pei, J. and Grishin, N.V. (2006) MUMMALS: multiple sequence alignment improved by
using hidden Markov models with local structural information. Nucleic Acids Res. 34, 4364-
4374.
164. Pei, J. and Grishin, N.V. (2007) PROMALS: towards accurate multiple sequence alignments
of distantly related proteins. Bioinformatics 23, 802-808.
165. Pevzner, P.A. (2000) Computational Molecular Biology: An Algorithmic Approach. The
MIT Press, Cambridge, Massachusetts.
166. Pevzner, P.A. and Tesler, G. (2003) Genome rearrangements in mammalian evolution:
lessons from human and mouse genomes. Genome Res. 13, 37-45.
167. Pevzner, P.A. and Waterman, M.S. (1995) Multiple filtration and approximate pattern match-
ing. Algorithmica 13, 135-154.
168. Preparata, F.P., Zhang, L.X., and Choi, K.P. (2005) Quick, practical selection of effective
seeds for homology search. J. Comput. Biol. 12, 1137-1152.
169. Reich, J.G., Drabsch, H., and Däumler, A. (1984) On the statistical assessment of similarities
in DNA sequences. Nucleic Acids Res. 12, 5529-5543.
170. Rényi, A. (1970) Foundations of Probability. Holden-Day, San Francisco.
171. Reinert, G., Schbath, S., and Waterman, M.S. (2000), Probabilistic and statistical properties
of words: An overview. J. Comput. Biol. 7, 1 - 46.
172. Risler, J.L., Delorme, M.O., Delacroix, H., and Henaut, A. (1988) Amino Acid substitutions
in structurally related proteins. A pattern recognition approach. Determination of a new and
efficient scoring matrix. J. Mol. Biol. 204, 1019-1029.
204 References
173. Robinson, A.B. and Robinson, L.R. (1991) Distribution of glutamine and asparagine
residues and their near neighbors in peptides and proteins. Proc. Nat’l. Acad. Sci USA 88,
8880-8884.
174. Sankoff, D. (2000) The early introduction of dynamic programming into computational bi-
ology. Bioinformatics 16, 41-47.
175. Sankoff, D. and Kruskal, J.B. (eds.) (1983). Time Warps, String Edits, and Macro-
molecules: the Theory and Practice of Sequence Comparisons. Addison-Wesley, Reading,
Massachusetts.
176. Schwager, S.J. (1983) Run probabilities in sequences of Markov-dependent trials, J. Amer.
Stat. Assoc. 78, 168-175.
177. Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D.,
and Miller, W. (2003) Human-mouse alignment with BLASTZ. Genome Res. 13, 103-107.
178. Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardi-
son, R.C., and Miller, W. (2000) PipMaker - a web server for aligning two genomic DNA
sequences. Genome Res. 10, 577-86.
179. Siegmund, D. and Yakir, B. (2000) Approximate p-value for local sequence alignments.
Ann. Stat. 28, 657-680.
180. Smith, T.F. and Waterman, M.S., (1981) Identification of common molecular subsequences.
J. Mol. Biol. 147, 195-197.
181. Smith, T.F., Waterman, M.S., and Burks, C. (1985) The statistical distribution of nucleic
acid similarities. Nucleic Acids Res. 13, 645-656.
182. Spang, R. and Vingron, M. (1998) Statistics of large-scale sequences searching. Bioinfor-
matics 14, 279-284.
183. Spitzer, F. (1960) A tauberian theorem and its probability interpretation. Trans. Amer Math
Soc. 94, 150-169.
184. States, D.J., Gish, W., and Altschul, S.F. (1991), Imporved sensitivity of nucleic acid
databases searches using application-specific scoring matrices. Methods 3, 61-70.
185. Sun, Y. and Buhler, J. Designing multiple simultaneous seeds for DNA similarity search. J.
Comput. Biol. 12, 847-861.
186. Sun, Y., and Buhler, J. (2006) Choosing the best heuristic for seeded alignment of DNA
sequences. BMC Bioinformatics 7, no. 133.
187. Sze, S.-H., Lu, Y., and Yang, Q. (2006) A polynomial time solvable formulation of multiple
sequence alignment. J. Comput. Biol. 13, 309-319.
188. Taylor, W.R. (1986) The classification of amino acid conservation. J. Theor. Biol. 119, 205-
218.
189. Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994) CLUSTAL W: improving the sen-
sitivity of progressive multiple sequence alignment through sequence weighting, position-
specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680.
190. Thompson, J.D., Plewniak, F., and Poch, O. (1999) A comprehensive comparison of multiple
sequence alignment programs. Nucleic Acids Res. 27, 2682-2690.
191. Ukkonen, E. (1995) On-line construction of suffix trees. Algorithmica 14, 249-260.
192. Vingron, M. and Argos, P. (1990) Determination of reliable regions in protein sequence
alignment. Protein Eng. 3, 565-569.
193. Vingron, M. and Waterman, M.S. (1994) Sequence alignment and penalty choice: Review
of concepts, case studies and implications. J. Mol. Biol. 235, 1-12.
194. Wagner, R.A. and Fischer, M.J. (1974) The string-to-string correction problem. J. Assoc.
Comput. Mach. 21, 168-173.
195. Wang, L. and Jiang, T. (1994) On the complexity of multiple sequence alignment. J. Comput.
Biol. 1, 337-348.
196. Waterman, M.S. (1984) Efficient sequence alignment algorithms. J. Theor. Biol. 108, 333-
337.
197. Waterman, M.S. (1995) Introduction to Computational Biology. Chapman and Hall, New
York.
198. Waterman, M. S. and Eggert, M. (1987) A new algorithm for best subsequence alignments
with application to tRNA-rRNA comparisons. Mol. Biol. 197, 723-728.
References 205
199. Waterman, W.S., Gordon, L., and Arratia, R. (1987) Phase transition in sequence matches
and nucleic acid structure. Proc. Nat’l. Acad. Sci. USA 84, 1239-1243.
200. Waterman, M.S. and Vingron, M. (1994) Rapid and accurate estimate of statistical signifi-
cance for sequence database searches. Proc. Nat’l. Acad. Sci. USA 91, 4625-4628.
201. Webber, C. and Barton, G.J. (2001) Estimation of P-value for global alignments of protein
sequences. Bioinformatics 17, 1158-1167.
202. Weiner, P. (1973) Linear pattern matching algorithms. In Proc. the 14th IEEE Annu. Symp.
on Switching and Automata Theory, pp. 1-11.
203. Wilbur, W.J. (1985), On the PAM matrix model of protein evolution. Mol. Biol. Evol. 2,
434-447.
204. Wilbur, W. and Lipman, D. (1983) Rapid similarity searches of nucleic acid and protein data
banks. Proc. Nat. Acad. Sci. USA 80, 726-730.
205. Wilbur, W. and Lipman, D. (1984) The context dependent comparison of biological se-
quences. SIAM J. Appl. Math. 44, 557-567.
206. Xu, J., Brown, D. Li, M., and Ma, M. (2006) Optimizing multiple spaced seeds for homology
search. J. Comput. Biol. 13, 1355-1368.
207. Yang, I.-H., Wang, S.-H., Chen, Y.-H., Huang, P.-H., Ye, L., Huang, X. and Chao, K.-M.
(2004) Efficient methods for generating optimal single and multiple spaced seeds. In Proc.
IEEE 4th Symp. on Bioinform. and Bioeng., pp. 411-418.
208. Yang J.L., and Zhang, L.X. (2008) Run probabilities of seed-like patterns and identifying
good transition seeds. J. Comput. Biol. (in press).
209. Yap, V.B. and Speed, T. (2005), Estimating substitution matrices. In Statistical Methods in
Molecular Evolution (ed. R. Nielsen), Springer.
210. Ye, L. and Huang, X. (2005) MAP2: Multiple alignment of syntenic genomic sequences.
Nucleic Acids Res. 33, 162-170.
211. Yu, Y.K., Wootton, J.C., and Altschul, S.F. (2003) The compositional adjustment of amino
acid substitution matrices. Proc. Nat’l. Acad. Sci. USA. 100, 15688-93.
212. Yu, Y.K. and Altschul, S.F. (2005) The construction of amino acid substitution matrices for
the comparison of proteins with non-standard compositions. Bioinformatics 21, 902-11.
213. Zachariah, M.A., Crooks, G.E., Holbrook, S.R., Brenner, S.E. (2005) A generalized affine
gap model significantly improves protein sequence alignment accuracy. Proteins 58, 329-38.
214. Zhang, L.X. (2007) Superiority of spaced seeds for homology search. IEEE/ACM Trans.
Comput. Biol. Bioinform. 4, 496-505.
215. Zhang, Z, Schwartz, S, Wagner, L, and Miller, W. (2000) A greedy algorithm for aligning
DNA sequences. J. Comput. Biol. 7, 203-214.
216. Zhou L. and Florea, L. (2007) Designing sensitive and specific spaced seeds for cross-
species mRNA-to-genome alignment. J. Comput. Biol. 14, 113-130,
217. Zuker, M. and Somorjal, R.L. (1989) The alignment of protein structures in three dimen-
sions. Bull. Math. Biol. 50, 97-120.
Index
207
208 Index
effective size of search space, 140, 141 search, 10, 63, 195
event, 177 HSP, 70, 75, 78
certain, 177 Huffman coding, 19
complementary, 177
disjoint, 177 insertion, 3
impossible, 177
independent, 178 Karlin-Altschul sum statistic, 131, 138, 140
intersection, 177
rare, 120, 181 length adjustment, 140, 142, 143
recurrent, 191 linear space, 50, 55
union, 177 local alignment, 10, 42, 52
evolutionary model longest common subsequence, 9, 30
Markov chain, 150, 161 longest increasing subsequence, 27
exact word match problem, 64
extreme value distribution, 120, 121, 134–136 Mandala, 111, 113
location and scale parameters, 120 Markov chain
mean and variance, 120 absorbing state, 189
aperiodic, irreducible, 189
FASTA, 68 evolutionary model, 150, 161
Fibonacci number, 25 high-order, 191
first hit probability, 94 memoryless, time homogeneity, 188
function stationary distribution, 189
density, 179 matrix
distribution, 179 BLOSUM, 153
moment-generating, 186 BLOSUM45, 141, 158, 171
probability generating, 181 BLOSUM62, 135, 138, 141, 158, 170
BLOSUM80, 141, 158, 172
gap, 8 PAM, 150
extension, 72 PAM120, 158, 169
generalized, 163 PAM250, 138, 158, 166
penalty, 136 PAM30, 141, 158, 167
effect, 134 PAM70, 141, 158, 168
logarithmic region, 135 scoring, 7, 37
penalty model, 8, 46 valid, 156
affine, 163 substitution, 8
constant, 48 maximal word match, 67
Gapped BLAST, 72 maximal-scoring segment, 123, 130
GenBank, 36, 63 pair, MSP, 70, 71, 132
gene, 175 maximum segment score, 123, 128, 132
generalized gap, 163 maximum-sum segment, 26, 70
genome, 175 mergesort, 22
global alignment, 9, 38, 39, 41, 42, 50 model
greedy method, 18 gap penalty, 8, 46
greedy-choice property, 19 Markov chain, 188
absorbing state, 189
hash table, 64 aperiodic, irreducible, 189
Hedera, 110, 113 high-order, 191
hidden Markov model, 62 memoryless, time homogeneity, 188
Hirschberg’s linear space approach, 50, 55, 59 stationary distribution, 189
hit, 70, 74 sequence
non-overlapping, 72, 100, 103 Bernoulli, 12
overlapping, 76 Markov chain, 12
probability, 94 MSP, 70, 71, 132
homology, 1 multiple alignment, 11, 73, 81, 196
Index 209
heuristic, 85 score
progressive, 85 alignment, 8, 37, 83
multiple spaced seeds, 114 distance, 9, 57
MUMmer, 62 similarity, 9, 37
MUSCLE, 88 sum-of-pairs, 11, 83
scoring
Needleman-Wunsch algorithm, 9, 60 matrix, 7, 37
non-hit probability, 94 compositional adjustment, 158
optimal alignment, 9, 37, 38, 41, 50 general form, 155
optimal local alignment scores selection, 157
λ , K, 132 valid, 156
gapped, 134 scheme, 8, 37, 38
ungapped, 121, 132, 138 system, 8
seed
P-value, 128, 130, 133, 134, 141, 144 consecutive, 92, 94, 95, 100–104, 112
pairwise alignment, 11, 35, 195 hit, 77, 95
PAM unit, 150 length, 93
PAM120, 158, 169 multiple, 114
PAM250, 138, 158, 166 PatternHunter, 112
PAM30, 141, 158, 167 selection, 110
PAM70, 141, 158, 168 spaced, 75, 76, 92
PatternHunter, 75 transition, 112
PipMaker, 62 uniform, 101
position-specific substitution matrix, 87 vector, 115
problem weight, 93
exact word match, 64 sensitivity, 92
Fibonacci number, 25 sequence model
global alignment, 9, 38 Bernoulli, 12
local alignment, 10, 42 Markov chain, 12
longest common subsequence, 9, 30 similar sequence alignment, 57
longest increasing subsequence, 27 Smith-Waterman algorithm, 10, 61, 71
maximum-sum segment, 26, 70 sorting, 22
sorting, 22 space-saving strategy, 50
progressive alignment, 85 spaced seed, 75, 76, 92
protein, 174 specificity, 92
PSI-BLAST, 73 suboptimal alignment, 58
PSSM, 73 substitution, 3
matrix, 8
random suffix array, 67
sequence, 12 suffix tree, 66
variables, 178 sum-of-pairs, 11, 83
covariance, 185
discrete, 178, 179 tabular computation, 24
mean, 183 target frequency, 134
standard deviation, moment, 185 transition, 112
variance, 185 seed, 112
walk, 191 transversion, 112
recurrence, 24
relation, 94 UCSC Genome Browser, 78, 87
system, 95, 97 ungapped BLAST, 69
renewal vector seed, 115
theorem, 126, 192
theory, 191 Wu-BLAST, 143
restricted affine gap penalties, 48
RNA, 173 YAMA, 87
Chapter 1
Introduction
A vast diversity of organisms exist on Earth, but they are amazingly similar. Every
organism is made up of cells and depends on two types of molecules: DNA and
proteins. DNA in a cell transmits itself from generation to generation and holds
genetic instructions that guide the synthesis of proteins and other molecules. Pro-
teins perform biochemical reactions, pass signals among cells, and form the body’s
components. DNA is a sequence of 4 nucleotides, whereas proteins are made of 20
amino acids arranged in a linear chain.
Due to the linear structure of DNA and protein, sequence comparison has been
one of the common practices in molecular biology since the first protein sequence
was read in the 1950s. There are dozens of reasons for this. Among these reasons
are the following: sequence comparisons allow identification of genes and other
conserved sequence patterns; they can be used to establish functional, structural,
and evolutionary relationship among proteins; and they provide a reliable method
for inferring the biological functions of newly sequenced genes.
A striking fact revealed by the abundance of biological sequence data is that a
large proportion of genes in an organism have significant similarity with ones in
other organisms that diverged hundreds of millions of years ago. Two mammals
have as many as 99% genes in common. Humans and fruit flies even have at least
50% genes in common.
On the other hand, different organisms do differ to some degree. This strongly
suggests that the genomes of existing species are shaped by evolution. Accordingly,
comparing related genomes provides the best hope for understanding the language
of DNA and for unraveling the evolutionary relationship among different species.
Two sequences from different genomes are homologous if they evolved from a
common ancestor. We are interested in homologies because they usually have con-
served structures and functions. Because species diverged at different time points,
homologous proteins are expected to be more similar in closely related organisms
than in remotely related ones.
1
2 1 Introduction
When a new protein is found, scientists usually have little idea about its function.
Direct experimentation on a protein of interest is often costly and time-consuming.
As a result, one common approach to inferring the protein’s function is to find by
similarity search its homologies that have been studied and are stored in database.
One remarkable finding made through sequence comparison is about how cancer is
caused. In 1983, a paper appearing in Science reported a 28-amino-acid sequence
for platelet derived growth factors, a normal protein whose function is to stimu-
late cell growth. By searching against a small protein database created by himself,
Doolittle found that the sequence is almost identical to a sequence of υ -sis, an onco-
gene causing cancer in woolly monkeys. This finding changed the way oncogenesis
had been seen and understood. Today, it is generally accepted that cancer might be
caused by a normal growth gene if it is switched on at the wrong time.
Today, powerful sequence comparison methods, together with comprehensive bi-
ological databases, have changed the practice of molecular biology and genomics.
In the words of Gilbert, Nobel prize winner and co-inventor of practical DNA se-
quencing technology,
The new paradigm now emerging, is that all the ‘genes’ will be known (in the sense of
being resident in databases available electronically), and that the starting point of biological
investigation will be theoretical. (Gilbert, 1991, [76])
1.2.1 Definition
GST1 SRAFRLLWLLDHLNLEYEIVPYKRDANFRAPPELKKIHPLGRSPLLEVQDRETGKKKILA
| || | | || ||| | | | | | | |
VEF SYLFRFRGLGDFMLLELQIVPILNLASVRVGNHHNGPHSYFNTTYLSVEVRDT-------
GST1 ESGFIFQYVL---QHFDHSHVLMSEDADIADQINYYLFYVEGSLQPPLMIEFILSKVKDS
| | | | | || | | | | |
VEF SGGVVFSYSRLGNEPMTHEH----HKFEVFKDYTIHLFIQE----PGQRLQLIVNKTLDT
GST1 GMPFPISYLARKVADKISQAYSSGEVKNQFDFV
| || | | |||
VEF ALPNSQNIYARLTATQLVVGEQSIIISDDNDFV
There are many ways of aligning two sequences. By using theoretic-graph concepts,
however, we can obtain a compact representation of these alignments.
Consider two sequences of length m and n. The alignment graph of these two
sequences is a direct graph G. The vertices of G are lattice points (i, j), 0 ≤ i ≤
m and 0 ≤ j ≤ n in the (m + 1) × (n + 1) grid, and there is an edge from vertex
(i, j) to another (i , j ) if and only if 0 ≤ i − i ≤ 1 and 0 ≤ j − j ≤ 1. Figure 1.2
gives the alignment graph for sequences ATACTGG and GTCCGTG. As shown in
Figure 1.2, there are vertical, horizontal, and diagonal edges. In a directed graph,
a sink is a vertex that has every incident edge directed toward it, and a source is a
vertex that has every incident edge directed outwards. In the alignment graph, (0, 0)
is the unique source and (m, n) is the unique sink.
The alignment graph is very useful for understanding and developing alignment
algorithms as we shall see in Chapter 3. One reason for this is that all possible
alignments correspond one-to-one to the directed paths from the source (0, 0) to the
sink (m, n). Hence, it gives a compact representation of all possible alignments of
4 1 Introduction
G T C C G T G
0
A
1
T
2
A
3
C
4
T
5
G
6
G
7
0 1 2 3 4 5 6 7
Fig. 1.2 Alignment graph for sequences ATACTGG and GTCCGTG.
the sequences. To see the correspondence, we consider the alignment graph given in
Figure 1.2 in what follows.
Let X = ATACTGG be the first sequence and Y = GTCCGTG be the second se-
quence. The highlighted path in the alignment graph is
s → (1, 0) → (1, 1) → (2, 2) → (3, 3) → (4, 4) → (4, 5) → (5, 6) → (6, 6) → (7, 7),
where s is the source (0, 0). Let xi (yi ) denote the ith letterofthe
x first
(second)
se-
quence. Walking along the path, we write down a column yxij , − i
, and y−j if we
see a diagonal, vertical, and horizontal edge entering vertex (i, j), respectively. After
reaching the sink (7, 7), we obtain the following nine-column alignment of X and Y :
X A-TAC-TGG
Y -GTCCGT-G
where the ith column corresponds to the ith edge in the path.
On the other hand, given a k-column alignment of X and Y , we let ui (and vi ) be
the number of letters in the first i columns in the row of X (and Y ) for i = 0, 1, . . . , k.
Trivially, u0 = v0 = 0 and uk = m and vk = n. For any i, we also have that
1.2 Alignment: A Model for Sequence Comparison 5
0 ≤ ui+1 − ui ≤ 1
and
0 ≤ vi+1 − vi ≤ 1.
This fact implies that there is an edge from (ui , vi ) to (ui+1 , vi+1 ). Thus,
is a path from the source to the sink. For example, from alignment
X ATACTG-G
Y G T C C - G T G,
we obtain the following path
(0, 0) → (1, 1) → (2, 2) → (3, 3) → (4, 4) → (5, 4) → (6, 5) → (6, 6) → (7, 7).
Noting that there is only one way to align the empty sequence to a non-empty se-
quence, we also have that
for any k ≥ 0.
6 1 Introduction
Table 1.1 The number of possible alignments of two sequences of length m and n for n, m ≤ 8.
0 1 2 3 4 5 6 7 8
0 1 1 1 1 1 1 1 1 1
1 3 5 7 9 11 13 15 17
2 13 25 41 61 85 113 145
3 63 129 231 377 575 833
4 321 681 1,289 2,241 3,649
5 1,683 3,653 7,183 13,073
6 8,989 19,825 40,081
7 48,639 108,545
8 265,729
Recurrence relation (1.1) and basis conditions (1.2) enable us to calculate a(m, n)
efficiently. Table 1.1 gives the values of a(m, n) for any m, n ≤ 8. As we see from the
table, the number a(m, n) grows quite fast with m and n. For n = m = 8, this number
is already 265,729! As the matter of fact, we have that (see the gray box below)
m+n
k m
a(m, n) = ∑ m n+m−k
. (1.3)
k=max{n,m}
In particular, when m = n, the sum of the last two terms of the right side of (1.3) is
2n − 1 n 2n (n + 2)(2n)!
+ = .
n 1 n 2(n!)2
we obtain a(100, 100) > 2200 ≈ 1060 , an astronomically large number! Clearly, this
shows that it is definitely not feasible to examine all possible alignments and that
we need an efficient method for sequence alignment.
We first consider the arrangements of letters in the first row and then those
in the second row of ak-column alignment. Assume there are m letters in the
first row. There are mk ways to arrange the letters in the first row. Fix such an
arbitrary arrangement. By the definition of alignment, there are k − n spaces
in the second row, and these m spaces
must be placed below the letters in the
m
first row. Thus, there are k−n = m+n−k arrangements for the second row.
By the multiplication principle, there are
1.3 Scoring Alignment 7
k m
ak (m, n) =
m m+n−k
Our goal is to find, given two DNA or protein sequences, the best alignment of
them. For this purpose, we need a rule to tell us the goodness of each possible
alignment. The earliest similarity measure was based on percent identical residues,
that is, simply to count matches in an alignment. In the old days, this simple rule
worked because it was rare to see the low percent identity of two proteins with
similar functions like homeodomain proteins. Nowadays, a more general rule has to
be used to score an alignment.
First, we have a score s(a, b) for matching a with b for any pair of letters a
and b. Usually, s(a, a) > 0, and s(a, b) < 0 for a = b. Assume there are k letters
a1 , a2 , . . . , ak . All these possible scores are specified by a matrix (s(ai , b j )), called a
scoring matrix. For example, when DNA sequences are aligned, the following scor-
ing matrix may be used:
A 2 -1 -1 -1
G -1 2 -1 -1
C -1 -1 2 -1
T -1 -1 -1 2
A G C T
This scoring matrix indicates that all matches score 2 whereas mismatches are
penalized by 1. Assume we align two sequences. If one sequence has letter a and
another has b in a position, it is unknown whether a had been replaced by b or
the other way around in evolutionary history. Thus, scoring matrices are usually
symmetric like the one given above. In this book, we only write down the lower
triangular part of a scoring matrix if it is symmetric.
Scoring matrices for DNA sequence alignment are usually simple. All different
matches score the same, and so do all mismatches. For protein sequences, however,
scoring matrices are quite complicated. Frequently used scoring matrices are de-
veloped using statistical analysis of real sequence data. They reflect the frequency
of an amino acid replacing another in biologically related protein sequences. As a
result, a scoring matrix for protein sequence alignment is usually called a substitu-
8 1 Introduction
tion matrix. We will discuss in detail how substitution matrices are constructed and
selected for sequence alignment in Chapter 8.
Second, we consider indels. In an alignment, indels and mismatches are intro-
duced to bring up matches that appear later. Thus, indels are penalized like mis-
matches. The most straightforward method is to penalize each indel by some con-
stant δ . However, two or more nucleotides are frequently inserted or deleted together
as a result of biochemical processes such as replication slippage. Hence, penalizing
a gap of length k by −kδ is too cruel. A gap in an alignment is defined as a sequence
of spaces locating between two letters in one row. A popular gap penalty model,
called affine gap penalty, scores a gap of length k as
−(o + k × e),
where o > 0 is considered as the penalty for opening a gap and e > 0 is the penalty
for extending a gap by one letter. The opening gap penalty o is usually big whereas
the gap extension penalty e is small. Note that simply multiples of the number of
indels is a special case of the affine gap penalty model in which o = 0.
A scoring matrix and a gap penalty model form a scoring scheme or a scoring
system. With a scoring scheme in hand, the score of an alignment is calculated as the
sum of individual scores, one for each aligned pair of letters, and scores for gaps.
Consider the comparison of two DNA sequences with the simple scoring matrix
given above, which assigns 2 to each match and -1 to each mismatch. If we simply
penalize each indel by -1.5, the score for the alignment on page 4 is
As we will see in Section 8.3, in any scoring matrix, the substitution score s(a, b) is
essentially a logarithm of the ratio of the probability that we expect to see a and b
aligned in biologically related sequences to the probability that they are aligned in
unrelated random sequences. Hence, being the sum of individual log-odds scores,
the score of a ungapped alignment reflects the likelihood that this alignment was
generated as a consequence of sequence evolution.
In this section, we briefly define the global and local alignment problems and then
relate the alignment problem to some interesting algorithmic problems in computer
science, mathematics, and information theory.
1.4 Computing Sequence Alignment 9
With a scoring system, we associate a score to each possible alignment. The optimal
alignments are those with the maximum score. The global alignment problem (or
simply alignment problem) is stated as
Global Alignment Problem
Input: Two sequences x and y and a scoring scheme.
Solution: An optimal alignment of x and y (as defined by the scoring
scheme).
Because there is a huge number of possible alignments for two sequences, it
is not feasible to find the optimal one by examining all alignments one by one.
Fortunately, there is a very efficient algorithm for this problem. This algorithm is
now called the Needleman-Wunsch algorithm. The so-called dynamic programming
idea behind this algorithm is so simple that such an algorithm has been discovered
and rediscovered in different form many times. The Needleman-Wunsch algorithm
and its generalizations are extensively discussed in Chapter 3.
The sequence alignment problem seems quite simple. But, it is rather general as
being closely related to several interesting problems in mathematics, computer sci-
ence, and information theory. Here we just name two such examples. The longest
common subsequence problem is, given two sequences, to find a longest sequence
whose letters appear in each sequence in order, but not necessarily in consecutive
positions. This problem had been interested in mathematics long before Needleman
and Wunsch discovered their algorithm for aligning DNA sequences. Consider a
special scoring system S that assigns 1 to each match, −∞ to each mismatch, and
0 to each indel. It is easy to verify that the optimal alignment of two sequences
found using S must not contain mismatches. As a result, all the matches in the
alignment give a longest common subsequence. Hence, the longest common subse-
quence problem is identical to the sequence alignment problem under the particular
scoring system S .
There are two ways of measuring the similarity of two sequences: similarity
scores and distance scores. In distance scores, the smaller the score, the more
closely related are the two sequences. Hamming distance allows one only to com-
pare sequences of the same length. In 1966, Levenshtein introduced edit distance for
comparison of sequences of different lengths. It is defined as the minimum number
of editing operations that are needed to transform one sequence into another, where
the editing operations include insertion of a letter, deletion of a letter, and substitu-
tion of a letter for another. It is left to the reader to find out that calculating the edit
distance between two strings is equivalent to the sequence alignment problem under
a particular scoring system.
10 1 Introduction
Proteins often have multiple functions. Two proteins that have a common function
may be similar only in functional domain regions. For example, homeodomain pro-
teins, which play important roles in developmental processes, are present in a vari-
ety of species. These proteins in different species are only similar in one domain of
about 60 amino acids long, encoded by homeobox genes. Obviously, aligning the
entire sequences will not be useful for identification of the similarity among home-
odomain proteins. This raises the problem of finding, given two sequences, which
respective segments have the best alignments. Such an alignment between some
segments of each sequence is called local alignment of the given sequences. The
problem of aligning locally sequences is formally stated as
Local Alignment Problem
Input: Two sequences x = x1 x2 . . . xm and y = y1 y2 . . . yn and a scoring
scheme.
Solution: An alignment of fragments xi xi+1 . . . x j and yk yk+1 . . . yl , that has
the largest score among all alignments of all pairs of fragments of x and y.
A straightforward method for this problem is to find the optimal alignment for
every pair of fragments
of x and y using
the Needleman-Wunsch algorithm. The se-
quence x has m2 fragments and y has n2 ones. Thus, this method is rather inefficient
because its running time will increase by roughly m2 n2 times. Instead, applying di-
rectly the dynamic programming idea leads to an algorithm that is as efficient as
the Needleman-Wunsch algorithm although it is a bit more tricky this time. This
dynamic programming algorithm, called Smith-Waterman algorithm, is covered in
Section 3.4.
Homology search is one important application of the local alignment problem. In
this case, we have a query sequence, say, a newly sequenced gene, and a database.
We wish to search the entire database to find those sequences that match locally (to
a significant degree) with the query. Because databases have easily millions of se-
quences, Smith-Waterman algorithm, having quadratic-time complexity, is too de-
manding in computational time for homology search. Accordingly, fast heuristic
search tools have been developed in the past two decades. Chapter 4 will present
several frequently used homology search tools.
Filtration is a useful idea for designing fast homology search programs. A
filtration-based program first identifies short exact matches specified by a fixed pat-
tern (called seed) of two sequences and then extends each match to both sides for
local alignment. A clever technique in speeding up homology search process is to
substitute optimized spaced seed for consecutive seed as exemplified in Pattern-
Hunter. Theoretic treatment of spaced seed technique is studied in Chapter 6.
1.5 Multiple Alignment 11
Hb a LSPADKTNVUAAWGKVGA----HAGEYGAE
Hb b LTPEEKSAVTALWGKV------NVDEVGGE
Mb SW LSEGEWQLVLHVWAKVEA----DVAGHGQD
LebHB LTESQAALVKSSWEEFNA----NIPKHTHR
BacHB QTINIIKATVPVLKEHG------V-TITTT
Multiple alignment is often used to assess sequence conservation of three or more
closely related proteins. Biologically similar proteins may have very diverged se-
quences and hence may not exhibit a strong sequence similarity. Comparing many
sequences at the same time often finds weak similarities that are invisible in pairwise
sequence comparison.
Several issues arise in aligning multiple sequences. First, it is not obvious how
to score a multiple alignment. Intuitively, high scores should correspond to highly
conserved sequences. One popular scoring method is the Sum-of-Pairs (SP) score.
Any alignment A of k sequences x1 , x2 , . . . , xk gives a pairwise alignment A(xi , x j ) of
xi and x j when restricted to these two sequences. We use s(i, j) to denote the score
of A(xi , x j ). The SP score of the alignment A is defined as
Note that the SP score is identical to the score of a pairwise alignment when there
are only two sequences. The details of the SP score and other scoring methods can
be found in Section 5.2
Second, aligning multiple sequences is extremely time-consuming. The SP score
of a multiple alignment is a generalization of a pairwise alignment score. Simi-
larly, the dynamic programming algorithm can generalize to multiple sequences in
a straightforward manner. However, such an algorithm will use roughly (2m)k arith-
metic operations for aligning k sequences of length m. For small k, it works well.
The running time is simply too much when k is large, say 30. Several heuristic
approaches have been proposed for speeding up multiple alignment process (see
Section 5.4).
12 1 Introduction
Although homology and similarity are often interchanged in popular usage, they
are completely different. Homology is qualitative, which means having a common
ancestor. On the other hand, similarity refers to the degree of the match between
two sequences. Similarity is an expected consequence of homology, but not a nec-
essary one. It may occur due to chance or due to an evolutionary process whereby
organisms independently evolve similar traits such as the wings of insect and bats.
Assume we find a good match for a newly sequenced gene through database
search. Does this match reflect a homology? Nobody knows what really happened
over evolutionary time. When we say that a sequence is homologous to another,
we are stating what we believe. No matter how high is the alignment score, we
can never be 100% sure. Hence, a central question in sequence comparison is how
frequently an alignment score is expected to occur by chance. This question has been
extensively investigated through the study of the alignments of random sequences.
The Karlin-Altschul alignment statistics covered in Chapter 7 lay the foundation for
answering this important question.
To approach theoretically the question, we need to model biological sequences.
The simplest model for random sequences assumes that the letters in all positions
are generated independently, with probability distribution
p1 , p2 , . . . , pr
for all letters a1 , a2 , . . . ar in the alphabet, where r is 4 for DNA sequences and 20
for protein sequences. We call it the Bernoulli sequence model.
The theoretical studies covered in Chapters 6 and 7 are based on this simple
model. However, most of the results generalize to the high-order Markov chain
model in a straightforward manner. In the kth-order Markov chain sequence model,
the probability that a letter is present at any position j depends on the letters in the
preceding k sites: i − k, i − k + 1, . . . , j + 1. The third-order Markov chain model is
often used to model gene coding sequences.
This book is structured into two parts. The first part examines alignment algorithms
and techniques and is composed of four chapters, and the second part focuses on the
theoretical issues of sequence comparison and has three chapters. The individual
chapters cover topics as follows.
4 Homology search tools. After showing how filtration technique speeds up ho-
mology search process, we describe in detail four frequently used homology search
tools: FASTA, BLAST, BLAT, and PatternHunter.
8 Scoring matrices. We start with the frequently used PAM and BLOSUM ma-
trices. We show that scoring matrices for aligning protein sequences take essentially
a log-odds form and there is one-to-one correspondence between so-called valid
scoring matrices and the sets of target and background frequencies. We also discuss
how scoring matrices are selected and adjusted for comparing sequences of biased
letter composition. Finally, we discuss gap score schemes.
After nearly 50 years of research, there are hundreds of available tools and thousands
of research papers in sequence alignment. We will not attempt to cover all (or even
a large portion) of this research in this text. Rather, we will be content to provide
14 1 Introduction
pointers to some of the most relevant and useful references on the topics not covered
in this text.
For the earlier phase of sequence comparison, we refer the reader to the paper
of Sankoff [174]. For information on probabilistic and statistical approach to se-
quence alignment, we refer the reader to the books by Durbin et al. [61], Ewens
and Grant [64], and Waterman [197]. For information on sequence comparisons in
DNA sequence assembly, we refer the reader to the survey paper of Myers [149]
and the books of Gusfield [85] and Deonier, Tavaré and Waterman [58]. For infor-
mation on future directions and challenging problems in comparison of genomic
DNA sequences, we refer the reader to the review papers by Batzoglou [23] and
Miller [138].
Chapter 2
Basic Algorithmic Techniques
17
18 2 Basic Algorithmic Techniques
A greedy method works in stages. It always makes a locally optimal (greedy) choice
at each stage. Once a choice has been made, it cannot be withdrawn, even if later we
Table 2.1 The time needed by the functions where we assume one million operations per second.
f (n) n = 10 n = 100 n = 100000
Table 2.2 The growth of the input size solvable in an hour as the computer runs faster.
f (n) Present speed 1000-times faster 106 -times faster
n x1 1000x1 106 x1
n2 x2 31.62x2 103 x2
n3 x3 10x3 102 x3
10n x4 x4 + 3 x4 + 6
realize that it is a poor decision. In other words, this greedy choice may or may not
lead to a globally optimal solution, depending on the characteristics of the problem.
It is a very straightforward algorithmic technique and has been used to solve a
variety of problems. In some situations, it is used to solve the problem exactly. In
others, it has been proved to be effective in approximation.
What kind of problems are suitable for a greedy solution? There are two ingre-
dients for an optimization problem to be exactly solved by a greedy approach. One
is that it has the so-called greedy-choice property, meaning that a locally optimal
choice can reach a globally optimal solution. The other is that it satisfies the princi-
ple of optimality, i.e., each solution substructure is optimal. We use Huffman coding,
a frequency-dependent coding scheme, to illustrate the greedy approach.
Suppose we are given a very long DNA sequence where the occurrence probabilities
of nucleotides A (adenine), C (cytosine), G (guanine), T (thymine) are 0.1, 0.1, 0.3,
and 0.5, respectively. In order to store it in a computer, we need to transform it into a
binary sequence, using only 0’s and 1’s. A trivial solution is to encode A, C, G, and
T by “00,” “01,” “10,” and “11,” respectively. This representation requires two bits
per nucleotide. The question is “Can we store the sequence in a more compressed
way?” Fortunately, by assigning longer codes for frequent nucleotides G and T, and
shorter codes for rare nucleotides A and C, it can be shown that it requires less than
two bits per nucleotide on average.
In 1952, Huffman [94] proposed a greedy algorithm for building up an optimal
way of representing each letter as a binary string. It works in two phases. In phase
one, we build a binary tree based on the occurrence probabilities of the letters. To
do so, we first write down all the letters, together with their associated probabilities.
They are initially the unmarked terminal nodes of the binary tree that we will build
up as the algorithm proceeds. As long as there is more than one unmarked node left,
we repeatedly find the two unmarked nodes with the smallest probabilities, mark
them, create a new unmarked internal node with an edge to each of the nodes just
marked, and set its probability as the sum of the probabilities of the two nodes.
The tree building process is depicted in Figure 2.1. Initially, there are four un-
marked nodes with probabilities 0.1, 0.1, 0.3, and 0.5. The two smallest ones are
20 2 Basic Algorithmic Techniques
Fig. 2.1 Building a binary tree based on the occurrence probabilities of the letters.
with probabilities 0.1 and 0.1. Thus we mark these two nodes and create a new
node with probability 0.2 and connect it to the two nodes just marked. Now we have
three unmarked nodes with probabilities 0.2, 0.3, and 0.5. The two smallest ones are
with probabilities 0.2 and 0.3. They are marked and a new node connecting them
with probabilities 0.5 is created. The final iteration connects the only two unmarked
nodes with probabilities 0.5 and 0.5. Since there is only one unmarked node left,
i.e., the root of the tree, we are done with the binary tree construction.
After the binary tree is built in phase one, the second phase is to assign the binary
strings to the letters. Starting from the root, we recursively assign the value “zero”
to the left edge and “one” to the right edge. Then for each leaf, i.e., the letter, we
concatenate the 0’s and 1’s from the root to it to form its binary string representation.
For example, in Figure 2.2 the resulting codewords for A, C, G, and T are “000,”
“000,” “01,” and “1,” respectively. By this coding scheme, a 20-nucleotide DNA
sequence “GTTGTTATCGTTTATGTGGC” will be represented as a 34-bit binary
sequence “0111011100010010111100010110101001.” In general, since 3 × 0.1 +
3 × 0.1 + 2 × 0.3 + 1 × 0.5 = 1.7, we conclude that, by Huffman coding techniques,
each nucleotide requires 1.7 bits on average, which is superior to 2 bits by a trivial
solution. Notice that in a Huffman code, no codeword is also a prefix of any other
codeword. Therefore we can decode a binary sequence without any ambiguity. For
example, if we are given “0111011100010010111100010110101001,” we decode
the binary sequence as “01” (G), “1” (T), “1” (T), “01” (G), and so forth.
The correctness of Huffman’s algorithm lies in two properties: (1) greedy-choice
property and (2) optimal-substructure property. It can be shown that there exists an
optimal binary code in which the codewords for the two smallest-probability nodes
have the same length and differ only in the last bit. That’s the reason why we can
contract them greedily without missing the path to the optimal solution. Besides,
after contraction, the optimal-substructure property allows us to consider only those
unmarked nodes.
2.3 Divide-and-Conquer Strategies 21
Let n be the number of letters under consideration. For DNA, n is 4 and for
English, n is 26. Since a heap can be used to maintain the minimum dynamically
in O(log n) time for each insertion or deletion, the time complexity of Huffman’s
algorithm is O(n log n).
The divide-and-conquer strategy divides the problem into a number of smaller sub-
problems. If the subproblem is small enough, it conquers it directly. Otherwise, it
conquers the subproblem recursively. Once the solution to each subproblem has
been done, it combines them together to form a solution to the original problem.
One of the well-known applications of the divide-and-conquer strategy is the
design of sorting algorithms. We use mergesort to illustrate the divide-and-conquer
algorithm design paradigm.
2.3.1 Mergesort
sorted sequence. Figure 2.3 illustrates the dividing process. The original input se-
quence consists of eight numbers. We first divide it into two smaller sequences, each
consisting of four numbers. Then we divide each four-number sequence into two
smaller sequences, each consisting of two numbers. Here we can sort the two num-
bers by comparing them directly, or divide it further into two smaller sequences,
each consisting of only one number. Either way we’ll reach the boundary cases
where sorting is trivial. Notice that a sequential recursive process won’t expand the
subproblems simultaneously, but instead it solves the subproblems at the same re-
cursion depth one by one.
How to combine the solutions to the two smaller subproblems to form a solu-
tion to the original problem? Let us consider the process of merging two sorted
sequences into a sorted output sequence. For each merging sequence, we maintain
a cursor pointing to the smallest element not yet included in the output sequence.
At each iteration, the smaller of these two smallest elements is removed from the
merging sequence and added to the end of the output sequence. Once one merg-
ing sequence has been exhausted, the other sequence is appended to the end of the
output sequence. Figure 2.4 depicts the merging process. The merging sequences
are 16, 25, 65, 85
and 8, 12, 36, 77
. The smallest elements of the two merging
sequences are 16 and 8. Since 8 is a smaller one, we remove it from the merging
sequence and add it to the output sequence. Now the smallest elements of the two
merging sequences are 16 and 12. We remove 12 from the merging sequence and
append it to the output sequence. Then 16 and 36 are the smallest elements of the
two merging sequences, thus 16 is appended to the output list. Finally, the resulting
output sequence is 8, 12, 16, 25, 36, 65, 77, 85
. Let N and M be the lengths of the
two merging sequences. Since the merging process scans the two merging sequences
linearly, its running time is therefore O(N + M) in total.
After the top-down dividing process, mergesort accumulates the solutions in
a bottom-up fashion by combining two smaller sorted sequences into a larger
sorted sequence as illustrated in Figure 2.5. In this example, the recursion depth
is log2 8 = 3. At recursion depth 3, every single element is itself a sorted se-
quence. They are merged to form sorted sequences at recursion depth 2: 16, 65
,
25, 85
, 8, 12
, and 36, 77
. At recursion depth 1, they are further merged into
two sorted sequences: 16, 25, 65, 85
and 8, 12, 36, 77
. Finally, we merge these
two sequences into one sorted sequence: 8, 12, 16, 25, 36, 65, 77, 85
.
It can be easily shown that the recursion depth of mergesort is log2 n for sorting
n numbers, and the total time spent for each recursion depth is O(n). Thus, we
conclude that mergesort sorts n numbers in O(n log n) time.
more efficiently if the solutions to the subproblems are recorded. If the subproblems
are not overlapping, a divide-and-conquer approach is the choice.
The development of a dynamic-programming algorithm has three basic com-
ponents: the recurrence relation for defining the value of an optimal solution, the
tabular computation for computing the value of an optimal solution, and the back-
tracking procedure for delivering an optimal solution. Here we introduce these basic
ideas by developing dynamic-programming solutions for problems from different
application areas.
First of all, the Fibonacci numbers are used to demonstrate how a tabular com-
putation can avoid recomputation. Then we use three classic problems, namely, the
maximum-sum segment problem, the longest increasing subsequence problem, and
the longest common subsequence problem, to explain how dynamic-programming
approaches can be used to solve the sequence-related problems.
By definition, the sequence goes like this: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89,
144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368,
75025, 121393, and so forth. Given a positive integer n, how would you compute
Fn ? You might say that it can be easily solved by a straightforward divide-and-
conquer method based on the recurrence. That’s right. But is it efficient? Take the
computation of F10 for example (see Figure 2.6). By definition, F10 is derived by
adding up F9 and F8 . What about the values of F9 and F8 ? Again, F9 is derived by
adding up F8 and F7 ; F8 is derived by adding up F7 and F6 . Working toward this
direction, we’ll finally reach the values of F1 and F0 , i.e., the end of the recursive
calls. By adding them up backwards, we have the value of F10 . It can be shown that
the number of recursive calls we have to make for computing Fn is exponential in n.
Those who are ignorant of history are doomed to repeat it. A major drawback of
this divide-and-conquer approach is to solve many of the subproblems repeatedly.
A tabular method solves every subproblem just once and then saves its answer in a
table, thereby avoiding the work of recomputing the answer every time the subprob-
lem is encountered. Figure 2.7 explains that Fn can be computed in O(n) steps by
a tabular computation. It should be noted that Fn can be computed in just O(log n)
steps by applying matrix computation.
If S[i − 1] < 0, concatenating ai with its previous elements will give a smaller
sum than ai itself. In this case, the maximum-sum segment ending at position i is ai
itself.
By a tabular computation, each S[i] can be computed in constant time for i from 1
to n, therefore all S values can be computed in O(n) time. During the computation,
we record the largest S entry computed so far in order to report where the maximum-
sum segment ends. We also record the traceback information for each position i so
that we can trace back from the end position of the maximum-sum segment to its
start position. If S[i − 1] > 0, we need to concatenate with previous elements for a
larger sum, therefore the traceback symbol for position i is “←.” Otherwise, “↑” is
recorded. Once we have computed all S values, the traceback information is used
to construct the maximum-sum segment by starting from the largest S entry and
following the arrows until a “↑” is reached. For example, in Figure 2.8, A = 3, 2,
-6, 5, 2, -3, 6,-4, 2
. By computing from i = 1 to i = n, we have S = 3, 5, -1, 5, 7,
4, 10,6, 8
. The maximum S entry is S[7] whose value is 10. By backtracking from
S[7], we conclude that the maximum-sum segment of A is 5, 2, -3, 6
, whose sum
is 10.
Let prefix sum P[i] = ∑ij=1 a j be the sum of the first i elements. It can be easily
j
seen that ∑k=i ak = P[ j] − P[i − 1]. Therefore, if we wish to compute for a given
position the maximum-sum segment ending at it, we could just look for a minimum
prefix sum ahead of this position. This yields another linear-time algorithm for the
maximum-sum segment problem.
Here we assume that a0 is a dummy element and smaller than any element in A,
and L[0] is equal to 0. By tabular computation for every i from 1 to n, each L[i] can be
computed in O(i) steps. Therefore, they require in total ∑ni=1 O(i) = O(n2 ) steps. For
each position i, we use an array P to record the index of the best previous element
for the current element to concatenate with. By tracing back from the element with
the largest L value, we derive a longest increasing subsequence.
Figure 2.9 illustrates the process of finding a longest increasing subsequence of
A = 4, 8, 2, 7, 3, 6, 9, 1, 10, 5
. Take i = 4 for instance, where a4 = 7. Its previous
smaller elements are a1 and a3 , both with L value equaling 1. Therefore, we have
L[4] = L[1] + 1 = 2, meaning that the length of a longest increasing subsequence
ending at position 4 is of length 2. Indeed, both a1 , a4
and a3 , a4
are an increasing
subsequence ending at position 4. In order to trace back the solution, we use array
P to record which entry contributes the maximum to the current L value. Thus, P[4]
can be 1 (standing for a1 ) or 3 (standing for a3 ). Once we have computed all L and
28 2 Basic Algorithmic Techniques
Fig. 2.9 An O(n2 )-time algorithm for finding a longest increasing subsequence.
2.4 Dynamic Programming 29
Fig. 2.10 An O(n log n)-time algorithm for finding a longest increasing subsequence.
30 2 Basic Algorithmic Techniques
P, R, E, S, I, D, E, N, T
and
P, R, O,V, I, D, E, N,C, E
,
P, R, D, N
is a common subsequence of them, whereas P, R,V
is not. Their LCS
is P, R, I, D, E, N
.
Now let us formulate the recurrence for computing the length of an LCS of two
sequences. We are given two sequences A = a1 , a2 , . . . , am
, and B = b1 , b2 , . . . , bn
.
Let len[i, j] denote the length of an LCS between a1 , a2 , . . . , ai
(a prefix of A) and
b1 , b2 , . . . , b j
(a prefix of B). They can be computed by the following recurrence:
⎧
⎨0 if i = 0 or j = 0,
len[i, j] = len[i − 1, j − 1] + 1 if i, j > 0 and ai = b j ,
⎩
max{len[i, j − 1], len[i − 1, j]} otherwise.
In other words, if one of the sequences is empty, the length of their LCS is just
zero. If ai and b j are the same, an LCS between a1 , a2 , . . . , ai
, and b1 , b2 , . . . , b j
work. These arrows will guide the backtracking process upon reaching the terminal
entry (m, n). Since the time spent for each entry is O(1), the total running time of
algorithm LCS LENGTH is O(mn).
Figure 2.12 illustrates the tabular computation. The length of an LCS of
A, L, G, O, R, I, T, H, M
and
A, L, I, G, N, M, E, N, T
is 4.
Besides computing the length of an LCS of the whole sequences, Figure 2.12
in fact computes the length of an LCS between each pair of prefixes of the two
sequences. For example, by this table, we can also tell the length of an LCS between
A, L, G, O, R
and A, L, I, G
is 3.
Once algorithm LCS LENGTH reaches (m, n), the backtracking information re-
tained in array prev allows us to find out which common subsequence contributes
len[m, n], the maximum length of an LCS of sequences A and B. Figure 2.13 lists the
pseudo-code for delivering an LCS. We trace back the dynamic-programming ma-
trix from the entry (m, n) recursively following the direction of the arrow. Whenever
a diagonal arrow “” is encountered, we append the current matched letter to the
end of the LCS under construction. Algorithm LCS OUTPUT takes O(m + n) time
in total since each recursive call reduces the indices i and/or j by one.
Figure 2.14 backtracks the dynamic-programming matrix computed in Fig-
ure 2.12. It outputs A, L, G, T
(the shaded entries) as an LCS of
A, L, G, O, R, I, T, H, M
and
A, L, I, G, N, M, E, N, T
.
This chapter presents three basic algorithmic techniques that are often used in de-
signing efficient methods for various problems in sequence comparison. Readers can
refer to algorithm textbooks for more instructive tutorials. The algorithm book (or
“The White Book”) by Cormen et al. [52] is a comprehensive reference of data struc-
tures and algorithms with a solid mathematical and theoretical foundation. Manber’s
2.5 Bibliographic Notes and Further Reading 33
book [133] provides a creative approach for the design and analysis of algorithms.
The book by Baase and Gelder [17] is a good algorithm textbook for beginners.
2.1
2.2
David A. Huffman [94] invented Huffman coding while he was a Ph.D. student
at MIT. It was actually a term paper for the problem of finding the most efficient
binary coding scheme assigned by Robert M. Fano.
2.3
There are numerous sorting algorithms such as insertion sort, bubblesort, quick-
sort, mergesort, to name a few. As noted by Donald E. Knuth [114], the first program
ever written for a stored program computer was the mergesort program written by
John von Neumann in 1945.
2.4
The name “dynamic programming” was given by Richard Bellman in 1957 [25].
The maximum-sum segment problem was first surveyed by Bentley and is linear-
time solvable using Kadane’s algorithm [27].
Chapter 3
Pairwise Sequence Alignment
35
36 3 Pairwise Sequence Alignment
3.1 Introduction
In nature, even a single amino acid sequence contains all the information necessary
to determine the fold of the protein. However, the folding process is still mysterious
to us, and some valuable information can be revealed by sequence comparison. Take
a look at the following sequence:
THETR UTHIS MOREI MPORT ANTTH ANTHE FACTS
What did you see in the above sequence? By comparing it with the words in the
dictionary, we find the tokens “FACTS,” “IMPORTANT,” “IS,” “MORE,” “THAN,”
“THE,” and “TRUTH.” Then we figure out the above is the sentence “The truth is
more important than the facts.”
Even though we have not yet decoded the DNA and protein languages, the emerg-
ing flood of sequence data has provided us with a golden opportunity of investigating
the evolution and function of biomolecular sequences. We are in a stage of compil-
ing dictionaries for DNA, proteins, and so forth. Sequence comparison plays a major
role in this line of research and thus becomes the most basic tool of bioinformatics.
Sequence comparison has wide applications to molecular biology, computer sci-
ence, speech processing, and so on. In molecular biology, it is often used to reveal
similarities among sequences, determine the residue-residue correspondences, lo-
cate patterns of conservation, study gene regulation, and infer evolutionary relation-
ships. It helps us to fish for related sequences in databanks, such as the GenBank
database. It can also be used for the annotation of genomes.
Fig. 3.1 A dot matrix of the two sequences ATACATGTCT and GTACGTCGG.
3.3 Global Alignment 37
A dot matrix is a two-dimensional array of dots used to highlight the exact matches
between two sequences. Given are two sequences A = a1 , a2 , . . . , am
(or A =
a1 a2 . . . am in short), and B = b1 , b2 , . . . , bn
. A dot is plotted on the (i, j) entry of
the matrix if ai = b j . Users can easily identify similar regions between the two se-
quences by locating those contiguous dots along the same diagonal. Figure 3.1 gives
a dot matrix of the two sequences ATACATGTCT and GTACGTCGG. Dashed lines
circle those regions with at lease three contiguous matches on the same diagonal.
A dot matrix allows the users to quickly visualize the similar regions of two se-
quences. However, as the sequences get longer, it becomes more involved to deter-
mine their most similar regions, which can no longer be answered by merely looking
at a dot matrix. It would be more desirable to automatically identify those similar
regions and rank them by their “similarity scores.” This leads to the development of
sequence alignment.
maximizes the score. By global alignment, we mean that both sequences are aligned
globally, i.e., from their first symbols to their last.
Figure 3.2 gives an alignment of sequences ATACATGTCT and GTACGTCGG
and its score. In this and the next sections, we assume the following simple scoring
scheme. A match is given a bonus score 8, a mismatch is penalized by assigning
score −5, and the gap penalty for each indel is −3. In other words, σ (a, b) = 8 if a
and b are the same, σ (a, b) = −5 if a and b are different, and β = −3.
It is quite helpful to recast the problem of aligning two sequences as an equiva-
lent problem of finding a maximum-scoring path in the alignment graph defined in
Section 1.2.2, as has been observed by a number of researchers. Recall that the align-
ment graph of A and B is a directed acyclic graph whose vertices are the pairs (i, j)
where i ∈ {0, 1, 2, . . . , m} and j ∈ {0, 1, 2, . . . , n}. These vertices are arrayed in m + 1
rows and n + 1 columns. The edge set consists of three types of edges. The substitu-
tion aligned pairs, insertion aligned pairs, and deletion aligned pairs correspond to
the diagonal edges, horizontal edges, and vertical edges, respectively. Specifically, a
vertical edge from (i − 1, j) to (i, j), which corresponds to a deletion of ai , is drawn
for i ∈ {1, 2, . . . , m} and j ∈ {0, 1, 2, . . . , n}. A horizontal edge from (i, j − 1) to
(i, j) , which corresponds to an insertion of b j , is drawn for i ∈ {0, 1, 2, . . . , m} and
j ∈ {1, 2, . . . , n}. A diagonal edge from (i − 1, j − 1) to (i, j) , which corresponds to
a substitution of ai with b j , is drawn for i ∈ {1, 2, . . . , m} and j ∈ {1, 2, . . . , n}.
It has been shown that an alignment corresponds to a path from the leftmost cell
of the first row to the rightmost cell of the last row in the alignment graph. Figure 3.3
gives another example of this correspondence.
Let S[i, j] denote the score of an optimal alignment between a1 a2 . . . ai , and
b1 b2 . . . b j . By definition, we have S[0, 0] = 0, S[i, 0] = −β × i, and S[0, j] = −β × j.
With these initializations, S[i, j] for i ∈ {1, 2, . . . , m} and j ∈ {1, 2, . . . , n} can be
computed by the following recurrence.
Fig. 3.3 A path in an alignment graph of the two sequences ATACATGTCT and GTACGTCGG.
3.3 Global Alignment 39
⎧
⎨ S[i − 1, j] − β ,
S[i, j] = max S[i, j − 1] − β ,
⎩
S[i − 1, j − 1] + σ (ai , b j ).
Figure 3.4 explains the recurrence by showing that there are three possible ways
entering into the grid point (i, j), and we take the maximum of their path weights.
The weight of the maximum-scoring path entering (i, j) from (i − 1, j) vertically
is the weight of the maximum-scoring path entering (i − 1, j) plus the weight of
edge (i − 1, j) → (i, j). That is, the weight of the maximum-scoring path entering
(i, j) with a deletion gap symbol at the end is S[i − 1, j] − β . Similarly, the weight
of the maximum-scoring path entering (i, j) from (i, j − 1) horizontally is S[i, j −
1] − β and the weight of the maximum-scoring path entering (i, j) from (i − 1, j −
1) diagonally is S[i − 1, j − 1] + σ (ai , b j ). To compute S[i, j], we simply take the
maximum value of these three choices. The value S[m, n] is the score of an optimal
global alignment between sequences A and B.
Figure 3.5 gives the pseudo-code for computing the score of an optimal global
alignment. Whenever there is a tie, any one of them will work. Since there are
O(mn) entries and the time spent for each entry is O(1), the total running time of
algorithm G LOBAL ALIGNMENT SCORE is O(mn).
Now let us use an example to illustrate the tabular computation. Figure 3.6 com-
putes the score of an optimal alignment of the two sequences ATACATGTCT and
GTACGTCGG, where a match is given a bonus score 8, a mismatch is penalized by
a score −5, and the gap penalty for each gap symbol is −3. The first row and col-
umn of the table are initialized with proper penalties. Other entries are computed
in order. Take the entry (5, 3) for example. Upon computing the value of this entry,
the following values are ready: S[4, 2] = −3, S[4, 3] = 8, and S[5, 2] = −6. Since the
edge weight of (4, 2) → (5, 3) is 8 (a match symbol “A”), the maximum score from
(4, 2) to (5, 3) is −3 + 8 = 5. The maximum score from (4, 3) is 8 − 3 = 5, and the
σ (ai , b j ) -β
-β
Fig. 3.4 There are three ways entering the grid point (i, j).
40 3 Pairwise Sequence Alignment
maximum score from (5, 2) is −6 − 3 = −9. Taking the maximum of them, we have
S[5, 3] = 5. Once the table has been computed, the value in the rightmost cell of the
last row, i.e., S[10, 9] = 29, is the score of an optimal global alignment.
In Section 2.4.4, we have shown that if a backtracking information is saved for
each entry while we compute the dynamic-programming matrix, an optimal solu-
tion can be derived following the backtracking pointers. Here we show that even if
we don’t save those backtracking pointers, we can still reconstruct an optimal so-
Fig. 3.6 The score of an optimal global alignment of the two sequences ATACATGTCT and
GTACGTCGG, where a match is given a bonus score 8, a mismatch is penalized by a score −5,
and the gap penalty for each gap symbol is −3.
3.3 Global Alignment 41
lution by examining the values of an entry’s possible contributors. Figure 3.7 lists
the pseudo-code for delivering an optimal global alignment, where an initial call
G LOBAL ALIGNMENT OUTPUT(A, B, S, m, n) is made to deliver an optimal global
alignment. Specifically, we trace back the dynamic-programming matrix from the
entry (m, n) recursively according to the following rules. Let (i, j) be the entry under
consideration. If i = 0 or j = 0, we simply output all the insertion pairs or deletion
pairs in these boundary conditions. Otherwise, consider the following three cases. If
−
S[i, j] = S[i 1, j − 1] + σ (ai , b j ), we make a diagonal move and output a substitu-
ai
tion pair . If S[i, j] = S[i − 1, j] − β , then we make a vertical move and output
bj
ai
a deletion pair . Otherwise, it must be the case where S[i, j] = S[i, j − 1] − β .
−
−
We simply make a horizontal move and output an insertion pair . Algorithm
bj
G LOBAL ALIGNMENT OUTPUT takes O(m + n) time in total since each recursive
call reduces i and/or j by one. The total space complexity is O(mn) since the size of
the dynamic-programming matrix is O(mn). In Section 3.6, we shall show that an
optimal global alignment can be recovered even if we don’t save the whole matrix.
Figure 3.8 delivers an optimal global alignment by backtracking from the right-
most cell of the last row of the dynamic-programming matrix computed in Fig-
ure 3.6. We start from the entry (10, 9) where S[10, 9] = 29. We have a tie there
because both S[10, 8] − 3 and S[9, 8] − 5 equal to 29. In this illustration, the horizon-
tal move to the entry (10, 8) is chosen. Interested readers are encouraged to try the
diagonal move to the entry (9, 8) for an alternative optimal global alignment, which
is actually chosen by G LOBAL ALIGNMENT OUTPUT. Continue this process until
the entry (0, 0) is reached. The shaded area depicts the backtracking path whose
corresponding alignment is given on the right-hand side of the figure.
It should be noted that during the backtracking procedure, we derive the aligned
pairs in a reverse order of the alignment. That’s why we make a recursive call before
actually printing out the pair in Figure 3.7. Another approach is to compute the
dynamic-programming matrix backward from the rightmost cell of the last row to
the leftmost cell of the first row. Then when we trace back from the leftmost cell of
the first row toward the rightmost cell of the last row, the aligned pairs are derived
in the same order as in the alignment. This approach could avoid the overhead of
reversing an alignment.
In many applications, a global (i.e., end-to-end) alignment of the two given se-
quences is inappropriate; instead, a local alignment (i.e., involving only a part of
each sequence) is desired. In other words, one seeks a high-scoring local path that
need not terminate at the corners of the dynamic-programming matrix.
Let S[i, j] denote the score of the highest-scoring local path ending at (i, j) be-
tween a1 a2 . . . ai , and b1 b2 . . . b j . S[i, j] can be computed as follows.
3.4 Local Alignment 43
⎧
⎪
⎪ 0,
⎨
S[i − 1, j] − β ,
S[i, j] = max
⎪
⎪ S[i, j − 1] − β ,
⎩
S[i − 1, j − 1] + σ (ai , b j ).
The recurrence is quite similar to that for global alignment except the first en-
try “zero.” For local alignment, we are not required to start from the source (0, 0).
Therefore, if the scores of all possible paths ending at the current position are all
negative, they are reset to zero. The largest value of S[i, j] is the score of the best
local alignment between sequences A and B.
Figure 3.9 gives the pseudo-code for computing the score of an optimal local
alignment. Whenever there is a tie, any one of them will work. Since there are O(mn)
entries and the time spent for each entry is O(1), the total running time of algorithm
L OCAL ALIGNMENT SCORE is O(mn).
Now let us use an example to illustrate the tabular computation. Figure 3.10 com-
putes the score of an optimal local alignment of the two sequences ATACATGTCT
and GTACGTCGG, where a match is given a bonus score 8, a mismatch is penalized
by a score −5, and the gap penalty for each gap symbol is −3. The first row and col-
umn of the table are initialized with zero’s. Other entries are computed in order. Take
the entry (5, 5) for example. Upon computing the value of this entry, the following
values are ready: S[4, 4] = 24, S[4, 5] = 21, and S[5, 4] = 21. Since the edge weight
of (4, 4) → (5, 5) is −5 (a mismatch), the maximum score from (4, 4) to (5, 5) is
24−5 = 19. The maximum score from (4, 5) is 21−3 = 18, and the maximum score
from (5, 4) is 21 − 3 = 18. Taking the maximum of them, we have S[5, 5] = 19. Once
Fig. 3.10 Computation of the score of an optimal local alignment of the sequences ATACATGTCT
and GTACGTCGG.
the table has been computed, the maximum value, i.e., S[9, 7] = 42, is the score of
an optimal local alignment.
Figure 3.11 lists the pseudo-code for delivering an optimal local alignment,
where an initial call L OCAL ALIGNMENT OUTPUT(A, B, S, Endi , End j ) is made
to deliver an optimal local alignment. Specifically, we trace back the dynamic-
programming matrix from the maximum-score entry (Endi , End j ) recursively ac-
cording to the following rules. Let (i, j) be the entry under consideration. If S[i, j] =
0, we have reached the beginning of the optimal local alignment. Otherwise, con-
−
sider the following three cases. If S[i, j] = S[i 1, j − 1] + σ (ai , b j ), we make a
ai
diagonal move and output a substitution pair . If S[i, j] = S[i − 1, j] − β , then
bj
ai
we make a vertical move and output a deletion pair . Otherwise, it must be the
−
j − 1] − β . We simply make a horizontal move and output
case where S[i, j]= S[i,
−
an insertion pair . Algorithm L OCAL ALIGNMENT OUTPUT takes O(m + n)
bj
time in total since each recursive call reduces i and/or j by one. The space com-
plexity is O(mn) since the size of the dynamic-programming matrix is O(mn). In
Section 3.6, we shall show that an optimal local alignment can be recovered even if
we don’t save the whole matrix.
Figure 3.12 delivers an optimal local alignment by backtracking from the max-
imum scoring entry of the dynamic-programming matrix computed in Figure 3.6.
3.4 Local Alignment 45
We start from the entry (9, 7) where S[9, 7] = 42. Since S[8, 6] + 8 = 34 + 8 = 42 =
S[9, 7], we make a diagonal move back to the entry (8, 6). Continue this process until
an entry with zero value is reached. The shaded area depicts the backtracking path
whose corresponding alignment is given on the right-hand side of the figure.
Further complications arise when one seeks k best alignments, where k > 1. For
computing an arbitrary number of non-intersecting and high-scoring local align-
ments, Waterman and Eggert [198] developed a very time-efficient method. It
records those high-scoring candidate regions of the dynamic-programming matrix
in the first pass. Each time a best alignment is reported, it recomputes only those
entries in the affected area rather than recompute the whole matrix. Its linear-space
implementation was developed by Huang and Miller [92].
On the other hand, to attain greater speed, the strategy of building alignments
from alignment fragments is often used. For example, one could specify some frag-
ment length w and work with fragments consisting of a segment of length at least
w that occurs exactly or approximately in both sequences. In general, algorithms
that optimize the score over alignments constructed from fragments can run faster
than algorithms that optimize over all possible alignments. Moreover, alignments
constructed from fragments have been very successful in initial filtering criteria
within programs that search a sequence database for matches to a query sequence.
Database sequences whose alignment score with the query sequence falls below
a threshold are ignored, and the remaining sequences are subjected to a slower
but higher-resolution alignment process. The high-resolution process can be made
more efficient by restricting the search to a “neighborhood” of the alignment-from-
fragments. Chapter 4 will introduce four such homology search programs: FASTA,
BLAST, BLAT, and PatternHunter.
46 3 Pairwise Sequence Alignment
Fig. 3.12 Computation of an optimal local alignment of the two sequences ATACATGTCT and
GTACGTCGG.
In this section, we shall briefly discuss how to modify the dynamic programming
methods to copy with three scoring schemes that are frequently used in biological
sequence analysis.
For aligning DNA and protein sequences, affine gap penalties are considered more
appropriate than the simple scoring scheme discussed in the previous sections.
“Affine” means that a gap of length k is penalized α + k × β , where α and β are
both nonnegative constants. In other words, it costs α to open up a gap plus β for
each symbol in the gap. Figure 3.13 computes the score of a global alignment of the
two sequences ATACATGTCT and GTACGTCGG under affine gap penalties, where
a match is given a bonus score 8, a mismatch is penalized by a score −5, and the
penalty for a gap of length k is −4 − k × 3.
In order to determine if a gap is newly opened, two more matrices are used to
distinguish gap extensions from gap openings. Let D(i, j) denote the score of an op-
timal alignment between a1 a2 . . . ai and b1 b2 . . . b j ending with a deletion. Let I(i, j)
denote the score of an optimal alignment between a1 a2 . . . ai and b1 b2 . . . b j end-
ing with an insertion. Let S(i, j) denote the score of an optimal alignment between
a1 a2 . . . ai and b1 b2 . . . b j .
By definition, D(i, j) can be derived as follows.
3.5 Various Scoring Schemes 47
Fig. 3.13 The score of a global alignment of the two sequences ATACATGTCT and GTACGTCGG
under affine gap penalties.
This recurrence can be explained in an alternative way. Recall that D(i, j) denotes
the score of an optimal alignment between a1 a2 . . . ai and b1 b2 . . . b j ending with a
deletion. If such an alignment is an extension of the alignment ending at (i − 1, j)
with a deletion, then it costs only β for such a gap extension. Thus, in this case,
D(i, j) = D(i − 1, j) − β . Otherwise, it is a new deletion gap and an additional gap-
opening penalty α is imposed. We have D(i, j) = S(i − 1, j) − α − β .
In a similar way, we derive I(i, j) as follows.
Therefore, with proper initializations, D(i, j), I(i, j) and S(i, j) can be computed
by the following recurrences:
D(i − 1, j) − β ,
D(i, j) = max
S(i − 1, j) − α − β ;
I(i, j − 1) − β ,
I(i, j) = max
S(i, j − 1) − α − β ;
48 3 Pairwise Sequence Alignment
⎧
⎨ D(i, j),
S(i, j) = max I(i, j),
⎩
S(i − 1, j − 1) + σ (ai , b j ).
Now let us consider the constant gap penalties where each gap, regardless of its
length, is charged with a nonnegative constant penalty α .
Let D(i, j) denote the score of an optimal alignment between a1 a2 . . . ai and
b1 b2 . . . b j ending with a deletion. Let I(i, j) denote the score of an optimal align-
ment between a1 a2 . . . ai and b1 b2 . . . b j ending with an insertion. Let S(i, j) denote
the score of an optimal alignment between a1 a2 . . . ai and b1 b2 . . . b j . With proper
initializations, D(i, j), I(i, j) and S(i, j) can be computed by the following recur-
rences. In fact, these recurrences can be easily derived from those for the affine gap
penalties by setting β to zero. A gap penalty is imposed when the gap is just opened,
and the extension is free of charge.
D(i − 1, j),
D(i, j) = max
S(i − 1, j) − α ;
I(i, j − 1),
I(i, j) = max
S(i, j − 1) − α ;
⎧
⎨ D(i, j),
S(i, j) = max I(i, j),
⎩
S(i − 1, j − 1) + σ (ai , b j ).
Another interesting scoring scheme is called the restricted affine gap penalties, in
which a gap of length k is penalized by α + f (k) × β , where α and β are both
nonnegative constants, and f (k) = min{k, } for a given positive integer .
In order to deal with the free long gaps, two more matrices D (i, j) and I (i, j) are
used to record the long gap penalties in advance. With proper initializations, D(i, j),
D (i, j), I(i, j), I (i, j), and S(i, j) can be computed by the following recurrences:
D(i − 1, j) − β ,
D(i, j) = max
S(i − 1, j) − α − β ;
3.6 Space-Saving Strategies 49
Fig. 3.14 There are seven ways entering the three grid points of an entry (i, j).
D (i − 1, j),
D (i, j) = max
S(i − 1, j) − α − × β ;
I(i, j − 1) − β ,
I(i, j) = max
S(i, j − 1) − α − β ;
I (i, j − 1),
I (i, j) = max
S(i, j − 1) − α − × β ;
⎧
⎪
⎪ D(i, j),
⎪
⎪
⎨ D (i, j),
S(i, j) = max I(i, j),
⎪
⎪
⎪ I (i, j),
⎪
⎩
S(i − 1, j − 1) + σ (ai , b j ).
Fig. 3.15 Entry locations of S− just before the entry value is evaluated at (i, j).
Because of this, different methods have been proposed to reduce the space used for
aligning globally or locally two sequences.
We first describe a space-saving strategy proposed by Hirschberg in 1975 [91].
It uses only “linear space,” i.e., space proportional to the sum of the sequences’
lengths. The original formulation was for the longest common subsequence problem
that is discussed in Section 2.4.4. But the basic idea is quite robust and works readily
for aligning globally two sequences with affine gap costs as shown by Myers and
Miller in 1988 [150]. Remarkably, this space-saving strategy has the same time
complexity as the original dynamic programming method presented in Section 3.3.
To introduce Hirschberg’s approach, let us first review the original algorithm
presented in Figure 3.5 for aligning two sequences of lengths m and n. It is apparent
that the scores in row i of dynamic programming matrix S are calculated from those
in row i − 1. Thus, after the scores in row i of S are calculated, the entries in row
i − 1 of S will no longer be used and hence the space used for storing these entries
can be recycled to calculate and store the entries in row i + 1. In other words, we
can get by with space for two rows, since all that we ultimately want is the single
entry S[m, n] in the rightmost cell of the last row.
In fact, a single array S− of size n, together with two extra variables, is adequate.
−
S [ j] holds the most recently computed value for each 1 ≤ j ≤ n, so that as soon as
the value of the jth entry of S− is computed, the old value at the entry is overwrited.
There is a slight conflict in this strategy since we need the old value of an entry to
compute a new value of the entry. To avoid this conflict, two additional variables,
say s and c, are introduced to hold the new and old values of the entry, respectively.
Figure 3.15 shows the locations of the scores kept in S− and in variables s and c.
When S− [ j] is updated, S− [ j ] holds the score in the entry (i, j ) in row i for each
j < j, and it holds the score in the entry (i − 1, j ) for any j ≥ j. Figure 3.16 gives
the pseudo-code for computing the score of an optimal global alignment in linear
space.
In the dynamic programming matrix S of aligning sequences A = a1 a2 . . . am and
B = b1 b2 . . . bn , S[i, j] denotes the optimal score of aligning a1 a2 . . . ai and b1 b2 . . . b j
3.6 Space-Saving Strategies 51
Fig. 3.16 Computation of the optimal score of aligning sequences of lengths m and n in linear
space O(n).
or, equivalently, the maximum score of a path from (0, 0) to the cell (i, j) in the
alignment graph. By symmetry, the optimal score of aligning ai+1 ai+2 . . . am and
b j+1 b j+2 . . . bn or the maximum score of a path from (i, j) to (m, n) in the alignment
graph can be calculated in linear space in a backward manner. Figure 3.17 gives the
pseudo-code for computing the score of an optimal global alignment in a backward
manner in linear space.
In what follows, we use S− [i, j] and S+ [i, j] to denote the maximum score of
a path from (0, 0) to (i, j) and that from (i, j) to (m, n) in the alignment graph,
respectively. Without loss of generality, we assume that m is a power of 2. Obviously,
for each j, S− [m/2, j] + S+ [m/2, j] is the maximum score of a path from (0, 0) to
(m, n) through (m/2, j) in the alignment graph. Choose jmid such that
Then, S− [m/2, jmid ] + S+ [m/2, jmid ] is the optimal alignment score of A and B and
there is a path having such a score from (0, 0) to (m, n) through (m/2, jmid ) in the
alignment graph.
Hirschberg’s linear-space approach is first to compute S− [m/2, j] for 1 ≤ j ≤ n
by a forward pass, stopping at row m/2 and to compute S+ [m/2, j] for 1 ≤ j ≤ n
by a backward pass and then to find jmid . After jmid is found, recursively compute
an optimal path from (0, 0) to (m/2, jmid ) and an optimal path from (m/2, jmid ) to
(m, n).
As the problem is partitioned further, there is a need to have an algorithm that
is capable of delivering an optimal path for any specified two ends. In Figure 3.18,
algorithm L INEAR ALIGN is a recursive procedure that delivers a maximum-scoring
path from (i1 , j1 ) to (i2 , j2 ). To deliver the whole optimal alignment, the two ends
are initially specified as (0, 0) and (m, n).
52 3 Pairwise Sequence Alignment
Now let us analyze the time and space taken by Hirschberg’s approach. Using
the algorithms given in Figures 3.16 and 3.17, both the forward and backward pass
take O(nm/2)-time and O(n)-spaces. Hence, it takes O(mn)-time and O(n)-spaces
to find jmid . Set T = mn and call it the size of the problem of aligning A and B. At
each recursive step, a problem is divided into two subproblems. However, regardless
of where the optimal path crosses the middle row m/2, the total size of the two
resulting subproblems is exactly half the size of the problem that we have at the
recursive step (see Figure 3.19). It follows that the total size of all problems, at all
levels of recursion, is at most T + T /2 + T /4 + · · · = 2T . Because computation time
is directly proportional to the problem size, Hirschberg’s approach will deliver an
optimal alignment using O(2T ) = O(T ) time. In other words, it yields an O(mn)-
time, O(n)-space global alignment algorithm.
Hirschberg’s original method, and the above discussion, apply to the case where
the penalty for a gap is merely proportional to the gap’s length, i.e., k × β for a k-
symbol gap. For applications in molecular biology, one wants penalties of the form
α + k × β , i.e., each gap is assessed an additional “gap-open” penalty α . Actually,
one can be slightly more general and substitute residue-dependent penalties for β .
In Section 3.5.1, we have shown that the relevant alignment graph is more compli-
cated. Now at each grid point (i, j) there are three nodes, denoted (i, j)S , (i, j)D , and
(i, j)I , and generally seven entering edges, as pictured in Figure 3.14. The align-
ment problem is to compute a highest-score path from (0, 0)S to (m, n)S . Fortu-
nately, Hirschberg’s strategy extends readily to this more general class of alignment
scores [150]. In essence, the main additional complication is that for each defining
corner of a subproblem, we need to specify one of the grid point’s three nodes.
Another issue is how to deliver an optimal local alignment in linear space. Recall
that in the local alignment problem, one seeks a highest-scoring alignment where
the end nodes can be arbitrary, i.e., they are not restricted to (0, 0)S and (m, n)S . In
fact, it can be reduced to a global alignment problem by performing a linear-space
S+ [ j] ← c
Output S+ [0] as the score of an optimal alignment.
end
Fig. 3.17 Backward computation of the score of an optimal global alignment in linear space.
3.6 Space-Saving Strategies 53
score-only pass over the dynamic-programming matrix to locate the first and last
nodes of an optimal local alignment, then delivering a global alignment between
these two nodes by applying Hirschberg’s approach.
0 n
0
m
4
m
2
3m
4
into two nearly equal parts. Here we introduce a more general approach that can be
easily utilized by other relevant problems.
Given a narrow region R with two boundary lines L and U, we can proceed as
follows. We assume that L and U are non-decreasing since if, e.g., L[i] were larger
than L[i + 1], we could set L[i + 1] to equal L[i] without affecting the set of con-
strained alignments. Enclose as many rows as possible from the top of the region in
an upright rectangle, subject to the condition that the rectangle’s area at most dou-
bles the area of its intersection with R. Then starting with the first row of R not in
the rectangle, we cover additional rows of R with a second such rectangle, and so
on.
A score-only backward pass is made over R, computing S+ . Values of S+ are
retained for the top line in every rectangle (the top rectangle can be skipped). It can
be shown that the total length of these pieces cannot exceed three times the total
number of columns, as required for a linear space bound. Next, perform a score-
only forward pass, stopping at the last row in the first rectangle. A sweep along
the boundary between the first and second rectangles locates a crossing edge on an
optimal path through R. That is, we can find a point p on the last row of the first
rectangle and a point q on the first row of the second rectangle such that there is a
vertical or diagonal edge e from p to q, and e is on an optimal path. Such an optimal
path can be found by applying Hirschberg’s strategy to R’s intersection with the first
rectangle (omitting columns following p) and recursively computing a path from q
through the remainder of R. This process inspects a grid point at most once during
the backward pass, once in a forward pass computing p and q, and an average of
four times for applying Hirschberg’s method to R’s intersection with a rectangle.
If two sequences are very similar, more efficient algorithms can be devised to deliver
an optimal alignment for them. In this case, we know that the maximum-scoring
path in the alignment graph will not get too far away from the diagonal of the source
3.7 Other Advanced Topics 57
(0, 0). One way is to draw a constrained region to restrict the diversion and run the
constrained alignment algorithm introduced in Section 3.7.1. Another approach is
to grow the path greedily until the destination (m, n) is reached. For this approach,
instead of working with the maximization of the alignment score, we look for the
minimum-cost set of single-nucleotide changes (i.e., insertions, deletions, or substi-
tutions) that will convert one sequence to the other. Any match costs zero, which
allows us to have a free advance. As for penalty, we have to pay a certain amount of
cost to get across a mismatch or a gap symbol.
Now we briefly describe the approach based on the diagonalwise monotonicity of
the cost tables. The following cost function is employed. Each match costs 0, each
mismatch costs 1, and a gap of length k is penalized at the cost of k + 1. Adding 1
to a gap’s length to derive its cost decreases the likelihood of generating gaps that
are separated by only a few paired nucleotides. The edit graph for sequences A and
B is a directed graph with a vertex at each integer grid point (x, y), 0 ≤ x ≤ m and
0 ≤ y ≤ n. Let I(z, c) denote the x value of the farthest point in diagonal z (=y − x)
that can be reached from the source (i.e., grid point (0, 0)) with cost c and that is
free to open an insertion gap. That is, the grid point can be (1) reached by a path of
cost c that ends with an insertion, or (2) reached by any path of cost c − 1 and the
gap-open penalty of 1 can be “paid in advance.” (The more traditional definition,
which considers only case (1), results in the storage of more vectors.) Let D(z, c)
denote the x value of the farthest point in diagonal z that can be reached from the
source with cost c and is free to open a deletion gap. Let S(z, c) denote the x value of
the farthest point in diagonal z that can be reached from the source with cost c. With
proper initializations, these vectors can be computed by the following recurrence
relations:
previous sections have been used to reveal similarities among biological sequences,
to study gene regulation, and even to infer evolutionary trees.
However, biologically significant alignments are not necessarily mathematically
optimized. It has been shown that sometimes the neighborhood of an optimal align-
ment reveals additional interesting biological features. Besides, the most strongly
conserved regions can be effectively located by inspecting the range of variation of
suboptimal alignments. Although rigorous statistical analysis for the mean and vari-
ance of optimal global alignment scores is not yet available, suboptimal alignments
have been successfully used to informally estimate the significance of an optimal
alignment.
For most applications, it is impractical to enumerate all suboptimal alignments
since the number could be enormous. Therefore, a more compact representation of
all suboptimal alignments is indispensable. A 0-1 matrix can be used to indicate if
a pair of positions is in some suboptimal alignment or not. However, this approach
misses some connectivity information among those pairs of positions. An alternative
is to use a set of “canonical” suboptimal alignments to represent all suboptimal
alignments. The kernel of that representation is a minimal directed acyclic graph
(DAG) containing all suboptimal alignments.
Suppose we are given a threshold score that does not exceed the optimal align-
ment score. An alignment is suboptimal if its score is at least as large as the thresh-
old score. Here we briefly describe a linear-space method that finds all edges that
are contained in at least one path whose score exceeds a given threshold τ . Again, a
recursive subproblem will consist of applying the alignment algorithm over a rectan-
gular portion of the original dynamic-programming matrix, but now it is necessary
that we continue to work with values S− and S+ that are defined relative to the origi-
nal problem. To accomplish this, each problem to be solved is defined by specifying
values of S− for nodes on the upper and left borders of the defining rectangle, and
values of S+ for the lower and right borders.
To divide a problem of this form, a forward pass propagates values of S− to nodes
in the middle row and the middle column, and a backward pass propagates values
S+ to those nodes. This information allows us to determine all edges starting in the
middle row or middle column that are contained in a path of score at least τ . The
data determining any one of the four subproblems, i.e., the arrays of S values on its
borders, is then at most half the size of the set of data defining the parent problem.
The maximum total space requirement is realized when recursion reaches a directly
solvable problem where there is only the leftmost cell of the first row of the original
grid left; at that time there are essentially 2(m + n) S-values saved for borders of
the original problem, m + n values on the middle row and column of the original
problem, (m + n)/2 values for the upper left subproblem, (m + n)/4 values for the
upper-left-most subsubproblem, etc., giving a total of about 4(m + n) retained S-
values.
3.7 Other Advanced Topics 59
The utility of information about the reliability of different regions within an align-
ment is widely appreciated, see [192] for example. One approach to obtaining such
information is to determine suboptimal alignments, i.e., some or all alignments
that come within a specified tolerance of the optimum score, as discussed in Sec-
tion 3.7.3. However, the number of suboptimal alignments, or even alternative opti-
mal alignments, can easily be so large as to preclude an exhaustive enumeration.
Sequence conservation has proved to be a reliable indicator of at least one class of
regulatory elements. Specifically, regions of six or more consecutive nucleotides that
identical across a range of mammalian sequences, called “phylogenetic footprints,”
frequently correspond to binding sits for sequence-specific nuclear proteins. It is
also interesting to look for longer, imperfectly conserved (but stronger matching)
regions, which may indicate other sorts of regulatory elements, such as a region that
binds to a nuclear matrix or assumes some altered chromatin structure.
In the following, we briefly describe some interesting measurements of the ro-
bustness of each aligned pair of a pairwise alignment. The first method computes,
for each position i of the first sequence, the lower and upper limits of the positions
in the second sequence to which it can be aligned and still come within a specified
tolerance of the optimum alignment score. Delimiting suboptimal alignments this
way, rather than enumerating all of them, allows the computation to run in only a
small constant factor more time than the computation of a single optimal alignment.
Another method determines, for each aligned pair of an optimal alignment, the
amount by which the optimum score must be lowered before reaching an align-
ment not containing that pair. In other words, if the optimum alignment score is s
and the aligned pair is assigned the robustness-measuring number r, then any align-
ment scoring strictly greater than s − r aligns those two sequence positions, whereas
some alignment of score s − r does not align them. As a special case, this value
Fig. 3.21 The total number of the boundary entries in the active subproblems is O(m + n).
60 3 Pairwise Sequence Alignment
tells whether the pair is in all optimal alignments (namely, the pair is in all opti-
mal alignments if and only if its associated value is non-zero). These computations
are performed using dynamic-programming methods that require only space pro-
portional to the sum of the two sequence lengths. It has also been shown on how to
efficiently handle the case where alignments are constrained so that each position,
say position i, of the first sequence can be aligned only to positions on a certain
range of the second sequence.
To deliver an optimal alignment, Hirschberg’s approach applies forward and
backward passes in the first nondegenerate rectangle along the optimal path be-
ing generated. Within a subproblem (i.e., rectangle) the scores of paths can be taken
relative to the “start node” at the rectangle’s upper left and the “end node” at the
rightmost cell of the last row. This means that a subproblem is completely specified
by giving the coordinates of those two nodes. In contrast, methods for the robust-
ness measurement must maintain more information about each pending subproblem.
Fortunately, it can be done in linear space by observing that the total number of the
boundary entries of all pending subproblems of Hirschberg’s approach is bounded
by O(m + n) (see Figure 3.21).
3.1
3.2
3.3
The global alignment method was proposed by Needleman and Wunsch [151].
Such a dynamic-programming method was independently discovered by Wagner
and Fischer [194] and workers in other fields. For a survey of the history, see the
book by Sankoff and Kruskal [175].
3.4
3.8 Bibliographic Notes and Further Reading 61
3.5
3.6
Readers can refer to [40, 41, 43] for more space-saving strategies.
3.7
Readers should also be aware that the hidden Markov models are a probabilistic
approach to sequence comparison. They have been widely used in the bioinformat-
ics community [61]. Given an observed sequence, the Viterbi algorithm computes
the most probable state path. The forward algorithm computes the probability that
a given observed sequence is generated by the model, whereas the backward al-
gorithm computes the probability that a given observed symbol was generated by
62 3 Pairwise Sequence Alignment
a given state. The book by Durbin et al. [61] is a terrific reference book for this
paradigm.
Alignment of two genomic sequences poses problems not well addressed by ear-
lier alignment programs.
PipMaker [178] is a software tool for comparing two long DNA sequences to
identify conserved segments and for producing informative, high-resolution dis-
plays of the resulting alignments. It displays a percent identity plot (pip), which
shows both the position in one sequence and the degree of similarity for each
aligning segment between the two sequences in a compact and easily understand-
able form. The alignment program used by the PipMaker network server is called
BLASTZ [177]. It is an independent implementation of the Gapped BLAST algo-
rithm specifically designed for aligning two long genomic sequences. Several mod-
ifications have been made to BLASTZ to attain efficiency adequate for aligning
entire mammalian genomes and to increase the sensitivity.
MUMmer [119] is a system for aligning entire genomes rapidly. The core of the
MUMmer algorithm is a suffix tree data structure, which can be built and searched
in linear time and which occupies only linear space. DisplayMUMs 1.0 graphically
presents alignments of MUMs from a set of query sequences and a single reference
sequence. Users can navigate MUM alignments to visually analyze coverage, tiling
patterns, and discontinuities due to misassemblies or SNPs.
The analysis of genome rearrangements is another exciting field for whole
genome comparison. It looks for a series of genome rearrangements that would
transform one genome into another. It was pioneered by Dobzhansky and Sturte-
vant [60] in 1938. Recent milestone advances include the works by Bafna and
Pevzner [19], Hannenhalli and Pevzner [86], and Pevzner and Tesler [166].
Chapter 4
Homology Search Tools
The alignment methods introduced in Chapter 3 are good for comparing two
sequences accurately. However, they are not adequate for homology search against
a large biological database such as GenBank. As of February 2008, there are ap-
proximately 85,759,586,764 bases in 82,853,685 sequence records in the traditional
GenBank divisions. To search such kind of huge databases, faster methods are re-
quired for identifying the homology between the query sequence and the database
sequence in a timely manner.
One common feature of homology search programs is the filtration idea, which
uses exact matches or approximate matches between the query sequence and the
database sequence as a basis to judge if the homology between the two sequences
passes the desired threshold.
This chapter is divided into six sections. Section 4.1 describes how to implement
the filtration idea for finding exact word matches between two sequences by using
efficient data structures such as hash tables, suffix trees, and suffix arrays.
FASTA was the first popular homology search tool, and its file format is still
widely used. Section 4.2 briefly describes a multi-step approach used by FASTA for
finding local alignments.
BLAST is the most popular homology search tool now. Section 4.3 reviews the
first version of BLAST, Ungapped BLAST, which generates ungapped alignments.
It then reviews two major products of BLAST 2.0: Gapped BLAST and Position-
Specific Iterated BLAST (PSI-BLAST). Gapped BLAST produces gapped align-
ments, yet it is able to run faster than the original one. PSI-BLAST can be used
to find distant relatives of a protein based on the profiles derived from the multi-
ple alignments of the highest scoring database sequence segments with the query
segment in iterative Gapped BLAST searches.
Section 4.4 describes BLAT, short for “BLAST-like alignment tool.” It is of-
ten used to search for the database sequences that are closely related to the query
sequences such as producing mRNA/DNA alignments and comparing vertebrate se-
quences.
PatternHunter, introduced in Section 4.5, is more sensitive than BLAST when
a hit contains the same number of matches. A novel idea in PatternHunter is the
63
64 4 Homology Search Tools
use of an optimized spaced seed. Furthermore, it has been demonstrated that using
optimized multiple spaced seed will speed up the computation even more.
Finally, we conclude the chapter with the bibliographic notes in Section 4.6.
An exact word match is a run of identities between two sequences. In the following,
we discuss how to find all short exact word matches, sometimes referred to as hits,
between two sequences using efficient data structures such as hash tables, suffix
trees, and suffix arrays.
Given two sequences A = a1 a2 . . . am , and B = b1 b2 . . . bn , and a positive integer
k, the exact word match problem is to find all occurrences of exact word matches
of length k, referred to as k-mers between A and B. This is a classic algorithmic
problem that has been investigated for decades. Here we describe three approaches
for this problem.
A hash table associates keys with numbers. It uses a hash function to transform a
given key into a number, called hash, which is used as an index to look up or store
the corresponding data. A method that uses a hash table for finding all exact word
matches of length w between two DNA sequences A and B is described as follows.
Since a DNA sequence is a sequence of four letters A, C, G, and T, there are 4w
possible DNA w-mers. The following encoding scheme maps a DNA w-mer to an
integer between 0 and 4w − 1. Let C = c1 c2 . . . cw be a w-mer. The hash value of C
is written V (C) and its value is
V (C) = 2 × 44 + 3 × 43 + 1 × 42 + 0 × 41 + 3 × 40 = 732.
In fact, we can use two bits to represent each nucleotide: A(00), C(01), G(10), and
T(11). In this way, a DNA segment is transformed into a binary string by compress-
ing four nucleotides into one byte. For C=GTCAT given above, we have
Initially, a hash table H of size 4w is created. To find all exact word matches of
length w between two sequences A and B, the following steps are executed. The first
step is to hash sequence A into a table. All possible w-mers in A are calculated by
4.1 Finding Exact Word Matches 65
4.2 FASTA
FASTA uses a multi-step approach to finding local alignments. First, it finds runs
of identities, and identifies regions with the highest density of identities. A param-
eter ktup is used to describe the minimum length of the identity runs. These runs
of identities are grouped together according to their diagonals. For each diagonal, it
locates the highest-scoring segment by adding up bonuses for matches and subtract-
ing penalties for intervening mismatches. The ten best segments of all diagonals are
selected for further consideration.
The next step is to re-score those selected segments using the scoring matrix such
as PAM and BLOSUM, and eliminate segments that are unlikely to be part of the
alignment. If there exist several segments with scores greater than the cutoff, they
will be joined together to form a chain provided that the sum of the scores of the
joined regions minus the gap penalties is greater than the threshold.
Finally, it considers the band of a couple of residues, say 32, centered on the chain
found in the previous step. A banded Smith-Waterman method is used to deliver an
optimal alignment between the query sequence and the database sequence.
Since FASTA was the first popular biological sequence database search program,
its sequence format, called FASTA format, has been widely adopted. FASTA format
is a text-based format for representing DNA, RNA, and protein sequences, where
each sequence is preceded by its name and comments as shown below:
>HAHU Hemoglobin alpha chain - Human
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTK
TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNAL
4.3 BLAST 69
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
4.3 BLAST
The BLAST program is the most widely used tool for homology search in DNA and
protein databases. It finds regions of local similarity between a query sequence and
each database sequence. It also calculates the statistical significance of matches. It
has been used by numerous biologists to reveal functional and evolutionary relation-
ships between sequences and identify members of gene families.
The first version of BLAST was launched in 1990. It generates ungapped align-
ments and hence is called Ungapped BLAST. Seven years later, BLAST 2.0 came
to the world. Two major products of BLAST 2.0 are Gapped BLAST and Position-
Specific Iterated BLAST (PSI-BLAST). Gapped BLAST produces gapped align-
ments, yet it is able to run faster than the original one. PSI-BLAST can be used to
find distant relatives of a protein based the profiles derived from the multiple align-
ments of the highest scoring database sequence segments with the query segment in
iterative Gapped BLAST searches.
As discussed in Chapter 3, all possible pairs of residues are assigned their similar-
ity scores when we compare biological sequences. For protein sequences, PAM or
Fig. 4.5 A matrix of similarity scores for the pairs of residues of the two sequences GATCCATCTT
and CTATCATTCTG.
70 4 Homology Search Tools
Fig. 4.6 A maximum-scoring segment pair of the two sequences GATCCATCTT and
CTATCATTCTG.
4.3 BLAST 71
length w, whereas for protein sequences, these word pairs are those fixed-length
segment pairs who have a score no less than the threshold T .
Section 4.1 gives three methods for finding exact word matches between two
sequences. Figure 4.7 depicts all the exact word matches of length three between
the two sequences GATCCATCTT and CTATCATTCTG.
For protein sequences, we are not looking for exact word matches. Instead, a hit
is a fixed-length segment pair having a score no less than the threshold T . A query
word may be represented by several different words whose similarity scores with
the query word are at least T .
The second phase of BLAST is to extend a hit in both directions (diagonally)
to find a locally maximal-scoring segment pair containing that hit. It continues the
extension in one direction until the score has dropped more than X below the max-
imum score found so far for shorter extensions. If the resulting segment pair has
score at least S, then it is reported.
It should be noted that both the Smith-Waterman algorithm and BLAST asymp-
totically take the time proportional to the product of the lengths of the sequences.
The speedup of BLAST comes from the reduced sample space size. For two se-
quences of lengths m and n, the Smith-Waterman algorithm involves (n + 1) × (m +
1) entries in the dynamic-programming matrix, whereas BLAST takes into account
only those w-mers, whose number is roughly mn/4w for DNA sequences or mn/20w
for protein sequences.
Fig. 4.7 Exact word matches of length three between the two sequences GATCCATCTT and
CTATCATTCTG.
72 4 Homology Search Tools
Gapped BLAST uses a new criterion for triggering hit extensions and generates
gapped alignment for segment pairs with “high scores.”
It was observed that the hit extension step of the original BLAST consumes most
of the processing time, say 90%. It was also observed that an HSP of interest is much
longer than the word size w, and it is very likely to have multiple hits within a rel-
atively short distance of one another on the same diagonal. Specifically, the two-hit
method is to invoke an extension only when two non-overlapping hits occur within
distance D of each other on the same diagonal (see Figure 4.8). These adjacent non-
overlapping hits can be detected if we maintain, for each diagonal, the coordinate of
the most recent hit found.
Another desirable feature of Gapped BLAST is that it generates gapped align-
ments explicitly for some cases. The original BLAST delivers only ungapped align-
ments. Gapped alignments are implicitly taken care of by calculating a joint statis-
tical assessment of several distinct HSPs in the same database sequence.
A gapped extension is in general much slower than an ungapped extension. Two
ideas are used to handle gapped extensions more efficiently. The first idea is to
trigger a gapped extension only for those HSPs with scores exceeding a threshold
Sg . The parameter Sg is chosen in a way that no more than one gap extension is
invoked per 50 database sequences.
The second idea is to confine the dynamic programming to those cells for which
the optimal local alignment score drops no more than Xg below the best alignment
≤D
Fig. 4.8 Two non-overlapping hits within distance D of each other on the same diagonal.
4.3 BLAST 73
Fig. 4.9 A scenario of the gap extensions in the dynamic-programming matrix confined by the
parameter Xg .
score found so far. The gapped extension for a selected HSP starts from a seed
residue pair, which is a central residue pair of the highest-scoring length-11 segment
pair along the HSP. If the HSP is shorter than 11, its central residue pair is chosen as
the seed. Then the gapped extension proceeds both forward and backward through
the dynamic-programming matrix confined by the parameter Xg (see Figure 4.9).
4.3.3 PSI-BLAST
PSI-BLAST runs BLAST iteratively with an updated scoring matrix generated auto-
matically. In each iteration, PSI-BLAST constructs a position specific score matrix
(PSSM) of dimension × 20 from a multiple alignment of the highest-scoring seg-
ments with the query segment of length . The constructed PSSM is then used to
score the segment pairs for the next iteration. It has been shown that this iterative
approach is often more sensitive to weak but biologically relevant sequence similar-
ities.
PSI-BLAST collects, from the BLAST output, all HSPs with E-value below a
threshold, say 0.01, and uses the query sequence as a template to construct a mul-
tiple alignment. For those selected HSPs, all database sequence segments that are
identical to the query segment are discarded, and only one copy is kept for those
database sequence segments with at least 98% identities. In fact, users can specify
the maximum number of database sequence segments to be included in the multiple
alignment. In case the number of HSPs with E-value below a threshold exceeds the
maximum number, only those top ones are reported. A sample multiple alignment
is given below:
query: GVDIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVK
seq1: GADVVVIGSR-NPSISTHLLGSNASSVIRHANLPVLVVR
seq2: PAHMIIIASH-RPDITTYLLGSNAAAVVRHAECSVLVVR
seq3: QAGIVVLGTVGRTGISAAFLGNTAEQVIDHLRCDLLVIK
74 4 Homology Search Tools
If all segments in the alignment are given the same weight, then a small set of
more divergent sequences might be suppressed by a much larger set of closely re-
lated sequences. To avoid such a bias, PSI-BLAST assigns various sequence weights
by a sequence weighting method. Thus, to calculate the observed residue frequen-
cies of a column of a multiple alignment, PSI-BLAST takes its weighted frequencies
into account. In Chapter 8, we shall discuss the theoretical foundation for generating
scoring matrices from a given multiple alignment.
4.4 BLAT
BLAT is short for “BLAST-like alignment tool.” It is often used to search for
database sequences that are closely related to the query sequences. For DNA se-
quences, it aims to find those sequences of length 25 bp or more and with at least
95% similarity. For protein sequences, it finds those sequences of length 20 residues
or more and with at least 80% similarity.
A desirable feature is that BLAT builds an index of the whole database and keeps
it in memory. The index consists of non-overlapping K-mers and their positions
in the database. It excludes those K-mers that are heavily involved in repeats. DNA
BLAT sets K to 11, and protein BLAT sets K to 4. The index requires a few gigabytes
of RAM and is affordable for many users. This feature lets BLAT scan linearly
through the query sequence, rather than scan linearly through the database.
BLAT builds a list of hits by looking up each overlapping K-mer of the query se-
quence in the index (see Figure 4.10). The hits can be single perfect word matches or
near perfect word matches. The near perfect mismatch option allows one mismatch
in a hit. Given a K-mer, there are K × (|Σ | − 1) other possible K-mers that match
it in all but one position, where |Σ | is the alphabet size. Therefore, the near perfect
mismatch option would require K × (|Σ | − 1) + 1 lookups per K-mer of the query
sequences.
Fig. 4.10 The index consists of non-overlapping K-mers in the database, and each overlapping
K-mer of the query sequence is looked up for hits.
4.5 PatternHunter 75
4.5 PatternHunter
words ACGTC and ATGAC form a word match under spaced seed 1*1*1, but not
under 11**1. Note that BLAST simply uses a consecutive model that consists of
consecutive 1s, such as 11111.
In general, a spaced seed model shares fewer 1s with any of its shifted copies
than the contiguous one. Define the number of overlapping 1s between a model and
its shifted copy as the number of 1s in the shifted copy that correspond to 1s in the
model. The number of non-overlapping 1s between a model and its shifted copy
is the weight of the model minus the number of overlapping 1s. If there are more
non-overlapping 1s between the model and its shifted copy, then the conditional
probability of having another hit given one hit is smaller, resulting in higher sen-
sitivity. For rigorous analysis of the hit probabilities of spaced seeds, the reader is
referred to Chapter 6.
A model of length l has l − 1 shifted copies. For a model π of length l, the sum
of overlapping hit probabilities between the model and each of its shifted copies,
φ (π , p), is calculated by the equation
l−1
φ (π , p) = ∑ pni , (4.1)
i=1
Fig. 4.11 Calculation of the sum of overlapping hit probabilities between the model π and each of
its shifted copies, φ (π , p) = p3 + p2 + p2 + p4 + p3 , for π =1*11*1.
4.5 PatternHunter 77
can develop an O() time algorithm for updating the numbers of non-overlapping 1s
for a swapped model. This suggests a practical computation method for evaluating
φ values of all models by orderly swapping one pair of * and 1 at a time.
Given a spaced seed model, how can one find all the hits? Recall that only those
1s account for a match. One way is to employ a hash table like Figure 4.1 for exact
word matches but use a spaced index instead of a contiguous index. Let the spaced
seed model be of length and weight w. For DNA sequences, a hash table of size 4w
is initialized. Then we scan the sequence with a window of size from left to right.
We extract the residues corresponding to those 1s as the index of the window (see
Figure 4.12).
Once the hash table is built, the lookup can be done in a similar way. Indeed, one
can scan the other sequence with a window of size from left to right and extract
the residues corresponding to those 1s as the index for looking up the hash table for
hits.
Hits are extended diagonally in both sides until the score drops by a certain
amount. Those extended hits with scores exceeding a threshold are collected as
Fig. 4.12 A hash table for GATCCATCTT under a weight three model 11*1.
78 4 Homology Search Tools
high-scoring segment pairs (HSPs). As for the gap extension of HSPs, a red-black
tree with diagonals as the key is employed to manipulate the extension process effi-
ciently.
Twenty years ago, it would have been a legend to find similarities between two se-
quences. However, nowadays it would be a great surprise if we cannot find similari-
ties between a newly sequenced biomolecular sequence and the GenBank database.
FASTA and BLAST were the ones that boosted this historical change. In Table C.2
of Appendix C, we compile a list of homology search tools.
4.1
The first linear-time suffix tree was given by Weiner [202] in 1973. A space-
efficient linear-time construction was proposed by McCreight [135] in 1976. An on-
line linear-time construction was presented by Ukkonen [191] in 1995. A review of
these three linear-time algorithms was given by Giegerich and Kurtz [75]. Gusfield’s
book [85] gave a very detailed explanation of suffix trees and their applications.
Suffix arrays were proposed by Manber and Myers [134] in 1991.
4.2
The FASTA program was proposed by Pearson and Lipman [161] in 1988. It im-
proved the sensitivity of the FASTP program [128] by joining several initial regions
if their scores pass a certain threshold.
4.3
The original BLAST paper by Altschul et al. [7] was the most cited paper pub-
lished in the 1990s. The first version of BLAST generates ungapped alignments,
whereas BLAST 2.0 [8] considers gapped alignments as well as position-specific
scoring schemes. The idea of seeking multiple hits on the same diagonal was first
proposed by Wilbur and Lipman in 1983 [204]. The book of Korf, Yandell, and
Bedell [116] gave a full account of the BLAST tool. It gave a clear understand-
ing of BLAST programs and demonstrated how to use BLAST for taking the full
advantage.
4.4
4.5
4.6 Bibliographic Notes and Further Reading 79
Several variants of PatternHunter [131] are also available to the public. For in-
stance, PatternHunter II [123] improved the sensitivity even further by using multi-
ple spaced seeds, and tPatternHunter [112] was designed for doing protein-protein,
translated protein-DNA, and translated DNA-DNA homology searches. A new ver-
sion of BLASTZ also adapted the idea of spaced models [177] by allowing one
transition (A-G, G-A, C-T or T-C) in any position of a seed.
Some related but different spaced approaches have been considered by others.
Califano and Rigoutsos [37] introduced a new index generating mechanism where
k-tuples are formed by concatenating non-contiguous subsequences that are spread
over a large portion of the sequence of interest. The first stage of the multiple fil-
tration approach proposed by Pevzner and Waterman [167] uses a new technique to
preselect similar m-tuples that allow a few number of mismatches. Buhler [33] uses
locality-sensitive hashing [96] to find K-mers that differ by no more than a certain
number of substitutions.
Chapter 5
Multiple Sequence Alignment
81
82 5 Multiple Sequence Alignment
S1 : TTATTTCACC-----CTTATATCA
S2 : TCCTTTCA--------TGATATCA
S3 : T--TTTCACCGACATCAGATAAAA
entries are required to appear in the alignment, we use the term global (as opposed
to local). Each column of an alignment is called an aligned pair. In general, we re-
quire that an alignment does not contain two spaces in a column, which we call the
null column. In context where null columns are permitted the term quasi-alignment
is used to emphasize that the ban on null columns has been temporarily lifted.
Assume that we are given S1 , S2 , . . . , Sm , each of which is a sequence of “letters.”
A multiple alignment of these sequences is an m × n array of letters and dashes,
such that no column consisting entirely of dashes, and removing dashes from row
i leaves the sequence Si for 1 ≤ i ≤ m. For each pair of sequences, say Si and S j ,
rows i and j of the m-way alignment constitute a pairwise quasi-alignment of Si and
S j ; removing any null columns produces a pairwise alignment of these sequences.
Figure 5.1 gives a multiple alignment of three sequences:
For any two given sequences, there are numerous alignments of those sequences.
To make explicit the criteria for preferring one alignment over another, we define
a score for each alignment. The higher the score is, the better the alignment is. Let
us review the scoring scheme given
in Section 1.3. First, we assign a score denoted
x
σ (x, y) to each aligned pair . In the cases that x or y is a space, σ (x, y) = −β .
y
Score function σ depends only on the contents of the two locations, not their posi-
tions within the sequences. Thus, σ (x, y) does not depend on where the particular
symbols occur. However, it should be noted that there are situations where position-
dependent scores are quite appropriate. Similar remarks hold for the gap penalties
defined below.
The other ingredient for scoring pairwise alignments is a constant gap-opening
penalty, denoted α , that is assessed for each gap in the alignment; a gap is defined
as a run of spaces in a row of the alignment that is terminated by either a non-space
symbol or an end of the row. Gap penalties are charged so that a single gap of length,
say, k will be preferred to several gaps of total length k, which is desirable since a
gap can be created in a single evolutionary event. Occasionally, a different scoring
criterion will be applied to end-gaps, i.e., gaps that are terminated by an end of the
row. The score of an alignment is defined as the sum of σ values for all aligned
pairs, minus α times the number of gaps.
5.2 Scoring Multiple Sequence Alignment 83
where Πi, j is the pairwise alignment of Si and S j induced by Π (see Figure 5.2).
Π1,2
S1 : TTATTTCACCCTTATATCA
S2 : TCCTTTCA---TGATATCA
Π1,3
S1 : TTATTTCACC-----CTTATATCA
S3 : T--TTTCACCGACATCAGATAAAA
Π2,3
S2 : TCCTTTCA--------TGATATCA
S3 : T--TTTCACCGACATCAGATAAAA
Fig. 5.2 Three pairwise alignments induced by the multiple alignment in Figure 5.1.
84 5 Multiple Sequence Alignment
S1 : TTATTTCACC-----CTTATATCA
S2 : TCCTTTCA--------TGATATCA
Fig. 5.3 A quasi-alignment of Π (in Figure 5.1) projected on S1 and S2 without discarding null
columns.
For many applications of multiple alignment, more efficient heuristic methods are
often required. Among them, most methods adopt the approach of “progressive”
pairwise alignments proposed by Feng and Doolittle [69]. It iteratively merges the
most similar pairs of sequences/alignments following the principle “once a gap,
always a gap.” Thus, later steps of the process align two “sequences,” one or both of
which can themselves be an alignment, i.e., sequence of fixed-height columns.
In aligning two pairwise alignments, the columns of each given pairwise align-
ment are treated as “symbols,” and these sequences of symbols are aligned by
86 5 Multiple Sequence Alignment
padding each sequence with appropriate-sized columns containing only dash sym-
bols. It is quite helpful to recast the problem of aligning two alignments as an equiv-
alent problem of finding a maximum-scoring path in an alignment graph. For ex-
ample, the path depicted in Figure 5.4 corresponds to a 4-way alignment. This al-
ternative formulation allows the problem to be visualized in a way that permits the
use of geometric intuition. We find this visual imagery critical for keeping track
of the low-level details that arise in development and implementation of alignment
algorithms.
Each step of the progressive alignment procedure produces an alignment that is
highest-scoring relative to the chosen scoring scheme subject to the constraint that
columns of the two smaller alignments being combined are treated as indivisible
“symbols.” Thus, the relationships between entries of two of the original sequences
are fixed at the first step that aligns those sequences or alignments containing those
sequences.
For that reason, it is wise to first compute the pairwise alignments that warrant
the most confidence, then combine those into multiple alignments. Though each
step is performed optimally, there is no guarantee that the resulting multiple align-
ment is highest-scoring over all possible ways of aligning the given sequences. An
appropriate order for progressive alignment is very critical for the success of a mul-
tiple alignment program. This order can either be determined by the guide tree con-
structed from the distance matrix of all pairs of sequences, or can be inferred di-
rectly from an evolutionary tree for those sequences. In any case, the progressive
alignment algorithm invokes the “generalized” pairwise alignment m − 1 times for
constructing an m-way alignment, and its time complexity is roughly the order of
the time for computing all O(m2 ) pairwise alignments.
In spite of the plethora of existing ideas and methods for multiple sequence align-
ment, it remains as an important and exciting line of investigation in computational
5.5 Bibliographic Notes and Further Reading 87
5.1
5.2
A recent survey by Notredame [154] divides the scoring schemes in two cat-
egories. Matrix-based methods, such as ClustalW [120, 189], Kalign [121], and
MUSCLE [62], use a substitution matrix to score matches. On the other hand,
consistency-based methods, such as T-Coffee [155], MUMMALS [163], and Prob-
Cons [59], compile a collection of pairwise alignments and produce a position-
specific substitution matrix to judge the consistency of a given aligned pair. An
alternative is to score a multiple alignment by using a consensus sequence that is
derived from the consensus of each column of the alignment.
Besides these two categories, there are some other scoring schemes. For example,
DIALIGN [143] focuses on scoring complete segments of sequences.
5.3
5.4
algorithm and maximizes the sum of pairwise scores with affine quasi-gap penal-
ties. To increase efficiency, a step of the progressive alignment algorithm can be
constrained to the portion of the dynamic-programming grid lying between two
boundary lines. Another option is to consider constrained alignments consisting of
aligned pairs in nearly optimal alignments. A set of “patterns” is specified (perhaps
the consensus sequences for transcription factor binding sites); YAMA selects, from
among all alignments with the highest score, an alignment with the largest number
of conserved blocks that match a pattern.
MUSCLE [62, 63] is a very efficient and accurate multiple alignment tool. It
uses a strategy similar to PRRP [80] and MAFFT [106]. It works in three stages.
Stage 1 is similar to ClustalW. A guide tree is built based on the distance matrix,
which is derived from pairwise similarities. A draft progressive alignment is built
according the constructed guide tree. In Stage 2, MUSCLE computes a Kimura dis-
tance matrix [111] using the pairwise identity percentage induced from the multiple
alignment. It then builds a new guide tree based on such a matrix, compares it with
the previous tree, and re-aligns the affected subtrees. This stage may be iterated if
desired. Stage 3 iteratively refines the multiple alignment by re-aligning the pro-
files of two disjoint subsets of sequences derived from deleting an edge from the
tree constructed in Stage 2. This refinement is a variant of the method by Hirosawa
et al. [90]. If the new alignment has a better score, save it. The refinement process
terminates if all edges incur no changes or a user-specified maximum number of
iterations has been reached.
Chapter 6
Anatomy of Spaced Seeds
BLAST was developed by Lipman and his collaborators to meet demanding needs
of homology search in the late 1980s. It is based on the filtration technique and
is multiple times faster than the Smith-Waterman algorithm. It first identifies short
exact matches (called seed matches) of a fixed length (usually 11 bases) and then
extends each match to both sides until a drop-off score is reached. Motivated by the
success of BLAST, several other seeding strategies were proposed at about the same
time in the early 2000s. In particular, PatternHunter demonstrates that an optimized
spaced seed improves sensitivity substantially. Accordingly, elucidating the mecha-
nism that confers power to spaced seeds and identifying good spaced seeds are two
new issues of homology search.
This chapter is divided into six sections. In Section 6.1, we define spaced seeds
and discuss the trade-off between sensitivity and specificity for homology search.
The sensitivity and specificity of a seeding-based program are largely related to
the probability that a seed match is expected to occur by chance, called the hit prob-
ability. Here we study analytically spaced seeds in the Bernoulli sequence model
defined in Section 1.6. Section 6.2 gives a recurrence relation system for calculating
hit probability.
In Section 6.3, we investigate the expected distance µ between adjacent non-
overlapping seed matches. By estimating µ , we further discuss why spaced seeds
are often more sensitive than the consecutive seed used in BLAST.
A spaced seed has a larger span than the consecutive seed of the same weight.
As a result, it has less hit probability in a small region but surpass the consecu-
tive seed for large regions. Section 6.4 studies the hit probability of spaced seeds
in asymptotic limit. Section 6.5 describes different methods for identifying good
spaced seeds.
Section 6.6 introduces briefly three generalizations of spaced seeds: transition
seeds, multiple spaced seeds, and vector seeds.
Finally, we conclude the chapter with the bibliographic notes in Section 6.7.
91
92 6 Anatomy of Spaced Seeds
The pattern used in filtration strategy is usually specified by one or more strings over
the alphabet {1, ∗}. Each such string is a spaced seed in which 1s denote matching
positions. For instance, if the seed P = 11 ∗ 1 ∗ ∗11 is used, then the segments ex-
amined for match in the first stage span 8 positions and the program only checks
whether they match in the 1st, 2nd, 4th, 7th, and 8th positions or not. If the seg-
ments match in these positions, they form a perfect match. As we have seen in this
example, the positions specified by ∗s are irrelevant and sometimes called don’t care
positions.
The most natural seeds are those in which all the positions are matching posi-
tions. These seeds are called the consecutive seeds. They are used in BLASTN.
WABA, another homology search program, uses spaced seed 11 ∗ 11 ∗ 11 ∗ 11 ∗ 11
to align gene-coding regions. The rationale behind this seed is that mutation in the
third position of a codon usually does not affect the function of the encoded amino
acid and hence substitution in the third base of a codon is irrelevant. PatternHunter
uses a rather unusual seed 111 ∗ 1 ∗ ∗1 ∗ 1 ∗ ∗11 ∗ 111 as its default seed. As we shall
see later, this spaced seed is much more sensitive than the BLASTN’s default seed
although they have the same number of matching positions.
We call πi the first hit probability. Let Πn := Πn (p) denote the probability that π
hits R[0, n − 1] and Π̄n := 1 − Πn . We call Πn the hit probability and Π̄n the non-hit
probability of π .
For each 0 ≤ n < |π | − 1, trivially, An = 0/ and Πn = 0. Because the event
∪0≤i≤n−1 Ai is the disjoint union of the events
∪0≤i≤n−2 Ai
and
∪0≤i≤n−2 Ai An−1 = Ā0 Ā1 · · · Ān−2 An−1 ,
Πn = π1 + π2 + · · · + πn ,
or equivalently,
for n ≥ |π |.
Example 6.1. Let θ be the consecutive seed of weight w. If θ hits random sequence
R at position n − 1, but not at position n − 2, then R[n − w, n − 1] = 11 · · · 1 and
R[n − w − 1] = 0. As a result, θ cannot hit R at positions n − 3, n − 4, . . . , n − w − 1.
This implies
By formula (6.2), its non-hit probability satisfies the following recurrence relation
Example 6.2. Let π be a spaced seed and k > 1. By inserting (k − 1) ∗’s between
every two consecutive positions in π , we obtain a spaced seed π of the same weight
and length |π | = k|π | − k + 1. It is not hard to see that π hits the random sequence
R[0, n − 1] = s[0]s[1] · · · s[n − 1] if and only if π hits one of the following k random
sequences:
Hence, formula (6.4) implies that Π̄n > Π̄n or equivalently Πn < Πn for any n ≥ |π |.
We have shown that the non-hit probability of a consecutive seed θ satisfies equa-
tion (6.3). Given a consecutive seed θ and n > |θ |, it takes linear-time to compute
the hit probability Θn . However, calculating the hit probability for an arbitrary seed
is rather complicated. In this section, we generalize the recurrence relation (6.3) to
a recurrence system in the general case.
For a spaced seed π , we set m = 2|π |−wπ . Let Wπ be the set of all m distinct strings
obtained from π by filling 0 or 1 in the ∗’s positions. For example, for π = 1 ∗ 11 ∗ 1,
The seed π hits the random sequence R at position n − 1 if and only if a unique
( j)
W j ∈ Wπ occurs at the position. For each j, let Bn denote the event that W j occurs
at the position n − 1. Because An denotes the event that π hits the sequences R at
( j) ( j)
position n − 1, we have that An = ∪1≤ j≤m Bn and Bn ’s are disjoint. Setting
( j) ( j)
πn = Pr Ā0 Ā1 · · · Ān−2 Bn−1 , j = 1, 2, · · · , m.
We have
∑
( j)
πn = πn
1≤ j≤m
Recall that, for any W j ∈ Wπ and a, b such that 0 ≤ a < b ≤ |π | − 1, W j [a, b] denotes
the substring of W j from position a to position b inclusively. For a string s, we use
96 6 Anatomy of Spaced Seeds
Pr[s] to denote the probability that s occurs at a position k ≥ |s|. For any i, j, and k
such that 1 ≤ i, j ≤ m, 1 ≤ k ≤ |π |, we define
⎧
⎨ Pr[W j [k, |π | − 1]] if Wi [|π | − k, |π − 1] = W j [0, k − 1];
(i j)
pk = 1 k = |π | & i = j;
⎩
0 otherwise.
(i j)
It is easy to see that pk is the conditional probability that W j hits at the position
n + k given that Wi hits at position n for k < |π | and n.
Theorem 6.1. Let p j = Pr[W j ] for W j ∈ Wπ (1 ≤ j ≤ m). Then, for any n ≥ |π |,
m |π |
p j Π̄n = ∑ ∑ πn+k pk
(i) (i j)
, j = 1, 2, . . . , m. (6.6)
i=1 k=1
p j Π̄n
( j)
= Pr Ā0 Ā1 · · · Ān−1 Bn+|π |−1
|π |−1
∑
( j) ( j)
= Pr Ā0 Ā1 · · · Ān+k−2 An+k−1 Bn+|π |−1 + Pr Ā0 Ā1 · · · Ān+|π |−2 Bn+|π |−1
k=1
|π |−1 m
∑ ∑ Pr
(i) ( j) ( j)
= Ā0 Ā1 · · · Ān+k−2 Bn+k−1 Bn+|π |−1 + πn+|π |
k=1 i=1
|π |−1 m
∑ ∑ Pr
(i) ( j) (i) ( j)
= Ā0 Ā1 · · · Ān+k−2 Bn+k−1 Pr Bn+|π |−1 |Bn+k−1 + πn+|π |
k=1 i=1
|π |−1 m
∑ ∑ πn+k pk
(i) (i j) ( j)
= + πn+|π |
k=1 i=1
m |π |
∑ ∑ πn+k pk
(i) (i j)
= .
i=1 k=1
= p|π |−k−1 q, k = 1, 2, . . . , b,
(11)
pk
(11)
pk = 0, k = b + 1, b + 2, . . . , |π | − 1,
(11)
p|π | = 1,
6.2 Basic Formulas on Hit Probability 97
= p|π |−k−1 q, k = 1, 2, . . . , a,
(21)
pk
(21)
pk = 0, k = a + 1, a + 2, . . . , |π |,
= p|π |−k , k = 1, 2, . . . , b,
(12)
pk
(12)
pk = 0, k = b + 1, b + 2, . . . , |π |,
= p|π |−k , k = 1, 2, . . . , |π |.
(22)
pk
⎪
⎩ |π |
p|π | Π̄n = ∑bk=1 πn+k p|π |−k + ∑k=1 πn+k p|π |−k .
(1) (2)
The recurrence system consisting of (6.5) and (6.6) can be used to calculate the hit
probability for arbitrary spaced seeds. Because there are as many as 2|π |−wπ recur-
rence relations in the system, such a computing method is not very efficient. One
may also use a dynamic programming approach for calculating the hit probability.
The rationale behind this approach is that the sequences hit by a spaced seed form
a regular language and can be represented by a finite automata (or equivalently, a
finite directed graph).
Let π be a spaced seed and n > |π |. For any suffix b of a string in Wπ (defined
in the last subsection) and |π | ≤ i ≤ n, we use P(i, b) to denote the conditional
probability that π hits R[0, i − 1] given that R[0, i − 1] ends with string b, that is,
P(i, b) = Pr ∪|π |−1≤ j≤i Ai | R[i − |b|, i − 1] = b .
Clearly, Π (n) = P(n, ε ) where ε is the empty string. Furthermore, P(i, b) can be
recursively computed as follows:
In this section, we establish two inequalities on hit probability. As we shall see in the
following sections, these two inequalities are very useful for comparison of spaced
seeds in asymptotic limit.
Theorem 6.2. Let π be a spaced seed and n > |π |. Then, for any 2|π | − 1 ≤ k ≤ n,
(i) πk Π̄n−k+|π |−1 ≤ πn ≤ πk Π̄n−k .
(ii) Π̄k Π̄n−k+|π |−1 ≤ Π̄n < Π̄k Π̄n−k .
Proof. (i). Recall that Ai denotes the event that seed π hits the random sequence R
at position i − 1 and Āi the complement of Ai . Set Āi, j = Āi Āi+1 . . . Ā j . By symmetry,
πn = Pr A|π | Ā|π |+1,n . (6.7)
The second inequality of fact (i) follows directly from that the event A|π | Ā|π |+1,n is
a subevent of A|π | Ā|π |+1,k Āk+|π |,n for any |π | + 1 < k < n. The first inequality in fact
(i) is proved as follows.
Let k be an integer in the range from 2|π | − 1 to n. For any 1 ≤ i ≤ |π | − 1, let
Si be the set of all length-i binary strings. For any w ∈ Si , we use Ew to denote
the event that R[k − |π | + 2, k − |π | + i + 1] = w in the random sequence R. With
ε being the empty string, Eε is the whole sample space. Obviously, it follows that
Eε = E0 ∪ E1 . In general, for any string w of length less than |π | − 1, Ew = Ew0 ∪ Ew1
and Ew0 and Ew1 are disjoint. By conditioning to A|π | Ā|π |+1,n in formula (6.7), we
have
πn = ∑ Pr [Ew ] Pr A|π | Ā|π |+1,k Āk+1,n |Ew
w∈S|π |−1
= ∑ Pr [Ew ] Pr A|π | Ā|π |+1,k |Ew Pr Āk+1,n |Ew
w∈S|π |−1
where the last equality follows from the facts: (a) conditioned on Ew , with w ∈
S|π |−1 , the event A|π | Ā|π |+1,k is independent of the positions beyond position k, and
(b) Āk+1,n is independent of the first k − |π | + 1 positions. Note that
πk = Pr A|π | Ā|π |+1,k = Pr A|π | Ā|π |+1,k |Eε
and
Π̄n−k+|π |−1 = Pr Āk+1,n |Eε .
Thus, we only need to prove that
and
Pr Āk+1,n |Ew = p Pr Āk+1,n |Ew1 + q Pr Āk+1,n |Ew0 .
and
Pr[Āk+1,n |Ew1 ] ≤ Pr[Āk+1,n |Ew0 ].
By applying Chebyshev’s inequality (see the book [87] of Hardy, Littlewood, and
Pólya, page 83), we obtain that
Pr A|π | Ā|π |+1,k |Ew Pr Āk+1,n |Ew
≤ p Pr A|π | Ā|π |+1,k |Ew1 Pr Āk+1,n |Ew1 + q Pr A|π | Ā|π |+1,k |Ew0 Pr Āk+1,n |Ew0 .
from Pr[Ew1 ] = Pr[Ew ]p and Pr[Ew0 ] = Pr[Ew ]q. Therefore, inequality (6.8) follows
from the fact that S j = {w0, w1 | w ∈ S j−1 }. Hence the first inequality in (i) is
proved.
(ii). Because ∑ j≥n+1 π j = Π̄n , the first inequality in (b) follows immediately as
follows
The second inequality in (ii) is similar to its counterpart in (ii) and is obvious.
The following two questions are often asked in the analysis of word patterns: “What
is the mean number of times that a pattern hits a random sequence of length N?” and
“What is the mean distance between one hit of the pattern and the next?” We call
these two questions “number of hits” and “distance between hits.” We first study
100 6 Anatomy of Spaced Seeds
the distance between hits problem. In this section, we use µπ to denote the expected
distance between the non-overlapping hits of a spaced seed π in a random sequence.
To find the average distance µπ between non-overlapping hits for a spaced seed π ,
we define the generating functions
∞
U(x) = ∑ Π̄n xn ,
n=0
∞
∑ πn
(i) n
Fi (x) = x , i ≤ m.
n=0
By formula (6.2),
µπ = ∑ jπ j = |π | + ∑ Π̄ j = U(1) (6.9)
j≥|π | j≥|π |
where π j is the first hit probability. Both U(x) and Fi (x) converge when x ∈ [0, 1].
Multiplying formula (6.5) by xn−1 and summing on n, we obtain
|π |
where Ci j (x) = ∑k=1 pk x|π |−k . Solving the above linear functional equation sys-
(i j)
Aπ = [Ci j (1)]m×m ,
and
0 I
Mπ = .
P Aπ
Then,
µπ = det(Aπ )/ det(Mπ )
where det() is the determinant of a matrix.
|π |
C11 (x) = ∑ p|π |−k x|π |−k ,
k=1
w−1
Aθ = [ ∑ pi ],
i=0
and
0 1
Mθ = w−1 i .
−pw ∑i=0 p
By Theorem 6.3,
w
µθ = ∑ (1/p)i .
i=1
Aπ = ⎣ ⎦.
b−1 a+1+i a+b i
∑i=0 p ∑i=0 p
Therefore,
∑a+b i b b−1 a+i+ j
i=0 p + ∑i=0 ∑ j=0 p q
µπ = .
p (1 + p(1 − p ))
a+b b
and let
oπ ( j) = |RP(π ) ∩ (RP(π ) + j)|.
102 6 Anatomy of Spaced Seeds
Then, oπ ( j) is the number of 1’s that coincide between the seed and the jth shifted
version of it. Trivially, oπ (0) = wπ and oπ (|π | − 1) = 1 hold for any spaced seed π .
Proof. Noticed that the equality holds for the consecutive seeds. Recall that A j
denotes the event that the seed π hits the random sequence at position j and Ā j the
complement of A j . Let mπ ( j) = wπ − oπ ( j). We have
Pr An−1 |An− j−1 = pmπ ( j)
for any n and j ≤ |π |. Because An−1 is negatively correlated with the joint event
Ā0 Ā1 · · · Ān− j−2 ,
Pr Ā0 Ā1 · · · Ān− j−2 |An− j−1 An−1 ≤ Pr Ā0 Ā1 · · · Ān− j−2 |An− j−1
Π̄n−|π | pwπ
= Pr[Ā0 Ā1 · · · Ān−|π |−1 An−1 ]
|π |−1
= Pr[Ā0 Ā1 · · · Ān−2 An−1 ] + ∑i=1 Pr[Ā0 Ā1 · · · Ān−|π |+i−2 An−|π |+i−1 An−1 ]
|π |−1 mπ (i)
≤ πn + ∑i=1 p πn−|π |+i
∞ |π |−1 |π |−1
µπ pwπ = ∑ Π̄n pwπ ≤ 1 + ∑ pmπ (i) = ∑ pmπ (i)
n=0 i=1 i=0
or
|π |−1 |π |−1
µπ ≤ ∑ pmπ (i)−wπ = ∑ (1/p)oπ (i) .
i=0 i=0
6.3 Distance between Non-Overlapping Hits 103
Table 6.1 The values of the upper bound in Theorem 6.6 for different w and p after rounding to
the nearest integer).
@ w 10 11 12 13 14
p @
Using the above theorem, the following explicit upper bound on µπ can be proved
for non-uniformly spaced seeds π . Its proof is quite involved and so is omitted.
Recall that, for the consecutive seed θ of weight w, µθ = ∑wi=1 ( 1p )i . By Theorem 6.5,
we have
Theorem 6.6. Let π be a non-uniformly spaced seed and θ the consecutive seed of
the same weight. If |π | < wπ + qp [( 1p )wπ −2 − 1], then, µπ < µθ .
Non-overlapping hit of a spaced seed π is a recurrent event with the follow-
ing convention: If a hit at position i is selected as a non-overlapping hit, then the
next non-overlapping hit is the first hit at or after position i + |π |. By (B.45) in
Section B.8, the expected number of the non-overlapping hits of a spaced seed π
in a random sequence of length N is approximately µNπ . Therefore, if |π | < wπ +
q 1 wπ −2
p [( p ) − 1] (see Table 6.1 for the values of this bound for p = 0.6, 0.7, 0.8, 0.9
and 10 ≤ w ≤ 14), Theorem 6.6 implies that π has on average more non-overlapping
hits than θ in a long homologous region with sequence similarity p in the Bernoulli
sequence model. Because overlapping hits can only be extended into one local align-
ment, the above fact indicates that a homology search program with a good spaced
seed is usually more sensitive than with the consecutive seed (of the same weight)
especially for genome-genome comparison.
104 6 Anatomy of Spaced Seeds
Because of its larger span, in terms of hit probability Πn , a spaced seed π lags
behind the consecutive seed of the same weight at first, but overtakes it when n is
relatively big. Therefore, to compare spaced seeds, we should analyze hit probability
in asymptotic limit. In this section, we shall show that Πn is approximately equal to
1 − απ λπn for some α and λ independent of n. Moreover, we also establish a close
connection between λπ and µπ .
If a consecutive seed θ does not hit the random sequence R of length n, there must
be 0s in the first wθ positions of R. Hence, by conditioning, we have that Θ̄n satisfies
the following recurrence relation
Let
g(x) = f (x)(p − x) = xwθ (1 − x) − pwθ q.
Then,
g (x) = xwθ −1 [wθ − (wθ + 1)x].
Hence, g(x) and g (x) have no common factors except for x − p when p = ww+1 θ
.
θ
This implies that f (x) has wθ − 1 distinct roots for any 0 < p < 1.
Because g(x) increases in the interval (0, ww+1 θ
), decreases in the interval
θ
wθ
( w +1 , ∞), and
θ
g(0) = g(1) < 0,
g(x) has exactly two positive real roots: p and some r0 . If p > ww+1 θ
, then, r0 ∈
θ
wθ wθ
(0, w +1 ); otherwise, r0 ∈ ( w +1 , ∞) (see Figure 6.1). Note that r0 is the unique
θ θ
positive real root of f (x). That f (r0 ) = 0 implies that
q p p
[1 + + · · · + ( )wθ −1 ] = 1.
r0 r0 r0
g(x)
x
0.875
Fig. 6.1 The graph of g(x) = xw (1 − x) − pw q when w=7 and p=0.7. It increases in (0, w+1
w
) and
w
decreases in ( w+1 , 1).
where the equality sign is possible only if all terms on the left have the same argu-
ment, that is, if r = r0 . Hence, r0 is larger in absolute value than any other root of
f (x).
Let f (x) has the following distinct roots
r0 , r1 , r2 , · · · , rwθ −1 ,
θi = Θ̄i−1 − Θ̄i = pw q
for any i = wθ + 1, . . . , 2wθ , we obtain the following linear equation system with
ai ’s as variables
⎧ wθ w w
⎪ a0 (1 − r0 )r0
⎪
⎪
+ a1 (1 − r1 )r1 θ + · · · + awθ −1 (1 − rwθ −1 )rwθθ −1 = pw q
⎨ a (1 − r )rwθ +1 + a (1 − r )rwθ +1 + · · · + a wθ +1
wθ −1 (1 − rwθ −1 )rwθ −1 = p q
w
0 0 0 1 1 1
⎪
⎪ ···
⎪
⎩ a (1 − r )r2wθ −1 + a (1 − r )r2wθ −1 + · · · + a 2wθ −1
wθ −1 (1 − rwθ −1 )rwθ −1 = p q
w
0 0 0 1 1 1
w
Solving this linear equation system and using ri θ (1 − ri ) = pwθ q, we obtain
pw q f (1) (p − ri )ri
ai = wθ = , i = 1, 2, . . . , wθ − 1.
(1 − ri ) ri f (ri ) q[wθ − (wθ + 1)ri ]
2
Table 6.2 The lower and bounds of the largest eigenvalue r0 in Theorem 6.7 with p = 0.70.
Weight Lower bound The largest eigenvalue Upper bound
(w) (r0 )
2 0.5950413223 0.6321825380 0.6844816484
4 0.8675456501 0.8797553586 0.8899014400
6 0.9499988312 0.9528375570 0.9544226080
8 0.9789424218 0.9796064694 0.9798509028
10 0.9905366752 0.9906953413 0.9907330424
12 0.9955848340 0.9956231900 0.9956289742
14 0.9978953884 0.9979046968 0.9979055739
16 0.9989844497 0.9989867082 0.9989868391
18 0.9995065746 0.9995071215 0.9995071407
(p − r0 )
Θ̄n = rn+1 + ε (n), (6.12)
q[wθ − (wθ + 1)r0 ] 0
1 1
g(x) = xwθ +1 [ − 1 − pwθ q( )wθ +1 ]
x x
1 1
implies that all ri and p are the roots of the equation
6.4 Asymptotic Analysis of Hit Probability 107
y = 1 + pwθ qywθ +1 .
Let h(y) = 1 + pwθ qywθ +1 . It intersects the bisector z = y at r10 and 1p . Moreover, in
the interval [ r10 , 1p ], the graph of z = h(y) is convex, and hence the graph of h(y) lies
below the bisector. Thus, for any complex number s such that r10 < |s| < 1p , we have
Hence, s = h(s) and any complex root of y = h(y) must be larger than 1
p in absolute
value. This implies that |ri | ≤ p.
to the error term ε (n). Note that for any complex number s such that s ≤ p < r0 ,
p−s
| |
wθ − (wθ + 1)s
p + |ri | 2
|An (ri )| ≤ |ri |n+1 < pn+2
q[wθ + (wθ + 1)|ri |] q[wθ + (wθ + 1)p]
and
2(wθ − 1)
ε (n) < pn+2 .
q[wθ + (wθ + 1)p]
This error estimation indicates that the asymptotic formula (6.12) gives a close
approximation to Θ̄n and the approximation improves rapidly with n if p ≤ ww+1 θ
.
θ
Table 6.3 gives the exact and approximate values of the non-hit probability for the
consecutive seed of weight 3 and n ≤ 10. When n = 10, the approximation is already
very accurate.
For an arbitrary spaced seed π , Π̄n does not satisfy a simple recursive relation. To
generalize the theorem proved in the last subsection to spaced seeds, we first derive
a formula for the generating function U(z) = ∑∞ n=0 Π̄n z using a “transition matrix.”
n
Let
V = {v1 , v2 , . . . , v2|π |−1 }
be the set of all 2|π |−1 binary strings of length |π | − 1. Define the transition matrix
108 6 Anatomy of Spaced Seeds
Table 6.3 The accurate and estimated values of non-hit probability Θ̄n with θ = 111 and p = 0.70
for 3 ≤ n ≤ n.
on V as follows. For any vi = vi [1]vi [2] · · · vi [|π |−1] and v j = v j [1]v j [2] · · · v j [|π |−1]
in V ,
Pr[v j [|π | − 1]] if vi [2, |π | − 1] = v j [1, |π | − 2] and vi [1]v j = π ,
ti j =
0 otherwise
Let P = (Pr[v1 ], Pr[v2 ], · · · , Pr[v2|π |−1 ]) and E = (1, 1, · · · , 1)t . By conditional proba-
bility, for n ≥ |π | − 1,
Hence,
U(z)
|π |−2 ∞
n−|π |+1
= ∑ zi + ∑ PTπ Ezn
i=0 |π |−1|
|π |−2 ∞
= ∑ zi + z|π |−1| ∑ P(zTπ )n E
i=0 n=0
6.4 Asymptotic Analysis of Hit Probability 109
|π |−2
Padj(I − zTπ ) E
= ∑ zi + z|π |−1|
det(I − zTπ )
i=0
where adj(I −zTπ ) and det(I −zTπ ) are the adjoint matrix and determinant of I −zTπ
respectively.
For any vi , v j ∈ V , π does not hit vi 0π v j . This implies that the power matrix
2|π |−1
Tπ is positive. Hence, Tπ is primitive. By Perron-Frobenius theorem on primi-
tive non-negative matrices, det(xI − Tπ ) has a simple real root λ1 > 0 that is larger
than the modulus of any other roots. Let λ2 , · · · , λ2|π |−1 be the rest of the roots. Then,
det(xI − Tπ ) = (x − λ1 )(x − λ2 ) · · · (x − λ2|π |−1 ), and
|π |−2
Padj(I − zTπ ) E
U(z) = ∑ zi + z|π |−1|
(1 − zλ1 )(1 − zλ2 ) · · · (1 − λ2|π |−1 )
.
i=0
and
Π̄n+1 πn+1 1
λπ = lim = 1 − lim ≥ 1− .
n→∞ Π̄n n→∞ Π̄n µπ − |π | + 1
Similarly, by the second inequality in Theorem 6.2 (i), πn+1+ j ≤ πn+1 Π̄ j for any
j ≥ |π |. Therefore,
and
1
λπ ≤ 1 − .
µπ
110 6 Anatomy of Spaced Seeds
1 1
λπ ≤ 1 − < 1− ≤ λθ
µπ µθ − wπ + 1
Π̄n
and so limn→∞ Θ̄ = 0. Therefore, there exists a large integer N such that, for any
n
n ≥ N, Πn > Θn . In other words, the spaced seed π will eventually surpass the
consecutive seed of the same weight in hit probability if π ’s length is not too large.
This statement raises the following interesting questions:
Question 1 For any non-uniform spaced seed π and the consecutive seed θ of the
same weight, is λπ always smaller than λθ ?
Question 2 For any non-uniform spaced seed π and the consecutive seed θ of the
same weight, does there exist a large integer N such that Πn > Θn whenever n > N?
Consider a homologous region with similarity level p and length n. The sensitivity
of a spaced seed π on the region largely depends on the hit probability Πn (p). The
larger Πn (p) is, the more sensitive π . The spaced seed π = 111 ∗ 1 ∗ ∗1 ∗ 1 ∗ ∗11 ∗
111 was chosen as the default seed in PatternHunter due to the fact that it has the
maximum value of Π64 (0.70) among all the spaced seeds of weight 11.
A straightforward approach to identifying good spaced seeds is through ex-
haustive search after Πn (p) (for fixed n and p) is computed. The Hedera program
designed by Noe and Kocherov takes this approach and is implemented through
automata method. Such an approach is quite efficient for short and low-weight
seeds. It, however, becomes impractical for designing long seeds as demonstrated by
Hedera for two reasons. First, the number of spaced seeds grows rapidly when the
number of don’t care positions is large. Second, computing hit probability for an ar-
bitrary spaced seed is proved to be NP-hard. Therefore, different heuristic methods
are developed for identifying good spaced seeds.
Theorems 6.8 and 6.9 in Section 6.4 suggest that the hit probability of a spaced
seed π is closely related to the expected distance µπ between non-overlapping hits.
The smaller the µπ is, the higher sensitivity the spaced seed π has. Hence, any
simple but close approximation of µπ can be used for the purpose. For example, the
upper bound in Theorem 6.4 has been suggested to find good spaced seeds. It is also
known that, for any spaced seed π ,
6.5 Spaced Seed Selection 111
1800
1600
1400
1200
Counts
1000
800
600
400
200
0
0.30
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.40
0.41
0.42
0.43
0.44
0.45
0.46
Hitting probability
Fig. 6.2 Frequency histogram of spaced seeds of weight 11 and length between 11 and 18, where
n = 64 and p = 0.8. The abscissa interval [0.30, 0.47] is subdivided into 34 bins. Each bar rep-
resents the number of spaced seeds whose hit probability falls in the corresponding bin. This is
reproduced from the paper [168] of Preparata, Zhang, and Choi; reprinted with permission of Mary
Ann Liebert, Inc.
Π̄L Π̄L
≤ µπ ≤ + |π |,
πL πL
where L = 2|π | − 1. Therefore, Π̄πLL is another proposed indicator for good spaced
seeds.
Sampling method is also useful because the optimal spaced seed is usually un-
necessary for practical seed design. As a matter of fact, the optimal spaced seed for
one similarity level may not be optimal for another similarity level as we shall see
in the next section. Among the spaced seeds with the fixed length and weight, most
spaced seeds have nearly optimal hit probability. For instance, Figure 6.2 gives the
frequency histogram of all the spaced seeds of weight 11 and length between 11 and
18 when the similarity level set to be 70%. The largest hit probability is 0.467122
and reaches at the default PatternHunter seed. However, more than 86% of the seeds
have hit probability 0.430 or more (> 92% × 0.467122). Therefore, search with hill
climbing from a random spaced seed usually finds nearly optimal spaced seed. Man-
dala program developed by Buhler and Sun takes this approach.
Recall that good spaced seeds are selected based on Πn for a fixed length n and
similarity level p of homologous regions. Two natural questions to ask are: Is the
best spaced seed with n = 64 optimal for every n > 64 when p is fixed? and is there
a spaced seed π that has the largest hit probability Πn (p) for every 0 < p < 1 when
n is fixed?
To answer the first question, we consider the probabilistic parameters λπ and µπ
that are studied in Sections 6.3 and 6.4. By Theorems 6.8 and 6.9, λπ is a dominant
112 6 Anatomy of Spaced Seeds
factor of the hit probability Πn (p) when n is large, say, 64 or more, and µπ is closely
related to λπ . In addition, Theorem 6.3 shows that µπ depends only on the similarity
level p and the positional structure of π . Hence, we conclude that the ranking of a
spaced seed should be quite stable when the length n of homologous region changes.
Unfortunately, the second question does not have an analytical answer. It is an-
swered through an empirical study. The rankings of top spaced seeds for n = 64 and
p = 65%, 70%, 75%, 80%, 85%, 90% are given in Table 6.4. When the length and
weight of a spaced seed are large, its sensitivity could fluctuate greatly with the sim-
ilarity level p. Hence, the similarity range in which it is optimal over all the spaced
seeds of the same weight, called the optimum span of a spaced seed, is a critical fac-
tor when a spaced seed is evaluated. In general, the larger the weight and length of a
spaced seed, the narrower its optimum span. Table 6.5 gives the sensitivity Π64 for
good spaced seeds of weight from 9 to 14 at different similarity levels. The Pattern-
Hunter default seed that we mentioned before is the first seed in the row for weight
11 in Table 6.4. It is actually only optimal in the similarity interval [61%, 73%] and
hence is suitable for detection of remote homologs.
This study also indicates that the spaced seed 111 ∗ 1 ∗ 11 ∗ 1 ∗ ∗11 ∗ 111 is prob-
ably the best good spaced seed for database search purpose. First, its weight is 12,
but its hit probability is larger than the consecutive seed of weight 11 for n = 64 and
p = 0.7. Second, it has rather wide optimum span [59%, 96%]. Finally, it contains
four repeats of 11* in its 6-codon span (in the reading frame 5). Such properties are
desirable for searching DNA genomic databases in which homologous sequences
have diverse similarity and for aligning coding regions.
M = {i1 , i2 , · · · , im }
Table 6.4 Top-ranked spaced seeds for different similarity levels and weights. The sign ‘−’ means
that the corresponding seed is not among the top 10 seeds for the similarity level and weight. This
is reproduced from the paper [48] of Choi, Zeng, and Zhang; reprinted with permission of Oxford
University Press.
Rank under
W Good spaced seeds
a similarity level (%)
65 70 75 80 85 90
9 11*11*1*1***111 1 1 1 1 1 1
11*1*11***1*111 2 2 2 2 2 3
11*11**1*1**111 4 4 4 4 4 4
10 11*11***11*1*111 1 1 1 1 1 1
111**1*1**11*111 2 2 4 6 8 9
11*11**1*1*1**111 8 6 2 2 2 5
11 111*1**1*1**11*111 1 1 2 2 2 3
111**1*11**1*1*111 2 2 1 1 1 1
11*1*1*11**1**1111 6 3 3 5 5 6
12 111*1*11*1**11*111 1 1 1 1 1 1
111*1**11*1*11*111 2 2 2 5 3 2
111**1*1*1**11*1111 6 3 3 2 4 4
13 111*1*11**11**1*1111 2 1 1 2 2 2
111*1**11*1**111*111 7 2 2 1 1 1
111*11*11**1*1*1111 1 4 5 7 8 8
14 111*111**1*11**1*1111 2 1 1 1 1 1
1111*1**11**11*1*1111 5 2 2 3 3 6
1111*1*1*11**11*1111 1 3 7 - - -
15 1111**1*1*1*11**11*1111 - 5 1 1 1 1
111*111**1*11**1*11111 - 1 2 5 5 4
111*111*1*11*1**11111 1 2 - - - -
1111*11**11*1*1*11*1111
16 7 1 2 6 - -
1111**11*1*1*11**11*1111
- 7 1 1 1 3
1111*1**11*1*1**111*1111
1 9 - - - -
111*111*1**111*11*1111
T = { j1 , j2 , · · · , jt }
the set of its transition positions. Two sequences S1 and S2 exhibit a match of the
transition seed P in positions x and y if, for 1 ≤ k ≤ m, S1 [x − LQ + ik ] = S2 [y − LQ +
ik ] and, for 1 ≤ k ≤ t, S1 [x − LQ + jk ] = S2 [y − LQ + jk ], or two residues S1 [x − LQ +
jk ] and S2 [y − LQ + jk ] are both pyrimidines or purines.
The analytical studies presented in this chapter can be generalized to transition
seeds in a straightforward manner. As a result, good transition seeds can be found
using each of the approaches discussed in Section 6.5.1. Hedera and Mandala that
were mentioned earlier can be used for transition seed design.
114 6 Anatomy of Spaced Seeds
Table 6.5 The hit probability of the best spaced seeds on a region of length 64 for each weight and
a similarity level. This is reproduced from the paper [48] of Choi, Zeng, and Zhang; reprinted with
permission of Oxford University Press.
Similarity
W Best spaced seeds Sensitivity
(%)
9 11*11*1*1***111 65 0.52215
70 0.72916
80 0.97249
90 0.99991
10 11*11***11*1*111 65 0.38093
70 0.59574
80 0.93685
90 0.99957
11 111*1**1*1**11*111 65 0.26721
70 0.46712
111**1*11**1*1*111 80 0.88240
90 0.99848
12 111*1*11*1**11*111 65 0.18385
70 0.35643
80 0.81206
90 0.99583
13 111*11*11**1*1*1111 65 0.12327
111*1*11**11**1*1111 70 0.26475
111*1**11*1**111*111 80 0.73071
90 0.99063
14 1111*1*1*11**11*1111 65 0.08179
111*111**1*11**1*1111 70 0.19351
80 0.66455
90 0.98168
Empirical studies show that transition seeds have a good trade-off between sen-
sitivity and specificity for homology search in both coding and non-coding regions.
The idea of spaced seeds is equivalent to using multiple similar word patterns to
increase the sensitivity. Naturally, multiple spaced seeds are employed to optimize
the sensitivity. In this case, a set of spaced seeds are selected first; then all the hits
generated by these seeds are examined to produce local alignments.
The empirical studies by several research groups show that doubling the number
of seeds gets better sensitivity than reducing the single seed by one bit. Moreover,
for DNA homology search, the former roughly doubles the number of hits, whereas
6.7 Bibliographic Notes and Further Reading 115
the latter will increase the number of hits by a factor of four. This implies that using
multiple seeds gains not only sensitivity, but also speed.
The formulas in Sections 6.2.1, 6.3.1, and 6.4.2 can be easily generalized to the
case of multiple spaced seeds. However, because there is a huge number of combi-
nations of multiple spaced seeds, the greedy approach seems to be the only efficient
one for multiple-seed selection. Given the number k of spaced seeds to be selected,
the greedy approach finds k spaced seeds in k steps. Let Si−1 be the set of spaced
seeds selected in the first i − 1 steps, where 1 ≤ i < k and we assume S0 = φ . At
the ith iteration step, the greedy approach selects a spaced seed πi satisfying that
Si−1 ∪ {πi } has the largest hit probability.
Another way to generalize the spaced seed idea is to use a real vector s and a thresh-
old T > 0 like the seeding strategy used in BLASTP. It identifies a hit at position k
in an alignment sequence if and only if the inner product s · (ak , ak+1 , · · · , ak+|s|−1 )
is at least T , where |s| denotes the dimension of the vector and ai is the score of
the ith column for each i. The tuple (s, T ) is called a vector seed. This framework
encodes many other seeding strategies that have been proposed so far. For example,
the spaced seed 11 ∗ ∗ ∗ 1 ∗ 111 corresponds to the vector seed ((1,1,0,0,0,1,0,1,1),
5). The seeding strategy for the BLASTP, which requires three consecutive positions
with total score greater than 13, corresponds to the vector seed ((1, 1, 1), 13). Em-
pirical studies show that using multiple vector seeds is not only effective for DNA
sequence alignment, but also for protein sequence alignment.
In this section, we briefly summarize the most relevant and useful references on the
topics covered in the text.
6.1
Although the filtration technique for string matching has been known for a long
time [105], the seed idea is first used in BLAST. In comparison of DNA sequences,
BLAST first identifies short exact matches of a fixed length (usually 11 bases) and
then extends each seed match to both sides until a drop-off score is reached. By ob-
serving that the sensitivities of seed models (of the same weight) vary significantly,
Ma, Tromp, and Li proposed to use an optimized spaced seed for achieving the high-
est sensitivity in PatternHunter [131]. Some other seeding strategies had also been
developed at about the same time. WABA developed by Kent employs a simple pat-
tern equivalent to the spaced seed 11*11*11*11*11 to align homologous coding
116 6 Anatomy of Spaced Seeds
regions [110]. BLAT uses a consecutive seed but allows one or two mismatches
to occur in any positions of the seed [109]. Random projection idea, originated in
the work of Indyk and Motwani [96], was proposed for improving sensitivity in the
work of Buhler [33].
6.2
The mathematical study of spaced seeds is rooted in the classical renewal theory
and run statistics. The analysis of the hit probability of the consecutive seeds is
even found in the textbook by Feller [68]. The hit probability of multiple string
patterns was studied in the papers of Schwager [176] and Guibas and Odlyzko [83].
The interested readers are referred to the books of Balakrishnan and Koutras [22],
Deonier, Tavaré, and Waterman [58], and Lothaire [129] for the probability and
statistical theory of string pattern matching. Most material presented in this text is
not covered in the books that were just mentioned.
The recurrence relation (6.3) in the consecutive seed case had been independently
discovered by different researchers (see, for example, [22]). A recurrence relation
system for the hit probability of multiple string patterns is presented in [176]. The
recurrence system (6.5) and (6.6) for the spaced seeds is given by Choi and Zhang
[47]. The dynamic programming algorithm in Section 6.2.2 is due to Keich, Li, Ma,
and Tromp [108]. Computing the hit probability was proved to be NP-hard by Ma
and Li [130] (see also [125]). Theorem 6.2 is found in the paper of Choi and Zhang
[47].
6.3
6.4
The close approximation formula (6.12) and Lemma 6.1 can be found in the book
of Feller [68]. The bounds in Theorems 6.7 and 6.9 are due to Zhang [214]. Theo-
rem 6.8 is due to Buhler, Keich, and Sun [34] (see also [152] and Chapter 7 in [129]
for proofs).
6.5
The idea of using a close approximation of µ for identifying good spaced seeds
Π
appears in several papers. The close formula π 2|π |−1 is used by Choi and Zhang [47].
2|π |−1
The upper bound established in Theorem 6.4 is used by Yang et al. [207] and Kong
[115]. The random sampling approach is proposed by Buhler, Keich, and Sun [34].
6.7 Bibliographic Notes and Further Reading 117
Sampling approach is also studied in the paper of Preparata, Zhang, and Choi [168].
The spaced seeds reported in Section 6.5.2 are from the paper of Choi, Zeng, and
Zhang [48].
6.6
To our best knowledge, the idea of multiple spaced seeds is first used in Pattern-
Hunter [131] and is further studied in the papers of Li et al. [123], Brejovà, Brown,
and Vinar̆ [29], Buhler, Keich, and Sun [34], Csürös and Ma [53], Ilie and Ilie [95],
Sun and Buhler [185], and Xu et al. [206].
Transition seeds are studied in the papers of Kucherov, Noè, and Roytberg [118],
Noè and Kucherov [153], Schwartz et al. [177], Sun and Buhler [186], Yang and
Zhang [208], and Zhou and Florea [216].
The vector seed is proposed in the paper of Brejovà, Brown, and Vinar̆ [30] for
the purpose of protein alignment. In [31], Brown presents a seeding strategy that
has approximately the same sensitivity as BLASTP while keeping five times fewer
false positives.
Miscellaneous
119
120 7 Local Alignment Statistics
transition phenomena for optimal gapped alignment scores. Section 7.3.2 introduces
two methods for estimating the key parameters of the distribution. Section 7.3.3 lists
the empirical values of these parameters for BLOSUM and PAM matrices.
In Section 7.4, we describe how P-value and E-value (also called Expect value)
are calculated for BLAST database search.
Finally, we conclude the chapter with the bibliographic notes in Section 7.5.
7.1 Introduction
where u and λ are called the location and scale parameters of this distribution, re-
spectively. The distribution defined in (7.1) has probability function
and
7.1 Introduction 121
∞
V =λ x2 f (x)dx − µ 2
−∞
∞
= (u − ln(z)/λ )2 e−z dz − (u + γ /λ )2
0
= π λ /6,
2 2
(7.3)
Hence, the best local alignment score in this simple case has the extreme value type-
1 distribution (7.1) with
and
λ = ln(1/p).
In general, to study the distribution of optimal local ungapped alignment scores,
we need a model of random sequences. Through this chapter, we assume that the
two aligned sequence are made up of residues that are drawn independently, with
respective probabilities pi for different residues i. These probabilities (pi ) define the
background frequency distribution of the aligned sequences. The score for align-
ing residues i and j is written si j . Under the condition that the expected score for
aligning two randomly chosen residues is negative, i.e.,
the optimal local ungapped alignment scores are proved to approach an extreme
value distribution when the aligned sequences are sufficiently long. Moreover, sim-
ple formulas are available for the corresponding parameters λ and u.
122 7 Local Alignment Statistics
Fig. 7.1 The accumulative score of the ungapped alignment in (7.7). The circles denote the ladder
positions where the accumulative score is lower than any previously reached ones.
The scale parameter λ is the unique positive number satisfying the following
equation (see Theorem B.1 for its existence):
∑ pi p j eλ si j = 1. (7.5)
i, j
By (7.5), λ depends on the scoring matrix (si j ) and the background (pi ).
frequencies
It converts pairwise match scores to a probabilistic distribution pi p j eλ si j .
The location parameter u is given by
u = ln(Kmn)/λ , (7.6)
where m and n are the lengths of aligned sequences and K < 1. K is considered as
a space correcting factor because optimal local alignments cannot locate in all mn
possible sites. It is analytically given by a geometrically convergent series, depend-
ing only on the (pi ) and (si j ) (see, for example, Karlin and Altschul, 1990, [100]).
We use s j to denote the score of the aligned pair of residues at position j and con-
sider the accumulative score
Sk = s1 + s2 + · · · + sk , k = 1, 2, . . . .
Starting from the left, the accumulative score Sk is graphically represented in Fig-
ure 7.1.
7.2 Ungapped Local Alignment Scores 123
Rx = Sx − Sa ,
(i) The probability distribution of the maximum value that the corresponding ran-
dom walk ever achieves before stopping at the absorbing state -1, and
(ii) The mean number of steps before the corresponding random walk first reaches
the absorbing state -1.
When two protein sequences are aligned, scores other than the simple scores 1 and
−1 are used for match and mismatches. These scores are taken from a substitu-
tion matrix such as the BLOSUM62 matrix. Because match and mismatches score
a range of integer values, the accumulative score performs a complicated random
walk. We need to apply the advanced random walk theory to study the distribution
of local alignment scores for protein sequences in this section.
Consider a random walk that starts at 0 and whose possible step sizes are
−d, −d + 1, . . . , −1, 0, 1, . . . , c − 1, c
such that
Let Xi denote the score of the aligned pair at the ith position. Then, Xi s are iid
random variables. Let X be a random variable with the same distribution as Xi s. The
moment generating function of X is
c
E eθ X = ∑ p j e jθ .
j=−d
S0 = 0,
j
S j = ∑ Xi , j = 1, 2, . . . ,
i=1
and partition the walk into non-negative excursions between the successive descend-
ing ladder points in the path:
K0 = 0,
Ki = min k | k ≥ Ki−1 + 1, Sk < SKi−1 , i = 1, 2, . . . . (7.8)
Because the mean step size is negative, the Ki − Ki−1 are positive integer-valued
iid random variables. Define Qi to be the maximal score attained during the ith
excursion between Ki−1 and Ki , i.e.,
Qi = max Sk − SKi−1 , i = 1, 2, . . . . (7.9)
Ki−1 ≤k<Ki
The Qi s are non-negative iid random variables. In the rest of this section, we focus
on estimating Pr[Q1 ≤ x] and E(K1 ).
Define
and set t + = ∞ if Si ≤ 0 for all i. Then t + is the stopping time of the first positive
accumulative score. We define
Z + = St + , t + < ∞ (7.11)
and
and
7.2 Ungapped Local Alignment Scores 125
+ ∞
1
1 − E eθ Z ; t + < ∞ = 1 − E eθ X exp ∑ kE eθ Sk ; Sk ≤ 0 .(7.14)
k=1
Proof. The formula (7.13) is symmetric to Formula (7.11) on page 396 of the book
[67] of Feller, which is proved for strict ascending ladder points.
Because the mean step size is negative in our case, we have that
+ ∞
1
1 − E e ; t < ∞ = exp − ∑ E eθ Sk ; Sk > 0
θZ +
. (7.15)
k=1 k
by using the same argument as in the proof of a result due to Baxter (Theorem 3.1
in Spitzer (1960)). Because
k
E eθ Sk ; Sk > 0 + E eθ Sk ; Sk ≤ 0 = E eθ Sk = E eθ X
for any k and ln(1 − y) = − ∑∞i=1 i y for 0 ≤ y < 1, the equation (7.15) becomes
1 i
+
1 − E eθ Z ; t + < ∞
∞
1 θ X k ∞ 1 θ Sk
= exp − ∑ E e + ∑ E e ; Sk ≤ 0
k=1 k k=1 k
! " ∞
1
= exp ln 1 − E e θX
exp ∑ E eθ Sk ; Sk ≤ 0
k=1 k
∞
1
= 1 − E eθ X exp ∑ E eθ Sk ; Sk ≤ 0 .
k=1 k
S(y)
= S(−1) + (S(y) − S(−1))
y
= 1 − Pr[t + < ∞] + ∑ Pr[Z + = k]S(y − k) (7.20)
k=0
y
V (y) = Pr[t + < ∞] − Pr[Z + ≤ y] eλ y + ∑ eλ k Pr[Z + = k]V (y − k). (7.22)
k=0
lim V (y)
y→∞
∑∞ λy
y=0 e (Pr[t < ∞] − Pr[Z ≤ y])
+ +
= .
E Z + eλ Z ; t + < ∞
+
lim V (y)
y→∞
1 − Pr[t + < ∞]
= λ .
e − 1 E Z + eλ Z ; t + < ∞
+
By the definition of S(y) in (7.19) and the law of probabilities, expanding according
to the outcome of σ (0, y) (either Sσ (0,h) > y or Sσ (0,h) < 0 ) yields
∞
1 − S(h) = 1 − F(h) + ∑ (1 − S(h + k)) Pr[Q1 < h and Sσ (0,h) = −k].
k=1
lim eλ h (1 − F(h))
h→∞
# $
∞
= lim V (h) 1 − ∑ e −λ k
Pr[Sσ (0,∞) = −k]
h→∞ k=1
−
(1 − Pr[t + < ∞]) 1 − E eλ Z
= λ . (7.24)
e − 1 E Z + eλ Z ; t + < ∞
+
from (7.21).
Combining (7.13) and (7.16) gives
exp ∑∞ 1
k=1 k Pr[Sk = 0]
1 − Pr[t < ∞] =
+
.
E(K1 )
Recall that the derivative of a sum of functions can be calculated as the sum of the
derivatives of the individual functions. Differentiating (7.14) with respect to θ and
setting θ = λ afterwards shows
E Z + eλ Z ; t + < ∞
+
∞
1
= E Xe λX
exp ∑ E eλ Sk ; Sk ≤ 0
k=1 k
∞
1 λ Sk
= E Xeλ X exp ∑ E e ; Sk < 0 + Pr[Sk = 0]
k=1 k
λX ∞ 1
E Xe exp ∑k=1 k Pr[Sk = 0]
= − .
1 − E eλ Z
Setting
− − 2
(1 − Pr[Z + ; t + < ∞]) 1 − E eλ Z 1 − E eλ Z
C= = λX (7.25)
E Z + eλ Z ; t + < ∞
+
E Xe E(K1 )
128 7 Local Alignment Statistics
In this subsection, we continue the above discussion to derive formulas for the esti-
mation of the tail distribution of maximal segment scores in ungapped alignment.
Consider the successive excursions with associated local maxima Qk defined in
(7.9). Define
and
C
C = . (7.28)
eλ − 1
Lemma 7.2. With notations defined above,
! "
ln m
lim inf Pr M(Km ) ≤ + y ≥ exp −C eλ −λ y ,
m→∞ λ
! "
ln m
lim sup Pr M(Km ) ≤ + y ≤ exp −C e−λ y .
m→∞ λ
for any 0 < x ≤ ∞, where F(x) is the distribution function of Q1 and is defined in
(7.23). For any real number x < ∞,
1 − F(x) ≤ 1 − F(x) ≤ 1 − F(x).
Hence,
7.2 Ungapped Local Alignment Scores 129
C
lim inf (1 − F(x))eλ x ≥ lim inf (1 − F(x))eλ x = = C .
x→∞ x→∞ eλ − 1
On the other hand, because lim supx→∞ x − x = 1,
C
lim sup (1 − F(x))eλ x ≤ lim (1 − F(x))ex eλ (x−x) ≤ eλ = C eλ .
x→∞ x→∞ eλ −1
1
Pr [M(Km ) ≤ ln(m)/λ + y] = (F(ln(m)/λ + y))m = .
exp {−m ln(F(ln(m)/λ + y))}
ln z
Using limz→1 z−1 = 1, we obtain
Similarly,
A = E(K1 ), (7.30)
which is the mean distance between two successive ladder points in the walk. Then
the mean number of ladder points is approximately An when n is large (see Sec-
tion B.8). Ignoring edge effects, we derive the following asymptotic bounds from
Lemma 7.2 by setting y = y + lnλA and m = An :
%
C eλ −λ y ln n C −λ y
exp − e ≤ Pr M(n) ≤ + y ≤ exp − e . (7.31)
A λ A
Set
C
K= . (7.32)
A
Replacing y by (ln(K) + s)/λ , inequality (7.31) becomes
! "
exp −eλ −s ≤ Pr [M(n) ≤ ln(Kn) + s/λ ] ≤ exp −e−s ,
or equivalently
! "
exp −eλ −s ≤ Pr [λ M(n) − ln(Kn) ≤ s] ≤ exp −e−s . (7.33)
is called the normalized score of the alignment. Hence, the P-value corresponding
to an observed value s of the normalized score is
P-value ≈ 1 − exp −e−s . (7.34)
By Theorem 7.1 and (7.28), the probability that any maximal-scoring segment has
score s or more is approximately C e−λ s . By (7.30), to a close approximation there
are N/A maximal-scoring segments in a fixed alignment of N columns as discussed
in Section 7.2.2. Hence, the expected number of the maximal-scoring segments with
score s or more is approximately
NC −λ s
e = NKe−λ s , (7.35)
A
7.2 Ungapped Local Alignment Scores 131
When two sequences are aligned, insertions and deletions can break a long align-
ment into several parts. If this is the case, focusing on the single highest-scoring
segment could lose useful information. As an option, one may consider the scores
of the multiple highest-scoring segments.
Denote the r disjoint highest segment scores as
Karlin and Altschul (1993) showed that the limiting joint density fS (x1 , x2 , · · · , xr )
of S = (S(1) , S(2) , · · · , S(r) ) is
# $
r
fS (x1 , x2 , · · · , xr ) = exp −e−xr
− ∑ xk (7.37)
k=1
in the domain x1 ≥ x2 ≥ . . . ≥ xr .
Assessing multiple highest-scoring segments is more involved than it might first
appear. Suppose, for example, comparison X reports two highest scores 108 and 88,
whereas comparison Y reports 99 and 90. One can say that Y is not better than X,
because its high score is lower than that of X. But neither is X considered better,
because the second high score of X is lower than that of Y. The natural way to
rank all the possible results is to consider the sum of the normalized scores of the r
highest-scoring segments
as suggested by Karlin and Altschul. This sum is now called the Karlin-Altschul
sum statistic.
Integrating fn,r (x) from t to ∞ gives the tail probability that Sn,r ≥ t. This re-
sulting double integral can be easily calculated numerically. Asymptotically, the tail
132 7 Local Alignment Statistics
e−t t r−1
Pr[Sn,r ≥ t] = . (7.40)
r!(r − 1)!
especially when t > r(r − 1). This is the formula used in the BLAST program for
calculating p-value when multiple highest-scoring segments are reported.
The statistic theory of Sections 7.2.1 and 7.2.2 considered the maximal scoring seg-
ments in a fixed ungapped alignment. In practice, the objective of database search is
to find all good matches between a query sequence and the sequences in a database.
Here, we consider a general problem of calculating the statistical significance of
a local ungapped alignment between two sequences. The sequences in a highest-
scoring local ungapped alignment between two sequences is called the maximal-
scoring segment pairs (MSP).
Consider two sequences of length n1 and n2 . To find the MSPs of the sequences,
we have to consider all n1 + n2 − 1 possible ungapped alignments between the se-
quences. Each such alignment yields a random work as that studied in Section 7.2.1.
Because the n1 +n2 −1 corresponding random walks are not independent, it is much
more involved to estimate the mean value of the maximum segment score. The the-
ory developed by Dembo et al. [56, 57] for this general case is too advanced to be
discussed in this book. Here we simply state the relevant results.
To some extent, the key formulas in Sections 7.2.2 can be taken over to the gen-
eral case, with n being simply replaced by n1 n2 . Consider the sequences x1 x2 . . . xn1
and y1 y2 . . . yn2 , where xi and y j are residues. We use s(xi , y j ) to denote the score for
aligning residues xi and y j . The optimal local ungapped alignment score Smax from
the comparison of the sequences is
∆
Smax = max max ∑ s(xi+l , x j+l ).
∆ ≤min{n1 ,n2 } i≤n1 −∆ l=1
j≤n2 −∆
Suppose the sequences are random and independent: xi and y j follows the same
distribution. The random variable Smax has the following tail probability distribution:
1 −λ y
Pr[Smax > log(n1 n2 ) + y] ≈ 1 − e−Ke , (7.41)
λ
where λ is given by equation (7.5) and K is given by equation (7.32) with n replaced
by n1 n2 . The mean value of Smax is approximately
1
log(Kn1 n2 ). (7.42)
λ
7.2 Ungapped Local Alignment Scores 133
Kn1 n2 eλ s (7.45)
The above theory is developed under several conditions and only applies, for ex-
ample, when the sequences are sufficiently long and the aligned sequences grow at
similar rate. But empirical studies show that the above theory carries over essen-
tially unchanged after edge effect is made in practical cases in which the aligned
sequences are only a few hundred of base pairs long.
When the aligned sequences have finite lengths n1 and n2 , optimal local ungapped
alignment will tend not to appear at the end of a sequence. As a result, the optimal
local alignment score will be less than that predicated by theory. Therefore, edge
effects have to be taken into account by subtracting from n1 and n2 the mean length
of MSPs.
Let E(L) denote the mean length of a MSP. Then effective lengths for the se-
quence compared are
Given that the score of a MSP is denoted by s, the mean length E(L) of this MSP
is obtained from dividing s by the expected score of aligning a pair of residues:
s
E(L) = , (7.49)
∑i j qi j si j
134 7 Local Alignment Statistics
where qi j is the target frequency at which we expect to see residue i aligned with
residue j in the MSP, and si j is the score for aligning i and j. With the value of λ ,
the si j and the background frequencies qi and q j in hand, qi j can be calculated as
qi j ≈ pi p j eλ si j . (7.50)
Simulation shows that the values calculated from (7.49) are often larger than the em-
pirical mean lengths of MSPs especially when n1 and n2 are in the range from 102
to 103 (Altschul and Gish, 1996, [6]). Accordingly, the effective lengths defined by
(7.46) might lead to P-value estimates less than the correct values. The current ver-
sion of BLAST calculates empirically the mean length of MSPs in database search.
In this section, we concern with the optimal local alignment scores S (in the general
case that gaps are allowed). Although the explicit theory is unknown in this case,
a number of empirical studies strongly suggest that S also has asymptotically an
extreme value distribution (7.1) under certain conditions on the scoring matrix and
gap penalty used for alignment. As we will see, these conditions are satisfied for
most combinations of scoring matrices and gap penalty costs.
Most empirical studies focus on the statistical distribution of the scores of optimal
local alignments with affine gap costs. Each such gap cost has gap opening penalty
o and gap extension penalty e, by which a gap of length k receives a score of −(o +
k × e).
Consider two sequences X and X that are random with letters generated in-
dependently according to a probabilistic distribution. The optimal local alignment
score Smax of X and X depends on the sequence lengths m and n, and the letter
distribution, the substitution matrix, and affine gap cost (o, e). Although the exact
probabilistic distribution of Smax is unclear, Smax has either linear or logarithmic
growth as m and n go to infinite.
This phase transition phenomenon for the optimal local alignment score was
studied by Arratia and Waterman [15]. Although a rigorous treatment is far be-
yond the scope of this book, an intuitive account is quite straightforward. Consider
two sequences x1 x2 . . . xn and y1 y2 . . . yn . Let S(t) denote the score of the optimal
alignment of x1 x2 . . . xt and y1 y2 . . . yt . Then,
Table 7.1 The average amino acid frequencies reported by Robinson and Robinson in [173].
Amino acid Freqency Amino acid Freqency Amino acid Freqency Amino acid Freqency
Ala 0.078 Gln 0.043 Leu 0.090 Ser 0.071
Arg 0.051 Glu 0.063 Lys 0.057 Thr 0.058
Asn 0.045 Gly 0.074 Met 0.022 Trp 0.013
Asp 0.054 His 0.022 Phe 0.039 Tyr 0.032
Cys 0.019 Ile 0.051 Pro 0.052 Val 0.064
Because S(xt+1 xt+2 · · · xt+k , yt+1 yt+2 · · · yt+k ) equals Sk in distribution, St and hence
E(St ) satisfy the subadditive property. This implies that the following limit exists
and equals the supremum
E(St ) E(St )
lim = sup = c,
t t≥1 t
where c is a constant that depends only on the penalty parameters and the letter
distribution.
When the penalty cost is small, the expected score of each aligned pair is positive;
then the limit c is positive. In this case, the score St grows linearly in t. For example,
if the score is 1 for matches and 0 otherwise, the optimal local alignment score
is equal to the length of the longest subsequence common to the sequences. The
alignment score converges to c log n for some c ∈ (1/k, 1) as n goes to infinity if the
aligned sequences have the same length n and are generated by drawing uniformly
k letters.
When the penalty cost is large such that the expected score of each aligned pair
is negative, the optimal local alignments, having positive score, represent large de-
viation behavior. In this case, c = 0 and the probability that a local alignment has
a positive score decays exponentially fast in its length. Hence, St grows like log(t).
The region consisting of such penalty parameters is called the logarithmic region. In
this region, local alignments of positive scores are rare events. By using the Chen-
Stein method (see, for example, [44]), Poisson approximation can be established to
show that the optimal local alignment score approaches an extreme value distribu-
tion; see, for example, the papers of Karlin and Altschul [100] and Arratia, Gordan,
and Waterman [11]. Empirical studies demonstrate that affine gap costs (o, e) sat-
isfying o ≥ 9 and e ≥ 1 are in the logarithmic region if the BLOSUM62 matrix is
used.
There are no formulas for calculating the relevant parameters for the hypothetical
extreme value distribution of the optimal local alignment scores. These parameters
136 7 Local Alignment Statistics
Table 7.2 Empirical values for l, u, λ , and K. Data are from Methods in Enzymology, Vol. 26,
Altschul and Gish, Local alignment statistics, 460-680, Copyright (1996), with permission from
Elsevier [6].
Sequence length Mean alignment length
u λ K
n l
403 32.4 32.04 0.275 0.041
518 36.3 33.92 0.279 0.048
854 43.9 37.84 0.272 0.040
1408 51.6 41.71 0.268 0.036
1808 55.1 43.54 0.271 0.041
2322 59.1 45.53 0.267 0.035
2981 63.5 47.32 0.270 0.040
1
A(s) = ∑ (S(i) − s),
|Is | i∈I
(7.51)
s
where S(i) is the score of island i and |Is | the number of islands in Is . Because
the island scores are integral and have no proper common divisors in the cases of
interest, the maximum-likelihood estimate for λ is
1
λs = ln(1 + ). (7.52)
A(s)
Pr[S = x] = De−λ x ,
De−λ x
Pr[S = x|S ≥ c] ≈ = (1 − e−λ )e−λ (x−c) .
∑∞j=c De−λ j
Let xi denote the ith island scores for i = 1, 2, . . . , M. Then the logarithm of
the probability that all xi s have a value of c or greater is
The best value λML of λ is the one that maximizes this expression. By equating
the first derivation of this expression to zero, we obtain that
# $
1
λML = ln 1 + .
1
M ∑ j=1 (x j − c)
M
138 7 Local Alignment Statistics
For practical purpose, the parameters λ , K, together with relative entropy H, were
empirically determined by Altschul and Gish for popular scoring matrices. The pa-
rameters for the BLOSUM62 and PAM250 matrices are compiled in Table 7.3. The
numbers for infinite gap costs were derived from theory; others were obtained based
on the average amino acid frequencies listed in Table 7.1. They may be poor for
other protein sequence models.
From the data presented in Table 7.3, Altschul and Gish (1996, [6]) observed two
interesting facts. First, for a given gap opening cost o, the parameters λ , u, and K
remain the same when gap extension costs e are relatively large. It implies that, with
these affine gap costs, optimal local alignments that occur by chance do not likely
contain insertion and deletions that involve more than 1 residue. Hence, for a given
o, it is unrewarding to use any gap extension cost that is close to o.
Second, the ratio of λ to λ∞ for the ungapped case indicates the proportion of
information in local ungapped alignments is lost in the hope of extending the align-
ments using gaps. Low gap costs are sometimes used in hoping that local align-
ments containing long insertions or deletions will not be severely penalized. But,
this decreases the information contained in each aligned pair of residues. As an al-
ternative method, one may employ fairly high gap costs and evaluate the multiple
high-scoring local alignments using the Karlin-Altschul sum statistic presented in
Section 7.2.4.
7.4 BLAST Database Search 139
Table 7.3 Parameters λ , K, and H for PAM250 (left column) and BLOSUM62 (right column) in
conjunction with affine gap costs (o, e) and the average amino acid frequencies listed in Table 7.1.
Here the relative entropy is calculated using the natural logarithm. Data are from Methods in En-
zymology, Vol. 26, Altschul and Gish, Local alignment statistics, 460-680, Copyright (1996), with
permission from Elsevier [6].
o e λ K H o e λ K H
We now consider the most relevant case in practice. In this case, we have a query
sequence and a database, and we wish to search the entire database to find all se-
quences that are homologies of the query sequence. In this case, significant high-
scoring local alignments, together with their P-values and E-values, are reported. In
this section, calculations of P-values and E-values in BLAST are discussed.
140 7 Local Alignment Statistics
eff-searchSP = ∑ lT − N l(s).
¯ (7.55)
T ∈D
end(Aik ) ≤ end(Aik+1 ),
offset(Aik ) ≤ offset(Aik+1 ),
end(Aik ) − d ≤ offset(Aik+1 ) ≤ end(Aik ) + dq ,
7.4 BLAST Database Search 141
and their offset and end positions in the database sequence satisfy the same inequal-
ities with dq replaced by ds .
For an admissible set A of alignments, its sum P-value and E-value are calcu-
lated using the sum of the normalized scores of the alignments
where SA is the raw score of A and λA and KA are the parameters associated with A.
In most cases, these statistical parameters in (7.58) have the same value. However,
they can be different when BLASTX is used.
Assume that A contains r alignments and the query and subject sequences have
length m and n respectively. S(A ) is further adjusted using r, m, and n as
eff-searchSP
ExpectA = − ln(1 − P-value) × , (7.61)
mn
where eff-searchSP is calculated from (7.55). There is no obvious choice for the
value of r. Hence, BLAST considers all possible values of r and choose the admis-
sible set that gives the lowest P-value. This means that a set of tests are performed.
For addressing this multiple testing issue, the E-value in (7.61) is further adjusted
by dividing a factor of (1 − τ )τ r−1 . For BLASTN search, τ is set to 0.5. For other
searches, τ is set to 0.5 for ungapped alignment and 0.1 for gapped alignment. Fi-
nally, the P-value for the alignments in A is calculated from (7.57).
Finally, we must warn that the above calculations for P-value and E-value are
used in the current version of BLAST (version 2.2). They are different from what
is used in the earlier versions. For example, the length adjustment was calculated as
the product of λ and the raw score divided by H in the earlier version. Accordingly,
they might be modified in future.
Table 7.4 The empirical values of α and β for PAM and BLOSUM matrices. Data are from
Altschul et al. (2001), with permission from Oxford University Press [5].
Scoring matrix BLOSUM45 BLOSUM62 BLOSUM80 PAM30 PAM70
Gap cost (14, 2) (11, 1) (10, 1) (9, 1) (10, 1)
BLASTP 2.2.18+
...
Database: Non-redundant SwissProt sequences
332,988 sequences; 124,438,792 total letters
Query= gi|731924|sp|P40582.1|GST1 YEAST Glutathione S-transferase 1 (GST-I)
Length=234
...
Alignments
...
1 1
s= ln (m − l(s))(M
¯ − N l(s))
¯ − ln(E) ≈ ln (m − l(s))(M
¯ − N l(s))
¯ .
λ λ
Hence, we have
¯ ≈ α ln (m − l(s))(M
l(s) ¯ − N l(s))
¯ +β.
λ
7.4 BLAST Database Search 143
...
Score = 36.6 bits (83), Expect = 0.11, Method: Compositional matrix adjust.
Identities = 29/86 (33%), Positives = 44/86 (51%), Gaps = 9/86 (10%)
Query 1 MSLPIIKVH-WLDHSRAFRLLWLLDHLNLEYEIVPYKR-DANFRAPPELKKIHPLGRSPL 58
M+ P +KV+ W R L L+ ++YE+VP R D + R P L + +P G+ P+
Sbjct 1 MATPAVKVYGWAISPFVSRALLALEEAGVDYELVPMSRQDGDHRRPEHLAR-NPFGKVPV 59
Query 59 LEVQDRETGKKKILAESGFIFQYVLQ 84
LE D L ES I ++VL+
Sbjct 60 LEDGDL------TLFESRAIARHVLR 79
...
Lambda K H
0.320 0.137 0.401
Gapped
Lambda K H
0.267 0.0410 0.140
Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Sequences: 332988
...
Length of query: 234
Length of database: 124438792
Length adjustment: 111
...
The printout above shows that the query sequence has length 234, the number of
letters in database is 124,438,792, the number of sequences in database is 332,988,
and the length adjustment is 111. The Maize Glutathione match listed in the printout
contains gaps. Hence,
λ = 0.267, K = 0.041.
in agreement with the value 0.11 in the printout. Finally, one can easily check that
the length adjustment 111 is an approximate fixed point of the function in (7.62).
alignments from 1 to 8 as they appear in the printout. It is easy to see that the align-
ments 1, 2, 3, 4, 6, 7 form the most significant admissible set. The P-value and
E-value associated with these alignments are calculated using this set. It is not clear
from the printout, however, which admissible set is used for calculating the P-values
and E-values for alignments 5 and 8. Furthermore, because the values of some pa-
rameters for calculation of the E-values are missing, we are unable to verify the
E-values in the printout.
BLASTP 2.0MP-WashU [04-May-2006] [linux26-x64-I32LPF64 2006-05-10T17:22:28]
Query= Sequence
(756 letters)
...
Query: 6 GVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIVKEGGLKLIQIQDNG 65
G I +L + V+NRIAAGEV+QRP+ A+KE++EN +DA + +QV+ EGGL+++Q+ D+G
Sbjct: 2 GSIHKLTDDVINRIAAGEVVQRPSAALKELLENAIDAGCSRVQVVAAEGGLEVLQVCDDG 61
There is a large body of literature on the topics presented in this chapter. We will
not attempt to cite all of the literature here. Rather, we will just point out some
of the most relevant and useful references on this subject matter. For further infor-
mation, we refer the reader to the survey papers of Altschul et al. [4], Karlin [99],
Mitrophanov and Borodovsky [141], Pearson and Wood [162], and Vingron and
Waterman [193]. The book by Ewens and Grant [64] is another useful source for
this subject.
7.1
Several studies have demonstrated that local similarity measures with or without
gaps follow an extreme value type-I distribution. In the late 1980s and early 1990s,
Arratia and Waterman studied the distribution of the longest run of matching let-
ters. Consider two random sequences of lengths n1 and n2 in which the probability
that two letters match is p. By generalizing the result of Erdös and RéNY [170] on
the longest head run in the n tosses of a coin, Arratia and Waterman showed that
the length of the longest run of matching letters in two sequences is approximately
k log1/p (n1 n2 ) as m and n go to infinity at similar rates, where k is a constant de-
pending on both ln(n1 )/ ln(n2 ) and the letter frequencies of the sequences being
compared [12, 13]. A simlar result was also proved by Karlin and Ost [103]. Later,
Arratia, Gordon, and Waterman further showed that an extreme value type-I distri-
bution even holds asymptotically for the longest run of matching letters between
two sequences, allowing m mismatches [10]. In this case, the extreme value type-I
distribution has the scale parameter log(e) and the location parameter
where log denotes logarithm base 1/p. A similar result is also proved for the longest
run of matching letters with a given proportion of mismatches [14, 11]. All these
resluts can be generalized to the case of comparing multiple random sequences. The
intuitive argument for the case that matches score 1 and mismatches and indels score
−∞ first appeared in the book by Waterman [197] (see also the book [58]).
Altschul, Dembo, and Karlin studied this problem in the more general case of
aligning locally sequences with a substitution matrix [100, 102, 57]. Karlin and
Dembo proved that the best segment scores follow asymptotically an extreme value
type-I distribution [102]. This result is presented in Section 7.2.1. Dembo, Karlin,
and Zeitouni further showed that optimal local alignment scores without gaps ap-
proach in the asymptotic limit an extreme value type-I distribution with parameters
given in (7.5) and (7.6) when the condition (7.4) holds and the lengths of the se-
quences being aligned grow at similar rates [57].
For the coverage of the statistics of extreme values, the reader is referred to the
book [50] written by Coles.
7.5 Bibliographic Notes and Further Reading 147
7.2
The theorems and their proofs in Sections 7.2.1 and 7.2.2 are from the work
of Karlin and Dembo [102]. The Karlin-Altschul sum statistic in Section 7.2.4 is
reported in [101]. The results summarized in Section 7.2.5 are found in the work
of Dembo, Karlin, and Zeitouni [57]. The edge effect correction presented in Sec-
tion 7.2.6 is first used in the BLAST program [7]. The justification of the edge
effect correction is demonstrated by Altschul and Gish [6]. Edge effects are even
more serious for gapped alignment as shown by Park and Spouge [157] and Spang
and Vingron [182].
7.3
7.4
The material covered in Section 7.4.1 can be found in the manuscript of Gertz
[74]. The printouts given in Section 7.4.1 are prepared using NCBI BLAST and EBI
WU-BLAST web server, respectively.
Miscellaneous
We have studied the statistical significance of local alignment scores. The distri-
bution of global alignment scores is hardly studied. No theoretical result is known.
Readers are referred to the papers by Reich et al. [169] and Webber and Barton
[201] for information on global alignment statistics.
Chapter 8
Scoring Matrices
With the introduction of the dynamic programming algorithm for comparing protein
sequences in 1970s, a need arose for scoring amino acid substitutions. Since then,
the construction of scoring matrices has become one key issue in sequence compari-
son. A variety of considerations such as the physicochemical and three-dimensional
structure properties have been used for deriving amino acid scoring (or substitution)
matrices.
The chapter is divided into eight sections. The PAM matrices are introduced in
Section 8.1. We first define the PAM evolutionary distance. We then describe the
Dayhoff’s method of constructing the PAM matrices. Frequently used PAM matrices
are listed at the end of this section.
In Section 8.2, after briefly introducing the BLOCK database, we describe the
Henikoff and Henikoff’s method of constructing the BLOSUM matrices. In addi-
tion, frequently used BLOSUM matrices are listed.
In Section 8.3, we show that in seeking local alignment without gaps, any amino
acid scoring matrix takes essentially a log-odds form. There is a one-to-one corre-
spondence between the so-called valid scoring matrices and the sets of target and
background frequencies. Moreover, given a valid scoring matrix, its implicit target
and background frequencies can be retrieved efficiently.
The log-odds form of the scoring matrices suggests that the quality of database
search results relies on the proper choice of scoring matrix. Section 8.4 describes
how to select scoring matrix for database search with a theoretic-information
method.
In comparison of protein sequences with biased amino acid compositions, stan-
dard scoring matrices are no longer optimal. Section 8.5 introduces a general proce-
dure for converting a standard scoring matrix into one suitable for the comparison
of two sequences with biased compositions.
For certain applications of DNA sequence comparison, nontrivial scoring matrix
is critical. In Section 8.6, we discuss a variant of the Dayhoff’s method in construct-
ing nucleotide substitution matrices. In addition, we address why comparison of
protein sequences is often more effective than that of the coding DNA sequences.
149
150 8 Scoring Matrices
350
300
PAM distance
250
200
150
100
50
0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
Observed Percent Difference
Fig. 8.1 The correspondence between the PAM evolutionary distances and the dissimilarity levels.
The PAM matrices are the first amino acid substitution matrices for protein sequence
comparison. Theses matrices were first constructed by Dayhoff and coworkers based
on a Markov chain model of evolution. A point accepted mutation in a protein is
a substitution of one amino acid by another that is “accepted” by natural selection.
For a mutation to be accepted, the resulting amino acid must have the same function
as the original one. A PAM unit is an evolutionary time period over which 1% of
the amino acids in a sequence are expected to undergo accepted mutations. Because
a mutation might occur several times at a position, two protein sequences that are
100 PAM diverged are not necessarily different in every position; instead, they are
expected to be different in about 52% of positions. Similarly, two protein sequences
that are 250 PAM diverged have only roughly 80% dissimilarity. The correspon-
dence between the PAM evolutionary distance and the dissimilarity level is shown
in Figure 8.1.
Dayhoff and her coworkers first constructed the amino acid substitution ma-
trix for one PAM time unit, and then extrapolated it to other PAM distances. The
construction started with 71 blocks of aligned protein sequences. In each of these
blocks, a sequence is no more than 15% different from any other sequence. The high
within-block similarity was imposed to minimize the number of substitutions that
may have resulted from multiple substitutions at the same position.
8.1 The PAM Scoring Matrices 151
ABGH
G K
ABGH
H M
B C A N
ABKM ACGH NBGH
Fig. 8.2 A phylogenetic tree over three observed protein sequences. Inferred ancestors are shown
at the internal nodes, and amino acid substitutions are indicated along the branches.
Ai j
pi j = c ,
∑k Aik
where c is a positive constant to be specified according to the constraint of one PAM
divergence, and set
pii = 1 − ∑ pi j .
j=i
Because 1% of amino acids are expected to change over the one PAM period, we
set c as
0.01
c= .
∑i ∑ j=i fi (Ai j /∑k Aik )
By estimating fi with the observed frequency of the corresponding amino acid in all
the sequences in the 71 phylogenetic trees, Dayhoff et al. obtained the substitution
matrix M1 over the one PAM period, which is given in Table 8.2.
If M1 = (mi j ) is considered as the transition matrix of the Markov chain model
of the evolution over one PAM period, then M1n is the transition matrix over the n
152 8 Scoring Matrices
Table 8.1 The 1572 amino acid substitutions were observed in the 71 phylogenetic trees. Here,
each entry is 10 times the corresponding substitutions. Fractional substitution numbers result when
ancestral sequences are ambiguous, for which the substitutions are counted statistically. This is
reproduced from the paper [55] of Dayoff, Schwartz, and Orcutt; reprinted with permission of
Nat’l Biomed. Res. Foundation.
Ala A
Arg R 30
Asn N 109 17
Asp D 154 0 532
Cys C 33 10 0 0
Gln Q 93 120 50 76 0
Glu E 266 0 94 831 0 422
Gly G 579 10 156 162 10 30 112
His H 21 103 226 43 10 243 23 10
Ile I 66 30 36 13 17 8 35 0 3
Leu L 95 17 37 0 0 75 15 17 40 253
Lys K 57 477 322 85 0 147 104 60 23 43 39
Met M 29 17 0 0 0 20 7 7 0 57 207 90
Phe F 20 7 7 0 0 0 0 17 20 90 167 0 17
Pro P 345 67 27 10 10 93 40 49 50 7 43 43 4 7
Ser S 772 137 432 98 117 47 86 450 26 20 32 168 20 40 269
Thr T 590 20 169 57 10 37 31 50 14 129 52 200 28 10 73 696
Trp W 0 27 3 0 0 0 0 0 3 0 13 0 0 10 0 17 0
Tyr Y 20 3 36 0 30 0 10 0 40 13 23 10 0 260 0 22 23 6
Val V 365 20 13 17 33 27 37 97 30 661 303 17 77 10 50 43 186 0 17
A R N D C Q E G H I L K M F P S TW YV
(n) (n)
PAM period. Let M1n = mi j . The entry mi j is the probability that the ith amino
acid is replaced by the jth amino acid in a position over the n PAM period. As n gets
larger, all the entries in the jth column of this matrix have approximately the same
value, converging to the background frequency f j of the jth amino acid. Finally, the
(i, j)-entry in the PAMn matrix is defined to be
(n)
C log mi j / f j
Table 8.2 The amino acid substitution matrix over the one PAM period. The (i, j)-entry equals
10,000 times the probability that the amino acid in column j is replaced by the amino acid in row i.
This is reproduced from the paper [55] of Dayoff, Schwartz, and Orcutt; reprinted with permission
of Nat’l Biomed. Res. Foundation.
A 9876 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18
R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1
N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1
D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1
C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2
Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1
E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2
G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5
H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1
I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33
L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15
K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1
M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4
F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0
P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2
S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2
T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9
W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0
Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1
V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901
A R N D C Q E G H I L K M F P S T W Y V
the entries for the 20 amino acids; * stands for any character that is not an amino
acid, such as the translation * of an end codon.
For homology search, what we actually do is to test whether two residues are cor-
related due to the fact that they are homologous or not. The probability that two
residues are aligned in an alignment of two homologous sequences is called the tar-
get frequency. If two residues are uncorrelated, occurring independently, the prob-
ability that we expect to observe these two residues aligned is the product of the
probabilities that these two residues occur in a sequence. This product is called
the background frequency. As we shall see in the next section, theory says that
the best score for aligning two amino acids is essentially the logarithm of the ratio
of their target frequencies to their background frequencies. In 1992, Henikoff and
Henikoff constructed the BLOSUM scoring matrices from the target frequencies in-
ferred from roughly 2000 blocks. Here, each block is the ungapped alignment of a
conserved region of a protein family. These blocks were obtained from 500 protein
groups.
Henikoff and Henikoff first counted the number of occurrences of each amino
acid and the number of occurrences of each pair of amino acids aligned in the block
dataset. Assume a block has k sequences of length w. There are kw residues and
wk(k − 1)/2 pairs of aligned amino acids in this block. If amino acids i and j occur
p and q times in a column, respectively, then there are 2p i-i pairs and pq i- j pairs
154 8 Scoring Matrices
in the column. The counts of all the possible pairs in every column of each block in
the dataset are then summed.
We further use ni j to denote the number of i- j pairs in the block dataset. The
frequency that we expect to see i and j aligned is
ni j
pi j = .
∑k≤l nlk
Given the frequencies pk for all 1 ≤ k ≤ 20, the probability ei j that two amino
acids i and j are aligned by chance is pi p j if i = j and 2pi p j otherwise. A BLOSUM
matrix is obtained by taking 2 times the logarithm of pi j /ei j to base 2 and rounding
to the nearest integer.
The score for each pair of amino acids in a BLOSUM matrix is positive if they
are more likely than chance, and negative if they are less likely.
This counting approach overlooks an important factor that can bias the result. If
there are many very closely related proteins and a few others that are less closely re-
lated in a block, then the contribution of that block will be biased toward the closely
related proteins. As a result, the substitution matrix derived will not be good for de-
tecting two distantly related protein sequences. To reduce multiple contributions to
the frequencies of pairs of amino acids, sequences in a block are first clustered. This
is done by specifying a cutoff similarity level x% and then grouping the sequences
in each block into clusters in such a way that each sequence in a cluster has x% or
higher similarity to one or more other sequences in the same cluster.
The sequences in each cluster are then weighted as a single sequence in frequency
calculation. Specifically, each occurrence of an amino acid in a sequence is counted
as m1 times, where m is the size of the cluster that contains the sequence. Each
occurrence of an amino acid pair within a cluster is not counted; it is counted as
1
mn times if they are from different clusters, where m and n are the sizes of the two
clusters from which the two sequences are taken.
If x% is used as the cutoff similarity level for clustering, the resulting substitution
matrix is called the BLOSUMx matrix. BLOSUM62 given in Figure 8.7 is the most
frequently used scoring matrix for homology search. Other popular scoring matrices
include BLOSUM45 and BLOSUM80, which are given in Figures 8.8 and 8.9. The
BLOSUM matrices seen in literature sometimes include four extra rows/columns
denoted by B, Z, X, and *. These four columns have the same meaning as in the
PAM matrices.
The program used for constructing the block dataset uses scoring matrix, too!
This raises two questions: what substitution matrices are used there and whether
those matrices bias the result or not? To break the circularity and eliminate the bias
effect, Henikoff and Henikoff took a three-step iterative approach. First, a unitary
scoring matrix, where the match score is 1 and the mismatch score is 0, was used
8.3 General Form of the Scoring Matrices 155
initially, generating 2205 blocks; a scoring matrix was obtained from these blocks
by clustering at similarity level 60%. Next, the resulting scoring matrix was used to
construct a second dataset of 1961 blocks, and another scoring matrix was obtained
in the same way as in the first step. Finally, the second scoring matrix was used to
construct the final version of the dataset of 2106 blocks (the BLOCKS database,
version 5.0); from this final dataset, various BLOSUM matrices were constructed
using corresponding similarity levels.
We now consider optimal ungapped local alignments obtained with scoring matrix
(si j ) for the equences in a protein model that is specified by a set of background
frequencies pi for all amino acids i. Assume (si j ) has at least one positive entry and
the expected score ∑i j pi p j si j for aligning two random residues is negative. Karlin
and Altschul (1990, [100]) showed that among optimal ungapped local alignments
obtained from comparison of random sequences with scoring matrix (si j ), the amino
acids ai and a j are aligned with frequency
qi j = pi p j eλ si j (8.1)
Therefore, the scoring matrix (si j ) has an implicit set of target frequencies qi j for
aligning amino acids satisfying (8.2). In other words, the substitution score for a pair
of amino acids in any scoring matrix is essentially a log-odds ratio. Accordingly, no
matter what method is used, the resulting scoring matrices implicitly have the same
underlying mathematical structure as the Dayhoff and BLOSUM matrices.
Consider a scoring matrix (si j ) and a protein model given by background fre-
quencies pi ’s. As long as the expected score ∑i j pi p j si j remains negative, (si j ) can
always be expressed in the log-odds form
1 zi j
si j = ln
λ pi p j
pi = ∑ zi j
j
and
p j = ∑ zi j
i
156 8 Scoring Matrices
for some λ , where pi and pj are the marginal sums of the qi j :
pi = ∑ qi j , (8.4)
j
pj = ∑ qi j . (8.5)
i
By (8.5),
∑ pi eλ si j = 1, 1 ≤ j ≤ 20. (8.6)
i
pi = ∑ Y ji (λ ), 1 ≤ i ≤ 20. (8.7)
j
Similarly,
∑ Yi j (λ ) = 1. (8.9)
i, j
Solving (8.9), we obtain the value of λ . Once λ is known, we obtain pi , pj , and qi j
from (8.7), (8.8), and (8.3) for all possible i and j, respectively.
8.4 How to Select a Scoring Matrix? 157
From the discussion in Section 8.3, given a protein sequence model in which amino
acids occur by chance with background frequencies pi (1 ≤ i ≤ 20), a scoring matrix
si j with negative expected score and at least one positive entry is only optimal to the
alignments in which the amino acids ai and a j are aligned with target frequencies
qi j that satisfy equation (8.2). In other words, different scoring matrices are optimal
for detecting different classes of alignments. This raises the problem of selecting
scoring matrix for best distinguishing true protein alignment from chance.
Multiplying a scoring matrix by a positive constant has no effect on the relative
scores of different MSPs. Therefore, two matrices related by such a constant factor
are said to be equivalent. By (8.2), any scaling corresponds merely to taking loga-
rithm to a different base. When λ is set to 1, the scores are natural logarithms of the
odds ratios; when λ is set to ln 2, the scores become logarithms to base 2.
Let N denote the expected number of MSPs with score s or more obtained in
the alignment of two sequences of length n1 and n2 . Setting λ to ln 2 in (7.45), we
obtain
K
s = log2 + log2 (n1 n2 ). (8.10)
N
The parameter K is usually less than 1 for a typical scoring matrix. An align-
ment may be considered significant when N is 0.01. As a result, the right-hand
side of equation (8.10) is dominated by the term log2 (mn), and alignment score
needed to distinguish a MSP from chance is approximately log2 (mn). Thus, for
comparing two proteins of 250 amino acid residues, a MSP is statistically signif-
icant only if its score is 16 (≈ 2 log2 250) or more; if such a protein is searched
against a database of 10,000,000 residues, a significant MSP should have then score
31 (≈ log2 2500000000) or more.
The PAM and BLOSUM matrices given in Sections 8.1 and 8.2 have different
λ s (see Table 8.3). By multiplying λ / ln 2, a scoring matrix is normalized into the
logarithms of odds ratios to base 2. Scores obtained with such a normalized scoring
matrix can be considered as bit information. As a result, we may say that 16-bit
information is required to distinguish a MSP from chance in comparing two protein
sequences of 250 amino acid residues.
Given a protein model (which specifies the background frequencies pi ) and a
normalized scoring matrix (si j ), one can calculate the target frequencies qi j , which
characterize the alignments on which the scoring matrix is optimal, as
qi j = pi p j e(ln 2)si j .
Define
qi j
H = ∑ qi j si j = ∑ qi j log2 . (8.11)
i, j i, j pi p j
H is the expected score (or bit information) per residue pair in the alignments char-
acterized by the target frequencies. From theoretic-information point of view, H is
158 8 Scoring Matrices
Table 8.3 The values of the parameter λ in equation (8.2) and the relative entropy of PAM and
BLOSUM matrices listed in Sections 8.1 and 8.2.
PAM30 PAM70 PAM120 PAM250 BLOSUM45 BLOSUM62 BLOSUM80
the relative entropy of the target frequency distribution with respect to the back-
ground distribution (see Section B.6). Hence, we call H the relative entropy of the
scoring matrix (si j ) (with respect to the protein model). Table 8.3 gives the relative
entropy of popular scoring matrices with respect to the implicit protein model.
Intuitively, if the value of H is high, relatively short alignments with the target
frequencies can be distinguished from chance; if the value of H is low, however,
long alignments are necessary. Recall that distinguishing an alignment from chance
needs 16 bits of information in comparison of two protein sequences of 250 amino
acids. Using this fact, we are able to estimate the length of a significant alignment of
two sequences that are x-PAM divergent. For example, at a distance of 120 PAMs,
there is on average 0.979 bit of information in every aligned position as shown in
Table 8.3. As a result, a significant alignment has at least 17 residues.
For database search, the situation is more complex. In this case, alignments are
unknown and hence it is not clear which scoring matrix is optimal. It is suggested
to use multiple PAM matrices.
The PAMx matrix is designed to compare two protein sequences that are sepa-
rated by x PAM distance. If it is used to compare two protein sequences that are
actually separated by y PAM distance, the average bit information achieved per po-
sition is smaller than its relative entropy. When y is close to x, the average infor-
mation achieved is near-optimal. Assume we are satisfied with using a PAM matrix
that yields a score greater than 93% of the optimal achievable score. Because a sig-
nificant MSP contains about 31 bits of information in searching a protein against
a protein database containing 10,000,000 residues, the length range of the local
alignments that the PAM120 matrix can detect is from 19 to 50. As a result, when
PAM120 is used, it may miss short but strong or long but weak alignments that con-
tain sufficient information to be found. Accordingly, PAM40 and PAM250 may be
used together with PAM120 in database search.
We have showed that a scoring matrix is only valid in one unique context in Sec-
tion 8.3. Thus, it is not ideal to use a scoring matrix that is constructed for a specific
set of target and background frequencies in a different context. To compare pro-
teins having biased composition, one approach is to repeat Henikoff and Henikoff’s
procedure of construction of the BLOSUM matrices. A set of true alignments for
8.5 Compositional Adjustment of Scoring Matrices 159
the proteins under consideration is first constructed. From this alignment set, a new
scoring matrix is then derived. But there are two problems with this approach. First,
it requires a large set of alignments. Such a set is often not available. Second, the
whole procedure requires a curatorial effort. Accordingly, an automatic adjustment
of a standard scoring matrix for different compositions is necessary. In the rest of
this section, we present a solution to this adjustment problem, which is due to Yu
and Altschul (2005, [212]).
Consider a scoring matrix (si j ) with implicit target frequencies (qi j ) and a set of
background frequencies (Pi ) and (Pj ) that are inconsistent with (qi j ). Here, (Pi ) and
(Pj ) are not necessarily equal although they are in the practical cases of interest. The
problem of adjusting a scoring matrix is formulated to find a set of target frequencies
(Qi j ) that minimize the following relative entropy with respect to the distribution
(qi j )
Qi j
D ((Qi j )) = ∑ Qi j ln (8.12)
ij qi j
subject to consistency with the given background frequencies (Pi ) and Pj :
∑ Qi j = Pi , 1 ≤ i ≤ 20 (8.13)
j
∑ Qi j = Pj , 1 ≤ j ≤ 20 (8.14)
i
∂ 2D 1
= > 0,
∂ 2 Qi j Qi j
and
∂ 2D
= 0, i = k or j = m,
∂ Qi j ∂ Qkm
the problem has a unique solution under the constraints of (8.13) and (8.14).
An additional constraint to impose is to keep the relative entropy H of the scoring
matrix sought unchanged in the given background:
# $ # $
Qi j qi j
∑ Qi j ln Pi P = ∑ qi j ln Pi P . (8.15)
ij j ij j
F ((Qi j ) , (αi ), (β j ), γ )
# $ # $
= D (Qi j ) + ∑ αi Pi − ∑ Qi j + ∑ β j Pj − ∑ Qi j
i j j i
& # $ # $'
qi j Qi j
+γ ∑ qi j ln Pi Pj
− ∑ Qi j ln
Pi Pj
. (8.16)
ij ij
Setting the partial derivative of the Lagrangian F with respect to each of the Qi j
equal to 0, we obtain that
# # # $ $$
Qi j Qi j
ln + 1 − αi + β j + γ ln +1 = 0. (8.17)
qi j Pi Pj
which has the same λ as the original scoring matrix (si j ). The constraint (8.17) may
be rewritten as
1/(1−γ ) −γ /(1−γ )
Qi j = e(αi −1)/(1−γ ) e(β j +γ )/(1−γ ) qi j Pi Pj .
Table 8.4 PAM substitution scores (bits) calculated from (8.18) in the uniform model.
PAM distance Match score Mismatch score Information per position
5 1.928 -3.946 1.64
30 1.588 -1.593 0.80
47 1.376 -1.096 0.51
70 1.119 -0.715 0.28
120 0.677 -0.322 0.08
8.6 DNA Scoring Matrices 161
This implies that the resulting scoring matrix (Si j ) related to the original scores (si j )
by
Si j = γ si j + δi + ε j ,
for some γ , δi , and ε j . Here, we omit the formal argument of this conclusion. For
details, the reader is referred to the work of Yu and Altschul (2005, [212]).
We have discussed how scoring matrices are calculated in the context of protein
sequence comparison. All the results carry through to the case of DNA sequence
comparison. In particular, one can construct proper scoring matrix for aligning DNA
sequences in an non-standard context using the log-odds approach as follows.
In comparison of non-coding DNA sequences, two evolutionary models are fre-
quently used. One assumes all nucleotides are uniformly distributed and all sub-
stitutions are equally likely. This model yields a scoring scheme by which all the
matches have the same score and so do all the mismatches. The so-called transition-
transversion model assumes that transitions (A ↔ G and C ↔ T) are threefold more
likely than transversions (A ↔ C, A ↔ T, G ↔ C, and G ↔ T). As a result, transi-
tions and transversions score differently.
Let M be the transition probability matrix that reflects 99% sequence conser-
vation and one point accepted mutation per 100 bases (1 PAM distance ) and let
α = 0.01. In the uniform distribution model,
⎛ ⎞
1 − α 13 α 13 α 13 α
⎜ 1α 1−α 1α 1α ⎟
M=⎜ 3 3
⎝ 1α 1α 1−α 1α ⎠.
3 ⎟
3 3 3
3α 3α 3α 1−α
1 1 1
where the off-diagonal elements corresponding to transitions are 35 α and those for
transversions are 15 α .
In a Markov chain evolutionary model, the matrix of probabilities for substituting
base i by base j after n PAMs is calculated by n successive iterations of M:
(mi j ) = (M)n .
162 8 Scoring Matrices
The n-PAM score si j for aligning base i with base j is simply the logarithm of the
relative chance of the pair occurring in an alignment of homologous sequences as
opposed to that occurring in a random alignment with background frequencies:
pi mi j
si j = log = log (4mi j ) (8.18)
pi p j
because both models assume equal frequencies for the four bases.
Table 8.4 shows substitution scores for various PAM distances in the uniform
model. Base 2 is used for these calculations, and hence these scores can be thought
of as bit information. At 47 PAMs (about 65% sequence conservation), the ratio
of the match and mismatch scores is approximately 5 to 4; the resulting scores are
equivalent to those used in the current version of the BLASTN.
Table 8.5 presents the substitution scores for various PAM distances in the
transition-transversion model. Notice that for 120 PAM distance, transitions score
positively and hence are considered as conservation substitutions. Numerical calcu-
lation indicates that transitions score positively for 87 PAMs or more.
All derived PAM substitution scores can be used to compare non-coding DNA
sequences. The best scores to use will depend upon whether one is seeking distantly
or closely related sequences. For example, in sequence assembly, one often uses
alignment to determine whether a new segment of sequence overlaps an existing
sequence significantly or not to form a contig. In this case, one is interested only in
alignments that differ by a few bases. Hence, PAM-5 substitution scores are more
effective than other scores. On the other hand, PAM120 scoring matrix is probably
more suitable in the application of finding transcriptor-binding sites.
One natural question often asked by a practitioner of homology search is: Should
I compare gene sequences or the corresponding protein sequence? This can be
answered through a simple quantitative analysis. Synonymous mutations are nu-
cleotide substitutions that do not result in a change to the amino acids sequence
of a protein. Evolutionary study suggests that there tend to be approximately 1.5
synonymous point mutations for every nonsynonymous point mutation. Because
each codon has 3 nucleotides, each protein PAM translates into roughly 1+1.5 3 ≈ 0.8
PAMs in DNA level. In the alignment of two proteins that have diverged by
120 protein PAMs, each residue carries on average 0.98-bit information (see Ta-
Table 8.5 PAM substitution scores (bits) calculated from (8.18) in the transition-transversion
model.
PAM distance Match score Transition score Transversion score Information per position
ble 8.3), whereas in the alignment of two DNA sequences that are diverged at 96 (or
120 × 0.8) PAMs, every three residues (a codon) carry only about 0.62-bit informa-
tion. In other words, at this evolutionary distance, as much as 37% of the information
available in protein comparison will be lost in DNA sequence comparison.
There is no general theory available for guiding the choice of gap costs. The most
straightforward scheme is to charge a fixed penalty for each indel. Over the years,
it has been observed that the optimal alignments produced by this scheme usually
contain a large number of short gaps and are often not biologically meaningful (see
Section 1.3 for the definition of gaps).
To capture the idea that a single mutational event might insert or delete a se-
quence of residues, Waterman and Smith (1981, [180]) introduced the affine gap
penalty model. Under this model, the penalty o + e × k is charged for a gap of length
k, where o is a large penalty for opening a gap and e a smaller penalty for extending
it. The current version of BLASTP uses, by default, 11 for gap opening and 1 for
gap extension, together with BLOSUM62, for aligning protein sequences.
The affine gap cost is based on the hypothesis that gap length has an exponential
distribution, that is, the probability of a gap of length k is α (1 − β )β k for some
constant
α and β. Under this hypothesis, an affine gap cost is derived by charging
log α (1 − β )β k for a gap of length k. But, this hypothesis might not be true in
general. For instance, the study of Benner, Cohen, and Gonnet (1993, [26]) suggests
that the frequency of a length k is accurately described by mk−1.7 for some constant
m.
A generalized affine gap cost is introduced by Altschul (1998, [2]). A generalized
gap consists of a consecutive sequence of indels in which spaces can be in either
row. A generalized gap of length 10 may contain 10 insertions; it may also contain
4 insertions and 6 deletions. To reflect the structural property of a generalized gap, a
generalized affine gap cost has three parameters a, b, c. The score −a is introduced
for the opening of a gap; −b is for each residue inserted or deleted; and −c is
for each pair of residues left unaligned. A generalized gap with k insertions and l
deletions scores −(a + |k − l|b + c min{k, l}).
Generalized affine gap costs can be used for aligning locally or globally protein
sequences. The dynamic programming algorithm carries over to this generalized
affine gap cost in a straightforward manner; and it still has quadratic time complex-
ity. For local alignment, the distribution of optimal alignment scores also follows
approximately an extreme value distribution (7.1). The empirical study of Zachariah
et al. (2005, [213]) shows that this generalized affine gap cost model improves sig-
nificantly the accuracy of protein alignment.
164 8 Scoring Matrices
For DNA sequence comparison and database search, simple scoring schemes are
usually effective. However, for protein sequences, some substitutions are much more
likely than others. The performance of an alignment program is clearly improved
when the scoring matrix employed accounts for this difference. As a result, a vari-
ety of amino acid properties have been used for constructing substitution matrices
in the papers of McLachlan [136], Taylor [188], Rao [142], Overington et al. [156],
and Risler et al. [172].
8.1
PAM matrices are due to Dayhoff, Schwartz, and Orcutt [55]. Although the de-
tails of Dayhoff’s approach had been criticized in the paper of Wilbur [203], PAM
matrices remained popular as scoring matrices for protein sequence comparison in
the past 30 years. New versions of the Dayhoff matrices were obtained from re-
calculating the model parameters using more protein sequence data in the papers of
Gonnet, Cohen, and Benner [77] and Jones, Taylor, and Thornton [97]. The Dayhoff
approach was also extended to estimate a Markov model of amino acid substitution
from alignments of remotely homologous sequences by Müller, Spang, and Vingron
[147, 148] and Arvestad [16]. For information on the statistical approach to the es-
timation of substituting matrices, we refer the reader to the survey paper of Yap and
Speed [209].
8.2
The BLOSUM matrices are due to Henikoff and Henikoff [88]. They were de-
rived from direct estimation of target frequencies based on relatively distant, pre-
sumed correct sequence alignments. Although these matrices do not rely on fitting
an evolutionary model, they are the most effective ones for homology search as
demonstrated in the papers of Henikoff and Henikoff [89] and Pearson [159].
8.3
8.4
empirical study of Pearson [159] illustrates that scaling similarity scores by the log-
arithm of the size of the database can dramatically improve the performance of a
scoring matrix.
8.5
The method discussed in this section is from the paper of Yu and Altschul [212].
An improved method is given in the paper of Altschul et al. [9].
8.6
This section is written based on the paper of States, Gish, and Altschul [184].
The Henikoff and Henikoff method was used to construct scoring matrix for align-
ing non-coding genomic sequences in the paper of Chiaromonte, Yap, and Miller
[45] (see also [46]). More methods for nucleotide scoring matrix can be found in
the papers of Müller, Spang, and Vingron [148] and Schwartz et al. [178]. The tran-
sition and transversion rate is given in the paper of Li, Wu, and Luo [126].
8.7
The affine gap cost was first proposed by Smith and Waterman [180]. The gener-
alized affine gap cost discussed in this section is due to Altschul [1]. When c = 2b,
the generalized affine gap cost reduces to a cost model proposed by Zuker and So-
morjal for protein structural alignment [217]. The empirical study of Zachariah et
al. [213] shows that the generalized affine gap model allows fewer residue pairs
aligned than the affine gap model but achieves significantly higher per-residue ac-
curacy. The empirical studies on the distribution of insertions/deletions are found in
the papers of Benner, Cohen, and Gonnet [26] and Pascarella and Argos [158].
166 8 Scoring Matrices
A 2
R -2 6
N 0 0 2
D 0 -1 2 4
C -2 -4 -4 -5 12
Q 0 1 1 2 -5 4
E 0 -1 1 3 -5 2 4
G 1 -3 0 1 -3 -1 0 5
H -1 2 2 1 -3 3 1 -2 6
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
A R N D C Q E G H I L K M F P S T W Y V
A 6
R -7 8
N -4 -6 8
D -3 -10 2 8
C -6 -8 -11 -14 10
Q -4 -2 -3 -2 -14 8
E -2 -9 -2 2 -14 1 8
G -2 -9 -3 -3 -9 -7 -4 6
H -7 -2 0 -4 -7 1 -5 -9 9
I -5 -5 -5 -7 -6 -8 -5 -11 -9 8
K -7 0 -1 -4 -14 -3 -4 -7 -6 -6 -8 7
P -2 -4 -6 -8 -8 -3 -5 -6 -4 -8 -7 -6 -8 -10 8
S 0 -3 0 -4 -3 -5 -4 -2 -6 -7 -8 -4 -5 -6 -2 6
T -1 -6 -2 -5 -8 -5 -6 -6 -7 -2 -7 -3 -4 -9 -4 0 7
W -13 -2 -8 -15 -15 -13 -17 -15 -7 -14 -6 -12 -13 -4 -14 -5 -13 13
V -2 -8 -8 -8 -6 -7 -6 -5 -6 2 -2 -9 -1 -8 -6 -6 -3 -15 -7 7
A R N D C Q E G H I L K M F P S T W Y V
A 5
R -4 8
N -2 -3 6
D -1 -6 3 6
C -4 -5 -7 -9 9
Q -2 0 -1 0 -9 7
E -1 -5 0 3 -9 2 6
G 0 -6 -1 -1 -6 -4 -2 6
H -4 0 1 -1 -5 2 -2 -6 8
I -2 -3 -3 -5 -4 -5 -4 -6 -6 7
L -4 -6 -5 -8 -10 -3 -6 -7 -4 1 6
K -4 2 0 -2 -9 -1 -2 -5 -3 -4 -5 6
M -3 -2 -5 -7 -9 -2 -4 -6 -6 1 2 0 10
F -6 -7 -6 -10 -8 -9 -9 -7 -4 0 -1 -9 -2 8
P 0 -2 -3 -4 -5 -1 -3 -3 -2 -5 -5 -4 -5 -7 7
S 1 -1 1 -1 -1 -3 -2 -0 -3 -4 -6 -2 -3 -4 0 5
T 1 -4 0 -2 -5 -3 -3 -3 -4 -1 -4 -1 -2 -6 -2 2 6
Y -5 -7 -3 -7 -2 -8 -6 -9 -1 -4 -4 -7 -7 4 -9 -5 -4 -3 9
V -1 -5 -5 -5 -4 -4 -4 -3 -4 3 0 -6 0 -5 -3 -3 -1 -10 -5 6
A R N D C Q E G H I L K M F P S T W Y V
A 3
R -3 6
N -1 -1 4
D 0 -3 2 5
C -3 -4 -5 -7 9
Q -1 1 0 1 -7 6
E 0 -3 1 3 -7 2 5
G 1 -4 0 0 -4 -3 -1 5
H -3 1 2 0 -4 4 -1 -4 7
I -1 -2 -2 -3 -3 -3 -3 -4 -4 6
L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5
K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5
M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8
F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8
P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6
S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3
T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4
W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12
Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8
V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5
A R N D C Q E G H I L K M F P S T W Y V
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I L K M F P S T W Y V
A 5
R -2 7
N -1 0 6
D -2 -1 2 7
C -1 -3 -2 -3 12
Q -1 1 0 0 -3 6
E -1 0 0 2 -3 2 6
G 0 -2 0 -1 -3 -2 -2 7
H -2 0 1 0 -3 1 0 -2 10
I -1 -3 -2 -4 -3 -2 -3 -4 -3 5
L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5
K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5
M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6
F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8
P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9
S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5
W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15
Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8
V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5
A R N D C Q E G H I L K M F P S T W Y V
A 7
R -3 9
N -3 -1 9
D -3 -3 2 10
C -1 -6 -5 -7 13
Q -2 1 0 -1 -5 9
E -2 -1 -1 2 -7 3 8
G 0 -4 -1 -3 -6 -4 -4 9
H -3 0 1 -2 -7 1 0 -4 12
I -3 -5 -6 -7 -2 -5 -6 -7 -6 7
L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6
K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8
M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9
F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10
P -1 -3 -4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12
S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7
T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8
W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16
Y -4 -4 -4 -6 -5 -3 -5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11
V -1 -4 -5 -6 -2 -4 -4 -6 -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7
A R N D C Q E G H I L K M F P S T W Y V
Editors-in-Chief
Andreas Dress
University of Bielefeld (Germany)
Martin Vingron
Max Planck Institute for Molecular Genetics (Germany)
Editorial Board
Gene Myers, Janelia Farm Research Campus, Howard Hughes Medical Institute (USA)
Robert Giegerich, University of Bielefeld (Germany)
Walter Fitch, University of California, Irvine (USA)
Pavel A. Pevzner, University of California, San Diego (USA)
Advisory Board
Gordon Grippen, University of Michigan
Joe Felsenstein, University of Washington (USA)
Dan Gusfield, University of California, Davis (USA)
Sorin Istrail, Brown University, Providence (USA)
Samuel Karlin, Stanford University
Thomas Lengauer, Max Planck Institut Informatik (Germany)
Marcella McClure, Montana State University (USA)
Martin Nowak, Harvard University (USA)
David Sankoff, University of Ottawa (Canada)
Ron Shamir, Tel Aviv University (Israel)
Mike Steel, University of Canterbury (New Zealand)
Gary Stormo, Washington University Medical School (USA)
Simon Tavaré, University of Southern California (USA)
Tandy Warnow, University of Texas, Austin (USA)
The Computational Biology series publishes the very latest, high-quality research devoted to specific
issues in computer-assisted analysis of biological data. The main emphasis is on current scientific devel-
opments and innovative techniques in computational biology (bioinformatics), bringing to light methods
from mathematics, statistics and computer science that directly address biological problems currently
under investigation.
The series offers publications that present the state-of-the-art regarding the problems in question; show
computational biology/bioinformatics methods at work; and finally discuss anticipated demands regard-
ing developments in future methodology. Titles can range from focused monographs, to undergraduate
and graduate textbooks, and professional text/reference works.
Sequence Comparison
Theory and Methods
ABC
Kun-Mao Chao, BS, MS, PhD Louxin Zhang, BSc, MSc, PhD
Department of Computer Science and Department of Mathematics,
Information Engineering, National University of Singapore,
National Taiwan University, Singapore
Taiwan
LXZ:
To my parents
Foreword
My first thought when I saw a preliminary version of this book was: Too bad there
was nothing like this book when I really needed it.
Around 20 years ago, I decided it was time to change my research directions.
After exploring a number of possibilities, I decided that the area of overlap between
molecular biology and computer science (which later came to be called ”bioinfor-
matics”) was my best bet for an exciting career. The next decision was to select a
specific class of problems to work on, and the main criterion for me was that algo-
rithmic methods would be the main key to success. I decided to work on sequence
analysis. A book like this could have, so to speak, straightened my learning curve.
It is amazing to me that those two conclusions still apply: bioinformatics is a
tremendously vibrant and rewarding field to be in, and sequence comparison is (ar-
guably, at least) the subfield of bioinformatics where algorithmic techniques play
the largest role in achieving success. The importance of sequence-analysis methods
in bioinformatics can be measured objectively, simply by looking at the numbers
of citations in the scientific literature for papers that describe successful develop-
ments; a high percentage of the most heavily cited scientific publications in the past
30 years are from this new field. Continued growth and importance of sequence
analysis is guaranteed by the explosive development of new technologies for gen-
erating sequence data, where the cost has dropped 1000-fold in the past few years,
and this fantastic decrease in cost means that sequencing and sequence analysis are
taking over jobs that were previously handled another way.
Careful study of this book will be valuable for a wide range of readers, from
students wanting to enter the field of bioinformatics, to experienced users of bioin-
formatic tools wanting to use tool options more intelligently, to bioinformatic spe-
cialists looking for the killer algorithm that will yield the next tool to sweep the
field. I predict that you will need more that just mastery of this material to reach
stardom in bioinformatics – there is also a huge amount of biology to be learned, to-
gether with a regular investment of time to keep up with the latest in data-generation
technology and its applications. However, the material herein will remain useful for
years, as new sequencing technologies and biological applications come and go.
vii
viii Foreword
I invite you to study this book carefully and apply ideas from it to one of the most
exciting areas of science. And be grateful that two professionals with a combined
30 years of experience have taken the time to open the door for you.
ix
x Preface
Chapters 2 to 5 form the method part. This part covers the basic algorithms and
methods for sequence alignment. Chapter 2 introduces basic algorithmic techniques
that are often used for solving various problems in sequence comparison.
In Chapter 3, we present the Needleman-Wunsch and Smith-Waterman algo-
rithms, which, respectively, align a pair of sequences globally and locally, and their
variants for coping with various gap penalty costs. For analysis of long genomic
sequences, the space restriction is more critical than the time constraint. We there-
fore introduce an efficient space-saving strategy for sequence alignment. Finally, we
discuss a few advanced topics of sequence alignment.
Chapter 4 introduces four popular homology search programs: FASTA, BLAST
family, BLAT, and PatternHunter. We also discuss how to implement the filtration
idea used in these programs with efficient data structures such as hash tables, suffix
trees, and suffix arrays.
Chapter 5 covers briefly multiple sequence alignment. We discuss how a multi-
ple sequence alignment is scored, and then show why the exact method based on a
dynamic-programming approach is not feasible. Finally, we introduce the progres-
sive alignment approach, which is adopted by ClustalW, MUSCLE, YAMA, and
other popular programs for multiple sequence alignment.
Chapters 6 to 8 form the theory part. Chapter 6 covers the theoretic aspects of the
seeding technique. PatternHunter demonstrates that an optimized spaced seed im-
proves sensitivity substantially. Accordingly, elucidating the mechanism that con-
fers power to spaced seeds and identifying good spaced seeds become new issues in
homology search. This chapter presents a framework of studying these two issues
by relating them to the probability of a spaced seed hitting a random alignment. We
address why spaced seeds improve homology search sensitivity and discuss how to
design good spaced seeds.
The Karlin-Altschul statistics of optimal local alignment scores are covered in
Chapter 7. Optimal segment scores are shown to follow an extreme value distri-
bution in asymptotic limit. The Karlin-Altschul sum statistic is also introduced. In
the case of gapped local alignment, we describe how the statistical parameters of
the distribution of the optimal alignment scores are estimated through empirical ap-
proach and discuss the edge-effect and multiple testing issues. We also relate theory
to the calculations of the Expect and P-values in BLAST program.
Chapter 8 is about the substitution matrices. We start with the reconstruction
of popular PAM and BLOSUM matrices. We then present Altschul’s theoretic-
information approach to scoring matrix selection and recent work on compositional
adjustment of scoring matrices for aligning sequences with biased letter frequencies.
Finally, we discuss gap penalty costs.
This text is targeted to a reader with a general scientific background. Little or
no prior knowledge of biology, algorithms, and probability is expected or assumed.
The basic notions from molecular biology that are useful for understanding the top-
ics covered in this text are outlined in Appendix A. Appendix B provides a brief
introduction to probability theory. Appendix C lists popular software packages for
pairwise alignment, homology search, and multiple alignment.
Preface xi
This book is a general and rigorous text on the algorithmic techniques and math-
ematical foundations of sequence alignment and homology search. But, it is by no
means comprehensive. It is impossible to give a complete introduction to this field
because it is evolving too quickly. Accordingly, each chapter concludes with the
bibliographic notes that report related work and recent progress. The reader may
ultimately turn to the research articles published in scientific journals for more in-
formation and new progress.
Most of the text is written at a level that is suitable for undergraduates. It is based
on lectures given to the students in the courses in bioinformatics and mathematical
genomics at the National University of Singapore and the National Taiwan Univer-
sity each year during 2002 – 2008. These courses were offered to students from
biology, computer science, electrical engineering, statistics, and mathematics ma-
jors. Here, we thank our students in the courses we have taught for their comments
on the material, which are often incorporated into this text.
Despite our best efforts, this book may contain errors. It is our responsibility
to correct any errors and omissions. A list of errata will be compiled and made
available at http://www.math.nus.edu.sg/˜matzlx/sequencebook.
We are extremely grateful to our mentor Webb Miller for kindly writing the fore-
word for this book. The first author particularly wants to thank Webb for introducing
him to the emerging field of computational molecular biology and guiding him from
the basics nearly two decades ago.
The second author is particularly thankful to Ming Li for guiding and encourag-
ing him since his student time in Waterloo. He also thanks his collaborators Kwok
Pui Choi, Aaron Darling, Minmei Hou, Yong Kong, Jian Ma, Bin Ma, and Franco
Preparata, with whom he worked on the topics covered in this book. In addition, he
would like to thank Kal Yen Kaow Ng and Jialiang Yang for reading sections of the
text and catching some nasty bugs.
We also thank the following people for their inspiring conversations, suggestions,
and pointers: Stephen F. Altschul, Vineet Bafna, Louis H.Y. Chen, Ross C. Hardison,
Xiaoqiu Huang, Tao Jiang, Jim Kent, Pavel Pevzner, David Sankoff, Scott Schwartz,
Nikola Stojanovic, Lusheng Wang, Von Bing Yap, and Zheng Zhang.
Finally, it has been a pleasure to work with Springer in the development of this
book. We especially thank our editor Wayne Wheeler and Catherine Brett for pa-
tiently shepherding this project and constantly reminding us of the deadline, which
eventually made us survive. We also thank the copy editor C. Curioli for valuable
comments and the production editor Frank Ganz for assistance with formatting.
xiii
About the Authors
Kun-Mao Chao was born in Tou-Liu, Taiwan, in 1963. He received the B.S. and
M.S. degrees in computer engineering from National Chiao-Tung University, Tai-
wan, in 1985 and 1987, respectively, and the Ph.D. degree in computer science from
The Pennsylvania State University, University Park, in 1993. He is currently a pro-
fessor of bioinformatics at National Taiwan University, Taipei, Taiwan. From 1987
to 1989, he served in the ROC Air Force Headquarters as a system engineer. From
1993 to 1994, he worked as a postdoctoral fellow at Penn State’s Center for Compu-
tational Biology. In 1994, he was a visiting research scientist at the National Center
for Biotechnology Information, National Institutes of Health, Bethesda, Maryland.
Before joining the faculty of National Taiwan University, he taught in the Depart-
ment of Computer Science and Information Management, Providence University,
Taichung, Taiwan, from 1994 to 1999, and the Department of Life Science, Na-
tional Yang-Ming University, Taipei, Taiwan, from 1999 to 2002. He was a teaching
award recipient of both Providence University and National Taiwan University. His
current research interests include algorithms and bioinformatics. He is a member of
Phi Tau Phi and Phi Kappa Phi.
Louxin Zhang studied mathematics at Lanzhou University, earning his B.S. and
M.S. degrees, and studied computer science at the University of Waterloo, where he
received his Ph.D. He has been a researcher and teacher in bioinformatics and com-
putational biology at National University of Singapore (NUS) since 1996. His cur-
rent research interests include genomic sequence analysis and phylogenetic analysis.
His research interests also include applied combinatorics, algorithms, and theoreti-
cal computer science. In 1997, he received a Lee Kuan Yew Postdoctoral Research
Fellowship to further his research. Currently, he is an associate professor of compu-
tational biology at NUS.
xv
Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Biological Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Alignment: A Model for Sequence Comparison . . . . . . . . . . . . . . . . . 2
1.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Alignment Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Scoring Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Computing Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Global Alignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Local Alignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Multiple Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 What Alignments Are Meaningful? . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Bibliographic Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . . 13
xvii
xviii Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207