Longest Common Subsequence
Inspiration
• Biological applications often need to compare the DNA
of two (or more) different organisms
• A strand of DNA consists of a string of molecules called
bases, where the possible bases are adenine, guanine,
cytosine, and thymine
• each of these bases by its initial letter, we can express a
strand of DNA as a string over the finite set {A, C, G, T}
Inspiration
• For example, the DNA of one organism may be S1=
ACCGGTCGAGTGCGCGGAAGCCGGCCGAA, and
the DNA of another organism may be S2=
GTCGTTCGGAATGCCGTTGCTCTGTAAA.
• One reason to compare two strands of DNA is to
determine how “similar the two strands are, as some
measure of how closely related the two organisms are
Inspiration
• We can define similarity in many different ways
• First way - we can say that two DNA strands are similar
if one is a substring of the other
• In our example, neither S1 nor S2 is a substring of the
other.
• Second way - two strands are similar if the number of
changes needed to turn one into the other is small
Inspiration
• Third way measure the similarity of strands S 1 and S2 is by
finding a third strand S3
• In which bases in S3 appear in each of S1 and S2; these bases
must appear in the same order, but not necessarily
consecutively
• Longer the strand S3 we can find, the more similar S1 and S2 are
Inspiration
• S1= ACCGGTCGAGTGCGCGGAAGCCGGCCGAA
• S2 = GTCGTTCGGAATGCCGTTGCTCTGTAAA
• S3 is GTCGTCGGAAGCCGGCCGAA
Problem Statement
• A subsequence of a given sequence is just the given
sequence with zero or more elements left out
• Formally, given a sequence X = <x1,x2,...,xm>, another
sequence Z =<z1,z2,...,zk> is a subsequence of X if there
exists a strictly increasing sequence <i1,i2,...,ik> of indices
of X such that for all j = 1,2,...,k, we have xij= zj
Problem Statement
• For example, Z = <B, C, D, B> is a subsequence of X = <A,
B, C, B, D, A, B> with corresponding index sequence <2, 3,
5, 7>
• Given two sequences X and Y , we say that a sequence Z is a
common subsequence of X and Y if Z is a subsequence of
both X and Y
Problem Statement
• For example, if X = <A, B, C, B, D, A, B> and Y = <B, D, C, A,
B, A>, the sequence <B, C, A> is a common subsequence of
both X and Y
• But not a longest common subsequence (LCS) of X and Y
• Sequence <B, C, B, A>, which is also common to both X and
Y , has length 4 is the LCS
• Since X and Y have no common subsequence of length 5 or
greater
Step 1: Characterizing a longest common subsequence
• Brute-force approach to solve LCS problem:
• Enumerate all subsequences of X
• Check each subsequence to see whether it is also a subsequence of Y
• Keeping track of the longest subsequence we find.
• Each subsequence of X corresponds to a subset of the indices
{1, 2,...,m} of X
• Because X has 2m subsequences, this approach requires
exponential time, making it impractical
Basis of Optimal substructure of an LCS
• Given a sequence X = <x1, x2,...,xm>, we define the ith prefix of
X , for i = 0,1,...,m, as Xi = <x1, x2,...,xi>
• For example, if X = <A, B, C, B, D, A, B>, then X4 = <A, B, C,
B> and X0 is the empty sequence
Theorem 15.1 Optimal substructure of an LCS
• Let X = <x1, x2,...,xm> and Y = <y1, y2,...,yn> be sequences, and let Z
= <z1, z2,..., zk> be any LCS of X and Y .
1. If xm = yn , then ́zk = xm = yn and Zk-1 is an LCS of Xm-1 and Yn-1
2. If xm ≠ yn , then zk ≠ xm implies that Z is an LCS of Xm-1 and Y
3. If xm ≠ yn , then zk ≠ yn implies that Z is an LCS of X and Yn-1
Proof of Theorem 15.1
• (1) If ́ zk ≠ xm , then we could append xm = yn to Z to obtain a
common subsequence of X and Y of length k + 1, contradicting
the supposition that Z is a longest common subsequence of X
and Y . Thus, we must have ́ zk = xm = yn .
• Now, the prefix Zk-1 is a length (k -1) common subsequence of
X and Y
Proof of Theorem 15.1
• We wish to show that it is an LCS
• Suppose for the purpose of contradiction that there exists a
common subsequence W of Xm-1 and Yn-1 with length greater
than k-1
• Then, appending xm = yn to W produces a common subsequence
of X and Y whose length is greater than k, which is a
Proof of Theorem 15.1
(2) If ́ zk ≠ xm, then Z is a common subsequence of X m-1 and Y
• If there were a common subsequence W of X m-1 and Y with
length greater than k, then W would also be a common
subsequence of Xm and Y, contradicting the assumption that Z is
an LCS of X and Y
• (3) The proof is symmetric to (2)
Step 2: A recursive solution
• Theorem 15.1 implies that we should examine either one or two
subproblems when finding an LCS of X = <x1, x2,...,xm> and
Y= <y1, y2,...,yn>
• If xm = yn, we must find an LCS of Xm-1 and Yn-1
• Appending xm = yn to this LCS yields an LCS of X and Y
• If xm ≠ yn , then we must solve two subproblems: finding an LCS
of Xm-1 and Y and finding an LCS of X and Y n-1.
Step 2: A recursive solution
• Whichever of these two LCSs is longer is an LCS of X and Y
• Because these cases exhaust all possibilities, we know that one
of the optimal subproblem solutions must appear within an LCS
of X and Y .
Step 2: Overlapping Subproblem
• To find an LCS of X and Y, we may need to find the LCSs of X
and Yn-1 and of Xm-1 and Y
• But each of these subproblems has the subsubproblem of finding
an LCS of Xm-1 and Yn-1
• Many other subproblems share subsubproblems.
Step 2: Overlapping Subproblem
• Let us define c[i, j] to be the length of an LCS of the sequences
Xi and Yj
• either i = 0 or j = 0, one of the sequences has length 0, and so the
LCS has length 0
Step 3: Computing the length of an LCS
• LCS problem has only θ(m*n) distinct subproblems, however,
we can use dynamic programming to compute the solutions
bottom up.
• We maintain two 2D tables c and b for dynamic programming
• c table maintains the length of the common sub sequence
• b table helps to construct the solution
Step 3: Computing the length of an LCS
Step 4: Constructing an LCS
• b table returned by LCS-LENGTH enables us to quickly
construct an LCS for X = <x1, x2,...,xm> and Y = <y1, y2,...,yn>
• We simply begin at b[m, n] and trace through the table by
following the arrows
• Whenever we encounter a in entry b[i,j], it implies that x i =
yj is an element of the LCS that LCS-L ENGTH found.
Step 4: Constructing an LCS
• With this method, we encounter the elements of this LCS in
reverse order.
• The following recursive procedure prints out an LCS of X and Y
in the proper, forward order
• The initial call is PRINT -LCS(b, X, X.length, Y.length)
• For the b table in Figure
15.8 this procedure prints
BCBA
The procedure takes
time O(m + n) since it
decrements at least one
of i and j in each
recursive call
Improving the code
• Once you have developed an algorithm, you will often find that
you can improve on the time or space it uses
• Some changes can simplify the code and improve constant
factors but otherwise yield no asymptotic improvement in
performance.
• Others can yield substantial asymptotic savings in time and
space.
Improving the code
• In the LCS algorithm, for example, we can eliminate the b table
altogether. Each c[i, j] entry depends on only three other c table
entries: c[i -1, j- 1], c[i - 1, j], and c[i, j -1].
• Given the value of c[i, j], we can determine in O(1) time which
of these three values was used to compute c[i,j], without
inspecting table b.
Improving the code
• Thus, we can reconstruct an LCS in O(m+n) time using a
procedure similar to PRINT -LCS.
• Although we save θ(mn) space by this method, the auxiliary
space requirement for computing an LCS does not
asymptotically decrease, since we need θ(mn) space for the c
table anyway.
Improving the code
• We can, however, reduce the asymptotic space requirements for
LCS-LENGTH , since it needs only two rows of table c at a
time: the row being computed and the previous row.
• This improvement works if we need only the length of an LCS;
if we need to reconstruct the elements of an LCS, the smaller
table does not keep enough information to retrace our steps in
O(m + n) time