LONGEST COMMON SUBSEQUENCE
Submitted to:
Dr. V.K.Pathak
C.S.E. Deptt.
HBTI-Kanpur
Submitted by :
Shweta Singhal
3rd C.S.E.
127/07
LONGEST COMMON SUBSEQUENCE
Definition
Character
Alphabet: A set of characters
String (or sequence): A list of characters from an
alphabet
ex> strings over {0,1}: Binary strings
ex> strings over {A,C,G,T}: DNA sequences
3
Substring
CBD is a substring of ABCBDAB
Subsequence
BCDB is a subsequence of ABCBDAB
Common subsequence
BCA is a common subsequence of
X=ABCBDAB and Y=BDCABA
4
USE OF LCS
In biological applications, we may want to compare the DNA
of two organisms. A strand of DNA consists of a string of
molecules called bases, where the possible bases are
adenine, guanine, cytosine, and thymine. We represent each
of these bases by their initial letters.
The DNA of one organism might be:
S1=ACCGGTCGAGTGCGCGGAAGCGGCCGAA
and another might be:
S2=GTCGTTCGGAATGCCGTTGCTCTGTAAA
One goal is to determine how “similar” these strands are.
5
LONGEST COMMON SUBSEQUENCE
We say that two DNA strands are similar if one
string is a substring of the other. In our case, we
say two strings are similar if we can find a third
substring in which the bases appear in
each of the first two strings.
The longer the strand S3 we can find that
appears in both S1
and S2, the more similar S1 and S2 are.
6
LONGEST COMMON SUBSEQUENCE
Dynamic programming
The ith prefix Xi of X is Xi=<x1,x2,…,xi>.
If X = <A, B, C, B, D, A, B>
X =< A, B, C, B>
4
X0=<>
7
LONGEST COMMON SUBSEQUENCE
Longest common subsequence (LCS)
BCBA is the longest common subsequence of X and Y
X=ABCBDAB
Y=BDCABA
LCS problem
Given two sequences X=<x1,x2,…,xm> and Y=<y1, y2,..,yn>
to find an LCS of X and Y. 8
How to Solve LCS Quickly
If X and Y are 1 character, LCS is 0 or 1
X a a
Y b a
If we then add 1 character to X and Y, LCS
increases by at most 1
X ab ab
Y bd ad
Note that we do not need to rescan the first
character
Longest Common Subsequence (LCS)
Define Xi, Yj to be prefixes of X and Y of length i and j; m = |X|, n =
|Y|
We store the length of LCS(Xi, Yj) in c[i,j]
Trivial cases: LCS(X0 , Yj ) and LCS(Xi, Y0) is empty (so c[0,j] = c[i,0]
=0)
Recursive formula for c[i,j]:
c[i − 1, j − 1] + 1 if x[i ] = y[ j ],
c[i, j ] =
max(c[i, j − 1], c[i − 1, j ]) otherwise
c[m,n] is the final solution
LONGEST COMMON SUBSEQUENCE
Optimal substructure
Let X = <x1,x2,...,xm> and Y = <y1,y2,...,yn> be the
sequences, and let Z = <z1,z2,...,zk> be any LCS of X
and Y.
1. If xm = yn
then zk = xm = yn and Zk-1 is an LCS of Xm- 1 and Yn-1 .
2. If xm ≠ yn
then zk ≠ xm implies Z is an LCS of Xm-1 and Y.
3. If xm ≠ yn
then zk ≠ yn implies Z is an LCS of X and Yn-1 .
11
LONGEST COMMON SUBSEQUENCE
Brute force approach
Enumerate all subsequences of X and check each subsequence
if it is also a subsequence of Y and find the longest one.
Infeasible!
The number of subsequences of X is 2m.
12
Longest common subsequence
Can we use a brute-force approach?
Brute-force algorithm:
1.For every subsequence of x, check to see if it’s a
subsequence of y.
2.Worst-case running time: Θ(n2m) because
3. we have 2m subsequences of x to check and each
check takes Θ(n) time…scanning Y for the first
element, scan from there for the next element,…
Clearly, to optimize this, we would let m ≤ n
Longest Common Subsequence
Problem: Given 2 sequences, X = 〈x1,...,xm〉
and
Y = 〈y1,...,yn〉 , find a common subsequence
whose length is maximum.
springtime ncaa tournament basketball
printing north carolina snoeyink
Subsequence need not be consecutive, but must be in
order.
LONGEST COMMON SUBSEQUENCE
c[i, j]: The length of an LCS of the sequences Xi and Yj .
If either i = 0 or j = 0, so the LCS has length = 0.
0 if i = 0 or j = 0,
c[i, j ] = c[i −1, j − 1] + 1 if i, j > 0 and xi = y j ,
max( c[i, j − 1], c[i − 1, j ]) if i, j > 0 and xi ≠ y j .
15
LCS Length Algorithm
LCS-Length(X, Y)
1. m = length(X) // get the # of symbols in X
2. n = length(Y) // get the # of symbols in Y
3. for i = 1 to m c[i,0] = 0 // special case: Y0
4. for j = 1 to n c[0,j] = 0 // special case: X0
5. for i = 1 to m // for all Xi
6. for j = 1 to n // for all Yj
7. if ( Xi == Yj )
8. c[i,j] = c[i-1,j-1] + 1
9. else c[i,j] = max( c[i-1,j], c[i,j-1] )
10. return c
16
LCS Example
We’ll see how LCS algorithm works on the following example:
X = ABCB
Y = BDCAB
What is the Longest Common Subsequence
of X and Y?
LCS(X, Y) = BCB
X=AB C B
Y= BDCAB 17
LCS Example (0) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi
A
1
2 B
3 C
4 B
X = ABCB; m = |X| = 4
Y = BDCAB; n = |Y| = 5
Allocate array c[5,4]
18
LCS Example (1) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0
2 B 0
3 C 0
4 B 0
for i = 1 to m c[i,0] = 0
for j = 1 to n c[0,j] = 0
19
LCS Example (2) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0
2 B 0
3 C 0
4 B 0
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
20
LCS Example (3) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0
2 B 0
3 C 0
4 B 0
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
21
LCS Example (4) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1
2 B 0
3 C 0
4 B 0
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
22
LCS Example (5) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0
3 C 0
4 B 0
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
23
LCS Example (6) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1
3 C 0
4 B 0
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
24
LCS Example (7) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1
3 C 0
4 B 0
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
25
LCS Example (8) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0
4 B 0
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
26
LCS Example (10) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0 1 1
4 B 0
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
27
LCS Example (11) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0 1 1 2
4 B 0
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
28
LCS Example (12) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0 1 1 2 2 2
4 B 0
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
29
LCS Example (13) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0 1 1 2 2 2
4 B 0 1
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
30
LCS Example (14) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0 1 1 2 2 2
4 B 0 1 1 2 2
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
31
LCS Example (15) ABCB
j 0 1 2 3 4 5
BDCAB
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0 1 1 2 2 2
4 B 0 1 1 2 2 3
if ( Xi == Yj )
c[i,j] = c[i-1,j-1] + 1
else c[i,j] = max( c[i-1,j], c[i,j-1] )
32
LCS Algorithm Running Time
LCS algorithm calculates the values of each entry of the array
c[m,n]
So what is the running time?
O(m*n)
since each c[i,j] is calculated in
constant time, and there are m*n
elements in the array
33
How to find actual LCS
So far, we have just found the length of LCS, but not LCS itself.
We want to modify this algorithm to make it output Longest
Common Subsequence of X and Y
Each c[i,j] depends on c[i-1,j] and c[i,j-1]
or c[i-1, j-1]
For each c[i,j] we can say how it was acquired:
2 2 For example, here
2 3 c[i,j] = c[i-1,j-1] +1 = 2+1=3 34
How to find actual LCS - continued
Remember that
c[i − 1, j − 1] + 1 if x[i ] = y[ j ],
c[i, j ] =
max(c[i, j − 1], c[i − 1, j ]) otherwise
■ So we can start from c[m,n] and go backwards
■ Whenever c[i,j] = c[i-1, j-1]+1, remember x[i] (because x[i]
is a part of LCS)
■ When i=0 or j=0 (i.e. we reached the beginning), output
remembered letters in reverse order
35
Finding LCS
j 0 1 2 3 4 5
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0 1 1 2 2 2
4 B 0 1 1 2 2 3
36
Finding LCS (2)
j 0 1 2 3 4 5
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0 1 1 2 2 2
4 B 0 1 1 2 2 3
LCS (reversed order): B C B
LCS (straight order): B C B
(this string turned out to be a palindrome) 37
LCS Algorithm Running Time
LCS algorithm calculates the values of each entry of the array
c[m,n]
So what is the running time?
O(m*n)
since each c[i,j] is calculated in
constant time, and there are m*n
elements in the array
38
How to find actual LCS
So far, we have just found the length of LCS, but not LCS
itself.
We want to modify this algorithm to make it output Longest
Common Subsequence of X and Y
Each c[i,j] depends on c[i-1,j] and c[i,j-1]
or c[i-1, j-1]
For each c[i,j] we can say how it was acquired:
2 2 For example, here
2 3 c[i,j] = c[i-1,j-1] +1 = 2+1=3 39
How to find actual LCS - continued
Remember that
c[i − 1, j − 1] + 1 if x[i ] = y[ j ],
c[i, j ] =
max(c[i, j − 1], c[i − 1, j ]) otherwise
■ So we can start from c[m,n] and go backwards
■ Whenever c[i,j] = c[i-1, j-1]+1, remember
x[i] (because x[i] is a part of LCS)
■ When i=0 or j=0 (i.e. we reached the
beginning), output remembered letters in
reverse order
40
Finding LCS
j 0 1 2 3 4 5
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0 1 1 2 2 2
4 B 0 1 1 2 2 3
41
Finding LCS (2)
j 0 1 2 3 4 5
i Yj B D C A B
0 Xi 0 0 0 0 0 0
A
1 0 0 0 0 1 1
2 B 0 1 1 1 1 2
3 C 0 1 1 2 2 2
4 B 0 1 1 2 2 3
LCS (reversed order): B C B
LCS (straight order): B C B
42
(this string turned out to be a palindrome)
Another example-
X=A,B,C,B,D,A,B AND Y= B,D,C,A,B,A
j 0 1 2 3 4 5 6
i yj B D C A B A
0 xi 0 0 0 0 0 0 0
1 A 0 0 0 0 1 1 1
2 B 0 1 1 1 1 2 2
3 C 0 1 1 2 2 2 2
4 B 0 1 1 2 2 3 3
5 D 0 1 2 2 2 3 3
6 A 0 1 2 2 3 3 4
7 B 0 1 2 2 3 4 4
PRINT-LCS(b,X,i,j)
PRINT-LCS(b,X,i,j)
if i = 0 or j = 0
then return
if b[i,j] = “”
then PRINT-LCS(b,X,i-1,j-1)
print xi
else if b[i,j] = “↑”
then PRINT-LCS(b,X,i-1,j)
else PRINT-LCS(b,X,i,j-1)
Note: This procedure takes O(m+n) since at least one
of i and j is decremented in each stage of the
recursion.
Longest Common Subsequence (LCS)
Need separate data structure to retrieve
answer
Algorithm runs in O(m*n),
Brute-force algorithm: O(n 2m)
46