Sequence Alignment: Lecture 2, Thursday April 3, 2003
Sequence Alignment: Lecture 2, Thursday April 3, 2003
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
|| ||||||| |||| | || ||| |||||
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
x1 ……………………………… xM
y1 ……………………………… yN
Every nondecreasing
path
corresponds to
an alignment
of the two sequences
x = AGTA m= 1
y = ATA s = -1
d = -1
F(i,j) i=0 1 2 3 4
A G T A
Optimal Alignment:
j=0 0 -1 -2 -3 -4
F(4,3) = 2
1 A -1 1 0 -1 -2
2 AGTA
T -2 0 0 1 0
A - TA
3 A -3 -1 -1 0 2
1. Initialization.
a. F(0, 0) = 0
b. F(0, j) =-jd
c. F(i, 0) =-id
UP, if [case 1]
Ptr(i,j) = LEFT if [case 2]
DIAG if [case 3]
3. Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment
x1 ……………………………… xM Changes:
y1 ……………………………… yN
1. Initialization
For all i, j,
F(i, 0) = 0
F(0, j) = 0
2. Termination
maxi F(i, N)
FOPT = max
maxj F(M, j)
• Local alignment
• Linear-Space Alignment
transcription
pre-mRNA
splicing
mature mRNA
Human 3x109 bp translation
Genome: ~30,000 genes
~200,000 exons protein
~23 Mb coding
~15 Mb noncoding
gene D
A B C Make D
If B then NOT D
If A and B then D
short sequences regulate
expression of genes
gene B
lots of “junk” sequence
D C Make B
e.g. ~50% repeats
selfish DNA If D then B
Lecture 2, Thursday April 3, 2003
Cross-species genome similarity
e.g. x = aaaacccccgggg
y = cccgggaaccaacc
Modifications to Needleman-Wunsch:
0
Iteration: F(i, j) = max F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + s(xi, yj)
Termination:
Gap of length n
incurs penalty nd
(n):
for all n, (n + 1) - (n) (n) - (n – 1)
Initialization: same
Iteration:
F(i-1, j-1) + s(xi, yj)
F(i, j) = max maxk=0…i-1F(k,j) – (i-k)
maxk=0…j-1F(i,k) – (j-k)
Termination: same
Iteration:
F(i – 1, j – 1) + s(xi, yj)
F(i, j) = max
G(i – 1, j – 1) + s(xi, yj)
F(i – 1, j) – d
F(i, j – 1) – d
G(i, j) = max
G(i, j – 1) – e
G(i – 1, j) – e
Termination: same
xi
Then, | implies | i – j | < k(N)
yj
Iteration:
For i = 1…M
For j = max(1, i – k)…min(N, i+k)
Allocate ( column[1] )
Allocate ( column[2] )
For i = 1….M
F(i,j)
If i > 1, then:
Free( column[i – 2] )
Allocate( column[ i ] )
For j = 1…N
F(i, j) = …
Notation:
Lemma:
M/2
x
F(M/2, k) Fr(M/2, N-k)
y
k*
k*
N-k*
M/2 M/2
Lecture 2, Thursday April 3, 2003
Linear-space alignment
1. Let h = (l’-l)/2
2. Find in Time O((l’ – l) (r’-r)), Space O(r’-r)
the optimal path, Lh, at column h
Let k1 = pos’n at column h – 1 where Lh enters
k2 = pos’n at column h + 1 where Lh exits
2. Output Lh
Definition:
C
A t-block is a t t square of the
DP matrix
Idea:
Divide matrix in t-blocks,
yr’ D
Precompute t-blocks
Speedup: O(t) t
Lecture 2, Thursday April 3, 2003
The Four-Russian Algorithm
• For i = 1……K
• For j = 1……K
• Compute Di,j as a function of
Ai,j, Bi,j, Ci,j, x[li…l’i], y[rj…r’j]
Another observation:
( Assume m = 1, s = 1, d = 1 )
Definition: xl xl’
The offset vector is a yr A B
t-long vector of values from
{-1, 0, 1},
where the first entry is 0
C
If we know the value at A,
and the top row, left column
offset vectors,
and xl……xl’, yr……yr’,
yr’ D
Then we can find D
t
Lecture 2, Thursday April 3, 2003
The Four-Russian Algorithm
Definition: xl xl’
The offset function of a t- yr A B
block
is a function that for any
t
Lecture 2, Thursday April 3, 2003
The Four-Russian Algorithm
We can keep all these values in a table, and look up in linear time, or in
O(1) time if we assume constant-lookup RAM
for log-sized inputs