0% found this document useful (0 votes)
46 views

Sequence Alignment: Lecture 2, Thursday April 3, 2003

This document summarizes key concepts from a lecture on sequence alignment: 1) The Needleman-Wunsch algorithm uses dynamic programming to find the optimal global alignment between two sequences by filling a matrix with alignment scores. 2) The Smith-Waterman algorithm modifies Needleman-Wunsch to find local alignments by ignoring badly aligning regions and initializing matrix values to 0. 3) More accurate scoring models use convex or affine gap penalties rather than a linear gap penalty, better reflecting biological properties of gaps.

Uploaded by

Alimushwan Adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Sequence Alignment: Lecture 2, Thursday April 3, 2003

This document summarizes key concepts from a lecture on sequence alignment: 1) The Needleman-Wunsch algorithm uses dynamic programming to find the optimal global alignment between two sequences by filling a matrix with alignment scores. 2) The Smith-Waterman algorithm modifies Needleman-Wunsch to find local alignments by ignoring badly aligning regions and initializing matrix values to 0. 3) More accurate scoring models use convex or affine gap penalties rather than a linear gap penalty, better reflecting biological properties of gaps.

Uploaded by

Alimushwan Adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 39

Sequence Alignment

Lecture 2, Thursday April 3, 2003


Review of Last Lecture

Lecture 2, Thursday April 3, 2003


Sequence conservation implies function

Interleukin region in human and mouse


Lecture 2, Thursday April 3, 2003
Sequence Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
|| ||||||| |||| | || ||| |||||
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Lecture 2, Thursday April 3, 2003


The Needleman-Wunsch Matrix

x1 ……………………………… xM
y1 ……………………………… yN

Every nondecreasing
path

from (0,0) to (M, N)

corresponds to
an alignment
of the two sequences

Lecture 2, Thursday April 3, 2003


The Needleman-Wunsch Algorithm

x = AGTA m= 1
y = ATA s = -1
d = -1

F(i,j) i=0 1 2 3 4
A G T A
Optimal Alignment:
j=0 0 -1 -2 -3 -4
F(4,3) = 2
1 A -1 1 0 -1 -2
2 AGTA
T -2 0 0 1 0
A - TA
3 A -3 -1 -1 0 2

Lecture 2, Thursday April 3, 2003


The Needleman-Wunsch Algorithm

1. Initialization.
a. F(0, 0) = 0
b. F(0, j) =-jd
c. F(i, 0) =-id

2. Main Iteration. Filling-in partial alignments


a. For each i = 1……M
For each j = 1……N
F(i-1,j) – d [case 1]
F(i, j) = max F(i, j-1) – d [case 2]
F(i-1, j-1) + s(xi, yj) [case 3]

UP, if [case 1]
Ptr(i,j) = LEFT if [case 2]
DIAG if [case 3]
3. Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment

Lecture 2, Thursday April 3, 2003


The Overlap Detection variant

x1 ……………………………… xM Changes:
y1 ……………………………… yN

1. Initialization
For all i, j,
F(i, 0) = 0
F(0, j) = 0

2. Termination
maxi F(i, N)
FOPT = max
maxj F(M, j)

Lecture 2, Thursday April 3, 2003


Today

• Structure of a genome, and cross-species


similarity

• Local alignment

• More elaborate scoring function

• Linear-Space Alignment

• The Four-Russian Speedup

Lecture 2, Thursday April 3, 2003


Structure of a genome
a gene

transcription

pre-mRNA
splicing

mature mRNA
Human 3x109 bp translation
Genome: ~30,000 genes
~200,000 exons protein
~23 Mb coding
~15 Mb noncoding

Lecture 2, Thursday April 3, 2003


Structure of a genome

gene D
A B C Make D

If B then NOT D
If A and B then D
short sequences regulate
expression of genes
gene B
lots of “junk” sequence
D C Make B
e.g. ~50% repeats
selfish DNA If D then B
Lecture 2, Thursday April 3, 2003
Cross-species genome similarity

• 98% of genes are conserved between any two mammals


• ~75% average similarity in protein sequence

hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001


mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001
rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938
fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174

hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001


mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001
rat_a
fug_a
:
:
CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG
TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG
@
@
112708/369938
36058/68174 “atoh” enhancer in
human, mouse, rat,
hum_a
mus_a
:
:
AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT
AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT
@
@
57431/400001
78659/400001 fugu fish
rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938
fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174

hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001


mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001
rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938
fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174

Lecture 2, Thursday April 3, 2003


The local alignment problem

Given two strings x = x1……xM,


y = y1……yN

Find substrings x’, y’ whose similarity


(optimal global alignment value)
is maximum

e.g. x = aaaacccccgggg
y = cccgggaaccaacc

Lecture 2, Thursday April 3, 2003


Why local alignment

• Genes are shuffled between genomes

• Portions of proteins (domains) are often conserved

Lecture 2, Thursday April 3, 2003


The Smith-Waterman algorithm

Idea: Ignore badly aligning regions

Modifications to Needleman-Wunsch:

Initialization: F(0, j) = F(i, 0) = 0

0
Iteration: F(i, j) = max F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + s(xi, yj)

Lecture 2, Thursday April 3, 2003


The Smith-Waterman algorithm

Termination:

1. If we want the best local alignment…

FOPT = maxi,j F(i, j)

2. If we want all local alignments scoring > t

For all i, j find F(i, j) > t, and trace back

Lecture 2, Thursday April 3, 2003


Scoring the gaps more accurately
(n)
Current model:

Gap of length n
incurs penalty nd

However, gaps usually occur in bunches

Convex gap penalty function: (n)

(n):
for all n, (n + 1) - (n)  (n) - (n – 1)

Lecture 2, Thursday April 3, 2003


General gap dynamic programming

Initialization: same

Iteration:
F(i-1, j-1) + s(xi, yj)
F(i, j) = max maxk=0…i-1F(k,j) – (i-k)
maxk=0…j-1F(i,k) – (j-k)

Termination: same

Running Time: O(N2M) (assume N>M)


Space: O(NM)

Lecture 2, Thursday April 3, 2003


Compromise: affine gaps
(n)
(n) = d + (n – 1)e
| | e
gap gap d
open extend

To compute optimal alignment,

At position i,j, need to “remember” best score if gap is open


best score if gap is not open

F(i, j): score of alignment x1…xi to y1…yj


if xi aligns to yj

G(i, j): score if xi, or yj, aligns to a gap

Lecture 2, Thursday April 3, 2003


Needleman-Wunsch with affine gaps

Initialization: F(i, 0) = d + (i – 1)e


F(0, j) = d + (j – 1)e

Iteration:
F(i – 1, j – 1) + s(xi, yj)
F(i, j) = max
G(i – 1, j – 1) + s(xi, yj)

F(i – 1, j) – d
F(i, j – 1) – d
G(i, j) = max
G(i, j – 1) – e
G(i – 1, j) – e

Termination: same

Lecture 2, Thursday April 3, 2003


To generalize a little…

… think of how you would compute optimal


alignment with this gap function
(n)

….in time O(MN)


Lecture 2, Thursday April 3, 2003
Bounded Dynamic Programming

Assume we know that x and y are very similar

Assumption: # gaps(x, y) < k(N) ( say N>M )

xi
Then, | implies | i – j | < k(N)
yj

We can align x and y more efficiently:

Time, Space: O(N  k(N)) << O(N2)

Lecture 2, Thursday April 3, 2003


Bounded Dynamic Programming
Initialization:
x1 ………………………… xM F(i,0), F(0,j) undefined for i, j > k
y1 ………………………… yN

Iteration:

For i = 1…M
For j = max(1, i – k)…min(N, i+k)

F(i – 1, j – 1)+ s(xi, yj)


F(i, j) = max F(i, j – 1) – d, if j > i – k(N)
F(i – 1, j) – d, if j < i + k(N)

k(N) Termination: same

Easy to extend to the affine gap case


Lecture 2, Thursday April 3, 2003
Linear-Space Alignment
Introduction: Compute the optimal score

It is easy to compute F(M, N) in linear space

Allocate ( column[1] )
Allocate ( column[2] )

For i = 1….M
F(i,j)
If i > 1, then:
Free( column[i – 2] )
Allocate( column[ i ] )
For j = 1…N
F(i, j) = …

Lecture 2, Thursday April 3, 2003


Linear-space alignment

To compute both the optimal score and the optimal alignment:

Divide & Conquer approach:

Notation:

xr, yr: reverse of x, y


E.g. x = accgg;
xr = ggcca

Fr(i, j): optimal score of aligning xr1…xri & yr1…yrj


same as F(M-i+1, N-j+1)

Lecture 2, Thursday April 3, 2003


Linear-space alignment

Lemma:

F(M, N) = maxk=0…N( F(M/2, k) + Fr(M/2, N-k) )

M/2
x
F(M/2, k) Fr(M/2, N-k)
y
k*

Lecture 2, Thursday April 3, 2003


Linear-space alignment

Now, using 2 columns of space, we can compute


for k = 1…M, F(M/2, k), Fr(M/2, k)

PLUS the backpointers

Lecture 2, Thursday April 3, 2003


Linear-space alignment

Now, we can find k* maximizing F(M/2, k) + Fr(M/2, k)


Also, we can trace the path exiting column M/2 from k*

Conclusion: In O(NM) time, O(N) space,


we found optimal alignment path at column M/2

Lecture 2, Thursday April 3, 2003


Linear-space alignment

• Iterate this procedure to the left and right!

k*

N-k*

M/2 M/2
Lecture 2, Thursday April 3, 2003
Linear-space alignment

Hirschberg’s Linear-space algorithm:

MEMALIGN(l, l’, r, r’): (aligns xl…xl’ with yr…yr’)

1. Let h = (l’-l)/2
2. Find in Time O((l’ – l)  (r’-r)), Space O(r’-r)
the optimal path, Lh, at column h
Let k1 = pos’n at column h – 1 where Lh enters
k2 = pos’n at column h + 1 where Lh exits

1. MEMALIGN(l, h-1, r, k1)

2. Output Lh

3. MEMALIGN(h+1, l’, k2, r’)

Lecture 2, Thursday April 3, 2003


Linear-space Alignment

Time, Space analysis of Hirschberg’s algorithm:


To compute optimal path at middle column,
For box of size M  N,
Space: 2N
Time: cMN, for some constant c

Then, left, right calls cost c( M/2  k* + M/2  (N-k*) ) = cMN/2

All recursive calls cost


Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN)

Total Space: O(N) for computation,


O(N+M) to store the optimal alignment

Lecture 2, Thursday April 3, 2003


The Four-Russian Algorithm

A useful speedup of Dynamic Programming


Main Observation

Within a rectangle of the DP xl xl’


matrix,
yr A B
values of D depend only
on the values of A, B, C,
and substrings xl...l’, yr…r’

Definition:
C
A t-block is a t  t square of the
DP matrix

Idea:
Divide matrix in t-blocks,
yr’ D
Precompute t-blocks

Speedup: O(t) t
Lecture 2, Thursday April 3, 2003
The Four-Russian Algorithm

Main structure of the algorithm:


• Divide NN DP matrix into KK
log2N-blocks that overlap by 1
column & 1 row

• For i = 1……K
• For j = 1……K
• Compute Di,j as a function of
Ai,j, Bi,j, Ci,j, x[li…l’i], y[rj…r’j]

Time: O(N2 / log2N)


times the cost of step 4

Lecture 2, Thursday April 3, 2003


The Four-Russian Algorithm

Another observation:

( Assume m = 1, s = 1, d = 1 )

Two adjacent cells of F(.,.) differ by at most 1.

Lecture 2, Thursday April 3, 2003


The Four-Russian Algorithm

Definition: xl xl’
The offset vector is a yr A B
t-long vector of values from
{-1, 0, 1},
where the first entry is 0

C
If we know the value at A,
and the top row, left column
offset vectors,
and xl……xl’, yr……yr’,
yr’ D
Then we can find D
t
Lecture 2, Thursday April 3, 2003
The Four-Russian Algorithm

Definition: xl xl’
The offset function of a t- yr A B
block
is a function that for any

given offset vectors


C
of top row, left column,
and xl……xl’, yr……yr’,

produces offset vectors


of bottom row, right column yr’ D

t
Lecture 2, Thursday April 3, 2003
The Four-Russian Algorithm

We can pre-compute the offset function:

32(t-1) possible input offset vectors

42t possible strings xl……xl’, yr……yr’

Therefore 32(t-1)  42t values to pre-compute

We can keep all these values in a table, and look up in linear time, or in
O(1) time if we assume constant-lookup RAM
for log-sized inputs

Lecture 2, Thursday April 3, 2003

You might also like