0% found this document useful (0 votes)
25 views19 pages

External Memory Suffix Array Construction: Roman Dementiev Juha Kärkkäinen Jens Mehnert

The document describes algorithms for constructing suffix arrays in external memory. It proposes a new pipelined Difference Cover 3 (DC3) algorithm that requires only sorting(30n) + scan(6n) I/Os for a string of length n, making it practical for large datasets. Experimental results on genome, text, and random string datasets show DC3 performs better than previous doubling, discarding, and skewed algorithms - in both I/O volume and time. The algorithms were implemented using the STXXL library to enable pipelining for improved performance.

Uploaded by

manuelq9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views19 pages

External Memory Suffix Array Construction: Roman Dementiev Juha Kärkkäinen Jens Mehnert

The document describes algorithms for constructing suffix arrays in external memory. It proposes a new pipelined Difference Cover 3 (DC3) algorithm that requires only sorting(30n) + scan(6n) I/Os for a string of length n, making it practical for large datasets. Experimental results on genome, text, and random string datasets show DC3 performs better than previous doubling, discarding, and skewed algorithms - in both I/O volume and time. The algorithms were implemented using the STXXL library to enable pipelining for improved performance.

Uploaded by

manuelq9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

External Memory Sufx Array Construction


Roman Dementiev Juha Krkkinen Jens Mehnert Peter Sanders
MPI Informatik, U. Karlsruhe, U. Helsinki

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

Sufx Arrays
sort sufxes T [i..n] of string T [0..n] over alphabet {1..n}.

Applications

b a n a n a

a ana anana banana na nana

2 Full text search 2 Burrows-Wheeler text compression 2 Bioinformatics,. . .


Big interest in BIG inputs ; External memory

registers ALU fast memory capacity M freely programmable B large memory

n scan(n) = , B

2n n sort(n) = machine words logM/B B M

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

Related Work
Incremental:

n scan(n) M

I/Os [Gonnet/Baeza-Yates/Snider 92]

[CF 97] not very scalable, a lot of internal work Doubling: Sort by rst 2i characters in iteration i [Manber/Myers 93]

; O(sort(n) log maxlcp) I/Os [AFGV 97]


Doubling+Discarding: Avoid sorting sufxes known to be unique [Crauser/Ferragina 97] Best scalable algorithm in study. >

6h for 26 MByte.

; External construction not practical?


via Sufx-Tree:

O(sort(n)) I/Os [Farach/Ferragina/Muthukrishnan 00]

very complicated DC3: Simple, linear time, O (sort(n)) I/Os [KS 03]. Practical? Better than improved doubling?

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

Pipelined Doubling with Bit Shufing


name(T [i..i + k ]) {1..n} preserves order of k -substrings i
(T[j], T[j+1], j) (name(T[j..j+2i), name(T[j+2 i..j+2i+1 ), j ) 3n words i := i+1

sort
form runs runs merge

name
2n words

i bits

pair

sort

total I/O complexity: sort(5n) log maxlcp + O (sort(n))

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

Improved Discarding
2 Scan all unique sufxes [CF 97];
Scan new unique sufxes [Krkkinen 03]

2 Triples ; pairs
merge
3N 2N

pair
2n 2n fully

Name and mark unique

2n

discarded suffixes

partially

sort(5N )+O(sort(n)) I/Os where N =


i

log distPrexSize(T [i..n]))

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

a-Tupling
Sort by rst ai characters in iteration i

Constant Factor in I/Os

(a + 3)/ log a 5.00 3.78 3.50 3.45 3.48 3.56

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

Difference Cover 3 (DC3) Algorithm


1. sort T [i..n] for i recurse 2. sort T [i..n] for i

mod 3 {1, 2}

sort and name triples

mod 3 {0} sort pairs (T [3i], name(T [3i + 1..n]))

3. merge using difference cover property of {1, 2}

T [3i..n] T [3j + 1..n] iff (T [3i ], name(T [3i + 1..n])) (T [3j + 1], name(T [3j + 2..n])) T [3i..n] T [3j + 2..n] iff (T [3i ], T [3i + 1], name(T [3i + 2..n])) (T [3j + 2], T [3j + 3], name(T [3j + 4..n]))

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

Pipelined DC3

if names are not unique input

triple 8n
3 n

name 4n
3

4n 3

5n 3 4n 3

mod 0
output

recurse
n

permute

tuple
5n 3

mod 1 merge mod 2


recursion

file node

streaming node

sorting node

sort(30n) + scan(6n) I/Os

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

Experimental Setup
g++3.2.3 -O2
S TXXL library [Dementiev 03] with new iterator-like pipelining feature
2x64x66 Mb/s 4x2x100 MB/s 8x45 MB/s 400x64 Mb/s Intel E7500 Chipset 128

2x Xeon 4 Threads 1 GB DDR RAM PCIBusses Controller Channels 8x80 GB

Genome: Human Genome Gutenberg: HTML: Source:

3GByte English text from Gutenberg project

3GByte text from a crawl of .gov 0.5GByte Linux sources T T with T := randCharn/2

Random2:

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

10

Gutenberg I/Os
1000 900 800 700 600 500 400 300 200 100 0 224

I/O Volume [byte] / n

Doubling Quadrupling Discarding Quad-Discarding DC3

226

228

230 n

232

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

11

Gutenberg Time
80 Gutenberg: Time [s] / n 70 60 50 40 30 20 10 0 224 226 228 230 n 232 Doubling Discarding Quadrupling Quad-Discarding DC3

I/O bound even for a single disk

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

12

Comparison with Previous Implementations


2 5 less I/O volume than [CF 97] 2 78 less clock cycles than [CF 97] (including BGS algorithm) 2 2.4 faster than internal compressed Genome [LSSSY 02] 2 1.2 slower than internal Genome on 64 GByte super computer
[Sadakane Shibuya 01]

2 Faster than linear time internal LCP computation on MPIIs SUN


Starre 15000

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

13

Conclusion
2 External DC3 is practical 2 Better than pipelined, shufed 4-tupling with improved discarding 2 S TXXL makes pipelining easy. Saves factor 23 in I/O volume.
Future Work

2 Tune pipelined sorters 2 Go parallel 2 Larger difference covers for rst iteration? 2 Will discarding help for DC algorithms?
Terabytes over night?

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

14

Random2 I/Os
3500 I/O Volume [byte] / n 3000 2500 2000 1500 1000 500 0 224 226 228 n 230 232 Doubling Discarding Quadrupling Quad-Discarding Skew nonpipelined

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

15

Random2 Time
140 Random2: Time [s] / n 120 100 80 60 40 20 0 226 228 n 230 232 nonpipelined Doubling Discarding Quadrupling Quad-Discarding DC3

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

16

Genome I/Os
1000 900 800 700 600 500 400 300 200 100 0 224

I/O Volume [byte] / n

Doubling Quadrupling Discarding Quad-Discarding Skew

226

228 n

230

232

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

17

Genome Time
80 Genome: Time [s] / n 70 60 50 40 30 20 10 0 224 226 228 n 230 232 Doubling Quadrupling Discarding Quad-Discarding Skew

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

18

I/O Volume [byte] / n

600 500 400 300 200 100 0 40 30 20 10 0 24 2 2


26

Quadrupling Quad-Discarding Skew

Source: Time [s] / n

Quadrupling Quad-Discarding Skew

28

30

32

Dementiev/Mehnert/Krkkinen/Sanders: External Sufx Arrays

19

I/O Volume [byte] / n

600 500 400 300 200 100 0 40 30 20 10 0 24 2 2


26

Quadrupling Quad-Discarding Skew

HTML: Time [s] / n

Quadrupling Quad-Discarding Skew

28

30

32

You might also like