External Memory Suffix Array Construction: Roman Dementiev Juha Kärkkäinen Jens Mehnert
External Memory Suffix Array Construction: Roman Dementiev Juha Kärkkäinen Jens Mehnert
Sufx Arrays
sort sufxes T [i..n] of string T [0..n] over alphabet {1..n}.
Applications
b a n a n a
n scan(n) = , B
Related Work
Incremental:
n scan(n) M
[CF 97] not very scalable, a lot of internal work Doubling: Sort by rst 2i characters in iteration i [Manber/Myers 93]
6h for 26 MByte.
very complicated DC3: Simple, linear time, O (sort(n)) I/Os [KS 03]. Practical? Better than improved doubling?
sort
form runs runs merge
name
2n words
i bits
pair
sort
Improved Discarding
2 Scan all unique sufxes [CF 97];
Scan new unique sufxes [Krkkinen 03]
2 Triples ; pairs
merge
3N 2N
pair
2n 2n fully
2n
discarded suffixes
partially
a-Tupling
Sort by rst ai characters in iteration i
mod 3 {1, 2}
T [3i..n] T [3j + 1..n] iff (T [3i ], name(T [3i + 1..n])) (T [3j + 1], name(T [3j + 2..n])) T [3i..n] T [3j + 2..n] iff (T [3i ], T [3i + 1], name(T [3i + 2..n])) (T [3j + 2], T [3j + 3], name(T [3j + 4..n]))
Pipelined DC3
triple 8n
3 n
name 4n
3
4n 3
5n 3 4n 3
mod 0
output
recurse
n
permute
tuple
5n 3
file node
streaming node
sorting node
Experimental Setup
g++3.2.3 -O2
S TXXL library [Dementiev 03] with new iterator-like pipelining feature
2x64x66 Mb/s 4x2x100 MB/s 8x45 MB/s 400x64 Mb/s Intel E7500 Chipset 128
3GByte text from a crawl of .gov 0.5GByte Linux sources T T with T := randCharn/2
Random2:
10
Gutenberg I/Os
1000 900 800 700 600 500 400 300 200 100 0 224
226
228
230 n
232
11
Gutenberg Time
80 Gutenberg: Time [s] / n 70 60 50 40 30 20 10 0 224 226 228 230 n 232 Doubling Discarding Quadrupling Quad-Discarding DC3
12
13
Conclusion
2 External DC3 is practical 2 Better than pipelined, shufed 4-tupling with improved discarding 2 S TXXL makes pipelining easy. Saves factor 23 in I/O volume.
Future Work
2 Tune pipelined sorters 2 Go parallel 2 Larger difference covers for rst iteration? 2 Will discarding help for DC algorithms?
Terabytes over night?
14
Random2 I/Os
3500 I/O Volume [byte] / n 3000 2500 2000 1500 1000 500 0 224 226 228 n 230 232 Doubling Discarding Quadrupling Quad-Discarding Skew nonpipelined
15
Random2 Time
140 Random2: Time [s] / n 120 100 80 60 40 20 0 226 228 n 230 232 nonpipelined Doubling Discarding Quadrupling Quad-Discarding DC3
16
Genome I/Os
1000 900 800 700 600 500 400 300 200 100 0 224
226
228 n
230
232
17
Genome Time
80 Genome: Time [s] / n 70 60 50 40 30 20 10 0 224 226 228 n 230 232 Doubling Quadrupling Discarding Quad-Discarding Skew
18
28
30
32
19
28
30
32