HPDC'23 Rapidgzip

HPDC’ 23
Rapidgzip: Parallel Decompression and Seeking in
Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Technische Universität Dresden

Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Slide 2 of 20
Motivation
Accessing huge datasets, e.g., from academictorrents.com:
●
wikidata-20220103-all.json.gz: gzip-compressed JSON, 109 GB, 1.4 TB uncompressed
●
ImageNet21K: gzip-compressed TAR archive, 1.2 TB, 14 million images averaging 9 KiB.
Solutions:
●
: Random access TAR mount.
Make (huge) archives’ contents available via FUSE.
●
, indexed_bzip2: Backends for ratarmount for parallel decompression and fast
seeking inside compressed gzip and bzip2 files.
They also offer command line tools for parallelized decompression.

Slide 3 of 20
Requirements
Data Stream
Data Stream
Data Stream
●
Parallelize gzip decompression
●
Without additional metadata
●
After any seeking
●
For concurrent accesses at two offsets
●
Decompress all kinds of gzip files
●
Enable fast backward and forward seeking
●
After the index has been created
●
While the index has only been partially created
●
Usable as a (Python-)library

Slide 4 of 20
●
: Random access parallel (indexed)
decompression for gzip files
●
Parallel decompression of gzip files
●
Decompression is faster and less memory intensive
with an existing index
●
The index enables seeking without having to start
decompression from the file beginning
●
Header-only C++ library with Python bindings:
> pip install rapidgzip
●
Also a has a command line interface that can be used
as a drop-in replacement for decompression:
gzip -d → rapidgzip -d
●
Not for compression
● https://github.com/mxmlnkn/rapidgzip
Decompression benchmarks on a 12 GB
FASTQ file using 64 cores of an AMD EPYC
7702 @ 2.0 GHz processor.
g
z
i
p
p
i
g
z
i
g
z
i
p
p
u
g
z
r
a
p
i
d
g
z
i
p
r
a
p
i
d
g
z
i
p
(
i
n
d
e
x
)
0
2
4
6
8
10
12
14
Bandwidth
/
(GB/s)
0.177 0.301 0.78 1.4
5.3
13.1
Introducing rapidgzip
74×
30×

Slide 5 of 20
Tools Overview
●
1992: gzip by Jean-loup Gailly and Mark Adler
●
2005: zlib/examples/zran.c example by Mark Adler
●
Shows how to resume decompression in the middle of a gzip stream
●
2007: pigz (parallel implementation of gzip) by Mark Adler
●
Compresses in parallel
●
2008: Blocked GNU Zip Format (BGZF) and the command line tool bgzip, part of HTSlib
●
Compresses in parallel to gzip files with additional metadata
●
Can decompress files containing such metadata in parallel
●
James K. Bonfield and others, "HTSlib: C library for reading/writing high-throughput sequencing data", GigaScience, Volume 10, Issue 2, February 2021
●
2016: indexed_gzip: Python module for random access based on zran.c
●
2019: pugz
●
Can decompress gzip-compressed files in parallel if it only contains characters 9–126
●
Kerbiriou, Maël, and Rayan Chikhi. "Parallel decompression of gzip-compressed files and random access to DNA sequences." 2019 IEEE International
Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2019.
→ What makes parallel decompression so difficult?

Slide 6 of 20
Challenges for Parallel
Decompression:
●
Find deflate block start offsets in bit
stream
●
Handling references to unknown
data
→ Two-Staged Decompression as
introduced by Kerbiriou and Chikhi
(2019)
●
Non-resolvable references result in
markers that get resolved in the
second stage after the 32 KiB
window has become known
Deflate Compression
⎵
o
0 1
0 1
1
0
How⎵much⎵wood⎵would⎵a⎵woodchuck⎵chuck⎵
if⎵a⎵woodchuck⎵could⎵chuck⎵wood?
LZSS
How⎵much⎵wood⎵would⎵a(13,5)chuck⎵(6,6)
if(21,13)c(39,5)(12,6)(22,4)?
Huffman Coding
11110|10|11111|0|11101|...
H o w ⎵ m
Deflate
Compression
First Stage
if 20 21 22 23 24 25 26 27 28 29 30 31 32 c 16 17 18 19 20 chuck⎵wood?
1 10 20 30
if(21,13)c(39,5)(12,6)(22,4)?
Second Stage
Let Thread 2
Decompress
Line 2
Marker Replacement Window For Line 2
Distance Length
Huffman Tree

Slide 7 of 20
Parallelized Deflate Decompression
⎵
o
0 1
0 1
1
0
LZSS
How⎵much⎵wood⎵would⎵a(13,5)chuck⎵(6,6)
if(21,13)c(39,5)(12,6)(22,4)?
Huffman Coding
11110|10|11111|0|11101|...
H o w ⎵ m
Deflate
Compression
First Stage
if 20 21 22 23 24 25 26 27 28 29 30 31 32 c 16 17 18 19 20 chuck⎵wood?
1 10 20 30
if(21,13)c(39,5)(12,6)(22,4)?
Second Stage
Let Thread 2
Decompress
Line 2
Marker Replacement Window For Line 2
Distance Length
Huffman Tree
Challenges for Parallel
Decompression:
●
Find offsets in bit stream to start
decompression from
●
Handling references to unknown
data
→ Two-Staged Decompression as
introduced by Kerbiriou and Chikhi
(2019)
●
Non-resolvable references result in
markers that get resolved in the
second stage after the 32 KiB
window has become known

Slide 8 of 20
Granularities for Parallelization
Gzip File
Gzip Stream
Gzip Stream
...
Deflate Stream
Deflate Block
Deflate Block
...
Deflate Stream
Gzip Stream
Gzip Header
Gzip Footer
Deflate Block
Final Block Flag
Block Data
Block Type
Block Type: 00
Non-Compressed
Length
Non-Compressed
Data
~Length
Block Type: 01
Dynamic Huffman
Tree Lengths
Compressed
Data
Huffman Trees
Block Type: 10
Fixed Huffman
Compressed Data
Often, there is
only one Gzip
stream per file.
Parallelize
decompression of
Deflate blocks

Slide 9 of 20
Redundancies that Help in Finding Deflate Blocks
Gzip File
Gzip Stream
Gzip Stream
...
Deflate Stream
Deflate Block
Deflate Block
...
Deflate Stream
Gzip Stream
Gzip Header
Gzip Footer
Deflate Block
Final Block Flag
Block Data
Block Type
Block Type: 00
Non-Compressed
Length
Non-Compressed
Data
~Length
Block Type: 01
Dynamic Huffman
Tree Lengths
Compressed
Data
Huffman Trees
Block Type: 10
Fixed Huffman
Compressed Data
Fixed Huffman
Blocks are
mostly used for
very small files
and the tail end
of streams.

Slide 10 of 20
How Often Will Valid-Looking Block Headers be Found in Random Data?
Invalid Non-optimal Valid and optimal
Code Lengths:
A: 1, B: 1, C: 1
Code Lengths:
A: 2, B: 2, C: 2
Code Lengths:
A: 2, B: 2, C: 1
●
Pugz reduces false positives further by checking that the decompressed data only
contains characters in the range 9-126, an assumption made for FASTQ files
●
Deflate Blocks with Fixed Huffman codings: Ignore because they are rare
●
Non-Compressed Deflate blocks: Look for 16-bit lengths and their one’s complement
→ 1 false positive per 525 kB
●
Deflate Blocks with Dynamic Huffman codings:
Look for valid Deflate block headers and valid and optimal Huffman codings.
~200 offsets pass this test given 1 Tbits of random data → 1 false positive per 625 MB

Slide 11 of 20
False Positives Can Be Crafted Using Non-Compressed Blocks
Example: Consider a gzip file that is compressed a second time.
Deﬂate
Stream
example.gz.gz
Gzip Header
Gzip Footer
Length
Non-Compressed
Data
~Length
Final Block Flag
Block Type:
Non-Compressed
Deﬂate
Stream
example.gz
Gzip Header
Gzip Footer
Block Data
Final Block Flag
Block Type
Block Boundary
of Interest
False Positive
Sequences inside the compressed Block Data might be recognized as Deflate block headers
→ These cases need to be detected and handled

Slide 12 of 20
Implementation
C₁
0 100 200 300
C₂ C₃ C₄
① Request chunk
at offset 0
Window
First Block
Offset
False
Positive
Thread Pool
C₁
C₃
Guessed Chunk
Boundary
C₄
C₂
C₁
C₃ C₄
C₂
Marker
Compressed Input Data
② Dispatch
③ Prefetch
④ Find block start ⑤ First-stage decompression
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
⑥ Periodically check for ready chunks
and move them into the cache
until C₁ has become ready
⑧ Return decompressed C₁
Index
0
Offset Size Win.
Dec.
Size
111 340
⑦ Add C₁ information
to the index
⑦
C₁
0 100 200 300
C₂ C₃ C₄
① Request chunk at 111 directly after C₁
② Found chunk in cache but with markers
③ Resolve markers in windows
C₁ C₂ C₃
C₄
No prefetched chunk
matches the end offset
of C₃. The chunk after
will be decompressed
on demand.
⑤ Resolve the markers inside
each chunk in parallel
using the thread pool
④ Add resolved
windows to the index
Index
0
Offset Size Win.
Dec.
Size
111 340
111 109 421
220 87 123
111 220
305 327
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
⑥ Return decompressed C₂
●
Prefetching to generate work
that can be processed in
parallel
●
Thread pool for work
balancing
●
Cache to speed up seeking
and concurrent
decompression
●
Use block offset as cache key
to catch false positives
●
On-demand cache fill to
recover from errors

Slide 13 of 20
Implementation
C₁
0 100 200 300
C₂ C₃ C₄
① Request chunk
at offset 0
Window
First Block
Offset
False
Positive
Thread Pool
C₁
C₃
Chunk
Boundary
C₄
C₂
C₁
C₃ C₄
C₂
Marker
② Dispatch Chunk
③ Prefetch Chunk
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
Index
0
Offset Size Win.
Dec.
Size
111 340
to the index
⑦
C₁
0 100 200 300
C₂ C₃ C₄
C₁ C₂ C₃
C₄
No prefetched chunk
on demand.
④ Add resolved
Index
0
Offset Size Win.
Dec.
Size
111 340
111 109 421
220 107 123
111 220
305 327
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
Chunk 4
●
parallel
●
balancing
●
and concurrent
decompression
●
●
recover from errors

Slide 14 of 20
Implementation
C₁
0 100 200 300
C₂ C₃ C₄
① Request chunk
at offset 0
Window
First Block
Offset
False
Positive
Thread Pool
C₁
C₃
Chunk
Boundary
C₄
C₂
C₁
C₃ C₄
C₂
Marker
② Dispatch Chunk
③ Prefetch Chunk
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
Index
0
Offset Size Win.
Dec.
Size
111 340
to the index
⑦
C₁
0 100 200 300
C₂ C₃ C₄
C₁ C₂ C₃
C₄
No prefetched chunk
on demand.
④ Add resolved
Index
0
Offset Size Win.
Dec.
Size
111 340
111 109 421
220 107 123
111 220
305 327
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
Chunk 4
●
parallel
●
balancing
●
and concurrent
decompression
●
●
recover from errors

Slide 15 of 20
Optimal Chunk Size
0.1250.25 0.5 1 2 4 8 16 32 64 128 256 512 1024
Chunk Size / MiB
1000
2000
1500
3000
Bandwidth
/
(MB/s)
rapidgzip
pugz
49808 12452 3113 779 390 195 98 49 25 13 7
Theoretical Number of Chunks
●
For smaller chunk sizes, the
block finder overhead leads to
worse performance
●
Larger chunk sizes lead to work
balancing issues and also might
adversely affect the cache
behavior and allocation speed
Decompression bandwidth using 16 cores and a 6.08 GiB test file,
which decompresses to 8 GiB.
Optimal chunk sizes: Rapidgzip: 4-8 MiB, Pugz: 32-64 MiB
DBF ... Dynamic Block Finder
NBF ... Non-Compressed Block Finder
3.8×

Slide 16 of 20
Decompression Benchmark: FASTQ File
Weak-scaling and writing the
results to /dev/null
●
gzip is the slowest, half as
slow as zlib-based pigz
●
rapidgzip with an index
scales up to 20 GB/s, without
an index up to 5 GB/s
●
pugz tops out at 1.4 GB/s
and crashes for 96+ cores
●
igzip by Intel shows leading
single-core performance
●
pigz does not parallelize
1 2 3 4 6 8 12 16 24 32 48 64 96 128
Number of Cores
100
1000
500
200
10000
5000
2000
20000
Bandwidth
/
(MB/s)
rapidgzip (index)
rapidgzip (no index)
linear scal. (no index)
pugz
pigz
igzip
gzip

Slide 18 of 20
0 2 4 6 8 10 12
Decompression Bandwidth / (GB/s)
pigz -9
pigz -6
pigz -3
pigz -1
igzip -3
igzip -2
igzip -1
igzip -0
gzip -9
gzip -6
gzip -3
gzip -1
bgzip -l 9
bgzip -l 6
bgzip -l 3
bgzip -l 0
bgzip -l -1
Tool
Used
for
Compression
3.73
3.76
3.81
3.82
6.52
6.42
6.15
0.1586
5.03
5.17
5.55
6.05
5.64
5.67
5.9
10.6
5.65
Benchmark: Various Gzip Compressors Rapidgzip Decompression
→
●
rapidgzip can parallelize
decompression for gzip files
produced with a wide variety of tools
and compression levels
●
Contains only Non-Compressed
Deflate blocks so that decompression
is reduced to a fast copy and some
accounting
●
Contains only a single Deflate block
with Fixed Huffman coding and
therefore cannot be parallelized

Slide 19 of 20
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Decompression Bandwidth / (GB/s)
pzstd pzstd
gzip rapidgzip
gzip rapidgzip
bgzip bgzip
pzstd pzstd
zstd pzstd
gzip rapidgzip
gzip rapidgzip
gzip bgzip
bgzip bgzip
pzstd pzstd
zstd pzstd
zstd zstd
gzip igzip
gzip rapidgzip
gzip bgzip
bgzip bgzip
Compression
Tool
Decompression
Tool
8.8
16.43
5.13
5.5
6.78
0.882
4.25
1.86
0.3017
2.82
0.811
0.816
0.82
0.656
0.1527
0.2965
0.2977
1 core
16 cores
128 cores
Benchmark: Competing File Formats
●
igzip is surprisingly competitive to zstd
●
zstd is the fastest in single-core
decompression
●
bgzip and pzstd can only decompress
gzip/zstd files produced by themselves in
parallel
●
zstd-compressed (TAR) files are not eligible
for random access and parallel
decompression. pzstd is recommended
instead to create multi-frame Zstandard files.
●
Applying the rapidgzip approach to
arbitrary Zstandard files might be infeasible
because their window size is not limited to 32
KiB.
3×
+25%
ⁱ rapidgzip decompression with an existing index

Slide 21 of 20
Improvements Since Submission
●
High memory usage has been alleviated by limiting the decompressed chunk size
Worst case (compression ratio ~1000)
was: ~ 9 GB per thread
now: ~ 200 MB per thread (configurable)
●
The Inflate implementation has been improved for high compression ratios
→ 25 % faster for Silesia by using memcpy/memset for long references
●
CRC32 computation has been added
The slice-by-16 algorithm has been implemented and parallelized using crc32_combine.
→
Achieves ~ 4 GB/s per core (~ 6 % overhead independent of parallelism)

Slide 22 of 20
●
We have shown that the specialized approach for
parallelized gzip decompression introduced by
Kerbiriou and Chikhi (2019) can be generalized
without affecting performance and stability.
●
Our architecture achieves better performance,
scales to more cores, adds robustness against
false positives, and also increases versatility by
adding fast seeking capabilities.
●
An index is created internally on first time
decompression and it can be exported and
imported to speed up subsequent
decompression and seeking.
●
Can be used with ratarmount to mount .tar.gz
archives.
●
Available at https://github.com/mxmlnkn/rapidgzip
Decompression benchmarks on a 12 GB
FASTQ file using 64 cores of an AMD EPYC
7702 @ 2.0 GHz processor.
g
z
i
p
p
i
g
z
i
g
z
i
p
p
u
g
z
r
a
p
i
d
g
z
i
p
r
a
p
i
d
g
z
i
p
(
i
n
d
e
x
)
0
2
4
6
8
10
12
14
Bandwidth
/
(GB/s)
0.177 0.301 0.78 1.4
5.3
13.1
Summary
74×
30×

HPDC'23 Rapidgzip

More Related Content

What's hot (20)

Similar to HPDC'23 Rapidgzip (20)

Recently uploaded (20)

HPDC'23 Rapidgzip