HPDC’ 23
Rapidgzip: Parallel Decompression and Seeking in
Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Technische Universität Dresden
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 2 of 20
Motivation
Accessing huge datasets, e.g., from academictorrents.com:
●
wikidata-20220103-all.json.gz: gzip-compressed JSON, 109 GB, 1.4 TB uncompressed
●
ImageNet21K: gzip-compressed TAR archive, 1.2 TB, 14 million images averaging 9 KiB.
Solutions:
●
: Random access TAR mount.
Make (huge) archives’ contents available via FUSE.
●
, indexed_bzip2: Backends for ratarmount for parallel decompression and fast
seeking inside compressed gzip and bzip2 files.
They also offer command line tools for parallelized decompression.
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 3 of 20
Requirements
Data Stream
Data Stream
Data Stream
●
Parallelize gzip decompression
●
Without additional metadata
●
After any seeking
●
For concurrent accesses at two offsets
●
Decompress all kinds of gzip files
●
Enable fast backward and forward seeking
●
After the index has been created
●
While the index has only been partially created
●
Usable as a (Python-)library
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 4 of 20
●
: Random access parallel (indexed)
decompression for gzip files
●
Parallel decompression of gzip files
●
Decompression is faster and less memory intensive
with an existing index
●
The index enables seeking without having to start
decompression from the file beginning
●
Header-only C++ library with Python bindings:
> pip install rapidgzip
●
Also a has a command line interface that can be used
as a drop-in replacement for decompression:
gzip -d → rapidgzip -d
●
Not for compression
● https://github.com/mxmlnkn/rapidgzip
Decompression benchmarks on a 12 GB
FASTQ file using 64 cores of an AMD EPYC
7702 @ 2.0 GHz processor.
g
z
i
p
p
i
g
z
i
g
z
i
p
p
u
g
z
r
a
p
i
d
g
z
i
p
r
a
p
i
d
g
z
i
p
(
i
n
d
e
x
)
0
2
4
6
8
10
12
14
Bandwidth
/
(GB/s)
0.177 0.301 0.78 1.4
5.3
13.1
Introducing rapidgzip
74×
30×
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 5 of 20
Tools Overview
●
1992: gzip by Jean-loup Gailly and Mark Adler
●
2005: zlib/examples/zran.c example by Mark Adler
●
Shows how to resume decompression in the middle of a gzip stream
●
2007: pigz (parallel implementation of gzip) by Mark Adler
●
Compresses in parallel
●
2008: Blocked GNU Zip Format (BGZF) and the command line tool bgzip, part of HTSlib
●
Compresses in parallel to gzip files with additional metadata
●
Can decompress files containing such metadata in parallel
●
James K. Bonfield and others, "HTSlib: C library for reading/writing high-throughput sequencing data", GigaScience, Volume 10, Issue 2, February 2021
●
2016: indexed_gzip: Python module for random access based on zran.c
●
2019: pugz
●
Can decompress gzip-compressed files in parallel if it only contains characters 9–126
●
Kerbiriou, Maël, and Rayan Chikhi. "Parallel decompression of gzip-compressed files and random access to DNA sequences." 2019 IEEE International
Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2019.
→ What makes parallel decompression so difficult?
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 6 of 20
Challenges for Parallel
Decompression:
●
Find deflate block start offsets in bit
stream
●
Handling references to unknown
data
→ Two-Staged Decompression as
introduced by Kerbiriou and Chikhi
(2019)
●
Non-resolvable references result in
markers that get resolved in the
second stage after the 32 KiB
window has become known
Deflate Compression
⎵
o
0 1
0 1
1
0
How⎵much⎵wood⎵would⎵a⎵woodchuck⎵chuck⎵
if⎵a⎵woodchuck⎵could⎵chuck⎵wood?
LZSS
How⎵much⎵wood⎵would⎵a(13,5)chuck⎵(6,6)
if(21,13)c(39,5)(12,6)(22,4)?
Huffman Coding
11110|10|11111|0|11101|...
H o w ⎵ m
Deflate
Compression
First Stage
How⎵much⎵wood⎵would⎵a⎵woodchuck⎵chuck⎵
if 20 21 22 23 24 25 26 27 28 29 30 31 32 c 16 17 18 19 20 chuck⎵wood?
1 10 20 30
if(21,13)c(39,5)(12,6)(22,4)?
Second Stage
Let Thread 2
Decompress
Line 2
if⎵a⎵woodchuck⎵could⎵chuck⎵wood?
Marker Replacement Window For Line 2
Distance Length
Huffman Tree
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 7 of 20
Parallelized Deflate Decompression
⎵
o
0 1
0 1
1
0
How⎵much⎵wood⎵would⎵a⎵woodchuck⎵chuck⎵
if⎵a⎵woodchuck⎵could⎵chuck⎵wood?
LZSS
How⎵much⎵wood⎵would⎵a(13,5)chuck⎵(6,6)
if(21,13)c(39,5)(12,6)(22,4)?
Huffman Coding
11110|10|11111|0|11101|...
H o w ⎵ m
Deflate
Compression
First Stage
How⎵much⎵wood⎵would⎵a⎵woodchuck⎵chuck⎵
if 20 21 22 23 24 25 26 27 28 29 30 31 32 c 16 17 18 19 20 chuck⎵wood?
1 10 20 30
if(21,13)c(39,5)(12,6)(22,4)?
Second Stage
Let Thread 2
Decompress
Line 2
if⎵a⎵woodchuck⎵could⎵chuck⎵wood?
Marker Replacement Window For Line 2
Distance Length
Huffman Tree
Challenges for Parallel
Decompression:
●
Find offsets in bit stream to start
decompression from
●
Handling references to unknown
data
→ Two-Staged Decompression as
introduced by Kerbiriou and Chikhi
(2019)
●
Non-resolvable references result in
markers that get resolved in the
second stage after the 32 KiB
window has become known
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 8 of 20
Granularities for Parallelization
Gzip File
Gzip Stream
Gzip Stream
...
Deflate Stream
Deflate Block
Deflate Block
...
Deflate Stream
Gzip Stream
Gzip Header
Gzip Footer
Deflate Block
Final Block Flag
Block Data
Block Type
Block Type: 00
Non-Compressed
Length
Non-Compressed
Data
~Length
Block Type: 01
Dynamic Huffman
Tree Lengths
Compressed
Data
Huffman Trees
Block Type: 10
Fixed Huffman
Compressed Data
Often, there is
only one Gzip
stream per file.
Parallelize
decompression of
Deflate blocks
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 9 of 20
Redundancies that Help in Finding Deflate Blocks
Gzip File
Gzip Stream
Gzip Stream
...
Deflate Stream
Deflate Block
Deflate Block
...
Deflate Stream
Gzip Stream
Gzip Header
Gzip Footer
Deflate Block
Final Block Flag
Block Data
Block Type
Block Type: 00
Non-Compressed
Length
Non-Compressed
Data
~Length
Block Type: 01
Dynamic Huffman
Tree Lengths
Compressed
Data
Huffman Trees
Block Type: 10
Fixed Huffman
Compressed Data
Fixed Huffman
Blocks are
mostly used for
very small files
and the tail end
of streams.
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 10 of 20
How Often Will Valid-Looking Block Headers be Found in Random Data?
Invalid Non-optimal Valid and optimal
Code Lengths:
A: 1, B: 1, C: 1
Code Lengths:
A: 2, B: 2, C: 2
Code Lengths:
A: 2, B: 2, C: 1
●
Pugz reduces false positives further by checking that the decompressed data only
contains characters in the range 9-126, an assumption made for FASTQ files
●
Deflate Blocks with Fixed Huffman codings: Ignore because they are rare
●
Non-Compressed Deflate blocks: Look for 16-bit lengths and their one’s complement
→ 1 false positive per 525 kB
●
Deflate Blocks with Dynamic Huffman codings:
Look for valid Deflate block headers and valid and optimal Huffman codings.
~200 offsets pass this test given 1 Tbits of random data → 1 false positive per 625 MB
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 11 of 20
False Positives Can Be Crafted Using Non-Compressed Blocks
Example: Consider a gzip file that is compressed a second time.
Deflate
Stream
example.gz.gz
Gzip Header
Gzip Footer
Length
Non-Compressed
Data
~Length
Final Block Flag
Block Type:
Non-Compressed
Deflate
Stream
example.gz
Gzip Header
Gzip Footer
Block Data
Final Block Flag
Block Type
Block Boundary
of Interest
False Positive
Sequences inside the compressed Block Data might be recognized as Deflate block headers
→ These cases need to be detected and handled
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 12 of 20
Implementation
C₁
0 100 200 300
C₂ C₃ C₄
① Request chunk
at offset 0
Window
First Block
Offset
False
Positive
Thread Pool
C₁
C₃
Guessed Chunk
Boundary
C₄
C₂
C₁
C₃ C₄
C₂
Marker
Compressed Input Data
② Dispatch
③ Prefetch
④ Find block start ⑤ First-stage decompression
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
⑥ Periodically check for ready chunks
and move them into the cache
until C₁ has become ready
⑧ Return decompressed C₁
Index
0
Offset Size Win.
Dec.
Size
111 340
⑦ Add C₁ information
to the index
⑦
C₁
0 100 200 300
C₂ C₃ C₄
① Request chunk at 111 directly after C₁
② Found chunk in cache but with markers
③ Resolve markers in windows
C₁ C₂ C₃
C₄
No prefetched chunk
matches the end offset
of C₃. The chunk after
will be decompressed
on demand.
⑤ Resolve the markers inside
each chunk in parallel
using the thread pool
④ Add resolved
windows to the index
Index
0
Offset Size Win.
Dec.
Size
111 340
111 109 421
220 87 123
111 220
305 327
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
⑥ Return decompressed C₂
●
Prefetching to generate work
that can be processed in
parallel
●
Thread pool for work
balancing
●
Cache to speed up seeking
and concurrent
decompression
●
Use block offset as cache key
to catch false positives
●
On-demand cache fill to
recover from errors
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 13 of 20
Implementation
C₁
0 100 200 300
C₂ C₃ C₄
① Request chunk
at offset 0
Window
First Block
Offset
False
Positive
Thread Pool
C₁
C₃
Chunk
Boundary
C₄
C₂
C₁
C₃ C₄
C₂
Marker
Compressed Input Data
② Dispatch Chunk
③ Prefetch Chunk
④ Find block start ⑤ First-stage decompression
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
⑥ Periodically check for ready chunks
and move them into the cache
until C₁ has become ready
⑧ Return decompressed C₁
Index
0
Offset Size Win.
Dec.
Size
111 340
⑦ Add C₁ information
to the index
⑦
C₁
0 100 200 300
C₂ C₃ C₄
① Request chunk at 111 directly after C₁
② Found chunk in cache but with markers
③ Resolve markers in windows
C₁ C₂ C₃
C₄
No prefetched chunk
matches the end offset
of C₃. The chunk after
will be decompressed
on demand.
⑤ Resolve the markers inside
each chunk in parallel
using the thread pool
④ Add resolved
windows to the index
Index
0
Offset Size Win.
Dec.
Size
111 340
111 109 421
220 107 123
111 220
305 327
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
⑥ Return decompressed C₂
Chunk 4
●
Prefetching to generate work
that can be processed in
parallel
●
Thread pool for work
balancing
●
Cache to speed up seeking
and concurrent
decompression
●
Use block offset as cache key
to catch false positives
●
On-demand cache fill to
recover from errors
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 14 of 20
Implementation
C₁
0 100 200 300
C₂ C₃ C₄
① Request chunk
at offset 0
Window
First Block
Offset
False
Positive
Thread Pool
C₁
C₃
Chunk
Boundary
C₄
C₂
C₁
C₃ C₄
C₂
Marker
Compressed Input Data
② Dispatch Chunk
③ Prefetch Chunk
④ Find block start ⑤ First-stage decompression
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
⑥ Periodically check for ready chunks
and move them into the cache
until C₁ has become ready
⑧ Return decompressed C₁
Index
0
Offset Size Win.
Dec.
Size
111 340
⑦ Add C₁ information
to the index
⑦
C₁
0 100 200 300
C₂ C₃ C₄
① Request chunk at 111 directly after C₁
② Found chunk in cache but with markers
③ Resolve markers in windows
C₁ C₂ C₃
C₄
No prefetched chunk
matches the end offset
of C₃. The chunk after
will be decompressed
on demand.
⑤ Resolve the markers inside
each chunk in parallel
using the thread pool
④ Add resolved
windows to the index
Index
0
Offset Size Win.
Dec.
Size
111 340
111 109 421
220 107 123
111 220
305 327
Cache
0
Offset Chunk
111
220
305
C₁
C₂
C₃
C₄
⑥ Return decompressed C₂
Chunk 4
●
Prefetching to generate work
that can be processed in
parallel
●
Thread pool for work
balancing
●
Cache to speed up seeking
and concurrent
decompression
●
Use block offset as cache key
to catch false positives
●
On-demand cache fill to
recover from errors
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 15 of 20
Optimal Chunk Size
0.1250.25 0.5 1 2 4 8 16 32 64 128 256 512 1024
Chunk Size / MiB
1000
2000
1500
3000
Bandwidth
/
(MB/s)
rapidgzip
pugz
49808 12452 3113 779 390 195 98 49 25 13 7
Theoretical Number of Chunks
●
For smaller chunk sizes, the
block finder overhead leads to
worse performance
●
Larger chunk sizes lead to work
balancing issues and also might
adversely affect the cache
behavior and allocation speed
Decompression bandwidth using 16 cores and a 6.08 GiB test file,
which decompresses to 8 GiB.
Optimal chunk sizes: Rapidgzip: 4-8 MiB, Pugz: 32-64 MiB
DBF ... Dynamic Block Finder
NBF ... Non-Compressed Block Finder
3.8×
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 16 of 20
Decompression Benchmark: FASTQ File
Weak-scaling and writing the
results to /dev/null
●
gzip is the slowest, half as
slow as zlib-based pigz
●
rapidgzip with an index
scales up to 20 GB/s, without
an index up to 5 GB/s
●
pugz tops out at 1.4 GB/s
and crashes for 96+ cores
●
igzip by Intel shows leading
single-core performance
●
pigz does not parallelize
1 2 3 4 6 8 12 16 24 32 48 64 96 128
Number of Cores
100
1000
500
200
10000
5000
2000
20000
Bandwidth
/
(MB/s)
rapidgzip (index)
rapidgzip (no index)
linear scal. (no index)
pugz
pigz
igzip
gzip
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 18 of 20
0 2 4 6 8 10 12
Decompression Bandwidth / (GB/s)
pigz -9
pigz -6
pigz -3
pigz -1
igzip -3
igzip -2
igzip -1
igzip -0
gzip -9
gzip -6
gzip -3
gzip -1
bgzip -l 9
bgzip -l 6
bgzip -l 3
bgzip -l 0
bgzip -l -1
Tool
Used
for
Compression
3.73
3.76
3.81
3.82
6.52
6.42
6.15
0.1586
5.03
5.17
5.55
6.05
5.64
5.67
5.9
10.6
5.65
Benchmark: Various Gzip Compressors Rapidgzip Decompression
→
●
rapidgzip can parallelize
decompression for gzip files
produced with a wide variety of tools
and compression levels
●
Contains only Non-Compressed
Deflate blocks so that decompression
is reduced to a fast copy and some
accounting
●
Contains only a single Deflate block
with Fixed Huffman coding and
therefore cannot be parallelized
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 19 of 20
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Decompression Bandwidth / (GB/s)
pzstd pzstd
gzip rapidgzip
gzip rapidgzip
bgzip bgzip
pzstd pzstd
zstd pzstd
gzip rapidgzip
gzip rapidgzip
gzip bgzip
bgzip bgzip
pzstd pzstd
zstd pzstd
zstd zstd
gzip igzip
gzip rapidgzip
gzip bgzip
bgzip bgzip
Compression
Tool
Decompression
Tool
8.8
16.43
5.13
5.5
6.78
0.882
4.25
1.86
0.3017
2.82
0.811
0.816
0.82
0.656
0.1527
0.2965
0.2977
1 core
16 cores
128 cores
Benchmark: Competing File Formats
●
igzip is surprisingly competitive to zstd
●
zstd is the fastest in single-core
decompression
●
bgzip and pzstd can only decompress
gzip/zstd files produced by themselves in
parallel
●
zstd-compressed (TAR) files are not eligible
for random access and parallel
decompression. pzstd is recommended
instead to create multi-frame Zstandard files.
●
Applying the rapidgzip approach to
arbitrary Zstandard files might be infeasible
because their window size is not limited to 32
KiB.
3×
+25%
ⁱ rapidgzip decompression with an existing index
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 21 of 20
Improvements Since Submission
●
High memory usage has been alleviated by limiting the decompressed chunk size
Worst case (compression ratio ~1000)
was: ~ 9 GB per thread
now: ~ 200 MB per thread (configurable)
●
The Inflate implementation has been improved for high compression ratios
→ 25 % faster for Silesia by using memcpy/memset for long references
●
CRC32 computation has been added
The slice-by-16 algorithm has been implemented and parallelized using crc32_combine.
→
Achieves ~ 4 GB/s per core (~ 6 % overhead independent of parallelism)
Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching
Maximilian Knespel, Holger Brunst
Slide 22 of 20
●
We have shown that the specialized approach for
parallelized gzip decompression introduced by
Kerbiriou and Chikhi (2019) can be generalized
without affecting performance and stability.
●
Our architecture achieves better performance,
scales to more cores, adds robustness against
false positives, and also increases versatility by
adding fast seeking capabilities.
●
An index is created internally on first time
decompression and it can be exported and
imported to speed up subsequent
decompression and seeking.
●
Can be used with ratarmount to mount .tar.gz
archives.
●
Available at https://github.com/mxmlnkn/rapidgzip
Decompression benchmarks on a 12 GB
FASTQ file using 64 cores of an AMD EPYC
7702 @ 2.0 GHz processor.
g
z
i
p
p
i
g
z
i
g
z
i
p
p
u
g
z
r
a
p
i
d
g
z
i
p
r
a
p
i
d
g
z
i
p
(
i
n
d
e
x
)
0
2
4
6
8
10
12
14
Bandwidth
/
(GB/s)
0.177 0.301 0.78 1.4
5.3
13.1
Summary
74×
30×

More Related Content

PPTX
Introduction to Apache Kafka
PPTX
Kafka: Internals
PDF
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
PDF
Messaging queue - Kafka
PDF
Kubernetes Sealed secrets
PDF
Fundamentals of Apache Kafka
PPTX
All you didn't know about the CAP theorem
Introduction to Apache Kafka
Kafka: Internals
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Messaging queue - Kafka
Kubernetes Sealed secrets
Fundamentals of Apache Kafka
All you didn't know about the CAP theorem

What's hot (20)

PPTX
Monitoramento de Banco de dados SQL Server com Zabbix
PPTX
PPTX
Apache Kafka
PPTX
Creating Highly-Available MongoDB Microservices with Docker Containers and Ku...
PPTX
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
PPTX
Git hub plugin setup and working with Git hub on anypoint studio
PDF
Stream Processing with Kafka in Uber, Danny Yuan
PDF
A Deep Dive into Kafka Controller
PDF
Solving Enterprise Data Challenges with Apache Arrow
PPTX
Apache Kafka - Patterns anti-patterns
PDF
Introduction to gRPC: A general RPC framework that puts mobile and HTTP/2 fir...
PDF
Uber: Kafka Consumer Proxy
PDF
Neo4j: Graph-like power
PDF
Debugging distributed systems
PPTX
Envoy and Kafka
PDF
Bigtable and Dynamo
PDF
Introduction to apache kafka
PPTX
Apache Kafka
PDF
.Net framework vs .net core a complete comparison
PPTX
Apache kafka
Monitoramento de Banco de dados SQL Server com Zabbix
Apache Kafka
Creating Highly-Available MongoDB Microservices with Docker Containers and Ku...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Git hub plugin setup and working with Git hub on anypoint studio
Stream Processing with Kafka in Uber, Danny Yuan
A Deep Dive into Kafka Controller
Solving Enterprise Data Challenges with Apache Arrow
Apache Kafka - Patterns anti-patterns
Introduction to gRPC: A general RPC framework that puts mobile and HTTP/2 fir...
Uber: Kafka Consumer Proxy
Neo4j: Graph-like power
Debugging distributed systems
Envoy and Kafka
Bigtable and Dynamo
Introduction to apache kafka
Apache Kafka
.Net framework vs .net core a complete comparison
Apache kafka
Ad

Similar to HPDC'23 Rapidgzip (20)

PDF
Relations between archive formats
PDF
How GZIP works... in 10 minutes
PDF
Ange Albertini and Gynvael Coldwind: Schizophrenic Files – A file that thinks...
PDF
Schizophrenic files
PPT
G zip compresser ppt
PPT
Compression Commands in Linux
PDF
Siemens s7 300-400-pkzip 4.0
PDF
Schizophrenic files v2
PDF
Perly Parallel Processing of Fixed Width Data Records
PDF
Joblib Toward efficient computing : from laptop to cloud
PDF
Joblib PyDataParis2016
PPTX
PDF
Creating a phar
PPTX
Data compression
PPTX
Extended memory access in PHP
PDF
7-zip compression settings guide
PDF
ODP
bup backup system (2011-04)
PDF
Designs, Lessons and Advice from Building Large Distributed Systems
PDF
In-memory Caching: Curb Tail Latency with Pelikan
Relations between archive formats
How GZIP works... in 10 minutes
Ange Albertini and Gynvael Coldwind: Schizophrenic Files – A file that thinks...
Schizophrenic files
G zip compresser ppt
Compression Commands in Linux
Siemens s7 300-400-pkzip 4.0
Schizophrenic files v2
Perly Parallel Processing of Fixed Width Data Records
Joblib Toward efficient computing : from laptop to cloud
Joblib PyDataParis2016
Creating a phar
Data compression
Extended memory access in PHP
7-zip compression settings guide
bup backup system (2011-04)
Designs, Lessons and Advice from Building Large Distributed Systems
In-memory Caching: Curb Tail Latency with Pelikan
Ad

Recently uploaded (20)

PPTX
Outcomes of Communication & Overcoming
PPTX
The Power of Communication & Overcoming
PPTX
Brief presentation for multiple products
PPTX
2025-08-24 Joseph 04 (shared slides).pptx
PDF
Overview of Fundamentals of Project Management
PDF
Community User Group Leaders_ Agentblazer Status, AI Sustainability, and Work...
PPTX
RP Virtual Session One intro to workplace readiness
PPTX
Basics of Stereotypes and Prejudice(1).pptx
PPTX
VIVEK BOOK REVIEW the fish sticks book.pptx
PPTX
Ease_of_Paying_Taxes_Act_Presentation.pptx
PPTX
Lesson-4-MS-Word-Inserting-Editing-Formatting-Objects.pptx.pptx
PDF
Enhancing the Value of African Agricultural Products through Intellectual Pro...
PDF
Echoes of AccountabilityComputational Analysis of Post-Junta Parliamentary Qu...
PPTX
The Electronic Technocracy (Electric Paradise) - Built on the Legal Reality o...
PPTX
HRPTA PPT 2024-2025 FOR PTA MEETING STUDENTS
PPTX
Enterprise Network Design and Implementation Project using Cisco ASA, FortiGa...
PPTX
Prevention of sexual harassment at work place
PPTX
Go Kiss the World book review presentation.pptx
PPTX
Principles-of-International-Environmental-Law.pptx
PDF
Lessons Learned building a product with clean core abap
Outcomes of Communication & Overcoming
The Power of Communication & Overcoming
Brief presentation for multiple products
2025-08-24 Joseph 04 (shared slides).pptx
Overview of Fundamentals of Project Management
Community User Group Leaders_ Agentblazer Status, AI Sustainability, and Work...
RP Virtual Session One intro to workplace readiness
Basics of Stereotypes and Prejudice(1).pptx
VIVEK BOOK REVIEW the fish sticks book.pptx
Ease_of_Paying_Taxes_Act_Presentation.pptx
Lesson-4-MS-Word-Inserting-Editing-Formatting-Objects.pptx.pptx
Enhancing the Value of African Agricultural Products through Intellectual Pro...
Echoes of AccountabilityComputational Analysis of Post-Junta Parliamentary Qu...
The Electronic Technocracy (Electric Paradise) - Built on the Legal Reality o...
HRPTA PPT 2024-2025 FOR PTA MEETING STUDENTS
Enterprise Network Design and Implementation Project using Cisco ASA, FortiGa...
Prevention of sexual harassment at work place
Go Kiss the World book review presentation.pptx
Principles-of-International-Environmental-Law.pptx
Lessons Learned building a product with clean core abap

HPDC'23 Rapidgzip

  • 1. HPDC’ 23 Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Technische Universität Dresden
  • 2. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 2 of 20 Motivation Accessing huge datasets, e.g., from academictorrents.com: ● wikidata-20220103-all.json.gz: gzip-compressed JSON, 109 GB, 1.4 TB uncompressed ● ImageNet21K: gzip-compressed TAR archive, 1.2 TB, 14 million images averaging 9 KiB. Solutions: ● : Random access TAR mount. Make (huge) archives’ contents available via FUSE. ● , indexed_bzip2: Backends for ratarmount for parallel decompression and fast seeking inside compressed gzip and bzip2 files. They also offer command line tools for parallelized decompression.
  • 3. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 3 of 20 Requirements Data Stream Data Stream Data Stream ● Parallelize gzip decompression ● Without additional metadata ● After any seeking ● For concurrent accesses at two offsets ● Decompress all kinds of gzip files ● Enable fast backward and forward seeking ● After the index has been created ● While the index has only been partially created ● Usable as a (Python-)library
  • 4. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 4 of 20 ● : Random access parallel (indexed) decompression for gzip files ● Parallel decompression of gzip files ● Decompression is faster and less memory intensive with an existing index ● The index enables seeking without having to start decompression from the file beginning ● Header-only C++ library with Python bindings: > pip install rapidgzip ● Also a has a command line interface that can be used as a drop-in replacement for decompression: gzip -d → rapidgzip -d ● Not for compression ● https://github.com/mxmlnkn/rapidgzip Decompression benchmarks on a 12 GB FASTQ file using 64 cores of an AMD EPYC 7702 @ 2.0 GHz processor. g z i p p i g z i g z i p p u g z r a p i d g z i p r a p i d g z i p ( i n d e x ) 0 2 4 6 8 10 12 14 Bandwidth / (GB/s) 0.177 0.301 0.78 1.4 5.3 13.1 Introducing rapidgzip 74× 30×
  • 5. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 5 of 20 Tools Overview ● 1992: gzip by Jean-loup Gailly and Mark Adler ● 2005: zlib/examples/zran.c example by Mark Adler ● Shows how to resume decompression in the middle of a gzip stream ● 2007: pigz (parallel implementation of gzip) by Mark Adler ● Compresses in parallel ● 2008: Blocked GNU Zip Format (BGZF) and the command line tool bgzip, part of HTSlib ● Compresses in parallel to gzip files with additional metadata ● Can decompress files containing such metadata in parallel ● James K. Bonfield and others, "HTSlib: C library for reading/writing high-throughput sequencing data", GigaScience, Volume 10, Issue 2, February 2021 ● 2016: indexed_gzip: Python module for random access based on zran.c ● 2019: pugz ● Can decompress gzip-compressed files in parallel if it only contains characters 9–126 ● Kerbiriou, Maël, and Rayan Chikhi. "Parallel decompression of gzip-compressed files and random access to DNA sequences." 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2019. → What makes parallel decompression so difficult?
  • 6. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 6 of 20 Challenges for Parallel Decompression: ● Find deflate block start offsets in bit stream ● Handling references to unknown data → Two-Staged Decompression as introduced by Kerbiriou and Chikhi (2019) ● Non-resolvable references result in markers that get resolved in the second stage after the 32 KiB window has become known Deflate Compression ⎵ o 0 1 0 1 1 0 How⎵much⎵wood⎵would⎵a⎵woodchuck⎵chuck⎵ if⎵a⎵woodchuck⎵could⎵chuck⎵wood? LZSS How⎵much⎵wood⎵would⎵a(13,5)chuck⎵(6,6) if(21,13)c(39,5)(12,6)(22,4)? Huffman Coding 11110|10|11111|0|11101|... H o w ⎵ m Deflate Compression First Stage How⎵much⎵wood⎵would⎵a⎵woodchuck⎵chuck⎵ if 20 21 22 23 24 25 26 27 28 29 30 31 32 c 16 17 18 19 20 chuck⎵wood? 1 10 20 30 if(21,13)c(39,5)(12,6)(22,4)? Second Stage Let Thread 2 Decompress Line 2 if⎵a⎵woodchuck⎵could⎵chuck⎵wood? Marker Replacement Window For Line 2 Distance Length Huffman Tree
  • 7. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 7 of 20 Parallelized Deflate Decompression ⎵ o 0 1 0 1 1 0 How⎵much⎵wood⎵would⎵a⎵woodchuck⎵chuck⎵ if⎵a⎵woodchuck⎵could⎵chuck⎵wood? LZSS How⎵much⎵wood⎵would⎵a(13,5)chuck⎵(6,6) if(21,13)c(39,5)(12,6)(22,4)? Huffman Coding 11110|10|11111|0|11101|... H o w ⎵ m Deflate Compression First Stage How⎵much⎵wood⎵would⎵a⎵woodchuck⎵chuck⎵ if 20 21 22 23 24 25 26 27 28 29 30 31 32 c 16 17 18 19 20 chuck⎵wood? 1 10 20 30 if(21,13)c(39,5)(12,6)(22,4)? Second Stage Let Thread 2 Decompress Line 2 if⎵a⎵woodchuck⎵could⎵chuck⎵wood? Marker Replacement Window For Line 2 Distance Length Huffman Tree Challenges for Parallel Decompression: ● Find offsets in bit stream to start decompression from ● Handling references to unknown data → Two-Staged Decompression as introduced by Kerbiriou and Chikhi (2019) ● Non-resolvable references result in markers that get resolved in the second stage after the 32 KiB window has become known
  • 8. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 8 of 20 Granularities for Parallelization Gzip File Gzip Stream Gzip Stream ... Deflate Stream Deflate Block Deflate Block ... Deflate Stream Gzip Stream Gzip Header Gzip Footer Deflate Block Final Block Flag Block Data Block Type Block Type: 00 Non-Compressed Length Non-Compressed Data ~Length Block Type: 01 Dynamic Huffman Tree Lengths Compressed Data Huffman Trees Block Type: 10 Fixed Huffman Compressed Data Often, there is only one Gzip stream per file. Parallelize decompression of Deflate blocks
  • 9. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 9 of 20 Redundancies that Help in Finding Deflate Blocks Gzip File Gzip Stream Gzip Stream ... Deflate Stream Deflate Block Deflate Block ... Deflate Stream Gzip Stream Gzip Header Gzip Footer Deflate Block Final Block Flag Block Data Block Type Block Type: 00 Non-Compressed Length Non-Compressed Data ~Length Block Type: 01 Dynamic Huffman Tree Lengths Compressed Data Huffman Trees Block Type: 10 Fixed Huffman Compressed Data Fixed Huffman Blocks are mostly used for very small files and the tail end of streams.
  • 10. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 10 of 20 How Often Will Valid-Looking Block Headers be Found in Random Data? Invalid Non-optimal Valid and optimal Code Lengths: A: 1, B: 1, C: 1 Code Lengths: A: 2, B: 2, C: 2 Code Lengths: A: 2, B: 2, C: 1 ● Pugz reduces false positives further by checking that the decompressed data only contains characters in the range 9-126, an assumption made for FASTQ files ● Deflate Blocks with Fixed Huffman codings: Ignore because they are rare ● Non-Compressed Deflate blocks: Look for 16-bit lengths and their one’s complement → 1 false positive per 525 kB ● Deflate Blocks with Dynamic Huffman codings: Look for valid Deflate block headers and valid and optimal Huffman codings. ~200 offsets pass this test given 1 Tbits of random data → 1 false positive per 625 MB
  • 11. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 11 of 20 False Positives Can Be Crafted Using Non-Compressed Blocks Example: Consider a gzip file that is compressed a second time. Deflate Stream example.gz.gz Gzip Header Gzip Footer Length Non-Compressed Data ~Length Final Block Flag Block Type: Non-Compressed Deflate Stream example.gz Gzip Header Gzip Footer Block Data Final Block Flag Block Type Block Boundary of Interest False Positive Sequences inside the compressed Block Data might be recognized as Deflate block headers → These cases need to be detected and handled
  • 12. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 12 of 20 Implementation C₁ 0 100 200 300 C₂ C₃ C₄ ① Request chunk at offset 0 Window First Block Offset False Positive Thread Pool C₁ C₃ Guessed Chunk Boundary C₄ C₂ C₁ C₃ C₄ C₂ Marker Compressed Input Data ② Dispatch ③ Prefetch ④ Find block start ⑤ First-stage decompression Cache 0 Offset Chunk 111 220 305 C₁ C₂ C₃ C₄ ⑥ Periodically check for ready chunks and move them into the cache until C₁ has become ready ⑧ Return decompressed C₁ Index 0 Offset Size Win. Dec. Size 111 340 ⑦ Add C₁ information to the index ⑦ C₁ 0 100 200 300 C₂ C₃ C₄ ① Request chunk at 111 directly after C₁ ② Found chunk in cache but with markers ③ Resolve markers in windows C₁ C₂ C₃ C₄ No prefetched chunk matches the end offset of C₃. The chunk after will be decompressed on demand. ⑤ Resolve the markers inside each chunk in parallel using the thread pool ④ Add resolved windows to the index Index 0 Offset Size Win. Dec. Size 111 340 111 109 421 220 87 123 111 220 305 327 Cache 0 Offset Chunk 111 220 305 C₁ C₂ C₃ C₄ ⑥ Return decompressed C₂ ● Prefetching to generate work that can be processed in parallel ● Thread pool for work balancing ● Cache to speed up seeking and concurrent decompression ● Use block offset as cache key to catch false positives ● On-demand cache fill to recover from errors
  • 13. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 13 of 20 Implementation C₁ 0 100 200 300 C₂ C₃ C₄ ① Request chunk at offset 0 Window First Block Offset False Positive Thread Pool C₁ C₃ Chunk Boundary C₄ C₂ C₁ C₃ C₄ C₂ Marker Compressed Input Data ② Dispatch Chunk ③ Prefetch Chunk ④ Find block start ⑤ First-stage decompression Cache 0 Offset Chunk 111 220 305 C₁ C₂ C₃ C₄ ⑥ Periodically check for ready chunks and move them into the cache until C₁ has become ready ⑧ Return decompressed C₁ Index 0 Offset Size Win. Dec. Size 111 340 ⑦ Add C₁ information to the index ⑦ C₁ 0 100 200 300 C₂ C₃ C₄ ① Request chunk at 111 directly after C₁ ② Found chunk in cache but with markers ③ Resolve markers in windows C₁ C₂ C₃ C₄ No prefetched chunk matches the end offset of C₃. The chunk after will be decompressed on demand. ⑤ Resolve the markers inside each chunk in parallel using the thread pool ④ Add resolved windows to the index Index 0 Offset Size Win. Dec. Size 111 340 111 109 421 220 107 123 111 220 305 327 Cache 0 Offset Chunk 111 220 305 C₁ C₂ C₃ C₄ ⑥ Return decompressed C₂ Chunk 4 ● Prefetching to generate work that can be processed in parallel ● Thread pool for work balancing ● Cache to speed up seeking and concurrent decompression ● Use block offset as cache key to catch false positives ● On-demand cache fill to recover from errors
  • 14. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 14 of 20 Implementation C₁ 0 100 200 300 C₂ C₃ C₄ ① Request chunk at offset 0 Window First Block Offset False Positive Thread Pool C₁ C₃ Chunk Boundary C₄ C₂ C₁ C₃ C₄ C₂ Marker Compressed Input Data ② Dispatch Chunk ③ Prefetch Chunk ④ Find block start ⑤ First-stage decompression Cache 0 Offset Chunk 111 220 305 C₁ C₂ C₃ C₄ ⑥ Periodically check for ready chunks and move them into the cache until C₁ has become ready ⑧ Return decompressed C₁ Index 0 Offset Size Win. Dec. Size 111 340 ⑦ Add C₁ information to the index ⑦ C₁ 0 100 200 300 C₂ C₃ C₄ ① Request chunk at 111 directly after C₁ ② Found chunk in cache but with markers ③ Resolve markers in windows C₁ C₂ C₃ C₄ No prefetched chunk matches the end offset of C₃. The chunk after will be decompressed on demand. ⑤ Resolve the markers inside each chunk in parallel using the thread pool ④ Add resolved windows to the index Index 0 Offset Size Win. Dec. Size 111 340 111 109 421 220 107 123 111 220 305 327 Cache 0 Offset Chunk 111 220 305 C₁ C₂ C₃ C₄ ⑥ Return decompressed C₂ Chunk 4 ● Prefetching to generate work that can be processed in parallel ● Thread pool for work balancing ● Cache to speed up seeking and concurrent decompression ● Use block offset as cache key to catch false positives ● On-demand cache fill to recover from errors
  • 15. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 15 of 20 Optimal Chunk Size 0.1250.25 0.5 1 2 4 8 16 32 64 128 256 512 1024 Chunk Size / MiB 1000 2000 1500 3000 Bandwidth / (MB/s) rapidgzip pugz 49808 12452 3113 779 390 195 98 49 25 13 7 Theoretical Number of Chunks ● For smaller chunk sizes, the block finder overhead leads to worse performance ● Larger chunk sizes lead to work balancing issues and also might adversely affect the cache behavior and allocation speed Decompression bandwidth using 16 cores and a 6.08 GiB test file, which decompresses to 8 GiB. Optimal chunk sizes: Rapidgzip: 4-8 MiB, Pugz: 32-64 MiB DBF ... Dynamic Block Finder NBF ... Non-Compressed Block Finder 3.8×
  • 16. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 16 of 20 Decompression Benchmark: FASTQ File Weak-scaling and writing the results to /dev/null ● gzip is the slowest, half as slow as zlib-based pigz ● rapidgzip with an index scales up to 20 GB/s, without an index up to 5 GB/s ● pugz tops out at 1.4 GB/s and crashes for 96+ cores ● igzip by Intel shows leading single-core performance ● pigz does not parallelize 1 2 3 4 6 8 12 16 24 32 48 64 96 128 Number of Cores 100 1000 500 200 10000 5000 2000 20000 Bandwidth / (MB/s) rapidgzip (index) rapidgzip (no index) linear scal. (no index) pugz pigz igzip gzip
  • 17. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 18 of 20 0 2 4 6 8 10 12 Decompression Bandwidth / (GB/s) pigz -9 pigz -6 pigz -3 pigz -1 igzip -3 igzip -2 igzip -1 igzip -0 gzip -9 gzip -6 gzip -3 gzip -1 bgzip -l 9 bgzip -l 6 bgzip -l 3 bgzip -l 0 bgzip -l -1 Tool Used for Compression 3.73 3.76 3.81 3.82 6.52 6.42 6.15 0.1586 5.03 5.17 5.55 6.05 5.64 5.67 5.9 10.6 5.65 Benchmark: Various Gzip Compressors Rapidgzip Decompression → ● rapidgzip can parallelize decompression for gzip files produced with a wide variety of tools and compression levels ● Contains only Non-Compressed Deflate blocks so that decompression is reduced to a fast copy and some accounting ● Contains only a single Deflate block with Fixed Huffman coding and therefore cannot be parallelized
  • 18. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 19 of 20 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Decompression Bandwidth / (GB/s) pzstd pzstd gzip rapidgzip gzip rapidgzip bgzip bgzip pzstd pzstd zstd pzstd gzip rapidgzip gzip rapidgzip gzip bgzip bgzip bgzip pzstd pzstd zstd pzstd zstd zstd gzip igzip gzip rapidgzip gzip bgzip bgzip bgzip Compression Tool Decompression Tool 8.8 16.43 5.13 5.5 6.78 0.882 4.25 1.86 0.3017 2.82 0.811 0.816 0.82 0.656 0.1527 0.2965 0.2977 1 core 16 cores 128 cores Benchmark: Competing File Formats ● igzip is surprisingly competitive to zstd ● zstd is the fastest in single-core decompression ● bgzip and pzstd can only decompress gzip/zstd files produced by themselves in parallel ● zstd-compressed (TAR) files are not eligible for random access and parallel decompression. pzstd is recommended instead to create multi-frame Zstandard files. ● Applying the rapidgzip approach to arbitrary Zstandard files might be infeasible because their window size is not limited to 32 KiB. 3× +25% ⁱ rapidgzip decompression with an existing index
  • 19. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 21 of 20 Improvements Since Submission ● High memory usage has been alleviated by limiting the decompressed chunk size Worst case (compression ratio ~1000) was: ~ 9 GB per thread now: ~ 200 MB per thread (configurable) ● The Inflate implementation has been improved for high compression ratios → 25 % faster for Silesia by using memcpy/memset for long references ● CRC32 computation has been added The slice-by-16 algorithm has been implemented and parallelized using crc32_combine. → Achieves ~ 4 GB/s per core (~ 6 % overhead independent of parallelism)
  • 20. Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching Maximilian Knespel, Holger Brunst Slide 22 of 20 ● We have shown that the specialized approach for parallelized gzip decompression introduced by Kerbiriou and Chikhi (2019) can be generalized without affecting performance and stability. ● Our architecture achieves better performance, scales to more cores, adds robustness against false positives, and also increases versatility by adding fast seeking capabilities. ● An index is created internally on first time decompression and it can be exported and imported to speed up subsequent decompression and seeking. ● Can be used with ratarmount to mount .tar.gz archives. ● Available at https://github.com/mxmlnkn/rapidgzip Decompression benchmarks on a 12 GB FASTQ file using 64 cores of an AMD EPYC 7702 @ 2.0 GHz processor. g z i p p i g z i g z i p p u g z r a p i d g z i p r a p i d g z i p ( i n d e x ) 0 2 4 6 8 10 12 14 Bandwidth / (GB/s) 0.177 0.301 0.78 1.4 5.3 13.1 Summary 74× 30×