Apache Parquet vs New File Formats: A New Era in Analytics

Director-Data+GenAI @Cloudera | Apache Iceberg, Hudi Contributor | Author of “Engineering Lakehouses”

2mo

Apache Parquet Vs New File Formats. There’s no denying the impact Apache Parquet has had on the analytics ecosystem. - Columnar layout - Strong compression & encoding - The de facto format in data lakes and warehouses for over a decade But there are also a lot of new things happening with our analytical workloads. ❌ Workloads are no longer just batch analytics - they involve AI pipelines, low-latency use cases, and hardware-accelerated compute ❌ Hardware is not only CPU but also GPU, from x86 to ARM/RISC-V, with wide SIMD instruction sets becoming the norm ❌ And performance bottlenecks are seen in decompression speed, memory pressure and lack of vectorized execution paths As a result, we’re seeing a wave of innovation around next-gen file formats. Efforts like: BtrBlocks – introduces cascaded compression with lightweight codecs Nimble (Meta) – optimized for fast scan & inference LanceDB – tailored for vector search and ML use cases And the past week I read about “Fastlanes” Here are some highlights from their VLDB paper ✅ Expression Encoding: Flexible chains of lightweight codecs (FFOR, DELTA, DICT, FSST) that outperform heavyweight compression like Zstd ✅ Multi-Column Compression: Exploits inter-column correlations (e.g. equality, one-to-one mappings) to go beyond traditional columnar compression ✅ Segmented Layout: Decompresses in small vectors (1024 values) rather than rowgroups, reducing memory pressure and improving cache efficiency ✅ Compressed Execution Support: Returns compressed vectors to engines like DuckDB/Velox for SIMD/GPU-friendly query execution Here are some numbers I picked up on Performance: - 800× faster random access than Parquet+Zstd - 41% better compression than Parquet+Snappy - 2% better compression than Parquet+Zstd (without using heavyweight codecs) - Decoding accelerates with AVX512 (up to 40% faster) I think we are going to see more of fit-for-purpose formats in the future as we tackle specific use cases around AI workloads and to target performance. Paper link in comments. #dataengineering #softwareengineering

13 Comments

Dipankar Mazumdar

Director-Data+GenAI @Cloudera | Apache Iceberg, Hudi Contributor | Author of “Engineering Lakehouses”

2mo

https://github.com/cwida/FastLanes/blob/dev/docs/specification.pdf?utm_source=substack&utm_medium=email

2 Reactions

Matt Topol

Co-Founder@Columnar | Author of "In-Memory Analytics with Apache Arrow" | Apache Arrow PMC | ASF Member

2mo

In general, my preference would be for enhancements to get added to the Parquet format vs creating entire new formats when possible. That way we can continue to exploit the network effects of the many Parquet libraries across the ecosystem