SlideShare a Scribd company logo
Apache Hudi: The Path Forward
Vinoth Chandar, Raymond Xu
PMC, Apache Hudi
Agenda
1) Hudi Intro
2) Table Metadata
3) Caching
4) Community
Hudi Intro
Components, Evolution
Typical Use-Cases
Hudi - the Pioneer
Serverless, transactional layer over
lakes.
Multi-engine, Decoupled storage
from engine/compute
Introduced notions of
Copy-On-Write and
Merge-on-Read
Change capture on lakes
Ideas now heavily borrowed
outside.
The Hudi Stack
Lakes on cheap, scalable Hadoop compatible storage
Built on open file and data formats
Transactional Database Kernel
- Table Format for file layouts, schema, …
- Indexing for faster updates/deletes
- Built-in “daemons” aka table services
- MVCC, OCC Concurrency Control
SQL and Programming APIs
Platform services and operational tools
Universally queryable from popular engines
It’s a platform!
Both streaming + batch style pipelines
- State store for incremental merging intermediate results
- Change events like Apache Kafka topics
For data lake workloads
- Optimized, self-managing data plane
- Large scale data processing
- Lakehouse?
With tightly-integrated components
- Loose coupling => too many to integrate
- Reduce build out time for data lakes
http://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform
Table Format
Avro Schema, Evolution rules
File groups, reduce merge overhead
Timeline => event log, WAL
Internal metadata table
Ongoing
- Schema-on-read i.e
drop,renames (RFC-33)
- Infinite retention
File Formats
Base and Delta Log Files
- Parquet, Orc, HFile Base files
- Avro log files
- Encode changes as blocks
Ongoing
- Parquet log blocks for large
batch writes
- CSV, unstructured formats
- pre-materialization for
masking/data privacy
Indexes
Pluggable, Consistent with txns
For upserts, deletes
- HBase, External index ->
pluggable
- Simple, Bloom/Local vs Global
Ongoing
- RFC-27 Range indexes
- Bucketed Index
- DynamoDB index
- Metadata index
- Record level indexing
Concurrency Control
Hudi did not need multi-writer support
- Treat writers and services differently
- MVCC, non-blocking
- Table services satisfy most needs
Hudi now does Optimistic Concurrency Control
- File level, timeline consistent
- Still MVCC for table services
Future/Ongoing
- Multi-table transactions
- MVCC, fully lock free transactions
Writers
Incremental & Batch write operations
- File sizing, Layout control upon write
- Sorting, compression, Index maintenance
- Spill handling, Multi-threaded write pipeline
Record level merges APIs
- Unique keys, composite,
- key generators, virtual or physical
- partial merges, event-time processing
Record level metadata
- Arrival and event time, watermarks
- Encode source CDC operation
Readers
Hive, Impala, Presto, Spark, Trino, Redshift
Use engine’s native readers
First class support for incremental queries
Flexibility - snapshot vs read-optimized
Future
- Flexible change stream data models.
- Snowflake/BigQuery external tables
Table Services
Self managing database runtime
Table services know each other
- E.g avoid duplicate schedules
- E.g skip compacting files being clustered
Cleaning (committed/uncommitted), archival,
clustering, compaction, ..
Services can be run continuously or scheduled
Platform Services
DeltaStreamer/FlinkStreamer
ingest/ETL utility
Deliver Commit notifications
Kafka Connect Sink
Data Quality checkers
Snapshot, Restore, Export, Import
Table Metadata
Current choices, Ongoing work, Future plans
What qualifies as table metadata?
Schema - Columns names/types, keys, partitioning, evolution/versions
- Typically small, < 1MB per version.
Files/Objects - Length, paths, URIs
- 2M objects => 10s of MBs
Stats - Min, Max, Nulls etc, Per col Per file
- 2M objects => 100+ of MBs
Redo Logs - Changes to metadata => writes, rollbacks, table optimizations.
- Committing (200kb) every minute for a year => ~100 GB
Indexes? - Remember Stats != Index, They can be much bigger.
How’s this stored in Hudi, today?
Schema - Stored within the redo log, consistent with table changes.
- Synced out to different meta-stores, post commit
Files/Objects - Obtained from an internal metadata table partition `files`
- Or just by listing storage - sometimes it’s faster!
Redo Logs - As an event log in the timeline folder “.hoodie”
- Archived out, once transactions/table operations complete/expire.
Stats - We don’t. Yet. Fetch from file footers.
- Again sometimes faster if parallelized, even on cloud storage.
RFC-27 (Ongoing): Flat Files are not cool
Scaling file stats for high scale writing
- 65536 files (1TB data, stored as 16MB
small files)
- 100 columns, 6.5M stat entries
- O(total_cols_tracked_in_table)
- Slow, 10s of seconds.
Range reads to the rescue!
- O(num_cols_in_query) performace
- Interval trees with smart skipping
The Hudi Timeline server
Metadata need efficient serving, caching
- Not just efficient storage
Responsibilities
- Cache file listings across executors
- Amortize access to metadata table
- Performant uncommitted file
cleanup
Incremental sync
- Streaming/continuous writes
- Lazy refreshing of timeline
S3 Baseline: listing p90
- 1sec (10k files),
- 10 sec (100K files)
Timeline Server: 1-10 ms!
File-backed metadata: ~1 second!
Extending the Timeline Server
New APIs
- Serve also stats, redo log information.
- Locking APIs
Let’s make a cluster!
- Shard servers by table/db
- Pluggable backing storage
- Local DB w/ recovery/checkpointing
- Remote DB with
newSQL/transactional storage
Cache
Basic Idea, Design Considerations
Basic Idea
Problems
- Frequent commits => small objects /
blocks => I/O costly
- File System / Block level caching not
very effective
base file b @ t1
base file b’ @ t2
log file 1 for b
log file 2 for b
log file 1 for b’
log file 2 for b’
Time
Hudi FileGroup
log file 3 for b’
Hudi FileGroup fits caching
- Smallest unit to compact
- Size properly to fit cache store
- Cache compacted data for
real-time views => save
computation
Design Considerations
Refresh-Ahead
- Works with Change-Data-Capture
scenario
- Micro-compact FileGroup and save in
cache
Cache
base file b
log file 1 for b
log file 2 for b
compacted
Change-Data-Capture
Refresh-Ahead
Read-Through
- Driven by usage, on-demand
computation
- LRU or LFU
Query I/O
Read-Through
Design Considerations
FileGroup consistent hashing
- Each FileGroup has a unique ID
- Work with distributed cache servers
Cache
Node A
FileGroup
Query I/O
Cache
Node B
FileGroup FileGroup
Coordinator
(Timeline server?)
Query I/O
Lake Storage
Cache (e.g. Alluxio)
Transactionality
- Only committed files can be
cached
- Rollback include cache
invalidation
Pluggable Caching Layer
- Define APIs for pluggable
caching implementations
Community
Adoption, Operating the Apache way, Ongoing work
How we roll?
Friendly and diverse community
- Open and Collaborative
- 20+ PMCs/Committers from 10+
organizations
Developers
- Propose new RFCs (design docs)
- Dev list discussions, JIRA for issue tracking.
Users
- Weekly community on-call rotations
- Issue triage, bug filing process on Github
1200+
Slack
200+
Contributors
1000+
GH Engagers
~10-20
PRs/week
20+
Committers
10+
PMCs
Major Ongoing Works
RFC-26: Z-order indexing, Hilbert curves (PR #3330)
RFC-27: Data skipping/Range indexing (PR #3475)
RFC-29: Hashed Indexing (PR #3173)
RFC-32: Kafka Connect Sink for Hudi (Pre-release; available in 0.10.0)
RFC-33: Full-schema evolution support (PR #3668)
RFC-35: BigQuery integration
Major Ongoing Works
RFC-20: Error tables (PR #3312)
RFC-08: Record level indexing (PR #3508)
RFC-15: Synchronous, Multi table Metadata writes (PR #3590)
Hudi + Dbt (dbt-labs/dbt-spark/pull/210)
PrestoDB/Trino Connectors (Early design)
Hudi is broadly adopted outside
More at : http://hudi.apache.org/powered-by
Engage With Our Community
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup
Thanks!
Questions?

More Related Content

PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PDF
Introducing Change Data Capture with Debezium
ChengKuan Gan
 
PDF
Scalability, Availability & Stability Patterns
Jonas Bonér
 
PPTX
Introduction to AWS Lake Formation.pptx
SwathiPonugumati
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PDF
Python Crash Course
Haim Michael
 
PDF
Let's talk about GPT: A crash course in Generative AI for researchers
Steven Van Vaerenbergh
 
PPTX
An Introduction to Druid
DataWorks Summit
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Introducing Change Data Capture with Debezium
ChengKuan Gan
 
Scalability, Availability & Stability Patterns
Jonas Bonér
 
Introduction to AWS Lake Formation.pptx
SwathiPonugumati
 
Databricks Delta Lake and Its Benefits
Databricks
 
Python Crash Course
Haim Michael
 
Let's talk about GPT: A crash course in Generative AI for researchers
Steven Van Vaerenbergh
 
An Introduction to Druid
DataWorks Summit
 

What's hot (20)

PDF
Building large scale transactional data lake using apache hudi
Bill Liu
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
PDF
Hoodie - DataEngConf 2017
Vinoth Chandar
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
PPTX
Rds data lake @ Robinhood
BalajiVaradarajan13
 
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PDF
Dynamic Partition Pruning in Apache Spark
Databricks
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
PDF
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
 
PPTX
Reshape Data Lake (as of 2020.07)
Eric Sun
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Building large scale transactional data lake using apache hudi
Bill Liu
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Making Apache Spark Better with Delta Lake
Databricks
 
Dynamic Partition Pruning in Apache Spark
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
 
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Ad

Similar to Apache Hudi: The Path Forward (20)

PPTX
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
PPTX
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
PDF
Hadoop and object stores can we do it better
gvernik
 
PDF
Hadoop and object stores: Can we do it better?
gvernik
 
PDF
Large-scale Web Apps @ Pinterest
HBaseCon
 
PPTX
Scale your Alfresco Solutions
Alfresco Software
 
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
PDF
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
nadine39280
 
PPTX
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
PPTX
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
alanfgates
 
PPTX
Hive & HBase For Transaction Processing
DataWorks Summit
 
PDF
Data Modeling in Hadoop - Essentials for building data driven applications
Maloy Manna, PMP®
 
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
PPT
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
PPTX
Optimizing Big Data to run in the Public Cloud
Qubole
 
PDF
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
PDF
Agile data lake? An oxymoron?
samthemonad
 
ODP
Apache Marmotta - Introduction
Sebastian Schaffert
 
PPTX
Overview of MongoDB and Other Non-Relational Databases
Andrew Kandels
 
PDF
Magento's Imagine eCommerce Conference 2011 - Hosting Magento: Performance an...
MagentoImagine
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
Hadoop and object stores can we do it better
gvernik
 
Hadoop and object stores: Can we do it better?
gvernik
 
Large-scale Web Apps @ Pinterest
HBaseCon
 
Scale your Alfresco Solutions
Alfresco Software
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
nadine39280
 
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
alanfgates
 
Hive & HBase For Transaction Processing
DataWorks Summit
 
Data Modeling in Hadoop - Essentials for building data driven applications
Maloy Manna, PMP®
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Optimizing Big Data to run in the Public Cloud
Qubole
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Agile data lake? An oxymoron?
samthemonad
 
Apache Marmotta - Introduction
Sebastian Schaffert
 
Overview of MongoDB and Other Non-Relational Databases
Andrew Kandels
 
Magento's Imagine eCommerce Conference 2011 - Hosting Magento: Performance an...
MagentoImagine
 
Ad

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 

Recently uploaded (20)

PDF
Exploring AI Agents in Process Industries
amoreira6
 
PDF
Microsoft Teams Essentials; The pricing and the versions_PDF.pdf
Q-Advise
 
PPTX
oapresentation.pptx
mehatdhavalrajubhai
 
PPTX
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
Presentation about variables and constant.pptx
safalsingh810
 
PDF
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
DOCX
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PDF
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
PDF
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
PPTX
Why Use Open Source Reporting Tools for Business Intelligence.pptx
Varsha Nayak
 
PDF
Become an Agentblazer Champion Challenge Kickoff
Dele Amefo
 
PDF
Become an Agentblazer Champion Challenge
Dele Amefo
 
PPTX
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Exploring AI Agents in Process Industries
amoreira6
 
Microsoft Teams Essentials; The pricing and the versions_PDF.pdf
Q-Advise
 
oapresentation.pptx
mehatdhavalrajubhai
 
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
Presentation about variables and constant.pptx
kr2589474
 
Presentation about variables and constant.pptx
safalsingh810
 
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
Why Use Open Source Reporting Tools for Business Intelligence.pptx
Varsha Nayak
 
Become an Agentblazer Champion Challenge Kickoff
Dele Amefo
 
Become an Agentblazer Champion Challenge
Dele Amefo
 
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 

Apache Hudi: The Path Forward

  • 1. Apache Hudi: The Path Forward Vinoth Chandar, Raymond Xu PMC, Apache Hudi
  • 2. Agenda 1) Hudi Intro 2) Table Metadata 3) Caching 4) Community
  • 5. Hudi - the Pioneer Serverless, transactional layer over lakes. Multi-engine, Decoupled storage from engine/compute Introduced notions of Copy-On-Write and Merge-on-Read Change capture on lakes Ideas now heavily borrowed outside.
  • 6. The Hudi Stack Lakes on cheap, scalable Hadoop compatible storage Built on open file and data formats Transactional Database Kernel - Table Format for file layouts, schema, … - Indexing for faster updates/deletes - Built-in “daemons” aka table services - MVCC, OCC Concurrency Control SQL and Programming APIs Platform services and operational tools Universally queryable from popular engines
  • 7. It’s a platform! Both streaming + batch style pipelines - State store for incremental merging intermediate results - Change events like Apache Kafka topics For data lake workloads - Optimized, self-managing data plane - Large scale data processing - Lakehouse? With tightly-integrated components - Loose coupling => too many to integrate - Reduce build out time for data lakes http://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform
  • 8. Table Format Avro Schema, Evolution rules File groups, reduce merge overhead Timeline => event log, WAL Internal metadata table Ongoing - Schema-on-read i.e drop,renames (RFC-33) - Infinite retention
  • 9. File Formats Base and Delta Log Files - Parquet, Orc, HFile Base files - Avro log files - Encode changes as blocks Ongoing - Parquet log blocks for large batch writes - CSV, unstructured formats - pre-materialization for masking/data privacy
  • 10. Indexes Pluggable, Consistent with txns For upserts, deletes - HBase, External index -> pluggable - Simple, Bloom/Local vs Global Ongoing - RFC-27 Range indexes - Bucketed Index - DynamoDB index - Metadata index - Record level indexing
  • 11. Concurrency Control Hudi did not need multi-writer support - Treat writers and services differently - MVCC, non-blocking - Table services satisfy most needs Hudi now does Optimistic Concurrency Control - File level, timeline consistent - Still MVCC for table services Future/Ongoing - Multi-table transactions - MVCC, fully lock free transactions
  • 12. Writers Incremental & Batch write operations - File sizing, Layout control upon write - Sorting, compression, Index maintenance - Spill handling, Multi-threaded write pipeline Record level merges APIs - Unique keys, composite, - key generators, virtual or physical - partial merges, event-time processing Record level metadata - Arrival and event time, watermarks - Encode source CDC operation
  • 13. Readers Hive, Impala, Presto, Spark, Trino, Redshift Use engine’s native readers First class support for incremental queries Flexibility - snapshot vs read-optimized Future - Flexible change stream data models. - Snowflake/BigQuery external tables
  • 14. Table Services Self managing database runtime Table services know each other - E.g avoid duplicate schedules - E.g skip compacting files being clustered Cleaning (committed/uncommitted), archival, clustering, compaction, .. Services can be run continuously or scheduled
  • 15. Platform Services DeltaStreamer/FlinkStreamer ingest/ETL utility Deliver Commit notifications Kafka Connect Sink Data Quality checkers Snapshot, Restore, Export, Import
  • 16. Table Metadata Current choices, Ongoing work, Future plans
  • 17. What qualifies as table metadata? Schema - Columns names/types, keys, partitioning, evolution/versions - Typically small, < 1MB per version. Files/Objects - Length, paths, URIs - 2M objects => 10s of MBs Stats - Min, Max, Nulls etc, Per col Per file - 2M objects => 100+ of MBs Redo Logs - Changes to metadata => writes, rollbacks, table optimizations. - Committing (200kb) every minute for a year => ~100 GB Indexes? - Remember Stats != Index, They can be much bigger.
  • 18. How’s this stored in Hudi, today? Schema - Stored within the redo log, consistent with table changes. - Synced out to different meta-stores, post commit Files/Objects - Obtained from an internal metadata table partition `files` - Or just by listing storage - sometimes it’s faster! Redo Logs - As an event log in the timeline folder “.hoodie” - Archived out, once transactions/table operations complete/expire. Stats - We don’t. Yet. Fetch from file footers. - Again sometimes faster if parallelized, even on cloud storage.
  • 19. RFC-27 (Ongoing): Flat Files are not cool Scaling file stats for high scale writing - 65536 files (1TB data, stored as 16MB small files) - 100 columns, 6.5M stat entries - O(total_cols_tracked_in_table) - Slow, 10s of seconds. Range reads to the rescue! - O(num_cols_in_query) performace - Interval trees with smart skipping
  • 20. The Hudi Timeline server Metadata need efficient serving, caching - Not just efficient storage Responsibilities - Cache file listings across executors - Amortize access to metadata table - Performant uncommitted file cleanup Incremental sync - Streaming/continuous writes - Lazy refreshing of timeline S3 Baseline: listing p90 - 1sec (10k files), - 10 sec (100K files) Timeline Server: 1-10 ms! File-backed metadata: ~1 second!
  • 21. Extending the Timeline Server New APIs - Serve also stats, redo log information. - Locking APIs Let’s make a cluster! - Shard servers by table/db - Pluggable backing storage - Local DB w/ recovery/checkpointing - Remote DB with newSQL/transactional storage
  • 22. Cache Basic Idea, Design Considerations
  • 23. Basic Idea Problems - Frequent commits => small objects / blocks => I/O costly - File System / Block level caching not very effective base file b @ t1 base file b’ @ t2 log file 1 for b log file 2 for b log file 1 for b’ log file 2 for b’ Time Hudi FileGroup log file 3 for b’ Hudi FileGroup fits caching - Smallest unit to compact - Size properly to fit cache store - Cache compacted data for real-time views => save computation
  • 24. Design Considerations Refresh-Ahead - Works with Change-Data-Capture scenario - Micro-compact FileGroup and save in cache Cache base file b log file 1 for b log file 2 for b compacted Change-Data-Capture Refresh-Ahead Read-Through - Driven by usage, on-demand computation - LRU or LFU Query I/O Read-Through
  • 25. Design Considerations FileGroup consistent hashing - Each FileGroup has a unique ID - Work with distributed cache servers Cache Node A FileGroup Query I/O Cache Node B FileGroup FileGroup Coordinator (Timeline server?) Query I/O Lake Storage Cache (e.g. Alluxio) Transactionality - Only committed files can be cached - Rollback include cache invalidation Pluggable Caching Layer - Define APIs for pluggable caching implementations
  • 26. Community Adoption, Operating the Apache way, Ongoing work
  • 27. How we roll? Friendly and diverse community - Open and Collaborative - 20+ PMCs/Committers from 10+ organizations Developers - Propose new RFCs (design docs) - Dev list discussions, JIRA for issue tracking. Users - Weekly community on-call rotations - Issue triage, bug filing process on Github 1200+ Slack 200+ Contributors 1000+ GH Engagers ~10-20 PRs/week 20+ Committers 10+ PMCs
  • 28. Major Ongoing Works RFC-26: Z-order indexing, Hilbert curves (PR #3330) RFC-27: Data skipping/Range indexing (PR #3475) RFC-29: Hashed Indexing (PR #3173) RFC-32: Kafka Connect Sink for Hudi (Pre-release; available in 0.10.0) RFC-33: Full-schema evolution support (PR #3668) RFC-35: BigQuery integration
  • 29. Major Ongoing Works RFC-20: Error tables (PR #3312) RFC-08: Record level indexing (PR #3508) RFC-15: Synchronous, Multi table Metadata writes (PR #3590) Hudi + Dbt (dbt-labs/dbt-spark/pull/210) PrestoDB/Trino Connectors (Early design)
  • 30. Hudi is broadly adopted outside More at : http://hudi.apache.org/powered-by
  • 31. Engage With Our Community User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : [email protected] (send an empty email to subscribe) [email protected] (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup