Building Data Products
on Spark at Airbnb
LIYIN TANG & JINGWEI LU
Data Infrastructure at Airbnb
Event
Logs
MySQL
Dumps
Gold Cluster
HDFS
Hive
Kafka
Sqoop
Silver Cluster Spark Cluster
Spark
ReAir
Airflow Scheduling
S3
Presto Cluster
AirPal
SuperSet
Tableau
Batch Infrastructure
Yarn HDFS
Hive
Yarn
Liyin Tang and Jingwei Lu
3
Streaming at Airbnb
Liyin Tang and Jingwei Lu
4
Cluster
Spark Streaming
Airflow Scheduling
HBase
HDFS
Sources
Kafka
S3
HDFS
…
Sinks
Datadog
Kafka
Dynamo
DB
Elastic
Search
…
Lambda Architecture
Batch
AirStream
Hive
Spark SQL
Lambda Architecture
Liyin Tang and Jingwei Lu
6
Streaming
Kafka
Spark Streaming
State Storage
Combine Streaming and
Batch Processing
Sources
Liyin Tang and Jingwei Lu
8
Streaming
source: [
{
name: source_example,
type: kafka,
config: {
topic: "example_topic",
}
}
]
Batch
source: [
{
name: source_example,
type: hive,
sql: {
select * from db.table where
ds=‘2017-06-05’;
}
}
]
Computation
Liyin Tang and Jingwei Lu
9
Streaming/Batch
process: [{
name = process_example,
type = sql,
sql = """
SELECT listing_id, checkin_date, context.source as source
FROM source_example
WHERE user_id IS NOT NULL """
}]
Sinks
Liyin Tang and Jingwei Lu
10
Streaming
sink: [
{
name = sink_example
input = process_example
type = hbase_update
hbase_table_name = test_table
bulk_upload = false
}
]
Batch
sink: [
{
name = sink_example
input = process_example
type = hbase_update
hbase_table_name = test_table
bulk_upload = true
}
]
Streaming
Computation Flow
Liyin Tang and Jingwei Lu
11
Source
Process_A Process_B
Process_A1
Sink_A2 Sink_B2
Batch
Source
Process_A Process_B
Process_A1
Sink_A2 Sink_B2
Liyin Tang and Jingwei Lu
Unified API through AirStream
• Declarative job configuration
• Streaming source vs static source
• Computation operator or sink can be shared by streaming
and batch job.
• Computation flow is shared by streaming and batch
• Single driver executes in both streaming and batch mode job
12
Shared State Storage
AirStream
Shared Global State Store
Liyin Tang and Jingwei Lu
14
HBase Tables
Spark StreamingSpark StreamingSpark StreamingSpark Streaming
Spark BatchSpark BatchSpark BatchSpark Batch
•Well integrated with Hadoop eco system
•Efficient API for streaming writes and bulk uploads
•Rich API for sequential scan and point-lookups
• Merged view based on version
15
Why HBase
Unified Write API
Liyin Tang and Jingwei Lu
16
DataFrame
HBase
Region 1
Region 2
Region N
Re-partition
<Region 1, [RowKey, Value]>
<Region 2, [RowKey, Value]>
<Region N, [RowKey, Value]>
… …
Puts
HFile
BulkLoad
Rich Read API
Liyin Tang and Jingwei Lu
17
HBase Tables
Spark Streaming/Batch Jobs
Multi-Gets Prefix Scan Time Range Scan
Merged Views
Liyin Tang and Jingwei Lu
18
Row Key
R1 V200 TS200
R1 V150 TS150
R1 V01 TS01
… … … …
Time
Streaming Writes
Streaming Writes
Streaming Writes
Merged Views
Liyin Tang and Jingwei Lu
19
Row Key
R1 V200 TS200
R1 V150 TS150
R1 V01 TS01
Time
Streaming Writes
Streaming Writes
Streaming Writes
R1 V100 TS100Batch Bulk Upload
Liyin Tang and Jingwei Lu
Our Foundations
•Unify streaming with batch process
•Shared global state store
20
Use Cases
MySQL DB Snapshot
Using Binlog Replay
• Large amount of data: Multiple large mysql DBs
• Realtime-ness: minutes delay/ hours delay
• Transaction : Need to keep transaction across different tables
• Schema change: Table schema evolves
Database Snapshot
23
Move Elephant
24
Binlog Replay on Spark
20+ hr 4+ hr
AirStream Job
5 mins
15 mins
1 hr
spinal tap
seed
• Streaming and Batch shares Logic:
Binlog file reader, DDL processor,
transaction processor, DML processor.
• Merged by binlog position: <filenum,
offset>
• Idempotent: Log can be replayed
multiple times.
• Schema changes: Full schema
change history.
25
Log Parser
Transaction
Processor
Change
Processor
Schema
Processor
HBASE
Lambda Architecture
Binlog(realtime/history)
DML
DDL
XVID
Mysql Instance
Realtime Indexing
Hive
Realtime Indexing
Liyin Tang and Jingwei Lu
27
Elastic
Search
es_version
=
mutation id
AirStream
Spark Streaming
Spark Batch
Table A
Event Event Event… …
Kafka
Table B
Table C
Realtime OLAP with
Druid
Druid Ingestion
Liyin Tang and Jingwei Lu
29
Druid
AirStream
Spark
Streaming
Kafka
Dimension
Metrics
Druid Beam
Superset Powered by Druid
30
Tips
Moving Window
Computation
Long Window Computation
33
What if window is weeks,
months, or even years?
Distinct in a Large Window
34
I don’t want
approximation. What
should I do?
Distinct Count
Liyin Tang and Jingwei Lu
35
Row Key
Listing 1 Visitor 01 TS100
Listing 1 Visitor 02 TS100
Listing 1 Visitor 04 TS98
Listing 1 Visitor 03 TS99
Prefix Scan with
TimeRange
Prefix Scan with
TimeRange
Time
Moving Average
Liyin Tang and Jingwei Lu
36
Row Key
Listing 1
Total Review
Cnt: 100
TS100
Listing 1
Total Review
Cnt: 98
TS99
Listing 1
Total Review
Cnt: 01
TS01
Listing 1
Total Review
Cnt: 50
TS50
Count Difference/
Time Elapsed
Count Difference/
Time Elapsed
Time
… … …
… … …
Window 1
Window 2
Streaming Ingestion&
Realtime Interactive
Query
Realtime Ingestion and Interactive Query
Liyin Tang and Jingwei Lu
38
HBase
AirStream
Spark
Streaming
Kafka
Query
Engine
Data
Portal
Spark SQL
Hive SQL
Presto SQL
Interactive Query in SqlLab
39
Schema Enforcement
Streaming Events
Thrift-> DataFrame
Liyin Tang and Jingwei Lu
41
Thrift
Event
https://github.com/airbnb/airbnb-spark-thrift
Thrift
Class
Thrift
Object
Field
Meta
Data
Struct
Type
Field
Value
Row
DataFrame
Summary
Unify Batch and
Streaming Computation
43
Global State Store Using
HBase
44
45
We are hiring
Happy Hour:
6pm, B Restaurant&Bar, 720 Howard St, SF

More Related Content

PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Hyperspace for Delta Lake
PDF
Parquet performance tuning: the missing guide
PPTX
Bootstrapping state in Apache Flink
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
Facebook Messages & HBase
The Parquet Format and Performance Optimization Opportunities
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Hyperspace for Delta Lake
Parquet performance tuning: the missing guide
Bootstrapping state in Apache Flink
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Iceberg: A modern table format for big data (Strata NY 2018)
Facebook Messages & HBase

What's hot (20)

PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
PPTX
Delta lake and the delta architecture
PDF
Introduction to Apache Flink
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Scaling up uber's real time data analytics
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
PPTX
Snowflake: The Good, the Bad, and the Ugly
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
Introduction to Redis
PPTX
Zero to Snowflake Presentation
PDF
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Productizing Structured Streaming Jobs
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Deep Dive into the New Features of Apache Spark 3.0
PDF
Introduction to DataFusion An Embeddable Query Engine Written in Rust
PPTX
How we solved Real-time User Segmentation using HBase
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Delta lake and the delta architecture
Introduction to Apache Flink
Apache Iceberg - A Table Format for Hige Analytic Datasets
Scaling up uber's real time data analytics
Dynamic Rule-based Real-time Market Data Alerts
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Snowflake: The Good, the Bad, and the Ugly
A Deep Dive into Query Execution Engine of Spark SQL
Building robust CDC pipeline with Apache Hudi and Debezium
Introduction to Redis
Zero to Snowflake Presentation
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Productizing Structured Streaming Jobs
Apache Spark Core—Deep Dive—Proper Optimization
Deep Dive into the New Features of Apache Spark 3.0
Introduction to DataFusion An Embeddable Query Engine Written in Rust
How we solved Real-time User Segmentation using HBase
Ad

Similar to Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liyin Tang (20)

PDF
HBaseCon2017 Data Product at AirBnB
PDF
Introduction to apache kafka, confluent and why they matter
PDF
Apache Kafka, and the Rise of Stream Processing
PDF
Airstream: Spark Streaming At Airbnb
PPTX
Apache Flink: Past, Present and Future
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PPTX
Building Stream Processing as a Service
PDF
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
PDF
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
PDF
Streaming ETL with Apache Kafka and KSQL
PPTX
Riga dev day: Lambda architecture at AWS
PPTX
Stream Analytics with SQL on Apache Flink
PDF
Data platform evolution
PDF
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
PPTX
Stream Processing Live Traffic Data with Kafka Streams
PDF
Kafka elastic search meetup 09242018
PDF
From Batch to Streaming ET(L) with Apache Apex
PPTX
Bringing OLTP woth OLAP: Lumos on Hadoop
HBaseCon2017 Data Product at AirBnB
Introduction to apache kafka, confluent and why they matter
Apache Kafka, and the Rise of Stream Processing
Airstream: Spark Streaming At Airbnb
Apache Flink: Past, Present and Future
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Building Stream Processing as a Service
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
Streaming ETL with Apache Kafka and KSQL
Riga dev day: Lambda architecture at AWS
Stream Analytics with SQL on Apache Flink
Data platform evolution
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Stream Processing Live Traffic Data with Kafka Streams
Kafka elastic search meetup 09242018
From Batch to Streaming ET(L) with Apache Apex
Bringing OLTP woth OLAP: Lumos on Hadoop
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
AI-Augmented Business Process Management Systems
PPTX
GPS sensor used agriculture land for automation
PPTX
cardiac failure and associated notes.pptx
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PDF
Mcdonald's : a half century growth . pdf
PDF
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
PDF
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
PPT
Classification methods in data analytics.ppt
PPTX
DATA ANALYTICS COURSE IN PITAMPURA.pptx
PDF
Introduction to Database Systems Lec # 1
PPTX
transformers as a tool for understanding advance algorithms in deep learning
PPTX
DAA UNIT 1 for unit 1 time compixity PPT.pptx
PPTX
Bussiness Plan S Group of college 2020-23 Final
PPTX
Reinforcement learning in artificial intelligence and deep learning
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PDF
Buddhism presentation about world religion
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PPTX
Chapter security of computer_8_v8.1.pptx
PPTX
Power BI - Microsoft Power BI is an interactive data visualization software p...
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
AI-Augmented Business Process Management Systems
GPS sensor used agriculture land for automation
cardiac failure and associated notes.pptx
PPT for Diseases (1)-2, types of diseases.pptx
Mcdonald's : a half century growth . pdf
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
toaz.info-grade-11-2nd-quarter-earth-and-life-science-pr_5360bfd5a497b75f7ae4...
Classification methods in data analytics.ppt
DATA ANALYTICS COURSE IN PITAMPURA.pptx
Introduction to Database Systems Lec # 1
transformers as a tool for understanding advance algorithms in deep learning
DAA UNIT 1 for unit 1 time compixity PPT.pptx
Bussiness Plan S Group of college 2020-23 Final
Reinforcement learning in artificial intelligence and deep learning
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
Buddhism presentation about world religion
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
Chapter security of computer_8_v8.1.pptx
Power BI - Microsoft Power BI is an interactive data visualization software p...
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liyin Tang

  • 1. Building Data Products on Spark at Airbnb LIYIN TANG & JINGWEI LU
  • 3. Event Logs MySQL Dumps Gold Cluster HDFS Hive Kafka Sqoop Silver Cluster Spark Cluster Spark ReAir Airflow Scheduling S3 Presto Cluster AirPal SuperSet Tableau Batch Infrastructure Yarn HDFS Hive Yarn Liyin Tang and Jingwei Lu 3
  • 4. Streaming at Airbnb Liyin Tang and Jingwei Lu 4 Cluster Spark Streaming Airflow Scheduling HBase HDFS Sources Kafka S3 HDFS … Sinks Datadog Kafka Dynamo DB Elastic Search …
  • 6. Batch AirStream Hive Spark SQL Lambda Architecture Liyin Tang and Jingwei Lu 6 Streaming Kafka Spark Streaming State Storage
  • 8. Sources Liyin Tang and Jingwei Lu 8 Streaming source: [ { name: source_example, type: kafka, config: { topic: "example_topic", } } ] Batch source: [ { name: source_example, type: hive, sql: { select * from db.table where ds=‘2017-06-05’; } } ]
  • 9. Computation Liyin Tang and Jingwei Lu 9 Streaming/Batch process: [{ name = process_example, type = sql, sql = """ SELECT listing_id, checkin_date, context.source as source FROM source_example WHERE user_id IS NOT NULL """ }]
  • 10. Sinks Liyin Tang and Jingwei Lu 10 Streaming sink: [ { name = sink_example input = process_example type = hbase_update hbase_table_name = test_table bulk_upload = false } ] Batch sink: [ { name = sink_example input = process_example type = hbase_update hbase_table_name = test_table bulk_upload = true } ]
  • 11. Streaming Computation Flow Liyin Tang and Jingwei Lu 11 Source Process_A Process_B Process_A1 Sink_A2 Sink_B2 Batch Source Process_A Process_B Process_A1 Sink_A2 Sink_B2
  • 12. Liyin Tang and Jingwei Lu Unified API through AirStream • Declarative job configuration • Streaming source vs static source • Computation operator or sink can be shared by streaming and batch job. • Computation flow is shared by streaming and batch • Single driver executes in both streaming and batch mode job 12
  • 14. AirStream Shared Global State Store Liyin Tang and Jingwei Lu 14 HBase Tables Spark StreamingSpark StreamingSpark StreamingSpark Streaming Spark BatchSpark BatchSpark BatchSpark Batch
  • 15. •Well integrated with Hadoop eco system •Efficient API for streaming writes and bulk uploads •Rich API for sequential scan and point-lookups • Merged view based on version 15 Why HBase
  • 16. Unified Write API Liyin Tang and Jingwei Lu 16 DataFrame HBase Region 1 Region 2 Region N Re-partition <Region 1, [RowKey, Value]> <Region 2, [RowKey, Value]> <Region N, [RowKey, Value]> … … Puts HFile BulkLoad
  • 17. Rich Read API Liyin Tang and Jingwei Lu 17 HBase Tables Spark Streaming/Batch Jobs Multi-Gets Prefix Scan Time Range Scan
  • 18. Merged Views Liyin Tang and Jingwei Lu 18 Row Key R1 V200 TS200 R1 V150 TS150 R1 V01 TS01 … … … … Time Streaming Writes Streaming Writes Streaming Writes
  • 19. Merged Views Liyin Tang and Jingwei Lu 19 Row Key R1 V200 TS200 R1 V150 TS150 R1 V01 TS01 Time Streaming Writes Streaming Writes Streaming Writes R1 V100 TS100Batch Bulk Upload
  • 20. Liyin Tang and Jingwei Lu Our Foundations •Unify streaming with batch process •Shared global state store 20
  • 22. MySQL DB Snapshot Using Binlog Replay
  • 23. • Large amount of data: Multiple large mysql DBs • Realtime-ness: minutes delay/ hours delay • Transaction : Need to keep transaction across different tables • Schema change: Table schema evolves Database Snapshot 23 Move Elephant
  • 24. 24 Binlog Replay on Spark 20+ hr 4+ hr AirStream Job 5 mins 15 mins 1 hr spinal tap seed
  • 25. • Streaming and Batch shares Logic: Binlog file reader, DDL processor, transaction processor, DML processor. • Merged by binlog position: <filenum, offset> • Idempotent: Log can be replayed multiple times. • Schema changes: Full schema change history. 25 Log Parser Transaction Processor Change Processor Schema Processor HBASE Lambda Architecture Binlog(realtime/history) DML DDL XVID Mysql Instance
  • 27. Hive Realtime Indexing Liyin Tang and Jingwei Lu 27 Elastic Search es_version = mutation id AirStream Spark Streaming Spark Batch Table A Event Event Event… … Kafka Table B Table C
  • 29. Druid Ingestion Liyin Tang and Jingwei Lu 29 Druid AirStream Spark Streaming Kafka Dimension Metrics Druid Beam
  • 31. Tips
  • 33. Long Window Computation 33 What if window is weeks, months, or even years?
  • 34. Distinct in a Large Window 34 I don’t want approximation. What should I do?
  • 35. Distinct Count Liyin Tang and Jingwei Lu 35 Row Key Listing 1 Visitor 01 TS100 Listing 1 Visitor 02 TS100 Listing 1 Visitor 04 TS98 Listing 1 Visitor 03 TS99 Prefix Scan with TimeRange Prefix Scan with TimeRange Time
  • 36. Moving Average Liyin Tang and Jingwei Lu 36 Row Key Listing 1 Total Review Cnt: 100 TS100 Listing 1 Total Review Cnt: 98 TS99 Listing 1 Total Review Cnt: 01 TS01 Listing 1 Total Review Cnt: 50 TS50 Count Difference/ Time Elapsed Count Difference/ Time Elapsed Time … … … … … … Window 1 Window 2
  • 38. Realtime Ingestion and Interactive Query Liyin Tang and Jingwei Lu 38 HBase AirStream Spark Streaming Kafka Query Engine Data Portal Spark SQL Hive SQL Presto SQL
  • 39. Interactive Query in SqlLab 39
  • 41. Thrift-> DataFrame Liyin Tang and Jingwei Lu 41 Thrift Event https://github.com/airbnb/airbnb-spark-thrift Thrift Class Thrift Object Field Meta Data Struct Type Field Value Row DataFrame
  • 43. Unify Batch and Streaming Computation 43
  • 44. Global State Store Using HBase 44
  • 45. 45 We are hiring Happy Hour: 6pm, B Restaurant&Bar, 720 Howard St, SF