SlideShare a Scribd company logo
Apache Beam in the Google Cloud
Lessons learned from building and operating a serverless
streaming runtime
Reuven Lax, Google (@reuvenlax)
Sergei Sokolenko, Google (@datancoffee)
Common steps in Stream AnalyticsLessons we learned
Watermarks
Adaptive Scaling: Flow Control
Adaptive Scaling: Autoscaling
Separating compute state from storage
Common steps in Stream AnalyticsHistory Lesson
2012 20132002 2004 2006 2008 2010
Flume Millwheel
2015
DataflowMillwheelBespoke Streaming
Common steps in Stream AnalyticsLesson Learned: Watermarks
A pipeline stage S with a watermark value of t means that all future data that will be seen by S will
have a timestamp later than t. In other words, all data older than t has already been processed.
Key use case: process windows once the watermark passes the end of the window, since we
expect all data for that window to have arrived already
Common steps in Stream AnalyticsWhat Triggers Output?
Traditional batch: query triggers output Streaming: When to trigger?
● Standing Query
● Unbounded Data
Query Data
Output
Query Data?
Common steps in Stream AnalyticsUse Case: Anomaly Detection pipelines
Early Millwheel user was an anomaly detection
pipeline
Built cubic-spline model for each key
Once a spline was calculated, it could not be
modified. No late data, trigger only when ready!
Bob
Sara
Common steps in Stream AnalyticsFirst attempt: leading-edge watermark
Latest timestamp - δ
Graph shows that skew peaked at 10
minutes.
Set δ = 10 minutes to minimize data
drops.
Common steps in Stream AnalyticsFirst attempt: leading-edge watermark
Too fast
Too often a lot of data was behind this watermark
Ended up with many gaps in output
Impacted quality of results
Too slow
Subtracting fixed delta puts lower bound on latency
Subtracted 10 minutes because the system is
sometimes delayed by 10 minutes. However most
of the time the delay was under 1 minute!
Common steps in Stream AnalyticsSecond attempt: dynamic leading edge watermark
Leading edge watermark
Dynamic statistical models to compute how far the lookback should be
Common steps in Stream AnalyticsSecond attempt: dynamic leading edge watermark
Still many gaps in output data
Input is too noisy
Many delays are unpredictable (e.g. a machine restarting)
Models take time to adapt, in which time you are dropping data
Common steps in Stream AnalyticsTrailing edge watermark
Tracking the minimum event time instead generally solved the problem.
Common steps in Stream AnalyticsWatermark: Definition
Given a node N of a computation graph G, Let In
be the sequence of input elements processed with
the order provided by an oracle. t: In
-> R is a real-valued function on In
called the timestamp
function. A watermark function is a real-valued function W defined on prefixes of In
satisfying the
following:
{Wn
} = {W({I1
, …, In
})} is eventually increasing.
{Wn
} is monotonic.
W is said to be a temporal watermark, if it also satisfies Wn
< t(Im
) for m >= n
Common steps in Stream AnalyticsLoad Variance
Streaming pipelines must keep up with input.
Load varies throughout the day, throughout the week and spikes can happen at any time.
Common steps in Stream AnalyticsLoad Variance: Hand Tuning
Every pipeline is different, and hand tuning is hard
Eventually tuning parameters go stale
Hand-tuned flags become cargo cult science
Must tune for worst case
● Tuning for the peak is wasteful
● If pipeline ever falls behind, must be able to catch up faster than real time.
○ An exactly-once streaming system is a batch system whenever it falls behind.
Common steps in Stream AnalyticsTechniques: Batching
Always process data in batches
Batch sizes are dynamic: small when caught up, large when while catching up.
Lesson: be careful of putting arbitrary limits on batches.
● Don’t limit by event time ranges - event-time density changes.
● Don’t limit by windows - window policies change.
Batching limits will be especially painful while catching up a backlog.
Common steps in Stream AnalyticsTechniques: Flow Control
A good adaptive backpressure system is critical
● Prevents workers from overloading and crashing
● Adaptive backpressure adapts to changing load.
● Reduces need to perfectly tune cluster.
Common steps in Stream AnalyticsTechniques: Flow Control
Soft resources: CPU.
Hard resources: Memory.
Signals:
● Queue length
● Memory usage
Eventually flow control will pause
pulling from sources.
A
B C
Worker 1
A
B C
Worker 3
A
B C
Worker 2
Flow
controlled
Common steps in Stream AnalyticsTechniques: Flow Control
What happens if all streams are flow
controlled?
Deadlock!!!!!
● Every worker is holding
onto memory for pending
deliveries.
● Every worker is flow
controlling its input
streams.
A
B C
Worker 1
A
B C
Worker 3
A
B C
Worker 2
Flow
controlled
Flow
controlled
Flow
controlled
Common steps in Stream AnalyticsTechniques: Flow Control
To avoid deadlock, workers must be
able to release memory
This might involve canceling in-flight
work to be retried later
Dataflow streaming workers can spill
pending deliveries to disk to release
memory. Scanned back in later.
A
B C
Worker 1
A
B C
Worker 3
A
B C
Worker 2
Flow
controlled
Flow
controlled
Flow
controlled
Common steps in Stream AnalyticsTechniques: Auto Scaling
Adapative autoscaling allows elastic
scaling with load.
Work must by dynamically load
balanced to take advantage of
autoscaling.
Common steps in Stream AnalyticsTechniques: Auto Scaling
Never assume fixed workers.
Work ownership can be moved at any time.
All keys are hash sharded, and hash ranges
distributed among workers.
Separate storage from compute
Adds a lot of complexity to exactly once and
consistency protocols!
A
B C
worker23
A
B C
Worker
A
B C
worker32
[0, 3)
[a, f)
[3, a)
[0, 3): 23
[3, a): 32
[a, f): 32
RPCs addressed to
keys, not workers
Common steps in Stream AnalyticsLoad Variance: Lesson
Dynamic control is key
No amount of static configuration works
Eventually the universe will outsmart your configuration
Separating compute from
state storage to improve
scalability
Sergei Sokolenko, Google (@datancoffee)
Common steps in Stream Analytics
End-user
apps
Cloud Composer
IoT
Events
Cloud Pub/Sub Dataflow Streaming
DBs
Cloud AI
Platform
Bigtable Dataflow Batch
Action
Streaming processing options in GCP
BigQuery
BigQuery Streaming API
Machine Learning
Data Warehousing
Motivating Example:
Spotify migrating the largest European Hadoop cluster to Dataflow
● Run 80,000+ Dataflow jobs / month
● 90% batch, 10% streaming
Use Dataflow for “everything”
● Music recommendations, Ads targeting
● AB testing, Behavioral analysis, Business metrics
Huge batch jobs:
● 26000 CPUs, 166 TB RAM
● Processing 325 billion rows in 240TB from Bigtable
Traditional Distributed Data Processing Architecture
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● Jobs executed on
clusters of VMs
● Job state stored on
network-attached
volumes
● Control plane
orchestrates data
plane
Network
Control plane
VM
State storage State storage State storage
Traditional Architecture works well ...
Filter
Filter
Join
Group
Filter
Filter
fs://
Databasefs://
Database
… except for Joins and
Group By’s
Shuffling key-value pairs
● Starting with <K,V> pairs
placed on different workers
● Goal: co-locate all pairs
with the same Key
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...
● Starting with <K,V> pairs
placed on different workers
● Goal: co-locate all pairs
with the same Key
● Workers exchange <K,V>
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...
Shuffling key-value pairs
Shuffling key-value pairs
<key1, record>
<key1, record>
<key2, record>
<key2, record>
<key2, record>
...
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key4, record>
...
<key5, record>
<key5, record>
<key5, record>
<key5, record>
<key6, record>
...
<key7, record>
<key8, record>
<key8, record>
<key8, record>
...
key1, key 2 key3, key4 key5, key6 key7, key8
● Starting with <K,V> pairs
placed on different workers
● Goal: co-locate all pairs
with the same Key
● Workers exchange <K,V>
● Until everything is sorted
Traditional Architecture Requires Manual Tuning
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● When data volumes
exceed dozens of TBs
Network
Control plane
VM
State storage State storage State storage
Distributed in-memory Shuffle in batch Cloud Dataflow
Compute
Petabit
network
Dataflow Shuffle
Region
Zone ‘a’ Zone ‘b’
Zone ‘c’Distributed
in-memory
file system
Distributed
on-disk
file system
Shuffle
proxy
Autozone placement
Pipeline user code Shuffling Operations
No tuning required
Dataflow Shuffle is usually
faster than worker-based
shuffle, including those using
SSD-PD.
Better autoscaling keeps
aggregate resource usage
same, but cuts processing
time.
Faster Processing
Runtime of shuffle
Runtime
(mins)
Shuffle 300TB+
Dataflow shuffle has been
used to shuffle 300TB+
datasets.
Supporting larger datasets
Dataset size of shuffle
Dataset
size (TB)
Storing state
What about streaming pipelines?
Streaming shuffle
Just like in batch, need to group and join
streams
Distributed streaming shuffle
Window data elements
Time window data aggregations need
to be buffered
Until triggering conditions occur
Goal: Grouping by Event Time into Time Windows
9:00 14:0013:0012:0011:0010:00Event
time
9:00 14:0013:0012:0011:0010:00Processing
time
Input
Output
Even more state to store on disks in streaming
User code
VM
User code
VM
User code
VM
User code
VM
Shuffle data elements
● Key ranges are assigned
to workers
● Data elements of these
keys is stored on
Persistent Disks
State storage State storage State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
key ABC3 ...
… key DEF5
key DEF6 ...
… key GHI2
Time window data
● Also assigned to workers
● When time windows
close, data processed on
workers
Dataflow Streaming Engine
Benefits
● Better supportability
● Less worker resources
● More efficient autoscaling
User code
Streaming engine
Worker
User code
Worker
User code
Worker
User code
Worker
Window state storage Streaming shuffle
Autoscaling: Even better with separate Compute and State Storage
User code
Streaming engine
Worker
User code
Worker
Window state storage Streaming shuffle
Dataflow with Streaming Engine
User code
VM
User code
VM
State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
Dataflow without Streaming Engine
Dataflow with Streaming Engine Dataflow without Streaming Engine
● Personalization and experimentation platform
● Wanted things to work out-of-the-box
Significant data volumes:
● 25 million user sessions per day
● 2B events per day
Dataflow usage profile:
● Streaming Engine for worryless autoscaling
● Batch processing with FlexRS for cost savings
AB Tasty is using Dataflow Streaming Engine
Main Takeaways
Trailing edge watermarks provided a solution for triggering aggregations
The system must be elastic and adaptive
Separating compute from state storage help make stream and batch processing scalable
Thank You!

More Related Content

PDF
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
PDF
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
PDF
Marton Balassi – Stateful Stream Processing
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PDF
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PPTX
Do Flink on Web with FLOW
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Marton Balassi – Stateful Stream Processing
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Do Flink on Web with FLOW

What's hot (20)

PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
PDF
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
PDF
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
PDF
Tech Talk @ Google on Flink Fault Tolerance and HA
PDF
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
PDF
Flink Connector Development Tips & Tricks
PDF
Christian Kreuzfeld – Static vs Dynamic Stream Processing
PDF
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
PDF
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
PDF
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
PDF
Stateful Distributed Stream Processing
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Matthias J. Sax – A Tale of Squirrels and Storms
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Tech Talk @ Google on Flink Fault Tolerance and HA
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Connector Development Tips & Tricks
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Stateful Distributed Stream Processing
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Ad

Similar to Keynote: Building and Operating A Serverless Streaming Runtime for Apache Beam in The Google Cloud - Sergei Sokolenko & Reuven lax, Google (20)

PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
PPTX
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
PDF
Stream Processing Overview
PDF
Netflix SRE perf meetup_slides
PPTX
Natural Laws of Software Performance
PDF
Dataflow - A Unified Model for Batch and Streaming Data Processing
PPTX
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
PDF
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Next Gen Big Data Analytics with Apache Apex
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Mantis: Netflix's Event Stream Processing System
PDF
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
PDF
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
PPTX
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
PDF
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
PPTX
Gcp dataflow
PDF
Docker Logging and analysing with Elastic Stack
PDF
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Stream Processing Overview
Netflix SRE perf meetup_slides
Natural Laws of Software Performance
Dataflow - A Unified Model for Batch and Streaming Data Processing
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Mantis: Netflix's Event Stream Processing System
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Gcp dataflow
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
1. Introduction to Computer Programming.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Group 1 Presentation -Planning and Decision Making .pptx
A novel scalable deep ensemble learning framework for big data classification...
1 - Historical Antecedents, Social Consideration.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Approach and Philosophy of On baking technology
Heart disease approach using modified random forest and particle swarm optimi...
1. Introduction to Computer Programming.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Encapsulation_ Review paper, used for researhc scholars
A comparative study of natural language inference in Swahili using monolingua...
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Enhancing emotion recognition model for a student engagement use case through...
Tartificialntelligence_presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf

Keynote: Building and Operating A Serverless Streaming Runtime for Apache Beam in The Google Cloud - Sergei Sokolenko & Reuven lax, Google

  • 1. Apache Beam in the Google Cloud Lessons learned from building and operating a serverless streaming runtime Reuven Lax, Google (@reuvenlax) Sergei Sokolenko, Google (@datancoffee)
  • 2. Common steps in Stream AnalyticsLessons we learned Watermarks Adaptive Scaling: Flow Control Adaptive Scaling: Autoscaling Separating compute state from storage
  • 3. Common steps in Stream AnalyticsHistory Lesson 2012 20132002 2004 2006 2008 2010 Flume Millwheel 2015 DataflowMillwheelBespoke Streaming
  • 4. Common steps in Stream AnalyticsLesson Learned: Watermarks A pipeline stage S with a watermark value of t means that all future data that will be seen by S will have a timestamp later than t. In other words, all data older than t has already been processed. Key use case: process windows once the watermark passes the end of the window, since we expect all data for that window to have arrived already
  • 5. Common steps in Stream AnalyticsWhat Triggers Output? Traditional batch: query triggers output Streaming: When to trigger? ● Standing Query ● Unbounded Data Query Data Output Query Data?
  • 6. Common steps in Stream AnalyticsUse Case: Anomaly Detection pipelines Early Millwheel user was an anomaly detection pipeline Built cubic-spline model for each key Once a spline was calculated, it could not be modified. No late data, trigger only when ready! Bob Sara
  • 7. Common steps in Stream AnalyticsFirst attempt: leading-edge watermark Latest timestamp - δ Graph shows that skew peaked at 10 minutes. Set δ = 10 minutes to minimize data drops.
  • 8. Common steps in Stream AnalyticsFirst attempt: leading-edge watermark Too fast Too often a lot of data was behind this watermark Ended up with many gaps in output Impacted quality of results Too slow Subtracting fixed delta puts lower bound on latency Subtracted 10 minutes because the system is sometimes delayed by 10 minutes. However most of the time the delay was under 1 minute!
  • 9. Common steps in Stream AnalyticsSecond attempt: dynamic leading edge watermark Leading edge watermark Dynamic statistical models to compute how far the lookback should be
  • 10. Common steps in Stream AnalyticsSecond attempt: dynamic leading edge watermark Still many gaps in output data Input is too noisy Many delays are unpredictable (e.g. a machine restarting) Models take time to adapt, in which time you are dropping data
  • 11. Common steps in Stream AnalyticsTrailing edge watermark Tracking the minimum event time instead generally solved the problem.
  • 12. Common steps in Stream AnalyticsWatermark: Definition Given a node N of a computation graph G, Let In be the sequence of input elements processed with the order provided by an oracle. t: In -> R is a real-valued function on In called the timestamp function. A watermark function is a real-valued function W defined on prefixes of In satisfying the following: {Wn } = {W({I1 , …, In })} is eventually increasing. {Wn } is monotonic. W is said to be a temporal watermark, if it also satisfies Wn < t(Im ) for m >= n
  • 13. Common steps in Stream AnalyticsLoad Variance Streaming pipelines must keep up with input. Load varies throughout the day, throughout the week and spikes can happen at any time.
  • 14. Common steps in Stream AnalyticsLoad Variance: Hand Tuning Every pipeline is different, and hand tuning is hard Eventually tuning parameters go stale Hand-tuned flags become cargo cult science Must tune for worst case ● Tuning for the peak is wasteful ● If pipeline ever falls behind, must be able to catch up faster than real time. ○ An exactly-once streaming system is a batch system whenever it falls behind.
  • 15. Common steps in Stream AnalyticsTechniques: Batching Always process data in batches Batch sizes are dynamic: small when caught up, large when while catching up. Lesson: be careful of putting arbitrary limits on batches. ● Don’t limit by event time ranges - event-time density changes. ● Don’t limit by windows - window policies change. Batching limits will be especially painful while catching up a backlog.
  • 16. Common steps in Stream AnalyticsTechniques: Flow Control A good adaptive backpressure system is critical ● Prevents workers from overloading and crashing ● Adaptive backpressure adapts to changing load. ● Reduces need to perfectly tune cluster.
  • 17. Common steps in Stream AnalyticsTechniques: Flow Control Soft resources: CPU. Hard resources: Memory. Signals: ● Queue length ● Memory usage Eventually flow control will pause pulling from sources. A B C Worker 1 A B C Worker 3 A B C Worker 2 Flow controlled
  • 18. Common steps in Stream AnalyticsTechniques: Flow Control What happens if all streams are flow controlled? Deadlock!!!!! ● Every worker is holding onto memory for pending deliveries. ● Every worker is flow controlling its input streams. A B C Worker 1 A B C Worker 3 A B C Worker 2 Flow controlled Flow controlled Flow controlled
  • 19. Common steps in Stream AnalyticsTechniques: Flow Control To avoid deadlock, workers must be able to release memory This might involve canceling in-flight work to be retried later Dataflow streaming workers can spill pending deliveries to disk to release memory. Scanned back in later. A B C Worker 1 A B C Worker 3 A B C Worker 2 Flow controlled Flow controlled Flow controlled
  • 20. Common steps in Stream AnalyticsTechniques: Auto Scaling Adapative autoscaling allows elastic scaling with load. Work must by dynamically load balanced to take advantage of autoscaling.
  • 21. Common steps in Stream AnalyticsTechniques: Auto Scaling Never assume fixed workers. Work ownership can be moved at any time. All keys are hash sharded, and hash ranges distributed among workers. Separate storage from compute Adds a lot of complexity to exactly once and consistency protocols! A B C worker23 A B C Worker A B C worker32 [0, 3) [a, f) [3, a) [0, 3): 23 [3, a): 32 [a, f): 32 RPCs addressed to keys, not workers
  • 22. Common steps in Stream AnalyticsLoad Variance: Lesson Dynamic control is key No amount of static configuration works Eventually the universe will outsmart your configuration
  • 23. Separating compute from state storage to improve scalability Sergei Sokolenko, Google (@datancoffee)
  • 24. Common steps in Stream Analytics End-user apps Cloud Composer IoT Events Cloud Pub/Sub Dataflow Streaming DBs Cloud AI Platform Bigtable Dataflow Batch Action Streaming processing options in GCP BigQuery BigQuery Streaming API Machine Learning Data Warehousing
  • 25. Motivating Example: Spotify migrating the largest European Hadoop cluster to Dataflow ● Run 80,000+ Dataflow jobs / month ● 90% batch, 10% streaming Use Dataflow for “everything” ● Music recommendations, Ads targeting ● AB testing, Behavioral analysis, Business metrics Huge batch jobs: ● 26000 CPUs, 166 TB RAM ● Processing 325 billion rows in 240TB from Bigtable
  • 26. Traditional Distributed Data Processing Architecture User code VM User code VM User code VM User code VM State storage ● Jobs executed on clusters of VMs ● Job state stored on network-attached volumes ● Control plane orchestrates data plane Network Control plane VM State storage State storage State storage
  • 27. Traditional Architecture works well ... Filter Filter Join Group Filter Filter fs:// Databasefs:// Database … except for Joins and Group By’s
  • 28. Shuffling key-value pairs ● Starting with <K,V> pairs placed on different workers ● Goal: co-locate all pairs with the same Key <key1, record> <key5, record> <key3, record> <key8, record> <key4, record> ... <key5, record> <key5, record> <key2, record> <key3, record> <key8, record> ... <key3, record> <key3, record> <key8, record> <key3, record> <key6, record> ... <key2, record> <key1, record> <key5, record> <key8, record> <key4, record> ...
  • 29. ● Starting with <K,V> pairs placed on different workers ● Goal: co-locate all pairs with the same Key ● Workers exchange <K,V> <key1, record> <key5, record> <key3, record> <key8, record> <key4, record> ... <key5, record> <key5, record> <key2, record> <key3, record> <key8, record> ... <key3, record> <key3, record> <key8, record> <key3, record> <key6, record> ... <key2, record> <key1, record> <key5, record> <key8, record> <key4, record> ... Shuffling key-value pairs
  • 30. Shuffling key-value pairs <key1, record> <key1, record> <key2, record> <key2, record> <key2, record> ... <key3, record> <key3, record> <key3, record> <key3, record> <key3, record> <key4, record> ... <key5, record> <key5, record> <key5, record> <key5, record> <key6, record> ... <key7, record> <key8, record> <key8, record> <key8, record> ... key1, key 2 key3, key4 key5, key6 key7, key8 ● Starting with <K,V> pairs placed on different workers ● Goal: co-locate all pairs with the same Key ● Workers exchange <K,V> ● Until everything is sorted
  • 31. Traditional Architecture Requires Manual Tuning User code VM User code VM User code VM User code VM State storage ● When data volumes exceed dozens of TBs Network Control plane VM State storage State storage State storage
  • 32. Distributed in-memory Shuffle in batch Cloud Dataflow Compute Petabit network Dataflow Shuffle Region Zone ‘a’ Zone ‘b’ Zone ‘c’Distributed in-memory file system Distributed on-disk file system Shuffle proxy Autozone placement Pipeline user code Shuffling Operations
  • 33. No tuning required Dataflow Shuffle is usually faster than worker-based shuffle, including those using SSD-PD. Better autoscaling keeps aggregate resource usage same, but cuts processing time. Faster Processing Runtime of shuffle Runtime (mins)
  • 34. Shuffle 300TB+ Dataflow shuffle has been used to shuffle 300TB+ datasets. Supporting larger datasets Dataset size of shuffle Dataset size (TB)
  • 35. Storing state What about streaming pipelines? Streaming shuffle Just like in batch, need to group and join streams Distributed streaming shuffle Window data elements Time window data aggregations need to be buffered Until triggering conditions occur
  • 36. Goal: Grouping by Event Time into Time Windows 9:00 14:0013:0012:0011:0010:00Event time 9:00 14:0013:0012:0011:0010:00Processing time Input Output
  • 37. Even more state to store on disks in streaming User code VM User code VM User code VM User code VM Shuffle data elements ● Key ranges are assigned to workers ● Data elements of these keys is stored on Persistent Disks State storage State storage State storage State storage key 0000 ... … key 1234 key 1235 ... … key ABC2 key ABC3 ... … key DEF5 key DEF6 ... … key GHI2 Time window data ● Also assigned to workers ● When time windows close, data processed on workers
  • 38. Dataflow Streaming Engine Benefits ● Better supportability ● Less worker resources ● More efficient autoscaling User code Streaming engine Worker User code Worker User code Worker User code Worker Window state storage Streaming shuffle
  • 39. Autoscaling: Even better with separate Compute and State Storage User code Streaming engine Worker User code Worker Window state storage Streaming shuffle Dataflow with Streaming Engine User code VM User code VM State storage State storage key 0000 ... … key 1234 key 1235 ... … key ABC2 Dataflow without Streaming Engine
  • 40. Dataflow with Streaming Engine Dataflow without Streaming Engine
  • 41. ● Personalization and experimentation platform ● Wanted things to work out-of-the-box Significant data volumes: ● 25 million user sessions per day ● 2B events per day Dataflow usage profile: ● Streaming Engine for worryless autoscaling ● Batch processing with FlexRS for cost savings AB Tasty is using Dataflow Streaming Engine
  • 42. Main Takeaways Trailing edge watermarks provided a solution for triggering aggregations The system must be elastic and adaptive Separating compute from state storage help make stream and batch processing scalable