Keynote: Building and Operating A Serverless Streaming Runtime for Apache Beam in The Google Cloud - Sergei Sokolenko & Reuven lax, Google

Apache Beam in the Google Cloud
Lessons learned from building and operating a serverless
streaming runtime
Reuven Lax, Google (@reuvenlax)
Sergei Sokolenko, Google (@datancoffee)

Common steps in Stream AnalyticsLessons we learned
Watermarks
Adaptive Scaling: Flow Control
Adaptive Scaling: Autoscaling
Separating compute state from storage

Common steps in Stream AnalyticsHistory Lesson
2012 20132002 2004 2006 2008 2010
Flume Millwheel
2015
DataﬂowMillwheelBespoke Streaming

Common steps in Stream AnalyticsLesson Learned: Watermarks
A pipeline stage S with a watermark value of t means that all future data that will be seen by S will
have a timestamp later than t. In other words, all data older than t has already been processed.
Key use case: process windows once the watermark passes the end of the window, since we
expect all data for that window to have arrived already

Common steps in Stream AnalyticsWhat Triggers Output?
Traditional batch: query triggers output Streaming: When to trigger?
● Standing Query
● Unbounded Data
Query Data
Output
Query Data?

Common steps in Stream AnalyticsUse Case: Anomaly Detection pipelines
Early Millwheel user was an anomaly detection
pipeline
Built cubic-spline model for each key
Once a spline was calculated, it could not be
modiﬁed. No late data, trigger only when ready!
Bob
Sara

Common steps in Stream AnalyticsFirst attempt: leading-edge watermark
Latest timestamp - δ
Graph shows that skew peaked at 10
minutes.
Set δ = 10 minutes to minimize data
drops.

Common steps in Stream AnalyticsFirst attempt: leading-edge watermark
Too fast
Too often a lot of data was behind this watermark
Ended up with many gaps in output
Impacted quality of results
Too slow
Subtracting ﬁxed delta puts lower bound on latency
Subtracted 10 minutes because the system is
sometimes delayed by 10 minutes. However most
of the time the delay was under 1 minute!

Common steps in Stream AnalyticsSecond attempt: dynamic leading edge watermark
Leading edge watermark
Dynamic statistical models to compute how far the lookback should be

Common steps in Stream AnalyticsSecond attempt: dynamic leading edge watermark
Still many gaps in output data
Input is too noisy
Many delays are unpredictable (e.g. a machine restarting)
Models take time to adapt, in which time you are dropping data

Common steps in Stream AnalyticsTrailing edge watermark
Tracking the minimum event time instead generally solved the problem.

Common steps in Stream AnalyticsWatermark: Definition
Given a node N of a computation graph G, Let In
be the sequence of input elements processed with
the order provided by an oracle. t: In
-> R is a real-valued function on In
called the timestamp
function. A watermark function is a real-valued function W defined on prefixes of In
satisfying the
following:
{Wn
} = {W({I1
, …, In
})} is eventually increasing.
{Wn
} is monotonic.
W is said to be a temporal watermark, if it also satisfies Wn
< t(Im
) for m >= n

Common steps in Stream AnalyticsLoad Variance
Streaming pipelines must keep up with input.
Load varies throughout the day, throughout the week and spikes can happen at any time.

Common steps in Stream AnalyticsLoad Variance: Hand Tuning
Every pipeline is different, and hand tuning is hard
Eventually tuning parameters go stale
Hand-tuned ﬂags become cargo cult science
Must tune for worst case
● Tuning for the peak is wasteful
● If pipeline ever falls behind, must be able to catch up faster than real time.
○ An exactly-once streaming system is a batch system whenever it falls behind.

Common steps in Stream AnalyticsTechniques: Batching
Always process data in batches
Batch sizes are dynamic: small when caught up, large when while catching up.
Lesson: be careful of putting arbitrary limits on batches.
● Don’t limit by event time ranges - event-time density changes.
● Don’t limit by windows - window policies change.
Batching limits will be especially painful while catching up a backlog.

Common steps in Stream AnalyticsTechniques: Flow Control
A good adaptive backpressure system is critical
● Prevents workers from overloading and crashing
● Adaptive backpressure adapts to changing load.
● Reduces need to perfectly tune cluster.

Soft resources: CPU.
Hard resources: Memory.
Signals:
● Queue length
● Memory usage
Eventually ﬂow control will pause
pulling from sources.
A
B C
Worker 1
A
B C
Worker 3
A
B C
Worker 2
Flow
controlled

What happens if all streams are ﬂow
controlled?
Deadlock!!!!!
● Every worker is holding
onto memory for pending
deliveries.
● Every worker is ﬂow
controlling its input
streams.
A
B C
Worker 1
A
B C
Worker 3
A
B C
Worker 2
Flow
controlled
Flow
controlled
Flow
controlled

To avoid deadlock, workers must be
able to release memory
This might involve canceling in-ﬂight
work to be retried later
Dataﬂow streaming workers can spill
pending deliveries to disk to release
memory. Scanned back in later.
A
B C
Worker 1
A
B C
Worker 3
A
B C
Worker 2
Flow
controlled
Flow
controlled
Flow
controlled

Common steps in Stream AnalyticsTechniques: Auto Scaling
Adapative autoscaling allows elastic
scaling with load.
Work must by dynamically load
balanced to take advantage of
autoscaling.

Common steps in Stream AnalyticsTechniques: Auto Scaling
Never assume ﬁxed workers.
Work ownership can be moved at any time.
All keys are hash sharded, and hash ranges
distributed among workers.
Separate storage from compute
Adds a lot of complexity to exactly once and
consistency protocols!
A
B C
worker23
A
B C
Worker
A
B C
worker32
[0, 3)
[a, f)
[3, a)
[0, 3): 23
[3, a): 32
[a, f): 32
RPCs addressed to
keys, not workers

Common steps in Stream AnalyticsLoad Variance: Lesson
Dynamic control is key
No amount of static conﬁguration works
Eventually the universe will outsmart your conﬁguration

Separating compute from
state storage to improve
scalability
Sergei Sokolenko, Google (@datancoffee)

Common steps in Stream Analytics
End-user
apps
Cloud Composer
IoT
Events
Cloud Pub/Sub Dataﬂow Streaming
DBs
Cloud AI
Platform
Bigtable Dataﬂow Batch
Action
Streaming processing options in GCP
BigQuery
BigQuery Streaming API
Machine Learning
Data Warehousing

Motivating Example:
Spotify migrating the largest European Hadoop cluster to Dataflow
● Run 80,000+ Dataﬂow jobs / month
● 90% batch, 10% streaming
Use Dataﬂow for “everything”
● Music recommendations, Ads targeting
● AB testing, Behavioral analysis, Business metrics
Huge batch jobs:
● 26000 CPUs, 166 TB RAM
● Processing 325 billion rows in 240TB from Bigtable

Traditional Distributed Data Processing Architecture
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● Jobs executed on
clusters of VMs
● Job state stored on
network-attached
volumes
● Control plane
orchestrates data
plane
Network
Control plane
VM
State storage State storage State storage

Traditional Architecture works well ...
Filter
Filter
Join
Group
Filter
Filter
fs://
Databasefs://
Database
… except for Joins and
Group By’s

Shuffling key-value pairs
● Starting with <K,V> pairs
placed on different workers
● Goal: co-locate all pairs
with the same Key
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...

with the same Key
● Workers exchange <K,V>
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...

<key1, record>
<key1, record>
<key2, record>
<key2, record>
<key2, record>
...
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key4, record>
...
<key5, record>
<key5, record>
<key5, record>
<key5, record>
<key6, record>
...
<key7, record>
<key8, record>
<key8, record>
<key8, record>
...
key1, key 2 key3, key4 key5, key6 key7, key8
with the same Key
● Workers exchange <K,V>
● Until everything is sorted

Traditional Architecture Requires Manual Tuning
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● When data volumes
exceed dozens of TBs
Network
Control plane
VM
State storage State storage State storage

Distributed in-memory Shuffle in batch Cloud Dataflow
Compute
Petabit
network
Dataflow Shuffle
Region
Zone ‘a’ Zone ‘b’
Zone ‘c’Distributed
in-memory
file system
Distributed
on-disk
file system
Shuffle
proxy
Autozone placement
Pipeline user code Shuffling Operations

No tuning required
Dataflow Shuffle is usually
faster than worker-based
shuffle, including those using
SSD-PD.
Better autoscaling keeps
aggregate resource usage
same, but cuts processing
time.
Faster Processing
Runtime of shuffle
Runtime
(mins)

Shuffle 300TB+
Dataflow shuffle has been
used to shuffle 300TB+
datasets.
Supporting larger datasets
Dataset size of shuffle
Dataset
size (TB)

Storing state
What about streaming pipelines?
Streaming shuﬄe
Just like in batch, need to group and join
streams
Distributed streaming shuﬄe
Window data elements
Time window data aggregations need
to be buffered
Until triggering conditions occur

Goal: Grouping by Event Time into Time Windows
9:00 14:0013:0012:0011:0010:00Event
time
9:00 14:0013:0012:0011:0010:00Processing
time
Input
Output

Even more state to store on disks in streaming
User code
VM
User code
VM
User code
VM
User code
VM
Shuﬄe data elements
● Key ranges are assigned
to workers
● Data elements of these
keys is stored on
Persistent Disks
State storage State storage State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
key ABC3 ...
… key DEF5
key DEF6 ...
… key GHI2
Time window data
● Also assigned to workers
● When time windows
close, data processed on
workers

Dataflow Streaming Engine
Beneﬁts
● Better supportability
● Less worker resources
● More eﬃcient autoscaling
User code
Streaming engine
Worker
User code
Worker
User code
Worker
User code
Worker
Window state storage Streaming shuffle

Autoscaling: Even better with separate Compute and State Storage
User code
Streaming engine
Worker
User code
Worker
Window state storage Streaming shuffle
Dataﬂow with Streaming Engine
User code
VM
User code
VM
State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
Dataﬂow without Streaming Engine

Dataﬂow with Streaming Engine Dataﬂow without Streaming Engine

● Personalization and experimentation platform
● Wanted things to work out-of-the-box
Significant data volumes:
● 25 million user sessions per day
● 2B events per day
Dataflow usage profile:
● Streaming Engine for worryless autoscaling
● Batch processing with FlexRS for cost savings
AB Tasty is using Dataflow Streaming Engine

Main Takeaways
Trailing edge watermarks provided a solution for triggering aggregations
The system must be elastic and adaptive
Separating compute from state storage help make stream and batch processing scalable

Keynote: Building and Operating A Serverless Streaming Runtime for Apache Beam in The Google Cloud - Sergei Sokolenko & Reuven lax, Google

More Related Content

What's hot (20)

Similar to Keynote: Building and Operating A Serverless Streaming Runtime for Apache Beam in The Google Cloud - Sergei Sokolenko & Reuven lax, Google (20)

More from Flink Forward (20)

Recently uploaded (20)

Keynote: Building and Operating A Serverless Streaming Runtime for Apache Beam in The Google Cloud - Sergei Sokolenko & Reuven lax, Google