Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Peter Bakas | @peter_bakas
@ Netflix : Cloud Platform Engineering - Real Time Data Infrastructure
@ Ooyala : Analytics, Discovery, Platform Engineering & Infrastructure
@ Yahoo : Display Advertising, Behavioral Targeting, Payments
@ PayPal : Site Engineering and Architecture
@ Play : Advisor to Startups (Data, Security, Containers)
Who is this guy?

common data pipeline to collect, transport, aggregate, process and visualize events
Why are we here?

● Architectural design and principles for Keystone
● Technologies that Keystone is leveraging
● Best practices
What should I expect?

that occasionally streams movies

600+ billion events ingested per day
11 million events (24 GB per second) peak
Hundreds of event types
Over 1.3 Petabyte / day
Numbers Galore!

1+ trillion events processed every day
1 trillion events ingested per day during holiday season
Numbers Galore - Part Deux

Chukwa/Suro + Real-Time Branch

Keystone
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane
HTTP
PROXY

Kafka Primer
Kafka is a distributed, partitioned, replicated commit log service.

Kafka Terminology
● Producer
● Consumer
● Topic
● Partition
● Broker

Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Netflix Kafka Producer
● Best effort delivery
● Prefer msg drop than disrupting producer app
● Wraps Apache Kafka Producer
● Integration with Netflix ecosystem: Eureka, Atlas, etc.

Producer Impact
● Kafka outage does not disrupt existing instances from serving its purpose
● Kafka outage should never prevent new instances from starting up
● After kafka cluster restored, event producing should resume automatically

Prefer Drop than Block
● Drop when buffer is full
● Handle potential blocking of first meta data request
● ack=1 (vs 2)

Sticky Partitioner
● Batching is important to reduce CPU and network I/O on brokers
● Stick to one partition for a while when producing for non-keyed messages
● “linger.ms” works well with sticky partitioner

Producing events to Keystone
● Using Netflix Platform logging API
○ LogManager.logEvent(Annotatable): majority of the cases
○ KeyValueSeriazlier with ILog#log(String)
● REST endpoint that proxies Platform logging
○ ksproxy
○ Prana sidecar

Injected Event Metadata
● GUID
● Timestamp
● Host
● App

Keystone Extensible Wire Protocol
● Invisible to source & sinks
● Backwards and forwards compatibility
● Supports JSON. AVRO on the horizon
● Efficient - 10 bytes overhead per message
○ message size - hundreds of bytes to 10MB

Keystone Extensible Wire Protocol
● Packaged as a jar
● Why? Evolve Independently
○ event metadata & traceability metadata
○ event payload serialization

Max message size 10MB
● Keystone drops if > 10MB
○ Immutable event payload

Keystone
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane

Fronting Kafka Clusters
● Normal-priority (majority)
● High-priority (streaming activities etc.)

Fronting Kafka Instances
● 3 ASGs per cluster, 1 ASG per zone
● 3000 d2.xl AWS instances across 3 regions for regular & failover traffic

Partition Assignment
● All replica assignments zone aware
○ Improved availability
○ Reduce cost of maintenance

Kafka Fault Tolerance
● Instance failure
○ With replication factor of N, guarantee no data loss with N-1 failures
○ With zone aware replica assignment, guarantee no data loss with multiple instance failures in the same
zone
● Sink failure
○ No data loss during retention period
● Replication is the key
○ Data loss can happen if leader dies while follower AND consumer cannot catch up
○ Usually indicated by UncleanLeaderElection metric

Kafka Auditor as a Service
● Broker monitoring
● Consumer monitoring
● Heart-beat & Continuous message latency
● On-demand Broker performance testing
● Built as a service deployable on single or multiple instances

Current Issues
● By using the d2-xl there is trade off between cost and performance
● Performance deteriorates with increase of partitions
● Replication lag during peak traffic

Routing Infrastructure
+
Checkpointing
Cluster
+ 0.9.1

Router
Job Manager
(Control Plane)
EC2 Instances
Zookeeper
(Instance Id assignment)
Job
Job
Job
ksnode
Checkpointing
Cluster
ASG
Reconcile every min.

Routing Layer
● Total of 13,000 containers on 1,300 AWS C3-4XL instances
○ S3 sink: ~7000 Containers
○ Consumer Kafka sink: ~ 4500 Containers
○ Elasticsearch sink: ~1500 Containers

Routing Layer
● Total of ~1400 streams across all regions
○ ~1000 S3 streams
○ ~250 Consumer Kafka streams
○ ~150 Elasticsearch streams

Router Job Details
● One Job per sink and Kafka source topic
○ Separate Job each for S3, ElasticSearch & Kafka sink
○ Provides better isolation & better QOS
● Batch processed message requests to sinks
● Offset checkpointed after batch request succeeds

Processing Semantics
Data Loss & Duplicates

Backpressure
Producer ⇐ Kafka Cluster ⇐ Samza job router ⇐ Sink
● Keystone - at least once

Data Loss - Producer
● buffer full
● network error
● partition leader change
● partition migration

Data Loss - Kafka
● Lose all Kafka replicas of data
○ Safe guards:
■ AZ isolation / Alerts / Broker replacement automation
■ alerts and monitoring
● Unclean partition leader election
○ ack = 1 could cause loss

Data Loss - Router
● Lose checkpointed offset & the router was down for retention period duration
● If messages not processed past retention period (8h / 24h)
● Unclean leader election cause offset to go back
● Safe guard
○ alerts for lag > 0.1% of traffic for 10 minutes
● Concerned only if unable to launch router instances

Duplicates Router - Sink
● Duplicates possible
○ messages reprocessed - retry after batch S3 upload failure
○ Loss of checkpointed offset (message processed marker)
○ Event GUID helps dedup

Measure Duplicates
● Producer sent count diff with Kafka message received
● Router checkpointed offset monitored over time
Note: GUID can be used to dedup at the sink

End to End metrics
● Producer to Router to Sink Average Latencies
○ Batch processing [S3 sink]: ~3 sec
○ Stream processing [Consumer Kafka sink]: ~1 sec
○ Log analysis [Elasticsearch]: ~400 seconds (with back pressure)

End to End metrics
● End to End latencies
○ S3:
■ 50 percentile under 1 sec
■ 80 percentile under 8 seconds
○ Consumer Kafka:
■ 50 percentile under 800 ms
○ Elasticsearch:

Alerts
● Producer drop rate over 1%
● Consumer lag > 0.1%
● Next offset after Checkpointed offset not found
● Consumer stuck on partition level

There’s more in the pipeline...
● Self service tools
● Better management of scaling Kafka
● More capable control plane
● JSON Support exists, support for Avro on the horizon
● Multi-tenant Messaging as a Service - MaaS
● Multi-tenant Stream Processing as a Service - SPaaS

Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec (20)

Recently uploaded (20)

Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec