Shriya Arora
Streaming datasets for
Personalization
What is Netflix’s Mission?
Entertainment by allowing you to stream content anywhere, anytime
What is Netflix’s Mission?
Entertainment by allowing you to stream personalized content anywhere,
anytime
How much data do we process to have a personalized Netflix for everyone?
● 125M hours/ day
● 86M active members
● 450B unique events/day
● 600+ kafka topics
Data Infrastructure
Raw data
(S3/hdfs)
Stream
Processing
(Spark, Flink …)
Processed data
(Tables/Indexers)
Batch
processing
(Spark/Pig/Hive/MR)
Application instances
Keystone
Ingestion
Pipeline
What do we solve with streaming that we can’t solve with batch ETL?
● Business Wins
○ Algorithms become more dynamic/responsive
○ Enables research by reducing time delay between event generation and consumption
○ Creates opportunity for new types of algorithms
● Technical Wins
○ Fewer moving parts means fewer places for error
○ Save on storage costs
○ Avoid long running jobs
■ Reduces processing resources
■ Shortens turnaround times
Picking a Stream Processing Engine?
Things to consider:
● Problem Scope/Requirements
○ Event-based pure streaming or micro-batches?
○ Do you want to implement Lambda?
● Existing Internal Technologies
○ Streaming Infrastructure: What are other teams using?
○ ETL eco-system: What about teams that don’t consume out of Kafka?
● What’s your team’s learning curve?
○ Do you know Spark?
○ Do you know Scala?
Getting started with Spark Streaming
Micro-batches
● Data received in DStreams, which are easily converted to RDDs
● Support all fundamental RDD operations like map/flatmap/reduce/reduceByKey
● Basic time-based windowing
● Checkpointing support for resilience to failures
Writing a basic Spark Streaming app
Performance tuning your Spark streaming application
● Choice of micro-batch interval
○ The most important parameter
● Cluster memory
○ Large batch intervals need more memory
● Parallelism
○ DStreams naturally partitioned to Kafka partitions
○ Repartition can help with increased parallelism at the cost of shuffle
● # of CPUs
○ <= number of tasks
○ Depends on how computationally intensive your processing is
Getting started with Flink
Performance tuning your Flink application (Yet to be productionised)
● Persistent data storage for checkpointing
○ Fault-tolerant, highly-available system
○ Support high-throughput for frequent state updates
● Parallelism
○ Optimized for # of Kafka Partitions
○ Optimal number of slots/ CPU
● Size of cluster
○ Function of your incoming stream
○ What is your bottleneck? Network/ Memory/ Computation
● Code Optimization
○ Build an optimal DAG with least network shuffle
Challenges with Spark
● Not a ‘pure’ event streaming system
○ Minimum latency of batch interval
○ Un-intuitive for stream design
● Choice of batch interval is a little too critical
○ Everything can go wrong, if you choose this wrong
○ Build-up of scheduling delay can lead to data loss
● Only time-based windowing
○ Cannot be used to solve session-stitching use cases, or trigger based event
aggregations*
Challenges with Flink
● Non trivial to bring up a basic app, newer concepts to adjust to
○ Complex (though powerful) concepts like Watermarking, checkpointing, custom
windows
● Insufficient monitoring and debugging tools
● Documentation basic, online community support not as proliferated
Challenges with Streaming
● Pioneer Tax: batch.getInfrastructure >= streaming.getInfrastructure
○ Analytics has historically always been batch, instinctively easier to formulate analytical
problems in batch frameworks like MR, Pig, Hive etc.
○ Deployments are non-trivial
● Moving towards unbounded === moving towards “On Call”
○ Batch failures have to be addressed urgently, Streaming failures have to be addressed
immediately.
● Streaming outages more critical than batch outages
○ In batch it’s easy/cheap to recover from outages (as long as the data isn’t lost).
○ In streaming, data recovery (beyond the fault-tolerant limits of the system) can be exhaustive
Questions?

More Related Content

PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
PDF
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
PDF
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
PPTX
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
PPTX
Cloud native data platform
PPTX
Monitoring and Troubleshooting a Real Time Pipeline
PDF
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Cloud native data platform
Monitoring and Troubleshooting a Real Time Pipeline
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar

What's hot (20)

PDF
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
PDF
Intro to open source observability with grafana, prometheus, loki, and tempo(...
PDF
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
PPTX
Taboola Road To Scale With Apache Spark
PDF
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
PDF
Build real time stream processing applications using Apache Kafka
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
PDF
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
PDF
Spark stack for Model life-cycle management
PPTX
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
PPTX
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
PPTX
Elevation Query Extension: Introducing Subselects into Lucene Queries
PDF
Modern ETL Pipelines with Change Data Capture
PDF
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
PDF
RealTime Recommendations @Netflix - Spark
PDF
Kentik Detect Engine - Network Field Day 2017
PDF
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
PDF
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
PDF
Kafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Taboola Road To Scale With Apache Spark
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Build real time stream processing applications using Apache Kafka
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
Spark stack for Model life-cycle management
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
Elevation Query Extension: Introducing Subselects into Lucene Queries
Modern ETL Pipelines with Change Data Capture
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
RealTime Recommendations @Netflix - Spark
Kentik Detect Engine - Network Field Day 2017
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
Kafka and Kafka Streams in the Global Schibsted Data Platform
Ad

Viewers also liked (20)

PDF
What's new in Drools 6 - London JBUG 2013
PDF
Spark Summit EU talk by Christos Erotocritou
PPTX
Kafka for data scientists
PDF
Wrangling Big Data in a Small Tech Ecosystem
PPTX
Kafka Streams: The Stream Processing Engine of Apache Kafka
PPTX
Online learning with structured streaming, spark summit brussels 2016
PPT
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
PDF
A little bit of clojure
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
PDF
Big Data & the Enterprise
PDF
Sessionization with Spark streaming
PDF
Continuous Application with Structured Streaming 2.0
PDF
Building a Real-Time Forecasting Engine with Scala and Akka
PPTX
Introduction to Kafka Cruise Control
PDF
Data Stream Analytics - Why they are important
PDF
Apache Storm Tutorial
PDF
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
PDF
Lightbend Fast Data Platform
PPTX
Big iron 2 (published)
PDF
The return of big iron?
What's new in Drools 6 - London JBUG 2013
Spark Summit EU talk by Christos Erotocritou
Kafka for data scientists
Wrangling Big Data in a Small Tech Ecosystem
Kafka Streams: The Stream Processing Engine of Apache Kafka
Online learning with structured streaming, spark summit brussels 2016
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
A little bit of clojure
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Big Data & the Enterprise
Sessionization with Spark streaming
Continuous Application with Structured Streaming 2.0
Building a Real-Time Forecasting Engine with Scala and Akka
Introduction to Kafka Cruise Control
Data Stream Analytics - Why they are important
Apache Storm Tutorial
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Lightbend Fast Data Platform
Big iron 2 (published)
The return of big iron?
Ad

Similar to Streaming datasets for personalization (20)

PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PPTX
Streaming options in the wild
PDF
Introduction to Apache Flink
PDF
Introduction to Spark Streaming
PDF
Apache Spark Streaming
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
PPTX
Apache Spark Components
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Introduction to Flink Streaming
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PDF
It's Time To Stop Using Lambda Architecture
PDF
Data Streaming For Big Data
PDF
Santander Stream Processing with Apache Flink
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Unbounded bounded-data-strangeloop-2016-monal-daxini
Stream, stream, stream: Different streaming methods with Spark and Kafka
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Streaming options in the wild
Introduction to Apache Flink
Introduction to Spark Streaming
Apache Spark Streaming
Don't Cross The Streams - Data Streaming And Apache Flink
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Apache Spark Components
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Introduction to Flink Streaming
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
It's Time To Stop Using Lambda Architecture
Data Streaming For Big Data
Santander Stream Processing with Apache Flink
Spark (Structured) Streaming vs. Kafka Streams
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

Recently uploaded (20)

PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PDF
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PPTX
865628565-Pertemuan-2-chapter-03-NUMERICAL-MEASURES.pptx
PPTX
ai agent creaction with langgraph_presentation_
PPT
Image processing and pattern recognition 2.ppt
PPTX
SET 1 Compulsory MNH machine learning intro
PPTX
Business_Capability_Map_Collection__pptx
PPTX
recommendation Project PPT with details attached
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
The Data Security Envisioning Workshop provides a summary of an organization...
PPTX
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
PPTX
MBA JAPAN: 2025 the University of Waseda
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Tapan_20220802057_Researchinternship_final_stage.pptx
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
expt-design-lecture-12 hghhgfggjhjd (1).ppt
865628565-Pertemuan-2-chapter-03-NUMERICAL-MEASURES.pptx
ai agent creaction with langgraph_presentation_
Image processing and pattern recognition 2.ppt
SET 1 Compulsory MNH machine learning intro
Business_Capability_Map_Collection__pptx
recommendation Project PPT with details attached
DU, AIS, Big Data and Data Analytics.ppt
The Data Security Envisioning Workshop provides a summary of an organization...
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
MBA JAPAN: 2025 the University of Waseda
Navigating the Thai Supplements Landscape.pdf
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx

Streaming datasets for personalization

  • 1. Shriya Arora Streaming datasets for Personalization
  • 2. What is Netflix’s Mission? Entertainment by allowing you to stream content anywhere, anytime
  • 3. What is Netflix’s Mission? Entertainment by allowing you to stream personalized content anywhere, anytime
  • 4. How much data do we process to have a personalized Netflix for everyone? ● 125M hours/ day ● 86M active members ● 450B unique events/day ● 600+ kafka topics
  • 5. Data Infrastructure Raw data (S3/hdfs) Stream Processing (Spark, Flink …) Processed data (Tables/Indexers) Batch processing (Spark/Pig/Hive/MR) Application instances Keystone Ingestion Pipeline
  • 6. What do we solve with streaming that we can’t solve with batch ETL? ● Business Wins ○ Algorithms become more dynamic/responsive ○ Enables research by reducing time delay between event generation and consumption ○ Creates opportunity for new types of algorithms ● Technical Wins ○ Fewer moving parts means fewer places for error ○ Save on storage costs ○ Avoid long running jobs ■ Reduces processing resources ■ Shortens turnaround times
  • 7. Picking a Stream Processing Engine? Things to consider: ● Problem Scope/Requirements ○ Event-based pure streaming or micro-batches? ○ Do you want to implement Lambda? ● Existing Internal Technologies ○ Streaming Infrastructure: What are other teams using? ○ ETL eco-system: What about teams that don’t consume out of Kafka? ● What’s your team’s learning curve? ○ Do you know Spark? ○ Do you know Scala?
  • 8. Getting started with Spark Streaming Micro-batches ● Data received in DStreams, which are easily converted to RDDs ● Support all fundamental RDD operations like map/flatmap/reduce/reduceByKey ● Basic time-based windowing ● Checkpointing support for resilience to failures
  • 9. Writing a basic Spark Streaming app
  • 10. Performance tuning your Spark streaming application ● Choice of micro-batch interval ○ The most important parameter ● Cluster memory ○ Large batch intervals need more memory ● Parallelism ○ DStreams naturally partitioned to Kafka partitions ○ Repartition can help with increased parallelism at the cost of shuffle ● # of CPUs ○ <= number of tasks ○ Depends on how computationally intensive your processing is
  • 12. Performance tuning your Flink application (Yet to be productionised) ● Persistent data storage for checkpointing ○ Fault-tolerant, highly-available system ○ Support high-throughput for frequent state updates ● Parallelism ○ Optimized for # of Kafka Partitions ○ Optimal number of slots/ CPU ● Size of cluster ○ Function of your incoming stream ○ What is your bottleneck? Network/ Memory/ Computation ● Code Optimization ○ Build an optimal DAG with least network shuffle
  • 13. Challenges with Spark ● Not a ‘pure’ event streaming system ○ Minimum latency of batch interval ○ Un-intuitive for stream design ● Choice of batch interval is a little too critical ○ Everything can go wrong, if you choose this wrong ○ Build-up of scheduling delay can lead to data loss ● Only time-based windowing ○ Cannot be used to solve session-stitching use cases, or trigger based event aggregations*
  • 14. Challenges with Flink ● Non trivial to bring up a basic app, newer concepts to adjust to ○ Complex (though powerful) concepts like Watermarking, checkpointing, custom windows ● Insufficient monitoring and debugging tools ● Documentation basic, online community support not as proliferated
  • 15. Challenges with Streaming ● Pioneer Tax: batch.getInfrastructure >= streaming.getInfrastructure ○ Analytics has historically always been batch, instinctively easier to formulate analytical problems in batch frameworks like MR, Pig, Hive etc. ○ Deployments are non-trivial ● Moving towards unbounded === moving towards “On Call” ○ Batch failures have to be addressed urgently, Streaming failures have to be addressed immediately. ● Streaming outages more critical than batch outages ○ In batch it’s easy/cheap to recover from outages (as long as the data isn’t lost). ○ In streaming, data recovery (beyond the fault-tolerant limits of the system) can be exhaustive

Editor's Notes

  • #5: Total number of processed events much higher ~ 1.5T because of duplicates and redundancy Thousands of shows in every country, ma
  • #6: Post processing required