Streaming datasets for personalization

Shriya Arora
Streaming datasets for
Personalization

What is Netflix’s Mission?
Entertainment by allowing you to stream content anywhere, anytime

What is Netflix’s Mission?
Entertainment by allowing you to stream personalized content anywhere,
anytime

How much data do we process to have a personalized Netflix for everyone?
● 125M hours/ day
● 86M active members
● 450B unique events/day
● 600+ kafka topics

Data Infrastructure
Raw data
(S3/hdfs)
Stream
Processing
(Spark, Flink …)
Processed data
(Tables/Indexers)
Batch
processing
(Spark/Pig/Hive/MR)
Application instances
Keystone
Ingestion
Pipeline

What do we solve with streaming that we can’t solve with batch ETL?
● Business Wins
○ Algorithms become more dynamic/responsive
○ Enables research by reducing time delay between event generation and consumption
○ Creates opportunity for new types of algorithms
● Technical Wins
○ Fewer moving parts means fewer places for error
○ Save on storage costs
○ Avoid long running jobs
■ Reduces processing resources
■ Shortens turnaround times

Picking a Stream Processing Engine?
Things to consider:
● Problem Scope/Requirements
○ Event-based pure streaming or micro-batches?
○ Do you want to implement Lambda?
● Existing Internal Technologies
○ Streaming Infrastructure: What are other teams using?
○ ETL eco-system: What about teams that don’t consume out of Kafka?
● What’s your team’s learning curve?
○ Do you know Spark?
○ Do you know Scala?

Getting started with Spark Streaming
Micro-batches
● Data received in DStreams, which are easily converted to RDDs
● Support all fundamental RDD operations like map/flatmap/reduce/reduceByKey
● Basic time-based windowing
● Checkpointing support for resilience to failures

Writing a basic Spark Streaming app

Performance tuning your Spark streaming application
● Choice of micro-batch interval
○ The most important parameter
● Cluster memory
○ Large batch intervals need more memory
● Parallelism
○ DStreams naturally partitioned to Kafka partitions
○ Repartition can help with increased parallelism at the cost of shuffle
● # of CPUs
○ <= number of tasks
○ Depends on how computationally intensive your processing is

Performance tuning your Flink application (Yet to be productionised)
● Persistent data storage for checkpointing
○ Fault-tolerant, highly-available system
○ Support high-throughput for frequent state updates
● Parallelism
○ Optimized for # of Kafka Partitions
○ Optimal number of slots/ CPU
● Size of cluster
○ Function of your incoming stream
○ What is your bottleneck? Network/ Memory/ Computation
● Code Optimization
○ Build an optimal DAG with least network shuffle

Challenges with Spark
● Not a ‘pure’ event streaming system
○ Minimum latency of batch interval
○ Un-intuitive for stream design
● Choice of batch interval is a little too critical
○ Everything can go wrong, if you choose this wrong
○ Build-up of scheduling delay can lead to data loss
● Only time-based windowing
○ Cannot be used to solve session-stitching use cases, or trigger based event
aggregations*

Challenges with Flink
● Non trivial to bring up a basic app, newer concepts to adjust to
○ Complex (though powerful) concepts like Watermarking, checkpointing, custom
windows
● Insufficient monitoring and debugging tools
● Documentation basic, online community support not as proliferated

Challenges with Streaming
● Pioneer Tax: batch.getInfrastructure >= streaming.getInfrastructure
○ Analytics has historically always been batch, instinctively easier to formulate analytical
problems in batch frameworks like MR, Pig, Hive etc.
○ Deployments are non-trivial
● Moving towards unbounded === moving towards “On Call”
○ Batch failures have to be addressed urgently, Streaming failures have to be addressed
immediately.
● Streaming outages more critical than batch outages
○ In batch it’s easy/cheap to recover from outages (as long as the data isn’t lost).
○ In streaming, data recovery (beyond the fault-tolerant limits of the system) can be exhaustive

Streaming datasets for personalization

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Streaming datasets for personalization (20)

Recently uploaded (20)

Streaming datasets for personalization

Editor's Notes