SlideShare a Scribd company logo
Streaming items through a
cluster with Spark Streaming
Tathagata “TD” Das
@tathadas
CME 323: Distributed Algorithms and Optimization
Stanford, May 6, 2015
Who am I?
> Project Management Committee (PMC)
member of Apache Spark
> Lead developer of Spark Streaming
> Formerly in AMPLab, UC Berkeley
> Software developer at Databricks
> Databricks was started by creators of Spark to
provide Spark-as-a-service in the cloud
Big Data
Big Streaming Data
Why process Big Streaming
Data?
Fraud detection in bank transactions
Anomalies in sensor data
Cat videos in tweets
How to Process Big Streaming
Data
> Ingest – Receive and buffer the streaming data
> Process – Clean, extract, transform the data
> Store – Store transformed data for consumption
Ingest
data
Process
data
Store
results
Raw Tweets
How to Process Big Streaming
Data
> For big streams, every step requires a cluster
> Every step requires a system that is designed for
it
Ingest
data
Process
data
Store
results
Raw Tweets
Stream Ingestion Systems
> Kafka – popular distributed pub-sub system
> Kinesis – Amazon managed distributed pub-sub
> Flume – like a distributed data pipe
Ingest
data
Process
data
Store
results
Raw Tweets
Amazon
Kinesis
Stream Ingestion Systems
> Spark Streaming – From UC Berkeley + Databricks
> Storm – From Backtype + Twitter
> Samza – From LinkedIn
Ingest
data
Process
data
Store
results
Raw Tweets
Stream Ingestion Systems
> File systems – HDFS, Amazon S3, etc.
> Key-value stores – HBase, Cassandra, etc.
> Databases – MongoDB, MemSQL, etc.
Ingest
data
Process
data
Store
results
Raw Tweets
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization
Kafka Cluster
Producers and
Consumers
> Producers publish data tagged by “topic”
> Consumers subscribe to data of a particular “topic”
Producer 1
Producer 2
(topicX, data1)
(topicY, data2)
(topicX, data3)
Topic X
Consumer
Topic Y
Consumer
(topicX, data1)
(topicX, data3)
(topicY, data2)
> Topic = category of message, divided into partitions
> Partition = ordered, numbered stream of messages
> Producer decides which (topic, partition) to put each
message in
Topics and Partitions
> Topic = category of message, divided into
partitions
> Partition = ordered, numbered stream of messages
> Producer decides which (topic, partition) to put
each message in
> Consumer decides which (topic, partition) to pull
messages from
- High-level consumer – handles fault-recovery with
Zookeeper
- Simple consumer – low-level API for greater control
Topics and Partitions
How to process Kafka
messages?
Ingest
data
Process
data
Store
results
Raw Tweets
> Incoming tweets received in distributed manner
and buffered in Kafka
> How to process them?
treaming
What is Spark Streaming?
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
HDFS
Kinesis
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrate with MLlib, SQL,
DataFrames, GraphX
How does Spark Streaming
work?
> Receivers chop up data streams into batches of
few seconds
> Spark processing engine processes each batch
and pushes out the results to external data
stores
data streams
Receivers
batches
as RDDs
results as
RDDs
Spark Programming Model
> Resilient distributed datasets (RDDs)
- Distributed, partitioned collection of objects
- Manipulated through parallel transformations
(map, filter, reduceByKey, …)
- All transformations are lazy, execution forced by
actions
(count, reduce, take, …)
- Can be cached in memory across cluster
- Automatically rebuilt on failure
Spark Streaming Programming
Model
> Discretized Stream (DStream)
- Represents a stream of data
- Implemented as a infinite sequence of RDDs
> DStreams API very similar to RDD API
- Functional APIs in
- Create input DStreams from Kafka, Flume, Kinesis, HDFS,
…
- Apply transformations
Example – Get hashtags from
Twitter
val ssc = new StreamingContext(conf, Seconds(1))
StreamingContext is the starting
point of all streaming functionality
Batch interval, by which
streams will be chopped up
Example – Get hashtags from
Twitter
val ssc = new StreamingContext(conf, Seconds(1))
val tweets = TwitterUtils.createStream(ssc, auth)
Input DStream
batch @ t+1batch @ t batch @ t+2
tweets DStream
replicated and stored in
memory as RDDs
Twitter Streaming API
Example – Get hashtags from
Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one
DStream to create another DStream
transformed
DStream
new RDDs created
for every batch
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags Dstream
[#cat, #dog, … ]
Example – Get hashtags from
Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsTextFiles("hdfs://...")
output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags DStream
every batch
saved to HDFS
Example – Get hashtags from
Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreachRDD(hashTagRDD => { ... })
foreachRDD: do whatever you want with the processed data
flatMap flatMap flatMap
foreach foreach foreach
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags DStream
Write to a database, update analytics
UI, do whatever you want
Example – Get hashtags from
Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreachRDD(hashTagRDD => { ... })
ssc.start()
all of this was just setup for what to do when
streaming data is receiver
this actually starts the receiving and processing
What’s going on inside?
> Receiver buffers tweets
in Executors’ memory
> Spark Streaming Driver
launches tasks to
process tweets
Driver
running
DStreams
Executors
launch tasks to
process tweets
Raw Tweets
Twitter
Receiver
Buffered
Tweets
Buffered
Tweets
Spark Cluster
What’s going on inside?
Driver
running
DStreams
Executors
launch tasks to
process data
Receiver
Kafka Cluster
ReceiverReceiver
receive data in
parallel
Spark Cluster
Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency
0
0.5
1
1.5
2
2.5
3
3.5
0 50 100
ClusterThroughput(GB/s)
# Nodes in Cluster
WordCount
1 sec
2 sec
0
1
2
3
4
5
6
7
0 50 100
ClusterThhroughput(GB/s)
# Nodes in Cluster
Grep
1 sec
2 sec
DStream of data
Window-based Transformations
val tweets = TwitterUtils.createStream(ssc, auth)
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
sliding window
operation
window length sliding interval
window length
sliding interval
Arbitrary Stateful Computations
Specify function to generate new state based on
previous state and new data
- Example: Maintain per-user mood as state, and update
it with their tweets
def updateMood(newTweets, lastMood) => newMood
val moods = tweetsByUser.updateStateByKey(updateMood _)
Integrates with Spark
Ecosystem
Spark Core
Spark
Streaming
Spark
SQL
MLlib GraphX
Combine batch and streaming
processing
> Join data streams with static data sets
// Create data set from Hadoop file
val dataset = sparkContext.hadoopFile(“file”)
// Join each batch in stream with dataset
kafkaStream.transform { batchRDD =>
batchRDD.join(dataset)filter(...)
}
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine machine learning with
streaming
> Learn models offline, apply them online
// Learn model offline
val model = KMeans.train(dataset, ...)
// Apply model online on stream
kafkaStream.map { event =>
model.predict(event.feature)
}
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine SQL with streaming
> Interactively query streaming data with SQL
// Register each batch in stream as table
kafkaStream.map { batchRDD =>
batchRDD.registerTempTable("latestEvents")
}
// Interactively query table
sqlContext.sql("select * from latestEvents")
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
100+ known industry
deployments
Why are they adopting Spark
Streaming?
Easy, high-level API
Unified API across batch and streaming
Integration with Spark SQL and MLlib
Ease of operations
37
Neuroscience @ Freeman Lab, Janelia Farm
Spark Streaming and MLlib
to analyze neural activities
Laser microscope scans
Zebrafish brain Spark
Streaming  interactive
visualization 
laser ZAP to kill neurons!
http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
Neuroscience @ Freeman Lab, Janelia Farm
Streaming machine learning
algorithms on time series
data of every neuron
Upto 2TB/hour and
increasing with brain size
Upto 80 HPC nodes
http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
Streaming Machine Learning
Algos
> Streaming Linear Regression
> Streaming Logistic Regression
> Streaming KMeans
http://www.jeremyfreeman.net/share/talks/spark-summit-east-2015/#/algorithms-repeat
Okay okay, how do I start off?
> Online Streaming
Programming Guide
http://spark.apache.org/docs/latest/streaming-programming-guide.html
> Streaming examples
https://github.com/apache/spark/tree/master/examples/src/main/scala/o
rg/apache/spark/examples/streaming

More Related Content

PPTX
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Tathagata Das
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PPTX
An Introduction to Spark
jlacefie
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
PDF
Productionizing your Streaming Jobs
Databricks
 
PPT
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Davorin Vukelic
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Tathagata Das
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
An Introduction to Spark
jlacefie
 
Unified Big Data Processing with Apache Spark
C4Media
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
Productionizing your Streaming Jobs
Databricks
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Davorin Vukelic
 

What's hot (20)

PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
PPTX
Apache Spark
Majid Hajibaba
 
PPT
Spark and spark streaming internals
Sigmoid
 
PDF
Distributed real time stream processing- why and how
Petr Zapletal
 
PDF
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PPTX
GoodFit: Multi-Resource Packing of Tasks with Dependencies
DataWorks Summit/Hadoop Summit
 
PDF
Predictive Maintenance at the Dutch Railways with Ivo Everts
Databricks
 
PDF
Hadoop & MapReduce
Newvewm
 
PDF
Apache Spark & Streaming
Fernando Rodriguez
 
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
PDF
Large volume data analysis on the Typesafe Reactive Platform
Martin Zapletal
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PPTX
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
PDF
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
PDF
OpenTSDB 2.0
HBaseCon
 
PDF
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 
PPTX
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
PPTX
Spark Study Notes
Richard Kuo
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Apache Spark
Majid Hajibaba
 
Spark and spark streaming internals
Sigmoid
 
Distributed real time stream processing- why and how
Petr Zapletal
 
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
Apache Spark Architecture
Alexey Grishchenko
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
DataWorks Summit/Hadoop Summit
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Databricks
 
Hadoop & MapReduce
Newvewm
 
Apache Spark & Streaming
Fernando Rodriguez
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Large volume data analysis on the Typesafe Reactive Platform
Martin Zapletal
 
Tuning and Debugging in Apache Spark
Databricks
 
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
OpenTSDB 2.0
HBaseCon
 
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Spark Study Notes
Richard Kuo
 
Ad

Viewers also liked (20)

PPT
Spark stream - Kafka
Dori Waldman
 
PPTX
Tale of Kafka Consumer for Spark Streaming
Sigmoid
 
PDF
Flink Streaming Berlin Meetup
Márton Balassi
 
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
PDF
Flink Apachecon Presentation
Gyula Fóra
 
PDF
Introduction to Apache Flink
datamantra
 
PPTX
Online learning with structured streaming, spark summit brussels 2016
Ram Sriharsha
 
PDF
Introduction to Real-time data processing
Yogi Devendra Vyavahare
 
PDF
Rethinking Streaming Analytics For Scale
Helena Edelson
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PPT
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
PPTX
Flink vs. Spark
Slim Baltagi
 
PDF
The Minimum Loveable Product
The Happy Startup School
 
PDF
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
Board of Innovation
 
PDF
The Seven Deadly Social Media Sins
XPLAIN
 
PDF
Five Killer Ways to Design The Same Slide
Crispy Presentations
 
PPTX
How People Really Hold and Touch (their Phones)
Steven Hoober
 
PDF
Upworthy: 10 Ways To Win The Internets
Upworthy
 
PDF
What 33 Successful Entrepreneurs Learned From Failure
ReferralCandy
 
PDF
Design Your Career 2018
Slides That Rock
 
Spark stream - Kafka
Dori Waldman
 
Tale of Kafka Consumer for Spark Streaming
Sigmoid
 
Flink Streaming Berlin Meetup
Márton Balassi
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Flink Apachecon Presentation
Gyula Fóra
 
Introduction to Apache Flink
datamantra
 
Online learning with structured streaming, spark summit brussels 2016
Ram Sriharsha
 
Introduction to Real-time data processing
Yogi Devendra Vyavahare
 
Rethinking Streaming Analytics For Scale
Helena Edelson
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
Flink vs. Spark
Slim Baltagi
 
The Minimum Loveable Product
The Happy Startup School
 
How I got 2.5 Million views on Slideshare (by @nickdemey - Board of Innovation)
Board of Innovation
 
The Seven Deadly Social Media Sins
XPLAIN
 
Five Killer Ways to Design The Same Slide
Crispy Presentations
 
How People Really Hold and Touch (their Phones)
Steven Hoober
 
Upworthy: 10 Ways To Win The Internets
Upworthy
 
What 33 Successful Entrepreneurs Learned From Failure
ReferralCandy
 
Design Your Career 2018
Slides That Rock
 
Ad

Similar to Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization (20)

PDF
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
PDF
Spark streaming state of the union
Databricks
 
PPT
strata_spark_streaming.ppt
rveiga100
 
PPT
Spark streaming
Venkateswaran Kandasamy
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
PPT
strata_spark_streaming.ppt
snowflakebatch
 
PPT
strata_spark_streaming.ppt
AbhijitManna19
 
PPT
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PDF
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
PDF
Deep dive into spark streaming
Tao Li
 
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Stratio
 
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Spark streaming state of the union
Databricks
 
strata_spark_streaming.ppt
rveiga100
 
Spark streaming
Venkateswaran Kandasamy
 
Introduction to Spark Streaming
datamantra
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
strata_spark_streaming.ppt
snowflakebatch
 
strata_spark_streaming.ppt
AbhijitManna19
 
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
Apache Spark Components
Girish Khanzode
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
Deep dive into spark streaming
Tao Li
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Stratio
 

Recently uploaded (20)

PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
International-health-agency and it's work.pptx
shreehareeshgs
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
1intro to AI.pptx AI components & composition
ssuserb993e5
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PPTX
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
International-health-agency and it's work.pptx
shreehareeshgs
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
1intro to AI.pptx AI components & composition
ssuserb993e5
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Chad Readey - An Independent Thinker
Chad Readey
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 

Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization

  • 1. Streaming items through a cluster with Spark Streaming Tathagata “TD” Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015
  • 2. Who am I? > Project Management Committee (PMC) member of Apache Spark > Lead developer of Spark Streaming > Formerly in AMPLab, UC Berkeley > Software developer at Databricks > Databricks was started by creators of Spark to provide Spark-as-a-service in the cloud
  • 5. Why process Big Streaming Data? Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets
  • 6. How to Process Big Streaming Data > Ingest – Receive and buffer the streaming data > Process – Clean, extract, transform the data > Store – Store transformed data for consumption Ingest data Process data Store results Raw Tweets
  • 7. How to Process Big Streaming Data > For big streams, every step requires a cluster > Every step requires a system that is designed for it Ingest data Process data Store results Raw Tweets
  • 8. Stream Ingestion Systems > Kafka – popular distributed pub-sub system > Kinesis – Amazon managed distributed pub-sub > Flume – like a distributed data pipe Ingest data Process data Store results Raw Tweets Amazon Kinesis
  • 9. Stream Ingestion Systems > Spark Streaming – From UC Berkeley + Databricks > Storm – From Backtype + Twitter > Samza – From LinkedIn Ingest data Process data Store results Raw Tweets
  • 10. Stream Ingestion Systems > File systems – HDFS, Amazon S3, etc. > Key-value stores – HBase, Cassandra, etc. > Databases – MongoDB, MemSQL, etc. Ingest data Process data Store results Raw Tweets
  • 12. Kafka Cluster Producers and Consumers > Producers publish data tagged by “topic” > Consumers subscribe to data of a particular “topic” Producer 1 Producer 2 (topicX, data1) (topicY, data2) (topicX, data3) Topic X Consumer Topic Y Consumer (topicX, data1) (topicX, data3) (topicY, data2)
  • 13. > Topic = category of message, divided into partitions > Partition = ordered, numbered stream of messages > Producer decides which (topic, partition) to put each message in Topics and Partitions
  • 14. > Topic = category of message, divided into partitions > Partition = ordered, numbered stream of messages > Producer decides which (topic, partition) to put each message in > Consumer decides which (topic, partition) to pull messages from - High-level consumer – handles fault-recovery with Zookeeper - Simple consumer – low-level API for greater control Topics and Partitions
  • 15. How to process Kafka messages? Ingest data Process data Store results Raw Tweets > Incoming tweets received in distributed manner and buffered in Kafka > How to process them?
  • 17. What is Spark Streaming? Scalable, fault-tolerant stream processing system File systems Databases Dashboards Flume HDFS Kinesis Kafka Twitter High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrate with MLlib, SQL, DataFrames, GraphX
  • 18. How does Spark Streaming work? > Receivers chop up data streams into batches of few seconds > Spark processing engine processes each batch and pushes out the results to external data stores data streams Receivers batches as RDDs results as RDDs
  • 19. Spark Programming Model > Resilient distributed datasets (RDDs) - Distributed, partitioned collection of objects - Manipulated through parallel transformations (map, filter, reduceByKey, …) - All transformations are lazy, execution forced by actions (count, reduce, take, …) - Can be cached in memory across cluster - Automatically rebuilt on failure
  • 20. Spark Streaming Programming Model > Discretized Stream (DStream) - Represents a stream of data - Implemented as a infinite sequence of RDDs > DStreams API very similar to RDD API - Functional APIs in - Create input DStreams from Kafka, Flume, Kinesis, HDFS, … - Apply transformations
  • 21. Example – Get hashtags from Twitter val ssc = new StreamingContext(conf, Seconds(1)) StreamingContext is the starting point of all streaming functionality Batch interval, by which streams will be chopped up
  • 22. Example – Get hashtags from Twitter val ssc = new StreamingContext(conf, Seconds(1)) val tweets = TwitterUtils.createStream(ssc, auth) Input DStream batch @ t+1batch @ t batch @ t+2 tweets DStream replicated and stored in memory as RDDs Twitter Streaming API
  • 23. Example – Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashTags = tweets.flatMap(status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one DStream to create another DStream transformed DStream new RDDs created for every batch batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]
  • 24. Example – Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashTags = tweets.flatMap(status => getTags(status)) hashTags.saveAsTextFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  • 25. Example – Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashTags = tweets.flatMap(status => getTags(status)) hashTags.foreachRDD(hashTagRDD => { ... }) foreachRDD: do whatever you want with the processed data flatMap flatMap flatMap foreach foreach foreach batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags DStream Write to a database, update analytics UI, do whatever you want
  • 26. Example – Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None) val hashTags = tweets.flatMap(status => getTags(status)) hashTags.foreachRDD(hashTagRDD => { ... }) ssc.start() all of this was just setup for what to do when streaming data is receiver this actually starts the receiving and processing
  • 27. What’s going on inside? > Receiver buffers tweets in Executors’ memory > Spark Streaming Driver launches tasks to process tweets Driver running DStreams Executors launch tasks to process tweets Raw Tweets Twitter Receiver Buffered Tweets Buffered Tweets Spark Cluster
  • 28. What’s going on inside? Driver running DStreams Executors launch tasks to process data Receiver Kafka Cluster ReceiverReceiver receive data in parallel Spark Cluster
  • 29. Performance Can process 60M records/sec (6 GB/sec) on 100 nodes at sub-second latency 0 0.5 1 1.5 2 2.5 3 3.5 0 50 100 ClusterThroughput(GB/s) # Nodes in Cluster WordCount 1 sec 2 sec 0 1 2 3 4 5 6 7 0 50 100 ClusterThhroughput(GB/s) # Nodes in Cluster Grep 1 sec 2 sec
  • 30. DStream of data Window-based Transformations val tweets = TwitterUtils.createStream(ssc, auth) val hashTags = tweets.flatMap(status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length sliding interval
  • 31. Arbitrary Stateful Computations Specify function to generate new state based on previous state and new data - Example: Maintain per-user mood as state, and update it with their tweets def updateMood(newTweets, lastMood) => newMood val moods = tweetsByUser.updateStateByKey(updateMood _)
  • 32. Integrates with Spark Ecosystem Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 33. Combine batch and streaming processing > Join data streams with static data sets // Create data set from Hadoop file val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with dataset kafkaStream.transform { batchRDD => batchRDD.join(dataset)filter(...) } Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 34. Combine machine learning with streaming > Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream kafkaStream.map { event => model.predict(event.feature) } Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 35. Combine SQL with streaming > Interactively query streaming data with SQL // Register each batch in stream as table kafkaStream.map { batchRDD => batchRDD.registerTempTable("latestEvents") } // Interactively query table sqlContext.sql("select * from latestEvents") Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 37. Why are they adopting Spark Streaming? Easy, high-level API Unified API across batch and streaming Integration with Spark SQL and MLlib Ease of operations 37
  • 38. Neuroscience @ Freeman Lab, Janelia Farm Spark Streaming and MLlib to analyze neural activities Laser microscope scans Zebrafish brain Spark Streaming  interactive visualization  laser ZAP to kill neurons! http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
  • 39. Neuroscience @ Freeman Lab, Janelia Farm Streaming machine learning algorithms on time series data of every neuron Upto 2TB/hour and increasing with brain size Upto 80 HPC nodes http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
  • 40. Streaming Machine Learning Algos > Streaming Linear Regression > Streaming Logistic Regression > Streaming KMeans http://www.jeremyfreeman.net/share/talks/spark-summit-east-2015/#/algorithms-repeat
  • 41. Okay okay, how do I start off? > Online Streaming Programming Guide http://spark.apache.org/docs/latest/streaming-programming-guide.html > Streaming examples https://github.com/apache/spark/tree/master/examples/src/main/scala/o rg/apache/spark/examples/streaming