ECS640U/ECS765P Big Data Processing
Stream Processing I
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Stream Processing I
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
Weeks 9-11: Processing
Data
Ingestion Storage Processing Output
Sources
In the next two weeks (9 and 10), we focus on Stream Processing
Streaming applications
• High frequency trading
• Trend detection: News, topics, tweets
• Systems management: detection of system failures
• User behaviour monitoring
• Service usage billing
• Goods tracking
• Cybersecurity, DDoS detection
Streaming data vs stream processing
It is important to clearly distinguish between two related streaming concepts:
• Streaming data is a term used to describe many data sources generating data continuously, typically
in small sizes. Examples include log files generated by customers using a mobile application, social
network activity or ecommerce purchases.
• Stream processing describes a processing mode where individual records or a small set of records are
processed continuously, producing a simple response.
Streaming data vs stream processing
Distinguishing between streaming data and stream processing is important because:
• In addition to stream processing, other processing modes can be applied to streaming data. For
instance, streaming data can be stored and then analysed by complex analytics based on batch
processing.
• Stream processing solutions can also be applied to batch data by creating streams from it.
• There exist tools for ingesting streaming data and tools for stream processing. Understanding this
distinction is important to understand the role of each tool.
Streaming
Data
Ingestion Storage Processing Output
Sources
Even though this week’s focus is on stream processing, at times we will also consider other stages in
the Big Data pipeline.
Stream Processing 1
Topic List:
● Generating, ingesting and processing streaming data
● Stream processing model
● Large-scale stream processing
Back in the Lab of week 5 (Twitter Dataset)
• A large dataset containing tweets collected during Olympics from the Twitter Streaming API was used
• We used Olympics dataset to compute different metrics describing the tweets, such as length and
number of hashtags per tweet
it’s still streaming
data but we use
batching processing streaming processing
• What is the difference between the collected dataset and the Twitter Streaming API?
• Would a processing solution be the same for data stored in a local dataset on hard disks versus
data coming from the Twitter Streaming API? No, even they are all streaming data but the
processing solution will be different
Bounded and unbounded data
In streaming systems, it is useful to distinguish between bounded and unbounded data:
• Bounded data is used to describe datasets that are finite in size. Batch processing systems such as
Hadoop or Spark have been designed with bounded data in mind.
• Unbounded data describes datasets that are at least theoretically infinite in size. New data can arrive
and be made available at any point of time. Streaming systems are designed with unbounded data in
mind. Examples of streaming systems include Storm, Spark streaming and Flink.
The role of time in streaming
In streaming systems, new pieces of data are made available at a point in time. In some cases, time itself
plays an important role:
• In problems where we want to identify temporal patterns, the actual time when new data arrives needs
to be considered during processing. additional data
• In some scenarios, failing to produce a processing result within a time window is as bad as no producing
a result at all. In these scenarios we talk about real-time systems.
Streaming data sources
Data
Ingestion Storage Processing Output
Sources
Streaming data sources are unbounded, as theoretically there is no limit to the amount of data that they
can generate. Examples include:
• Messages from social platforms (e.g. Twitter)
• Internet traffic going through a network device such as a switch
• Readings from an IoT device
• Interactions of users with a web application
Stream ingestion
Data
Ingestion Storage Processing Output
Sources
Data generated by unbounded sources need to be ingested so that it can be made available for storage or
further processing. There exist multiple solutions to ingest unbounded data, including
Generic Frameworks:
• Apache Kafka, Apache Flume (https://kafka.apache.org)
Custom Frameworks (cloud offerings):
• Amazon Kinesis Firehose, AWS IoT Events (https://aws.amazon.com/kinesis/data-firehose/)
• Azure Event Hub, IoT Hub (https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-about )
• Google Pub/Sub (https://cloud.google.com/pubsub/docs/overview)
Stream processing
Data
Ingestion Storage Processing Output
Sources
Unbounded data that has been ingested can be processed by streaming systems.
Note that streaming systems can generate unbounded data as an output and hence constitute data sources
themselves. Examples of stream processing solutions include:
Generic Frameworks:
• Apache Storm, Spark Streaming, Kafka, Flink (https://storm.apache.org)
Custom Frameworks (cloud offerings):
• Amazon Kinesis Streams, AWS Lambda (https://aws.amazon.com/lambda/)
• Azure Stream Analytics, Azure Functions (https://azure.microsoft.com/en-us/services/stream-analytics/)
• Google Dataflow (https://cloud.google.com/dataflow)
Stream Processing 1
Topic List:
● Generating, ingesting and processing streaming data
● Stream processing model
● Large-scale stream processing
Streams and tables
Streams and tables can be used to represent data:
• Streams can be seen as sequences of immutable records that arrive at some point in time:
They are also known as event streams, event logs and message queues.
Streams can be seen as a dataset in motion.
• Tables are collections of records.
They can be seen as datasets at rest.
Tables can be converted into streams and vice-versa.
to order the sequence, (1) by order in the table or (2) using
a timestamp (stored in the table)
MapReduce processing flow: Reminder
k/v
k/v
line k/v k/v
line k/v
line
k/[vs]
line
line
line k/v
Input Map Output
line
Data line
line Reduce
k/v
Data
k/v
k/v
k/v k/v
line k/[vs]
line
line
k/v
MapReduce processing flow: Reminder
line
def mapper(self, _, line):
words = WORD_REGEX.findall(line)
for word in words:
yield (word.lower(), 1)
k/v key/value pair
k/[vs]
def reducer(self, word, counts):
yield (word, sum(counts))
k/v
Map-only processing flow as tables and streams
line k/v
line
line
line
line
line k/v
Map
Output
Input Data line
line Data
line
k/v
line
line
line
k/v
line k/v
MapRead stream Map stream MapWrite
table table
A different view of MapReduce: Tables and streams
k/v
k/v
k/v k/v
line
line k/v
line
k/[vs]
line
line
line k/v
Map Output
Input Data line
line Reduce Data
line k/v
k/v
k/v
k/v k/v
line k/[vs]
line
line
k/v
line k/v k/[vs] k/v
stream stream Red Red
MapRead Map MapWrite stream Red stream
table table Read Write table
Traditional stream processing
• Designed for processing unbounded datasets
• Datasets are handled as streams
• Input sources are a continuous generator of data
• Processing elements operate on stream events, one at a time
• The output from a stream processor is another stream
Traditional stream processing
• An initial event stream is created by an unbounded data sources.
• The event stream is processed by a network of Processing Elements (PE) consisting of input queues,
computing elements and an output queue.
Output Queue
Macro view Input Queue Micro view
Computing Element
Stream Processing 1
Topic List:
● Generating, ingesting and processing streaming data
● Stream processing model
● Large-scale stream processing
Quiz and Break
Apache Storm
Apache Storm was developed by BackType (acquired by Twitter) and was donated to the Apache
Foundation.
Storm provides real-time computation of Big Data streams:
• Scalable
• Robust and fault-tolerant
• Guarantees no data loss (at-least-once processed)
• High throughput
• Programming language agnostic
https://storm.apache.org/releases/current/Rationale.html
Apache Storm: Concepts
Storm defines the following basic concepts:
• Tuple: Basic unit of data that can be processed by a storm application.
It consists of a pre-defined list of fields
• Stream: Unbounded stream of tuples
• Spouts: Elements that generate streams from external sources ~ ingestion units
• Bolts: Processing elements that consume and generate streams.
Apply transforms, produce aggregations, join or split streams, …
• Topology: Flow of spouts, streams and bolts
Apache Storm: Streams (the line)
• Tuple: Basic unit of data that can be processed by a storm application. It
consists of a pre-defined list of fields
• Stream: Unbounded stream of tuples
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Time
Apache Storm: Spouts
Spouts: Elements that generate streams from external sources
le
T up
le
T up
le
T up
le
T up
le
T up
le
T up
le
T up
T upl e
T upl e
Spout T upl e
T upl e
T upl e
T upl e
T upl e
Apache Storm: Bolts
Bolts: Processing elements that consume and generate streams.
Apply transforms, produce aggregations, join or split streams, ...
T upl e
T upl e
T upl e
T upl e
Tuple Tuple Tuple Tuple
le
T up Bolt
le
T up
le
T up
Apache Storm: Topology
Topology: Flow of spouts, streams and bolts
Apache Storm: Website click analysis
Storm streaming word count - Spout
Note, in exam, you will not be
asked about writing Java-based
Programming Code for Storm
but you need to understand the
code and its functionality
https://storm.apache.org/releases/current/Concepts.html
https://storm.apache.org/releases/current/javadocs/org/apache/storm/topology/base/BaseRichSpout.html
Storm streaming word count - Bolts
https://storm.apache.org/releases/current/javadocs/org/apache/storm/topology/base/BaseBasicBolt.html
Storm streaming word count – Topology
# of executor threads
for parallelism
# of tasks
# of JVM workers
Storm Code is not examinable
https://storm.apache.org/releases/current/javadocs/org/apache/storm/topology/TopologyBuilder.html
https://storm.apache.org/releases/2.2.0/Understanding-the-parallelism-of-a-Storm-topology.html
Understanding Paralleism
Config conf = new Config();
conf.setNumWorkers(2); // use two worker
processes
topologyBuilder.setSpout("blue-spout", new
BlueSpout(), 2); // set parallelism hint to 2
topologyBuilder.setBolt("green-bolt", new
GreenBolt(), 2).setNumTasks(4)
.shuffleGrouping("blue-spout");
topologyBuilder.setBolt("yellow-bolt", new
YellowBolt(), 6).shuffleGrouping("green-bolt");
Bolts translated to mrjob job (pseudocode)
def parse_tweet_bolt(self, tuple):
tweet = tuple[0]
for word in tweet.split(“ “):
yield( [word] )
def word_count_bolt(self, tuple):
word = tuple[0]
self.counts.increase_count[word]
yield( [word, self.counts[word] ] )
Stream processing state
Traditional stream operators are stateless: output depends directly on the input element
Non-trivial operations requires a state:
• Much mode difficult to scale
• Risk of losing state because of failures
Storm scalability
Bolts and spouts are distributed in different nodes (workers) of the cluster
Multiple replicas of bolts and spouts are run:
• Tuples are grouped and sent to several replicas (hash-based partitions, random partitions)
• However, it is no longer possible to keep a global state
why is the lack of a global state in Storm potentially a problem?
=> inconsistent results
Storm scalability
Apache Storm topologies are inherently parallel and run across a cluster of machines.
Different parts of the topology can be scaled individually by tweaking their parallelism. The
"rebalance" command of the "storm" command line client can adjust the parallelism of
running topologies on the fly.
https://storm.apache.org/about/scalable.html
Stream Processing 1
Topic List:
● Generating, ingesting and processing streaming data
● Stream processing model
● Large-scale stream processing
Quiz and End