0% found this document useful (0 votes)
79 views39 pages

ECS765P - W10 - Stream Processing

The document discusses stream processing and streaming data. It covers topics like generating and ingesting streaming data, the stream processing model, and large-scale stream processing. It also discusses the differences between streaming data and stream processing.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views39 pages

ECS765P - W10 - Stream Processing

The document discusses stream processing and streaming data. It covers topics like generating and ingesting streaming data, the stream processing model, and large-scale stream processing. It also discusses the differences between streaming data and stream processing.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

ECS640U/ECS765P Big Data Processing

Stream Processing I
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Stream Processing I
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
Weeks 9-11: Processing

Data
Ingestion Storage Processing Output
Sources

In the next two weeks (9 and 10), we focus on Stream Processing


Streaming applications

• High frequency trading


• Trend detection: News, topics, tweets
• Systems management: detection of system failures
• User behaviour monitoring
• Service usage billing
• Goods tracking
• Cybersecurity, DDoS detection
Streaming data vs stream processing

It is important to clearly distinguish between two related streaming concepts:

• Streaming data is a term used to describe many data sources generating data continuously, typically
in small sizes. Examples include log files generated by customers using a mobile application, social
network activity or ecommerce purchases.

• Stream processing describes a processing mode where individual records or a small set of records are
processed continuously, producing a simple response.
Streaming data vs stream processing

Distinguishing between streaming data and stream processing is important because:

• In addition to stream processing, other processing modes can be applied to streaming data. For
instance, streaming data can be stored and then analysed by complex analytics based on batch
processing.

• Stream processing solutions can also be applied to batch data by creating streams from it.

• There exist tools for ingesting streaming data and tools for stream processing. Understanding this
distinction is important to understand the role of each tool.
Streaming

Data
Ingestion Storage Processing Output
Sources

Even though this week’s focus is on stream processing, at times we will also consider other stages in
the Big Data pipeline.
Stream Processing 1
Topic List:

● Generating, ingesting and processing streaming data


● Stream processing model
● Large-scale stream processing
Back in the Lab of week 5 (Twitter Dataset)

• A large dataset containing tweets collected during Olympics from the Twitter Streaming API was used
• We used Olympics dataset to compute different metrics describing the tweets, such as length and
number of hashtags per tweet
it’s still streaming
data but we use
batching processing streaming processing
• What is the difference between the collected dataset and the Twitter Streaming API?
• Would a processing solution be the same for data stored in a local dataset on hard disks versus
data coming from the Twitter Streaming API? No, even they are all streaming data but the
processing solution will be different
Bounded and unbounded data

In streaming systems, it is useful to distinguish between bounded and unbounded data:

• Bounded data is used to describe datasets that are finite in size. Batch processing systems such as
Hadoop or Spark have been designed with bounded data in mind.

• Unbounded data describes datasets that are at least theoretically infinite in size. New data can arrive
and be made available at any point of time. Streaming systems are designed with unbounded data in
mind. Examples of streaming systems include Storm, Spark streaming and Flink.
The role of time in streaming

In streaming systems, new pieces of data are made available at a point in time. In some cases, time itself
plays an important role:

• In problems where we want to identify temporal patterns, the actual time when new data arrives needs
to be considered during processing. additional data

• In some scenarios, failing to produce a processing result within a time window is as bad as no producing
a result at all. In these scenarios we talk about real-time systems.
Streaming data sources

Data
Ingestion Storage Processing Output
Sources

Streaming data sources are unbounded, as theoretically there is no limit to the amount of data that they
can generate. Examples include:

• Messages from social platforms (e.g. Twitter)


• Internet traffic going through a network device such as a switch
• Readings from an IoT device
• Interactions of users with a web application
Stream ingestion

Data
Ingestion Storage Processing Output
Sources

Data generated by unbounded sources need to be ingested so that it can be made available for storage or
further processing. There exist multiple solutions to ingest unbounded data, including

Generic Frameworks:
• Apache Kafka, Apache Flume (https://kafka.apache.org)
Custom Frameworks (cloud offerings):
• Amazon Kinesis Firehose, AWS IoT Events (https://aws.amazon.com/kinesis/data-firehose/)
• Azure Event Hub, IoT Hub (https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-about )
• Google Pub/Sub (https://cloud.google.com/pubsub/docs/overview)
Stream processing

Data
Ingestion Storage Processing Output
Sources

Unbounded data that has been ingested can be processed by streaming systems.
Note that streaming systems can generate unbounded data as an output and hence constitute data sources
themselves. Examples of stream processing solutions include:

Generic Frameworks:
• Apache Storm, Spark Streaming, Kafka, Flink (https://storm.apache.org)
Custom Frameworks (cloud offerings):
• Amazon Kinesis Streams, AWS Lambda (https://aws.amazon.com/lambda/)
• Azure Stream Analytics, Azure Functions (https://azure.microsoft.com/en-us/services/stream-analytics/)
• Google Dataflow (https://cloud.google.com/dataflow)
Stream Processing 1
Topic List:

● Generating, ingesting and processing streaming data


● Stream processing model
● Large-scale stream processing
Streams and tables
Streams and tables can be used to represent data:
• Streams can be seen as sequences of immutable records that arrive at some point in time:
They are also known as event streams, event logs and message queues.
Streams can be seen as a dataset in motion.
• Tables are collections of records.
They can be seen as datasets at rest.

Tables can be converted into streams and vice-versa.

to order the sequence, (1) by order in the table or (2) using


a timestamp (stored in the table)
MapReduce processing flow: Reminder

k/v
k/v
line k/v k/v
line k/v
line
k/[vs]
line
line
line k/v

Input Map Output


line
Data line
line Reduce
k/v
Data
k/v
k/v
k/v k/v
line k/[vs]
line
line

k/v
MapReduce processing flow: Reminder

line
def mapper(self, _, line):
words = WORD_REGEX.findall(line)
for word in words:
yield (word.lower(), 1)

k/v key/value pair

k/[vs]
def reducer(self, word, counts):
yield (word, sum(counts))

k/v
Map-only processing flow as tables and streams

line k/v
line
line

line
line
line k/v

Map
Output
Input Data line
line Data
line

k/v
line
line
line

k/v

line k/v

MapRead stream Map stream MapWrite


table table
A different view of MapReduce: Tables and streams

k/v
k/v
k/v k/v
line
line k/v
line
k/[vs]
line
line
line k/v

Map Output
Input Data line
line Reduce Data
line k/v
k/v
k/v
k/v k/v
line k/[vs]
line
line

k/v

line k/v k/[vs] k/v

stream stream Red Red


MapRead Map MapWrite stream Red stream
table table Read Write table
Traditional stream processing

• Designed for processing unbounded datasets


• Datasets are handled as streams

• Input sources are a continuous generator of data


• Processing elements operate on stream events, one at a time
• The output from a stream processor is another stream
Traditional stream processing
• An initial event stream is created by an unbounded data sources.
• The event stream is processed by a network of Processing Elements (PE) consisting of input queues,
computing elements and an output queue.

Output Queue

Macro view Input Queue Micro view


Computing Element
Stream Processing 1
Topic List:

● Generating, ingesting and processing streaming data


● Stream processing model
● Large-scale stream processing

Quiz and Break


Apache Storm

Apache Storm was developed by BackType (acquired by Twitter) and was donated to the Apache
Foundation.

Storm provides real-time computation of Big Data streams:


• Scalable
• Robust and fault-tolerant
• Guarantees no data loss (at-least-once processed)
• High throughput
• Programming language agnostic

https://storm.apache.org/releases/current/Rationale.html
Apache Storm: Concepts

Storm defines the following basic concepts:

• Tuple: Basic unit of data that can be processed by a storm application.


It consists of a pre-defined list of fields
• Stream: Unbounded stream of tuples
• Spouts: Elements that generate streams from external sources ~ ingestion units
• Bolts: Processing elements that consume and generate streams.
Apply transforms, produce aggregations, join or split streams, …
• Topology: Flow of spouts, streams and bolts
Apache Storm: Streams (the line)

• Tuple: Basic unit of data that can be processed by a storm application. It


consists of a pre-defined list of fields

• Stream: Unbounded stream of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Time
Apache Storm: Spouts

Spouts: Elements that generate streams from external sources


le
T up
le
T up
le
T up
le
T up
le
T up
le
T up
le
T up

T upl e
T upl e
Spout T upl e
T upl e
T upl e
T upl e
T upl e
Apache Storm: Bolts
Bolts: Processing elements that consume and generate streams.
Apply transforms, produce aggregations, join or split streams, ...

T upl e
T upl e
T upl e
T upl e

Tuple Tuple Tuple Tuple

le
T up Bolt
le
T up
le
T up
Apache Storm: Topology
Topology: Flow of spouts, streams and bolts
Apache Storm: Website click analysis
Storm streaming word count - Spout

Note, in exam, you will not be


asked about writing Java-based
Programming Code for Storm
but you need to understand the
code and its functionality

https://storm.apache.org/releases/current/Concepts.html
https://storm.apache.org/releases/current/javadocs/org/apache/storm/topology/base/BaseRichSpout.html
Storm streaming word count - Bolts

https://storm.apache.org/releases/current/javadocs/org/apache/storm/topology/base/BaseBasicBolt.html
Storm streaming word count – Topology

# of executor threads
for parallelism

# of tasks

# of JVM workers

Storm Code is not examinable

https://storm.apache.org/releases/current/javadocs/org/apache/storm/topology/TopologyBuilder.html

https://storm.apache.org/releases/2.2.0/Understanding-the-parallelism-of-a-Storm-topology.html
Understanding Paralleism

Config conf = new Config();


conf.setNumWorkers(2); // use two worker
processes

topologyBuilder.setSpout("blue-spout", new
BlueSpout(), 2); // set parallelism hint to 2

topologyBuilder.setBolt("green-bolt", new
GreenBolt(), 2).setNumTasks(4)
.shuffleGrouping("blue-spout");

topologyBuilder.setBolt("yellow-bolt", new
YellowBolt(), 6).shuffleGrouping("green-bolt");
Bolts translated to mrjob job (pseudocode)

def parse_tweet_bolt(self, tuple):


tweet = tuple[0]
for word in tweet.split(“ “):
yield( [word] )

def word_count_bolt(self, tuple):


word = tuple[0]
self.counts.increase_count[word]
yield( [word, self.counts[word] ] )
Stream processing state

Traditional stream operators are stateless: output depends directly on the input element

Non-trivial operations requires a state:


• Much mode difficult to scale
• Risk of losing state because of failures
Storm scalability

Bolts and spouts are distributed in different nodes (workers) of the cluster

Multiple replicas of bolts and spouts are run:


• Tuples are grouped and sent to several replicas (hash-based partitions, random partitions)
• However, it is no longer possible to keep a global state

why is the lack of a global state in Storm potentially a problem?


=> inconsistent results
Storm scalability
Apache Storm topologies are inherently parallel and run across a cluster of machines.

Different parts of the topology can be scaled individually by tweaking their parallelism. The
"rebalance" command of the "storm" command line client can adjust the parallelism of
running topologies on the fly.

https://storm.apache.org/about/scalable.html
Stream Processing 1
Topic List:

● Generating, ingesting and processing streaming data


● Stream processing model
● Large-scale stream processing

Quiz and End

You might also like