0% found this document useful (0 votes)

79 views39 pages

ECS765P - W10 - Stream Processing

The document discusses stream processing and streaming data. It covers topics like generating and ingesting streaming data, the stream processing model, and large-scale stream processing. It also discusses the differences between streaming data and stream processing.

Uploaded by

Yen-Kai Cheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views39 pages

ECS765P - W10 - Stream Processing

Uploaded by

Yen-Kai Cheng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

ECS640U/ECS765P Big Data Processing

Stream Processing I
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Stream Processing I
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
Weeks 9-11: Processing

Data
Ingestion Storage Processing Output
Sources

In the next two weeks (9 and 10), we focus on Stream Processing

Streaming applications

• High frequency trading

• Trend detection: News, topics, tweets
• Systems management: detection of system failures
• User behaviour monitoring
• Service usage billing
• Goods tracking
• Cybersecurity, DDoS detection
Streaming data vs stream processing

It is important to clearly distinguish between two related streaming concepts:

• Streaming data is a term used to describe many data sources generating data continuously, typically
in small sizes. Examples include log files generated by customers using a mobile application, social
network activity or ecommerce purchases.

• Stream processing describes a processing mode where individual records or a small set of records are
processed continuously, producing a simple response.
Streaming data vs stream processing

Distinguishing between streaming data and stream processing is important because:

• In addition to stream processing, other processing modes can be applied to streaming data. For
instance, streaming data can be stored and then analysed by complex analytics based on batch
processing.

• Stream processing solutions can also be applied to batch data by creating streams from it.

• There exist tools for ingesting streaming data and tools for stream processing. Understanding this
distinction is important to understand the role of each tool.
Streaming

Data
Ingestion Storage Processing Output
Sources

Even though this week’s focus is on stream processing, at times we will also consider other stages in
the Big Data pipeline.
Stream Processing 1
Topic List:

● Generating, ingesting and processing streaming data

● Stream processing model
● Large-scale stream processing
Back in the Lab of week 5 (Twitter Dataset)

• A large dataset containing tweets collected during Olympics from the Twitter Streaming API was used
• We used Olympics dataset to compute different metrics describing the tweets, such as length and
number of hashtags per tweet
it’s still streaming
data but we use
batching processing streaming processing
• What is the difference between the collected dataset and the Twitter Streaming API?
• Would a processing solution be the same for data stored in a local dataset on hard disks versus
data coming from the Twitter Streaming API? No, even they are all streaming data but the
processing solution will be different
Bounded and unbounded data

In streaming systems, it is useful to distinguish between bounded and unbounded data:

• Bounded data is used to describe datasets that are finite in size. Batch processing systems such as
Hadoop or Spark have been designed with bounded data in mind.

• Unbounded data describes datasets that are at least theoretically infinite in size. New data can arrive
and be made available at any point of time. Streaming systems are designed with unbounded data in
mind. Examples of streaming systems include Storm, Spark streaming and Flink.
The role of time in streaming

In streaming systems, new pieces of data are made available at a point in time. In some cases, time itself
plays an important role:

• In problems where we want to identify temporal patterns, the actual time when new data arrives needs
to be considered during processing. additional data

• In some scenarios, failing to produce a processing result within a time window is as bad as no producing
a result at all. In these scenarios we talk about real-time systems.
Streaming data sources

Data
Ingestion Storage Processing Output
Sources

Streaming data sources are unbounded, as theoretically there is no limit to the amount of data that they
can generate. Examples include:

• Messages from social platforms (e.g. Twitter)

• Internet traffic going through a network device such as a switch
• Readings from an IoT device
• Interactions of users with a web application
Stream ingestion

Data
Ingestion Storage Processing Output
Sources

Data generated by unbounded sources need to be ingested so that it can be made available for storage or
further processing. There exist multiple solutions to ingest unbounded data, including

Generic Frameworks:
• Apache Kafka, Apache Flume (https://kafka.apache.org)
Custom Frameworks (cloud offerings):
• Amazon Kinesis Firehose, AWS IoT Events (https://aws.amazon.com/kinesis/data-firehose/)
• Azure Event Hub, IoT Hub (https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-about )
• Google Pub/Sub (https://cloud.google.com/pubsub/docs/overview)
Stream processing

Data
Ingestion Storage Processing Output
Sources

Unbounded data that has been ingested can be processed by streaming systems.
Note that streaming systems can generate unbounded data as an output and hence constitute data sources
themselves. Examples of stream processing solutions include:

Generic Frameworks:
• Apache Storm, Spark Streaming, Kafka, Flink (https://storm.apache.org)
Custom Frameworks (cloud offerings):
• Amazon Kinesis Streams, AWS Lambda (https://aws.amazon.com/lambda/)
• Azure Stream Analytics, Azure Functions (https://azure.microsoft.com/en-us/services/stream-analytics/)
• Google Dataflow (https://cloud.google.com/dataflow)
Stream Processing 1
Topic List:

● Generating, ingesting and processing streaming data

● Stream processing model
● Large-scale stream processing
Streams and tables
Streams and tables can be used to represent data:
• Streams can be seen as sequences of immutable records that arrive at some point in time:
They are also known as event streams, event logs and message queues.
Streams can be seen as a dataset in motion.
• Tables are collections of records.
They can be seen as datasets at rest.

Tables can be converted into streams and vice-versa.

to order the sequence, (1) by order in the table or (2) using

a timestamp (stored in the table)
MapReduce processing flow: Reminder

k/v
k/v
line k/v k/v
line k/v
line
k/[vs]
line
line
line k/v

Input Map Output

line
Data line
line Reduce
k/v
Data
k/v
k/v
k/v k/v
line k/[vs]
line
line

k/v
MapReduce processing flow: Reminder

line
def mapper(self, _, line):
words = WORD_REGEX.findall(line)
for word in words:
yield (word.lower(), 1)

k/v key/value pair

k/[vs]
def reducer(self, word, counts):
yield (word, sum(counts))

k/v
Map-only processing flow as tables and streams

line k/v
line
line

line
line
line k/v

Map
Output
Input Data line
line Data
line

k/v
line
line
line

k/v

line k/v

MapRead stream Map stream MapWrite

table table
A different view of MapReduce: Tables and streams

k/v
k/v
k/v k/v
line
line k/v
line
k/[vs]
line
line
line k/v

Map Output
Input Data line
line Reduce Data
line k/v
k/v
k/v
k/v k/v
line k/[vs]
line
line

k/v

line k/v k/[vs] k/v

stream stream Red Red

MapRead Map MapWrite stream Red stream
table table Read Write table
Traditional stream processing

• Designed for processing unbounded datasets

• Datasets are handled as streams

• Input sources are a continuous generator of data

• Processing elements operate on stream events, one at a time
• The output from a stream processor is another stream
Traditional stream processing
• An initial event stream is created by an unbounded data sources.
• The event stream is processed by a network of Processing Elements (PE) consisting of input queues,
computing elements and an output queue.

Output Queue

Macro view Input Queue Micro view

Computing Element
Stream Processing 1
Topic List:

● Generating, ingesting and processing streaming data

● Stream processing model
● Large-scale stream processing

Quiz and Break

Apache Storm

Apache Storm was developed by BackType (acquired by Twitter) and was donated to the Apache
Foundation.

Storm provides real-time computation of Big Data streams:

• Scalable
• Robust and fault-tolerant
• Guarantees no data loss (at-least-once processed)
• High throughput
• Programming language agnostic

https://storm.apache.org/releases/current/Rationale.html
Apache Storm: Concepts

Storm defines the following basic concepts:

• Tuple: Basic unit of data that can be processed by a storm application.

It consists of a pre-defined list of fields
• Stream: Unbounded stream of tuples
• Spouts: Elements that generate streams from external sources ~ ingestion units
• Bolts: Processing elements that consume and generate streams.
Apply transforms, produce aggregations, join or split streams, …
• Topology: Flow of spouts, streams and bolts
Apache Storm: Streams (the line)

• Tuple: Basic unit of data that can be processed by a storm application. It

consists of a pre-defined list of fields

• Stream: Unbounded stream of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Time
Apache Storm: Spouts

Spouts: Elements that generate streams from external sources

le
T up
le
T up
le
T up
le
T up
le
T up
le
T up
le
T up

T upl e
T upl e
Spout T upl e
T upl e
T upl e
T upl e
T upl e
Apache Storm: Bolts
Bolts: Processing elements that consume and generate streams.
Apply transforms, produce aggregations, join or split streams, ...

T upl e
T upl e
T upl e
T upl e

Tuple Tuple Tuple Tuple

le
T up Bolt
le
T up
le
T up
Apache Storm: Topology
Topology: Flow of spouts, streams and bolts
Apache Storm: Website click analysis
Storm streaming word count - Spout

Note, in exam, you will not be

asked about writing Java-based
Programming Code for Storm
but you need to understand the
code and its functionality

https://storm.apache.org/releases/current/Concepts.html
https://storm.apache.org/releases/current/javadocs/org/apache/storm/topology/base/BaseRichSpout.html
Storm streaming word count - Bolts

https://storm.apache.org/releases/current/javadocs/org/apache/storm/topology/base/BaseBasicBolt.html
Storm streaming word count – Topology

# of executor threads
for parallelism

# of tasks

# of JVM workers

Storm Code is not examinable

https://storm.apache.org/releases/current/javadocs/org/apache/storm/topology/TopologyBuilder.html

https://storm.apache.org/releases/2.2.0/Understanding-the-parallelism-of-a-Storm-topology.html
Understanding Paralleism

Config conf = new Config();

conf.setNumWorkers(2); // use two worker
processes

topologyBuilder.setSpout("blue-spout", new
BlueSpout(), 2); // set parallelism hint to 2

topologyBuilder.setBolt("green-bolt", new
GreenBolt(), 2).setNumTasks(4)
.shuffleGrouping("blue-spout");

topologyBuilder.setBolt("yellow-bolt", new
YellowBolt(), 6).shuffleGrouping("green-bolt");
Bolts translated to mrjob job (pseudocode)

def parse_tweet_bolt(self, tuple):

tweet = tuple[0]
for word in tweet.split(“ “):
yield( [word] )

def word_count_bolt(self, tuple):

word = tuple[0]
self.counts.increase_count[word]
yield( [word, self.counts[word] ] )
Stream processing state

Traditional stream operators are stateless: output depends directly on the input element

Non-trivial operations requires a state:

• Much mode difficult to scale
• Risk of losing state because of failures
Storm scalability

Bolts and spouts are distributed in different nodes (workers) of the cluster

Multiple replicas of bolts and spouts are run:

• Tuples are grouped and sent to several replicas (hash-based partitions, random partitions)
• However, it is no longer possible to keep a global state

why is the lack of a global state in Storm potentially a problem?

=> inconsistent results
Storm scalability
Apache Storm topologies are inherently parallel and run across a cluster of machines.

Different parts of the topology can be scaled individually by tweaking their parallelism. The
"rebalance" command of the "storm" command line client can adjust the parallelism of
running topologies on the fly.

https://storm.apache.org/about/scalable.html
Stream Processing 1
Topic List:

● Generating, ingesting and processing streaming data

● Stream processing model
● Large-scale stream processing

Quiz and End

Data Build Tool (DBT)
No ratings yet
Data Build Tool (DBT)
65 pages
Apache Flink
No ratings yet
Apache Flink
40 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
What Is Streaming Data
No ratings yet
What Is Streaming Data
4 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Big Data PDF
No ratings yet
Big Data PDF
10 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Lecture 3 SStreaming Data Systems and Applications
No ratings yet
Lecture 3 SStreaming Data Systems and Applications
39 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Module II
No ratings yet
Module II
22 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
Lec 19
No ratings yet
Lec 19
24 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
Compute Engine
No ratings yet
Compute Engine
49 pages
Unit 3
No ratings yet
Unit 3
51 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
BDA Unit 3
No ratings yet
BDA Unit 3
18 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
Lec 19
No ratings yet
Lec 19
23 pages
Unit 3 Data Analytics
No ratings yet
Unit 3 Data Analytics
15 pages
Stream Data Processing
No ratings yet
Stream Data Processing
32 pages
Stream Processing and Analytics - Regular-HO
No ratings yet
Stream Processing and Analytics - Regular-HO
7 pages
ECS765P - W11 - Stream Processing II
No ratings yet
ECS765P - W11 - Stream Processing II
47 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Lec 02
No ratings yet
Lec 02
13 pages
SA Unit 1 PPT 1
No ratings yet
SA Unit 1 PPT 1
19 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
Lec 01
No ratings yet
Lec 01
17 pages
Big Data Analysis Apache Storm Perspecti
No ratings yet
Big Data Analysis Apache Storm Perspecti
6 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
Lecture - Week04
No ratings yet
Lecture - Week04
29 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
Unit 3
No ratings yet
Unit 3
30 pages
Uint 4miningdatastream 230810162429 9d7c02a7
No ratings yet
Uint 4miningdatastream 230810162429 9d7c02a7
11 pages
Reference Guide To Stream Processing
No ratings yet
Reference Guide To Stream Processing
14 pages
Bda Ut-2
No ratings yet
Bda Ut-2
18 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
C42-Batch Stream Micro Batch Realtime Processing
No ratings yet
C42-Batch Stream Micro Batch Realtime Processing
33 pages
Whitepaper Streaming Data Solutions On Aws With Amazon Kinesis
No ratings yet
Whitepaper Streaming Data Solutions On Aws With Amazon Kinesis
33 pages
BigData Mod2
No ratings yet
BigData Mod2
12 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Lights Illusions Script 08-26-19
No ratings yet
Lights Illusions Script 08-26-19
6 pages
Magic Pen Script 10-05-19
No ratings yet
Magic Pen Script 10-05-19
4 pages
ECS765P - W5 - Spark Programming
No ratings yet
ECS765P - W5 - Spark Programming
43 pages
Note - Wireless Communications For Everybody
No ratings yet
Note - Wireless Communications For Everybody
2 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
ECS726-Week01 Intro
No ratings yet
ECS726-Week01 Intro
70 pages
Week 4 v1.1 (Hidden) - Supervised Learning (Classification)
No ratings yet
Week 4 v1.1 (Hidden) - Supervised Learning (Classification)
43 pages
ECS726-Week05 Cryptographic Protocols Key Management-P
No ratings yet
ECS726-Week05 Cryptographic Protocols Key Management-P
58 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
Week 3 v1.1 (Hidden) Supervised Learning (Regression)
No ratings yet
Week 3 v1.1 (Hidden) Supervised Learning (Regression)
52 pages
ECS7020P ClassificationExercisesSolutions II
No ratings yet
ECS7020P ClassificationExercisesSolutions II
7 pages
ECS726-Week02 Symmetric EncryptionP
No ratings yet
ECS726-Week02 Symmetric EncryptionP
62 pages
ECS726-Week04 - Hash - MAC - Digital Sinatures - Freshness - Dynamic Password Schemes
No ratings yet
ECS726-Week04 - Hash - MAC - Digital Sinatures - Freshness - Dynamic Password Schemes
52 pages
ECS765P - W9 - Large-Scale Graph Processing
No ratings yet
ECS765P - W9 - Large-Scale Graph Processing
51 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
ECS765P - W6 - Big Data Ingestion and Storage
No ratings yet
ECS765P - W6 - Big Data Ingestion and Storage
34 pages
ECS781P-11-Edge of The Cloud
No ratings yet
ECS781P-11-Edge of The Cloud
30 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
W3 Ecs7020p
No ratings yet
W3 Ecs7020p
51 pages
ECS781P-9-Cloud Data Management
No ratings yet
ECS781P-9-Cloud Data Management
79 pages
ECS781P 10 Microservices
No ratings yet
ECS781P 10 Microservices
34 pages
ECS781P 6 CloudPerformanceSLAs
No ratings yet
ECS781P 6 CloudPerformanceSLAs
39 pages
Cloud Computing Lab 2
No ratings yet
Cloud Computing Lab 2
4 pages
Ecs781p 4 Rest
No ratings yet
Ecs781p 4 Rest
47 pages
W4 Ecs7020p
No ratings yet
W4 Ecs7020p
48 pages
ECS781P-3-Cloud Applications
No ratings yet
ECS781P-3-Cloud Applications
50 pages
The Passion of An Amateur Card Magician
100% (4)
The Passion of An Amateur Card Magician
557 pages
Matt Mello - Thought Control
No ratings yet
Matt Mello - Thought Control
16 pages
Tom Rose - From The Red Notebook 2nd Edition
80% (5)
Tom Rose - From The Red Notebook 2nd Edition
33 pages
Vikas Gowda K S: About Me Work Experience
No ratings yet
Vikas Gowda K S: About Me Work Experience
1 page
Ververica Platform Whitepaper Stream Processing For Real-Time Business, Powered by Apache Flink®
No ratings yet
Ververica Platform Whitepaper Stream Processing For Real-Time Business, Powered by Apache Flink®
22 pages
Epgds 1664364649407
No ratings yet
Epgds 1664364649407
41 pages
Unit 4 Streaming Data
No ratings yet
Unit 4 Streaming Data
4 pages
IU Master
No ratings yet
IU Master
34 pages
Executive PG Programme in Data Science: Curriculum
No ratings yet
Executive PG Programme in Data Science: Curriculum
12 pages
Real-Time Network Intrusion Detection System Based On Deep Learning
No ratings yet
Real-Time Network Intrusion Detection System Based On Deep Learning
4 pages
Atik
No ratings yet
Atik
4 pages
Anvesh - Sr. Data Engineer
No ratings yet
Anvesh - Sr. Data Engineer
6 pages
Rocks DB
No ratings yet
Rocks DB
66 pages
Apache Flink On Confluent Cloud
No ratings yet
Apache Flink On Confluent Cloud
2 pages
Apache Flink Tutorial
100% (1)
Apache Flink Tutorial
44 pages
Hammad Khan Java
No ratings yet
Hammad Khan Java
2 pages
文件系统 - Apache Flink
No ratings yet
文件系统 - Apache Flink
18 pages
Big Data Platforms
No ratings yet
Big Data Platforms
8 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
60 pages
Testing Data Streaming Applications: Lars Albertsson, Independent Consultant Øyvind Løkling, Schibsted Media Group
No ratings yet
Testing Data Streaming Applications: Lars Albertsson, Independent Consultant Øyvind Løkling, Schibsted Media Group
26 pages
Module 1 BDA
No ratings yet
Module 1 BDA
103 pages
Oreilly Technical Guide Understanding Etl
No ratings yet
Oreilly Technical Guide Understanding Etl
107 pages
Big Data Analytics For Intelligent Manufacturing Systems A Review
No ratings yet
Big Data Analytics For Intelligent Manufacturing Systems A Review
15 pages
Buyers Guide - Decoding The Top 4 Real-Time Data Platforms Powered by Apache Flink
No ratings yet
Buyers Guide - Decoding The Top 4 Real-Time Data Platforms Powered by Apache Flink
17 pages
Data Engg
No ratings yet
Data Engg
19 pages
Performance Engineer Data Streaming
No ratings yet
Performance Engineer Data Streaming
2 pages
Experiences Running Apache Flink at Very Large Scale: @stephanewen Berlin Buzzwords, 2017
No ratings yet
Experiences Running Apache Flink at Very Large Scale: @stephanewen Berlin Buzzwords, 2017
76 pages
Vishwak Sena Reddy de
No ratings yet
Vishwak Sena Reddy de
4 pages
Real Time Data Streaming New Techniques
No ratings yet
Real Time Data Streaming New Techniques
5 pages
Uber Data Engineer Interview Questions
No ratings yet
Uber Data Engineer Interview Questions
22 pages

ECS765P - W10 - Stream Processing

Uploaded by

ECS765P - W10 - Stream Processing

Uploaded by

ECS640U/ECS765P Big Data Processing

In the next two weeks (9 and 10), we focus on Stream Processing

• High frequency trading

It is important to clearly distinguish between two related streaming concepts:

Distinguishing between streaming data and stream processing is important because:

● Generating, ingesting and processing streaming data

In streaming systems, it is useful to distinguish between bounded and unbounded data:

• Messages from social platforms (e.g. Twitter)

● Generating, ingesting and processing streaming data

Tables can be converted into streams and vice-versa.

to order the sequence, (1) by order in the table or (2) using

Input Map Output

k/v key/value pair

MapRead stream Map stream MapWrite

line k/v k/[vs] k/v

stream stream Red Red

• Designed for processing unbounded datasets

• Input sources are a continuous generator of data

Macro view Input Queue Micro view

● Generating, ingesting and processing streaming data

Quiz and Break

Storm provides real-time computation of Big Data streams:

Storm defines the following basic concepts:

• Tuple: Basic unit of data that can be processed by a storm application.

• Tuple: Basic unit of data that can be processed by a storm application. It

• Stream: Unbounded stream of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Spouts: Elements that generate streams from external sources

Tuple Tuple Tuple Tuple

Note, in exam, you will not be

Storm Code is not examinable

Config conf = new Config();

def parse_tweet_bolt(self, tuple):

def word_count_bolt(self, tuple):

Non-trivial operations requires a state:

Multiple replicas of bolts and spouts are run:

why is the lack of a global state in Storm potentially a problem?

● Generating, ingesting and processing streaming data

Quiz and End

You might also like