0% found this document useful (0 votes)

37 views43 pages

Lecture 07 Streaming

Uploaded by

sokoclash123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views43 pages

Lecture 07 Streaming

Uploaded by

sokoclash123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Streaming

What is Stream?
• A sequence of data elements made available over time.
• It represents a continuous flow of data that may be generated,
collected, or transmitted at various rates.
• A data stream is dynamic and may evolve over time.
What is Stream?
• Streaming terminology, an event is generated once by a producer
(also known as a publisher or sender), and then potentially processed
by multiple consumers (subscribers or recipients)
• related events are usually grouped together into a topic or stream.
• An event may be encoded as a text string, or JSON, or perhaps in
some binary form → Recall Encoding lecture.
What is Stream?
Topic
Event Event

Consumer

Topic
Event
Producer

Publish/Subscribe model
What is Stream?
Consumer 1
Topic
Event Event

Consumer 2

Topic
Event
Producer 1

Producer 2
What is Stream?
What can go wrong ?
• What happens if the producers send messages faster than the
consumers can process them?
• drop messages
• buffer messages in a queue
• apply backpressure
• What happens if nodes crash or temporarily go offline are any
messages lost?
• writing to disk
• Replicate …. Streams are infinite
Streaming

• Where streams come from?

• how are streams transported ?

• what can you do with a stream?

Transmitting Event Streams

• Messaging Systems
• Direct Messaging
• Message brokers.

• Partitioned Logs
Messaging Systems: Direct Messaging

Event

Producer Consumer
Messaging Systems: Direct Messaging
• UDP multicast; quick fit for financial industry like stocks.
• Unreliable UDP messaging for collecting metrics from all machines on
the network and monitoring them at best approximate: StatsD and
Brubeck .
• Brokerless messaging libraries such as ZeroMQ over TCP.
• Consumer published a service on the network, producers push
messages to consumer via a direct HTTP or RPC request webhooks.
Messaging Systems: Direct Messaging
• Here we assume that producers and consumers are constantly online.
• Offline consumer? miss messages
• Producer to retry failed message deliveries,
• What if the producer crashes?
Messaging Systems: Message Brokers
Broker

Event

Producer Consumer

Fire-and-Forget Model
Messaging Systems: Message Brokers

• Centralizing the data in the broker

• tolerate clients that come and go (connect, disconnect, and crash).
• The broker is now responsible for durability of messages
• Keep messages in memory.
• Store in disk.
• unbounded queueing.
Message Broker Vs Databases
Databases Message brokers
Databases usually keep data until it is Delete a message when it has been
explicitly deleted successfully delivered to its consumers.
Tables increase in size as the data The queues are short; but when having
increases. to increase the size of queues
performance degrades.
Secondary indexes and search queries. Subscribing to a subset of topics
matching some pattern.
Result is typically based on a point-in- Do not support arbitrary queries, but
time snapshot of the data, if data they do notify clients when data changes
became outdated a client with former
result is not notified
Messaging Systems: Message Brokers
• If we have Multiple consumers reading data from the same topic, we
have two main patterns of reading
• Load balancing
• Each message is delivered to one of the consumers,
• Consumers can share the work of processing the messages in the topic
• JMS (Shared subscription) and AMQP
• Fanout
• Each message is delivered to all the consumers.
• Topic subscriptions in JMS, exchange bindings in AMQP
Messaging Systems: Message Brokers
Messaging Systems: Message Brokers
• Acknowledgments and redelivery
• Client must explicitly tell the broker when it has finished
processing a message.
• When the consumer acknowledges the receipt of a
message the broker can delete it.
• What happens if acknowledgment was lost in the
network?
• Combine this with load balancing
Messaging Systems: Message Brokers
• Redelivery

use a separate queue per consumer? No load balancing

Partitioned Logs
• A log is simply an append-only sequence of records on disk.
• A producer sends a message by appending it to the end of the log,
and a consumer receives messages by reading the log sequentially.
• Logs are added to the messaging brokers and read from them.
• Logs for different topics can be partitioned
• No ordering guarantee across the different partitions.
• Messages are tagged with a sequence number for writing and
reading.
Partitioned Logs
Point of comparison Partitioned Log Broker Message Passing Broker
(Kafka) (JMS)

Throughput Producers don’t wait for It maintains the delivery state of every
acknowledgments from the Brokers. So, message resulting in lower
brokers can write messages at a very throughput
high rate resulting in higher throughput.

Delivery The responsibility of the consumers to The responsibility of the producers to

consume all the messages they are ensure that messages have been
supposed to consume delivered.

Ordering It can ensure that messages are received It cannot ensure that messages are
in the order they were sent at the received in the same order they were
partition level sent.
Point of comparison Partitioned Log Broker Message Passing Broker
(Kafka) (JMS)

Filter Doesn’t have any concept of filters at Allows a consumer to specify the
the brokers that can ensure that messages it is interested in.
messages that are picked up by
consumers match a certain criterion. The filtering is being done at the
Broker side
The filtering has to be done by the
consumers.

Pull/Push It is a pull-type messaging platform It is a push-type messaging platform

where the consumers pull the where the providers push the
messages from the brokers messages to the consumers
Point of comparison Partitioned Log Broker Message Passing Broker
(Kafka) (JMS)

Scalability It is highly scalable. Due to It is not possible to scale horizontally.

replications of partitions, it offers There is also no concept of replication
higher availability too

Endurance It doesn’t slow down with the The performance of both queue and
addition of new consumers. topic degrades as the number of
consumers rises

Replay (Debugging) Messages can be re-read as they are Messages cannot be re-read.
not deleted once consumed.
Disk Space
• What happens when the disk space runs out;
• To reclaim disk space, the log is divided into segments, and from time-to-time
old segments are deleted or moved to archive storage.
• If the consumer is very slow, it can fall behind such that it reaches a
segment that is deleted, and that is when it loses such messages.
• This resembles a circular buffer or a ring buffer structure. However,
since that buffer is on disk, it can be quite large.
Keeping systems in sync
• No single system that can satisfy all data storage, querying, and
processing needs.
• We have talked about several different technologies to satisfy all
needed requirements;
• using an OLTP database to serve user requests,
• a cache to speed up common requests,
• a full-text index to handle search queries,
• and a data warehouse for analytics.
• Each of these has its own copy of the data, stored in its own
representation that is optimized for its own purposes.
Interaction of Databases and Streams
• Keeping Systems in Sync
• Change Data Capture (CDC)
• Event Sourcing
Keeping systems in sync
Initial snapshot
Change Data Capture (CDC)
Event Sourcing
• Technique for data modeling:
• It is more meaningful to record the users' actions as immutable events, rather
than recording the effect of those actions on a mutable database.
• In CDC, the application uses the database in a mutable way, updating
and deleting records at will.
• In event sourcing, it is based on immutable events that are written to
an event log. In this case, the event store is append-only, and updates
or deletes are discouraged or prohibited.
Event Sourcing
• Events are designed to reflect things that happened at the application
level.
• Distinguish between events and commands:
• When a request from a user first arrives, it is initially a command: at this point
it may still fail.
• The application must first validate that it can execute the command.
• Successful? command becomes an event, which is durable and immutable.
• Synchronously: using a serializable transaction that atomically
validates the command and publishes the event.
• Asynchronously: splitting it in to a request then a confirmation.
Stream Processing

What can a consumer do with a stream?

Pipeline it after
Store events to Transfer it to the
processing into
a database users
another stream
Stream Processing

What can a consumer do with a stream?

Pipeline it after
Store events to Transfer it to the
processing into
a database users
another stream
Stream Processing

What can a consumer do with a stream?

Pipeline it after
Store events to Transfer it to the
processing into
a database users
another stream
Stream Processing
• A stream processor consumes input streams in a read-only fashion
and writes its output to a different location in an append-only
fashion.
• The difference between a stream and a batch jobs is that a stream
never ends.
Applications of Stream Processing

• Fraud detection systems: credit card usage

• Trading systems: price changes

• Manufacturing systems: status of machines in a factory

• Military and intelligence systems: signs of an attack

Stream Processing
Complex event processing (CEP)
• CEP systems often use a high-level declarative query
language like SQL to describe the patterns of events that
should be detected.
• These queries are submitted to a processing engine that
consumes the input streams.
Stream Processing
Complex event processing (CEP)
• Maintains a state machine that performs the required
matching.
• When a match is found, the engine emits a complex event.
• Here the query comes first then data is waited for.
• Esper, IBM InfoSphere Streams, Apama, TIBCO StreamBase,
and SQLstream.
Stream Processing
Stream analytics
• Aggregations and statistical metrics over many events
• Measuring the rate of some type of event.
• Calculating the rolling average of a value over some time.
• Comparing current statistics to previous time intervals
Stream Processing
Stream analytics
• Statistic are aggregated using windows : Recap first lecture
• Apache Storm, Spark Streaming, Flink, Concord, Samza, and Kafka
Streams.
• Hosted services include Google Cloud Dataflow and Azure Stream
Analytics.
Stream Processing
Maintaining materialized views
• Application state is maintained by applying a log of events;
materialized view
• It is usually not sufficient to consider only events within
some time window: building the materialized view
potentially requires all events over an arbitrary period.
Samza and Kafka Streams support this kind of usage, building
upon Kafkas support for log compaction
Stream Processing
Search on streams
• Allows searching for patterns consisting of multiple events
• Media monitoring services subscribe to feeds of news articles and
broadcasts from media outlets
• Formulate a search query in advance, and then continually matching
the stream of news items against this query
• Searching a stream turns the processing on its head: the queries are
stored, and the documents run past the queries, like in CEP.
References
• [1]M. Kleppmann, ‘Chapter 11: Stream Processing’, in Designing Data-
Intensive Applications, O’Reilly, 1st Ed. ISBN 9781449373320, 2017,
562.

Introduction To - Messaging Systems-My Version
No ratings yet
Introduction To - Messaging Systems-My Version
43 pages
Module 3
No ratings yet
Module 3
77 pages
Async vs Sync: Messaging Queues Explained
No ratings yet
Async vs Sync: Messaging Queues Explained
4 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Chapter 7 Notes Final
No ratings yet
Chapter 7 Notes Final
13 pages
Asynchronous Systems
No ratings yet
Asynchronous Systems
62 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Types of Communication: Persistent Versus Transient Synchronous Versus Asynchronous Discrete Versus Streaming
No ratings yet
Types of Communication: Persistent Versus Transient Synchronous Versus Asynchronous Discrete Versus Streaming
32 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
Indirect
No ratings yet
Indirect
33 pages
Publish/Subscribe Network Overview
No ratings yet
Publish/Subscribe Network Overview
31 pages
System Design Interview
No ratings yet
System Design Interview
4 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
Redis vs Kafka: Understanding Streams
No ratings yet
Redis vs Kafka: Understanding Streams
69 pages
Ebook Streams Redis Streams and Kafka 20220615
No ratings yet
Ebook Streams Redis Streams and Kafka 20220615
69 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Streaming Data Systems Overview
No ratings yet
Streaming Data Systems Overview
24 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Indirect Communication
No ratings yet
Indirect Communication
26 pages
Redis vs Kafka: Stream Processing Guide
No ratings yet
Redis vs Kafka: Stream Processing Guide
70 pages
Asynchronous Sensor Monitoring System
No ratings yet
Asynchronous Sensor Monitoring System
13 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Intro2 Middlewares
No ratings yet
Intro2 Middlewares
27 pages
Indirect 2
No ratings yet
Indirect 2
35 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
12lecture - Technology and Tools (Ù SqoobFlume)
No ratings yet
12lecture - Technology and Tools (Ù SqoobFlume)
48 pages
System Design Interviews
No ratings yet
System Design Interviews
151 pages
Apache Kafka Notes
No ratings yet
Apache Kafka Notes
11 pages
Big Data - Group 14
No ratings yet
Big Data - Group 14
26 pages
ITSA Cheatsheet
No ratings yet
ITSA Cheatsheet
2 pages
Advance Event Mesh Overview
No ratings yet
Advance Event Mesh Overview
9 pages
Media Streamer and Player Overview
No ratings yet
Media Streamer and Player Overview
33 pages
Unit 3
No ratings yet
Unit 3
55 pages
Unit 4.1 High-Speed Networks
No ratings yet
Unit 4.1 High-Speed Networks
26 pages
Kafkha
No ratings yet
Kafkha
32 pages
DC MTT2
No ratings yet
DC MTT2
29 pages
Queues: Message Queuing
No ratings yet
Queues: Message Queuing
6 pages
Fair Comparison of Message Queuing Systems
No ratings yet
Fair Comparison of Message Queuing Systems
12 pages
Assignment 2part 3 Asynchronou Middleware
No ratings yet
Assignment 2part 3 Asynchronou Middleware
18 pages
Streaming Data
No ratings yet
Streaming Data
33 pages
Unit 2
No ratings yet
Unit 2
73 pages
Unit 4
No ratings yet
Unit 4
11 pages
Stream Processing in Big Data Analytics
No ratings yet
Stream Processing in Big Data Analytics
33 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Chapter No. 1 Requirement Analysis
No ratings yet
Chapter No. 1 Requirement Analysis
21 pages
Lec 02
No ratings yet
Lec 02
13 pages
2013 05 Indirect Communication
No ratings yet
2013 05 Indirect Communication
39 pages
Apache - Kafka Notes
No ratings yet
Apache - Kafka Notes
9 pages
2
No ratings yet
2
8 pages
Chapter 2-Communication 2
No ratings yet
Chapter 2-Communication 2
33 pages
AWS Cloud Computing Unit 2
No ratings yet
AWS Cloud Computing Unit 2
13 pages
Kafka Event System
75% (4)
Kafka Event System
166 pages
Distributed System Models
No ratings yet
Distributed System Models
39 pages
Daily Issues Faced by Data Engineers 1747908192
No ratings yet
Daily Issues Faced by Data Engineers 1747908192
28 pages
GPU-Accelerated Fluid Dynamics
No ratings yet
GPU-Accelerated Fluid Dynamics
19 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
Whitepaper Mule 4 Performance 0
No ratings yet
Whitepaper Mule 4 Performance 0
23 pages
Velox
No ratings yet
Velox
13 pages
Introduction To Stream Data Model
50% (2)
Introduction To Stream Data Model
15 pages
Module 07 Streaming - Distributed Stream Computing Engine
No ratings yet
Module 07 Streaming - Distributed Stream Computing Engine
33 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Unit 4
No ratings yet
Unit 4
84 pages
Building Advanced AI Agent Systems: From Fundamentals To Scalable Architecture
No ratings yet
Building Advanced AI Agent Systems: From Fundamentals To Scalable Architecture
18 pages
Big Data Analytics System Tutorial
No ratings yet
Big Data Analytics System Tutorial
36 pages
Capture D'écran . 2025-02-20 À 15.08.45
No ratings yet
Capture D'écran . 2025-02-20 À 15.08.45
1 page
Unit 3 Data Analytics
No ratings yet
Unit 3 Data Analytics
15 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
GPU Gems2 ch29
No ratings yet
GPU Gems2 ch29
21 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
64 pages
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
No ratings yet
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
30 pages
Real-Time Monitoring of Road Traffic Using Data Stream Mining - UNINOVA & JSI
No ratings yet
Real-Time Monitoring of Road Traffic Using Data Stream Mining - UNINOVA & JSI
8 pages
Tensor Streaming Processor Unveiled
No ratings yet
Tensor Streaming Processor Unveiled
14 pages
Apache Flink® Training: Intro
No ratings yet
Apache Flink® Training: Intro
37 pages
Real Time Data Stream Processing Engine
No ratings yet
Real Time Data Stream Processing Engine
13 pages
Django Tutorial
No ratings yet
Django Tutorial
26 pages
Course Introduction: Dsecl Zc556 Stream Processing and Analytics Lecture No. 1.0
No ratings yet
Course Introduction: Dsecl Zc556 Stream Processing and Analytics Lecture No. 1.0
52 pages
Area-Time Efficient Streaming Architecture For FAST and BRIEF Detector
No ratings yet
Area-Time Efficient Streaming Architecture For FAST and BRIEF Detector
6 pages
Multicore Processors and Systems PDF
100% (2)
Multicore Processors and Systems PDF
310 pages
Storm-RTS Stream Processing With Stable Performance For Multi-Cloud and Cloud-Edge
No ratings yet
Storm-RTS Stream Processing With Stable Performance For Multi-Cloud and Cloud-Edge
10 pages
Articlepush
No ratings yet
Articlepush
12 pages
Data Science Engineering Diploma
No ratings yet
Data Science Engineering Diploma
28 pages

Lecture 07 Streaming

Uploaded by

Lecture 07 Streaming

Uploaded by

Data Streaming

• Where streams come from?

• how are streams transported ?

• what can you do with a stream?

• Centralizing the data in the broker

use a separate queue per consumer? No load balancing

Delivery The responsibility of the consumers to The responsibility of the producers to

Pull/Push It is a pull-type messaging platform It is a push-type messaging platform

Scalability It is highly scalable. Due to It is not possible to scale horizontally.

What can a consumer do with a stream?

What can a consumer do with a stream?

What can a consumer do with a stream?

• Fraud detection systems: credit card usage

• Trading systems: price changes

• Manufacturing systems: status of machines in a factory

• Military and intelligence systems: signs of an attack

You might also like