0% found this document useful (0 votes)
37 views43 pages

Lecture 07 Streaming

Uploaded by

sokoclash123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views43 pages

Lecture 07 Streaming

Uploaded by

sokoclash123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Streaming

What is Stream?
• A sequence of data elements made available over time.
• It represents a continuous flow of data that may be generated,
collected, or transmitted at various rates.
• A data stream is dynamic and may evolve over time.
What is Stream?
• Streaming terminology, an event is generated once by a producer
(also known as a publisher or sender), and then potentially processed
by multiple consumers (subscribers or recipients)
• related events are usually grouped together into a topic or stream.
• An event may be encoded as a text string, or JSON, or perhaps in
some binary form → Recall Encoding lecture.
What is Stream?
Topic
Event Event

Consumer

Topic
Event
Producer

Publish/Subscribe model
What is Stream?
Consumer 1
Topic
Event Event

Consumer 2

Topic
Event
Producer 1

Producer 2
What is Stream?
What can go wrong ?
• What happens if the producers send messages faster than the
consumers can process them?
• drop messages
• buffer messages in a queue
• apply backpressure
• What happens if nodes crash or temporarily go offline are any
messages lost?
• writing to disk
• Replicate …. Streams are infinite
Streaming

• Where streams come from?

• how are streams transported ?

• what can you do with a stream?


Transmitting Event Streams

• Messaging Systems
• Direct Messaging
• Message brokers.

• Partitioned Logs
Messaging Systems: Direct Messaging

Event

Producer Consumer
Messaging Systems: Direct Messaging
• UDP multicast; quick fit for financial industry like stocks.
• Unreliable UDP messaging for collecting metrics from all machines on
the network and monitoring them at best approximate: StatsD and
Brubeck .
• Brokerless messaging libraries such as ZeroMQ over TCP.
• Consumer published a service on the network, producers push
messages to consumer via a direct HTTP or RPC request webhooks.
Messaging Systems: Direct Messaging
• Here we assume that producers and consumers are constantly online.
• Offline consumer? miss messages
• Producer to retry failed message deliveries,
• What if the producer crashes?
Messaging Systems: Message Brokers
Broker

Event

Producer Consumer

Fire-and-Forget Model
Messaging Systems: Message Brokers

• Centralizing the data in the broker


• tolerate clients that come and go (connect, disconnect, and crash).
• The broker is now responsible for durability of messages
• Keep messages in memory.
• Store in disk.
• unbounded queueing.
Message Broker Vs Databases
Databases Message brokers
Databases usually keep data until it is Delete a message when it has been
explicitly deleted successfully delivered to its consumers.
Tables increase in size as the data The queues are short; but when having
increases. to increase the size of queues
performance degrades.
Secondary indexes and search queries. Subscribing to a subset of topics
matching some pattern.
Result is typically based on a point-in- Do not support arbitrary queries, but
time snapshot of the data, if data they do notify clients when data changes
became outdated a client with former
result is not notified
Messaging Systems: Message Brokers
• If we have Multiple consumers reading data from the same topic, we
have two main patterns of reading
• Load balancing
• Each message is delivered to one of the consumers,
• Consumers can share the work of processing the messages in the topic
• JMS (Shared subscription) and AMQP
• Fanout
• Each message is delivered to all the consumers.
• Topic subscriptions in JMS, exchange bindings in AMQP
Messaging Systems: Message Brokers
Messaging Systems: Message Brokers
• Acknowledgments and redelivery
• Client must explicitly tell the broker when it has finished
processing a message.
• When the consumer acknowledges the receipt of a
message the broker can delete it.
• What happens if acknowledgment was lost in the
network?
• Combine this with load balancing
Messaging Systems: Message Brokers
• Redelivery

use a separate queue per consumer? No load balancing


Partitioned Logs
• A log is simply an append-only sequence of records on disk.
• A producer sends a message by appending it to the end of the log,
and a consumer receives messages by reading the log sequentially.
• Logs are added to the messaging brokers and read from them.
• Logs for different topics can be partitioned
• No ordering guarantee across the different partitions.
• Messages are tagged with a sequence number for writing and
reading.
Partitioned Logs
Point of comparison Partitioned Log Broker Message Passing Broker
(Kafka) (JMS)

Throughput Producers don’t wait for It maintains the delivery state of every
acknowledgments from the Brokers. So, message resulting in lower
brokers can write messages at a very throughput
high rate resulting in higher throughput.

Delivery The responsibility of the consumers to The responsibility of the producers to


consume all the messages they are ensure that messages have been
supposed to consume delivered.

Ordering It can ensure that messages are received It cannot ensure that messages are
in the order they were sent at the received in the same order they were
partition level sent.
Point of comparison Partitioned Log Broker Message Passing Broker
(Kafka) (JMS)

Filter Doesn’t have any concept of filters at Allows a consumer to specify the
the brokers that can ensure that messages it is interested in.
messages that are picked up by
consumers match a certain criterion. The filtering is being done at the
Broker side
The filtering has to be done by the
consumers.

Pull/Push It is a pull-type messaging platform It is a push-type messaging platform


where the consumers pull the where the providers push the
messages from the brokers messages to the consumers
Point of comparison Partitioned Log Broker Message Passing Broker
(Kafka) (JMS)

Scalability It is highly scalable. Due to It is not possible to scale horizontally.


replications of partitions, it offers There is also no concept of replication
higher availability too

Endurance It doesn’t slow down with the The performance of both queue and
addition of new consumers. topic degrades as the number of
consumers rises

Replay (Debugging) Messages can be re-read as they are Messages cannot be re-read.
not deleted once consumed.
Disk Space
• What happens when the disk space runs out;
• To reclaim disk space, the log is divided into segments, and from time-to-time
old segments are deleted or moved to archive storage.
• If the consumer is very slow, it can fall behind such that it reaches a
segment that is deleted, and that is when it loses such messages.
• This resembles a circular buffer or a ring buffer structure. However,
since that buffer is on disk, it can be quite large.
Keeping systems in sync
• No single system that can satisfy all data storage, querying, and
processing needs.
• We have talked about several different technologies to satisfy all
needed requirements;
• using an OLTP database to serve user requests,
• a cache to speed up common requests,
• a full-text index to handle search queries,
• and a data warehouse for analytics.
• Each of these has its own copy of the data, stored in its own
representation that is optimized for its own purposes.
Interaction of Databases and Streams
• Keeping Systems in Sync
• Change Data Capture (CDC)
• Event Sourcing
Keeping systems in sync
Initial snapshot
Change Data Capture (CDC)
Event Sourcing
• Technique for data modeling:
• It is more meaningful to record the users' actions as immutable events, rather
than recording the effect of those actions on a mutable database.
• In CDC, the application uses the database in a mutable way, updating
and deleting records at will.
• In event sourcing, it is based on immutable events that are written to
an event log. In this case, the event store is append-only, and updates
or deletes are discouraged or prohibited.
Event Sourcing
• Events are designed to reflect things that happened at the application
level.
• Distinguish between events and commands:
• When a request from a user first arrives, it is initially a command: at this point
it may still fail.
• The application must first validate that it can execute the command.
• Successful? command becomes an event, which is durable and immutable.
• Synchronously: using a serializable transaction that atomically
validates the command and publishes the event.
• Asynchronously: splitting it in to a request then a confirmation.
Stream Processing

What can a consumer do with a stream?

Pipeline it after
Store events to Transfer it to the
processing into
a database users
another stream
Stream Processing

What can a consumer do with a stream?

Pipeline it after
Store events to Transfer it to the
processing into
a database users
another stream
Stream Processing

What can a consumer do with a stream?

Pipeline it after
Store events to Transfer it to the
processing into
a database users
another stream
Stream Processing
• A stream processor consumes input streams in a read-only fashion
and writes its output to a different location in an append-only
fashion.
• The difference between a stream and a batch jobs is that a stream
never ends.
Applications of Stream Processing

• Fraud detection systems: credit card usage

• Trading systems: price changes

• Manufacturing systems: status of machines in a factory

• Military and intelligence systems: signs of an attack


Stream Processing
Complex event processing (CEP)
• CEP systems often use a high-level declarative query
language like SQL to describe the patterns of events that
should be detected.
• These queries are submitted to a processing engine that
consumes the input streams.
Stream Processing
Complex event processing (CEP)
• Maintains a state machine that performs the required
matching.
• When a match is found, the engine emits a complex event.
• Here the query comes first then data is waited for.
• Esper, IBM InfoSphere Streams, Apama, TIBCO StreamBase,
and SQLstream.
Stream Processing
Stream analytics
• Aggregations and statistical metrics over many events
• Measuring the rate of some type of event.
• Calculating the rolling average of a value over some time.
• Comparing current statistics to previous time intervals
Stream Processing
Stream analytics
• Statistic are aggregated using windows : Recap first lecture
• Apache Storm, Spark Streaming, Flink, Concord, Samza, and Kafka
Streams.
• Hosted services include Google Cloud Dataflow and Azure Stream
Analytics.
Stream Processing
Maintaining materialized views
• Application state is maintained by applying a log of events;
materialized view
• It is usually not sufficient to consider only events within
some time window: building the materialized view
potentially requires all events over an arbitrary period.
Samza and Kafka Streams support this kind of usage, building
upon Kafkas support for log compaction
Stream Processing
Search on streams
• Allows searching for patterns consisting of multiple events
• Media monitoring services subscribe to feeds of news articles and
broadcasts from media outlets
• Formulate a search query in advance, and then continually matching
the stream of news items against this query
• Searching a stream turns the processing on its head: the queries are
stored, and the documents run past the queries, like in CEP.
References
• [1]M. Kleppmann, ‘Chapter 11: Stream Processing’, in Designing Data-
Intensive Applications, O’Reilly, 1st Ed. ISBN 9781449373320, 2017,
562.

You might also like