0% found this document useful (0 votes)
366 views84 pages

A Visual Introduction To Apache Kafka PDF

Uploaded by

tim2421
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
366 views84 pages

A Visual Introduction To Apache Kafka PDF

Uploaded by

tim2421
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Introducing Apache Kafka®

Paul Brebner
Technical Evangelist February 2019
What is Kafka?
Distributed streams
Processing System

Messages sent by
Distributed Producers
to
Message flow
Distributed Consumers
Consumers
via
Distributed Kafka Cluster
Cluster
KafkaKafka
benefits
Benefits
 Fast
 Scalable
 Reliable
 Durable
 Open Source
 Managed Service
KafkaKafka
benefits
Benefits
 Fast – high throughput and low latency
 Scalable – horizontally scalable with nodes and partitions
 Reliable – distributed and fault tolerant
 Durable - zero data loss, messages persisted to disk with immutable log
 Open Source – An Apache project
 Available as a Managed Service - on multiple cloud platforms
Message flow
Message flow
To send messages from A to B
“A” is the Producer – sends a message, to
“B” the Consumer (recipient) of the message
Due to decline in “snail mail” direct deliveries
Due to decline in “snail mail” direct deliveries
Instead … “Poste Restante”
“Poste Restante”?

• Not a post office in a


restaurant

• General delivery (in the


US)

• The mail is delivered to


a post office, they hold
it for you until you call
Consumers “poll” for messages by visiting the
Poste Restante counter at the post office
Kafka topics act like a Post Office
Benefits include
• Disconnected delivery –
consumer doesn’t need to
be available to receive
messages

• Less effort for the


messaging service – only
has to deliver to a few
locations not every
consumer

• Can scale better and


handle more complex
delivery semantics!
Scalability? Many consumers for a topic?
A single counter introduces delays
More counters increases concurrency
Kafka topics have >= 1 Partitions (“counters”)

• Partitions increase
consumer concurrency
• Increase throughput
• Reduce latency
What’s a Kafka message?

A Record – like a letter


Santa
North Pole
Topic is the destination
Santa
Topic North Pole
The “Postmark”
Timestamp, offset, partition

Santa
Topic North Pole
The “Postmark”
Timestamp, offset, partition

Time semantics are flexible


time of event creation, Santa
ingestion, North Pole Topic

or processing.
Key (optional)
Timestamp, offset, partition

Key -> Partition (optional) Santa


Topic North Pole
Key (optional)
Timestamp, offset, partition

Refines the destination


Send to Santa not just any Elf
Key -> Partition (optional) Santa
Topic North Pole
Timestamp, offset, partition

Value is contents (byte array)


Key -> Partition (optional) Santa
Topic North Pole
Timestamp, offset, partition

Kafka Producers and Consumers


need a serializer and de-serializer
to write & read key and value
Santa
Key -> Partition (optional)

North Pole Topic


• Kafka doesn’t look
at the value

• Consumer can
read value

• And try to make


sense of the
message

• What will Santa


be delivering?!
Next…
Delivery Semantics

Do we care
if the message arrives?
Yes! Guaranteed delivery is desirable
But homing pigeons got lost or eaten
Send multiple pigeons
One pigeon may make it
Lost

Eaten
How does Kafka guarantee delivery?
Producer

M1

Broker Broker Broker


1 2 3
Producer

A Message (M1) is written to a broker (2)


M1

Broker Broker Broker


1 2 3
Producer

M1

Broker Broker Broker


1 2 3

The message is always persisted to disk.

M1
This makes it resilient to loss of power.
Producer

M1

Broker Broker Broker


1 2 3

M1
Producer

The message is also replicated on multiple “brokers”

Broker M1 Broker M1 Broker


1 2 3

M1 M1 M1
Producer

Which makes it resilient to loss of most servers

Broker Broker M1 Broker


1 2 3

M1 M1 M1
Producer

Acknowledgement
Producer gets acknowledgement
once the message is persisted and replicated
(configurable)
Broker
1
Broker
2
Broker
3

M1 M1 M1
Consumer

Multiple Brokers and Partitions 


increased read availability and concurrency
Broker Broker Broker
1 2 3

M1 M1 M1
The 2nd aspect of delivery semantics:
Who gets the messages?
How many times are messages delivered?
Consumer

Consumer

Consumer

Producer

Consumer
Delivery Semantics - Kafka is “pub-sub”
- Loosely coupled
- Producers and consumers don’t know about each other
Consumer

Consumer

Consumer

Producer

Consumer
Which consumers get which messages
(filtering), is topic based
Consumer

Consumer

Consumer

Topic “Parties”

Producer
Consumer
Topic “Work”
Consumers Subscribe to topic “Parties”
Consumer

Consumer

Consumer

Topic “Parties” Subscribe

Producer
Consumer
Topic “Work”
Publishers send messages to topics
Consumer

Consumers Consumer
Subscribed to Topic “Parties”
Consumer

Send
Topic “Parties”

Producer
Consumer
Send Topic “Work”
Consumers only receive messages from
subscribed topics
Consumers Poll Consumer
To receive messages
from ”Parties”
Consumers Consumer
Subscribed to Topic “Parties”
Consumer

Topic “Parties”

Producer
Consumer
Topic “Work” Consumers not subscribed to “Work”
Don’t receive any “Work” messages
Partitions and Consumer Groups
Enable sharing of work across
consumers
Duplicate message delivery
Each message is
delivered to each
subscribed
consumer group
Consumers subscribed to topic are allocated partitions
They will only get messages from their allocated partitions.
Consumer Group

Consumer

C1
Topic “Parties” Consumer

Consumer
Partition 1

Producer
Partition 2
Consumer Group

Consumer
Partition n
Consumers in the same group share the work around
Each consumer gets only a subset of messages
Consumer Group

Consumer

C1
Topic “Parties” Consumer

Consumer
Partition 1

Producer
Partition 2

Consumers share work


Partition n within groups
Multiple groups enable message broadcasting
Messages are duplicated across groups, each consumer
Consumer Group
group receives a copy of each message.
Consumer

C1
Topic “Parties” Consumer

Consumer
Partition 1

Producer
Partition 2
Consumer Group

Consumer
Partition n
Messages are duplicated across
Consumer groups
Key Which messages are delivered to
which consumers?

If a message has a key, then Kafka


uses Partition based delivery.

Messages with the same key are


always sent to the same partition
and therefore the same
consumer.

Partition based delivery And the order is guaranteed.


No Key

If the key is null, then


Kafka uses round robin
delivery

Each message is delivered


to the next partition

Round robin delivery


Time for an Example, with 2 consumer groups.
Consumer Group = Nerds
Multiple consumers
Consumer Group = Nerds Consumer Group = Hairy
Multiple consumers Single consumer
Case 1: No Key
Consumers Group “Nerds”
Subscribed to “Parties”

M1 Consumer 1 (Bill)

C1
Topic “Parties” M2 Consumer 2 (Paul)
No Key
Round Robin
M1 Consumer n
M1 Partition 1

Producer
M2 M2
Partition 2 Group “Hairy”
etc M1
M2
Consumer 1 (Chewy)
Partition n

Message (M1, M2, etc) sent to the next partition


All consumers allocated to that partition will receive a message when they poll next.
Here’s what happens (not showing producer or topics, have to imagine them)
1. Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).

Subscribe to “Parties” Subscribe to “Parties”


1. Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
2. Producer sends record “Cool pool party – Invitation”
<key=null, value=“Cool pool party - Invitation”> to “parties” topic (no key)
1. Both Groups subscribe to Topic “parties” (11 partitions, so 1 consumer per partition).
2. Producer sends record “Cool pool party - Invitation”> to “parties” topic
3. Bill and Chewbacca receive a copy of the invitation and plan to attend
4. Producer sends another record “Cool pool party – Cancelled”
<key=null, value=“Cool pool party - Cancelled”> to “parties” topic
4. Producer sends another record <key=null, value=“Cool pool party - Cancelled”> to “parties” topic
5. Paul and Chewbacca receive the cancellation.
Paul gets the message this time as it’s round robin, ignores it as he didn’t get the invitation. Bill wastes his
time trying to go to cancelled party. The rest of the gang aren’t surprised at not receiving any party invites and
stay at home to do some hacking. Chewy is only consumer in his group so gets all messages, plans something
fun instead…
Case 2: If there is a Key
Consumers
Subscribed to “Parties” Group “Nerds”

Consumer 1 (Bill)
M1, M2
C1
Key Topic “Parties” Consumer 2 (Paul)
M3
Hashed to partition
M1, M2 Consumer n
M1, M2 Partition 1

Producer M3
M3 2
Partition Group “Hairy”
etc M1, M2
M3
Consumer 1 (Chewy)
Partition n

A key is hashed to a partition, and a Message with that key is always sent to that partition.
Assume there are 3 messages, and messages 1 and 2 are hashed to same partition.
Here’s what happens with a key: key is “title” of the message (e.g. “Cool pool party”)
Same set up as before:
1. Both Groups subscribe to Topic “parties” (11 partitions).
1. Both Groups subscribe to Topic “parties” (11 partitions).
2. Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
1. Both Groups subscribe to Topic “parties” (11 partitions).
2. Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
3. As before Bill and Chewbacca receive a copy of the invitation and plan to attend
4. Producer sends another record <key=“Cool pool party”, value=“Cancelled”> to “parties” topic
4. Producer sends another record <key=“Cool pool party”, value=“Cancelled”> to “parties” topic
5. Bill and Chewbacca receive the cancellation (same consumers this time, as identical key)
6. Producer sends another record <key=“Horrible Halloween party”, value=“Invitation”> to ”parties” topic
6. Producer sends another record <key=“Horrible Halloween party”, value=“Invitation”> to ”parties” topic
7. Paul and Chewy receive the invitation
Paul receives the Halloween invitation as the key is different and the record is sent to the partition that Paul is
allocated to
Chewy is the only consumer in his group so he gets every record no matter what partition it’s sent to
Example Kafka Use Cases
Real-time data pipeline

Read-time data pipeline features:


• Ingestion of multiple heterogeneous sources
• Sending data to multiple heterogeneous sinks
• Acts as a buffer to smooth out load spikes
• Enables use cases which reprocess data (e.g. disaster recovery)
Anomaly Detection Pipeline

Real-time Event processing pipeline:


• Simple event driven applications (If X then Y…)
• May write and read from other data sources (e.g. Cassandra)
• New Events sent back to Kafka or to other systems
• E.g. Anomaly Detection, check out my current blog series if you are interested in this example.
Kafka Streams Processing (Kongo IoT Blog series)
Streams processing features:
• Complex streams processing (multiple events and streams)
• Time, windows, and transformations
• Uses Kafka Streams API, includes state store
• Visualization of the streams topology
• Continuously computes the loads for trucks and checks if they are overloaded.
Linkedin - Before Kafka (BK)

A real example from Linkedin, who developed Kafka.


Before Kafka they had spaghetti integration of monolithic applications.
To accommodate growing membership and increasing site complexity, they migrated from a monolithic
application infrastructure to one based on microservices, which made the integration even more complex!
After Kafka (AK)

Rather than maintaining and scaling each pipeline individually, they invested in the
development of a single, distributed pub-sub platform - Kafka was born.
The main benefit was better Service decoupling and independent scaling.
The End (of the introduction) -
Find out more

Apache Kafka: https://kafka.apache.org/


Instaclustr blogs
• Mix of Cassandra, Spark, Zeppelin and Kafka
https://www.instaclustr.com/paul-brebner/
• Kafka introduction
https://insidebigdata.com/2018/04/12/developing-deeper-understanding-apache-kafka-architecture/
https://insidebigdata.com/2018/04/19/developing-deeper-understanding-apache-kafka-architecture-part-2-
• Kongo – Kafka IoT logistics application blog series
https://www.instaclustr.com/instaclustr-kongo-iot-logistics-streaming-demo-application/
• Anomaly detection with Kafka and Cassandra (and Kubernetes), current blog series
https://www.instaclustr.com/anomalia-machina-1-massively-scalable-anomaly-detection-with-apache-kafka-

Instaclustr’s Managed Kafka (Free trial)


https://www.instaclustr.com/solutions/managed-apache-kafka/

You might also like