A Visual Introduction To Apache Kafka PDF
A Visual Introduction To Apache Kafka PDF
Paul Brebner
Technical Evangelist February 2019
What is Kafka?
Distributed streams
Processing System
Messages sent by
Distributed Producers
to
Message flow
Distributed Consumers
Consumers
via
Distributed Kafka Cluster
Cluster
KafkaKafka
benefits
Benefits
Fast
Scalable
Reliable
Durable
Open Source
Managed Service
KafkaKafka
benefits
Benefits
Fast – high throughput and low latency
Scalable – horizontally scalable with nodes and partitions
Reliable – distributed and fault tolerant
Durable - zero data loss, messages persisted to disk with immutable log
Open Source – An Apache project
Available as a Managed Service - on multiple cloud platforms
Message flow
Message flow
To send messages from A to B
“A” is the Producer – sends a message, to
“B” the Consumer (recipient) of the message
Due to decline in “snail mail” direct deliveries
Due to decline in “snail mail” direct deliveries
Instead … “Poste Restante”
“Poste Restante”?
• Partitions increase
consumer concurrency
• Increase throughput
• Reduce latency
What’s a Kafka message?
Santa
Topic North Pole
The “Postmark”
Timestamp, offset, partition
or processing.
Key (optional)
Timestamp, offset, partition
• Consumer can
read value
Do we care
if the message arrives?
Yes! Guaranteed delivery is desirable
But homing pigeons got lost or eaten
Send multiple pigeons
One pigeon may make it
Lost
Eaten
How does Kafka guarantee delivery?
Producer
M1
M1
M1
This makes it resilient to loss of power.
Producer
M1
M1
Producer
M1 M1 M1
Producer
M1 M1 M1
Producer
Acknowledgement
Producer gets acknowledgement
once the message is persisted and replicated
(configurable)
Broker
1
Broker
2
Broker
3
M1 M1 M1
Consumer
M1 M1 M1
The 2nd aspect of delivery semantics:
Who gets the messages?
How many times are messages delivered?
Consumer
Consumer
Consumer
Producer
Consumer
Delivery Semantics - Kafka is “pub-sub”
- Loosely coupled
- Producers and consumers don’t know about each other
Consumer
Consumer
Consumer
Producer
Consumer
Which consumers get which messages
(filtering), is topic based
Consumer
Consumer
Consumer
Topic “Parties”
Producer
Consumer
Topic “Work”
Consumers Subscribe to topic “Parties”
Consumer
Consumer
Consumer
Producer
Consumer
Topic “Work”
Publishers send messages to topics
Consumer
Consumers Consumer
Subscribed to Topic “Parties”
Consumer
Send
Topic “Parties”
Producer
Consumer
Send Topic “Work”
Consumers only receive messages from
subscribed topics
Consumers Poll Consumer
To receive messages
from ”Parties”
Consumers Consumer
Subscribed to Topic “Parties”
Consumer
Topic “Parties”
Producer
Consumer
Topic “Work” Consumers not subscribed to “Work”
Don’t receive any “Work” messages
Partitions and Consumer Groups
Enable sharing of work across
consumers
Duplicate message delivery
Each message is
delivered to each
subscribed
consumer group
Consumers subscribed to topic are allocated partitions
They will only get messages from their allocated partitions.
Consumer Group
Consumer
C1
Topic “Parties” Consumer
Consumer
Partition 1
Producer
Partition 2
Consumer Group
Consumer
Partition n
Consumers in the same group share the work around
Each consumer gets only a subset of messages
Consumer Group
Consumer
C1
Topic “Parties” Consumer
Consumer
Partition 1
Producer
Partition 2
C1
Topic “Parties” Consumer
Consumer
Partition 1
Producer
Partition 2
Consumer Group
Consumer
Partition n
Messages are duplicated across
Consumer groups
Key Which messages are delivered to
which consumers?
M1 Consumer 1 (Bill)
C1
Topic “Parties” M2 Consumer 2 (Paul)
No Key
Round Robin
M1 Consumer n
M1 Partition 1
Producer
M2 M2
Partition 2 Group “Hairy”
etc M1
M2
Consumer 1 (Chewy)
Partition n
Consumer 1 (Bill)
M1, M2
C1
Key Topic “Parties” Consumer 2 (Paul)
M3
Hashed to partition
M1, M2 Consumer n
M1, M2 Partition 1
Producer M3
M3 2
Partition Group “Hairy”
etc M1, M2
M3
Consumer 1 (Chewy)
Partition n
A key is hashed to a partition, and a Message with that key is always sent to that partition.
Assume there are 3 messages, and messages 1 and 2 are hashed to same partition.
Here’s what happens with a key: key is “title” of the message (e.g. “Cool pool party”)
Same set up as before:
1. Both Groups subscribe to Topic “parties” (11 partitions).
1. Both Groups subscribe to Topic “parties” (11 partitions).
2. Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
1. Both Groups subscribe to Topic “parties” (11 partitions).
2. Producer sends record <key=“Cool pool party”, value=“Invitation”> to “parties” topic
3. As before Bill and Chewbacca receive a copy of the invitation and plan to attend
4. Producer sends another record <key=“Cool pool party”, value=“Cancelled”> to “parties” topic
4. Producer sends another record <key=“Cool pool party”, value=“Cancelled”> to “parties” topic
5. Bill and Chewbacca receive the cancellation (same consumers this time, as identical key)
6. Producer sends another record <key=“Horrible Halloween party”, value=“Invitation”> to ”parties” topic
6. Producer sends another record <key=“Horrible Halloween party”, value=“Invitation”> to ”parties” topic
7. Paul and Chewy receive the invitation
Paul receives the Halloween invitation as the key is different and the record is sent to the partition that Paul is
allocated to
Chewy is the only consumer in his group so he gets every record no matter what partition it’s sent to
Example Kafka Use Cases
Real-time data pipeline
Rather than maintaining and scaling each pipeline individually, they invested in the
development of a single, distributed pub-sub platform - Kafka was born.
The main benefit was better Service decoupling and independent scaling.
The End (of the introduction) -
Find out more