Kafka
Kafka
Kafka Architecture: Below we are discussing four core APIs of Apache Kafka.
1) Topics: A stream of messages belonging to a particular category is called a topic. It is a
logical feed name to which records are published. Similar to a table in DB (Records are
considered messages here). Unique identifier of a topic is its name. We can create as many
topics as we want.
Partitions:
Topics are split into partitions. All the messages within a partition are ordered and
immutable. Each message within a partition has a unique id associated knows as Offset.
Replica/Replication:
Replicas are backup of a partition. Replicas are never read or write data. They are used to
prevent data loss (Fault Tolerant).
2) Producers:
Producers are applications which write/publish data to the topics within a cluster
using the Producer APIs.
Producers can write date either on the topic level (All the partition of that topic in a
round robin manner) or specific partitions of that topic.
3) Consumers:
Consumers are applications which read/consume data from the topics within a
cluster using the Consumer APIs.
Consumers can read date either on the topic level (All the partition of that) or
specific partitions of that topic.
Consumers are always associated with exactly one Consumer Group which is a
group of Consumers that performs a task.
4) Brokers:
Brokers are simple software processes who maintain and manage the published
messages also known as Kafka Servers.
Brokers also manage the consumer-offsets and are responsible for the delivery of
messages to the right consumers.
A set of brokers who are communicating with each other to perform the
management and maintenance task are collectively known as Kafka Cluster.
We can add more brokers in a already running Kafka cluster without and downtime
which ensures horizontal scalability.
5) Zookeeper:
Zookeeper is used to monitor Kafka Cluster and co-ordinate with each broker.
Keeps all the metadata information related to Kafka Cluster in the form of Key-Value
pair.
Metadata Includes:
Configuration Information
Health status of each broker.
It is used for the controller election within Kafka Cluster.
A set of Zookeepers nodes working together to manage other distributed systems is
known as Zookeeper Cluster or Zookeeper Ensemble.
Kafka Features:
1) Scalable: Horizontal scaling is done by adding new brokers to the existing clusters.
2) Fault Tolerance: Kafka Clusters can handle failures because of its distributed nature.
3) Performance: Kafka has high throughput for both publishing and subscribing
messages.
4) No Data Loss: It ensures no data loss if we configure it properly.
5) Zero Down Time: It ensures zero downtime when required number of brokers are
present in the cluster.
Zookeeper:
a) Start the Zookeeper:
cd $ZOOKEEPER_HOME
bin/zkServer.sh start
b) Validate your Zookeeper is running or not:
echo stat | nc localhost 2181
c) Stop the Zookeeper:
bin/zkServer.sh stop
Kafka:
a) Start the Kafka:
cd $KAFKA_HOME
bin/kafka-server-start.sh config/server.properties
b) Validate your Kafka is running or not:
echo dump | nc localhost 2181 | grep brokers
c) Stop the Kafka:
bin/kafka-server-stop.sh
1) Create Topic:
bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic myTopic
--partitions 1 --replication-factor 1
2) List Topic:
bin/kafka-topics.sh --bootstrap-server localhost:9092 --list
3) Describe Topic:
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic myTopic
4) To create a Producer:
bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic myTopic
5) To create a Consumer:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic myTopic
--from-beginning
11) Let's verify how many Consumer Groups are now available:
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
Push data from Producer & see message in your User Defined Consumer Group
Before starting Multi-Node Zookeeper & Kafka you need to create tmp directories &
add broker id as mentioned in config.
mkdir /tmp/zookeeper-1
mkdir /tmp/zookeeper-2
mkdir /tmp/zookeeper-3
bin/kafka-server-start.sh config/server.properties
echo dump | nc localhost 2181 | grep brokers
Note: If you stop one broker, then your insync replica will not be exact like before it will
be a subset of replica, also it changes the leader.
bin/kafka-topics.sh --bootstrap-server localhost:9092,localhost:9093,localhost:9094
--describe --topic myMultiTopic
Internals of Producer:
Offsets:
The records in the partitions are assigned a sequential id number called offset that
uniquely identifies each record within the partition.
1) Log-end offset: Offset of the last message written to a log/partition.
2) Current offset: Pointer to the last record that Kafka has already sent to a consumer in
the most recent poll.
3) Committed offset: Marking an offset as consumed is called committing an offset.
Metadata i.e. information about data when messages are sent to Topic.
Note: When Key is null it always send the messages in Round-Robin fashion. If key has
some value then the messages are directly sent to the required partition.
Offsets Cont’d…
2) Current offset: Pointer to the last record that Kafka has already sent to a consumer in
the most recent poll.
3) Committed offset: Marking an offset as consumed is called committing an offset.
Kafka Consumer Group:
Consumer group is a logical entity in Kafka ecosystem which mainly provides parallel
processing/scalable message consumption to consumer clients.
1) Each consumer must be associated with some consumer group.
2) Make sure there is no duplication within consumers who are part of the same
consumer group.
Consumer Group Rebalancing:
The process of re-distributing partitions to the consumers within a consumer group is
known as Consumer Group Rebalancing.
Create Producer:
bin/kafka-console-producer.sh --bootstrap-server
localhost:9092,localhost:9093,localhost:9094 --topic myConsumerGroupRebalancing
Describe:
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group
myConsumerGroupRebalancing
If you see all the 3 partitions are consumed by same consumer group. See their
CONSUMER-ID
Create One more Consumer in the same Group & check Rebalancing:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic
myConsumerGroupRebalancing --group myConsumerGroupRebalancing
Describe & See the partition balancing.
Try: Now send the messages from one producer & observe the distribution in Round
Robin fashion to the consumer in the group.
Try: Add one more consumer in the group & see who sits idle as we have 3 partitions but
4 consumers available in the same consumer group.
Try: Kill one and send the data from producer again the data will start redistributing in
Round Robin fashion.
Delete Topic: