Apache Kafka
Apache Kafka
#apache-
kafka
Table of Contents
About 1
Remarks 2
Examples 2
Installation or Setup 2
Introduction 3
Which means 3
Installation 4
Create a topic 4
Stop kafka 5
Clean-up 7
Parameters 8
Examples 8
Processing guarantees 10
Syntax 13
Parameters 13
Remarks 13
Examples 13
Gson (de)serializer 13
Serializer 13
Code 14
Usage 14
deserializer 14
Code 15
Usage 15
Introduction 17
Examples 17
kafka-topics 17
kafka-console-producer 18
kafka-console-consumer 18
kafka-simple-consumer-shell 18
kafka-consumer-groups 19
Introduction 21
Examples 21
Basic poll 22
The code 23
Basic example 23
Runnable example 23
Sending messages 26
The code 26
Credits 28
About
You can share this PDF with anyone you feel could benefit from it, downloaded the latest version
from: apache-kafka
It is an unofficial and free apache-kafka ebook created for educational purposes. All the content is
extracted from Stack Overflow Documentation, which is written by many hardworking individuals at
Stack Overflow. It is neither affiliated with Stack Overflow nor official apache-kafka.
The content is released under Creative Commons BY-SA, and the list of contributors to each
chapter are provided in the credits section at the end of this book. Images may be copyright of
their respective owners unless otherwise specified. All trademarks and registered trademarks are
the property of their respective company owners.
Use the content presented in this book at your own risk; it is not guaranteed to be correct nor
accurate, please send your feedback and corrections to [email protected]
https://riptutorial.com/ 1
Chapter 1: Getting started with apache-kafka
Remarks
Kafka is a high throughput publish-subscribe messaging system implemented as distributed,
partitioned, replicated commit log service.
Fast
A single Kafka broker can handle hundreds of megabytes of reads and writes per
second from thousands of clients.
Scalable
Kafka is designed to allow a single cluster to serve as the central data backbone for a
large organization. It can be elastically and transparently expanded without downtime.
Data streams are partitioned and spread over a cluster of machines to allow data
streams larger than the capability of any single machine and to allow clusters of co-
ordinated consumers
Durable
Messages are persisted on disk and replicated within the cluster to prevent data loss.
Each broker can handle terabytes of messages without performance impact.
Distributed by Design
Kafka has a modern cluster-centric design that offers strong durability and fault-
tolerance guarantees.
Examples
Installation or Setup
On Linux:
https://riptutorial.com/ 2
On Window: Right click --> Extract here
cd kafka_2.11-0.10.0.0
Linux:
bin/zookeeper-server-start.sh config/zookeeper.properties
Windows:
bin/windows/zookeeper-server-start.bat config/zookeeper.properties
Linux:
bin/kafka-server-start.sh config/server.properties
Windows:
bin/windows/kafka-server-start.bat config/server.properties
Introduction
Which means
1-It lets you publish and subscribe to streams of records. In this respect it is similar to a message
queue or enterprise messaging system.
2-Building real-time streaming applications that transform or react to the streams of data
Kafka console scripts are different for Unix-based and Windows platforms. In the
examples, you might need to add the extension according to your platform. Linux:
scripts located in bin/ with .sh extension. Windows: scripts located in bin\windows\ and
https://riptutorial.com/ 3
with .bat extension.
Installation
Step 1: Download the code and untar it:
Kafka relies heavily on zookeeper, so you need to start it first. If you don't have it installed, you
can use the convenience script packaged with kafka to get a quick-and-dirty single-node
ZooKeeper instance.
zookeeper-server-start config/zookeeper.properties
kafka-server-start config/server.properties
You should now have zookeeper listening to localhost:2181 and a single kafka broker on
localhost:6667.
Create a topic
We only have one broker, so we create a topic with no replication factor and just one partition:
https://riptutorial.com/ 4
Launch a consumer:
On another terminal, launch a producer and send some messages. By default, the tool send each
line as a separate message to the broker, without special encoding. Write some lines and exit with
CTRL+D or CTRL+C:
Stop kafka
kafka-server-stop
Step 1: to avoid collision, we create a server.properties file for each broker and change the id,
port and logfile configuration properties.
Copy:
cp config/server.properties config/server-1.properties
cp config/server.properties config/server-2.properties
vim config/server-1.properties
broker.id=1
listeners=PLAINTEXT://:9093
log.dirs=/usr/local/var/lib/kafka-logs-1
vim config/server-2.properties
broker.id=2
listeners=PLAINTEXT://:9094
log.dirs=/usr/local/var/lib/kafka-logs-2
https://riptutorial.com/ 5
kafka-server-start config/server-1.properties &
kafka-server-start config/server-2.properties &
• "leader" is the node responsible for all reads and writes for the given partition. Each node will
be the leader for a randomly selected portion of the partitions.
• "replicas" is the list of nodes that replicate the log for this partition regardless of whether they
are the leader or even if they are currently alive.
• "isr" is the set of "in-sync" replicas. This is the subset of the replicas list that is currently alive
and caught-up to the leader.
On Windows:
https://riptutorial.com/ 6
The leadership has switched to broker 2 and "1" in not in-sync anymore. But the messages are still
there (use the consumer to check out by yourself).
Clean-up
Delete the two topics using:
https://riptutorial.com/ 7
Chapter 2: Consumer Groups and Offset
Management
Parameters
Parameter Description
Examples
What is a Consumer Group
As of Kafka 0.9, the new high level KafkaConsumer client is availalbe. It exploits a new built-in
Kafka protocol that allows to combine multiple consumers in a so-called Consumer Group. A
Consumer Group can be describes as a single logical consumer that subscribes to a set of topics.
The partions over all topics are assigend to the physical consumers within the group, such that
each patition is assigned to exaclty one consumer (a single consumer can get multiple partitons
assigned). The indiviual consumers belonging to the same group can run on different hosts in a
distributed manner.
Consumer Groups are identified via their group.id. To make a specific client instance member of a
Consumer Group, it is sufficient to assign the groups group.id to this client, via the client's
configuration:
https://riptutorial.com/ 8
Properties props = new Properties();
props.put("group.id", "groupName");
// ...some more properties required
new KafkaConsumer<K, V>(config);
Thus, all consumers that connect to the same Kafka cluster and use the same group.id form a
Consumer Group. Consumers can leave a group at any time and new consumers can join a group
at any time. For both cases, a so-called rebalance is triggered and partitions get reassigned with
the Consumer Group to ensure that each partition is processed by exaclty one consumer within
the group.
Pay attention, that even a single KafkaConsumer forms a Consumer Group with itself as single
member.
KafkaConsumers request messages from a Kafka broker via a call to poll() and their progress is
tracked via offsets. Each message within each partition of each topic, has a so-called offset
assigned—its logical sequence number within the partition. A KafkaConsumer tracks its current offset
for each partition that is assigned to it. Pay attention, that the Kafka brokers are not aware of the
current offsets of the consumers. Thus, on poll() the consumer needs to send its current offsets
to the broker, such that the broker can return the corresponding messages, i.e,. messages with a
larger consecutive offset. For example, let us assume we have a single partition topic and a single
consumer with current offset 5. On poll() the consumer sends if offset to the broker and the
broker return messages for offsets 6,7,8,...
Because consumers track their offsets themselves, this information could get lost if a consumer
fails. Thus, offsets must be stored reliably, such that on restart, a consumer can pick up its old
offset and resumer where it left of. In Kafka, there is built-in support for this via offset commits. The
new KafkaConsumer can commit its current offset to Kafka and Kafka stores those offsets in a
special topic called __consumer_offsets. Storing the offsets within a Kafka topic is not just fault-
tolerant, but allows to reassign partitions to other consumers during a rebalance, too. Because all
consumers of a Consumer Group can access all committed offsets of all partitions, on rebalance,
a consumer that gets a new partition assigned just reads the committed offset of this partition from
the __consumer_offsets topic and resumes where the old consumer left of.
As an alternative to auto commit, offsets can also be managed manually. For this, auto commit
should be disabled (enable.auto.commit = false). For manual committing KafkaConsumers offers two
methods, namely commitSync() and commitAsync(). As the name indicates, commitSync() is a
blocking call, that does return after offsets got committed successfully, while commitAsync() returns
https://riptutorial.com/ 9
immediately. If you want to know if a commit was successful or not, you can provide a call back
handler (OffsetCommitCallback) a method parameter. Pay attention, that in both commit calls, the
consumer commits the offsets of the latest poll() call. For example. let us assume a single
partition topic with a single consumer and the last call to poll() return messages with offsets 4,5,6.
On commit, offset 6 will be committed because this is the latest offset tracked by the consumer
client. At the same time, both commitSync() and commitAsync() allow for more control what offset you
want to commit: if you use the corresponding overloads that allow you to specify a
Map<TopicPartition, OffsetAndMetadata> the consumer will commit only the specified offsets (ie, the
map can contain any subset of assigned partitions, and the specified offset can have any value).
Pay attention, that by design it is also possible to commit a smaller offset than the last committed
offset. This can be done, if messages should be read a second time.
Processing guarantees
Using auto commit provides at-least-once processing semantics. The underlying assumption is,
that poll() is only called after all previously delivered messages got processed successfully. This
ensures, that no message get lost because a commit happens after processing. If a consumer fails
before a commit, all messages after the last commit are received from Kafka and processed again.
However, this retry might result in duplicates, as some message from the last poll() call might
have been processed but the failure happened right before the auto commit call.
If at-most-once processing semantics are required, auto commit must be disabled and a manual
commitSync() directly after poll() should be done. Afterward, messages get processed. This
ensure, that messages are committed before there are processed and thus never read a second
time. Of course, some message might get lost in case of failure.
There are multiple strategies to read a topic from its beginning. To explain those, we first need to
understand what happens at consumer startup. On startup of a consumer, the following happens:
1. join the configured consumer group, which triggers a rebalance and assigns partitions to the
consumer
2. look for committed offsets (for all partitions that got assigned to the consumer)
3. for all partitions with valid offset, resume from this offset
4. for all partitions with not valid offset, set start offset according to auto.offset.reset
configuration parameter
https://riptutorial.com/ 10
Start a new Consumer Group
If you want to process a topic from its beginning, you can simple start a new consumer group (i.e.,
choose an unused group.id) and set auto.offset.reset = earliest. Because there are no
committed offsets for a new group, auto offset reset will trigger and the topic will be consumed
from its beginning. Pay attention, that on consumer restart, if you use the same group.id again, it
will not read the topic from beginning again, but resume where it left of. Thus, for this strategy, you
will need to assign a new group.id each time you want to read a topic from its beginning.
1. it is not fault-tolerant
2. group rebalance does not work as intended
(1) Because offsets are never committed, a failing and a stopped consumer are handled the same
way on restart. For both cases, the topic will be consumed from its beginning. (2) Because offset
are never committed, on rebalance newly assigned partitions will be consumer from the very
beginning.
Therefore, this strategy only works for consumer groups with a single consumer and should only
be used for development purpose.
if (consumer-stop-and-restart-from-beginning) {
consumer.poll(0); // dummy poll() to join consumer group
consumer.seekToBeginning(...);
}
https://riptutorial.com/ 11
// now you can start your poll() loop
while (isRunning) {
for (ConsumerRecord record : consumer.poll(0)) {
// process a record
}
}
https://riptutorial.com/ 12
Chapter 3: Custom Serializer/Deserializer
Introduction
Kafka stores and transports byte arrays in its queue. The (de)serializers are responsible for
translating between the byte array provided by Kafka and POJOs.
Syntax
• public void configure(Map<String, ?> config, boolean isKey);
• public T deserialize(String topic, byte[] bytes);
• public byte[] serialize(String topic, T obj);
Parameters
parameters details
custom (de)serializers can be used for keys and/or values. This parameter tells
isKey
you which of the two this instance will deal with.
the topic of the current message. This lets you define custom logic based on
topic
the source/destination topic.
obj The message to serialize. Its actual class depends on your serializer.
Remarks
Before version 0.9.0.0 the Kafka Java API used Encoders and Decoders. They have been replaced
by Serializer and Deserializer in the new API.
Examples
Gson (de)serializer
This example uses the gson library to map java objects to json strings. The (de)serializers are
generic, but they don't always need to be !
https://riptutorial.com/ 13
Serializer
Code
@Override
public void configure(Map<String, ?> config, boolean isKey) {
// this is called right after construction
// use it for initialisation
}
@Override
public byte[] serialize(String s, T t) {
return gson.toJson(t).getBytes();
}
@Override
public void close() {
// this is called right before destruction
}
}
Usage
Serializers are defined through the required key.serializer and value.serializer producer
properties.
Assume we have a POJO class named SensorValue and that we want to produce messages
without any key (keys set to null):
(key.serializer is a required configuration. Since we don't specify message keys, we keep the
StringSerializer shipped with kafka, which is able to handle null).
deserializer
https://riptutorial.com/ 14
Code
@Override
public void configure(Map<String, ?> config, boolean isKey) {
String configKey = isKey ? CONFIG_KEY_CLASS : CONFIG_VALUE_CLASS;
String clsName = String.valueOf(config.get(configKey));
try {
cls = (Class<T>) Class.forName(clsName);
} catch (ClassNotFoundException e) {
System.err.printf("Failed to configure GsonDeserializer. " +
"Did you forget to specify the '%s' property ?%n",
configKey);
}
}
@Override
public T deserialize(String topic, byte[] bytes) {
return (T) gson.fromJson(new String(bytes), cls);
}
@Override
public void close() {}
}
Usage
Deserializers are defined through the required key.deserializer and value.deserializer consumer
properties.
Assume we have a POJO class named SensorValue and that we want to produce messages
without any key (keys set to null):
https://riptutorial.com/ 15
Here, we add a custom property to the consumer configuration, namely CONFIG_VALUE_CLASS. The
GsonDeserializer will use it in the configure() method to determine what POJO class it should
handle (all properties added to props will be passed to the configure method in the form of a map).
https://riptutorial.com/ 16
Chapter 4: kafka console tools
Introduction
Kafka offers command-line tools to manage topics, consumer groups, to consume and publish
messages and so forth.
Important: Kafka console scripts are different for Unix-based and Windows platforms. In the
examples, you might need to add the extension according to your platform.
Examples
kafka-topics
This tool let you list, create, alter and describe topics.
List topics:
Create a topic:
Describe a topic:
Alter a topic:
# change configuration
kafka-topics --zookeeper localhost:2181 --alter --topic test --config
max.message.bytes=128000
# add a partition
kafka-topics --zookeeper localhost:2181 --alter --topic test --partitions 2
(Beware: Kafka does not support reducing the number of partitions of a topic) (see this list of
configuration properties)
https://riptutorial.com/ 17
kafka-console-producer
kafka-console-consumer
In order to see older messages, you can use the --from-beginning option.
kafka-simple-consumer-shell
This consumer is a low-level tool which allows you to consume messages from specific partitions,
https://riptutorial.com/ 18
offsets and replicas.
Useful parameters:
• parition:
the specific partition to consume from (default to all)
• offset: the beginning offset. Use -2 to consume messages from the beginning, -1 to
consume from the end.
• max-messages: number of messages to print
• replica: the replica, default to the broker-leader (-1)
Exemple:
kafka-simple-consumer-shell \
--broker-list localhost:9092 \
--partition 1 \
--offset 4 \
--max-messages 3 \
--topic test-topic
kafka-consumer-groups
This tool allows you to list, describe, or delete consumer groups. Have a look at this article for
more information about consumer groups.
if you still use the old consumer implementation, replace --bootstrap-server with --
zookeeper.
Describe a consumer-group:
https://riptutorial.com/ 19
Delete a consumer-group:
deletion is only available when the group metadata is stored in zookeeper (old
consumer api). With the new consumer API, the broker handles everything including
metadata deletion: the group is deleted automatically when the last committed offset
for the group expires.
https://riptutorial.com/ 20
Chapter 5: Producer/Consumer in Java
Introduction
This topic shows how to produce and consume records in Java.
Examples
SimpleConsumer (Kafka >= 0.9.0)
The 0.9 release of Kafka introduced a complete redesign of the kafka consumer. If you are
interested in the old SimpleConsumer (0.8.X), have a look at this page. If your Kafka installation is
newer than 0.8.X, the following codes should work out of the box.
First, create a maven project and add the following dependency in your pom:
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.9.0.1</version>
</dependency>
</dependencies>
Note : don't forget to update the version field for the latest releases (now > 0.10).
The consumer is initialised using a Properties object. There are lots of properties allowing you to
fine-tune the consumer behaviour. Below is the minimal configuration needed:
The bootstrap-servers is an initial list of brokers for the consumer to be able discover the rest of
the cluster. This doesn’t need to be all the servers in the cluster: the client will determine the full
set of alive brokers from the brokers in this list.
The deserializer tells the consumer how to interpret/deserialize the message keys and values.
Here, we use the built-in StringDeserializer.
https://riptutorial.com/ 21
Finally, the group.id corresponds to the consumer group of this client. Remember: all consumers
of a consumer group will split messages between them (kafka acting like a message queue), while
consumers from different consumer groups will get the same messages (kafka acting like a
publish-subscribe system).
• session.timeout.ms: a session timeout ensures that the lock will be released if the consumer
crashes or if a network partition isolates the consumer from the coordinator. Indeed:
After you have subscribed, the consumer can coordinate with the rest of the group to get its
partition assignment. This is all handled automatically when you begin consuming data.
Basic poll
https://riptutorial.com/ 22
The consumer needs to be able to fetch data in parallel, potentially from many partitions for many
topics likely spread across many brokers. Fortunately, this is all handled automatically when you
begin consuming data. To do that, all you need to do is call poll in a loop and the consumer
handles the rest.
poll returns a (possibly empty) set of messages from the partitions that were assigned.
while( true ){
ConsumerRecords<String, String> records = consumer.poll( 100 );
if( !records.isEmpty() ){
StreamSupport.stream( records.spliterator(), false ).forEach( System.out::println );
}
}
The code
Basic example
This is the most basic code you can use to fetch messages from a kafka topic.
while( true ){
// poll with a 100 ms timeout
ConsumerRecords<String, String> records = consumer.poll( 100 );
if( records.isEmpty() ) continue;
StreamSupport.stream( records.spliterator(), false ).forEach(
System.out::println );
}
}
}
}
Runnable example
The consumer is designed to be run in its own thread. It is not safe for multithreaded
use without external synchronization and it is probably not a good idea to try.
https://riptutorial.com/ 23
Below is a simple Runnable task which initializes the consumer, subscribes to a list of topics, and
executes the poll loop indefinitely until shutdown externally.
@Override
public void run(){
try{
consumer.subscribe( topics );
while( true ){
ConsumerRecords<String, String> records = consumer.poll( Long.MAX_VALUE );
StreamSupport.stream( records.spliterator(), false ).forEach(
System.out::println );
}
}catch( WakeupException e ){
// ignore for shutdown
}finally{
consumer.close();
}
}
Note that we use a timeout of Long.MAX_VALUE during poll, so it will wait indefinitely for a new
message. To properly close the consumer, it is important to call its shutdown() method before
ending the application.
int numConsumers = 3;
String groupId = "octopus";
List<String> topics = Arrays.asList( "test-topic" );
https://riptutorial.com/ 24
for( int i = 0; i < numConsumers; i++ ){
ConsumerLoop consumer = new ConsumerLoop( i, groupId, topics );
consumers.add( consumer );
executor.submit( consumer );
}
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.9.0.1</version>
</dependency>
</dependencies>
The producer is initialized using a Properties object. There are lots of properties allowing you to
fine-tune the producer behavior. Below is the minimal configuration needed:
The bootstrap-servers is an initial list of one or more brokers for the producer to be able discover
the rest of the cluster. The serializer properties tell Kafka how the message key and value should
be encoded. Here, we will send string messages. Although not required, setting a client.id since
is always recommended: this allows you to easily correlate requests on the broker with the client
instance which made it.
https://riptutorial.com/ 25
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
You can control the durability of messages written to Kafka through the acks setting. The default
value of “1” requires an explicit acknowledgement from the partition leader that the write
succeeded. The strongest guarantee that Kafka provides is with acks=all, which guarantees that
not only did the partition leader accept the write, but it was successfully replicated to all of the in-
sync replicas. You can also use a value of “0” to maximize throughput, but you will have no
guarantee that the message was successfully written to the broker’s log since the broker does not
even send a response in this case.
retries (default to >0) determines if the producer try to resend message after a failure. Note that
with retries > 0, message reordering may occur since the retry may occur after a following write
succeeded.
Kafka producers attempt to collect sent messages into batches to improve throughput. With the
Java client, you can use batch.size to control the maximum size in bytes of each message batch.
To give more time for batches to fill, you can use linger.ms to have the producer delay sending.
Finally, compression can be enabled with the compression.type setting.
Use buffer.memory to limit the total memory that is available to the Java client for collecting unsent
messages. When this limit is hit, the producer will block on additional sends for as long as
max.block.ms before raising an exception. Additionally, to avoid keeping records queued
indefinitely, you can set a timeout using request.timeout.ms.
The complete list of properties is available here. I suggest to read this article from Confluent for
more details.
Sending messages
The send() method is asynchronous. When called it adds the record to a buffer of pending record
sends and immediately returns. This allows the producer to batch together individual records for
efficiency.
The result of send is a RecordMetadata specifying the partition the record was sent to and the offset
it was assigned. Since the send call is asynchronous it returns a Future for the RecordMetadata
that will be assigned to this record. To consult the metadata, you can either call get(), which will
block until the request completes or use a callback.
https://riptutorial.com/ 26
The code
public class SimpleProducer{
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
props.put( "client.id", "octopus" );
https://riptutorial.com/ 27
Credits
S.
Chapters Contributors
No
Getting started with Ali786, Community, Derlin, Laurel, Mandeep Lohan, Matthias J.
1
apache-kafka Sax, Mincong Huang, NangSaigon, Vivek
Consumer Groups
2 and Offset Matthias J. Sax, Sönke Liebau
Management
Custom
3 Derlin, G McNicol
Serializer/Deserializer
Producer/Consumer
5 Derlin, ha9u63ar
in Java
https://riptutorial.com/ 28