Kafka Fund
Kafka Fund
Student Handbook
Version 5.5.0-v1.1.0
m
.co
ail
gm
d@
lar
go
ph
Table of Contents
01 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
07 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
m
.co
ail
gm
d@
lar
go
ph
01 Introduction
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
7. Conclusion
◦ Why is it relevant?
m
.co
This module gives motivation for stream processing, and a high level overview of the
following:
◦ Broker
◦ Producer
◦ ZooKeeper
◦ Introduction to development
◦ Partition Strategies
◦ Topic Compaction
◦ Troubleshooting
◦ Security Overview
◦ REST-Proxy
◦ Schema-Registry
◦ Kafka Connect
◦ Kafka Streams
◦ KSQL
6. Confluent Platform
gm
d@
◦ Confluent Platform
◦ Confluent Cloud
◦ Confluent CLI
◦ RBAC
◦ Confluent Operator
m
.co
ail
gm
d@
lar
go
ph
m
.co
Trainer, please demo the access of the CR environment, including the VM and how to best
ail
gm
• Breaks
• Lunch
• Restrooms
• Emergency procedures
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
7. Conclusion
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
Why are we all here? Why do we care about Kafka? What’s in it for us? These are questions
that we hope to address in this module. We will provide some use cases from real customers
of Confluent that have solved mission critical problems by using Kafka to power their real
time event streaming platforms.
Stay tuned!
m
.co
ail
gm
d@
lar
go
ph
This slide summarizes what makes real-time systems for the scenarios shown on the last
.co
ail
slide possible.
gm
d@
lar
go
ph
m
.co
Credit Card
ail
Streaming Provider
gm
Payments
d@
On-demand Digital
lar
Microservices
Content
go
Architecture
ph
Ride Hailing
Connected Cars
Connecting Provider
IoT - Real-time
with Consumer in
Traffic Routing
real-time
m
.co
ail
gm
d@
lar
go
ph
• Global-scale
• Real-time
• Persistent Storage
• Stream Processing
https://kafka.apache.org/powered-by
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
• Detect fraud
• Minimize risk
m
.co
The company needed a streaming platform to move transactional and business-critical data
lar
go
between all of their apps in real-time to not only ensure high-quality customer experience,
ph
but also to minimize risk by detecting fraud and managing approval and rejection status for
card services. These applications also needed to seamlessly interact with the company’s
loyalty platform in real-time, as membership rewards management impacts both card
holders and merchants alike.
CHALLENGE
To increase revenue and support their mission of “Advancement through Technology,” Audi
developed a platform to collect and analyze the vast amounts of data from its fleet. An
autonomous vehicle driven for just a few hours generates over four terabytes of data, which
is useless if that data cannot be organized and used to generate business insights. To make
use of this data and bring their “Swarm Intelligence” vision to life, Audi deployed a
streaming platform to reveal once undiscoverable and actionable business insights. They can
now process data about the car’s sensors, and components and use “Predictive A/I,” which is
the ability for a car to anticipate changes in its state and surrounding.
SOLUTION
Confluent Platform was chosen to be the streaming platform for their Automotive Cloud
Data Collector (ACDC). By using the insights gained in real-time via the platform, Audi will
be able to leverage the platform and use it to avoid accidents, suggest vacant parking spots,
save people a lot of time and maneuver thousands of cars around obstacles in the road.
Kafka is also used to connect the data from the ACDC to other approved systems
requesting the data from the platform to guide business decisions. Audi selected Kafka as
the streaming backbone for a platform that unifies the entire Volkswagen Group of
RESULTS
• Worldwide tracking and data gathering of Audi cars to improve user experience
• Providing sales, engineering and post-sales departments with the data required to provide
additional or better services to customers
m
.co
ail
gm
d@
lar
go
ph
CHALLENGE
m
.co
The Bank sought to launch an eCommerce digital platform, a rewards program that allows
ail
gm
customers to earn points and cash when they use their Bank card to make purchases from
d@
lar
hundreds of bank sponsored merchants, such as Starbucks and Walmart. As their first use
go
case, the platform would allow the Bank to have exponentially more checkpoints with their
ph
customer and improved merchant-analytics, and would require the real-time management
of large amounts of data. Batch processes, using Informatica, were part of their
architecture so data was not updated in real time. Ultimately, the Bank was having difficulty
in delivering meaningful mobile applications to their customers.
SOLUTION
To help manage the vast amounts of data coming from the eCommerce platform program,
Confluent Platform is now the standard event streaming platform for Mobile Commerce
and Technology within the Consumer Bank.
RESULTS
• Ability to onboard new merchants into the eCommerce platform program faster
• Enabled a full 360 view of their customers that was not possible before
• Projected savings of millions of dollars by reducing reliance on IBM MQ, Informatica and
m
.co
ail
gm
d@
lar
go
ph
• Saved costs
m
.co
ail
gm
d@
lar
CHALLENGE
go
ph
With rising expectations from customers to leverage mobile, email and SMS
communications, The Insurance Company was struggling to keep up with the ever changing
ways to interact with their retail insurance customers. They also struggled to achieve a true
360 view of customer interactions and wanted to leverage omnichannel marketing best
practices.
SOLUTION
The Insurance Company implemented Confluent Platform to get a real time view of their
customer interactions, email history and interaction, as well as advertisement impressions
and success rates. Confluent enabled The Insurance Company to gain a complete view of
what products a customer had purchased, which enabled the identification of up-sell and
cross-sell opportunities.
RESULTS
m
.co
ail
gm
d@
lar
go
ph
• Improved scalability
m
.co
ail
CHALLENGE
gm
d@
lar
The Bank built a new core banking platform to support their 150+ million customers. The
go
ph
bank replies on Apache Kafka to process their 200,000 transactions per day, and were in
dire need of enterprise-wide support for the January 2018 go-live. They sought to mitigate
risk of the platform going down and wanted to improve the time to market, efficiency,
analytics, personalization, and the speed of the new core banking platform.
SOLUTION
Apache Kafka with Confluent support enables The Bank to use Kafka to build a new Core
Banking Platform that run all their existing and new business models on. These range from
standard banking applications, such a fraud mitigation, to online payments, cashless smart
cities, and A/I for chat bots.
RESULTS
• Improved scalability, as the bank anticipates that transactions will triple or quadruple in
the in coming years
• Internet of Things
CHALLENGE
.co
ail
gm
The Health Care Company needed a streaming data platform to support their “Clinics
d@
lar
Without Walls” initiative. This initiative aimed to enstate a dedicated server in each of their
go
ph
clinics to capture IoT sensor data at frequent intervals for analysis. They also wanted to
improve their steaming analytics capabilities to connect their 3,000 outpatient dialysis
centers globally.
SOLUTION
The Health Care Company first launched Apache Kafka for their event sourcing pub/sub
scenarios. They then adopted Confluent Platform to integrate with the in-clinic hardware to
stream patient data every 22 seconds to their central databases.
RESULTS
• Improved ability to more accurately customize medications and dosages for each unique
patient
• Increased reliability
CHALLENGE
The Gaming Company needs to deliver smooth game launches and understand when usage
m
was going to spike. Prior to deploying Kafka, The Gaming Company was using a database as
.co
ail
a data pipeline, but as the number of players increased from year to year they began
gm
running into high IO problems; different applications were all reading and writing to the
d@
lar
same few tables and the databases could not keep up with the demand as the data volumes
go
ph
grew. This solution was breaking at the seams. The Gaming Company started looking for a
solution that could handle a high volume of data at scale, and could provide a fundamental
platform for real-time data access.
SOLUTION
The Gaming Company deployed Confluent Platform and Apache Kafka to scale their data
pipeline infrastructure depending on usage as they capture and process the various types of
data coming in every second from user activity. The data is then sent to teams within The
Gaming Company who use it to power the various services including diagnostics and
optimizing and improving player experiences, which enables designers to adjust and make
improvements in the games.
RESULTS
• Increased reliability
• Increased efficiency
CHALLENGE
ail
gm
d@
NAV has established a vision that “Life is a Stream of Events,” and wanted to leverage
lar
streaming data technology to associate every major life event of its citizen with an event in
go
ph
a streaming platform. NAV had legacy systems that operated on their mainframe, which led
to slow development speeds, technological and organizational dependencies. This left them
with no single holistic perspective on individual citizens, which led to a lack of verifiable
process outcomes, and inability to track changes to data throughout history.
SOLUTION
NAV has established a vision that each life event triggers an event in Kafka. By doing that,
instead of a citizen having to apply for welfare or apply for pensions, the event is
automatically triggered and NAV is able to proactively send the benefit to the taxpayer.
RESULTS
• Client-oriented tasks and processes supported by near real-time events and better data
quality
m
.co
ail
gm
d@
lar
go
ph
Payments engine
CHALLENGE
d@
lar
go
Faced with regulatory uncertainty, changing customer behavior and an evolving global
ph
SOLUTION
The company chose Confluent Platform and Apache Kafka to re-architect their technology
stack. The Company leveraged Confluent Platform to improve fraud detection company
wide, to develop of standard of on-time customer communication where customers are only
communicated when and via the methods they prefer, and to power their future payments
engine.
RESULTS
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
7. Conclusion
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
The world produces data, loads of it. And we want to collect and store all that data in
lar
Kafka. We have
go
ph
The applications that forward or write all this data to Kafka are called producers.
m
.co
ail
• If a NACK (not acknowledged) is received then the producer knows that Kafka was not
able to accept the data for whatever reason. In this case the producer automatically
retries to send the data.
m
.co
ail
• Kafka consists of a bunch of what we call brokers. More formally Kafka is a cluster of
gm
brokers
d@
lar
go
• Brokers receive the data from producers and store it temporarily in the page cache, or
ph
• How long the data is kept around is determined by the so called retention time (1 week by
default)
m
.co
ail
gm
Just storing the data in Kafka in most cases makes no sense. The real value comes into play
d@
lar
if we have some down stream consumers that use this data and create business value out of
go
ph
it.
• Each consumer periodically asks brokers of the Kafka cluster: "Do you have more data for
me?". Normally the consumer does this in an endless loop.
• Many consumers can poll data from Kafka at the same time
• Many different consumers can poll the same data, each at their own pace
• To allow for parallelism, consumers are organized in consumer groups that split up the
work
m
.co
ail
Up to this point, we have learned about the producer, broker, and consumer separately.
gm
Now, let’s see how these all fit together in an architecture diagram.
d@
lar
go
• In the middle you have the Kafka cluster consisting of many brokers that receive and store
the data
• On the right you see consumers that poll or read data from Kafka for downstream
processing
• On top there is a cluster of ZooKeeper instances that form a so called ensemble. We will
talk more about the latter on the subsequent slides
• a key feature of Kafka is that Producers and Consumers are decoupled, that is,
◦ the internal logic of a producer does never depend on any of the downstream
consumers
m
.co
ail
• Producers and Consumers simply need to agree on the data format of the records
lar
go
https://techvidvan.com/tutorials/apache-zookeeper-tutorial/
m
.co
ail
• Kafka Brokers use ZooKeeper for a number of important internal features such as
gm
d@
◦ Cluster management
lar
go
ph
◦ To store Access Control Lists (ACLs) used for authorization in the Kafka cluster
exqct forum
.co
defin,tioon
ail
• The number needs to be odd due to the ZAB consensus algorithm used to achieve a
gm
d@
quorum
lar
go
reasons (ZK instances need to communicate synchronously with each other to achieve a
quorum)
Quorum defines the rule to form a healthy Ensemble Naming service − This service is for identifying the
which is defined using a formula Q = 2N+1 where Q nodes in the cluster by the name. This service is similar
defines number of nodes required to form a healthy to DNS, but for nodes.
Ensemble which can allow N failure nodes. Configuration management − This service provides the latest
Example with 2 nodes cluster : There is problem to form and up-to-date configuration information of a system for the
a healthy Ensemble, because if the connection between joining node.
these 2 nodes is lost, then both nodes will think the other
node is down, so both of them try to act as Leader, which Cluster management − This service keeps the status of the
leads to inconsistency as they can’t communicate with Joining or leaving of a node in the cluster and the node
each other status in real-time.
Inconsistencies will take place in the situation when the Leader election − This service elects a node as
changes are not propagated to all the nodes/machines in a leader for the coordination purpose.
the distributed system.
For ex :
Setup of 4 Kafka nodes in a kafka cluster, would atleast need 3 zk nodes to form a quorum.,
Topics
• Topics: Streams of "related" Messages in
Kafka
◦ Is a Logical Representation
The Topic is a logical representation that spans across Producers, Brokers, and Consumers
m
.co
ail
gm
d@
lar
go
ph
Let’s now transfer what we have learned about the log to Kafka. In Kafka we have the three
terms Topic, Partition and Segment. Let’s review each of them:
• Topic: A topic comprises all messages of a given category. We could e.g. have the topic
"temperature_readings" which would contain all messages that contain a temperature
reading from one of the many measurement stations the company has around the globe.
• Partition: To parallelize work and thus increase the throughput Kafka can split a single
topic into many partitions. The messages of the topic will then be split between the
partitions. The default algorithm used to decide to which topic a message goes uses the
hash code of the message key. A partition is handled in its entirety by a single Kafka
broker. A partition can be viewed as a "log".
On the slide we annotate the segment and say that it "corresponds to a log".
The Log will be introduced on a subsequent slide…
m
.co
ail
gm
d@
lar
go
ph
m
.co
Normally one uses several Kafka brokers that form a cluster to scale out. If we have now
ail
gm
three topics (that is 3 categories of messages) then they may be organized physically as
d@
• Each partition on a given broker results in one to many physical files called segments
• Segments are treated in a rolling file strategy. Kafka opens/allocates a file on disk and
then fills that file sequentially and in append only mode until it is full, where "full" depends
on the defined max size of the file, (or the defined max time per segment is expired).
Subsequently a new segment is created.
log == partition
m
basics. One central element that enables Kafka and stream processing is the so called
gm
d@
"log".
lar
go
• A log is a data structure that is like a queue of elements. New elements are always
ph
appended at the end of the log and once written they are never changed. In this regard
one talks of an append only, write once data structure.
• Elements that are added to the log are strictly ordered in time. The first element added to
the log is older than the second one which in turn is older than the third one.
• In the image the time axis reflects that fact. The offset of the elements in that sense can
be viewed as a time scale.
m
.co
ail
gm
d@
lar
go
• The log is produced by some data source. In the image the data source has already
ph
produced elements with offset 0 to 10 and the next element the source produces will be
written at offset 11.
• The data source can write at its own speed since it is totally decoupled from any of the
destination systems. In fact, the source system does not know anything about the
consuming applications.
• Multiple destination systems can independently consume from the log, each at its own
speed. In the sample we have two consumers that read from different positions at the
same time. This is totally fine and an expected behavior.
• Each destination system consumes the elements from the log in temporal order, that is
from left to right in our image.
m
.co
In this course we’re often going to hear about streaming data. Let’s thus give a very simple
ail
gm
As show in the image, a stream is a sequence of events. The sequence has a beginning
go
ph
somewhere in the past. The first event has offset 0. "Past" in this context can mean
anything from seconds to hours to weeks or even longer periods.
In the image we have denoted the event with offset n to be "now". Everything that comes
after event n is considered the "future".
A stream is open ended. There is not necessarily an end-time when streaming of events
stops (although there can be one of course).
Given the fact that a stream is open ended we cannot know how much data to expect in the
future and we can certainly not wait until all data has arrived before we start doing
something with the stream.
Also very important to note is that a stream is "immutable". In stream processing one never
modifies an existing stream but always generates a new output stream.
A data element in a log (or topic; to be introduced later) is called a record in the Kafka
world. Often we also use equivalent words for a record. The most common ones are
m
• The metadata contains offset, compression, magic byte, timestamp and an optional
ph
• The key by default is used to decide into which partition a record is written to. As a
consequence all records with identical an key go into the same partition. This is important
in the downstream processing, since ordering is (only) guaranteed on a partition and not
on a topic level!
There is also a timestamp in the message but we will talk later about the concept of time…
From the perspective of Kafka (Brokers) it doesn’t really matter what is in the key and
value. Any data type is possible that can be serialized by the producers. The broker only sees
arrays of bytes.
• Each Partition is stored on the Broker’s disk as one or more log files
m
• Each message in the log is identified by its offset which is a monotonically increasing value
.co
ail
gm
• Kafka provides a configurable retention policy for messages to manage log file growth
d@
lar
Brokers share metadata with Producers and Consumer, e.g. mapping of partitions to
go
ph
A typical Kafka cluster has many brokers for high availability and scalability reasons. Each
broker handles many partitions from either the same topic (if the number of partitions is
bigger than the number of brokers) or from different topics.
m
Each broker can handle hundreds of thousands, or millions, of messages per second
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
Kafka can replicate partitions across a configurable number of Kafka servers which is used
d@
for fault tolerance. Each partition has a leader server and zero or more follower servers.
lar
go
In this image we have four brokers and three replicated partitions. The replication factor is
also 3. For each partition a different broker is the leader, for optimal resource usage.
◦ Native Java, C/C++, Python, Go, .NET, JMS clients (for legacy or enterprise Java apps)
are supported by Confluent
◦ Clients for many other languages exist that are supported by the community
m
.co
ail
◦ Confluent develops and supports a REST server which can be used by clients written in
gm
• A command-line Producer tool exists to send messages to the cluster which is useful for
ph
• Two Purposes:
◦ Load Balancing
◦ Semantic Partitioning
◦ No Key → Round-Robin
▪ hash(key) % number_of_partitions
• Load balancing: for example, round robin just to do random load balancing
• Semantic partitioning: for example, user-specified key is the user id, allows Consumers to
make locality assumptions about their consumption. This style of partitioning is explicitly
designed to allow locality-sensitive processing in Consumers.
• Consumer offset
• A Consumers pulls messages from one or more Topics in the cluster. As messages are
written to a Topic, the Consumer will automatically retrieve them
• The Consumer Offset keeps track of the latest message read and it is stored in a special
Kafka Topic. If necessary, the Consumer Offset can be changed, for example, to reread
messages
m
• A command-line Consumer tool exists to read messages from the Kafka cluster, which
.co
ail
m
.co
ail
gm
d@
lar
go
ph
Kafka stores messages in topics for a pre-defined amount of time (by default 1 week), such
as that many consumers (or more precisely consumer groups) can access the data. Most
consumers will process the incoming data in near-real time, but others may consume the
data at a later time.
On the slide we see 3 consumers that all poll data from the same partition of a given topic.
m
.co
ail
gm
d@
lar
go
ph
I mentioned consumer group as a concept in the previous slide already. To allow to increase
the throughput in downstream consumption of data flowing into a topic, Kafka introduced
Consumer Groups.
• All consumer instances in a consumer group are identical clones of each other
• A consumer group can scale out and thus parallelize work until the number of consumer
instances is equal to the number of topic partitions
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
• Conceptually the Kafka Cluster can grow infinitely. The only limit is the failover time after
????????????????
a catastrophic failure. This limits the reasonable max size of a Kafka cluster to
approximately 50 brokers with up to 4000 partitions each.
• Downstream work can also be parallelized infinitely by creating topics with many
equal nb of consumer instances as topics partitioons within a cons group
partitions and by running an equal number of consumers as topics in a consumer group.
Here the limits of parallelism lies in the number of distinct keys (if using the default
partitioner on the producers). It doesn’t make sense to have more partitions in a topic
than distinct keys in the messages that are sent to this topic.
m
.co
ail
gm
d@
lar
go
ph
Answers:
????????????
• ZooKeeper forms an ensemble which requires an odd number of nodes to be able to form
a quorum
ZooKeeper
ail
gm
d@
• The minimum number of brokers for limited high availability is 2. Each partition can then
lar
be replicated once.
go
ph
m
.co
ail
gm
d@
lar
go
ph
These are are few links to material diving into the topics of this module in more depth.
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
7. Conclusion
m
.co
ail
gm
d@
lar
go
ph
• Illustrate on a high level, how you can secure your Kafka cluster
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
This slide is not about the code details, but just wants to give a high level
lar
• Configuration: this is the part where we define all non-default properties of our producer
• Shutdown: In this section we define how the application shall behave in case it receives a
SIG_TERM or SIG_KILL signal
• Sending: In this part we define the logic that actually sends messages to the respective
topic in the Kafka cluster
m
.co
ail
gm
This slide is not about the code details, but just wants to give a high level
d@
When writing a basic consumer (here in C#, .NET) we can distinguish the following parts:
• Message Callback: In this callback, which is triggered for each message we define what
shall happen with the particular message.
• Polling: Here we define how the consumer shall be polling. In this sample in an endless loop
with a 100 ms wait time between polls.
m
.co
ail
gm
d@
lar
go
ph
• Kafka does its best to equally distribute the work load among available brokers, thus in
this case each broker is leader for one partition
• Followers poll the leader periodically for new data and then write it to their own local
commit log
• The controller of the Kafka cluster reassigns the partition leadership for topic1/partition4
ail
gm
• If the number of replicas is lower than the minimal requested number of ISRs (in-sync
replicas), producers will not be able to write to this topic anymore
The above happens totally transparently to the users. Note that if broker 4 is
brought up again and added to the cluster, the leadership for partition 4 might
go back to it eventually (after a few minutes), since Kafka tries to balance the
leadership across all available brokers
• Business decision
• Cost factor
• Contrary to a normal enterprise service bus such as IBM MQ Series or Rabbit MQ, Kafka
lar
• The time how long messages are to be stored in their respective topics is called retention
time
• By default the retention time is set to 1 week, but it can be changed from 0 to infinite
• How long you want to store your data is a business decision and an important cost factor.
It is also a compliance question, e.g. GDPR determines how long (sensitive) customer data
can be maximally stored
The data is always purged per segment. When all the messages in a segment
are older than the retention time then the segment is deleted.
The exception here are compacted topics. Log/topic compaction means that
Kafka will keep the latest version of a record (with a given key) and delete the
older versions during a log compaction.
On this slide we dive a bit more into the high level internals of a producer. No worries, we will
.co
ail
• The area with the grey underlay represents the code of the Kafka Client library - to be
lar
go
more precise, of it’s producer part. Thus this is not the code that you write, but that is
ph
• On the left upper side you can see a single record. It contains elements such as the record
key and value as well as other meta data such as topic name and partition number
• First your record is serialized into an array of bytes, using the pre-configured serializer
• Then the record is passing through a partitioner. By default the partitioner looks at the
(serialized) key of the record and decides to which partition of the topic this message will
be sent
• No often messages are not sent directly to the respective broker but first batched. This
depends on some settings the developer can define in their code
• Once a batch is "full" the producer library code flushes the batch to the respective Kafka
broker
• The broker tries to store the batch in its local commit log
• If the broker was successful it will answer with an ACK and some additional meta data. All
is good in this case!
• If the sending is not successful and the number of retries have been exhausted, then the
producer triggers an exception which needs to be handled by your code
m
.co
ail
gm
d@
lar
go
ph
When using a producer, you can configure its acks (Acknowledgments) which default to 1.
.co
ail
The acks config setting is the write-acknowledgment received count required from partition
gm
d@
leader before the producer write request is deemed complete. This setting controls the
lar
producer’s durability which can be very strong (all) or none. Durability is a tradeoff
go
ph
between throughput and consistency. The acks setting is set to “all” (-1), “none” (0), or
“leader” (1).
• Acks 0 (NONE): The acks=0 is none meaning the Producer does not wait for any ack from
Kafka broker at all. The records added to the socket buffer are considered sent. There are
no guarantees of durability. The record offset returned from the send method is set to -1
(unknown). There could be record loss if the leader is down. There could be use cases that
need to maximize throughput over durability, for example, log aggregation.
• Acks 1 (LEADER): The acks=1 is leader acknowledgment. The means that the Kafka
broker acknowledges that the partition leader wrote the record to its local log but
responds without the partition followers confirming the write. If leader fails right after
sending ack, the record could be lost as the followers might not have replicated the record
yet. Record loss is rare but possible, and you might only see this used if a rarely missed
record is not statistically significant, log aggregation, a collection of data for machine
learning or dashboards, etc.
• Acks -1 (ALL): The acks=all or acks=-1 is all acknowledgment which means the leader
gets write confirmation from the full set of ISRs before sending an ack back to the
producer. This guarantees that a record is not lost as long as one ISR remains alive. This
ack=all setting is the strongest available guarantee that Kafka provides for durability.
m
.co
ail
gm
d@
lar
go
ph
• At most once: From all records written to Kafka it is guaranteed that there will never be a
duplicate. Under certain bad circumstances some record may be lost
m
• At least once: From all records written to Kafka none is ever lost. Under certain bad
.co
ail
• Exactly once: Every single record written to Kafka will be found in the Kafka logs exactly
lar
go
once. There is no situation where either a record is lost or where a record is duplicated
ph
m
.co
ail
To achieve exactly once semantics (EOS) the first step is to establish idempotent producers.
gm
d@
lar
On the slide I have shown what happens if a producer is not idempotent. Upon failure one
go
ph
But, an idempotent producer guarantees, in collaboration with the respective broker, that:
• all messages written to a specific partition are maintaining their relative order in the
commit log of the broker
• each message is only written once. No duplicates ever are to be found in the commit log
• together with acks=all we also make sure that no message is ever lost
For the longest time it seemed an impossibility to have transactional guarantees in a highly
distributed and asynchronous system such as Kafka.
Exactly Once Semantics (EOS) bring strong transactional guarantees to Kafka, preventing
m
.co
duplicate messages from being processed by client applications, even in the event of client
ail
gm
Use cases:
ph
• tracking ad views,
• stream processing with e.g. aggregates only really makes sense with EOS
Process:
• The producer then writes several records to multiple partitions on different brokers
• If the TX succeeds, then the producer has the guarantee, that all records have been
written exactly once and maintaining the local ordering to the Kafka brokers
• If the TX fails, then the producer knows that none of the record written will be showing up
in the downstream consumers. That is, the aborted TX will leave no unwanted side effects
• NOTE: To have this downstream guarantee the consumers need to set their reading
behavior to read committed
partition 0
partition 1
partition 2
Producer partitioner
If you have enough load that you need more than a single instance of your application, you
.co
ail
need to partition your data. The producer clients decide which topic partition data ends up
gm
d@
in, but it’s what the consumer applications will do with that data that drives the decision
lar
• The default partitioner of the Kafka client library is hashing the key of the message and
taking the hash code (modulo number of partitions) as the partition number (hash(key)
% number_of_partitions).
• The partitioner is pluggable and thus we are free to implement a custom partitioner that
uses any scenario specific algorithm to partition the data, e.g. any field of the value
object.
The guarantee of order from key-based allocation only applies if all messages
with the same key are sent by the same producer.
m
.co
ail
gm
d@
lar
Kafka Topics allow the same message to be consumed multiple times by different Consumer
go
ph
Groups.
Within a Consumer Group, a Consumer can consume from multiple Partitions, but a
Partition is only assigned to one Consumer to prevent repeated data.
Consumer Groups have built-in logic which triggers partition assignment on certain events,
e.g., Consumers joining or leaving the group.
m
.co
consumer is added to the group or a consumer is removed from a group, e.g. to scale
d@
lar
• The partitions are automatically assigned (by the consumer group protocol) to the
consumers using the specified strategy.
• Range (default): In Range, the Partition assignment assigns matching partitions to the
same Consumer. The Range strategy is useful for "co-partitioning", which is particularly
useful for Topics with keyed messages. Imagine that these two Topics are using the same
key - for example, a userid.
• Range (continued): Topic A is tracking search results for specific user IDs; Topic B is
tracking search clicks for the same set of user IDs. By using the same user IDs for the key
in both Topics, messages with the same key would land in the same numbered Partition in
both Topics (assuming both topics had the same number of Partitions) and so will land in
the same Consumer.
• Sticky: An important note about Range and RoundRobin: Neither strategy guarantees
that Consumers will retain the same Partitions after a reassignment. In other words, if a
Consumer 1 is assigned to Partition A-0 right now, Partition A-0 may be assigned to
another Consumer if a reassignment were to happen. Most Consumer applications are
not locality-dependent enough to require that Consumer-Partition assignments be static.
m
.co
ail
gm
d@
lar
go
ph
The log on top contains all events/records that are produced by the data source (each event
.co
ail
in this slide has an offset [or a time], a key and a value). Events with the same key are color
gm
d@
If we create a new stream that only ever contains the last event per key of the full log then
we call this "log compaction". The lower part of the slide shows the result of such a
compaction.
Compacted logs are useful for restoring state after a crash or system failure. Kafka log
compaction also allows downstream consumers to restore their state from a log compacted
topic.
They are useful for in-memory services, persistent data stores, reloading a cache, etc. An
important use case of data streams is to log changes to keyed, mutable data changes to a
database table or changes to object in in-memory microservice.
Log compaction is a granular retention mechanism that retains the last update for each
key. A log compacted topic log contains a full snapshot of final record values for every
record key not just the recently changed keys.
• Log Files
◦ SSL logging
◦ Authorizer debugging
• Log Files should be centrally aggregated e.g. in DataDog to avoid SSH access to brokers
go
ph
(security risk!)
◦ SSL logging: Enable SSL debug logging at the JVM level by starting the Kafka broker
and/or clients with the javax.net.debug system property.
◦ Authorizer Debugging: It’s possible to run with authorizer logs in DEBUG mode by
making some changes to the log4j.properties file.
• Kafka supports cluster wide encryption in transit via TLS and authentication (SASL or
TLS) as well as authorization via ACLs
• Kafka does not support encryption at rest out of the box but can be combined with other
data encryption methods such as volume encryption or client based encryption.
• Kafka supports a mix of authenticated and unauthenticated, and encrypted and non-
m
.co
encrypted clients.
ail
gm
d@
lar
• Client Authentication
• Client Authorization
• Encrypt data-in-transit between Kafka client applications and Kafka brokers: You can
ail
gm
enable the encryption of the client-server communication between your applications and
d@
the Kafka brokers. For example, you can configure your applications to always use
lar
go
encryption when reading and writing data to and from Kafka. This is critical when reading
ph
and writing data across security domains such as internal network, public internet, and
partner networks.
• Client authentication: You can enable client authentication for connections from your
application to Kafka brokers. For example, you can define that only specific applications
are allowed to connect to your Kafka cluster.
• Client authorization: You can enable client authorization of read and write operations by
your applications. For example, you can define that only specific applications are allowed
to read from a Kafka topic. You can also restrict write access to Kafka topics to prevent
data pollution or fraudulent activities.
The idea is that the learner understand that security is of utmost importance when running
a Kafka cluster in production or in a production like environment. Various means can be used
to achieve this goal:
• run the cluster on a secured private network with access only through secured gateways
• use TLS to encrypt data in transit between the brokers themselves and the clients and
brokers
• use either MTLS or SASL to authenticate access to the brokers from the clients
m
.co
ail
• use ACLs to lock down access to resources (use the least privileges principle)
gm
d@
• run components of the cluster (brokers, clients, etc.) in containers in e.g. Kubernetes to
lar
go
m
.co
ail
gm
d@
lar
go
ph
◦ https://www.confluent.io/blog/optimizing-apache-kafka-deployment/
◦ https://www.confluent.io/white-paper/optimizing-your-apache-kafka-deployment/
These are are few links to material diving into the topics of this module in more depth.
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
7. Conclusion
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
m
.co
ail
gm
d@
lar
go
ph
Manage hundreds of
data sources and sinks
Integrated within
Confluent Control Center
Kafka Connect, a part of the Apache Kafka project, is a standardized framework for
.co
ail
handling import and export of data from Kafka. This framework can address a variety of
gm
d@
use cases, makes adopting Kafka much simpler for users with existing data pipelines;
lar
encourages an ecosystem of tools for integration of other systems with Kafka using a
go
ph
unified interface; and provides a better user experience, guarantees, and scalability than
other frameworks that are not Kafka-specific.
Kafka Connect is a pluggable framework for inbound and outbound connectors. It’s fully
distributed and integrates Kafka with a number of sources and sinks. This approach is
operationally simpler, particularly if you have a diverse set of other systems you need to
copy data to/from as it can be holistically and centrally managed. It also pushes a lot of the
hard work of scalability into the framework, so if you want to work in the Kafka Connect
framework you would only have to worry about getting your data from a source into the
Kafka Connect format – scalability, fault tolerance, etc would be handled for intrinsically.
Most of the data systems illustrated here are supported with both Sources and Sinks
(including JDBC, MySQL, Mongo, and Cassandra).
• Open Source
• Kafka Connect is a framework for streaming data between Apache Kafka and other data
ail
gm
systems
d@
lar
• Kafka Connect is open source, and is part of the Apache Kafka distribution
go
ph
Kafka Connect is not an API like the Client API (which implements Producers and Consumers
within the applications you write) or Kafka Streams. It is a reusable framework that uses
plugins called Connectors to customize its behavior for the endpoints you choose to work
with.
◦ …
m
.co
ail
• Grey Logo denotes commercially licensed Connectors in preview state. Confluent support
will be available soon
We will talk in more detail about connector and Confluent Hub in the module
about Confluent Platform
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 100
Apache Kafka Connect
m
.co
• Splitting the workload into smaller pieces provides the parallelism and scalability
lar
go
• Connector jobs are broken down into Tasks that do the actual copying of the data
ph
• Workers are processes running one or more tasks, each in a different thread
A Connector is a Connector class and a configuration. Each connector defines and updates
a set of tasks that actually copy the data. Connect distributes these tasks across workers.
In the case of Connect, the term partition can mean any subset of data in the source or the
sink. How a partition is represented depends on the type of Connector (e.g., tables are
partitions for the JDBC connector, files are the partitions for the FileStream connector).
Worker processes are not managed by Kafka. Other tools can do this: YARN, Kubernetes,
Chef, Puppet, custom scripts, etc.
As with Kafka topics, the position within the partitions used by Connect must be tracked as
well to prevent the unexpected replay of data. As with the partitions, the object used as the
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 101
offset varies from connector to connector, as does the method of tracking the offset. The
most common way to track the offset is through the use of a Kafka topic, though the
Connector developer can use whatever they want.
The variation is primarily with source connectors. With sink connectors, the offsets are
Kafka topic partition offsets, and they’re usually stored in the external system where the
data is being written.
$ connect-distributed connect-distributed.properties
• Group coordination
In distributed mode, the connector configurations cannot be kept on the Worker systems; a
m
.co
failure of a worker should not make the configurations unavailable. Instead, distributed
ail
gm
workers keep their connector configurations in a special Kafka topic which is specified in the
d@
Workers coordinate their tasks to distribute the workload using the same mechanisms as
Consumer Groups.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 102
Confluent REST Proxy
Talk to non-native Kafka Apps and outside the Firewall
Simplifies administrative
actions
m
.co
ail
gm
d@
1. For remote clients (such as outside the datacenter including over the public internet)
2. For internal client apps that are written in a language not supported by native Kafka
client libraries (i.e. other than Java, C/C++, Python, and Go)
3. For developers or existing apps for which REST is more productive and familiar than
Kafka Producer/Consumer API.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 103
Confluent REST Proxy
m
.co
ail
gm
d@
• The REST Proxy allows an administrator to use HTTP to perform actions on the Kafka
lar
go
cluster
ph
Although there are native clients for the most important languages next to
Java such as .NET/C#, Python, Go, C/C++, etc., there is still a need for REST
Proxy even when using one of those languages to build Kafka clients. Imagine a
situation where the Kafka Cluster sits behind a firewall which only opens port
80 and 443. In this case Kafka clients sitting outside this firewall can still access
Kafka (via the REST Proxy).
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 104
Data Compatibility
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 105
The Challenge of Data Compatibility at Scale
Many sources without a policy
causes mayhem in a centralized
data pipeline
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 106
Confluent Schema Registry
Make Data Backwards Compatible and Future-Proof
• Define the expected fields for each Kafka • Prevent backwards incompatible changes
topic
• Support multi-data center environments
• Automatically handle schema changes
(e.g. new fields)
m
.co
ail
gm
d@
This diagram shows what schema registry is for. From left to right:
lar
go
ph
• Schema registry is specific to a Kafka cluster, so you can have one schema registry per
cluster.
• Applications can be third party apps like Twitter or SFDC or a custom application.
Schema registry is relevant to every producer who can feed messages to your cluster.
• Within an application, there is a function called serializer that serializes messages for
delivery to the kafka data pipeline. Confluent’s schema registry is integrated into the
serializer. Some other SaaS provider companies have created their own schema registries
outside the serializer; argument is that’s more complexity for the same functionality.
• The serializer places a call to the schema registry, to see if it has a format for the data the
application wants to publish.
• If it does, then schema registry passes that format to the application’s serializer, which
uses it to filter out incorrectly formatted messages. This keeps your data pipeline clean.
The formats Confluent uses for its schema registry are Avro, Protobuff, and JSON.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 107
When you use the Confluent schema registry, it’s automatically integrated into the serializer
and there’s no effort you need to put into it. Your data pipeline will just be clean. You simply
need to have all applications call the schema registry when publishing.
The ultimate benefit of schema registry is that the consumer applications of your kafka
topic can successfully read the messages they receive.
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 108
Confluent Schema Registry
m
.co
ail
gm
d@
lar
go
ph
◦ Checks schemas and throws an exception if data does not conform to the schema
◦ Instead, a globally unique ID representing the Avro schema is sent with each message
• The Schema Registry is accessible both via a REST API and a Java API
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 109
1. The message key and value can be independently serialized
5. Producers and Consumers cache the schema/ID mapping for future messages
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 110
AVRO
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 111
Schema Evolution
◦ Backward compatibility
ph
▪ Code with a new version of the schema can read data written in the old schema
▪ Code that reads data written with the schema will assume default values if fields are
not provided
◦ Forward compatibility
▪ Code with previous versions of the schema can read data written in a new schema
▪ Code that reads data written with the schema ignores new fields
◦ Full compatibility
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 112
Stream Processing
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 113
Confluent ksqlDB
Streaming SQL Engine for Apache Kafka
Develop real-time stream processing apps writing only SQL! No
Java, Python, or other boilerplate to wrap around it
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 114
Confluent ksqlDB
Enable Stream Processing using SQL-like Semantics
• Streaming ETL
• Anomaly detection
• Event monitoring
m
.co
ail
gm
d@
lar
go
ph
ksqlDB is the open source streaming SQL engine for Apache Kafka. It provides an easy-to-
use yet powerful interactive SQL interface for stream processing on Kafka, without the
need to write code in a programming language such as Java or Python. ksqlDB is scalable,
elastic, fault-tolerant, and real-time. It supports a wide range of streaming operations,
including data filtering, transformations, aggregations, joins, windowing, and sessionization.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 115
Confluent ksqlDB & Apache Kafka = easy
All you need is Kafka – no complex deployments of bespoke systems for stream processing!
.co
ail
gm
ksqlDB reads and writes data from and to your Kafka cluster over the network. Yes, that’s
d@
lar
right, ksqlDB runs on its own dedicated servers and not as part of the Kafka cluster
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 116
Apache Kafka Streams
Transform Data with Real-Time Applications
Overview
• Microservices
• Continuous queries
m
.co
ail
• Continuous transformations
gm
d@
lar
go
ph
Kafka Streams is a client library for building applications and microservices, where the input
and output data are stored in a Kafka cluster. It combines the simplicity of writing and
deploying standard Java and Scala applications on the client side with the benefits of
Kafka’s server-side cluster technology.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 117
Your Kafka Streams App
• The Streams API of Apache Kafka, available through a Java library, can be used to build
.co
ail
• A unique feature of the Kafka Streams API is that the applications we build with the
lar
go
Kafka Streams library are normal Java applications. These applications can be packaged,
ph
deployed, and monitored like any other Java application – there is no need to install
separate processing clusters or similar special-purpose and expensive infrastructure!
• We can run more than one instance of the application. The instances run independently
but will automatically discover each other and collaborate.
• If one instance unexpectedly dies the others will automatically take over its work and
continue where it left off.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 118
Apache Kafka Streams
m
.co
ail
Let’s first elaborate about the motivation of why one would care about Kafka Streams.
gm
d@
• For the most flexibility, but also high complexity, we can rely purely on the Kafka Client
lar
go
library with its Producer and Consumer objects. We have learned about those in the
ph
previous module.
• If we want less complexity but are willing to trade in a bit of flexibility then that’s where
Kafka Streams comes into play. We can build an application that uses the Kafka Streams
library. It offers us an easy to learn and manage DSL with functions such as filter, map,
groupBy or aggregate.
• Later we will discuss KSQL, and seen why it appeals to so many people. One main point is
its simplicity. But looking at the graph on the slide we have this divergence between
simplicity and flexibility. You cannot have both at the same time. The more flexible you
want to be the more complex a system usually becomes. KSQL is on the flexibility side. It
is super easy to author a ksqlDB query that processes streams. That comes at a cost.
ksqlDB has its clear boundaries.
In summary: Kafka Streams sits somewhere at the middle grounds and gives us a healthy
mix of flexibility at a manageable complexity.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 119
Q&A
Question:
In your company and/or team, which of the three methods to
write a Kafka client application do you think will be the most
appropriate:
a) Producer/Consumer
b) Kafka Streams application, or
c) KSQL
The idea is that the learner understand that there are compromises in each method.
Producer/consumer clients are the most flexible yet the most difficult to implement for
complex problems. ksqlDB on the other hand is easy but also limited in its ability. The latter
benefits from the possibility of extending it via UDFs and UDAFs, yet those have to be
written in Java.
m
.co
ail
gm
Somewhere in the middle sits the Kafka Streams API. It offers a fair amount of flexibility
d@
with a high level abstraction, that makes coding streaming applications fairly easy for a
lar
go
Java developer.
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 120
Hands-On Lab
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 121
Further Reading
• Confluent REST Proxy: https://docs.confluent.io/current/kafka-rest/docs/index.html
◦ https://www.confluent.io/confluent-schema-registry/
◦ https://docs.confluent.io/current/schema-registry/docs/index.html
These are are few links to material diving into the topics of this module in more depth.
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 122
06 The Confluent Platform
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 123
Agenda
1. Introduction
7. Conclusion
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 124
Central Nervous System
m
.co
ail
gm
You’ll see how an event streaming platform sits at the center of your applications and data
d@
apps, 3rd party apps, as well as to your event-driven applications. For this reason, customers
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 125
The Maturity Model
m
.co
ail
The journey starts with awareness and moves up a curve, ending with a streaming platform
go
ph
1. you are starting to realize the potential that streaming offers to your company
3. now you put a mission critical application into production and setup management and
monitoring in parallel
4. more and more departments realize the benefit of streaming and like to participate. Silos
are broken up
5. Everything is integrated with your company’s streaming platform. The platform acts as
the backbone or even better, as the central nervous system. It allows for self-servicing
Using Apache Kafka® on this journey is a good starting point. Yet along your journey, as your
company matures you may encounter some challenges that the Confluent Platform can help
you to solve in an elegant, cost and resource effective way. Confluent accompanies you with
products, processes and support on all stages of the journey, until you stream everything.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 126
Confluent Platform
Confluent Platform complements Apache Kafka with a complete set of capabilities around
m
connectivity, stream processing and operations that maximize developer productivity and
.co
ail
Developers focus primarily on building streaming applications on top of Kafka, caring mainly
lar
go
about the data and topics needed to solve important business problems. Others, like BI
ph
analysts, mainly run queries against Kafka and need to know what data is available and how
to access it to feed their analytics toolset. Adding to this diversity, users also vary in their
preferred modes of interacting with software. Some prefer to solve their problems by
writing code, while others prefer to click buttons.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 127
Confluent Platform
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 128
Confluent Platform
Confluent is Enabling Event-driven Transformation
across Industries
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 129
Confluent Platform
Complete Set of Development, Operations and
Management Capabilities to run Kafka at Scale
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 130
Confluent Platform
Let’s walk through an overview of Confluent Platform, the enterprise-ready event streaming
platform built by the original creators of Apache Kafka.
m
.co
ail
gm
Apache Kafka
d@
lar
• Let’s start with Kafka. A lot of people think of Kafka as pub/sub messaging, but it’s much
go
ph
more than that. At its core, Kafka delivers what we call a continuous commit log, which
allows users to treat data neither as stored records nor transient messages, but as a
continually updating stream of events. These streams are readily accessible, fast enough
to handle events as they occur, and able to store events that previously occurred.
• Over the years, Kafka has evolved to add capabilities that complement the continuous
commit log. In Kafka 0.9, we added in Kafka Connect, which is a framework to build
connectors into external systems, such as databases, applications, other messaging
systems, and so on.
• Then, in Kafka 0.10 we added in Kafka Streams, which is another framework, this time for
building stateful and stateless applications that can process and enrich the event streams
flowing through the platform in real time.
• The combination of the continuous commit log, along with the Connect and Streams
frameworks constitutes what we call an Event Streaming Platform. However, running
Kafka at enterprise-scale can come with its own set of challenges, so our goal at
Confluent is to deliver an enterprise-ready platform that can become the central nervous
system for any organization.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 131
• The first thing the Confluent Platform (CP) does to deliver a complete event streaming
platform is to complement Kafka with advanced development and stream processing
capabilities.
• To enhance the continuous commit log, CP offers tools to facilitate the flow of event
streams through the platform:
◦ Native client libraries to develop custom Kafka producers and consumers using
programming languages different from Java or Scala which are the one packaged with
open source Kafka. Confluent Platform adds clients for C/C++, Python, Go and .NET
◦ REST Proxy provides a simple RESTful interface to produce and consume messages into
Kafka, view the state of the cluster, and perform administrative actions without using
the native Kafka protocol or clients.
◦ For customers working on IoT projects, CP’s MQTT Proxy offers an interface for MQTT
clients to produce messages into Kafka without the need for third party brokers.
◦ As event streams continue to flow into the event streaming platform, customers need a
way to ensure data compatibility between all the producers and consumers. Schema
Registry stores a versioned history of all schemas and allows the evolution of schemas
while preserving backwards compatibility for existing consumers.
m
• Confluent Platform includes connectors that run on Kafka Connect. These are pre-built
.co
ail
connectors for popular data sources and sinks, such as S3, HDFS, Elasticsearch, IBM MQ,
gm
d@
Splunk, Cassandra and many others. We have many connector developed and supported
lar
by Confluent, and there are many more developed by partners and the broader Kafka
go
ph
community. To allow easy access to them, Confluent has an online marketplace called
Confluent Hub.
Mission-Critical Reliability
◦ Confluent Control Center provides a simple way to view and understand key aspects
and metrics of Kafka using a graphical user interface. It allows customers to track key
metrics about event streams with expertly curated dashboards, check the health of
Kafka brokers, and it provides useful integration points into other components of
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 132
Confluent Platform, such as Kafka Connect, ksqlDB and Schema Registry. Overall,
Control Center is a critical component to meet even streaming SLAs in distributed
Kafka environments.
◦ Scaling Kafka up and down can be challenging at larger scale. Auto Data Balancer
monitors the cluster and shifts data automatically to balance the load across brokers.
This helps customers expand and contract the size of the cluster more dynamically.
◦ Organizations tapping into cloud native operating models have started to standardize
on Kubernetes as their platform runtime. Operator provides an implementation of the
Kubernetes Operator API to automate the management of Confluent Platform in a
cloud-native environment, and performs key operations such as rolling upgrades.
◦ Confluent Platform provides a series of security plugins and features that allow
ail
gm
◦ Starting from version CP 5.3 Role-Based Access Control (RBAC) is introduced which
go
ph
will allow users to be added to predefined roles, which ensure that each persona in the
organization has the right level of access to Kafka. This is a key feature that customers
have been asking for.
• And mission critical reliability would not be complete without comprehensive enterprise
support. With Confluent Platform, you will have access to 24/7 support and experts with
the deepest knowledge base on Kafka and Confluent Platform, with guaranteed sub 1-
hour response times and full application lifecycle support from development to
operations.
Freedom of Choice
◦ Confluent Platform can run on any kind of infrastructure, from bare metal to VMs to
containers, and can be deployed on any cloud, private or public, with the consistency of
single platform.
◦ Moreover, Confluent delivers choice in the operational model. Customers can run
Confluent Platform as self-managed software, but they can also offload operations to
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 133
Confluent’s experts through Confluent Cloud, the industry’s only fully-managed,
complete streaming service.
◦ Customers can build a bridge to cloud by streaming data between private and public
clouds, enabling the use of best-of-breed cloud services regardless of where the data is
produced.
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 134
Confluent Platform Deployment Models
m
.co
ail
There are two primary ways you can use Confluent Platform. On on premise distribution is
gm
d@
called Confluent Platform and Confluent Cloud is Confluent’s fully managed solution on
lar
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 135
Confluent Cloud
Confluent Cloud is a fully managed offering of the whole Confluent Platform. It frees you
m
.co
from managing your own deployment and allows you to concentrate on the things that
ail
gm
bring value to your business, such as your context driven streaming applications.
d@
lar
Confluent Cloud is available on AWS and GCP and will also be available very soon on MS
go
ph
Azure.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 136
Confluent Control Center
Kafka is powerful … but has many parts
We all know that Kafka is very powerful and very popular. But, there are also many things
m
you need to both understand and do in order to use it. There are brokers, topics, partitions,
.co
ail
other things that operators and application developers need to become familiar with in
d@
lar
order to use Kafka. It’s both a lot to understand and a lot to manage as part of standing up
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 137
Confluent Control Center
Management and Monitoring for the Enterprise
m
.co
Control Center has swiftly come of age in the past few releases since Confluent Platform
ail
gm
5.0, introducing important enhancements such as a view of consumer lag, topic inspection,
d@
Schema Registry integration, the ksqlDB UI, dynamic broker configuration and several more.
lar
go
ph
It’s time for us to improve the way that Control Center allows you to explore and
understand Apache Kafka. Through extensive customer research, we have mapped out the
best way to build a mental model of the composition of the Kafka infrastructure.
Whether you are a Kafka operator primarily concerned with keeping Kafka running,
understanding how the various components of the architecture fit together and configuring
those components the right way - or you are a developer who wants to create topics to
produce into and consume from, view and manage schemas, create connections and write
queries to power your streaming applications - the redesigned GUI will allow you to slide and
dice through your Apache Kafka infrastructure through sensical categories and workflows,
offering the simplest way to understand and control Apache Kafka.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 138
Confluent CLI
Overview Manage your Confluent Platform.
Usage:
• Availability: CP 5.3 confluent [command]
Available Commands:
completion Print shell completion
• Platforms: Linux, Unix-based code.
help Help about any command
• License: Proprietary iam Manage RBAC and IAM
permissions.
• Packaging: Independent of CP local Manage local Confluent
Platform
development environment.
Grammar <resource: noun> <optional login Login to Confluent
subresource: noun> <operation: verb> Platform.
logout Logout of Confluent
Key features Platform.
secret Manage secrets for
Confluent
• RBAC management Platform.
update Update the confluent CLI.
• Password protection version Print the confluent CLI
version.
• Subsumed confluent-cli commands for ...
local
m
.co
ail
gm
d@
lar
go
Confluent Platform 5.3 is introducing a new CLI that is ready for production environments
ph
and fully supported by Confluent. The new CLI offers the following characteristics:
• Consistent – CLI grammar that’s intuitive and easy-to-understand so new concepts fit
into your existing mental model
• Unambiguous – Commands will be self-explanatory for those core concepts with which
you are familiar, e.g. topics, schemas, etc.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 139
Role Based Access Control
RBAC
• Availability: As of CP 5.3 (As a Preview)
Modern enterprises are extremely careful about ensuring their data is secure. Security
go
ph
breaches are very costly and put a major burden on organizations around system downtime
and damage to their reputation in front of customers and partners. In many cases,
organizations have dedicated security staff to ensure that any new platform that is utilized
in production follows strict security best practices.
One of the pillars of a comprehensive security strategy is access controls. As the number of
enterprise applications increases, so does the number of users (developers and operators)
involved across multiple groups. Managing users and groups at scale is complex without a
framework to provide controlled access to applications according to the roles of the users.
This is especially true with Apache Kafka, because of it’s distributed nature. We have heard
from Kafka users that it can take them several days to set ACLs for all their teams, because
they do not know about the content of topics and who should have access to them. Overall,
the process of setting access controls takes too long when it involves authorization for 100s
or 1000s of resources, it is difficult to maintain and manage and it is error prone.
Confluent Platform 5.3 introduces Role-Based Access Control (RBAC) in Preview for
development environments. These are some of the key characteristics of RBAC:
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 140
• Authorization is enforced via all user interfaces:
◦ APIs
• A user who is assigned a role receives all the privileges of that role
While the overall benefit of increased enterprise-level security across the entire platform is
obvious, RBAC also allows you to:
• Delegate responsibility of managing permissions and access to true resource owners, such
as departments/business units. This greatly simplifies the management of authorization
at scale in shared platforms
m
.co
ail
• UserAdmin - Responsible for managing user/groups and mapping them to roles (setting
role bindings)
• ClusterAdmin - Responsible for managing Kafka clusters and brokers, including ksqlDB
and Connect clusters. Can create topics but cannot read/write from/to topics. Cannot
create/change applications. Can change the setup of topics, etc. Can make a user/group
as the owner of that resource. Can make a user/group as the owner of an application.
Manage quotas
• Operator - Responsible for monitoring the health of the application(s). Can start/stop
applications but cannot change them
• Security admin - Responsible for managing security initiatives like managing audit logs,
etc.
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 141
• ResourceOwner - Owns a set of resources and can grant permission to others to get
access to those resources
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 142
Confluent Operator
For Whom:
Organizations that have standardized on
Kubernetes as the platform runtime. m
.co
ail
gm
• That is why we created Confluent Operator. It runs Kafka and soon the entire CP on
d@
lar
• Our fully managed service, Confluent Cloud, runs Confluent Operator underneath the
covers. It is well tested, includes several operational improvements reflecting our
experience of running hosted Kafka
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 143
Confluent Operator Capabilities
Deploy
Update / Upgrade
Monitor
ail
gm
d@
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 144
Confluent - Feature Comparison
Feature Confluent Platform (On Premise) Confluent Cloud (Fully Managed)
Scale Unlimited throughput, unlimited Unlimited throughput, unlimited
retention retention (CCE)
Support 24x7 Gold & Platinum (option) 24x7 Gold SVPC peering (option)
• Pay only for what you use based on ingress, egress and storage
ph
• User level usage metrics available to Cloud Enterprise customers via Kafka API
• Now on AWS/GCP: SOC I, II, II, PCI phase 1(requires message level encryption), and HIPAA
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 145
Broad Connector Eco-System
that is part of Apache Kafka, and thus fully open source. But Confluent doesn’t stop there.
ail
gm
Let’s first recap, before we dive into the extras that our company provides.
d@
lar
go
Kafka Connect, a part of the Apache Kafka project, is a standardized framework for
ph
handling import and export of data from/to Kafka. This framework can address a variety of
use cases, makes adopting Kafka much simpler for users with various data stores;
encourages an ecosystem of tools for integration of other systems with Kafka using a
unified interface; and provides a better user experience, guarantees, and scalability than
other frameworks that are not Kafka-specific.
Kafka Connect is a pluggable framework for inbound and outbound connectors. It’s fully
distributed and integrates Kafka with a number of sources and sinks. This approach is
operationally simpler, particularly if you have a diverse set of other systems you need to
copy data to/from as it can be holistically and centrally managed. It also pushes a lot of the
hard work of scalability into the framework, so if you want to work in the Kafka Connect
framework you would only have to worry about getting your data from a source into the
Kafka Connect format – scalability, fault tolerance, etc would be handled intrinsically.
Most of the data systems illustrated here are supported with both Sources and Sinks
(including JMS, JDBC, and Cassandra).
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 146
Confluent Hub
m
.co
ons for Confluent Platform and Apache Kafka. You can browse the large ecosystem of
d@
connectors, transforms, and converters to find the components that suit your needs and
lar
go
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 147
Confluent Hub
GOAL:
Connect Everything!
Confluent has the ambitious goal of connecting everything with Kafka! All connectors will be
m
available through the Confluent Hub and be fully supported by Confluent or certified
.co
ail
partners. Some connectors are and will be contributed and supported by the community.
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 148
Module Review
Question:
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 149
Hands-On Lab
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 150
07 Conclusion
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 151
Course Review
You are now able to:
Verify for yourself that you are now really able to successfully perform all tasks listed above.
If not, please feel free to contact us on [email protected] or seek for help on the
numerous Confluent Community Slack channels.
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 152
Other Confluent Training Courses
• Apache Kafka® Administration by Confluent
◦ Inter-cluster design
◦ How the REST Proxy supports development in languages other than Java
◦ How to use the Schema Registry to store Avro, Protobuff, and JSON data in Kafka
◦ How to write streaming applications with Kafka Streams API and KSQL
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 153
◦ Installing ksqlDB containerized and natively
m
.co
ail
gm
d@
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 154
Confluent Certified Administrator for Apache Kafka
Duration: 90 minutes
Cost: $150
Benefits:
• Digital certificate and use of the official Confluent Certified Operator Associate logo
ail
gm
d@
Exam Details:
lar
go
ph
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 155
Confluent Certified Developer for Apache Kafka
Duration: 90 minutes
Cost: $150
Benefits:
• Digital certificate and use of the official Confluent Certified Developer Associate logo
ail
gm
Exam Details:
d@
lar
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 156
Thank You
• Thank you for attending the course!
• your instructor will give you details on how to access the survey
© 2014-2020 Confluent, Inc. Do not reproduce without prior written consent. 157