SlideShare a Scribd company logo
1
Exactly Once Semantics in
Apache Kafka
2
Apache Kafka: A Distributed Streaming Platform
Consumers
Producers
Connectors
Processing
3
A Distributed What???
A Streaming Platform is a little like…
• A messaging system
• Except it scales horizontally, stores the streams persistently,
and allows continuous stream processing
• Hadoop
• But not batch oriented
4
Logs: A data structure for continuous streams
5
APIs
1. Producer and Consumer: Read and write streams
2. Connect: Managed Connectors that connect existing
systems
3. Streams: Transformations of streams
6
Producer & Consumer API
7
Consumers Scale With Groups
8
Connect API
9
Kafka Connect Does The Hard Parts
1. Scale out
2. Fault Tolerance
3. Central Management
4. Schemas
10
11
12
13
14
Streams API
• Full power of a modern stream processing framework
• Distributed and fault-tolerant
• Natively uses event-time
• Stateful processing: joins, aggregations, etc
• Integrates tables and streams
• Easy re-processing
• Just a library
15
Wordcount Example
16
Wordcount Example
17
Deploy as you wish
24
Not Limited To Java
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
25
The Semantics of Working With Streams
26
Two Problems
1. Duplicate writes
2. Exactly-once processing
27
Problem #1:
Duplicate Writes
28
Duplicate Writes
29
Duplicate Writes
30
Duplicate Writes
31
Duplicate Writes
32
Duplicate Writes
33
Duplicate Writes
34
Duplicate Writes
35
Duplicate Writes
36
Duplicate Writes
37
Problem #2:
Duplicate Processing
38
Read From Offset=0
39
Process and Update State
40
Commit offset 0 as processed
41
Read from offset=1
42
Process and Update State
43
App crashes!
44
Restore from offset=0, resume processing
45
Chose your undesirable semantics
1. Update state, then save offset => At-Least-Once Delivery
2. Save offset, then update state => At-Most-Once Delivery
46
Two workarounds
1. Make processing idempotent
• Much harder than it sounds in practice
2. Store offset in the application DB and update transitionally
• Not all stores support transactions
• Must handle zombies
47
Solving These Problems
48
Solving Problem #1:
Avoiding Duplicate Writes with an Idempotent
Producer
49
Basic idea
1. Unique ID for each message
2. Server deduplicates
50
Basic idea has problems
1. Random access database of all message ids?
2. Message IDs would be bulky
3. Must handle server fail-over
51
Better idea: Do it like TCP
1. Unique producer id for each producer (PID)
2. Each producer assigns a sequential number to each
message it sends
3. The unique identifier is the PID + sequence number
4. Sequence number and PID both stored in the log
52
The idempotent producer
53
The idempotent producer
54
The idempotent producer
55
The idempotent producer
56
The idempotent producer
57
The idempotent producer
58
The idempotent producer
59
The idempotent producer
60
Idempotent Producer
• Works transparently – no API changes.
• Fast enough you don’t need to worry about it
• Will be on by default in the future
61
Solving Problem #2:
Avoiding Duplicate Processing with Transactions
62
63
It’s More Complex Than I’ve Let On
• Multiple partitions
• Multiple input streams
• Non-determinism
• Diverse data stores
• Zombies
64
Transactions in Kafka
65
Introducing transactions
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record0);
producer.send(record1);
producer.commitTransaction();
} catch (KafkaException e) {
producer.abortTransaction();
}
66
Introducing ‘transactions’
67
Initializing ‘transactions’
68
Transactional sends – part 1
69
Transactional sends – part 2
70
Commit – phase 1
71
Commit – phase 2
72
Commit – phase 2
73
Success!
74
Consumer returns only committed messages
75
Transactions => Stream Processing
76
Factor problem into two parts
1. Transforming input streams to output streams (Streams)
2. Connecting output streams to data systems (Connect)
77
Stream processing with Kafka
1. Read from input streams
2. Process and update state
3. Produce to output streams
4. Save offsets
78
Stream processing as a sequence of transactions
BEGIN
1. Read from input streams
2. Process and update state
3. Produce to output streams
4. Save Offsets
COMMIT
79
The Theory
• Two Generals
• Atomic Broadcast
• Consensus
80
In Practice
81
Performance
• Up to +20% producer throughput
• Up to +50% consumer throughput
• Up to -20% disk utilization
• Savings start when you batch
• Details: https://bit.ly/kafka-eos-perf
82
Cool!
But how do I use this?
83
Producer Configs
• enable.idempotence = true
• acks = “all”
• retries > 1 (preferably MAX_INT)
• transactional.id = ‘some unique id’
84
Consumer configs
• isolation.level:
• “read_committed”, or
• “read_uncommitted”
85
Streams config
• processing.mode = “exactly_once”
86
Confluent
• Founded by the original creators of Apache Kafka
• Headquarters based in Palo Alto, CA
KSQL: Streaming SQL for Apache Kafka
Developer Preview
(https://github.com/confluentinc/ksql)
87
Thank You!

More Related Content

PPTX
Deep Dive into Apache Kafka
confluent
 
PPTX
A Deep Dive into Kafka Controller
confluent
 
PPTX
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
confluent
 
PDF
A Deep Dive into Kafka Controller
confluent
 
PPTX
Apache Kafka : Monitoring vs Alerting
Ratish Ravindran
 
PPTX
Kafka replication apachecon_2013
Jun Rao
 
PPT
Data Loss and Duplication in Kafka
Jayesh Thakrar
 
PDF
CNIT 127 Ch 8: Windows overflows (Part 1)
Sam Bowne
 
Deep Dive into Apache Kafka
confluent
 
A Deep Dive into Kafka Controller
confluent
 
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
confluent
 
A Deep Dive into Kafka Controller
confluent
 
Apache Kafka : Monitoring vs Alerting
Ratish Ravindran
 
Kafka replication apachecon_2013
Jun Rao
 
Data Loss and Duplication in Kafka
Jayesh Thakrar
 
CNIT 127 Ch 8: Windows overflows (Part 1)
Sam Bowne
 

What's hot (20)

PPTX
Sgnog openflow demo-v1.0
Jason Kalai Arasu
 
PDF
Deep dive into Apache Kafka consumption
Alexandre Tamborrino
 
PPTX
Apache Kafka Reliability
Jeff Holoman
 
PDF
Grokking TechTalk #24: Kafka's principles and protocols
Grokking VN
 
PPT
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Jeff Holoman
 
PDF
Apache Kafka – (Pattern and) Anti-Pattern
confluent
 
PPTX
Autonomous workload rebalancing in kafka
Indrajeet Kumar
 
PDF
Observer, a "real life" time series application
Kévin LOVATO
 
PDF
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
PPTX
Design and Implementation of a Load Balancing Algorithm for a Clustered SDN C...
Daniel Gheorghita
 
PPTX
Kafka reliability velocity 17
Gwen (Chen) Shapira
 
PDF
Spirent TestCenter OpenFlow Switch Emulation
Malathi Malla
 
PDF
Raft presentation
Patroclos Christou
 
PDF
Raft in details
Ivan Glushkov
 
PDF
CNIT 127 14: Protection Mechanisms
Sam Bowne
 
PDF
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
confluent
 
PDF
CNIT 127 14: Protection Mechanisms
Sam Bowne
 
PPTX
Software Load Balancer for OpenFlow Complaint SDN architecture
Pritesh Ranjan
 
PDF
127 Ch 2: Stack overflows on Linux
Sam Bowne
 
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
Sgnog openflow demo-v1.0
Jason Kalai Arasu
 
Deep dive into Apache Kafka consumption
Alexandre Tamborrino
 
Apache Kafka Reliability
Jeff Holoman
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking VN
 
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Jeff Holoman
 
Apache Kafka – (Pattern and) Anti-Pattern
confluent
 
Autonomous workload rebalancing in kafka
Indrajeet Kumar
 
Observer, a "real life" time series application
Kévin LOVATO
 
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Design and Implementation of a Load Balancing Algorithm for a Clustered SDN C...
Daniel Gheorghita
 
Kafka reliability velocity 17
Gwen (Chen) Shapira
 
Spirent TestCenter OpenFlow Switch Emulation
Malathi Malla
 
Raft presentation
Patroclos Christou
 
Raft in details
Ivan Glushkov
 
CNIT 127 14: Protection Mechanisms
Sam Bowne
 
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
confluent
 
CNIT 127 14: Protection Mechanisms
Sam Bowne
 
Software Load Balancer for OpenFlow Complaint SDN architecture
Pritesh Ranjan
 
127 Ch 2: Stack overflows on Linux
Sam Bowne
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
Ad

Similar to Kafka eos (20)

PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
PDF
Treasure Data Summer Internship 2016
Yuta Iwama
 
PPTX
SC'16 PMIx BoF Presentation
rcastain
 
PPTX
Reactive solutions using java 9 and spring reactor
OrenEzer1
 
PDF
Springone2gx 2014 Reactive Streams and Reactor
Stéphane Maldini
 
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
From a kafkaesque story to The Promised Land
Ran Silberman
 
PPTX
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
PPT
Reduced instruction set computers
Syed Zaid Irshad
 
PPTX
HiveServer2
Schubert Zhang
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PDF
D200011_2024_Dec13 (2).pdf aaaaaaaaaaaaa
TeaKashahu1
 
PPTX
Android asynchronous programming
Nhan Cao
 
PPTX
Software architecture for data applications
Ding Li
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Treasure Data Summer Internship 2016
Yuta Iwama
 
SC'16 PMIx BoF Presentation
rcastain
 
Reactive solutions using java 9 and spring reactor
OrenEzer1
 
Springone2gx 2014 Reactive Streams and Reactor
Stéphane Maldini
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
From a kafkaesque story to The Promised Land
Ran Silberman
 
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Reduced instruction set computers
Syed Zaid Irshad
 
HiveServer2
Schubert Zhang
 
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
D200011_2024_Dec13 (2).pdf aaaaaaaaaaaaa
TeaKashahu1
 
Android asynchronous programming
Nhan Cao
 
Software architecture for data applications
Ding Li
 
Ad

More from Nitin Kumar (16)

PDF
Deep learning with kafka
Nitin Kumar
 
PDF
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
Nitin Kumar
 
PDF
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Nitin Kumar
 
PPTX
Processing trillions of events per day with apache
Nitin Kumar
 
PPTX
Ren cao kafka connect
Nitin Kumar
 
PDF
Insta clustr seattle kafka meetup presentation bb
Nitin Kumar
 
PPTX
EventHub for kafka ecosystems kafka meetup
Nitin Kumar
 
PPTX
Microsoft challenges of a multi tenant kafka service
Nitin Kumar
 
PDF
Net flix kafka seattle meetup
Nitin Kumar
 
PDF
Avvo fkafka
Nitin Kumar
 
PPTX
Brandon obrien streaming_data
Nitin Kumar
 
PDF
Confluent kafka meetupseattle jan2017
Nitin Kumar
 
PPTX
Microsoft kafka load imbalance
Nitin Kumar
 
PPTX
Map r seattle streams meetup oct 2016
Nitin Kumar
 
PPTX
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
PPTX
Seattle kafka meetup nov 2015 published siphon
Nitin Kumar
 
Deep learning with kafka
Nitin Kumar
 
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
Nitin Kumar
 
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Nitin Kumar
 
Processing trillions of events per day with apache
Nitin Kumar
 
Ren cao kafka connect
Nitin Kumar
 
Insta clustr seattle kafka meetup presentation bb
Nitin Kumar
 
EventHub for kafka ecosystems kafka meetup
Nitin Kumar
 
Microsoft challenges of a multi tenant kafka service
Nitin Kumar
 
Net flix kafka seattle meetup
Nitin Kumar
 
Avvo fkafka
Nitin Kumar
 
Brandon obrien streaming_data
Nitin Kumar
 
Confluent kafka meetupseattle jan2017
Nitin Kumar
 
Microsoft kafka load imbalance
Nitin Kumar
 
Map r seattle streams meetup oct 2016
Nitin Kumar
 
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
Seattle kafka meetup nov 2015 published siphon
Nitin Kumar
 

Recently uploaded (20)

PDF
DNSSEC Made Easy, presented at PHNOG 2025
APNIC
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
PDF document: World Game (s) Great Redesign.pdf
Steven McGee
 
PDF
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PPTX
AI ad its imp i military life read it ag
ShwetaBharti31
 
PDF
Slides: PDF Eco Economic Epochs for World Game (s) pdf
Steven McGee
 
PPTX
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PDF
KIPER4D situs Exclusive Game dari server Star Gaming Asia
hokimamad0
 
PDF
Cybersecurity Awareness Presentation ppt.
banodhaharshita
 
PPTX
谢尔丹学院毕业证购买|Sheridan文凭不见了怎么办谢尔丹学院成绩单
mookxk3
 
PPTX
LESSON-2-Roles-of-ICT-in-Teaching-for-learning_123922 (1).pptx
renavieramopiquero
 
PPTX
Parallel & Concurrent ...
yashpavasiya892
 
PPTX
How tech helps people in the modern era.
upadhyayaryan154
 
PPTX
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PDF
LOGENVIDAD DANNYFGRETRRTTRRRTRRRRRRRRR.pdf
juan456ytpro
 
PPTX
The Monk and the Sadhurr and the story of how
BeshoyGirgis2
 
PPTX
Slides Powerpoint: Eco Economic Epochs.pptx
Steven McGee
 
PPTX
Perkembangan Perangkat jaringan komputer dan telekomunikasi 3.pptx
Prayudha3
 
PPT
1965 INDO PAK WAR which Pak will never forget.ppt
sanjaychief112
 
PPTX
Unlocking Hope : How Crypto Recovery Services Can Reclaim Your Lost Funds
lionsgate network
 
DNSSEC Made Easy, presented at PHNOG 2025
APNIC
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF document: World Game (s) Great Redesign.pdf
Steven McGee
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
AI ad its imp i military life read it ag
ShwetaBharti31
 
Slides: PDF Eco Economic Epochs for World Game (s) pdf
Steven McGee
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
KIPER4D situs Exclusive Game dari server Star Gaming Asia
hokimamad0
 
Cybersecurity Awareness Presentation ppt.
banodhaharshita
 
谢尔丹学院毕业证购买|Sheridan文凭不见了怎么办谢尔丹学院成绩单
mookxk3
 
LESSON-2-Roles-of-ICT-in-Teaching-for-learning_123922 (1).pptx
renavieramopiquero
 
Parallel & Concurrent ...
yashpavasiya892
 
How tech helps people in the modern era.
upadhyayaryan154
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
LOGENVIDAD DANNYFGRETRRTTRRRTRRRRRRRRR.pdf
juan456ytpro
 
The Monk and the Sadhurr and the story of how
BeshoyGirgis2
 
Slides Powerpoint: Eco Economic Epochs.pptx
Steven McGee
 
Perkembangan Perangkat jaringan komputer dan telekomunikasi 3.pptx
Prayudha3
 
1965 INDO PAK WAR which Pak will never forget.ppt
sanjaychief112
 
Unlocking Hope : How Crypto Recovery Services Can Reclaim Your Lost Funds
lionsgate network
 

Kafka eos

Editor's Notes

  • #11: First example is loading data into Hadoop. Note that Kaka is a big, scale-out system; and Hadoop is also a big scale out system, so it’s important in this instance that we can scale out Kafka connect. It’s also important that we capture the metadata or schema information for the records we have in our topics so that we can replicate that into HDFS as well and load data in a structured format like Parquet.
  • #12: Another example is loading data out of a relational database using JDBC. (Note that you can also go the other way, loading data into a DB…that would be a sink and we have one of those comming soon too). So here you would think you don’t really need the scale out capability of connect because you just have a single centralized relational database. But in reality you face a similar problem—instead of replicating one big data system into another big data system you likely have something like this...
  • #13: …here you have lot’s of little relational databases and you need to manage the replication with all of these. This is part of how the connect api in Kafka makes this really managable: even if you have hundreds of database to pull data from, you can manage these dynamically off a small set of connect workers. You don’t need to set up one process per database.
  • #14: And of course the idea isn’t that you run connectors to one or another system, but rather that you are able to manage lots of these connections to all different kinds of systems. We’ll dive into Kafka connect in more detail in the third installment of this talk series which goes far deeper into the practice of building streaming pipelines with Kafka.
  • #61: Stress the application level resends bit. Encourage people to rely on the producer retries and not re-send messages from their apps.
  • #72: New concept – control messages. Mention that commit markers are special message with log the producer id and the result of the transaction. These messages are not passed on to application – the client interprets them and acts accordingly.
  • #75: Mention ’read_uncommitted’ Mention that the buffering is broker side.
  • #77: Transformation may be complex and stateful Connector is pretty simple and reusable
  • #78: Solution is to do all these in a transaction
  • #79: Solution is to do all these in a transaction
  • #84: max.inflight is required for idempotence. It will cause a slowdown because you now have a sync producer.