CDC patterns in Apache Kafka®

Hi!
○ Mario Molina.
○ Data Engineer.
○ Working in data & all things related since 2005.
○ You can find me at:
mmolimar_
mmolimar
mmolimar

Needs
● System integration / Data replication.
○ They are usually done via API calls with an orchestration engine or service mesh for
microservices.
○ But… sometimes there are not such APIs or legacy systems might be too complicated.
● Migration.
○ From monolithic to a microservices architecture.
○ From one database type to another.
● Audit trail of changes.
○ Analysis of changes.
○ Identification of patterns.

What is CDC?
● Stands for Change Data Capture.
● Identifies changes that happen in source systems to sync others.
● Observability about what is happening in the applications (data layer).
● Most common scenario in database systems.
● Helps to avoid dual writes.
CHANGES

Patterns - Modification Date
● Adding a column on each table with the date/time for each record (last modification).
● This column must be available in all tables we want to track.
● Be sure this column is reliable set (in the app or even with database triggers).
● Query the data based on a date range.
● Take into account the number of inserts/updates for this range.
● Deletes might be an issue. Use logical deletes instead.
SELECT *
FROM sample_table
WHERE updated_time > now() - interval ‘1 second’

● Database triggers can store the change in the original table into a “shadow” table.
● Store the whole record or just the PK from the original table.
● Must be defined for each table to track.
● Adds overhead into the database.
● Ensure all the triggers are enabled.
Patterns - Triggers
TRIGGER
TABLE

● Changes in database are stored in its transaction log.
● Based on this log, changes are notified.
● There is no overhead in the database and no extra SQL/configs required.
● More reliable.
● Each database implements its own way of representing changes.
Patterns - Log based

Patterns - Diff or custom scripts
● Compute the difference between the previous and the current state.
● Via SQL or custom implementations (based on the data source).
● Might generate more overhead.
● Requires more maintenance.
● More oriented to specific use cases.

Considerations
● Cannot apply same approach for all data sources.
● Understanding of the data changed (context).
● Management of delete records.
● Schema changes in the data source.
● Initial workloads due to huge volume of data.

Some products
● Commercial
○ Oracle Golden Gate - https://www.oracle.com/integration/goldengate
○ Qlik Replicate - https://www.qlik.com/us/data-streaming/data-streaming-cdc
○ HVR - https://www.hvr-software.com/product/change-data-capture
○ AWS Data Migration Service - https://aws.amazon.com/dms
● Open source:
○ Debezium - https://debezium.io
○ Maxwell - https://maxwells-daemon.io
○ MongoDB - https://docs.mongodb.com/kafka-connector

What about Kafka?
App
App
Producers
App
App
App
App
Consumers
App
App
Sources
Connectors
Sinks
App
Streams
App

With Kafka Connect
● We already have a bunch of connectors:
○ Debezium (PostgreSQL, MySQL, SQL Server, MongoDB).
○ MongoDB.
○ Oracle CDC.
○ SQData.
○ TiDB.
○ …
● You can find more in Confluent Hub!
○ https://www.confluent.io/hub

Demo - Database Sync
Kafka
Connect
mongo-kafka
kukulcan
kafka-connect-jdb
c
MongoDB
Kafka
Connect
PostgreSQL

○ A REPL for Apache Kafka.
○ Support POSIX and Windows OS.
○ Written in Scala, Java and Python.
○ Shells in:
○ Ammonite REPL.
○ Scala REPL.
○ JShell.
○ Python shell.
○ APIs for Admin, Producer, Consumer, Connect,
Streams, Schema Registry, and KSQL.
kukulcan
https://github.com/mmolimar/kukulcan

○ Scripts to run the demo in Kukulcan.
○ Source code:
○ https://github.com/mmolimar/meetups
○ Documentation:
○ https://github.com/mmolimar/meetups/tree/master/kafka-cdc
Ammonite scripts

○ Source code:
○ https://github.com/mongodb/mongo-kafka
○ Documentation:
○ https://docs.mongodb.com/kafka-connector
○ Confluent Hub:
○ https://www.confluent.io/hub/mongodb/kafka-connect-mongodb
mongo-kafka

○ Source code:
○ https://github.com/confluentinc/kafka-connect-jdbc
○ Documentation:
○ https://docs.confluent.io/current/connect/kafka-connect-jdbc
○ Confluent Hub:
○ https://www.confluent.io/hub/confluentinc/kafka-connect-jdbc
kafka-connect-jdbc

Getting involved with Apache Kafka
○ Website: http://kafka.apache.org
○ Join the mailing lists:
○ users@kafka.apache.org
○ dev@kafka.apache.org
○ Slack: https://confluentcommunity.slack.com
○ Meetups: https://www.meetup.com/<LOCATION>-Kafka
○ Contribute: https://github.com/apache/kafka
○ Kafka Summit 2021: https://kafka-summit.org

THANKS!
Any questions?
mmolimar
mmolimar
mmolimar_

CDC patterns in Apache Kafka®

Recommended

More Related Content

What's hot (20)

Similar to CDC patterns in Apache Kafka® (20)

More from confluent (20)

Recently uploaded (20)

CDC patterns in Apache Kafka®