0% found this document useful (0 votes)
334 views11 pages

Iceberg Change Data Capture (CDC) Guide

1. The document proposes a way to generate change data capture (CDC) records for Iceberg tables to capture data changes continuously. 2. The proposal aims to support row-level CDC record generation for all Iceberg table formats and operation types while requiring no changes to the Iceberg table specification or write-time logging. 3. The proposal outlines CDC record formats, solutions for different operation types, limitations, and algorithms for generating CDC records from snapshots in Merge-On-Read and Copy-On-Write table formats.

Uploaded by

wilhelmjung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
334 views11 pages

Iceberg Change Data Capture (CDC) Guide

1. The document proposes a way to generate change data capture (CDC) records for Iceberg tables to capture data changes continuously. 2. The proposal aims to support row-level CDC record generation for all Iceberg table formats and operation types while requiring no changes to the Iceberg table specification or write-time logging. 3. The proposal outlines CDC record formats, solutions for different operation types, limitations, and algorithms for generating CDC records from snapshots in Merge-On-Read and Copy-On-Write table formats.

Uploaded by

wilhelmjung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Change Data Capture(CDC) for Iceberg

Yufei Gu([email protected])

Background and Motivation


Users want to capture data changes in Iceberg tables continuously for the following use cases:
1. Materialized view incremental refreshing
2. Updating downstream tables(Iceberg tables/Vertica tables/Snowflake tables)
3. Keeping CDC records for auditing
The existing incremental search is close to the CDC feature with the following limitations.
1. It only supports snapshots with DataOperations.APPEND.
2. The support for Merge-On-Read table format is limited.
3. No output of deleted rows and update rows.
These limitations require users to develop their own tools to meet their requirements.
Other table formats provide a CDC feature, for example, DataBricks’ Delta has
Change-Data-Feed which generates CDC records and is popular among their users.
This document proposes a way to generate CDC records for Iceberg tables.

Goal
1. Support row-level CDC records generation for Iceberg tables
2. Support all Iceberg Operation Types (Append, Delete, Replace, Overwrite)
3. Can emit all CDC records types: update/insert/delete
4. No table spec change and write-time logging
5. Support v1 table format, in which the write mode is Copy-On-Write only
6. Support v2 table format, in which the write mode could be
Merge-On-Read/Copy-On-Write/Mixed Merge-On-Read with Copy-On-Write

Non-goal

1. This proposal doesn’t distinguish the written order of rows inside one snapshot. The
event granularity is at the snapshot level.
2. This proposal doesn’t discuss how CDC records are persisted. There is no way to
generate CDC records based on the expired snapshots. Users need to figure out a way
to persist CDC records if snapshots are expired frequently. For example, CDC records
can be saved in another Iceberg table or Kafka.
Proposal

CDC Record Format


Adding metadata columns for CDC records

1. CDC record Type(insert, update preImage, update postImage, delete)


2. Commit timestamp
3. Commit snapshot id
4. Commit order (Sequence number for v2 tables, snapshot order number for v1 tables)

table_col1 table_col2 ... _record_type _commit_snapshot_id _commit_timestamp _commit_order

For example, here is what a CDC record looks like for an append operation of a table with
schema (id, name, age).
INSERT INTO student values (1, 'Amy', 21)

ID Name Age _record_type _commit_snapshot_id _commit_timestamp _commit_order

1 Amy 21 Insert 8563478696506531065 2021-01-23 04:30:45 34

Solution
Append only snapshots are handled by emitting all new rows as CDC inserted records.

Delete only snapshots are handled by emitting all deleted rows as CDC deleted records.

Replace snapshots are ignored. Rewriting data files and metadata files generates this type of
snapshot without changing any data in the table.

Overwrite Snapshots require special consideration based on the type of Overwrite which can
contain rows which have been updated as well as new rows and deleted rows. Updates can be
handled by emitting a delete and insert CDC record for each row or by attempting to determine
if a record has been deleted, or updated. A user could decide whether to attempt to discern
updates or just return inserts and deletes at execution time.

Updated rows are difficult because within Iceberg files rows are either deleted or inserted. There
is no concept of “update” in the Iceberg persistence layer so determining updates requires
additional work. One of the major challenges is to reconstruct the updates from deletes and
inserts, so users are given 2 options:
1. Generate inserts and deletes CDC records by determining returning all deleted rows in
existing files as deleted and all rows in new files as added (default).
2. Generate updated CDC records with the identifier columns provided either by users in v1
table format(COW) or the identifier-field-ids in v2 table format. Identifier columns are a
concept similar to primary keys in a relational database system. A row should be unique
in a table based on the values of the identifier columns. With the identifier columns,
updates can be reconstructed from deletes and inserts.

Note: There should be no change of the identifier-field-ids across the snapshots CDC based on.

Use Case Analysis


1. The sink tables are Iceberg tables
1. CDC records for refreshing materialized views. The materialized view will likely
be implemented as an Iceberg table. The updates are not necessary.
2. CDC records for medallion architecture. The updates are also not necessary if
Bronze, Silver and Gold tables are Iceberg tables.
2. The sink tables are not Iceberg tables. For example, they can be Hive tables/Vertica
tables/Snowflake tables. Users should have identifier columns so that they can update
the sink tables with CDC outputs.

Limitations
1. Push the responsibility to the user side to decide which are the identifier columns.
2. May not reconstruct the same updates as the write-time.
1. Users can merge/update with different columns. For example, the columns could
be (id, name) or (name, city).
2. Users can query CDC by providing different identifier columns.
3. May not perform as good as the approach of adding meta columns or write-time CDC file
logging. It’s a tradeoff though. Write-time logging has the following limitations:
1. Need changes for all engines(e.g., Spark/Flink/Trino/Others), which requires a lot
of effort. Some customized Java/Python Clients may still break the CDC
generation even though all main engines have been taken care of.
2. Perform worse at write-time. (This solution requires no additional work at write
time)
3. Extra maintenance is needed for the CDC log files, e.g., expiration, compaction,
encryption, etc.

Algorithm
Let’s focus on single snapshot CDC records generation first. Here are discussions for MOR and
COW.
Merge-On-Read
Let’s get started with an example, in which a single snapshot is generated by a Spark operation
“Merge into”. The table schema is (id, name, age, _pos), in which _pos is the position of the row
inside a data file. The column _pos is only for the demonstration, it isn’t necessary a real
column.

To get CDC records:

1. Get deleted rows by merging deleted files(Pos Delete file) added by the snapshot(S2)
with their affected data files from this (S2 : No Applicable Files) and previous (S1: Data
File 1) snapshots.
2. Get inserted rows by reading rows from the data files(Data File 2) added by the
snapshot(S2).
3. Output CDC records
a. (Option 1. Only Inserts and Deletes) Output the deleted/inserted rows as the
CDC records when no identifier columns are provided.

ID Name Age _record_type _commit_snapshot_id

1 Amy 20 Delete s2

1 Amy 21 Insert s2

5 Harry 19 Delete s2

7 Abby 22 Insert s2

b. (Option 2: Differentiate Updates from Inserts and Deletes) Inner join deleted rows
and inserted rows on identifier columns(ID) as the CDC updated records. Left
outer join with deleted rows and inserted rows on identifier columns(ID) as CDC
deleted records. Right outer join with deleted rows and inserted rows on identifier
columns(ID) as CDC inserted records. Note: It could be a full outer join in
implementation.
ID Name Age _record_type _commit_snapshot_id
1 Amy 20 PreImage s2

1 Amy 21 PostImage s2

5 Harry 19 Delete s2

7 Abby 22 Insert s2

Algorithm
With the analysis above, here is the algorithm pseudocode:
deletedRows = get deleted rows by checking deleted files added by the
snapshot
if(a user does not provide identifier columns) {
// Option 1. Only Inserts and Deletes
output deletedRows as the CDC deleted records
output all rows in new data files as CDC inserted records
} else {
// Option 2: Differentiate Updates from Inserts and Deletes
deletedRows inner join with inserted rows on identifier columns
output all rows as the updated rows

deletedRows left join with inserted rows on identifier columns


output not matched rows as the deleted rows

deletedRows right join with inserted rows on identifier columns


output not matched rows as the inserted rows
}

Note: The current way to get deleted rows is to read the data file first, then apply records from
the deleted files. It’s not efficient when the portion of deleted rows is small. For example, 10
rows are deleted out of 1M rows. We need a more efficient read.

Here are additional examples.

Example: Both Eq delete and Pos delete


Flink can write both pos delete and eq delete in a snapshot.
The CDC records of S2:

● Delete: (5, Harry, 19)


● Delete: (1, Amy, 20)
● Insert: (1, Amy, 20)
● Delete: (1, Amy, 20)

Example: Delete an appended row in the same snapshot


This is an edge case, which deletes a row appended in the same snapshot.

Output two CDC records like this:

● Insert: (7, Abby, 22)


● Delete: (7, Abby, 22)

Copy-On-Write
Let’s get started with an example, in which a single snapshot is generated by an operation
“Merge into”. The table schema is (id, name, age). The command “merge into” deleted row (3,
Alice), updated row (1, Amy), and inserted row (7, Abby).
To get CDC records:

1. Skip the “Existing” data files(data file1) since the rows inside are not changed.
2. Please note that the rows in the “Deleted” data files could be updated, deleted or not
changed.
3. Left outer join with the “Deleted” files(data file2) with “Added” data files(data file3 and
data file4) to get deleted rows on all columns.
4. Right outer join with the “Deleted” files(data file2) with “Added” data files(data file3 and
data file4) to get inserted rows on all columns.
5. Output CDC records
a. (Option 1. Only Inserts and Deletes) Output the deleted/inserted rows as the
CDC records when no identifier columns are provided.

ID Name Age _record_type _commit_snapshot_id

1 Amy 20 Delete s2

1 Amy 21 Insert s2

3 Alice 21 Delete s2

7 Abby 22 Insert s2

b. (Option 2: Differentiate Updates from Inserts and Deletes) Inner join with deleted
rows and inserted rows on identifier columns(ID) as the CDC updated records.
Left outer join with deleted rows and inserted rows on identifier columns(ID) as
CDC deleted records. Right outer join with deleted rows and inserted rows on
identifier columns(ID) as CDC inserted records.
ID Name Age _record_type _commit_snapshot_id

1 Amy 20 PreImage s2

1 Amy 21 PostImage s2

3 Alice 21 Delete s2
7 Abby 22 Insert s2

Algorithm
With the analysis above, here is the algorithm pseudocode:

// skip the "existing" data files


deletedRows = Deleted data files left join with the Added data file on all
columns
insertedRows = Deleted data files right join with the Added data file on
all columns
if(a user does not provide identifier columns) {
output the deleted rows
output the inserted rows
} else {
deletedRows inner join with inserted rows on identifier columns
output result as the updated rows

deletedRows left join with inserted rows on identifier columns


output result as the deleted rows

deletedRows right join with inserted rows on identifier columns


output result as the inserted rows
}

Merge Records From Multiple Snapshots


There are two use cases when generating CDC records from multiple snapshots.
1. Users require the full change history when the records are used for the cases like
auditing and replaying to other tables. The outputs should include every CDC record. It’s
trivial to implement.
2. Users require merged CDC records across snapshots in a case like materialized view
refreshing. Here are two merge solutions:
1. Calculate CDC records for individual snapshot, then merge them together by
joining records from individual snapshot. The join has to happen in the order of
snapshots.
2. Calculate the merged CDC records for multiple snapshots directly. This may
perform better with optimizations. We leave it for future discussion.

Users should have the option to enable/disable merging.


Here is an example with multiple snapshots on MOR. We try to generate CDC records for
snapshots s2 and s3. In this example, S2 deleted row (5, Harry, 19), and inserted row (7, Abby,
22), S3 deleted row (7, Abby, 22), and inserted row (5, Harry, 19).

CDC records for S2:


ID Name Age _record_type _commit_snapshot_id

5 Harry 19 Delete s2

7 Abby 22 Insert s2

CDC records for S3:


ID Name Age _record_type _commit_snapshot_id

5 Harry 19 Insert s3

7 Abby 22 Delete s3

For use cases requiring the full history(e.g., audit and replay), the outputs should include CDC
records in every snapshot.
ID Name Age _record_type _commit_snapshot_id

5 Harry 19 Delete s2

7 Abby 22 Insert s2

5 Harry 19 Insert s3

7 Abby 22 Delete s3

For other use cases not needing the middle steps(e.g., materialized view refresh), the output
would be empty in this example.

The merge should also consider if an update can be reconstructed by given identifier columns.
For example, the following CDC records will be considered as an update.
ID Name Age _record_type _commit_snapshot_id
6 Alice 19 Delete s2

6 Alice 22 Insert s3

After merging, the CDC records will be like this:


ID Name Age _record_type _commit_snapshot_id

6 Alice 19 PreImage s2

6 Alice 22 PostImage s3

User Interface Changes


The following interfaces should cover most use cases.
● Spark Procedure
○ Provide a new procedure which, given a table and snapshot information,
produces CDC records.
● Metatable
○ Add a postfix identifier like metadata tables which returns CDC records instead of
normal table data. This identifier could serve both streaming and batch use
cases.

User should be able to specify the following parameters:


1. Start snapshot Id and end snapshot Id
2. A time window (start timestamp and end timestamp)
3. Types of CDC records (delete/insert/update), optional.
4. Identifier columns for COW table, optional.
Here are interface examples for Spark procedure and metatable.
-- Spark procedure
CALL catalog_name.system.table_changes('tableName', start_snapshot_id, end_snapshot_id)
-- for example
CALL catalog_name.system.table_changes('tableName', 0, 10)

CALL catalog_name.system.table_changes('tableName', start_timestamp, end_timestamp)


-- for example
CALL catalog_name.system.table_changes('tableName', TIMESTAMP '2021-01-23 04:30:45',
TIMESTAMP '2021-02-23 6:00:00')

-- Metatable
SELECT * FROM 'tableName.cdc' WHERE snapshot_id BETWEEN start_snapshot_id AND
end_snapshot_id
-- for example
SELECT * FROM 'tableName.cdc' WHERE snapshot_id BETWEEN 0 AND 10

SELECT * FROM 'tableName.cdc' WHERE timestamp BETWEEN start_timestamp AND end_timestamp


-- for example
SELECT * FROM 'tableName.cdc' WHERE timestamp BETWEEN '2021-01-23 04:30:45' AND '2021-02-23
6:00:00'

You might also like