0% found this document useful (0 votes)

334 views11 pages

Iceberg Change Data Capture (CDC) Guide

1. The document proposes a way to generate change data capture (CDC) records for Iceberg tables to capture data changes continuously. 2. The proposal aims to support row-level CDC record generation for all Iceberg table formats and operation types while requiring no changes to the Iceberg table specification or write-time logging. 3. The proposal outlines CDC record formats, solutions for different operation types, limitations, and algorithms for generating CDC records from snapshots in Merge-On-Read and Copy-On-Write table formats.

Uploaded by

wilhelmjung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

334 views11 pages

Iceberg Change Data Capture (CDC) Guide

Uploaded by

wilhelmjung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Change Data Capture(CDC) for Iceberg

Yufei Gu([email protected])

Background and Motivation

Users want to capture data changes in Iceberg tables continuously for the following use cases:
1. Materialized view incremental refreshing
2. Updating downstream tables(Iceberg tables/Vertica tables/Snowflake tables)
3. Keeping CDC records for auditing
The existing incremental search is close to the CDC feature with the following limitations.
1. It only supports snapshots with DataOperations.APPEND.
2. The support for Merge-On-Read table format is limited.
3. No output of deleted rows and update rows.
These limitations require users to develop their own tools to meet their requirements.
Other table formats provide a CDC feature, for example, DataBricks’ Delta has
Change-Data-Feed which generates CDC records and is popular among their users.
This document proposes a way to generate CDC records for Iceberg tables.

Goal
1. Support row-level CDC records generation for Iceberg tables
2. Support all Iceberg Operation Types (Append, Delete, Replace, Overwrite)
3. Can emit all CDC records types: update/insert/delete
4. No table spec change and write-time logging
5. Support v1 table format, in which the write mode is Copy-On-Write only
6. Support v2 table format, in which the write mode could be
Merge-On-Read/Copy-On-Write/Mixed Merge-On-Read with Copy-On-Write

Non-goal

1. This proposal doesn’t distinguish the written order of rows inside one snapshot. The
event granularity is at the snapshot level.
2. This proposal doesn’t discuss how CDC records are persisted. There is no way to
generate CDC records based on the expired snapshots. Users need to figure out a way
to persist CDC records if snapshots are expired frequently. For example, CDC records
can be saved in another Iceberg table or Kafka.
Proposal

CDC Record Format

Adding metadata columns for CDC records

1. CDC record Type(insert, update preImage, update postImage, delete)

2. Commit timestamp
3. Commit snapshot id
4. Commit order (Sequence number for v2 tables, snapshot order number for v1 tables)

table_col1 table_col2 ... _record_type _commit_snapshot_id _commit_timestamp _commit_order

For example, here is what a CDC record looks like for an append operation of a table with
schema (id, name, age).
INSERT INTO student values (1, 'Amy', 21)

ID Name Age _record_type _commit_snapshot_id _commit_timestamp _commit_order

1 Amy 21 Insert 8563478696506531065 2021-01-23 04:30:45 34

Solution
Append only snapshots are handled by emitting all new rows as CDC inserted records.

Delete only snapshots are handled by emitting all deleted rows as CDC deleted records.

Replace snapshots are ignored. Rewriting data files and metadata files generates this type of
snapshot without changing any data in the table.

Overwrite Snapshots require special consideration based on the type of Overwrite which can
contain rows which have been updated as well as new rows and deleted rows. Updates can be
handled by emitting a delete and insert CDC record for each row or by attempting to determine
if a record has been deleted, or updated. A user could decide whether to attempt to discern
updates or just return inserts and deletes at execution time.

Updated rows are difficult because within Iceberg files rows are either deleted or inserted. There
is no concept of “update” in the Iceberg persistence layer so determining updates requires
additional work. One of the major challenges is to reconstruct the updates from deletes and
inserts, so users are given 2 options:
1. Generate inserts and deletes CDC records by determining returning all deleted rows in
existing files as deleted and all rows in new files as added (default).
2. Generate updated CDC records with the identifier columns provided either by users in v1
table format(COW) or the identifier-field-ids in v2 table format. Identifier columns are a
concept similar to primary keys in a relational database system. A row should be unique
in a table based on the values of the identifier columns. With the identifier columns,
updates can be reconstructed from deletes and inserts.

Note: There should be no change of the identifier-field-ids across the snapshots CDC based on.

Use Case Analysis

1. The sink tables are Iceberg tables
1. CDC records for refreshing materialized views. The materialized view will likely
be implemented as an Iceberg table. The updates are not necessary.
2. CDC records for medallion architecture. The updates are also not necessary if
Bronze, Silver and Gold tables are Iceberg tables.
2. The sink tables are not Iceberg tables. For example, they can be Hive tables/Vertica
tables/Snowflake tables. Users should have identifier columns so that they can update
the sink tables with CDC outputs.

Limitations
1. Push the responsibility to the user side to decide which are the identifier columns.
2. May not reconstruct the same updates as the write-time.
1. Users can merge/update with different columns. For example, the columns could
be (id, name) or (name, city).
2. Users can query CDC by providing different identifier columns.
3. May not perform as good as the approach of adding meta columns or write-time CDC file
logging. It’s a tradeoff though. Write-time logging has the following limitations:
1. Need changes for all engines(e.g., Spark/Flink/Trino/Others), which requires a lot
of effort. Some customized Java/Python Clients may still break the CDC
generation even though all main engines have been taken care of.
2. Perform worse at write-time. (This solution requires no additional work at write
time)
3. Extra maintenance is needed for the CDC log files, e.g., expiration, compaction,
encryption, etc.

Algorithm
Let’s focus on single snapshot CDC records generation first. Here are discussions for MOR and
COW.
Merge-On-Read
Let’s get started with an example, in which a single snapshot is generated by a Spark operation
“Merge into”. The table schema is (id, name, age, _pos), in which _pos is the position of the row
inside a data file. The column _pos is only for the demonstration, it isn’t necessary a real
column.

To get CDC records:

1. Get deleted rows by merging deleted files(Pos Delete file) added by the snapshot(S2)
with their affected data files from this (S2 : No Applicable Files) and previous (S1: Data
File 1) snapshots.
2. Get inserted rows by reading rows from the data files(Data File 2) added by the
snapshot(S2).
3. Output CDC records
a. (Option 1. Only Inserts and Deletes) Output the deleted/inserted rows as the
CDC records when no identifier columns are provided.

ID Name Age _record_type _commit_snapshot_id

1 Amy 20 Delete s2

1 Amy 21 Insert s2

5 Harry 19 Delete s2

7 Abby 22 Insert s2

b. (Option 2: Differentiate Updates from Inserts and Deletes) Inner join deleted rows
and inserted rows on identifier columns(ID) as the CDC updated records. Left
outer join with deleted rows and inserted rows on identifier columns(ID) as CDC
deleted records. Right outer join with deleted rows and inserted rows on identifier
columns(ID) as CDC inserted records. Note: It could be a full outer join in
implementation.
ID Name Age _record_type _commit_snapshot_id
1 Amy 20 PreImage s2

1 Amy 21 PostImage s2

5 Harry 19 Delete s2

7 Abby 22 Insert s2

Algorithm
With the analysis above, here is the algorithm pseudocode:
deletedRows = get deleted rows by checking deleted files added by the
snapshot
if(a user does not provide identifier columns) {
// Option 1. Only Inserts and Deletes
output deletedRows as the CDC deleted records
output all rows in new data files as CDC inserted records
} else {
// Option 2: Differentiate Updates from Inserts and Deletes
deletedRows inner join with inserted rows on identifier columns
output all rows as the updated rows

deletedRows left join with inserted rows on identifier columns

output not matched rows as the deleted rows

deletedRows right join with inserted rows on identifier columns

output not matched rows as the inserted rows
}

Note: The current way to get deleted rows is to read the data file first, then apply records from
the deleted files. It’s not efficient when the portion of deleted rows is small. For example, 10
rows are deleted out of 1M rows. We need a more efficient read.

Here are additional examples.

Example: Both Eq delete and Pos delete

Flink can write both pos delete and eq delete in a snapshot.
The CDC records of S2:

● Delete: (5, Harry, 19)

● Delete: (1, Amy, 20)
● Insert: (1, Amy, 20)
● Delete: (1, Amy, 20)

Example: Delete an appended row in the same snapshot

This is an edge case, which deletes a row appended in the same snapshot.

Output two CDC records like this:

● Insert: (7, Abby, 22)

● Delete: (7, Abby, 22)

Copy-On-Write
Let’s get started with an example, in which a single snapshot is generated by an operation
“Merge into”. The table schema is (id, name, age). The command “merge into” deleted row (3,
Alice), updated row (1, Amy), and inserted row (7, Abby).
To get CDC records:

1. Skip the “Existing” data files(data file1) since the rows inside are not changed.
2. Please note that the rows in the “Deleted” data files could be updated, deleted or not
changed.
3. Left outer join with the “Deleted” files(data file2) with “Added” data files(data file3 and
data file4) to get deleted rows on all columns.
4. Right outer join with the “Deleted” files(data file2) with “Added” data files(data file3 and
data file4) to get inserted rows on all columns.
5. Output CDC records
a. (Option 1. Only Inserts and Deletes) Output the deleted/inserted rows as the
CDC records when no identifier columns are provided.

ID Name Age _record_type _commit_snapshot_id

1 Amy 20 Delete s2

1 Amy 21 Insert s2

3 Alice 21 Delete s2

7 Abby 22 Insert s2

b. (Option 2: Differentiate Updates from Inserts and Deletes) Inner join with deleted
rows and inserted rows on identifier columns(ID) as the CDC updated records.
Left outer join with deleted rows and inserted rows on identifier columns(ID) as
CDC deleted records. Right outer join with deleted rows and inserted rows on
identifier columns(ID) as CDC inserted records.
ID Name Age _record_type _commit_snapshot_id

1 Amy 20 PreImage s2

1 Amy 21 PostImage s2

3 Alice 21 Delete s2
7 Abby 22 Insert s2

Algorithm
With the analysis above, here is the algorithm pseudocode:

// skip the "existing" data files

deletedRows = Deleted data files left join with the Added data file on all
columns
insertedRows = Deleted data files right join with the Added data file on
all columns
if(a user does not provide identifier columns) {
output the deleted rows
output the inserted rows
} else {
deletedRows inner join with inserted rows on identifier columns
output result as the updated rows

deletedRows left join with inserted rows on identifier columns

output result as the deleted rows

deletedRows right join with inserted rows on identifier columns

output result as the inserted rows
}

Merge Records From Multiple Snapshots

There are two use cases when generating CDC records from multiple snapshots.
1. Users require the full change history when the records are used for the cases like
auditing and replaying to other tables. The outputs should include every CDC record. It’s
trivial to implement.
2. Users require merged CDC records across snapshots in a case like materialized view
refreshing. Here are two merge solutions:
1. Calculate CDC records for individual snapshot, then merge them together by
joining records from individual snapshot. The join has to happen in the order of
snapshots.
2. Calculate the merged CDC records for multiple snapshots directly. This may
perform better with optimizations. We leave it for future discussion.

Users should have the option to enable/disable merging.

Here is an example with multiple snapshots on MOR. We try to generate CDC records for
snapshots s2 and s3. In this example, S2 deleted row (5, Harry, 19), and inserted row (7, Abby,
22), S3 deleted row (7, Abby, 22), and inserted row (5, Harry, 19).

CDC records for S2:

ID Name Age _record_type _commit_snapshot_id

5 Harry 19 Delete s2

7 Abby 22 Insert s2

CDC records for S3:

ID Name Age _record_type _commit_snapshot_id

5 Harry 19 Insert s3

7 Abby 22 Delete s3

For use cases requiring the full history(e.g., audit and replay), the outputs should include CDC
records in every snapshot.
ID Name Age _record_type _commit_snapshot_id

5 Harry 19 Delete s2

7 Abby 22 Insert s2

5 Harry 19 Insert s3

7 Abby 22 Delete s3

For other use cases not needing the middle steps(e.g., materialized view refresh), the output
would be empty in this example.

The merge should also consider if an update can be reconstructed by given identifier columns.
For example, the following CDC records will be considered as an update.
ID Name Age _record_type _commit_snapshot_id
6 Alice 19 Delete s2

6 Alice 22 Insert s3

After merging, the CDC records will be like this:

ID Name Age _record_type _commit_snapshot_id

6 Alice 19 PreImage s2

6 Alice 22 PostImage s3

User Interface Changes

The following interfaces should cover most use cases.
● Spark Procedure
○ Provide a new procedure which, given a table and snapshot information,
produces CDC records.
● Metatable
○ Add a postfix identifier like metadata tables which returns CDC records instead of
normal table data. This identifier could serve both streaming and batch use
cases.

User should be able to specify the following parameters:

1. Start snapshot Id and end snapshot Id
2. A time window (start timestamp and end timestamp)
3. Types of CDC records (delete/insert/update), optional.
4. Identifier columns for COW table, optional.
Here are interface examples for Spark procedure and metatable.
-- Spark procedure
CALL catalog_name.system.table_changes('tableName', start_snapshot_id, end_snapshot_id)
-- for example
CALL catalog_name.system.table_changes('tableName', 0, 10)

CALL catalog_name.system.table_changes('tableName', start_timestamp, end_timestamp)

-- for example
CALL catalog_name.system.table_changes('tableName', TIMESTAMP '2021-01-23 04:30:45',
TIMESTAMP '2021-02-23 6:00:00')

-- Metatable
SELECT * FROM 'tableName.cdc' WHERE snapshot_id BETWEEN start_snapshot_id AND
end_snapshot_id
-- for example
SELECT * FROM 'tableName.cdc' WHERE snapshot_id BETWEEN 0 AND 10

SELECT * FROM 'tableName.cdc' WHERE timestamp BETWEEN start_timestamp AND end_timestamp

-- for example
SELECT * FROM 'tableName.cdc' WHERE timestamp BETWEEN '2021-01-23 04:30:45' AND '2021-02-23
6:00:00'

Change Data Capture Using Snowflake Streams - by Alexander - Snowflake - Medium
No ratings yet
Change Data Capture Using Snowflake Streams - by Alexander - Snowflake - Medium
5 pages
Capturing Changed Data Using SQL Server 2008 - Srikant Jahagirdar
No ratings yet
Capturing Changed Data Using SQL Server 2008 - Srikant Jahagirdar
19 pages
6.3. Data - Structure - Pyspark - Ipynb - Exercise
No ratings yet
6.3. Data - Structure - Pyspark - Ipynb - Exercise
6 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
Caching vs Persisting in PySpark
No ratings yet
Caching vs Persisting in PySpark
3 pages
Data Warehousing Overview and Benefits
No ratings yet
Data Warehousing Overview and Benefits
63 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Pandas Class 12 Ncertttt
No ratings yet
Pandas Class 12 Ncertttt
48 pages
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
51 pages
Databricks Architecture Interview Preparation
No ratings yet
Databricks Architecture Interview Preparation
3 pages
Spark Architecture
No ratings yet
Spark Architecture
6 pages
Getting Started With Apache Nifi
No ratings yet
Getting Started With Apache Nifi
10 pages
Databricks Widgets Overview and Usage
No ratings yet
Databricks Widgets Overview and Usage
13 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
Spark 20 Tuning Guide
No ratings yet
Spark 20 Tuning Guide
21 pages
PDF 1733662736
No ratings yet
PDF 1733662736
17 pages
PySpark Practice Exercises for DataFrames
No ratings yet
PySpark Practice Exercises for DataFrames
9 pages
ETL Process Overview in Agriculture
100% (1)
ETL Process Overview in Agriculture
42 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Explore Flask
100% (1)
Explore Flask
79 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
ENGG1003 10 PythonApplicationsOnJupiter
No ratings yet
ENGG1003 10 PythonApplicationsOnJupiter
30 pages
Data Engineer Interview Questions With Examples
No ratings yet
Data Engineer Interview Questions With Examples
8 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Mastering Azure Databricks Day-5
No ratings yet
Mastering Azure Databricks Day-5
9 pages
Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
Eb Attunity Streaming Change Data Capture en
100% (1)
Eb Attunity Streaming Change Data Capture en
60 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
IBM MDM 11.6 Installation Guide
No ratings yet
IBM MDM 11.6 Installation Guide
5 pages
Inmon vs Kimball: Data Architecture Debate
No ratings yet
Inmon vs Kimball: Data Architecture Debate
28 pages
Intro To Jupyter Notebooks
No ratings yet
Intro To Jupyter Notebooks
44 pages
Near Real-Time Big Data Processing
No ratings yet
Near Real-Time Big Data Processing
59 pages
Informatica TDM 10.2 Overview
100% (1)
Informatica TDM 10.2 Overview
1 page
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
PT - Chapter 6 - Library in Python
No ratings yet
PT - Chapter 6 - Library in Python
133 pages
Overview of Apache Druid Architecture
No ratings yet
Overview of Apache Druid Architecture
12 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
Guide Renewal PDF
No ratings yet
Guide Renewal PDF
103 pages
Deloitte Scenario-Based Questions in Spark
No ratings yet
Deloitte Scenario-Based Questions in Spark
7 pages
Spark
No ratings yet
Spark
96 pages
Change Data Capture
No ratings yet
Change Data Capture
10 pages
Pyspark 1
No ratings yet
Pyspark 1
19 pages
SQL View for Job Names and Salaries
No ratings yet
SQL View for Job Names and Salaries
133 pages
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
No ratings yet
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
39 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
4 pages
Azure Databricks: A Hands-On Guide
No ratings yet
Azure Databricks: A Hands-On Guide
36 pages
Business Intelligence & Data Warehousing Guide
No ratings yet
Business Intelligence & Data Warehousing Guide
17 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
DWH & Datastage
No ratings yet
DWH & Datastage
5 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
DevOps Training
No ratings yet
DevOps Training
22 pages
Bda - Unit 1
No ratings yet
Bda - Unit 1
33 pages
Cd-Rom Included: Business User Action
100% (1)
Cd-Rom Included: Business User Action
11 pages
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
No ratings yet
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
15 pages
A - Learning - Oreilly.com-Preface Data Engineering With AWS
No ratings yet
A - Learning - Oreilly.com-Preface Data Engineering With AWS
6 pages
Data Warehousing
No ratings yet
Data Warehousing
39 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
Cloudera Administration PDF
100% (1)
Cloudera Administration PDF
476 pages
Iceberg Spark Catalog Configuration Guide
No ratings yet
Iceberg Spark Catalog Configuration Guide
1 page
Electrical Wiring Guide
No ratings yet
Electrical Wiring Guide
1 page
M316C MJIDE 2sisweb - Sisweb - Techdoc - Techdoc - Print - Page - JSP
No ratings yet
M316C MJIDE 2sisweb - Sisweb - Techdoc - Techdoc - Print - Page - JSP
3 pages
Id Cards
No ratings yet
Id Cards
1 page
Impact de Huller Id-900
No ratings yet
Impact de Huller Id-900
4 pages
Flight Management System, CS, Class 12
No ratings yet
Flight Management System, CS, Class 12
25 pages
Ifrs 16 V4 (002) - Dipifrs
No ratings yet
Ifrs 16 V4 (002) - Dipifrs
59 pages
Rfi Summary (Till Date)
No ratings yet
Rfi Summary (Till Date)
26 pages
Reporters (+KEY) 2
No ratings yet
Reporters (+KEY) 2
2 pages
Matrix COSEC VEGA FOT Technical Specifications
No ratings yet
Matrix COSEC VEGA FOT Technical Specifications
3 pages
Duraflex Wires Price
No ratings yet
Duraflex Wires Price
1 page
SketchUp Tutorial
No ratings yet
SketchUp Tutorial
16 pages
HEC Faculty Resume
No ratings yet
HEC Faculty Resume
2 pages
Ingepac Da PT: Multifunction Protection & Control Data Sheet
No ratings yet
Ingepac Da PT: Multifunction Protection & Control Data Sheet
42 pages
Mindspark'24 - Rule Book Final
No ratings yet
Mindspark'24 - Rule Book Final
138 pages
Ga H87TN 00 10
No ratings yet
Ga H87TN 00 10
33 pages
Optimization and Artificial Intelligence in Civil and Structural Engineering - Volume II - Artificial Intelligence in Civil and Structural Engineering
No ratings yet
Optimization and Artificial Intelligence in Civil and Structural Engineering - Volume II - Artificial Intelligence in Civil and Structural Engineering
356 pages
(TW-P6-F16) - Testing Comm. of Fire Alarm System
No ratings yet
(TW-P6-F16) - Testing Comm. of Fire Alarm System
5 pages
ACER V3-571 LA-7912P R.03 (Diagramas - Com.br)
No ratings yet
ACER V3-571 LA-7912P R.03 (Diagramas - Com.br)
63 pages
Software Design Lesson Overview
No ratings yet
Software Design Lesson Overview
33 pages
HP-Compaq Merger Analysis
No ratings yet
HP-Compaq Merger Analysis
46 pages
BMW M2 Order Guide
No ratings yet
BMW M2 Order Guide
4 pages
Procedure For TPM .Maintenance
No ratings yet
Procedure For TPM .Maintenance
4 pages
Description and Discussion On Dcase 2025 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection For Machine Condition Monitoring
No ratings yet
Description and Discussion On Dcase 2025 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection For Machine Condition Monitoring
4 pages
Elon Musk: Innovator and Controversy
No ratings yet
Elon Musk: Innovator and Controversy
1 page
Associate-Google-Workspace-Administrator (77 Questions)
No ratings yet
Associate-Google-Workspace-Administrator (77 Questions)
6 pages
Three Way
No ratings yet
Three Way
11 pages
SOP Analitycal Balance
100% (3)
SOP Analitycal Balance
4 pages
Company Profile - Indocool Group
No ratings yet
Company Profile - Indocool Group
20 pages
EMS - All Activity Diagram
No ratings yet
EMS - All Activity Diagram
14 pages
Computer Network and Information Security
No ratings yet
Computer Network and Information Security
33 pages

Iceberg Change Data Capture (CDC) Guide

Uploaded by

Iceberg Change Data Capture (CDC) Guide

Uploaded by

Change Data Capture(CDC) for Iceberg

Background and Motivation

CDC Record Format

1. CDC record Type(insert, update preImage, update postImage, delete)

table_col1 table_col2 ... _record_type _commit_snapshot_id _commit_timestamp _commit_order

ID Name Age _record_type _commit_snapshot_id _commit_timestamp _commit_order

1 Amy 21 Insert 8563478696506531065 2021-01-23 04:30:45 34

Use Case Analysis

To get CDC records:

ID Name Age _record_type _commit_snapshot_id

deletedRows left join with inserted rows on identifier columns

deletedRows right join with inserted rows on identifier columns

Here are additional examples.

Example: Both Eq delete and Pos delete

● Delete: (5, Harry, 19)

Example: Delete an appended row in the same snapshot

Output two CDC records like this:

● Insert: (7, Abby, 22)

ID Name Age _record_type _commit_snapshot_id

// skip the "existing" data files

deletedRows left join with inserted rows on identifier columns

deletedRows right join with inserted rows on identifier columns

Merge Records From Multiple Snapshots

Users should have the option to enable/disable merging.

CDC records for S2:

CDC records for S3:

After merging, the CDC records will be like this:

User Interface Changes

User should be able to specify the following parameters:

CALL catalog_name.system.table_changes('tableName', start_timestamp, end_timestamp)

SELECT * FROM 'tableName.cdc' WHERE timestamp BETWEEN start_timestamp AND end_timestamp

You might also like