0% found this document useful (0 votes)

47 views6 pages

26.3 Change Data Capture (CDC) in DataVault

1. When using change data capture (CDC) to load incremental data, validating relationships can be challenging since CDC only provides new, updated, and deleted records rather than a full extract. 2. To handle deleted relationships when using CDC, one option is to use a persistent staging area or source replica to determine when a relationship no longer exists by comparing the current and previous values. 3. Another option is to include "from" and "to" values in the audit trail to explicitly identify when a relationship changes, in which case two entries can be made to the effectivity satellite to mark the old relationship as deleted and the new one as current.

Uploaded by

Diego Fernando Galicia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views6 pages

26.3 Change Data Capture (CDC) in DataVault

Uploaded by

Diego Fernando Galicia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Changed Data Capture [CDC]– Handing

Validation of Relationships in Data Vault

2.0

In Data Vault 2.0, we differentiate data by keys, relationships and

description.

In the following we explain what to consider and how to deal with it:

There are different ways to handle the validation of relationships from source
systems depending on how the data is delivered, (full-extract or CDC), and the way
a delete is delivered by the source system, such as a soft delete or hard delete.

First, let us explain the different kinds of deletes in source systems:

1. Hard delete – A record is hard deleted in the source system and no longer
appears in the system.

2. Soft delete – The deleted record still exists in the source systems database and
is flagged as deleted.
Secondly, let’s explore how we find the data in the staging area:

1. Full-extract – This can be the current status of the source system or a delta
(incremental) extract.

2. CDC (Change Data Capture) – Only new, updated or deleted records to

load data in a delta way.

To keep the following explanation as simple as possible, our assumption is that we
want to mark relationships as deleted as soon as we get the delete information, even
if there is no audit trail from the source system (data aging is another topic).
Delete detection for business keys, or Hubs, is straight-forward. Soft deletes are
handled as descriptive attributes in the Satellite directly and do not take into
account as to whether the data arrives from a full extract or CDC. For hard deletes
in the source system, we have to distinguish between full-extract and CDC.

Here we introduce the Effectivity Satellite. In case of:

1. Full-extract – Perform a lookup back into the staging area to check whether
the business key still exists. If not, add a record with the delete information
(i.e. a flag and a date) into the Effectivity Satellite.

2. CDC – We receive a “Delete” information which is a new entry in the

Effectivity Satellite.

Delete detection of relationships needs a bit more attention and is often forgotten.
With a full-extract, we can follow the same approach as followed for business keys:
Just check whether or not the Link Hash Key exists in the current staging load and
insert a new entry accordingly into the Effectivity Satellite.
But nowadays, CDC is becoming more common. Though, as CDC delivers deltas
only, the challenge now is to identify relationships that no longer exist. The example
below shows a relationship between the business objects customer and company.
This is a 1:n relationship:

Image 1: Tables Customer and Company

The Link table in Data Vault looks like this:

Table 1: Customer Link

For better readability and simplification, we present the business keys instead of
hash keys and don’t show system fields like the load date timestamp and record
source.

So far so good, but what happens when the customer is starting to work for another
company? This will result in a new record in the Link. The CDC mechanism will
provide us the data as an update of the customer table.

Image 2: Source tables and Link after company change.

From where do we get the information that the Customer 4711 no longer works for
Company 1234 and where is that information stored? We need to soft-delete the
old link entry in the data warehouse to make the data consistent again. At the
moment, it looks like the customer works for both companies as both links are
currently active.

There are two possible ways:

1. You get the “from” and the “to” in your audit trail and you identify a
difference for the company_id.

If that is the case, create 2 new entries in the Effectivity Satellite, one marks
the old one (from) as deleted and the other one marks the new one (to) as
not deleted (This is also known as current record). It is necessary to insert
new relationships as “not deleted” that you can activate and deactivate Hash
Keys forth and back.

Think about what happens when customer 4711 works for company 1234 again.

2. In case you don’t have the “from” and “to”, you either have to load the
CDC data into a persistent staging area, where you keep the full history of
data delivered by CDC, or a source replica, where you create a mirror of the of
the source system by feeding it with the CDC data whereby you perform hard
updates when an “Updated” comes from the CDC and hard deletes when a
“Delete” comes from the CDC.

When using the source replica, you can follow the same approach as stated
before when getting full loads: join into the replica and figure out whether
the Hash Key still exists or not.

The biggest disadvantage here is that you have to scan more data, which
means more IO. When using a persistent staging area, you can figure out a
change in a relationship by using the window function lead() where you
partition by the technical ID, Customer_ID in this case, and order by the load
date timestamp.
As soon as the Link Hash Key is different, the relationship is changed and the
old one no longer exists.

The result is the following Effectivity Satellite (logical):
Table 2: Effectivity Satellite on the LinkH
We covered two major points in this article. The first one is that in Data Vault, we
extract relationship information from the source tables and thus we have to pay more
attention to the validation of those.

The second point is that the way you get the data (delta by CDC or full-extract)
brings you different opportunities regarding the way to load the data. When you are
dealing with a huge amount of data, CDC is definitely the way to go. In addition
to that, with the CDC mechanism you will get all updates from the source, and
you can easier load data in (near) real time.

3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Azure Data Factory
No ratings yet
Azure Data Factory
47 pages
Kiran Abinitio
No ratings yet
Kiran Abinitio
66 pages
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
No ratings yet
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
36 pages
Informatica Project
100% (2)
Informatica Project
4 pages
03 Etl 081028 2055
No ratings yet
03 Etl 081028 2055
46 pages
Database Configuration
No ratings yet
Database Configuration
5 pages
27.1 Record Tracking Satellite
No ratings yet
27.1 Record Tracking Satellite
22 pages
Chapter 5 Data Warehouse Design Methodology
No ratings yet
Chapter 5 Data Warehouse Design Methodology
10 pages
Data Warehousing: Modern Database Management
No ratings yet
Data Warehousing: Modern Database Management
32 pages
Data Flux
No ratings yet
Data Flux
11 pages
Realtime Mohan Belandur IQ
No ratings yet
Realtime Mohan Belandur IQ
38 pages
How Datetrack Works: (Rel 11I)
No ratings yet
How Datetrack Works: (Rel 11I)
8 pages
Datastage Questions1
No ratings yet
Datastage Questions1
33 pages
Unit III DWM
No ratings yet
Unit III DWM
13 pages
Module 2
No ratings yet
Module 2
117 pages
Sapbw - Implementing Delta Updates in The Financial Domain: by Sergei Peleshuk
No ratings yet
Sapbw - Implementing Delta Updates in The Financial Domain: by Sergei Peleshuk
3 pages
Incremental Load/Creation of Dimension Table Logic at Mayo Clinic
No ratings yet
Incremental Load/Creation of Dimension Table Logic at Mayo Clinic
11 pages
Informatica CDC
No ratings yet
Informatica CDC
4 pages
Computer Systerms Backup CCCD
No ratings yet
Computer Systerms Backup CCCD
24 pages
Introd Data
No ratings yet
Introd Data
10 pages
How To Implement CDC in Datastage 8.1!: Incremental Loading in The Datastage Can Be Performed by Using The Change Data
No ratings yet
How To Implement CDC in Datastage 8.1!: Incremental Loading in The Datastage Can Be Performed by Using The Change Data
2 pages
Extract Transform Load
No ratings yet
Extract Transform Load
4 pages
Activity IN IT202 (AIS) : Accounting Information System
No ratings yet
Activity IN IT202 (AIS) : Accounting Information System
11 pages
Digging Deeper: Asynchronous or Concurrent Processes: Sequence Diagrams With Asynchronous Messages
No ratings yet
Digging Deeper: Asynchronous or Concurrent Processes: Sequence Diagrams With Asynchronous Messages
3 pages
Informatica Interview Quesions
No ratings yet
Informatica Interview Quesions
7 pages
Ilovepdf Merged (2)
No ratings yet
Ilovepdf Merged (2)
166 pages
Chapter 9 Fundamentals of An ETL Architecture
No ratings yet
Chapter 9 Fundamentals of An ETL Architecture
8 pages
Activity IN IT202 (AIS) : Accounting Information System
No ratings yet
Activity IN IT202 (AIS) : Accounting Information System
11 pages
Database MGMT
No ratings yet
Database MGMT
112 pages
SCD Type2 Through A With Date Range
No ratings yet
SCD Type2 Through A With Date Range
19 pages
C-Some More Stages
No ratings yet
C-Some More Stages
25 pages
5-Data Managament
No ratings yet
5-Data Managament
31 pages
Date Tracking
No ratings yet
Date Tracking
11 pages
2 - Managing Data in Motion
No ratings yet
2 - Managing Data in Motion
80 pages
27.2 Record Tracking Satellite
No ratings yet
27.2 Record Tracking Satellite
7 pages
Data Warehousing and Business Intelligence DS-3003 Assignment # 1
No ratings yet
Data Warehousing and Business Intelligence DS-3003 Assignment # 1
6 pages
Data Visualization of Multidimensional Data
No ratings yet
Data Visualization of Multidimensional Data
86 pages
8000002080-Installation On Linux OS - Cloud Connector - PRD
100% (1)
8000002080-Installation On Linux OS - Cloud Connector - PRD
4 pages
Curriculum Vitae
No ratings yet
Curriculum Vitae
2 pages
ADTHEORY4
No ratings yet
ADTHEORY4
13 pages
Data Vault
No ratings yet
Data Vault
11 pages
Unit 3
No ratings yet
Unit 3
18 pages
ETL Interview Questions
No ratings yet
ETL Interview Questions
19 pages
Change Data Capture
No ratings yet
Change Data Capture
4 pages
Understand and Troubleshoot Guide Hyper-V Replica in Windows Server 8 Beta
No ratings yet
Understand and Troubleshoot Guide Hyper-V Replica in Windows Server 8 Beta
82 pages
datastage
No ratings yet
datastage
74 pages
Paper 2 Datawarehouse Notes
No ratings yet
Paper 2 Datawarehouse Notes
20 pages
Lecture 04
No ratings yet
Lecture 04
29 pages
bi-unit-3
No ratings yet
bi-unit-3
26 pages
Client Management System PHP and MySQL
No ratings yet
Client Management System PHP and MySQL
7 pages
Acceptance_Testing_and_ETL_Process_j8Mus6Ctvj
No ratings yet
Acceptance_Testing_and_ETL_Process_j8Mus6Ctvj
19 pages
kubernetes troubleshooting _Dec24
No ratings yet
kubernetes troubleshooting _Dec24
20 pages
IBM Mainframe COBOL DB2 JCL CICS Tutorials
No ratings yet
IBM Mainframe COBOL DB2 JCL CICS Tutorials
21 pages
Z Data Warehouse Concepts
No ratings yet
Z Data Warehouse Concepts
19 pages
Chapter 11 Enterprise Resource Planning (ERP) Systems
No ratings yet
Chapter 11 Enterprise Resource Planning (ERP) Systems
19 pages
1 Analysis and Design of Visualization
No ratings yet
1 Analysis and Design of Visualization
8 pages
Word and Excel Training
No ratings yet
Word and Excel Training
4 pages
Module 3
No ratings yet
Module 3
30 pages
IT 6208 - System Integration and Architecture 1 - Oct 5, 2020
100% (5)
IT 6208 - System Integration and Architecture 1 - Oct 5, 2020
37 pages
CDC QLIK
No ratings yet
CDC QLIK
18 pages
Sida in Power Bi Minor Project Report
No ratings yet
Sida in Power Bi Minor Project Report
45 pages
Topics Covered: Application Directory Partition
No ratings yet
Topics Covered: Application Directory Partition
5 pages
Year 10 ICT Weekly Planners
No ratings yet
Year 10 ICT Weekly Planners
2 pages
662 Chapter 4 Planning For Contigencies
No ratings yet
662 Chapter 4 Planning For Contigencies
27 pages
Resource Eb Change Data Capture 101 What Works Best and Why en Fycoab
No ratings yet
Resource Eb Change Data Capture 101 What Works Best and Why en Fycoab
18 pages
Data Cloud Consultant 7 (1)
No ratings yet
Data Cloud Consultant 7 (1)
14 pages
Introduction To: Nosql
No ratings yet
Introduction To: Nosql
27 pages
Deepak Jeengar Report Part 1
No ratings yet
Deepak Jeengar Report Part 1
7 pages
1Z0133 - Weblogic LPS PDF
No ratings yet
1Z0133 - Weblogic LPS PDF
2 pages
NetBeans - Conf Example
No ratings yet
NetBeans - Conf Example
2 pages
EIM Handbook
No ratings yet
EIM Handbook
47 pages
AZ103Microsoft Azure Administrator
No ratings yet
AZ103Microsoft Azure Administrator
5 pages
Object Oriented Programming (OOPS) : Java Means Durga Sir
No ratings yet
Object Oriented Programming (OOPS) : Java Means Durga Sir
2 pages
Data Analytics - TOS
No ratings yet
Data Analytics - TOS
1 page
Rosario
No ratings yet
Rosario
3 pages
Redhat Passguide Ex200 PDF Download 2021-Mar-31 by Hugo 47q Vce
No ratings yet
Redhat Passguide Ex200 PDF Download 2021-Mar-31 by Hugo 47q Vce
11 pages
DWM - Viva and Short Question Answers
No ratings yet
DWM - Viva and Short Question Answers
24 pages
WWW W3schools Com
No ratings yet
WWW W3schools Com
3 pages
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Nios4 FIRST STEPS
From Everand
Nios4 FIRST STEPS
Gessica Monteforte
No ratings yet
Installing SQL Server 2012 Step by Step
From Everand
Installing SQL Server 2012 Step by Step
Stephen Thomas
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
C & C++ Interview Questions You'll Most Likely Be Asked
From Everand
C & C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
From Everand
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
Vibrant Publishers
5/5 (1)

26.3 Change Data Capture (CDC) in DataVault

Uploaded by

26.3 Change Data Capture (CDC) in DataVault

Uploaded by

Changed Data Capture [CDC]– Handing

Validation of Relationships in Data Vault

In Data Vault 2.0, we differentiate data by keys, relationships and

First, let us explain the different kinds of deletes in source systems:

2. CDC (Change Data Capture) – Only new, updated or deleted records to

Here we introduce the Effectivity Satellite. In case of:

2. CDC – We receive a “Delete” information which is a new entry in the

Image 1: Tables Customer and Company

Table 1: Customer Link

Image 2: Source tables and Link after company change.

You might also like