0% found this document useful (0 votes)
47 views6 pages

26.3 Change Data Capture (CDC) in DataVault

1. When using change data capture (CDC) to load incremental data, validating relationships can be challenging since CDC only provides new, updated, and deleted records rather than a full extract. 2. To handle deleted relationships when using CDC, one option is to use a persistent staging area or source replica to determine when a relationship no longer exists by comparing the current and previous values. 3. Another option is to include "from" and "to" values in the audit trail to explicitly identify when a relationship changes, in which case two entries can be made to the effectivity satellite to mark the old relationship as deleted and the new one as current.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views6 pages

26.3 Change Data Capture (CDC) in DataVault

1. When using change data capture (CDC) to load incremental data, validating relationships can be challenging since CDC only provides new, updated, and deleted records rather than a full extract. 2. To handle deleted relationships when using CDC, one option is to use a persistent staging area or source replica to determine when a relationship no longer exists by comparing the current and previous values. 3. Another option is to include "from" and "to" values in the audit trail to explicitly identify when a relationship changes, in which case two entries can be made to the effectivity satellite to mark the old relationship as deleted and the new one as current.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Changed Data Capture [CDC]– Handing

Validation of Relationships in Data Vault


2.0

In Data Vault 2.0, we differentiate data by keys, relationships and


description.

In the following we explain what to consider and how to deal with it:

There are different ways to handle the validation of relationships from source
systems depending on how the data is delivered, (full-extract or CDC), and the way
a delete is delivered by the source system, such as a soft delete or hard delete.

First, let us explain the different kinds of deletes in source systems:

1. Hard delete – A record is hard deleted in the source system and no longer
appears in the system.

2. Soft delete – The deleted record still exists in the source systems database and
is flagged as deleted.
Secondly, let’s explore how we find the data in the staging area:

1. Full-extract – This can be the current status of the source system or a delta
(incremental) extract.

2. CDC (Change Data Capture) – Only new, updated or deleted records to


load data in a delta way.
 
To keep the following explanation as simple as possible, our assumption is that we
want to mark relationships as deleted as soon as we get the delete information, even
if there is no audit trail from the source system (data aging is another topic).
Delete detection for business keys, or Hubs, is straight-forward.  Soft deletes are
handled as descriptive attributes in the Satellite directly and do not take into
account as to whether the data arrives from a full extract or CDC. For hard deletes
in the source system, we have to distinguish between full-extract and CDC.

Here we introduce the Effectivity Satellite. In case of:

1. Full-extract – Perform a lookup back into the staging area to check whether
the business key still exists. If not, add a record with the delete information
(i.e. a flag and a date) into the Effectivity Satellite. 

2. CDC – We receive a “Delete” information which is a new entry in the


Effectivity Satellite.
 
Delete detection of relationships needs a bit more attention and is often forgotten.
With a full-extract, we can follow the same approach as followed for business keys:
Just check whether or not the Link Hash Key exists in the current staging load and
insert a new entry accordingly into the Effectivity Satellite.
But nowadays, CDC is becoming more common. Though, as CDC delivers deltas
only, the challenge now is to identify relationships that no longer exist. The example
below shows a relationship between the business objects customer and company.
This is a 1:n relationship:

Image 1: Tables Customer and Company


The Link table in Data Vault looks like this:

Table 1: Customer Link


For better readability and simplification, we present the business keys instead of
hash keys and don’t show system fields like the load date timestamp and record
source.
 
So far so good, but what happens when the customer is starting to work for another
company? This will result in a new record in the Link. The CDC mechanism will
provide us the data as an update of the customer table.

Image 2: Source tables and Link after company change.

From where do we get the information that the Customer 4711 no longer works for
Company 1234 and where is that information stored? We need to soft-delete the
old link entry in the data warehouse to make the data consistent again. At the
moment, it looks like the customer works for both companies as both links are
currently active. 
 
There are two possible ways:

1. You get the “from” and the “to” in your audit trail and you identify a
difference for the company_id.

If that is the case, create 2 new entries in the Effectivity Satellite, one marks
the old one (from) as deleted and the other one marks the new one (to) as
not deleted (This is also known as current record). It is necessary to insert
new relationships as “not deleted” that you can activate and deactivate Hash
Keys forth and back.

Think about what happens when customer 4711 works for company 1234 again.

2. In case you don’t have the “from” and “to”, you either have to load the
CDC data into a persistent staging area, where you keep the full history of
data delivered by CDC, or a source replica, where you create a mirror of the of
the source system by feeding it with the CDC data whereby you perform hard
updates when an “Updated” comes from the CDC and hard deletes when a
“Delete” comes from the CDC.

When using the source replica, you can follow the same approach as stated
before when getting full loads: join into the replica and figure out whether
the Hash Key still exists or not.

The biggest disadvantage here is that you have to scan more data, which
means more IO. When using a persistent staging area, you can figure out a
change in a relationship by using the window function lead() where you
partition by the technical ID, Customer_ID in this case, and order by the load
date timestamp.
As soon as the Link Hash Key is different, the relationship is changed and the
old one no longer exists.
 
The result is the following Effectivity Satellite (logical):
Table 2: Effectivity Satellite on the LinkH
We covered two major points in this article. The first one is that in Data Vault, we
extract relationship information from the source tables and thus we have to pay more
attention to the validation of those.

The second point is that the way you get the data (delta by CDC or full-extract)
brings you different opportunities regarding the way to load the data. When you are
dealing with a huge amount of data, CDC is definitely the way to go. In addition
to that, with the CDC mechanism you will get all updates from the source, and
you can easier load data in (near) real time.

You might also like