26.3 Change Data Capture (CDC) in DataVault
26.3 Change Data Capture (CDC) in DataVault
In the following we explain what to consider and how to deal with it:
There are different ways to handle the validation of relationships from source
systems depending on how the data is delivered, (full-extract or CDC), and the way
a delete is delivered by the source system, such as a soft delete or hard delete.
1. Hard delete – A record is hard deleted in the source system and no longer
appears in the system.
2. Soft delete – The deleted record still exists in the source systems database and
is flagged as deleted.
Secondly, let’s explore how we find the data in the staging area:
1. Full-extract – This can be the current status of the source system or a delta
(incremental) extract.
1. Full-extract – Perform a lookup back into the staging area to check whether
the business key still exists. If not, add a record with the delete information
(i.e. a flag and a date) into the Effectivity Satellite.
From where do we get the information that the Customer 4711 no longer works for
Company 1234 and where is that information stored? We need to soft-delete the
old link entry in the data warehouse to make the data consistent again. At the
moment, it looks like the customer works for both companies as both links are
currently active.
There are two possible ways:
1. You get the “from” and the “to” in your audit trail and you identify a
difference for the company_id.
If that is the case, create 2 new entries in the Effectivity Satellite, one marks
the old one (from) as deleted and the other one marks the new one (to) as
not deleted (This is also known as current record). It is necessary to insert
new relationships as “not deleted” that you can activate and deactivate Hash
Keys forth and back.
Think about what happens when customer 4711 works for company 1234 again.
2. In case you don’t have the “from” and “to”, you either have to load the
CDC data into a persistent staging area, where you keep the full history of
data delivered by CDC, or a source replica, where you create a mirror of the of
the source system by feeding it with the CDC data whereby you perform hard
updates when an “Updated” comes from the CDC and hard deletes when a
“Delete” comes from the CDC.
When using the source replica, you can follow the same approach as stated
before when getting full loads: join into the replica and figure out whether
the Hash Key still exists or not.
The biggest disadvantage here is that you have to scan more data, which
means more IO. When using a persistent staging area, you can figure out a
change in a relationship by using the window function lead() where you
partition by the technical ID, Customer_ID in this case, and order by the load
date timestamp.
As soon as the Link Hash Key is different, the relationship is changed and the
old one no longer exists.
The result is the following Effectivity Satellite (logical):
Table 2: Effectivity Satellite on the LinkH
We covered two major points in this article. The first one is that in Data Vault, we
extract relationship information from the source tables and thus we have to pay more
attention to the validation of those.
The second point is that the way you get the data (delta by CDC or full-extract)
brings you different opportunities regarding the way to load the data. When you are
dealing with a huge amount of data, CDC is definitely the way to go. In addition
to that, with the CDC mechanism you will get all updates from the source, and
you can easier load data in (near) real time.