0% found this document useful (0 votes)
30 views9 pages

Implementing Parameterization in ADF

The document discusses implementing parameterization in Azure Data Factory (ADF) using pipeline, dataset, and activity parameters to read dynamic files from Azure Blob Storage. It also explains Slowly Changing Dimension Type 2 (SCD2) for maintaining historical changes in data and describes the process of Incremental Load for efficiently transferring only new or changed data. Key features, implementation steps, and benefits of each concept are outlined to facilitate understanding and application in data warehousing and ETL processes.

Uploaded by

Pramod Narkhede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views9 pages

Implementing Parameterization in ADF

The document discusses implementing parameterization in Azure Data Factory (ADF) using pipeline, dataset, and activity parameters to read dynamic files from Azure Blob Storage. It also explains Slowly Changing Dimension Type 2 (SCD2) for maintaining historical changes in data and describes the process of Incremental Load for efficiently transferring only new or changed data. Key features, implementation steps, and benefits of each concept are outlined to facilitate understanding and application in data warehousing and ETL processes.

Uploaded by

Pramod Narkhede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Implementing Parameterization in ADF

Parameterization in ADF is achieved using pipeline parameters, dataset


parameters, and activity parameters. Here's an example of parameterizing file
names and paths to read data from Azure Blob Storage.

Example Scenario: Read Dynamic Files from Azure Blob Storage

1. Requirement: A pipeline should read files from an Azure Blob Storage


container. The file name and folder path will be passed as parameters.

Steps to Implement

1. Create a Pipeline with Parameters:


o In ADF, create a new pipeline.
o Define parameters for FolderPath and FileName.

json
Copy code
{
"parameters": {
"FolderPath": {
"type": "string",
"defaultValue": ""
},
"FileName": {
"type": "string",
"defaultValue": ""
}
}
}

2. Create a Dataset with Parameters:


o Create a dataset pointing to Azure Blob Storage.
o Add parameters for FolderPath and FileName in the dataset.
o Modify the dataset connection properties to use these parameters.
Example JSON for dataset parameters:

json
Copy code
{
"parameters": {
"FolderPath": {
"type": "string"
},
"FileName": {
"type": "string"
}
},
"typeProperties": {
"fileName": "@dataset().FileName",
"folderPath": "@dataset().FolderPath"
}
}

3. Connect the Dataset to the Pipeline:


o In the pipeline, add an activity (e.g., Copy Data).
o Link the dataset to the activity.
o Map the pipeline parameters to the dataset parameters:
 For FolderPath: Pass
@pipeline().parameters.FolderPath.
 For FileName: Pass
@pipeline().parameters.FileName.
4. Trigger the Pipeline:
o Use the "Add Trigger" option to test the pipeline.
o Pass different values for FolderPath and FileName when
triggering the pipeline.

Example Configuration

 Pipeline Parameters:

json
Copy code
{
"FolderPath": "input-data",
"FileName": "sales-data.json"
}

 Dataset Connection:
o FolderPath: input-data
o FileName: sales-data.json

Final Outcome

By passing parameters dynamically, the same pipeline can read different files, such
as:

 input-data/sales-data.json
 input-data/inventory-data.csv

&&&&&&&&&&&&&&&

What is SCD2 (Slowly Changing Dimension Type 2)?

SCD2 (Slowly Changing Dimension Type 2) is a method used in data warehousing


to track and maintain historical changes to dimensional data over time. It creates
multiple records for a given entity in the dimension table, with each record
representing a version of the data valid for a specific time period.

Key Features of SCD2:

1. Tracks Historical Data: Maintains a history of changes to data.


2. Versioning: Each change results in a new version of the dimension record.
3. Validity Range: Typically uses date fields (e.g., StartDate and
EndDate) or a flag (e.g., IsCurrent) to indicate the validity of each
version.

How SCD2 Works:


1. Initial Load:
o Load the dimension table with the first version of data.
2. Change Detection:
o Compare the incoming data (source) with the existing data (dimension
table) to detect changes.
3. Insert New Records:
o For records that have changes, close the existing record by updating
the EndDate or setting IsCurrent to false.
o Insert a new record with the updated data, new version number, and
StartDate.
4. Unchanged Records:
o Retain existing records if no changes are detected.

Example Scenario: Employee Dimension

Initial State of Dimension Table:

EmployeeID Name Department StartDate EndDate IsCurrent


101 John Doe HR 2023-01-01 9999-12-31 true
102 Jane Doe IT 2023-01-01 9999-12-31 true

Incoming Source Data (New Load):

EmployeeID Name Department


101 John Doe Finance
102 Jane Doe IT

Steps to Implement SCD2:

1. Compare Records:
o Compare EmployeeID and Department between the source and
the dimension table.
o Detect that the department for EmployeeID 101 has changed.
2. Update Existing Record:
o For EmployeeID 101, update the EndDate to the current date
(e.g., 2024-01-01) and set IsCurrent to false.
3. Insert New Record:
o Insert a new record for EmployeeID 101 with the updated
department and StartDate as the current date.

Updated Dimension Table:

EmployeeID Name Department StartDate EndDate IsCurrent


101 John Doe HR 2023-01-01 2024-01-01 false
101 John Doe Finance 2024-01-01 9999-12-31 true
102 Jane Doe IT 2023-01-01 9999-12-31 true

Key Benefits of SCD2:

 Provides a complete history of changes.


 Supports complex historical analysis, such as tracking how values evolve
over time.

SCD2 in Practice (Implementation in SQL):

sql
Copy code
-- Update the current record to set the EndDate and
IsCurrent flag
UPDATE DimensionTable
SET EndDate = GETDATE(),
IsCurrent = 0
WHERE EmployeeID = @EmployeeID
AND IsCurrent = 1;

-- Insert the new record


INSERT INTO DimensionTable (EmployeeID, Name,
Department, StartDate, EndDate, IsCurrent)
VALUES (@EmployeeID, @Name, @Department, GETDATE(),
'9999-12-31', 1);
What is Incremental Load?

Incremental Load refers to the process of loading only the new or changed data
from a source system into a target system, rather than reloading the entire dataset.
It is a crucial process in ETL (Extract, Transform, Load) operations, as it optimizes
performance, reduces data transfer, and minimizes load times.

Types of Incremental Loads

1. Insert Only: Loads only new records.


2. Insert and Update: Loads new records and updates existing records.
3. Insert, Update, and Delete: Handles new records, updates existing records,
and deletes records that no longer exist in the source.

How Incremental Load Works

1. Detect Changes:
o Use a mechanism to identify new or modified records in the source
data. Common methods include:
 Timestamps: Compare a LastModified column with the
last processed time.
 Watermarking: Use a high watermark (e.g., max date or ID) to
track the last processed record.
 CDC (Change Data Capture): Use database features to
capture data changes.
2. Extract Changes:
o Extract only the identified new or modified records from the source.
3. Load Changes:
o Insert new records into the target system.
o Update or delete existing records as needed.

Example: Incremental Load Using Timestamps

Scenario:
You have a source database table called Sales and need to incrementally load it
into a data warehouse table called SalesDW.

Source Table (Sales):

SalesID CustomerID Amount LastModified


1 101 100 2023-12-01 10:00:00
2 102 200 2023-12-02 11:00:00
3 103 300 2023-12-03 12:00:00

Target Table (SalesDW):

SalesID CustomerID Amount LastModified


1 101 100 2023-12-01 10:00:00

Steps to Implement Incremental Load:

1. Set a High Watermark:


o Determine the last LastModified timestamp processed (e.g.,
2023-12-01 10:00:00).
2. Extract Changes:
o Query the source table to fetch records with a LastModified
greater than the watermark.

sql
Copy code
SELECT *
FROM Sales
WHERE LastModified > '2023-12-01 10:00:00';

3. Result:

SalesID CustomerID Amount LastModified


2 102 200 2023-12-02 11:00:00
3 103 300 2023-12-03 12:00:00

4. Load Changes into Target:


o Insert New Records: Insert the extracted records into SalesDW.
o Update Existing Records: If a record already exists in the target,
update it based on SalesID.

sql
Copy code
-- Insert new records
INSERT INTO SalesDW (SalesID, CustomerID,
Amount, LastModified)
SELECT SalesID, CustomerID, Amount,
LastModified
FROM Sales
WHERE LastModified > '2023-12-01 10:00:00';

-- Update existing records (if needed)


UPDATE SalesDW
SET CustomerID = S.CustomerID,
Amount = S.Amount,
LastModified = S.LastModified
FROM SalesDW D
INNER JOIN Sales S ON D.SalesID = S.SalesID
WHERE S.LastModified > '2023-12-01 10:00:00';

5. Update the High Watermark:


o Set the new high watermark to the maximum LastModified
timestamp processed (2023-12-03 12:00:00).

Incremental Load in Tools

Azure Data Factory:

 Use Lookup or Filter Activities to fetch incremental data.


 Use a parameterized pipeline with a watermark value passed as a
parameter.
 Use a sink dataset to write the incremental data to the target.

Databricks with PySpark:

 Use max() on the LastModified column to determine the high


watermark.
 Filter the DataFrame based on this value.

Example PySpark Code:

python
Copy code
# Read source data
source_df =
spark.read.format("delta").load("/source_path")

# Read target data


target_df =
spark.read.format("delta").load("/target_path")

# Determine the high watermark


last_processed_time = target_df.agg({"LastModified":
"max"}).collect()[0][0]

# Filter for new or modified records


incremental_df =
source_df.filter(source_df.LastModified >
last_processed_time)

# Merge into target


incremental_df.write.format("delta").mode("append").sav
e("/target_path")

Benefits of Incremental Load

1. Efficiency: Reduces the volume of data processed.


2. Performance: Faster ETL processes as only a subset of data is handled.
3. Scalability: Suitable for large datasets where full loads are impractical.

This approach is widely used in data warehousing, ETL tools, and cloud services to
ensure efficient and scalable data pipelines.

You might also like