0% found this document useful (0 votes)
10 views41 pages

PROJECT 2 For Python

The document outlines an architecture for migrating data from an on-premises virtual machine to Azure, utilizing components like Azure Data Factory, Azure Data Lake Storage Gen2, and Azure Synapse Analytics. It details the process of data transfer, transformation through Azure Databricks, and the setup of security and monitoring mechanisms. Step-by-step instructions for creating the necessary resources and pipelines in Azure are also provided, ensuring a structured approach to the migration process.

Uploaded by

nikhilranjan2357
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views41 pages

PROJECT 2 For Python

The document outlines an architecture for migrating data from an on-premises virtual machine to Azure, utilizing components like Azure Data Factory, Azure Data Lake Storage Gen2, and Azure Synapse Analytics. It details the process of data transfer, transformation through Azure Databricks, and the setup of security and monitoring mechanisms. Step-by-step instructions for creating the necessary resources and pipelines in Azure are also provided, ensuring a structured approach to the migration process.

Uploaded by

nikhilranjan2357
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

On-Prem to Azure Data Migration Architecture

Architecture Components

1. On-Premise Virtual Machine (VM):


The source data resides on an on-premises virtual machine. This VM has a file system that
holds data in various formats like TXT, CSV, and Parquet.

2. Azure Data Factory (ADF):


ADF is used to manage, schedule, and automate the data transfer from the on-prem system to
the cloud. It acts as the central orchestrator for the end-to-end migration process.

3. Azure Data Lake Storage Gen2 (ADLS Gen2):


This is the main storage layer in Azure where the data is organized into different zones. The raw
zone holds the original data, while cleaned and transformed data goes into the preprocessed
and final processed zones.

4. Azure Synapse Analytics:


The final data, after all processing is complete, is loaded into Synapse Analytics. This serves
as the central data warehouse for analytics and BI purposes.

5. Azure Databricks using PySpark:


Azure Databricks is used for building data transformation layers—Bronze, Silver, and Gold.

• Bronze Layer: Cleans up the raw data by removing duplicates, null values, and
unwanted records.

• Silver Layer: Applies further transformations like joins, filters, and business logic.

• Gold Layer: Produces fully structured, high-quality datasets that are ready for
reporting or machine learning.

6. Azure App Registration:


Used to securely connect Databricks to ADLS Gen2 through mounted access. This ensures
secure and consistent connectivity.

7. Azure Logic Apps:


Set up to monitor the data pipelines and send notifications or alerts when jobs fail or succeed.

8. Azure Key Vault:


Secrets, credentials, and other sensitive configurations are stored securely in Key Vault, which
is accessed by ADF and Databricks.
High-Level Architecture Flow

1. On-Prem Data Availability:


We begin by preparing a virtual machine on-prem with files (TXT, CSV, Parquet) stored in a
local directory.

2. Connecting On-Prem VM to Azure:


In Azure Data Factory, we set up a Self-Hosted Integration Runtime (SHIR). This acts as a
secure bridge between Azure and the on-prem system to enable data transfer.

3. Transferring Data to ADLS Gen2:


ADF pipelines use various activities like Copy, Lookup, and Metadata to extract files from the
VM and load them into the raw layer of ADLS Gen2.

4. Processing with Azure Databricks:


In Databricks, PySpark scripts handle the transformation across three stages:

• Bronze: Initial cleaning like removing nulls and duplicates.

• Silver: Applying joins and business logic to transform the data.

• Gold: Creating final datasets ready for consumption and analytics.

5. Loading into Synapse Analytics:


The transformed data from the Gold layer is loaded into Synapse SQL Data Warehouse where
analysts and BI tools can run queries.

6. Security Layer:

• App Registration enables token-based secure access to storage.

• Key Vault holds credentials, which are accessed by ADF and Databricks as needed.

7. Monitoring and Notifications:


Logic Apps are used to set up alert mechanisms that notify the team when a pipeline fails or
completes successfully.
Pipeline Structure

• Pipeline 1: Moves data from the on-prem VM to ADLS (raw layer).

• Pipeline 2: Transforms raw data to the Bronze layer in ADLS.

• Pipeline 3: Converts Bronze data to the Silver layer.

• Pipeline 4: Moves the refined Silver data into Synapse SQL Data Warehouse.

• Pipeline 5: A master pipeline that orchestrates the above pipelines in sequence.


Step-by-Step Setup Instructions
Step 1 - Create the On-Prem VM:
Choose an image with SQL Server, then proceed through the VM creation wizard using the default
options. Make sure to enable SQL Authentication.
Step 2 - Set Up Azure Data Factory:
Create an ADF instance from the portal.

Step 3 - Provision ADLS Gen2:


• Create a storage account.

• Add a container named global.

• Within that container, create folders for raw, bronze, and silver data.

Step 4 - Create a Key Vault:


• Assign access to ADF in the Key Vault’s Access Policies section so it can retrieve
secrets.

• Complete the setup by reviewing and creating the Key Vault.


Step 5 - Deploy Dedicated SQL Pool (Synapse):
Create a Synapse SQL pool that will later serve as the target data warehouse.
Step 6 - Remote into the VM:
Use RDP to connect to the VM using its public IP and login credentials.

Step 7 -Install Self-Hosted Integration Runtime (SHIR):


• In ADF, create a new SHIR.

• Download the integration runtime installer.

• Copy the registration key.

• Move to the VM and disable Enhanced Security in Internet Explorer.

• Install the integration runtime on the VM and register it using the copied key.
Step 8 - Connect Synapse Pool to SSMS:
Use SQL Server Management Studio (SSMS) to verify connectivity to your Synapse dedicated pool.

Step 9: - Prepare Files on the VM:


Upload your sample data files (preferably in .txt format since Excel isn't available) to the C: drive of the
VM.

Step 10: - Disable Validation in Integration Runtime:


Go to the folder where the integration runtime is installed:

C:\Program Files\Microsoft Integration Runtime\5.0\Shared

• Open PowerShell inside the VM.

• Navigate to the directory using:

cd "C:\Program Files\Microsoft Integration Runtime\5.0\Shared"

• Run the following command to disable path validation:

.\dmgcmd.exe -DisableLocalFolderPathValidation
Step 11: Creating Linked Services in Azure Data Factory
1. File System Linked Service (On-Prem VM):

Choose Self-Hosted Integration Runtime.

In the host/path section, provide the exact directory path where the files were copied
(e.g., C:\Data).

For authentication:

▪ Use the username of the VM (RDP username).

▪ For the password, store it as a secret in Azure Key Vault, then reference it in the
linked service using Key Vault integration.

Click Test Connection to ensure it connects successfully. If the dmgcmd.exe


validation is disabled correctly, the connection will succeed.
2. ADLS Gen2 Linked Service:

Use Azure Key Vault to securely retrieve secrets.

The URL format should be:


https://<adlsaccountname>.dfs.core.windows.net

To get the connection string:

▪ Go to the ADLS Storage Account → Access Keys → Connection String → copy it.

▪ Create a secret in Key Vault and store the connection string.

If not already created, set up a Key Vault linked service in ADF to access the secret.

3. Azure Synapse (SQL Pool) Linked Service:

Choose Azure Synapse Analytics as the type.

Use the Key Vault secret to get the password dynamically via the connection string.

The connection string should be of .NET format with the password field referencing Key
Vault.
4. Firewall Configuration:

Ensure that the SQL Pool’s firewall allows access from the Self-Hosted IR.
Step 12: Setting Up Metadata and Stored Procedures in Synapse (SSMS)
Run these scripts inside the testpool database using SSMS:

-- Create metadata table

CREATE TABLE metadata (

sourcefoldername VARCHAR(50),

storagepath VARCHAR(50),

isactive INT,

status VARCHAR(50)

);

-- Insert initial folder details

INSERT INTO metadata (sourcefoldername, storagepath, isactive, status)

VALUES

('cust', 'cust', 0, 'ready'),

('orders', 'orders', 0, 'ready'),

('emp', 'emp', 0, 'ready'),

('discounts', 'discounts', 0, 'ready');

-- Create stored procedure to update status

CREATE PROCEDURE metadata_usp (@status VARCHAR(50), @sourcefoldername VARCHAR(50))

AS

BEGIN

UPDATE metadata

SET status = @status

WHERE sourcefoldername = @sourcefoldername;

END;

-- Create stored procedure to reset status

CREATE PROCEDURE reset_status_usp

AS

BEGIN

UPDATE metadata

SET status = 'ready';

END;
Step 13→ Create the pipeline in ADF
First activity is lookup→ to lookup metadata table from azure synapse

Take lookup activity→ setting→ create dataset→ create dataset as below


Step → take for each activity next to lookup, take the output of lookup as input to for each loop

Inside for each → take copy activity → copy activity

for copy activity source is file system → create the dataset →

file system→ csv file


for sink use adls gen 2→ create dataset for the same

Here file path should be in sink raw, as we are going to put raw data in raw folder

Parameterize the directory → for file system dataset source

And in sink dataset also parameterize


Add→ directory in form of

Raw/filename/yyyy/mm/dd

It should be created as folder.


Take output from lookup activity

Sourcefoldername

Sink→ storage path


Following the copy activity take store proc activity

Use→metadat_usp as store proc,

And import parameter,

Add sourcefoldername as dynamically.

Status hard code succeeded indicating the copy activity finished


Take another store procedure activity for failure

Use the same store procedure and hardcode status as failed

Next take the store procedure outside the for each activity

Here we reset the sp on success

CREATE PROCEDURE reset_status_usp

AS

BEGIN

UPDATE metadata

SET status = 'ready';

END;
Step → Create a logic app

Create logic app→go to resource→ create blank new→ search for hhtp→select request→ add new
parameter→method →GET→ add next step → gmail→send email→name = gmail →sign in → add the to
email→ subject → save → url will be generated in http→ copy url
Step→ go to ADF→ on failed of for each take a web activity

Small change→ add the wait activity after for each loop → 30 secs
Here add the wild card and * as there are folder in source in RDP where the files are added

If any failure like the change in file name or meatadata occurs email is triggered.
Step 14→ perform data quality checks and move cleansed data from Raw layer to bronze
layer.
Data quality checks are performed in data bricks →Create Azure databricks service in Azure.

Data source is ADLS container → create a mount point to connect to container (since we are using only
one container and inside that container we are moving data from raw folder to bronze folder only 1
mount point is enough)

For mount ADLS → required service→ ADB, Azure key vault and SPN→ i.e App registration(where we
create a new client network and from there extract client secret value, client ID, tenant ID and create
the key vault secrets for the same)

Create app registration → go to app directory →app registration→once created→ go to secrets and
certificate→new client secret → copy the secret value and keep as it will be encrypted once web is
closed
Step→ go to your Adls storage account→I am role→ create a role for storage Blob data contributor
→assign that to app regist that i.e created above → review and assign

Step 15→ go to ADB →create notebook→


Create dbutilities widgets (once executes the widgets box appears at top through which we can filter
the particular data and run the entire notebook)

dbutils.widgets.text(processeddate,'')

dbutils.widgets.text(foldername,'')

step→ create the ADLS mount point

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": dbutils.secrets.get(scope="adlsgenkey",key="appid"),

"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="adlsgenkey",key="apppwd"),

"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/f5ea40f2-c7b8-4658-
8d25-0aac8535e48c/oauth2/v2.0/token",

"fs.azure.createRemoteFileSystemDuringInitialization": "true"}

dbutils.fs.mount(

source = "abfss://[email protected]/",

mount_point = "/mnt/global",

extra_configs = configs)

create the scope by adding attend of url as →#secrets/createScope


Give the scope name → adlsgenkey

DNS name and rescource ID → from azure key vault


in mount point→

client.secret":→ i.e apppwd the secret value copied from app registration

client id→ application id

tenant id copy and paste in

fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenantID >


/oauth2/v2.0/token

source = "abfss://<containername>@<storageaccountname>.dfs.core.windows.net/",

once execute comment all (select all in text and Ctrl+/ shortcut to comment all together)
Step 16→Aim is to move data from raw to bronze
src_path="/mnt/global/raw/"

dest_path="/mnt/global/bronze/"

dbutils.widgets.text(processeddate,'')

dbutils.widgets.text(foldername,'')

foldername = dbutils.widgets.get(' foldername')

pdate = dbutils.widgets.get(' processeddate')

print(foldername )

print(cdate)

src_final_path=src_path+foldername+"/"+pdate

print(src_final_path)

dest_final_path=dest_path+foldername+"/"+pdate

print(dest_final_path)

#following is the code for cleaning the data

try:

# Read data from source path

df = spark.read.format("csv").option("header", True).load(src_final_path)

# Count the number of rows in the source DataFrame

src_count = df.count()

print("Source count:", src_count)

# Remove duplicates

df1 = df.dropDuplicates()

# Count the number of rows in the destination DataFrame

dest_count = df1.count()

print("Destination count:", dest_count)

# Write the cleaned data to the destination path

df1.write.mode("overwrite").format("csv").option("header", True).save(dest_final_path)

# Exit the notebook with success message and counts

print("Success: Source count = " + str(src_count) + ", Destination count = " + str(dest_count))

except Exception as e:

# Handle exceptions and exit with an error message

dbutils.notebook.exit("Error: " + str(e))


Step 17→ go to ADF to create the pipeline for the movement of data from raw to bronze
Take the lookup activity →create the source dataset for synapse →

Next take→ foreach activity→ take output of lookup as input→ go inside for each → take notebook
activity → create linked service→ dataset for notebook
Under setting→ we have to pass 2 base parameters (as we have take processeddate and foldername
as widgets in notebook)
Simialry to last pipeline→following the notebook activity take 2 store proc , 1 for success and other for
failure add the parameter

Outside the for each take reset store proc and web activity for failure where we call the logic app

If any errors in notebook, pipeline fails and in output of the activity we get the notebook url, click on
that and it directly goes to ADB, and one with highlighted one is having error fix it and run the pipeline
again.
cannot edit here directly→ so go to the main development branch and edit the code.
Step 18→ to do transformation and move data from bronze layer to Silver layer

Below is the scripts to perform transformation in notebook from Bronze to silver, here only join
transfprmtion for cust table is performed

# Set source and destination paths

src_path = "/mnt/global/bronze/"

dest_path = "/mnt/global/silver/"

# Input widgets for folder name and processing date

dbutils.widgets.text('foldername', '')

dbutils.widgets.text('pdate', '')

try:

# Get user input for folder name and processing date

foldername = dbutils.widgets.get('foldername')

pdate = dbutils.widgets.get('pdate')

print("Folder Name:", foldername)

print("Processing Date:", pdate)

# Create source and destination paths based on user input

src_final_path = src_path + foldername + "/" + pdate

print("Source Path:", src_final_path)

# Destination path for writing processed data

dest_final_path = dest_path + 'dim' + foldername

print("Destination Path:", dest_final_path)

# Load data from the source path

df = spark.read.format("csv").option("header", True).load(src_final_path)

src_count = df.count()

print("Source Count:", src_count)

# Display the DataFrame

df.show()

# Create a sample DataFrame (df11) - replace this with your actual data

df11 = spark.createDataFrame([(2, '78654345'), (3, '67865467')], ['cid', 'cphone'])

df11.show()

# Join dataframes if foldername is 'cust', otherwise use df as is

from pyspark.sql.functions import col

if foldername == 'cust':

df1 = df.alias('a').join(df11.alias('b'), col('a.cid') == col('b.cid'), "inner").select('a.*', 'b.cphone')

df1.show()
else:

df1 = df

# Count rows in the destination DataFrame

dest_count = df1.count()

# Write processed data to the destination path

df1.coalesce(1).write.mode("overwrite").format("csv").option("header", True).save(dest_final_path)

print("Processing completed successfully.")

print("Source Count:", src_count)

print("Destination Count:", dest_count)

dbutils.notebook.exit("Processing completed successfully.")

except Exception as e:

print("Error:", str(e))

dbutils.notebook.exit("Error: " + str(e))

Create a pipeline similar to above raw to Bronze, change the notebook and provide the base
parameters properly

Step 19→ move data from Silver layer to Sql DW


print("Source Count:", src_count)

print("Destination Count:", dest_count)

# Load SQL data into the data warehouse

dbutils.widgets.text('foldername', '')

foldername = dbutils.widgets.get('foldername')

print("Folder Name:", foldername)

# Set source and destination paths for SQL data

src_path = "/mnt/global/silver/" + 'dim' + foldername

dest_path = "dim" + foldername

print("Source Path:", src_path)

print("Destination Path:", dest_path)

# Read data from the source path

df = spark.read.format("csv").option("header", True).load(src_path)

src_count = df.count()

print("Source Count:", src_count)

# Set Azure Storage account key


spark.conf.set("fs.azure.account.key.onpremdatasynasegen.dfs.core.windows.net",
"o82RdY56QpidiJOBzA0+c0xBYomGajKVXZ8oZKRr+TtVSjYOTI5+i6IVTmOFL5E73Ha5wJHe7aQ1+AStdI
FwNA==")

# Write data to SQL Data Warehouse (using JDBC connection from key vault)

df.write \

.mode("overwrite") \

.format("com.databricks.spark.sqldw") \

.option("url", dbutils.secrets.get(scope="adlsgenkey", key="sqljdbcpwd")) \

.option("dbtable", dest_path) \

.option("tempDir", "abfss://[email protected]/tmp/synapse") \

.option("forwardSparkAzureStorageCredentials", "true") \

.save()

# Display source count

print("Source Count:", src_count)

dbutils.notebook.exit("Source Count: " + str(src_count) + ", Destination Count: " + str(dest_count)

Create a pipeline similar to above here we give only 1 base parameter


Step 20→Create a master pipeline to execute all the pipeline using execute pipeline
activity

You might also like