0% found this document useful (0 votes)
282 views3 pages

Data Engineering Standards Guide

This document provides an overview of the data engineering platform and establishes standards for development activities. It covers required skills, coding languages, tools, code repositories, code consistency standards, unit testing, Azure Data Factory, Azure DevOps Pipelines, and the release process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
282 views3 pages

Data Engineering Standards Guide

This document provides an overview of the data engineering platform and establishes standards for development activities. It covers required skills, coding languages, tools, code repositories, code consistency standards, unit testing, Azure Data Factory, Azure DevOps Pipelines, and the release process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
  • Data Engineering - Overview & Standards
  • Azure Data Factory
  • Coding Consistency

Data Engineering - DDH overview + standards

Context:
This document delivers the platform overview from the data engineering perspective. It also establishes the standard for the
development activities.

1. Data Engineer required skillset (Data Engineer skillset)


2. Coding languages
a. Python / PySpark (all data transformation activities)
b. T-SQL (Synapse, SQL Server)
c. PowerShell (infrastructure automation, deployments)
d. C#/.Net (auxiliary platform activities API / Functions )
e. ARM/JSON/YML
3. Tools
a. IDE / Visual Studio (VS) Code
i. Remote Containers / WSL2 DATA ENG ONLY - Create…: New joiners’ Checklist - Copy only
ii. Detabricks-Connect
iii. Manual code deployment (databricks, ADF, lake)
iv. Linting
v. Unit Tests
vi. Regression tests
vii. Validation routines + Control DB tables
b. Databricks Notebooks, only for investigation/data debugging purposes (Read.me file)
c. SSMS ( SQL Server Management Studio) - tool for querying data in DW/Control DB
d. Full Visual Studio -
i. SSDT - Data Definition
ii. Azure Function Definitions
e. Azure Storage Explorer - Accessing the Lake resources
f. Azure Portal - Access control, resources status etc.
g. Azure Data Factory - pipeline orchestration
4. Code repositories:
a. Ingestion (Databricks/ADF/Synapse) : https://dev.azure.com/diageoinsights/Orion/_git/Orion?version=GBmaster
i. ADF (triggers)
ii. Databricks
iii. Warehouse
iv. Tests
b. Services / Control DB/ API etc: https://dev.azure.com/diageoinsights/Orion/_git/Orion-Services
c. Infrastructure: https://dev.azure.com/diageoinsights/Orion/_git/Orion-Infrastructure
d. ReadMe files /
Wiki: https://dev.azure.com/diageoinsights/Orion/_wiki/wikis/Development%20Wiki/90/Development-Wiki
5. Code consistency: Data Engineering, the pipelines building standards
a. Helpers
b. Hash key creation methods on the target side
c. Schema object life cycle
d. Static vs. dynamic SQL
6. Unit Tests:
a. Local/container pyspark/pytest installation
b. TestExplorer
c. Fixtures
d. Data samples / Inputs / Expected Results
7. Azure Data Factory: https://adf.azure.com/en-us/home?factory=%2Fsubscriptions%2F3a683d84-be08-4356-bb14-
3b62df1bad55%2FresourceGroups%2Fdiageo-analytics-prod-rg-
orion%2Fproviders%2FMicrosoft.DataFactory%2Ffactories%2Fdiageo-eun-analytics-prod-adf-ingest-prod01
a. Folder structure
b. Data sets
c. Linked services
d. Triggers
e. Global parameters
f. Deployment with REST
g. BAU ADF (admin purposes): https://adf.azure.com/en-us/authoring?factory=%2Fsubscriptions%2Fcbce95f0-
3cdf-467f-852e-1b64bd2105d3%2FresourceGroups%2Fdiageo-analytics-nonprod-rg-
orion%2Fproviders%2FMicrosoft.DataFactory%2Ffactories%2Fdiageo-eun-analytics-nonprod-adf-bau-dev01
h. Development in main DEV using git/banch, deploy in VS Code, test in private ADF insta
8. Azure DevOps Pipelines:
a. Ingestion: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=90
b. Services: https://dev.azure.com/diageoinsights/Orion/_release?_a=releases&view=mine&definitionId=24
c. Infra: https://dev.azure.com/diageoinsights/Orion/_release?_a=releases&view=mine&definitionId=7
d. Config: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=98
e. Private DEV environment: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=226
i. Private Synapse Build: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=95
f. Pull Request Validation: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=101
g. Environment Dim Sync: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=229
h. Unit Tests: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=211
9. Release process:
a. Pull requests https://dev.azure.com/diageoinsights/Orion/_git/Orion/pullrequests?_a=completed
i. Approvals (Viz/ BAU)
ii. Policies: Data Engineering, the pipelines building standards
b. Regression tests
c. Release standup
i. Release document
ii. Testing in UAT
d. Releasing to PROD / git tagging

Data Engineering - DDH overview + standards
Context:
This document delivers the platform overview from the data engineering p
5.
a.
b.
c.
d.
6.
a.
b.
c.
d.
7.
a.
b.
c.
d.
e.
f.
g.
h.
8.
a.
b.
c.
d.
e.
i.
f.
g.
h.
9.
a.
i.
ii.
b.
c.
i.
ii.
Code consist
d. Releasing to PROD / git tagging

You might also like