Azure Data Engineering Interview q & a - Topicwise
Azure Data Engineering Interview q & a - Topicwise
INTERVIEW Q & A -
TOPICWISE
(This document contains interview questions which I experienced personally in all my 20+ interviews)
Skill Covered
Pyspark
SQL
Databricks
Azure Data Factory
ADLS
Python
Data Modeling and Architecture
Other Questions(Azure Devops, Azure functions, LogicApps and general ques)
1. Pyspark
3. Perform aggregation,
How to handle incremental load in PySpark when the table lacks a last_modified or
updated_time column?
How many jobs, stages, and tasks are created during a Spark job execution?
How do you design and implement data pipelines using Azure Data Factory?
Explain the use of Azure Synapse Analytics and how it integrates with other Azure services.
How do you optimize data storage and retrieval in Azure Data Lake Storage?
How do you implement security measures for data in transit and at rest in Azure?
How do you use Azure Stream Analytics for real-time data processing?
What are the different data partitioning strategies in Azure SQL Data Warehouse?
How do you monitor and troubleshoot data pipelines in Azure Data Factory?
Describe the process of integrating Azure Databricks with Azure Data Lake Storage.
How do you manage and automate data workflows using Azure Logic Apps?
Explain the concept of Managed Identity in Azure and its use in data engineering.
How do you ensure data consistency and reliability in distributed systems on Azure?
Describe the process of performing ETL operations using Azure Data Factory.
How do you use Azure Monitor to track and analyze performance metrics of your data
infrastructure?
What are the best practices for data archiving and retention in Azure?
How do you implement disaster recovery and backup strategies for data in Azure?
2. SQL
What is the difference between star schema, snowflake schema, and 3NF?
Write a SQL query to convert row-level data to column-level data using pivot.
How do you set up and configure an Azure Data Factory pipeline for ETL processes?
What are the best practices for database security in Azure SQL Database?
Describe the use of Azure Cosmos DB and its benefits over traditional SQL databases.
How do you integrate Azure SQL Database with other Azure services?
Explain the concept of Azure Data Lake and its integration with SQL-based systems.
How do you use Azure Logic Apps to automate data workflows in SQL databases?
How do you use Azure Data Lake Analytics with U-SQL for big data processing?
Explain the differences between Azure SQL Database and Azure SQL Managed Instance.
How do you ensure high availability and disaster recovery for Azure SQL databases?
What are the common challenges in managing large-scale SQL databases in Azure?
Describe the process of setting up and managing an Azure Synapse Analytics workspace.
Explain the use of Azure Active Directory for SQL database authentication.
Describe the process of using Azure SQL Data Sync for data replication.
What are the key considerations for designing a scalable data architecture in Azure?
How to run one notebook from another notebook using %run or dbutils.notebook.run()?
What are Delta logs, and how to track data versioning in Delta tables?
Explain the process of copying large files (e.g., 3TB) from on-premises to Azure in ADF.
How to handle error handling in ADF using retry, try-catch blocks, and failover mechanisms?
What are the activities in ADF (e.g., Copy Activity, Notebook Activity)?
Where to store sensitive information like passwords in ADF (e.g., Azure Key Vault)?
How to copy all tables from one source to the target using metadata-driven pipelines.
Difference between Blob Storage and Azure Data Lake Storage (ADLS).
Why is mounting preferred over using access keys to connect Databricks with ADLS?
Explain how to copy all files with different formats (CSV, Parquet, Excel) from a source to
target.
How to track file names in the output table while performing copy operations in ADF?
What are the security features available in ADLS (e.g., access control lists, role-based
access)?
Find consecutive numbers in a list (e.g., "Print the number that appears
consecutively 3 times").
Write Python code to split a name column into firstname and lastname.
Explain the deployment architecture of a data pipeline involving ADF, Databricks, and ADLS.
What are the different types of data schemas, and how do you choose the right one for your
data model?
How do you handle slowly changing dimensions (SCD) in data modeling?
Explain the concept of denormalization and when it should be used.
How do you design a star schema for a sales reporting system?
What is data mart, and how does it differ from a data warehouse?
How do you ensure data consistency and integrity in a distributed data architecture?
Describe the role of metadata in data modeling and data architecture.
How do you optimize data models for performance and scalability?
Explain the use of surrogate keys in dimensional modeling.
How do you implement a data governance framework in a data lake environment?
8. Other Questions
What is the difference between wide transformations and narrow transformations in Spark?
What challenges have you faced in managing large datasets (e.g., 3TB+ files)?
How do you implement CI/CD pipelines for deploying ADF and Databricks solutions?
What is the use of Delta Lake, and how does it support ACID transactions?
Snowflake-specific features:
# reduceByKey
rdd.reduceByKey(lambda a, b: a + b)
df1 = spark.read.csv("large_table.csv")
df2 = spark.read.csv("small_lookup.csv")
def to_upper(s):
return s.upper()
df = df.withColumn("upper_name", to_upper_udf(df.name))
spark = SparkSession.builder.appName("example").getOrCreate()
sc = spark.sparkContext # Access SparkContext if needed
spark = SparkSession.builder.appName("PySparkExample").getOrCreate()
result.show()
10. How to create a rank column using the Window function in PySpark?
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("rank", rank().over(windowSpec)).show()
✅ You can also use dense_rank() or row_number() depending on the use case.
14. How to handle incremental load in PySpark when the table lacks last_modified or
updated_time?
If no timestamp column is available:
Approach 1: Hash-Based Comparison
Add a hash column (e.g., MD5 on concatenated columns) to detect changes.
Compare with previously stored hash to identify new/changed records.
Approach 2: Full load with Delta Merge
Load the entire dataset and use MERGE INTO in Delta to upsert data.
# Pseudo Delta Merge code
MERGE INTO target USING source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
15. How many jobs, stages, and tasks are created during a Spark job execution?
Job: One job per action (e.g., .collect(), .write())
Stage: Divided by shuffle boundaries
Task: One task per partition in a stage
🔸 Example:
df.groupBy("col").sum().show()
Triggers 1 job
If 3 shuffles: 3 stages
If each stage has 4 partitions: 4 tasks per stage
You can see this breakdown in Spark UI under the Jobs tab.
df.groupBy("department").agg(_sum("salary").alias("sumofsalary")).filter("sumofsalary >
50000").show()
🔹 Explanation:
groupBy("department"): Groups rows by department
agg(_sum("salary")): Aggregates salary column
alias(...): Renames aggregated column
filter(...): Filters aggregated result
20. How do you design and implement data pipelines using Azure Data Factory (ADF)?
ADF Pipeline Design Flow:
1. Source Identification
o Connectors: On-prem (SQL, Oracle), Cloud (ADLS, Blob, S3, API, etc.)
2. Create Linked Services
o Define connections to source/destination.
3. Datasets
o Represent data structures (tables/files).
4. Pipeline Activities
o Use Copy Activity, Data Flow, Stored Procedure, Notebook, etc.
5. Orchestration
o Use Trigger (schedule/event-based/manual)
o Use control flow: If, ForEach, Until, etc.
6. Monitoring
o Use ADF Monitoring tab, alerts, and logs.
🔹 Example Use Case:
Extract from SQL → Transform in Data Flow → Load to ADLS/SQL DB.
21. Explain the use of Azure Synapse Analytics and how it integrates with other Azure services.
Azure Synapse Analytics is an integrated analytics service combining big data and data warehousing.
🔹 Features:
SQL-based data warehouse (Synapse SQL)
Spark runtime for big data processing
Serverless and dedicated pools
Deep integration with Power BI, ADF, Azure ML, ADLS
🔹 Integration Examples:
ADLS Gen2: Synapse reads data directly from lake
ADF Pipelines: Orchestrate data flows into Synapse
Power BI: Direct Query mode for reports
Azure Purview: Metadata catalog and lineage
Azure Key Vault: Secure credential storage
22. How do you optimize data storage and retrieval in Azure Data Lake Storage (ADLS)?
✅ Best Practices for Optimization:
1. Partitioning
o Use date-based folder structure (e.g., /year/month/day/)
o Helps improve query performance and scanning
2. File Size Tuning
o Avoid small files (optimum: 100–250 MB per file)
3. Format Choices
o Use columnar formats like Parquet/ORC over CSV
4. Compression
o Use snappy/gzip for faster transfer and less storage
5. Delta Lake Usage
o Enables ACID, upserts, schema evolution
6. Z-Ordering
o For faster lookup of specific values
7. Caching (Databricks)
o Cache frequently accessed data
24. Describe the process of setting up and managing an Azure SQL Database
✅ Setup Steps:
1. Create Azure SQL DB:
o From Azure Portal → Create Resource → Azure SQL → SQL Database
2. Configure Settings:
o Choose server (or create new), pricing tier, backup, security
3. Connect:
o Use SSMS, Azure Data Studio, or ADF/Databricks
4. Firewall Settings:
o Allow your IP or use Azure services
5. Create Tables/Objects:
o Run SQL scripts for schema setup
6. Performance Tuning:
o Use Query Performance Insight, Automatic Tuning
7. Security:
o Use Transparent Data Encryption, Azure Defender, Key Vault
8. Monitoring:
o Use Azure Monitor, Log Analytics, Alerts
25. Explain the concept of PolyBase in Azure SQL Data Warehouse (now Synapse SQL)
PolyBase is a feature in Azure Synapse Analytics that lets you run T-SQL queries over external
data sources like:
Azure Data Lake Storage (ADLS)
Blob Storage
Hadoop
External SQL Servers
💡 Why is it useful?
Enables ELT (Extract-Load-Transform) pattern: load data directly from files into Synapse
tables without pre-processing.
You can query Parquet, CSV, or ORC files as if they were tables.
Example:
CREATE EXTERNAL TABLE ext_sales (
id INT,
amount FLOAT
)
WITH (
LOCATION = /sales-data/ ,
DATA_SOURCE = MyADLS,
FILE_FORMAT = ParquetFileFormat
);
26. How do you implement security measures for data in transit and at rest in Azure?
✅ Data at Rest:
Encryption by default using Azure-managed keys (Storage, SQL, Synapse)
Customer-Managed Keys (CMK) via Azure Key Vault
Transparent Data Encryption (TDE) in SQL
✅ Data in Transit:
HTTPS and TLS 1.2+ for secure data movement
Use Private Endpoints to keep traffic within Azure network
SAS Tokens and OAuth tokens for controlled access
27. How do you use Azure Stream Analytics for real-time data processing?
Azure Stream Analytics (ASA) processes real-time streaming data from sources like:
Azure Event Hubs
Azure IoT Hub
Azure Blob Storage (append blobs)
🔹 You can run SQL-like queries on streaming data.
Example:
SELECT deviceId, AVG(temperature) AS avgTemp
FROM inputStream TIMESTAMP BY eventTime
GROUP BY deviceId, TumblingWindow(minute, 5)
📤 Output: Write to SQL DB, Power BI, Blob, ADLS, etc.
Use ASA when you want dashboards or alerts within seconds/minutes of event occurrence.
28. What are the different data partitioning strategies in Azure SQL Data Warehouse
(Synapse)?
Partitioning helps improve query performance and data management.
✅ Common Strategies:
1. Hash Partitioning:
o Distributes rows using a hash function on a column (e.g., CustomerID)
2. Range Partitioning:
o Split data by date ranges, like monthly or yearly partitions
3. Round-Robin Distribution:
o Default method; spreads rows evenly regardless of value
4. Replicated Tables:
o Whole table is copied to all compute nodes (use for small dimension tables)
🧠 Choosing the right strategy depends on your query patterns and data size.
29. How do you monitor and troubleshoot data pipelines in Azure Data Factory (ADF)?
ADF provides built-in Monitoring tools:
🔍 Monitoring Portal:
Check pipeline run status: succeeded, failed, in progress
Inspect activity run logs
📧 Alerts & Metrics:
Set up alerts via Azure Monitor for failure or latency
Debugging Tools:
Use Data Preview in Mapping Data Flows
Enable logging to Log Analytics or Storage Accounts
💡 Best Practice:
Use Retry policies and Activity dependencies to handle failures gracefully.
30. Describe the process of integrating Azure Databricks with Azure Data Lake Storage (ADLS)
🔗 Steps to integrate Databricks with ADLS Gen2:
1. Create Storage Account (with hierarchical namespace enabled)
2. Set up Azure Key Vault (to store credentials securely)
3. Mount ADLS in Databricks:
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "...",
"fs.azure.account.oauth2.client.id": "...",
...
}
dbutils.fs.mount(
source = "abfss://<container>@<storageaccount>.dfs.core.windows.net/",
mount_point = "/mnt/myadls",
extra_configs = configs
)
4. Read/write data from /mnt/myadls/
✅ You can also use ABFS paths directly in read/write code without mounting.
31. How do you manage and automate data workflows using Azure Logic Apps?
Azure Logic Apps is a no-code/low-code platform to automate workflows like:
Data movement
Notification alerts
File uploads
Database triggers
📌 Example Workflow:
1. Trigger: When a new file is uploaded to Blob
2. Condition: If file name contains “sales”
3. Action: Trigger a stored procedure or pipeline in ADF
📌 Common Connectors:
SQL Server, Outlook, Blob, ADLS, Service Bus, Salesforce, etc.
32. Explain the concept of Managed Identity in Azure and its use in data engineering
Managed Identity allows Azure services to authenticate securely with other Azure resources
without needing secrets/passwords.
👷♂️Two types:
System-assigned: Bound to a single resource
User-assigned: Reusable across multiple resources
🔐 Use Cases in Data Engineering:
ADF accessing Key Vault
Databricks accessing ADLS
Synapse accessing Azure SQL
✅ This improves security, automation, and governance.
33. How do you ensure data consistency and reliability in distributed systems on Azure?
✅ Key Practices:
1. Idempotent Operations:
o Ensure retrying the same operation doesn t lead to duplicates
2. Exactly-once Delivery (e.g., Event Hubs + Stream Analytics)
3. Checkpointing & Watermarking (e.g., in Databricks structured streaming)
4. Delta Lake Transactions:
o Use ACID compliance to manage updates reliably
5. Retry Policies and Failover:
o Handle transient failures with exponential backoff
6. Monitoring and Alerting:
o Use Azure Monitor to detect and react to data issues
7. Data Quality Checks:
o Enforce schema validation, null checks, uniqueness
34. Describe the process of performing ETL operations using Azure Data Factory (ADF)
Azure Data Factory is a cloud-based ETL tool that helps you move and transform data across
sources and destinations.
🔹 ETL Workflow in ADF:
1. Extract: Use Copy Activity to pull data from sources like SQL, Oracle, SAP, Blob, ADLS, etc.
2. Transform:
o Use Mapping Data Flows (visually designed Spark-based transformations)
o Or trigger Azure Databricks Notebooks or Stored Procedures
3. Load: Write the processed data to a sink like Azure SQL, ADLS, Synapse, etc.
🔄 Trigger Options:
Manual
Scheduled (time-based)
Event-based (e.g., blob created)
📂 Components:
Pipeline (group of activities)
Linked Service (connection to data)
Datasets (metadata about data)
Integration Runtime (compute engine for data movement)
35. How do you use Azure Monitor to track and analyze performance metrics of your data
infrastructure?
Azure Monitor helps you observe, alert, and analyze the health of your Azure resources.
🔍 For Data Infrastructure:
Azure Data Factory: Monitor pipeline runs, trigger executions, activity failures
Azure SQL/Synapse: Track DTU usage, query duration, deadlocks, etc.
ADLS: Monitor storage usage, access patterns
📈 Key Tools Inside Azure Monitor:
Metrics: Real-time charts (CPU, memory, latency)
Logs: Deep insights via Log Analytics queries (KQL)
Alerts: Automatic triggers for thresholds (e.g., pipeline failure or high latency)
Dashboards: Custom views to observe performance in one place
36. Explain the role of Azure Key Vault in securing sensitive data
Azure Key Vault is a secure secrets management service that stores:
🔐 What you can store:
API keys
Passwords
Certificates
Encryption keys
🎯 Why it matters:
Keeps secrets out of code
Supports RBAC and Managed Identity for secure access
Integrates with ADF, Databricks, Function Apps, Logic Apps, etc.
✅ Use Cases:
ADF securely accessing SQL DB using secrets from Key Vault
Databricks fetching credentials during runtime
37. How do you handle big data processing using Azure HDInsight?
Azure HDInsight is a managed cloud service for open-source big data frameworks.
🔧 Supported Frameworks:
Apache Spark
Hadoop
Hive
HBase
Kafka
📌 How to use it:
1. Choose the right cluster type (e.g., Spark for ETL, Kafka for streaming)
2. Load your data (from Blob, ADLS, etc.)
3. Write your ETL/ML code in PySpark or HiveQL
4. Submit jobs using Ambari, Jupyter, or REST APIs
5. Monitor via Ambari UI or Azure Monitor
💡 Why HDInsight:
Enterprise-scale processing
Supports large volume batch/stream processing
Integrated with Azure ecosystem
38. What are the best practices for data archiving and retention in Azure?
📁 Archiving = moving less-used data to cheaper storage, while retention = how long data is
kept.
🔹 Best Practices:
1. Use ADLS Lifecycle Management:
o Automatically move data from hot → cool → archive tiers
o Delete files after a specific retention period
2. Tag and Classify Data:
o Metadata tagging helps apply rules based on importance
3. Immutable Storage (Write Once, Read Many):
o Use this for compliance (e.g., financial/legal data)
4. Use Azure Blob Archive Tier:
o Cheapest option for infrequently accessed data
5. Document Retention Policies:
o Clearly define which datasets are critical, long-term, or disposable
39. How do you implement disaster recovery and backup strategies for data in Azure?
🔒 Key Components of Disaster Recovery (DR):
1. Geo-Redundancy:
o Use GZRS/RA-GRS for storage
o Azure SQL & Synapse have built-in geo-replication
2. Automated Backups:
o Azure SQL: Point-in-time restore, Long-Term Retention (LTR)
o ADLS: Use soft delete and snapshots
3. Cross-Region Replication:
o Set up secondary regions for business continuity
4. Infrastructure as Code:
o Use ARM templates or Terraform to quickly redeploy
5. Runbook/Automation:
o Automate failover or restore via Logic Apps or Azure Automation
💡 Test your DR strategy periodically (mock failovers, validation scripts)
SQL
1. How to find the second-highest salary in a table? (Multiple methods)
Assuming table: employees(emp_id, emp_name, salary)
✅ Method 1: Using DISTINCT + LIMIT / OFFSET (SQL Server: TOP)
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
✅ Method 2: Using MAX() with a subquery
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
✅ Method 3: Using DENSE_RANK()
SELECT salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rnk
FROM employees
) ranked
WHERE rnk = 2;
24. Differences between Azure SQL Database and Azure SQL Managed Instance
Feature Azure SQL Database Azure SQL Managed Instance
Instance-level (multiple DBs like on-prem
Deployment Type Single DB or Elastic Pool
SQL Server)
Feature Near-full compatibility with on-prem SQL
Limited (no SQL Agent, CLR, etc.)
Compatibility Server
Modern cloud apps needing high Lift-and-shift on-prem workloads with
Use Case
availability minimal changes
VNET Support Limited (private endpoint) Full VNET support
Cross-DB Queries Not supported Supported
Azure Databricks
📜 8. What are Delta logs, and how to track data versioning in Delta tables?
Delta uses _delta_log/ directory to store JSON files (transaction logs).
Use:
DESCRIBE HISTORY delta.`/path/to/table`
To see version history, timestamp, operation type, etc.
⏱️16. How to manage and automate ETL workflows using Databricks Workflows?
Use Workflows (Jobs UI) → Create tasks → Chain notebooks/scripts → Define schedule,
retries, parameters.
🔁 19. How to use Databricks Delta Live Tables (DLT) for continuous processing?
DLT is a declarative ETL framework.
Use SQL or Python to define LIVE TABLES.
It handles:
o Data quality
o Incremental loads
o Lineage
o Monitoring
Example:
CREATE LIVE TABLE bronze AS SELECT * FROM cloud_files("path", "json")
Azure Data Factory
1. How to implement incremental load in ADF?
Use watermark columns (e.g., LastModifiedDate) and store the last loaded value in a variable or file.
In the next run, filter data using a query like:
SELECT * FROM table WHERE LastModifiedDate > @lastLoadedTime
Use a Lookup activity to get the last value, then set it via a variable or parameter in the pipeline.
4. Explain the process of copying large files (e.g., 3TB) from on-premises to Azure in ADF.
Use Self-hosted IR.
Enable parallelism via data partitioning.
Use binary copy for non-transformational moves.
Consider staging in Azure Blob Storage first.
5. How to handle error handling in ADF using retry, try-catch blocks, and failover mechanisms?
Retry: Configure retries in activity settings.
Try-Catch: Use 'If Condition', 'Switch', and 'Until' activities with proper success/failure
dependencies.
Failover: Route failures to a different branch (e.g., send email, log error, retry with backup).
9. What are the activities in ADF (e.g., Copy Activity, Notebook Activity)?
Copy Data Activity
Data Flow
Execute Pipeline
Lookup
Web/REST API
Execute Notebook
Set Variable, If Condition, ForEach
12. Where to store sensitive information like passwords in ADF (e.g., Azure Key Vault)?
Store in Azure Key Vault, and reference secrets in Linked Services using the Key Vault integration.
13. How to copy all tables from one source to the target using metadata-driven pipelines?
Create a metadata table with source/target table names.
Use Lookup + ForEach to iterate and pass values to a Copy Activity dynamically.
18. Describe the process of integrating ADF with Azure Synapse Analytics.
Use Linked Service to connect to Synapse.
Load data via Copy Activity or Stored Proc activity.
Trigger Synapse Notebooks or SQL Scripts for processing.
20. Explain how to use ADF with Azure Databricks for complex transformations.
Use Notebook Activity in ADF.
Pass parameters to Databricks.
Databricks handles transformation logic; output can be written to ADLS or Synapse.
21. How do you implement data masking in ADF for sensitive data?
Mask in Mapping Data Flows using expressions.
substring(col, 1, 2) + ***
Mask using SQL views or Databricks logic before loading.
✅ 8. Versioning in ADLS
ADLS Gen2 doesn t have built-in versioning.
Use Delta Lake on top of ADLS to manage data versioning with Delta logs.
✅ 9. Hierarchical Namespace
Makes ADLS Gen2 behave like a file system.
Enables:
o Directory-level operations (rename, delete)
o ACLs at folder/file level
o Better integration with analytics engines
Great question! Handling schema evolution in ADLS (Azure Data Lake Storage) depends on
the tools you re using on top of ADLS, since ADLS itself is just storage—it doesn t enforce or manage
schema. Schema evolution is actually managed by processing engines like Delta Lake (Databricks),
ADF, Synapse, or Spark.
Delta Lake handles schema evolution natively, making it the most efficient way to manage evolving
data structures.
🔹How to enable schema evolution:
df.write \
.format("delta") \
.option("mergeSchema", "true") \
.mode("overwrite") \
.save("abfss://[email protected]/path")
🔹Benefits:
Adds new columns automatically.
Keeps a history of schema versions (data versioning).
Supports time travel to roll back to previous versions.
ADF handles schema drift and evolution using Mapping Data Flows.
🔹Steps:
1. In the source and sink, enable "Allow schema drift".
2. Enable "Auto Mapping" or manually map fields.
3. Use Derived Columns to handle transformations for new/changed columns.
⚠️ADF does not support adding new columns to SQL sinks automatically. You may need to modify
the schema manually or use stored procedures.
✅ 22. Explain the process of auditing and logging access to data in ADLS.
Enable:
o Azure Diagnostics
o Activity Logs
o Storage Analytics
o Monitor with Azure Monitor + Log Analytics
Auditing and logging access to data in Azure Data Lake Storage (ADLS)—especially Gen2—is crucial
for maintaining data security, tracking usage, and meeting compliance requirements (like GDPR,
HIPAA, etc.).
Here’s how you can audit and log access to ADLS Gen2 step-by-step:
✅ 1. Enable Azure Storage Logging and Monitoring
ADLS Gen2 is built on top of Azure Blob Storage, so you use Azure Monitor, Diagnostic Settings, and
Azure Activity Logs to audit access.
These logs track who did what on the Azure Resource itself (e.g., permissions changed, blob
containers created).
Go to Monitor > Activity Log.
Filter by resource type = Storage accounts.
Audit who has access via IAM roles like Storage Blob Data Reader, Contributor, etc.
Keep least privilege principle.
Use Azure Policy to restrict public access and enforce rules.
✅ 6. Set up Alerts
Set up alerts for:
High volume of deletes.
Access from unknown IPs.
Unauthorized access attempts.
def generate_primes(n):
primes = []
for num in range(2, n + 1):
if all(num % i != 0 for i in range(2, int(num ** 0.5) + 1)):
primes.append(num)
return primes
def is_palindrome(s):
s = s.lower().replace(" ", "")
return s == s[::-1]
def replace_vowels(text):
return .join( if ch.lower() in aeiou else ch for ch in text)
def count_words(text):
words = text.lower().split()
return Counter(words)
def find_consecutive_triplet(nums):
for i in range(len(nums) - 2):
if nums[i] == nums[i+1] == nums[i+2]:
return nums[i]
return None
def split_name(full_name):
first, last = full_name.strip().split( , 1)
return first, last
def find_duplicates(lst):
counts = Counter(lst)
return {k: v for k, v in counts.items() if v > 1}
2. Write Python code to split a name column into firstname and lastname.
Assuming you have a list of full names:
names = [ John Doe , Jane Smith ]
split_names = [name.split( , 1) for name in names]
for first, last in split_names:
print(f"First Name: {first}, Last Name: {last}")
With pandas:
import pandas as pd
8. How do you read and write data to and from a database using Python?
Use pyodbc, sqlalchemy, or psycopg2 depending on the DB.
Using sqlalchemy and pandas:
from sqlalchemy import create_engine
import pandas as pd
df = pd.read_csv( employees.csv )
df[ total ] = df[ salary ] + df[ bonus ]
grouped = df.groupby( department )[ total ].mean()
14. How do you integrate Python with big data tools like Apache Spark?
Use PySpark, the Python API for Apache Spark.
Example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()
df = spark.read.csv( data.csv , header=True, inferSchema=True)
df.select("name", "age").show()
Use it for distributed data processing (ETL, ML) across clusters.
• Explain the deployment architecture of a data pipeline involving ADF, Databricks, and ADLS
A common architecture looks like this:
1. ADF extracts data from source (on-prem or cloud) and loads it into ADLS (raw zone).
2. Databricks processes data (transformations, joins, aggregations) and writes clean and
enriched data back to ADLS (cleansed and curated zones).
3. ADF may load curated data into Synapse or a data warehouse for reporting.
4. Monitoring and logging are handled in ADF via pipeline monitoring and alerts.
• What are the different types of data schemas, and how do you choose the right one for your data
model?
Star Schema: Simplified, easy to understand, good for reporting (fact in the middle,
surrounded by dimensions).
Snowflake Schema: Normalized dimension tables; useful when dimensions have hierarchies.
Galaxy/Fact Constellation: Multiple fact tables sharing dimension tables. Choose based on:
Query performance
Complexity of dimensions
Scalability and maintenance needs
• What is data mart, and how does it differ from a data warehouse?
A Data Mart is a subset of a data warehouse tailored for a specific business unit (e.g., sales,
marketing). It’s more focused and faster to build, while a Data Warehouse is enterprise-wide and
more comprehensive.
• How do you ensure data consistency and integrity in a distributed data architecture?
Use primary/foreign key constraints during ingestion
Employ validation rules during transformation (e.g., NULL checks, data types)
Use schema validation, data quality checks, and ETL error handling
Implement data lineage and audit logging
Other Questions
1. What is DAG in Spark, and how does it work?
In Apache Spark, a DAG (Directed Acyclic Graph) represents the sequence of computations
performed on data. When you execute a series of transformations on an RDD (Resilient Distributed
Dataset), Spark constructs a DAG to track these operations. The DAG is divided into stages based on
transformations:
Stages: Each stage consists of tasks that can be executed in parallel. There are two types:
o Shuffle Stages: Involve data shuffling between nodes.
o Non-Shuffle Stages: Do not involve data movement between nodes.
The DAG scheduler in Spark divides the graph into these stages and assigns tasks to worker nodes for
execution. citeturn0search0
3. What is the difference between wide transformations and narrow transformations in Spark?
In Spark, transformations are categorized based on data dependencies:
Narrow Transformations: Each partition of the parent RDD is used by at most one partition
of the child RDD. Examples include map() and filter(). These transformations do not require
data shuffling and are more efficient.
Wide Transformations: Multiple child partitions may depend on one or more partitions of
the parent RDD, necessitating data shuffling across the cluster. Examples include
groupByKey() and reduceByKey(). These operations are more resource-intensive due to the
data movement involved.
Understanding these helps in optimizing Spark jobs by minimizing expensive wide transformations
when possible. citeturn0search2
4. How does fault tolerance work in ADF and Databricks?
Both Azure Data Factory (ADF) and Azure Databricks incorporate fault tolerance mechanisms:
Azure Data Factory: ADF ensures fault tolerance by allowing activities to retry upon failure.
You can configure the retry policy with parameters like count and interval. Additionally, ADF
can handle transient failures by implementing robust error-handling logic within pipelines.
Azure Databricks: Databricks enhances fault tolerance through features like job clusters from
pools, which provide workload isolation and faster cluster creation. This setup supports auto-
termination upon job completion and offers resilience against failures.
citeturn0search3
5. What challenges have you faced in managing large datasets (e.g., 3TB+ files)?
Managing large datasets presents several challenges:
Storage Requirements: Large volumes necessitate substantial storage capacity, leading to
increased costs and infrastructure complexity.
Performance Issues: As data volume grows, retrieval and processing times can increase,
resulting in slower and less responsive analytics.
Data Quality: Ensuring consistency and accuracy becomes more complex with larger
datasets, requiring robust validation mechanisms.
Addressing these challenges involves implementing efficient storage solutions, optimizing data
processing workflows, and establishing stringent data quality protocols. citeturn0search4
6. How do you implement CI/CD pipelines for deploying ADF and Databricks solutions?
Implementing CI/CD pipelines involves automating the deployment and integration processes:
Azure Data Factory: Utilize Azure DevOps to create build and release pipelines that automate
the deployment of ADF pipelines and related resources.
Azure Databricks: Implement CI/CD by integrating Databricks notebooks with version control
systems like Git. Use Azure DevOps pipelines to automate testing and deployment of
notebooks and other Databricks artifacts. citeturn0search5
This approach ensures consistent and reliable deployments across environments.
8. What is the use of Delta Lake, and how does it support ACID transactions?
Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation,
Durability) transactions to data lakes:
ACID Transactions: Delta Lake ensures that all operations on data are atomic, consistent,
isolated, and durable. This means that each transaction is completed fully or not at all, data
remains consistent before and after transactions, transactions are isolated from each other,
and once a transaction is committed, it remains
✅ What is a Unity Catalog in Databricks?
Unity Catalog is a unified governance solution for all data and AI assets in the Databricks Lakehouse
platform. It centralizes and standardizes access control, auditing, and lineage tracking across
workspaces.
Key Features:
Centralized metadata management: One catalog for multiple workspaces.
Fine-grained access control: Manage access down to the column and row level using ANSI
SQL.
Data lineage: Automatically track how data moves and transforms from raw to refined.
Support for multiple storage accounts: Unlike legacy Hive metastore, it supports federated
access.
Use case: If you’re working in a multi-team or enterprise setup, Unity Catalog helps you enforce
consistent data governance and security across projects.
✅ Snowflake-Specific Features
❓ How does it compare to Delta Lake or other data lakes?
Feature Snowflake Delta Lake (Databricks)
Storage Decoupled compute & storage (multi- Built on top of Parquet files with
Architecture cluster shared data) transactional support
ACID
Fully supported Supported via Delta Lake
Transactions
Auto-scaling compute, fast query
Performance High performance with tuning options
performance
More flexible but requires Spark
Ease of Use SQL-first, no infrastructure management
knowledge
Seamless cross-account sharing
Data Sharing Data sharing possible but more manual
(Snowflake Data Sharing)
Streaming Excellent streaming with Structured
Limited native support
Support Streaming
❓ When would you choose Snowflake over other platforms?
You want a fully-managed, SQL-friendly cloud data warehouse.
Your team has limited Spark expertise.
You need multi-region or cross-cloud deployment (Snowflake runs on AWS, Azure, and GCP).
For use cases needing instant compute scaling, automatic tuning, and zero maintenance.
✅ Difference between Azure Logic Apps, Azure Functions, and Azure Data Factory
Feature Azure Logic Apps Azure Functions Azure Data Factory
Serverless
Workflow automation (no/low Data movement and
Use Case functions/code
code) transformation
execution
Code vs No- Code-first (C#, Python, Low-code UI with some
Low-code visual designer
Code JS) scripting
HTTP trigger, timer, Time-based triggers, event
Triggers Event-based, HTTP, schedules
queue, blob triggers
Integrations between systems Light compute logic,
Best for ETL/ELT data workflows
(e.g., email, Teams, SharePoint) custom scripts
Extend workflows with Works well with data sources
Integration 300+ connectors
custom logic (SQL, ADLS, Blob, etc.)
⚙️Azure Functions
6. What are Azure Functions, and how do they differ from traditional web services? Azure
Functions are serverless – you don’t manage infrastructure. They re:
Event-driven: Triggered by HTTP requests, blobs, queues, timers, etc.
Short-lived, stateless, and scalable.
Compared to traditional web APIs, Functions are lighter, cost-effective, and easier to deploy
for specific tasks.
9. How do you secure secrets and keys using Azure Key Vault?
Store secrets (passwords, connection strings), certificates, and encryption keys securely.
Integrate with Azure services to avoid hardcoding secrets in pipelines or notebooks.
Access controlled via Azure RBAC and Access Policies.
10. How do you integrate Azure Key Vault with other services?
Azure Data Factory: Linked Service → “Use Azure Key Vault” to pull secrets.
Azure Functions: Use Managed Identity to access secrets securely.
Azure Databricks: Mount secrets via Databricks-backed secret scope or Azure Key Vault-
backed scope.
11. What are the benefits of Azure Synapse for big data processing?
Unified platform: Combines data warehousing and big data.
Supports T-SQL and Spark workloads.
Serverless and dedicated compute options.
Integration with Power BI, ADF, and ADLS for end-to-end pipelines.
13. Difference between dedicated SQL pool and serverless SQL pool:
Dedicated SQL Pool: Pre-allocated compute for high-performance queries, paid per hour.
Serverless SQL Pool: Query data on-demand (pay-per-query), great for ad hoc analysis or
querying raw files (e.g., CSV, Parquet).
📈 Azure Monitor
21. How do you use Azure Monitor to track performance in Azure services?
Collect telemetry from services like ADF, Synapse, Functions, etc.
Use Log Analytics workspace to write Kusto queries for analysis.
Set up alerts based on metrics (e.g., pipeline failures, high latency).
Integrate with Application Insights for detailed diagnostics.
Gopi Rayavarapu!
www.linkedin.com/in/gopi-rayavarapu-5b560020a