0% found this document useful (0 votes)
37 views3 pages

Advanced Interview QA ADF Databricks PowerBI

The document contains advanced interview questions and answers related to Azure Data Factory, Databricks, and Power BI, focusing on scenario-based and technical questions. It covers troubleshooting strategies, delta load designs, integration runtime usage, performance optimization, and data modeling best practices. Each section provides concise answers to help candidates prepare for technical interviews in data engineering and analytics roles.

Uploaded by

Praveen Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views3 pages

Advanced Interview QA ADF Databricks PowerBI

The document contains advanced interview questions and answers related to Azure Data Factory, Databricks, and Power BI, focusing on scenario-based and technical questions. It covers troubleshooting strategies, delta load designs, integration runtime usage, performance optimization, and data modeling best practices. Each section provides concise answers to help candidates prepare for technical interviews in data engineering and analytics roles.

Uploaded by

Praveen Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Advanced Interview Questions and Answers

Azure Data Factory (ADF)


Scenario-Based Questions:

1. Q: You have a pipeline that loads millions of records daily from an on-prem SQL Server to Azure SQL

Database. One day, the copy activity fails without any code changes. How would you troubleshoot and

ensure minimal downtime?

A: - Check pipeline run history in ADF Monitor.

- Review integration runtime status.

- Inspect source/target connectivity.

- Retry manually or use retry policy.

- Maintain checkpoints and use data partitioning.

- Enable alerts using Azure Monitor.

2. Q: Your client wants to implement a CDC-based delta load from SAP to Azure SQL via ADF, but SAP only

provides full extracts. How would you design a delta load strategy with minimal load time?

A: - Use watermark columns or hash diff logic.

- Store last load value in variables or metadata table.

- Filter SAP extract using this watermark.

- Use snapshot-diff logic if no timestamp available.

Technical-Based Questions:

1. Q: Explain how integration runtime works in ADF. When would you use self-hosted IR over Azure IR?

A: - Azure IR for cloud data movement.

- Self-hosted IR for on-prem/private network resources.

- Use self-hosted IR when accessing on-prem SQL Server.

2. Q: How can you parameterize Linked Services and Datasets for reusability in ADF across multiple

environments (dev/test/prod)?

A: - Use global parameters or dynamic content.

- Define parameters for ServerName, DatabaseName, etc.

- Helps in CI/CD deployment and reuse.

3. Q: How does ADF handle retry policies, and what are the best practices for configuring them in

mission-critical pipelines?

A: - Retry options: count and interval.

- Enable for transient faults.


- Avoid retrying logic-based failures.

- Use timeouts and fail-fast logic.

Databricks
Scenario-Based Questions:

1. Q: You are implementing a CDC pipeline using Delta Lake. The source system provides both insert and

delete records. How would you design the pipeline in Databricks to handle this efficiently using DLT or Auto

Loader?

A: - Use apply_changes() in DLT with apply_as_deletes.

- For Auto Loader, use MERGE with DELETE logic.

- Maintain high watermark using timestamp.

2. Q: A data team complains that a notebook job is running slower after new columns were added to the

Delta table. How would you investigate and optimize it?

A: - Check for file skew and small files.

- Run OPTIMIZE and ZORDER.

- Consider schema evolution impact.

- Use Photon runtime and cache hot tables.

Technical-Based Questions:

1. Q: Explain the difference between OPTIMIZE, VACUUM, and ZORDER BY in Delta Lake. When and how

should each be used?

A: - OPTIMIZE: compacts small files.

- ZORDER: sorts by columns for filtering.

- VACUUM: cleans obsolete files.

- Use ZORDER after OPTIMIZE.

2. Q: How would you implement a slowly changing dimension Type 2 (SCD2) logic using PySpark in

Databricks?

A: - Join source with target on business keys.

- Detect changes -> expire old records.

- Insert new version with updated start_date.

- Use MERGE INTO.

3. Q: What are the pros and cons of using Delta Live Tables (DLT) over traditional notebooks for data

pipeline orchestration?

A: - Pros: declarative, auto-lineage, CDC support.

- Cons: less flexible, more resource intensive.


- Best for production-grade pipelines.

Power BI
Scenario-Based Questions:

1. Q: Your report is slow when filtering on slicers and visuals take 10+ seconds to render. How would you go

about identifying and resolving performance bottlenecks?

A: - Use Performance Analyzer.

- Optimize DAX and reduce model size.

- Use summary/aggregation tables.

- Disable auto-date/time.

2. Q: A user requests row-level security (RLS) based on department and region. Departments may span

multiple regions. How do you implement this dynamic RLS in Power BI?

A: - Create user-department-region mapping.

- Apply USERPRINCIPALNAME() in DAX.

- Set roles and test with 'View as Role'.

Technical-Based Questions:

1. Q: Explain the differences between Import, DirectQuery, and Composite models. When should each be

used and why?

A: - Import: fastest, full DAX.

- DirectQuery: real-time, limited features.

- Composite: mix of both.

- Use Composite for large + fast KPIs.

2. Q: How do you handle circular dependency errors in complex DAX measures or calculated columns?

A: - Use variables.

- Break logic into steps.

- Avoid calculated columns depending on measures.

3. Q: What are the best practices for designing a Power BI data model for large-scale datasets (e.g., over 1

billion rows)?

A: - Use star schema.

- Apply aggregations.

- Use surrogate keys.

- Apply incremental refresh.

You might also like