0% found this document useful (0 votes)

223 views25 pages

Infosys Data Engineering Questions and Answers - 2025

Uploaded by

bhanusagarkv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

223 views25 pages

Infosys Data Engineering Questions and Answers - 2025

Uploaded by

bhanusagarkv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Frankline Florence

INFOSYS
DATA
ENGINEERING
Q&A

2025
www.linkedin.com/in/frankline-florence
INFOSYS DATA ENGINEERING INTERVIEW
QUESTIONS AND ANSWERS- 2025

⚙️ Azure Data Factory (ADF)

1. Explain different types of Integration Runtimes in ADF.
2. How do you handle incremental load in ADF pipelines?
3. What happens if an Until activity fails?

E
💾 Azure Databricks / PySpark

IN
4. Difference between schema enforcement and schema evolution in Delta tables.
5. How to optimize a Spark job running for 2 hours down to 30 minutes?
6. Difference between groupBy vs groupByKey and their use cases.
KL
7. Given a dataset with nested JSON structures, how would you flatten it into a
tabular format using PySpark?
8. How do you handle missing or null values in a DataFrame? What strategies would
you use in different scenarios?
AN

🏁 SQL / Data Modeling

9. Find the maximum salary per department and employee name.
10.Difference between Clustered vs Non-Clustered Index.
11.Write a query to get 2nd highest salary without using TOP/Limit.
FR

🎯 General Data Engineering:

12.Managed vs External tables in Databricks.
13.What is Lakehouse architecture and how does it differ from Data Lake?
14.How do you implement data quality checks during ingestion?
ANSWERS

⚙️ AZURE DATA FACTORY (ADF)

🟧 Explain different types of Integration Runtimes in ADF
Integration Runtime (IR) is the compute infrastructure ADF uses to move/transform
data and to execute SSIS packages.

E
1. Azure Integration Runtime (Azure IR)

IN
Use cases: Copy between cloud data stores; Mapping Data Flows; Data Flow Debug;
lookups; metadata.

Where it runs: Microsoft-managed compute in Azure (region you choose or

“Auto-resolve”).
KL
Networking: Can run inside Managed Virtual Network with Private Endpoints to reach
private resources without self-hosting.

Scaling: Elastic; Mapping Data Flows let you choose core sizes and parallelism;
autoscaling during execution.
AN

Security: MSI for auth to Azure services; keyless connections via linked services/Key
Vault.

2. Self-Hosted Integration Runtime (SHIR)

Use cases: Access on-premises/private-network data (SQL Server, file shares), VMs
FR

without public endpoints, cross-cloud private networks.

Where it runs: Your Windows machines/VMs; you can create a cluster (multiple nodes)
for HA and scale-out; can be shared across multiple factories/Synapse.

Networking: Your outbound IPs; respects local firewall/proxy; supports proxy config.

Capabilities: Copy (source/sink), lookups, metadata, store procedures, etc. (not

Mapping Data Flows). For transformations, you’d typically stage to a compute (e.g.,
Databricks/Synapse) or use SSIS IR.
3. Azure-SSIS Integration Runtime (SSIS IR)

Use cases: Lift-and-shift SSIS packages to Azure with minimal changes.

Where it runs: Managed Azure VMs that host SSIS; you choose node size, count, and
edition (Standard/Enterprise).

Features: Package Store (SSISDB), scale-out execution, custom components via custom
setup, join to a VNet (including Managed VNet).

Triggering: Orchestrated by ADF pipelines/Triggers, just like any other activity.

E
🖐🏼 Quick rule of thumb:

IN
● Cloud-to-cloud & data flows → Azure IR

● Anything behind a firewall → Self-Hosted IR

● SSIS packages → Azure-SSIS IR

KL
🟧 How do you handle incremental load in ADF pipelines?
There isn’t a single “switch”—you implement a pattern. The most common ones:
AN

1) Watermark (High-Watermark) Pattern — Relational Sources

● Idea: Track the max value of a monotonic column (e.g., LastModifiedDate,

UpdateTS, or a numeric surrogate key) that has changed since last run.

● Where to store state: Control table (e.g., ETL_Watermark) in Azure SQL or a file
FR

(e.g., in ADLS).

● Flow:

1. Look up the current watermark from the control table.

2. Set Variable prevWatermark.

3. Get a new watermark (e.g., SELECT MAX(LastModifiedDate) FROM src).

4. Copy Activity with a query filter:

SELECT *
FROM dbo.Orders
WHERE LastModifiedDate > @prevWatermark
AND LastModifiedDate <= @newWatermark;

5. Upsert to target (via stored proc or MERGE in sink).

6. Update control table with newWatermark.

E
Notes: Always bound the upper window (<= @newWatermark) to be restart-safe.

IN
2) Change Tracking / CDC (SQL Server/Azure SQL)

● Idea: Let the source system tell you what changed.

● Flow: Enable Change Tracking or CDC on source tables. Use a Lookup to fetch
KLlast LSN/Version, Copy only deltas via CDC/CT functions, then Update the
LSN/Version bookmark in your control table.

● Pros: Robust for updates/deletes; minimal source impact compared with full
scans.
AN

3) Files by Last Modified Time (Data Lake/Blob/S3)

● Idea: Only ingest new or changed files.

● Flow:
FR

○ Get Metadata (child items + lastModified).

○ Filter activity against stored watermark timestamp.

○ ForEach over filtered files → Copy.

○ Update watermark to max lastModified processed.

● Alternatives: Event-based triggers (Blob storage events) ingest files as they land.
4) Delta Lake/Databricks Autoloader (if Databricks is in the mix)

● Idea: Let Databricks track files via a checkpoint and use MERGE on a Delta table.

● ADF orchestrates the notebook/job; the incremental logic lives in the notebook.

5) Tumbling Window Triggers

● Idea: Time-sliced orchestrations with automatic “window start/end” parameters

to bound your query (useful for append-only event data).

E
Common pitfalls & tips

● Never rely on ADF pipeline variables as persistent state—store them externally.

IN
● Use idempotent sinks (MERGE/upsert) so re-runs are safe.

● Guard for late-arriving data by overlapping windows (e.g., process last 5 minutes
again but de-duplicate in sink).
KL
🟧 What happens if an Until activity fails?
Until repeats its inner activities until a boolean expression evaluates to true, or until it
AN

times out.

● Inner activity failure: If any inner activity fails and is not set to
continueOnFailure = true, the Until activity fails immediately and the loop
stops.

● Retries: Inner activities respect their own retry policy (retry count/interval). If
FR

retries exhaust and an inner activity still fails, Until fails.

● Timeout: The Until itself has a timeout. If the condition never becomes true
before the timeout expires, the Until fails with a timeout error.

● Zero iterations: If the condition is already true at entry, the loop doesn’t run and
succeeds immediately.

● Best practice to “break”: Update a variable/state inside the loop and make the
expression depend on it; use continueOnFailure selectively if a failure is
acceptable for that iteration.
💾 AZURE DATABRICKS / PYSPARK
🟧 Difference between schema enforcement and schema evolution in Delta tables.
● Schema Enforcement (“write compliance”)

○ Delta checks that incoming data matches the table schema (column
names, types, nullability).

E
○ Mismatches (e.g., extra columns, wrong types) → write fails to protect
data quality.

IN
○ Example (fails):

df.write.format("delta").mode("append").save("/mnt/delta/sales")
KL
# df has an extra column -> error

● Schema Evolution (“allow controlled changes”)

○ Allows adding new columns (and some compatible type changes) during
writes/merge.

○ Requires enabling:
FR

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled",
"true")

# Append with evolution
df.write.format("delta").mode("append").option("mergeSchema",
"true").save("/mnt/delta/sales")

# Overwrite with new schema
df.write.format("delta").mode("overwrite").option("overwriteSchema",
"true").save("/mnt/delta/sales")
○ With MERGE:

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled",
"true")
(
bronze.alias("b").merge(silver.alias("s"), "b.id = s.id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()

E
)

IN
In short: Enforcement prevents bad writes; Evolution permits controlled schema growth
(usually adding columns) when explicitly enabled.

🟧 How to optimize a Spark job running for 2 hours down to 30 minutes?

KL
A pragmatic, stepwise plan:

1) Profile first (don’t guess)

● Use Spark UI: identify long stages, skewed shuffles, excessive tasks, spill, GC
time, and I/O hotspots.

● Capture input sizes, # of files, avg file size, shuffle read/write, cache usage.

2) Reduce I/O and shuffles

● Read less: Select only needed columns; push predicates down (Parquet/Delta do
this automatically).

● Avoid small files: Compact to target file sizes (128–512 MB).

(df.repartition(200) # or tune after AQE

.write.format("delta")
.option("dataChange", "false")
.mode("overwrite")
.save(path))
● Coalesce after wide ops if you created too many partitions:

df = df.coalesce( max(1, df.rdd.getNumPartitions() // 2) )

3) Fix joins

● Prefer broadcast hash join when one side is small-ish (tens/hundreds of MB):

E
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 *
1024) # 100MB

IN
fact.join(F.broadcast(dim), "key", "left")

● If data is skewed, enable AQE + skew join and/or salt keys:

KL
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled",
AN

"true")

Key salting (example):

from pyspark.sql import functions as F

k = 10 # salt buckets
big = big.withColumn("salt", F.rand()*k).withColumn("salt",
F.floor("salt"))
small = small.withColumn("salt", F.expr(f"sequence(0,{k-1})"))
small = small.withColumn("salt", F.explode("salt"))
joined = big.join(small, ["key","salt"])
4) Cache/persist only when reused

● Cache only DataFrames reused multiple times, and unpersist promptly:

cached = heavy_df.persist() # or
.persist(StorageLevel.MEMORY_AND_DISK)
# use cached multiple times...
cached.unpersist()

E
5) Partitioning and Z-ordering (Delta)

IN
● Partition large Delta tables by low-cardinality columns used in filters (e.g., date,
country).

● Use OPTIMIZE + Z-ORDER to improve data skipping on high-cardinality filter

KLcolumns:

OPTIMIZE delta.`/mnt/delta/sales` ZORDER BY (customer_id, event_ts);

6) Right-size the cluster

● Use Photon (Databricks Runtime with Photon) for SQL/DataFrame

workloads—often 2–3× speedups.

● Ensure sufficient parallelism: spark.sql.shuffle.partitions (AQE can reduce this

automatically).

● Prefer fewer, more powerful nodes to avoid network bottlenecks; enable

autoscaling for bursty stages.

7) Clean code patterns

● Minimize UDFs; prefer built-in functions. If unavoidable, use Pandas UDFs for
vectorization.
● Combine transformations to reduce passes over data; avoid repeated
.count()/.collect() in the hot path.

● Materialize critical joins/aggregations once and reuse.

8) Practical target

● After applying AQE, right-sizing partitions, fixing joins, compacting files, and
enabling Photon, two-hour jobs commonly land < 30–40 mins.

E
🟧 Difference between groupBy vs groupByKey and their use cases.

IN
● In DataFrame API (PySpark) you typically use groupBy:

df.groupBy("key").agg(F.sum("value").alias("sum_v"))
KL ○ Exec model: Pushes aggregation to the engine using combiners; efficient;
uses partial aggregates before shuffle (map-side combine).

● groupByKey exists for key-value RDDs (and typed Datasets in Scala). It groups
all values for a key before you apply a function:
AN

rdd = sc.parallelize([("a",1),("a",2),("b",3)])
rdd.groupByKey().mapValues(lambda xs: sum(xs)).collect()

○ Downside: Moves all values for each key across the network (heavy
FR

shuffle), then aggregates—often slower and memory-intensive.

○ Prefer reduceByKey / aggregateByKey on RDDs because they

combine before the shuffle:

rdd.reduceByKey(lambda a,b: a+b).collect()

Guidance:

● Use groupBy (DataFrame) for most analytics (Catalyst optimizer, Tungsten,

AQE).

● Avoid groupByKey unless you truly need all values per key (e.g., complex
non-associative operations) and cannot express it with aggregations.

🟧 Given a dataset with nested JSON structures, how would you flatten it into a
tabular format using PySpark?

E
Example JSON

{

IN
"order_id": "O1",
"customer": {"id": "C1", "name": "Ana"},
"items": [
{"sku": "S1", "qty": 2, "price": 10.0, "attrs": {"color":
KL
"red"}},
{"sku": "S2", "qty": 1, "price": 25.0, "attrs": {"color":
"blue"}}
],
"shipping": {"address": {"city": "NYC", "zip": "10001"}}
AN

Steps:

1. Read JSON (multi-line if needed) and infer or provide schema.

2. Flatten structs by selecting col("struct.*").

3. Explode arrays (explode_outer) to get one row per array element.

4. Handle nested structs/maps recursively.

Code:

from pyspark.sql import functions as F, types as T

# 1) Read JSON (one JSON doc per line)
df = spark.read.json("/mnt/data/orders/*.json", multiLine=True)

# 2) Utility to flatten struct columns (and keep simple ones)
def flatten_structs(_df):
flat_cols = []
nested_cols = []
for field in _df.schema.fields:
name = field.name
dtype = field.dataType
if isinstance(dtype, T.StructType):
for sub in dtype.fields:

E
flat_cols.append(F.col(f"`{name}`.`{sub.name}`").alias(f"{name}__{sub

IN
.name}"))
else:
flat_cols.append(F.col(f"`{name}`"))
if isinstance(dtype, T.ArrayType) and
isinstance(dtype.elementType, T.StructType):
KL nested_cols.append(name)
return _df.select(flat_cols), nested_cols

# 3) Flatten top-level structs
df1, array_struct_cols = flatten_structs(df)
AN

# 4) Explode arrays of structs (repeat if multiple)
for arr in array_struct_cols:
df1 = df1.withColumn(arr, F.explode_outer(F.col(arr)))

# 5) After explode, flatten again because array elements are structs
FR

df2, _ = flatten_structs(df1)

# 6) (Optional) Flatten deeper nested structs or maps
def flatten_all(df_in):
# Recursively flatten until no StructType remains
while any(isinstance(f.dataType, T.StructType) for f in
df_in.schema.fields):
cols = []
expanded = False
for f in df_in.schema.fields:
if isinstance(f.dataType, T.StructType):
expanded = True
for sub in f.dataType.fields:

cols.append(F.col(f"`{f.name}`.`{sub.name}`").alias(f"{f.name}__{sub.
name}"))
else:
cols.append(F.col(f"`{f.name}`"))
df_in = df_in.select(cols)
if not expanded:
break

E
return df_in

IN
flat = flatten_all(df2)

flat.show(truncate=False)
KL
Notes:

● Use explode_outer to retain orders even when items is null/empty.

● For maps, you can map_keys/map_values or transform to key,value rows:

flat = flat.selectExpr("*", "inline(attrs)") # attrs is a MAP ->

turns into key/value rows

● For performance on very wide schemas, consider generating the select list from
the schema programmatically (as above) rather than manual selection.
FR

🟧 How do you handle missing or null values in a DataFrame? What strategies would
you use in different scenarios?

1) Drop vs Fill

- Drop when the row/column isn’t useful:

# Drop rows where any of these columns are null

df_clean = df.na.drop(subset=["id","event_ts"], how="any")

# Drop rows with fewer than 3 non-null values
df_thresh = df.na.drop(thresh=3)

- Fill with sensible defaults:

from pyspark.sql import functions as F

df_filled = (df

E
.fillna({"age": 0, "country": "UNKNOWN"}) # by column
.fillna(0) # numeric columns

IN
)

# Use COALESCE in expressions
df = df.withColumn("price", F.coalesce("price", F.lit(0.0)))
KL
2) Statistically informed imputation

- Numeric with mean/median/quantiles:

from pyspark.ml.feature import Imputer

imputer = Imputer(strategy="median", inputCols=["age","income"],
outputCols=["age","income"])
df_imp = imputer.fit(df).transform(df)
FR

- Categorical with mode:

mode_val = (df.groupBy("segment").count()
.orderBy(F.desc("count")).first()["segment"])
df_mode = df.fillna({"segment": mode_val})

3) Group-wise imputation (e.g., by country)

stats =
df.groupBy("country").agg(F.avg("income").alias("avg_income"))
df_join = (df.join(stats, "country", "left")
.withColumn("income", F.coalesce("income",
"avg_income"))
.drop("avg_income"))

4) Time-series forward/backward fill

from pyspark.sql import Window, functions as F

E

w =
Window.partitionBy("id").orderBy("ts").rowsBetween(Window.unboundedPr

IN
eceding, 0)
df_ffill = df.withColumn("value_ffill", F.last("value",
ignorenulls=True).over(w))

KL
w_rev =
Window.partitionBy("id").orderBy(F.col("ts").desc()).rowsBetween(Wind
ow.unboundedPreceding, 0)
df_bfill = df_ffill.withColumn("value_bfill", F.last("value",
ignorenulls=True).over(w_rev))
AN

5) Special values: NaN vs null

# Replace NaN with null then use standard fills

df = df.select([F.when(F.isnan(c), None).otherwise(F.col(c)).alias(c)
FR

if t.simpleString() in ("double","float") else

F.col(c)
for c,t in zip(df.columns, df.dtypes)])

6) Nested structures

- Use withField (Spark 3.1+) or rebuild the struct:

from pyspark.sql import functions as F

df = df.withColumn(
"address",
F.when(F.col("address").isNull(),
F.struct(F.lit("").alias("city"), F.lit("").alias("zip")))
.otherwise(F.col("address"))
)

7) When to choose what

● Training data: Use statistically sound imputation; preserve variance

E
(median/quantile).

● Reporting/BI: Use defaults (0/“Unknown”) but flag imputed fields.

IN
● Data quality pipelines: Reject or quarantine rows when mandatory fields are
missing.
KL
AN
FR
🏁 SQL AND DATA MODELING
🟧 Find the maximum salary per department and employee name
We want each department’s max salary, along with the employee(s) who earn it. Two
common approaches:

1) Window Functions (preferred in analytics databases)

E
SELECT department_id,
employee_name,

IN
salary
FROM (
SELECT department_id,
employee_name,
salary,

rnk
KL RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS

FROM Employees
) t
WHERE rnk = 1;
AN

● Why: RANK() handles ties, so if two employees share the max salary, both are
returned.

2) Join with MAX per department

SELECT e.department_id,
FR

e.employee_name,
e.salary
FROM Employees e
JOIN (
SELECT department_id, MAX(salary) AS max_salary
FROM Employees
GROUP BY department_id
) m
ON e.department_id = m.department_id
AND e.salary = m.max_salary;

● Why: Classic “join-on-aggregate” method; works across all SQL engines

🟧 Explain the difference between Clustered vs Non-Clustered Index
Clustered Index

● Defines the physical order of rows in the table.

● Each table can have only one clustered index (since data can only be sorted one
way).

● Often built on the primary key.

● Accessing data via clustered index = direct (data is the index).

E
● Example: In SQL Server/MySQL InnoDB, primary key by default is clustered.

IN
● Analogy: Phonebook sorted by last name.

Non-Clustered Index
KL
● Stored separately from table data.

● Contains index keys + pointers (row locators) to actual data.

● A table can have multiple non-clustered indexes.

● Best for queries filtering by non-PK columns.

● Example: Index on email column when primary key is employee_id.

● Analogy: Index at the back of a book pointing to page numbers.

🟧 Write a query to get 2nd highest salary without using TOP/Limit

Several valid patterns:

1) Window Function

SELECT DISTINCT salary

FROM (
SELECT salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM Employees
) t
WHERE rnk = 2;

2) Subquery with MAX

SELECT MAX(salary) AS second_highest

FROM Employees
WHERE salary < (

E
SELECT MAX(salary) FROM Employees
);

IN
● Simpler if you only need the value (not employee details).
KL
3) Self-Join

SELECT e1.salary
FROM Employees e1
AN

WHERE 2 = (
SELECT COUNT(DISTINCT e2.salary)
FROM Employees e2
WHERE e2.salary >= e1.salary
);
FR

● Works in all ANSI SQL engines; scales less well for big data.
🎯 GENERAL DATA ENGINEERING
🟧 Managed vs External tables in Databricks
Managed Table (a.k.a. Internal table)

● Created in the Hive Metastore/Unity Catalog without a location → data stored

E
under Databricks’ default warehouse location
(dbfs:/user/hive/warehouse/...).

IN
● Dropping the table → drops both metadata & underlying data.

● Example:
KL
CREATE TABLE sales (id INT, amount DOUBLE);
-- Data physically lives in the default managed path
AN

External Table

● Created with an explicit LOCATION pointing to external storage (ADLS, S3, Blob,
etc.).

● Dropping the table → removes only metadata, not the data files.
FR

● Example:

CREATE TABLE sales_ext (id INT, amount DOUBLE)

USING DELTA
LOCATION 'abfss://[email protected]/sales/';
🟧 What is Lakehouse architecture and how does it differ from Data Lake?:
Traditional Data Lake

● Stores raw files (Parquet, ORC, CSV, JSON).

● Flexible schema-on-read.

● Great for cheap storage, ML feature feeds.

● Weaknesses:

E
○ No ACID transactions → risk of partial writes.

○ Harder data governance (no unified catalog).

IN
○ Performance issues (small files, lack of indexes, no optimizer).

Data Warehouse
KL
● Structured, ACID-compliant, governed.

● Schema-on-write with BI-friendly optimizations (indexes, caching).

● Weakness: Expensive and rigid; not ideal for semi-structured/unstructured data.

Lakehouse (e.g., Delta Lake on Databricks)

● Unifies the strengths:

○ Open format (Parquet-based).

○ ACID transactions and schema enforcement (like a warehouse).

○ Support for ML + BI on the same data.

○ Unified catalog & governance.

○ Performance enhancements (Z-ordering, caching, vectorized execution).

● Key idea: One copy of data serving batch, streaming, ML, and BI.
🟧 How do you implement data quality checks during ingestion?
Data quality checks catch bad data before it pollutes downstream tables. Common
strategies:

1) Schema Validation

● Reject rows with unexpected schema (wrong columns, types).

● In PySpark:

E
from pyspark.sql.types import StructType, StructField, StringType,
IntegerType

IN

schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), True),

])
KL
StructField("age", IntegerType(), True)

df = spark.read.schema(schema).json("/mnt/raw/customers/*.json")
AN

● Enforces id as non-null int.

2) Null / Mandatory Checks

invalid = df.filter("id IS NULL OR name IS NULL")

if invalid.count() > 0:
FR

invalid.write.format("delta").mode("append").save("/mnt/quarantine/cu
stomers")
df = df.exceptAll(invalid) # keep only valid records

3) Referential Integrity

● Validate FK relationships (e.g., orders.customer_id exists in customers).

valid_orders = orders.join(customers, "customer_id", "inner")
invalid_orders = orders.join(customers, "customer_id", "left_anti")

4) Business Rule Checks

● Example: age >= 0, amount > 0, date <= today.

df = df.filter("amount > 0 AND age >= 0 AND order_date <=

E
current_date()")

IN
5) Duplicate Detection

window = Window.partitionBy("id").orderBy("last_updated DESC")

KL
deduped = (df.withColumn("rn", F.row_number().over(window))
.filter("rn = 1")
.drop("rn"))

6) Monitoring / Metrics
AN

● Generate data quality KPIs: null % per column, row counts, duplicates,
distribution checks.

● Store in a DQ dashboard (Power BI/Datadog/ADF logs).

● Example metric output:

dq_stats = df.agg(
F.count("*").alias("row_count"),
F.sum(F.col("id").isNull().cast("int")).alias("null_ids")
)
dq_stats.write.mode("append").save("/mnt/metrics/dq_checks")
7) Tools & Frameworks

● Great Expectations (popular for rule-based validations).

● Deequ (Amazon’s constraint-based library, runs on Spark).

● Delta Live Tables (Databricks) → allows expect clauses:

CREATE LIVE TABLE clean_orders

AS SELECT *

E
FROM STREAM(live.raw_orders)
EXPECT (amount > 0) ON VIOLATION DROP ROW
EXPECT (order_id IS NOT NULL) ON VIOLATION FAIL UPDATE

IN
KL
AN
FR

Azure Data Factory Interview Questions Answers 1740678784
No ratings yet
Azure Data Factory Interview Questions Answers 1740678784
9 pages
Azure DE Interview Que
100% (2)
Azure DE Interview Que
25 pages
Azure de QSN and Ans
No ratings yet
Azure de QSN and Ans
16 pages
Azure Etl 1741608374
No ratings yet
Azure Etl 1741608374
14 pages
Tcs DE INTERVIEW Q&A2025
No ratings yet
Tcs DE INTERVIEW Q&A2025
12 pages
Azure Data Factory Interview Questions & Answers - Claude
No ratings yet
Azure Data Factory Interview Questions & Answers - Claude
25 pages
Important Interview Qa
No ratings yet
Important Interview Qa
13 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Interview Series ADF Part-1
No ratings yet
Interview Series ADF Part-1
17 pages
Pyspark Operations: Exceptall vs Union
No ratings yet
Pyspark Operations: Exceptall vs Union
8 pages
Azure Data Engineering Guide
No ratings yet
Azure Data Engineering Guide
11 pages
Azure Data Engineer Interview Questions - Part 1
No ratings yet
Azure Data Engineer Interview Questions - Part 1
19 pages
PDF Data Engineering Interview Questions and Answers
No ratings yet
PDF Data Engineering Interview Questions and Answers
18 pages
?stuck in A Loop of Rejections - Let's Break The Cycle!?
No ratings yet
?stuck in A Loop of Rejections - Let's Break The Cycle!?
7 pages
Azure Data Engineering Course Interview Questions 1751484980
No ratings yet
Azure Data Engineering Course Interview Questions 1751484980
20 pages
Az Questions
No ratings yet
Az Questions
11 pages
ADF Scenario Questions 5yrs
No ratings yet
ADF Scenario Questions 5yrs
5 pages
EoDA Open QA
No ratings yet
EoDA Open QA
1 page
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
ADF Interviews
No ratings yet
ADF Interviews
6 pages
Data Bricks
No ratings yet
Data Bricks
9 pages
Capgemini Questionnaire
No ratings yet
Capgemini Questionnaire
11 pages
Shaik DataEngineer Interview QA CheatSheet
No ratings yet
Shaik DataEngineer Interview QA CheatSheet
2 pages
Advanced Interview QA ADF Databricks PowerBI
No ratings yet
Advanced Interview QA ADF Databricks PowerBI
3 pages
Data Engineer Questions
No ratings yet
Data Engineer Questions
10 pages
Execr
No ratings yet
Execr
4 pages
ADE4 Topics To Brush 1
No ratings yet
ADE4 Topics To Brush 1
20 pages
Taking Interviw
No ratings yet
Taking Interviw
15 pages
Azure de and Fabric de Full Edited
No ratings yet
Azure de and Fabric de Full Edited
7 pages
Interview
No ratings yet
Interview
2 pages
Data Factory, Data Integration
100% (1)
Data Factory, Data Integration
2,034 pages
Dense Rank
No ratings yet
Dense Rank
2 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
Ade Companywise Interview
No ratings yet
Ade Companywise Interview
133 pages
Azure Data Solutions Training
No ratings yet
Azure Data Solutions Training
2 pages
Companywise Interview Questions
No ratings yet
Companywise Interview Questions
71 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
Course Content
No ratings yet
Course Content
13 pages
06.introduction To Data Factory
No ratings yet
06.introduction To Data Factory
26 pages
Azure Data Lake & Big Data Concepts Explained
No ratings yet
Azure Data Lake & Big Data Concepts Explained
4 pages
ADE
No ratings yet
ADE
4 pages
? Exploring Common Tasks in Azure Synapse Analytics ?
No ratings yet
? Exploring Common Tasks in Azure Synapse Analytics ?
54 pages
This Is Where 80 - of ADF Candidates Go Wrong
No ratings yet
This Is Where 80 - of ADF Candidates Go Wrong
12 pages
Azure Databricks Data Engineering Guide
No ratings yet
Azure Databricks Data Engineering Guide
8 pages
DP-600 Updated Dumps - Microsoft Fabric Analytics Engineer
No ratings yet
DP-600 Updated Dumps - Microsoft Fabric Analytics Engineer
14 pages
Azure Data Factory
No ratings yet
Azure Data Factory
3,167 pages
Data Engineer Interview Prep
No ratings yet
Data Engineer Interview Prep
27 pages
BASF Interview QA
No ratings yet
BASF Interview QA
4 pages
Narsimlu - Azure Data Engineer - Resume .Pf-1
67% (3)
Narsimlu - Azure Data Engineer - Resume .Pf-1
4 pages
Azure Data Factory Guide
0% (1)
Azure Data Factory Guide
2,982 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Implementing Parameterization in ADF
No ratings yet
Implementing Parameterization in ADF
9 pages
Interview Questions
No ratings yet
Interview Questions
7 pages
ETL Question and Answers
No ratings yet
ETL Question and Answers
6 pages
19 Databricks
No ratings yet
19 Databricks
28 pages
Azure Data Pipeline Design Guide
No ratings yet
Azure Data Pipeline Design Guide
13 pages
ADF Syllabus
No ratings yet
ADF Syllabus
8 pages
2023 PX 600 Datasheet v5
No ratings yet
2023 PX 600 Datasheet v5
2 pages
Lab Manual 2 Installing of Ms Word
No ratings yet
Lab Manual 2 Installing of Ms Word
5 pages
MAGUIRE Interconnects For GE Healthcare 2020
No ratings yet
MAGUIRE Interconnects For GE Healthcare 2020
8 pages
Phone Global Sbom Xt2015-x (Jimny Row) v1.8 (20191028)
No ratings yet
Phone Global Sbom Xt2015-x (Jimny Row) v1.8 (20191028)
82 pages
RC Plane Building Guide
No ratings yet
RC Plane Building Guide
4 pages
PO 751.27 EN - AUS2 - User Manual - V 2.1
No ratings yet
PO 751.27 EN - AUS2 - User Manual - V 2.1
61 pages
Antons Et Al 2021 Computational Literature Reviews Method Algorithms and Roadmap
No ratings yet
Antons Et Al 2021 Computational Literature Reviews Method Algorithms and Roadmap
32 pages
My SQL Notes
No ratings yet
My SQL Notes
13 pages
Architectural Tender Drawing: Construction Drawings
No ratings yet
Architectural Tender Drawing: Construction Drawings
1 page
Eran15.0 Lte TDD Clock Synchronization Detection: Huawei Technologies Co., LTD
No ratings yet
Eran15.0 Lte TDD Clock Synchronization Detection: Huawei Technologies Co., LTD
33 pages
Pixel: Multi-Signatures For Consensus
No ratings yet
Pixel: Multi-Signatures For Consensus
20 pages
3 CyberSecurity Forensics Unit-1-A
No ratings yet
3 CyberSecurity Forensics Unit-1-A
65 pages
Pso Loc Rwy 20 PDF
No ratings yet
Pso Loc Rwy 20 PDF
1 page
A117 Ca910 - en P
No ratings yet
A117 Ca910 - en P
218 pages
FM & Maintenance Engineer CV
No ratings yet
FM & Maintenance Engineer CV
11 pages
DIY 20,000mah Homemade Power Bank - 3D Printed Edition - 5 Steps (With Pictures) - Instructables
No ratings yet
DIY 20,000mah Homemade Power Bank - 3D Printed Edition - 5 Steps (With Pictures) - Instructables
7 pages
Peca 2016 Research Paper
No ratings yet
Peca 2016 Research Paper
11 pages
TissueScope Iq Brochure
No ratings yet
TissueScope Iq Brochure
2 pages
Class 3 Windows 10 Worksheet
No ratings yet
Class 3 Windows 10 Worksheet
3 pages
Mmds Downconverter MODEL DC02-201
No ratings yet
Mmds Downconverter MODEL DC02-201
2 pages
3 02h Bits Bytes and Data Rates
No ratings yet
3 02h Bits Bytes and Data Rates
1 page
Steam Heat Loss Calculation Tool
No ratings yet
Steam Heat Loss Calculation Tool
1 page
Comprehensive Guide To Building Webpages With Curs
No ratings yet
Comprehensive Guide To Building Webpages With Curs
7 pages
General Probability Exercises
No ratings yet
General Probability Exercises
28 pages
Numerical Methods and Optimization An Introduction 1st Edition Pardalos Full Chapters Instanly
No ratings yet
Numerical Methods and Optimization An Introduction 1st Edition Pardalos Full Chapters Instanly
62 pages
Motor List
No ratings yet
Motor List
3 pages
Barangay Assembly Insights
No ratings yet
Barangay Assembly Insights
1 page
Cs515 MJP C#net Slips
No ratings yet
Cs515 MJP C#net Slips
20 pages
Summative Test1 Q2PR2
100% (2)
Summative Test1 Q2PR2
2 pages
IndusInd Bank Digital Banking Solutions Success Story
No ratings yet
IndusInd Bank Digital Banking Solutions Success Story
9 pages

Infosys Data Engineering Questions and Answers - 2025

Uploaded by

Infosys Data Engineering Questions and Answers - 2025

Uploaded by

Frankline Florence

⚙️ Azure Data Factory (ADF)

🏁 SQL / Data Modeling

🎯 General Data Engineering:

⚙️ AZURE DATA FACTORY (ADF)

Where it runs: Microsoft-managed compute in Azure (region you choose or

2.​ Self-Hosted Integration Runtime (SHIR)

without public endpoints, cross-cloud private networks.

Capabilities: Copy (source/sink), lookups, metadata, store procedures, etc. (not

Use cases: Lift-and-shift SSIS packages to Azure with minimal changes.

Triggering: Orchestrated by ADF pipelines/Triggers, just like any other activity.​

●​ Anything behind a firewall → Self-Hosted IR​

●​ SSIS packages → Azure-SSIS IR

1) Watermark (High-Watermark) Pattern — Relational Sources

●​ Idea: Track the max value of a monotonic column (e.g., LastModifiedDate,

1.​ Look up the current watermark from the control table.​

2.​ Set Variable prevWatermark.​

3.​ Get a new watermark (e.g., SELECT MAX(LastModifiedDate) FROM src).​

5.​ Upsert to target (via stored proc or MERGE in sink).​

6.​ Update control table with newWatermark.​

●​ Idea: Let the source system tell you what changed.​

3) Files by Last Modified Time (Data Lake/Blob/S3)

●​ Idea: Only ingest new or changed files.​

○​ Get Metadata (child items + lastModified).​

○​ Filter activity against stored watermark timestamp.​

○​ ForEach over filtered files → Copy.​

○​ Update watermark to max lastModified processed.​

5) Tumbling Window Triggers

●​ Idea: Time-sliced orchestrations with automatic “window start/end” parameters

●​ Never rely on ADF pipeline variables as persistent state—store them externally.​

retries exhaust and an inner activity still fails, Until fails.​

●​ Schema Evolution (“allow controlled changes”)​

🟧 How to optimize a Spark job running for 2 hours down to 30 minutes?

1) Profile first (don’t guess)

2) Reduce I/O and shuffles

●​ Avoid small files: Compact to target file sizes (128–512 MB).

(df.repartition(200) # or tune after AQE​

df = df.coalesce( max(1, df.rdd.getNumPartitions() // 2) )

●​ If data is skewed, enable AQE + skew join and/or salt keys:

Key salting (example):

from pyspark.sql import functions as F​

●​ Cache only DataFrames reused multiple times, and unpersist promptly:

●​ Use OPTIMIZE + Z-ORDER to improve data skipping on high-cardinality filter

OPTIMIZE delta.`/mnt/delta/sales` ZORDER BY (customer_id, event_ts);

6) Right-size the cluster

●​ Use Photon (Databricks Runtime with Photon) for SQL/DataFrame

●​ Ensure sufficient parallelism: spark.sql.shuffle.partitions (AQE can reduce this

●​ Prefer fewer, more powerful nodes to avoid network bottlenecks; enable

7) Clean code patterns

●​ Materialize critical joins/aggregations once and reuse.

shuffle), then aggregates—often slower and memory-intensive.​

○​ Prefer reduceByKey / aggregateByKey on RDDs because they

rdd.reduceByKey(lambda a,b: a+b).collect()

●​ Use groupBy (DataFrame) for most analytics (Catalyst optimizer, Tungsten,

1.​ Read JSON (multi-line if needed) and infer or provide schema.​

2.​ Flatten structs by selecting col("struct.*").​

4.​ Handle nested structs/maps recursively.

from pyspark.sql import functions as F, types as T​

●​ Use explode_outer to retain orders even when items is null/empty.​

●​ For maps, you can map_keys/map_values or transform to key,value rows:

flat = flat.selectExpr("*", "inline(attrs)") # attrs is a MAP ->

-​ Drop when the row/column isn’t useful:

# Drop rows where any of these columns are null​

-​ Fill with sensible defaults:

from pyspark.sql import functions as F​

-​ Numeric with mean/median/quantiles:

from pyspark.ml.feature import Imputer​

-​ Categorical with mode:

3) Group-wise imputation (e.g., by country)

4) Time-series forward/backward fill

from pyspark.sql import Window, functions as F​

5) Special values: NaN vs null

# Replace NaN with null then use standard fills​

if t.simpleString() in ("double","float") else

-​ Use withField (Spark 3.1+) or rebuild the struct:

from pyspark.sql import functions as F​

2. Self-Hosted Integration Runtime (SHIR)

Triggering: Orchestrated by ADF pipelines/Triggers, just like any other activity.

● Anything behind a firewall → Self-Hosted IR

● SSIS packages → Azure-SSIS IR

● Idea: Track the max value of a monotonic column (e.g., LastModifiedDate,

1. Look up the current watermark from the control table.

2. Set Variable prevWatermark.

3. Get a new watermark (e.g., SELECT MAX(LastModifiedDate) FROM src).

5. Upsert to target (via stored proc or MERGE in sink).

6. Update control table with newWatermark.

● Idea: Let the source system tell you what changed.

● Idea: Only ingest new or changed files.

○ Get Metadata (child items + lastModified).

○ Filter activity against stored watermark timestamp.

○ ForEach over filtered files → Copy.

○ Update watermark to max lastModified processed.

● Idea: Time-sliced orchestrations with automatic “window start/end” parameters

● Never rely on ADF pipeline variables as persistent state—store them externally.

retries exhaust and an inner activity still fails, Until fails.

● Schema Evolution (“allow controlled changes”)

● Avoid small files: Compact to target file sizes (128–512 MB).

(df.repartition(200) # or tune after AQE

● If data is skewed, enable AQE + skew join and/or salt keys:

from pyspark.sql import functions as F

● Cache only DataFrames reused multiple times, and unpersist promptly:

● Use OPTIMIZE + Z-ORDER to improve data skipping on high-cardinality filter

● Use Photon (Databricks Runtime with Photon) for SQL/DataFrame

● Ensure sufficient parallelism: spark.sql.shuffle.partitions (AQE can reduce this

● Prefer fewer, more powerful nodes to avoid network bottlenecks; enable

● Materialize critical joins/aggregations once and reuse.

shuffle), then aggregates—often slower and memory-intensive.

○ Prefer reduceByKey / aggregateByKey on RDDs because they

● Use groupBy (DataFrame) for most analytics (Catalyst optimizer, Tungsten,

1. Read JSON (multi-line if needed) and infer or provide schema.

2. Flatten structs by selecting col("struct.*").

4. Handle nested structs/maps recursively.

from pyspark.sql import functions as F, types as T

● Use explode_outer to retain orders even when items is null/empty.

● For maps, you can map_keys/map_values or transform to key,value rows:

- Drop when the row/column isn’t useful:

# Drop rows where any of these columns are null

- Fill with sensible defaults:

from pyspark.sql import functions as F

- Numeric with mean/median/quantiles:

from pyspark.ml.feature import Imputer

- Categorical with mode:

from pyspark.sql import Window, functions as F

# Replace NaN with null then use standard fills

- Use withField (Spark 3.1+) or rebuild the struct:

from pyspark.sql import functions as F

● Training data: Use statistically sound imputation; preserve variance

● Reporting/BI: Use defaults (0/“Unknown”) but flag imputed fields.

● Why: Classic “join-on-aggregate” method; works across all SQL engines

● Defines the physical order of rows in the table.

● Often built on the primary key.

● Accessing data via clustered index = direct (data is the index).

● Contains index keys + pointers (row locators) to actual data.

● A table can have multiple non-clustered indexes.

● Best for queries filtering by non-PK columns.

● Example: Index on email column when primary key is employee_id.

● Analogy: Index at the back of a book pointing to page numbers.

SELECT DISTINCT salary

SELECT MAX(salary) AS second_highest

● Created in the Hive Metastore/Unity Catalog without a location → data stored

CREATE TABLE sales_ext (id INT, amount DOUBLE)

● Stores raw files (Parquet, ORC, CSV, JSON).

● Great for cheap storage, ML feature feeds.

○ Harder data governance (no unified catalog).

● Schema-on-write with BI-friendly optimizations (indexes, caching).

● Weakness: Expensive and rigid; not ideal for semi-structured/unstructured data.

● Unifies the strengths:

○ Open format (Parquet-based).

○ ACID transactions and schema enforcement (like a warehouse).

○ Support for ML + BI on the same data.

○ Unified catalog & governance.

○ Performance enhancements (Z-ordering, caching, vectorized execution).

● Reject rows with unexpected schema (wrong columns, types).

● Enforces id as non-null int.

invalid = df.filter("id IS NULL OR name IS NULL")

● Validate FK relationships (e.g., orders.customer_id exists in customers).