Frankline Florence
INFOSYS
DATA
ENGINEERING
Q&A
2025
www.linkedin.com/in/frankline-florence
INFOSYS DATA ENGINEERING INTERVIEW
QUESTIONS AND ANSWERS- 2025
⚙️ Azure Data Factory (ADF)
1. Explain different types of Integration Runtimes in ADF.
2. How do you handle incremental load in ADF pipelines?
3. What happens if an Until activity fails?
E
💾 Azure Databricks / PySpark
IN
4. Difference between schema enforcement and schema evolution in Delta tables.
5. How to optimize a Spark job running for 2 hours down to 30 minutes?
6. Difference between groupBy vs groupByKey and their use cases.
KL
7. Given a dataset with nested JSON structures, how would you flatten it into a
tabular format using PySpark?
8. How do you handle missing or null values in a DataFrame? What strategies would
you use in different scenarios?
AN
🏁 SQL / Data Modeling
9. Find the maximum salary per department and employee name.
10.Difference between Clustered vs Non-Clustered Index.
11.Write a query to get 2nd highest salary without using TOP/Limit.
FR
🎯 General Data Engineering:
12.Managed vs External tables in Databricks.
13.What is Lakehouse architecture and how does it differ from Data Lake?
14.How do you implement data quality checks during ingestion?
ANSWERS
⚙️ AZURE DATA FACTORY (ADF)
🟧 Explain different types of Integration Runtimes in ADF
Integration Runtime (IR) is the compute infrastructure ADF uses to move/transform
data and to execute SSIS packages.
E
1. Azure Integration Runtime (Azure IR)
IN
Use cases: Copy between cloud data stores; Mapping Data Flows; Data Flow Debug;
lookups; metadata.
Where it runs: Microsoft-managed compute in Azure (region you choose or
“Auto-resolve”).
KL
Networking: Can run inside Managed Virtual Network with Private Endpoints to reach
private resources without self-hosting.
Scaling: Elastic; Mapping Data Flows let you choose core sizes and parallelism;
autoscaling during execution.
AN
Security: MSI for auth to Azure services; keyless connections via linked services/Key
Vault.
2. Self-Hosted Integration Runtime (SHIR)
Use cases: Access on-premises/private-network data (SQL Server, file shares), VMs
FR
without public endpoints, cross-cloud private networks.
Where it runs: Your Windows machines/VMs; you can create a cluster (multiple nodes)
for HA and scale-out; can be shared across multiple factories/Synapse.
Networking: Your outbound IPs; respects local firewall/proxy; supports proxy config.
Capabilities: Copy (source/sink), lookups, metadata, store procedures, etc. (not
Mapping Data Flows). For transformations, you’d typically stage to a compute (e.g.,
Databricks/Synapse) or use SSIS IR.
3. Azure-SSIS Integration Runtime (SSIS IR)
Use cases: Lift-and-shift SSIS packages to Azure with minimal changes.
Where it runs: Managed Azure VMs that host SSIS; you choose node size, count, and
edition (Standard/Enterprise).
Features: Package Store (SSISDB), scale-out execution, custom components via custom
setup, join to a VNet (including Managed VNet).
Triggering: Orchestrated by ADF pipelines/Triggers, just like any other activity.
E
🖐🏼 Quick rule of thumb:
IN
● Cloud-to-cloud & data flows → Azure IR
● Anything behind a firewall → Self-Hosted IR
● SSIS packages → Azure-SSIS IR
KL
🟧 How do you handle incremental load in ADF pipelines?
There isn’t a single “switch”—you implement a pattern. The most common ones:
AN
1) Watermark (High-Watermark) Pattern — Relational Sources
● Idea: Track the max value of a monotonic column (e.g., LastModifiedDate,
UpdateTS, or a numeric surrogate key) that has changed since last run.
● Where to store state: Control table (e.g., ETL_Watermark) in Azure SQL or a file
FR
(e.g., in ADLS).
● Flow:
1. Look up the current watermark from the control table.
2. Set Variable prevWatermark.
3. Get a new watermark (e.g., SELECT MAX(LastModifiedDate) FROM src).
4. Copy Activity with a query filter:
SELECT *
FROM dbo.Orders
WHERE LastModifiedDate > @prevWatermark
AND LastModifiedDate <= @newWatermark;
5. Upsert to target (via stored proc or MERGE in sink).
6. Update control table with newWatermark.
E
Notes: Always bound the upper window (<= @newWatermark) to be restart-safe.
IN
2) Change Tracking / CDC (SQL Server/Azure SQL)
● Idea: Let the source system tell you what changed.
● Flow: Enable Change Tracking or CDC on source tables. Use a Lookup to fetch
KLlast LSN/Version, Copy only deltas via CDC/CT functions, then Update the
LSN/Version bookmark in your control table.
● Pros: Robust for updates/deletes; minimal source impact compared with full
scans.
AN
3) Files by Last Modified Time (Data Lake/Blob/S3)
● Idea: Only ingest new or changed files.
● Flow:
FR
○ Get Metadata (child items + lastModified).
○ Filter activity against stored watermark timestamp.
○ ForEach over filtered files → Copy.
○ Update watermark to max lastModified processed.
● Alternatives: Event-based triggers (Blob storage events) ingest files as they land.
4) Delta Lake/Databricks Autoloader (if Databricks is in the mix)
● Idea: Let Databricks track files via a checkpoint and use MERGE on a Delta table.
● ADF orchestrates the notebook/job; the incremental logic lives in the notebook.
5) Tumbling Window Triggers
● Idea: Time-sliced orchestrations with automatic “window start/end” parameters
to bound your query (useful for append-only event data).
E
Common pitfalls & tips
● Never rely on ADF pipeline variables as persistent state—store them externally.
IN
● Use idempotent sinks (MERGE/upsert) so re-runs are safe.
● Guard for late-arriving data by overlapping windows (e.g., process last 5 minutes
again but de-duplicate in sink).
KL
🟧 What happens if an Until activity fails?
Until repeats its inner activities until a boolean expression evaluates to true, or until it
AN
times out.
● Inner activity failure: If any inner activity fails and is not set to
continueOnFailure = true, the Until activity fails immediately and the loop
stops.
● Retries: Inner activities respect their own retry policy (retry count/interval). If
FR
retries exhaust and an inner activity still fails, Until fails.
● Timeout: The Until itself has a timeout. If the condition never becomes true
before the timeout expires, the Until fails with a timeout error.
● Zero iterations: If the condition is already true at entry, the loop doesn’t run and
succeeds immediately.
● Best practice to “break”: Update a variable/state inside the loop and make the
expression depend on it; use continueOnFailure selectively if a failure is
acceptable for that iteration.
💾 AZURE DATABRICKS / PYSPARK
🟧 Difference between schema enforcement and schema evolution in Delta tables.
● Schema Enforcement (“write compliance”)
○ Delta checks that incoming data matches the table schema (column
names, types, nullability).
E
○ Mismatches (e.g., extra columns, wrong types) → write fails to protect
data quality.
IN
○ Example (fails):
df.write.format("delta").mode("append").save("/mnt/delta/sales")
KL
# df has an extra column -> error
● Schema Evolution (“allow controlled changes”)
AN
○ Allows adding new columns (and some compatible type changes) during
writes/merge.
○ Requires enabling:
FR
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled",
"true")
# Append with evolution
df.write.format("delta").mode("append").option("mergeSchema",
"true").save("/mnt/delta/sales")
# Overwrite with new schema
df.write.format("delta").mode("overwrite").option("overwriteSchema",
"true").save("/mnt/delta/sales")
○ With MERGE:
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled",
"true")
(
bronze.alias("b").merge(silver.alias("s"), "b.id = s.id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
E
)
IN
In short: Enforcement prevents bad writes; Evolution permits controlled schema growth
(usually adding columns) when explicitly enabled.
🟧 How to optimize a Spark job running for 2 hours down to 30 minutes?
KL
A pragmatic, stepwise plan:
1) Profile first (don’t guess)
AN
● Use Spark UI: identify long stages, skewed shuffles, excessive tasks, spill, GC
time, and I/O hotspots.
● Capture input sizes, # of files, avg file size, shuffle read/write, cache usage.
2) Reduce I/O and shuffles
FR
● Read less: Select only needed columns; push predicates down (Parquet/Delta do
this automatically).
● Avoid small files: Compact to target file sizes (128–512 MB).
(df.repartition(200) # or tune after AQE
.write.format("delta")
.option("dataChange", "false")
.mode("overwrite")
.save(path))
● Coalesce after wide ops if you created too many partitions:
df = df.coalesce( max(1, df.rdd.getNumPartitions() // 2) )
3) Fix joins
● Prefer broadcast hash join when one side is small-ish (tens/hundreds of MB):
E
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 *
1024) # 100MB
IN
fact.join(F.broadcast(dim), "key", "left")
● If data is skewed, enable AQE + skew join and/or salt keys:
KL
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled",
AN
"true")
Key salting (example):
FR
from pyspark.sql import functions as F
k = 10 # salt buckets
big = big.withColumn("salt", F.rand()*k).withColumn("salt",
F.floor("salt"))
small = small.withColumn("salt", F.expr(f"sequence(0,{k-1})"))
small = small.withColumn("salt", F.explode("salt"))
joined = big.join(small, ["key","salt"])
4) Cache/persist only when reused
● Cache only DataFrames reused multiple times, and unpersist promptly:
cached = heavy_df.persist() # or
.persist(StorageLevel.MEMORY_AND_DISK)
# use cached multiple times...
cached.unpersist()
E
5) Partitioning and Z-ordering (Delta)
IN
● Partition large Delta tables by low-cardinality columns used in filters (e.g., date,
country).
● Use OPTIMIZE + Z-ORDER to improve data skipping on high-cardinality filter
KLcolumns:
OPTIMIZE delta.`/mnt/delta/sales` ZORDER BY (customer_id, event_ts);
AN
6) Right-size the cluster
● Use Photon (Databricks Runtime with Photon) for SQL/DataFrame
workloads—often 2–3× speedups.
● Ensure sufficient parallelism: spark.sql.shuffle.partitions (AQE can reduce this
FR
automatically).
● Prefer fewer, more powerful nodes to avoid network bottlenecks; enable
autoscaling for bursty stages.
7) Clean code patterns
● Minimize UDFs; prefer built-in functions. If unavoidable, use Pandas UDFs for
vectorization.
● Combine transformations to reduce passes over data; avoid repeated
.count()/.collect() in the hot path.
● Materialize critical joins/aggregations once and reuse.
8) Practical target
● After applying AQE, right-sizing partitions, fixing joins, compacting files, and
enabling Photon, two-hour jobs commonly land < 30–40 mins.
E
🟧 Difference between groupBy vs groupByKey and their use cases.
IN
● In DataFrame API (PySpark) you typically use groupBy:
df.groupBy("key").agg(F.sum("value").alias("sum_v"))
KL ○ Exec model: Pushes aggregation to the engine using combiners; efficient;
uses partial aggregates before shuffle (map-side combine).
● groupByKey exists for key-value RDDs (and typed Datasets in Scala). It groups
all values for a key before you apply a function:
AN
rdd = sc.parallelize([("a",1),("a",2),("b",3)])
rdd.groupByKey().mapValues(lambda xs: sum(xs)).collect()
○ Downside: Moves all values for each key across the network (heavy
FR
shuffle), then aggregates—often slower and memory-intensive.
○ Prefer reduceByKey / aggregateByKey on RDDs because they
combine before the shuffle:
rdd.reduceByKey(lambda a,b: a+b).collect()
Guidance:
● Use groupBy (DataFrame) for most analytics (Catalyst optimizer, Tungsten,
AQE).
● Avoid groupByKey unless you truly need all values per key (e.g., complex
non-associative operations) and cannot express it with aggregations.
🟧 Given a dataset with nested JSON structures, how would you flatten it into a
tabular format using PySpark?
E
Example JSON
{
IN
"order_id": "O1",
"customer": {"id": "C1", "name": "Ana"},
"items": [
{"sku": "S1", "qty": 2, "price": 10.0, "attrs": {"color":
KL
"red"}},
{"sku": "S2", "qty": 1, "price": 25.0, "attrs": {"color":
"blue"}}
],
"shipping": {"address": {"city": "NYC", "zip": "10001"}}
AN
Steps:
1. Read JSON (multi-line if needed) and infer or provide schema.
2. Flatten structs by selecting col("struct.*").
FR
3. Explode arrays (explode_outer) to get one row per array element.
4. Handle nested structs/maps recursively.
Code:
from pyspark.sql import functions as F, types as T
# 1) Read JSON (one JSON doc per line)
df = spark.read.json("/mnt/data/orders/*.json", multiLine=True)
# 2) Utility to flatten struct columns (and keep simple ones)
def flatten_structs(_df):
flat_cols = []
nested_cols = []
for field in _df.schema.fields:
name = field.name
dtype = field.dataType
if isinstance(dtype, T.StructType):
for sub in dtype.fields:
E
flat_cols.append(F.col(f"`{name}`.`{sub.name}`").alias(f"{name}__{sub
IN
.name}"))
else:
flat_cols.append(F.col(f"`{name}`"))
if isinstance(dtype, T.ArrayType) and
isinstance(dtype.elementType, T.StructType):
KL nested_cols.append(name)
return _df.select(flat_cols), nested_cols
# 3) Flatten top-level structs
df1, array_struct_cols = flatten_structs(df)
AN
# 4) Explode arrays of structs (repeat if multiple)
for arr in array_struct_cols:
df1 = df1.withColumn(arr, F.explode_outer(F.col(arr)))
# 5) After explode, flatten again because array elements are structs
FR
df2, _ = flatten_structs(df1)
# 6) (Optional) Flatten deeper nested structs or maps
def flatten_all(df_in):
# Recursively flatten until no StructType remains
while any(isinstance(f.dataType, T.StructType) for f in
df_in.schema.fields):
cols = []
expanded = False
for f in df_in.schema.fields:
if isinstance(f.dataType, T.StructType):
expanded = True
for sub in f.dataType.fields:
cols.append(F.col(f"`{f.name}`.`{sub.name}`").alias(f"{f.name}__{sub.
name}"))
else:
cols.append(F.col(f"`{f.name}`"))
df_in = df_in.select(cols)
if not expanded:
break
E
return df_in
IN
flat = flatten_all(df2)
flat.show(truncate=False)
KL
Notes:
● Use explode_outer to retain orders even when items is null/empty.
● For maps, you can map_keys/map_values or transform to key,value rows:
AN
flat = flat.selectExpr("*", "inline(attrs)") # attrs is a MAP ->
turns into key/value rows
● For performance on very wide schemas, consider generating the select list from
the schema programmatically (as above) rather than manual selection.
FR
🟧 How do you handle missing or null values in a DataFrame? What strategies would
you use in different scenarios?
1) Drop vs Fill
- Drop when the row/column isn’t useful:
# Drop rows where any of these columns are null
df_clean = df.na.drop(subset=["id","event_ts"], how="any")
# Drop rows with fewer than 3 non-null values
df_thresh = df.na.drop(thresh=3)
- Fill with sensible defaults:
from pyspark.sql import functions as F
df_filled = (df
E
.fillna({"age": 0, "country": "UNKNOWN"}) # by column
.fillna(0) # numeric columns
IN
)
# Use COALESCE in expressions
df = df.withColumn("price", F.coalesce("price", F.lit(0.0)))
KL
2) Statistically informed imputation
- Numeric with mean/median/quantiles:
AN
from pyspark.ml.feature import Imputer
imputer = Imputer(strategy="median", inputCols=["age","income"],
outputCols=["age","income"])
df_imp = imputer.fit(df).transform(df)
FR
- Categorical with mode:
mode_val = (df.groupBy("segment").count()
.orderBy(F.desc("count")).first()["segment"])
df_mode = df.fillna({"segment": mode_val})
3) Group-wise imputation (e.g., by country)
stats =
df.groupBy("country").agg(F.avg("income").alias("avg_income"))
df_join = (df.join(stats, "country", "left")
.withColumn("income", F.coalesce("income",
"avg_income"))
.drop("avg_income"))
4) Time-series forward/backward fill
from pyspark.sql import Window, functions as F
E
w =
Window.partitionBy("id").orderBy("ts").rowsBetween(Window.unboundedPr
IN
eceding, 0)
df_ffill = df.withColumn("value_ffill", F.last("value",
ignorenulls=True).over(w))
KL
w_rev =
Window.partitionBy("id").orderBy(F.col("ts").desc()).rowsBetween(Wind
ow.unboundedPreceding, 0)
df_bfill = df_ffill.withColumn("value_bfill", F.last("value",
ignorenulls=True).over(w_rev))
AN
5) Special values: NaN vs null
# Replace NaN with null then use standard fills
df = df.select([F.when(F.isnan(c), None).otherwise(F.col(c)).alias(c)
FR
if t.simpleString() in ("double","float") else
F.col(c)
for c,t in zip(df.columns, df.dtypes)])
6) Nested structures
- Use withField (Spark 3.1+) or rebuild the struct:
from pyspark.sql import functions as F
df = df.withColumn(
"address",
F.when(F.col("address").isNull(),
F.struct(F.lit("").alias("city"), F.lit("").alias("zip")))
.otherwise(F.col("address"))
)
7) When to choose what
● Training data: Use statistically sound imputation; preserve variance
E
(median/quantile).
● Reporting/BI: Use defaults (0/“Unknown”) but flag imputed fields.
IN
● Data quality pipelines: Reject or quarantine rows when mandatory fields are
missing.
KL
AN
FR
🏁 SQL AND DATA MODELING
🟧 Find the maximum salary per department and employee name
We want each department’s max salary, along with the employee(s) who earn it. Two
common approaches:
1) Window Functions (preferred in analytics databases)
E
SELECT department_id,
employee_name,
IN
salary
FROM (
SELECT department_id,
employee_name,
salary,
rnk
KL RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS
FROM Employees
) t
WHERE rnk = 1;
AN
● Why: RANK() handles ties, so if two employees share the max salary, both are
returned.
2) Join with MAX per department
SELECT e.department_id,
FR
e.employee_name,
e.salary
FROM Employees e
JOIN (
SELECT department_id, MAX(salary) AS max_salary
FROM Employees
GROUP BY department_id
) m
ON e.department_id = m.department_id
AND e.salary = m.max_salary;
● Why: Classic “join-on-aggregate” method; works across all SQL engines
🟧 Explain the difference between Clustered vs Non-Clustered Index
Clustered Index
● Defines the physical order of rows in the table.
● Each table can have only one clustered index (since data can only be sorted one
way).
● Often built on the primary key.
● Accessing data via clustered index = direct (data is the index).
E
● Example: In SQL Server/MySQL InnoDB, primary key by default is clustered.
IN
● Analogy: Phonebook sorted by last name.
Non-Clustered Index
KL
● Stored separately from table data.
● Contains index keys + pointers (row locators) to actual data.
● A table can have multiple non-clustered indexes.
AN
● Best for queries filtering by non-PK columns.
● Example: Index on email column when primary key is employee_id.
● Analogy: Index at the back of a book pointing to page numbers.
FR
🟧 Write a query to get 2nd highest salary without using TOP/Limit
Several valid patterns:
1) Window Function
SELECT DISTINCT salary
FROM (
SELECT salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM Employees
) t
WHERE rnk = 2;
2) Subquery with MAX
SELECT MAX(salary) AS second_highest
FROM Employees
WHERE salary < (
E
SELECT MAX(salary) FROM Employees
);
IN
● Simpler if you only need the value (not employee details).
KL
3) Self-Join
SELECT e1.salary
FROM Employees e1
AN
WHERE 2 = (
SELECT COUNT(DISTINCT e2.salary)
FROM Employees e2
WHERE e2.salary >= e1.salary
);
FR
● Works in all ANSI SQL engines; scales less well for big data.
🎯 GENERAL DATA ENGINEERING
🟧 Managed vs External tables in Databricks
Managed Table (a.k.a. Internal table)
● Created in the Hive Metastore/Unity Catalog without a location → data stored
E
under Databricks’ default warehouse location
(dbfs:/user/hive/warehouse/...).
IN
● Dropping the table → drops both metadata & underlying data.
● Example:
KL
CREATE TABLE sales (id INT, amount DOUBLE);
-- Data physically lives in the default managed path
AN
External Table
● Created with an explicit LOCATION pointing to external storage (ADLS, S3, Blob,
etc.).
● Dropping the table → removes only metadata, not the data files.
FR
● Example:
CREATE TABLE sales_ext (id INT, amount DOUBLE)
USING DELTA
LOCATION 'abfss://
[email protected]/sales/';
🟧 What is Lakehouse architecture and how does it differ from Data Lake?:
Traditional Data Lake
● Stores raw files (Parquet, ORC, CSV, JSON).
● Flexible schema-on-read.
● Great for cheap storage, ML feature feeds.
● Weaknesses:
E
○ No ACID transactions → risk of partial writes.
○ Harder data governance (no unified catalog).
IN
○ Performance issues (small files, lack of indexes, no optimizer).
Data Warehouse
KL
● Structured, ACID-compliant, governed.
● Schema-on-write with BI-friendly optimizations (indexes, caching).
● Weakness: Expensive and rigid; not ideal for semi-structured/unstructured data.
AN
Lakehouse (e.g., Delta Lake on Databricks)
● Unifies the strengths:
○ Open format (Parquet-based).
○ ACID transactions and schema enforcement (like a warehouse).
FR
○ Support for ML + BI on the same data.
○ Unified catalog & governance.
○ Performance enhancements (Z-ordering, caching, vectorized execution).
● Key idea: One copy of data serving batch, streaming, ML, and BI.
🟧 How do you implement data quality checks during ingestion?
Data quality checks catch bad data before it pollutes downstream tables. Common
strategies:
1) Schema Validation
● Reject rows with unexpected schema (wrong columns, types).
● In PySpark:
E
from pyspark.sql.types import StructType, StructField, StringType,
IntegerType
IN
schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), True),
])
KL
StructField("age", IntegerType(), True)
df = spark.read.schema(schema).json("/mnt/raw/customers/*.json")
AN
● Enforces id as non-null int.
2) Null / Mandatory Checks
invalid = df.filter("id IS NULL OR name IS NULL")
if invalid.count() > 0:
FR
invalid.write.format("delta").mode("append").save("/mnt/quarantine/cu
stomers")
df = df.exceptAll(invalid) # keep only valid records
3) Referential Integrity
● Validate FK relationships (e.g., orders.customer_id exists in customers).
valid_orders = orders.join(customers, "customer_id", "inner")
invalid_orders = orders.join(customers, "customer_id", "left_anti")
4) Business Rule Checks
● Example: age >= 0, amount > 0, date <= today.
df = df.filter("amount > 0 AND age >= 0 AND order_date <=
E
current_date()")
IN
5) Duplicate Detection
window = Window.partitionBy("id").orderBy("last_updated DESC")
KL
deduped = (df.withColumn("rn", F.row_number().over(window))
.filter("rn = 1")
.drop("rn"))
6) Monitoring / Metrics
AN
● Generate data quality KPIs: null % per column, row counts, duplicates,
distribution checks.
● Store in a DQ dashboard (Power BI/Datadog/ADF logs).
● Example metric output:
FR
dq_stats = df.agg(
F.count("*").alias("row_count"),
F.sum(F.col("id").isNull().cast("int")).alias("null_ids")
)
dq_stats.write.mode("append").save("/mnt/metrics/dq_checks")
7) Tools & Frameworks
● Great Expectations (popular for rule-based validations).
● Deequ (Amazon’s constraint-based library, runs on Spark).
● Delta Live Tables (Databricks) → allows expect clauses:
CREATE LIVE TABLE clean_orders
AS SELECT *
E
FROM STREAM(live.raw_orders)
EXPECT (amount > 0) ON VIOLATION DROP ROW
EXPECT (order_id IS NOT NULL) ON VIOLATION FAIL UPDATE
IN
KL
AN
FR