0% found this document useful (0 votes)
223 views25 pages

Infosys Data Engineering Questions and Answers - 2025

Uploaded by

bhanusagarkv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
223 views25 pages

Infosys Data Engineering Questions and Answers - 2025

Uploaded by

bhanusagarkv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Frankline Florence

INFOSYS
DATA
ENGINEERING
Q&A

2025
www.linkedin.com/in/frankline-florence
INFOSYS DATA ENGINEERING INTERVIEW
QUESTIONS AND ANSWERS- 2025

⚙️ Azure Data Factory (ADF)


1.​ Explain different types of Integration Runtimes in ADF.
2.​ How do you handle incremental load in ADF pipelines?
3.​ What happens if an Until activity fails?

E
💾 Azure Databricks / PySpark

IN
4.​ Difference between schema enforcement and schema evolution in Delta tables.
5.​ How to optimize a Spark job running for 2 hours down to 30 minutes?
6.​ Difference between groupBy vs groupByKey and their use cases.
KL
7.​ Given a dataset with nested JSON structures, how would you flatten it into a
tabular format using PySpark?
8.​ How do you handle missing or null values in a DataFrame? What strategies would
you use in different scenarios?
AN

🏁 SQL / Data Modeling


9.​ Find the maximum salary per department and employee name.
10.​Difference between Clustered vs Non-Clustered Index.
11.​Write a query to get 2nd highest salary without using TOP/Limit.
FR

🎯 General Data Engineering:


12.​Managed vs External tables in Databricks.
13.​What is Lakehouse architecture and how does it differ from Data Lake?
14.​How do you implement data quality checks during ingestion?
ANSWERS

⚙️ AZURE DATA FACTORY (ADF)


🟧 Explain different types of Integration Runtimes in ADF
Integration Runtime (IR) is the compute infrastructure ADF uses to move/transform
data and to execute SSIS packages.

E
1.​ Azure Integration Runtime (Azure IR)

IN
Use cases: Copy between cloud data stores; Mapping Data Flows; Data Flow Debug;
lookups; metadata.

Where it runs: Microsoft-managed compute in Azure (region you choose or


“Auto-resolve”).
KL
Networking: Can run inside Managed Virtual Network with Private Endpoints to reach
private resources without self-hosting.

Scaling: Elastic; Mapping Data Flows let you choose core sizes and parallelism;
autoscaling during execution.
AN

Security: MSI for auth to Azure services; keyless connections via linked services/Key
Vault.​

2.​ Self-Hosted Integration Runtime (SHIR)

Use cases: Access on-premises/private-network data (SQL Server, file shares), VMs
FR

without public endpoints, cross-cloud private networks.

Where it runs: Your Windows machines/VMs; you can create a cluster (multiple nodes)
for HA and scale-out; can be shared across multiple factories/Synapse.

Networking: Your outbound IPs; respects local firewall/proxy; supports proxy config.

Capabilities: Copy (source/sink), lookups, metadata, store procedures, etc. (not


Mapping Data Flows). For transformations, you’d typically stage to a compute (e.g.,
Databricks/Synapse) or use SSIS IR.​
3.​ Azure-SSIS Integration Runtime (SSIS IR)

Use cases: Lift-and-shift SSIS packages to Azure with minimal changes.

Where it runs: Managed Azure VMs that host SSIS; you choose node size, count, and
edition (Standard/Enterprise).

Features: Package Store (SSISDB), scale-out execution, custom components via custom
setup, join to a VNet (including Managed VNet).

Triggering: Orchestrated by ADF pipelines/Triggers, just like any other activity.​

E
🖐🏼 Quick rule of thumb:

IN
●​ Cloud-to-cloud & data flows → Azure IR​

●​ Anything behind a firewall → Self-Hosted IR​

●​ SSIS packages → Azure-SSIS IR


KL
🟧 How do you handle incremental load in ADF pipelines?
There isn’t a single “switch”—you implement a pattern. The most common ones:
AN

1) Watermark (High-Watermark) Pattern — Relational Sources

●​ Idea: Track the max value of a monotonic column (e.g., LastModifiedDate,


UpdateTS, or a numeric surrogate key) that has changed since last run.​

●​ Where to store state: Control table (e.g., ETL_Watermark) in Azure SQL or a file
FR

(e.g., in ADLS).​

●​ Flow:​

1.​ Look up the current watermark from the control table.​

2.​ Set Variable prevWatermark.​

3.​ Get a new watermark (e.g., SELECT MAX(LastModifiedDate) FROM src).​


4.​ Copy Activity with a query filter:

SELECT * ​
FROM dbo.Orders ​
WHERE LastModifiedDate > @prevWatermark ​
AND LastModifiedDate <= @newWatermark;

5.​ Upsert to target (via stored proc or MERGE in sink).​

6.​ Update control table with newWatermark.​

E
Notes: Always bound the upper window (<= @newWatermark) to be restart-safe.

IN
2) Change Tracking / CDC (SQL Server/Azure SQL)

●​ Idea: Let the source system tell you what changed.​

●​ Flow: Enable Change Tracking or CDC on source tables. Use a Lookup to fetch
KLlast LSN/Version, Copy only deltas via CDC/CT functions, then Update the
LSN/Version bookmark in your control table.​

●​ Pros: Robust for updates/deletes; minimal source impact compared with full
scans.
AN

3) Files by Last Modified Time (Data Lake/Blob/S3)

●​ Idea: Only ingest new or changed files.​

●​ Flow:​
FR

○​ Get Metadata (child items + lastModified).​

○​ Filter activity against stored watermark timestamp.​

○​ ForEach over filtered files → Copy.​

○​ Update watermark to max lastModified processed.​

●​ Alternatives: Event-based triggers (Blob storage events) ingest files as they land.​
4) Delta Lake/Databricks Autoloader (if Databricks is in the mix)

●​ Idea: Let Databricks track files via a checkpoint and use MERGE on a Delta table.​

●​ ADF orchestrates the notebook/job; the incremental logic lives in the notebook.

5) Tumbling Window Triggers

●​ Idea: Time-sliced orchestrations with automatic “window start/end” parameters


to bound your query (useful for append-only event data).​

E
Common pitfalls & tips

●​ Never rely on ADF pipeline variables as persistent state—store them externally.​

IN
●​ Use idempotent sinks (MERGE/upsert) so re-runs are safe.​

●​ Guard for late-arriving data by overlapping windows (e.g., process last 5 minutes
again but de-duplicate in sink).
KL
🟧 What happens if an Until activity fails?
Until repeats its inner activities until a boolean expression evaluates to true, or until it
AN

times out.

●​ Inner activity failure: If any inner activity fails and is not set to
continueOnFailure = true, the Until activity fails immediately and the loop
stops.​

●​ Retries: Inner activities respect their own retry policy (retry count/interval). If
FR

retries exhaust and an inner activity still fails, Until fails.​

●​ Timeout: The Until itself has a timeout. If the condition never becomes true
before the timeout expires, the Until fails with a timeout error.​

●​ Zero iterations: If the condition is already true at entry, the loop doesn’t run and
succeeds immediately.​

●​ Best practice to “break”: Update a variable/state inside the loop and make the
expression depend on it; use continueOnFailure selectively if a failure is
acceptable for that iteration.
💾 AZURE DATABRICKS / PYSPARK
🟧 Difference between schema enforcement and schema evolution in Delta tables.
●​ Schema Enforcement (“write compliance”)​

○​ Delta checks that incoming data matches the table schema (column
names, types, nullability).​

E
○​ Mismatches (e.g., extra columns, wrong types) → write fails to protect
data quality.​

IN
○​ Example (fails):

df.write.format("delta").mode("append").save("/mnt/delta/sales") ​
KL
# df has an extra column -> error

●​ Schema Evolution (“allow controlled changes”)​


AN

○​ Allows adding new columns (and some compatible type changes) during
writes/merge.

○​ Requires enabling:
FR

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled",
"true")​

# Append with evolution​
df.write.format("delta").mode("append").option("mergeSchema",
"true").save("/mnt/delta/sales")​

# Overwrite with new schema​
df.write.format("delta").mode("overwrite").option("overwriteSchema",
"true").save("/mnt/delta/sales")
○​ With MERGE:

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled",
"true")​
(​
bronze.alias("b").merge(silver.alias("s"), "b.id = s.id")​
.whenMatchedUpdateAll()​
.whenNotMatchedInsertAll()​
.execute()​

E
)

IN
In short: Enforcement prevents bad writes; Evolution permits controlled schema growth
(usually adding columns) when explicitly enabled.

🟧 How to optimize a Spark job running for 2 hours down to 30 minutes?


KL
A pragmatic, stepwise plan:

1) Profile first (don’t guess)


AN

●​ Use Spark UI: identify long stages, skewed shuffles, excessive tasks, spill, GC
time, and I/O hotspots.​

●​ Capture input sizes, # of files, avg file size, shuffle read/write, cache usage.

2) Reduce I/O and shuffles


FR

●​ Read less: Select only needed columns; push predicates down (Parquet/Delta do
this automatically).​

●​ Avoid small files: Compact to target file sizes (128–512 MB).

(df.repartition(200) # or tune after AQE​


.write.format("delta")​
.option("dataChange", "false")​
.mode("overwrite")​
.save(path))
●​ Coalesce after wide ops if you created too many partitions:

df = df.coalesce( max(1, df.rdd.getNumPartitions() // 2) )

3) Fix joins

●​ Prefer broadcast hash join when one side is small-ish (tens/hundreds of MB):

E
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 *
1024) # 100MB​

IN
fact.join(F.broadcast(dim), "key", "left")

●​ If data is skewed, enable AQE + skew join and/or salt keys:


KL
spark.conf.set("spark.sql.adaptive.enabled", "true")​
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")​
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled",
AN

"true")

Key salting (example):


FR

from pyspark.sql import functions as F​


k = 10 # salt buckets​
big = big.withColumn("salt", F.rand()*k).withColumn("salt",
F.floor("salt"))​
small = small.withColumn("salt", F.expr(f"sequence(0,{k-1})"))​
small = small.withColumn("salt", F.explode("salt"))​
joined = big.join(small, ["key","salt"])
4) Cache/persist only when reused

●​ Cache only DataFrames reused multiple times, and unpersist promptly:

cached = heavy_df.persist() # or
.persist(StorageLevel.MEMORY_AND_DISK)​
# use cached multiple times...​
cached.unpersist()

E
5) Partitioning and Z-ordering (Delta)

IN
●​ Partition large Delta tables by low-cardinality columns used in filters (e.g., date,
country).​

●​ Use OPTIMIZE + Z-ORDER to improve data skipping on high-cardinality filter


KLcolumns:

OPTIMIZE delta.`/mnt/delta/sales` ZORDER BY (customer_id, event_ts);


AN

6) Right-size the cluster

●​ Use Photon (Databricks Runtime with Photon) for SQL/DataFrame


workloads—often 2–3× speedups.​

●​ Ensure sufficient parallelism: spark.sql.shuffle.partitions (AQE can reduce this


FR

automatically).​

●​ Prefer fewer, more powerful nodes to avoid network bottlenecks; enable


autoscaling for bursty stages.

7) Clean code patterns

●​ Minimize UDFs; prefer built-in functions. If unavoidable, use Pandas UDFs for
vectorization.
●​ Combine transformations to reduce passes over data; avoid repeated
.count()/.collect() in the hot path.​

●​ Materialize critical joins/aggregations once and reuse.

8) Practical target

●​ After applying AQE, right-sizing partitions, fixing joins, compacting files, and
enabling Photon, two-hour jobs commonly land < 30–40 mins.

E
🟧 Difference between groupBy vs groupByKey and their use cases.

IN
●​ In DataFrame API (PySpark) you typically use groupBy:

df.groupBy("key").agg(F.sum("value").alias("sum_v"))
KL ○​ Exec model: Pushes aggregation to the engine using combiners; efficient;
uses partial aggregates before shuffle (map-side combine).​

●​ groupByKey exists for key-value RDDs (and typed Datasets in Scala). It groups
all values for a key before you apply a function:
AN

rdd = sc.parallelize([("a",1),("a",2),("b",3)])​
rdd.groupByKey().mapValues(lambda xs: sum(xs)).collect()

○​ Downside: Moves all values for each key across the network (heavy
FR

shuffle), then aggregates—often slower and memory-intensive.​

○​ Prefer reduceByKey / aggregateByKey on RDDs because they


combine before the shuffle:

rdd.reduceByKey(lambda a,b: a+b).collect()


Guidance:

●​ Use groupBy (DataFrame) for most analytics (Catalyst optimizer, Tungsten,


AQE).​

●​ Avoid groupByKey unless you truly need all values per key (e.g., complex
non-associative operations) and cannot express it with aggregations.

🟧 Given a dataset with nested JSON structures, how would you flatten it into a
tabular format using PySpark?

E
Example JSON

{​

IN
"order_id": "O1",​
"customer": {"id": "C1", "name": "Ana"},​
"items": [​
{"sku": "S1", "qty": 2, "price": 10.0, "attrs": {"color":
KL
"red"}},​
{"sku": "S2", "qty": 1, "price": 25.0, "attrs": {"color":
"blue"}}​
],​
"shipping": {"address": {"city": "NYC", "zip": "10001"}}​
AN

Steps:

1.​ Read JSON (multi-line if needed) and infer or provide schema.​

2.​ Flatten structs by selecting col("struct.*").​


FR

3.​ Explode arrays (explode_outer) to get one row per array element.​

4.​ Handle nested structs/maps recursively.

Code:

from pyspark.sql import functions as F, types as T​



# 1) Read JSON (one JSON doc per line)​
df = spark.read.json("/mnt/data/orders/*.json", multiLine=True)​

# 2) Utility to flatten struct columns (and keep simple ones)​
def flatten_structs(_df):​
flat_cols = []​
nested_cols = []​
for field in _df.schema.fields:​
name = field.name​
dtype = field.dataType​
if isinstance(dtype, T.StructType):​
for sub in dtype.fields:​

E
flat_cols.append(F.col(f"`{name}`.`{sub.name}`").alias(f"{name}__{sub

IN
.name}"))​
else:​
flat_cols.append(F.col(f"`{name}`"))​
if isinstance(dtype, T.ArrayType) and
isinstance(dtype.elementType, T.StructType):​
KL nested_cols.append(name)​
return _df.select(flat_cols), nested_cols​

# 3) Flatten top-level structs​
df1, array_struct_cols = flatten_structs(df)​
AN


# 4) Explode arrays of structs (repeat if multiple)​
for arr in array_struct_cols:​
df1 = df1.withColumn(arr, F.explode_outer(F.col(arr)))​

# 5) After explode, flatten again because array elements are structs​
FR

df2, _ = flatten_structs(df1)​

# 6) (Optional) Flatten deeper nested structs or maps​
def flatten_all(df_in):​
# Recursively flatten until no StructType remains​
while any(isinstance(f.dataType, T.StructType) for f in
df_in.schema.fields):​
cols = []​
expanded = False​
for f in df_in.schema.fields:​
if isinstance(f.dataType, T.StructType):​
expanded = True​
for sub in f.dataType.fields:​

cols.append(F.col(f"`{f.name}`.`{sub.name}`").alias(f"{f.name}__{sub.
name}"))​
else:​
cols.append(F.col(f"`{f.name}`"))​
df_in = df_in.select(cols)​
if not expanded:​
break​

E
return df_in​

IN
flat = flatten_all(df2)​

flat.show(truncate=False)
KL
Notes:

●​ Use explode_outer to retain orders even when items is null/empty.​

●​ For maps, you can map_keys/map_values or transform to key,value rows:


AN

flat = flat.selectExpr("*", "inline(attrs)") # attrs is a MAP ->


turns into key/value rows

●​ For performance on very wide schemas, consider generating the select list from
the schema programmatically (as above) rather than manual selection.
FR

🟧 How do you handle missing or null values in a DataFrame? What strategies would
you use in different scenarios?

1) Drop vs Fill

-​ Drop when the row/column isn’t useful:

# Drop rows where any of these columns are null​


df_clean = df.na.drop(subset=["id","event_ts"], how="any")​

# Drop rows with fewer than 3 non-null values​
df_thresh = df.na.drop(thresh=3)

-​ Fill with sensible defaults:

from pyspark.sql import functions as F​



df_filled = (df​

E
.fillna({"age": 0, "country": "UNKNOWN"}) # by column​
.fillna(0) # numeric columns​

IN
)​

# Use COALESCE in expressions​
df = df.withColumn("price", F.coalesce("price", F.lit(0.0)))
KL
2) Statistically informed imputation

-​ Numeric with mean/median/quantiles:


AN

from pyspark.ml.feature import Imputer​


imputer = Imputer(strategy="median", inputCols=["age","income"],
outputCols=["age","income"])​
df_imp = imputer.fit(df).transform(df)
FR

-​ Categorical with mode:

mode_val = (df.groupBy("segment").count()​
.orderBy(F.desc("count")).first()["segment"])​
df_mode = df.fillna({"segment": mode_val})

3) Group-wise imputation (e.g., by country)

stats =
df.groupBy("country").agg(F.avg("income").alias("avg_income"))​
df_join = (df.join(stats, "country", "left")​
.withColumn("income", F.coalesce("income",
"avg_income"))​
.drop("avg_income"))

4) Time-series forward/backward fill

from pyspark.sql import Window, functions as F​

E

w =
Window.partitionBy("id").orderBy("ts").rowsBetween(Window.unboundedPr

IN
eceding, 0)​
df_ffill = df.withColumn("value_ffill", F.last("value",
ignorenulls=True).over(w))​

KL
w_rev =
Window.partitionBy("id").orderBy(F.col("ts").desc()).rowsBetween(Wind
ow.unboundedPreceding, 0)​
df_bfill = df_ffill.withColumn("value_bfill", F.last("value",
ignorenulls=True).over(w_rev))
AN

5) Special values: NaN vs null

# Replace NaN with null then use standard fills​


df = df.select([F.when(F.isnan(c), None).otherwise(F.col(c)).alias(c) ​
FR

if t.simpleString() in ("double","float") else


F.col(c)​
for c,t in zip(df.columns, df.dtypes)])

6) Nested structures

-​ Use withField (Spark 3.1+) or rebuild the struct:

from pyspark.sql import functions as F​


df = df.withColumn(​
"address",​
F.when(F.col("address").isNull(),
F.struct(F.lit("").alias("city"), F.lit("").alias("zip")))​
.otherwise(F.col("address"))​
)

7) When to choose what

●​ Training data: Use statistically sound imputation; preserve variance

E
(median/quantile).​

●​ Reporting/BI: Use defaults (0/“Unknown”) but flag imputed fields.​

IN
●​ Data quality pipelines: Reject or quarantine rows when mandatory fields are
missing.
KL
AN
FR
🏁 SQL AND DATA MODELING
🟧 Find the maximum salary per department and employee name
We want each department’s max salary, along with the employee(s) who earn it. Two
common approaches:

1) Window Functions (preferred in analytics databases)

E
SELECT department_id,​
employee_name,​

IN
salary​
FROM (​
SELECT department_id,​
employee_name,​
salary,​

rnk​
KL RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS

FROM Employees​
) t​
WHERE rnk = 1;
AN

●​ Why: RANK() handles ties, so if two employees share the max salary, both are
returned.

2) Join with MAX per department

SELECT e.department_id,​
FR

e.employee_name,​
e.salary​
FROM Employees e​
JOIN (​
SELECT department_id, MAX(salary) AS max_salary​
FROM Employees​
GROUP BY department_id​
) m​
ON e.department_id = m.department_id​
AND e.salary = m.max_salary;

●​ Why: Classic “join-on-aggregate” method; works across all SQL engines


🟧 Explain the difference between Clustered vs Non-Clustered Index
Clustered Index

●​ Defines the physical order of rows in the table.​

●​ Each table can have only one clustered index (since data can only be sorted one
way).​

●​ Often built on the primary key.​

●​ Accessing data via clustered index = direct (data is the index).​

E
●​ Example: In SQL Server/MySQL InnoDB, primary key by default is clustered.​

IN
●​ Analogy: Phonebook sorted by last name.​

Non-Clustered Index
KL
●​ Stored separately from table data.​

●​ Contains index keys + pointers (row locators) to actual data.​

●​ A table can have multiple non-clustered indexes.​


AN

●​ Best for queries filtering by non-PK columns.​

●​ Example: Index on email column when primary key is employee_id.​

●​ Analogy: Index at the back of a book pointing to page numbers.


FR

🟧 Write a query to get 2nd highest salary without using TOP/Limit


Several valid patterns:

1) Window Function

SELECT DISTINCT salary​


FROM (​
SELECT salary,​
DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk​
FROM Employees​
) t​
WHERE rnk = 2;

2) Subquery with MAX

SELECT MAX(salary) AS second_highest​


FROM Employees​
WHERE salary < (​

E
SELECT MAX(salary) FROM Employees​
);

IN
●​ Simpler if you only need the value (not employee details).​
KL
3) Self-Join

SELECT e1.salary​
FROM Employees e1​
AN

WHERE 2 = (​
SELECT COUNT(DISTINCT e2.salary)​
FROM Employees e2​
WHERE e2.salary >= e1.salary​
);
FR

●​ Works in all ANSI SQL engines; scales less well for big data.
🎯 GENERAL DATA ENGINEERING
🟧 Managed vs External tables in Databricks
Managed Table (a.k.a. Internal table)

●​ Created in the Hive Metastore/Unity Catalog without a location → data stored

E
under Databricks’ default warehouse location
(dbfs:/user/hive/warehouse/...).​

IN
●​ Dropping the table → drops both metadata & underlying data.​

●​ Example:
KL
CREATE TABLE sales (id INT, amount DOUBLE);​
-- Data physically lives in the default managed path
AN

External Table

●​ Created with an explicit LOCATION pointing to external storage (ADLS, S3, Blob,
etc.).​

●​ Dropping the table → removes only metadata, not the data files.​
FR

●​ Example:

CREATE TABLE sales_ext (id INT, amount DOUBLE)​


USING DELTA​
LOCATION 'abfss://[email protected]/sales/';
🟧 What is Lakehouse architecture and how does it differ from Data Lake?:
Traditional Data Lake

●​ Stores raw files (Parquet, ORC, CSV, JSON).​

●​ Flexible schema-on-read.​

●​ Great for cheap storage, ML feature feeds.​

●​ Weaknesses:​

E
○​ No ACID transactions → risk of partial writes.​

○​ Harder data governance (no unified catalog).​

IN
○​ Performance issues (small files, lack of indexes, no optimizer).

Data Warehouse
KL
●​ Structured, ACID-compliant, governed.​

●​ Schema-on-write with BI-friendly optimizations (indexes, caching).​

●​ Weakness: Expensive and rigid; not ideal for semi-structured/unstructured data.


AN

Lakehouse (e.g., Delta Lake on Databricks)

●​ Unifies the strengths:​

○​ Open format (Parquet-based).​

○​ ACID transactions and schema enforcement (like a warehouse).​


FR

○​ Support for ML + BI on the same data.​

○​ Unified catalog & governance.​

○​ Performance enhancements (Z-ordering, caching, vectorized execution).​

●​ Key idea: One copy of data serving batch, streaming, ML, and BI.
🟧 How do you implement data quality checks during ingestion?
Data quality checks catch bad data before it pollutes downstream tables. Common
strategies:

1) Schema Validation

●​ Reject rows with unexpected schema (wrong columns, types).​

●​ In PySpark:

E
from pyspark.sql.types import StructType, StructField, StringType,
IntegerType​

IN

schema = StructType([​
StructField("id", IntegerType(), False),​
StructField("name", StringType(), True),​

])​
KL
StructField("age", IntegerType(), True)​


df = spark.read.schema(schema).json("/mnt/raw/customers/*.json")
AN

●​ Enforces id as non-null int.

2) Null / Mandatory Checks

invalid = df.filter("id IS NULL OR name IS NULL")​


if invalid.count() > 0:​
FR

invalid.write.format("delta").mode("append").save("/mnt/quarantine/cu
stomers")​
df = df.exceptAll(invalid) # keep only valid records

3) Referential Integrity

●​ Validate FK relationships (e.g., orders.customer_id exists in customers).


valid_orders = orders.join(customers, "customer_id", "inner")​
invalid_orders = orders.join(customers, "customer_id", "left_anti")

4) Business Rule Checks

●​ Example: age >= 0, amount > 0, date <= today.

df = df.filter("amount > 0 AND age >= 0 AND order_date <=

E
current_date()")

IN
5) Duplicate Detection

window = Window.partitionBy("id").orderBy("last_updated DESC")​


KL
deduped = (df.withColumn("rn", F.row_number().over(window))​
.filter("rn = 1")​
.drop("rn"))

6) Monitoring / Metrics
AN

●​ Generate data quality KPIs: null % per column, row counts, duplicates,
distribution checks.​

●​ Store in a DQ dashboard (Power BI/Datadog/ADF logs).​

●​ Example metric output:


FR

dq_stats = df.agg(​
F.count("*").alias("row_count"),​
F.sum(F.col("id").isNull().cast("int")).alias("null_ids")​
)​
dq_stats.write.mode("append").save("/mnt/metrics/dq_checks")
7) Tools & Frameworks

●​ Great Expectations (popular for rule-based validations).​

●​ Deequ (Amazon’s constraint-based library, runs on Spark).​

●​ Delta Live Tables (Databricks) → allows expect clauses:

CREATE LIVE TABLE clean_orders​


AS SELECT *​

E
FROM STREAM(live.raw_orders)​
EXPECT (amount > 0) ON VIOLATION DROP ROW​
EXPECT (order_id IS NOT NULL) ON VIOLATION FAIL UPDATE

IN
KL
AN
FR

You might also like