Barclays Data Engineer Interview Questions
Barclays Data Engineer Interview Questions
(3–4 YOE)
12-16 LPA
SQL
o Identify costly operations like full table scans, nested loops, or missing
indexes.
3. Avoid SELECT *:
o Prefer INNER JOIN over OUTER JOIN if NULLs are not needed.
5. Filter Early:
o Apply WHERE clauses early to limit the data set before joins and
aggregations.
o Use JOINs or CTEs (Common Table Expressions) for better performance and
readability.
-- Avoid this:
-- Prefer this:
o Use table partitioning to divide large tables logically for faster access.
FROM employees
SELECT salary
FROM (
FROM employees
) ranked
WHERE rnk = 2;
Aggregate
Cannot use (SUM, AVG, etc.) Can use
Functions
Example:
-- Using WHERE
-- Using HAVING
FROM employees
GROUP BY department
SELECT
name,
CASE
ELSE salary
END AS salary_status
FROM employees;
SELECT * FROM table WHERE column <=> NULL; -- Only TRUE if column is NULL
Data Normalization
1NF (First Normal No repeating groups; atomic Avoid arrays or multiple values in a
Form) columns only single column
• Reduces data redundancy (e.g., no repeated customer info in each order row)
Query
Faster (fewer joins) Slightly slower (more joins)
Performance
Fact Table:
Fact_Transactions (transaction_id, customer_id, product_id, amount, date_id)
Dimension Tables:
• Customers
• Loans
• Example:
Fact Table:
Dimension Tables:
• Implement:
o Deduplication
• Use cloud DWs like Snowflake, Amazon Redshift, Google BigQuery, or on-
premise like Teradata
• Core Components:
• Workflow:
• Core Components:
• Spark Ecosystem:
Fault Tolerance Yes (via HDFS replication) Yes (via lineage of RDDs)
When to Use:
Types of Partitioning:
1. Default Partitioning:
3. Range Partitioning:
Repartition vs Coalesce:
• Optimizes parallelism
df.write.partitionBy("country", "year").parquet("output_path")
3. Why might you choose Parquet over CSV for storing large
datasets?
Parquet vs CSV Comparison:
Data Types Strings (needs manual parsing) Strongly typed (ints, floats, etc.)
1. Columnar Storage:
2. Compression:
3. Schema Enforcement:
4. Integration:
• Parquet would only load account_id, region, amount columns → faster and cheaper.
Coding
When working with large datasets (e.g., millions of rows), it's efficient to:
Example Code:
import pandas as pd
chunk_size = 100000
result = []
chunk = chunk.dropna()
result.append(chunk)
final_df = pd.concat(result)
# Save to new file
final_df.to_csv("transformed_file.csv", index=False)
Best Practices:
Fill with
df['col'].fillna(df['col'].mean()) Numerical columns
mean/median/mode
Example:
df['Age'] = df['Age'].fillna(df['Age'].mean())
df = df.dropna(subset=['Salary'])
# Fill missing city names with "Unknown"
df['City'] = df['City'].fillna("Unknown")
def my_decorator(func):
def wrapper():
func()
return wrapper
@my_decorator
def say_hello():
print("Hello!")
say_hello()
Output:
Hello!
After function runs
import time
def timer_decorator(func):
start = time.time()
end = time.time()
return result
return wrapper
@timer_decorator
def process_data():
time.sleep(2)
print("Data processed")
process_data()
Output:
Data processed