KPMG Data Analyst Interview Questions
KPMG Data Analyst Interview Questions
Interview Questions
SQL Questions
1. Write a SQL query to find the second-highest salary from an
employee table.
Given an employee table with columns:
id | name | salary
---+-------+--------
1 | John | 50000
2 | Alice | 60000
3 | Bob | 70000
4 | David | 70000
5 | Emma | 80000
FROM employee
• LIMIT 1 OFFSET 1: Skips the highest salary and returns the second-highest salary.
FROM employee
• The outer query finds the highest salary below the maximum, which is the second-
highest.
SELECT salary
FROM (
FROM employee
) ranked_salaries
WHERE salary_rank = 2;
• The outer query filters the row where rank is 2 (second-highest salary).
• Example:
• Instead of:
• Use:
• Bad:
• Good:
• Bad:
SELECT name FROM employee WHERE salary = (SELECT MAX(salary) FROM employee);
• Good:
Combines data from multiple tables based Combines results of two or more
Purpose
on a relationship. queries into a single result set.
Duplicate Can return duplicates if INNER JOIN, LEFT UNION removes duplicates
Rows JOIN, etc. (UNION ALL keeps them).
id | name | department
---+-------+-----------
1 | John | HR
2 | Alice | IT
3 | John | HR
4 | Bob | IT
5 | John | HR
FROM employees
GROUP BY name, department
• HAVING COUNT(*) > 1 filters groups with more than one record, indicating
duplicates.
FROM employees
This returns:
------+------------+------
John | HR | 3
Window functions perform calculations across a set of table rows related to the current
row. Unlike aggregate functions, they do not collapse rows into a single output.
---+----------+-------------
1 | John | 500
2 | Alice | 700
3 | Bob | 600
4 | John | 800
5 | Alice | 900
SELECT
employee,
sales_amount,
FROM
sales;
Output:
---------+--------------+-----
Alice | 900 | 1
Alice | 700 | 2
Bob | 600 | 1
John | 800 | 1
John | 500 | 2
Explanation:
1. Use COALESCE():
o Example:
o Example:
o Example:
o Example:
UPDATE employees
o Example:
SELECT
name,
CASE
ELSE phone_number
END AS contact_info
FROM employees;
o Example:
Example Demonstration:
Given a sales table:
id | customer_id | amount
---+------------+-------
1 | 101 | 200
2 | 102 | 400
3 | 101 | 300
4 | 103 | 800
5 | 102 | 600
FROM sales
Filters out rows where amount is less than 300 before aggregation.
FROM sales
GROUP BY customer_id
Groups data first and then filters out customers with SUM(amount) ≤ 500.
Example of CTE:
WITH total_sales AS (
FROM sales
GROUP BY customer_id
FROM total_sales
• The main query then filters customers who spent more than 500.
Recursion
Yes (RECURSIVE CTE) No recursion
Support
FROM (
FROM sales
GROUP BY customer_id
) AS total_sales
Customer Lifetime Value (CLV) estimates the total revenue a business can expect from a
customer over their lifetime.
Formula:
WITH customer_data AS (
SELECT
customer_id,
SUM(amount) AS total_revenue,
FROM sales
GROUP BY customer_id
SELECT
customer_id,
FROM customer_data;
Explanation:
Example Output:
customer_id | estimated_CLV
------------+--------------
101 | 1500.00
102 | 2000.00
103 | 2500.00
• An index is a data structure that improves the speed of data retrieval operations on
a table.
Type Description
Creating an Index
Performance Improvement:
SELECT * FROM sales WHERE customer_id = Full table scan Index lookup
101; (Slow) (Fast)
• Columns are frequently used in WHERE, JOIN, ORDER BY, GROUP BY.
PYTHON Questions
import pandas as pd
df = pd.DataFrame({
})
Output:
mathematica
CopyEdit
Name 1
Age 1
Salary 1
dtype: int64
• Forward Fill (ffill) → Replaces missing values with the previous row value.
• Backward Fill (bfill) → Replaces missing values with the next row value.
4 .Interpolation (interpolate())
df.interpolate(method='linear', inplace=True)
Memory
Uses more memory. Uses less memory.
Usage
Example:
# List Example
my_list = [1, 2, 3]
my_list.append(4) # Allowed
print(my_list) # [1, 2, 3, 4]
# Tuple Example
my_tuple = (1, 2, 3)
print(my_tuple) # (1, 2, 3)
When to Use?
• Tuples: When the data should remain unchanged (e.g., coordinates, database
records).
13. What are lambda functions in Python? Give an example.
What is a Lambda Function?
A lambda function in Python is an anonymous function (without a name) that is used for
short, simple operations. It is defined using the lambda keyword.
Syntax:
Example:
square = lambda x: x ** 2
print(square(5)) # Output: 25
Equivalent to:
def square(x):
return x ** 2
• Useful for One-time Operations: Often used in functions like map(), filter(), and
sorted().
numbers = [1, 2, 3, 4, 5]
numbers = [1, 2, 3, 4, 5, 6]
print(even_numbers) # [2, 4, 6]
print(students_sorted)
Summary Table
Immutable
Definition Mutable sequence Anonymous function
sequence
Basic Syntax:
import pandas as pd
Scenario Solution
Slower than
Performance Faster than apply() Fastest (uses NumPy)
vectorization
Element-wise
Use Case Complex operations Mathematical operations
transformation
3 .String Transformations
scaler = MinMaxScaler()
df[['Salary']] = scaler.fit_transform(df[['Salary']])
Data Structure ndarray (N-dimensional array) DataFrame (Tabular) & Series (1D)
import numpy as np
print(arr * 2) # [2 4 6 8 10]
import pandas as pd
print(df['Salary'].mean()) # 55000
Best Practice:
• Use Pandas for working with structured data (CSV, Excel, databases).
POWER BI Questions
21. What are Calculated Columns and Measures in Power BI?
Power BI allows users to create Calculated Columns and Measures using DAX (Data
Analysis Expressions), but they serve different purposes.
Stored in The table (each row gets a value) Computed dynamically in visuals
TotalPrice = Sales[Quantity] *
Example Total Sales = SUM(Sales[Amount])
Sales[UnitPrice]
Example of a Measure:
2. Transform Data:
o Remove duplicates
o Apply changes and load the transformed data into Power BI reports.
• If a report has sales data from multiple countries, you can filter a single page to
show only India’s sales.
Query Optimization
Report-Level Optimization
Reduce the number of visuals per page (ideally under 8-10 visuals).
Turn off unnecessary interactions between visuals.
Use Aggregations and Summary Tables instead of detailed data.
Bad:
Optimized:
Fact table with directly linked Dimension tables are normalized into
Structure
dimension tables sub-tables
Type Description
Static RLS Filters are manually assigned (e.g., Sales Manager sees only their region).
Suppose we have a Sales Table and a Users Table with Region info.
Result: Users see only their specific data without modifying the report!
Purpose of DAX:
SELECTEDVALUE(Metrics[Metric]),
"Sales", SUM(Sales[Amount]),
"Profit", SUM(Sales[Profit])
)
Result: Users can switch between Sales and Profit dynamically!