PySpark Interview Questions & Answers
Prepared for Shubham Soude - Final Interview Prep
Core PySpark Concepts
Q: What is PySpark?
A: PySpark is the Python API for Apache Spark, allowing Python developers to write distributed data
processing applications using Sparks capabilities.
Q: What is the role of SparkSession?
A: SparkSession is the entry point to programming with DataFrame and SQL functionality in
PySpark.
Q: What is lazy evaluation?
A: Lazy evaluation means Spark waits until an action is called before executing transformations to
optimize the execution plan.
RDDs
Q: What is an RDD?
A: RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark, representing an
immutable, distributed collection of objects.
Q: When do you use RDDs over DataFrames?
A: When you need low-level transformations, fine-grained control, or working with unstructured data.
Q: How does Spark achieve fault tolerance?
A: Through lineage graphsSpark can recompute lost data using the transformation history.
DataFrames & SQL
Q: What are DataFrames in PySpark?
A: DataFrames are distributed collections of data organized into named columns, similar to SQL
tables.
Q: How do you register a DataFrame as a SQL table?
A: Using `createOrReplaceTempView()` or `createGlobalTempView()`.
Q: What is Catalyst Optimizer?
A: It is the query optimization engine in Spark that generates an optimized logical and physical
query plan.
Performance Tuning
Q: What is shuffling in Spark?
A: Shuffling is the movement of data between executors when operations like joins or groupBy are
performed.
Q: What is the difference between repartition and coalesce?
A: `repartition()` increases partitions with shuffle; `coalesce()` reduces them without shuffle.
Q: How do you cache data in PySpark?
A: Using `df.cache()` or `df.persist()` to store intermediate results in memory.
Joins, UDFs, Window Functions
Q: What is a broadcast join?
A: A join where a small DataFrame is broadcast to all executors to avoid shuffling the large
DataFrame.
Q: When should you use UDFs in PySpark?
A: Only when built-in functions cant perform the logic, as UDFs are not optimized by Catalyst.
Q: What are window functions?
A: Functions that operate over a window of rows, like `row_number()`, `rank()`, and `lag()`.
File I/O & ETL
Q: What formats does PySpark support?
A: CSV, JSON, Parquet, Avro, ORC.
Q: Which format is best for performance?
A: Parquet, because it's columnar and compressed.
Q: How do you write partitioned data in PySpark?
A: Using `df.write.partitionBy('col').parquet(path)`.
Airflow Integration
Q: How do you trigger a PySpark job in Airflow?
A: Using `BashOperator` with `spark-submit` inside an Airflow DAG.
Q: How can you pass parameters from Airflow to PySpark?
A: By passing them as command-line arguments in the `bash_command`.
Q: What executor is best for limited memory in Airflow?
A: `SequentialExecutor`, as it requires no extra services like Redis or Celery.
Debugging & Real World
Q: How do you handle corrupt records in JSON?
A: Use `badRecordsPath` and set `mode='PERMISSIVE'` or `DROPMALFORMED`.
Q: How do you monitor Spark jobs?
A: Using the Spark UI (`localhost:4040`) and checking logs in Airflow.
Q: How do you handle null values in PySpark?
A: Use `.na.fill()`, `.na.drop()`, or `col.isNull()` to manage nulls during processing.