0% found this document useful (1 vote)
137 views3 pages

PySpark Interview Questions Shubham

interviw question for pyspark

Uploaded by

Shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
137 views3 pages

PySpark Interview Questions Shubham

interviw question for pyspark

Uploaded by

Shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PySpark Interview Questions & Answers

Prepared for Shubham Soude - Final Interview Prep

Core PySpark Concepts


Q: What is PySpark?
A: PySpark is the Python API for Apache Spark, allowing Python developers to write distributed data
processing applications using Sparks capabilities.

Q: What is the role of SparkSession?


A: SparkSession is the entry point to programming with DataFrame and SQL functionality in
PySpark.

Q: What is lazy evaluation?


A: Lazy evaluation means Spark waits until an action is called before executing transformations to
optimize the execution plan.

RDDs
Q: What is an RDD?
A: RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark, representing an
immutable, distributed collection of objects.

Q: When do you use RDDs over DataFrames?


A: When you need low-level transformations, fine-grained control, or working with unstructured data.

Q: How does Spark achieve fault tolerance?


A: Through lineage graphsSpark can recompute lost data using the transformation history.

DataFrames & SQL


Q: What are DataFrames in PySpark?
A: DataFrames are distributed collections of data organized into named columns, similar to SQL
tables.

Q: How do you register a DataFrame as a SQL table?


A: Using `createOrReplaceTempView()` or `createGlobalTempView()`.

Q: What is Catalyst Optimizer?


A: It is the query optimization engine in Spark that generates an optimized logical and physical
query plan.
Performance Tuning
Q: What is shuffling in Spark?
A: Shuffling is the movement of data between executors when operations like joins or groupBy are
performed.

Q: What is the difference between repartition and coalesce?


A: `repartition()` increases partitions with shuffle; `coalesce()` reduces them without shuffle.

Q: How do you cache data in PySpark?


A: Using `df.cache()` or `df.persist()` to store intermediate results in memory.

Joins, UDFs, Window Functions


Q: What is a broadcast join?
A: A join where a small DataFrame is broadcast to all executors to avoid shuffling the large
DataFrame.

Q: When should you use UDFs in PySpark?


A: Only when built-in functions cant perform the logic, as UDFs are not optimized by Catalyst.

Q: What are window functions?


A: Functions that operate over a window of rows, like `row_number()`, `rank()`, and `lag()`.

File I/O & ETL


Q: What formats does PySpark support?
A: CSV, JSON, Parquet, Avro, ORC.

Q: Which format is best for performance?


A: Parquet, because it's columnar and compressed.

Q: How do you write partitioned data in PySpark?


A: Using `df.write.partitionBy('col').parquet(path)`.

Airflow Integration
Q: How do you trigger a PySpark job in Airflow?
A: Using `BashOperator` with `spark-submit` inside an Airflow DAG.

Q: How can you pass parameters from Airflow to PySpark?


A: By passing them as command-line arguments in the `bash_command`.

Q: What executor is best for limited memory in Airflow?


A: `SequentialExecutor`, as it requires no extra services like Redis or Celery.
Debugging & Real World
Q: How do you handle corrupt records in JSON?
A: Use `badRecordsPath` and set `mode='PERMISSIVE'` or `DROPMALFORMED`.

Q: How do you monitor Spark jobs?


A: Using the Spark UI (`localhost:4040`) and checking logs in Airflow.

Q: How do you handle null values in PySpark?


A: Use `.na.fill()`, `.na.drop()`, or `col.isNull()` to manage nulls during processing.

You might also like