Spark Interview More Questions With Answers

Uploaded by

kslvonlineexams

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

Spark Interview More Questions With Answers

Uploaded by

kslvonlineexams

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Apache Spark – Questions Deepa Vasanthkumar

1. How many types of join strategies are there in Spark?

In Spark, there are four main types of join strategies:
1. Broadcast Join: Used when one of the DataFrames is small enough to be
broadcasted to all executor nodes.
2. Sort Merge Join: Used when both DataFrames are large and can be sorted and
merged efficiently.
3. Shuffle Hash Join: Used for joining large DataFrames when the join keys are evenly
distributed.
4. Cartesian Join: Used when you want to join every row of one DataFrame with every
row of another DataFrame.

2. When using Shuffle Sort Merge Join, does the shuffling occur on the driver node or the
executor node?
Shuffling occurs on the executor nodes. The driver node initiates the job, but the actual
shuffling and sorting of data happen on the executors.

3. What optimization techniques have you used in Spark?

Common Spark optimization techniques include:
• Persisting/Caching Data: To avoid recomputation of the same data.

[in] Deepa Vasanthkumar [Medium] Deepa Vasanthkumar – Medium

Apache Spark – Questions Deepa Vasanthkumar

• Using the repartition and coalesce methods: For efficient partition

management.
• Broadcasting small DataFrames: To avoid shuffling.
• Predicate Pushdown: Filtering data as early as possible.
• Using proper file formats: Such as Parquet or ORC for efficient storage and query
performance.
• Tuning Spark configurations: Like executor memory, cores, and shuffle partitions.

4. What is a DAG in Spark, and what is its purpose?

A DAG (Directed Acyclic Graph) in Spark is a sequence of computation stages where each
stage consists of a set of tasks that can be executed in parallel. The DAG scheduler in Spark
is responsible for breaking down a job into stages of tasks based on shuffle boundaries and
executing these stages in the right order to compute the final result. The purpose of the
DAG is to optimize the execution plan for performance and fault tolerance.
5. If data can be spilled to disk, why do we encounter OOM (Out Of Memory) errors?
OOM (Out Of Memory) errors can occur in Spark even if data can be spilled to disk because:
• Executor memory limits: If intermediate data structures (like RDDs) are too large
to fit in memory.
• Memory leaks: Due to inefficient code or bugs.
• Improperly tuned memory settings: Such as insufficient memory allocated to
executors or drivers.
• Skewed data: Where certain partitions are significantly larger than others, causing
uneven memory usage.
• Complex transformations: That require a large amount of intermediate data to be
held in memory simultaneously.

6. How does Spark work internally?

Spark works internally using the following steps:
1. Job Submission: The user submits a Spark job via a SparkContext.
2. DAG Creation: The SparkContext creates a logical plan in the form of a DAG.
3. Job Scheduling: The DAG scheduler splits the job into stages based on shuffle
boundaries.
4. Task Assignment: The stages are further divided into tasks and assigned to
executor nodes.
5. Task Execution: Executors run the tasks, processing data and storing intermediate
results in memory or disk.
6. Shuffle Operations: Data is shuffled between executors as required by operations
like joins.
7. Result Collection: The final results are collected back to the driver node or written
to storage.
[in] Deepa Vasanthkumar [Medium] Deepa Vasanthkumar – Medium
Apache Spark – Questions Deepa Vasanthkumar

7. What are the different phases of the SQL Engine?

The different phases of the SQL engine in Spark include:
1. Parsing: Converts the SQL query into an unresolved logical plan.
2. Analysis: Resolves the logical plan by determining the correct attributes and data
types using the catalog.
3. Optimization: Optimizes the logical plan using Catalyst optimizers.
4. Physical Planning: Converts the optimized logical plan into a physical plan with
specific execution strategies.
5. Execution: Executes the physical plan and returns the result.

8. Why do we get an Analysis Exception error?

Analysis Exception errors occur in Spark for several reasons:
• Unresolved attributes: When column names in the query don't match the schema.
• Missing tables: When referenced tables or DataFrames are not available.
• Schema mismatch: When the data types of the columns don't match the expected
types.
• Unsupported operations: When trying to perform operations that are not
supported by Spark SQL.

9. Explain in detail the different types of transformations in Spark.

Transformations in Spark are of two types:
1. Narrow Transformations: These transformations do not require shuffling of data
across partitions and can be executed without data movement. Examples include:
– map(): Applies a function to each element.
– filter(): Selects elements that satisfy a condition.
– flatMap(): Similar to map but can return multiple elements for each input
element.
2. Wide Transformations: These transformations require shuffling of data across
partitions and involve data movement. Examples include:
– reduceByKey(): Aggregates data across keys and requires shuffling.
– groupByKey(): Groups data by key, resulting in shuffling of data.
– join(): Joins two RDDs based on a key, involving shuffling.

10. How many partitions are created when we invoke a wide dependency transformation?
The number of partitions created during a wide dependency transformation is determined
by the spark.sql.shuffle.partitions configuration, which defaults to 200. However, it
can be adjusted based on the size of the data and the specific requirements of the job.

[in] Deepa Vasanthkumar [Medium] Deepa Vasanthkumar – Medium

Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
H2O.ai Partner Prospection
No ratings yet
H2O.ai Partner Prospection
13 pages
Spark Material
No ratings yet
Spark Material
6 pages
TFWoljND9k
No ratings yet
TFWoljND9k
25 pages
15 Asked Questions in KPMG
No ratings yet
15 Asked Questions in KPMG
22 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
Pyspark
100% (1)
Pyspark
48 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
SPARK Question answers
No ratings yet
SPARK Question answers
19 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Spark Interview Questions 04
No ratings yet
Spark Interview Questions 04
4 pages
master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
Interview - Questions
No ratings yet
Interview - Questions
8 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Top Spark Interview Q&A
No ratings yet
Top Spark Interview Q&A
21 pages
Apache Spark
No ratings yet
Apache Spark
15 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Top 75 Apache Spark Interview Questions
No ratings yet
Top 75 Apache Spark Interview Questions
18 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Pyspark-1
No ratings yet
Pyspark-1
7 pages
Spark Questions Asked in Mock Interview
No ratings yet
Spark Questions Asked in Mock Interview
2 pages
PySpark_Interview_Questions
No ratings yet
PySpark_Interview_Questions
2 pages
Spark Scenario Based Interview Questions !! For Interview
No ratings yet
Spark Scenario Based Interview Questions !! For Interview
4 pages
PySpark_Interview_Questions_Shubham
No ratings yet
PySpark_Interview_Questions_Shubham
3 pages
Spark Tips 1716698498
No ratings yet
Spark Tips 1716698498
7 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
Data Engineer
No ratings yet
Data Engineer
19 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Must Know Before Your Next Databricks Interview
No ratings yet
Must Know Before Your Next Databricks Interview
7 pages
pyspark questions
No ratings yet
pyspark questions
2 pages
PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
Apache Spark IQ
No ratings yet
Apache Spark IQ
15 pages
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Module 4
No ratings yet
Module 4
29 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Spark Interview Questions Answers
No ratings yet
Spark Interview Questions Answers
2 pages
1746178312202
No ratings yet
1746178312202
4 pages
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Top Questions for Data Engineering Interviews 1742072752
No ratings yet
Top Questions for Data Engineering Interviews 1742072752
72 pages
Spark: Big Data Cluster Computing in Production
From Everand
Spark: Big Data Cluster Computing in Production
Ilya Ganelin
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Databricks RealQuestions
No ratings yet
Databricks RealQuestions
9 pages
pyspark interview questions
No ratings yet
pyspark interview questions
9 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
SparkStepbyStepInterviewGuide_draft
No ratings yet
SparkStepbyStepInterviewGuide_draft
3 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
BD Notes 5
No ratings yet
BD Notes 5
37 pages
Fresher Resume
No ratings yet
Fresher Resume
3 pages
Module 3 - Analytics Techniques & Tools
No ratings yet
Module 3 - Analytics Techniques & Tools
74 pages
360 DigiTMG Data Analytics Course Syllabus
No ratings yet
360 DigiTMG Data Analytics Course Syllabus
22 pages
BDA Exp Removed Removed
No ratings yet
BDA Exp Removed Removed
33 pages
Data Scientist Nanodegree Syllabus: Before You Start
No ratings yet
Data Scientist Nanodegree Syllabus: Before You Start
5 pages
Data Science Create Teams That Ask the Right Questions and Deliver Real Value 1st Edition Doug Rose - The complete ebook version is now available for download
100% (3)
Data Science Create Teams That Ask the Right Questions and Deliver Real Value 1st Edition Doug Rose - The complete ebook version is now available for download
67 pages
Varun resume h1b (1)
No ratings yet
Varun resume h1b (1)
3 pages
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
No ratings yet
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
38 pages
Data Pipelines From Zero to Solid
No ratings yet
Data Pipelines From Zero to Solid
16 pages
RIT question bank
No ratings yet
RIT question bank
2 pages
Advanced Computing Lab Manual
No ratings yet
Advanced Computing Lab Manual
49 pages
Mongodb Use Case Guidance
No ratings yet
Mongodb Use Case Guidance
25 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Production Data Processing With Apache Spark
No ratings yet
Production Data Processing With Apache Spark
7 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Zalzalah Cesar Thesis 2016
No ratings yet
Zalzalah Cesar Thesis 2016
72 pages
A Big Data Analytics Architecture For The Internet of Small Things
No ratings yet
A Big Data Analytics Architecture For The Internet of Small Things
36 pages
Nova Guliyev: SR Data Consultant
No ratings yet
Nova Guliyev: SR Data Consultant
6 pages
Croma Campus - DP-203 Data Engineering On Microsoft Azure Training Curriculum
No ratings yet
Croma Campus - DP-203 Data Engineering On Microsoft Azure Training Curriculum
7 pages
Charis Chakim DevOps Engineer Resume
No ratings yet
Charis Chakim DevOps Engineer Resume
2 pages
Amazon Review Data Spark Example
No ratings yet
Amazon Review Data Spark Example
11 pages
ADF Copy Data
No ratings yet
ADF Copy Data
85 pages
Data Scientist ML Resume
No ratings yet
Data Scientist ML Resume
5 pages
(Ebook) Fast Data Processing with Spark, 2nd Edition: Perform real-time analytics using Spark in a fast, distributed, and scalable way by Krishna Sankar, Holden Karau ISBN 9781784392574, 178439257X download
100% (1)
(Ebook) Fast Data Processing with Spark, 2nd Edition: Perform real-time analytics using Spark in a fast, distributed, and scalable way by Krishna Sankar, Holden Karau ISBN 9781784392574, 178439257X download
48 pages
Gagan
No ratings yet
Gagan
8 pages
BIG DATA ANALYTICS: Introduction to Hadoop, Spark, and Machine-Learning 1st Edition - eBook PDF download
100% (1)
BIG DATA ANALYTICS: Introduction to Hadoop, Spark, and Machine-Learning 1st Edition - eBook PDF download
60 pages

Spark Interview More Questions With Answers

Uploaded by

Spark Interview More Questions With Answers

Uploaded by

Apache Spark – Questions Deepa Vasanthkumar

1. How many types of join strategies are there in Spark?

3. What optimization techniques have you used in Spark?

[in] Deepa Vasanthkumar [Medium] Deepa Vasanthkumar – Medium

• Using the repartition and coalesce methods: For efficient partition

4. What is a DAG in Spark, and what is its purpose?

6. How does Spark work internally?

7. What are the different phases of the SQL Engine?

8. Why do we get an Analysis Exception error?

9. Explain in detail the different types of transformations in Spark.

[in] Deepa Vasanthkumar [Medium] Deepa Vasanthkumar – Medium

You might also like