0% found this document useful (0 votes)

16 views

Apache Spark

The document provides a comprehensive overview of Apache Spark, detailing its significance as a unified analytics engine for large-scale data processing, including its components like Spark SQL, Spark Streaming, and MLlib. It emphasizes the importance of Spark knowledge in data engineering interviews, highlighting problem-solving skills and real-world applications. Additionally, the document includes a series of interview questions and answers related to Spark, covering concepts, performance optimization, and machine learning integration.

Uploaded by

Cristofer Franco Vilchis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Apache Spark

Uploaded by

Cristofer Franco Vilchis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

AJAY KADIYALA - Data Engineer

Follow me Here:

LinkedIn:
https://www.linkedin.com/in/ajay026/

Data Geeks Community:

https://lnkd.in/gU5NkCqi

APACHE SPARK
DATA ENGINEER
INTERVIEW QUESTIONS &
ANSWERS

COPY RIGHTS RESERVED

What is Apache Spark?

• Apache Spark is an open-source, unified analytics engine designed for large-scale

data processing. It provides a robust platform for running complex algorithms
and processing massive datasets efficiently.

• It offers high-level APIs in Python (PySpark), Java, Scala, and R, making it

accessible and flexible for various programming preferences.

• Spark includes specialized components like Spark SQL for structured data
processing, Spark Streaming for real-time data, MLlib for machine learning, and
GraphX for graph computations.

Why is it Important?
• Speed and Efficiency: Spark's in-memory computation increases processing
speed significantly compared to traditional disk-based engines like Hadoop
MapReduce.

• Versatility: It supports multiple workloads—batch processing, real-time

analytics, machine learning, and graph processing—all in one framework.

COPY RIGHTS RESERVED

• Scalability: Spark can handle petabytes of data across clusters of thousands of
nodes, making it ideal for big data applications.

• Integration: It easily integrates with various data sources and storage systems
like Hadoop HDFS, Apache Cassandra, and Amazon S3.

Importance in Data Engineering Interviews

• Core Competency: Knowledge of Apache Spark is often a key requirement, as
many organizations rely on it for their big data solutions.

• Problem-Solving Skills: Interviewers assess your ability to optimize data

pipelines, handle large-scale data processing, and troubleshoot performance
issues using Spark.

• Real-World Applications: Demonstrating experience with Spark shows that

you're equipped to work on projects involving real-time data streaming and
complex analytics.

• Competitive Edge: Proficiency in Spark sets you apart, showcasing your

commitment to staying updated with industry-standard tools and technologies.

Table of Contents
1. Introduction to Apache Spark
2. Core Spark Concepts and Architecture
3. Spark DataFrames and Datasets
4. Spark SQL
5. Optimization Techniques
6. Spark Streaming
7. Advanced RDD Operations
8. DataFrames, Datasets, and Spark SQL Optimization
9. Scenario-Based Questions

10. My Spark Interview Questions

1: What is Apache Spark, and why is it popular in data engineering?

COPY RIGHTS RESERVED

• Answer: Apache Spark is an open-source, distributed computing system
designed for large-scale data processing. It’s highly popular due to its speed and
flexibility, as it allows in-memory processing, which is 100 times faster than
Hadoop MapReduce for some applications. Spark supports batch, real-time
streaming, machine learning, and graph processing, making it a versatile tool in
data engineering.

2: What are some main components of Apache Spark?

• Answer: The main components are:
o Spark Core: The foundation of Spark that handles basic operations.
o Spark SQL: For working with structured data.
o Spark Streaming: Real-time data processing.
o MLlib: Machine learning library.
o GraphX: Graph processing framework.

3: Explain the Spark execution model with Driver and Executors.

• Answer: The Driver is the central node in Spark that converts code into tasks for
processing and coordinates their execution. Executors are the worker nodes that
perform the tasks assigned by the driver and store data. Together, they form the
Spark execution model, distributing tasks and data across nodes in a cluster.

4: What is DAG, and how does Spark use it for execution?

• Answer: DAG (Directed Acyclic Graph) is a sequence of computational steps with
a clear flow. In Spark, each job is broken down into tasks organized as a DAG.
This allows Spark to optimize task execution by understanding the dependencies
between tasks, minimizing shuffling and data movement.

5: What are Transformations and Actions in Spark?

• Answer: Transformations are operations that define a new RDD based on
existing data, like map, filter, or flatMap, and they’re lazy (executed only when
an action is called). Actions like count, collect, or saveAsTextFile actually perform
computations and return results, triggering the execution of transformations.

COPY RIGHTS RESERVED

6: What’s the difference between DataFrames and Datasets in Spark?
• Answer: DataFrames are distributed collections of data organized into named
columns and offer optimizations through Spark’s Catalyst Optimizer. Datasets
add type safety, making them easier to debug in statically typed languages like
Scala. Datasets support both functional and relational transformations.

7: How can you convert an RDD to a DataFrame?

• Answer: You can convert an RDD to a DataFrame by first defining a schema, then
using spark.createDataFrame(). For instance:
from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.appName("example").getOrCreate()
rdd = spark.sparkContext.parallelize([Row(name="Alice", age=5), Row(name="Bob",
age=7)])
df = spark.createDataFrame(rdd)
df.show()

8: What are Spark Partitions, and why are they important?

• Answer: Partitions are divisions of data within Spark that are distributed across
executors. They are fundamental because they allow Spark to distribute tasks
across nodes, improving parallelism and efficiency in data processing.

9: How does Spark SQL work, and why is it useful?

• Answer: Spark SQL allows SQL-style querying of structured data. It’s useful
because it leverages the Catalyst Optimizer for query optimization and
seamlessly integrates SQL and DataFrame/Dataset APIs, making it easy for
developers with SQL experience to work in Spark.

COPY RIGHTS RESERVED

10: What is the Catalyst Optimizer, and how does it improve performance
in Spark SQL?
• Answer: Catalyst Optimizer is a query optimization engine in Spark SQL that
generates efficient execution plans by analyzing and optimizing SQL queries,
such as by reordering filters and joins. This process reduces the execution time
and enhances Spark SQL performance.

11: What are some common Spark performance tuning techniques?

• Answer: Common techniques include:

o Data Serialization: Use Kryo serialization for faster data transfer.
o Broadcast Joins: Broadcast small datasets to executors to optimize joins.
o Partitioning and Repartitioning: Adjust partitions to improve parallelism.
o Caching and Persistence: Cache frequently accessed data to reduce
computations.

12: What is the importance of Broadcast variables in Spark?

• Answer: Broadcast variables allow small datasets to be cached on each executor,

preventing repetitive data transfer across nodes. They’re essential for
performance, especially when using small datasets in join operations.

13: How does Spark Streaming process real-time data?

• Answer: Spark Streaming divides the incoming data into mini-batches, which are
then processed using Spark’s computational model. These batches are processed
sequentially, providing near real-time results by transforming and aggregating
data within each batch.

14: What is a DStream in Spark Streaming?

• Answer: DStream, or Discretized Stream, is the basic abstraction in Spark

Streaming. It represents a continuous stream of data as a sequence of RDDs,
enabling transformation and action operations on streaming data.

COPY RIGHTS RESERVED

15: How does the reduceByKey transformation work in Spark?

• Answer: reduceByKey is a transformation that aggregates data with the same

key, using a specified function. It performs local aggregation on each partition
before a final shuffle, optimizing network usage compared to groupByKey.

16: Explain the difference between map and flatMap transformations.

• Answer: map applies a function to each element and returns a new RDD of the
same length, whereas flatMap may produce multiple elements from each input,
resulting in a flattened structure.

17: How would you optimize a join operation on large DataFrames?

• Answer: Use broadcast joins if one DataFrame is small enough to fit in memory.
For large tables, ensure columns used in joins are partitioned and cached, and
use column pruning to avoid processing unnecessary columns.

18: Scenario: You need to handle out-of-memory issues during shuffling

large data. How would you resolve this?
• Answer: I’d try to increase the executor memory, configure the
spark.shuffle.memoryFraction parameter, and tune the
spark.storage.memoryFraction. Switching to disk-based storage for shuffle files
or reducing data volume can also help.

19: Scenario: You’re asked to optimize a real-time recommendation

engine in Spark. What steps would you take?
• Answer: For real-time data, I’d use Spark Streaming with pre-trained ML models
from MLlib, optimize batch intervals for lower latency, and use caching and
efficient partitioning strategies. Broadcasting recommendation models to
executors ensures that data doesn’t need to be transferred repeatedly.

20: How can Spark’s MLlib be integrated into a Spark application?

COPY RIGHTS RESERVED

• Answer: MLlib provides scalable machine learning algorithms that can be applied
to RDDs and DataFrames for tasks like classification, clustering, and regression.
For example, using ALS for collaborative filtering in a recommendation engine is
a common use case.

21: What is the role of the Catalyst Optimizer in Spark SQL, and how does
it enhance query performance?
• Answer: The Catalyst Optimizer is Spark SQL's query optimization engine. It
analyzes logical plans for SQL queries and applies various optimization
techniques, such as predicate pushdown, constant folding, and join reordering.
By generating efficient physical plans, Catalyst reduces query execution time and
resource consumption, leading to enhanced performance.

22: Can you explain the concept of Tungsten in Spark and its significance?

• Answer: Tungsten is a Spark initiative aimed at improving the efficiency of

memory and CPU usage. It achieves this through techniques like whole-stage
code generation, which compiles parts of the query into Java bytecode, and off-
heap memory management, reducing garbage collection overhead. Tungsten's
optimizations result in faster execution and better resource utilization.

23: How does Spark handle fault tolerance, and what mechanisms are in
place to recover from failures?
• Answer: Spark ensures fault tolerance through lineage information in RDDs. Each
RDD maintains a record of the transformations that created it, forming a lineage
graph. If a partition is lost due to node failure, Spark can recompute the lost data
using this lineage. Additionally, Spark supports checkpointing, where RDDs are
saved to reliable storage, allowing for recovery without recomputation.

24: What are Accumulators and Broadcast Variables in Spark, and how are
they used?
• Answer: Accumulators are write-only shared variables used for aggregating
information across executors, such as counters or sums. They are primarily used
for debugging and monitoring. Broadcast Variables, on the other hand, allow

COPY RIGHTS RESERVED

large read-only data to be cached on each executor, reducing communication
overhead during tasks like joins with small datasets.

25: Explain the difference between narrow and wide transformations in

Spark.
• Answer: Narrow transformations, like map and filter, operate on a single
partition and do not require data shuffling across the network. Wide
transformations, such as reduceByKey and join, involve shuffling data between
partitions, as they depend on data from multiple partitions. Understanding this
distinction is crucial for optimizing Spark jobs, as wide transformations are more
resource-intensive.

26: How does Structured Streaming differ from traditional Spark

Streaming?
• Answer: Structured Streaming is a newer API in Spark that treats streaming data
as an unbounded table, allowing for SQL-like operations and integration with
DataFrames and Datasets. Unlike traditional Spark Streaming, which processes
data in micro-batches, Structured Streaming can achieve lower latency and
provides better fault tolerance and consistency guarantees.

27: What are Watermarks in Structured Streaming, and why are they
important?
• Answer: Watermarks are a mechanism in Structured Streaming to handle late
data. They define a threshold of how late data can arrive and still be processed.
By specifying a watermark, Spark can manage stateful operations efficiently,
discarding state information for data older than the watermark, thus preventing
unbounded state growth.

28: Can you describe the concept of Event Time and Processing Time in the
context of Spark Streaming?
• Answer: Event Time refers to the time when data is generated at the source,
while Processing Time is the time when data is processed by the Spark
application. Handling Event Time is crucial for applications where the order and

COPY RIGHTS RESERVED

timing of events matter, especially when dealing with out-of-order or late-
arriving data. Structured Streaming provides tools to manage these differences
effectively.

29: How does Spark ensure exactly-once semantics in Structured

Streaming?
• Answer: Spark achieves exactly-once semantics through idempotent operations
and checkpointing. By writing output to external storage systems that support
idempotent writes and maintaining checkpoints of processed data, Spark
ensures that even in the case of failures, data is not duplicated or lost.

30: What are some common sources and sinks supported by Structured
Streaming?
• Answer: Structured Streaming supports various sources and sinks, including:
o Sources: Kafka, Kinesis, File Systems (e.g., HDFS, S3), Socket Streams.
o Sinks: Kafka, Kinesis, File Systems, Console, Foreach (for custom sinks).

31: How does Spark's MLlib differ from traditional machine learning
libraries?
• Answer: MLlib is Spark's scalable machine learning library designed for
distributed computing. Unlike traditional libraries that operate on a single
machine, MLlib leverages Spark's distributed architecture to handle large-scale
data across clusters, providing algorithms for classification, regression,
clustering, and collaborative filtering.

32: Can you explain the concept of Pipelines in Spark MLlib?

• Answer: Pipelines in MLlib provide a way to streamline the machine learning

workflow by chaining multiple stages, including data preprocessing, feature
extraction, and model training. Each stage is represented as a Transformer or
Estimator, allowing for a structured and reusable approach to building machine
learning models.

33: What is the role of Transformers and Estimators in Spark MLlib?

COPY RIGHTS RESERVED

• Answer: In MLlib, Transformers are algorithms or models that transform one
DataFrame into another, such as feature transformers. Estimators are algorithms
that fit on a DataFrame to produce a Transformer, like a learning algorithm that
trains a model. This distinction helps in building and managing machine learning
pipelines effectively.

34: How does Spark handle hyperparameter tuning in MLlib?

• Answer: Spark provides tools like CrossValidator and TrainValidationSplit for

hyperparameter tuning. These tools automate the process of testing different
hyperparameter combinations and evaluating model performance using cross-
validation or a validation set, helping to identify the best model configuration.

35: Can you describe the concept of Feature Transformers in MLlib?

• Answer: Feature Transformers are components in MLlib that process raw data
into features suitable for machine learning models. They include operations like
normalization, one-hot encoding, and feature scaling, which are essential steps
in preparing data for effective model training.
Question 36: What strategies can be employed to optimize Spark job performance?
• Answer: To optimize Spark job performance employ strategies like efficient
partitioning, using Kryo serialization, caching frequently used data, tuning shuffle
and memory configurations, and leveraging broadcast variables for small lookup
tables.

37: How does Spark's Catalyst Optimizer enhance query performance?

• Answer: The Catalyst Optimizer is Spark's query optimization engine that

analyzes and optimizes logical plans for SQL queries. It applies various
optimization techniques, such as predicate pushdown, constant folding, and join
reordering, to generate efficient physical plans. This process reduces execution
time and resource consumption, leading to enhanced performance.

38: What is the significance of Tungsten in Spark, and how does it improve
performance?
• Answer: Tungsten is a Spark initiative aimed at improving the efficiency of
memory and CPU usage. It achieves this through techniques like whole-stage

COPY RIGHTS RESERVED

code generation, which compiles parts of the query into Java bytecode, and off-
heap memory management, reducing garbage collection overhead. Tungsten's
optimizations result in faster execution and better resource utilization.

39: Can you explain the difference between narrow and wide
transformations in Spark?
• Answer: Narrow transformations, like map and filter, operate on a single
partition and do not require data shuffling across the network. Wide
transformations, such as reduceByKey and join, involve shuffling data between
partitions, as they depend on data from multiple partitions. Understanding this
distinction is crucial for optimizing Spark jobs, as wide transformations are more
resource-intensive.

40: How does Spark handle fault tolerance, and what mechanisms are in
place to recover from failures?
• Answer: Spark ensures fault tolerance through lineage information in RDDs. Each
RDD maintains a record of the transformations that created it, forming a lineage
graph. If a partition is lost due to node failure, Spark can recompute the lost data
using this lineage. Additionally, Spark supports checkpointing, where RDDs are
saved to reliable storage, allowing for recovery without recomputation.

41: What are Accumulators and Broadcast Variables in Spark, and how are
they used?
• Answer: Accumulators are write-only shared variables used for aggregating
information across executors, such as counters or sums. They are primarily used
for debugging and monitoring. Broadcast Variables, on the other hand, allow
large read-only data to be cached on each executor, reducing communication
overhead during tasks like joins with small datasets.

42: How does Spark Streaming process real-time data, and what are
DStreams?
• Answer: Spark Streaming divides the incoming data into mini-batches, which are
then processed using Spark's computational model. These batches are processed

COPY RIGHTS RESERVED

sequentially, providing near real-time results by transforming and aggregating
data within each batch. DStreams, or Discretized Streams, are the basic
abstraction in Spark Streaming, representing a continuous stream of data as a
sequence of RDDs.

43: What is Structured Streaming in Spark, and how does it differ from
traditional Spark Streaming?
• Answer: Structured Streaming is a newer API in Spark that treats streaming data
as an unbounded table, allowing for SQL-like operations and integration with
DataFrames and Datasets. Unlike traditional Spark Streaming, which processes
data in micro-batches, Structured Streaming can achieve lower latency and
provides better fault tolerance and consistency guarantees.

44: Can you explain the concept of Event Time and Processing Time in the
context of Spark Streaming?
• Answer: Event Time refers to the time when data is generated at the source,
while Processing Time is the time when data is processed by the Spark
application. Handling Event Time is crucial for applications where the order and
timing of events matter, especially when dealing with out-of-order or late-
arriving data. Structured Streaming provides tools to manage these differences
effectively.

45: How does Spark ensure exactly-once semantics in Structured

46: What are some common sources and sinks supported by Structured
Streaming?
• Answer: Structured Streaming supports various sources and sinks, including:
o Sources: Kafka, Kinesis, File Systems (e.g., HDFS, S3), Socket Streams.

COPY RIGHTS RESERVED

o Sinks: Kafka, Kinesis, File Systems, Console, Foreach (for custom sinks).

47: How does Spark's MLlib differ from traditional machine learning
libraries?
• Answer: MLlib is Spark's scalable machine learning library designed for
distributed computing. Unlike traditional libraries that operate on a single
machine, MLlib leverages Spark's distributed architecture to handle large-scale
data across clusters, providing algorithms for classification, regression,
clustering, and collaborative filtering.

48: Can you explain the concept of Pipelines in Spark MLlib?

• Answer: Pipelines in MLlib provide a way to streamline the machine learning
workflow by chaining multiple stages, including data preprocessing, feature
extraction, and model training. Each stage is represented as a Transformer or
Estimator, allowing for a structured and reusable approach to building machine
learning models.

49: What is the role of Transformers and Estimators in Spark MLlib?

• Answer: In MLlib, Transformers are algorithms or models that transform one

DataFrame into another, such as feature transformers. Estimators are algorithms
that fit on a DataFrame to produce a Transformer, like a learning algorithm that
trains a model. This distinction helps in building and managing machine learning
pipelines effectively.

50: How does Spark handle hyperparameter tuning in MLlib?

• Answer: Spark provides tools like CrossValidator and TrainValidationSplit for

My Interview Experience

COPY RIGHTS RESERVED

1. You’re given a dataset containing transaction records and need to filter
transactions for the past month efficiently. How would you handle this
in Spark?
• To filter recent transactions, I would first use spark.sql to select transactions
where the date is within the past month. With Spark’s Catalyst Optimizer, date-
based filters are handled efficiently when you set up date indexes.
from pyspark.sql.functions import current_date, date_sub

recent_transactions = transactions.filter(transactions.date >= date_sub(current_date(),

30))
recent_transactions.show()

2. You have multiple JSON files with slightly different structures. How
would you read and process them in Spark without losing any fields?
• I’d use Spark’s mergeSchema option while reading the files. This option helps
Spark handle fields that are present in some files but not in others by merging
the schema across all files.
df = spark.read.option("mergeSchema", "true").json("path/to/json_files")
df.show()

3. Given a large dataset with millions of rows, you’re tasked with

performing joins between two tables on specific keys with a high skew.
How do you handle data skew in Spark joins?
• I’d use salting, a technique where a random number is added to join keys, to
spread out skewed keys across partitions. Additionally, broadcast joins can help
if one dataset is small.
from pyspark.sql.functions import rand

df1 = df1.withColumn("salted_key", concat(df1.key, (rand() * 10).cast("int")))

df2 = df2.withColumn("salted_key", concat(df2.key, (rand() * 10).cast("int")))

COPY RIGHTS RESERVED

joined_df = df1.join(df2, "salted_key")

4. Your team needs to track unique visitors to a website based on user

sessions. How would you do this in Spark?
• I’d use distinct() on session ID and user ID to identify unique visits. Alternatively,
in a streaming environment, Structured Streaming’s stateful operations can
maintain a count of unique users.
unique_visitors = website_data.select("user_id", "session_id").distinct().count()

5. Given a dataset of e-commerce orders, you need to find the top 5

products sold in each category. What approach would you take?
• I’d use Spark’s window functions to rank products by sales within each category
and then filter the top 5 products.
from pyspark.sql.window import Window
from pyspark.sql.functions import rank

window = Window.partitionBy("category").orderBy(df.sales.desc())
top_products = df.withColumn("rank", rank().over(window)).filter("rank <= 5")

6. How would you optimize a Spark job that frequently performs

shuffling operations?
• I’d tune the spark.shuffle.file.buffer and spark.reducer.maxSizeInFlight
parameters to optimize memory usage during shuffles, and if possible, try to
minimize shuffles by reducing data movement.
7. A dataset has null values in certain columns. How would you handle missing values
when performing aggregations?

• I’d use fillna() to replace null values with default values for aggregations, or use
dropna() to ignore nulls in specific aggregations.
cleaned_df = df.fillna({'column': 0}).groupBy('category').agg(sum('column'))

8. For a dataset containing user activity logs, you’re asked to compute

rolling averages of activity per user. How would you implement this?

COPY RIGHTS RESERVED

• Using Spark’s Window functions, I’d define a window over each user, ordering by
timestamp, and compute a moving average for each activity entry.
from pyspark.sql.functions import avg
window_spec = Window.partitionBy("user_id").orderBy("timestamp").rowsBetween(-5,
0)
activity_avg = logs.withColumn("rolling_avg", avg("activity").over(window_spec))

9. Given a dataset with nested JSON fields, how would you flatten the
structure to make analysis easier?
• I’d use the selectExpr() function to pull out fields from nested structures and
make them top-level columns.
flattened_df = df.selectExpr("nested_field.subfield1 as subfield1",
"nested_field.subfield2 as subfield2")

10.You need to cache a DataFrame for faster access in a Spark job. How
would you decide between cache() and persist()?
• If I only need to reuse the DataFrame in memory, cache() is sufficient. But if I
need memory and disk storage or specific serialization, I’d choose persist() with
an appropriate storage level.

11.To detect anomalies in streaming sensor data, how would you set up
the pipeline in Spark Structured Streaming?
• I’d configure Spark Structured Streaming with a sliding window on the data
stream and use statistical methods like Z-score calculations to identify outliers.
windowed = df.groupBy(window("timestamp", "5 minutes")).agg(avg("value"),
stddev("value"))
anomalies = windowed.filter("value > avg + 3 * stddev")

12.If you have a dataset with highly repetitive values, how would you
reduce the storage size in Spark?
• I’d use encoding techniques like dictionary encoding or leverage parquet format,
which efficiently compresses repeated values.

COPY RIGHTS RESERVED

13.You need to perform a join on two large datasets in Spark. How would
you decide between a broadcast join and a shuffle join?
• If one dataset is small enough to fit in executor memory, I’d use a broadcast join
to avoid shuffles. Otherwise, I’d use a shuffle join.

14.How would you implement an ETL pipeline in Spark that ingests data
from a database and writes it to HDFS?
• I’d use Spark’s read and write APIs with the JDBC connector for database
ingestion, transforming data as necessary, and then write the output to HDFS.
jdbc_df = spark.read.format("jdbc").options(driver="com.mysql.jdbc.Driver", url=db_url,
dbtable="table").load()
jdbc_df.write.format("parquet").save("hdfs://output_path")

15.How do you ensure that a Spark job doesn’t run out of memory when
processing large datasets?
• I’d tune executor memory, adjust partitioning, and avoid wide transformations
that lead to large shuffles. Using Kryo serialization and caching intermediate
results efficiently also helps.

16.A dataset contains repeated keys, and you need to get the latest
record per key. How would you do this in Spark?
• I’d use row_number() with a window function to rank records by timestamp
within each key, keeping only the latest record.
from pyspark.sql.functions import row_number
window = Window.partitionBy("key").orderBy(df.timestamp.desc())
latest_records = df.withColumn("rank", row_number().over(window)).filter("rank = 1")

17.For high-throughput Spark Streaming applications, how would you

manage state effectively?

COPY RIGHTS RESERVED

• Using stateful transformations like mapWithState or windowed aggregations
allows me to maintain and update state without excessive memory usage.

18.When working with Spark on a cloud environment, how would you

handle data ingestion from multiple sources?
• I’d set up data ingestion using managed connectors (e.g., AWS Glue for S3 or
BigQuery for Google Cloud) and use Spark’s built-in source connectors,
configuring parallel reads and writes.

19.If Spark tasks are taking too long due to data shuffling, what
configuration changes could improve performance?
• I’d adjust spark.shuffle.compress and increase spark.shuffle.spill to allow more
memory for shuffle operations. Adding more partitions also helps.

20.For a machine learning model pipeline in Spark, how would you handle
categorical variables?
• I’d use StringIndexer and OneHotEncoder for encoding categorical variables,
integrating these transformations into the ML pipeline.
from pyspark.ml.feature import StringIndexer, OneHotEncoder

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
model_pipeline = Pipeline(stages=[indexer, encoder])

21. In a Spark job, how do you efficiently write a DataFrame to both S3

and a relational database?
• I’d use write with format() to specify multiple formats, running these actions
concurrently if Spark’s cluster resources allow it.
df.write.format("parquet").save("s3://bucket/path")

COPY RIGHTS RESERVED

df.write.format("jdbc").options("url": db_url, "dbtable": "table").save()

22.You’re tasked with aggregating a year’s worth of data stored across

many files. How would you optimize the file reading process?
• I’d use mergeSchema and wholeFileInput settings to minimize reads, partition
the data by month or day, and ensure my compute nodes are configured with
optimal I/O.

23.To perform real-time sentiment analysis on Twitter data in Spark, what

would be your approach?
• I’d configure Structured Streaming to read from Kafka or another Twitter API
stream, apply NLP libraries to classify sentiment, and aggregate the results in
real-time.

24. You have complex JSON data in HDFS that needs transforming into a
structured format. How would you handle this in Spark?
• I’d use explode to expand arrays, selectExpr for nesting, and define custom
schema mappings for each field, converting complex structures into DataFrames.

25. You have a Spark job that processes a large dataset and frequently
encounters out-of-memory errors during shuffling. How would you
address this issue?
• Answer: To mitigate out-of-memory errors during shuffling, I would:
o Increase Executor Memory: Allocate more memory to executors by
adjusting the spark.executor.memory configuration.
o Optimize Partitioning: Increase the number of partitions to reduce the
size of data shuffled per task, using repartition() or coalesce() as
appropriate.
o Use Kryo Serialization: Enable Kryo serialization
(spark.serializer=org.apache.spark.serializer.KryoSerializer) for more
efficient data serialization.

COPY RIGHTS RESERVED

o Adjust Shuffle Configurations: Tune parameters like
spark.shuffle.file.buffer and spark.reducer.maxSizeInFlight to optimize
shuffle behavior.
o Avoid Wide Transformations: Minimize operations that cause extensive
shuffling, such as wide transformations, by restructuring the job logic.

26. In a Spark Streaming application, you observe that the processing time
for batches is increasing over time, leading to delays. What steps would
you take to diagnose and resolve this issue?
• Answer: To address increasing batch processing times in Spark Streaming:
o Monitor Batch Processing Times: Use Spark's web UI to track batch
processing durations and identify trends.
o Optimize Batch Interval: Adjust the batch interval to ensure that
processing completes before the next batch arrives.
o Scale Resources: Increase the number of executors or cores to handle the
workload more efficiently.
o Optimize Transformations: Review and optimize transformations to
reduce computational complexity.
o Manage State Efficiently: If using stateful operations, ensure that state
data is managed and pruned appropriately to prevent unbounded
growth.

27. You need to join two large datasets in Spark, but one dataset is
significantly smaller than the other. How would you optimize this join
operation?
• Answer: For joining a large dataset with a significantly smaller one, I would use a
broadcast join. Broadcasting the smaller dataset to all executors allows each
executor to perform the join locally, reducing the need for shuffling.
from pyspark.sql.functions import broadcast

large_df = spark.read.parquet("hdfs://path/to/large_dataset")

COPY RIGHTS RESERVED

small_df = spark.read.parquet("hdfs://path/to/small_dataset")

joined_df = large_df.join(broadcast(small_df), "join_key")

This approach minimizes network I/O and improves performance.

28. During a Spark job, you notice that certain tasks are consistently
slower than others, leading to performance bottlenecks. How would you
identify and address these straggling tasks?
• Answer: To handle straggling tasks:
o Identify Stragglers: Use Spark's web UI to monitor task durations and
identify outliers.
o Data Skew Mitigation: If data skew is causing stragglers, implement
techniques like salting to distribute data more evenly across partitions.
o Speculative Execution: Enable speculative execution
(spark.speculation=true) to re-launch slow tasks on other nodes,
potentially completing them faster.
o Resource Allocation: Ensure that resources are evenly distributed and
that no executor is overloaded.

29. You are tasked with processing a real-time data stream that includes
late-arriving data. How would you handle late data in Spark Structured
Streaming to ensure accurate results?
• Answer: In Spark Structured Streaming, I would handle late-arriving data using
watermarks and windowed aggregations:
o Define Watermark: Specify a watermark to define how late data can
arrive and still be processed.
from pyspark.sql.functions import window

df_with_watermark = df.withWatermark("event_time", "10 minutes")

COPY RIGHTS RESERVED

o Windowed Aggregation: Perform aggregations over defined windows,
allowing late data within the watermark threshold to be included.
aggregated_df = df_with_watermark.groupBy(window("event_time", "5
minutes")).count()
• This approach ensures that late data is considered within a reasonable
timeframe, maintaining result accuracy.

30. Your Spark application requires reading from and writing to a Hive
table. How would you configure Spark to integrate seamlessly with Hive?
• Answer: To integrate Spark with Hive:
o Enable Hive Support: Initialize SparkSession with Hive support.
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("SparkHiveIntegration") \
.enableHiveSupport() \
.getOrCreate()
o Configure Hive Metastore: Ensure that Spark is configured to connect to
the Hive metastore by setting the appropriate configurations in spark-
defaults.conf or programmatically.
o Access Hive Tables: Use Spark SQL to read from and write to Hive tables.
df = spark.sql("SELECT * FROM hive_database.hive_table")
df.write.mode("overwrite").saveAsTable("hive_database.new_table")
• This setup allows Spark to interact with Hive tables, leveraging Hive's metastore
for schema information.

31. You need to perform a rolling average calculation over a time-series

dataset in Spark. How would you implement this using DataFrame APIs?
• Answer: To compute a rolling average in Spark:

COPY RIGHTS RESERVED

o Define Window Specification: Use Window to specify the rolling window.
from pyspark.sql.window import Window
from pyspark.sql.functions import avg

window_spec = Window.partitionBy("id").orderBy("timestamp").rowsBetween(-2, 0)
o Apply Rolling Average: Use avg() over the window specification.
df_with_rolling_avg = df.withColumn("rolling_avg", avg("value").over(window_spec))
• This calculates the average of the current and previous two rows for each
partition defined by "id," ordered by "timestamp."

32. In a Spark job, you need to write output data to multiple destinations,
such as HDFS and a relational database. How would you implement this
efficiently?
• Answer: To write data to multiple destinations:
o Write to HDFS:
df.write.mode("overwrite").parquet("hdfs://path/to/output")
o Write to Relational Database:
df.write \
.format("jdbc") \
.option("url", "jdbc:mysql://hostname:port/dbname") \
.option("dbtable", "table_name") \
.option("user", "username") \
.option("password", "password") \
.mode("overwrite") \
.save()
• Ensure that the write operations are appropriately managed to prevent resource
contention, possibly by performing them sequentially or allocating sufficient
resources.

COPY RIGHTS RESERVED

33. You are processing a dataset with nested JSON structures in Spark.
How would you flatten these structures for analysis?
• Answer: To flatten nested JSON structures in Spark:
o Read the JSON Data: Load the nested JSON data into a DataFrame.
df = spark.read.json("path/to/nested.json")
o Explode Arrays: Use the explode function to flatten arrays within the
JSON.
from pyspark.sql.functions import explode

df = df.withColumn("exploded_column", explode("nested_array_column"))
o Select Nested Fields: Access nested fields using dot notation and alias
them as top-level columns.
df = df.select(
"top_level_field",
"nested_struct_field.sub_field1",
"nested_struct_field.sub_field2"
)
o Flatten Structs: If there are nested structs, repeat the selection process
to bring all nested fields to the top level.
df = df.select(
"top_level_field",
"sub_field1",
"sub_field2",
"nested_struct_field.sub_field3"
)
• This process transforms nested JSON structures into a flat DataFrame, facilitating
easier analysis.

COPY RIGHTS RESERVED

34. You have a Spark job that processes data from multiple sources,
including Kafka and HDFS. How would you design the job to handle data
ingestion efficiently?
• Answer: To design a Spark job for efficient data ingestion from Kafka and HDFS:
o Structured Streaming for Kafka: Use Spark Structured Streaming to read
data from Kafka in real-time.
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.load()
o Batch Processing for HDFS: Read static data from HDFS using Spark's
batch processing capabilities.
hdfs_df = spark.read.parquet("hdfs://path/to/data")
o Join Streaming and Batch Data: Combine streaming and batch data as
needed.
joined_df = kafka_df.join(hdfs_df, "common_key")
o Write Output: Write the processed data to the desired sink.
query = joined_df.writeStream \
.format("console") \
.start()
• This approach leverages Spark's capabilities to handle both streaming and batch
data efficiently.

35. In a Spark application, you need to perform a group-by operation

followed by a sort within each group. How would you implement this?
• Answer: To perform a group-by operation followed by sorting within each group:
o Define Window Specification: Use Window to specify the partitioning
and ordering.

COPY RIGHTS RESERVED

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("group_column").orderBy("sort_column")
o Apply Window Function: Use row_number() to assign a unique number
to each row within the partition.
df = df.withColumn("row_num", row_number().over(window_spec))
o Filter Rows: If needed, filter rows based on the row number.
df = df.filter(df.row_num <= n)
• This method allows for sorting within each group efficiently.

36. You are tasked with optimizing a Spark job that performs multiple
joins and aggregations. What strategies would you employ to improve
performance?
• Answer: To optimize a Spark job with multiple joins and aggregations:
o Broadcast Joins: Use broadcast joins for small datasets to avoid shuffling.
from pyspark.sql.functions import broadcast

df = large_df.join(broadcast(small_df), "join_key")
o Optimize Aggregations: Use reduceByKey instead of groupByKey to
reduce data shuffling.
rdd = rdd.reduceByKey(lambda x, y: x + y)
o Cache Intermediate Results: Cache DataFrames that are reused multiple
times.
df.cache()
o Repartition Data: Repartition data to balance the workload across
partitions.
df = df.repartition(num_partitions, "partition_key")

COPY RIGHTS RESERVED

o Use Efficient File Formats: Write data in efficient formats like Parquet or
ORC.
df.write.parquet("path/to/output")
• Implementing these strategies can significantly enhance the performance of
Spark jobs.

37. You need to process a large dataset that doesn't fit into memory. How
would you handle this in Spark?
• Answer: To process a large dataset that exceeds memory capacity:
o Use Disk Storage: Spark automatically spills data to disk when it doesn't
fit into memory. Ensure that there's sufficient disk space and that
spark.local.dir is set to a fast disk.
o Increase Partitions: Increase the number of partitions to reduce the size
of data in each partition.
df = df.repartition(num_partitions)
o Optimize Transformations: Use transformations that minimize memory
usage, such as mapPartitions instead of map.
rdd = rdd.mapPartitions(lambda partition: [process(record) for record in partition])
o Use Efficient Data Formats: Read and write data in efficient formats like
Parquet or ORC to reduce memory overhead.
df = spark.read.parquet("path/to/data")
• These approaches enable Spark to handle large datasets efficiently without
running out of memory.

38. In a Spark Streaming application, you need to maintain state across

batches. How would you implement this?
• Answer: To maintain state across batches in Spark Streaming:
o Use UpdateStateByKey: In DStream-based streaming, use
updateStateByKey to maintain and update state.
def updateFunc(new_values, last_sum):

COPY RIGHTS RESERVED

return sum(new_values) + (last_sum or 0)

state_dstream = dstream.updateStateByKey(updateFunc)
o Use MapWithState: In Structured Streaming, use mapWithState for
stateful operations.
from pyspark.sql.functions import expr

state_schema = StructType([
StructField("key", StringType()),
StructField("count", IntegerType())
])

def updateState(batch_df, state_df):

return
batch_df.groupBy("key").count().union(state_df).groupBy("key").agg(sum("count"))

state_df = spark.readStream.format("state").schema(state_schema).load()
updated_state_df = updateState(batch_df, state_df)
o Use Checkpointing: Enable checkpointing to allow Spark to recover state
after failures.
streamingContext.checkpoint("hdfs://path/to/checkpoint")
• These methods ensure that state is maintained across batches in Spark
Streaming applications.

39. You need to perform a windowed aggregation over a streaming

DataFrame in Spark. How would you implement this?
• Answer: To perform windowed aggregations over a streaming DataFrame:
o Define the Window: Use the window function to specify the window
duration and slide interval.

COPY RIGHTS RESERVED

from pyspark.sql.functions import window

windowed_counts = df.groupBy(
window(df.timestamp, "10 minutes", "5 minutes"),
df.word
).count()
o Perform Aggregation: Apply the desired aggregation function within the
defined window.
windowed_counts = df.groupBy(
window(df.timestamp, "10 minutes", "5 minutes"),
df.word
).agg(sum("count"))
• This approach allows for aggregations over sliding windows in streaming data.

40. In a Spark job, you need to read data from a JDBC source and write it
to HDFS. How would you handle this efficiently?
• Answer: To efficiently read from a JDBC source and write to HDFS:
o Read from JDBC Source: Use Spark's read method with the JDBC format.
jdbc_df = spark.read \
.format("jdbc") \
.option("url", "jdbc:mysql://hostname:port/dbname") \
.option("dbtable", "table_name") \
.option("user", "username") \
.option("password", "password") \
.load()
o Write to HDFS: Write the DataFrame to HDFS in an efficient format like
Parquet.
jdbc_df.write.parquet("hdfs://path/to/output")

COPY RIGHTS RESERVED

o Optimize Data Transfer: Use partitioning and predicates to minimize data
transfer.
jdbc_df = spark.read \
.format("jdbc") \
.option("url", "jdbc:mysql://hostname:port/dbname") \
.option("dbtable", "table_name") \
.option("user", "username") \
.option("password", "password") \
.option("partitionColumn", "id") \
.option("lowerBound", "1") \
.option("upperBound", "100000") \
.option("numPartitions", "10") \
.load()
• This method ensures efficient data transfer from a JDBC source to HDFS.

41. You are tasked with processing a large graph dataset in Spark. Which
library would you use, and how would you implement a PageRank
algorithm?
• Answer: To process a large graph dataset in Spark, I would use GraphX, Spark's
API for graphs and graph-parallel computation.
o Load the Graph: Create an RDD for vertices and edges.
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import SQLContext
from pyspark.sql.functions import *
from pyspark.sql.window import Window

COPY RIGHTS RESERVED

from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql import Window
from pyspark.sql import Window
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql import Window
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql import Window
from pyspark.sql import functions as F

42. In a Spark application, you need to process data from multiple Kafka
topics with different schemas. How would you handle this scenario?
• Answer: To process data from multiple Kafka topics with different schemas in
Spark:
o Read from Multiple Topics: Use Spark Structured Streaming to read from
multiple Kafka topics.
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1,topic2") \
.load()
o Deserialize and Parse Messages: Apply appropriate deserialization and
parsing logic for each topic based on its schema.
from pyspark.sql.functions import col, when

parsed_df = kafka_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \

.withColumn("topic", col("topic").cast("string")) \

COPY RIGHTS RESERVED

.withColumn("parsed_value", when(col("topic") == "topic1",
parse_topic1_udf(col("value")))
.when(col("topic") == "topic2", parse_topic2_udf(col("value"))))
o Process Each Schema Separately: Filter and process data for each
topic/schema as needed.
topic1_df = parsed_df.filter(col("topic") == "topic1")
topic2_df = parsed_df.filter(col("topic") == "topic2")
• This approach allows handling multiple Kafka topics with different schemas
within a single Spark application.

43. You are tasked with implementing a custom aggregation function in

Spark. How would you achieve this using the DataFrame API?
• Answer: To implement a custom aggregation function in Spark using the
DataFrame API:
o Define a User-Defined Aggregate Function (UDAF): Create a subclass of
UserDefinedAggregateFunction.
from pyspark.sql.expressions import UserDefinedAggregateFunction
from pyspark.sql.types import *

class CustomSum(UserDefinedAggregateFunction):
def inputSchema(self):
return StructType([StructField("input", IntegerType())])

def bufferSchema(self):
return StructType([StructField("sum", IntegerType())])

def dataType(self):
return IntegerType()

COPY RIGHTS RESERVED

def deterministic(self):
return True

def initialize(self, buffer):

buffer[0] = 0

def update(self, buffer, input):

buffer[0] += input[0]

def merge(self, buffer1, buffer2):

buffer1[0] += buffer2[0]

def evaluate(self, buffer):

return buffer[0]
o Register and Use the UDAF: Register the UDAF and apply it to a
DataFrame.
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

spark = SparkSession.builder.appName("CustomAggregation").getOrCreate()
custom_sum = CustomSum()
spark.udf.register("custom_sum", custom_sum)

df = spark.createDataFrame([(1,), (2,), (3,)], ["value"])

df.createOrReplaceTempView("values_table")
result_df = spark.sql("SELECT custom_sum(value) FROM values_table")
result_df.show()

COPY RIGHTS RESERVED

• This method allows for the creation and application of custom aggregation
functions within Spark's DataFrame API.

44. In a Spark job, you need to handle data with varying schemas arriving
in real-time. How would you design your application to accommodate
this?
• Answer: To handle real-time data with varying schemas in a Spark application:
o Use Schema Inference: Enable schema inference to automatically detect
the schema of incoming data.
df = spark.readStream \
.format("json") \
.option("path", "path/to/data") \
.option("inferSchema", "true") \
.load()
o Implement Schema Evolution: Use formats like Avro or Parquet that
support schema evolution to handle changes in data structure.
df = spark.readStream \
.format("avro") \
.option("path", "path/to/data") \
.load()
o Apply Schema Registry: Integrate with a schema registry to manage and
retrieve schemas for incoming data.
from pyspark.sql.avro.functions import from_avro

schema_registry_url = "http://schema-registry:8081"
df = df.withColumn("value", from_avro(df.value, schema_registry_url))
o Use Try-Catch Blocks: Implement error handling to manage records that
do not conform to expected schemas.
from pyspark.sql.functions import col, from_json, schema_of_json

COPY RIGHTS RESERVED

schema = schema_of_json('{"name":"string","age":"integer"}')
df = df.withColumn("parsed_value", from_json(col("value"), schema))
valid_df = df.filter(col("parsed_value").isNotNull())
invalid_df = df.filter(col("parsed_value").isNull())
• This design ensures that the application can adapt to varying schemas in real-
time data streams.

45. You need to implement a custom partitioner in Spark to control the

distribution of data across partitions. How would you achieve this?
• Answer: To implement a custom partitioner in Spark:
o Define a Custom Partitioner: Create a class that extends Partitioner and
overrides the numPartitions and getPartition methods.
from pyspark import Partitioner

class CustomPartitioner(Partitioner):
def __init__(self, num_partitions):
self.num_partitions = num_partitions

def numPartitions(self):
return self.num_partitions

def getPartition(self, key):

return hash(key) % self.num_partitions
o Apply the Custom Partitioner: Use the partitionBy method with an RDD
to apply the custom partitioner.
rdd = sc.parallelize([(1, 'a'), (2, 'b'), (3, 'c')])
partitioned_rdd = rdd.partitionBy(3, CustomPartitioner(3))

COPY RIGHTS RESERVED

o Use with Pair RDDs: Ensure that the RDD is a pair RDD (key-value pairs)
before applying the partitioner.
pair_rdd = rdd.map(lambda x: (x[0], x[1]))
partitioned_rdd = pair_rdd.partitionBy(3, CustomPartitioner(3))
• This approach allows for custom control over data distribution across partitions
in Spark.

46. In a Spark application, you need to perform a left join between two
large datasets, but one dataset is significantly smaller. How would you
optimize this join operation?
• Answer: To optimize a left join between a large and a significantly smaller
dataset:
o Use Broadcast Join: Broadcast the smaller dataset to all executors to
perform the join locally, reducing shuffling.
from pyspark.sql.functions import broadcast

result_df = large_df.join(broadcast(small_df), "join_key", "left")

o Ensure Small Dataset Fits in Memory: Verify that the smaller dataset can
fit into the memory of each executor to prevent out-of-memory errors.
o Adjust Broadcast Threshold: If necessary, adjust the broadcast threshold
to accommodate the size of the smaller dataset.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", value_in_bytes)
• This approach leverages Spark's broadcast mechanism to optimize join
operations involving a large and a significantly smaller dataset.

47. You are processing a dataset with skewed key distribution, leading to
performance bottlenecks. How would you handle this data skew in Spark?
• Answer: To handle data skew in Spark:
o Salting Technique: Add a random "salt" to the keys to distribute skewed
data across multiple partitions.

COPY RIGHTS RESERVED

from pyspark.sql.functions import col, concat, lit, rand

salted_df = df.withColumn("salted_key", concat(col("key"), lit("_"), (rand() *

10).cast("int")))
o Repartition Skewed Keys: Identify skewed keys and repartition them to
balance the load.
skewed_keys = df.groupBy("key").count().filter("count > threshold").select("key")
skewed_df = df.join(skewed_keys, "key")
balanced_df = skewed_df.repartition("key")
o Use Map-Side Joins: For joins, broadcast the smaller dataset to avoid
shuffling the larger, skewed dataset.
from pyspark.sql.functions import broadcast

result_df = large_df.join(broadcast(small_df), "key")

o Optimize Partitioning: Adjust the number of partitions to ensure an even
distribution of data.
df = df.repartition(num_partitions, "key")
• Implementing these strategies can mitigate the effects of data skew and improve
performance.

48. In a Spark Streaming application, you need to handle late-arriving

data. How would you manage this scenario?
• Answer: To handle late-arriving data in Spark Streaming:
o Watermarking: Use watermarks to define the allowed lateness for data
to be included in windowed aggregations.
from pyspark.sql.functions import window

df_with_watermark = df.withWatermark("event_time", "10 minutes")

windowed_counts = df_with_watermark.groupBy(

COPY RIGHTS RESERVED

window("event_time", "5 minutes"),
"key"
).count()
o Late Data Handling: Configure the application to handle late data
according to business requirements, such as updating aggregates or
discarding late records.
df_with_watermark = df.withWatermark("event_time", "10 minutes")
windowed_counts = df_with_watermark.groupBy(
window("event_time", "5 minutes"),
"key"
).count()
o Stateful Processing: Maintain state to accommodate late data and
update computations accordingly.
from pyspark.sql.functions import expr

state_schema = StructType([
StructField("key", StringType()),
StructField("count", IntegerType())
])

def updateState(batch_df, state_df):

return
batch_df.groupBy("key").count().union(state_df).groupBy("key").agg(sum("count"))

state_df = spark.readStream.format("state").schema(state_schema).load()
updated_state_df = updateState(batch_df, state_df)
• These methods ensure that late-arriving data is appropriately managed in Spark
Streaming applications.

COPY RIGHTS RESERVED

49. You need to implement a custom serialization mechanism for a
complex data type in Spark. How would you achieve this?
• Answer: To implement custom serialization for a complex data type in Spark:
o Define a Custom Encoder: Create an encoder for the complex data type.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import Row

schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", IntegerType(), True)
])

def complex_type_encoder(row):
return Row(field1=row.field1, field2=row.field2)
o Register the Encoder: Register the encoder with Spark.
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CustomSerialization").getOrCreate()
spark.udf.register("complex_type_encoder", complex_type_encoder, schema)
o Use the Encoder: Apply the encoder when processing data.
df = spark.read.json("path/to/data")
encoded_df = df.selectExpr("complex_type_encoder(field1, field2) as complex_field")
• This approach allows for custom serialization of complex data types in Spark.

50. In a Spark application, you need to process data stored in a custom

binary format. How would you implement a custom data source to read
this data?

COPY RIGHTS RESERVED

• Answer: To implement a custom data source for a custom binary format in
Spark:
o Implement a Data Source V2 API: Create a class that extends
DataSourceV2 and implements the necessary methods.
from pyspark.sql.sources import DataSourceV2, ReadSupport, DataSourceRegister
from pyspark.sql.types import StructType

class CustomBinarySource(DataSourceV2, ReadSupport, DataSourceRegister):

def shortName(self):
return "custom_binary"

def createReader(self, options):

return CustomBinaryReader(options)
o Implement a Data Reader: Create a class that reads the custom binary
data and returns a DataFrame.
from pyspark.sql.sources import DataSourceReader
from pyspark.sql.types import StructType

class CustomBinaryReader(DataSourceReader):
def __init__(self, options):
self.options = options

def readSchema(self):
return StructType([]) # Define the schema

def planInputPartitions(self):
return [CustomBinaryInputPartition(self.options)]

COPY RIGHTS RESERVED

o Implement an Input Partition: Create a class that defines how data is
read from each partition.
from pyspark.sql.sources import InputPartition

class CustomBinaryInputPartition(InputPartition):
def __init__(self, options):
self.options = options

def createPartitionReader(self):
return CustomBinaryPartitionReader(self.options)

51. You are tasked with processing a large dataset containing user activity
logs stored in JSON format. Each log entry includes nested structures with
arrays and dictionaries. How would you efficiently flatten this nested
JSON structure using PySpark for analysis?
• Answer: To efficiently flatten nested JSON structures in PySpark:
o Read the JSON Data: Load the JSON data into a DataFrame.
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FlattenJSON").getOrCreate()
df = spark.read.json("path/to/json/files")
o Flatten the Nested Structure: Use the select function along with dot
notation to extract nested fields. For arrays, use the explode function to
flatten them.
from pyspark.sql.functions import explode

flattened_df = df.select(
"userId",

COPY RIGHTS RESERVED

"userName",
"activity.timestamp",
"activity.type",
explode("activity.details").alias("detail")
)
o Extract Fields from Exploded Columns: If the exploded column contains
further nested structures, continue to select the necessary fields.
final_df = flattened_df.select(
"userId",
"userName",
"timestamp",
"type",
"detail.subField1",
"detail.subField2"
)
• This approach allows for the transformation of complex nested JSON structures
into a flat schema suitable for analysis.

52. In a PySpark application, you need to join two large DataFrames on

multiple keys, but the join operation is causing performance bottlenecks
due to data skew. How would you address this issue to optimize the join
performance?
• Answer: To address data skew and optimize join performance:
o Identify Skewed Keys: Analyze the distribution of keys to identify those
that are causing skew.
skewed_keys = df.groupBy("join_key").count().filter("count >
threshold").select("join_key")
o Apply Salting Technique: Add a random "salt" to the skewed keys to
distribute the data more evenly across partitions.

COPY RIGHTS RESERVED

from pyspark.sql.functions import col, concat, lit, rand

salted_df1 = df1.withColumn("salted_key", concat(col("join_key"), lit("_"), (rand() *

10).cast("int")))
salted_df2 = df2.withColumn("salted_key", concat(col("join_key"), lit("_"), (rand() *
10).cast("int")))
o Perform the Join on Salted Keys: Join the DataFrames using the salted
keys.
joined_df = salted_df1.join(salted_df2, "salted_key")
o Remove the Salt After Join: If necessary, remove the salt to restore the
original key.
result_df = joined_df.withColumn("original_key", col("salted_key").substr(0,
length_of_original_key))
• This method helps in distributing the data more evenly, thereby mitigating the
effects of data skew during join operations.

53. You are working with a PySpark DataFrame containing time-series

data, and you need to calculate a rolling average over a specific window
for each group. How would you implement this using PySpark?
• Answer: To calculate a rolling average over a specific window for each group:
o Define a Window Specification: Use the Window function to specify the
partitioning and ordering.
from pyspark.sql.window import Window
from pyspark.sql.functions import col, avg

window_spec =
Window.partitionBy("group_column").orderBy("timestamp_column").rowsBetween(-
window_size, 0)
o Calculate the Rolling Average: Apply the avg function over the defined
window.

COPY RIGHTS RESERVED

df_with_rolling_avg = df.withColumn("rolling_avg",
avg(col("value_column")).over(window_spec))
• This approach computes the rolling average for each group over the specified
window size.

54. In a PySpark application, you need to read data from a REST API that
returns paginated JSON responses. How would you implement this to
create a DataFrame containing all the data?
• Answer: To read paginated JSON data from a REST API into a PySpark
DataFrame:
o Use Python's requests Library: Fetch data from the API in a loop until all
pages are retrieved.
import requests
import json

all_data = []
url = "https://api.example.com/data"
params = {"page": 1}
while True:
response = requests.get(url, params=params)
data = response.json()
if not data["results"]:
break
all_data.extend(data["results"])
params["page"] += 1
o Create a Spark DataFrame: Convert the collected data into a DataFrame.
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("APIData").getOrCreate()

COPY RIGHTS RESERVED

df = spark.read.json(spark.sparkContext.parallelize([json.dumps(all_data)]))
• This method allows for the aggregation of paginated API responses into a single
DataFrame for analysis.

55. You have a PySpark DataFrame with a column containing JSON strings,
and you need to extract specific fields from these JSON strings into
separate columns. How would you achieve this?
• Answer: To extract specific fields from JSON strings into separate columns:
o Use the from_json Function: Parse the JSON strings into a structured
format.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

json_schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", IntegerType(), True)
])

df_with_json = df.withColumn("json_data", from_json(col("json_column"),

json_schema))
o Extract Fields into Separate Columns: Select the desired fields from the
parsed JSON.
df_extracted = df_with_json.select(
col("json_data.field1").alias("field1"),
col("json_data.field2").alias("field2")
)
• This approach enables the extraction of specific fields from JSON strings into
individual columns for further analysis.

COPY RIGHTS RESERVED

56. In a PySpark application, you need to process a large dataset stored in
multiple CSV files with inconsistent schemas. How would you handle this
scenario to create a unified DataFrame?
• Answer: To process multiple CSV files with inconsistent schemas:
o Read Files Individually: Load each CSV file into a separate DataFrame.
df1 = spark.read.csv("path/to/file1.csv", header=True, inferSchema=True)
df2 = spark.read.csv("path/to/file2.csv", header=True, inferSchema=True)
o Align Schemas: Identify the union of all columns and add missing
columns with null values to each Data

57. You are tasked with processing a large dataset containing user activity
logs stored in JSON format. Each log entry includes nested structures with
arrays and dictionaries. How would you efficiently flatten this nested
JSON structure using PySpark for analysis?
• Answer: To efficiently flatten nested JSON structures in PySpark:
o Read the JSON Data: Load the JSON data into a DataFrame.
from pyspark.sql import SparkSession

flattened_df = df.select(
"userId",
"userName",
"activity.timestamp",
"activity.type",

COPY RIGHTS RESERVED

explode("activity.details").alias("detail")
)
o Extract Fields from Exploded Columns: If the exploded column contains
further nested structures, continue to select the necessary fields.
final_df = flattened_df.select(
"userId",
"userName",
"timestamp",
"type",
"detail.subField1",
"detail.subField2"
)
• This approach allows for the transformation of complex nested JSON structures
into a flat schema suitable for analysis.

58. In a PySpark application, you need to join two large DataFrames on

salted_df1 = df1.withColumn("salted_key", concat(col("join_key"), lit("_"), (rand() *

10).cast("int")))

COPY RIGHTS RESERVED

salted_df2 = df2.withColumn("salted_key", concat(col("join_key"), lit("_"), (rand() *
10).cast("int")))
o Perform the Join on Salted Keys: Join the DataFrames using the salted
keys.
joined_df = salted_df1.join(salted_df2, "salted_key")
o Remove the Salt After Join: If necessary, remove the salt to restore the
original key.
result_df = joined_df.withColumn("original_key", col("salted_key").substr(0,
length_of_original_key))
• This method helps in distributing the data more evenly, thereby mitigating the
effects of data skew during join operations.

59. You are working with a PySpark DataFrame containing time-series

window_spec =
Window.partitionBy("group_column").orderBy("timestamp_column").rowsBetween(-
window_size, 0)
o Calculate the Rolling Average: Apply the avg function over the defined
window.
df_with_rolling_avg = df.withColumn("rolling_avg",
avg(col("value_column")).over(window_spec))
• This approach computes the rolling average for each group over the specified
window size.

COPY RIGHTS RESERVED

60. In a PySpark application, you need to read data from a REST API that
returns paginated JSON responses. How would you implement this to
create a DataFrame containing all the data?
• Answer: To read paginated JSON data from a REST API into a PySpark
DataFrame:
o Use Python's requests Library: Fetch data from the API in a loop until all
pages are retrieved.
import requests
import json

spark = SparkSession.builder.appName("APIData").getOrCreate()
df = spark.read.json(spark.sparkContext.parallelize([json.dumps(all_data)]))
• This method allows for the aggregation of paginated API responses into a single
DataFrame for analysis.

COPY RIGHTS RESERVED

61. You have a PySpark DataFrame with a column containing JSON strings,
and you need to extract specific fields from these JSON strings into
separate columns. How would you achieve this?
• Answer: To extract specific fields from JSON strings into separate columns:
o Use the from_json Function: Parse the JSON strings into a structured
format.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

json_schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", IntegerType(), True)
])

df_with_json = df.withColumn("json_data", from_json(col("json_column"),

62. In a PySpark application, you need to process a large dataset stored in

multiple CSV files with inconsistent schemas. How would you handle this
scenario to create a unified DataFrame?
• Answer: To process multiple CSV files with inconsistent schemas:

COPY RIGHTS RESERVED

o Read Files Individually: Load each CSV file into a separate DataFrame.
df1 = spark.read.csv("path/to/file1.csv", header=True, inferSchema=True)
df2 = spark.read.csv("path/to/file2.csv", header=True, inferSchema=True)
o Align Schemas: Identify the union of all columns and add missing
columns with null values to each Data

63. You are tasked with processing a large dataset containing user activity
logs stored in JSON format. Each log entry includes nested structures with
arrays and dictionaries. How would you efficiently flatten this nested
JSON structure using PySpark for analysis?
• Answer: To efficiently flatten nested JSON structures in PySpark:
o Read the JSON Data: Load the JSON data into a DataFrame.
from pyspark.sql import SparkSession

flattened_df = df.select(
"userId",
"userName",
"activity.timestamp",
"activity.type",
explode("activity.details").alias("detail")
)
o Extract Fields from Exploded Columns: If the exploded column contains
further nested structures, continue to select the necessary fields.

COPY RIGHTS RESERVED

final_df = flattened_df.select(
"userId",
"userName",
"timestamp",
"type",
"detail.subField1",
"detail.subField2"
)
• This approach allows for the transformation of complex nested JSON structures
into a flat schema suitable for analysis.

64. In a PySpark application, you need to join two large DataFrames on

salted_df1 = df1.withColumn("salted_key", concat(col("join_key"), lit("_"), (rand() *

10).cast("int")))
salted_df2 = df2.withColumn("salted_key", concat(col("join_key"), lit("_"), (rand() *
10).cast("int")))
o Perform the Join on Salted Keys: Join the DataFrames using the salted
keys.

COPY RIGHTS RESERVED

joined_df = salted_df1.join(salted_df2, "salted_key")
o Remove the Salt After Join: If necessary, remove the salt to restore the
original key.
result_df = joined_df.withColumn("original_key", col("salted_key").substr(0,
length_of_original_key))
• This method helps in distributing the data more evenly, thereby mitigating the
effects of data skew during join operations.

65. You are working with a PySpark DataFrame containing time-series

66. In a PySpark application, you need to read data from a REST API that
returns paginated JSON responses. How would you implement this to
create a DataFrame containing all the data?

COPY RIGHTS RESERVED

• Answer: To read paginated JSON data from a REST API into a PySpark
DataFrame:
o Use Python's requests Library: Fetch data from the API in a loop until all
pages are retrieved.
import requests
import json

67. You have a PySpark DataFrame with a column containing JSON strings,
and you need to extract specific fields from these JSON strings into
separate columns. How would you achieve this?

COPY RIGHTS RESERVED

• Answer: To extract specific fields from JSON strings into separate columns:
o Use the from_json Function: Parse the JSON strings into a structured
format.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

json_schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", IntegerType(), True)
])

df_with_json = df.withColumn("json_data", from_json(col("json_column"),

68. In a PySpark application, you need to process a large dataset stored in

multiple CSV files with inconsistent schemas. How would you handle this
scenario to create a unified DataFrame?
• Answer: To process multiple CSV files with inconsistent schemas:
o Read Files Individually: Load each CSV file into a separate DataFrame.
df1 = spark.read.csv("path/to/file1.csv", header=True, inferSchema=True)
df2 = spark.read.csv("path/to/file2.csv", header=True, inferSchema=True)

COPY RIGHTS RESERVED

o Align Schemas: Identify the union of all columns and add missing
columns with null values to each Data

69. You are tasked with processing a large dataset containing user activity
logs stored in JSON format. Each log entry includes nested structures with
arrays and dictionaries. How would you efficiently flatten this nested
JSON structure using PySpark for analysis?
• Answer: To efficiently flatten nested JSON structures in PySpark:
o Read the JSON Data: Load the JSON data into a DataFrame.
from pyspark.sql import SparkSession

COPY RIGHTS RESERVED

"timestamp",
"type",
"detail.subField1",
"detail.subField2"
)
• This approach allows for the transformation of complex nested JSON structures
into a flat schema suitable for analysis.

70. In a PySpark application, you need to join two large DataFrames on

salted_df1 = df1.withColumn("salted_key", concat(col("join_key"), lit("_"), (rand() *

COPY RIGHTS RESERVED

• This method helps in distributing the data more evenly, thereby mitigating the
effects of data skew during join operations.

71. You are working with a PySpark DataFrame containing time-series

72. In a PySpark application, you need to read data from a REST API that
returns paginated JSON responses. How would you implement this to
create a DataFrame containing all the data?
• Answer: To read paginated JSON data from a REST API into a PySpark
DataFrame:
o Use Python's requests Library: Fetch data from the API in a loop until all
pages are retrieved.
import requests
import json

COPY RIGHTS RESERVED

73. You have a PySpark DataFrame with a column containing JSON strings,
and you need to extract specific fields from these JSON strings into
separate columns. How would you achieve this?
• Answer: To extract specific fields from JSON strings into separate columns:
o Use the from_json Function: Parse the JSON strings into a structured
format.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

json_schema = StructType([

COPY RIGHTS RESERVED

StructField("field1", StringType(), True),
StructField("field2", IntegerType(), True)
])

df_with_json = df.withColumn("json_data", from_json(col("json_column"),

74. In a PySpark application, you need to process a large dataset stored in

FREE RESOURCES

1. Importance of Apache Spark

COPY RIGHTS RESERVED

https://www.linkedin.com/posts/ajay026_apache-apachespark-
dataengineering-activity-7250346908169183233-1km_?
2. 50 Famous Apache Spark interview questions
https://www.linkedin.com/posts/ajay026_spark-interviewquestions-
dataengineering-activity-7098249757277487104-M7RI?
3. Most Important Apache Spark Questions

https://www.linkedin.com/posts/ajay026_apachespark-spark-spark-
activity-6994141335872028673-ARBl?

4. Roadmap to master Apache Spark

https://www.linkedin.com/posts/ajay026_data-engineers-guide-to-
apache-spark-activity-7157939505923035136-cI79?
5. Apache Spark Architecture
https://www.linkedin.com/posts/ajay026_spark-apachespark-spark-
activity-7102620993554182145-I_Ho?
6. Internals of Apache Spark
https://www.linkedin.com/posts/ajay026_dataengineer-spark-hadoop-
activity-7049230500657299456-Yyam?
7. Learn Apache Spark Step by step
https://www.linkedin.com/posts/ajay026_learnapachesparkhere-
activity-7169172130939564032-lHvF?
8. Transformations in Apache Spark
https://www.linkedin.com/posts/ajay026_comment-spark-transformations-activity-
7192026655425503234-_0vP?

9. 3 Apache Spark Projects to Practice for FREE

https://www.linkedin.com/posts/ajay026_dataanalytics-spark-projects-activity-
7105762648696225792-gW5h?

COPY RIGHTS RESERVED

Xenakis Technique
100% (1)
Xenakis Technique
48 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
before spark interview
No ratings yet
before spark interview
18 pages
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
No ratings yet
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
15 pages
Lead Data Engineer Resume Example
No ratings yet
Lead Data Engineer Resume Example
1 page
Aws Notes
No ratings yet
Aws Notes
34 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Spark
No ratings yet
Spark
96 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Structured Streaming
No ratings yet
Structured Streaming
12 pages
Hadoop Notes Unit2
No ratings yet
Hadoop Notes Unit2
24 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
_ Databricks & PySpark learning day-10
No ratings yet
_ Databricks & PySpark learning day-10
4 pages
Master_Snowflake_Interview_Q_A_�_1729835390
No ratings yet
Master_Snowflake_Interview_Q_A_�_1729835390
7 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Interview-Questions-with-Data-Analysis
No ratings yet
Interview-Questions-with-Data-Analysis
6 pages
Big Data Masters Certification Learnbay
No ratings yet
Big Data Masters Certification Learnbay
12 pages
Sqoop Interview Questions
No ratings yet
Sqoop Interview Questions
6 pages
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
51 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
HDPDeveloper EnterpriseSpark1 StudentGuide
100% (1)
HDPDeveloper EnterpriseSpark1 StudentGuide
244 pages
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Pyspark
No ratings yet
Pyspark
31 pages
Kanishk Resume
No ratings yet
Kanishk Resume
5 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Apache Kafka Course Curriculum
No ratings yet
Apache Kafka Course Curriculum
5 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
1.hadoop Admin Brochure
No ratings yet
1.hadoop Admin Brochure
11 pages
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
100% (1)
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
20 pages
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Hadoop and Mapreduce Cheat Sheet
No ratings yet
Hadoop and Mapreduce Cheat Sheet
1 page
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
ElasticSearch Interview Questions and Answers 40
No ratings yet
ElasticSearch Interview Questions and Answers 40
7 pages
SnowFlake Course Brochure FINAL
No ratings yet
SnowFlake Course Brochure FINAL
7 pages
Data Streams: Models and Algorithms
No ratings yet
Data Streams: Models and Algorithms
372 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
70-765 ExamCollection PDF
No ratings yet
70-765 ExamCollection PDF
476 pages
Duncan models and elastic parameters - Katona
No ratings yet
Duncan models and elastic parameters - Katona
14 pages
FAI Sporting Code: Annex C Official Observer & Pilot Guide
No ratings yet
FAI Sporting Code: Annex C Official Observer & Pilot Guide
37 pages
10 PHY ICSE X Electro Magnetism
No ratings yet
10 PHY ICSE X Electro Magnetism
11 pages
LP Maths Grade 8 June 2024 QP and Memo
No ratings yet
LP Maths Grade 8 June 2024 QP and Memo
14 pages
Estimation
No ratings yet
Estimation
6 pages
Abap Material
No ratings yet
Abap Material
3 pages
SSGM - XN-L Service Program 20141020
No ratings yet
SSGM - XN-L Service Program 20141020
10 pages
Solid Meachanics Materials
No ratings yet
Solid Meachanics Materials
4 pages
Electronics Lab Report 1
No ratings yet
Electronics Lab Report 1
4 pages
Neural Crossbreed: Neural Based Image Metamorphosis: Sanghun Park, Kwanggyoon Seo, Junyong Noh
No ratings yet
Neural Crossbreed: Neural Based Image Metamorphosis: Sanghun Park, Kwanggyoon Seo, Junyong Noh
16 pages
A Quick Look at Pneumatic Conveying System Basics
100% (1)
A Quick Look at Pneumatic Conveying System Basics
3 pages
Electrical Power Products PDF
No ratings yet
Electrical Power Products PDF
96 pages
SSC Stenographer 2023 Official Paper (Held On_ 12 Oct, 2023 Shift 2)
No ratings yet
SSC Stenographer 2023 Official Paper (Held On_ 12 Oct, 2023 Shift 2)
80 pages
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
No ratings yet
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
12 pages
Configuring VLANs and Trunking
No ratings yet
Configuring VLANs and Trunking
11 pages
How Install Windows On An External Hard Drive
No ratings yet
How Install Windows On An External Hard Drive
2 pages
Chapter 11 - Perimeter and Area
No ratings yet
Chapter 11 - Perimeter and Area
14 pages
Unit 1-Tieng Anh 8
100% (1)
Unit 1-Tieng Anh 8
7 pages
Powerware 9 UPS: Features
No ratings yet
Powerware 9 UPS: Features
4 pages
3nV2s1u8Q3i1drNbvBN4Hg - Chamillard C Unity Book
No ratings yet
3nV2s1u8Q3i1drNbvBN4Hg - Chamillard C Unity Book
505 pages
JBJC_23.09_EN_U
No ratings yet
JBJC_23.09_EN_U
12 pages
Matrix Calculus: Derivation and Simple Application: HU, Pili March 30, 2012
No ratings yet
Matrix Calculus: Derivation and Simple Application: HU, Pili March 30, 2012
30 pages
Tool Geometry
No ratings yet
Tool Geometry
16 pages
Crash
No ratings yet
Crash
6 pages
SCMH 3.2.5 Tie Rod Detail FAIR Rev New Dated 10AUG2023
No ratings yet
SCMH 3.2.5 Tie Rod Detail FAIR Rev New Dated 10AUG2023
9 pages
Revision History: Software Banking Project Software Requirements Specification 1.3 25/10/2016
No ratings yet
Revision History: Software Banking Project Software Requirements Specification 1.3 25/10/2016
15 pages
Vmware 2V0-620
100% (1)
Vmware 2V0-620
79 pages
Moments Couples Activity Opt
No ratings yet
Moments Couples Activity Opt
18 pages