spark_notes
spark_notes
The executors accept tasks from the driver, execute those tasks, and return results to the driver.
Executors only persist during the lifetime of an application. An executor is always specific to the
application. It is terminated when the application completes. Executors are launched by the cluster
manager on behalf of the driver.
Spark executors run on worker nodes in the Spark Cluster. Each worker node may run one or more
executors depending upon the resource availability on the worker node. You may have more executors
than the workers
The cluster manager keeps track of executors and will work together with the driver to launch an
executor and assign the workload of the failed executor to it
The cluster manager allocates resources to Spark applications and maintains the executor processes in
client mode.
In cluster mode, the cluster manager is located on a node other than the client machine. From there it
starts and ends executor processes on the cluster nodes as required by the Spark application running on
the Spark driver.
Every Spark Application creates one driver and one or more executors at run time. Drivers and executors
are never shared. Every application will have its own dedicated Spark Driver.
The catalog is an Interface through which the user may create, drop, alter or query underlying
databases, tables, functions, etc.
Spark Session allows you to create JVM runtime parameters, define DataFrames and Datasets, read from
data sources, access catalog metadata, and issue Spark SQL queries. SparkSession provides a single
unified entry point to all of Spark’s functionality.
The driver converts your Spark application into one or more Spark jobs. It then transforms each job into
a DAG (Spark’s execution plan), where each node within a DAG could be single or multiple Spark stages.
You can access DAGs through the Spark UI and they can be of great help when optimizing queries
manually. DAGs represent the execution plan in Spark and as such are lazily executed when the driver
requests the data processed in the DAG. DAGs can be decomposed into tasks that are executed in
parallel.
The Spark engine starts new stages after operations called shuffles. A shuffle represents a physical
repartitioning of the data. This type of repartitioning requires coordinating across executors to move
data around.
The Spark driver has multiple roles: it communicates with the cluster manager; it requests resources
(CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); and it transforms all the
1
Spark Notes
Spark operations into DAG computations, schedules them, and distributes their execution as tasks
across the Spark executors. Once the resources are allocated, it communicates directly with the
executors.
Logical Planning applies a standard rule-based optimization approach to construct a set of multiple plans
and then use a cost-based optimizer (CBO) to assign costs to each plan. These plans are laid out as
operator trees. They may include, for example, the process of constant folding, predicate pushdown,
projection pruning, Boolean expression simplification, etc. This logical plan is the input into the physical
plan.
The Catalyst optimizer takes a computational query and converts it into an execution plan. It goes
through four transformational phases, as shown below.
1. Analysis
2. Logical optimization: First set of optimizations takes place in step logical optimization.
3. Physical planning: Catalyst optimizer generates one or more physical plans
4. Code generation (Project Tungsten)
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution
1. Reduce the number of reducers in the shuffle stage by decreasing the number of shuffle
partitions. AQE allows you to dynamically coalesce shuffle partitions.
2. Optimize the physical execution plan of the query, for example by converting a SortMergeJoin
into a BroadcastHashJoin where appropriate. - AQE allows you to dynamically switch join
strategies
3. Handle data skew during a join.
2
Spark Notes
The cluster manager receives input from the driver through the SparkContext.
Deployment modes often refer to ways that Spark can be deployed in cluster mode and how it uses
specific frameworks outside Spark. YARN and Mesos modes are two deployment modes. These modes
allow Spark to run alongside other frameworks on a cluster.
When Spark is run in standalone mode, only the Spark framework can run on the cluster. Standalone
mode uses only a single executor per worker per application.
client mode
cluster mode
3
Spark Notes
• a user submits a pre-compiled JAR, Python script, or R script to a cluster manager. The cluster
manager then launches the driver process on a worker node inside the cluster, in addition to the
executor processes.
local mode
• all Spark infrastructure is started in a single JVM (Java Virtual Machine) in a single computer
which then also includes the driver.
• Local mode uses a single JVM to run Spark driver and executor processes.
Dataset API
Tasks in a stage may be executed by multiple machines at the same time. Within a single stage, tasks do
not depend on each other. Executors on multiple machines may execute tasks belonging to the same
stage on the respective partitions they are holding at the same time.
• Execution
• storage. Storage memory is used for caching partitions derived from DataFrames.
garbage collection
Spark’s garbage collection runs faster on fewer big objects than many small objects.
Optimizing garbage collection performance in Spark may limit caching ability. A full garbage collection
run slows down a Spark application. When taking about "tuning" garbage collection, we mean reducing
the amount or duration of these slowdowns.
4
Spark Notes
Garbage collection information can be accessed in the Spark UI's stage detail view.
In Spark, using the G1 garbage collector (not enabled by default) is an alternative to using the default
Parallel garbage collector.
• The maximum number of tasks that an executor can process at the same time is controlled by
the spark.task.cpus and spark.executor.cores property
• The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.
• The default number of partitions to use when shuffling data for joins or aggregations is 200.
• The default number of partitions returned from certain transformations can be controlled by the
spark.default.parallelism property.
spark.conf.set(“spark.sql.shuffle.partitions”, 100) => set the number of partitions that Spark uses when
shuffling data for joins or aggregations to 100
spark.memory.fraction is the percentage of memory used for computation in shuffles, joins, sorts, and
aggregations.
• The location for the Spark Managed Tables data is configured using spark.sql.warehouse.dir
configuration.
• Spark supports creating managed and unmanaged tables using Spark SQL and DataFrame APIs.
• Spark only manages the metadata for unmanaged tables, while you manage the data yourself.
• Drop an unmanaged table, Spark will only drop the metadata and does not impact the actual
data.
• Drop a managed table, Spark will drop the metadata and also delete the actual data.
df.write.option(‘path’, “/data/”).saveAsTable(“my_ummanaged_table”)
As soon as you add the ‘path’ option in the dataframe writer it will be treated as an
external/unmanaged table.
Actions
Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer
result data back to the driver.
Actions can trigger Adaptive Query Execution, while transformation cannot. Adaptive Query Execution
optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any
runtime information to optimize the query until an action is called. If Adaptive Query Execution is
5
Spark Notes
enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating
the query.
Lazy evaluation:
Accumulator
Accumulator values can only be read by the driver, but not by executors. - a problem with using
accumulators
• Accumulators provide a shared, mutable variable that a Spark cluster can safely update on a per-
row basis.
• The Spark UI doesn’t display all accumulators used by your application. You need to name the
accumulator in order to see in it in the spark ui.
• By extending org.apache.spark.util.AccumulatorV2 in Java / Scala or pyspark.AccumulatorParam
in Python, You can define your own custom accumulator class.
• For accumulator updates performed inside actions only, Spark guarantees that each task’s
update to the accumulator will be applied only once, meaning that restarted tasks will not
update the value. If an action including an accumulator fails during execution and Spark
manages to restart the action and complete it successfully, only the successful attempt will be
counted in the accumulator.
DDL expressions
must use the enableHiveSupport() while creating your Spark session. Without hive support, you cannot
run Spark DDL expressions. DDL expressions will create Spark Database objects which are stored in the
meta store. Spark by default uses hive meta store. So you must enable hive support and include hive
dependencies for using DDL expressions.
spark.sql(“CACHE LAZY TABLE MY_TABLE”) => LAZY cache will cache the table on its first use.
6
Spark Notes
To materialize the DataFrame in cache, you need to call an action (and also you need to be using all
partitions with that action otherwise it will only cache some partitions).
print(table_list) => print the list of Spark Tables in the current database.
1. Using the DROP VIEW Spark SQL Expression; spark.sql("DROP VIEW IF EXISTS my_view")
2. Using the spark.catalog.dropTempView(); spark.catalog.dropTempView("my_view")
Spark UI
Jobs DAGs
Stages DAGs
Storage investigating information about your Cached DataFrames
Environment spark.executor.memory
Executors
SQL DAGs
OOM
7
Spark Notes
To process a partition, this partition needs to be loaded into the memory of an executor. If you imagine
that every core in every executor processes a partition, potentially in parallel with other executors, you
can imagine that memory on the machine hosting the executors fills up quite quickly. So, memory usage
of executors is a concern, especially when multiple partitions are processed at the same time. To strike a
balance between performance and memory usage, decreasing the number of cores may help against
out-of-memory errors.
When using commands like collect() that trigger the transmission of potentially large amounts of data
from the cluster to the driver, the driver may experience out-of-memory errors. One strategy to avoid
this is to be careful about using commands like collect() that send back large amounts of data to the
driver. Another strategy is setting the parameter spark.driver.maxResultSize. If data to be transmitted to
the driver exceeds the threshold specified by the parameter, Spark will abort the job and therefore
prevent an out-of-memory error.
Broadcast variables:
Both panels 2 and 3 represent the operation performed for broadcast variables. While a broadcast
operation may look like panel 3, with the driver being the bottleneck, it most probably looks like panel 2.
a viable way to improve Spark’s performance when dealing with large amounts of data, given that there
is only a single application running on the cluster:
8
Spark Notes
single application running on the cluster, so enabling dynamic allocation would not yield a performance
benefit.
During a shuffle, data is compared between partitions because shuffling includes the process of sorting.
For sorting, data need to be compared. Since per definition, more than one partition is involved in a
shuffle, it can be said that data is compared across partitions.
In a shuffle, Spark writes data to disk. Spark's architecture dictates that intermediate results during a
shuffle are written to disk.
configurations are related to enabling dynamic adjustment of the resources for your application based
on the workload:
• spark.dynamicAllocation.shuffleTracking.enabled
• spark.dynamicAllocation.enabled
• spark.shuffle.service.enabled
• The high-level DataFrame API is built on top of the low-level RDD API.
• RDDs are immutable.
• RDD stands for Resilient Distributed Dataset.
• RDDs are great for precisely instructing Spark on how to do a query.
joinType = “inner”
joinExpr = df1.BatchID == df2.BatchID
df1.join(df2, joinExpr, joinType )
the df2 is a very small and we want to apply broadcast hint for df2:
df1.join(broadcast(df2), joinExpr, joinType )
9
Spark Notes
1. Use the DataFrame to reference the column name; joinType = "inner" joinExpr = df1.BatchID ==
df2.BatchID df1.join(df2, joinExpr, joinType) .select(df1.BatchID, df1.Year).show()
2. Drop one of the two ambiguous columns after join: joinType = "inner" joinExpr = df1.BatchID ==
df2.BatchID df1.join(df2, joinExpr, joinType).drop(df2.BatchID) .select("BatchID", "Year").show()
3. Use the common name as your join expression so Spark can auto-remove one ambiguous column
after join. joinType = "inner" joinExpr = "BatchID" df1.join(df2, joinExpr, joinType) .select("BatchID",
"Year").show()
Semi Join: A semi join returns values from the left side of the relation that has a match with the right. It
is also referred to as a left semi join. Use LEFT SEMI JOIN if you want to list the matching record from the
left hand side table only once for each matching record in the right hand side.
Anti Join: An anti join returns values from the left relation that has no match with the right. It is also
referred to as a left anti join.
transactionsDf.dropna(thresh=4) removes all rows in the 6-column DataFrame transactionsDf that have
missing data in at least 2 columns:, threashold=1 says "Keep the row if at least one column is not null.
thresh defines the minimum number of columns that need to have data for the row not to be dropped.
So, 6 – 2 = 4 here.
transactionsDf.dropna(“any”): remove all rows that have at least one missing value.
df.na.drop() or df.na.drop(“any”) to delete all the rows having any null column.
Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be
recomputed when they are needed.
transactionsDf.cache()
10
Spark Notes
The default storage level of DataFrame.cache() is MEMORY_AND_DISK (in Spark 3.0.0. it was changed to
MEMORY_AND_DISK_DESER in later versions), meaning that partitions that do not fit into memory are
stored on disk. cache() does not have any arguments.
DataFrame.cache() is evaluated like a transformation: Through lazy evaluation. This means that after
calling DataFrame.cache() the command will not have any effect until you call a subsequent action, like
DataFrame.cache().count().
transactionsDf.persist()
when thinking about fault tolerance and storage levels, you would want to store redundant copies of the
dataset. This can be achieved by using a storage level such as StorageLevel.MEMORY_AND_DISK_2.
transactionsDf sorted in descending order by column predError, showing missing values last
transactionsDf.sort(“predError”, ascending=False)
transactionsDf.sort(desc_nulls_last(“predError”))
df.orderBy(df.a.desc_nulls_first()) == df.sort(desc_nulls_first(“a“))
spark.read.schema(fileSchema).format("parquet").load(filePath)
11
Spark Notes
transactionsDf.write.format(“parquet”).mode(“overwrite”).option(“compression”,
“snappy”).save(storeDir)
DataFrame writer default format is the parquet file. For saving a DataFrame in parquet format, you must
call the parquet() or save() method.
spark.read.format(“binaryFile”).option(“pathGlobFilter”, “*.png”).load(path)
DataFrameReader.format(…).option(“key”, “value”).schema(…).load()
There are 3 typical read modes and the default read mode is permissive.
• permissive — All fields are set to null and corrupted records are placed in a string column called
_corrupt_record
• dropMalformed — Drops all rows containing corrupt records.
• failFast — Fails when corrupt records are encountered.
DataFrameWriter.format(...).option(...).partitionBy(...).bucketBy(...).sortBy( ...).save()
There are 4 typical save modes and the default mode is errorIfExists
+---------+-----+--------+----------------------+
| name |age |salary|name,age,salary |
+---------+-----+--------+----------------------+
When you write a text file, you need to be sure to have only one string column; otherwise, the write will
fail
12
Spark Notes
1. json_schema = """
2. {"type": "struct",
3. "fields": [
4. {
5. "name": "itemId",
6. "type": "integer",
7. "nullable": true,
8. "metadata": {}
9. },
10. {
11. "name": "supplier",
12. "type": "string",
13. "nullable": true,
14. "metadata": {}
15. }
16. ]
17. }
18. """
df.selectExpr("name", "case when (salary < 5000) then salary * 0.20 else 0 end as increment")
df.select(expr("number + 10"))
The select() method does not accept column expressions. You must use selectExpr() or wrap your
column expressions into expr().
df.selectExpr("*", expr("salary * 0.15").alias("salary_increment")) => Wrong, The selectExpr() accepts
only column name or column expression. You cannot use expr() the method inside selectExpr().
13
Spark Notes
The isnull() function takes a single argument and returns true if expr is null, or false otherwise. You do
not need lit() in Spark SQL. In fact, we do not have a lit() function in Spark SQL because it is not needed.
transactionsDf.groupBy("storeId").agg(avg("value"))
You can use the countDistinct() for counting unique records. But remember, this function is not available
in Spark SQL.
df.select(“InvoiceNo”).distinct().agg(count(“InvoiceNo”))
df.selectExpr("count(distinct InvoiceNo)")
df.select(countDistinct("InvoiceNo"))
Spark SQL functions such as INT(), DOUBLE(), DATE(), etc to cast a value
df.select("name", expr("INT(age)"), expr("DOUBLE(salary)"))
NOT forget that the expected result requires you to give a column alias:
df.groupBy("department", "country").agg(expr("count(*) as NumEmployee"), expr("sum(salary) as
TotalSalary")).show()
df.groupBy("Year").pivot("CourseName").agg(expr("sum(Students)"))
14
Spark Notes
• df1 = df.na.fill("Unknown")
• df1 = df.fillna("Unknown")
Returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that
any row can appear more than once: transactionsDf.sample(True, 0.15, 8261)
withReplacement (default, False). we have to enable it here, since the question asks for a row being able
to appear more than once.
seed. A random seed makes a randomized processed repeatable. This means that if you would re-run
the same sample() operation with the same random seed, you would get the same rows returned from
the sample() command.
Python UDF runs in a Python process on the worker node. In this case, Spark must serialize data and
send it to the Python process from the JVM process. This transfer of data and returning results is extra
work causing some performance overhead.
executes the function row by row on that data in the Python process, and then finally returns the results
of the row operations to the JVM and Spark.
When we create Spark UDF in Python, it does NOT run in the executor JVM.
1. func_name_udf = udf(func_name):
registers the UDF as a DataFrame function so you can use it in your DataFrame expressions as a
function. But you cannot use it in a string expression.
power3_udf = udf(power3)
df.select(power3_udf(col("num"))).show()
2. spark.udf.register(“func_name_udf”,func_name)
registers the UDF as a SQL function. So you can use it in the string expressions.
spark.udf.register(“power3_udf”, power3)
df.selectExpr(“power3_udf(num)”).show()
15
Spark Notes
1. “colName”
2. col(“colName”)
3. column(“colName”)
4. df[“colName”]
Adaptive Query Execution (AQE): An optimization technique in Spark SQL that makes use of the runtime
statistics to choose the most efficient query execution plan
spark.sql.adaptive.localShuffleReader.enabled - AQE converts sort-merge join to broadcast hash join
when the runtime statistics of any join side is smaller than the broadcast hash join threshold.
16
Spark Notes
A partition is considered as skewed if its size is larger than this factor multiplying the median partition
size and also larger than spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes
spark.sql.adaptive.skewJoin.skewedPartitionFactor
transactionsDf.dropDuplicates(subset=["productId"])
Spark Session
Subsume previous entry points to the Spark like the SparkContext, SQLContext, HiveContext, and
StreamingContext
Setting spark.speculation to true will re-launch one or more tasks if they are running slowly in a stage. (
Not killing the slow running task and restart it on another node.) Apache Spark has the ‘speculative
execution’ feature to handle the slow tasks in a stage due to environmental issues like slow network,
disk, etc. If one task is running slowly in a stage, the Spark driver can launch a speculation task for it on a
different host. Between the regular task and its speculation task, the Spark system will later take the
result from the first successfully completed task and kill the slower one.
Logical Planning applies a standard rule-based optimization approach to construct a set of multiple plans
and then use a cost-based optimizer (CBO) to assign costs to each plan. These plans are laid out as
operator trees.
17
Spark Notes
You noticed that a full garbage collection is invoked multiple times before a task completes. So you
concluded that there isn’t enough memory available for executing tasks. How can you address the task
memory problem? Decrease the spark.memory.fraction so you reduce the amount of memory that
Spark uses for caching.
The default data format for reading CSV and JSON data files is yyyy-MM-dd
• By default, UDFs are registered as temporary functions to be used in that specific SparkSession.
• UDFs can take and return one or more columns as input.
• UDFs are just functions that operate on the data, record by record.
Spark Session allows you to create JVM runtime parameters, define DataFrames and Datasets, read from
data sources, access catalog metadata, and issue Spark SQL queries. SparkSession provides a single
unified entry point to all of Spark’s functionality.
Spark Session:
• In a standalone Spark application, you can create a SparkSession using one of the high-level APIs
in the programming language of your choice.
• parkSession is created for you in spark-shell, and you can access it via a global variable called
spark.
You noticed too many minor collections but not many major garbage collections. How can you approach
to address this problem? Allocating more memory for Eden might help
Stage: A unit for work that is executed as a sequence of tasks in parallel without a shuffle
18
Spark Notes
What is the use of coalesce(expr*) function in Spark SQL? Returns the first non-null argument if exists.
Otherwise, null.
Standalone mode uses only a single executor per worker per application.
Local uses a single JVM to run Spark driver and executor processes.
1. Reduce the number of reducers in the shuffle stage by decreasing the number of shuffle
partitions.
2. Optimize the physical execution plan of the query, for example by converting a SortMergeJoin
into a BroadcastHashJoin where appropriate.
3. Handle data skew during a join.
Explicit caching can decrease application performance by interferring with the Catalyst optimizer‘s
ability to optimize some queries.
windowSpec = Window.partitionBy(“department“).orderBy(“salary“)
df.withColumn(“dense_rank“,dense_rank().over(windowSpec)) .show()
The driver is the machine in which the application runs. It is responsible for three main things: 1)
Maintaining information about the Spark Application, 2) Responding to the user’s program, 3) Analyzing,
distributing, and scheduling work across the executors.
19