0% found this document useful (0 votes)
37 views19 pages

spark_notes

my spark notes

Uploaded by

Ali Saberi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views19 pages

spark_notes

my spark notes

Uploaded by

Ali Saberi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Spark Notes

The executors accept tasks from the driver, execute those tasks, and return results to the driver.
Executors only persist during the lifetime of an application. An executor is always specific to the
application. It is terminated when the application completes. Executors are launched by the cluster
manager on behalf of the driver.

Spark executors run on worker nodes in the Spark Cluster. Each worker node may run one or more
executors depending upon the resource availability on the worker node. You may have more executors
than the workers

The cluster manager keeps track of executors and will work together with the driver to launch an
executor and assign the workload of the failed executor to it

The cluster manager allocates resources to Spark applications and maintains the executor processes in
client mode.

In cluster mode, the cluster manager is located on a node other than the client machine. From there it
starts and ends executor processes on the cluster nodes as required by the Spark application running on
the Spark driver.

Every Spark Application creates one driver and one or more executors at run time. Drivers and executors
are never shared. Every application will have its own dedicated Spark Driver.

The catalog is an Interface through which the user may create, drop, alter or query underlying
databases, tables, functions, etc.

Spark Session allows you to create JVM runtime parameters, define DataFrames and Datasets, read from
data sources, access catalog metadata, and issue Spark SQL queries. SparkSession provides a single
unified entry point to all of Spark’s functionality.

DAG “Directed Acyclic Graph”

The driver converts your Spark application into one or more Spark jobs. It then transforms each job into
a DAG (Spark’s execution plan), where each node within a DAG could be single or multiple Spark stages.

You can access DAGs through the Spark UI and they can be of great help when optimizing queries
manually. DAGs represent the execution plan in Spark and as such are lazily executed when the driver
requests the data processed in the DAG. DAGs can be decomposed into tasks that are executed in
parallel.

The Spark engine starts new stages after operations called shuffles. A shuffle represents a physical
repartitioning of the data. This type of repartitioning requires coordinating across executors to move
data around.

The Spark driver has multiple roles: it communicates with the cluster manager; it requests resources
(CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); and it transforms all the

1
Spark Notes

Spark operations into DAG computations, schedules them, and distributes their execution as tasks
across the Spark executors. Once the resources are allocated, it communicates directly with the
executors.

• Each Spark application may run as a series of Spark Jobs.


• Spark Job is internally represented as a DAG of stages.

Logical Planning applies a standard rule-based optimization approach to construct a set of multiple plans
and then use a cost-based optimizer (CBO) to assign costs to each plan. These plans are laid out as
operator trees. They may include, for example, the process of constant folding, predicate pushdown,
projection pruning, Boolean expression simplification, etc. This logical plan is the input into the physical
plan.

The Catalyst optimizer takes a computational query and converts it into an execution plan. It goes
through four transformational phases, as shown below.

1. Analysis
2. Logical optimization: First set of optimizations takes place in step logical optimization.
3. Physical planning: Catalyst optimizer generates one or more physical plans
4. Code generation (Project Tungsten)

optimizations enabled by adaptive query execution (AQE):

• AQE does NOT perform dynamic partition pruning


• AQE applies coalescing post-shuffle partitions
• AQE attempts to convert sort-merge join to broadcast join
• AQE performs skew join optimization
• AQE reoptimizes query plans based on runtime statistics collected during query execution.

https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

AQE attempts to to do the following at runtime:

1. Reduce the number of reducers in the shuffle stage by decreasing the number of shuffle
partitions. AQE allows you to dynamically coalesce shuffle partitions.
2. Optimize the physical execution plan of the query, for example by converting a SortMergeJoin
into a BroadcastHashJoin where appropriate. - AQE allows you to dynamically switch join
strategies
3. Handle data skew during a join.

Spark catalyst optimizer let‘s you do;

2
Spark Notes

1. Dynamically convert physical plans to RDDs.


1. Dynamically reorganize query orders.
2. Dynamically select physical plans based on cost.

the main considerations in tuning memory usage in Spark application:

• Check the overhead of garbage collection


• Check the amount of memory used by your objects

Spark’s Job Scheduler

• By default, Spark’s scheduler runs jobs in a FIFO fashion


• FAIR (spark.scheduler.mode) scheduler assigns tasks between jobs in a “round robin” fashion

The cluster manager receives input from the driver through the SparkContext.

Deployment modes often refer to ways that Spark can be deployed in cluster mode and how it uses
specific frameworks outside Spark. YARN and Mesos modes are two deployment modes. These modes
allow Spark to run alongside other frameworks on a cluster.

When Spark is run in standalone mode, only the Spark framework can run on the cluster. Standalone
mode uses only a single executor per worker per application.

execution modes deployment modes.


Client, Cluster, Local standalone, Apache YARN, Apache Mesos and Kubernetes

client mode

• the driver sits on a machine outside the cluster.


• DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node‘s
memory. In fact, Spark does not have a provision to cache DataFrames in the driver (which sits
on the edge node in client mode). Spark caches DataFrames in the executors‘ memory.
• gateway machines are located outside the cluster.
• the cluster manager is independent of the edge node and runs in the cluster.

cluster mode

• the driver sits on a machine inside the cluster (worker nodes).


• on YARN, the driver process runs inside an application master process.

3
Spark Notes

• a user submits a pre-compiled JAR, Python script, or R script to a cluster manager. The cluster
manager then launches the driver process on a worker node inside the cluster, in addition to the
executor processes.

local mode

• all Spark infrastructure is started in a single JVM (Java Virtual Machine) in a single computer
which then also includes the driver.
• Local mode uses a single JVM to run Spark driver and executor processes.

Dataset API

• The Dataset API is available in Scala, but it is not available in Python.


• The Dataset API provides compile-time type safety.
• Dataset API supports structured and unstructured data.
• The Dataset API uses fixed typing and is typically used for object-oriented programming.

Tasks in a stage may be executed by multiple machines at the same time. Within a single stage, tasks do
not depend on each other. Executors on multiple machines may execute tasks belonging to the same
stage on the respective partitions they are holding at the same time.

Spark’s memory usage categories:

• Execution
• storage. Storage memory is used for caching partitions derived from DataFrames.

garbage collection

Spark’s garbage collection runs faster on fewer big objects than many small objects.

strategies in order to decrease garbage collection time:

• Use Structured APIs and create fewer objects


• Increase the Java Heap Size
• Persist objects in serialized form

Serialized caching is a strategy to increase the performance of garbage collection.

Optimizing garbage collection performance in Spark may limit caching ability. A full garbage collection
run slows down a Spark application. When taking about "tuning" garbage collection, we mean reducing
the amount or duration of these slowdowns.

4
Spark Notes

Garbage collection information can be accessed in the Spark UI's stage detail view.

In Spark, using the G1 garbage collector (not enabled by default) is an alternative to using the default
Parallel garbage collector.

Spark’s configuration properties:

• The maximum number of tasks that an executor can process at the same time is controlled by
the spark.task.cpus and spark.executor.cores property
• The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.
• The default number of partitions to use when shuffling data for joins or aggregations is 200.
• The default number of partitions returned from certain transformations can be controlled by the
spark.default.parallelism property.

spark.conf.set(“spark.sql.shuffle.partitions”, 100) => set the number of partitions that Spark uses when
shuffling data for joins or aggregations to 100

spark.memory.fraction is the percentage of memory used for computation in shuffles, joins, sorts, and
aggregations.

Spark Managed and Unmanaged Tables

• The location for the Spark Managed Tables data is configured using spark.sql.warehouse.dir
configuration.
• Spark supports creating managed and unmanaged tables using Spark SQL and DataFrame APIs.
• Spark only manages the metadata for unmanaged tables, while you manage the data yourself.
• Drop an unmanaged table, Spark will only drop the metadata and does not impact the actual
data.
• Drop a managed table, Spark will drop the metadata and also delete the actual data.

df.write.option(‘path’, “/data/”).saveAsTable(“my_ummanaged_table”)

As soon as you add the ‘path’ option in the dataframe writer it will be treated as an
external/unmanaged table.

Actions

Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer
result data back to the driver.

Actions can trigger Adaptive Query Execution, while transformation cannot. Adaptive Query Execution
optimizes queries at runtime. Since transformations are evaluated lazily, Spark does not have any
runtime information to optimize the query until an action is called. If Adaptive Query Execution is

5
Spark Notes

enabled, Spark will then try to optimize the query based on the feedback it gathers while it is evaluating
the query.

Lazy evaluation:

• Predicate pushdown is a feature resulting from lazy evaluation.


• Spark will fail a job only during execution, but not during definition.
• Accumulators do not change the lazy evaluation model of Spark.
• Lineages allow Spark to coalesce transformations into stages.

Accumulator

Accumulator values can only be read by the driver, but not by executors. - a problem with using
accumulators

• Accumulators provide a shared, mutable variable that a Spark cluster can safely update on a per-
row basis.
• The Spark UI doesn’t display all accumulators used by your application. You need to name the
accumulator in order to see in it in the spark ui.
• By extending org.apache.spark.util.AccumulatorV2 in Java / Scala or pyspark.AccumulatorParam
in Python, You can define your own custom accumulator class.
• For accumulator updates performed inside actions only, Spark guarantees that each task’s
update to the accumulator will be applied only once, meaning that restarted tasks will not
update the value. If an action including an accumulator fails during execution and Spark
manages to restart the action and complete it successfully, only the successful attempt will be
counted in the accumulator.

DDL expressions

must use the enableHiveSupport() while creating your Spark session. Without hive support, you cannot
run Spark DDL expressions. DDL expressions will create Spark Database objects which are stored in the
meta store. Spark by default uses hive meta store. So you must enable hive support and include hive
dependencies for using DDL expressions.

spark.sql(“CACHE LAZY TABLE MY_TABLE”) => LAZY cache will cache the table on its first use.

spark.catalog.uncacheTable("MY_TABLE") or spark.sql(“uncache table MY_TABLE”) => spark catalog


allows you to cache or uncache tables.

6
Spark Notes

df1.unpersist() - There is no method such as uncache() for DataFrame.

To materialize the DataFrame in cache, you need to call an action (and also you need to be using all
partitions with that action otherwise it will only cache some partitions).

print(table_list) => print the list of Spark Tables in the current database.

You can drop a Spark view in two ways.

1. Using the DROP VIEW Spark SQL Expression; spark.sql("DROP VIEW IF EXISTS my_view")
2. Using the spark.catalog.dropTempView(); spark.catalog.dropTempView("my_view")

Spark Global Temporary Views must be accessed prefixing global_temp.


df1.createOrReplaceGlobalTempView(“my_view”)
spark.sql("select * from global_temp.my_view").show()
spark.read.table("global_temp.my_view").show()

when you should avoid caching your data frames.

• Small Data Frame which is not frequently used


• DataFrames that are too big to fit in memory

Spark UI

Jobs DAGs
Stages DAGs
Storage investigating information about your Cached DataFrames
Environment spark.executor.memory
Executors
SQL DAGs

OOM

• Reducing partition size is a viable way to aid against out-of-memory errors


• Decreasing the number of cores available to each executor can help against out-of-memory
errors.
• Setting a limit on the maximum size of serialized data returned to the driver may help prevent
out-of-memory errors.
• Limiting the amount of data being automatically broadcast in joins can help against out-of-
memory errors.

7
Spark Notes

To process a partition, this partition needs to be loaded into the memory of an executor. If you imagine
that every core in every executor processes a partition, potentially in parallel with other executors, you
can imagine that memory on the machine hosting the executors fills up quite quickly. So, memory usage
of executors is a concern, especially when multiple partitions are processed at the same time. To strike a
balance between performance and memory usage, decreasing the number of cores may help against
out-of-memory errors.

When using commands like collect() that trigger the transmission of potentially large amounts of data
from the cluster to the driver, the driver may experience out-of-memory errors. One strategy to avoid
this is to be careful about using commands like collect() that send back large amounts of data to the
driver. Another strategy is setting the parameter spark.driver.maxResultSize. If data to be transmitted to
the driver exceeds the threshold specified by the parameter, Spark will abort the job and therefore
prevent an out-of-memory error.

Broadcast variables:

• immutable – they are never updated.


• meant to be shared across the cluster.
• small and do fit into memory.
• cached on every machine in the cluster, precisely avoiding to have to be serialized with every
single task.

Both panels 2 and 3 represent the operation performed for broadcast variables. While a broadcast
operation may look like panel 3, with the driver being the bottleneck, it most probably looks like panel 2.

a viable way to improve Spark’s performance when dealing with large amounts of data, given that there
is only a single application running on the cluster:

Increase values for the properties spark.default.parallelism, and spark.sql.shuffle.partitions

Increase values for the properties spark.dynamicAllocation.maxExecutors: The property


spark.dynamicAllocation.maxExecutors is only in effect if dynamic allocation is enabled, using the
spark.dynamicAllocation.enabled property. It is disabled by default. Dynamic allocation can be useful
when to run multiple applications on the same cluster in parallel. However, in this case there is only a

8
Spark Notes

single application running on the cluster, so enabling dynamic allocation would not yield a performance
benefit.

During a shuffle, data is compared between partitions because shuffling includes the process of sorting.
For sorting, data need to be compared. Since per definition, more than one partition is involved in a
shuffle, it can be said that data is compared across partitions.

In a shuffle, Spark writes data to disk. Spark's architecture dictates that intermediate results during a
shuffle are written to disk.

configurations are related to enabling dynamic adjustment of the resources for your application based
on the workload:

• spark.dynamicAllocation.shuffleTracking.enabled
• spark.dynamicAllocation.enabled
• spark.shuffle.service.enabled

RDD (Resilient Distributed Dataset)

• The high-level DataFrame API is built on top of the low-level RDD API.
• RDDs are immutable.
• RDD stands for Resilient Distributed Dataset.
• RDDs are great for precisely instructing Spark on how to do a query.

the spark.sql.autoBroadcastJoinThreshold it is enabled by default and it is set to 10 MB by default. If that


property would be set to -1, then broadcast joining would be disabled. It expects a number in bytes

DataFrame has no broadcast() method.

joinType = “inner”
joinExpr = df1.BatchID == df2.BatchID
df1.join(df2, joinExpr, joinType )
the df2 is a very small and we want to apply broadcast hint for df2:
df1.join(broadcast(df2), joinExpr, joinType )

itemsDf.join(broadcast(transactionsDf), itemsDf.itemId == transactionsDf.storeId)

transactionsDf.join(itemsDf, itemsDf.itemId == transactionsDf.productId, “outer”)

9
Spark Notes

You can handle ambiguous column names in three ways.

1. Use the DataFrame to reference the column name; joinType = "inner" joinExpr = df1.BatchID ==
df2.BatchID df1.join(df2, joinExpr, joinType) .select(df1.BatchID, df1.Year).show()

2. Drop one of the two ambiguous columns after join: joinType = "inner" joinExpr = df1.BatchID ==
df2.BatchID df1.join(df2, joinExpr, joinType).drop(df2.BatchID) .select("BatchID", "Year").show()

3. Use the common name as your join expression so Spark can auto-remove one ambiguous column
after join. joinType = "inner" joinExpr = "BatchID" df1.join(df2, joinExpr, joinType) .select("BatchID",
"Year").show()

Semi Join: A semi join returns values from the left side of the relation that has a match with the right. It
is also referred to as a left semi join. Use LEFT SEMI JOIN if you want to list the matching record from the
left hand side table only once for each matching record in the right hand side.

Anti Join: An anti join returns values from the left relation that has no match with the right. It is also
referred to as a left anti join.

transactionsDf.dropna(thresh=4) removes all rows in the 6-column DataFrame transactionsDf that have
missing data in at least 2 columns:, threashold=1 says "Keep the row if at least one column is not null.

thresh defines the minimum number of columns that need to have data for the row not to be dropped.
So, 6 – 2 = 4 here.

transactionsDf.dropna(“any”): remove all rows that have at least one missing value.

drop.na is not a proper DataFrame method.

df.na.drop(subset=("Year","CourseName")) to delete all the rows if Year or CourseName column is null

df.na.drop() or df.na.drop(“any”) to delete all the rows having any null column.

df.na.drop("all") to delete rows if all columns are null.

from pyspark import StorageLevel


transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be
recomputed when they are needed.

transactionsDf.cache()

10
Spark Notes

The default storage level of DataFrame.cache() is MEMORY_AND_DISK (in Spark 3.0.0. it was changed to
MEMORY_AND_DISK_DESER in later versions), meaning that partitions that do not fit into memory are
stored on disk. cache() does not have any arguments.

DataFrame.cache() is evaluated like a transformation: Through lazy evaluation. This means that after
calling DataFrame.cache() the command will not have any effect until you call a subsequent action, like
DataFrame.cache().count().

transactionsDf.persist()

The default storage level of DataFrame.persist() is MEMORY_AND_DISK_DESER.


persist(StorageLevel.MEMORY_ONLY)

when thinking about fault tolerance and storage levels, you would want to store redundant copies of the
dataset. This can be achieved by using a storage level such as StorageLevel.MEMORY_AND_DISK_2.

transactionsDf sorted in descending order by column predError, showing missing values last

transactionsDf.sort(“predError”, ascending=False)
transactionsDf.sort(desc_nulls_last(“predError”))

df.orderBy(df.a.desc_nulls_first()) == df.sort(desc_nulls_first(“a“))

transactionsDf.repartition(14, “storeId”, “transactionDate”).count()


df.repartition(numPartitions, Column*). Repartition on a column name will create a hash partitioned
DataFrame. The number of the partition depends on the spark.sql.shuffle.partitions value. If the value
for spark.sql.shuffle.partitions=200 then your new Dataframe df2 will have 200 partitions. However, you
can override the spark.sql.shuffle.partitions passing the numPartitions in the API call.

transactionsDf.select((col(“storeId”).between(20, 30)) & (col(“productId”)==2)) - Operators like && or


and are not valid

DataFrame.first() returns an object of type Row, but not DataFrame.

itemsDf.filter(~col('supplier').contains('X')).select('supplier').distinct() - the ~ (not) operator

spark.read.schema(fileSchema).format("parquet").load(filePath)

transactionsDf.repartition(1).write.option(“sep”, ” “).option(“nullValue”, “n/a”).csv(csvPath)

11
Spark Notes

transactionsDf.write.format(“parquet”).mode(“overwrite”).option(“compression”,
“snappy”).save(storeDir)

DataFrame writer default format is the parquet file. For saving a DataFrame in parquet format, you must
call the parquet() or save() method.

transactionsDf = spark.read.load(“data/transactions.csv”, sep=”;”, format=”csv”, header=True,


inferSchema=True) => By default, Spark does not infer the schema of the CSV

spark.read.format(“binaryFile”).option(“pathGlobFilter”, “*.png”).load(path)

• binary files, like images.


• pathGlobFilter option is a great way to filter files by name (and ending).

DataFrameReader.format(…).option(“key”, “value”).schema(…).load()

There are 3 typical read modes and the default read mode is permissive.

• permissive — All fields are set to null and corrupted records are placed in a string column called
_corrupt_record
• dropMalformed — Drops all rows containing corrupt records.
• failFast — Fails when corrupt records are encountered.

DataFrameWriter.format(...).option(...).partitionBy(...).bucketBy(...).sortBy( ...).save()

There are 4 typical save modes and the default mode is errorIfExists

• append — appends output data to files that already exist


• overwrite — completely overwrites any data present at the destination
• errorIfExists — Spark throws an error if data already exists at the destination
• ignore — if data exists do nothing with the dataFrame

+---------+-----+--------+----------------------+
| name |age |salary|name,age,salary |
+---------+-----+--------+----------------------+

df2 = df1.drop("name", "age", "salary")


df2.write.text(“data/test.csv”)

When you write a text file, you need to be sure to have only one string column; otherwise, the write will
fail

12
Spark Notes

For column expressions, you can use three approaches.

1. Use selectExpr() - df.selectExpr("avg(salary)")


2. Use expr() as expr(“avg(columnName)”) - df.select(expr("avg(salary)"))
3. Use Data Frame expressions as avg(“columnName”) - df.select(avg("salary"))

1. json_schema = """
2. {"type": "struct",
3. "fields": [
4. {
5. "name": "itemId",
6. "type": "integer",
7. "nullable": true,
8. "metadata": {}
9. },
10. {
11. "name": "supplier",
12. "type": "string",
13. "nullable": true,
14. "metadata": {}
15. }
16. ]
17. }
18. """

df.selectExpr("name", "case when (salary < 5000) then salary * 0.20 else 0 end as increment")
df.select(expr("number + 10"))

The select() method does not accept column expressions. You must use selectExpr() or wrap your
column expressions into expr().
df.selectExpr("*", expr("salary * 0.15").alias("salary_increment")) => Wrong, The selectExpr() accepts
only column name or column expression. You cannot use expr() the method inside selectExpr().

• df.withColumn("salary_increment", expr("salary * 0.15"))


• df.selectExpr("*", "salary * 0.15 as salary_increment")

df1 = df.selectExpr("ID", "struct(FName,LName,DOB) as PersonalDetails", "Department") - [John, Doe,


1977]

df.withColumn("Year", expr("ifnull(Year, '2021')"))

13
Spark Notes

The isnull() function takes a single argument and returns true if expr is null, or false otherwise. You do
not need lit() in Spark SQL. In fact, we do not have a lit() function in Spark SQL because it is not needed.

df.groupBy("Key") .agg(collect_list(struct("Name", "Score")).alias("Collection")) -


|1 |[[Apple, 0.76], [Orange, 0.98], [Banana, 0.24]] |

transactionsDf.sort("storeId", desc("productId")) => sort transactionDf by two columns


transactionsDf.sort("storeId").sort(desc("productId")) => two separate sorts

spark.createDataFrame([(“red”,), (“blue”,), (“green”,)], [“color”])

transactionsDf.groupBy("storeId").agg(avg("value"))

You can use the countDistinct() for counting unique records. But remember, this function is not available
in Spark SQL.

df.select(“InvoiceNo”).distinct().agg(count(“InvoiceNo”))
df.selectExpr("count(distinct InvoiceNo)")
df.select(countDistinct("InvoiceNo"))

df1 = df.selectExpr("ID", "struct(FName,LName,DOB) as PersonalDetails", "Department")


df1.show(truncate=0)
df1.select("ID", col("FName"), col("LName"), "Department").show()

df.select("name", expr("salary * 0.20 as increment"))

Spark SQL functions such as INT(), DOUBLE(), DATE(), etc to cast a value
df.select("name", expr("INT(age)"), expr("DOUBLE(salary)"))

NOT forget that the expected result requires you to give a column alias:
df.groupBy("department", "country").agg(expr("count(*) as NumEmployee"), expr("sum(salary) as
TotalSalary")).show()

df.groupBy("Year").pivot("CourseName").agg(expr("sum(Students)"))

14
Spark Notes

replacing all null values in all columns with Unknown:

• df1 = df.na.fill("Unknown")
• df1 = df.fillna("Unknown")

df.na.fill({"CourseName": "Python", "Year": "2021"})

Returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that
any row can appear more than once: transactionsDf.sample(True, 0.15, 8261)

sample(withReplacement=None, fraction=None, seed=None).

withReplacement (default, False). we have to enable it here, since the question asks for a row being able
to appear more than once.

seed. A random seed makes a randomized processed repeatable. This means that if you would re-run
the same sample() operation with the same random seed, you would get the same rows returned from
the sample() command.

to_timestamp, doesn‘t always requires a format to be specified


df.select(to_timestamp(col(“date“)).show()

Python UDF runs in a Python process on the worker node. In this case, Spark must serialize data and
send it to the Python process from the JVM process. This transfer of data and returning results is extra
work causing some performance overhead.
executes the function row by row on that data in the Python process, and then finally returns the results
of the row operations to the JVM and Spark.
When we create Spark UDF in Python, it does NOT run in the executor JVM.

You can register a UDF in two ways.

1. func_name_udf = udf(func_name):
registers the UDF as a DataFrame function so you can use it in your DataFrame expressions as a
function. But you cannot use it in a string expression.
power3_udf = udf(power3)
df.select(power3_udf(col("num"))).show()

2. spark.udf.register(“func_name_udf”,func_name)
registers the UDF as a SQL function. So you can use it in the string expressions.
spark.udf.register(“power3_udf”, power3)
df.selectExpr(“power3_udf(num)”).show()

15
Spark Notes

You can reference a column using four ways.

1. “colName”
2. col(“colName”)
3. column(“colName”)
4. df[“colName”]

Adaptive Query Execution (AQE): An optimization technique in Spark SQL that makes use of the runtime
statistics to choose the most efficient query execution plan
spark.sql.adaptive.localShuffleReader.enabled - AQE converts sort-merge join to broadcast hash join
when the runtime statistics of any join side is smaller than the broadcast hash join threshold.

Dynamic Partition Pruning (DPP)

Cost-Based Optimization (CBO)

Rule-Based Optimization (RBO)

16
Spark Notes

A partition is considered as skewed if its size is larger than this factor multiplying the median partition
size and also larger than spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes

spark.sql.adaptive.skewJoin.skewedPartitionFactor

spark.scheduler.allocation.file configuration => Create and configure FAIR schedular pools

transactionsDf.dropDuplicates(subset=["productId"])

transactionsDf.filter("transactionId" % 2 == 0).select("predError", "value") > wrong

transactionsDf.filter(col("transactionId") % 2 == 0).select("predError", "value") > correct

Storage memory is used for caching partitions derived from DataFrames.

Spark Session

Allows you to access SparkContext and SQLContext etc.

Eliminates the need for accessing SparkContext => Wrong

Subsume previous entry points to the Spark like the SparkContext, SQLContext, HiveContext, and
StreamingContext

Setting spark.speculation to true will re-launch one or more tasks if they are running slowly in a stage. (
Not killing the slow running task and restart it on another node.) Apache Spark has the ‘speculative
execution’ feature to handle the slow tasks in a stage due to environmental issues like slow network,
disk, etc. If one task is running slowly in a stage, the Spark driver can launch a speculation task for it on a
different host. Between the regular task and its speculation task, the Spark system will later take the
result from the first successfully completed task and kill the slower one.

Logical Planning applies a standard rule-based optimization approach to construct a set of multiple plans
and then use a cost-based optimizer (CBO) to assign costs to each plan. These plans are laid out as
operator trees.

17
Spark Notes

You noticed that a full garbage collection is invoked multiple times before a task completes. So you
concluded that there isn’t enough memory available for executing tasks. How can you address the task
memory problem? Decrease the spark.memory.fraction so you reduce the amount of memory that
Spark uses for caching.

The default data format for reading CSV and JSON data files is yyyy-MM-dd

• By default, UDFs are registered as temporary functions to be used in that specific SparkSession.
• UDFs can take and return one or more columns as input.
• UDFs are just functions that operate on the data, record by record.

Spark Session allows you to create JVM runtime parameters, define DataFrames and Datasets, read from
data sources, access catalog metadata, and issue Spark SQL queries. SparkSession provides a single
unified entry point to all of Spark’s functionality.

Spark Session:

• In a standalone Spark application, you can create a SparkSession using one of the high-level APIs
in the programming language of your choice.
• parkSession is created for you in spark-shell, and you can access it via a global variable called
spark.

You noticed too many minor collections but not many major garbage collections. How can you approach
to address this problem? Allocating more memory for Eden might help

Stage: A unit for work that is executed as a sequence of tasks in parallel without a shuffle

spark.sql.adaptive.localShuffleReader.enabled: AQE converts sort-merge join to broadcast hash join


when the runtime statistics of any join side is smaller than the broadcast hash join threshold.

df1.withColumn("Flight_Delays", when(col("delay") > 360, lit("Very Long Delays")) .when((col("delay")


>= 120) & (col("delay") <= 360), lit("Long Delays")) .when((col("delay") >= 60) & (col("delay") <= 120),
lit("Short Delays")) .when((col("delay") > 0) & (col("delay") < 60), lit("Tolerable Delays"))
.when(col("delay") == 0, lit("No Delays")) .otherwise(lit("Early")) )

18
Spark Notes

What is the use of coalesce(expr*) function in Spark SQL? Returns the first non-null argument if exists.
Otherwise, null.

Standalone mode uses only a single executor per worker per application.

Local uses a single JVM to run Spark driver and executor processes.

AQE attempts to to do the following at runtime:

1. Reduce the number of reducers in the shuffle stage by decreasing the number of shuffle
partitions.
2. Optimize the physical execution plan of the query, for example by converting a SortMergeJoin
into a BroadcastHashJoin where appropriate.
3. Handle data skew during a join.

Spark catalyst optimizer let‘s you do;

1. Dynamically convert physical plans to RDDs.


1. Dynamically reorganize query orders.
2. Dynamically select physical plans based on cost.

Explicit caching can decrease application performance by interferring with the Catalyst optimizer‘s
ability to optimize some queries.

windowSpec = Window.partitionBy(“department“).orderBy(“salary“)

df.withColumn(“dense_rank“,dense_rank().over(windowSpec)) .show()

The driver is the machine in which the application runs. It is responsible for three main things: 1)
Maintaining information about the Spark Application, 2) Responding to the user’s program, 3) Analyzing,
distributing, and scheduling work across the executors.

19

You might also like