Performance Tuning Spark UI
Performance Tuning Spark UI
2
How can I make this faster?
Can I make this application faster? Look for these 4 symptoms
• Spill
• Shuffle
• Skew/Stragglers
• Small Files
3
Ganglia Metrics
44
Ganglia Metrics
Things to note as we walk through the UI
• CPU
– Aggregated CPU is over all machines, including driver, so 80% of usage on a 4
node + 1 driver cluster is max cluster cpu usage
• Network
– Read from data source = data in
– Write to data target = data out
• Memory
– Cache + objects created by the application
5
The Spark UI
66
Spark UI - What we care about
Things to cover as we walk through it
• Jobs/Stages/Tasks, Transformations and Actions
– Accessing the executor heap, stdout, stderr
– What do the the metrics mean? What is important.
• Storage Tab
• Executor Tab
7
Analyzing a Spark SQL Plan
Things to look at in the Spark SQL Plan
• Pushed vs Partition Filters
• Join Order
• DataFrame reuse
8
Choosing your instance type
99
Different AWS Instance Types
Compute Optimized Memory Optimized Storage Optimized
• C4 • R4 • I3
– 1 vCPU per 1.875gb RAM – 1 vCPU per 7.625gb RAM – 1 vCPU per 7.625gb RAM
– Storage : Attached EBS only – Storage : Attached EBS only – Storage : NVMe SSD
– Price : .085 – Price : .133 – Price : .156
• C3 • R3 • I2
– 1 vCPU per 1.875 RAM – 1 vCPU per 7.625gb RAM – 1 vCPU per 7.625gb RAM
– Storage : Attached SSD – Storage : Attached SSD – Storage : Attached SSD
– Price : .105 – Price : .166 – Price : .853
10
10
Cluster Sizing Starting Points
Rules of Thumb
• Fewer big instances > more small instances (larger heap = larger GC)
– Reduce network shuffle
– Aggregates seem to benefit from smaller machines due to cache locality
• Size based on the number of tasks initially, tweak later
– Run the job with a small cluster to get idea of # of tasks
• Choose based on workload (Probably start with C):
– ETL with full file scans and no data reuse - C/R class
– ML workload with data caching - C class (Total RAM = dataset size)
– Analytics - i3
11
How do we tweak these?
Workload requires caching (like machine learning)
• Look at the Storage tab in Spark UI to see if the entirety of the training
dataset is cached
– Fully cached with room to spare -> less instances
– Partially cached
• Almost completely cached? -> Increase the cluster size
• Not even close to cached -> Consider instance with SSD instead of EBS or use R class
– Check to see if persist is MEMORY_ONLY, or MEMORY_AND_DISK
– Spill to disk with NVMe ssd isn’t so bad
• Still not good enough? Follow the steps in the next section
12
How do we tweak these?
ETL and Analytic Workloads
• Are we compute bound?
– Look at ganglia and check CPU Usage
– Only way to make faster is more cores
• Are we network bound?
– Look for low CPU usage that occur after shuffle boundaries
– Use bigger/fewer machines to reduce the shuffle
– Use an ssd backed instance for faster remote reads/add compression
• Are we spilling a ton?
– Check Spark SQL tab for spill (pre-agg before shuffles are common to spill)
• Use i3
• Increase the core/memory ratio
13
File Formats and
Compression
14
14
File Formats and Compression
File Format Schema DBR Version for Performance Common when
benefits
15
File Formats and Compression
Additional notes
• ORC
– rewrite table to add column
• Parquet
– rewrite table to add column (delta mergeSchema)
– DBR Native Parquet Reader
• Avro - Compressible JSON
– OK if always selecting entire dataset
16
File Formats and Compression
Compression is usually good
• GZIP
– JSON/CSV Unsplittable, one core must process entire file
– Highest compression ratio, longer to decompress
– Really big json/csv gzip files = BAD
• Snappy
– Splittable, file can be read by multiple cores
– Lower compression ratio, quicker to decompress
– Better than gzip for large files as far as Spark is concerned
17
File Formats and Compression
Compression is usually good
• BZIP2
– Compresses the most
– The better it compresses the longer it takes to decompress
– 10x slower for 5% compression gains*
– 5x slower to decompress
• Parquet (.gz.parquet) is always splittable regardless of compression format
due to the internal layout of a parquet file
– Compression is done on a per-page basis
18
FYI’s and Spark Functionality
19
19
Picking Number of Partitions
Rules of Thumb
• Too many >> too few >> way too many >> way too few
– Once you get to ~20-40k partitions, shuffle server problems
• Look for largest shuffle read input in a stage
– Ideally each partition will be around 200mb
• IE shuffle input = 10gb, partitions = 50
– Number of initial partitions = spark.sql.files.maxPartitionBytes (128mb default)
• # cores = multiple #partitions (ie 20 cores total on worker, try 40 partitions
• Use auto optimize, otherwise you may have to coalesce/repartition before
writes
20
Untyped vs Typed API vs UDF/UDAF
FYI’s
• Untyped dataset API is well optimized by Catalyst
• Typed dataset API is opaque and may involve SerDe
• UDF is somewhat similar to typed API (just that UDF is for
columns whereas typed Dataset API is for rows),
• UDAF forces the use of a less optimized aggregate operator
(due to current implementation limitations) that may cause
the agg to run slower, etc
21
Spark DataFrame vs Dataset
DataFrame == Dataset[Row]
• Untyped vs Typed API
– df.select($"col.att") is faster than df.map(l => l.col.att)
– InternalRow (Tungsten) format vs converting to object
22
RDD Cache vs DBIO Cache
When to use what, and what it means for your application
• RDD Cache uses the RAM on an executor
– Useful for ML workloads where they iterate a bunch on the same dataset
– Some feature building functions are expensive, we want to compute the feature
once and reuse as opposed to rebuild
• DBIO Cache uses SSD on i3
– Reduce the memory pressure of your application (spill less)
– Cheaper than memory
– Only base data is cached, not ideal if heavy amounts of processing are done
before the data is reused
– Faster for queries using subsets of the data
23
Joins
24
Physical Join Types
BroadcastHashJoin
One side must fit in memory
BroadcastNestedLoopJoin
more expensive
25
Physical Join Types
BroadcastHashJoin
BroadcastNestedLoopJoin
ShuffledHashJoin: One side is smaller (3x or more) and a partition of it can fit in memory
(off by default, enable by `spark.sql.join.preferSortMergeJoin = false`)
more expensive
26
Physical Join Types
BroadcastHashJoin
BroadcastNestedLoopJoin
ShuffledHashJoin
SortMergeJoin
28
Using skew joins
What is a skew join and how do I use it
• Specify Skewed Table
df.hint("skew") .join(df2, $"df1_id" === $"df2_id")
29
Solving for the 4 S’s
30
30
So I found spill, now what
Not always obvious to find, sql plan is most informative
• More partitions
• Higher ratio of ram/core
• Reduce the number of cores the executor can use
– spark.executor.cores < total cores per worker
• Is the spill happening in a prereduce sort?
– Spark tries to reduce locally first, sometimes this has very little effect and takes a
lot of time
spark.databricks.execution.singlePassAggregate.enabled = True
31
So I found skew, now what
Very obvious between task duration and ganglia cpu
• Ganglia CPU usage looks like a standard deviation curve
• Task duration -> max and 75% >> average and 25%
• Use a Skew join
• Filter out large keys/salt keys and set up multiple reduce steps
• Explicitly repartition the data on a different field
– Spend 5 mins repartitioning vs 10-15 mins for huge partitions
32
So I found shuffle, now what
Lots of joins, low compute on ganglia, huh?
• Reduce number of partitions
– Node-local vs remote reads, fewer partitions, more node-local reads = less
shuffle
• Review Join order
– Look at the plan, are we joining the biggest table several times?
– Join the smallest tables first and work up to the big one
33
So I found shuffle, now what (cont.)
Green and blue lines look to be maxed out on ganglia
• Size disparity between tables? 1 TB joined to 10gb?
– Use fewer bigger machines and broadcast the 10gb one
• Shuffle 1 tb once vs 10gb n executor times
34
So I found a lot small files, now what
This one is a problem that we can’t easily fix
• If you’re processing a lot of small files in batch, it’s probably better to treat
it as a streaming application
– Small etl-streaming cluster that writes to Deltat and compacts the files to help
downstream processing
• Use the S3-SQS connector + Trigger.Once() in DBR 4.2+
• S3 inventory future work
• Fix the upstream application building tons of files
• Ingest it into delta first
35
Thank you
Joe Widen
36
Here are some logos
37