0% found this document useful (0 votes)

91 views37 pages

Performance Tuning Spark UI

Performance Tuning and Mastering the Spark UI

Uploaded by

wwajler

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views37 pages

Performance Tuning Spark UI

Performance Tuning and Mastering the Spark UI

Uploaded by

wwajler

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Performance Tuning and

Mastering the Spark UI

Joe Widen - Solutions Architect

July 2019
1
Agenda
• Ganglia
• Spark UI
• Choosing and tweaking your instance types and number of instances
• File format and Compression
• FYI’s and Spark Functionality
• Tuning!

2
How can I make this faster?
Can I make this application faster? Look for these 4 symptoms
• Spill
• Shuffle
• Skew/Stragglers
• Small Files

3
Ganglia Metrics

44
Ganglia Metrics
Things to note as we walk through the UI
• CPU
– Aggregated CPU is over all machines, including driver, so 80% of usage on a 4
node + 1 driver cluster is max cluster cpu usage
• Network
– Read from data source = data in
– Write to data target = data out
• Memory
– Cache + objects created by the application

5
The Spark UI

66
Spark UI - What we care about
Things to cover as we walk through it
• Jobs/Stages/Tasks, Transformations and Actions
– Accessing the executor heap, stdout, stderr
– What do the the metrics mean? What is important.
• Storage Tab
• Executor Tab

7
Analyzing a Spark SQL Plan
Things to look at in the Spark SQL Plan
• Pushed vs Partition Filters
• Join Order
• DataFrame reuse

8
Choosing your instance type

99
Different AWS Instance Types
Compute Optimized Memory Optimized Storage Optimized
• C4 • R4 • I3
– 1 vCPU per 1.875gb RAM – 1 vCPU per 7.625gb RAM – 1 vCPU per 7.625gb RAM
– Storage : Attached EBS only – Storage : Attached EBS only – Storage : NVMe SSD
– Price : .085 – Price : .133 – Price : .156
• C3 • R3 • I2
– 1 vCPU per 1.875 RAM – 1 vCPU per 7.625gb RAM – 1 vCPU per 7.625gb RAM
– Storage : Attached SSD – Storage : Attached SSD – Storage : Attached SSD
– Price : .105 – Price : .166 – Price : .853

10
10
Cluster Sizing Starting Points
Rules of Thumb
• Fewer big instances > more small instances (larger heap = larger GC)
– Reduce network shuffle
– Aggregates seem to benefit from smaller machines due to cache locality
• Size based on the number of tasks initially, tweak later
– Run the job with a small cluster to get idea of # of tasks
• Choose based on workload (Probably start with C):
– ETL with full file scans and no data reuse - C/R class
– ML workload with data caching - C class (Total RAM = dataset size)
– Analytics - i3
11
How do we tweak these?
Workload requires caching (like machine learning)
• Look at the Storage tab in Spark UI to see if the entirety of the training
dataset is cached
– Fully cached with room to spare -> less instances
– Partially cached
• Almost completely cached? -> Increase the cluster size
• Not even close to cached -> Consider instance with SSD instead of EBS or use R class
– Check to see if persist is MEMORY_ONLY, or MEMORY_AND_DISK
– Spill to disk with NVMe ssd isn’t so bad
• Still not good enough? Follow the steps in the next section

12
How do we tweak these?
ETL and Analytic Workloads
• Are we compute bound?
– Look at ganglia and check CPU Usage
– Only way to make faster is more cores
• Are we network bound?
– Look for low CPU usage that occur after shuffle boundaries
– Use bigger/fewer machines to reduce the shuffle
– Use an ssd backed instance for faster remote reads/add compression
• Are we spilling a ton?
– Check Spark SQL tab for spill (pre-agg before shuffles are common to spill)
• Use i3
• Increase the core/memory ratio
13
File Formats and
Compression

14
14
File Formats and Compression
File Format Schema DBR Version for Performance Common when
benefits

CSV Rigid - Row 4.1+ Slow ETL

JSON Flexible - Row None* Slower Streaming, IoT,

ETL

Parquet Rigid - Columnar Many Fastest ETL, Analytics,

Using CDH

ORC Rigid - Columnar None* Fast ETL, Analytics,

Using HDP

Avro Flexible - Row None* Slow All

15
File Formats and Compression
Additional notes
• ORC
– rewrite table to add column
• Parquet
– rewrite table to add column (delta mergeSchema)
– DBR Native Parquet Reader
• Avro - Compressible JSON
– OK if always selecting entire dataset

16
File Formats and Compression
Compression is usually good
• GZIP
– JSON/CSV Unsplittable, one core must process entire file
– Highest compression ratio, longer to decompress
– Really big json/csv gzip files = BAD
• Snappy
– Splittable, file can be read by multiple cores
– Lower compression ratio, quicker to decompress
– Better than gzip for large files as far as Spark is concerned

17
File Formats and Compression
Compression is usually good
• BZIP2
– Compresses the most
– The better it compresses the longer it takes to decompress
– 10x slower for 5% compression gains*
– 5x slower to decompress
• Parquet (.gz.parquet) is always splittable regardless of compression format
due to the internal layout of a parquet file
– Compression is done on a per-page basis

18
FYI’s and Spark Functionality

19
19
Picking Number of Partitions
Rules of Thumb
• Too many >> too few >> way too many >> way too few
– Once you get to ~20-40k partitions, shuffle server problems
• Look for largest shuffle read input in a stage
– Ideally each partition will be around 200mb
• IE shuffle input = 10gb, partitions = 50
– Number of initial partitions = spark.sql.files.maxPartitionBytes (128mb default)
• # cores = multiple #partitions (ie 20 cores total on worker, try 40 partitions
• Use auto optimize, otherwise you may have to coalesce/repartition before
writes
20
Untyped vs Typed API vs UDF/UDAF
FYI’s
• Untyped dataset API is well optimized by Catalyst
• Typed dataset API is opaque and may involve SerDe
• UDF is somewhat similar to typed API (just that UDF is for
columns whereas typed Dataset API is for rows),
• UDAF forces the use of a less optimized aggregate operator
(due to current implementation limitations) that may cause
the agg to run slower, etc
21
Spark DataFrame vs Dataset
DataFrame == Dataset[Row]
• Untyped vs Typed API
– df.select($"col.att") is faster than df.map(l => l.col.att)
– InternalRow (Tungsten) format vs converting to object

22
RDD Cache vs DBIO Cache
When to use what, and what it means for your application
• RDD Cache uses the RAM on an executor
– Useful for ML workloads where they iterate a bunch on the same dataset
– Some feature building functions are expensive, we want to compute the feature
once and reuse as opposed to rebuild
• DBIO Cache uses SSD on i3
– Reduce the memory pressure of your application (spill less)
– Cheaper than memory
– Only base data is cached, not ideal if heavy amounts of processing are done
before the data is reused
– Faster for queries using subsets of the data
23
Joins

24
Physical Join Types
BroadcastHashJoin
One side must fit in memory
BroadcastNestedLoopJoin

more expensive
25
Physical Join Types
BroadcastHashJoin
BroadcastNestedLoopJoin

ShuffledHashJoin: One side is smaller (3x or more) and a partition of it can fit in memory
(off by default, enable by `spark.sql.join.preferSortMergeJoin = false`)

SortMergeJoin: Scalable - just works

more expensive
26
Physical Join Types
BroadcastHashJoin
BroadcastNestedLoopJoin

ShuffledHashJoin
SortMergeJoin

CartesianProduct: Avoid if you can

more expensive
27
Using broadcast joins
• Why didn’t it broadcast
– Stats are not available for plan
– Size of table is larger than spark.sql.autoBroadcastJoinThreshold (10 MB by default)

• Broadcast join syntax using hints

df.join(df2.hint("broadcast"), $"df_key" === $"df2_key", joinType)

• Why did it broadcast???

– Fallback is BroadcastNestedLoopJoin
– Check join expression: UDF usage, OR statements

28
Using skew joins
What is a skew join and how do I use it
• Specify Skewed Table
df.hint("skew") .join(df2, $"df1_id" === $"df2_id")

• Specify Skewed Table Column

df.hint("skew", "df1_id") .join(df2, $"df1_id" === $"df2_id")

• Specify Skewed Table Column Key Value

df.hint("skew", "df1_id", Seq(1,2)) .join(df2, $"df1_id" === $"df2_id")

29
Solving for the 4 S’s

30
30
So I found spill, now what
Not always obvious to find, sql plan is most informative
• More partitions
• Higher ratio of ram/core
• Reduce the number of cores the executor can use
– spark.executor.cores < total cores per worker
• Is the spill happening in a prereduce sort?
– Spark tries to reduce locally first, sometimes this has very little effect and takes a
lot of time
spark.databricks.execution.singlePassAggregate.enabled = True

31
So I found skew, now what
Very obvious between task duration and ganglia cpu
• Ganglia CPU usage looks like a standard deviation curve
• Task duration -> max and 75% >> average and 25%
• Use a Skew join
• Filter out large keys/salt keys and set up multiple reduce steps
• Explicitly repartition the data on a different field
– Spend 5 mins repartitioning vs 10-15 mins for huge partitions

32
So I found shuffle, now what
Lots of joins, low compute on ganglia, huh?
• Reduce number of partitions
– Node-local vs remote reads, fewer partitions, more node-local reads = less
shuffle
• Review Join order
– Look at the plan, are we joining the biggest table several times?
– Join the smallest tables first and work up to the big one

33
So I found shuffle, now what (cont.)
Green and blue lines look to be maxed out on ganglia
• Size disparity between tables? 1 TB joined to 10gb?
– Use fewer bigger machines and broadcast the 10gb one
• Shuffle 1 tb once vs 10gb n executor times

34
So I found a lot small files, now what
This one is a problem that we can’t easily fix
• If you’re processing a lot of small files in batch, it’s probably better to treat
it as a streaming application
– Small etl-streaming cluster that writes to Deltat and compacts the files to help
downstream processing
• Use the S3-SQS connector + Trigger.Once() in DBR 4.2+
• S3 inventory future work
• Fix the upstream application building tons of files
• Ingest it into delta first

35
Thank you
Joe Widen

36
Here are some logos

Pythons Basics
No ratings yet
Pythons Basics
104 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Pyspark
No ratings yet
Pyspark
31 pages
Databricks Certified Data Analyst Associate
No ratings yet
Databricks Certified Data Analyst Associate
110 pages
Snowproans
No ratings yet
Snowproans
85 pages
Cert DEWD (Edits)
No ratings yet
Cert DEWD (Edits)
158 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Databricks Certified Associate Data Engineer
100% (1)
Databricks Certified Associate Data Engineer
18 pages
Databricks Certified Data Engineer Associate Practice Exams - 1
100% (1)
Databricks Certified Data Engineer Associate Practice Exams - 1
25 pages
What Is Azure Data Engineer
No ratings yet
What Is Azure Data Engineer
74 pages
Databricks Final
100% (1)
Databricks Final
81 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Azure Databricks An Introduction
No ratings yet
Azure Databricks An Introduction
54 pages
Azure Data Factory
No ratings yet
Azure Data Factory
6 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
Data Engineering Databricks
No ratings yet
Data Engineering Databricks
139 pages
Snowflake Architecture
No ratings yet
Snowflake Architecture
18 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Databricks Quiz Questions
No ratings yet
Databricks Quiz Questions
35 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Databricks Certified Data Engineer Professional Practice Questions
No ratings yet
Databricks Certified Data Engineer Professional Practice Questions
13 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
TESDA Circular No. 084-2020 Institutionalization of CCTV
100% (3)
TESDA Circular No. 084-2020 Institutionalization of CCTV
20 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
Roadmap To Become An Azure Data Engineer 2024
No ratings yet
Roadmap To Become An Azure Data Engineer 2024
3 pages
Cit 427 Summary Gemini
No ratings yet
Cit 427 Summary Gemini
16 pages
Virtualization Lab Record
No ratings yet
Virtualization Lab Record
33 pages
How To Land On Azure Data Engineer Job
No ratings yet
How To Land On Azure Data Engineer Job
5 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
Databricks Questions
No ratings yet
Databricks Questions
23 pages
Databricks Interview Question & Answers
No ratings yet
Databricks Interview Question & Answers
10 pages
Structured Streaming
No ratings yet
Structured Streaming
12 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Databricks Data Engineer Associate Practice
No ratings yet
Databricks Data Engineer Associate Practice
12 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Yamaha pm5d
No ratings yet
Yamaha pm5d
367 pages
Database Technology Query Processing: Heiko Paulheim
No ratings yet
Database Technology Query Processing: Heiko Paulheim
60 pages
Jarupula Praveen
No ratings yet
Jarupula Praveen
7 pages
Databricks
No ratings yet
Databricks
11 pages
C2 Databricks - Sparks - EE
No ratings yet
C2 Databricks - Sparks - EE
9 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
Simulation Modeling and Analysis With Arena
No ratings yet
Simulation Modeling and Analysis With Arena
64 pages
Fc3 Digital Data Systems
No ratings yet
Fc3 Digital Data Systems
500 pages
1 Introduction To Databricks Machine Learning
No ratings yet
1 Introduction To Databricks Machine Learning
9 pages
Advanced Project For Data Engineering in Azure
100% (1)
Advanced Project For Data Engineering in Azure
5 pages
Ajay Resume VLaF
No ratings yet
Ajay Resume VLaF
2 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Lab 3 - Enabling Team Based Data Science With Azure Databricks
No ratings yet
Lab 3 - Enabling Team Based Data Science With Azure Databricks
18 pages
Interview DE by Company Azurelib Dot Com
No ratings yet
Interview DE by Company Azurelib Dot Com
14 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Mac Os X Manual Page - Hdiutil
No ratings yet
Mac Os X Manual Page - Hdiutil
19 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Databricks Certified Data Engineer Associate Exam Guide
No ratings yet
Databricks Certified Data Engineer Associate Exam Guide
7 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Done CH 2 1st Year (Usman Sir) Computer Memory
No ratings yet
Done CH 2 1st Year (Usman Sir) Computer Memory
16 pages
Question Bank COA
No ratings yet
Question Bank COA
7 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Jungle Flasher
No ratings yet
Jungle Flasher
134 pages
Mram
No ratings yet
Mram
15 pages
Databricks Project
No ratings yet
Databricks Project
1 page
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
O&O Defrag and The Windows Defragmenter: A Comparison
No ratings yet
O&O Defrag and The Windows Defragmenter: A Comparison
12 pages
OCS 4.X Troubleshooting
No ratings yet
OCS 4.X Troubleshooting
96 pages
Memoria Ram y Ad Con La Tarjeta Madre
No ratings yet
Memoria Ram y Ad Con La Tarjeta Madre
7 pages
Ist Unit Notes
No ratings yet
Ist Unit Notes
44 pages
Addressing Modes: 4.1. Interpreting Memory Addresses
No ratings yet
Addressing Modes: 4.1. Interpreting Memory Addresses
20 pages
Designing Pastebin - Grokking The System Design Interview
No ratings yet
Designing Pastebin - Grokking The System Design Interview
9 pages
RA 10175 Cybercrime Law
No ratings yet
RA 10175 Cybercrime Law
19 pages
Computer Hardware: Chapter Preview
No ratings yet
Computer Hardware: Chapter Preview
38 pages
Exam Sheet
No ratings yet
Exam Sheet
7 pages
IBM Mainframe Utility Programs: Quelle Quelle
No ratings yet
IBM Mainframe Utility Programs: Quelle Quelle
3 pages
Technical Vocational Education: Quarter 1-Week4-Module 4
No ratings yet
Technical Vocational Education: Quarter 1-Week4-Module 4
20 pages
The Computer Won'T Start: Reasons and Solutions
No ratings yet
The Computer Won'T Start: Reasons and Solutions
12 pages
01 Notes Hardware Concepts
No ratings yet
01 Notes Hardware Concepts
11 pages
Holdpeak 1356 en
No ratings yet
Holdpeak 1356 en
39 pages
Block Diagram of Computer
No ratings yet
Block Diagram of Computer
14 pages
TSM Journal Backup
No ratings yet
TSM Journal Backup
4 pages
Pricelist
No ratings yet
Pricelist
4 pages
BSC IT COURSE SYLLABUS
No ratings yet
BSC IT COURSE SYLLABUS
11 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet

Performance Tuning Spark UI

Uploaded by

Performance Tuning Spark UI

Uploaded by

Performance Tuning and

Mastering the Spark UI

Joe Widen - Solutions Architect

CSV Rigid - Row 4.1+ Slow ETL

JSON Flexible - Row None* Slower Streaming, IoT,

Parquet Rigid - Columnar Many Fastest ETL, Analytics,

ORC Rigid - Columnar None* Fast ETL, Analytics,

Avro Flexible - Row None* Slow All

SortMergeJoin: Scalable - just works

CartesianProduct: Avoid if you can

• Broadcast join syntax using hints

• Why did it broadcast???

• Specify Skewed Table Column

• Specify Skewed Table Column Key Value

You might also like