SlideShare a Scribd company logo
Sean Suchter
CTO @ Pepperdata
Spark performance is too hard,
let’s make it easier
Pepperdata does performance (for Big Data)
15
Thousand
Production
Nodes
50
Million
Jobs/Year
200
Trillion
Performance
Data Points
Today’s talk will cover…
• How code translates to execution
• How to find common, known problems
• For the rest of the problems…
– Why debugging performance problems is hard
– Data elements needed for complete view of application
performance from separate tools
– Bringing these elements together in a single tool
Brief terminology about Spark
• An app contains
multiple jobs
• A job contains
multiple stages
• A stage contains
multiple tasks
• Executors run tasks
Example App
A word count app:
val textFile = sc.textFile("hdfs:/dict.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs:/wordcounts.txt")
1. Declares input from
external storage
2. Specifies
transformations
3. Triggers an action
Distributed Architecture
Spark executes a job using
multiple machines.
Spark
Driver
process
Spark
Executor 1
process
Spark
Executor 2
process
Spark
Executor N
process
Sends tasks
Stages
Image source.
val textFile = sc.textFile("hdfs:/dict.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs:/wordcounts.txt")
Shuffle and Re-partitioning
Image source.
Stages and Tasks in Example Job
Task
0
Task
1
Task
n
Task
n+m
Task
n+1
Task
n+2
Debugging known problems
The easier case…
Spark History Server
11
Spark History Server
12
Intro: Dr Elephant (MapReduce)
What does Dr. Elephant do?
• Performance monitoring and tuning service
• Finds common mistakes, indicates best practices
14
Spark Application Heuristics
15
Spark Application Heuristics
16
3 Classes of Spark Heuristics
• Configuration Settings
• Simple Alarms on Stage/Job Failure
• Data-Dependent Tuning Suggestions
17
Configuration Heuristic
• Display some basic config settings for your app
• Complain if some settings not explicitly set
• Recommend configuring an external shuffle
service (especially if dynamic allocation is
enabled)
• These recommendations won’t change over
multiple runs of an application
18
Stages and Jobs Heuristics
• Simple alarms showing stage and job failure rates
• Good for seeing when there’s a problem
19
Executors Heuristic
• Looks at the distribution across executors of
several different metrics
• Outliers in these distributions probably indicate:
– Suboptimal partitioning.
– One or more slow executors due to external
circumstances (cluster weather)
20
Partitions Heuristic
• Ideally data for each task will fit into the RAM
available to that task.
• Sandy Ryza (once from Cloudera) has an
excellent blog on Spark tuning:
(observed shuffle write) * (observed shuffle spill memory) * (spark.executor.cores)
(observed shuffle spill disk) * (spark.executor.memory) * (spark.shuffle.memoryFraction) * (spark.shuffle.safetyFraction)
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
21
More Heuristics?
Yes, please! Dr. Elephant is open source.
https://github.com/linkedin/dr-elephant
22
Is there an enterprise version?
Pepperdata Application Profiler
• Benefits to our users:
– Provide simple answers to simple questions
– Combination of metrics for experts
– Simple actionable insights for all users
– Pepperdata support
• Why stay close to open source?
– Heuristics
24
Pepperdata Application Profiler
25
Debugging novel problems
The harder case…
2 reasons this is hard
Reason #1
Same external symptom (“too slow”), but many possible
causes:
• code
• data
• configuration
• cluster weather
Reason #2
Existing tools provide limited visibility
• Spark Web UI is the most popular
– Good view of query execution plan (job/stages/DAG)
– Limited view of aggregate performance data
• Time series
– Ganglia, Ambari, CM, etc provide time series data for cluster (but
not specific to Spark apps)
– Spark Sink metrics can be fed to InfluxDb/others, yielding partial
Spark app metrics
• Code execution not connected to resource consumption
• Load from other apps unaccounted
3 data elements form a complete picture
of Spark application performance
1. Code execution plan
– Indicates which block of code is being executed, where
2. Time series view
– Visual of resource consumption of application
– Outliers in resource usage very easy to detect
3. Cluster weather
– A view of all applications that run on the cluster
Spark Web UI
First half of solution
Logical code execution plan from Spark:
Jobs / Stages / DAG
Physical execution plan from Spark:
Executors / Tasks
Time series view
Second half of solution
Time series view of resource consumption
for the App
Bring them together
Best of both worlds
Code Analyzer = execution plan + time series
GC across all Stages of App
Let’s examine GC activity in Stage 4
Executor skew increased Stage duration 2x
Executor 6 does twice as much work: possible
solution increase number of partitions
What if it’s not your fault?
Cluster weather
How does cluster weather impact your app ?
No apparent reason for delay from Spark
Web UI
Time series shows slower run of app with
much lower resources
View cluster weather for slower run of app
Cluster weather reveals reason for CPU
constraints on slower app
Cluster weather reveals reason for
memory constraints on slower app
Cluster weather reveals reason for HDFS
constraints on slower app
Code Analyzer for Apache Spark
• Free during Early Access starting today
• Early Access is for development teams
• To learn more visit booth #101
• info@pepperdata.com
pepperdata.com/products/code-analyzer
Other performance tools mentioned
• Dr Elephant
– github.com/linkedin/dr-elephant
• Application Profiler
– www.pepperdata.com/products/application-profiler/
To recap
• Use heuristics to find known problems
• Execution plan + time series = powerful visualization
• Knowing cluster weather can prevent time wasted
debugging performance “issues” that aren’t the app’s
fault
Spark Summit Talk Plugs
Tuesday 11:40AM Connect Code to Resource Consumption to Scale Your
Production Spark Applications (Vinod @ Pepperdata)
Tuesday 12:50PM Kubernetes SIG Big Data Birds-of-a-Feather session
(many)
Tuesday 3:20PM Apache Spark on Kubernetes (Anirudh @ Google, Tim @
Hyperpilot)
Wednesday 11:00AM HDFS on Kubernetes – Lessons Learned (Kimoon @
Pepperdata)
Wednesday 11:00AM Dr Elephant for Monitoring and Tuning Apache Spark Jobs
on Hadoop (Carl @ LinkedIn, Simon @ Pepperdata)
Thank You.
www.pepperdata.com/products/code-analyzer/
ssuchter@pepperdata.com

More Related Content

PDF
Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi
Databricks
 
PDF
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
PDF
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Databricks
 
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
PDF
SSR: Structured Streaming for R and Machine Learning
felixcss
 
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
PDF
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi
Databricks
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Databricks
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
SSR: Structured Streaming for R and Machine Learning
felixcss
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 

What's hot (20)

PDF
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
PDF
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Databricks
 
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
PPTX
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks
 
PDF
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
PDF
Continuous Application with FAIR Scheduler with Robert Xue
Databricks
 
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
PDF
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Databricks
 
PDF
Transactional writes to cloud storage with Eric Liang
Databricks
 
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
PDF
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
PDF
Spark Summit EU talk by Jim Dowling
Spark Summit
 
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Databricks
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
Continuous Application with FAIR Scheduler with Robert Xue
Databricks
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Databricks
 
Transactional writes to cloud storage with Eric Liang
Databricks
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
Spark Summit EU talk by Jim Dowling
Spark Summit
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Ad

Similar to Apache Spark Performance is too hard. Let's make it easier (20)

PDF
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Databricks
 
PDF
How to Automate Performance Tuning for Apache Spark
Databricks
 
PDF
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
PDF
Spark Autotuning - Strata EU 2018
Holden Karau
 
PDF
Spark Autotuning Talk - Strata New York
Holden Karau
 
PPTX
Understanding Spark Tuning: Strata New York
Rachel Warren
 
PPTX
Intro to Apache Spark
Cloudera, Inc.
 
PPTX
LanceShivnathHadoopSummit2015
Lance Co Ting Keh
 
PPTX
Spark Gotchas and Lessons Learned
Jen Waller
 
PPTX
Spark Overview and Performance Issues
Antonios Katsarakis
 
PPTX
Spark autotuning talk final
Rachel Warren
 
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
PPTX
Spark Tips & Tricks
Jason Hubbard
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Spark Gotchas and Lessons Learned (2/20/20)
Jen Waller
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PPTX
Spark Application Development Made Easy
DataWorks Summit
 
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
Connect Code to Resource Consumption to Scale Your Production Spark Applicati...
Databricks
 
How to Automate Performance Tuning for Apache Spark
Databricks
 
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Spark Autotuning - Strata EU 2018
Holden Karau
 
Spark Autotuning Talk - Strata New York
Holden Karau
 
Understanding Spark Tuning: Strata New York
Rachel Warren
 
Intro to Apache Spark
Cloudera, Inc.
 
LanceShivnathHadoopSummit2015
Lance Co Ting Keh
 
Spark Gotchas and Lessons Learned
Jen Waller
 
Spark Overview and Performance Issues
Antonios Katsarakis
 
Spark autotuning talk final
Rachel Warren
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Spark Tips & Tricks
Jason Hubbard
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Spark Gotchas and Lessons Learned (2/20/20)
Jen Waller
 
Tuning and Debugging in Apache Spark
Databricks
 
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Sohil Jain
 
Spark Application Development Made Easy
DataWorks Summit
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPTX
TestNG for Java Testing and Automation testing
ssuser0213cb
 
PDF
Become an Agentblazer Champion Challenge
Dele Amefo
 
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
PPTX
Presentation about variables and constant.pptx
safalsingh810
 
PPTX
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
PPTX
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
PDF
Micromaid: A simple Mermaid-like chart generator for Pharo
ESUG
 
PDF
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
PDF
Microsoft Teams Essentials; The pricing and the versions_PDF.pdf
Q-Advise
 
PPTX
Smart Panchayat Raj e-Governance App.pptx
Rohitnikam33
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PDF
Wondershare Filmora 14.5.20.12999 Crack Full New Version 2025
gsgssg2211
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
Become an Agentblazer Champion Challenge Kickoff
Dele Amefo
 
PDF
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
QAware GmbH
 
PDF
Why Use Open Source Reporting Tools for Business Intelligence.pdf
Varsha Nayak
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
TestNG for Java Testing and Automation testing
ssuser0213cb
 
Become an Agentblazer Champion Challenge
Dele Amefo
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Presentation about variables and constant.pptx
safalsingh810
 
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
Micromaid: A simple Mermaid-like chart generator for Pharo
ESUG
 
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
Microsoft Teams Essentials; The pricing and the versions_PDF.pdf
Q-Advise
 
Smart Panchayat Raj e-Governance App.pptx
Rohitnikam33
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
Wondershare Filmora 14.5.20.12999 Crack Full New Version 2025
gsgssg2211
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Become an Agentblazer Champion Challenge Kickoff
Dele Amefo
 
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
QAware GmbH
 
Why Use Open Source Reporting Tools for Business Intelligence.pdf
Varsha Nayak
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 

Apache Spark Performance is too hard. Let's make it easier

  • 1. Sean Suchter CTO @ Pepperdata Spark performance is too hard, let’s make it easier
  • 2. Pepperdata does performance (for Big Data) 15 Thousand Production Nodes 50 Million Jobs/Year 200 Trillion Performance Data Points
  • 3. Today’s talk will cover… • How code translates to execution • How to find common, known problems • For the rest of the problems… – Why debugging performance problems is hard – Data elements needed for complete view of application performance from separate tools – Bringing these elements together in a single tool
  • 4. Brief terminology about Spark • An app contains multiple jobs • A job contains multiple stages • A stage contains multiple tasks • Executors run tasks
  • 5. Example App A word count app: val textFile = sc.textFile("hdfs:/dict.txt") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs:/wordcounts.txt") 1. Declares input from external storage 2. Specifies transformations 3. Triggers an action
  • 6. Distributed Architecture Spark executes a job using multiple machines. Spark Driver process Spark Executor 1 process Spark Executor 2 process Spark Executor N process Sends tasks
  • 7. Stages Image source. val textFile = sc.textFile("hdfs:/dict.txt") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs:/wordcounts.txt")
  • 9. Stages and Tasks in Example Job Task 0 Task 1 Task n Task n+m Task n+1 Task n+2
  • 13. Intro: Dr Elephant (MapReduce)
  • 14. What does Dr. Elephant do? • Performance monitoring and tuning service • Finds common mistakes, indicates best practices 14
  • 17. 3 Classes of Spark Heuristics • Configuration Settings • Simple Alarms on Stage/Job Failure • Data-Dependent Tuning Suggestions 17
  • 18. Configuration Heuristic • Display some basic config settings for your app • Complain if some settings not explicitly set • Recommend configuring an external shuffle service (especially if dynamic allocation is enabled) • These recommendations won’t change over multiple runs of an application 18
  • 19. Stages and Jobs Heuristics • Simple alarms showing stage and job failure rates • Good for seeing when there’s a problem 19
  • 20. Executors Heuristic • Looks at the distribution across executors of several different metrics • Outliers in these distributions probably indicate: – Suboptimal partitioning. – One or more slow executors due to external circumstances (cluster weather) 20
  • 21. Partitions Heuristic • Ideally data for each task will fit into the RAM available to that task. • Sandy Ryza (once from Cloudera) has an excellent blog on Spark tuning: (observed shuffle write) * (observed shuffle spill memory) * (spark.executor.cores) (observed shuffle spill disk) * (spark.executor.memory) * (spark.shuffle.memoryFraction) * (spark.shuffle.safetyFraction) http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ 21
  • 22. More Heuristics? Yes, please! Dr. Elephant is open source. https://github.com/linkedin/dr-elephant 22
  • 23. Is there an enterprise version?
  • 24. Pepperdata Application Profiler • Benefits to our users: – Provide simple answers to simple questions – Combination of metrics for experts – Simple actionable insights for all users – Pepperdata support • Why stay close to open source? – Heuristics 24
  • 27. 2 reasons this is hard
  • 28. Reason #1 Same external symptom (“too slow”), but many possible causes: • code • data • configuration • cluster weather
  • 29. Reason #2 Existing tools provide limited visibility • Spark Web UI is the most popular – Good view of query execution plan (job/stages/DAG) – Limited view of aggregate performance data • Time series – Ganglia, Ambari, CM, etc provide time series data for cluster (but not specific to Spark apps) – Spark Sink metrics can be fed to InfluxDb/others, yielding partial Spark app metrics • Code execution not connected to resource consumption • Load from other apps unaccounted
  • 30. 3 data elements form a complete picture of Spark application performance 1. Code execution plan – Indicates which block of code is being executed, where 2. Time series view – Visual of resource consumption of application – Outliers in resource usage very easy to detect 3. Cluster weather – A view of all applications that run on the cluster
  • 31. Spark Web UI First half of solution
  • 32. Logical code execution plan from Spark: Jobs / Stages / DAG
  • 33. Physical execution plan from Spark: Executors / Tasks
  • 34. Time series view Second half of solution
  • 35. Time series view of resource consumption for the App
  • 36. Bring them together Best of both worlds
  • 37. Code Analyzer = execution plan + time series
  • 38. GC across all Stages of App
  • 39. Let’s examine GC activity in Stage 4
  • 40. Executor skew increased Stage duration 2x
  • 41. Executor 6 does twice as much work: possible solution increase number of partitions
  • 42. What if it’s not your fault? Cluster weather
  • 43. How does cluster weather impact your app ?
  • 44. No apparent reason for delay from Spark Web UI
  • 45. Time series shows slower run of app with much lower resources
  • 46. View cluster weather for slower run of app
  • 47. Cluster weather reveals reason for CPU constraints on slower app
  • 48. Cluster weather reveals reason for memory constraints on slower app
  • 49. Cluster weather reveals reason for HDFS constraints on slower app
  • 50. Code Analyzer for Apache Spark • Free during Early Access starting today • Early Access is for development teams • To learn more visit booth #101 • [email protected] pepperdata.com/products/code-analyzer
  • 51. Other performance tools mentioned • Dr Elephant – github.com/linkedin/dr-elephant • Application Profiler – www.pepperdata.com/products/application-profiler/
  • 52. To recap • Use heuristics to find known problems • Execution plan + time series = powerful visualization • Knowing cluster weather can prevent time wasted debugging performance “issues” that aren’t the app’s fault
  • 53. Spark Summit Talk Plugs Tuesday 11:40AM Connect Code to Resource Consumption to Scale Your Production Spark Applications (Vinod @ Pepperdata) Tuesday 12:50PM Kubernetes SIG Big Data Birds-of-a-Feather session (many) Tuesday 3:20PM Apache Spark on Kubernetes (Anirudh @ Google, Tim @ Hyperpilot) Wednesday 11:00AM HDFS on Kubernetes – Lessons Learned (Kimoon @ Pepperdata) Wednesday 11:00AM Dr Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop (Carl @ LinkedIn, Simon @ Pepperdata)