SlideShare a Scribd company logo
3
Most read
4
Most read
15
Most read
Spark Shuffle Deep Dive
Bo Yang
Content
• Overview
• Major Classes
• Shuffle Writer
• Spark Serializer
• Shuffle Reader
• External Shuffle Service
• Suggestions
Shuffle Overview
Mapper 1
Orange 3
Apple 2
Peach 5
Pear 1
Mapper 2
Peach 3
Banana 2
Grape 5
Reducer 1
Apple 2
Peach 8
Pear 1
Reducer 2
Grape 5
Orange 3
Reducer 3
Banana 2
High Level Abstraction
• Pluggable Interface: ShuffleManager
• registerShuffle(…)
• getWriter(…)
• getReader(…)
• Configurable: spark.shuffle.manager=xxx
• Mapper: ShuffleWriter
• write(records: Iterator)
• Reducer: ShuffleReader
• read(): Iterator
Implementations
• SortShuffleManager (extends ShuffleManager)
• Three Writers (optimized for different scenarios)
• SortShuffleWriter: uses ExternalSorter
• BypassMergeSortShuffleWriter: no sorter
• UnsafeShuffleWriter: uses ShuffleExternalSorter
• One Reader
• BlockStoreShuffleReader, uses
• ExternalAppendOnlyMap
• ExternalSorter (if ordering)
Writer Output Example (Shuffle Files)
Mapper 1
Data File
Index File
Reducer 1 Reducer 2 Reducer 3
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Mapper 2
Data File
Index File
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Number of Partitions == Number of Reducers
Three Shuffle Writers
• Different Writer Algorithms
• SortShuffleWriter
• BypassMergeSortShuffleWriter
• UnsafeShuffleWriter
• Used in different situations (optimizations)
• Things to consider
• Reduce total number of files
• Reduce serialization/deserialization when possible
When Different Writers Are Used?
• Small number of partitions?
---> BypassMergeSortShuffleWriter
• Able to sort record in serialized form?
---> UnsafeShuffleWriter
• Otherwise
---> SortShuffleWriter
BypassMergeSortShuffleWriter
One file for each partition, then merge them
Mapper
BypassMergeSort
ShuffleWriter
Temp File: Partition 0
…
Temp File: Partition X
Index File
Data File
merge
Temp File: Partition 1
write
BypassMergeSortShuffleWriter (cont’d)
Used when
• No map side combine
• Number of partitions < spark.shuffle.sort.bypassMergeThreshold
Pros
• Simple
Cons
• 1 to 1 mapping between temp file and partition
• Many temp files
SortShuffleWriter
• Why sort?
• Sort records by PartitionId, to separate records by different partitions
• Reduce number of files: number of spill files < number of partitions
• Buffer (in memory):
• PartitionedAppendOnlyMap (when there is map side combine)
• PartitionedPairBuffer (when there is no map side combine)
Mapper
SortShuffleWriter
ExternalSorter Buffer
Spill File (Sorted)
…
Spill File (Sorted)
Index File
Data File
merge
SortShuffleWriter (cont’d)
Used when
• Has map side combine, or, many partitions
• Serializer supports record relocation
Pros
• Flexible, support all shuffle situations
Cons
• Serialize/deserialize multiple times
Internal configure to control spill behavior
(inside Spillable.scala):
spark.shuffle.spill.initialMemoryThreshold
spark.shuffle.spill.numElementsForceSpillThreshold
UnsafeShuffleWriter
• Record serialized once, then stored in memory pages
• 8 bytes record pointer (pointing to: memory page + offset)
• All record pointers stored in a long array
• Sort record pointers (long array)
• Small memory footprint
• Better fit CPU cache
• Sorter class: ShuffleExternalSorter
Memory
Page 1
Memory
Page 2
Memory
Page xxx
Record 1 (8 bytes)
Record 2 (8 bytes)
…
Store/Sort as
Array
UnsafeShuffleWriter (cont’d)
Used when
• Serializer supports record relocation
• No aggregator
Pros
• Single serialization, no deserialization/serialization for merging spill files
• Sorting is CPU cache friendly
Cons
• Not supported when using default serializer (JavaSerializer), supported
when using KryoSerializer
Serializer: JavaSerializer
• Default serializer in Spark
• spark.serializer=org.apache.spark.serializer.JavaSerializer
• Use object reference in serialized stream
• Write reference instead of whole object for repeated (same) object
• Not support record relocation
• Cannot move record in serialized stream due to object reference
• Pros: support serialization in all situations
• Cons: performance not good
Serializer: KryoSerializer
• Use kryo library
• Not use object reference in serialized stream by default
• Support record relocation
• Because there is no object reference, and each serialized object is independent
• Need to explicitly register classes for serialization, otherwise, it will write
fully qualified class name for each serialized object
• Pros: performance is good for common classes and registered classes (see
KryoSerializer.scala
• Cons: performance is bad for custom classes if not registered, need to
explicitly register them
Shuffle Reader: BlockStoreShuffleReader
Mapper 1
Data File
Index File
Reducer: BlockStoreShuffleReader
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Mapper 2
Data File
Index File
Offset 1
Partition 1
Partition 2
Partition 3
Offset 2
Offset 3
Aggregator
ExternalAppend
OnlyMap
Spill File
…
Spill File
Iterator
Use:
HashComparator ExternalSorter
Iterator
If ordering
by key
External Shuffle Service
• YarnShuffleService / MesosExternalShuffleService
• YarnShuffleService: running inside YARN Node Manager as an
AuxiliaryService
• Run on each machine in YARN/Mesos cluster
• Get shuffle files from local disk and stream to reducers
• Use file name convention to locate shuffle files
(ExternalShuffleBlockResolver)
• "shuffle_" + shuffleId + "_" + mapId + "_0.index”
• "shuffle_" + shuffleId + "_" + mapId + "_0.data"
Suggestions / Takeaway
• Shuffle is expensive, avoid unnecessary shuffle
• Shuffle vs Cache (Dataset.persist(…))
• Shuffle files provide full data set for next stage execution
• Cache may not necessary when there is shuffle (unless want cache replicas)
• Use KryoSerializer if possible
• Tune different configures
• spark.shuffle.sort.bypassMergeThreshold
• spark.shuffle.spill.initialMemoryThreshold
• spark.shuffle.spill.numElementsForceSpillThreshold

More Related Content

PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Physical Plans in Spark SQL
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Apache Spark Core – Practical Optimization
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
A Deep Dive into Query Execution Engine of Spark SQL
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Deep Dive: Memory Management in Apache Spark
Physical Plans in Spark SQL
Apache Spark Core—Deep Dive—Proper Optimization
Dynamic Partition Pruning in Apache Spark
Apache Spark Core – Practical Optimization

What's hot (20)

PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Spark shuffle introduction
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Spark SQL
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PPTX
Apache Spark Architecture
PDF
Understanding Query Plans and Spark UIs
PDF
Memory Management in Apache Spark
PDF
Parquet performance tuning: the missing guide
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Introduction to Spark Internals
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Top 5 Mistakes When Writing Spark Applications
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Spark shuffle introduction
Cosco: An Efficient Facebook-Scale Shuffle Service
Spark SQL
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Apache Spark Architecture
Understanding Query Plans and Spark UIs
Memory Management in Apache Spark
Parquet performance tuning: the missing guide
Apache Tez - A New Chapter in Hadoop Data Processing
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Introduction to Spark Internals
The Parquet Format and Performance Optimization Opportunities
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Processing Large Data with Apache Spark -- HasGeek
Top 5 Mistakes When Writing Spark Applications
Ad

Similar to Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark (20)

PPTX
IBM Spark Meetup - RDD & Spark Basics
PDF
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
PDF
Spark Streaming Tips for Devs and Ops
PDF
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
ODP
Spark Deep Dive
PDF
Optimizations in Spark; RDD, DataFrame
PPTX
How to Actually Tune Your Spark Jobs So They Work
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PPTX
Beyond shuffling - Strata London 2016
PDF
Spark Performance Tuning .pdf
PPTX
Control dataset partitioning and cache to optimize performances in Spark
PDF
Spark / Mesos Cluster Optimization
PDF
Beyond shuffling - Scala Days Berlin 2016
PDF
Spark Summit EU talk by Qifan Pu
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PPTX
Spark Overview and Performance Issues
PDF
Skew Mitigation For Facebook PetabyteScale Joins
PDF
The magic of (data parallel) distributed systems and where it all breaks - Re...
PDF
Improving Apache Spark Downscaling
IBM Spark Meetup - RDD & Spark Basics
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Spark Deep Dive
Optimizations in Spark; RDD, DataFrame
How to Actually Tune Your Spark Jobs So They Work
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Beyond shuffling - Strata London 2016
Spark Performance Tuning .pdf
Control dataset partitioning and cache to optimize performances in Spark
Spark / Mesos Cluster Optimization
Beyond shuffling - Scala Days Berlin 2016
Spark Summit EU talk by Qifan Pu
Apache Spark in Depth: Core Concepts, Architecture & Internals
Spark Overview and Performance Issues
Skew Mitigation For Facebook PetabyteScale Joins
The magic of (data parallel) distributed systems and where it all breaks - Re...
Improving Apache Spark Downscaling
Ad

Recently uploaded (20)

PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPT
SCOPE_~1- technology of green house and poyhouse
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
PDF
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
PDF
A Framework for Securing Personal Data Shared by Users on the Digital Platforms
PPTX
The-Looming-Shadow-How-AI-Poses-Dangers-to-Humanity.pptx
PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
PPTX
Internship_Presentation_Final engineering.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Ship’s Structural Components.pptx 7.7 Mb
PPTX
Glazing at Facade, functions, types of glazing
PDF
6th International Conference on Artificial Intelligence and Machine Learning ...
PPTX
TE-AI-Unit VI notes using planning model
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PDF
BRKDCN-2613.pdf Cisco AI DC NVIDIA presentation
PDF
Queuing formulas to evaluate throughputs and servers
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPT
Chapter 6 Design in software Engineeing.ppt
PPTX
Practice Questions on recent development part 1.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
SCOPE_~1- technology of green house and poyhouse
July 2025: Top 10 Read Articles Advanced Information Technology
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
A Framework for Securing Personal Data Shared by Users on the Digital Platforms
The-Looming-Shadow-How-AI-Poses-Dangers-to-Humanity.pptx
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
Internship_Presentation_Final engineering.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Ship’s Structural Components.pptx 7.7 Mb
Glazing at Facade, functions, types of glazing
6th International Conference on Artificial Intelligence and Machine Learning ...
TE-AI-Unit VI notes using planning model
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
BRKDCN-2613.pdf Cisco AI DC NVIDIA presentation
Queuing formulas to evaluate throughputs and servers
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Chapter 6 Design in software Engineeing.ppt
Practice Questions on recent development part 1.pptx

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

  • 1. Spark Shuffle Deep Dive Bo Yang
  • 2. Content • Overview • Major Classes • Shuffle Writer • Spark Serializer • Shuffle Reader • External Shuffle Service • Suggestions
  • 3. Shuffle Overview Mapper 1 Orange 3 Apple 2 Peach 5 Pear 1 Mapper 2 Peach 3 Banana 2 Grape 5 Reducer 1 Apple 2 Peach 8 Pear 1 Reducer 2 Grape 5 Orange 3 Reducer 3 Banana 2
  • 4. High Level Abstraction • Pluggable Interface: ShuffleManager • registerShuffle(…) • getWriter(…) • getReader(…) • Configurable: spark.shuffle.manager=xxx • Mapper: ShuffleWriter • write(records: Iterator) • Reducer: ShuffleReader • read(): Iterator
  • 5. Implementations • SortShuffleManager (extends ShuffleManager) • Three Writers (optimized for different scenarios) • SortShuffleWriter: uses ExternalSorter • BypassMergeSortShuffleWriter: no sorter • UnsafeShuffleWriter: uses ShuffleExternalSorter • One Reader • BlockStoreShuffleReader, uses • ExternalAppendOnlyMap • ExternalSorter (if ordering)
  • 6. Writer Output Example (Shuffle Files) Mapper 1 Data File Index File Reducer 1 Reducer 2 Reducer 3 Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Mapper 2 Data File Index File Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Number of Partitions == Number of Reducers
  • 7. Three Shuffle Writers • Different Writer Algorithms • SortShuffleWriter • BypassMergeSortShuffleWriter • UnsafeShuffleWriter • Used in different situations (optimizations) • Things to consider • Reduce total number of files • Reduce serialization/deserialization when possible
  • 8. When Different Writers Are Used? • Small number of partitions? ---> BypassMergeSortShuffleWriter • Able to sort record in serialized form? ---> UnsafeShuffleWriter • Otherwise ---> SortShuffleWriter
  • 9. BypassMergeSortShuffleWriter One file for each partition, then merge them Mapper BypassMergeSort ShuffleWriter Temp File: Partition 0 … Temp File: Partition X Index File Data File merge Temp File: Partition 1 write
  • 10. BypassMergeSortShuffleWriter (cont’d) Used when • No map side combine • Number of partitions < spark.shuffle.sort.bypassMergeThreshold Pros • Simple Cons • 1 to 1 mapping between temp file and partition • Many temp files
  • 11. SortShuffleWriter • Why sort? • Sort records by PartitionId, to separate records by different partitions • Reduce number of files: number of spill files < number of partitions • Buffer (in memory): • PartitionedAppendOnlyMap (when there is map side combine) • PartitionedPairBuffer (when there is no map side combine) Mapper SortShuffleWriter ExternalSorter Buffer Spill File (Sorted) … Spill File (Sorted) Index File Data File merge
  • 12. SortShuffleWriter (cont’d) Used when • Has map side combine, or, many partitions • Serializer supports record relocation Pros • Flexible, support all shuffle situations Cons • Serialize/deserialize multiple times Internal configure to control spill behavior (inside Spillable.scala): spark.shuffle.spill.initialMemoryThreshold spark.shuffle.spill.numElementsForceSpillThreshold
  • 13. UnsafeShuffleWriter • Record serialized once, then stored in memory pages • 8 bytes record pointer (pointing to: memory page + offset) • All record pointers stored in a long array • Sort record pointers (long array) • Small memory footprint • Better fit CPU cache • Sorter class: ShuffleExternalSorter Memory Page 1 Memory Page 2 Memory Page xxx Record 1 (8 bytes) Record 2 (8 bytes) … Store/Sort as Array
  • 14. UnsafeShuffleWriter (cont’d) Used when • Serializer supports record relocation • No aggregator Pros • Single serialization, no deserialization/serialization for merging spill files • Sorting is CPU cache friendly Cons • Not supported when using default serializer (JavaSerializer), supported when using KryoSerializer
  • 15. Serializer: JavaSerializer • Default serializer in Spark • spark.serializer=org.apache.spark.serializer.JavaSerializer • Use object reference in serialized stream • Write reference instead of whole object for repeated (same) object • Not support record relocation • Cannot move record in serialized stream due to object reference • Pros: support serialization in all situations • Cons: performance not good
  • 16. Serializer: KryoSerializer • Use kryo library • Not use object reference in serialized stream by default • Support record relocation • Because there is no object reference, and each serialized object is independent • Need to explicitly register classes for serialization, otherwise, it will write fully qualified class name for each serialized object • Pros: performance is good for common classes and registered classes (see KryoSerializer.scala • Cons: performance is bad for custom classes if not registered, need to explicitly register them
  • 17. Shuffle Reader: BlockStoreShuffleReader Mapper 1 Data File Index File Reducer: BlockStoreShuffleReader Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Mapper 2 Data File Index File Offset 1 Partition 1 Partition 2 Partition 3 Offset 2 Offset 3 Aggregator ExternalAppend OnlyMap Spill File … Spill File Iterator Use: HashComparator ExternalSorter Iterator If ordering by key
  • 18. External Shuffle Service • YarnShuffleService / MesosExternalShuffleService • YarnShuffleService: running inside YARN Node Manager as an AuxiliaryService • Run on each machine in YARN/Mesos cluster • Get shuffle files from local disk and stream to reducers • Use file name convention to locate shuffle files (ExternalShuffleBlockResolver) • "shuffle_" + shuffleId + "_" + mapId + "_0.index” • "shuffle_" + shuffleId + "_" + mapId + "_0.data"
  • 19. Suggestions / Takeaway • Shuffle is expensive, avoid unnecessary shuffle • Shuffle vs Cache (Dataset.persist(…)) • Shuffle files provide full data set for next stage execution • Cache may not necessary when there is shuffle (unless want cache replicas) • Use KryoSerializer if possible • Tune different configures • spark.shuffle.sort.bypassMergeThreshold • spark.shuffle.spill.initialMemoryThreshold • spark.shuffle.spill.numElementsForceSpillThreshold

Editor's Notes

  • #18: ExternalAppendOnlyMap ExternalSorter (if ordering)