Data Distribution and Ordering for
Efficient Data Source V2
Anton Okolnychyi
This is not a contribution
Data + AI Summit 2021
Presenter
• Apache Iceberg PMC member
• Apache Spark contributor
• Data Lakes at Apple
• Open source enthusiast
Agenda
• Why V2?
• Data distribution and ordering
• Future work
What’s wrong with V1?
Reliability
• Behavior of DataFrameWriter is not defined
- Connectors interpret SaveMode differently
- SaveIntoDataSourceCommand vs InsertIntoDataSourceCommand
Reliability
• Validation rules are not consistent
- PreprocessTableCreation vs PreprocessTableInsertion
- No schema validation for path-based tables
Design choices
• Connectors interact with internal APIs
- SQLContext
- RDD
- DataFrame
Extensibility
• Hard to support new features
- No easy way to extend PrunedFilterScan
- Exposing ColumnarBatch instead of Row is challenging
Features
• No Structured Streaming support
• No multi-catalog support
• Limited bucketed tables support
What’s different in V2?
Reliability
• Predictable and reliable behavior
- Clearly defined logical plans for all connectors
- Consistent validation rules
- Less delegation to connectors
Design choices
• Proper abstractions
- Connectors interact only with InternalRow and ColumnarBatch
- Mix-in traits for optional functionality
Features
• Multi-catalog support
• Structured Streaming
• Vectorization
• Bucketed tables (in progress)
Data distribution and ordering
Distribution
Distribution
Ordering
Ordering
Why should I care?
Impact
• Writes
- Control the number of generated files
- Reduce the overall memory consumption
- Reduce the actual writing time
© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Unspecified distribution
© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Unspecified distribution
© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Unspecified distribution
© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Proper distribution
© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Proper distribution
© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Proper distribution
© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Unspecified ordering
© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Unspecified ordering
© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Proper ordering
© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Proper ordering
Impact
• Reads
- Cluster data on write for faster reads
- Enable efficient data skipping
Impact
• Storage footprint
- Columnar encodings perform better on sorted data (e.g. dictionary encoding)
How do connectors control this?
Data Source V1
• Connectors can apply arbitrary transformations on DataFrame
• Built-in connectors sort data within tasks using partition columns
Data Source V2
• No way to control (SPARK-23889)
• Severe performance issues unless explicitly handled by the user
• Blocks migration to V2
• Fixed in upcoming Spark 3.2
Solution
Use cases
• Global sort
• Cluster + sort within tasks
• Local sort within tasks
• No distribution and sort
API
interface WriteBuilder {
Write build()
}
API
interface Write {
BatchWrite toBatch();
StreamingWrite toStreaming();
}
API
interface RequiresDistributionAndOrdering extends Write {
Distribution requiredDistribution();
SortOrder[] requiredOrdering();
}
Distributions
• OrderedDistribution
• ClusteredDistribution
• UnspecifiedDistribution
SortOrder
interface SortOrder extends Expression {
Expression expression();
SortDirection direction();
NullOrdering nullOrdering();
}
Current state
• Available and fully functional in master for batch queries
• Structured Streaming support is in progress (SPARK-34183)
Future work
• Distribution and ordering in CREATE TABLE
• Ability to control the number of shuffle partitions
• Coalesce partitions during adaptive query execution
Key takeaways
Summary
• Consider migrating to Data Source V2
• Data distribution and ordering is critical at scale
Feedback
• Your feedback is important to us
• Don’t forget to review and rate sessions
Thank you!
TM and © 2021 Apple Inc. All rights reserved.

Data Distribution and Ordering for Efficient Data Source V2