Data Distribution and Ordering for Efficient Data Source V2

Data Distribution and Ordering for
Efficient Data Source V2
Anton Okolnychyi
This is not a contribution
Data + AI Summit 2021

Presenter
• Apache Iceberg PMC member
• Apache Spark contributor
• Data Lakes at Apple
• Open source enthusiast

Agenda
• Why V2?
• Data distribution and ordering
• Future work

Reliability
• Behavior of DataFrameWriter is not defined
- Connectors interpret SaveMode differently
- SaveIntoDataSourceCommand vs InsertIntoDataSourceCommand

Reliability
• Validation rules are not consistent
- PreprocessTableCreation vs PreprocessTableInsertion
- No schema validation for path-based tables

Design choices
• Connectors interact with internal APIs
- SQLContext
- RDD
- DataFrame

Extensibility
• Hard to support new features
- No easy way to extend PrunedFilterScan
- Exposing ColumnarBatch instead of Row is challenging

Features
• No Structured Streaming support
• No multi-catalog support
• Limited bucketed tables support

Reliability
• Predictable and reliable behavior
- Clearly defined logical plans for all connectors
- Consistent validation rules
- Less delegation to connectors

Design choices
• Proper abstractions
- Connectors interact only with InternalRow and ColumnarBatch
- Mix-in traits for optional functionality

Features
• Multi-catalog support
• Structured Streaming
• Vectorization
• Bucketed tables (in progress)

Data distribution and ordering

Impact
• Writes
- Control the number of generated files
- Reduce the overall memory consumption
- Reduce the actual writing time

© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Unspecified distribution

Proper distribution

Unspecified ordering

Proper ordering

Impact
• Reads
- Cluster data on write for faster reads
- Enable efficient data skipping

Impact
• Storage footprint
- Columnar encodings perform better on sorted data (e.g. dictionary encoding)

How do connectors control this?

Data Source V1
• Connectors can apply arbitrary transformations on DataFrame
• Built-in connectors sort data within tasks using partition columns

Data Source V2
• No way to control (SPARK-23889)
• Severe performance issues unless explicitly handled by the user
• Blocks migration to V2
• Fixed in upcoming Spark 3.2

Use cases
• Global sort
• Cluster + sort within tasks
• Local sort within tasks
• No distribution and sort

API
interface WriteBuilder {
Write build()
}

API
interface Write {
BatchWrite toBatch();
StreamingWrite toStreaming();
}

API
interface RequiresDistributionAndOrdering extends Write {
Distribution requiredDistribution();
SortOrder[] requiredOrdering();
}

Distributions
• OrderedDistribution
• ClusteredDistribution
• UnspecifiedDistribution

SortOrder
interface SortOrder extends Expression {
Expression expression();
SortDirection direction();
NullOrdering nullOrdering();
}

Current state
• Available and fully functional in master for batch queries
• Structured Streaming support is in progress (SPARK-34183)

Future work
• Distribution and ordering in CREATE TABLE
• Ability to control the number of shuffle partitions
• Coalesce partitions during adaptive query execution

Summary
• Consider migrating to Data Source V2
• Data distribution and ordering is critical at scale

Feedback
• Your feedback is important to us
• Don’t forget to review and rate sessions

Data Distribution and Ordering for Efficient Data Source V2

More Related Content

What's hot

Similar to Data Distribution and Ordering for Efficient Data Source V2

More from Databricks

Recently uploaded

Data Distribution and Ordering for Efficient Data Source V2