Dynamic Partition Pruning in Apache Spark

Oct 30, 20197 likes5,139 views

In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.

Dynamic Partition Pruning
in Apache Spark
Spark + AI Summit, Amsterdam
1
Bogdan Ghit and Juliusz Sompolski

2
About Us
BI Experience team in the
Databricks Amsterdam European Development Centre
● Working on improving the experience and performance of
Business Intelligence / SQL analytics workloads using
Databricks
○ JDBC / ODBC connectivity to Databricks clusters
○ Integrations with BI tools such as Tableau
○ But also: core performance improvements in
Apache Spark for common SQL analytics query
patterns
Bogdan Ghit
Juliusz Sompolski

TPCDS Q98 on 10 TB
How to Make a Query 100x Faster?

Static Partition Pruning
SELECT * FROM Sales WHERE day_of_week = ‘Mon’
Filter
Scan
Basic data-flow
Filter
Scan
Filter Push-down
Filter
Scan
Partition files with
multi-columnar data

Most read

TPCDS 10 TB
Highly selective dimension filter that retains only
one month out of 5 years of data

Conclusion
Apache Spark 3.0 introduces Dynamic Partition Pruning
- Strawman approach at logical planning time
- Optimized approach during execution time
Significant speedup, exhibited in many TPC-DS queries
With this optimization Spark may now work good with
star-schema queries, making it unnecessary to ETL
denormalized tables.

20
Thanks!
Bogdan Ghit - linkedin.com/in/bogdanghit
Juliusz Sompolski - linkedin.com/in/juliuszsompolski

Table Denormalization
SELECT * FROM Sales JOIN Date
WHERE Date.day_of_week = ‘Mon’
Static pruning not possible
Scan
Sales
Filter
day_of_week = ‘mon’
Join
Simple workaround
Scan
Sales
Join
Scan
Date
Filter
day_of_week = ‘mon’
Scan
Scan
Date

This Talk
Dynamic pruning
Scan
Sales
Filter
day_of_week = ‘mon’
Join
SELECT * FROM Sales JOIN Date
WHERE Date.day_of_week = ‘Mon’
Scan
Countries

Spark In a Nutshell
Query Logical Plan
Optimization
Physical Plan
Selection
RDD batches
Cluster slots
Stats-based
cost model
Rule-based
transformations
APIs

Optimization Opportunities
Data Layout
Partition files with
multi-columnar data
Scan FACT TABLE Scan DIM TABLE
Non-partitioned dataset
Filter DIM
Join on partition id
Query Shape

A Simple Approach
Partition files with
multi-columnar data
Scan FACT TABLE
Scan DIM TABLE
Non-partitioned dataset
Filter DIM
Join on partition id
Scan DIM TABLE
Filter DIM
Work duplication may be expensive
Heuristics based on inaccurate stats

Broadcast Hash Join
FileScan FileScan with Dim Filter
Non-partitioned dataset
BroadcastExchange
Broadcast Hash Join
Execute the build side
of the join
Place the result in a
broadcast variableBroadcast the build
side results
Execute the join
locally without
a shuﬀle

Reusing Broadcast Results
Partition files with
multi-columnar data
FileScan
FileScan with Dim Filter
Non-partitioned dataset
BroadcastExchange
Broadcast Hash Join
Dynamic Filter

Experimental Setup
Workload Selection
- TPC-DS scale factors 1-10 TB
Cluster Configuration
- 10 i3.xlarge machines
Data-Processing Framework
- Apache Spark 3.0

TPCDS 1 TB
60 / 102 queries speedup between 2 and 18

Top Queries
Very good speedups for top 10% of the queries

Data Skipped
Very eﬀective in skipping data

TPCDS 10 TB
Even better speedups at 10x the scale

Query 98
SELECT i_item_desc, i_category, i_class, i_current_price,
sum(ss_ext_sales_price) as itemrevenue,
sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over
(partition by i_class) as revenueratio
FROM
store_sales, item, date_dim
WHERE
ss_item_sk = i_item_sk
and i_category in ('Sports', 'Books', 'Home')
and ss_sold_date_sk = d_date_sk
and cast(d_date as date) between cast('1999-02-22' as date)
and (cast('1999-02-22' as date) + interval '30' day)
GROUP BY
i_item_id, i_item_desc, i_category, i_class, i_current_price
ORDER BY
i_category, i_class, i_item_id, i_item_desc, revenueratio

Over the years, there has been extensive and continuous effort on improving Spark SQL’s query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks

Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees. In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.

Memory Management in Apache SparkDatabricks

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Apache Spark Core – Practical OptimizationDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Deep Dive: Memory Management in Apache SparkDatabricks

Optimizing Apache Spark SQL JoinsDatabricks

Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast! This session will cover different ways of joining tables in Apache Spark. Speaker: Vida Ha This talk was originally presented at Spark Summit East 2017.

Spark shuffle introductioncolorant

This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.

Apache Spark At Scale in the CloudDatabricks

Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on. You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal. By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs: Sizing the cluster based on your dataset (shuffle partitions) Ingestion challenges – well begun is half done (globbing S3, small files) Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you) Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win) Scheduling (FAIR vs FIFO, is there a difference for your pipeline?) Caching and persistence (it’s the cost of doing business, so what are your options?) Fault tolerance (blacklisting, speculation, task reaping) Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans) Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Matthew Powers gave a presentation on optimizing Delta and Parquet data lakes. He discussed the benefits of using Delta lakes such as built-in time travel, compacting, and vacuuming capabilities. Delta lakes provide these features for free on top of Parquet files and a transaction log. Powers demonstrated how to create, compact, vacuum, partition, filter, and update Delta lakes in Spark. He showed that partitioning data significantly improves query performance by enabling data skipping and filtering at the partition level.

Physical Plans in Spark SQLDatabricks

In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution. The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?

Understanding Query Plans and Spark UIsDatabricks

"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark. "

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet. At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it. We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).

RocksDB Performance and Reliability PracticesYoshinori Matsunobu

Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale. In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.

The Apache Spark File Format EcosystemDatabricks

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.

Apache Spark overviewDataArt

This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.

Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks

Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.

The Parquet Format and Performance Optimization OpportunitiesDatabricks

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

Understanding Memory Management In Spark For Fun And ProfitSpark Summit

1) The document discusses memory management in Spark applications and summarizes different approaches tried by developers to address out of memory errors in Spark executors. 2) It analyzes the root causes of memory issues like executor overheads and data sizes, and evaluates fixes like increasing memory overhead, reducing cores, frequent garbage collection. 3) The document dives into Spark and JVM level configuration options for memory like storage pool sizes, caching formats, and garbage collection settings to improve reliability, efficiency and performance of Spark jobs.

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Apache Spark 3.0: Overview of What’s New and Why CareDatabricks

Spark 3.0 introduces several new features and enhancements to improve performance, usability and compatibility. Key highlights include adaptive query execution which optimizes query plans at runtime based on statistics, dynamic partition pruning to avoid unnecessary data scans, and join hints to influence join strategies. Usability is improved with richer APIs like pandas UDF enhancements and a new structured streaming UI. Compatibility and extensibility is enhanced with Java 11 support, Hive 3.x metastore support and Hadoop 3 support.

What’s New in the Upcoming Apache Spark 3.0Databricks

More Related Content

What's hot (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Deep Dive: Memory Management in Apache SparkDatabricks

Optimizing Apache Spark SQL JoinsDatabricks

Spark shuffle introductioncolorant

Apache Spark At Scale in the CloudDatabricks

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Physical Plans in Spark SQLDatabricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Understanding Query Plans and Spark UIsDatabricks

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

RocksDB Performance and Reliability PracticesYoshinori Matsunobu

The Apache Spark File Format EcosystemDatabricks

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Apache Spark overviewDataArt

Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Understanding Memory Management In Spark For Fun And ProfitSpark Summit

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Deep Dive: Memory Management in Apache SparkDatabricks

Optimizing Apache Spark SQL JoinsDatabricks

Spark shuffle introductioncolorant

Apache Spark At Scale in the CloudDatabricks

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Physical Plans in Spark SQLDatabricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Understanding Query Plans and Spark UIsDatabricks

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

RocksDB Performance and Reliability PracticesYoshinori Matsunobu

The Apache Spark File Format EcosystemDatabricks

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Apache Spark overviewDataArt

Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Understanding Memory Management In Spark For Fun And ProfitSpark Summit

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Similar to Dynamic Partition Pruning in Apache Spark (20)

Apache Spark 3.0: Overview of What’s New and Why CareDatabricks

What’s New in the Upcoming Apache Spark 3.0Databricks

SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSeeQuality.net

This document contains a presentation on performance tuning strategies for Azure Databricks. It discusses techniques like enabling the Databricks disk cache, using Autoloader for ingestion, implementing dynamic and static partition pruning, leveraging file pruning using statistics, optimizing layout using Z-ordering, and additional tips around query optimization, adaptive query processing, and cluster configuration. The presentation provides technical details on how each strategy works and when it should be applied to improve query performance on Databricks.

Spark + AI Summit recap jul16 2020Guido Oswald

2018 data warehouse features in sparkChester Chen

Spark can be enhanced with data warehouse capabilities to leverage both open source analytics and enterprise data warehouse strengths. This includes incorporating star schema detection and referential integrity constraints to optimize queries. Performance can be improved by pushing down operations like joins, filters, and projections from Spark to underlying data sources using heuristics like star schema patterns. Push downs allow exploiting database indexes and reducing data transfer. Star schema detection and join push downs have shown speedups of 2-31x on TPC-DS benchmark queries.

Apache Spark 3 Dynamic Partition PruningAparup Chatterjee

Spark 3.0 introduces dynamic partition pruning (DPP) to improve query performance. DPP allows Spark to dynamically infer which partitions need to be read based on column statistics, reducing the amount of data read. For star schema queries with a large fact table partitioned by a dimension key, DPP broadcasts the dimension table and prunes irrelevant fact table partitions at runtime before joining the tables. This optimization avoids unnecessary I/O by only reading the relevant fact table partitions.

Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks

An informational, or statistical, constraint is a constraint such as a unique, primary key, foreign key, or check constraint that can be used by Apache Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. Informational constraints will be primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations. This session will cover the support for primary and foreign key (referential integrity) constraints in Spark. You’ll learn about the constraint specification, metastore storage, constraint validation and maintenance. You’ll also see examples of query optimizations that utilize referential integrity constraints, such as Join and Distinct elimination and Star Schema detection.

Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen

The document discusses GoPro's transition to a new data platform architecture. The old architecture had several clusters for different workloads which caused operational overhead and lack of elasticity. The new architecture separates storage and computing, uses S3 for storage and ephemeral instances as compute clusters. It also introduces a centralized Hive metastore and uses dynamic DDL to flexibly ingest and aggregate both batch and streaming data while allowing the schema to change on the fly. This improves cost, scalability and enables more advanced analytics capabilities.

ETL 2.0 Data Engineering for developersMicrosoft Tech Community

Deep Dive into the New Features of Apache Spark 3.1Databricks

Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos. The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks

The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.

SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki

SQL Performance Improvements at a Glance in Apache Spark 3.0Databricks

The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit

Stephan Kessler and Santiago Mola presented SAP HANA Vora, which extends Spark SQL's data sources API to allow "pushing down" more of a SQL query's logical plan to the data source for execution. This "Pushdown of Everything" approach leverages data sources' capabilities to process less data and optimize query execution. They described how data sources can implement interfaces like TableScan, PrunedScan, and the new CatalystSource interface to support pushing down projections, filters, and more complex queries respectively. While this approach has advantages in performance, challenges include the complexity of implementing CatalystSource and ensuring compatibility across Spark versions. Future work aims to improve the API and provide utilities to simplify implementation.

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks

This document summarizes a presentation on extending Spark SQL Data Sources APIs with join push down. The presentation discusses how join push down can significantly improve query performance by reducing data transfer and exploiting data source capabilities like indexes. It provides examples of join push down in enterprise data pipelines and SQL acceleration use cases. The presentation also outlines the challenges of network speeds and exploiting data source capabilities, and how join push down addresses these challenges. Future work discussed includes building a cost model for global optimization across data sources.

Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks

Modernizing Your Data Warehouse using APSStéphane Fréchette

The document discusses modernizing a data warehouse using the Microsoft Analytics Platform System (APS). APS is described as a turnkey appliance that allows organizations to integrate relational and non-relational data in a single system for enterprise-ready querying and business intelligence. It provides a scalable solution for growing data volumes and types that removes limitations of traditional data warehousing approaches.

Achieving Lakehouse Models with Spark 3.0Databricks

It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks

This document discusses best practices for optimizing Apache Spark applications. It covers techniques for speeding up file loading, optimizing file storage and layout, identifying bottlenecks in queries, dealing with many partitions, using datasource tables, managing schema inference, file types and compression, partitioning and bucketing files, managing shuffle partitions with adaptive execution, optimizing unions, using the cost-based optimizer, and leveraging the data skipping index. The presentation aims to help Spark developers apply these techniques to improve performance.

Deep Dive into SparkEric Xiao

Apache Spark 3.0: Overview of What’s New and Why CareDatabricks

What’s New in the Upcoming Apache Spark 3.0Databricks

SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSeeQuality.net

Spark + AI Summit recap jul16 2020Guido Oswald

2018 data warehouse features in sparkChester Chen

Apache Spark 3 Dynamic Partition PruningAparup Chatterjee

Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks

Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen

ETL 2.0 Data Engineering for developersMicrosoft Tech Community

Deep Dive into the New Features of Apache Spark 3.1Databricks

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks

SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki

SQL Performance Improvements at a Glance in Apache Spark 3.0Databricks

The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks

Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks

Modernizing Your Data Warehouse using APSStéphane Fréchette

Achieving Lakehouse Models with Spark 3.0Databricks

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks

Deep Dive into SparkEric Xiao

More from Databricks (20)

DW Migration Webinar-March 2022.pptxDatabricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized PlatformDatabricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data ScienceDatabricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML MonitoringDatabricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature AggregationsDatabricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and SparkDatabricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta LakeDatabricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Recently uploaded (20)

Human body make Structure analysis the part of the humanankit392215

Blue Dark Professional Geometric Business Project Presentation .pdfmohammadhaidarayoobi

语法专题3-状语从句.pdf 英语语法基础部分，涉及到状语从句部分的内容来米爱上JunZhao68

How Data Annotation Services Drive Innovation in Autonomous Vehicles.docxsofiawilliams5966

一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理Taqyea

办留学学历认证(USC毕业证书)南加利福尼亚大学毕业证学历证书代办服务【q微1954292140】Buy University of Southern California Diploma《正式成绩单论文没过》有文凭却得不到认证。又该怎么办？？？美国毕业证购买，美国文凭购买，【q微1954292140】美国文凭购买，美国文凭定制，美国文凭补办。专业在线定制美国大学文凭，定做美国本科文凭，【q微1954292140】复制美国University of Southern California completion letter。在线快速补办美国本科毕业证、硕士文凭证书，购买美国学位证、南加利福尼亚大学Offer，美国大学文凭在线购买。主营项目： 1、真实教育部国外学历学位认证《美国毕业文凭证书快速办理南加利福尼亚大学学校原版文凭补办》【q微1954292140】《论文没过南加利福尼亚大学正式成绩单》，教育部存档，教育部留服网站100%可查. 2、办理USC毕业证，改成绩单《USC毕业证明办理南加利福尼亚大学学位证书网上查询》【Q/WeChat：1954292140】Buy University of Southern California Certificates《正式成绩单论文没过》，南加利福尼亚大学Offer、在读证明、学生卡、信封、证明信等全套材料，从防伪到印刷，从水印到钢印烫金，高精仿度跟学校原版100%相同. 3、真实使馆认证（即留学人员回国证明），使馆存档可通过大使馆查询确认. 4、留信网认证，国家专业人才认证中心颁发入库证书，留信网存档可查. 美国南加利福尼亚大学毕业证(USC毕业证书）USC文凭【q微1954292140】高仿真还原美国文凭证书和外壳，定制美国南加利福尼亚大学成绩单和信封。国外毕业证成绩单的办理流程USC毕业证【q微1954292140】学历学位证制作南加利福尼亚大学offer/学位证出售、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作。帮你解决南加利福尼亚大学学历学位认证难题。帮您解决在美国南加利福尼亚大学未毕业难题（University of Southern California）文凭购买、毕业证购买、大学文凭购买、大学毕业证购买、买文凭、日韩文凭、英国大学文凭、美国大学文凭、澳洲大学文凭、加拿大大学文凭（q微1954292140）新加坡大学文凭、新西兰大学文凭、爱尔兰文凭、西班牙文凭、德国文凭、教育部认证，买毕业证，毕业证购买，买大学文凭，【q微1954292140】学位证1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。《南加利福尼亚大学学位证书英文版美国毕业证书办理USC国外文凭电子版》【办理南加利福尼亚大学成绩单Buy University of Southern California Transcripts】购买日韩成绩单、英国大学成绩单、美国大学成绩单、澳洲大学成绩单、加拿大大学成绩单（q微1954292140）新加坡大学成绩单、新西兰大学成绩单、爱尔兰成绩单、西班牙成绩单、德国成绩单。成绩单的意义主要体现在证明学习能力、评估学术背景、展示综合素质、提高录取率，以及是作为留信认证申请材料的一部分。南加利福尼亚大学成绩单能够体现您的的学习能力，包括南加利福尼亚大学课程成绩、专业能力、研究能力。（q微1954292140）具体来说，成绩报告单通常包含学生的学习技能与习惯、各科成绩以及老师评语等部分，因此，成绩单不仅是学生学术能力的证明，也是评估学生是否适合某个教育项目的重要依据！南加利福尼亚大学offer/学位证、留信官方学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作【q微1954292140】Buy University of Southern California Diploma购买美国毕业证，购买英国毕业证，购买澳洲毕业证，购买加拿大毕业证，以及德国毕业证，购买法国毕业证（q微1954292140）购买荷兰毕业证、购买瑞士毕业证、购买日本毕业证、购买韩国毕业证、购买新西兰毕业证、购买新加坡毕业证、购买西班牙毕业证、购买马来西亚毕业证等。包括了本科毕业证，硕士毕业证。特殊原因导致无法毕业，也可以联系我们帮您办理相关材料：１：在南加利福尼亚大学挂科了，不想读了，成绩不理想怎么办？ 2：打算回国了，找工作的时候，需要提供认证《USC成绩单购买办理南加利福尼亚大学毕业证书范本》购买日韩毕业证、英国大学毕业证、美国大学毕业证、澳洲大学毕业证、加拿大大学毕业证（q微1954292140）新加坡大学毕业证、新西兰大学毕业证、爱尔兰毕业证、西班牙毕业证、德国毕业证，回国证明，留信网认证，留信认证办理，学历认证。从而完成就业。南加利福尼亚大学毕业证办理，南加利福尼亚大学文凭办理，南加利福尼亚大学成绩单办理和真实留信认证、留服认证、南加利福尼亚大学学历认证。学院文凭定制，南加利福尼亚大学原版文凭补办，成绩单购买办理，扫描件文凭定做，100%文凭复刻。

Tableau Cloud - what to consider before making the move update 2025.pdfelinavihriala

Tableau Finland User Group June 2025.pdfelinavihriala

EPC UNIT-V forengineeringstudentsin.pptxExtremerZ

15 Benefits of Data Analytics in Business Growth.pdfAffinityCore

Content Moderation Services_ Leading the Future of Online Safety.docxsofiawilliams5966

Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docxsofiawilliams5966

llm lecture 3 stanford blah blah blah blahsaud140081

$refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx$ $refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx$

refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxKannanDamodaram

LECT CONCURRENCY………………..pdf document or power pointnwanjamakane

Chronic constipation presentaion final.pptDrShashank7

Market Share Analysis.pptx nnnnnnnnnnnnnnrocky

Artificial-Intelligence-in-Autonomous-Vehicles (1).pptxAbhijitPal87

Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...Karim Baïna

Artificial Intelligence (AI) is reshaping societies and raising complex ethical, legal, and geopolitical questions. This talk explores the foundations and limits of Trustworthy AI through the lens of global frameworks such as the EU’s HLEG guidelines, UNESCO’s human rights-based approach, OECD recommendations, and NIST’s taxonomy of AI security risks. We analyze key principles like fairness, transparency, privacy, robustness, and accountability — not only as ideals, but in terms of their practical implementation and tensions. Special attention is given to real-world contexts such as Morocco’s deployment of 4,000 intelligent cameras and the country’s positioning in AI readiness indexes. These examples raise critical issues about surveillance, accountability, and ethical governance in the Global South. Rather than relying on standardized terms or ethical "checklists", this presentation advocates for a grounded, interdisciplinary, and context-aware approach to responsible AI — one that balances innovation with human rights, and technological ambition with social responsibility. This rich Trustworthy and Responsible AI frameworks context is a serious opportunity for Human and Social Sciences Researchers : either operate as gatekeepers, reinforcing existing ethical constraints, or become revolutionaries, pioneering new paradigms that redefine how AI interacts with society, knowledge production, and policymaking ?

delta airlines new york office (Airwayscityoffice)jamespromind

Alcoholic liver disease slides presentation new.pptxDrShashank7

Human body make Structure analysis the part of the humanankit392215

Blue Dark Professional Geometric Business Project Presentation .pdfmohammadhaidarayoobi

语法专题3-状语从句.pdf 英语语法基础部分，涉及到状语从句部分的内容来米爱上JunZhao68

How Data Annotation Services Drive Innovation in Autonomous Vehicles.docxsofiawilliams5966

一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理Taqyea

Tableau Cloud - what to consider before making the move update 2025.pdfelinavihriala

Tableau Finland User Group June 2025.pdfelinavihriala

EPC UNIT-V forengineeringstudentsin.pptxExtremerZ

15 Benefits of Data Analytics in Business Growth.pdfAffinityCore

Content Moderation Services_ Leading the Future of Online Safety.docxsofiawilliams5966

Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docxsofiawilliams5966

llm lecture 3 stanford blah blah blah blahsaud140081

$refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx$ $refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx$

refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxKannanDamodaram

LECT CONCURRENCY………………..pdf document or power pointnwanjamakane

Chronic constipation presentaion final.pptDrShashank7

Market Share Analysis.pptx nnnnnnnnnnnnnnrocky

Artificial-Intelligence-in-Autonomous-Vehicles (1).pptxAbhijitPal87

Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...Karim Baïna

delta airlines new york office (Airwayscityoffice)jamespromind

Alcoholic liver disease slides presentation new.pptxDrShashank7

Dynamic Partition Pruning in Apache Spark

1. Dynamic Partition Pruning in Apache Spark Spark + AI Summit, Amsterdam 1 Bogdan Ghit and Juliusz Sompolski

2. 2 About Us BI Experience team in the Databricks Amsterdam European Development Centre ● Working on improving the experience and performance of Business Intelligence / SQL analytics workloads using Databricks ○ JDBC / ODBC connectivity to Databricks clusters ○ Integrations with BI tools such as Tableau ○ But also: core performance improvements in Apache Spark for common SQL analytics query patterns Bogdan Ghit Juliusz Sompolski

3. TPCDS Q98 on 10 TB How to Make a Query 100x Faster?

4. Static Partition Pruning SELECT * FROM Sales WHERE day_of_week = ‘Mon’ Filter Scan Basic data-flow Filter Scan Filter Push-down Filter Scan Partition files with multi-columnar data

5. Table Denormalization SELECT * FROM Sales JOIN Date WHERE Date.day_of_week = ‘Mon’ Static pruning not possible Scan Sales Filter day_of_week = ‘mon’ Join Simple workaround Scan Sales Join Scan Date Filter day_of_week = ‘mon’ Scan Scan Date

6. This Talk Dynamic pruning Scan Sales Filter day_of_week = ‘mon’ Join SELECT * FROM Sales JOIN Date WHERE Date.day_of_week = ‘Mon’ Scan Countries

7. Spark In a Nutshell Query Logical Plan Optimization Physical Plan Selection RDD batches Cluster slots Stats-based cost model Rule-based transformations APIs

8. Optimization Opportunities Data Layout Partition files with multi-columnar data Scan FACT TABLE Scan DIM TABLE Non-partitioned dataset Filter DIM Join on partition id Query Shape

9. A Simple Approach Partition files with multi-columnar data Scan FACT TABLE Scan DIM TABLE Non-partitioned dataset Filter DIM Join on partition id Scan DIM TABLE Filter DIM Work duplication may be expensive Heuristics based on inaccurate stats

10. Broadcast Hash Join FileScan FileScan with Dim Filter Non-partitioned dataset BroadcastExchange Broadcast Hash Join Execute the build side of the join Place the result in a broadcast variableBroadcast the build side results Execute the join locally without a shuﬀle

11. Reusing Broadcast Results Partition files with multi-columnar data FileScan FileScan with Dim Filter Non-partitioned dataset BroadcastExchange Broadcast Hash Join Dynamic Filter

12. Experimental Setup Workload Selection - TPC-DS scale factors 1-10 TB Cluster Configuration - 10 i3.xlarge machines Data-Processing Framework - Apache Spark 3.0

13. TPCDS 1 TB 60 / 102 queries speedup between 2 and 18

14. Top Queries Very good speedups for top 10% of the queries

15. Data Skipped Very eﬀective in skipping data

16. TPCDS 10 TB Even better speedups at 10x the scale

17. Query 98 SELECT i_item_desc, i_category, i_class, i_current_price, sum(ss_ext_sales_price) as itemrevenue, sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over (partition by i_class) as revenueratio FROM store_sales, item, date_dim WHERE ss_item_sk = i_item_sk and i_category in ('Sports', 'Books', 'Home') and ss_sold_date_sk = d_date_sk and cast(d_date as date) between cast('1999-02-22' as date) and (cast('1999-02-22' as date) + interval '30' day) GROUP BY i_item_id, i_item_desc, i_category, i_class, i_current_price ORDER BY i_category, i_class, i_item_id, i_item_desc, revenueratio

18. TPCDS 10 TB Highly selective dimension filter that retains only one month out of 5 years of data

19. Conclusion Apache Spark 3.0 introduces Dynamic Partition Pruning - Strawman approach at logical planning time - Optimized approach during execution time Significant speedup, exhibited in many TPC-DS queries With this optimization Spark may now work good with star-schema queries, making it unnecessary to ETL denormalized tables.

20. 20 Thanks! Bogdan Ghit - linkedin.com/in/bogdanghit Juliusz Sompolski - linkedin.com/in/juliuszsompolski

Dynamic Partition Pruning in Apache Spark

Recommended

More Related Content

What's hot (20)

Similar to Dynamic Partition Pruning in Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Dynamic Partition Pruning in Apache Spark