SlideShare a Scribd company logo
Materialized Column——An Efficient Way
to Optimize Queries on Nested Columns
Guo, Jun (jason.guo.vip@gmail.com)
Lead of Data Engine Team, @ByteDance
Who we are
o Data Engine team of ByteDance
o Build a platform of one-stop
experience for OLAP , on which users
can analyze PB level data by writing
SQL without caring about the
underlying execution engine
What we do
o Manage Spark SQL / Presto / Hive
workload
o Offer Open API and self-serve platform
o Optimize Spark SQL / Presto / Hive
engine
o Design data architecture for most
business lines in ByteDance
Agenda
▪ Spark SQL at ByteDance
▪ Why nested type are widely used
▪ What are the main issues of nested type
▪ Optional solutions
▪ How does Materialized Column solve these problems
Spark SQL at ByteDance
Spark SQL at ByteDance
2016 2017 2018 2019 2020
Small Scale Experiments
Ad-hoc workload
Few ETL pipelines in production
Full-production deployment
Main engine in DW area
Why nested type are widely used
Why nested type are widely used
▪ Event log
▪ A lot of new tracking events are created everyday
▪ It is not a good idea to create a new column for a new type of event
▪ Dimension
▪ Dimension tables are dumped from MySQL of service backend
▪ Service backend may add some new fields on demand. These fields may not be
helpful for now but they may be useful in the future
Main issues for nested type
Main issues for nested type
▪ Unnecessary data are read which is a
waste of IO
▪ Vectorized read can not be exploit when
nested type column is read
▪ Filter pushdown can not be utilized
when nested column is read
▪ Duplicated computation. e.g. JSON
parsing is CPU-intensive
Optional solutions
Optional solutions – A separate table
▪ DW users design a solution to solve
these problems
▪ Maintain a new table which add new
columns which are extracted from the
nested columns
▪ Downstream users should query on this
new table and new columns for better
performance
Optional solutions – A separate table
▪ Pros
▪ Queries are on simple type so that all the
problems are solved
▪ Cons
▪ Need to push all the downstream users to
migrate their queries / pipelines to the new
table and new columns
▪ Duplicated storage and computation cost
▪ Can not handle frequent subfields changing
Optional solutions – Vectorized Read on Nested Column
▪ Refactor Parquet vectorized reader to
support vectorized read for nested types
▪ Support predicate pushdown for struct
Optional solutions – Vectorized Read on Nested Column
▪ Pros
▪ Enable vectorized read without any storage
overhead
▪ Cons
▪ Need to refactor vectorized reader for
Parquet and ORC respectively
▪ Filter pushdown for Array/Map is still not
available
▪ The performance of vectorized read on
nested type is not as good as that for simple
type
▪ Improve performance with struct by
about 100%
▪ Improve performance with map by
about 163%
How does Materialized Column solve these problems
How does Materialized Column solve these problems
CREATE TABLE base_table (
item STRING,
count INT,
people<STRING, STRING>
date STRING
)
USING parquet
PARTITIONED BY (date);
ALTER TABLE base_table ADD COLUMNS
(
age INT MATERIALIZED CAST(peopl
e[‘age’] AS INTEGER)
);
Add materialized columnOriginal table
How does Materialized Column solve these problems
How does Materialized Column solve these problems
Write with materialized column
explain extended insert into base_table partition(date='20201010') select 'appole', 1,
map('age','18','name','jack','gender','male')
How does Materialized Column solve these problems
Query with materialized column rewriteQuery without materialized column rewrite
How does Materialized Column solve these problems
Test case
Without Materialized
Column rewrite
With Materialized
Column rewrite
Performance Read data size
SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓
SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓
SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓
Query without materialized column rewrite
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot (20)

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 

Similar to Materialized Column: An Efficient Way to Optimize Queries on Nested Columns (20)

The Science of DBMS: Data Storage & Organization
The Science of DBMS: Data Storage & Organization The Science of DBMS: Data Storage & Organization
The Science of DBMS: Data Storage & Organization
SAP Technology
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
Calpont
 
Ibm redbook
Ibm redbookIbm redbook
Ibm redbook
Rahul Verma
 
MWLUG 2016 : AD117 : Xpages & jQuery DataTables
MWLUG 2016 : AD117 : Xpages & jQuery DataTablesMWLUG 2016 : AD117 : Xpages & jQuery DataTables
MWLUG 2016 : AD117 : Xpages & jQuery DataTables
Michael Smith
 
The Science of DBMS: Query Optimization
The Science of DBMS: Query Optimization The Science of DBMS: Query Optimization
The Science of DBMS: Query Optimization
SAP Technology
 
Best practice bi_design_bestpracticesv_1_5
Best practice bi_design_bestpracticesv_1_5Best practice bi_design_bestpracticesv_1_5
Best practice bi_design_bestpracticesv_1_5
rajibzzaman
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDBThe Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
Datavail
 
GIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLsGIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLs
techmaddy
 
Pl sql best practices document
Pl sql best practices documentPl sql best practices document
Pl sql best practices document
Ashwani Pandey
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
World2016_T5_S7_TeradataFunctionalOverview
World2016_T5_S7_TeradataFunctionalOverviewWorld2016_T5_S7_TeradataFunctionalOverview
World2016_T5_S7_TeradataFunctionalOverview
Farah Omer
 
Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008
paulguerin
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
政宏 张
 
Dan Hotka's Top 10 Oracle 12c New Features
Dan Hotka's Top 10 Oracle 12c New FeaturesDan Hotka's Top 10 Oracle 12c New Features
Dan Hotka's Top 10 Oracle 12c New Features
Embarcadero Technologies
 
Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User Group
Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User GroupSeatug Presentation (Excel to Data Viz culture) Seattle Tableau User Group
Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User Group
Russell Spangler
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
MySQL Optimizer: What's New in 8.0
MySQL Optimizer: What's New in 8.0MySQL Optimizer: What's New in 8.0
MySQL Optimizer: What's New in 8.0
Manyi Lu
 
Recent MariaDB features to learn for a happy life
Recent MariaDB features to learn for a happy lifeRecent MariaDB features to learn for a happy life
Recent MariaDB features to learn for a happy life
Federico Razzoli
 
The Science of DBMS: Data Storage & Organization
The Science of DBMS: Data Storage & Organization The Science of DBMS: Data Storage & Organization
The Science of DBMS: Data Storage & Organization
SAP Technology
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
Calpont
 
MWLUG 2016 : AD117 : Xpages & jQuery DataTables
MWLUG 2016 : AD117 : Xpages & jQuery DataTablesMWLUG 2016 : AD117 : Xpages & jQuery DataTables
MWLUG 2016 : AD117 : Xpages & jQuery DataTables
Michael Smith
 
The Science of DBMS: Query Optimization
The Science of DBMS: Query Optimization The Science of DBMS: Query Optimization
The Science of DBMS: Query Optimization
SAP Technology
 
Best practice bi_design_bestpracticesv_1_5
Best practice bi_design_bestpracticesv_1_5Best practice bi_design_bestpracticesv_1_5
Best practice bi_design_bestpracticesv_1_5
rajibzzaman
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDBThe Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
Datavail
 
GIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLsGIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLs
techmaddy
 
Pl sql best practices document
Pl sql best practices documentPl sql best practices document
Pl sql best practices document
Ashwani Pandey
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
World2016_T5_S7_TeradataFunctionalOverview
World2016_T5_S7_TeradataFunctionalOverviewWorld2016_T5_S7_TeradataFunctionalOverview
World2016_T5_S7_TeradataFunctionalOverview
Farah Omer
 
Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008
paulguerin
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
政宏 张
 
Dan Hotka's Top 10 Oracle 12c New Features
Dan Hotka's Top 10 Oracle 12c New FeaturesDan Hotka's Top 10 Oracle 12c New Features
Dan Hotka's Top 10 Oracle 12c New Features
Embarcadero Technologies
 
Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User Group
Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User GroupSeatug Presentation (Excel to Data Viz culture) Seattle Tableau User Group
Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User Group
Russell Spangler
 
MySQL Optimizer: What's New in 8.0
MySQL Optimizer: What's New in 8.0MySQL Optimizer: What's New in 8.0
MySQL Optimizer: What's New in 8.0
Manyi Lu
 
Recent MariaDB features to learn for a happy life
Recent MariaDB features to learn for a happy lifeRecent MariaDB features to learn for a happy life
Recent MariaDB features to learn for a happy life
Federico Razzoli
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays
 
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGePSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
Tomas Moser
 
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
Retort Instrumentation laboratory practi
Retort Instrumentation laboratory practiRetort Instrumentation laboratory practi
Retort Instrumentation laboratory practi
ADINDADYAHMUKHLASIN
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptxBE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
AaronBaluyut
 
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdfAG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
Anass Nabil
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
Alcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptxAlcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptx
DrShashank7
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfBODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
原版英国威尔士三一圣大卫大学毕业证(UWTSD毕业证书)如何办理
原版英国威尔士三一圣大卫大学毕业证(UWTSD毕业证书)如何办理原版英国威尔士三一圣大卫大学毕业证(UWTSD毕业证书)如何办理
原版英国威尔士三一圣大卫大学毕业证(UWTSD毕业证书)如何办理
Taqyea
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
Math arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is hereMath arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is here
rdarshankumar84
 
Али махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptxАли махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptx
palr19411
 
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdfMICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays
 
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGePSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
Tomas Moser
 
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
Retort Instrumentation laboratory practi
Retort Instrumentation laboratory practiRetort Instrumentation laboratory practi
Retort Instrumentation laboratory practi
ADINDADYAHMUKHLASIN
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptxBE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
AaronBaluyut
 
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdfAG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
Anass Nabil
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
Alcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptxAlcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptx
DrShashank7
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfBODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
原版英国威尔士三一圣大卫大学毕业证(UWTSD毕业证书)如何办理
原版英国威尔士三一圣大卫大学毕业证(UWTSD毕业证书)如何办理原版英国威尔士三一圣大卫大学毕业证(UWTSD毕业证书)如何办理
原版英国威尔士三一圣大卫大学毕业证(UWTSD毕业证书)如何办理
Taqyea
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
Math arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is hereMath arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is here
rdarshankumar84
 
Али махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptxАли махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptx
palr19411
 
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdfMICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

  • 1. Materialized Column——An Efficient Way to Optimize Queries on Nested Columns Guo, Jun ([email protected]) Lead of Data Engine Team, @ByteDance
  • 2. Who we are o Data Engine team of ByteDance o Build a platform of one-stop experience for OLAP , on which users can analyze PB level data by writing SQL without caring about the underlying execution engine
  • 3. What we do o Manage Spark SQL / Presto / Hive workload o Offer Open API and self-serve platform o Optimize Spark SQL / Presto / Hive engine o Design data architecture for most business lines in ByteDance
  • 4. Agenda ▪ Spark SQL at ByteDance ▪ Why nested type are widely used ▪ What are the main issues of nested type ▪ Optional solutions ▪ How does Materialized Column solve these problems
  • 5. Spark SQL at ByteDance
  • 6. Spark SQL at ByteDance 2016 2017 2018 2019 2020 Small Scale Experiments Ad-hoc workload Few ETL pipelines in production Full-production deployment Main engine in DW area
  • 7. Why nested type are widely used
  • 8. Why nested type are widely used ▪ Event log ▪ A lot of new tracking events are created everyday ▪ It is not a good idea to create a new column for a new type of event ▪ Dimension ▪ Dimension tables are dumped from MySQL of service backend ▪ Service backend may add some new fields on demand. These fields may not be helpful for now but they may be useful in the future
  • 9. Main issues for nested type
  • 10. Main issues for nested type ▪ Unnecessary data are read which is a waste of IO ▪ Vectorized read can not be exploit when nested type column is read ▪ Filter pushdown can not be utilized when nested column is read ▪ Duplicated computation. e.g. JSON parsing is CPU-intensive
  • 12. Optional solutions – A separate table ▪ DW users design a solution to solve these problems ▪ Maintain a new table which add new columns which are extracted from the nested columns ▪ Downstream users should query on this new table and new columns for better performance
  • 13. Optional solutions – A separate table ▪ Pros ▪ Queries are on simple type so that all the problems are solved ▪ Cons ▪ Need to push all the downstream users to migrate their queries / pipelines to the new table and new columns ▪ Duplicated storage and computation cost ▪ Can not handle frequent subfields changing
  • 14. Optional solutions – Vectorized Read on Nested Column ▪ Refactor Parquet vectorized reader to support vectorized read for nested types ▪ Support predicate pushdown for struct
  • 15. Optional solutions – Vectorized Read on Nested Column ▪ Pros ▪ Enable vectorized read without any storage overhead ▪ Cons ▪ Need to refactor vectorized reader for Parquet and ORC respectively ▪ Filter pushdown for Array/Map is still not available ▪ The performance of vectorized read on nested type is not as good as that for simple type ▪ Improve performance with struct by about 100% ▪ Improve performance with map by about 163%
  • 16. How does Materialized Column solve these problems
  • 17. How does Materialized Column solve these problems CREATE TABLE base_table ( item STRING, count INT, people<STRING, STRING> date STRING ) USING parquet PARTITIONED BY (date); ALTER TABLE base_table ADD COLUMNS ( age INT MATERIALIZED CAST(peopl e[‘age’] AS INTEGER) ); Add materialized columnOriginal table
  • 18. How does Materialized Column solve these problems
  • 19. How does Materialized Column solve these problems Write with materialized column explain extended insert into base_table partition(date='20201010') select 'appole', 1, map('age','18','name','jack','gender','male')
  • 20. How does Materialized Column solve these problems Query with materialized column rewriteQuery without materialized column rewrite
  • 21. How does Materialized Column solve these problems Test case Without Materialized Column rewrite With Materialized Column rewrite Performance Read data size SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓ SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓ SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓ Query without materialized column rewrite
  • 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.