0% found this document useful (0 votes)

11 views23 pages

Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018

Uploaded by

fussfuss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views23 pages

Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018

Uploaded by

fussfuss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Apache Calcite: A Foundational

Framework for Optimized Query

Processing Over Heterogeneous Data
Sources
Edmon Begoli, Jesú s Camacho-Rodrı́guez, Julian Hyde,
Michael J. Mior, Daniel Lemire

2018 SIGMOD, Houston, Texas, USA

Outline
Background and History

Architecture

Adapter Design

Optimizer and Planner

Adoption

Uses in Research and Scholastic Potential

Roadmap and Future Work

What is Calcite?
Apache Calcite is an extensible framework for
building data management systems.

It is an open source project governed by the

Apache Software Foundation, is written in
Java, and is used by dozens of projects and
companies, and several research projects.
Origins and Design Principles
Origins 2004 – LucidEra and SQLstream were each building SQL systems;
2012 – Pare down code base, enter Apache as incubator project

Problem Building a high-quality database requires ~ 20 person years (effort)

and 5 years (elapsed)

Solution Create an open source framework that a community can contribute

to, and use to build their own DBMSs

Design Flexible → Relational algebra

principles Extensible/composable → Volcano-style planner
Easy to contribute to → Java, FP style

Alternatives PostgreSQL, Apache Spark, AsterixDB

Architecture
Core – Operator expressions
(relational algebra) and planner
(based on Volcano/Cascades)

External – Data storage, algorithms

and catalog

Optional – SQL parser, JDBC &

ODBC drivers

Extensible – Planner rewrite rules,

statistics, cost model, algebra, UDFs
Adapter Design
A pattern that defines how
Calcite incorporates diverse
data sources for general
access.

Model – specification of the

physical properties of the data
source.

Schema – definition of the data

(format and layouts) found in
the model.
select [Link], count(*) as c
Represent query as from [Link] as s
join [Link] as p
relational algebra on [Link] = [Link]
where [Link] = 'purchase'
group by [Link]
Splunk
order by c desc

Table: splunk

Key: productId
Condition: Key: productName Key: c desc
action = 'purchase' Agg: count
scan

join
MySQL
filter group sort
scan

Table: products
select [Link], count(*) as c
Optimize query by from [Link] as s
join [Link] as p
applying transformation on [Link] = [Link]
where [Link] = 'purchase'
rules group by [Link]
Splunk
order by c desc

Table: splunk Condition:

action = 'purchase'

Key: productId Key: productName Key: c desc

Agg: count
scan filter

MySQL
join group sort
scan

Table: products
Conventions
Join
1. Plans start Join
3. Fire rules to
Filter Scan as logical Filter Scan propagate conventions
nodes. to other nodes.
Join Join

Scan Scan Scan Scan

Join 2. Assign each Join 4. The best plan may

Scan its table’s use an engine not tied
Filter Scan Filter Scan
native to any native format.
Join
convention. Join

To implement, generate
Scan Scan Scan Scan
a program that calls out
to query1 and query2.
Conventions & adapters Convention provides a uniform
representation for hybrid queries

Like ordering and distribution,

convention is a physical property of
Join
nodes

Filter Scan Adapter =

schema factory (lists tables)
+ convention
Join
+ rules to convert nodes to convention
Scan Scan
select stream *
Streaming SQL from Orders as o
where units > (
select avg(units)
Stream ~= append-only table from Orders as h
where [Link] = [Link]
Streaming queries return deltas and [Link] >
[Link] - interval ‘1’ year)
Stream-table duality: Orders is used as
both stream and table
“Show me real-time orders whose size is larger
than the average for that product over the
Our contributions: preceding year”

➢ Popularize streaming SQL

➢ SQL parser / validator / rules
➢ Reference implementation & TCK
Uses and Adoption
Uses in Research
● Polystore research – use as lightweight
heterogeneous data processing platform
● Optimization and query profiling –
general performance, and optimizer
research
● Reasoning over Streams, Graphs –
under consideration
● Open-source, production grade learning
and research platform
Future Work and Roadmap
● Support its use as a standalone engine – DDL, materialized views,
indexes and constraints.
● Improvements to the design and extensibility of the planner
(modularity, pluggability)
● Incorporation of new parametric approaches into the design of the
optimizer.
● Support for an extended set of SQL commands, functions, and
utilities, including full compliance with OpenGIS (spatial).
● New adapters for non-relational data sources such as array
databases.
● Improvements to performance profiling and instrumentation.
Thank you! Questions?

@ApacheCalcite

[Link]

[Link]
Extra slides
Calcite framework
Relational algebra SQL parser Transformation rules
RelNode (operator) SqlNode RelOptRule
• TableScan SqlParser • FilterMergeRule
• Filter SqlValidator • AggregateUnionTransposeRule
• Project • 100+ more
• Union Metadata Global transformations
• Aggregate Schema • Unification (materialized view)
• … Table • Column trimming
RelDataType (type) Function • De-correlation
RexNode (expression) • TableFunction
RelTrait (physical property) • TableMacro Cost, statistics
• RelConvention (calling-convention) Lattice
• RelCollation (sortedness) RelOptCost
• RelDistribution (partitioning) JDBC driver RelOptCostFactory
RelBuilder RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
Avatica

● Database connectivity
stack
● Self-contained sub-project
of Calcite
● Fast, open, stable
● Protobuf or JSON over
HTTP
● Powers Phoenix Query
Server
Lattice (optimized) () 1

(z) 43k (s) 50 (g) 2 (y) 5 (m) 12

(z, s) (g, m) (y, m)

(g, y) 10
43.4k 24 60

(z, s, g) (g, y, m)
83.6k 120

(z, s, g, (z, s, g, (z, s, y, (z, g, y, (s, g, y,

Key y) 392k m) 644k m) 831k m) 909k m) 6k

z zipcode (43k)
s state (50) (z, s, g, y,
g gender (2) m) 912k
y year (5)
m month (12)
raw 1m
Aggregation and windows on GROUP BY
streams
GROUP BY aggregates multiple rows into
sub-totals
➢ In regular GROUP BY each row contributes to
exactly one sub-total Multi
GROUP BY
➢ In multi-GROUP BY (e.g. HOP, GROUPING
SETS) a row can contribute to more than one
sub-total

Window functions (OVER) leave the number of Window

rows unchanged, but compute extra expressions functions
for each row (based on neighboring rows)
Tumbling, hopping & session windows in SQL
Tumbling window select stream … from Orders
group by floor(rowtime to hour)

select stream … from Orders

group by tumble(rowtime, interval ‘1’ hour)

Hopping window select stream … from Orders

group by hop(rowtime, interval ‘1’ hour,
interval ‘2’ hour)

Session window select stream … from Orders

group by session(rowtime, interval ‘1’ hour)
Controlling when data is emitted
select stream productId,
Early emission is the defining count(*) as c
characteristic of a streaming query. from Orders
group by productId,
The emit clause is a SQL extension floor(rowtime to hour)
inspired by Apache Beam’s “trigger” emit at watermark,
notion. (Still experimental… and early interval ‘2’ minute,
evolving.) late limit 1;
A relational (non-streaming) query is
just a query with the most conservative select *
possible emission strategy. from Orders
emit when complete;

Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann
No ratings yet
Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann
55 pages
Apache Calcite Tutorial
No ratings yet
Apache Calcite Tutorial
83 pages
Building Cost-Based Query Optimizers With Apache Calcite
No ratings yet
Building Cost-Based Query Optimizers With Apache Calcite
33 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Logical Design in Data Warehousing
No ratings yet
Logical Design in Data Warehousing
40 pages
Data Warehousing and Decision Support
No ratings yet
Data Warehousing and Decision Support
8 pages
Data Warehouse Schemas for Decision Support
No ratings yet
Data Warehouse Schemas for Decision Support
13 pages
Database Management Concepts
No ratings yet
Database Management Concepts
21 pages
Apache Calcite Paper
No ratings yet
Apache Calcite Paper
10 pages
SC4x W3L1 TopicsInDatabases v2
No ratings yet
SC4x W3L1 TopicsInDatabases v2
37 pages
OLTP vs OLAP: A Technical Guide
No ratings yet
OLTP vs OLAP: A Technical Guide
44 pages
Big Data Analytics Overview and Notes
No ratings yet
Big Data Analytics Overview and Notes
9 pages
Advanced Databased Integration RepotMaamJho
No ratings yet
Advanced Databased Integration RepotMaamJho
45 pages
SQL Part 2
No ratings yet
SQL Part 2
20 pages
Study Guide CheatSheet SQL Basics v1
No ratings yet
Study Guide CheatSheet SQL Basics v1
12 pages
Data Engineering For Beginners
No ratings yet
Data Engineering For Beginners
129 pages
Data Analysis
No ratings yet
Data Analysis
40 pages
001 - OpenEdge Getting Started Database Essentials Gsdbe
No ratings yet
001 - OpenEdge Getting Started Database Essentials Gsdbe
142 pages
Data Warehousing: Data Models and OLAP Operations: by Kishore Jaladi
No ratings yet
Data Warehousing: Data Models and OLAP Operations: by Kishore Jaladi
41 pages
Adbms Finals Reviewer
No ratings yet
Adbms Finals Reviewer
3 pages
Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
No ratings yet
Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
119 pages
SQL Fundamentals Slides
100% (1)
SQL Fundamentals Slides
84 pages
Business Intelligence MSE 1 - IMP
No ratings yet
Business Intelligence MSE 1 - IMP
11 pages
OLAP & Data Mining Essentials
No ratings yet
OLAP & Data Mining Essentials
44 pages
Database Design for E-Commerce Performance
No ratings yet
Database Design for E-Commerce Performance
9 pages
Databaser
No ratings yet
Databaser
137 pages
NoSQL Databases for Students
No ratings yet
NoSQL Databases for Students
35 pages
Data Engineering - Behind The Scene of Data by Hoda Ragaie
No ratings yet
Data Engineering - Behind The Scene of Data by Hoda Ragaie
44 pages
Data Science Tools Guide: SQL, R, Python
No ratings yet
Data Science Tools Guide: SQL, R, Python
23 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
51 pages
AWS Redshift
No ratings yet
AWS Redshift
145 pages
Module 1-BDA
No ratings yet
Module 1-BDA
82 pages
Data Warehousing for Analysts
No ratings yet
Data Warehousing for Analysts
14 pages
OLTP and OLAP
No ratings yet
OLTP and OLAP
46 pages
000 - Company Interview Qns
No ratings yet
000 - Company Interview Qns
13 pages
SQL For Web Developers
No ratings yet
SQL For Web Developers
16 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
Lecture 16
No ratings yet
Lecture 16
31 pages
17 Olap
No ratings yet
17 Olap
32 pages
Data Stream Management
No ratings yet
Data Stream Management
46 pages
Data Mining
No ratings yet
Data Mining
3 pages
Course Introduction: Dsecl Zc556 Stream Processing and Analytics Lecture No. 1.0
No ratings yet
Course Introduction: Dsecl Zc556 Stream Processing and Analytics Lecture No. 1.0
52 pages
Section I - Setup: 2.1A - Scalar Subqueries
No ratings yet
Section I - Setup: 2.1A - Scalar Subqueries
32 pages
Files 1 2020 April NotesHubDocument 1586849482
No ratings yet
Files 1 2020 April NotesHubDocument 1586849482
60 pages
2025 04 Power Bi On Databricks Best Practices Cheat Sheet
No ratings yet
2025 04 Power Bi On Databricks Best Practices Cheat Sheet
1 page
Data Engineering 101
No ratings yet
Data Engineering 101
1 page
Bajwa A C
No ratings yet
Bajwa A C
4 pages
Data Science Tools Overview
No ratings yet
Data Science Tools Overview
23 pages
SQL ANalyst by CT Taylor Part 4
No ratings yet
SQL ANalyst by CT Taylor Part 4
5 pages
Database Design
No ratings yet
Database Design
23 pages
SQL Basics: Data Retrieval and Management
No ratings yet
SQL Basics: Data Retrieval and Management
29 pages
Data Warehousing & OLAP Overview
No ratings yet
Data Warehousing & OLAP Overview
31 pages
Calcite
No ratings yet
Calcite
10 pages
Distributed Query Processing
No ratings yet
Distributed Query Processing
31 pages
Adbms Notes
No ratings yet
Adbms Notes
17 pages
Knowledge Management White Paper
100% (1)
Knowledge Management White Paper
7 pages
My OracleInstall N Config v01
100% (1)
My OracleInstall N Config v01
101 pages
Simple Backup/Restore Utility With SQL-: Introduction To SQL-DMO
No ratings yet
Simple Backup/Restore Utility With SQL-: Introduction To SQL-DMO
8 pages
Chapter1 5
No ratings yet
Chapter1 5
71 pages
Lidar360 MLS - 1669701265105762
100% (1)
Lidar360 MLS - 1669701265105762
16 pages
Certified List of Candidates: Region Xiii Agusan Del Sur Provincial Governor
No ratings yet
Certified List of Candidates: Region Xiii Agusan Del Sur Provincial Governor
21 pages
How To Fix The Error "Named Pipes Provider, Error 40 - Could Not Open A Connection To SQL Server"
No ratings yet
How To Fix The Error "Named Pipes Provider, Error 40 - Could Not Open A Connection To SQL Server"
12 pages
HANA Workload 1.00.74+ ESS
No ratings yet
HANA Workload 1.00.74+ ESS
7 pages
Order Management Data Strategies
No ratings yet
Order Management Data Strategies
43 pages
Roleofmusicinbrandrecall
No ratings yet
Roleofmusicinbrandrecall
24 pages
Lesson 4. Spatial Data Input and Editing
No ratings yet
Lesson 4. Spatial Data Input and Editing
9 pages
How Do I Copy An Oracle DB From One Server To Another
No ratings yet
How Do I Copy An Oracle DB From One Server To Another
2 pages
Case Study - Data Stage
No ratings yet
Case Study - Data Stage
4 pages
ch03 - The Relational Model
No ratings yet
ch03 - The Relational Model
34 pages
Dynamics AX 2012 R3 Cookbook Preview
No ratings yet
Dynamics AX 2012 R3 Cookbook Preview
61 pages
SAP System Administration Questions
No ratings yet
SAP System Administration Questions
5 pages
Everything Data Analytics-A Beginners Guide To Data Literacy Understanding The Processes That Turn Data Into Insights by Elizabeth Clarke
No ratings yet
Everything Data Analytics-A Beginners Guide To Data Literacy Understanding The Processes That Turn Data Into Insights by Elizabeth Clarke
245 pages
SAP Integrated Business Planning For Supply Chain: Order-Based Planning Process
No ratings yet
SAP Integrated Business Planning For Supply Chain: Order-Based Planning Process
2 pages
Pwcs Data Quality Capabilities
No ratings yet
Pwcs Data Quality Capabilities
39 pages
4413 Eb 6 de 6
No ratings yet
4413 Eb 6 de 6
1 page
FS Document
No ratings yet
FS Document
4 pages
Plan of Mata Elang Stable Development
No ratings yet
Plan of Mata Elang Stable Development
11 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
Module 7 - SQL Basics
No ratings yet
Module 7 - SQL Basics
7 pages
HPE MSA 2062 Storage-PSN1012748860IEEN
No ratings yet
HPE MSA 2062 Storage-PSN1012748860IEEN
4 pages
EEE 141L Electrical Circuits Lab Guide
No ratings yet
EEE 141L Electrical Circuits Lab Guide
4 pages
ArcGIS Shapefile Files Types & Extensions
No ratings yet
ArcGIS Shapefile Files Types & Extensions
4 pages
Analyzing Symbols in The Zoo Story
No ratings yet
Analyzing Symbols in The Zoo Story
8 pages
C-DAC Relational Database Overview
No ratings yet
C-DAC Relational Database Overview
29 pages

Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018

Uploaded by

Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018

Uploaded by

Apache Calcite: A Foundational

Framework for Optimized Query

2018 SIGMOD, Houston, Texas, USA

Optimizer and Planner

Uses in Research and Scholastic Potential

Roadmap and Future Work

It is an open source project governed by the

Problem Building a high-quality database requires ~ 20 person years (effort)

Solution Create an open source framework that a community can contribute

Design Flexible → Relational algebra

Alternatives PostgreSQL, Apache Spark, AsterixDB

External – Data storage, algorithms

Optional – SQL parser, JDBC &

Extensible – Planner rewrite rules,

Model – specification of the

Schema – definition of the data

Table: splunk Condition:

Key: productId Key: productName Key: c desc

Scan Scan Scan Scan

Join 2. Assign each Join 4. The best plan may

Like ordering and distribution,

Filter Scan Adapter =

➢ Popularize streaming SQL

(z) 43k (s) 50 (g) 2 (y) 5 (m) 12

(z, s) (g, m) (y, m)

(z, s, g, (z, s, g, (z, s, y, (z, g, y, (s, g, y,

Window functions (OVER) leave the number of Window

select stream … from Orders

Hopping window select stream … from Orders

Session window select stream … from Orders

You might also like