0% found this document useful (0 votes)
11 views23 pages

Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018

Uploaded by

fussfuss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views23 pages

Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018

Uploaded by

fussfuss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Apache Calcite: A Foundational

Framework for Optimized Query


Processing Over Heterogeneous Data
Sources
Edmon Begoli, Jesú s Camacho-Rodrı́guez, Julian Hyde,
Michael J. Mior, Daniel Lemire

2018 SIGMOD, Houston, Texas, USA


Outline
Background and History

Architecture

Adapter Design

Optimizer and Planner

Adoption

Uses in Research and Scholastic Potential

Roadmap and Future Work


What is Calcite?
Apache Calcite is an extensible framework for
building data management systems.

It is an open source project governed by the


Apache Software Foundation, is written in
Java, and is used by dozens of projects and
companies, and several research projects.
Origins and Design Principles
Origins 2004 – LucidEra and SQLstream were each building SQL systems;
2012 – Pare down code base, enter Apache as incubator project

Problem Building a high-quality database requires ~ 20 person years (effort)


and 5 years (elapsed)

Solution Create an open source framework that a community can contribute


to, and use to build their own DBMSs

Design Flexible → Relational algebra


principles Extensible/composable → Volcano-style planner
Easy to contribute to → Java, FP style

Alternatives PostgreSQL, Apache Spark, AsterixDB


Architecture
Core – Operator expressions
(relational algebra) and planner
(based on Volcano/Cascades)

External – Data storage, algorithms


and catalog

Optional – SQL parser, JDBC &


ODBC drivers

Extensible – Planner rewrite rules,


statistics, cost model, algebra, UDFs
Adapter Design
A pattern that defines how
Calcite incorporates diverse
data sources for general
access.

Model – specification of the


physical properties of the data
source.

Schema – definition of the data


(format and layouts) found in
the model.
select [Link], count(*) as c
Represent query as from [Link] as s
join [Link] as p
relational algebra on [Link] = [Link]
where [Link] = 'purchase'
group by [Link]
Splunk
order by c desc

Table: splunk

Key: productId
Condition: Key: productName Key: c desc
action = 'purchase' Agg: count
scan

join
MySQL
filter group sort
scan

Table: products
select [Link], count(*) as c
Optimize query by from [Link] as s
join [Link] as p
applying transformation on [Link] = [Link]
where [Link] = 'purchase'
rules group by [Link]
Splunk
order by c desc

Table: splunk Condition:


action = 'purchase'

Key: productId Key: productName Key: c desc


Agg: count
scan filter

MySQL
join group sort
scan

Table: products
Conventions
Join
1. Plans start Join
3. Fire rules to
Filter Scan as logical Filter Scan propagate conventions
nodes. to other nodes.
Join Join

Scan Scan Scan Scan

Join 2. Assign each Join 4. The best plan may


Scan its table’s use an engine not tied
Filter Scan Filter Scan
native to any native format.
Join
convention. Join

To implement, generate
Scan Scan Scan Scan
a program that calls out
to query1 and query2.
Conventions & adapters Convention provides a uniform
representation for hybrid queries

Like ordering and distribution,


convention is a physical property of
Join
nodes

Filter Scan Adapter =


schema factory (lists tables)
+ convention
Join
+ rules to convert nodes to convention
Scan Scan
select stream *
Streaming SQL from Orders as o
where units > (
select avg(units)
Stream ~= append-only table from Orders as h
where [Link] = [Link]
Streaming queries return deltas and [Link] >
[Link] - interval ‘1’ year)
Stream-table duality: Orders is used as
both stream and table
“Show me real-time orders whose size is larger
than the average for that product over the
Our contributions: preceding year”

➢ Popularize streaming SQL


➢ SQL parser / validator / rules
➢ Reference implementation & TCK
Uses and Adoption
Uses in Research
● Polystore research – use as lightweight
heterogeneous data processing platform
● Optimization and query profiling –
general performance, and optimizer
research
● Reasoning over Streams, Graphs –
under consideration
● Open-source, production grade learning
and research platform
Future Work and Roadmap
● Support its use as a standalone engine – DDL, materialized views,
indexes and constraints.
● Improvements to the design and extensibility of the planner
(modularity, pluggability)
● Incorporation of new parametric approaches into the design of the
optimizer.
● Support for an extended set of SQL commands, functions, and
utilities, including full compliance with OpenGIS (spatial).
● New adapters for non-relational data sources such as array
databases.
● Improvements to performance profiling and instrumentation.
Thank you! Questions?

@ApacheCalcite

[Link]

[Link]
Extra slides
Calcite framework
Relational algebra SQL parser Transformation rules
RelNode (operator) SqlNode RelOptRule
• TableScan SqlParser • FilterMergeRule
• Filter SqlValidator • AggregateUnionTransposeRule
• Project • 100+ more
• Union Metadata Global transformations
• Aggregate Schema • Unification (materialized view)
• … Table • Column trimming
RelDataType (type) Function • De-correlation
RexNode (expression) • TableFunction
RelTrait (physical property) • TableMacro Cost, statistics
• RelConvention (calling-convention) Lattice
• RelCollation (sortedness) RelOptCost
• RelDistribution (partitioning) JDBC driver RelOptCostFactory
RelBuilder RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
Avatica

● Database connectivity
stack
● Self-contained sub-project
of Calcite
● Fast, open, stable
● Protobuf or JSON over
HTTP
● Powers Phoenix Query
Server
Lattice (optimized) () 1

(z) 43k (s) 50 (g) 2 (y) 5 (m) 12

(z, s) (g, m) (y, m)


(g, y) 10
43.4k 24 60

(z, s, g) (g, y, m)
83.6k 120

(z, s, g, (z, s, g, (z, s, y, (z, g, y, (s, g, y,


Key y) 392k m) 644k m) 831k m) 909k m) 6k

z zipcode (43k)
s state (50) (z, s, g, y,
g gender (2) m) 912k
y year (5)
m month (12)
raw 1m
Aggregation and windows on GROUP BY
streams
GROUP BY aggregates multiple rows into
sub-totals
➢ In regular GROUP BY each row contributes to
exactly one sub-total Multi
GROUP BY
➢ In multi-GROUP BY (e.g. HOP, GROUPING
SETS) a row can contribute to more than one
sub-total

Window functions (OVER) leave the number of Window


rows unchanged, but compute extra expressions functions
for each row (based on neighboring rows)
Tumbling, hopping & session windows in SQL
Tumbling window select stream … from Orders
group by floor(rowtime to hour)

select stream … from Orders


group by tumble(rowtime, interval ‘1’ hour)

Hopping window select stream … from Orders


group by hop(rowtime, interval ‘1’ hour,
interval ‘2’ hour)

Session window select stream … from Orders


group by session(rowtime, interval ‘1’ hour)
Controlling when data is emitted
select stream productId,
Early emission is the defining count(*) as c
characteristic of a streaming query. from Orders
group by productId,
The emit clause is a SQL extension floor(rowtime to hour)
inspired by Apache Beam’s “trigger” emit at watermark,
notion. (Still experimental… and early interval ‘2’ minute,
evolving.) late limit 1;
A relational (non-streaming) query is
just a query with the most conservative select *
possible emission strategy. from Orders
emit when complete;

You might also like