SlideShare a Scribd company logo
4
Most read
6
Most read
16
Most read
Apache Calcite: A Foundational
Framework for Optimized Query
Processing Over Heterogeneous Data
Sources
Edmon Begoli, Jesú s Camacho-Rodrı́guez, Julian Hyde,
Michael J. Mior, Daniel Lemire
2018 SIGMOD, Houston, Texas, USA
Outline
Background and History
Architecture
Adapter Design
Optimizer and Planner
Adoption
Uses in Research and Scholastic Potential
Roadmap and Future Work
What is Calcite?
Apache Calcite is an extensible framework for
building data management systems.
It is an open source project governed by the
Apache Software Foundation, is written in
Java, and is used by dozens of projects and
companies, and several research projects.
Origins and Design Principles
Origins 2004 – LucidEra and SQLstream were each building SQL systems;
2012 – Pare down code base, enter Apache as incubator project
Problem Building a high-quality database requires ~ 20 person years (effort)
and 5 years (elapsed)
Solution Create an open source framework that a community can contribute
to, and use to build their own DBMSs
Design
principles
Flexible → Relational algebra
Extensible/composable → Volcano-style planner
Easy to contribute to → Java, FP style
Alternatives PostgreSQL, Apache Spark, AsterixDB
Architecture
Core – Operator expressions
(relational algebra) and planner
(based on Volcano/Cascades)
External – Data storage, algorithms
and catalog
Optional – SQL parser, JDBC &
ODBC drivers
Extensible – Planner rewrite rules,
statistics, cost model, algebra, UDFs
Adapter Design
A pattern that defines how
Calcite incorporates diverse
data sources for general
access.
Model – specification of the
physical properties of the data
source.
Schema – definition of the data
(format and layouts) found in
the model.
Represent query as
relational algebra
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
Table: splunk
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Optimize query by
applying transformation
rules
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
1. Plans start
as logical
nodes.
3. Fire rules to
propagate conventions
to other nodes.
2. Assign each
Scan its table’s
native
convention.
4. The best plan may
use an engine not tied
to any native format.
To implement, generate
a program that calls out
to query1 and query2.
Join
Filter Scan
ScanScan
Join
Conventions
Join
Filter Scan
ScanScan
Join
Scan
ScanScan
Join
Filter
Join
Join
Filter Scan
ScanScan
Join
Conventions & adapters
Scan Scan
Join
Filter
Join
Scan
Convention provides a uniform
representation for hybrid queries
Like ordering and distribution,
convention is a physical property of
nodes
Adapter =
schema factory (lists tables)
+ convention
+ rules to convert nodes to convention
Stream ~= append-only table
Streaming queries return deltas
Stream-table duality: Orders is used as
both stream and table
Our contributions:
➢ Popularize streaming SQL
➢ SQL parser / validator / rules
➢ Reference implementation & TCK
select stream *
from Orders as o
where units > (
select avg(units)
from Orders as h
where h.productId = o.productId
and h.rowtime >
o.rowtime - interval ‘1’ year)
“Show me real-time orders whose size is larger
than the average for that product over the
preceding year”
Streaming SQL
Uses and Adoption
Uses in Research
● Polystore research – use as lightweight
heterogeneous data processing platform
● Optimization and query profiling –
general performance, and optimizer
research
● Reasoning over Streams, Graphs –
under consideration
● Open-source, production grade learning
and research platform
Future Work and Roadmap
● Support its use as a standalone engine – DDL, materialized views,
indexes and constraints.
● Improvements to the design and extensibility of the planner
(modularity, pluggability)
● Incorporation of new parametric approaches into the design of the
optimizer.
● Support for an extended set of SQL commands, functions, and
utilities, including full compliance with OpenGIS (spatial).
● New adapters for non-relational data sources such as array
databases.
● Improvements to performance profiling and instrumentation.
Thank you! Questions?
@ApacheCalcite
https://calcite.apache.org
https://arxiv.org/abs/1802.10233
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
Extra slides
Calcite framework
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode
SqlParser
SqlValidator
Transformation rules
RelOptRule
• FilterMergeRule
• AggregateUnionTransposeRule
• 100+ more
Global transformations
• Unification (materialized view)
• Column trimming
• De-correlation
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• RelDistribution (partitioning)
RelBuilder
JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Lattice
Avatica
● Database connectivity
stack
● Self-contained sub-project
of Calcite
● Fast, open, stable
● Protobuf or JSON over
HTTP
● Powers Phoenix Query
Server
Lattice (optimized) () 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
(z, g, y,
m) 909k
(z, s, y,
m) 831k
raw 1m
(z, s, g,
m) 644k
(z, s, g,
y) 392k
(y, m)
60
(z, s)
43.4k
(z, s, g)
83.6k
(g, y) 10
(g, y, m)
120
(g, m)
24
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
Aggregation and windows on
streams
GROUP BY aggregates multiple rows into
sub-totals
➢ In regular GROUP BY each row contributes to
exactly one sub-total
➢ In multi-GROUP BY (e.g. HOP, GROUPING
SETS) a row can contribute to more than one
sub-total
Window functions (OVER) leave the number of
rows unchanged, but compute extra expressions
for each row (based on neighboring rows)
Multi
GROUP BY
Window
functions
GROUP BY
Tumbling, hopping & session windows in SQL
Tumbling window
Hopping window
Session window
select stream … from Orders
group by floor(rowtime to hour)
select stream … from Orders
group by tumble(rowtime, interval ‘1’ hour)
select stream … from Orders
group by hop(rowtime, interval ‘1’ hour,
interval ‘2’ hour)
select stream … from Orders
group by session(rowtime, interval ‘1’ hour)
Controlling when data is emitted
Early emission is the defining
characteristic of a streaming query.
The emit clause is a SQL extension
inspired by Apache Beam’s “trigger”
notion. (Still experimental… and
evolving.)
A relational (non-streaming) query is
just a query with the most conservative
possible emission strategy.
select stream productId,
count(*) as c
from Orders
group by productId,
floor(rowtime to hour)
emit at watermark,
early interval ‘2’ minute,
late limit 1;
select *
from Orders
emit when complete;

More Related Content

PDF
Fast federated SQL with Apache Calcite
PDF
Introduction to Apache Calcite
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
Apache Calcite Tutorial - BOSS 21
PDF
10 Good Reasons to Use ClickHouse
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
ClickHouse Materialized Views: The Magic Continues
Fast federated SQL with Apache Calcite
Introduction to Apache Calcite
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite Tutorial - BOSS 21
10 Good Reasons to Use ClickHouse
ClickHouse Deep Dive, by Aleksei Milovidov
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
ClickHouse Materialized Views: The Magic Continues

What's hot (20)

PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
PDF
SQL for NoSQL and how Apache Calcite can help
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
PPTX
Apache Spark Architecture
PPTX
Apache Calcite overview
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
PDF
Etsy Activity Feeds Architecture
PDF
Apache Calcite: One planner fits all
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PDF
The evolution of Apache Calcite and its Community
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
Facebook Messages & HBase
PDF
Streaming SQL with Apache Calcite
PDF
Building large scale transactional data lake using apache hudi
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Using ClickHouse for Experimentation
ODP
Presto
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
SQL for NoSQL and how Apache Calcite can help
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Apache Spark Architecture
Apache Calcite overview
How to understand and analyze Apache Hive query execution plan for performanc...
Etsy Activity Feeds Architecture
Apache Calcite: One planner fits all
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
The evolution of Apache Calcite and its Community
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Facebook Messages & HBase
Streaming SQL with Apache Calcite
Building large scale transactional data lake using apache hudi
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Using ClickHouse for Experimentation
Presto
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Ad

Similar to Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources (20)

PDF
Apache Calcite: One Frontend to Rule Them All
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PDF
SQL on everything, in memory
PDF
Cost-Based query optimization
PDF
Cost-based Query Optimization
PDF
phoenix-on-calcite-hadoop-summit-2016
PDF
Streaming SQL
PPTX
The Evolution of a Relational Database Layer over HBase
PDF
Streaming SQL
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
PDF
Streaming SQL w/ Apache Calcite
PDF
Big data analytics using a custom SQL engine
PDF
Enable SQL/JDBC Access to Apache Geode/GemFire Using Apache Calcite
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
PDF
From flat files to deconstructed database
PPTX
HBaseCon2015-final
PDF
ONE FOR ALL! Using Apache Calcite to make SQL smart
PDF
Towards sql for streams
PDF
Tactical data engineering
PDF
Enable SQL/JDBC Access to Apache Geode/GemFire Using Apache Calcite
Apache Calcite: One Frontend to Rule Them All
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
SQL on everything, in memory
Cost-Based query optimization
Cost-based Query Optimization
phoenix-on-calcite-hadoop-summit-2016
Streaming SQL
The Evolution of a Relational Database Layer over HBase
Streaming SQL
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Streaming SQL w/ Apache Calcite
Big data analytics using a custom SQL engine
Enable SQL/JDBC Access to Apache Geode/GemFire Using Apache Calcite
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
From flat files to deconstructed database
HBaseCon2015-final
ONE FOR ALL! Using Apache Calcite to make SQL smart
Towards sql for streams
Tactical data engineering
Enable SQL/JDBC Access to Apache Geode/GemFire Using Apache Calcite
Ad

More from Julian Hyde (20)

PPTX
Measures in SQL (SIGMOD 2024, Santiago, Chile)
PDF
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
PDF
Building a semantic/metrics layer using Calcite
PDF
Cubing and Metrics in SQL, oh my!
PDF
Adding measures to Calcite SQL
PDF
Morel, a data-parallel programming language
PDF
Is there a perfect data-parallel programming language? (Experiments with More...
PDF
Morel, a Functional Query Language
PDF
What to expect when you're Incubating
PDF
Efficient spatial queries on vanilla databases
PDF
Don't optimize my queries, organize my data!
PDF
Spatial query on vanilla databases
PPTX
Lazy beats Smart and Fast
PDF
Don’t optimize my queries, optimize my data!
PDF
Data profiling with Apache Calcite
PDF
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
PDF
Data Profiling in Apache Calcite
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PDF
Streaming SQL
PDF
Streaming SQL
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Building a semantic/metrics layer using Calcite
Cubing and Metrics in SQL, oh my!
Adding measures to Calcite SQL
Morel, a data-parallel programming language
Is there a perfect data-parallel programming language? (Experiments with More...
Morel, a Functional Query Language
What to expect when you're Incubating
Efficient spatial queries on vanilla databases
Don't optimize my queries, organize my data!
Spatial query on vanilla databases
Lazy beats Smart and Fast
Don’t optimize my queries, optimize my data!
Data profiling with Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Data Profiling in Apache Calcite
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL
Streaming SQL

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
L1 - Introduction to python Backend.pptx
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
System and Network Administration Chapter 2
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Digital Strategies for Manufacturing Companies
PPT
Introduction Database Management System for Course Database
PDF
System and Network Administraation Chapter 3
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Introduction to Artificial Intelligence
PDF
Understanding Forklifts - TECH EHS Solution
VVF-Customer-Presentation2025-Ver1.9.pptx
How Creative Agencies Leverage Project Management Software.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
2025 Textile ERP Trends: SAP, Odoo & Oracle
L1 - Introduction to python Backend.pptx
ISO 45001 Occupational Health and Safety Management System
System and Network Administration Chapter 2
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Navsoft: AI-Powered Business Solutions & Custom Software Development
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Digital Strategies for Manufacturing Companies
Introduction Database Management System for Course Database
System and Network Administraation Chapter 3
Online Work Permit System for Fast Permit Processing
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ai tools demonstartion for schools and inter college
Which alternative to Crystal Reports is best for small or large businesses.pdf
Introduction to Artificial Intelligence
Understanding Forklifts - TECH EHS Solution

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

  • 1. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources Edmon Begoli, Jesú s Camacho-Rodrı́guez, Julian Hyde, Michael J. Mior, Daniel Lemire 2018 SIGMOD, Houston, Texas, USA
  • 2. Outline Background and History Architecture Adapter Design Optimizer and Planner Adoption Uses in Research and Scholastic Potential Roadmap and Future Work
  • 3. What is Calcite? Apache Calcite is an extensible framework for building data management systems. It is an open source project governed by the Apache Software Foundation, is written in Java, and is used by dozens of projects and companies, and several research projects.
  • 4. Origins and Design Principles Origins 2004 – LucidEra and SQLstream were each building SQL systems; 2012 – Pare down code base, enter Apache as incubator project Problem Building a high-quality database requires ~ 20 person years (effort) and 5 years (elapsed) Solution Create an open source framework that a community can contribute to, and use to build their own DBMSs Design principles Flexible → Relational algebra Extensible/composable → Volcano-style planner Easy to contribute to → Java, FP style Alternatives PostgreSQL, Apache Spark, AsterixDB
  • 5. Architecture Core – Operator expressions (relational algebra) and planner (based on Volcano/Cascades) External – Data storage, algorithms and catalog Optional – SQL parser, JDBC & ODBC drivers Extensible – Planner rewrite rules, statistics, cost model, algebra, UDFs
  • 6. Adapter Design A pattern that defines how Calcite incorporates diverse data sources for general access. Model – specification of the physical properties of the data source. Schema – definition of the data (format and layouts) found in the model.
  • 7. Represent query as relational algebra MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products Table: splunk select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 8. Optimize query by applying transformation rules MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 9. 1. Plans start as logical nodes. 3. Fire rules to propagate conventions to other nodes. 2. Assign each Scan its table’s native convention. 4. The best plan may use an engine not tied to any native format. To implement, generate a program that calls out to query1 and query2. Join Filter Scan ScanScan Join Conventions Join Filter Scan ScanScan Join Scan ScanScan Join Filter Join Join Filter Scan ScanScan Join
  • 10. Conventions & adapters Scan Scan Join Filter Join Scan Convention provides a uniform representation for hybrid queries Like ordering and distribution, convention is a physical property of nodes Adapter = schema factory (lists tables) + convention + rules to convert nodes to convention
  • 11. Stream ~= append-only table Streaming queries return deltas Stream-table duality: Orders is used as both stream and table Our contributions: ➢ Popularize streaming SQL ➢ SQL parser / validator / rules ➢ Reference implementation & TCK select stream * from Orders as o where units > ( select avg(units) from Orders as h where h.productId = o.productId and h.rowtime > o.rowtime - interval ‘1’ year) “Show me real-time orders whose size is larger than the average for that product over the preceding year” Streaming SQL
  • 13. Uses in Research ● Polystore research – use as lightweight heterogeneous data processing platform ● Optimization and query profiling – general performance, and optimizer research ● Reasoning over Streams, Graphs – under consideration ● Open-source, production grade learning and research platform
  • 14. Future Work and Roadmap ● Support its use as a standalone engine – DDL, materialized views, indexes and constraints. ● Improvements to the design and extensibility of the planner (modularity, pluggability) ● Incorporation of new parametric approaches into the design of the optimizer. ● Support for an extended set of SQL commands, functions, and utilities, including full compliance with OpenGIS (spatial). ● New adapters for non-relational data sources such as array databases. ● Improvements to performance profiling and instrumentation.
  • 18. Calcite framework Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • FilterMergeRule • AggregateUnionTransposeRule • 100+ more Global transformations • Unification (materialized view) • Column trimming • De-correlation Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • RelDistribution (partitioning) RelBuilder JDBC driver Metadata Schema Table Function • TableFunction • TableMacro Lattice
  • 19. Avatica ● Database connectivity stack ● Self-contained sub-project of Calcite ● Fast, open, stable ● Protobuf or JSON over HTTP ● Powers Phoenix Query Server
  • 20. Lattice (optimized) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24 Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12)
  • 21. Aggregation and windows on streams GROUP BY aggregates multiple rows into sub-totals ➢ In regular GROUP BY each row contributes to exactly one sub-total ➢ In multi-GROUP BY (e.g. HOP, GROUPING SETS) a row can contribute to more than one sub-total Window functions (OVER) leave the number of rows unchanged, but compute extra expressions for each row (based on neighboring rows) Multi GROUP BY Window functions GROUP BY
  • 22. Tumbling, hopping & session windows in SQL Tumbling window Hopping window Session window select stream … from Orders group by floor(rowtime to hour) select stream … from Orders group by tumble(rowtime, interval ‘1’ hour) select stream … from Orders group by hop(rowtime, interval ‘1’ hour, interval ‘2’ hour) select stream … from Orders group by session(rowtime, interval ‘1’ hour)
  • 23. Controlling when data is emitted Early emission is the defining characteristic of a streaming query. The emit clause is a SQL extension inspired by Apache Beam’s “trigger” notion. (Still experimental… and evolving.) A relational (non-streaming) query is just a query with the most conservative possible emission strategy. select stream productId, count(*) as c from Orders group by productId, floor(rowtime to hour) emit at watermark, early interval ‘2’ minute, late limit 1; select * from Orders emit when complete;