SlideShare a Scribd company logo
Drill / SQL / Optiq
     Julian Hyde
 Apache Drill User Group
     2013-03-13
Drill / SQL / Optiq
SQL
SQL: Pros & cons
Fact:

  SQL is older than Macaulay Culkin

Less interesting but more relevant:

  Can be written by (lots of) humans

  Can be written by machines

  Requires query optimization

  Allows query optimization

  Based on “flat” relations and basic relational
  operations
Quick intro to Optiq
Introducing Optiq
Framework
Derived from LucidDB
Minimal query mediator:

    No storage

    No runtime

    No metadata

    Query planning engine

    Core operators & rewrite rules

    Optional SQL parser/validator
SELECT p.“product_name”, COUNT(*) AS c
Expression tree                        FROM “splunk”.”splunk” AS s
                                         JOIN “mysql”.”products” AS p
                                         ON s.”product_id” = p.”product_id”
                                       WHERE s.“action” = 'purchase'
                                       GROUP BY p.”product_name”
                                       ORDER BY c DESC
 Splunk
 Table: splunk
                                                    Key: product_name
                     Key: product_id                Agg: count
                                       Condition:                       Key: c DESC
                                         action =
                                       'purchase'
 scan
                          join
MySQL                                  filter          group            sort
    scan
                 Table: products
SELECT p.“product_name”, COUNT(*) AS c
Expression tree                      FROM “splunk”.”splunk” AS s
                                       JOIN “mysql”.”products” AS p
(optimized)                            ON s.”product_id” = p.”product_id”
                                     WHERE s.“action” = 'purchase'
                                     GROUP BY p.”product_name”
                                     ORDER BY c DESC
             Splunk
                        Condition:
 Table: splunk
                          action =
                        'purchase'                     Key: product_name
                                                       Agg: count
                                                                           Key: c DESC

                                     Key: product_id
 scan                   filter

MySQL
                                     join                 group            sort
   scan
                 Table: products
Metadata SPI

    interface Table
    −   RelDataType getRowType()

    interface TableFunction
    −   List<Parameter> getParameters()
    −   Table apply(List arguments)
    −   e.g. ViewTableFunction

    interface Schema
     − Map<String, List<TableFunction>>
       getTableFunctions()
Operators and rules

    Rule: interface RelOptRule

    Operator: interface RelNode

    Core operators: TableAccess, Project,
    Filter, Join, Aggregate, Order, Union,
    Intersect, Minus, Values

    Some rules: MergeFilterRule,
    PushAggregateThroughUnionRule,
    RemoveCorrelationForScalarProjectRule
    + 100 more
Planning algorithm

    Start with a logical plan and a set of
    rewrite rules

    Form a list of rewrite rules that are
    applicable (based on pattern-matching)

    Fire the rule that is likely to do the most
    good

    Rule generates an expression that is
    equivalent (and hopefully cheaper)

    Queue up new rule matches

    Repeat until cheap enough
Concepts

    Cost

    Equivalence sets

    Calling convention

    Logical vs Physical

    Traits

    Implementation
Outside the kernel
                                      JDBC client

    SQL
    parser/validator
                                    JDBC server

    JDBC driver            Optional SQL parser /          Metadata
                                       validator            SPI

    SQL function             Core
                                        Query            Pluggable
    library (validation               3
                                       planner
                                        rd
                                              3
                                              rd
                                                           rules

    + code-               Pluggable  party party
    generation)       3 party
                            rd        ops ops
                                                    3rd party
                        data                          data

    Lingual
    (Cascading
    adapter)

    Splunk adapter

    Drill adapter
Optiq roadmap

    Building blocks for analytic DB:
    −   In-memory tables in a distributed cache
    −   Materialized views
    −   Partitioned tables

    Faster planning

    Easier rule development

    ODBC driver

    Adapters for XXX, YYY
Applying Optiq to Drill
  1. Enhance SQL
 2. Query translation
Drill vs Traditional SQL

    SQL:
    −   Flat data
    −   Schema up front

    Drill:
    −   Nested data (list & map)
    −   No schema

    We'd like to write:
    −   SELECT name, toppings[2] FROM donuts
        WHERE ppu > 0.6

    Solution: ARRAY, MAP, ANY types
ARRAY & MAP SQL types

  ARRAY is like java.util.List

  MAP is like java.util.LinkedHashMap
Examples:

  VALUES ARRAY ['a', 'b', 'c']

  VALUES MAP ['Washington', 1, 'Obama', 44]

  SELECT name, address[1], address[2], state
  FROM Employee

  SELECT * FROM Donuts WHERE
  CAST(donuts['ppu'] AS DOUBLE) > 0.6
ANY SQL type

    ANY means “type to be determined at runtime”

    Validator narrows down possible type based
    on operators used

    Similar to converting Java's type system into
    JavaScript's. (Not easy.)
Sugaring the donut
Query:

  SELECT c['ppu'], c['toppings'][1] FROM
  Donuts
Additional syntactic sugar:

  c.x means c['x']
So:

  CREATE TABLE Donuts(c ANY)

  SELECT c.ppu, c.toppings[1] FROM
  Donuts
Better:
UNNEST
Employees nested inside departments:

  CREATE TYPE employee (empno INT, name
  VARCHAR(30));

  CREATE TABLE dept (deptno INT, name
  VARCHAR(30),
   employees EMPLOYEE ARRAY);
Unnest:

  SELECT d.deptno, d.name, e.empno, e.name
  FROM department AS d
   CROSS JOIN UNNEST(d.employees) AS e
SQL standard provides other operations on
  collections:
Applying Optiq to Drill
   1. Enhance SQL
2. Query translation
Query translation
SQL:

  select d['name'] as name, d['xx'] as xx
  from (
    select _MAP['donuts'] as d from donuts)
  where cast(d['ppu'] as double) > 0.6

Drill:

  { head: { … },
      storage: { … },
      query: [ {
         op: “sequence”, do: [
           { op: “scan”, … selection: { path:
Planner log
Original rel:
AbstractConverter(subset=[rel#14:Subset#3.AR
 RAY], convention=[ARRAY])
 ProjectRel(subset=[rel#10:Subset#3.NONE],
 NAME=[ITEM($0, 'name')], XX=[ITEM($0,
 'xx')])
  FilterRel(subset=[rel#8:Subset#2.NONE],
 condition=[>(CAST(ITEM($0, 'ppu')):DOUBLE
 NOT NULL, 0.6)])
    ProjectRel(subset=[rel#6:Subset#1.NONE],
 D=[ITEM($0, 'donuts')])
     DrillScan(subset=[rel#4:Subset#0.DRILL],
Next

    Translate join, aggregate, sort, set ops

    Operator overloading with ANY

    Mondrian on Drill
Thank you!


https://github.com/julianhyde/share/tree/master/slides
https://github.com/julianhyde/incubator-drill
https://github.com/julianhyde/optiq
http://incubator.apache.org/drill

@julianhyde

More Related Content

PDF
Apache Calcite Tutorial - BOSS 21
PPT
SQL on Big Data using Optiq
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PPT
How to integrate Splunk with any data solution
PPT
Why is data independence (still) so important? Optiq and Apache Drill.
PDF
SQL on everything, in memory
PDF
Tactical data engineering
Apache Calcite Tutorial - BOSS 21
SQL on Big Data using Optiq
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
How to integrate Splunk with any data solution
Why is data independence (still) so important? Optiq and Apache Drill.
SQL on everything, in memory
Tactical data engineering

What's hot (20)

PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Why you care about
 relational algebra (even though you didn’t know it)
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
PDF
Apache Calcite: One planner fits all
PPTX
Apache Calcite overview
PDF
Don't optimize my queries, organize my data!
PDF
Introduction to Apache Calcite
PPTX
Calcite meetup-2016-04-20
PPTX
Cost-based query optimization in Apache Hive
PDF
SQL for NoSQL and how Apache Calcite can help
PDF
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
PDF
Cost-based query optimization in Apache Hive 0.14
PDF
Apache Calcite: One Frontend to Rule Them All
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PDF
Don’t optimize my queries, optimize my data!
PDF
What's new in Mondrian 4?
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PDF
Cost-based Query Optimization
PPT
Optiq: a SQL front-end for everything
PDF
ONE FOR ALL! Using Apache Calcite to make SQL smart
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Why you care about
 relational algebra (even though you didn’t know it)
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Apache Calcite: One planner fits all
Apache Calcite overview
Don't optimize my queries, organize my data!
Introduction to Apache Calcite
Calcite meetup-2016-04-20
Cost-based query optimization in Apache Hive
SQL for NoSQL and how Apache Calcite can help
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Cost-based query optimization in Apache Hive 0.14
Apache Calcite: One Frontend to Rule Them All
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Don’t optimize my queries, optimize my data!
What's new in Mondrian 4?
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Cost-based Query Optimization
Optiq: a SQL front-end for everything
ONE FOR ALL! Using Apache Calcite to make SQL smart
Ad

Similar to Drill / SQL / Optiq (20)

PDF
Intro to Spark and Spark SQL
PPTX
Introduce to Spark sql 1.3.0
PPTX
Spark Sql for Training
PPTX
Odtug2011 adf developers make the database work for you
PDF
PerlApp2Postgresql (2)
ODP
Meetup cassandra sfo_jdbc
PDF
Road to Analytics
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
10 Reasons to Start Your Analytics Project with PostgreSQL
PPTX
Introduction to NoSQL Database
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
PPTX
Powering a Graph Data System with Scylla + JanusGraph
PDF
Spark Summit EU talk by Michael Nitschinger
PDF
Nko workshop - node js & nosql
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Apache spark - Architecture , Overview & libraries
PPTX
The Pushdown of Everything by Stephan Kessler and Santiago Mola
PDF
Living the Nomadic life - Nic Jackson
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Intro to Spark and Spark SQL
Introduce to Spark sql 1.3.0
Spark Sql for Training
Odtug2011 adf developers make the database work for you
PerlApp2Postgresql (2)
Meetup cassandra sfo_jdbc
Road to Analytics
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
10 Reasons to Start Your Analytics Project with PostgreSQL
Introduction to NoSQL Database
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
Powering a Graph Data System with Scylla + JanusGraph
Spark Summit EU talk by Michael Nitschinger
Nko workshop - node js & nosql
Real-Time Spark: From Interactive Queries to Streaming
Apache spark - Architecture , Overview & libraries
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Living the Nomadic life - Nic Jackson
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Ad

More from Julian Hyde (20)

PPTX
Measures in SQL (SIGMOD 2024, Santiago, Chile)
PDF
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
PDF
Building a semantic/metrics layer using Calcite
PDF
Cubing and Metrics in SQL, oh my!
PDF
Adding measures to Calcite SQL
PDF
Morel, a data-parallel programming language
PDF
Is there a perfect data-parallel programming language? (Experiments with More...
PDF
Morel, a Functional Query Language
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
The evolution of Apache Calcite and its Community
PDF
What to expect when you're Incubating
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
PDF
Efficient spatial queries on vanilla databases
PDF
Spatial query on vanilla databases
PPTX
Lazy beats Smart and Fast
PDF
Data profiling with Apache Calcite
PDF
Data Profiling in Apache Calcite
PDF
Streaming SQL
PDF
Streaming SQL
PDF
Streaming SQL
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Building a semantic/metrics layer using Calcite
Cubing and Metrics in SQL, oh my!
Adding measures to Calcite SQL
Morel, a data-parallel programming language
Is there a perfect data-parallel programming language? (Experiments with More...
Morel, a Functional Query Language
Apache Calcite (a tutorial given at BOSS '21)
The evolution of Apache Calcite and its Community
What to expect when you're Incubating
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Efficient spatial queries on vanilla databases
Spatial query on vanilla databases
Lazy beats Smart and Fast
Data profiling with Apache Calcite
Data Profiling in Apache Calcite
Streaming SQL
Streaming SQL
Streaming SQL

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
A Presentation on Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
A Presentation on Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation_ Review paper, used for researhc scholars
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Drill / SQL / Optiq

  • 1. Drill / SQL / Optiq Julian Hyde Apache Drill User Group 2013-03-13
  • 3. SQL
  • 4. SQL: Pros & cons Fact:  SQL is older than Macaulay Culkin Less interesting but more relevant:  Can be written by (lots of) humans  Can be written by machines  Requires query optimization  Allows query optimization  Based on “flat” relations and basic relational operations
  • 6. Introducing Optiq Framework Derived from LucidDB Minimal query mediator:  No storage  No runtime  No metadata  Query planning engine  Core operators & rewrite rules  Optional SQL parser/validator
  • 7. SELECT p.“product_name”, COUNT(*) AS c Expression tree FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC Splunk Table: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = 'purchase' scan join MySQL filter group sort scan Table: products
  • 8. SELECT p.“product_name”, COUNT(*) AS c Expression tree FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p (optimized) ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC Splunk Condition: Table: splunk action = 'purchase' Key: product_name Agg: count Key: c DESC Key: product_id scan filter MySQL join group sort scan Table: products
  • 9. Metadata SPI  interface Table − RelDataType getRowType()  interface TableFunction − List<Parameter> getParameters() − Table apply(List arguments) − e.g. ViewTableFunction  interface Schema − Map<String, List<TableFunction>> getTableFunctions()
  • 10. Operators and rules  Rule: interface RelOptRule  Operator: interface RelNode  Core operators: TableAccess, Project, Filter, Join, Aggregate, Order, Union, Intersect, Minus, Values  Some rules: MergeFilterRule, PushAggregateThroughUnionRule, RemoveCorrelationForScalarProjectRule + 100 more
  • 11. Planning algorithm  Start with a logical plan and a set of rewrite rules  Form a list of rewrite rules that are applicable (based on pattern-matching)  Fire the rule that is likely to do the most good  Rule generates an expression that is equivalent (and hopefully cheaper)  Queue up new rule matches  Repeat until cheap enough
  • 12. Concepts  Cost  Equivalence sets  Calling convention  Logical vs Physical  Traits  Implementation
  • 13. Outside the kernel JDBC client  SQL parser/validator JDBC server  JDBC driver Optional SQL parser / Metadata validator SPI  SQL function Core Query Pluggable library (validation 3 planner rd 3 rd rules + code- Pluggable party party generation) 3 party rd ops ops 3rd party data data  Lingual (Cascading adapter)  Splunk adapter  Drill adapter
  • 14. Optiq roadmap  Building blocks for analytic DB: − In-memory tables in a distributed cache − Materialized views − Partitioned tables  Faster planning  Easier rule development  ODBC driver  Adapters for XXX, YYY
  • 15. Applying Optiq to Drill 1. Enhance SQL 2. Query translation
  • 16. Drill vs Traditional SQL  SQL: − Flat data − Schema up front  Drill: − Nested data (list & map) − No schema  We'd like to write: − SELECT name, toppings[2] FROM donuts WHERE ppu > 0.6  Solution: ARRAY, MAP, ANY types
  • 17. ARRAY & MAP SQL types  ARRAY is like java.util.List  MAP is like java.util.LinkedHashMap Examples:  VALUES ARRAY ['a', 'b', 'c']  VALUES MAP ['Washington', 1, 'Obama', 44]  SELECT name, address[1], address[2], state FROM Employee  SELECT * FROM Donuts WHERE CAST(donuts['ppu'] AS DOUBLE) > 0.6
  • 18. ANY SQL type  ANY means “type to be determined at runtime”  Validator narrows down possible type based on operators used  Similar to converting Java's type system into JavaScript's. (Not easy.)
  • 19. Sugaring the donut Query:  SELECT c['ppu'], c['toppings'][1] FROM Donuts Additional syntactic sugar:  c.x means c['x'] So:  CREATE TABLE Donuts(c ANY)  SELECT c.ppu, c.toppings[1] FROM Donuts Better:
  • 20. UNNEST Employees nested inside departments:  CREATE TYPE employee (empno INT, name VARCHAR(30));  CREATE TABLE dept (deptno INT, name VARCHAR(30), employees EMPLOYEE ARRAY); Unnest:  SELECT d.deptno, d.name, e.empno, e.name FROM department AS d CROSS JOIN UNNEST(d.employees) AS e SQL standard provides other operations on collections:
  • 21. Applying Optiq to Drill 1. Enhance SQL 2. Query translation
  • 22. Query translation SQL:  select d['name'] as name, d['xx'] as xx from ( select _MAP['donuts'] as d from donuts) where cast(d['ppu'] as double) > 0.6 Drill:  { head: { … }, storage: { … }, query: [ { op: “sequence”, do: [ { op: “scan”, … selection: { path:
  • 23. Planner log Original rel: AbstractConverter(subset=[rel#14:Subset#3.AR RAY], convention=[ARRAY]) ProjectRel(subset=[rel#10:Subset#3.NONE], NAME=[ITEM($0, 'name')], XX=[ITEM($0, 'xx')]) FilterRel(subset=[rel#8:Subset#2.NONE], condition=[>(CAST(ITEM($0, 'ppu')):DOUBLE NOT NULL, 0.6)]) ProjectRel(subset=[rel#6:Subset#1.NONE], D=[ITEM($0, 'donuts')]) DrillScan(subset=[rel#4:Subset#0.DRILL],
  • 24. Next  Translate join, aggregate, sort, set ops  Operator overloading with ANY  Mondrian on Drill

Editor's Notes

  • #8: It&apos;s much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn&apos;t have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.
  • #9: It&apos;s much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn&apos;t have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.