1© Copyright 2015 Pivotal. All rights reserved. 1© Copyright 2015 Pivotal. All rights reserved.
GPORCA: Query Optimization as a Service
Venkatesh Raghavan
Apache HAWQ Nest
2© Copyright 2015 Pivotal. All rights reserved.
• Motivation
• Introduction to GPORCA (a.k.a Orca)
• How to add an Orca feature?
• How to enable Orca to Apache HAWQ?
Outline
3© Copyright 2015 Pivotal. All rights reserved.
Why a New Optimizer?
• Query optimization is key to performance
• Legacy Planner was not initially designed with distributed
data processing in mind
• Average time to fix customer issues:
– Legacy Optimizer (Planner) ~ 70 days
– Pivotal Query Optimizer (Orca) ~ 13 days
4© Copyright 2015 Pivotal. All rights reserved.
Legacy Query Planner
• Technology of the 90’s
• Addresses Join re-ordering
• Treats everything else as add-on (grouping, with clause, etc.)
• Imposes order on specific optimization steps
• Recursively descends into sub-queries
– Cannot Unnest Complex Correlated Sub Queries
• High Code Complexity
– Maximum: 102 (Orca 8.5)
– Minimum: 6.4 (Orca 1.5)
5© Copyright 2015 Pivotal. All rights reserved.
Join Ordering vs. “Everything Else”
• TPC-H Query 5
– 6 Tables
– “Harmless” query
“Everything Else”
Size of search space ~230,000,000
Join Order Problem
Size of search space
< 100,000
6© Copyright 2015 Pivotal. All rights reserved.
What Is GP-Orca?
• State-of-the-art query optimization framework designed from
scratch to
– Improve – performance, ease-of-use
– Enable – foundation for future research and development
– Connect – applies to multiple host systems (GPDB and HAWQ)
7© Copyright 2015 Pivotal. All rights reserved.
Modularity
• Orca is not baked into one host system
Orca
Parser
Host System
< />
SQL
Q2DXL
DXL
Query
MD Provider
MDreq.
Catalog Executor
DXL
MD
DXL
Plan
DXL2Plan
Results
8© Copyright 2015 Pivotal. All rights reserved.
Key Features
• Smarter partition elimination
• Subquery unnesting
• Common table expressions (CTE)
• Additional Functionality
– Improved join ordering
– Join-Aggregate reordering
– Sort order optimization
– Skew aware
9© Copyright 2015 Pivotal. All rights reserved.
Not Yet Feature Complete
• Improve Performance for Short Running Queries
• External parameters
• Cubes
• Multiple grouping sets
• Inverse distribution functions
• Ordered aggregates
• Catalog Queries
10© Copyright 2015 Pivotal. All rights reserved.
Currently in Apache HAWQ
Parser Executor
Orca
PlannerSQL Results
Query Plan
11© Copyright 2015 Pivotal. All rights reserved.
When Orca is exercised
Parser Executor
Orca
Planner
Query Plan
SQL Results
12© Copyright 2015 Pivotal. All rights reserved.
When Orca fallbacks
Parser Executor
Orca
Planner
Query
SQL Results
Fallback
Plan
Orca
Orca will automatically fallback to the legacy optimizer
for unsupported features
13© Copyright 2015 Pivotal. All rights reserved. 13© Copyright 2015 Pivotal. All rights reserved. 13© Copyright 2015 Pivotal. All rights reserved.
Subquery Unnesting
14© Copyright 2015 Pivotal. All rights reserved.
• A query that is nested inside an outer query block
• Correlated Subquery (CSQ) is a subquery that uses values
from the outer query
Subqueries: Definition
SELECT * FROM part p1
WHERE price >
(SELECT avg(price) FROM part p2 WHERE p2.brand = p1.brand)
15© Copyright 2015 Pivotal. All rights reserved.
Subqueries: Impact
• Heavily used in many workloads
– BI/Reporting tools generate substantial number of subqueries
– TPC-H workload: 40% of the 22 queries
– TPC-DS workload: 20% of the 111 queries
• Inefficient plans means query takes a long time or does not terminate
• Optimizations
– De-correlation
– Conversion of subqueries to joins
16© Copyright 2015 Pivotal. All rights reserved.
Subqueries in Disjunctive Filters
• Find parts with: size > 40 OR price > the average brand price
SELECT *
FROM part p1
WHERE p_size > 40 OR
p_retailprice >(SELECT avg(p_retailprice)
FROM part p2
WHERE p2.p_brand = p1.p_brand)
17© Copyright 2015 Pivotal. All rights reserved.
Subqueries in Disjunctive Filters
18© Copyright 2015 Pivotal. All rights reserved.
Subquery Handling: Orca vs. Planner
CSQ Class Planner Orca
CSQ in select list Correlated Execution Join
CSQ in disjunctive filter Correlated Execution Join
Multi-Level CSQ No Plan Join
CSQ with group by and inequality Correlated Execution Join
CSQ must return one row Correlated Execution Join
CSQ with correlation in select list Correlated Execution Correlated Execution
19© Copyright 2015 Pivotal. All rights reserved.
TPC-DS Orca vs. Planner
TPC-DS 10TB, 16 nodes, 48 GB/node
20© Copyright 2015 Pivotal. All rights reserved. 20© Copyright 2015 Pivotal. All rights reserved. 20© Copyright 2015 Pivotal. All rights reserved.
How to add an Orca feature?
21© Copyright 2015 Pivotal. All rights reserved.
• Step 1: Pre-Process Input Logical Expression
– Apply heuristics like pushing selects down etc.
• Step 2: Exploration (via Transforms)
– Generate all equivalent logical plans
• Step 3: Implementation (via Transforms)
– Generate all physical implementation for all logical operators
• Step 4: Optimization
– Enforce distribution and ordering requirements and pick the cheapest
plan
Optimization Steps in Orca
22© Copyright 2015 Pivotal. All rights reserved.
• Split an aggregate into a pair of local and global aggregate.
• Schema: CREATE TABLE foo (a int, b int, c int) distributed by (a);
• Query: SELECT sum(c) FROM foo GROUP BY b
• Do local aggregation and then send the partial aggregation to
the master.
• The final aggregation can then be done on the master.
Let’s Pair
23© Copyright 2015 Pivotal. All rights reserved.
// HEADER FILES
~/orca/libgpopt/include/gpopt/xforms
// SOURCE FILES
~/orca/libgpopt/src/xforms
CXformSplitGbAgg
24© Copyright 2015 Pivotal. All rights reserved.
• Pattern
• Pre-Condition Check
Define What Will Trigger This Transformation
25© Copyright 2015 Pivotal. All rights reserved.
Pattern
GPOS_NEW(pmp)
CExpression
(
pmp,
// logical aggregate operator
GPOS_NEW(pmp) CLogicalGbAgg(pmp),
// relational child
GPOS_NEW(pmp) CExpression(pmp, GPOS_NEW(pmp) CPatternLeaf(pmp)),
// scalar project list
GPOS_NEW(pmp) CExpression(pmp, GPOS_NEW(pmp) CPatternTree(pmp))
));
26© Copyright 2015 Pivotal. All rights reserved.
// Compatibility function for splitting aggregates
virtual
BOOL FCompatible(CXform::EXformId exfid)
{
return (CXform::ExfSplitGbAgg != exfid);
}
Pre-Condition Check
Common sense rules such as do not fire this rule on a logical operator
that was a result of the same rule. (Avoid Infinite Recurssion)
27© Copyright 2015 Pivotal. All rights reserved.
void Transform
(
CXformContext *pxfctxt,
CXformResult *pxfres,
CExpression *pexpr
)
const;
The Actual Transformation
28© Copyright 2015 Pivotal. All rights reserved.
void CXformFactory::Instantiate()
{
….
Add(GPOS_NEW(m_pmp) CXformSplitGbAgg(m_pmp));
….
}
Register Transformation
29© Copyright 2015 Pivotal. All rights reserved. 29© Copyright 2015 Pivotal. All rights reserved. 29© Copyright 2015 Pivotal. All rights reserved.
How to enable Orca on HAWQ?
30© Copyright 2015 Pivotal. All rights reserved.
• GPORCA https://github.com/greenplum-db/gporca
• White Paper: bit.ly/1ntrE8v
• Pivotal Tracker: bit.ly/1m1WGDn
Get Your Hands On It!
31© Copyright 2015 Pivotal. All rights reserved.
• Step 1: Get Centos07 Docker Image
https://github.com/xinzweb/hawq-devel-env/blob/master/README.md
• Step 2: Update CMake to 3.4.3
• Step 3: Install GPORCA
– Install Xerces 3.1.2
– Install GPOS
Getting Orca on Apache HAWQ
32© Copyright 2015 Pivotal. All rights reserved.
• Step 4: Apply my changes to Apache HAWQ Makefile
https://github.com/vraghavan78/incubator-hawq/tree/fix-enable-orca
• Step 5: Compile HAWQ with orca enabled
./configure --enable-orca --with-perl --with-python --with-libxml
• Step 6: May have to copy libraries to the different segments
For Example: scp -r . gpadmin@centos7-datanode1:/usr/local/lib
Getting Orca on Apache HAWQ
33© Copyright 2015 Pivotal. All rights reserved.
Publications
• Optimization of Common Table Expressions in MPP Database Systems, VLDB 2015
– Amr El-Helw, Venkatesh Raghavan, Mohamed A. Soliman, George C. Caragea, Zhongxian Gu, Michalis Petropoulos.
• Orca: A Modular Query Optimizer Architecture for Big Data, SIGMOD 2014
– Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-
Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, Rhonda
Baldwin
• Optimizing Queries over Partitioned Tables in MPP Systems, SIGMOD 2014
– Lyublena Antova, Amr El-Helw, Mohamed Soliman, Zhongxian Gu, Michalis Petropoulos, Florian Waas
• Reversing Statistics for Scalable Test Databases Generation, DBTest 2013
– Entong Shen, Lyublena Antova
• Total Operator State Recall - Cost-Effective Reuse of Results in Greenplum Database, ICDE Workshops 2013
– George C. Caragea, Carlos Garcia-Alvarado, Michalis Petropoulos, Florian M. Waas
• Testing the Accuracy of Query Optimizers, DBTest 2012
– Zhongxian Gu, Mohamed A. Soliman, Florian M. Waas
– Automatic Capture of Minimal, Portable, and Executable Bug Repros using AMPERe, DBTest 2012
– Lyublena Antova, Konstantinos Krikellas, Florian M. Waas
– Automatic Data Placement in MPP Databases, ICDE Workshops 2012
– Carlos Garcia-Alvarado, Venkatesh Raghavan, Sivaramakrishnan Narayanan, Florian M. Waas

More Related Content

PDF
GPDB Meetup GPORCA OSS 101
PDF
The internals of gporca optimizer
PDF
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
PDF
One PDB to go, please!
PDF
Iceberg: a fast table format for S3
PPTX
Reshape Data Lake (as of 2020.07)
PDF
DDD 2016 DB 12c クエリー・オプティマイザ新機能活用と統計情報運用の戦略
GPDB Meetup GPORCA OSS 101
The internals of gporca optimizer
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
One PDB to go, please!
Iceberg: a fast table format for S3
Reshape Data Lake (as of 2020.07)
DDD 2016 DB 12c クエリー・オプティマイザ新機能活用と統計情報運用の戦略

What's hot (20)

PPTX
Introduction to snowflake
PPTX
Azure Data Engineer Certification | How to Become Azure Data Engineer
PPTX
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
Airbyte @ Airflow Summit - The new modern data stack
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Jitney, Kafka at Airbnb
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
PDF
Scaling and Modernizing Data Platform with Databricks
PDF
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
PDF
Introduction to Apache Calcite
PDF
MyRocks Deep Dive
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Data Warehouse - Incremental Migration to the Cloud
PDF
Solving PostgreSQL wicked problems
PDF
Log Structured Merge Tree
Introduction to snowflake
Azure Data Engineer Certification | How to Become Azure Data Engineer
Advanced Change Data Streaming Patterns in Distributed Systems | Gunnar Morli...
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Airbyte @ Airflow Summit - The new modern data stack
ClickHouse Deep Dive, by Aleksei Milovidov
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
A Deep Dive into Query Execution Engine of Spark SQL
Jitney, Kafka at Airbnb
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Incremental View Maintenance with Coral, DBT, and Iceberg
Scaling and Modernizing Data Platform with Databricks
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Introduction to Apache Calcite
MyRocks Deep Dive
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Data Warehouse - Incremental Migration to the Cloud
Solving PostgreSQL wicked problems
Log Structured Merge Tree
Ad

Similar to GPORCA: Query Optimization as a Service (20)

PDF
Orca: A Modular Query Optimizer Architecture for Big Data
 
PDF
orca_fosdem_FINAL
PDF
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
PDF
SQL and Machine Learning on Hadoop
PPTX
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
PPTX
Apache HAWQ and Apache MADlib: Journey to Apache
PPTX
Apache Hive for modern DBAs
PDF
Tajo_Meetup_20141120
PPTX
Query optimization
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PDF
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
PDF
Big data sql as exadata for hadoop. Oracle power on top of your Big Data solu...
PPTX
Cost-based query optimization in Apache Hive 0.14
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
Apache Hive 2.0; SQL, Speed, Scale
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PPTX
Hive for Analytic Workloads
PPTX
Hive analytic workloads hadoop summit san jose 2014
PDF
Gunther hagleitner:apache hive & stinger
PPTX
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Orca: A Modular Query Optimizer Architecture for Big Data
 
orca_fosdem_FINAL
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
SQL and Machine Learning on Hadoop
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Apache HAWQ and Apache MADlib: Journey to Apache
Apache Hive for modern DBAs
Tajo_Meetup_20141120
Query optimization
Apache Hive 2.0: SQL, Speed, Scale
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Big data sql as exadata for hadoop. Oracle power on top of your Big Data solu...
Cost-based query optimization in Apache Hive 0.14
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Hive for Analytic Workloads
Hive analytic workloads hadoop summit san jose 2014
Gunther hagleitner:apache hive & stinger
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Ad

More from PivotalOpenSourceHub (20)

PPTX
Zettaset Elastic Big Data Security for Greenplum Database
PPTX
New Security Framework in Apache Geode
PPTX
Apache Geode Clubhouse - WAN-based Replication
PDF
#GeodeSummit: Easy Ways to Become a Contributor to Apache Geode
PDF
#GeodeSummit Keynote: Creating the Future of Big Data Through 'The Apache Way"
PDF
#GeodeSummit: Combining Stream Processing and In-Memory Data Grids for Near-R...
PPTX
#GeodeSummit - Off-Heap Storage Current and Future Design
PDF
#GeodeSummit - Redis to Geode Adaptor
PDF
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
PPTX
#GeodeSummit - Spring Data GemFire API Current and Future
PDF
#GeodeSummit - Modern manufacturing powered by Spring XD and Geode
PDF
#GeodeSummit - Using Geode as Operational Data Services for Real Time Mobile ...
PDF
#GeodeSummit - Large Scale Fraud Detection using GemFire Integrated with Gree...
PDF
#GeodeSummit: Democratizing Fast Analytics with Ampool (Powered by Apache Geode)
PDF
#GeodeSummit: Architecting Data-Driven, Smarter Cloud Native Apps with Real-T...
PDF
#GeodeSummit - Apex & Geode: In-memory streaming, storage & analytics
PDF
#GeodeSummit - Where Does Geode Fit in Modern System Architectures
PDF
#GeodeSummit - Design Tradeoffs in Distributed Systems
PDF
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
PDF
Building Apps with Distributed In-Memory Computing Using Apache Geode
Zettaset Elastic Big Data Security for Greenplum Database
New Security Framework in Apache Geode
Apache Geode Clubhouse - WAN-based Replication
#GeodeSummit: Easy Ways to Become a Contributor to Apache Geode
#GeodeSummit Keynote: Creating the Future of Big Data Through 'The Apache Way"
#GeodeSummit: Combining Stream Processing and In-Memory Data Grids for Near-R...
#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Redis to Geode Adaptor
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
#GeodeSummit - Spring Data GemFire API Current and Future
#GeodeSummit - Modern manufacturing powered by Spring XD and Geode
#GeodeSummit - Using Geode as Operational Data Services for Real Time Mobile ...
#GeodeSummit - Large Scale Fraud Detection using GemFire Integrated with Gree...
#GeodeSummit: Democratizing Fast Analytics with Ampool (Powered by Apache Geode)
#GeodeSummit: Architecting Data-Driven, Smarter Cloud Native Apps with Real-T...
#GeodeSummit - Apex & Geode: In-memory streaming, storage & analytics
#GeodeSummit - Where Does Geode Fit in Modern System Architectures
#GeodeSummit - Design Tradeoffs in Distributed Systems
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
Building Apps with Distributed In-Memory Computing Using Apache Geode

Recently uploaded (20)

PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
Business_Capability_Map_Collection__pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PPTX
Leprosy and NLEP programme community medicine
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Transcultural that can help you someday.
PPTX
chrmotography.pptx food anaylysis techni
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Steganography Project Steganography Project .pptx
PDF
Global Data and Analytics Market Outlook Report
PPT
Predictive modeling basics in data cleaning process
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Business_Capability_Map_Collection__pptx
CYBER SECURITY the Next Warefare Tactics
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
Leprosy and NLEP programme community medicine
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Transcultural that can help you someday.
chrmotography.pptx food anaylysis techni
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Steganography Project Steganography Project .pptx
Global Data and Analytics Market Outlook Report
Predictive modeling basics in data cleaning process
STERILIZATION AND DISINFECTION-1.ppthhhbx

GPORCA: Query Optimization as a Service

  • 1. 1© Copyright 2015 Pivotal. All rights reserved. 1© Copyright 2015 Pivotal. All rights reserved. GPORCA: Query Optimization as a Service Venkatesh Raghavan Apache HAWQ Nest
  • 2. 2© Copyright 2015 Pivotal. All rights reserved. • Motivation • Introduction to GPORCA (a.k.a Orca) • How to add an Orca feature? • How to enable Orca to Apache HAWQ? Outline
  • 3. 3© Copyright 2015 Pivotal. All rights reserved. Why a New Optimizer? • Query optimization is key to performance • Legacy Planner was not initially designed with distributed data processing in mind • Average time to fix customer issues: – Legacy Optimizer (Planner) ~ 70 days – Pivotal Query Optimizer (Orca) ~ 13 days
  • 4. 4© Copyright 2015 Pivotal. All rights reserved. Legacy Query Planner • Technology of the 90’s • Addresses Join re-ordering • Treats everything else as add-on (grouping, with clause, etc.) • Imposes order on specific optimization steps • Recursively descends into sub-queries – Cannot Unnest Complex Correlated Sub Queries • High Code Complexity – Maximum: 102 (Orca 8.5) – Minimum: 6.4 (Orca 1.5)
  • 5. 5© Copyright 2015 Pivotal. All rights reserved. Join Ordering vs. “Everything Else” • TPC-H Query 5 – 6 Tables – “Harmless” query “Everything Else” Size of search space ~230,000,000 Join Order Problem Size of search space < 100,000
  • 6. 6© Copyright 2015 Pivotal. All rights reserved. What Is GP-Orca? • State-of-the-art query optimization framework designed from scratch to – Improve – performance, ease-of-use – Enable – foundation for future research and development – Connect – applies to multiple host systems (GPDB and HAWQ)
  • 7. 7© Copyright 2015 Pivotal. All rights reserved. Modularity • Orca is not baked into one host system Orca Parser Host System < /> SQL Q2DXL DXL Query MD Provider MDreq. Catalog Executor DXL MD DXL Plan DXL2Plan Results
  • 8. 8© Copyright 2015 Pivotal. All rights reserved. Key Features • Smarter partition elimination • Subquery unnesting • Common table expressions (CTE) • Additional Functionality – Improved join ordering – Join-Aggregate reordering – Sort order optimization – Skew aware
  • 9. 9© Copyright 2015 Pivotal. All rights reserved. Not Yet Feature Complete • Improve Performance for Short Running Queries • External parameters • Cubes • Multiple grouping sets • Inverse distribution functions • Ordered aggregates • Catalog Queries
  • 10. 10© Copyright 2015 Pivotal. All rights reserved. Currently in Apache HAWQ Parser Executor Orca PlannerSQL Results Query Plan
  • 11. 11© Copyright 2015 Pivotal. All rights reserved. When Orca is exercised Parser Executor Orca Planner Query Plan SQL Results
  • 12. 12© Copyright 2015 Pivotal. All rights reserved. When Orca fallbacks Parser Executor Orca Planner Query SQL Results Fallback Plan Orca Orca will automatically fallback to the legacy optimizer for unsupported features
  • 13. 13© Copyright 2015 Pivotal. All rights reserved. 13© Copyright 2015 Pivotal. All rights reserved. 13© Copyright 2015 Pivotal. All rights reserved. Subquery Unnesting
  • 14. 14© Copyright 2015 Pivotal. All rights reserved. • A query that is nested inside an outer query block • Correlated Subquery (CSQ) is a subquery that uses values from the outer query Subqueries: Definition SELECT * FROM part p1 WHERE price > (SELECT avg(price) FROM part p2 WHERE p2.brand = p1.brand)
  • 15. 15© Copyright 2015 Pivotal. All rights reserved. Subqueries: Impact • Heavily used in many workloads – BI/Reporting tools generate substantial number of subqueries – TPC-H workload: 40% of the 22 queries – TPC-DS workload: 20% of the 111 queries • Inefficient plans means query takes a long time or does not terminate • Optimizations – De-correlation – Conversion of subqueries to joins
  • 16. 16© Copyright 2015 Pivotal. All rights reserved. Subqueries in Disjunctive Filters • Find parts with: size > 40 OR price > the average brand price SELECT * FROM part p1 WHERE p_size > 40 OR p_retailprice >(SELECT avg(p_retailprice) FROM part p2 WHERE p2.p_brand = p1.p_brand)
  • 17. 17© Copyright 2015 Pivotal. All rights reserved. Subqueries in Disjunctive Filters
  • 18. 18© Copyright 2015 Pivotal. All rights reserved. Subquery Handling: Orca vs. Planner CSQ Class Planner Orca CSQ in select list Correlated Execution Join CSQ in disjunctive filter Correlated Execution Join Multi-Level CSQ No Plan Join CSQ with group by and inequality Correlated Execution Join CSQ must return one row Correlated Execution Join CSQ with correlation in select list Correlated Execution Correlated Execution
  • 19. 19© Copyright 2015 Pivotal. All rights reserved. TPC-DS Orca vs. Planner TPC-DS 10TB, 16 nodes, 48 GB/node
  • 20. 20© Copyright 2015 Pivotal. All rights reserved. 20© Copyright 2015 Pivotal. All rights reserved. 20© Copyright 2015 Pivotal. All rights reserved. How to add an Orca feature?
  • 21. 21© Copyright 2015 Pivotal. All rights reserved. • Step 1: Pre-Process Input Logical Expression – Apply heuristics like pushing selects down etc. • Step 2: Exploration (via Transforms) – Generate all equivalent logical plans • Step 3: Implementation (via Transforms) – Generate all physical implementation for all logical operators • Step 4: Optimization – Enforce distribution and ordering requirements and pick the cheapest plan Optimization Steps in Orca
  • 22. 22© Copyright 2015 Pivotal. All rights reserved. • Split an aggregate into a pair of local and global aggregate. • Schema: CREATE TABLE foo (a int, b int, c int) distributed by (a); • Query: SELECT sum(c) FROM foo GROUP BY b • Do local aggregation and then send the partial aggregation to the master. • The final aggregation can then be done on the master. Let’s Pair
  • 23. 23© Copyright 2015 Pivotal. All rights reserved. // HEADER FILES ~/orca/libgpopt/include/gpopt/xforms // SOURCE FILES ~/orca/libgpopt/src/xforms CXformSplitGbAgg
  • 24. 24© Copyright 2015 Pivotal. All rights reserved. • Pattern • Pre-Condition Check Define What Will Trigger This Transformation
  • 25. 25© Copyright 2015 Pivotal. All rights reserved. Pattern GPOS_NEW(pmp) CExpression ( pmp, // logical aggregate operator GPOS_NEW(pmp) CLogicalGbAgg(pmp), // relational child GPOS_NEW(pmp) CExpression(pmp, GPOS_NEW(pmp) CPatternLeaf(pmp)), // scalar project list GPOS_NEW(pmp) CExpression(pmp, GPOS_NEW(pmp) CPatternTree(pmp)) ));
  • 26. 26© Copyright 2015 Pivotal. All rights reserved. // Compatibility function for splitting aggregates virtual BOOL FCompatible(CXform::EXformId exfid) { return (CXform::ExfSplitGbAgg != exfid); } Pre-Condition Check Common sense rules such as do not fire this rule on a logical operator that was a result of the same rule. (Avoid Infinite Recurssion)
  • 27. 27© Copyright 2015 Pivotal. All rights reserved. void Transform ( CXformContext *pxfctxt, CXformResult *pxfres, CExpression *pexpr ) const; The Actual Transformation
  • 28. 28© Copyright 2015 Pivotal. All rights reserved. void CXformFactory::Instantiate() { …. Add(GPOS_NEW(m_pmp) CXformSplitGbAgg(m_pmp)); …. } Register Transformation
  • 29. 29© Copyright 2015 Pivotal. All rights reserved. 29© Copyright 2015 Pivotal. All rights reserved. 29© Copyright 2015 Pivotal. All rights reserved. How to enable Orca on HAWQ?
  • 30. 30© Copyright 2015 Pivotal. All rights reserved. • GPORCA https://github.com/greenplum-db/gporca • White Paper: bit.ly/1ntrE8v • Pivotal Tracker: bit.ly/1m1WGDn Get Your Hands On It!
  • 31. 31© Copyright 2015 Pivotal. All rights reserved. • Step 1: Get Centos07 Docker Image https://github.com/xinzweb/hawq-devel-env/blob/master/README.md • Step 2: Update CMake to 3.4.3 • Step 3: Install GPORCA – Install Xerces 3.1.2 – Install GPOS Getting Orca on Apache HAWQ
  • 32. 32© Copyright 2015 Pivotal. All rights reserved. • Step 4: Apply my changes to Apache HAWQ Makefile https://github.com/vraghavan78/incubator-hawq/tree/fix-enable-orca • Step 5: Compile HAWQ with orca enabled ./configure --enable-orca --with-perl --with-python --with-libxml • Step 6: May have to copy libraries to the different segments For Example: scp -r . gpadmin@centos7-datanode1:/usr/local/lib Getting Orca on Apache HAWQ
  • 33. 33© Copyright 2015 Pivotal. All rights reserved. Publications • Optimization of Common Table Expressions in MPP Database Systems, VLDB 2015 – Amr El-Helw, Venkatesh Raghavan, Mohamed A. Soliman, George C. Caragea, Zhongxian Gu, Michalis Petropoulos. • Orca: A Modular Query Optimizer Architecture for Big Data, SIGMOD 2014 – Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia- Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, Rhonda Baldwin • Optimizing Queries over Partitioned Tables in MPP Systems, SIGMOD 2014 – Lyublena Antova, Amr El-Helw, Mohamed Soliman, Zhongxian Gu, Michalis Petropoulos, Florian Waas • Reversing Statistics for Scalable Test Databases Generation, DBTest 2013 – Entong Shen, Lyublena Antova • Total Operator State Recall - Cost-Effective Reuse of Results in Greenplum Database, ICDE Workshops 2013 – George C. Caragea, Carlos Garcia-Alvarado, Michalis Petropoulos, Florian M. Waas • Testing the Accuracy of Query Optimizers, DBTest 2012 – Zhongxian Gu, Mohamed A. Soliman, Florian M. Waas – Automatic Capture of Minimal, Portable, and Executable Bug Repros using AMPERe, DBTest 2012 – Lyublena Antova, Konstantinos Krikellas, Florian M. Waas – Automatic Data Placement in MPP Databases, ICDE Workshops 2012 – Carlos Garcia-Alvarado, Venkatesh Raghavan, Sivaramakrishnan Narayanan, Florian M. Waas