GPORCA: Query Optimization as a Service

1© Copyright 2015 Pivotal. All rights reserved. 1© Copyright 2015 Pivotal. All rights reserved.
GPORCA: Query Optimization as a Service
Venkatesh Raghavan
Apache HAWQ Nest

2© Copyright 2015 Pivotal. All rights reserved.
• Motivation
• Introduction to GPORCA (a.k.a Orca)
• How to add an Orca feature?
• How to enable Orca to Apache HAWQ?
Outline

Why a New Optimizer?
• Query optimization is key to performance
• Legacy Planner was not initially designed with distributed
data processing in mind
• Average time to fix customer issues:
– Legacy Optimizer (Planner) ~ 70 days
– Pivotal Query Optimizer (Orca) ~ 13 days

Legacy Query Planner
• Technology of the 90’s
• Addresses Join re-ordering
• Treats everything else as add-on (grouping, with clause, etc.)
• Imposes order on specific optimization steps
• Recursively descends into sub-queries
– Cannot Unnest Complex Correlated Sub Queries
• High Code Complexity
– Maximum: 102 (Orca 8.5)
– Minimum: 6.4 (Orca 1.5)

Join Ordering vs. “Everything Else”
• TPC-H Query 5
– 6 Tables
– “Harmless” query
“Everything Else”
Size of search space ~230,000,000
Join Order Problem
Size of search space
< 100,000

What Is GP-Orca?
• State-of-the-art query optimization framework designed from
scratch to
– Improve – performance, ease-of-use
– Enable – foundation for future research and development
– Connect – applies to multiple host systems (GPDB and HAWQ)

Modularity
• Orca is not baked into one host system
Orca
Parser
Host System
< />
SQL
Q2DXL
DXL
Query
MD Provider
MDreq.
Catalog Executor
DXL
MD
DXL
Plan
DXL2Plan
Results

Key Features
• Smarter partition elimination
• Subquery unnesting
• Common table expressions (CTE)
• Additional Functionality
– Improved join ordering
– Join-Aggregate reordering
– Sort order optimization
– Skew aware

Not Yet Feature Complete
• Improve Performance for Short Running Queries
• External parameters
• Cubes
• Multiple grouping sets
• Inverse distribution functions
• Ordered aggregates
• Catalog Queries

Currently in Apache HAWQ
Parser Executor
Orca
PlannerSQL Results
Query Plan

When Orca is exercised
Parser Executor
Orca
Planner
Query Plan
SQL Results

When Orca fallbacks
Parser Executor
Orca
Planner
Query
SQL Results
Fallback
Plan
Orca
Orca will automatically fallback to the legacy optimizer
for unsupported features

13© Copyright 2015 Pivotal. All rights reserved. 13© Copyright 2015 Pivotal. All rights reserved. 13© Copyright 2015 Pivotal. All rights reserved.
Subquery Unnesting

• A query that is nested inside an outer query block
• Correlated Subquery (CSQ) is a subquery that uses values
from the outer query
Subqueries: Definition
SELECT * FROM part p1
WHERE price >
(SELECT avg(price) FROM part p2 WHERE p2.brand = p1.brand)

Subqueries: Impact
• Heavily used in many workloads
– BI/Reporting tools generate substantial number of subqueries
– TPC-H workload: 40% of the 22 queries
– TPC-DS workload: 20% of the 111 queries
• Inefficient plans means query takes a long time or does not terminate
• Optimizations
– De-correlation
– Conversion of subqueries to joins

Subqueries in Disjunctive Filters
• Find parts with: size > 40 OR price > the average brand price
SELECT *
FROM part p1
WHERE p_size > 40 OR
p_retailprice >(SELECT avg(p_retailprice)
FROM part p2
WHERE p2.p_brand = p1.p_brand)

Subqueries in Disjunctive Filters

Subquery Handling: Orca vs. Planner
CSQ Class Planner Orca
CSQ in select list Correlated Execution Join
CSQ in disjunctive filter Correlated Execution Join
Multi-Level CSQ No Plan Join
CSQ with group by and inequality Correlated Execution Join
CSQ must return one row Correlated Execution Join
CSQ with correlation in select list Correlated Execution Correlated Execution

TPC-DS Orca vs. Planner
TPC-DS 10TB, 16 nodes, 48 GB/node

How to add an Orca feature?

• Step 1: Pre-Process Input Logical Expression
– Apply heuristics like pushing selects down etc.
• Step 2: Exploration (via Transforms)
– Generate all equivalent logical plans
• Step 3: Implementation (via Transforms)
– Generate all physical implementation for all logical operators
• Step 4: Optimization
– Enforce distribution and ordering requirements and pick the cheapest
plan
Optimization Steps in Orca

• Split an aggregate into a pair of local and global aggregate.
• Schema: CREATE TABLE foo (a int, b int, c int) distributed by (a);
• Query: SELECT sum(c) FROM foo GROUP BY b
• Do local aggregation and then send the partial aggregation to
the master.
• The final aggregation can then be done on the master.
Let’s Pair

// HEADER FILES
~/orca/libgpopt/include/gpopt/xforms
// SOURCE FILES
~/orca/libgpopt/src/xforms
CXformSplitGbAgg

• Pattern
• Pre-Condition Check
Define What Will Trigger This Transformation

Pattern
GPOS_NEW(pmp)
CExpression
(
pmp,
// logical aggregate operator
GPOS_NEW(pmp) CLogicalGbAgg(pmp),
// relational child
GPOS_NEW(pmp) CExpression(pmp, GPOS_NEW(pmp) CPatternLeaf(pmp)),
// scalar project list
GPOS_NEW(pmp) CExpression(pmp, GPOS_NEW(pmp) CPatternTree(pmp))
));

// Compatibility function for splitting aggregates
virtual
BOOL FCompatible(CXform::EXformId exfid)
{
return (CXform::ExfSplitGbAgg != exfid);
}
Pre-Condition Check
Common sense rules such as do not fire this rule on a logical operator
that was a result of the same rule. (Avoid Infinite Recurssion)

void Transform
(
CXformContext *pxfctxt,
CXformResult *pxfres,
CExpression *pexpr
)
const;
The Actual Transformation

void CXformFactory::Instantiate()
{
….
Add(GPOS_NEW(m_pmp) CXformSplitGbAgg(m_pmp));
….
}
Register Transformation

How to enable Orca on HAWQ?

• GPORCA https://github.com/greenplum-db/gporca
• White Paper: bit.ly/1ntrE8v
• Pivotal Tracker: bit.ly/1m1WGDn
Get Your Hands On It!

• Step 1: Get Centos07 Docker Image
https://github.com/xinzweb/hawq-devel-env/blob/master/README.md
• Step 2: Update CMake to 3.4.3
• Step 3: Install GPORCA
– Install Xerces 3.1.2
– Install GPOS
Getting Orca on Apache HAWQ

• Step 4: Apply my changes to Apache HAWQ Makefile
https://github.com/vraghavan78/incubator-hawq/tree/fix-enable-orca
• Step 5: Compile HAWQ with orca enabled
./configure --enable-orca --with-perl --with-python --with-libxml
• Step 6: May have to copy libraries to the different segments
For Example: scp -r . gpadmin@centos7-datanode1:/usr/local/lib
Getting Orca on Apache HAWQ

Publications
• Optimization of Common Table Expressions in MPP Database Systems, VLDB 2015
– Amr El-Helw, Venkatesh Raghavan, Mohamed A. Soliman, George C. Caragea, Zhongxian Gu, Michalis Petropoulos.
• Orca: A Modular Query Optimizer Architecture for Big Data, SIGMOD 2014
– Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-
Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, Rhonda
Baldwin
• Optimizing Queries over Partitioned Tables in MPP Systems, SIGMOD 2014
– Lyublena Antova, Amr El-Helw, Mohamed Soliman, Zhongxian Gu, Michalis Petropoulos, Florian Waas
• Reversing Statistics for Scalable Test Databases Generation, DBTest 2013
– Entong Shen, Lyublena Antova
• Total Operator State Recall - Cost-Effective Reuse of Results in Greenplum Database, ICDE Workshops 2013
– George C. Caragea, Carlos Garcia-Alvarado, Michalis Petropoulos, Florian M. Waas
• Testing the Accuracy of Query Optimizers, DBTest 2012
– Zhongxian Gu, Mohamed A. Soliman, Florian M. Waas
– Automatic Capture of Minimal, Portable, and Executable Bug Repros using AMPERe, DBTest 2012
– Lyublena Antova, Konstantinos Krikellas, Florian M. Waas
– Automatic Data Placement in MPP Databases, ICDE Workshops 2012
– Carlos Garcia-Alvarado, Venkatesh Raghavan, Sivaramakrishnan Narayanan, Florian M. Waas

GPORCA: Query Optimization as a Service

More Related Content

What's hot (20)

Similar to GPORCA: Query Optimization as a Service (20)

More from PivotalOpenSourceHub (20)

Recently uploaded (20)

GPORCA: Query Optimization as a Service