SlideShare a Scribd company logo
Technical Deep Dive:
How to Think Vectorized
Alex Behm
Tech Lead, Photon
Agenda
Introduction
Delta Engine, vectorization, micro-benchmarks
Expressions
Compute kernels, adaptivity, lazy filters
Aggregation
Hash tables, mixed row/columnar kernels
End-to-End Performance
Hardware Changes since 2015
2010 2015 2020
Storage
50 MB/s
(HDD)
500 MB/s
(SSD)
16 GB/s
(NVMe)
10X
Network 1 Gbps 10 Gbps 100 Gbps 10X
CPU ~3 GHz ~3 GHz ~3 GHz ☹
CPUs continue to be the bottleneck.
How do we achieve next level performance?
Workload Trends
Businesses are moving faster, and as a result organizations spend less
time in data modeling, leading to worse performance:
▪ Most columns don’t have “NOT NULL” defined
▪ Strings are convenient, and many date columns are stored as strings
▪ Raw → Bronze → Silver → Gold: from nothing to pristine schema/quality
Can we get both agility and performance?
Query
Optimizer
Photon
Execution
Engine
SQL
Spark
DataFrame
Koalas
Caching
Delta Engine
Photon
New execution engine for Delta Engine to accelerate Spark SQL
Built from scratch in C++, for performance:
▪ Vectorization: data-level and instruction-level parallelism
▪ Optimize for modern structured and semi-structured workloads
Vectorization
● Decompose query into compute kernels that process vectors of data
● Typically: Columnar in-memory format
● Cache and CPU friendly: simple predictable loops, many data items, SIMD
● Adaptive: Batch-level specialization, e.g., NULLs or no NULLs
● Modular: Can optimize individual kernels as needed
Sounds great! But… what does it really mean? How does it work? Is it worth
it?
This talk: I will teach you how to think vectorized!
Microbenchmarks
Does not necessarily reflect speedups on end-to-end queries
Let’s build a simple engine from scratch.
1. Expression evaluation and adaptivity
2. Filters and laziness
3. Hash tables and mixed column/row operations
Vectorization: Basic Building Blocks
Expressions
Running Example
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Scan
Filter
c1 + c2 < 10
Aggregate
SUM(c3)
We’re not covering this part
Operators pass
batches of
columnar data
Expression Evaluation
c1 c2
+
<
10
Out
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation
c1 c2
+
<
10
Out
Kernels!
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation
void PlusKernel(const int64_t* left, const int64_t* right
int32_t num_rows, int64_t* output) {
for (int32_t i = 0; i < num_rows; ++i) {
output[i] = left[i] + right[i]
}
}
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation
void PlusKernel(const int64_t* left, const int64_t* right
int32_t num_rows, int64_t* output) {
for (int32_t i = 0; i < num_rows; ++i) {
output[i] = left[i] + right[i]
}
}
🤔
What about NULLs?
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation
void PlusKernel(const int64_t* left, const bool* left_nulls,
const int64_t* right, const bool* right_nulls,
int32_t num_rows,
int64_t* output, bool* output_nulls) {
for (int32_t i = 0; i < num_rows; ++i) {
bool is_null = left_nulls[i] || right[nulls];
if (!is_null) output[i] = left[i] + right[i];
output_nulls[i] = is_null;
}
}
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation
void PlusKernel(const int64_t* left, const bool* left_nulls,
const int64_t* right, const bool* right_nulls,
int32_t num_rows,
int64_t* output, bool* output_nulls) {
for (int32_t i = 0; i < num_rows; ++i) {
bool is_null = left_nulls[i] || right[nulls];
if (!is_null) output[i] = left[i] + right[i];
output_nulls[i] = is_null;
}
}
> 30% slower with NULL checks
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Expression Evaluation: Runtime Adaptivity
void PlusKernelNoNulls(...);
void PlusKernel(...);
void PlusEval(Column left, Column right, Column output) {
if (!left.has_nulls() && !right.has_nulls()) {
PlusKernelNoNulls(left.data(), right.data(), output.data());
} else {
PlusKernel(left.data(), left.nulls(), …);
}
}
But what if my data rarely has NULLs?
Expression Evaluation
c1 c2
+
<
10
Out
● Similar kernel approach
● Can optimize for literals,
~25% faster
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Filters
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
c1 c2
+
<
10
Out ???
● What exactly is the output?
● What should we do with our
input column batch?
Filters: Lazy Representation as Active Rows
5
4
3
2
1
7
2
3
8
5
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c1 c
2
c3 g1 g
2
Scan
Filter
c1 + c2 < 10
Aggregate
SUM(c3)
{c1, c2, c3, g1, g2}
{c1, c2, c3, g1, g2}
Column Batch
3
2
0
c1 + c2 < 10
Active RowsSELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Filters: Lazy Representation as Active Rows
void PlusNoNullsSomeActiveKernel(
const int64_t* left, const int64_t* right,
const int32_t* active_rows, int32_t num_rows,
int64_t* output) {
for (int32_t i = 0; i < num_rows; ++i) {
int32_t active_idx = active_rows[i];
output[active_idx] = left[active_idx] * right[active_idx]
}
}
Active rows concept must be supported throughout the engine
● Adds complexity, code
● Will come in handy for advanced operations like aggregation/join
Aggregation
Hash Aggregation
Basic Algorithm
1. Hash and find bucket
2. If bucket empty, initialize entry with
keys and aggregation buffers
3. Compare keys and follow probing
strategy to resolve collisions
4. Update aggregation buffers
according to aggregation function
and input
Hash Table
{g1, g2, SUM}
Hash Aggregation
Think vectorized!
● Columnar, batch-oriented
● Type specialized
Basic Algorithm
1. Hash and find bucket
2. If bucket empty, initialize entry with
keys and aggregation buffers
3. Compare keys and follow probing
strategy to resolve collisions
4. Update aggregation buffers
according to aggregation function
and input
Microbenchmarks
Does not necessarily reflect speedups on end-to-end queries
SELECT co1l, SUM(col2)
FROM t
GROUP BY col1
Hash Aggregation
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 4, 3}
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes
Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 4, 3}
Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 4, 3}
● Compare keys
● Create an active rows
for non-matches
(collisions)
Collision
Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 3, 0}
{7, 4, 3}
● Advance buckets for all
collisions and compare keys
● Repeat until match or
empty bucket
Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 12}
{7, 3, 1}
{7, 4, 5}
● Update the aggregation
state for each aggregate
Mixed Column/Row Kernel Example
void AggKernel(AggFn* fn,
int64_t* input,
int8_t** buckets,
int64_t buffer_offset,
int32_t num_rows) {
for (int32_t i = 0; i < num_rows; ++i) {
// Memory access into large array. Good to have a tight loop.
int8_t* bucket = buckets[i];
// Make sure this gets inlined.
fn->update(input[i], bucket + buffer_offset);
}
}
A “column” whose values are sprayed
across rows in the hash table
End-to-End Performance
Why go to the trouble? TPC-DS 30TB Queries/Hour
3.3x
speedup
110
32
(Higher is better)
32
23 columns
mixed types
1 column
Real-World Queries
▪ Several preview customers from different industries
▪ Need to have a suitable workload with sufficient Photon feature coverage
▪ Typical experience: 2-3x speedup end-to-end
▪ Mileage varies, best speedup: From 80 → 5 minutes!
▪ Vectorization: Decompose query into simple loops over vectors of data
▪ Batch-level adaptivity, e.g., NULLs vs no-NULLs
▪ Lazy filter evaluation with an active rows → useful concept
▪ Mixed column/row operations for accessing hash tables
Recap
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

PDF
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
Streaming SQL with Apache Calcite
PDF
Introduction to Apache Calcite
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
What is in a Lucene index?
PPTX
Apache Calcite overview
PDF
Adding measures to Calcite SQL
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Apache Calcite (a tutorial given at BOSS '21)
Streaming SQL with Apache Calcite
Introduction to Apache Calcite
Efficient Data Storage for Analytics with Apache Parquet 2.0
What is in a Lucene index?
Apache Calcite overview
Adding measures to Calcite SQL

What's hot (20)

PDF
Apache Calcite Tutorial - BOSS 21
PDF
Parquet performance tuning: the missing guide
PDF
Care and Feeding of Catalyst Optimizer
PDF
SQL for NoSQL and how Apache Calcite can help
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Don’t optimize my queries, optimize my data!
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PPT
Advanced Sql Training
PDF
Apache Calcite: One Frontend to Rule Them All
PPTX
Apache Spark Architecture
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PDF
The evolution of Apache Calcite and its Community
PDF
Apache Calcite: One planner fits all
KEY
Trees In The Database - Advanced data structures
PPTX
Understanding SQL Trace, TKPROF and Execution Plan for beginners
PDF
Spark SQL
Apache Calcite Tutorial - BOSS 21
Parquet performance tuning: the missing guide
Care and Feeding of Catalyst Optimizer
SQL for NoSQL and how Apache Calcite can help
The Parquet Format and Performance Optimization Opportunities
Don’t optimize my queries, optimize my data!
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Advanced Sql Training
Apache Calcite: One Frontend to Rule Them All
Apache Spark Architecture
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
The evolution of Apache Calcite and its Community
Apache Calcite: One planner fits all
Trees In The Database - Advanced data structures
Understanding SQL Trace, TKPROF and Execution Plan for beginners
Spark SQL
Ad

Similar to Photon Technical Deep Dive: How to Think Vectorized (20)

PPTX
The Other HPC: High Productivity Computing in Polystore Environments
PDF
Fast and Reliable Apache Spark SQL Releases
PDF
PostgreSQL, performance for queries with grouping
PDF
Why you care about
 relational algebra (even though you didn’t know it)
PDF
Adaptive Query Processing on RAW Data
PDF
Microsoft Big Data @ SQLUG 2013
PDF
_Super_Study_Guide__Data_Science_Tools_1620233377.pdf
PPTX
Modern sql
PDF
Correctness and Performance of Apache Spark SQL
PDF
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
PDF
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
PPTX
Low-Latency Data Access: The Required Synergy Between Memory & Disk
PPTX
Cassandra 2.2 & 3.0
PPTX
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
PPTX
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
PDF
PGDay UK 2016 -- Performace for queries with grouping
PDF
OLAP Indexes and Algorithms CMU Advanced Databases
PDF
Pg for web developer
PPTX
cikm_2016_1027
PPTX
In memory databases presentation
The Other HPC: High Productivity Computing in Polystore Environments
Fast and Reliable Apache Spark SQL Releases
PostgreSQL, performance for queries with grouping
Why you care about
 relational algebra (even though you didn’t know it)
Adaptive Query Processing on RAW Data
Microsoft Big Data @ SQLUG 2013
_Super_Study_Guide__Data_Science_Tools_1620233377.pdf
Modern sql
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
Low-Latency Data Access: The Required Synergy Between Memory & Disk
Cassandra 2.2 & 3.0
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
PGDay UK 2016 -- Performace for queries with grouping
OLAP Indexes and Algorithms CMU Advanced Databases
Pg for web developer
cikm_2016_1027
In memory databases presentation
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
annual-report-2024-2025 original latest.
PDF
Lecture1 pattern recognition............
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Fluorescence-microscope_Botany_detailed content
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Supervised vs unsupervised machine learning algorithms
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Clinical guidelines as a resource for EBP(1).pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
annual-report-2024-2025 original latest.
Lecture1 pattern recognition............
Business Acumen Training GuidePresentation.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Qualitative Qantitative and Mixed Methods.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
.pdf is not working space design for the following data for the following dat...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Quality review (1)_presentation of this 21
Fluorescence-microscope_Botany_detailed content
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf

Photon Technical Deep Dive: How to Think Vectorized

  • 1. Technical Deep Dive: How to Think Vectorized Alex Behm Tech Lead, Photon
  • 2. Agenda Introduction Delta Engine, vectorization, micro-benchmarks Expressions Compute kernels, adaptivity, lazy filters Aggregation Hash tables, mixed row/columnar kernels End-to-End Performance
  • 3. Hardware Changes since 2015 2010 2015 2020 Storage 50 MB/s (HDD) 500 MB/s (SSD) 16 GB/s (NVMe) 10X Network 1 Gbps 10 Gbps 100 Gbps 10X CPU ~3 GHz ~3 GHz ~3 GHz ☹ CPUs continue to be the bottleneck. How do we achieve next level performance?
  • 4. Workload Trends Businesses are moving faster, and as a result organizations spend less time in data modeling, leading to worse performance: ▪ Most columns don’t have “NOT NULL” defined ▪ Strings are convenient, and many date columns are stored as strings ▪ Raw → Bronze → Silver → Gold: from nothing to pristine schema/quality Can we get both agility and performance?
  • 6. Photon New execution engine for Delta Engine to accelerate Spark SQL Built from scratch in C++, for performance: ▪ Vectorization: data-level and instruction-level parallelism ▪ Optimize for modern structured and semi-structured workloads
  • 7. Vectorization ● Decompose query into compute kernels that process vectors of data ● Typically: Columnar in-memory format ● Cache and CPU friendly: simple predictable loops, many data items, SIMD ● Adaptive: Batch-level specialization, e.g., NULLs or no NULLs ● Modular: Can optimize individual kernels as needed Sounds great! But… what does it really mean? How does it work? Is it worth it? This talk: I will teach you how to think vectorized!
  • 8. Microbenchmarks Does not necessarily reflect speedups on end-to-end queries
  • 9. Let’s build a simple engine from scratch. 1. Expression evaluation and adaptivity 2. Filters and laziness 3. Hash tables and mixed column/row operations Vectorization: Basic Building Blocks
  • 11. Running Example SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2 Scan Filter c1 + c2 < 10 Aggregate SUM(c3) We’re not covering this part Operators pass batches of columnar data
  • 12. Expression Evaluation c1 c2 + < 10 Out SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 13. Expression Evaluation c1 c2 + < 10 Out Kernels! SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 14. Expression Evaluation void PlusKernel(const int64_t* left, const int64_t* right int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { output[i] = left[i] + right[i] } } SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 15. Expression Evaluation void PlusKernel(const int64_t* left, const int64_t* right int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { output[i] = left[i] + right[i] } } 🤔 What about NULLs? SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 16. Expression Evaluation void PlusKernel(const int64_t* left, const bool* left_nulls, const int64_t* right, const bool* right_nulls, int32_t num_rows, int64_t* output, bool* output_nulls) { for (int32_t i = 0; i < num_rows; ++i) { bool is_null = left_nulls[i] || right[nulls]; if (!is_null) output[i] = left[i] + right[i]; output_nulls[i] = is_null; } } SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 17. Expression Evaluation void PlusKernel(const int64_t* left, const bool* left_nulls, const int64_t* right, const bool* right_nulls, int32_t num_rows, int64_t* output, bool* output_nulls) { for (int32_t i = 0; i < num_rows; ++i) { bool is_null = left_nulls[i] || right[nulls]; if (!is_null) output[i] = left[i] + right[i]; output_nulls[i] = is_null; } } > 30% slower with NULL checks SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 18. Expression Evaluation: Runtime Adaptivity void PlusKernelNoNulls(...); void PlusKernel(...); void PlusEval(Column left, Column right, Column output) { if (!left.has_nulls() && !right.has_nulls()) { PlusKernelNoNulls(left.data(), right.data(), output.data()); } else { PlusKernel(left.data(), left.nulls(), …); } } But what if my data rarely has NULLs?
  • 19. Expression Evaluation c1 c2 + < 10 Out ● Similar kernel approach ● Can optimize for literals, ~25% faster SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 20. Filters SELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2 c1 c2 + < 10 Out ??? ● What exactly is the output? ● What should we do with our input column batch?
  • 21. Filters: Lazy Representation as Active Rows 5 4 3 2 1 7 2 3 8 5 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c1 c 2 c3 g1 g 2 Scan Filter c1 + c2 < 10 Aggregate SUM(c3) {c1, c2, c3, g1, g2} {c1, c2, c3, g1, g2} Column Batch 3 2 0 c1 + c2 < 10 Active RowsSELECT SUM(c3) FROM t WHERE c1 + c2 < 10 GROUP BY g1, g2
  • 22. Filters: Lazy Representation as Active Rows void PlusNoNullsSomeActiveKernel( const int64_t* left, const int64_t* right, const int32_t* active_rows, int32_t num_rows, int64_t* output) { for (int32_t i = 0; i < num_rows; ++i) { int32_t active_idx = active_rows[i]; output[active_idx] = left[active_idx] * right[active_idx] } } Active rows concept must be supported throughout the engine ● Adds complexity, code ● Will come in handy for advanced operations like aggregation/join
  • 24. Hash Aggregation Basic Algorithm 1. Hash and find bucket 2. If bucket empty, initialize entry with keys and aggregation buffers 3. Compare keys and follow probing strategy to resolve collisions 4. Update aggregation buffers according to aggregation function and input Hash Table {g1, g2, SUM}
  • 25. Hash Aggregation Think vectorized! ● Columnar, batch-oriented ● Type specialized Basic Algorithm 1. Hash and find bucket 2. If bucket empty, initialize entry with keys and aggregation buffers 3. Compare keys and follow probing strategy to resolve collisions 4. Update aggregation buffers according to aggregation function and input
  • 26. Microbenchmarks Does not necessarily reflect speedups on end-to-end queries SELECT co1l, SUM(col2) FROM t GROUP BY col1
  • 27. Hash Aggregation Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3} 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes
  • 28. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3}
  • 29. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 4, 3} ● Compare keys ● Create an active rows for non-matches (collisions) Collision
  • 30. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 10} {7, 3, 0} {7, 4, 3} ● Advance buckets for all collisions and compare keys ● Repeat until match or empty bucket
  • 31. Hash Aggregation 1 1 1 1 1 7 7 7 7 7 4 5 3 4 5 c3 g1 g 2 Column Batch h2 h1 h1 h2 h1 hashes buckets Hash Table {g1, g2, SUM} {7, 5, 12} {7, 3, 1} {7, 4, 5} ● Update the aggregation state for each aggregate
  • 32. Mixed Column/Row Kernel Example void AggKernel(AggFn* fn, int64_t* input, int8_t** buckets, int64_t buffer_offset, int32_t num_rows) { for (int32_t i = 0; i < num_rows; ++i) { // Memory access into large array. Good to have a tight loop. int8_t* bucket = buckets[i]; // Make sure this gets inlined. fn->update(input[i], bucket + buffer_offset); } } A “column” whose values are sprayed across rows in the hash table
  • 34. Why go to the trouble? TPC-DS 30TB Queries/Hour 3.3x speedup 110 32 (Higher is better)
  • 36. Real-World Queries ▪ Several preview customers from different industries ▪ Need to have a suitable workload with sufficient Photon feature coverage ▪ Typical experience: 2-3x speedup end-to-end ▪ Mileage varies, best speedup: From 80 → 5 minutes!
  • 37. ▪ Vectorization: Decompose query into simple loops over vectors of data ▪ Batch-level adaptivity, e.g., NULLs vs no-NULLs ▪ Lazy filter evaluation with an active rows → useful concept ▪ Mixed column/row operations for accessing hash tables Recap
  • 38. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.