Photon Technical Deep Dive: How to Think Vectorized

Technical Deep Dive:
How to Think Vectorized
Alex Behm
Tech Lead, Photon

Agenda
Introduction
Delta Engine, vectorization, micro-benchmarks
Expressions
Compute kernels, adaptivity, lazy filters
Aggregation
Hash tables, mixed row/columnar kernels
End-to-End Performance

Hardware Changes since 2015
2010 2015 2020
Storage
50 MB/s
(HDD)
500 MB/s
(SSD)
16 GB/s
(NVMe)
10X
Network 1 Gbps 10 Gbps 100 Gbps 10X
CPU ~3 GHz ~3 GHz ~3 GHz ☹
CPUs continue to be the bottleneck.
How do we achieve next level performance?

Workload Trends
Businesses are moving faster, and as a result organizations spend less
time in data modeling, leading to worse performance:
▪ Most columns don’t have “NOT NULL” defined
▪ Strings are convenient, and many date columns are stored as strings
▪ Raw → Bronze → Silver → Gold: from nothing to pristine schema/quality
Can we get both agility and performance?

Query
Optimizer
Photon
Execution
Engine
SQL
Spark
DataFrame
Koalas
Caching
Delta Engine

Photon
New execution engine for Delta Engine to accelerate Spark SQL
Built from scratch in C++, for performance:
▪ Vectorization: data-level and instruction-level parallelism
▪ Optimize for modern structured and semi-structured workloads

Vectorization
● Decompose query into compute kernels that process vectors of data
● Typically: Columnar in-memory format
● Cache and CPU friendly: simple predictable loops, many data items, SIMD
● Adaptive: Batch-level specialization, e.g., NULLs or no NULLs
● Modular: Can optimize individual kernels as needed
Sounds great! But… what does it really mean? How does it work? Is it worth
it?
This talk: I will teach you how to think vectorized!

Microbenchmarks
Does not necessarily reflect speedups on end-to-end queries

Let’s build a simple engine from scratch.
1. Expression evaluation and adaptivity
2. Filters and laziness
3. Hash tables and mixed column/row operations
Vectorization: Basic Building Blocks

Running Example
SELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2
Scan
Filter
c1 + c2 < 10
Aggregate
SUM(c3)
We’re not covering this part
Operators pass
batches of
columnar data

Expression Evaluation
c1 c2
+
<
10
Out
WHERE c1 + c2 < 10
GROUP BY g1, g2

c1 c2
+
<
10
Out
Kernels!
WHERE c1 + c2 < 10
GROUP BY g1, g2

void PlusKernel(const int64_t* left, const int64_t* right
int32_t num_rows, int64_t* output) {
for (int32_t i = 0; i < num_rows; ++i) {
output[i] = left[i] + right[i]
}
}
WHERE c1 + c2 < 10
GROUP BY g1, g2

void PlusKernel(const int64_t* left, const int64_t* right
int32_t num_rows, int64_t* output) {
output[i] = left[i] + right[i]
}
}
🤔
What about NULLs?
WHERE c1 + c2 < 10
GROUP BY g1, g2

void PlusKernel(const int64_t* left, const bool* left_nulls,
const int64_t* right, const bool* right_nulls,
int32_t num_rows,
int64_t* output, bool* output_nulls) {
bool is_null = left_nulls[i] || right[nulls];
if (!is_null) output[i] = left[i] + right[i];
output_nulls[i] = is_null;
}
}
WHERE c1 + c2 < 10
GROUP BY g1, g2

void PlusKernel(const int64_t* left, const bool* left_nulls,
const int64_t* right, const bool* right_nulls,
int32_t num_rows,
int64_t* output, bool* output_nulls) {
bool is_null = left_nulls[i] || right[nulls];
if (!is_null) output[i] = left[i] + right[i];
output_nulls[i] = is_null;
}
}
> 30% slower with NULL checks
WHERE c1 + c2 < 10
GROUP BY g1, g2

Expression Evaluation: Runtime Adaptivity
void PlusKernelNoNulls(...);
void PlusKernel(...);
void PlusEval(Column left, Column right, Column output) {
if (!left.has_nulls() && !right.has_nulls()) {
PlusKernelNoNulls(left.data(), right.data(), output.data());
} else {
PlusKernel(left.data(), left.nulls(), …);
}
}
But what if my data rarely has NULLs?

c1 c2
+
<
10
Out
● Similar kernel approach
● Can optimize for literals,
~25% faster
WHERE c1 + c2 < 10
GROUP BY g1, g2

Filters
WHERE c1 + c2 < 10
GROUP BY g1, g2
c1 c2
+
<
10
Out ???
● What exactly is the output?
● What should we do with our
input column batch?

Filters: Lazy Representation as Active Rows
5
4
3
2
1
7
2
3
8
5
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c1 c
2
c3 g1 g
2
Scan
Filter
c1 + c2 < 10
Aggregate
SUM(c3)
{c1, c2, c3, g1, g2}
{c1, c2, c3, g1, g2}
Column Batch
3
2
0
c1 + c2 < 10
Active RowsSELECT SUM(c3) FROM t
WHERE c1 + c2 < 10
GROUP BY g1, g2

Filters: Lazy Representation as Active Rows
void PlusNoNullsSomeActiveKernel(
const int64_t* left, const int64_t* right,
const int32_t* active_rows, int32_t num_rows,
int64_t* output) {
int32_t active_idx = active_rows[i];
output[active_idx] = left[active_idx] * right[active_idx]
}
}
Active rows concept must be supported throughout the engine
● Adds complexity, code
● Will come in handy for advanced operations like aggregation/join

Hash Aggregation
Basic Algorithm
1. Hash and find bucket
2. If bucket empty, initialize entry with
keys and aggregation buffers
3. Compare keys and follow probing
strategy to resolve collisions
4. Update aggregation buffers
according to aggregation function
and input
Hash Table
{g1, g2, SUM}

Hash Aggregation
Think vectorized!
● Columnar, batch-oriented
● Type specialized
Basic Algorithm
1. Hash and find bucket
2. If bucket empty, initialize entry with
keys and aggregation buffers
3. Compare keys and follow probing
strategy to resolve collisions
4. Update aggregation buffers
according to aggregation function
and input

Microbenchmarks
Does not necessarily reflect speedups on end-to-end queries
SELECT co1l, SUM(col2)
FROM t
GROUP BY col1

Hash Aggregation
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 4, 3}
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes

Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 4, 3}

Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 4, 3}
● Compare keys
● Create an active rows
for non-matches
(collisions)
Collision

Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 10}
{7, 3, 0}
{7, 4, 3}
● Advance buckets for all
collisions and compare keys
● Repeat until match or
empty bucket

Hash Aggregation
1
1
1
1
1
7
7
7
7
7
4
5
3
4
5
c3 g1 g
2
Column Batch
h2
h1
h1
h2
h1
hashes buckets
Hash Table
{g1, g2, SUM}
{7, 5, 12}
{7, 3, 1}
{7, 4, 5}
● Update the aggregation
state for each aggregate

Mixed Column/Row Kernel Example
void AggKernel(AggFn* fn,
int64_t* input,
int8_t** buckets,
int64_t buffer_offset,
int32_t num_rows) {
// Memory access into large array. Good to have a tight loop.
int8_t* bucket = buckets[i];
// Make sure this gets inlined.
fn->update(input[i], bucket + buffer_offset);
}
}
A “column” whose values are sprayed
across rows in the hash table

Why go to the trouble? TPC-DS 30TB Queries/Hour
3.3x
speedup
110
32
(Higher is better)

32
23 columns
mixed types
1 column

Real-World Queries
▪ Several preview customers from different industries
▪ Need to have a suitable workload with sufficient Photon feature coverage
▪ Typical experience: 2-3x speedup end-to-end
▪ Mileage varies, best speedup: From 80 → 5 minutes!

▪ Vectorization: Decompose query into simple loops over vectors of data
▪ Batch-level adaptivity, e.g., NULLs vs no-NULLs
▪ Lazy filter evaluation with an active rows → useful concept
▪ Mixed column/row operations for accessing hash tables
Recap

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Photon Technical Deep Dive: How to Think Vectorized

More Related Content

What's hot (20)

Similar to Photon Technical Deep Dive: How to Think Vectorized (20)

More from Databricks (20)

Recently uploaded (20)

Photon Technical Deep Dive: How to Think Vectorized