SlideShare a Scribd company logo
PL/CUDA
~Fusion of HPC Grade Power with In-Database Analytics~
The PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kaigai@ak.jp.nec.com>
The PG-Strom Project
about myself...
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics2
▌KaiGai Kohei
 tw: @kkaigai
 https://github.com/kaigai
▌PostgreSQL
 SELinux, FDW, CustomScan, ...
▌PG-Strom
 GPU acceleration for PostgreSQL
▌Works
 NEC OSS Promotion Center
 Development of the software and
its business opportunity
The PG-Strom Project
PG-Strom Overview (1/2) – Architecture
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics3
Application
Storage
Query
Optimizer
Query
Executor
PG-Strom
Extension
SQL Parser
Storage Manager
GPU
No Storage
Changes
No Query
Changes
 Features
• automatic GPU code generation
from the supplied SQL
• asynchronous & massive
parallel execution on GPUs
• WHERE-clause, JOIN, GROUP
BY, and projection are
supported
 Advantages
• transparent acceleration by the
power of thousands cores
• Low cost solution for analytic
processing
The PG-Strom Project
Characteristics of GPU (Graphic Processor Unit)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics4
GPU CPU
Model
NVIDIA
Tesla P100
Intel Xeon
E5-2699v4
Architecture Pascal Broadwell
Launch Q2-2016 Q1-2016
# of transistors 15billion 7.2billion
# of cores
3584
(simple)
22
(functional)
core clock
1.126GHz
~1.303GHz
2.20GHz
~3.60GHz
Perk FFLOPS
(FP32)
9.3 TFLOPS
1.2 TFLOPS
(with AVX2)
DRAM Size 16GB (HBM2) max 1.5TB (DDR4)
Memory Band 732GB/s 76.8GB/s
Power
Consumption
250W 145W
The PG-Strom Project
PG-Strom Overview (2/2) – GPU binary generation on the fly
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Transform of arithmetic operations
in the WHERE-clause to CUDA programs
Reference to input data
SQL expression in CUDA source code
Run-time
Compiler
(nvrtc)
Just-in-time
Compile
Parallel
Execution
5
The PG-Strom Project
GPU accelerates SQL performance
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics6
▌Test Query:
SELECT cat, count(*), avg(x)
FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...]
GROUP BY cat;
 t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema)
40.44
62.79
79.82
100.50
119.51
144.55
201.12
248.95
9.96 9.93 9.96 9.99 10.02 10.03 9.98 10.00
0
50
100
150
200
250
300
2 3 4 5 6 7 8 9
QueryResponseTime[sec]
Number of joined tables
PG-Strom microbenchmark with JOIN/GROUP BY
PostgreSQL v9.5 PG-Strom v1.0
CPU: Xeon E5-2670v3
GPU: GTX1080
RAM: 384GB
OS: CentOS 7.2
DB: PostgreSQL 9.5 +
PG-Strom v1.0
The PG-Strom Project
Feedbacks from users during v1.0 development
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics7
Application
Storage
Query
Optimizer
Query
Executor
PG-Strom
Extension
SQL Parser
Storage Manager
GPU
Heavy computing
intensive workloads
• In-database Analytics
• Scientific R&D, Marketing, ...
 by PL/CUDA + Matrix-Array
Heavy I/O
intensive workloads
• Generic large OLAP
• ETL, Reporting, ...
 by SSD-to-GPU P2P DMA
The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics8
Introduction of PL/CUDA
The PG-Strom Project
Our Failure (1/3) – Write an algorithm logic in SQL
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics9
Apr-2016
The PG-Strom Project
Our Failure (2/3) – Performance benefit (?)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics10
Apr-2016
The PG-Strom Project
Our Failure (3/3) – Problems
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics11
▌Problem.1 – Who writes algorithms in SQL?
 Majority of mathematical algorithms are developed based on
the manner of procedural programming language.
 Although users don’t need to write algorithm logics in CUDA,
they also have to write up the algorithm logic using SQL puzzle.
▌Problem.2 – Performance Benefit
 Yeah, PG-Strom is much faster than PostgreSQL to execute the core
logic of min-max method. It is an excellent result but nobody knows
how many people run the algorithm on PostgreSQL.
 Performance of GpuProjection is almost equivalent to the CPU version
of implementation which is designed to a particular problem. Why?
 Inefficient code due to SQL compatibility
 Inefficient data format due to PostgreSQL’s row-format
The PG-Strom Project
Our Answer – PL/CUDA + Matrix-like Array
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics12
CREATE FUNCTION my_logic(matrix, matrix)
RETURNS vector
AS $$
$$ LANGUAGE ‘plcuda’;
User define CUDA code block
Storage
GPU Kernel
User defined
CUDA code block
Post SQL Process
 Tables JOIN
 Window
Function
 ORDER BY
 GROUP BY
 etc....
Load function’s
arguments
Write-back
result set
PL/CUDA
method for manual optimization
ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN…
𝑎1 ⋯ 𝑑1
⋮ ⋱ ⋮
𝑎 𝑁 ⋯ 𝑑 𝑁
Matrix of
4cols x Nrows
Matrix-like Array
2D Array without NULL, to represent a matrix
The PG-Strom Project
Example of PL/CUDA function definition
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics13
CREATE OR REPLACE FUNCTION
knn_gpu_similarity(int, int[], int[])
RETURNS float4[]
AS $$
#plcuda_begin
cl_int k = arg1.value;
MatrixType *Q = (MatrixType *) arg2.value;
MatrixType *D = (MatrixType *) arg3.value;
MatrixType *R = (MatrixType *) results;
:
nloops = (ARRAY_MATRIX_HEIGHT(Q) + (part_sz - k - 1)) / (part_sz - k);
for (loop=0; loop < nloops; loop++) {
/* 1. calculation of the similarity */
for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) {
j = i % part_sz; /* index within partition */
/* index of database matrix (D) */
dindex = part_nums * get_global_index() + (i / part_sz);
/* index of query matrix (Q) */
qindex = loop * (part_sz - k) + (j - k);
values[i] = knn_similarity_compute(D, dindex, Q, qindex);
}
}
:
#plcuda_end
$$ LANGUAGE 'plcuda';
CUDA code block
The PG-Strom Project
How GPU Kernels are built
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics14
CREATE OR REPLACE FUNCTION
my_cuda_func(float[])
RETURNS int[]
$$
#plcuda_sanity_check func_sc
#plcuda_begin
#plcuda_end
#plcuda_working_bufsz func_wb
$$ LANGUAGE ‘plcuda’;
User define CUDA code block
bool func_sc(float[])
helper function for sanity check
bigint func_wb(float[])
helper function for buffer size estimation
GPU
Binary
GPU
Kernel
Working
Buffer
Input
arguments
Run-time
compiler
input/output
of SQL data
User define
CUDA code block
Common Library
Routines
source program
of GPU code
Load the
arguments
The PG-Strom Project
Why automatic generated code was not best for performance
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics15
 NULL checks for each variable references
 Overflow checks for each primitive operators
 Function call instead of primitive operators
STATIC_FUNCTION(pg_float4_t)
pgfn_float4mul(kern_context *kcxt, pg_float4_t arg1, pg_float4_t arg2)
{
pg_float4_t result;
result.isnull = arg1.isnull | arg2.isnull;
if (!result.isnull)
{
result.value = arg1.value * arg2.value;
CHECKFLOATVAL(&kcxt->e, result,
isinf(arg1.value) || isinf(arg2.value),
arg1.value == 0.0 || arg2.value == 0.0);
}
return result;
}
select x*y from c_test;
The PG-Strom Project
Disadvantage because of data format (1/2)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics16
▌Row-oriented data
× includes unreferenced values
× many steps for data references
〇 common data format with PostgreSQL
▌Column-oriented data
〇 load only referenced variables
〇 data reference by 1 step
× needs data format exchange
GPU
core
GPU
core
GPU
core
a c d feb
a c d feb
a d feb
a c d feb
GPU
core
eb
GPU
core
GPU
core
GPU
core
GPU
core
b
b
b
b
b
GPU
core
GPU
core
e
e
e
e
e
Usual SQL load cannot justify the cost for data transformation,
Advanced algorithm processing is improved by columnar format.
The PG-Strom Project
Disadvantage because of data format (2/2)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics17
▌Case of random memory access
 Increase of memory transaction, but less usage rate of the data-bus
▌Case of coalesced memory access
 Least number of memory transaction, and maximum usage rate of the data-bus
32bit
Memory transaction width: 256bit
32bit 32bit32bit 32bit 32bit
32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit
Memory transaction width: 256bit
32bit x 8 = 256bit is valid data
in 256bit width memory transaction
(Bus usage ratio: 100.0%)
Only 32bit x 1 = 32bit is valid data
in 256bit width memory transaction
(Bus usage ratio: 12.5%)
GPU cores
GPU cores
The PG-Strom Project
2D-Array as Matrix
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics18
▌datatype[] array_matrix(variadic datatype[])
 An aggregate function to construct a 2D-array from the input stream.
 datatype is either of int2, int4, int8, float4 or float8.
 The 2D-array does not contain NULL.
▌SETOF record matrix_unnest(datatype[])
 Deform a 2D-array into stream of multiple records
▌Downside
 Unable to handle variable-length data type
 1GB limit of varlena data type in PostgreSQL
ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN…
𝑎1 ⋯ 𝑑1
⋮ ⋱ ⋮
𝑎 𝑁 ⋯ 𝑑 𝑁
Matrix of
4cols x Nrows
Matrix-like Array
2D Array without NULL, to represent a matrix
The PG-Strom Project
Example to call PL/CUDA function
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics19
SELECT row_number() OVER (),
float4_as_int4(R.key_id) key_id,
R.score
FROM matrix_unnest(
(SELECT my_plcuda_function(A.matrix,
B.matrix)
FROM (SELECT cbind(array_matrix(id),
array_matrix(x, y, z)) matrix
FROM normal_table
WHERE tag LIKE ‘%abc%’) A,
(SELECT matrix
FROM matrix_table) B
)
) AS R(key_id real, score real)
ORDER BY score DESC
LIMIT 1000;
Invocation of PL/CUDA
function with two
Matrix arguments
Construct, or load
pre-built array-matrix
Deform array-matrix
to generic records
Post-process by SQL
(JOIN, window-function)
The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics20
Case Study
similarity search on drug discovery
The PG-Strom Project
Background – relationship of disease and chemical compounds
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics21
target disease
relevant protein
chemical compounds
(= candidate of drugs)
Discovery of chemical compounds which are “active” to the target protein
inactive active
active
but toxicity
academic papers
The PG-Strom Project
k-NN Similarity Search on Chemical Compounds (1/2)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics22
Database chemical compounds set
(D; 10M records scale)
Query chemical
compounds set
(Q; ~1000 records scale)
Search by
Similarity
Target Protein
“similar compounds” will
have higher probability of active
Picks up active
chemical compounds
to the target protein
from academic papers
The PG-Strom Project
k-NN Similarity Search on Chemical Compounds (2/2)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics23
Similarity
is
definition of distance
ID NAME Fingerprint (1024bit)
1 CHEMBL153534 000000000001000000100000000000000100000000000001000000...
2 CHEMBL405398 000000000000000100100000000000000000000000000000100000...
3 CHEMBL503634 000001000000000000000000001000000100000000000000000000...
: : :
Data structure of the chemical compounds
Similarity by Jaccard index:
𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = 𝐴 𝐹𝑃 ∩ 𝐵 𝐹𝑃 𝐴 𝐹𝑃 ∪ 𝐵 𝐹𝑃
The PG-Strom Project
Scale of the computing
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics24
Database chemical compounds set
(D; 10M records scale)
Q: Query chemical
compounds set
average of
the top-3
𝑑𝑖 of D-compounds
Distance of Q-set
and 𝑑𝑖 compounds
𝑑𝑗 of D-compounds
average of
the top-3
Distance of Q-set
and 𝑑𝑗 compounds
Order of calculation:
𝑂 𝑄 × 𝐷 + 𝑂 𝐷 × 𝑄𝑙𝑜𝑔𝑄
(make distance) (sorting+average)
The PG-Strom Project
Implementation of PL/CUDA function (1/3)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics25
Step-1
Split all the logical combination of Q and D
into multiple partitions, then assign them
on SMM; execution unit of GPU.
Step-2
Each GPU core calculates a similarity score
between a Q-compound and a D-compound,
then store the score on “shared memory”;
which is fast and close to SMM.
The PG-Strom Project
Implementation of PL/CUDA function (2/3)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics26
Step-3
Bitonic-sorting by the similarity score, then
reorder the Q-compounds Step-5
Make an average by the
top-k values, then store it
on the result buffer.
Step-4
Repeat from Step-2, if # of Q-compounds is larger
than shared memory size. Top-k items are kept.
The PG-Strom Project
Implementation of PL/CUDA function (3/3)
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics27
CREATE OR REPLACE FUNCTION
knn_gpu_similarity(int, -- k-value
int[], -- ID+bitmap of Q
int[]) -- ID+bitmap of D
RETURNS float4[] -- result: ID+similarity
AS $$
#plcuda_decl
:
#plcuda_begin
#plcuda_kernel_blocksz ¥
knn_gpu_similarity_main_block_size
#plcuda_num_threads ¥
knn_gpu_similarity_main_num_threads
#plcuda_shmem_blocksz 8192
cl_int k = arg1.value;
MatrixType *Q = (MatrixType *) arg2.value;
MatrixType *D = (MatrixType *) arg3.value;
MatrixType *R = (MatrixType *) results;
:
for (loop=0; loop < nloops; loop++)
{
/* 1. calculation of the similarity */
for (i = get_local_id();
i < part_sz * part_nums;
i += get_local_size()) {
j = i % part_sz; /* index within partition */
dindex = part_nums * get_global_index()
+ (i / part_sz);
qindex = loop * (part_sz - k) + (j - k);
if (dindex < ARRAY_MATRIX_HEIGHT(D) &&
qindex < ARRAY_MATRIX_HEIGHT(Q)) {
values[i] = knn_similarity_compute(D, dindex,
Q, qindex);
}
}
__syncthreads();
/* 2. sorting by the similarity for each partition */
knn_similarity_sorting(values, part_sz, part_nums);
__syncthreads();
:
}
#plcuda_end
#plcuda_sanity_check knn_gpu_similarity_sanity_check
#plcuda_working_bufsz 0
#plcuda_results_bufsz knn_gpu_similarity_results_bufsz
$$ LANGUAGE 'plcuda';
real[] -- ID+Similarity of D-compounds (2xN)
knn_gpu_similarity(int k, -- k-value
int[] Q, -- ID+Fingerprint of Q-compounds (33xM)
int[] D); -- ID+Fingerprint of D-compounds (33xN)
The PG-Strom Project
Invocation of PL/CUDA function
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics28
PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value
AS
SELECT row_number() OVER (),
fp.name,
similarity
FROM (SELECT float4_as_int4(key_id) key_id, similarity
FROM matrix_unnest(
(SELECT rbind( knn_gpu_similarity($1,Q.matrix,
D.matrix))
FROM (SELECT cbind(array_matrix(id),
array_matrix(bitmap)) matrix
FROM finger_print_query) Q,
(SELECT matrix
FROM finger_print_10m_matrix) D
)
) AS sim(key_id real, similarity real)
ORDER BY similarity DESC) sim,
finger_print_10m fp
WHERE fp.id = sim.key_id
LIMIT 1000;
Post-process by SQL; like lookup of compounds
name by compounds-id (tables JOIN), making
rank of similarity by window function.
Execution of PL/CUDA function
with Q-/D-matrix as argument
Transform the records read from tables
to Array-Matrix type.
(Pre-build is also possible)
Transform the Array-Matrix (3xN),
return value of PL/CUDA function,
into usual record data x Nrows.
The PG-Strom Project
Performance results
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics29
 For comparison to CPU cases, we implemented an equivalent SQL function by C.
 # of D-compounds is 10M records, # of Q-compounds is 10, 50, 100, 500 and 1000.
 Up to 10B combination search; almost equivalent size for real drug discovery research.
 HW) CPU: Xeon E5-2670v3, GPU: GTX980 / GTX1080, RAM:384GB
 SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0
30.25
145.29
295.95
1503.31
3034.94
12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13
0
500
1000
1500
2000
2500
3000
3500
10 50 100 500 1000
QueryResponseTime[sec]
Number of Query Compounds [Q]
Similarity search of chemical compounds by k-NN method (k=3, D=10M)
CPU(E5-2670v3) GTX980 GTX1080
x150 times
faster!
The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics30
Another Usage
k-means clustering in database system
The PG-Strom Project
Clustering Analysis
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics31
The PG-Strom Project
k-means clustering algorithm
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics32
1. Assign cluster randomly 2. Make centroid for each
cluster
3. Chose the nearest
cluster from the centroid
The PG-Strom Project
k-means clustering algorithm
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics33
1. Assign cluster randomly 5. Make centroid for each
cluster again
6. All the cluster get fully
converged
4. Repeat until convergence
The PG-Strom Project
k-means clustering on PL/CUDA
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics34
CREATE OR REPLACE FUNCTION
gpu_kmeans(real[], -- ID + Data Matrix
int, -- k-value (number of clusters)
int = 10, -- max number of iteration
int = 1) -- seed of initial randomness
RETURNS int[]
AS $$
#plcuda_decl
:
KERNEL_FUNCTION_MAXTHREADS(void)
update_centroid(MatrixType *D,
MatrixType *R,
MatrixType *C)
{
:
/* accumulate the local centroid */
for (did = get_global_id();
did < nitems;
did += get_global_size())
{
/* pick up the target cluster */
cid = r_values[nitems + did];
atomicAdd(&l_cent[cid], 1.0);
for (index=1; index < width; index++)
atomicAdd(&l_cent[index * k_value + cid],
d_values[index * nitems + did]);
}
__syncthreads();
/* write back to the global C-matrix */
for (index = get_local_id();
index < width * k_value;
index += get_local_size())
atomicAdd(&c_values[index], l_cent[index]);
}
:
#plcuda_begin
:
status = pgstromLaunchDynamicKernel4((void *)
setup_initial_cluster,
(kern_arg_t)(D),
(kern_arg_t)(R),
(kern_arg_t)(C),
(kern_arg_t)(r_seed),
nitems, 0, 0);
if (status != cudaSuccess)
PLCUDA_RUNTIME_ERROR_RETURN(status);
for (loop=0; loop < nloops; loop++)
{
:
status = pgstromLaunchDynamicKernelMaxThreads3(
(void *)kmeans_update_cluster,
(kern_arg_t)(D),
(kern_arg_t)(R),
(kern_arg_t)(C),
(kern_arg_t)k_value,
nitems, 0,
sizeof(cl_int) + sizeof(cl_float));
if (status != cudaSuccess)
PLCUDA_RUNTIME_ERROR_RETURN(status);
:
}
#plcuda_sanity_check gpu_kmeans_sanity_check
#plcuda_working_bufsz gpu_kmeans_working_bufsz
#plcuda_results_bufsz gpu_kmeans_results_bufsz
#plcuda_end
$$ LANGUAGE 'plcuda';
The PG-Strom Project
Test data for k-means clustering
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics35
▌Dataset overview
 A collection of datasets of vehicle traffic,
observed between two points for a set duration
of time over a period of 6 months
 449 observation points, at Aarhus, Denmark.
 13.5M records from Feb to June of 2014
▌Data contains
 average speed
 average measured time
 number of vehicles
 latitude/longitude of the observation points
 etc...
▌What we did
 categorize the zone of road into 5-classes
according to the characteristics of vehicle’s
running style.
The PG-Strom Project
Invocation of GPU k-means function
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics36
SELECT report_id, k, c
FROM (SELECT report_id, k, c,
row_number() OVER (PARTITION BY report_id
ORDER BY c DESC) rank
FROM (SELECT report_id, k, count(*) c
FROM matrix_unnest(
(SELECT gpu_kmeans ( array_matrix(
int4_as_float4(report_id),
avg_measured_time,
avg_speed,
vehicle_count),
5)
FROM tr_rawdata)
) R(report_id int, k int)
GROUP BY report_id, k
) __summary_1
) __summary_2
WHERE rank = 1;
Make a matrix from the raw-data
Run k-means clustering logic
Pick-up most frequent cluster
The PG-Strom Project
GPU k-means (1/3) – clustering by all the data
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics37
$ wget -O map.png "`psql traffic -At -f ~/traffic.sql`"
bypass highway?
Road towards
downtown?
Beltway?
The PG-Strom Project
GPU k-means (2/3) – Daytime and Nighttime
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics38
daytime (8-17) nighttime (18-7)
The PG-Strom Project
GPU k-means (3/3) – Weekdays and Weekend
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics39
weekdays weekend
The PG-Strom Project
Invocation of GPU k-means function
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics40
SELECT report_id, k, c
FROM (SELECT report_id, k, c,
row_number() OVER (PARTITION BY report_id
ORDER BY c DESC) rank
FROM (SELECT report_id, k, count(*) c
FROM matrix_unnest(
(SELECT gpu_kmeans ( array_matrix(
int4_as_float4(report_id),
avg_measured_time,
avg_speed,
vehicle_count),
5)
FROM tr_rawdata
WHERE extract('hour' from timestamp)
between 7 and 17
)
) R(report_id int, k int)
GROUP BY report_id, k
) __summary_1
) __summary_2
WHERE rank = 1;
Just add a line to select
different input data set.
Flexibility of SQL!
The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics41
Summary
The PG-Strom Project
Summary
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics42
▌What is PL/CUDA
 Original concept of PG-Strom is automatic optimization.
 PL/CUDA pulls out full capability of GPU instead of manual optimization.
Likely, nobody has written advanced algorithm in SQL :-)
▌Advantages
 TFLOPS grade computing engine for analytics in-database
 No need to export entire dataset for analytics by external applications
 Allows to utilize SQL flexibility for pre-/post-processing of the core
analytics algorithms
▌Future Challenges
 Data size larger than 1GB, because of varlena restriction in PostgreSQL
 Asynchronous execution. CPU parallel
 Time to construct matrix. If likely static, we can construct preliminary.
The PG-Strom Project
Resources
PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics43
▌Repository
https://github.com/pg-strom/devel
▌Today’s Slides
http://www.slideshare.net/kaigai/pgconfsv2016-plcuda
▌Contact
 kaigai@ak.jp.nec.com
 Tw: @kkaigai
Question?

More Related Content

PDF
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PDF
pgconfasia2016 plcuda en
PDF
SQL+GPU+SSD=∞ (English)
PDF
20170602_OSSummit_an_intelligent_storage
PDF
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
PDF
20160407_GTC2016_PgSQL_In_Place
PDF
PG-Strom
PDF
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
pgconfasia2016 plcuda en
SQL+GPU+SSD=∞ (English)
20170602_OSSummit_an_intelligent_storage
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
20160407_GTC2016_PgSQL_In_Place
PG-Strom
PG-Strom - GPU Accelerated Asyncr

What's hot (20)

PDF
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
PDF
Let's turn your PostgreSQL into columnar store with cstore_fdw
PDF
20150318-SFPUG-Meetup-PGStrom
PPTX
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
PDF
GPGPU Accelerates PostgreSQL (English)
PDF
PostgreSQL with OpenCL
PPTX
GPGPU programming with CUDA
PPTX
Debugging CUDA applications
PDF
20171206 PGconf.ASIA LT gstore_fdw
PDF
Parallel Implementation of K Means Clustering on CUDA
PDF
20201006_PGconf_Online_Large_Data_Processing
PDF
20181212 - PGconfASIA - LT - English
PPTX
Parallel K means clustering using CUDA
PDF
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PDF
計算力学シミュレーションに GPU は役立つのか?
PDF
20181025_pgconfeu_lt_gstorefdw
PDF
PG-Strom - A FDW module utilizing GPU device
PDF
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
PDF
How to Burn Multi-GPUs using CUDA stress test memo
PDF
20181016_pgconfeu_ssd2gpu_multi
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
Let's turn your PostgreSQL into columnar store with cstore_fdw
20150318-SFPUG-Meetup-PGStrom
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
GPGPU Accelerates PostgreSQL (English)
PostgreSQL with OpenCL
GPGPU programming with CUDA
Debugging CUDA applications
20171206 PGconf.ASIA LT gstore_fdw
Parallel Implementation of K Means Clustering on CUDA
20201006_PGconf_Online_Large_Data_Processing
20181212 - PGconfASIA - LT - English
Parallel K means clustering using CUDA
PG-Strom v2.0 Technical Brief (17-Apr-2018)
計算力学シミュレーションに GPU は役立つのか?
20181025_pgconfeu_lt_gstorefdw
PG-Strom - A FDW module utilizing GPU device
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
How to Burn Multi-GPUs using CUDA stress test memo
20181016_pgconfeu_ssd2gpu_multi
Ad

Viewers also liked (18)

PPTX
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
PDF
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
PDF
PL/CUDA - GPU Accelerated In-Database Analytics
PDF
CUDA-Aware MPI
PPTX
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
PDF
Implementazione di un vincolo table su un CSP solver GPU-based
PDF
20170127 JAWS HPC-UG#8
PDF
pgconfasia2016 lt ssd2gpu
PDF
20170310_InDatabaseAnalytics_#1
PDF
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PDF
ICISA 2010 Conference Presentation
PDF
SQL+GPU+SSD=∞ (Japanese)
PDF
An Intelligent Storage?
PDF
Visual Design with Data
PDF
3 Things Every Sales Team Needs to Be Thinking About in 2017
PDF
How to Become a Thought Leader in Your Niche
PDF
Actor Model and C++: what, why and how?
PDF
並列クエリを実行するPostgreSQLのアーキテクチャ
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
PL/CUDA - GPU Accelerated In-Database Analytics
CUDA-Aware MPI
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Implementazione di un vincolo table su un CSP solver GPU-based
20170127 JAWS HPC-UG#8
pgconfasia2016 lt ssd2gpu
20170310_InDatabaseAnalytics_#1
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
ICISA 2010 Conference Presentation
SQL+GPU+SSD=∞ (Japanese)
An Intelligent Storage?
Visual Design with Data
3 Things Every Sales Team Needs to Be Thinking About in 2017
How to Become a Thought Leader in Your Niche
Actor Model and C++: what, why and how?
並列クエリを実行するPostgreSQLのアーキテクチャ
Ad

Similar to PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics (20)

PDF
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
PDF
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PDF
20190909_PGconf.ASIA_KaiGai
PDF
20180920_DBTS_PGStrom_EN
PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
PDF
PPTX
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
PDF
20181210 - PGconf.ASIA Unconference
PDF
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
PDF
Pgopencl
PDF
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
PDF
GPUs in Big Data - StampedeCon 2014
PDF
Lessons PostgreSQL learned from commercial databases, and didn’t
PDF
Accelerating Real Time Applications on Heterogeneous Platforms
PDF
CUDA vs OpenCL
PPTX
Introduction to Accelerators
PDF
PostgreSQL Prologue
PPT
GPU_based Searching
PPT
Harnessing OpenCL in Modern Coprocessors
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
20190909_PGconf.ASIA_KaiGai
20180920_DBTS_PGStrom_EN
20181116 Massive Log Processing using I/O optimized PostgreSQL
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
20181210 - PGconf.ASIA Unconference
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
Pgopencl
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
GPUs in Big Data - StampedeCon 2014
Lessons PostgreSQL learned from commercial databases, and didn’t
Accelerating Real Time Applications on Heterogeneous Platforms
CUDA vs OpenCL
Introduction to Accelerators
PostgreSQL Prologue
GPU_based Searching
Harnessing OpenCL in Modern Coprocessors
CUDA-Python and RAPIDS for blazing fast scientific computing

More from Kohei KaiGai (20)

PDF
20221116_DBTS_PGStrom_History
PDF
20221111_JPUG_CustomScan_API
PDF
20211112_jpugcon_gpu_and_arrow
PDF
20210928_pgunconf_hll_count
PDF
20210731_OSC_Kyoto_PGStrom3.0
PDF
20210511_PGStrom_GpuCache
PDF
20201128_OSC_Fukuoka_Online_GPUPostGIS
PDF
20201113_PGconf_Japan_GPU_PostGIS
PDF
20200828_OSCKyoto_Online
PDF
20200806_PGStrom_PostGIS_GstoreFdw
PDF
20200424_Writable_Arrow_Fdw
PDF
20191211_Apache_Arrow_Meetup_Tokyo
PDF
20191115-PGconf.Japan
PDF
20190926_Try_RHEL8_NVMEoF_Beta
PDF
20190925_DBTS_PGStrom
PDF
20190516_DLC10_PGStrom
PDF
20190418_PGStrom_on_ArrowFdw
PDF
20190314 PGStrom Arrow_Fdw
PDF
20181212 - PGconf.ASIA - LT
PDF
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20221116_DBTS_PGStrom_History
20221111_JPUG_CustomScan_API
20211112_jpugcon_gpu_and_arrow
20210928_pgunconf_hll_count
20210731_OSC_Kyoto_PGStrom3.0
20210511_PGStrom_GpuCache
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201113_PGconf_Japan_GPU_PostGIS
20200828_OSCKyoto_Online
20200806_PGStrom_PostGIS_GstoreFdw
20200424_Writable_Arrow_Fdw
20191211_Apache_Arrow_Meetup_Tokyo
20191115-PGconf.Japan
20190926_Try_RHEL8_NVMEoF_Beta
20190925_DBTS_PGStrom
20190516_DLC10_PGStrom
20190418_PGStrom_on_ArrowFdw
20190314 PGStrom Arrow_Fdw
20181212 - PGconf.ASIA - LT
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
“AI and Expert System Decision Support & Business Intelligence Systems”
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

  • 1. PL/CUDA ~Fusion of HPC Grade Power with In-Database Analytics~ The PG-Strom Project / NEC OSS Promotion Center KaiGai Kohei <kaigai@ak.jp.nec.com>
  • 2. The PG-Strom Project about myself... PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics2 ▌KaiGai Kohei  tw: @kkaigai  https://github.com/kaigai ▌PostgreSQL  SELinux, FDW, CustomScan, ... ▌PG-Strom  GPU acceleration for PostgreSQL ▌Works  NEC OSS Promotion Center  Development of the software and its business opportunity
  • 3. The PG-Strom Project PG-Strom Overview (1/2) – Architecture PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics3 Application Storage Query Optimizer Query Executor PG-Strom Extension SQL Parser Storage Manager GPU No Storage Changes No Query Changes  Features • automatic GPU code generation from the supplied SQL • asynchronous & massive parallel execution on GPUs • WHERE-clause, JOIN, GROUP BY, and projection are supported  Advantages • transparent acceleration by the power of thousands cores • Low cost solution for analytic processing
  • 4. The PG-Strom Project Characteristics of GPU (Graphic Processor Unit) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics4 GPU CPU Model NVIDIA Tesla P100 Intel Xeon E5-2699v4 Architecture Pascal Broadwell Launch Q2-2016 Q1-2016 # of transistors 15billion 7.2billion # of cores 3584 (simple) 22 (functional) core clock 1.126GHz ~1.303GHz 2.20GHz ~3.60GHz Perk FFLOPS (FP32) 9.3 TFLOPS 1.2 TFLOPS (with AVX2) DRAM Size 16GB (HBM2) max 1.5TB (DDR4) Memory Band 732GB/s 76.8GB/s Power Consumption 250W 145W
  • 5. The PG-Strom Project PG-Strom Overview (2/2) – GPU binary generation on the fly PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics QUERY: SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; : STATIC_FUNCTION(bool) gpupreagg_qual_eval(kern_context *kcxt, kern_data_store *kds, size_t kds_index) { pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index); pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index); return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) && pgfn_float8le(kcxt, KVAR_3, pgfn_float8pl(kcxt, KVAR_4, KPARAM_1)))); } : E.g) Transform of arithmetic operations in the WHERE-clause to CUDA programs Reference to input data SQL expression in CUDA source code Run-time Compiler (nvrtc) Just-in-time Compile Parallel Execution 5
  • 6. The PG-Strom Project GPU accelerates SQL performance PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics6 ▌Test Query: SELECT cat, count(*), avg(x) FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...] GROUP BY cat;  t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema) 40.44 62.79 79.82 100.50 119.51 144.55 201.12 248.95 9.96 9.93 9.96 9.99 10.02 10.03 9.98 10.00 0 50 100 150 200 250 300 2 3 4 5 6 7 8 9 QueryResponseTime[sec] Number of joined tables PG-Strom microbenchmark with JOIN/GROUP BY PostgreSQL v9.5 PG-Strom v1.0 CPU: Xeon E5-2670v3 GPU: GTX1080 RAM: 384GB OS: CentOS 7.2 DB: PostgreSQL 9.5 + PG-Strom v1.0
  • 7. The PG-Strom Project Feedbacks from users during v1.0 development PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics7 Application Storage Query Optimizer Query Executor PG-Strom Extension SQL Parser Storage Manager GPU Heavy computing intensive workloads • In-database Analytics • Scientific R&D, Marketing, ...  by PL/CUDA + Matrix-Array Heavy I/O intensive workloads • Generic large OLAP • ETL, Reporting, ...  by SSD-to-GPU P2P DMA
  • 8. The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics8 Introduction of PL/CUDA
  • 9. The PG-Strom Project Our Failure (1/3) – Write an algorithm logic in SQL PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics9 Apr-2016
  • 10. The PG-Strom Project Our Failure (2/3) – Performance benefit (?) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics10 Apr-2016
  • 11. The PG-Strom Project Our Failure (3/3) – Problems PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics11 ▌Problem.1 – Who writes algorithms in SQL?  Majority of mathematical algorithms are developed based on the manner of procedural programming language.  Although users don’t need to write algorithm logics in CUDA, they also have to write up the algorithm logic using SQL puzzle. ▌Problem.2 – Performance Benefit  Yeah, PG-Strom is much faster than PostgreSQL to execute the core logic of min-max method. It is an excellent result but nobody knows how many people run the algorithm on PostgreSQL.  Performance of GpuProjection is almost equivalent to the CPU version of implementation which is designed to a particular problem. Why?  Inefficient code due to SQL compatibility  Inefficient data format due to PostgreSQL’s row-format
  • 12. The PG-Strom Project Our Answer – PL/CUDA + Matrix-like Array PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics12 CREATE FUNCTION my_logic(matrix, matrix) RETURNS vector AS $$ $$ LANGUAGE ‘plcuda’; User define CUDA code block Storage GPU Kernel User defined CUDA code block Post SQL Process  Tables JOIN  Window Function  ORDER BY  GROUP BY  etc.... Load function’s arguments Write-back result set PL/CUDA method for manual optimization ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN… 𝑎1 ⋯ 𝑑1 ⋮ ⋱ ⋮ 𝑎 𝑁 ⋯ 𝑑 𝑁 Matrix of 4cols x Nrows Matrix-like Array 2D Array without NULL, to represent a matrix
  • 13. The PG-Strom Project Example of PL/CUDA function definition PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics13 CREATE OR REPLACE FUNCTION knn_gpu_similarity(int, int[], int[]) RETURNS float4[] AS $$ #plcuda_begin cl_int k = arg1.value; MatrixType *Q = (MatrixType *) arg2.value; MatrixType *D = (MatrixType *) arg3.value; MatrixType *R = (MatrixType *) results; : nloops = (ARRAY_MATRIX_HEIGHT(Q) + (part_sz - k - 1)) / (part_sz - k); for (loop=0; loop < nloops; loop++) { /* 1. calculation of the similarity */ for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) { j = i % part_sz; /* index within partition */ /* index of database matrix (D) */ dindex = part_nums * get_global_index() + (i / part_sz); /* index of query matrix (Q) */ qindex = loop * (part_sz - k) + (j - k); values[i] = knn_similarity_compute(D, dindex, Q, qindex); } } : #plcuda_end $$ LANGUAGE 'plcuda'; CUDA code block
  • 14. The PG-Strom Project How GPU Kernels are built PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics14 CREATE OR REPLACE FUNCTION my_cuda_func(float[]) RETURNS int[] $$ #plcuda_sanity_check func_sc #plcuda_begin #plcuda_end #plcuda_working_bufsz func_wb $$ LANGUAGE ‘plcuda’; User define CUDA code block bool func_sc(float[]) helper function for sanity check bigint func_wb(float[]) helper function for buffer size estimation GPU Binary GPU Kernel Working Buffer Input arguments Run-time compiler input/output of SQL data User define CUDA code block Common Library Routines source program of GPU code Load the arguments
  • 15. The PG-Strom Project Why automatic generated code was not best for performance PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics15  NULL checks for each variable references  Overflow checks for each primitive operators  Function call instead of primitive operators STATIC_FUNCTION(pg_float4_t) pgfn_float4mul(kern_context *kcxt, pg_float4_t arg1, pg_float4_t arg2) { pg_float4_t result; result.isnull = arg1.isnull | arg2.isnull; if (!result.isnull) { result.value = arg1.value * arg2.value; CHECKFLOATVAL(&kcxt->e, result, isinf(arg1.value) || isinf(arg2.value), arg1.value == 0.0 || arg2.value == 0.0); } return result; } select x*y from c_test;
  • 16. The PG-Strom Project Disadvantage because of data format (1/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics16 ▌Row-oriented data × includes unreferenced values × many steps for data references 〇 common data format with PostgreSQL ▌Column-oriented data 〇 load only referenced variables 〇 data reference by 1 step × needs data format exchange GPU core GPU core GPU core a c d feb a c d feb a d feb a c d feb GPU core eb GPU core GPU core GPU core GPU core b b b b b GPU core GPU core e e e e e Usual SQL load cannot justify the cost for data transformation, Advanced algorithm processing is improved by columnar format.
  • 17. The PG-Strom Project Disadvantage because of data format (2/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics17 ▌Case of random memory access  Increase of memory transaction, but less usage rate of the data-bus ▌Case of coalesced memory access  Least number of memory transaction, and maximum usage rate of the data-bus 32bit Memory transaction width: 256bit 32bit 32bit32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit Memory transaction width: 256bit 32bit x 8 = 256bit is valid data in 256bit width memory transaction (Bus usage ratio: 100.0%) Only 32bit x 1 = 32bit is valid data in 256bit width memory transaction (Bus usage ratio: 12.5%) GPU cores GPU cores
  • 18. The PG-Strom Project 2D-Array as Matrix PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics18 ▌datatype[] array_matrix(variadic datatype[])  An aggregate function to construct a 2D-array from the input stream.  datatype is either of int2, int4, int8, float4 or float8.  The 2D-array does not contain NULL. ▌SETOF record matrix_unnest(datatype[])  Deform a 2D-array into stream of multiple records ▌Downside  Unable to handle variable-length data type  1GB limit of varlena data type in PostgreSQL ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN… 𝑎1 ⋯ 𝑑1 ⋮ ⋱ ⋮ 𝑎 𝑁 ⋯ 𝑑 𝑁 Matrix of 4cols x Nrows Matrix-like Array 2D Array without NULL, to represent a matrix
  • 19. The PG-Strom Project Example to call PL/CUDA function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics19 SELECT row_number() OVER (), float4_as_int4(R.key_id) key_id, R.score FROM matrix_unnest( (SELECT my_plcuda_function(A.matrix, B.matrix) FROM (SELECT cbind(array_matrix(id), array_matrix(x, y, z)) matrix FROM normal_table WHERE tag LIKE ‘%abc%’) A, (SELECT matrix FROM matrix_table) B ) ) AS R(key_id real, score real) ORDER BY score DESC LIMIT 1000; Invocation of PL/CUDA function with two Matrix arguments Construct, or load pre-built array-matrix Deform array-matrix to generic records Post-process by SQL (JOIN, window-function)
  • 20. The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics20 Case Study similarity search on drug discovery
  • 21. The PG-Strom Project Background – relationship of disease and chemical compounds PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics21 target disease relevant protein chemical compounds (= candidate of drugs) Discovery of chemical compounds which are “active” to the target protein inactive active active but toxicity academic papers
  • 22. The PG-Strom Project k-NN Similarity Search on Chemical Compounds (1/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics22 Database chemical compounds set (D; 10M records scale) Query chemical compounds set (Q; ~1000 records scale) Search by Similarity Target Protein “similar compounds” will have higher probability of active Picks up active chemical compounds to the target protein from academic papers
  • 23. The PG-Strom Project k-NN Similarity Search on Chemical Compounds (2/2) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics23 Similarity is definition of distance ID NAME Fingerprint (1024bit) 1 CHEMBL153534 000000000001000000100000000000000100000000000001000000... 2 CHEMBL405398 000000000000000100100000000000000000000000000000100000... 3 CHEMBL503634 000001000000000000000000001000000100000000000000000000... : : : Data structure of the chemical compounds Similarity by Jaccard index: 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = 𝐴 𝐹𝑃 ∩ 𝐵 𝐹𝑃 𝐴 𝐹𝑃 ∪ 𝐵 𝐹𝑃
  • 24. The PG-Strom Project Scale of the computing PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics24 Database chemical compounds set (D; 10M records scale) Q: Query chemical compounds set average of the top-3 𝑑𝑖 of D-compounds Distance of Q-set and 𝑑𝑖 compounds 𝑑𝑗 of D-compounds average of the top-3 Distance of Q-set and 𝑑𝑗 compounds Order of calculation: 𝑂 𝑄 × 𝐷 + 𝑂 𝐷 × 𝑄𝑙𝑜𝑔𝑄 (make distance) (sorting+average)
  • 25. The PG-Strom Project Implementation of PL/CUDA function (1/3) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics25 Step-1 Split all the logical combination of Q and D into multiple partitions, then assign them on SMM; execution unit of GPU. Step-2 Each GPU core calculates a similarity score between a Q-compound and a D-compound, then store the score on “shared memory”; which is fast and close to SMM.
  • 26. The PG-Strom Project Implementation of PL/CUDA function (2/3) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics26 Step-3 Bitonic-sorting by the similarity score, then reorder the Q-compounds Step-5 Make an average by the top-k values, then store it on the result buffer. Step-4 Repeat from Step-2, if # of Q-compounds is larger than shared memory size. Top-k items are kept.
  • 27. The PG-Strom Project Implementation of PL/CUDA function (3/3) PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics27 CREATE OR REPLACE FUNCTION knn_gpu_similarity(int, -- k-value int[], -- ID+bitmap of Q int[]) -- ID+bitmap of D RETURNS float4[] -- result: ID+similarity AS $$ #plcuda_decl : #plcuda_begin #plcuda_kernel_blocksz ¥ knn_gpu_similarity_main_block_size #plcuda_num_threads ¥ knn_gpu_similarity_main_num_threads #plcuda_shmem_blocksz 8192 cl_int k = arg1.value; MatrixType *Q = (MatrixType *) arg2.value; MatrixType *D = (MatrixType *) arg3.value; MatrixType *R = (MatrixType *) results; : for (loop=0; loop < nloops; loop++) { /* 1. calculation of the similarity */ for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) { j = i % part_sz; /* index within partition */ dindex = part_nums * get_global_index() + (i / part_sz); qindex = loop * (part_sz - k) + (j - k); if (dindex < ARRAY_MATRIX_HEIGHT(D) && qindex < ARRAY_MATRIX_HEIGHT(Q)) { values[i] = knn_similarity_compute(D, dindex, Q, qindex); } } __syncthreads(); /* 2. sorting by the similarity for each partition */ knn_similarity_sorting(values, part_sz, part_nums); __syncthreads(); : } #plcuda_end #plcuda_sanity_check knn_gpu_similarity_sanity_check #plcuda_working_bufsz 0 #plcuda_results_bufsz knn_gpu_similarity_results_bufsz $$ LANGUAGE 'plcuda'; real[] -- ID+Similarity of D-compounds (2xN) knn_gpu_similarity(int k, -- k-value int[] Q, -- ID+Fingerprint of Q-compounds (33xM) int[] D); -- ID+Fingerprint of D-compounds (33xN)
  • 28. The PG-Strom Project Invocation of PL/CUDA function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics28 PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value AS SELECT row_number() OVER (), fp.name, similarity FROM (SELECT float4_as_int4(key_id) key_id, similarity FROM matrix_unnest( (SELECT rbind( knn_gpu_similarity($1,Q.matrix, D.matrix)) FROM (SELECT cbind(array_matrix(id), array_matrix(bitmap)) matrix FROM finger_print_query) Q, (SELECT matrix FROM finger_print_10m_matrix) D ) ) AS sim(key_id real, similarity real) ORDER BY similarity DESC) sim, finger_print_10m fp WHERE fp.id = sim.key_id LIMIT 1000; Post-process by SQL; like lookup of compounds name by compounds-id (tables JOIN), making rank of similarity by window function. Execution of PL/CUDA function with Q-/D-matrix as argument Transform the records read from tables to Array-Matrix type. (Pre-build is also possible) Transform the Array-Matrix (3xN), return value of PL/CUDA function, into usual record data x Nrows.
  • 29. The PG-Strom Project Performance results PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics29  For comparison to CPU cases, we implemented an equivalent SQL function by C.  # of D-compounds is 10M records, # of Q-compounds is 10, 50, 100, 500 and 1000.  Up to 10B combination search; almost equivalent size for real drug discovery research.  HW) CPU: Xeon E5-2670v3, GPU: GTX980 / GTX1080, RAM:384GB  SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0 30.25 145.29 295.95 1503.31 3034.94 12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13 0 500 1000 1500 2000 2500 3000 3500 10 50 100 500 1000 QueryResponseTime[sec] Number of Query Compounds [Q] Similarity search of chemical compounds by k-NN method (k=3, D=10M) CPU(E5-2670v3) GTX980 GTX1080 x150 times faster!
  • 30. The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics30 Another Usage k-means clustering in database system
  • 31. The PG-Strom Project Clustering Analysis PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics31
  • 32. The PG-Strom Project k-means clustering algorithm PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics32 1. Assign cluster randomly 2. Make centroid for each cluster 3. Chose the nearest cluster from the centroid
  • 33. The PG-Strom Project k-means clustering algorithm PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics33 1. Assign cluster randomly 5. Make centroid for each cluster again 6. All the cluster get fully converged 4. Repeat until convergence
  • 34. The PG-Strom Project k-means clustering on PL/CUDA PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics34 CREATE OR REPLACE FUNCTION gpu_kmeans(real[], -- ID + Data Matrix int, -- k-value (number of clusters) int = 10, -- max number of iteration int = 1) -- seed of initial randomness RETURNS int[] AS $$ #plcuda_decl : KERNEL_FUNCTION_MAXTHREADS(void) update_centroid(MatrixType *D, MatrixType *R, MatrixType *C) { : /* accumulate the local centroid */ for (did = get_global_id(); did < nitems; did += get_global_size()) { /* pick up the target cluster */ cid = r_values[nitems + did]; atomicAdd(&l_cent[cid], 1.0); for (index=1; index < width; index++) atomicAdd(&l_cent[index * k_value + cid], d_values[index * nitems + did]); } __syncthreads(); /* write back to the global C-matrix */ for (index = get_local_id(); index < width * k_value; index += get_local_size()) atomicAdd(&c_values[index], l_cent[index]); } : #plcuda_begin : status = pgstromLaunchDynamicKernel4((void *) setup_initial_cluster, (kern_arg_t)(D), (kern_arg_t)(R), (kern_arg_t)(C), (kern_arg_t)(r_seed), nitems, 0, 0); if (status != cudaSuccess) PLCUDA_RUNTIME_ERROR_RETURN(status); for (loop=0; loop < nloops; loop++) { : status = pgstromLaunchDynamicKernelMaxThreads3( (void *)kmeans_update_cluster, (kern_arg_t)(D), (kern_arg_t)(R), (kern_arg_t)(C), (kern_arg_t)k_value, nitems, 0, sizeof(cl_int) + sizeof(cl_float)); if (status != cudaSuccess) PLCUDA_RUNTIME_ERROR_RETURN(status); : } #plcuda_sanity_check gpu_kmeans_sanity_check #plcuda_working_bufsz gpu_kmeans_working_bufsz #plcuda_results_bufsz gpu_kmeans_results_bufsz #plcuda_end $$ LANGUAGE 'plcuda';
  • 35. The PG-Strom Project Test data for k-means clustering PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics35 ▌Dataset overview  A collection of datasets of vehicle traffic, observed between two points for a set duration of time over a period of 6 months  449 observation points, at Aarhus, Denmark.  13.5M records from Feb to June of 2014 ▌Data contains  average speed  average measured time  number of vehicles  latitude/longitude of the observation points  etc... ▌What we did  categorize the zone of road into 5-classes according to the characteristics of vehicle’s running style.
  • 36. The PG-Strom Project Invocation of GPU k-means function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics36 SELECT report_id, k, c FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank FROM (SELECT report_id, k, count(*) c FROM matrix_unnest( (SELECT gpu_kmeans ( array_matrix( int4_as_float4(report_id), avg_measured_time, avg_speed, vehicle_count), 5) FROM tr_rawdata) ) R(report_id int, k int) GROUP BY report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1; Make a matrix from the raw-data Run k-means clustering logic Pick-up most frequent cluster
  • 37. The PG-Strom Project GPU k-means (1/3) – clustering by all the data PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics37 $ wget -O map.png "`psql traffic -At -f ~/traffic.sql`" bypass highway? Road towards downtown? Beltway?
  • 38. The PG-Strom Project GPU k-means (2/3) – Daytime and Nighttime PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics38 daytime (8-17) nighttime (18-7)
  • 39. The PG-Strom Project GPU k-means (3/3) – Weekdays and Weekend PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics39 weekdays weekend
  • 40. The PG-Strom Project Invocation of GPU k-means function PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics40 SELECT report_id, k, c FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank FROM (SELECT report_id, k, count(*) c FROM matrix_unnest( (SELECT gpu_kmeans ( array_matrix( int4_as_float4(report_id), avg_measured_time, avg_speed, vehicle_count), 5) FROM tr_rawdata WHERE extract('hour' from timestamp) between 7 and 17 ) ) R(report_id int, k int) GROUP BY report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1; Just add a line to select different input data set. Flexibility of SQL!
  • 41. The PG-Strom ProjectPGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics41 Summary
  • 42. The PG-Strom Project Summary PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics42 ▌What is PL/CUDA  Original concept of PG-Strom is automatic optimization.  PL/CUDA pulls out full capability of GPU instead of manual optimization. Likely, nobody has written advanced algorithm in SQL :-) ▌Advantages  TFLOPS grade computing engine for analytics in-database  No need to export entire dataset for analytics by external applications  Allows to utilize SQL flexibility for pre-/post-processing of the core analytics algorithms ▌Future Challenges  Data size larger than 1GB, because of varlena restriction in PostgreSQL  Asynchronous execution. CPU parallel  Time to construct matrix. If likely static, we can construct preliminary.
  • 43. The PG-Strom Project Resources PGconf.SV 2016 - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics43 ▌Repository https://github.com/pg-strom/devel ▌Today’s Slides http://www.slideshare.net/kaigai/pgconfsv2016-plcuda ▌Contact  kaigai@ak.jp.nec.com  Tw: @kkaigai