Harnessing OpenCL in Modern Coprocessors

Harnessing OpenCL in
modern coprocessors
Unai Lopez-Novoa
unai.lopez@ehu.es
06 Aug 2014
Intelligent Systems Group
University of the Basque Country UPV/EHU

Outline
• Previous work
• Work @ UniMan: Relational Join
1.Motivation
2.Algorithm
3.Results
4.Conclusions
2

About Myself
• PhD Student @ Intelligent Systems Group: 2011 – Now
• Research interest: Efficient use of Modern coprocessors
• Performance modeling
• Code acceleration
• Development of parallel implementations
• Molecular Dynamics simulation code (MSc thesis)
• Kernel Density Estimation (Under review)
• Relational Join (Work @ UniMan)
3

Kernel Density Estimation
• Estimate the Probability Density Function of a population
• Our use case: Climate models
• Challenge: large volumes of data
4
Histogram: KDE:

Kernel Density Estimation
• 1st
: Algorithmic rework
• 2nd
: Parallel implementation: multi/many core processors
• Compared to R+MKL and CUDA implementations
Naive approach
for each evaluation_point e
for each sample s
d = distance(e,s)
e += density (d)
Our approach
B = computeBoundingBox()
for each sample s
b = fitBoundingBox(B,s)
for each e_point e in b
d = distance(e,s)
e += density (d)
5

Join
Slide based on: Wu, Lisa, et al. "Navigating big data with high-throughput, energy-efficient data
partitioning." Proc. of the 40th Annual International Symposium on Computer Architecture. ACM,
2013.
Do sunblock sales correlate with weather?
Sales
Weather
Join-Date(Sales,Weather)
Join-Date
7

Join
•Join is everyday operation
8

Join
Goal: Develop a parallel implementation of relational
join targeting nowadays heterogeneous systems
9

Heterogeneous systems
• Performance depends on the nature of the application
Multi-core
•16 cores
•250 GFLOP/s
Many-core
•61 cores
•1 TFLOP/s
GPU
•2880 cores
•1.3 TFLOP/s
Complex control flow Number crunchingComplex control flow Number crunching
10

• Wide variety of programming environments in HPC
• OpenMP, CUDA, MPI, TBB,…
• Our choice: OpenCL
NVIDIA SDKIntel SDKAMD SDK
Write once
Compile
Run many
11

• Cross-platform portability != Performance portability
• OpenCL: Abstraction layer
• Solution 1: per-device hand-made tuning
• Not portable at all
• Solution 2: auto tuning
• Rely on performance models
12

Previous work
• Collection of performance modeling proposals for latest
GPUs and Intel Xeon Phi
• Comprehensive analysis of the literature since ~2007
• Organized as:
Unai Lopez-Novoa et al. A Survey of Performance Modeling and Simulation
Techniques for Accelerator-based Computing IEEE Transactions on
Parallel and Distributed Computing, DOI: 10.1109/TPDS.2014.2308216
Execution time
estimation
Bottleneck
highlighting
Power cons.
estimation
Simulators
13

Types of Join
100
103
104
100
102
Inner Left Outer
Right Outer Full Outer
100 100 100 100
103 -
104 -
100 100
- 102
100 100
103 -
104 -
- 102
Table A
Table B
14

Algorithm
• Biggest debate: Sort or Hash?
Hash-join
Complexity:
Limitation: Extensive use of
atomics prevent
efficient parallelization
O(n + m)
Procedure: 1. Hash smaller table
2. Scan larger table
Sort-join
Sorting increases
complexity
O(n·log(n))
1. Sort keys
2. Scan interleaved
15

Algorithm
• Step 1: Sort keys in both tables
• Radix sort: speed/scalability sweet spot
100
104
103
103
102
100
100
102
101
102
100
100
102
103
103
104
100
101
102
102
Sort
16

Algorithm
• Step 2: Merge
• Add non matching keys for outer joins
100
100
102
103
103
104
100
101
102
102
100 100
100 100
102 102
102 102
Table A Table B
Result – Inner Join
17

Implementation
• Steps:
1)Develop a naive OpenCL implementation
2)Optimize per device type
3)Add a cost model for load balancing and partitioning
• Experimental setup:
• M1: 4 (x2 SMT) Cores Xeon + Xeon Phi + 384 Cores GPU
• M2: 12 (x2 SMT) Cores Xeon + Xeon Phi + 2496 Cores GPU
• Baseline: ModernGPU (CUDA)
18

Per-device tuning
• Optimizations:
• Thread scheduling
• Memory management
• Overheads:
• Compilation
• Memory allocation
20

Optimizations
• Per device thread scheduling
OpenCL
Kernel
Threads:
Groups:
OpenCL
Devices
Four core CPU
0 1 2 3
61 core Xeon Phi
21
2 3 4 600 1

• Per device memory management
Optimizations
Private Local Global
OpenCL Device
Memory Hierarchy
Thread Thread-group Any thread
22
Scope:
Registers On-chip RAM
Registers RAM
RAMRegisters
RAM
RAM

Overheads
• Compilation
• Online compilation: X% of runtime (without I/O)
• Memory allocation
• Intel SDK: Y % of Merge Step in Xeon Phi
OpenCL
Program
Host code Device code
Compilation: Offline (gcc) Online (SDK)
23

Future work
1) Finish tuning per device code
2) Test join in FPGA
3) Revisit partitioning strategy
4) Support multi-device execution
• Develop a cost model that characterizes Join
• Split the workload in runtime among existing devices
25

Conclusions
• Performance: device specific code
• Performance portability:
a) Platform specific code
b) Parameterizable code
• High OpenCL SDK dependence
• Only portable debugging tool: printf
• …but still the only portable framework
• Future: OpenACC / OpenMP 4.0 ?
26

Harnessing OpenCL in Modern Coprocessors

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Harnessing OpenCL in Modern Coprocessors (20)

Recently uploaded (20)

Harnessing OpenCL in Modern Coprocessors