SlideShare a Scribd company logo
Harnessing OpenCL in
modern coprocessors
Unai Lopez-Novoa
unai.lopez@ehu.es
06 Aug 2014
Intelligent Systems Group
University of the Basque Country UPV/EHU
Outline
• Previous work
• Work @ UniMan: Relational Join
1.Motivation
2.Algorithm
3.Results
4.Conclusions
2
About Myself
• PhD Student @ Intelligent Systems Group: 2011 – Now
• Research interest: Efficient use of Modern coprocessors
• Performance modeling
• Code acceleration
• Development of parallel implementations
• Molecular Dynamics simulation code (MSc thesis)
• Kernel Density Estimation (Under review)
• Relational Join (Work @ UniMan)
3
Kernel Density Estimation
• Estimate the Probability Density Function of a population
• Our use case: Climate models
• Challenge: large volumes of data
4
Histogram: KDE:
Kernel Density Estimation
• 1st
: Algorithmic rework
• 2nd
: Parallel implementation: multi/many core processors
• Compared to R+MKL and CUDA implementations
Naive approach
for each evaluation_point e
for each sample s
d = distance(e,s)
e += density (d)
Our approach
B = computeBoundingBox()
for each sample s
b = fitBoundingBox(B,s)
for each e_point e in b
d = distance(e,s)
e += density (d)
5
Work @ UniMan
6
Join
Slide based on: Wu, Lisa, et al. "Navigating big data with high-throughput, energy-efficient data
partitioning." Proc. of the 40th Annual International Symposium on Computer Architecture. ACM,
2013.
Do sunblock sales correlate with weather?
Sales
Weather
Join-Date(Sales,Weather)
Join-Date
7
Join
•Join is everyday operation
8
Join
Goal: Develop a parallel implementation of relational
join targeting nowadays heterogeneous systems
9
Heterogeneous systems
• Performance depends on the nature of the application
Multi-core
•16 cores
•250 GFLOP/s
Many-core
•61 cores
•1 TFLOP/s
GPU
•2880 cores
•1.3 TFLOP/s
Complex control flow Number crunchingComplex control flow Number crunching
10
• Wide variety of programming environments in HPC
• OpenMP, CUDA, MPI, TBB,…
• Our choice: OpenCL
Heterogeneous systems
NVIDIA SDKIntel SDKAMD SDK
Write once
Compile
Run many
11
Heterogeneous systems
• Cross-platform portability != Performance portability
• OpenCL: Abstraction layer
• Solution 1: per-device hand-made tuning
• Not portable at all
• Solution 2: auto tuning
• Rely on performance models
12
Previous work
• Collection of performance modeling proposals for latest
GPUs and Intel Xeon Phi
• Comprehensive analysis of the literature since ~2007
• Organized as:
Unai Lopez-Novoa et al. A Survey of Performance Modeling and Simulation
Techniques for Accelerator-based Computing IEEE Transactions on
Parallel and Distributed Computing, DOI: 10.1109/TPDS.2014.2308216
Execution time
estimation
Bottleneck
highlighting
Power cons.
estimation
Simulators
13
Types of Join
100
103
104
100
102
Inner Left Outer
Right Outer Full Outer
100 100 100 100
103 -
104 -
100 100
- 102
100 100
103 -
104 -
- 102
Table A
Table B
14
Algorithm
• Biggest debate: Sort or Hash?
Hash-join
Complexity:
Limitation: Extensive use of
atomics prevent
efficient parallelization
O(n + m)
Procedure: 1. Hash smaller table
2. Scan larger table
Sort-join
Sorting increases
complexity
O(n·log(n))
1. Sort keys
2. Scan interleaved
15
Algorithm
• Step 1: Sort keys in both tables
• Radix sort: speed/scalability sweet spot
100
104
103
103
102
100
100
102
101
102
100
100
102
103
103
104
100
101
102
102
Sort
16
Algorithm
• Step 2: Merge
• Add non matching keys for outer joins
100
100
102
103
103
104
100
101
102
102
100 100
100 100
102 102
102 102
Table A Table B
Result – Inner Join
17
Implementation
• Steps:
1)Develop a naive OpenCL implementation
2)Optimize per device type
3)Add a cost model for load balancing and partitioning
• Experimental setup:
• M1: 4 (x2 SMT) Cores Xeon + Xeon Phi + 384 Cores GPU
• M2: 12 (x2 SMT) Cores Xeon + Xeon Phi + 2496 Cores GPU
• Baseline: ModernGPU (CUDA)
18
Results
19
Per-device tuning
• Optimizations:
• Thread scheduling
• Memory management
• Overheads:
• Compilation
• Memory allocation
20
Optimizations
• Per device thread scheduling
OpenCL
Kernel
Threads:
Groups:
OpenCL
Devices
Four core CPU
0 1 2 3
61 core Xeon Phi
21
2 3 4 600 1
• Per device memory management
Optimizations
Private Local Global
OpenCL Device
Memory Hierarchy
Thread Thread-group Any thread
22
Scope:
Registers On-chip RAM
Registers RAM
RAMRegisters
RAM
RAM
Overheads
• Compilation
• Online compilation: X% of runtime (without I/O)
• Memory allocation
• Intel SDK: Y % of Merge Step in Xeon Phi
OpenCL
Program
Host code Device code
Compilation: Offline (gcc) Online (SDK)
23
Results
24
Future work
1) Finish tuning per device code
2) Test join in FPGA
3) Revisit partitioning strategy
4) Support multi-device execution
• Develop a cost model that characterizes Join
• Split the workload in runtime among existing devices
25
Conclusions
• Performance: device specific code
• Performance portability:
a) Platform specific code
b) Parameterizable code
• High OpenCL SDK dependence
• Only portable debugging tool: printf
• …but still the only portable framework
• Future: OpenACC / OpenMP 4.0 ?
26
Harnessing OpenCL in
modern coprocessors
Unai Lopez-Novoa
unai.lopez@ehu.es
06 Aug 2014
Intelligent Systems Group
University of the Basque Country UPV/EHU

More Related Content

PPTX
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
PDF
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
PPTX
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
PPTX
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
PPTX
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
Introduction to OpenSees by Frank McKenna
PDF
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...
Time-Evolving Graph Processing On Commodity Clusters
Introduction to OpenSees by Frank McKenna
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...

What's hot (20)

PPTX
PEARC17: Improving Uintah's Scalability Through the Use of Portable Kokkos-Ba...
PDF
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
PPTX
Beyond data and model parallelism for deep neural networks
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
PDF
Early Application experiences on Summit
PPTX
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
PPTX
Programmable Exascale Supercomputer
PDF
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
PDF
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
PPT
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
PPTX
An Introduction to TensorFlow architecture
PDF
Deep learning with TensorFlow
PDF
Recent progress on distributing deep learning
PDF
TinyML as-a-Service
PPTX
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
PDF
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
PPTX
TensorFrames: Google Tensorflow on Apache Spark
PDF
Spark Meetup TensorFrames
PEARC17: Improving Uintah's Scalability Through the Use of Portable Kokkos-Ba...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Beyond data and model parallelism for deep neural networks
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Early Application experiences on Summit
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Programmable Exascale Supercomputer
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
An Introduction to TensorFlow architecture
Deep learning with TensorFlow
Recent progress on distributing deep learning
TinyML as-a-Service
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
TensorFrames: Google Tensorflow on Apache Spark
Spark Meetup TensorFrames
Ad

Viewers also liked (6)

PPT
Introducción a la Computación Paralela
PPT
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...
PDF
Computación Heterogénea: Aplicaciones y Modelado de Rendimiento
PPT
Introduccion a MPI
ODP
OpenMP - Configuración de IDE y ejecución de código paralelo
PDF
Introducción al Grid Computing
Introducción a la Computación Paralela
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...
Computación Heterogénea: Aplicaciones y Modelado de Rendimiento
Introduccion a MPI
OpenMP - Configuración de IDE y ejecución de código paralelo
Introducción al Grid Computing
Ad

Similar to Harnessing OpenCL in Modern Coprocessors (20)

PDF
Introduction to OpenCL
PDF
Introduction to OpenCL, 2010
PDF
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
PPTX
MattsonTutorialSC14.pptx
PDF
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
PPTX
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
PDF
MattsonTutorialSC14.pdf
PPTX
Hands on OpenCL
PDF
Performance analysis of sobel edge filter on heterogeneous system using opencl
PDF
Open CL For Speedup Workshop
PDF
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
PPTX
OpenCL Heterogeneous Parallel Computing
PDF
Parallel and Distributed Computing Chapter 8
PDF
GPGPU Accelerates PostgreSQL (English)
PDF
HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng
PDF
Introduction to OpenCL By Hammad Ghulam Mustafa
PDF
General Purpose GPU Computing
PDF
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
PDF
SDAccel Design Contest: Xilinx SDAccel
PDF
OpenCL & the Future of Desktop High Performance Computing in CAD
Introduction to OpenCL
Introduction to OpenCL, 2010
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
MattsonTutorialSC14.pptx
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
MattsonTutorialSC14.pdf
Hands on OpenCL
Performance analysis of sobel edge filter on heterogeneous system using opencl
Open CL For Speedup Workshop
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
OpenCL Heterogeneous Parallel Computing
Parallel and Distributed Computing Chapter 8
GPGPU Accelerates PostgreSQL (English)
HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng
Introduction to OpenCL By Hammad Ghulam Mustafa
General Purpose GPU Computing
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
SDAccel Design Contest: Xilinx SDAccel
OpenCL & the Future of Desktop High Performance Computing in CAD

Recently uploaded (20)

PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Geodesy 1.pptx...............................................
PPT
Mechanical Engineering MATERIALS Selection
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
additive manufacturing of ss316l using mig welding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Construction Project Organization Group 2.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Geodesy 1.pptx...............................................
Mechanical Engineering MATERIALS Selection
Arduino robotics embedded978-1-4302-3184-4.pdf
Lesson 3_Tessellation.pptx finite Mathematics
additive manufacturing of ss316l using mig welding
Operating System & Kernel Study Guide-1 - converted.pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Strings in CPP - Strings in C++ are sequences of characters used to store and...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Construction Project Organization Group 2.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Structs to JSON How Go Powers REST APIs.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Foundation to blockchain - A guide to Blockchain Tech
OOP with Java - Java Introduction (Basics)
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx

Harnessing OpenCL in Modern Coprocessors

  • 1. Harnessing OpenCL in modern coprocessors Unai Lopez-Novoa unai.lopez@ehu.es 06 Aug 2014 Intelligent Systems Group University of the Basque Country UPV/EHU
  • 2. Outline • Previous work • Work @ UniMan: Relational Join 1.Motivation 2.Algorithm 3.Results 4.Conclusions 2
  • 3. About Myself • PhD Student @ Intelligent Systems Group: 2011 – Now • Research interest: Efficient use of Modern coprocessors • Performance modeling • Code acceleration • Development of parallel implementations • Molecular Dynamics simulation code (MSc thesis) • Kernel Density Estimation (Under review) • Relational Join (Work @ UniMan) 3
  • 4. Kernel Density Estimation • Estimate the Probability Density Function of a population • Our use case: Climate models • Challenge: large volumes of data 4 Histogram: KDE:
  • 5. Kernel Density Estimation • 1st : Algorithmic rework • 2nd : Parallel implementation: multi/many core processors • Compared to R+MKL and CUDA implementations Naive approach for each evaluation_point e for each sample s d = distance(e,s) e += density (d) Our approach B = computeBoundingBox() for each sample s b = fitBoundingBox(B,s) for each e_point e in b d = distance(e,s) e += density (d) 5
  • 7. Join Slide based on: Wu, Lisa, et al. "Navigating big data with high-throughput, energy-efficient data partitioning." Proc. of the 40th Annual International Symposium on Computer Architecture. ACM, 2013. Do sunblock sales correlate with weather? Sales Weather Join-Date(Sales,Weather) Join-Date 7
  • 9. Join Goal: Develop a parallel implementation of relational join targeting nowadays heterogeneous systems 9
  • 10. Heterogeneous systems • Performance depends on the nature of the application Multi-core •16 cores •250 GFLOP/s Many-core •61 cores •1 TFLOP/s GPU •2880 cores •1.3 TFLOP/s Complex control flow Number crunchingComplex control flow Number crunching 10
  • 11. • Wide variety of programming environments in HPC • OpenMP, CUDA, MPI, TBB,… • Our choice: OpenCL Heterogeneous systems NVIDIA SDKIntel SDKAMD SDK Write once Compile Run many 11
  • 12. Heterogeneous systems • Cross-platform portability != Performance portability • OpenCL: Abstraction layer • Solution 1: per-device hand-made tuning • Not portable at all • Solution 2: auto tuning • Rely on performance models 12
  • 13. Previous work • Collection of performance modeling proposals for latest GPUs and Intel Xeon Phi • Comprehensive analysis of the literature since ~2007 • Organized as: Unai Lopez-Novoa et al. A Survey of Performance Modeling and Simulation Techniques for Accelerator-based Computing IEEE Transactions on Parallel and Distributed Computing, DOI: 10.1109/TPDS.2014.2308216 Execution time estimation Bottleneck highlighting Power cons. estimation Simulators 13
  • 14. Types of Join 100 103 104 100 102 Inner Left Outer Right Outer Full Outer 100 100 100 100 103 - 104 - 100 100 - 102 100 100 103 - 104 - - 102 Table A Table B 14
  • 15. Algorithm • Biggest debate: Sort or Hash? Hash-join Complexity: Limitation: Extensive use of atomics prevent efficient parallelization O(n + m) Procedure: 1. Hash smaller table 2. Scan larger table Sort-join Sorting increases complexity O(n·log(n)) 1. Sort keys 2. Scan interleaved 15
  • 16. Algorithm • Step 1: Sort keys in both tables • Radix sort: speed/scalability sweet spot 100 104 103 103 102 100 100 102 101 102 100 100 102 103 103 104 100 101 102 102 Sort 16
  • 17. Algorithm • Step 2: Merge • Add non matching keys for outer joins 100 100 102 103 103 104 100 101 102 102 100 100 100 100 102 102 102 102 Table A Table B Result – Inner Join 17
  • 18. Implementation • Steps: 1)Develop a naive OpenCL implementation 2)Optimize per device type 3)Add a cost model for load balancing and partitioning • Experimental setup: • M1: 4 (x2 SMT) Cores Xeon + Xeon Phi + 384 Cores GPU • M2: 12 (x2 SMT) Cores Xeon + Xeon Phi + 2496 Cores GPU • Baseline: ModernGPU (CUDA) 18
  • 20. Per-device tuning • Optimizations: • Thread scheduling • Memory management • Overheads: • Compilation • Memory allocation 20
  • 21. Optimizations • Per device thread scheduling OpenCL Kernel Threads: Groups: OpenCL Devices Four core CPU 0 1 2 3 61 core Xeon Phi 21 2 3 4 600 1
  • 22. • Per device memory management Optimizations Private Local Global OpenCL Device Memory Hierarchy Thread Thread-group Any thread 22 Scope: Registers On-chip RAM Registers RAM RAMRegisters RAM RAM
  • 23. Overheads • Compilation • Online compilation: X% of runtime (without I/O) • Memory allocation • Intel SDK: Y % of Merge Step in Xeon Phi OpenCL Program Host code Device code Compilation: Offline (gcc) Online (SDK) 23
  • 25. Future work 1) Finish tuning per device code 2) Test join in FPGA 3) Revisit partitioning strategy 4) Support multi-device execution • Develop a cost model that characterizes Join • Split the workload in runtime among existing devices 25
  • 26. Conclusions • Performance: device specific code • Performance portability: a) Platform specific code b) Parameterizable code • High OpenCL SDK dependence • Only portable debugging tool: printf • …but still the only portable framework • Future: OpenACC / OpenMP 4.0 ? 26
  • 27. Harnessing OpenCL in modern coprocessors Unai Lopez-Novoa unai.lopez@ehu.es 06 Aug 2014 Intelligent Systems Group University of the Basque Country UPV/EHU