SlideShare a Scribd company logo
Jeff Larkin, August 2017 DOE Performance Portability Workshop
Early Results of OpenMP 4.5 Portability
on NVIDIA GPUs
2
Background
Since the last performance portability workshop, several OpenMP
implementations for NVIDIA GPUs have emerged or matured
As of August 2017, can these implementations deliver on performance,
portability, and performance portability?
• Will OpenMP Target code be portable between compilers?
• Will OpenMP Target code be portable with the host?
I will compare results using 4 compilers: CLANG, Cray, GCC, and XL
8/23/2017
3
OpenMP In Clang
Multi-vendor effort to implement OpenMP in Clang (including offloading)
Runtime based on open-sourced runtime from Intel.
Current status: much improved since last year!
Version used: clang/20170629
Compiler Options:
-O2 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=$CUDA_HOME
8/23/2017
4
OpenMP In Cray
Due to its experience with OpenACC, Cray’s OpenMP 4.x compiler was the first to
market for NVIDIA GPUs.
Observation: Does not adhere to OpenMP as strictly as the others.
Version used: 8.5.5
Compiler Options: None Required
Note: Cray performance results were obtained on an X86 + P100 system, unlike the
other compilers. Only GPU performance is being compared.
8/23/2017
5
OpenMP In GCC
Open-source GCC compiler with support for OpenMP offloading to NVIDIA GPUs
Runtime also based on open-sourced runtime from Intel
Current status: Mature on CPU, Very immature on GPU
Version used: 7.1.1 20170718 (experimental)
Compiler Options:
-O3 -fopenmp -foffload="-lm"
8/23/2017
6
OpenMP In XL
IBM’s compiler suite, which now includes offloading to NVIDIA GPUs.
Same(ish) runtime as CLANG, but compilation by IBM’s compiler
Version used: xl/20170727-beta
Compiler Options:
-O3 -qsmp -qoffload
8/23/2017
7
Case Study: Jacobi Iteration
8
Example: Jacobi Iteration
Iteratively converges to correct value (e.g. Temperature), by computing new
values at each point from the average of neighboring points.
Common, useful algorithm
Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎
A(i,j)
A(i+1,j)A(i-1,j)
A(i,j-1)
A(i,j+1)
𝐴 𝑘+1 𝑖, 𝑗 =
𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1
4
9
Teams & Distribute
10
Teaming Up
#pragma omp target data map(to:Anew) map(A)
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma omp target teams distribute parallel for reduction(max:error) map(error)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute parallel for
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error);
iter++;
}
Explicitly maps arrays
for the entire while
loop.
• Spawns thread teams
• Distributes iterations
to those teams
• Workshares within
those teams.
13
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
7.786044 1.851838 42.543545 8.930509 17.8542 11.487634
0
5
10
15
20
25
30
35
40
45
CLANG Cray GCC GCC simd XL XL simd
Data Kernels Other
14
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
7.786044 1.851838 8.930509 17.8542 11.487634
0
2
4
6
8
10
12
14
16
18
20
CLANG Cray GCC simd XL XL simd
Data Kernels Other
15
Increasing Parallelism
16
Increasing Parallelism
Currently both our distributed and workshared parallelism comes from the same
loop.
• We could collapse them together
• We could move the PARALLEL to the inner loop
The COLLAPSE(N) clause
• Turns the next N loops into one, linearized loop.
• This will give us more parallelism to distribute, if we so choose.
8/23/2017
17
Collapse
#pragma omp target teams distribute parallel for reduction(max:error) map(error) 
collapse(2)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute parallel for collapse(2)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
Collapse the two loops
into one and then
parallelize this new
loop across both teams
and threads.
18
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
1.490654 1.820148 41.812337 3.706288
0
5
10
15
20
25
30
35
40
45
CLANG Cray GCC XL
Data Kernels Other
19
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
1.490654 1.820148 3.706288
0
0.5
1
1.5
2
2.5
3
3.5
4
CLANG Cray XL
Data Kernels Other
20
Splitting Teams & Parallel
#pragma omp target teams distribute map(error)
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for reduction(max:error)
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
Distribute the “j” loop
over teams.
Workshare the “i” loop
over threads.
21
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
2.30662 1.94593 49.474303 12.261814 14.559997
0
10
20
30
40
50
60
CLANG Cray GCC GCC simd XL
Data Kernels Other
22
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
2.30662 1.94593 12.261814 14.559997
0
2
4
6
8
10
12
14
16
CLANG Cray GCC simd XL
Data Kernels Other
23
Host Fallback
24
Fallback to the Host Processor
Most OpenMP users would like to write 1 set of directives for host and device,
but is this really possible?
Using the “if” clause, offloading can be enabled/disabled at runtime.
#pragma omp target teams distribute parallel for reduction(max:error) map(error) 
collapse(2) if(target:use_gpu)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
Compiler must build CPU & GPU
codes and select at runtime.
25
Host Fallback vs. Host Native OpenMP%ofReferenceCPUThreading
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
0%
20%
40%
60%
80%
100%
120%
Teams Collapse Split Teams Collapse Split Teams Collapse Split Teams Collapse Split
XLGCCCrayCLANG
26
Conclusions
27
Conclusions
OpenMP offloading compilers for NVIDIA GPUs have improved dramatically over the
past year and are ready for real use.
• Will OpenMP Target code be portable between compilers?
Maybe. Compilers are of various levels of maturity. SIMD support/requirement
inconsistent.
• Will OpenMP Target code be portable with the host?
Highly compiler-dependent. XL does this very well, CLANG somewhat well, and GCC
and Cray did poorly.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

More Related Content

PDF
Performance Portability Through Descriptive Parallelism
PDF
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
PDF
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
PDF
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
PPTX
What&rsquo;s new in Visual C++
PDF
Juan josefumeroarray14
PDF
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
PDF
Ehsan parallel accelerator-dec2015
Performance Portability Through Descriptive Parallelism
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
What&rsquo;s new in Visual C++
Juan josefumeroarray14
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Ehsan parallel accelerator-dec2015

What's hot (18)

PPTX
HPAT presentation at JuliaCon 2016
PPTX
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
PPTX
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
PDF
PDF
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
PDF
【論文紹介】Relay: A New IR for Machine Learning Frameworks
PPTX
PVS-Studio team experience: checking various open source projects, or mistake...
DOCX
Heap sort &amp; bubble sort
PDF
Post-processing SAR images on Xeon Phi - a porting exercise
PPTX
Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...
PDF
Yampa AFRP Introduction
PPTX
190111 tf2 preview_jwkang_pub
PDF
SPU Optimizations - Part 2
PPTX
FPGA Implementation of a GA
PDF
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)
PPSX
Dx11 performancereloaded
PPT
lecture 6
TXT
Firefox content
HPAT presentation at JuliaCon 2016
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
【論文紹介】Relay: A New IR for Machine Learning Frameworks
PVS-Studio team experience: checking various open source projects, or mistake...
Heap sort &amp; bubble sort
Post-processing SAR images on Xeon Phi - a porting exercise
Scott Anderson [InfluxData] | Map & Reduce – The Powerhouses of Custom Flux F...
Yampa AFRP Introduction
190111 tf2 preview_jwkang_pub
SPU Optimizations - Part 2
FPGA Implementation of a GA
function* - ES6, generators, and all that (JSRomandie meetup, February 2014)
Dx11 performancereloaded
lecture 6
Firefox content
Ad

Similar to Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs (20)

PPTX
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
PDF
Task based Programming with OmpSs and its Application
PDF
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
PDF
OpenMP tasking model: from the standard to the classroom
PPTX
Early Experiences with the OpenMP Accelerator Model
PPTX
LEGaTO Integration
PDF
OmpSs – improving the scalability of OpenMP
PDF
Options and trade offs for parallelism and concurrency in Modern C++
PPSX
Parallel Computing--Webminar.ppsx
PPSX
parallelcomputing-webminar.ppsx
PDF
parallelprocessing-openmp-181105062408.pdf
PPTX
Parallel processing -open mp
PDF
SC13: OpenMP and NVIDIA
PDF
Parallel computation
PDF
OpenPOWER Webinar from University of Delaware - Title :OpenMP (offloading) o...
PDF
parallel-computation.pdf
PPT
CS4961-L9.ppt
PDF
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
PPTX
A Source-To-Source Approach to HPC Challenges
PDF
NVIDIA HPC ソフトウエア斜め読み
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Task based Programming with OmpSs and its Application
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
OpenMP tasking model: from the standard to the classroom
Early Experiences with the OpenMP Accelerator Model
LEGaTO Integration
OmpSs – improving the scalability of OpenMP
Options and trade offs for parallelism and concurrency in Modern C++
Parallel Computing--Webminar.ppsx
parallelcomputing-webminar.ppsx
parallelprocessing-openmp-181105062408.pdf
Parallel processing -open mp
SC13: OpenMP and NVIDIA
Parallel computation
OpenPOWER Webinar from University of Delaware - Title :OpenMP (offloading) o...
parallel-computation.pdf
CS4961-L9.ppt
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
A Source-To-Source Approach to HPC Challenges
NVIDIA HPC ソフトウエア斜め読み
Ad

More from Jeff Larkin (12)

PDF
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
PDF
Refactoring Applications for the XK7 and Future Hybrid Architectures
PDF
Optimizing GPU to GPU Communication on Cray XK7
PPTX
Progress Toward Accelerating CAM-SE
PDF
HPCMPUG2011 cray tutorial
PDF
CUG2011 Introduction to GPU Computing
PDF
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
PDF
May2010 hex-core-opt
PDF
A Comparison of Accelerator Programming Models
PDF
Cray XT Porting, Scaling, and Optimization Best Practices
PDF
XT Best Practices
PDF
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
Refactoring Applications for the XK7 and Future Hybrid Architectures
Optimizing GPU to GPU Communication on Cray XK7
Progress Toward Accelerating CAM-SE
HPCMPUG2011 cray tutorial
CUG2011 Introduction to GPU Computing
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
May2010 hex-core-opt
A Comparison of Accelerator Programming Models
Cray XT Porting, Scaling, and Optimization Best Practices
XT Best Practices
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
1. Introduction to Computer Programming.pptx
A Presentation on Artificial Intelligence
Accuracy of neural networks in brain wave diagnosis of schizophrenia
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
A comparative analysis of optical character recognition models for extracting...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Advanced methodologies resolving dimensionality complications for autism neur...
Group 1 Presentation -Planning and Decision Making .pptx
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine Learning_overview_presentation.pptx
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation_ Review paper, used for researhc scholars
1. Introduction to Computer Programming.pptx

Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

  • 1. Jeff Larkin, August 2017 DOE Performance Portability Workshop Early Results of OpenMP 4.5 Portability on NVIDIA GPUs
  • 2. 2 Background Since the last performance portability workshop, several OpenMP implementations for NVIDIA GPUs have emerged or matured As of August 2017, can these implementations deliver on performance, portability, and performance portability? • Will OpenMP Target code be portable between compilers? • Will OpenMP Target code be portable with the host? I will compare results using 4 compilers: CLANG, Cray, GCC, and XL 8/23/2017
  • 3. 3 OpenMP In Clang Multi-vendor effort to implement OpenMP in Clang (including offloading) Runtime based on open-sourced runtime from Intel. Current status: much improved since last year! Version used: clang/20170629 Compiler Options: -O2 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=$CUDA_HOME 8/23/2017
  • 4. 4 OpenMP In Cray Due to its experience with OpenACC, Cray’s OpenMP 4.x compiler was the first to market for NVIDIA GPUs. Observation: Does not adhere to OpenMP as strictly as the others. Version used: 8.5.5 Compiler Options: None Required Note: Cray performance results were obtained on an X86 + P100 system, unlike the other compilers. Only GPU performance is being compared. 8/23/2017
  • 5. 5 OpenMP In GCC Open-source GCC compiler with support for OpenMP offloading to NVIDIA GPUs Runtime also based on open-sourced runtime from Intel Current status: Mature on CPU, Very immature on GPU Version used: 7.1.1 20170718 (experimental) Compiler Options: -O3 -fopenmp -foffload="-lm" 8/23/2017
  • 6. 6 OpenMP In XL IBM’s compiler suite, which now includes offloading to NVIDIA GPUs. Same(ish) runtime as CLANG, but compilation by IBM’s compiler Version used: xl/20170727-beta Compiler Options: -O3 -qsmp -qoffload 8/23/2017
  • 8. 8 Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎 A(i,j) A(i+1,j)A(i-1,j) A(i,j-1) A(i,j+1) 𝐴 𝑘+1 𝑖, 𝑗 = 𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1 4
  • 10. 10 Teaming Up #pragma omp target data map(to:Anew) map(A) while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp target teams distribute parallel for reduction(max:error) map(error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute parallel for for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error); iter++; } Explicitly maps arrays for the entire while loop. • Spawns thread teams • Distributes iterations to those teams • Workshares within those teams.
  • 11. 13 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 7.786044 1.851838 42.543545 8.930509 17.8542 11.487634 0 5 10 15 20 25 30 35 40 45 CLANG Cray GCC GCC simd XL XL simd Data Kernels Other
  • 12. 14 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 7.786044 1.851838 8.930509 17.8542 11.487634 0 2 4 6 8 10 12 14 16 18 20 CLANG Cray GCC simd XL XL simd Data Kernels Other
  • 14. 16 Increasing Parallelism Currently both our distributed and workshared parallelism comes from the same loop. • We could collapse them together • We could move the PARALLEL to the inner loop The COLLAPSE(N) clause • Turns the next N loops into one, linearized loop. • This will give us more parallelism to distribute, if we so choose. 8/23/2017
  • 15. 17 Collapse #pragma omp target teams distribute parallel for reduction(max:error) map(error) collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute parallel for collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Collapse the two loops into one and then parallelize this new loop across both teams and threads.
  • 16. 18 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 1.490654 1.820148 41.812337 3.706288 0 5 10 15 20 25 30 35 40 45 CLANG Cray GCC XL Data Kernels Other
  • 17. 19 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 1.490654 1.820148 3.706288 0 0.5 1 1.5 2 2.5 3 3.5 4 CLANG Cray XL Data Kernels Other
  • 18. 20 Splitting Teams & Parallel #pragma omp target teams distribute map(error) for( int j = 1; j < n-1; j++) { #pragma omp parallel for reduction(max:error) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute for( int j = 1; j < n-1; j++) { #pragma omp parallel for for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Distribute the “j” loop over teams. Workshare the “i” loop over threads.
  • 19. 21 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 2.30662 1.94593 49.474303 12.261814 14.559997 0 10 20 30 40 50 60 CLANG Cray GCC GCC simd XL Data Kernels Other
  • 20. 22 Execution Time (Smaller is Better)ExecutionTime(seconds) CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 2.30662 1.94593 12.261814 14.559997 0 2 4 6 8 10 12 14 16 CLANG Cray GCC simd XL Data Kernels Other
  • 22. 24 Fallback to the Host Processor Most OpenMP users would like to write 1 set of directives for host and device, but is this really possible? Using the “if” clause, offloading can be enabled/disabled at runtime. #pragma omp target teams distribute parallel for reduction(max:error) map(error) collapse(2) if(target:use_gpu) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } Compiler must build CPU & GPU codes and select at runtime.
  • 23. 25 Host Fallback vs. Host Native OpenMP%ofReferenceCPUThreading CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100 0% 20% 40% 60% 80% 100% 120% Teams Collapse Split Teams Collapse Split Teams Collapse Split Teams Collapse Split XLGCCCrayCLANG
  • 25. 27 Conclusions OpenMP offloading compilers for NVIDIA GPUs have improved dramatically over the past year and are ready for real use. • Will OpenMP Target code be portable between compilers? Maybe. Compilers are of various levels of maturity. SIMD support/requirement inconsistent. • Will OpenMP Target code be portable with the host? Highly compiler-dependent. XL does this very well, CLANG somewhat well, and GCC and Cray did poorly.