Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

Jeff Larkin, August 2017 DOE Performance Portability Workshop
Early Results of OpenMP 4.5 Portability
on NVIDIA GPUs

2
Background
Since the last performance portability workshop, several OpenMP
implementations for NVIDIA GPUs have emerged or matured
As of August 2017, can these implementations deliver on performance,
portability, and performance portability?
• Will OpenMP Target code be portable between compilers?
• Will OpenMP Target code be portable with the host?
I will compare results using 4 compilers: CLANG, Cray, GCC, and XL
8/23/2017

3
OpenMP In Clang
Multi-vendor effort to implement OpenMP in Clang (including offloading)
Runtime based on open-sourced runtime from Intel.
Current status: much improved since last year!
Version used: clang/20170629
Compiler Options:
-O2 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda --cuda-path=$CUDA_HOME
8/23/2017

4
OpenMP In Cray
Due to its experience with OpenACC, Cray’s OpenMP 4.x compiler was the first to
market for NVIDIA GPUs.
Observation: Does not adhere to OpenMP as strictly as the others.
Version used: 8.5.5
Compiler Options: None Required
Note: Cray performance results were obtained on an X86 + P100 system, unlike the
other compilers. Only GPU performance is being compared.
8/23/2017

5
OpenMP In GCC
Open-source GCC compiler with support for OpenMP offloading to NVIDIA GPUs
Runtime also based on open-sourced runtime from Intel
Current status: Mature on CPU, Very immature on GPU
Version used: 7.1.1 20170718 (experimental)
Compiler Options:
-O3 -fopenmp -foffload="-lm"
8/23/2017

6
OpenMP In XL
IBM’s compiler suite, which now includes offloading to NVIDIA GPUs.
Same(ish) runtime as CLANG, but compilation by IBM’s compiler
Version used: xl/20170727-beta
Compiler Options:
-O3 -qsmp -qoffload
8/23/2017

7
Case Study: Jacobi Iteration

8
Example: Jacobi Iteration
Iteratively converges to correct value (e.g. Temperature), by computing new
values at each point from the average of neighboring points.
Common, useful algorithm
Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎
A(i,j)
A(i+1,j)A(i-1,j)
A(i,j-1)
A(i,j+1)
𝐴 𝑘+1 𝑖, 𝑗 =
𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1
4

10
Teaming Up
#pragma omp target data map(to:Anew) map(A)
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma omp target teams distribute parallel for reduction(max:error) map(error)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute parallel for
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error);
iter++;
}
Explicitly maps arrays
for the entire while
loop.
• Spawns thread teams
• Distributes iterations
to those teams
• Workshares within
those teams.

13
Execution Time (Smaller is Better)ExecutionTime(seconds)
CLANG, GCC, XL: IBM “Minsky”, NVIDIA Tesla P100, Cray: CrayXC-40, NVIDIA TeslaP100
7.786044 1.851838 42.543545 8.930509 17.8542 11.487634
0
5
10
15
20
25
30
35
40
45
CLANG Cray GCC GCC simd XL XL simd
Data Kernels Other

14
7.786044 1.851838 8.930509 17.8542 11.487634
0
2
4
6
8
10
12
14
16
18
20
CLANG Cray GCC simd XL XL simd
Data Kernels Other

16
Increasing Parallelism
Currently both our distributed and workshared parallelism comes from the same
loop.
• We could collapse them together
• We could move the PARALLEL to the inner loop
The COLLAPSE(N) clause
• Turns the next N loops into one, linearized loop.
• This will give us more parallelism to distribute, if we so choose.
8/23/2017

17
Collapse
collapse(2)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
}
}
#pragma omp target teams distribute parallel for collapse(2)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
}
}
Collapse the two loops
into one and then
parallelize this new
loop across both teams
and threads.

18
1.490654 1.820148 41.812337 3.706288
0
5
10
15
20
25
30
35
40
45
CLANG Cray GCC XL
Data Kernels Other

19
1.490654 1.820148 3.706288
0
0.5
1
1.5
2
2.5
3
3.5
4
CLANG Cray XL
Data Kernels Other

20
Splitting Teams & Parallel
#pragma omp target teams distribute map(error)
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for reduction(max:error)
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
}
}
#pragma omp target teams distribute
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for
for( int i = 1; i < m-1; i++ )
{
}
}
Distribute the “j” loop
over teams.
Workshare the “i” loop
over threads.

21
2.30662 1.94593 49.474303 12.261814 14.559997
0
10
20
30
40
50
60
CLANG Cray GCC GCC simd XL
Data Kernels Other

22
2.30662 1.94593 12.261814 14.559997
0
2
4
6
8
10
12
14
16
CLANG Cray GCC simd XL
Data Kernels Other

24
Fallback to the Host Processor
Most OpenMP users would like to write 1 set of directives for host and device,
but is this really possible?
Using the “if” clause, offloading can be enabled/disabled at runtime.
collapse(2) if(target:use_gpu)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
}
}
Compiler must build CPU & GPU
codes and select at runtime.

25
Host Fallback vs. Host Native OpenMP%ofReferenceCPUThreading
0%
20%
40%
60%
80%
100%
120%
Teams Collapse Split Teams Collapse Split Teams Collapse Split Teams Collapse Split
XLGCCCrayCLANG

27
Conclusions
OpenMP offloading compilers for NVIDIA GPUs have improved dramatically over the
past year and are ready for real use.
• Will OpenMP Target code be portable between compilers?
Maybe. Compilers are of various levels of maturity. SIMD support/requirement
inconsistent.
• Will OpenMP Target code be portable with the host?
Highly compiler-dependent. XL does this very well, CLANG somewhat well, and GCC
and Cray did poorly.

Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

More Related Content

What's hot (18)

Similar to Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs (20)

More from Jeff Larkin (12)

Recently uploaded (20)

Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs