PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

OpenACC on AMD GPUs and APUs
with the PGI Accelerator Compilers
Michael Wolfe

Michael.Wolfe@pgroup.com
http://www.pgroup.com

APU13
San Jose, November, 2013

 C, C++, Fortran compilers
 Optimizing
 Vectorizing
 Parallelizing

 Graphical parallel tools
 PGDBG debugger
 PGPROF profiler







AMD, Intel, NVIDIA processors
PGI Unified Binary™ technology
Linux, MacOS, Windows
Visual Studio & Eclipse integration
PGI Accelerator support
 OpenACC
 CUDA Fortran

www.pgroup.com

SMP Parallel Programming

for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);

SMP Parallel Programming

#pragma omp parallel for private(i)
for( i = 0; i < n; ++i )
% pgcc –mp x.c …

AMD Radeon Block Diagram*
 Multiple Compute Units
 Vector Unit
 Pipelining / Multithreading

 Device Memory
 Cache Hierarchy


SW-managed cache (LDS)

*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.

Heterogeneous Parallel
Programming

for( i = 0; i < n; ++i )

Heterogeneous Parallel
Programming
#pragma acc parallel loop private(i)
pcopyin(b[0:n], c[0:n])
pcopyout(a[0:n])
for( i = 0; i < n; ++i )
% pgcc –acc –ta=radeon x.c

Overview
 Parallel programming
 GPU Architectural highlights
 OpenACC 5 minute summary
 PGI Implementation
 Performance

Abstract CPU+Accelerator Target

Accelerator Architecture Features
 Potentially separate memory (relatively small)
 High bandwidth memory interface
 Many degrees of parallelism
 MIMD parallelism across many cores
 SIMD parallelism within a core
 Multithreading for latency tolerance

 Asynchronous with host
 Performance from Parallelism
 slower clock, less ILP, simpler control unit, smaller caches

OpenACC
Open Programming Standard for Parallel Computing
“PGI OpenACC will enable programmers to easily develop portable applications that
maximize the performance and power efficiency benefits of the hybrid CPU/GPU
architecture of Titan.”
--Buddy Bland, Titan Project Director, Oak Ridge National Lab
“OpenACC is a technically impressive initiative brought together by members of the
OpenMP Working Group on Accelerators, as well as many others. We look forward to
releasing a version of this proposal in the next release of OpenMP.”

--Michael Wong, CEO OpenMP Directives Board

OpenACC Overview

 Directive-based
 Parallel Computation
 Data Management

#pragma acc data copyin( a[0:n] )
copy( b(0:n] ) create( tmp[0:n] )
{
for( int i = 0; i < iters; ++i ){
relax( a, b, tmp, n );
relax( b, a, tmp, n );
}
}
relax(float *x,float *y,float *t,int n){
#pragma acc data
present( x[0:n], y[0:n], t[0:n] )
{
#pragma acc parallel loop
for( int j = 0; j < n; ++j )
t[j] = x[j];
#pragma acc parallel loop
for( int j = 1; j < n-1; ++j
x[j] = 0.25f*(t[j-1]+t[j+1] +
y[n-j+1] + y[n-j-1]);
}
}

OpenACC compared to OpenMP
 Data parallelism

 Thread parallelism

 Parallel per region

 Fixed number of threads

 Flexible || mapping

 Fixed || thread mapping

 Structured parallelism

 Tasks and loops

 Performance portability

 ?

PGI OpenACC Implementation
 C, C++, Fortran
 pgcc, pgc++, pgfortran

 Command line options





-acc
-ta=radeon
-ta=radeon,host
-ta=radeon,nvidia

 Planner
 maps program ||ism to
hardware ||ism

 Code Generator
 OpenCL API

 Runtime
 initialization
 data management
 kernel launches

Planner
 Maps parallel loops
 OpenACC abstractions
 gang, worker, vector

 OpenCL abstractions
 work group, work item

 Hardware abstractions
 wavefront

#pragma acc parallel loop gang
for( int j = 0; j < n; ++j )
t[j] = x[j];

#pragma acc parallel loop gang vector
for( int j = 0; j < n; ++j )
t[j] = x[j];
#pragma acc kernels loop independent
for( int j = 0; j < n; ++j )
t[j] = x[j];

Code Generator
 Low-level OpenCL
 “assembly code in C”

 SPIR interface to AMD
Radeon LLVM back-end

 Uses non-standard
features
 device addresses

Runtime
 Dynamically loads
OpenCL library

 Supports multiple devices
 Multiple command
queues
 Host as a device (*)

 Memory management
 device addresses
 bigbuffer(s) suballocation

 Profiling support

Performance
 AMD Piledriver 5800K
 4.0GHz
 2MB cache
 8 cores

 Single thread/core
 OpenMP parallel
 PGI 13.10 –fast –mp

 AMD Radeon 7970





Tahiti
925 MHz
3GB memory
32 compute units

 OpenACC parallel
 PGI 13.10 –fast –acc
–ta=radeon:tahiti

Cloverleaf Mantevo Miniapp
 Lagrangian-Eulerian hydrodynamics
 compressible Euler equation solver in 2D
 9500 lines of Fortran+C with OpenMP, OpenACC
 Accelerating Hydrocodes with OpenACC, OpenCL and CUDA,
Herdman et al, 2012 SC Companion
DOI: 10.1109/SC.Companion.2012.66

Performance Results
40
35
30
25

Serial

OpenMP

20

R7970
15

S10000

10
5
0

960^2x87

1920^2x87

3840^2x87

960^2x2955

1920^2x2955

OpenACC on AMD GPUs and APUs
 OpenACC is designed for performance portability
 PGI Accelerator compilers provide evidence
 Target-specific tuning still underway
 Open Beta compilers available now
 Product version in January 2014

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

More Related Content

What's hot (20)

Similar to PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe (20)

More from AMD Developer Central (20)

Recently uploaded (20)

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe