SlideShare a Scribd company logo
OpenACC on AMD GPUs and APUs
with the PGI Accelerator Compilers
Michael Wolfe

Michael.Wolfe@pgroup.com
http://www.pgroup.com

APU13
San Jose, November, 2013
 C, C++, Fortran compilers
 Optimizing
 Vectorizing
 Parallelizing

 Graphical parallel tools
 PGDBG debugger
 PGPROF profiler







AMD, Intel, NVIDIA processors
PGI Unified Binary™ technology
Linux, MacOS, Windows
Visual Studio & Eclipse integration
PGI Accelerator support
 OpenACC
 CUDA Fortran

www.pgroup.com
SMP Parallel Programming

for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
SMP Parallel Programming

#pragma omp parallel for private(i)
for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
% pgcc –mp x.c …
AMD Radeon Block Diagram*
 Multiple Compute Units
 Vector Unit
 Pipelining / Multithreading

 Device Memory
 Cache Hierarchy


SW-managed cache (LDS)

*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.
Heterogeneous Parallel
Programming

for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
Heterogeneous Parallel
Programming
#pragma acc parallel loop private(i) 
pcopyin(b[0:n], c[0:n]) 
pcopyout(a[0:n])
for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
% pgcc –acc –ta=radeon x.c
Overview
 Parallel programming
 GPU Architectural highlights
 OpenACC 5 minute summary
 PGI Implementation
 Performance
Abstract CPU+Accelerator Target
Accelerator Architecture Features
 Potentially separate memory (relatively small)
 High bandwidth memory interface
 Many degrees of parallelism
 MIMD parallelism across many cores
 SIMD parallelism within a core
 Multithreading for latency tolerance

 Asynchronous with host
 Performance from Parallelism
 slower clock, less ILP, simpler control unit, smaller caches
OpenACC
Open Programming Standard for Parallel Computing
“PGI OpenACC will enable programmers to easily develop portable applications that
maximize the performance and power efficiency benefits of the hybrid CPU/GPU
architecture of Titan.”
--Buddy Bland, Titan Project Director, Oak Ridge National Lab
“OpenACC is a technically impressive initiative brought together by members of the
OpenMP Working Group on Accelerators, as well as many others. We look forward to
releasing a version of this proposal in the next release of OpenMP.”

--Michael Wong, CEO OpenMP Directives Board
OpenACC Overview

 Directive-based
 Parallel Computation
 Data Management

#pragma acc data copyin( a[0:n] ) 
copy( b(0:n] ) create( tmp[0:n] )
{
for( int i = 0; i < iters; ++i ){
relax( a, b, tmp, n );
relax( b, a, tmp, n );
}
}
relax(float *x,float *y,float *t,int n){
#pragma acc data 
present( x[0:n], y[0:n], t[0:n] )
{
#pragma acc parallel loop
for( int j = 0; j < n; ++j )
t[j] = x[j];
#pragma acc parallel loop
for( int j = 1; j < n-1; ++j
x[j] = 0.25f*(t[j-1]+t[j+1] +
y[n-j+1] + y[n-j-1]);
}
}
OpenACC compared to OpenMP
 Data parallelism

 Thread parallelism

 Parallel per region

 Fixed number of threads

 Flexible || mapping

 Fixed || thread mapping

 Structured parallelism

 Tasks and loops

 Performance portability

 ?
PGI OpenACC Implementation
 C, C++, Fortran
 pgcc, pgc++, pgfortran

 Command line options





-acc
-ta=radeon
-ta=radeon,host
-ta=radeon,nvidia

 Planner
 maps program ||ism to
hardware ||ism

 Code Generator
 OpenCL API

 Runtime
 initialization
 data management
 kernel launches
Planner
 Maps parallel loops
 OpenACC abstractions
 gang, worker, vector

 OpenCL abstractions
 work group, work item

 Hardware abstractions
 wavefront

#pragma acc parallel loop gang
for( int j = 0; j < n; ++j )
t[j] = x[j];

#pragma acc parallel loop gang vector
for( int j = 0; j < n; ++j )
t[j] = x[j];
#pragma acc kernels loop independent
for( int j = 0; j < n; ++j )
t[j] = x[j];
Code Generator
 Low-level OpenCL
 “assembly code in C”

 SPIR interface to AMD
Radeon LLVM back-end

 Uses non-standard
features
 device addresses
Runtime
 Dynamically loads
OpenCL library

 Supports multiple devices
 Multiple command
queues
 Host as a device (*)

 Memory management
 device addresses
 bigbuffer(s) suballocation

 Profiling support
Performance
 AMD Piledriver 5800K
 4.0GHz
 2MB cache
 8 cores

 Single thread/core
 OpenMP parallel
 PGI 13.10 –fast –mp

 AMD Radeon 7970





Tahiti
925 MHz
3GB memory
32 compute units

 OpenACC parallel
 PGI 13.10 –fast –acc
–ta=radeon:tahiti
Cloverleaf Mantevo Miniapp
 Lagrangian-Eulerian hydrodynamics
 compressible Euler equation solver in 2D
 9500 lines of Fortran+C with OpenMP, OpenACC
 Accelerating Hydrocodes with OpenACC, OpenCL and CUDA,
Herdman et al, 2012 SC Companion
DOI: 10.1109/SC.Companion.2012.66
Performance Results
40
35
30
25

Serial

OpenMP

20

R7970
15

S10000

10
5
0

960^2x87

1920^2x87

3840^2x87

960^2x2955

1920^2x2955
OpenACC on AMD GPUs and APUs
 OpenACC is designed for performance portability
 PGI Accelerator compilers provide evidence
 Target-specific tuning still underway
 Open Beta compilers available now
 Product version in January 2014
Copyright Notice

© Contents copyright 2013, NVIDIA Corp. This material may not be
reproduced in any manner without the expressed written
permission of NVIDIA Corp.

More Related Content

PDF
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PDF
WT-4066, The Making of Turbulenz’ Polycraft WebGL Benchmark, by Ian Ballantyne
PDF
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PDF
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PDF
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PPTX
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
PDF
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
PDF
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
WT-4066, The Making of Turbulenz’ Polycraft WebGL Benchmark, by Ian Ballantyne
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans

What's hot (20)

PPSX
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
PDF
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
PDF
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
PDF
Final lisa opening_keynote_draft_-_v12.1tb
PDF
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PDF
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PDF
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
PDF
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PDF
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
PDF
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
PDF
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
PDF
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
PPTX
Leverage the Speed of OpenCL™ with AMD Math Libraries
PDF
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
PDF
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
PDF
HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng
PDF
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
PDF
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PDF
LCU13: GPGPU on ARM Experience Report
PDF
HSA-4123, HSA Memory Model, by Ben Gaster
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
Final lisa opening_keynote_draft_-_v12.1tb
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
Leverage the Speed of OpenCL™ with AMD Math Libraries
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
HC-4022, Towards an Ecosystem for Heterogeneous Parallel Computing, by Wu Feng
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
LCU13: GPGPU on ARM Experience Report
HSA-4123, HSA Memory Model, by Ben Gaster
Ad

Similar to PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe (20)

PDF
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
PPTX
PGI Compilers & Tools Update- March 2018
PPTX
OpenACC Highlights - February
PDF
Application Optimisation using OpenPOWER and Power 9 systems
PPTX
OpenACC Monthly Highlights - September
PPTX
OpenACC Monthly Highlights Summer 2019
PPTX
OpenACC Monthly Highlights: May 2020
PDF
OpenACC and Hackathons Monthly Highlights: April 2023
PPTX
OpenACC Monthly Highlights April 2018
PPTX
OpenACC Monthly Highlights: November 2020
PPTX
OpenACC Monthly Highlights- December
PDF
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
PDF
accULL (HAC Leganés)
PPTX
OpenACC Monthly Highlights September 2019
PDF
Directive-based approach to Heterogeneous Computing
PDF
Speeding up Programs with OpenACC in GCC
PPTX
OpenACC Monthly Highlights: August 2020
PPTX
OpenACC Monthly Highlights: June 2020
PPTX
OpenACC Monthly Highlights: January 2021
PDF
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
PGI Compilers & Tools Update- March 2018
OpenACC Highlights - February
Application Optimisation using OpenPOWER and Power 9 systems
OpenACC Monthly Highlights - September
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights: May 2020
OpenACC and Hackathons Monthly Highlights: April 2023
OpenACC Monthly Highlights April 2018
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights- December
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
accULL (HAC Leganés)
OpenACC Monthly Highlights September 2019
Directive-based approach to Heterogeneous Computing
Speeding up Programs with OpenACC in GCC
OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: January 2021
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
Ad

More from AMD Developer Central (20)

PDF
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
PPTX
Introduction to Node.js
PPTX
Media SDK Webinar 2014
PDF
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
PDF
DirectGMA on AMD’S FirePro™ GPUS
PPT
Webinar: Whats New in Java 8 with Develop Intelligence
PPSX
Inside XBox- One, by Martin Fuller
PPSX
TressFX The Fast and The Furry by Nicolas Thibieroz
PPSX
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
PPTX
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
PPSX
Gcn performance ftw by stephan hodes
PPSX
Inside XBOX ONE by Martin Fuller
PPSX
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
PPSX
Introduction to Direct 3D 12 by Ivan Nevraev
PPSX
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
PDF
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
PDF
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
PPSX
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
PDF
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
PPSX
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
Introduction to Node.js
Media SDK Webinar 2014
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
DirectGMA on AMD’S FirePro™ GPUS
Webinar: Whats New in Java 8 with Develop Intelligence
Inside XBox- One, by Martin Fuller
TressFX The Fast and The Furry by Nicolas Thibieroz
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Gcn performance ftw by stephan hodes
Inside XBOX ONE by Martin Fuller
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Introduction to Direct 3D 12 by Ivan Nevraev
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Modernizing your data center with Dell and AMD
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Weekly Chronicles - August'25 Week I
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

  • 1. OpenACC on AMD GPUs and APUs with the PGI Accelerator Compilers Michael Wolfe Michael.Wolfe@pgroup.com http://www.pgroup.com APU13 San Jose, November, 2013
  • 2.  C, C++, Fortran compilers  Optimizing  Vectorizing  Parallelizing  Graphical parallel tools  PGDBG debugger  PGPROF profiler      AMD, Intel, NVIDIA processors PGI Unified Binary™ technology Linux, MacOS, Windows Visual Studio & Eclipse integration PGI Accelerator support  OpenACC  CUDA Fortran www.pgroup.com
  • 3. SMP Parallel Programming for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]);
  • 4. SMP Parallel Programming #pragma omp parallel for private(i) for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]); % pgcc –mp x.c …
  • 5. AMD Radeon Block Diagram*  Multiple Compute Units  Vector Unit  Pipelining / Multithreading  Device Memory  Cache Hierarchy  SW-managed cache (LDS) *From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.
  • 6. Heterogeneous Parallel Programming for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]);
  • 7. Heterogeneous Parallel Programming #pragma acc parallel loop private(i) pcopyin(b[0:n], c[0:n]) pcopyout(a[0:n]) for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]); % pgcc –acc –ta=radeon x.c
  • 8. Overview  Parallel programming  GPU Architectural highlights  OpenACC 5 minute summary  PGI Implementation  Performance
  • 10. Accelerator Architecture Features  Potentially separate memory (relatively small)  High bandwidth memory interface  Many degrees of parallelism  MIMD parallelism across many cores  SIMD parallelism within a core  Multithreading for latency tolerance  Asynchronous with host  Performance from Parallelism  slower clock, less ILP, simpler control unit, smaller caches
  • 11. OpenACC Open Programming Standard for Parallel Computing “PGI OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.” --Buddy Bland, Titan Project Director, Oak Ridge National Lab “OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.” --Michael Wong, CEO OpenMP Directives Board
  • 12. OpenACC Overview  Directive-based  Parallel Computation  Data Management #pragma acc data copyin( a[0:n] ) copy( b(0:n] ) create( tmp[0:n] ) { for( int i = 0; i < iters; ++i ){ relax( a, b, tmp, n ); relax( b, a, tmp, n ); } } relax(float *x,float *y,float *t,int n){ #pragma acc data present( x[0:n], y[0:n], t[0:n] ) { #pragma acc parallel loop for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc parallel loop for( int j = 1; j < n-1; ++j x[j] = 0.25f*(t[j-1]+t[j+1] + y[n-j+1] + y[n-j-1]); } }
  • 13. OpenACC compared to OpenMP  Data parallelism  Thread parallelism  Parallel per region  Fixed number of threads  Flexible || mapping  Fixed || thread mapping  Structured parallelism  Tasks and loops  Performance portability  ?
  • 14. PGI OpenACC Implementation  C, C++, Fortran  pgcc, pgc++, pgfortran  Command line options     -acc -ta=radeon -ta=radeon,host -ta=radeon,nvidia  Planner  maps program ||ism to hardware ||ism  Code Generator  OpenCL API  Runtime  initialization  data management  kernel launches
  • 15. Planner  Maps parallel loops  OpenACC abstractions  gang, worker, vector  OpenCL abstractions  work group, work item  Hardware abstractions  wavefront #pragma acc parallel loop gang for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc parallel loop gang vector for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc kernels loop independent for( int j = 0; j < n; ++j ) t[j] = x[j];
  • 16. Code Generator  Low-level OpenCL  “assembly code in C”  SPIR interface to AMD Radeon LLVM back-end  Uses non-standard features  device addresses
  • 17. Runtime  Dynamically loads OpenCL library  Supports multiple devices  Multiple command queues  Host as a device (*)  Memory management  device addresses  bigbuffer(s) suballocation  Profiling support
  • 18. Performance  AMD Piledriver 5800K  4.0GHz  2MB cache  8 cores  Single thread/core  OpenMP parallel  PGI 13.10 –fast –mp  AMD Radeon 7970     Tahiti 925 MHz 3GB memory 32 compute units  OpenACC parallel  PGI 13.10 –fast –acc –ta=radeon:tahiti
  • 19. Cloverleaf Mantevo Miniapp  Lagrangian-Eulerian hydrodynamics  compressible Euler equation solver in 2D  9500 lines of Fortran+C with OpenMP, OpenACC  Accelerating Hydrocodes with OpenACC, OpenCL and CUDA, Herdman et al, 2012 SC Companion DOI: 10.1109/SC.Companion.2012.66
  • 21. OpenACC on AMD GPUs and APUs  OpenACC is designed for performance portability  PGI Accelerator compilers provide evidence  Target-specific tuning still underway  Open Beta compilers available now  Product version in January 2014
  • 22. Copyright Notice © Contents copyright 2013, NVIDIA Corp. This material may not be reproduced in any manner without the expressed written permission of NVIDIA Corp.