ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

HETEROGENEOUS SYSTEM
ARCHITECTURE (HSA): ARCHITECTURE
AND ALGORITHMS
ISCA TUTORIAL - JUNE 15, 2014

TOPICS
 Introduction
 HSAIL Virtual Parallel ISA
 HSA Runtime
 HSA Memory Model
 HSA Queuing Model
 HSA Applications
 HSA Compilation
© Copyright 2014 HSA Foundation. All Rights Reserved
The HSA Specifications are not at 1.0 final so all content is subject to change

SCHEDULE
Time Topic Speaker
8:45am Introduction to HSA Phil Rogers, AMD
9:30am HSAIL Virtual Parallel ISA Ben Sander, AMD
10:30am Break
10:50am HSA Runtime Yeh-Ching Chung, National Tsing Hua University
12 noon Lunch
1pm HSA Memory Model Benedict Gaster, Qualcomm
2pm HSA Queuing Model Hakan Persson, ARM
3pm Break
3:15pm HSA Compilation Technology Wen Mei Hwu, University of Illinois
4pm HSA Application Programming Wen Mei Hwu, University of Illinois
4:45pm Questions All presenters

INTRODUCTION
PHIL ROGERS, AMD CORPORATE FELLOW &
PRESIDENT OF HSA FOUNDATION

HSA FOUNDATION
 Founded in June 2012
 Developing a new platform for heterogeneous
systems
 www.hsafoundation.com
 Specifications under development in working
groups to define the platform
 Membership consists of 43 companies and 16
universities
 Adding 1-2 new members each month

DIVERSE PARTNERS DRIVING FUTURE OF
HETEROGENEOUS COMPUTING
Founders
Promoters
Supporters
Contributors
Academic
Needs Updating – Add Toshiba
Logo

MEMBERSHIP TABLE
Membership Level Number List
Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc.,
Qualcomm Inc., Samsung Electronics Co Ltd
Promoter 1 LG Electronics
Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical
Limited, CEVA Inc., Digital Media Professionals,
Electronics and Telecommunications Research,
Institute (ETRI), General Processor, Huawei, Industrial
Technology Res. Institute, Marvell International Ltd.,
Mobica, Oracle, Sonics, Inc, Sony Mobile,
Communications, Swarm 64 GmbH, Synopsys,
Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA
Technologies, Vivante Corporation
Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software,
Fabric Engine, Kishonti, Lawrence Livermore National
Laboratory, Linaro, MultiCoreWare, Oak Ridge
National Laboratory, Sandia Corporation,
StreamComputing, SUSE LLC, UChicago Argonne LLC,
Operator of Argonne National Laboratory
Academic 17 Institute for Computing Systems Architecture,
Missouri University of Science & Technology, National
Tsing Hua University, NMAM Institute of Technology,
Northeastern University, Rice University, Seoul
National University, System Software Lab National,
Tsing Hua University, Tampere University of
Technology, TEI of Crete, The University of Mississippi,
University of North Texas, University of Bologna,
University of Bristol Microelectronic Research Group,
University of Edinburgh, University of Illinois at
Urbana-Champaign Department of Computer Science

HETEROGENEOUS PROCESSORS HAVE
PROLIFERATED — MAKE THEM BETTER
 Heterogeneous SOCs have arrived and are a
tremendous advance over previous platforms
 SOCs combine CPU cores, GPU cores and
other accelerators, with high bandwidth access
to memory
 How do we make them even better?
 Easier to program
 Easier to optimize
 Higher performance
 Lower power
 HSA unites accelerators architecturally
 Early focus on the GPU compute accelerator,
but HSA will go well beyond the GPU

INFLECTIONS IN PROCESSOR DESIGN
?
Single-thread
Performance
Time
we are
here
Enabled by:
 Moore’s
Law
 Voltage
Scaling
Constrained by:
Power
Complexity
Single-Core Era
ModernApplication
Performance
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by:
 Abundant data
parallelism
 Power efficient
GPUs
Temporarily
Constrained by:
Programming
models
Comm.overhead
Throughput
Performance
Time (# of processors)
we are
here
Enabled by:
 Moore’s Law
 SMP
architecture
Constrained
by:
Power
Parallel SW
Scalability
Multi-Core Era
Assembly  C/C++  Java … pthreads  OpenMP / TBB …
Shader  CUDA OpenCL
 C++ and Java

LEGACY GPU COMPUTE
PCIe
™
System Memory
(Coherent)
CPU CPU CPU
. .
.
CU CU CU CU
CU CU CU CU
GPU Memory
(Non-Coherent)
GPU
 Multiple memory pools
 Multiple address spaces
 High overhead dispatch
 Data copies across PCIe
 New languages for
programming
 Dual source development
 Proprietary environments
 Expert programmers only
 Need to fix all of this to
unleash our programmers
The limiters

EXISTING APUS AND SOCS
CPU
1
CPU
N…
CPU
2
Physical Integration
CU
1 …
CU
2
CU
3
CU
M-2
CU
M-1
CU
M
System Memory
(Coherent)
GPU Memory
(Non-Coherent)
GPU
 Physical Integration
 Good first step
 Some copies gone
 Two memory pools remain
 Still queue through the OS
 Still requires expert
programmers
 Need to finish the job

AN HSA ENABLED SOC
 Unified Coherent
Memory enables
data sharing across
all processors
 Processors
architected to
operate
cooperatively
 Designed to enable
the application to
run on different
processors at
different times
Unified Coherent Memory
CPU
1
CPU
N…
CPU
2
CU
1
CU
2
CU
3
CU
M-2
CU
M-1
CU
M…

PILLARS OF HSA*
 Unified addressing across all processors
 Operation into pageable system memory
 Full memory coherency
 User mode dispatch
 Architected queuing language
 Scheduling and context switching
 HSA Intermediate Language (HSAIL)
 High level language support for GPU compute processors
* All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors

HSA SPECIFICATIONS
 HSA System Architecture Specification
 Version 1.0 Provisional, Released April 2014
 Defines discovery, memory model, queue management, atomics, etc
 HSA Programmers Reference Specification
 Version 1.0 Provisional, Released June 2014
 Defines the HSAIL language and object format
 HSA Runtime Software Specification
 Version 1.0 Provisional, expected to be released in July 2014
 Defines the APIs through which an HSA application uses the platform
 All released specifications can be found at the HSA Foundation web site:
 www.hsafoundation.com/standards

HSA - AN OPEN PLATFORM
 Open Architecture, membership open to all
 HSA Programmers Reference Manual
 HSA System Architecture
 HSA Runtime
 Delivered via royalty free standards
 Royalty Free IP, Specifications and APIs
 ISA agnostic for both CPU and GPU
 Membership from all areas of computing
 Hardware companies
 Operating Systems
 Tools and Middleware
 Applications
 Universities

HSA INTERMEDIATE LAYER — HSAIL
 HSAIL is a virtual ISA for parallel programs
 Finalized to ISA by a JIT compiler or “Finalizer”
 ISA independent by design for CPU & GPU
 Explicitly parallel
 Designed for data parallel programming
 Support for exceptions, virtual functions,
and other high level language features
 Lower level than OpenCL SPIR
 Fits naturally in the OpenCL compilation stack
 Suitable to support additional high level languages and programming models:
 Java, C++, OpenMP, C++, Python, etc

HSA MEMORY MODEL
 Defines visibility ordering between all
threads in the HSA System
 Designed to be compatible with
C++11, Java, OpenCL and .NET
Memory Models
 Relaxed consistency memory model
for parallel compute performance
 Visibility controlled by:
 Load.Acquire
 Store.Release
 Fences

HSA QUEUING MODEL
 User mode queuing for low latency dispatch
 Application dispatches directly
 No OS or driver required in the dispatch path
 Architected Queuing Layer
 Single compute dispatch path for all hardware
 No driver translation, direct to hardware
 Allows for dispatch to queue from any agent
 CPU or GPU
 GPU self enqueue enables lots of solutions
 Recursion
 Tree traversal
 Wavefront reforming

Hardware - APUs, CPUs, GPUs
Driver Stack
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
Graphics Kernel Mode Driver
Apps
Apps
Apps
Apps
Apps
Apps
HSA Software Stack
Task Queuing
Libraries
HSA Domain Libraries,
OpenCL ™ 2.x Runtime
HSA Kernel
Mode Driver
HSA Runtime
HSA JIT
Apps
Apps
Apps
Apps
Apps
Apps
User mode component Kernel mode component Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK

OPENCL™ AND HSA
 HSA is an optimized platform architecture
for OpenCL
 Not an alternative to OpenCL
 OpenCL on HSA will benefit from
 Avoidance of wasteful copies
 Low latency dispatch
 Improved memory model
 Pointers shared between CPU and GPU
 OpenCL 2.0 leverages HSA Features
 Shared Virtual Memory
 Platform Atomics

ADDITIONAL LANGUAGES ON HSA
 In development
Language Body More Information
Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/
LLVM LLVM Code
generator for HSAIL
C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa
mp-driver-ng/wiki/Home
OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches
/hsa/gcc/README.hsa?view=markup&p
athrev=207425

SUMATRA PROJECT OVERVIEW
 AMD/Oracle sponsored Open Source (OpenJDK) project
 Targeted at Java 9 (2015 release)
 Allows developers to efficiently represent data parallel algorithms in
Java
 Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to
enable both CPU or GPU computing
 At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch
‘selected’ constructs to available HSA enabled devices
 Developers of Java libraries are already refactoring their library code to
use these same constructs
 So developers using existing libraries should see GPU acceleration
without any code changes
 http://openjdk.java.net/projects/sumatra/
 https://wikis.oracle.com/display/HotSpotInternals/Sumatra
 http://mail.openjdk.java.net/pipermail/sumatra-dev/
Application.java
Java Compiler
GPUCPU
Sumatra Enabled JVM
Application
GPU ISA
Lambda/Stream API
CPU ISA
Application.clas
s
Development
Runtime
HSA Finalizer

HSA OPEN SOURCE SOFTWARE
 HSA will feature an open source linux execution and compilation stack
 Allows a single shared implementation for many components
 Enables university research and collaboration in all areas
 Because it’s the right thing to do
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros

WORKLOAD EXAMPLE
SUFFIX ARRAY CONSTRUCTION
CLOUD SERVER WORKLOAD

SUFFIX ARRAYS
 Suffix Arrays are a fundamental data structure
 Designed for efficient searching of a large text
 Quickly locate every occurrence of a substring S in a text T
 Suffix Arrays are used to accelerate in-memory cloud workloads
 Full text index search
 Lossless data compression
 Bio-informatics

ACCELERATED SUFFIX ARRAY
CONSTRUCTION ON HSA
M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM
By offloading data parallel computations to
GPU, HSA increases performance and
reduces energy for Suffix Array
Construction.
By efficiently sharing data between CPU and
GPU, HSA lets us move compute to data
without penalty of intermediate copies.
+5.8x
-5x
INCREASED
PERFORMANCE
DECREASED
ENERGYMerge Sort::GPU
Radix Sort::GPU
Compute SA::CPU
Lexical Rank::CPU
Radix Sort::GPU
Skew Algorithm for Compute SA

EASE OF PROGRAMMING
CODE COMPLEXITY VS. PERFORMANCE

LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT
PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350
LOC
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Performance
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0Copy-
back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)

THE HSA FUTURE
 Architected heterogeneous processing on the SOC
 Programming of accelerators becomes much easier
 Accelerated software that runs across multiple hardware vendors
 Scalability from smart phones to super computers on a common architecture
 GPU acceleration of parallel processing is the initial target, with DSPs
and other accelerators coming to the HSA system architecture model
 Heterogeneous software ecosystem evolves at a much faster pace
 Lower power, more capable devices in your hand, on the wall, in the cloud

JOIN US!
WWW.HSAFOUNDATION.COM

HETEROGENEOUS SYSTEM
ARCHITECTURE (HSA): HSAIL VIRTUAL
PARALLEL ISA
BEN SANDER, AMD

TOPICS
 Introduction and Motivation
 HSAIL – what makes it special?
 HSAIL Execution Model
 How to program in HSAIL?
 Conclusion

STATE OF GPU COMPUTING
Today’s Challenges
 Separate address spaces
 Copies
 Can’t share pointers
 New language required for compute kernel
 EX: OpenCL™ runtime API
 Compute kernel compiled separately than host
code
Emerging Solution
 HSA Hardware
 Single address space
 Coherent
 Virtual
 Fast access from all components
 Can share pointers
 Bring GPU computing to existing, popular,
programming models
 Single-source, fully supported by compiler
 HSAIL compiler IR (Cross-platform!)
• GPUs are fast and power efficient : high compute density per-mm and per-watt
• But: Can be hard to program
PCIe

THE PORTABILITY CHALLENGE
 CPU ISAs
 ISA innovations added incrementally (ie NEON, AVX, etc)
 ISA retains backwards-compatibility with previous generation
 Two dominant instruction-set architectures: ARM and x86
 GPU ISAs
 Massive diversity of architectures in the market
 Each vendor has own ISA - and often several in market at same time
 No commitment (or attempt!) to provide any backwards compatibility
 Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction

HSAIL :
WHAT MAKES IT SPECIAL?

WHAT IS HSAIL?
 Intermediate language for parallel compute in HSA
 Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc)
 Expresses parallel regions of code
 Binary format of HSAIL is called “BRIG”
 Goal: Bring parallel acceleration to mainstream programming languages
main() {
…
#pragma omp parallel for
for (int i=0;i<N; i++) {
}
…
}
High-Level
Compiler
BRIG Finalizer Component
ISA
Host ISA

KEY HSAIL FEATURES
 Parallel
 Shared virtual memory
 Portable across vendors in HSA Foundation
 Stable across multiple product generations
 Consistent numerical results (IEEE-754 with defined min accuracy)
 Fast, robust, simple finalization step (no monthly updates)
 Good performance (little need to write in ISA)
 Supports all of OpenCL™
 Supports Java, C++, and other languages as well

HSAIL INSTRUCTION SET - OVERVIEW
 Similar to assembly language for a RISC CPU
 Load-store architecture
 Destination register first, then source registers
 140 opcodes (Java™ bytecode has 200)
 Floating point (single, double, half (f16))
 Integer (32-bit, 64-bit)
 Some packed operations
 Branches
 Function calls
 Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas
 Synchronize host CPU and HSA Component!
 Text and Binary formats (“BRIG”)
ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)
add_u64 $d1, $d0, 24 ; $d1= $d2+24

SEGMENTS AND MEMORY (1/2)
 7 segments of memory
 global, readonly, group, spill, private, arg, kernarg
 Memory instructions can (optionally) specify a segment
 Control data sharing properties and communicate intent
 Global Segment
 Visible to all HSA agents (including host CPU)
 Group Segment
 Provides high-performance memory shared in the work-group.
 Group memory can be read and written by any work-item in the work-group
 HSAIL provides sync operations to control visibility of group memory
ld_global_u64 $d0,[$d6]
ld_group_u64 $d0,[$d6+24]
st_spill_f32 $s1,[$d6+4]

SEGMENTS AND MEMORY (2/2)
 Spill, Private, Arg Segments
 Represent different regions of a per-work-item stack
 Typically generated by compiler, not specified by programmer
 Compiler can use these to convey intent – ie spills
 Kernarg Segment
 Programmer writes kernarg segment to pass arguments to a kernel
 Read-Only Segment
 Remains constant during execution of kernel

FLAT ADDRESSING
 Each segment mapped into virtual address space
 Flat addresses can map to segments based on virtual address
 Instructions with no explicit segment use flat addressing
 Very useful for high-level language support (ie classes, libraries)
 Aligns well with OpenCL 2.0 “generic” addressing feature
ld_global_u64 $d6, [%_arg0] ; global
ld_u64 $d0,[$d6+24] ; flat

REGISTERS
 Four classes of registers:
 S: 32-bit, Single-precision FP or Int
 D: 64-bit, Double-precision FP or Long Int
 Q: 128-bit, Packed data.
 C: 1-bit, Control Registers (Compares)
 Fixed number of registers
 S, D, Q share a single pool of resources
 S + 2*D + 4*Q <= 128
 Up to 128 S or 64 D or 32 Q (or a blend)
 Register allocation done in high-level compiler
 Finalizer doesn’t perform expensive register allocation
c0
c1
c2
c3
c4
c5
c6
c7
s0
d0
q0
s1
s2
d1
s3
s4
d2
q1
s5
s6
d3
s7
s8
d4
q2
s9
s10
d5
s11
…
s120
d60
q30
s121
s122
d61
s123
s124
d62
q31
s125
s126
d63
s127

SIMT EXECUTION MODEL
 HSAIL Presents a “SIMT” execution model to the programmer
 “Single Instruction, Multiple Thread”
 Programmer writes program for a single thread of execution
 Each work-item appears to have its own program counter
 Branch instructions look natural
 Hardware Implementation
 Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency
 Actually one program counter for the entire SIMD instruction
 Branches implemented with predication
 SIMT Advantages
 Easier to program (branch code in particular)
 Natural path for mainstream programming models and existing compilers
 Scales across a wide variety of hardware (programmer doesn’t see vector width)
 Cross-lane operations available for those who want peak performance

WAVEFRONTS
 Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, 64, 128, or 256 “lanes”
 Lanes in wavefront can be “active” or “inactive”
 Inactive lanes consume hardware resources but don’t do useful work
 Tradeoffs
 “Wavefront-aware” programming can be useful for peak performance
 But results in less portable code (since wavefront width is encoded in algorithm)
if (cond) {
operationA; // cond=True lanes active here
} else {
operationB; // cond=False lanes active here
}

CROSS-LANE OPERATIONS
 Example HSAIL cross-lane operation: “activelaneid”
 Dest set to count of earlier work-items that are active for this instruction
 Useful for compaction algorithms
 Example HSAIL cross-lane operation: “activelaneshuffle”
 Each workitem reads value from another lane in the wavefront
 Supports selection of “identity” element for inactive lanes
 Useful for wavefront-level reductionsactivelaneshuffle_b32 $s0, $s1, $s2, 0, 0
// s0 = dest, s1= source, s2=lane select, no identity
activelaneid_u32 $s0

HSAIL MODES
 Working group strived to limit optional modes and features in HSAIL
 Minimize differences between HSA target machines
 Better for compiler vendors and application developers
 Two modes survived
 Machine Models
 Small: 32-bit pointers, 32-bit data
 Large: 64-bit pointers, 32-bit or 64-bit data
 Vendors can support one or both models
 “Base” and “Full” Profiles
 Two sets of requirements for FP accuracy, rounding, exception reporting, hard
pre-emption

HSA PROFILES
Feature Base Full
Addressing Modes Small, Large Small, Large
All 32-bit HSAIL operations according to the declared
profile Yes Yes
F16 support (IEEE 754 or better) Yes Yes
F64 support No Yes
Precision for add/sub/mul 1/2 ULP 1/2 ULP
Precision for div 2.5 ULP 1/2 ULP
Precision for sqrt 1 ULP 1/2 ULP
HSAIL Rounding: Near Yes Yes
HSAIL Rounding: Up / Down / Zero No Yes
Subnormal floating-point Flush-to-zero Supported
Propagate NaN Payloads No Yes
FMA Yes Yes
Arithmetic Exception reporting None DETECT or BREAK
Debug trap Yes Yes
Hard Preemption No Yes

HSA PARALLEL EXECUTION
MODEL

HSA PARALLEL EXECUTION MODEL
Basic Idea:
Programmer supplies an HSAIL
“kernel” that is run on each work-item.
Kernel is written as a single thread of
execution.
Programmer specifies grid dimensions
(scope of problem) when launching
the kernel.
Each work-item has a unique
coordinate in the grid.
Programmer optionally specifies work-
group dimensions (for optimized
communication).

CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)

Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
2D grid
workitem
kernel

Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
2D work-group
2D grid
workitem
kernel

HOW TO PROGRAM HSA?
WHAT DO I TYPE?

HSA PROGRAMMING MODELS : CORE PRINCIPLES
 Single source
 Host and device code side-by-side in same source file
 Written in same programming language
 Single unified coherent address space
 Freely share pointers between host and device
 Similar memory model as multi-core CPU
 Parallel regions identified with existing language syntax
 Typically same syntax used for multi-core CPU
 HSAIL is the compiler IR that supports these programming models

GCC OPENMP : COMPILATION FLOW
 SUSE GCC Project
 Adding HSAIL code generator to GCC compiler infrastructure
 Supports OpenMP 3.1 syntax
 No data movement directives required !main() {
…
// Host code.
#pragma omp parallel for
for (int i=0;i<N; i++) {
C[i] = A[i] + B[i];
}
…
}
GCC OpenMP
Compiler
ISA
Host ISA

GCC OpenMP flow
C/C++/Fortran OpenMP application
e.g., #pragma omp for
for( j = 0; j<n;j++) { b[j] = a[j]; }
GNU Compiler(GCC)
Compiles host code + Emits runtime
calls with kernel name, parameters,
launch attributes
Lowers OpenMP directives,
converts GIMPLE to BRIG.
Embeds BRIG into host code
Dispatch kernel to GPU
Pragmas map to calls into
HSA Runtime
Application
Compiler
Run time
Finalize kernel from BRIG->ISA
Kernels finalized once and cached.
Compile time

MCW C++AMP : COMPILATION FLOW
 C++AMP : Single-source C++ template parallel programming model
 MCW compiler based on CLANG/LLVM
 Open-source and runs on Linux
 Leverage open-source LLVM->HSAIL code generator
main() {
…
parallel_for_each(grid<1>(ext
entent<256>(…)
…
}
C++AMP
Compiler
ISA
Host ISA

JAVA: RUNTIME FLOW
JAVA 8 – HSA ENABLED APARAPI
 Java 8 brings Stream + Lambda API.
‒ More natural way of expressing data parallel algorithms
‒ Initially targeted at multi-core.
 APARAPI will :
‒ Support Java 8 Lambdas
‒ Dispatch code to HSA enabled devices at runtime via
HSAIL
JVM
Java Application
HSA Finalizer & Runtime
APARAPI + Lambda API
GPUCPU
Future Java – HSA ENABLED JAVA (SUMATRA)
 Adds native GPU acceleration to Java Virtual Machine
(JVM)
 Developer uses JDK Lambda, Stream API
 JVM uses GRAAL compiler to generate HSAIL
JVM
Java Application
HSA Finalizer & Runtime
Java JDK Stream + Lambda API
Java GRAAL JIT
backend
GPUCPU

AN EXAMPLE (IN JAVA 8)
//Example computes the percentage of total scores achieved by each player on a team.
class Player {
private Team team; // Note: Reference to the parent Team.
private int scores;
private float pctOfTeamScores;
public Team getTeam() {return team;}
public int getScores() {return scores;}
public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; }
};
// “Team” class not shown
// Assume “allPlayers’ is an initialized array of Players..
Arrays.stream(allPlayers). // wrap the array in a stream
parallel(). // developer indication that lambda is thread-safe
forEach(p -> {
int teamScores = p.getTeam().getScores();
float pctOfTeamScores = (float)p.getScores()/(float) teamScores;
p.setPctOfTeamScores(pctOfTeamScores);
});

HSAIL CODE EXAMPLE
01: version 0:95: $full : $large;
02: // static method HotSpotMethod<Main.lambda$2(Player)>
03: kernel &run (
04: kernarg_u64 %_arg0 // Kernel signature for lambda method
05: ) {
06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register
07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord
08:
09: cvt_u64_s32 $d2, $s2; // Convert X gid to long
10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref
11: add_u64 $d2, $d2, 24; // Adjust for actual elements start
12: add_u64 $d2, $d2, $d6; // Add to array ref ptr
13: ld_global_u64 $d6, [$d2]; // Load from array element into reg
14: @L0:
15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()
16: mov_b64 $d3, $d0;
17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()
18: cvt_f32_s32 $s16, $s3;
19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()
20: cvt_f32_s32 $s17, $s0;
21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores
22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()
23: ret;
24: };

HOW TO PROGRAM HSA?
OTHER PROGRAMMING TOOLS

HSAIL ASSEMBLER
kernel &run (kernarg_u64 %_arg0)
{
ld_kernarg_u64 $d6, [%_arg0];
workitemabsid_u32 $s2, 0;
cvt_u64_s32 $d2, $s2;
mul_u64 $d2, $d2, 8;
add_u64 $d2, $d2, 24;
add_u64 $d2, $d2, $d6;
ld_global_u64 $d6, [$d2];
. . .
HSAIL
Assembler BRIG Finalizer
Machine
ISA
• HSAIL has a text format and an assembler

OPENCL™ OFFLINE COMPILER (CLOC)
__kernel void vec_add(
__global const float *a,
__global const float *b,
__global float *c,
const unsigned int n)
{
int id = get_global_id(0);
// Bounds check
if (id < n)
c[id] = a[id] + b[id];
}
CLOC BRIG Finalizer
Machine
ISA
•OpenCL split-source model cleanly isolates kernel
•Can express many HSAIL features in OpenCL Kernel Language
•Higher productivity than writing in HSAIL assembly
•Can dispatch kernel directly with HSAIL Runtime (lower-level access to hardware)
•Or use CLOC+OKRA Runtime for approachable “fits-on-a-slide” GPU programming model

KEY TAKEAWAYS
 HSAIL
 Thin, robust, fast finalizer
 Portable (multiple HW vendors and parallel architectures)
 Supports shared virtual memory and platform atomics
 HSA brings GPU computing to mainstream programming models
 Shared and coherent memory bridges “faraway accelerator” gap
 HSAIL provides the common IL for high-level languages to benefit from
parallel computing
 Languages and Compilers
 HSAIL support in GCC, LLVM, Java JVM
 Leverage same language syntax designed for multi-core CPUs
 Can use pointer-containing data structures

HSA RUNTIME
YEN-CHING CHUNG, NATIONAL TSING HUA
UNIVERSITY

OUTLINE
 Introduction
 HSA Core Runtime API (Pre-release 1.0 provisional)
 Initialization and Shut Down
 Notifications (Synchronous/Asynchronous)
 Agent Information
 Signals and Synchronization (Memory-Based)
 Queues and Architected Dispatch
 Summary

INTRODUCTION (1)
 The HSA core runtime is a thin, user-mode API that provides the interface necessary for
the host to launch compute kernels to the available HSA components.
 The overall goal of the HSA core runtime design is to provide a high-performance dispatch
mechanism that is portable across multiple HSA vendor architectures.
 The dispatch mechanism differentiates the HSA runtime from other language runtimes by
architected argument setting and kernel launching at the hardware and specification level.
 The HSA core runtime API is standard across all HSA vendors, such that languages which use the
HSA runtime can run on different vendor’s platforms that support the API.
 The implementation of the HSA runtime may include kernel-level components (required for
some hardware components, ex: AMD Kaveri) or may be entirely user-space (for example,
simulators or CPU implementations).

Component 1
Driver
Component N…
Vendor m
…
Component 1
Driver
Component N…
Vendor 1
Component 1
HSA Runtime
Component N…
HSA Vendor 1
HSA
Finalizer Component 1
HSA Runtime
Component N…
HSA Vendor m
HSA
Finalizer
INTRODUCTION (2)
Programming Model
Language Runtime
 The software architecture stack without HSA runtime
OpenCL
App
Java
App
OpenMP
App
DSL
App
OpenCL
Runtime
Java
Runtime
OpenMP
Runtime
DSL
Runtime
…
…
 The software architecture stack with HSA runtime
…

INTRODUCTION (3)
OpenCL Runtime HSA RuntimeAgent
Start
Program
HSA Memory Allocation
Enqueue Dispatch Packet
Exit
Program Resource Deallocation
Command Queue
Platform, Device, and
Context Initialization
SVM Allocation and
Kernel Arguments Setting
Build Kernel
HSA Runtime Close
HSA Runtime Initialization
and Topology Discovery
HSAIL Finalization and
Linking

INTRODUCTION (4)
 HSA Platform System Architecture Specification support
 Runtime initialization and shutdown
 Notifications (synchronous/asynchronous)
 Agent information
 Signals and synchronization (memory-based)
 Queues and Architected dispatch
 Memory management
 HSAIL support
 Finalization, linking, and debugging
 Image and Sampler support
HSA Runtime
HSA Memory Allocation
Enqueue Dispatch Packet
HSA Runtime Close
HSA Runtime
Initialization and
Topology Discovery
HSAIL Finalization and
Linking

RUNTIME INITIALIZATION AND
SHUTDOWN

OUTLINE
 Runtime Initialization API
 hsa_init
 Runtime Shut Down API
 hsa_shut_down
 Examples

HSA RUNTIME INITIALIZATION
 When the API is invoked for the first time in a given process, a runtime
instance is created.
 A typical runtime instance may contain information of platform, topology, reference
count, queues, signals, etc.
 The API can be called multiple times by applications
 Only a single runtime instance will exist for a given process.
 Whenever the API is invoked, the reference count is increased by one.

HSA RUNTIME SHUT DOWN
 When the API is invoked, the reference count is decreased by 1.
 When the reference count < 1
 All the resources associated with the runtime instance (queues, signals, topology
information, etc.) are considered invalid and any attempt to reference them in
subsequent API calls results in undefined behavior.
 The user might call hsa_init to initialize the HSA runtime again.
 The HSA runtime might release resources associated with it.

EXAMPLE – RUNTIME INITIALIZATION (1)
Data structure for
runtime instance
If hsa_init is called more than once,
increase the ref_count by 1

EXAMPLE – RUNTIME INITIALIZATION (2)
hsa_init is called the first time, allocate
resources and set the reference count
Get the number of HSA agent
Initialize agents
Create an empty agent list
If initialization failed, release resources
Create topology table

Agent-0
node_id 0
id 0
type CPU
vendor Generic
name Generic
wavefront_size 0
queue_size 200
group_memory 0
fbarrier_max_count 1
is_pic_supported 0
…
…
EXAMPLE - RUNTIME INSTANCE (1)
Platform Name: Generic Memory
node_id 0
id 0
segment_type 111111
address_base 0x0001
size 2048 MB
peak_bandwidth 6553.6 mpbs
Agent-1
node_id 0
id 0
type GPU
vendor Generic
name Generic
wavefront_size 64
queue_size 200
group_memory 64
fbarrier_max_count 1
is_pic_supported 1
Cache
node_id 0
id 0
levels 1
associativity 1
cache size 64KB
cache line size 4
is_inclusive 1
Agent: 2
Memory: 1
Cache: 1
…
…

Agent-0
node_id = 0
id = 0
agent_type = 1 (CPU)
vendor[16] = Generic
name[16] = Generic
wavefront_size = 0
queue_size =200
group_memory_size_bytes =0
fbarrier_max_count = 1
is_pic_supported = 0
Platform Header File
*base_address = 0x00001
Size = 248
system_timestamp_frequency_
mhz = 200
signal_maximum_wait = 1/200
*node_id
no_nodes = 1
*agent_list
no_agent = 2
*memory_descriptor_list
no_memory_descriptor = 1
*cache_descriptor_list
no_cache_descriptor = 1
EXAMPLE - RUNTIME INSTANCE (2)
…
…
cache
node_id = 0
Id = 0
Levels = 1
* associativity
* cache_size
* cache_line_size
* is_inclusive
1 NULL
64KB NULL
1 NULL
4 NULL
Memory
node_id = 0
Id = 0
supported_segment_type_mask =
111111
virtual_address_base = 0x0001
size_in_bytes = 2048MB
peak_bandwidth_mbps = 6553.6
0 NULL
45 165 NULL
285 NULL
325 NULL
Agent-1
node_id = 0
id = 0
agent_type = 2 (GPU)
vendor[16] = Generic
name[16] = Generic
wavefront_size = 64
queue_size =200
group_memory_size_bytes =64
fbarrier_max_count = 1
is_pic_supported = 1
…

EXAMPLE – RUNTIME SHUT DOWN
If ref_count < 1, then free the list;
Otherwise decrease the ref_count
by 1.

NOTIFICATIONS
(SYNCHRONOUS/ASYNCHRONOUS)

OUTLINE
 Synchronous Notifications
 hsa_status_t
 hsa_status_string
 Asynchronous Notifications
 Example

SYNCHRONOUS NOTIFICATIONS
 Notifications (errors, events, etc.) reported by the runtime can be synchronous or
asynchronous
 The HSA runtime uses the return values of API functions to pass notifications
synchronously.
 A status code is define as an enumeration, , to capture the return value
of any API function that has been executed, except accessors/mutators.
 The notification is a status code that indicates success or error.
 Success is represented by HSA_STATUS_SUCCESS, which is equivalent to zero.
 An error status is assigned a positive integer and its identifier starts with the
HSA_STATUS_ERROR prefix.
 The status code can help to determine a cause of the unsuccessful execution.

STATUS CODE QUERY
 Query additional information on status code
 Parameters
 status (input): Status code that the user is seeking more information on
 status_string (output): An ISO/IEC 646 encoded English language string that potentially
describes the error status

ASYNCHRONOUS NOTIFICATIONS
 The runtime passes asynchronous notifications by calling user-defined
callbacks.
 For instance, queues are a common source of asynchronous events because the
tasks queued by an application are asynchronously consumed by the packet
processor. Callbacks are associated with queues when they are created. When the
runtime detects an error in a queue, it invokes the callback associated with that
queue and passes it an error flag (indicating what happened) and a pointer to the
erroneous queue.
 The HSA runtime does not implement any default callbacks.
 When using blocking functions within the callback implementation, a callback that
does not return can render the runtime state to be undefined.

EXAMPLE - CALLBACK
Pass the callback function
when create queue
If the queue is empty, set the
event and invoke callback

OUTLINE
 Agent information
 hsa_node_t
 hsa_agent_t
 hsa_agent_info_t
 hsa_component_feature_t
 Agent Information manipulation APIs
 hsa_iterate_agents
 hsa_agent_get_info
 Example

INTRODUCTION
 The runtime exposes a list of agents that are available in the system.
 An HSA agent is a hardware component that participates in the HSA memory model.
 An HSA agent can submit AQL packets for execution.
 An HSA agent may also but is not required to be an HSA component. It is possible for
a system to include HSA agents that are neither an HSA component nor a host CPU.
 HSA agents are defined as opaque handles of type hsa_agent_t .
 The HSA runtime provides APIs for applications to traverse the list of available
agents and query attributes of a particular agent.

AGENT INFORMATION (1)
 Opaque agent handle
 Opaque NUMA node handle
 An HSA memory node is a node that delineates a set of
system components (host CPUs and HSA Components) with
“local” access to a set of memory resources attached to the
node's memory controller and appropriate HSA-compliant
access attributes.

 Component features
 An HSA component is a hardware or software component that can be a target of the AQL queries
and conforms to the memory model of the HSA.
 Values
 HSA_COMPONENT_FEATURE_NONE = 0
 No component capabilities. The device is an agent, but not a component.
 HSA_COMPONENT_FEATURE_BASIC = 1
 The component supports the HSAIL instruction set and all the AQL packet types except Agent
dispatch.
 HSA_COMPONENT_FEATURE_ALL = 2
 The component supports the HSAIL instruction set and all the AQL packet types.

 Agent attributes
 Values
 HSA_AGENT_INFO_MAX_GRID_DIM
 HSA_AGENT_INFO_MAX_WORKGROUP_DIM
 HSA_AGENT_INFO_QUEUE_MAX_PACKETS
 HSA_AGENT_INFO_CLOCK
 HSA_AGENT_INFO_CLOCK_FREQUENCY
 HSA_AGENT_INFO_MAX_SIGNAL_WAIT
 HSA_AGENT_INFO_NAME
 HSA_AGENT_INFO_NODE
 HSA_AGENT_INFO_COMPONENT_FEATURES
 HSA_AGENT_INFO_VENDOR_NAME
 HSA_AGENT_INFO_WAVEFRONT_SIZE
 HSA_AGENT_INFO_CACHE_SIZE

AGENT INFORMATION MANIPULATION (1)
 Iterate over the available agents, and invoke an application-defined callback on
every iteration
 If callback returns a status other than HSA_STATUS_SUCCESS for a particular
iteration, the traversal stops and the function returns that status value.
 Parameters
 callback (input): Callback to be invoked once per agent
 data (input): Application data that is passed to callback on every iteration. Can be
NULL.

AGENT INFORMATION MANIPULATION (2)
 Get the current value of an attribute for a given agent
 Parameters
 agent (input): A valid agent
 attribute (input): Attribute to query
 value (output): Pointer to a user-allocated buffer where to store the value of the
attribute. If the buffer passed by the application is not large enough to hold the value
of attribute, the behavior is undefined.

EXAMPLE - AGENT ATTRIBUTE QUERY
Copy agent attribute information
Get the agent handle of Agent 0

SIGNALS AND SYNCHRONIZATION
(MEMORY-BASED)

OUTLIINE
 Signal
 Signal manipulation API
 Create/Destroy
 Query
 Send
 Atomic Operations
 Signal wait
 Get time out
 Signal Condition
 Example

SIGNAL (1)
 HSA agents can communicate with each other by using coherent global memory,
or by using signals.
 A signal is represented by an opaque signal handle
 A signal carries a value, which can be updated or conditionally waited upon via
an API call or HSAIL instruction.
 The value occupies four or eight bytes depending on the machine model in use.

SIGNAL (2)
 Updating the value of a signal is equivalent to sending the signal.
 In addition to the update (store) of signals, the API for sending signal must
support other atomic operations with specific memory order semantics
 Atomic operations: AND, OR, XOR, Add, Subtract, Exchange, and CAS
 Memory order semantics : Release and Relaxed

SIGNAL CREATE/DESTROY
 Create a signal
 Parameters
 initial_value (input): Initial value of the
signal.
 signal_handle (output): Signal handle.
 Destroy a signal previous created by
hsa_signal_create
 Parameter
 signal_handle (input): Signal handle.

 Send and atomically set the value of a signal
with release semantics
SIGNAL LOAD/STORE
 Atomically read the current signal value with
acquire semantics
 Atomically read the current signal value with
relaxed semantics
 Send and atomically set the value of a signal with
relaxed semantics

 Send and atomically increment the value of a
signal by a given amount with release semantics
SIGNAL ADD/SUBTRACT
 Send and atomically decrement the value of a
signal by a given amount with release semantics
 Send and atomically increment the value of a
signal by a given amount with relaxed semantics
 Send and atomically decrement the value of a
signal by a given amount with relaxed semantics

 Send and atomically perform a logical AND operation
on the value of a signal and a given value with
release semantics
SIGNAL AND (OR, XOR)/EXCHANGE
 Send and atomically set the value of a signal and
return its previous value with release semantics
 Send and atomically perform a logical AND operation
on the value of a signal and a given value with
relaxed semantics
 Send and atomically set the value of a signal and
return its previous value with relaxed semantics

SIGNAL WAIT (1)
 The application may wait on a signal, with a condition specifying the terms of
wait.
 Signal wait condition operator
 Values
 HSA_EQ: The two operands are equal.
 HSA_NE: The two operands are not equal.
 HSA_LT: The first operand is less than the second operand.
 HSA_GTE: The first operand is greater than or equal to the second operand.

SIGNAL WAIT (2)
 The wait can be done either in the HSA component via an HSAIL wait instruction
or via a runtime API defined here.
 Waiting on a signal returns the current value at the opaque signal object;
 The wait may have a runtime defined timeout which indicates the maximum amount of time that an
implementation can spend waiting.
 The signal infrastructure allows for multiple senders/waiters on a single signal.
 Wait reads the value, hence acquire synchronizations may be applied.

SIGNAL WAIT (3)
 Signal wait
 Parameters
 signal_handle (input): A signal handle
 condition (input): Condition used to compare the passed and signal values
 compare_ value (input): Value to compare with
 return_value (output): A pointer where the current signal value must be read into

SIGNAL WAIT (4)
 Signal wait with timeout
 Parameters
 signal_handle (input): A signal handle
 timeout (input): Maximum wait duration (A value of zero indicates no maximum)
 long_wait (input): Hint indicating that the signal value is not expected to meet the given condition in
a short period of time. The HSA runtime may use this hint to optimize the wait implementation.
 condition (input): Condition used to compare the passed and signal values
 compare_ value (input): Value to compare with
 return_value (output): A pointer where the current signal value must be read into

EXAMPLE – SIGNAL WAIT (1)
thread_1 thread_2
thread_1 is blocked
hsa_signal_add_relaxed
(value = value + 3)
Return signal value
Condition satisfied, the
execution of thread_1
continues
value = 0
Timeline Timeline
value = 3
hsa_signal_substract_relaxed
(value = value - 1)value = 2
hsa_signal_wait_timeout_acquire
(value == 2)

EXAMPLE – SIGNAL WAIT (2)
If signal_handle is invalid, then return signal invalid status
Compare tmp->value with compare_value to see if the
condition is satisfied?
If timeout = 0 then return signal time out status
Signal wait condition function
If the condition is satisfied, then return signal and status

QUEUES AND ARCHITECTED
DISPATCH

OUTLINE
 Queues
 Queue Types and Structure
 HSA runtime API for Queue Manipulations
 Architected Queuing Language (AQL) Support
 Packet type
 Packet header
 Examples
 Enqueue Packet
 Packet Processor

INTRODUCTION (1)
 An HSA-compliant platform supports multiple user-level command queues allocation.
 A use-level command queue is characterized as runtime-allocated, user-level accessible virtual
memory of a certain size, containing packets defined in the Architected Queuing Language (AQL
packets).
 Queues are allocated by HSA applications through the HSA runtime.
 HSA software receives memory-based structures to configure the hardware queues to
allow for efficient software management of the hardware queues of the HSA agents.
 This queue memory shall be processed by the HSA Packet Processor as a ring buffer.
 Queues are read-only data structures.
 Writing values directly to a queue structure results in undefined behavior.
 But HSA agents can directly modify the contents of the buffer pointed by base_address, or use
runtime APIs to access the doorbell signal or the service queue.

 Two queue types, AQL and Service Queues, are supported
 AQL Queue consumes AQL packets that are used to specify the information of kernel functions
that will be executed on the HSA component
 Service Queue consumes agent dispatch packets that are used to specify runtime-defined or user
registered functions that will be executed on the agent (typically, the host CPU)
INTRODUCTION (2)

INTRODUCTION (3)
 AQL queue structure

INTRODUCTION (4)
 In addition to the data held in the queue structure, the queue also defines two
properties (readIndex and writeIndex) that define the location of “head” and “tail”
of the queue.
 readIndex: The read index is a 64-bit unsigned integer that specifies the packetID of
the next AQL packet to be consumed by the packet processor.
 writeIndex: The write index is a 64-bit unsigned integer that specifies the packetID of
the next AQL packet slot to be allocated.
 Both indices are not directly exposed to the user, who can only access them by using
dedicated HSA core runtime APIs.
 The available index functions differ on the index of interest (read or write), action to be
performed (addition, compare and swap, etc.), and memory consistency model
(relaxed, release, etc.).

INTRODUCTION (5)
 The read index is automatically advanced when a packet is read by the packet
processor.
 When the packet processor observes that
 The read index matches the write index, the queue can be considered empty;
 The write index is greater than or equal to the sum of the read index and the size of
the queue, then the queue is full.
 The doorbell_signal field of a queue contains a signal that is used by the agent
to inform the packet processor to process the packets it writes.
 The value that the doorbell signaled is equal to the ID of the packet that is ready to be
launched.

INTRODUCTION (6)
 The new task might be consumed by the packet processor even before the
doorbell signal has been signaled by the agent.
 This is because the packet processor might be already processing some other
packets and observes that there is new work available, so it processes the new
packets.
 In any case, the agent must ring the doorbell for every batch of packets it writes.

QUEUE CREATE/DESTROY
 Create a user mode queue
 When a queue is created, the runtime also
allocates the packet buffer and the completion
signal.
 The application should only rely on the status
code returned to determine if the queue is valid
 Destroy a user mode queue
 A destroyed queue might not be accessed after being
destroyed.
 When a queue is destroyed, the state of the AQL packets
that have not been yet fully processed becomes undefined.

GET READ/WRITE INDEX
 Atomically retrieve read index of a queue with
acquire semantics
 Atomically retrieve write index of a queue with
acquire semantics
 Atomically retrieve read index of a queue with
relaxed semantics
 Atomically retrieve write index of a queue with
relaxed semantics

SET READ/WRITE INDEX
 Atomically set the read index of a queue with
release semantics
 Atomically set the read index of a queue with
relaxed semantics
 Atomically set the write index of a queue with
release semantics
 Atomically set the write index of a queue with
relaxed semantics

COMPARE AND SWAP WRITE INDEX
 Atomically compare and set the write index of a
queue with acquire/release/relaxed/acquire-
release semantics
 Parameters
 queue (input): A queue
 expected (input): The expected index value
 val (input): Value to copy to the write index if expected
matches the observed write index
 Return value
 Previous value of the write index

ADD WRITE INDEX
 Atomically increment the write index of a
queue by an offset with
release/acquire/relaxed/acquire-release
semantics
 Parameters
 queue (input): A queue
 val (input): The value to add to the write index
 Return value
 Previous value of the write index

ARCHITECTED QUEUING LANGUAGE (AQL)
 An HSA-compliant system provides a command interface for the dispatch of
HSA agent commands.
 This command interface is provided by the Architected Queuing Language (AQL).
 AQL allows HSA agents to build and enqueue their own command packets,
enabling fast and low-power dispatch.
 AQL also provides support for HSA component queue submissions
 The HSA component kernel can write commands in AQL format.

AQL PACKET (1)
 AQL packet format
 Values
 Always reserved packet (0): Packet format is set to always reserved when the queue is initialized.
 Invalid packet (1): Packet format is set to invalid when the readIndex is incremented, making the
packet slot available to the HSA agents.
 Dispatch packet (2): Dispatch packets contain jobs for the HSA component and are created by
HSA agents.
 Barrier packet (3): Barrier packets can be inserted by HSA agents to delay processing subsequent
packets. All queues support barrier packets.
 Agent dispatch packet (4): Dispatch packets contain jobs for the HSA agent and are created by
HSA agents.

AQL PACKET (2)
HSA signaling object handle used to indicate completion of the job

EXAMPLE - ENQUEUE AQL PACKET (1)
 An HSA agent submits a task to a queue by performing the following steps:
 Allocate a packet slot (by incrementing the writeIndex)
 Initialize the packet and copy packet to a queue associated with the Packet Processor
 Mark packet as valid
 Notify the Packet Processor of the packet (With doorbell signal)

EXAMPLE - ENQUEUE AQL PACKET (2)
Dispatch Queue
Allocate an AQL packet slot
Copy the packet into queue. Note
that, we can have a lock here to
prevent race condition in
multithread environment
WriteIndex
ReadIndex
Initialize
packet
Send doorbell signal

EXAMPLE - PACKET PROCESSOR
WriteIndex
ReadIndex
Get packet content
Check if barrier packet
Update readIndex, change packet state to invalid,
and send completion signal.
Receive doorbell
Dispatch Queue
If there is any packet in queue, process the packet.

OUTLINE
 Memory registration and deregistration
 Memory region and memory segment
 APIs for memory region manipulation
 APIs for memory registration and deregistration

INTRODUCTION
 One of the key features of HSA is its ability to share global pointers between the
host application and code executing on the HSA component.
 This ability means that an application can directly pass a pointer to memory allocated on the host
to a kernel function dispatched to a component without an intermediate copy
 When a buffer created in the host is also accessed by a component,
programmers are encouraged to register the corresponding address range
beforehand.
 Registering memory expresses an intention to access (read or write) the passed buffer from a
component other than the host. This is a performance hint that allows the runtime implementation
to know which buffers will be accessed by some of the components ahead of time.
 When an HSA program no longer needs to access a registered buffer in a device,
the user should deregister that virtual address range.

MEMORY REGION/SEGMENT
 A memory region represents a virtual memory interval that is visible to a particular agent,
and contains properties about how memory is accessed or allocated from that agent.
 Memory segments
 Values
 HSA_SEGMENT_GLOBAL = 1
 HSA_SEGMENT_PRIVATE = 2
 HSA_SEGMENT_GROUP = 4
 HSA_SEGMENT_KERNARG = 8
 HSA_SEGMENT_READONLY = 16
 HSA_SEGMENT_IMAGE = 32

MEMORY REGION INFORMATION
 Attributes of a memory region
 Values
 HSA_REGION_INFO_BASE_ADDRESS
 HSA_REGION_INFO_SIZE
 HSA_REGION_INFO_NODE
 HSA_REGION_INFO_MAX_ALLOCATION_SIZE
 HSA_REGION_INFO_SEGMENT
 HSA_REGION_INFO_BANDWIDTH
 HSA_REGION_INFO_CACHED

MEMORY REGION MANIPULATION (1)
 Get the current value of an attribute of a region
 Iterate over the memory regions that are visible to an agent, and invoke an
application-defined callback on every iteration
 If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the
traversal stops and the function returns that status value.

MEMORY REGION MANIPULATION (2)
 Allocate a block of memory
 Deallocate a block of memory previously allocated
using hsa_memory_allocate
 Copy block of memory
 Copying a number of bytes larger than the size of the
memory regions pointed by dst or src results in
undefined behavior.

MEMORY REGISTRATION/DEREGISTRATION
 Register memory
 Parameters
 address (input): A pointer to the base of
the memory region to be registered. If a
NULL pointer is passed, no operation is
performed.
 size (input): Requested registration size
in bytes. A size of zero is only allowed if
address is NULL.
 Deregister memory previously registered
using hsa_memory_register
 Parameter
 address (input): A pointer to the base of the
memory region to be registered. If a NULL
pointer is passed, no operation is performed.

EXAMPLE
Allocate a memory space
Use hsa_region_get_info to get the
size in byte of this memory space
Register this memory space for a
performance hint
Finish operation, deregister and
free this memory space

SUMMARY
 Covered
 HSA Core Runtime API (Pre-release 1.0 provisional)
 Runtime Initialization and Shutdown (Open/Close)
 Notifications (Synchronous/Asynchronous)
 Agent Information
 Signals and Synchronization (Memory-Based)
 Queues and Architected Dispatch
 Memory Management
 Not covered
 Extension of Core Runtime
 HSAIL Finalization, Linking, and Debugging
 Images and Samplers

QUESTIONS?

HSA MEMORY MODEL
BEN GASTER, ENGINEER, QUALCOMM

OUTLINE
 HSA Memory Model
 OpenCL 2.0
 Has a memory model too
 Obstruction-free bounded deques
 An example using the HSA memory model

HSA MEMORY MODEL

TYPES OF MODELS
 Shared memory computers and programming languages, divide complexity into
models:
1. Memory model specifies safety
 e.g. can a work-item prevent others from progressing?
 This is what this section of the tutorial will focus on
2. Execution model specifies liveness
 Described in Ben Sander’s tutorial section on HSAIL
 e.g. can a work-item prevent others from progressing
3. Performance model specifies the big picture
 e.g. caches or branch divergence
 Specific to particular implementations and outside the scope of today’s tutorial

THE PROBLEM
 Assume all locations (a, b, …) are initialized to 0
 What are the values of $s2 and $s4 after execution?
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
*a = 1;
int x = *b;
*b = 1;
int y = *a;
initially *a = 0 && *b = 0

THE SOLUTION
 The memory model tells us:
 Defines the visibility of writes to memory at any given point
 Provides us with a set of possible executions

WHAT MAKES A GOOD MEMORY MODEL*
 Programmability ; A good model should make it (relatively) easy to write multi-
work-item programs. The model should be intuitive to most users, even to those
who have not read the details
 Performance ; A good model should facilitate high-performance implementations
at reasonable power, cost, etc. It should give implementers broad latitude in
options
 Portability ; A good model would be adopted widely or at least provide backward
compatibility or the ability to translate among models
* S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department,
University of Wisconsin–Madison, Nov. 1993.

SEQUENTIAL CONSISTENCY (SC)*
 Axiomatic Definition
 A single processor (core) sequential if “the result of an execution is the same as if the
operations had been executed in the order specified by the program.”
 A multiprocessor sequentially consistent if “the result of any execution is the same as if the
operations of all processors (cores) were executed in some sequential order, and the
operations of each individual processor (core) appear in this sequence in the order specified by
its program.”
 But HW/Compiler actually implements more relaxed models, e.g. ARMv7
* L. Lamport. How to Make a Multiprocessor Computer that Correctly
Executes Multiprocessor Programs. IEEE Transactions on Computers,
C-28(9):690–91, Sept. 1979.

SEQUENTIAL CONSISTENCY (SC)
Work-item 0
mov_u32 $s1, 1 ;
Work-item 1
mov_u32 $s3, 1 ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
$s2 = 0 && $s4 =
1

BUT WHAT ABOUT ACTUAL HARDWARE
 Sequential consistency is (reasonably) easy to understand, but limits
optimizations that the compiler and hardware can perform
 Many modern processors implement many reordering optimizations
 Store buffers (TSO*), work-items can see their own stores early
 Reorder buffers (XC*), work-items can see other work-items store early
*TSO – Total Store Order as implemented by Sparc and x86
*XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno

RELAXED CONSISTENCY (XC)
Work-item 0
mov_u32 $s1, 1 ;
Work-item 1
mov_u32 $s3, 1 ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
$s2 = 0 && $s4 =
0

WHAT ARE OUR 3 Ps?
 Programmability ; XC is really pretty hard for the programmer to reason about
what will be visible when
 many memory model experts have been known to get it wrong!
 Performance ; XC is good for performance, the hardware (compiler) is free to
reorder many loads and stores, opening the door for performance and power
enhancements
 Portability ; XC is very portable as it places very little constraints

MY CHILDREN AND COMPUTER
ARCHITECTS ALL WANT
 To have their cake and eat it!
Put picture with kids and cake
HSA Provides: The ability to enable
programmers to reason with (relatively)
intuitive model of SC, while still achieving the
benefits of XC!

SEQUENTIAL CONSISTENCY FOR DRF*
 HSA adopts the same approach as Java, C++11, and OpenCL 2.0 adopting SC for Data
Race Free (DRF)
 plus some new capabilities !
 (Informally) A data race occurs when two (or more) work-items access the same memory
location such that:
 At least one of the accesses is a WRITE
 There are no intervening synchronization operations
 SC for DRF asks:
 Programmers to ensure programs are DRF under SC
 Implementers to ensure that all executions of DRF programs on the relaxed model are also SC
executions
*S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the
17th Annual International Symposium on Computer Architecture, pp. 2–14, May
1990

HSA SUPPORTS RELEASE CONSISTENCY
 HSA’s memory model is based on RCSC:
 All atomic_ld_scacq and atomic_st_screl are SC
 Means coherence on all atomic_ld_scacq and atomic_st_screl to a single
address. )
 All atomic_ld_scacq and atomic_st_screl are program ordered per work-
item (actually: sequence-order by language constraints
 Similar model adopted by ARMv8
 HSA extends RCSC to SC for HRF*, to access the full capabilities of
modern heterogeneous systems, containing CPUs, GPUs, and DSPs,
for example.
*Sequential Consistency for Heterogeneous-Race-Free Programmer-centric
Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R.
Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.

MAKING RELAXED CONSISTENCY WORK
Work-item 0
mov_u32 $s1, 1 ;
atomic_st_global_u32_screl $s1, [&a] ;
atomic_ld_global_u32_scacq $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
atomic_st_global_u32_screl $s3, [&b] ;
atomic_ld_global_u32_scacq $s4, [&a]
;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
atomic_st_global_u32_screl $s1, [&a] ;
atomic_ld_global_u32_scacq $s2, [&b] ;
atomic_st_global_u32_screl $s3, [&b] ;
atomic_ld_global_u32_scacq $s4, [&a] ;
$s2 = 0 && $s4 =
1

SEQUENTIAL CONSISTENCY FOR DRF
 Two memory accesses participate in a data race if they
 access the same location
 at least one access is a store
 can occur simultaneously
 i.e. appear as adjacent operations in interleaving.
 A program is data-race-free if no possible execution results in a data race.
 Sequential consistency for data-race-free programs
 Avoid everything else
HSA: Not good enough!

ALL ARE NOT EQUAL – OR SOME CAN SEE
BETTER THAN OTHERS
 Remember the HSAIL
Execution Model
device scope
group scope
wave
scope
platform scope

DATA-RACE-FREE IS NOT ENOUGH
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_screl 0, [&flag]
atomic_cas_global_scar 1, 0, [&flag]
...
atomic_st_global_screl 0, [&flag]
atomic_cas_global_scar ,1 0, [&flag]
ld_global (??), [&x]
group #1-2 group #3-4
 Two ordinary memory accesses participate in a data race if they
 Access same location
 At least one is a store
 Can occur simultaneously
Not a data race…
Is it SC?
Well that depends
t4t3t1 t2
SGlobal
S12 S34
visibility implied by
causality?

SEQUENTIAL CONSISTENCY FOR
HETEROGENEOUS-RACE-FREE
 Two memory accesses participate in a heterogeneous race if
 access the same location
 at least one access is a store
 can occur simultaneously
 i.e. appear as adjacent operations in interleaving.
 Are not synchronized with “enough” scope
 A program is heterogeneous-race-free if no possible execution results in a
heterogeneous race.
 Sequential consistency for heterogeneous-race-free programs
 Avoid everything else

HSA HETEROGENEOUS RACE FREE
 HRF0: Basic Scope Synchronization
 “enough” = both threads synchronize using identical scope
 Recall example:
 Contains a heterogeneous race in HSA
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_rcrel_wg 0, [&flag]
...
atomic_cas_global_scar_wg,1 0, [&flag]
ld_global (??), [&x]
Workgroup #1-2 Workgroup #3-4
HSA Conclusion:
This is bad. Don’t do it.

HOW TO USE HSA WITH SCOPES
Use smallest scope that includes all
producers/consumers of shared data
HSA Scope Selection Guideline
Implication:
Producers/consumers must be known at synchronization time
 Want: For performance, use smallest scope possible
 What is safe in HSA?
Is this a valid assumption?

REGULAR GPGPU WORKLOADS
N
M
Define
Problem Space
Partition
Hierarchically
Communicate
Locally
N times
Communicate
Globally
M times
Well defined (regular) data partitioning +
Well defined (regular) synchronization pattern =
 Producer/consumers are always known
Generally: HSA works well with
regular data-parallel workloads

t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_screl_plat 0, [&flag]
atomic_cas_global_scar_plat 1, 0, [&flag]
...
atomic_st_global_screl_plat 0, [&flag]
atomic_cas_global_ar_plat ,1 0, [&flag]
ld $s1, [&x]
IRREGULAR WORKLOADS
 HSA: example is race
 Must upgrade wg (workgroup) -> plat (platform)
 HSA memory model says:
 ld $s1, [&x], will see value (1)!
Workgroup #1-2 Workgroup #3-4

OPENCL
HAS MEMORY MODELS TOO
MAPPING ONTO HSA’S MEMORY MODEL

 It is straightforward to provide a mapping from OpenCL 1.x to the
proposed model
 OpenCL 1.x atomics are unordered and so map to atomic_op_X
 Mapping for fences not shown but straightforward
OPENCL 1.X MEMORY MODEL MAPPING
OpenCL Operation HSA Memory Model
Operation
Atomic load ld_global_wg
ld_group_wg
Atomic store atomic_st_global_wg
atomic_st_group_wg
atomic_op atomic_op_global_comp
atomic_op_group_wg
barrier(…) fence ; barrier_wg

OPENCL 2.0 BACKGROUND
 Provisional specification released at SIGGRAPH’13, July 2013.
 Huge update to OpenCL to account for the evolving hardware landscape and
emerging use cases (e.g. irregular work loads)
 Key features:
 Shared virtual memory, including platform atomics
 Formally defined memory model based on C11 plus support for scopes
 Includes an extended set of C1X atomic operations
 Generic address space, that subsumes global, local, and private
 Device to device enqueue
 Out-of-order device side queuing model
 Backwards compatible with OpenCL 1.x

OPENCL 2.0 MEMORY MODEL MAPPING
OpenCL Operation HSA Memory Model Operation
Load
memory_order_relaxed
atomic_ld_[global | group]_relaxed_scope
Store
Memory_order_relaxed
atomic_st_[global | group]_relaxed_scope
Load
memory_order_acquire
atomic_ld_[global | group]_scacq_scope
Load
memory_order_seq_cst
atomic_ld_[global | group]_scacq_scope
Store
memory_order_release
atomic_st_[global | group]_screl_scope
Store
Memory_order_seq_cst
atomic_st_[global | group]_screl_scope
memory_order_acq_rel atomic_op_[global | group]_scar_scope
memory_order_seq_cst atomic_op_[global|group]_scar_scope

OPENCL 2.0 MEMORY SCOPE MAPPING
OpenCL Scope HSA Scope
memory_scope_sub_group _wave
memory_scope_work_group _wg
memory_scope_device _component
memory_scope_all_svm_devices _platform

OBSTRUCTION-FREE
BOUNDED DEQUES
AN EXAMPLE USING THE HSA MEMORY MODEL

CONCURRENT DATA-STRUCTURES
 Why do we need such a memory model in practice?
 One important application of memory consistency is in the development and use
of concurrent data-structures
 In particular, there are a class data-structures implementations that provide non-
blocking guarantees:
 wait-free; An algorithm is wait-free if every operation has a bound on the number of
steps the algorithm will take before the operation completes
 In practice very hard to build efficient data-structures that meet this requirement
 lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of
the work-items (or threads) makes progress
 In practice lock-free algorithms are implemented by work-item cooperating with one
enough to allow progress
 Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can
make progress

Emerging Compute Cluster
BUT WAY NOT JUST USE MUTUAL
EXCLUSION?
Fabric & Memory Controller
Krait
CPUAdreno
GPU
Krait
CPU
Krait
CPU
Krait
CPU
MMU
MMUs
2MB L2
Hexagon
DSP
MMU
?? ??
Diversity in a heterogeneous system, such as
different clock speeds, different scheduling
policies, and more can mean traditional
mutual exclusion is not the right choice

CONCURRENT DATA-STRUCTURES
 Emerging heterogeneous compute clusters means we need:
 To adapt existing concurrent data-structures
 Developer new concurrent data-structures
 Lock based programming may still be useful but often these algorithms will need
to be lock-free
 Of course, this is a key application of the HSA memory model
 To showcase this we highlight the development of a well known (HLM)
obstruction-free deque*
*Herlihy, M. et al. 2003. Obstruction-free
synchronization: double-ended queues as an
example. (2003), 522–529.

HLM - OBSTRUCTION-FREE DEQUE
 Uses a fixed length circular queue
 At any given time, reading from left to right, the array will contain:
 Zero or more left-null (LN) values
 Zero or more dummy-null (DN) values
 Zero or more right-null (RN) values
 At all times there must be:
 At least two different nulls values
 At least one LN or DN, and at least one DN or RN
 Memory consistency is required to allow multiple producers and multiple
consumers, potentially happening in parallel from the left and right ends, to see
changes from other work-items (HSA Components) and threads (HSA Agents)

HLM - OBSTRUCTION-FREE DEQUE
LNLN vLN RNv RNRN
left right
Key:
LN – left null value
RN – right null value
v – value
left – left hint index
right – right hint index

C REPRESENTATION OF DEQUE
struct node {
uint64_t type : 2; // null type (LN, RN, DN)
uint64_t counter : 8 ; // version counter to avoid ABA
uint64_t value : 54 ; // index value stored in queue
}
struct queue {
unsigned int size; // size of bounded buffer
node * array; // backing store for deque itself© Copyright 2014 HSA Foundation. All Rights Reserved

HSAIL REPRESENTATION
 Allocate a deque in global memory using HSAIL
@deque_instance:
align 64 global_u32 &size;
align 8 global_u64 &array;

ORACLE
 Assume a function:
function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);
 Which given a deque
 returns (%k) the position of the left most of RN
 atomic_ld_global_scacq used to read node from array
 Makes one if necessary (i.e. if there are only LN or DN)
 atomic_cas_global_scar, required to make new RN
 returns (%left) the left node (i.e. the value to the left of the left most RN position)
 returns (%right) the right node (i.e. the value at position (%k))

RIGHT POP
function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) {
// load queue address
ld_arg_u64 $d0, [%deque];
@loop_forever:
// setup and call right oracle to get next RN
arg_u32 %k; arg_u64 %current; arg_u64 %next;
call &rcheck_oracle(%queue) ;
ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next];
// current.value($d5)
shr_u64 $d5, $d1, 62;
// current.counter($d6)
and_u64 $d6, $d1,
0x3FC0000000000000;
shr_u64 $d6, $d6, 54;
// current.value($d7)
and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF;
// next.counter($d8)
and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54;
brn @loop_forever ;
}

RIGHT POP – TEST FOR EMPTY
// current.type($d5) == LN || current.type($d5) == DN
cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN;
or_b1 $c0, $c0, $c1;
cbr $c0, @not_empty ;
// current node index (%deque($d0) + (%k($s1) - 1) * 16)
add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0;
atomic_ld_global_scacq_u64 $d4, [$d3];
cmp_neq_b1_u64 $c0, $d4, $d1;
cbr $c0, @not_empty;
st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY
%ret
@not_empty:

RIGHT POP – TRY READ/REMOVE NODE
// $d9 = (RN, next.cnt+1, 0)
add_u64 $d8, $d8, 1;
shl_u64 $d9, RN, 62;
and_u64 $d8, $d8, $d9;
// cas(deq+k, next, node(RN, next.cnt+1, 0))
atomic_cas_global_scar_u64 $d9, [$s0], $d2, $d9;
cmp_neq_u64 $c0, $d9, $d2;
cbr $c0, @cas_failed;
// $d9 = (RN, current.cnt+1, 0)
add_u64 $d6, $d6, 1;
shl_u64 $d9, RN, 62;
and_u64 $d9, $d6, $d9;
// cas(deq+(k-1), curr, node(RN, curr.cnt+1,0)
atomic_cas_global_scar_u64 $d9, [$s1], $d1, $d9;
cmp_neq_u64 $c0, $d9, $d1;
cbr $c0, @cas_failed;
st_arg_u32 SUCCESS, [&err];
st_arg_u64 $d7, [&value];
%ret
@cas_failed:
// loop back around and try again

TAKE AWAYS
 HSA provides a powerful and modern memory model
 Based on the well know SC for DRF
 Defined as Release Consistency
 Extended with scopes as defined by HRF
 OpenCL 2.0 introduces a new memory model
 Also based on SC for DRF
 Also defined in terms of Release Consistency
 Also Extended with scope as defined in HRF
 Has a well defined mapping to HSA
 Concurrent algorithm development for emerging heterogeneous computing
cluster can benefit from HSA and OpenCL 2.0 memory models

HSA QUEUING MODEL
HAKAN PERSSON, SENIOR PRINCIPAL ENGINEER,
ARM

MOTIVATION (TODAY’S PICTURE)
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory

REQUIREMENTS
 Three key technologies are used to build the user mode queueing
mechanism
 System Coherency
 Signaling
 AQL (Architected Queueing Language) enables any agent
enqueue tasks

PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (TODAY)
 Multiple Virtual memory address spaces
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2

PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (HSA)
 Common Virtual Memory for all HSA agents
CPU0 GPU
VIRTUAL MEMORY
PHYSICAL MEMORY
VA->PA VA->PA

SHARED VIRTUAL MEMORY
 Advantages
 No mapping tricks, no copying back-and-forth between different PA
addresses
 Send pointers (not data) back and forth between HSA agents.
 Implications
 Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc).
 Common mechanisms for address translation (and servicing address
translation faults)
 Concept of a process address space (PASID) to allow multiple, per
process virtual address spaces within the system.

SHARED VIRTUAL MEMORY
 Specifics
 Minimum supported VA width is 48b for 64b systems, and 32b for
32b systems.
 HSA agents may reserve VA ranges for internal use via system
software.
 All HSA agents other than the host unit must use the lowest privilege
level
 If present, read/write access flags for page tables must be
maintained by all agents.
 Read/write permissions apply to all HSA agents, equally.

GETTING THERE …
Application OS GPU
Transfer
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory

CACHE COHERENCY DOMAINS (1/3)
 Data accesses to global memory segment from all HSA Agents shall be
coherent without the need for explicit cache maintenance.

 Advantages
 Composability
 Reduced SW complexity when communicating between agents
 Lower barrier to entry when porting software
 Implications
 Hardware coherency support between all HSA agents
 Can take many forms
 Stand alone Snoop Filters / Directories
 Combined L3/Filters
 Snoop-based systems (no filter)
 Etc …

 Specifics
 No requirement for instruction memory accesses to be
coherent
 Only applies to the Primary memory type.
 No requirement for HSA agents to maintain coherency to any
memory location where the HSA agents do not specify the
same memory attributes
 Read-only image data is required to remain static during the
execution of an HSA kernel.
 No double mapping (via different attributes) in order to
modify. Must remain static

GETTING CLOSER …
Application OS GPU
Transfer
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory

SIGNALING (1/3)
 HSA agents support the ability to use signaling objects
 All creation/destruction signaling objects occurs via HSA
runtime APIs
 From an HSA Agent you can directly access signaling objects.
 Signaling a signal object (this will wake up HSA agents
waiting upon the object)
 Query current object
 Wait on the current object (various conditions supported).

SIGNALING (2/3)
 Advantages
 Enables asynchronous events between HSA agents,
without involving the kernel
 Common idiom for work offload
 Low power waiting
 Implications
 Runtime support required
 Commonly implemented on top of cache coherency flows

SIGNALING (3/3)
 Specifics
 Only supported within a PASID
 Supported wait conditions are =, !=, < and >=
 Wait operations may return sporadically (no guarantee against
false positives)
 Programmer must test.
 Wait operations have a maximum duration before returning.
 The HSAIL atomic operations are supported on signal objects.
 Signal objects are opaque
 Must use dedicated HSAIL/HSA runtime operations

ALMOST THERE…
Application OS GPU
Transfer
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory

ONE BLOCK LEFT
Application OS GPU
Transfer
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory

USER MODE QUEUEING (1/3)
 User mode Queueing
 Enables user space applications to directly, without OS
intervention, enqueue jobs (“Dispatch Packets”) for HSA
agents.
 Queues are created/destroyed via calls to the HSA
runtime.
 One (or many) agents enqueue packets, a single agent
dequeues packets.
 Requires coherency and shared virtual memory.

USER MODE QUEUEING (2/3)
 Advantages
 Avoid involving the kernel/driver when dispatching work for an Agent.
 Lower latency job dispatch enables finer granularity of offload
 Standard memory protection mechanisms may be used to protect communication with
the consuming agent.
 Implications
 Packet formats/fields are Architected – standard across vendors!
 Guaranteed backward compatibility
 Packets are enqueued/dequeued via an Architected protocol (all via memory
accesses and signaling)
 More on this later……

SUCCESS!
Application OS GPU
Transfer
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory

SUCCESS!
Application OS GPU
Queue Job
Start Job
Finish Job

ARCHITECTED QUEUEING
LANGUAGE, QUEUES

ARCHITECTED QUEUEING LANGUAGE
 HSA Queues look just like standard shared
memory queues, supporting multi-producer,
single-consumer
 Single producer variant defined with some
optimizations possible.
 Queues consist of storage, read/write indices, ID,
etc.
 Queues are created/destroyed via calls to the
HSA runtime
 “Packets” are placed in queues directly from user
mode, via an architected protocol
 Packet format is architected
Producer Producer
Consumer
Read Index
Write Index
Storage in
coherent, shared
memory
Packets

ARCHITECTED QUEUING LANGUAGE
 Packets are read and dispatched for execution from the queue in order, but
may complete in any order.
 There is no guarantee that more than one packet will be processed in parallel at a
time
 There may be many queues. A single agent may also consume from several
queues.
 Any HSA agent may enqueue packets
 CPUs
 GPUs
 Other accelerators

QUEUE STRUCTURE
Offset (bytes) Size (bytes) Field Notes
0 4 queueType Differentiate different queues
4 4 queueFeatures Indicate supported features
8 8 baseAddress Pointer to packet array
16 16 doorbellSignal HSA signaling object handle
24 4 size Packet array cardinality
28 4 queueId Unique per process
32 8 serviceQueue Queue for callback services
intrinsic 8 writeIndex Packet array write index
intrinsic 8 readIndex Packet array read index

QUEUE VARIANTS
 queueType and queueFeatures together define queue semantics and
capabilities
 Two queueType values defined, other values reserved:
 MULTI – queue supports multiple producers
 SINGLE – queue supports single producer
 queueFeatures is a bitfield indicating capabilities
 DISPATCH (bit 0) if set then queue supports DISPATCH packets
 AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets
 All other bits are reserved and must be 0

QUEUE STRUCTURE DETAILS
 Queue doorbells are HSA signaling objects with restrictions
 Created as part of the queue – lifetime tied to queue object
 Atomic read-modify-write not allowed
 size field value must be aligned to a power of 2
 serviceQueue can be used by HSA kernel for callback services
 Provided by application when queue is created
 Can be mapped to HSA runtime provided serviceQueue, an application serviced
queue, or NULL if no serviceQueue required

READ/WRITE INDICES
 readIndex and writeIndex properties are part of the queue, but not visible in the queue structure
 Accessed through HSA runtime API and HSAIL operations
 HSA runtime/HSAIL operations defined to
 Read readIndex or writeIndex property
 Write readIndex or writeIndex property
 Add constant to writeIndex property (returns previous writeIndex value)
 CAS on writeIndex property
 readIndex & writeIndex operations treated as atomic in memory model
 relaxed, acquire, release and acquire-release variants defined as applicable
 readIndex and writeIndex never wrap
 PacketID – the index of a particular packet
 Uniquely identifies each packet of a queue

PACKET ENQUEUE
 Packet enqueue follows a few simple steps:
 Reserve space
 Multiple packets can be reserved at a time
 Write packet to queue
 Mark packet as valid
 Producer no longer allowed to modify packet
 Consumer is allowed to start processing packet
 Notify consumer of packet through the queue doorbell
 Multiple packets can be notified at a time
 Doorbell signal should be signaled with last packetID notified
 On small machine model the lower 32 bits of the packetID are used

PACKET RESERVATION
 Two flows envisaged
 Atomic add writeIndex with number of packets to reserve
 Producer must wait until packetID < readIndex + size before writing to packet
 Queue can be sized so that wait is unlikely (or impossible)
 Suitable when many threads use one queue
 Check queue not full first, then use atomic CAS to update writeIndex
 Can be inefficient if many threads use the same queue
 Allows different failure model if queue is congested

QUEUE OPTIMIZATIONS
 Queue behavior is loosely defined to allow optimizations
 Some potential producer behavior optimizations:
 Keep local copy of readIndex, update when required
 For single producer queues:
 Keep local copy of writeIndex
 Use store operation rather than add/cas atomic to update writeIndex
 Some potential consumer behavior optimizations:
 Use packet format field to determine whether a packet has been submitted rather than writeIndex
property
 Speculatively read multiple packets from the queue
 Not update readIndex for each packet processed
 Rely on value used for doorbellSignal to notify new packets
 Especially useful for single producer queues

POTENTIAL MULTI-PRODUCER ALGORITHM
// Allocate packet
uint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1);
// Wait until the queue is no longer full.
uint64_t rdIdx;
do {
rdIdx = hsa_queue_load_read_index_relaxed(q);
} while (packetID >= (rdIdx + q->size));
// calculate index
uint32_t arrayIdx = packetID & (q->size-1);
// copy over the packet, the format field is INVALID
q->baseAddress[arrayIdx] = pkt;
// Update format field with release semantics
q->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release);
// ring doorbell, with release semantics (could also amortize over multiple packets)
hsa_signal_send_relaxed(q->doorbellSignal, packetID);

POTENTIAL CONSUMER ALGORITHM
// Get location of next packet
uint64_t readIndex = hsa_queue_load_read_index_relaxed(q);
// calculate the index
uint32_t arrayIdx = readIndex & (q->size-1);
// spin while empty (could also perform low-power wait on doorbell)
while (INVALID == q->baseAddress[arrayIdx].hdr.format) { }
// copy over the packet
pkt = q->baseAddress[arrayIdx];
// set the format field to invalid
q->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed);
// Update the readIndex using HSA intrinsic
hsa_queue_store_read_index_relaxed(q, readIndex+1);
// Now process <pkt>!

ARCHITECTED QUEUEING
LANGUAGE, PACKETS

PACKETS
 Packets come in three main types with architected layouts
 Always reserved & Invalid
 Do not contain any valid tasks and are not processed (queue will not progress)
 Dispatch
 Specifies kernel execution over a grid
 Agent Dispatch
 Specifies a single function to perform with a set of parameters
 Barrier
 Used for task dependencies

COMMON PACKET HEADER
Start Offset
(Bytes)
Format Field Name Description
0 uint16_t
format:8
Contains the packet type (Always reserved, Invalid,
Dispatch, Agent Dispatch, and Barrier). Other values are
reserved and should not be used.
barrier:1
If set then processing of packet will only begin when all
preceding packets are complete.
acquireFenceScope:2
Determines the scope and type of the memory fence
operation applied before the packet enters the active
phase.
Must be 0 for Barrier Packets.
releaseFenceScope:2
Determines the scope and type of the memory fence
operation applied after kernel completion but before the
packet is completed.
reserved:3 Must be 0

DISPATCH PACKET
Start
Offset
(Bytes)
0 uint16_t header Packet header
2 uint16_t
dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.
reserved:14 Must be 0.
4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items).
6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items).
8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items).
10 uint16_t reserved2 Must be 0.
12 uint32_t gridSize.x x dimension of grid (measured in work-items).
16 uint32_t gridSize.y y dimension of grid (measured in work-items).
20 uint32_t gridSize.z z dimension of grid (measured in work-items).
24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item).
28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group).
32 uint64_t kernelObjectAddress
Address of an object in memory that includes an implementation-defined
executable ISA image for the kernel.
40 uint64_t kernargAddress Address of memory containing kernel arguments.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.

AGENT DISPATCH PACKET
Start Offset
(Bytes)
0 uint16_t header Packet header
2 uint16_t type
The function to be performed by the destination Agent. The type value is
split into the following ranges:
 0x0000:0x3FFF – Vendor specific
 0x4000:0x7FFF – HSA runtime
 0x8000:0xFFFF – User registered function
8 uint64_t returnLocation Pointer to location to store the function return value in.
16 uint64_t arg[0]
64-bit direct or indirect arguments.
24 uint64_t arg[1]
32 uint64_t arg[2]
40 uint64_t arg[3]

BARRIER PACKET
 Used for specifying dependences between packets
 HSA agent will not launch any further packets from this queue until the barrier
packet signal conditions are met
 Used for specifying dependences on packets dispatched from any queue.
 Execution phase completes only when all of the dependent signals (up to five) have
been signaled (with the value of 0).
 Or if an error has occurred in one of the packets upon which we have a dependence.

BARRIER PACKET
Start Offset
(Bytes)
0 uint16_t header Packet header, see 2.8.1 Packet header (p. 16).
8 uint64_t depSignal0
Address of dependent signaling objects to be evaluated by the packet processor.

DEPENDENCES
 A user may never assume more than one packet is being executed by an HSA
agent at a time.
 Implications:
 Packets can’t poll on shared memory values which will be set by packets issued from
other queues, unless the user has ensured the proper ordering.
 To ensure all previous packets from a queue have been completed, use the Barrier
bit.
 To ensure specific packets from any queue have completed, use the Barrier packet.

HSA QUEUEING, PACKET EXECUTION

PACKET EXECUTION
 Launch phase
 Initiated when launch conditions are met
 All preceding packets in the queue must have exited launch phase
 If the barrier bit in the packet header is set, then all preceding packets in the queue
must have exited completion phase
 Includes memory acquire fence
 Active phase
 Execute the packet
 Barrier packets remain in Active phase until conditions are met.
 Completion phase
 First step is memory release fence – make results visible.
 completionSignal field is then signaled with a decrementing atomic.

PACKET EXECUTION – BARRIER BIT
Pkt1
Launch
Pkt2
Launch
Pkt1
Execute
Pkt2
Execute
Pkt1
Complete
Pkt3
Launch (barrier=1)
Pkt2
Complete
Pkt3
Execute
Time
Pkt3 launches whenall
packets in the queue
have completed.

PUTTING IT ALL TOGETHER (FFT)
Packet 1
Packet 2
Packet 3
Packet 4
Packet 5
Packet 6
Barrier Barrier
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
Time

PUTTING IT ALL TOGETHER
AQL Pseudo Code
// Send the packets to do the first stage.
aql_dispatch(pkt1);
aql_dispatch(pkt2);
// Send the next two packets, setting the barrier bit so we
// know packets 1 & 2 will be complete before 3 and 4 are
// launched.
aql_dispatch_with _barrier_bit(pkt3);
aql_dispatch(pkt4);
// Same as above (make sure 3 & 4 are done before issuing 5
// & 6)
aql_dispatch_with_barrier_bit(pkt5);
aql_dispatch(pkt6);
// This packet will notify us when 5 & 6 are complete)
aql_dispatch_with_barrier_bit(finish_pkt);

PACKET EXECUTION – BARRIER PACKET
Barrier T2Q2
T1Q1
Signal X
init to 1
depSignal0
completionSignal
Time
Decrements signal X
Barrier
Launch
T1
Launch
Barrier
Execute
T1
Execute
Barrier
Complete
T1
Complete
T2
Launch
T2
Execute
T2
Complete
Barrier completes
when signal X
signalled with 0
T2 launches once
barrier complete

DEPTH FIRST CHILD TASK EXECUTION
 Consider two generations of child tasks
 Task T submits tasks T.1 & T.2
 Task T.1 submits tasks T.1.1 & T.1.2
 Task T.2 submits tasks T.2.1 & T.2.2
 Desired outcome
 Depth first child task execution
 I.e. T  T1  T.1.1  T.1.2  T.2  T.2.1  T.2.2
 T passed signal (allComplete) to decrement when all tasks are complete (T and its
children etc)
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2

HOW TO DO THIS WITH HSA QUEUES?
 Use a separate user mode queue for each recursion level
 Task T submits to queue Q1
 Tasks T.1 & T.2 submits tasks to queue Q2
 Queues could be passed in as parameters to task T
 Depth first requires ordering of T.1, T.2 and their children
 Use additional signal object (childrenComplete) to track completion of the children of
T.1 & T.2
 childrenComplete set to number of children (i.e. 2) by each of T.1 & T.2

A PICTURE SAYS MORE THAN 1000 WORDS
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2 T.1 Barrier T.2 BarrierQ1
Wait on
childrenComplete
Signal
allComplete
T.1.1 T.1.2 T.2.1 T.2.2Q2

SUMMARY

KEY HSA TECHNOLOGIES
 HSA combines several mechanisms to enable low overhead task
dispatch
 System Coherency
 Signaling
 AQL
 User mode queues – from any compatible agent
 Architected packet format
 Rich dependency mechanism
 Flexible and efficient signaling of completion

HSA APPLICATIONS
WEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS
WITH J.P. BORDES AND JUAN GOMEZ

USE CASES SHOWING HSA ADVANTAGE
Programming
Technique
Use Case Description HSA Advantage
Pointer-based Data
Structures
Binary tree searches
GPU performs parallel searches in a CPU created
binary tree.
CPU and GPU have access to entire unified coherent
memory. GPU can access existing data structures containing
pointers.
Platform Atomics
Work-Group Dynamic Task Management
GPU directly operate on a task pool managed
by the CPU for algorithms with dynamic
computation loads
Binary tree updates
CPU and GPU operating simultaneously on the
tree, both doing modifications
CPU and GPU can synchronize using Platform Atomics
Higher performance through parallel operations reducing the
need for data copying and reconciling.
Large Data Sets
Hierarchical data searches
Applications include object recognition, collision
detection, global illumination, BVH
memory. GPU can operate on huge models in place,
reducing copy and kernel launch overhead.
CPU Callbacks
Middleware user-callbacks
GPU processes work items, some of which require
a call to a CPU function to fetch new data
GPU can invoke CPU functions from within a GPU kernel
Simpler programming does not require “split kernels”
Higher performance through parallel operations

UNIFIED COHERENT MEMORY
FOR POINTER-BASED DATA
STRUCTURES

MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE

L R
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE

Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
L
R
L
R
L
R

Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
L R

SYSTEM
MEMORY
KERNEL
GPU
HSA and full OpenCL 2.0
TREE RESULT
BUFFER
L R
L R L R

HSA
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R

POINTER DATA STRUCTURES
- CODE COMPLEXITY
HSA Legacy

POINTER DATA STRUCTURES
- PERFORMANCE
0
10,000
20,000
30,000
40,000
50,000
60,000
1M 5M 10M 25M
Searchrate(nodes/ms)
Tree size ( # nodes )
Binary Tree Search
CPU (1 core)
CPU (4 core)
Legacy APU
HSA APU
Measured in AMD labs Jan 1-3 on system shown in back up
slide

PLATFORM ATOMICS FOR
DYNAMIC TASK MANAGEMENT

PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
0
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

0
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*
Asynchronous transfer

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*
Asynchronous transfer

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*
Atomic add

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*
Atomic add

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*
Atomic add

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*
Atomic add

4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*
Zero-copy

PLATFORM ATOMICS
0
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
GPU MEMORY

PLATFORM ATOMICS
0
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
GPU MEMORY
memcpy

PLATFORM ATOMICS
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
GPU MEMORY

PLATFORM ATOMICS
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
GPU MEMORY
Platform atomic add

PLATFORM ATOMICS
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
GPU MEMORY

PLATFORM ATOMICS
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
GPU MEMORY
Platform atomic add

PLATFORM ATOMICS
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
GPU MEMORY

PLATFORM ATOMICS
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
GPU MEMORY
Platform atomic add

PLATFORM ATOMICS
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
GPU MEMORY

PLATFORM ATOMICS
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
GPU MEMORY
Platform atomic add

PLATFORM ATOMICS – CODE COMPLEXITY
HSA
Legacy
Host enqueue function: 20 lines of code
Host enqueue function: 102 lines of code

PLATFORM ATOMICS - PERFORMANCE
0
100
200
300
400
500
600
700
64 128 256 512 64 128 256 512
4096 16384
Executiontime(ms)
Tasks per insertion
Tasks pool size
Legacy implementation (ms)
HSA implementation (ms)

PLATFORM ATOMICS FOR
CPU/GPU COLLABORATION

PLATFORM ATOMICS
ENABLING EFFICIENT GPU/CPU COLLABORATION
Legacy
Only GPU
can work
on input
array
Concurre
nt
processin
g not
possible
TREEINPUT
BUFFER
GPU
KERNEL

PLATFORM ATOMICS
Legacy
Only GPU
can work
on input
array
Concurre
nt
processin
g not
possible
TREEINPUT
BUFFER
GPU
KERNEL

GPU
KERNEL
PLATFORM ATOMICS
Both
CPU+GPU
operating
on same
data
structure
concurren
tly
TREEINPUT
BUFFER
CPU
0
CPU
1

FOR LARGE
DATA SETS

PROCESSING LARGE DATA SETS
The CPU creates a large
data structure in System
Memory. Computations
using the data are
offloaded to the GPU.
SYSTEM
MEMORY
GPU

SYSTEM
MEMORY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
PROCESSING LARGE DATA SETS
Large3Dspatialdata
structure
GPU
The CPU creates a large
data structure in System
Memory. Computations
using the data are
offloaded to the GPU.
Compare HSA and
Legacy methods

SYSTEM
MEMORY
LEGACY ACCESS USING GPU MEMORY
Legacy
GPU Memory
is smaller
Have to copy and
process in chunks
GPU
GPU
MEMORY

SYSTEM
MEMORY
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
LEGACY ACCESS TO LARGE STRUCTURES
Large3Dspatialdata
structure
GPU
GPU
MEMOR
Y

SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
Copy of top 2 levels of
hierarchy
Large3Dspatialdata
structure
GPU
MEMORY

GPU
GPU
MEMORY
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
FIRST
KERNEL

SYSTEM
MEMORY
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
FIRST
KERNEL

SYSTEM
MEMORY
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY

SYSTEM
MEMORY
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
Copy of bottom 3 levels of
one branch of the hierarchy
GPU
MEMORY

SYSTEM
MEMORY
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
SECOND
KERNEL

SYSTEM
MEMORY
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
Copy of bottom 3 levels of a
different branch of the
hierarchy
GPU
MEMORY

SYSTEM
MEMORY
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
Nth
KERNEL

LARGE SPATIAL DATA STRUCTURE
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
Large3Dspatialdata
structure
SYSTEM
MEMORY
KERNEL
GPU

SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU

SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
KERNEL
HSA
GPU

CALLBACKS
 Parallel processing algorithm with branches
 A seldom taken branch requires new data from the CPU
 On legacy systems, the algorithm must be split:
 Process Kernel 1 on GPU
 Check for CPU callbacks and if any, process on CPU
 Process Kernel 2 on GPU
 Example algorithm from Image Processing
 Perform a filter
 Calculate average LUMA in each tile
 Compare LUMA against threshold and call CPU callback if exceeded (rare)
 Perform special processing on tiles with callbackxs
COMMON SITUATION IN HC
Input Image Output Image

CALLBACKS
Legacy
GPUTHREADS
0
1
2
N
.
.
.
.
.
.
.
.
.
Continuation kernel
finishes up kernel
works
results in poor GPU
utilization

CALLBACKS
Input Image
1 Tile = 1 OpenCL Work
Item
Output
Image
GPU
• Work items compute average RGB value
of all the pixels in a tile
• Work items also compute average Luma
from the average RGB
• If average Luma > threshold, workgroup
invokes CPU CALLBACK
• In parallel with callback, continue compute
CPU
• For selected tiles, update average Luma
value (set to RED)
GPU
• Work items apply the Luma value to all
pixels in the tile
GPU to CPU callbacks use Shared
Virtual Memory (SVM) Semaphores,
implemented using Platform Atomic
Compare-and-Swap.

CALLBACKS
A few kernel threads
need CPU callback
services but serviced
immediately
GPUTHREADS
0
1
2
N
.
.
.
.
.
.
.
.
.
CPU
callbacks

SUMMARY - HSA ADVANTAGE
Programming
Technique
Use Case Description HSA Advantage
Pointer-based Data
Structures
Binary tree searches
GPU performs parallel searches in a CPU created
binary tree.
memory. GPU can access existing data structures containing
pointers.
Platform Atomics
Work-Group Dynamic Task Management
GPU directly operate on a task pool managed
by the CPU for algorithms with dynamic
computation loads
Binary tree updates
CPU and GPU operating simultaneously on the
tree, both doing modifications
CPU and GPU can synchronize using Platform Atomics
Higher performance through parallel operations reducing the
need for data copying and reconciling.
Large Data Sets
Hierarchical data searches
Applications include object recognition, collision
detection, global illumination, BVH
memory. GPU can operate on huge models in place,
reducing copy and kernel launch overhead.
CPU Callbacks
Middleware user-callbacks
GPU processes work items, some of which require
a call to a CPU function to fetch new data
GPU can invoke CPU functions from within a GPU kernel
Simpler programming does not require “split kernels”
Higher performance through parallel operations

HSA COMPILATION
WEN-MEI HWU, CTO, MULTICOREWARE INC
WITH RAY I-JUI SUNG

KEY HSA FEATURES FOR COMPILATION
ALL-PROCESSORS-EQUAL
 GPU and CPU have equal
flexibility to create and
dispatch work items
EQUAL ACCESS TO
ENTIRE SYSTEM MEMORY
 GPU and CPU have
uniform visibility into entire
memory space
Unified Coherent
Memory
GPUCPU
Single Dispatch Path
GPUCPU

A QUICK REVIEW OF OPENCL
CURRENT STATE OF PORTABLE HETEROGENEOUS
PARALLEL PROGRAMMING

DEVICE CODE IN OPENCL
SIMPLE MATRIX MULTIPLICATION
__kernel void
matrixMul(__global float* C, __global float* A, __global float* B, int wA, int wB) {
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;
}
Explicit thread index usage.
Reasonably readable.
Portable across CPUs, GPUs, and FPGAs

HOST CODE IN OPENCL -
CONCEPTUAL
1. allocate and initialize memory on host side
2. Initialize OpenCL
3. allocate device memory and move the data
4. Load and build device code
5. Launch kernel
a. append arguments
6. move the data back from device

int main(int argc, char** argv){
// set seed for rand()
srand(2006);
/****************************************************/
/* Allocate and initialize memory on Host Side */
/****************************************************/
// allocate and initialize host memory for matrices A and B
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float*) malloc(mem_size_A);
unsigned int size_B = WB * HB;
unsigned int mem_size_B = sizeof(float) * size_B;
float* h_B = (float*) malloc(mem_size_B);
randomInit(h_A, size_A);
randomInit(h_B, size_B);
// allocate host memory for the result C
unsigned int size_C = WC * HC;
unsigned int mem_size_C = sizeof(float) * size_C;
float* h_C = (float*) malloc(mem_size_C);
/*****************************************/
/* Initialize OpenCL */
/*****************************************/
// OpenCL specific variables
cl_context clGPUContext;
cl_command_queue clCommandQue;
cl_program clProgram;
size_t dataBytes;
size_t kernelLength;
cl_int errcode;
// OpenCL device memory pointers for matrices
cl_mem d_A;
cl_mem d_B;
cl_mem d_C;
clGPUContext = clCreateContextFromType(0,
CL_DEVICE_TYPE_GPU,
NULL, NULL, &errcode);
shrCheckError(errcode, CL_SUCCESS);
// get the list of GPU devices associated with context
errcode = clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, 0, NULL,
&dataBytes);
cl_device_id *clDevices = (cl_device_id *)
malloc(dataBytes);
errcode |= clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, dataBytes,
clDevices, NULL);
//Create a command-queue
clCommandQue = clCreateCommandQueue(clGPUContext,
clDevices[0], 0, &errcode);
// 3. Allocate device memory and move data
d_C = clCreateBuffer(clGPUContext,
CL_MEM_READ_WRITE,
mem_size_A, NULL, &errcode);
d_A = clCreateBuffer(clGPUContext,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
mem_size_A, h_A, &errcode);
d_B = clCreateBuffer(clGPUContext,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
mem_size_B, h_B, &errcode);
// 4. Load and build OpenCL kernel
char *clMatrixMul = oclLoadProgSource("kernel.cl",
"// My commentn",
&kernelLength);
shrCheckError(clMatrixMul != NULL, shrTRUE);
clProgram = clCreateProgramWithSource(clGPUContext,
1, (const char **)&clMatrixMul,
&kernelLength, &errcode);
errcode = clBuildProgram(clProgram, 0,
NULL, NULL, NULL, NULL);
clKernel = clCreateKernel(clProgram,
"matrixMul", &errcode);
// 5. Launch OpenCL kernel
size_t localWorkSize[2], globalWorkSize[2];
int wA = WA;
int wC = WC;
errcode = clSetKernelArg(clKernel, 0,
sizeof(cl_mem), (void *)&d_C);
errcode |= clSetKernelArg(clKernel, 1,
sizeof(cl_mem), (void *)&d_A);
sizeof(cl_mem), (void *)&d_B);
sizeof(int), (void *)&wA);
sizeof(int), (void *)&wC);
localWorkSize[0] = 16;
localWorkSize[1] = 16;
globalWorkSize[0] = 1024;
globalWorkSize[1] = 1024;
errcode = clEnqueueNDRangeKernel(clCommandQue,
clKernel, 2, NULL, globalWorkSize,
localWorkSize, 0, NULL, NULL);
// 6. Retrieve result from device
errcode = clEnqueueReadBuffer(clCommandQue,
d_C, CL_TRUE, 0, mem_size_C,
h_C, 0, NULL, NULL);
// 7. clean up memory
free(h_A);
free(h_B);
free(h_C);
clReleaseMemObject(d_A);
clReleaseMemObject(d_C);
clReleaseMemObject(d_B);
free(clDevices);
free(clMatrixMul);
clReleaseContext(clGPUContext);
clReleaseKernel(clKernel);
clReleaseProgram(clProgram);
clReleaseCommandQueue(clCommandQue);}
almost 100 lines of code
– tedious and hard to maintain
It does not take advantage of HAS features.
It will likely need to be changed for OpenCL 2.0.

COMPARING SEVERAL HIGH-LEVEL
PROGRAMMING INTERFACES
C++AMP Thrust Bolt OpenACC SYCL
C++ Language
extension
proposed by
Microsoft
library
proposed
by CUDA
library
proposed
by AMD
Annotation
and
Pragmas
proposed
by PGI
C++
wrapper
for
OpenCL
All these proposals aim to reduce tedious boiler
plate code and provide transparent porting to future
systems (future proofing).

OPENACC
HSA ENABLES SIMPLER IMPLEMENTATION OR
BETTER OPTIMIZATION

OPENACC
- SIMPLE MATRIX MULTIPLICATION EXAMPLE
1. void MatrixMulti(float *C, const float *A, const float *B, int hA, int wA, int wB)
2 {
3 #pragma acc parallel loop copyin(A[0:hA*wA]) copyin(B[0:wA*wB]) copyout(C[0:hA*wB])
4 for (int i=0; i<hA; i++) {
5 #pragma acc loop
6 for (int j=0; j<wB; j++) {
7 float sum = 0;
8 for (int k=0; k<wA; k++) {
9 float a = A[i*wA+k];
10 float b = B[k*wB+j];
11 sum += a*b;
12 }
13 C[i*Nw+j] = sum;
14 }
15 }
16 }
Little Host Code Overhead
Programmer annotation of
kernel computation
Programmer annotation of data movement

ADVANTAGE OF HSA FOR OPENACC
 Flexibility in copyin and copyout implementation
 Flexible code generation for nested acc parallel loops
 E.g., inner loop bounds that depend on outer loop iterations
 Compiler data affinity optimization (especially OpenACC kernel regions)
 The compiler does not have to undo programmer managed data transfers

C++AMP
HSA ENABLES EFFICIENT COMPILATION OF AN
EVEN HIGHER LEVEL OF PROGRAMMING
INTERFACE

C++ AMP
● C++ Accelerated Massive Parallelism
● Designed for data level parallelism
● Extension of C++11 proposed by Microsoft
● An open specification with multiple implementations aiming at standardization
● MS Visual Studio 2013
● MulticoreWare CLAMP
● GPU data modeled as C++14-like containers for multidimensional arrays
● GPU kernels modeled as C++11 lambda
● Minimal extension to C++ for simplicity and future proofing

MATRIX MULTIPLICATION IN C++AMP
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix,
int ha, int hb, int hc) {
array_view<int, 2> a(ha, hb, aMatrix);
array_view<int, 2> b(hb, hc, bMatrix);
array_view<int, 2> product(ha, hc, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();}
clGPUContext = clCreateContextFromType(0,
CL_DEVICE_TYPE_GPU,
NULL, NULL, &errcode);
// get the list of GPU devices associated
// with context
errcode = clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, 0, NULL,
&dataBytes);
cl_device_id *clDevices = (cl_device_id *)
malloc(dataBytes);
errcode |= clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, dataBytes,
clDevices, NULL);
//Create a command-queue
clCommandQue =
clCreateCommandQueue(clGPUContext,
clDevices[0], 0, &errcode);
__kernel void
matrixMul(__global float* C, __global float* A,
__global float* B, int wA, int wB) {
float value = 0;
for (int k = 0; k < wA; ++k)
{
}
C[ty * wA + tx] = value;}

C++AMP PROGRAMMING MODEL
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {
array_view<int, 2> a(3, 2, aMatrix);
array_view<int, 2> b(2, 3, bMatrix);
array_view<int, 2> product(3, 3, productMatrix);
parallel_for_each(
product.extent,
int row = idx[0];
int col = idx[1];
}
}
);
GPU data
modeled as
data container

parallel_for_each(
product.extent,
int row = idx[0];
int col = idx[1];
}
}
);
Kernels modeled as
lambdas; arguments are
implicitly modeled as
captured variables,
programmer do not need to
specify copyin and copyout

parallel_for_each(
product.extent,
int row = idx[0];
int col = idx[1];
}
}
);
product.synchronize();
}
Execution
interface; marking
an implicitly
parallel region for
GPU execution

MCW C++AMP (CLAMP)
● Runs on Linux and Mac OS X
● Output code compatible with all major OpenCL stacks: AMD, Apple/Intel (OS X),
NVIDIA and even POCL
● Clang/LLVM-based, open source
o Translate C++AMP code to OpenCL C or OpenCL 1.2 SPIR
o With template helper library
● Runtime: OpenCL 1.1/HSA Runtime and GMAC for non-HSA systems
● One of the two C++ AMP implementations recognized by HSA foundation

MCW C++ AMP COMPILER
● Device Path
o generate OpenCL C code and SPIR
o emit kernel function
● Host Path
o preparation to launch the code
C++ AMP
source code
Clang/LLVM 3.3
Device
Code
Host
Code

TRANSLATION
parallel_for_each(product.extent,
int row = idx[0];
int col = idx[1];
}
});
__kernel void
matrixMul(__global float* C, __global float*
A,
__global float* B, int wA, int wB){
float value = 0;
for (int k = 0; k < wA; ++k)
{
}
● Append the arguments
● Set the index
● emit kernel function
● implicit memory management

EXECUTION ON NON-HSA OPENCL
PLATFORMS
C++ AMP
source code
Clang/LLVM
3.3
Device Code
C++ AMP
source code
Clang/LLVM
3.3
Host Code
gmac
OpenCL
Our work
Runtime

GMAC
● unified virtual address space in
software
● Can have high overhead
sometimes
● In HSA (e.g., AMD Kaveri), GMAC
is not longer needed
Gelado, et al, ASPLOS 2010

CASE STUDY:
BINOMIAL OPTION PRICING
 Line of Codes
0
50
100
150
200
250
300
350
C++AMP OpenCL
Lines of Code by Cloc
Host
Kernel

PERFORMANCE ON NON-HSA SYSTEMS
BINOMIAL OPTION PRICING
0
0.02
0.04
0.06
0.08
0.1
0.12
Total GPU Time Kernel-only
TimeinSeconds
Performance on an NV Tesla C2050
OpenCL
C++AMP

EXECUTION ON HSA
C++ AMP
source code
Clang/LLVM
3.3
Device SPIR
C++ AMP
source code
Clang/LLVM
3.3
Host SPIR
HSA Runtime
Compile Time
Runtime

WHAT WE NEED TO DO?
● Kernel function
o emit the kernel function with required arguments
● On Host side
o a function that recursively traverses the object and append the arguments to OpenCL
stack.
● On Device side
o reconstruct it on the device code for future use.

WHY COMPILING C++AMP TO OPENCL IS
NOT TRIVIAL
● C++AMP → LLVM IR → OpenCL C or SPIR
● arguments passing (lambda capture vs function calls)
● explicit V.S. implicit memory transfer
● Heavy lifting is done by compiler and runtime

EXAMPLE
struct A { int a; };struct B : A { int b; };struct C { B b; int c; };
struct C c;
c.c = 100;
auto fn = [=] () { int qq = c.c; };

TRANSLATION
parallel_for_each(product.extent,
int row = idx[0];
int col = idx[1];
}
});
__kernel void
matrixMul(__global float* C, __global float* A,
__global float* B, int wA, int wB){
float value = 0;
for (int k = 0; k < wA; ++k)
{
}
● Compiler
● Turn captured variables into
OpenCL arguments
● Populate the index<N> in OCL
kernel
● Runtime
● Implicit memory management

ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial (20)

More from HSA Foundation (20)

Recently uploaded (20)

ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial