SlideShare a Scribd company logo
HETEROGENEOUS SYSTEM
ARCHITECTURE (HSA): ARCHITECTURE
AND ALGORITHMS
ISCA TUTORIAL - JUNE 15, 2014
TOPICS
 Introduction
 HSAIL Virtual Parallel ISA
 HSA Runtime
 HSA Memory Model
 HSA Queuing Model
 HSA Applications
 HSA Compilation
© Copyright 2014 HSA Foundation. All Rights Reserved
The HSA Specifications are not at 1.0 final so all content is subject to change
SCHEDULE
© Copyright 2014 HSA Foundation. All Rights Reserved
Time Topic Speaker
8:45am Introduction to HSA Phil Rogers, AMD
9:30am HSAIL Virtual Parallel ISA Ben Sander, AMD
10:30am Break
10:50am HSA Runtime Yeh-Ching Chung, National Tsing Hua University
12 noon Lunch
1pm HSA Memory Model Benedict Gaster, Qualcomm
2pm HSA Queuing Model Hakan Persson, ARM
3pm Break
3:15pm HSA Compilation Technology Wen Mei Hwu, University of Illinois
4pm HSA Application Programming Wen Mei Hwu, University of Illinois
4:45pm Questions All presenters
INTRODUCTION
PHIL ROGERS, AMD CORPORATE FELLOW &
PRESIDENT OF HSA FOUNDATION
HSA FOUNDATION
 Founded in June 2012
 Developing a new platform for heterogeneous
systems
 www.hsafoundation.com
 Specifications under development in working
groups to define the platform
 Membership consists of 43 companies and 16
universities
 Adding 1-2 new members each month
© Copyright 2014 HSA Foundation. All Rights Reserved
DIVERSE PARTNERS DRIVING FUTURE OF
HETEROGENEOUS COMPUTING
© Copyright 2014 HSA Foundation. All Rights Reserved
Founders
Promoters
Supporters
Contributors
Academic
Needs Updating – Add Toshiba
Logo
MEMBERSHIP TABLE
Membership Level Number List
Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc.,
Qualcomm Inc., Samsung Electronics Co Ltd
Promoter 1 LG Electronics
Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical
Limited, CEVA Inc., Digital Media Professionals,
Electronics and Telecommunications Research,
Institute (ETRI), General Processor, Huawei, Industrial
Technology Res. Institute, Marvell International Ltd.,
Mobica, Oracle, Sonics, Inc, Sony Mobile,
Communications, Swarm 64 GmbH, Synopsys,
Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA
Technologies, Vivante Corporation
Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software,
Fabric Engine, Kishonti, Lawrence Livermore National
Laboratory, Linaro, MultiCoreWare, Oak Ridge
National Laboratory, Sandia Corporation,
StreamComputing, SUSE LLC, UChicago Argonne LLC,
Operator of Argonne National Laboratory
Academic 17 Institute for Computing Systems Architecture,
Missouri University of Science & Technology, National
Tsing Hua University, NMAM Institute of Technology,
Northeastern University, Rice University, Seoul
National University, System Software Lab National,
Tsing Hua University, Tampere University of
Technology, TEI of Crete, The University of Mississippi,
University of North Texas, University of Bologna,
University of Bristol Microelectronic Research Group,
University of Edinburgh, University of Illinois at
Urbana-Champaign Department of Computer Science
© Copyright 2014 HSA Foundation. All Rights Reserved
HETEROGENEOUS PROCESSORS HAVE
PROLIFERATED — MAKE THEM BETTER
 Heterogeneous SOCs have arrived and are a
tremendous advance over previous platforms
 SOCs combine CPU cores, GPU cores and
other accelerators, with high bandwidth access
to memory
 How do we make them even better?
 Easier to program
 Easier to optimize
 Higher performance
 Lower power
 HSA unites accelerators architecturally
 Early focus on the GPU compute accelerator,
but HSA will go well beyond the GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
INFLECTIONS IN PROCESSOR DESIGN
© Copyright 2014 HSA Foundation. All Rights Reserved
?
Single-thread
Performance
Time
we are
here
Enabled by:
 Moore’s
Law
 Voltage
Scaling
Constrained by:
Power
Complexity
Single-Core Era
ModernApplication
Performance
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by:
 Abundant data
parallelism
 Power efficient
GPUs
Temporarily
Constrained by:
Programming
models
Comm.overhead
Throughput
Performance
Time (# of processors)
we are
here
Enabled by:
 Moore’s Law
 SMP
architecture
Constrained
by:
Power
Parallel SW
Scalability
Multi-Core Era
Assembly  C/C++  Java … pthreads  OpenMP / TBB …
Shader  CUDA OpenCL
 C++ and Java
LEGACY GPU COMPUTE
PCIe
™
System Memory
(Coherent)
CPU CPU CPU
. .
.
CU CU CU CU
CU CU CU CU
GPU Memory
(Non-Coherent)
GPU
 Multiple memory pools
 Multiple address spaces
 High overhead dispatch
 Data copies across PCIe
 New languages for
programming
 Dual source development
 Proprietary environments
 Expert programmers only
 Need to fix all of this to
unleash our programmers
The limiters
© Copyright 2014 HSA Foundation. All Rights Reserved
EXISTING APUS AND SOCS
CPU
1
CPU
N…
CPU
2
Physical Integration
CU
1 …
CU
2
CU
3
CU
M-2
CU
M-1
CU
M
System Memory
(Coherent)
GPU Memory
(Non-Coherent)
GPU
 Physical Integration
 Good first step
 Some copies gone
 Two memory pools remain
 Still queue through the OS
 Still requires expert
programmers
 Need to finish the job
AN HSA ENABLED SOC
 Unified Coherent
Memory enables
data sharing across
all processors
 Processors
architected to
operate
cooperatively
 Designed to enable
the application to
run on different
processors at
different times
Unified Coherent Memory
CPU
1
CPU
N…
CPU
2
CU
1
CU
2
CU
3
CU
M-2
CU
M-1
CU
M…
PILLARS OF HSA*
 Unified addressing across all processors
 Operation into pageable system memory
 Full memory coherency
 User mode dispatch
 Architected queuing language
 Scheduling and context switching
 HSA Intermediate Language (HSAIL)
 High level language support for GPU compute processors
© Copyright 2014 HSA Foundation. All Rights Reserved
* All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors
HSA SPECIFICATIONS
 HSA System Architecture Specification
 Version 1.0 Provisional, Released April 2014
 Defines discovery, memory model, queue management, atomics, etc
 HSA Programmers Reference Specification
 Version 1.0 Provisional, Released June 2014
 Defines the HSAIL language and object format
 HSA Runtime Software Specification
 Version 1.0 Provisional, expected to be released in July 2014
 Defines the APIs through which an HSA application uses the platform
 All released specifications can be found at the HSA Foundation web site:
 www.hsafoundation.com/standards
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA - AN OPEN PLATFORM
 Open Architecture, membership open to all
 HSA Programmers Reference Manual
 HSA System Architecture
 HSA Runtime
 Delivered via royalty free standards
 Royalty Free IP, Specifications and APIs
 ISA agnostic for both CPU and GPU
 Membership from all areas of computing
 Hardware companies
 Operating Systems
 Tools and Middleware
 Applications
 Universities
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA INTERMEDIATE LAYER — HSAIL
 HSAIL is a virtual ISA for parallel programs
 Finalized to ISA by a JIT compiler or “Finalizer”
 ISA independent by design for CPU & GPU
 Explicitly parallel
 Designed for data parallel programming
 Support for exceptions, virtual functions,
and other high level language features
 Lower level than OpenCL SPIR
 Fits naturally in the OpenCL compilation stack
 Suitable to support additional high level languages and programming models:
 Java, C++, OpenMP, C++, Python, etc
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA MEMORY MODEL
 Defines visibility ordering between all
threads in the HSA System
 Designed to be compatible with
C++11, Java, OpenCL and .NET
Memory Models
 Relaxed consistency memory model
for parallel compute performance
 Visibility controlled by:
 Load.Acquire
 Store.Release
 Fences
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA QUEUING MODEL
 User mode queuing for low latency dispatch
 Application dispatches directly
 No OS or driver required in the dispatch path
 Architected Queuing Layer
 Single compute dispatch path for all hardware
 No driver translation, direct to hardware
 Allows for dispatch to queue from any agent
 CPU or GPU
 GPU self enqueue enables lots of solutions
 Recursion
 Tree traversal
 Wavefront reforming
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA SOFTWARE
Hardware - APUs, CPUs, GPUs
Driver Stack
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
Graphics Kernel Mode Driver
Apps
Apps
Apps
Apps
Apps
Apps
HSA Software Stack
Task Queuing
Libraries
HSA Domain Libraries,
OpenCL ™ 2.x Runtime
HSA Kernel
Mode Driver
HSA Runtime
HSA JIT
Apps
Apps
Apps
Apps
Apps
Apps
User mode component Kernel mode component Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL™ AND HSA
 HSA is an optimized platform architecture
for OpenCL
 Not an alternative to OpenCL
 OpenCL on HSA will benefit from
 Avoidance of wasteful copies
 Low latency dispatch
 Improved memory model
 Pointers shared between CPU and GPU
 OpenCL 2.0 leverages HSA Features
 Shared Virtual Memory
 Platform Atomics
© Copyright 2014 HSA Foundation. All Rights Reserved
ADDITIONAL LANGUAGES ON HSA
 In development
© Copyright 2014 HSA Foundation. All Rights Reserved
Language Body More Information
Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/
LLVM LLVM Code
generator for HSAIL
C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa
mp-driver-ng/wiki/Home
OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches
/hsa/gcc/README.hsa?view=markup&p
athrev=207425
SUMATRA PROJECT OVERVIEW
 AMD/Oracle sponsored Open Source (OpenJDK) project
 Targeted at Java 9 (2015 release)
 Allows developers to efficiently represent data parallel algorithms in
Java
 Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to
enable both CPU or GPU computing
 At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch
‘selected’ constructs to available HSA enabled devices
 Developers of Java libraries are already refactoring their library code to
use these same constructs
 So developers using existing libraries should see GPU acceleration
without any code changes
 http://openjdk.java.net/projects/sumatra/
 https://wikis.oracle.com/display/HotSpotInternals/Sumatra
 http://mail.openjdk.java.net/pipermail/sumatra-dev/
© Copyright 2014 HSA Foundation. All Rights Reserved
Application.java
Java Compiler
GPUCPU
Sumatra Enabled JVM
Application
GPU ISA
Lambda/Stream API
CPU ISA
Application.clas
s
Development
Runtime
HSA Finalizer
HSA OPEN SOURCE SOFTWARE
 HSA will feature an open source linux execution and compilation stack
 Allows a single shared implementation for many components
 Enables university research and collaboration in all areas
 Because it’s the right thing to do
© Copyright 2014 HSA Foundation. All Rights Reserved
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros
WORKLOAD EXAMPLE
SUFFIX ARRAY CONSTRUCTION
CLOUD SERVER WORKLOAD
SUFFIX ARRAYS
 Suffix Arrays are a fundamental data structure
 Designed for efficient searching of a large text
 Quickly locate every occurrence of a substring S in a text T
 Suffix Arrays are used to accelerate in-memory cloud workloads
 Full text index search
 Lossless data compression
 Bio-informatics
© Copyright 2014 HSA Foundation. All Rights Reserved
ACCELERATED SUFFIX ARRAY
CONSTRUCTION ON HSA
© Copyright 2014 HSA Foundation. All Rights Reserved
M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM
By offloading data parallel computations to
GPU, HSA increases performance and
reduces energy for Suffix Array
Construction.
By efficiently sharing data between CPU and
GPU, HSA lets us move compute to data
without penalty of intermediate copies.
+5.8x
-5x
INCREASED
PERFORMANCE
DECREASED
ENERGYMerge Sort::GPU
Radix Sort::GPU
Compute SA::CPU
Lexical Rank::CPU
Radix Sort::GPU
Skew Algorithm for Compute SA
EASE OF PROGRAMMING
CODE COMPLEXITY VS. PERFORMANCE
LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT
PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350
LOC
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Performance
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0Copy-
back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)
© Copyright 2014 HSA Foundation. All Rights Reserved
THE HSA FUTURE
 Architected heterogeneous processing on the SOC
 Programming of accelerators becomes much easier
 Accelerated software that runs across multiple hardware vendors
 Scalability from smart phones to super computers on a common architecture
 GPU acceleration of parallel processing is the initial target, with DSPs
and other accelerators coming to the HSA system architecture model
 Heterogeneous software ecosystem evolves at a much faster pace
 Lower power, more capable devices in your hand, on the wall, in the cloud
© Copyright 2014 HSA Foundation. All Rights Reserved
JOIN US!
WWW.HSAFOUNDATION.COM
HETEROGENEOUS SYSTEM
ARCHITECTURE (HSA): HSAIL VIRTUAL
PARALLEL ISA
BEN SANDER, AMD
TOPICS
 Introduction and Motivation
 HSAIL – what makes it special?
 HSAIL Execution Model
 How to program in HSAIL?
 Conclusion
© Copyright 2014 HSA Foundation. All Rights Reserved
STATE OF GPU COMPUTING
Today’s Challenges
 Separate address spaces
 Copies
 Can’t share pointers
 New language required for compute kernel
 EX: OpenCL™ runtime API
 Compute kernel compiled separately than host
code
Emerging Solution
 HSA Hardware
 Single address space
 Coherent
 Virtual
 Fast access from all components
 Can share pointers
 Bring GPU computing to existing, popular,
programming models
 Single-source, fully supported by compiler
 HSAIL compiler IR (Cross-platform!)
• GPUs are fast and power efficient : high compute density per-mm and per-watt
• But: Can be hard to program
PCIe
THE PORTABILITY CHALLENGE
 CPU ISAs
 ISA innovations added incrementally (ie NEON, AVX, etc)
 ISA retains backwards-compatibility with previous generation
 Two dominant instruction-set architectures: ARM and x86
 GPU ISAs
 Massive diversity of architectures in the market
 Each vendor has own ISA - and often several in market at same time
 No commitment (or attempt!) to provide any backwards compatibility
 Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL :
WHAT MAKES IT SPECIAL?
WHAT IS HSAIL?
 Intermediate language for parallel compute in HSA
 Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc)
 Expresses parallel regions of code
 Binary format of HSAIL is called “BRIG”
 Goal: Bring parallel acceleration to mainstream programming languages
© Copyright 2014 HSA Foundation. All Rights Reserved
main() {
…
#pragma omp parallel for
for (int i=0;i<N; i++) {
}
…
}
High-Level
Compiler
BRIG Finalizer Component
ISA
Host ISA
KEY HSAIL FEATURES
 Parallel
 Shared virtual memory
 Portable across vendors in HSA Foundation
 Stable across multiple product generations
 Consistent numerical results (IEEE-754 with defined min accuracy)
 Fast, robust, simple finalization step (no monthly updates)
 Good performance (little need to write in ISA)
 Supports all of OpenCL™
 Supports Java, C++, and other languages as well
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL INSTRUCTION SET - OVERVIEW
 Similar to assembly language for a RISC CPU
 Load-store architecture
 Destination register first, then source registers
 140 opcodes (Java™ bytecode has 200)
 Floating point (single, double, half (f16))
 Integer (32-bit, 64-bit)
 Some packed operations
 Branches
 Function calls
 Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas
 Synchronize host CPU and HSA Component!
 Text and Binary formats (“BRIG”)
ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)
add_u64 $d1, $d0, 24 ; $d1= $d2+24
© Copyright 2014 HSA Foundation. All Rights Reserved
SEGMENTS AND MEMORY (1/2)
 7 segments of memory
 global, readonly, group, spill, private, arg, kernarg
 Memory instructions can (optionally) specify a segment
 Control data sharing properties and communicate intent
 Global Segment
 Visible to all HSA agents (including host CPU)
 Group Segment
 Provides high-performance memory shared in the work-group.
 Group memory can be read and written by any work-item in the work-group
 HSAIL provides sync operations to control visibility of group memory
ld_global_u64 $d0,[$d6]
ld_group_u64 $d0,[$d6+24]
st_spill_f32 $s1,[$d6+4]
© Copyright 2014 HSA Foundation. All Rights Reserved
SEGMENTS AND MEMORY (2/2)
 Spill, Private, Arg Segments
 Represent different regions of a per-work-item stack
 Typically generated by compiler, not specified by programmer
 Compiler can use these to convey intent – ie spills
 Kernarg Segment
 Programmer writes kernarg segment to pass arguments to a kernel
 Read-Only Segment
 Remains constant during execution of kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
FLAT ADDRESSING
 Each segment mapped into virtual address space
 Flat addresses can map to segments based on virtual address
 Instructions with no explicit segment use flat addressing
 Very useful for high-level language support (ie classes, libraries)
 Aligns well with OpenCL 2.0 “generic” addressing feature
ld_global_u64 $d6, [%_arg0] ; global
ld_u64 $d0,[$d6+24] ; flat
© Copyright 2014 HSA Foundation. All Rights Reserved
REGISTERS
 Four classes of registers:
 S: 32-bit, Single-precision FP or Int
 D: 64-bit, Double-precision FP or Long Int
 Q: 128-bit, Packed data.
 C: 1-bit, Control Registers (Compares)
 Fixed number of registers
 S, D, Q share a single pool of resources
 S + 2*D + 4*Q <= 128
 Up to 128 S or 64 D or 32 Q (or a blend)
 Register allocation done in high-level compiler
 Finalizer doesn’t perform expensive register allocation
c0
c1
c2
c3
c4
c5
c6
c7
s0
d0
q0
s1
s2
d1
s3
s4
d2
q1
s5
s6
d3
s7
s8
d4
q2
s9
s10
d5
s11
…
s120
d60
q30
s121
s122
d61
s123
s124
d62
q31
s125
s126
d63
s127
© Copyright 2014 HSA Foundation. All Rights Reserved
SIMT EXECUTION MODEL
 HSAIL Presents a “SIMT” execution model to the programmer
 “Single Instruction, Multiple Thread”
 Programmer writes program for a single thread of execution
 Each work-item appears to have its own program counter
 Branch instructions look natural
 Hardware Implementation
 Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency
 Actually one program counter for the entire SIMD instruction
 Branches implemented with predication
 SIMT Advantages
 Easier to program (branch code in particular)
 Natural path for mainstream programming models and existing compilers
 Scales across a wide variety of hardware (programmer doesn’t see vector width)
 Cross-lane operations available for those who want peak performance
© Copyright 2014 HSA Foundation. All Rights Reserved
WAVEFRONTS
 Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, 64, 128, or 256 “lanes”
 Lanes in wavefront can be “active” or “inactive”
 Inactive lanes consume hardware resources but don’t do useful work
 Tradeoffs
 “Wavefront-aware” programming can be useful for peak performance
 But results in less portable code (since wavefront width is encoded in algorithm)
if (cond) {
operationA; // cond=True lanes active here
} else {
operationB; // cond=False lanes active here
}
© Copyright 2014 HSA Foundation. All Rights Reserved
CROSS-LANE OPERATIONS
 Example HSAIL cross-lane operation: “activelaneid”
 Dest set to count of earlier work-items that are active for this instruction
 Useful for compaction algorithms
 Example HSAIL cross-lane operation: “activelaneshuffle”
 Each workitem reads value from another lane in the wavefront
 Supports selection of “identity” element for inactive lanes
 Useful for wavefront-level reductionsactivelaneshuffle_b32 $s0, $s1, $s2, 0, 0
// s0 = dest, s1= source, s2=lane select, no identity
activelaneid_u32 $s0
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL MODES
 Working group strived to limit optional modes and features in HSAIL
 Minimize differences between HSA target machines
 Better for compiler vendors and application developers
 Two modes survived
 Machine Models
 Small: 32-bit pointers, 32-bit data
 Large: 64-bit pointers, 32-bit or 64-bit data
 Vendors can support one or both models
 “Base” and “Full” Profiles
 Two sets of requirements for FP accuracy, rounding, exception reporting, hard
pre-emption
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PROFILES
Feature Base Full
Addressing Modes Small, Large Small, Large
All 32-bit HSAIL operations according to the declared
profile Yes Yes
F16 support (IEEE 754 or better) Yes Yes
F64 support No Yes
Precision for add/sub/mul 1/2 ULP 1/2 ULP
Precision for div 2.5 ULP 1/2 ULP
Precision for sqrt 1 ULP 1/2 ULP
HSAIL Rounding: Near Yes Yes
HSAIL Rounding: Up / Down / Zero No Yes
Subnormal floating-point Flush-to-zero Supported
Propagate NaN Payloads No Yes
FMA Yes Yes
Arithmetic Exception reporting None DETECT or BREAK
Debug trap Yes Yes
Hard Preemption No Yes
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PARALLEL EXECUTION
MODEL
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PARALLEL EXECUTION MODEL
Basic Idea:
Programmer supplies an HSAIL
“kernel” that is run on each work-item.
Kernel is written as a single thread of
execution.
Programmer specifies grid dimensions
(scope of problem) when launching
the kernel.
Each work-item has a unique
coordinate in the grid.
Programmer optionally specifies work-
group dimensions (for optimized
communication).
© Copyright 2014 HSA Foundation. All Rights Reserved
CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
© Copyright 2014 HSA Foundation. All Rights Reserved
CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
2D grid
workitem
kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
CONVOLUTION / SOBEL EDGE FILTER
Gx = [ -1 0 +1 ]
[ -2 0 +2 ]
[ -1 0 +1 ]
Gy = [ -1 -2 -1 ]
[ 0 0 0 ]
[ +1 +2 +1 ]
G = sqrt(Gx
2 + Gy
2)
2D work-group
2D grid
workitem
kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
HOW TO PROGRAM HSA?
WHAT DO I TYPE?
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA PROGRAMMING MODELS : CORE PRINCIPLES
 Single source
 Host and device code side-by-side in same source file
 Written in same programming language
 Single unified coherent address space
 Freely share pointers between host and device
 Similar memory model as multi-core CPU
 Parallel regions identified with existing language syntax
 Typically same syntax used for multi-core CPU
 HSAIL is the compiler IR that supports these programming models
© Copyright 2014 HSA Foundation. All Rights Reserved
GCC OPENMP : COMPILATION FLOW
 SUSE GCC Project
 Adding HSAIL code generator to GCC compiler infrastructure
 Supports OpenMP 3.1 syntax
 No data movement directives required !main() {
…
// Host code.
#pragma omp parallel for
for (int i=0;i<N; i++) {
C[i] = A[i] + B[i];
}
…
}
GCC OpenMP
Compiler
BRIG Finalizer Component
ISA
Host ISA
© Copyright 2014 HSA Foundation. All Rights Reserved
GCC OpenMP flow
C/C++/Fortran OpenMP application
e.g., #pragma omp for
for( j = 0; j<n;j++) { b[j] = a[j]; }
GNU Compiler(GCC)
Compiles host code + Emits runtime
calls with kernel name, parameters,
launch attributes
Lowers OpenMP directives,
converts GIMPLE to BRIG.
Embeds BRIG into host code
Dispatch kernel to GPU
Pragmas map to calls into
HSA Runtime
Application
Compiler
Run time
Finalize kernel from BRIG->ISA
Kernels finalized once and cached.
Compile time
© Copyright 2014 HSA Foundation. All Rights Reserved
MCW C++AMP : COMPILATION FLOW
 C++AMP : Single-source C++ template parallel programming model
 MCW compiler based on CLANG/LLVM
 Open-source and runs on Linux
 Leverage open-source LLVM->HSAIL code generator
main() {
…
parallel_for_each(grid<1>(ext
entent<256>(…)
…
}
C++AMP
Compiler
BRIG Finalizer Component
ISA
Host ISA
© Copyright 2014 HSA Foundation. All Rights Reserved
JAVA: RUNTIME FLOW
© Copyright 2014 HSA Foundation. All Rights Reserved
JAVA 8 – HSA ENABLED APARAPI
 Java 8 brings Stream + Lambda API.
‒ More natural way of expressing data parallel algorithms
‒ Initially targeted at multi-core.
 APARAPI will :
‒ Support Java 8 Lambdas
‒ Dispatch code to HSA enabled devices at runtime via
HSAIL
JVM
Java Application
HSA Finalizer & Runtime
APARAPI + Lambda API
GPUCPU
Future Java – HSA ENABLED JAVA (SUMATRA)
 Adds native GPU acceleration to Java Virtual Machine
(JVM)
 Developer uses JDK Lambda, Stream API
 JVM uses GRAAL compiler to generate HSAIL
JVM
Java Application
HSA Finalizer & Runtime
Java JDK Stream + Lambda API
Java GRAAL JIT
backend
GPUCPU
AN EXAMPLE (IN JAVA 8)
© Copyright 2014 HSA Foundation. All Rights Reserved
//Example computes the percentage of total scores achieved by each player on a team.
class Player {
private Team team; // Note: Reference to the parent Team.
private int scores;
private float pctOfTeamScores;
public Team getTeam() {return team;}
public int getScores() {return scores;}
public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; }
};
// “Team” class not shown
// Assume “allPlayers’ is an initialized array of Players..
Arrays.stream(allPlayers). // wrap the array in a stream
parallel(). // developer indication that lambda is thread-safe
forEach(p -> {
int teamScores = p.getTeam().getScores();
float pctOfTeamScores = (float)p.getScores()/(float) teamScores;
p.setPctOfTeamScores(pctOfTeamScores);
});
HSAIL CODE EXAMPLE
© Copyright 2014 HSA Foundation. All Rights Reserved
01: version 0:95: $full : $large;
02: // static method HotSpotMethod<Main.lambda$2(Player)>
03: kernel &run (
04: kernarg_u64 %_arg0 // Kernel signature for lambda method
05: ) {
06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register
07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord
08:
09: cvt_u64_s32 $d2, $s2; // Convert X gid to long
10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref
11: add_u64 $d2, $d2, 24; // Adjust for actual elements start
12: add_u64 $d2, $d2, $d6; // Add to array ref ptr
13: ld_global_u64 $d6, [$d2]; // Load from array element into reg
14: @L0:
15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()
16: mov_b64 $d3, $d0;
17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()
18: cvt_f32_s32 $s16, $s3;
19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()
20: cvt_f32_s32 $s17, $s0;
21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores
22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()
23: ret;
24: };
HOW TO PROGRAM HSA?
OTHER PROGRAMMING TOOLS
© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL ASSEMBLER
kernel &run (kernarg_u64 %_arg0)
{
ld_kernarg_u64 $d6, [%_arg0];
workitemabsid_u32 $s2, 0;
cvt_u64_s32 $d2, $s2;
mul_u64 $d2, $d2, 8;
add_u64 $d2, $d2, 24;
add_u64 $d2, $d2, $d6;
ld_global_u64 $d6, [$d2];
. . .
HSAIL
Assembler BRIG Finalizer
Machine
ISA
• HSAIL has a text format and an assembler
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL™ OFFLINE COMPILER (CLOC)
__kernel void vec_add(
__global const float *a,
__global const float *b,
__global float *c,
const unsigned int n)
{
int id = get_global_id(0);
// Bounds check
if (id < n)
c[id] = a[id] + b[id];
}
CLOC BRIG Finalizer
Machine
ISA
•OpenCL split-source model cleanly isolates kernel
•Can express many HSAIL features in OpenCL Kernel Language
•Higher productivity than writing in HSAIL assembly
•Can dispatch kernel directly with HSAIL Runtime (lower-level access to hardware)
•Or use CLOC+OKRA Runtime for approachable “fits-on-a-slide” GPU programming model
© Copyright 2014 HSA Foundation. All Rights Reserved
KEY TAKEAWAYS
 HSAIL
 Thin, robust, fast finalizer
 Portable (multiple HW vendors and parallel architectures)
 Supports shared virtual memory and platform atomics
 HSA brings GPU computing to mainstream programming models
 Shared and coherent memory bridges “faraway accelerator” gap
 HSAIL provides the common IL for high-level languages to benefit from
parallel computing
 Languages and Compilers
 HSAIL support in GCC, LLVM, Java JVM
 Leverage same language syntax designed for multi-core CPUs
 Can use pointer-containing data structures
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA RUNTIME
YEN-CHING CHUNG, NATIONAL TSING HUA
UNIVERSITY
OUTLINE
 Introduction
 HSA Core Runtime API (Pre-release 1.0 provisional)
 Initialization and Shut Down
 Notifications (Synchronous/Asynchronous)
 Agent Information
 Signals and Synchronization (Memory-Based)
 Queues and Architected Dispatch
 Summary
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (1)
 The HSA core runtime is a thin, user-mode API that provides the interface necessary for
the host to launch compute kernels to the available HSA components.
 The overall goal of the HSA core runtime design is to provide a high-performance dispatch
mechanism that is portable across multiple HSA vendor architectures.
 The dispatch mechanism differentiates the HSA runtime from other language runtimes by
architected argument setting and kernel launching at the hardware and specification level.
 The HSA core runtime API is standard across all HSA vendors, such that languages which use the
HSA runtime can run on different vendor’s platforms that support the API.
 The implementation of the HSA runtime may include kernel-level components (required for
some hardware components, ex: AMD Kaveri) or may be entirely user-space (for example,
simulators or CPU implementations).
© Copyright 2014 HSA Foundation. All Rights Reserved
Component 1
Driver
Component N…
Vendor m
…
Component 1
Driver
Component N…
Vendor 1
Component 1
HSA Runtime
Component N…
HSA Vendor 1
HSA
Finalizer Component 1
HSA Runtime
Component N…
HSA Vendor m
HSA
Finalizer
INTRODUCTION (2)
Programming Model
Language Runtime
 The software architecture stack without HSA runtime
OpenCL
App
Java
App
OpenMP
App
DSL
App
OpenCL
Runtime
Java
Runtime
OpenMP
Runtime
DSL
Runtime
…
…
 The software architecture stack with HSA runtime
…
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (3)
OpenCL Runtime HSA RuntimeAgent
Start
Program
HSA Memory Allocation
Enqueue Dispatch Packet
Exit
Program Resource Deallocation
Command Queue
Platform, Device, and
Context Initialization
SVM Allocation and
Kernel Arguments Setting
Build Kernel
HSA Runtime Close
HSA Runtime Initialization
and Topology Discovery
HSAIL Finalization and
Linking
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (4)
 HSA Platform System Architecture Specification support
 Runtime initialization and shutdown
 Notifications (synchronous/asynchronous)
 Agent information
 Signals and synchronization (memory-based)
 Queues and Architected dispatch
 Memory management
 HSAIL support
 Finalization, linking, and debugging
 Image and Sampler support
HSA Runtime
HSA Memory Allocation
Enqueue Dispatch Packet
HSA Runtime Close
HSA Runtime
Initialization and
Topology Discovery
HSAIL Finalization and
Linking
© Copyright 2014 HSA Foundation. All Rights Reserved
RUNTIME INITIALIZATION AND
SHUTDOWN
OUTLINE
 Runtime Initialization API
 hsa_init
 Runtime Shut Down API
 hsa_shut_down
 Examples
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA RUNTIME INITIALIZATION
 When the API is invoked for the first time in a given process, a runtime
instance is created.
 A typical runtime instance may contain information of platform, topology, reference
count, queues, signals, etc.
 The API can be called multiple times by applications
 Only a single runtime instance will exist for a given process.
 Whenever the API is invoked, the reference count is increased by one.
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA RUNTIME SHUT DOWN
 When the API is invoked, the reference count is decreased by 1.
 When the reference count < 1
 All the resources associated with the runtime instance (queues, signals, topology
information, etc.) are considered invalid and any attempt to reference them in
subsequent API calls results in undefined behavior.
 The user might call hsa_init to initialize the HSA runtime again.
 The HSA runtime might release resources associated with it.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – RUNTIME INITIALIZATION (1)
Data structure for
runtime instance
If hsa_init is called more than once,
increase the ref_count by 1
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – RUNTIME INITIALIZATION (2)
hsa_init is called the first time, allocate
resources and set the reference count
Get the number of HSA agent
Initialize agents
Create an empty agent list
If initialization failed, release resources
Create topology table
© Copyright 2014 HSA Foundation. All Rights Reserved
Agent-0
node_id 0
id 0
type CPU
vendor Generic
name Generic
wavefront_size 0
queue_size 200
group_memory 0
fbarrier_max_count 1
is_pic_supported 0
…
…
EXAMPLE - RUNTIME INSTANCE (1)
Platform Name: Generic Memory
node_id 0
id 0
segment_type 111111
address_base 0x0001
size 2048 MB
peak_bandwidth 6553.6 mpbs
Agent-1
node_id 0
id 0
type GPU
vendor Generic
name Generic
wavefront_size 64
queue_size 200
group_memory 64
fbarrier_max_count 1
is_pic_supported 1
Cache
node_id 0
id 0
levels 1
associativity 1
cache size 64KB
cache line size 4
is_inclusive 1
Agent: 2
Memory: 1
Cache: 1
…
…
© Copyright 2014 HSA Foundation. All Rights Reserved
Agent-0
node_id = 0
id = 0
agent_type = 1 (CPU)
vendor[16] = Generic
name[16] = Generic
wavefront_size = 0
queue_size =200
group_memory_size_bytes =0
fbarrier_max_count = 1
is_pic_supported = 0
Platform Header File
*base_address = 0x00001
Size = 248
system_timestamp_frequency_
mhz = 200
signal_maximum_wait = 1/200
*node_id
no_nodes = 1
*agent_list
no_agent = 2
*memory_descriptor_list
no_memory_descriptor = 1
*cache_descriptor_list
no_cache_descriptor = 1
EXAMPLE - RUNTIME INSTANCE (2)
…
…
cache
node_id = 0
Id = 0
Levels = 1
* associativity
* cache_size
* cache_line_size
* is_inclusive
1 NULL
64KB NULL
1 NULL
4 NULL
Memory
node_id = 0
Id = 0
supported_segment_type_mask =
111111
virtual_address_base = 0x0001
size_in_bytes = 2048MB
peak_bandwidth_mbps = 6553.6
0 NULL
45 165 NULL
285 NULL
325 NULL
Agent-1
node_id = 0
id = 0
agent_type = 2 (GPU)
vendor[16] = Generic
name[16] = Generic
wavefront_size = 64
queue_size =200
group_memory_size_bytes =64
fbarrier_max_count = 1
is_pic_supported = 1
…
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – RUNTIME SHUT DOWN
© Copyright 2014 HSA Foundation. All Rights Reserved
If ref_count < 1, then free the list;
Otherwise decrease the ref_count
by 1.
NOTIFICATIONS
(SYNCHRONOUS/ASYNCHRONOUS)
OUTLINE
 Synchronous Notifications
 hsa_status_t
 hsa_status_string
 Asynchronous Notifications
 Example
© Copyright 2014 HSA Foundation. All Rights Reserved
SYNCHRONOUS NOTIFICATIONS
 Notifications (errors, events, etc.) reported by the runtime can be synchronous or
asynchronous
 The HSA runtime uses the return values of API functions to pass notifications
synchronously.
 A status code is define as an enumeration, , to capture the return value
of any API function that has been executed, except accessors/mutators.
 The notification is a status code that indicates success or error.
 Success is represented by HSA_STATUS_SUCCESS, which is equivalent to zero.
 An error status is assigned a positive integer and its identifier starts with the
HSA_STATUS_ERROR prefix.
 The status code can help to determine a cause of the unsuccessful execution.
© Copyright 2014 HSA Foundation. All Rights Reserved
STATUS CODE QUERY
 Query additional information on status code
 Parameters
 status (input): Status code that the user is seeking more information on
 status_string (output): An ISO/IEC 646 encoded English language string that potentially
describes the error status
© Copyright 2014 HSA Foundation. All Rights Reserved
ASYNCHRONOUS NOTIFICATIONS
 The runtime passes asynchronous notifications by calling user-defined
callbacks.
 For instance, queues are a common source of asynchronous events because the
tasks queued by an application are asynchronously consumed by the packet
processor. Callbacks are associated with queues when they are created. When the
runtime detects an error in a queue, it invokes the callback associated with that
queue and passes it an error flag (indicating what happened) and a pointer to the
erroneous queue.
 The HSA runtime does not implement any default callbacks.
 When using blocking functions within the callback implementation, a callback that
does not return can render the runtime state to be undefined.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - CALLBACK
Pass the callback function
when create queue
If the queue is empty, set the
event and invoke callback
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION
OUTLINE
 Agent information
 hsa_node_t
 hsa_agent_t
 hsa_agent_info_t
 hsa_component_feature_t
 Agent Information manipulation APIs
 hsa_iterate_agents
 hsa_agent_get_info
 Example
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION
 The runtime exposes a list of agents that are available in the system.
 An HSA agent is a hardware component that participates in the HSA memory model.
 An HSA agent can submit AQL packets for execution.
 An HSA agent may also but is not required to be an HSA component. It is possible for
a system to include HSA agents that are neither an HSA component nor a host CPU.
 HSA agents are defined as opaque handles of type hsa_agent_t .
 The HSA runtime provides APIs for applications to traverse the list of available
agents and query attributes of a particular agent.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION (1)
 Opaque agent handle
 Opaque NUMA node handle
 An HSA memory node is a node that delineates a set of
system components (host CPUs and HSA Components) with
“local” access to a set of memory resources attached to the
node's memory controller and appropriate HSA-compliant
access attributes.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION (2)
 Component features
 An HSA component is a hardware or software component that can be a target of the AQL queries
and conforms to the memory model of the HSA.
 Values
 HSA_COMPONENT_FEATURE_NONE = 0
 No component capabilities. The device is an agent, but not a component.
 HSA_COMPONENT_FEATURE_BASIC = 1
 The component supports the HSAIL instruction set and all the AQL packet types except Agent
dispatch.
 HSA_COMPONENT_FEATURE_ALL = 2
 The component supports the HSAIL instruction set and all the AQL packet types.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION (3)
 Agent attributes
 Values
 HSA_AGENT_INFO_MAX_GRID_DIM
 HSA_AGENT_INFO_MAX_WORKGROUP_DIM
 HSA_AGENT_INFO_QUEUE_MAX_PACKETS
 HSA_AGENT_INFO_CLOCK
 HSA_AGENT_INFO_CLOCK_FREQUENCY
 HSA_AGENT_INFO_MAX_SIGNAL_WAIT
 HSA_AGENT_INFO_NAME
 HSA_AGENT_INFO_NODE
 HSA_AGENT_INFO_COMPONENT_FEATURES
 HSA_AGENT_INFO_VENDOR_NAME
 HSA_AGENT_INFO_WAVEFRONT_SIZE
 HSA_AGENT_INFO_CACHE_SIZE
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION MANIPULATION (1)
 Iterate over the available agents, and invoke an application-defined callback on
every iteration
 If callback returns a status other than HSA_STATUS_SUCCESS for a particular
iteration, the traversal stops and the function returns that status value.
 Parameters
 callback (input): Callback to be invoked once per agent
 data (input): Application data that is passed to callback on every iteration. Can be
NULL.
© Copyright 2014 HSA Foundation. All Rights Reserved
AGENT INFORMATION MANIPULATION (2)
 Get the current value of an attribute for a given agent
 Parameters
 agent (input): A valid agent
 attribute (input): Attribute to query
 value (output): Pointer to a user-allocated buffer where to store the value of the
attribute. If the buffer passed by the application is not large enough to hold the value
of attribute, the behavior is undefined.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - AGENT ATTRIBUTE QUERY
Copy agent attribute information
Get the agent handle of Agent 0
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALS AND SYNCHRONIZATION
(MEMORY-BASED)
OUTLIINE
 Signal
 Signal manipulation API
 Create/Destroy
 Query
 Send
 Atomic Operations
 Signal wait
 Get time out
 Signal Condition
 Example
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL (1)
 HSA agents can communicate with each other by using coherent global memory,
or by using signals.
 A signal is represented by an opaque signal handle
 A signal carries a value, which can be updated or conditionally waited upon via
an API call or HSAIL instruction.
 The value occupies four or eight bytes depending on the machine model in use.
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL (2)
 Updating the value of a signal is equivalent to sending the signal.
 In addition to the update (store) of signals, the API for sending signal must
support other atomic operations with specific memory order semantics
 Atomic operations: AND, OR, XOR, Add, Subtract, Exchange, and CAS
 Memory order semantics : Release and Relaxed
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL CREATE/DESTROY
 Create a signal
 Parameters
 initial_value (input): Initial value of the
signal.
 signal_handle (output): Signal handle.
 Destroy a signal previous created by
hsa_signal_create
 Parameter
 signal_handle (input): Signal handle.
© Copyright 2014 HSA Foundation. All Rights Reserved
 Send and atomically set the value of a signal
with release semantics
SIGNAL LOAD/STORE
 Atomically read the current signal value with
acquire semantics
 Atomically read the current signal value with
relaxed semantics
 Send and atomically set the value of a signal with
relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
 Send and atomically increment the value of a
signal by a given amount with release semantics
SIGNAL ADD/SUBTRACT
 Send and atomically decrement the value of a
signal by a given amount with release semantics
 Send and atomically increment the value of a
signal by a given amount with relaxed semantics
 Send and atomically decrement the value of a
signal by a given amount with relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
 Send and atomically perform a logical AND operation
on the value of a signal and a given value with
release semantics
SIGNAL AND (OR, XOR)/EXCHANGE
 Send and atomically set the value of a signal and
return its previous value with release semantics
 Send and atomically perform a logical AND operation
on the value of a signal and a given value with
relaxed semantics
 Send and atomically set the value of a signal and
return its previous value with relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (1)
 The application may wait on a signal, with a condition specifying the terms of
wait.
 Signal wait condition operator
 Values
 HSA_EQ: The two operands are equal.
 HSA_NE: The two operands are not equal.
 HSA_LT: The first operand is less than the second operand.
 HSA_GTE: The first operand is greater than or equal to the second operand.
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (2)
 The wait can be done either in the HSA component via an HSAIL wait instruction
or via a runtime API defined here.
 Waiting on a signal returns the current value at the opaque signal object;
 The wait may have a runtime defined timeout which indicates the maximum amount of time that an
implementation can spend waiting.
 The signal infrastructure allows for multiple senders/waiters on a single signal.
 Wait reads the value, hence acquire synchronizations may be applied.
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (3)
 Signal wait
 Parameters
 signal_handle (input): A signal handle
 condition (input): Condition used to compare the passed and signal values
 compare_ value (input): Value to compare with
 return_value (output): A pointer where the current signal value must be read into
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNAL WAIT (4)
 Signal wait with timeout
 Parameters
 signal_handle (input): A signal handle
 timeout (input): Maximum wait duration (A value of zero indicates no maximum)
 long_wait (input): Hint indicating that the signal value is not expected to meet the given condition in
a short period of time. The HSA runtime may use this hint to optimize the wait implementation.
 condition (input): Condition used to compare the passed and signal values
 compare_ value (input): Value to compare with
 return_value (output): A pointer where the current signal value must be read into
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – SIGNAL WAIT (1)
thread_1 thread_2
thread_1 is blocked
hsa_signal_add_relaxed
(value = value + 3)
Return signal value
Condition satisfied, the
execution of thread_1
continues
value = 0
Timeline Timeline
value = 3
hsa_signal_substract_relaxed
(value = value - 1)value = 2
hsa_signal_wait_timeout_acquire
(value == 2)
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE – SIGNAL WAIT (2)
If signal_handle is invalid, then return signal invalid status
Compare tmp->value with compare_value to see if the
condition is satisfied?
If timeout = 0 then return signal time out status
Signal wait condition function
If the condition is satisfied, then return signal and status
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUES AND ARCHITECTED
DISPATCH
OUTLINE
 Queues
 Queue Types and Structure
 HSA runtime API for Queue Manipulations
 Architected Queuing Language (AQL) Support
 Packet type
 Packet header
 Examples
 Enqueue Packet
 Packet Processor
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (1)
 An HSA-compliant platform supports multiple user-level command queues allocation.
 A use-level command queue is characterized as runtime-allocated, user-level accessible virtual
memory of a certain size, containing packets defined in the Architected Queuing Language (AQL
packets).
 Queues are allocated by HSA applications through the HSA runtime.
 HSA software receives memory-based structures to configure the hardware queues to
allow for efficient software management of the hardware queues of the HSA agents.
 This queue memory shall be processed by the HSA Packet Processor as a ring buffer.
 Queues are read-only data structures.
 Writing values directly to a queue structure results in undefined behavior.
 But HSA agents can directly modify the contents of the buffer pointed by base_address, or use
runtime APIs to access the doorbell signal or the service queue.
© Copyright 2014 HSA Foundation. All Rights Reserved
 Two queue types, AQL and Service Queues, are supported
 AQL Queue consumes AQL packets that are used to specify the information of kernel functions
that will be executed on the HSA component
 Service Queue consumes agent dispatch packets that are used to specify runtime-defined or user
registered functions that will be executed on the agent (typically, the host CPU)
INTRODUCTION (2)
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (3)
 AQL queue structure
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (4)
 In addition to the data held in the queue structure, the queue also defines two
properties (readIndex and writeIndex) that define the location of “head” and “tail”
of the queue.
 readIndex: The read index is a 64-bit unsigned integer that specifies the packetID of
the next AQL packet to be consumed by the packet processor.
 writeIndex: The write index is a 64-bit unsigned integer that specifies the packetID of
the next AQL packet slot to be allocated.
 Both indices are not directly exposed to the user, who can only access them by using
dedicated HSA core runtime APIs.
 The available index functions differ on the index of interest (read or write), action to be
performed (addition, compare and swap, etc.), and memory consistency model
(relaxed, release, etc.).
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (5)
 The read index is automatically advanced when a packet is read by the packet
processor.
 When the packet processor observes that
 The read index matches the write index, the queue can be considered empty;
 The write index is greater than or equal to the sum of the read index and the size of
the queue, then the queue is full.
 The doorbell_signal field of a queue contains a signal that is used by the agent
to inform the packet processor to process the packets it writes.
 The value that the doorbell signaled is equal to the ID of the packet that is ready to be
launched.
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION (6)
 The new task might be consumed by the packet processor even before the
doorbell signal has been signaled by the agent.
 This is because the packet processor might be already processing some other
packets and observes that there is new work available, so it processes the new
packets.
 In any case, the agent must ring the doorbell for every batch of packets it writes.
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE CREATE/DESTROY
 Create a user mode queue
 When a queue is created, the runtime also
allocates the packet buffer and the completion
signal.
 The application should only rely on the status
code returned to determine if the queue is valid
 Destroy a user mode queue
 A destroyed queue might not be accessed after being
destroyed.
 When a queue is destroyed, the state of the AQL packets
that have not been yet fully processed becomes undefined.
© Copyright 2014 HSA Foundation. All Rights Reserved
GET READ/WRITE INDEX
 Atomically retrieve read index of a queue with
acquire semantics
 Atomically retrieve write index of a queue with
acquire semantics
 Atomically retrieve read index of a queue with
relaxed semantics
 Atomically retrieve write index of a queue with
relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
SET READ/WRITE INDEX
 Atomically set the read index of a queue with
release semantics
 Atomically set the read index of a queue with
relaxed semantics
 Atomically set the write index of a queue with
release semantics
 Atomically set the write index of a queue with
relaxed semantics
© Copyright 2014 HSA Foundation. All Rights Reserved
COMPARE AND SWAP WRITE INDEX
 Atomically compare and set the write index of a
queue with acquire/release/relaxed/acquire-
release semantics
 Parameters
 queue (input): A queue
 expected (input): The expected index value
 val (input): Value to copy to the write index if expected
matches the observed write index
 Return value
 Previous value of the write index
© Copyright 2014 HSA Foundation. All Rights Reserved
ADD WRITE INDEX
 Atomically increment the write index of a
queue by an offset with
release/acquire/relaxed/acquire-release
semantics
 Parameters
 queue (input): A queue
 val (input): The value to add to the write index
 Return value
 Previous value of the write index
© Copyright 2014 HSA Foundation. All Rights Reserved
ARCHITECTED QUEUING LANGUAGE (AQL)
 An HSA-compliant system provides a command interface for the dispatch of
HSA agent commands.
 This command interface is provided by the Architected Queuing Language (AQL).
 AQL allows HSA agents to build and enqueue their own command packets,
enabling fast and low-power dispatch.
 AQL also provides support for HSA component queue submissions
 The HSA component kernel can write commands in AQL format.
© Copyright 2014 HSA Foundation. All Rights Reserved
AQL PACKET (1)
 AQL packet format
 Values
 Always reserved packet (0): Packet format is set to always reserved when the queue is initialized.
 Invalid packet (1): Packet format is set to invalid when the readIndex is incremented, making the
packet slot available to the HSA agents.
 Dispatch packet (2): Dispatch packets contain jobs for the HSA component and are created by
HSA agents.
 Barrier packet (3): Barrier packets can be inserted by HSA agents to delay processing subsequent
packets. All queues support barrier packets.
 Agent dispatch packet (4): Dispatch packets contain jobs for the HSA agent and are created by
HSA agents.
© Copyright 2014 HSA Foundation. All Rights Reserved
AQL PACKET (2)
HSA signaling object handle used to indicate completion of the job
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - ENQUEUE AQL PACKET (1)
 An HSA agent submits a task to a queue by performing the following steps:
 Allocate a packet slot (by incrementing the writeIndex)
 Initialize the packet and copy packet to a queue associated with the Packet Processor
 Mark packet as valid
 Notify the Packet Processor of the packet (With doorbell signal)
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - ENQUEUE AQL PACKET (2)
Dispatch Queue
Allocate an AQL packet slot
Copy the packet into queue. Note
that, we can have a lock here to
prevent race condition in
multithread environment
WriteIndex
ReadIndex
Initialize
packet
Send doorbell signal
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE - PACKET PROCESSOR
WriteIndex
ReadIndex
Get packet content
Check if barrier packet
Update readIndex, change packet state to invalid,
and send completion signal.
Receive doorbell
Dispatch Queue
If there is any packet in queue, process the packet.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY MANAGEMENT
OUTLINE
 Memory registration and deregistration
 Memory region and memory segment
 APIs for memory region manipulation
 APIs for memory registration and deregistration
© Copyright 2014 HSA Foundation. All Rights Reserved
INTRODUCTION
 One of the key features of HSA is its ability to share global pointers between the
host application and code executing on the HSA component.
 This ability means that an application can directly pass a pointer to memory allocated on the host
to a kernel function dispatched to a component without an intermediate copy
 When a buffer created in the host is also accessed by a component,
programmers are encouraged to register the corresponding address range
beforehand.
 Registering memory expresses an intention to access (read or write) the passed buffer from a
component other than the host. This is a performance hint that allows the runtime implementation
to know which buffers will be accessed by some of the components ahead of time.
 When an HSA program no longer needs to access a registered buffer in a device,
the user should deregister that virtual address range.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION/SEGMENT
 A memory region represents a virtual memory interval that is visible to a particular agent,
and contains properties about how memory is accessed or allocated from that agent.
 Memory segments
 Values
 HSA_SEGMENT_GLOBAL = 1
 HSA_SEGMENT_PRIVATE = 2
 HSA_SEGMENT_GROUP = 4
 HSA_SEGMENT_KERNARG = 8
 HSA_SEGMENT_READONLY = 16
 HSA_SEGMENT_IMAGE = 32
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION INFORMATION
 Attributes of a memory region
 Values
 HSA_REGION_INFO_BASE_ADDRESS
 HSA_REGION_INFO_SIZE
 HSA_REGION_INFO_NODE
 HSA_REGION_INFO_MAX_ALLOCATION_SIZE
 HSA_REGION_INFO_SEGMENT
 HSA_REGION_INFO_BANDWIDTH
 HSA_REGION_INFO_CACHED
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION MANIPULATION (1)
 Get the current value of an attribute of a region
 Iterate over the memory regions that are visible to an agent, and invoke an
application-defined callback on every iteration
 If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the
traversal stops and the function returns that status value.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGION MANIPULATION (2)
 Allocate a block of memory
 Deallocate a block of memory previously allocated
using hsa_memory_allocate
 Copy block of memory
 Copying a number of bytes larger than the size of the
memory regions pointed by dst or src results in
undefined behavior.
© Copyright 2014 HSA Foundation. All Rights Reserved
MEMORY REGISTRATION/DEREGISTRATION
 Register memory
 Parameters
 address (input): A pointer to the base of
the memory region to be registered. If a
NULL pointer is passed, no operation is
performed.
 size (input): Requested registration size
in bytes. A size of zero is only allowed if
address is NULL.
 Deregister memory previously registered
using hsa_memory_register
 Parameter
 address (input): A pointer to the base of the
memory region to be registered. If a NULL
pointer is passed, no operation is performed.
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE
Allocate a memory space
Use hsa_region_get_info to get the
size in byte of this memory space
Register this memory space for a
performance hint
Finish operation, deregister and
free this memory space
© Copyright 2014 HSA Foundation. All Rights Reserved
SUMMARY
SUMMARY
 Covered
 HSA Core Runtime API (Pre-release 1.0 provisional)
 Runtime Initialization and Shutdown (Open/Close)
 Notifications (Synchronous/Asynchronous)
 Agent Information
 Signals and Synchronization (Memory-Based)
 Queues and Architected Dispatch
 Memory Management
 Not covered
 Extension of Core Runtime
 HSAIL Finalization, Linking, and Debugging
 Images and Samplers
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA MEMORY MODEL
BEN GASTER, ENGINEER, QUALCOMM
OUTLINE
 HSA Memory Model
 OpenCL 2.0
 Has a memory model too
 Obstruction-free bounded deques
 An example using the HSA memory model
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA MEMORY MODEL
© Copyright 2014 HSA Foundation. All Rights Reserved
TYPES OF MODELS
 Shared memory computers and programming languages, divide complexity into
models:
1. Memory model specifies safety
 e.g. can a work-item prevent others from progressing?
 This is what this section of the tutorial will focus on
2. Execution model specifies liveness
 Described in Ben Sander’s tutorial section on HSAIL
 e.g. can a work-item prevent others from progressing
3. Performance model specifies the big picture
 e.g. caches or branch divergence
 Specific to particular implementations and outside the scope of today’s tutorial
© Copyright 2014 HSA Foundation. All Rights Reserved
THE PROBLEM
 Assume all locations (a, b, …) are initialized to 0
 What are the values of $s2 and $s4 after execution?
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
*a = 1;
int x = *b;
*b = 1;
int y = *a;
initially *a = 0 && *b = 0
THE SOLUTION
 The memory model tells us:
 Defines the visibility of writes to memory at any given point
 Provides us with a set of possible executions
© Copyright 2014 HSA Foundation. All Rights Reserved
WHAT MAKES A GOOD MEMORY MODEL*
 Programmability ; A good model should make it (relatively) easy to write multi-
work-item programs. The model should be intuitive to most users, even to those
who have not read the details
 Performance ; A good model should facilitate high-performance implementations
at reasonable power, cost, etc. It should give implementers broad latitude in
options
 Portability ; A good model would be adopted widely or at least provide backward
compatibility or the ability to translate among models
* S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department,
University of Wisconsin–Madison, Nov. 1993.
© Copyright 2014 HSA Foundation. All Rights Reserved
SEQUENTIAL CONSISTENCY (SC)*
 Axiomatic Definition
 A single processor (core) sequential if “the result of an execution is the same as if the
operations had been executed in the order specified by the program.”
 A multiprocessor sequentially consistent if “the result of any execution is the same as if the
operations of all processors (cores) were executed in some sequential order, and the
operations of each individual processor (core) appear in this sequence in the order specified by
its program.”
© Copyright 2014 HSA Foundation. All Rights Reserved
 But HW/Compiler actually implements more relaxed models, e.g. ARMv7
* L. Lamport. How to Make a Multiprocessor Computer that Correctly
Executes Multiprocessor Programs. IEEE Transactions on Computers,
C-28(9):690–91, Sept. 1979.
SEQUENTIAL CONSISTENCY (SC)
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
$s2 = 0 && $s4 =
1
BUT WHAT ABOUT ACTUAL HARDWARE
 Sequential consistency is (reasonably) easy to understand, but limits
optimizations that the compiler and hardware can perform
 Many modern processors implement many reordering optimizations
 Store buffers (TSO*), work-items can see their own stores early
 Reorder buffers (XC*), work-items can see other work-items store early
© Copyright 2014 HSA Foundation. All Rights Reserved
*TSO – Total Store Order as implemented by Sparc and x86
*XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno
RELAXED CONSISTENCY (XC)
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
ld_global_u32 $s2, [&b] ;
ld_global_u32 $s4, [&a] ;
st_global_u32 $s1, [&a] ;
st_global_u32 $s3, [&b] ;
$s2 = 0 && $s4 =
0
WHAT ARE OUR 3 Ps?
 Programmability ; XC is really pretty hard for the programmer to reason about
what will be visible when
 many memory model experts have been known to get it wrong!
 Performance ; XC is good for performance, the hardware (compiler) is free to
reorder many loads and stores, opening the door for performance and power
enhancements
 Portability ; XC is very portable as it places very little constraints
© Copyright 2014 HSA Foundation. All Rights Reserved
MY CHILDREN AND COMPUTER
ARCHITECTS ALL WANT
 To have their cake and eat it!
© Copyright 2014 HSA Foundation. All Rights Reserved
Put picture with kids and cake
HSA Provides: The ability to enable
programmers to reason with (relatively)
intuitive model of SC, while still achieving the
benefits of XC!
SEQUENTIAL CONSISTENCY FOR DRF*
 HSA adopts the same approach as Java, C++11, and OpenCL 2.0 adopting SC for Data
Race Free (DRF)
 plus some new capabilities !
 (Informally) A data race occurs when two (or more) work-items access the same memory
location such that:
 At least one of the accesses is a WRITE
 There are no intervening synchronization operations
 SC for DRF asks:
 Programmers to ensure programs are DRF under SC
 Implementers to ensure that all executions of DRF programs on the relaxed model are also SC
executions
© Copyright 2014 HSA Foundation. All Rights Reserved
*S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the
17th Annual International Symposium on Computer Architecture, pp. 2–14, May
1990
HSA SUPPORTS RELEASE CONSISTENCY
 HSA’s memory model is based on RCSC:
 All atomic_ld_scacq and atomic_st_screl are SC
 Means coherence on all atomic_ld_scacq and atomic_st_screl to a single
address. )
 All atomic_ld_scacq and atomic_st_screl are program ordered per work-
item (actually: sequence-order by language constraints
 Similar model adopted by ARMv8
 HSA extends RCSC to SC for HRF*, to access the full capabilities of
modern heterogeneous systems, containing CPUs, GPUs, and DSPs,
for example.
© Copyright 2014 HSA Foundation. All Rights Reserved
*Sequential Consistency for Heterogeneous-Race-Free Programmer-centric
Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R.
Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.
MAKING RELAXED CONSISTENCY WORK
© Copyright 2014 HSA Foundation. All Rights Reserved
Work-item 0
mov_u32 $s1, 1 ;
atomic_st_global_u32_screl $s1, [&a] ;
atomic_ld_global_u32_scacq $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
atomic_st_global_u32_screl $s3, [&b] ;
atomic_ld_global_u32_scacq $s4, [&a]
;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
atomic_st_global_u32_screl $s1, [&a] ;
atomic_ld_global_u32_scacq $s2, [&b] ;
atomic_st_global_u32_screl $s3, [&b] ;
atomic_ld_global_u32_scacq $s4, [&a] ;
$s2 = 0 && $s4 =
1
SEQUENTIAL CONSISTENCY FOR DRF
 Two memory accesses participate in a data race if they
 access the same location
 at least one access is a store
 can occur simultaneously
 i.e. appear as adjacent operations in interleaving.
 A program is data-race-free if no possible execution results in a data race.
 Sequential consistency for data-race-free programs
 Avoid everything else
HSA: Not good enough!
© Copyright 2014 HSA Foundation. All Rights Reserved
ALL ARE NOT EQUAL – OR SOME CAN SEE
BETTER THAN OTHERS
 Remember the HSAIL
Execution Model
© Copyright 2014 HSA Foundation. All Rights Reserved
device scope
group scope
wave
scope
platform scope
DATA-RACE-FREE IS NOT ENOUGH
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_screl 0, [&flag]
atomic_cas_global_scar 1, 0, [&flag]
...
atomic_st_global_screl 0, [&flag]
atomic_cas_global_scar ,1 0, [&flag]
ld_global (??), [&x]
group #1-2 group #3-4
 Two ordinary memory accesses participate in a data race if they
 Access same location
 At least one is a store
 Can occur simultaneously
Not a data race…
Is it SC?
Well that depends
t4t3t1 t2
SGlobal
S12 S34
visibility implied by
causality?
© Copyright 2014 HSA Foundation. All Rights Reserved
SEQUENTIAL CONSISTENCY FOR
HETEROGENEOUS-RACE-FREE
 Two memory accesses participate in a heterogeneous race if
 access the same location
 at least one access is a store
 can occur simultaneously
 i.e. appear as adjacent operations in interleaving.
 Are not synchronized with “enough” scope
 A program is heterogeneous-race-free if no possible execution results in a
heterogeneous race.
 Sequential consistency for heterogeneous-race-free programs
 Avoid everything else
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA HETEROGENEOUS RACE FREE
 HRF0: Basic Scope Synchronization
 “enough” = both threads synchronize using identical scope
 Recall example:
 Contains a heterogeneous race in HSA
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_rcrel_wg 0, [&flag]
...
atomic_cas_global_scar_wg,1 0, [&flag]
ld_global (??), [&x]
Workgroup #1-2 Workgroup #3-4
HSA Conclusion:
This is bad. Don’t do it.
© Copyright 2014 HSA Foundation. All Rights Reserved
HOW TO USE HSA WITH SCOPES
Use smallest scope that includes all
producers/consumers of shared data
HSA Scope Selection Guideline
Implication:
Producers/consumers must be known at synchronization time
 Want: For performance, use smallest scope possible
 What is safe in HSA?
Is this a valid assumption?
© Copyright 2014 HSA Foundation. All Rights Reserved
REGULAR GPGPU WORKLOADS
N
M
Define
Problem Space
Partition
Hierarchically
Communicate
Locally
N times
Communicate
Globally
M times
Well defined (regular) data partitioning +
Well defined (regular) synchronization pattern =
 Producer/consumers are always known
Generally: HSA works well with
regular data-parallel workloads
© Copyright 2014 HSA Foundation. All Rights Reserved
t1 t2 t3 t4
st_global 1, [&X]
atomic_st_global_screl_plat 0, [&flag]
atomic_cas_global_scar_plat 1, 0, [&flag]
...
atomic_st_global_screl_plat 0, [&flag]
atomic_cas_global_ar_plat ,1 0, [&flag]
ld $s1, [&x]
IRREGULAR WORKLOADS
 HSA: example is race
 Must upgrade wg (workgroup) -> plat (platform)
 HSA memory model says:
 ld $s1, [&x], will see value (1)!
Workgroup #1-2 Workgroup #3-4
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL
HAS MEMORY MODELS TOO
MAPPING ONTO HSA’S MEMORY MODEL
 It is straightforward to provide a mapping from OpenCL 1.x to the
proposed model
 OpenCL 1.x atomics are unordered and so map to atomic_op_X
 Mapping for fences not shown but straightforward
OPENCL 1.X MEMORY MODEL MAPPING
OpenCL Operation HSA Memory Model
Operation
Atomic load ld_global_wg
ld_group_wg
Atomic store atomic_st_global_wg
atomic_st_group_wg
atomic_op atomic_op_global_comp
atomic_op_group_wg
barrier(…) fence ; barrier_wg
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL 2.0 BACKGROUND
 Provisional specification released at SIGGRAPH’13, July 2013.
 Huge update to OpenCL to account for the evolving hardware landscape and
emerging use cases (e.g. irregular work loads)
 Key features:
 Shared virtual memory, including platform atomics
 Formally defined memory model based on C11 plus support for scopes
 Includes an extended set of C1X atomic operations
 Generic address space, that subsumes global, local, and private
 Device to device enqueue
 Out-of-order device side queuing model
 Backwards compatible with OpenCL 1.x
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL 2.0 MEMORY MODEL MAPPING
OpenCL Operation HSA Memory Model Operation
Load
memory_order_relaxed
atomic_ld_[global | group]_relaxed_scope
Store
Memory_order_relaxed
atomic_st_[global | group]_relaxed_scope
Load
memory_order_acquire
atomic_ld_[global | group]_scacq_scope
Load
memory_order_seq_cst
atomic_ld_[global | group]_scacq_scope
Store
memory_order_release
atomic_st_[global | group]_screl_scope
Store
Memory_order_seq_cst
atomic_st_[global | group]_screl_scope
memory_order_acq_rel atomic_op_[global | group]_scar_scope
memory_order_seq_cst atomic_op_[global|group]_scar_scope
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENCL 2.0 MEMORY SCOPE MAPPING
OpenCL Scope HSA Scope
memory_scope_sub_group _wave
memory_scope_work_group _wg
memory_scope_device _component
memory_scope_all_svm_devices _platform
© Copyright 2014 HSA Foundation. All Rights Reserved
OBSTRUCTION-FREE
BOUNDED DEQUES
AN EXAMPLE USING THE HSA MEMORY MODEL
CONCURRENT DATA-STRUCTURES
 Why do we need such a memory model in practice?
 One important application of memory consistency is in the development and use
of concurrent data-structures
 In particular, there are a class data-structures implementations that provide non-
blocking guarantees:
 wait-free; An algorithm is wait-free if every operation has a bound on the number of
steps the algorithm will take before the operation completes
 In practice very hard to build efficient data-structures that meet this requirement
 lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of
the work-items (or threads) makes progress
 In practice lock-free algorithms are implemented by work-item cooperating with one
enough to allow progress
 Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can
make progress
© Copyright 2014 HSA Foundation. All Rights Reserved
Emerging Compute Cluster
BUT WAY NOT JUST USE MUTUAL
EXCLUSION?
© Copyright 2014 HSA Foundation. All Rights Reserved
Fabric & Memory Controller
Krait
CPUAdreno
GPU
Krait
CPU
Krait
CPU
Krait
CPU
MMU
MMUs
2MB L2
Hexagon
DSP
MMU
?? ??
Diversity in a heterogeneous system, such as
different clock speeds, different scheduling
policies, and more can mean traditional
mutual exclusion is not the right choice
CONCURRENT DATA-STRUCTURES
 Emerging heterogeneous compute clusters means we need:
 To adapt existing concurrent data-structures
 Developer new concurrent data-structures
 Lock based programming may still be useful but often these algorithms will need
to be lock-free
 Of course, this is a key application of the HSA memory model
 To showcase this we highlight the development of a well known (HLM)
obstruction-free deque*
© Copyright 2014 HSA Foundation. All Rights Reserved
*Herlihy, M. et al. 2003. Obstruction-free
synchronization: double-ended queues as an
example. (2003), 522–529.
HLM - OBSTRUCTION-FREE DEQUE
 Uses a fixed length circular queue
 At any given time, reading from left to right, the array will contain:
 Zero or more left-null (LN) values
 Zero or more dummy-null (DN) values
 Zero or more right-null (RN) values
 At all times there must be:
 At least two different nulls values
 At least one LN or DN, and at least one DN or RN
 Memory consistency is required to allow multiple producers and multiple
consumers, potentially happening in parallel from the left and right ends, to see
changes from other work-items (HSA Components) and threads (HSA Agents)
© Copyright 2014 HSA Foundation. All Rights Reserved
HLM - OBSTRUCTION-FREE DEQUE
© Copyright 2014 HSA Foundation. All Rights Reserved
LNLN vLN RNv RNRN
left right
Key:
LN – left null value
RN – right null value
v – value
left – left hint index
right – right hint index
C REPRESENTATION OF DEQUE
struct node {
uint64_t type : 2; // null type (LN, RN, DN)
uint64_t counter : 8 ; // version counter to avoid ABA
uint64_t value : 54 ; // index value stored in queue
}
struct queue {
unsigned int size; // size of bounded buffer
node * array; // backing store for deque itself© Copyright 2014 HSA Foundation. All Rights Reserved
HSAIL REPRESENTATION
 Allocate a deque in global memory using HSAIL
@deque_instance:
align 64 global_u32 &size;
align 8 global_u64 &array;
© Copyright 2014 HSA Foundation. All Rights Reserved
ORACLE
 Assume a function:
function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);
 Which given a deque
 returns (%k) the position of the left most of RN
 atomic_ld_global_scacq used to read node from array
 Makes one if necessary (i.e. if there are only LN or DN)
 atomic_cas_global_scar, required to make new RN
 returns (%left) the left node (i.e. the value to the left of the left most RN position)
 returns (%right) the right node (i.e. the value at position (%k))
© Copyright 2014 HSA Foundation. All Rights Reserved
RIGHT POP
function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) {
// load queue address
ld_arg_u64 $d0, [%deque];
@loop_forever:
// setup and call right oracle to get next RN
arg_u32 %k; arg_u64 %current; arg_u64 %next;
call &rcheck_oracle(%queue) ;
ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next];
// current.value($d5)
shr_u64 $d5, $d1, 62;
// current.counter($d6)
and_u64 $d6, $d1,
0x3FC0000000000000;
shr_u64 $d6, $d6, 54;
// current.value($d7)
and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF;
// next.counter($d8)
and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54;
brn @loop_forever ;
}
© Copyright 2014 HSA Foundation. All Rights Reserved
RIGHT POP – TEST FOR EMPTY
// current.type($d5) == LN || current.type($d5) == DN
cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN;
or_b1 $c0, $c0, $c1;
cbr $c0, @not_empty ;
// current node index (%deque($d0) + (%k($s1) - 1) * 16)
add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0;
atomic_ld_global_scacq_u64 $d4, [$d3];
cmp_neq_b1_u64 $c0, $d4, $d1;
cbr $c0, @not_empty;
st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY
%ret
@not_empty:
© Copyright 2014 HSA Foundation. All Rights Reserved
RIGHT POP – TRY READ/REMOVE NODE
// $d9 = (RN, next.cnt+1, 0)
add_u64 $d8, $d8, 1;
shl_u64 $d9, RN, 62;
and_u64 $d8, $d8, $d9;
// cas(deq+k, next, node(RN, next.cnt+1, 0))
atomic_cas_global_scar_u64 $d9, [$s0], $d2, $d9;
cmp_neq_u64 $c0, $d9, $d2;
cbr $c0, @cas_failed;
// $d9 = (RN, current.cnt+1, 0)
add_u64 $d6, $d6, 1;
shl_u64 $d9, RN, 62;
and_u64 $d9, $d6, $d9;
// cas(deq+(k-1), curr, node(RN, curr.cnt+1,0)
atomic_cas_global_scar_u64 $d9, [$s1], $d1, $d9;
cmp_neq_u64 $c0, $d9, $d1;
cbr $c0, @cas_failed;
st_arg_u32 SUCCESS, [&err];
st_arg_u64 $d7, [&value];
%ret
@cas_failed:
// loop back around and try again
© Copyright 2014 HSA Foundation. All Rights Reserved
TAKE AWAYS
 HSA provides a powerful and modern memory model
 Based on the well know SC for DRF
 Defined as Release Consistency
 Extended with scopes as defined by HRF
 OpenCL 2.0 introduces a new memory model
 Also based on SC for DRF
 Also defined in terms of Release Consistency
 Also Extended with scope as defined in HRF
 Has a well defined mapping to HSA
 Concurrent algorithm development for emerging heterogeneous computing
cluster can benefit from HSA and OpenCL 2.0 memory models
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA QUEUING MODEL
HAKAN PERSSON, SENIOR PRINCIPAL ENGINEER,
ARM
HSA QUEUEING, MOTIVATION
MOTIVATION (TODAY’S PICTURE)
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
HSA QUEUEING: REQUIREMENTS
REQUIREMENTS
 Three key technologies are used to build the user mode queueing
mechanism
 Shared Virtual Memory
 System Coherency
 Signaling
 AQL (Architected Queueing Language) enables any agent
enqueue tasks
© Copyright 2014 HSA Foundation. All Rights Reserved
SHARED VIRTUAL MEMORY
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (TODAY)
 Multiple Virtual memory address spaces
© Copyright 2014 HSA Foundation. All Rights Reserved
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (HSA)
 Common Virtual Memory for all HSA agents
© Copyright 2014 HSA Foundation. All Rights Reserved
CPU0 GPU
VIRTUAL MEMORY
PHYSICAL MEMORY
VA->PA VA->PA
SHARED VIRTUAL MEMORY
 Advantages
 No mapping tricks, no copying back-and-forth between different PA
addresses
 Send pointers (not data) back and forth between HSA agents.
 Implications
 Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc).
 Common mechanisms for address translation (and servicing address
translation faults)
 Concept of a process address space (PASID) to allow multiple, per
process virtual address spaces within the system.
© Copyright 2014 HSA Foundation. All Rights Reserved
SHARED VIRTUAL MEMORY
 Specifics
 Minimum supported VA width is 48b for 64b systems, and 32b for
32b systems.
 HSA agents may reserve VA ranges for internal use via system
software.
 All HSA agents other than the host unit must use the lowest privilege
level
 If present, read/write access flags for page tables must be
maintained by all agents.
 Read/write permissions apply to all HSA agents, equally.
© Copyright 2014 HSA Foundation. All Rights Reserved
GETTING THERE …
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
CACHE COHERENCY
CACHE COHERENCY DOMAINS (1/3)
 Data accesses to global memory segment from all HSA Agents shall be
coherent without the need for explicit cache maintenance.
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (2/3)
 Advantages
 Composability
 Reduced SW complexity when communicating between agents
 Lower barrier to entry when porting software
 Implications
 Hardware coherency support between all HSA agents
 Can take many forms
 Stand alone Snoop Filters / Directories
 Combined L3/Filters
 Snoop-based systems (no filter)
 Etc …
© Copyright 2014 HSA Foundation. All Rights Reserved
CACHE COHERENCY DOMAINS (3/3)
 Specifics
 No requirement for instruction memory accesses to be
coherent
 Only applies to the Primary memory type.
 No requirement for HSA agents to maintain coherency to any
memory location where the HSA agents do not specify the
same memory attributes
 Read-only image data is required to remain static during the
execution of an HSA kernel.
 No double mapping (via different attributes) in order to
modify. Must remain static
© Copyright 2014 HSA Foundation. All Rights Reserved
GETTING CLOSER …
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SIGNALING
SIGNALING (1/3)
 HSA agents support the ability to use signaling objects
 All creation/destruction signaling objects occurs via HSA
runtime APIs
 From an HSA Agent you can directly access signaling objects.
 Signaling a signal object (this will wake up HSA agents
waiting upon the object)
 Query current object
 Wait on the current object (various conditions supported).
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (2/3)
 Advantages
 Enables asynchronous events between HSA agents,
without involving the kernel
 Common idiom for work offload
 Low power waiting
 Implications
 Runtime support required
 Commonly implemented on top of cache coherency flows
© Copyright 2014 HSA Foundation. All Rights Reserved
SIGNALING (3/3)
 Specifics
 Only supported within a PASID
 Supported wait conditions are =, !=, < and >=
 Wait operations may return sporadically (no guarantee against
false positives)
 Programmer must test.
 Wait operations have a maximum duration before returning.
 The HSAIL atomic operations are supported on signal objects.
 Signal objects are opaque
 Must use dedicated HSAIL/HSA runtime operations
© Copyright 2014 HSA Foundation. All Rights Reserved
ALMOST THERE…
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
USER MODE QUEUING
ONE BLOCK LEFT
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
USER MODE QUEUEING (1/3)
 User mode Queueing
 Enables user space applications to directly, without OS
intervention, enqueue jobs (“Dispatch Packets”) for HSA
agents.
 Queues are created/destroyed via calls to the HSA
runtime.
 One (or many) agents enqueue packets, a single agent
dequeues packets.
 Requires coherency and shared virtual memory.
© Copyright 2014 HSA Foundation. All Rights Reserved
USER MODE QUEUEING (2/3)
 Advantages
 Avoid involving the kernel/driver when dispatching work for an Agent.
 Lower latency job dispatch enables finer granularity of offload
 Standard memory protection mechanisms may be used to protect communication with
the consuming agent.
 Implications
 Packet formats/fields are Architected – standard across vendors!
 Guaranteed backward compatibility
 Packets are enqueued/dequeued via an Architected protocol (all via memory
accesses and signaling)
 More on this later……
© Copyright 2014 HSA Foundation. All Rights Reserved
SUCCESS!
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SUCCESS!
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Queue Job
Start Job
Finish Job
ARCHITECTED QUEUEING
LANGUAGE, QUEUES
ARCHITECTED QUEUEING LANGUAGE
 HSA Queues look just like standard shared
memory queues, supporting multi-producer,
single-consumer
 Single producer variant defined with some
optimizations possible.
 Queues consist of storage, read/write indices, ID,
etc.
 Queues are created/destroyed via calls to the
HSA runtime
 “Packets” are placed in queues directly from user
mode, via an architected protocol
 Packet format is architected
© Copyright 2014 HSA Foundation. All Rights Reserved
Producer Producer
Consumer
Read Index
Write Index
Storage in
coherent, shared
memory
Packets
ARCHITECTED QUEUING LANGUAGE
 Packets are read and dispatched for execution from the queue in order, but
may complete in any order.
 There is no guarantee that more than one packet will be processed in parallel at a
time
 There may be many queues. A single agent may also consume from several
queues.
 Any HSA agent may enqueue packets
 CPUs
 GPUs
 Other accelerators
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE STRUCTURE
© Copyright 2014 HSA Foundation. All Rights Reserved
Offset (bytes) Size (bytes) Field Notes
0 4 queueType Differentiate different queues
4 4 queueFeatures Indicate supported features
8 8 baseAddress Pointer to packet array
16 16 doorbellSignal HSA signaling object handle
24 4 size Packet array cardinality
28 4 queueId Unique per process
32 8 serviceQueue Queue for callback services
intrinsic 8 writeIndex Packet array write index
intrinsic 8 readIndex Packet array read index
QUEUE VARIANTS
 queueType and queueFeatures together define queue semantics and
capabilities
 Two queueType values defined, other values reserved:
 MULTI – queue supports multiple producers
 SINGLE – queue supports single producer
 queueFeatures is a bitfield indicating capabilities
 DISPATCH (bit 0) if set then queue supports DISPATCH packets
 AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets
 All other bits are reserved and must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE STRUCTURE DETAILS
 Queue doorbells are HSA signaling objects with restrictions
 Created as part of the queue – lifetime tied to queue object
 Atomic read-modify-write not allowed
 size field value must be aligned to a power of 2
 serviceQueue can be used by HSA kernel for callback services
 Provided by application when queue is created
 Can be mapped to HSA runtime provided serviceQueue, an application serviced
queue, or NULL if no serviceQueue required
© Copyright 2014 HSA Foundation. All Rights Reserved
READ/WRITE INDICES
 readIndex and writeIndex properties are part of the queue, but not visible in the queue structure
 Accessed through HSA runtime API and HSAIL operations
 HSA runtime/HSAIL operations defined to
 Read readIndex or writeIndex property
 Write readIndex or writeIndex property
 Add constant to writeIndex property (returns previous writeIndex value)
 CAS on writeIndex property
 readIndex & writeIndex operations treated as atomic in memory model
 relaxed, acquire, release and acquire-release variants defined as applicable
 readIndex and writeIndex never wrap
 PacketID – the index of a particular packet
 Uniquely identifies each packet of a queue
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET ENQUEUE
 Packet enqueue follows a few simple steps:
 Reserve space
 Multiple packets can be reserved at a time
 Write packet to queue
 Mark packet as valid
 Producer no longer allowed to modify packet
 Consumer is allowed to start processing packet
 Notify consumer of packet through the queue doorbell
 Multiple packets can be notified at a time
 Doorbell signal should be signaled with last packetID notified
 On small machine model the lower 32 bits of the packetID are used
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET RESERVATION
 Two flows envisaged
 Atomic add writeIndex with number of packets to reserve
 Producer must wait until packetID < readIndex + size before writing to packet
 Queue can be sized so that wait is unlikely (or impossible)
 Suitable when many threads use one queue
 Check queue not full first, then use atomic CAS to update writeIndex
 Can be inefficient if many threads use the same queue
 Allows different failure model if queue is congested
© Copyright 2014 HSA Foundation. All Rights Reserved
QUEUE OPTIMIZATIONS
 Queue behavior is loosely defined to allow optimizations
 Some potential producer behavior optimizations:
 Keep local copy of readIndex, update when required
 For single producer queues:
 Keep local copy of writeIndex
 Use store operation rather than add/cas atomic to update writeIndex
 Some potential consumer behavior optimizations:
 Use packet format field to determine whether a packet has been submitted rather than writeIndex
property
 Speculatively read multiple packets from the queue
 Not update readIndex for each packet processed
 Rely on value used for doorbellSignal to notify new packets
 Especially useful for single producer queues
© Copyright 2014 HSA Foundation. All Rights Reserved
POTENTIAL MULTI-PRODUCER ALGORITHM
// Allocate packet
uint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1);
// Wait until the queue is no longer full.
uint64_t rdIdx;
do {
rdIdx = hsa_queue_load_read_index_relaxed(q);
} while (packetID >= (rdIdx + q->size));
// calculate index
uint32_t arrayIdx = packetID & (q->size-1);
// copy over the packet, the format field is INVALID
q->baseAddress[arrayIdx] = pkt;
// Update format field with release semantics
q->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release);
// ring doorbell, with release semantics (could also amortize over multiple packets)
hsa_signal_send_relaxed(q->doorbellSignal, packetID);
© Copyright 2014 HSA Foundation. All Rights Reserved
POTENTIAL CONSUMER ALGORITHM
// Get location of next packet
uint64_t readIndex = hsa_queue_load_read_index_relaxed(q);
// calculate the index
uint32_t arrayIdx = readIndex & (q->size-1);
// spin while empty (could also perform low-power wait on doorbell)
while (INVALID == q->baseAddress[arrayIdx].hdr.format) { }
// copy over the packet
pkt = q->baseAddress[arrayIdx];
// set the format field to invalid
q->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed);
// Update the readIndex using HSA intrinsic
hsa_queue_store_read_index_relaxed(q, readIndex+1);
// Now process <pkt>!
© Copyright 2014 HSA Foundation. All Rights Reserved
ARCHITECTED QUEUEING
LANGUAGE, PACKETS
PACKETS
© Copyright 2014 HSA Foundation. All Rights Reserved
 Packets come in three main types with architected layouts
 Always reserved & Invalid
 Do not contain any valid tasks and are not processed (queue will not progress)
 Dispatch
 Specifies kernel execution over a grid
 Agent Dispatch
 Specifies a single function to perform with a set of parameters
 Barrier
 Used for task dependencies
COMMON PACKET HEADER
Start Offset
(Bytes)
Format Field Name Description
0 uint16_t
format:8
Contains the packet type (Always reserved, Invalid,
Dispatch, Agent Dispatch, and Barrier). Other values are
reserved and should not be used.
barrier:1
If set then processing of packet will only begin when all
preceding packets are complete.
acquireFenceScope:2
Determines the scope and type of the memory fence
operation applied before the packet enters the active
phase.
Must be 0 for Barrier Packets.
releaseFenceScope:2
Determines the scope and type of the memory fence
operation applied after kernel completion but before the
packet is completed.
reserved:3 Must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
DISPATCH PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start
Offset
(Bytes)
Format Field Name Description
0 uint16_t header Packet header
2 uint16_t
dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.
reserved:14 Must be 0.
4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items).
6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items).
8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items).
10 uint16_t reserved2 Must be 0.
12 uint32_t gridSize.x x dimension of grid (measured in work-items).
16 uint32_t gridSize.y y dimension of grid (measured in work-items).
20 uint32_t gridSize.z z dimension of grid (measured in work-items).
24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item).
28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group).
32 uint64_t kernelObjectAddress
Address of an object in memory that includes an implementation-defined
executable ISA image for the kernel.
40 uint64_t kernargAddress Address of memory containing kernel arguments.
48 uint64_t reserved3 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
AGENT DISPATCH PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start Offset
(Bytes)
Format Field Name Description
0 uint16_t header Packet header
2 uint16_t type
The function to be performed by the destination Agent. The type value is
split into the following ranges:
 0x0000:0x3FFF – Vendor specific
 0x4000:0x7FFF – HSA runtime
 0x8000:0xFFFF – User registered function
4 uint32_t reserved2 Must be 0.
8 uint64_t returnLocation Pointer to location to store the function return value in.
16 uint64_t arg[0]
64-bit direct or indirect arguments.
24 uint64_t arg[1]
32 uint64_t arg[2]
40 uint64_t arg[3]
48 uint64_t reserved3 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
BARRIER PACKET
 Used for specifying dependences between packets
 HSA agent will not launch any further packets from this queue until the barrier
packet signal conditions are met
 Used for specifying dependences on packets dispatched from any queue.
 Execution phase completes only when all of the dependent signals (up to five) have
been signaled (with the value of 0).
 Or if an error has occurred in one of the packets upon which we have a dependence.
© Copyright 2014 HSA Foundation. All Rights Reserved
BARRIER PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start Offset
(Bytes)
Format Field Name Description
0 uint16_t header Packet header, see 2.8.1 Packet header (p. 16).
2 uint16_t reserved2 Must be 0.
4 uint32_t reserved3 Must be 0.
8 uint64_t depSignal0
Address of dependent signaling objects to be evaluated by the packet processor.
16 uint64_t depSignal1
24 uint64_t depSignal2
32 uint64_t depSignal3
40 uint64_t depSignal4
48 uint64_t reserved4 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
DEPENDENCES
 A user may never assume more than one packet is being executed by an HSA
agent at a time.
 Implications:
 Packets can’t poll on shared memory values which will be set by packets issued from
other queues, unless the user has ensured the proper ordering.
 To ensure all previous packets from a queue have been completed, use the Barrier
bit.
 To ensure specific packets from any queue have completed, use the Barrier packet.
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA QUEUEING, PACKET EXECUTION
PACKET EXECUTION
 Launch phase
 Initiated when launch conditions are met
 All preceding packets in the queue must have exited launch phase
 If the barrier bit in the packet header is set, then all preceding packets in the queue
must have exited completion phase
 Includes memory acquire fence
 Active phase
 Execute the packet
 Barrier packets remain in Active phase until conditions are met.
 Completion phase
 First step is memory release fence – make results visible.
 completionSignal field is then signaled with a decrementing atomic.
© Copyright 2014 HSA Foundation. All Rights Reserved
PACKET EXECUTION – BARRIER BIT
© Copyright 2014 HSA Foundation. All Rights Reserved
Pkt1
Launch
Pkt2
Launch
Pkt1
Execute
Pkt2
Execute
Pkt1
Complete
Pkt3
Launch (barrier=1)
Pkt2
Complete
Pkt3
Execute
Time
Pkt3 launches whenall
packets in the queue
have completed.
PUTTING IT ALL TOGETHER (FFT)
© Copyright 2014 HSA Foundation. All Rights Reserved
Packet 1
Packet 2
Packet 3
Packet 4
Packet 5
Packet 6
Barrier Barrier
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
Time
PUTTING IT ALL TOGETHER
© Copyright 2014 HSA Foundation. All Rights Reserved
AQL Pseudo Code
// Send the packets to do the first stage.
aql_dispatch(pkt1);
aql_dispatch(pkt2);
// Send the next two packets, setting the barrier bit so we
// know packets 1 & 2 will be complete before 3 and 4 are
// launched.
aql_dispatch_with _barrier_bit(pkt3);
aql_dispatch(pkt4);
// Same as above (make sure 3 & 4 are done before issuing 5
// & 6)
aql_dispatch_with_barrier_bit(pkt5);
aql_dispatch(pkt6);
// This packet will notify us when 5 & 6 are complete)
aql_dispatch_with_barrier_bit(finish_pkt);
PACKET EXECUTION – BARRIER PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Barrier T2Q2
T1Q1
Signal X
init to 1
depSignal0
completionSignal
Time
Decrements signal X
Barrier
Launch
T1
Launch
Barrier
Execute
T1
Execute
Barrier
Complete
T1
Complete
T2
Launch
T2
Execute
T2
Complete
Barrier completes
when signal X
signalled with 0
T2 launches once
barrier complete
DEPTH FIRST CHILD TASK EXECUTION
 Consider two generations of child tasks
 Task T submits tasks T.1 & T.2
 Task T.1 submits tasks T.1.1 & T.1.2
 Task T.2 submits tasks T.2.1 & T.2.2
 Desired outcome
 Depth first child task execution
 I.e. T  T1  T.1.1  T.1.2  T.2  T.2.1  T.2.2
 T passed signal (allComplete) to decrement when all tasks are complete (T and its
children etc)
© Copyright 2014 HSA Foundation. All Rights Reserved
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2
HOW TO DO THIS WITH HSA QUEUES?
 Use a separate user mode queue for each recursion level
 Task T submits to queue Q1
 Tasks T.1 & T.2 submits tasks to queue Q2
 Queues could be passed in as parameters to task T
 Depth first requires ordering of T.1, T.2 and their children
 Use additional signal object (childrenComplete) to track completion of the children of
T.1 & T.2
 childrenComplete set to number of children (i.e. 2) by each of T.1 & T.2
© Copyright 2014 HSA Foundation. All Rights Reserved
A PICTURE SAYS MORE THAN 1000 WORDS
© Copyright 2014 HSA Foundation. All Rights Reserved
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2 T.1 Barrier T.2 BarrierQ1
Wait on
childrenComplete
Signal
allComplete
T.1.1 T.1.2 T.2.1 T.2.2Q2
SUMMARY
© Copyright 2014 HSA Foundation. All Rights Reserved
KEY HSA TECHNOLOGIES
 HSA combines several mechanisms to enable low overhead task
dispatch
 Shared Virtual Memory
 System Coherency
 Signaling
 AQL
 User mode queues – from any compatible agent
 Architected packet format
 Rich dependency mechanism
 Flexible and efficient signaling of completion
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
© Copyright 2014 HSA Foundation. All Rights Reserved
HSA APPLICATIONS
WEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS
WITH J.P. BORDES AND JUAN GOMEZ
USE CASES SHOWING HSA ADVANTAGE
Programming
Technique
Use Case Description HSA Advantage
Pointer-based Data
Structures
Binary tree searches
GPU performs parallel searches in a CPU created
binary tree.
CPU and GPU have access to entire unified coherent
memory. GPU can access existing data structures containing
pointers.
Platform Atomics
Work-Group Dynamic Task Management
GPU directly operate on a task pool managed
by the CPU for algorithms with dynamic
computation loads
Binary tree updates
CPU and GPU operating simultaneously on the
tree, both doing modifications
CPU and GPU can synchronize using Platform Atomics
Higher performance through parallel operations reducing the
need for data copying and reconciling.
Large Data Sets
Hierarchical data searches
Applications include object recognition, collision
detection, global illumination, BVH
CPU and GPU have access to entire unified coherent
memory. GPU can operate on huge models in place,
reducing copy and kernel launch overhead.
CPU Callbacks
Middleware user-callbacks
GPU processes work items, some of which require
a call to a CPU function to fetch new data
GPU can invoke CPU functions from within a GPU kernel
Simpler programming does not require “split kernels”
Higher performance through parallel operations
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
FOR POINTER-BASED DATA
STRUCTURES
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
L R
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
L
R
L
R
L
R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
KERNEL
GPU
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
HSA and full OpenCL 2.0
TREE RESULT
BUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
POINTER DATA STRUCTURES
- CODE COMPLEXITY
HSA Legacy
© Copyright 2014 HSA Foundation. All Rights Reserved
POINTER DATA STRUCTURES
- PERFORMANCE
0
10,000
20,000
30,000
40,000
50,000
60,000
1M 5M 10M 25M
Searchrate(nodes/ms)
Tree size ( # nodes )
Binary Tree Search
CPU (1 core)
CPU (4 core)
Legacy APU
HSA APU
Measured in AMD labs Jan 1-3 on system shown in back up
slide
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS FOR
DYNAMIC TASK MANAGEMENT
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
0
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
0
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Asynchronous transfer
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Asynchronous transfer
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Zero-copy
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
0
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
0
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
memcpy
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS – CODE COMPLEXITY
HSA
Legacy
Host enqueue function: 20 lines of code
Host enqueue function: 102 lines of code
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS - PERFORMANCE
0
100
200
300
400
500
600
700
64 128 256 512 64 128 256 512
4096 16384
Executiontime(ms)
Tasks per insertion
Tasks pool size
Legacy implementation (ms)
HSA implementation (ms)
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS FOR
CPU/GPU COLLABORATION
PLATFORM ATOMICS
ENABLING EFFICIENT GPU/CPU COLLABORATION
Legacy
Only GPU
can work
on input
array
Concurre
nt
processin
g not
possible
TREEINPUT
BUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
Legacy
Only GPU
can work
on input
array
Concurre
nt
processin
g not
possible
TREEINPUT
BUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
Legacy
Only GPU
can work
on input
array
Concurre
nt
processin
g not
possible
TREEINPUT
BUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
KERNEL
PLATFORM ATOMICS
Both
CPU+GPU
operating
on same
data
structure
concurren
tly
TREEINPUT
BUFFER
CPU
0
CPU
1
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
KERNEL
PLATFORM ATOMICS
Both
CPU+GPU
operating
on same
data
structure
concurren
tly
TREEINPUT
BUFFER
CPU
0
CPU
1
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
FOR LARGE
DATA SETS
PROCESSING LARGE DATA SETS
The CPU creates a large
data structure in System
Memory. Computations
using the data are
offloaded to the GPU.
SYSTEM
MEMORY
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
PROCESSING LARGE DATA SETS
Large3Dspatialdata
structure
GPU
The CPU creates a large
data structure in System
Memory. Computations
using the data are
offloaded to the GPU.
Compare HSA and
Legacy methods
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
LEGACY ACCESS USING GPU MEMORY
Legacy
GPU Memory
is smaller
Have to copy and
process in chunks
GPU
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
LEGACY ACCESS TO LARGE STRUCTURES
Large3Dspatialdata
structure
GPU
GPU
MEMOR
Y
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
Copy of top 2 levels of
hierarchy
Large3Dspatialdata
structure
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
GPU
MEMORY
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
FIRST
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
FIRST
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
FIRST
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
Copy of bottom 3 levels of
one branch of the hierarchy
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
SECOND
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
SECOND
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
SECOND
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
Copy of bottom 3 levels of a
different branch of the
hierarchy
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
Nth
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
Nth
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
Nth
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
LARGE SPATIAL DATA STRUCTURE
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
Large3Dspatialdata
structure
SYSTEM
MEMORY
KERNEL
GPU
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
KERNEL
HSA
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
CALLBACKS
 Parallel processing algorithm with branches
 A seldom taken branch requires new data from the CPU
 On legacy systems, the algorithm must be split:
 Process Kernel 1 on GPU
 Check for CPU callbacks and if any, process on CPU
 Process Kernel 2 on GPU
 Example algorithm from Image Processing
 Perform a filter
 Calculate average LUMA in each tile
 Compare LUMA against threshold and call CPU callback if exceeded (rare)
 Perform special processing on tiles with callbackxs
COMMON SITUATION IN HC
Input Image Output Image
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
Legacy
GPUTHREADS
0
1
2
N
.
.
.
.
.
.
.
.
.
Continuation kernel
finishes up kernel
works
results in poor GPU
utilization
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
Input Image
1 Tile = 1 OpenCL Work
Item
Output
Image
GPU
• Work items compute average RGB value
of all the pixels in a tile
• Work items also compute average Luma
from the average RGB
• If average Luma > threshold, workgroup
invokes CPU CALLBACK
• In parallel with callback, continue compute
CPU
• For selected tiles, update average Luma
value (set to RED)
GPU
• Work items apply the Luma value to all
pixels in the tile
GPU to CPU callbacks use Shared
Virtual Memory (SVM) Semaphores,
implemented using Platform Atomic
Compare-and-Swap.
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
A few kernel threads
need CPU callback
services but serviced
immediately
GPUTHREADS
0
1
2
N
.
.
.
.
.
.
.
.
.
CPU
callbacks
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
SUMMARY - HSA ADVANTAGE
Programming
Technique
Use Case Description HSA Advantage
Pointer-based Data
Structures
Binary tree searches
GPU performs parallel searches in a CPU created
binary tree.
CPU and GPU have access to entire unified coherent
memory. GPU can access existing data structures containing
pointers.
Platform Atomics
Work-Group Dynamic Task Management
GPU directly operate on a task pool managed
by the CPU for algorithms with dynamic
computation loads
Binary tree updates
CPU and GPU operating simultaneously on the
tree, both doing modifications
CPU and GPU can synchronize using Platform Atomics
Higher performance through parallel operations reducing the
need for data copying and reconciling.
Large Data Sets
Hierarchical data searches
Applications include object recognition, collision
detection, global illumination, BVH
CPU and GPU have access to entire unified coherent
memory. GPU can operate on huge models in place,
reducing copy and kernel launch overhead.
CPU Callbacks
Middleware user-callbacks
GPU processes work items, some of which require
a call to a CPU function to fetch new data
GPU can invoke CPU functions from within a GPU kernel
Simpler programming does not require “split kernels”
Higher performance through parallel operations
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
HSA COMPILATION
WEN-MEI HWU, CTO, MULTICOREWARE INC
WITH RAY I-JUI SUNG
KEY HSA FEATURES FOR COMPILATION
ALL-PROCESSORS-EQUAL
 GPU and CPU have equal
flexibility to create and
dispatch work items
EQUAL ACCESS TO
ENTIRE SYSTEM MEMORY
 GPU and CPU have
uniform visibility into entire
memory space
Unified Coherent
Memory
GPUCPU
Single Dispatch Path
GPUCPU
© Copyright 2014 HSA Foundation. All Rights Reserved
A QUICK REVIEW OF OPENCL
CURRENT STATE OF PORTABLE HETEROGENEOUS
PARALLEL PROGRAMMING
DEVICE CODE IN OPENCL
SIMPLE MATRIX MULTIPLICATION
__kernel void
matrixMul(__global float* C, __global float* A, __global float* B, int wA, int wB) {
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;
}
Explicit thread index usage.
Reasonably readable.
Portable across CPUs, GPUs, and FPGAs
© Copyright 2014 HSA Foundation. All Rights Reserved
HOST CODE IN OPENCL -
CONCEPTUAL
1. allocate and initialize memory on host side
2. Initialize OpenCL
3. allocate device memory and move the data
4. Load and build device code
5. Launch kernel
a. append arguments
6. move the data back from device
© Copyright 2014 HSA Foundation. All Rights Reserved
int main(int argc, char** argv){
// set seed for rand()
srand(2006);
/****************************************************/
/* Allocate and initialize memory on Host Side */
/****************************************************/
// allocate and initialize host memory for matrices A and B
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float*) malloc(mem_size_A);
unsigned int size_B = WB * HB;
unsigned int mem_size_B = sizeof(float) * size_B;
float* h_B = (float*) malloc(mem_size_B);
randomInit(h_A, size_A);
randomInit(h_B, size_B);
// allocate host memory for the result C
unsigned int size_C = WC * HC;
unsigned int mem_size_C = sizeof(float) * size_C;
float* h_C = (float*) malloc(mem_size_C);
/*****************************************/
/* Initialize OpenCL */
/*****************************************/
// OpenCL specific variables
cl_context clGPUContext;
cl_command_queue clCommandQue;
cl_program clProgram;
size_t dataBytes;
size_t kernelLength;
cl_int errcode;
// OpenCL device memory pointers for matrices
cl_mem d_A;
cl_mem d_B;
cl_mem d_C;
clGPUContext = clCreateContextFromType(0,
CL_DEVICE_TYPE_GPU,
NULL, NULL, &errcode);
shrCheckError(errcode, CL_SUCCESS);
// get the list of GPU devices associated with context
errcode = clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, 0, NULL,
&dataBytes);
cl_device_id *clDevices = (cl_device_id *)
malloc(dataBytes);
errcode |= clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, dataBytes,
clDevices, NULL);
shrCheckError(errcode, CL_SUCCESS);
//Create a command-queue
clCommandQue = clCreateCommandQueue(clGPUContext,
clDevices[0], 0, &errcode);
shrCheckError(errcode, CL_SUCCESS);
// 3. Allocate device memory and move data
d_C = clCreateBuffer(clGPUContext,
CL_MEM_READ_WRITE,
mem_size_A, NULL, &errcode);
d_A = clCreateBuffer(clGPUContext,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
mem_size_A, h_A, &errcode);
d_B = clCreateBuffer(clGPUContext,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
mem_size_B, h_B, &errcode);
// 4. Load and build OpenCL kernel
char *clMatrixMul = oclLoadProgSource("kernel.cl",
"// My commentn",
&kernelLength);
shrCheckError(clMatrixMul != NULL, shrTRUE);
clProgram = clCreateProgramWithSource(clGPUContext,
1, (const char **)&clMatrixMul,
&kernelLength, &errcode);
shrCheckError(errcode, CL_SUCCESS);
errcode = clBuildProgram(clProgram, 0,
NULL, NULL, NULL, NULL);
shrCheckError(errcode, CL_SUCCESS);
clKernel = clCreateKernel(clProgram,
"matrixMul", &errcode);
shrCheckError(errcode, CL_SUCCESS);
// 5. Launch OpenCL kernel
size_t localWorkSize[2], globalWorkSize[2];
int wA = WA;
int wC = WC;
errcode = clSetKernelArg(clKernel, 0,
sizeof(cl_mem), (void *)&d_C);
errcode |= clSetKernelArg(clKernel, 1,
sizeof(cl_mem), (void *)&d_A);
errcode |= clSetKernelArg(clKernel, 2,
sizeof(cl_mem), (void *)&d_B);
errcode |= clSetKernelArg(clKernel, 3,
sizeof(int), (void *)&wA);
errcode |= clSetKernelArg(clKernel, 4,
sizeof(int), (void *)&wC);
shrCheckError(errcode, CL_SUCCESS);
localWorkSize[0] = 16;
localWorkSize[1] = 16;
globalWorkSize[0] = 1024;
globalWorkSize[1] = 1024;
errcode = clEnqueueNDRangeKernel(clCommandQue,
clKernel, 2, NULL, globalWorkSize,
localWorkSize, 0, NULL, NULL);
shrCheckError(errcode, CL_SUCCESS);
// 6. Retrieve result from device
errcode = clEnqueueReadBuffer(clCommandQue,
d_C, CL_TRUE, 0, mem_size_C,
h_C, 0, NULL, NULL);
shrCheckError(errcode, CL_SUCCESS);
// 7. clean up memory
free(h_A);
free(h_B);
free(h_C);
clReleaseMemObject(d_A);
clReleaseMemObject(d_C);
clReleaseMemObject(d_B);
free(clDevices);
free(clMatrixMul);
clReleaseContext(clGPUContext);
clReleaseKernel(clKernel);
clReleaseProgram(clProgram);
clReleaseCommandQueue(clCommandQue);}
almost 100 lines of code
– tedious and hard to maintain
It does not take advantage of HAS features.
It will likely need to be changed for OpenCL 2.0.
COMPARING SEVERAL HIGH-LEVEL
PROGRAMMING INTERFACES
C++AMP Thrust Bolt OpenACC SYCL
C++ Language
extension
proposed by
Microsoft
library
proposed
by CUDA
library
proposed
by AMD
Annotation
and
Pragmas
proposed
by PGI
C++
wrapper
for
OpenCL
All these proposals aim to reduce tedious boiler
plate code and provide transparent porting to future
systems (future proofing).
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENACC
HSA ENABLES SIMPLER IMPLEMENTATION OR
BETTER OPTIMIZATION
© Copyright 2014 HSA Foundation. All Rights Reserved
OPENACC
- SIMPLE MATRIX MULTIPLICATION EXAMPLE
1. void MatrixMulti(float *C, const float *A, const float *B, int hA, int wA, int wB)
2 {
3 #pragma acc parallel loop copyin(A[0:hA*wA]) copyin(B[0:wA*wB]) copyout(C[0:hA*wB])
4 for (int i=0; i<hA; i++) {
5 #pragma acc loop
6 for (int j=0; j<wB; j++) {
7 float sum = 0;
8 for (int k=0; k<wA; k++) {
9 float a = A[i*wA+k];
10 float b = B[k*wB+j];
11 sum += a*b;
12 }
13 C[i*Nw+j] = sum;
14 }
15 }
16 }
Little Host Code Overhead
Programmer annotation of
kernel computation
Programmer annotation of data movement
© Copyright 2014 HSA Foundation. All Rights Reserved
ADVANTAGE OF HSA FOR OPENACC
 Flexibility in copyin and copyout implementation
 Flexible code generation for nested acc parallel loops
 E.g., inner loop bounds that depend on outer loop iterations
 Compiler data affinity optimization (especially OpenACC kernel regions)
 The compiler does not have to undo programmer managed data transfers
© Copyright 2014 HSA Foundation. All Rights Reserved
C++AMP
HSA ENABLES EFFICIENT COMPILATION OF AN
EVEN HIGHER LEVEL OF PROGRAMMING
INTERFACE
© Copyright 2014 HSA Foundation. All Rights Reserved
C++ AMP
● C++ Accelerated Massive Parallelism
● Designed for data level parallelism
● Extension of C++11 proposed by Microsoft
● An open specification with multiple implementations aiming at standardization
● MS Visual Studio 2013
● MulticoreWare CLAMP
● GPU data modeled as C++14-like containers for multidimensional arrays
● GPU kernels modeled as C++11 lambda
● Minimal extension to C++ for simplicity and future proofing
© Copyright 2014 HSA Foundation. All Rights Reserved
MATRIX MULTIPLICATION IN C++AMP
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix,
int ha, int hb, int hc) {
array_view<int, 2> a(ha, hb, aMatrix);
array_view<int, 2> b(hb, hc, bMatrix);
array_view<int, 2> product(ha, hc, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();}
clGPUContext = clCreateContextFromType(0,
CL_DEVICE_TYPE_GPU,
NULL, NULL, &errcode);
shrCheckError(errcode, CL_SUCCESS);
// get the list of GPU devices associated
// with context
errcode = clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, 0, NULL,
&dataBytes);
cl_device_id *clDevices = (cl_device_id *)
malloc(dataBytes);
errcode |= clGetContextInfo(clGPUContext,
CL_CONTEXT_DEVICES, dataBytes,
clDevices, NULL);
shrCheckError(errcode, CL_SUCCESS);
//Create a command-queue
clCommandQue =
clCreateCommandQueue(clGPUContext,
clDevices[0], 0, &errcode);
shrCheckError(errcode, CL_SUCCESS);
__kernel void
matrixMul(__global float* C, __global float* A,
__global float* B, int wA, int wB) {
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;}
© Copyright 2014 HSA Foundation. All Rights Reserved
C++AMP PROGRAMMING MODEL
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {
array_view<int, 2> a(3, 2, aMatrix);
array_view<int, 2> b(2, 3, bMatrix);
array_view<int, 2> product(3, 3, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();}
GPU data
modeled as
data container
© Copyright 2014 HSA Foundation. All Rights Reserved
C++AMP PROGRAMMING MODEL
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {
array_view<int, 2> a(3, 2, aMatrix);
array_view<int, 2> b(2, 3, bMatrix);
array_view<int, 2> product(3, 3, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();}
Kernels modeled as
lambdas; arguments are
implicitly modeled as
captured variables,
programmer do not need to
specify copyin and copyout
© Copyright 2014 HSA Foundation. All Rights Reserved
C++AMP PROGRAMMING MODEL
void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) {
array_view<int, 2> a(3, 2, aMatrix);
array_view<int, 2> b(2, 3, bMatrix);
array_view<int, 2> product(3, 3, productMatrix);
parallel_for_each(
product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
}
);
product.synchronize();
}
Execution
interface; marking
an implicitly
parallel region for
GPU execution
© Copyright 2014 HSA Foundation. All Rights Reserved
MCW C++AMP (CLAMP)
● Runs on Linux and Mac OS X
● Output code compatible with all major OpenCL stacks: AMD, Apple/Intel (OS X),
NVIDIA and even POCL
● Clang/LLVM-based, open source
o Translate C++AMP code to OpenCL C or OpenCL 1.2 SPIR
o With template helper library
● Runtime: OpenCL 1.1/HSA Runtime and GMAC for non-HSA systems
● One of the two C++ AMP implementations recognized by HSA foundation
© Copyright 2014 HSA Foundation. All Rights Reserved
MCW C++ AMP COMPILER
● Device Path
o generate OpenCL C code and SPIR
o emit kernel function
● Host Path
o preparation to launch the code
C++ AMP
source code
Clang/LLVM 3.3
Device
Code
Host
Code
© Copyright 2014 HSA Foundation. All Rights Reserved
TRANSLATION
parallel_for_each(product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
});
__kernel void
matrixMul(__global float* C, __global float*
A,
__global float* B, int wA, int wB){
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;}
● Append the arguments
● Set the index
● emit kernel function
● implicit memory management
© Copyright 2014 HSA Foundation. All Rights Reserved
EXECUTION ON NON-HSA OPENCL
PLATFORMS
C++ AMP
source code
Clang/LLVM
3.3
Device Code
C++ AMP
source code
Clang/LLVM
3.3
Host Code
gmac
OpenCL
Our work
Runtime
© Copyright 2014 HSA Foundation. All Rights Reserved
GMAC
● unified virtual address space in
software
● Can have high overhead
sometimes
● In HSA (e.g., AMD Kaveri), GMAC
is not longer needed
Gelado, et al, ASPLOS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
CASE STUDY:
BINOMIAL OPTION PRICING
 Line of Codes
0
50
100
150
200
250
300
350
C++AMP OpenCL
Lines of Code by Cloc
Host
Kernel
© Copyright 2014 HSA Foundation. All Rights Reserved
PERFORMANCE ON NON-HSA SYSTEMS
BINOMIAL OPTION PRICING
0
0.02
0.04
0.06
0.08
0.1
0.12
Total GPU Time Kernel-only
TimeinSeconds
Performance on an NV Tesla C2050
OpenCL
C++AMP
© Copyright 2014 HSA Foundation. All Rights Reserved
EXECUTION ON HSA
C++ AMP
source code
Clang/LLVM
3.3
Device SPIR
C++ AMP
source code
Clang/LLVM
3.3
Host SPIR
HSA Runtime
Compile Time
Runtime
© Copyright 2014 HSA Foundation. All Rights Reserved
WHAT WE NEED TO DO?
● Kernel function
o emit the kernel function with required arguments
● On Host side
o a function that recursively traverses the object and append the arguments to OpenCL
stack.
● On Device side
o reconstruct it on the device code for future use.
© Copyright 2014 HSA Foundation. All Rights Reserved
WHY COMPILING C++AMP TO OPENCL IS
NOT TRIVIAL
● C++AMP → LLVM IR → OpenCL C or SPIR
● arguments passing (lambda capture vs function calls)
● explicit V.S. implicit memory transfer
● Heavy lifting is done by compiler and runtime
© Copyright 2014 HSA Foundation. All Rights Reserved
EXAMPLE
struct A { int a; };struct B : A { int b; };struct C { B b; int c; };
struct C c;
c.c = 100;
auto fn = [=] () { int qq = c.c; };
© Copyright 2014 HSA Foundation. All Rights Reserved
TRANSLATION
parallel_for_each(product.extent,
[=](index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
for (int inner = 0; inner < 2; inner++) {
product[idx] += a(row, inner) * b(inner, col);
}
});
__kernel void
matrixMul(__global float* C, __global float* A,
__global float* B, int wA, int wB){
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA * elementB;
}
C[ty * wA + tx] = value;}
● Compiler
● Turn captured variables into
OpenCL arguments
● Populate the index<N> in OCL
kernel
● Runtime
● Implicit memory management
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
© Copyright 2014 HSA Foundation. All Rights Reserved

More Related Content

PPTX
GPU Architecture NVIDIA (GTX GeForce 480)
PDF
HSA Design (2015-04-30)
PPTX
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
PPTX
HSA Kernel Code (KFD v0.6)
PPTX
Heterogeneous computing
PPTX
Intel® hyper threading technology
PPTX
Superscalar Architecture_AIUB
PDF
Introduction to CUDA
GPU Architecture NVIDIA (GTX GeForce 480)
HSA Design (2015-04-30)
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
HSA Kernel Code (KFD v0.6)
Heterogeneous computing
Intel® hyper threading technology
Superscalar Architecture_AIUB
Introduction to CUDA

What's hot (20)

PPTX
HSA Queuing Hot Chips 2013
PDF
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
PDF
HSA System Architecture Overview (2014-10-31)
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
不揮発メモリ(NVDIMM)とLinuxの対応動向について
PDF
OpenFOAMスレッド並列化のための基礎検討
PDF
LCU13: An Introduction to ARM Trusted Firmware
PDF
The future of RISC-V Supervisor Binary Interface(SBI)
PDF
Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
PPT
Hive Training -- Motivations and Real World Use Cases
PDF
Secure Boot on ARM systems – Building a complete Chain of Trust upon existing...
PPTX
Apache NiFi Crash Course Intro
PDF
TEE - kernel support is now upstream. What this means for open source security
PDF
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
PPTX
Memory model
PPTX
Bootloaders (U-Boot)
TXT
OPTEE on QEMU - Build Tutorial
PDF
Vectorized Query Execution in Apache Spark at Facebook
PDF
Embedded Linux Kernel - Build your custom kernel
HSA Queuing Hot Chips 2013
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
HSA System Architecture Overview (2014-10-31)
Apache Spark in Depth: Core Concepts, Architecture & Internals
不揮発メモリ(NVDIMM)とLinuxの対応動向について
OpenFOAMスレッド並列化のための基礎検討
LCU13: An Introduction to ARM Trusted Firmware
The future of RISC-V Supervisor Binary Interface(SBI)
Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
Hive Training -- Motivations and Real World Use Cases
Secure Boot on ARM systems – Building a complete Chain of Trust upon existing...
Apache NiFi Crash Course Intro
TEE - kernel support is now upstream. What this means for open source security
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Memory model
Bootloaders (U-Boot)
OPTEE on QEMU - Build Tutorial
Vectorized Query Execution in Apache Spark at Facebook
Embedded Linux Kernel - Build your custom kernel
Ad

Viewers also liked (12)

PPTX
HSA Introduction
PPTX
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
PDF
LCU13: HSA Architecture Presentation
PDF
Using Xeon + FPGA for Accelerating HPC Workloads
PDF
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
PPTX
Hands on OpenCL
PPTX
OpenCV 에서 OpenCL 살짝 써보기
PPTX
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
PDF
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
PDF
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
 
PDF
1050: 車載用ADAS/自動運転プラットフォームDRIVE PX及びコックピット・プラットフォームDRIVE CXのご紹介
PPT
Cloud computing ppt
HSA Introduction
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
LCU13: HSA Architecture Presentation
Using Xeon + FPGA for Accelerating HPC Workloads
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Hands on OpenCL
OpenCV 에서 OpenCL 살짝 써보기
이기종 멀티코어 프로세서를 위한 프로그래밍 언어 및 영상처리 오픈소스
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
 
1050: 車載用ADAS/自動運転プラットフォームDRIVE PX及びコックピット・プラットフォームDRIVE CXのご紹介
Cloud computing ppt
Ad

Similar to ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial (20)

PPTX
ISCA Final Presentation - Intro
PPTX
HSA Introduction Hot Chips 2013
PDF
HSA From A Software Perspective
PDF
"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati...
PDF
Heterogeneous System Architecture Overview
PPTX
HSA Features
PDF
Implement Runtime Environments for HSA using LLVM
PPT
Guide to heterogeneous system architecture (hsa)
PDF
Introduction to HSA
PPTX
Mnk hsa ppt
PDF
HSA-4024, OpenJDK Sumatra Project: Bringing the GPU to Java, by Eric Caspole
PDF
Heterogenous system architecture(HSA)
PPTX
ISCA Final Presentation - HSAIL
PDF
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
PDF
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
PPTX
Ppt hsa
PDF
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
PDF
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
PDF
HC-4017, HSA Compilers Technology, by Debyendu Das
PDF
Hsa10 whitepaper
ISCA Final Presentation - Intro
HSA Introduction Hot Chips 2013
HSA From A Software Perspective
"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati...
Heterogeneous System Architecture Overview
HSA Features
Implement Runtime Environments for HSA using LLVM
Guide to heterogeneous system architecture (hsa)
Introduction to HSA
Mnk hsa ppt
HSA-4024, OpenJDK Sumatra Project: Bringing the GPU to Java, by Eric Caspole
Heterogenous system architecture(HSA)
ISCA Final Presentation - HSAIL
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
Ppt hsa
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
HC-4017, HSA Compilers Technology, by Debyendu Das
Hsa10 whitepaper

More from HSA Foundation (20)

PDF
Hsa Runtime version 1.00 Provisional
PDF
Hsa programmers reference manual (version 1.0 provisional)
PPTX
ISCA final presentation - Runtime
PPTX
ISCA final presentation - Queuing Model
PPTX
ISCA final presentation - Memory Model
PPTX
ISCA Final Presentaiton - Compilations
PPTX
ISCA Final Presentation - Applications
PPT
Apu13 cp lu-keynote-final-slideshare
PDF
HSAemu a Full System Emulator for HSA
PPTX
HSA Memory Model Hot Chips 2013
PPTX
HSA HSAIL Introduction Hot Chips 2013
PDF
HSA Foundation BoF -Siggraph 2013 Flyer
PDF
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
PDF
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
PDF
Phil Rogers IFA Keynote 2012
PDF
Deeper Look Into HSAIL And It's Runtime
PDF
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
PDF
Hsa2012 logo guidelines.
PDF
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
PDF
What Fabric Engine Can Do With HSA
Hsa Runtime version 1.00 Provisional
Hsa programmers reference manual (version 1.0 provisional)
ISCA final presentation - Runtime
ISCA final presentation - Queuing Model
ISCA final presentation - Memory Model
ISCA Final Presentaiton - Compilations
ISCA Final Presentation - Applications
Apu13 cp lu-keynote-final-slideshare
HSAemu a Full System Emulator for HSA
HSA Memory Model Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
Phil Rogers IFA Keynote 2012
Deeper Look Into HSAIL And It's Runtime
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Hsa2012 logo guidelines.
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
What Fabric Engine Can Do With HSA

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Cloud computing and distributed systems.
PDF
cuic standard and advanced reporting.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPT
Teaching material agriculture food technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Cloud computing and distributed systems.
cuic standard and advanced reporting.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
The AUB Centre for AI in Media Proposal.docx
Teaching material agriculture food technology
Programs and apps: productivity, graphics, security and other tools
Encapsulation_ Review paper, used for researhc scholars
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx

ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorithms Tutorial

  • 1. HETEROGENEOUS SYSTEM ARCHITECTURE (HSA): ARCHITECTURE AND ALGORITHMS ISCA TUTORIAL - JUNE 15, 2014
  • 2. TOPICS  Introduction  HSAIL Virtual Parallel ISA  HSA Runtime  HSA Memory Model  HSA Queuing Model  HSA Applications  HSA Compilation © Copyright 2014 HSA Foundation. All Rights Reserved The HSA Specifications are not at 1.0 final so all content is subject to change
  • 3. SCHEDULE © Copyright 2014 HSA Foundation. All Rights Reserved Time Topic Speaker 8:45am Introduction to HSA Phil Rogers, AMD 9:30am HSAIL Virtual Parallel ISA Ben Sander, AMD 10:30am Break 10:50am HSA Runtime Yeh-Ching Chung, National Tsing Hua University 12 noon Lunch 1pm HSA Memory Model Benedict Gaster, Qualcomm 2pm HSA Queuing Model Hakan Persson, ARM 3pm Break 3:15pm HSA Compilation Technology Wen Mei Hwu, University of Illinois 4pm HSA Application Programming Wen Mei Hwu, University of Illinois 4:45pm Questions All presenters
  • 4. INTRODUCTION PHIL ROGERS, AMD CORPORATE FELLOW & PRESIDENT OF HSA FOUNDATION
  • 5. HSA FOUNDATION  Founded in June 2012  Developing a new platform for heterogeneous systems  www.hsafoundation.com  Specifications under development in working groups to define the platform  Membership consists of 43 companies and 16 universities  Adding 1-2 new members each month © Copyright 2014 HSA Foundation. All Rights Reserved
  • 6. DIVERSE PARTNERS DRIVING FUTURE OF HETEROGENEOUS COMPUTING © Copyright 2014 HSA Foundation. All Rights Reserved Founders Promoters Supporters Contributors Academic Needs Updating – Add Toshiba Logo
  • 7. MEMBERSHIP TABLE Membership Level Number List Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc., Qualcomm Inc., Samsung Electronics Co Ltd Promoter 1 LG Electronics Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical Limited, CEVA Inc., Digital Media Professionals, Electronics and Telecommunications Research, Institute (ETRI), General Processor, Huawei, Industrial Technology Res. Institute, Marvell International Ltd., Mobica, Oracle, Sonics, Inc, Sony Mobile, Communications, Swarm 64 GmbH, Synopsys, Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA Technologies, Vivante Corporation Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software, Fabric Engine, Kishonti, Lawrence Livermore National Laboratory, Linaro, MultiCoreWare, Oak Ridge National Laboratory, Sandia Corporation, StreamComputing, SUSE LLC, UChicago Argonne LLC, Operator of Argonne National Laboratory Academic 17 Institute for Computing Systems Architecture, Missouri University of Science & Technology, National Tsing Hua University, NMAM Institute of Technology, Northeastern University, Rice University, Seoul National University, System Software Lab National, Tsing Hua University, Tampere University of Technology, TEI of Crete, The University of Mississippi, University of North Texas, University of Bologna, University of Bristol Microelectronic Research Group, University of Edinburgh, University of Illinois at Urbana-Champaign Department of Computer Science © Copyright 2014 HSA Foundation. All Rights Reserved
  • 8. HETEROGENEOUS PROCESSORS HAVE PROLIFERATED — MAKE THEM BETTER  Heterogeneous SOCs have arrived and are a tremendous advance over previous platforms  SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory  How do we make them even better?  Easier to program  Easier to optimize  Higher performance  Lower power  HSA unites accelerators architecturally  Early focus on the GPU compute accelerator, but HSA will go well beyond the GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 9. INFLECTIONS IN PROCESSOR DESIGN © Copyright 2014 HSA Foundation. All Rights Reserved ? Single-thread Performance Time we are here Enabled by:  Moore’s Law  Voltage Scaling Constrained by: Power Complexity Single-Core Era ModernApplication Performance Time (Data-parallel exploitation) we are here Heterogeneous Systems Era Enabled by:  Abundant data parallelism  Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead Throughput Performance Time (# of processors) we are here Enabled by:  Moore’s Law  SMP architecture Constrained by: Power Parallel SW Scalability Multi-Core Era Assembly  C/C++  Java … pthreads  OpenMP / TBB … Shader  CUDA OpenCL  C++ and Java
  • 10. LEGACY GPU COMPUTE PCIe ™ System Memory (Coherent) CPU CPU CPU . . . CU CU CU CU CU CU CU CU GPU Memory (Non-Coherent) GPU  Multiple memory pools  Multiple address spaces  High overhead dispatch  Data copies across PCIe  New languages for programming  Dual source development  Proprietary environments  Expert programmers only  Need to fix all of this to unleash our programmers The limiters © Copyright 2014 HSA Foundation. All Rights Reserved
  • 11. EXISTING APUS AND SOCS CPU 1 CPU N… CPU 2 Physical Integration CU 1 … CU 2 CU 3 CU M-2 CU M-1 CU M System Memory (Coherent) GPU Memory (Non-Coherent) GPU  Physical Integration  Good first step  Some copies gone  Two memory pools remain  Still queue through the OS  Still requires expert programmers  Need to finish the job
  • 12. AN HSA ENABLED SOC  Unified Coherent Memory enables data sharing across all processors  Processors architected to operate cooperatively  Designed to enable the application to run on different processors at different times Unified Coherent Memory CPU 1 CPU N… CPU 2 CU 1 CU 2 CU 3 CU M-2 CU M-1 CU M…
  • 13. PILLARS OF HSA*  Unified addressing across all processors  Operation into pageable system memory  Full memory coherency  User mode dispatch  Architected queuing language  Scheduling and context switching  HSA Intermediate Language (HSAIL)  High level language support for GPU compute processors © Copyright 2014 HSA Foundation. All Rights Reserved * All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors
  • 14. HSA SPECIFICATIONS  HSA System Architecture Specification  Version 1.0 Provisional, Released April 2014  Defines discovery, memory model, queue management, atomics, etc  HSA Programmers Reference Specification  Version 1.0 Provisional, Released June 2014  Defines the HSAIL language and object format  HSA Runtime Software Specification  Version 1.0 Provisional, expected to be released in July 2014  Defines the APIs through which an HSA application uses the platform  All released specifications can be found at the HSA Foundation web site:  www.hsafoundation.com/standards © Copyright 2014 HSA Foundation. All Rights Reserved
  • 15. HSA - AN OPEN PLATFORM  Open Architecture, membership open to all  HSA Programmers Reference Manual  HSA System Architecture  HSA Runtime  Delivered via royalty free standards  Royalty Free IP, Specifications and APIs  ISA agnostic for both CPU and GPU  Membership from all areas of computing  Hardware companies  Operating Systems  Tools and Middleware  Applications  Universities © Copyright 2014 HSA Foundation. All Rights Reserved
  • 16. HSA INTERMEDIATE LAYER — HSAIL  HSAIL is a virtual ISA for parallel programs  Finalized to ISA by a JIT compiler or “Finalizer”  ISA independent by design for CPU & GPU  Explicitly parallel  Designed for data parallel programming  Support for exceptions, virtual functions, and other high level language features  Lower level than OpenCL SPIR  Fits naturally in the OpenCL compilation stack  Suitable to support additional high level languages and programming models:  Java, C++, OpenMP, C++, Python, etc © Copyright 2014 HSA Foundation. All Rights Reserved
  • 17. HSA MEMORY MODEL  Defines visibility ordering between all threads in the HSA System  Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models  Relaxed consistency memory model for parallel compute performance  Visibility controlled by:  Load.Acquire  Store.Release  Fences © Copyright 2014 HSA Foundation. All Rights Reserved
  • 18. HSA QUEUING MODEL  User mode queuing for low latency dispatch  Application dispatches directly  No OS or driver required in the dispatch path  Architected Queuing Layer  Single compute dispatch path for all hardware  No driver translation, direct to hardware  Allows for dispatch to queue from any agent  CPU or GPU  GPU self enqueue enables lots of solutions  Recursion  Tree traversal  Wavefront reforming © Copyright 2014 HSA Foundation. All Rights Reserved
  • 20. Hardware - APUs, CPUs, GPUs Driver Stack Domain Libraries OpenCL™, DX Runtimes, User Mode Drivers Graphics Kernel Mode Driver Apps Apps Apps Apps Apps Apps HSA Software Stack Task Queuing Libraries HSA Domain Libraries, OpenCL ™ 2.x Runtime HSA Kernel Mode Driver HSA Runtime HSA JIT Apps Apps Apps Apps Apps Apps User mode component Kernel mode component Components contributed by third parties EVOLUTION OF THE SOFTWARE STACK © Copyright 2014 HSA Foundation. All Rights Reserved
  • 21. OPENCL™ AND HSA  HSA is an optimized platform architecture for OpenCL  Not an alternative to OpenCL  OpenCL on HSA will benefit from  Avoidance of wasteful copies  Low latency dispatch  Improved memory model  Pointers shared between CPU and GPU  OpenCL 2.0 leverages HSA Features  Shared Virtual Memory  Platform Atomics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 22. ADDITIONAL LANGUAGES ON HSA  In development © Copyright 2014 HSA Foundation. All Rights Reserved Language Body More Information Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/ LLVM LLVM Code generator for HSAIL C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa mp-driver-ng/wiki/Home OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches /hsa/gcc/README.hsa?view=markup&p athrev=207425
  • 23. SUMATRA PROJECT OVERVIEW  AMD/Oracle sponsored Open Source (OpenJDK) project  Targeted at Java 9 (2015 release)  Allows developers to efficiently represent data parallel algorithms in Java  Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to enable both CPU or GPU computing  At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch ‘selected’ constructs to available HSA enabled devices  Developers of Java libraries are already refactoring their library code to use these same constructs  So developers using existing libraries should see GPU acceleration without any code changes  http://openjdk.java.net/projects/sumatra/  https://wikis.oracle.com/display/HotSpotInternals/Sumatra  http://mail.openjdk.java.net/pipermail/sumatra-dev/ © Copyright 2014 HSA Foundation. All Rights Reserved Application.java Java Compiler GPUCPU Sumatra Enabled JVM Application GPU ISA Lambda/Stream API CPU ISA Application.clas s Development Runtime HSA Finalizer
  • 24. HSA OPEN SOURCE SOFTWARE  HSA will feature an open source linux execution and compilation stack  Allows a single shared implementation for many components  Enables university research and collaboration in all areas  Because it’s the right thing to do © Copyright 2014 HSA Foundation. All Rights Reserved Component Name IHV or Common Rationale HSA Bolt Library Common Enable understanding and debug HSAIL Code Generator Common Enable research LLVM Contributions Common Industry and academic collaboration HSAIL Assembler Common Enable understanding and debug HSA Runtime Common Standardize on a single runtime HSA Finalizer IHV Enable research and debug HSA Kernel Driver IHV For inclusion in linux distros
  • 25. WORKLOAD EXAMPLE SUFFIX ARRAY CONSTRUCTION CLOUD SERVER WORKLOAD
  • 26. SUFFIX ARRAYS  Suffix Arrays are a fundamental data structure  Designed for efficient searching of a large text  Quickly locate every occurrence of a substring S in a text T  Suffix Arrays are used to accelerate in-memory cloud workloads  Full text index search  Lossless data compression  Bio-informatics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 27. ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA © Copyright 2014 HSA Foundation. All Rights Reserved M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013. AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction. By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies. +5.8x -5x INCREASED PERFORMANCE DECREASED ENERGYMerge Sort::GPU Radix Sort::GPU Compute SA::CPU Lexical Rank::CPU Radix Sort::GPU Skew Algorithm for Compute SA
  • 28. EASE OF PROGRAMMING CODE COMPLEXITY VS. PERFORMANCE
  • 29. LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta 0 50 100 150 200 250 300 350 LOC Copy-back Algorithm Launch Copy Compile Init Performance Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt Performance 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0Copy- back Algorithm Launch Copy Compile Init. Copy-back Algorithm Launch Copy Compile Copy-back Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch Algorithm Launch (Exemplary ISV “Hessian” Kernel) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 30. THE HSA FUTURE  Architected heterogeneous processing on the SOC  Programming of accelerators becomes much easier  Accelerated software that runs across multiple hardware vendors  Scalability from smart phones to super computers on a common architecture  GPU acceleration of parallel processing is the initial target, with DSPs and other accelerators coming to the HSA system architecture model  Heterogeneous software ecosystem evolves at a much faster pace  Lower power, more capable devices in your hand, on the wall, in the cloud © Copyright 2014 HSA Foundation. All Rights Reserved
  • 32. HETEROGENEOUS SYSTEM ARCHITECTURE (HSA): HSAIL VIRTUAL PARALLEL ISA BEN SANDER, AMD
  • 33. TOPICS  Introduction and Motivation  HSAIL – what makes it special?  HSAIL Execution Model  How to program in HSAIL?  Conclusion © Copyright 2014 HSA Foundation. All Rights Reserved
  • 34. STATE OF GPU COMPUTING Today’s Challenges  Separate address spaces  Copies  Can’t share pointers  New language required for compute kernel  EX: OpenCL™ runtime API  Compute kernel compiled separately than host code Emerging Solution  HSA Hardware  Single address space  Coherent  Virtual  Fast access from all components  Can share pointers  Bring GPU computing to existing, popular, programming models  Single-source, fully supported by compiler  HSAIL compiler IR (Cross-platform!) • GPUs are fast and power efficient : high compute density per-mm and per-watt • But: Can be hard to program PCIe
  • 35. THE PORTABILITY CHALLENGE  CPU ISAs  ISA innovations added incrementally (ie NEON, AVX, etc)  ISA retains backwards-compatibility with previous generation  Two dominant instruction-set architectures: ARM and x86  GPU ISAs  Massive diversity of architectures in the market  Each vendor has own ISA - and often several in market at same time  No commitment (or attempt!) to provide any backwards compatibility  Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction © Copyright 2014 HSA Foundation. All Rights Reserved
  • 36. HSAIL : WHAT MAKES IT SPECIAL?
  • 37. WHAT IS HSAIL?  Intermediate language for parallel compute in HSA  Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc)  Expresses parallel regions of code  Binary format of HSAIL is called “BRIG”  Goal: Bring parallel acceleration to mainstream programming languages © Copyright 2014 HSA Foundation. All Rights Reserved main() { … #pragma omp parallel for for (int i=0;i<N; i++) { } … } High-Level Compiler BRIG Finalizer Component ISA Host ISA
  • 38. KEY HSAIL FEATURES  Parallel  Shared virtual memory  Portable across vendors in HSA Foundation  Stable across multiple product generations  Consistent numerical results (IEEE-754 with defined min accuracy)  Fast, robust, simple finalization step (no monthly updates)  Good performance (little need to write in ISA)  Supports all of OpenCL™  Supports Java, C++, and other languages as well © Copyright 2014 HSA Foundation. All Rights Reserved
  • 39. HSAIL INSTRUCTION SET - OVERVIEW  Similar to assembly language for a RISC CPU  Load-store architecture  Destination register first, then source registers  140 opcodes (Java™ bytecode has 200)  Floating point (single, double, half (f16))  Integer (32-bit, 64-bit)  Some packed operations  Branches  Function calls  Platform Atomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas  Synchronize host CPU and HSA Component!  Text and Binary formats (“BRIG”) ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120) add_u64 $d1, $d0, 24 ; $d1= $d2+24 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 40. SEGMENTS AND MEMORY (1/2)  7 segments of memory  global, readonly, group, spill, private, arg, kernarg  Memory instructions can (optionally) specify a segment  Control data sharing properties and communicate intent  Global Segment  Visible to all HSA agents (including host CPU)  Group Segment  Provides high-performance memory shared in the work-group.  Group memory can be read and written by any work-item in the work-group  HSAIL provides sync operations to control visibility of group memory ld_global_u64 $d0,[$d6] ld_group_u64 $d0,[$d6+24] st_spill_f32 $s1,[$d6+4] © Copyright 2014 HSA Foundation. All Rights Reserved
  • 41. SEGMENTS AND MEMORY (2/2)  Spill, Private, Arg Segments  Represent different regions of a per-work-item stack  Typically generated by compiler, not specified by programmer  Compiler can use these to convey intent – ie spills  Kernarg Segment  Programmer writes kernarg segment to pass arguments to a kernel  Read-Only Segment  Remains constant during execution of kernel © Copyright 2014 HSA Foundation. All Rights Reserved
  • 42. FLAT ADDRESSING  Each segment mapped into virtual address space  Flat addresses can map to segments based on virtual address  Instructions with no explicit segment use flat addressing  Very useful for high-level language support (ie classes, libraries)  Aligns well with OpenCL 2.0 “generic” addressing feature ld_global_u64 $d6, [%_arg0] ; global ld_u64 $d0,[$d6+24] ; flat © Copyright 2014 HSA Foundation. All Rights Reserved
  • 43. REGISTERS  Four classes of registers:  S: 32-bit, Single-precision FP or Int  D: 64-bit, Double-precision FP or Long Int  Q: 128-bit, Packed data.  C: 1-bit, Control Registers (Compares)  Fixed number of registers  S, D, Q share a single pool of resources  S + 2*D + 4*Q <= 128  Up to 128 S or 64 D or 32 Q (or a blend)  Register allocation done in high-level compiler  Finalizer doesn’t perform expensive register allocation c0 c1 c2 c3 c4 c5 c6 c7 s0 d0 q0 s1 s2 d1 s3 s4 d2 q1 s5 s6 d3 s7 s8 d4 q2 s9 s10 d5 s11 … s120 d60 q30 s121 s122 d61 s123 s124 d62 q31 s125 s126 d63 s127 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 44. SIMT EXECUTION MODEL  HSAIL Presents a “SIMT” execution model to the programmer  “Single Instruction, Multiple Thread”  Programmer writes program for a single thread of execution  Each work-item appears to have its own program counter  Branch instructions look natural  Hardware Implementation  Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency  Actually one program counter for the entire SIMD instruction  Branches implemented with predication  SIMT Advantages  Easier to program (branch code in particular)  Natural path for mainstream programming models and existing compilers  Scales across a wide variety of hardware (programmer doesn’t see vector width)  Cross-lane operations available for those who want peak performance © Copyright 2014 HSA Foundation. All Rights Reserved
  • 45. WAVEFRONTS  Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, 64, 128, or 256 “lanes”  Lanes in wavefront can be “active” or “inactive”  Inactive lanes consume hardware resources but don’t do useful work  Tradeoffs  “Wavefront-aware” programming can be useful for peak performance  But results in less portable code (since wavefront width is encoded in algorithm) if (cond) { operationA; // cond=True lanes active here } else { operationB; // cond=False lanes active here } © Copyright 2014 HSA Foundation. All Rights Reserved
  • 46. CROSS-LANE OPERATIONS  Example HSAIL cross-lane operation: “activelaneid”  Dest set to count of earlier work-items that are active for this instruction  Useful for compaction algorithms  Example HSAIL cross-lane operation: “activelaneshuffle”  Each workitem reads value from another lane in the wavefront  Supports selection of “identity” element for inactive lanes  Useful for wavefront-level reductionsactivelaneshuffle_b32 $s0, $s1, $s2, 0, 0 // s0 = dest, s1= source, s2=lane select, no identity activelaneid_u32 $s0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 47. HSAIL MODES  Working group strived to limit optional modes and features in HSAIL  Minimize differences between HSA target machines  Better for compiler vendors and application developers  Two modes survived  Machine Models  Small: 32-bit pointers, 32-bit data  Large: 64-bit pointers, 32-bit or 64-bit data  Vendors can support one or both models  “Base” and “Full” Profiles  Two sets of requirements for FP accuracy, rounding, exception reporting, hard pre-emption © Copyright 2014 HSA Foundation. All Rights Reserved
  • 48. HSA PROFILES Feature Base Full Addressing Modes Small, Large Small, Large All 32-bit HSAIL operations according to the declared profile Yes Yes F16 support (IEEE 754 or better) Yes Yes F64 support No Yes Precision for add/sub/mul 1/2 ULP 1/2 ULP Precision for div 2.5 ULP 1/2 ULP Precision for sqrt 1 ULP 1/2 ULP HSAIL Rounding: Near Yes Yes HSAIL Rounding: Up / Down / Zero No Yes Subnormal floating-point Flush-to-zero Supported Propagate NaN Payloads No Yes FMA Yes Yes Arithmetic Exception reporting None DETECT or BREAK Debug trap Yes Yes Hard Preemption No Yes © Copyright 2014 HSA Foundation. All Rights Reserved
  • 49. HSA PARALLEL EXECUTION MODEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 50. HSA PARALLEL EXECUTION MODEL Basic Idea: Programmer supplies an HSAIL “kernel” that is run on each work-item. Kernel is written as a single thread of execution. Programmer specifies grid dimensions (scope of problem) when launching the kernel. Each work-item has a unique coordinate in the grid. Programmer optionally specifies work- group dimensions (for optimized communication). © Copyright 2014 HSA Foundation. All Rights Reserved
  • 51. CONVOLUTION / SOBEL EDGE FILTER Gx = [ -1 0 +1 ] [ -2 0 +2 ] [ -1 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0 ] [ +1 +2 +1 ] G = sqrt(Gx 2 + Gy 2) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 52. CONVOLUTION / SOBEL EDGE FILTER Gx = [ -1 0 +1 ] [ -2 0 +2 ] [ -1 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0 ] [ +1 +2 +1 ] G = sqrt(Gx 2 + Gy 2) 2D grid workitem kernel © Copyright 2014 HSA Foundation. All Rights Reserved
  • 53. CONVOLUTION / SOBEL EDGE FILTER Gx = [ -1 0 +1 ] [ -2 0 +2 ] [ -1 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0 ] [ +1 +2 +1 ] G = sqrt(Gx 2 + Gy 2) 2D work-group 2D grid workitem kernel © Copyright 2014 HSA Foundation. All Rights Reserved
  • 54. HOW TO PROGRAM HSA? WHAT DO I TYPE? © Copyright 2014 HSA Foundation. All Rights Reserved
  • 55. HSA PROGRAMMING MODELS : CORE PRINCIPLES  Single source  Host and device code side-by-side in same source file  Written in same programming language  Single unified coherent address space  Freely share pointers between host and device  Similar memory model as multi-core CPU  Parallel regions identified with existing language syntax  Typically same syntax used for multi-core CPU  HSAIL is the compiler IR that supports these programming models © Copyright 2014 HSA Foundation. All Rights Reserved
  • 56. GCC OPENMP : COMPILATION FLOW  SUSE GCC Project  Adding HSAIL code generator to GCC compiler infrastructure  Supports OpenMP 3.1 syntax  No data movement directives required !main() { … // Host code. #pragma omp parallel for for (int i=0;i<N; i++) { C[i] = A[i] + B[i]; } … } GCC OpenMP Compiler BRIG Finalizer Component ISA Host ISA © Copyright 2014 HSA Foundation. All Rights Reserved
  • 57. GCC OpenMP flow C/C++/Fortran OpenMP application e.g., #pragma omp for for( j = 0; j<n;j++) { b[j] = a[j]; } GNU Compiler(GCC) Compiles host code + Emits runtime calls with kernel name, parameters, launch attributes Lowers OpenMP directives, converts GIMPLE to BRIG. Embeds BRIG into host code Dispatch kernel to GPU Pragmas map to calls into HSA Runtime Application Compiler Run time Finalize kernel from BRIG->ISA Kernels finalized once and cached. Compile time © Copyright 2014 HSA Foundation. All Rights Reserved
  • 58. MCW C++AMP : COMPILATION FLOW  C++AMP : Single-source C++ template parallel programming model  MCW compiler based on CLANG/LLVM  Open-source and runs on Linux  Leverage open-source LLVM->HSAIL code generator main() { … parallel_for_each(grid<1>(ext entent<256>(…) … } C++AMP Compiler BRIG Finalizer Component ISA Host ISA © Copyright 2014 HSA Foundation. All Rights Reserved
  • 59. JAVA: RUNTIME FLOW © Copyright 2014 HSA Foundation. All Rights Reserved JAVA 8 – HSA ENABLED APARAPI  Java 8 brings Stream + Lambda API. ‒ More natural way of expressing data parallel algorithms ‒ Initially targeted at multi-core.  APARAPI will : ‒ Support Java 8 Lambdas ‒ Dispatch code to HSA enabled devices at runtime via HSAIL JVM Java Application HSA Finalizer & Runtime APARAPI + Lambda API GPUCPU Future Java – HSA ENABLED JAVA (SUMATRA)  Adds native GPU acceleration to Java Virtual Machine (JVM)  Developer uses JDK Lambda, Stream API  JVM uses GRAAL compiler to generate HSAIL JVM Java Application HSA Finalizer & Runtime Java JDK Stream + Lambda API Java GRAAL JIT backend GPUCPU
  • 60. AN EXAMPLE (IN JAVA 8) © Copyright 2014 HSA Foundation. All Rights Reserved //Example computes the percentage of total scores achieved by each player on a team. class Player { private Team team; // Note: Reference to the parent Team. private int scores; private float pctOfTeamScores; public Team getTeam() {return team;} public int getScores() {return scores;} public void setPctOfTeamScores(int pct) { pctOfTeamScores = pct; } }; // “Team” class not shown // Assume “allPlayers’ is an initialized array of Players.. Arrays.stream(allPlayers). // wrap the array in a stream parallel(). // developer indication that lambda is thread-safe forEach(p -> { int teamScores = p.getTeam().getScores(); float pctOfTeamScores = (float)p.getScores()/(float) teamScores; p.setPctOfTeamScores(pctOfTeamScores); });
  • 61. HSAIL CODE EXAMPLE © Copyright 2014 HSA Foundation. All Rights Reserved 01: version 0:95: $full : $large; 02: // static method HotSpotMethod<Main.lambda$2(Player)> 03: kernel &run ( 04: kernarg_u64 %_arg0 // Kernel signature for lambda method 05: ) { 06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register 07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord 08: 09: cvt_u64_s32 $d2, $s2; // Convert X gid to long 10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref 11: add_u64 $d2, $d2, 24; // Adjust for actual elements start 12: add_u64 $d2, $d2, $d6; // Add to array ref ptr 13: ld_global_u64 $d6, [$d2]; // Load from array element into reg 14: @L0: 15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam() 16: mov_b64 $d3, $d0; 17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores () 18: cvt_f32_s32 $s16, $s3; 19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores() 20: cvt_f32_s32 $s17, $s0; 21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores 22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores() 23: ret; 24: };
  • 62. HOW TO PROGRAM HSA? OTHER PROGRAMMING TOOLS © Copyright 2014 HSA Foundation. All Rights Reserved
  • 63. HSAIL ASSEMBLER kernel &run (kernarg_u64 %_arg0) { ld_kernarg_u64 $d6, [%_arg0]; workitemabsid_u32 $s2, 0; cvt_u64_s32 $d2, $s2; mul_u64 $d2, $d2, 8; add_u64 $d2, $d2, 24; add_u64 $d2, $d2, $d6; ld_global_u64 $d6, [$d2]; . . . HSAIL Assembler BRIG Finalizer Machine ISA • HSAIL has a text format and an assembler © Copyright 2014 HSA Foundation. All Rights Reserved
  • 64. OPENCL™ OFFLINE COMPILER (CLOC) __kernel void vec_add( __global const float *a, __global const float *b, __global float *c, const unsigned int n) { int id = get_global_id(0); // Bounds check if (id < n) c[id] = a[id] + b[id]; } CLOC BRIG Finalizer Machine ISA •OpenCL split-source model cleanly isolates kernel •Can express many HSAIL features in OpenCL Kernel Language •Higher productivity than writing in HSAIL assembly •Can dispatch kernel directly with HSAIL Runtime (lower-level access to hardware) •Or use CLOC+OKRA Runtime for approachable “fits-on-a-slide” GPU programming model © Copyright 2014 HSA Foundation. All Rights Reserved
  • 65. KEY TAKEAWAYS  HSAIL  Thin, robust, fast finalizer  Portable (multiple HW vendors and parallel architectures)  Supports shared virtual memory and platform atomics  HSA brings GPU computing to mainstream programming models  Shared and coherent memory bridges “faraway accelerator” gap  HSAIL provides the common IL for high-level languages to benefit from parallel computing  Languages and Compilers  HSAIL support in GCC, LLVM, Java JVM  Leverage same language syntax designed for multi-core CPUs  Can use pointer-containing data structures © Copyright 2014 HSA Foundation. All Rights Reserved
  • 66. HSA RUNTIME YEN-CHING CHUNG, NATIONAL TSING HUA UNIVERSITY
  • 67. OUTLINE  Introduction  HSA Core Runtime API (Pre-release 1.0 provisional)  Initialization and Shut Down  Notifications (Synchronous/Asynchronous)  Agent Information  Signals and Synchronization (Memory-Based)  Queues and Architected Dispatch  Summary © Copyright 2014 HSA Foundation. All Rights Reserved
  • 68. INTRODUCTION (1)  The HSA core runtime is a thin, user-mode API that provides the interface necessary for the host to launch compute kernels to the available HSA components.  The overall goal of the HSA core runtime design is to provide a high-performance dispatch mechanism that is portable across multiple HSA vendor architectures.  The dispatch mechanism differentiates the HSA runtime from other language runtimes by architected argument setting and kernel launching at the hardware and specification level.  The HSA core runtime API is standard across all HSA vendors, such that languages which use the HSA runtime can run on different vendor’s platforms that support the API.  The implementation of the HSA runtime may include kernel-level components (required for some hardware components, ex: AMD Kaveri) or may be entirely user-space (for example, simulators or CPU implementations). © Copyright 2014 HSA Foundation. All Rights Reserved
  • 69. Component 1 Driver Component N… Vendor m … Component 1 Driver Component N… Vendor 1 Component 1 HSA Runtime Component N… HSA Vendor 1 HSA Finalizer Component 1 HSA Runtime Component N… HSA Vendor m HSA Finalizer INTRODUCTION (2) Programming Model Language Runtime  The software architecture stack without HSA runtime OpenCL App Java App OpenMP App DSL App OpenCL Runtime Java Runtime OpenMP Runtime DSL Runtime … …  The software architecture stack with HSA runtime … © Copyright 2014 HSA Foundation. All Rights Reserved
  • 70. INTRODUCTION (3) OpenCL Runtime HSA RuntimeAgent Start Program HSA Memory Allocation Enqueue Dispatch Packet Exit Program Resource Deallocation Command Queue Platform, Device, and Context Initialization SVM Allocation and Kernel Arguments Setting Build Kernel HSA Runtime Close HSA Runtime Initialization and Topology Discovery HSAIL Finalization and Linking © Copyright 2014 HSA Foundation. All Rights Reserved
  • 71. INTRODUCTION (4)  HSA Platform System Architecture Specification support  Runtime initialization and shutdown  Notifications (synchronous/asynchronous)  Agent information  Signals and synchronization (memory-based)  Queues and Architected dispatch  Memory management  HSAIL support  Finalization, linking, and debugging  Image and Sampler support HSA Runtime HSA Memory Allocation Enqueue Dispatch Packet HSA Runtime Close HSA Runtime Initialization and Topology Discovery HSAIL Finalization and Linking © Copyright 2014 HSA Foundation. All Rights Reserved
  • 73. OUTLINE  Runtime Initialization API  hsa_init  Runtime Shut Down API  hsa_shut_down  Examples © Copyright 2014 HSA Foundation. All Rights Reserved
  • 74. HSA RUNTIME INITIALIZATION  When the API is invoked for the first time in a given process, a runtime instance is created.  A typical runtime instance may contain information of platform, topology, reference count, queues, signals, etc.  The API can be called multiple times by applications  Only a single runtime instance will exist for a given process.  Whenever the API is invoked, the reference count is increased by one. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 75. HSA RUNTIME SHUT DOWN  When the API is invoked, the reference count is decreased by 1.  When the reference count < 1  All the resources associated with the runtime instance (queues, signals, topology information, etc.) are considered invalid and any attempt to reference them in subsequent API calls results in undefined behavior.  The user might call hsa_init to initialize the HSA runtime again.  The HSA runtime might release resources associated with it. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 76. EXAMPLE – RUNTIME INITIALIZATION (1) Data structure for runtime instance If hsa_init is called more than once, increase the ref_count by 1 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 77. EXAMPLE – RUNTIME INITIALIZATION (2) hsa_init is called the first time, allocate resources and set the reference count Get the number of HSA agent Initialize agents Create an empty agent list If initialization failed, release resources Create topology table © Copyright 2014 HSA Foundation. All Rights Reserved
  • 78. Agent-0 node_id 0 id 0 type CPU vendor Generic name Generic wavefront_size 0 queue_size 200 group_memory 0 fbarrier_max_count 1 is_pic_supported 0 … … EXAMPLE - RUNTIME INSTANCE (1) Platform Name: Generic Memory node_id 0 id 0 segment_type 111111 address_base 0x0001 size 2048 MB peak_bandwidth 6553.6 mpbs Agent-1 node_id 0 id 0 type GPU vendor Generic name Generic wavefront_size 64 queue_size 200 group_memory 64 fbarrier_max_count 1 is_pic_supported 1 Cache node_id 0 id 0 levels 1 associativity 1 cache size 64KB cache line size 4 is_inclusive 1 Agent: 2 Memory: 1 Cache: 1 … … © Copyright 2014 HSA Foundation. All Rights Reserved
  • 79. Agent-0 node_id = 0 id = 0 agent_type = 1 (CPU) vendor[16] = Generic name[16] = Generic wavefront_size = 0 queue_size =200 group_memory_size_bytes =0 fbarrier_max_count = 1 is_pic_supported = 0 Platform Header File *base_address = 0x00001 Size = 248 system_timestamp_frequency_ mhz = 200 signal_maximum_wait = 1/200 *node_id no_nodes = 1 *agent_list no_agent = 2 *memory_descriptor_list no_memory_descriptor = 1 *cache_descriptor_list no_cache_descriptor = 1 EXAMPLE - RUNTIME INSTANCE (2) … … cache node_id = 0 Id = 0 Levels = 1 * associativity * cache_size * cache_line_size * is_inclusive 1 NULL 64KB NULL 1 NULL 4 NULL Memory node_id = 0 Id = 0 supported_segment_type_mask = 111111 virtual_address_base = 0x0001 size_in_bytes = 2048MB peak_bandwidth_mbps = 6553.6 0 NULL 45 165 NULL 285 NULL 325 NULL Agent-1 node_id = 0 id = 0 agent_type = 2 (GPU) vendor[16] = Generic name[16] = Generic wavefront_size = 64 queue_size =200 group_memory_size_bytes =64 fbarrier_max_count = 1 is_pic_supported = 1 … © Copyright 2014 HSA Foundation. All Rights Reserved
  • 80. EXAMPLE – RUNTIME SHUT DOWN © Copyright 2014 HSA Foundation. All Rights Reserved If ref_count < 1, then free the list; Otherwise decrease the ref_count by 1.
  • 82. OUTLINE  Synchronous Notifications  hsa_status_t  hsa_status_string  Asynchronous Notifications  Example © Copyright 2014 HSA Foundation. All Rights Reserved
  • 83. SYNCHRONOUS NOTIFICATIONS  Notifications (errors, events, etc.) reported by the runtime can be synchronous or asynchronous  The HSA runtime uses the return values of API functions to pass notifications synchronously.  A status code is define as an enumeration, , to capture the return value of any API function that has been executed, except accessors/mutators.  The notification is a status code that indicates success or error.  Success is represented by HSA_STATUS_SUCCESS, which is equivalent to zero.  An error status is assigned a positive integer and its identifier starts with the HSA_STATUS_ERROR prefix.  The status code can help to determine a cause of the unsuccessful execution. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 84. STATUS CODE QUERY  Query additional information on status code  Parameters  status (input): Status code that the user is seeking more information on  status_string (output): An ISO/IEC 646 encoded English language string that potentially describes the error status © Copyright 2014 HSA Foundation. All Rights Reserved
  • 85. ASYNCHRONOUS NOTIFICATIONS  The runtime passes asynchronous notifications by calling user-defined callbacks.  For instance, queues are a common source of asynchronous events because the tasks queued by an application are asynchronously consumed by the packet processor. Callbacks are associated with queues when they are created. When the runtime detects an error in a queue, it invokes the callback associated with that queue and passes it an error flag (indicating what happened) and a pointer to the erroneous queue.  The HSA runtime does not implement any default callbacks.  When using blocking functions within the callback implementation, a callback that does not return can render the runtime state to be undefined. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 86. EXAMPLE - CALLBACK Pass the callback function when create queue If the queue is empty, set the event and invoke callback © Copyright 2014 HSA Foundation. All Rights Reserved
  • 88. OUTLINE  Agent information  hsa_node_t  hsa_agent_t  hsa_agent_info_t  hsa_component_feature_t  Agent Information manipulation APIs  hsa_iterate_agents  hsa_agent_get_info  Example © Copyright 2014 HSA Foundation. All Rights Reserved
  • 89. INTRODUCTION  The runtime exposes a list of agents that are available in the system.  An HSA agent is a hardware component that participates in the HSA memory model.  An HSA agent can submit AQL packets for execution.  An HSA agent may also but is not required to be an HSA component. It is possible for a system to include HSA agents that are neither an HSA component nor a host CPU.  HSA agents are defined as opaque handles of type hsa_agent_t .  The HSA runtime provides APIs for applications to traverse the list of available agents and query attributes of a particular agent. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 90. AGENT INFORMATION (1)  Opaque agent handle  Opaque NUMA node handle  An HSA memory node is a node that delineates a set of system components (host CPUs and HSA Components) with “local” access to a set of memory resources attached to the node's memory controller and appropriate HSA-compliant access attributes. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 91. AGENT INFORMATION (2)  Component features  An HSA component is a hardware or software component that can be a target of the AQL queries and conforms to the memory model of the HSA.  Values  HSA_COMPONENT_FEATURE_NONE = 0  No component capabilities. The device is an agent, but not a component.  HSA_COMPONENT_FEATURE_BASIC = 1  The component supports the HSAIL instruction set and all the AQL packet types except Agent dispatch.  HSA_COMPONENT_FEATURE_ALL = 2  The component supports the HSAIL instruction set and all the AQL packet types. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 92. AGENT INFORMATION (3)  Agent attributes  Values  HSA_AGENT_INFO_MAX_GRID_DIM  HSA_AGENT_INFO_MAX_WORKGROUP_DIM  HSA_AGENT_INFO_QUEUE_MAX_PACKETS  HSA_AGENT_INFO_CLOCK  HSA_AGENT_INFO_CLOCK_FREQUENCY  HSA_AGENT_INFO_MAX_SIGNAL_WAIT  HSA_AGENT_INFO_NAME  HSA_AGENT_INFO_NODE  HSA_AGENT_INFO_COMPONENT_FEATURES  HSA_AGENT_INFO_VENDOR_NAME  HSA_AGENT_INFO_WAVEFRONT_SIZE  HSA_AGENT_INFO_CACHE_SIZE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 93. AGENT INFORMATION MANIPULATION (1)  Iterate over the available agents, and invoke an application-defined callback on every iteration  If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the traversal stops and the function returns that status value.  Parameters  callback (input): Callback to be invoked once per agent  data (input): Application data that is passed to callback on every iteration. Can be NULL. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 94. AGENT INFORMATION MANIPULATION (2)  Get the current value of an attribute for a given agent  Parameters  agent (input): A valid agent  attribute (input): Attribute to query  value (output): Pointer to a user-allocated buffer where to store the value of the attribute. If the buffer passed by the application is not large enough to hold the value of attribute, the behavior is undefined. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 95. EXAMPLE - AGENT ATTRIBUTE QUERY Copy agent attribute information Get the agent handle of Agent 0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 97. OUTLIINE  Signal  Signal manipulation API  Create/Destroy  Query  Send  Atomic Operations  Signal wait  Get time out  Signal Condition  Example © Copyright 2014 HSA Foundation. All Rights Reserved
  • 98. SIGNAL (1)  HSA agents can communicate with each other by using coherent global memory, or by using signals.  A signal is represented by an opaque signal handle  A signal carries a value, which can be updated or conditionally waited upon via an API call or HSAIL instruction.  The value occupies four or eight bytes depending on the machine model in use. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 99. SIGNAL (2)  Updating the value of a signal is equivalent to sending the signal.  In addition to the update (store) of signals, the API for sending signal must support other atomic operations with specific memory order semantics  Atomic operations: AND, OR, XOR, Add, Subtract, Exchange, and CAS  Memory order semantics : Release and Relaxed © Copyright 2014 HSA Foundation. All Rights Reserved
  • 100. SIGNAL CREATE/DESTROY  Create a signal  Parameters  initial_value (input): Initial value of the signal.  signal_handle (output): Signal handle.  Destroy a signal previous created by hsa_signal_create  Parameter  signal_handle (input): Signal handle. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 101.  Send and atomically set the value of a signal with release semantics SIGNAL LOAD/STORE  Atomically read the current signal value with acquire semantics  Atomically read the current signal value with relaxed semantics  Send and atomically set the value of a signal with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 102.  Send and atomically increment the value of a signal by a given amount with release semantics SIGNAL ADD/SUBTRACT  Send and atomically decrement the value of a signal by a given amount with release semantics  Send and atomically increment the value of a signal by a given amount with relaxed semantics  Send and atomically decrement the value of a signal by a given amount with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 103.  Send and atomically perform a logical AND operation on the value of a signal and a given value with release semantics SIGNAL AND (OR, XOR)/EXCHANGE  Send and atomically set the value of a signal and return its previous value with release semantics  Send and atomically perform a logical AND operation on the value of a signal and a given value with relaxed semantics  Send and atomically set the value of a signal and return its previous value with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 104. SIGNAL WAIT (1)  The application may wait on a signal, with a condition specifying the terms of wait.  Signal wait condition operator  Values  HSA_EQ: The two operands are equal.  HSA_NE: The two operands are not equal.  HSA_LT: The first operand is less than the second operand.  HSA_GTE: The first operand is greater than or equal to the second operand. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 105. SIGNAL WAIT (2)  The wait can be done either in the HSA component via an HSAIL wait instruction or via a runtime API defined here.  Waiting on a signal returns the current value at the opaque signal object;  The wait may have a runtime defined timeout which indicates the maximum amount of time that an implementation can spend waiting.  The signal infrastructure allows for multiple senders/waiters on a single signal.  Wait reads the value, hence acquire synchronizations may be applied. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 106. SIGNAL WAIT (3)  Signal wait  Parameters  signal_handle (input): A signal handle  condition (input): Condition used to compare the passed and signal values  compare_ value (input): Value to compare with  return_value (output): A pointer where the current signal value must be read into © Copyright 2014 HSA Foundation. All Rights Reserved
  • 107. SIGNAL WAIT (4)  Signal wait with timeout  Parameters  signal_handle (input): A signal handle  timeout (input): Maximum wait duration (A value of zero indicates no maximum)  long_wait (input): Hint indicating that the signal value is not expected to meet the given condition in a short period of time. The HSA runtime may use this hint to optimize the wait implementation.  condition (input): Condition used to compare the passed and signal values  compare_ value (input): Value to compare with  return_value (output): A pointer where the current signal value must be read into © Copyright 2014 HSA Foundation. All Rights Reserved
  • 108. EXAMPLE – SIGNAL WAIT (1) thread_1 thread_2 thread_1 is blocked hsa_signal_add_relaxed (value = value + 3) Return signal value Condition satisfied, the execution of thread_1 continues value = 0 Timeline Timeline value = 3 hsa_signal_substract_relaxed (value = value - 1)value = 2 hsa_signal_wait_timeout_acquire (value == 2) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 109. EXAMPLE – SIGNAL WAIT (2) If signal_handle is invalid, then return signal invalid status Compare tmp->value with compare_value to see if the condition is satisfied? If timeout = 0 then return signal time out status Signal wait condition function If the condition is satisfied, then return signal and status © Copyright 2014 HSA Foundation. All Rights Reserved
  • 111. OUTLINE  Queues  Queue Types and Structure  HSA runtime API for Queue Manipulations  Architected Queuing Language (AQL) Support  Packet type  Packet header  Examples  Enqueue Packet  Packet Processor © Copyright 2014 HSA Foundation. All Rights Reserved
  • 112. INTRODUCTION (1)  An HSA-compliant platform supports multiple user-level command queues allocation.  A use-level command queue is characterized as runtime-allocated, user-level accessible virtual memory of a certain size, containing packets defined in the Architected Queuing Language (AQL packets).  Queues are allocated by HSA applications through the HSA runtime.  HSA software receives memory-based structures to configure the hardware queues to allow for efficient software management of the hardware queues of the HSA agents.  This queue memory shall be processed by the HSA Packet Processor as a ring buffer.  Queues are read-only data structures.  Writing values directly to a queue structure results in undefined behavior.  But HSA agents can directly modify the contents of the buffer pointed by base_address, or use runtime APIs to access the doorbell signal or the service queue. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 113.  Two queue types, AQL and Service Queues, are supported  AQL Queue consumes AQL packets that are used to specify the information of kernel functions that will be executed on the HSA component  Service Queue consumes agent dispatch packets that are used to specify runtime-defined or user registered functions that will be executed on the agent (typically, the host CPU) INTRODUCTION (2) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 114. INTRODUCTION (3)  AQL queue structure © Copyright 2014 HSA Foundation. All Rights Reserved
  • 115. INTRODUCTION (4)  In addition to the data held in the queue structure, the queue also defines two properties (readIndex and writeIndex) that define the location of “head” and “tail” of the queue.  readIndex: The read index is a 64-bit unsigned integer that specifies the packetID of the next AQL packet to be consumed by the packet processor.  writeIndex: The write index is a 64-bit unsigned integer that specifies the packetID of the next AQL packet slot to be allocated.  Both indices are not directly exposed to the user, who can only access them by using dedicated HSA core runtime APIs.  The available index functions differ on the index of interest (read or write), action to be performed (addition, compare and swap, etc.), and memory consistency model (relaxed, release, etc.). © Copyright 2014 HSA Foundation. All Rights Reserved
  • 116. INTRODUCTION (5)  The read index is automatically advanced when a packet is read by the packet processor.  When the packet processor observes that  The read index matches the write index, the queue can be considered empty;  The write index is greater than or equal to the sum of the read index and the size of the queue, then the queue is full.  The doorbell_signal field of a queue contains a signal that is used by the agent to inform the packet processor to process the packets it writes.  The value that the doorbell signaled is equal to the ID of the packet that is ready to be launched. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 117. INTRODUCTION (6)  The new task might be consumed by the packet processor even before the doorbell signal has been signaled by the agent.  This is because the packet processor might be already processing some other packets and observes that there is new work available, so it processes the new packets.  In any case, the agent must ring the doorbell for every batch of packets it writes. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 118. QUEUE CREATE/DESTROY  Create a user mode queue  When a queue is created, the runtime also allocates the packet buffer and the completion signal.  The application should only rely on the status code returned to determine if the queue is valid  Destroy a user mode queue  A destroyed queue might not be accessed after being destroyed.  When a queue is destroyed, the state of the AQL packets that have not been yet fully processed becomes undefined. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 119. GET READ/WRITE INDEX  Atomically retrieve read index of a queue with acquire semantics  Atomically retrieve write index of a queue with acquire semantics  Atomically retrieve read index of a queue with relaxed semantics  Atomically retrieve write index of a queue with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 120. SET READ/WRITE INDEX  Atomically set the read index of a queue with release semantics  Atomically set the read index of a queue with relaxed semantics  Atomically set the write index of a queue with release semantics  Atomically set the write index of a queue with relaxed semantics © Copyright 2014 HSA Foundation. All Rights Reserved
  • 121. COMPARE AND SWAP WRITE INDEX  Atomically compare and set the write index of a queue with acquire/release/relaxed/acquire- release semantics  Parameters  queue (input): A queue  expected (input): The expected index value  val (input): Value to copy to the write index if expected matches the observed write index  Return value  Previous value of the write index © Copyright 2014 HSA Foundation. All Rights Reserved
  • 122. ADD WRITE INDEX  Atomically increment the write index of a queue by an offset with release/acquire/relaxed/acquire-release semantics  Parameters  queue (input): A queue  val (input): The value to add to the write index  Return value  Previous value of the write index © Copyright 2014 HSA Foundation. All Rights Reserved
  • 123. ARCHITECTED QUEUING LANGUAGE (AQL)  An HSA-compliant system provides a command interface for the dispatch of HSA agent commands.  This command interface is provided by the Architected Queuing Language (AQL).  AQL allows HSA agents to build and enqueue their own command packets, enabling fast and low-power dispatch.  AQL also provides support for HSA component queue submissions  The HSA component kernel can write commands in AQL format. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 124. AQL PACKET (1)  AQL packet format  Values  Always reserved packet (0): Packet format is set to always reserved when the queue is initialized.  Invalid packet (1): Packet format is set to invalid when the readIndex is incremented, making the packet slot available to the HSA agents.  Dispatch packet (2): Dispatch packets contain jobs for the HSA component and are created by HSA agents.  Barrier packet (3): Barrier packets can be inserted by HSA agents to delay processing subsequent packets. All queues support barrier packets.  Agent dispatch packet (4): Dispatch packets contain jobs for the HSA agent and are created by HSA agents. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 125. AQL PACKET (2) HSA signaling object handle used to indicate completion of the job © Copyright 2014 HSA Foundation. All Rights Reserved
  • 126. EXAMPLE - ENQUEUE AQL PACKET (1)  An HSA agent submits a task to a queue by performing the following steps:  Allocate a packet slot (by incrementing the writeIndex)  Initialize the packet and copy packet to a queue associated with the Packet Processor  Mark packet as valid  Notify the Packet Processor of the packet (With doorbell signal) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 127. EXAMPLE - ENQUEUE AQL PACKET (2) Dispatch Queue Allocate an AQL packet slot Copy the packet into queue. Note that, we can have a lock here to prevent race condition in multithread environment WriteIndex ReadIndex Initialize packet Send doorbell signal © Copyright 2014 HSA Foundation. All Rights Reserved
  • 128. EXAMPLE - PACKET PROCESSOR WriteIndex ReadIndex Get packet content Check if barrier packet Update readIndex, change packet state to invalid, and send completion signal. Receive doorbell Dispatch Queue If there is any packet in queue, process the packet. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 130. OUTLINE  Memory registration and deregistration  Memory region and memory segment  APIs for memory region manipulation  APIs for memory registration and deregistration © Copyright 2014 HSA Foundation. All Rights Reserved
  • 131. INTRODUCTION  One of the key features of HSA is its ability to share global pointers between the host application and code executing on the HSA component.  This ability means that an application can directly pass a pointer to memory allocated on the host to a kernel function dispatched to a component without an intermediate copy  When a buffer created in the host is also accessed by a component, programmers are encouraged to register the corresponding address range beforehand.  Registering memory expresses an intention to access (read or write) the passed buffer from a component other than the host. This is a performance hint that allows the runtime implementation to know which buffers will be accessed by some of the components ahead of time.  When an HSA program no longer needs to access a registered buffer in a device, the user should deregister that virtual address range. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 132. MEMORY REGION/SEGMENT  A memory region represents a virtual memory interval that is visible to a particular agent, and contains properties about how memory is accessed or allocated from that agent.  Memory segments  Values  HSA_SEGMENT_GLOBAL = 1  HSA_SEGMENT_PRIVATE = 2  HSA_SEGMENT_GROUP = 4  HSA_SEGMENT_KERNARG = 8  HSA_SEGMENT_READONLY = 16  HSA_SEGMENT_IMAGE = 32 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 133. MEMORY REGION INFORMATION  Attributes of a memory region  Values  HSA_REGION_INFO_BASE_ADDRESS  HSA_REGION_INFO_SIZE  HSA_REGION_INFO_NODE  HSA_REGION_INFO_MAX_ALLOCATION_SIZE  HSA_REGION_INFO_SEGMENT  HSA_REGION_INFO_BANDWIDTH  HSA_REGION_INFO_CACHED © Copyright 2014 HSA Foundation. All Rights Reserved
  • 134. MEMORY REGION MANIPULATION (1)  Get the current value of an attribute of a region  Iterate over the memory regions that are visible to an agent, and invoke an application-defined callback on every iteration  If callback returns a status other than HSA_STATUS_SUCCESS for a particular iteration, the traversal stops and the function returns that status value. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 135. MEMORY REGION MANIPULATION (2)  Allocate a block of memory  Deallocate a block of memory previously allocated using hsa_memory_allocate  Copy block of memory  Copying a number of bytes larger than the size of the memory regions pointed by dst or src results in undefined behavior. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 136. MEMORY REGISTRATION/DEREGISTRATION  Register memory  Parameters  address (input): A pointer to the base of the memory region to be registered. If a NULL pointer is passed, no operation is performed.  size (input): Requested registration size in bytes. A size of zero is only allowed if address is NULL.  Deregister memory previously registered using hsa_memory_register  Parameter  address (input): A pointer to the base of the memory region to be registered. If a NULL pointer is passed, no operation is performed. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 137. EXAMPLE Allocate a memory space Use hsa_region_get_info to get the size in byte of this memory space Register this memory space for a performance hint Finish operation, deregister and free this memory space © Copyright 2014 HSA Foundation. All Rights Reserved
  • 139. SUMMARY  Covered  HSA Core Runtime API (Pre-release 1.0 provisional)  Runtime Initialization and Shutdown (Open/Close)  Notifications (Synchronous/Asynchronous)  Agent Information  Signals and Synchronization (Memory-Based)  Queues and Architected Dispatch  Memory Management  Not covered  Extension of Core Runtime  HSAIL Finalization, Linking, and Debugging  Images and Samplers © Copyright 2014 HSA Foundation. All Rights Reserved
  • 140. QUESTIONS? © Copyright 2014 HSA Foundation. All Rights Reserved
  • 141. HSA MEMORY MODEL BEN GASTER, ENGINEER, QUALCOMM
  • 142. OUTLINE  HSA Memory Model  OpenCL 2.0  Has a memory model too  Obstruction-free bounded deques  An example using the HSA memory model © Copyright 2014 HSA Foundation. All Rights Reserved
  • 143. HSA MEMORY MODEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 144. TYPES OF MODELS  Shared memory computers and programming languages, divide complexity into models: 1. Memory model specifies safety  e.g. can a work-item prevent others from progressing?  This is what this section of the tutorial will focus on 2. Execution model specifies liveness  Described in Ben Sander’s tutorial section on HSAIL  e.g. can a work-item prevent others from progressing 3. Performance model specifies the big picture  e.g. caches or branch divergence  Specific to particular implementations and outside the scope of today’s tutorial © Copyright 2014 HSA Foundation. All Rights Reserved
  • 145. THE PROBLEM  Assume all locations (a, b, …) are initialized to 0  What are the values of $s2 and $s4 after execution? © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; *a = 1; int x = *b; *b = 1; int y = *a; initially *a = 0 && *b = 0
  • 146. THE SOLUTION  The memory model tells us:  Defines the visibility of writes to memory at any given point  Provides us with a set of possible executions © Copyright 2014 HSA Foundation. All Rights Reserved
  • 147. WHAT MAKES A GOOD MEMORY MODEL*  Programmability ; A good model should make it (relatively) easy to write multi- work-item programs. The model should be intuitive to most users, even to those who have not read the details  Performance ; A good model should facilitate high-performance implementations at reasonable power, cost, etc. It should give implementers broad latitude in options  Portability ; A good model would be adopted widely or at least provide backward compatibility or the ability to translate among models * S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department, University of Wisconsin–Madison, Nov. 1993. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 148. SEQUENTIAL CONSISTENCY (SC)*  Axiomatic Definition  A single processor (core) sequential if “the result of an execution is the same as if the operations had been executed in the order specified by the program.”  A multiprocessor sequentially consistent if “the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.” © Copyright 2014 HSA Foundation. All Rights Reserved  But HW/Compiler actually implements more relaxed models, e.g. ARMv7 * L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocessor Programs. IEEE Transactions on Computers, C-28(9):690–91, Sept. 1979.
  • 149. SEQUENTIAL CONSISTENCY (SC) © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; $s2 = 0 && $s4 = 1
  • 150. BUT WHAT ABOUT ACTUAL HARDWARE  Sequential consistency is (reasonably) easy to understand, but limits optimizations that the compiler and hardware can perform  Many modern processors implement many reordering optimizations  Store buffers (TSO*), work-items can see their own stores early  Reorder buffers (XC*), work-items can see other work-items store early © Copyright 2014 HSA Foundation. All Rights Reserved *TSO – Total Store Order as implemented by Sparc and x86 *XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno
  • 151. RELAXED CONSISTENCY (XC) © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; ld_global_u32 $s2, [&b] ; ld_global_u32 $s4, [&a] ; st_global_u32 $s1, [&a] ; st_global_u32 $s3, [&b] ; $s2 = 0 && $s4 = 0
  • 152. WHAT ARE OUR 3 Ps?  Programmability ; XC is really pretty hard for the programmer to reason about what will be visible when  many memory model experts have been known to get it wrong!  Performance ; XC is good for performance, the hardware (compiler) is free to reorder many loads and stores, opening the door for performance and power enhancements  Portability ; XC is very portable as it places very little constraints © Copyright 2014 HSA Foundation. All Rights Reserved
  • 153. MY CHILDREN AND COMPUTER ARCHITECTS ALL WANT  To have their cake and eat it! © Copyright 2014 HSA Foundation. All Rights Reserved Put picture with kids and cake HSA Provides: The ability to enable programmers to reason with (relatively) intuitive model of SC, while still achieving the benefits of XC!
  • 154. SEQUENTIAL CONSISTENCY FOR DRF*  HSA adopts the same approach as Java, C++11, and OpenCL 2.0 adopting SC for Data Race Free (DRF)  plus some new capabilities !  (Informally) A data race occurs when two (or more) work-items access the same memory location such that:  At least one of the accesses is a WRITE  There are no intervening synchronization operations  SC for DRF asks:  Programmers to ensure programs are DRF under SC  Implementers to ensure that all executions of DRF programs on the relaxed model are also SC executions © Copyright 2014 HSA Foundation. All Rights Reserved *S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 2–14, May 1990
  • 155. HSA SUPPORTS RELEASE CONSISTENCY  HSA’s memory model is based on RCSC:  All atomic_ld_scacq and atomic_st_screl are SC  Means coherence on all atomic_ld_scacq and atomic_st_screl to a single address. )  All atomic_ld_scacq and atomic_st_screl are program ordered per work- item (actually: sequence-order by language constraints  Similar model adopted by ARMv8  HSA extends RCSC to SC for HRF*, to access the full capabilities of modern heterogeneous systems, containing CPUs, GPUs, and DSPs, for example. © Copyright 2014 HSA Foundation. All Rights Reserved *Sequential Consistency for Heterogeneous-Race-Free Programmer-centric Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R. Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.
  • 156. MAKING RELAXED CONSISTENCY WORK © Copyright 2014 HSA Foundation. All Rights Reserved Work-item 0 mov_u32 $s1, 1 ; atomic_st_global_u32_screl $s1, [&a] ; atomic_ld_global_u32_scacq $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; atomic_st_global_u32_screl $s3, [&b] ; atomic_ld_global_u32_scacq $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; atomic_st_global_u32_screl $s1, [&a] ; atomic_ld_global_u32_scacq $s2, [&b] ; atomic_st_global_u32_screl $s3, [&b] ; atomic_ld_global_u32_scacq $s4, [&a] ; $s2 = 0 && $s4 = 1
  • 157. SEQUENTIAL CONSISTENCY FOR DRF  Two memory accesses participate in a data race if they  access the same location  at least one access is a store  can occur simultaneously  i.e. appear as adjacent operations in interleaving.  A program is data-race-free if no possible execution results in a data race.  Sequential consistency for data-race-free programs  Avoid everything else HSA: Not good enough! © Copyright 2014 HSA Foundation. All Rights Reserved
  • 158. ALL ARE NOT EQUAL – OR SOME CAN SEE BETTER THAN OTHERS  Remember the HSAIL Execution Model © Copyright 2014 HSA Foundation. All Rights Reserved device scope group scope wave scope platform scope
  • 159. DATA-RACE-FREE IS NOT ENOUGH t1 t2 t3 t4 st_global 1, [&X] atomic_st_global_screl 0, [&flag] atomic_cas_global_scar 1, 0, [&flag] ... atomic_st_global_screl 0, [&flag] atomic_cas_global_scar ,1 0, [&flag] ld_global (??), [&x] group #1-2 group #3-4  Two ordinary memory accesses participate in a data race if they  Access same location  At least one is a store  Can occur simultaneously Not a data race… Is it SC? Well that depends t4t3t1 t2 SGlobal S12 S34 visibility implied by causality? © Copyright 2014 HSA Foundation. All Rights Reserved
  • 160. SEQUENTIAL CONSISTENCY FOR HETEROGENEOUS-RACE-FREE  Two memory accesses participate in a heterogeneous race if  access the same location  at least one access is a store  can occur simultaneously  i.e. appear as adjacent operations in interleaving.  Are not synchronized with “enough” scope  A program is heterogeneous-race-free if no possible execution results in a heterogeneous race.  Sequential consistency for heterogeneous-race-free programs  Avoid everything else © Copyright 2014 HSA Foundation. All Rights Reserved
  • 161. HSA HETEROGENEOUS RACE FREE  HRF0: Basic Scope Synchronization  “enough” = both threads synchronize using identical scope  Recall example:  Contains a heterogeneous race in HSA t1 t2 t3 t4 st_global 1, [&X] atomic_st_global_rcrel_wg 0, [&flag] ... atomic_cas_global_scar_wg,1 0, [&flag] ld_global (??), [&x] Workgroup #1-2 Workgroup #3-4 HSA Conclusion: This is bad. Don’t do it. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 162. HOW TO USE HSA WITH SCOPES Use smallest scope that includes all producers/consumers of shared data HSA Scope Selection Guideline Implication: Producers/consumers must be known at synchronization time  Want: For performance, use smallest scope possible  What is safe in HSA? Is this a valid assumption? © Copyright 2014 HSA Foundation. All Rights Reserved
  • 163. REGULAR GPGPU WORKLOADS N M Define Problem Space Partition Hierarchically Communicate Locally N times Communicate Globally M times Well defined (regular) data partitioning + Well defined (regular) synchronization pattern =  Producer/consumers are always known Generally: HSA works well with regular data-parallel workloads © Copyright 2014 HSA Foundation. All Rights Reserved
  • 164. t1 t2 t3 t4 st_global 1, [&X] atomic_st_global_screl_plat 0, [&flag] atomic_cas_global_scar_plat 1, 0, [&flag] ... atomic_st_global_screl_plat 0, [&flag] atomic_cas_global_ar_plat ,1 0, [&flag] ld $s1, [&x] IRREGULAR WORKLOADS  HSA: example is race  Must upgrade wg (workgroup) -> plat (platform)  HSA memory model says:  ld $s1, [&x], will see value (1)! Workgroup #1-2 Workgroup #3-4 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 165. OPENCL HAS MEMORY MODELS TOO MAPPING ONTO HSA’S MEMORY MODEL
  • 166.  It is straightforward to provide a mapping from OpenCL 1.x to the proposed model  OpenCL 1.x atomics are unordered and so map to atomic_op_X  Mapping for fences not shown but straightforward OPENCL 1.X MEMORY MODEL MAPPING OpenCL Operation HSA Memory Model Operation Atomic load ld_global_wg ld_group_wg Atomic store atomic_st_global_wg atomic_st_group_wg atomic_op atomic_op_global_comp atomic_op_group_wg barrier(…) fence ; barrier_wg © Copyright 2014 HSA Foundation. All Rights Reserved
  • 167. OPENCL 2.0 BACKGROUND  Provisional specification released at SIGGRAPH’13, July 2013.  Huge update to OpenCL to account for the evolving hardware landscape and emerging use cases (e.g. irregular work loads)  Key features:  Shared virtual memory, including platform atomics  Formally defined memory model based on C11 plus support for scopes  Includes an extended set of C1X atomic operations  Generic address space, that subsumes global, local, and private  Device to device enqueue  Out-of-order device side queuing model  Backwards compatible with OpenCL 1.x © Copyright 2014 HSA Foundation. All Rights Reserved
  • 168. OPENCL 2.0 MEMORY MODEL MAPPING OpenCL Operation HSA Memory Model Operation Load memory_order_relaxed atomic_ld_[global | group]_relaxed_scope Store Memory_order_relaxed atomic_st_[global | group]_relaxed_scope Load memory_order_acquire atomic_ld_[global | group]_scacq_scope Load memory_order_seq_cst atomic_ld_[global | group]_scacq_scope Store memory_order_release atomic_st_[global | group]_screl_scope Store Memory_order_seq_cst atomic_st_[global | group]_screl_scope memory_order_acq_rel atomic_op_[global | group]_scar_scope memory_order_seq_cst atomic_op_[global|group]_scar_scope © Copyright 2014 HSA Foundation. All Rights Reserved
  • 169. OPENCL 2.0 MEMORY SCOPE MAPPING OpenCL Scope HSA Scope memory_scope_sub_group _wave memory_scope_work_group _wg memory_scope_device _component memory_scope_all_svm_devices _platform © Copyright 2014 HSA Foundation. All Rights Reserved
  • 170. OBSTRUCTION-FREE BOUNDED DEQUES AN EXAMPLE USING THE HSA MEMORY MODEL
  • 171. CONCURRENT DATA-STRUCTURES  Why do we need such a memory model in practice?  One important application of memory consistency is in the development and use of concurrent data-structures  In particular, there are a class data-structures implementations that provide non- blocking guarantees:  wait-free; An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes  In practice very hard to build efficient data-structures that meet this requirement  lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of the work-items (or threads) makes progress  In practice lock-free algorithms are implemented by work-item cooperating with one enough to allow progress  Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can make progress © Copyright 2014 HSA Foundation. All Rights Reserved
  • 172. Emerging Compute Cluster BUT WAY NOT JUST USE MUTUAL EXCLUSION? © Copyright 2014 HSA Foundation. All Rights Reserved Fabric & Memory Controller Krait CPUAdreno GPU Krait CPU Krait CPU Krait CPU MMU MMUs 2MB L2 Hexagon DSP MMU ?? ?? Diversity in a heterogeneous system, such as different clock speeds, different scheduling policies, and more can mean traditional mutual exclusion is not the right choice
  • 173. CONCURRENT DATA-STRUCTURES  Emerging heterogeneous compute clusters means we need:  To adapt existing concurrent data-structures  Developer new concurrent data-structures  Lock based programming may still be useful but often these algorithms will need to be lock-free  Of course, this is a key application of the HSA memory model  To showcase this we highlight the development of a well known (HLM) obstruction-free deque* © Copyright 2014 HSA Foundation. All Rights Reserved *Herlihy, M. et al. 2003. Obstruction-free synchronization: double-ended queues as an example. (2003), 522–529.
  • 174. HLM - OBSTRUCTION-FREE DEQUE  Uses a fixed length circular queue  At any given time, reading from left to right, the array will contain:  Zero or more left-null (LN) values  Zero or more dummy-null (DN) values  Zero or more right-null (RN) values  At all times there must be:  At least two different nulls values  At least one LN or DN, and at least one DN or RN  Memory consistency is required to allow multiple producers and multiple consumers, potentially happening in parallel from the left and right ends, to see changes from other work-items (HSA Components) and threads (HSA Agents) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 175. HLM - OBSTRUCTION-FREE DEQUE © Copyright 2014 HSA Foundation. All Rights Reserved LNLN vLN RNv RNRN left right Key: LN – left null value RN – right null value v – value left – left hint index right – right hint index
  • 176. C REPRESENTATION OF DEQUE struct node { uint64_t type : 2; // null type (LN, RN, DN) uint64_t counter : 8 ; // version counter to avoid ABA uint64_t value : 54 ; // index value stored in queue } struct queue { unsigned int size; // size of bounded buffer node * array; // backing store for deque itself© Copyright 2014 HSA Foundation. All Rights Reserved
  • 177. HSAIL REPRESENTATION  Allocate a deque in global memory using HSAIL @deque_instance: align 64 global_u32 &size; align 8 global_u64 &array; © Copyright 2014 HSA Foundation. All Rights Reserved
  • 178. ORACLE  Assume a function: function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);  Which given a deque  returns (%k) the position of the left most of RN  atomic_ld_global_scacq used to read node from array  Makes one if necessary (i.e. if there are only LN or DN)  atomic_cas_global_scar, required to make new RN  returns (%left) the left node (i.e. the value to the left of the left most RN position)  returns (%right) the right node (i.e. the value at position (%k)) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 179. RIGHT POP function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) { // load queue address ld_arg_u64 $d0, [%deque]; @loop_forever: // setup and call right oracle to get next RN arg_u32 %k; arg_u64 %current; arg_u64 %next; call &rcheck_oracle(%queue) ; ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next]; // current.value($d5) shr_u64 $d5, $d1, 62; // current.counter($d6) and_u64 $d6, $d1, 0x3FC0000000000000; shr_u64 $d6, $d6, 54; // current.value($d7) and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF; // next.counter($d8) and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54; brn @loop_forever ; } © Copyright 2014 HSA Foundation. All Rights Reserved
  • 180. RIGHT POP – TEST FOR EMPTY // current.type($d5) == LN || current.type($d5) == DN cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN; or_b1 $c0, $c0, $c1; cbr $c0, @not_empty ; // current node index (%deque($d0) + (%k($s1) - 1) * 16) add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0; atomic_ld_global_scacq_u64 $d4, [$d3]; cmp_neq_b1_u64 $c0, $d4, $d1; cbr $c0, @not_empty; st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY %ret @not_empty: © Copyright 2014 HSA Foundation. All Rights Reserved
  • 181. RIGHT POP – TRY READ/REMOVE NODE // $d9 = (RN, next.cnt+1, 0) add_u64 $d8, $d8, 1; shl_u64 $d9, RN, 62; and_u64 $d8, $d8, $d9; // cas(deq+k, next, node(RN, next.cnt+1, 0)) atomic_cas_global_scar_u64 $d9, [$s0], $d2, $d9; cmp_neq_u64 $c0, $d9, $d2; cbr $c0, @cas_failed; // $d9 = (RN, current.cnt+1, 0) add_u64 $d6, $d6, 1; shl_u64 $d9, RN, 62; and_u64 $d9, $d6, $d9; // cas(deq+(k-1), curr, node(RN, curr.cnt+1,0) atomic_cas_global_scar_u64 $d9, [$s1], $d1, $d9; cmp_neq_u64 $c0, $d9, $d1; cbr $c0, @cas_failed; st_arg_u32 SUCCESS, [&err]; st_arg_u64 $d7, [&value]; %ret @cas_failed: // loop back around and try again © Copyright 2014 HSA Foundation. All Rights Reserved
  • 182. TAKE AWAYS  HSA provides a powerful and modern memory model  Based on the well know SC for DRF  Defined as Release Consistency  Extended with scopes as defined by HRF  OpenCL 2.0 introduces a new memory model  Also based on SC for DRF  Also defined in terms of Release Consistency  Also Extended with scope as defined in HRF  Has a well defined mapping to HSA  Concurrent algorithm development for emerging heterogeneous computing cluster can benefit from HSA and OpenCL 2.0 memory models © Copyright 2014 HSA Foundation. All Rights Reserved
  • 183. HSA QUEUING MODEL HAKAN PERSSON, SENIOR PRINCIPAL ENGINEER, ARM
  • 185. MOTIVATION (TODAY’S PICTURE) © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 187. REQUIREMENTS  Three key technologies are used to build the user mode queueing mechanism  Shared Virtual Memory  System Coherency  Signaling  AQL (Architected Queueing Language) enables any agent enqueue tasks © Copyright 2014 HSA Foundation. All Rights Reserved
  • 189. PHYSICAL MEMORY SHARED VIRTUAL MEMORY (TODAY)  Multiple Virtual memory address spaces © Copyright 2014 HSA Foundation. All Rights Reserved CPU0 GPU VIRTUAL MEMORY1 PHYSICAL MEMORY VA1->PA1 VA2->PA1 VIRTUAL MEMORY2
  • 190. PHYSICAL MEMORY SHARED VIRTUAL MEMORY (HSA)  Common Virtual Memory for all HSA agents © Copyright 2014 HSA Foundation. All Rights Reserved CPU0 GPU VIRTUAL MEMORY PHYSICAL MEMORY VA->PA VA->PA
  • 191. SHARED VIRTUAL MEMORY  Advantages  No mapping tricks, no copying back-and-forth between different PA addresses  Send pointers (not data) back and forth between HSA agents.  Implications  Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc).  Common mechanisms for address translation (and servicing address translation faults)  Concept of a process address space (PASID) to allow multiple, per process virtual address spaces within the system. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 192. SHARED VIRTUAL MEMORY  Specifics  Minimum supported VA width is 48b for 64b systems, and 32b for 32b systems.  HSA agents may reserve VA ranges for internal use via system software.  All HSA agents other than the host unit must use the lowest privilege level  If present, read/write access flags for page tables must be maintained by all agents.  Read/write permissions apply to all HSA agents, equally. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 193. GETTING THERE … © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 195. CACHE COHERENCY DOMAINS (1/3)  Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 196. CACHE COHERENCY DOMAINS (2/3)  Advantages  Composability  Reduced SW complexity when communicating between agents  Lower barrier to entry when porting software  Implications  Hardware coherency support between all HSA agents  Can take many forms  Stand alone Snoop Filters / Directories  Combined L3/Filters  Snoop-based systems (no filter)  Etc … © Copyright 2014 HSA Foundation. All Rights Reserved
  • 197. CACHE COHERENCY DOMAINS (3/3)  Specifics  No requirement for instruction memory accesses to be coherent  Only applies to the Primary memory type.  No requirement for HSA agents to maintain coherency to any memory location where the HSA agents do not specify the same memory attributes  Read-only image data is required to remain static during the execution of an HSA kernel.  No double mapping (via different attributes) in order to modify. Must remain static © Copyright 2014 HSA Foundation. All Rights Reserved
  • 198. GETTING CLOSER … © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 200. SIGNALING (1/3)  HSA agents support the ability to use signaling objects  All creation/destruction signaling objects occurs via HSA runtime APIs  From an HSA Agent you can directly access signaling objects.  Signaling a signal object (this will wake up HSA agents waiting upon the object)  Query current object  Wait on the current object (various conditions supported). © Copyright 2014 HSA Foundation. All Rights Reserved
  • 201. SIGNALING (2/3)  Advantages  Enables asynchronous events between HSA agents, without involving the kernel  Common idiom for work offload  Low power waiting  Implications  Runtime support required  Commonly implemented on top of cache coherency flows © Copyright 2014 HSA Foundation. All Rights Reserved
  • 202. SIGNALING (3/3)  Specifics  Only supported within a PASID  Supported wait conditions are =, !=, < and >=  Wait operations may return sporadically (no guarantee against false positives)  Programmer must test.  Wait operations have a maximum duration before returning.  The HSAIL atomic operations are supported on signal objects.  Signal objects are opaque  Must use dedicated HSAIL/HSA runtime operations © Copyright 2014 HSA Foundation. All Rights Reserved
  • 203. ALMOST THERE… © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 205. ONE BLOCK LEFT © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 206. USER MODE QUEUEING (1/3)  User mode Queueing  Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents.  Queues are created/destroyed via calls to the HSA runtime.  One (or many) agents enqueue packets, a single agent dequeues packets.  Requires coherency and shared virtual memory. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 207. USER MODE QUEUEING (2/3)  Advantages  Avoid involving the kernel/driver when dispatching work for an Agent.  Lower latency job dispatch enables finer granularity of offload  Standard memory protection mechanisms may be used to protect communication with the consuming agent.  Implications  Packet formats/fields are Architected – standard across vendors!  Guaranteed backward compatibility  Packets are enqueued/dequeued via an Architected protocol (all via memory accesses and signaling)  More on this later…… © Copyright 2014 HSA Foundation. All Rights Reserved
  • 208. SUCCESS! © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 209. SUCCESS! © Copyright 2014 HSA Foundation. All Rights Reserved Application OS GPU Queue Job Start Job Finish Job
  • 211. ARCHITECTED QUEUEING LANGUAGE  HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer  Single producer variant defined with some optimizations possible.  Queues consist of storage, read/write indices, ID, etc.  Queues are created/destroyed via calls to the HSA runtime  “Packets” are placed in queues directly from user mode, via an architected protocol  Packet format is architected © Copyright 2014 HSA Foundation. All Rights Reserved Producer Producer Consumer Read Index Write Index Storage in coherent, shared memory Packets
  • 212. ARCHITECTED QUEUING LANGUAGE  Packets are read and dispatched for execution from the queue in order, but may complete in any order.  There is no guarantee that more than one packet will be processed in parallel at a time  There may be many queues. A single agent may also consume from several queues.  Any HSA agent may enqueue packets  CPUs  GPUs  Other accelerators © Copyright 2014 HSA Foundation. All Rights Reserved
  • 213. QUEUE STRUCTURE © Copyright 2014 HSA Foundation. All Rights Reserved Offset (bytes) Size (bytes) Field Notes 0 4 queueType Differentiate different queues 4 4 queueFeatures Indicate supported features 8 8 baseAddress Pointer to packet array 16 16 doorbellSignal HSA signaling object handle 24 4 size Packet array cardinality 28 4 queueId Unique per process 32 8 serviceQueue Queue for callback services intrinsic 8 writeIndex Packet array write index intrinsic 8 readIndex Packet array read index
  • 214. QUEUE VARIANTS  queueType and queueFeatures together define queue semantics and capabilities  Two queueType values defined, other values reserved:  MULTI – queue supports multiple producers  SINGLE – queue supports single producer  queueFeatures is a bitfield indicating capabilities  DISPATCH (bit 0) if set then queue supports DISPATCH packets  AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets  All other bits are reserved and must be 0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 215. QUEUE STRUCTURE DETAILS  Queue doorbells are HSA signaling objects with restrictions  Created as part of the queue – lifetime tied to queue object  Atomic read-modify-write not allowed  size field value must be aligned to a power of 2  serviceQueue can be used by HSA kernel for callback services  Provided by application when queue is created  Can be mapped to HSA runtime provided serviceQueue, an application serviced queue, or NULL if no serviceQueue required © Copyright 2014 HSA Foundation. All Rights Reserved
  • 216. READ/WRITE INDICES  readIndex and writeIndex properties are part of the queue, but not visible in the queue structure  Accessed through HSA runtime API and HSAIL operations  HSA runtime/HSAIL operations defined to  Read readIndex or writeIndex property  Write readIndex or writeIndex property  Add constant to writeIndex property (returns previous writeIndex value)  CAS on writeIndex property  readIndex & writeIndex operations treated as atomic in memory model  relaxed, acquire, release and acquire-release variants defined as applicable  readIndex and writeIndex never wrap  PacketID – the index of a particular packet  Uniquely identifies each packet of a queue © Copyright 2014 HSA Foundation. All Rights Reserved
  • 217. PACKET ENQUEUE  Packet enqueue follows a few simple steps:  Reserve space  Multiple packets can be reserved at a time  Write packet to queue  Mark packet as valid  Producer no longer allowed to modify packet  Consumer is allowed to start processing packet  Notify consumer of packet through the queue doorbell  Multiple packets can be notified at a time  Doorbell signal should be signaled with last packetID notified  On small machine model the lower 32 bits of the packetID are used © Copyright 2014 HSA Foundation. All Rights Reserved
  • 218. PACKET RESERVATION  Two flows envisaged  Atomic add writeIndex with number of packets to reserve  Producer must wait until packetID < readIndex + size before writing to packet  Queue can be sized so that wait is unlikely (or impossible)  Suitable when many threads use one queue  Check queue not full first, then use atomic CAS to update writeIndex  Can be inefficient if many threads use the same queue  Allows different failure model if queue is congested © Copyright 2014 HSA Foundation. All Rights Reserved
  • 219. QUEUE OPTIMIZATIONS  Queue behavior is loosely defined to allow optimizations  Some potential producer behavior optimizations:  Keep local copy of readIndex, update when required  For single producer queues:  Keep local copy of writeIndex  Use store operation rather than add/cas atomic to update writeIndex  Some potential consumer behavior optimizations:  Use packet format field to determine whether a packet has been submitted rather than writeIndex property  Speculatively read multiple packets from the queue  Not update readIndex for each packet processed  Rely on value used for doorbellSignal to notify new packets  Especially useful for single producer queues © Copyright 2014 HSA Foundation. All Rights Reserved
  • 220. POTENTIAL MULTI-PRODUCER ALGORITHM // Allocate packet uint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1); // Wait until the queue is no longer full. uint64_t rdIdx; do { rdIdx = hsa_queue_load_read_index_relaxed(q); } while (packetID >= (rdIdx + q->size)); // calculate index uint32_t arrayIdx = packetID & (q->size-1); // copy over the packet, the format field is INVALID q->baseAddress[arrayIdx] = pkt; // Update format field with release semantics q->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release); // ring doorbell, with release semantics (could also amortize over multiple packets) hsa_signal_send_relaxed(q->doorbellSignal, packetID); © Copyright 2014 HSA Foundation. All Rights Reserved
  • 221. POTENTIAL CONSUMER ALGORITHM // Get location of next packet uint64_t readIndex = hsa_queue_load_read_index_relaxed(q); // calculate the index uint32_t arrayIdx = readIndex & (q->size-1); // spin while empty (could also perform low-power wait on doorbell) while (INVALID == q->baseAddress[arrayIdx].hdr.format) { } // copy over the packet pkt = q->baseAddress[arrayIdx]; // set the format field to invalid q->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed); // Update the readIndex using HSA intrinsic hsa_queue_store_read_index_relaxed(q, readIndex+1); // Now process <pkt>! © Copyright 2014 HSA Foundation. All Rights Reserved
  • 223. PACKETS © Copyright 2014 HSA Foundation. All Rights Reserved  Packets come in three main types with architected layouts  Always reserved & Invalid  Do not contain any valid tasks and are not processed (queue will not progress)  Dispatch  Specifies kernel execution over a grid  Agent Dispatch  Specifies a single function to perform with a set of parameters  Barrier  Used for task dependencies
  • 224. COMMON PACKET HEADER Start Offset (Bytes) Format Field Name Description 0 uint16_t format:8 Contains the packet type (Always reserved, Invalid, Dispatch, Agent Dispatch, and Barrier). Other values are reserved and should not be used. barrier:1 If set then processing of packet will only begin when all preceding packets are complete. acquireFenceScope:2 Determines the scope and type of the memory fence operation applied before the packet enters the active phase. Must be 0 for Barrier Packets. releaseFenceScope:2 Determines the scope and type of the memory fence operation applied after kernel completion but before the packet is completed. reserved:3 Must be 0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 225. DISPATCH PACKET © Copyright 2014 HSA Foundation. All Rights Reserved Start Offset (Bytes) Format Field Name Description 0 uint16_t header Packet header 2 uint16_t dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3. reserved:14 Must be 0. 4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items). 6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items). 8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items). 10 uint16_t reserved2 Must be 0. 12 uint32_t gridSize.x x dimension of grid (measured in work-items). 16 uint32_t gridSize.y y dimension of grid (measured in work-items). 20 uint32_t gridSize.z z dimension of grid (measured in work-items). 24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item). 28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group). 32 uint64_t kernelObjectAddress Address of an object in memory that includes an implementation-defined executable ISA image for the kernel. 40 uint64_t kernargAddress Address of memory containing kernel arguments. 48 uint64_t reserved3 Must be 0. 56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
  • 226. AGENT DISPATCH PACKET © Copyright 2014 HSA Foundation. All Rights Reserved Start Offset (Bytes) Format Field Name Description 0 uint16_t header Packet header 2 uint16_t type The function to be performed by the destination Agent. The type value is split into the following ranges:  0x0000:0x3FFF – Vendor specific  0x4000:0x7FFF – HSA runtime  0x8000:0xFFFF – User registered function 4 uint32_t reserved2 Must be 0. 8 uint64_t returnLocation Pointer to location to store the function return value in. 16 uint64_t arg[0] 64-bit direct or indirect arguments. 24 uint64_t arg[1] 32 uint64_t arg[2] 40 uint64_t arg[3] 48 uint64_t reserved3 Must be 0. 56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
  • 227. BARRIER PACKET  Used for specifying dependences between packets  HSA agent will not launch any further packets from this queue until the barrier packet signal conditions are met  Used for specifying dependences on packets dispatched from any queue.  Execution phase completes only when all of the dependent signals (up to five) have been signaled (with the value of 0).  Or if an error has occurred in one of the packets upon which we have a dependence. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 228. BARRIER PACKET © Copyright 2014 HSA Foundation. All Rights Reserved Start Offset (Bytes) Format Field Name Description 0 uint16_t header Packet header, see 2.8.1 Packet header (p. 16). 2 uint16_t reserved2 Must be 0. 4 uint32_t reserved3 Must be 0. 8 uint64_t depSignal0 Address of dependent signaling objects to be evaluated by the packet processor. 16 uint64_t depSignal1 24 uint64_t depSignal2 32 uint64_t depSignal3 40 uint64_t depSignal4 48 uint64_t reserved4 Must be 0. 56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
  • 229. DEPENDENCES  A user may never assume more than one packet is being executed by an HSA agent at a time.  Implications:  Packets can’t poll on shared memory values which will be set by packets issued from other queues, unless the user has ensured the proper ordering.  To ensure all previous packets from a queue have been completed, use the Barrier bit.  To ensure specific packets from any queue have completed, use the Barrier packet. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 230. HSA QUEUEING, PACKET EXECUTION
  • 231. PACKET EXECUTION  Launch phase  Initiated when launch conditions are met  All preceding packets in the queue must have exited launch phase  If the barrier bit in the packet header is set, then all preceding packets in the queue must have exited completion phase  Includes memory acquire fence  Active phase  Execute the packet  Barrier packets remain in Active phase until conditions are met.  Completion phase  First step is memory release fence – make results visible.  completionSignal field is then signaled with a decrementing atomic. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 232. PACKET EXECUTION – BARRIER BIT © Copyright 2014 HSA Foundation. All Rights Reserved Pkt1 Launch Pkt2 Launch Pkt1 Execute Pkt2 Execute Pkt1 Complete Pkt3 Launch (barrier=1) Pkt2 Complete Pkt3 Execute Time Pkt3 launches whenall packets in the queue have completed.
  • 233. PUTTING IT ALL TOGETHER (FFT) © Copyright 2014 HSA Foundation. All Rights Reserved Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Packet 6 Barrier Barrier X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7] Time
  • 234. PUTTING IT ALL TOGETHER © Copyright 2014 HSA Foundation. All Rights Reserved AQL Pseudo Code // Send the packets to do the first stage. aql_dispatch(pkt1); aql_dispatch(pkt2); // Send the next two packets, setting the barrier bit so we // know packets 1 & 2 will be complete before 3 and 4 are // launched. aql_dispatch_with _barrier_bit(pkt3); aql_dispatch(pkt4); // Same as above (make sure 3 & 4 are done before issuing 5 // & 6) aql_dispatch_with_barrier_bit(pkt5); aql_dispatch(pkt6); // This packet will notify us when 5 & 6 are complete) aql_dispatch_with_barrier_bit(finish_pkt);
  • 235. PACKET EXECUTION – BARRIER PACKET © Copyright 2014 HSA Foundation. All Rights Reserved Barrier T2Q2 T1Q1 Signal X init to 1 depSignal0 completionSignal Time Decrements signal X Barrier Launch T1 Launch Barrier Execute T1 Execute Barrier Complete T1 Complete T2 Launch T2 Execute T2 Complete Barrier completes when signal X signalled with 0 T2 launches once barrier complete
  • 236. DEPTH FIRST CHILD TASK EXECUTION  Consider two generations of child tasks  Task T submits tasks T.1 & T.2  Task T.1 submits tasks T.1.1 & T.1.2  Task T.2 submits tasks T.2.1 & T.2.2  Desired outcome  Depth first child task execution  I.e. T  T1  T.1.1  T.1.2  T.2  T.2.1  T.2.2  T passed signal (allComplete) to decrement when all tasks are complete (T and its children etc) © Copyright 2014 HSA Foundation. All Rights Reserved T T.2.2T.1.2T.1.2T.1.1 T.1 T.2
  • 237. HOW TO DO THIS WITH HSA QUEUES?  Use a separate user mode queue for each recursion level  Task T submits to queue Q1  Tasks T.1 & T.2 submits tasks to queue Q2  Queues could be passed in as parameters to task T  Depth first requires ordering of T.1, T.2 and their children  Use additional signal object (childrenComplete) to track completion of the children of T.1 & T.2  childrenComplete set to number of children (i.e. 2) by each of T.1 & T.2 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 238. A PICTURE SAYS MORE THAN 1000 WORDS © Copyright 2014 HSA Foundation. All Rights Reserved T T.2.2T.1.2T.1.2T.1.1 T.1 T.2 T.1 Barrier T.2 BarrierQ1 Wait on childrenComplete Signal allComplete T.1.1 T.1.2 T.2.1 T.2.2Q2
  • 239. SUMMARY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 240. KEY HSA TECHNOLOGIES  HSA combines several mechanisms to enable low overhead task dispatch  Shared Virtual Memory  System Coherency  Signaling  AQL  User mode queues – from any compatible agent  Architected packet format  Rich dependency mechanism  Flexible and efficient signaling of completion © Copyright 2014 HSA Foundation. All Rights Reserved
  • 241. QUESTIONS? © Copyright 2014 HSA Foundation. All Rights Reserved
  • 242. HSA APPLICATIONS WEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS WITH J.P. BORDES AND JUAN GOMEZ
  • 243. USE CASES SHOWING HSA ADVANTAGE Programming Technique Use Case Description HSA Advantage Pointer-based Data Structures Binary tree searches GPU performs parallel searches in a CPU created binary tree. CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers. Platform Atomics Work-Group Dynamic Task Management GPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads Binary tree updates CPU and GPU operating simultaneously on the tree, both doing modifications CPU and GPU can synchronize using Platform Atomics Higher performance through parallel operations reducing the need for data copying and reconciling. Large Data Sets Hierarchical data searches Applications include object recognition, collision detection, global illumination, BVH CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead. CPU Callbacks Middleware user-callbacks GPU processes work items, some of which require a call to a CPU function to fetch new data GPU can invoke CPU functions from within a GPU kernel Simpler programming does not require “split kernels” Higher performance through parallel operations © Copyright 2014 HSA Foundation. All Rights Reserved
  • 244. UNIFIED COHERENT MEMORY FOR POINTER-BASED DATA STRUCTURES
  • 245. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 246. L R Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES © Copyright 2014 HSA Foundation. All Rights Reserved
  • 247. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 248. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 249. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 250. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 251. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 252. SYSTEM MEMORY KERNEL GPU UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA and full OpenCL 2.0 TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 253. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 254. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 255. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 256. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 257. POINTER DATA STRUCTURES - CODE COMPLEXITY HSA Legacy © Copyright 2014 HSA Foundation. All Rights Reserved
  • 258. POINTER DATA STRUCTURES - PERFORMANCE 0 10,000 20,000 30,000 40,000 50,000 60,000 1M 5M 10M 25M Searchrate(nodes/ms) Tree size ( # nodes ) Binary Tree Search CPU (1 core) CPU (4 core) Legacy APU HSA APU Measured in AMD labs Jan 1-3 on system shown in back up slide © Copyright 2014 HSA Foundation. All Rights Reserved
  • 259. PLATFORM ATOMICS FOR DYNAMIC TASK MANAGEMENT
  • 260. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* 0 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 0 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 261. 0 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 0 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Asynchronous transfer © Copyright 2014 HSA Foundation. All Rights Reserved
  • 262. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 0 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 263. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Asynchronous transfer © Copyright 2014 HSA Foundation. All Rights Reserved
  • 264. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 265. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 266. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 267. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 268. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 269. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 270. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 271. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 272. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Zero-copy © Copyright 2014 HSA Foundation. All Rights Reserved
  • 273. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 0 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 274. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 0 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY memcpy © Copyright 2014 HSA Foundation. All Rights Reserved
  • 275. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 276. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 277. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 278. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 279. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 280. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 281. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 282. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 283. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 284. PLATFORM ATOMICS – CODE COMPLEXITY HSA Legacy Host enqueue function: 20 lines of code Host enqueue function: 102 lines of code © Copyright 2014 HSA Foundation. All Rights Reserved
  • 285. PLATFORM ATOMICS - PERFORMANCE 0 100 200 300 400 500 600 700 64 128 256 512 64 128 256 512 4096 16384 Executiontime(ms) Tasks per insertion Tasks pool size Legacy implementation (ms) HSA implementation (ms) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 287. PLATFORM ATOMICS ENABLING EFFICIENT GPU/CPU COLLABORATION Legacy Only GPU can work on input array Concurre nt processin g not possible TREEINPUT BUFFER GPU KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 288. PLATFORM ATOMICS Legacy Only GPU can work on input array Concurre nt processin g not possible TREEINPUT BUFFER GPU KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 289. PLATFORM ATOMICS Legacy Only GPU can work on input array Concurre nt processin g not possible TREEINPUT BUFFER GPU KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 292. UNIFIED COHERENT MEMORY FOR LARGE DATA SETS
  • 293. PROCESSING LARGE DATA SETS The CPU creates a large data structure in System Memory. Computations using the data are offloaded to the GPU. SYSTEM MEMORY GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 294. SYSTEM MEMORY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 PROCESSING LARGE DATA SETS Large3Dspatialdata structure GPU The CPU creates a large data structure in System Memory. Computations using the data are offloaded to the GPU. Compare HSA and Legacy methods © Copyright 2014 HSA Foundation. All Rights Reserved
  • 295. SYSTEM MEMORY LEGACY ACCESS USING GPU MEMORY Legacy GPU Memory is smaller Have to copy and process in chunks GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 296. SYSTEM MEMORY Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 LEGACY ACCESS TO LARGE STRUCTURES Large3Dspatialdata structure GPU GPU MEMOR Y © Copyright 2014 HSA Foundation. All Rights Reserved
  • 297. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL Copy of top 2 levels of hierarchy Large3Dspatialdata structure GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 298. GPU GPU MEMORY SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 FIRST KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 299. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY FIRST KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 300. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY FIRST KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 301. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 302. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL Copy of bottom 3 levels of one branch of the hierarchy GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 303. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY SECOND KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 304. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY SECOND KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 305. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY SECOND KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 306. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 307. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 308. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU Copy of bottom 3 levels of a different branch of the hierarchy GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 309. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY Nth KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 310. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY Nth KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 311. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY Nth KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 312. LARGE SPATIAL DATA STRUCTURE Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 Large3Dspatialdata structure SYSTEM MEMORY KERNEL GPU HSA and full OpenCL 2.0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 313. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 314. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 315. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 316. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 317. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 KERNEL HSA GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 319. CALLBACKS  Parallel processing algorithm with branches  A seldom taken branch requires new data from the CPU  On legacy systems, the algorithm must be split:  Process Kernel 1 on GPU  Check for CPU callbacks and if any, process on CPU  Process Kernel 2 on GPU  Example algorithm from Image Processing  Perform a filter  Calculate average LUMA in each tile  Compare LUMA against threshold and call CPU callback if exceeded (rare)  Perform special processing on tiles with callbackxs COMMON SITUATION IN HC Input Image Output Image © Copyright 2014 HSA Foundation. All Rights Reserved
  • 320. CALLBACKS Legacy GPUTHREADS 0 1 2 N . . . . . . . . . Continuation kernel finishes up kernel works results in poor GPU utilization © Copyright 2014 HSA Foundation. All Rights Reserved
  • 321. CALLBACKS Input Image 1 Tile = 1 OpenCL Work Item Output Image GPU • Work items compute average RGB value of all the pixels in a tile • Work items also compute average Luma from the average RGB • If average Luma > threshold, workgroup invokes CPU CALLBACK • In parallel with callback, continue compute CPU • For selected tiles, update average Luma value (set to RED) GPU • Work items apply the Luma value to all pixels in the tile GPU to CPU callbacks use Shared Virtual Memory (SVM) Semaphores, implemented using Platform Atomic Compare-and-Swap. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 322. CALLBACKS A few kernel threads need CPU callback services but serviced immediately GPUTHREADS 0 1 2 N . . . . . . . . . CPU callbacks HSA and full OpenCL 2.0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 323. SUMMARY - HSA ADVANTAGE Programming Technique Use Case Description HSA Advantage Pointer-based Data Structures Binary tree searches GPU performs parallel searches in a CPU created binary tree. CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers. Platform Atomics Work-Group Dynamic Task Management GPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads Binary tree updates CPU and GPU operating simultaneously on the tree, both doing modifications CPU and GPU can synchronize using Platform Atomics Higher performance through parallel operations reducing the need for data copying and reconciling. Large Data Sets Hierarchical data searches Applications include object recognition, collision detection, global illumination, BVH CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead. CPU Callbacks Middleware user-callbacks GPU processes work items, some of which require a call to a CPU function to fetch new data GPU can invoke CPU functions from within a GPU kernel Simpler programming does not require “split kernels” Higher performance through parallel operations © Copyright 2014 HSA Foundation. All Rights Reserved
  • 325. HSA COMPILATION WEN-MEI HWU, CTO, MULTICOREWARE INC WITH RAY I-JUI SUNG
  • 326. KEY HSA FEATURES FOR COMPILATION ALL-PROCESSORS-EQUAL  GPU and CPU have equal flexibility to create and dispatch work items EQUAL ACCESS TO ENTIRE SYSTEM MEMORY  GPU and CPU have uniform visibility into entire memory space Unified Coherent Memory GPUCPU Single Dispatch Path GPUCPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 327. A QUICK REVIEW OF OPENCL CURRENT STATE OF PORTABLE HETEROGENEOUS PARALLEL PROGRAMMING
  • 328. DEVICE CODE IN OPENCL SIMPLE MATRIX MULTIPLICATION __kernel void matrixMul(__global float* C, __global float* A, __global float* B, int wA, int wB) { int tx = get_global_id(0); int ty = get_global_id(1); float value = 0; for (int k = 0; k < wA; ++k) { float elementA = A[ty * wA + k]; float elementB = B[k * wB + tx]; value += elementA * elementB; } C[ty * wA + tx] = value; } Explicit thread index usage. Reasonably readable. Portable across CPUs, GPUs, and FPGAs © Copyright 2014 HSA Foundation. All Rights Reserved
  • 329. HOST CODE IN OPENCL - CONCEPTUAL 1. allocate and initialize memory on host side 2. Initialize OpenCL 3. allocate device memory and move the data 4. Load and build device code 5. Launch kernel a. append arguments 6. move the data back from device © Copyright 2014 HSA Foundation. All Rights Reserved
  • 330. int main(int argc, char** argv){ // set seed for rand() srand(2006); /****************************************************/ /* Allocate and initialize memory on Host Side */ /****************************************************/ // allocate and initialize host memory for matrices A and B unsigned int size_A = WA * HA; unsigned int mem_size_A = sizeof(float) * size_A; float* h_A = (float*) malloc(mem_size_A); unsigned int size_B = WB * HB; unsigned int mem_size_B = sizeof(float) * size_B; float* h_B = (float*) malloc(mem_size_B); randomInit(h_A, size_A); randomInit(h_B, size_B); // allocate host memory for the result C unsigned int size_C = WC * HC; unsigned int mem_size_C = sizeof(float) * size_C; float* h_C = (float*) malloc(mem_size_C); /*****************************************/ /* Initialize OpenCL */ /*****************************************/ // OpenCL specific variables cl_context clGPUContext; cl_command_queue clCommandQue; cl_program clProgram; size_t dataBytes; size_t kernelLength; cl_int errcode; // OpenCL device memory pointers for matrices cl_mem d_A; cl_mem d_B; cl_mem d_C; clGPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, &errcode); shrCheckError(errcode, CL_SUCCESS); // get the list of GPU devices associated with context errcode = clGetContextInfo(clGPUContext, CL_CONTEXT_DEVICES, 0, NULL, &dataBytes); cl_device_id *clDevices = (cl_device_id *) malloc(dataBytes); errcode |= clGetContextInfo(clGPUContext, CL_CONTEXT_DEVICES, dataBytes, clDevices, NULL); shrCheckError(errcode, CL_SUCCESS); //Create a command-queue clCommandQue = clCreateCommandQueue(clGPUContext, clDevices[0], 0, &errcode); shrCheckError(errcode, CL_SUCCESS); // 3. Allocate device memory and move data d_C = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE, mem_size_A, NULL, &errcode); d_A = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, mem_size_A, h_A, &errcode); d_B = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, mem_size_B, h_B, &errcode); // 4. Load and build OpenCL kernel char *clMatrixMul = oclLoadProgSource("kernel.cl", "// My commentn", &kernelLength); shrCheckError(clMatrixMul != NULL, shrTRUE); clProgram = clCreateProgramWithSource(clGPUContext, 1, (const char **)&clMatrixMul, &kernelLength, &errcode); shrCheckError(errcode, CL_SUCCESS); errcode = clBuildProgram(clProgram, 0, NULL, NULL, NULL, NULL); shrCheckError(errcode, CL_SUCCESS); clKernel = clCreateKernel(clProgram, "matrixMul", &errcode); shrCheckError(errcode, CL_SUCCESS); // 5. Launch OpenCL kernel size_t localWorkSize[2], globalWorkSize[2]; int wA = WA; int wC = WC; errcode = clSetKernelArg(clKernel, 0, sizeof(cl_mem), (void *)&d_C); errcode |= clSetKernelArg(clKernel, 1, sizeof(cl_mem), (void *)&d_A); errcode |= clSetKernelArg(clKernel, 2, sizeof(cl_mem), (void *)&d_B); errcode |= clSetKernelArg(clKernel, 3, sizeof(int), (void *)&wA); errcode |= clSetKernelArg(clKernel, 4, sizeof(int), (void *)&wC); shrCheckError(errcode, CL_SUCCESS); localWorkSize[0] = 16; localWorkSize[1] = 16; globalWorkSize[0] = 1024; globalWorkSize[1] = 1024; errcode = clEnqueueNDRangeKernel(clCommandQue, clKernel, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL); shrCheckError(errcode, CL_SUCCESS); // 6. Retrieve result from device errcode = clEnqueueReadBuffer(clCommandQue, d_C, CL_TRUE, 0, mem_size_C, h_C, 0, NULL, NULL); shrCheckError(errcode, CL_SUCCESS); // 7. clean up memory free(h_A); free(h_B); free(h_C); clReleaseMemObject(d_A); clReleaseMemObject(d_C); clReleaseMemObject(d_B); free(clDevices); free(clMatrixMul); clReleaseContext(clGPUContext); clReleaseKernel(clKernel); clReleaseProgram(clProgram); clReleaseCommandQueue(clCommandQue);} almost 100 lines of code – tedious and hard to maintain It does not take advantage of HAS features. It will likely need to be changed for OpenCL 2.0.
  • 331. COMPARING SEVERAL HIGH-LEVEL PROGRAMMING INTERFACES C++AMP Thrust Bolt OpenACC SYCL C++ Language extension proposed by Microsoft library proposed by CUDA library proposed by AMD Annotation and Pragmas proposed by PGI C++ wrapper for OpenCL All these proposals aim to reduce tedious boiler plate code and provide transparent porting to future systems (future proofing). © Copyright 2014 HSA Foundation. All Rights Reserved
  • 332. OPENACC HSA ENABLES SIMPLER IMPLEMENTATION OR BETTER OPTIMIZATION © Copyright 2014 HSA Foundation. All Rights Reserved
  • 333. OPENACC - SIMPLE MATRIX MULTIPLICATION EXAMPLE 1. void MatrixMulti(float *C, const float *A, const float *B, int hA, int wA, int wB) 2 { 3 #pragma acc parallel loop copyin(A[0:hA*wA]) copyin(B[0:wA*wB]) copyout(C[0:hA*wB]) 4 for (int i=0; i<hA; i++) { 5 #pragma acc loop 6 for (int j=0; j<wB; j++) { 7 float sum = 0; 8 for (int k=0; k<wA; k++) { 9 float a = A[i*wA+k]; 10 float b = B[k*wB+j]; 11 sum += a*b; 12 } 13 C[i*Nw+j] = sum; 14 } 15 } 16 } Little Host Code Overhead Programmer annotation of kernel computation Programmer annotation of data movement © Copyright 2014 HSA Foundation. All Rights Reserved
  • 334. ADVANTAGE OF HSA FOR OPENACC  Flexibility in copyin and copyout implementation  Flexible code generation for nested acc parallel loops  E.g., inner loop bounds that depend on outer loop iterations  Compiler data affinity optimization (especially OpenACC kernel regions)  The compiler does not have to undo programmer managed data transfers © Copyright 2014 HSA Foundation. All Rights Reserved
  • 335. C++AMP HSA ENABLES EFFICIENT COMPILATION OF AN EVEN HIGHER LEVEL OF PROGRAMMING INTERFACE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 336. C++ AMP ● C++ Accelerated Massive Parallelism ● Designed for data level parallelism ● Extension of C++11 proposed by Microsoft ● An open specification with multiple implementations aiming at standardization ● MS Visual Studio 2013 ● MulticoreWare CLAMP ● GPU data modeled as C++14-like containers for multidimensional arrays ● GPU kernels modeled as C++11 lambda ● Minimal extension to C++ for simplicity and future proofing © Copyright 2014 HSA Foundation. All Rights Reserved
  • 337. MATRIX MULTIPLICATION IN C++AMP void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix, int ha, int hb, int hc) { array_view<int, 2> a(ha, hb, aMatrix); array_view<int, 2> b(hb, hc, bMatrix); array_view<int, 2> product(ha, hc, productMatrix); parallel_for_each( product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < 2; inner++) { product[idx] += a(row, inner) * b(inner, col); } } ); product.synchronize();} clGPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, &errcode); shrCheckError(errcode, CL_SUCCESS); // get the list of GPU devices associated // with context errcode = clGetContextInfo(clGPUContext, CL_CONTEXT_DEVICES, 0, NULL, &dataBytes); cl_device_id *clDevices = (cl_device_id *) malloc(dataBytes); errcode |= clGetContextInfo(clGPUContext, CL_CONTEXT_DEVICES, dataBytes, clDevices, NULL); shrCheckError(errcode, CL_SUCCESS); //Create a command-queue clCommandQue = clCreateCommandQueue(clGPUContext, clDevices[0], 0, &errcode); shrCheckError(errcode, CL_SUCCESS); __kernel void matrixMul(__global float* C, __global float* A, __global float* B, int wA, int wB) { int tx = get_global_id(0); int ty = get_global_id(1); float value = 0; for (int k = 0; k < wA; ++k) { float elementA = A[ty * wA + k]; float elementB = B[k * wB + tx]; value += elementA * elementB; } C[ty * wA + tx] = value;} © Copyright 2014 HSA Foundation. All Rights Reserved
  • 338. C++AMP PROGRAMMING MODEL void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) { array_view<int, 2> a(3, 2, aMatrix); array_view<int, 2> b(2, 3, bMatrix); array_view<int, 2> product(3, 3, productMatrix); parallel_for_each( product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < 2; inner++) { product[idx] += a(row, inner) * b(inner, col); } } ); product.synchronize();} GPU data modeled as data container © Copyright 2014 HSA Foundation. All Rights Reserved
  • 339. C++AMP PROGRAMMING MODEL void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) { array_view<int, 2> a(3, 2, aMatrix); array_view<int, 2> b(2, 3, bMatrix); array_view<int, 2> product(3, 3, productMatrix); parallel_for_each( product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < 2; inner++) { product[idx] += a(row, inner) * b(inner, col); } } ); product.synchronize();} Kernels modeled as lambdas; arguments are implicitly modeled as captured variables, programmer do not need to specify copyin and copyout © Copyright 2014 HSA Foundation. All Rights Reserved
  • 340. C++AMP PROGRAMMING MODEL void MultiplyWithAMP(int* aMatrix, int* bMatrix, int *productMatrix) { array_view<int, 2> a(3, 2, aMatrix); array_view<int, 2> b(2, 3, bMatrix); array_view<int, 2> product(3, 3, productMatrix); parallel_for_each( product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < 2; inner++) { product[idx] += a(row, inner) * b(inner, col); } } ); product.synchronize(); } Execution interface; marking an implicitly parallel region for GPU execution © Copyright 2014 HSA Foundation. All Rights Reserved
  • 341. MCW C++AMP (CLAMP) ● Runs on Linux and Mac OS X ● Output code compatible with all major OpenCL stacks: AMD, Apple/Intel (OS X), NVIDIA and even POCL ● Clang/LLVM-based, open source o Translate C++AMP code to OpenCL C or OpenCL 1.2 SPIR o With template helper library ● Runtime: OpenCL 1.1/HSA Runtime and GMAC for non-HSA systems ● One of the two C++ AMP implementations recognized by HSA foundation © Copyright 2014 HSA Foundation. All Rights Reserved
  • 342. MCW C++ AMP COMPILER ● Device Path o generate OpenCL C code and SPIR o emit kernel function ● Host Path o preparation to launch the code C++ AMP source code Clang/LLVM 3.3 Device Code Host Code © Copyright 2014 HSA Foundation. All Rights Reserved
  • 343. TRANSLATION parallel_for_each(product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < 2; inner++) { product[idx] += a(row, inner) * b(inner, col); } }); __kernel void matrixMul(__global float* C, __global float* A, __global float* B, int wA, int wB){ int tx = get_global_id(0); int ty = get_global_id(1); float value = 0; for (int k = 0; k < wA; ++k) { float elementA = A[ty * wA + k]; float elementB = B[k * wB + tx]; value += elementA * elementB; } C[ty * wA + tx] = value;} ● Append the arguments ● Set the index ● emit kernel function ● implicit memory management © Copyright 2014 HSA Foundation. All Rights Reserved
  • 344. EXECUTION ON NON-HSA OPENCL PLATFORMS C++ AMP source code Clang/LLVM 3.3 Device Code C++ AMP source code Clang/LLVM 3.3 Host Code gmac OpenCL Our work Runtime © Copyright 2014 HSA Foundation. All Rights Reserved
  • 345. GMAC ● unified virtual address space in software ● Can have high overhead sometimes ● In HSA (e.g., AMD Kaveri), GMAC is not longer needed Gelado, et al, ASPLOS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 346. CASE STUDY: BINOMIAL OPTION PRICING  Line of Codes 0 50 100 150 200 250 300 350 C++AMP OpenCL Lines of Code by Cloc Host Kernel © Copyright 2014 HSA Foundation. All Rights Reserved
  • 347. PERFORMANCE ON NON-HSA SYSTEMS BINOMIAL OPTION PRICING 0 0.02 0.04 0.06 0.08 0.1 0.12 Total GPU Time Kernel-only TimeinSeconds Performance on an NV Tesla C2050 OpenCL C++AMP © Copyright 2014 HSA Foundation. All Rights Reserved
  • 348. EXECUTION ON HSA C++ AMP source code Clang/LLVM 3.3 Device SPIR C++ AMP source code Clang/LLVM 3.3 Host SPIR HSA Runtime Compile Time Runtime © Copyright 2014 HSA Foundation. All Rights Reserved
  • 349. WHAT WE NEED TO DO? ● Kernel function o emit the kernel function with required arguments ● On Host side o a function that recursively traverses the object and append the arguments to OpenCL stack. ● On Device side o reconstruct it on the device code for future use. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 350. WHY COMPILING C++AMP TO OPENCL IS NOT TRIVIAL ● C++AMP → LLVM IR → OpenCL C or SPIR ● arguments passing (lambda capture vs function calls) ● explicit V.S. implicit memory transfer ● Heavy lifting is done by compiler and runtime © Copyright 2014 HSA Foundation. All Rights Reserved
  • 351. EXAMPLE struct A { int a; };struct B : A { int b; };struct C { B b; int c; }; struct C c; c.c = 100; auto fn = [=] () { int qq = c.c; }; © Copyright 2014 HSA Foundation. All Rights Reserved
  • 352. TRANSLATION parallel_for_each(product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < 2; inner++) { product[idx] += a(row, inner) * b(inner, col); } }); __kernel void matrixMul(__global float* C, __global float* A, __global float* B, int wA, int wB){ int tx = get_global_id(0); int ty = get_global_id(1); float value = 0; for (int k = 0; k < wA; ++k) { float elementA = A[ty * wA + k]; float elementB = B[k * wB + tx]; value += elementA * elementB; } C[ty * wA + tx] = value;} ● Compiler ● Turn captured variables into OpenCL arguments ● Populate the index<N> in OCL kernel ● Runtime ● Implicit memory management © Copyright 2014 HSA Foundation. All Rights Reserved
  • 353. QUESTIONS? © Copyright 2014 HSA Foundation. All Rights Reserved