HSA From A Software Perspective

HSA (HETEROGENEOUS SYSTEM ARCHITECTURE) FROM A SOFTWARE PERSPECTIVE
OCT 2013
GARY FROSTAMD SOFTWARE FELLOW

HSA FOUNDATION
Founded in June 2012
Developing a new platform for heterogeneous systems
www.hsafoundation.com
Specifications under development in working groups
Our first specification, HSA Programmers Reference Manual is already published and available on our web site
Additional specifications for System Architecture, Runtime Software and Tools are in process
© Copyright 2013 HSA Foundation. All Rights Reserved. 2

HSA FOUNDATION MEMBERSHIP — AUGUST 2013
Founders
Promoters
Supporters
Contributors
Academic
Associates

SOCS HAVE PROLIFERATED — MAKE THEM BETTER
SOCs have arrived and are a tremendous advance over previous platforms
SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory
How can we make them even better?
Easier to program
Easier to optimize
Higher performance
Lower power
HSA unites accelerators architecturally
Early focus on the GPU compute accelerator, but HSA goes well beyond the GPU

INFLECTIONS IN PROCESSOR DESIGN
?
Single-thread Performance
Time
we are
here
Enabled by:
Moore’s Law
Voltage Scaling
Constrained by:
Power
ComplexitySingle-Core Era
Modern Application
Performance
Time (Data-parallel exploitation)
we are
hereHeterogeneousSystems Era
Enabled by:
Abundant data parallelism
Power efficient GPUs
TemporarilyConstrained by:
Programming models
Comm.overhead
Throughput Performance
Time (# of processors)
we are
here
Enabled by:
Moore’s Law
SMP architecture
Constrained by:
Power
Parallel SW
ScalabilityMulti-Core Era
Assembly C/C++ Java …
pthreadsOpenMP/ TBB …
Shader CUDAOpenCL
C++ and Java

HIGH LEVEL FEATURES OF HSA
Features currently being defined in the HSA Working Groups**
Unified addressing across all processors
Operation into pageable system memory
Full memory coherency
User mode dispatch
Architected queuing language
High level language support for GPU compute processors
Preemption and context switching
** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups

HSA —AN OPEN PLATFORM
Open Architecture, membership open to all
HSA Programmers Reference Manual
HSA System Architecture
HSA Runtime
Delivered via royalty free standards
Royalty Free IP, Specifications and APIs
ISA agnostic for both CPU and GPU
Membership from all areas of computing
Hardware companies
Operating Systems
Tools and Middleware

HSA MEMORY MODEL
Defines visibility ordering between all threads in the HSA System
Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models
Relaxed consistency memory model for parallel compute performance
Visibility controlled by:
Load.Acquire
Store.Release
Barriers

HSA QUEUING MODEL
User mode queuing for low latency dispatch
Application dispatches directly
No OS or driver in the dispatch path
Architected Queuing Layer
Single compute dispatch path for all hardware
No driver translation, direct to hardware
Allows for dispatch to queue from any agent
CPU or GPU
GPU self enqueue enables lots of solutions
Recursion
Tree traversal
Wavefront reforming

HSA INTERMEDIATE LAYER —HSAIL
HSAIL is a virtual ISA for parallel programs
Finalized to ISA by a JIT compiler or “Finalizer”
ISA independent by design for CPU & GPU
Explicitly parallel
Designed for data parallel programming
Support for exceptions, virtual functions, and other high level language features
Lower level than OpenCL SPIR
Fits naturally in the OpenCL compilation stack
Suitable to support additional high level languages and programming models:
Java, C++, OpenMP, etc

HSAIL INSTRUCTION SET -OVERVIEW
Similar to assembly language for a RISC CPU
Load-store architecture
ld_global_u64 $d0, [$d6 + 120]; $d0= load($d6+120)
add_u64 $d1, $d2, 24 ; $d1= $d2+24
136 opcodes(Java™ bytecodehas 200)
Floating point (single, double, half (f16))
Integer (32-bit, 64-bit)
Some packed operations
Branches
Function calls
PlatformAtomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas
Synchronize host CPU and HSA Component!
Text and Binary formats (“BRIG”)
© Copyright 2013 HSA Foundation. All Rights Reserved.

SEGMENTS AND MEMORY (1/2)
7 segments of memory
global, readonly, group, spill, private, arg, kernarg,
Memory instructions can (optionally) specify a segment
Global Segment
Visible to all HSA agents (including host CPU)
Group Segment
Provides high-performance memory shared in the work-group.
Group memory can be read and written by any work-item in the work-group
HSAIL provides sync operations to control visibility of group memory
Useful for expert programmers
Spill, Private, ArgSegments
Represent different regions of a per-work-item stack
Typically generated by compiler, not specified by programmer
Compiler can use these to convey intent –iespills
ld_global_u64 $d0, [$d6]
ld_group_u64 $d0,[$d6+24]
st_spill_f32 $s1,[$d6+4]

SEGMENTS AND MEMORY (2/2)
KernargSegment
Programmer writes kernargsegment to pass arguments to a kernel
Read-Only Segment
Remains constant during execution of kernel
Flat Addressing
Each segment mapped into virtual address space
Flat addresses can map to segments based on virtual address
Instructions with no explicit segment use flat addressing
Very useful for high-level language support (ieclasses, libraries)
Aligns well with OpenCL2.0 “generic” addressing feature
ld_kernarg_u64 $d6, [%_arg0] ld_u64 $d0,[$d6+24] ; flat

REGISTERS
Four classes of registers
C: 1-bit, Control Registers
S: 32-bit, Single-precision FP or Int
D: 64-bit, Double-precision FP or Long Int
Q: 128-bit, Packed data.
Fixed number of registers:
8 C
S, D, Q share a single pool of resources
S + 2*D + 4*Q <= 128
Up to 128 S or 64 D or 32 Q (or a blend)
Register allocation done in high-level compiler
Finalizer doesn’t have to perform expensive register allocation

SIMT EXECUTION MODEL
HSAIL Presents a “SIMT” execution model to the programmer
“Single Instruction, Multiple Thread”
Programmer writes program for a single thread of execution
Each work-item appears to have its own program counter
Branch instructions look natural
Hardware Implementation
Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency
Actually one program counter for the entire SIMD instruction
Branches implemented with predication
SIMT Advantages
Easier to program (branch code in particular)
Natural path for mainstream programming models
Scales across a wide variety of hardware (programmer doesn’t see vector width)
Cross-lane operations available for those who want peak performance

WAVEFRONTS
Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, or 64 “lanes”
Lanes in wavefrontcan be “active” or “inactive”
Inactive lanes consume hardware resources but don’t do useful work
Tradeoffs
“Wavefront-aware” programming can be useful for peak performance
But results in less portable code (since wavefront width is encoded in algorithm)
if (cond) {
operationA; // cond=True lanes active here
} else {
operationB; // cond=False lanes active here
}

HSA ENABLEMENTOF LANGUAGES, FRAMEWORKS, LIBRARIESANDRUNTIMES

HSA AND OPENCL™
HSA is an optimized platform architecture for OpenCL™
Not an alternative to OpenCL™
OpenCL™ on HSA will benefit from
Avoidance of wasteful copies
Low latency dispatch
Improved memory model
Pointers shared between CPU and GPU
OpenCL™ 2.0 shows considerable alignment with HSA
Many HSA member companies are also active with Khronos in the OpenCL™ working group

HSA AND OPENMP®
OpenMP®
Established
Portable
Scalable (desktop to supercomputer)
Simple
Flexible
HSA enablement brings :-
GPU performance
Energy efficiency
…to established developer community

BOLT : A C++ PARALLEL PRIMITIVES LIBRARY FOR HSA
Allow C++ developers to leverage the power efficiency of GPU computing
Common routines such as scan, sort, reduce, transform
More advanced routines like heterogeneous pipelines
Bolt library works with OpenCL and C++ AMP
Enjoy the unique advantages of the HSA platform
Move the computation not the data
Asingle source code base for the CPU and GPU!
Developers can focus on core algorithmshttps://github.com/HSA-Libraries/Bolt

21
Why Java™?
9 Million Developers
1Billion Java downloads per year
97% Enterprise desktops run Java
100% of blue ray players ship with Java
http://oracle.com.edgesuite.net/timeline/java/
Java™ 8 language/libraries include concurrency features
primitives (threads, locks, monitors, atomic ops)
libraries (fork/join, thread pools, executors, futures)
support for ‘lambda’ based Stream API’s
JIT (Just In Time) architecture ideal for generating and executing HSAIL.
Project ‘Sumatra’ targets GPU JIT generation/execution in the 2015 Java™ 9 timeframe.
HSA ENABLEMENT OF JAVA™

22
Aparapi API for expressing data parallel workloads
Developer uses common Java™ patterns and idioms
Java source compiled to (bytecode) using standard compiler (javac)
Aparapi runtime capable of converting bytecode to OpenCL™
Execution on OpenCL™ 1.1+ capable devices (GPUs/APUs)
OR
Execute via a Java thread pool if OpenCL™ is not available
Open Source project
~20 contributors
>7000 downloads
~150 visits per day
APARAPI: INITIAL JAVA ENABLEMENT (2011)
CPU ISA
GPU ISA
JVM
Java Application
GPU
CPU
OpenCL Source
OpenCLRuntime
APARAPI
API

23
AMD/Oracle sponsored Open Source (OpenJDK) project
Targeted at Java 9 (2015 release)
Allow developers to efficiently represent data parallel algorithms in Java using Stream API + Lambda expressions
Sumatra is notpushing new ‘programming model’
Instead we ‘repurpose’ Java 8’s new Stream API/Lambda to enable both CPU or GPU computing
A Sumatra enabled Java Virtual Machine will dispatch ‘selected’ constructs to HSA enabled devices at runtime.
Developers already refactoring JDK to use stream+lambda
–So anyone using existing JDK should see GPU acceleration without anycode changes.
http://openjdk.java.net/projects/sumatra/
https://wikis.oracle.com/display/HotSpotInternals/Sumatrahttp://mail.openjdk.java.net/pipermail/sumatra-dev/
SUMATRA PROJECT (JAVA 9 2015)
GPU ISA
JVM
Java Application
GPU
CPU
HSAIL
HSA Finalizer& Runtime
Java JDK Stream + Lambda API
Java GRAAL JIT backend
CPU ISA

HSA ENABLEMENT OF JAVA
CPU ISA GPU ISA
JVM
Java Application
CPU GPU
OpenCL Source
OpenCL Runtime
APARAPI
API
Java 7 – OpenCL enabled Aparapi
• AMD initiated Open Source project
• APIs for data parallel algorithms
GPU accelerate Java applications
No need to learn OpenCL
• Active community captured mindshare
~20 contributors
>7000 downloads
~150 visits per day
CPU ISA GPU ISA
JVM
Java Application
CPU GPU
HSAIL
HSA Finalizer &
Runtime
APARAPI +
Lambda API
Java 8 – HSA enabled Aparapi
• Java 8 brings Stream + Lambda API.
More natural way of expressing
data parallel algorithms
Initially targeted at multi-core.
• APARAPI will :-
Support Java 8 Lambdas
Dispatch code to HSA enabled
devices at runtime via HSAIL
Java 9 – HSA enabled Java (Sumatra)
• Adds native GPU compute support to
Java Virtual Machine (JVM)
• Developer uses JDK provided Lambda
+ Stream API
• JVM uses GRAAL compiler to generate
HSAIL
• JVM decides at runtime to execute on
either CPU or GPU depending on
workload characteristics.
GPU ISA
JVM
Java Application
CPU GPU
HSAIL
HSA Finalizer &
Runtime
Java JDK Stream +
Lambda API
Java GRAAL JIT
backend
CPU ISA
We plan to provide
HSA Enabled Aparapi (Java 8)
as a bridge technology between
OpenCL based Aparapi (Java 7)
and
HSA Enabled Sumatra (Java 9)

A JAVA EXAMPLE
Player[] allPlayers= …// Code to initialize array of Players omitted
intteamScores= p.getTeam().getScores();
float pctOfTeamScores= (float)p.getScores()/(float) teamScores;
p.setPctOfTeamScores(pctOfTeamScores);
});
class Team {
private intscores;
public intgetScores() {
return scores;
}
}// Setters omitted for brevity
class Player {
private Team team;
private intscores;
private float pctOfTeamScores;
public Team getTeam() {
return team;
}
public intgetScores() {
return scores;
}
public void setPctOfTeamScores(intpct){
pctOfTeamScores= pct;
}
} // Setters omitted for brevity
Arrays.stream(allPlayers).parallel().forEach(p -> {// HSA enabled Sumatra
Device.hsa().forEach(allPlayers, p -> { // HSA enabled Aparapi

HSAIL CODE EXAMPLE (SUMATRA)
01: version 0:95: $full : $large;
02: // We pass underlying array of Players to the kernel
03: kernel &run (
04:kernarg_u64 %_arg0// Array of players passed as arg
05:){
06:ld_kernarg_u64 $d6, [%_arg0]; // Move argto an HSAIL register
07:workitemabsid_u32 $s2, 0;// Read the work-item global id (gid)
08:
09:cvt_u64_s32 $d2, $s2; // Convert gidto long
10:mul_u64 $d2, $d2, 8; // Stride for sizeof(int) elements
11:add_u64 $d2, $d2, 24; // Skip array object header (24 bytes)
12:add_u64 $d2, $d2, $d6; // $d2 now points to players[$d2]
13:ld_global_u64 $d6, [$d2]; // Load Player p from players[$d2]
14:ld_global_u64 $d0, [$d6 + 120]; // p.getTeam() inlined
15:ld_global_s32 $s3, [$d6 + 40]; // p.getScores() inlined
16:cvt_f32_s32 $s16, $s3; // cast to (float)
17:ld_global_s32 $s0, [$d0 + 24]; // Team getScores() inlined
18:cvt_f32_s32 $s17, $s0; // cast to (float)
19:div_f32 $s16, $s16, $s17; // p.getScores()/teamScoresinlined
20:st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores() inlined
21:ret;
22:}

27
A Java developer implementing Nbodywould probably…
A CASE STUDY CENTERED ON NBODY
// Assuming bodies[] is an initialized array of Body
// We can update and display each one in turn
for (Body b: bodies)
b.updateAndShow(screen, bodies);
// Create a class to represent each body
class Body{
float x,y,z,m,vx,vy,vz;
// Include method to update position and display
void updateAndShow(Screen screen, Body[] bodies){
// omitted varsfor accumulating forces
for (Body other:bodies){
// accumulate forces between other and this
}
// update vx,vy,vz,x,yand z from accumulated data
screen.paint(x,y,z);
}
}

28
Java does not guarantee contiguous allocation of objects in arrays
Only arrays of primitives (long, float etc) are allocated contiguously
Non HSA enabled Java GPU frameworks force developers to either
Abandon Object Oriented solutions and revert to parallel primitive arrays
Or…
Add scatter/gather (costly copies) behind the scenes
WITHOUT HSA WE CAN’T USE OBJECTS
// Create and populate parallel arrays of primitives
float x[], y[], z[], m[], vx,[], vy[], vz[];
// Treat x[n],y[n],z[n] etcas the state of Body[n]
Kernel k = new Kernel(){
void run(){
// omitted varsfor accumulating state not shown
for (intj=0; j<bodies j++){
// accumforces between (x,y,z)[j] and (x,y,z)[i]
}
// update vx[j],vy[j],vz[j],x[j],y[j] and z[j]
}
});
k.execute(bodies);

29
HSA version of Aparapi and Sumatra can deal with Java objects
Then loop over the array, updating and displaying the bodies.
HSA ENABLEMENT ALLOWS NATURAL
JAVA REPRESENTATIONS
class Body{
float x,y,z,m,vx,vy,vz;
void updateAndShow(Screen screen, Body[] bodies){
// hidden varsfor accumulating forces
for (Body other:bodies){
// accumulate forces between other and this
}
// update vx,vy,vz,x,yand z from accumulated data
screen.paint(x,y,z);
}
}
Arrays.stream(bodies).parallel().forEach(b -> {// Sumatra solution
b.updateAndShow(screen, bodies);
});
Device.hsa().forEach(bodies, b -> { //HSA enabled Aparapi solution

SUMATRA + HSA ENABLED APARAPI PERFORMANCE
Number of bodies (higher is better)
Intra-Body Interactions per microsecond
NBodyimplemented as an array of Objects.
On early access HSA enabled hardware and software.
12.3 x perf
(1.48 x power)
10.6 x perf
(1.44 x power)
7.9 x perf
(1.35 x power)

HSA ENABLEMENTOFJVM CAN ACCELERATEOTHER JVM BASED LANGUAGES
Java 9 –3Q2015
HSA enabled Java (Sumatra)
•Adds native GPU compute support to Java Virtual Machine (JVM)
•Developer uses JDK provided Lambda + Stream API
•JVM uses GRAAL compiler to generate HSAIL
•JVM decides at runtime to execute on either CPU or GPU depending on workload characteristics.
GPU ISA
JVM
Java Application
GPU
CPU
HSAIL
Java JDK Stream + Lambda API
CPU ISA
Java 9 + 2016?
HSA enablement of other JVM based languages/frameworks
•Developer uses their preferred Truffle based language (R, Javascript, Python, Runbyetc)
•JVM uses Truffle + GRAAL compiler to generate HSAIL
•HSA acceleration beyond Java
GPU ISA
JVM
R
APP
GPU
CPU
HSAIL
Truffle
JavaScript
APP
Ruby
APP
Python
APP
CPU ISA

TAKEAWAYS
HSA brings GPU computing to mainstream programming models
Open standard for emerging parallel compute platforms
Shared and coherent memory bridges “faraway accelerator” gap
HSAIL provides the common IL for high-level languages to benefit from parallel computing
HSAIL Key Points
Thin, robust, fast finalizer
Portable (multiple HW vendors and parallel architectures)
Supports shared virtual memory and platform atomics
Java Enablement
Can access Objects on Java’s heap thanks to ‘Shared Virtual Memory’
Leverages Java 8 Lambda and Stream APIs intended for multicore
Gateway to enabling other JVM based languages.

TOOLS ARE AVAILABLE NOW
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG)
http://hsafoundation.com/standards/
https://hsafoundation.box.com/s/m6mrsjv8b7r50kqeyyal
Tools now at GitHUB–HSA Foundation
libHSAAssembler and Disassembler
https://github.com/HSAFoundation/HSAIL-Tools
HSAIL Instruction Set Simulator
https://github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator
Soon: LLVM Compilation stack which outputs HSAIL and BRIG
Java enablement via HSAIL (preliminary)
http://openjdk.java.net/projects/sumatra/
http://openjdk.java.net/projects/graal/
http://aparapi.googlecode.com/

HSA From A Software Perspective

More Related Content

What's hot (20)

Similar to HSA From A Software Perspective (20)

More from HSA Foundation (12)

Recently uploaded (20)

HSA From A Software Perspective