SlideShare a Scribd company logo
HSA (HETEROGENEOUS SYSTEM ARCHITECTURE) FROM A SOFTWARE PERSPECTIVE 
OCT 2013 
GARY FROSTAMD SOFTWARE FELLOW
HSA FOUNDATION 
Founded in June 2012 
Developing a new platform for heterogeneous systems 
www.hsafoundation.com 
Specifications under development in working groups 
Our first specification, HSA Programmers Reference Manual is already published and available on our web site 
Additional specifications for System Architecture, Runtime Software and Tools are in process 
© Copyright 2013 HSA Foundation. All Rights Reserved. 2
HSA FOUNDATION MEMBERSHIP — AUGUST 2013 
© Copyright 2013 HSA Foundation. All Rights Reserved. 3 
Founders 
Promoters 
Supporters 
Contributors 
Academic 
Associates
SOCS HAVE PROLIFERATED — MAKE THEM BETTER 
SOCs have arrived and are a tremendous advance over previous platforms 
SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory 
How can we make them even better? 
Easier to program 
Easier to optimize 
Higher performance 
Lower power 
HSA unites accelerators architecturally 
Early focus on the GPU compute accelerator, but HSA goes well beyond the GPU 
© Copyright 2013 HSA Foundation. All Rights Reserved. 4
INFLECTIONS IN PROCESSOR DESIGN 
© Copyright 2013 HSA Foundation. All Rights Reserved. 5 
? 
Single-thread Performance 
Time 
we are 
here 
Enabled by: 
Moore’s Law 
Voltage Scaling 
Constrained by: 
Power 
ComplexitySingle-Core Era 
Modern Application 
Performance 
Time (Data-parallel exploitation) 
we are 
hereHeterogeneousSystems Era 
Enabled by: 
Abundant data parallelism 
Power efficient GPUs 
TemporarilyConstrained by: 
Programming models 
Comm.overhead 
Throughput Performance 
Time (# of processors) 
we are 
here 
Enabled by: 
Moore’s Law 
SMP architecture 
Constrained by: 
Power 
Parallel SW 
ScalabilityMulti-Core Era 
Assembly C/C++ Java … 
pthreadsOpenMP/ TBB … 
Shader CUDAOpenCL 
C++ and Java
HIGH LEVEL FEATURES OF HSA 
Features currently being defined in the HSA Working Groups** 
Unified addressing across all processors 
Operation into pageable system memory 
Full memory coherency 
User mode dispatch 
Architected queuing language 
High level language support for GPU compute processors 
Preemption and context switching 
© Copyright 2013 HSA Foundation. All Rights Reserved. 6 
** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups
HSA —AN OPEN PLATFORM 
Open Architecture, membership open to all 
HSA Programmers Reference Manual 
HSA System Architecture 
HSA Runtime 
Delivered via royalty free standards 
Royalty Free IP, Specifications and APIs 
ISA agnostic for both CPU and GPU 
Membership from all areas of computing 
Hardware companies 
Operating Systems 
Tools and Middleware 
© Copyright 2013 HSA Foundation. All Rights Reserved. 7
HSA MEMORY MODEL 
Defines visibility ordering between all threads in the HSA System 
Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models 
Relaxed consistency memory model for parallel compute performance 
Visibility controlled by: 
Load.Acquire 
Store.Release 
Barriers 
© Copyright 2013 HSA Foundation. All Rights Reserved. 8
HSA QUEUING MODEL 
User mode queuing for low latency dispatch 
Application dispatches directly 
No OS or driver in the dispatch path 
Architected Queuing Layer 
Single compute dispatch path for all hardware 
No driver translation, direct to hardware 
Allows for dispatch to queue from any agent 
CPU or GPU 
GPU self enqueue enables lots of solutions 
Recursion 
Tree traversal 
Wavefront reforming 
© Copyright 2013 HSA Foundation. All Rights Reserved. 9
HSA INTERMEDIATE LAYER —HSAIL 
HSAIL is a virtual ISA for parallel programs 
Finalized to ISA by a JIT compiler or “Finalizer” 
ISA independent by design for CPU & GPU 
Explicitly parallel 
Designed for data parallel programming 
Support for exceptions, virtual functions, and other high level language features 
Lower level than OpenCL SPIR 
Fits naturally in the OpenCL compilation stack 
Suitable to support additional high level languages and programming models: 
Java, C++, OpenMP, etc 
© Copyright 2013 HSA Foundation. All Rights Reserved. 10
HSAIL INSTRUCTION SET -OVERVIEW 
Similar to assembly language for a RISC CPU 
Load-store architecture 
ld_global_u64 $d0, [$d6 + 120]; $d0= load($d6+120) 
add_u64 $d1, $d2, 24 ; $d1= $d2+24 
136 opcodes(Java™ bytecodehas 200) 
Floating point (single, double, half (f16)) 
Integer (32-bit, 64-bit) 
Some packed operations 
Branches 
Function calls 
PlatformAtomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas 
Synchronize host CPU and HSA Component! 
Text and Binary formats (“BRIG”) 
© Copyright 2013 HSA Foundation. All Rights Reserved.
SEGMENTS AND MEMORY (1/2) 
7 segments of memory 
global, readonly, group, spill, private, arg, kernarg, 
Memory instructions can (optionally) specify a segment 
Global Segment 
Visible to all HSA agents (including host CPU) 
Group Segment 
Provides high-performance memory shared in the work-group. 
Group memory can be read and written by any work-item in the work-group 
HSAIL provides sync operations to control visibility of group memory 
Useful for expert programmers 
Spill, Private, ArgSegments 
Represent different regions of a per-work-item stack 
Typically generated by compiler, not specified by programmer 
Compiler can use these to convey intent –iespills 
© Copyright 2013 HSA Foundation. All Rights Reserved. 12 
ld_global_u64 $d0, [$d6] 
ld_group_u64 $d0,[$d6+24] 
st_spill_f32 $s1,[$d6+4]
SEGMENTS AND MEMORY (2/2) 
KernargSegment 
Programmer writes kernargsegment to pass arguments to a kernel 
Read-Only Segment 
Remains constant during execution of kernel 
Flat Addressing 
Each segment mapped into virtual address space 
Flat addresses can map to segments based on virtual address 
Instructions with no explicit segment use flat addressing 
Very useful for high-level language support (ieclasses, libraries) 
Aligns well with OpenCL2.0 “generic” addressing feature 
© Copyright 2013 HSA Foundation. All Rights Reserved. 13 
ld_kernarg_u64 $d6, [%_arg0] ld_u64 $d0,[$d6+24] ; flat
REGISTERS 
Four classes of registers 
C: 1-bit, Control Registers 
S: 32-bit, Single-precision FP or Int 
D: 64-bit, Double-precision FP or Long Int 
Q: 128-bit, Packed data. 
Fixed number of registers: 
8 C 
S, D, Q share a single pool of resources 
S + 2*D + 4*Q <= 128 
Up to 128 S or 64 D or 32 Q (or a blend) 
Register allocation done in high-level compiler 
Finalizer doesn’t have to perform expensive register allocation 
© Copyright 2013 HSA Foundation. All Rights Reserved. 14
SIMT EXECUTION MODEL 
HSAIL Presents a “SIMT” execution model to the programmer 
“Single Instruction, Multiple Thread” 
Programmer writes program for a single thread of execution 
Each work-item appears to have its own program counter 
Branch instructions look natural 
Hardware Implementation 
Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency 
Actually one program counter for the entire SIMD instruction 
Branches implemented with predication 
SIMT Advantages 
Easier to program (branch code in particular) 
Natural path for mainstream programming models 
Scales across a wide variety of hardware (programmer doesn’t see vector width) 
Cross-lane operations available for those who want peak performance 
© Copyright 2013 HSA Foundation. All Rights Reserved. 15
WAVEFRONTS 
Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, or 64 “lanes” 
Lanes in wavefrontcan be “active” or “inactive” 
Inactive lanes consume hardware resources but don’t do useful work 
Tradeoffs 
“Wavefront-aware” programming can be useful for peak performance 
But results in less portable code (since wavefront width is encoded in algorithm) 
© Copyright 2012 HSA Foundation. All Rights Reserved. 16 
if (cond) { 
operationA; // cond=True lanes active here 
} else { 
operationB; // cond=False lanes active here 
}
HSA ENABLEMENTOF LANGUAGES, FRAMEWORKS, LIBRARIESANDRUNTIMES
HSA AND OPENCL™ 
HSA is an optimized platform architecture for OpenCL™ 
Not an alternative to OpenCL™ 
OpenCL™ on HSA will benefit from 
Avoidance of wasteful copies 
Low latency dispatch 
Improved memory model 
Pointers shared between CPU and GPU 
OpenCL™ 2.0 shows considerable alignment with HSA 
Many HSA member companies are also active with Khronos in the OpenCL™ working group 
© Copyright 2013 HSA Foundation. All Rights Reserved. 18
HSA AND OPENMP® 
OpenMP® 
Established 
Portable 
Scalable (desktop to supercomputer) 
Simple 
Flexible 
HSA enablement brings :- 
GPU performance 
Energy efficiency 
…to established developer community 
© Copyright 2013 HSA Foundation. All Rights Reserved. 19
BOLT : A C++ PARALLEL PRIMITIVES LIBRARY FOR HSA 
Allow C++ developers to leverage the power efficiency of GPU computing 
Common routines such as scan, sort, reduce, transform 
More advanced routines like heterogeneous pipelines 
Bolt library works with OpenCL and C++ AMP 
Enjoy the unique advantages of the HSA platform 
Move the computation not the data 
Asingle source code base for the CPU and GPU! 
Developers can focus on core algorithmshttps://github.com/HSA-Libraries/Bolt 
© Copyright 2013 HSA Foundation. All Rights Reserved. 20
21 
Why Java™? 
9 Million Developers 
1Billion Java downloads per year 
97% Enterprise desktops run Java 
100% of blue ray players ship with Java 
http://oracle.com.edgesuite.net/timeline/java/ 
Java™ 8 language/libraries include concurrency features 
primitives (threads, locks, monitors, atomic ops) 
libraries (fork/join, thread pools, executors, futures) 
support for ‘lambda’ based Stream API’s 
JIT (Just In Time) architecture ideal for generating and executing HSAIL. 
Project ‘Sumatra’ targets GPU JIT generation/execution in the 2015 Java™ 9 timeframe. 
HSA ENABLEMENT OF JAVA™ 
© Copyright 2013 HSA Foundation. All Rights Reserved.
22 
Aparapi API for expressing data parallel workloads 
Developer uses common Java™ patterns and idioms 
Java source compiled to (bytecode) using standard compiler (javac) 
Aparapi runtime capable of converting bytecode to OpenCL™ 
Execution on OpenCL™ 1.1+ capable devices (GPUs/APUs) 
OR 
Execute via a Java thread pool if OpenCL™ is not available 
Open Source project 
~20 contributors 
>7000 downloads 
~150 visits per day 
APARAPI: INITIAL JAVA ENABLEMENT (2011) 
CPU ISA 
GPU ISA 
JVM 
Java Application 
GPU 
CPU 
OpenCL Source 
OpenCLRuntime 
APARAPI 
API 
© Copyright 2013 HSA Foundation. All Rights Reserved.
23 
AMD/Oracle sponsored Open Source (OpenJDK) project 
Targeted at Java 9 (2015 release) 
Allow developers to efficiently represent data parallel algorithms in Java using Stream API + Lambda expressions 
Sumatra is notpushing new ‘programming model’ 
Instead we ‘repurpose’ Java 8’s new Stream API/Lambda to enable both CPU or GPU computing 
A Sumatra enabled Java Virtual Machine will dispatch ‘selected’ constructs to HSA enabled devices at runtime. 
Developers already refactoring JDK to use stream+lambda 
–So anyone using existing JDK should see GPU acceleration without anycode changes. 
http://openjdk.java.net/projects/sumatra/ 
https://wikis.oracle.com/display/HotSpotInternals/Sumatrahttp://mail.openjdk.java.net/pipermail/sumatra-dev/ 
SUMATRA PROJECT (JAVA 9 2015) 
GPU ISA 
JVM 
Java Application 
GPU 
CPU 
HSAIL 
HSA Finalizer& Runtime 
Java JDK Stream + Lambda API 
Java GRAAL JIT backend 
CPU ISA 
© Copyright 2013 HSA Foundation. All Rights Reserved.
HSA ENABLEMENT OF JAVA 
CPU ISA GPU ISA 
JVM 
Java Application 
CPU GPU 
OpenCL Source 
OpenCL Runtime 
APARAPI 
API 
Java 7 – OpenCL enabled Aparapi 
• AMD initiated Open Source project 
• APIs for data parallel algorithms 
GPU accelerate Java applications 
No need to learn OpenCL 
• Active community captured mindshare 
~20 contributors 
>7000 downloads 
~150 visits per day 
CPU ISA GPU ISA 
JVM 
Java Application 
CPU GPU 
HSAIL 
HSA Finalizer & 
Runtime 
APARAPI + 
Lambda API 
Java 8 – HSA enabled Aparapi 
• Java 8 brings Stream + Lambda API. 
More natural way of expressing 
data parallel algorithms 
Initially targeted at multi-core. 
• APARAPI will :- 
Support Java 8 Lambdas 
Dispatch code to HSA enabled 
devices at runtime via HSAIL 
Java 9 – HSA enabled Java (Sumatra) 
• Adds native GPU compute support to 
Java Virtual Machine (JVM) 
• Developer uses JDK provided Lambda 
+ Stream API 
• JVM uses GRAAL compiler to generate 
HSAIL 
• JVM decides at runtime to execute on 
either CPU or GPU depending on 
workload characteristics. 
GPU ISA 
JVM 
Java Application 
CPU GPU 
HSAIL 
HSA Finalizer & 
Runtime 
Java JDK Stream + 
Lambda API 
Java GRAAL JIT 
backend 
CPU ISA 
We plan to provide 
HSA Enabled Aparapi (Java 8) 
as a bridge technology between 
OpenCL based Aparapi (Java 7) 
and 
HSA Enabled Sumatra (Java 9) 
© Copyright 2013 HSA Foundation. All Rights Reserved.
A JAVA EXAMPLE 
© Copyright 2013 HSA Foundation. All Rights Reserved. 25 
Player[] allPlayers= …// Code to initialize array of Players omitted 
intteamScores= p.getTeam().getScores(); 
float pctOfTeamScores= (float)p.getScores()/(float) teamScores; 
p.setPctOfTeamScores(pctOfTeamScores); 
}); 
class Team { 
private intscores; 
public intgetScores() { 
return scores; 
} 
}// Setters omitted for brevity 
class Player { 
private Team team; 
private intscores; 
private float pctOfTeamScores; 
public Team getTeam() { 
return team; 
} 
public intgetScores() { 
return scores; 
} 
public void setPctOfTeamScores(intpct){ 
pctOfTeamScores= pct; 
} 
} // Setters omitted for brevity 
Arrays.stream(allPlayers).parallel().forEach(p -> {// HSA enabled Sumatra 
Device.hsa().forEach(allPlayers, p -> { // HSA enabled Aparapi
HSAIL CODE EXAMPLE (SUMATRA) 
© Copyright 2013 HSA Foundation. All Rights Reserved. 26 
01: version 0:95: $full : $large; 
02: // We pass underlying array of Players to the kernel 
03: kernel &run ( 
04:kernarg_u64 %_arg0// Array of players passed as arg 
05:){ 
06:ld_kernarg_u64 $d6, [%_arg0]; // Move argto an HSAIL register 
07:workitemabsid_u32 $s2, 0;// Read the work-item global id (gid) 
08: 
09:cvt_u64_s32 $d2, $s2; // Convert gidto long 
10:mul_u64 $d2, $d2, 8; // Stride for sizeof(int) elements 
11:add_u64 $d2, $d2, 24; // Skip array object header (24 bytes) 
12:add_u64 $d2, $d2, $d6; // $d2 now points to players[$d2] 
13:ld_global_u64 $d6, [$d2]; // Load Player p from players[$d2] 
14:ld_global_u64 $d0, [$d6 + 120]; // p.getTeam() inlined 
15:ld_global_s32 $s3, [$d6 + 40]; // p.getScores() inlined 
16:cvt_f32_s32 $s16, $s3; // cast to (float) 
17:ld_global_s32 $s0, [$d0 + 24]; // Team getScores() inlined 
18:cvt_f32_s32 $s17, $s0; // cast to (float) 
19:div_f32 $s16, $s16, $s17; // p.getScores()/teamScoresinlined 
20:st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores() inlined 
21:ret; 
22:}
27 
A Java developer implementing Nbodywould probably… 
A CASE STUDY CENTERED ON NBODY 
// Assuming bodies[] is an initialized array of Body 
// We can update and display each one in turn 
for (Body b: bodies) 
b.updateAndShow(screen, bodies); 
// Create a class to represent each body 
class Body{ 
float x,y,z,m,vx,vy,vz; 
// Include method to update position and display 
void updateAndShow(Screen screen, Body[] bodies){ 
// omitted varsfor accumulating forces 
for (Body other:bodies){ 
// accumulate forces between other and this 
} 
// update vx,vy,vz,x,yand z from accumulated data 
screen.paint(x,y,z); 
} 
} 
© Copyright 2013 HSA Foundation. All Rights Reserved.
28 
Java does not guarantee contiguous allocation of objects in arrays 
Only arrays of primitives (long, float etc) are allocated contiguously 
Non HSA enabled Java GPU frameworks force developers to either 
Abandon Object Oriented solutions and revert to parallel primitive arrays 
Or… 
Add scatter/gather (costly copies) behind the scenes 
WITHOUT HSA WE CAN’T USE OBJECTS 
// Create and populate parallel arrays of primitives 
float x[], y[], z[], m[], vx,[], vy[], vz[]; 
// Treat x[n],y[n],z[n] etcas the state of Body[n] 
Kernel k = new Kernel(){ 
void run(){ 
// omitted varsfor accumulating state not shown 
for (intj=0; j<bodies j++){ 
// accumforces between (x,y,z)[j] and (x,y,z)[i] 
} 
// update vx[j],vy[j],vz[j],x[j],y[j] and z[j] 
} 
}); 
k.execute(bodies); 
© Copyright 2013 HSA Foundation. All Rights Reserved.
29 
HSA version of Aparapi and Sumatra can deal with Java objects 
Then loop over the array, updating and displaying the bodies. 
HSA ENABLEMENT ALLOWS NATURAL 
JAVA REPRESENTATIONS 
class Body{ 
float x,y,z,m,vx,vy,vz; 
void updateAndShow(Screen screen, Body[] bodies){ 
// hidden varsfor accumulating forces 
for (Body other:bodies){ 
// accumulate forces between other and this 
} 
// update vx,vy,vz,x,yand z from accumulated data 
screen.paint(x,y,z); 
} 
} 
Arrays.stream(bodies).parallel().forEach(b -> {// Sumatra solution 
b.updateAndShow(screen, bodies); 
}); 
Device.hsa().forEach(bodies, b -> { //HSA enabled Aparapi solution 
© Copyright 2013 HSA Foundation. All Rights Reserved.
SUMATRA + HSA ENABLED APARAPI PERFORMANCE 
Number of bodies (higher is better) 
Intra-Body Interactions per microsecond 
NBodyimplemented as an array of Objects. 
On early access HSA enabled hardware and software. 
12.3 x perf 
(1.48 x power) 
10.6 x perf 
(1.44 x power) 
7.9 x perf 
(1.35 x power) 
© Copyright 2013 HSA Foundation. All Rights Reserved.
HSA ENABLEMENTOFJVM CAN ACCELERATEOTHER JVM BASED LANGUAGES 
Java 9 –3Q2015 
HSA enabled Java (Sumatra) 
•Adds native GPU compute support to Java Virtual Machine (JVM) 
•Developer uses JDK provided Lambda + Stream API 
•JVM uses GRAAL compiler to generate HSAIL 
•JVM decides at runtime to execute on either CPU or GPU depending on workload characteristics. 
GPU ISA 
JVM 
Java Application 
GPU 
CPU 
HSAIL 
HSA Finalizer& Runtime 
Java JDK Stream + Lambda API 
Java GRAAL JIT backend 
CPU ISA 
Java 9 + 2016? 
HSA enablement of other JVM based languages/frameworks 
•Developer uses their preferred Truffle based language (R, Javascript, Python, Runbyetc) 
•JVM uses Truffle + GRAAL compiler to generate HSAIL 
•HSA acceleration beyond Java 
GPU ISA 
JVM 
R 
APP 
GPU 
CPU 
HSAIL 
HSA Finalizer& Runtime 
Java GRAAL JIT backend 
Truffle 
JavaScript 
APP 
Ruby 
APP 
Python 
APP 
CPU ISA 
© Copyright 2013 HSA Foundation. All Rights Reserved.
TAKEAWAYS 
HSA brings GPU computing to mainstream programming models 
Open standard for emerging parallel compute platforms 
Shared and coherent memory bridges “faraway accelerator” gap 
HSAIL provides the common IL for high-level languages to benefit from parallel computing 
HSAIL Key Points 
Thin, robust, fast finalizer 
Portable (multiple HW vendors and parallel architectures) 
Supports shared virtual memory and platform atomics 
Java Enablement 
Can access Objects on Java’s heap thanks to ‘Shared Virtual Memory’ 
Leverages Java 8 Lambda and Stream APIs intended for multicore 
Gateway to enabling other JVM based languages. 
© Copyright 2013 HSA Foundation. All Rights Reserved. 32
TOOLS ARE AVAILABLE NOW 
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG) 
http://hsafoundation.com/standards/ 
https://hsafoundation.box.com/s/m6mrsjv8b7r50kqeyyal 
Tools now at GitHUB–HSA Foundation 
libHSAAssembler and Disassembler 
https://github.com/HSAFoundation/HSAIL-Tools 
HSAIL Instruction Set Simulator 
https://github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator 
Soon: LLVM Compilation stack which outputs HSAIL and BRIG 
Java enablement via HSAIL (preliminary) 
http://openjdk.java.net/projects/sumatra/ 
http://openjdk.java.net/projects/graal/ 
http://aparapi.googlecode.com/ 
© Copyright 2013 HSA Foundation. All Rights Reserved. 33

More Related Content

PPTX
HSA Queuing Hot Chips 2013
PPTX
ISCA final presentation - Runtime
PPTX
ISCA Final Presentation - HSAIL
PPTX
ISCA final presentation - Queuing Model
PPTX
ISCA final presentation - Memory Model
PPTX
ISCA Final Presentation - Applications
PDF
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
PPTX
ISCA Final Presentation - Intro
HSA Queuing Hot Chips 2013
ISCA final presentation - Runtime
ISCA Final Presentation - HSAIL
ISCA final presentation - Queuing Model
ISCA final presentation - Memory Model
ISCA Final Presentation - Applications
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
ISCA Final Presentation - Intro

What's hot (20)

PDF
HSAemu a Full System Emulator for HSA
PPTX
HSA HSAIL Introduction Hot Chips 2013
PPTX
HSA Introduction
PPTX
HSA Introduction Hot Chips 2013
PDF
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
PDF
Hsa10 whitepaper
PDF
HSA Overview
PDF
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
PPTX
HSA Memory Model Hot Chips 2013
PDF
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
PDF
Deeper Look Into HSAIL And It's Runtime
PDF
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
PDF
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
 
PDF
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
PDF
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
PDF
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
PPTX
Heterogeneous computing
PPT
Guide to heterogeneous system architecture (hsa)
PPTX
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
PDF
Gpu Compute
HSAemu a Full System Emulator for HSA
HSA HSAIL Introduction Hot Chips 2013
HSA Introduction
HSA Introduction Hot Chips 2013
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
Hsa10 whitepaper
HSA Overview
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
HSA Memory Model Hot Chips 2013
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
Deeper Look Into HSAIL And It's Runtime
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Heterogeneous computing
Guide to heterogeneous system architecture (hsa)
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Gpu Compute
Ad

Similar to HSA From A Software Perspective (20)

PDF
"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati...
PDF
Heterogeneous System Architecture Overview
PDF
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
PDF
HC-4017, HSA Compilers Technology, by Debyendu Das
PDF
LCU13: HSA Architecture Presentation
PDF
HSA-4122, "HSA Queuing Mode," by Ian Bratt
PPT
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
PDF
Implement Runtime Environments for HSA using LLVM
PPTX
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
PDF
Solution Use Case Demo: The Power of Relationships in Your Big Data
PDF
Spark forplainoldjavageeks svforum_20140724
PDF
Oracle NoSQL Database release 3.0 overview
PDF
HSA-4024, OpenJDK Sumatra Project: Bringing the GPU to Java, by Eric Caspole
PDF
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
DOC
Hadoop cluster configuration
PDF
Ceph Performance on OpenStack - Barcelona Summit
PPTX
Apache Spark Introduction @ University College London
PPTX
Cloudera - Amr Awadallah - Hadoop World 2010
PDF
Making Hardware Accelerator Easier to Use
PPTX
Apache spark architecture (Big Data and Analytics)
"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati...
Heterogeneous System Architecture Overview
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HC-4017, HSA Compilers Technology, by Debyendu Das
LCU13: HSA Architecture Presentation
HSA-4122, "HSA Queuing Mode," by Ian Bratt
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Implement Runtime Environments for HSA using LLVM
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
Solution Use Case Demo: The Power of Relationships in Your Big Data
Spark forplainoldjavageeks svforum_20140724
Oracle NoSQL Database release 3.0 overview
HSA-4024, OpenJDK Sumatra Project: Bringing the GPU to Java, by Eric Caspole
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
Hadoop cluster configuration
Ceph Performance on OpenStack - Barcelona Summit
Apache Spark Introduction @ University College London
Cloudera - Amr Awadallah - Hadoop World 2010
Making Hardware Accelerator Easier to Use
Apache spark architecture (Big Data and Analytics)
Ad

More from HSA Foundation (12)

PDF
Hsa Runtime version 1.00 Provisional
PDF
Hsa programmers reference manual (version 1.0 provisional)
PPTX
ISCA Final Presentaiton - Compilations
PDF
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
PPT
Apu13 cp lu-keynote-final-slideshare
PDF
HSA Foundation BoF -Siggraph 2013 Flyer
PDF
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
PDF
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
PDF
Phil Rogers IFA Keynote 2012
PDF
Hsa2012 logo guidelines.
PDF
What Fabric Engine Can Do With HSA
PDF
Fabric Engine: Why HSA is Invaluable
Hsa Runtime version 1.00 Provisional
Hsa programmers reference manual (version 1.0 provisional)
ISCA Final Presentaiton - Compilations
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
Apu13 cp lu-keynote-final-slideshare
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
Phil Rogers IFA Keynote 2012
Hsa2012 logo guidelines.
What Fabric Engine Can Do With HSA
Fabric Engine: Why HSA is Invaluable

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Digital-Transformation-Roadmap-for-Companies.pptx
Spectroscopy.pptx food analysis technology
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release
Unlocking AI with Model Context Protocol (MCP)

HSA From A Software Perspective

  • 1. HSA (HETEROGENEOUS SYSTEM ARCHITECTURE) FROM A SOFTWARE PERSPECTIVE OCT 2013 GARY FROSTAMD SOFTWARE FELLOW
  • 2. HSA FOUNDATION Founded in June 2012 Developing a new platform for heterogeneous systems www.hsafoundation.com Specifications under development in working groups Our first specification, HSA Programmers Reference Manual is already published and available on our web site Additional specifications for System Architecture, Runtime Software and Tools are in process © Copyright 2013 HSA Foundation. All Rights Reserved. 2
  • 3. HSA FOUNDATION MEMBERSHIP — AUGUST 2013 © Copyright 2013 HSA Foundation. All Rights Reserved. 3 Founders Promoters Supporters Contributors Academic Associates
  • 4. SOCS HAVE PROLIFERATED — MAKE THEM BETTER SOCs have arrived and are a tremendous advance over previous platforms SOCs combine CPU cores, GPU cores and other accelerators, with high bandwidth access to memory How can we make them even better? Easier to program Easier to optimize Higher performance Lower power HSA unites accelerators architecturally Early focus on the GPU compute accelerator, but HSA goes well beyond the GPU © Copyright 2013 HSA Foundation. All Rights Reserved. 4
  • 5. INFLECTIONS IN PROCESSOR DESIGN © Copyright 2013 HSA Foundation. All Rights Reserved. 5 ? Single-thread Performance Time we are here Enabled by: Moore’s Law Voltage Scaling Constrained by: Power ComplexitySingle-Core Era Modern Application Performance Time (Data-parallel exploitation) we are hereHeterogeneousSystems Era Enabled by: Abundant data parallelism Power efficient GPUs TemporarilyConstrained by: Programming models Comm.overhead Throughput Performance Time (# of processors) we are here Enabled by: Moore’s Law SMP architecture Constrained by: Power Parallel SW ScalabilityMulti-Core Era Assembly C/C++ Java … pthreadsOpenMP/ TBB … Shader CUDAOpenCL C++ and Java
  • 6. HIGH LEVEL FEATURES OF HSA Features currently being defined in the HSA Working Groups** Unified addressing across all processors Operation into pageable system memory Full memory coherency User mode dispatch Architected queuing language High level language support for GPU compute processors Preemption and context switching © Copyright 2013 HSA Foundation. All Rights Reserved. 6 ** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups
  • 7. HSA —AN OPEN PLATFORM Open Architecture, membership open to all HSA Programmers Reference Manual HSA System Architecture HSA Runtime Delivered via royalty free standards Royalty Free IP, Specifications and APIs ISA agnostic for both CPU and GPU Membership from all areas of computing Hardware companies Operating Systems Tools and Middleware © Copyright 2013 HSA Foundation. All Rights Reserved. 7
  • 8. HSA MEMORY MODEL Defines visibility ordering between all threads in the HSA System Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models Relaxed consistency memory model for parallel compute performance Visibility controlled by: Load.Acquire Store.Release Barriers © Copyright 2013 HSA Foundation. All Rights Reserved. 8
  • 9. HSA QUEUING MODEL User mode queuing for low latency dispatch Application dispatches directly No OS or driver in the dispatch path Architected Queuing Layer Single compute dispatch path for all hardware No driver translation, direct to hardware Allows for dispatch to queue from any agent CPU or GPU GPU self enqueue enables lots of solutions Recursion Tree traversal Wavefront reforming © Copyright 2013 HSA Foundation. All Rights Reserved. 9
  • 10. HSA INTERMEDIATE LAYER —HSAIL HSAIL is a virtual ISA for parallel programs Finalized to ISA by a JIT compiler or “Finalizer” ISA independent by design for CPU & GPU Explicitly parallel Designed for data parallel programming Support for exceptions, virtual functions, and other high level language features Lower level than OpenCL SPIR Fits naturally in the OpenCL compilation stack Suitable to support additional high level languages and programming models: Java, C++, OpenMP, etc © Copyright 2013 HSA Foundation. All Rights Reserved. 10
  • 11. HSAIL INSTRUCTION SET -OVERVIEW Similar to assembly language for a RISC CPU Load-store architecture ld_global_u64 $d0, [$d6 + 120]; $d0= load($d6+120) add_u64 $d1, $d2, 24 ; $d1= $d2+24 136 opcodes(Java™ bytecodehas 200) Floating point (single, double, half (f16)) Integer (32-bit, 64-bit) Some packed operations Branches Function calls PlatformAtomic Operations: and, or, xor, exch, add, sub, inc, dec, max, min, cas Synchronize host CPU and HSA Component! Text and Binary formats (“BRIG”) © Copyright 2013 HSA Foundation. All Rights Reserved.
  • 12. SEGMENTS AND MEMORY (1/2) 7 segments of memory global, readonly, group, spill, private, arg, kernarg, Memory instructions can (optionally) specify a segment Global Segment Visible to all HSA agents (including host CPU) Group Segment Provides high-performance memory shared in the work-group. Group memory can be read and written by any work-item in the work-group HSAIL provides sync operations to control visibility of group memory Useful for expert programmers Spill, Private, ArgSegments Represent different regions of a per-work-item stack Typically generated by compiler, not specified by programmer Compiler can use these to convey intent –iespills © Copyright 2013 HSA Foundation. All Rights Reserved. 12 ld_global_u64 $d0, [$d6] ld_group_u64 $d0,[$d6+24] st_spill_f32 $s1,[$d6+4]
  • 13. SEGMENTS AND MEMORY (2/2) KernargSegment Programmer writes kernargsegment to pass arguments to a kernel Read-Only Segment Remains constant during execution of kernel Flat Addressing Each segment mapped into virtual address space Flat addresses can map to segments based on virtual address Instructions with no explicit segment use flat addressing Very useful for high-level language support (ieclasses, libraries) Aligns well with OpenCL2.0 “generic” addressing feature © Copyright 2013 HSA Foundation. All Rights Reserved. 13 ld_kernarg_u64 $d6, [%_arg0] ld_u64 $d0,[$d6+24] ; flat
  • 14. REGISTERS Four classes of registers C: 1-bit, Control Registers S: 32-bit, Single-precision FP or Int D: 64-bit, Double-precision FP or Long Int Q: 128-bit, Packed data. Fixed number of registers: 8 C S, D, Q share a single pool of resources S + 2*D + 4*Q <= 128 Up to 128 S or 64 D or 32 Q (or a blend) Register allocation done in high-level compiler Finalizer doesn’t have to perform expensive register allocation © Copyright 2013 HSA Foundation. All Rights Reserved. 14
  • 15. SIMT EXECUTION MODEL HSAIL Presents a “SIMT” execution model to the programmer “Single Instruction, Multiple Thread” Programmer writes program for a single thread of execution Each work-item appears to have its own program counter Branch instructions look natural Hardware Implementation Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency Actually one program counter for the entire SIMD instruction Branches implemented with predication SIMT Advantages Easier to program (branch code in particular) Natural path for mainstream programming models Scales across a wide variety of hardware (programmer doesn’t see vector width) Cross-lane operations available for those who want peak performance © Copyright 2013 HSA Foundation. All Rights Reserved. 15
  • 16. WAVEFRONTS Hardware SIMD vector, composed of 1, 2, 4, 8, 16, 32, or 64 “lanes” Lanes in wavefrontcan be “active” or “inactive” Inactive lanes consume hardware resources but don’t do useful work Tradeoffs “Wavefront-aware” programming can be useful for peak performance But results in less portable code (since wavefront width is encoded in algorithm) © Copyright 2012 HSA Foundation. All Rights Reserved. 16 if (cond) { operationA; // cond=True lanes active here } else { operationB; // cond=False lanes active here }
  • 17. HSA ENABLEMENTOF LANGUAGES, FRAMEWORKS, LIBRARIESANDRUNTIMES
  • 18. HSA AND OPENCL™ HSA is an optimized platform architecture for OpenCL™ Not an alternative to OpenCL™ OpenCL™ on HSA will benefit from Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU OpenCL™ 2.0 shows considerable alignment with HSA Many HSA member companies are also active with Khronos in the OpenCL™ working group © Copyright 2013 HSA Foundation. All Rights Reserved. 18
  • 19. HSA AND OPENMP® OpenMP® Established Portable Scalable (desktop to supercomputer) Simple Flexible HSA enablement brings :- GPU performance Energy efficiency …to established developer community © Copyright 2013 HSA Foundation. All Rights Reserved. 19
  • 20. BOLT : A C++ PARALLEL PRIMITIVES LIBRARY FOR HSA Allow C++ developers to leverage the power efficiency of GPU computing Common routines such as scan, sort, reduce, transform More advanced routines like heterogeneous pipelines Bolt library works with OpenCL and C++ AMP Enjoy the unique advantages of the HSA platform Move the computation not the data Asingle source code base for the CPU and GPU! Developers can focus on core algorithmshttps://github.com/HSA-Libraries/Bolt © Copyright 2013 HSA Foundation. All Rights Reserved. 20
  • 21. 21 Why Java™? 9 Million Developers 1Billion Java downloads per year 97% Enterprise desktops run Java 100% of blue ray players ship with Java http://oracle.com.edgesuite.net/timeline/java/ Java™ 8 language/libraries include concurrency features primitives (threads, locks, monitors, atomic ops) libraries (fork/join, thread pools, executors, futures) support for ‘lambda’ based Stream API’s JIT (Just In Time) architecture ideal for generating and executing HSAIL. Project ‘Sumatra’ targets GPU JIT generation/execution in the 2015 Java™ 9 timeframe. HSA ENABLEMENT OF JAVA™ © Copyright 2013 HSA Foundation. All Rights Reserved.
  • 22. 22 Aparapi API for expressing data parallel workloads Developer uses common Java™ patterns and idioms Java source compiled to (bytecode) using standard compiler (javac) Aparapi runtime capable of converting bytecode to OpenCL™ Execution on OpenCL™ 1.1+ capable devices (GPUs/APUs) OR Execute via a Java thread pool if OpenCL™ is not available Open Source project ~20 contributors >7000 downloads ~150 visits per day APARAPI: INITIAL JAVA ENABLEMENT (2011) CPU ISA GPU ISA JVM Java Application GPU CPU OpenCL Source OpenCLRuntime APARAPI API © Copyright 2013 HSA Foundation. All Rights Reserved.
  • 23. 23 AMD/Oracle sponsored Open Source (OpenJDK) project Targeted at Java 9 (2015 release) Allow developers to efficiently represent data parallel algorithms in Java using Stream API + Lambda expressions Sumatra is notpushing new ‘programming model’ Instead we ‘repurpose’ Java 8’s new Stream API/Lambda to enable both CPU or GPU computing A Sumatra enabled Java Virtual Machine will dispatch ‘selected’ constructs to HSA enabled devices at runtime. Developers already refactoring JDK to use stream+lambda –So anyone using existing JDK should see GPU acceleration without anycode changes. http://openjdk.java.net/projects/sumatra/ https://wikis.oracle.com/display/HotSpotInternals/Sumatrahttp://mail.openjdk.java.net/pipermail/sumatra-dev/ SUMATRA PROJECT (JAVA 9 2015) GPU ISA JVM Java Application GPU CPU HSAIL HSA Finalizer& Runtime Java JDK Stream + Lambda API Java GRAAL JIT backend CPU ISA © Copyright 2013 HSA Foundation. All Rights Reserved.
  • 24. HSA ENABLEMENT OF JAVA CPU ISA GPU ISA JVM Java Application CPU GPU OpenCL Source OpenCL Runtime APARAPI API Java 7 – OpenCL enabled Aparapi • AMD initiated Open Source project • APIs for data parallel algorithms GPU accelerate Java applications No need to learn OpenCL • Active community captured mindshare ~20 contributors >7000 downloads ~150 visits per day CPU ISA GPU ISA JVM Java Application CPU GPU HSAIL HSA Finalizer & Runtime APARAPI + Lambda API Java 8 – HSA enabled Aparapi • Java 8 brings Stream + Lambda API. More natural way of expressing data parallel algorithms Initially targeted at multi-core. • APARAPI will :- Support Java 8 Lambdas Dispatch code to HSA enabled devices at runtime via HSAIL Java 9 – HSA enabled Java (Sumatra) • Adds native GPU compute support to Java Virtual Machine (JVM) • Developer uses JDK provided Lambda + Stream API • JVM uses GRAAL compiler to generate HSAIL • JVM decides at runtime to execute on either CPU or GPU depending on workload characteristics. GPU ISA JVM Java Application CPU GPU HSAIL HSA Finalizer & Runtime Java JDK Stream + Lambda API Java GRAAL JIT backend CPU ISA We plan to provide HSA Enabled Aparapi (Java 8) as a bridge technology between OpenCL based Aparapi (Java 7) and HSA Enabled Sumatra (Java 9) © Copyright 2013 HSA Foundation. All Rights Reserved.
  • 25. A JAVA EXAMPLE © Copyright 2013 HSA Foundation. All Rights Reserved. 25 Player[] allPlayers= …// Code to initialize array of Players omitted intteamScores= p.getTeam().getScores(); float pctOfTeamScores= (float)p.getScores()/(float) teamScores; p.setPctOfTeamScores(pctOfTeamScores); }); class Team { private intscores; public intgetScores() { return scores; } }// Setters omitted for brevity class Player { private Team team; private intscores; private float pctOfTeamScores; public Team getTeam() { return team; } public intgetScores() { return scores; } public void setPctOfTeamScores(intpct){ pctOfTeamScores= pct; } } // Setters omitted for brevity Arrays.stream(allPlayers).parallel().forEach(p -> {// HSA enabled Sumatra Device.hsa().forEach(allPlayers, p -> { // HSA enabled Aparapi
  • 26. HSAIL CODE EXAMPLE (SUMATRA) © Copyright 2013 HSA Foundation. All Rights Reserved. 26 01: version 0:95: $full : $large; 02: // We pass underlying array of Players to the kernel 03: kernel &run ( 04:kernarg_u64 %_arg0// Array of players passed as arg 05:){ 06:ld_kernarg_u64 $d6, [%_arg0]; // Move argto an HSAIL register 07:workitemabsid_u32 $s2, 0;// Read the work-item global id (gid) 08: 09:cvt_u64_s32 $d2, $s2; // Convert gidto long 10:mul_u64 $d2, $d2, 8; // Stride for sizeof(int) elements 11:add_u64 $d2, $d2, 24; // Skip array object header (24 bytes) 12:add_u64 $d2, $d2, $d6; // $d2 now points to players[$d2] 13:ld_global_u64 $d6, [$d2]; // Load Player p from players[$d2] 14:ld_global_u64 $d0, [$d6 + 120]; // p.getTeam() inlined 15:ld_global_s32 $s3, [$d6 + 40]; // p.getScores() inlined 16:cvt_f32_s32 $s16, $s3; // cast to (float) 17:ld_global_s32 $s0, [$d0 + 24]; // Team getScores() inlined 18:cvt_f32_s32 $s17, $s0; // cast to (float) 19:div_f32 $s16, $s16, $s17; // p.getScores()/teamScoresinlined 20:st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores() inlined 21:ret; 22:}
  • 27. 27 A Java developer implementing Nbodywould probably… A CASE STUDY CENTERED ON NBODY // Assuming bodies[] is an initialized array of Body // We can update and display each one in turn for (Body b: bodies) b.updateAndShow(screen, bodies); // Create a class to represent each body class Body{ float x,y,z,m,vx,vy,vz; // Include method to update position and display void updateAndShow(Screen screen, Body[] bodies){ // omitted varsfor accumulating forces for (Body other:bodies){ // accumulate forces between other and this } // update vx,vy,vz,x,yand z from accumulated data screen.paint(x,y,z); } } © Copyright 2013 HSA Foundation. All Rights Reserved.
  • 28. 28 Java does not guarantee contiguous allocation of objects in arrays Only arrays of primitives (long, float etc) are allocated contiguously Non HSA enabled Java GPU frameworks force developers to either Abandon Object Oriented solutions and revert to parallel primitive arrays Or… Add scatter/gather (costly copies) behind the scenes WITHOUT HSA WE CAN’T USE OBJECTS // Create and populate parallel arrays of primitives float x[], y[], z[], m[], vx,[], vy[], vz[]; // Treat x[n],y[n],z[n] etcas the state of Body[n] Kernel k = new Kernel(){ void run(){ // omitted varsfor accumulating state not shown for (intj=0; j<bodies j++){ // accumforces between (x,y,z)[j] and (x,y,z)[i] } // update vx[j],vy[j],vz[j],x[j],y[j] and z[j] } }); k.execute(bodies); © Copyright 2013 HSA Foundation. All Rights Reserved.
  • 29. 29 HSA version of Aparapi and Sumatra can deal with Java objects Then loop over the array, updating and displaying the bodies. HSA ENABLEMENT ALLOWS NATURAL JAVA REPRESENTATIONS class Body{ float x,y,z,m,vx,vy,vz; void updateAndShow(Screen screen, Body[] bodies){ // hidden varsfor accumulating forces for (Body other:bodies){ // accumulate forces between other and this } // update vx,vy,vz,x,yand z from accumulated data screen.paint(x,y,z); } } Arrays.stream(bodies).parallel().forEach(b -> {// Sumatra solution b.updateAndShow(screen, bodies); }); Device.hsa().forEach(bodies, b -> { //HSA enabled Aparapi solution © Copyright 2013 HSA Foundation. All Rights Reserved.
  • 30. SUMATRA + HSA ENABLED APARAPI PERFORMANCE Number of bodies (higher is better) Intra-Body Interactions per microsecond NBodyimplemented as an array of Objects. On early access HSA enabled hardware and software. 12.3 x perf (1.48 x power) 10.6 x perf (1.44 x power) 7.9 x perf (1.35 x power) © Copyright 2013 HSA Foundation. All Rights Reserved.
  • 31. HSA ENABLEMENTOFJVM CAN ACCELERATEOTHER JVM BASED LANGUAGES Java 9 –3Q2015 HSA enabled Java (Sumatra) •Adds native GPU compute support to Java Virtual Machine (JVM) •Developer uses JDK provided Lambda + Stream API •JVM uses GRAAL compiler to generate HSAIL •JVM decides at runtime to execute on either CPU or GPU depending on workload characteristics. GPU ISA JVM Java Application GPU CPU HSAIL HSA Finalizer& Runtime Java JDK Stream + Lambda API Java GRAAL JIT backend CPU ISA Java 9 + 2016? HSA enablement of other JVM based languages/frameworks •Developer uses their preferred Truffle based language (R, Javascript, Python, Runbyetc) •JVM uses Truffle + GRAAL compiler to generate HSAIL •HSA acceleration beyond Java GPU ISA JVM R APP GPU CPU HSAIL HSA Finalizer& Runtime Java GRAAL JIT backend Truffle JavaScript APP Ruby APP Python APP CPU ISA © Copyright 2013 HSA Foundation. All Rights Reserved.
  • 32. TAKEAWAYS HSA brings GPU computing to mainstream programming models Open standard for emerging parallel compute platforms Shared and coherent memory bridges “faraway accelerator” gap HSAIL provides the common IL for high-level languages to benefit from parallel computing HSAIL Key Points Thin, robust, fast finalizer Portable (multiple HW vendors and parallel architectures) Supports shared virtual memory and platform atomics Java Enablement Can access Objects on Java’s heap thanks to ‘Shared Virtual Memory’ Leverages Java 8 Lambda and Stream APIs intended for multicore Gateway to enabling other JVM based languages. © Copyright 2013 HSA Foundation. All Rights Reserved. 32
  • 33. TOOLS ARE AVAILABLE NOW HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG) http://hsafoundation.com/standards/ https://hsafoundation.box.com/s/m6mrsjv8b7r50kqeyyal Tools now at GitHUB–HSA Foundation libHSAAssembler and Disassembler https://github.com/HSAFoundation/HSAIL-Tools HSAIL Instruction Set Simulator https://github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator Soon: LLVM Compilation stack which outputs HSAIL and BRIG Java enablement via HSAIL (preliminary) http://openjdk.java.net/projects/sumatra/ http://openjdk.java.net/projects/graal/ http://aparapi.googlecode.com/ © Copyright 2013 HSA Foundation. All Rights Reserved. 33