SlideShare a Scribd company logo
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 1
ENABLING EFFICIENT HETEROGENEOUS
PROCESSING THROUGH COHERENCY: AN
HSA FOUNDATION UPDATE
EMBEDDED VISION ALLIANCE MEMBER MEETING
DR. JOHN GLOSSNER, PRESIDENT, HSA FOUNDATION / CEO GPT-US
HARMONIZING THE INDUSTRY AROUND
HETEROGENEOUS COMPUTING
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 2
AGENDA
Heterogeneous Programming
Problem
About HSA
 Founding
 Member Companies
 Open / Royalty Free Solutions
HSA Solution
 Hardware
 Software
 Infrastructure
 HSAIL
Portable Applications
Programming
 C/C++, Python, OpenCL
Performance Results
 AMD
Products and Announcements
 AMD, GPT, Imagination, MediaTek
What’s Next
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 3
THE PROBLEM
HETEROGENEOUS APPLICATION DEVELOPMENT
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 4
WHAT IS A HETEROGENEOUS SYSTEM?
A CPU+ System
 +GPU
 +Vision Processors
 +DSP
 +FPGA
 +Accelerators
Typically
 Different development tools
 Different memory spaces
 Communication via I/O only (data copies)
Unified Coherent Memory
CPU
1
CPU
N…
CPU
2
GPU
1
GPU
2
GPU
3
GPU
M
DSP ACC…
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 5
WHAT’S THE PROBLEM?
Heterogeneous processors are
widely available
Huge compute capability
 Acceleration Units (GPU, DSP, FPGA)
 CPU Cluster-based computer
Coherency
 Established in high-end
 Migrating to mainstream mobile and
consumer
BUT…
Heterogeneous programming
models not standardized
Multi-core/device applications
difficult to optimize or scale
Non-portable application
developer ecosystems
HSAF brings compute app abstraction to heterogeneous platforms
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 6
HSA TECHNOLOGY
Developing a new platform for heterogeneous systems
 Reducing Heterogeneous System Complexity
 Provides software ecosystem
Abstracts away complexities of heterogeneous systems
 Cache coherent shared virtual memory hardware
 Removes time consuming operating system calls
 Runs at user level
Exploiting Compute Capabilities
 Single source programming
 Control and compute code reside in the same file or project
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 7
ABOUT HSA
HETEROGENEOUS SYSTEM ARCHITECTURE FOUNDATION
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 8
HSA FOUNDATION
A Non-Profit Foundation Founded in June 2012
 Programming heterogeneous systems (“CPU +” era)
Industry standards body
 V1.1 Specifications released May 2016
 Backward compatibility with V1.0 hardware
First compatible hardware
 AMD
Measured Performance Improvements
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 9
HSA – AN OPEN PLATFORM
Open Architecture, membership open to all
 HSA Programmers Reference Manual
 HSA Platform System Architecture
 HSA Runtime
 HSA Multivendor Specification
Royalty Free
 IP, Specifications, and APIs
Open Source
 Tools, Compilers, etc.
 Runtime implementations
 Tests
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 10
MEMBERS DRIVING HSA
Founders
Promoters
Supporters
Contributors
Academic
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 11
HSAF HARDWARE CONTRIBUTIONS
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 12
JIM MCGREGOR, TIRIAS RESEARCH
…HSAF has had a profound impact on hardware
architectures
… even Intel‘s
Cache
Coherency
Unified memory
(Shared Virtual
Memory)
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 13
THE PLATFORM PILLARS OF HSA
Unified memory
(SVM)
User mode
dispatch
Platform
atomics
Architected
Signals
Formal
Relaxed
Memory
Model
Cache
Coherency
Quality
Of
Service
Some non-HSA platforms support a few
of these platform features
In combination they form a well-rounded
base for application programmability
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 14
HSAF SOFTWARE
INFRASTRUCTURE
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 15
THE VISION
Make Heterogeneous Programming Much Easier
Single source programming1
Any programming language2
Eliminate data copies3
Common address space4
Standardized command submission to Agents (GPU / DSP)5
Eliminate software layers between application and hardware6
ISA agnostic for CPU, GPU, DSP, and more7
Open source software stack8
Single tool chain
C++, Python, JavaScript, …
Performance!
A pointer is a pointer
A common dispatch language
Efficient
x86, ARM, MIPS, PowerVR, Mali, Adreno, GPT, …
Open Access!
High performance
Low power
Extensible to other accelerators on the SoC
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 16
MOTIVATION (TODAY’S PICTURE)
Application OS
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
Agent GPU/DSP
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 17
WITH SHARED VIRTUAL MEMORY
Application OS
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
Agent GPU/DSP
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 18
WITH COHERENT CACHE MEMORY
Application OS
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
Agent GPU/DSP
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 19
SIGNALS
HSA agents support signaling
 creation/destruction using runtime APIs
Any Agent can access signals
 Wake up agents waiting upon the object
 Query/Wait for current object
 Allows conditions
Hardware-assisted signaling and
synchronization primitives
 Memory semantics synchronizes work
items processed by HSA agents
 Synchronizes execution between threads
on HSA agents and host CPU
One-to-one and one-to-many
signaling
 System Software, runtime & application SW
use infrastructure to build higher-level
synchronization primitives like mutexes,
semaphores, …
Advantages
 Asynchronous events between agents
 Doesn’t require CPU
 Common idiom for work offload
 Low power waiting
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 20
WITH SIGNALING
Application OS
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
Agent GPU/DSP
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 21
HSA QUEUING MODEL
User mode queuing
 Low latency dispatch
 Application dispatches directly
 No OS or driver required
Architected Queuing Layer (AQL)
 Single compute dispatch path for all hardware
 No driver translation, direct to hardware
 Standard across vendors!
 Guaranteed backward compatibility
Allows for dispatch to queue from any agent
 CPU or GPU or DSP or FPGA, etc.
Agent self enqueue enables
 Recursion, Tree traversal, Wavefront reforming
Requires coherency and
shared virtual memory
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 22
WITH USER MODE QUEUING
Application CPU OS Agent GPU/DSP
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 23
FINAL PICTURE: SVM + CACHE COHERENCY +
SIGNALS + USER MODE QUEUES
Application OS Agent GPU/DSP
Queue Job
Start Job
Finish Job
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 24
HSA COMMAND AND DISPATCH FLOW
Application
A
Application
B
Application
C
Optional
Dispatch Buffer
Agent
HARDWARE
Hardware Queue
A
A A
Hardware Queue
B
B B
Hardware Queue
C
C C
C
C
HW view:
 HW / microcode controlled
 HW scheduling
 Architected Queuing Language (AQL)
 HW-managed protection
SW view:
 User-mode dispatches to HW
 No OS Driver overhead
 Low dispatch times
 Host & Kernel Agent dispatch APIs
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 25
HSA INTERMEDIATE LANGUAGE (HSAIL)
BYTECODE FOR HETEROGENEOUS SYSTEMS
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 26
THE PORTABILITY CHALLENGE
CPU ISAs – Backwards Compatible
 ISA innovations added incrementally (ie NEON, AVX, etc)
 ISA retains backwards-compatibility with previous generation
 HSA instruction-set architectures: ARM, GPT, MIPS, and x86
Kernel Agent ISAs – No Backwards Compatibility
 GPU, DSP, DNN, Image Signal Processor, Custom Accelerators, etc.
 Massive diversity of architectures in the market
 Each vendor has own ISA - and often several in market at same time
 Compatibility via APIs (OpenGL, DirectX, OpenCV)
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 27
HSA INTERMEDIATE LAYER — HSAIL
Virtual ISA for parallel programs
 Finalized to native ISA by a compiler
 Dynamic or Offline
 ISA independent by design
Explicitly parallel
 Designed for data parallel programming
Multiple HLL Support
 Exceptions, virtual functions, etc.
 Java, C++, OpenMP, C++, Python, etc
main() {
…
#pragma omp parallel for
for (int i=0;i<N; i++) {
}
…
}
High-Level
Compiler
BRIG Finalizer Component
ISA
Host ISA
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 28
HSAIL FEATURES
A Virtual Explicitly Parallel ISA
 ~135 Opcodes
 RISC Register-based Load/Store
 Arithmetic
 IEEE 754 Floating Point including 16-bit
 Integer (32/64-bit)
 DSP fixed point
 Packed / SIMD
 f16x2, f16x4, f16x8, f32x2, f32x4, f64x2
 signed/unsigned 8x4, 8x8, 8x16, 16x2, 16x4,
16x8, 32x2, 32x4, 64x2
 Branches & Function Calls
 Atomic Operations
Wavefronts
 1, 2, 4, 8, 16, 32, or 64 SIMD lanes
 Lanes can be active or inactive
Memory
 Shared Virtual Memory
Exceptions
ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)
add_u64 $d1, $d0, 24 ; $d1= $d2+24
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 29
PORTABLE APPLICATIONS PROGRAMMING
FROM OPENCL TO C++17
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 30
HSA OPEN SOURCE SOFTWARE
Full open source Linux stack: tools, compilers and OS support
 Allows a single shared implementation for many components
 Enables university research and industry collaboration in all areas
 Because it’s the right thing to do
Many open source applications & frameworks
 Native Languages: HCC (C++17), LLVM, GCC, CLOC/SNACK, Python, Java, …
 Tools, API’s, Frameworks: CodeXL, POCL, Docker, OpenMP, OKRA, HIP, …
 Research: Multi2sim, HSAEmu, gem5, ViennaCL, …
 And many applications using OCL 2.x or HSA stack
 Github & Bitbucket repositories have much, much more…
gccbrig
 Any processor with a gcc machine description can finalize HSAIL
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 31
ARCHITECTED PROFILING AND DEBUGGING
Profiling
• Common timeline across HSA
accelerators & system
• Common HSA hardware events (+ HW
specific)
• Common HSA profiling counter
definitions (+ HW specific counters)
• Consistent profiling methodology for
all HSA accelerators
Debugging
• Breakpoints
• Exception handling
• Single-step
• Tracing
• HSAIL Disassembly
• Emulation support
• Libraries
• Plugins
Python on GPU’s
Numba: NumPy aware python compiler
 Open source. Avail on Github
 Sponsored by Continuum Analytics
Direct HSA Support
Automatic Parallelization
 2x-200x speedup
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 33
PERFORMANCE RESULTS
Python Geographic Locality
What is the distance from a set of
points to a target point
 How many points are within a specified
range
Numba can auto-parallelize user
universal functions for HSA
 Ufunc’s broadcast operation over
elements of a NumPy array
 ZERO HSA developer knowledge
required
1M Points
 >8X speedup
https://github.com/ContinuumIO/Numba-HSA-Webinar
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 35
GEN1: FIR & AES
FIR is a memory-intensive streaming workload
AES is a compute-intensive streaming
workload
CL12 – cl_mem buffer
 Copy to/from the device
CL20 – SVM buffer – Coarse Grain Sync
 Copy to/from SVM
 Data copy cannot be avoided, since the space for
SVM is limited
HSA – Unified Memory Space – Fine Grained
Sync
 Regular pointer
 No explicit copy
Results
 HSA compute abstraction
 NO performance penalty
Note: Not all algorithms run faster
Benchmark: NUCAR HeteroMark
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 36
HSA PRODUCT UPDATES - AMD
FROM HSA FOUNDATION MEMBER COMPANIES
37
Heterogeneous System Architecture Is At The Core of ROCm
Rich Foundation for HPC and Ultrascale Computing support our APU’s and Discreet GPU’s
HSA Drives rich capabilities into the ROCm
 Systems Architecture
‒ User Mode Queues
‒ Architected Queuing Language
‒ Flat memory Addressing
‒ Atomic Memory Transactions
‒ Process Concurrency & Preemption
 HSA Runtime enables a programming language
neutral systems interface
 Supports standardized loader and linker interface
ROCm: Radeon
Open Compute
Platform
38
ROCm Enabled Hardware 2016
S9150W9100
RADEON R9 Nano S9300x2 RADEON RX480 ( Oct ROCm 1.3)
S9170
AMD Proprietary and Confidential August 2016
AMD Embedded
R-Series SOC
AMD FX 98xx,
A12-97xx
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 39
HXGPT ANNOUNCEMENT UNITY处理器架构
Working silicon for Unity Architecture
Focus on out-of-order pipeline
 Superscalar
 Control flow
See our demo
 Image Processing Filter
Before After
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 40
HXGPT ANNOUNCEMENT
基于HSAIL的深度学习-神经网络开源计划
Open Source Machine Learning HSAIL library
 Deep Neural Network Library
 Delivered in HSAIL
Any HSA-Compatible platform can execute
Optimized for hxGPT
 Using gccbrig
Development now underway
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 41
IMAGINATION HSA COMPLIANT IP (COMING SOON)
We will be rolling out:
• HSA across all
MIPS I-class and
P-class CPUs
• HSA across all
PowerVR GPUs
• HSA compliant
fabric solutions
Coherent HSA-compliant SoC fabric
PowerVR Video
Encode
PowerVR Camera
ISP
PowerVR Video
Decode
ROM
Peripheral Bus
DDR3/4
Bridge
RAM
PowerVR GX7200
Series6XT
2 cluster
PowerVR GPU
HSA-compliant
eFuse
DMAC
Clock &
Reset
Control
JTAG
& Test
PSU &
Power
Control
TE &
Crypto
L2 cache
PowerVR
GX7200
Series6XT
2 cluster
MIPS CPU
HSA-compliant
Display Pipeline
PowerVR JPEG
Encode
OTP
Ensigma RPU
AFE
Customer
IP
HDMI
Tx & Rx
USB3MIPINAND
Peripherals
GPIO; UART; I2C; I2S; SPI; SD
Customer
IP
Customer IP
& interfaces
Imagination Smart Vision IP Platform
Copyright © MediaTek Inc. All rights reserved.
HMP – 2013
Heterogeneous Multi-
Processing
HC – 2015
Heterogeneous
Computing
Tri-cluster
2016
Hybrid Tri-cluster
Multi-Processing
HSA Features
Heterogeneous
System Architecture
LITTLE CPUs
BIG CPUs
LITTLE CPUs
BIG CPUs
GPU GPU
Accelerators
CoherentMemoryMMU
Evolution of Heterogeneity at MediaTek
Min CPUs
Max CPUs
GPU
Mid CPUs
Min CPUs
Max CPUs
Mid CPUs
42
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 43
CONCLUSIONS
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 44
THE RESULT
Make Heterogeneous Programming Much Easier
Single source programming1
Any programming language2
Eliminate data copies3
Common address space4
Standardized command submission to Agents (GPU / DSP)5
Eliminate software layers between application and hardware6
ISA agnostic for CPU, GPU, DSP, and more7
Open source software stack8
Single tool chain
C++, Python, JavaScript, …
Performance!
A pointer is a pointer
A common dispatch language
Efficient
x86, ARM, MIPS, PowerVR, Mali, Adreno, GPT, …
Open Access!
High performance
Low power
Extensible to other accelerators on the SoC
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 45
SUMMARY
2012 goal of changing chip H/W architecture achieved
 Cache coherent shared virtual memory
2014-2015 S/W architecture to support H/W
 March 2015 V1.0 specs
 Programmed in any language (C++, Python, OpenCL)
2015-2016
 May 2016 v1.1 specs
 Multivendor support
 Wider range of processors
2016 H/W platforms arriving
 AMD’s Carrizo (Dell, Asus, Lenovo)
 Licensable IP available
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 46
V1.2 SPECIFICATIONS IN PROGRESS
Improved
Data Interop
Fixed function
accelerators
(e.g. FPGA)
Local device
memory
Coarse
grain
memory
Architected
Debug
BRIG, new
linking formats
Architecture
Fully
formalized
memory
model
HSAIL Parallel
loops
Flexible API
and access
semantics
Programming
Models
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 47
JOIN US!
WWW.HSAFOUNDATION.COM
THANK YOU

More Related Content

PDF
Open Hybrid Cloud - Erik Geensen
PDF
Helion meetup-2014
PDF
Demystify OpenPOWER
PDF
Ro r capability
PDF
HP Helion OpenStack step by step
PPSX
Gcn performance ftw by stephan hodes
PPTX
HSA Introduction Hot Chips 2013
PDF
HSA From A Software Perspective
Open Hybrid Cloud - Erik Geensen
Helion meetup-2014
Demystify OpenPOWER
Ro r capability
HP Helion OpenStack step by step
Gcn performance ftw by stephan hodes
HSA Introduction Hot Chips 2013
HSA From A Software Perspective

Similar to "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation (20)

PPTX
ISCA Final Presentation - Intro
PDF
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
PDF
HSA-4122, "HSA Queuing Mode," by Ian Bratt
PPTX
HSA Introduction
PPTX
ISCA Final Presentation - HSAIL
PPTX
HSA Queuing Hot Chips 2013
PPT
Rogue Wave Corporate Vision(P) 5.19.10
PDF
An Update on the European Processor Initiative
PPTX
SAM - Streaming Analytics Made Easy
PPTX
Syncfusion: Flat License Options
PPTX
Streaming analytics manager
PPTX
ISCA final presentation - Queuing Model
PPTX
HSA HSAIL Introduction Hot Chips 2013
PDF
SAP HANA Cloud – Virtual Bootcamp: How to use the HANA Persistence Se…
PDF
System Design on Zynq using SDSoC
PDF
getting started with e2studio
PDF
Webinar: Synergy turbinado com o SSP1.4: criptografia elíptica, vídeo pela US...
PDF
Automated Software Modernization
PDF
SYCL 2020 Specification
PPT
Mainframe Architecture & Product Overview
ISCA Final Presentation - Intro
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
HSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA Introduction
ISCA Final Presentation - HSAIL
HSA Queuing Hot Chips 2013
Rogue Wave Corporate Vision(P) 5.19.10
An Update on the European Processor Initiative
SAM - Streaming Analytics Made Easy
Syncfusion: Flat License Options
Streaming analytics manager
ISCA final presentation - Queuing Model
HSA HSAIL Introduction Hot Chips 2013
SAP HANA Cloud – Virtual Bootcamp: How to use the HANA Persistence Se…
System Design on Zynq using SDSoC
getting started with e2studio
Webinar: Synergy turbinado com o SSP1.4: criptografia elíptica, vídeo pela US...
Automated Software Modernization
SYCL 2020 Specification
Mainframe Architecture & Product Overview
Ad

More from Edge AI and Vision Alliance (20)

PDF
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
PDF
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
PDF
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
PDF
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
Ad

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
sap open course for s4hana steps from ECC to s4
PPT
Teaching material agriculture food technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Review of recent advances in non-invasive hemoglobin estimation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
sap open course for s4hana steps from ECC to s4
Teaching material agriculture food technology
The Rise and Fall of 3GPP – Time for a Sabbatical?

"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

  • 1. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 1 ENABLING EFFICIENT HETEROGENEOUS PROCESSING THROUGH COHERENCY: AN HSA FOUNDATION UPDATE EMBEDDED VISION ALLIANCE MEMBER MEETING DR. JOHN GLOSSNER, PRESIDENT, HSA FOUNDATION / CEO GPT-US HARMONIZING THE INDUSTRY AROUND HETEROGENEOUS COMPUTING
  • 2. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 2 AGENDA Heterogeneous Programming Problem About HSA  Founding  Member Companies  Open / Royalty Free Solutions HSA Solution  Hardware  Software  Infrastructure  HSAIL Portable Applications Programming  C/C++, Python, OpenCL Performance Results  AMD Products and Announcements  AMD, GPT, Imagination, MediaTek What’s Next
  • 3. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 3 THE PROBLEM HETEROGENEOUS APPLICATION DEVELOPMENT
  • 4. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 4 WHAT IS A HETEROGENEOUS SYSTEM? A CPU+ System  +GPU  +Vision Processors  +DSP  +FPGA  +Accelerators Typically  Different development tools  Different memory spaces  Communication via I/O only (data copies) Unified Coherent Memory CPU 1 CPU N… CPU 2 GPU 1 GPU 2 GPU 3 GPU M DSP ACC…
  • 5. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 5 WHAT’S THE PROBLEM? Heterogeneous processors are widely available Huge compute capability  Acceleration Units (GPU, DSP, FPGA)  CPU Cluster-based computer Coherency  Established in high-end  Migrating to mainstream mobile and consumer BUT… Heterogeneous programming models not standardized Multi-core/device applications difficult to optimize or scale Non-portable application developer ecosystems HSAF brings compute app abstraction to heterogeneous platforms
  • 6. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 6 HSA TECHNOLOGY Developing a new platform for heterogeneous systems  Reducing Heterogeneous System Complexity  Provides software ecosystem Abstracts away complexities of heterogeneous systems  Cache coherent shared virtual memory hardware  Removes time consuming operating system calls  Runs at user level Exploiting Compute Capabilities  Single source programming  Control and compute code reside in the same file or project
  • 7. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 7 ABOUT HSA HETEROGENEOUS SYSTEM ARCHITECTURE FOUNDATION
  • 8. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 8 HSA FOUNDATION A Non-Profit Foundation Founded in June 2012  Programming heterogeneous systems (“CPU +” era) Industry standards body  V1.1 Specifications released May 2016  Backward compatibility with V1.0 hardware First compatible hardware  AMD Measured Performance Improvements
  • 9. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 9 HSA – AN OPEN PLATFORM Open Architecture, membership open to all  HSA Programmers Reference Manual  HSA Platform System Architecture  HSA Runtime  HSA Multivendor Specification Royalty Free  IP, Specifications, and APIs Open Source  Tools, Compilers, etc.  Runtime implementations  Tests
  • 10. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 10 MEMBERS DRIVING HSA Founders Promoters Supporters Contributors Academic
  • 11. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 11 HSAF HARDWARE CONTRIBUTIONS
  • 12. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 12 JIM MCGREGOR, TIRIAS RESEARCH …HSAF has had a profound impact on hardware architectures … even Intel‘s Cache Coherency Unified memory (Shared Virtual Memory)
  • 13. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 13 THE PLATFORM PILLARS OF HSA Unified memory (SVM) User mode dispatch Platform atomics Architected Signals Formal Relaxed Memory Model Cache Coherency Quality Of Service Some non-HSA platforms support a few of these platform features In combination they form a well-rounded base for application programmability
  • 14. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 14 HSAF SOFTWARE INFRASTRUCTURE
  • 15. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 15 THE VISION Make Heterogeneous Programming Much Easier Single source programming1 Any programming language2 Eliminate data copies3 Common address space4 Standardized command submission to Agents (GPU / DSP)5 Eliminate software layers between application and hardware6 ISA agnostic for CPU, GPU, DSP, and more7 Open source software stack8 Single tool chain C++, Python, JavaScript, … Performance! A pointer is a pointer A common dispatch language Efficient x86, ARM, MIPS, PowerVR, Mali, Adreno, GPT, … Open Access! High performance Low power Extensible to other accelerators on the SoC
  • 16. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 16 MOTIVATION (TODAY’S PICTURE) Application OS Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory Agent GPU/DSP
  • 17. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 17 WITH SHARED VIRTUAL MEMORY Application OS Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory Agent GPU/DSP
  • 18. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 18 WITH COHERENT CACHE MEMORY Application OS Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory Agent GPU/DSP
  • 19. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 19 SIGNALS HSA agents support signaling  creation/destruction using runtime APIs Any Agent can access signals  Wake up agents waiting upon the object  Query/Wait for current object  Allows conditions Hardware-assisted signaling and synchronization primitives  Memory semantics synchronizes work items processed by HSA agents  Synchronizes execution between threads on HSA agents and host CPU One-to-one and one-to-many signaling  System Software, runtime & application SW use infrastructure to build higher-level synchronization primitives like mutexes, semaphores, … Advantages  Asynchronous events between agents  Doesn’t require CPU  Common idiom for work offload  Low power waiting
  • 20. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 20 WITH SIGNALING Application OS Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory Agent GPU/DSP
  • 21. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 21 HSA QUEUING MODEL User mode queuing  Low latency dispatch  Application dispatches directly  No OS or driver required Architected Queuing Layer (AQL)  Single compute dispatch path for all hardware  No driver translation, direct to hardware  Standard across vendors!  Guaranteed backward compatibility Allows for dispatch to queue from any agent  CPU or GPU or DSP or FPGA, etc. Agent self enqueue enables  Recursion, Tree traversal, Wavefront reforming Requires coherency and shared virtual memory
  • 22. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 22 WITH USER MODE QUEUING Application CPU OS Agent GPU/DSP Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 23. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 23 FINAL PICTURE: SVM + CACHE COHERENCY + SIGNALS + USER MODE QUEUES Application OS Agent GPU/DSP Queue Job Start Job Finish Job
  • 24. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 24 HSA COMMAND AND DISPATCH FLOW Application A Application B Application C Optional Dispatch Buffer Agent HARDWARE Hardware Queue A A A Hardware Queue B B B Hardware Queue C C C C C HW view:  HW / microcode controlled  HW scheduling  Architected Queuing Language (AQL)  HW-managed protection SW view:  User-mode dispatches to HW  No OS Driver overhead  Low dispatch times  Host & Kernel Agent dispatch APIs
  • 25. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 25 HSA INTERMEDIATE LANGUAGE (HSAIL) BYTECODE FOR HETEROGENEOUS SYSTEMS
  • 26. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 26 THE PORTABILITY CHALLENGE CPU ISAs – Backwards Compatible  ISA innovations added incrementally (ie NEON, AVX, etc)  ISA retains backwards-compatibility with previous generation  HSA instruction-set architectures: ARM, GPT, MIPS, and x86 Kernel Agent ISAs – No Backwards Compatibility  GPU, DSP, DNN, Image Signal Processor, Custom Accelerators, etc.  Massive diversity of architectures in the market  Each vendor has own ISA - and often several in market at same time  Compatibility via APIs (OpenGL, DirectX, OpenCV)
  • 27. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 27 HSA INTERMEDIATE LAYER — HSAIL Virtual ISA for parallel programs  Finalized to native ISA by a compiler  Dynamic or Offline  ISA independent by design Explicitly parallel  Designed for data parallel programming Multiple HLL Support  Exceptions, virtual functions, etc.  Java, C++, OpenMP, C++, Python, etc main() { … #pragma omp parallel for for (int i=0;i<N; i++) { } … } High-Level Compiler BRIG Finalizer Component ISA Host ISA
  • 28. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 28 HSAIL FEATURES A Virtual Explicitly Parallel ISA  ~135 Opcodes  RISC Register-based Load/Store  Arithmetic  IEEE 754 Floating Point including 16-bit  Integer (32/64-bit)  DSP fixed point  Packed / SIMD  f16x2, f16x4, f16x8, f32x2, f32x4, f64x2  signed/unsigned 8x4, 8x8, 8x16, 16x2, 16x4, 16x8, 32x2, 32x4, 64x2  Branches & Function Calls  Atomic Operations Wavefronts  1, 2, 4, 8, 16, 32, or 64 SIMD lanes  Lanes can be active or inactive Memory  Shared Virtual Memory Exceptions ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120) add_u64 $d1, $d0, 24 ; $d1= $d2+24
  • 29. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 29 PORTABLE APPLICATIONS PROGRAMMING FROM OPENCL TO C++17
  • 30. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 30 HSA OPEN SOURCE SOFTWARE Full open source Linux stack: tools, compilers and OS support  Allows a single shared implementation for many components  Enables university research and industry collaboration in all areas  Because it’s the right thing to do Many open source applications & frameworks  Native Languages: HCC (C++17), LLVM, GCC, CLOC/SNACK, Python, Java, …  Tools, API’s, Frameworks: CodeXL, POCL, Docker, OpenMP, OKRA, HIP, …  Research: Multi2sim, HSAEmu, gem5, ViennaCL, …  And many applications using OCL 2.x or HSA stack  Github & Bitbucket repositories have much, much more… gccbrig  Any processor with a gcc machine description can finalize HSAIL
  • 31. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 31 ARCHITECTED PROFILING AND DEBUGGING Profiling • Common timeline across HSA accelerators & system • Common HSA hardware events (+ HW specific) • Common HSA profiling counter definitions (+ HW specific counters) • Consistent profiling methodology for all HSA accelerators Debugging • Breakpoints • Exception handling • Single-step • Tracing • HSAIL Disassembly • Emulation support • Libraries • Plugins
  • 32. Python on GPU’s Numba: NumPy aware python compiler  Open source. Avail on Github  Sponsored by Continuum Analytics Direct HSA Support Automatic Parallelization  2x-200x speedup
  • 33. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 33 PERFORMANCE RESULTS
  • 34. Python Geographic Locality What is the distance from a set of points to a target point  How many points are within a specified range Numba can auto-parallelize user universal functions for HSA  Ufunc’s broadcast operation over elements of a NumPy array  ZERO HSA developer knowledge required 1M Points  >8X speedup https://github.com/ContinuumIO/Numba-HSA-Webinar
  • 35. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 35 GEN1: FIR & AES FIR is a memory-intensive streaming workload AES is a compute-intensive streaming workload CL12 – cl_mem buffer  Copy to/from the device CL20 – SVM buffer – Coarse Grain Sync  Copy to/from SVM  Data copy cannot be avoided, since the space for SVM is limited HSA – Unified Memory Space – Fine Grained Sync  Regular pointer  No explicit copy Results  HSA compute abstraction  NO performance penalty Note: Not all algorithms run faster Benchmark: NUCAR HeteroMark
  • 36. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 36 HSA PRODUCT UPDATES - AMD FROM HSA FOUNDATION MEMBER COMPANIES
  • 37. 37 Heterogeneous System Architecture Is At The Core of ROCm Rich Foundation for HPC and Ultrascale Computing support our APU’s and Discreet GPU’s HSA Drives rich capabilities into the ROCm  Systems Architecture ‒ User Mode Queues ‒ Architected Queuing Language ‒ Flat memory Addressing ‒ Atomic Memory Transactions ‒ Process Concurrency & Preemption  HSA Runtime enables a programming language neutral systems interface  Supports standardized loader and linker interface ROCm: Radeon Open Compute Platform
  • 38. 38 ROCm Enabled Hardware 2016 S9150W9100 RADEON R9 Nano S9300x2 RADEON RX480 ( Oct ROCm 1.3) S9170 AMD Proprietary and Confidential August 2016 AMD Embedded R-Series SOC AMD FX 98xx, A12-97xx
  • 39. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 39 HXGPT ANNOUNCEMENT UNITY处理器架构 Working silicon for Unity Architecture Focus on out-of-order pipeline  Superscalar  Control flow See our demo  Image Processing Filter Before After
  • 40. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 40 HXGPT ANNOUNCEMENT 基于HSAIL的深度学习-神经网络开源计划 Open Source Machine Learning HSAIL library  Deep Neural Network Library  Delivered in HSAIL Any HSA-Compatible platform can execute Optimized for hxGPT  Using gccbrig Development now underway
  • 41. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 41 IMAGINATION HSA COMPLIANT IP (COMING SOON) We will be rolling out: • HSA across all MIPS I-class and P-class CPUs • HSA across all PowerVR GPUs • HSA compliant fabric solutions Coherent HSA-compliant SoC fabric PowerVR Video Encode PowerVR Camera ISP PowerVR Video Decode ROM Peripheral Bus DDR3/4 Bridge RAM PowerVR GX7200 Series6XT 2 cluster PowerVR GPU HSA-compliant eFuse DMAC Clock & Reset Control JTAG & Test PSU & Power Control TE & Crypto L2 cache PowerVR GX7200 Series6XT 2 cluster MIPS CPU HSA-compliant Display Pipeline PowerVR JPEG Encode OTP Ensigma RPU AFE Customer IP HDMI Tx & Rx USB3MIPINAND Peripherals GPIO; UART; I2C; I2S; SPI; SD Customer IP Customer IP & interfaces Imagination Smart Vision IP Platform
  • 42. Copyright © MediaTek Inc. All rights reserved. HMP – 2013 Heterogeneous Multi- Processing HC – 2015 Heterogeneous Computing Tri-cluster 2016 Hybrid Tri-cluster Multi-Processing HSA Features Heterogeneous System Architecture LITTLE CPUs BIG CPUs LITTLE CPUs BIG CPUs GPU GPU Accelerators CoherentMemoryMMU Evolution of Heterogeneity at MediaTek Min CPUs Max CPUs GPU Mid CPUs Min CPUs Max CPUs Mid CPUs 42
  • 43. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 43 CONCLUSIONS
  • 44. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 44 THE RESULT Make Heterogeneous Programming Much Easier Single source programming1 Any programming language2 Eliminate data copies3 Common address space4 Standardized command submission to Agents (GPU / DSP)5 Eliminate software layers between application and hardware6 ISA agnostic for CPU, GPU, DSP, and more7 Open source software stack8 Single tool chain C++, Python, JavaScript, … Performance! A pointer is a pointer A common dispatch language Efficient x86, ARM, MIPS, PowerVR, Mali, Adreno, GPT, … Open Access! High performance Low power Extensible to other accelerators on the SoC
  • 45. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 45 SUMMARY 2012 goal of changing chip H/W architecture achieved  Cache coherent shared virtual memory 2014-2015 S/W architecture to support H/W  March 2015 V1.0 specs  Programmed in any language (C++, Python, OpenCL) 2015-2016  May 2016 v1.1 specs  Multivendor support  Wider range of processors 2016 H/W platforms arriving  AMD’s Carrizo (Dell, Asus, Lenovo)  Licensable IP available
  • 46. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 46 V1.2 SPECIFICATIONS IN PROGRESS Improved Data Interop Fixed function accelerators (e.g. FPGA) Local device memory Coarse grain memory Architected Debug BRIG, new linking formats Architecture Fully formalized memory model HSAIL Parallel loops Flexible API and access semantics Programming Models
  • 47. © Copyright 2012-2016 HSA Foundation. All Rights Reserved. 47 JOIN US! WWW.HSAFOUNDATION.COM THANK YOU