“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentation from the Khronos Group

© 2020 The Khronos Group
Khronos Standard APIs for
Accelerating Vision and Inferencing
Neil Trevett
Khronos President
NVIDIA VP Developer Ecosystems
22nd September 2020

Khronos Connects Software to Silicon
3D graphics, XR, parallel
programming, vision acceleration
and machine learning
Non-profit, member-driven
standards-defining industry
consortium
Open to any
interested company
All Khronos standards
are royalty-free
Well-defined IP Framework
protects participant’s
intellectual property
Founded in 2000
>150 Members ~ 40% US, 30% Europe, 30% Asia
Open interoperability standards to enable software to effectively
harness the power of 3D and multiprocessor acceleration
2

Khronos Active Initiatives
3D Graphics
Desktop, Mobile, Web
Embedded and Safety Critical
3D Assets
Authoring
and Delivery
Portable XR
Augmented and
Virtual Reality
Parallel Computation
Vision, Inferencing,
Machine Learning
3

Khronos Compute Acceleration Standards
Increasing industry
interest in parallel
compute acceleration
to combat the ‘End of
Moore’s Law’
GPU
GPU rendering +
compute
acceleration
Heterogeneous
compute
acceleration
Single source C++ programming
with compute acceleration
Graph-based vision and
inferencing acceleration
Lower-level APIs
Direct Hardware Control
Intermediate
Representation (IR)
supporting parallel
execution and
graphics
Higher-level
Languages and APIs
Streamlined development and
performance portability
GPU
FPGA DSP
Custom Hardware
GPU
CPU
CPU
CPU
AI/Tensor HW
Hardware
4

Sensor
Data
Training
Data
Trained
Networks
Neural Network
Training
C++ Application
Code
Embedded Vision and Inferencing Acceleration
Compilation Ingestion
FPGA
DSP
Dedicated
Hardware
GPU
Vision / Inferencing
Engine
Compiled
Code
Hardware Acceleration APIs
Diverse Embedded Hardware
(GPUs, DSPs, FPGAs)
Applications link to compiled
inferencing code or call
vision/inferencing API
Networks trained on high-end
desktop and cloud systems
5

NNEF Neural Network Exchange Format
Training Framework 1
Inference Engine 1
Inference Engine 2
Inference Engine 3
Every Inferencing Engine needs a custom importer
from every Framework
Before - Training and Inferencing Fragmentation
After - NN Training and Inferencing Interoperability
Inference Engine 1
Inference Engine 2
Inference Engine 3
Common optimization
and processing tools
6

NNEF and ONNX
NNEF V1.0 released in August 2018
After positive industry feedback on Provisional Specification.
Maintenance update issued in September 2019
Extensions to V1.0 released for expanded functionality
NNEF Working Group Participants
ONNX 1.6 Released in September 2019
Introduced support for Quantization
ONNX Runtime being integrated with GPU inferencing engines
such as NVIDIA TensorRT
ONNX Supporters
Embedded Inferencing Import Training Interchange
Defined Specification Open Source Project
Multi-company Governance at Khronos Initiated by Facebook & Microsoft
Stability for hardware deployment Software stack flexibility
ONNX and NNEF
are Complementary
ONNX moves quickly to track authoring
framework updates
NNEF provides a stable bridge from
training into edge inferencing engines
7

NNEF Open Source Tools Ecosystem
Files
Caffe and
Caffe2
Import/Export
TensorFlow and
TensorFlow Lite
Import/Export
NNEF open source projects hosted on Khronos NNEF
GitHub repository under Apache 2.0
https://github.com/KhronosGroup/NNEF-Tools
ONNX
Import/Export
Syntax
Parser and
Validator
OpenVX
Ingestion and
Execution
NNEF Model Zoo
Now available on GitHub. Useful for
checking that ingested NNEF produces
acceptable results on target system
Compound operations
captured by exporting
graph Python script
NNEF adopts a rigorous approach to
design lifecycle
Especially important for safety-critical or
mission-critical applications in automotive,
industrial and infrastructure markets
8

SYCL Single Source C++ Parallel Programming
GPU
FPGA DSP
Custom Hardware
GPU
CPU
CPU
CPU
Standard C++
Application
Code
C++
Libraries
ML
Frameworks
C++ Template
Libraries
C++ Template
Libraries
C++ Template
Libraries
SYCL Compiler
for OpenCL
CPU
Compiler
CPU
SYCL-BLAS, SYCL-DNN,
SYCL-Eigen,
SYCL Parallel STL
C++ templates and lambda
functions separate host &
accelerated device code
Accelerated code
passed into device
OpenCL compilers
Complex ML frameworks
can be directly compiled
and accelerated
SYCL is ideal for accelerating larger
C++-based engines and applications
with performance portability
C++ Kernel Fusion can give
better performance on
complex apps and libs than
hand-coding
AI/Tensor HW
9

SYCL Implementations
Multiple Backends in Development
SYCL beginning to be supported on multiple
low-level APIs in addition to OpenCL
e.g. ROCm and CUDA
For more information: http://sycl.tech
SYCL enables Khronos to influence
ISO C++ to (eventually) support
heterogeneous compute
SYCL
Source Code
DPC++
Uses LLVM/Clang
Part of oneAPI
ComputeCpp
SYCL 1.2.1 on
multiple hardware
triSYCL
Open source
test bed
hipSYCL
SYCL 1.2.1 on
CUDA & HIP/ROCm
Any CPU
OpenCL +
SPIR-V
Any CPU
OpenCL +
SPIR(-V)
OpenCL+PTX
Intel CPUs
Intel GPUs
Intel FPGAs
Intel CPUs
Intel GPUs
Intel FPGAs
AMD GPUs
(depends on driver stack)
Arm Mali
IMG PowerVR
Renesas R-Car
NVIDIA GPUs
OpenMP
OpenCL +
SPIR/LLVM
XILINX FPGAs
POCL
(open source OpenCL supporting
CPUs and NVIDIA GPUs and more)
Any CPU
Experimental
OpenMP
ROCm
CUDA
AMD GPUs
NVIDIA GPUs
Any CPU
CUDA+PTX
NVIDIA GPUs
SYCL, OpenCL and SPIR-V, as open industry
standards, enable flexible integration and
deployment of multiple acceleration technologies
10

OpenVX Cross-Vendor Vision and Inferencing
Vision
Node
Vision
Node
Vision
Node
Downstream
Application
Processing
Native
Camera
Control CNN Nodes
NNEF Translator converts NNEF
representation into OpenVX Node Graphs
OpenVX
High-level graph-based abstraction for portable, efficient vision processing
Graph can contain vision processing and NN nodes – enables global optimizations
Optimized OpenVX drivers created, optimized and shipped by processor vendors
Implementable on almost any hardware or processor with performance portability
Run-time graph execution need very little host CPU interaction
Performance comparable to hand-optimized, non-portable code
Real, complex applications on real, complex hardware
Much lower development effort than hand-optimized code
Hardware Implementations
OpenVX Graph
11

OpenVX 1.3 Released October 2019
Deployment Flexibility through Feature Sets
Conformant Implementations ship one or more complete feature sets
Enables market-focused Implementations
- Baseline Graph Infrastructure (enables other Feature Sets)
- Default Vision Functions
- Enhanced Vision Functions (introduced in OpenVX 1.2)
- Neural Network Inferencing (including tensor objects)
- NNEF Kernel import (including tensor objects)
- Binary Images
- Safety Critical (reduced features for easier safety certification)
https://www.khronos.org/registry/OpenVX/specs/1.3/html/OpenVX_Specification_1_3.html
Functionality Consolidation into Core
Neural Net Extension, NNEF Kernel Import,
Safety Critical etc.
Open Source Conformance Test Suite
https://github.com/KhronosGroup/OpenVX-cts/tree/openvx_1.3
OpenCL Interop
Custom accelerated Nodes
OpenCL Command Queue
Application
cl_mem buffers
Fully asynchronous host-device
operations during data exchange
OpenVX data objects
Runtime
Runtime Map or copy OpenVX data objects
into cl_mem buffers
Copy or export
cl_mem buffers into OpenVX data
objects
OpenVX user-kernels can access command queue
and cl_mem objects to asynchronously schedule
OpenCL kernel execution
OpenVX/OpenCL Interop
12

Open Source OpenVX & Samples
Open Source OpenVX Tutorial and Code Samples
https://github.com/rgiduthuri/openvx_tutorial
https://github.com/KhronosGroup/openvx-samples
Fully Conformant Open Source OpenVX 1.3
for Raspberry Pi
https://github.com/KhronosGroup/OpenVX-sample-impl/tree/openvx_1.3
Raspberry Pi 3 and 4 Model B with Raspbian OS
Memory access optimization via tiling/chaining
Highly optimized kernels on multimedia instruction set
Automatic parallelization for multicore CPUs and GPUs
Automatic merging of common kernel sequences
13

OpenCL is Widely Deployed and Used
Accelerated Implementations
Modo
Desktop Creative Apps
CLBlast
SYCL-BLAS
Linear Algebra
Libraries
Parallel
Languages
Math and Physics
Libraries
Vision, Imaging
and Video Libraries
The industry’s most pervasive, cross-vendor, open standard
for low-level heterogeneous parallel programming
Arm Compute Library
SYCL-DNN
Machine Learning
Libraries and Frameworks
TI DL Library (TIDL)
VeriSilicon
Xiaomi
clDNN
Intel
Intel
Synopsis
MetaWare EV
NNAPI
https://en.wikipedia.org/wiki/List_of_OpenCL_applications
Vegas Pro
ForceBalance
Molecular Modelling Libraries
Machine Learning
Compilers
14

OpenCL – Low-level Parallel Programing
Complements GPU-only APIs
Simpler programming model
Relatively lightweight run-time
More language flexibility, e.g. pointers
Rigorously defined numeric precision
OpenCL
Kernel
Code
OpenCL
Kernel
Code
OpenCL
Kernel
Code
OpenCL C
Kernel
Code
GPU
DSP
CPU
CPU
FPGA
OpenCL
Devices
Host
CPU
NN HW
Runtime OpenCL API to
compile, load and execute
kernels across devices
Programming and Runtime Framework
for Application Acceleration
Offload compute-intensive kernels onto parallel
heterogeneous processors
CPUs, GPUs, DSPs, FPGAs, Tensor Processors
OpenCL C or C++ kernel languages
Platform Layer API
Query, select and initialize compute devices
Runtime API
Build and execute kernels programs on multiple devices
Explicit Application Control
Which programs execute on what device
Where data is stored in memories in the system
When programs are run, and what operations are
dependent on earlier operations
15

OpenCL 3.0
OpenCL C:
- kernels,
- address spaces,
- special types,
...
Most of C++17:
- inheritance,
- templates,
- type deduction,
...
C++ for OpenCL
Increased Ecosystem Flexibility
All functionality beyond OpenCL 1.2 queryable plus
macros for optional OpenCL C language features
New extensions that become widely adopted will be
integrated into new OpenCL core specifications
OpenCL C++ for OpenCL
Open source C++ for OpenCL front end compiler
combines OpenCL C and C++17 replacing
OpenCL C++ language specification
Unified Specification
All versions of OpenCL in one specification for easier
maintenance, evolution and accessibility
Source on Khronos GitHub for community feedback,
functionality requests and bug fixes
Moving Applications to OpenCL 3.0
OpenCL 1.2 applications – no change
OpenCL 2.X applications - no code changes if all used
functionality is present
Queries recommended for future portability
C++ for OpenCL
Supported by Clang and uses the LLVM
compiler infrastructure
OpenCL C code is valid and fully compatible
Supports most C++17 features
Generates SPIR-V kernels
16

Google Ports TensorFlow Lite to OpenCL
OpenCL providing ~2x inferencing
speedup over OpenGL ES
acceleration
TensorFlow Lite uses OpenGL ES as a
backup if OpenCL not available …
…but most mobile GPU vendors
provide an OpenCL drivers - even if
not exposed directly to Android
developers
OpenCL is increasingly used as
acceleration target for higher-level
framework and compilers
17

Primary Machine Learning Compilers
Import Formats
Caffe, Keras,
MXNet, ONNX
TensorFlow Graph,
MXNet, PaddlePaddle,
Keras, ONNX
PyTorch, ONNX
TensorFlow Graph,
PyTorch, ONNX
Front-end / IR NNVM / Relay IR nGraph / Stripe IR Glow Core / Glow IR XLA HLO
Output
OpenCL, LLVM,
CUDA, Metal
OpenCL,
LLVM, CUDA
OpenCL
LLVM
LLVM, TPU IR, XLA IR
TensorFlow Lite / NNAPI
(inc. HW accel)
18

ML Compiler Steps
1.Import Trained
Network Description
2. Apply graph-level
optimizations e.g. node fusion,
node lowering and memory tiling
3. Decompose to primitive
instructions and emit programs
for accelerated run-times
Consistent Steps
Fast progress but still area of intense research
If compiler optimizations are effective - hardware accelerator APIs can stay ‘simple’ and
won’t need complex metacommands (e.g. combined primitive commands like DirectML)
19

Google MLIR and IREE Compilers
MLIR
Multi-level Intermediate Representation
Format and library of compiler utilities that sits
between the trained model representation and
low-level compilers/executors that generate
hardware-specific code
IREE
Intermediate Representation
Execution Environment
Lowers and optimizes ML models for real-time
accelerated inferencing on mobile/edge
heterogeneous hardware
Contains scheduling logic to communicate data
dependencies to low-level parallel pipelined
hardware/APIs like Vulkan, and execution logic
to encode dense computation in the form of
hardware/API-specific binaries like SPIR-V
IREE is a research project today. Google is working with Khronos
working groups to explore how SPIR-V code can provide effective
inferencing acceleration on APIs such as Vulkan through SPIR-V
Trained Models
Generate Hardware
Specific Binaries
Optimizes and Lowers
for Acceleration
20

SPIR-V Language Ecosystem
OpenCL C
C++ for OpenCL
clspv
triSYCL
Intel DPC++
Codeplay
ComputeCpp
LLVM
Clang
SYCL
SPIR-V LLVM
IR Translator
Khronos Open Source
3rd Party Open Source
Language Definitions
Closed Source
Environment Specs
OpenCL Vulkan
OpenCLon12
Inc. Mesa SPIR-V to DXIL
SPIRV-Cross
GLSL
HLSL
Metal
Shading
Language
glslang
GLSL
HLSL DXC
DXIL
SPIR-V Tools
(Dis)Assembler
Validator
Optimize/Remap
Fuzzer
Reducer
OpenCL C
Online
Compilation
SPIR-V enables a rich ecosystem of languages and compilers to
target low-level APIs such as Vulkan and OpenCL, including
deployment flexibility: e.g. running OpenCL C kernels on Vulkan
IREE
21

Khronos for Global Industry Collaboration
Khronos membership is open
to any company
Influence the design and direction
of key open standards that will
drive your business
Accelerate time-to-market with
early access to specification drafts
Provide industry thought
leadership and gain insights into
industry trends and directions
Benefit from Adopter discounts
www.khronos.org/members/
ntrevett@nvidia.com | @neilt3d
22

Resources
• Khronos Website and home page for all Khronos Standards
• https://www.khronos.org/
• OpenCL Resources and C++ for OpenCL documentation
• https://www.khronos.org/opencl/resources
• https://github.com/KhronosGroup/Khronosdotorg/blob/master/api/opencl/assets/CXX_for_OpenCL.pdf
• OpenVX Tutorial, Samples and Sample Implementation
• https://github.com/rgiduthuri/openvx_tutorial
• https://github.com/KhronosGroup/openvx-samples
• https://github.com/KhronosGroup/OpenVX-sample-impl/tree/openvx_1.3
• NNEF Tools
• https://github.com/KhronosGroup/NNEF-Tools
• SYCL Resources
• http://sycl.tech
• SPIR-V User Guide
• https://github.com/KhronosGroup/SPIRV-Guide
• MLIR Blog
• https://blog.tensorflow.org/2019/04/mlir-new-intermediate-representation.html
• IREE GitHub Repository
• https://google.github.io/iree/
23

“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentation from the Khronos Group

More Related Content

What's hot (20)

Similar to “Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentation from the Khronos Group (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentation from the Khronos Group