"The Vision API Maze: Options and Trade-offs," a Presentation from the Khronos Group

© Copyright Khronos Group 2016 - Page 1
The Vision API Maze
Options and Trade-offs
Embedded Vision Summit, May 2016
Neil Trevett | Khronos President
NVIDIA Vice President Developer Ecosystem

Accelerated Vision API Jungle
Vision Frameworks
Language-based
Acceleration Frameworks
Explicit
Kernels
GPU FPGA
DSP
Dedicated
Hardware
Neural Net Libraries

http://hwstats.unity3d.com/mobile/gpu.html
OpenGL ES 2.0
OpenGL ES 3.x
OpenGL ES
2003
1.0
2004
1.1
2007
2.0
2012
3.0
2014
3.1
Driver
Update
Silicon
Update
Silicon
Update
Driver
Update
Compute Shaders
32-bit integers and floats
NPOT, 3D/depth textures
Texture arrays
Multiple Render Targets
Vertex and
fragment shaders
Fixed function
Pipeline
2015
3.2
Silicon
Update
Tessellation and geometry shaders
ASTC Texture Compression
Floating point render targets
Debug and robustness for security
Epic’s Rivalry demo using full Unreal Engine 4
https://www.youtube.com/watch?v=jRr-G95GdaM
+AEP
OpenGL ES 3.1 and
Android Extension Pack
since Android 5.0 (Lollipop)
Vertex and Fragment Shaders Compute Shaders

New Generation GPU APIs
Only AppleOnly Windows 10
Cross Platform
Vulkan 1.0 launched in February
Shipping now on Windows, Linux, Android platforms from multiple vendors
‘Half Way New Gen’
Retains Traditional Binding Model
Mixes OpenGL ES 3.1/OpenCL 1.2
C++11-based kernel language
Objective-C or Swift

Vulkan Explicit GPU Control
GPU
High-level Driver
Abstraction
Context management
Memory allocation
Full GLSL compiler
Error detection
Layered GPU Control
Application
Single thread per context
GPU
Thin Driver
Explicit GPU Control
Application
Memory allocation
Thread management
Synchronization
Multi-threaded generation
of command buffers
Language Front-end
Compilers
Initially GLSL
Loadable debug and
validation layers
Vulkan 1.0 provides access to
OpenGL ES 3.1 / OpenGL 4.X-class GPU functionality
but with increased performance and flexibility
Loadable Layers
No error handling overhead in
production code
SPIR-V Pre-compiled Shaders:
No front-end compiler in driver
Future shading language flexibility
Simpler drivers:
Improved efficiency/performance
Reduced CPU bottlenecks
Lower latency
Increased portability
Graphics, compute and DMA queues:
Work dispatch flexibility
Command Buffers:
Command creation can be multi-threaded
Multiple CPU cores increase performance
Resource management in app code:
Less hitches and surprises
Vulkan Benefits
SPIR-V pre-compiled
shaders

NVIDIA CUDA
• The industry’s original dedicated GPU Compute language
- C/C++11 language extensions for ‘single source’ programming
• Easy programmability and low level access to GPU
- Unified Memory, Virtual Addressing, Dynamic Parallelism etc.
• Mature and optimized tools and compute / imaging libraries
- Thrust, NPP, cuFFT, cuBLAS, cuda-gdb, nvprof etc.
• CUDA 7.5 released September 2015
- Added 16-bit floating point (FP16) data format support
• NVIDIA only, GPU only
CUDA 8 (Coming Soon)
Support for Pascal Unified Memory
Flexible Mixing of
CUDA (C++ language extensions)
OpenACC (parallelism through standard C++)
Thrust (C++ parallel library)

OpenCL
• Heterogeneous parallel programming of diverse compute resources
- One code tree can be executed on CPUs, GPUs, DSPs and FPGA
• OpenCL = Two APIs and Two Kernel languages
- C Platform Layer API to query, select and initialize compute devices
- C Runtime API to build and execute kernels across multiple devices
- OpenCL C and OpenCL C++ kernel languages
• The OpenCL C++ kernel language is a static subset of C++14
- Adaptable and elegant sharable code – great for building libraries
- Templates enable meta-programming for highly adaptive software
- Lambdas used to implement nested/dynamic parallelism
OpenCL
Kernel
Code
OpenCL
Kernel
Code
OpenCL
Kernel
Code
OpenCL
Kernel
Code
GPU
DSP
CPU
CPU
FPGA
Kernel code
compiled for
devices
Devices
CPU
Host
Runtime API
loads and executes
kernels across devices

OpenCL Conformant Implementations
OpenCL 1.0
Specification
Dec08 Jun10
OpenCL 1.1
Specification
Nov11
OpenCL 1.2
Specification
OpenCL 2.0
Specification
Nov13
1.0 | Jul13
1.0 | Aug09
1.0 | May09
1.0 | May10
1.0 | Feb11
1.0 | May09
1.0 | Jan10
1.1 | Aug10
1.1 | Jul11
1.2 | May12
1.2 | Jun12
1.1 | Feb11
1.1 |Mar11
1.1 | Jun10
1.1 | Aug12
1.1 | Nov12
1.1 | May13
1.1 | Apr12
1.2 | Apr14
1.2 | Sep13
1.2 | Dec12
Desktop
Mobile
FPGA
2.0 | Jul14
OpenCL 2.1
Specification
Nov15
1.2 | May15
2.0 | Dec14
1.0 | Dec14
1.2 | Dec14
1.2 | Sep14
Vendor timelines are
first implementation of
each spec generation
1.2 | May15
Embedded
1.2 | Aug15
1.2 | Mar16
2.0 | Nov15

OpenCL 2.2 - Top to Bottom C++
OpenCL 1.0
Specification
Dec08 Jun10
OpenCL 1.1
Specification
Nov11
OpenCL 1.2
Specification
OpenCL 2.0
Specification
Nov13
Device partitioning
Separate compilation and linking
Enhanced image support
Built-in kernels / custom devices
Enhanced DX and OpenGL Interop
Shared Virtual Memory
On-device dispatch
Generic Address Space
Enhanced Image Support
C11 Atomics
Pipes
Android ICD
3-component vectors
Additional image formats
Multiple hosts and devices
Buffer region operations
Enhanced event-driven execution
Additional OpenCL C built-ins
Improved OpenGL data/event interop
18 months 18 months 24 months
OpenCL 2.1
Specification
Nov1524 months
SPIR-V in Core
Subgroups into core
Subgroup query operations
clCloneKernel
Low-latency device
timer queries
OpenCL C++14
Kernel Language
into core
OpenCL 2.2
PROVISIONAL
May167months
Single Source C++ Programming
Full support for features in C++14 Kernel Language
API and Language Specs
Brings C++14 Kernel Language into core specification
Portable Kernel Intermediate Language
Support for C++14 kernel language e.g. constructors/destructors

SPIR-V Ecosystem
LLVM
Third party kernel and
shader Languages
SPIR-V
• Khronos defined and controlled
cross-API intermediate language
• Native support for graphics
and parallel constructs
• 32-bit Word Stream
• Extensible and easily parsed
• Retains data object and control
flow information for effective
code generation and translation
OpenCL C++OpenCL C
GLSL
Khronos has open sourced
these tools and translators
IHV Driver
Runtimes
Other
Intermediate
Forms
SPIR-V Validator
SPIR-V Tools
SPIR-V (Dis)Assembler
LLVM to SPIR-V
Bi-directional
Translator
Khronos plans to open
source these tools soon
https://github.com/KhronosGroup/SPIR/tree/spirv-1.1
Open source C++ front-end released

SYCL for OpenCL
• Single-source heterogeneous programming using STANDARD C++
- Use C++ templates and lambda functions for host & device code
• Aligns the hardware acceleration of OpenCL with direction of the C++ standard
- C++14 with open source C++17 Parallel STL hosted by Khronos
C++ Kernel Language
Low Level Control
‘GPGPU’-style separation of
device-side kernel source
code and host code
Single-source C++
Programmer Familiarity
Approach also taken by
C++ AMP and OpenMP
Developer Choice
The development of the two specifications are aligned so
code can be easily shared between the two approaches

OpenCL Roadmap Discussions…
Embedded
Use cases: Signal and Pixel Processing
Roadmap: arbitrary precision for power
efficiency, hard real-time scheduling,
asynch DMA
FPGAs
Use cases: Network and
Stream Processing
Roadmap: enhanced execution model, self-
synchronized and self-scheduled graphs, fine-
grained synchronization between kernels,
DSL in C++
HPC, SciViz, Datacenter
Use case: Numerical Simulation, Virtualization
Roadmap: enhanced streaming processing,
enhanced library support
Vulkan Compute can leverage OpenCL?
Gaming Compute, Pixel Processing, Inference
Fine grain graphics and compute (no interop needed)
SPIR-V for shading language flexibility – C/C++
Low-latency, fine grain run-time
Google Android adoption
Competes well with Metal (=C++/OpenCL 1.2)
Roadmap: types, precision and accuracy
Pointers and address spaces, execution model
Desktop
Use cases: Video and Image Processing, Gaming Compute
Roadmap: Vulkan interop, arbitrary precision for
increased performance, pre-emption, collective
programming and improved execution model
Mobile
Use case: Photo and Vision Processing
Roadmap: arbitrary precision for
inference engine and pixel processing efficiency, pre-
emption and QoS scheduling for power efficiency
Possible learnings from Vulkan Philosophy
1. Explicit - provide direct access to hardware capabilities with thin driver
2. Feature Sets – enable diverse architectures to ship just relevant features
3. Open source conformance tests for deep community engagement

OpenCV
• Extensive and widely used open source
vision library - written in optimized C/C++
- Free-use BSD license
• C++, C, Python and Java interfaces
- Windows, Linux, Mac OS, iOS and Android
• Increasingly taking advantage of
heterogeneous processing using OpenCL
- OpenCV 3.X Transparent API;
single API entry for each function/algorithm
- Dynamically loads OpenCL runtime if available;
otherwise falls back to CPU code
- Runtime Dispatching;
no recompilation!
CPU
Thread
CPU
Thread
CPU
Thread
…
ocl::Queue
ocl::Device
ocl::Queue ocl::Queue
ocl::Device
…
…
ocl::Context
OpenCV Application
OpenCV Transparent API for OpenCL Kernel Offload
• One OpenCL queue per CPU thread
• CPU threads can share a device
• OpenCL kernels are executed asynchronously
OpenCV is active open source - not an API specification
A strength and a weakness!
Production deployment often needs tightly defined callable API

Vision Pipeline Challenges and Opportunities
22
Sensor ProliferationGrowing Camera Diversity Diverse Vision Processors
Flexible sensor and camera
control to GENERATE
an image stream
Use efficient acceleration to
PROCESS
the image stream
Combine vision output
with other sensor data
on device

OpenVX – Low Power Vision Acceleration
• Precisely defined API for production deployment of vision acceleration
- Targeted at real-time mobile and embedded platforms
• Higher abstraction than OpenCL for performance portability across diverse architectures
- Multi-core CPUs, GPUs, DSPs and DSP arrays, ISPs, Dedicated hardware…
• Extends portable vision acceleration to very low power domains
- Doesn’t require high-power CPU/GPU Complex or OpenCL precision
- Low-power host can setup and manage frame-rate graph
GPU
Vision Engine
Middleware
Application
DSP
Hardware
PowerEfficiency
Computation Flexibility
Dedicated
Hardware
GPU
Compute
Multi-core
CPUX1
X10
X100
Vision Processing
EfficiencyVision
DSPs

OpenVX Graphs
• OpenVX developers express a graph of image operations (‘Nodes’)
- Nodes can be on any hardware or processor coded in any language
• Graphs can execute almost autonomously
- Possible to Minimize host interaction during frame-rate graph execution
• Graphs are the key to run-time optimization opportunities…
Array of
Keypoints
YUV
Frame
Gray
Frame
Camera
Input
Rendering
Output
Pyrt
Color
Conversion
Channel
Extract
Optical
Flow
Harris
Track
Image
Pyramid
RGB
Frame
Array of
Features
Ftrt-1OpenVX Graph
OpenVX Nodes
Feature Extraction Example Graph

OpenVX Efficiency through Graphs..
Reuse
pre-allocated
memory for
multiple
intermediate data
Memory
Management
Less allocation overhead,
more memory for
other applications
Replace a sub-
graph with a
single faster node
Kernel
Merge
Better memory
locality, less kernel
launch overhead
Split the graph
execution across
the whole system:
CPU / GPU /
dedicated HW
Graph
Scheduling
Faster execution
or lower power
consumption
Execute a sub-
graph at tile
granularity instead
of image
granularity
Data
Tiling
Better use of
data cache and
local memory

Example Relative Performance
1.1
2.9
8.7
1.5
2.5
0
1
2
3
4
5
6
7
8
9
10
Arithmetic Analysis Filter Geometric Overall
OpenCV (GPU accelerated)
OpenVX (GPU accelerated)
Relative Performance
NVIDIA
implementation
experience.
Geometric mean of
>2200 primitives,
grouped into each
categories,
running at different
image sizes and
parameter settings

Layered Vision Processing Ecosystem
Programmable Vision
Processors
Dedicated Vision
Hardware
Application
Processor Hardware
Powerful, flexible
low-level APIs / languages
Application Software
Engines/frameworks
C/C++
Implementers may use OpenCL or Compute Shaders to
implement OpenVX nodes on programmable processors
And then developers can use OpenVX to enable a
developer to easily connect those nodes into a graph
The OpenVX graph enables implementers to optimize execution across
diverse hardware architectures an drive to lower power implementations
OpenVX enables the graph to be extended to include hardware
architectures that don’t support programmable APIs

OpenVX 1.0 Shipping, OpenVX 1.1 Released!
• Multiple OpenVX 1.0 Implementations shipping – spec in October 2014
- Open source sample implementation and conformance tests available
• OpenVX 1.1 Specification released 2nd May 2016 at Embedded Vision Summit
- Expands node functionality AND enhances graph framework
- Laplacian pyramids and enhanced filters
- Easier user nodes and control over execution on heterogeneous platforms
- Sample source and conformance tests will be updated to OpenVX 1.1 in 1H16
= shipping implementations

OpenVX Roadmap and Safety Critical APIs
New Generation APIs for
safety certifiable vision,
graphics and compute
e.g. ISO 26262 and DO-178B/C
OpenGL ES 1.0 - 2003
Fixed function graphics
OpenGL ES 2.0 - 2007
Shader programmable pipeline
OpenGL SC 1.0 - 2005
Fixed function graphics subset
OpenGL SC 2.0 - April 2016
Shader programmable pipeline subset
Experience and Guidelines
Small driver size
Advanced functionality
Graphics and compute
OpenVX Roadmap Discussions
Significantly broaden node functionality
In-graph neural nets
Programmable nodes (OpenCL or SPIR-V?)
Market-specific feature sets
OpenVX SC?

Get Involved!
• A diverse set of vision APIs in the industry
- Developer choice is good – but need to choose wisely!
• Many APIs originally created to program GPUs
- But vision processing needs are increasingly driving API roadmaps
• Industry will tend to consolidate around leading APIs
- Working toward a multi-layer API ecosystem
- Powerful foundational hardware APIs enabling rich middleware APIs and libraries
• Any company or organization is welcome to join Khronos
for a voice and a vote in any of its standards
- www.khronos.org
• Neil Trevett
- ntrevett@nvidia.com
- @neilt3d

"The Vision API Maze: Options and Trade-offs," a Presentation from the Khronos Group

More Related Content

What's hot (20)

Similar to "The Vision API Maze: Options and Trade-offs," a Presentation from the Khronos Group (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

"The Vision API Maze: Options and Trade-offs," a Presentation from the Khronos Group