Applying Deep Learning Vision Technology to low-cost/power Embedded Systems

© 2016 Synopsys, Inc. 1
Applying Deep Learning Vision Technology to
Low-cost, Low-power Embedded Systems:
An Industrial Perspective
Pierre Paulin
Director of R&D
16 January 2016

Agenda
• Embedded Vision application
trends and challenges
• Synopsys Embedded Vision
Processor Overview
• Convolution Neural Networks
– Applications, requirements
– Dedicated CNN engine for EV
– Competitive analysis
• Summary & Final Thoughts

Embedded Vision is Coming Fast
• Embedded Vision is the use of computer
vision in embedded systems to interpret
meaning from images or video
• In cars to improve safety
• Surveillance for detection and tracking
• In industrial automation to improve
quality and control
• Estimated $300B+ market in 2020,
35% CAGR
0
50
100
150
200
250
300
350
2013 2014 2015 2016 2017 2018 2019 2020
BillionsofDollars
Vision Systems Shipments
Sources: ABI Research, Insight Media, Transparency
Market Research, Markets And Markets, Synopsys

Wide Variety of Vision Applications
Cameras
Drones
Home AutomationRetailGaming Infotainment
Augmented RealityMobile SurveillanceADAS

Autonomous Driving Buzz
1/14/2016 – U.S. Proposes Spending $4 Billion to Encourage Driverless Cars
Obama administration aims to remove hurdles to making autonomous cars more widespread
Wall Street Journal
8/17/2016 – Ford's self-driving car 'coming in 2021’ (BBC News)
8/24/2016 – Self-driving taxis roll out in Singapore -
beating Uber to it (The Guardian)
10/20/2016 – Elon Musk: You'll be able to summon your driverless Tesla
from cross-country (CNN Money)
10/25/2016 – Uber's Self-Driving Truck Makes Its First Delivery:
50000 Beers (Wired)

Largest Embedded Vision Application Segment
Advanced Driver Assistance Systems Driven By Safety Concerns
Source: IC Market Drivers, IC Insights, January 2015 & Trends and Opportunities in Driver Assistance and Automated Driving, IHS Automotive Sep 2015

Video Surveillance Markets Growing Rapidly
• Global IP Video Surveillance Market
expected to grow at CAGR of 37.3%
from 2012-20
• Demand driven by
– Growing installations of IP cameras
– Need for surveillance cameras
with better video quality
– Limited ability for real-time human
analysis
http://www.alliedmarketresearch.com/IP-video-surveillance-VSaaS-market
3X Growth Forecast
2013 - 2019
Security (Airports, Govt, Banks, Casinos), Home Surveillance, Retail, Healthcare

Less Efficient EV Options Dedicated Embedded Vision Processors
EV Challenges Require Embedded Vision Processors
Performance
Power
Area
CPUs don’t have math horsepower for fast
2D vision processing
GPUs have high performance but large
areas and higher power
DSPs are designed for low power audio
and speech applications, not 2D video
FPGAs are good for prototyping but are
expensive and performance limited
Higher performance
Lower power
Smaller area
Can include a dedicated deep learning
(CNN) engine

Embedded Vision Applications and
Power, Performance and Area (PPA) Requirements

Vision Pipeline Example
Object detection pipeline
Grayscale
Conversion
Image
Pyramid
Detecting
Areas of
Interest in a
Frame
Non-max
Suppression
Draw Box

Vision Pipeline Example
Video surveillance pipeline
Grayscale &
Image
Pyramid
Face
Detection
Tracking &
Detection
Cascade
Fusion &
Learning

Vision Algorithm Computation
• Object detection
• Background
subtraction
• Feature extraction
• Image
segmentation
• Connected comp.
labeling
• Noise reduction
• Color space
conversion
• Gamma correction
• Image scaling
• Gaussian
pyramid
Simple Data-Level
Parallelism (DLP)
• Good spatial locality
• Good compute intensity
• Small context
More Complex DLP
• Complex data structures
• Irregular compute intensity
• Larger context
Scalar Processing
• General purpose compute
• Thread level parallelism
Pre-processing
Selecting Areas
of Interest
Precise
Processing of
Selected Areas
Decision
Making
• Object recognition
• Tracking
• Feature matching
• Gesture
recognition
• Motion analysis
• Match/no match
• Flag events
CNN
RISC scalar
Multi-core Gen2
EV SIMD processorMulti-core Gen1
EV SIMD processor
Multi-core
CNN Engine

Sample Power, Performance and Area Targets
• Intelligent video surveillance applications
– Face detection & tracking, pose detection, gaze
estimation, gender recognition, age estimation
– People detection & counting for video surveillance
– Driver fatigue detection
– Advanced detection and tracking
– Implementation on
GPP and GP-GPU
– Typical customer
targets for
HD @30 fps
Based on 28 nm process node
<500 mW 1-2 mm2
10-500 GOP/s
1-10 W 50-100 mm2

Sample Power, Performance and Area Targets
• ADAS
– Pedestrian, vehicle, traffic sign, lane detections
– Scene segmentation
– Implementation on
GPP and GP-GPU
– Typical customer
targets for
HD @30 fps
Based on 28 nm process node
100-2000 GOP/s
1-2 W 2-5 mm2
>100 W >100 mm2

DesignWare® ARC EV6 Processor and CNN
- Vision-specific wide SIMD engine
- Optimized CNN engine
- Programming tools

EV6x Processor Objectives
Low power:
Over 1000 GMAC/s/W
in CNN engine
High productivity
Highly Scalable Vector Engine
100 GOP/s
620 GOP/s
Low area High-performance CNN:
Up to 880 MAC/cycle
Scalar
Vector
CNN
Standard Programming model
Accelerator
OpenCL C
Most Integrated Solution
C/C++
Embedded
Vision
Libraries
Preliminary – Subject to Change

EV Processor Solution: EV6x with CNN Engine
Embedded Vision Programming Tools
Vision CPU (1 to 4 cores) CNN Engine
Option
Convolution
ALU Conv. 2D
AGUs CC MEMs
Cluster
Comm. Shared Mem.DMA
Classification
AXI Interconnect
User kernels
Ui
Uk
C/C++
OpenCL C
K1 Kn…
Kernel Lib
OpenCL C compiler, with
whole function vectorization
C/C++
compiler
Lib
Ui
Uj
Uk
Kn
Uk
Um
graph
CNN Graph
Mapping Tools
HAPS®
Rapid
Prototyping
Board
Virtual
Prototype
ALU Conv. 1D
AGUs CC MEMs
Coherency
ARConnect Sync Debug Power Mgmt.
Up to 880 MAC/cycle
Up to 620 GOP/s
at 800 MHz
Core 4
Core 3
32b
Scalar
512b
Vector DSP
Core 2
Core 1
32b
Scalar
512b
Vector DSP
VCCMD$I$ VCCMD$I$
CNN
graph
Cn
CNN graph
node

CNN – Convolution Neural Networks
Deep Learning Approach to Embedded Vision

CNN for a Wide Range of Vision Applications
• Image classification, search similar images
• Object detection, classification & localization
– Any type of object(s), depending on training phase
• Face recognition
• Visual attention
• Facial expression recognition
• Gesture recognition / hand tracking
• Resolution upscaling
• Scene recognition and labelling, semantic segmentation
– Sky, mountain, road, tree, building, …
• Recent advocates
– Nvidia, Microsoft, Google, Baidu, Adobe, Qualcomm, Yahoo …
– Mobileye for autonomous driving car
car
sky
building
building
road

Pedestrian Detection: HoG vs. CNN

Computation Requirements for CNN
Accuracy
Computationalcomplexity
Lenet (1994)
4 layers
AlexNet (2012)
8 layers
100MByte
VGG-19 (2014)
19 layers
270MByte
GoogleNet (2014)
22 layer
20MByte
ResNet (2015)
152 layers!
10MByte
1 GOPs/frame
10 GOPs/frame

Scene Segmentation
Source: Press Release by Toshiba and Denso, 17 Oct. 2016

Super resolution using CNN
Source
Bicubic
Interpolation CNN Reference Source
Bicubic
Interpolation CNN Reference
“Image Super-Resolution Using Deep Convolutional Networks (2016), C. Dong et al.”

Super-Resolution using Convolutional Neural Networks
• CNN’s deliver superior Super-Resolution for single image and video
• CNN’s for Super-Resolution require dedicated compute engine with high compute capacity
• Example “Image Super-Resolution Using Deep Convolutional Networks (2016), C. Dong et al.”
Requires 600 GMAC for one 4K frame

CNN Graph Training and Porting
Image labeling
Graph
explore,
training
GPU farm
Code
vectorization
TrainingPorting
coeff.
Code
Object
detection
executable
CNN
graph
GPP
CNN-optimized
processor
GP-GPU

CNN Computation
• Convolution of multiple
inputs together
– Fixed kernel size
• Optional subsampling
– 1x, 2x, 4x
• Optional max-pooling
• Very regular, repetitive
computation
– Dominated by MAC
– Deterministic
• Non-linear activation
function
– Rectifier, Sigmoid,
Hyperbolic tangent
I0
IM-1
I1
O0
ON-1
M inputs
(XI * YI)
Z kernels (K * K) with
associated weights
N outputs (XO * YO)
Oj = act(Bj+ (Iv x Kw) + …)
Convolution (x)
act
act
Activation (tanh, ReLU)
…

EV6x Second Generation CNN Engine for
Object Detection and Semantic Segmentation
- High performance, low power and area
- Fully programmable
car
car
sky
building
building

High-Performance EV6x CNN Engine
• Dedicated EV6x CNN Engine with
performance equal or better than GP-GPU
• Programmable to support full range of fixed point
CNN graphs
• State-of-the-art power-efficiency
• Real-time, high quality image classification, object
recognition, semantic segmentation
• Supports resolutions up to 4K
• Operates in parallel with Vision CPUs increasing
efficiency and throughput
AXIInterconnect
Vision CPU Core
32 bit
RISC
512-bit
Vector DSP
Cluster
Shared
Memory
DMA
ARConnect
CNN Engine
Convolution
Classification
ALU
Conv. 2D
AGUs CC MEMs
ALU Conv. 1D
AGUs CC MEMs

AlexNet on ImageNet
Quantization opportunities for recognition tasks
32-bit
floating point
16-bit
fixed point
vs
[Moons WACV2016]
Recognitionaccuracy
Fixed-point word length
• 12 bit good compromise between
CNN recognition performance and
hardware cost
– 8 bit will cause recognition rate loss on
existing graphs
– 12 bit multiplier is almost half the area
of a 16 bit multiplier
12-bit

CNN data precision – Qualcomm data

CNN Competitive Analysis

CNN Performance and Area Efficiency Comparison
GMAC/s/mm2
10 1000
1
10
100
1000
300X
2X
100
GMAC/s
20X
14X
First gen
vision
processors
GP/GPU
EV6x Embedded
Vision Processor
w/integrated CNN
Circle area proportional
to logic area

CNN Performance and Power Efficiency Comparison
GMAC/s/W
10 100 1000
10
100
1000
10000
11X
30X
GMAC/s
EV6x Embedded
Vision Processor
w/integrated CNN
First gen
vision
processors
GP/GPU
Circle area proportional
to logic area

Less Efficient EV Options Dedicated Embedded Vision Processors
EV Challenges Require Embedded Vision Processors
Performance
Power
Area
CPUs don’t have math horsepower
for fast 2D vision processing
GPUs have high performance but
large areas and higher power
DSPs are designed for low power
audio and speech applications, not
2D video
FPGAs are good for prototyping but
are expensive and performance
limited
High performance
Lower power
Smaller area
Dedicated deep learning (CNN) engine provides
PPA numbers compatible with surveillance,
ADAS and mobile targets
1000
GMACs/W
100-1000
GOP/s
Few
mm2

Thank You
Contact me at:
pierre.paulin@synopsys.com

Applying Deep Learning Vision Technology to low-cost/power Embedded Systems

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to Applying Deep Learning Vision Technology to low-cost/power Embedded Systems (20)

More from Jenny Midwinter (11)

Recently uploaded (20)

Applying Deep Learning Vision Technology to low-cost/power Embedded Systems