Running deep learning onto heterogenous hardware

Running deep learning onto
heterogenous hardware
Vincent Delaitre
CTO @ deepomatic

Some context
• We build a platform allowing to
easily train and deploy custom
computer vision algorithms
• Our customers are mostly large
industrial groups
• Many diﬀerent use-cases, one
methodology

« Lean AI »
AIs
deployment
Image
analysis
Feedback loop
Embedded devices
(jetson, movidus, etc.)
On premise &
private cloud
Training in
the cloud

My shopping list for deployment
Running locally
Reliability constraints
Bandwidth constraints
Privacy constraints
Various hardware requirements
High vs Low throughput
May need to be powered with 24V
May need to be as cheap as possible
State of the art
Various meta-architectures:
Caﬀe / TF / Darknet / ?Pytorch?

Deployment: frameworks VS hardware
Caﬀe
Tensorflow
Darknet
Pytorch
GPU
Jetson
CPU
Myriad
FPGA

Solution: use runtimes !
CPU Myriad FPGA
Intel OpenVino
GPU Jetson
Nvidia TensorRT

CPU Myriad FPGA
Intel OpenVino
GPU Jetson
Nvidia TensorRT
ONNX

Darknet Tensorflow Caﬀe Pytorch
CPU Myriad FPGA
Intel OpenVino
GPU Jetson
Nvidia TensorRT
ONNX

Deployment logic
Caﬀe
models
Tensorflow
models
Current architecture
CPU and GPU
worker

Deployment logic
Caﬀe
models
Tensorflow
models
Current architecture
CPU and GPU
worker
ONNX
models
Target architecture
TensorRT
worker
OpenVino
worker

Why does it matter ?
Speedratiow/rVanillaTF
0
1
2
3
4
Vanilla TF (FP32) TensorRT (FP32) TensorRT (FP16) TensorRT (INT8)

Speed-up
• From Vanilla to INT8: x3 speedup
• Using 4 times less memory, use batch of 4: x3 speedup
Total: x9 speedup

INT8, really ?
http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
• Need a representative and diverse validation set
• Increase error on accuracy of 0.4% in average (top-1)

Why does it matter ?
0
4
8
12
16
20
CPU Myriad 2 Myriad X Jetson TX2 Jetson Xavier GPU (Titan X) FPGA (Arria 10)
Speed ratio w/r CPU Price

Flexibility
• Low throughput: CPU
• Moderate throughput or embedded: Myriad X or Jetson X
• High throughput: GPU

What’s next ? Workflows !
Image
Detector
Crop
Box 1 Label 1
Label 1 bisClassifier for « Label 1 »
Crop
Box N Label N
Label N bisClassifier for « Label 1 »

Image
Detector
Crop
Classifier for « Label 1 »
Box 1 Label 1
Label 1 bisClassifier for « Label 1 »
Fuse
Crop
Box N Label N
Label N bisClassifier for « Label 1 »
Classifier for « Label N »
Fuse

Image
Detector
Crop
Box K Label K
Label K bis
Classiﬁer for « Label K »
Fuse
Jitter
Jitter
Jitter

Image
Detector
Crop
Box K Label K
Label K bis
Classiﬁer for « Label K »
Fuse
Jitter
Jitter
Jitter
Tile
Box K Label K
Box K Label K
Box K Label K
Regroup

Facebook Tensor Comprehension
C := Ax
C := a.AB + bC
https://arxiv.org/pdf/1802.04730.pdf

Facebook Tensor Comprehension
https://arxiv.org/pdf/1802.04730.pdf
Computation
graph
Intermediate
representation
Backend code
(CUDA, cuDNN, etc…)
• Graph optimisation
• Operation scheduling
• Operation placement
• Just in time compilation + Autotuning
• For training and inference

Running deep learning onto heterogenous hardware

Merci !
PS: deepomatic recrute !
Et c’est par là: http://careers.deepomatic.com :)

Running deep learning onto heterogenous hardware

More Related Content

Similar to Running deep learning onto heterogenous hardware (20)

Recently uploaded (20)

Running deep learning onto heterogenous hardware