"The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation from Facebook

The Caﬀe2 Framework for Mobile
and Embedded Deep Learning
Fei Sun
AI Platform, Facebook
1

• Caﬀe2 on mobile
• ONNX
• From research to production
• Vendor’s dilemma
• Caﬀe2 on embedded. Benchmarking the performance
Outline
2

• A lightweight open source framework for deep learning
algorithms
• Primarily designed for production use cases
• Speed is top priority
• C++ / Python based interfaces
• Supports deployment on multiple platforms
• Linux, Mac, iOS, Android and Windows
• IoT devices, Raspberry Pi, Tegra X1, ...
Caﬀe2 is...
4

Mobile Fragmentation
5
OpenGL
Two major
operating systems
Android iOS
20+ chipset vendors 25+ CPU microarchitectures 15+ GPU architectures
Three major
graphics APIs
Two major
compute APIs
RenderScript
OpenCL
Vulkan
Metal

One Framework, Multiple Backends
ARM Compute
Library
NNPACK
Metal™/
MPSCNN
Qualcomm
Snapdragon
NPE
CUDA/cuDNN

CPU Acceleration with NNPACK
7
• Fast convolution algorithms
• NEON micro-kernels
• Multi-core computation
• big.LITTLE optimizations

• Custom Metal™ Kernels
• Leverage MPSCNN (Metal Performance Shaders)
• Performs best on iPhone 6s and later
GPU Acceleration on iPhones

• Leverage Qualcomm's Snapdragon NPE
• Supports new Qualcomm Adreno GPUs
• Runs on top of OpenCL
• Potential to use Hexagon DSPs
GPU Acceleration on Android

Caﬀe2 mobile integration
with Qualcomm® Snapdragon™ mobile platform
CPU
12 FPS
GPU
50 FPS
Galaxy S7
Snapdragon 820
Marshmallow

• Leveraging ARM Compute Library
• Utilizes OpenGL 3.1
• For newer Mali GPUs - ex: from Samsung LSI, MediaTek
• Person segmentation model:
• CPU: 50 FPS
• ACL: 71 FPS with CPU->GPU, 133 FPS without
GPU Acceleration on Android

• Engage and collaborate with a few vendors:
• Support Caﬀe2
• Iterate on performance
• Problem:
• Not scalable
Caﬀe2 on Mobile
12

Support What?
14
Framework
backends
O (n^2) pairs
Tensor
Flow
MXNET CNTK
Vendor and numeric libraries
Apple
CoreML
Nvidia
TensorRT
ARM
Compute
Library
Qualcomm
SNPE …

From Research to Production
15
• Research new models/operators in Pytorch
• Re-implement the models/operators in Caﬀe2
Retrain the models
• Deploy Caﬀe2 models to production

• Enable interoperability
• Across frameworks and hardware vendors
• Starting base compatibility
• Creating community eﬀort
• Across PyTorch and Caﬀe2 at FB
• Operators and programming modes gap
• Advanced research to production uses cases
Open Neural Network Exchange
(ONNX)
16

Support What All
17
Framework
backends
O (n) pairs
Tensor
Flow
MXNET CNTK
Vendor and numeric libraries
Apple
CoreML
Nvidia
TensorRT
ARM
Compute
Library
Qualcomm
SNPE …

From Research to Production
18
• Frontend
• Representation
• Backend
• Frontend
• Representation
• Backend

Embedded Sea of Choices
20
Two major
operating systems
20+ chipset vendors 25+ CPU microarchitectures 15+ GPU architectures
Three major
graphics APIs
Two major
compute APIs
Many
Many DSP Many proprietary
Many
Many proprietary Many
design
ﬂows

• The approach working with mobile vendors does not scale
• What ML models matter?
• How to help embedded vendors to enhance ML model
performance?
• How to assist embedded vendors to evaluate against market?
Existing Challenges
21

• Provide a model zoo on important models
• Normalize the benchmarking metrics and conditions
• Automate the benchmarking process
• Honest measurement on performance
• Focus on inference
AI Benchmarking
22

Benchmarking Starting Point
23
Nexus 6 Nexus 6P Galaxy S7
Huawei Mate
10
Galaxy S8
ShufﬂeNet 108 148 84 125 112
SqueezeNet 149 279 143 161 156
ResNet50 1230 1970 1220 1510 1490
Style
Transfer
52 80 56 53 39
CPU inference delay on select Caffe2 models in ms

Benchmarking - Add a New Model
24
Huawei Mate
10
Galaxy S8
ShufﬂeNet 108 148 84 125 112
SqueezeNet 149 279 143 161 156
ResNet50 1230 1970 1220 1510 1490
Style
Transfer
52 80 56 53 39
Inception V1 612 829 575 638 645

Benchmarking - Add a New Device
25
Huawei Mate
10
Galaxy S8 Pixel XL
ShufﬂeNet 108 148 84 125 112 83
SqueezeNet 149 279 143 161 156 141
ResNet50 1230 1970 1220 1510 1490 1230
Style
Transfer
52 80 56 53 39 57
Inception V1 612 829 575 638 645 597

Three Steps of Benchmarking
26
Model Zoo Data Consumption
GPU
CPU
Phone
Embedded
Benchmarking

• Supported framework
• Caﬀe2
• Supported model format
• Caﬀe2
• ONNX
• Supported backend
• CPU, GPU, Android, linux based systems.
• Eigen, MKL, NNPACK, OpenGL, Cuda
• Community help needed!
Benchmarking Status
27

• Caffe2
• https://github.com/caffe2/caffe2
• ONNX
• https://github.com/onnx/onnx
• Benchmarking
• https://github.com/caffe2/caffe2-benchmarking
• Model zoo
• https://github.com/caffe2/models
• https://github.com/onnx/models
Resources
28

"The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation from Facebook

More Related Content

What's hot (20)

Similar to "The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation from Facebook (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

"The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation from Facebook