SlideShare a Scribd company logo
The Caffe2 Framework for Mobile
and Embedded Deep Learning
Fei Sun
AI Platform, Facebook
1
• Caffe2 on mobile
• ONNX
• From research to production
• Vendor’s dilemma
• Caffe2 on embedded. Benchmarking the performance
Outline
2
Caffe2 on Mobile
3
• A lightweight open source framework for deep learning
algorithms
• Primarily designed for production use cases
• Speed is top priority
• C++ / Python based interfaces
• Supports deployment on multiple platforms
• Linux, Mac, iOS, Android and Windows
• IoT devices, Raspberry Pi, Tegra X1, ...
Caffe2 is...
4
Mobile Fragmentation
5
OpenGL
Two major
operating systems
Android iOS
20+ chipset vendors 25+ CPU microarchitectures 15+ GPU architectures
Three major
graphics APIs
Two major
compute APIs
RenderScript
OpenCL
Vulkan
Metal
One Framework, Multiple Backends
ARM Compute
Library
NNPACK
Metal™/
MPSCNN
Qualcomm
Snapdragon
NPE
CUDA/cuDNN
CPU Acceleration with NNPACK
7
• Fast convolution algorithms
• NEON micro-kernels
• Multi-core computation
• big.LITTLE optimizations
• Custom Metal™ Kernels
• Leverage MPSCNN (Metal Performance Shaders)
• Performs best on iPhone 6s and later
GPU Acceleration on iPhones
• Leverage Qualcomm's Snapdragon NPE
• Supports new Qualcomm Adreno GPUs
• Runs on top of OpenCL
• Potential to use Hexagon DSPs
GPU Acceleration on Android
Caffe2 mobile integration
with Qualcomm® Snapdragon™ mobile platform
CPU
12 FPS
GPU
50 FPS
Galaxy	S7
Snapdragon	820
Marshmallow
• Leveraging ARM Compute Library
• Utilizes OpenGL 3.1
• For newer Mali GPUs - ex: from Samsung LSI, MediaTek
• Person segmentation model:
• CPU: 50 FPS
• ACL: 71 FPS with CPU->GPU, 133 FPS without
GPU Acceleration on Android
• Engage and collaborate with a few vendors:
• Support Caffe2
• Iterate on performance
• Problem:
• Not scalable
Caffe2 on Mobile
12
ONNX
13
Support What?
14
Framework
backends
O (n^2) pairs
Tensor
Flow
MXNET CNTK
Vendor and numeric libraries
Apple
CoreML
Nvidia
TensorRT
ARM
Compute
Library
Qualcomm
SNPE …
From Research to Production
15
• Research new models/operators in Pytorch
• Re-implement the models/operators in Caffe2
Retrain the models
• Deploy Caffe2 models to production
• Enable interoperability
• Across frameworks and hardware vendors
• Starting base compatibility
• Creating community effort
• Across PyTorch and Caffe2 at FB
• Operators and programming modes gap
• Advanced research to production uses cases
Open Neural Network Exchange
(ONNX)
16
Support What All
17
Framework
backends
O (n) pairs
Tensor
Flow
MXNET CNTK
Vendor and numeric libraries
Apple
CoreML
Nvidia
TensorRT
ARM
Compute
Library
Qualcomm
SNPE …
From Research to Production
18
• Frontend
• Representation
• Backend
• Frontend
• Representation
• Backend
Caffe2 on Embedded
19
Embedded Sea of Choices
20
Two major
operating systems
20+ chipset vendors 25+ CPU microarchitectures 15+ GPU architectures
Three major
graphics APIs
Two major
compute APIs
Many
Many DSP Many proprietary
Many
Many proprietary Many
design
flows
• The approach working with mobile vendors does not scale
• What ML models matter?
• How to help embedded vendors to enhance ML model
performance?
• How to assist embedded vendors to evaluate against market?
Existing Challenges
21
• Provide a model zoo on important models
• Normalize the benchmarking metrics and conditions
• Automate the benchmarking process
• Honest measurement on performance
• Focus on inference
AI Benchmarking
22
Benchmarking Starting Point
23
Nexus 6 Nexus 6P Galaxy S7
Huawei Mate
10
Galaxy S8
ShuffleNet 108 148 84 125 112
SqueezeNet 149 279 143 161 156
ResNet50 1230 1970 1220 1510 1490
Style
Transfer
52 80 56 53 39
CPU inference delay on select Caffe2 models in ms
Benchmarking - Add a New Model
24
Nexus 6 Nexus 6P Galaxy S7
Huawei Mate
10
Galaxy S8
ShuffleNet 108 148 84 125 112
SqueezeNet 149 279 143 161 156
ResNet50 1230 1970 1220 1510 1490
Style
Transfer
52 80 56 53 39
Inception V1 612 829 575 638 645
CPU inference delay on select Caffe2 models in ms
Benchmarking - Add a New Device
25
Nexus 6 Nexus 6P Galaxy S7
Huawei Mate
10
Galaxy S8 Pixel XL
ShuffleNet 108 148 84 125 112 83
SqueezeNet 149 279 143 161 156 141
ResNet50 1230 1970 1220 1510 1490 1230
Style
Transfer
52 80 56 53 39 57
Inception V1 612 829 575 638 645 597
CPU inference delay on select Caffe2 models in ms
Three Steps of Benchmarking
26
Model Zoo Data Consumption
GPU
CPU
Phone
Embedded
Benchmarking
• Supported framework
• Caffe2
• Supported model format
• Caffe2
• ONNX
• Supported backend
• CPU, GPU, Android, linux based systems.
• Eigen, MKL, NNPACK, OpenGL, Cuda
• Community help needed!
Benchmarking Status
27
• Caffe2
• https://github.com/caffe2/caffe2
• ONNX
• https://github.com/onnx/onnx
• Benchmarking
• https://github.com/caffe2/caffe2-benchmarking
• Model zoo
• https://github.com/caffe2/models
• https://github.com/onnx/models
Resources
28
Questions?
29

More Related Content

PDF
Scaling MLOps on NVIDIA DGX Systems
PDF
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
PDF
Continuous Integration on Steroids
PDF
Hadoop analytics provisioning based on a virtual infrastructure
PDF
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
PDF
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
PDF
Running tests for every commit: Gerrit, Jenkins, Docker, AWS
PPTX
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
Scaling MLOps on NVIDIA DGX Systems
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Continuous Integration on Steroids
Hadoop analytics provisioning based on a virtual infrastructure
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
Running tests for every commit: Gerrit, Jenkins, Docker, AWS
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies

What's hot (20)

PDF
Packaging Strategy for Community Openstack and Implementation Reference | Hoj...
PDF
PyconKR 2019 Lightning Talk - Let The Dogs Out on Kubernetes
PDF
"OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision," a P...
PDF
Continuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 Copenhagen
PDF
Rally: OpenStack Benchmarking
PPTX
OpenStack QA Tooling & How to use it for Production Cloud Testing | Ghanshyam...
PDF
Workshop actualización SVG CESGA 2012
PPTX
Distributed tensorflow on kubernetes
PDF
Open Source at Zalando - OSB Open Source Day 2019
PPTX
Integrating Bare-metal Provisioning into CERN's Private Cloud
PPTX
Operational War Stories from 5 Years of Running OpenStack in Production
PPTX
Support of containerized workloads in ONAP
PDF
Spinnaker at DevOpsDays Montreal
PDF
Spark Summit EU talk by Jorg Schad
PDF
DevConf 2017 - Realistic Container Platform Simulations
PDF
OpenShift, Docker, Kubernetes: The next generation of PaaS
PPTX
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
PDF
[Spark Summit 2017 NA] Apache Spark on Kubernetes
PPTX
The Jenkins Plugin for OpenStack
PDF
Running and Managing Kubernetes on OpenStack
Packaging Strategy for Community Openstack and Implementation Reference | Hoj...
PyconKR 2019 Lightning Talk - Let The Dogs Out on Kubernetes
"OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision," a P...
Continuously Deliver Your Kubernetes Infrastructure - KubeCon 2018 Copenhagen
Rally: OpenStack Benchmarking
OpenStack QA Tooling & How to use it for Production Cloud Testing | Ghanshyam...
Workshop actualización SVG CESGA 2012
Distributed tensorflow on kubernetes
Open Source at Zalando - OSB Open Source Day 2019
Integrating Bare-metal Provisioning into CERN's Private Cloud
Operational War Stories from 5 Years of Running OpenStack in Production
Support of containerized workloads in ONAP
Spinnaker at DevOpsDays Montreal
Spark Summit EU talk by Jorg Schad
DevConf 2017 - Realistic Container Platform Simulations
OpenShift, Docker, Kubernetes: The next generation of PaaS
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
[Spark Summit 2017 NA] Apache Spark on Kubernetes
The Jenkins Plugin for OpenStack
Running and Managing Kubernetes on OpenStack
Ad

Similar to "The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation from Facebook (20)

PPTX
PDF
Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications
PPTX
Squeezing Deep Learning Into Mobile Phones
PDF
Running deep learning onto heterogenous hardware
PPTX
Deep learning on mobile
PDF
Caffe2 on Android
PDF
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
PDF
Caffe - A deep learning framework (Ramin Fahimi)
PDF
Deep learning - the conf br 2018
PDF
OpenCV DNN module vs. Ours method
PPT
OpenCL caffe IWOCL 2016 presentation final
PPTX
DIY Deep Learning with Caffe Workshop
PPTX
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
PDF
open source nn frameworks on cellphones
PPTX
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
PDF
Caffe hands on tutorial
PPTX
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
PDF
Deep Learning on the Mobile Devices
PDF
FPGA Hardware Accelerator for Machine Learning
PDF
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications
Squeezing Deep Learning Into Mobile Phones
Running deep learning onto heterogenous hardware
Deep learning on mobile
Caffe2 on Android
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Caffe - A deep learning framework (Ramin Fahimi)
Deep learning - the conf br 2018
OpenCV DNN module vs. Ours method
OpenCL caffe IWOCL 2016 presentation final
DIY Deep Learning with Caffe Workshop
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
open source nn frameworks on cellphones
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
Caffe hands on tutorial
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
Deep Learning on the Mobile Devices
FPGA Hardware Accelerator for Machine Learning
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Ad

More from Edge AI and Vision Alliance (20)

PDF
“An Introduction to the MIPI CSI-2 Image Sensor Standard and Its Latest Advan...
PDF
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
PDF
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
PDF
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
PDF
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“An Introduction to the MIPI CSI-2 Image Sensor Standard and Its Latest Advan...
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Spectroscopy.pptx food analysis technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
20250228 LYD VKU AI Blended-Learning.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
A comparative analysis of optical character recognition models for extracting...
Unlocking AI with Model Context Protocol (MCP)
Spectroscopy.pptx food analysis technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MIND Revenue Release Quarter 2 2025 Press Release
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
Assigned Numbers - 2025 - Bluetooth® Document
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding

"The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation from Facebook

  • 1. The Caffe2 Framework for Mobile and Embedded Deep Learning Fei Sun AI Platform, Facebook 1
  • 2. • Caffe2 on mobile • ONNX • From research to production • Vendor’s dilemma • Caffe2 on embedded. Benchmarking the performance Outline 2
  • 4. • A lightweight open source framework for deep learning algorithms • Primarily designed for production use cases • Speed is top priority • C++ / Python based interfaces • Supports deployment on multiple platforms • Linux, Mac, iOS, Android and Windows • IoT devices, Raspberry Pi, Tegra X1, ... Caffe2 is... 4
  • 5. Mobile Fragmentation 5 OpenGL Two major operating systems Android iOS 20+ chipset vendors 25+ CPU microarchitectures 15+ GPU architectures Three major graphics APIs Two major compute APIs RenderScript OpenCL Vulkan Metal
  • 6. One Framework, Multiple Backends ARM Compute Library NNPACK Metal™/ MPSCNN Qualcomm Snapdragon NPE CUDA/cuDNN
  • 7. CPU Acceleration with NNPACK 7 • Fast convolution algorithms • NEON micro-kernels • Multi-core computation • big.LITTLE optimizations
  • 8. • Custom Metal™ Kernels • Leverage MPSCNN (Metal Performance Shaders) • Performs best on iPhone 6s and later GPU Acceleration on iPhones
  • 9. • Leverage Qualcomm's Snapdragon NPE • Supports new Qualcomm Adreno GPUs • Runs on top of OpenCL • Potential to use Hexagon DSPs GPU Acceleration on Android
  • 10. Caffe2 mobile integration with Qualcomm® Snapdragon™ mobile platform CPU 12 FPS GPU 50 FPS Galaxy S7 Snapdragon 820 Marshmallow
  • 11. • Leveraging ARM Compute Library • Utilizes OpenGL 3.1 • For newer Mali GPUs - ex: from Samsung LSI, MediaTek • Person segmentation model: • CPU: 50 FPS • ACL: 71 FPS with CPU->GPU, 133 FPS without GPU Acceleration on Android
  • 12. • Engage and collaborate with a few vendors: • Support Caffe2 • Iterate on performance • Problem: • Not scalable Caffe2 on Mobile 12
  • 14. Support What? 14 Framework backends O (n^2) pairs Tensor Flow MXNET CNTK Vendor and numeric libraries Apple CoreML Nvidia TensorRT ARM Compute Library Qualcomm SNPE …
  • 15. From Research to Production 15 • Research new models/operators in Pytorch • Re-implement the models/operators in Caffe2 Retrain the models • Deploy Caffe2 models to production
  • 16. • Enable interoperability • Across frameworks and hardware vendors • Starting base compatibility • Creating community effort • Across PyTorch and Caffe2 at FB • Operators and programming modes gap • Advanced research to production uses cases Open Neural Network Exchange (ONNX) 16
  • 17. Support What All 17 Framework backends O (n) pairs Tensor Flow MXNET CNTK Vendor and numeric libraries Apple CoreML Nvidia TensorRT ARM Compute Library Qualcomm SNPE …
  • 18. From Research to Production 18 • Frontend • Representation • Backend • Frontend • Representation • Backend
  • 20. Embedded Sea of Choices 20 Two major operating systems 20+ chipset vendors 25+ CPU microarchitectures 15+ GPU architectures Three major graphics APIs Two major compute APIs Many Many DSP Many proprietary Many Many proprietary Many design flows
  • 21. • The approach working with mobile vendors does not scale • What ML models matter? • How to help embedded vendors to enhance ML model performance? • How to assist embedded vendors to evaluate against market? Existing Challenges 21
  • 22. • Provide a model zoo on important models • Normalize the benchmarking metrics and conditions • Automate the benchmarking process • Honest measurement on performance • Focus on inference AI Benchmarking 22
  • 23. Benchmarking Starting Point 23 Nexus 6 Nexus 6P Galaxy S7 Huawei Mate 10 Galaxy S8 ShuffleNet 108 148 84 125 112 SqueezeNet 149 279 143 161 156 ResNet50 1230 1970 1220 1510 1490 Style Transfer 52 80 56 53 39 CPU inference delay on select Caffe2 models in ms
  • 24. Benchmarking - Add a New Model 24 Nexus 6 Nexus 6P Galaxy S7 Huawei Mate 10 Galaxy S8 ShuffleNet 108 148 84 125 112 SqueezeNet 149 279 143 161 156 ResNet50 1230 1970 1220 1510 1490 Style Transfer 52 80 56 53 39 Inception V1 612 829 575 638 645 CPU inference delay on select Caffe2 models in ms
  • 25. Benchmarking - Add a New Device 25 Nexus 6 Nexus 6P Galaxy S7 Huawei Mate 10 Galaxy S8 Pixel XL ShuffleNet 108 148 84 125 112 83 SqueezeNet 149 279 143 161 156 141 ResNet50 1230 1970 1220 1510 1490 1230 Style Transfer 52 80 56 53 39 57 Inception V1 612 829 575 638 645 597 CPU inference delay on select Caffe2 models in ms
  • 26. Three Steps of Benchmarking 26 Model Zoo Data Consumption GPU CPU Phone Embedded Benchmarking
  • 27. • Supported framework • Caffe2 • Supported model format • Caffe2 • ONNX • Supported backend • CPU, GPU, Android, linux based systems. • Eigen, MKL, NNPACK, OpenGL, Cuda • Community help needed! Benchmarking Status 27
  • 28. • Caffe2 • https://github.com/caffe2/caffe2 • ONNX • https://github.com/onnx/onnx • Benchmarking • https://github.com/caffe2/caffe2-benchmarking • Model zoo • https://github.com/caffe2/models • https://github.com/onnx/models Resources 28