SlideShare a Scribd company logo
© 2021 Chamberlain Group
Practical Guide to
Implementing ML on
Embedded Devices
Nathan Kopp
Chamberlain Group
© 2021 Chamberlain Group
About the Chamberlain Group
2
Chamberlain Group (CGI) is a global leader in access
solutions and products.
Giving the Power of
Access and Knowledge
People everywhere rely on
CGI to move safely through
their world, confident that
what they value most is
secure within reach.
Residential
Commercial
Automotive
END-MARKETS
SERVED
VISION
MISSION
Over 8,000 Employees Worldwide
CGI is a global team with solutions and operations
designed to serve customers in a variety of
markets worldwide.
© 2021 Chamberlain Group
About the Chamberlain Group
3
© 2021 Chamberlain Group
Goals of this talk
• Survey the landscape of edge inference implementation
• Explore software & hardware choices
• Examine model & optimization choices
4
More is possible than you think!
© 2021 Chamberlain Group
Choices
Hardware
Inference
Engine
Training
Framework
Model
Optimizations
Everything is intertwined.
5
© 2021 Chamberlain Group
Choices
Hardware
Inference
Engine
Training
Framework
Model
Optimizations
6
TFLite
Limited Layer Choice
TensorFlow
8-bit Quantized
Only
Some
Models
© 2021 Chamberlain Group
Choices
Hardware
Inference
Engine
Training
Framework
Model
Optimizations
7
TensorRT
Complex Pruning
PyTorch
16-bit FP YOLOv5
Hardware
8
© 2021 Chamberlain Group
Hardware Choices
MCU
Cortex-M, RISC-V
+SIMD +DSP +NPU +FPGA
CPU
Cortex-A, x86
+SIMD +DSP +GPU +NPU +FPGA
FPGA
On-Chip Memory, L1 & L2 Caches
Bus Speed
Register Count
9
Multiple Cores & Clock Speed
Out-of-Order Execution
© 2021 Chamberlain Group
Hardware Choices
MCU
Cortex-M, RISC-V
+SIMD +DSP +NPU +FPGA
CPU
Cortex-A, x86
+SIMD +DSP +GPU +NPU +FPGA
FPGA
On-Chip Memory, L1 & L2 Caches
Bus Speed
Register Count
10
Multiple Cores & Clock Speed
Out-of-Order Execution
© 2021 Chamberlain Group
Hardware Choices
MCU
Cortex-M, RISC-V
+SIMD +DSP +NPU +FPGA
CPU
Cortex-A, x86
+SIMD +DSP +GPU +NPU +FPGA
FPGA
On-Chip Memory, L1 & L2 Caches
Bus Speed
Register Count
11
Multiple Cores & Clock Speed
Out-of-Order Execution
© 2021 Chamberlain Group
Hardware Choices
MCU
Cortex-M, RISC-V
+SIMD +DSP +NPU +FPGA
CPU
Cortex-A, x86
+SIMD +DSP +GPU +NPU +FPGA
FPGA
On-Chip Memory, L1 & L2 Caches
Bus Speed
Register Count
12
Multiple Cores & Clock Speed
Out-of-Order Execution
© 2021 Chamberlain Group
Memory,
IO, etc.
Hardware: Common Configurations
Low Volume Mass Production
USB MIPI
Software → Hardware Hardware → Software
13
Model & Software
14
© 2021 Chamberlain Group
Model Optimization Workflow
Reference
Model
• State-of-the-art
• FP32 GPU
• Purpose: validate
your training data
Optimization
• Choose backbone
• Choose head
• Quantization
• Pruning
• NAS
Convert or
Compile
• Convert
• Compile
Deploy and Test
• Verify
• Review Choices
• Analyze
• Optimize
• Integrate
15
© 2021 Chamberlain Group
Model Optimization Workflow
Reference
Model
• State-of-the-art
• FP32 GPU
• Purpose: validate
your training data
Optimization
• Choose backbone
• Choose head
• Quantization
• Pruning
• NAS
Convert or
Compile
• Convert
• Compile
Deploy and Test
• Verify
• Review Choices
• Analyze
• Optimize
• Integrate
16
© 2021 Chamberlain Group
First, you will need data!
• Your problem space is different!
• Smaller datasets are usually OK
• Smaller models need less data
• Fine tuning needs less data
Research Datasets
Your
Application
17
© 2021 Chamberlain Group
Model Optimization Workflow
Reference
Model
• State-of-the-art
• FP32 GPU
• Purpose: validate
your training data
Optimization
• Choose backbone
• Choose head
• Quantization
• Pruning
• NAS
Convert or
Compile
• Convert
• Compile
Deploy and Test
• Verify
• Review Choices
• Analyze
• Optimize
• Integrate
18
© 2021 Chamberlain Group
Model Mash-Up
Output
Layer 5
Layer 4
Layer 3
Layer 2
Layer 1
Input
Head & Neck:
• Interprets results
• Inexpensive to fine-tune
• Lower data requirements than backbone
Backbone:
• Extracts features
• Costly to train; needs lots of data & time
• Recommendation: pre-trained weights
19
© 2021 Chamberlain Group
Model
Structure
Model
Optimizations
Inference
Engine
Training
Data
Optimization Options
Model Zoo
Pretrained Fine-tune with your data Train from scratch
Community-Supported Mix & Match Head & BB
Code it yourself
Quantization, Pruning, Compression
NAS / OFA
Decomposition
Hand-Optimized Code
Pre-Optimized Model
20
Pre-Optimized Runtime Community-Supported Runtimes
Commercial  → Open Source
© 2021 Chamberlain Group
Model Optimization Workflow
Reference
Model
• State-of-the-art
• FP32 GPU
• Purpose: validate
your training data
Optimization
• Choose backbone
• Choose head
• Quantization
• Pruning
• NAS
Convert or
Compile
• Convert
• Compile
Deploy and Test
• Verify
• Review Choices
• Analyze
• Optimize
• Integrate
21
© 2021 Chamberlain Group
Inference Engine
• Runtime
• Model is interpreted
• Model deployed separately
• Easier OTA updates
• Compiler
• Model is compiled
• Model is part of firmware
• Weights are often constants
22
• Code Optimizations
• Memory Usage
• Cache-aware (e.g., tiling)
• Efficient register usage
• Vectorization
• Use SIMD
• Use DSP, NPU, GPU
• Parallelization
© 2021 Chamberlain Group
Final Step: Integrate into your app
• Consensus over time
• No model gets it right all the time
• High frame rate:
• More samples for consensus
• Lower per-sample accuracy
• Low frame rate:
• Fewer samples for consensus
• Higher per-sample accuracy
100 ms inference time does NOT mean 10
FPS!
Reserve CPU cycles for:
• Ingesting from the sensor/buffer
• Interpreting the output
• Network
• Other app functions
• Temperature management
23
Example
24
© 2021 Chamberlain Group
Example: Object Detection on ArmV7
Task Vehicle Detection
Reference Model YOLOv5-s, FP32, PyTorch
Compute Constraints ARMv7 w/NEON, no accelerators
25
ASUS Tinkerboard (v1)
Cortex A17 @ 1.8 GHz
Raspberry Pi 2 B v1.1
Cortex A7 @ 900 MHz
Seeed NPI i.MX6ULL
Cortex A7 @ 800 MHz
© 2021 Chamberlain Group
Model
Structure
Model
Optimizations
Inference
Engine
Training
Data
Optimization Options
Model Zoo
Pretrained Fine-tune with your data Train from scratch
Community Supported Mix & Match Head & BB
Code it yourself
Quantization, Pruning, Compression
NAS / OFA
Decomposition
Hand-Optimized Code
Pre-Optimized Model
26
Pre-Optimized Runtime Community Supported Optimizations
Commercial  → Open Source
© 2021 Chamberlain Group
Models & Tools Tested
27
© 2021 Chamberlain Group
Models & Tools Tested
28
Why so much faster?
• Both 32-bit ARMv7
• NEON: 64-bit vs 32-bit
• FP: 16 registers vs 32 registers
• Out-of-order execution
• Deeper pipeline
• DMIPS/MHz: 4.0 vs 1.9
© 2021 Chamberlain Group
Models & Tools Tested
29
© 2021 Chamberlain Group
Models & Tools Tested
30
509x
54x
50x
10x
6x
Speed improvement
compared to
YOLOv5-s on PyTorch
80-class
80-class
1-class
© 2021 Chamberlain Group
Models & Tools Tested
31
Reference Model
509x
54x
50x
Speed
Improvement
© 2021 Chamberlain Group
Conclusions
• Inference on edge devices has become both possible and practical
• Small hardware features can make a big difference in speed
• Selecting the right model and the right inference engine for your hardware can expand
the scope of what is possible
32
© 2021 Chamberlain Group
Example of Resource Slide
33
Detectors used in the Example:
YOLOv5
https://github.com/ultralytics/yolov5
NanoDet
https://github.com/RangiLyu/nanodet
QuickYOLO
https://github.com/tehtea/QuickYOLO
Xailient
https://www.xailient.com/
Chamberlain Group:
https://chamberlaingroup.com
Note:
Many more links and resources are
available at the end of the slide deck.
Backup Material
34
© 2021 Chamberlain Group
Half of implementing deep learning
is fighting Python & C++ errors
and resolving library incompatibilities.
Pay close attention
to documented
versions!
Use “virtualenv”
Become a CMAKE
expert!
35
© 2021 Chamberlain Group
Papers
36
Paper Title URL
Larq Compute Engine: Design, Benchmark, and Deploy State-of-the-
Art Binarized Neural Networks
https://arxiv.org/abs/2011.09398
Latent Weights Do Not Exist: Rethinking Binarized Neural Network
Optimization
https://arxiv.org/abs/1906.02107
FCOS: Fully Convolutional One-Stage Object Detection https://arxiv.org/abs/1904.01355
Bridging the Gap Between Anchor-based and Anchor-free Detection
via Adaptive Training Sample Selection
https://arxiv.org/abs/1912.02424
Generalized Focal Loss: Learning Qualified and Distributed
Bounding Boxes for Dense Object Detection
https://arxiv.org/abs/2006.04388
ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture
Design
https://arxiv.org/abs/1807.11164
© 2021 Chamberlain Group
Edge Inference Engines
Inference Engine Type Notes
TFLite Runtime Runtime for TF/Keras https://www.tensorflow.org/lite
TFLite Micro Runtime TFLite for MCUs https://www.tensorflow.org/lite/microcontrollers
Larq Compute Engine Runtime Binarized TFLite https://github.com/larq/compute-engine
NCNN Runtime Tencent runtime https://github.com/Tencent/ncnn
MNN Runtime Alibaba runtime https://github.com/alibaba/MNN
Apache TVM Compiler Compiler and optimizer https://tvm.apache.org/
Apache MicroTVM Compiler TVM for MCUs https://tvm.apache.org/docs/microtvm/index.html
Glow Compiler Compiler for ONNX https://ai.facebook.com/tools/glow/
Microsoft ELL Compiler Compiler https://github.com/Microsoft/ELL
deepC Compiler ONNX -> LLVM https://github.com/ai-techsystems/deepC
NNoM Library Keras -> C https://github.com/majianjia/nnom
37
Generic, Open-Source
Note: This list is not comprehensive.
© 2021 Chamberlain Group
Edge Inference Engines
Vendor Inference Engine Type Notes
Nvidia TensorRT Runtime Support for Nvidia GPUs, such as Jetson Nano
Arm Arm® NN Runtime Optimized for Arm Cortex-A CPU, Mali GPU, Ethos NPU
Arm CMSIS-NN Library Library used by various runtimes and compilers
NXP NXP eIQ™ Both Optimized for NXP; Tflite and Glow with Arm-NN & CMSIS-NN
Qualcomm SNPE Runtime For Qualcomm Snapdragon processors
Intel OpenVINO™ Runtime Runtime for Intel products, including Movidius
Morpho SoftNeuro Runtime Commercial platform; limited details publicly available.
Edge Impulse EON Compiler Commercial platform; Targeted at Microcontrollers
STMicro STM32Cube.AI Compiler Optimized for STM32
Kendryte nncase Compiler Kendryte K210; https://github.com/kendryte/nncase
38
Hardware-Specific and Commercial
Note: This list is not comprehensive.
© 2021 Chamberlain Group
Model Optimization Tools
Tool Framework(s) URL
TensorFlow MOT TensorFlow https://www.tensorflow.org/model_optimization
Microsoft NNI PyTorch https://github.com/microsoft/nni
IntelLabs Distiller PyTorch https://github.com/IntelLabs/distiller
Riptide TensorFlow + TVM https://github.com/jwfromm/Riptide
Qualcomm AIMET PyTorch, TensorFlow https://github.com/quic/aimet
39
Generic, Open-Source
Note: This list is not comprehensive.
© 2021 Chamberlain Group
Model Optimization Tools
Tool Framework(s) URL
OpenVINO NNCF PyTorch https://github.com/openvinotoolkit/nncf
NXP eIQ TensorFlow, TFLite, ONNX https://www.nxp.com/design/software/development-software/eiq-
ml-development-environment:EIQ
Deeplite PyTorch, TensorFlow, ONNX https://www.deeplite.ai/
Edge Impulse Keras
40
Hardware-Specific and Commercial
Note: This list is not comprehensive.
© 2021 Chamberlain Group
Peripheral Accelerators
Product Off-the-Thelf SBC USB
Nvidia GPU Jetson Nano, TX1, TX2 -
Movidius Myriad X - Intel Neural Compute Stick 2
Google Edge TPU Coral Dev Board, Dev Board Mini Coral USB Accelerator
Gryfalcon Lightspeeur® - Orange Pi AI Stick Lite
Rockchip RK1808 - Toybrick RK1808
41
Note: This list is not comprehensive.
© 2021 Chamberlain Group
SoCs w/Embedded Accelerators
Product Acceleration Single Board Computer
Qualcomm Snapdragon (various) DSP + GPU (+NPU) (by request only)
Ambarella CV2, CV5, CV22S, CV25S, CV28M DSP + NPU (by request only)
NXP i.MX 8 DSP + GPU SolidRun $160+
NXP i.MX 8M Plus DSP + GPU + NPU SolidRun, Wandboard $180+
Rockchip RK3399Pro NPU Rock Pi N10 $99+
Allwinner V831 NPU Sipeed MAIX-II Dock $29
Sophon BM1880 NPU Sophon Edge $129
42
Note: This list is not comprehensive.
© 2021 Chamberlain Group
MCUs for Inference
Vendor Product Features that support inference
Various Cortex-M4/7/33/35P SIMD instructions, FPU; Future Ethos-U55 microNPU
Raspberry Pi RP2040 Memory, bus fabric
Maxim Integrated MAX78000 Cortex-M4, CNN accelerator
Kendryte K210 DNN accelerator
Espressif ESP32-S3 SIMD instructions, FPU
Note: This list is not comprehensive.
43

More Related Content

PDF
Deep Learning for Video: Action Recognition (UPC 2018)
PDF
nl80211 and libnl
PPTX
Network Function Virtualization : Overview
PPTX
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
PPT
Case study windows
PDF
eBPF - Rethinking the Linux Kernel
PDF
Understanding Open vSwitch
PPTX
Linux Network Stack
Deep Learning for Video: Action Recognition (UPC 2018)
nl80211 and libnl
Network Function Virtualization : Overview
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
Case study windows
eBPF - Rethinking the Linux Kernel
Understanding Open vSwitch
Linux Network Stack

What's hot (20)

PDF
Wpa supplicant introduction
PDF
Introduction to Level Zero API for Heterogeneous Programming : NOTES
PPTX
Citrix group policy troubleshooting for xen app and xendesktop
PDF
BPF Internals (eBPF)
PDF
ELC21: VM-to-VM Communication Mechanisms for Embedded
PDF
Introduction to TensorFlow
PPTX
PPT
Linux commands and file structure
PDF
TFLite NNAPI and GPU Delegates
PDF
Using eBPF for High-Performance Networking in Cilium
PPT
Basic command ppt
PPT
comparing windows and linux ppt
PDF
TensorFlow and Keras: An Overview
PDF
An Introduction To Linux
PDF
Introduction to Deep Learning, Keras, and TensorFlow
PPTX
Linux Device Tree
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
Part 01 Linux Kernel Compilation (Ubuntu)
ODP
Tensorflow for Beginners
PDF
nRF51のGPIOTEについて
Wpa supplicant introduction
Introduction to Level Zero API for Heterogeneous Programming : NOTES
Citrix group policy troubleshooting for xen app and xendesktop
BPF Internals (eBPF)
ELC21: VM-to-VM Communication Mechanisms for Embedded
Introduction to TensorFlow
Linux commands and file structure
TFLite NNAPI and GPU Delegates
Using eBPF for High-Performance Networking in Cilium
Basic command ppt
comparing windows and linux ppt
TensorFlow and Keras: An Overview
An Introduction To Linux
Introduction to Deep Learning, Keras, and TensorFlow
Linux Device Tree
The TCP/IP Stack in the Linux Kernel
Part 01 Linux Kernel Compilation (Ubuntu)
Tensorflow for Beginners
nRF51のGPIOTEについて
Ad

Similar to “A Practical Guide to Implementing ML on Embedded Devices,” a Presentation from the Chamberlain Group (20)

PDF
Introduction to Convolutional Neural Networks
PDF
Distributed deep learning optimizations for Finance
PPTX
HPC Advisory Council Stanford Conference 2016
PDF
NeuralProcessingofGeneralPurposeApproximatePrograms
PDF
Towards neuralprocessingofgeneralpurposeapproximateprograms
PDF
Austin,TX Meetup presentation tensorflow final oct 26 2017
PDF
Hardware for Deep Learning AI ML CNN.pdf
PPTX
Ai in 45 minutes
PDF
OpenPOWER Workshop in Silicon Valley
PDF
Pycon tati gabru
PDF
Large Scale Deep Learning with TensorFlow
PDF
Deep learning for medical imaging
PPTX
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
PDF
Distributed Deep Learning with Hadoop and TensorFlow
PDF
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
PDF
customization of a deep learning accelerator, based on NVDLA
PDF
IRJET - Implementation of Neural Network on FPGA
PDF
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
PDF
IRJET- Python Libraries and Packages for Deep Learning-A Survey
Introduction to Convolutional Neural Networks
Distributed deep learning optimizations for Finance
HPC Advisory Council Stanford Conference 2016
NeuralProcessingofGeneralPurposeApproximatePrograms
Towards neuralprocessingofgeneralpurposeapproximateprograms
Austin,TX Meetup presentation tensorflow final oct 26 2017
Hardware for Deep Learning AI ML CNN.pdf
Ai in 45 minutes
OpenPOWER Workshop in Silicon Valley
Pycon tati gabru
Large Scale Deep Learning with TensorFlow
Deep learning for medical imaging
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
Innovation with ai at scale on the edge vt sept 2019 v0
Distributed Deep Learning with Hadoop and TensorFlow
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
customization of a deep learning accelerator, based on NVDLA
IRJET - Implementation of Neural Network on FPGA
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IRJET- Python Libraries and Packages for Deep Learning-A Survey
Ad

More from Edge AI and Vision Alliance (20)

PDF
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
PDF
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
PDF
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
PDF
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
cuic standard and advanced reporting.pdf
PDF
KodekX | Application Modernization Development
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MIND Revenue Release Quarter 2 2025 Press Release
cuic standard and advanced reporting.pdf
KodekX | Application Modernization Development
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Diabetes mellitus diagnosis method based random forest with bat algorithm
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
sap open course for s4hana steps from ECC to s4
Cloud computing and distributed systems.
Teaching material agriculture food technology

“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation from the Chamberlain Group

  • 1. © 2021 Chamberlain Group Practical Guide to Implementing ML on Embedded Devices Nathan Kopp Chamberlain Group
  • 2. © 2021 Chamberlain Group About the Chamberlain Group 2 Chamberlain Group (CGI) is a global leader in access solutions and products. Giving the Power of Access and Knowledge People everywhere rely on CGI to move safely through their world, confident that what they value most is secure within reach. Residential Commercial Automotive END-MARKETS SERVED VISION MISSION Over 8,000 Employees Worldwide CGI is a global team with solutions and operations designed to serve customers in a variety of markets worldwide.
  • 3. © 2021 Chamberlain Group About the Chamberlain Group 3
  • 4. © 2021 Chamberlain Group Goals of this talk • Survey the landscape of edge inference implementation • Explore software & hardware choices • Examine model & optimization choices 4 More is possible than you think!
  • 5. © 2021 Chamberlain Group Choices Hardware Inference Engine Training Framework Model Optimizations Everything is intertwined. 5
  • 6. © 2021 Chamberlain Group Choices Hardware Inference Engine Training Framework Model Optimizations 6 TFLite Limited Layer Choice TensorFlow 8-bit Quantized Only Some Models
  • 7. © 2021 Chamberlain Group Choices Hardware Inference Engine Training Framework Model Optimizations 7 TensorRT Complex Pruning PyTorch 16-bit FP YOLOv5
  • 9. © 2021 Chamberlain Group Hardware Choices MCU Cortex-M, RISC-V +SIMD +DSP +NPU +FPGA CPU Cortex-A, x86 +SIMD +DSP +GPU +NPU +FPGA FPGA On-Chip Memory, L1 & L2 Caches Bus Speed Register Count 9 Multiple Cores & Clock Speed Out-of-Order Execution
  • 10. © 2021 Chamberlain Group Hardware Choices MCU Cortex-M, RISC-V +SIMD +DSP +NPU +FPGA CPU Cortex-A, x86 +SIMD +DSP +GPU +NPU +FPGA FPGA On-Chip Memory, L1 & L2 Caches Bus Speed Register Count 10 Multiple Cores & Clock Speed Out-of-Order Execution
  • 11. © 2021 Chamberlain Group Hardware Choices MCU Cortex-M, RISC-V +SIMD +DSP +NPU +FPGA CPU Cortex-A, x86 +SIMD +DSP +GPU +NPU +FPGA FPGA On-Chip Memory, L1 & L2 Caches Bus Speed Register Count 11 Multiple Cores & Clock Speed Out-of-Order Execution
  • 12. © 2021 Chamberlain Group Hardware Choices MCU Cortex-M, RISC-V +SIMD +DSP +NPU +FPGA CPU Cortex-A, x86 +SIMD +DSP +GPU +NPU +FPGA FPGA On-Chip Memory, L1 & L2 Caches Bus Speed Register Count 12 Multiple Cores & Clock Speed Out-of-Order Execution
  • 13. © 2021 Chamberlain Group Memory, IO, etc. Hardware: Common Configurations Low Volume Mass Production USB MIPI Software → Hardware Hardware → Software 13
  • 15. © 2021 Chamberlain Group Model Optimization Workflow Reference Model • State-of-the-art • FP32 GPU • Purpose: validate your training data Optimization • Choose backbone • Choose head • Quantization • Pruning • NAS Convert or Compile • Convert • Compile Deploy and Test • Verify • Review Choices • Analyze • Optimize • Integrate 15
  • 16. © 2021 Chamberlain Group Model Optimization Workflow Reference Model • State-of-the-art • FP32 GPU • Purpose: validate your training data Optimization • Choose backbone • Choose head • Quantization • Pruning • NAS Convert or Compile • Convert • Compile Deploy and Test • Verify • Review Choices • Analyze • Optimize • Integrate 16
  • 17. © 2021 Chamberlain Group First, you will need data! • Your problem space is different! • Smaller datasets are usually OK • Smaller models need less data • Fine tuning needs less data Research Datasets Your Application 17
  • 18. © 2021 Chamberlain Group Model Optimization Workflow Reference Model • State-of-the-art • FP32 GPU • Purpose: validate your training data Optimization • Choose backbone • Choose head • Quantization • Pruning • NAS Convert or Compile • Convert • Compile Deploy and Test • Verify • Review Choices • Analyze • Optimize • Integrate 18
  • 19. © 2021 Chamberlain Group Model Mash-Up Output Layer 5 Layer 4 Layer 3 Layer 2 Layer 1 Input Head & Neck: • Interprets results • Inexpensive to fine-tune • Lower data requirements than backbone Backbone: • Extracts features • Costly to train; needs lots of data & time • Recommendation: pre-trained weights 19
  • 20. © 2021 Chamberlain Group Model Structure Model Optimizations Inference Engine Training Data Optimization Options Model Zoo Pretrained Fine-tune with your data Train from scratch Community-Supported Mix & Match Head & BB Code it yourself Quantization, Pruning, Compression NAS / OFA Decomposition Hand-Optimized Code Pre-Optimized Model 20 Pre-Optimized Runtime Community-Supported Runtimes Commercial  → Open Source
  • 21. © 2021 Chamberlain Group Model Optimization Workflow Reference Model • State-of-the-art • FP32 GPU • Purpose: validate your training data Optimization • Choose backbone • Choose head • Quantization • Pruning • NAS Convert or Compile • Convert • Compile Deploy and Test • Verify • Review Choices • Analyze • Optimize • Integrate 21
  • 22. © 2021 Chamberlain Group Inference Engine • Runtime • Model is interpreted • Model deployed separately • Easier OTA updates • Compiler • Model is compiled • Model is part of firmware • Weights are often constants 22 • Code Optimizations • Memory Usage • Cache-aware (e.g., tiling) • Efficient register usage • Vectorization • Use SIMD • Use DSP, NPU, GPU • Parallelization
  • 23. © 2021 Chamberlain Group Final Step: Integrate into your app • Consensus over time • No model gets it right all the time • High frame rate: • More samples for consensus • Lower per-sample accuracy • Low frame rate: • Fewer samples for consensus • Higher per-sample accuracy 100 ms inference time does NOT mean 10 FPS! Reserve CPU cycles for: • Ingesting from the sensor/buffer • Interpreting the output • Network • Other app functions • Temperature management 23
  • 25. © 2021 Chamberlain Group Example: Object Detection on ArmV7 Task Vehicle Detection Reference Model YOLOv5-s, FP32, PyTorch Compute Constraints ARMv7 w/NEON, no accelerators 25 ASUS Tinkerboard (v1) Cortex A17 @ 1.8 GHz Raspberry Pi 2 B v1.1 Cortex A7 @ 900 MHz Seeed NPI i.MX6ULL Cortex A7 @ 800 MHz
  • 26. © 2021 Chamberlain Group Model Structure Model Optimizations Inference Engine Training Data Optimization Options Model Zoo Pretrained Fine-tune with your data Train from scratch Community Supported Mix & Match Head & BB Code it yourself Quantization, Pruning, Compression NAS / OFA Decomposition Hand-Optimized Code Pre-Optimized Model 26 Pre-Optimized Runtime Community Supported Optimizations Commercial  → Open Source
  • 27. © 2021 Chamberlain Group Models & Tools Tested 27
  • 28. © 2021 Chamberlain Group Models & Tools Tested 28 Why so much faster? • Both 32-bit ARMv7 • NEON: 64-bit vs 32-bit • FP: 16 registers vs 32 registers • Out-of-order execution • Deeper pipeline • DMIPS/MHz: 4.0 vs 1.9
  • 29. © 2021 Chamberlain Group Models & Tools Tested 29
  • 30. © 2021 Chamberlain Group Models & Tools Tested 30 509x 54x 50x 10x 6x Speed improvement compared to YOLOv5-s on PyTorch 80-class 80-class 1-class
  • 31. © 2021 Chamberlain Group Models & Tools Tested 31 Reference Model 509x 54x 50x Speed Improvement
  • 32. © 2021 Chamberlain Group Conclusions • Inference on edge devices has become both possible and practical • Small hardware features can make a big difference in speed • Selecting the right model and the right inference engine for your hardware can expand the scope of what is possible 32
  • 33. © 2021 Chamberlain Group Example of Resource Slide 33 Detectors used in the Example: YOLOv5 https://github.com/ultralytics/yolov5 NanoDet https://github.com/RangiLyu/nanodet QuickYOLO https://github.com/tehtea/QuickYOLO Xailient https://www.xailient.com/ Chamberlain Group: https://chamberlaingroup.com Note: Many more links and resources are available at the end of the slide deck.
  • 35. © 2021 Chamberlain Group Half of implementing deep learning is fighting Python & C++ errors and resolving library incompatibilities. Pay close attention to documented versions! Use “virtualenv” Become a CMAKE expert! 35
  • 36. © 2021 Chamberlain Group Papers 36 Paper Title URL Larq Compute Engine: Design, Benchmark, and Deploy State-of-the- Art Binarized Neural Networks https://arxiv.org/abs/2011.09398 Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization https://arxiv.org/abs/1906.02107 FCOS: Fully Convolutional One-Stage Object Detection https://arxiv.org/abs/1904.01355 Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection https://arxiv.org/abs/1912.02424 Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection https://arxiv.org/abs/2006.04388 ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design https://arxiv.org/abs/1807.11164
  • 37. © 2021 Chamberlain Group Edge Inference Engines Inference Engine Type Notes TFLite Runtime Runtime for TF/Keras https://www.tensorflow.org/lite TFLite Micro Runtime TFLite for MCUs https://www.tensorflow.org/lite/microcontrollers Larq Compute Engine Runtime Binarized TFLite https://github.com/larq/compute-engine NCNN Runtime Tencent runtime https://github.com/Tencent/ncnn MNN Runtime Alibaba runtime https://github.com/alibaba/MNN Apache TVM Compiler Compiler and optimizer https://tvm.apache.org/ Apache MicroTVM Compiler TVM for MCUs https://tvm.apache.org/docs/microtvm/index.html Glow Compiler Compiler for ONNX https://ai.facebook.com/tools/glow/ Microsoft ELL Compiler Compiler https://github.com/Microsoft/ELL deepC Compiler ONNX -> LLVM https://github.com/ai-techsystems/deepC NNoM Library Keras -> C https://github.com/majianjia/nnom 37 Generic, Open-Source Note: This list is not comprehensive.
  • 38. © 2021 Chamberlain Group Edge Inference Engines Vendor Inference Engine Type Notes Nvidia TensorRT Runtime Support for Nvidia GPUs, such as Jetson Nano Arm Arm® NN Runtime Optimized for Arm Cortex-A CPU, Mali GPU, Ethos NPU Arm CMSIS-NN Library Library used by various runtimes and compilers NXP NXP eIQ™ Both Optimized for NXP; Tflite and Glow with Arm-NN & CMSIS-NN Qualcomm SNPE Runtime For Qualcomm Snapdragon processors Intel OpenVINO™ Runtime Runtime for Intel products, including Movidius Morpho SoftNeuro Runtime Commercial platform; limited details publicly available. Edge Impulse EON Compiler Commercial platform; Targeted at Microcontrollers STMicro STM32Cube.AI Compiler Optimized for STM32 Kendryte nncase Compiler Kendryte K210; https://github.com/kendryte/nncase 38 Hardware-Specific and Commercial Note: This list is not comprehensive.
  • 39. © 2021 Chamberlain Group Model Optimization Tools Tool Framework(s) URL TensorFlow MOT TensorFlow https://www.tensorflow.org/model_optimization Microsoft NNI PyTorch https://github.com/microsoft/nni IntelLabs Distiller PyTorch https://github.com/IntelLabs/distiller Riptide TensorFlow + TVM https://github.com/jwfromm/Riptide Qualcomm AIMET PyTorch, TensorFlow https://github.com/quic/aimet 39 Generic, Open-Source Note: This list is not comprehensive.
  • 40. © 2021 Chamberlain Group Model Optimization Tools Tool Framework(s) URL OpenVINO NNCF PyTorch https://github.com/openvinotoolkit/nncf NXP eIQ TensorFlow, TFLite, ONNX https://www.nxp.com/design/software/development-software/eiq- ml-development-environment:EIQ Deeplite PyTorch, TensorFlow, ONNX https://www.deeplite.ai/ Edge Impulse Keras 40 Hardware-Specific and Commercial Note: This list is not comprehensive.
  • 41. © 2021 Chamberlain Group Peripheral Accelerators Product Off-the-Thelf SBC USB Nvidia GPU Jetson Nano, TX1, TX2 - Movidius Myriad X - Intel Neural Compute Stick 2 Google Edge TPU Coral Dev Board, Dev Board Mini Coral USB Accelerator Gryfalcon Lightspeeur® - Orange Pi AI Stick Lite Rockchip RK1808 - Toybrick RK1808 41 Note: This list is not comprehensive.
  • 42. © 2021 Chamberlain Group SoCs w/Embedded Accelerators Product Acceleration Single Board Computer Qualcomm Snapdragon (various) DSP + GPU (+NPU) (by request only) Ambarella CV2, CV5, CV22S, CV25S, CV28M DSP + NPU (by request only) NXP i.MX 8 DSP + GPU SolidRun $160+ NXP i.MX 8M Plus DSP + GPU + NPU SolidRun, Wandboard $180+ Rockchip RK3399Pro NPU Rock Pi N10 $99+ Allwinner V831 NPU Sipeed MAIX-II Dock $29 Sophon BM1880 NPU Sophon Edge $129 42 Note: This list is not comprehensive.
  • 43. © 2021 Chamberlain Group MCUs for Inference Vendor Product Features that support inference Various Cortex-M4/7/33/35P SIMD instructions, FPU; Future Ethos-U55 microNPU Raspberry Pi RP2040 Memory, bus fabric Maxim Integrated MAX78000 Cortex-M4, CNN accelerator Kendryte K210 DNN accelerator Espressif ESP32-S3 SIMD instructions, FPU Note: This list is not comprehensive. 43