SlideShare a Scribd company logo
Jameson Toole
Creating smaller, faster, production-worthy
mobile machine learning models
O’Reilly AI London, 2019
@
“We showcase this approach by training an 8.3 billion parameter
transformer language model with 8-way model parallelism and 64-way
data parallelism on 512 GPUs, making it the largest transformer based
language model ever trained at 24x the size of BERT and 5.6x the size of
GPT-2.” - MegatronLM, 2019
@
Are we going in the right direction?
@
https://www.technologyreview.com/s/613630/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/
Training Megatron-ML from scratch: 0.3 kW x 220 hours x 512 GPUs =
33,914 kw
3X yearly energy consumption of the average American
@
Does my model enable the largest number of
people to iterate as fast as possible using the
fewest amount of resources on the most
devices?
@
How do you teach a microwave its name?
@
How do you teach a microwave its name?
Edge intelligence: small,
efficient neural networks
that run directly
on-device.
@
How do you teach a _____ to _____?
@
Edge Intelligence is necessary and inevitable.
Latency: too much data, too fast
Power: radios use too much energy
Connectivity: internet access isn’t guaranteed
Cost: compute and bandwidth aren’t free
Privacy: some data should stay in the hands of users
@
Most intelligence will be at the edge.
<100M
servers
3B
phones
12B
IoT
150B
embedded devices
= 1 billion devices
@
The Edge Intelligence lifecycle.
@
Model selection
75MB: Avg size of
Top-100 app
348KB: SRAM SparkFun Edge
Development Board
@
Model selection: macro-architecture
Design Principles
● Keep activation maps large by downsampling later or using atrous
(dilated) convolutions
● Use more channels, but fewer layers
● Spend more time optimizing expensive input and output blocks, they
are usually 15-25% of your computation cost
@
Model selection: macro-architecture
Backbones
● MobileNet (20mb)
● SqueezeNet (5mb)
Layers
● Depthwise Separable
Convolutions
● Bilinear upsampling
8-9X reduction in
computation cost
https://arxiv.org/abs/1704.04861
@
Model selection: micro-architecture
Design Principles
● Add a width multiplier to control the number of parameters with a
hyperparameter: kernel x kernel x channel x w
● Use 1x1 convolutions instead of 3x3 convolutions where possible
● Arrange layers so they can be fused before inference (e.g. bias + batch
norm)
@
Training small, fast models
Most neural networks are massively
over-parameterized.
@
Training small, fast models: distillation
Knowledge distillation: a smaller “student”
network learns from a larger “teacher”
Results:
1. ResNet on CIFAR10:
a. 46X smaller,
b. 10% less accurate
2. ResNet on ImageNet:
a. 2X smaller
b. 2% less accurate
3. TinyBert on Squad:
a. 7.5X smaller,
b. 3% less accurate
https://nervanasystems.github.io/distiller/knowledge_distillation.html
https://arxiv.org/abs/1802.05668v1
https://arxiv.org/abs/1909.10351v2
@
Training small, fast models: pruning
Iterative pruning: periodically removing
unimportant weights and / or filters during
training.
Results:
1. AlexNet and VGG on ImageNet:
a. Weight Level: 9-11X smaller
b. Filter Level: 2-3X smaller
c. No accuracy loss
2. No clear consensus on whether pruning
is required vs training smaller networks
from scratch.
https://arxiv.org/abs/1506.02626
https://arxiv.org/abs/1608.08710
https://arxiv.org/abs/1810.05270v2
https://arxiv.org/abs/1510.00149v5
2 5
2 5
1 1
3 21 4
7 6
1 8
9 2
2 5
2 5
0 0
3 20 4
7 6
0 8
9 2
Weight Level - smallest, not always faster
Filter Level - smaller, faster
@
Compressing models via quantization
Quantizing 32-bit floating point weights to low precision
integers decreases size and (sometimes) increases
speed.
https://medium.com/@kaustavtamuly/compressing-and-accelerating-high-dimensional-neural-networks-6b501983c0c8
@
Compressing models via quantization
Post-training quantization: train networks normally, quantize once after training.
Training aware quantization: periodically removing unimportant weights and / or filters during
training.
Weights and activations: quantize both weights and activations to increase speed
Results:
1. Post-training 8-bit quantization: 4X smaller with <2% accuracy loss
2. Training aware quantization: 8-16X smaller with minimal accuracy loss
3. Quantizing weights and activations can result in a 2-3X speed increase on CPUs
https://arxiv.org/abs/1806.08342
@
Deployment: embracing combinatorics
@
Deployment: embracing combinatorics
Design Principles
● Train multiple models targeting different devices: OS x device
● Use native formats and frameworks
● Leverage available DSPs
● Monitor performance across devices
@
Putting it all together
@
Putting it all together
Edge Intelligence Lifecycle
● Model selection: use efficient layers, parameterize model size
● Training: distill / prune for 2-10X smaller models, little accuracy loss
● Quantization: 8-bit models 4X smaller, 2-3X faster, no accuracy loss
● Deployment: use native formats that leverage available DSPs
● Improvement: put the right model on the right device at the right time
@
6327 kb / 7 fps iPhone X 28kb / +50 fps iPhone X
225x
smaller
1.6 million parameters 6,300 parameters
Putting it all together
@
Putting it all together
“TinyBERT is empirically effective and achieves comparable results with
BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on
inference.” - Jiao et al
“Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB,
again with no loss of accuracy.” - Han et al
“The model itself takes up less than 20KB of Flash storage space … and it
only needs 30KB of RAM to operate.” - Peter Warden at TensorFlow Dev
Summit 2019
@
Open questions and future work
Need better support for quantized operations.
Need more rigorous study of model optimization vs task complexity.
Will platform-aware architecture search be helpful?
Can MLIR solve the combinatorics problem?
@
Complete Platform for Edge Intelligence
@
Complete Platform for Edge Intelligence
@
Benefits of using Fritz
Mobile Developers
● Prepared + Pretrained
● Simple APIs
● Fast, Secure, On-device
Machine Learning Engineers
● Iterate on Mobile
● Benchmark + Optimize
● Analytics
Try yourself:
Fritz AI Studio
App Store
Google Play
Working at the edge?
Questions?
@jamesonthecrow
jameson@fritz.ai
https://www.fritz.ai
Join the community!
heartbeat.fritz.ai

More Related Content

PDF
Artificial intelligence at the edge
PDF
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
PDF
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
PDF
KeithWiley_NeuromorphicComputing_and_CM1K_and_emulator_talk_wide
PDF
NIPS - Deep learning @ Edge using Intel's NCS
PDF
Ed Safford III MetroCon 2015 Verification, Validation, and Deployment of Hybr...
PDF
Introduction to Deep Learning
PDF
Data Parallel Deep Learning
Artificial intelligence at the edge
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
KeithWiley_NeuromorphicComputing_and_CM1K_and_emulator_talk_wide
NIPS - Deep learning @ Edge using Intel's NCS
Ed Safford III MetroCon 2015 Verification, Validation, and Deployment of Hybr...
Introduction to Deep Learning
Data Parallel Deep Learning

What's hot (20)

PPTX
Deep Learning With Neural Networks
PDF
Learning where to look: focus and attention in deep vision
PPTX
Introduction to deep learning
PDF
Intro to Deep Learning for Computer Vision
PDF
Deep Networks with Neuromorphic VLSI devices
PPTX
HML: Historical View and Trends of Deep Learning
PPTX
deep learning from scratch chapter 3 neural network
PPT
blue brain technology
PDF
An Introduction to Deep Learning
PDF
Image Classification Done Simply using Keras and TensorFlow
PPTX
Deep learning intro
PDF
Handwritten Recognition using Deep Learning with R
PPTX
Introduction to deep learning
PDF
Deep Learning: a birds eye view
PPTX
Deep Learning Projects - Anomaly Detection Using Deep Learning
PPTX
Deep learning at nmc devin jones
PDF
Convolutional neural network
PDF
Tutorial on Deep Learning
PDF
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
PPTX
Intelligent robot used in the field of practical
Deep Learning With Neural Networks
Learning where to look: focus and attention in deep vision
Introduction to deep learning
Intro to Deep Learning for Computer Vision
Deep Networks with Neuromorphic VLSI devices
HML: Historical View and Trends of Deep Learning
deep learning from scratch chapter 3 neural network
blue brain technology
An Introduction to Deep Learning
Image Classification Done Simply using Keras and TensorFlow
Deep learning intro
Handwritten Recognition using Deep Learning with R
Introduction to deep learning
Deep Learning: a birds eye view
Deep Learning Projects - Anomaly Detection Using Deep Learning
Deep learning at nmc devin jones
Convolutional neural network
Tutorial on Deep Learning
HiPEAC 2019 Workshop - Real-Time Modelling Visual Scenes with Biological Insp...
Intelligent robot used in the field of practical
Ad

Similar to Creating smaller, faster, production-ready mobile machine learning models. (20)

PDF
Deep learning at the edge: 100x Inference improvement on edge devices
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
PPTX
Scalable Learning in Computer Vision
PDF
A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...
PDF
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
PPTX
Nuts and Bolts of Transfer Learning.pptx
PDF
DLD meetup 2017, Efficient Deep Learning
PDF
Retraining Quantized Neural Network Models with Unlabeled Data.pdf
PDF
Development of 3D convolutional neural network to recognize human activities ...
PDF
cbs_sips2005
PDF
Open power ddl and lms
PDF
B-FPGM: Lightweight Face Detection via Bayesian-Optimized Soft FPGM Pruning
PDF
Accelerating stochastic gradient descent using adaptive mini batch size3
PPTX
Artificial intelligence - A Teaser to the Topic.
PPTX
Beyond data and model parallelism for deep neural networks
PDF
Deep learning for medical imaging
PDF
USENIX NSDI 2016 (Session: Resource Sharing)
PDF
Convolutional neural networks for speech controlled prosthetic hands
PDF
AI On the Edge: Model Compression
PPTX
Solar energy Forecasting and site adjustment using ML.pptx
Deep learning at the edge: 100x Inference improvement on edge devices
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Scalable Learning in Computer Vision
A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on G...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Nuts and Bolts of Transfer Learning.pptx
DLD meetup 2017, Efficient Deep Learning
Retraining Quantized Neural Network Models with Unlabeled Data.pdf
Development of 3D convolutional neural network to recognize human activities ...
cbs_sips2005
Open power ddl and lms
B-FPGM: Lightweight Face Detection via Bayesian-Optimized Soft FPGM Pruning
Accelerating stochastic gradient descent using adaptive mini batch size3
Artificial intelligence - A Teaser to the Topic.
Beyond data and model parallelism for deep neural networks
Deep learning for medical imaging
USENIX NSDI 2016 (Session: Resource Sharing)
Convolutional neural networks for speech controlled prosthetic hands
AI On the Edge: Model Compression
Solar energy Forecasting and site adjustment using ML.pptx
Ad

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Spectral efficient network and resource selection model in 5G networks
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Reach Out and Touch Someone: Haptics and Empathic Computing
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
“AI and Expert System Decision Support & Business Intelligence Systems”

Creating smaller, faster, production-ready mobile machine learning models.

  • 1. Jameson Toole Creating smaller, faster, production-worthy mobile machine learning models O’Reilly AI London, 2019
  • 2. @ “We showcase this approach by training an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2.” - MegatronLM, 2019
  • 3. @ Are we going in the right direction?
  • 5. @ Does my model enable the largest number of people to iterate as fast as possible using the fewest amount of resources on the most devices?
  • 6. @ How do you teach a microwave its name?
  • 7. @ How do you teach a microwave its name? Edge intelligence: small, efficient neural networks that run directly on-device.
  • 8. @ How do you teach a _____ to _____?
  • 9. @ Edge Intelligence is necessary and inevitable. Latency: too much data, too fast Power: radios use too much energy Connectivity: internet access isn’t guaranteed Cost: compute and bandwidth aren’t free Privacy: some data should stay in the hands of users
  • 10. @ Most intelligence will be at the edge. <100M servers 3B phones 12B IoT 150B embedded devices = 1 billion devices
  • 12. @ Model selection 75MB: Avg size of Top-100 app 348KB: SRAM SparkFun Edge Development Board
  • 13. @ Model selection: macro-architecture Design Principles ● Keep activation maps large by downsampling later or using atrous (dilated) convolutions ● Use more channels, but fewer layers ● Spend more time optimizing expensive input and output blocks, they are usually 15-25% of your computation cost
  • 14. @ Model selection: macro-architecture Backbones ● MobileNet (20mb) ● SqueezeNet (5mb) Layers ● Depthwise Separable Convolutions ● Bilinear upsampling 8-9X reduction in computation cost https://arxiv.org/abs/1704.04861
  • 15. @ Model selection: micro-architecture Design Principles ● Add a width multiplier to control the number of parameters with a hyperparameter: kernel x kernel x channel x w ● Use 1x1 convolutions instead of 3x3 convolutions where possible ● Arrange layers so they can be fused before inference (e.g. bias + batch norm)
  • 16. @ Training small, fast models Most neural networks are massively over-parameterized.
  • 17. @ Training small, fast models: distillation Knowledge distillation: a smaller “student” network learns from a larger “teacher” Results: 1. ResNet on CIFAR10: a. 46X smaller, b. 10% less accurate 2. ResNet on ImageNet: a. 2X smaller b. 2% less accurate 3. TinyBert on Squad: a. 7.5X smaller, b. 3% less accurate https://nervanasystems.github.io/distiller/knowledge_distillation.html https://arxiv.org/abs/1802.05668v1 https://arxiv.org/abs/1909.10351v2
  • 18. @ Training small, fast models: pruning Iterative pruning: periodically removing unimportant weights and / or filters during training. Results: 1. AlexNet and VGG on ImageNet: a. Weight Level: 9-11X smaller b. Filter Level: 2-3X smaller c. No accuracy loss 2. No clear consensus on whether pruning is required vs training smaller networks from scratch. https://arxiv.org/abs/1506.02626 https://arxiv.org/abs/1608.08710 https://arxiv.org/abs/1810.05270v2 https://arxiv.org/abs/1510.00149v5 2 5 2 5 1 1 3 21 4 7 6 1 8 9 2 2 5 2 5 0 0 3 20 4 7 6 0 8 9 2 Weight Level - smallest, not always faster Filter Level - smaller, faster
  • 19. @ Compressing models via quantization Quantizing 32-bit floating point weights to low precision integers decreases size and (sometimes) increases speed. https://medium.com/@kaustavtamuly/compressing-and-accelerating-high-dimensional-neural-networks-6b501983c0c8
  • 20. @ Compressing models via quantization Post-training quantization: train networks normally, quantize once after training. Training aware quantization: periodically removing unimportant weights and / or filters during training. Weights and activations: quantize both weights and activations to increase speed Results: 1. Post-training 8-bit quantization: 4X smaller with <2% accuracy loss 2. Training aware quantization: 8-16X smaller with minimal accuracy loss 3. Quantizing weights and activations can result in a 2-3X speed increase on CPUs https://arxiv.org/abs/1806.08342
  • 22. @ Deployment: embracing combinatorics Design Principles ● Train multiple models targeting different devices: OS x device ● Use native formats and frameworks ● Leverage available DSPs ● Monitor performance across devices
  • 23. @ Putting it all together
  • 24. @ Putting it all together Edge Intelligence Lifecycle ● Model selection: use efficient layers, parameterize model size ● Training: distill / prune for 2-10X smaller models, little accuracy loss ● Quantization: 8-bit models 4X smaller, 2-3X faster, no accuracy loss ● Deployment: use native formats that leverage available DSPs ● Improvement: put the right model on the right device at the right time
  • 25. @ 6327 kb / 7 fps iPhone X 28kb / +50 fps iPhone X 225x smaller 1.6 million parameters 6,300 parameters Putting it all together
  • 26. @ Putting it all together “TinyBERT is empirically effective and achieves comparable results with BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference.” - Jiao et al “Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy.” - Han et al “The model itself takes up less than 20KB of Flash storage space … and it only needs 30KB of RAM to operate.” - Peter Warden at TensorFlow Dev Summit 2019
  • 27. @ Open questions and future work Need better support for quantized operations. Need more rigorous study of model optimization vs task complexity. Will platform-aware architecture search be helpful? Can MLIR solve the combinatorics problem?
  • 28. @ Complete Platform for Edge Intelligence
  • 29. @ Complete Platform for Edge Intelligence
  • 30. @ Benefits of using Fritz Mobile Developers ● Prepared + Pretrained ● Simple APIs ● Fast, Secure, On-device Machine Learning Engineers ● Iterate on Mobile ● Benchmark + Optimize ● Analytics Try yourself: Fritz AI Studio App Store Google Play
  • 31. Working at the edge? Questions? @jamesonthecrow jameson@fritz.ai https://www.fritz.ai Join the community! heartbeat.fritz.ai