Creating smaller, faster, production-ready mobile machine learning models.

Jameson Toole
Creating smaller, faster, production-worthy
mobile machine learning models
O’Reilly AI London, 2019

@
“We showcase this approach by training an 8.3 billion parameter
transformer language model with 8-way model parallelism and 64-way
data parallelism on 512 GPUs, making it the largest transformer based
language model ever trained at 24x the size of BERT and 5.6x the size of
GPT-2.” - MegatronLM, 2019

@
Are we going in the right direction?

@
https://www.technologyreview.com/s/613630/training-a-single-ai-model-can-emit-as-much-carbon-as-ﬁve-cars-in-their-lifetimes/
Training Megatron-ML from scratch: 0.3 kW x 220 hours x 512 GPUs =
33,914 kw
3X yearly energy consumption of the average American

@
Does my model enable the largest number of
people to iterate as fast as possible using the
fewest amount of resources on the most
devices?

@
How do you teach a microwave its name?

@
How do you teach a microwave its name?
Edge intelligence: small,
efﬁcient neural networks
that run directly
on-device.

@
How do you teach a _____ to _____?

@
Edge Intelligence is necessary and inevitable.
Latency: too much data, too fast
Power: radios use too much energy
Connectivity: internet access isn’t guaranteed
Cost: compute and bandwidth aren’t free
Privacy: some data should stay in the hands of users

@
Most intelligence will be at the edge.
<100M
servers
3B
phones
12B
IoT
150B
embedded devices
= 1 billion devices

@
The Edge Intelligence lifecycle.

@
Model selection
75MB: Avg size of
Top-100 app
348KB: SRAM SparkFun Edge
Development Board

@
Model selection: macro-architecture
Design Principles
● Keep activation maps large by downsampling later or using atrous
(dilated) convolutions
● Use more channels, but fewer layers
● Spend more time optimizing expensive input and output blocks, they
are usually 15-25% of your computation cost

@
Model selection: macro-architecture
Backbones
● MobileNet (20mb)
● SqueezeNet (5mb)
Layers
● Depthwise Separable
Convolutions
● Bilinear upsampling
8-9X reduction in
computation cost
https://arxiv.org/abs/1704.04861

@
Model selection: micro-architecture
Design Principles
● Add a width multiplier to control the number of parameters with a
hyperparameter: kernel x kernel x channel x w
● Use 1x1 convolutions instead of 3x3 convolutions where possible
● Arrange layers so they can be fused before inference (e.g. bias + batch
norm)

@
Training small, fast models
Most neural networks are massively
over-parameterized.

@
Training small, fast models: distillation
Knowledge distillation: a smaller “student”
network learns from a larger “teacher”
Results:
1. ResNet on CIFAR10:
a. 46X smaller,
b. 10% less accurate
2. ResNet on ImageNet:
a. 2X smaller
b. 2% less accurate
3. TinyBert on Squad:
a. 7.5X smaller,
b. 3% less accurate
https://nervanasystems.github.io/distiller/knowledge_distillation.html
https://arxiv.org/abs/1802.05668v1

@
Training small, fast models: pruning
Iterative pruning: periodically removing
unimportant weights and / or ﬁlters during
training.
Results:
1. AlexNet and VGG on ImageNet:
a. Weight Level: 9-11X smaller
b. Filter Level: 2-3X smaller
c. No accuracy loss
2. No clear consensus on whether pruning
is required vs training smaller networks
from scratch.
2 5
2 5
1 1
3 21 4
7 6
1 8
9 2
2 5
2 5
0 0
3 20 4
7 6
0 8
9 2
Weight Level - smallest, not always faster
Filter Level - smaller, faster

@
Compressing models via quantization
Quantizing 32-bit ﬂoating point weights to low precision
integers decreases size and (sometimes) increases
speed.
https://medium.com/@kaustavtamuly/compressing-and-accelerating-high-dimensional-neural-networks-6b501983c0c8

@
Compressing models via quantization
Post-training quantization: train networks normally, quantize once after training.
Training aware quantization: periodically removing unimportant weights and / or ﬁlters during
training.
Weights and activations: quantize both weights and activations to increase speed
Results:
1. Post-training 8-bit quantization: 4X smaller with <2% accuracy loss
2. Training aware quantization: 8-16X smaller with minimal accuracy loss
3. Quantizing weights and activations can result in a 2-3X speed increase on CPUs

@
Deployment: embracing combinatorics

@
Deployment: embracing combinatorics
Design Principles
● Train multiple models targeting different devices: OS x device
● Use native formats and frameworks
● Leverage available DSPs
● Monitor performance across devices

@
Putting it all together
Edge Intelligence Lifecycle
● Model selection: use efﬁcient layers, parameterize model size
● Training: distill / prune for 2-10X smaller models, little accuracy loss
● Quantization: 8-bit models 4X smaller, 2-3X faster, no accuracy loss
● Deployment: use native formats that leverage available DSPs
● Improvement: put the right model on the right device at the right time

@
6327 kb / 7 fps iPhone X 28kb / +50 fps iPhone X
225x
smaller
1.6 million parameters 6,300 parameters

@
“TinyBERT is empirically effective and achieves comparable results with
BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on
inference.” - Jiao et al
“Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB,
again with no loss of accuracy.” - Han et al
“The model itself takes up less than 20KB of Flash storage space … and it
only needs 30KB of RAM to operate.” - Peter Warden at TensorFlow Dev
Summit 2019

@
Open questions and future work
Need better support for quantized operations.
Need more rigorous study of model optimization vs task complexity.
Will platform-aware architecture search be helpful?
Can MLIR solve the combinatorics problem?

@
Complete Platform for Edge Intelligence

@
Beneﬁts of using Fritz
Mobile Developers
● Prepared + Pretrained
● Simple APIs
● Fast, Secure, On-device
Machine Learning Engineers
● Iterate on Mobile
● Benchmark + Optimize
● Analytics
Try yourself:
Fritz AI Studio
App Store
Google Play

Working at the edge?
Questions?
@jamesonthecrow
jameson@fritz.ai
https://www.fritz.ai
Join the community!
heartbeat.fritz.ai

Creating smaller, faster, production-ready mobile machine learning models.

More Related Content

What's hot (20)

Similar to Creating smaller, faster, production-ready mobile machine learning models. (20)

Recently uploaded (20)

Creating smaller, faster, production-ready mobile machine learning models.