“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT

Massachusetts Institute of Technology
Song Han
“Once-for-All” DNNs: Simplifying Design of
Efficient Models for Diverse Hardware

The Rise of AIoT
IoT + AI = AIoT

Less Computational Resources: TinyML
Less Engineer Resources: AutoML
many engineers
large model
A lot of computation
fewer engineers small model
less computation
Simplify

Efficient AI Applications:
- Efficient Video recognition: TSM highlighted by NVIDIA and IBM, adopted by Baidu PaddlePaddle
and HKUST MMLab
- Efficient 3D point cloud recognition, autonomous driving: 1st place on SemanticKITTI leaderboard,
adopted by MIT Driverless
- Machine translation, NLP: Reduce the design cost by 4 orders of magnitude compared with Google.
Automated Tools:
- Pruning, Quantization, Compression: co-founded DeePhi Tech, acquired by Xilinx, industry standard.
- Two generations of automated NN architecture design (ProxylessNAS, OFA): adopted by Facebook
PyTorch and Amazon AutoGluon
- AI designed by AI outperforms human performance:
• 1st place, 3rd/4th Low-Power Computer Vision Challenge @ICCV’19, NeurIPS’19 [paper]
• 1st place, MicroNet Challenge, NLP track (WikiText-103), @NeurIPS’19
• 1st place, Visual Wake Words challenge on MCU @CVPR’19
Efficient Hardware & AI for EDA:
- EIE: first accelerator that support pruned and sparse weight. Influenced NVIDIA’s Ampere GPU,
NVIDIA’s DLA, ARM’s Project Trillium, Samsung’s NPU, and Intel’s NN Distiller.
Research Topics
TinyML and Efficient Deep Learning
Low LatencyLow EnergySmall Model Size
Fullstack
Automated

Evolved Transformer ICML’19, ACL’19
We need Green AI:
Solve the Environmental Problem of NAS
Ours 52 4 orders of magnitude ACL’20
“Hardware-Aware Transformer”
TinyML comes at the cost of BigML
(inference) (training/search)
Problem:

Once-for-all, ICLR’20
Challenge: Efficient Inference on Diverse Hardware
Platforms
7
Diverse Hardware Platforms
…
Once-for-All Network
Cloud AI ( FLOPS)1012
Mobile AI ( FLOPS)109
Tiny AI ( FLOPS)106
160K
40K
1600K
Design Cost (GPU hours)
11.4k lbs CO2 emission→
1 GPU hour translates to 0.284 lbs CO2 emission according to
Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

Our Solution: Once for All Network
Once-for-All Network
Get many child nets
for free

OFA: Decouple Training and Search
9
Conventional NAS
with meta controller
For devices:
For search episodes: // meta controller
For training iterations:
forward-backward();
If good_model: break;
For post-search training iterations:
forward-backward();
Expensive
Expensive
=>
Once-for-All:
For OFA training iterations:
forward-backward();
For devices:
For search episodes:
sample from OFA;
If good_model: break;
directly deploy without training;
Expensive
training
search
decouple
Light-Weight
Light-Weight

Once-for-All Network:
Decouple Model Training and Architecture Design
10
once-for-all network

11

12

13
…

Challenge: how to prevent different subnetworks
from interfering with each other?
14

Solution: Progressive Shrinking
15
• Training once-for-all network is much more challenging than training a normal
neural network given so many sub-networks to support.
• Progressive Shrinking can support more than different sub-networks in a
single once-for-all network, covering 4 different dimensions: resolution, kernel
size, depth, width.
1019

Once-for-all, ICLR’20 16
Train the
full model
Shrink the model
(4 dimensions)
Jointly fine-tune
both large and
small sub-networks
• Small sub-networks are nested in large sub-networks.
• Cast the training process of the once-for-all network as a progressive shrinking and
joint fine-tuning process.
once-for-all
network
Progressive Shrinking
Solution: Progressive Shrinking
• Training once-for-all network is much more challenging than training a normal
neural network given so many sub-networks to support.
• Progressive Shrinking can support more than different sub-networks in a
single once-for-all network, covering 4 different dimensions: resolution, kernel
size, depth, width.
1019

Connection to Network Pruning
17
Train the
full model
Shrink the model
(only width)
Fine-tune
the small net
single pruned
network
Network Pruning
Train the
full model
Shrink the model
(4 dimensions)
Fine-tune
both large and
small sub-nets
once-for-all
network
• Progressive shrinking can be viewed as a generalized network pruning with much
higher flexibility across 4 dimensions.

18
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial

19
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Resolution
Full
Partial

20
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Resolution
Full
Partial

21
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Resolution
Full
Partial

22
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Resolution
Full
Partial

23
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Resolution
Full
Partial

24
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial

25
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial

26
Elastic
Resolution
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Kernel Size

27
Elastic
Resolution
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Kernel Size

28
Elastic
Resolution
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Kernel Size

29
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial

30
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial Partial

31
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full

32
Elastic
Resolution
Elastic
Kernel Size
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Depth
Partial

33
Elastic
Resolution
Elastic
Kernel Size
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Depth
Partial

34
Elastic
Resolution
Elastic
Kernel Size
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Depth
Partial

35
Elastic
Width
Full Full Full Full
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth

36
Full Full Full Full
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth

37
Full Full Full Full
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth

38
Full Full Full Full
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth

39
Full Full Full Full
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth

40
Full Full Full Full
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth

41
Full Full Full Full
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth

Once-for-all, ICLR’20 42
Performances of Sub-networks on ImageNetImageNetTop-1Acc(%)
67
70
73
75
78
w/o PS w/ PS
D=2
W=3
K=3
D=2
W=3
K=7
D=2
W=6
K=3
D=2
W=6
K=7
D=4
W=3
K=3
D=4
W=3
K=7
D=4
W=6
K=3
D=4
W=6
K=7
2.5%
2.8%
3.5%
3.4% 3.3%
3.4%
3.7%
3.5%
Sub-networks under various architecture configurations
D: depth, W: width, K: kernel size
• Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.

Train Once, Get Many
43

How about search? Zero training cost!
44
for OFA training iterations:
forward-backward();
for devices:
for search episodes:
sample from OFA;
if good_model: break;
training
search
decouple
direct deploy without training;
//with evolution or even random

How to evaluate if good_model? — by Model Twin
45
Acc Dataset
[Architecture, Accuracy]
Latency Dataset
[Architecture, Latency]
OFA
Network
Accuracy
Prediction Model
Accuracy/Latency predictor 
RMSE ~0.2%
Latency
Prediction Model
Predictor-based
Architecture Search Specialized
Sub-Network

Our latency model is super accurate
46

Once-for-All, ICLR’20
Accuracy & Latency Improvement
47
Top-1ImageNetAcc(%)
76
77
78
79
80
81
0 50 100 150 200 250 300 350 400
OFA
EfficientNet
76.3
78.8
79.8
79.8
78.7
Google Pixel1 Latency (ms)
80.1 2.6x faster
3.8% higher
accuracy
Top-1ImageNetAcc(%)
67
69
71
73
75
77
18 24 30 36 42 48 54 60
OFA
MobileNetV3
75.2
73.3
70.4
67.4
76.4
74.9
73.3
71.4
4% higher
accuracy
1.5x faster
• Training from scratch cannot achieve the same level of accuracy

More accurate than training from scratch
48
Top-1ImageNetAcc(%)
76
77
78
79
80
81
0 50 100 150 200 250 300 350 400
OFA
EfficientNet
OFA - Train from scratch
76.3
78.8
79.8
79.8
78.7
80.1 2.6x faster
3.8% higher
accuracy
Top-1ImageNetAcc(%)
67
69
71
73
75
77
18 24 30 36 42 48 54 60
OFA
MobileNetV3
OFA - Train from scatch
75.2
73.3
70.4
67.4
76.4
74.9
73.3
71.4
4% higher
accuracy
1.5x faster
• Training from scratch cannot achieve the same level of accuracy

OFA: 80% Top-1 Accuracy on ImageNet
49
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
69
71
73
75
77
79
81
ImageNetTop-1accuracy(%)
2M 4M 8M
Handcrafted
16M
AutoML
32M 64M
→
→The higher the better
The lower the better
Once-for-All (ours)
EfficientNet
ProxylessNAS
MBNetV3
AmoebaNet
MBNetV2
PNASNet
ShuffleNet
DARTS
IGCV3-D
MobileNetV1 (MBNetV1)
NASNet-A
InceptionV2
DenseNet-121
DenseNet-169
ResNet-50
ResNetXt-50
InceptionV3
DenseNet-264
DPN-92
ResNet-101
Xception
ResNetXt-101
14x less computation
595M MACs
80.0% Top-1
Model Size
• Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under
the mobile vision setting (< 600M MACs).

OFA Enables Fast Specialization on Diverse Hardware Platforms
50
Samsung S7 Edge Latency (ms)
Top-1ImageNetAcc(%)
67
69
71
73
75
77
25 40 55 70 85 100
OFA MobileNetV3 MobileNetV2
75.2
73.3
70.4
67.4
70.5
73.1
74.7
76.3
67
69
71
73
75
77
23 28 33 38 43 48 53 58 63 68
75.2
73.3
70.4
67.4
75.8
74.7
73.4
71.5
LG G8 Latency (ms)
67
69
71
73
75
77
7 10 13 16 19 22 25
75.2
73.3
70.4
67.4
76.4
74.7
73.0
71.1
Top-1ImageNetAcc(%)
58
62
66
69
73
77
10 14 18 22 26 30
NVIDIA 1080Ti Latency (ms)
Batch Size = 64
60.3
65.4
69.8
72.0
72.6
73.8
75.3 76.4
58
62
66
69
73
77
9 11 13 15 17 19
Intel Xeon CPU Latency (ms)
Batch Size = 1
60.3
65.4
69.8
72.0
71.1
74.6
75.7
72.0
58
62
66
69
73
77
3.0 4.0 5.0 6.0 7.0 8.0
Xilinx ZU3EG FPGA Latency (ms)
Batch Size = 1 (Quantized)
59.1
63.3
69.0
71.5
67.0
69.6
72.8
73.7

• First place in the 3rd Low Power Computer Vision Challenge, DSP track at ICCV’19
• First place in the 4th Low Power Computer Vision Challenge, both classification and detection track
Qualcomm SnapDragon 855
Hexagon 690 DSP
OFA
Network
Specialized
Sub-network
Deploy
latency < 7ms
Latency: 5.15ms
Top1: 78.8%
Our result:
OFA’s Application: Low Power Computer Vision

Measured results on FPGA
OFA for FPGA
ArithmeticIntensity(OPS/Byte)
0.0
12.5
25.0
37.5
50.0
ZU3EGFPGA(GOPS/s)
0.0
20.0
40.0
60.0
80.0
MobileNetV2 MnasNet OFA (Ours)
40%
higher 57%
higher
Specialized NN architecture on specialized hardware architecture

Specialized Architecture for Different Hardware Platforms
53

Tutorial on ProxylessNAS & OFA
● IPython Notebook tutorial. 
● Architecture search with 1 GPU in 2 minutes. 
● Hands-on lab at 3:45pm PT today, office hour 6:00pm. 
Zoom LinkWebsite

Once-for-All Network (OFA) has broad applications
• Efficient Video Recognition
• Efficient 3D Vision
• Efficient GAN Compression

KineticsTop-1Accuracy(%)
69
70
71
72
73
74
75
Computation (GFLOPs)
0 10 20 30 40
Same Acc.
OFA + TSM (large)
OFA + TSM (small)
MobileNetV2 + TSM
ResNet50 + TSM
ResNet50 + I3D
7x less computation
Same Comp.
+3.0% Acc.
followup of TSM, ICCV’19
OFA’s Application: Efficient Video Recognition
7x less computation, same performance as TSM+ResNet50
same computation, 3% higher accuracy than TSM+MobileNet-v2

Latency Comparison
Batch size=1. Measured on NVIDIA Tesla P100.
Each row represents a video.
I3D:
Latency: 164.3 ms/Video Something-V1 Acc.: 41.6%
TSM:
Latency: 17.4 ms/Video Something-V1 Acc.: 43.4%
Speed-up: 9x

Throughput Comparison
Batch size=16. Measured on NVIDIA Tesla P100.
Each square represents a video.
I3D:
Throughput: 6.1 video/s
Something-V1 Acc.: 41.6%
TSM:
Throughput: 77.4 video/s
Something-V1 Acc.: 43.4%
12.7x larger throughput

59
Improving the Robustness of Online Video Detection

Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
SUMMIT Super Computer:
• CPU: 2 x 16 Core IBM POWER9 (connected
via dual NVLINK bricks, 25GB/s each side)

• GPU: 6 x NVIDIA Tesla V100

• RAM: 512 GB DDR4 memory

• Data Storage: HDD

• Connection: Dual-rail EDR InﬁniBand
network of 23 GB/s
Acknowledgment: IBM and Oak Ridge National Lab
* Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383

● We are able to speedup the training by 200x, from 2 days to 14minutes.
● Model setup: 8-frame ResNet-50 for video recognition
● Dataset: Kinetics (240k training videos) x 100 epoch
Training Time Accuracy Peak GPU
Performance
Speed-up
1 SUMMIT Nodes  
(6 GPUs)
49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x

Actual: 106x

Theoretical: 256x

Actual: 211x
128 SUMMIT Nodes  
(768 GPUs)
28min 74.1% 5,989TFLOP/s
256 SUMMIT Nodes  
(1536 GPUs)
14min 74.0% 11,978TFLOP/s
0 12.5 25 37.5 50
Time (h)
1 SUMMIT Node
128 SUMMIT Node
106x
Scaling Up: Large-Scale Distributed Training with SUMMIT
Super Computer

GAN Compression, CVPR’20
OFA’s Application: GAN Compression
8-21x FLOPs reduction on CycleGAN, Pix2pix, GauGAN
1.7x-18.5x speedup on CPU/GPU & Mobile CPU/GPU

OFA’s Application: Efficient 3D Recognition
self-driving: a whole trunk of GPU
Accuracy v.s. Latency Tradeoff
4x FLOPs reduction and 2x speedup over MinkowskiNet
3.6% better accuracy under the same computation budget.
AR/VR: a whole backpack
of computer
SPVNAS, ECCV’20

DarkNet53Seg
SPVNAS (Ours)
Mean IoU: 49.9
Throughput: 9.7 FPS
50.4M Params
376.3G FLOPs
Mean IoU: 58.8 (= KPConv)
Throughput: 11.8 FPS
1.1M Params
10.6G FLOPs
SPVNAS makes fewer errors (in red) than the 2D baseline model.
45x model size reduction and 35x computation reduction
SPVNAS, ECCV’20

Signiﬁcantly Faster than MinkowskiNets
Mean IoU: 63.1 Throughput: 3.4 FPS
(21.7M Params 114.0G FLOPs)
Mean IoU: 63.6 Throughput: 6.5 FPS
(7.6M Params 30.0G FLOPs)
MinkowskiNet SPVNAS (Ours)
SPVNAS outperforms the state-of-the-art MinkowskiNet (with 2x measured speedup and 3x model size
reduction).
SPVNAS, ECCV’20

Qualitative Results on SemanticKITTI
Error By
MinkowskiNets
Less Error By
SPVNAS
Ground Truth
SPVNAS, ECCV’20

Qualitative Results on KITTI
Detection By
SECOND
More Accurate Detection By
SPVNAS
Ground Truth
SPVNAS, ECCV’20

Hardware-aware autoML, push-button solution
Make AI Efficient, with Tiny Resources Computational
Human{
ProxylessNAS, ICLR’19
HAQ, CVPR’19, oral
AMC, ECCV’18 
Once-for-All, ICLR’20 
Neural-Hardware Architecture Search, NeurIPS workshop’19 
SPVNAS, ECCV’20
1st place, Low Power Computer Vision Challenge’19
1st place, Low Power Computer Vision Challenge’20
1st place, Visual Wake Words Challenge@CVPR’19

AutoML: Design Automation for AI [ECCV’18, ICLR’19, CVPR’19, ICLR’20, CVPR’20]
- We developed two generations of AutoML technique for efficient NN design (ProxylessNAS, OFA)
- Such AI designed AI consistently outperforms human performance:
- First place in Low Power Computer Vision Challenges (2019, 2020).
- First place in Visual Wake Words Challenge 2019.
AI for Design Automation [DAC’20]
- AI is Revolutionizing EDA: fast, hw, data-driven
- “GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and
Reinforcement Learning”, DAC’20
- Circuit is a graph; GCN feature extractor.
- Transfer ability between technology nodes & topologies

“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT

Summary: Once-for-All Network
• Released 50+ different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP).
net, image_size = ofa_specialized(net_id, pretrained=True)
• Released the training code & pre-trained OFA network that provides diverse sub-networks without training.
ofa_network = ofa_net(net_id, pretrained=True)
• We introduce once-for-all network for efficient inference on diverse hardware platforms.
• We present an effective progressive shrinking approach for training once-for-all networks.
Project Page: https://ofa.mit.edu
• Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios,
setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).
• First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19
• First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19, both classification & detection.
Train the
full model
Shrink the model
In 4 dimensions
Fine-tune
both large and
small sub-nets
once-for-all
network

Less Engineer Resources: AutoML
Less Computational Resources: TinyML
many engineers
large model
A lot of computation
fewer engineers small model
less computation
Simplify

The Future of AI is “Tiny”
Vast Applications
Smart Retail Personalized Healthcare
Smart Manufacturing Precision Agriculture
Smart Home
Autonomous Driving

Hardware for AI and Neural-net
itiative

Algebra”
Hardware for AI and Neural-net
Initiative

ear Algebra”
be pruned to very sparse,
ndex included). However, it’s
e of sparsity. EIE [Han’16] is
Hardware, AI and Neural-nets
TinyML and Efficient AI
Media:
songhan.mit.edu
youtube.com/c/MITHANLab
github.com/mit-han-lab

“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT

More Related Content

Similar to “Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT