SlideShare a Scribd company logo
Massachusetts Institute of Technology
Song Han
“Once-for-All” DNNs: Simplifying Design of
Efficient Models for Diverse Hardware
The Rise of AIoT
IoT + AI = AIoT
Less Computational Resources: TinyML
Less Engineer Resources: AutoML
many engineers
large model
A lot of computation
fewer engineers small model
less computation
Simplify
Efficient AI Applications:
- Efficient Video recognition: TSM highlighted by NVIDIA and IBM, adopted by Baidu PaddlePaddle
and HKUST MMLab
- Efficient 3D point cloud recognition, autonomous driving: 1st place on SemanticKITTI leaderboard,
adopted by MIT Driverless
- Machine translation, NLP: Reduce the design cost by 4 orders of magnitude compared with Google.
Automated Tools:
- Pruning, Quantization, Compression: co-founded DeePhi Tech, acquired by Xilinx, industry standard.
- Two generations of automated NN architecture design (ProxylessNAS, OFA): adopted by Facebook
PyTorch and Amazon AutoGluon
- AI designed by AI outperforms human performance:
• 1st place, 3rd/4th Low-Power Computer Vision Challenge @ICCV’19, NeurIPS’19 [paper]
• 1st place, MicroNet Challenge, NLP track (WikiText-103), @NeurIPS’19
• 1st place, Visual Wake Words challenge on MCU @CVPR’19
Efficient Hardware & AI for EDA:
- EIE: first accelerator that support pruned and sparse weight. Influenced NVIDIA’s Ampere GPU,
NVIDIA’s DLA, ARM’s Project Trillium, Samsung’s NPU, and Intel’s NN Distiller.
Research Topics
TinyML and Efficient Deep Learning
Low LatencyLow EnergySmall Model Size
Fullstack
Automated
Research Topics
Full Stack Research
Algorithm
Hardware
Edge
Cloud
Training Inference
[Edge | Inference | Algorithm] Hardware-Aware Quantization CVPR 19 Oral
[Edge | Inference | Algorithm] Point-Voxel CNN for 3D DL NeurIPS 19 Spotlight
[Cloud | Training | Algorithm] Deep Leakage from Gradients NeurIPS 19
[Edge/Cloud | Inference | Algorithm] Once-for-All Network ICLR 20
[Edge | Inference | Algorithm] GAN Compression CVPR 20
[Edge/Cloud | Inference | Hardware] SpArch for Sparse Matrix Multiplication HPCA 20
[Edge | Inference | Algorithm] Temporal Shift Module for Video ICCV 19
[Edge | Inference | Algorithm] HAT: Hardware-Aware Transformer ACL 20
[Edge | Inference | Algorithm] Lite Transformer ICLR 20
Evolved Transformer ICML’19, ACL’19
We need Green AI:
Solve the Environmental Problem of NAS
Ours 52 4 orders of magnitude ACL’20
“Hardware-Aware Transformer”
TinyML comes at the cost of BigML
(inference) (training/search)
Problem:
Once-for-all, ICLR’20
Challenge: Efficient Inference on Diverse Hardware
Platforms
7
Diverse Hardware Platforms
…
Once-for-All Network
Cloud AI ( FLOPS)1012
Mobile AI ( FLOPS)109
Tiny AI ( FLOPS)106
160K
40K
1600K
Design Cost (GPU hours)
11.4k lbs CO2 emission→
454.4k lbs CO2 emission→
45.4k lbs CO2 emission→
1 GPU hour translates to 0.284 lbs CO2 emission according to
Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
Our Solution: Once for All Network
Once-for-All Network
Get many child nets
for free
Once-for-all, ICLR’20
OFA: Decouple Training and Search
9
Conventional NAS
with meta controller
For devices:
For search episodes: // meta controller
For training iterations:
forward-backward();
If good_model: break;
For post-search training iterations:
forward-backward();
Expensive
Expensive
=>
Once-for-All:
For OFA training iterations:
forward-backward();
For devices:
For search episodes:
sample from OFA;
If good_model: break;
directly deploy without training;
Expensive
training
search
decouple
Light-Weight
Light-Weight
Once-for-all, ICLR’20
Once-for-All Network:
Decouple Model Training and Architecture Design
10
once-for-all network
Once-for-all, ICLR’20
Once-for-All Network:
Decouple Model Training and Architecture Design
11
once-for-all network
Once-for-all, ICLR’20
Once-for-All Network:
Decouple Model Training and Architecture Design
12
once-for-all network
Once-for-all, ICLR’20
Once-for-All Network:
Decouple Model Training and Architecture Design
13
…
once-for-all network
Once-for-all, ICLR’20
Challenge: how to prevent different subnetworks
from interfering with each other?
14
Once-for-all, ICLR’20
Solution: Progressive Shrinking
15
• Training once-for-all network is much more challenging than training a normal
neural network given so many sub-networks to support.
• Progressive Shrinking can support more than different sub-networks in a
single once-for-all network, covering 4 different dimensions: resolution, kernel
size, depth, width.
1019
Once-for-all, ICLR’20 16
Train the
full model
Shrink the model
(4 dimensions)
Jointly fine-tune
both large and
small sub-networks
• Small sub-networks are nested in large sub-networks.
• Cast the training process of the once-for-all network as a progressive shrinking and
joint fine-tuning process.
once-for-all
network
Progressive Shrinking
Solution: Progressive Shrinking
• Training once-for-all network is much more challenging than training a normal
neural network given so many sub-networks to support.
• Progressive Shrinking can support more than different sub-networks in a
single once-for-all network, covering 4 different dimensions: resolution, kernel
size, depth, width.
1019
Once-for-all, ICLR’20
Connection to Network Pruning
17
Train the
full model
Shrink the model
(only width)
Fine-tune
the small net
single pruned
network
Network Pruning
Train the
full model
Shrink the model
(4 dimensions)
Fine-tune
both large and
small sub-nets
once-for-all
network
• Progressive shrinking can be viewed as a generalized network pruning with much
higher flexibility across 4 dimensions.
Progressive Shrinking
Once-for-all, ICLR’20
Progressive Shrinking
18
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
19
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
20
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
21
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
22
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
23
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full FullElastic
Resolution
Full
Partial
Once-for-all, ICLR’20
Progressive Shrinking
24
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Once-for-all, ICLR’20
Progressive Shrinking
25
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Once-for-all, ICLR’20
Progressive Shrinking
26
Elastic
Resolution
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Kernel Size
Once-for-all, ICLR’20
Progressive Shrinking
27
Elastic
Resolution
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Kernel Size
Once-for-all, ICLR’20
Progressive Shrinking
28
Elastic
Resolution
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Kernel Size
Once-for-all, ICLR’20
Progressive Shrinking
29
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial
Once-for-all, ICLR’20
Progressive Shrinking
30
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial Partial
Once-for-all, ICLR’20
Progressive Shrinking
31
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Elastic
Width
Full Full Full Full
Partial Partial Partial
Once-for-all, ICLR’20
Progressive Shrinking
32
Elastic
Resolution
Elastic
Kernel Size
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Depth
Partial
Once-for-all, ICLR’20
Progressive Shrinking
33
Elastic
Resolution
Elastic
Kernel Size
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Depth
Partial
Once-for-all, ICLR’20
Progressive Shrinking
34
Elastic
Resolution
Elastic
Kernel Size
Elastic
Width
Full Full Full Full
Partial Partial
Elastic
Depth
Partial
Once-for-all, ICLR’20
Progressive Shrinking
35
Elastic
Width
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
36
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
37
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
38
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
39
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
40
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20
Progressive Shrinking
41
Full Full Full Full
Partial Partial Partial
Elastic
Resolution
Elastic
Kernel Size
Partial
Elastic
Width
Elastic
Depth
Once-for-all, ICLR’20 42
Performances of Sub-networks on ImageNetImageNetTop-1Acc(%)
67
70
73
75
78
w/o PS w/ PS
D=2
W=3
K=3
D=2
W=3
K=7
D=2
W=6
K=3
D=2
W=6
K=7
D=4
W=3
K=3
D=4
W=3
K=7
D=4
W=6
K=3
D=4
W=6
K=7
2.5%
2.8%
3.5%
3.4% 3.3%
3.4%
3.7%
3.5%
Sub-networks under various architecture configurations
D: depth, W: width, K: kernel size
• Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.
Once-for-all, ICLR’20
Train Once, Get Many
43
Once-for-all, ICLR’20
How about search? Zero training cost!
44
for OFA training iterations:
forward-backward();
for devices:
for search episodes:
sample from OFA;
if good_model: break;
training
search
decouple
direct deploy without training;
//with evolution or even random
Once-for-all, ICLR’20
How to evaluate if good_model? — by Model Twin
45
Acc Dataset
[Architecture, Accuracy]
Latency Dataset
[Architecture, Latency]
OFA
Network
Accuracy
Prediction Model
Accuracy/Latency predictor

RMSE ~0.2%
Latency
Prediction Model
Predictor-based
Architecture Search Specialized
Sub-Network
Once-for-all, ICLR’20
Our latency model is super accurate
46
Once-for-All, ICLR’20
Accuracy & Latency Improvement
47
Top-1ImageNetAcc(%)
76
77
78
79
80
81
0 50 100 150 200 250 300 350 400
OFA
EfficientNet
76.3
78.8
79.8
79.8
78.7
Google Pixel1 Latency (ms)
80.1 2.6x faster
3.8% higher
accuracy
Google Pixel1 Latency (ms)
Top-1ImageNetAcc(%)
67
69
71
73
75
77
18 24 30 36 42 48 54 60
OFA
MobileNetV3
75.2
73.3
70.4
67.4
76.4
74.9
73.3
71.4
4% higher
accuracy
1.5x faster
• Training from scratch cannot achieve the same level of accuracy
Once-for-All, ICLR’20
More accurate than training from scratch
48
Top-1ImageNetAcc(%)
76
77
78
79
80
81
0 50 100 150 200 250 300 350 400
OFA
EfficientNet
OFA - Train from scratch
76.3
78.8
79.8
79.8
78.7
Google Pixel1 Latency (ms)
80.1 2.6x faster
3.8% higher
accuracy
Google Pixel1 Latency (ms)
Top-1ImageNetAcc(%)
67
69
71
73
75
77
18 24 30 36 42 48 54 60
OFA
MobileNetV3
OFA - Train from scatch
75.2
73.3
70.4
67.4
76.4
74.9
73.3
71.4
4% higher
accuracy
1.5x faster
• Training from scratch cannot achieve the same level of accuracy
OFA: 80% Top-1 Accuracy on ImageNet
49
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
69
71
73
75
77
79
81
ImageNetTop-1accuracy(%)
2M 4M 8M
Handcrafted
16M
AutoML
32M 64M
→
→The higher the better
The lower the better
Once-for-All (ours)
EfficientNet
ProxylessNAS
MBNetV3
AmoebaNet
MBNetV2
PNASNet
ShuffleNet
DARTS
IGCV3-D
MobileNetV1 (MBNetV1)
NASNet-A
InceptionV2
DenseNet-121
DenseNet-169
ResNet-50
ResNetXt-50
InceptionV3
DenseNet-264
DPN-92
ResNet-101
Xception
ResNetXt-101
14x less computation
595M MACs
80.0% Top-1
Model Size
• Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under
the mobile vision setting (< 600M MACs).
Once-for-All, ICLR’20
Once-for-all, ICLR’20
OFA Enables Fast Specialization on Diverse Hardware Platforms
50
Samsung S7 Edge Latency (ms)
Top-1ImageNetAcc(%)
67
69
71
73
75
77
25 40 55 70 85 100
OFA MobileNetV3 MobileNetV2
75.2
73.3
70.4
67.4
70.5
73.1
74.7
76.3
Google Pixel2 Latency (ms)
67
69
71
73
75
77
23 28 33 38 43 48 53 58 63 68
75.2
73.3
70.4
67.4
75.8
74.7
73.4
71.5
LG G8 Latency (ms)
67
69
71
73
75
77
7 10 13 16 19 22 25
75.2
73.3
70.4
67.4
76.4
74.7
73.0
71.1
Top-1ImageNetAcc(%)
58
62
66
69
73
77
10 14 18 22 26 30
NVIDIA 1080Ti Latency (ms)
Batch Size = 64
60.3
65.4
69.8
72.0
72.6
73.8
75.3 76.4
58
62
66
69
73
77
9 11 13 15 17 19
Intel Xeon CPU Latency (ms)
Batch Size = 1
60.3
65.4
69.8
72.0
71.1
74.6
75.7
72.0
58
62
66
69
73
77
3.0 4.0 5.0 6.0 7.0 8.0
Xilinx ZU3EG FPGA Latency (ms)
Batch Size = 1 (Quantized)
59.1
63.3
69.0
71.5
67.0
69.6
72.8
73.7
• First place in the 3rd Low Power Computer Vision Challenge, DSP track at ICCV’19
• First place in the 4th Low Power Computer Vision Challenge, both classification and detection track
Qualcomm SnapDragon 855
Hexagon 690 DSP
OFA
Network
Specialized
Sub-network
Deploy
latency < 7ms
Latency: 5.15ms
Top1: 78.8%
Our result:
OFA’s Application: Low Power Computer Vision
Measured results on FPGA
OFA for FPGA
ArithmeticIntensity(OPS/Byte)
0.0
12.5
25.0
37.5
50.0
ZU3EGFPGA(GOPS/s)
0.0
20.0
40.0
60.0
80.0
MobileNetV2 MnasNet OFA (Ours)
40%
higher 57%
higher
Specialized NN architecture on specialized hardware architecture
Once-for-All, ICLR’20
Once-for-All, ICLR’20
Specialized Architecture for Different Hardware Platforms
53
Tutorial on ProxylessNAS & OFA
● IPython Notebook tutorial.

● Architecture search with 1 GPU in 2 minutes.

● Hands-on lab at 3:45pm PT today, office hour 6:00pm.

Zoom LinkWebsite
Once-for-All Network (OFA) has broad applications
• Efficient Video Recognition
• Efficient 3D Vision
• Efficient GAN Compression
KineticsTop-1Accuracy(%)
69
70
71
72
73
74
75
Computation (GFLOPs)
0 10 20 30 40
Same Acc.
OFA + TSM (large)
OFA + TSM (small)
MobileNetV2 + TSM
ResNet50 + TSM
ResNet50 + I3D
7x less computation
Same Comp.
+3.0% Acc.
followup of TSM, ICCV’19
OFA’s Application: Efficient Video Recognition
7x less computation, same performance as TSM+ResNet50
same computation, 3% higher accuracy than TSM+MobileNet-v2
Latency Comparison
Batch size=1. Measured on NVIDIA Tesla P100.
Each row represents a video.
I3D:
Latency: 164.3 ms/Video Something-V1 Acc.: 41.6%
TSM:
Latency: 17.4 ms/Video Something-V1 Acc.: 43.4%
Speed-up: 9x
Throughput Comparison
Batch size=16. Measured on NVIDIA Tesla P100.
Each square represents a video.
I3D:
Throughput: 6.1 video/s
Something-V1 Acc.: 41.6%
TSM:
Throughput: 77.4 video/s
Something-V1 Acc.: 43.4%
12.7x larger throughput
59
Improving the Robustness of Online Video Detection
Guesture recognition
60
Scaling Up: Large-Scale Distributed Training with Summit
Super Computer
SUMMIT Super Computer:
• CPU: 2 x 16 Core IBM POWER9 (connected
via dual NVLINK bricks, 25GB/s each side)

• GPU: 6 x NVIDIA Tesla V100

• RAM: 512 GB DDR4 memory

• Data Storage: HDD

• Connection: Dual-rail EDR InfiniBand
network of 23 GB/s
Acknowledgment: IBM and Oak Ridge National Lab
* Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383
● We are able to speedup the training by 200x, from 2 days to 14minutes.
● Model setup: 8-frame ResNet-50 for video recognition
● Dataset: Kinetics (240k training videos) x 100 epoch
Training Time Accuracy Peak GPU
Performance
Speed-up
1 SUMMIT Nodes 

(6 GPUs)
49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x

Actual: 106x

Theoretical: 256x

Actual: 211x
128 SUMMIT Nodes 

(768 GPUs)
28min 74.1% 5,989TFLOP/s
256 SUMMIT Nodes 

(1536 GPUs)
14min 74.0% 11,978TFLOP/s
0 12.5 25 37.5 50
Time (h)
1 SUMMIT Node
128 SUMMIT Node
106x
Scaling Up: Large-Scale Distributed Training with SUMMIT
Super Computer
GAN Compression, CVPR’20
OFA’s Application: GAN Compression
8-21x FLOPs reduction on CycleGAN, Pix2pix, GauGAN
1.7x-18.5x speedup on CPU/GPU & Mobile CPU/GPU
OFA’s Application: Efficient 3D Recognition
self-driving: a whole trunk of GPU
Accuracy v.s. Latency Tradeoff
4x FLOPs reduction and 2x speedup over MinkowskiNet
3.6% better accuracy under the same computation budget.
AR/VR: a whole backpack
of computer
SPVNAS, ECCV’20
DarkNet53Seg
SPVNAS (Ours)
Mean IoU: 49.9
Throughput: 9.7 FPS
50.4M Params
376.3G FLOPs
Mean IoU: 58.8 (= KPConv)
Throughput: 11.8 FPS
1.1M Params
10.6G FLOPs
SPVNAS makes fewer errors (in red) than the 2D baseline model.
45x model size reduction and 35x computation reduction
SPVNAS, ECCV’20
Significantly Faster than MinkowskiNets
Mean IoU: 63.1 Throughput: 3.4 FPS
(21.7M Params 114.0G FLOPs)
Mean IoU: 63.6 Throughput: 6.5 FPS
(7.6M Params 30.0G FLOPs)
MinkowskiNet SPVNAS (Ours)
SPVNAS outperforms the state-of-the-art MinkowskiNet (with 2x measured speedup and 3x model size
reduction).
SPVNAS, ECCV’20
Qualitative Results on SemanticKITTI
Error By
MinkowskiNets
Less Error By
SPVNAS
Ground Truth
SPVNAS, ECCV’20
Qualitative Results on KITTI
Detection By
SECOND
More Accurate Detection By
SPVNAS
Ground Truth
SPVNAS, ECCV’20
Hardware-aware autoML, push-button solution
Make AI Efficient, with Tiny Resources Computational
Human{
ProxylessNAS, ICLR’19
HAQ, CVPR’19, oral
AMC, ECCV’18

Once-for-All, ICLR’20

Neural-Hardware Architecture Search, NeurIPS workshop’19

SPVNAS, ECCV’20
1st place, Low Power Computer Vision Challenge’19
1st place, Low Power Computer Vision Challenge’20
1st place, Visual Wake Words Challenge@CVPR’19
AutoML: Design Automation for AI [ECCV’18, ICLR’19, CVPR’19, ICLR’20, CVPR’20]
- We developed two generations of AutoML technique for efficient NN design (ProxylessNAS, OFA)
- Such AI designed AI consistently outperforms human performance:
- First place in Low Power Computer Vision Challenges (2019, 2020).
- First place in Visual Wake Words Challenge 2019.
AI for Design Automation [DAC’20]
- AI is Revolutionizing EDA: fast, hw, data-driven
- “GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and
Reinforcement Learning”, DAC’20
- Circuit is a graph; GCN feature extractor.
- Transfer ability between technology nodes & topologies
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT
Summary: Once-for-All Network
• Released 50+ different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP).
net, image_size = ofa_specialized(net_id, pretrained=True)
• Released the training code & pre-trained OFA network that provides diverse sub-networks without training.
ofa_network = ofa_net(net_id, pretrained=True)
• We introduce once-for-all network for efficient inference on diverse hardware platforms.
• We present an effective progressive shrinking approach for training once-for-all networks.
Project Page: https://ofa.mit.edu
• Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios,
setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).
• First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19
• First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19, both classification & detection.
Train the
full model
Shrink the model
In 4 dimensions
Fine-tune
both large and
small sub-nets
once-for-all
network
Progressive Shrinking
Less Engineer Resources: AutoML
Less Computational Resources: TinyML
many engineers
large model
A lot of computation
fewer engineers small model
less computation
Simplify
The Future of AI is “Tiny”
Vast Applications
Smart Retail Personalized Healthcare
Smart Manufacturing Precision Agriculture
Smart Home
Autonomous Driving
Hardware for AI and Neural-net
itiative

Algebra”
Hardware for AI and Neural-net
Initiative

ear Algebra”
be pruned to very sparse,
ndex included). However, it’s
e of sparsity. EIE [Han’16] is
Hardware, AI and Neural-nets
TinyML and Efficient AI
Media:
songhan.mit.edu
youtube.com/c/MITHANLab
github.com/mit-han-lab

More Related Content

PPTX
Game qa
PPTX
Linea del tiempo de los videojuegos
PPTX
Unity 3D game engine seminar
PDF
PPTX
Introduction to Game Development
PPTX
Making an independend MMO - The Albion Online Story
PPTX
Funny programming
PDF
Testing Kafka containers with Testcontainers: There and back again with Vikto...
Game qa
Linea del tiempo de los videojuegos
Unity 3D game engine seminar
Introduction to Game Development
Making an independend MMO - The Albion Online Story
Funny programming
Testing Kafka containers with Testcontainers: There and back again with Vikto...

Similar to “Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT (20)

PPTX
HiPEAC-CSW 2022_Pedro Trancoso presentation
PPT
Chip Design Trend & Fabrication Prospects In India
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
Cray HPC Environments for Leading Edge Simulations
PDF
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
PDF
Linaro connect 2018 keynote final updated
PDF
invited speech at Ge2013, Udine 2013
PDF
PPTX
GreenDroid
PPTX
GreenDroid
PDF
Efficient video perception through AI
PDF
“A New Golden Age for Computer Architecture: Processor Innovation to Enable U...
PDF
Lecture 1 Advanced Computer Architecture
PPTX
CourboSpark: Decision Tree for Time-series on Spark
PDF
TRACK D: Advanced design regardless of process technology/ Marco Casale-Rossi
PDF
NVIDIA at Breakthrough Discuss for Space Exploration
PDF
Artificial intelligence at the edge
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PPTX
Presentation by Priyanka_Greendroid
PDF
1. CMOS Basic.pdf detail explain provide in This pdf
HiPEAC-CSW 2022_Pedro Trancoso presentation
Chip Design Trend & Fabrication Prospects In India
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Cray HPC Environments for Leading Edge Simulations
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
Linaro connect 2018 keynote final updated
invited speech at Ge2013, Udine 2013
GreenDroid
GreenDroid
Efficient video perception through AI
“A New Golden Age for Computer Architecture: Processor Innovation to Enable U...
Lecture 1 Advanced Computer Architecture
CourboSpark: Decision Tree for Time-series on Spark
TRACK D: Advanced design regardless of process technology/ Marco Casale-Rossi
NVIDIA at Breakthrough Discuss for Space Exploration
Artificial intelligence at the edge
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Presentation by Priyanka_Greendroid
1. CMOS Basic.pdf detail explain provide in This pdf

More from Edge AI and Vision Alliance (20)

PDF
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
PDF
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
PDF
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
PDF
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
PDF
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
PDF
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-Offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
MYSQL Presentation for SQL database connectivity
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The Rise and Fall of 3GPP – Time for a Sabbatical?
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mobile App Security Testing_ A Comprehensive Guide.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology

“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardware,” a Presentation from MIT

  • 1. Massachusetts Institute of Technology Song Han “Once-for-All” DNNs: Simplifying Design of Efficient Models for Diverse Hardware
  • 2. The Rise of AIoT IoT + AI = AIoT
  • 3. Less Computational Resources: TinyML Less Engineer Resources: AutoML many engineers large model A lot of computation fewer engineers small model less computation Simplify
  • 4. Efficient AI Applications: - Efficient Video recognition: TSM highlighted by NVIDIA and IBM, adopted by Baidu PaddlePaddle and HKUST MMLab - Efficient 3D point cloud recognition, autonomous driving: 1st place on SemanticKITTI leaderboard, adopted by MIT Driverless - Machine translation, NLP: Reduce the design cost by 4 orders of magnitude compared with Google. Automated Tools: - Pruning, Quantization, Compression: co-founded DeePhi Tech, acquired by Xilinx, industry standard. - Two generations of automated NN architecture design (ProxylessNAS, OFA): adopted by Facebook PyTorch and Amazon AutoGluon - AI designed by AI outperforms human performance: • 1st place, 3rd/4th Low-Power Computer Vision Challenge @ICCV’19, NeurIPS’19 [paper] • 1st place, MicroNet Challenge, NLP track (WikiText-103), @NeurIPS’19 • 1st place, Visual Wake Words challenge on MCU @CVPR’19 Efficient Hardware & AI for EDA: - EIE: first accelerator that support pruned and sparse weight. Influenced NVIDIA’s Ampere GPU, NVIDIA’s DLA, ARM’s Project Trillium, Samsung’s NPU, and Intel’s NN Distiller. Research Topics TinyML and Efficient Deep Learning Low LatencyLow EnergySmall Model Size Fullstack Automated
  • 5. Research Topics Full Stack Research Algorithm Hardware Edge Cloud Training Inference [Edge | Inference | Algorithm] Hardware-Aware Quantization CVPR 19 Oral [Edge | Inference | Algorithm] Point-Voxel CNN for 3D DL NeurIPS 19 Spotlight [Cloud | Training | Algorithm] Deep Leakage from Gradients NeurIPS 19 [Edge/Cloud | Inference | Algorithm] Once-for-All Network ICLR 20 [Edge | Inference | Algorithm] GAN Compression CVPR 20 [Edge/Cloud | Inference | Hardware] SpArch for Sparse Matrix Multiplication HPCA 20 [Edge | Inference | Algorithm] Temporal Shift Module for Video ICCV 19 [Edge | Inference | Algorithm] HAT: Hardware-Aware Transformer ACL 20 [Edge | Inference | Algorithm] Lite Transformer ICLR 20
  • 6. Evolved Transformer ICML’19, ACL’19 We need Green AI: Solve the Environmental Problem of NAS Ours 52 4 orders of magnitude ACL’20 “Hardware-Aware Transformer” TinyML comes at the cost of BigML (inference) (training/search) Problem:
  • 7. Once-for-all, ICLR’20 Challenge: Efficient Inference on Diverse Hardware Platforms 7 Diverse Hardware Platforms … Once-for-All Network Cloud AI ( FLOPS)1012 Mobile AI ( FLOPS)109 Tiny AI ( FLOPS)106 160K 40K 1600K Design Cost (GPU hours) 11.4k lbs CO2 emission→ 454.4k lbs CO2 emission→ 45.4k lbs CO2 emission→ 1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
  • 8. Our Solution: Once for All Network Once-for-All Network Get many child nets for free
  • 9. Once-for-all, ICLR’20 OFA: Decouple Training and Search 9 Conventional NAS with meta controller For devices: For search episodes: // meta controller For training iterations: forward-backward(); If good_model: break; For post-search training iterations: forward-backward(); Expensive Expensive => Once-for-All: For OFA training iterations: forward-backward(); For devices: For search episodes: sample from OFA; If good_model: break; directly deploy without training; Expensive training search decouple Light-Weight Light-Weight
  • 10. Once-for-all, ICLR’20 Once-for-All Network: Decouple Model Training and Architecture Design 10 once-for-all network
  • 11. Once-for-all, ICLR’20 Once-for-All Network: Decouple Model Training and Architecture Design 11 once-for-all network
  • 12. Once-for-all, ICLR’20 Once-for-All Network: Decouple Model Training and Architecture Design 12 once-for-all network
  • 13. Once-for-all, ICLR’20 Once-for-All Network: Decouple Model Training and Architecture Design 13 … once-for-all network
  • 14. Once-for-all, ICLR’20 Challenge: how to prevent different subnetworks from interfering with each other? 14
  • 15. Once-for-all, ICLR’20 Solution: Progressive Shrinking 15 • Training once-for-all network is much more challenging than training a normal neural network given so many sub-networks to support. • Progressive Shrinking can support more than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width. 1019
  • 16. Once-for-all, ICLR’20 16 Train the full model Shrink the model (4 dimensions) Jointly fine-tune both large and small sub-networks • Small sub-networks are nested in large sub-networks. • Cast the training process of the once-for-all network as a progressive shrinking and joint fine-tuning process. once-for-all network Progressive Shrinking Solution: Progressive Shrinking • Training once-for-all network is much more challenging than training a normal neural network given so many sub-networks to support. • Progressive Shrinking can support more than different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width. 1019
  • 17. Once-for-all, ICLR’20 Connection to Network Pruning 17 Train the full model Shrink the model (only width) Fine-tune the small net single pruned network Network Pruning Train the full model Shrink the model (4 dimensions) Fine-tune both large and small sub-nets once-for-all network • Progressive shrinking can be viewed as a generalized network pruning with much higher flexibility across 4 dimensions. Progressive Shrinking
  • 18. Once-for-all, ICLR’20 Progressive Shrinking 18 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 19. Once-for-all, ICLR’20 Progressive Shrinking 19 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 20. Once-for-all, ICLR’20 Progressive Shrinking 20 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 21. Once-for-all, ICLR’20 Progressive Shrinking 21 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 22. Once-for-all, ICLR’20 Progressive Shrinking 22 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 23. Once-for-all, ICLR’20 Progressive Shrinking 23 Elastic Kernel Size Elastic Depth Elastic Width Full Full FullElastic Resolution Full Partial
  • 24. Once-for-all, ICLR’20 Progressive Shrinking 24 Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width Full Full Full Full Partial Partial
  • 25. Once-for-all, ICLR’20 Progressive Shrinking 25 Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width Full Full Full Full Partial Partial
  • 29. Once-for-all, ICLR’20 Progressive Shrinking 29 Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width Full Full Full Full Partial Partial
  • 30. Once-for-all, ICLR’20 Progressive Shrinking 30 Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width Full Full Full Full Partial Partial Partial
  • 31. Once-for-all, ICLR’20 Progressive Shrinking 31 Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width Full Full Full Full Partial Partial Partial
  • 32. Once-for-all, ICLR’20 Progressive Shrinking 32 Elastic Resolution Elastic Kernel Size Elastic Width Full Full Full Full Partial Partial Elastic Depth Partial
  • 33. Once-for-all, ICLR’20 Progressive Shrinking 33 Elastic Resolution Elastic Kernel Size Elastic Width Full Full Full Full Partial Partial Elastic Depth Partial
  • 34. Once-for-all, ICLR’20 Progressive Shrinking 34 Elastic Resolution Elastic Kernel Size Elastic Width Full Full Full Full Partial Partial Elastic Depth Partial
  • 35. Once-for-all, ICLR’20 Progressive Shrinking 35 Elastic Width Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Elastic Depth
  • 36. Once-for-all, ICLR’20 Progressive Shrinking 36 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 37. Once-for-all, ICLR’20 Progressive Shrinking 37 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 38. Once-for-all, ICLR’20 Progressive Shrinking 38 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 39. Once-for-all, ICLR’20 Progressive Shrinking 39 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 40. Once-for-all, ICLR’20 Progressive Shrinking 40 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 41. Once-for-all, ICLR’20 Progressive Shrinking 41 Full Full Full Full Partial Partial Partial Elastic Resolution Elastic Kernel Size Partial Elastic Width Elastic Depth
  • 42. Once-for-all, ICLR’20 42 Performances of Sub-networks on ImageNetImageNetTop-1Acc(%) 67 70 73 75 78 w/o PS w/ PS D=2 W=3 K=3 D=2 W=3 K=7 D=2 W=6 K=3 D=2 W=6 K=7 D=4 W=3 K=3 D=4 W=3 K=7 D=4 W=6 K=3 D=4 W=6 K=7 2.5% 2.8% 3.5% 3.4% 3.3% 3.4% 3.7% 3.5% Sub-networks under various architecture configurations D: depth, W: width, K: kernel size • Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.
  • 44. Once-for-all, ICLR’20 How about search? Zero training cost! 44 for OFA training iterations: forward-backward(); for devices: for search episodes: sample from OFA; if good_model: break; training search decouple direct deploy without training; //with evolution or even random
  • 45. Once-for-all, ICLR’20 How to evaluate if good_model? — by Model Twin 45 Acc Dataset [Architecture, Accuracy] Latency Dataset [Architecture, Latency] OFA Network Accuracy Prediction Model Accuracy/Latency predictor
 RMSE ~0.2% Latency Prediction Model Predictor-based Architecture Search Specialized Sub-Network
  • 46. Once-for-all, ICLR’20 Our latency model is super accurate 46
  • 47. Once-for-All, ICLR’20 Accuracy & Latency Improvement 47 Top-1ImageNetAcc(%) 76 77 78 79 80 81 0 50 100 150 200 250 300 350 400 OFA EfficientNet 76.3 78.8 79.8 79.8 78.7 Google Pixel1 Latency (ms) 80.1 2.6x faster 3.8% higher accuracy Google Pixel1 Latency (ms) Top-1ImageNetAcc(%) 67 69 71 73 75 77 18 24 30 36 42 48 54 60 OFA MobileNetV3 75.2 73.3 70.4 67.4 76.4 74.9 73.3 71.4 4% higher accuracy 1.5x faster • Training from scratch cannot achieve the same level of accuracy
  • 48. Once-for-All, ICLR’20 More accurate than training from scratch 48 Top-1ImageNetAcc(%) 76 77 78 79 80 81 0 50 100 150 200 250 300 350 400 OFA EfficientNet OFA - Train from scratch 76.3 78.8 79.8 79.8 78.7 Google Pixel1 Latency (ms) 80.1 2.6x faster 3.8% higher accuracy Google Pixel1 Latency (ms) Top-1ImageNetAcc(%) 67 69 71 73 75 77 18 24 30 36 42 48 54 60 OFA MobileNetV3 OFA - Train from scatch 75.2 73.3 70.4 67.4 76.4 74.9 73.3 71.4 4% higher accuracy 1.5x faster • Training from scratch cannot achieve the same level of accuracy
  • 49. OFA: 80% Top-1 Accuracy on ImageNet 49 0 1 2 3 4 5 6 7 8 9 MACs (Billion) 69 71 73 75 77 79 81 ImageNetTop-1accuracy(%) 2M 4M 8M Handcrafted 16M AutoML 32M 64M → →The higher the better The lower the better Once-for-All (ours) EfficientNet ProxylessNAS MBNetV3 AmoebaNet MBNetV2 PNASNet ShuffleNet DARTS IGCV3-D MobileNetV1 (MBNetV1) NASNet-A InceptionV2 DenseNet-121 DenseNet-169 ResNet-50 ResNetXt-50 InceptionV3 DenseNet-264 DPN-92 ResNet-101 Xception ResNetXt-101 14x less computation 595M MACs 80.0% Top-1 Model Size • Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under the mobile vision setting (< 600M MACs). Once-for-All, ICLR’20
  • 50. Once-for-all, ICLR’20 OFA Enables Fast Specialization on Diverse Hardware Platforms 50 Samsung S7 Edge Latency (ms) Top-1ImageNetAcc(%) 67 69 71 73 75 77 25 40 55 70 85 100 OFA MobileNetV3 MobileNetV2 75.2 73.3 70.4 67.4 70.5 73.1 74.7 76.3 Google Pixel2 Latency (ms) 67 69 71 73 75 77 23 28 33 38 43 48 53 58 63 68 75.2 73.3 70.4 67.4 75.8 74.7 73.4 71.5 LG G8 Latency (ms) 67 69 71 73 75 77 7 10 13 16 19 22 25 75.2 73.3 70.4 67.4 76.4 74.7 73.0 71.1 Top-1ImageNetAcc(%) 58 62 66 69 73 77 10 14 18 22 26 30 NVIDIA 1080Ti Latency (ms) Batch Size = 64 60.3 65.4 69.8 72.0 72.6 73.8 75.3 76.4 58 62 66 69 73 77 9 11 13 15 17 19 Intel Xeon CPU Latency (ms) Batch Size = 1 60.3 65.4 69.8 72.0 71.1 74.6 75.7 72.0 58 62 66 69 73 77 3.0 4.0 5.0 6.0 7.0 8.0 Xilinx ZU3EG FPGA Latency (ms) Batch Size = 1 (Quantized) 59.1 63.3 69.0 71.5 67.0 69.6 72.8 73.7
  • 51. • First place in the 3rd Low Power Computer Vision Challenge, DSP track at ICCV’19 • First place in the 4th Low Power Computer Vision Challenge, both classification and detection track Qualcomm SnapDragon 855 Hexagon 690 DSP OFA Network Specialized Sub-network Deploy latency < 7ms Latency: 5.15ms Top1: 78.8% Our result: OFA’s Application: Low Power Computer Vision
  • 52. Measured results on FPGA OFA for FPGA ArithmeticIntensity(OPS/Byte) 0.0 12.5 25.0 37.5 50.0 ZU3EGFPGA(GOPS/s) 0.0 20.0 40.0 60.0 80.0 MobileNetV2 MnasNet OFA (Ours) 40% higher 57% higher Specialized NN architecture on specialized hardware architecture Once-for-All, ICLR’20
  • 53. Once-for-All, ICLR’20 Specialized Architecture for Different Hardware Platforms 53
  • 54. Tutorial on ProxylessNAS & OFA ● IPython Notebook tutorial.
 ● Architecture search with 1 GPU in 2 minutes.
 ● Hands-on lab at 3:45pm PT today, office hour 6:00pm.
 Zoom LinkWebsite
  • 55. Once-for-All Network (OFA) has broad applications • Efficient Video Recognition • Efficient 3D Vision • Efficient GAN Compression
  • 56. KineticsTop-1Accuracy(%) 69 70 71 72 73 74 75 Computation (GFLOPs) 0 10 20 30 40 Same Acc. OFA + TSM (large) OFA + TSM (small) MobileNetV2 + TSM ResNet50 + TSM ResNet50 + I3D 7x less computation Same Comp. +3.0% Acc. followup of TSM, ICCV’19 OFA’s Application: Efficient Video Recognition 7x less computation, same performance as TSM+ResNet50 same computation, 3% higher accuracy than TSM+MobileNet-v2
  • 57. Latency Comparison Batch size=1. Measured on NVIDIA Tesla P100. Each row represents a video. I3D: Latency: 164.3 ms/Video Something-V1 Acc.: 41.6% TSM: Latency: 17.4 ms/Video Something-V1 Acc.: 43.4% Speed-up: 9x
  • 58. Throughput Comparison Batch size=16. Measured on NVIDIA Tesla P100. Each square represents a video. I3D: Throughput: 6.1 video/s Something-V1 Acc.: 41.6% TSM: Throughput: 77.4 video/s Something-V1 Acc.: 43.4% 12.7x larger throughput
  • 59. 59 Improving the Robustness of Online Video Detection
  • 61. Scaling Up: Large-Scale Distributed Training with Summit Super Computer SUMMIT Super Computer: • CPU: 2 x 16 Core IBM POWER9 (connected via dual NVLINK bricks, 25GB/s each side) • GPU: 6 x NVIDIA Tesla V100 • RAM: 512 GB DDR4 memory • Data Storage: HDD • Connection: Dual-rail EDR InfiniBand network of 23 GB/s Acknowledgment: IBM and Oak Ridge National Lab * Lin et al., Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos, arXiv 1811.08383
  • 62. ● We are able to speedup the training by 200x, from 2 days to 14minutes. ● Model setup: 8-frame ResNet-50 for video recognition ● Dataset: Kinetics (240k training videos) x 100 epoch Training Time Accuracy Peak GPU Performance Speed-up 1 SUMMIT Nodes 
 (6 GPUs) 49h 50min 74.1% 46.5TFLOP/s Theoretical: 128x Actual: 106x Theoretical: 256x Actual: 211x 128 SUMMIT Nodes 
 (768 GPUs) 28min 74.1% 5,989TFLOP/s 256 SUMMIT Nodes 
 (1536 GPUs) 14min 74.0% 11,978TFLOP/s 0 12.5 25 37.5 50 Time (h) 1 SUMMIT Node 128 SUMMIT Node 106x Scaling Up: Large-Scale Distributed Training with SUMMIT Super Computer
  • 63. GAN Compression, CVPR’20 OFA’s Application: GAN Compression 8-21x FLOPs reduction on CycleGAN, Pix2pix, GauGAN 1.7x-18.5x speedup on CPU/GPU & Mobile CPU/GPU
  • 64. OFA’s Application: Efficient 3D Recognition self-driving: a whole trunk of GPU Accuracy v.s. Latency Tradeoff 4x FLOPs reduction and 2x speedup over MinkowskiNet 3.6% better accuracy under the same computation budget. AR/VR: a whole backpack of computer SPVNAS, ECCV’20
  • 65. DarkNet53Seg SPVNAS (Ours) Mean IoU: 49.9 Throughput: 9.7 FPS 50.4M Params 376.3G FLOPs Mean IoU: 58.8 (= KPConv) Throughput: 11.8 FPS 1.1M Params 10.6G FLOPs SPVNAS makes fewer errors (in red) than the 2D baseline model. 45x model size reduction and 35x computation reduction SPVNAS, ECCV’20
  • 66. Significantly Faster than MinkowskiNets Mean IoU: 63.1 Throughput: 3.4 FPS (21.7M Params 114.0G FLOPs) Mean IoU: 63.6 Throughput: 6.5 FPS (7.6M Params 30.0G FLOPs) MinkowskiNet SPVNAS (Ours) SPVNAS outperforms the state-of-the-art MinkowskiNet (with 2x measured speedup and 3x model size reduction). SPVNAS, ECCV’20
  • 67. Qualitative Results on SemanticKITTI Error By MinkowskiNets Less Error By SPVNAS Ground Truth SPVNAS, ECCV’20
  • 68. Qualitative Results on KITTI Detection By SECOND More Accurate Detection By SPVNAS Ground Truth SPVNAS, ECCV’20
  • 69. Hardware-aware autoML, push-button solution Make AI Efficient, with Tiny Resources Computational Human{ ProxylessNAS, ICLR’19 HAQ, CVPR’19, oral AMC, ECCV’18
 Once-for-All, ICLR’20
 Neural-Hardware Architecture Search, NeurIPS workshop’19
 SPVNAS, ECCV’20 1st place, Low Power Computer Vision Challenge’19 1st place, Low Power Computer Vision Challenge’20 1st place, Visual Wake Words Challenge@CVPR’19
  • 70. AutoML: Design Automation for AI [ECCV’18, ICLR’19, CVPR’19, ICLR’20, CVPR’20] - We developed two generations of AutoML technique for efficient NN design (ProxylessNAS, OFA) - Such AI designed AI consistently outperforms human performance: - First place in Low Power Computer Vision Challenges (2019, 2020). - First place in Visual Wake Words Challenge 2019. AI for Design Automation [DAC’20] - AI is Revolutionizing EDA: fast, hw, data-driven - “GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning”, DAC’20 - Circuit is a graph; GCN feature extractor. - Transfer ability between technology nodes & topologies
  • 72. Summary: Once-for-All Network • Released 50+ different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP). net, image_size = ofa_specialized(net_id, pretrained=True) • Released the training code & pre-trained OFA network that provides diverse sub-networks without training. ofa_network = ofa_net(net_id, pretrained=True) • We introduce once-for-all network for efficient inference on diverse hardware platforms. • We present an effective progressive shrinking approach for training once-for-all networks. Project Page: https://ofa.mit.edu • Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios, setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs). • First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19 • First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19, both classification & detection. Train the full model Shrink the model In 4 dimensions Fine-tune both large and small sub-nets once-for-all network Progressive Shrinking
  • 73. Less Engineer Resources: AutoML Less Computational Resources: TinyML many engineers large model A lot of computation fewer engineers small model less computation Simplify
  • 74. The Future of AI is “Tiny” Vast Applications Smart Retail Personalized Healthcare Smart Manufacturing Precision Agriculture Smart Home Autonomous Driving
  • 75. Hardware for AI and Neural-net itiative Algebra” Hardware for AI and Neural-net Initiative ear Algebra” be pruned to very sparse, ndex included). However, it’s e of sparsity. EIE [Han’16] is Hardware, AI and Neural-nets TinyML and Efficient AI Media: songhan.mit.edu youtube.com/c/MITHANLab github.com/mit-han-lab