SlideShare a Scribd company logo
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
EfficientML.ai Lecture 07
Neural Architecture Search
Part I
Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Lecture Plan
Today we will:
1. Review primitive operations of deep neural networks.
2. Introduce popular building blocks.
3. Introduce Neural Architecture Search (NAS), an automatic technique for designing neural
network architectures.
2
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Trade-off Between Efficiency and Accuracy
3
Storage
Latency Energy
Image source: 1
Accuracy
Image source: 2
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Neural Architecture Search
• Primitive operations
• Classic building blocks
• Introduction to neural architecture search (NAS)
• What is NAS?
• Search space
• Design the search space
• Search strategy
• E
ffi
cient and Hardware-aware NAS
• Performance estimation strategy
• Hardware-aware NAS
• Zero-shot NAS
• Neural-hardware architecture co-search
• NAS applications
• NLP, GAN, point cloud, pose
4
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Neural Architecture Search
5
• Primitive operations
• Classic building blocks
• Introduction to neural architecture search (NAS)
• What is NAS?
• Search space
• Design the search space
• Search strategy
• E
ffi
cient and Hardware-aware NAS
• Performance estimation strategy
• Hardware-aware NAS
• Zero-shot NAS
• Neural-hardware architecture co-search
• NAS applications
• NLP, GAN, point cloud, pose
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Recap: Primitive Operations
Fully-connected layer / linear layer
6
• Shape of Tensors:
• Input Features
• Output Features
• Weights
• Bias
X : (n, ci)
Y : (n, co)
W : (co, ci)
b : (co, )
y2
y1
y0
x4
x3
x2
x1
x0
w00 w42
z1
z0
Multilayer Perceptron (MLP)
co
ci
=
co
WT
X Y
ci
n n
n
ci
co
Notations
Batch Size
Input Channels
Output Channels
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Recap: Primitive Operations
Fully-connected layer / linear layer
7
Layer
MACs
(batch size n=1)
Linear Layer co ⋅ ci co
ci
=
co
WT
X Y
ci
n n
y2
y1
y0
x4
x3
x2
x1
x0
w00 w42
n
ci
co
hi, ho
wi, wo
kh, kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height/Width
Groups
* bias is ignored
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Recap: Primitive Operations
Convolution layer
8
• Shape of Tensors:
• Input Features
• Output Features
• Weights
• Bias
X : (n, ci)
Y : (n, co)
W : (co, ci)
b : (co, )
n
ci
co
hi, ho
wi, wo
kh
kw
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height
Kernel Width channel dimension
s
p
a
t
i
a
l
d
i
m
e
n
s
i
o
n
(
s
)
kh
kw
ci
co
hi wi
wo ho
Image source: 1
(n, ci, hi, wi)
(n, co, ho, wo)
2D Conv
(n, ci, wi)
(n, co, wo)
1D Conv
(co, ci, kh, kw)
(co, ci, kw)
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Recap: Primitive Operations
Convolution layer
9
n
ci
co
hi, ho
wi, wo
kh, kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height/Width
Groups
kh
ci
co
ci
co
wo
ho
hi
wi
kw
hi wi
ho
wo
Image source: 1
Layer
MACs
(batch size n=1)
Linear Layer
Convolution
co ⋅ ci
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo
* bias is ignored
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Recap: Primitive Operations
Grouped convolution layer
10
• Shape of Tensors:
• Input Features
• Output Features
• Weights
• Bias
X : (n, ci, hi, wi)
Y : (n, co, ho, wo)
W : (co, ci, kh, kw)
b : (co, )
is stride
s
is padding
p
n
ci
co
hi, ho
wi, wo
kh
kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height
Kernel Width
Groups
(g ⋅ co/g, ci/g, kh, kw)
co
ci
co
ci
g = 2
g = 1
channel dimension
Image source: 1
hi wi
ho
ho =
hi + 2p − kh
s
+ 1
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Recap: Primitive Operations
Grouped convolution layer
11
n
ci
co
hi, ho
wi, wo
kh, kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height/Width
Groups
co
ci
g = 2
hi wi
ho
wo
Image source: 1
Layer
MACs
(batch size n=1)
Linear Layer
Convolution
Grouped Convolution
co ⋅ ci
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo/g
* bias is ignored
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Recap: Primitive Operations
Depthwise convolution layer
12
• Shape of Tensors:
• Input Features
• Output Features
• Weights
• Bias
X : (n, ci, hi, wi)
Y : (n, co, ho, wo)
W : (co, ci, kh, kw)
b : (co, )
is stride
s
is padding
p
n
ci
co
hi, ho
wi, wo
kh
kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height
Kernel Width
Groups
(c, kh, kw)
co
ci
Image source: 1, 2
hi wi
wo
ho
ho =
hi + 2p − kh
s
+ 1
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Recap: Primitive Operations
Depthwise convolution layer
13
n
ci
co
hi, ho
wi, wo
kh, kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height/Width
Groups
co
ci
hi wi
ho
wo
Image source: 1
Layer
MACs
(batch size n=1)
Linear Layer
Convolution
Grouped Convolution
Depthwise
Convolution
co ⋅ ci
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo/g
co ⋅ kh ⋅ kw ⋅ ho ⋅ wo
* bias is ignored
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Recap: Primitive Operations
1x1 convolution layer
14
Layer
MACs
(batch size n=1)
Linear Layer
Convolution
Grouped Convolution
Depthwise
Convolution
1x1 Convolution
co ⋅ ci
n
ci
co
hi, ho
wi, wo
kh, kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height/Width
Groups
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo/g co
ci
co ⋅ kh ⋅ kw ⋅ ho ⋅ wo
hi wi
ho
wo
Image source: 1
co ⋅ ci ⋅ ho ⋅ wo
* bias is ignored
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Neural Architecture Search
15
• Primitive operations
• Classic building blocks
• Introduction to neural architecture search (NAS)
• What is NAS?
• Search space
• Design the search space
• Search strategy
• E
ffi
cient and Hardware-aware NAS
• Performance estimation strategy
• Hardware-aware NAS
• Zero-shot NAS
• Neural-hardware architecture co-search
• NAS applications
• NLP, GAN, point cloud, pose
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
ResNet50: bottleneck block
16
image source: link
Deep Residual Learning for Image Recognition [He et al., CVPR 2016]
Reduce the number of channels by 4x
via 1x1 convolution
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
ResNet50: bottleneck block
17
image source: link
Deep Residual Learning for Image Recognition [He et al., CVPR 2016]
Feed reduced feature map to 3x3 convolution
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
ResNet50: bottleneck block
18
image source: link
Deep Residual Learning for Image Recognition [He et al., CVPR 2016]
Expand the number of channels via 1x1 convolution
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
ResNet50: bottleneck block
19
image source: link
Deep Residual Learning for Image Recognition [He et al., CVPR 2016]
2048 x 512 x H x W x 1
2048 x 512 x H x W x 1
512 x 512 x H x W x 9
#MACs
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
ResNet50: bottleneck block
20
image source: link
Deep Residual Learning for Image Recognition [He et al., CVPR 2016]
512 x 512 x H x W x 17
2048 x 512 x H x W x 1
2048 x 512 x H x W x 1
512 x 512 x H x W x 9
#MACs
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
ResNet50: bottleneck block
21
image source: link
Deep Residual Learning for Image Recognition [He et al., CVPR 2016]
512 x 512 x H x W x 17
2048 x 2048 x H x W x 9
= 512 x 512 x H x W x 144
8.5x reduction
2048 x 512 x H x W x 1
2048 x 512 x H x W x 1
512 x 512 x H x W x 9
#MACs
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
ResNet50: bottleneck block
22
image source: link
Deep Residual Learning for Image Recognition [He et al., CVPR 2016]
512 x 512 x 17
2048 x 2048 x 9
= 512 x 512 x 144
8.5x reduction
2048 x 512 x 1
2048 x 512 x 1
512 x 512 x 9
#Params
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
ResNeXt: grouped convolution
• Replace 3x3 convolution with 3x3 grouped convolution.
23
Aggregated Residual Transformations for Deep Neural Networks [Xie et al., CVPR 2017]
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
ResNeXt: grouped convolution
• Replace 3x3 convolution with 3x3 grouped convolution.
• Equivalent to a multi-path block.
24
Aggregated Residual Transformations for Deep Neural Networks [Xie et al., CVPR 2017]
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
ResNeXt: grouped convolution
• Replace 3x3 convolution with 3x3 grouped convolution.
• Equivalent to a multi-path block.
25
Aggregated Residual Transformations for Deep Neural Networks [Xie et al., CVPR 2017]
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
MobileNet: depthwise-separable block
• Depthwise convolution is an extreme case of group convolution where the group number equals
the number of input channels.
26
image source: link
MobileNets: E
ffi
cient Convolutional Neural Networks for Mobile Vision Applications [Howard et al., arXiv 2017]
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
MobileNet: depthwise-separable block
• Depthwise convolution is an extreme case of group convolution where the group number equals
the number of input channels.
• Use depthwise convolution to capture spatial information.
27
image source: link
MobileNets: E
ffi
cient Convolutional Neural Networks for Mobile Vision Applications [Howard et al., arXiv 2017]
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
• Depthwise convolution is an extreme case of group convolution where the group number equals
the number of input channels.
• Use depthwise convolution to capture spatial information.
• Use 1x1 convolution to fuse/exchange information across di
ff
erent channels.
Classic Building Blocks
MobileNet: depthwise-separable block
28
image source: link
MobileNets: E
ffi
cient Convolutional Neural Networks for Mobile Vision Applications [Howard et al., arXiv 2017]
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
• Depthwise convolution has a much lower capacity compared to normal convolution.
• Increase the depthwise convolution's input and output channels to improve its capacity.
• Depthwise convolution’s cost only grows linearly. Therefore, the cost is still a
ff
ordable.
Classic Building Blocks
MobileNetV2: inverted bottleneck block
29
MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018]
Image source: 1
Depthwise Convolution
N * 6
N * 6
N
Feature
Channels
N
1×1 Conv
3×3 DW-Conv
1×1 Conv
Image source: 1
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
• Depthwise convolution has a much lower capacity compared to normal convolution.
• Increase the depthwise convolution's input and output channels to improve its capacity.
• Depthwise convolution’s cost only grows linearly. Therefore, the cost is still a
ff
ordable.
Classic Building Blocks
MobileNetV2: inverted bottleneck block
30
MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018]
Image source: 1
Depthwise Convolution
N * 6
N * 6
N
Feature
Channels
N
1×1 Conv
3×3 DW-Conv
1×1 Conv
Image source: 1
160 x 960 x H x W x 1
160 x 960 x H x W x 1
960 x H x W x 9
#MACs (N = 160)
960 x H x W x 329
160 x 160 x H x W x 9
= 960 x H x W x 240
1 : 1.37
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
• Depthwise convolution has a much lower capacity compared to normal convolution.
• Increase the depthwise convolution's input and output channels to improve its capacity.
• Depthwise convolution’s cost only grows linearly. Therefore, the cost is still a
ff
ordable.
Classic Building Blocks
MobileNetV2: inverted bottleneck block
31
MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018]
Image source: 1
Depthwise Convolution
N * 6
N * 6
N
Feature
Channels
N
1×1 Conv
3×3 DW-Conv
1×1 Conv
Image source: 1
160 x 960 x 1
160 x 960 x 1
960 x 9
#Params (N = 160)
960 x 329
160 x 160 x 9
= 960 x 240
1 : 1.37
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
• Depthwise convolution has a much lower capacity compared to normal convolution.
• Increase the depthwise convolution's input and output channels to improve its capacity.
• Depthwise convolution’s cost only grows linearly. Therefore, the cost is still a
ff
ordable.
• However, this design is not memory-e
ffi
cient for both inference and training.
Classic Building Blocks
MobileNetV2: inverted bottleneck block
32
Depthwise Convolution
N * 6
N * 6
N
Feature
Channels
N
1×1 Conv
3×3 DW-Conv
1×1 Conv
MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018]
Image source: 1
0
200
400
600
800
Param (MB) Activation (MB)
626
24
707
102
ResNet-50 MobileNetV2-1.4×
4.3×
1.1×
* All parameters and activations are
Floating-Point numbers (32 bits).
Batch size is 16.
0
2.4
4.8
7.2
9.6
12
Param (MB) Peak Activation (MB)
ResNet-18 MobileNetV2-0.75
4.6×
1.8×
* All parameters and activations are
Integer numbers (8 bits). Batch size is 1.
Inference Training
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
Shu
ffl
eNet: 1x1 group convolution & channel shu
ffl
e
• Further reduce the cost by replacing 1x1 convolution with 1x1 group convolution.
• Exchange information across di
ff
erent groups via channel shu
ffl
e.
33
Shu
ffl
eNet: An Extremely E
ffi
cient Convolutional Neural Network for Mobile Devices [Zhang et al., CVPR 2018]
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
Transformer: Multi-Head Self-Attention (MHSA)
34
Attention Is All You Need [Vaswani et al., NeurIPS 2017]
• Project Q, K and V with h di
ff
erent, learned
linear projections.
• Perform the scaled dot-product attention
function on each of these projected versions
of Q, K and V in parallel.
• Concatenate the output values.
• Project the output values again, resulting in
the
fi
nal values.
Scaled Dot-
Product Attention
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
Transformer: Multi-Head Self-Attention (MHSA)
35
Attention Is All You Need [Vaswani et al., NeurIPS 2017]
• Project Q, K and V with h di
ff
erent, learned
linear projections.
• Perform the scaled dot-product attention
function on each of these projected versions
of Q, K and V in parallel.
• Concatenate the output values.
• Project the output values again, resulting in
the
fi
nal values.
Image source: 1
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
Transformer: Multi-Head Self-Attention (MHSA)
36
Attention Is All You Need [Vaswani et al., NeurIPS 2017]
• Project Q, K and V with h di
ff
erent, learned
linear projections.
• Perform the scaled dot-product attention
function on each of these projected versions
of Q, K and V in parallel.
• Concatenate the output values.
• Project the output values again, resulting in
the
fi
nal values.
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Classic Building Blocks
Transformer: Multi-Head Self-Attention (MHSA)
37
Attention Is All You Need [Vaswani et al., NeurIPS 2017]
• Project Q, K and V with h di
ff
erent, learned
linear projections.
• Perform the scaled dot-product attention
function on each of these projected versions
of Q, K and V in parallel.
• Concatenate the output values.
• Project the output values again, resulting in
the
fi
nal values.
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Neural Architecture Search
38
• Primitive operations
• Classic building blocks
• Introduction to neural architecture search (NAS)
• What is NAS?
• Search space
• Design the search space
• Search strategy
• E
ffi
cient and Hardware-aware NAS
• Performance estimation strategy
• Hardware-aware NAS
• Zero-shot NAS
• Neural-hardware architecture co-search
• NAS applications
• NLP, GAN, point cloud, pose
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
From Manual Design to Automatic Design
Huge design space, manual design is unscalable
39
# channels
# layers
# kernel
resolution
connectivity
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
From Manual Design to Automatic Design
Huge design space, manual design is unscalable
40
# channels
# layers
# kernel
resolution
connectivity
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Manually-Designed Neural Networks
41
Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE 2020]
Accuracy-e
ffi
ciency trade-o
ff
on ImageNet
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
69
71
73
75
77
79
81
ImageNet
Top-1
accuracy
(%)
2M 4M 8M
Handcrafted
16M 32M 64M
MBNetV2
Shu
ffl
eNet
IGCV3-D
MobileNetV1 (MBNetV1)
InceptionV2
DenseNet-121
DenseNet-169
ResNet-50
ResNetXt-50
DenseNet-264
InceptionV3
DPN-92
ResNet-101
ResNetXt-101
Xception
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
From Manual Design to Automatic Design
Huge design space, manual design is unscalable
42
Once-for-All: Train One Network and Specialize it for E
ffi
cient Deployment [Cai et al., ICLR 2020]
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
69
71
73
75
77
79
81
ImageNet
Top-1
accuracy
(%)
2M 4M 8M
Handcrafted
16M
AutoML
32M 64M
MBNetV2
Shu
ffl
eNet
IGCV3-D
MobileNetV1 (MBNetV1)
InceptionV2
DenseNet-121
DenseNet-169
ResNet-50
ResNetXt-50
DenseNet-264
InceptionV3
DPN-92
ResNet-101
ResNetXt-101
Xception
DARTS
PNASNet
AmoebaNet
MBNetV3
ProxylessNAS
E
ffi
cientNet
NASNet-A
Once-for-All
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Illustration of NAS
Components and goal
• The goal of NAS is to
fi
nd the best neural network architecture in the search space, maximizing
the objective of interest (e.g., accuracy, e
ffi
ciency, etc).
43
Neural Architecture Search: A Survey [Elskan et al., JMLR 2019]
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
• Search space is a set of candidate neural network architectures.
Search Space
44
Neural Architecture Search: A Survey [Elskan et al., JMLR 2019]
• Cell-level
• Network-level
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Space
Cell-level search space
45
Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018]
tures
for
image
classification
consist
of
d
Normal
Cell
and
Reduction
Cell.
This
del
architecture
for
CIFAR-10
and
Ima-
number
of
times
the
Normal
Cells
that
ction
cells,
N
,
can
vary
in
our
experi-
the
Normal
and
controller
RNN
within
a
searc
Figure
7
for
sc
ceives
as
input
are
the
output
or
the
input
im
dicts
the
rest
o
these
two
initi
of
the
controll
where
each
blo
softmax
classi
elements
of
a
b
Step
1.
Select
a
states
create
Step
2.
Select
a
s
Step
3.
Select
an
Step
4.
Select
an
Step
5.
Select
a
m
a
new
hidde
The
algorithm
the
set
of
exist
x N x N x N
Normal Cell Reduction Cell
hi
hi-1
...
hi+1
concat
avg
3x3
sep
5x5
sep
7x7
sep
5x5
max
3x3
sep
7x7
add add
add add add
sep
3x3
iden
tity
avg
3x3
max
3x3
hi
hi-1
...
hi+1
concat
sep
3x3
avg
3x3
avg
3x3
sep
5x5
sep
3x3
iden
tity
iden
tity
sep
3x3
sep
5x5
avg
3x3
add add add add
add
Figure 4. Architecture of the best convolutional cells (NASNet-A) with B = 5 blocks identified with CIFAR-10 . The input (white) is the
hidden state from previous activations (or input image). The output (pink) is the result of a concatenation operation across all resulting
branches. Each convolutional cell is the result of B blocks. A single block is corresponds to two primitive operations (yellow) and a
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Space
Cell-level search space
46
Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018]
softmax
layer
controller
hidden
layer
Select one
hidden state
Select second
hidden state
Select operation for
first hidden state
Select operation for
second hidden state
Select method to
combine hidden state
repeat B times
new hidden layer
add
3 x 3 conv 2 x 2 maxpool
hidden layer B
hidden layer A
Figure 3. Controller model architecture for recursively constructing one block of a convolutional cell. Each block requires selecting 5
discrete parameters, each of which corresponds to the output of a softmax layer. Example constructed block shown on right. A convolu-
tional cell contains B blocks, hence the controller contains 5B softmax layers for predicting the architecture of a convolutional cell. In our
experiments, the number of blocks B is 5.
• Left: An RNN controller generates the candidate cells by:
•
fi
nding two inputs
• selecting two input transformation operations (e.g. convolution / pooling / identity)
• selecting the method to combine the results.
• Right: A cell generated after one step.
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Space
Cell-level search space
47
Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018]
softmax
layer
controller
hidden
layer
Select one
hidden state
Select second
hidden state
Select operation for
first hidden state
Select operation for
second hidden state
Select method to
combine hidden state
repeat B times
new hidden layer
add
3 x 3 conv 2 x 2 maxpool
hidden layer B
hidden layer A
Figure 3. Controller model architecture for recursively constructing one block of a convolutional cell. Each block requires selecting 5
discrete parameters, each of which corresponds to the output of a softmax layer. Example constructed block shown on right. A convolu-
tional cell contains B blocks, hence the controller contains 5B softmax layers for predicting the architecture of a convolutional cell. In our
experiments, the number of blocks B is 5.
• Question: Assuming that we have two candidate inputs, M candidate operations to transform the
inputs and N potential operations to combine hidden states, what is the size of the search space
in NASNet if we have B layers?
• Hint: Consider it step by step!
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Space
Cell-level search space
48
Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018]
softmax
layer
controller
hidden
layer
Select one
hidden state
Select second
hidden state
Select operation for
first hidden state
Select operation for
second hidden state
Select method to
combine hidden state
repeat B times
new hidden layer
add
3 x 3 conv 2 x 2 maxpool
hidden layer B
hidden layer A
Figure 3. Controller model architecture for recursively constructing one block of a convolutional cell. Each block requires selecting 5
discrete parameters, each of which corresponds to the output of a softmax layer. Example constructed block shown on right. A convolu-
tional cell contains B blocks, hence the controller contains 5B softmax layers for predicting the architecture of a convolutional cell. In our
experiments, the number of blocks B is 5.
• Question: Assuming that we have two candidate inputs, M candidate operations to transform the
inputs and N potential operations to combine hidden states, what is the size of the search space
in NASNet if we have B layers?
• Hint: Consider it step by step!
• Answer: .
• Assume M=5, N=2, B=5, we have candidates in the design space.
(2 × 2 × M × M × N)B
= 4B
M2B
NB
3.2 × 1011
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Space
Network-level search space: depth dimension
49
Once-for-All: Train One Network and Specialize it for E
ffi
cient Deployment [Cai et al., ICLR 2020]
Input
7x7
Conv
3x3
MaxPool
3
x
224
x
224
Input Stem
Bottleneck
Block
Stage 1
C1 x 160 x 160
× L1
Stage 2
C2 x 80 x 80
× L2
Linear
Attention
FFN
with
Depthwise
Conv
× L3
⏟
Stage 3
C3 x 40 x 40
MBConv
MBConv
Linear
Attention
FFN
with
Depthwise
Conv
× L4
⏟
Stage 4
C4 x 20 x 20
MBConv
P2 P3 P4
64
x
56
x
56
Bottleneck
Block
× 2
⏟
Stage1
256
x
56
x
56
Bottleneck
Block
Bottleneck
Block
× 3
⏟
Stage2
512
x
28
x
28
Bottleneck
Block
Bottleneck
Block
× 5
⏟
Stage3
1024
x
14
x
14
Bottleneck
Block
Bottleneck
Block
× 2
⏟
Stage4
2048
x
7
x
7
Global
AvgPool
FC
2048
x
1
x
1
Head
Prediction
× [1,2,3] × [2,3,4] × [3,5,7,9] × [1,2,3]
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Space
50
Once-for-All: Train One Network and Specialize it for E
ffi
cient Deployment [Cai et al., ICLR 2020]
Input
7x7
Conv
3x3
MaxPool
3
x
224
x
224
Input Stem
Bottleneck
Block
Stage 1
C1 x 160 x 160
× L1
Stage 2
C2 x 80 x 80
× L2
Linear
Attention
FFN
with
Depthwise
Conv
× L3
⏟
Stage 3
C3 x 40 x 40
MBConv
MBConv
Linear
Attention
FFN
with
Depthwise
Conv
× L4
⏟
Stage 4
C4 x 20 x 20
MBConv
P2 P3 P4
64
x
56
x
56
Bottleneck
Block
× 2
⏟
Stage1
256
x
56
x
56
Bottleneck
Block
Bottleneck
Block
× 3
⏟
Stage2
512
x
28
x
28
Bottleneck
Block
Bottleneck
Block
× 5
⏟
Stage3
1024
x
14
x
14
Bottleneck
Block
Bottleneck
Block
× 2
⏟
Stage4
2048
x
7
x
7
Global
AvgPool
FC
2048
x
1
x
1
Head
Prediction
[(3,128,128); (3,160,160); (3,192,192); (3,224,224); (3,256,256)]
Network-level search space: resolution dimension
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Input
7x7
Conv
3x3
MaxPool
3
x
224
x
224
Input Stem
Bottleneck
Block
Stage 1
C1 x 160 x 160
× L1
Stage 2
C2 x 80 x 80
× L2
Linear
Attention
FFN
with
Depthwise
Conv
× L3
⏟
Stage 3
C3 x 40 x 40
MBConv
MBConv
Linear
Attention
FFN
with
Depthwise
Conv
× L4
⏟
Stage 4
C4 x 20 x 20
MBConv
P2 P3 P4
64
x
56
x
56
Bottleneck
Block
× 2
⏟
Stage1
256
x
56
x
56
Bottleneck
Block
Bottleneck
Block
× 3
⏟
Stage2
512
x
28
x
28
Bottleneck
Block
Bottleneck
Block
× 5
⏟
Stage3
1024
x
14
x
14
Bottleneck
Block
Bottleneck
Block
× 2
⏟
Stage4
2048
x
7
x
7
Global
AvgPool
FC
2048
x
1
x
1
Head
Prediction
[48,64,96] [192,256,384] [384,512,768] [640,1024,1600] [1280,2048,3200]
Search Space
51
Once-for-All: Train One Network and Specialize it for E
ffi
cient Deployment [Cai et al., ICLR 2020]
Network-level search space: width dimension
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Space
52
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware [Cai et al., ICLR 2019]
For models that use depthwise convolution, we can choose the kernel size for each depthwise convolution.
Conv
3x3
MB1
3x3
MB3
5x5
MB3
3x3
MB3
7x7
MB3
3x3
MB3
5x5
MB3
5x5
MB6
7x7
32x112x112
16x112x112
3x224x224
32x56x56
32x56x56
40x28x28
40x28x28
40x28x28
40x28x28
MB3
5x5
MB3
5x5
80x14x14
80x14x14
MB6
5x5
MB3
5x5
MB3
5x5
MB3
5x5
MB6
7x7
MB3
7x7
MB6
7x7
Pooling
FC
80x14x14
96x14x14
96x14x14
96x14x14
192x7x7
192x7x7
192x7x7
192x7x7
320x7x7
MB3
5x5
80x14x14
MB6
7x7
MB3
7x7
96x14x14
Conv
3x3
MB1
3x3
MB6
3x3
MB3
3x3
MB3
3x3
MB3
3x3
MB6
3x3
MB3
3x3
MB3
3x3
40x112x112
24x112x112
3x224x224
32x56x56
32x56x56
32x56x56
32x56x56
48x28x28
48x28x28
MB6
3x3
MB3
5x5
48x28x28
48x28x28
MB6
5x5
MB3
3x3
MB3
3x3
MB3
3x3
MB6
5x5
MB3
3x3
MB6
5x5
Pooling
FC
88x14x14
104x14x14
104x14x14
104x14x14
216x7x7
216x7x7
216x7x7
216x7x7
360x7x7
MB3
3x3
88x14x14
MB3
5x5
MB3
5x5
104x14x14
(1) Efficient mobile architecture found by ProxylessNAS.
(2) Efficient CPU architecture found by ProxylessNAS.
Network-level search space: kernel size dimension
7x7
3x3
5x5
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Space
Network-level search space: topology connection
53
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation [Liu et al., CVPR 2019]
1
AS
PP
AS
PP
AS
PP
AS
PP
DownsampleLayer
2
4
8
16
32
1 L
s
Hl-1
s
Hl-2 ...
s
H1
l
s
H2
l
s
H3
l
s
H4
l
s
H5
l
2 3 4 5 L-1
……
Figure 1: Left: Our network level search space with L = 12. Gray nodes represent the fixed “stem” layers,
the blue nodes represents a candidate network level architecture. Right: During the search, each cell is a de
structure as described in Sec. 4.1.1. Every yellow arrow is associated with the set of values ↵j!i. The th
concat are associated with l
s
2 !s, l
s!s, l
2s!s respectively, as described in Sec. 4.1.2. Best viewed in co
1
AS
PP
AS
PP
AS
PP
AS
PP
DownsampleLayer
2
4
8
16
32
1 L
2 3 4 5 L-1
……
(a) Network level architecture used in DeepLabv3 [9].
1
DownsampleLayer
2
4
8
16
32
1 L
2 3 4 5 L-1
……
(b) Network level architecture used in Conv-Deconv [56].
1
DownsampleLayer
2
4
8
16
32
1 L
2 3 4 5 L-1
……
(c) Network level architecture used in Stacked Hourglass [55].
Figure 2: Our network level search space is general and
includes various existing designs.
by 2 and multiply the number of filters by 2). This keep-
downsampling strategy is reasonable in the image classifi-
cation case, but in dense image prediction it is also impor-
tant to keep high spatial resolution, and as a result there are
m
ag
Fo
in
wo
sp
to
m
do
re
on
wo
go
to
to
ch
fro
ch
fo
ge
4.
th
ac
ab
vi
tec
• Left: the network-level search space in AutoDeepLab. Each path along the blue nodes
corresponds to a network architecture in the search space.
• Right: representative manual designs can be represented using this formulation. E.g.: DeepLabV3,
UNet, Stacked Hourglass.
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Neural Architecture Search
54
• Primitive operations
• Classic building blocks
• Introduction to neural architecture search (NAS)
• What is NAS?
• Search space
• Design the search space
• Search strategy
• E
ffi
cient and Hardware-aware NAS
• Performance estimation strategy
• Hardware-aware NAS
• Zero-shot NAS
• Neural-hardware architecture co-search
• NAS applications
• NLP, GAN, point cloud, pose
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Design the Search Space
Design the search space for TinyML
55
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
Search space design is crucial for NAS performance
TinyNAS:
(1) Automated search space optimization
(2) Resource-constrained model specialization
Search Space
Optimization
Network Space Model Specialization
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Design the Search Space
Design the search space for TinyML: memory is important for TinyML
56
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
Di
ff
erent from Mobile AI
+
+
Latency
Constraint
Latency
Constraint
Energy
Constraint
Energy
Constraint
Memory
Constraint
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Design the Search Space
Design the search space for TinyML: memory is important for TinyML
57
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
Memory (GB)
NVIDIA V100
iPhone 11
STM32F746
0 4 8 12 16
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Design the Search Space
Design the search space for TinyML
58
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
Analyzing FLOPs distribution of satisfying models:
Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Design the Search Space
Design the search space for TinyML
59
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
• Discussion: what are the advantages / disadvantages of these two methods (RegNet, MCUNet)
for search space design?
Better search space, better
fi
nal accuracy.
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Neural Architecture Search
60
• Primitive operations
• Classic building blocks
• Introduction to neural architecture search (NAS)
• What is NAS?
• Search space
• Design the search space
• Search strategy
• E
ffi
cient and Hardware-aware NAS
• Performance estimation strategy
• Hardware-aware NAS
• Zero-shot NAS
• Neural-hardware architecture co-search
• NAS applications
• NLP, GAN, point cloud, pose
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Illustration of NAS
Search strategy
• Search space is a set of candidate neural network architectures.
• Search strategy de
fi
nes how to explore the search space.
61
Neural Architecture Search: A Survey [Elskan et al., JMLR 2019]
• Grid search
• Random search
• Reinforcement learning
• Gradient descent
• Evolutionary search
• Cell-level
• Network-level
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Grid search
62
• Grid search is the traditional way of hyper parameter optimization. The entire design space is
represented as the Cartesian product of single dimension design spaces (e.g. resolution in [1.0x,
1.1x,1.2x], width in [1.0x, 1.1x, 1.2x]).
• To obtain the accuracy of each candidate network, we train them from scratch.
1.0x 1.1x 1.2x
1.0x 50.0% 53.0% 54.9%
1.1x 51.0% 53.5% 55.4%
1.2x 52.0% 54.1% 56.2%
Width
Resolution
Satisfies the latency constraint
Breaks the latency constraint
Numbers in the grids correspond to the accuracy of candidate networks.
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Grid search
63
• E
ffi
cientNet applies compound scaling on depth, width and resolution to a starting network. It
performs grid search on values such that the total FLOPs of the new model will be of
the original network.
α, β, γ 2 ×
E
ffi
cientNet: Rethinking Model Scaling for Convolutional Neural Networks [Tan and Le, ICML 2019]
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
(a) baseline (b) width scaling (c) depth scaling (d) resolution scaling (e) compound scaling
#channels
layer_i
resolution HxW
wider
deeper
higher
resolution
higher
resolution
deeper
wider
Figure 2. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network
width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio.
example, if we want to use 2N
times more computational
resources, then we can simply increase the network depth by
reducing parameters by up to 21x than existing ConvNets.
EfficientNet: Rethinking Model Scal
In fact, a few prior work (Zoph et al., 2018; Real et al., 2019)
have already tried to arbitrarily balance network width and
depth, but they all require tedious manual tuning.
In this paper, we propose a new compound scaling method,
which use a compound coefficient to uniformly scales
network width, depth, and resolution in a principled way:
depth: d = ↵
width: w =
resolution: r =
s.t. ↵ · 2
· 2
⇡ 2
↵ 1, 1, 1
(3)
where ↵, , are constants that can be determined by a
small grid search. Intuitively, is a user-specified coeffi-
cient that controls how many more resources are available
for model scaling, while ↵, , specify how to assign these
extra resources to network width, depth, and resolution re-
spectively. Notably, the FLOPS of a regular convolution op
is proportional to d, w2
, r2
, i.e., doubling network depth
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Random search
64
image source: 1 image source: 2
Grid Search Random Search
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Reinforcement learning
65
Neural Architecture Search with Reinforcement Learning [Zoph and Le, ICLR 2017]
ars have seen much success of deep neural networks in many challenging appli-
speech recognition (Hinton et al., 2012), image recognition (LeCun et al., 1998;
, 2012) and machine translation (Sutskever et al., 2014; Bahdanau et al., 2015; Wu
long with this success is a paradigm shift from feature designing to architecture
om SIFT (Lowe, 1999), and HOG (Dalal & Triggs, 2005), to AlexNet (Krizhevsky
GGNet (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and
., 2016a). Although it has become easier, designing architectures still requires a
wledge and takes ample time.
Figure 1: An overview of Neural Architecture Search.
nts Neural Architecture Search, a gradient-based method for finding good architec-
Overview of RL-based NAS
the section, we will focus on generating recurrent architectures, which is another key
of our paper.
3.1 GENERATE MODEL DESCRIPTIONS WITH A CONTROLLER RECURRENT NEU
NETWORK
In Neural Architecture Search, we use a controller to generate architectural hyperp
neural networks. To be flexible, the controller is implemented as a recurrent neural ne
suppose we would like to predict feedforward neural networks with only convolution
can use the controller to generate their hyperparameters as a sequence of tokens:
Figure 2: How our controller recurrent neural network samples a simple convolutiona
predicts filter height, filter width, stride height, stride width, and number of filters for o
repeats. Every prediction is carried out by a softmax classifier and then fed into the n
as input.
In our experiments, the process of generating an architecture stops if the number of la
a certain value. This value follows a schedule where we increase it as training progress
controller RNN finishes generating an architecture, a neural network with this archite
The RNN controller
• Model neural architecture design as a sequential decision-making problem.
• Use reinforcement learning to train the controller that is implemented with an RNN.
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Gradient descent
66
DARTS: Di
ff
erentiable Architecture Search [Liu et al., ICLR 2019]
Published as a conference paper at ICLR 2019
Figure 1: An overview of DARTS: (a) Operations on the edges are initially unknown. (b) Continuous
A special zero operation is also included to indicate a lack of connection between two nodes. The
task of learning the cell therefore reduces to learning the operations on its edges.
2.2 CONTINUOUS RELAXATION AND OPTIMIZATION
Let O be a set of candidate operations (e.g., convolution, max pooling, zero) where each operation
represents some function o(·) to be applied to x(i)
. To make the search space continuous, we relax
the categorical choice of a particular operation to a softmax over all possible operations:
ō(i,j)
(x) =
X
o2O
exp(↵
(i,j)
o )
P
o02O exp(↵
(i,j)
o0 )
o(x) (2)
where the operation mixing weights for a pair of nodes (i, j) are parameterized by a vector ↵(i,j)
of
dimension |O|. The task of architecture search then reduces to learning a set of continuous variables
↵ = ↵(i,j)
, as illustrated in Fig. 1. At the end of search, a discrete architecture can be obtained by
replacing each mixed operation ō(i,j)
with the most likely operation, i.e., o(i,j)
= argmaxo2O ↵
(i,j)
o .
In the following, we refer to ↵ as the (encoding of the) architecture.
After relaxation, our goal is to jointly learn the architecture ↵ and the weights w within all the mixed
operations (e.g. weights of the convolution filters). Analogous to architecture search using RL (Zoph
& Le, 2017; Zoph et al., 2018; Pham et al., 2018b) or evolution (Liu et al., 2018b; Real et al., 2018)
where the validation set performance is treated as the reward or fitness, DARTS aims to optimize the
validation loss, but using gradient descent.
Denote by Ltrain and Lval the training and the validation loss, respectively. Both losses are deter-
mined not only by the architecture ↵, but also the weights w in the network. The goal for architecture
search is to find ↵⇤
that minimizes the validation loss Lval(w⇤
, ↵⇤
), where the weights w⇤
associated
with the architecture are obtained by minimizing the training loss w⇤
= argminw Ltrain(w, ↵⇤
).
This implies a bilevel optimization problem (Anandalingam & Friesz, 1992; Colson et al., 2007) with
• Represent the output at each node as a weighted sum of outputs from di
ff
erent edges.
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Gradient descent
67
• It is also possible to take latency into account for gradient-based search.
• Here, F is a latency prediction model (typically a regressor or a lookup table). With such
formulation, we can calculate an additional gradient for the architecture parameters from the
latency penalty term.
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware [Cai et al., ICLR 2019]
Trainer
Latency
Model
Direct measurement:
expensive and slow
Latency modeling:
cheap, fast and differentiable
Learnable Block
i - 1
Learnable Block
i
……
Learnable Block
i + 1
……
INPUT
OUTPUT
...
α β σ … ζ
CONV
5x5
POOL
3x3
...
CONV
3x3
Identity
E[latency] =
X
i
E[latencyi]
<latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit>
<latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit>
<latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit>
<latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit>
Loss = LossCE + 1||w||2
2 + 2E[latency]
<latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit>
<latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit>
<latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit>
<latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit>
E[Latency] = ↵ ⇥ F(conv 3x3)+
⇥ F(conv 5x5)+
⇥ F(identity)+
......
⇥ F(pool 3x3)
<latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit>
<latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit>
<latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit>
<latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit>
Figure 3: Making latency differentiable by introducing latency regularization loss.
We incorporate the expected latency of the network into the normal loss function by multiplying a
scaling factor 2(> 0) which controls the trade-off between accuracy and latency. The final loss
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Evolutionary search
68
image source: 1
• Mimic the evolution process in biology
image source: 2
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Evolutionary search:
fi
tness function
69
t = 10ms
a = 80%
t = 8ms, a=81%
t = 12ms, a=78%
Re-sample
Keep Arch.
Sample
Latency/Accuracy Feedback
Sub Network
OFA Network
PVNAS: 3D Neural Architecture Search with Point-Voxel Convolution [Liu et al., TPAMI 2021]
• Fitness function: F(accuracy, e
ffi
ciency)
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Evolutionary search: mutation
70
Once-for-All: Train One Network and Specialize it for E
ffi
cient Deployment [Cai et al., ICLR 2020]
Input
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
Output
Mutation on depth
Input
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
Output
Stage 1: depth = 3
Stage 2: depth = 3
Stage 1: depth = 2
Stage 2: depth = 4
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Evolutionary search: mutation
71
Once-for-All: Train One Network and Specialize it for E
ffi
cient Deployment [Cai et al., ICLR 2020]
Input
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
Output
Mutation on operator MB6 5x5
MB6 5x5
Input
MB6 3x3
MB6 3x3
MB6 7x7
Output
MB4 5x5
Stage 1: depth = 3
Stage 2: depth = 3
Stage 1: depth = 3
Stage 2: depth = 3
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Search Strategy
Evolutionary search: crossover
72
Once-for-All: Train One Network and Specialize it for E
ffi
cient Deployment [Cai et al., ICLR 2020]
Input
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
Output
MB6 5x5
MB6 5x5
Input
MB6 3x3
MB6 3x3
MB6 7x7
Output
MB4 5x5
+
Crossover
Idea: Randomly choose one
operator among two choices
(from the parents) for each
layer.
MB6 5x5
Input
MB6 3x3
MB6 3x3
MB6 3x3
MB6 7x7
Output
MB6 3x3
From parent 1
From parent 2
From parent 1
From parent 1
From parent 2
From parent 2
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
Summary of Today’s Lecture
• Primitive operations
• Classic building blocks
• NAS search space
• Design the search space
• NAS search strategy
• We will cover in later lectures:
• Performance estimation strategy in NAS
• Hardware-aware NAS
• Zero-shot NAS
• Neural-hardware architecture search
• NAS applications
73
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
References
1. Deep Residual Learning for Image Recognition [He et al., CVPR 2016]
2. ImageNet Classi
fi
cation with Deep Convolutional Neural Networks [Krizhevsky et al., NeurIPS
2012]
3. Very Deep Convolutional Networks for Large-scale Image Recognition [Simonyan et al., ICLR
2015]
4. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola
et al., arXiv 2016]
5. Aggregated Residual Transformations for Deep Neural Networks [Xie et al., CVPR 2017]
6. MobileNets: E
ffi
cient Convolutional Neural Networks for Mobile Vision Applications [Howard et
al., arXiv 2017]
7. MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018]
8. Shu
ffl
eNet: An Extremely E
ffi
cient Convolutional Neural Network for Mobile Devices [Zhang et al.,
CVPR 2018]
9. Neural Architecture Search: A Survey [Elskan et al., JMLR 2019]
10.Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018]
11.DARTS: Di
ff
erentiable Architecture Search [Liu et al., ICLR 2019]
74
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
References
• 12. MnasNet: Platform-Aware Neural Architecture Search for Mobile [Tan et al., CVPR 2019]
• 13. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware [Cai et al.,
ICLR 2019]
• 14. FBNet: Hardware-Aware E
ffi
cient ConvNet Design via Di
ff
erentiable Neural Architecture
Search [Wu et al., CVPR 2019]
• 15. Designing Network Design Spaces [Radosavovic et al., CVPR 2020]
• 16. Single Path One-Shot Neural Architecture Search with Uniform Sampling [Guo et al., ECCV
2020]
• 17. Once-for-All: Train One Network and Specialize it for E
ffi
cient Deployment [Cai et al., ICLR
2020]
• 18. Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
[Liu et al., CVPR 2019]
• 19. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection [Ghaisi et al.,
CVPR 2019]
• 20. Exploring Randomly Wired Neural Networks for Image Recognition [Xie et al., ICCV 2019]
• 21. MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
75
MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
References
• 22. E
ffi
cientNet: Rethinking Model Scaling for Convolutional Neural Networks [Tan and Le, ICML
2019]
• 23. Neural Architecture Search with Reinforcement Learning [Zoph and Le, ICLR 2017]
• 24. PVNAS: 3D Neural Architecture Search with Point-Voxel Convolution [Liu et al., TPAMI 2021]
• 25. Regularized Evolution for Image Classi
fi
er Architecture Search [Real et al., AAAI 2019]
76

More Related Content

PPTX
Artificial Intelligence, Machine Learning and Deep Learning
PDF
“Introduction to Computer Vision with Convolutional Neural Networks,” a Prese...
PDF
Object Detection with Transformers
PPTX
Machine Learning - Convolutional Neural Network
PDF
Skip Connections, Residual networks and challanges
PDF
Recent Advances in Kernel-Based Graph Classification
PDF
On-the-fly Visual Category Search in Web-scale Image Collections
PDF
深度學習在AOI的應用
Artificial Intelligence, Machine Learning and Deep Learning
“Introduction to Computer Vision with Convolutional Neural Networks,” a Prese...
Object Detection with Transformers
Machine Learning - Convolutional Neural Network
Skip Connections, Residual networks and challanges
Recent Advances in Kernel-Based Graph Classification
On-the-fly Visual Category Search in Web-scale Image Collections
深度學習在AOI的應用

Similar to EfficientML.ai Lecture Neural Architecture Search (20)

PDF
Deep learning for molecules, introduction to chainer chemistry
PDF
Computer vision for transportation
PDF
The world is the computer and the programmer is you
PDF
Deep learning unsupervised learning diapo
PPTX
object-detection.pptx
PPTX
The deep learning tour - Q1 2017
PPTX
Week2- Deep Learning Intuition.pptx
PDF
ProjectReport
PPTX
Optimized Feedforward Network of CNN with Xnor Final Presentation
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
PDF
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
PPTX
Diving into Deep Learning (Silicon Valley Code Camp 2017)
PDF
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
PDF
OpenSees: Future Directions
PDF
Performance evaluation of GANs in a semisupervised OCR use case
PDF
Performance evaluation of GANs in a semisupervised OCR use case
PDF
A brief introduction to recent segmentation methods
PPTX
Binary Analysis - Luxembourg
PDF
Main principles of Data Science and Machine Learning
Deep learning for molecules, introduction to chainer chemistry
Computer vision for transportation
The world is the computer and the programmer is you
Deep learning unsupervised learning diapo
object-detection.pptx
The deep learning tour - Q1 2017
Week2- Deep Learning Intuition.pptx
ProjectReport
Optimized Feedforward Network of CNN with Xnor Final Presentation
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
Diving into Deep Learning (Silicon Valley Code Camp 2017)
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
OpenSees: Future Directions
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
A brief introduction to recent segmentation methods
Binary Analysis - Luxembourg
Main principles of Data Science and Machine Learning
Ad

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Introduction to the R Programming Language
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to machine learning and Linear Models
PDF
Lecture1 pattern recognition............
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Supervised vs unsupervised machine learning algorithms
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
climate analysis of Dhaka ,Banglades.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to the R Programming Language
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
1_Introduction to advance data techniques.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to machine learning and Linear Models
Lecture1 pattern recognition............
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Ad

EfficientML.ai Lecture Neural Architecture Search

  • 1. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai EfficientML.ai Lecture 07 Neural Architecture Search Part I Song Han Associate Professor, MIT Distinguished Scientist, NVIDIA @SongHan_MIT
  • 2. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Lecture Plan Today we will: 1. Review primitive operations of deep neural networks. 2. Introduce popular building blocks. 3. Introduce Neural Architecture Search (NAS), an automatic technique for designing neural network architectures. 2
  • 3. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Trade-off Between Efficiency and Accuracy 3 Storage Latency Energy Image source: 1 Accuracy Image source: 2
  • 4. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Neural Architecture Search • Primitive operations • Classic building blocks • Introduction to neural architecture search (NAS) • What is NAS? • Search space • Design the search space • Search strategy • E ffi cient and Hardware-aware NAS • Performance estimation strategy • Hardware-aware NAS • Zero-shot NAS • Neural-hardware architecture co-search • NAS applications • NLP, GAN, point cloud, pose 4
  • 5. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Neural Architecture Search 5 • Primitive operations • Classic building blocks • Introduction to neural architecture search (NAS) • What is NAS? • Search space • Design the search space • Search strategy • E ffi cient and Hardware-aware NAS • Performance estimation strategy • Hardware-aware NAS • Zero-shot NAS • Neural-hardware architecture co-search • NAS applications • NLP, GAN, point cloud, pose
  • 6. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Recap: Primitive Operations Fully-connected layer / linear layer 6 • Shape of Tensors: • Input Features • Output Features • Weights • Bias X : (n, ci) Y : (n, co) W : (co, ci) b : (co, ) y2 y1 y0 x4 x3 x2 x1 x0 w00 w42 z1 z0 Multilayer Perceptron (MLP) co ci = co WT X Y ci n n n ci co Notations Batch Size Input Channels Output Channels
  • 7. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Recap: Primitive Operations Fully-connected layer / linear layer 7 Layer MACs (batch size n=1) Linear Layer co ⋅ ci co ci = co WT X Y ci n n y2 y1 y0 x4 x3 x2 x1 x0 w00 w42 n ci co hi, ho wi, wo kh, kw g Notations Batch Size Input Channels Output Channels Input/Output Height Input/Output Width Kernel Height/Width Groups * bias is ignored
  • 8. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Recap: Primitive Operations Convolution layer 8 • Shape of Tensors: • Input Features • Output Features • Weights • Bias X : (n, ci) Y : (n, co) W : (co, ci) b : (co, ) n ci co hi, ho wi, wo kh kw Notations Batch Size Input Channels Output Channels Input/Output Height Input/Output Width Kernel Height Kernel Width channel dimension s p a t i a l d i m e n s i o n ( s ) kh kw ci co hi wi wo ho Image source: 1 (n, ci, hi, wi) (n, co, ho, wo) 2D Conv (n, ci, wi) (n, co, wo) 1D Conv (co, ci, kh, kw) (co, ci, kw)
  • 9. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Recap: Primitive Operations Convolution layer 9 n ci co hi, ho wi, wo kh, kw g Notations Batch Size Input Channels Output Channels Input/Output Height Input/Output Width Kernel Height/Width Groups kh ci co ci co wo ho hi wi kw hi wi ho wo Image source: 1 Layer MACs (batch size n=1) Linear Layer Convolution co ⋅ ci co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo * bias is ignored
  • 10. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Recap: Primitive Operations Grouped convolution layer 10 • Shape of Tensors: • Input Features • Output Features • Weights • Bias X : (n, ci, hi, wi) Y : (n, co, ho, wo) W : (co, ci, kh, kw) b : (co, ) is stride s is padding p n ci co hi, ho wi, wo kh kw g Notations Batch Size Input Channels Output Channels Input/Output Height Input/Output Width Kernel Height Kernel Width Groups (g ⋅ co/g, ci/g, kh, kw) co ci co ci g = 2 g = 1 channel dimension Image source: 1 hi wi ho ho = hi + 2p − kh s + 1
  • 11. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Recap: Primitive Operations Grouped convolution layer 11 n ci co hi, ho wi, wo kh, kw g Notations Batch Size Input Channels Output Channels Input/Output Height Input/Output Width Kernel Height/Width Groups co ci g = 2 hi wi ho wo Image source: 1 Layer MACs (batch size n=1) Linear Layer Convolution Grouped Convolution co ⋅ ci co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo/g * bias is ignored
  • 12. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Recap: Primitive Operations Depthwise convolution layer 12 • Shape of Tensors: • Input Features • Output Features • Weights • Bias X : (n, ci, hi, wi) Y : (n, co, ho, wo) W : (co, ci, kh, kw) b : (co, ) is stride s is padding p n ci co hi, ho wi, wo kh kw g Notations Batch Size Input Channels Output Channels Input/Output Height Input/Output Width Kernel Height Kernel Width Groups (c, kh, kw) co ci Image source: 1, 2 hi wi wo ho ho = hi + 2p − kh s + 1
  • 13. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Recap: Primitive Operations Depthwise convolution layer 13 n ci co hi, ho wi, wo kh, kw g Notations Batch Size Input Channels Output Channels Input/Output Height Input/Output Width Kernel Height/Width Groups co ci hi wi ho wo Image source: 1 Layer MACs (batch size n=1) Linear Layer Convolution Grouped Convolution Depthwise Convolution co ⋅ ci co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo/g co ⋅ kh ⋅ kw ⋅ ho ⋅ wo * bias is ignored
  • 14. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Recap: Primitive Operations 1x1 convolution layer 14 Layer MACs (batch size n=1) Linear Layer Convolution Grouped Convolution Depthwise Convolution 1x1 Convolution co ⋅ ci n ci co hi, ho wi, wo kh, kw g Notations Batch Size Input Channels Output Channels Input/Output Height Input/Output Width Kernel Height/Width Groups co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo/g co ci co ⋅ kh ⋅ kw ⋅ ho ⋅ wo hi wi ho wo Image source: 1 co ⋅ ci ⋅ ho ⋅ wo * bias is ignored
  • 15. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Neural Architecture Search 15 • Primitive operations • Classic building blocks • Introduction to neural architecture search (NAS) • What is NAS? • Search space • Design the search space • Search strategy • E ffi cient and Hardware-aware NAS • Performance estimation strategy • Hardware-aware NAS • Zero-shot NAS • Neural-hardware architecture co-search • NAS applications • NLP, GAN, point cloud, pose
  • 16. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks ResNet50: bottleneck block 16 image source: link Deep Residual Learning for Image Recognition [He et al., CVPR 2016] Reduce the number of channels by 4x via 1x1 convolution
  • 17. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks ResNet50: bottleneck block 17 image source: link Deep Residual Learning for Image Recognition [He et al., CVPR 2016] Feed reduced feature map to 3x3 convolution
  • 18. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks ResNet50: bottleneck block 18 image source: link Deep Residual Learning for Image Recognition [He et al., CVPR 2016] Expand the number of channels via 1x1 convolution
  • 19. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks ResNet50: bottleneck block 19 image source: link Deep Residual Learning for Image Recognition [He et al., CVPR 2016] 2048 x 512 x H x W x 1 2048 x 512 x H x W x 1 512 x 512 x H x W x 9 #MACs
  • 20. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks ResNet50: bottleneck block 20 image source: link Deep Residual Learning for Image Recognition [He et al., CVPR 2016] 512 x 512 x H x W x 17 2048 x 512 x H x W x 1 2048 x 512 x H x W x 1 512 x 512 x H x W x 9 #MACs
  • 21. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks ResNet50: bottleneck block 21 image source: link Deep Residual Learning for Image Recognition [He et al., CVPR 2016] 512 x 512 x H x W x 17 2048 x 2048 x H x W x 9 = 512 x 512 x H x W x 144 8.5x reduction 2048 x 512 x H x W x 1 2048 x 512 x H x W x 1 512 x 512 x H x W x 9 #MACs
  • 22. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks ResNet50: bottleneck block 22 image source: link Deep Residual Learning for Image Recognition [He et al., CVPR 2016] 512 x 512 x 17 2048 x 2048 x 9 = 512 x 512 x 144 8.5x reduction 2048 x 512 x 1 2048 x 512 x 1 512 x 512 x 9 #Params
  • 23. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks ResNeXt: grouped convolution • Replace 3x3 convolution with 3x3 grouped convolution. 23 Aggregated Residual Transformations for Deep Neural Networks [Xie et al., CVPR 2017]
  • 24. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks ResNeXt: grouped convolution • Replace 3x3 convolution with 3x3 grouped convolution. • Equivalent to a multi-path block. 24 Aggregated Residual Transformations for Deep Neural Networks [Xie et al., CVPR 2017]
  • 25. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks ResNeXt: grouped convolution • Replace 3x3 convolution with 3x3 grouped convolution. • Equivalent to a multi-path block. 25 Aggregated Residual Transformations for Deep Neural Networks [Xie et al., CVPR 2017]
  • 26. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks MobileNet: depthwise-separable block • Depthwise convolution is an extreme case of group convolution where the group number equals the number of input channels. 26 image source: link MobileNets: E ffi cient Convolutional Neural Networks for Mobile Vision Applications [Howard et al., arXiv 2017]
  • 27. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks MobileNet: depthwise-separable block • Depthwise convolution is an extreme case of group convolution where the group number equals the number of input channels. • Use depthwise convolution to capture spatial information. 27 image source: link MobileNets: E ffi cient Convolutional Neural Networks for Mobile Vision Applications [Howard et al., arXiv 2017]
  • 28. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai • Depthwise convolution is an extreme case of group convolution where the group number equals the number of input channels. • Use depthwise convolution to capture spatial information. • Use 1x1 convolution to fuse/exchange information across di ff erent channels. Classic Building Blocks MobileNet: depthwise-separable block 28 image source: link MobileNets: E ffi cient Convolutional Neural Networks for Mobile Vision Applications [Howard et al., arXiv 2017]
  • 29. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai • Depthwise convolution has a much lower capacity compared to normal convolution. • Increase the depthwise convolution's input and output channels to improve its capacity. • Depthwise convolution’s cost only grows linearly. Therefore, the cost is still a ff ordable. Classic Building Blocks MobileNetV2: inverted bottleneck block 29 MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018] Image source: 1 Depthwise Convolution N * 6 N * 6 N Feature Channels N 1×1 Conv 3×3 DW-Conv 1×1 Conv Image source: 1
  • 30. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai • Depthwise convolution has a much lower capacity compared to normal convolution. • Increase the depthwise convolution's input and output channels to improve its capacity. • Depthwise convolution’s cost only grows linearly. Therefore, the cost is still a ff ordable. Classic Building Blocks MobileNetV2: inverted bottleneck block 30 MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018] Image source: 1 Depthwise Convolution N * 6 N * 6 N Feature Channels N 1×1 Conv 3×3 DW-Conv 1×1 Conv Image source: 1 160 x 960 x H x W x 1 160 x 960 x H x W x 1 960 x H x W x 9 #MACs (N = 160) 960 x H x W x 329 160 x 160 x H x W x 9 = 960 x H x W x 240 1 : 1.37
  • 31. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai • Depthwise convolution has a much lower capacity compared to normal convolution. • Increase the depthwise convolution's input and output channels to improve its capacity. • Depthwise convolution’s cost only grows linearly. Therefore, the cost is still a ff ordable. Classic Building Blocks MobileNetV2: inverted bottleneck block 31 MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018] Image source: 1 Depthwise Convolution N * 6 N * 6 N Feature Channels N 1×1 Conv 3×3 DW-Conv 1×1 Conv Image source: 1 160 x 960 x 1 160 x 960 x 1 960 x 9 #Params (N = 160) 960 x 329 160 x 160 x 9 = 960 x 240 1 : 1.37
  • 32. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai • Depthwise convolution has a much lower capacity compared to normal convolution. • Increase the depthwise convolution's input and output channels to improve its capacity. • Depthwise convolution’s cost only grows linearly. Therefore, the cost is still a ff ordable. • However, this design is not memory-e ffi cient for both inference and training. Classic Building Blocks MobileNetV2: inverted bottleneck block 32 Depthwise Convolution N * 6 N * 6 N Feature Channels N 1×1 Conv 3×3 DW-Conv 1×1 Conv MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018] Image source: 1 0 200 400 600 800 Param (MB) Activation (MB) 626 24 707 102 ResNet-50 MobileNetV2-1.4× 4.3× 1.1× * All parameters and activations are Floating-Point numbers (32 bits). Batch size is 16. 0 2.4 4.8 7.2 9.6 12 Param (MB) Peak Activation (MB) ResNet-18 MobileNetV2-0.75 4.6× 1.8× * All parameters and activations are Integer numbers (8 bits). Batch size is 1. Inference Training
  • 33. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks Shu ffl eNet: 1x1 group convolution & channel shu ffl e • Further reduce the cost by replacing 1x1 convolution with 1x1 group convolution. • Exchange information across di ff erent groups via channel shu ffl e. 33 Shu ffl eNet: An Extremely E ffi cient Convolutional Neural Network for Mobile Devices [Zhang et al., CVPR 2018]
  • 34. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks Transformer: Multi-Head Self-Attention (MHSA) 34 Attention Is All You Need [Vaswani et al., NeurIPS 2017] • Project Q, K and V with h di ff erent, learned linear projections. • Perform the scaled dot-product attention function on each of these projected versions of Q, K and V in parallel. • Concatenate the output values. • Project the output values again, resulting in the fi nal values. Scaled Dot- Product Attention
  • 35. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks Transformer: Multi-Head Self-Attention (MHSA) 35 Attention Is All You Need [Vaswani et al., NeurIPS 2017] • Project Q, K and V with h di ff erent, learned linear projections. • Perform the scaled dot-product attention function on each of these projected versions of Q, K and V in parallel. • Concatenate the output values. • Project the output values again, resulting in the fi nal values. Image source: 1
  • 36. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks Transformer: Multi-Head Self-Attention (MHSA) 36 Attention Is All You Need [Vaswani et al., NeurIPS 2017] • Project Q, K and V with h di ff erent, learned linear projections. • Perform the scaled dot-product attention function on each of these projected versions of Q, K and V in parallel. • Concatenate the output values. • Project the output values again, resulting in the fi nal values.
  • 37. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Classic Building Blocks Transformer: Multi-Head Self-Attention (MHSA) 37 Attention Is All You Need [Vaswani et al., NeurIPS 2017] • Project Q, K and V with h di ff erent, learned linear projections. • Perform the scaled dot-product attention function on each of these projected versions of Q, K and V in parallel. • Concatenate the output values. • Project the output values again, resulting in the fi nal values.
  • 38. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Neural Architecture Search 38 • Primitive operations • Classic building blocks • Introduction to neural architecture search (NAS) • What is NAS? • Search space • Design the search space • Search strategy • E ffi cient and Hardware-aware NAS • Performance estimation strategy • Hardware-aware NAS • Zero-shot NAS • Neural-hardware architecture co-search • NAS applications • NLP, GAN, point cloud, pose
  • 39. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai From Manual Design to Automatic Design Huge design space, manual design is unscalable 39 # channels # layers # kernel resolution connectivity
  • 40. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai From Manual Design to Automatic Design Huge design space, manual design is unscalable 40 # channels # layers # kernel resolution connectivity
  • 41. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Manually-Designed Neural Networks 41 Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE 2020] Accuracy-e ffi ciency trade-o ff on ImageNet 0 1 2 3 4 5 6 7 8 9 MACs (Billion) 69 71 73 75 77 79 81 ImageNet Top-1 accuracy (%) 2M 4M 8M Handcrafted 16M 32M 64M MBNetV2 Shu ffl eNet IGCV3-D MobileNetV1 (MBNetV1) InceptionV2 DenseNet-121 DenseNet-169 ResNet-50 ResNetXt-50 DenseNet-264 InceptionV3 DPN-92 ResNet-101 ResNetXt-101 Xception
  • 42. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai From Manual Design to Automatic Design Huge design space, manual design is unscalable 42 Once-for-All: Train One Network and Specialize it for E ffi cient Deployment [Cai et al., ICLR 2020] 0 1 2 3 4 5 6 7 8 9 MACs (Billion) 69 71 73 75 77 79 81 ImageNet Top-1 accuracy (%) 2M 4M 8M Handcrafted 16M AutoML 32M 64M MBNetV2 Shu ffl eNet IGCV3-D MobileNetV1 (MBNetV1) InceptionV2 DenseNet-121 DenseNet-169 ResNet-50 ResNetXt-50 DenseNet-264 InceptionV3 DPN-92 ResNet-101 ResNetXt-101 Xception DARTS PNASNet AmoebaNet MBNetV3 ProxylessNAS E ffi cientNet NASNet-A Once-for-All
  • 43. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Illustration of NAS Components and goal • The goal of NAS is to fi nd the best neural network architecture in the search space, maximizing the objective of interest (e.g., accuracy, e ffi ciency, etc). 43 Neural Architecture Search: A Survey [Elskan et al., JMLR 2019]
  • 44. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai • Search space is a set of candidate neural network architectures. Search Space 44 Neural Architecture Search: A Survey [Elskan et al., JMLR 2019] • Cell-level • Network-level
  • 45. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Space Cell-level search space 45 Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018] tures for image classification consist of d Normal Cell and Reduction Cell. This del architecture for CIFAR-10 and Ima- number of times the Normal Cells that ction cells, N , can vary in our experi- the Normal and controller RNN within a searc Figure 7 for sc ceives as input are the output or the input im dicts the rest o these two initi of the controll where each blo softmax classi elements of a b Step 1. Select a states create Step 2. Select a s Step 3. Select an Step 4. Select an Step 5. Select a m a new hidde The algorithm the set of exist x N x N x N Normal Cell Reduction Cell hi hi-1 ... hi+1 concat avg 3x3 sep 5x5 sep 7x7 sep 5x5 max 3x3 sep 7x7 add add add add add sep 3x3 iden tity avg 3x3 max 3x3 hi hi-1 ... hi+1 concat sep 3x3 avg 3x3 avg 3x3 sep 5x5 sep 3x3 iden tity iden tity sep 3x3 sep 5x5 avg 3x3 add add add add add Figure 4. Architecture of the best convolutional cells (NASNet-A) with B = 5 blocks identified with CIFAR-10 . The input (white) is the hidden state from previous activations (or input image). The output (pink) is the result of a concatenation operation across all resulting branches. Each convolutional cell is the result of B blocks. A single block is corresponds to two primitive operations (yellow) and a
  • 46. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Space Cell-level search space 46 Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018] softmax layer controller hidden layer Select one hidden state Select second hidden state Select operation for first hidden state Select operation for second hidden state Select method to combine hidden state repeat B times new hidden layer add 3 x 3 conv 2 x 2 maxpool hidden layer B hidden layer A Figure 3. Controller model architecture for recursively constructing one block of a convolutional cell. Each block requires selecting 5 discrete parameters, each of which corresponds to the output of a softmax layer. Example constructed block shown on right. A convolu- tional cell contains B blocks, hence the controller contains 5B softmax layers for predicting the architecture of a convolutional cell. In our experiments, the number of blocks B is 5. • Left: An RNN controller generates the candidate cells by: • fi nding two inputs • selecting two input transformation operations (e.g. convolution / pooling / identity) • selecting the method to combine the results. • Right: A cell generated after one step.
  • 47. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Space Cell-level search space 47 Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018] softmax layer controller hidden layer Select one hidden state Select second hidden state Select operation for first hidden state Select operation for second hidden state Select method to combine hidden state repeat B times new hidden layer add 3 x 3 conv 2 x 2 maxpool hidden layer B hidden layer A Figure 3. Controller model architecture for recursively constructing one block of a convolutional cell. Each block requires selecting 5 discrete parameters, each of which corresponds to the output of a softmax layer. Example constructed block shown on right. A convolu- tional cell contains B blocks, hence the controller contains 5B softmax layers for predicting the architecture of a convolutional cell. In our experiments, the number of blocks B is 5. • Question: Assuming that we have two candidate inputs, M candidate operations to transform the inputs and N potential operations to combine hidden states, what is the size of the search space in NASNet if we have B layers? • Hint: Consider it step by step!
  • 48. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Space Cell-level search space 48 Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018] softmax layer controller hidden layer Select one hidden state Select second hidden state Select operation for first hidden state Select operation for second hidden state Select method to combine hidden state repeat B times new hidden layer add 3 x 3 conv 2 x 2 maxpool hidden layer B hidden layer A Figure 3. Controller model architecture for recursively constructing one block of a convolutional cell. Each block requires selecting 5 discrete parameters, each of which corresponds to the output of a softmax layer. Example constructed block shown on right. A convolu- tional cell contains B blocks, hence the controller contains 5B softmax layers for predicting the architecture of a convolutional cell. In our experiments, the number of blocks B is 5. • Question: Assuming that we have two candidate inputs, M candidate operations to transform the inputs and N potential operations to combine hidden states, what is the size of the search space in NASNet if we have B layers? • Hint: Consider it step by step! • Answer: . • Assume M=5, N=2, B=5, we have candidates in the design space. (2 × 2 × M × M × N)B = 4B M2B NB 3.2 × 1011
  • 49. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Space Network-level search space: depth dimension 49 Once-for-All: Train One Network and Specialize it for E ffi cient Deployment [Cai et al., ICLR 2020] Input 7x7 Conv 3x3 MaxPool 3 x 224 x 224 Input Stem Bottleneck Block Stage 1 C1 x 160 x 160 × L1 Stage 2 C2 x 80 x 80 × L2 Linear Attention FFN with Depthwise Conv × L3 ⏟ Stage 3 C3 x 40 x 40 MBConv MBConv Linear Attention FFN with Depthwise Conv × L4 ⏟ Stage 4 C4 x 20 x 20 MBConv P2 P3 P4 64 x 56 x 56 Bottleneck Block × 2 ⏟ Stage1 256 x 56 x 56 Bottleneck Block Bottleneck Block × 3 ⏟ Stage2 512 x 28 x 28 Bottleneck Block Bottleneck Block × 5 ⏟ Stage3 1024 x 14 x 14 Bottleneck Block Bottleneck Block × 2 ⏟ Stage4 2048 x 7 x 7 Global AvgPool FC 2048 x 1 x 1 Head Prediction × [1,2,3] × [2,3,4] × [3,5,7,9] × [1,2,3]
  • 50. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Space 50 Once-for-All: Train One Network and Specialize it for E ffi cient Deployment [Cai et al., ICLR 2020] Input 7x7 Conv 3x3 MaxPool 3 x 224 x 224 Input Stem Bottleneck Block Stage 1 C1 x 160 x 160 × L1 Stage 2 C2 x 80 x 80 × L2 Linear Attention FFN with Depthwise Conv × L3 ⏟ Stage 3 C3 x 40 x 40 MBConv MBConv Linear Attention FFN with Depthwise Conv × L4 ⏟ Stage 4 C4 x 20 x 20 MBConv P2 P3 P4 64 x 56 x 56 Bottleneck Block × 2 ⏟ Stage1 256 x 56 x 56 Bottleneck Block Bottleneck Block × 3 ⏟ Stage2 512 x 28 x 28 Bottleneck Block Bottleneck Block × 5 ⏟ Stage3 1024 x 14 x 14 Bottleneck Block Bottleneck Block × 2 ⏟ Stage4 2048 x 7 x 7 Global AvgPool FC 2048 x 1 x 1 Head Prediction [(3,128,128); (3,160,160); (3,192,192); (3,224,224); (3,256,256)] Network-level search space: resolution dimension
  • 51. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Input 7x7 Conv 3x3 MaxPool 3 x 224 x 224 Input Stem Bottleneck Block Stage 1 C1 x 160 x 160 × L1 Stage 2 C2 x 80 x 80 × L2 Linear Attention FFN with Depthwise Conv × L3 ⏟ Stage 3 C3 x 40 x 40 MBConv MBConv Linear Attention FFN with Depthwise Conv × L4 ⏟ Stage 4 C4 x 20 x 20 MBConv P2 P3 P4 64 x 56 x 56 Bottleneck Block × 2 ⏟ Stage1 256 x 56 x 56 Bottleneck Block Bottleneck Block × 3 ⏟ Stage2 512 x 28 x 28 Bottleneck Block Bottleneck Block × 5 ⏟ Stage3 1024 x 14 x 14 Bottleneck Block Bottleneck Block × 2 ⏟ Stage4 2048 x 7 x 7 Global AvgPool FC 2048 x 1 x 1 Head Prediction [48,64,96] [192,256,384] [384,512,768] [640,1024,1600] [1280,2048,3200] Search Space 51 Once-for-All: Train One Network and Specialize it for E ffi cient Deployment [Cai et al., ICLR 2020] Network-level search space: width dimension
  • 52. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Space 52 ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware [Cai et al., ICLR 2019] For models that use depthwise convolution, we can choose the kernel size for each depthwise convolution. Conv 3x3 MB1 3x3 MB3 5x5 MB3 3x3 MB3 7x7 MB3 3x3 MB3 5x5 MB3 5x5 MB6 7x7 32x112x112 16x112x112 3x224x224 32x56x56 32x56x56 40x28x28 40x28x28 40x28x28 40x28x28 MB3 5x5 MB3 5x5 80x14x14 80x14x14 MB6 5x5 MB3 5x5 MB3 5x5 MB3 5x5 MB6 7x7 MB3 7x7 MB6 7x7 Pooling FC 80x14x14 96x14x14 96x14x14 96x14x14 192x7x7 192x7x7 192x7x7 192x7x7 320x7x7 MB3 5x5 80x14x14 MB6 7x7 MB3 7x7 96x14x14 Conv 3x3 MB1 3x3 MB6 3x3 MB3 3x3 MB3 3x3 MB3 3x3 MB6 3x3 MB3 3x3 MB3 3x3 40x112x112 24x112x112 3x224x224 32x56x56 32x56x56 32x56x56 32x56x56 48x28x28 48x28x28 MB6 3x3 MB3 5x5 48x28x28 48x28x28 MB6 5x5 MB3 3x3 MB3 3x3 MB3 3x3 MB6 5x5 MB3 3x3 MB6 5x5 Pooling FC 88x14x14 104x14x14 104x14x14 104x14x14 216x7x7 216x7x7 216x7x7 216x7x7 360x7x7 MB3 3x3 88x14x14 MB3 5x5 MB3 5x5 104x14x14 (1) Efficient mobile architecture found by ProxylessNAS. (2) Efficient CPU architecture found by ProxylessNAS. Network-level search space: kernel size dimension 7x7 3x3 5x5
  • 53. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Space Network-level search space: topology connection 53 Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation [Liu et al., CVPR 2019] 1 AS PP AS PP AS PP AS PP DownsampleLayer 2 4 8 16 32 1 L s Hl-1 s Hl-2 ... s H1 l s H2 l s H3 l s H4 l s H5 l 2 3 4 5 L-1 …… Figure 1: Left: Our network level search space with L = 12. Gray nodes represent the fixed “stem” layers, the blue nodes represents a candidate network level architecture. Right: During the search, each cell is a de structure as described in Sec. 4.1.1. Every yellow arrow is associated with the set of values ↵j!i. The th concat are associated with l s 2 !s, l s!s, l 2s!s respectively, as described in Sec. 4.1.2. Best viewed in co 1 AS PP AS PP AS PP AS PP DownsampleLayer 2 4 8 16 32 1 L 2 3 4 5 L-1 …… (a) Network level architecture used in DeepLabv3 [9]. 1 DownsampleLayer 2 4 8 16 32 1 L 2 3 4 5 L-1 …… (b) Network level architecture used in Conv-Deconv [56]. 1 DownsampleLayer 2 4 8 16 32 1 L 2 3 4 5 L-1 …… (c) Network level architecture used in Stacked Hourglass [55]. Figure 2: Our network level search space is general and includes various existing designs. by 2 and multiply the number of filters by 2). This keep- downsampling strategy is reasonable in the image classifi- cation case, but in dense image prediction it is also impor- tant to keep high spatial resolution, and as a result there are m ag Fo in wo sp to m do re on wo go to to ch fro ch fo ge 4. th ac ab vi tec • Left: the network-level search space in AutoDeepLab. Each path along the blue nodes corresponds to a network architecture in the search space. • Right: representative manual designs can be represented using this formulation. E.g.: DeepLabV3, UNet, Stacked Hourglass.
  • 54. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Neural Architecture Search 54 • Primitive operations • Classic building blocks • Introduction to neural architecture search (NAS) • What is NAS? • Search space • Design the search space • Search strategy • E ffi cient and Hardware-aware NAS • Performance estimation strategy • Hardware-aware NAS • Zero-shot NAS • Neural-hardware architecture co-search • NAS applications • NLP, GAN, point cloud, pose
  • 55. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Design the Search Space Design the search space for TinyML 55 MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020] Search space design is crucial for NAS performance TinyNAS: (1) Automated search space optimization (2) Resource-constrained model specialization Search Space Optimization Network Space Model Specialization
  • 56. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Design the Search Space Design the search space for TinyML: memory is important for TinyML 56 MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020] Di ff erent from Mobile AI + + Latency Constraint Latency Constraint Energy Constraint Energy Constraint Memory Constraint
  • 57. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Design the Search Space Design the search space for TinyML: memory is important for TinyML 57 MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020] Memory (GB) NVIDIA V100 iPhone 11 STM32F746 0 4 8 12 16
  • 58. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Design the Search Space Design the search space for TinyML 58 MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020] Analyzing FLOPs distribution of satisfying models: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy
  • 59. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Design the Search Space Design the search space for TinyML 59 MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020] • Discussion: what are the advantages / disadvantages of these two methods (RegNet, MCUNet) for search space design? Better search space, better fi nal accuracy.
  • 60. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Neural Architecture Search 60 • Primitive operations • Classic building blocks • Introduction to neural architecture search (NAS) • What is NAS? • Search space • Design the search space • Search strategy • E ffi cient and Hardware-aware NAS • Performance estimation strategy • Hardware-aware NAS • Zero-shot NAS • Neural-hardware architecture co-search • NAS applications • NLP, GAN, point cloud, pose
  • 61. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Illustration of NAS Search strategy • Search space is a set of candidate neural network architectures. • Search strategy de fi nes how to explore the search space. 61 Neural Architecture Search: A Survey [Elskan et al., JMLR 2019] • Grid search • Random search • Reinforcement learning • Gradient descent • Evolutionary search • Cell-level • Network-level
  • 62. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Grid search 62 • Grid search is the traditional way of hyper parameter optimization. The entire design space is represented as the Cartesian product of single dimension design spaces (e.g. resolution in [1.0x, 1.1x,1.2x], width in [1.0x, 1.1x, 1.2x]). • To obtain the accuracy of each candidate network, we train them from scratch. 1.0x 1.1x 1.2x 1.0x 50.0% 53.0% 54.9% 1.1x 51.0% 53.5% 55.4% 1.2x 52.0% 54.1% 56.2% Width Resolution Satisfies the latency constraint Breaks the latency constraint Numbers in the grids correspond to the accuracy of candidate networks.
  • 63. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Grid search 63 • E ffi cientNet applies compound scaling on depth, width and resolution to a starting network. It performs grid search on values such that the total FLOPs of the new model will be of the original network. α, β, γ 2 × E ffi cientNet: Rethinking Model Scaling for Convolutional Neural Networks [Tan and Le, ICML 2019] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (a) baseline (b) width scaling (c) depth scaling (d) resolution scaling (e) compound scaling #channels layer_i resolution HxW wider deeper higher resolution higher resolution deeper wider Figure 2. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio. example, if we want to use 2N times more computational resources, then we can simply increase the network depth by reducing parameters by up to 21x than existing ConvNets. EfficientNet: Rethinking Model Scal In fact, a few prior work (Zoph et al., 2018; Real et al., 2019) have already tried to arbitrarily balance network width and depth, but they all require tedious manual tuning. In this paper, we propose a new compound scaling method, which use a compound coefficient to uniformly scales network width, depth, and resolution in a principled way: depth: d = ↵ width: w = resolution: r = s.t. ↵ · 2 · 2 ⇡ 2 ↵ 1, 1, 1 (3) where ↵, , are constants that can be determined by a small grid search. Intuitively, is a user-specified coeffi- cient that controls how many more resources are available for model scaling, while ↵, , specify how to assign these extra resources to network width, depth, and resolution re- spectively. Notably, the FLOPS of a regular convolution op is proportional to d, w2 , r2 , i.e., doubling network depth
  • 64. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Random search 64 image source: 1 image source: 2 Grid Search Random Search
  • 65. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Reinforcement learning 65 Neural Architecture Search with Reinforcement Learning [Zoph and Le, ICLR 2017] ars have seen much success of deep neural networks in many challenging appli- speech recognition (Hinton et al., 2012), image recognition (LeCun et al., 1998; , 2012) and machine translation (Sutskever et al., 2014; Bahdanau et al., 2015; Wu long with this success is a paradigm shift from feature designing to architecture om SIFT (Lowe, 1999), and HOG (Dalal & Triggs, 2005), to AlexNet (Krizhevsky GGNet (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and ., 2016a). Although it has become easier, designing architectures still requires a wledge and takes ample time. Figure 1: An overview of Neural Architecture Search. nts Neural Architecture Search, a gradient-based method for finding good architec- Overview of RL-based NAS the section, we will focus on generating recurrent architectures, which is another key of our paper. 3.1 GENERATE MODEL DESCRIPTIONS WITH A CONTROLLER RECURRENT NEU NETWORK In Neural Architecture Search, we use a controller to generate architectural hyperp neural networks. To be flexible, the controller is implemented as a recurrent neural ne suppose we would like to predict feedforward neural networks with only convolution can use the controller to generate their hyperparameters as a sequence of tokens: Figure 2: How our controller recurrent neural network samples a simple convolutiona predicts filter height, filter width, stride height, stride width, and number of filters for o repeats. Every prediction is carried out by a softmax classifier and then fed into the n as input. In our experiments, the process of generating an architecture stops if the number of la a certain value. This value follows a schedule where we increase it as training progress controller RNN finishes generating an architecture, a neural network with this archite The RNN controller • Model neural architecture design as a sequential decision-making problem. • Use reinforcement learning to train the controller that is implemented with an RNN.
  • 66. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Gradient descent 66 DARTS: Di ff erentiable Architecture Search [Liu et al., ICLR 2019] Published as a conference paper at ICLR 2019 Figure 1: An overview of DARTS: (a) Operations on the edges are initially unknown. (b) Continuous A special zero operation is also included to indicate a lack of connection between two nodes. The task of learning the cell therefore reduces to learning the operations on its edges. 2.2 CONTINUOUS RELAXATION AND OPTIMIZATION Let O be a set of candidate operations (e.g., convolution, max pooling, zero) where each operation represents some function o(·) to be applied to x(i) . To make the search space continuous, we relax the categorical choice of a particular operation to a softmax over all possible operations: ō(i,j) (x) = X o2O exp(↵ (i,j) o ) P o02O exp(↵ (i,j) o0 ) o(x) (2) where the operation mixing weights for a pair of nodes (i, j) are parameterized by a vector ↵(i,j) of dimension |O|. The task of architecture search then reduces to learning a set of continuous variables ↵ = ↵(i,j) , as illustrated in Fig. 1. At the end of search, a discrete architecture can be obtained by replacing each mixed operation ō(i,j) with the most likely operation, i.e., o(i,j) = argmaxo2O ↵ (i,j) o . In the following, we refer to ↵ as the (encoding of the) architecture. After relaxation, our goal is to jointly learn the architecture ↵ and the weights w within all the mixed operations (e.g. weights of the convolution filters). Analogous to architecture search using RL (Zoph & Le, 2017; Zoph et al., 2018; Pham et al., 2018b) or evolution (Liu et al., 2018b; Real et al., 2018) where the validation set performance is treated as the reward or fitness, DARTS aims to optimize the validation loss, but using gradient descent. Denote by Ltrain and Lval the training and the validation loss, respectively. Both losses are deter- mined not only by the architecture ↵, but also the weights w in the network. The goal for architecture search is to find ↵⇤ that minimizes the validation loss Lval(w⇤ , ↵⇤ ), where the weights w⇤ associated with the architecture are obtained by minimizing the training loss w⇤ = argminw Ltrain(w, ↵⇤ ). This implies a bilevel optimization problem (Anandalingam & Friesz, 1992; Colson et al., 2007) with • Represent the output at each node as a weighted sum of outputs from di ff erent edges.
  • 67. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Gradient descent 67 • It is also possible to take latency into account for gradient-based search. • Here, F is a latency prediction model (typically a regressor or a lookup table). With such formulation, we can calculate an additional gradient for the architecture parameters from the latency penalty term. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware [Cai et al., ICLR 2019] Trainer Latency Model Direct measurement: expensive and slow Latency modeling: cheap, fast and differentiable Learnable Block i - 1 Learnable Block i …… Learnable Block i + 1 …… INPUT OUTPUT ... α β σ … ζ CONV 5x5 POOL 3x3 ... CONV 3x3 Identity E[latency] = X i E[latencyi] <latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit> <latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit> <latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit> <latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit> Loss = LossCE + 1||w||2 2 + 2E[latency] <latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit> <latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit> <latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit> <latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit> E[Latency] = ↵ ⇥ F(conv 3x3)+ ⇥ F(conv 5x5)+ ⇥ F(identity)+ ...... ⇥ F(pool 3x3) <latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit> <latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit> <latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit> <latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit> Figure 3: Making latency differentiable by introducing latency regularization loss. We incorporate the expected latency of the network into the normal loss function by multiplying a scaling factor 2(> 0) which controls the trade-off between accuracy and latency. The final loss
  • 68. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Evolutionary search 68 image source: 1 • Mimic the evolution process in biology image source: 2
  • 69. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Evolutionary search: fi tness function 69 t = 10ms a = 80% t = 8ms, a=81% t = 12ms, a=78% Re-sample Keep Arch. Sample Latency/Accuracy Feedback Sub Network OFA Network PVNAS: 3D Neural Architecture Search with Point-Voxel Convolution [Liu et al., TPAMI 2021] • Fitness function: F(accuracy, e ffi ciency)
  • 70. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Evolutionary search: mutation 70 Once-for-All: Train One Network and Specialize it for E ffi cient Deployment [Cai et al., ICLR 2020] Input MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 Output Mutation on depth Input MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 Output Stage 1: depth = 3 Stage 2: depth = 3 Stage 1: depth = 2 Stage 2: depth = 4
  • 71. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Evolutionary search: mutation 71 Once-for-All: Train One Network and Specialize it for E ffi cient Deployment [Cai et al., ICLR 2020] Input MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 Output Mutation on operator MB6 5x5 MB6 5x5 Input MB6 3x3 MB6 3x3 MB6 7x7 Output MB4 5x5 Stage 1: depth = 3 Stage 2: depth = 3 Stage 1: depth = 3 Stage 2: depth = 3
  • 72. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Search Strategy Evolutionary search: crossover 72 Once-for-All: Train One Network and Specialize it for E ffi cient Deployment [Cai et al., ICLR 2020] Input MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 MB6 3x3 Output MB6 5x5 MB6 5x5 Input MB6 3x3 MB6 3x3 MB6 7x7 Output MB4 5x5 + Crossover Idea: Randomly choose one operator among two choices (from the parents) for each layer. MB6 5x5 Input MB6 3x3 MB6 3x3 MB6 3x3 MB6 7x7 Output MB6 3x3 From parent 1 From parent 2 From parent 1 From parent 1 From parent 2 From parent 2
  • 73. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai Summary of Today’s Lecture • Primitive operations • Classic building blocks • NAS search space • Design the search space • NAS search strategy • We will cover in later lectures: • Performance estimation strategy in NAS • Hardware-aware NAS • Zero-shot NAS • Neural-hardware architecture search • NAS applications 73
  • 74. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai References 1. Deep Residual Learning for Image Recognition [He et al., CVPR 2016] 2. ImageNet Classi fi cation with Deep Convolutional Neural Networks [Krizhevsky et al., NeurIPS 2012] 3. Very Deep Convolutional Networks for Large-scale Image Recognition [Simonyan et al., ICLR 2015] 4. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola et al., arXiv 2016] 5. Aggregated Residual Transformations for Deep Neural Networks [Xie et al., CVPR 2017] 6. MobileNets: E ffi cient Convolutional Neural Networks for Mobile Vision Applications [Howard et al., arXiv 2017] 7. MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018] 8. Shu ffl eNet: An Extremely E ffi cient Convolutional Neural Network for Mobile Devices [Zhang et al., CVPR 2018] 9. Neural Architecture Search: A Survey [Elskan et al., JMLR 2019] 10.Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018] 11.DARTS: Di ff erentiable Architecture Search [Liu et al., ICLR 2019] 74
  • 75. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai References • 12. MnasNet: Platform-Aware Neural Architecture Search for Mobile [Tan et al., CVPR 2019] • 13. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware [Cai et al., ICLR 2019] • 14. FBNet: Hardware-Aware E ffi cient ConvNet Design via Di ff erentiable Neural Architecture Search [Wu et al., CVPR 2019] • 15. Designing Network Design Spaces [Radosavovic et al., CVPR 2020] • 16. Single Path One-Shot Neural Architecture Search with Uniform Sampling [Guo et al., ECCV 2020] • 17. Once-for-All: Train One Network and Specialize it for E ffi cient Deployment [Cai et al., ICLR 2020] • 18. Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation [Liu et al., CVPR 2019] • 19. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection [Ghaisi et al., CVPR 2019] • 20. Exploring Randomly Wired Neural Networks for Image Recognition [Xie et al., ICCV 2019] • 21. MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020] 75
  • 76. MIT 6.5940: TinyML and E ffi cient Deep Learning Computing https://e ffi cientml.ai References • 22. E ffi cientNet: Rethinking Model Scaling for Convolutional Neural Networks [Tan and Le, ICML 2019] • 23. Neural Architecture Search with Reinforcement Learning [Zoph and Le, ICLR 2017] • 24. PVNAS: 3D Neural Architecture Search with Point-Voxel Convolution [Liu et al., TPAMI 2021] • 25. Regularized Evolution for Image Classi fi er Architecture Search [Real et al., AAAI 2019] 76