EfficientML.ai Lecture Neural Architecture Search

MIT 6.5940: TinyML and E
ffi
cient Deep Learning Computing https://e
ffi
cientml.ai
EfficientML.ai Lecture 07
Neural Architecture Search
Part I
Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT

ffi
ffi
cientml.ai
Lecture Plan
Today we will:
1. Review primitive operations of deep neural networks.
2. Introduce popular building blocks.
3. Introduce Neural Architecture Search (NAS), an automatic technique for designing neural
network architectures.
2

ffi
ffi
cientml.ai
Trade-off Between Efficiency and Accuracy
3
Storage
Latency Energy
Image source: 1
Accuracy
Image source: 2

ffi
ffi
cientml.ai
• Primitive operations
• Classic building blocks
• Introduction to neural architecture search (NAS)
• What is NAS?
• Search space
• Design the search space
• Search strategy
• E
ffi
cient and Hardware-aware NAS
• Performance estimation strategy
• Hardware-aware NAS
• Zero-shot NAS
• Neural-hardware architecture co-search
• NAS applications
• NLP, GAN, point cloud, pose
4

ffi
ffi
cientml.ai
5
• What is NAS?
• Search space
• Search strategy
• E
ffi
• Zero-shot NAS

ffi
ffi
cientml.ai
Recap: Primitive Operations
Fully-connected layer / linear layer
6
• Shape of Tensors:
• Input Features
• Output Features
• Weights
• Bias
X : (n, ci)
Y : (n, co)
W : (co, ci)
b : (co, )
y2
y1
y0
x4
x3
x2
x1
x0
w00 w42
z1
z0
Multilayer Perceptron (MLP)
co
ci
=
co
WT
X Y
ci
n n
n
ci
co
Notations
Batch Size
Input Channels
Output Channels

ffi
ffi
cientml.ai
Fully-connected layer / linear layer
7
Layer
MACs
(batch size n=1)
Linear Layer co ⋅ ci co
ci
=
co
WT
X Y
ci
n n
y2
y1
y0
x4
x3
x2
x1
x0
w00 w42
n
ci
co
hi, ho
wi, wo
kh, kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height/Width
Groups
* bias is ignored

ffi
ffi
cientml.ai
Convolution layer
8
• Input Features
• Output Features
• Weights
• Bias
X : (n, ci)
Y : (n, co)
W : (co, ci)
b : (co, )
n
ci
co
hi, ho
wi, wo
kh
kw
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height
Kernel Width channel dimension
s
p
a
t
i
a
l
d
i
m
e
n
s
i
o
n
(
s
)
kh
kw
ci
co
hi wi
wo ho
Image source: 1
(n, ci, hi, wi)
(n, co, ho, wo)
2D Conv
(n, ci, wi)
(n, co, wo)
1D Conv
(co, ci, kh, kw)
(co, ci, kw)

ffi
ffi
cientml.ai
Convolution layer
9
n
ci
co
hi, ho
wi, wo
kh, kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height/Width
Groups
kh
ci
co
ci
co
wo
ho
hi
wi
kw
hi wi
ho
wo
Image source: 1
Layer
MACs
(batch size n=1)
Linear Layer
Convolution
co ⋅ ci
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo
* bias is ignored

ffi
ffi
cientml.ai
Grouped convolution layer
10
• Input Features
• Output Features
• Weights
• Bias
X : (n, ci, hi, wi)
Y : (n, co, ho, wo)
W : (co, ci, kh, kw)
b : (co, )
is stride
s
is padding
p
n
ci
co
hi, ho
wi, wo
kh
kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height
Kernel Width
Groups
(g ⋅ co/g, ci/g, kh, kw)
co
ci
co
ci
g = 2
g = 1
channel dimension
Image source: 1
hi wi
ho
ho =
hi + 2p − kh
s
+ 1

ffi
ffi
cientml.ai
Grouped convolution layer
11
n
ci
co
hi, ho
wi, wo
kh, kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height/Width
Groups
co
ci
g = 2
hi wi
ho
wo
Image source: 1
Layer
MACs
(batch size n=1)
Linear Layer
Convolution
Grouped Convolution
co ⋅ ci
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo/g
* bias is ignored

ffi
ffi
cientml.ai
Depthwise convolution layer
12
• Input Features
• Output Features
• Weights
• Bias
X : (n, ci, hi, wi)
Y : (n, co, ho, wo)
W : (co, ci, kh, kw)
b : (co, )
is stride
s
is padding
p
n
ci
co
hi, ho
wi, wo
kh
kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height
Kernel Width
Groups
(c, kh, kw)
co
ci
Image source: 1, 2
hi wi
wo
ho
ho =
hi + 2p − kh
s
+ 1

ffi
ffi
cientml.ai
Depthwise convolution layer
13
n
ci
co
hi, ho
wi, wo
kh, kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height/Width
Groups
co
ci
hi wi
ho
wo
Image source: 1
Layer
MACs
(batch size n=1)
Linear Layer
Convolution
Grouped Convolution
Depthwise
Convolution
co ⋅ ci
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo/g
co ⋅ kh ⋅ kw ⋅ ho ⋅ wo
* bias is ignored

ffi
ffi
cientml.ai
1x1 convolution layer
14
Layer
MACs
(batch size n=1)
Linear Layer
Convolution
Grouped Convolution
Depthwise
Convolution
1x1 Convolution
co ⋅ ci
n
ci
co
hi, ho
wi, wo
kh, kw
g
Notations
Batch Size
Input Channels
Output Channels
Input/Output Height
Input/Output Width
Kernel Height/Width
Groups
co ⋅ ci ⋅ kh ⋅ kw ⋅ ho ⋅ wo/g co
ci
co ⋅ kh ⋅ kw ⋅ ho ⋅ wo
hi wi
ho
wo
Image source: 1
co ⋅ ci ⋅ ho ⋅ wo
* bias is ignored

ffi
ffi
cientml.ai
15
• What is NAS?
• Search space
• Search strategy
• E
ffi
• Zero-shot NAS

ffi
ffi
cientml.ai
Classic Building Blocks
ResNet50: bottleneck block
16
image source: link
Deep Residual Learning for Image Recognition [He et al., CVPR 2016]
Reduce the number of channels by 4x
via 1x1 convolution

ffi
ffi
cientml.ai
17
image source: link
Feed reduced feature map to 3x3 convolution

ffi
ffi
cientml.ai
18
image source: link
Expand the number of channels via 1x1 convolution

ffi
ffi
cientml.ai
19
image source: link
2048 x 512 x H x W x 1
2048 x 512 x H x W x 1
512 x 512 x H x W x 9
#MACs

ffi
ffi
cientml.ai
20
image source: link
512 x 512 x H x W x 17
2048 x 512 x H x W x 1
2048 x 512 x H x W x 1
512 x 512 x H x W x 9
#MACs

ffi
ffi
cientml.ai
21
image source: link
512 x 512 x H x W x 17
2048 x 2048 x H x W x 9
= 512 x 512 x H x W x 144
8.5x reduction
2048 x 512 x H x W x 1
2048 x 512 x H x W x 1
512 x 512 x H x W x 9
#MACs

ffi
ffi
cientml.ai
22
image source: link
512 x 512 x 17
2048 x 2048 x 9
= 512 x 512 x 144
8.5x reduction
2048 x 512 x 1
2048 x 512 x 1
512 x 512 x 9
#Params

ffi
ffi
cientml.ai
ResNeXt: grouped convolution
• Replace 3x3 convolution with 3x3 grouped convolution.
23
Aggregated Residual Transformations for Deep Neural Networks [Xie et al., CVPR 2017]

ffi
ffi
cientml.ai
• Equivalent to a multi-path block.
24

ffi
ffi
cientml.ai
• Equivalent to a multi-path block.
25

ffi
ffi
cientml.ai
MobileNet: depthwise-separable block
• Depthwise convolution is an extreme case of group convolution where the group number equals
the number of input channels.
26
image source: link
MobileNets: E
ffi
cient Convolutional Neural Networks for Mobile Vision Applications [Howard et al., arXiv 2017]

ffi
ffi
cientml.ai
• Use depthwise convolution to capture spatial information.
27
image source: link
MobileNets: E
ffi

ffi
ffi
cientml.ai
• Use depthwise convolution to capture spatial information.
• Use 1x1 convolution to fuse/exchange information across di
ff
erent channels.
28
image source: link
MobileNets: E
ffi

ffi
ffi
cientml.ai
• Depthwise convolution has a much lower capacity compared to normal convolution.
• Increase the depthwise convolution's input and output channels to improve its capacity.
• Depthwise convolution’s cost only grows linearly. Therefore, the cost is still a
ff
ordable.
MobileNetV2: inverted bottleneck block
29
MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018]
Image source: 1
Depthwise Convolution
N * 6
N * 6
N
Feature
Channels
N
1×1 Conv
3×3 DW-Conv
1×1 Conv
Image source: 1

ffi
ffi
cientml.ai
ff
ordable.
30
Image source: 1
N * 6
N * 6
N
Feature
Channels
N
1×1 Conv
3×3 DW-Conv
1×1 Conv
Image source: 1
160 x 960 x H x W x 1
160 x 960 x H x W x 1
960 x H x W x 9
#MACs (N = 160)
960 x H x W x 329
160 x 160 x H x W x 9
= 960 x H x W x 240
1 : 1.37

ffi
ffi
cientml.ai
ff
ordable.
31
Image source: 1
N * 6
N * 6
N
Feature
Channels
N
1×1 Conv
3×3 DW-Conv
1×1 Conv
Image source: 1
160 x 960 x 1
160 x 960 x 1
960 x 9
#Params (N = 160)
960 x 329
160 x 160 x 9
= 960 x 240
1 : 1.37

ffi
ffi
cientml.ai
ff
ordable.
• However, this design is not memory-e
ffi
cient for both inference and training.
32
N * 6
N * 6
N
Feature
Channels
N
1×1 Conv
3×3 DW-Conv
1×1 Conv
Image source: 1
0
200
400
600
800
Param (MB) Activation (MB)
626
24
707
102
ResNet-50 MobileNetV2-1.4×
4.3×
1.1×
* All parameters and activations are
Floating-Point numbers (32 bits).
Batch size is 16.
0
2.4
4.8
7.2
9.6
12
Param (MB) Peak Activation (MB)
ResNet-18 MobileNetV2-0.75
4.6×
1.8×
* All parameters and activations are
Integer numbers (8 bits). Batch size is 1.
Inference Training

ffi
ffi
cientml.ai
Shu
ffl
eNet: 1x1 group convolution & channel shu
ffl
e
• Further reduce the cost by replacing 1x1 convolution with 1x1 group convolution.
• Exchange information across di
ff
erent groups via channel shu
ffl
e.
33
Shu
ffl
eNet: An Extremely E
ffi
cient Convolutional Neural Network for Mobile Devices [Zhang et al., CVPR 2018]

ffi
ffi
cientml.ai
Transformer: Multi-Head Self-Attention (MHSA)
34
Attention Is All You Need [Vaswani et al., NeurIPS 2017]
• Project Q, K and V with h di
ff
erent, learned
linear projections.
• Perform the scaled dot-product attention
function on each of these projected versions
of Q, K and V in parallel.
• Concatenate the output values.
• Project the output values again, resulting in
the
fi
nal values.
Scaled Dot-
Product Attention

ffi
ffi
cientml.ai
35
ff
erent, learned
linear projections.
the
fi
nal values.
Image source: 1

ffi
ffi
cientml.ai
36
ff
erent, learned
linear projections.
the
fi
nal values.

ffi
ffi
cientml.ai
37
ff
erent, learned
linear projections.
the
fi
nal values.

ffi
ffi
cientml.ai
38
• What is NAS?
• Search space
• Search strategy
• E
ffi
• Zero-shot NAS

ffi
ffi
cientml.ai
From Manual Design to Automatic Design
Huge design space, manual design is unscalable
39
# channels
# layers
# kernel
resolution
connectivity

ffi
ffi
cientml.ai
40
# channels
# layers
# kernel
resolution
connectivity

ffi
ffi
cientml.ai
Manually-Designed Neural Networks
41
Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE 2020]
Accuracy-e
ffi
ciency trade-o
ff
on ImageNet
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
69
71
73
75
77
79
81
ImageNet
Top-1
accuracy
(%)
2M 4M 8M
Handcrafted
16M 32M 64M
MBNetV2
Shu
ffl
eNet
IGCV3-D
MobileNetV1 (MBNetV1)
InceptionV2
DenseNet-121
DenseNet-169
ResNet-50
ResNetXt-50
DenseNet-264
InceptionV3
DPN-92
ResNet-101
ResNetXt-101
Xception

ffi
ffi
cientml.ai
42
Once-for-All: Train One Network and Specialize it for E
ffi
cient Deployment [Cai et al., ICLR 2020]
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
69
71
73
75
77
79
81
ImageNet
Top-1
accuracy
(%)
2M 4M 8M
Handcrafted
16M
AutoML
32M 64M
MBNetV2
Shu
ffl
eNet
IGCV3-D
MobileNetV1 (MBNetV1)
InceptionV2
DenseNet-121
DenseNet-169
ResNet-50
ResNetXt-50
DenseNet-264
InceptionV3
DPN-92
ResNet-101
ResNetXt-101
Xception
DARTS
PNASNet
AmoebaNet
MBNetV3
ProxylessNAS
E
ffi
cientNet
NASNet-A
Once-for-All

ffi
ffi
cientml.ai
Illustration of NAS
Components and goal
• The goal of NAS is to
fi
nd the best neural network architecture in the search space, maximizing
the objective of interest (e.g., accuracy, e
ffi
ciency, etc).
43
Neural Architecture Search: A Survey [Elskan et al., JMLR 2019]

ffi
ffi
cientml.ai
• Search space is a set of candidate neural network architectures.
Search Space
44
• Cell-level
• Network-level

ffi
ffi
cientml.ai
Search Space
Cell-level search space
45
Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018]
tures
for
image
classification
consist
of
d
Normal
Cell
and
Reduction
Cell.
This
del
architecture
for
CIFAR-10
and
Ima-
number
of
times
the
Normal
Cells
that
ction
cells,
N
,
can
vary
in
our
experi-
the
Normal
and
controller
RNN
within
a
searc
Figure
7
for
sc
ceives
as
input
are
the
output
or
the
input
im
dicts
the
rest
o
these
two
initi
of
the
controll
where
each
blo
softmax
classi
elements
of
a
b
Step
1.
Select
a
states
create
Step
2.
Select
a
s
Step
3.
Select
an
Step
4.
Select
an
Step
5.
Select
a
m
a
new
hidde
The
algorithm
the
set
of
exist
x N x N x N
Normal Cell Reduction Cell
hi
hi-1
...
hi+1
concat
avg
3x3
sep
5x5
sep
7x7
sep
5x5
max
3x3
sep
7x7
add add
add add add
sep
3x3
iden
tity
avg
3x3
max
3x3
hi
hi-1
...
hi+1
concat
sep
3x3
avg
3x3
avg
3x3
sep
5x5
sep
3x3
iden
tity
iden
tity
sep
3x3
sep
5x5
avg
3x3
add add add add
add
Figure 4. Architecture of the best convolutional cells (NASNet-A) with B = 5 blocks identified with CIFAR-10 . The input (white) is the
hidden state from previous activations (or input image). The output (pink) is the result of a concatenation operation across all resulting
branches. Each convolutional cell is the result of B blocks. A single block is corresponds to two primitive operations (yellow) and a

ffi
ffi
cientml.ai
Search Space
46
softmax
layer
controller
hidden
layer
Select one
hidden state
Select second
hidden state
Select operation for
first hidden state
second hidden state
Select method to
combine hidden state
repeat B times
new hidden layer
add
3 x 3 conv 2 x 2 maxpool
hidden layer B
hidden layer A
Figure 3. Controller model architecture for recursively constructing one block of a convolutional cell. Each block requires selecting 5
discrete parameters, each of which corresponds to the output of a softmax layer. Example constructed block shown on right. A convolu-
tional cell contains B blocks, hence the controller contains 5B softmax layers for predicting the architecture of a convolutional cell. In our
experiments, the number of blocks B is 5.
• Left: An RNN controller generates the candidate cells by:
•
fi
nding two inputs
• selecting two input transformation operations (e.g. convolution / pooling / identity)
• selecting the method to combine the results.
• Right: A cell generated after one step.

ffi
ffi
cientml.ai
Search Space
47
softmax
layer
controller
hidden
layer
Select one
hidden state
Select second
hidden state
first hidden state
second hidden state
Select method to
repeat B times
new hidden layer
add
hidden layer B
hidden layer A
• Question: Assuming that we have two candidate inputs, M candidate operations to transform the
inputs and N potential operations to combine hidden states, what is the size of the search space
in NASNet if we have B layers?
• Hint: Consider it step by step!

ffi
ffi
cientml.ai
Search Space
48
softmax
layer
controller
hidden
layer
Select one
hidden state
Select second
hidden state
first hidden state
second hidden state
Select method to
repeat B times
new hidden layer
add
hidden layer B
hidden layer A
• Question: Assuming that we have two candidate inputs, M candidate operations to transform the
inputs and N potential operations to combine hidden states, what is the size of the search space
in NASNet if we have B layers?
• Hint: Consider it step by step!
• Answer: .
• Assume M=5, N=2, B=5, we have candidates in the design space.
(2 × 2 × M × M × N)B
= 4B
M2B
NB
3.2 × 1011

ffi
ffi
cientml.ai
Search Space
Network-level search space: depth dimension
49
ffi
Input
7x7
Conv
3x3
MaxPool
3
x
224
x
224
Input Stem
Bottleneck
Block
Stage 1
C1 x 160 x 160
× L1
Stage 2
C2 x 80 x 80
× L2
Linear
Attention
FFN
with
Depthwise
Conv
× L3
⏟
Stage 3
C3 x 40 x 40
MBConv
MBConv
Linear
Attention
FFN
with
Depthwise
Conv
× L4
⏟
Stage 4
C4 x 20 x 20
MBConv
P2 P3 P4
64
x
56
x
56
Bottleneck
Block
× 2
⏟
Stage1
256
x
56
x
56
Bottleneck
Block
Bottleneck
Block
× 3
⏟
Stage2
512
x
28
x
28
Bottleneck
Block
Bottleneck
Block
× 5
⏟
Stage3
1024
x
14
x
14
Bottleneck
Block
Bottleneck
Block
× 2
⏟
Stage4
2048
x
7
x
7
Global
AvgPool
FC
2048
x
1
x
1
Head
Prediction
× [1,2,3] × [2,3,4] × [3,5,7,9] × [1,2,3]

ffi
ffi
cientml.ai
Search Space
50
ffi
Input
7x7
Conv
3x3
MaxPool
3
x
224
x
224
Input Stem
Bottleneck
Block
Stage 1
C1 x 160 x 160
× L1
Stage 2
C2 x 80 x 80
× L2
Linear
Attention
FFN
with
Depthwise
Conv
× L3
⏟
Stage 3
C3 x 40 x 40
MBConv
MBConv
Linear
Attention
FFN
with
Depthwise
Conv
× L4
⏟
Stage 4
C4 x 20 x 20
MBConv
P2 P3 P4
64
x
56
x
56
Bottleneck
Block
× 2
⏟
Stage1
256
x
56
x
56
Bottleneck
Block
Bottleneck
Block
× 3
⏟
Stage2
512
x
28
x
28
Bottleneck
Block
Bottleneck
Block
× 5
⏟
Stage3
1024
x
14
x
14
Bottleneck
Block
Bottleneck
Block
× 2
⏟
Stage4
2048
x
7
x
7
Global
AvgPool
FC
2048
x
1
x
1
Head
Prediction
[(3,128,128); (3,160,160); (3,192,192); (3,224,224); (3,256,256)]
Network-level search space: resolution dimension

ffi
ffi
cientml.ai
Input
7x7
Conv
3x3
MaxPool
3
x
224
x
224
Input Stem
Bottleneck
Block
Stage 1
C1 x 160 x 160
× L1
Stage 2
C2 x 80 x 80
× L2
Linear
Attention
FFN
with
Depthwise
Conv
× L3
⏟
Stage 3
C3 x 40 x 40
MBConv
MBConv
Linear
Attention
FFN
with
Depthwise
Conv
× L4
⏟
Stage 4
C4 x 20 x 20
MBConv
P2 P3 P4
64
x
56
x
56
Bottleneck
Block
× 2
⏟
Stage1
256
x
56
x
56
Bottleneck
Block
Bottleneck
Block
× 3
⏟
Stage2
512
x
28
x
28
Bottleneck
Block
Bottleneck
Block
× 5
⏟
Stage3
1024
x
14
x
14
Bottleneck
Block
Bottleneck
Block
× 2
⏟
Stage4
2048
x
7
x
7
Global
AvgPool
FC
2048
x
1
x
1
Head
Prediction
[48,64,96] [192,256,384] [384,512,768] [640,1024,1600] [1280,2048,3200]
Search Space
51
ffi
Network-level search space: width dimension

ffi
ffi
cientml.ai
Search Space
52
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware [Cai et al., ICLR 2019]
For models that use depthwise convolution, we can choose the kernel size for each depthwise convolution.
Conv
3x3
MB1
3x3
MB3
5x5
MB3
3x3
MB3
7x7
MB3
3x3
MB3
5x5
MB3
5x5
MB6
7x7
32x112x112
16x112x112
3x224x224
32x56x56
32x56x56
40x28x28
40x28x28
40x28x28
40x28x28
MB3
5x5
MB3
5x5
80x14x14
80x14x14
MB6
5x5
MB3
5x5
MB3
5x5
MB3
5x5
MB6
7x7
MB3
7x7
MB6
7x7
Pooling
FC
80x14x14
96x14x14
96x14x14
96x14x14
192x7x7
192x7x7
192x7x7
192x7x7
320x7x7
MB3
5x5
80x14x14
MB6
7x7
MB3
7x7
96x14x14
Conv
3x3
MB1
3x3
MB6
3x3
MB3
3x3
MB3
3x3
MB3
3x3
MB6
3x3
MB3
3x3
MB3
3x3
40x112x112
24x112x112
3x224x224
32x56x56
32x56x56
32x56x56
32x56x56
48x28x28
48x28x28
MB6
3x3
MB3
5x5
48x28x28
48x28x28
MB6
5x5
MB3
3x3
MB3
3x3
MB3
3x3
MB6
5x5
MB3
3x3
MB6
5x5
Pooling
FC
88x14x14
104x14x14
104x14x14
104x14x14
216x7x7
216x7x7
216x7x7
216x7x7
360x7x7
MB3
3x3
88x14x14
MB3
5x5
MB3
5x5
104x14x14
(1) Efficient mobile architecture found by ProxylessNAS.
(2) Efficient CPU architecture found by ProxylessNAS.
Network-level search space: kernel size dimension
7x7
3x3
5x5

ffi
ffi
cientml.ai
Search Space
Network-level search space: topology connection
53
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation [Liu et al., CVPR 2019]
1
AS
PP
AS
PP
AS
PP
AS
PP
DownsampleLayer
2
4
8
16
32
1 L
s
Hl-1
s
Hl-2 ...
s
H1
l
s
H2
l
s
H3
l
s
H4
l
s
H5
l
2 3 4 5 L-1
……
Figure 1: Left: Our network level search space with L = 12. Gray nodes represent the fixed “stem” layers,
the blue nodes represents a candidate network level architecture. Right: During the search, each cell is a de
structure as described in Sec. 4.1.1. Every yellow arrow is associated with the set of values ↵j!i. The th
concat are associated with l
s
2 !s, l
s!s, l
2s!s respectively, as described in Sec. 4.1.2. Best viewed in co
1
AS
PP
AS
PP
AS
PP
AS
PP
DownsampleLayer
2
4
8
16
32
1 L
2 3 4 5 L-1
……
(a) Network level architecture used in DeepLabv3 [9].
1
DownsampleLayer
2
4
8
16
32
1 L
2 3 4 5 L-1
……
(b) Network level architecture used in Conv-Deconv [56].
1
DownsampleLayer
2
4
8
16
32
1 L
2 3 4 5 L-1
……
(c) Network level architecture used in Stacked Hourglass [55].
Figure 2: Our network level search space is general and
includes various existing designs.
by 2 and multiply the number of filters by 2). This keep-
downsampling strategy is reasonable in the image classifi-
cation case, but in dense image prediction it is also impor-
tant to keep high spatial resolution, and as a result there are
m
ag
Fo
in
wo
sp
to
m
do
re
on
wo
go
to
to
ch
fro
ch
fo
ge
4.
th
ac
ab
vi
tec
• Left: the network-level search space in AutoDeepLab. Each path along the blue nodes
corresponds to a network architecture in the search space.
• Right: representative manual designs can be represented using this formulation. E.g.: DeepLabV3,
UNet, Stacked Hourglass.

ffi
ffi
cientml.ai
54
• What is NAS?
• Search space
• Search strategy
• E
ffi
• Zero-shot NAS

ffi
ffi
cientml.ai
Design the Search Space
Design the search space for TinyML
55
MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
Search space design is crucial for NAS performance
TinyNAS:
(1) Automated search space optimization
(2) Resource-constrained model specialization
Search Space
Optimization
Network Space Model Specialization

ffi
ffi
cientml.ai
Design the search space for TinyML: memory is important for TinyML
56
Di
ff
erent from Mobile AI
+
+
Latency
Constraint
Latency
Constraint
Energy
Constraint
Energy
Constraint
Memory
Constraint

ffi
ffi
cientml.ai
Design the search space for TinyML: memory is important for TinyML
57
Memory (GB)
NVIDIA V100
iPhone 11
STM32F746
0 4 8 12 16

ffi
ffi
cientml.ai
58
Analyzing FLOPs distribution of satisfying models:
Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

ffi
ffi
cientml.ai
59
• Discussion: what are the advantages / disadvantages of these two methods (RegNet, MCUNet)
for search space design?
Better search space, better
fi
nal accuracy.

ffi
ffi
cientml.ai
60
• What is NAS?
• Search space
• Search strategy
• E
ffi
• Zero-shot NAS

ffi
ffi
cientml.ai
Illustration of NAS
Search strategy
• Search space is a set of candidate neural network architectures.
• Search strategy de
fi
nes how to explore the search space.
61
• Grid search
• Random search
• Reinforcement learning
• Gradient descent
• Evolutionary search
• Cell-level
• Network-level

ffi
ffi
cientml.ai
Search Strategy
Grid search
62
• Grid search is the traditional way of hyper parameter optimization. The entire design space is
represented as the Cartesian product of single dimension design spaces (e.g. resolution in [1.0x,
1.1x,1.2x], width in [1.0x, 1.1x, 1.2x]).
• To obtain the accuracy of each candidate network, we train them from scratch.
1.0x 1.1x 1.2x
1.0x 50.0% 53.0% 54.9%
1.1x 51.0% 53.5% 55.4%
1.2x 52.0% 54.1% 56.2%
Width
Resolution
Satisfies the latency constraint
Breaks the latency constraint
Numbers in the grids correspond to the accuracy of candidate networks.

ffi
ffi
cientml.ai
Search Strategy
Grid search
63
• E
ffi
cientNet applies compound scaling on depth, width and resolution to a starting network. It
performs grid search on values such that the total FLOPs of the new model will be of
the original network.
α, β, γ 2 ×
E
ffi
cientNet: Rethinking Model Scaling for Convolutional Neural Networks [Tan and Le, ICML 2019]
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
(a) baseline (b) width scaling (c) depth scaling (d) resolution scaling (e) compound scaling
#channels
layer_i
resolution HxW
wider
deeper
higher
resolution
higher
resolution
deeper
wider
Figure 2. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network
width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio.
example, if we want to use 2N
times more computational
resources, then we can simply increase the network depth by
reducing parameters by up to 21x than existing ConvNets.
EfficientNet: Rethinking Model Scal
In fact, a few prior work (Zoph et al., 2018; Real et al., 2019)
have already tried to arbitrarily balance network width and
depth, but they all require tedious manual tuning.
In this paper, we propose a new compound scaling method,
which use a compound coefficient to uniformly scales
network width, depth, and resolution in a principled way:
depth: d = ↵
width: w =
resolution: r =
s.t. ↵ · 2
· 2
⇡ 2
↵ 1, 1, 1
(3)
where ↵, , are constants that can be determined by a
small grid search. Intuitively, is a user-specified coeffi-
cient that controls how many more resources are available
for model scaling, while ↵, , specify how to assign these
extra resources to network width, depth, and resolution re-
spectively. Notably, the FLOPS of a regular convolution op
is proportional to d, w2
, r2
, i.e., doubling network depth

ffi
ffi
cientml.ai
Search Strategy
Random search
64
image source: 1 image source: 2
Grid Search Random Search

ffi
ffi
cientml.ai
Search Strategy
Reinforcement learning
65
Neural Architecture Search with Reinforcement Learning [Zoph and Le, ICLR 2017]
ars have seen much success of deep neural networks in many challenging appli-
speech recognition (Hinton et al., 2012), image recognition (LeCun et al., 1998;
, 2012) and machine translation (Sutskever et al., 2014; Bahdanau et al., 2015; Wu
long with this success is a paradigm shift from feature designing to architecture
om SIFT (Lowe, 1999), and HOG (Dalal & Triggs, 2005), to AlexNet (Krizhevsky
GGNet (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and
., 2016a). Although it has become easier, designing architectures still requires a
wledge and takes ample time.
Figure 1: An overview of Neural Architecture Search.
nts Neural Architecture Search, a gradient-based method for finding good architec-
Overview of RL-based NAS
the section, we will focus on generating recurrent architectures, which is another key
of our paper.
3.1 GENERATE MODEL DESCRIPTIONS WITH A CONTROLLER RECURRENT NEU
NETWORK
In Neural Architecture Search, we use a controller to generate architectural hyperp
neural networks. To be flexible, the controller is implemented as a recurrent neural ne
suppose we would like to predict feedforward neural networks with only convolution
can use the controller to generate their hyperparameters as a sequence of tokens:
Figure 2: How our controller recurrent neural network samples a simple convolutiona
predicts filter height, filter width, stride height, stride width, and number of filters for o
repeats. Every prediction is carried out by a softmax classifier and then fed into the n
as input.
In our experiments, the process of generating an architecture stops if the number of la
a certain value. This value follows a schedule where we increase it as training progress
controller RNN finishes generating an architecture, a neural network with this archite
The RNN controller
• Model neural architecture design as a sequential decision-making problem.
• Use reinforcement learning to train the controller that is implemented with an RNN.

ffi
ffi
cientml.ai
Search Strategy
Gradient descent
66
DARTS: Di
ff
erentiable Architecture Search [Liu et al., ICLR 2019]
Published as a conference paper at ICLR 2019
Figure 1: An overview of DARTS: (a) Operations on the edges are initially unknown. (b) Continuous
A special zero operation is also included to indicate a lack of connection between two nodes. The
task of learning the cell therefore reduces to learning the operations on its edges.
2.2 CONTINUOUS RELAXATION AND OPTIMIZATION
Let O be a set of candidate operations (e.g., convolution, max pooling, zero) where each operation
represents some function o(·) to be applied to x(i)
. To make the search space continuous, we relax
the categorical choice of a particular operation to a softmax over all possible operations:
ō(i,j)
(x) =
X
o2O
exp(↵
(i,j)
o )
P
o02O exp(↵
(i,j)
o0 )
o(x) (2)
where the operation mixing weights for a pair of nodes (i, j) are parameterized by a vector ↵(i,j)
of
dimension |O|. The task of architecture search then reduces to learning a set of continuous variables
↵ = ↵(i,j)
, as illustrated in Fig. 1. At the end of search, a discrete architecture can be obtained by
replacing each mixed operation ō(i,j)
with the most likely operation, i.e., o(i,j)
= argmaxo2O ↵
(i,j)
o .
In the following, we refer to ↵ as the (encoding of the) architecture.
After relaxation, our goal is to jointly learn the architecture ↵ and the weights w within all the mixed
operations (e.g. weights of the convolution filters). Analogous to architecture search using RL (Zoph
& Le, 2017; Zoph et al., 2018; Pham et al., 2018b) or evolution (Liu et al., 2018b; Real et al., 2018)
where the validation set performance is treated as the reward or fitness, DARTS aims to optimize the
validation loss, but using gradient descent.
Denote by Ltrain and Lval the training and the validation loss, respectively. Both losses are deter-
mined not only by the architecture ↵, but also the weights w in the network. The goal for architecture
search is to find ↵⇤
that minimizes the validation loss Lval(w⇤
, ↵⇤
), where the weights w⇤
associated
with the architecture are obtained by minimizing the training loss w⇤
= argminw Ltrain(w, ↵⇤
).
This implies a bilevel optimization problem (Anandalingam & Friesz, 1992; Colson et al., 2007) with
• Represent the output at each node as a weighted sum of outputs from di
ff
erent edges.

ffi
ffi
cientml.ai
Search Strategy
Gradient descent
67
• It is also possible to take latency into account for gradient-based search.
• Here, F is a latency prediction model (typically a regressor or a lookup table). With such
formulation, we can calculate an additional gradient for the architecture parameters from the
latency penalty term.
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware [Cai et al., ICLR 2019]
Trainer
Latency
Model
Direct measurement:
expensive and slow
Latency modeling:
cheap, fast and differentiable
Learnable Block
i - 1
Learnable Block
i
……
Learnable Block
i + 1
……
INPUT
OUTPUT
...
α β σ … ζ
CONV
5x5
POOL
3x3
...
CONV
3x3
Identity
E[latency] =
X
i
E[latencyi]
<latexit sha1_base64="lntrwtqgrPO1VSk1eaeyNufWf/Q=">AAACNXicfVDLSsNAFJ34rPUVdelmsAiuSiKCboSiCC5cVLAPaEKYTCft0JkkzEyEEPJTbvwPV7pwoYhbf8FJmoW24oGBwznnMvceP2ZUKst6MRYWl5ZXVmtr9fWNza1tc2e3K6NEYNLBEYtE30eSMBqSjqKKkX4sCOI+Iz1/cln4vXsiJI3CO5XGxOVoFNKAYqS05Jk3Dkdq7PvZVT7ISi54xpAiIU7z3IXn0JEJ9zKaw3+Sha/TntmwmlYJOE/sijRAhbZnPjnDCCechAozJOXAtmLlZkgoihnJ604iSYzwBI3IQNMQcSLdrLw6h4daGcIgEvqFCpbqz4kMcSlT7utksa2c9QrxL2+QqODMzWgYJ8Vx04+ChEEVwaJCOKSCYMVSTRAWVO8K8RgJhJUuuq5LsGdPnifd46ZtNe3bk0broqqjBvbBATgCNjgFLXAN2qADMHgAz+ANvBuPxqvxYXxOowtGNbMHfsH4+gbJva55</latexit>
Loss = LossCE + 1||w||2
2 + 2E[latency]
<latexit sha1_base64="UvM7E2w6lXtW50LPl18+gDN1T4k=">AAACSnicbVBNSxxBEO3ZaGJWTTbJ0UvjIgjCMrMEkktAIoKHHBRcFXbHoaanVht7Puiu0Sy98/ty8eTNH+HFQ0LwYs+4gtEUdPN4VY969eJCSUO+f+21Xs3Nv36z8La9uLT87n3nw8cDk5da4EDkKtdHMRhUMsMBSVJ4VGiENFZ4GJ9t1f3Dc9RG5tk+TQoMUzjJ5FgKIEdFHfiRG/ON139kt7YrvsFHyskTiAI+nV5Mp1H/2PYbnvAnNRutxqSyj3N9PkqBTuPYblfDBurUKiDMxKQKq6jT9Xt+U/wlCGagy2a1G3WuRkkuyhQzEgqMGQZ+QaEFTVIorNqj0mAB4gxOcOhgBima0DbGKr7mmISPc+1eRrxhnyospMZM0thN1k7N815N/q83LGn8NbQyK8r6sIdF41JxynmdK0+kRkFq4gAILZ1XLk5BgyCXftuFEDw/+SU46PcCvxfsfe5ufp/FscBW2CpbZwH7wjbZDttlAybYL3bDfrM/3qV36/317h5GW95M84n9U625e9C6tBw=</latexit>
E[Latency] = ↵ ⇥ F(conv 3x3)+
⇥ F(conv 5x5)+
⇥ F(identity)+
......
⇥ F(pool 3x3)
<latexit sha1_base64="Lm2s03uPUYjMAPu22Qv9NriXYRM=">AAADIHicfVJNb9QwEHXCVwlfWzhysViBipBWSUuBC1IFAnHgUCS2rbSOVo4zm7Xq2JHtVBui8E+48Fe4cAAhuMGvwUnDV7vLSJaeZuZ53jw7KQQ3Ngy/e/6Zs+fOX1i7GFy6fOXqtcH69T2jSs1gzJRQ+iChBgSXMLbcCjgoNNA8EbCfHD5t6/tHoA1X8rWtCohzmkk+44xal5quew9IAhmXNRU8k5A2AcmpnSdJ/ayZdFDn9UtqQbKqifFjfOctsbCw3eQ6ESU0NaGimNMGE8tzMPj5xi8eU/KITLcWW81dfA8TEiwjJ2AddxV5e7H9P7LhWb5sMk9BOjeqP9xRF6vuebNCRKGU6DcICMj0t03TwTAchV3g0yDqwRD1sTsdfCOpYmXudDFBjZlEYWHjmmrLmQBne2mgoOyQZjBxUFInJK47kQ2+7TIpnintjrS4y/7NqGluTJUnrrMVbk7W2uSy2qS0s0dxzWVRti98PGhWCmwVbn8LTrkGZkXlAGWaO62YzammzLo/FTgTopMrnwZ7m6MoHEWv7g93nvR2rKGb6BbaQBF6iHbQC7SLxoh577wP3ifvs//e/+h/8b8et/pez7mB/gn/x09YVgHC</latexit>
Figure 3: Making latency differentiable by introducing latency regularization loss.
We incorporate the expected latency of the network into the normal loss function by multiplying a
scaling factor 2(> 0) which controls the trade-off between accuracy and latency. The final loss

ffi
ffi
cientml.ai
Search Strategy
Evolutionary search
68
image source: 1
• Mimic the evolution process in biology
image source: 2

ffi
ffi
cientml.ai
Search Strategy
Evolutionary search:
fi
tness function
69
t = 10ms
a = 80%
t = 8ms, a=81%
t = 12ms, a=78%
Re-sample
Keep Arch.
Sample
Latency/Accuracy Feedback
Sub Network
OFA Network
PVNAS: 3D Neural Architecture Search with Point-Voxel Convolution [Liu et al., TPAMI 2021]
• Fitness function: F(accuracy, e
ffi
ciency)

ffi
ffi
cientml.ai
Search Strategy
Evolutionary search: mutation
70
ffi
Input
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
Output
Mutation on depth
Input
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
Output
Stage 1: depth = 3
Stage 2: depth = 3
Stage 1: depth = 2
Stage 2: depth = 4

ffi
ffi
cientml.ai
Search Strategy
Evolutionary search: mutation
71
ffi
Input
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
Output
Mutation on operator MB6 5x5
MB6 5x5
Input
MB6 3x3
MB6 3x3
MB6 7x7
Output
MB4 5x5
Stage 1: depth = 3
Stage 2: depth = 3
Stage 1: depth = 3
Stage 2: depth = 3

ffi
ffi
cientml.ai
Search Strategy
Evolutionary search: crossover
72
ffi
Input
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
MB6 3x3
Output
MB6 5x5
MB6 5x5
Input
MB6 3x3
MB6 3x3
MB6 7x7
Output
MB4 5x5
+
Crossover
Idea: Randomly choose one
operator among two choices
(from the parents) for each
layer.
MB6 5x5
Input
MB6 3x3
MB6 3x3
MB6 3x3
MB6 7x7
Output
MB6 3x3
From parent 1
From parent 2
From parent 1
From parent 1
From parent 2
From parent 2

ffi
ffi
cientml.ai
Summary of Today’s Lecture
• NAS search space
• NAS search strategy
• We will cover in later lectures:
• Performance estimation strategy in NAS
• Zero-shot NAS
• Neural-hardware architecture search
73

ffi
ffi
cientml.ai
References
1. Deep Residual Learning for Image Recognition [He et al., CVPR 2016]
2. ImageNet Classi
fi
cation with Deep Convolutional Neural Networks [Krizhevsky et al., NeurIPS
2012]
3. Very Deep Convolutional Networks for Large-scale Image Recognition [Simonyan et al., ICLR
2015]
4. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Iandola
et al., arXiv 2016]
5. Aggregated Residual Transformations for Deep Neural Networks [Xie et al., CVPR 2017]
6. MobileNets: E
ffi
cient Convolutional Neural Networks for Mobile Vision Applications [Howard et
al., arXiv 2017]
7. MobileNetV2: Inverted Residuals and Linear Bottlenecks [Sandler et al., CVPR 2018]
8. Shu
ffl
eNet: An Extremely E
ffi
cient Convolutional Neural Network for Mobile Devices [Zhang et al.,
CVPR 2018]
9. Neural Architecture Search: A Survey [Elskan et al., JMLR 2019]
10.Learning Transferable Architectures for Scalable Image Recognition [Zoph et al., CVPR 2018]
11.DARTS: Di
ff
erentiable Architecture Search [Liu et al., ICLR 2019]
74

ffi
ffi
cientml.ai
References
• 12. MnasNet: Platform-Aware Neural Architecture Search for Mobile [Tan et al., CVPR 2019]
• 13. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware [Cai et al.,
ICLR 2019]
• 14. FBNet: Hardware-Aware E
ffi
cient ConvNet Design via Di
ff
erentiable Neural Architecture
Search [Wu et al., CVPR 2019]
• 15. Designing Network Design Spaces [Radosavovic et al., CVPR 2020]
• 16. Single Path One-Shot Neural Architecture Search with Uniform Sampling [Guo et al., ECCV
2020]
• 17. Once-for-All: Train One Network and Specialize it for E
ffi
cient Deployment [Cai et al., ICLR
2020]
• 18. Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
[Liu et al., CVPR 2019]
• 19. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection [Ghaisi et al.,
CVPR 2019]
• 20. Exploring Randomly Wired Neural Networks for Image Recognition [Xie et al., ICCV 2019]
• 21. MCUNet: Tiny Deep Learning on IoT Devices [Lin et al., NeurIPS 2020]
75

ffi
ffi
cientml.ai
References
• 22. E
ffi
cientNet: Rethinking Model Scaling for Convolutional Neural Networks [Tan and Le, ICML
2019]
• 23. Neural Architecture Search with Reinforcement Learning [Zoph and Le, ICLR 2017]
• 24. PVNAS: 3D Neural Architecture Search with Point-Voxel Convolution [Liu et al., TPAMI 2021]
• 25. Regularized Evolution for Image Classi
fi
er Architecture Search [Real et al., AAAI 2019]
76

EfficientML.ai Lecture Neural Architecture Search

More Related Content

Similar to EfficientML.ai Lecture Neural Architecture Search (20)

Recently uploaded (20)

EfficientML.ai Lecture Neural Architecture Search