MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures

MetaPerturb:
Transferable Regularizer for
Heterogeneous Tasks and Architectures
Jeongun Ryu1*, JaeWoong Shin1*, Hae Beom Lee1*, Sung Ju Hwang12
1KAIST, South Korea
2AITRICS, South Korea
*: Equal contribution

Motivation
• Regularization
• Versatile: task- and architecture-agnostic.
• Does not exploit data available.
• Transfer learning
• Learns to transfer knowledge.
• May not generalize across tasks and architectures.
2
copy
Target
Source

Concept
We propose a meta-learned transferable perturbation function that can
enhance generalization of diverse neural architectures on unseen tasks.
𝒈 𝝓
Perturbation
function
...
vs
Dog Cat
vs
Car Truck
Source Dataset
Task
1
Task
𝑇
Meta-training Meta-testing
Conv4
VGG
𝒈 𝝓
Aircraft
CUB
3
Transfer
𝒈!

The perturbation function should be applicable to :
1. Neural networks with undefined number of convolutional layers.
→ share function across the convolutional layers.
2. Convolution layers with undefined number of channels.
→ share function across channels/permutation-equivariant set encodings
Architecture of the Noise Generator
4
VGGConv4
𝝋 (𝟖𝟐)z
s

The perturbation function should be applicable to :
→ share function across channels/permutation-equivariant set encodings.
→ share function across channels/permutation-equivariant set encodings
5
𝐻
Channel 1
Channel 2
Channel 𝐶
...
...
...
𝝋 (𝟖𝟐)z
s

Input-dependent stochastic noise generator - Generate noise with two
layers of permutation equivariant operations.
𝒉
Channel
1
Channel
2
Channel
𝐶
Channel-wise Permutation Equivariant Operation
...
𝜸"
𝝀, 𝜸 : 3x3 kernel
𝝀" 𝝀#
𝜸#
ReLU
𝝁"(𝒉)
ReLUReLU
...
𝝁#
𝝁$
𝝁%
...
6
...

Input-dependent stochastic noise generator - Generate noise with two
layers of permutation equivariant operations.
𝝀#
𝜸#
𝒉
𝐻
ReLU
Channel
1
Channel
2
Channel
𝐶
𝝁"(𝒉)
ReLUReLU
...
...
...
𝝀"
𝜸"
𝝁%
...
7
𝝀#
𝜸#
𝒉
𝐻
ReLU
𝝁"(𝒉)
ReLUReLU
...
...
...
𝝀"
𝜸"
𝝁%

Batch-dependent scaling function – Adaptively scale noise of each channel
to different values for different dataset.
Layer
Info.
Channel-wise Scaling Function
Stats.
Pooling
Mean
Var.Batch
ℬ
Instance
|ℬ|
3x3
conv
4
GAP
4
...
...
Instance
1
Channel 𝑘
3x3
conv
4
GAP
4
FC
𝑠&
Sigmoid
8
Importance
Importance
object
backgroundobject
background

MetaPerturb combines the input-dependent noise with the batch-
dependent scaling.
9
3x3Conv
BN
ReLU
3x3Conv
BN
Perturbation
Function
......
𝒉
ReLU
𝝁.(𝒉)
𝒉
𝒂 ~ 𝑁 𝝁, 𝐈
𝒛 = Softplus(a)
𝒔
𝒈!(𝒉): Not
Parameterized

Conventional Meta-learning
10
Conventional meta-learning for few-shot classification cannot scale to
many-shot standard learning scenario we target.
• Episodic training with task sampling is costly since many-shot
learning requires a large number of gradient steps until convergence.
Sample 𝑇~{𝑇", 𝑇#, … , 𝑇$}
Train model
…
Sample 𝑇~{𝑇", 𝑇#, … , 𝑇$}
Train model

11
• Also, due to the large network size, performing gradient lookahead
step for standard many-shot learning is costly.
[1] [Finn et al.] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
[2] [Ren et al.] Learning to Reweight Examples for Robust Deep Learning, ICML 2018
MAML[1]
𝜃@ 𝜃@AB
Online approx. [2]
𝜃@
𝜃@AB

12
• Episodic training with task sampling is costly since many-shot
learning requires a large number of gradient steps until convergence.
• Also, due to the large network size, performing gradient lookahead
step for standard many-shot learning is costly.
Thus, we propose a novel meta-learning framework, in the form of
distributed joint training.

Meta-learning framework
13
Meta-training step
𝜃: Main model
𝜙: Perturb module
𝜃# 𝜙
𝜃$ 𝜙
𝜃' 𝜙
Task 1
Task 2
Task T
…
𝜙
𝜙
𝜙
𝜙
𝜙
𝜙
𝐷#
()
𝐷$
()
𝐷'
()
…
…
Train 𝜙 with 𝐷*
()
𝜃#
𝜃$
𝜃'
𝜃#
𝜃$
𝜃'
𝐷#
(+
𝐷$
(+
𝐷'
(+
…
…
Train 𝜃* with 𝐷*
(+
𝜃# 𝜙∗
𝜃$ 𝜙∗
𝜃' 𝜙∗
…
Obtain meta-learned 𝜙∗
Shared

Meta-learning framework
Meta-test step
14
𝜃 𝜙∗Target task
Transfer meta-learned 𝜙∗
𝜃 𝜃
𝐷(+
Train 𝜃 with 𝐷(+
𝜃∗
𝜙∗
Obtain 𝜃∗
with 𝜙∗

Baseline
[Achille and Soatto. 18] Information Dropout: Learning Optimal Representations Through Noisy Computation, TPAMI 2018
[Ghaiasi et al. 18] Dropblock: A regularization method for convolutional networks, NeurIPS 2018
[Verma et al. 18] Manifold Mixup: Better Representations by Interpolating Hidden States, ICML 2019 18
Information Dropout DropBlock Manifold Mixup
conv
conv
DropBlock
conv
conv
conv
conv
conv
Mixup
Noise

Datasets, Architectures
19
STL10, CIFAR-100 Stanford Dogs Stanford Cars Aircraft CUB
Conv4 Conv6 VGG9 ResNet20 / ResNet44 WideResNet-28-2
𝑥-
𝑥-.#
conv3x3
conv3x3
𝑥-
𝑥-.#
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3, 128
conv3x3, 64
conv3x3, 256
conv3x3, 512
…

Results on Heterogeneous Datasets
MetaPerturb significantly improves the performance of the base network with
especially large performance gains on fine-grained datasets.
MetaPerturb even outperforms finetuning on certain datasets, albeit using a
very small number of parameters.
Model
# Transfer
params
Source
dataset
Target Dataset
STL10 s-CIFAR 100 Dogs Cars Aircraft CUB
Base 0 None 66.78 ± 0.59 31.79 ± 0.24 34.65 ± 1.05 44.35 ± 1.10 59.23 ± 0.95 30.63 ± 0.66
Info. Dropout 0 None 67.46 ± 0.17 32.32 ± 0.33 34.63 ± 0.68 43.13 ± 2.31 58.59 ± 0.90 30.83 ± 0.79
DropBlock 0 None 68.51 ± 0.67 32.74 ± 0.36 34.59 ± 0.87 45.11 ± 1.47 59.76 ± 1.38 30.55 ± 0.26
Manifold Mixup 0 None 72.83 ± 0.69 39.06 ± 0.73 36.29 ± 0.70 48.97 ± 1.69 64.35 ± 1.23 37.80 ± 0.53
MetaPerturb 82 TIN 69.98 ± 0.63 34.57 ± 0.38 38.41 ± 0.74 62.46 ± 0.80 65.87 ± 0.77 42.01 ± 0.43
Finetuning (FT) .3M TIN 77.16 ± 0.41 43.69 ± 0.22 40.09 ± 0.31 58.61 ± 1.16 66.03 ± 0.85 34.89 ± 0.30
17

Results on Heterogeneous Datasets
Model
# Transfer
params
Source
dataset
Target Dataset
STL10 s-CIFAR 100 Dogs Cars Aircraft CUB
Base 0 None 66.78 ± 0.59 31.79 ± 0.24 34.65 ± 1.05 44.35 ± 1.10 59.23 ± 0.95 30.63 ± 0.66
Info. Dropout 0 None 67.46 ± 0.17 32.32 ± 0.33 34.63 ± 0.68 43.13 ± 2.31 58.59 ± 0.90 30.83 ± 0.79
DropBlock 0 None 68.51 ± 0.67 32.74 ± 0.36 34.59 ± 0.87 45.11 ± 1.47 59.76 ± 1.38 30.55 ± 0.26
Manifold Mixup 0 None 72.83 ± 0.69 39.06 ± 0.73 36.29 ± 0.70 48.97 ± 1.69 64.35 ± 1.23 37.80 ± 0.53
MetaPerturb 82 TIN 69.98 ± 0.63 34.57 ± 0.38 38.41 ± 0.74 62.46 ± 0.80 65.87 ± 0.77 42.01 ± 0.43
Finetuning (FT) .3M TIN 77.16 ± 0.41 43.69 ± 0.22 40.09 ± 0.31 58.61 ± 1.16 66.03 ± 0.85 34.89 ± 0.30
FT + Info.Dropout .3M + 0 TIN 77.41 ± 0.13 43.92 ± 0.44 40.04 ± 0.46 58.07 ± 0.57 65.47 ± 0.27 35.55 ± 0.81
FT + DropBlock .3M + 0 TIN 78.32 ± 0.31 44.84 ± 0.37 40.54 ± 0.56 61.08 ± 0.61 66.30 ± 0.84 34.61 ± 0.54
FT + Manif. Mixup .3M + 0 TIN 79.60 ± 0.27 47.92 ± 0.79 42.54 ± 0.70 64.81 ± 0.97 71.53 ± 0.80 43.07 ± 0.83
FT + MetaPerturb .3M + 82 TIN 78.27 ± 0.36 47.41 ± 0.40 46.06 ± 0.44 73.04 ± 0.45 72.34 ± 0.41 48.60 ± 1.14
MetaPerturb can be further combined with finetuning to achieve even
larger performance gains.
18

Results on Heterogeneous Neural Architectures
Model
Source
Network
Target Network
Conv4 Conv6 VGG9 ResNet20 ResNet44 WRN-28-2
Base None 83.93 ± 0.20 86.14 ± 0.23 88.44 ± 0.29 87.96 ± 0.30 88.94 ± 0.41 88.95 ± 0.44
Info. Dropout None 84.91 ± 0.34 87.23 ± 0.26 88.29 ± 1.18 88.46 ± 0.65 89.33 ± 0.20 89.51 ± 0.29
DropBlock None 84.29 ± 0.24 86.22 ± 0.26 88.68 ± 0.35 89.43 ± 0.26 90.14 ± 0.18 90.55 ± 0.25
Finetuning Same 84.00 ± 0.27 86.56 ± 0.23 88.17 ± 0.18 88.77 ± 0.26 89.62 ± 0.05 89.85 ± 0.31
MetaPerturb TIN 86.61 ± 0.42 88.59 ± 0.26 90.24 ± 0.27 90.70 ± 0.25 90.97 ± 1.09 90.88 ± 0.07
When transferred to diverse heterogeneous neural architectures,
MetaPerturb significantly outperforms the baselines with all networks.
19

Results: Qualitative Analysis
MetaPerturb applies different amount of noise at each layer, across datasets.
MetaPerturb makes the loss surface flatter.
20

Results: Adversarial Robustness and Calibration
MetaPerturb is robust to adversarial perturbations although it is not explicitly
trained to be robust against them.
21
MetaPeturb also significantly improves the calibration performance (ECE).

Conclusion
22
• We propose a lightweight and versatile perturbation function that can
transfer the knowledge of a source task to heterogeneous target tasks and
architectures.
• We propose a novel meta-learning framework in the form of distributed joint
training, which allows to efficiently perform meta-learning on large-scale
datasets with deep networks.
• Our transferable regularizer largely enhances model generalization on various
datasets and architectures, outperforming existing regularizers and finetuning
in most cases, while also improving on robustness and calibration.

MetaPerturb:
Transferable Regularizer for
Heterogeneous Tasks and Architectures
Jeongun Ryu1*, JaeWoong Shin1*, Hae Beom Lee1*, Sung Ju Hwang12
https://github.com/JWoong148/metaperturb
Paper, Poster ID: 2018, 17113

MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures

More Related Content

What's hot (20)

Similar to MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures (20)

More from MLAI2 (20)

Recently uploaded (20)

MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures