SlideShare a Scribd company logo
MetaPerturb:
Transferable Regularizer for
Heterogeneous Tasks and Architectures
Jeongun Ryu1*, JaeWoong Shin1*, Hae Beom Lee1*, Sung Ju Hwang12
1KAIST, South Korea
2AITRICS, South Korea
*: Equal contribution
Motivation
• Regularization
• Versatile: task- and architecture-agnostic.
• Does not exploit data available.
• Transfer learning
• Learns to transfer knowledge.
• May not generalize across tasks and architectures.
2
copy
Target
Source
Concept
We propose a meta-learned transferable perturbation function that can
enhance generalization of diverse neural architectures on unseen tasks.
𝒈 𝝓
Perturbation
function
...
vs
Dog Cat
vs
Car Truck
Source Dataset
Task
1
Task
𝑇
Meta-training Meta-testing
Conv4
VGG
𝒈 𝝓
Aircraft
CUB
3
Transfer
𝒈!
The perturbation function should be applicable to :
1. Neural networks with undefined number of convolutional layers.
→ share function across the convolutional layers.
2. Convolution layers with undefined number of channels.
→ share function across channels/permutation-equivariant set encodings
Architecture of the Noise Generator
4
VGGConv4
𝝋 (𝟖𝟐)z
s
The perturbation function should be applicable to :
2. Convolution layers with undefined number of channels.
→ share function across channels/permutation-equivariant set encodings.
3. Convolution layers with undefined number of channels.
→ share function across channels/permutation-equivariant set encodings
Architecture of the Noise Generator
5
𝐻
Channel 1
Channel 2
Channel 𝐶
...
...
...
𝝋 (𝟖𝟐)z
s
Input-dependent stochastic noise generator - Generate noise with two
layers of permutation equivariant operations.
𝒉
Channel
1
Channel
2
Channel
𝐶
Channel-wise Permutation Equivariant Operation
...
𝜸"
𝝀, 𝜸 : 3x3 kernel
𝝀" 𝝀#
𝜸#
ReLU
𝝁"(𝒉)
ReLUReLU
...
𝝁#
𝝁$
𝝁%
...
6
Architecture of the Noise Generator
...
Input-dependent stochastic noise generator - Generate noise with two
layers of permutation equivariant operations.
𝝀#
𝜸#
𝒉
𝐻
ReLU
Channel
1
Channel
2
Channel
𝐶
𝝁"(𝒉)
ReLUReLU
...
...
...
𝝀"
𝜸"
𝝁%
...
7
Architecture of the Noise Generator
𝝀#
𝜸#
𝒉
𝐻
ReLU
𝝁"(𝒉)
ReLUReLU
...
...
...
𝝀"
𝜸"
𝝁%
Batch-dependent scaling function – Adaptively scale noise of each channel
to different values for different dataset.
Layer
Info.
Channel-wise Scaling Function
Stats.
Pooling
Mean
Var.Batch
ℬ
Instance
|ℬ|
3x3
conv
4
GAP
4
...
...
Instance
1
Channel 𝑘
3x3
conv
4
GAP
4
FC
𝑠&
Sigmoid
8
Architecture of the Noise Generator
Importance
Importance
object
backgroundobject
background
Architecture of the Noise Generator
MetaPerturb combines the input-dependent noise with the batch-
dependent scaling.
9
3x3Conv
BN
ReLU
3x3Conv
BN
Perturbation
Function
......
𝒉
ReLU
𝝁.(𝒉)
𝒉
𝒂 ~ 𝑁 𝝁, 𝐈
𝒛 = Softplus(a)
𝒔
𝒈!(𝒉): Not
Parameterized
Conventional Meta-learning
10
Conventional meta-learning for few-shot classification cannot scale to
many-shot standard learning scenario we target.
• Episodic training with task sampling is costly since many-shot
learning requires a large number of gradient steps until convergence.
Sample 𝑇~{𝑇", 𝑇#, … , 𝑇$}
Train model
…
Sample 𝑇~{𝑇", 𝑇#, … , 𝑇$}
Train model
Conventional Meta-learning
11
Conventional meta-learning for few-shot classification cannot scale to
many-shot standard learning scenario we target.
• Also, due to the large network size, performing gradient lookahead
step for standard many-shot learning is costly.
[1] [Finn et al.] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
[2] [Ren et al.] Learning to Reweight Examples for Robust Deep Learning, ICML 2018
MAML[1]
𝜃@ 𝜃@AB
Online approx. [2]
𝜃@
𝜃@AB
Conventional Meta-learning
12
Conventional meta-learning for few-shot classification cannot scale to
many-shot standard learning scenario we target.
• Episodic training with task sampling is costly since many-shot
learning requires a large number of gradient steps until convergence.
• Also, due to the large network size, performing gradient lookahead
step for standard many-shot learning is costly.
Thus, we propose a novel meta-learning framework, in the form of
distributed joint training.
Meta-learning framework
13
Meta-training step
𝜃: Main model
𝜙: Perturb module
𝜃# 𝜙
𝜃$ 𝜙
𝜃' 𝜙
Task 1
Task 2
Task T
…
𝜙
𝜙
𝜙
𝜙
𝜙
𝜙
𝐷#
()
𝐷$
()
𝐷'
()
…
…
Train 𝜙 with 𝐷*
()
𝜃#
𝜃$
𝜃'
𝜃#
𝜃$
𝜃'
𝐷#
(+
𝐷$
(+
𝐷'
(+
…
…
Train 𝜃* with 𝐷*
(+
𝜃# 𝜙∗
𝜃$ 𝜙∗
𝜃' 𝜙∗
…
Obtain meta-learned 𝜙∗
Shared
Meta-learning framework
Meta-test step
14
𝜃 𝜙∗Target task
Transfer meta-learned 𝜙∗
𝜃 𝜃
𝐷(+
Train 𝜃 with 𝐷(+
𝜃∗
𝜙∗
Obtain 𝜃∗
with 𝜙∗
Baseline
[Achille and Soatto. 18] Information Dropout: Learning Optimal Representations Through Noisy Computation, TPAMI 2018
[Ghaiasi et al. 18] Dropblock: A regularization method for convolutional networks, NeurIPS 2018
[Verma et al. 18] Manifold Mixup: Better Representations by Interpolating Hidden States, ICML 2019 18
Information Dropout DropBlock Manifold Mixup
conv
conv
DropBlock
conv
conv
conv
conv
conv
Mixup
Noise
Datasets, Architectures
19
STL10, CIFAR-100 Stanford Dogs Stanford Cars Aircraft CUB
Conv4 Conv6 VGG9 ResNet20 / ResNet44 WideResNet-28-2
𝑥-
𝑥-.#
conv3x3
conv3x3
𝑥-
𝑥-.#
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3
conv3x3, 128
conv3x3, 64
conv3x3, 256
conv3x3, 512
…
Results on Heterogeneous Datasets
MetaPerturb significantly improves the performance of the base network with
especially large performance gains on fine-grained datasets.
MetaPerturb even outperforms finetuning on certain datasets, albeit using a
very small number of parameters.
Model
# Transfer
params
Source
dataset
Target Dataset
STL10 s-CIFAR 100 Dogs Cars Aircraft CUB
Base 0 None 66.78 ± 0.59 31.79 ± 0.24 34.65 ± 1.05 44.35 ± 1.10 59.23 ± 0.95 30.63 ± 0.66
Info. Dropout 0 None 67.46 ± 0.17 32.32 ± 0.33 34.63 ± 0.68 43.13 ± 2.31 58.59 ± 0.90 30.83 ± 0.79
DropBlock 0 None 68.51 ± 0.67 32.74 ± 0.36 34.59 ± 0.87 45.11 ± 1.47 59.76 ± 1.38 30.55 ± 0.26
Manifold Mixup 0 None 72.83 ± 0.69 39.06 ± 0.73 36.29 ± 0.70 48.97 ± 1.69 64.35 ± 1.23 37.80 ± 0.53
MetaPerturb 82 TIN 69.98 ± 0.63 34.57 ± 0.38 38.41 ± 0.74 62.46 ± 0.80 65.87 ± 0.77 42.01 ± 0.43
Finetuning (FT) .3M TIN 77.16 ± 0.41 43.69 ± 0.22 40.09 ± 0.31 58.61 ± 1.16 66.03 ± 0.85 34.89 ± 0.30
17
Results on Heterogeneous Datasets
Model
# Transfer
params
Source
dataset
Target Dataset
STL10 s-CIFAR 100 Dogs Cars Aircraft CUB
Base 0 None 66.78 ± 0.59 31.79 ± 0.24 34.65 ± 1.05 44.35 ± 1.10 59.23 ± 0.95 30.63 ± 0.66
Info. Dropout 0 None 67.46 ± 0.17 32.32 ± 0.33 34.63 ± 0.68 43.13 ± 2.31 58.59 ± 0.90 30.83 ± 0.79
DropBlock 0 None 68.51 ± 0.67 32.74 ± 0.36 34.59 ± 0.87 45.11 ± 1.47 59.76 ± 1.38 30.55 ± 0.26
Manifold Mixup 0 None 72.83 ± 0.69 39.06 ± 0.73 36.29 ± 0.70 48.97 ± 1.69 64.35 ± 1.23 37.80 ± 0.53
MetaPerturb 82 TIN 69.98 ± 0.63 34.57 ± 0.38 38.41 ± 0.74 62.46 ± 0.80 65.87 ± 0.77 42.01 ± 0.43
Finetuning (FT) .3M TIN 77.16 ± 0.41 43.69 ± 0.22 40.09 ± 0.31 58.61 ± 1.16 66.03 ± 0.85 34.89 ± 0.30
FT + Info.Dropout .3M + 0 TIN 77.41 ± 0.13 43.92 ± 0.44 40.04 ± 0.46 58.07 ± 0.57 65.47 ± 0.27 35.55 ± 0.81
FT + DropBlock .3M + 0 TIN 78.32 ± 0.31 44.84 ± 0.37 40.54 ± 0.56 61.08 ± 0.61 66.30 ± 0.84 34.61 ± 0.54
FT + Manif. Mixup .3M + 0 TIN 79.60 ± 0.27 47.92 ± 0.79 42.54 ± 0.70 64.81 ± 0.97 71.53 ± 0.80 43.07 ± 0.83
FT + MetaPerturb .3M + 82 TIN 78.27 ± 0.36 47.41 ± 0.40 46.06 ± 0.44 73.04 ± 0.45 72.34 ± 0.41 48.60 ± 1.14
MetaPerturb can be further combined with finetuning to achieve even
larger performance gains.
18
Results on Heterogeneous Neural Architectures
Model
Source
Network
Target Network
Conv4 Conv6 VGG9 ResNet20 ResNet44 WRN-28-2
Base None 83.93 ± 0.20 86.14 ± 0.23 88.44 ± 0.29 87.96 ± 0.30 88.94 ± 0.41 88.95 ± 0.44
Info. Dropout None 84.91 ± 0.34 87.23 ± 0.26 88.29 ± 1.18 88.46 ± 0.65 89.33 ± 0.20 89.51 ± 0.29
DropBlock None 84.29 ± 0.24 86.22 ± 0.26 88.68 ± 0.35 89.43 ± 0.26 90.14 ± 0.18 90.55 ± 0.25
Finetuning Same 84.00 ± 0.27 86.56 ± 0.23 88.17 ± 0.18 88.77 ± 0.26 89.62 ± 0.05 89.85 ± 0.31
MetaPerturb TIN 86.61 ± 0.42 88.59 ± 0.26 90.24 ± 0.27 90.70 ± 0.25 90.97 ± 1.09 90.88 ± 0.07
When transferred to diverse heterogeneous neural architectures,
MetaPerturb significantly outperforms the baselines with all networks.
19
Results: Qualitative Analysis
MetaPerturb applies different amount of noise at each layer, across datasets.
MetaPerturb makes the loss surface flatter.
20
Results: Adversarial Robustness and Calibration
MetaPerturb is robust to adversarial perturbations although it is not explicitly
trained to be robust against them.
21
MetaPeturb also significantly improves the calibration performance (ECE).
Conclusion
22
• We propose a lightweight and versatile perturbation function that can
transfer the knowledge of a source task to heterogeneous target tasks and
architectures.
• We propose a novel meta-learning framework in the form of distributed joint
training, which allows to efficiently perform meta-learning on large-scale
datasets with deep networks.
• Our transferable regularizer largely enhances model generalization on various
datasets and architectures, outperforming existing regularizers and finetuning
in most cases, while also improving on robustness and calibration.
MetaPerturb:
Transferable Regularizer for
Heterogeneous Tasks and Architectures
Jeongun Ryu1*, JaeWoong Shin1*, Hae Beom Lee1*, Sung Ju Hwang12
https://github.com/JWoong148/metaperturb
Paper, Poster ID: 2018, 17113

More Related Content

PDF
201907 AutoML and Neural Architecture Search
PDF
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PDF
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PDF
Edge Representation Learning with Hypergraphs
PPTX
Incremental collaborative filtering via evolutionary co clustering
PPTX
Radial basis function network ppt bySheetal,Samreen and Dhanashri
PDF
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PPTX
InfoGAIL
201907 AutoML and Neural Architecture Search
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
Edge Representation Learning with Hypergraphs
Incremental collaborative filtering via evolutionary co clustering
Radial basis function network ppt bySheetal,Samreen and Dhanashri
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
InfoGAIL

What's hot (20)

PDF
Efficient Neural Architecture Search via Parameter Sharing
PPTX
Deep Learning in Computer Vision
PPTX
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
PDF
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PPTX
Image classification with Deep Neural Networks
PDF
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PPTX
Image classification using cnn
PDF
PR243: Designing Network Design Spaces
PDF
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PDF
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PDF
Introduction to Convolutional Neural Networks
PDF
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PDF
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PDF
Efficient de cvpr_2020_paper
PDF
Modern Convolutional Neural Network techniques for image segmentation
PPTX
Image classification using CNN
PPT
Zoooooohaib
PPTX
Learning to compare: relation network for few shot learning
PDF
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)
PDF
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Efficient Neural Architecture Search via Parameter Sharing
Deep Learning in Computer Vision
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Image classification with Deep Neural Networks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
Image classification using cnn
PR243: Designing Network Design Spaces
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Introduction to Convolutional Neural Networks
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Efficient de cvpr_2020_paper
Modern Convolutional Neural Network techniques for image segmentation
Image classification using CNN
Zoooooohaib
Learning to compare: relation network for few shot learning
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Ad

Similar to MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures (20)

PDF
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
PPTX
08 neural networks
PPTX
DL-CO2-Session6-VGGNet_GoogLeNet_ResNet_DenseNet_RCNN.pptx
PPTX
ML Module 3 Non Linear Learning.pptx
PDF
Improving Hardware Efficiency for DNN Applications
PDF
Enabling Power-Efficient AI Through Quantization
PPTX
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
PDF
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
PDF
Convolutional Neural Networks (CNN)
PPTX
Batch normalization presentation
PDF
Efficient Implementation of Self-Organizing Map for Sparse Input Data
PPTX
Deep Neural Networks for Computer Vision
PPTX
Cvpr 2018 papers review (efficient computing)
PDF
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
PDF
deep CNN vs conventional ML
PDF
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
PDF
Training Neural Networks
PPTX
Trackster Pruning at the CMS High-Granularity Calorimeter
PDF
Introduction of Feature Hashing
PDF
Josh Patterson MLconf slides
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
08 neural networks
DL-CO2-Session6-VGGNet_GoogLeNet_ResNet_DenseNet_RCNN.pptx
ML Module 3 Non Linear Learning.pptx
Improving Hardware Efficiency for DNN Applications
Enabling Power-Efficient AI Through Quantization
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
[AAAI2018] Multispectral Transfer Network: Unsupervised Depth Estimation for ...
Convolutional Neural Networks (CNN)
Batch normalization presentation
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Deep Neural Networks for Computer Vision
Cvpr 2018 papers review (efficient computing)
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
deep CNN vs conventional ML
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Training Neural Networks
Trackster Pruning at the CMS High-Granularity Calorimeter
Introduction of Feature Hashing
Josh Patterson MLconf slides
Ad

More from MLAI2 (20)

PDF
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
PDF
Online Hyperparameter Meta-Learning with Hypergradient Distillation
PDF
Online Coreset Selection for Rehearsal-based Continual Learning
PDF
Representational Continuity for Unsupervised Continual Learning
PDF
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
PDF
Skill-Based Meta-Reinforcement Learning
PDF
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
PDF
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
PDF
Task Adaptive Neural Network Search with Meta-Contrastive Learning
PDF
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
PDF
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
PDF
Accurate Learning of Graph Representations with Graph Multiset Pooling
PDF
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
PDF
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
PDF
Adversarial Self-Supervised Contrastive Learning
PDF
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
PDF
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
PDF
Cost-effective Interactive Attention Learning with Neural Attention Process
PDF
Adversarial Neural Pruning with Latent Vulnerability Suppression
PDF
Generating Diverse and Consistent QA pairs from Contexts with Information-Max...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Online Hyperparameter Meta-Learning with Hypergradient Distillation
Online Coreset Selection for Rehearsal-based Continual Learning
Representational Continuity for Unsupervised Continual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Skill-Based Meta-Reinforcement Learning
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Accurate Learning of Graph Representations with Graph Multiset Pooling
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
Adversarial Self-Supervised Contrastive Learning
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
Cost-effective Interactive Attention Learning with Neural Attention Process
Adversarial Neural Pruning with Latent Vulnerability Suppression
Generating Diverse and Consistent QA pairs from Contexts with Information-Max...

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Spectroscopy.pptx food analysis technology
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx
Big Data Technologies - Introduction.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures

  • 1. MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures Jeongun Ryu1*, JaeWoong Shin1*, Hae Beom Lee1*, Sung Ju Hwang12 1KAIST, South Korea 2AITRICS, South Korea *: Equal contribution
  • 2. Motivation • Regularization • Versatile: task- and architecture-agnostic. • Does not exploit data available. • Transfer learning • Learns to transfer knowledge. • May not generalize across tasks and architectures. 2 copy Target Source
  • 3. Concept We propose a meta-learned transferable perturbation function that can enhance generalization of diverse neural architectures on unseen tasks. 𝒈 𝝓 Perturbation function ... vs Dog Cat vs Car Truck Source Dataset Task 1 Task 𝑇 Meta-training Meta-testing Conv4 VGG 𝒈 𝝓 Aircraft CUB 3 Transfer 𝒈!
  • 4. The perturbation function should be applicable to : 1. Neural networks with undefined number of convolutional layers. → share function across the convolutional layers. 2. Convolution layers with undefined number of channels. → share function across channels/permutation-equivariant set encodings Architecture of the Noise Generator 4 VGGConv4 𝝋 (𝟖𝟐)z s
  • 5. The perturbation function should be applicable to : 2. Convolution layers with undefined number of channels. → share function across channels/permutation-equivariant set encodings. 3. Convolution layers with undefined number of channels. → share function across channels/permutation-equivariant set encodings Architecture of the Noise Generator 5 𝐻 Channel 1 Channel 2 Channel 𝐶 ... ... ... 𝝋 (𝟖𝟐)z s
  • 6. Input-dependent stochastic noise generator - Generate noise with two layers of permutation equivariant operations. 𝒉 Channel 1 Channel 2 Channel 𝐶 Channel-wise Permutation Equivariant Operation ... 𝜸" 𝝀, 𝜸 : 3x3 kernel 𝝀" 𝝀# 𝜸# ReLU 𝝁"(𝒉) ReLUReLU ... 𝝁# 𝝁$ 𝝁% ... 6 Architecture of the Noise Generator ...
  • 7. Input-dependent stochastic noise generator - Generate noise with two layers of permutation equivariant operations. 𝝀# 𝜸# 𝒉 𝐻 ReLU Channel 1 Channel 2 Channel 𝐶 𝝁"(𝒉) ReLUReLU ... ... ... 𝝀" 𝜸" 𝝁% ... 7 Architecture of the Noise Generator 𝝀# 𝜸# 𝒉 𝐻 ReLU 𝝁"(𝒉) ReLUReLU ... ... ... 𝝀" 𝜸" 𝝁%
  • 8. Batch-dependent scaling function – Adaptively scale noise of each channel to different values for different dataset. Layer Info. Channel-wise Scaling Function Stats. Pooling Mean Var.Batch ℬ Instance |ℬ| 3x3 conv 4 GAP 4 ... ... Instance 1 Channel 𝑘 3x3 conv 4 GAP 4 FC 𝑠& Sigmoid 8 Architecture of the Noise Generator Importance Importance object backgroundobject background
  • 9. Architecture of the Noise Generator MetaPerturb combines the input-dependent noise with the batch- dependent scaling. 9 3x3Conv BN ReLU 3x3Conv BN Perturbation Function ...... 𝒉 ReLU 𝝁.(𝒉) 𝒉 𝒂 ~ 𝑁 𝝁, 𝐈 𝒛 = Softplus(a) 𝒔 𝒈!(𝒉): Not Parameterized
  • 10. Conventional Meta-learning 10 Conventional meta-learning for few-shot classification cannot scale to many-shot standard learning scenario we target. • Episodic training with task sampling is costly since many-shot learning requires a large number of gradient steps until convergence. Sample 𝑇~{𝑇", 𝑇#, … , 𝑇$} Train model … Sample 𝑇~{𝑇", 𝑇#, … , 𝑇$} Train model
  • 11. Conventional Meta-learning 11 Conventional meta-learning for few-shot classification cannot scale to many-shot standard learning scenario we target. • Also, due to the large network size, performing gradient lookahead step for standard many-shot learning is costly. [1] [Finn et al.] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017 [2] [Ren et al.] Learning to Reweight Examples for Robust Deep Learning, ICML 2018 MAML[1] 𝜃@ 𝜃@AB Online approx. [2] 𝜃@ 𝜃@AB
  • 12. Conventional Meta-learning 12 Conventional meta-learning for few-shot classification cannot scale to many-shot standard learning scenario we target. • Episodic training with task sampling is costly since many-shot learning requires a large number of gradient steps until convergence. • Also, due to the large network size, performing gradient lookahead step for standard many-shot learning is costly. Thus, we propose a novel meta-learning framework, in the form of distributed joint training.
  • 13. Meta-learning framework 13 Meta-training step 𝜃: Main model 𝜙: Perturb module 𝜃# 𝜙 𝜃$ 𝜙 𝜃' 𝜙 Task 1 Task 2 Task T … 𝜙 𝜙 𝜙 𝜙 𝜙 𝜙 𝐷# () 𝐷$ () 𝐷' () … … Train 𝜙 with 𝐷* () 𝜃# 𝜃$ 𝜃' 𝜃# 𝜃$ 𝜃' 𝐷# (+ 𝐷$ (+ 𝐷' (+ … … Train 𝜃* with 𝐷* (+ 𝜃# 𝜙∗ 𝜃$ 𝜙∗ 𝜃' 𝜙∗ … Obtain meta-learned 𝜙∗ Shared
  • 14. Meta-learning framework Meta-test step 14 𝜃 𝜙∗Target task Transfer meta-learned 𝜙∗ 𝜃 𝜃 𝐷(+ Train 𝜃 with 𝐷(+ 𝜃∗ 𝜙∗ Obtain 𝜃∗ with 𝜙∗
  • 15. Baseline [Achille and Soatto. 18] Information Dropout: Learning Optimal Representations Through Noisy Computation, TPAMI 2018 [Ghaiasi et al. 18] Dropblock: A regularization method for convolutional networks, NeurIPS 2018 [Verma et al. 18] Manifold Mixup: Better Representations by Interpolating Hidden States, ICML 2019 18 Information Dropout DropBlock Manifold Mixup conv conv DropBlock conv conv conv conv conv Mixup Noise
  • 16. Datasets, Architectures 19 STL10, CIFAR-100 Stanford Dogs Stanford Cars Aircraft CUB Conv4 Conv6 VGG9 ResNet20 / ResNet44 WideResNet-28-2 𝑥- 𝑥-.# conv3x3 conv3x3 𝑥- 𝑥-.# conv3x3 conv3x3 conv3x3 conv3x3 conv3x3 conv3x3 conv3x3 conv3x3 conv3x3 conv3x3 conv3x3 conv3x3 conv3x3, 128 conv3x3, 64 conv3x3, 256 conv3x3, 512 …
  • 17. Results on Heterogeneous Datasets MetaPerturb significantly improves the performance of the base network with especially large performance gains on fine-grained datasets. MetaPerturb even outperforms finetuning on certain datasets, albeit using a very small number of parameters. Model # Transfer params Source dataset Target Dataset STL10 s-CIFAR 100 Dogs Cars Aircraft CUB Base 0 None 66.78 ± 0.59 31.79 ± 0.24 34.65 ± 1.05 44.35 ± 1.10 59.23 ± 0.95 30.63 ± 0.66 Info. Dropout 0 None 67.46 ± 0.17 32.32 ± 0.33 34.63 ± 0.68 43.13 ± 2.31 58.59 ± 0.90 30.83 ± 0.79 DropBlock 0 None 68.51 ± 0.67 32.74 ± 0.36 34.59 ± 0.87 45.11 ± 1.47 59.76 ± 1.38 30.55 ± 0.26 Manifold Mixup 0 None 72.83 ± 0.69 39.06 ± 0.73 36.29 ± 0.70 48.97 ± 1.69 64.35 ± 1.23 37.80 ± 0.53 MetaPerturb 82 TIN 69.98 ± 0.63 34.57 ± 0.38 38.41 ± 0.74 62.46 ± 0.80 65.87 ± 0.77 42.01 ± 0.43 Finetuning (FT) .3M TIN 77.16 ± 0.41 43.69 ± 0.22 40.09 ± 0.31 58.61 ± 1.16 66.03 ± 0.85 34.89 ± 0.30 17
  • 18. Results on Heterogeneous Datasets Model # Transfer params Source dataset Target Dataset STL10 s-CIFAR 100 Dogs Cars Aircraft CUB Base 0 None 66.78 ± 0.59 31.79 ± 0.24 34.65 ± 1.05 44.35 ± 1.10 59.23 ± 0.95 30.63 ± 0.66 Info. Dropout 0 None 67.46 ± 0.17 32.32 ± 0.33 34.63 ± 0.68 43.13 ± 2.31 58.59 ± 0.90 30.83 ± 0.79 DropBlock 0 None 68.51 ± 0.67 32.74 ± 0.36 34.59 ± 0.87 45.11 ± 1.47 59.76 ± 1.38 30.55 ± 0.26 Manifold Mixup 0 None 72.83 ± 0.69 39.06 ± 0.73 36.29 ± 0.70 48.97 ± 1.69 64.35 ± 1.23 37.80 ± 0.53 MetaPerturb 82 TIN 69.98 ± 0.63 34.57 ± 0.38 38.41 ± 0.74 62.46 ± 0.80 65.87 ± 0.77 42.01 ± 0.43 Finetuning (FT) .3M TIN 77.16 ± 0.41 43.69 ± 0.22 40.09 ± 0.31 58.61 ± 1.16 66.03 ± 0.85 34.89 ± 0.30 FT + Info.Dropout .3M + 0 TIN 77.41 ± 0.13 43.92 ± 0.44 40.04 ± 0.46 58.07 ± 0.57 65.47 ± 0.27 35.55 ± 0.81 FT + DropBlock .3M + 0 TIN 78.32 ± 0.31 44.84 ± 0.37 40.54 ± 0.56 61.08 ± 0.61 66.30 ± 0.84 34.61 ± 0.54 FT + Manif. Mixup .3M + 0 TIN 79.60 ± 0.27 47.92 ± 0.79 42.54 ± 0.70 64.81 ± 0.97 71.53 ± 0.80 43.07 ± 0.83 FT + MetaPerturb .3M + 82 TIN 78.27 ± 0.36 47.41 ± 0.40 46.06 ± 0.44 73.04 ± 0.45 72.34 ± 0.41 48.60 ± 1.14 MetaPerturb can be further combined with finetuning to achieve even larger performance gains. 18
  • 19. Results on Heterogeneous Neural Architectures Model Source Network Target Network Conv4 Conv6 VGG9 ResNet20 ResNet44 WRN-28-2 Base None 83.93 ± 0.20 86.14 ± 0.23 88.44 ± 0.29 87.96 ± 0.30 88.94 ± 0.41 88.95 ± 0.44 Info. Dropout None 84.91 ± 0.34 87.23 ± 0.26 88.29 ± 1.18 88.46 ± 0.65 89.33 ± 0.20 89.51 ± 0.29 DropBlock None 84.29 ± 0.24 86.22 ± 0.26 88.68 ± 0.35 89.43 ± 0.26 90.14 ± 0.18 90.55 ± 0.25 Finetuning Same 84.00 ± 0.27 86.56 ± 0.23 88.17 ± 0.18 88.77 ± 0.26 89.62 ± 0.05 89.85 ± 0.31 MetaPerturb TIN 86.61 ± 0.42 88.59 ± 0.26 90.24 ± 0.27 90.70 ± 0.25 90.97 ± 1.09 90.88 ± 0.07 When transferred to diverse heterogeneous neural architectures, MetaPerturb significantly outperforms the baselines with all networks. 19
  • 20. Results: Qualitative Analysis MetaPerturb applies different amount of noise at each layer, across datasets. MetaPerturb makes the loss surface flatter. 20
  • 21. Results: Adversarial Robustness and Calibration MetaPerturb is robust to adversarial perturbations although it is not explicitly trained to be robust against them. 21 MetaPeturb also significantly improves the calibration performance (ECE).
  • 22. Conclusion 22 • We propose a lightweight and versatile perturbation function that can transfer the knowledge of a source task to heterogeneous target tasks and architectures. • We propose a novel meta-learning framework in the form of distributed joint training, which allows to efficiently perform meta-learning on large-scale datasets with deep networks. • Our transferable regularizer largely enhances model generalization on various datasets and architectures, outperforming existing regularizers and finetuning in most cases, while also improving on robustness and calibration.
  • 23. MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures Jeongun Ryu1*, JaeWoong Shin1*, Hae Beom Lee1*, Sung Ju Hwang12 https://github.com/JWoong148/metaperturb Paper, Poster ID: 2018, 17113