Online Hyperparameter Meta-Learning with Hypergradient Distillation

Online Hyperparameter Meta-Learning
with Hypergradient Distillation
Hae Beom Lee1, Hayeon Lee1, Jaewoong Shin3,
Eunho Yang1,2, Timothy Hospedales4,5, Sung Ju Hwang1,2
KAIST1, AITRICS2, Lunit3, University of Edinburgh4, Samsung AI Centre Cambridge5
ICLR 2022 spotlight

Meta-Learning
Unseen tasks
Knowledge
Transfer !
Meta-test
Test
Test
Training Test
Training
Training
Meta-training
S. Ravi, H. Larochelle, Optimization as a Model for Few-shot Learning, ICLR 2017
C. Finn, P, Abbeel, S. Levine, Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
MAML (Finn et al., ‘17)
• Humans generalize well because we never learn from scratch.
• Learn a model that can generalize over a distribution of tasks.

Hyperparameters in Meta-Learning
Whole feature extractor
Element-wise learning rates Interleaved (e.g. “Warp”) layers
A. Raghu*, M. Raghu*, S. Bengio, O. Vinyals, Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML, ICLR 2020
S. Flennerhag, A. A. Rusu, R. Pascanu, F. Visin, H. Yin, R. Hadsell, Meta-Learning with Warped Gradient Descent, ICLR 2020
Z. Li, F. Zhou, F. Chen, H. Li, Meta-SGD: Learning to Learn Quickly for Few-Shot Learning, 2017
• The parameters that do not participate in inner-optimization → Hyperparameters in meta-learning.
• They are usually high-dimensional.

Hyperparameter Optimization (HO)
D. Maclaurin, D. Duvenaud, R. P. Adams, Gradient-based Hyperparameter Optimization through Reversible Learning, ICML 2015
https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083
J. Bergstra, Y. Bengio. Random search for hyper-parameter optimization, 2012
• Hyperparameter optimization (HO): a problem of choosing a set of optimal hyperparameters for
a learning algorithm.
• Which method should we use for such high-dimensional hyperparams?
Random Search Bayesian Optimization Gradient-based HO
Not scalable to
hyperparameter
dimension

In Case of Few-shot Learning
• In case of few-shot learning, computing the exact gradient w.r.t. the hyperparameter (i.e.
hypergradient) is not too expensive.
• A few-gradient steps are sufficient for each task.
Shared
initialization
5 steps
5 steps
5 steps
few-shot task
few-shot task
few-shot task

In Case of Few-shot Learning
• In case of few-shot learning, computing the exact gradient w.r.t. the hyperparameter (i.e.
hypergradient) is not too expensive.
• A few-gradient steps are sufficient for each task.
Shared
initialization
backprop
backprop
backprop
few-shot task
few-shot task
few-shot task

• Many-shot learning → Only a few gradient steps?
Shared
initialization
HO Does Matter when Horizon Gets Longer
SVHN
Flowers
→ Meta-learner may suffer from the short-horizon bias (Wu et al. ‘18).
J. Shin*, H. B. Lee*, B. Gong, S. J. Hwang, Large-Scale Meta-Learning with Continual Trajectory Shifting, ICML 2021
Cars
Y. Wu*, M. Ren*, R. Liao, R. Grosse, Understanding Short-Horizon Bias in Stochastic Meta-Optimization, ICLR 2018
?
5 steps
5 steps
5 steps

• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
SVHN
Flowers
Cars

Shared
initialization
b….a….c….k….p….r….o….p....
→ Computing a single hypergradient becomes too expensive!
SVHN
Flowers
Cars

→ Offline method: interval between two adjacent meta-updates is too long…
→ Meta-convergence is poor.
1000 step forward
1000 step backward
meta-update meta-update meta-update
Shared
initialization
SVHN
Flowers
Cars
new trajectory
new trajectory
new
trajectory

→ Online method: update hyperparamer every inner-grad step!
Shared
initialization
Flowers

Shared
initialization
Flowers update

Shared
initialization
Flowers
update

Shared
initialization
Flowers much faster
meta-convergence!

Shared
initialization
SVHN
Flowers
?
Criteria of Good HO Algorithm for Meta-Learning
3. Computing a single hypergradient
should not be too expensive
4. Update hyperparam every inner-grad step
i.e. online optimization
1. Scalable to hyperparameter dimension 2. Less or no short-horizon bias
Shared
initialization
b….a….c….k….p….r….o….p....
b….a….c….k….p….r….o….p….
b
…
.
a
…
.
c
…
.
k
…
.
p….r….o….p….
SVHN
Flowers
Cars
Shared
initialization
Flowers
…
much faster
meta-convergence!

Limitations of Existing Grad-based HO Algs
Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously.
Criteria FMD
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and Reverse Gradient-Based Hyperparameter Optimization, ICML 2017

Criteria FMD RMD
o
o
o
x
o
x
x
o
L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and Reverse Gradient-Based Hyperparameter Optimization, ICML 2017

Criteria FMD RMD IFT
o
o
o
x
o
x
x
o
o
o
△
o
J. Lorraine, P. Vicol, D. Duvenaud, Optimizing Millions of Hyperparameters by Implicit Differentiation, AISTATS 2020

Criteria FMD RMD IFT 1-step
o
o
o
x
o
x
x
o
o
o
△
o
x
o
o
o
Shared
initialization
J. Luketina, M. Berglund, K. Greff, T. Raiko, Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters, ICML 2016

Goal of This Paper
This paper aims to overcome all the aforementioned limitations at the same time.
Criteria FMD RMD IFT 1-step Ours
o
o
o
x
o
x
x
o
o
o
△
o
x
o
o
o
o
o
o
o
Hypergradient distillation

Hypergradient Distillation
Shared
initialization

Shared
initialization
requires 2t – 1 JVP computations (e.g. RMD)
respose Jacobian
hypergradient

Requires only 1 JVP computation
But it suffers from short horizon bias
Shared
initialization
hypergradient respose Jacobian

2t – 1 JVP
hypergradient
Shared
initialization
a single JVP
distilled weight and dataset
→ hypergrad direction
scaling factor
→ hypergrad size
𝑤𝑡, 𝐷𝑡
𝑤2, 𝐷2
𝑤1, 𝐷1
• it does not require computing the actual 𝑔𝑡
𝑆𝑂
.
• we only need to keep updating a moving
average of 𝑤𝑡
∗
and 𝐷𝑡
∗
.
• the scaling factor 𝜋∗
is also efficiently estimated
with a function approximator.
• we can approximately solve the distillation
problem efficiently.
Please read the main paper for
the technical details !
For each online HO step 𝒕,
distill
𝑤𝑡
∗
, 𝐷𝑡
∗

Experimental Setup
J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, D. Meng, Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting, NeurIPS 2019
ANIL (Raghu et al. ‘20) WarpGrad (Flennerhag et al. ‘20) Meta-Weight-Net (Shu et al. ‘20)
A. Nichol, J. Achiam, J. Schulman, On First-Order Meta-Learning Algorithms, 2018
Meta-learning
models
Other details
CIFAR100
tinyImageNet
• Inner-grad step = 100
• Use Reptile for learning
shared initialization
Task distribution
• 10-way 250-shot

Experimental Results
Q1. Does HyperDistill provide faster convergence?
Meta-training convergence (Test Loss)

Q2. Does HyperDistill provide better generalization performance?
Meta-validation performance (Test Acc)
Meta-test performance (Test Acc)

Q3. Is HyperDistill a reasonable approximation to the true hypergradient?
Cosine similarity to the true hypergradient
Q4. Is HyperDistill computationally efficient?
GPU memory consumption and wall-clock runtime

Conclusion
• The existing gradient-based HO algorithms do not satisfy the four criteria that should be met for
their practical use in meta-learning.
• In this paper, we showed that for each online HO step, it is possible to efficiently distill the whole
hypergradient indirect term into a single JVP, satisfying the four criteria simultaneously.
• Thank to the accurate hypergradient approximation, HyperDistill could improve meta-training
convergence and meta-testing performance, in a computationally efficient manner.
github.com/haebeom-lee/hyperdistill

Online Hyperparameter Meta-Learning with Hypergradient Distillation

More Related Content

Similar to Online Hyperparameter Meta-Learning with Hypergradient Distillation (20)

More from MLAI2 (20)

Recently uploaded (20)

Online Hyperparameter Meta-Learning with Hypergradient Distillation