SlideShare a Scribd company logo
Online Hyperparameter Meta-Learning
with Hypergradient Distillation
Hae Beom Lee1, Hayeon Lee1, Jaewoong Shin3,
Eunho Yang1,2, Timothy Hospedales4,5, Sung Ju Hwang1,2
KAIST1, AITRICS2, Lunit3, University of Edinburgh4, Samsung AI Centre Cambridge5
ICLR 2022 spotlight
Meta-Learning
Unseen tasks
Knowledge
Transfer !
Meta-test
Test
Test
Training Test
Training
Training
Meta-training
S. Ravi, H. Larochelle, Optimization as a Model for Few-shot Learning, ICLR 2017
C. Finn, P, Abbeel, S. Levine, Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
MAML (Finn et al., ‘17)
• Humans generalize well because we never learn from scratch.
• Learn a model that can generalize over a distribution of tasks.
Hyperparameters in Meta-Learning
Whole feature extractor
Element-wise learning rates Interleaved (e.g. “Warp”) layers
A. Raghu*, M. Raghu*, S. Bengio, O. Vinyals, Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML, ICLR 2020
S. Flennerhag, A. A. Rusu, R. Pascanu, F. Visin, H. Yin, R. Hadsell, Meta-Learning with Warped Gradient Descent, ICLR 2020
Z. Li, F. Zhou, F. Chen, H. Li, Meta-SGD: Learning to Learn Quickly for Few-Shot Learning, 2017
• The parameters that do not participate in inner-optimization → Hyperparameters in meta-learning.
• They are usually high-dimensional.
Hyperparameter Optimization (HO)
D. Maclaurin, D. Duvenaud, R. P. Adams, Gradient-based Hyperparameter Optimization through Reversible Learning, ICML 2015
https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083
J. Bergstra, Y. Bengio. Random search for hyper-parameter optimization, 2012
• Hyperparameter optimization (HO): a problem of choosing a set of optimal hyperparameters for
a learning algorithm.
• Which method should we use for such high-dimensional hyperparams?
Random Search Bayesian Optimization Gradient-based HO
Not scalable to
hyperparameter
dimension
Whole feature extractor
Element-wise learning rates Interleaved (e.g. “Warp”) layers
In Case of Few-shot Learning
• In case of few-shot learning, computing the exact gradient w.r.t. the hyperparameter (i.e.
hypergradient) is not too expensive.
• A few-gradient steps are sufficient for each task.
Shared
initialization
5 steps
5 steps
5 steps
few-shot task
few-shot task
few-shot task
In Case of Few-shot Learning
• In case of few-shot learning, computing the exact gradient w.r.t. the hyperparameter (i.e.
hypergradient) is not too expensive.
• A few-gradient steps are sufficient for each task.
Shared
initialization
backprop
backprop
backprop
few-shot task
few-shot task
few-shot task
• Many-shot learning → Only a few gradient steps?
Shared
initialization
HO Does Matter when Horizon Gets Longer
SVHN
Flowers
→ Meta-learner may suffer from the short-horizon bias (Wu et al. ‘18).
J. Shin*, H. B. Lee*, B. Gong, S. J. Hwang, Large-Scale Meta-Learning with Continual Trajectory Shifting, ICML 2021
Cars
Y. Wu*, M. Ren*, R. Liao, R. Grosse, Understanding Short-Horizon Bias in Stochastic Meta-Optimization, ICLR 2018
?
5 steps
5 steps
5 steps
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
HO Does Matter when Horizon Gets Longer
SVHN
Flowers
Cars
Shared
initialization
b….a….c….k….p….r….o….p....
HO Does Matter when Horizon Gets Longer
→ Computing a single hypergradient becomes too expensive!
• Many-shot learning → requires longer inner-learning trajectory
SVHN
Flowers
Cars
HO Does Matter when Horizon Gets Longer
→ Offline method: interval between two adjacent meta-updates is too long…
→ Meta-convergence is poor.
• Many-shot learning → requires longer inner-learning trajectory
1000 step forward
1000 step backward
meta-update meta-update meta-update
Shared
initialization
SVHN
Flowers
Cars
new trajectory
new trajectory
new
trajectory
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers update
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers
update
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers update
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers much faster
meta-convergence!
Shared
initialization
SVHN
Flowers
?
Criteria of Good HO Algorithm for Meta-Learning
3. Computing a single hypergradient
should not be too expensive
4. Update hyperparam every inner-grad step
i.e. online optimization
1. Scalable to hyperparameter dimension 2. Less or no short-horizon bias
Shared
initialization
b….a….c….k….p….r….o….p....
b….a….c….k….p….r….o….p….
b
…
.
a
…
.
c
…
.
k
…
.
p….r….o….p….
SVHN
Flowers
Cars
Whole feature extractor
Element-wise learning rates Interleaved (e.g. “Warp”) layers
Shared
initialization
Flowers
…
much faster
meta-convergence!
Limitations of Existing Grad-based HO Algs
Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously.
Criteria FMD
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and Reverse Gradient-Based Hyperparameter Optimization, ICML 2017
Limitations of Existing Grad-based HO Algs
Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously.
Criteria FMD RMD
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
o
x
x
o
L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and Reverse Gradient-Based Hyperparameter Optimization, ICML 2017
Limitations of Existing Grad-based HO Algs
Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously.
Criteria FMD RMD IFT
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
o
x
x
o
o
o
△
o
J. Lorraine, P. Vicol, D. Duvenaud, Optimizing Millions of Hyperparameters by Implicit Differentiation, AISTATS 2020
Limitations of Existing Grad-based HO Algs
Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously.
Criteria FMD RMD IFT 1-step
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
o
x
x
o
o
o
△
o
x
o
o
o
Shared
initialization
J. Luketina, M. Berglund, K. Greff, T. Raiko, Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters, ICML 2016
Goal of This Paper
This paper aims to overcome all the aforementioned limitations at the same time.
Criteria FMD RMD IFT 1-step Ours
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
o
x
x
o
o
o
△
o
x
o
o
o
o
o
o
o
Hypergradient distillation
Hypergradient Distillation
Shared
initialization
Shared
initialization
Hypergradient Distillation
requires 2t – 1 JVP computations (e.g. RMD)
respose Jacobian
hypergradient
Hypergradient Distillation
Requires only 1 JVP computation
But it suffers from short horizon bias
Shared
initialization
hypergradient respose Jacobian
Hypergradient Distillation
2t – 1 JVP
hypergradient
Shared
initialization
a single JVP
distilled weight and dataset
→ hypergrad direction
scaling factor
→ hypergrad size
𝑤𝑡, 𝐷𝑡
𝑤2, 𝐷2
𝑤1, 𝐷1
• it does not require computing the actual 𝑔𝑡
𝑆𝑂
.
• we only need to keep updating a moving
average of 𝑤𝑡
∗
and 𝐷𝑡
∗
.
• the scaling factor 𝜋∗
is also efficiently estimated
with a function approximator.
• we can approximately solve the distillation
problem efficiently.
Please read the main paper for
the technical details !
For each online HO step 𝒕,
distill
𝑤𝑡
∗
, 𝐷𝑡
∗
Experimental Setup
J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, D. Meng, Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting, NeurIPS 2019
ANIL (Raghu et al. ‘20) WarpGrad (Flennerhag et al. ‘20) Meta-Weight-Net (Shu et al. ‘20)
A. Nichol, J. Achiam, J. Schulman, On First-Order Meta-Learning Algorithms, 2018
Meta-learning
models
Other details
CIFAR100
tinyImageNet
• Inner-grad step = 100
• Use Reptile for learning
shared initialization
Task distribution
• 10-way 250-shot
Experimental Results
Q1. Does HyperDistill provide faster convergence?
Meta-training convergence (Test Loss)
Experimental Results
Q2. Does HyperDistill provide better generalization performance?
Meta-validation performance (Test Acc)
Meta-test performance (Test Acc)
Experimental Results
Q3. Is HyperDistill a reasonable approximation to the true hypergradient?
Cosine similarity to the true hypergradient
Q4. Is HyperDistill computationally efficient?
GPU memory consumption and wall-clock runtime
Conclusion
• The existing gradient-based HO algorithms do not satisfy the four criteria that should be met for
their practical use in meta-learning.
• In this paper, we showed that for each online HO step, it is possible to efficiently distill the whole
hypergradient indirect term into a single JVP, satisfying the four criteria simultaneously.
• Thank to the accurate hypergradient approximation, HyperDistill could improve meta-training
convergence and meta-testing performance, in a computationally efficient manner.
github.com/haebeom-lee/hyperdistill

More Related Content

PDF
Gradient-based Meta-learning with learned layerwise subspace and metric
PDF
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
PDF
Hyperparameter optimization with approximate gradient
PDF
Automated-tuned hyper-parameter deep neural network by using arithmetic optim...
PDF
Terascale Learning
PDF
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
PPTX
Paper review: Learned Optimizers that Scale and Generalize.
PDF
A SIMPLE PROCESS TO SPEED UP MACHINE LEARNING METHODS: APPLICATION TO HIDDEN ...
Gradient-based Meta-learning with learned layerwise subspace and metric
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Hyperparameter optimization with approximate gradient
Automated-tuned hyper-parameter deep neural network by using arithmetic optim...
Terascale Learning
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Paper review: Learned Optimizers that Scale and Generalize.
A SIMPLE PROCESS TO SPEED UP MACHINE LEARNING METHODS: APPLICATION TO HIDDEN ...

Similar to Online Hyperparameter Meta-Learning with Hypergradient Distillation (20)

PDF
Chap 8. Optimization for training deep models
PDF
Deep Dive into Hyperparameter Tuning
PDF
Deep Meta Learning
PDF
San Francisco Hadoop User Group Meetup Deep Learning
PDF
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
PPTX
Deep learning presentation
PDF
Sdforum 11-04-2010
PDF
SD Forum 11 04-2010
PDF
BOIL: Towards Representation Change for Few-shot Learning
PPTX
deeplearningpresentation-180625071236.pptx
PDF
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
PDF
M4L18 Unsupervised and Semi-Supervised Learning - Slides v2.pdf
PDF
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
PDF
AutoML lectures (ACDL 2019)
PDF
Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...
PDF
Scalable Learning Technologies for Big Data Mining
PDF
Learning to Optimize
PPTX
Introduction of deep learning in cse.pptx
PDF
Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization a...
PDF
An optimized deep learning model for optical character recognition applications
Chap 8. Optimization for training deep models
Deep Dive into Hyperparameter Tuning
Deep Meta Learning
San Francisco Hadoop User Group Meetup Deep Learning
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Deep learning presentation
Sdforum 11-04-2010
SD Forum 11 04-2010
BOIL: Towards Representation Change for Few-shot Learning
deeplearningpresentation-180625071236.pptx
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
M4L18 Unsupervised and Semi-Supervised Learning - Slides v2.pdf
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
AutoML lectures (ACDL 2019)
Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artifi...
Scalable Learning Technologies for Big Data Mining
Learning to Optimize
Introduction of deep learning in cse.pptx
Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization a...
An optimized deep learning model for optical character recognition applications
Ad

More from MLAI2 (20)

PDF
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
PDF
Online Coreset Selection for Rehearsal-based Continual Learning
PDF
Representational Continuity for Unsupervised Continual Learning
PDF
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
PDF
Skill-Based Meta-Reinforcement Learning
PDF
Edge Representation Learning with Hypergraphs
PDF
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
PDF
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
PDF
Task Adaptive Neural Network Search with Meta-Contrastive Learning
PDF
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
PDF
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
PDF
Accurate Learning of Graph Representations with Graph Multiset Pooling
PDF
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
PDF
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
PDF
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
PDF
Adversarial Self-Supervised Contrastive Learning
PDF
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
PDF
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
PDF
Cost-effective Interactive Attention Learning with Neural Attention Process
PDF
Adversarial Neural Pruning with Latent Vulnerability Suppression
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Online Coreset Selection for Rehearsal-based Continual Learning
Representational Continuity for Unsupervised Continual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Skill-Based Meta-Reinforcement Learning
Edge Representation Learning with Hypergraphs
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Accurate Learning of Graph Representations with Graph Multiset Pooling
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
Adversarial Self-Supervised Contrastive Learning
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
Cost-effective Interactive Attention Learning with Neural Attention Process
Adversarial Neural Pruning with Latent Vulnerability Suppression
Ad

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Cloud computing and distributed systems.
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
Spectral efficient network and resource selection model in 5G networks
Cloud computing and distributed systems.
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Building Integrated photovoltaic BIPV_UPV.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx

Online Hyperparameter Meta-Learning with Hypergradient Distillation

  • 1. Online Hyperparameter Meta-Learning with Hypergradient Distillation Hae Beom Lee1, Hayeon Lee1, Jaewoong Shin3, Eunho Yang1,2, Timothy Hospedales4,5, Sung Ju Hwang1,2 KAIST1, AITRICS2, Lunit3, University of Edinburgh4, Samsung AI Centre Cambridge5 ICLR 2022 spotlight
  • 2. Meta-Learning Unseen tasks Knowledge Transfer ! Meta-test Test Test Training Test Training Training Meta-training S. Ravi, H. Larochelle, Optimization as a Model for Few-shot Learning, ICLR 2017 C. Finn, P, Abbeel, S. Levine, Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017 MAML (Finn et al., ‘17) • Humans generalize well because we never learn from scratch. • Learn a model that can generalize over a distribution of tasks.
  • 3. Hyperparameters in Meta-Learning Whole feature extractor Element-wise learning rates Interleaved (e.g. “Warp”) layers A. Raghu*, M. Raghu*, S. Bengio, O. Vinyals, Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML, ICLR 2020 S. Flennerhag, A. A. Rusu, R. Pascanu, F. Visin, H. Yin, R. Hadsell, Meta-Learning with Warped Gradient Descent, ICLR 2020 Z. Li, F. Zhou, F. Chen, H. Li, Meta-SGD: Learning to Learn Quickly for Few-Shot Learning, 2017 • The parameters that do not participate in inner-optimization → Hyperparameters in meta-learning. • They are usually high-dimensional.
  • 4. Hyperparameter Optimization (HO) D. Maclaurin, D. Duvenaud, R. P. Adams, Gradient-based Hyperparameter Optimization through Reversible Learning, ICML 2015 https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083 J. Bergstra, Y. Bengio. Random search for hyper-parameter optimization, 2012 • Hyperparameter optimization (HO): a problem of choosing a set of optimal hyperparameters for a learning algorithm. • Which method should we use for such high-dimensional hyperparams? Random Search Bayesian Optimization Gradient-based HO Not scalable to hyperparameter dimension Whole feature extractor Element-wise learning rates Interleaved (e.g. “Warp”) layers
  • 5. In Case of Few-shot Learning • In case of few-shot learning, computing the exact gradient w.r.t. the hyperparameter (i.e. hypergradient) is not too expensive. • A few-gradient steps are sufficient for each task. Shared initialization 5 steps 5 steps 5 steps few-shot task few-shot task few-shot task
  • 6. In Case of Few-shot Learning • In case of few-shot learning, computing the exact gradient w.r.t. the hyperparameter (i.e. hypergradient) is not too expensive. • A few-gradient steps are sufficient for each task. Shared initialization backprop backprop backprop few-shot task few-shot task few-shot task
  • 7. • Many-shot learning → Only a few gradient steps? Shared initialization HO Does Matter when Horizon Gets Longer SVHN Flowers → Meta-learner may suffer from the short-horizon bias (Wu et al. ‘18). J. Shin*, H. B. Lee*, B. Gong, S. J. Hwang, Large-Scale Meta-Learning with Continual Trajectory Shifting, ICML 2021 Cars Y. Wu*, M. Ren*, R. Liao, R. Grosse, Understanding Short-Horizon Bias in Stochastic Meta-Optimization, ICLR 2018 ? 5 steps 5 steps 5 steps
  • 8. • Many-shot learning → requires longer inner-learning trajectory Shared initialization HO Does Matter when Horizon Gets Longer SVHN Flowers Cars
  • 9. Shared initialization b….a….c….k….p….r….o….p.... HO Does Matter when Horizon Gets Longer → Computing a single hypergradient becomes too expensive! • Many-shot learning → requires longer inner-learning trajectory SVHN Flowers Cars
  • 10. HO Does Matter when Horizon Gets Longer → Offline method: interval between two adjacent meta-updates is too long… → Meta-convergence is poor. • Many-shot learning → requires longer inner-learning trajectory 1000 step forward 1000 step backward meta-update meta-update meta-update Shared initialization SVHN Flowers Cars new trajectory new trajectory new trajectory
  • 11. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers
  • 12. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers update
  • 13. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers
  • 14. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers update
  • 15. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers
  • 16. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers update
  • 17. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers much faster meta-convergence!
  • 18. Shared initialization SVHN Flowers ? Criteria of Good HO Algorithm for Meta-Learning 3. Computing a single hypergradient should not be too expensive 4. Update hyperparam every inner-grad step i.e. online optimization 1. Scalable to hyperparameter dimension 2. Less or no short-horizon bias Shared initialization b….a….c….k….p….r….o….p.... b….a….c….k….p….r….o….p…. b … . a … . c … . k … . p….r….o….p…. SVHN Flowers Cars Whole feature extractor Element-wise learning rates Interleaved (e.g. “Warp”) layers Shared initialization Flowers … much faster meta-convergence!
  • 19. Limitations of Existing Grad-based HO Algs Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously. Criteria FMD 1. Scalable to hyperparam dim 2. Less or no short horizon bias 3. Constant memory cost 4. Online optimization o o o x L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and Reverse Gradient-Based Hyperparameter Optimization, ICML 2017
  • 20. Limitations of Existing Grad-based HO Algs Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously. Criteria FMD RMD 1. Scalable to hyperparam dim 2. Less or no short horizon bias 3. Constant memory cost 4. Online optimization o o o x o x x o L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and Reverse Gradient-Based Hyperparameter Optimization, ICML 2017
  • 21. Limitations of Existing Grad-based HO Algs Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously. Criteria FMD RMD IFT 1. Scalable to hyperparam dim 2. Less or no short horizon bias 3. Constant memory cost 4. Online optimization o o o x o x x o o o △ o J. Lorraine, P. Vicol, D. Duvenaud, Optimizing Millions of Hyperparameters by Implicit Differentiation, AISTATS 2020
  • 22. Limitations of Existing Grad-based HO Algs Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously. Criteria FMD RMD IFT 1-step 1. Scalable to hyperparam dim 2. Less or no short horizon bias 3. Constant memory cost 4. Online optimization o o o x o x x o o o △ o x o o o Shared initialization J. Luketina, M. Berglund, K. Greff, T. Raiko, Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters, ICML 2016
  • 23. Goal of This Paper This paper aims to overcome all the aforementioned limitations at the same time. Criteria FMD RMD IFT 1-step Ours 1. Scalable to hyperparam dim 2. Less or no short horizon bias 3. Constant memory cost 4. Online optimization o o o x o x x o o o △ o x o o o o o o o Hypergradient distillation
  • 25. Shared initialization Hypergradient Distillation requires 2t – 1 JVP computations (e.g. RMD) respose Jacobian hypergradient
  • 26. Hypergradient Distillation Requires only 1 JVP computation But it suffers from short horizon bias Shared initialization hypergradient respose Jacobian
  • 27. Hypergradient Distillation 2t – 1 JVP hypergradient Shared initialization a single JVP distilled weight and dataset → hypergrad direction scaling factor → hypergrad size 𝑤𝑡, 𝐷𝑡 𝑤2, 𝐷2 𝑤1, 𝐷1 • it does not require computing the actual 𝑔𝑡 𝑆𝑂 . • we only need to keep updating a moving average of 𝑤𝑡 ∗ and 𝐷𝑡 ∗ . • the scaling factor 𝜋∗ is also efficiently estimated with a function approximator. • we can approximately solve the distillation problem efficiently. Please read the main paper for the technical details ! For each online HO step 𝒕, distill 𝑤𝑡 ∗ , 𝐷𝑡 ∗
  • 28. Experimental Setup J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, D. Meng, Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting, NeurIPS 2019 ANIL (Raghu et al. ‘20) WarpGrad (Flennerhag et al. ‘20) Meta-Weight-Net (Shu et al. ‘20) A. Nichol, J. Achiam, J. Schulman, On First-Order Meta-Learning Algorithms, 2018 Meta-learning models Other details CIFAR100 tinyImageNet • Inner-grad step = 100 • Use Reptile for learning shared initialization Task distribution • 10-way 250-shot
  • 29. Experimental Results Q1. Does HyperDistill provide faster convergence? Meta-training convergence (Test Loss)
  • 30. Experimental Results Q2. Does HyperDistill provide better generalization performance? Meta-validation performance (Test Acc) Meta-test performance (Test Acc)
  • 31. Experimental Results Q3. Is HyperDistill a reasonable approximation to the true hypergradient? Cosine similarity to the true hypergradient Q4. Is HyperDistill computationally efficient? GPU memory consumption and wall-clock runtime
  • 32. Conclusion • The existing gradient-based HO algorithms do not satisfy the four criteria that should be met for their practical use in meta-learning. • In this paper, we showed that for each online HO step, it is possible to efficiently distill the whole hypergradient indirect term into a single JVP, satisfying the four criteria simultaneously. • Thank to the accurate hypergradient approximation, HyperDistill could improve meta-training convergence and meta-testing performance, in a computationally efficient manner. github.com/haebeom-lee/hyperdistill