SlideShare a Scribd company logo
OPTIMIZATION AS A MODEL FOR
FEW-SHOT LEARNING
Hugo Larochelle
Work done atTwitter

Google Brain

Joint work with Sachin Ravi
e of meta-learning setup. The top represents the meta-training set Dmet
gray box is a separate dataset that consists of the training set D (lef
A RESEARCH AGENDA
• Deep learning successes have required a lot of labeled training data
‣ collecting and labeling such data requires significant human labor
‣ is that really how we’ll solve AI ?
• Alternative solution : exploit other sources of data that are imperfect but plentiful
‣ unlabeled data (unsupervised learning)
‣ multimodal data (multimodal learning)
‣ multidomain data (transfer learning, domain adaptation)
3
A RESEARCH AGENDA
• Deep learning successes have required a lot of labeled training data
‣ collecting and labeling such data requires significant human labor
‣ is that really how we’ll solve AI ?
• Alternative solution : exploit other sources of data that are imperfect but plentiful
‣ unlabeled data (unsupervised learning)
‣ multimodal data (multimodal learning)
‣ multidomain data (transfer learning, domain adaptation)
3
A RESEARCH AGENDA
• Let’s attack directly the problem of few-shot learning
‣ we want to design a learning algorithm A that outputs a good parameters 𝜽

of a model M, when fed a small dataset Dtrain={(Xt,Yt)}t=1
• Idea: let’s learn that algorithm A, end-to-end
‣ this is known as meta-learning or learning to learn
4
T
META-LEARNING
• Learning algorithm A
‣ input: training set Dtrain={(Xt,Yt)}
‣ output: parameters 𝜽 model M (the learner)
‣ objective: good performance on test set Dtest=(X,Y)
• Meta-learning algorithm
‣ input: meta-training set ={(Dtrain,Dtest)}n=1
‣ output: parameters 𝝝 algorithm A (the meta-learner)
‣ objective: good performance on meta-test set =(Dtrain,Dtest)
5
captures fundamental knowledge shared among all the tasks.
2 TASK DESCRIPTION
We first begin by detailing the meta-learning formulation we use. In the typical mach
setting, we are interested in a dataset D and usually split D so that we optimize param
training set Dtrain and evaluate its generalization on the test set Dtest. In meta-learnin
we are dealing with meta-sets D containing multiple regular datasets, where each D 2 D
of Dtrain and Dtest.
We consider the k-shot, N-class classification task, where for each dataset D, the train
sists of k labelled examples for each of N classes, meaning that Dtrain consists of k · N
and Dtest has a set number of examples for evaluation.
In meta-learning, we thus have different meta-sets for meta-training, meta-validation
testing (Dmeta train, Dmeta validation, and Dmeta test, respectively). On Dmeta tr
interested in training a learning procedure (the meta-learning model) that can take as i
its training sets Dtrain and produce a model that achieves high average classification perf
its corresponding test set Dtest. Using Dmeta validation we can perform hyper-paramet
of the meta-learning model and evaluate its generalization performance on Dmeta test.
For this formulation to correspond to the few-shot learning setting, each training set
D 2 D will contain few labeled examples (we consider k = 1 or k = 5), that must
(n) (n) N
Figure 1: Computational graph for the forward pass of the meta-learner. The dashed line div
examples from the training set Dtrain and test set Dtest. Each (Xi, Yi) is the ith
batch from
training set whereas (X, Y) is all the elements from the test set. The dashed arrows indicate tha
do not back-propagate through that step when training the meta-learner. We refer to the learn
M, where M(X; ✓) is the output of learner M using parameters ✓ for inputs X. We also use r
a shorthand for r✓t 1 Lt.
to have training conditions match those of test time. During evaluation of the meta-learning
each dataset D = (Dtrain, Dtest) 2 Dmeta test, a good meta-learner model will, given a seri
learner gradients and losses on the training set Dtrain, suggest a series of updates for the lea
model that trains it towards good performance on the test set Dtest.
META-LEARNING
6
1: Example of meta-learning setup. The top represents the meta-training set Dmeta train,
nside each gray box is a separate dataset that consists of the training set Dtrain (left side of
line) and the test set Dtest (right side of dashed line). In this illustration, we are considering
META-LEARNING
7
A META-LEARNING MODEL
• How to parametrize learning algorithms?
‣ we take inspiration from the gradient descent algorithm:
‣ we parametrize this update similarly to LSTM state updates:



- state ct is model M’s parameter space
- state update ct is the negative gradient
- ft and it are LSTM gates:
8
MODEL DESCRIPTION
ider a single dataset D 2 Dmeta train. Suppose we have a learner neural net mode
meters ✓ that we want to train on Dtrain. The standard optimization algorithms used t
neural networks are some variant of gradient descent, which uses updates of the form
✓t = ✓t 1 ↵tr✓t 1 Lt,
e ✓t 1 are the parameters of the learner after t 1 updates, ↵t is the learning rate at
the loss optimized by the learner for its tth
update, r✓t 1 Lt is the gradient of that los
ect to parameters ✓t 1, and ✓t is the updated parameters of the learner.
2
der review as a conference paper at ICLR 2017
r key observation that we leverage here is that this update resembles the update for the cell
an LSTM
ct = ft ct 1 + it ˜ct,
ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1
Lt.
us, we propose training a meta-learner LSTM to learn an update rule for training a neural
rk. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and
ndidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for
zation. We define parametric forms for it and ft so that the meta-learner can determine opt
ues through the course of the updates.
~
Under review as a conference paper at ICLR 2017
Our key observation that we leverage here is that this update resembles the update for the cell state
in an LSTM
ct = ft ct 1 + it ˜ct, (2)
if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1
Lt.
Thus, we propose training a meta-learner LSTM to learn an update rule for training a neural net-
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and the
candidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
values through the course of the updates.
Let us start with it, which corresponds to the learning rate for the updates. We let
it = WI ·
⇥
r✓t 1
Lt, Lt, ✓t 1, it 1
⇤
+ bI ,
meaning that the learning rate is a function of the current parameter value ✓t, the current gradient
r✓t
Lt, the current loss Lt, and the previous learning rate it 1. With this information, the meta-
if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1
Lt.
Thus, we propose training a meta-learner LSTM to learn an update rule for training a neural net-
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and the
candidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
values through the course of the updates.
Let us start with it, which corresponds to the learning rate for the updates. We let
it = WI ·
⇥
r✓t 1
Lt, Lt, ✓t 1, it 1
⇤
+ bI ,
meaning that the learning rate is a function of the current parameter value ✓t, the current gradient
r✓t
Lt, the current loss Lt, and the previous learning rate it 1. With this information, the meta-
learner should be able to finely control the learning rate so as to train the learner quickly while
avoiding divergence.
As for ft, it seems possible that the optimal choice isn’t the constant 1. Intuitively, what would
justify shrinking the parameters of the learner and forgetting part of its previous value would be
if the learner is currently in a bad local optima and needs a large change to escape. This would
correspond to a situation where the loss is high but the gradient is close to zero. Thus, one proposal
for the forget gate is to have it be a function of that information, as well as the previous value of the
forget gate:
ft = WF ·
⇥
r✓t 1
Lt, Lt, ✓t 1, ft 1
⇤
+ bF .
META-LEARNING UPDATES
9
Under review as a conference paper at ICLR 2017
(M)
(LSTM)
Dtrain Dtest
(n) (n)
R
TO SUM UP
• We use our meta-learning LSTM to model parameter dynamics during training
‣ LSTM parameters are shared across M’s parameters (i.e. treated like a large minibatch)
‣ learns c0, which is like learning M’s initialization
• It is trained to produce parameters that have low loss on the corresponding test set
‣ possible thanks to backprop (though we don’t ignore gradients through the inputs of the LSTM)
• Inputs to meta-learning LSTM are the loss, the parameter and its loss gradient
‣ we use the preprocessing proposed by Andrychowicz et al. (2016)
• Model M uses batch normalization
‣ we are careful to avoid “leakage” between meta-train / meta-validation / meta-test sets
10
RELATED WORK: META-LEARNING
• Early work on learning an update rule
‣ Learning a synaptic learning rule (1990)

Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier
‣ On the search for new learning rules for ANNs (1995)

Samy Bengio,Yoshua Bengio, and Jocelyn Cloutier
• Early work on recurrent networks modifying their weights
‣ Learning to control fast-weight memories:An alternative to dynamic recurrent
networks (1992)

Jürgen Schmidhuber
‣ A neural network that embeds its own meta-levels (1993)

Jürgen Schmidhuber
11
[see related work section of Learning to learn by gradient descent by gradient descent (2016)]
RELATED WORK: META-LEARNING
• Training a recurrent neural network to optimize
‣ outputs update, so can decide to do something else than gradient descent
• Learning to learn by gradient descent by gradient descent (2016)

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau,Tom Schaul, and Nando de Freitas
• Learning to learn using gradient descent (2001)

Sepp Hochreiter,A. StevenYounger, and Peter R. Conwell
12
Optimizee
Optimizer
t-2 t-1 t
m m m
+ + +
ft-1 ftft-2
∇t-2 ∇t-1 ∇t
ht-2 ht-1 ht ht+1
gt-1 gt
θt-2 θt-1 θt θt+1
gt-2
Figure 2: Computational graph used for computing the gradient of the optimizer.
2.1 Coordinatewise LSTM optimizer
One challenge in applying RNNs in our setting is that we want to be able to optimize at least tens of
thousands of parameters. Optimizing at this scale with a fully connected RNN is not feasible as it
RELATED WORK: FEW-SHOT LEARNING
• Training a “pattern matcher” to optimize

each episode’s test set performance
‣ no notion of learning an update

rule
• Matching networks for one shot learning (2016)

Oriol Vinyals, Charles Blundell,Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan Wierstra
13
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to min
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in defining a model and training criterion amenable for one-shot le
we contribute by the definition of tasks that can be used to benchmark other approaches o
RELATED WORK: FEW-SHOT LEARNING
• Training a “prototype extractor” to optimize

each episode’s test set performance
‣ no notion of learning an update

rule
• Prototypical Networks for Few-shot Learning (2016)

Jake Snell, Kevin Swersky and Richard Zemel
14
c1
c2
c3
x
(a) Few-shot
v1
Figure 1: Prototypical networks in the few-shot and zero-s
ck are computed as the mean of embedded support exa
prototypes ck are produced by embedding class meta-data
RELATED WORK: FEW-SHOT LEARNING
• Training a “initialization+fine-tuning” procedure

that’s based on a known update (e.g.ADAM)
‣ much simpler than a meta-LSTM,

yet works quite well!
• Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (2017)

Chelsea Finn, Pieter Abbeel and Sergey Levine
15
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
the task loss.
ion of this work is a simple model-
orithm for meta-learning that trains
such that a small number of gradi-
to fast learning on a new task. We
hm on different model types, includ-
d convolutional networks, and in sev-
ncluding few-shot regression, image
forcement learning. Our evaluation
earning algorithm compares favor-
one-shot learning methods designed
sed classification, while using fewer
meta-learning
learning/adaptation
✓
rL1
rL2
rL3
✓⇤
1 ✓⇤
2
✓⇤
3
Figure 1. Diagram of our model-agnostic meta-learning
rithm (MAML), which optimizes for a representation ✓ th
quickly adapt to new tasks.
RELATED WORK: FEW-SHOT LEARNING
• Training a neural Turing machine 

to learn
‣ no notion of gradient on learner
• One-shot learning with memory-augmented neural networks (2016)

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap
16
One-shot learning with Memory-Augmented Neural Networks
a) Task setup (b) Network strategy
Omniglot images (or x-values for regression), xt, are presented with time-offset labels (or function values),
from simply mapping the class labels to the output. From episode to episode, the classes to be presented
RELATED WORK: FEW-SHOT LEARNING
• Training a convolutional network to learn
• Meta-Learning withTemporal Convolutions (2017)

Nikhil Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel
17
• How does its performance compare to existing approaches that are specialized to a particular
task domain, or have elements of high-level strategy already built-in?
4.1 Few-Shot Image Classification
In the few-shot classification setting, we wish to classify data points into N classes, when we
only have a small number (K) of labeled examples per class. A meta-learner is readily applicable,
because it learns how to compare input points, rather than memorize a specific mapping from points to
classes. Figure 2 illustrates how few-shot image classification fits into the meta-learning formalization
presented in Section 2.1 and our introduction of the TCML in Section 2.2.
ŷ
TCML
Predicted Labels
(Current Features,
Previous Label)
φ
A
(i0, --)
(x0, --)
φ
D
(i1, y0)
(x1, y0)
φ
C
(i2, y1)
(x2, y1)
φ
A
(i3, y2)
(x3, y2)
2 3
Learned
Embedding Function
(Current Image,
Previous Label)
0
Figure 2: An episode of few-shot image classification using a TCML. Given an image it, the input
to the TCML is a feature vector xt (produced by a embedding function xt = (it)), and the label
yt 1 of the previous image it 1. The embedding function is learned jointly with the TCML, which is
trained to classify each image it based on the images i0, . . . , it 1 seen at previous timesteps within
the same episode. Qualitatively, in order to make the correct prediction at time t = 3, the TCML
EXPERIMENT
• Mini-ImageNet
‣ random subset of 100 classes (64 training, 16 validation, 20 testing)
‣ random sets Dtrain are generated by randomly picking 5 classes from class subset
‣ model M is a small 4-layers CNN, meta-learner LSTM has 2 layers
18
Under review as a conference paper at ICLR 2017
Model
5-class
1-shot 5-shot
Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79%
Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65%
Matching Network 43.40 ± 0.78% 51.09 ± 0.71%
Matching Network FCE 43.56 ± 0.84% 55.31 ± 0.73%
Meta-Learner LSTM (OURS) 43.44 ± 0.77% 60.60 ± 0.71%43.44% ± 0.77% 60.60% ± 0.71%
43.56% ± 0.84% 55.31% ± 0.73%
EXPERIMENT
• Mini-ImageNet
‣ random subset of 100 classes (64 training, 16 validation, 20 testing)
‣ random sets Dtrain are generated by randomly picking 5 classes from class subset
‣ model M is a small 4-layers CNN, meta-learner LSTM has 2 layers
19
Under review as a conference paper at ICLR 2017
Model
5-class
1-shot 5-shot
Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79%
Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65%
Matching Network 43.40 ± 0.78% 51.09 ± 0.71%
Matching Network FCE 43.56 ± 0.84% 55.31 ± 0.73%
Meta-Learner LSTM (OURS) 43.44 ± 0.77% 60.60 ± 0.71%43.44% ± 0.77% 60.60% ± 0.71%
43.56% ± 0.84% 55.31% ± 0.73%
56.48% ± 0.99% 61.22% ± 0.98%
48.70% ± 1.84% 63.10% ± 0.92%
49.42% ± 0.78% 68.20% ± 0.66%
MAML (Finn et al.)
Prototypical Nets (Snell et al.)
TCML (Mishra et al.)
(updated)
DISCUSSION
• How to scale up to a variable number of classes / examples
‣ we need an “ImageNet transposed”
• How best to characterize / parametrize learning algorithms (i.e. meta-models)
‣ inspiration from other optimization algorithms? other learning algorithms?
• How to apply beyond supervised learning
‣ unsupervised learning, semi-supervised learning, active learning, domain adaptation?
• … meta-meta-learning ?
20
MERCI !
21

More Related Content

PDF
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
PDF
Optimization as a model for few shot learning
PPTX
Meta-Learning Presentation
PDF
On First-Order Meta-Learning Algorithms
PDF
Introduction to Few shot learning
PPTX
Pre trained language model
PPTX
Few shot learning/ one shot learning/ machine learning
PPTX
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Optimization as a model for few shot learning
Meta-Learning Presentation
On First-Order Meta-Learning Algorithms
Introduction to Few shot learning
Pre trained language model
Few shot learning/ one shot learning/ machine learning
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University

What's hot (20)

PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
PDF
Meta learning tutorial
PDF
Deep Learning Theory Seminar (Chap 1-2, part 1)
PDF
AutoML lectures (ACDL 2019)
PDF
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
PDF
Module 4: Model Selection and Evaluation
PPTX
Introduction to Machine Learning
PPT
Machine Learning and Inductive Inference
PPTX
Reinforcement Learning
PPTX
Introduction to PyTorch
ODP
Challenges in Large Scale Machine Learning
PDF
Machine learning Lecture 1
PPTX
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
PPTX
Machine Learning PPT BY RAVINDRA SINGH KUSHWAHA B.TECH(IT) CHAUDHARY CHARAN S...
PPT
A Gentle Introduction to the EM Algorithm
PPTX
doença periodontal aula02.03.pptx
PPTX
Concept learning
PPT
Basics of Machine Learning
DOCX
Machine learning important questions
PDF
Optimization for Deep Learning
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Meta learning tutorial
Deep Learning Theory Seminar (Chap 1-2, part 1)
AutoML lectures (ACDL 2019)
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Module 4: Model Selection and Evaluation
Introduction to Machine Learning
Machine Learning and Inductive Inference
Reinforcement Learning
Introduction to PyTorch
Challenges in Large Scale Machine Learning
Machine learning Lecture 1
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning PPT BY RAVINDRA SINGH KUSHWAHA B.TECH(IT) CHAUDHARY CHARAN S...
A Gentle Introduction to the EM Algorithm
doença periodontal aula02.03.pptx
Concept learning
Basics of Machine Learning
Machine learning important questions
Optimization for Deep Learning
Ad

Similar to OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING (20)

PPTX
machine _learning_introductionand python.pptx
PDF
An improved teaching learning
PDF
Higgs bosob machine learning challange
PDF
Higgs Boson Machine Learning Challenge - Kaggle
PDF
ML_Lec1 introduction to machine learning.pdf
PDF
ML_lec1.pdf
PDF
Introduction to Reinforcement Learning for Molecular Design
PDF
Performance Comparision of Machine Learning Algorithms
PDF
Sample_Subjective_Questions_Answers (1).pdf
PPTX
Why start using uplift models for more efficient marketing campaigns
PDF
Supervised Learning.pdf
PPTX
Predicting the relevance of search results for e-commerce systems
PDF
The Validity of CNN to Time-Series Forecasting Problem
PPTX
fINAL ML PPT.pptx
PDF
[update] Introductory Parts of the Book "Dive into Deep Learning"
PDF
Chap 8. Optimization for training deep models
PDF
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
PDF
Machine Learning.pdf
PDF
report
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
machine _learning_introductionand python.pptx
An improved teaching learning
Higgs bosob machine learning challange
Higgs Boson Machine Learning Challenge - Kaggle
ML_Lec1 introduction to machine learning.pdf
ML_lec1.pdf
Introduction to Reinforcement Learning for Molecular Design
Performance Comparision of Machine Learning Algorithms
Sample_Subjective_Questions_Answers (1).pdf
Why start using uplift models for more efficient marketing campaigns
Supervised Learning.pdf
Predicting the relevance of search results for e-commerce systems
The Validity of CNN to Time-Series Forecasting Problem
fINAL ML PPT.pptx
[update] Introductory Parts of the Book "Dive into Deep Learning"
Chap 8. Optimization for training deep models
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Machine Learning.pdf
report
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Ad

More from MLReview (13)

PDF
Bayesian Non-parametric Models for Data Science using PyMC
PDF
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
PDF
Tutorial on Deep Generative Models
PDF
PixelGAN Autoencoders
PDF
Representing and comparing probabilities: Part 2
PDF
Representing and comparing probabilities
PDF
Theoretical Neuroscience and Deep Learning Theory
PDF
2017 Tutorial - Deep Learning for Dialogue Systems
PDF
Deep Learning for Semantic Composition
PDF
Near human performance in question answering?
PDF
Tutorial on Theory and Application of Generative Adversarial Networks
PDF
Real-time Edge-aware Image Processing with the Bilateral Grid
PDF
Yoav Goldberg: Word Embeddings What, How and Whither
Bayesian Non-parametric Models for Data Science using PyMC
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
Tutorial on Deep Generative Models
PixelGAN Autoencoders
Representing and comparing probabilities: Part 2
Representing and comparing probabilities
Theoretical Neuroscience and Deep Learning Theory
2017 Tutorial - Deep Learning for Dialogue Systems
Deep Learning for Semantic Composition
Near human performance in question answering?
Tutorial on Theory and Application of Generative Adversarial Networks
Real-time Edge-aware Image Processing with the Bilateral Grid
Yoav Goldberg: Word Embeddings What, How and Whither

Recently uploaded (20)

PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
famous lake in india and its disturibution and importance
PPTX
2. Earth - The Living Planet earth and life
PPTX
2Systematics of Living Organisms t-.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
The scientific heritage No 166 (166) (2025)
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
HPLC-PPT.docx high performance liquid chromatography
ECG_Course_Presentation د.محمد صقران ppt
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Placing the Near-Earth Object Impact Probability in Context
famous lake in india and its disturibution and importance
2. Earth - The Living Planet earth and life
2Systematics of Living Organisms t-.pptx
AlphaEarth Foundations and the Satellite Embedding dataset
The KM-GBF monitoring framework – status & key messages.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Introduction to Cardiovascular system_structure and functions-1
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
bbec55_b34400a7914c42429908233dbd381773.pdf
The scientific heritage No 166 (166) (2025)
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
HPLC-PPT.docx high performance liquid chromatography

OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING

  • 1. OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING Hugo Larochelle Work done atTwitter
 Google Brain
 Joint work with Sachin Ravi
  • 2. e of meta-learning setup. The top represents the meta-training set Dmet gray box is a separate dataset that consists of the training set D (lef
  • 3. A RESEARCH AGENDA • Deep learning successes have required a lot of labeled training data ‣ collecting and labeling such data requires significant human labor ‣ is that really how we’ll solve AI ? • Alternative solution : exploit other sources of data that are imperfect but plentiful ‣ unlabeled data (unsupervised learning) ‣ multimodal data (multimodal learning) ‣ multidomain data (transfer learning, domain adaptation) 3
  • 4. A RESEARCH AGENDA • Deep learning successes have required a lot of labeled training data ‣ collecting and labeling such data requires significant human labor ‣ is that really how we’ll solve AI ? • Alternative solution : exploit other sources of data that are imperfect but plentiful ‣ unlabeled data (unsupervised learning) ‣ multimodal data (multimodal learning) ‣ multidomain data (transfer learning, domain adaptation) 3
  • 5. A RESEARCH AGENDA • Let’s attack directly the problem of few-shot learning ‣ we want to design a learning algorithm A that outputs a good parameters 𝜽
 of a model M, when fed a small dataset Dtrain={(Xt,Yt)}t=1 • Idea: let’s learn that algorithm A, end-to-end ‣ this is known as meta-learning or learning to learn 4 T
  • 6. META-LEARNING • Learning algorithm A ‣ input: training set Dtrain={(Xt,Yt)} ‣ output: parameters 𝜽 model M (the learner) ‣ objective: good performance on test set Dtest=(X,Y) • Meta-learning algorithm ‣ input: meta-training set ={(Dtrain,Dtest)}n=1 ‣ output: parameters 𝝝 algorithm A (the meta-learner) ‣ objective: good performance on meta-test set =(Dtrain,Dtest) 5 captures fundamental knowledge shared among all the tasks. 2 TASK DESCRIPTION We first begin by detailing the meta-learning formulation we use. In the typical mach setting, we are interested in a dataset D and usually split D so that we optimize param training set Dtrain and evaluate its generalization on the test set Dtest. In meta-learnin we are dealing with meta-sets D containing multiple regular datasets, where each D 2 D of Dtrain and Dtest. We consider the k-shot, N-class classification task, where for each dataset D, the train sists of k labelled examples for each of N classes, meaning that Dtrain consists of k · N and Dtest has a set number of examples for evaluation. In meta-learning, we thus have different meta-sets for meta-training, meta-validation testing (Dmeta train, Dmeta validation, and Dmeta test, respectively). On Dmeta tr interested in training a learning procedure (the meta-learning model) that can take as i its training sets Dtrain and produce a model that achieves high average classification perf its corresponding test set Dtest. Using Dmeta validation we can perform hyper-paramet of the meta-learning model and evaluate its generalization performance on Dmeta test. For this formulation to correspond to the few-shot learning setting, each training set D 2 D will contain few labeled examples (we consider k = 1 or k = 5), that must (n) (n) N Figure 1: Computational graph for the forward pass of the meta-learner. The dashed line div examples from the training set Dtrain and test set Dtest. Each (Xi, Yi) is the ith batch from training set whereas (X, Y) is all the elements from the test set. The dashed arrows indicate tha do not back-propagate through that step when training the meta-learner. We refer to the learn M, where M(X; ✓) is the output of learner M using parameters ✓ for inputs X. We also use r a shorthand for r✓t 1 Lt. to have training conditions match those of test time. During evaluation of the meta-learning each dataset D = (Dtrain, Dtest) 2 Dmeta test, a good meta-learner model will, given a seri learner gradients and losses on the training set Dtrain, suggest a series of updates for the lea model that trains it towards good performance on the test set Dtest.
  • 7. META-LEARNING 6 1: Example of meta-learning setup. The top represents the meta-training set Dmeta train, nside each gray box is a separate dataset that consists of the training set Dtrain (left side of line) and the test set Dtest (right side of dashed line). In this illustration, we are considering
  • 9. A META-LEARNING MODEL • How to parametrize learning algorithms? ‣ we take inspiration from the gradient descent algorithm: ‣ we parametrize this update similarly to LSTM state updates:
 
 - state ct is model M’s parameter space - state update ct is the negative gradient - ft and it are LSTM gates: 8 MODEL DESCRIPTION ider a single dataset D 2 Dmeta train. Suppose we have a learner neural net mode meters ✓ that we want to train on Dtrain. The standard optimization algorithms used t neural networks are some variant of gradient descent, which uses updates of the form ✓t = ✓t 1 ↵tr✓t 1 Lt, e ✓t 1 are the parameters of the learner after t 1 updates, ↵t is the learning rate at the loss optimized by the learner for its tth update, r✓t 1 Lt is the gradient of that los ect to parameters ✓t 1, and ✓t is the updated parameters of the learner. 2 der review as a conference paper at ICLR 2017 r key observation that we leverage here is that this update resembles the update for the cell an LSTM ct = ft ct 1 + it ˜ct, ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1 Lt. us, we propose training a meta-learner LSTM to learn an update rule for training a neural rk. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and ndidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for zation. We define parametric forms for it and ft so that the meta-learner can determine opt ues through the course of the updates. ~ Under review as a conference paper at ICLR 2017 Our key observation that we leverage here is that this update resembles the update for the cell state in an LSTM ct = ft ct 1 + it ˜ct, (2) if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1 Lt. Thus, we propose training a meta-learner LSTM to learn an update rule for training a neural net- work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and the candidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for opti- mization. We define parametric forms for it and ft so that the meta-learner can determine optimal values through the course of the updates. Let us start with it, which corresponds to the learning rate for the updates. We let it = WI · ⇥ r✓t 1 Lt, Lt, ✓t 1, it 1 ⇤ + bI , meaning that the learning rate is a function of the current parameter value ✓t, the current gradient r✓t Lt, the current loss Lt, and the previous learning rate it 1. With this information, the meta- if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1 Lt. Thus, we propose training a meta-learner LSTM to learn an update rule for training a neural net- work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and the candidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for opti- mization. We define parametric forms for it and ft so that the meta-learner can determine optimal values through the course of the updates. Let us start with it, which corresponds to the learning rate for the updates. We let it = WI · ⇥ r✓t 1 Lt, Lt, ✓t 1, it 1 ⇤ + bI , meaning that the learning rate is a function of the current parameter value ✓t, the current gradient r✓t Lt, the current loss Lt, and the previous learning rate it 1. With this information, the meta- learner should be able to finely control the learning rate so as to train the learner quickly while avoiding divergence. As for ft, it seems possible that the optimal choice isn’t the constant 1. Intuitively, what would justify shrinking the parameters of the learner and forgetting part of its previous value would be if the learner is currently in a bad local optima and needs a large change to escape. This would correspond to a situation where the loss is high but the gradient is close to zero. Thus, one proposal for the forget gate is to have it be a function of that information, as well as the previous value of the forget gate: ft = WF · ⇥ r✓t 1 Lt, Lt, ✓t 1, ft 1 ⇤ + bF .
  • 10. META-LEARNING UPDATES 9 Under review as a conference paper at ICLR 2017 (M) (LSTM) Dtrain Dtest (n) (n) R
  • 11. TO SUM UP • We use our meta-learning LSTM to model parameter dynamics during training ‣ LSTM parameters are shared across M’s parameters (i.e. treated like a large minibatch) ‣ learns c0, which is like learning M’s initialization • It is trained to produce parameters that have low loss on the corresponding test set ‣ possible thanks to backprop (though we don’t ignore gradients through the inputs of the LSTM) • Inputs to meta-learning LSTM are the loss, the parameter and its loss gradient ‣ we use the preprocessing proposed by Andrychowicz et al. (2016) • Model M uses batch normalization ‣ we are careful to avoid “leakage” between meta-train / meta-validation / meta-test sets 10
  • 12. RELATED WORK: META-LEARNING • Early work on learning an update rule ‣ Learning a synaptic learning rule (1990)
 Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier ‣ On the search for new learning rules for ANNs (1995)
 Samy Bengio,Yoshua Bengio, and Jocelyn Cloutier • Early work on recurrent networks modifying their weights ‣ Learning to control fast-weight memories:An alternative to dynamic recurrent networks (1992)
 Jürgen Schmidhuber ‣ A neural network that embeds its own meta-levels (1993)
 Jürgen Schmidhuber 11 [see related work section of Learning to learn by gradient descent by gradient descent (2016)]
  • 13. RELATED WORK: META-LEARNING • Training a recurrent neural network to optimize ‣ outputs update, so can decide to do something else than gradient descent • Learning to learn by gradient descent by gradient descent (2016)
 Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau,Tom Schaul, and Nando de Freitas • Learning to learn using gradient descent (2001)
 Sepp Hochreiter,A. StevenYounger, and Peter R. Conwell 12 Optimizee Optimizer t-2 t-1 t m m m + + + ft-1 ftft-2 ∇t-2 ∇t-1 ∇t ht-2 ht-1 ht ht+1 gt-1 gt θt-2 θt-1 θt θt+1 gt-2 Figure 2: Computational graph used for computing the gradient of the optimizer. 2.1 Coordinatewise LSTM optimizer One challenge in applying RNNs in our setting is that we want to be able to optimize at least tens of thousands of parameters. Optimizing at this scale with a fully connected RNN is not feasible as it
  • 14. RELATED WORK: FEW-SHOT LEARNING • Training a “pattern matcher” to optimize
 each episode’s test set performance ‣ no notion of learning an update
 rule • Matching networks for one shot learning (2016)
 Oriol Vinyals, Charles Blundell,Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan Wierstra 13 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to min much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot le we contribute by the definition of tasks that can be used to benchmark other approaches o
  • 15. RELATED WORK: FEW-SHOT LEARNING • Training a “prototype extractor” to optimize
 each episode’s test set performance ‣ no notion of learning an update
 rule • Prototypical Networks for Few-shot Learning (2016)
 Jake Snell, Kevin Swersky and Richard Zemel 14 c1 c2 c3 x (a) Few-shot v1 Figure 1: Prototypical networks in the few-shot and zero-s ck are computed as the mean of embedded support exa prototypes ck are produced by embedding class meta-data
  • 16. RELATED WORK: FEW-SHOT LEARNING • Training a “initialization+fine-tuning” procedure
 that’s based on a known update (e.g.ADAM) ‣ much simpler than a meta-LSTM,
 yet works quite well! • Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (2017)
 Chelsea Finn, Pieter Abbeel and Sergey Levine 15 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks the task loss. ion of this work is a simple model- orithm for meta-learning that trains such that a small number of gradi- to fast learning on a new task. We hm on different model types, includ- d convolutional networks, and in sev- ncluding few-shot regression, image forcement learning. Our evaluation earning algorithm compares favor- one-shot learning methods designed sed classification, while using fewer meta-learning learning/adaptation ✓ rL1 rL2 rL3 ✓⇤ 1 ✓⇤ 2 ✓⇤ 3 Figure 1. Diagram of our model-agnostic meta-learning rithm (MAML), which optimizes for a representation ✓ th quickly adapt to new tasks.
  • 17. RELATED WORK: FEW-SHOT LEARNING • Training a neural Turing machine 
 to learn ‣ no notion of gradient on learner • One-shot learning with memory-augmented neural networks (2016)
 Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap 16 One-shot learning with Memory-Augmented Neural Networks a) Task setup (b) Network strategy Omniglot images (or x-values for regression), xt, are presented with time-offset labels (or function values), from simply mapping the class labels to the output. From episode to episode, the classes to be presented
  • 18. RELATED WORK: FEW-SHOT LEARNING • Training a convolutional network to learn • Meta-Learning withTemporal Convolutions (2017)
 Nikhil Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel 17 • How does its performance compare to existing approaches that are specialized to a particular task domain, or have elements of high-level strategy already built-in? 4.1 Few-Shot Image Classification In the few-shot classification setting, we wish to classify data points into N classes, when we only have a small number (K) of labeled examples per class. A meta-learner is readily applicable, because it learns how to compare input points, rather than memorize a specific mapping from points to classes. Figure 2 illustrates how few-shot image classification fits into the meta-learning formalization presented in Section 2.1 and our introduction of the TCML in Section 2.2. ŷ TCML Predicted Labels (Current Features, Previous Label) φ A (i0, --) (x0, --) φ D (i1, y0) (x1, y0) φ C (i2, y1) (x2, y1) φ A (i3, y2) (x3, y2) 2 3 Learned Embedding Function (Current Image, Previous Label) 0 Figure 2: An episode of few-shot image classification using a TCML. Given an image it, the input to the TCML is a feature vector xt (produced by a embedding function xt = (it)), and the label yt 1 of the previous image it 1. The embedding function is learned jointly with the TCML, which is trained to classify each image it based on the images i0, . . . , it 1 seen at previous timesteps within the same episode. Qualitatively, in order to make the correct prediction at time t = 3, the TCML
  • 19. EXPERIMENT • Mini-ImageNet ‣ random subset of 100 classes (64 training, 16 validation, 20 testing) ‣ random sets Dtrain are generated by randomly picking 5 classes from class subset ‣ model M is a small 4-layers CNN, meta-learner LSTM has 2 layers 18 Under review as a conference paper at ICLR 2017 Model 5-class 1-shot 5-shot Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79% Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65% Matching Network 43.40 ± 0.78% 51.09 ± 0.71% Matching Network FCE 43.56 ± 0.84% 55.31 ± 0.73% Meta-Learner LSTM (OURS) 43.44 ± 0.77% 60.60 ± 0.71%43.44% ± 0.77% 60.60% ± 0.71% 43.56% ± 0.84% 55.31% ± 0.73%
  • 20. EXPERIMENT • Mini-ImageNet ‣ random subset of 100 classes (64 training, 16 validation, 20 testing) ‣ random sets Dtrain are generated by randomly picking 5 classes from class subset ‣ model M is a small 4-layers CNN, meta-learner LSTM has 2 layers 19 Under review as a conference paper at ICLR 2017 Model 5-class 1-shot 5-shot Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79% Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65% Matching Network 43.40 ± 0.78% 51.09 ± 0.71% Matching Network FCE 43.56 ± 0.84% 55.31 ± 0.73% Meta-Learner LSTM (OURS) 43.44 ± 0.77% 60.60 ± 0.71%43.44% ± 0.77% 60.60% ± 0.71% 43.56% ± 0.84% 55.31% ± 0.73% 56.48% ± 0.99% 61.22% ± 0.98% 48.70% ± 1.84% 63.10% ± 0.92% 49.42% ± 0.78% 68.20% ± 0.66% MAML (Finn et al.) Prototypical Nets (Snell et al.) TCML (Mishra et al.) (updated)
  • 21. DISCUSSION • How to scale up to a variable number of classes / examples ‣ we need an “ImageNet transposed” • How best to characterize / parametrize learning algorithms (i.e. meta-models) ‣ inspiration from other optimization algorithms? other learning algorithms? • How to apply beyond supervised learning ‣ unsupervised learning, semi-supervised learning, active learning, domain adaptation? • … meta-meta-learning ? 20