Learning how to learn

Learningto Learn
Joaquin Vanschoren
Eindhoven University of Technology & JADS
j.vanschoren@tue.nl
@joavanschoren
Berlin Machine Learning Meetup, Jan 2019
Learning Learning LearningLearning Learning
Slides: www.automl.org/events - Video: bit.ly/2QsnsUT
Book (open access): www.automl.org/book (or www.amazon.de)

Learning is a never-ending process
!2
Humans don’t learn from scratch

Learning is a never-ending process
Task 1 Task 2 Task 3
Models ModelsModelsModels
Model
performance performance performance
Learning
ModelsModelsModelsModels
Learning Learning
Learning

episodes
meta-

learning
meta-

learning
meta-

learning
!3
Humans learn across tasks
Why? Less trial-and-error, less data

ModelsModelsModels
LearningLearning Learning
ModelsModelsModels
ModelsModelsModels
inductive bias
!4
Inductive bias: assumptions added to the training data to learn eﬀectively
If prior tasks are similar, we can transfer prior knowledge to new tasks
Learning to learn
New Task
performance
ModelsModelsModels
Learning
prior beliefs
constraints
model parameters
representations
training data

ModelsModelsModels
LearningLearningLearners
ModelsModelsModels
ModelsModelsModels
}meta-data
!5
Meta-learning
New Task
performance
ModelsModelsModels
meta-learner
base-learner
additional

experiments
Meta-learner learns a (base-)learning algorithm, based on meta-data

How?
New Task
meta-learner
ModelsModelsModels
Task 1 Task j
ModelsModelsModels
performance performance
LearningLearningLearningLearning
ModelsModelsModels
…
performance
1. Start with what generally works
2. Start from what likely works (based on partially similar tasks)
3. Start from previously trained models (for very similar tasks)
!6
Learners Learners

1. Start with what generally works
Store and use meta-data:
- configurations: settings that uniquely define the model
- performance (e.g. accuracy) on specific tasks
New Task
meta-learner
ModelsModelsModels
configurations
performances
Task 1 Task j
ModelsModelsModels
LearningLearningLearningLearning
ModelsModelsModels
…
performance
λi
Pi,j
!7
Learners Learners
(hyperparameters, pipeline,
neural architecture,…)

Rankings
• Build a global (multi-objective) ranking, recommend the top-K
• Can be used as a warm start for optimization techniques
• E.g. Bayesian optimization, evolutionary techniques,…
Tasks
ModelsModelsModels
performance
LearningLearningLearning
1. λa
2. λb
3. λc
4. λd
5. λe
6. …
New Task
meta-learner
ModelsModelsModels
performance
Global ranking

(task independent)
λa..k
warm
start
Leite et al. 2012
Abdulrahman et al. 2018
}
(discrete)
(multi-objective)
Pi,j
!8
λi

• Functional ANOVA 1
Select hyperparameters that cause variance in the evaluations.
• Tunability 2
Learn good defaults, measure improvement from tuning over defaults
Tasks
ModelsModelsModels
performance
}
New Task
meta-learner
ModelsModelsModels
performance
To tune or not to tune?
hyperparameter
importance
1 van Rijn & Hutter 2018
λ1
λ2
λ3
λ4 constraints
priors
2 Probst et al. 2018
Pi,j
!9
λi

• Search space pruning
Exclude regions yielding bad performance on (similar) tasks
Tasks
ModelsModelsModels
performance
}
New Task
meta-learner
ModelsModelsModels
performance
To tune or not to tune?
constraints
Pi,j
Wistuba et al. 2015
P
λ1
!10
λi
λ2

• Experiments on the new task can tell us how it is similar to previous tasks
• Task are similar if observed performance of configurations is similar
• Use this to recommend new experiments
Tasks
ModelsModelsModels
performance
}
New Task
meta-learner
ModelsModelsModels
performance
Learning task similarity
λc
(discrete)
Pi,j
!11
λi
Pc,new ≅ Pc,j

• Tournament-style selection, warm-start with overall best config λbest
• Next candidate λc : the one that beats current λbest on similar tasks
Tasks
ModelsModelsModels
performance
}
New Task
meta-learner
ModelsModelsModels
performance
Active testing
Leite et al. 2012
λc
Select λc > λbest
on similar tasks
(discrete)
Pi,j
!12
λi

• Consider space of all configuration options (e.g. all possible neural nets or pipelines)
• Surrogate model: probabilistic regression model of configuration performance
• Acquisition function: selects next configuration to try (exploration-exploitation)
Task
ModelsModelsModels
performance
Bayesian optimization
Surrogate model
Acquisition function
λ ∈ Λ
P
λi
Rasmussen 2014
!13

Task
ModelsModelsModels
performance
Surrogate model
P
λi
Rasmussen 2014
!14

Task
ModelsModelsModels
performance
Surrogate model
P
λi
Rasmussen 2014
!15
• Like a short-term memory for
solving new task
• Can we store it in long-term
memory for solving new tasks?

• If task j is similar to the new task, its surrogate model Sj will likely transfer well
• Sum up all Sj predictions, weighted by task similarity (as in active testing)1
• Build combined Gaussian process, weighted by current performance on new task2
Tasks
ModelsModelsModels
performance
New Task
meta-learner
ModelsModelsModels
performance
per task tj:
Pi,j
}
Surrogate model transfer
1 Wistuba et al. 2018
λi
P
Sj
2 Feurer et al. 2018
S= ∑ wj Sj
+
+
S1
S2
S3
!16
λi

• Multi-task Gaussian processes: train surrogate model on t tasks simultaneously1
• If tasks are similar: transfers useful info
• Not very scalable
• Bayesian Neural Networks as surrogate model2
• Multi-task, more scalable
• Stacking Gaussian Process regressors (Google Vizier)3
• Sequential tasks, each similar to the previous one
• Transfers a prior based on residuals of previous GP
Multi-task Bayesian optimization
1 Swersky et al. 2013
Independent GP predictions Multi-task GP predictions
2 Springenberg et al. 2016
3 Golovin et al. 2017
!17

• Bayesian linear regression (BLR) surrogate model on every task
• Use neural net to learn a suitable basis expansion ϕz(λ) for all tasks
• Scales linearly in # observations, transfers info on configuration space
Tasks
ModelsModelsModels
performance
LearningLearningLearning New Task
meta-learner
ModelsModelsModels
performance
}
More scalable variants
Perrone et al. 2018
P
BLR
surrogate
(λi,Pi,j)
φz(λ)i
warm-start (pre-train)
λi
φz(λ)
BLR hyperparameters
Pi,j
!18
λi

2. Start from what likely works (based on similar tasks)
Meta-features: measurable properties of the tasks
(number of instances and features, class imbalance, feature skewness,…)
conﬁgurations
performances
similar mj ?Task 1 Task N
ModelsModelsModels
ModelsModelsModels
… meta-features New Task
meta-learner
ModelsModelsModels
performance
mj
Pi,j
!19
λi

• Hand-crafted (interpretable) meta-features1
• Number of instances, features, classes, missing values, outliers,…
• Statistical: skewness, kurtosis, correlation, covariance, sparsity, variance,…
• Information-theoretic: class entropy, mutual information, noise-signal
ratio,…
• Model-based: properties of simple models trained on the task
• Landmarkers: performance of fast algorithms trained on the task
• Domain specific task properties
Meta-features
Vanschoren 2018
!20

• Learning a joint task representation
• Deep metric learning: learn a representation hmf using a ground truth
distance
• With Siamese Network: similar task, similar representation
• Can be used to recommend neural architectures given task similarity
Meta-features
Kim et al. 2017
!21

• Find k most similar tasks, warm-start search with best λi
• Auto-sklearn: Bayesian optimization (SMAC)
• Meta-learning yield better models, faster
• Winner of AutoML Challenges
Tasks
ModelsModelsModels
performance
New Task
meta-learner
ModelsModelsModels
performance
Pi,j
}
Warm-starting from similar tasks
λ1..k
mj
best λi on
similar tasks
Feurer et al. 2015
!22
λi
λ
P
λ1
λ3
λ2
λ4

• Learn direct mapping between meta-features and Pi,j
• Zero-shot meta-models: predict best λi given meta-features 1
• Ranking models: return ranking λ1..k 2
• Predict which algorithms / configurations to consider / tune3
• Predict performance / runtime for given 𝛳i and task4
• Can be integrated in larger AutoML systems: warm start, guide search,…
meta-learner
Meta-models
λbest
1 Brazdil et al. 2009, Lemke et al. 2015
2 Sun and Pfahringer 2013, Pinto et al. 2017
meta-learner λ1..k
mj
mj
meta-learner
Pijmj, λi
3 Sanders and C. Giraud-Carrier 2017
meta-learner
Λmj
4 Yang et al. 2018
!23

Learning Pipelines / Architectures
!24
• Compositionality: the learning process can be broken down into smaller tasks
• Easier to learn, more transferable, more robust
• Pipelines are one way of doing this, but how to control the search space?
• Planning techniques (e.g. Hierarchical Task Planning) 2
• Impose a fixed structure or grammar on the pipeline 1
• Neural architecture search
• Usually defines fixed search space 3
• Very little meta-learning (yet)
• RL controller transfer 4
2 Mohr et al. 2018
1 Feurer et al. 2015
3 Zoph et al. 2018
4 Wong et al. 2018

Evolving pipelines
!25
3 De Sa et al. 2017
1 Olson et al. 2017
2 Gijsbers et al. 2018
• Start from simple pipelines
• Evolve more complex ones if needed
• Reuse pipelines that do specific things
• Mechanisms:
• Cross-over: reuse partial pipelines
• Mutation: change structure, tuning
• Approaches:
• TPOT: Tree-based pipelines1
• GAMA: asynchronous evolution2
• RECIPE: grammar-based3
• Meta-learning:
• Warm-starting, Meta-models 2

Learning to learn through self-play
!26
• Build pipelines by inserting, deleting, replacing components (actions)
• Neural network (LSTM) receives task meta-features, pipelines and evaluations
• Predicts pipeline performance and action probabilities
• Monte Carlo Tree Search builds pipelines, runs simulations against LSTM
Drori et al 2017
New Task
meta-learner
ModelsModelsModels
performance
self-play
mj
λi

3. Start with previously trained models
conﬁgurations
performances
Task 1 Task N
ModelsModelsModels
ModelsModelsModels
… New Task
meta-learner
ModelsModelsModels
performance
model parameters
Models trained on similar tasks
(model parameters, features,…)
intrinsically (very) similar

(e.g. shared representation)
𝛳k
Pi,j
!27
λi

Transfer learning
• For neural networks, both structure and weights can be transferred
• Features and initializations learned from:
• Large image datasets (e.g. ImageNet) 1
• Large text corpora (e.g. Wikipedia) 2
• Fails if tasks are not similar enough 3
frozen new
pre-trained new
frozen
Source
tasks
ModelsModels
performance
LearningLearningLearning Feature extraction:

remove last layers, use output as features

if task is quite different, remove more layers
End-to-end tuning:

train from initialized weights
Fine-tuning:

unfreeze last layers, tune on new task
sm
all target task
large

similar
large

different
filters
1 Razavian et al. 2014
3 Yosinski et al. 2014
2 Mikolov et al. 2013
new
pre-trained
convnet
!28

Learning to learn by gradient descent
• Our brains probably don’t do backprop, replace it with:
• Simple parametric (bio-inspired) rule to update weights 1
• Single-layer neural network to learn weight updates 2
• Learn parameters across tasks, by gradient descent (meta-gradient)
1 Bengio et al. 1995
2 Runarsson and Jonsson 2000
learning rate
presynaptic activity
reinforcing signal
Tasks
meta-learner
performance
ModelsModelsModels
Δ 𝛳i = 𝜂 ( )
meta-gradient
weights λi!29
learning rate
learn λi
gradient
descent
λi
λinit
learner
Bengio et al.
Runarsson and Jonsson
Δ 𝛳i

Learning to learn gradient descent
2 Andrychowicz et al. 2016
1 Hochreiter 2001
• Replace backprop with a recurrent neural net (LSTM)1, not so scalable
• Use a coordinatewise LSTM [m] for scalability/flexibility (cfr. ADAM, RMSprop) 2
• Optimizee: receives weight update gt from optimizer
• Optimizer: receives gradient estimate ∇t from optimizee
• Learns how to do gradient descent across tasks
hidden state
optimisee
weights
New task
Model
meta-
model
by gradient descent
!30LSTM parameters shared for all 𝛳
Single

network!

Learning to learn gradient descent
2 Andrychowicz et al. 2016
1 Hochreiter 2001
• Left: optimizer and optimize trained to do style transfer
• Right: optimizer solves similar tasks (diﬀerent style, content and
resolution) without any more training data
by gradient descent
!31

Few-shot learning
• Learn how to learn from few examples (given similar tasks)
• Meta-learner must learn how to train a base-learner based on prior experience
• Parameterize base-learner model and learn the parameters 𝛳i
Ttrain
Image: Hugo Larochelle
meta-model
Model M
𝛳i+1
Tj
Ttest
Ttest
𝛳i
Pi,j
λk
!32
Pi+1,test
X_test
y_test
y_test
X_train y_train
Cost(θi) =
1
|Ttest | ∑
t∈Ttest
loss(θi, t) 1-shot, 5-class:
new classes!

Few-shot learning: approaches
!33
• Existing algorithm as meta-learner:
• LSTM + gradient descent
• Learn 𝛳init + gradient descent
• kNN-like: Memory + similarity
• Learn embedding + classifier
• …
• Black-box meta-learner
• Neural Turing machine (with memory)
• Neural attentive learner
• …
Cost(θi) =
1
|Ttest | ∑
t∈Ttest
loss(θi, t)
Santoro et al. 2016
Mishra et al. 2018
meta-model
Model M
𝛳i+1
Tj Ttest
𝛳i
Pi,j
λk
Pi+1,test
Ravi and Larochelle 2017
Finn et al. 2017
Vinyals et al. 2016
Snell et al. 2017

LSTM meta-learner + gradient descent
Ravi and Larochelle 2017
!34
Train Test
Cost(θT) =
1
|Ttest | ∑
t∈Ttest
loss(θT, t)
LSTM LSTM LSTM LSTM
M M M M M
• Gradient descent update 𝛳t is similar to LSTM cell state update ct
• Hence, training a meta-learner LSTM yields an update rule for training M
• Start from initial 𝛳0, train model on first batch, get gradient and loss update
• Predict 𝛳t+1 , continue to t=T, get cost, backpropagate to learn LSTM weights, optimal 𝛳0
forget
input

!35
Model-agnostic meta-learning
1 Finn et al. 2017
2 Finn et al. 2018
3 Nichol et al. 2018
• For reinforcement learning:

Learning to reinforcement learn
!36
1 Duan et al. 2017
2 Wang et al. 2017
3 Duan et al. 2017
Environments
meta-RL
algorithm
performance
policy 𝝅θ
fast RL
agent
!36
policy 𝝅θ
Similar env.
performance
• Humans often learn to play new games much faster than RL techniques do
• Reinforcement learning is very suited for learning-to-learn:
• Build a learner, then use performance as that learner as a reward
impl
• Learning to reinforcement learn 1,2
• Use RNN-based deep RL to train a
recurrent network on many tasks
• Learns to implement a‘fast’RL agent,
encoded in its weights

Learning to reinforcement learn
!37
• Also works for few-shot learning 3
• Condition on observation + upcoming demonstration
• You don’t know what someone is trying to teach you, but you
prepare for the lesson
1 Duan et al. 2017
2 Wang et al. 2017
3 Duan et al. 2017
!37

Learning to learn more tasks
!38
• Active learning
• Deep network (learns representation) + policy network
• Receives state and reward, says which points to query next
• Density estimation
• Learn distribution over small set of images, can generate new ones
• Uses a MAML-based few-shot learner
• Matrix factorization
• Deep learning architecture that makes recommendations
• Meta-learner learns how to adjust biases for each user (task)
• Replace hand-crafted algorithms by learned ones.
• Look at problems through a meta-learning lens!
Pang et al. 2018
Reed et al. 2017
Vartak et al. 2017

Meta-data sharing
!39
import openml as oml
from sklearn import tree
task = oml.tasks.get_task(14951)
clf = tree.ExtraTreeClassifier()
flow = oml.flows.sklearn_to_flow(clf)
run = oml.runs.run_flow_on_task(task, flow)
myrun = run.publish()
run locally, share globally
Vanschoren et al. 2014
models built
by humans
models built
by AutoML bots
• OK, but how do I get large amounts of meta-data for meta-learning?
• OpenML.org
• Thousands of uniform datasets
• 100+ meta-features
• Millions of evaluated runs
• Same splits, 30+ metrics
• Traces, models (opt)
• APIs in Python, R, Java,…
• Publish your own runs
• Never ending learning
• Benchmarks
building a shared memory
Open positions!

Scientiﬁc programmer

Teaching PhD

Towards human-like learning to learn
!40
• Learning-to-learn gives humans a significant advantage
• Learning how to learn any task empowers us far beyond knowing
how to learn specific tasks.
• It is a universal aspect of life, and how it evolves
• Very exciting field with many unexplored possibilities
• Many aspects not understood (e.g. task similarity), need more
experiments.
• Challenge:
• Build learners that never stop learning, that learn from each other
• Build a global memory for learning systems to learn from
• Let them explore / experiment by themselves

Thank you!
special thanks to
Pavel Brazdil, Matthias Feurer, Frank Hutter, Erin Grant,
Hugo Larochelle, Raghu Rajan, Jan van Rijn, Jane Wang
more to learn
http://www.automl.org/book/
Chapter 2: Meta-Learning
!41
Never stop learning

Learning how to learn

More Related Content

What's hot (20)

Similar to Learning how to learn (20)

More from Joaquin Vanschoren (13)

Recently uploaded (20)

Learning how to learn