SlideShare a Scribd company logo
Learningto Learn
Joaquin Vanschoren
Eindhoven University of Technology & JADS
j.vanschoren@tue.nl
@joavanschoren
Berlin Machine Learning Meetup, Jan 2019
Learning Learning LearningLearning Learning
Slides: www.automl.org/events - Video: bit.ly/2QsnsUT
Book (open access): www.automl.org/book (or www.amazon.de)
Learning is a never-ending process
!2
Humans don’t learn from scratch
Learning is a never-ending process
Task 1 Task 2 Task 3
Models ModelsModelsModels
Model
performance performance performance
Learning
ModelsModelsModelsModels
Learning Learning
Learning

episodes
meta-

learning
meta-

learning
meta-

learning
!3
Humans learn across tasks
Why? Less trial-and-error, less data
Task 1 Task 2 Task 3
ModelsModelsModels
performance performance performance
LearningLearning Learning
ModelsModelsModels
ModelsModelsModels
inductive bias
!4
Inductive bias: assumptions added to the training data to learn effectively
If prior tasks are similar, we can transfer prior knowledge to new tasks
Learning to learn
New Task
performance
ModelsModelsModels
Learning
prior beliefs
constraints
model parameters
representations
training data
Task 1 Task 2 Task 3
ModelsModelsModels
performance performance performance
LearningLearningLearners
LearningLearningLearners
LearningLearningLearners
ModelsModelsModels
ModelsModelsModels
}meta-data
!5
Meta-learning
New Task
performance
ModelsModelsModels
meta-learner
base-learner
additional 

experiments
Meta-learner learns a (base-)learning algorithm, based on meta-data
How?
New Task
meta-learner
ModelsModelsModels
Task 1 Task j
ModelsModelsModels
performance performance
LearningLearningLearningLearning
ModelsModelsModels
…
performance
1. Start with what generally works
2. Start from what likely works (based on partially similar tasks)
3. Start from previously trained models (for very similar tasks)
!6
Learners Learners
1. Start with what generally works
Store and use meta-data:
- configurations: settings that uniquely define the model
- performance (e.g. accuracy) on specific tasks
New Task
meta-learner
ModelsModelsModels
configurations
performances
Task 1 Task j
ModelsModelsModels
performance performance
LearningLearningLearningLearning
ModelsModelsModels
…
performance
λi
Pi,j
!7
Learners Learners
(hyperparameters, pipeline,
neural architecture,…)
Rankings
• Build a global (multi-objective) ranking, recommend the top-K
• Can be used as a warm start for optimization techniques
• E.g. Bayesian optimization, evolutionary techniques,…
Tasks
ModelsModelsModels
performance
LearningLearningLearning
1. λa
2. λb
3. λc
4. λd
5. λe
6. …
New Task
meta-learner
ModelsModelsModels
performance
Global ranking

(task independent)
λa..k
warm
start
Leite et al. 2012
Abdulrahman et al. 2018
}
(discrete)
(multi-objective)
Pi,j
!8
λi
• Functional ANOVA 1
Select hyperparameters that cause variance in the evaluations.
• Tunability 2
Learn good defaults, measure improvement from tuning over defaults
Tasks
ModelsModelsModels
performance
LearningLearningLearning
}
New Task
meta-learner
ModelsModelsModels
performance
To tune or not to tune?
hyperparameter
importance
1 van Rijn & Hutter 2018
λ1
λ2
λ3
λ4 constraints
priors
2 Probst et al. 2018
Pi,j
!9
λi
• Search space pruning
Exclude regions yielding bad performance on (similar) tasks
Tasks
ModelsModelsModels
performance
LearningLearningLearning
}
New Task
meta-learner
ModelsModelsModels
performance
To tune or not to tune?
constraints
Pi,j
Wistuba et al. 2015
P
λ1
!10
λi
λ2
• Experiments on the new task can tell us how it is similar to previous tasks
• Task are similar if observed performance of configurations is similar
• Use this to recommend new experiments
Tasks
ModelsModelsModels
performance
LearningLearningLearning
}
New Task
meta-learner
ModelsModelsModels
performance
Learning task similarity
λc
(discrete)
Pi,j
!11
λi
Pc,new ≅ Pc,j
• Tournament-style selection, warm-start with overall best config λbest
• Next candidate λc : the one that beats current λbest on similar tasks
Tasks
ModelsModelsModels
performance
LearningLearningLearning
}
New Task
meta-learner
ModelsModelsModels
performance
Active testing
Leite et al. 2012
λc
Select λc > λbest
on similar tasks
(discrete)
Pi,j
!12
λi
• Consider space of all configuration options (e.g. all possible neural nets or pipelines)
• Surrogate model: probabilistic regression model of configuration performance
• Acquisition function: selects next configuration to try (exploration-exploitation)
Task
ModelsModelsModels
performance
LearningLearningLearning
Bayesian optimization
Surrogate model
Acquisition function
λ ∈ Λ
P
λi
Rasmussen 2014
!13
Task
ModelsModelsModels
performance
LearningLearningLearning
Bayesian optimization
Surrogate model
Acquisition function
P
λi
Rasmussen 2014
!14
Task
ModelsModelsModels
performance
LearningLearningLearning
Bayesian optimization
Surrogate model
Acquisition function
P
λi
Rasmussen 2014
!15
• Like a short-term memory for
solving new task
• Can we store it in long-term
memory for solving new tasks?
• If task j is similar to the new task, its surrogate model Sj will likely transfer well
• Sum up all Sj predictions, weighted by task similarity (as in active testing)1
• Build combined Gaussian process, weighted by current performance on new task2
Tasks
ModelsModelsModels
performance
LearningLearningLearning
New Task
meta-learner
ModelsModelsModels
performance
per task tj:
Pi,j
}
Surrogate model transfer
1 Wistuba et al. 2018
λi
P
Sj
2 Feurer et al. 2018
S= ∑ wj Sj
+
+
S1
S2
S3
!16
λi
• Multi-task Gaussian processes: train surrogate model on t tasks simultaneously1
• If tasks are similar: transfers useful info
• Not very scalable
• Bayesian Neural Networks as surrogate model2
• Multi-task, more scalable
• Stacking Gaussian Process regressors (Google Vizier)3
• Sequential tasks, each similar to the previous one
• Transfers a prior based on residuals of previous GP
Multi-task Bayesian optimization
1 Swersky et al. 2013
Independent GP predictions Multi-task GP predictions
2 Springenberg et al. 2016
3 Golovin et al. 2017
!17
• Bayesian linear regression (BLR) surrogate model on every task
• Use neural net to learn a suitable basis expansion ϕz(λ) for all tasks
• Scales linearly in # observations, transfers info on configuration space
Tasks
ModelsModelsModels
performance
LearningLearningLearning New Task
meta-learner
ModelsModelsModels
performance
}
More scalable variants
Perrone et al. 2018
P
BLR
surrogate
(λi,Pi,j)
φz(λ)i
warm-start (pre-train)
λi
Bayesian optimization
φz(λ)
BLR hyperparameters
Pi,j
!18
λi
2. Start from what likely works (based on similar tasks)
Meta-features: measurable properties of the tasks
(number of instances and features, class imbalance, feature skewness,…)
configurations
performances
similar mj ?Task 1 Task N
ModelsModelsModels
performance performance
LearningLearningLearning
LearningLearningLearning
ModelsModelsModels
… meta-features New Task
meta-learner
ModelsModelsModels
performance
mj
Pi,j
!19
λi
• Hand-crafted (interpretable) meta-features1
• Number of instances, features, classes, missing values, outliers,…
• Statistical: skewness, kurtosis, correlation, covariance, sparsity, variance,…
• Information-theoretic: class entropy, mutual information, noise-signal
ratio,…
• Model-based: properties of simple models trained on the task
• Landmarkers: performance of fast algorithms trained on the task
• Domain specific task properties
Meta-features
Vanschoren 2018
!20
• Learning a joint task representation
• Deep metric learning: learn a representation hmf using a ground truth
distance
• With Siamese Network: similar task, similar representation
• Can be used to recommend neural architectures given task similarity
Meta-features
Kim et al. 2017
!21
• Find k most similar tasks, warm-start search with best λi
• Auto-sklearn: Bayesian optimization (SMAC)
• Meta-learning yield better models, faster
• Winner of AutoML Challenges
Tasks
ModelsModelsModels
performance
LearningLearningLearning
New Task
meta-learner
ModelsModelsModels
performance
Pi,j
}
Warm-starting from similar tasks
λ1..k
mj
best λi on
similar tasks
Feurer et al. 2015
!22
λi
Bayesian optimization
λ
P
λ1
λ3
λ2
λ4
• Learn direct mapping between meta-features and Pi,j
• Zero-shot meta-models: predict best λi given meta-features 1
• Ranking models: return ranking λ1..k 2
• Predict which algorithms / configurations to consider / tune3
• Predict performance / runtime for given 𝛳i and task4
• Can be integrated in larger AutoML systems: warm start, guide search,…
meta-learner
Meta-models
λbest
1 Brazdil et al. 2009, Lemke et al. 2015
2 Sun and Pfahringer 2013, Pinto et al. 2017
meta-learner λ1..k
mj
mj
meta-learner
Pijmj, λi
3 Sanders and C. Giraud-Carrier 2017
meta-learner
Λmj
4 Yang et al. 2018
!23
Learning Pipelines / Architectures
!24
• Compositionality: the learning process can be broken down into smaller tasks
• Easier to learn, more transferable, more robust
• Pipelines are one way of doing this, but how to control the search space?
• Planning techniques (e.g. Hierarchical Task Planning) 2
• Impose a fixed structure or grammar on the pipeline 1
• Neural architecture search
• Usually defines fixed search space 3
• Very little meta-learning (yet)
• RL controller transfer 4
2 Mohr et al. 2018
1 Feurer et al. 2015
3 Zoph et al. 2018
4 Wong et al. 2018
Evolving pipelines
!25
3 De Sa et al. 2017
1 Olson et al. 2017
2 Gijsbers et al. 2018
• Start from simple pipelines
• Evolve more complex ones if needed
• Reuse pipelines that do specific things
• Mechanisms:
• Cross-over: reuse partial pipelines
• Mutation: change structure, tuning
• Approaches:
• TPOT: Tree-based pipelines1
• GAMA: asynchronous evolution2
• RECIPE: grammar-based3
• Meta-learning:
• Warm-starting, Meta-models 2
Learning to learn through self-play
!26
• Build pipelines by inserting, deleting, replacing components (actions)
• Neural network (LSTM) receives task meta-features, pipelines and evaluations
• Predicts pipeline performance and action probabilities
• Monte Carlo Tree Search builds pipelines, runs simulations against LSTM
Drori et al 2017
New Task
meta-learner
ModelsModelsModels
performance
self-play
mj
λi
3. Start with previously trained models
configurations
performances
Task 1 Task N
ModelsModelsModels
performance performance
LearningLearningLearning
LearningLearningLearning
ModelsModelsModels
… New Task
meta-learner
ModelsModelsModels
performance
model parameters
Models trained on similar tasks
(model parameters, features,…)
intrinsically (very) similar

(e.g. shared representation)
𝛳k
Pi,j
!27
λi
Transfer learning
• For neural networks, both structure and weights can be transferred
• Features and initializations learned from:
• Large image datasets (e.g. ImageNet) 1
• Large text corpora (e.g. Wikipedia) 2
• Fails if tasks are not similar enough 3
frozen new
pre-trained new
frozen
Source
tasks
ModelsModels
performance
LearningLearningLearning Feature extraction: 

remove last layers, use output as features

if task is quite different, remove more layers
End-to-end tuning: 

train from initialized weights
Fine-tuning: 

unfreeze last layers, tune on new task
sm
all target task
large

similar
large

different
filters
1 Razavian et al. 2014
3 Yosinski et al. 2014
2 Mikolov et al. 2013
new
pre-trained
convnet
!28
Learning to learn by gradient descent
• Our brains probably don’t do backprop, replace it with:
• Simple parametric (bio-inspired) rule to update weights 1
• Single-layer neural network to learn weight updates 2
• Learn parameters across tasks, by gradient descent (meta-gradient)
1 Bengio et al. 1995
2 Runarsson and Jonsson 2000
learning rate
presynaptic activity
reinforcing signal
Tasks
meta-learner
performance
ModelsModelsModels
Δ 𝛳i = 𝜂 ( )
meta-gradient
weights λi!29
learning rate
learn λi
gradient
descent
λi
λinit
learner
Bengio et al.
Runarsson and Jonsson
Δ 𝛳i
Learning to learn gradient descent
2 Andrychowicz et al. 2016
1 Hochreiter 2001
• Replace backprop with a recurrent neural net (LSTM)1, not so scalable
• Use a coordinatewise LSTM [m] for scalability/flexibility (cfr. ADAM, RMSprop) 2
• Optimizee: receives weight update gt from optimizer
• Optimizer: receives gradient estimate ∇t from optimizee
• Learns how to do gradient descent across tasks
hidden state
optimisee
weights
New task
Model
meta-
model
by gradient descent
!30LSTM parameters shared for all 𝛳
Single 

network!
Learning to learn gradient descent
2 Andrychowicz et al. 2016
1 Hochreiter 2001
• Left: optimizer and optimize trained to do style transfer
• Right: optimizer solves similar tasks (different style, content and
resolution) without any more training data
by gradient descent
!31
Few-shot learning
• Learn how to learn from few examples (given similar tasks)
• Meta-learner must learn how to train a base-learner based on prior experience
• Parameterize base-learner model and learn the parameters 𝛳i
Ttrain
Image: Hugo Larochelle
meta-model
Model M
𝛳i+1
Tj
Ttest
Ttest
𝛳i
Pi,j
λk
!32
Pi+1,test
X_test
y_test
y_test
X_train y_train
Cost(θi) =
1
|Ttest | ∑
t∈Ttest
loss(θi, t) 1-shot, 5-class:
new classes!
Few-shot learning: approaches
!33
• Existing algorithm as meta-learner:
• LSTM + gradient descent
• Learn 𝛳init + gradient descent
• kNN-like: Memory + similarity
• Learn embedding + classifier
• …
• Black-box meta-learner
• Neural Turing machine (with memory)
• Neural attentive learner
• …
Cost(θi) =
1
|Ttest | ∑
t∈Ttest
loss(θi, t)
Santoro et al. 2016
Mishra et al. 2018
meta-model
Model M
𝛳i+1
Tj Ttest
𝛳i
Pi,j
λk
Pi+1,test
Ravi and Larochelle 2017
Finn et al. 2017
Vinyals et al. 2016
Snell et al. 2017
LSTM meta-learner + gradient descent
Ravi and Larochelle 2017
!34
Train Test
Cost(θT) =
1
|Ttest | ∑
t∈Ttest
loss(θT, t)
LSTM LSTM LSTM LSTM
M M M M M
• Gradient descent update 𝛳t is similar to LSTM cell state update ct
• Hence, training a meta-learner LSTM yields an update rule for training M
• Start from initial 𝛳0, train model on first batch, get gradient and loss update
• Predict 𝛳t+1 , continue to t=T, get cost, backpropagate to learn LSTM weights, optimal 𝛳0
forget
input
!35
Model-agnostic meta-learning
1 Finn et al. 2017
2 Finn et al. 2018
3 Nichol et al. 2018
• For reinforcement learning:
Learning to reinforcement learn
!36
1 Duan et al. 2017
2 Wang et al. 2017
3 Duan et al. 2017
Environments
meta-RL
algorithm
performance
policy 𝝅θ
fast RL
agent
!36
policy 𝝅θ
Similar env.
performance
• Humans often learn to play new games much faster than RL techniques do
• Reinforcement learning is very suited for learning-to-learn:
• Build a learner, then use performance as that learner as a reward
impl
• Learning to reinforcement learn 1,2
• Use RNN-based deep RL to train a
recurrent network on many tasks
• Learns to implement a‘fast’RL agent,
encoded in its weights
Learning to reinforcement learn
!37
• Also works for few-shot learning 3
• Condition on observation + upcoming demonstration
• You don’t know what someone is trying to teach you, but you
prepare for the lesson
1 Duan et al. 2017
2 Wang et al. 2017
3 Duan et al. 2017
!37
Learning to learn more tasks
!38
• Active learning
• Deep network (learns representation) + policy network
• Receives state and reward, says which points to query next
• Density estimation
• Learn distribution over small set of images, can generate new ones
• Uses a MAML-based few-shot learner
• Matrix factorization
• Deep learning architecture that makes recommendations
• Meta-learner learns how to adjust biases for each user (task)
• Replace hand-crafted algorithms by learned ones.
• Look at problems through a meta-learning lens!
Pang et al. 2018
Reed et al. 2017
Vartak et al. 2017
Meta-data sharing
!39
import	openml	as	oml	
from	sklearn	import	tree	
task	=	oml.tasks.get_task(14951)	
clf	=	tree.ExtraTreeClassifier()	
flow	=	oml.flows.sklearn_to_flow(clf)	
run	=	oml.runs.run_flow_on_task(task,	flow)	
myrun	=	run.publish()
run locally, share globally
Vanschoren et al. 2014
models built
by humans
models built
by AutoML bots
• OK, but how do I get large amounts of meta-data for meta-learning?
• OpenML.org
• Thousands of uniform datasets
• 100+ meta-features
• Millions of evaluated runs
• Same splits, 30+ metrics
• Traces, models (opt)
• APIs in Python, R, Java,…
• Publish your own runs
• Never ending learning
• Benchmarks
building a shared memory
Open positions!

Scientific programmer

Teaching PhD
Towards human-like learning to learn
!40
• Learning-to-learn gives humans a significant advantage
• Learning how to learn any task empowers us far beyond knowing
how to learn specific tasks.
• It is a universal aspect of life, and how it evolves
• Very exciting field with many unexplored possibilities
• Many aspects not understood (e.g. task similarity), need more
experiments.
• Challenge:
• Build learners that never stop learning, that learn from each other
• Build a global memory for learning systems to learn from
• Let them explore / experiment by themselves
Thank you!
special thanks to
Pavel Brazdil, Matthias Feurer, Frank Hutter, Erin Grant,
Hugo Larochelle, Raghu Rajan, Jan van Rijn, Jane Wang
more to learn
http://www.automl.org/book/
Chapter 2: Meta-Learning
!41
Never stop learning

More Related Content

PDF
Exposé Ontology
PDF
OpenML 2019
PDF
Meta learning tutorial
PDF
OpenML NeurIPS2018
PDF
AutoML lectures (ACDL 2019)
PDF
Open and Automated Machine Learning
PPTX
Methods for meta learning in AutoML
PPTX
Automated Machine Learning (Auto ML)
Exposé Ontology
OpenML 2019
Meta learning tutorial
OpenML NeurIPS2018
AutoML lectures (ACDL 2019)
Open and Automated Machine Learning
Methods for meta learning in AutoML
Automated Machine Learning (Auto ML)

What's hot (20)

PPTX
Machine learning 101 dkom 2017
PDF
Le Machine Learning de A à Z
PPTX
Top 10 Data Science Practitioner Pitfalls
PPTX
Introduction to Machine Learning with Python and scikit-learn
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PDF
Data Workflows for Machine Learning - Seattle DAML
PPTX
An introduction to Machine Learning (and a little bit of Deep Learning)
PDF
General Tips for participating Kaggle Competitions
PDF
Data structures algorithms_tutorial
PPTX
Using SHAP to Understand Black Box Models
PDF
The ABC of Implementing Supervised Machine Learning with Python.pptx
PPTX
Overview of Machine Learning and Feature Engineering
PPTX
Deep learning with TensorFlow
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
PPTX
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
PDF
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
PDF
Pybcn machine learning for dummies with python
PPTX
Demystifying Machine and Deep Learning for Developers
PPTX
Ferruzza g automl deck
PDF
QCon Rio - Machine Learning for Everyone
Machine learning 101 dkom 2017
Le Machine Learning de A à Z
Top 10 Data Science Practitioner Pitfalls
Introduction to Machine Learning with Python and scikit-learn
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Data Workflows for Machine Learning - Seattle DAML
An introduction to Machine Learning (and a little bit of Deep Learning)
General Tips for participating Kaggle Competitions
Data structures algorithms_tutorial
Using SHAP to Understand Black Box Models
The ABC of Implementing Supervised Machine Learning with Python.pptx
Overview of Machine Learning and Feature Engineering
Deep learning with TensorFlow
Artificial Intelligence, Machine Learning and Deep Learning
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Pybcn machine learning for dummies with python
Demystifying Machine and Deep Learning for Developers
Ferruzza g automl deck
QCon Rio - Machine Learning for Everyone
Ad

Similar to Learning how to learn (20)

PDF
Object Oriented Programming Lab Manual
PDF
ODSC East: Effective Transfer Learning for NLP
PPTX
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
PDF
Preliminary Exam Slides
PDF
cs330_2021_lifelong_learning.pdf
PPTX
Machine Learning basics
PDF
Naver learning to rank question answer pairs using hrde-ltc
PPTX
Generative AI Reasoning Tech Talk - July 2024
PDF
Learning to Learn by Gradient Descent by Gradient Descent
PPTX
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
PDF
Software tools to facilitate materials science research
PPTX
MODULE 4 AAI_______________________.pptx
PDF
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
PPTX
Java parser a fine grained indexing tool and its application
PDF
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
PDF
gpt3_presentation.pdf
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
PPT
feature-selection.ppt on machine learning
PDF
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
Object Oriented Programming Lab Manual
ODSC East: Effective Transfer Learning for NLP
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Preliminary Exam Slides
cs330_2021_lifelong_learning.pdf
Machine Learning basics
Naver learning to rank question answer pairs using hrde-ltc
Generative AI Reasoning Tech Talk - July 2024
Learning to Learn by Gradient Descent by Gradient Descent
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Software tools to facilitate materials science research
MODULE 4 AAI_______________________.pptx
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
Java parser a fine grained indexing tool and its application
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
gpt3_presentation.pdf
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
feature-selection.ppt on machine learning
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
Ad

More from Joaquin Vanschoren (13)

PDF
Designed Serendipity
PDF
OpenML Reproducibility in Machine Learning ICML2017
PDF
OpenML DALI
PDF
OpenML data@Sheffield
PDF
OpenML Tutorial ECMLPKDD 2015
PDF
OpenML Tutorial: Networked Science in Machine Learning
PDF
Data science
PDF
OpenML 2014
PDF
Open Machine Learning
PDF
Hadoop tutorial
PDF
Hadoop sensordata part2
PDF
Hadoop sensordata part1
PDF
Hadoop sensordata part3
Designed Serendipity
OpenML Reproducibility in Machine Learning ICML2017
OpenML DALI
OpenML data@Sheffield
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial: Networked Science in Machine Learning
Data science
OpenML 2014
Open Machine Learning
Hadoop tutorial
Hadoop sensordata part2
Hadoop sensordata part1
Hadoop sensordata part3

Recently uploaded (20)

PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
An interstellar mission to test astrophysical black holes
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
Microbiology with diagram medical studies .pptx
PPTX
2. Earth - The Living Planet earth and life
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
. Radiology Case Scenariosssssssssssssss
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
Sciences of Europe No 170 (2025)
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
BIOMOLECULES PPT........................
PDF
Placing the Near-Earth Object Impact Probability in Context
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Comparative Structure of Integument in Vertebrates.pptx
An interstellar mission to test astrophysical black holes
2Systematics of Living Organisms t-.pptx
Microbiology with diagram medical studies .pptx
2. Earth - The Living Planet earth and life
ECG_Course_Presentation د.محمد صقران ppt
. Radiology Case Scenariosssssssssssssss
HPLC-PPT.docx high performance liquid chromatography
bbec55_b34400a7914c42429908233dbd381773.pdf
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Sciences of Europe No 170 (2025)
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
The KM-GBF monitoring framework – status & key messages.pptx
POSITIONING IN OPERATION THEATRE ROOM.ppt
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
BIOMOLECULES PPT........................
Placing the Near-Earth Object Impact Probability in Context

Learning how to learn

  • 1. Learningto Learn Joaquin Vanschoren Eindhoven University of Technology & JADS j.vanschoren@tue.nl @joavanschoren Berlin Machine Learning Meetup, Jan 2019 Learning Learning LearningLearning Learning Slides: www.automl.org/events - Video: bit.ly/2QsnsUT Book (open access): www.automl.org/book (or www.amazon.de)
  • 2. Learning is a never-ending process !2 Humans don’t learn from scratch
  • 3. Learning is a never-ending process Task 1 Task 2 Task 3 Models ModelsModelsModels Model performance performance performance Learning ModelsModelsModelsModels Learning Learning Learning episodes meta- learning meta- learning meta- learning !3 Humans learn across tasks Why? Less trial-and-error, less data
  • 4. Task 1 Task 2 Task 3 ModelsModelsModels performance performance performance LearningLearning Learning ModelsModelsModels ModelsModelsModels inductive bias !4 Inductive bias: assumptions added to the training data to learn effectively If prior tasks are similar, we can transfer prior knowledge to new tasks Learning to learn New Task performance ModelsModelsModels Learning prior beliefs constraints model parameters representations training data
  • 5. Task 1 Task 2 Task 3 ModelsModelsModels performance performance performance LearningLearningLearners LearningLearningLearners LearningLearningLearners ModelsModelsModels ModelsModelsModels }meta-data !5 Meta-learning New Task performance ModelsModelsModels meta-learner base-learner additional experiments Meta-learner learns a (base-)learning algorithm, based on meta-data
  • 6. How? New Task meta-learner ModelsModelsModels Task 1 Task j ModelsModelsModels performance performance LearningLearningLearningLearning ModelsModelsModels … performance 1. Start with what generally works 2. Start from what likely works (based on partially similar tasks) 3. Start from previously trained models (for very similar tasks) !6 Learners Learners
  • 7. 1. Start with what generally works Store and use meta-data: - configurations: settings that uniquely define the model - performance (e.g. accuracy) on specific tasks New Task meta-learner ModelsModelsModels configurations performances Task 1 Task j ModelsModelsModels performance performance LearningLearningLearningLearning ModelsModelsModels … performance λi Pi,j !7 Learners Learners (hyperparameters, pipeline, neural architecture,…)
  • 8. Rankings • Build a global (multi-objective) ranking, recommend the top-K • Can be used as a warm start for optimization techniques • E.g. Bayesian optimization, evolutionary techniques,… Tasks ModelsModelsModels performance LearningLearningLearning 1. λa 2. λb 3. λc 4. λd 5. λe 6. … New Task meta-learner ModelsModelsModels performance Global ranking (task independent) λa..k warm start Leite et al. 2012 Abdulrahman et al. 2018 } (discrete) (multi-objective) Pi,j !8 λi
  • 9. • Functional ANOVA 1 Select hyperparameters that cause variance in the evaluations. • Tunability 2 Learn good defaults, measure improvement from tuning over defaults Tasks ModelsModelsModels performance LearningLearningLearning } New Task meta-learner ModelsModelsModels performance To tune or not to tune? hyperparameter importance 1 van Rijn & Hutter 2018 λ1 λ2 λ3 λ4 constraints priors 2 Probst et al. 2018 Pi,j !9 λi
  • 10. • Search space pruning Exclude regions yielding bad performance on (similar) tasks Tasks ModelsModelsModels performance LearningLearningLearning } New Task meta-learner ModelsModelsModels performance To tune or not to tune? constraints Pi,j Wistuba et al. 2015 P λ1 !10 λi λ2
  • 11. • Experiments on the new task can tell us how it is similar to previous tasks • Task are similar if observed performance of configurations is similar • Use this to recommend new experiments Tasks ModelsModelsModels performance LearningLearningLearning } New Task meta-learner ModelsModelsModels performance Learning task similarity λc (discrete) Pi,j !11 λi Pc,new ≅ Pc,j
  • 12. • Tournament-style selection, warm-start with overall best config λbest • Next candidate λc : the one that beats current λbest on similar tasks Tasks ModelsModelsModels performance LearningLearningLearning } New Task meta-learner ModelsModelsModels performance Active testing Leite et al. 2012 λc Select λc > λbest on similar tasks (discrete) Pi,j !12 λi
  • 13. • Consider space of all configuration options (e.g. all possible neural nets or pipelines) • Surrogate model: probabilistic regression model of configuration performance • Acquisition function: selects next configuration to try (exploration-exploitation) Task ModelsModelsModels performance LearningLearningLearning Bayesian optimization Surrogate model Acquisition function λ ∈ Λ P λi Rasmussen 2014 !13
  • 15. Task ModelsModelsModels performance LearningLearningLearning Bayesian optimization Surrogate model Acquisition function P λi Rasmussen 2014 !15 • Like a short-term memory for solving new task • Can we store it in long-term memory for solving new tasks?
  • 16. • If task j is similar to the new task, its surrogate model Sj will likely transfer well • Sum up all Sj predictions, weighted by task similarity (as in active testing)1 • Build combined Gaussian process, weighted by current performance on new task2 Tasks ModelsModelsModels performance LearningLearningLearning New Task meta-learner ModelsModelsModels performance per task tj: Pi,j } Surrogate model transfer 1 Wistuba et al. 2018 λi P Sj 2 Feurer et al. 2018 S= ∑ wj Sj + + S1 S2 S3 !16 λi
  • 17. • Multi-task Gaussian processes: train surrogate model on t tasks simultaneously1 • If tasks are similar: transfers useful info • Not very scalable • Bayesian Neural Networks as surrogate model2 • Multi-task, more scalable • Stacking Gaussian Process regressors (Google Vizier)3 • Sequential tasks, each similar to the previous one • Transfers a prior based on residuals of previous GP Multi-task Bayesian optimization 1 Swersky et al. 2013 Independent GP predictions Multi-task GP predictions 2 Springenberg et al. 2016 3 Golovin et al. 2017 !17
  • 18. • Bayesian linear regression (BLR) surrogate model on every task • Use neural net to learn a suitable basis expansion ϕz(λ) for all tasks • Scales linearly in # observations, transfers info on configuration space Tasks ModelsModelsModels performance LearningLearningLearning New Task meta-learner ModelsModelsModels performance } More scalable variants Perrone et al. 2018 P BLR surrogate (λi,Pi,j) φz(λ)i warm-start (pre-train) λi Bayesian optimization φz(λ) BLR hyperparameters Pi,j !18 λi
  • 19. 2. Start from what likely works (based on similar tasks) Meta-features: measurable properties of the tasks (number of instances and features, class imbalance, feature skewness,…) configurations performances similar mj ?Task 1 Task N ModelsModelsModels performance performance LearningLearningLearning LearningLearningLearning ModelsModelsModels … meta-features New Task meta-learner ModelsModelsModels performance mj Pi,j !19 λi
  • 20. • Hand-crafted (interpretable) meta-features1 • Number of instances, features, classes, missing values, outliers,… • Statistical: skewness, kurtosis, correlation, covariance, sparsity, variance,… • Information-theoretic: class entropy, mutual information, noise-signal ratio,… • Model-based: properties of simple models trained on the task • Landmarkers: performance of fast algorithms trained on the task • Domain specific task properties Meta-features Vanschoren 2018 !20
  • 21. • Learning a joint task representation • Deep metric learning: learn a representation hmf using a ground truth distance • With Siamese Network: similar task, similar representation • Can be used to recommend neural architectures given task similarity Meta-features Kim et al. 2017 !21
  • 22. • Find k most similar tasks, warm-start search with best λi • Auto-sklearn: Bayesian optimization (SMAC) • Meta-learning yield better models, faster • Winner of AutoML Challenges Tasks ModelsModelsModels performance LearningLearningLearning New Task meta-learner ModelsModelsModels performance Pi,j } Warm-starting from similar tasks λ1..k mj best λi on similar tasks Feurer et al. 2015 !22 λi Bayesian optimization λ P λ1 λ3 λ2 λ4
  • 23. • Learn direct mapping between meta-features and Pi,j • Zero-shot meta-models: predict best λi given meta-features 1 • Ranking models: return ranking λ1..k 2 • Predict which algorithms / configurations to consider / tune3 • Predict performance / runtime for given 𝛳i and task4 • Can be integrated in larger AutoML systems: warm start, guide search,… meta-learner Meta-models λbest 1 Brazdil et al. 2009, Lemke et al. 2015 2 Sun and Pfahringer 2013, Pinto et al. 2017 meta-learner λ1..k mj mj meta-learner Pijmj, λi 3 Sanders and C. Giraud-Carrier 2017 meta-learner Λmj 4 Yang et al. 2018 !23
  • 24. Learning Pipelines / Architectures !24 • Compositionality: the learning process can be broken down into smaller tasks • Easier to learn, more transferable, more robust • Pipelines are one way of doing this, but how to control the search space? • Planning techniques (e.g. Hierarchical Task Planning) 2 • Impose a fixed structure or grammar on the pipeline 1 • Neural architecture search • Usually defines fixed search space 3 • Very little meta-learning (yet) • RL controller transfer 4 2 Mohr et al. 2018 1 Feurer et al. 2015 3 Zoph et al. 2018 4 Wong et al. 2018
  • 25. Evolving pipelines !25 3 De Sa et al. 2017 1 Olson et al. 2017 2 Gijsbers et al. 2018 • Start from simple pipelines • Evolve more complex ones if needed • Reuse pipelines that do specific things • Mechanisms: • Cross-over: reuse partial pipelines • Mutation: change structure, tuning • Approaches: • TPOT: Tree-based pipelines1 • GAMA: asynchronous evolution2 • RECIPE: grammar-based3 • Meta-learning: • Warm-starting, Meta-models 2
  • 26. Learning to learn through self-play !26 • Build pipelines by inserting, deleting, replacing components (actions) • Neural network (LSTM) receives task meta-features, pipelines and evaluations • Predicts pipeline performance and action probabilities • Monte Carlo Tree Search builds pipelines, runs simulations against LSTM Drori et al 2017 New Task meta-learner ModelsModelsModels performance self-play mj λi
  • 27. 3. Start with previously trained models configurations performances Task 1 Task N ModelsModelsModels performance performance LearningLearningLearning LearningLearningLearning ModelsModelsModels … New Task meta-learner ModelsModelsModels performance model parameters Models trained on similar tasks (model parameters, features,…) intrinsically (very) similar (e.g. shared representation) 𝛳k Pi,j !27 λi
  • 28. Transfer learning • For neural networks, both structure and weights can be transferred • Features and initializations learned from: • Large image datasets (e.g. ImageNet) 1 • Large text corpora (e.g. Wikipedia) 2 • Fails if tasks are not similar enough 3 frozen new pre-trained new frozen Source tasks ModelsModels performance LearningLearningLearning Feature extraction: remove last layers, use output as features if task is quite different, remove more layers End-to-end tuning: train from initialized weights Fine-tuning: unfreeze last layers, tune on new task sm all target task large similar large different filters 1 Razavian et al. 2014 3 Yosinski et al. 2014 2 Mikolov et al. 2013 new pre-trained convnet !28
  • 29. Learning to learn by gradient descent • Our brains probably don’t do backprop, replace it with: • Simple parametric (bio-inspired) rule to update weights 1 • Single-layer neural network to learn weight updates 2 • Learn parameters across tasks, by gradient descent (meta-gradient) 1 Bengio et al. 1995 2 Runarsson and Jonsson 2000 learning rate presynaptic activity reinforcing signal Tasks meta-learner performance ModelsModelsModels Δ 𝛳i = 𝜂 ( ) meta-gradient weights λi!29 learning rate learn λi gradient descent λi λinit learner Bengio et al. Runarsson and Jonsson Δ 𝛳i
  • 30. Learning to learn gradient descent 2 Andrychowicz et al. 2016 1 Hochreiter 2001 • Replace backprop with a recurrent neural net (LSTM)1, not so scalable • Use a coordinatewise LSTM [m] for scalability/flexibility (cfr. ADAM, RMSprop) 2 • Optimizee: receives weight update gt from optimizer • Optimizer: receives gradient estimate ∇t from optimizee • Learns how to do gradient descent across tasks hidden state optimisee weights New task Model meta- model by gradient descent !30LSTM parameters shared for all 𝛳 Single network!
  • 31. Learning to learn gradient descent 2 Andrychowicz et al. 2016 1 Hochreiter 2001 • Left: optimizer and optimize trained to do style transfer • Right: optimizer solves similar tasks (different style, content and resolution) without any more training data by gradient descent !31
  • 32. Few-shot learning • Learn how to learn from few examples (given similar tasks) • Meta-learner must learn how to train a base-learner based on prior experience • Parameterize base-learner model and learn the parameters 𝛳i Ttrain Image: Hugo Larochelle meta-model Model M 𝛳i+1 Tj Ttest Ttest 𝛳i Pi,j λk !32 Pi+1,test X_test y_test y_test X_train y_train Cost(θi) = 1 |Ttest | ∑ t∈Ttest loss(θi, t) 1-shot, 5-class: new classes!
  • 33. Few-shot learning: approaches !33 • Existing algorithm as meta-learner: • LSTM + gradient descent • Learn 𝛳init + gradient descent • kNN-like: Memory + similarity • Learn embedding + classifier • … • Black-box meta-learner • Neural Turing machine (with memory) • Neural attentive learner • … Cost(θi) = 1 |Ttest | ∑ t∈Ttest loss(θi, t) Santoro et al. 2016 Mishra et al. 2018 meta-model Model M 𝛳i+1 Tj Ttest 𝛳i Pi,j λk Pi+1,test Ravi and Larochelle 2017 Finn et al. 2017 Vinyals et al. 2016 Snell et al. 2017
  • 34. LSTM meta-learner + gradient descent Ravi and Larochelle 2017 !34 Train Test Cost(θT) = 1 |Ttest | ∑ t∈Ttest loss(θT, t) LSTM LSTM LSTM LSTM M M M M M • Gradient descent update 𝛳t is similar to LSTM cell state update ct • Hence, training a meta-learner LSTM yields an update rule for training M • Start from initial 𝛳0, train model on first batch, get gradient and loss update • Predict 𝛳t+1 , continue to t=T, get cost, backpropagate to learn LSTM weights, optimal 𝛳0 forget input
  • 35. !35 Model-agnostic meta-learning 1 Finn et al. 2017 2 Finn et al. 2018 3 Nichol et al. 2018 • For reinforcement learning:
  • 36. Learning to reinforcement learn !36 1 Duan et al. 2017 2 Wang et al. 2017 3 Duan et al. 2017 Environments meta-RL algorithm performance policy 𝝅θ fast RL agent !36 policy 𝝅θ Similar env. performance • Humans often learn to play new games much faster than RL techniques do • Reinforcement learning is very suited for learning-to-learn: • Build a learner, then use performance as that learner as a reward impl • Learning to reinforcement learn 1,2 • Use RNN-based deep RL to train a recurrent network on many tasks • Learns to implement a‘fast’RL agent, encoded in its weights
  • 37. Learning to reinforcement learn !37 • Also works for few-shot learning 3 • Condition on observation + upcoming demonstration • You don’t know what someone is trying to teach you, but you prepare for the lesson 1 Duan et al. 2017 2 Wang et al. 2017 3 Duan et al. 2017 !37
  • 38. Learning to learn more tasks !38 • Active learning • Deep network (learns representation) + policy network • Receives state and reward, says which points to query next • Density estimation • Learn distribution over small set of images, can generate new ones • Uses a MAML-based few-shot learner • Matrix factorization • Deep learning architecture that makes recommendations • Meta-learner learns how to adjust biases for each user (task) • Replace hand-crafted algorithms by learned ones. • Look at problems through a meta-learning lens! Pang et al. 2018 Reed et al. 2017 Vartak et al. 2017
  • 39. Meta-data sharing !39 import openml as oml from sklearn import tree task = oml.tasks.get_task(14951) clf = tree.ExtraTreeClassifier() flow = oml.flows.sklearn_to_flow(clf) run = oml.runs.run_flow_on_task(task, flow) myrun = run.publish() run locally, share globally Vanschoren et al. 2014 models built by humans models built by AutoML bots • OK, but how do I get large amounts of meta-data for meta-learning? • OpenML.org • Thousands of uniform datasets • 100+ meta-features • Millions of evaluated runs • Same splits, 30+ metrics • Traces, models (opt) • APIs in Python, R, Java,… • Publish your own runs • Never ending learning • Benchmarks building a shared memory Open positions! Scientific programmer Teaching PhD
  • 40. Towards human-like learning to learn !40 • Learning-to-learn gives humans a significant advantage • Learning how to learn any task empowers us far beyond knowing how to learn specific tasks. • It is a universal aspect of life, and how it evolves • Very exciting field with many unexplored possibilities • Many aspects not understood (e.g. task similarity), need more experiments. • Challenge: • Build learners that never stop learning, that learn from each other • Build a global memory for learning systems to learn from • Let them explore / experiment by themselves
  • 41. Thank you! special thanks to Pavel Brazdil, Matthias Feurer, Frank Hutter, Erin Grant, Hugo Larochelle, Raghu Rajan, Jan van Rijn, Jane Wang more to learn http://www.automl.org/book/ Chapter 2: Meta-Learning !41 Never stop learning