SlideShare a Scribd company logo
[course site]
#DLUPC
Life-long/incremental
Learning
Day 6 Lecture 2
Ramon Morros
ramon.morros@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
‘Classical’ approach to ML
● Isolated, single task learning:
○ Well defined tasks.
○ Knowledge is not retained or accumulated. Learning is performed w.o.
considering past learned knowledge in other tasks
● Data given prior to training
○ Model selection & meta-parameter optimization based on full data set
○ Large number of training data needed
● Batch mode
○ Examples are used at the same time, irrespective of their (temporal)
order
● Assumption that data and its underlying structure is static
○ Restricted environment
I
Dataset
1
Task 1
Dataset
N
Task N
.
.
.
2
Challenges
● Data not available priorly, but exemples arrive over time
● Memory resources may be limited
○ LML has to rely on a compact/implicit representation of the already observed signals
○ NN models provide a good implicit representation!
● Adaptive model complexity
○ Impossible to determine model complexity in advance
○ Complexity may be bounded by available resources → intelligent reallocation
○ Meta-parameters such as learning rate or regularization strength can not be determined prior to
training → They turn into model parameters!
3
Challenges
● Concept drift: Changes in data distribution occurs with time
○ For instance, model evolution, changes in appearance, aging, etc.
● Stability -plasticity dilemma: When and how to adapt to the current model
○ Quick update enables rapid adaptation, but old information is forgotten
○ Slower adaptation allows to retain old information but the reactivity of the system is decreased
○ Failure to deal with this dilemma may lead to catastrophic forgetting
Old
data
New
data
Source:
https://www.youtube.com/watch?v=HMaWYBlo2Vc
4
Lifelong Machine Learning (LML)
[Silver2013, Gepperth2016, Chen2016b]
Learn, retain, use knowledge over an extended period of time
● Data streams, constantly arriving, not static → Incremental learning
● Multiple tasks with multiple learning/mining algorithms
● Retain/accumulate learned knowledge in the past & use it to help future
learning
○ Use past knowledge for inductive transfer when learning new tasks
● Mimics human way of learning
5
Lifelong Machine Learning (LML)
Data
Knowledge
Task 1
Task 2 Task 3
Task 4
KnowledgeKnowledgeKnowledgeKnowledge
Task 1 Task 2 Task 3 Task 4
Data Data Data Data
Image from [Chen2016a]
‘Classical’ approach
LML approach
6
Related learning approaches
Transfer learning (finetuning):
● Data in the source domain helps learning the target domain
● Less data is needed in the target domain
● Tasks must be similar
Multi-task learning:
● Co-learn multiple, related tasks simultaneously
● All tasks have labeled data and are treated equally
● Goal: optimize learning/performance across all tasks
through shared knowledge
Original model
Source task
Target task
Task 1
Task 2
7
Related learning approaches
Transfer learning (finetuning):
● Unidirectional: source → target
● Not continuous
● No retention/accumulation of knowledge
Multi-task learning:
● Simultaneous learning
● All tasks data is needed for training
Original model
Source task
Target task
Task 1
Task 2
8
LML Methods
9
Original application was to transfer the knowledge from a large, easy to train model
into a smaller/faster model more suitable for deployment
Bucilua1
demonstrated that this can be done reliably when transferring from a large
ensemble of models to a single small model
Distillation
101
C.Bucilua, R. Caruana, and A. Niculescu-Mizil. “Model compression”. In ACMSIG KDD ’06, 2006
Idea: use the class probabilities produced by the large model as “soft targets” for
training the small model
○ The ratios of probabilities in the soft targets provide information about the learned function
○ These ratios carry information about the structure of the data
○ Train by replacing the hard labels with the softmax activations from the original large model
Distillation
Hinton, G., Vinyals, O., & Dean, J. “Distilling the Knowledge in a Neural Network”. NIPS 2014 DL Workshop, 1–9.
0.05
0.8
0.1
0.05
0
1
0
0
Yn
Yn
0.09
0.05
0.85
0.01
0.4
0.1
0.2
0.3
Y0Y0
11
Distillation lossMultinomial logistic loss
Distillation
● To increase the influence of non-target class probabilities in the cross entropy, the
temperature of the final softmax is raised to “soften” the final probability distribution
over classes
● Transfer can be obtained by using the same large model training set or a separate
training set
● If the ground-truth labels of the transfer set are known, standard loss and distillation
loss can be combined
Hinton, G., Vinyals, O., & Dean, J. “Distilling the Knowledge in a Neural Network”. NIPS 2014 DL Workshop, 1–9. 12
0.09
0.05
0.85
0.01
0.15
0.10
0.70
0.05
T=1 T>1
LWF: Learning without Forgetting [Li2016]
Goal:
Add new prediction tasks based on adapting shared parameters without access
to training data for previously learned tasks
Solution:
Using only examples for the new task, optimize for :
● High accuracy on the new task
● Preservation of responses on existing tasks from the original network (distillation, Hinton2015)
● Storage/complexity does not grow with time. Old samples are not kept
Preserves performance on old task
(even if images in new task provide a poor sampling of old task)
13
LWF: Learning without Forgetting [Li2016]
14
LWF: Learning without Forgetting [Li2016]
Weight decay of 0.0005Multinomial logistic loss
Distillation loss 15
iCaRL
Goal:
Add new classes based on adapting shared parameters with restricted access to
training data for previously learned classes.
Solution:
● A subset of training samples (exemplar set) from previous classes is stored.
● Combination of classification loss for new samples and distillation loss for old samples.
● The size of the exemplar set is kept constant. As new classes arrive, some examples
from old classes are removed.
16
iCaRL: Incremental Classifier and Representation learning
Exemplar set
(old classes)
New training data
(new class)
Model
update
[Hinton2015]
17
iCaRL: Incremental Classifier and Representation learning
New exemplar set
Exemplar set
(old classes)
New training data
(new class)
18
Results on face recognition
● Preliminary results from Eric Presas TFG (co-directed with Elisa Sayrol)
iCaRL LWF
19
● Evidence suggests that the mammalian brain may avoid catastrophic forgetting by protecting
previously acquired knowledge in neocortical circuits
● Knowledge is durably encoded by rendering a proportion of synapses less plastic (stable over long
timescales)
● EWC algorithm slows down learning on certain weights based on how important they are to
previously seen tasks
● While learning task B, EWC therefore protects the performance in task A by constraining the
parameters to stay in a region of low error for task A centered around θ*
● Constraint implemented as a quadratic penalty. Can be imagined as a spring anchoring the
parameters to the previous solution (elastic).
● The stiffness of this spring should not be the same for all parameters; rather, it should be greater for
parameters that most affect performance in task A
Elastic Weight Consolidation (EWC)
20
F: Fisher information matrix
(https://en.wikipedia.org/wiki/Fisher_information#Matrix_form)
Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks. Proc. of the National Academy of Sciences, 114(13), 2017
Progressive Neural Networks
Goal:
Learn a series of tasks in sequence, using knowledge from
previous tasks to improve convergence speed
Solution:
● Instantiate a new NN for each task being solved, with lateral
connections to features of previously learned columns
● Previous tasks training data is not stored. Implicit
representation as NN weights.
● Complexity of the model grows with each task
● Task labels needed at test time
21
Rusu et al (2016). Progressive Neural Networks. CoRR. arXiv:1606.04671. Retrieved from http://arxiv.org/abs/1606.04671
Deep adaptation (I)
In Progressive NN, the number of parameters is duplicated for each task
In iCaRL, LWF and EWC, the performance in older tasks can decrease because weights are
shared between tasks
Idea: Augmenting a network learned for one task with controller modules which utilize already
learned representations for another
22
● Parameters of the controller modules are optimized to
minimize a loss on a new task.
● The training data for the original task is not required for
successive tasks.
● The network’s output on the original task data stays
exactly as it was
● Any number of controller modules may be added so that
a single network can simultaneously encode multiple
distinct tasks
Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv. Retrieved from http://arxiv.org/abs/1705.04228
● Each controller module uses the existing weights of the corresponding layer of N to
create new convolutional filters adapted to the new task T2
● Throughout training & testing, the weights of the base network are fixed and only used
as basis functions.
Deep adaptation (II)
23
Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv. Retrieved from http://arxiv.org/abs/1705.04228
● Each controller module uses the existing weights of the corresponding layer of N to
create new convolutional filters adapted to the new task T2
● Throughout training & testing, the weights of the base network are fixed and only used
as basis functions.
Deep adaptation (III)
24
Controller
Convolution
Co
is the number of output features, Ci
the number of inputs, k size of the conv. filters
Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv. Retrieved from http://arxiv.org/abs/1705.04228
● Fully connected layers are not reused
● The weights of the controller modules are learned via back-propagation given the loss
function
● The number of new of parameters added for each task is moderate
Deep adaptation (IV)
25
Controller
Convolution
Co
is the number of output features, Ci
the number of inputs, k size of the conv. filters
Ratio of new parameters to old ones (per layer):
Co
= Ci
= 256, k=5 → r = 0.04
For a complete network, typically : 20 ~ 30%
Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv. Retrieved from http://arxiv.org/abs/1705.04228
Summary
Task labels
needed?
Old training
data needed?
Constant
data size
Constant model
complexity
Type Mechanism
iCaRL No Yes Yes Yes Class incremental Distillation
LFW Yes No Yes Yes Task incremental Distillation
PNN Yes No Yes No (doubling per each new
task)
Task incremental New network with
lateral connections
to old ones
EWC No No Yes Yes Task incremental Preserve important
weights
DA Yes () No Yes No (20~30% increment per
new task)
Task incremental Add controller
modules
26
Increasing model capacity (I)
New knowledge acquired (new classes, new domains) over time may saturate
network capacity
We can think of a lifelong learning system as experiencing a continually growing
training set.
The optimal model complexity changes as training set size changes over time.
● Initially, a small model may be preferred, in order to prevent overfitting and to reduce
the computational cost of using the model.
● Later, a large model may be necessary to fully utilize the large dataset.
27
Increasing model capacity (II)
Some LML methods already add capacity for each task (PNN, DA) but others do
not.
If the capacity of the network has to be incremented we want to avoid retraining
the new network from scratch
It is possible to transfer knowledge from a teacher network to a ‘bigger’ student
network in an efficient way
Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR 2016
28
● The new, larger network immediately performs as well as the original network,
rather than spending time passing through a period of low performance.
● Any change made to the network after initialization is guaranteed to be an
improvement, so long as each local step is an improvement.
● It is always “safe” to optimize all parameters in the network.
Increasing model capacity: Net2Net (I)
29Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR
2016
Increasing model capacity: Net2Net (II)
Net2WiderNet:
● Allows a layer to be replaced with a wider layer (a layer that has more units)
● For convolution architectures, this means more convolution channels
30
Teacher Network Student Network
(Biases are omitted for simplicity)
Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR
2016
Increasing model capacity: Net2Net (III)
A random mapping g(·) is used to build U from W:
● The first n columns of W(i)
are copied directly into U(i)
● Columns n+1 through q of U(i)
are created by choosing at
random (with replacement) as defined in g.
● For weights in U(i+1)
, we must account for the replication by
dividing the weight by a replication factor, so all the units
have the same value as the unit in the original net
● This can be generalized to making multiple layers wider
31Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR
2016
Discovering new classes
Most learning systems follow a closed world assumption (the number of
categories is predetermined at training time)
New classes may appear over time. Systems need a way to detect them and to
introduce them in the learning process
The method in [Kading2016] inspires in the way humans (children) learn over time
32
Käding, C., Rodner, E., Freytag, A., & Denzler, J. (2016). Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN.
Discovering new classes
Most learning systems follow a closed world assumption (the number of
categories is predetermined at training time)
New classes may appear over time. Systems need a way to detect them and to
introduce them in the learning process
The method in [Kading2016] inspires in the way humans (children) learn over time
33
Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN. 2016
Time
2017
WALI (I)
The system incorporates four phases:
● Watch: the system is feed with continuous streams of youtube video
● Ask: The system actively selects few examples for manual annotations
● Learn: Obtained feedback is used to update the current model
● Improve: This never-ending cycle allows to adapt to new scenarios
34
Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN. 2016
WALI (II)
Watch
● Continuous stream of unlabeled images
● Obtained by automatically downloading videos from youtube using the official API
● A given youtube category is used (animal documentary)
● Images are sampled every 10th
frame to reduce redundancy
● Visual descriptors are extracted using using pre-trained CNN activations (relu7 of
AlexNet trained on ImageNet)
ASK
● A key feature is to select images to be labeled by human annotators.
● Images that will lead to an information gain must be selected
● Active learning: unlabeled samples are evaluated whether they likely result in an
increase on the classifier performance once labeled and added to the training
35
Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN. 2016
WALI (III)
ASK (cont.)
● Query images are selected according to the best vs. second-best strategy as
proposed in [Ajay2009]
○ One-vs-all classifier for each class
○ The example with the smallest q(x) score is selected for labeling
q(x) = scorebest_class
- scoresecond_best_class
● A rejection category is added (not all frames can be associated with a semantic
category or maybe some categories are not important).
q*(x) = (1 - p(rejection | x)) · q(x)
Learn
● Use an incremental learning to retrain the classifiers with the new samples
36
Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN. 2016
References
[Ajay2009] Joshi, A. J., Porikli, F., & Papanikolopoulos, N.. Multi-class active learning for image classification. In 2009 CVPR
[Chen2016a] Z. Chen, Google, B. Liu, “Lifelong Machine Learning for Natural Language Processing”, EMNLP-2016 Tutorial, 2016
[Chen2016b] Z. Chen and B. Liu, “Lifelong Machine Learning”, Morgan & Claypool Publishers, November 2016.
[Gepperth2016] A. Gepperth, B. Hammer, “Incremental learning algorithms and applications”, ESANN 2016
[ChenT2016] Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR 2016
[Hinton2015] Hinton, G., Vinyals, O., & Dean, J. “Distilling the Knowledge in a Neural Network”. NIPS 2014 DL Workshop, 1–9.
[Kading2016] Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual
recognition. European Symposium on Artificial Neural Networks. 2016
[Kirkpatrick2017 ] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., … Hadsell, R. Overcoming catastrophic
forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 2017.
[Li2016] Li, Z., & Hoiem, D. “Learning without forgetting”. In vol. 9908 LNCS, 2016.
[Rebuffi2016] Rebuffi, S.-A., Kolesnikov, A., & Lampert, C. H. “iCaRL: Incremental Classifier and Representation Learning”. 2016
arXiv:1611.07725
[Rosenfeld2017] Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv.
http://arxiv.org/abs/1705.04228
[Rusu2016] Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., … Hadsell, R. “Progressive Neural Networks”. 2016
CoRR. arXiv:1606.04671.
[Silver2013] D.L.Silver, et al, “Lifelong machine learning systems: Beyond learning algorithms”, 2013 AAAI Spring Symposium
37
Questions?
38

More Related Content

PDF
Continual Learning with Deep Architectures - Tutorial ICML 2021
PPTX
Incremental Machine Learning.pptx
PPTX
Transfer Learning and Fine-tuning Deep Neural Networks
PDF
Continual learning: Survey
PPTX
Variational continual learning
PPTX
Introduction to continual learning
PDF
Continual/Lifelong Learning with Deep Architectures
PDF
Continual Learning: why, how, and when
Continual Learning with Deep Architectures - Tutorial ICML 2021
Incremental Machine Learning.pptx
Transfer Learning and Fine-tuning Deep Neural Networks
Continual learning: Survey
Variational continual learning
Introduction to continual learning
Continual/Lifelong Learning with Deep Architectures
Continual Learning: why, how, and when

What's hot (20)

PDF
Continual Learning Introduction
PDF
Overcoming catastrophic forgetting in neural network
PDF
Online Coreset Selection for Rehearsal-based Continual Learning
PDF
A Brief Introduction on Recurrent Neural Network and Its Application
PDF
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
PDF
Comparing Incremental Learning Strategies for Convolutional Neural Networks
PPTX
Introduction to Interpretable Machine Learning
PPTX
backbone としての timm 入門
PPTX
Master's Thesis Presentation
PDF
Convolutional neural network
PPTX
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
PDF
Transfer Learning
PPTX
Attention Is All You Need
PPTX
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
PPTX
Introduction to-machine-learning
PPTX
Deep Learning - CNN and RNN
PPTX
情報検索のためのユーザモデル
PPTX
TensorFlow.pptx
PDF
Transformer Introduction (Seminar Material)
PDF
Machine learning
Continual Learning Introduction
Overcoming catastrophic forgetting in neural network
Online Coreset Selection for Rehearsal-based Continual Learning
A Brief Introduction on Recurrent Neural Network and Its Application
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Introduction to Interpretable Machine Learning
backbone としての timm 入門
Master's Thesis Presentation
Convolutional neural network
大域的探索から局所的探索へデータ拡張 (Data Augmentation)を用いた学習の探索テクニック
Transfer Learning
Attention Is All You Need
【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...
Introduction to-machine-learning
Deep Learning - CNN and RNN
情報検索のためのユーザモデル
TensorFlow.pptx
Transformer Introduction (Seminar Material)
Machine learning
Ad

Similar to Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artificial Intelligence) (20)

PPTX
Neural network learning ability
PPTX
Deep Learning: Towards General Artificial Intelligence
PPTX
An Introduction to Deep Learning
PPTX
Introduction to Deep Learning
PPT
learning.ppt
PPT
deepnet-lourentzou.ppt
PPT
Overview of Deep Learning and its advantage
PPT
Deep learning is a subset of machine learning and AI
PPT
Introduction to Deep Learning presentation
PDF
Deep Learning
PPTX
Nimrita deep learning
PDF
imageclassification-160206090009.pdf
PDF
Convolutional auto-encoded extreme learning machine for incremental learning ...
PDF
Phx dl meetup
PPTX
Batch normalization presentation
PDF
Artificial Neural Networks Lect3: Neural Network Learning rules
PPTX
What Deep Learning Means for Artificial Intelligence
PPTX
The Deep Learning Glossary
PDF
MILA DL & RL summer school highlights
PPTX
Image classification with Deep Neural Networks
Neural network learning ability
Deep Learning: Towards General Artificial Intelligence
An Introduction to Deep Learning
Introduction to Deep Learning
learning.ppt
deepnet-lourentzou.ppt
Overview of Deep Learning and its advantage
Deep learning is a subset of machine learning and AI
Introduction to Deep Learning presentation
Deep Learning
Nimrita deep learning
imageclassification-160206090009.pdf
Convolutional auto-encoded extreme learning machine for incremental learning ...
Phx dl meetup
Batch normalization presentation
Artificial Neural Networks Lect3: Neural Network Learning rules
What Deep Learning Means for Artificial Intelligence
The Deep Learning Glossary
MILA DL & RL summer school highlights
Image classification with Deep Neural Networks
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
PDF
Deep Generative Learning for All
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
PDF
Open challenges in sign language translation and production
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
PDF
Intepretability / Explainable AI for Deep Neural Networks
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
Curriculum Learning for Recurrent Video Object Segmentation
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
The Transformer - Xavier Giró - UPC Barcelona 2021
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Open challenges in sign language translation and production
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Learn2Sign : Sign language recognition and translation using human keypoint e...
Intepretability / Explainable AI for Deep Neural Networks
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Curriculum Learning for Recurrent Video Object Segmentation
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020

Recently uploaded (20)

PDF
Transcultural that can help you someday.
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPT
Predictive modeling basics in data cleaning process
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Steganography Project Steganography Project .pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Introduction to Inferential Statistics.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
chrmotography.pptx food anaylysis techni
Transcultural that can help you someday.
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
modul_python (1).pptx for professional and student
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Predictive modeling basics in data cleaning process
Optimise Shopper Experiences with a Strong Data Estate.pdf
CYBER SECURITY the Next Warefare Tactics
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Steganography Project Steganography Project .pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
IMPACT OF LANDSLIDE.....................
Introduction to Inferential Statistics.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Pilar Kemerdekaan dan Identi Bangsa.pptx
chrmotography.pptx food anaylysis techni

Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artificial Intelligence)

  • 1. [course site] #DLUPC Life-long/incremental Learning Day 6 Lecture 2 Ramon Morros ramon.morros@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia
  • 2. ‘Classical’ approach to ML ● Isolated, single task learning: ○ Well defined tasks. ○ Knowledge is not retained or accumulated. Learning is performed w.o. considering past learned knowledge in other tasks ● Data given prior to training ○ Model selection & meta-parameter optimization based on full data set ○ Large number of training data needed ● Batch mode ○ Examples are used at the same time, irrespective of their (temporal) order ● Assumption that data and its underlying structure is static ○ Restricted environment I Dataset 1 Task 1 Dataset N Task N . . . 2
  • 3. Challenges ● Data not available priorly, but exemples arrive over time ● Memory resources may be limited ○ LML has to rely on a compact/implicit representation of the already observed signals ○ NN models provide a good implicit representation! ● Adaptive model complexity ○ Impossible to determine model complexity in advance ○ Complexity may be bounded by available resources → intelligent reallocation ○ Meta-parameters such as learning rate or regularization strength can not be determined prior to training → They turn into model parameters! 3
  • 4. Challenges ● Concept drift: Changes in data distribution occurs with time ○ For instance, model evolution, changes in appearance, aging, etc. ● Stability -plasticity dilemma: When and how to adapt to the current model ○ Quick update enables rapid adaptation, but old information is forgotten ○ Slower adaptation allows to retain old information but the reactivity of the system is decreased ○ Failure to deal with this dilemma may lead to catastrophic forgetting Old data New data Source: https://www.youtube.com/watch?v=HMaWYBlo2Vc 4
  • 5. Lifelong Machine Learning (LML) [Silver2013, Gepperth2016, Chen2016b] Learn, retain, use knowledge over an extended period of time ● Data streams, constantly arriving, not static → Incremental learning ● Multiple tasks with multiple learning/mining algorithms ● Retain/accumulate learned knowledge in the past & use it to help future learning ○ Use past knowledge for inductive transfer when learning new tasks ● Mimics human way of learning 5
  • 6. Lifelong Machine Learning (LML) Data Knowledge Task 1 Task 2 Task 3 Task 4 KnowledgeKnowledgeKnowledgeKnowledge Task 1 Task 2 Task 3 Task 4 Data Data Data Data Image from [Chen2016a] ‘Classical’ approach LML approach 6
  • 7. Related learning approaches Transfer learning (finetuning): ● Data in the source domain helps learning the target domain ● Less data is needed in the target domain ● Tasks must be similar Multi-task learning: ● Co-learn multiple, related tasks simultaneously ● All tasks have labeled data and are treated equally ● Goal: optimize learning/performance across all tasks through shared knowledge Original model Source task Target task Task 1 Task 2 7
  • 8. Related learning approaches Transfer learning (finetuning): ● Unidirectional: source → target ● Not continuous ● No retention/accumulation of knowledge Multi-task learning: ● Simultaneous learning ● All tasks data is needed for training Original model Source task Target task Task 1 Task 2 8
  • 10. Original application was to transfer the knowledge from a large, easy to train model into a smaller/faster model more suitable for deployment Bucilua1 demonstrated that this can be done reliably when transferring from a large ensemble of models to a single small model Distillation 101 C.Bucilua, R. Caruana, and A. Niculescu-Mizil. “Model compression”. In ACMSIG KDD ’06, 2006
  • 11. Idea: use the class probabilities produced by the large model as “soft targets” for training the small model ○ The ratios of probabilities in the soft targets provide information about the learned function ○ These ratios carry information about the structure of the data ○ Train by replacing the hard labels with the softmax activations from the original large model Distillation Hinton, G., Vinyals, O., & Dean, J. “Distilling the Knowledge in a Neural Network”. NIPS 2014 DL Workshop, 1–9. 0.05 0.8 0.1 0.05 0 1 0 0 Yn Yn 0.09 0.05 0.85 0.01 0.4 0.1 0.2 0.3 Y0Y0 11 Distillation lossMultinomial logistic loss
  • 12. Distillation ● To increase the influence of non-target class probabilities in the cross entropy, the temperature of the final softmax is raised to “soften” the final probability distribution over classes ● Transfer can be obtained by using the same large model training set or a separate training set ● If the ground-truth labels of the transfer set are known, standard loss and distillation loss can be combined Hinton, G., Vinyals, O., & Dean, J. “Distilling the Knowledge in a Neural Network”. NIPS 2014 DL Workshop, 1–9. 12 0.09 0.05 0.85 0.01 0.15 0.10 0.70 0.05 T=1 T>1
  • 13. LWF: Learning without Forgetting [Li2016] Goal: Add new prediction tasks based on adapting shared parameters without access to training data for previously learned tasks Solution: Using only examples for the new task, optimize for : ● High accuracy on the new task ● Preservation of responses on existing tasks from the original network (distillation, Hinton2015) ● Storage/complexity does not grow with time. Old samples are not kept Preserves performance on old task (even if images in new task provide a poor sampling of old task) 13
  • 14. LWF: Learning without Forgetting [Li2016] 14
  • 15. LWF: Learning without Forgetting [Li2016] Weight decay of 0.0005Multinomial logistic loss Distillation loss 15
  • 16. iCaRL Goal: Add new classes based on adapting shared parameters with restricted access to training data for previously learned classes. Solution: ● A subset of training samples (exemplar set) from previous classes is stored. ● Combination of classification loss for new samples and distillation loss for old samples. ● The size of the exemplar set is kept constant. As new classes arrive, some examples from old classes are removed. 16
  • 17. iCaRL: Incremental Classifier and Representation learning Exemplar set (old classes) New training data (new class) Model update [Hinton2015] 17
  • 18. iCaRL: Incremental Classifier and Representation learning New exemplar set Exemplar set (old classes) New training data (new class) 18
  • 19. Results on face recognition ● Preliminary results from Eric Presas TFG (co-directed with Elisa Sayrol) iCaRL LWF 19
  • 20. ● Evidence suggests that the mammalian brain may avoid catastrophic forgetting by protecting previously acquired knowledge in neocortical circuits ● Knowledge is durably encoded by rendering a proportion of synapses less plastic (stable over long timescales) ● EWC algorithm slows down learning on certain weights based on how important they are to previously seen tasks ● While learning task B, EWC therefore protects the performance in task A by constraining the parameters to stay in a region of low error for task A centered around θ* ● Constraint implemented as a quadratic penalty. Can be imagined as a spring anchoring the parameters to the previous solution (elastic). ● The stiffness of this spring should not be the same for all parameters; rather, it should be greater for parameters that most affect performance in task A Elastic Weight Consolidation (EWC) 20 F: Fisher information matrix (https://en.wikipedia.org/wiki/Fisher_information#Matrix_form) Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks. Proc. of the National Academy of Sciences, 114(13), 2017
  • 21. Progressive Neural Networks Goal: Learn a series of tasks in sequence, using knowledge from previous tasks to improve convergence speed Solution: ● Instantiate a new NN for each task being solved, with lateral connections to features of previously learned columns ● Previous tasks training data is not stored. Implicit representation as NN weights. ● Complexity of the model grows with each task ● Task labels needed at test time 21 Rusu et al (2016). Progressive Neural Networks. CoRR. arXiv:1606.04671. Retrieved from http://arxiv.org/abs/1606.04671
  • 22. Deep adaptation (I) In Progressive NN, the number of parameters is duplicated for each task In iCaRL, LWF and EWC, the performance in older tasks can decrease because weights are shared between tasks Idea: Augmenting a network learned for one task with controller modules which utilize already learned representations for another 22 ● Parameters of the controller modules are optimized to minimize a loss on a new task. ● The training data for the original task is not required for successive tasks. ● The network’s output on the original task data stays exactly as it was ● Any number of controller modules may be added so that a single network can simultaneously encode multiple distinct tasks Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv. Retrieved from http://arxiv.org/abs/1705.04228
  • 23. ● Each controller module uses the existing weights of the corresponding layer of N to create new convolutional filters adapted to the new task T2 ● Throughout training & testing, the weights of the base network are fixed and only used as basis functions. Deep adaptation (II) 23 Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv. Retrieved from http://arxiv.org/abs/1705.04228
  • 24. ● Each controller module uses the existing weights of the corresponding layer of N to create new convolutional filters adapted to the new task T2 ● Throughout training & testing, the weights of the base network are fixed and only used as basis functions. Deep adaptation (III) 24 Controller Convolution Co is the number of output features, Ci the number of inputs, k size of the conv. filters Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv. Retrieved from http://arxiv.org/abs/1705.04228
  • 25. ● Fully connected layers are not reused ● The weights of the controller modules are learned via back-propagation given the loss function ● The number of new of parameters added for each task is moderate Deep adaptation (IV) 25 Controller Convolution Co is the number of output features, Ci the number of inputs, k size of the conv. filters Ratio of new parameters to old ones (per layer): Co = Ci = 256, k=5 → r = 0.04 For a complete network, typically : 20 ~ 30% Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv. Retrieved from http://arxiv.org/abs/1705.04228
  • 26. Summary Task labels needed? Old training data needed? Constant data size Constant model complexity Type Mechanism iCaRL No Yes Yes Yes Class incremental Distillation LFW Yes No Yes Yes Task incremental Distillation PNN Yes No Yes No (doubling per each new task) Task incremental New network with lateral connections to old ones EWC No No Yes Yes Task incremental Preserve important weights DA Yes () No Yes No (20~30% increment per new task) Task incremental Add controller modules 26
  • 27. Increasing model capacity (I) New knowledge acquired (new classes, new domains) over time may saturate network capacity We can think of a lifelong learning system as experiencing a continually growing training set. The optimal model complexity changes as training set size changes over time. ● Initially, a small model may be preferred, in order to prevent overfitting and to reduce the computational cost of using the model. ● Later, a large model may be necessary to fully utilize the large dataset. 27
  • 28. Increasing model capacity (II) Some LML methods already add capacity for each task (PNN, DA) but others do not. If the capacity of the network has to be incremented we want to avoid retraining the new network from scratch It is possible to transfer knowledge from a teacher network to a ‘bigger’ student network in an efficient way Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR 2016 28
  • 29. ● The new, larger network immediately performs as well as the original network, rather than spending time passing through a period of low performance. ● Any change made to the network after initialization is guaranteed to be an improvement, so long as each local step is an improvement. ● It is always “safe” to optimize all parameters in the network. Increasing model capacity: Net2Net (I) 29Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR 2016
  • 30. Increasing model capacity: Net2Net (II) Net2WiderNet: ● Allows a layer to be replaced with a wider layer (a layer that has more units) ● For convolution architectures, this means more convolution channels 30 Teacher Network Student Network (Biases are omitted for simplicity) Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR 2016
  • 31. Increasing model capacity: Net2Net (III) A random mapping g(·) is used to build U from W: ● The first n columns of W(i) are copied directly into U(i) ● Columns n+1 through q of U(i) are created by choosing at random (with replacement) as defined in g. ● For weights in U(i+1) , we must account for the replication by dividing the weight by a replication factor, so all the units have the same value as the unit in the original net ● This can be generalized to making multiple layers wider 31Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR 2016
  • 32. Discovering new classes Most learning systems follow a closed world assumption (the number of categories is predetermined at training time) New classes may appear over time. Systems need a way to detect them and to introduce them in the learning process The method in [Kading2016] inspires in the way humans (children) learn over time 32 Käding, C., Rodner, E., Freytag, A., & Denzler, J. (2016). Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN.
  • 33. Discovering new classes Most learning systems follow a closed world assumption (the number of categories is predetermined at training time) New classes may appear over time. Systems need a way to detect them and to introduce them in the learning process The method in [Kading2016] inspires in the way humans (children) learn over time 33 Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN. 2016 Time 2017
  • 34. WALI (I) The system incorporates four phases: ● Watch: the system is feed with continuous streams of youtube video ● Ask: The system actively selects few examples for manual annotations ● Learn: Obtained feedback is used to update the current model ● Improve: This never-ending cycle allows to adapt to new scenarios 34 Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN. 2016
  • 35. WALI (II) Watch ● Continuous stream of unlabeled images ● Obtained by automatically downloading videos from youtube using the official API ● A given youtube category is used (animal documentary) ● Images are sampled every 10th frame to reduce redundancy ● Visual descriptors are extracted using using pre-trained CNN activations (relu7 of AlexNet trained on ImageNet) ASK ● A key feature is to select images to be labeled by human annotators. ● Images that will lead to an information gain must be selected ● Active learning: unlabeled samples are evaluated whether they likely result in an increase on the classifier performance once labeled and added to the training 35 Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN. 2016
  • 36. WALI (III) ASK (cont.) ● Query images are selected according to the best vs. second-best strategy as proposed in [Ajay2009] ○ One-vs-all classifier for each class ○ The example with the smallest q(x) score is selected for labeling q(x) = scorebest_class - scoresecond_best_class ● A rejection category is added (not all frames can be associated with a semantic category or maybe some categories are not important). q*(x) = (1 - p(rejection | x)) · q(x) Learn ● Use an incremental learning to retrain the classifiers with the new samples 36 Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN. 2016
  • 37. References [Ajay2009] Joshi, A. J., Porikli, F., & Papanikolopoulos, N.. Multi-class active learning for image classification. In 2009 CVPR [Chen2016a] Z. Chen, Google, B. Liu, “Lifelong Machine Learning for Natural Language Processing”, EMNLP-2016 Tutorial, 2016 [Chen2016b] Z. Chen and B. Liu, “Lifelong Machine Learning”, Morgan & Claypool Publishers, November 2016. [Gepperth2016] A. Gepperth, B. Hammer, “Incremental learning algorithms and applications”, ESANN 2016 [ChenT2016] Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR 2016 [Hinton2015] Hinton, G., Vinyals, O., & Dean, J. “Distilling the Knowledge in a Neural Network”. NIPS 2014 DL Workshop, 1–9. [Kading2016] Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial Neural Networks. 2016 [Kirkpatrick2017 ] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., … Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 2017. [Li2016] Li, Z., & Hoiem, D. “Learning without forgetting”. In vol. 9908 LNCS, 2016. [Rebuffi2016] Rebuffi, S.-A., Kolesnikov, A., & Lampert, C. H. “iCaRL: Incremental Classifier and Representation Learning”. 2016 arXiv:1611.07725 [Rosenfeld2017] Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv. http://arxiv.org/abs/1705.04228 [Rusu2016] Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., … Hadsell, R. “Progressive Neural Networks”. 2016 CoRR. arXiv:1606.04671. [Silver2013] D.L.Silver, et al, “Lifelong machine learning systems: Beyond learning algorithms”, 2013 AAAI Spring Symposium 37