Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artificial Intelligence)

[course site]
#DLUPC
Life-long/incremental
Learning
Day 6 Lecture 2
Ramon Morros
ramon.morros@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia

‘Classical’ approach to ML
● Isolated, single task learning:
○ Well defined tasks.
○ Knowledge is not retained or accumulated. Learning is performed w.o.
considering past learned knowledge in other tasks
● Data given prior to training
○ Model selection & meta-parameter optimization based on full data set
○ Large number of training data needed
● Batch mode
○ Examples are used at the same time, irrespective of their (temporal)
order
● Assumption that data and its underlying structure is static
○ Restricted environment
I
Dataset
1
Task 1
Dataset
N
Task N
.
.
.
2

Challenges
● Data not available priorly, but exemples arrive over time
● Memory resources may be limited
○ LML has to rely on a compact/implicit representation of the already observed signals
○ NN models provide a good implicit representation!
● Adaptive model complexity
○ Impossible to determine model complexity in advance
○ Complexity may be bounded by available resources → intelligent reallocation
○ Meta-parameters such as learning rate or regularization strength can not be determined prior to
training → They turn into model parameters!
3

Challenges
● Concept drift: Changes in data distribution occurs with time
○ For instance, model evolution, changes in appearance, aging, etc.
● Stability -plasticity dilemma: When and how to adapt to the current model
○ Quick update enables rapid adaptation, but old information is forgotten
○ Slower adaptation allows to retain old information but the reactivity of the system is decreased
○ Failure to deal with this dilemma may lead to catastrophic forgetting
Old
data
New
data
Source:
https://www.youtube.com/watch?v=HMaWYBlo2Vc
4

Lifelong Machine Learning (LML)
[Silver2013, Gepperth2016, Chen2016b]
Learn, retain, use knowledge over an extended period of time
● Data streams, constantly arriving, not static → Incremental learning
● Multiple tasks with multiple learning/mining algorithms
● Retain/accumulate learned knowledge in the past & use it to help future
learning
○ Use past knowledge for inductive transfer when learning new tasks
● Mimics human way of learning
5

Lifelong Machine Learning (LML)
Data
Knowledge
Task 1
Task 2 Task 3
Task 4
KnowledgeKnowledgeKnowledgeKnowledge
Task 1 Task 2 Task 3 Task 4
Data Data Data Data
Image from [Chen2016a]
‘Classical’ approach
LML approach
6

Related learning approaches
Transfer learning (finetuning):
● Data in the source domain helps learning the target domain
● Less data is needed in the target domain
● Tasks must be similar
Multi-task learning:
● Co-learn multiple, related tasks simultaneously
● All tasks have labeled data and are treated equally
● Goal: optimize learning/performance across all tasks
through shared knowledge
Original model
Source task
Target task
Task 1
Task 2
7

Related learning approaches
Transfer learning (finetuning):
● Unidirectional: source → target
● Not continuous
● No retention/accumulation of knowledge
Multi-task learning:
● Simultaneous learning
● All tasks data is needed for training
Original model
Source task
Target task
Task 1
Task 2
8

Original application was to transfer the knowledge from a large, easy to train model
into a smaller/faster model more suitable for deployment
Bucilua1
demonstrated that this can be done reliably when transferring from a large
ensemble of models to a single small model
Distillation
101
C.Bucilua, R. Caruana, and A. Niculescu-Mizil. “Model compression”. In ACMSIG KDD ’06, 2006

Idea: use the class probabilities produced by the large model as “soft targets” for
training the small model
○ The ratios of probabilities in the soft targets provide information about the learned function
○ These ratios carry information about the structure of the data
○ Train by replacing the hard labels with the softmax activations from the original large model
Distillation
Hinton, G., Vinyals, O., & Dean, J. “Distilling the Knowledge in a Neural Network”. NIPS 2014 DL Workshop, 1–9.
0.05
0.8
0.1
0.05
0
1
0
0
Yn
Yn
0.09
0.05
0.85
0.01
0.4
0.1
0.2
0.3
Y0Y0
11
Distillation lossMultinomial logistic loss

Distillation
● To increase the influence of non-target class probabilities in the cross entropy, the
temperature of the final softmax is raised to “soften” the final probability distribution
over classes
● Transfer can be obtained by using the same large model training set or a separate
training set
● If the ground-truth labels of the transfer set are known, standard loss and distillation
loss can be combined
Hinton, G., Vinyals, O., & Dean, J. “Distilling the Knowledge in a Neural Network”. NIPS 2014 DL Workshop, 1–9. 12
0.09
0.05
0.85
0.01
0.15
0.10
0.70
0.05
T=1 T>1

LWF: Learning without Forgetting [Li2016]
Goal:
Add new prediction tasks based on adapting shared parameters without access
to training data for previously learned tasks
Solution:
Using only examples for the new task, optimize for :
● High accuracy on the new task
● Preservation of responses on existing tasks from the original network (distillation, Hinton2015)
● Storage/complexity does not grow with time. Old samples are not kept
Preserves performance on old task
(even if images in new task provide a poor sampling of old task)
13

14

Weight decay of 0.0005Multinomial logistic loss
Distillation loss 15

iCaRL
Goal:
Add new classes based on adapting shared parameters with restricted access to
training data for previously learned classes.
Solution:
● A subset of training samples (exemplar set) from previous classes is stored.
● Combination of classification loss for new samples and distillation loss for old samples.
● The size of the exemplar set is kept constant. As new classes arrive, some examples
from old classes are removed.
16

iCaRL: Incremental Classifier and Representation learning
Exemplar set
(old classes)
New training data
(new class)
Model
update
[Hinton2015]
17

iCaRL: Incremental Classifier and Representation learning
New exemplar set
Exemplar set
(old classes)
New training data
(new class)
18

Results on face recognition
● Preliminary results from Eric Presas TFG (co-directed with Elisa Sayrol)
iCaRL LWF
19

● Evidence suggests that the mammalian brain may avoid catastrophic forgetting by protecting
previously acquired knowledge in neocortical circuits
● Knowledge is durably encoded by rendering a proportion of synapses less plastic (stable over long
timescales)
● EWC algorithm slows down learning on certain weights based on how important they are to
previously seen tasks
● While learning task B, EWC therefore protects the performance in task A by constraining the
parameters to stay in a region of low error for task A centered around θ*
● Constraint implemented as a quadratic penalty. Can be imagined as a spring anchoring the
parameters to the previous solution (elastic).
● The stiffness of this spring should not be the same for all parameters; rather, it should be greater for
parameters that most affect performance in task A
Elastic Weight Consolidation (EWC)
20
F: Fisher information matrix
(https://en.wikipedia.org/wiki/Fisher_information#Matrix_form)
Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks. Proc. of the National Academy of Sciences, 114(13), 2017

Progressive Neural Networks
Goal:
Learn a series of tasks in sequence, using knowledge from
previous tasks to improve convergence speed
Solution:
● Instantiate a new NN for each task being solved, with lateral
connections to features of previously learned columns
● Previous tasks training data is not stored. Implicit
representation as NN weights.
● Complexity of the model grows with each task
● Task labels needed at test time
21
Rusu et al (2016). Progressive Neural Networks. CoRR. arXiv:1606.04671. Retrieved from http://arxiv.org/abs/1606.04671

Deep adaptation (I)
In Progressive NN, the number of parameters is duplicated for each task
In iCaRL, LWF and EWC, the performance in older tasks can decrease because weights are
shared between tasks
Idea: Augmenting a network learned for one task with controller modules which utilize already
learned representations for another
22
● Parameters of the controller modules are optimized to
minimize a loss on a new task.
● The training data for the original task is not required for
successive tasks.
● The network’s output on the original task data stays
exactly as it was
● Any number of controller modules may be added so that
a single network can simultaneously encode multiple
distinct tasks
Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv. Retrieved from http://arxiv.org/abs/1705.04228

● Each controller module uses the existing weights of the corresponding layer of N to
create new convolutional filters adapted to the new task T2
● Throughout training & testing, the weights of the base network are fixed and only used
as basis functions.
Deep adaptation (II)
23

● Each controller module uses the existing weights of the corresponding layer of N to
create new convolutional filters adapted to the new task T2
● Throughout training & testing, the weights of the base network are fixed and only used
as basis functions.
Deep adaptation (III)
24
Controller
Convolution
Co
is the number of output features, Ci
the number of inputs, k size of the conv. filters

● Fully connected layers are not reused
● The weights of the controller modules are learned via back-propagation given the loss
function
● The number of new of parameters added for each task is moderate
Deep adaptation (IV)
25
Controller
Convolution
Co
is the number of output features, Ci
the number of inputs, k size of the conv. filters
Ratio of new parameters to old ones (per layer):
Co
= Ci
= 256, k=5 → r = 0.04
For a complete network, typically : 20 ~ 30%

Summary
Task labels
needed?
Old training
data needed?
Constant
data size
Constant model
complexity
Type Mechanism
iCaRL No Yes Yes Yes Class incremental Distillation
LFW Yes No Yes Yes Task incremental Distillation
PNN Yes No Yes No (doubling per each new
task)
Task incremental New network with
lateral connections
to old ones
EWC No No Yes Yes Task incremental Preserve important
weights
DA Yes () No Yes No (20~30% increment per
new task)
Task incremental Add controller
modules
26

Increasing model capacity (I)
New knowledge acquired (new classes, new domains) over time may saturate
network capacity
We can think of a lifelong learning system as experiencing a continually growing
training set.
The optimal model complexity changes as training set size changes over time.
● Initially, a small model may be preferred, in order to prevent overfitting and to reduce
the computational cost of using the model.
● Later, a large model may be necessary to fully utilize the large dataset.
27

Increasing model capacity (II)
Some LML methods already add capacity for each task (PNN, DA) but others do
not.
If the capacity of the network has to be incremented we want to avoid retraining
the new network from scratch
It is possible to transfer knowledge from a teacher network to a ‘bigger’ student
network in an efficient way
Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR 2016
28

● The new, larger network immediately performs as well as the original network,
rather than spending time passing through a period of low performance.
● Any change made to the network after initialization is guaranteed to be an
improvement, so long as each local step is an improvement.
● It is always “safe” to optimize all parameters in the network.
Increasing model capacity: Net2Net (I)
29Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR
2016

Increasing model capacity: Net2Net (II)
Net2WiderNet:
● Allows a layer to be replaced with a wider layer (a layer that has more units)
● For convolution architectures, this means more convolution channels
30
Teacher Network Student Network
(Biases are omitted for simplicity)
Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR
2016

Increasing model capacity: Net2Net (III)
A random mapping g(·) is used to build U from W:
● The first n columns of W(i)
are copied directly into U(i)
● Columns n+1 through q of U(i)
are created by choosing at
random (with replacement) as defined in g.
● For weights in U(i+1)
, we must account for the replication by
dividing the weight by a replication factor, so all the units
have the same value as the unit in the original net
● This can be generalized to making multiple layers wider
31Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR
2016

Discovering new classes
Most learning systems follow a closed world assumption (the number of
categories is predetermined at training time)
New classes may appear over time. Systems need a way to detect them and to
introduce them in the learning process
The method in [Kading2016] inspires in the way humans (children) learn over time
32
Käding, C., Rodner, E., Freytag, A., & Denzler, J. (2016). Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN.

Discovering new classes
Most learning systems follow a closed world assumption (the number of
categories is predetermined at training time)
New classes may appear over time. Systems need a way to detect them and to
introduce them in the learning process
The method in [Kading2016] inspires in the way humans (children) learn over time
33
Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual recognition. European Symposium on Artificial NN. 2016
Time
2017

WALI (I)
The system incorporates four phases:
● Watch: the system is feed with continuous streams of youtube video
● Ask: The system actively selects few examples for manual annotations
● Learn: Obtained feedback is used to update the current model
● Improve: This never-ending cycle allows to adapt to new scenarios
34

WALI (II)
Watch
● Continuous stream of unlabeled images
● Obtained by automatically downloading videos from youtube using the official API
● A given youtube category is used (animal documentary)
● Images are sampled every 10th
frame to reduce redundancy
● Visual descriptors are extracted using using pre-trained CNN activations (relu7 of
AlexNet trained on ImageNet)
ASK
● A key feature is to select images to be labeled by human annotators.
● Images that will lead to an information gain must be selected
● Active learning: unlabeled samples are evaluated whether they likely result in an
increase on the classifier performance once labeled and added to the training
35

WALI (III)
ASK (cont.)
● Query images are selected according to the best vs. second-best strategy as
proposed in [Ajay2009]
○ One-vs-all classifier for each class
○ The example with the smallest q(x) score is selected for labeling
q(x) = scorebest_class
- scoresecond_best_class
● A rejection category is added (not all frames can be associated with a semantic
category or maybe some categories are not important).
q*(x) = (1 - p(rejection | x)) · q(x)
Learn
● Use an incremental learning to retrain the classifiers with the new samples
36

References
[Ajay2009] Joshi, A. J., Porikli, F., & Papanikolopoulos, N.. Multi-class active learning for image classification. In 2009 CVPR
[Chen2016a] Z. Chen, Google, B. Liu, “Lifelong Machine Learning for Natural Language Processing”, EMNLP-2016 Tutorial, 2016
[Chen2016b] Z. Chen and B. Liu, “Lifelong Machine Learning”, Morgan & Claypool Publishers, November 2016.
[Gepperth2016] A. Gepperth, B. Hammer, “Incremental learning algorithms and applications”, ESANN 2016
[ChenT2016] Chen, T., Goodfellow, I., & Shlens, J. (2016). Net2Net: Accelerating Learning via Knowledge Transfer. In ICLR 2016
[Hinton2015] Hinton, G., Vinyals, O., & Dean, J. “Distilling the Knowledge in a Neural Network”. NIPS 2014 DL Workshop, 1–9.
[Kading2016] Käding, C., Rodner, E., Freytag, A., & Denzler, J. Watch, Ask, Learn, and Improve: a lifelong learning cycle for visual
recognition. European Symposium on Artificial Neural Networks. 2016
[Kirkpatrick2017 ] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., … Hadsell, R. Overcoming catastrophic
forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 2017.
[Li2016] Li, Z., & Hoiem, D. “Learning without forgetting”. In vol. 9908 LNCS, 2016.
[Rebuffi2016] Rebuffi, S.-A., Kolesnikov, A., & Lampert, C. H. “iCaRL: Incremental Classifier and Representation Learning”. 2016
arXiv:1611.07725
[Rosenfeld2017] Rosenfeld, A., & Tsotsos, J. K. (2017). Incremental Learning Through Deep Adaptation. arXiv.
http://arxiv.org/abs/1705.04228
[Rusu2016] Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., … Hadsell, R. “Progressive Neural Networks”. 2016
CoRR. arXiv:1606.04671.
[Silver2013] D.L.Silver, et al, “Lifelong machine learning systems: Beyond learning algorithms”, 2013 AAAI Spring Symposium
37

Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artificial Intelligence)

More Related Content

What's hot (20)

Similar to Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artificial Intelligence) (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

Life-long / Incremental Learning (DLAI D6L1 2017 UPC Deep Learning for Artificial Intelligence)