How to make your model happy again @PyData Florence @PyConIT

How to make your
model happy
again
Alessia Marcolini
Junior Research Assistant @ FBK
@viperale
amarcolini@fbk.eu

Alessia Marcolini @ PyConIT Nove
Neural
Networks

Modern Neural Networks
neuron
layer

⨍
x1 ⋅ w1
x2 ⋅ w2
y
The quadratic error surface
for a linear neuron
Imagine you have a linear
neuron with only two inputs
(two weights - w1 and w2)
You want to minimise the loss function

You want to minimise the loss function
Real life can be more
complicated

Gradient Descent
https://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/
weight update rule
we usually implement Mini-batch Gradient Descent:
performs an update for every mini-batch of n training examples

Saddle Points
You want to escape from
where derivatives of the function
become zero but the point is not a
local extremum on all axeshttps://en.wikipedia.org/wiki/Saddle_point

So difﬁcult?

So difﬁcult?
Could be !

Architecture
https://www.quora.com/Artiﬁcial-Neural-Networks-How-can-I-estimate-the-number-of-neurons-and-layers
http://www.asimovinstitute.org/neural-network-zoo/
Once you have chosen the type of the
architecture, then?

How to initialise
hyperparameters
The biggest mistake in hyperparameter
optimisation is
not performing hyperparameter
optimisation at all
“
”
Hyperparameters are your model’s magic numbers
medium.com/@alexandraj777/

Hyperparameters? uh?
Number of hidden units
Learning Rate (LR)
Momentum
Convolution Kernel width
Zero Padding
Weight Decay coefﬁcient
Batch-size
Dropout rate
Number of training iterations
Neuron non-linearity
Weights initialisation
Random seeds
Preprocessing
…
When you set good
hyperparameters on the ﬁrst try

should guide the actions of an organization, spell out its overall goal,
provide a path, and guide decision-making.
Manually
AutomaticallyGrid Search
Random Search
Suffers from the
curse of dimensionality ‼
Theoretically more effective
Bergstra and Bengio. 2012
Random Search for Hyper-Parameter Optimization
JMLR, Vol 13, 281-305
http://www.jmlr.org/papers/v13/bergstra12a.html
YOU are the optimisation method. #
And you are most likely an inefﬁcient
optimisation strategy. !

Typical values for a neural network (with standardised inputs) are
less than 1 and greater than 10-6
Learning Rate
The King $
Keep it constant? Should be variable?
Often useful decreasing LR as the training progresses
ADAPTIVE
LEARNING RATE
METHODS
LEARNING RATE
SCHEDULERS
or ?

Adaptive Gradient
Descent Algorithms
https://keras.io/optimizers/

Time/Drop-Based
Learning Rate Schedule
One could simply decrease LR when validation loss stops improving
epoch
LR
or
Decrease the learning rate
using punctuated large
drops at speciﬁc epochs
Decrease the learning
rate gradually based
on the epoch
epoch
LR
bit.ly/lr-schedules

Time/Drop-Based
or
bit.ly/lr-schedules

Time/Drop-Based
epoch
LR
or
epoch
LR
Decrease the learning rate
using punctuated large
drops at speciﬁc epochs
Decrease the learning
rate gradually based
on the epoch
bit.ly/lr-schedules

Cyclical Learning Rates
Eliminates the need to tune the LR yet achieve near optimal
classiﬁcation accuracy, with no additional computation
stepsize
(half of a cycle)
maximum bound
(max_lr)
minimum bound
(base_lr)
Leslie N. Smith. 2017
Cyclical Learning Rates for Training Neural
Networks
arXiv:1506.01186v6
triangular fashion

Networks
arXiv:1506.01186v6
20

Networks
arXiv:1506.01186v6
21
https://github.com/bckenstler/keras-contrib/blob/master/keras_contrib/callbacks/cyclical_learning_rate.py
PR not approved yet

Batch Size
Chosen between 1 and a few hundreds
GOAL % achieve the highest performance
while minimising the needed computation time
higher the batch size, faster the computation
Batch size and LR are
strongly related to each other
but

Don’t decay the Learning
rate, Increase the Batch Size
L. Smith et al. 2018.
Don't Decay the Learning Rate, Increase the
Batch Size
arXiv:1711.00489v2
You can obtain the same performance by increasing the
batch size, instead of decaying the learning rate
fewer parameter updates
greater parallelism
faster training
increase batch size at constant
learning rate until B ~ N/10
Key concept:
B: Batch Size
N: Training Set size[ ]
keeping the product B⋅LR constant

Training Accuracy …

Training Accuracy …
yes, but …

How to make your model happy again @PyData Florence @PyConIT

Overfitting
Accuracy
Epochs
training set
validation set
https://en.wikipedia.org/wiki/Overfitting
Avoid fitting the noise in the data

Early Stopping
Most common used form of regularisation
Why?
When?

Threshold criterion
Stop as soon as the generalisation loss exceeds a certain
threshold
Prechelt L. (2012)
Early Stopping — But When?.
In: Montavon G., Orr G.B., Müller KR. (eds)
Neural Networks: Tricks of the Trade
https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5
Generalisation loss: how much the current validation loss is
higher than the minimum validation loss so far
&

Quotient criterion
Stop when the quotient of generalisation loss and
training progress exceeds a certain threshold
Prechelt L. (2012)

Successive strips
criterion
Prechelt L. (2012)
Stop when the generalisation error increased in s
successive strips

Average many different
models
Train multiple independent
models and at test time average
their predictions
Core idea '
https://www.oreilly.com/ideas/ideas-on-interpreting-machine-learning

Approaches
1. Same model, different initialisations → variety only due to initialisation (
3. Top models discovered during cross-validation →
4. Different checkpoints of a single model
2. Running average of parameters during training →
cheap way to get a slight
boost of performance )
suboptimal models danger,
but easier *
→ some lack of variety, but very cheap +
https://keras.io/layers/merge/

4. Different checkpoints of a single model
Huang et al. ICLR 2017
Snapshot Ensembles: Train 1, get M for free
arXiv:1704.00109v1
… without incurring in any additional training costs
using a Cyclic Cosine Annealing schedule for the LR

Cyclic Cosine
Annealing
Huang et al. ICLR 2017
Snapshot Ensembles: Train 1, get M for free
arXiv:1704.00109v1
restart the LR
at each snapshot

How to do it in Keras
https://github.com/keras-team/keras-contrib/
blob/master/keras_contrib/callbacks/snapshot.py
schedule

Introducing Dropout
1
2
3
1⋅<output of neuron 1> + 0⋅<output of neuron 2>
correct prediction
on 80%
random predictions
the best thing it can do is

Introducing Dropout
https://twitter.com/Smerity/status/980175898119778304
arXiv:1804.404
1
2
3
1⋅<output of neuron 1> + 0⋅<output of neuron 2>
correct prediction
on 80%
random predictions
the best thing it can do is

Srivastava, Hinton et al. 2014.
Dropout: A Simple Way to Prevent Neural
Networks from Overfitting
http://jmlr.org/papers/v15/srivastava14a.html
Dropout
As a form of different models ensemble — but shared weights
Randomly drop units
with a probability 1-p
from the net during
training
single neural net
without dropout
at test time
at test time weights
multiplied by p
p: the probability that a
neuron is kept
&

But Keras implements
Inverted Dropout
https://github.com/tensorflow/tensorflow/blob/r1.7/tensorflow/python/ops/nn_ops.py#L2264
backend
divide by p at training time
and not modify the weights
at test time

Recall that when training a model, we
aspire to ﬁnd the minima of a loss
function given a set of parameters (in a
neural network, these are the weights
and biases). We can interpret the loss as
the “unhappiness” of the network with
respect to its parameters. The higher the
loss, the higher the unhappiness: we
don’t want that. We want to make our
models happy.
https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/

Thank You
@viperale
Any questions?
41
< Don’t ask why because nobody knows >

How to make your model happy again @PyData Florence @PyConIT

More Related Content

Similar to How to make your model happy again @PyData Florence @PyConIT (20)

Recently uploaded (20)

How to make your model happy again @PyData Florence @PyConIT