SlideShare a Scribd company logo
How to make your
model happy
again
Alessia Marcolini
Junior Research Assistant @ FBK
@viperale
amarcolini@fbk.eu
Alessia Marcolini @ PyConIT Nove
Neural
Networks
Alessia Marcolini @ PyConIT Nove
Neural
Networks
Alessia Marcolini @ PyConIT Nove
Modern Neural Networks
neuron
layer
Alessia Marcolini @ PyConIT Nove
⨍
x1 ⋅ w1
x2 ⋅ w2
y
The quadratic error surface
for a linear neuron
Imagine you have a linear
neuron with only two inputs
(two weights - w1 and w2)
You want to minimise the loss function
Alessia Marcolini @ PyConIT Nove
You want to minimise the loss function
Real life can be more
complicated
Alessia Marcolini @ PyConIT Nove
Gradient Descent
https://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/
weight update rule
we usually implement Mini-batch Gradient Descent:
performs an update for every mini-batch of n training examples
Alessia Marcolini @ PyConIT Nove
Saddle Points
You want to escape from
where derivatives of the function
become zero but the point is not a
local extremum on all axeshttps://en.wikipedia.org/wiki/Saddle_point
Alessia Marcolini @ PyConIT Nove
So difficult?
Alessia Marcolini @ PyConIT Nove
So difficult?
Could be !
Alessia Marcolini @ PyConIT Nove
Architecture
https://www.quora.com/Artificial-Neural-Networks-How-can-I-estimate-the-number-of-neurons-and-layers
http://www.asimovinstitute.org/neural-network-zoo/
Once you have chosen the type of the
architecture, then?
Alessia Marcolini @ PyConIT Nove
Alessia Marcolini @ PyConIT Nove
How to initialise
hyperparameters
The biggest mistake in hyperparameter
optimisation is
not performing hyperparameter
optimisation at all
“
”
Hyperparameters are your model’s magic numbers
medium.com/@alexandraj777/
Hyperparameters? uh?
Number of hidden units
Learning Rate (LR)
Momentum
Convolution Kernel width
Zero Padding
Weight Decay coefficient
Batch-size
Dropout rate
Number of training iterations
Neuron non-linearity
Weights initialisation
Random seeds
Preprocessing
…
When you set good
hyperparameters on the first try
should guide the actions of an organization, spell out its overall goal, 
provide a path, and guide decision-making.
Alessia Marcolini @ PyConIT Nove
Manually
AutomaticallyGrid Search
Random Search
Suffers from the
curse of dimensionality ‼
Theoretically more effective
Bergstra and Bengio. 2012
Random Search for Hyper-Parameter Optimization
JMLR, Vol 13, 281-305
http://www.jmlr.org/papers/v13/bergstra12a.html
YOU are the optimisation method. #
And you are most likely an inefficient
optimisation strategy. !
Hyperparameters? uh?
Number of hidden units
Learning Rate (LR)
Momentum
Convolution Kernel width
Zero Padding
Weight Decay coefficient
Batch-size
Dropout rate
Number of training iterations
Neuron non-linearity
Weights initialisation
Random seeds
Preprocessing
…
When you set good
hyperparameters on the first try
Alessia Marcolini @ PyConIT Nove
Typical values for a neural network (with standardised inputs) are
less than 1 and greater than 10-6
Learning Rate
The King $
Keep it constant? Should be variable?
Often useful decreasing LR as the training progresses
ADAPTIVE
LEARNING RATE
METHODS
LEARNING RATE
SCHEDULERS
or ?
Alessia Marcolini @ PyConIT Nove
Adaptive Gradient
Descent Algorithms
https://keras.io/optimizers/
Alessia Marcolini @ PyConIT Nove
Time/Drop-Based
Learning Rate Schedule
One could simply decrease LR when validation loss stops improving
epoch
LR
or
Decrease the learning rate
using punctuated large
drops at specific epochs
Decrease the learning
rate gradually based
on the epoch
epoch
LR
bit.ly/lr-schedules
Alessia Marcolini @ PyConIT Nove
Time/Drop-Based
Learning Rate Schedule
One could simply decrease LR when validation loss stops improving
or
bit.ly/lr-schedules
Alessia Marcolini @ PyConIT Nove
Time/Drop-Based
Learning Rate Schedule
One could simply decrease LR when validation loss stops improving
epoch
LR
or
epoch
LR
Decrease the learning rate
using punctuated large
drops at specific epochs
Decrease the learning
rate gradually based
on the epoch
bit.ly/lr-schedules
Alessia Marcolini @ PyConIT Nove
Time/Drop-Based
Learning Rate Schedule
One could simply decrease LR when validation loss stops improving
or
bit.ly/lr-schedules
Alessia Marcolini @ PyConIT Nove
Cyclical Learning Rates
Eliminates the need to tune the LR yet achieve near optimal
classification accuracy, with no additional computation
stepsize
(half of a cycle)
maximum bound
(max_lr)
minimum bound
(base_lr)
Leslie N. Smith. 2017
Cyclical Learning Rates for Training Neural
Networks
arXiv:1506.01186v6 
triangular fashion
Alessia Marcolini @ PyConIT Nove
Cyclical Learning Rates
Leslie N. Smith. 2017
Cyclical Learning Rates for Training Neural
Networks
arXiv:1506.01186v6 
20
Alessia Marcolini @ PyConIT Nove
Cyclical Learning Rates
Leslie N. Smith. 2017
Cyclical Learning Rates for Training Neural
Networks
arXiv:1506.01186v6 
21
https://github.com/bckenstler/keras-contrib/blob/master/keras_contrib/callbacks/cyclical_learning_rate.py
PR not approved yet
Hyperparameters? uh?
Number of hidden units
Learning Rate (LR)
Momentum
Convolution Kernel width
Zero Padding
Weight Decay coefficient
Batch-size
Dropout rate
Number of training iterations
Neuron non-linearity
Weights initialisation
Random seeds
Preprocessing
…
When you set good
hyperparameters on the first try
Alessia Marcolini @ PyConIT Nove
Batch Size
Chosen between 1 and a few hundreds
GOAL % achieve the highest performance
while minimising the needed computation time
higher the batch size, faster the computation
Batch size and LR are
strongly related to each other
but
Alessia Marcolini @ PyConIT Nove
Don’t decay the Learning
rate, Increase the Batch Size
L. Smith et al. 2018.
Don't Decay the Learning Rate, Increase the
Batch Size
arXiv:1711.00489v2 
You can obtain the same performance by increasing the
batch size, instead of decaying the learning rate
fewer parameter updates
greater parallelism
faster training
increase batch size at constant
learning rate until B ~ N/10
Key concept:
B: Batch Size
N: Training Set size[ ]
keeping the product B⋅LR constant
Alessia Marcolini @ PyConIT Nove
Training Accuracy …
Alessia Marcolini @ PyConIT Nove
Training Accuracy …
yes, but …
How to make your model happy again @PyData Florence @PyConIT
Alessia Marcolini @ PyConIT Nove
Overfitting
Accuracy
Epochs
training set
validation set
https://en.wikipedia.org/wiki/Overfitting
Avoid fitting the noise in the data
Alessia Marcolini @ PyConIT Nove
Early Stopping
Most common used form of regularisation
Why?
When?
Alessia Marcolini @ PyConIT Nove
Threshold criterion
Stop as soon as the generalisation loss exceeds a certain
threshold
Prechelt L. (2012)
Early Stopping — But When?.
In: Montavon G., Orr G.B., Müller KR. (eds)
Neural Networks: Tricks of the Trade
https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5
Generalisation loss: how much the current validation loss is
higher than the minimum validation loss so far
&
Alessia Marcolini @ PyConIT Nove
Quotient criterion
Stop when the quotient of generalisation loss and
training progress exceeds a certain threshold
Prechelt L. (2012)
Early Stopping — But When?.
In: Montavon G., Orr G.B., Müller KR. (eds)
Neural Networks: Tricks of the Trade
https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5
Alessia Marcolini @ PyConIT Nove
Successive strips
criterion
Prechelt L. (2012)
Early Stopping — But When?.
In: Montavon G., Orr G.B., Müller KR. (eds)
Neural Networks: Tricks of the Trade
https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5
Stop when the generalisation error increased in s
successive strips
Alessia Marcolini @ PyConIT Nove
Average many different
models
Train multiple independent
models and at test time average
their predictions
Core idea '
https://www.oreilly.com/ideas/ideas-on-interpreting-machine-learning
Alessia Marcolini @ PyConIT Nove
Approaches
1. Same model, different initialisations → variety only due to initialisation (
3. Top models discovered during cross-validation →
4. Different checkpoints of a single model
2. Running average of parameters during training →
cheap way to get a slight
boost of performance )
suboptimal models danger,
but easier *
→ some lack of variety, but very cheap +
https://keras.io/layers/merge/
Alessia Marcolini @ PyConIT Nove
4. Different checkpoints of a single model
Huang et al. ICLR 2017
Snapshot Ensembles: Train 1, get M for free
arXiv:1704.00109v1
… without incurring in any additional training costs
using a Cyclic Cosine Annealing schedule for the LR
Alessia Marcolini @ PyConIT Nove
Cyclic Cosine
Annealing
Huang et al. ICLR 2017
Snapshot Ensembles: Train 1, get M for free
arXiv:1704.00109v1
restart the LR
at each snapshot
Alessia Marcolini @ PyConIT Nove
How to do it in Keras
https://github.com/keras-team/keras-contrib/
blob/master/keras_contrib/callbacks/snapshot.py
schedule
Alessia Marcolini @ PyConIT Nove
Introducing Dropout
1
2
3
1⋅<output of neuron 1> + 0⋅<output of neuron 2>
correct prediction
on 80%
random predictions
the best thing it can do is
Alessia Marcolini @ PyConIT Nove
Introducing Dropout
https://twitter.com/Smerity/status/980175898119778304
arXiv:1804.404
1
2
3
1⋅<output of neuron 1> + 0⋅<output of neuron 2>
correct prediction
on 80%
random predictions
the best thing it can do is
Alessia Marcolini @ PyConIT Nove
Srivastava, Hinton et al. 2014.
Dropout: A Simple Way to Prevent Neural
Networks from Overfitting
http://jmlr.org/papers/v15/srivastava14a.html
Dropout
As a form of different models ensemble — but shared weights
Randomly drop units
with a probability 1-p
from the net during
training
single neural net
without dropout
at test time
at test time weights
multiplied by p
p: the probability that a
neuron is kept
&
Alessia Marcolini @ PyConIT Nove
But Keras implements
Inverted Dropout
https://github.com/tensorflow/tensorflow/blob/r1.7/tensorflow/python/ops/nn_ops.py#L2264
backend
divide by p at training time
and not modify the weights
at test time
Alessia Marcolini @ PyConIT Nove
Recall that when training a model, we
aspire to find the minima of a loss
function given a set of parameters (in a
neural network, these are the weights
and biases). We can interpret the loss as
the “unhappiness” of the network with
respect to its parameters. The higher the
loss, the higher the unhappiness: we
don’t want that. We want to make our
models happy.
https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/
Thank You
Alessia Marcolini @ PyConIT Nove
@viperale
Any questions?
41
< Don’t ask why because nobody knows >

More Related Content

PPTX
Refactoring: Code it Clean
PPTX
Training from High End Compute
PDF
Frontcon Riga - GraphQL Will Do To REST What JSON Did To XML
PDF
Arcee AI - building and working with small language models (06/25)
PDF
Day 4
PDF
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
PPTX
Building Continuous Learning Systems
PDF
Lean DevOps - Lessons Learned from Innovation-driven Companies
Refactoring: Code it Clean
Training from High End Compute
Frontcon Riga - GraphQL Will Do To REST What JSON Did To XML
Arcee AI - building and working with small language models (06/25)
Day 4
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
Building Continuous Learning Systems
Lean DevOps - Lessons Learned from Innovation-driven Companies

Similar to How to make your model happy again @PyData Florence @PyConIT (20)

PDF
Room 2 - 5 - Seong Soo - NHN Cloud - Upstream contribution mentoring program ...
PDF
Methodology Patterns (Agile Cambridge 2014)
PPTX
Scrum Escalation To Governance
PPTX
Scrum Escalation To Governance
PDF
Ecet 365 Enhance teaching / snaptutorial.com
PPTX
addressing tim/quality trade-off in view maintenance
PPTX
CO2_Session-1.pptx module 2 ppt showing
PPTX
CODE TUNINGtertertertrtryryryryrtytrytrtry
PPTX
Optimizely NYC Developer Meetup - Experimentation at Blue Apron
DOCX
GDE Lab 1 – Traffic Light Pg. 1 Lab 1 Traffic L.docx
PDF
Better Functional Design through TDD
PDF
The Perfect Neos Project Setup
PDF
Java Forum Nord 2015 - Swimming upstream in the container revolution
PDF
MeetingPoint 2015 - Swimming upstream in the container revolution
DOC
Ecet 365 Education Redefined - snaptutorial.com
PDF
Course-Descriptions-and-Instructors
DOCX
ECET 365 Success Begins /newtonhelp.com 
PDF
Software Development 2020 - Swimming upstream in the container revolution
PDF
Swimming upstream in the container revolution
PDF
NextBuild 2015 - Swimming upstream in the container revolution
Room 2 - 5 - Seong Soo - NHN Cloud - Upstream contribution mentoring program ...
Methodology Patterns (Agile Cambridge 2014)
Scrum Escalation To Governance
Scrum Escalation To Governance
Ecet 365 Enhance teaching / snaptutorial.com
addressing tim/quality trade-off in view maintenance
CO2_Session-1.pptx module 2 ppt showing
CODE TUNINGtertertertrtryryryryrtytrytrtry
Optimizely NYC Developer Meetup - Experimentation at Blue Apron
GDE Lab 1 – Traffic Light Pg. 1 Lab 1 Traffic L.docx
Better Functional Design through TDD
The Perfect Neos Project Setup
Java Forum Nord 2015 - Swimming upstream in the container revolution
MeetingPoint 2015 - Swimming upstream in the container revolution
Ecet 365 Education Redefined - snaptutorial.com
Course-Descriptions-and-Instructors
ECET 365 Success Begins /newtonhelp.com 
Software Development 2020 - Swimming upstream in the container revolution
Swimming upstream in the container revolution
NextBuild 2015 - Swimming upstream in the container revolution
Ad

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Database Infoormation System (DBIS).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Leprosy and NLEP programme community medicine
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
ISS -ESG Data flows What is ESG and HowHow
Introduction-to-Cloud-ComputingFinal.pptx
annual-report-2024-2025 original latest.
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Quality review (1)_presentation of this 21
Galatica Smart Energy Infrastructure Startup Pitch Deck
.pdf is not working space design for the following data for the following dat...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Data Science and Data Analysis
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Database Infoormation System (DBIS).pptx
Reliability_Chapter_ presentation 1221.5784
STUDY DESIGN details- Lt Col Maksud (21).pptx
Leprosy and NLEP programme community medicine
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Supervised vs unsupervised machine learning algorithms
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
ISS -ESG Data flows What is ESG and HowHow
Ad

How to make your model happy again @PyData Florence @PyConIT

  • 1. How to make your model happy again Alessia Marcolini Junior Research Assistant @ FBK @viperale amarcolini@fbk.eu
  • 2. Alessia Marcolini @ PyConIT Nove Neural Networks
  • 3. Alessia Marcolini @ PyConIT Nove Neural Networks
  • 4. Alessia Marcolini @ PyConIT Nove Modern Neural Networks neuron layer
  • 5. Alessia Marcolini @ PyConIT Nove ⨍ x1 ⋅ w1 x2 ⋅ w2 y The quadratic error surface for a linear neuron Imagine you have a linear neuron with only two inputs (two weights - w1 and w2) You want to minimise the loss function
  • 6. Alessia Marcolini @ PyConIT Nove You want to minimise the loss function Real life can be more complicated
  • 7. Alessia Marcolini @ PyConIT Nove Gradient Descent https://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/ weight update rule we usually implement Mini-batch Gradient Descent: performs an update for every mini-batch of n training examples
  • 8. Alessia Marcolini @ PyConIT Nove Saddle Points You want to escape from where derivatives of the function become zero but the point is not a local extremum on all axeshttps://en.wikipedia.org/wiki/Saddle_point
  • 9. Alessia Marcolini @ PyConIT Nove So difficult?
  • 10. Alessia Marcolini @ PyConIT Nove So difficult? Could be !
  • 11. Alessia Marcolini @ PyConIT Nove Architecture https://www.quora.com/Artificial-Neural-Networks-How-can-I-estimate-the-number-of-neurons-and-layers http://www.asimovinstitute.org/neural-network-zoo/ Once you have chosen the type of the architecture, then?
  • 12. Alessia Marcolini @ PyConIT Nove
  • 13. Alessia Marcolini @ PyConIT Nove How to initialise hyperparameters The biggest mistake in hyperparameter optimisation is not performing hyperparameter optimisation at all “ ” Hyperparameters are your model’s magic numbers medium.com/@alexandraj777/
  • 14. Hyperparameters? uh? Number of hidden units Learning Rate (LR) Momentum Convolution Kernel width Zero Padding Weight Decay coefficient Batch-size Dropout rate Number of training iterations Neuron non-linearity Weights initialisation Random seeds Preprocessing … When you set good hyperparameters on the first try
  • 15. should guide the actions of an organization, spell out its overall goal,  provide a path, and guide decision-making. Alessia Marcolini @ PyConIT Nove Manually AutomaticallyGrid Search Random Search Suffers from the curse of dimensionality ‼ Theoretically more effective Bergstra and Bengio. 2012 Random Search for Hyper-Parameter Optimization JMLR, Vol 13, 281-305 http://www.jmlr.org/papers/v13/bergstra12a.html YOU are the optimisation method. # And you are most likely an inefficient optimisation strategy. !
  • 16. Hyperparameters? uh? Number of hidden units Learning Rate (LR) Momentum Convolution Kernel width Zero Padding Weight Decay coefficient Batch-size Dropout rate Number of training iterations Neuron non-linearity Weights initialisation Random seeds Preprocessing … When you set good hyperparameters on the first try
  • 17. Alessia Marcolini @ PyConIT Nove Typical values for a neural network (with standardised inputs) are less than 1 and greater than 10-6 Learning Rate The King $ Keep it constant? Should be variable? Often useful decreasing LR as the training progresses ADAPTIVE LEARNING RATE METHODS LEARNING RATE SCHEDULERS or ?
  • 18. Alessia Marcolini @ PyConIT Nove Adaptive Gradient Descent Algorithms https://keras.io/optimizers/
  • 19. Alessia Marcolini @ PyConIT Nove Time/Drop-Based Learning Rate Schedule One could simply decrease LR when validation loss stops improving epoch LR or Decrease the learning rate using punctuated large drops at specific epochs Decrease the learning rate gradually based on the epoch epoch LR bit.ly/lr-schedules
  • 20. Alessia Marcolini @ PyConIT Nove Time/Drop-Based Learning Rate Schedule One could simply decrease LR when validation loss stops improving or bit.ly/lr-schedules
  • 21. Alessia Marcolini @ PyConIT Nove Time/Drop-Based Learning Rate Schedule One could simply decrease LR when validation loss stops improving epoch LR or epoch LR Decrease the learning rate using punctuated large drops at specific epochs Decrease the learning rate gradually based on the epoch bit.ly/lr-schedules
  • 22. Alessia Marcolini @ PyConIT Nove Time/Drop-Based Learning Rate Schedule One could simply decrease LR when validation loss stops improving or bit.ly/lr-schedules
  • 23. Alessia Marcolini @ PyConIT Nove Cyclical Learning Rates Eliminates the need to tune the LR yet achieve near optimal classification accuracy, with no additional computation stepsize (half of a cycle) maximum bound (max_lr) minimum bound (base_lr) Leslie N. Smith. 2017 Cyclical Learning Rates for Training Neural Networks arXiv:1506.01186v6  triangular fashion
  • 24. Alessia Marcolini @ PyConIT Nove Cyclical Learning Rates Leslie N. Smith. 2017 Cyclical Learning Rates for Training Neural Networks arXiv:1506.01186v6  20
  • 25. Alessia Marcolini @ PyConIT Nove Cyclical Learning Rates Leslie N. Smith. 2017 Cyclical Learning Rates for Training Neural Networks arXiv:1506.01186v6  21 https://github.com/bckenstler/keras-contrib/blob/master/keras_contrib/callbacks/cyclical_learning_rate.py PR not approved yet
  • 26. Hyperparameters? uh? Number of hidden units Learning Rate (LR) Momentum Convolution Kernel width Zero Padding Weight Decay coefficient Batch-size Dropout rate Number of training iterations Neuron non-linearity Weights initialisation Random seeds Preprocessing … When you set good hyperparameters on the first try
  • 27. Alessia Marcolini @ PyConIT Nove Batch Size Chosen between 1 and a few hundreds GOAL % achieve the highest performance while minimising the needed computation time higher the batch size, faster the computation Batch size and LR are strongly related to each other but
  • 28. Alessia Marcolini @ PyConIT Nove Don’t decay the Learning rate, Increase the Batch Size L. Smith et al. 2018. Don't Decay the Learning Rate, Increase the Batch Size arXiv:1711.00489v2  You can obtain the same performance by increasing the batch size, instead of decaying the learning rate fewer parameter updates greater parallelism faster training increase batch size at constant learning rate until B ~ N/10 Key concept: B: Batch Size N: Training Set size[ ] keeping the product B⋅LR constant
  • 29. Alessia Marcolini @ PyConIT Nove Training Accuracy …
  • 30. Alessia Marcolini @ PyConIT Nove Training Accuracy … yes, but …
  • 32. Alessia Marcolini @ PyConIT Nove Overfitting Accuracy Epochs training set validation set https://en.wikipedia.org/wiki/Overfitting Avoid fitting the noise in the data
  • 33. Alessia Marcolini @ PyConIT Nove Early Stopping Most common used form of regularisation Why? When?
  • 34. Alessia Marcolini @ PyConIT Nove Threshold criterion Stop as soon as the generalisation loss exceeds a certain threshold Prechelt L. (2012) Early Stopping — But When?. In: Montavon G., Orr G.B., Müller KR. (eds) Neural Networks: Tricks of the Trade https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5 Generalisation loss: how much the current validation loss is higher than the minimum validation loss so far &
  • 35. Alessia Marcolini @ PyConIT Nove Quotient criterion Stop when the quotient of generalisation loss and training progress exceeds a certain threshold Prechelt L. (2012) Early Stopping — But When?. In: Montavon G., Orr G.B., Müller KR. (eds) Neural Networks: Tricks of the Trade https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5
  • 36. Alessia Marcolini @ PyConIT Nove Successive strips criterion Prechelt L. (2012) Early Stopping — But When?. In: Montavon G., Orr G.B., Müller KR. (eds) Neural Networks: Tricks of the Trade https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5 Stop when the generalisation error increased in s successive strips
  • 37. Alessia Marcolini @ PyConIT Nove Average many different models Train multiple independent models and at test time average their predictions Core idea ' https://www.oreilly.com/ideas/ideas-on-interpreting-machine-learning
  • 38. Alessia Marcolini @ PyConIT Nove Approaches 1. Same model, different initialisations → variety only due to initialisation ( 3. Top models discovered during cross-validation → 4. Different checkpoints of a single model 2. Running average of parameters during training → cheap way to get a slight boost of performance ) suboptimal models danger, but easier * → some lack of variety, but very cheap + https://keras.io/layers/merge/
  • 39. Alessia Marcolini @ PyConIT Nove 4. Different checkpoints of a single model Huang et al. ICLR 2017 Snapshot Ensembles: Train 1, get M for free arXiv:1704.00109v1 … without incurring in any additional training costs using a Cyclic Cosine Annealing schedule for the LR
  • 40. Alessia Marcolini @ PyConIT Nove Cyclic Cosine Annealing Huang et al. ICLR 2017 Snapshot Ensembles: Train 1, get M for free arXiv:1704.00109v1 restart the LR at each snapshot
  • 41. Alessia Marcolini @ PyConIT Nove How to do it in Keras https://github.com/keras-team/keras-contrib/ blob/master/keras_contrib/callbacks/snapshot.py schedule
  • 42. Alessia Marcolini @ PyConIT Nove Introducing Dropout 1 2 3 1⋅<output of neuron 1> + 0⋅<output of neuron 2> correct prediction on 80% random predictions the best thing it can do is
  • 43. Alessia Marcolini @ PyConIT Nove Introducing Dropout https://twitter.com/Smerity/status/980175898119778304 arXiv:1804.404 1 2 3 1⋅<output of neuron 1> + 0⋅<output of neuron 2> correct prediction on 80% random predictions the best thing it can do is
  • 44. Alessia Marcolini @ PyConIT Nove Srivastava, Hinton et al. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting http://jmlr.org/papers/v15/srivastava14a.html Dropout As a form of different models ensemble — but shared weights Randomly drop units with a probability 1-p from the net during training single neural net without dropout at test time at test time weights multiplied by p p: the probability that a neuron is kept &
  • 45. Alessia Marcolini @ PyConIT Nove But Keras implements Inverted Dropout https://github.com/tensorflow/tensorflow/blob/r1.7/tensorflow/python/ops/nn_ops.py#L2264 backend divide by p at training time and not modify the weights at test time
  • 46. Alessia Marcolini @ PyConIT Nove Recall that when training a model, we aspire to find the minima of a loss function given a set of parameters (in a neural network, these are the weights and biases). We can interpret the loss as the “unhappiness” of the network with respect to its parameters. The higher the loss, the higher the unhappiness: we don’t want that. We want to make our models happy. https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/
  • 47. Thank You Alessia Marcolini @ PyConIT Nove @viperale Any questions? 41 < Don’t ask why because nobody knows >