SlideShare a Scribd company logo
Tuning Learning Rate
Taka Wang
20210727
Hyperparameters vs Model Parameters
● Learning rate
● Momentum or the hyperparameters for
Adam optimization algorithm
● Number of layers
● Number of hidden units
● Mini-batch size
● Activation function
● Number of epochs
● ...
2
1. How fast the
algorithm learns
2. Whether the cost
function is
minimized or not
Effect of Learning Rate
3
Source: Understanding Learning Rate in Machine Learning
4
Source: Setting the learning rate of your neural network.
5
Source: Understanding Fastai's fit_one_cycle method
Adjust Learning Rate During Training
● Adaptive Learning Rate Methods (AdaGrad, Adam, etc.)
● Learning Rate Annealing
● Cyclical Learning Rate
● LR Finder
6
Source: How do we decide the optimizer used for training?
Learning Rate Schedule
8
9
Why use learning rate schedule?
● Too small a learning rate and your neural network may not learn at all
● Too large a learning rate and you may overshoot areas of low loss (or even
overfit from the start of training)
➔ Finding a set of reasonably “good” weights early in the training process with
a larger learning rate.
➔ Tuning these weights later in the process to find more optimal weights using
a smaller learning rate.
10
Learning Rate Schedule
● Time-based decay
● Linear decay
● Step decay (Piecewise Constant
Decay)
● Polynomial decay
● Exponential decay Two Methods:
● Built-in Schedules
● Custom Callbacks (every batch)
Keras Example
import tensorflow as tf
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float32')/255
y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(10, activation='sigmoid', input_shape=(784,)))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy'])
model.fit(x_train, y_train, epochs=10, verbose=0, callbacks=[])
11
12
Time-based decay (InverseTimeDecay)
13
lr_fn = keras.optimizers.schedules.InverseTimeDecay(
initial_learning_rate = 0.1,
decay_steps = 1.0,
decay_rate = 0.5
)
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=lr_fn),
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.fit(data, labels, epochs=5)
Source: Learning Rate Schedules in Deep Learning
Step Decay
14
from tensorflow.keras.callbacks import LearningRateScheduler
class StepDecay:
def __init__(self, initAlpha=0.01, factor=0.25, dropEvery=10):
self.initAlpha = initAlpha
self.factor = factor
self.dropEvery = dropEvery
def __call__(self, epoch):
# compute the learning rate for the current epoch
exp = np.floor((1 + epoch) / self.dropEvery)
alpha = self.initAlpha * (self.factor ** exp)
return float(alpha) # learning rate
cb = [LearningRateScheduler(schedule)]
model.fit(x_train, y_train, epochs=10, callbacks=cb)
Linear Decay & Polynomial Decay
15
Learning rate is decayed to zero over a fixed number of epochs.
Cyclical Learning Rate
16
17
Let LR cyclical vary
between boundary
values
Estimate reasonable
bounds
Claims & Proposal
● We don’t know what the optimal initial learning rate is.
● Monotonically decreasing our learning rate may lead to our network getting
“stuck” in plateaus of the loss landscape.
18
● Define a minimu learning rate
● Define a maximum learning rate
● Allow the learning rate to cyclical oscillate between the two bounds
Source: Escaping from Saddle Points
saddle point
convex function
critical point
update rule
Loss
Landscape
20
model architecture & dataset
Source: VISUALIZING THE LOSS LANDSCAPE OF NEURAL NETS
21
CLR - Policies
● batch size: number of training examples
● batch or iteration: number of weight updates per
epoch (#total training examples/batch size)
● cycle: number of iterations (lower -> upper -> lower)
● step size: number of batch/iterations in a half cycle
https://github.com/bckenstler/CLR
Implementations
22
opt = SGD(lr=config.MIN_LR, momentum=0.9)
model = MiniGoogLeNet.build(width=32, height=32, depth=3, classes=10)
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
# initialize the cyclical learning rate callback
clr = CyclicLR(
mode="triangular",
base_lr=config.MIN_LR,
max_lr=config.MAX_LR,
step_size= config.STEP_SIZE * (trainX.shape[0] // config.BATCH_SIZE),
)
model.fit(
...,
steps_per_epoch=trainX.shape[0] // config.BATCH_SIZE,
epochs=config.NUM_EPOCHS,
callbacks=[clr])
MIN_LR = 1e-7
MAX_LR = 1e-2
BATCH_SIZE = 64
STEP_SIZE = 8 (4 or 8)
CLR_METHOD = "triangular"
NUM_EPOCHS = 96
https://github.com/bckenstler/CLR
TensorFlow Addons Optimizers
23
!pip install -q -U tensorflow_addons
import tensorflow_addons as tfa
...
steps_per_epoch = len(x_train) // BATCH_SIZE
clr = tfa.optimizers.CyclicalLearningRate(
initial_learning_rate=INIT_LR,
maximal_learning_rate=MAX_LR,
scale_fn=lambda x: 1/(2.**(x-1)),
step_size=2 * steps_per_epoch
)
optimizer = tf.keras.optimizers.SGD(clr)
clr_model = tf.keras.models.load_model("initial_model")
clr_history = train_model(clr_model, optimizer=optimizer)
#no_clr_history = train_model(standard_model, optimizer="sgd")
BATCH_SIZE = 64
EPOCHS = 10
INIT_LR = 1e-4
MAX_LR = 1e-2
Experiment Results - Triangular
24
Experiment Results - Triangular2
25
LR Finder (Range Test)
26
Automatic learning rate finder algorithm
28
Learning Rate Increase After Every Mini-Batch
3~5 epochs
29
● Recommended minimum: loss decreases the fastest (minimum
negative gradient)
● Recommended maximum: 10 times less (one order lower) than the
learning rate where the loss is minimum (if loss is low at 0.1, good
value to start is 0.01).
Source: The Learning Rate Finder Technique: How Reliable Is It?
30
Reminder
● use the same initial weights for the LRFinder and the subsequent model
training.
● We simply keep a copy of the model weights to reset them, that way they are
“as they were” before you ran the learning rate finder.
● We should never assume that the found learning rates are the best for any
model initialization ❌
● setting a narrower range than what is recommended is safer and could
reduce the risk of divergence due to very high learning rates.
31
● min: loss decreases the fastest
● max: narrower then 1 order lower
● Higher batch → higher learning rate
Source: The Learning Rate Finder Technique: How Reliable Is It?
Summary
● Learning Rate Annealing
● Cyclical Learning Rate
● LR Finder
32
One Cycle Policy
33
34
Learning rate
Batch Size
Momentum
Weight Decay
Learning Rate
fastai Modification (cosine descent)
0.08~0.8
The maximum should be the value picked with
a learning rate finder procedure.
Source: Finding Good Learning Rate and The One Cycle Policy.
Cyclic Momentum
36
fastai Modification (cosine ascent)
Source: Finding Good Learning Rate and The One Cycle Policy.
Weight Decay Matters
1e-3 1e-5
Example of super-convergence
38
Source: Understanding Fastai's fit_one_cycle method
@log_args(but_as=Learner.fit)
@delegates(Learner.fit_one_cycle)
def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100, pct_start=0.3, div=5.0, **kwargs):
"Fine tune with `freeze` for `freeze_epochs` then with `unfreeze` from `epochs` using discriminative LR"
self.freeze()
self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
base_lr /= 2
self.unfreeze()
self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
@log_args(but_as=Learner.fit)
def fit_one_cycle(self:Learner, n_epoch, lr_max=None, div=25., div_final=1e5, pct_start=0.25, wd=None,
moms=None, cbs=None, reset_opt=False):
"Fit `self.model` for `n_epoch` using the 1cycle policy."
if self.opt is None: self.create_opt()
self.opt.set_hyper('lr', self.lr if lr_max is None else lr_max)
lr_max = np.array([h['lr'] for h in self.opt.hypers])
scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
References
● 1.Keras learning rate schedules and decay (PyImageSearch)
● 2.Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch)
● 3.Keras Learning Rate Finder (PyImageSearch)
● Learning Rate Schedule in Practice: an example with Keras and TensorFlow 2.0 👍
● Understanding Learning Rate in Machine Learning
● Learning Rate Schedules in Deep Learning
● Setting the learning rate of your neural network
● Exploring Super-Convergence 👍
● The Learning Rate Finder Technique: How Reliable Is It?
40
References - One Cycle
● One-cycle learning rate schedulers (Kaggle)
● Finding Good Learning Rate and The One Cycle Policy. 👍
● The 1cycle policy (fastbook author)
● Understanding Fastai's fit_one_cycle method 👍
41
Colab
● Keras learning rate schedules and decay (PyImageSearch)
● Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch)
● Keras Learning Rate Finder (PyImageSearch) 💎
● TensorFlow Addons Optimizers: CyclicalLearningRate 👍
42
Further Reading
● Cyclical Learning Rates for Training Neural Networks (Leslie, 2015)
● Super-Convergence: Very Fast Training of Neural Networks Using Large Learning
Rates (Leslie et., 2017)
● A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate,
batch size, momentum, and weight decay (Leslie, 2018)
● SGDR: Stochastic Gradient Descent with Warm Restarts (2016)
● Snapshot Ensembles: Train 1, get M for free (2017)
● A brief history of learning rate schedulers and adaptive optimizers 💎
● Faster Deep Learning Training with PyTorch – a 2021 Guide 💎
43

More Related Content

PDF
Deep Dive into Hyperparameter Tuning
PPT
Decision tree and random forest
PPTX
Ensemble learning Techniques
PDF
Unsupervised learning: Clustering
PDF
Bias and variance trade off
PPTX
Lecture 6: Ensemble Methods
PPTX
Gradient descent method
PDF
Feature selection
Deep Dive into Hyperparameter Tuning
Decision tree and random forest
Ensemble learning Techniques
Unsupervised learning: Clustering
Bias and variance trade off
Lecture 6: Ensemble Methods
Gradient descent method
Feature selection

What's hot (20)

PPTX
Linear regression, costs & gradient descent
PPT
Perceptron
PPTX
Random forest algorithm
PPTX
Classification and Regression
PPTX
K nearest neighbor
PDF
Model selection and cross validation techniques
PDF
Optimization for Deep Learning
PPTX
Random Forest Classifier in Machine Learning | Palin Analytics
PPTX
Hyperparameter Tuning
PDF
Decision trees in Machine Learning
PPTX
Perceptron & Neural Networks
PDF
Introduction to Machine Learning Classifiers
PPT
K means Clustering Algorithm
PDF
Gradient descent method
PDF
Nature-Inspired Optimization Algorithms
PDF
Recurrent neural networks rnn
PPTX
Decision Trees
PPTX
Clustering
PDF
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
PPTX
Knn Algorithm presentation
Linear regression, costs & gradient descent
Perceptron
Random forest algorithm
Classification and Regression
K nearest neighbor
Model selection and cross validation techniques
Optimization for Deep Learning
Random Forest Classifier in Machine Learning | Palin Analytics
Hyperparameter Tuning
Decision trees in Machine Learning
Perceptron & Neural Networks
Introduction to Machine Learning Classifiers
K means Clustering Algorithm
Gradient descent method
Nature-Inspired Optimization Algorithms
Recurrent neural networks rnn
Decision Trees
Clustering
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Knn Algorithm presentation
Ad

Similar to Tuning learning rate (20)

PPTX
Leslie Smith's Papers discussion for DL Journal Club
PPTX
Competition winning learning rates
PDF
How to make your model happy again @PyData Florence @PyConIT
PDF
Introduction to cyclical learning rates for training neural nets
PPTX
effect of learning rate
PPTX
3. Training Artificial Neural Networks.pptx
DOCX
Dnn guidelines
PPTX
Nimrita deep learning
PDF
Setting Artificial Neural Networks parameters
PDF
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
PDF
chapter 11 HANDS ON MACHINE LEARNING SCIKIT
PPTX
Deeplearning
PPTX
Deep learning crash course
PPTX
Comprehensive Guide to Neural Networks in Machine Learning and Deep Learning ...
PDF
Training Neural Networks
PDF
Deep Learning for Computer Vision: Training Neural Network-Part-2
PDF
Bag of tricks for image classification with convolutional neural networks r...
PPTX
Neural network basic and introduction of Deep learning
PDF
Designing your neural networks – a step by step walkthrough
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Leslie Smith's Papers discussion for DL Journal Club
Competition winning learning rates
How to make your model happy again @PyData Florence @PyConIT
Introduction to cyclical learning rates for training neural nets
effect of learning rate
3. Training Artificial Neural Networks.pptx
Dnn guidelines
Nimrita deep learning
Setting Artificial Neural Networks parameters
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
chapter 11 HANDS ON MACHINE LEARNING SCIKIT
Deeplearning
Deep learning crash course
Comprehensive Guide to Neural Networks in Machine Learning and Deep Learning ...
Training Neural Networks
Deep Learning for Computer Vision: Training Neural Network-Part-2
Bag of tricks for image classification with convolutional neural networks r...
Neural network basic and introduction of Deep learning
Designing your neural networks – a step by step walkthrough
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Ad

More from Jamie (Taka) Wang (20)

PDF
20200606_insight_Ignition
PDF
20200727_Insight workstation
PDF
20200723_insight_release_plan
PDF
20210105_量產技轉
PDF
20200808自營電商平台策略討論
PDF
20200427_hardware
PDF
20200429_ec
PDF
20200607_insight_sync
PDF
20220113_product_day
PDF
20200429_software
PDF
20200602_insight_business
PDF
20200408_gen11_sequence_diagram
PDF
20190827_activity_diagram
PDF
20150722 - AGV
PDF
20161220 - microservice
PDF
20160217 - Overview of Vortex Intelligent Data Sharing Platform
PDF
20151111 - IoT Sync Up
PDF
20151207 - iot strategy
PDF
20141210 - Microservice Container
PDF
20161027 - edge part2
20200606_insight_Ignition
20200727_Insight workstation
20200723_insight_release_plan
20210105_量產技轉
20200808自營電商平台策略討論
20200427_hardware
20200429_ec
20200607_insight_sync
20220113_product_day
20200429_software
20200602_insight_business
20200408_gen11_sequence_diagram
20190827_activity_diagram
20150722 - AGV
20161220 - microservice
20160217 - Overview of Vortex Intelligent Data Sharing Platform
20151111 - IoT Sync Up
20151207 - iot strategy
20141210 - Microservice Container
20161027 - edge part2

Recently uploaded (20)

PPTX
gene cloning powerpoint for general biology 2
PDF
Packaging materials of fruits and vegetables
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPT
Mutation in dna of bacteria and repairss
PPTX
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
Presentation of a Romanian Institutee 2.
PPTX
A powerpoint on colorectal cancer with brief background
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
veterinary parasitology ````````````.ppt
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
perinatal infections 2-171220190027.pptx
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PDF
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
gene cloning powerpoint for general biology 2
Packaging materials of fruits and vegetables
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Mutation in dna of bacteria and repairss
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Presentation of a Romanian Institutee 2.
A powerpoint on colorectal cancer with brief background
Enhancing Laboratory Quality Through ISO 15189 Compliance
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
veterinary parasitology ````````````.ppt
BODY FLUIDS AND CIRCULATION class 11 .pptx
Seminar Hypertension and Kidney diseases.pptx
perinatal infections 2-171220190027.pptx
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
Animal tissues, epithelial, muscle, connective, nervous tissue
ap-psych-ch-1-introduction-to-psychology-presentation.pptx

Tuning learning rate

  • 2. Hyperparameters vs Model Parameters ● Learning rate ● Momentum or the hyperparameters for Adam optimization algorithm ● Number of layers ● Number of hidden units ● Mini-batch size ● Activation function ● Number of epochs ● ... 2
  • 3. 1. How fast the algorithm learns 2. Whether the cost function is minimized or not Effect of Learning Rate 3 Source: Understanding Learning Rate in Machine Learning
  • 4. 4 Source: Setting the learning rate of your neural network.
  • 5. 5 Source: Understanding Fastai's fit_one_cycle method
  • 6. Adjust Learning Rate During Training ● Adaptive Learning Rate Methods (AdaGrad, Adam, etc.) ● Learning Rate Annealing ● Cyclical Learning Rate ● LR Finder 6
  • 7. Source: How do we decide the optimizer used for training?
  • 9. 9 Why use learning rate schedule? ● Too small a learning rate and your neural network may not learn at all ● Too large a learning rate and you may overshoot areas of low loss (or even overfit from the start of training) ➔ Finding a set of reasonably “good” weights early in the training process with a larger learning rate. ➔ Tuning these weights later in the process to find more optimal weights using a smaller learning rate.
  • 10. 10 Learning Rate Schedule ● Time-based decay ● Linear decay ● Step decay (Piecewise Constant Decay) ● Polynomial decay ● Exponential decay Two Methods: ● Built-in Schedules ● Custom Callbacks (every batch)
  • 11. Keras Example import tensorflow as tf (x_train, y_train), _ = tf.keras.datasets.mnist.load_data() x_train = x_train.reshape(60000, 784).astype('float32')/255 y_train = tf.keras.utils.to_categorical(y_train, num_classes=10) model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(10, activation='sigmoid', input_shape=(784,))) model.add(tf.keras.layers.Dense(10, activation='softmax')) model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy']) model.fit(x_train, y_train, epochs=10, verbose=0, callbacks=[]) 11
  • 12. 12
  • 13. Time-based decay (InverseTimeDecay) 13 lr_fn = keras.optimizers.schedules.InverseTimeDecay( initial_learning_rate = 0.1, decay_steps = 1.0, decay_rate = 0.5 ) model.compile( optimizer=tf.keras.optimizers.SGD(learning_rate=lr_fn), loss='categorical_crossentropy', metrics=['accuracy'] ) model.fit(data, labels, epochs=5) Source: Learning Rate Schedules in Deep Learning
  • 14. Step Decay 14 from tensorflow.keras.callbacks import LearningRateScheduler class StepDecay: def __init__(self, initAlpha=0.01, factor=0.25, dropEvery=10): self.initAlpha = initAlpha self.factor = factor self.dropEvery = dropEvery def __call__(self, epoch): # compute the learning rate for the current epoch exp = np.floor((1 + epoch) / self.dropEvery) alpha = self.initAlpha * (self.factor ** exp) return float(alpha) # learning rate cb = [LearningRateScheduler(schedule)] model.fit(x_train, y_train, epochs=10, callbacks=cb)
  • 15. Linear Decay & Polynomial Decay 15 Learning rate is decayed to zero over a fixed number of epochs.
  • 17. 17 Let LR cyclical vary between boundary values Estimate reasonable bounds
  • 18. Claims & Proposal ● We don’t know what the optimal initial learning rate is. ● Monotonically decreasing our learning rate may lead to our network getting “stuck” in plateaus of the loss landscape. 18 ● Define a minimu learning rate ● Define a maximum learning rate ● Allow the learning rate to cyclical oscillate between the two bounds
  • 19. Source: Escaping from Saddle Points saddle point convex function critical point update rule
  • 20. Loss Landscape 20 model architecture & dataset Source: VISUALIZING THE LOSS LANDSCAPE OF NEURAL NETS
  • 21. 21 CLR - Policies ● batch size: number of training examples ● batch or iteration: number of weight updates per epoch (#total training examples/batch size) ● cycle: number of iterations (lower -> upper -> lower) ● step size: number of batch/iterations in a half cycle https://github.com/bckenstler/CLR
  • 22. Implementations 22 opt = SGD(lr=config.MIN_LR, momentum=0.9) model = MiniGoogLeNet.build(width=32, height=32, depth=3, classes=10) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) # initialize the cyclical learning rate callback clr = CyclicLR( mode="triangular", base_lr=config.MIN_LR, max_lr=config.MAX_LR, step_size= config.STEP_SIZE * (trainX.shape[0] // config.BATCH_SIZE), ) model.fit( ..., steps_per_epoch=trainX.shape[0] // config.BATCH_SIZE, epochs=config.NUM_EPOCHS, callbacks=[clr]) MIN_LR = 1e-7 MAX_LR = 1e-2 BATCH_SIZE = 64 STEP_SIZE = 8 (4 or 8) CLR_METHOD = "triangular" NUM_EPOCHS = 96 https://github.com/bckenstler/CLR
  • 23. TensorFlow Addons Optimizers 23 !pip install -q -U tensorflow_addons import tensorflow_addons as tfa ... steps_per_epoch = len(x_train) // BATCH_SIZE clr = tfa.optimizers.CyclicalLearningRate( initial_learning_rate=INIT_LR, maximal_learning_rate=MAX_LR, scale_fn=lambda x: 1/(2.**(x-1)), step_size=2 * steps_per_epoch ) optimizer = tf.keras.optimizers.SGD(clr) clr_model = tf.keras.models.load_model("initial_model") clr_history = train_model(clr_model, optimizer=optimizer) #no_clr_history = train_model(standard_model, optimizer="sgd") BATCH_SIZE = 64 EPOCHS = 10 INIT_LR = 1e-4 MAX_LR = 1e-2
  • 24. Experiment Results - Triangular 24
  • 25. Experiment Results - Triangular2 25
  • 26. LR Finder (Range Test) 26
  • 27. Automatic learning rate finder algorithm
  • 28. 28 Learning Rate Increase After Every Mini-Batch 3~5 epochs
  • 29. 29 ● Recommended minimum: loss decreases the fastest (minimum negative gradient) ● Recommended maximum: 10 times less (one order lower) than the learning rate where the loss is minimum (if loss is low at 0.1, good value to start is 0.01). Source: The Learning Rate Finder Technique: How Reliable Is It?
  • 30. 30 Reminder ● use the same initial weights for the LRFinder and the subsequent model training. ● We simply keep a copy of the model weights to reset them, that way they are “as they were” before you ran the learning rate finder. ● We should never assume that the found learning rates are the best for any model initialization ❌ ● setting a narrower range than what is recommended is safer and could reduce the risk of divergence due to very high learning rates.
  • 31. 31 ● min: loss decreases the fastest ● max: narrower then 1 order lower ● Higher batch → higher learning rate Source: The Learning Rate Finder Technique: How Reliable Is It?
  • 32. Summary ● Learning Rate Annealing ● Cyclical Learning Rate ● LR Finder 32
  • 35. Learning Rate fastai Modification (cosine descent) 0.08~0.8 The maximum should be the value picked with a learning rate finder procedure. Source: Finding Good Learning Rate and The One Cycle Policy.
  • 36. Cyclic Momentum 36 fastai Modification (cosine ascent) Source: Finding Good Learning Rate and The One Cycle Policy.
  • 38. Example of super-convergence 38 Source: Understanding Fastai's fit_one_cycle method
  • 39. @log_args(but_as=Learner.fit) @delegates(Learner.fit_one_cycle) def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100, pct_start=0.3, div=5.0, **kwargs): "Fine tune with `freeze` for `freeze_epochs` then with `unfreeze` from `epochs` using discriminative LR" self.freeze() self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs) base_lr /= 2 self.unfreeze() self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs) @log_args(but_as=Learner.fit) def fit_one_cycle(self:Learner, n_epoch, lr_max=None, div=25., div_final=1e5, pct_start=0.25, wd=None, moms=None, cbs=None, reset_opt=False): "Fit `self.model` for `n_epoch` using the 1cycle policy." if self.opt is None: self.create_opt() self.opt.set_hyper('lr', self.lr if lr_max is None else lr_max) lr_max = np.array([h['lr'] for h in self.opt.hypers]) scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final), 'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))} self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  • 40. References ● 1.Keras learning rate schedules and decay (PyImageSearch) ● 2.Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch) ● 3.Keras Learning Rate Finder (PyImageSearch) ● Learning Rate Schedule in Practice: an example with Keras and TensorFlow 2.0 👍 ● Understanding Learning Rate in Machine Learning ● Learning Rate Schedules in Deep Learning ● Setting the learning rate of your neural network ● Exploring Super-Convergence 👍 ● The Learning Rate Finder Technique: How Reliable Is It? 40
  • 41. References - One Cycle ● One-cycle learning rate schedulers (Kaggle) ● Finding Good Learning Rate and The One Cycle Policy. 👍 ● The 1cycle policy (fastbook author) ● Understanding Fastai's fit_one_cycle method 👍 41
  • 42. Colab ● Keras learning rate schedules and decay (PyImageSearch) ● Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch) ● Keras Learning Rate Finder (PyImageSearch) 💎 ● TensorFlow Addons Optimizers: CyclicalLearningRate 👍 42
  • 43. Further Reading ● Cyclical Learning Rates for Training Neural Networks (Leslie, 2015) ● Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates (Leslie et., 2017) ● A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay (Leslie, 2018) ● SGDR: Stochastic Gradient Descent with Warm Restarts (2016) ● Snapshot Ensembles: Train 1, get M for free (2017) ● A brief history of learning rate schedulers and adaptive optimizers 💎 ● Faster Deep Learning Training with PyTorch – a 2021 Guide 💎 43