Tuning learning rate

Tuning Learning Rate
Taka Wang
20210727

Hyperparameters vs Model Parameters
● Learning rate
● Momentum or the hyperparameters for
Adam optimization algorithm
● Number of layers
● Number of hidden units
● Mini-batch size
● Activation function
● Number of epochs
● ...
2

1. How fast the
algorithm learns
2. Whether the cost
function is
minimized or not
Effect of Learning Rate
3
Source: Understanding Learning Rate in Machine Learning

4
Source: Setting the learning rate of your neural network.

5
Source: Understanding Fastai's ﬁt_one_cycle method

Adjust Learning Rate During Training
● Adaptive Learning Rate Methods (AdaGrad, Adam, etc.)
● Learning Rate Annealing
● Cyclical Learning Rate
● LR Finder
6

Source: How do we decide the optimizer used for training?

9
Why use learning rate schedule?
● Too small a learning rate and your neural network may not learn at all
● Too large a learning rate and you may overshoot areas of low loss (or even
overﬁt from the start of training)
➔ Finding a set of reasonably “good” weights early in the training process with
a larger learning rate.
➔ Tuning these weights later in the process to ﬁnd more optimal weights using
a smaller learning rate.

10
Learning Rate Schedule
● Time-based decay
● Linear decay
● Step decay (Piecewise Constant
Decay)
● Polynomial decay
● Exponential decay Two Methods:
● Built-in Schedules
● Custom Callbacks (every batch)

Keras Example
import tensorflow as tf
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float32')/255
y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(10, activation='sigmoid', input_shape=(784,)))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics = ['accuracy'])
model.fit(x_train, y_train, epochs=10, verbose=0, callbacks=[])
11

Time-based decay (InverseTimeDecay)
13
lr_fn = keras.optimizers.schedules.InverseTimeDecay(
initial_learning_rate = 0.1,
decay_steps = 1.0,
decay_rate = 0.5
)
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=lr_fn),
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.ﬁt(data, labels, epochs=5)
Source: Learning Rate Schedules in Deep Learning

Step Decay
14
from tensorflow.keras.callbacks import LearningRateScheduler
class StepDecay:
def __init__(self, initAlpha=0.01, factor=0.25, dropEvery=10):
self.initAlpha = initAlpha
self.factor = factor
self.dropEvery = dropEvery
def __call__(self, epoch):
# compute the learning rate for the current epoch
exp = np.floor((1 + epoch) / self.dropEvery)
alpha = self.initAlpha * (self.factor ** exp)
return float(alpha) # learning rate
cb = [LearningRateScheduler(schedule)]
model.fit(x_train, y_train, epochs=10, callbacks=cb)

Linear Decay & Polynomial Decay
15
Learning rate is decayed to zero over a fixed number of epochs.

17
Let LR cyclical vary
between boundary
values
Estimate reasonable
bounds

Claims & Proposal
● We don’t know what the optimal initial learning rate is.
● Monotonically decreasing our learning rate may lead to our network getting
“stuck” in plateaus of the loss landscape.
18
● Deﬁne a minimu learning rate
● Deﬁne a maximum learning rate
● Allow the learning rate to cyclical oscillate between the two bounds

Source: Escaping from Saddle Points
saddle point
convex function
critical point
update rule

Loss
Landscape
20
model architecture & dataset
Source: VISUALIZING THE LOSS LANDSCAPE OF NEURAL NETS

21
CLR - Policies
● batch size: number of training examples
● batch or iteration: number of weight updates per
epoch (#total training examples/batch size)
● cycle: number of iterations (lower -> upper -> lower)
● step size: number of batch/iterations in a half cycle
https://github.com/bckenstler/CLR

Implementations
22
opt = SGD(lr=config.MIN_LR, momentum=0.9)
model = MiniGoogLeNet.build(width=32, height=32, depth=3, classes=10)
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
# initialize the cyclical learning rate callback
clr = CyclicLR(
mode="triangular",
base_lr=config.MIN_LR,
max_lr=config.MAX_LR,
step_size= config.STEP_SIZE * (trainX.shape[0] // config.BATCH_SIZE),
)
model.fit(
...,
steps_per_epoch=trainX.shape[0] // config.BATCH_SIZE,
epochs=config.NUM_EPOCHS,
callbacks=[clr])
MIN_LR = 1e-7
MAX_LR = 1e-2
BATCH_SIZE = 64
STEP_SIZE = 8 (4 or 8)
CLR_METHOD = "triangular"
NUM_EPOCHS = 96
https://github.com/bckenstler/CLR

TensorFlow Addons Optimizers
23
!pip install -q -U tensorﬂow_addons
import tensorﬂow_addons as tfa
...
steps_per_epoch = len(x_train) // BATCH_SIZE
clr = tfa.optimizers.CyclicalLearningRate(
initial_learning_rate=INIT_LR,
maximal_learning_rate=MAX_LR,
scale_fn=lambda x: 1/(2.**(x-1)),
step_size=2 * steps_per_epoch
)
optimizer = tf.keras.optimizers.SGD(clr)
clr_model = tf.keras.models.load_model("initial_model")
clr_history = train_model(clr_model, optimizer=optimizer)
#no_clr_history = train_model(standard_model, optimizer="sgd")
BATCH_SIZE = 64
EPOCHS = 10
INIT_LR = 1e-4
MAX_LR = 1e-2

Experiment Results - Triangular
24

Experiment Results - Triangular2
25

Automatic learning rate ﬁnder algorithm

28
Learning Rate Increase After Every Mini-Batch
3~5 epochs

29
● Recommended minimum: loss decreases the fastest (minimum
negative gradient)
● Recommended maximum: 10 times less (one order lower) than the
learning rate where the loss is minimum (if loss is low at 0.1, good
value to start is 0.01).
Source: The Learning Rate Finder Technique: How Reliable Is It?

30
Reminder
● use the same initial weights for the LRFinder and the subsequent model
training.
● We simply keep a copy of the model weights to reset them, that way they are
“as they were” before you ran the learning rate ﬁnder.
● We should never assume that the found learning rates are the best for any
model initialization ❌
● setting a narrower range than what is recommended is safer and could
reduce the risk of divergence due to very high learning rates.

31
● min: loss decreases the fastest
● max: narrower then 1 order lower
● Higher batch → higher learning rate
Source: The Learning Rate Finder Technique: How Reliable Is It?

Summary
● Learning Rate Annealing
● Cyclical Learning Rate
● LR Finder
32

34
Learning rate
Batch Size
Momentum
Weight Decay

Learning Rate
fastai Modiﬁcation (cosine descent)
0.08~0.8
The maximum should be the value picked with
a learning rate finder procedure.
Source: Finding Good Learning Rate and The One Cycle Policy.

Cyclic Momentum
36
fastai Modiﬁcation (cosine ascent)
Source: Finding Good Learning Rate and The One Cycle Policy.

Weight Decay Matters
1e-3 1e-5

Example of super-convergence
38
Source: Understanding Fastai's ﬁt_one_cycle method

@log_args(but_as=Learner.fit)
@delegates(Learner.fit_one_cycle)
def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100, pct_start=0.3, div=5.0, **kwargs):
"Fine tune with `freeze` for `freeze_epochs` then with ùnfreeze` from èpochs` using discriminative LR"
self.freeze()
self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
base_lr /= 2
self.unfreeze()
self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
@log_args(but_as=Learner.fit)
def fit_one_cycle(self:Learner, n_epoch, lr_max=None, div=25., div_final=1e5, pct_start=0.25, wd=None,
moms=None, cbs=None, reset_opt=False):
"Fit `self.model` for `n_epoch` using the 1cycle policy."
if self.opt is None: self.create_opt()
self.opt.set_hyper('lr', self.lr if lr_max is None else lr_max)
lr_max = np.array([h['lr'] for h in self.opt.hypers])
scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)

References
● 1.Keras learning rate schedules and decay (PyImageSearch)
● 2.Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch)
● 3.Keras Learning Rate Finder (PyImageSearch)
● Learning Rate Schedule in Practice: an example with Keras and TensorFlow 2.0 👍
● Understanding Learning Rate in Machine Learning
● Learning Rate Schedules in Deep Learning
● Setting the learning rate of your neural network
● Exploring Super-Convergence 👍
● The Learning Rate Finder Technique: How Reliable Is It?
40

References - One Cycle
● One-cycle learning rate schedulers (Kaggle)
● Finding Good Learning Rate and The One Cycle Policy. 👍
● The 1cycle policy (fastbook author)
● Understanding Fastai's ﬁt_one_cycle method 👍
41

Colab
● Keras learning rate schedules and decay (PyImageSearch)
● Cyclical Learning Rates with Keras and Deep Learning (PyImageSearch)
● Keras Learning Rate Finder (PyImageSearch) 💎
● TensorFlow Addons Optimizers: CyclicalLearningRate 👍
42

Further Reading
● Cyclical Learning Rates for Training Neural Networks (Leslie, 2015)
● Super-Convergence: Very Fast Training of Neural Networks Using Large Learning
Rates (Leslie et., 2017)
● A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate,
batch size, momentum, and weight decay (Leslie, 2018)
● SGDR: Stochastic Gradient Descent with Warm Restarts (2016)
● Snapshot Ensembles: Train 1, get M for free (2017)
● A brief history of learning rate schedulers and adaptive optimizers 💎
● Faster Deep Learning Training with PyTorch – a 2021 Guide 💎
43

Tuning learning rate

More Related Content

What's hot (20)

Similar to Tuning learning rate (20)

More from Jamie (Taka) Wang (20)

Recently uploaded (20)

Tuning learning rate