Optimization in deep learning

Optimization in Deep
Learning
Jeremy Nixon

Overview
1. Challenges in Neural Network Optimization
2. Gradient Descent
3. Stochastic Gradient Descent
4. Momentum
a. Nesterov Momentum
5. RMSProp
6. Adam

Challenges in Neural Network Optimization
1. Training Time
a. Model complexity (depth, width) is important to accuracy
b. Training time for state of the art can take weeks on a GPU
2. Hyperparameter Tuning
a. Learning rate tuning is important to accuracy
3. Local Minima

Neural Net Refresh + Gradient Descent
w2
w1
Hidden raw / relu
output_softmax
x_train

Stochastic Gradient Descent
Dramatic Speedup
Sub-linear returns to more data in each batch
Crucial Learning Rate Hyperparameter
Schedule to reduce learning rate during training
SGD introduces noise to the gradient
Gradient will almost never fully converge to 0

Number hidden layers = 1
lr = 1.0 (normal is 0.01)
Dataset = Mnist

Momentum
Dramatically Accelerates Learning
1. Initialize learning rates & momentum matrix the size of the weights
2. At each SGD iteration, collect the gradient.
3. Update momentum matrix to be momentum rate times a momentum
hyperparameter plus the learning rate times the collected gradient.
s = .9 = momentum hyperparameter t.layers[i].moment1 = layer i’s momentum matrix lr = .01 gradient = sgd’s collected gradient

Dataset = Mnist

Intuition for Momentum
Automatically cancels out noise in the gradient
Amplifies small but consistent gradients
“Momentum” derives from the physical analogy [momentum = mass * velocity]
Assumes unit mass
Velocity vector is the ‘particle's’ momentum
Deals well with heavy curvature

Momentum Accelerates the Gradient
Gradient that accumulates in the same direction can achieve velocities of up to
lr / (1-s). S = .9 => lr can max out at lr * 10 in the direction of accumulated gradient.

Asynchronous SGD similar to Momentum
In distributed SGD, asynchronous has workers update parameters as they return, instead of
waiting for all workers to finish
Creates a weighted average of previous gradients applied to the current weights

Nesterov Momentum
Evaluate the gradient with the momentum step taken into account

Adaptive Learning Rate Algorithms
Adagrad
Duchi et al., 2011
RMSProp
Hinton, 2012
Adam
Kingma and Ba, 2014
Idea is to auto-tune the learning rate, making the network less sensitive to hyperparameters.

Adagrad
Shrinks the learning rate adaptively
Learning rate is the inverse of the historical squared gradient
r = squared gradient history g = gradient theta = weights epsilon = learning rate delta = small constant for numerical stability

Intuition for Adagrad
Instead of setting a single global learning rate, have a different learning rate for
every weight in the network
Parameters with the largest derivative have a rapid decrease in learning rate
Parameters with small derivatives have a small decrease in learning rate
We get much more progress in more gently sloped directions of parameter
space.
Downside - accumulating gradients from the beginning leads to extremely small
learning rates later in training
Downside - doesn’t deal well with differences in global and local structure

RMSProp
Collect exponentially weighted average of the gradient for the learning rate
Performs well in non-convex setting with differences between global and local
structure
Can be combined with momentum / nesterov momentum

Dataset = Mnist

Adam
Short for “Adaptive Moments”
Exponentially weighted average of gradient for momentum (first moment)
Exponentially weighted average of squared gradient for adapting learning rate
(second moment)
Bias Correction for both to adjust early in training

Dataset = Mnist

Thank you!
Questions?
Bibliography
Adam paper - https://arxiv.org/abs/1412.6980
Adagrad - http://jmlr.org/papers/v12/duchi11a.html
RMSProp - http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Deep Learning Textbook - http://www.deeplearningbook.org/

Optimization in deep learning

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Optimization in deep learning (20)

Recently uploaded (20)

Optimization in deep learning