SlideShare a Scribd company logo
Deep Neural Network
Module 3
1
Optimization
2
What we Learn….
3.1 Challenges in Neural Network Optimization – saddle points and
plateau
3.2 Non-convex optimization intuition
3.3 Overview of optimization algorithms
3.4 Momentum based algorithms
3.5 Algorithms with Adaptive Learning Rates
3
Optimization Algorithms
4
Optimization Algorithm
● Optimization algorithms train deep learning models.
● Optimization algorithms are the tools that allow
○ continue updating model parameters
○ to minimize the value of the loss function, as evaluated on the
training set.
● In optimization, a loss function is often referred to as the objective
function of the optimization problem.
● By tradition and convention most optimization algorithms are concerned
with minimization.
5
6
Optimization
Why Optimization Algorithm?
● The performance of the optimization algorithm directly affects the
model s training efficiency.
ʼ
● Understanding the principles of different optimization algorithms and the
role of their hyperparameters will enable us to tune the
hyperparameters in a targeted manner to improve the performance of
deep learning models.
● The goal of optimization is to reduce the training error. The goal of
deep learning is to reduce the generalization error, this requires reduction
in overfitting also.
7
Optimization Challenges in Deep Learning
● Local minima
● Saddle points
● Vanishing gradients
8
Local Minima
● For any objective function f (x), if the value
of f (x) at x is smaller than the values of f (x)
at any other points in the vicinity of x, then f
(x) could be a local minimum.
● If the value of f (x) at x is the minimum of
the objective function over the entire
domain, then f (x) is the global minimum.
● In minibatch stochastic gradient descent,
the natural variation of gradients over
minibatches is able to dislodge the
parameters from local minima.
9
10
Finding Minimum of a Function
11
Finding Minimum of a Function
12
Derivatives of a Function
Saddle points
● A saddle point is any location where
all gradients of a function vanish but
which is neither a global nor a local
minimum.
● Eg: f (x, y) = x^2 − y^2
○ saddle point at (0, 0)
○ Maximum wrt y
○ Minimum wrt x
13
14
Derivatives at Saddle (Inflection) Point
15
Derivatives at Saddle (Inflection) Point
16
Functions of Multiple Variables
17
Gradient of a Scalar Function
18
Gradient of a Scalar Function with multiple variables
19
Hessian
20
Gradient of a Scalar Function with multiple variables
21
Solution of Unconstrained Minimization
 Solve for X where the gradient equation equal to zero.
⛛f(x) = 0
 Compute the Hessian matrix ⛛2
f(x) at the candidate solution and verify
that
○ Local Minimum
■ Eigenvalues of Hessian matrix are all positive
○ Local Maximum
■ Eigenvalues of Hessian matrix are all negative
○ Saddle Point
■ Eigenvalues of Hessian matrix at the zero-gradient position are
negative and positive
22
Example
23
Example
24
Example
Vanishing gradients
● Function f (x) = tanh(x)
● f ′(x) = 1 − tanh2
(x)
○ f ′(4) = 0.0013.
● Gradient of f is close to nil.
● Vanishing gradients can cause
optimization to stall.
○ Reparameterization of the
problem helps.
○ Good initialization of the
parameters
25
Gradient Descent
26
27
How to find Global Minima?
28
Find Global Minima Iteratively
29
Approach of Gradient Descent
30
Approach of Gradient Descent
31
Approach of Gradient Descent
gd(eta, grad)
new = old - eta * grad
Approach of Gradient Descent
● First order Gradient Descent algos consider the first order derivatives
to get the value (magnitude) and direction of update.
def gd(eta, f_grad):
x = 10.0
results = [x]
for i in range(10):
x -= eta * f_grad(x)
results.append(float(x))
return results
gd(eta = 0.2, f_grad)
32
Effect of Learning Rate on Gradient Descent
33
Learning Rate
● The role of the learning rate is to moderate the degree to which
weights are changed at each step.
● Learning rate η is set by the algorithm designer.
● If the learning rate that is too small, it will cause x to update very
slowly, requiring more iterations to get a better solution.
● If the learning rate that is too large, the solution oscillates and in the
worst case it might even diverge.
34
Learning rate and Gradient Descent
gd(eta = 0.05, f_grad) gd(eta = 1.1, f_grad)
Slow Learning
If we pick eta too small, we
make little progress.
Oscillations
If we pick eta too large, the
solution oscillates and in the
worst case it might diverge. 35
Stochastic Gradient Descent
36
Stochastic Gradient Descent
● In deep learning, the objective function is the average of the loss functions for
each example in the training dataset.
● Given a training dataset of n examples, let fi (x) is the loss function with respect
to the training example of index i, where x is the parameter vector.
● The objective function
● The gradient of the objective function at x
● Update x as
● Computational cost of each iteration is O(1).
37
SGD with Constant Learning Rate
● Trajectory of the variables in the stochastic gradient descent is much
more noisy. This is due to the stochastic nature of the gradient. Even
near the minimum, uncertainty is injected by the instantaneous
gradient via η f
∇ i (x).
38
def sgd(x1, x2, s1, s2, f_grad):
g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += np.random.normal(0.0, 1, (1,))
g2 += np.random.normal(0.0, 1, (1,))
eta_t = eta * lr()
return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)
Dynamic Learning Rate
39
Dynamic Learning Rate
● Replace η with a time-dependent learning rate η(t)
○ adds to the complexity of controlling convergence of an optimization algorithm.
● A few basic strategies that adjust η over time.
40
1. Piecewise constant
a. decrease the learning rate, e.g., whenever progress in optimization stalls.
b. This is a common strategy for training deep networks. A
2. Exponential decay
a. Leads to premature stopping before the algorithm has converged.
3. Polynomial decay with α = 0.5.
Dynamic Learning Rate
● Replace η with a time-dependent learning rate η(t)
○ adds to the complexity of controlling convergence of an
optimization algorithm.
● A few basic strategies that adjust η over time.
41
Exponential Decay
● Variance in the parameters is significantly reduced.
● The algorithm fails to converge at all.
42
def exponential_lr():
global t
t += 1
return math.exp(-0.1 * t)
Polynomial decay
● Use a polynomial decay
○ Learning rate decays with the inverse square root of the number of steps
○ Convergence gets better after only 50 steps.
43
def polynomial_lr():
global t
t += 1
return (1 + 0.1 * t) ** (-0.5)
Review
● Gradient descent
○ Uses the full dataset to compute gradients and to update parameters, one pass at a time.
○ Gradient Descent is not particularly data efficient whenever data is very similar.
● Stochastic Gradient descent
○ Processes one observation at a time to make progress.
○ Stochastic Gradient Descent is not particularly computationally efficient since CPUs and GPUs
cannot exploit the full power of vectorization.
○ For noisy gradients, choice of the learning rate is critical.
■ If we decrease it too rapidly, convergence stalls.
■ If we are too lenient, we fail to converge to a good enough solution since noise keeps on
driving us away from optimality.
● MInibatch SGD
○ Accelerate computation, or better or computational efficiency.
○ Averaging gradients reduce the amount of variance.
44
Minibatch Stochastic Gradient Descent
45
Minibatch Stochastic Gradient Descent
● In each iteration, we first randomly sample a minibatch B consisting of a fixed number of
training examples.
● We then compute the derivative (gradient) of the average loss on the minibatch with regard to
the model parameters.
● Finally, we multiply the gradient by a predetermined positive value η and subtract the resulting
term from the current parameter values.
● |B| represents the number of examples in each minibatch (the batch size) and η denotes the
learning rate.
46
Minibatch Stochastic Gradient Descent Algorithm
● Gradients at time t is calculated as
● |B| represents the number of examples in each minibatch (the batch
size) and η denotes the learning rate.
●
● is the stochastic gradient descent for sample i
using the weights updated at time t − 1.
47
SGD Algorithm
48
Momentum
49
Leaky Average in Minibatch SGD
● Replace the gradient computation by a “leaky average “ for better variance reduction. β
(0, 1).
∈
● This effectively replaces the instantaneous gradient by one that s been averaged over
ʼ
multiple past gradients.
● v is called momentum. It accumulates past gradients.
● Large β amounts to a long-range average and small β amounts to only a slight correction
relative to a gradient method.
● The new gradient replacement no longer points into the direction of steepest descent on
a particular instance any longer but rather in the direction of a weighted average of past
gradients. 50
Momentum Algorithm
51
Momentum Method Example
● Consider a moderately distorted ellipsoid objective
● f has its minimum at (0, 0). This function is very flat in x1 direction.
52
● For eta = 0.4. Without momentum
● The gradient in the x2 direction
oscillates than in the horizontal x1
direction.
● For eta = 0.6. Without momentum
● Convergence in the x1 direction
improves but the overall solution
quality is diverging.
Momentum Method Example
● Consider a moderately distorted ellipsoid objective
● Apply momentum for eta = 0.6
53
● For beta = 0.5.
● Converges well. Lesser oscillations.
Larger steps in x1 direction.
● For beta = 0.25.
● Reduced convergence. More
oscillations. Larger magnitude of
oscillations.
Momentum method: Summary
● Momentum replaces gradients with a leaky average over past
gradients. This accelerates convergence significantly.
● Momentum prevents stalling of the optimization process that is much
more likely to occur for stochastic gradient descent.
● The effective number of gradients is given by 1/ (1−β ) due to
exponentiated downweighting of past data.
● Implementation is quite straightforward but it requires us to store an
additional state vector (momentum v).
54
Adagrad
55
Adagrad
● Used for features that occur infrequently (sparse features)
● Adagrad uses aggregate of the squares of previously observed
gradients.
56
Adagrad Algorithm
● Variable st to accumulate past gradient variance.
● Operation are applied coordinate wise. √ (1 / v) has entries √( 1/ vi)
and u · v has entries ui vi. η is the learning rate and ε is an additive
constant that ensures that we do not divide by 0.
● Initialize s0 = 0.
57
Adagrad: Summary
● Adagrad decreases the learning rate dynamically on a per-coordinate basis.
● It uses the magnitude of the gradient as a means of adjusting how quickly
progress is achieved - coordinates with large gradients are compensated with a
smaller learning rate.
● If the optimization problem has a rather uneven structure Adagrad can help
mitigate the distortion.
● Adagrad is particularly effective for sparse features where the learning rate needs
to decrease more slowly for infrequently occurring terms.
● On deep learning problems Adagrad can sometimes be too aggressive in reducing
learning rates.
58
RMSProp
59
RMSProp
● Adagrad use learning rate that decreases at a predefined schedule of
effectively O(t− 1/2
).
● RMSProp algorithm decouples rate scheduling from coordinate-
adaptive learning rates. This is essential for non-convex optimization.
60
RMSProp Algorithm
● Use leaky average to accumulate past gradient variance.
● Parameter gamma > 0.
● The constant ε > 0 is set to 10−6
to ensure that we do not suffer from
division by zero or overly large step sizes.
61
RMSProp Algorithm
62
RMSProp Example
63
ef rmsprop_2d(x1, x2, s1, s2):
g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
s1 = gamma * s1 + (1 - gamma) * g1 ** 2
s2 = gamma * s2 + (1 - gamma) * g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
RMS Prop: Summary
● RMSProp is very similar to Adagrad as both use the square of the
gradient to scale coefficients.
● RMSProp shares with momentum the leaky averaging. However,
RMSProp uses the technique to adjust the coefficient-wise
preconditioner.
● The learning rate needs to be scheduled by the experimenter in practice.
● The coefficient γ (gamma) determines how long the history is when
adjusting the per-coordinate scale.
64
Adam
65
Review of techniques learned so far
1. Stochastic gradient descent
○ more effective than Gradient Descent when solving optimization problems, e.g., due to its
inherent resilience to redundant data.
2. Minibatch Stochastic gradient descent
○ affords significant additional efficiency arising from vectorization, using larger sets of
observations in one minibatch. This is the key to efficient multi-machine, multi-GPU and
overall parallel processing.
3. Momentum
○ added a mechanism for aggregating a history of past gradients to accelerate convergence.
4. Adagrad
○ used per-coordinate scaling to allow for a computationally efficient preconditioner.
5. RMSProp
○ decoupled per-coordinate scaling from a learning rate adjustment.
66
Adam
● Adam combines all these techniques into one efficient learning
algorithm.
● More robust and effective optimization algorithms to use in deep
learning.
● Adam can diverge due to poor variance control. (disadvantage)
● Adam uses exponential weighted moving averages (also known as
leaky averaging) to obtain an estimate of both the momentum and
also the second moment of the gradient.
67
Adam Algorithm
● State variables
● β1 and β2 are nonnegative weighting parameters. Common choices
for them are β1 = 0.9 and β2 = 0.999. That is, the variance estimate
moves much more slowly than the momentum term.
● Initialize v0 = s0 = 0
● Normalize the state variables
68
Adam Algorithm
● Rescale the gradient
● Compute updates
69
Adam Algorithm
70
Adam: Summary
● Adam combines features of many optimization algorithms into a fairly
robust update rule.
● Adam uses bias correction to adjust for a slow startup when
estimating momentum and a second moment.
● For gradients with significant variance we may encounter issues with
convergence. They can be amended by using larger minibatches or
by switching to an improved estimate for state variables. Yogi
algorithm offers such an alternative.
71
Learning Rate: Summary
1. Adjusting the learning rate is often just as important as the actual algorithm.
2. Magnitude of the learning rate matters. If it is too large, optimization diverges, if it is
too small, it takes too long to train or we end up with a suboptimal result. Momentum
algo helps.
3. The rate of decay is just as important. If the learning rate remains large we may simply
end up bouncing around the minimum and thus not reach optimality. We want the rate
to decay, but probably more slowly than O(t− 1/ 2
) .
4. Initialization pertains both to how the parameters are set initially and also how they
evolve initially. This is known as warmup, i.e., how rapidly we start moving towards
the solution initially.
72
Numerical Problems
73
Question with Solution
74
Question
1. Compute the value that minimizes (w1 , w2). Compute the minimum possible
value of error.
2. What will be value of (w1 , w2 ) at time (t + 1) if standard gradient descent is used?
3. What will be value of (w1 , w2 ) at time (t + 1) if momentum is used?
4. What will be value of (w1 , w2 ) at time (t + 1) if RMSPRop is used?
5. What will be value of (w1 , w2 ) at time (t + 1) if Adam is used?
75
Solution
76
Solution
77
Solution
78
Ref TB Dive into Deep Learning
● Chapter 12 (online version)
79
Next Session:
Regularization
80

More Related Content

PDF
Overview on Optimization algorithms in Deep Learning
PDF
Deep Learning for Computer Vision: Optimization (UPC 2016)
PPTX
ngboost.pptx
PDF
04 numerical
PDF
DNN_M3_Optimization.pdf
PDF
optmizationtechniques.pdf
PDF
4optmizationtechniques-150308051251-conversion-gate01.pdf
PPTX
Optmization techniques
Overview on Optimization algorithms in Deep Learning
Deep Learning for Computer Vision: Optimization (UPC 2016)
ngboost.pptx
04 numerical
DNN_M3_Optimization.pdf
optmizationtechniques.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf
Optmization techniques

Similar to Deep Neural Network Module 3A Optimization.pptx (20)

PPTX
Gradient Descent or Assent is to find optimal parameters that minimize the l...
PPTX
ngboost.pptx
PPTX
Techniques in Deep Learning
PDF
Paper Study: Melding the data decision pipeline
PPTX
MACHINE LEARNING NEURAL NETWORK PPT UNIT 4
PDF
Methods of Optimization in Machine Learning
PDF
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
PPTX
Training DNN Models - II.pptx
PPTX
Gradient Descent DS Rohit Sharma fench knjs.pptx
PPTX
Auto encoders in Deep Learning
PDF
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
PDF
Machine learning using matlab.pdf
PPTX
Regression ppt
PPTX
Lec2-review-III-svm-logreg_for the beginner.pptx
PPTX
Lec2-review-III-svm-logregressionmodel.pptx
PDF
L1 intro2 supervised_learning
PDF
Lesson 5_VARIOUS_ optimization_algos.pdf
PDF
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
PPTX
Deep Learning Module 2A Training MLP.pptx
PDF
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Gradient Descent or Assent is to find optimal parameters that minimize the l...
ngboost.pptx
Techniques in Deep Learning
Paper Study: Melding the data decision pipeline
MACHINE LEARNING NEURAL NETWORK PPT UNIT 4
Methods of Optimization in Machine Learning
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Training DNN Models - II.pptx
Gradient Descent DS Rohit Sharma fench knjs.pptx
Auto encoders in Deep Learning
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Machine learning using matlab.pdf
Regression ppt
Lec2-review-III-svm-logreg_for the beginner.pptx
Lec2-review-III-svm-logregressionmodel.pptx
L1 intro2 supervised_learning
Lesson 5_VARIOUS_ optimization_algos.pdf
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Deep Learning Module 2A Training MLP.pptx
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Ad

More from ratnababum (20)

PDF
ZG537-L5 Introduction to information .pdf
PPT
chxnxx introduction to relational model.ppt
PDF
Dep Neural Networks introduction new.pdf
PPT
4_14755_CS213_20172018_1__2_1_Lecture_5.ppt
PPT
Introduction to medical ethics and bioethics - fs.ppt
PPTX
Webinar information Retrieval introduction.pptx
PDF
Natural Language Processing introduction-L1.pdf
PDF
Sutton reinforcement learning new ppt.pdf
PDF
a_star in Artificial intelligence new.pdf
PDF
hillclimbing in Artificial intelligence.pdf
PPT
9_2019_01_04!07_19_02_ natural languagePM.ppt
PDF
8.-Hill-Climbing-Algorithm in Artificial.pdf
PPTX
EFL-042220jc ethics in medicical dept.pptx
PPT
Lec8-PredicateLogic knowledge representation.ppt
PPT
901470_Ch Intelligent agent introduction2.ppt
PPTX
5-Top-Down Parsing natural language .pptx
PPT
KnowledgeRepresentation in artificial intelligence.ppt
PDF
DRL 1 Course Introduction Reinforcement.ppt
PPT
PPT-UEU-Sistem-Pendukung-Keputusan-Pertemuan-11.ppt
PPT
cs344-lect15-robotic-knowledge-inferencing-prolog-11feb08.ppt
ZG537-L5 Introduction to information .pdf
chxnxx introduction to relational model.ppt
Dep Neural Networks introduction new.pdf
4_14755_CS213_20172018_1__2_1_Lecture_5.ppt
Introduction to medical ethics and bioethics - fs.ppt
Webinar information Retrieval introduction.pptx
Natural Language Processing introduction-L1.pdf
Sutton reinforcement learning new ppt.pdf
a_star in Artificial intelligence new.pdf
hillclimbing in Artificial intelligence.pdf
9_2019_01_04!07_19_02_ natural languagePM.ppt
8.-Hill-Climbing-Algorithm in Artificial.pdf
EFL-042220jc ethics in medicical dept.pptx
Lec8-PredicateLogic knowledge representation.ppt
901470_Ch Intelligent agent introduction2.ppt
5-Top-Down Parsing natural language .pptx
KnowledgeRepresentation in artificial intelligence.ppt
DRL 1 Course Introduction Reinforcement.ppt
PPT-UEU-Sistem-Pendukung-Keputusan-Pertemuan-11.ppt
cs344-lect15-robotic-knowledge-inferencing-prolog-11feb08.ppt
Ad

Recently uploaded (20)

PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
composite construction of structures.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
PPT on Performance Review to get promotions
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT
Project quality management in manufacturing
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Arduino robotics embedded978-1-4302-3184-4.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
composite construction of structures.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT on Performance Review to get promotions
Operating System & Kernel Study Guide-1 - converted.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Project quality management in manufacturing
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Foundation to blockchain - A guide to Blockchain Tech
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
additive manufacturing of ss316l using mig welding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx

Deep Neural Network Module 3A Optimization.pptx

  • 3. What we Learn…. 3.1 Challenges in Neural Network Optimization – saddle points and plateau 3.2 Non-convex optimization intuition 3.3 Overview of optimization algorithms 3.4 Momentum based algorithms 3.5 Algorithms with Adaptive Learning Rates 3
  • 5. Optimization Algorithm ● Optimization algorithms train deep learning models. ● Optimization algorithms are the tools that allow ○ continue updating model parameters ○ to minimize the value of the loss function, as evaluated on the training set. ● In optimization, a loss function is often referred to as the objective function of the optimization problem. ● By tradition and convention most optimization algorithms are concerned with minimization. 5
  • 7. Why Optimization Algorithm? ● The performance of the optimization algorithm directly affects the model s training efficiency. ʼ ● Understanding the principles of different optimization algorithms and the role of their hyperparameters will enable us to tune the hyperparameters in a targeted manner to improve the performance of deep learning models. ● The goal of optimization is to reduce the training error. The goal of deep learning is to reduce the generalization error, this requires reduction in overfitting also. 7
  • 8. Optimization Challenges in Deep Learning ● Local minima ● Saddle points ● Vanishing gradients 8
  • 9. Local Minima ● For any objective function f (x), if the value of f (x) at x is smaller than the values of f (x) at any other points in the vicinity of x, then f (x) could be a local minimum. ● If the value of f (x) at x is the minimum of the objective function over the entire domain, then f (x) is the global minimum. ● In minibatch stochastic gradient descent, the natural variation of gradients over minibatches is able to dislodge the parameters from local minima. 9
  • 10. 10 Finding Minimum of a Function
  • 11. 11 Finding Minimum of a Function
  • 13. Saddle points ● A saddle point is any location where all gradients of a function vanish but which is neither a global nor a local minimum. ● Eg: f (x, y) = x^2 − y^2 ○ saddle point at (0, 0) ○ Maximum wrt y ○ Minimum wrt x 13
  • 14. 14 Derivatives at Saddle (Inflection) Point
  • 15. 15 Derivatives at Saddle (Inflection) Point
  • 17. 17 Gradient of a Scalar Function
  • 18. 18 Gradient of a Scalar Function with multiple variables
  • 20. 20 Gradient of a Scalar Function with multiple variables
  • 21. 21 Solution of Unconstrained Minimization  Solve for X where the gradient equation equal to zero. ⛛f(x) = 0  Compute the Hessian matrix ⛛2 f(x) at the candidate solution and verify that ○ Local Minimum ■ Eigenvalues of Hessian matrix are all positive ○ Local Maximum ■ Eigenvalues of Hessian matrix are all negative ○ Saddle Point ■ Eigenvalues of Hessian matrix at the zero-gradient position are negative and positive
  • 25. Vanishing gradients ● Function f (x) = tanh(x) ● f ′(x) = 1 − tanh2 (x) ○ f ′(4) = 0.0013. ● Gradient of f is close to nil. ● Vanishing gradients can cause optimization to stall. ○ Reparameterization of the problem helps. ○ Good initialization of the parameters 25
  • 27. 27 How to find Global Minima?
  • 28. 28 Find Global Minima Iteratively
  • 31. 31 Approach of Gradient Descent gd(eta, grad) new = old - eta * grad
  • 32. Approach of Gradient Descent ● First order Gradient Descent algos consider the first order derivatives to get the value (magnitude) and direction of update. def gd(eta, f_grad): x = 10.0 results = [x] for i in range(10): x -= eta * f_grad(x) results.append(float(x)) return results gd(eta = 0.2, f_grad) 32
  • 33. Effect of Learning Rate on Gradient Descent 33
  • 34. Learning Rate ● The role of the learning rate is to moderate the degree to which weights are changed at each step. ● Learning rate η is set by the algorithm designer. ● If the learning rate that is too small, it will cause x to update very slowly, requiring more iterations to get a better solution. ● If the learning rate that is too large, the solution oscillates and in the worst case it might even diverge. 34
  • 35. Learning rate and Gradient Descent gd(eta = 0.05, f_grad) gd(eta = 1.1, f_grad) Slow Learning If we pick eta too small, we make little progress. Oscillations If we pick eta too large, the solution oscillates and in the worst case it might diverge. 35
  • 37. Stochastic Gradient Descent ● In deep learning, the objective function is the average of the loss functions for each example in the training dataset. ● Given a training dataset of n examples, let fi (x) is the loss function with respect to the training example of index i, where x is the parameter vector. ● The objective function ● The gradient of the objective function at x ● Update x as ● Computational cost of each iteration is O(1). 37
  • 38. SGD with Constant Learning Rate ● Trajectory of the variables in the stochastic gradient descent is much more noisy. This is due to the stochastic nature of the gradient. Even near the minimum, uncertainty is injected by the instantaneous gradient via η f ∇ i (x). 38 def sgd(x1, x2, s1, s2, f_grad): g1, g2 = f_grad(x1, x2) # Simulate noisy gradient g1 += np.random.normal(0.0, 1, (1,)) g2 += np.random.normal(0.0, 1, (1,)) eta_t = eta * lr() return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)
  • 40. Dynamic Learning Rate ● Replace η with a time-dependent learning rate η(t) ○ adds to the complexity of controlling convergence of an optimization algorithm. ● A few basic strategies that adjust η over time. 40 1. Piecewise constant a. decrease the learning rate, e.g., whenever progress in optimization stalls. b. This is a common strategy for training deep networks. A 2. Exponential decay a. Leads to premature stopping before the algorithm has converged. 3. Polynomial decay with α = 0.5.
  • 41. Dynamic Learning Rate ● Replace η with a time-dependent learning rate η(t) ○ adds to the complexity of controlling convergence of an optimization algorithm. ● A few basic strategies that adjust η over time. 41
  • 42. Exponential Decay ● Variance in the parameters is significantly reduced. ● The algorithm fails to converge at all. 42 def exponential_lr(): global t t += 1 return math.exp(-0.1 * t)
  • 43. Polynomial decay ● Use a polynomial decay ○ Learning rate decays with the inverse square root of the number of steps ○ Convergence gets better after only 50 steps. 43 def polynomial_lr(): global t t += 1 return (1 + 0.1 * t) ** (-0.5)
  • 44. Review ● Gradient descent ○ Uses the full dataset to compute gradients and to update parameters, one pass at a time. ○ Gradient Descent is not particularly data efficient whenever data is very similar. ● Stochastic Gradient descent ○ Processes one observation at a time to make progress. ○ Stochastic Gradient Descent is not particularly computationally efficient since CPUs and GPUs cannot exploit the full power of vectorization. ○ For noisy gradients, choice of the learning rate is critical. ■ If we decrease it too rapidly, convergence stalls. ■ If we are too lenient, we fail to converge to a good enough solution since noise keeps on driving us away from optimality. ● MInibatch SGD ○ Accelerate computation, or better or computational efficiency. ○ Averaging gradients reduce the amount of variance. 44
  • 46. Minibatch Stochastic Gradient Descent ● In each iteration, we first randomly sample a minibatch B consisting of a fixed number of training examples. ● We then compute the derivative (gradient) of the average loss on the minibatch with regard to the model parameters. ● Finally, we multiply the gradient by a predetermined positive value η and subtract the resulting term from the current parameter values. ● |B| represents the number of examples in each minibatch (the batch size) and η denotes the learning rate. 46
  • 47. Minibatch Stochastic Gradient Descent Algorithm ● Gradients at time t is calculated as ● |B| represents the number of examples in each minibatch (the batch size) and η denotes the learning rate. ● ● is the stochastic gradient descent for sample i using the weights updated at time t − 1. 47
  • 50. Leaky Average in Minibatch SGD ● Replace the gradient computation by a “leaky average “ for better variance reduction. β (0, 1). ∈ ● This effectively replaces the instantaneous gradient by one that s been averaged over ʼ multiple past gradients. ● v is called momentum. It accumulates past gradients. ● Large β amounts to a long-range average and small β amounts to only a slight correction relative to a gradient method. ● The new gradient replacement no longer points into the direction of steepest descent on a particular instance any longer but rather in the direction of a weighted average of past gradients. 50
  • 52. Momentum Method Example ● Consider a moderately distorted ellipsoid objective ● f has its minimum at (0, 0). This function is very flat in x1 direction. 52 ● For eta = 0.4. Without momentum ● The gradient in the x2 direction oscillates than in the horizontal x1 direction. ● For eta = 0.6. Without momentum ● Convergence in the x1 direction improves but the overall solution quality is diverging.
  • 53. Momentum Method Example ● Consider a moderately distorted ellipsoid objective ● Apply momentum for eta = 0.6 53 ● For beta = 0.5. ● Converges well. Lesser oscillations. Larger steps in x1 direction. ● For beta = 0.25. ● Reduced convergence. More oscillations. Larger magnitude of oscillations.
  • 54. Momentum method: Summary ● Momentum replaces gradients with a leaky average over past gradients. This accelerates convergence significantly. ● Momentum prevents stalling of the optimization process that is much more likely to occur for stochastic gradient descent. ● The effective number of gradients is given by 1/ (1−β ) due to exponentiated downweighting of past data. ● Implementation is quite straightforward but it requires us to store an additional state vector (momentum v). 54
  • 56. Adagrad ● Used for features that occur infrequently (sparse features) ● Adagrad uses aggregate of the squares of previously observed gradients. 56
  • 57. Adagrad Algorithm ● Variable st to accumulate past gradient variance. ● Operation are applied coordinate wise. √ (1 / v) has entries √( 1/ vi) and u · v has entries ui vi. η is the learning rate and ε is an additive constant that ensures that we do not divide by 0. ● Initialize s0 = 0. 57
  • 58. Adagrad: Summary ● Adagrad decreases the learning rate dynamically on a per-coordinate basis. ● It uses the magnitude of the gradient as a means of adjusting how quickly progress is achieved - coordinates with large gradients are compensated with a smaller learning rate. ● If the optimization problem has a rather uneven structure Adagrad can help mitigate the distortion. ● Adagrad is particularly effective for sparse features where the learning rate needs to decrease more slowly for infrequently occurring terms. ● On deep learning problems Adagrad can sometimes be too aggressive in reducing learning rates. 58
  • 60. RMSProp ● Adagrad use learning rate that decreases at a predefined schedule of effectively O(t− 1/2 ). ● RMSProp algorithm decouples rate scheduling from coordinate- adaptive learning rates. This is essential for non-convex optimization. 60
  • 61. RMSProp Algorithm ● Use leaky average to accumulate past gradient variance. ● Parameter gamma > 0. ● The constant ε > 0 is set to 10−6 to ensure that we do not suffer from division by zero or overly large step sizes. 61
  • 63. RMSProp Example 63 ef rmsprop_2d(x1, x2, s1, s2): g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6 s1 = gamma * s1 + (1 - gamma) * g1 ** 2 s2 = gamma * s2 + (1 - gamma) * g2 ** 2 x1 -= eta / math.sqrt(s1 + eps) * g1 x2 -= eta / math.sqrt(s2 + eps) * g2 return x1, x2, s1, s2
  • 64. RMS Prop: Summary ● RMSProp is very similar to Adagrad as both use the square of the gradient to scale coefficients. ● RMSProp shares with momentum the leaky averaging. However, RMSProp uses the technique to adjust the coefficient-wise preconditioner. ● The learning rate needs to be scheduled by the experimenter in practice. ● The coefficient γ (gamma) determines how long the history is when adjusting the per-coordinate scale. 64
  • 66. Review of techniques learned so far 1. Stochastic gradient descent ○ more effective than Gradient Descent when solving optimization problems, e.g., due to its inherent resilience to redundant data. 2. Minibatch Stochastic gradient descent ○ affords significant additional efficiency arising from vectorization, using larger sets of observations in one minibatch. This is the key to efficient multi-machine, multi-GPU and overall parallel processing. 3. Momentum ○ added a mechanism for aggregating a history of past gradients to accelerate convergence. 4. Adagrad ○ used per-coordinate scaling to allow for a computationally efficient preconditioner. 5. RMSProp ○ decoupled per-coordinate scaling from a learning rate adjustment. 66
  • 67. Adam ● Adam combines all these techniques into one efficient learning algorithm. ● More robust and effective optimization algorithms to use in deep learning. ● Adam can diverge due to poor variance control. (disadvantage) ● Adam uses exponential weighted moving averages (also known as leaky averaging) to obtain an estimate of both the momentum and also the second moment of the gradient. 67
  • 68. Adam Algorithm ● State variables ● β1 and β2 are nonnegative weighting parameters. Common choices for them are β1 = 0.9 and β2 = 0.999. That is, the variance estimate moves much more slowly than the momentum term. ● Initialize v0 = s0 = 0 ● Normalize the state variables 68
  • 69. Adam Algorithm ● Rescale the gradient ● Compute updates 69
  • 71. Adam: Summary ● Adam combines features of many optimization algorithms into a fairly robust update rule. ● Adam uses bias correction to adjust for a slow startup when estimating momentum and a second moment. ● For gradients with significant variance we may encounter issues with convergence. They can be amended by using larger minibatches or by switching to an improved estimate for state variables. Yogi algorithm offers such an alternative. 71
  • 72. Learning Rate: Summary 1. Adjusting the learning rate is often just as important as the actual algorithm. 2. Magnitude of the learning rate matters. If it is too large, optimization diverges, if it is too small, it takes too long to train or we end up with a suboptimal result. Momentum algo helps. 3. The rate of decay is just as important. If the learning rate remains large we may simply end up bouncing around the minimum and thus not reach optimality. We want the rate to decay, but probably more slowly than O(t− 1/ 2 ) . 4. Initialization pertains both to how the parameters are set initially and also how they evolve initially. This is known as warmup, i.e., how rapidly we start moving towards the solution initially. 72
  • 75. Question 1. Compute the value that minimizes (w1 , w2). Compute the minimum possible value of error. 2. What will be value of (w1 , w2 ) at time (t + 1) if standard gradient descent is used? 3. What will be value of (w1 , w2 ) at time (t + 1) if momentum is used? 4. What will be value of (w1 , w2 ) at time (t + 1) if RMSPRop is used? 5. What will be value of (w1 , w2 ) at time (t + 1) if Adam is used? 75
  • 79. Ref TB Dive into Deep Learning ● Chapter 12 (online version) 79

Editor's Notes

  • #3: Talk about each topic in 1 or 2 sentences.
  • #46: use minibatches is computational efficiency.
  • #47: use minibatches is computational efficiency.
  • #50: use minibatches is computational efficiency.
  • #52: if we pick a small learning rate we ensure that the solution does not diverge in the x2 direction but we are saddled with slow convergence in the x1 direction. Conversely, with a large learning rate we progress rapidly in the x1 direction but diverge in x2 .