SlideShare a Scribd company logo
Optimization in Deep
Learning
Jeremy Nixon
Overview
1. Challenges in Neural Network Optimization
2. Gradient Descent
3. Stochastic Gradient Descent
4. Momentum
a. Nesterov Momentum
5. RMSProp
6. Adam
Challenges in Neural Network Optimization
1. Training Time
a. Model complexity (depth, width) is important to accuracy
b. Training time for state of the art can take weeks on a GPU
2. Hyperparameter Tuning
a. Learning rate tuning is important to accuracy
3. Local Minima
Neural Net Refresh + Gradient Descent
w2
w1
Hidden raw / relu
output_softmax
x_train
Stochastic Gradient Descent
Dramatic Speedup
Sub-linear returns to more data in each batch
Crucial Learning Rate Hyperparameter
Schedule to reduce learning rate during training
SGD introduces noise to the gradient
Gradient will almost never fully converge to 0
Stochastic Gradient Descent
Number hidden layers = 1
lr = 1.0 (normal is 0.01)
Dataset = Mnist
Momentum
Dramatically Accelerates Learning
1. Initialize learning rates & momentum matrix the size of the weights
2. At each SGD iteration, collect the gradient.
3. Update momentum matrix to be momentum rate times a momentum
hyperparameter plus the learning rate times the collected gradient.
s = .9 = momentum hyperparameter t.layers[i].moment1 = layer i’s momentum matrix lr = .01 gradient = sgd’s collected gradient
Number hidden layers = 2
Dataset = Mnist
Intuition for Momentum
Automatically cancels out noise in the gradient
Amplifies small but consistent gradients
“Momentum” derives from the physical analogy [momentum = mass * velocity]
Assumes unit mass
Velocity vector is the ‘particle's’ momentum
Deals well with heavy curvature
Momentum Accelerates the Gradient
Gradient that accumulates in the same direction can achieve velocities of up to
lr / (1-s). S = .9 => lr can max out at lr * 10 in the direction of accumulated gradient.
Asynchronous SGD similar to Momentum
In distributed SGD, asynchronous has workers update parameters as they return, instead of
waiting for all workers to finish
Creates a weighted average of previous gradients applied to the current weights
Nesterov Momentum
Evaluate the gradient with the momentum step taken into account
Number hidden layers = 2
Dataset = Mnist
Adaptive Learning Rate Algorithms
Adagrad
Duchi et al., 2011
RMSProp
Hinton, 2012
Adam
Kingma and Ba, 2014
Idea is to auto-tune the learning rate, making the network less sensitive to hyperparameters.
Adagrad
Shrinks the learning rate adaptively
Learning rate is the inverse of the historical squared gradient
r = squared gradient history g = gradient theta = weights epsilon = learning rate delta = small constant for numerical stability
Intuition for Adagrad
Instead of setting a single global learning rate, have a different learning rate for
every weight in the network
Parameters with the largest derivative have a rapid decrease in learning rate
Parameters with small derivatives have a small decrease in learning rate
We get much more progress in more gently sloped directions of parameter
space.
Downside - accumulating gradients from the beginning leads to extremely small
learning rates later in training
Downside - doesn’t deal well with differences in global and local structure
RMSProp
Collect exponentially weighted average of the gradient for the learning rate
Performs well in non-convex setting with differences between global and local
structure
Can be combined with momentum / nesterov momentum
Number hidden layers = 1
Dataset = Mnist
Number hidden layers = 1
Dataset = Mnist
Adam
Short for “Adaptive Moments”
Exponentially weighted average of gradient for momentum (first moment)
Exponentially weighted average of squared gradient for adapting learning rate
(second moment)
Bias Correction for both to adjust early in training
Adam
Number hidden layers = 5
Dataset = Mnist
Thank you!
Questions?
Bibliography
Adam paper - https://arxiv.org/abs/1412.6980
Adagrad - http://jmlr.org/papers/v12/duchi11a.html
RMSProp - http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Deep Learning Textbook - http://www.deeplearningbook.org/

More Related Content

PDF
Machine Learning using Apache Spark MLlib
PDF
Bias and variance trade off
PDF
Optimization for Deep Learning
PDF
AI_ 8 Weak Slot and Filler Structure
PDF
Machine Learning: Generative and Discriminative Models
PPTX
PPTX
Curse of dimensionality
PPTX
GFS & HDFS Introduction
Machine Learning using Apache Spark MLlib
Bias and variance trade off
Optimization for Deep Learning
AI_ 8 Weak Slot and Filler Structure
Machine Learning: Generative and Discriminative Models
Curse of dimensionality
GFS & HDFS Introduction

What's hot (20)

PPTX
Distributed shred memory architecture
PPTX
Hill climbing algorithm
PDF
PPT
Apriori and Eclat algorithm in Association Rule Mining
PDF
07 Analysis of Algorithms: Order Statistics
PPTX
Gradient descent method
PPTX
Transfer Learning and Fine-tuning Deep Neural Networks
PPTX
Feed forward ,back propagation,gradient descent
PDF
The fundamentals of Machine Learning
PDF
Image analysis using python
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PDF
Deep learning - A Visual Introduction
PPTX
Seven step model of migration into the cloud
PDF
Winning Kaggle 101: Introduction to Stacking
PPTX
Neural networks...
PPTX
Collaborative Filtering using KNN
PDF
Open mp directives
PDF
Introduction to XGBoost
PPTX
Neural Networks for Pattern Recognition
PDF
Convolutional Neural Network Models - Deep Learning
Distributed shred memory architecture
Hill climbing algorithm
Apriori and Eclat algorithm in Association Rule Mining
07 Analysis of Algorithms: Order Statistics
Gradient descent method
Transfer Learning and Fine-tuning Deep Neural Networks
Feed forward ,back propagation,gradient descent
The fundamentals of Machine Learning
Image analysis using python
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Deep learning - A Visual Introduction
Seven step model of migration into the cloud
Winning Kaggle 101: Introduction to Stacking
Neural networks...
Collaborative Filtering using KNN
Open mp directives
Introduction to XGBoost
Neural Networks for Pattern Recognition
Convolutional Neural Network Models - Deep Learning
Ad

Viewers also liked (20)

PPT
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
PDF
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
PPTX
Semi fragile watermarking
PDF
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
PPTX
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
PPT
портфоліо Бабич О.А.
PDF
Caffe - A deep learning framework (Ramin Fahimi)
PDF
Using Gradient Descent for Optimization and Learning
PPTX
DIY Deep Learning with Caffe Workshop
PDF
Processor, Compiler and Python Programming Language
PPTX
Caffe framework tutorial2
PDF
Facebook Deep face
PPTX
Caffe framework tutorial
PPTX
Computer vision, machine, and deep learning
PDF
Center loss for Face Recognition
PDF
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
PPTX
Pattern Recognition and Machine Learning : Graphical Models
PDF
Rattani - Ph.D. Defense Slides
PDF
怖くない誤差逆伝播法 Chainerを添えて
PDF
Pattern Recognition and Machine Learning: Section 3.3
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
Semi fragile watermarking
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
портфоліо Бабич О.А.
Caffe - A deep learning framework (Ramin Fahimi)
Using Gradient Descent for Optimization and Learning
DIY Deep Learning with Caffe Workshop
Processor, Compiler and Python Programming Language
Caffe framework tutorial2
Facebook Deep face
Caffe framework tutorial
Computer vision, machine, and deep learning
Center loss for Face Recognition
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
Pattern Recognition and Machine Learning : Graphical Models
Rattani - Ph.D. Defense Slides
怖くない誤差逆伝播法 Chainerを添えて
Pattern Recognition and Machine Learning: Section 3.3
Ad

Similar to Optimization in deep learning (20)

PDF
Cheatsheet deep-learning-tips-tricks
PPTX
Paper review: Learned Optimizers that Scale and Generalize.
PDF
Deep Learning for Computer Vision: Optimization (UPC 2016)
PPTX
Deep gradient compression
PPTX
Deep learning.pptxst8itsstitissitdyiitsistitsitd
PPTX
Auto encoders in Deep Learning
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
PDF
Lesson 5_VARIOUS_ optimization_algos.pdf
PDF
Accelerating stochastic gradient descent using adaptive mini batch size3
PDF
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
PDF
Methods of Optimization in Machine Learning
PPTX
Deeplearning
PPTX
Deep Learning for Search
PPTX
Sachpazis: Demystifying Neural Networks: A Comprehensive Guide
PPTX
Deep Learning for Search
PPTX
Deep Learning in Computer Vision
PDF
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
PPTX
Gradient Descent or Assent is to find optimal parameters that minimize the l...
PPTX
Deep Learning with Apache MXNet (September 2017)
PDF
Predicting rainfall using ensemble of ensembles
Cheatsheet deep-learning-tips-tricks
Paper review: Learned Optimizers that Scale and Generalize.
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep gradient compression
Deep learning.pptxst8itsstitissitdyiitsistitsitd
Auto encoders in Deep Learning
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Lesson 5_VARIOUS_ optimization_algos.pdf
Accelerating stochastic gradient descent using adaptive mini batch size3
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Methods of Optimization in Machine Learning
Deeplearning
Deep Learning for Search
Sachpazis: Demystifying Neural Networks: A Comprehensive Guide
Deep Learning for Search
Deep Learning in Computer Vision
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Gradient Descent or Assent is to find optimal parameters that minimize the l...
Deep Learning with Apache MXNet (September 2017)
Predicting rainfall using ensemble of ensembles

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Electronic commerce courselecture one. Pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Chapter 3 Spatial Domain Image Processing.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
Unlocking AI with Model Context Protocol (MCP)
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Spectroscopy.pptx food analysis technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Electronic commerce courselecture one. Pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The Rise and Fall of 3GPP – Time for a Sabbatical?
Reach Out and Touch Someone: Haptics and Empathic Computing
Chapter 3 Spatial Domain Image Processing.pdf

Optimization in deep learning

  • 2. Overview 1. Challenges in Neural Network Optimization 2. Gradient Descent 3. Stochastic Gradient Descent 4. Momentum a. Nesterov Momentum 5. RMSProp 6. Adam
  • 3. Challenges in Neural Network Optimization 1. Training Time a. Model complexity (depth, width) is important to accuracy b. Training time for state of the art can take weeks on a GPU 2. Hyperparameter Tuning a. Learning rate tuning is important to accuracy 3. Local Minima
  • 4. Neural Net Refresh + Gradient Descent w2 w1 Hidden raw / relu output_softmax x_train
  • 5. Stochastic Gradient Descent Dramatic Speedup Sub-linear returns to more data in each batch Crucial Learning Rate Hyperparameter Schedule to reduce learning rate during training SGD introduces noise to the gradient Gradient will almost never fully converge to 0
  • 7. Number hidden layers = 1 lr = 1.0 (normal is 0.01) Dataset = Mnist
  • 8. Momentum Dramatically Accelerates Learning 1. Initialize learning rates & momentum matrix the size of the weights 2. At each SGD iteration, collect the gradient. 3. Update momentum matrix to be momentum rate times a momentum hyperparameter plus the learning rate times the collected gradient. s = .9 = momentum hyperparameter t.layers[i].moment1 = layer i’s momentum matrix lr = .01 gradient = sgd’s collected gradient
  • 9. Number hidden layers = 2 Dataset = Mnist
  • 10. Intuition for Momentum Automatically cancels out noise in the gradient Amplifies small but consistent gradients “Momentum” derives from the physical analogy [momentum = mass * velocity] Assumes unit mass Velocity vector is the ‘particle's’ momentum Deals well with heavy curvature
  • 11. Momentum Accelerates the Gradient Gradient that accumulates in the same direction can achieve velocities of up to lr / (1-s). S = .9 => lr can max out at lr * 10 in the direction of accumulated gradient.
  • 12. Asynchronous SGD similar to Momentum In distributed SGD, asynchronous has workers update parameters as they return, instead of waiting for all workers to finish Creates a weighted average of previous gradients applied to the current weights
  • 13. Nesterov Momentum Evaluate the gradient with the momentum step taken into account
  • 14. Number hidden layers = 2 Dataset = Mnist
  • 15. Adaptive Learning Rate Algorithms Adagrad Duchi et al., 2011 RMSProp Hinton, 2012 Adam Kingma and Ba, 2014 Idea is to auto-tune the learning rate, making the network less sensitive to hyperparameters.
  • 16. Adagrad Shrinks the learning rate adaptively Learning rate is the inverse of the historical squared gradient r = squared gradient history g = gradient theta = weights epsilon = learning rate delta = small constant for numerical stability
  • 17. Intuition for Adagrad Instead of setting a single global learning rate, have a different learning rate for every weight in the network Parameters with the largest derivative have a rapid decrease in learning rate Parameters with small derivatives have a small decrease in learning rate We get much more progress in more gently sloped directions of parameter space. Downside - accumulating gradients from the beginning leads to extremely small learning rates later in training Downside - doesn’t deal well with differences in global and local structure
  • 18. RMSProp Collect exponentially weighted average of the gradient for the learning rate Performs well in non-convex setting with differences between global and local structure Can be combined with momentum / nesterov momentum
  • 19. Number hidden layers = 1 Dataset = Mnist
  • 20. Number hidden layers = 1 Dataset = Mnist
  • 21. Adam Short for “Adaptive Moments” Exponentially weighted average of gradient for momentum (first moment) Exponentially weighted average of squared gradient for adapting learning rate (second moment) Bias Correction for both to adjust early in training
  • 22. Adam
  • 23. Number hidden layers = 5 Dataset = Mnist
  • 24. Thank you! Questions? Bibliography Adam paper - https://arxiv.org/abs/1412.6980 Adagrad - http://jmlr.org/papers/v12/duchi11a.html RMSProp - http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Deep Learning Textbook - http://www.deeplearningbook.org/