SlideShare a Scribd company logo
Basic Concepts of Large Scale
Optimization for Machine Learning
Devdatt Dubhashi
AI and Data Science
Computer Science and Engineering
Chalmers
Machine Intelligence Sweden AB
Behind the Cat Pictures …
• Amazing successes of ML in
computer vision, natural
language processing …
• Underneath the hood is
optimization
• Large scale machine learning:
– large n (data points)
– large d (dimension)
Minimization of Finite Sums
• Assumptions on component functions: convex, smooth, …
Empirical Risk Minimization (ERM)
• Labelled training data:
• Parametrized class of prediction
functions:
• Empirical Loss:
min
Data Driven Clustering
• Given data points, cluster
• Classic K-means algorithm
• Needs to know k, number of
clusters
• Data driven clustering: find the
right number of clusters driven
by data. (Panahi, D: ICML 2017)
Minimization of Finite Sums
• Assumptions on component functions: convex, smooth, …
Mother of all First Order Methods:Gradient Descent
Gradient Descent Convergence
However, GD is not viable for large scale ML because each iteration has cost nd
Stochastic Gradient Descent (SGD)
Robbins and Munro 1950
• Index sampled uniformly at random with
replacement from [n]
• Cost per iteration is d
• Hugely successful in machine learning!
Stochastic, Batch and Full Gradient Descent
• Full GD:
• Minibatch GD:
• Stochastic GD:
The Unreasonable Effectiveness of SGD
• Very fast initial convergence
• Cheap O(d) per iteration as
opposed to O(nd) for full GD
• Very slow at the end ...
Convergence is only O(1/ 𝑘) for
smooth and O(1/k) for smooth
strongly convex functions.
• … but we do not need to run the
iterations to optimum, better to
stop early (Bottou and Bosquet)
SGD: Have the Cake and Eat it Too!
(Bottou and Bosquet 2008)
Variance/Noise Reduction
Can we improve convergence of SGD?
Variance/Noise Reduction
Optforml
Variance Reduction: Three Takes
Optforml
Nesterov Momentum for GD
Y. Nesterov, Doklady 1983.
Katyusha Momentum for SGD
Z. Allen-Zhu, STOC 2017, JMLR 2018
Non-smooth Objectives
• What if the objective is non-smooth?
• Need a proxy for gradients.
LASSO
SON Clustering
Proximal Operator
• Proximal operator:
• Special case: projection:
• Like a gradient step:
• Fixed points are minimizers:
Proximal Gradient Algorithm
• Objective split
• Iteration
• Special cases:
– g=0: usual gradient descent
– f=0: proximal algorithm
– g= indicator of convex set: projected gradient descent (constrained
optimization)
• Only works if the proximal operator can be evaluated efficiently!
PointSAGA: Stochastic Prox with Variance Reduction
Defazio 2016
MP-SAGA: Stochastic Prox with Variance Reduction
Panahi, Dubhashi, ICML 2017, (2019 under review: proximal operator in closed form!
SGD for Deep Learning
• SGD variants (Adagrad, RMSprop, Adam …) used to train
neural networks.
• Use aggressive adaptation with different learning rates for
different parameters.
• Theory says it shouldn’t work for highly nonconvex problems!
• But Adagrad greatly improved the robustness of SGD and
Google used it for training large-scale neural nets to recognize
cats
Variance Reduction for Deep Learning
Defazio, Bottou, 2019
References
Frances Bach Tutorials/Short Courses: https://www.di.ens.fr/~fbach/

More Related Content

PPTX
Image colorization
PDF
Deep Learning for Computer Vision: Optimization (UPC 2016)
PPTX
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
PDF
Webinar on Graph Neural Networks
PDF
How Powerful are Graph Networks?
PPTX
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
PDF
Introduction to 3D Computer Vision and Differentiable Rendering
PPTX
VIBE: Video Inference for Human Body Pose and Shape Estimation
Image colorization
Deep Learning for Computer Vision: Optimization (UPC 2016)
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Webinar on Graph Neural Networks
How Powerful are Graph Networks?
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Introduction to 3D Computer Vision and Differentiable Rendering
VIBE: Video Inference for Human Body Pose and Shape Estimation

Similar to Optforml (20)

PDF
Chap 8. Optimization for training deep models
PDF
lec2_annotated.pdf ml csci 567 vatsal sharan
PDF
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
PDF
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
PPTX
Techniques in Deep Learning
PPTX
Advance Machine Learning presentation.pptx
PDF
Optimization for Deep Learning
PDF
Lesson 5_VARIOUS_ optimization_algos.pdf
PDF
08 distributed optimization
PPTX
DeepLearningLecture.pptx
PDF
Methods of Optimization in Machine Learning
PPTX
Gradient descent variants in deep laearning
PDF
17 large scale machine learning
PPTX
Optimization in Deep Learning
PPTX
Introduction to deep Learning Fundamentals
PDF
Overview on Optimization algorithms in Deep Learning
PDF
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
PDF
lec2_unannotated.pdf ml csci 567 vatsal sharan
PDF
W10L2 Scaling Up LLM Pretraining: Parallel Training Scaling Up Optimizer Basi...
PDF
The Machinery behind Deep Learning
Chap 8. Optimization for training deep models
lec2_annotated.pdf ml csci 567 vatsal sharan
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Techniques in Deep Learning
Advance Machine Learning presentation.pptx
Optimization for Deep Learning
Lesson 5_VARIOUS_ optimization_algos.pdf
08 distributed optimization
DeepLearningLecture.pptx
Methods of Optimization in Machine Learning
Gradient descent variants in deep laearning
17 large scale machine learning
Optimization in Deep Learning
Introduction to deep Learning Fundamentals
Overview on Optimization algorithms in Deep Learning
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
lec2_unannotated.pdf ml csci 567 vatsal sharan
W10L2 Scaling Up LLM Pretraining: Parallel Training Scaling Up Optimizer Basi...
The Machinery behind Deep Learning
Ad

More from Devdatt Dubhashi (7)

PPTX
Corr clust-kiel
PPTX
PPTX
Ai energy
PPTX
Ai finance
PPTX
AI: The New Electricity to Harness Our Digital Future
PPTX
PPTX
Corr clust-kiel
Ai energy
Ai finance
AI: The New Electricity to Harness Our Digital Future
Ad

Recently uploaded (20)

PPTX
A Complete Guide to Streamlining Business Processes
DOCX
Factor Analysis Word Document Presentation
PPTX
Introduction to Inferential Statistics.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Managing Community Partner Relationships
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
How to run a consulting project- client discovery
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
modul_python (1).pptx for professional and student
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Transcultural that can help you someday.
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Microsoft Core Cloud Services powerpoint
A Complete Guide to Streamlining Business Processes
Factor Analysis Word Document Presentation
Introduction to Inferential Statistics.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Managing Community Partner Relationships
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
How to run a consulting project- client discovery
STERILIZATION AND DISINFECTION-1.ppthhhbx
Optimise Shopper Experiences with a Strong Data Estate.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
modul_python (1).pptx for professional and student
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ISS -ESG Data flows What is ESG and HowHow
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Transcultural that can help you someday.
Acceptance and paychological effects of mandatory extra coach I classes.pptx
[EN] Industrial Machine Downtime Prediction
Microsoft Core Cloud Services powerpoint

Optforml

  • 1. Basic Concepts of Large Scale Optimization for Machine Learning Devdatt Dubhashi AI and Data Science Computer Science and Engineering Chalmers Machine Intelligence Sweden AB
  • 2. Behind the Cat Pictures … • Amazing successes of ML in computer vision, natural language processing … • Underneath the hood is optimization • Large scale machine learning: – large n (data points) – large d (dimension)
  • 3. Minimization of Finite Sums • Assumptions on component functions: convex, smooth, …
  • 4. Empirical Risk Minimization (ERM) • Labelled training data: • Parametrized class of prediction functions: • Empirical Loss: min
  • 5. Data Driven Clustering • Given data points, cluster • Classic K-means algorithm • Needs to know k, number of clusters • Data driven clustering: find the right number of clusters driven by data. (Panahi, D: ICML 2017)
  • 6. Minimization of Finite Sums • Assumptions on component functions: convex, smooth, …
  • 7. Mother of all First Order Methods:Gradient Descent
  • 8. Gradient Descent Convergence However, GD is not viable for large scale ML because each iteration has cost nd
  • 9. Stochastic Gradient Descent (SGD) Robbins and Munro 1950 • Index sampled uniformly at random with replacement from [n] • Cost per iteration is d • Hugely successful in machine learning!
  • 10. Stochastic, Batch and Full Gradient Descent • Full GD: • Minibatch GD: • Stochastic GD:
  • 11. The Unreasonable Effectiveness of SGD • Very fast initial convergence • Cheap O(d) per iteration as opposed to O(nd) for full GD • Very slow at the end ... Convergence is only O(1/ 𝑘) for smooth and O(1/k) for smooth strongly convex functions. • … but we do not need to run the iterations to optimum, better to stop early (Bottou and Bosquet)
  • 12. SGD: Have the Cake and Eat it Too! (Bottou and Bosquet 2008)
  • 13. Variance/Noise Reduction Can we improve convergence of SGD?
  • 18. Nesterov Momentum for GD Y. Nesterov, Doklady 1983.
  • 19. Katyusha Momentum for SGD Z. Allen-Zhu, STOC 2017, JMLR 2018
  • 20. Non-smooth Objectives • What if the objective is non-smooth? • Need a proxy for gradients. LASSO SON Clustering
  • 21. Proximal Operator • Proximal operator: • Special case: projection: • Like a gradient step: • Fixed points are minimizers:
  • 22. Proximal Gradient Algorithm • Objective split • Iteration • Special cases: – g=0: usual gradient descent – f=0: proximal algorithm – g= indicator of convex set: projected gradient descent (constrained optimization) • Only works if the proximal operator can be evaluated efficiently!
  • 23. PointSAGA: Stochastic Prox with Variance Reduction Defazio 2016
  • 24. MP-SAGA: Stochastic Prox with Variance Reduction Panahi, Dubhashi, ICML 2017, (2019 under review: proximal operator in closed form!
  • 25. SGD for Deep Learning • SGD variants (Adagrad, RMSprop, Adam …) used to train neural networks. • Use aggressive adaptation with different learning rates for different parameters. • Theory says it shouldn’t work for highly nonconvex problems! • But Adagrad greatly improved the robustness of SGD and Google used it for training large-scale neural nets to recognize cats
  • 26. Variance Reduction for Deep Learning Defazio, Bottou, 2019
  • 27. References Frances Bach Tutorials/Short Courses: https://www.di.ens.fr/~fbach/