Optforml

Basic Concepts of Large Scale
Optimization for Machine Learning
Devdatt Dubhashi
AI and Data Science
Computer Science and Engineering
Chalmers
Machine Intelligence Sweden AB

Behind the Cat Pictures …
• Amazing successes of ML in
computer vision, natural
language processing …
• Underneath the hood is
optimization
• Large scale machine learning:
– large n (data points)
– large d (dimension)

Minimization of Finite Sums
• Assumptions on component functions: convex, smooth, …

Empirical Risk Minimization (ERM)
• Labelled training data:
• Parametrized class of prediction
functions:
• Empirical Loss:
min

Data Driven Clustering
• Given data points, cluster
• Classic K-means algorithm
• Needs to know k, number of
clusters
• Data driven clustering: find the
right number of clusters driven
by data. (Panahi, D: ICML 2017)

Mother of all First Order Methods:Gradient Descent

Gradient Descent Convergence
However, GD is not viable for large scale ML because each iteration has cost nd

Stochastic Gradient Descent (SGD)
Robbins and Munro 1950
• Index sampled uniformly at random with
replacement from [n]
• Cost per iteration is d
• Hugely successful in machine learning!

Stochastic, Batch and Full Gradient Descent
• Full GD:
• Minibatch GD:
• Stochastic GD:

The Unreasonable Effectiveness of SGD
• Very fast initial convergence
• Cheap O(d) per iteration as
opposed to O(nd) for full GD
• Very slow at the end ...
Convergence is only O(1/ 𝑘) for
smooth and O(1/k) for smooth
strongly convex functions.
• … but we do not need to run the
iterations to optimum, better to
stop early (Bottou and Bosquet)

SGD: Have the Cake and Eat it Too!
(Bottou and Bosquet 2008)

Variance/Noise Reduction
Can we improve convergence of SGD?

Variance Reduction: Three Takes

Nesterov Momentum for GD
Y. Nesterov, Doklady 1983.

Katyusha Momentum for SGD
Z. Allen-Zhu, STOC 2017, JMLR 2018

Non-smooth Objectives
• What if the objective is non-smooth?
• Need a proxy for gradients.
LASSO
SON Clustering

Proximal Operator
• Proximal operator:
• Special case: projection:
• Like a gradient step:
• Fixed points are minimizers:

Proximal Gradient Algorithm
• Objective split
• Iteration
• Special cases:
– g=0: usual gradient descent
– f=0: proximal algorithm
– g= indicator of convex set: projected gradient descent (constrained
optimization)
• Only works if the proximal operator can be evaluated efficiently!

PointSAGA: Stochastic Prox with Variance Reduction
Defazio 2016

MP-SAGA: Stochastic Prox with Variance Reduction
Panahi, Dubhashi, ICML 2017, (2019 under review: proximal operator in closed form!

SGD for Deep Learning
• SGD variants (Adagrad, RMSprop, Adam …) used to train
neural networks.
• Use aggressive adaptation with different learning rates for
different parameters.
• Theory says it shouldn’t work for highly nonconvex problems!
• But Adagrad greatly improved the robustness of SGD and
Google used it for training large-scale neural nets to recognize
cats

Variance Reduction for Deep Learning
Defazio, Bottou, 2019

References
Frances Bach Tutorials/Short Courses: https://www.di.ens.fr/~fbach/

Optforml

More Related Content

Similar to Optforml (20)

More from Devdatt Dubhashi (7)

Recently uploaded (20)

Optforml