SlideShare a Scribd company logo
Doubly Accelerated
Stochastic Variance Reduced Gradient Methods
for Regularized Empirical Risk Minimization
Tomoya Murata†
, Taiji Suzukiद
†NTT DATA Mathematical Systems Inc., ‡The University of Tokyo, §RIKEN, ¶PRESTO
Jan. 13, 2018
1 / 39
This Presentation
Murata and Suzuki:
Doubly Accelerated Stochastic Variance Reduced Dual Averaging
Method for Regularized Empirical Risk Minimization, NIPS 2017
+ some extensions
2 / 39
Overview
What:
New methods for solving convex composite optimization in
mini-batch settings
Main result:
Improvement of the mini-batch efficiency of previous methods
− Mini-batch efficiency
: We say that A is more mini-batch efficient than B, if A’s
necessary mini-batch size for achieving given iteration complexity
is smaller than B’s.
− Iteration complexity
: Necessary number of parameter updates to achieve a desired
optimization error
3 / 39
Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
4 / 39
Smoothness
Definition :
We say that f : Rd
→ R is (L, ℓ)-smooth (L > 0) if
−
ℓ
2
∥x − y∥2
≤ f(x) − f(y) − ⟨∇f(y), x − y⟩ ≤
L
2
∥x − y∥2
.
− Lower smoothness ℓ ≤ 0 implies (strong) convexity of f
− Lower smoothness ℓ > 0 implies non-convexity of f
5 / 39
Convex Composite Optimization
Focus of this presentation:
min
x∈Rd
{P(x)
def
= F(x) + R(x)
def
= 1
n
∑n
i=1 fi(x) + R(x)}
F: (L, −µ)-smooth (L > 0, µ > 0) (i.e., µ-strongly convex)
fi: (L, ℓ)-smooth (L > 0, ℓ ≥ 0) (i.e., generally nonconvex)
R: simple and (possibly) non-differentiable convex
6 / 39
Examples (ℓ = 0)
(a1, b1), . . . , (an, bn) ∈ Rd
× R: traning set.
Lasso:
fi(x)
def
=
1
2
(a⊤
i x − bi)2
, R(x)
def
= λ∥x∥1
Elastic Net logistic regression:
fi(x)
def
= log(1 + exp(−bia⊤
i x)) +
λ2
2
∥x∥2
2, R(x)
def
= λ1∥x∥1
Support vector machines:
fi(x)
def
= ¯hν
i (a⊤
i x) +
λ
2
∥x∥2
2, R(x)
def
= 0
− ¯hν
i : smoothed variant of hinge loss hi(u)
def
= max{0, 1 − biu}
7 / 39
Examples (ℓ > 0)
Recently, Carmon et al. (2016), Allen-Zhu and Li (2017) and Yu
et al. (2017) have proposed algorithms for finding second-order
stationary points of smooth non-convex objectives.
− x is a (ε, δ)-second-order stationary point of f
def
⇔ ∥∇f(x)∥2
≤ ε and ∇2
f(x) ⪰ −δ
These algorithms are essentially based on two building blocks:
finding a first-order stationary point
finding a direction of the objective that has negative curvature
For exploiting negative curvature, these algorithms compute the
minimum eigenvector of the hessian.
http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/
8 / 39
Fast eigenvector computation:
Recently, Garber et al. (2016) has proposed a noble method for
finding approximate eigenvectors using convex optimization.
Essential subproblem :
min
z∈Rd
{g(z)
def
= 1
n
∑n
i=1 gi(z)
def
= 1
n
∑n
i=1
1
2
z⊤
(λ + ∇2
fi(x0))z − ⟨y, z⟩}
− λ > λmin(∇2
F(x0)) is assumed
− z∗ = (λ + ∇2
F(x0))−1
y
g is (λ + λmax(∇2
F(x0)), −(λ − λmin(∇2
F(x0)))-smooth
gi is (λ + λmax(∇2
fi(x0)), −(λ − λmin(∇2
fi(x0)))-smooth
Note that generally −(λ − λmin(∇2
fi(x0))) > 0, even though
−(λ − λmin(∇2
F(x0)) < 0.
9 / 39
Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
10 / 39
Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
Today’s focus
11 / 39
Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
12 / 39
SVRG [Johnson and Zhang (2013); Xiao and Zhang (2014)]
(Proximal) Stochastic Variace Reduced Gradient
= SGD + Variance Reduction
SVRG(x0, η, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
xs = One Stage SVRG(xs−1, η, m, b)
Output: xS.
One Stage SVRG(x0, η, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
vk = 1
b
∑
i∈Ik
(∇fi(xk−1) − ∇fi(x0)) + ∇F(x0).
xk = proxηR(xk−1 − ηvk).
Output: 1
m
∑m
k=1 xk.
13 / 39
vk = ∇fIk
(xk−1) − ∇fIk
(x0) + ∇F(x0)
Main Idea: Usage of vk as an unbiased estimator of ∇F(xk−1)
− V[vk] → 0 as xk−1, x0 → x∗
− Computaional cost per inner iteration is same as SGD’s
x0 (initial)
xk−1 (current)
xk (next)
∇F(x0)
∇fIk
(x0)
∇F(xk−1)
∇fIk
(xk−1)
vk
14 / 39
Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SGD O
(
L
ε
+ 1
bµε
)
O
(
L
ε
+ 1
bµε
)
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error
− Linear convergence
− Limit in mini-batch settings: SVRG requires at least
O
(
L
µ
log(1
ε
)
)
for any mini-batch size b
Questions:
The mini-batch efficiency of SVRG is improvable?
By Nesterov’s method SVRG can be accelerated?
15 / 39
Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
16 / 39
AccProxSVRG [Nitanda (2014)]
Accelerated Proximal SVRG = SVRG + Inner Acceleration
AccProxSVRG(x0, η, β, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
xs = One Stage AccProxSVRG(xs−1, η, β, m, b).
Output: xS.
One Stage AccProxSVRG(x0, η, β, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
yk = xk−1 + β(xk−1 − xk−2).
vk = 1
b
∑
i∈Ik
(∇fi(yk) − ∇fi(x0)) + ∇F(x0).
xk = proxηR(yk − ηvk).
Output: xm.
17 / 39
yk = xk−1 + β(xk−1 − xk−2)
Main Idea: Usage of Nesterov’s momentum in each inner iteration
xk−2 (previous)
yk−1 (previous)
xk−1 (current)
yk (current)
xk (next)
Momentum
18 / 39
Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
AccProxSVRG O
((
n
b
+ L
bµ
+
√
L
µ
)
log
(1
ε
))
No analysis
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error
− Linear speed up w.r.t mini-batch size b:
L
µ
(SVRG) → L
bµ
(AccProxSVRG)
− No acceleration in non-mini-batch settings: the rate of
AccProxSVRG is same as the one of SVRG when b = 1
Question:
The identical rate between AccProxSVRG’s and SVRG’s in
non-mini-batch settings is improvable?
19 / 39
Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
20 / 39
Universal Catalyst [Lin et al. (2015)]
Universal Catalyst: a generic acceleration framework
Given an non-accelerated algorithm M (for example, SVRG),
UC(ˇx0, κ, {βt}, {εt}, T)
Iterating the following for t = 1, 2, . . . , T:
ˇyt = ˇxt−1 + βt(ˇxt−1 − ˇxt−2).
Define Gt(x) = P(x) + κ
2
∥x − ˇyt∥2
2.
ˇxt ≈ argminx∈Rd Gt(x) s.t. Gt(ˇxt) − G∗
t ≤ εt by M.
Output: ˇxT .
Main Idea: Running IAPPA and solving each subproblem by M
− UC can be regard as an application of Inexact Accelerated PPA
(PPA: Proximal Point Algorithm).
21 / 39
Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
UC+SVRG O
((
n
b
+
√
nL
bµ
)
log
(1
ε
))
O
((
n
b
+
√
nL
bµ
+ n
3
4
b
√
(Lℓ)
1
2
µ
)
log
(1
ε
)
)
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error, O hides extra log-factors
− Accelerated rate: L
µ
(SVRG) →
√
nL
bµ
(UC +SVRG)
− Sublinear speed up w.r.t mini-batch size b: not sufficient
− Katyusha also achieves the same rate
Practicality:
Hardness of tuning stopping criterions of subproblems
Many tuning parameters
Question:
The dependency on mini-batch size b is improvable?
22 / 39
Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
23 / 39
Core Ideas
Double acceleration:
Combining Inner Acceleration and Outer Acceleration
Two approaches:
Applying UC to AccProxSVRG
Directly applying Nesterov’s acceleration to the outer iterations
of AccProxSVRG
The latter algorithm is more direct and practical.
24 / 39
Proposed Algorithm
Doubly Accelerated Stochastic Variance Reduced Dual Averaging
= (SVRDA + Inner Acceleration) + Outer Acceleration
DASVRDAsc
(ˇx0, η, m, b, S, T)
Iterating the following for t = 1, 2, . . . , T:
ˇxt = DASVRDAns
(ˇxt−1, η, m, b, S).
Output: ˇxT .
DASVRDAns
(x0, η, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
ys = xs−1 + s−1
s+2
(xs−1 − xs−2) + s+1
s+2
(zs−1 − xs−1).
(xs, zs) = One Stage AccSVRDA(ys, xs−1, η, β, m, b).
Output: xS.
25 / 39
One Stage AccSVRDA(x0, x, η, β, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
yk = xk−1 + k−1
k+1
(xk−1 − xk−2).
vk = 1
b
∑
i∈Ik
(∇fi(yk) − ∇fi(x)) + ∇F(x).
¯vk =
(
1 − 2
k+1
)
¯vk−1 + 2
k+1
vk
zk = proxηk(k+1)
4
(x0 − ηk(k+1)
4
¯vk)
xk =
(
1 − 2
k+1
)
xk−1 + 2
k+1
zk.
Output: (xm, zm).
Main Idea: Combining Inner Acceleration and Outer Acceleration
− For outer acceleration, adding new momentum s+1
s+2
(zs−1 − xs−1)
− AccSVRDA = AccSDA [Xiao (2009)] + Variance Reduction
Why SVRDA rather than SVRG?
− Only because lazy updates for AccSVRDA can be constructed.
26 / 39
ys = xs−1 + s−1
s+2(xs−1 − xs−2) + s+1
s+2(zs−1 − xs−1)
xs−2 (previous)
ys−1 (previous)
xs−1 = xm (current)
: weighted average of {zk}
zs−1 = zm (current)
ys (current)
xs (next)
z1
z2
z3
zm−1
Momentum
New momentum
27 / 39
Convergence Analysis (ℓ = 0)
Theorem (ℓ = 0)
Assume that F is (L, −µ)-smooth and fi is (L, 0)-smooth. If we
appropriately choose η = O
( 1
(1+n/b2)L
)
, S = O
(
1 + b
n
√
L
µ
+
√
L
nµ
)
and T = O(1), then DASVRDAsc
achieves an iteration complexity of
O
((
n
b
+
1
b
√
nL
µ
+
√
L
µ
)
log
(
1
ε
))
for E[P(ˇxT ) − P(x∗)] ≤ ε.
− In contrast, AccProxSVRG: O
((
n
b
+ L
bµ
+
√
L
µ
)
log
(1
ε
))
,
UC + SVRG: O
((
n
b
+
√
nL
bµ
)
log
(1
ε
))
.
28 / 39
Extension to ℓ ≥ 0
For generalizing our results to the case ℓ ≥ 0, we adopt UC +
AccProxSVRG approach.
− For theoretical guarantee, non-trivial modifications to the
algorithm of AccProxSVRG are needed.
UC + AccProxSVRG achieves
O



n
b
+
1
b
√
nL
µ
+
n
4
3
b
√
(Lℓ)
1
2
µ

 log
(
1
ε
)


− In contrast, UC + SVRG only achieves
O
((
n
b
+
√
nL
bµ
+ n
4
3
b
√
(Lℓ)
1
2
µ
)
log
(1
ε
)
)
.
29 / 39
Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
30 / 39
Experimental Settings
Model: Elastic Net logistic regression
− Regularization parameters: (λ1, λ2) = (10−4, 10−6), (0, 10−6)
− µ = 10−6, ℓ = 0
Data sets and mini-batch sizes:
Data sets n d b
a9a 32, 561 123 180
rcv1 20, 242 47, 236 140
sido0 12, 678 4, 932 100
Implemented algorithms: SVRG, UC+SVRG, AccProxSVRG,
UC+AccProxSVRG, APCG (dual), Katyusha, DASVRDA and
DASVRDA with heuristic adaptive restart
31 / 39
Numerical Results
Comparisons on a9a data set:
Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6)
32 / 39
Comparisons on rcv1 data set:
Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6)
33 / 39
Comparisons on sido0 data set:
Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6)
34 / 39
Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
35 / 39
Summary
Conclusion:
New methods for solving convex composite optimization in
mini-batch settings
− Improvement of the mini-batch efficiency of previous methods
− Extention to sum-of-nonconvex objectives
− Numerical outperformance to the state-of-the-art methods
36 / 39
Reference I
Allen-Zhu, Z. (2017). Katyusha: The First Direct Acceleration of
Stochastic Gradient Methods. In 48th Annual ACM Symposium on
the Theory of Computing, pages 19–23.
Allen-Zhu, Z. and Li, Y. (2017). Neon2: Finding local minima via
first-order oracles. arXiv preprint arXiv:1711.06673.
Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2016).
Accelerated methods for non-convex optimization. arXiv preprint
arXiv:1611.00756.
Garber, D., Hazan, E., Jin, C., Sham, Musco, C., Netrapalli, P., and
Sidford, A. (2016). Faster eigenvector computation via
shift-and-invert preconditioning. In Proceedings of The 33rd
International Conference on Machine Learning, volume 48 of
Proceedings of Machine Learning Research, pages 2626–2634.
37 / 39
Reference II
Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient
descent using predictive variance reduction. In Advances in Neural
Information Processing Systems 26, pages 315–323.
Lin, H., Mairal, J., and Harchaoui, Z. (2015). A universal catalyst for
first-order optimization. In Advances in Neural Information
Processing Systems 28, pages 3384–3392.
Nitanda, A. (2014). Stochastic proximal gradient descent with
acceleration techniques. In Advances in Neural Information
Processing Systems 27, pages 1574–1582.
Xiao, L. (2009). Dual averaging method for regularized stochastic
learning and online optimization. In Advances in Neural
Information Processing Systems 22, pages 2116–2124.
38 / 39
Reference III
Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient
method with progressive variance reduction. SIAM Journal on
Optimization, 24(4), 2057–2075.
Yu, Y., Zou, D., and Gu, Q. (2017). Saving gradient and negative
curvature computations: Finding local minima more efficiently.
arXiv preprint arXiv:1712.03950.
39 / 39

More Related Content

PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
PDF
Density theorems for anisotropic point configurations
PDF
Bellman functions and Lp estimates for paraproducts
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
A sharp nonlinear Hausdorff-Young inequality for small potentials
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
A T(1)-type theorem for entangled multilinear Calderon-Zygmund operators
PDF
Density theorems for Euclidean point configurations
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
Density theorems for anisotropic point configurations
Bellman functions and Lp estimates for paraproducts
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
A sharp nonlinear Hausdorff-Young inequality for small potentials
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
A T(1)-type theorem for entangled multilinear Calderon-Zygmund operators
Density theorems for Euclidean point configurations

What's hot (20)

PDF
On Twisted Paraproducts and some other Multilinear Singular Integrals
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Norm-variation of bilinear averages
PDF
A Szemeredi-type theorem for subsets of the unit cube
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Multilinear Twisted Paraproducts
PDF
Some Examples of Scaling Sets
PDF
Multilinear singular integrals with entangled structure
PDF
Trilinear embedding for divergence-form operators
PDF
Scattering theory analogues of several classical estimates in Fourier analysis
PDF
Estimates for a class of non-standard bilinear multipliers
PDF
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
PDF
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
PDF
Tales on two commuting transformations or flows
PDF
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
PDF
Quantitative norm convergence of some ergodic averages
PDF
On maximal and variational Fourier restriction
PDF
A Szemerédi-type theorem for subsets of the unit cube
PDF
Paraproducts with general dilations
On Twisted Paraproducts and some other Multilinear Singular Integrals
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Norm-variation of bilinear averages
A Szemeredi-type theorem for subsets of the unit cube
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Multilinear Twisted Paraproducts
Some Examples of Scaling Sets
Multilinear singular integrals with entangled structure
Trilinear embedding for divergence-form operators
Scattering theory analogues of several classical estimates in Fourier analysis
Estimates for a class of non-standard bilinear multipliers
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
Tales on two commuting transformations or flows
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Quantitative norm convergence of some ergodic averages
On maximal and variational Fourier restriction
A Szemerédi-type theorem for subsets of the unit cube
Paraproducts with general dilations
Ad

Similar to Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regularized Empirical Risk Minimization (20)

PDF
Dep Neural Networks introduction new.pdf
PPTX
Deep Neural Network Module 3A Optimization.pptx
PDF
Stochastic optimization from mirror descent to recent algorithms
PDF
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
PDF
Overview on Optimization algorithms in Deep Learning
PDF
Lecture5 kernel svm
PDF
Lesson 5_VARIOUS_ optimization_algos.pdf
PDF
1108.1170
PDF
Chap 8. Optimization for training deep models
PDF
Gradient_Descent_Unconstrained.pdf
PDF
Comparative study of Terminating Newton Iterations: in Solving ODEs
PPTX
An overview of gradient descent optimization algorithms
PDF
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
Maximum likelihood estimation of regularisation parameters in inverse problem...
PDF
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
PDF
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
PDF
Value Function Geometry and Gradient TD
PPTX
Optimization tutorial
PPTX
Gradient descent variants in deep laearning
PDF
DNN_M3_Optimization.pdf
Dep Neural Networks introduction new.pdf
Deep Neural Network Module 3A Optimization.pptx
Stochastic optimization from mirror descent to recent algorithms
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Overview on Optimization algorithms in Deep Learning
Lecture5 kernel svm
Lesson 5_VARIOUS_ optimization_algos.pdf
1108.1170
Chap 8. Optimization for training deep models
Gradient_Descent_Unconstrained.pdf
Comparative study of Terminating Newton Iterations: in Solving ODEs
An overview of gradient descent optimization algorithms
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Maximum likelihood estimation of regularisation parameters in inverse problem...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Value Function Geometry and Gradient TD
Optimization tutorial
Gradient descent variants in deep laearning
DNN_M3_Optimization.pdf
Ad

Recently uploaded (20)

PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
diccionario toefl examen de ingles para principiante
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPT
Chemical bonding and molecular structure
PPTX
2. Earth - The Living Planet earth and life
PPTX
famous lake in india and its disturibution and importance
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
microscope-Lecturecjchchchchcuvuvhc.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Phytochemical Investigation of Miliusa longipes.pdf
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
7. General Toxicologyfor clinical phrmacy.pptx
diccionario toefl examen de ingles para principiante
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Chemical bonding and molecular structure
2. Earth - The Living Planet earth and life
famous lake in india and its disturibution and importance
Derivatives of integument scales, beaks, horns,.pptx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Introduction to Fisheries Biotechnology_Lesson 1.pptx

Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regularized Empirical Risk Minimization

  • 1. Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regularized Empirical Risk Minimization Tomoya Murata† , Taiji Suzuki‡§¶ †NTT DATA Mathematical Systems Inc., ‡The University of Tokyo, §RIKEN, ¶PRESTO Jan. 13, 2018 1 / 39
  • 2. This Presentation Murata and Suzuki: Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization, NIPS 2017 + some extensions 2 / 39
  • 3. Overview What: New methods for solving convex composite optimization in mini-batch settings Main result: Improvement of the mini-batch efficiency of previous methods − Mini-batch efficiency : We say that A is more mini-batch efficient than B, if A’s necessary mini-batch size for achieving given iteration complexity is smaller than B’s. − Iteration complexity : Necessary number of parameter updates to achieve a desired optimization error 3 / 39
  • 4. Outline 1 Problem Setup 2 Previous Work 3 Proposed methods 4 Numerical Experiments 5 Summary 4 / 39
  • 5. Smoothness Definition : We say that f : Rd → R is (L, ℓ)-smooth (L > 0) if − ℓ 2 ∥x − y∥2 ≤ f(x) − f(y) − ⟨∇f(y), x − y⟩ ≤ L 2 ∥x − y∥2 . − Lower smoothness ℓ ≤ 0 implies (strong) convexity of f − Lower smoothness ℓ > 0 implies non-convexity of f 5 / 39
  • 6. Convex Composite Optimization Focus of this presentation: min x∈Rd {P(x) def = F(x) + R(x) def = 1 n ∑n i=1 fi(x) + R(x)} F: (L, −µ)-smooth (L > 0, µ > 0) (i.e., µ-strongly convex) fi: (L, ℓ)-smooth (L > 0, ℓ ≥ 0) (i.e., generally nonconvex) R: simple and (possibly) non-differentiable convex 6 / 39
  • 7. Examples (ℓ = 0) (a1, b1), . . . , (an, bn) ∈ Rd × R: traning set. Lasso: fi(x) def = 1 2 (a⊤ i x − bi)2 , R(x) def = λ∥x∥1 Elastic Net logistic regression: fi(x) def = log(1 + exp(−bia⊤ i x)) + λ2 2 ∥x∥2 2, R(x) def = λ1∥x∥1 Support vector machines: fi(x) def = ¯hν i (a⊤ i x) + λ 2 ∥x∥2 2, R(x) def = 0 − ¯hν i : smoothed variant of hinge loss hi(u) def = max{0, 1 − biu} 7 / 39
  • 8. Examples (ℓ > 0) Recently, Carmon et al. (2016), Allen-Zhu and Li (2017) and Yu et al. (2017) have proposed algorithms for finding second-order stationary points of smooth non-convex objectives. − x is a (ε, δ)-second-order stationary point of f def ⇔ ∥∇f(x)∥2 ≤ ε and ∇2 f(x) ⪰ −δ These algorithms are essentially based on two building blocks: finding a first-order stationary point finding a direction of the objective that has negative curvature For exploiting negative curvature, these algorithms compute the minimum eigenvector of the hessian. http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/ 8 / 39
  • 9. Fast eigenvector computation: Recently, Garber et al. (2016) has proposed a noble method for finding approximate eigenvectors using convex optimization. Essential subproblem : min z∈Rd {g(z) def = 1 n ∑n i=1 gi(z) def = 1 n ∑n i=1 1 2 z⊤ (λ + ∇2 fi(x0))z − ⟨y, z⟩} − λ > λmin(∇2 F(x0)) is assumed − z∗ = (λ + ∇2 F(x0))−1 y g is (λ + λmax(∇2 F(x0)), −(λ − λmin(∇2 F(x0)))-smooth gi is (λ + λmax(∇2 fi(x0)), −(λ − λmin(∇2 fi(x0)))-smooth Note that generally −(λ − λmin(∇2 fi(x0))) > 0, even though −(λ − λmin(∇2 F(x0)) < 0. 9 / 39
  • 10. Outline 1 Problem Setup 2 Previous Work 3 Proposed methods 4 Numerical Experiments 5 Summary 10 / 39
  • 11. Relationships between Previous Work GD SGD AGD SVRG Katyusha AccProxSVRG This Work Inexact PPA Inexact APPA UC + SVRG Randomization Outer Acceleration Variance Reduction Inner Acceleration Universal Catalyst Katyusha momentum Today’s focus 11 / 39
  • 12. Relationships between Previous Work GD SGD AGD SVRG Katyusha AccProxSVRG This Work Inexact PPA Inexact APPA UC + SVRG Randomization Outer Acceleration Variance Reduction Inner Acceleration Universal Catalyst Katyusha momentum 12 / 39
  • 13. SVRG [Johnson and Zhang (2013); Xiao and Zhang (2014)] (Proximal) Stochastic Variace Reduced Gradient = SGD + Variance Reduction SVRG(x0, η, m, b, S) Iterating the following for s = 1, 2, . . . , S: xs = One Stage SVRG(xs−1, η, m, b) Output: xS. One Stage SVRG(x0, η, m, b) Iterating the following for k = 1, 2, . . . , m: Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly. vk = 1 b ∑ i∈Ik (∇fi(xk−1) − ∇fi(x0)) + ∇F(x0). xk = proxηR(xk−1 − ηvk). Output: 1 m ∑m k=1 xk. 13 / 39
  • 14. vk = ∇fIk (xk−1) − ∇fIk (x0) + ∇F(x0) Main Idea: Usage of vk as an unbiased estimator of ∇F(xk−1) − V[vk] → 0 as xk−1, x0 → x∗ − Computaional cost per inner iteration is same as SGD’s x0 (initial) xk−1 (current) xk (next) ∇F(x0) ∇fIk (x0) ∇F(xk−1) ∇fIk (xk−1) vk 14 / 39
  • 15. Comparisons of Iteration Complexities: ℓ = 0 ℓ ≥ 0 SGD O ( L ε + 1 bµε ) O ( L ε + 1 bµε ) SVRG O (( n b + L µ ) log (1 ε )) O (( n b + L µ + Lℓ bµ2 ) log (1 ε )) n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi, b: mini-batch size, ε: optimization error − Linear convergence − Limit in mini-batch settings: SVRG requires at least O ( L µ log(1 ε ) ) for any mini-batch size b Questions: The mini-batch efficiency of SVRG is improvable? By Nesterov’s method SVRG can be accelerated? 15 / 39
  • 16. Relationships between Previous Work GD SGD AGD SVRG Katyusha AccProxSVRG This Work Inexact PPA Inexact APPA UC + SVRG Randomization Outer Acceleration Variance Reduction Inner Acceleration Universal Catalyst Katyusha momentum 16 / 39
  • 17. AccProxSVRG [Nitanda (2014)] Accelerated Proximal SVRG = SVRG + Inner Acceleration AccProxSVRG(x0, η, β, m, b, S) Iterating the following for s = 1, 2, . . . , S: xs = One Stage AccProxSVRG(xs−1, η, β, m, b). Output: xS. One Stage AccProxSVRG(x0, η, β, m, b) Iterating the following for k = 1, 2, . . . , m: Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly. yk = xk−1 + β(xk−1 − xk−2). vk = 1 b ∑ i∈Ik (∇fi(yk) − ∇fi(x0)) + ∇F(x0). xk = proxηR(yk − ηvk). Output: xm. 17 / 39
  • 18. yk = xk−1 + β(xk−1 − xk−2) Main Idea: Usage of Nesterov’s momentum in each inner iteration xk−2 (previous) yk−1 (previous) xk−1 (current) yk (current) xk (next) Momentum 18 / 39
  • 19. Comparisons of Iteration Complexities: ℓ = 0 ℓ ≥ 0 SVRG O (( n b + L µ ) log (1 ε )) O (( n b + L µ + Lℓ bµ2 ) log (1 ε )) AccProxSVRG O (( n b + L bµ + √ L µ ) log (1 ε )) No analysis n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi, b: mini-batch size, ε: optimization error − Linear speed up w.r.t mini-batch size b: L µ (SVRG) → L bµ (AccProxSVRG) − No acceleration in non-mini-batch settings: the rate of AccProxSVRG is same as the one of SVRG when b = 1 Question: The identical rate between AccProxSVRG’s and SVRG’s in non-mini-batch settings is improvable? 19 / 39
  • 20. Relationships between Previous Work GD SGD AGD SVRG Katyusha AccProxSVRG This Work Inexact PPA Inexact APPA UC + SVRG Randomization Outer Acceleration Variance Reduction Inner Acceleration Universal Catalyst Katyusha momentum 20 / 39
  • 21. Universal Catalyst [Lin et al. (2015)] Universal Catalyst: a generic acceleration framework Given an non-accelerated algorithm M (for example, SVRG), UC(ˇx0, κ, {βt}, {εt}, T) Iterating the following for t = 1, 2, . . . , T: ˇyt = ˇxt−1 + βt(ˇxt−1 − ˇxt−2). Define Gt(x) = P(x) + κ 2 ∥x − ˇyt∥2 2. ˇxt ≈ argminx∈Rd Gt(x) s.t. Gt(ˇxt) − G∗ t ≤ εt by M. Output: ˇxT . Main Idea: Running IAPPA and solving each subproblem by M − UC can be regard as an application of Inexact Accelerated PPA (PPA: Proximal Point Algorithm). 21 / 39
  • 22. Comparisons of Iteration Complexities: ℓ = 0 ℓ ≥ 0 SVRG O (( n b + L µ ) log (1 ε )) O (( n b + L µ + Lℓ bµ2 ) log (1 ε )) UC+SVRG O (( n b + √ nL bµ ) log (1 ε )) O (( n b + √ nL bµ + n 3 4 b √ (Lℓ) 1 2 µ ) log (1 ε ) ) n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi, b: mini-batch size, ε: optimization error, O hides extra log-factors − Accelerated rate: L µ (SVRG) → √ nL bµ (UC +SVRG) − Sublinear speed up w.r.t mini-batch size b: not sufficient − Katyusha also achieves the same rate Practicality: Hardness of tuning stopping criterions of subproblems Many tuning parameters Question: The dependency on mini-batch size b is improvable? 22 / 39
  • 23. Outline 1 Problem Setup 2 Previous Work 3 Proposed methods 4 Numerical Experiments 5 Summary 23 / 39
  • 24. Core Ideas Double acceleration: Combining Inner Acceleration and Outer Acceleration Two approaches: Applying UC to AccProxSVRG Directly applying Nesterov’s acceleration to the outer iterations of AccProxSVRG The latter algorithm is more direct and practical. 24 / 39
  • 25. Proposed Algorithm Doubly Accelerated Stochastic Variance Reduced Dual Averaging = (SVRDA + Inner Acceleration) + Outer Acceleration DASVRDAsc (ˇx0, η, m, b, S, T) Iterating the following for t = 1, 2, . . . , T: ˇxt = DASVRDAns (ˇxt−1, η, m, b, S). Output: ˇxT . DASVRDAns (x0, η, m, b, S) Iterating the following for s = 1, 2, . . . , S: ys = xs−1 + s−1 s+2 (xs−1 − xs−2) + s+1 s+2 (zs−1 − xs−1). (xs, zs) = One Stage AccSVRDA(ys, xs−1, η, β, m, b). Output: xS. 25 / 39
  • 26. One Stage AccSVRDA(x0, x, η, β, m, b) Iterating the following for k = 1, 2, . . . , m: Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly. yk = xk−1 + k−1 k+1 (xk−1 − xk−2). vk = 1 b ∑ i∈Ik (∇fi(yk) − ∇fi(x)) + ∇F(x). ¯vk = ( 1 − 2 k+1 ) ¯vk−1 + 2 k+1 vk zk = proxηk(k+1) 4 (x0 − ηk(k+1) 4 ¯vk) xk = ( 1 − 2 k+1 ) xk−1 + 2 k+1 zk. Output: (xm, zm). Main Idea: Combining Inner Acceleration and Outer Acceleration − For outer acceleration, adding new momentum s+1 s+2 (zs−1 − xs−1) − AccSVRDA = AccSDA [Xiao (2009)] + Variance Reduction Why SVRDA rather than SVRG? − Only because lazy updates for AccSVRDA can be constructed. 26 / 39
  • 27. ys = xs−1 + s−1 s+2(xs−1 − xs−2) + s+1 s+2(zs−1 − xs−1) xs−2 (previous) ys−1 (previous) xs−1 = xm (current) : weighted average of {zk} zs−1 = zm (current) ys (current) xs (next) z1 z2 z3 zm−1 Momentum New momentum 27 / 39
  • 28. Convergence Analysis (ℓ = 0) Theorem (ℓ = 0) Assume that F is (L, −µ)-smooth and fi is (L, 0)-smooth. If we appropriately choose η = O ( 1 (1+n/b2)L ) , S = O ( 1 + b n √ L µ + √ L nµ ) and T = O(1), then DASVRDAsc achieves an iteration complexity of O (( n b + 1 b √ nL µ + √ L µ ) log ( 1 ε )) for E[P(ˇxT ) − P(x∗)] ≤ ε. − In contrast, AccProxSVRG: O (( n b + L bµ + √ L µ ) log (1 ε )) , UC + SVRG: O (( n b + √ nL bµ ) log (1 ε )) . 28 / 39
  • 29. Extension to ℓ ≥ 0 For generalizing our results to the case ℓ ≥ 0, we adopt UC + AccProxSVRG approach. − For theoretical guarantee, non-trivial modifications to the algorithm of AccProxSVRG are needed. UC + AccProxSVRG achieves O    n b + 1 b √ nL µ + n 4 3 b √ (Lℓ) 1 2 µ   log ( 1 ε )   − In contrast, UC + SVRG only achieves O (( n b + √ nL bµ + n 4 3 b √ (Lℓ) 1 2 µ ) log (1 ε ) ) . 29 / 39
  • 30. Outline 1 Problem Setup 2 Previous Work 3 Proposed methods 4 Numerical Experiments 5 Summary 30 / 39
  • 31. Experimental Settings Model: Elastic Net logistic regression − Regularization parameters: (λ1, λ2) = (10−4, 10−6), (0, 10−6) − µ = 10−6, ℓ = 0 Data sets and mini-batch sizes: Data sets n d b a9a 32, 561 123 180 rcv1 20, 242 47, 236 140 sido0 12, 678 4, 932 100 Implemented algorithms: SVRG, UC+SVRG, AccProxSVRG, UC+AccProxSVRG, APCG (dual), Katyusha, DASVRDA and DASVRDA with heuristic adaptive restart 31 / 39
  • 32. Numerical Results Comparisons on a9a data set: Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6) 32 / 39
  • 33. Comparisons on rcv1 data set: Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6) 33 / 39
  • 34. Comparisons on sido0 data set: Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6) 34 / 39
  • 35. Outline 1 Problem Setup 2 Previous Work 3 Proposed methods 4 Numerical Experiments 5 Summary 35 / 39
  • 36. Summary Conclusion: New methods for solving convex composite optimization in mini-batch settings − Improvement of the mini-batch efficiency of previous methods − Extention to sum-of-nonconvex objectives − Numerical outperformance to the state-of-the-art methods 36 / 39
  • 37. Reference I Allen-Zhu, Z. (2017). Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. In 48th Annual ACM Symposium on the Theory of Computing, pages 19–23. Allen-Zhu, Z. and Li, Y. (2017). Neon2: Finding local minima via first-order oracles. arXiv preprint arXiv:1711.06673. Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2016). Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756. Garber, D., Hazan, E., Jin, C., Sham, Musco, C., Netrapalli, P., and Sidford, A. (2016). Faster eigenvector computation via shift-and-invert preconditioning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2626–2634. 37 / 39
  • 38. Reference II Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315–323. Lin, H., Mairal, J., and Harchaoui, Z. (2015). A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28, pages 3384–3392. Nitanda, A. (2014). Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems 27, pages 1574–1582. Xiao, L. (2009). Dual averaging method for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 22, pages 2116–2124. 38 / 39
  • 39. Reference III Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4), 2057–2075. Yu, Y., Zou, D., and Gu, Q. (2017). Saving gradient and negative curvature computations: Finding local minima more efficiently. arXiv preprint arXiv:1712.03950. 39 / 39