SlideShare a Scribd company logo
Breaking the Nonsmooth Barrier: A Scalable
Parallel Method for Composite Optimization
Fabian Pedregosa Rémi Leblond Simon Lacoste–Julien
Motivation
 Since 2005, the speed of
processors has stagnated.
 The number of cores has
increased.
 Development of parallel
asynchronous variants of
stochastic gradient algorithms
1/6
Motivation
 Since 2005, the speed of
processors has stagnated.
 The number of cores has
increased.
 Development of parallel
asynchronous variants of
stochastic gradient algorithms
SGD → Hogwild (Niu et al. 2011).
SVRG → Kromagnon (Reddi et al. 2015; Mania et al. 2017).
SAGA → ASAGA (Leblond, Pedregosa, and Lacoste-Julien 2017).
1/6
Composite objective
 These methods assume objective function is smooth
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
2/6
Composite objective
 These methods assume objective function is smooth
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
Objective: minimize composite objective function:
minimize
x
f(x) + h(x) , with f(x) = 1
n
∑n
i=1 fi(x)
where fi is smooth and h is a block-separable (i.e., h(x) =
∑
B h([x]B))
convex function for which we have access to its proximal operator.
2/6
Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse.
3/6
Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse.
Like SAGA, it relies on unbiased gradient estimate
vi=∇fi(x) − αi + Diα ;
3/6
Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse.
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + Diα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
3/6
Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse.
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + Diα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Unlike SAGA, Di and φi are designed to give sparse updates while
verifying unbiasedness conditions.
3/6
Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse.
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + Diα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Unlike SAGA, Di and φi are designed to give sparse updates while
verifying unbiasedness conditions.
Convergence: same linear convergence rate as SAGA, with cheaper
updates in presence of sparsity.
3/6
Proximal Asynchronous SAGA (ProxASAGA)
Contribution 2: Proximal Asynchronous SAGA (ProxASAGA). Each core
runs Sparse Proximal SAGA asynchronously without locks and
updates x, α and α in shared memory.
 All read/write operations to shared memory are inconsistent, i.e.,
no performance destroying vector-level locks while reading/writing.
Convergence: under sparsity assumptions, ProxASAGA converges
with the same rate as the sequential algorithm =⇒ theoretical
linear speedup with respect to the number of cores.
4/6
Empirical results
ProxASAGA vs competing methods on 3 large-scale datasets,
ℓ1-regularized logistic regression
Dataset n p density L ∆
KDD 2010 19,264,097 1,163,024 10−6
28.12 0.15
KDD 2012 149,639,105 54,686,452 2 × 10−7
1.25 0.85
Criteo 45,840,617 1,000,000 4 × 10−5
1.25 0.89
0 20 40 60 80 100
Time (in minutes)
10 12
10 9
10 6
10 3
100
Objectiveminusoptimum
KDD10 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
KDD12 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
100 Criteo dataset
ProxASAGA (1 core)
ProxASAGA (10 cores)
AsySPCD (1 core)
AsySPCD (10 cores)
FISTA (1 core)
FISTA (10 cores)
5/6
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
6/6
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
6/6
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
6/6
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
Thanks for your attention, see you at poster #159. 6/6
References
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient
method with support for non-strongly convex composite objectives”. In: Advances in Neural
Information Processing Systems.
Leblond, Rémi, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel
SAGA”. In: Proceedings of the 20th International Conference on Artificial Intelligence and
Statistics (AISTATS 2017).
Mania, Horia et al. (2017). “Perturbed iterate analysis for asynchronous stochastic optimization”. In:
SIAM Journal on Optimization.
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”.
In: Advances in Neural Information Processing Systems.
Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth
Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural
Information Processing Systems 30.
Reddi, Sashank J et al. (2015). “On variance reduction in stochastic gradient descent and its
asynchronous variants”. In: Advances in Neural Information Processing Systems.
6/6

More Related Content

PDF
Parallel Optimization in Machine Learning
PDF
Improving Variational Inference with Inverse Autoregressive Flow
PDF
Safe and Efficient Off-Policy Reinforcement Learning
PDF
PDF
Scalable Link Discovery for Modern Data-Driven Applications
PPTX
Introduction of "TrailBlazer" algorithm
PPTX
Differential privacy without sensitivity [NIPS2016読み会資料]
PDF
Hyperparameter optimization with approximate gradient
Parallel Optimization in Machine Learning
Improving Variational Inference with Inverse Autoregressive Flow
Safe and Efficient Off-Policy Reinforcement Learning
Scalable Link Discovery for Modern Data-Driven Applications
Introduction of "TrailBlazer" algorithm
Differential privacy without sensitivity [NIPS2016読み会資料]
Hyperparameter optimization with approximate gradient

What's hot (19)

PDF
Gradient Estimation Using Stochastic Computation Graphs
PDF
Real-Time Big Data Stream Analytics
PDF
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
PDF
Sampling from Massive Graph Streams: A Unifying Framework
PDF
VRP2013 - Comp Aspects VRP
PDF
Graph Sample and Hold: A Framework for Big Graph Analytics
PDF
MOA for the IoT at ACML 2016
PDF
On Sampling from Massive Graph Streams
PDF
Generalized Linear Models with H2O
PDF
010_20160216_Variational Gaussian Process
PPTX
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
PPTX
Effective management of high volume numeric data with histograms
PDF
Efficient Data Stream Classification via Probabilistic Adaptive Windows
PPTX
CPM2013-tabei201306
PDF
Traffic flow modeling on road networks using Hamilton-Jacobi equations
PDF
Leveraging Bagging for Evolving Data Streams
PDF
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
PDF
Target tracking suing multiple auxiliary particle filtering
PDF
Encoding survey
Gradient Estimation Using Stochastic Computation Graphs
Real-Time Big Data Stream Analytics
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
Sampling from Massive Graph Streams: A Unifying Framework
VRP2013 - Comp Aspects VRP
Graph Sample and Hold: A Framework for Big Graph Analytics
MOA for the IoT at ACML 2016
On Sampling from Massive Graph Streams
Generalized Linear Models with H2O
010_20160216_Variational Gaussian Process
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
Effective management of high volume numeric data with histograms
Efficient Data Stream Classification via Probabilistic Adaptive Windows
CPM2013-tabei201306
Traffic flow modeling on road networks using Hamilton-Jacobi equations
Leveraging Bagging for Evolving Data Streams
Joint CSI Estimation, Beamforming and Scheduling Design for Wideband Massive ...
Target tracking suing multiple auxiliary particle filtering
Encoding survey
Ad

Similar to Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization (20)

PDF
Asynchronous Stochastic Optimization, New Analysis and Algorithms
PPTX
Regression and Classification: An Artificial Neural Network Approach
PDF
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
PDF
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
PPTX
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
PDF
Towards a Systematic Study of Big Data Performance and Benchmarking
KEY
Defense
PDF
Gossip & Key Value Store
PDF
PDF
Convolutional neural networks for image classification — evidence from Kaggle...
PPTX
Graphlab dunning-clustering
ODP
EOS5 Demo
ODP
End of Sprint 5
PDF
Extending lifespan with Hadoop and R
PDF
Slider: an Efficient Incremental Reasoner, by Jules Chevalier
PPTX
Neo, Titan & Cassandra
PDF
MLconf NYC Shan Shan Huang
PDF
Streaming solutions for real time problems
PDF
SVD and the Netflix Dataset
PDF
Storm users group real time hadoop
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Regression and Classification: An Artificial Neural Network Approach
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Towards a Systematic Study of Big Data Performance and Benchmarking
Defense
Gossip & Key Value Store
Convolutional neural networks for image classification — evidence from Kaggle...
Graphlab dunning-clustering
EOS5 Demo
End of Sprint 5
Extending lifespan with Hadoop and R
Slider: an Efficient Incremental Reasoner, by Jules Chevalier
Neo, Titan & Cassandra
MLconf NYC Shan Shan Huang
Streaming solutions for real time problems
SVD and the Netflix Dataset
Storm users group real time hadoop
Ad

More from Fabian Pedregosa (9)

PDF
Random Matrix Theory and Machine Learning - Part 4
PDF
Random Matrix Theory and Machine Learning - Part 3
PDF
Random Matrix Theory and Machine Learning - Part 2
PDF
Random Matrix Theory and Machine Learning - Part 1
PDF
Average case acceleration through spectral density estimation
PDF
Adaptive Three Operator Splitting
PDF
Sufficient decrease is all you need
PDF
Lightning: large scale machine learning in python
PDF
Profiling in Python
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 1
Average case acceleration through spectral density estimation
Adaptive Three Operator Splitting
Sufficient decrease is all you need
Lightning: large scale machine learning in python
Profiling in Python

Recently uploaded (20)

PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Sciences of Europe No 170 (2025)
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
BIOMOLECULES PPT........................
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Microbiology with diagram medical studies .pptx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
2. Earth - The Living Planet earth and life
Taita Taveta Laboratory Technician Workshop Presentation.pptx
famous lake in india and its disturibution and importance
The KM-GBF monitoring framework – status & key messages.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Sciences of Europe No 170 (2025)
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
BIOMOLECULES PPT........................
Derivatives of integument scales, beaks, horns,.pptx
2. Earth - The Living Planet Module 2ELS
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Microbiology with diagram medical studies .pptx
POSITIONING IN OPERATION THEATRE ROOM.ppt
2Systematics of Living Organisms t-.pptx
neck nodes and dissection types and lymph nodes levels
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
bbec55_b34400a7914c42429908233dbd381773.pdf
7. General Toxicologyfor clinical phrmacy.pptx
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
2. Earth - The Living Planet earth and life

Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization

  • 1. Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization Fabian Pedregosa Rémi Leblond Simon Lacoste–Julien
  • 2. Motivation  Since 2005, the speed of processors has stagnated.  The number of cores has increased.  Development of parallel asynchronous variants of stochastic gradient algorithms 1/6
  • 3. Motivation  Since 2005, the speed of processors has stagnated.  The number of cores has increased.  Development of parallel asynchronous variants of stochastic gradient algorithms SGD → Hogwild (Niu et al. 2011). SVRG → Kromagnon (Reddi et al. 2015; Mania et al. 2017). SAGA → ASAGA (Leblond, Pedregosa, and Lacoste-Julien 2017). 1/6
  • 4. Composite objective  These methods assume objective function is smooth Cannot be applied to Lasso, Group Lasso, box constraints, etc. 2/6
  • 5. Composite objective  These methods assume objective function is smooth Cannot be applied to Lasso, Group Lasso, box constraints, etc. Objective: minimize composite objective function: minimize x f(x) + h(x) , with f(x) = 1 n ∑n i=1 fi(x) where fi is smooth and h is a block-separable (i.e., h(x) = ∑ B h([x]B)) convex function for which we have access to its proximal operator. 2/6
  • 6. Sparse Proximal SAGA Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach, and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse. 3/6
  • 7. Sparse Proximal SAGA Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach, and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse. Like SAGA, it relies on unbiased gradient estimate vi=∇fi(x) − αi + Diα ; 3/6
  • 8. Sparse Proximal SAGA Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach, and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse. Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + Diα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) 3/6
  • 9. Sparse Proximal SAGA Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach, and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse. Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + Diα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) Unlike SAGA, Di and φi are designed to give sparse updates while verifying unbiasedness conditions. 3/6
  • 10. Sparse Proximal SAGA Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach, and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse. Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + Diα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) Unlike SAGA, Di and φi are designed to give sparse updates while verifying unbiasedness conditions. Convergence: same linear convergence rate as SAGA, with cheaper updates in presence of sparsity. 3/6
  • 11. Proximal Asynchronous SAGA (ProxASAGA) Contribution 2: Proximal Asynchronous SAGA (ProxASAGA). Each core runs Sparse Proximal SAGA asynchronously without locks and updates x, α and α in shared memory.  All read/write operations to shared memory are inconsistent, i.e., no performance destroying vector-level locks while reading/writing. Convergence: under sparsity assumptions, ProxASAGA converges with the same rate as the sequential algorithm =⇒ theoretical linear speedup with respect to the number of cores. 4/6
  • 12. Empirical results ProxASAGA vs competing methods on 3 large-scale datasets, ℓ1-regularized logistic regression Dataset n p density L ∆ KDD 2010 19,264,097 1,163,024 10−6 28.12 0.15 KDD 2012 149,639,105 54,686,452 2 × 10−7 1.25 0.85 Criteo 45,840,617 1,000,000 4 × 10−5 1.25 0.89 0 20 40 60 80 100 Time (in minutes) 10 12 10 9 10 6 10 3 100 Objectiveminusoptimum KDD10 dataset 0 10 20 30 40 Time (in minutes) 10 12 10 9 10 6 10 3 KDD12 dataset 0 10 20 30 40 Time (in minutes) 10 12 10 9 10 6 10 3 100 Criteo dataset ProxASAGA (1 core) ProxASAGA (10 cores) AsySPCD (1 core) AsySPCD (10 cores) FISTA (1 core) FISTA (10 cores) 5/6
  • 13. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA 6/6
  • 14. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. 6/6
  • 15. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. • As predicted by theory, there is a high correlation between degree of sparsity and speedup. 6/6
  • 16. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. • As predicted by theory, there is a high correlation between degree of sparsity and speedup. Thanks for your attention, see you at poster #159. 6/6
  • 17. References Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives”. In: Advances in Neural Information Processing Systems. Leblond, Rémi, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel SAGA”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). Mania, Horia et al. (2017). “Perturbed iterate analysis for asynchronous stochastic optimization”. In: SIAM Journal on Optimization. Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”. In: Advances in Neural Information Processing Systems. Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural Information Processing Systems 30. Reddi, Sashank J et al. (2015). “On variance reduction in stochastic gradient descent and its asynchronous variants”. In: Advances in Neural Information Processing Systems. 6/6