SlideShare a Scribd company logo
Distributed Coordinate Descent for Logistic Regression with
Regularization
Ilya Tromov (Yandex Data Factory)
Alexander Genkin (AVG Consulting)
presented by Ilya Tromov
Machine Learning: Prospects and Applications
58 October 2015, Berlin, Germany
Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Key features of Large Scale Machine Learning problems:
1 Large number of examples n
2 High dimensionality p
Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Key features of Large Scale Machine Learning problems:
1 Large number of examples n
2 High dimensionality p
Datasets are often:
1 Sparse
2 Don't t memory of a single machine
Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Key features of Large Scale Machine Learning problems:
1 Large number of examples n
2 High dimensionality p
Datasets are often:
1 Sparse
2 Don't t memory of a single machine
Linear methods for classication and regression are often used for large-scale problems:
1 Training  testing for linear models are fast
2 High dimensional datasets are rich and non-linearities are not required
Binary Classication
Supervised machine learning problem:
given feature vector xi ∈ Rp predict yi ∈ {−1, +1}.
Function
F : x → y
should be built using training dataset {xi , yi }n
i=1 and minimize expected risk:
Ex,y Ψ(y, F(x))
where Ψ(·, ·) is some loss function.
Logistic Regression
Logistic regression is a special case of Generalized Linear Model with the logit link
function:
yi ∈ {−1, +1}
P(y = +1|x) =
1
1 + exp(−βT
x)
Logistic Regression
Logistic regression is a special case of Generalized Linear Model with the logit link
function:
yi ∈ {−1, +1}
P(y = +1|x) =
1
1 + exp(−βT
x)
Negated log-likelihood (empirical risk) L(β)
L(β) =
n
i=1
log(1 + exp(−yi βT
xi ))
β∗
= argmin
β
L(β)
Logistic Regression
Logistic regression is a special case of Generalized Linear Model with the logit link
function:
yi ∈ {−1, +1}
P(y = +1|x) =
1
1 + exp(−βT
x)
Negated log-likelihood (empirical risk) L(β)
L(β) =
n
i=1
log(1 + exp(−yi βT
xi ))
β∗
= argmin
β
L(β) + R(β) regularizer
Logistic Regression, regularization
L2-regularization
argmin
β
L(β) +
λ2
2
||β||2
Logistic Regression, regularization
L2-regularization
argmin
β
L(β) +
λ2
2
||β||2
L1-regularization, provides feature selection
argmin
β
(L(β) + λ1||β||1)
Logistic Regression, regularization
L2-regularization
argmin
β
L(β) +
λ2
2
||β||2
Minimization of smooth convex function.
Logistic Regression, regularization
L2-regularization
argmin
β
L(β) +
λ2
2
||β||2
Minimization of smooth convex function.
Optimization techniques for large datasets
SGD
Conjugate gradients
L-BFGS
Coordinate descent (GLMNET, BBR)
Logistic Regression, regularization
L2-regularization
argmin
β
L(β) +
λ2
2
||β||2
Minimization of smooth convex function.
Optimization techniques for large datasets, distributed
SGD poor parallelization
Conjugate gradients good parallelization
L-BFGS good parallelization
Coordinate descent (GLMNET, BBR) ?
Logistic Regression, regularization
L1-regularization, provides feature selection
argmin
β
(L(β) + λ1||β||1)
Minimization of non-smooth convex function.
Logistic Regression, regularization
L1-regularization, provides feature selection
argmin
β
(L(β) + λ1||β||1)
Minimization of non-smooth convex function.
Optimization techniques for large datasets
Subgradient method
Online learning via truncated gradient
Coordinate descent (GLMNET, BBR)
Logistic Regression, regularization
L1-regularization, provides feature selection
argmin
β
(L(β) + λ1||β||1)
Minimization of non-smooth convex function.
Optimization techniques for large datasets, distributed
Subgradient method slow
Online learning via truncated gradient poor parallelization
Coordinate descent (GLMNET, BBR) ?
How to run coordinate descent in parallel?
Suppose we have several machines (cluster)
How to run coordinate descent in parallel?
Suppose we have several machines (cluster)
features
S1
S2
… SM
examples
Dataset is split by features among machines
S1
∪ . . . ∪ SM
= {1, ..., p}
Sm
∩ Sk
= ∅, k = m
βT
= ((β1
)T , (β2
)T , . . . , (βM
)T )
Each machine makes steps on its own subset
of input features ∆βm
Problems
Two main questions:
1 How to compute ∆βm
2 How to organize communication between machines
Problems
Two main questions:
1 How to compute ∆βm
2 How to organize communication between machines
Answers:
1 Each machine makes step using GLMNET algorithm.
Problems
Two main questions:
1 How to compute ∆βm
2 How to organize communication between machines
Answers:
1 Each machine makes step using GLMNET algorithm.
2
∆β =
M
m=1
∆βm
Steps from machines can come in conict!
so that target function will increase
L(β + ∆β) + R(β + ∆β)  L(β + ∆β) + R(β)
Problems
β ← β + α∆β, 0  α ≤ 1
Problems
β ← β + α∆β, 0  α ≤ 1
where α is found by an Armijo rule
L(β + ∆β) + R(β + ∆β) ≤ L(β) + R(β) + ασDk
Dk
= L(β)T
∆β + R(β + ∆β) − R(β)
Problems
β ← β + α∆β, 0  α ≤ 1
where α is found by an Armijo rule
L(β + ∆β) + R(β + ∆β) ≤ L(β) + R(β) + ασDk
Dk
= L(β)T
∆β + R(β + ∆β) − R(β)
L(β + α∆β) =
n
i=1
log(1 + exp(−yi (β + α∆β)T
xi ))
R(β + α∆β) =
M
m=1
R(βm
+ α∆βm
)
Eective communication between machines
L(β + α∆β) =
n
i=1
log(1 + exp(−yi (β + α∆β)T
xi ))
R(β + α∆β) =
M
m=1
R(βm
+ α∆βm
)
Data transfer:
(βT
xi ) are kept synchronized
(∆βT
xi ) are summed up via MPI_AllReduce (M vectors of size n)
Calculate R(βm
+ α∆βm
), L(β)T ∆βm
separately, and then sum up (M scalars)
Total communication cost: M(n + 1)
Distributed GLMNET (d-GLMNET)
d-GLMNET Algorithm
Input: training dataset {xi , yi }n
i=1, split to M parts over features.
βm
← 0, ∆βm
← 0, where m - index of a machine
Repeat until converged:
1 Do in parallel over M machines:
2 Find ∆βm
and calculate (∆(βm
)T xi ))
3 Sum up ∆βm
, (∆(βm
)T xi ) using MPI_AllReduce
4 ∆β ← M
m=1 ∆βm
5 (∆βT
xi ) ← M
m=1(∆(βm
)T xi )
6 Find α using line search with Armijo rule
7 β ← β + α∆β,
8 (exp(βT
xi )) ← (exp(βT
xi + α∆βT
xi ))
Solving the ¾slow node¿ problem
Distributed Machine Learning Algorithm
Do until converged:
1 Do some computations in parallel over M machines
2 Synchronize
Solving the ¾slow node¿ problem
Distributed Machine Learning Algorithm
Do until converged:
1 Do some computations in parallel over M machines
2 Synchronize PROBLEM!
M − 1 fast machines will wait for 1 slow
Solving the ¾slow node¿ problem
Distributed Machine Learning Algorithm
Do until converged:
1 Do some computations in parallel over M machines
2 Synchronize PROBLEM!
M − 1 fast machines will wait for 1 slow
Our solution: machine m at iteration k updates subset Pm
k ⊆ Sm of input features at
iteration.
The synchronization is done in separate thread asynchronously, we call it
Asynchronous Load Balancing (ALB).
Theoretical Results
Theorem 1. Each iteration of the d-GLMNET is equivalent to
β ← β + α∆β∗
∆β∗
= argmin
∆β
L(β) + L (β)T
∆β +
1
2
∆βT
H(β)∆β + λ1||β + ∆β||1
where H(β) is a block-diagonal approximation to the Hessian 2L(β),
iteration-dependent
Theoretical Results
Theorem 1. Each iteration of the d-GLMNET is equivalent to
β ← β + α∆β∗
∆β∗
= argmin
∆β
L(β) + L (β)T
∆β +
1
2
∆βT
H(β)∆β + λ1||β + ∆β||1
where H(β) is a block-diagonal approximation to the Hessian 2L(β),
iteration-dependent
Theorem 2. d-GLMNET algorithm converges at least linearly.
Numerical Experiments
dataset size #examples #features nnz
train/test/validation
epsilon 12 Gb 0.4 / 0.05 / 0.05 × 106
2000 8.0 × 108
webspam 21 Gb 0.315 / 0.0175 / 0.0175 × 106
16.6 × 106
1.2 × 109
yandex_ad 56 Gb 57 / 2.35 / 2.35 × 106
35 × 106
5.57 × 109
Numerical Experiments
dataset size #examples #features nnz
train/test/validation
epsilon 12 Gb 0.4 / 0.05 / 0.05 × 106
2000 8.0 × 108
webspam 21 Gb 0.315 / 0.0175 / 0.0175 × 106
16.6 × 106
1.2 × 109
yandex_ad 56 Gb 57 / 2.35 / 2.35 × 106
35 × 106
5.57 × 109
16 machines Intel(R) Xeon(R) CPU E5-2660 2.20GHz, 32 GB RAM, gigabit Ethernet.
Numerical Experiments
We compared
d-GLMNET
Online learning via truncated gradient (Vowpal Wabbit)
L-BFGS (Vowpal Wabbit)
ADMM with sharing (feature splitting)
Numerical Experiments
We compared
d-GLMNET
Online learning via truncated gradient (Vowpal Wabbit)
L-BFGS (Vowpal Wabbit)
ADMM with sharing (feature splitting)
1 We selected best L1 and L2 regularization on test set from range {2−6, . . . , 26}
2 We found parameters of online learning and ADMM yielding best performance
3 For evaluating timing performance we repeated training 9 times and selected run
with median time
¾yandex_ad¿ dataset, testing quality vs time
L2 regularization L1 regularization
Conclusions  Future Work
d-GLMNET is faster that state-of-the-art algoritms (online learning, L-BFGS,
ADMM) on sparse high-dimensional datasets
d-GLMNET can be easily extended to
other [block-]separable regularizers: bridge, SCAD, group Lasso, e.t.c.
other generalized linear models
Conclusions  Future Work
d-GLMNET is faster that state-of-the-art algoritms (online learning, L-BFGS,
ADMM) on sparse high-dimensional datasets
d-GLMNET can be easily extended to
other [block-]separable regularizers: bridge, SCAD, group Lasso, e.t.c.
other generalized linear models
Extending software architecture to boosting
F∗
(x) =
M
i=1
fi (x), where fi (x) is a week learner
Let machine m t weak learner f m
i (x
m) on subset of input features Sm. Then
fi (x) = α
M
m=1
f m
i (x
m
)
where α is calculated via line search, in the similar way as in d-GLMNET algorithm.
Conclusions  Future Work
Software implementation:
https://github.com/IlyaTrofimov/dlr
Conclusions  Future Work
Software implementation:
https://github.com/IlyaTrofimov/dlr
paper is available by request :
Ilya Tromov - trofim@yandex-team.ru
Thank you :)
Questions ?

More Related Content

PDF
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
PDF
Introduction to behavior based recommendation system
PDF
Multiclass Logistic Regression: Derivation and Apache Spark Examples
PDF
ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularizat...
PPT
Pe 4030 digital logic chapter 7 (weeks 11 12)
PDF
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
Introduction to behavior based recommendation system
Multiclass Logistic Regression: Derivation and Apache Spark Examples
ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularizat...
Pe 4030 digital logic chapter 7 (weeks 11 12)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)

What's hot (20)

PDF
Bayesian Dark Knowledge and Matrix Factorization
PDF
Generative adversarial networks
PDF
Variational Autoencoded Regression of Visual Data with Generative Adversarial...
PDF
Additive model and boosting tree
PDF
Parallel Algorithms
PDF
Introduction to NumPy for Machine Learning Programmers
PPTX
Gradient descent optimizer
PDF
MATLAB for Technical Computing
PDF
Algebraic data types: Semilattices
PPTX
Algorithms 101 for Data Scientists
PDF
Learning Deep Learning
PDF
cyclic_code.pdf
PDF
Cryptography
PPTX
3 analysis.gtm
PDF
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
PDF
Ch01 basic concepts_nosoluiton
PPT
Analysis of Algorithum
PDF
Circular convolution Using DFT Matlab Code
PPTX
2D Plot Matlab
PDF
Safe and Efficient Off-Policy Reinforcement Learning
Bayesian Dark Knowledge and Matrix Factorization
Generative adversarial networks
Variational Autoencoded Regression of Visual Data with Generative Adversarial...
Additive model and boosting tree
Parallel Algorithms
Introduction to NumPy for Machine Learning Programmers
Gradient descent optimizer
MATLAB for Technical Computing
Algebraic data types: Semilattices
Algorithms 101 for Data Scientists
Learning Deep Learning
cyclic_code.pdf
Cryptography
3 analysis.gtm
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Ch01 basic concepts_nosoluiton
Analysis of Algorithum
Circular convolution Using DFT Matlab Code
2D Plot Matlab
Safe and Efficient Off-Policy Reinforcement Learning
Ad

Similar to Distributed Coordinate Descent for Logistic Regression with Regularization (20)

PDF
Scaling Multinomial Logistic Regression via Hybrid Parallelism
PDF
Introduction to Machine Learning
PDF
Artificial Intelligence Course: Linear models
PPTX
Modern classification techniques
PDF
Machine learning using matlab.pdf
PDF
deep-learning-ppt-full-notes.pdf notesss
PPTX
Machine learning introduction lecture notes
PDF
Basic deep learning & Deep learning application to medicine
PDF
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
PDF
2014-06-20 Multinomial Logistic Regression with Apache Spark
PDF
Some Equations for MAchine LEarning
PDF
MLSEV. Logistic Regression, Deepnets, and Time Series
PPT
Multi-Layer Perceptrons
ODP
Online advertising and large scale model fitting
PDF
CS229_MachineLearning_notes.pdfkkkkkkkkkk
PDF
machine learning notes by Andrew Ng and Tengyu Ma
PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
PDF
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
PDF
Generalized Linear Models in Spark MLlib and SparkR
PDF
Machine learning cheat sheet
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Introduction to Machine Learning
Artificial Intelligence Course: Linear models
Modern classification techniques
Machine learning using matlab.pdf
deep-learning-ppt-full-notes.pdf notesss
Machine learning introduction lecture notes
Basic deep learning & Deep learning application to medicine
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
2014-06-20 Multinomial Logistic Regression with Apache Spark
Some Equations for MAchine LEarning
MLSEV. Logistic Regression, Deepnets, and Time Series
Multi-Layer Perceptrons
Online advertising and large scale model fitting
CS229_MachineLearning_notes.pdfkkkkkkkkkk
machine learning notes by Andrew Ng and Tengyu Ma
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Generalized Linear Models in Spark MLlib and SparkR
Machine learning cheat sheet
Ad

Recently uploaded (20)

PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Computer network topology notes for revision
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Quality review (1)_presentation of this 21
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Computer network topology notes for revision
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to machine learning and Linear Models
STUDY DESIGN details- Lt Col Maksud (21).pptx
Quality review (1)_presentation of this 21
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IB Computer Science - Internal Assessment.pptx
Reliability_Chapter_ presentation 1221.5784
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Acumen Training GuidePresentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction-to-Cloud-ComputingFinal.pptx

Distributed Coordinate Descent for Logistic Regression with Regularization

  • 1. Distributed Coordinate Descent for Logistic Regression with Regularization Ilya Tromov (Yandex Data Factory) Alexander Genkin (AVG Consulting) presented by Ilya Tromov Machine Learning: Prospects and Applications 58 October 2015, Berlin, Germany
  • 2. Large Scale Machine Learning Large Scale Machine Learning = Big Data + ML
  • 3. Large Scale Machine Learning Large Scale Machine Learning = Big Data + ML Many applications in web search, online advertising, e-commerce, text processing etc.
  • 4. Large Scale Machine Learning Large Scale Machine Learning = Big Data + ML Many applications in web search, online advertising, e-commerce, text processing etc. Key features of Large Scale Machine Learning problems: 1 Large number of examples n 2 High dimensionality p
  • 5. Large Scale Machine Learning Large Scale Machine Learning = Big Data + ML Many applications in web search, online advertising, e-commerce, text processing etc. Key features of Large Scale Machine Learning problems: 1 Large number of examples n 2 High dimensionality p Datasets are often: 1 Sparse 2 Don't t memory of a single machine
  • 6. Large Scale Machine Learning Large Scale Machine Learning = Big Data + ML Many applications in web search, online advertising, e-commerce, text processing etc. Key features of Large Scale Machine Learning problems: 1 Large number of examples n 2 High dimensionality p Datasets are often: 1 Sparse 2 Don't t memory of a single machine Linear methods for classication and regression are often used for large-scale problems: 1 Training testing for linear models are fast 2 High dimensional datasets are rich and non-linearities are not required
  • 7. Binary Classication Supervised machine learning problem: given feature vector xi ∈ Rp predict yi ∈ {−1, +1}. Function F : x → y should be built using training dataset {xi , yi }n i=1 and minimize expected risk: Ex,y Ψ(y, F(x)) where Ψ(·, ·) is some loss function.
  • 8. Logistic Regression Logistic regression is a special case of Generalized Linear Model with the logit link function: yi ∈ {−1, +1} P(y = +1|x) = 1 1 + exp(−βT x)
  • 9. Logistic Regression Logistic regression is a special case of Generalized Linear Model with the logit link function: yi ∈ {−1, +1} P(y = +1|x) = 1 1 + exp(−βT x) Negated log-likelihood (empirical risk) L(β) L(β) = n i=1 log(1 + exp(−yi βT xi )) β∗ = argmin β L(β)
  • 10. Logistic Regression Logistic regression is a special case of Generalized Linear Model with the logit link function: yi ∈ {−1, +1} P(y = +1|x) = 1 1 + exp(−βT x) Negated log-likelihood (empirical risk) L(β) L(β) = n i=1 log(1 + exp(−yi βT xi )) β∗ = argmin β L(β) + R(β) regularizer
  • 12. Logistic Regression, regularization L2-regularization argmin β L(β) + λ2 2 ||β||2 L1-regularization, provides feature selection argmin β (L(β) + λ1||β||1)
  • 13. Logistic Regression, regularization L2-regularization argmin β L(β) + λ2 2 ||β||2 Minimization of smooth convex function.
  • 14. Logistic Regression, regularization L2-regularization argmin β L(β) + λ2 2 ||β||2 Minimization of smooth convex function. Optimization techniques for large datasets SGD Conjugate gradients L-BFGS Coordinate descent (GLMNET, BBR)
  • 15. Logistic Regression, regularization L2-regularization argmin β L(β) + λ2 2 ||β||2 Minimization of smooth convex function. Optimization techniques for large datasets, distributed SGD poor parallelization Conjugate gradients good parallelization L-BFGS good parallelization Coordinate descent (GLMNET, BBR) ?
  • 16. Logistic Regression, regularization L1-regularization, provides feature selection argmin β (L(β) + λ1||β||1) Minimization of non-smooth convex function.
  • 17. Logistic Regression, regularization L1-regularization, provides feature selection argmin β (L(β) + λ1||β||1) Minimization of non-smooth convex function. Optimization techniques for large datasets Subgradient method Online learning via truncated gradient Coordinate descent (GLMNET, BBR)
  • 18. Logistic Regression, regularization L1-regularization, provides feature selection argmin β (L(β) + λ1||β||1) Minimization of non-smooth convex function. Optimization techniques for large datasets, distributed Subgradient method slow Online learning via truncated gradient poor parallelization Coordinate descent (GLMNET, BBR) ?
  • 19. How to run coordinate descent in parallel? Suppose we have several machines (cluster)
  • 20. How to run coordinate descent in parallel? Suppose we have several machines (cluster) features S1 S2 … SM examples Dataset is split by features among machines S1 ∪ . . . ∪ SM = {1, ..., p} Sm ∩ Sk = ∅, k = m βT = ((β1 )T , (β2 )T , . . . , (βM )T ) Each machine makes steps on its own subset of input features ∆βm
  • 21. Problems Two main questions: 1 How to compute ∆βm 2 How to organize communication between machines
  • 22. Problems Two main questions: 1 How to compute ∆βm 2 How to organize communication between machines Answers: 1 Each machine makes step using GLMNET algorithm.
  • 23. Problems Two main questions: 1 How to compute ∆βm 2 How to organize communication between machines Answers: 1 Each machine makes step using GLMNET algorithm. 2 ∆β = M m=1 ∆βm Steps from machines can come in conict! so that target function will increase L(β + ∆β) + R(β + ∆β) L(β + ∆β) + R(β)
  • 24. Problems β ← β + α∆β, 0 α ≤ 1
  • 25. Problems β ← β + α∆β, 0 α ≤ 1 where α is found by an Armijo rule L(β + ∆β) + R(β + ∆β) ≤ L(β) + R(β) + ασDk Dk = L(β)T ∆β + R(β + ∆β) − R(β)
  • 26. Problems β ← β + α∆β, 0 α ≤ 1 where α is found by an Armijo rule L(β + ∆β) + R(β + ∆β) ≤ L(β) + R(β) + ασDk Dk = L(β)T ∆β + R(β + ∆β) − R(β) L(β + α∆β) = n i=1 log(1 + exp(−yi (β + α∆β)T xi )) R(β + α∆β) = M m=1 R(βm + α∆βm )
  • 27. Eective communication between machines L(β + α∆β) = n i=1 log(1 + exp(−yi (β + α∆β)T xi )) R(β + α∆β) = M m=1 R(βm + α∆βm ) Data transfer: (βT xi ) are kept synchronized (∆βT xi ) are summed up via MPI_AllReduce (M vectors of size n) Calculate R(βm + α∆βm ), L(β)T ∆βm separately, and then sum up (M scalars) Total communication cost: M(n + 1)
  • 28. Distributed GLMNET (d-GLMNET) d-GLMNET Algorithm Input: training dataset {xi , yi }n i=1, split to M parts over features. βm ← 0, ∆βm ← 0, where m - index of a machine Repeat until converged: 1 Do in parallel over M machines: 2 Find ∆βm and calculate (∆(βm )T xi )) 3 Sum up ∆βm , (∆(βm )T xi ) using MPI_AllReduce 4 ∆β ← M m=1 ∆βm 5 (∆βT xi ) ← M m=1(∆(βm )T xi ) 6 Find α using line search with Armijo rule 7 β ← β + α∆β, 8 (exp(βT xi )) ← (exp(βT xi + α∆βT xi ))
  • 29. Solving the ¾slow node¿ problem Distributed Machine Learning Algorithm Do until converged: 1 Do some computations in parallel over M machines 2 Synchronize
  • 30. Solving the ¾slow node¿ problem Distributed Machine Learning Algorithm Do until converged: 1 Do some computations in parallel over M machines 2 Synchronize PROBLEM! M − 1 fast machines will wait for 1 slow
  • 31. Solving the ¾slow node¿ problem Distributed Machine Learning Algorithm Do until converged: 1 Do some computations in parallel over M machines 2 Synchronize PROBLEM! M − 1 fast machines will wait for 1 slow Our solution: machine m at iteration k updates subset Pm k ⊆ Sm of input features at iteration. The synchronization is done in separate thread asynchronously, we call it Asynchronous Load Balancing (ALB).
  • 32. Theoretical Results Theorem 1. Each iteration of the d-GLMNET is equivalent to β ← β + α∆β∗ ∆β∗ = argmin ∆β L(β) + L (β)T ∆β + 1 2 ∆βT H(β)∆β + λ1||β + ∆β||1 where H(β) is a block-diagonal approximation to the Hessian 2L(β), iteration-dependent
  • 33. Theoretical Results Theorem 1. Each iteration of the d-GLMNET is equivalent to β ← β + α∆β∗ ∆β∗ = argmin ∆β L(β) + L (β)T ∆β + 1 2 ∆βT H(β)∆β + λ1||β + ∆β||1 where H(β) is a block-diagonal approximation to the Hessian 2L(β), iteration-dependent Theorem 2. d-GLMNET algorithm converges at least linearly.
  • 34. Numerical Experiments dataset size #examples #features nnz train/test/validation epsilon 12 Gb 0.4 / 0.05 / 0.05 × 106 2000 8.0 × 108 webspam 21 Gb 0.315 / 0.0175 / 0.0175 × 106 16.6 × 106 1.2 × 109 yandex_ad 56 Gb 57 / 2.35 / 2.35 × 106 35 × 106 5.57 × 109
  • 35. Numerical Experiments dataset size #examples #features nnz train/test/validation epsilon 12 Gb 0.4 / 0.05 / 0.05 × 106 2000 8.0 × 108 webspam 21 Gb 0.315 / 0.0175 / 0.0175 × 106 16.6 × 106 1.2 × 109 yandex_ad 56 Gb 57 / 2.35 / 2.35 × 106 35 × 106 5.57 × 109 16 machines Intel(R) Xeon(R) CPU E5-2660 2.20GHz, 32 GB RAM, gigabit Ethernet.
  • 36. Numerical Experiments We compared d-GLMNET Online learning via truncated gradient (Vowpal Wabbit) L-BFGS (Vowpal Wabbit) ADMM with sharing (feature splitting)
  • 37. Numerical Experiments We compared d-GLMNET Online learning via truncated gradient (Vowpal Wabbit) L-BFGS (Vowpal Wabbit) ADMM with sharing (feature splitting) 1 We selected best L1 and L2 regularization on test set from range {2−6, . . . , 26} 2 We found parameters of online learning and ADMM yielding best performance 3 For evaluating timing performance we repeated training 9 times and selected run with median time
  • 38. ¾yandex_ad¿ dataset, testing quality vs time L2 regularization L1 regularization
  • 39. Conclusions Future Work d-GLMNET is faster that state-of-the-art algoritms (online learning, L-BFGS, ADMM) on sparse high-dimensional datasets d-GLMNET can be easily extended to other [block-]separable regularizers: bridge, SCAD, group Lasso, e.t.c. other generalized linear models
  • 40. Conclusions Future Work d-GLMNET is faster that state-of-the-art algoritms (online learning, L-BFGS, ADMM) on sparse high-dimensional datasets d-GLMNET can be easily extended to other [block-]separable regularizers: bridge, SCAD, group Lasso, e.t.c. other generalized linear models Extending software architecture to boosting F∗ (x) = M i=1 fi (x), where fi (x) is a week learner Let machine m t weak learner f m i (x m) on subset of input features Sm. Then fi (x) = α M m=1 f m i (x m ) where α is calculated via line search, in the similar way as in d-GLMNET algorithm.
  • 41. Conclusions Future Work Software implementation: https://github.com/IlyaTrofimov/dlr
  • 42. Conclusions Future Work Software implementation: https://github.com/IlyaTrofimov/dlr paper is available by request : Ilya Tromov - trofim@yandex-team.ru