SlideShare a Scribd company logo
1
Dec. 17, 2018
Jay Chang
ADMM
Distributed Stochastic ADMM for
Matrix Factorization
2
Z.-Q. Yu, X.-J. Shi, L. Yan, W.-J. Li, "Distributed stochastic admm for matrixfactorization", Proceedings of the 23rd ACM Int. Conf. on Inform. and
Knowl. Manag, pp. 1259-1268, 2014.
3
Matrix Factorization
4
ALS-based Parallel MF Models
• ALS (Alternating Least Square) allow the columns in both U and V can be independently
updated by following equations:
Parallel ALS
: , , ,
:
randinit( , )
ra
Initial factors
1
1 Executed in parallel
ndinit , )(
m n
iter T
k T
m k
m
k
i
n
λ×
←
∈
←
←
←
Algorithm :
Input R
for to do
par fo
Initializa
r to
tio
par
n
U
V
⊳
⊳
ℝ
1 Executed in parallelj n←for to
end for
⊳
5
ALS-based Improvement
CCD (Cyclic Coordinate Descent):
Instead of optimizing the whole vector or at one time, CCD adopts the coordinate descent
to optimize each element of or separately in order to avoid t
i j
i j
∗ ∗
∗ ∗
U V
U V
i
1
he matrix inverse.
CCD :
Further improves CCD's performance by changing the updating sequence in CCD.
updates one element in or each time by using coordinate descent.
kT T
d d d dd ∗ ∗ ∗ ∗=
+ +
= U V U V U V
i
CCD: Pilerl'szy, D. Zibriczky, and D. Tikk. “Fast als-based matrix factorization for explicit and implicit feedback datasets.” In RecSys, pages 71-78, 2010.
CCD++: H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems.”
In ICDM, pages 765-774, 2012.
6
Some Derivations for ALS
2 22
,
1
1
( , ) ( )
( , )
0
( , )
0
T
m n m k n k
T
ui u i u i
i u u i
T
u i i ui i
i iu
T
i u u ui u
u ui
Loss r
Loss
r
Loss
r
λ
λ
λ
× × ×
−
−
≈
 
= − + + 
 
∂  
=  = + ∂  
∂  
=  = + ∂  
  
 
 
R X Y
X Y x y x y
X Y
x y y I y
x
X Y
y x x I x
y
SGD (Stochastic Gradient Descent) randomly select one rating index ( , ) from each time,
then update the corresponding variables:
where , is the learning rate.
Conflicts ex
T
ij ij i j
i j
Rε η∗ ∗
Ω
= − U V
i
i ist between two nodes when their randomly selected ratings share
either the same user index or the same item index.
Hogwild!
DSGD (Distributed SGD)
FPSGD (fast parallel SGD)
i
i
i
7
SGD-based Parallel MF Models
Hogwild!: F. Niu, B. Recht, C. Re, and S. J. Wright. "Hogwild!: A lock-free approach to parallelizing stochastic gradient descent." In NIPS, pages
693-701, 2011.
DSGD: R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. "Large-scale matrix factorization with distributed stochastic gradient descent." In KDD,
pages 69-77, 2011.
FPSGD: Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. "A fast parallel sgd for matrix factorization in shared memory systems." In RecSys, pages
249-256, 2013.
8
: the number of latent factor , the learning rata , regularization parameters , , max iteration step, and the rating matrix
: Initialize a random matrix for user matrix and
U VD R
U
λη λInput
Initialization
= 1, 2,...step
( , , ) in
make prediction
error
update and
(1 )
(1 )
item matrix
T
i j
i j
i U i ij j
j V
t
u i r R
pr U V
e r pr
U V
U U e V
V
V
ηλ η
ηλ
=
= −
← − +
← −
for do
for
where T
j ij i ij ij i jV e U e R U Vη+ = −
end
end
Pseudo Code
9
Some Derivations for SGD
10
ADMM (Alternating Direction Methods of Multipliers)
• ADMM is used to solve the constrained problems as follows
• First gets the augmented Lagrangian as follows:
• The ADMM solution can be got be repeating the following three steps:
• If f(x) or g(z) are separable, the corresponding steps of ADMM can be done in parallel.
• So ADMM can be used to design distributed learning algorithms for large-scale problems.
11
Data Split Strategy
P nodes, 1~p denote computer node id.
Local item
latent matrix
Global item
latent matrix
m n×
k n×
k n×
k n×
m k×
( )m p k×
( )m p n×
12
LocalMFSplit
parallel
• Based on split strategy LocalMFSplit, the MF problem can be reformulated as follows:
• Define objective function by using augmented Lagrangian method:
where
13
Distributed ADMM
1 denotes the ( , ) indices of the ratiwher ngs located ine { } , nodep P p
p i j p== ΩVVVVV
Analog, but objective function Lp is non-convexconvex convex
14
Distributed ADMM
• We can get
• ADMM get the solutions by repeating the following three steps:
• The solution for is:
, which can be calculated efficiently.
• The problem lies in getting efficiently.
// ? We hope we can decouple U, V and
do parallel
// parallel: locally updated on each node
1 1
1 1 1
1 1 1
scatter , , , ; update , (
, ; update (global)
scatt
in each node,
er
in parallel)
gather
in each node, in p, ; update arallel)(
p p p p p
t t t t
t t t
p p
t t t
+ +
+ + +
+ + +
U V Θ V U V
U V V
V V Θ
Problems become how to
construct a surrogate objective
function, which is convex so
that we can use ADMM three
steps update.
non-convex
15
P.S. Consensus optimization via ADMM
S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimization and statistical learning via the alternating direction method of multipliers.”
Foundations and Trends in Machine Learning, pages 1-122, 2011.
16
P.S. Consensus optimization via ADMM
S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimization and statistical learning via the alternating direction method of multipliers.”
Foundations and Trends in Machine Learning, pages 1-122, 2011.
g,
with regularization,
averaging in update is followed byz ρprox
• Batch Learning: construct a surrogate objective function Gp, which is convex and can make U and
V decouple from each other, but is still not very efficient.
• For improve efficiency Stochastic Learning are proposed for DS-ADMM
17
Stochastic Learning for Distributed ADMM
convex
access all ratings related to U*i to update each U*I, so slow.
18
DS-ADMM algorithm
19
Scheduler Comparison
20
In x-axis why not use # iteration but time ?
21
A Flexible and Efficient Algorithmic
Framework for
Constrained Matrix and Tensor
Factorization
Kejun Huang, Nicholas D Sidiropoulos, and Athanasios P Liavas. “A flexible and efficient algorithmic framework for constrained matrix and tensor
factorization.” IEEE Transactions on Signal Processing, 64(19) 5052–5065, 2016.
• Optimization problem
• Optimization problem is not convex in both W and H, but is convex in one if we fix the other,
thus alternating optimization (AO) is usually employed.
• ADMM for regularized least-squares
• And derived the following iterates:
22
Constrained Matrix Factorization
where ( ) and ( ) are latent factors and regularizations,
, , , .
W H
m n m k n k k n
r r
× × × ×
⋅ ⋅
∈ ∈ ∈ ∈
W H
Y W H Hɶℝ ℝ ℝ ℝ
2
cache ,
Cholesky decomposition ,
update by forward and backward substitution,
complexity ( ).
T
T T
k n
ρ+ =

W Y
W W I LL
H
OOOO
ɶ
Update while fixing using ADMM.H W
23
Proximity Operator
• And derived the following iterates:
of function (1/ ) ( ) around point
T
proximity operator rρ ⋅
−H Uɶ
24
Most Commonly Used Constraints/Regularizations
+
1
Non-negativity. ( ) is the indicator function of .
Lasso regularization. For ( ) the update is the well-known -
operator
Simplex constraint. In some probabilistic mode
r
r soft thresholdingλ
⋅
=H H
i ℝ
i
i l we need to constrain the columns or rows
to be elementwise non-negative and sum up to one.
Smoothness regularization. We can encourage the columns of to be smooth by
adding the regularizatio
Hi
2
n ( ) ( / 2) .
Projections onto non-convex constraints, e.g. cardinality constraints can be handled by
- . But ADMM is not guaranteed to converge to the optimal solution.
F
r
hard thresholding
λ=H TH
i
25
ADMM Algorithm
Define relative primal residual
and relative dual residual where H0 is H from the previous ADMM iteration.
26
AO-ADMM Algorithm
• We can initialize the current ADMM update using the previous results W and H.
• ADMM is an subroutine for alternating optimization (AO) so called AO-ADMM.
27
General Loss Solution for ADMM
where is the scaled dual variable corresponding to the constraint ,
is the scaled dual variable corresponding to the equality constraint .
T
=
=
U H H
V Y WH
ɶ
ɶ ɶ
28
Most Commonly Used Non-Least-Squares Loss Functions
• Missing values. In the case that only a subset of the entries in Y are available.
• Robust fitting. In the case that data entries are not uniformly corrupted by noise but only sparingly
corrupted by outliers, or when the noise is dense but heavy-tailed (e.g., Laplacian-distributed), we
can use the L1 norm as the loss function for robust fitting.
• Huber fitting. Another way to deal with possible outliers in Y is to use the Huber function to
measure the loss.
• Kullback-Leibler divergence. A commonly adopted loss function for non-negative integer data is
the Kullback-Leibler (K-L) divergence defined as
29
…(12)
General Loss ADMM Algorithm
Define relative primal residual
and relative dual residual where H0 is H from the previous ADMM iteration.
30
Extension to Tensor Factorization
2
(1)
(1)
(1)
arg min ( )
Khatri-Rao product and ( ), is mode-1 matricization of tensor ,
element-wise (Hadamard) product, , ( ) .
T T
F
T
T T T T T
← −
=
∗ = ∗ =
A
A Y C B A
W C B Y
W W C C B B W Y C B Y
YYYY
⊙
⊙ ⊙
⊙
31
Tensor
AO-ADMM Algorithm
…(1)
• The outer AO framework naturally
provides a good initial point to the
inner ADMM iteration called warm-
start strategy.
LS loss function
General loss function
Low-Rank Regularized
Heterogeneous Tensor
Decomposition (LRRHTD) for
Subspace Clustering
32
J. Zhang, X. Li, P. Jing, J. Liu, Y. Su, "Low-rank regularized heterogeneous tensor decomposition for subspace clustering", IEEE Signal Process. Lett.,
vol. 25, no. 3, pp. 333-337, Mar. 2018.
33
Tucker Decomposition
tensor with , , ,I J K P Q R I P J Q K R× × × × × × ×
∈ ∈ ∈ ∈ ∈A B Cℝ ℝ ℝ ℝ ℝX GX GX GX G
core tensor
factor matrix
factor matrix
factor matrix
The factor matrices (which are usually orthogonal) A, B, and C are often referred to as
the principal component in the respective tensor mode.
this will result in a compression of XXXX, with GGGG being the compressed version of XXXX.
is Kronecker product⊗
34
Tucker Decomposition
• Tucker model can be generalized to N-way tensors
• The concept of n-rank (denoted by rankn(XXXX)): corresponds to the column rank of the n-th
unfolding of the tensor XXXX.
• According to the type of constraints Tucker decomposition approaches can be roughly grouped
into three categories:
• orthogonal tensor decomposition
• non-negative tensor decomposition
• sparse tensor decomposition
• Almost all of the above algorithms decompose tensors based on the isotropy hypothesis (i.e.
orthogonal, non-negative...), meaning that the factor matrices are learned in an equivalent way
for all modes.
• Not suitable for heterogeneous tensor data.
( )rank ( ) rank( )n n= XXXXX
• For all but the last mode, LRRHTD seeks a set of orthogonal projection matrices to map the
original tensor data into a low-dimensional common subspace.
• But for the last mode, a low-rank projection matrix is learned by imposing a nuclear-norm so that
a lowest rank representation that reveals the global structure of samples is obtained for
performing clustering.
• M-th order tensors, N is the total number of samples:
• We concatenate the N tensors to yield a (M+1)-th order tensor
• The goal of LRRHTD is to find M orthogonal factor matrices for intrinsic
low-dimensional representation and the lowest rank representation using the mapped
low-dimensional tensor as a dictionary, and D < N.
35
Low-Rank Regularized Heterogeneous Tensor
Decomposition (LRRHTD)
• Tucker decomposition of the concatenated tensor XXXX can be estimated in a general form as follows:
where is the core tensor and is the approximation error tensor.
• Cost function:
and
36
Low-Rank Regularized Heterogeneous Tensor
Decomposition (LRRHTD)
arg min
37
Low-Rank Regularized Heterogeneous Tensor
Decomposition (LRRHTD)
38
Thank you for your attention

More Related Content

PPT
Medical Image Processing
PPTX
FUZZY LOGIC
PPTX
信号検出理論 (『実践ベイズモデリング』15章)
PPTX
Chapter 3 image enhancement (spatial domain)
PDF
Evolution of the StyleGAN family
PDF
Wasserstein GAN
POTX
Presentation of Lossy compression
PDF
Alternating direction
Medical Image Processing
FUZZY LOGIC
信号検出理論 (『実践ベイズモデリング』15章)
Chapter 3 image enhancement (spatial domain)
Evolution of the StyleGAN family
Wasserstein GAN
Presentation of Lossy compression
Alternating direction

What's hot (20)

PDF
Artificial Neural Networks (ANN)
PPT
Boundary respresentation and descriptors.ppt
PDF
Introduction to A3C model
PDF
Adaptive filters
PDF
Lecture9 camera calibration
PDF
(DL hacks輪読) Deep Kalman Filters
PPTX
discrete wavelet transform
PPTX
Face spoofing detection using texture analysis
PDF
Deep-Learning Based Stereo Super-Resolution
PDF
DIP - Image Restoration
PDF
感覚運動随伴性、予測符号化、そして自由エネルギー原理 (Sensory-Motor Contingency, Predictive Coding and ...
PDF
Lecture 15 DCT, Walsh and Hadamard Transform
PDF
Lecture 12
PPTX
Filtering and masking
PDF
DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)
PPTX
Wavelet based image compression technique
PDF
Reservoir computing fast deep learning for sequences
PPT
Starc RTL設計スタイルガイドの検査道具spyglassの使い方
PDF
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
Artificial Neural Networks (ANN)
Boundary respresentation and descriptors.ppt
Introduction to A3C model
Adaptive filters
Lecture9 camera calibration
(DL hacks輪読) Deep Kalman Filters
discrete wavelet transform
Face spoofing detection using texture analysis
Deep-Learning Based Stereo Super-Resolution
DIP - Image Restoration
感覚運動随伴性、予測符号化、そして自由エネルギー原理 (Sensory-Motor Contingency, Predictive Coding and ...
Lecture 15 DCT, Walsh and Hadamard Transform
Lecture 12
Filtering and masking
DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)
Wavelet based image compression technique
Reservoir computing fast deep learning for sequences
Starc RTL設計スタイルガイドの検査道具spyglassの使い方
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
Ad

Similar to Distributed ADMM (20)

PPTX
Monte Carlo Berkeley.pptx
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
PDF
Reweighting and Boosting to uniforimty in HEP
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
PPTX
The world of loss function
PDF
My Prize Winning Physics Poster from 2006
PDF
Differential Evolution Algorithm with Triangular Adaptive Control Parameter f...
PDF
MLHEP 2015: Introductory Lecture #3
PDF
A simple framework for contrastive learning of visual representations
PDF
The International Journal of Engineering and Science (The IJES)
PDF
optmizationtechniques.pdf
PDF
4optmizationtechniques-150308051251-conversion-gate01.pdf
PPTX
Optmization techniques
PDF
Scalable trust-region method for deep reinforcement learning using Kronecker-...
PDF
Random Matrix Theory and Machine Learning - Part 4
PDF
Reinforcement Learning - DQN
PDF
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
PDF
Linear regression [Theory and Application (In physics point of view) using py...
PDF
Global analysis of nonlinear dynamics
Monte Carlo Berkeley.pptx
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
Reweighting and Boosting to uniforimty in HEP
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
The world of loss function
My Prize Winning Physics Poster from 2006
Differential Evolution Algorithm with Triangular Adaptive Control Parameter f...
MLHEP 2015: Introductory Lecture #3
A simple framework for contrastive learning of visual representations
The International Journal of Engineering and Science (The IJES)
optmizationtechniques.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf
Optmization techniques
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Random Matrix Theory and Machine Learning - Part 4
Reinforcement Learning - DQN
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
Linear regression [Theory and Application (In physics point of view) using py...
Global analysis of nonlinear dynamics
Ad

More from Pei-Che Chang (20)

PDF
PLL Note
PDF
Phase Locked Loops (PLL) 1
PDF
NTHU Comm Presentation
PDF
Introduction to Compressive Sensing in Wireless Communication
PDF
Distributed Architecture of Subspace Clustering and Related
PDF
PMF BPMF and BPTF
PDF
Brief Introduction About Topological Interference Management (TIM)
PDF
Patch antenna
PDF
Antenna basic
PDF
PAPR Reduction
PDF
Channel Estimation
PDF
Introduction to OFDM
PDF
The Wireless Channel Propagation
PDF
MIMO Channel Capacity
PDF
Digital Passband Communication
PDF
Digital Baseband Communication
PDF
The relationship between bandwidth and rise time
PDF
Millimeter wave 5G antennas for smartphones
PDF
Introduction of GPS BPSK-R and BOC
PDF
Filtering Requirements for FDD + TDD CA Scenarios
PLL Note
Phase Locked Loops (PLL) 1
NTHU Comm Presentation
Introduction to Compressive Sensing in Wireless Communication
Distributed Architecture of Subspace Clustering and Related
PMF BPMF and BPTF
Brief Introduction About Topological Interference Management (TIM)
Patch antenna
Antenna basic
PAPR Reduction
Channel Estimation
Introduction to OFDM
The Wireless Channel Propagation
MIMO Channel Capacity
Digital Passband Communication
Digital Baseband Communication
The relationship between bandwidth and rise time
Millimeter wave 5G antennas for smartphones
Introduction of GPS BPSK-R and BOC
Filtering Requirements for FDD + TDD CA Scenarios

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
KodekX | Application Modernization Development
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
sap open course for s4hana steps from ECC to s4
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
Programs and apps: productivity, graphics, security and other tools
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
KodekX | Application Modernization Development
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Distributed ADMM

  • 1. 1 Dec. 17, 2018 Jay Chang ADMM
  • 2. Distributed Stochastic ADMM for Matrix Factorization 2 Z.-Q. Yu, X.-J. Shi, L. Yan, W.-J. Li, "Distributed stochastic admm for matrixfactorization", Proceedings of the 23rd ACM Int. Conf. on Inform. and Knowl. Manag, pp. 1259-1268, 2014.
  • 4. 4 ALS-based Parallel MF Models • ALS (Alternating Least Square) allow the columns in both U and V can be independently updated by following equations: Parallel ALS : , , , : randinit( , ) ra Initial factors 1 1 Executed in parallel ndinit , )( m n iter T k T m k m k i n λ× ← ∈ ← ← ← Algorithm : Input R for to do par fo Initializa r to tio par n U V ⊳ ⊳ ℝ 1 Executed in parallelj n←for to end for ⊳
  • 5. 5 ALS-based Improvement CCD (Cyclic Coordinate Descent): Instead of optimizing the whole vector or at one time, CCD adopts the coordinate descent to optimize each element of or separately in order to avoid t i j i j ∗ ∗ ∗ ∗ U V U V i 1 he matrix inverse. CCD : Further improves CCD's performance by changing the updating sequence in CCD. updates one element in or each time by using coordinate descent. kT T d d d dd ∗ ∗ ∗ ∗= + + = U V U V U V i CCD: Pilerl'szy, D. Zibriczky, and D. Tikk. “Fast als-based matrix factorization for explicit and implicit feedback datasets.” In RecSys, pages 71-78, 2010. CCD++: H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems.” In ICDM, pages 765-774, 2012.
  • 6. 6 Some Derivations for ALS 2 22 , 1 1 ( , ) ( ) ( , ) 0 ( , ) 0 T m n m k n k T ui u i u i i u u i T u i i ui i i iu T i u u ui u u ui Loss r Loss r Loss r λ λ λ × × × − − ≈   = − + +    ∂   =  = + ∂   ∂   =  = + ∂          R X Y X Y x y x y X Y x y y I y x X Y y x x I x y
  • 7. SGD (Stochastic Gradient Descent) randomly select one rating index ( , ) from each time, then update the corresponding variables: where , is the learning rate. Conflicts ex T ij ij i j i j Rε η∗ ∗ Ω = − U V i i ist between two nodes when their randomly selected ratings share either the same user index or the same item index. Hogwild! DSGD (Distributed SGD) FPSGD (fast parallel SGD) i i i 7 SGD-based Parallel MF Models Hogwild!: F. Niu, B. Recht, C. Re, and S. J. Wright. "Hogwild!: A lock-free approach to parallelizing stochastic gradient descent." In NIPS, pages 693-701, 2011. DSGD: R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. "Large-scale matrix factorization with distributed stochastic gradient descent." In KDD, pages 69-77, 2011. FPSGD: Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. "A fast parallel sgd for matrix factorization in shared memory systems." In RecSys, pages 249-256, 2013.
  • 8. 8 : the number of latent factor , the learning rata , regularization parameters , , max iteration step, and the rating matrix : Initialize a random matrix for user matrix and U VD R U λη λInput Initialization = 1, 2,...step ( , , ) in make prediction error update and (1 ) (1 ) item matrix T i j i j i U i ij j j V t u i r R pr U V e r pr U V U U e V V V ηλ η ηλ = = − ← − + ← − for do for where T j ij i ij ij i jV e U e R U Vη+ = − end end Pseudo Code
  • 10. 10 ADMM (Alternating Direction Methods of Multipliers) • ADMM is used to solve the constrained problems as follows • First gets the augmented Lagrangian as follows: • The ADMM solution can be got be repeating the following three steps: • If f(x) or g(z) are separable, the corresponding steps of ADMM can be done in parallel. • So ADMM can be used to design distributed learning algorithms for large-scale problems.
  • 11. 11 Data Split Strategy P nodes, 1~p denote computer node id. Local item latent matrix Global item latent matrix m n× k n× k n× k n× m k× ( )m p k× ( )m p n×
  • 13. • Based on split strategy LocalMFSplit, the MF problem can be reformulated as follows: • Define objective function by using augmented Lagrangian method: where 13 Distributed ADMM 1 denotes the ( , ) indices of the ratiwher ngs located ine { } , nodep P p p i j p== ΩVVVVV Analog, but objective function Lp is non-convexconvex convex
  • 14. 14 Distributed ADMM • We can get • ADMM get the solutions by repeating the following three steps: • The solution for is: , which can be calculated efficiently. • The problem lies in getting efficiently. // ? We hope we can decouple U, V and do parallel // parallel: locally updated on each node 1 1 1 1 1 1 1 1 scatter , , , ; update , ( , ; update (global) scatt in each node, er in parallel) gather in each node, in p, ; update arallel)( p p p p p t t t t t t t p p t t t + + + + + + + + U V Θ V U V U V V V V Θ Problems become how to construct a surrogate objective function, which is convex so that we can use ADMM three steps update. non-convex
  • 15. 15 P.S. Consensus optimization via ADMM S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimization and statistical learning via the alternating direction method of multipliers.” Foundations and Trends in Machine Learning, pages 1-122, 2011.
  • 16. 16 P.S. Consensus optimization via ADMM S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimization and statistical learning via the alternating direction method of multipliers.” Foundations and Trends in Machine Learning, pages 1-122, 2011. g, with regularization, averaging in update is followed byz ρprox
  • 17. • Batch Learning: construct a surrogate objective function Gp, which is convex and can make U and V decouple from each other, but is still not very efficient. • For improve efficiency Stochastic Learning are proposed for DS-ADMM 17 Stochastic Learning for Distributed ADMM convex access all ratings related to U*i to update each U*I, so slow.
  • 20. 20 In x-axis why not use # iteration but time ?
  • 21. 21 A Flexible and Efficient Algorithmic Framework for Constrained Matrix and Tensor Factorization Kejun Huang, Nicholas D Sidiropoulos, and Athanasios P Liavas. “A flexible and efficient algorithmic framework for constrained matrix and tensor factorization.” IEEE Transactions on Signal Processing, 64(19) 5052–5065, 2016.
  • 22. • Optimization problem • Optimization problem is not convex in both W and H, but is convex in one if we fix the other, thus alternating optimization (AO) is usually employed. • ADMM for regularized least-squares • And derived the following iterates: 22 Constrained Matrix Factorization where ( ) and ( ) are latent factors and regularizations, , , , . W H m n m k n k k n r r × × × × ⋅ ⋅ ∈ ∈ ∈ ∈ W H Y W H Hɶℝ ℝ ℝ ℝ 2 cache , Cholesky decomposition , update by forward and backward substitution, complexity ( ). T T T k n ρ+ =  W Y W W I LL H OOOO ɶ Update while fixing using ADMM.H W
  • 23. 23 Proximity Operator • And derived the following iterates: of function (1/ ) ( ) around point T proximity operator rρ ⋅ −H Uɶ
  • 24. 24 Most Commonly Used Constraints/Regularizations + 1 Non-negativity. ( ) is the indicator function of . Lasso regularization. For ( ) the update is the well-known - operator Simplex constraint. In some probabilistic mode r r soft thresholdingλ ⋅ =H H i ℝ i i l we need to constrain the columns or rows to be elementwise non-negative and sum up to one. Smoothness regularization. We can encourage the columns of to be smooth by adding the regularizatio Hi 2 n ( ) ( / 2) . Projections onto non-convex constraints, e.g. cardinality constraints can be handled by - . But ADMM is not guaranteed to converge to the optimal solution. F r hard thresholding λ=H TH i
  • 25. 25 ADMM Algorithm Define relative primal residual and relative dual residual where H0 is H from the previous ADMM iteration.
  • 26. 26 AO-ADMM Algorithm • We can initialize the current ADMM update using the previous results W and H. • ADMM is an subroutine for alternating optimization (AO) so called AO-ADMM.
  • 27. 27 General Loss Solution for ADMM where is the scaled dual variable corresponding to the constraint , is the scaled dual variable corresponding to the equality constraint . T = = U H H V Y WH ɶ ɶ ɶ
  • 28. 28 Most Commonly Used Non-Least-Squares Loss Functions • Missing values. In the case that only a subset of the entries in Y are available. • Robust fitting. In the case that data entries are not uniformly corrupted by noise but only sparingly corrupted by outliers, or when the noise is dense but heavy-tailed (e.g., Laplacian-distributed), we can use the L1 norm as the loss function for robust fitting. • Huber fitting. Another way to deal with possible outliers in Y is to use the Huber function to measure the loss. • Kullback-Leibler divergence. A commonly adopted loss function for non-negative integer data is the Kullback-Leibler (K-L) divergence defined as
  • 29. 29 …(12) General Loss ADMM Algorithm Define relative primal residual and relative dual residual where H0 is H from the previous ADMM iteration.
  • 30. 30 Extension to Tensor Factorization 2 (1) (1) (1) arg min ( ) Khatri-Rao product and ( ), is mode-1 matricization of tensor , element-wise (Hadamard) product, , ( ) . T T F T T T T T T ← − = ∗ = ∗ = A A Y C B A W C B Y W W C C B B W Y C B Y YYYY ⊙ ⊙ ⊙ ⊙
  • 31. 31 Tensor AO-ADMM Algorithm …(1) • The outer AO framework naturally provides a good initial point to the inner ADMM iteration called warm- start strategy. LS loss function General loss function
  • 32. Low-Rank Regularized Heterogeneous Tensor Decomposition (LRRHTD) for Subspace Clustering 32 J. Zhang, X. Li, P. Jing, J. Liu, Y. Su, "Low-rank regularized heterogeneous tensor decomposition for subspace clustering", IEEE Signal Process. Lett., vol. 25, no. 3, pp. 333-337, Mar. 2018.
  • 33. 33 Tucker Decomposition tensor with , , ,I J K P Q R I P J Q K R× × × × × × × ∈ ∈ ∈ ∈ ∈A B Cℝ ℝ ℝ ℝ ℝX GX GX GX G core tensor factor matrix factor matrix factor matrix The factor matrices (which are usually orthogonal) A, B, and C are often referred to as the principal component in the respective tensor mode. this will result in a compression of XXXX, with GGGG being the compressed version of XXXX. is Kronecker product⊗
  • 34. 34 Tucker Decomposition • Tucker model can be generalized to N-way tensors • The concept of n-rank (denoted by rankn(XXXX)): corresponds to the column rank of the n-th unfolding of the tensor XXXX. • According to the type of constraints Tucker decomposition approaches can be roughly grouped into three categories: • orthogonal tensor decomposition • non-negative tensor decomposition • sparse tensor decomposition • Almost all of the above algorithms decompose tensors based on the isotropy hypothesis (i.e. orthogonal, non-negative...), meaning that the factor matrices are learned in an equivalent way for all modes. • Not suitable for heterogeneous tensor data. ( )rank ( ) rank( )n n= XXXXX
  • 35. • For all but the last mode, LRRHTD seeks a set of orthogonal projection matrices to map the original tensor data into a low-dimensional common subspace. • But for the last mode, a low-rank projection matrix is learned by imposing a nuclear-norm so that a lowest rank representation that reveals the global structure of samples is obtained for performing clustering. • M-th order tensors, N is the total number of samples: • We concatenate the N tensors to yield a (M+1)-th order tensor • The goal of LRRHTD is to find M orthogonal factor matrices for intrinsic low-dimensional representation and the lowest rank representation using the mapped low-dimensional tensor as a dictionary, and D < N. 35 Low-Rank Regularized Heterogeneous Tensor Decomposition (LRRHTD)
  • 36. • Tucker decomposition of the concatenated tensor XXXX can be estimated in a general form as follows: where is the core tensor and is the approximation error tensor. • Cost function: and 36 Low-Rank Regularized Heterogeneous Tensor Decomposition (LRRHTD) arg min
  • 37. 37 Low-Rank Regularized Heterogeneous Tensor Decomposition (LRRHTD)
  • 38. 38 Thank you for your attention