Distributed ADMM

1
Dec. 17, 2018
Jay Chang
ADMM

Distributed Stochastic ADMM for
Matrix Factorization
2
Z.-Q. Yu, X.-J. Shi, L. Yan, W.-J. Li, "Distributed stochastic admm for matrixfactorization", Proceedings of the 23rd ACM Int. Conf. on Inform. and
Knowl. Manag, pp. 1259-1268, 2014.

4
ALS-based Parallel MF Models
• ALS (Alternating Least Square) allow the columns in both U and V can be independently
updated by following equations:
Parallel ALS
: , , ,
:
randinit( , )
ra
Initial factors
1
1 Executed in parallel
ndinit , )(
m n
iter T
k T
m k
m
k
i
n
λ×
←
∈
←
←
←
Algorithm :
Input R
for to do
par fo
Initializa
r to
tio
par
n
U
V
⊳
⊳
ℝ
1 Executed in parallelj n←for to
end for
⊳

5
ALS-based Improvement
CCD (Cyclic Coordinate Descent):
Instead of optimizing the whole vector or at one time, CCD adopts the coordinate descent
to optimize each element of or separately in order to avoid t
i j
i j
∗ ∗
∗ ∗
U V
U V
i
1
he matrix inverse.
CCD :
Further improves CCD's performance by changing the updating sequence in CCD.
updates one element in or each time by using coordinate descent.
kT T
d d d dd ∗ ∗ ∗ ∗=
+ +
= U V U V U V
i
CCD: Pilerl'szy, D. Zibriczky, and D. Tikk. “Fast als-based matrix factorization for explicit and implicit feedback datasets.” In RecSys, pages 71-78, 2010.
CCD++: H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems.”
In ICDM, pages 765-774, 2012.

6
Some Derivations for ALS
2 22
,
1
1
( , ) ( )
( , )
0
( , )
0
T
m n m k n k
T
ui u i u i
i u u i
T
u i i ui i
i iu
T
i u u ui u
u ui
Loss r
Loss
r
Loss
r
λ
λ
λ
× × ×
−
−
≈
 
= − + + 
 
∂  
=  = + ∂  
∂  
=  = + ∂  
  
 
 
R X Y
X Y x y x y
X Y
x y y I y
x
X Y
y x x I x
y

SGD (Stochastic Gradient Descent) randomly select one rating index ( , ) from each time,
then update the corresponding variables:
where , is the learning rate.
Conflicts ex
T
ij ij i j
i j
Rε η∗ ∗
Ω
= − U V
i
i ist between two nodes when their randomly selected ratings share
either the same user index or the same item index.
Hogwild!
DSGD (Distributed SGD)
FPSGD (fast parallel SGD)
i
i
i
7
SGD-based Parallel MF Models
Hogwild!: F. Niu, B. Recht, C. Re, and S. J. Wright. "Hogwild!: A lock-free approach to parallelizing stochastic gradient descent." In NIPS, pages
693-701, 2011.
DSGD: R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. "Large-scale matrix factorization with distributed stochastic gradient descent." In KDD,
pages 69-77, 2011.
FPSGD: Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. "A fast parallel sgd for matrix factorization in shared memory systems." In RecSys, pages
249-256, 2013.

8
: the number of latent factor , the learning rata , regularization parameters , , max iteration step, and the rating matrix
: Initialize a random matrix for user matrix and
U VD R
U
λη λInput
Initialization
= 1, 2,...step
( , , ) in
make prediction
error
update and
(1 )
(1 )
item matrix
T
i j
i j
i U i ij j
j V
t
u i r R
pr U V
e r pr
U V
U U e V
V
V
ηλ η
ηλ
=
= −
← − +
← −
for do
for
where T
j ij i ij ij i jV e U e R U Vη+ = −
end
end
Pseudo Code

10
ADMM (Alternating Direction Methods of Multipliers)
• ADMM is used to solve the constrained problems as follows
• First gets the augmented Lagrangian as follows:
• The ADMM solution can be got be repeating the following three steps:
• If f(x) or g(z) are separable, the corresponding steps of ADMM can be done in parallel.
• So ADMM can be used to design distributed learning algorithms for large-scale problems.

11
Data Split Strategy
P nodes, 1~p denote computer node id.
Local item
latent matrix
Global item
latent matrix
m n×
k n×
k n×
k n×
m k×
( )m p k×
( )m p n×

• Based on split strategy LocalMFSplit, the MF problem can be reformulated as follows:
• Define objective function by using augmented Lagrangian method:
where
13
Distributed ADMM
1 denotes the ( , ) indices of the ratiwher ngs located ine { } , nodep P p
p i j p== ΩVVVVV
Analog, but objective function Lp is non-convexconvex convex

14
Distributed ADMM
• We can get
• ADMM get the solutions by repeating the following three steps:
• The solution for is:
, which can be calculated efficiently.
• The problem lies in getting efficiently.
// ? We hope we can decouple U, V and
do parallel
// parallel: locally updated on each node
1 1
1 1 1
1 1 1
scatter , , , ; update , (
, ; update (global)
scatt
in each node,
er
in parallel)
gather
in each node, in p, ; update arallel)(
p p p p p
t t t t
t t t
p p
t t t
+ +
+ + +
+ + +
U V Θ V U V
U V V
V V Θ
Problems become how to
construct a surrogate objective
function, which is convex so
that we can use ADMM three
steps update.
non-convex

15
P.S. Consensus optimization via ADMM
S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimization and statistical learning via the alternating direction method of multipliers.”
Foundations and Trends in Machine Learning, pages 1-122, 2011.

16
P.S. Consensus optimization via ADMM
S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed optimization and statistical learning via the alternating direction method of multipliers.”
Foundations and Trends in Machine Learning, pages 1-122, 2011.
g,
with regularization,
averaging in update is followed byz ρprox

• Batch Learning: construct a surrogate objective function Gp, which is convex and can make U and
V decouple from each other, but is still not very efficient.
• For improve efficiency Stochastic Learning are proposed for DS-ADMM
17
Stochastic Learning for Distributed ADMM
convex
access all ratings related to U*i to update each U*I, so slow.

20
In x-axis why not use # iteration but time ?

21
A Flexible and Efficient Algorithmic
Framework for
Constrained Matrix and Tensor
Factorization
Kejun Huang, Nicholas D Sidiropoulos, and Athanasios P Liavas. “A flexible and efficient algorithmic framework for constrained matrix and tensor
factorization.” IEEE Transactions on Signal Processing, 64(19) 5052–5065, 2016.

• Optimization problem
• Optimization problem is not convex in both W and H, but is convex in one if we fix the other,
thus alternating optimization (AO) is usually employed.
• ADMM for regularized least-squares
• And derived the following iterates:
22
Constrained Matrix Factorization
where ( ) and ( ) are latent factors and regularizations,
, , , .
W H
m n m k n k k n
r r
× × × ×
⋅ ⋅
∈ ∈ ∈ ∈
W H
Y W H Hɶℝ ℝ ℝ ℝ
2
cache ,
Cholesky decomposition ,
update by forward and backward substitution,
complexity ( ).
T
T T
k n
ρ+ =

W Y
W W I LL
H
OOOO
ɶ
Update while fixing using ADMM.H W

23
Proximity Operator
• And derived the following iterates:
of function (1/ ) ( ) around point
T
proximity operator rρ ⋅
−H Uɶ

24
Most Commonly Used Constraints/Regularizations
+
1
Non-negativity. ( ) is the indicator function of .
Lasso regularization. For ( ) the update is the well-known -
operator
Simplex constraint. In some probabilistic mode
r
r soft thresholdingλ
⋅
=H H
i ℝ
i
i l we need to constrain the columns or rows
to be elementwise non-negative and sum up to one.
Smoothness regularization. We can encourage the columns of to be smooth by
adding the regularizatio
Hi
2
n ( ) ( / 2) .
Projections onto non-convex constraints, e.g. cardinality constraints can be handled by
- . But ADMM is not guaranteed to converge to the optimal solution.
F
r
hard thresholding
λ=H TH
i

25
ADMM Algorithm
Define relative primal residual
and relative dual residual where H0 is H from the previous ADMM iteration.

26
AO-ADMM Algorithm
• We can initialize the current ADMM update using the previous results W and H.
• ADMM is an subroutine for alternating optimization (AO) so called AO-ADMM.

27
General Loss Solution for ADMM
where is the scaled dual variable corresponding to the constraint ,
is the scaled dual variable corresponding to the equality constraint .
T
=
=
U H H
V Y WH
ɶ
ɶ ɶ

28
Most Commonly Used Non-Least-Squares Loss Functions
• Missing values. In the case that only a subset of the entries in Y are available.
• Robust fitting. In the case that data entries are not uniformly corrupted by noise but only sparingly
corrupted by outliers, or when the noise is dense but heavy-tailed (e.g., Laplacian-distributed), we
can use the L1 norm as the loss function for robust fitting.
• Huber fitting. Another way to deal with possible outliers in Y is to use the Huber function to
measure the loss.
• Kullback-Leibler divergence. A commonly adopted loss function for non-negative integer data is
the Kullback-Leibler (K-L) divergence defined as

29
…(12)
General Loss ADMM Algorithm
Define relative primal residual
and relative dual residual where H0 is H from the previous ADMM iteration.

30
Extension to Tensor Factorization
2
(1)
(1)
(1)
arg min ( )
Khatri-Rao product and ( ), is mode-1 matricization of tensor ,
element-wise (Hadamard) product, , ( ) .
T T
F
T
T T T T T
← −
=
∗ = ∗ =
A
A Y C B A
W C B Y
W W C C B B W Y C B Y
YYYY
⊙
⊙ ⊙
⊙

31
Tensor
AO-ADMM Algorithm
…(1)
• The outer AO framework naturally
provides a good initial point to the
inner ADMM iteration called warm-
start strategy.
LS loss function
General loss function

Low-Rank Regularized
Heterogeneous Tensor
Decomposition (LRRHTD) for
Subspace Clustering
32
J. Zhang, X. Li, P. Jing, J. Liu, Y. Su, "Low-rank regularized heterogeneous tensor decomposition for subspace clustering", IEEE Signal Process. Lett.,
vol. 25, no. 3, pp. 333-337, Mar. 2018.

33
Tucker Decomposition
tensor with , , ,I J K P Q R I P J Q K R× × × × × × ×
∈ ∈ ∈ ∈ ∈A B Cℝ ℝ ℝ ℝ ℝX GX GX GX G
core tensor
factor matrix
factor matrix
factor matrix
The factor matrices (which are usually orthogonal) A, B, and C are often referred to as
the principal component in the respective tensor mode.
this will result in a compression of XXXX, with GGGG being the compressed version of XXXX.
is Kronecker product⊗

34
Tucker Decomposition
• Tucker model can be generalized to N-way tensors
• The concept of n-rank (denoted by rankn(XXXX)): corresponds to the column rank of the n-th
unfolding of the tensor XXXX.
• According to the type of constraints Tucker decomposition approaches can be roughly grouped
into three categories:
• orthogonal tensor decomposition
• non-negative tensor decomposition
• sparse tensor decomposition
• Almost all of the above algorithms decompose tensors based on the isotropy hypothesis (i.e.
orthogonal, non-negative...), meaning that the factor matrices are learned in an equivalent way
for all modes.
• Not suitable for heterogeneous tensor data.
( )rank ( ) rank( )n n= XXXXX

• For all but the last mode, LRRHTD seeks a set of orthogonal projection matrices to map the
original tensor data into a low-dimensional common subspace.
• But for the last mode, a low-rank projection matrix is learned by imposing a nuclear-norm so that
a lowest rank representation that reveals the global structure of samples is obtained for
performing clustering.
• M-th order tensors, N is the total number of samples:
• We concatenate the N tensors to yield a (M+1)-th order tensor
• The goal of LRRHTD is to find M orthogonal factor matrices for intrinsic
low-dimensional representation and the lowest rank representation using the mapped
low-dimensional tensor as a dictionary, and D < N.
35
Low-Rank Regularized Heterogeneous Tensor
Decomposition (LRRHTD)

• Tucker decomposition of the concatenated tensor XXXX can be estimated in a general form as follows:
where is the core tensor and is the approximation error tensor.
• Cost function:
and
36
arg min

37

38
Thank you for your attention

Distributed ADMM

More Related Content

What's hot (20)

Similar to Distributed ADMM (20)

More from Pei-Che Chang (20)

Recently uploaded (20)

Distributed ADMM