Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium

Subgradient methods for huge-scale optimization
problems
Yurii Nesterov, CORE/INMA (UCL)
November 10, 2014 (Yandex, Moscow)
Yu. Nesterov Subgradient methods for huge-scale problems 1/22

Outline
1 Problems sizes
2 Sparse Optimization problems
3 Sparse updates for linear operators
4 Fast updates in computational trees
5 Simple subgradient methods
6 Application examples
7 Computational experiments: Google problem

Nonlinear Optimization: problems sizes
Class Operations Dimension Iter.Cost Memory
Small-size All 100 − 102 n4 → n3 Kilobyte: 103
Medium-size A−1 103 − 104 n3 → n2 Megabyte: 106

Large-scale Ax 105 − 107 n2 → n Gigabyte: 109

Huge-scale x + y 108 − 1012 n → log n Terabyte: 1012

Sources of Huge-Scale problems

Internet (New)
Telecommunications (New)

Internet (New)
Finite-element schemes (Old)
PDE, Weather prediction (Old)

Internet (New)
Finite-element schemes (Old)
PDE, Weather prediction (Old)
Main hope: Sparsity.

Sparse problems

Sparse problems
Problem: min
x∈Q
f (x), where Q is closed and convex in RN,

Sparse problems
Problem: min
x∈Q
f (x), where Q is closed and convex in RN, and
f (x) = Ψ(Ax), where Ψ is a simple convex function:
Ψ(y1) ≥ Ψ(y2) + Ψ (y2), y1 − y2 , y1, y2 ∈ RM
,

Sparse problems
Problem: min
x∈Q
Ψ(y1) ≥ Ψ(y2) + Ψ (y2), y1 − y2 , y1, y2 ∈ RM
,
A : RN → RM is a sparse matrix.

Sparse problems
Problem: min
x∈Q
Ψ(y1) ≥ Ψ(y2) + Ψ (y2), y1 − y2 , y1, y2 ∈ RM
,
Let p(x)
def
= # of nonzeros in x.

Sparse problems
Problem: min
x∈Q
Ψ(y1) ≥ Ψ(y2) + Ψ (y2), y1 − y2 , y1, y2 ∈ RM
,
Let p(x)
def
= # of nonzeros in x. Sparsity coeﬃcient: γ(A)
def
= p(A)
MN .

Sparse problems
Problem: min
x∈Q
Ψ(y1) ≥ Ψ(y2) + Ψ (y2), y1 − y2 , y1, y2 ∈ RM
,
Let p(x)
def
def
= p(A)
MN .
Example 1: Matrix-vector multiplication

Sparse problems
Problem: min
x∈Q
Ψ(y1) ≥ Ψ(y2) + Ψ (y2), y1 − y2 , y1, y2 ∈ RM
,
Let p(x)
def
def
= p(A)
MN .
Computation of vector Ax needs p(A) operations.

Sparse problems
Problem: min
x∈Q
Ψ(y1) ≥ Ψ(y2) + Ψ (y2), y1 − y2 , y1, y2 ∈ RM
,
Let p(x)
def
def
= p(A)
MN .
Computation of vector Ax needs p(A) operations.
Initial complexity MN is reduced in γ(A) times.

Example: Gradient Method

x0 ∈ Q, xk+1 = πQ(xk − hf (xk)), k ≥ 0.

x0 ∈ Q, xk+1 = πQ(xk − hf (xk)), k ≥ 0.
Main computational expenses

x0 ∈ Q, xk+1 = πQ(xk − hf (xk)), k ≥ 0.
Projection of simple set Q needs O(N) operations.

x0 ∈ Q, xk+1 = πQ(xk − hf (xk)), k ≥ 0.
Displacement xk → xk − hf (xk) needs O(N) operations.

x0 ∈ Q, xk+1 = πQ(xk − hf (xk)), k ≥ 0.
f (x) = AT Ψ (Ax).

x0 ∈ Q, xk+1 = πQ(xk − hf (xk)), k ≥ 0.
f (x) = AT Ψ (Ax). If Ψ is simple, then the main eﬀorts are
spent for two matrix-vector multiplications: 2p(A).

x0 ∈ Q, xk+1 = πQ(xk − hf (xk)), k ≥ 0.
Conclusion:

x0 ∈ Q, xk+1 = πQ(xk − hf (xk)), k ≥ 0.
Conclusion: As compared with full matrices, we accelerate in
γ(A) times.

x0 ∈ Q, xk+1 = πQ(xk − hf (xk)), k ≥ 0.
γ(A) times.
Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6.

x0 ∈ Q, xk+1 = πQ(xk − hf (xk)), k ≥ 0.
γ(A) times.
Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6. Can we get more?

Sparse updating strategy

Main idea

Main idea
After update x+ = x + d

Main idea
After update x+ = x + d we have y+
def
= Ax+

Main idea
def
= Ax+ = Ax
y
+Ad.

Main idea
def
= Ax+ = Ax
y
+Ad.
What happens if d is sparse?

Main idea
def
= Ax+ = Ax
y
+Ad.
Denote σ(d) = {j : d(j) = 0}.

Main idea
def
= Ax+ = Ax
y
+Ad.
Denote σ(d) = {j : d(j) = 0}. Then y+ = y +
j∈σ(d)
d(j) · Aej .

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
Its complexity, κA(d)
def
=
j∈σ(d)
p(Aej ),

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
def
=
j∈σ(d)
p(Aej ), can be VERY small!

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
def
=
j∈σ(d)
κA(d) = M
j∈σ(d)
γ(Aej )

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
def
=
j∈σ(d)
κA(d) = M
j∈σ(d)
γ(Aej ) = γ(d) · 1
p(d)
j∈σ(d)
γ(Aej ) · MN

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
def
=
j∈σ(d)
κA(d) = M
j∈σ(d)
γ(Aej ) = γ(d) · 1
p(d)
j∈σ(d)
γ(Aej ) · MN
≤ γ(d) max
1≤j≤m
γ(Aej ) · MN.

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
def
=
j∈σ(d)
κA(d) = M
j∈σ(d)
γ(Aej ) = γ(d) · 1
p(d)
j∈σ(d)
γ(Aej ) · MN
≤ γ(d) max
1≤j≤m
γ(Aej ) · MN.
If γ(d) ≤ cγ(A), γ(Aj ) ≤ cγ(A),

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
def
=
j∈σ(d)
κA(d) = M
j∈σ(d)
γ(Aej ) = γ(d) · 1
p(d)
j∈σ(d)
γ(Aej ) · MN
≤ γ(d) max
1≤j≤m
γ(Aej ) · MN.
If γ(d) ≤ cγ(A), γ(Aj ) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) · MN .

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
def
=
j∈σ(d)
κA(d) = M
j∈σ(d)
γ(Aej ) = γ(d) · 1
p(d)
j∈σ(d)
γ(Aej ) · MN
≤ γ(d) max
1≤j≤m
γ(Aej ) · MN.
Expected acceleration:

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
def
=
j∈σ(d)
κA(d) = M
j∈σ(d)
γ(Aej ) = γ(d) · 1
p(d)
j∈σ(d)
γ(Aej ) · MN
≤ γ(d) max
1≤j≤m
γ(Aej ) · MN.
Expected acceleration: (10−6)2 = 10−12

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
def
=
j∈σ(d)
κA(d) = M
j∈σ(d)
γ(Aej ) = γ(d) · 1
p(d)
j∈σ(d)
γ(Aej ) · MN
≤ γ(d) max
1≤j≤m
γ(Aej ) · MN.
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec

Main idea
def
= Ax+ = Ax
y
+Ad.
j∈σ(d)
d(j) · Aej .
def
=
j∈σ(d)
κA(d) = M
j∈σ(d)
γ(Aej ) = γ(d) · 1
p(d)
j∈σ(d)
γ(Aej ) · MN
≤ γ(d) max
1≤j≤m
γ(Aej ) · MN.
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000
years!

When it can work?

When it can work?
Simple methods:

When it can work?
Simple methods: No full-vector operations!

When it can work?
Simple methods: No full-vector operations! (Is it possible?)

When it can work?
Simple problems:

When it can work?
Simple problems: Functions with sparse gradients.

When it can work?
Let us try:

When it can work?
Let us try:
1 Quadratic function f (x) = 1
2 Ax, x − b, x .

When it can work?
Let us try:
2 Ax, x − b, x . The gradient
f (x) = Ax − b, x ∈ RN
,

When it can work?
Let us try:
f (x) = Ax − b, x ∈ RN
,
is not sparse even if A is sparse.

When it can work?
Let us try:
f (x) = Ax − b, x ∈ RN
,
2 Piece-wise linear function g(x) = max
1≤i≤m
[ ai , x − b(i)].

When it can work?
Let us try:
f (x) = Ax − b, x ∈ RN
,
1≤i≤m
[ ai , x − b(i)]. Its
subgradient f (x) = ai(x), i(x) : f (x) = ai(x), x − b(i(x)),

When it can work?
Let us try:
f (x) = Ax − b, x ∈ RN
,
1≤i≤m
[ ai , x − b(i)]. Its
can be sparse is ai is sparse!

When it can work?
Let us try:
f (x) = Ax − b, x ∈ RN
,
1≤i≤m
[ ai , x − b(i)]. Its
But:

When it can work?
Let us try:
f (x) = Ax − b, x ∈ RN
,
1≤i≤m
[ ai , x − b(i)]. Its
But: We need a fast procedure for updating max-type operations.

Fast updates in short computational trees

Def: Function f (x), x ∈ Rn, is short-tree representable, if it can
be computed by a short binary tree with the height ≈ ln n.

Let n = 2k and the tree has k + 1 levels: v0,i = x(i), i = 1, . . . , n.

Size of the next level halves the size of the previous one:
vi+1,j = ψi+1,j (vi,2j−1, vi,2j ), j = 1, . . . , 2k−i−1, i = 0, . . . , k − 1,
where ψi,j are some bivariate functions.

Size of the next level halves the size of the previous one:
vi+1,j = ψi+1,j (vi,2j−1, vi,2j ), j = 1, . . . , 2k−i−1, i = 0, . . . , k − 1,
where ψi,j are some bivariate functions.
v2,1
v1,1 v1,2
v0,1 v0,2 v0,3 v0,4
v2,n/4
v1,n/2−1 v1,n/2
v0,n−3v0,n−2v0,n−1 v0,n
. . . . . . . . .
. . .
vk−1,1 vk−1,2
vk,1

Main advantages

Main advantages
Important examples (symmetric functions)

Main advantages
f (x) = x p, p ≥ 1, ψi,j (t1, t2) ≡ [ |t1|p + |t2|p ]1/p
,

Main advantages
,
f (x) = ln
n
i=1
ex(i)
, ψi,j (t1, t2) ≡ ln (et1 + et2 ) ,

Main advantages
,
f (x) = ln
n
i=1
ex(i)
, ψi,j (t1, t2) ≡ ln (et1 + et2 ) ,
f (x) = max
1≤i≤n
x(i), ψi,j (t1, t2) ≡ max {t1, t2} .

Main advantages
,
f (x) = ln
n
i=1
ex(i)
, ψi,j (t1, t2) ≡ ln (et1 + et2 ) ,
f (x) = max
1≤i≤n
x(i), ψi,j (t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.

Main advantages
,
f (x) = ln
n
i=1
ex(i)
, ψi,j (t1, t2) ≡ ln (et1 + et2 ) ,
f (x) = max
1≤i≤n
x(i), ψi,j (t1, t2) ≡ max {t1, t2} .
Its value needs n − 1 applications of ψi,j (·, ·) ( ≡ operations).

Main advantages
,
f (x) = ln
n
i=1
ex(i)
, ψi,j (t1, t2) ≡ ln (et1 + et2 ) ,
f (x) = max
1≤i≤n
x(i), ψi,j (t1, t2) ≡ max {t1, t2} .
If x+ diﬀers from x in one entry only, then for re-computing
f (x+) we need only k ≡ log2 n operations.

Main advantages
,
f (x) = ln
n
i=1
ex(i)
, ψi,j (t1, t2) ≡ ln (et1 + et2 ) ,
f (x) = max
1≤i≤n
x(i), ψi,j (t1, t2) ≡ max {t1, t2} .
Thus, we can have pure subgradient minimization schemes with

Main advantages
,
f (x) = ln
n
i=1
ex(i)
, ψi,j (t1, t2) ≡ ln (et1 + et2 ) ,
f (x) = max
1≤i≤n
x(i), ψi,j (t1, t2) ≡ max {t1, t2} .
Thus, we can have pure subgradient minimization schemes with
Sublinear Iteration Cost
.

Simple subgradient methods

I. Problem: f ∗ def
= min
x∈Q
f (x), where

= min
x∈Q
f (x), where
Q is a closed and convex and f (x) ≤ L(f ), x ∈ Q,

= min
x∈Q
f (x), where
the optimal value f ∗ is known.

= min
x∈Q
f (x), where
Consider the following optimization scheme (B.Polyak, 1967):

= min
x∈Q
f (x), where
x0 ∈ Q, xk+1 = πQ xk −
f (xk) − f ∗
f (xk) 2
f (xk) , k ≥ 0.

= min
x∈Q
f (x), where
x0 ∈ Q, xk+1 = πQ xk −
f (xk) − f ∗
f (xk) 2
f (xk) , k ≥ 0.
Denote f ∗
k = min
0≤i≤k
f (xi ).

= min
x∈Q
f (x), where
x0 ∈ Q, xk+1 = πQ xk −
f (xk) − f ∗
f (xk) 2
f (xk) , k ≥ 0.
Denote f ∗
k = min
0≤i≤k
f (xi ). Then for any k ≥ 0 we have:
f ∗
k − f ∗ ≤
L(f ) x0−πX∗ (x0)
(k+1)1/2 ,

= min
x∈Q
f (x), where
x0 ∈ Q, xk+1 = πQ xk −
f (xk) − f ∗
f (xk) 2
f (xk) , k ≥ 0.
Denote f ∗
k = min
0≤i≤k
f (xi ). Then for any k ≥ 0 we have:
f ∗
k − f ∗ ≤
L(f ) x0−πX∗ (x0)
(k+1)1/2 ,
xk − x∗ ≤ x0 − x∗ , ∀x∗ ∈ X∗.

Proof:

Proof:
Let us ﬁx x∗ ∈ X∗. Denote rk(x∗) = xk − x∗ .

Proof:
Let us ﬁx x∗ ∈ X∗. Denote rk(x∗) = xk − x∗ . Then
r2
k+1(x∗) ≤ xk − f (xk )−f ∗
f (xk ) 2 f (xk) − x∗
2

Proof:
r2
k+1(x∗) ≤ xk − f (xk )−f ∗
f (xk ) 2 f (xk) − x∗
2
= r2
k (x∗) − 2f (xk )−f ∗
f (xk ) 2 f (xk), xk − x∗ + (f (xk )−f ∗)2
f (xk ) 2

Proof:
r2
k+1(x∗) ≤ xk − f (xk )−f ∗
f (xk ) 2 f (xk) − x∗
2
= r2
k (x∗) − 2f (xk )−f ∗
f (xk ) 2 f (xk), xk − x∗ + (f (xk )−f ∗)2
f (xk ) 2
≤ r2
k (x∗) − (f (xk )−f ∗)2
f (xk ) 2

Proof:
r2
k+1(x∗) ≤ xk − f (xk )−f ∗
f (xk ) 2 f (xk) − x∗
2
= r2
k (x∗) − 2f (xk )−f ∗
f (xk ) 2 f (xk), xk − x∗ + (f (xk )−f ∗)2
f (xk ) 2
≤ r2
k (x∗) − (f (xk )−f ∗)2
f (xk ) 2 ≤ r2
k (x∗) −
(f ∗
k −f ∗)2
L2(f )
.

Proof:
r2
k+1(x∗) ≤ xk − f (xk )−f ∗
f (xk ) 2 f (xk) − x∗
2
= r2
k (x∗) − 2f (xk )−f ∗
f (xk ) 2 f (xk), xk − x∗ + (f (xk )−f ∗)2
f (xk ) 2
≤ r2
k (x∗) − (f (xk )−f ∗)2
f (xk ) 2 ≤ r2
k (x∗) −
(f ∗
k −f ∗)2
L2(f )
.
From this reasoning, xk+1 − x∗ 2 ≤ xk − x∗ 2, ∀x∗ ∈ X∗.

Proof:
r2
k+1(x∗) ≤ xk − f (xk )−f ∗
f (xk ) 2 f (xk) − x∗
2
= r2
k (x∗) − 2f (xk )−f ∗
f (xk ) 2 f (xk), xk − x∗ + (f (xk )−f ∗)2
f (xk ) 2
≤ r2
k (x∗) − (f (xk )−f ∗)2
f (xk ) 2 ≤ r2
k (x∗) −
(f ∗
k −f ∗)2
L2(f )
.
Corollary: Assume X∗ has recession direction d∗.

Proof:
r2
k+1(x∗) ≤ xk − f (xk )−f ∗
f (xk ) 2 f (xk) − x∗
2
= r2
k (x∗) − 2f (xk )−f ∗
f (xk ) 2 f (xk), xk − x∗ + (f (xk )−f ∗)2
f (xk ) 2
≤ r2
k (x∗) − (f (xk )−f ∗)2
f (xk ) 2 ≤ r2
k (x∗) −
(f ∗
k −f ∗)2
L2(f )
.
Corollary: Assume X∗ has recession direction d∗. Then
xk − πX∗ (x0) ≤ x0 − πX∗ (x0) ,

Proof:
r2
k+1(x∗) ≤ xk − f (xk )−f ∗
f (xk ) 2 f (xk) − x∗
2
= r2
k (x∗) − 2f (xk )−f ∗
f (xk ) 2 f (xk), xk − x∗ + (f (xk )−f ∗)2
f (xk ) 2
≤ r2
k (x∗) − (f (xk )−f ∗)2
f (xk ) 2 ≤ r2
k (x∗) −
(f ∗
k −f ∗)2
L2(f )
.
xk − πX∗ (x0) ≤ x0 − πX∗ (x0) , d∗, xk ≥ d∗, x0 .

Proof:
r2
k+1(x∗) ≤ xk − f (xk )−f ∗
f (xk ) 2 f (xk) − x∗
2
= r2
k (x∗) − 2f (xk )−f ∗
f (xk ) 2 f (xk), xk − x∗ + (f (xk )−f ∗)2
f (xk ) 2
≤ r2
k (x∗) − (f (xk )−f ∗)2
f (xk ) 2 ≤ r2
k (x∗) −
(f ∗
k −f ∗)2
L2(f )
.
xk − πX∗ (x0) ≤ x0 − πX∗ (x0) , d∗, xk ≥ d∗, x0 .
(Proof: consider x∗ = πX∗ (x0) + αd∗, α ≥ 0.)

Constrained minimization (N.Shor (1964) & B.Polyak)

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0},

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
Q is closed and convex,

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
f , g have uniformly bounded subgradients.

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
Consider the following method.

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
Consider the following method. It has step-size parameter h > 0.

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
If g(xk) > h g (xk) , then (A):

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
If g(xk) > h g (xk) , then (A): xk+1 = πQ xk − g(xk )
g (xk ) 2 g (xk) ,

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
g (xk ) 2 g (xk) ,
else (B): xk+1 = πQ xk − h
f (xk ) f (xk) .

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
g (xk ) 2 g (xk) ,
f (xk ) f (xk) .
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗
k = min
i∈Fk
f (xi ).

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
g (xk ) 2 g (xk) ,
f (xk ) f (xk) .
k = min
i∈Fk
f (xi ).
Theorem:

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
g (xk ) 2 g (xk) ,
f (xk ) f (xk) .
k = min
i∈Fk
f (xi ).
Theorem: If k > x0 − x∗ 2/h2,

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
g (xk ) 2 g (xk) ,
f (xk ) f (xk) .
k = min
i∈Fk
f (xi ).
Theorem: If k > x0 − x∗ 2/h2, then Fk = ∅

II. Problem: min
x∈Q
{f (x) : g(x) ≤ 0}, where
g (xk ) 2 g (xk) ,
f (xk ) f (xk) .
k = min
i∈Fk
f (xi ).
Theorem: If k > x0 − x∗ 2/h2, then Fk = ∅ and
f ∗
k − f (x) ≤ hL(f ), max
i∈Fk
g(xi ) ≤ hL(g).

Computational strategies

1. Constants L(f ), L(g) are known

1. Constants L(f ), L(g) are known (e.g. Linear Programming)

We can take h = max{L(f ),L(g)}.

We can take h = max{L(f ),L(g)}. Then we need to decide on the
number of steps N

number of steps N (easy!).

Note: The standard advice is h = R√
N+1

N+1
(much more diﬃcult!)

N+1
2. Constants L(f ), L(g) are not known

N+1
Start from a guess.

N+1
Start from a guess.
Restart from scratch each time we see the guess is wrong.

N+1
Start from a guess.
The guess is doubled after restart.

N+1
Start from a guess.
3. Tracking the record value f ∗
k

N+1
Start from a guess.
k
Double run.

N+1
Start from a guess.
k
Double run. Other ideas are welcome!

Application examples

Observations:

Observations:
1 Very often, Large- and Huge- scale problems have repetitive
sparsity patterns and/or limited connectivity.

Observations:
Social networks.

Observations:
Social networks.
Mobile phone networks.

Observations:
Social networks.
Truss topology design (local bars).

Observations:
Social networks.
Finite elements models (2D: four neighbors, 3D: six neighbors).

Observations:
Social networks.
Finite elements models (2D: four neighbors, 3D: six neighbors).
2 For p-diagonal matrices κ(A) ≤ p2.

Google problem

Google problem
Goal: Rank the agents in the society by their social weights.

Google problem
Unknown: xi ≥ 0 - social inﬂuence of agent i = 1, . . . , N.

Google problem
Known: σi - set of friends of agent i.

Google problem
Hypothesis

Google problem
Hypothesis
Agent i shares his support among all friends by equal parts.

Google problem
Hypothesis
Agent i shares his support among all friends by equal parts.
The inﬂuence of agent i is equal to the total support obtained
from his friends.

Mathematical formulation: quadratic problem

Let E ∈ RN×N be an incidence matrix of the connections graph.

Denote e = (1, . . . , 1)T ∈ RN and ¯E = E · diag (ET e)−1.

Since, ¯ET e = e, this matrix is stochastic.

Problem: Find x∗ ≥ 0 : ¯Ex∗ = x∗, x∗ = 0.

The size is very big!

Known technique:

Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &
coauthors, etc.)

Known technique:
coauthors, etc.)
N09: Solve it by random CD-method as applied to
1
2
¯Ex − x 2 + γ
2 [ e, x − 1]2, γ > 0.

Known technique:
coauthors, etc.)
N09: Solve it by random CD-method as applied to
1
2
¯Ex − x 2 + γ
2 [ e, x − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!

Nonsmooth formulation of Google Problem

Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n
+ , then ρ(A) = min
x≥0
max
1≤i≤n
1
x(i) ei , Ax .

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
The minimum is attained at the corresponding eigenvector.

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
Since ρ(¯E) = 1, our problem is as follows:
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.
Interpretation: Increase self-conﬁdence!

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.
Since f ∗ = 0, we can apply Polyak’s method with sparse updates.

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.
Additional features; the optimal set X∗ is a convex cone.

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.
If x0 = e, then the whole sequence is separated from zero:

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.
x∗, e

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.
x∗, e ≤ x∗, xk

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.
x∗, e ≤ x∗, xk ≤ x∗
1 · xk ∞

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.
x∗, e ≤ x∗, xk ≤ x∗
1 · xk ∞ = x∗, e · xk ∞.

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.
x∗, e ≤ x∗, xk ≤ x∗
1 · xk ∞ = x∗, e · xk ∞.
Goal: Find ¯x ≥ 0 such that ¯x ∞ ≥ 1 and f (¯x) ≤ .

If A ∈ Rn×n
x≥0
max
1≤i≤n
1
x(i) ei , Ax .
f (x)
def
= max
1≤i≤N
[ ei , ¯Ex − x(i)] → min
x≥0
.
x∗, e ≤ x∗, xk ≤ x∗
1 · xk ∞ = x∗, e · xk ∞.
Goal: Find ¯x ≥ 0 such that ¯x ∞ ≥ 1 and f (¯x) ≤ .
(First condition is satisﬁed automatically.)

Computational experiments: Iteration Cost

We compare Polyak’s GM with sparse update (GMs) with the
standard one (GM).

standard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)
def
= max
1≤i≤M
κA(AT ei ) ≈ p2.

standard one (GM).
Thus, κ(A)
def
= max
1≤i≤M
κA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N

standard one (GM).
Thus, κ(A)
def
= max
1≤i≤M
κA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N,

standard one (GM).
Thus, κ(A)
def
= max
1≤i≤M
κA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.

standard one (GM).
Thus, κ(A)
def
= max
1≤i≤M
κA(AT ei ) ≈ p2.
(log2 103 = 10,

standard one (GM).
Thus, κ(A)
def
= max
1≤i≤M
κA(AT ei ) ≈ p2.
(log2 103 = 10, log2 106 = 20,

standard one (GM).
Thus, κ(A)
def
= max
1≤i≤M
κA(AT ei ) ≈ p2.
(log2 103 = 10, log2 106 = 20, log2 109 = 30)

standard one (GM).
Thus, κ(A)
def
= max
1≤i≤M
κA(AT ei ) ≈ p2.
(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.98
2048 1792 3.36 6.41
4096 1888 3.75 15.11
8192 1920 4.20 139.92
16384 1824 4.69 408.38

standard one (GM).
Thus, κ(A)
def
= max
1≤i≤M
κA(AT ei ) ≈ p2.
(log2 103 = 10, log2 106 = 20, log2 109 = 30)
N κ(A) GMs GM
1024 1632 3.00 2.98
2048 1792 3.36 6.41
4096 1888 3.75 15.11
8192 1920 4.20 139.92
16384 1824 4.69 408.38
N κ(A) GMs GM
131072 576 0.19 213.9
262144 592 0.25 477.8
524288 592 0.32 1095.5
1048576 608 0.40 2590.8

standard one (GM).
Thus, κ(A)
def
= max
1≤i≤M
κA(AT ei ) ≈ p2.
(log2 103 = 10, log2 106 = 20, log2 109 = 30)
N κ(A) GMs GM
1024 1632 3.00 2.98
2048 1792 3.36 6.41
4096 1888 3.75 15.11
8192 1920 4.20 139.92
16384 1824 4.69 408.38
N κ(A) GMs GM
131072 576 0.19 213.9
262144 592 0.25 477.8
524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!

Convergence of GMs: Medium Size

Let N = 131072, p = 16, κ(A) = 576, and L(f ) = 0.21.

Let N = 131072, p = 16, κ(A) = 576, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
1.0 · 105 0.1100 16.44
3.0 · 105 0.0429 49.32
6.0 · 105 0.0221 98.65
1.1 · 106 0.0119 180.85
2.2 · 106 0.0057 361.71
4.1 · 106 0.0028 674.09
7.6 · 106 0.0014 1249.54
1.0 · 107 0.0010 1644.13

Let N = 131072, p = 16, κ(A) = 576, and L(f ) = 0.21.
1.0 · 105 0.1100 16.44
3.0 · 105 0.0429 49.32
6.0 · 105 0.0221 98.65
1.1 · 106 0.0119 180.85
2.2 · 106 0.0057 361.71
4.1 · 106 0.0028 674.09
7.6 · 106 0.0014 1249.54
1.0 · 107 0.0010 1644.13
Dimension and accuracy are suﬃciently high, but the time is still
reasonable.

Convergence of GMs: Large Scale

Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.

Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.
0 2.000000 0.00
1.0 · 105 0.546662 7.69
4.0 · 105 0.276866 30.74
1.0 · 106 0.137822 76.86
2.5 · 106 0.063099 192.14
5.1 · 106 0.032092 391.97
9.9 · 106 0.016162 760.88
1.5 · 107 0.010009 1183.59

Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.
0 2.000000 0.00
1.0 · 105 0.546662 7.69
4.0 · 105 0.276866 30.74
1.0 · 106 0.137822 76.86
2.5 · 106 0.063099 192.14
5.1 · 106 0.032092 391.97
9.9 · 106 0.016162 760.88
1.5 · 107 0.010009 1183.59
Final point ¯x∗: ¯x∗ ∞ = 2.941497, R2
0
def
= ¯x∗ − e 2
2 = 1.2 · 105.

Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.
0 2.000000 0.00
1.0 · 105 0.546662 7.69
4.0 · 105 0.276866 30.74
1.0 · 106 0.137822 76.86
2.5 · 106 0.063099 192.14
5.1 · 106 0.032092 391.97
9.9 · 106 0.016162 760.88
1.5 · 107 0.010009 1183.59
Final point ¯x∗: ¯x∗ ∞ = 2.941497, R2
0
def
= ¯x∗ − e 2
2 = 1.2 · 105.
Theoretical bound:
L2(f )R2
0
2 = 5.3 · 107.

Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.
0 2.000000 0.00
1.0 · 105 0.546662 7.69
4.0 · 105 0.276866 30.74
1.0 · 106 0.137822 76.86
2.5 · 106 0.063099 192.14
5.1 · 106 0.032092 391.97
9.9 · 106 0.016162 760.88
1.5 · 107 0.010009 1183.59
Final point ¯x∗: ¯x∗ ∞ = 2.941497, R2
0
def
= ¯x∗ − e 2
2 = 1.2 · 105.
Theoretical bound:
L2(f )R2
0
2 = 5.3 · 107. Time for GM: ≈ 1 year!

Conclusion

Conclusion
1 Sparse GM is an eﬃcient and reliable method for solving
Large- and Huge- Scale problems with uniform sparsity.

Conclusion
2 We can treat also dense rows.

Conclusion
2 We can treat also dense rows. Assume that inequality
a, x ≤ b is dense.

Conclusion
a, x ≤ b is dense. It is equivalent to the following system:
y(1) = a(1) x(1), y(j) = y(j−1) + a(j) x(j), j = 2, . . . , n,
y(n) ≤ b.

Conclusion
y(1) = a(1) x(1), y(j) = y(j−1) + a(j) x(j), j = 2, . . . , n,
y(n) ≤ b.
We need new variables y(j) for all nonzero coeﬃcients of a.

Conclusion
y(1) = a(1) x(1), y(j) = y(j−1) + a(j) x(j), j = 2, . . . , n,
y(n) ≤ b.
Introduce p(a) additional variables and p(A) additional
equality constraints.

Conclusion
y(1) = a(1) x(1), y(j) = y(j−1) + a(j) x(j), j = 2, . . . , n,
y(n) ≤ b.
equality constraints. (No problem!)

Conclusion
y(1) = a(1) x(1), y(j) = y(j−1) + a(j) x(j), j = 2, . . . , n,
y(n) ≤ b.
Hidden drawback: the above equalities are satisﬁed with errors.

Conclusion
y(1) = a(1) x(1), y(j) = y(j−1) + a(j) x(j), j = 2, . . . , n,
y(n) ≤ b.
May be it is not too bad?

Conclusion
y(1) = a(1) x(1), y(j) = y(j−1) + a(j) x(j), j = 2, . . . , n,
y(n) ≤ b.
May be it is not too bad?
3 Similar technique can be applied to dense columns.

Thank you for your attention!

Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium (20)

More from Yandex (20)

Recently uploaded (20)

Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium