QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex Problems - Aleksandr Aravkin, Mar 21, 2018

A Splitting Method for Nonsmooth Nonconvex Problems.
Peng Zheng and Aleksandr Aravkin1
1
Applied Math & eSciences, UW
0 / 21

Problem Class
We are interested in optimizing problems of the form
h(Ax) + g(x)
• h is a nonsmooth, nonconvex separable function
• A is a linear map for now; in general, we need F(x)
• g(x) is a convex regularizer
We see these problems at the bleeding edge of very interesting applications:
• Optics (e.g. phase retrieval)
• Radiation therapy (nonconvex constraints)
• Chemistry (predicting structures of chromosomes/peptides/proteins)
1 / 21

Applications in this talk
• Exact Phase Retrieval:
min
x
|Ax| − b 1, x ∈ Cn
.
• Semi-Supervised SVMs:
min
ξ,β
λ
2
ξ 2
H +
s
i=1
[1 − bili(ξ, β)]+ + τ
m
i=s+1
[1 − |li(ξ, β)|]+.
• Stochastic Shortest Path:
min
x
d
i=1
| min{ u1
i , x + v1
i − xi, u2
i , x + v2
i − xi}|.
• Exact Robust PCA:
min
L,R
D − LR 1.
2 / 21

A Possible Approach
Deﬁne
f(x) := h(Ax) + g(x)
and consider the relaxed objective
min
x,w
fν (x, w) := h(w) +
1
2ν
Ax − w 2
+ g(x)
If we partially minimize in w we eﬀectively replace h with its Moreau envelope:
min
x
hν (Ax) + g(x), hν (Ax) := min
w
1
2
w − Ax 2
+ h(w).
3 / 21

Simple Nonconvex Example
Consider a 1D phase retrieval objective function and its relaxed version,
p(x) = ||x| − 1|, pν (x) = min
y
1
2ν
(y − x)2
+ ||y| − 1|.
−4 −2 0 2 4
x
0.0
0.5
1.0
1.5
2.0
2.5
3.0 original
ν = 0.5
ν = 1
ν = 10
4 / 21

A Better Approach
Deﬁne
f(x) := h(Ax) + g(x)
and consider the relaxed objective
min
x,w
fν (x, w) := h(w) +
1
2ν
Ax − w 2
+ g(x)
Partially minimize in x instead:
min
w
h(w) + gν (w), gν (w) := min
x
1
2
Ax − w 2
+ g(x).
5 / 21

Algorithm
A simple algorithm: Zheng and Aravkin (2018).
Algorithm 1 Proximal Gradient Descent for h(w) + gν (w)
1: Input: w0
2: Initialize: k = 0
3: while not converged do
4: wk+1 ← arg minw h(w) + 1
2ν
w − Axk
2
5: xk+1 ← arg minx g(x) + 1
2ν
Ax − wk+1
2
6: k ← k + 1
7: Output: wk
Algorithm requires only two ingredients:
1. Oracle for x update
2. Prox operator for h
6 / 21

Critical Points and Optimality Condition
Definition (Critical Points and Optimality Condition)
A point (¯x, ¯w) ∈ Rd
× Rm
is a critical point for fν if it satisfies the inclusion,
0 ∈ ∂h( ¯w) + 1
ν
( ¯w − A¯x),
0 ∈ ∂g(¯x) + 1
ν
AT
(A¯x − ¯w),
where ∂h, ∂g are limiting subdifferentials of h, g (Rockafellar and Wets (1998))
Define also
Tν (x, w) = min{ v 2
+ u 2
:
v ∈ ∂h(w) + 1
ν
(w − Ax),
u ∈ ∂g(x) + 1
ν
AT
(Ax − w)}.
7 / 21

Summary of Convergence Results
h(w) + gν (w)
A1 T
k
ν ≤ 2
νk
[fν (x0, w0) − f∗
]
A2 fν (xk, wk) − f∗
ν ≤ w0−w∗ 2
2ν(k+1)
A3 wk+1 − w∗ 2
≤ 1
1+αν
wk − w∗ 2
A4 wk+1 − w∗
≤ 1
αν
wk − w∗ 2
Assumption
A1 h is prox-bounded, g is closed and convex.
A2 h and g are both proper closed convex functions.
A3 h is α-strongly convex and g = 0.
A4 h is proper closed convex, g = 0 and there exist a sharp minimum of fν .
8 / 21

Sharpness
Deﬁnition
The minimizer w∗
of p is sharp if there exist δ, α > 0, so that for any
w ∈ {w : w − w∗
≤ δ},
p(w) − p(w∗
) ≥ α w − w∗
.
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
9 / 21

Results for Phase Retrieval
min
x,w∈Cn
fν (x, w) := |w| − b 1 +
1
2ν
Ax − w 2
0 10 20
iterations
10−20
10−14
10−8
10−2
104
f(x,w)−f∗
objective value
0 10 20
iterations
10−4
10−2
100
102
wk−wk−1
optimality condition
Figure: Convergence history for large scale phase retrieval.
10 / 21

Results for Phase Retrieval
Figure: Large example (d = 3 × 222
, n = 222
, m = 3n). Original picture (left), initial point
(middle), and ﬁnal result (right)
11 / 21

Comparison with Other Methods
we compare Algorithm 1 with several other popular methods studied by Duchi
and Ruan (2017); Davis et al. (2017).
objective n d m # FHT
Alg 1 |Ax| − b 1 2048 × 2048 3 × 222
m = 3d 518
Alg 22
(Ax)2
− b 1 2048 × 2048 3 × 222
m = 3d 1530
Alg 33
(Ax)2
− b 1 1024 × 1024 3 × 220
m = 3d 15100
Table: Comparison summary. n represents the size of the pictures, d is the dimension of the
vectorized picture, and m is the number of measurements. FHT is a fast Hadamard transform.
The counts of FHT include initialization.
• Get a simple fast method (in terms of FHT) for phase retrieval
• Requires prox of piecewise linear function, and matrix-vector multiplication.
2
Davis et al. (2017)
3
Duchi and Ruan (2017)
12 / 21

Semi-Supervised Learning
Finite dimensional Kernel SVM formulation:
min
x,β
λ
2
x 2
K +
s
i=1
[1 − bili(x, β)]+ + τ
m
i=s+1
[1 − |li(x, β)|]+
hτ (Kx+β1)
• li(x, β) = φ(A)x, φ(ai) H + β
• K = { φ(ai), φ(aj) H} is the kernel matrix.
• |li(x, β)| are used for unlabeled examples i ∈ {s + 1, . . . , m}.
Relaxed objective:
min
w,x,β
hτ (w) +
1
2ν
Kx + β1 − w 2
+
λ
2
x 2
K
g(x)
13 / 21

Results
0 50 100 150
iterations
10−11
10−8
10−5
10−2
101
f(x,w)−f∗
objective value
0 50 100 150
iterations
10−5
10−3
10−1
101
wk−wk−1
10−6
10−5
10−4
10−3
10−2
10−1
100
101
τ
0.024
0.026
0.028
0.030
0.032
percentageofmiss-ﬁts
training and test errors for diﬀerent τ
training error
testing error
• First method (that we have seen) for semi-supervised Kernel machines.
• Requires least squares problems over the Kernel matrix, and prox operator.
14 / 21

Stochastic Shortest Path
Figure: Given two action graphs, we want to move from A to B. At each node we can switch between black and
red graph depend on the expected cost, and take available edges uniformly at random.
15 / 21

Stochastic Shortest Path
Using the Bellman equation, the problem is formulated as,
min
x∈Rd
d
i=1
min U1
i·, x + v1
i − xi, U2
i·, x + v2
i − xi .
• Uk
is the connectivity matrix for graph k
• vk
i is the expected cost of leaving node i in graph k
Relaxed objective:
min
x,w1,w2
h(w1
, w2
) +
1
2ν
A1
x − w1 2
+ A2
x − w2
,
where Ak
= Uk
− I, and h(w1
, w2
) =
d
i=1
| min{w1
i + v1
i , w2
i + v2
i }|.
16 / 21

Result
0 200 400 600
iterations
10−15
10−12
10−9
10−6
10−3
f(x,w)−f∗
objective value
0 200 400 600
iterations
10−7
10−5
10−3
10−1
• Optimal value is 0, so we know we solve the original problem.
• Previous methods use subgradients; we just need LS and prox.
17 / 21

Open Problems
• What happens with nonlinear nonconvex composite models?
min
x
h(F(x)) + g(x)
• How to do ν continuation?
min
x,w
h(w) +
1
2ν
Ax − w 2
+ g(x)
• What’s the best way to proceed for large-scale problems, where we have to
solve the x-subproblem inexactly?
18 / 21

Empirical Result for Problem 1:
Exact robust PCA:
min
L,R
D − LR 1 , L ∈ Rm×k
, R ∈ Rk×n
.
Relaxed objective:
min
L,R,W
D − W 1 +
1
2ν
W − LR 2
F .
Even though the problem is technically in the F(x) class, for x = (L, R), it is
special: SVD gives a closed form solution to the L, R subproblem.
19 / 21

Result
0 5 10
iterations
10−11
10−8
10−5
10−2
101
f(x,w)−f∗
objective value
0 5 10
iterations
10−5
10−3
10−1
101
20 / 21

Reference I
Davis, D., Drusvyatskiy, D., and Paquette, C. (2017). The nonsmooth
landscape of phase retrieval. arXiv preprint arXiv:1711.03247.
Duchi, J. C. and Ruan, F. (2017). Solving (most) of a set of quadratic
equalities: Composite optimization for robust phase retrieval. arXiv preprint
arXiv:1705.02356.
Rockafellar, R. and Wets, R.-B. (1998). Variational Analysis. Grundlehren der
mathematischen Wissenschaften, Vol 317, Springer, Berlin.
Zheng, P. and Aravkin, A. Y. (2018). Fast methods for nonsmooth nonconvex
minimization. arXiv preprint arXiv:1802.02654.
21 / 21

QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex Problems - Aleksandr Aravkin, Mar 21, 2018

More Related Content

What's hot (18)

Similar to QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex Problems - Aleksandr Aravkin, Mar 21, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded (20)

QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex Problems - Aleksandr Aravkin, Mar 21, 2018