Proximal Splitting and Optimal Transport

Proximal Splitting
and Optimal Transport
Gabriel Peyré
www.numerical-tours.com

Overview
• Optimal Transport and Imaging

• Convex Analysis and Proximal Calculus

• Forward Backward

• Douglas Rachford and ADMM

• Generalized Forward-Backward

• Primal-Dual Schemes

ork, Measure Preserving Maps
ica-
d ofDistributions µ0 , µ1 on Rk .
ase.
eeds
ans-
that
eme
rate
ance
eval
t al. µ0 µ1

ica-
ase.
eeds
Mass preserving map T : Rk Rk .
ans-
that µ1 = T µ0 where (T µ0 )(A) = µ0 (T (A))
1

eme
rate
ance x T (x)
eval
t al. µ0 µ1

ica-
ase.
eeds
Mass preserving map T : Rk Rk .
ans-
that µ1 = T µ0 where (T µ0 )(A) = µ0 (T (A))
1

eme
rate
ance x T (x)
eval
t al. µ0 µ1

Distributions with densities: µi = i (x)dx

T µ0 = µ1 1 (T (x))|det ⇥T (x)| = 0 (x)

Optimal Transport
Lp optimal transport:
W2 (µ0 , µ1 )p = min ||T (x) x||p µ0 (dx)
T µ0 =µ1

Optimal Transport
W2 (µ0 , µ1 )p = min ||T (x) x||p µ0 (dx)
T µ0 =µ1

Regularity condition:
µ0 or µ1 does not give mass to “small sets”.

Theorem (p > 1): there exists a unique optimal T .

T T
µ1
µ0

Optimal Transport
W2 (µ0 , µ1 )p = min ||T (x) x||p µ0 (dx)
T µ0 =µ1

Regularity condition:
µ0 or µ1 does not give mass to “small sets”.

Theorem (p > 1): there exists a unique optimal T .

Theorem (p = 2): T is deﬁned as T = with convex.

T T T (x)
T (x ) T is monotone:
µ1 x T (x) T (x ), x x 0
µ0 x

Wasserstein Distance
µ

Couplings: µ, x
A Rd , ⇥(A Rd ) = µ(A) y
B Rd , ⇥(Rd B) = (B)

Wasserstein Distance
µ

Couplings: µ, x
A Rd , ⇥(A Rd ) = µ(A) y
B Rd , ⇥(Rd B) = (B)
Transportation cost:

Wp (µ, )p = min c(x, y)d⇥(x, y)
µ, Rd Rd

Optimal Transport
Let p > 1 and µ does not vanish on small sets.

Unique µ, s.t. Wp (µ, )p = c(x, y)d⇥(x, y)
Rd Rd

Optimal transport T : Rd Rd :

µ

x
y
(x, T (x))

Optimal Transport
Let p > 1 and µ does not vanish on small sets.

Unique µ, s.t. Wp (µ, )p = c(x, y)d⇥(x, y)
Rd Rd

Optimal transport T : Rd Rd :

µ

x
p = 2: T = unique solution of y
⇥ is convex l.s.c. (x, T (x))
( ⇥)⇤µ =

1-D Continuous Wasserstein
Distributions µ, on R.
t
Cumulative functions: Cµ (t) = dµ(x)

For all p > 1: T =C 1
Cµ
T is non-decreasing (“change of contrast”)

1-D Continuous Wasserstein
Distributions µ, on R.
t
Cumulative functions: Cµ (t) = dµ(x)

For all p > 1: T =C 1
Cµ
T is non-decreasing (“change of contrast”)

Explicit formulas:
1 H
Wp (µ, )p = |Cµ 1 C 1 p
|
0

W1 (µ, ) = |Cµ C | = ||(Cµ C ) ⇥ H||1
R

Grayscale Histogram Transfer
f1
Input images: fi : [0, 1]
2
[0, 1], i = 0, 1.

f0

f1
Input images: fi : [0, 1] 2
[0, 1], i = 0, 1.
Gray-value distributions: µi deﬁned on [0, 1].
µi ([a, b]) = 1{a f b} (x)dx
[0,1]2
µ1

f0

µ0

f1
Input images: fi : [0, 1] 2
[0, 1], i = 0, 1.
Gray-value distributions: µi deﬁned on [0, 1].
µi ([a, b]) = 1{a f b} (x)dx
[0,1]2
Optimal transport: T = Cµ11 Cµ0 . µ1

f0 Cµ0 (f0 ) T (f0 )
Cµ0 Cµ11

µ0 µ1

pplication to Color Transfer
Color Histogram Equalization
1
Input color images: fi RN 3 . projection iof= to style
Sliced Wasserstein
⇥ X
N x
fi (x)
image color statistics Y

Optimal transport framework Sliced Wasserstein projection Applications

Application to Color Transfer
Source image (X )

f1 f0
Sliced Wasserstein project

f0

Source image after color transfer
µ1 image (Y )
Style Source image (X )
µ0
J. Rabin Wasserstein Regularization

1
Sliced Wasserstein
⇥ X
N x
fi (x)
Optimal assignement: min ||f0 f1 ⇥ ||
N


Source image (X )

f1 f0

f0

µ1 image (Y )
µ0

1
Sliced Wasserstein
⇥ X
N x
fi (x)
N

Transport: T : f0 (x) R3 f1 ( (i)) R3

Source image (X )

f1 f0

f0

µ1 image (Y )
µ0
T

1
Sliced Wasserstein
⇥ X
N x
fi (x)
N

Transport: T : f0 (x) R3Application to Color Transfer R3
f1 ( (i))

˜ Application to ColorfTransfer
Equalization:) f0 = T (f0 ) ˜ = f1
0 Sliced Wasserstein projection of X to sty
Source image (X image color statistics Y

f1 f0 T (f0 )
Source image (X )
T
f0

µ1 image (Y )
µ0 Source image after color transfer
µ1
Style image (Y )

T J. Rabin Wasserstein Regularization

cðdvÞ ¼ l0> þ dvÞ detðrðv þ dvÞÞ À l1 ¼ 0:
ðv
can be thought as an elliptic system thought as anThe sys-system of equations. The trilinearRelaxation was performed for transferring
v cc
> tem cv cc can be of equations. elliptic a sys-
the GPU. We used cubic grid. interpolation used a trilineara parallelizable four-
the GPU. We operator using interpolation operator for transferring

Image Registration
Ittem isto verify that a correction for dv can be obtained by solving with an
is easy solved using preconditioned conjugate gradient color Gauss-Seidel relaxation scheme. Thisrestriction
s solved using preconditioned conjugate À1 gradient with an the coarse grid residual increases robustness
the coarse grid correction to fine grids. Thecorrection to fine grids. The residual restriction
the system dv % c> ðcv c> Þ cðvÞ (Nocedal and Wright, 1999) The sys- and efficiency and is especially suited for the implementation on
incomplete Cholesky preconditioner.
mplete Cholesky preconditioner. v v
operator for projecting residual from for projecting residual from the fine to coarse grids is
operator the fine to coarse grids is
tem c c> can be thought as an elliptic system of equations. The sys-
v c the GPU. We used a trilinear interpolation operator for transferring
tem is solved using preconditioned conjugate gradient with an the coarse grid correction to fine grids. The residual restriction
incomplete Cholesky preconditioner. operator for projecting residual from the fine to coarse grids is

T

[ur Rehman et al, 2009]
Fig. 6. OMT Results viewed on an axial slice. The top row shows corresponding slices from Pre-op(Left) and Post-op(Right) MRI data. The deformation is clearly visible in the
anterior part of the brain.

Convex Formulation (Benamou-Brenier)
⇢
⇢ : Rd ⇥ [0, 1] ! R+ solving:
Find
m : Rd ⇥ [0, 1] ! Rd

W (µ0 , µ1 )2 = min J(x) + ◆C (x)
x=(m,⇢)

⇢
⇢ : Rd ⇥ [0, 1] ! R+ solving:
Find
m : Rd ⇥ [0, 1] ! Rd

W (µ0 , µ1 )2 = min J(x) + ◆C (x)
x=(m,⇢)

Z Z 1
J(x) = j(x(s, t))dtds
s2Rd t=0
8 ||m||2
< ˜˜ ⇢ if ⇢ > 0,
˜
j(m, ⇢) =
˜ ˜
: 0 if ⇢ = 0 and m = 0,
˜ ˜
+1 otherwise.
2 R 2 R2

⇢
⇢ : Rd ⇥ [0, 1] ! R+ solving:
Find
m : Rd ⇥ [0, 1] ! Rd

W (µ0 , µ1 )2 = min J(x) + ◆C (x)
x=(m,⇢)

Z Z 1
J(x) = j(x(s, t))dtds
s2Rd t=0
8 ||m||2
< ˜˜ ⇢ if ⇢ > 0,
˜
j(m, ⇢) =
˜ ˜
: 0 if ⇢ = 0 and m = 0,
˜ ˜
+1 otherwise.
2 R 2 R2
C = {x = (m, ⇢) div(x) = 0, B(⇢) = (⇢0 , ⇢1 )}
B(⇢) = (⇢(0, ·), ⇢(1, ·))

Numerical Examples
⇢0 ⇢1

t

Numerical Examples
⇢0 ⇢1
con-
work,
plica-
ad of
ease.
peeds
rans-
t that
heme
erate
mance
ieval
et al.

t
Figure 7: Synthetic 2D examples on a Euclidean domain. The

Discrete Formulation
s
Centered grid formulation (d = 1):
min J(x) + ◆C (x)
x2RGc ⇥2
P
J(x) = i2Gc j(xi ) t
Centered grid Gc

s
min J(x) + ◆C (x)
x2RGc ⇥2
P
Staggered grid formulation : Centered grid Gc
min 2 J(I(x)) + ◆C (x) s
1
x2RGst ⇥RGst

t
Staggered grid
1 2
Gst Gst

s
min J(x) + ◆C (x)
x2RGc ⇥2
P
Staggered grid formulation : Centered grid Gc
min 2 J(I(x)) + ◆C (x) s
1
x2RGst ⇥RGst

Interpolation operator:
1 2
Gst Gst
1 2
I = (I , I ) : R ⇥R ! RG c
t
2I1 (m)i,j = mi+ 1 ,j + mi
2
1
2 ,j
Staggered grid
! Projection on div(x) = 0 using FFTs. 1 2
Gst Gst

SOCP Formulation
P
min J(x) + ◆C (x) J(x) = i2Gc j(xi )
x2RGc ⇥d
X
() min ri s.t. 8 i 2 Gc , (mi , ⇢i , ri ) 2 K
x2RGc ⇥d ,r2RGc
i

(Rotated) Lorentz cone: K = (m, ⇢, r) 2 Rd+2 ||m||2 6 ⇢r
˜ ˜ ˜ ˜ ˜˜

SOCP Formulation
P
min J(x) + ◆C (x) J(x) = i2Gc j(xi )
x2RGc ⇥d
X
() min ri s.t. 8 i 2 Gc , (mi , ⇢i , ri ) 2 K
x2RGc ⇥d ,r2RGc
i

(Rotated) Lorentz cone: K = (m, ⇢, r) 2 Rd+2 ||m||2 6 ⇢r
˜ ˜ ˜ ˜ ˜˜

Second order cone program:
! Use interior point methods (e.g. MOSEK software).
Linear convergence with iteration #.
Poor scaling with dimension |Gc |.
E cient for medium scale problems (N ⇠ 104 ).

1
Example: Regularization
Inverse problem: measurements y = x0 + w
x0 y

1
x0 y x?
argmin

Regularized inversion: x? 2 argmin 1 ||y
2 x||2 + R(x)
x2R N
Data ﬁdelity Regularity

1
x0 y x?
argmin

2 x||2 + R(x)
x2R N
P
Total Variation: R(x) = i ||(rx)i ||

1
x0 y x?
argmin

2 x||2 + R(x)
x2R N
P
Total Variation: R(x) = i ||(rx)i ||

1
P ⇤
` sparsity: R(x) = i |xi |
Images are sparse
in wavelet bases. ⇤
Image f = x Coe↵. x = f

Convex Optimization
Setting: G : H R ⇤ {+⇥}
H: Hilbert space. Here: H = RN .

Problem: min G(x)
x H

Convex Optimization

Problem: min G(x)
x H

Class of functions: x y

Convex: G(tx + (1 t)y) tG(x) + (1 t)G(y) t [0, 1]

Convex Optimization

Problem: min G(x)
x H



Lower semi-continuous: lim inf G(x) G(x0 )
x x0

Proper: {x ⇥ H G(x) ⇤= + } = ⌅
⇤

Convex Optimization

Problem: min G(x)
x H



Lower semi-continuous: lim inf G(x) G(x0 )
x x0

Proper: {x ⇥ H G(x) ⇤= + } = ⌅
⇤

0 if x ⇥ C,
Indicator: C (x) =
+ otherwise.
(C closed and convex)

Sub-differential

Sub-di erential:
G(x) = {u ⇥ H ⇤ z, G(z) G(x) + ⌅u, z x⇧}
G(x) = |x|

G(0) = [ 1, 1]

Sub-differential

Sub-di erential:
G(x) = {u ⇥ H ⇤ z, G(z) G(x) + ⌅u, z x⇧}
G(x) = |x|
Smooth functions:
If F is C 1 , F (x) = { F (x)}

G(0) = [ 1, 1]

Sub-differential

Sub-di erential:
G(x) = {u ⇥ H ⇤ z, G(z) G(x) + ⌅u, z x⇧}
G(x) = |x|
Smooth functions:
If F is C 1 , F (x) = { F (x)}

G(0) = [ 1, 1]
First-order conditions:
x argmin G(x) 0 G(x )
x H

Sub-differential

Sub-di erential:
G(x) = {u ⇥ H ⇤ z, G(z) G(x) + ⌅u, z x⇧}
G(x) = |x|
Smooth functions:
If F is C 1 , F (x) = { F (x)}

G(0) = [ 1, 1]
First-order conditions:
x argmin G(x) 0 G(x )
x H U (x)

x
Monotone operator: U (x) = G(x)
(u, v) U (x) U (y), y x, v u 0

Prox and Subdifferential
1
Prox G (x) = argmin ||x z||2 + G(z)
z 2

1
z 2
Resolvant of G:
z = Prox G (x) 0 z x + ⇥G(z)
x (Id + ⇥G)(z)

1
z 2
Resolvant of G:
z = Prox G (x) 0 z x + ⇥G(z)
x (Id + ⇥G)(z) z = (Id + ⇥G) 1
(x)
Inverse of a set-valued mapping:
where x U (y) y U 1
(x)
Prox G = (Id + ⇥G) 1
is a single-valued mapping

1
z 2
Resolvant of G:
z = Prox G (x) 0 z x + ⇥G(z)
x (Id + ⇥G)(z) z = (Id + ⇥G) 1
(x)
Inverse of a set-valued mapping:
where x U (y) y U 1
(x)
Prox G = (Id + ⇥G) 1
is a single-valued mapping
Fix point: x argmin G(x)
x
0 G(x ) x (Id + ⇥G)(x )
x⇥ = (Id + ⇥G) 1
(x⇥ ) = Prox G (x⇥ )

Proximal Calculus
Separability: G(x) = G1 (x1 ) + . . . + Gn (xn )
ProxG (x) = (ProxG1 (x1 ), . . . , ProxGn (xn ))

Proximal Calculus
1
Quadratic functionals: G(x) = || x y||2
2
Prox G = (Id + ) 1
= (Id + ) 1

Proximal Calculus
1
2
Prox G = (Id + ) 1
= (Id + ) 1

Composition by tight frame: A A = Id
ProxG A (x) =A ProxG A + Id A A

Proximal Calculus
1
2
Prox G = (Id + ) 1
= (Id + ) 1

Composition by tight frame: A A = Id
ProxG A (x) =A ProxG A + Id A A
x
Indicators: G(x) = C (x)
C
Prox G (x) = ProjC (x) ProjC (x)
= argmin ||x z||
z C

Prox of Sparse Regularizers
1
z 2

1
z 2
G(x) = ||x||1 = |xi | 12 log(1 + x2 )
i 10
|x| ||x||0
8

6

4

2

G(x) = ||x||0 = | {i xi = 0} | 0

−2
G(x)
−10 −8 −6 −4 −2 0 2 4 6 8 10

G(x) = log(1 + |xi |2 )
i

1
z 2
G(x) = ||x||1 = |xi | 12 log(1 + x2 )
i 10
|x| ||x||0
Prox G (x)i = max 0, 1 xi 8

|xi | 6

4

2

G(x) = ||x||0 = | {i xi = 0} | 0

−2
G(x)
xi if |xi | 2 , −10 −8 −6 −4 −2 0 2 4 6 8 10

Prox G (x)i =
10

0 otherwise.
8

6

4

2

G(x) = log(1 + |xi |2 ) −2
0

i −4

3rd order polynomial root.
−6

−8
ProxG (x)
−10
−10 −8 −6 −4 −2 0 2 4 6 8 10

Legendre-Fenchel Duality
Legendre-Fenchel transform:
G (u) = sup u, x G(x) eu
x dom(G) G(x) S lop
G (u)

x

x dom(G) G(x) S lop
G (u)
Example: quadratic functional
1 x
G(x) = Ax, x + x, b
2
1
G (u) = u b, A 1 (u b)
2

x dom(G) G(x) S lop
G (u)
Example: quadratic functional
1 x
G(x) = Ax, x + x, b
2
1
G (u) = u b, A 1 (u b)
2

Moreau’s identity:
Prox G (x) = x ProxG/ (x/ )
G simple G simple

Indicator and Homogeneous Functionals
Positively 1-homogeneous functional: G( x) = | |G(x)
Example: norm G(x) = ||x||

Duality: G (x) = G (·) 1 (x) G (y) = min x, y
G(x) 1


G(x) 1
p
norms: G(x) = ||x||p 1 1
+ =1 1 p, q +
G (x) = ||x||q p q


G(x) 1
p
norms: G(x) = ||x||p 1 1
+ =1 1 p, q +
G (x) = ||x||q p q

Example: Proximal operator of norm
Prox ||·|| = Id Proj||·||1

Proj||·||1 (x)i = max 0, 1 xi
|xi |
for a well-chosen ⇥ = ⇥ (x, )

Prox of the J Functional
X ||m||2
˜
J(m, ⇢) = j(mi , ⇢i ) j(m, ⇢) =
˜ ˜ for ⇢ > 0
˜
i
⇢˜

X ||m||2
˜
J(m, ⇢) = j(mi , ⇢i ) j(m, ⇢) =
˜ ˜ for ⇢ > 0
˜
i
⇢˜
Prox J (m, ⇢) = (Prox j (mi , ⇢i ))i

X ||m||2
˜
J(m, ⇢) = j(mi , ⇢i ) j(m, ⇢) =
˜ ˜ for ⇢ > 0
˜
i
⇢˜
j ⇤ = ◆C where C = (a, b) 2 R2 ⇥ R 2||a||2 + b 6 0

Prox j (˜) = x
x ˜ ProjC (˜/ )
x where x = (m, ⇢)
˜ ˜ ˜

X ||m||2
˜
J(m, ⇢) = j(mi , ⇢i ) j(m, ⇢) =
˜ ˜ for ⇢ > 0
˜
i
⇢˜
j ⇤ = ◆C where C = (a, b) 2 R2 ⇥ R 2||a||2 + b 6 0

Prox j (˜) = x
x ˜ ProjC (˜/ )
x where x = (m, ⇢)
˜ ˜ ˜

⇢
(m? , ⇢? ) if ⇢? > 0
Proposition: Prox (m, ⇢) =
˜ ˜
(0, 0) otherwise.
⇢? m
˜
?
where m = ? and ⇢? is the largest root of
⇢ +2
X 3 + (4 ⇢)X 2 + 4 (
˜ ⇢)X
˜ ||m||2
˜ 4 2
⇢=0
˜

Gradient and Proximal Descents
Gradient descent: x( +1) = x( ) G(x( ) ) [explicit]
G is C 1 and G is L-Lipschitz

Theorem: If 0 < < 2/L, x( )
x a solution.


x a solution.

Sub-gradient descent: x( +1)
= x( )
v( ) , v( )
G(x( ) )

Theorem: If 1/⇥, x( )
x a solution.

Problem: slow.


x a solution.

Sub-gradient descent: x( +1)
= x( )
v( ) , v( )
G(x( ) )

Theorem: If 1/⇥, x( )
x a solution.

Problem: slow.
Proximal-point algorithm: x(⇥+1) = Prox G (x(⇥) ) [implicit]

Theorem: If c > 0, x( )
x a solution.

Prox G hard to compute. [Rockafellar, 70]

Proximal Splitting Methods
Solve min E(x)
x H
Problem: Prox E is not available.

Solve min E(x)
x H
Splitting: E(x) = F (x) + Gi (x)
i
Smooth Simple

Solve min E(x)
x H
Splitting: E(x) = F (x) + Gi (x)
i
Smooth Simple
F (x)
Iterative algorithms using:
Prox Gi (x)
solves
Forward-Backward: F + G
Douglas-Rachford: Gi
Primal-Dual: Gi A
Generalized FB: F+ Gi

Smooth + Simple Splitting
Inverse problem: measurements y = Kf0 + w
f0 Kf0
K K : RN RP , P N

Model: f0 = x0 sparse in dictionary .
Sparse recovery: f = x where x solves
min F (x) + G(x)
x RN
Smooth Simple
1
Data ﬁdelity: F (x) = ||y x||2 =K ⇥
2
Regularization: G(x) = ||x||1 = |xi |
i

Forward-Backward
Fix point equation:
x argmin F (x) + G(x) 0 F (x ) + G(x )
x
(x F (x )) x + ⇥G(x )
x⇥ = Prox G (x⇥ F (x⇥ ))

Forward-Backward
Fix point equation:
x
(x F (x )) x + ⇥G(x )
x⇥ = Prox G (x⇥ F (x⇥ ))

Forward-backward: x(⇥+1) = Prox G x(⇥) F (x(⇥) )

Forward-Backward
Fix point equation:
x
(x F (x )) x + ⇥G(x )
x⇥ = Prox G (x⇥ F (x⇥ ))


Projected gradient descent: G= C

Forward-Backward
Fix point equation:
x
(x F (x )) x + ⇥G(x )
x⇥ = Prox G (x⇥ F (x⇥ ))


Projected gradient descent: G= C

Theorem: Let F be L-Lipschitz.
If < 2/L, x( )
x a solution of ( )

[Passty 79, Gabay, 83]

Example: L1 Regularization

1
min || x y||2 + ||x||1 min F (x) + G(x)
x 2 x

1
F (x) = || x y||2
2
F (x) = ( x y) L = || ||

G(x) = ||x||1
⇥
Prox G (x)i = max 0, 1 xi
|xi |

Forward-backward Iterative soft thresholding

Convergence Speed

min E(x) = F (x) + G(x)
x

F is L-Lipschitz.
G is simple.

Theorem: If L > 0, FB iterates x( )
satisﬁes
E(x( ) ) E(x ) C/

C degrades with L 0.

Multi-steps Accelerations
Beck-Teboule accelerated FB: t(0) = 1
✓ ◆
(`+1) (`) 1
x = Prox1/L y rF (y (`) )
L
1+ 1 + 4(t( ) )2
t( +1) =
2()
t 1 (
y ( +1)
=x( +1)
+ ( +1) (x +1)
x( ) )
t

(see also Nesterov method)

C
Theorem: If L > 0, E(x ( )
) E(x )

Complexity theory: optimal in a worse-case sense.

Douglas Rachford Scheme

min G1 (x) + G2 (x) ( )
x
Simple Simple
Douglas-Rachford iterations:

z (⇥+1) = 1 z (⇥) + RProx G2 RProx G1 (z (⇥) )
2 2
x(⇥+1) = Prox G2 (z (⇥+1) )

Reﬂexive prox: RProx G (x) = 2Prox G (x) x

Douglas Rachford Scheme

min G1 (x) + G2 (x) ( )
x
Simple Simple
Douglas-Rachford iterations:

z (⇥+1) = 1 z (⇥) + RProx G2 RProx G1 (z (⇥) )
2 2
x(⇥+1) = Prox G2 (z (⇥+1) )

Reﬂexive prox: RProx G (x) = 2Prox G (x) x

Theorem: If 0 < < 2 and ⇥ > 0,
x( )
x a solution of ( )
[Lions, Mercier, 79]

DR Fix Point Equation

min G1 (x) + G2 (x) 0 (G1 + G2 )(x)
x

z, z x ⇥( G1 )(x) and x z ⇥( G2 )(x)

x = Prox G1 (z) and (2x z) x ⇥( G2 )(x)

DR Fix Point Equation

min G1 (x) + G2 (x) 0 (G1 + G2 )(x)
x

z, z x ⇥( G1 )(x) and x z ⇥( G2 )(x)

x = Prox G1 (z) and (2x z) x ⇥( G2 )(x)
x = Prox G2 (2x z) = Prox G2 RProx G1 (z)

z = 2Prox G2 RProx G1 (y) (2x z)

z = 2Prox G2 RProx G1 (z) RProx G1 (z)

z = RProx G2 RProx G1 (z)

z= 1 z+ RProx G2 RProx G1 (z)
2 2

Example: Optimal Transport on Centered Grid
s
min J(x) + ◆C (x)
x2RGc ⇥2

C = {x = (m, ⇢) Ax = b} I0 I1
b = (0, ⇢0 , ⇢1 )
t
A(x) = (div(x), ⇢I0 , ⇢I1 )
Centered grid Gc

s
min J(x) + ◆C (x)
x2RGc ⇥2

C = {x = (m, ⇢) Ax = b} I0 I1
b = (0, ⇢0 , ⇢1 )
t
A(x) = (div(x), ⇢I0 , ⇢I1 )
Centered grid Gc
Prox J : cubic root (closed form).

s
min J(x) + ◆C (x)
x2RGc ⇥2

C = {x = (m, ⇢) Ax = b} I0 I1
b = (0, ⇢0 , ⇢1 )
t
A(x) = (div(x), ⇢I0 , ⇢I1 )
Centered grid Gc
Prox◆C = ProjC = (Id A⇤ 1
A) + A⇤ 1
y
1
= (AA⇤ ) 1
: solving a Poisson equation with b.c.

s
min J(x) + ◆C (x)
x2RGc ⇥2

C = {x = (m, ⇢) Ax = b} I0 I1
b = (0, ⇢0 , ⇢1 )
t
A(x) = (div(x), ⇢I0 , ⇢I1 )
Centered grid Gc
A) + A⇤ 1
y
1
= (AA⇤ ) 1

Proposition: DR(↵ = 1) is ALG2 of [Benamou, Brenier 2000]

s
min J(x) + ◆C (x)
x2RGc ⇥2

C = {x = (m, ⇢) Ax = b} I0 I1
b = (0, ⇢0 , ⇢1 )
t
A(x) = (div(x), ⇢I0 , ⇢I1 )
Centered grid Gc
A) + A⇤ 1
y
1
= (AA⇤ ) 1

Proposition: DR(↵ = 1) is ALG2 of [Benamou, Brenier 2000]

! Advantage: relaxation parameter ↵ 2]0, 1[.

Example: Constrained L1
min ||x||1 min G1 (x) + G2 (x)
x=y x

G1 (x) = iC (x), C = {x x = y}
Prox G1 (x) = ProjC (x) = x +
⇥
( ⇥
) 1
(y x)

G2 (x) = ||x||1 Prox G2 (x) = max 0, 1 xi
|xi | i
e⇥cient if easy to invert.

Example: Constrained L1
min ||x||1 min G1 (x) + G2 (x)
x=y x

G1 (x) = iC (x), C = {x x = y}
Prox G1 (x) = ProjC (x) = x +
⇥
( ⇥
) 1
(y x)

G2 (x) = ||x||1 Prox G2 (x) = max 0, 1 xi
|xi | i
e⇥cient if easy to invert. log10 (||x( ) ||1 ||x ||1 )
1

Example: compressed sensing −1
0

R100 400
Gaussian matrix −2
−3 = 0.01
y = x0 ||x0 ||0 = 17 −4 =1
−5
= 10
50 100 150 200 250

Auxiliary Variables with DR
min G1 (x) + G2 A(x) Linear map A : E H.
x
min G(z) + C (z) G1 , G2 simple.
z⇥H E

G(x, y) = G1 (x) + G2 (y)
C = {(x, y) ⇥ H E Ax = y}

Auxiliary Variables with DR
min G1 (x) + G2 A(x) Linear map A : E H.
x
min G(z) + C (z) G1 , G2 simple.
z⇥H E

G(x, y) = G1 (x) + G2 (y)
C = {(x, y) ⇥ H E Ax = y}

Prox G (x, y) = (Prox G1 (x), Prox G2 (y))

Prox C (x, y) = (x + A y , y
˜ y ) = (˜, A˜)
˜ x x

y = (Id + AA )
˜ 1
(Ax y)
where
x = (Id + A A)
˜ 1
(A y + x)
e cient if Id + AA or Id + A A easy to invert.

Example: TV Regularization
1 ||u||1 = ||ui ||
min ||Kf y||2 + ||⇥f ||1
f 2 i
min G1 (f ) + G2 (f )
x

G1 (u) = ||u||1 Prox G1 (u)i = max 0, 1 ui
||ui ||
1
G2 (f ) = ||Kf y||2 Prox = (Id + K K) 1
K
2 G2

C = (f, u) ⇥ RN RN 2
u = ⇤f
˜ ˜
Prox C (f, u) = (f , f )

1 ||u||1 = ||ui ||
min ||Kf y||2 + ||⇥f ||1
f 2 i
min G1 (f ) + G2 (f )
x

G1 (u) = ||u||1 Prox G1 (u)i = max 0, 1 ui
||ui ||
1
G2 (f ) = ||Kf y||2 Prox = (Id + K K) 1
K
2 G2

C = (f, u) ⇥ RN RN 2
u = ⇤f
˜ ˜
Prox C (f, u) = (f , f )
Compute the solution of: (Id + ˜
)f = div(u) + f
O(N log(N )) operations using FFT.


Orignal f0 y = f0 + w Recovery f

y = Kx0 Iteration

Alternating Direction Method of Multipliers
min F (x) + G A(x) (?) () min F (x) + G(y)
x x,y=Ax
A : RN ! RP injective.

x x,y=Ax
Lagrangian: min max L(x, y, u) = F (x) + G(y) + hu, y Axi
x,y u

x x,y=Ax
x,y u
Augmented: min max L (x, y, u) = L(x, y, u) + ||y Ax||2
x,y u 2

x x,y=Ax
x,y u
x,y u 2
(`+1)
x = argminx L (x, y (`) , u(`) )
ADMM

y (`+1) = argminy L (x(`+1) , y, u(`) )
u(`+1) = u(`) + (y (`+1) Ax(`+1) )

x x,y=Ax
x,y u
x,y u 2
(`+1)
x = argminx L (x, y (`) , u(`) )
ADMM

y (`+1) = argminy L (x(`+1) , y, u(`) )
u(`+1) = u(`) + (y (`+1) Ax(`+1) )

Theorem: If > 0, x( )
x a solution of ( )

[Gabay, Mercier, Glowinski, Marrocco, 76]

ADMM with Proximal Operators
Proximal mapping for metric A: (A is injective)
A 1
Prox F = argmin ||Ax z||2 + F (x)
x 2

A 1
x 2

Proposition: ProxAF = A+ Id ProxF ⇤ A⇤ / (·/ )

A 1
x 2


x(`+1) = ProxA (y (`)
F/ u(`) )
ADMM

y (`+1) = ProxG/ (Ax(`+1) + u(`) )

u(`+1) = u(`) + (y (`+1) Ax(`+1) )

A 1
x 2


x(`+1) = ProxA (y (`)
F/ u(`) )
ADMM

y (`+1) = ProxG/ (Ax(`+1) + u(`) )

u(`+1) = u(`) + (y (`+1) Ax(`+1) )

! If G A is simple: use DR.
! If F ⇤ A⇤ is simple: use ADMM.

ADMM vs. DR
Fenchel-Rockafellar duality:
min F (x) + G A(x) ! min F ⇤ ( A⇤ u) + G⇤ (u)
x u
Important: no bijection between u and x.

ADMM vs. DR
x u

Proposition: DR applied to F ⇤ A⇤ + G⇤ is ADMM.

[Eckstein, Bertsekas, 92]

ADMM vs. DR
x u


DR iterations (when ↵ = 1):
(`+1) 1 (`) 1
z = z + RProx F⇤ A⇤ RProx G⇤ (z (`) )
2 2

ADMM vs. DR
x u


DR iterations (when ↵ = 1):
(`+1) 1 (`) 1
z = z + RProx F ⇤ A⇤ RProx G⇤ (z (`) )
2 2
The iterates of ADMM are recovered using:
(`) 1 (`)
y = (z u(`) )
x(`+1) = ProxA (y (`) u(`) )
F/
u(`) = Prox G⇤ (z (`) )

More than 2 Functionals

min G1 (x) + . . . + Gk (x) each Fi is simple
x

min G(x1 , . . . , xk ) + C (x1 , . . . , xk )
x

G(x1 , . . . , xk ) = G1 (x1 ) + . . . + Gk (xk )

C = (x1 , . . . , xk ) Hk x1 = . . . = xk

More than 2 Functionals

min G1 (x) + . . . + Gk (x) each Fi is simple
x

min G(x1 , . . . , xk ) + C (x1 , . . . , xk )
x

G(x1 , . . . , xk ) = G1 (x1 ) + . . . + Gk (xk )

C = (x1 , . . . , xk ) Hk x1 = . . . = xk

G and C are simple:

Prox G (x1 , . . . , xk ) = (Prox Gi (xi ))i
1
Prox ⇥C (x1 , . . . , xk ) = (˜, . . . , x)
x ˜ where x =
˜ xi
k i

GFB Splitting
n
min F (x) + Gi (x) ( )
x RN
i=1
i = 1, . . . , n, Smooth Simple
(⇥+1) (⇥) (⇥)
zi = zi + Proxn G (2x
(⇥)
zi F (x(⇥) )) x(⇥)
n
1 ( +1)
x( +1)
= zi
n i=1

[Raguet, Fadili, Peyr´ 2012]
e

GFB Splitting
n
min F (x) + Gi (x) ( )
x RN
i=1
(⇥+1) (⇥) (⇥)
zi = zi + Proxn G (2x (⇥)
zi F (x(⇥) )) x(⇥)
n
1 ( +1)
x( +1)
= zi
n i=1

If < 2/L, x( )
x a solution of ( )
e

GFB Splitting
n
min F (x) + Gi (x) ( )
x RN
i=1
(⇥+1) (⇥) (⇥)
zi = zi + Proxn G (2x (⇥)
zi F (x(⇥) )) x(⇥)
n
1 ( +1)
x( +1)
= zi
n i=1

If < 2/L, x( )
x a solution of ( )
e
n=1 Forward-backward.
F =0 Douglas-Rachford.

GFB Fix Point
x argmin F (x) + i Gi (x) 0 F (x ) + i Gi (x )
x RN
yi Gi (x ), F (x ) + i yi =0

GFB Fix Point
x RN
yi Gi (x ), F (x ) + i yi =0

1
(zi )n ,
i=1 i, x zi F (x ) ⇥Gi (x )
n
x = 1
n i zi (use zi = x F (x ) N yi )

GFB Fix Point
x RN
yi Gi (x ), F (x ) + i yi =0

1
(zi )n ,
i=1 i, x zi F (x ) ⇥Gi (x )
n
x = 1

(2x zi F (x )) x n ⇥Gi (x )
x⇥ = Proxn Gi (2x⇥ zi F (x⇥ ))
zi = zi + Proxn G (2x⇥ zi F (x⇥ )) x⇥

GFB Fix Point
x RN
yi Gi (x ), F (x ) + i yi =0

1
(zi )n ,
i=1 i, x zi F (x ) ⇥Gi (x )
n
x = 1

(2x zi F (x )) x n ⇥Gi (x )
x⇥ = Proxn Gi (2x⇥ zi F (x⇥ ))
zi = zi + Proxn G (2x⇥ zi F (x⇥ )) x⇥

+ Fix point equation on (x , z1 , . . . , zn ).

Block Regularization
1 2
block sparsity: G(x) = ||x[b] ||, ||x[b] ||2 = x2
m
b B m b

iments Towards More Complex Penalization
(2) Bk
2
+ ` 1 `2
4
k=1 x 1,2

b B1 i b xi
⇥ x⇥⇥1 = i ⇥xi ⇥ b B i b xi2 +
i b xi
N: 256
b B2

b B

Image f = x Coe cients x.

1 2
m
b B m b

Non-overlapping decomposition: B = B ... B
Towards More Complex Penalization
n
1 n

2 G(x) =4 x iBk
(2)
+ ` ` k=1 G 1,2
(x) Gi (x) = ||x[b] ||,
1 2 i=1 b Bi

b b 1b1 B1 i b xiixb xi
22
BB
⇥ x⇥x⇥x⇥⇥1 =i ⇥x⇥x⇥xi ⇥
⇥= ++ +
i b i
⇥ ⇥1 ⇥1 = i i ⇥i i ⇥ bb B B i
Bb xii2bi2xi2
bbx
i
N: 256
b b 2b2 B2 i
BB xi2 b2xi
b b xi
i

b B

Image f = x Coe cients x. Blocks B1 B1 B2

1 2
m
b B m b

Non-overlapping decomposition: B = B ... B
n
1 n

2 G(x) =4 x iBk
(2)
+ ` ` k=1 G 1,2
(x) Gi (x) = ||x[b] ||,
1 2 i=1 b Bi
Each Gi is simple: b b 1b1 B1 i b xiixb xi
BB
22

⇥ x⇥x⇥x⇥⇥1 =i ⇥xG ⇥xi ⇥ m = b B B i b xii2bi2xi2
⇥ ⇥1 = i ⇥i i x +
i b i
⇤ m ⇥ b ⇥ Bi , ⇥ ⇥1Prox i ⇥xi ⇥(x) b max i0, 1
= Bb bx ++m
N: 256 ||x[b]b||B xi2 b2xi
2 2 B2
b B b i b b xi
i

b B

Image f = x Coe cients x. Blocks B1 B1 B2

Deconv. + Inpaint. 2min+CP Y ⇥ P K x CP Y + P 1 K2
Deconv. x 2Inpaint. min 2 ⇥ ` `
x x
k=1 x+1,2` k=1
log10(E−E
2 1 `2
Numerical Illustration

log10(E−
1
Numerical Experiments Experiments
1
Numerical 1

TI (2)`2 4
0

||y x 1 ⇥x||368s PRx 2 minix(x)Y ⇥ K x= + `wavelets x
Bk 2
0
: 283s; tPR: 298s; tCP:: 283s; t : 298s; t (2)
Deconvolution min 2 Y ⇥ K
tmin
−1 EFB
x 102
Deconvolution +GCP: 1` 4
−1
tEFB 2 +
10 40 20
368s
30 1 2 2 40k=1
`
x 1,2 1 k=1
20 30
3 iteration 3
# i
EFB iteration # EFB
log10(E−Emin)

log10(E−Emin)
PR PR
2 = convolution
2 = inpainting+convolution l1/l2
l1/l2
: 1.30e−03; CPλ2 : 1.30e−03; CP 2
λ
tPR: 173s; tCP 190s noise: 0.025; convol.::it. #50; SNR: 22.49dB #50; SNR: 22.49dB
tEFB: 161s; tPR: 173s; tCP N: 256
tEFB: 161s; noise: 0.025; :convol.: 2 190s
1
Numerical Experiments 2 1

EFB
it. N: 256
EFB
(4) Bk
Y ⇥P K +
0 0
log10(E−Emin)

3 3
1
PR
2 PR 16
onv. + Inpaint. minx
2
CP
2 30
2
x CP
`140`2 k=1 x 1,2
10 20 10 40 20 30
1 iteration #
1 iteration #
0 0 λ4 : 1.00e−03; λ4 : 1.00e−03;
l1/l2 l1/l2
tEFB: 283s; tPR: 298s; tCP: 368s
−1 noise: 0.025; degrad.: 0.4; 0.025; degrad.: 0.4; convol.: 2
noise: convol.: 2
−1 it. #50; SNR: 21.80dB #50; SNR: 21.80dB
it.
10 20
iteration #
30
EFB
40 10 20
iteration #
30 40 x0
3
PR
min

2
CP
λ2 : 1.30e−03; λ2 : 1.30e−03;
l1/l2 l1/l2
1 noise: 0.025; convol.: 2 noise: 0.025; it. #50; SNR: 22.49dB
convol.: 2 it. #50; SNR: 22.49dB
10

0

log10
10 20
(E(x( ) ) #
iteration
30
E(x ))
y = x0 + w 40
x
4

Primal-dual Formulation
Fenchel-Rockafellar duality: A:H⇥ L linear
min G1 (x) + G2 A(x) = min G1 (x) + sup hAx, ui G⇤ (u)
2
x2H x u2L

2
x2H x u2L

Strong duality: 0 2 ri(dom(G2 )) A ri(dom(G1 ))
(min $ max) = max G⇤ (u) + min G1 (x) + hx, A⇤ ui
2
u x

= max G⇤ (u)
2 G⇤ (
1 A⇤ u)
u

2
x2H x u2L

2
u x

= max G⇤ (u)
2 G⇤ (
1 A⇤ u)
u
Recovering x? from some u? :
x? = argmin G1 (x? ) + hx? , A⇤ u? i
x

2
x2H x u2L

2
u x

= max G⇤ (u)
2 G⇤ (
1 A⇤ u)
u
Recovering x? from some u? :
x? = argmin G1 (x? ) + hx? , A⇤ u? i
x

() A⇤ u? 2 @G1 (x? )
() x? 2 (@G1 ) 1
( A⇤ u? ) = @G⇤ ( A⇤ s? )
1

Forward-Backward on the Dual

If G1 is strongly convex: r2 G1 > cId
c
G1 (tx + (1 t)y) 6 tG1 (x) + (1 t)G1 (y) t(1 t)||x y||2
2


c
2
x? uniquely deﬁned.
x? = rG? ( A⇤ u? )
1
G? is of class C 1 .
1


c
2
x? uniquely deﬁned.
x? = rG? ( A⇤ u? )
1
G? is of class C 1 .
1

FB on the dual: min G1 (x) + G2 A(x)
x2H
= min G? ( A⇤ u) + G? (u)
1 2
u2L
Smooth Simple
⇣ ⌘
u(`+1) = Prox⌧ G? u(`) + ⌧ A⇤ rG? ( A⇤ u(`) )
2 1

Example: TV Denoising
1
min ||f y||2 + ||⇥f ||1 min ||y + div(u)||2
f RN 2 ||u||

||u||1 = ||ui || ||u|| = max ||ui ||
i
i
Dual solution u Primal solution f = y + div(u )

[Chambolle 2004]

Example: TV Denoising
1
min ||f y||2 + ||⇥f ||1 min ||y + div(u)||2
f RN 2 ||u||

||u||1 = ||ui || ||u|| = max ||ui ||
i
i
Dual solution u Primal solution f = y + div(u )

FB (aka projected gradient descent): [Chambolle 2004]

u( +1)
= Proj||·|| u( ) + (y + div(u( ) ))
ui
v = Proj||·|| (u) vi =
max(||ui ||/ , 1)
2 1
Convergence if < =
||div ⇥|| 4

Primal-Dual Algorithm
min G1 (x) + G2 A(x)
x H
() min max G1 (x) G⇤ (z) + hA(x), zi
2
x z

x H
2
x z

z (`+1) = Prox G⇤
2
(z (`) + A(˜(`) )
x
x(⇥+1) = Prox G1 (x(⇥) A (z (⇥) ))
˜
x( +1)
= x( +1)
+ (x( +1)
x( ) )
= 0: Arrow-Hurwicz algorithm.
= 1: convergence speed on duality gap.

x H
2
x z

z (`+1) = Prox G⇤
2
(z (`) + A(˜(`) )
x
x(⇥+1) = Prox G1 (x(⇥) A (z (⇥) ))
˜
x( +1)
= x( +1)
+ (x( +1)
x( ) )
= 0: Arrow-Hurwicz algorithm.
= 1: convergence speed on duality gap.

Theorem: [Chambolle-Pock 2011]
If 0 1 and ⇥⇤ ||A||2 < 1 then
x( )
x minimizer of G1 + G2 A.

Example: Optimal Transport

Staggered grid formulation :
min
1 2
J(I(x)) + ◆C (x)
x2RGst ⇥RGst
1 2
Gst Gst
1 2
I = (I , I ) : R ⇥R ! RG c

s
s

I

t t
Staggered grid Centered grid Gc
1 2
Gst Gst

Conclusion
Inverse problems in imaging:
Large scale, N 106 .
Non-smooth (sparsity, TV, . . . )
(Sometimes) convex.
Highly structured (separability, p
norms, . . . ).

Conclusion
(Sometimes) convex. b B1 i b xi
2

⇥ x⇥⇥1 = i ⇥xi ⇥ b B
2
i p xi +
Highly structured (separability, b
norms, . . . ).
b B2 i b xi2
Proximal splitting:
Unravel the structure of problems.
Parallelizable.

Decomposition G = k Gk

Conclusion
(Sometimes) convex. b B1 i b xi
2

⇥ x⇥⇥1 = i ⇥xi ⇥ b B
2
i p xi +
Highly structured (separability, b
norms, . . . ).
b B2 i b xi2
Proximal splitting:
Unravel the structure of problems.
Parallelizable.
Open problems:
Decomposition G = k Gk
Less structured problems without smoothness.
Non-convex optimization.

Proximal Splitting and Optimal Transport

More Related Content

What's hot (20)

Similar to Proximal Splitting and Optimal Transport (20)

More from Gabriel Peyré (12)

Recently uploaded (20)

Proximal Splitting and Optimal Transport