On the Convergence of Single-Call Stochastic Extra-Gradient Methods

On the Convergence of Single-Call Stochastic Extra-Gradient Methods
Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos
NeurIPS, December 2019
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 0 / 22

Outline:
1 Variational Inequality
2 Extra-Gradient
3 Single-call Extra-Gradient [Main Focus]
4 Conclusion

Variational Inequality

Introduction: Variational Inequalities in Machine Learning
Generative adversarial network (GAN)
min
θ
max
φ
Ex∼pdata
[log(Dφ(x))] + Ez∼pZ
[log(1 − Dφ(Gθ(z)))].
More min-max (saddle point) problems: distributionally robust learning, primal-dual
formulation in optimization, . . .
Seach of equilibrium: games, multi-agent reinforcement learning, . . .

Deﬁnition
Stampacchia variational inequality
Find x⋆
∈ X such that ⟨V (x⋆
),x − x⋆
⟩ ≥ 0 for all x ∈ X. (SVI)
Minty variational inequality
Find x⋆
∈ X such that ⟨V (x),x − x⋆
⟩ ≥ 0 for all x ∈ X. (MVI)
With closed convex set X ⊆ Rd
and vector ﬁeld V Rd
→ Rd
.

Illustration
SVI: V (x⋆
) belongs to the dual cone DC(x⋆
) of X at x⋆
[ local ]
MVI: V (x) forms an acute angle with tangent vector x − x⋆
∈ TC(x⋆
) [ global ]

Example: Function Minimization
min
x
f(x)
subject to x ∈ X
f X → R diﬀerentiable function to minimize.
Let V = ∇f.
(SVI) ∀x ∈ X, ⟨∇f(x⋆
),x − x⋆
⟩ ≥ 0 [ ﬁrst-order optimality ]
(MVI) ∀x ∈ X, ⟨∇f(x),x − x⋆
⟩ ≥ 0 [x⋆
is a minimizer of f ]
If f is convex, (SVI) and (MVI) are equivalent.

Example: Saddle point Problem
Find x⋆
= (θ⋆
,φ⋆
) such that
L(θ⋆
,φ) ≤ L(θ⋆
,φ⋆
) ≤ L(θ,φ⋆
) for all θ ∈ Θ and all φ ∈ Φ.
X ≡ Θ × Φ and L X → R diﬀerentiable function.
Let V = (∇θL,−∇φL).
(SVI) ∀(θ,φ) ∈ X, ⟨∇θL(x⋆
),θ − θ⋆
⟩ − ⟨∇φL(x⋆
),φ − φ⋆
⟩ ≥ 0 [ stationary ]
(MVI) ∀(θ,φ) ∈ X, ⟨∇θL(x),θ − θ⋆
⟩ − ⟨∇φL(x),φ − φ⋆
⟩ ≥ 0 [ saddle point ]
If L is convex-concave, (SVI) and (MVI) are equivalent.

Monoticity
The solutions of (SVI) and (MVI) coincide when V is continuous and monotone, i.e.,
⟨V (x′
) − V (x),x′
− x⟩ ≥ 0 for all x,x′
∈ Rd
.
In the above two examples, this corresponds to either f being convex or L being
convex-concave.
The operator analogue of strong convexity is strong monoticity
⟨V (x′
) − V (x),x′
− x⟩ ≥ α x′
− x 2
for some α > 0 and all x,x′
∈ Rd
.

Extra-Gradient
Extra-Gradient

Extra-Gradient
From Forward-backward to Extra-Gradient
Forward-backward
Xt+1 = ΠX (Xt − γtV (Xt)) (FB)
Extra-Gradient [Korpelevich 1976]
Xt+ 1
2
= ΠX (Xt − γtV (Xt))
Xt+1 = ΠX (Xt − γtV (Xt+ 1
2
))
(EG)
The Extra-Gradient method anticipates the landscape of V by taking an extrapolation step
to reach the leading state Xt+ 1
2
.

Extra-Gradient
From Forward-backward to Extra-Gradient
Forward-backward does not converge in bilinear games, while Extra-Gradient does.
min
θ∈R
max
φ∈R
θφ
Left:
Forward-backward
Right:
Extra-Gradient
4 2 0 2 4 6
6
4
2
0
2
4
6
0.5 0.0 0.5 1.0
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00

Extra-Gradient
Stochastic Oracle
If a stochastic oracle is involved:
Xt+1
2
= ΠX (Xt − γt
ˆVt)
Xt+1 = ΠX (Xt − γt
ˆVt+1
2
)
With ˆVt = V (Xt) + Zt satisfying (and same for ˆVt+ 1
2
)
a) Zero-mean: E[Zt Ft] = 0.
b) Bounded variance: E[ Zt
2
Ft] ≤ σ2
.
(Ft)t∈N/2 is the natural ﬁltration associated to the stochastic process (Xt)t∈N/2.

Extra-Gradient
Convergence Metrics
Ergodic convergence: restricted error function
ErrR(ˆx) = max
x∈XR
⟨V (x), ˆx − x⟩,
where XR ≡ X ∩ BR(0) = {x ∈ X x ≤ R}.
Last iterate convergence: squared distance dist(ˆx,X⋆
)2
.
Lemma [Nesterov 2007]
Assume V is monotone. If x⋆
is a solution of (SVI), we have ErrR(x⋆
) = 0 for all
suﬃciently large R. Conversely, if ErrR(ˆx) = 0 for large enough R > 0 and some ˆx ∈ XR,
then ˆx is a solution of (SVI).

Extra-Gradient
Literature Review
We further suppose that V is β-Lipschitz.
Convergence type Hypothesis
Korpelevich 1976 Last iterate asymptotic Pseudo monotone
Tseng 1995 Last iterate geometric
Monotone + error bound
(e.g., strongly monotone, aﬃne)
Nemirovski 2004 Ergodic in O(1/t) Monotone
Juditsky et al. 2011 Ergodic in O(1/
√
t) Stochastic monotone

Extra-Gradient
In Deep Learning
Extra-Gradient (EG) needs two oracle calls per iteration, while gradient computations can
be very costly for deep models:
And if we drop one oracle call per iteration?

Single-call Extra-Gradient [Main Focus]
Single-call Extra-Gradient

Algorithms
1 Past Extra-Gradient [Popov 1980]
Xt+ 1
2
= ΠX (Xt − γt
ˆVt− 1
2
)
ˆVt+ 1
2
)
(PEG)
2 Reﬂected Gradient [Malitsky 2015]
Xt+ 1
2
= Xt − (Xt−1 − Xt)
ˆVt+ 1
2
)
(RG)
3 Optimistic Gradient [Daskalakis et al. 2018]
Xt+ 1
2
= ΠX (Xt − γt
ˆVt− 1
2
)
Xt+1 = Xt+1
2
+ γt
ˆVt−1
2
− γt
ˆVt+ 1
2
(OG)

A First Result
Proxy
PEG: [Step 1] ˆVt ← ˆVt− 1
2
RG: [Step 1] ˆVt ← (Xt−1 − Xt)/γt; no projection
OG: [Step 1] ˆVt ← ˆVt− 1
2
[Step 2] Xt ← Xt+ 1
2
+ γt
ˆVt− 1
2
; no projection
Proposition
Suppose that the Single-call Extra-Gradient (1-EG) methods presented above share the
same initialization, X0 = X1 ∈ X, ˆV1/2 = 0 and a same constant step-size (γt)t∈N ≡ γ. If
X = Rd
, the generated iterates Xt coincide for all t ≥ 1.
Xt+1
2
= ΠX (Xt − γt
ˆVt)
ˆVt+ 1
2
)

Global Convergence Rate
Always with Lipschitz continuity.
Stochastic strongly monotone: step size in O(1/t).
New results!
Monotone Strongly Monotone
Ergodic Last Iterate Ergodic Last Iterate
Deterministic 1/t Unknown 1/t e−ρt
Stochastic 1/
√
t Unknown 1/t 1/t

Proof Ingredients
Descent Lemma [Deterministic + Monotone]
There exists (µt)t∈N ∈ RN
+ such that for all p ∈ X,
Xt+1 − p 2
+ µt+1 ≤ Xt − p 2
− 2γ⟨V (Xt+1
2
),Xt+ 1
2
− p⟩ + µt.
Descent Lemma [Stochastic + Strongly Monotone]
Let x⋆
be the unique solution of (SVI). There exists (µt)t∈N ∈ RN
+ ,M ∈ R+ such that
E[ Xt+1 − x⋆ 2
] + µt+1 ≤ (1 − αγt)(E[ Xt − x⋆ 2
] + µt) + Mγ2
t σ2
.

Regular Solution
Definition [Regular Solution]
We say that x⋆
is a regular solution of (SVI) if V is C1
-smooth in a neighborhood of x⋆
and the Jacobian JacV (x⋆
) is positive-definite along rays emanating from x⋆
, i.e.,
z⊺
JacV (x⋆
)z ≡
d
∑
i,j=1
zi
∂Vi
∂xj
(x⋆
)zj > 0 for all z ∈ Rd
∖{0} that are tangent to X at x⋆
.
To be compared with
positive definiteness of the Hessian along qualified constraints in minimization;
differential equilibrium in games.
Localization of strong monoticity.

Local Convergence
Theorem [Local convergence for stochastic non-monotone operators]
Let x⋆
be a regular solution of (SVI) and ﬁx a tolerance level δ > 0. Suppose (PEG) is run
with step-sizes of the form γt = γ/(t + b) for large enough γ and b. Then:
(a) There are neighborhoods U and U1 of x⋆
in X such that, if X1/2 ∈ U,X1 ∈ U1, the
event
E∞ = {Xt+ 1
2
∈ U for all t = 1,2,...}
occurs with probability at least 1 − δ.
(b) Conditioning on the above, we have:
E[ Xt − x⋆ 2
E∞] = O (
1
t
).

Experiments
L(θ,φ) = 2 1θ⊺
A1θ + 2(θ⊺
A2θ)
2
− 2 1φ⊺
B1φ − 2(φ⊺
B2φ)
2
+ 4θ⊺
Cφ
0 20 40 60 80 100
# Oracle Calls
10
−16
10
−13
10
−10
10
−7
10
−4
10
−1
∥x−x*∥2
EG
1-EG
γ = 0.1
γ = 0.2
γ = 0.3
γ = 0.4
(a)
Strongly monotone
( 1 = 1, 2 = 0)
Deterministic
Last iterate
10
0
10
1
10
2
10
3
# Oracle Calls
10
−3
10
−2
10
−1
10
0
∥x−x*∥2
EG
1-EG
γ = 0.2
γ = 0.4
γ = 0.6
γ = 0.8
(b)
Monotone ( 1 = 0, 2 = 1)
Deterministic
Ergodic
10
0
10
1
10
2
10
3
10
4
# Oracle Calls
10
−3
10
−2
10
−1
∥x−x*∥2
EG
1-EG
γ = 0.3
γ = 0.6
γ = 0.9
γ = 1.2
(c)
Non monotone ( 1 = 1, 2 = −1)
Stochastic Zt
iid
∼ N (0, σ
2
= .01)
Last iterate (b = 15)

Conclusion
Conclusion and Perspectives

Conclusion
Conclusion
Single-call rates ∼ Two-call rates.
Localization of stochastic guarantee.
Last iterate convergence: a ﬁrst step to the non-monotone world.
Some research directions: Bregman, universal, . . .

Conclusion
Bibliography
Daskalakis, Constantinos et al. (2018). “Training GANs with optimism”. In: ICLR ’18: Proceedings of the
2018 International Conference on Learning Representations.
Juditsky, Anatoli, Arkadi Semen Nemirovski, and Claire Tauvel (2011). “Solving variational inequalities with
stochastic mirror-prox algorithm”. In: Stochastic Systems 1.1, pp. 17–58.
Korpelevich, G. M. (1976). “The extragradient method for finding saddle points and other problems”. In:
Èkonom. i Mat. Metody 12, pp. 747–756.
Malitsky, Yura (2015). “Projected reflected gradient methods for monotone variational inequalities”. In:
SIAM Journal on Optimization 25.1, pp. 502–520.
Nemirovski, Arkadi Semen (2004). “Prox-method with rate of convergence O(1/t) for variational inequalities
with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems”. In:
SIAM Journal on Optimization 15.1, pp. 229–251.
Nesterov, Yurii (2007). “Dual extrapolation and its applications to solving variational inequalities and related
problems”. In: Mathematical Programming 109.2, pp. 319–344.
Popov, Leonid Denisovich (1980). “A modification of the Arrow–Hurwicz method for search of saddle
points”. In: Mathematical Notes of the Academy of Sciences of the USSR 28.5, pp. 845–848.
Tseng, Paul (June 1995). “On linear convergence of iterative methods for the variational inequality problem”.
In: Journal of Computational and Applied Mathematics 60.1-2, pp. 237–252.

On the Convergence of Single-Call Stochastic Extra-Gradient Methods

More Related Content

What's hot (14)

Similar to On the Convergence of Single-Call Stochastic Extra-Gradient Methods (20)

Recently uploaded (20)

On the Convergence of Single-Call Stochastic Extra-Gradient Methods