SlideShare a Scribd company logo
On the Convergence of Single-Call Stochastic Extra-Gradient Methods
Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos
NeurIPS, December 2019
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 0 / 22
Outline:
1 Variational Inequality
2 Extra-Gradient
3 Single-call Extra-Gradient [Main Focus]
4 Conclusion
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
Variational Inequality
Variational Inequality
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
Variational Inequality
Introduction: Variational Inequalities in Machine Learning
Generative adversarial network (GAN)
min
θ
max
φ
Ex∼pdata
[log(Dφ(x))] + Ez∼pZ
[log(1 − Dφ(Gθ(z)))].
More min-max (saddle point) problems: distributionally robust learning, primal-dual
formulation in optimization, . . .
Seach of equilibrium: games, multi-agent reinforcement learning, . . .
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 2 / 22
Variational Inequality
Definition
Stampacchia variational inequality
Find x⋆
∈ X such that ⟨V (x⋆
),x − x⋆
⟩ ≥ 0 for all x ∈ X. (SVI)
Minty variational inequality
Find x⋆
∈ X such that ⟨V (x),x − x⋆
⟩ ≥ 0 for all x ∈ X. (MVI)
With closed convex set X ⊆ Rd
and vector field V Rd
→ Rd
.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 3 / 22
Variational Inequality
Illustration
SVI: V (x⋆
) belongs to the dual cone DC(x⋆
) of X at x⋆
[ local ]
MVI: V (x) forms an acute angle with tangent vector x − x⋆
∈ TC(x⋆
) [ global ]
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 4 / 22
Variational Inequality
Example: Function Minimization
min
x
f(x)
subject to x ∈ X
f X → R differentiable function to minimize.
Let V = ∇f.
(SVI) ∀x ∈ X, ⟨∇f(x⋆
),x − x⋆
⟩ ≥ 0 [ first-order optimality ]
(MVI) ∀x ∈ X, ⟨∇f(x),x − x⋆
⟩ ≥ 0 [x⋆
is a minimizer of f ]
If f is convex, (SVI) and (MVI) are equivalent.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 5 / 22
Variational Inequality
Example: Saddle point Problem
Find x⋆
= (θ⋆
,φ⋆
) such that
L(θ⋆
,φ) ≤ L(θ⋆
,φ⋆
) ≤ L(θ,φ⋆
) for all θ ∈ Θ and all φ ∈ Φ.
X ≡ Θ × Φ and L X → R differentiable function.
Let V = (∇θL,−∇φL).
(SVI) ∀(θ,φ) ∈ X, ⟨∇θL(x⋆
),θ − θ⋆
⟩ − ⟨∇φL(x⋆
),φ − φ⋆
⟩ ≥ 0 [ stationary ]
(MVI) ∀(θ,φ) ∈ X, ⟨∇θL(x),θ − θ⋆
⟩ − ⟨∇φL(x),φ − φ⋆
⟩ ≥ 0 [ saddle point ]
If L is convex-concave, (SVI) and (MVI) are equivalent.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 6 / 22
Variational Inequality
Monoticity
The solutions of (SVI) and (MVI) coincide when V is continuous and monotone, i.e.,
⟨V (x′
) − V (x),x′
− x⟩ ≥ 0 for all x,x′
∈ Rd
.
In the above two examples, this corresponds to either f being convex or L being
convex-concave.
The operator analogue of strong convexity is strong monoticity
⟨V (x′
) − V (x),x′
− x⟩ ≥ α x′
− x 2
for some α > 0 and all x,x′
∈ Rd
.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
Extra-Gradient
Extra-Gradient
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
Extra-Gradient
From Forward-backward to Extra-Gradient
Forward-backward
Xt+1 = ΠX (Xt − γtV (Xt)) (FB)
Extra-Gradient [Korpelevich 1976]
Xt+ 1
2
= ΠX (Xt − γtV (Xt))
Xt+1 = ΠX (Xt − γtV (Xt+ 1
2
))
(EG)
The Extra-Gradient method anticipates the landscape of V by taking an extrapolation step
to reach the leading state Xt+ 1
2
.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 8 / 22
Extra-Gradient
From Forward-backward to Extra-Gradient
Forward-backward does not converge in bilinear games, while Extra-Gradient does.
min
θ∈R
max
φ∈R
θφ
Left:
Forward-backward
Right:
Extra-Gradient
4 2 0 2 4 6
6
4
2
0
2
4
6
0.5 0.0 0.5 1.0
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 9 / 22
Extra-Gradient
Stochastic Oracle
If a stochastic oracle is involved:
Xt+1
2
= ΠX (Xt − γt
ˆVt)
Xt+1 = ΠX (Xt − γt
ˆVt+1
2
)
With ˆVt = V (Xt) + Zt satisfying (and same for ˆVt+ 1
2
)
a) Zero-mean: E[Zt Ft] = 0.
b) Bounded variance: E[ Zt
2
Ft] ≤ σ2
.
(Ft)t∈N/2 is the natural filtration associated to the stochastic process (Xt)t∈N/2.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 10 / 22
Extra-Gradient
Convergence Metrics
Ergodic convergence: restricted error function
ErrR(ˆx) = max
x∈XR
⟨V (x), ˆx − x⟩,
where XR ≡ X ∩ BR(0) = {x ∈ X x ≤ R}.
Last iterate convergence: squared distance dist(ˆx,X⋆
)2
.
Lemma [Nesterov 2007]
Assume V is monotone. If x⋆
is a solution of (SVI), we have ErrR(x⋆
) = 0 for all
sufficiently large R. Conversely, if ErrR(ˆx) = 0 for large enough R > 0 and some ˆx ∈ XR,
then ˆx is a solution of (SVI).
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 11 / 22
Extra-Gradient
Literature Review
We further suppose that V is β-Lipschitz.
Convergence type Hypothesis
Korpelevich 1976 Last iterate asymptotic Pseudo monotone
Tseng 1995 Last iterate geometric
Monotone + error bound
(e.g., strongly monotone, affine)
Nemirovski 2004 Ergodic in O(1/t) Monotone
Juditsky et al. 2011 Ergodic in O(1/
√
t) Stochastic monotone
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 12 / 22
Extra-Gradient
In Deep Learning
Extra-Gradient (EG) needs two oracle calls per iteration, while gradient computations can
be very costly for deep models:
And if we drop one oracle call per iteration?
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
Single-call Extra-Gradient [Main Focus]
Single-call Extra-Gradient
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
Single-call Extra-Gradient [Main Focus]
Algorithms
1 Past Extra-Gradient [Popov 1980]
Xt+ 1
2
= ΠX (Xt − γt
ˆVt− 1
2
)
Xt+1 = ΠX (Xt − γt
ˆVt+ 1
2
)
(PEG)
2 Reflected Gradient [Malitsky 2015]
Xt+ 1
2
= Xt − (Xt−1 − Xt)
Xt+1 = ΠX (Xt − γt
ˆVt+ 1
2
)
(RG)
3 Optimistic Gradient [Daskalakis et al. 2018]
Xt+ 1
2
= ΠX (Xt − γt
ˆVt− 1
2
)
Xt+1 = Xt+1
2
+ γt
ˆVt−1
2
− γt
ˆVt+ 1
2
(OG)
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 14 / 22
Single-call Extra-Gradient [Main Focus]
A First Result
Proxy
PEG: [Step 1] ˆVt ← ˆVt− 1
2
RG: [Step 1] ˆVt ← (Xt−1 − Xt)/γt; no projection
OG: [Step 1] ˆVt ← ˆVt− 1
2
[Step 2] Xt ← Xt+ 1
2
+ γt
ˆVt− 1
2
; no projection
Proposition
Suppose that the Single-call Extra-Gradient (1-EG) methods presented above share the
same initialization, X0 = X1 ∈ X, ˆV1/2 = 0 and a same constant step-size (γt)t∈N ≡ γ. If
X = Rd
, the generated iterates Xt coincide for all t ≥ 1.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 15 / 22
Xt+1
2
= ΠX (Xt − γt
ˆVt)
Xt+1 = ΠX (Xt − γt
ˆVt+ 1
2
)
Single-call Extra-Gradient [Main Focus]
Global Convergence Rate
Always with Lipschitz continuity.
Stochastic strongly monotone: step size in O(1/t).
New results!
Monotone Strongly Monotone
Ergodic Last Iterate Ergodic Last Iterate
Deterministic 1/t Unknown 1/t e−ρt
Stochastic 1/
√
t Unknown 1/t 1/t
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 16 / 22
Single-call Extra-Gradient [Main Focus]
Proof Ingredients
Descent Lemma [Deterministic + Monotone]
There exists (µt)t∈N ∈ RN
+ such that for all p ∈ X,
Xt+1 − p 2
+ µt+1 ≤ Xt − p 2
− 2γ⟨V (Xt+1
2
),Xt+ 1
2
− p⟩ + µt.
Descent Lemma [Stochastic + Strongly Monotone]
Let x⋆
be the unique solution of (SVI). There exists (µt)t∈N ∈ RN
+ ,M ∈ R+ such that
E[ Xt+1 − x⋆ 2
] + µt+1 ≤ (1 − αγt)(E[ Xt − x⋆ 2
] + µt) + Mγ2
t σ2
.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 17 / 22
Single-call Extra-Gradient [Main Focus]
Regular Solution
Definition [Regular Solution]
We say that x⋆
is a regular solution of (SVI) if V is C1
-smooth in a neighborhood of x⋆
and the Jacobian JacV (x⋆
) is positive-definite along rays emanating from x⋆
, i.e.,
z⊺
JacV (x⋆
)z ≡
d
∑
i,j=1
zi
∂Vi
∂xj
(x⋆
)zj > 0 for all z ∈ Rd
∖{0} that are tangent to X at x⋆
.
To be compared with
positive definiteness of the Hessian along qualified constraints in minimization;
differential equilibrium in games.
Localization of strong monoticity.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 18 / 22
Single-call Extra-Gradient [Main Focus]
Local Convergence
Theorem [Local convergence for stochastic non-monotone operators]
Let x⋆
be a regular solution of (SVI) and fix a tolerance level δ > 0. Suppose (PEG) is run
with step-sizes of the form γt = γ/(t + b) for large enough γ and b. Then:
(a) There are neighborhoods U and U1 of x⋆
in X such that, if X1/2 ∈ U,X1 ∈ U1, the
event
E∞ = {Xt+ 1
2
∈ U for all t = 1,2,...}
occurs with probability at least 1 − δ.
(b) Conditioning on the above, we have:
E[ Xt − x⋆ 2
E∞] = O (
1
t
).
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 19 / 22
Single-call Extra-Gradient [Main Focus]
Experiments
L(θ,φ) = 2 1θ⊺
A1θ + 2(θ⊺
A2θ)
2
− 2 1φ⊺
B1φ − 2(φ⊺
B2φ)
2
+ 4θ⊺
Cφ
0 20 40 60 80 100
# Oracle Calls
10
−16
10
−13
10
−10
10
−7
10
−4
10
−1
∥x−x*∥2
EG
1-EG
γ = 0.1
γ = 0.2
γ = 0.3
γ = 0.4
(a)
Strongly monotone
( 1 = 1, 2 = 0)
Deterministic
Last iterate
10
0
10
1
10
2
10
3
# Oracle Calls
10
−3
10
−2
10
−1
10
0
∥x−x*∥2
EG
1-EG
γ = 0.2
γ = 0.4
γ = 0.6
γ = 0.8
(b)
Monotone ( 1 = 0, 2 = 1)
Deterministic
Ergodic
10
0
10
1
10
2
10
3
10
4
# Oracle Calls
10
−3
10
−2
10
−1
∥x−x*∥2
EG
1-EG
γ = 0.3
γ = 0.6
γ = 0.9
γ = 1.2
(c)
Non monotone ( 1 = 1, 2 = −1)
Stochastic Zt
iid
∼ N (0, σ
2
= .01)
Last iterate (b = 15)
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
Conclusion
Conclusion and Perspectives
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
Conclusion
Conclusion
Single-call rates ∼ Two-call rates.
Localization of stochastic guarantee.
Last iterate convergence: a first step to the non-monotone world.
Some research directions: Bregman, universal, . . .
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 21 / 22
Conclusion
Bibliography
Daskalakis, Constantinos et al. (2018). “Training GANs with optimism”. In: ICLR ’18: Proceedings of the
2018 International Conference on Learning Representations.
Juditsky, Anatoli, Arkadi Semen Nemirovski, and Claire Tauvel (2011). “Solving variational inequalities with
stochastic mirror-prox algorithm”. In: Stochastic Systems 1.1, pp. 17–58.
Korpelevich, G. M. (1976). “The extragradient method for finding saddle points and other problems”. In:
Èkonom. i Mat. Metody 12, pp. 747–756.
Malitsky, Yura (2015). “Projected reflected gradient methods for monotone variational inequalities”. In:
SIAM Journal on Optimization 25.1, pp. 502–520.
Nemirovski, Arkadi Semen (2004). “Prox-method with rate of convergence O(1/t) for variational inequalities
with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems”. In:
SIAM Journal on Optimization 15.1, pp. 229–251.
Nesterov, Yurii (2007). “Dual extrapolation and its applications to solving variational inequalities and related
problems”. In: Mathematical Programming 109.2, pp. 319–344.
Popov, Leonid Denisovich (1980). “A modification of the Arrow–Hurwicz method for search of saddle
points”. In: Mathematical Notes of the Academy of Sciences of the USSR 28.5, pp. 845–848.
Tseng, Paul (June 1995). “On linear convergence of iterative methods for the variational inequality problem”.
In: Journal of Computational and Applied Mathematics 60.1-2, pp. 237–252.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 22 / 22

More Related Content

PDF
CoClus ICDM Workshop talk
PDF
Vancouver18
PDF
The Black-Litterman model in the light of Bayesian portfolio analysis
PDF
Parameter Uncertainty and Learning in Dynamic Financial Decisions
PDF
prior selection for mixture estimation
PDF
Macroeconometrics of Investment and the User Cost of Capital Presentation Sample
PDF
Description and retrieval of medical visual information based on language mod...
PDF
Generalized interaction in multigravity
CoClus ICDM Workshop talk
Vancouver18
The Black-Litterman model in the light of Bayesian portfolio analysis
Parameter Uncertainty and Learning in Dynamic Financial Decisions
prior selection for mixture estimation
Macroeconometrics of Investment and the User Cost of Capital Presentation Sample
Description and retrieval of medical visual information based on language mod...
Generalized interaction in multigravity

What's hot (14)

PDF
k-MLE: A fast algorithm for learning statistical mixture models
PDF
better together? statistical learning in models made of modules
PDF
A Note on Pseudo Operations Decomposable Measure
PDF
Approximating Bayes Factors
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
Continuous and Discrete-Time Analysis of SGD
PDF
QMC: Operator Splitting Workshop, Composite Infimal Convolutions - Zev Woodst...
PDF
Testing for mixtures by seeking components
PDF
Ml mle_bayes
PDF
Intro probability 4
PDF
Logit stick-breaking priors for partially exchangeable count data
PDF
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
PDF
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
PDF
2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...
k-MLE: A fast algorithm for learning statistical mixture models
better together? statistical learning in models made of modules
A Note on Pseudo Operations Decomposable Measure
Approximating Bayes Factors
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Continuous and Discrete-Time Analysis of SGD
QMC: Operator Splitting Workshop, Composite Infimal Convolutions - Zev Woodst...
Testing for mixtures by seeking components
Ml mle_bayes
Intro probability 4
Logit stick-breaking priors for partially exchangeable count data
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...
Ad

Similar to On the Convergence of Single-Call Stochastic Extra-Gradient Methods (20)

PDF
A Stochastic Iteration Method for A Class of Monotone Variational Inequalitie...
PDF
Value Function Geometry and Gradient TD
PDF
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
PDF
Connection between inverse problems and uncertainty quantification problems
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Maximum likelihood estimation of regularisation parameters in inverse problem...
PDF
Tensor train to solve stochastic PDEs
PDF
Propagation of Error Bounds due to Active Subspace Reduction
PDF
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
PDF
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
PDF
Bellman functions and Lp estimates for paraproducts
PDF
A nonlinear approximation of the Bayesian Update formula
PDF
Fast parallelizable scenario-based stochastic optimization
PDF
Gradient_Descent_Unconstrained.pdf
PDF
Minimum mean square error estimation and approximation of the Bayesian update
PDF
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
PDF
Calibration of composite asymptotic solutions to heston option prices
PDF
Andreas Eberle
PDF
Optimum Engineering Design - Day 2b. Classical Optimization methods
PDF
Tensor Train data format for uncertainty quantification
A Stochastic Iteration Method for A Class of Monotone Variational Inequalitie...
Value Function Geometry and Gradient TD
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
Connection between inverse problems and uncertainty quantification problems
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Maximum likelihood estimation of regularisation parameters in inverse problem...
Tensor train to solve stochastic PDEs
Propagation of Error Bounds due to Active Subspace Reduction
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Bellman functions and Lp estimates for paraproducts
A nonlinear approximation of the Bayesian Update formula
Fast parallelizable scenario-based stochastic optimization
Gradient_Descent_Unconstrained.pdf
Minimum mean square error estimation and approximation of the Bayesian update
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Calibration of composite asymptotic solutions to heston option prices
Andreas Eberle
Optimum Engineering Design - Day 2b. Classical Optimization methods
Tensor Train data format for uncertainty quantification
Ad

Recently uploaded (20)

PPTX
2. Earth - The Living Planet earth and life
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPT
protein biochemistry.ppt for university classes
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
Pharmacology of Autonomic nervous system
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
C1 cut-Methane and it's Derivatives.pptx
PPTX
BIOMOLECULES PPT........................
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
Fluid dynamics vivavoce presentation of prakash
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
2. Earth - The Living Planet earth and life
Introduction to Cardiovascular system_structure and functions-1
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
protein biochemistry.ppt for university classes
6.1 High Risk New Born. Padetric health ppt
Pharmacology of Autonomic nervous system
Biophysics 2.pdffffffffffffffffffffffffff
C1 cut-Methane and it's Derivatives.pptx
BIOMOLECULES PPT........................
lecture 2026 of Sjogren's syndrome l .pdf
Fluid dynamics vivavoce presentation of prakash
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
7. General Toxicologyfor clinical phrmacy.pptx
. Radiology Case Scenariosssssssssssssss
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...

On the Convergence of Single-Call Stochastic Extra-Gradient Methods

  • 1. On the Convergence of Single-Call Stochastic Extra-Gradient Methods Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos NeurIPS, December 2019 Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 0 / 22
  • 2. Outline: 1 Variational Inequality 2 Extra-Gradient 3 Single-call Extra-Gradient [Main Focus] 4 Conclusion Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
  • 3. Variational Inequality Variational Inequality Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
  • 4. Variational Inequality Introduction: Variational Inequalities in Machine Learning Generative adversarial network (GAN) min θ max φ Ex∼pdata [log(Dφ(x))] + Ez∼pZ [log(1 − Dφ(Gθ(z)))]. More min-max (saddle point) problems: distributionally robust learning, primal-dual formulation in optimization, . . . Seach of equilibrium: games, multi-agent reinforcement learning, . . . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 2 / 22
  • 5. Variational Inequality Definition Stampacchia variational inequality Find x⋆ ∈ X such that ⟨V (x⋆ ),x − x⋆ ⟩ ≥ 0 for all x ∈ X. (SVI) Minty variational inequality Find x⋆ ∈ X such that ⟨V (x),x − x⋆ ⟩ ≥ 0 for all x ∈ X. (MVI) With closed convex set X ⊆ Rd and vector field V Rd → Rd . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 3 / 22
  • 6. Variational Inequality Illustration SVI: V (x⋆ ) belongs to the dual cone DC(x⋆ ) of X at x⋆ [ local ] MVI: V (x) forms an acute angle with tangent vector x − x⋆ ∈ TC(x⋆ ) [ global ] Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 4 / 22
  • 7. Variational Inequality Example: Function Minimization min x f(x) subject to x ∈ X f X → R differentiable function to minimize. Let V = ∇f. (SVI) ∀x ∈ X, ⟨∇f(x⋆ ),x − x⋆ ⟩ ≥ 0 [ first-order optimality ] (MVI) ∀x ∈ X, ⟨∇f(x),x − x⋆ ⟩ ≥ 0 [x⋆ is a minimizer of f ] If f is convex, (SVI) and (MVI) are equivalent. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 5 / 22
  • 8. Variational Inequality Example: Saddle point Problem Find x⋆ = (θ⋆ ,φ⋆ ) such that L(θ⋆ ,φ) ≤ L(θ⋆ ,φ⋆ ) ≤ L(θ,φ⋆ ) for all θ ∈ Θ and all φ ∈ Φ. X ≡ Θ × Φ and L X → R differentiable function. Let V = (∇θL,−∇φL). (SVI) ∀(θ,φ) ∈ X, ⟨∇θL(x⋆ ),θ − θ⋆ ⟩ − ⟨∇φL(x⋆ ),φ − φ⋆ ⟩ ≥ 0 [ stationary ] (MVI) ∀(θ,φ) ∈ X, ⟨∇θL(x),θ − θ⋆ ⟩ − ⟨∇φL(x),φ − φ⋆ ⟩ ≥ 0 [ saddle point ] If L is convex-concave, (SVI) and (MVI) are equivalent. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 6 / 22
  • 9. Variational Inequality Monoticity The solutions of (SVI) and (MVI) coincide when V is continuous and monotone, i.e., ⟨V (x′ ) − V (x),x′ − x⟩ ≥ 0 for all x,x′ ∈ Rd . In the above two examples, this corresponds to either f being convex or L being convex-concave. The operator analogue of strong convexity is strong monoticity ⟨V (x′ ) − V (x),x′ − x⟩ ≥ α x′ − x 2 for some α > 0 and all x,x′ ∈ Rd . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
  • 10. Extra-Gradient Extra-Gradient Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
  • 11. Extra-Gradient From Forward-backward to Extra-Gradient Forward-backward Xt+1 = ΠX (Xt − γtV (Xt)) (FB) Extra-Gradient [Korpelevich 1976] Xt+ 1 2 = ΠX (Xt − γtV (Xt)) Xt+1 = ΠX (Xt − γtV (Xt+ 1 2 )) (EG) The Extra-Gradient method anticipates the landscape of V by taking an extrapolation step to reach the leading state Xt+ 1 2 . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 8 / 22
  • 12. Extra-Gradient From Forward-backward to Extra-Gradient Forward-backward does not converge in bilinear games, while Extra-Gradient does. min θ∈R max φ∈R θφ Left: Forward-backward Right: Extra-Gradient 4 2 0 2 4 6 6 4 2 0 2 4 6 0.5 0.0 0.5 1.0 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 9 / 22
  • 13. Extra-Gradient Stochastic Oracle If a stochastic oracle is involved: Xt+1 2 = ΠX (Xt − γt ˆVt) Xt+1 = ΠX (Xt − γt ˆVt+1 2 ) With ˆVt = V (Xt) + Zt satisfying (and same for ˆVt+ 1 2 ) a) Zero-mean: E[Zt Ft] = 0. b) Bounded variance: E[ Zt 2 Ft] ≤ σ2 . (Ft)t∈N/2 is the natural filtration associated to the stochastic process (Xt)t∈N/2. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 10 / 22
  • 14. Extra-Gradient Convergence Metrics Ergodic convergence: restricted error function ErrR(ˆx) = max x∈XR ⟨V (x), ˆx − x⟩, where XR ≡ X ∩ BR(0) = {x ∈ X x ≤ R}. Last iterate convergence: squared distance dist(ˆx,X⋆ )2 . Lemma [Nesterov 2007] Assume V is monotone. If x⋆ is a solution of (SVI), we have ErrR(x⋆ ) = 0 for all sufficiently large R. Conversely, if ErrR(ˆx) = 0 for large enough R > 0 and some ˆx ∈ XR, then ˆx is a solution of (SVI). Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 11 / 22
  • 15. Extra-Gradient Literature Review We further suppose that V is β-Lipschitz. Convergence type Hypothesis Korpelevich 1976 Last iterate asymptotic Pseudo monotone Tseng 1995 Last iterate geometric Monotone + error bound (e.g., strongly monotone, affine) Nemirovski 2004 Ergodic in O(1/t) Monotone Juditsky et al. 2011 Ergodic in O(1/ √ t) Stochastic monotone Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 12 / 22
  • 16. Extra-Gradient In Deep Learning Extra-Gradient (EG) needs two oracle calls per iteration, while gradient computations can be very costly for deep models: And if we drop one oracle call per iteration? Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
  • 17. Single-call Extra-Gradient [Main Focus] Single-call Extra-Gradient Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
  • 18. Single-call Extra-Gradient [Main Focus] Algorithms 1 Past Extra-Gradient [Popov 1980] Xt+ 1 2 = ΠX (Xt − γt ˆVt− 1 2 ) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 ) (PEG) 2 Reflected Gradient [Malitsky 2015] Xt+ 1 2 = Xt − (Xt−1 − Xt) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 ) (RG) 3 Optimistic Gradient [Daskalakis et al. 2018] Xt+ 1 2 = ΠX (Xt − γt ˆVt− 1 2 ) Xt+1 = Xt+1 2 + γt ˆVt−1 2 − γt ˆVt+ 1 2 (OG) Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 14 / 22
  • 19. Single-call Extra-Gradient [Main Focus] A First Result Proxy PEG: [Step 1] ˆVt ← ˆVt− 1 2 RG: [Step 1] ˆVt ← (Xt−1 − Xt)/γt; no projection OG: [Step 1] ˆVt ← ˆVt− 1 2 [Step 2] Xt ← Xt+ 1 2 + γt ˆVt− 1 2 ; no projection Proposition Suppose that the Single-call Extra-Gradient (1-EG) methods presented above share the same initialization, X0 = X1 ∈ X, ˆV1/2 = 0 and a same constant step-size (γt)t∈N ≡ γ. If X = Rd , the generated iterates Xt coincide for all t ≥ 1. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 15 / 22 Xt+1 2 = ΠX (Xt − γt ˆVt) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 )
  • 20. Single-call Extra-Gradient [Main Focus] Global Convergence Rate Always with Lipschitz continuity. Stochastic strongly monotone: step size in O(1/t). New results! Monotone Strongly Monotone Ergodic Last Iterate Ergodic Last Iterate Deterministic 1/t Unknown 1/t e−ρt Stochastic 1/ √ t Unknown 1/t 1/t Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 16 / 22
  • 21. Single-call Extra-Gradient [Main Focus] Proof Ingredients Descent Lemma [Deterministic + Monotone] There exists (µt)t∈N ∈ RN + such that for all p ∈ X, Xt+1 − p 2 + µt+1 ≤ Xt − p 2 − 2γ⟨V (Xt+1 2 ),Xt+ 1 2 − p⟩ + µt. Descent Lemma [Stochastic + Strongly Monotone] Let x⋆ be the unique solution of (SVI). There exists (µt)t∈N ∈ RN + ,M ∈ R+ such that E[ Xt+1 − x⋆ 2 ] + µt+1 ≤ (1 − αγt)(E[ Xt − x⋆ 2 ] + µt) + Mγ2 t σ2 . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 17 / 22
  • 22. Single-call Extra-Gradient [Main Focus] Regular Solution Definition [Regular Solution] We say that x⋆ is a regular solution of (SVI) if V is C1 -smooth in a neighborhood of x⋆ and the Jacobian JacV (x⋆ ) is positive-definite along rays emanating from x⋆ , i.e., z⊺ JacV (x⋆ )z ≡ d ∑ i,j=1 zi ∂Vi ∂xj (x⋆ )zj > 0 for all z ∈ Rd ∖{0} that are tangent to X at x⋆ . To be compared with positive definiteness of the Hessian along qualified constraints in minimization; differential equilibrium in games. Localization of strong monoticity. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 18 / 22
  • 23. Single-call Extra-Gradient [Main Focus] Local Convergence Theorem [Local convergence for stochastic non-monotone operators] Let x⋆ be a regular solution of (SVI) and fix a tolerance level δ > 0. Suppose (PEG) is run with step-sizes of the form γt = γ/(t + b) for large enough γ and b. Then: (a) There are neighborhoods U and U1 of x⋆ in X such that, if X1/2 ∈ U,X1 ∈ U1, the event E∞ = {Xt+ 1 2 ∈ U for all t = 1,2,...} occurs with probability at least 1 − δ. (b) Conditioning on the above, we have: E[ Xt − x⋆ 2 E∞] = O ( 1 t ). Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 19 / 22
  • 24. Single-call Extra-Gradient [Main Focus] Experiments L(θ,φ) = 2 1θ⊺ A1θ + 2(θ⊺ A2θ) 2 − 2 1φ⊺ B1φ − 2(φ⊺ B2φ) 2 + 4θ⊺ Cφ 0 20 40 60 80 100 # Oracle Calls 10 −16 10 −13 10 −10 10 −7 10 −4 10 −1 ∥x−x*∥2 EG 1-EG γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 (a) Strongly monotone ( 1 = 1, 2 = 0) Deterministic Last iterate 10 0 10 1 10 2 10 3 # Oracle Calls 10 −3 10 −2 10 −1 10 0 ∥x−x*∥2 EG 1-EG γ = 0.2 γ = 0.4 γ = 0.6 γ = 0.8 (b) Monotone ( 1 = 0, 2 = 1) Deterministic Ergodic 10 0 10 1 10 2 10 3 10 4 # Oracle Calls 10 −3 10 −2 10 −1 ∥x−x*∥2 EG 1-EG γ = 0.3 γ = 0.6 γ = 0.9 γ = 1.2 (c) Non monotone ( 1 = 1, 2 = −1) Stochastic Zt iid ∼ N (0, σ 2 = .01) Last iterate (b = 15) Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
  • 25. Conclusion Conclusion and Perspectives Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
  • 26. Conclusion Conclusion Single-call rates ∼ Two-call rates. Localization of stochastic guarantee. Last iterate convergence: a first step to the non-monotone world. Some research directions: Bregman, universal, . . . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 21 / 22
  • 27. Conclusion Bibliography Daskalakis, Constantinos et al. (2018). “Training GANs with optimism”. In: ICLR ’18: Proceedings of the 2018 International Conference on Learning Representations. Juditsky, Anatoli, Arkadi Semen Nemirovski, and Claire Tauvel (2011). “Solving variational inequalities with stochastic mirror-prox algorithm”. In: Stochastic Systems 1.1, pp. 17–58. Korpelevich, G. M. (1976). “The extragradient method for finding saddle points and other problems”. In: Èkonom. i Mat. Metody 12, pp. 747–756. Malitsky, Yura (2015). “Projected reflected gradient methods for monotone variational inequalities”. In: SIAM Journal on Optimization 25.1, pp. 502–520. Nemirovski, Arkadi Semen (2004). “Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems”. In: SIAM Journal on Optimization 15.1, pp. 229–251. Nesterov, Yurii (2007). “Dual extrapolation and its applications to solving variational inequalities and related problems”. In: Mathematical Programming 109.2, pp. 319–344. Popov, Leonid Denisovich (1980). “A modification of the Arrow–Hurwicz method for search of saddle points”. In: Mathematical Notes of the Academy of Sciences of the USSR 28.5, pp. 845–848. Tseng, Paul (June 1995). “On linear convergence of iterative methods for the variational inequality problem”. In: Journal of Computational and Applied Mathematics 60.1-2, pp. 237–252. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 22 / 22