An Augmented Lagrangian Value Function Method for Lower-level Constrained Stochastic Bilevel Optimization

Hantao Nie111School of Mathematical Science, Peking University, Beijing, China. nht@pku.edu.cn    Jiaxiang Li222Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, USA. li003755@umn.edu    Zaiwen Wen333School of Mathematical Science, Peking University, Beijing, China. wenzw@pku.edu.cn
Abstract

Recently, lower-level constrained bilevel optimization has attracted increasing attention. However, existing methods mostly focus on either deterministic cases or problems with linear constraints. The main challenge in stochastic cases with general constraints is the bias and variance of the hyper-gradient, arising from the inexact solution of the lower-level problem. In this paper, we propose a novel stochastic augmented Lagrangian value function method for solving stochastic bilevel optimization problems with nonlinear lower-level constraints. Our approach reformulates the original bilevel problem using an augmented Lagrangian-based value function and then applies a penalized stochastic gradient method that carefully manages the noise from stochastic oracles. We establish an equivalence between the stochastic single-level reformulation and the original constrained bilevel problem and provide a non-asymptotic rate of convergence for the proposed method. The rate is further enhanced by employing variance reduction techniques. Extensive experiments on synthetic problems and real-world applications demonstrate the effectiveness of our approach.

Keywords: Bilevel optimization, Lower-level constraint, Stochastic optimization, Augmented Lagrangian method

1 Introduction

We consider the stochastic lower-level constrained bilevel optimization (stochastic LC-BLO) problem. The lower-level problem is defined as

minyY\displaystyle\min_{y\in Y} G(x,y)=𝔼ξ𝒟ξ[g(x,y;ξ)]\displaystyle\quad G(x,y)=\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[g(x,y;\xi)] (1.1)
s.t. Hi(x,y)0,i=1,,p,\displaystyle\quad H_{i}(x,y)\leq 0,\quad i=1,.,p,

where YnY\subseteq\mathbb{R}^{n} is a convex compact set, ξ\xi is a random variable in the space Ωξ\Omega_{\xi}, and 𝒟ξ\mathcal{D}_{\xi} is the distribution of ξ\xi. G(x,y)G(x,y) and H(x,y)=[H1(x,y),,Hp(x,y)]TH(x,y)=[H_{1}(x,y),...,H_{p}(x,y)]^{T} are the lower-level objective function and constraint function, respectively. We also denote the feasible set of the lower-level problem by 𝒴(x):={yYH(x,y)0}\mathcal{Y}(x):=\{y\in Y\mid H(x,y)\leq 0\}. The bilevel optimization (BLO) problem is

minxX\displaystyle\min_{x\in X} F(x,y(x))=𝔼ζ𝒟ζ[f(x,y(x);ζ)]\displaystyle\quad F(x,y^{*}(x))=\mathbb{E}_{\zeta\sim\mathcal{D}_{\zeta}}[f(x,y^{*}(x);\zeta)] (1.2)
s.t. y(x)argminy𝒴(x)G(x,y),\displaystyle\quad y^{*}(x)\in\arg\min_{y\in\mathcal{Y}(x)}G(x,y),

where XmX\subseteq\mathbb{R}^{m} is a convex compact set, ζΩζ\zeta\in\Omega_{\zeta} is a random variable, and 𝒟ζ\mathcal{D}_{\zeta} is the distribution of ζ\zeta. We assume the gradient oracles g(x,y;ξ)\nabla g(x,y;\xi) and f(x,y;ζ)\nabla f(x,y;\zeta) have unavoidable noise. This framework includes deterministic LC-BLO as a special case.

As depicted in (1.2), this hierarchical structure captures a learning‑to‑learn philosophy that underpins numerous modern machine‑learning pipelines. Hence, BLO plays a critical role in various machine learning tasks, including hyperparameter optimization (2, 7, 16, 22), model-agnostic meta-learning (7, 11, 18, 30), and reinforcement learning (23, 21, 27, 14). Recently, the lower-level constrained BLO (LC-BLO) has attracted increasing attention due to its wide applications such as transportation (12), kernelized SVM (29), meta-learning (26), and data hyper-cleaning  (26, 28).

Several methods have been proposed to solve the deterministic LC-BLO problem. The methodologies for solving LC-BLO can be broadly categorized into implicit gradient-based (IG) approaches and lower-level value function-based (LLVF) techniques. Implicit gradient methods primarily focus on computing the hyper-gradient with some implicit gradient approximation of ddxy(x)\frac{d}{dx}y^{*}(x). The key issue here is that y(x)y^{*}(x) may not be differentiable when the lower-level problem has constraints. Some research has discussed the conditions under which the hyper-gradient exists. For the linearly constrained case, IG-AL (24) discussed the smoothness of y(x)y^{*}(x) by introducing the Lagrangian multiplier of the lower-level problem. SIGD (15) develops a smoothing approximation implicit gradient method to handle the non-differentiability point. Recent works (25, 13) also consider a barrier reformulation of LC-BLO to write the lower-level constraints into the objective function. Even with these existing discussions, the main challenge of designing implicit gradient methods is that second-order derivative of the lower-level objective is required to compute the hyper-gradient, which is computationally expensive.

To overcome the high computational cost of implicit gradient computation, value function-based methods have been proposed as a Hessian-free alternative. Value function-based methods introduce the value function of the lower-level problem and then replace the optimality condition y𝒴(x)y\in\mathcal{Y}(x) in (1.2) with an inequality condition on the value function. The original BLO (1.2) is then equivalently reformulated as a single-level problem. Different value-function formulations have been widely studied in the literature, including the value function (12), a Moreau envelope-based value function (8), the proximal Lagrangian function (29), and the regularized gap function (28).

To the best of our knowledge, no existing work analyzes the BLO with nonlinear lower-level constraints in the stochastic setting, which has broad applications in real-world machine learning tasks. Solving stochastic LC-BLO presents two fundamental challenges. The first is the nonsmoothness of the hyper-objective function, which arises from the coupling between the upper-level and lower-level problems. The second is the bias of the hyper-gradient due to the inexact solution of the lower-level problem. To address these challenges, we introduce an augmented Lagrangian function and its Moreau envelope to reformulate the bilevel problem as a single-level problem, while ensuring that the solution remains close to the optimal solution of the original problem. We then propose a novel stochastic value function-based method for stochastic LC-BLO, which carefully controls the bias of the gradient oracle to achieve convergence.

1.1 Main contribution

Our main contributions are as follows:

1. We introduce a novel reformulation of the stochastic LC-BLO by leveraging the stochastic augmented Lagrangian function and its Moreau envelope (see (2.2)). This reformulation transforms the bilevel problem into a single-level problem, effectively addressing noise arising from the inexact solution of the lower-level problem (see (2.3), (2.7)). Notably, we also ensure that the solution of the reformulated problem remains close to the optimal solution of the original bilevel problem (see Theorem 3.13.2), providing a practical yet theoretically grounded approach to stochastic LC-BLO.

2. We propose a novel Hessian-free method based on the stochastic reformulation for solving the stochastic LC-BLO (see Algorithm 2). Our work provides the first convergence analysis of value function-based algorithms for nonlinear LC-BLO in the stochastic setting. The issue of biased gradients is mitigated by controlling the bias through the accuracy of lower-level solutions. We derive a non-asymptotic convergence rate, proving that our method achieves (O~(cϵ2),O~(cc12ϵ2))(\widetilde{O}(c\epsilon^{-2}),\widetilde{O}(cc_{1}^{2}\epsilon^{-2})) sample complexity on (ζ,ξ)(\zeta,\xi), where c1,c2c_{1},c_{2} denotes the penalty parameter in the reformulations and c=max(c1,c2)c=\max(c_{1},c_{2}) (see Theorem 3.4, Remark 3.1). The sample complexity on ζ\zeta is further improved to O~(c1.5ϵ1.5)\widetilde{O}(c^{1.5}\epsilon^{-1.5}) by employing variance reduction techniques (see Theorem 3.5, Remark 3.4). In Table 1, we briefly summarize the existing approaches compared to the proposed methods.

Table 1: Comparison of methods for solving BLO with nonlinear convex lower-level constraints. “H.-free”, “Sto”, “Iteration”, “Sample” stand for Hessian-free, stochastic, iteration complexity, and sample complexity, respectively. Iteration complexity is the number of outer-loop iterations needed to achieve the target accuracy ϵ\epsilon, measured by the squared gradient norm. For stochastic algorithms, sample complexities denote the number of samples for the upper and lower stochastic variables (ζ,ξ)(\zeta,\xi) needed to achieve the target accuracy ϵ\epsilon. The proposed methods are marked bold as SALVF (Algorithm 2) and SALVF-VR (Algorithm 3). We compare with four existing value function-based deterministic methods: BLOCC (12), LV-HBA (29), BiC-GAFFA (28), and GAM (26).
H-free Sto Iteration Sample (upper, lower)
BLOCC yes no 𝒪~(cϵ1)\tilde{\mathcal{O}}(c\epsilon^{-1})
LV-HBA yes no 𝒪(ϵ2p),p>0.5\mathcal{O}(\epsilon^{-2p}),p>0.5
BiC-GAFFA yes no 𝒪(ϵ2p),p>0.5\mathcal{O}(\epsilon^{-2p}),p>0.5
GAM no no
SALVF yes yes 𝒪~(cϵ1)\tilde{\mathcal{O}}(c\epsilon^{-1}) 𝒪~(cϵ2),𝒪~(cc12ϵ2)\tilde{\mathcal{O}}(c\epsilon^{-2}),\tilde{\mathcal{O}}(cc_{1}^{2}\epsilon^{-2})
SALVF-VR yes yes 𝒪~(c1.5ϵ1.5)\tilde{\mathcal{O}}(c^{1.5}\epsilon^{-1.5}) 𝒪~(c1.5ϵ1.5),𝒪~(c1.5c12ϵ2.5)\tilde{\mathcal{O}}(c^{1.5}\epsilon^{-1.5}),\tilde{\mathcal{O}}(c^{1.5}c_{1}^{2}\epsilon^{-2.5})

1.2 Notation

For multivariate function f(x1,,xk)f(x_{1},...,x_{k}), its partial derivative with respect to the ii-th variable is denoted by if(x1,,xk)\nabla_{i}f(x_{1},...,x_{k}). Given random variable x(ξ)n,ξ𝒟ξx(\xi)\in\mathbb{R}^{n},\xi\sim\mathcal{D}_{\xi}, we represent its expectation and covariance matrix by 𝔼ξ𝒟ξ[x(ξ)]\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[x(\xi)] and Varξ𝒟ξ[x(ξ)]\mathrm{Var}_{\xi\sim\mathcal{D}_{\xi}}[x(\xi)], respectively. The trace of its covariance matrix is denoted by 𝕍ξ𝒟ξ[x(ξ)]=Tr(Varξ𝒟ξ[x(ξ)])=𝔼ξ𝒟ξ[x𝔼ξ𝒟ξ[x]2].\mathbb{V}_{\xi\sim\mathcal{D}_{\xi}}[x(\xi)]=\mathrm{Tr}(\mathrm{Var}_{\xi\sim\mathcal{D}_{\xi}}[x(\xi)])=\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\|x-\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[x]\|^{2}]. If the distribution 𝒟ξ\mathcal{D}_{\xi} is clear, we abbreviate the above notations as 𝔼ξ[x(ξ)]\mathbb{E}_{\xi}[x(\xi)], Varξ[x(ξ)]\mathrm{Var}_{\xi}[x(\xi)] and 𝕍ξ[x(ξ)]\mathbb{V}_{\xi}[x(\xi)], respectively. Denote the normal cone of a convex set CC at yy as 𝒩C(y)={vv,wy0,wC}\mathcal{N}_{C}(y)=\{v\mid\langle v,w-y\rangle\leq 0,\forall w\in C\}. For a scalar aa, we define [a]+=max{0,a}[a]_{+}=\max\{0,a\}. For a vector vv, we write [v]+=([v1]+,,[vn]+)T[v]_{+}=([v_{1}]_{+},...,[v_{n}]_{+})^{T} and [v]+2=([v1]+2,,[vn]+2)T[v]_{+}^{2}=([v_{1}]_{+}^{2},...,[v_{n}]_{+}^{2})^{T}. For ss independent identically distributed random variables ξ=(ξ1,,ξs)\mathbf{\xi}=(\xi_{1},...,\xi_{s}), we denote the joint distribution by 𝒟ξs\mathcal{D}_{\mathbf{\xi}}^{s}. Further given some function g(x,ξ)g(x,\xi), we define the empirical average as g(x,ξ)=1si=1sg(x,ξi)g(x,\mathbf{\xi})=\frac{1}{s}\sum_{i=1}^{s}g(x,\xi_{i}).

2 Stochastic augmented value function-based method

In this section, we propose a novel value function-based reformulation for (1.2) and a stochastic value function method.

2.1 Stochastic augmented Lagrangian reformulation

We propose a novel reformulation using a stochastic augmented Lagrangian value function that transforms (1.2) into a single level problem. First, the constraints (1.1) are addressed by the augmented Lagrangian function and its corresponding stochastic version. The bilevel optimization problem (1.2) is then transformed into a single-level problem via this augmented Lagrangian function-based formulation.

For the lower-level problem (1.1), an augmented Lagrangian penalty term is introduced by penalizing the constraints using

𝒜γ1(x,y,z)=12γ1i=1p[γ1zi+Hi(x,y)]+2,\mathcal{A}_{\gamma_{1}}(x,y,z)=\frac{1}{2{{\gamma_{1}}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2},

where z1,,zp0z_{1},...,z_{p}\geq 0 and γ1>0{\gamma_{1}}>0 is a penalty parameter. The augmented Lagrangian function and its stochastic oracle are defined by adding the penalty term to the objective function and its stochastic oracle, respectively, that is,

γ1(x,y,z)\displaystyle\mathcal{L}_{{\gamma_{1}}}(x,y,z) =G(x,y)+𝒜γ1(x,y,z),\displaystyle=G(x,y)+\mathcal{A}_{\gamma_{1}}(x,y,z), (2.1)
γ1(x,y,z;ξ)\displaystyle\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi) =g(x,y;ξ)+𝒜γ1(x,y,z).\displaystyle=g(x,y;\xi)+\mathcal{A}_{\gamma_{1}}(x,y,z).

The augmented dual function and its Moreau envelope are then defined as

Dγ1(x,z)\displaystyle{D}_{\gamma_{1}}(x,z) =minyYγ1(x,y,z),\displaystyle=\min_{y\in Y}\mathcal{L}_{\gamma_{1}}(x,y,z), (2.2)
Eγ1γ2(x,z)\displaystyle{E}_{\gamma_{1}}^{\gamma_{2}}(x,z) =maxλ+p{Dγ1(x,λ)γ22λz2},\displaystyle=\max_{\lambda\in\mathbb{R}_{+}^{p}}\{{D}_{\gamma_{1}}(x,\lambda)-\frac{\gamma_{2}}{2}\|\lambda-z\|^{2}\},

where γ20\gamma_{2}\geq 0 is a regularization parameter. Then (1.2) is reformulated as an equivalent single-level problem

min(x,y,z)X×Y×+p\displaystyle\min_{(x,y,z)\in X\times Y\times\mathbb{R}_{+}^{p}} F(x,y)\displaystyle F(x,y) (2.3)
s.t. 𝒢(x,y,z):=G(x,y)Eγ1γ2(x,z)0,\displaystyle\mathcal{G}(x,y,z)=G(x,y)-{E}_{\gamma_{1}}^{\gamma_{2}}(x,z)\leq 0,
H(x,y)0.\displaystyle\quad H(x,y)\leq 0.

In this paper, γ1,γ2\gamma_{1},\gamma_{2} are fixed parameters. We omit the subscript γ1{\gamma_{1}} and the superscript γ2{\gamma_{2}} in Dγ1D_{\gamma_{1}} and Eγ1γ2E_{\gamma_{1}}^{\gamma_{2}} to simplify notation. The envelope-based value function reformulation (2.3) contains the value function-based reformulation

min(x,y)X×Y\displaystyle\min_{(x,y)\in X\times Y} F(x,y)\displaystyle\quad F(x,y) (2.4)
s.t. G(x,y)miny𝒴(x)G(x,y)0,\displaystyle\quad G(x,y)-\min_{y\in\mathcal{Y}(x)}G(x,y)\leq 0,
H(x,y)0,\displaystyle\quad H(x,y)\leq 0,

as a special case when γ2=0\gamma_{2}=0. The relationship between them is discussed in Appendix A.5. By introducing the auxiliary variable zz, the advantage of the augmented function-based reformulation is that the subproblem (2.5a) becomes strongly-convex-strongly-concave, which ensures faster convergence of the inner loop.

To evaluate the E(x,z)E(x,z) in (2.2), we need to estimate

(w,λ)\displaystyle(w^{*},\lambda^{*}) =argmaxλ+pminwY{γ(x,z,w,λ)},\displaystyle=\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\min_{w\in Y}\left\{\ell_{\gamma}(x,z,w,\lambda)\right\}, (2.5a)
whereγ(x,z,w,λ)\displaystyle\text{where}\quad\ell_{\gamma}(x,z,w,\lambda) :=γ1(x,w,λ)γ22λz2.\displaystyle:=\mathcal{L}_{\gamma_{1}}(x,w,\lambda)-\frac{\gamma_{2}}{2}\|\lambda-z\|^{2}. (2.5b)

However, in the stochastic setting, the exact solution is inaccessible due to unavoidable noise in the gradient oracles. To address this, we consider approximating the solution of (2.5a) using stochastic algorithms. Specifically, let ξ=(ξ1,,ξs)Ωξs\mathbf{\xi}=(\xi_{1},...,\xi_{s})\in\Omega_{\xi}^{s} be the samples from 𝒟ξs\mathcal{D}_{\xi}^{s}. Denote 𝒫w,𝒫λ\mathcal{P}_{w},\mathcal{P}_{\lambda} as the space of random variables mapping ξ\mathbf{\xi} to YY and +p\mathbb{R}_{+}^{p}, respectively, that is,

𝒫w\displaystyle\mathcal{P}_{w} ={w^:ΩξsYw^ is measurable},\displaystyle=\{\hat{w}:\Omega_{\xi}^{s}\to Y\mid\hat{w}\text{ is measurable}\},
𝒫λ\displaystyle\mathcal{P}_{\lambda} ={λ^:Ωξs+pλ^ is measurable}.\displaystyle=\{\hat{\lambda}:\Omega_{\xi}^{s}\to\mathbb{R}_{+}^{p}\mid\hat{\lambda}\text{ is measurable}\}.

Assume (w^,λ^)𝒫w×𝒫λ(\hat{w},\hat{\lambda})\in\mathcal{P}_{w}\times\mathcal{P}_{\lambda} is a stochastic algorithm solving the subproblem (2.5a) using samples ξ\mathbf{\xi} and (w^(ξ),λ^(ξ))(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi})) is a pair of approximate solution of (2.5a). We are interested in the subset of “good enough” algorithms that can provide a sufficiently accurate solution of the subproblem (2.5a) as

𝒫(δ)\displaystyle\mathcal{P}(\delta) ={(w^,λ^)𝒫w×𝒫λ|𝔼ξ[γ(x,z,w^(ξ),λ^(ξ))]E(x,z)|δ}.\displaystyle=\left\{(\hat{w},\hat{\lambda})\in\mathcal{P}_{w}\times\mathcal{P}_{\lambda}\mid\left|\mathbb{E}_{\mathbf{\xi}}\left[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right]-E(x,z)\right|\leq\delta\right\}. (2.6)

With this estimator (w^(ξ),λ^(ξ))(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi})), we approximate the envelope function with γ(x,z,w^(ξ),λ^(ξ))\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi})). This gives the following stochastic value function-based reformulation:

min(x,y,z)X×Y×+p\displaystyle\min_{(x,y,z)\in X\times Y\times\mathbb{R}_{+}^{p}} F(x,y)\displaystyle\quad F(x,y) (2.7)
s.t. 𝒢^(x,y,z;ξ)ϵ1,12i=1p[Hi(x,y)]+2ϵ22,\displaystyle\hat{\mathcal{G}}(x,y,z;\mathbf{\xi})\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},

where 𝒢^(x,y,z;ξ)=G(x,y)γ(x,z,w^(ξ),λ^(ξ))\hat{\mathcal{G}}(x,y,z;\mathbf{\xi})=G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi})), ϵ1,ϵ2\epsilon_{1},\epsilon_{2} are the target accuracy of the lower-level objective function and constraints violation, respectively. The equivalence between (1.2) and (2.7) is established in Theorem 3.2. Compared to (2.3),  (2.7) incorporates inexactness of the lower-level solution into the formulation, making it more practical in the stochastic setting.

Algorithm 1 (wk,λk)=SALM(xk1,zk1,s,γ1,γ2,η,ρ;ξk)(w^{k},\lambda^{k})=\text{SALM}(x^{k-1},z^{k-1},s,\gamma_{1},\gamma_{2},\eta,\rho;\mathbf{\xi}^{k})
1:Input: xk1,zk1x^{k-1},z^{k-1}, iteration count ss, primal step size ηj\eta_{j}, dual step size ρj\rho_{j} for 0js10\leq j\leq s-1.
2:Initialize wk,0w^{k,0} and λk,0=0\lambda^{k,0}=0.
3:for j=0j=0 to s1s-1 do
4:  Update (wk,j+1,λk,j+1)(w^{k,j+1},\lambda^{k,j+1}) by (2.8a) and (2.8b).
5:end for
6:Output (wk,λk)=(wk,s,λk,s)(w^{k},\lambda^{k})=(w^{k,s},\lambda^{k,s}).

2.2 Value function-based penalized method

In this subsection, we develop a stochastic value function-based penalized method for the reformulation (2.7). At kk-th iteration, the stochastic gradient ascent descent method is applied to solve an approximation solution from the subproblem (2.5a) with fixed (xk1,zk1)(x^{k-1},z^{k-1}). More specifically, the primal and dual variables are updated by

wk,j+1\displaystyle w^{k,j+1} =ProjY(wk,jηjwγ(xk1,zk1,wk,j,λk,j;ξjk)),\displaystyle=\mathrm{Proj}_{Y}\left(w^{k,j}-\eta_{j}\nabla_{w}\ell_{\gamma}(x^{k-1},z^{k-1},w^{k,j},\lambda^{k,j};\xi^{k}_{j})\right), (2.8a)
λk,j+1\displaystyle\lambda^{k,j+1} =Proj+p(λk,j+ρjλγ(xk1,zk1,wj,λk,j;ξjk)),\displaystyle=\mathrm{Proj}_{\mathbb{R}_{+}^{p}}\left(\lambda^{k,j}+\rho_{j}\nabla_{\lambda}\ell_{\gamma}(x^{k-1},z^{k-1},w^{j},\lambda^{k,j};\xi^{k}_{j})\right), (2.8b)

where ηpj,ηdj\eta_{p}^{j},\eta_{d}^{j} denote the primal and dual step sizes, respectively. The complete procedure is shown in Algorithm 1. After that we consider a augmented Lagrangian-based penalty reformulation:

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} 𝔼ξ[Ψ(x,y,z;ξ)],\displaystyle\quad\mathbb{E}_{\mathbf{\xi}}[\Psi(x,y,z;\mathbf{\xi})], (2.9)
whereΨ(x,y,z;ξ)\displaystyle\text{where}\quad\Psi(x,y,z;\mathbf{\xi}) :=F(x,y)+c1𝒢^(x,y,z;ξ)+c22i=1p[Hi(x,y)]+2,\displaystyle=F(x,y)+c_{1}\hat{\mathcal{G}}(x,y,z;\mathbf{\xi})+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2},

where Z=[0,p0.5B]pZ=[0,p^{-0.5}B]^{p} is the domain of zz, BB is a constant and c1,c2>0c_{1},c_{2}>0 are the penalty parameters444It can be shown that the optimal zz for (2.9) is contained in the domain ZZ with appropriate regularity assumptions (see Assumption 3.6 and Lemma A.7). Different from the standard duality, the penalty parameter ρ\rho is also treated as a variable in the above saddle reformulation. The advantage of this saddle point refomulation is that the convexity of objective functin in (2.7) is not required. The equivalence between (2.7) and (2.9) is established in Theorem 3.1. We consider a stochastic gradient descent ascent method for solving (2.9). Denote all variables in (2.9) and its region as

𝐮=(x,y,z),𝒰=X×Y×Z,\mathbf{u}=(x,y,z),\quad\mathcal{U}=X\times Y\times Z,

for simplicity. The gradient oracle of the objective function of (2.9) is given by

Ψ(𝐮;ζ,ξ,ξ~)\displaystyle\nabla\Psi(\mathbf{u};\mathbf{\zeta},\mathbf{\xi},\mathbf{\tilde{\xi}}) =f(x,y;ζ)+c1𝒢^(𝐮;ξ~)+c2i=1p[Hi(x,y)]+Hi(x,y),\displaystyle=\nabla f(x,y;\mathbf{\zeta})+c_{1}\nabla\hat{\mathcal{G}}(\mathbf{u};\mathbf{\tilde{\xi}})+c_{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}\nabla H_{i}(x,y), (2.10)

with the stochastic oracles in mini-batched as

f(x,y;ζ)\displaystyle\nabla f(x,y;\mathbf{\zeta}) =1rj=1rf(x,y;ζj),\displaystyle=\frac{1}{r}\sum_{j=1}^{r}\nabla f(x,y;\zeta_{j}), (2.11)
𝒢^(𝐮;ξ,ξ~)\displaystyle\nabla\hat{\mathcal{G}}(\mathbf{u};\mathbf{\xi},\mathbf{\tilde{\xi}}) =1qj=1qg(x,y;ξ~j)1qj=1q~(x,z,w^(ξ),λ^(ξ);ξ~j).\displaystyle=\frac{1}{q}\sum_{j=1}^{q}\nabla g(x,y;\tilde{\xi}_{j})-\frac{1}{q}\sum_{j=1}^{q}\nabla\tilde{\mathcal{L}}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi});\tilde{\xi}_{j}).

By substituting r=rk,q=qkr=r_{k},q=q_{k} in (2.10), (2.11) and conditioned on ξk\mathbf{\xi}^{k} we rewrite

𝒢^k(𝐮)=𝒢^(x,y,\displaystyle\hat{\mathcal{G}}^{k}(\mathbf{u})=\hat{\mathcal{G}}(x,y, z;ξk),Ψk(𝐮)=Ψ(x,y,z;ξk),\displaystyle z;\mathbf{\xi}^{k}),\quad\Psi^{k}(\mathbf{u})=\Psi(x,y,z;\mathbf{\xi}^{k}), (2.12)
Ψk(𝐮;ζk,ξ~k)\displaystyle\nabla\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k}) =Ψ(x,y,z;ζk,ξk,ξ~k).\displaystyle=\nabla\Psi(x,y,z;\mathbf{\zeta}^{k},\mathbf{\xi}^{k},\mathbf{\tilde{\xi}}^{k}).

The complete procedure is summarized as follows. At the kk-th iteration, an estimator (wk,λk)=(w^(xk1,zk1;ξk),λ^(xk1,zk1;ξk))(w^{k},\lambda^{k})=(\hat{w}(x^{k-1},z^{k-1};\mathbf{\xi}^{k}),\hat{\lambda}(x^{k-1},z^{k-1};\mathbf{\xi}^{k})) is computed utilizing Algorithm 1 with samples ξk=(ξ1k,,ξskk)𝒟ξsk\mathbf{\xi}^{k}=(\xi_{1}^{k},...,\xi_{s_{k}}^{k})\sim\mathcal{D}_{\xi}^{s_{k}}. Then the stochastic gradient descent method is applied to solve (2.9) with samples ζk=(ζ1k,,ζrkk)𝒟ζrk\mathbf{\zeta}^{k}=(\zeta_{1}^{k},...,\zeta_{r_{k}}^{k})\sim\mathcal{D}_{\zeta}^{r_{k}}, ξ~k=(ξ~1k,,ξ~qkk)𝒟ξqk\mathbf{\tilde{\xi}}^{k}=(\tilde{\xi}_{1}^{k},...,\tilde{\xi}_{q_{k}}^{k})\sim\mathcal{D}_{\xi}^{q_{k}}. After KK iterations, we output (xR,yR)(x^{R},y^{R}) where the index RR is randomly chosen according to the probability mass function

Prob(R=k)=αkk=0K1αk,k=0,,K1.\mathrm{Prob}(R=k)=\frac{\alpha_{k}}{\sum_{k=0}^{K-1}\alpha_{k}},\quad k=0,\ldots,K-1. (2.13)

Besides, an extra SALM loop

(y,z)=SALM(xR,0,sK,γ1,0,η,ρ;ξK),(y^{\prime},z^{\prime})=\textbf{SALM}(x^{R},0,s^{K},{\gamma_{1}},0,\eta,\rho;\mathbf{\xi}^{K}), (2.14)

can be applied to guarantee the feasibility of the final output. The complete procedure is shown in Algorithm 2.

Algorithm 2 SALVF
1:Input penalty parameters c1,c2c_{1},c_{2}, iteration number KK, sample sizes sk,rk,qks_{k},r_{k},q_{k} and step sizes ηpk,ηdk,αk\eta_{p}^{k},\eta_{d}^{k},\alpha_{k}.
2:Initialize x0,y0,z0x^{0},y^{0},z^{0}.
3:for k=0k=0 to K1K-1 do
4:  Run Algorithm 1 with samples ξk\mathbf{\xi}^{k} to compute
(w^k,λ^k)=SALM(xk1,zk1,sk,γ1,γ2,ηk,ρk;ξk).(\hat{w}^{k},\hat{\lambda}^{k})=\text{SALM}(x^{k-1},z^{k-1},s_{k},{\gamma_{1}},\gamma_{2},\eta_{k},\rho_{k};\mathbf{\xi}^{k}).
5:  Sample ζk𝒟ζrk\mathbf{\zeta}^{k}\sim\mathcal{D}_{\zeta}^{r_{k}} and ξ~k𝒟ξqk\mathbf{\tilde{\xi}}^{k}\sim\mathcal{D}_{\xi}^{q_{k}}. 
6:  Compute direction dk=Ψk(𝐮k;ζk,ξ~k)d^{k}=\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k}) by (2.10).
7:
8:  Update 𝐮k+1=Proj𝒰(𝐮kαkdk)\mathbf{u}^{k+1}=\mathrm{Proj}_{\mathcal{U}}(\mathbf{u}^{k}-\alpha_{k}d^{k}).
9:end for
10:Choose index RR with probability mass function (2.13). Output (xR,yR)(x^{R},y^{R}).
11:(Optional) Compute (2.14) and Output (xR,y)(x^{R},y^{\prime}).

2.3 Variance reduced SALVF method

When sampling on ζ\zeta is significantly more expensive than sampling on ξ\xi, a natural question arises: is it possible to reduce the sample size of ζ\zeta, thereby allowing an increase in the sample size of ξ\xi? In this subsection, we apply a variance reduction technique with a similar update rule to STORM (5) to reduce the sample complexity on ζ\zeta. At each iteration, the direction dkd^{k} is updated by

dk=Ψk(𝐮k;ζk,ξ~k)+(1βk)(dk1Ψk(𝐮k1;ζk,ξ~k)).d^{k}=\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})+(1-\beta_{k})(d^{k-1}-\nabla\Psi^{k}(\mathbf{u}^{k-1};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})). (2.15)

Unlike the STORM method, our approach deals with a different scenario, where the main challenge arises from the biased gradient oracle Ψk\nabla\Psi^{k} due to the inexactness of estimator (w^k,λ^k)(\hat{w}^{k},\hat{\lambda}^{k}). This challenge is addressed by carefully handling the extra bias term and designing proper coefficients βk\beta_{k}. The complete procedure is summarized in Algorithm 3 and the convergence guarantee is provided in Theorem 3.5.

Algorithm 3 SALVF-VR
1:Input penalty parameters c1,c2c_{1},c_{2}, iteration number KK, sample sizes sk,rk,qks_{k},r_{k},q_{k}, step sizes ηpk,ηdk,αk\eta_{p}^{k},\eta_{d}^{k},\alpha_{k} and coefficient βk\beta_{k}.
2:Initialize x0,y0,z0x^{0},y^{0},z^{0}. 
3:for k=0k=0 to K1K-1 do
4:  Run Algorithm 1 with samples ξk\mathbf{\xi}^{k} to compute
(w^k,λ^k)=SALM(xk1,zk1,sk,γ1,γ2,ηk,ρk;ξk).(\hat{w}^{k},\hat{\lambda}^{k})=\text{SALM}(x^{k-1},z^{k-1},s_{k},{\gamma_{1}},\gamma_{2},\eta_{k},\rho_{k};\mathbf{\xi}^{k}).
5:  Sample ζk\mathbf{\zeta}^{k} from 𝒟ζrk\mathcal{D}_{\zeta}^{r_{k}} and ξ~k\mathbf{\tilde{\xi}}^{k} from 𝒟ξqk\mathcal{D}_{\xi}^{q_{k}}.
6:  if k=0k=0 then
7:   Compute d0=Ψ(𝐮0;ζ𝟎,ξ~0).d^{0}=\nabla\Psi(\mathbf{u}^{0};\mathbf{\zeta^{0}},\mathbf{\tilde{\xi}}^{0}).
8:  else
9:   Update the direction by (2.15).
10:  end if
11:  Update 𝐮k+1=Proj𝒰(𝐮kαkdk).\mathbf{u}^{k+1}=\mathrm{Proj}_{\mathcal{U}}(\mathbf{u}^{k}-\alpha_{k}d^{k}).
12:end for
13:Choose index RR with probability mass function  (2.13). Output (xR,yR)(x^{R},y^{R}).
14:(Optional) Compute (2.14) and Output (xR,y)(x^{R},y^{\prime}).

3 Theoretical analysis

3.1 Basic assumptions

In this section, we inspect the properties of the penalty reformulated problem (2.9), also provide non-asymptotic convergence guarantee for the proposed algorithms (Algorithm 2 and 3). First some basis assumptions in the literature of stochastic bilevel optimization (10, 4) are introduced, including assumptions on the smoothness, convexity, boundedness, and the stochastic oracle associated with the objective and constraints.

Assumption 3.1.

(Lipschitz continuity) Assume the F(x,y)\nabla F(x,y), G(x,y)\nabla G(x,y), H(x,y)\nabla H(x,y) are LF,LG,LHL_{F},L_{G},L_{H}-Lipschitz continuous, respectively.

Assumption 3.2.

(Convexity) Assume G(x,y)G(x,y) is μG\mu_{G}-strongly convex in yy for any xXx\in X, H(x,y)H(x,y) is convex in yy for any xXx\in X.

Remark 3.1.

Assumption 3.2 implies that y(x)y^{*}(x) defined in (1.2) is unique for any xXx\in X.

Assumption 3.3.

(Boundedness) Assume G(x,y)\nabla G(x,y), H(x,y)H(x,y) and H(x,y)\nabla H(x,y) are bounded, that is,

G(x,y)MG,1,\displaystyle\|\nabla G(x,y)\|\leq M_{G,1},
|Hi(x,y)|\displaystyle|H_{i}(x,y)|\leq MH,0,Hi(x,y)MH,1,1ip.\displaystyle M_{H,0},~\|\nabla H_{i}(x,y)\|\leq M_{H,1},1\leq i\leq p.
Assumption 3.4.

(Stochastic derivative) The stochastic oracle f(x,y;ζ)\nabla f(x,y;\zeta), g(x,y;ξ)\nabla g(x,y;\xi) are unbiased estimator of F(x,y)\nabla F(x,y), G(x,y)\nabla G(x,y), respectively, and their variances are bounded by σf2,σg2\sigma_{f}^{2},\sigma_{g}^{2}, respectively.

To ensure the strong duality and regularity of the optimal points, we assume Slater’s condition and linear independence constraint qualification (LICQ) hold for the lower-level constants, which are common in nonlinear optimization analysis.

Assumption 3.5.

(LL Slater’s condition) For any fixed xXx\in X, Slater’s condition holds for (1.1), that is, there exist ϵ0(x)>0\epsilon_{0}(x)>0 and y0(x)y_{0}(x) such that

Hi(x,y0(x))<ϵ0(x),i=1,,p.H_{i}(x,y_{0}(x))<-\epsilon_{0}(x),\quad i=1,...,p.
Assumption 3.6.

(LICQ) For any xXx\in X and y=y(x)y=y^{*}(x), {yHi(x,y)|Hi(x,y)=0}\{\nabla_{y}H_{i}(x,y)|H_{i}(x,y)=0\} is linearly independent. Denote the matrix 𝒞(x,y)=[yHi(x,y)]i{i|Hi(x,y)=0}\mathcal{C}(x,y)=[\nabla_{y}H_{i}(x,y)]_{i\in\{i|H_{i}(x,y)=0\}}. Since X,YX,Y are compact sets, we further assume the smallest singular value satisfies

σmin(𝒞(x,y)𝒞(x,y))σ02>0,(x,y)X×Y.\sigma_{\min}(\mathcal{C}(x,y)\mathcal{C}(x,y)^{\top})\geq\sigma_{0}^{2}>0,(x,y)\in X\times Y.

3.2 Equivalence of reformulations

In this subsection, the equivalence between the reformulations (2.9) and the original BLO formulation in (1.2) is established. We take B=p2σ02MH,1(MG,1+pMH,1)B=p^{2}\sigma_{0}^{2}M_{H,1}(M_{G,1}+pM_{H,1}) as provided in (2.9) throughout the analysis. The deterministic case is first analyzed as a special case. The equivalence is then extended to the stochastic case, emphasizing the key improvements introduced by the stochastic reformulation.

3.2.1 Deterministic case

The following theorem establishes the equivalence between the penalized form of (2.3) and the original BLO.

Theorem 3.1.

Suppose that Assumptions 3.1  3.2 and 3.6 holds and γ1,γ2>0\gamma_{1},\gamma_{2}>0 are fixed parameters.

1. Assume (x,y)(x^{*},y^{*}) is a global solution to (1.2) and c1L2μGϵ1,c2(c1)2B2ϵ1c_{1}\geq\frac{L}{2\mu_{G}}\epsilon^{-1},c_{2}\geq(c_{1})^{2}B^{2}\epsilon^{-1}. There exists z+pz^{*}\in\mathbb{R}_{+}^{p} such that (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-global-minima of the following penalized form

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} Ψ(x,y,z)=F(x,y)+c1𝒢(x,y,z)+c22i=1p[Hi(x,y)]+2.\displaystyle\quad\Psi(x,y,z)=F(x,y)+c_{1}\mathcal{G}(x,y,z)+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}. (3.1)

2. By taking c1=c1+2:=L2μGϵ1+2,c2=c2+2:=(c1)2B2ϵ1c_{1}=c_{1}^{*}+2:=\frac{L}{2\mu_{G}}\epsilon^{-1}+2,c_{2}=c_{2}^{*}+2:=(c_{1}^{*})^{2}B^{2}\epsilon^{-1}, any ϵ\epsilon-global-minima of (3.1) is an ϵ\epsilon-global-minima the following approximation of BLO

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} F(x,y)s.t.G(x,y)E(x,z)ϵ1,12i=1p[Hi(x,y)]+2ϵ22,\displaystyle\quad F(x,y)\quad\text{s.t.}\quad G(x,y)-E(x,z)\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2}, (3.2)

with some ϵ1ϵ\epsilon_{1}\leq\epsilon, respectively.

This theorem indicates that the penalized reformulation can approximate the original bilevel optimization problem within a controlled error bound.

3.2.2 Stochastic case

The major difference between the deterministic and stochastic reformulation is the inexact solution of the subproblem (2.5a). By controlling the inexactness, we design an approximated stochastic reformulation for the stochastic bilevel optimization problem. Denote the penalized form as

Ψ(x,y,z,w^,λ^;ξ)=F(x,y)+c1(G(x,y)γ(x,z,w^(ξ),λ^(ξ)))+c22i=1p[Hi(x,y)]+2.\displaystyle\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})=F(x,y)+c_{1}(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi})))+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}.

The following theorem shows the equivalence between this penalized form and (1.2) (see Theorem A.4 for proofs).

Theorem 3.2.

Suppose that Assumptions 3.13.2 and 3.5 holds and γ1,γ2>0\gamma_{1},\gamma_{2}>0 are fixed parameters.

1. Assume (x,y)(x^{*},y^{*}) is a global solution to (1.2). If 𝒫(δ)\mathcal{P}(\delta) defined in (2.6) is nonempty for any (x,z)X×+p(x,z)\in X\times\mathbb{R}_{+}^{p}, then for any (w^,λ^)𝒫(δ)(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta), there exists zz^{*} such that (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-global-minima of the following penalized form there exists zz^{*} such that (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-global-minima of the following penalized form

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} 𝔼[Ψ(x,y,z,w^,λ^;ξ)]\displaystyle\quad\mathbb{E}[\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})] (3.3)

with any c12L3μGϵ1c_{1}\geq\frac{2L}{3\mu_{G}}\epsilon^{-1}, c232(c1)2B2ϵ1c_{2}\geq\frac{3}{2}(c_{1})^{2}B^{2}\epsilon^{-1} and δϵ6c1\delta\leq\frac{\epsilon}{6c_{1}}.

2. By taking c1=c1+2:=2L3μGϵ1+2,c2=c2+2:=32(c1)2B2ϵ1+2c_{1}=c_{1}^{*}+2:=\frac{2L}{3\mu_{G}}\epsilon^{-1}+2,c_{2}=c_{2}^{*}+2:=\frac{3}{2}(c_{1}^{*})^{2}B^{2}\epsilon^{-1}+2 and δϵ6c1\delta\leq\frac{\epsilon}{6c_{1}}, for any (w^,λ^)𝒫(δ)(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta), the ϵ\epsilon-global-minima of (3.3) is a ϵ\epsilon-global-minima of the following approximation of BLO:

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} F(x,y)\displaystyle\quad F(x,y) (3.4)
s.t. G(x,y)𝔼[γ(x,z,w^(ξ),λ^(ξ))]ϵ1,\displaystyle\quad G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\leq\epsilon_{1},
12i=1p[Hi(x,y)]+2ϵ22,\displaystyle\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},

with some ϵ1,ϵ21312ϵ\epsilon_{1},\epsilon_{2}\leq\frac{13}{12}\epsilon.

3.3 Convergence analysis

Denote k\mathcal{F}_{k} and ~k\tilde{\mathcal{F}}_{k} as the σ\sigma-algebra generated by {ξ𝐥}l=0k{ζl,ξ~l}l=0k1\{\mathbf{\xi_{l}}\}_{l=0}^{k}\cup\{\mathbf{\zeta}^{l},\mathbf{\tilde{\xi}}^{l}\}_{l=0}^{k-1} and {ξl}l=0k{ζl,ξ~l}l=0k\{\mathbf{\xi}^{l}\}_{l=0}^{k}\cup\{\mathbf{\zeta}^{l},\mathbf{\tilde{\xi}}^{l}\}_{l=0}^{k} respectively. Then

=~01~1k~k\emptyset=\tilde{\mathcal{F}}_{0}\subset\mathcal{F}_{1}\subset\tilde{\mathcal{F}}_{1}\subset\cdots\subset\mathcal{F}_{k}\subset\tilde{\mathcal{F}}_{k}\subset\cdots

is the filtration generated by the random variables in Algorithms 2 and 3.

Theorem 3.3.

Suppose Assumptions 3.1-3.5 holds. By taking the step sizes in Algorithm 1 as

ηj=ηj+1,ρj=ρj+1,\eta_{j}=\frac{\eta}{j+1},\quad\rho_{j}=\frac{\rho}{j+1}, (3.5)

there exist constants ϕ¯1,ϕ¯2>0\bar{\phi}_{1},\bar{\phi}_{2}>0 such that the output pair (wk,s,λk,s)(w^{k,s},\lambda^{k,s}) satisfying

𝔼ξwk,sw(xk1,zk1)2\displaystyle\mathbb{E}_{\mathbf{\xi}}\|w^{k,s}-w^{*}(x^{k-1},z^{k-1})\|^{2} ϕ¯11+log(s)s,\displaystyle\leq\bar{\phi}_{1}\frac{1+\log(s)}{s}, (3.6)
𝔼ξλk,sλ(xk1,zk1)2\displaystyle\mathbb{E}_{\mathbf{\xi}}\|\lambda^{k,s}-\lambda^{*}(x^{k-1},z^{k-1})\|^{2} ϕ¯21+log(s)s.\displaystyle\leq\bar{\phi}_{2}\frac{1+\log(s)}{s}.

Define the bias of the gradient oracle of Ψ\Psi as

bk=Ψk(𝐮k;ζk,ξ~k)Ψ(𝐮k).b^{k}=\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})-\nabla\Psi(\mathbf{u}^{k}). (3.7)

We establish the following lemma to control the bias of the gradient oracle in terms of conditional expectation.

Lemma 3.1.

The bias bkb^{k} has a controllable bound as

𝔼[bk2|~k1]\displaystyle\mathbb{E}[\|b^{k}\|^{2}|\tilde{\mathcal{F}}_{k-1}] 2(σf2rk+c12σ𝒢2qk+c12(ϵ𝒢k)2).\displaystyle\leq 2\left(\frac{\sigma_{f}^{2}}{r_{k}}+c_{1}^{2}\frac{\sigma_{\mathcal{G}}^{2}}{q_{k}}+c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right). (3.8)

Here 𝔼[]\mathbb{E}[\cdot] is the abbreviation of 𝔼ζk,ξ~k,ξk[]\mathbb{E}_{\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k},\mathbf{\xi}^{k}}[\cdot] and ϵ𝒢k,(σ𝒢k)2\epsilon^{k}_{\mathcal{G}},(\sigma_{\mathcal{G}}^{k})^{2} denote the upper bounds of the bias and variance of 𝒢k(𝐮;ζk,ξ~k)\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k}), respectively. ϵ𝒢k,(σ𝒢k)2\epsilon^{k}_{\mathcal{G}},(\sigma_{\mathcal{G}}^{k})^{2} are constants conditioned on k1~\tilde{\mathcal{F}_{k-1}} and can be further bounded by polynomials of 𝔼ξkwk,sw(xk1,zk1)2\mathbb{E}_{\mathbf{\xi}^{k}}\|w^{k,s}-w^{*}(x^{k-1},z^{k-1})\|^{2}, 𝔼ξkλk,sλ(xk1,zk1)2\mathbb{E}_{\mathbf{\xi}^{k}}\|\lambda^{k,s}-\lambda^{*}(x^{k-1},z^{k-1})\|^{2}. Therefore we can control the bias bkb^{k} by enhancing the accuracy of Algorithm 1.

Denote c=max(c1,c2)c=\max(c_{1},c_{2}). With the convergence results of Algorithm 1 and the boundedness of gradient oracle, we maintain the convergence of Algorithm 2.

Theorem 3.4.

Suppose Assumptions 3.1-3.6 Take constant step size αk=α<12LΨ\alpha_{k}={\alpha}<\frac{1}{2L_{\Psi}} constant sample sizes as rk=r,qk=qr_{k}=r,q_{k}=q, and sk=ss_{k}=s. Then the sequence {𝐮k}k=0K\{\mathbf{u}^{k}\}_{k=0}^{K} generated by Algorithm 2 satisfies

𝔼[1Kk=0K11αk𝐮k+1𝐮k2]\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right] 𝒪(1αK+1r+c12q+c12s).\displaystyle\leq\mathcal{O}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right).

The detailed proof is available in Theorem C.1 and Corollary C.2. We summarize the proof idea of Theorem 3.4 as follows. By combining Theorem 3.3 with Lemma 3.1, the bias and variance are bounded with O~(sk)\widetilde{O}(s_{k}). Further analyzing the biased stochastic gradient descent methods provides the desired result.

Corollary 3.1.

If the rate is measured by dist(0,Ψ(𝐮)+𝒩𝒰(𝐮))2\mathrm{dist}(0,\nabla\Psi(\mathbf{u})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}))^{2}, we have the following equivalent conclusion (see Theorem C.2 and corollary C.2 for detailed proof):

𝔼[dist(0,Ψ(𝐮R)+𝒩𝒰(𝐮R))2]\displaystyle\mathbb{E}[\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{R})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{R}))^{2}] 𝒪(1αK+1r+c12q+c12s).\displaystyle\leq\mathcal{O}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right).
Remark 3.2.

The step size condition αk<12LΨ\alpha_{k}<\frac{1}{2L_{\Psi}} and (C.3) implies αk\alpha_{k} is a most 𝒪~(c1)\widetilde{\mathcal{O}}(c^{-1}). With α𝒪(c1)\alpha\sim\mathcal{O}(c^{-1}), r𝒪(ϵ1)r\sim\mathcal{O}(\epsilon^{-1}), q𝒪(c12ϵ1)q\sim\mathcal{O}(c_{1}^{2}\epsilon^{-1}), s𝒪(c12ϵ1)s\sim{\mathcal{O}}(c_{1}^{2}\epsilon^{-1}), K𝒪(cϵ1)K\sim{\mathcal{O}}(c\epsilon^{-1}), the right side of the above inequality is 𝒪~(ϵ)\widetilde{\mathcal{O}}(\epsilon). Then the sample complexity on (ζ,ξ)(\zeta,\xi) is (𝒪~(cϵ2),𝒪~(cc12ϵ2))(\widetilde{\mathcal{O}}(c\epsilon^{-2}),\widetilde{\mathcal{O}}(cc_{1}^{2}\epsilon^{-2})).

Remark 3.3.

Theorem 3.2 shows that (2.9) is equivalent to the original problem (1.2) in the sense of ϵ\epsilon-accuracy by taking c1𝒪(ϵ1)c_{1}\sim\mathcal{O}(\epsilon^{-1}), c2𝒪(ϵ3)c_{2}\sim\mathcal{O}(\epsilon^{-3}) and δ𝒪(ϵ2)\delta\sim\mathcal{O}(\epsilon^{-2}). Under this condition, the sample complexity on (ζ,ξ)(\zeta,\xi) is (𝒪~(ϵ5),𝒪~(ϵ7))(\widetilde{\mathcal{O}}(\epsilon^{-5}),\widetilde{\mathcal{O}}(\epsilon^{-7})).

By introducing the following averaged Lipschitz assumption, we can further improve the convergence rate utilizing variance reduction techniques.

Assumption 3.7.

Assume f(x,y;ζ),g(x,y;ξ)\nabla f(x,y;\zeta),\nabla g(x,y;\xi) are averaged Lipschitz continuous, that is,

𝔼ζ[f(x1,y1;ζ)f(x2,y2;ζ)2]Lf2(x1,y1)(x2,y2)2,\displaystyle\mathbb{E}_{\zeta}[\|\nabla f(x_{1},y_{1};\zeta)-\nabla f(x_{2},y_{2};\zeta)\|^{2}]\leq L_{f}^{2}\|(x_{1},y_{1})-(x_{2},y_{2})\|^{2}, (3.9)
𝔼ξ[g(x1,y1;ξ)g(x2,y2;ξ)2]Lg2(x1,y1)(x2,y2)2.\displaystyle\mathbb{E}_{\xi}[\|\nabla g(x_{1},y_{1};\xi)-\nabla g(x_{2},y_{2};\xi)\|^{2}]\leq L_{g}^{2}\|(x_{1},y_{1})-(x_{2},y_{2})\|^{2}.
Theorem 3.5.

Suppose Assumptions 3.2-3.6 and 3.7 hold. Take αk=α(k+1)13\alpha_{k}=\alpha(k+1)^{-\frac{1}{3}} and βk+1=βαk2\beta_{k+1}=\beta\alpha_{k}^{2} in the outer loop and take constant sample sizes as rk=r,qk=qr_{k}=r,q_{k}=q, and sk=ss_{k}=s. Then the sequence {𝐮k}k=0K\{\mathbf{u}^{k}\}_{k=0}^{K} generated by Algorithm 3 satisfies

1k=0Kαkk=0K1𝔼[1αk𝐮k+1𝐮k2]𝒪~(1αK23+(c12+K23r)(1q+1s)+1K23r).\displaystyle\quad\frac{1}{\sum_{k=0}^{K}\alpha_{k}}\sum_{k=0}^{K-1}\mathbb{E}\left[\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right]\leq\widetilde{\mathcal{O}}\left(\frac{1}{\alpha K^{\frac{2}{3}}}+(c_{1}^{2}+\frac{K^{\frac{2}{3}}}{r})(\frac{1}{q}+\frac{1}{s})+\frac{1}{K^{\frac{2}{3}}r}\right). (3.10)

The detailed proof is available in Theorem C.3 and Corollary C.3. The proof sketch is summarized as follows: first we show the error of the direction ek=dk𝔼[Ψk(𝐮k;ζk,ξ~k)]e^{k}=d^{k}-\mathbb{E}[\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})] is bounded by the linear functions of the previous error ek1e^{k-1}. By proving the reduction of the merit function Ψ(𝐮k)+θkek\Psi(\mathbf{u}^{k})+\theta_{k}\|e^{k}\| with some coefficients θk\theta_{k}, we establish the convergence of Algorithm 3.

Remark 3.4.

Further take K𝒪(c1.5ϵ1.5)K\sim\mathcal{O}(c^{1.5}\epsilon^{-1.5}), r𝒪(1)r\sim\mathcal{O}(1), q𝒪(c12ϵ1)q\sim\mathcal{O}(c_{1}^{2}\epsilon^{-1}), s𝒪(c12ϵ1)s\sim\mathcal{O}(c_{1}^{2}\epsilon^{-1}), the right side is 𝒪~(ϵ)\widetilde{\mathcal{O}}(\epsilon). The sample complexity on (ζ,ξ)(\zeta,\xi) is (𝒪~(c1.5ϵ1.5),𝒪~(c1.5c12ϵ2.5))(\widetilde{\mathcal{O}}(c^{1.5}\epsilon^{-1.5}),\widetilde{\mathcal{O}}(c^{1.5}c_{1}^{2}\epsilon^{-2.5})). From the Lemma 3.33.1 we know the upper bound of the bk2\|b^{k}\|^{2} is 𝒪~(c12s)\widetilde{\mathcal{O}}(\frac{c_{1}^{2}}{s}), hence to achieve an ϵ\epsilon-optimal solution by biased gradient-based approach, the condition 𝒪~(c12s)ϵ\widetilde{\mathcal{O}}(\frac{c_{1}^{2}}{s})\leq\epsilon cannot be be further improved. That is, the sample complexity on ξ\xi in each iteration is at least 𝒪(c12ϵ1)\mathcal{O}(c_{1}^{2}\epsilon^{-1}). To this extent, current analysis requires a larger sample complexity on the lower-level to reduce the upper-level complexity, and it remains an open question whether variance-reduction could reduce the sample complexity fo upper- and lower-level at the same time.

4 Numerical experiments

This section presents numerical experiments to demonstrate the effectiveness of the proposed algorithms, compared with baselines including LV-HBA (29), GAM (26), and BLOCC (12).

4.1 Toy example

Consider the following example from Jiang et al. (12):

minx[0,3]F(x,y)=ey(x)+22+cos(6x)+12log((4x2)2+1)\displaystyle\min_{x\in[0,3]}F\left(x,y^{*}\right)=\frac{e^{-y^{*}(x)+2}}{2+\cos(6x)}+\frac{1}{2}\log\left((4x-2)^{2}+1\right)
s.t. y(x)argminy𝒴(x)G(x,y)=(y2x)2,\displaystyle\text{ s.t. }y^{*}(x)\in\arg\min_{y\in\mathcal{Y}(x)}G(x,y)=(y-2x)^{2},

where 𝒴(x)={y[0,3]|H(x,y)0}\mathcal{Y}(x)=\{y\in[0,3]|H(x,y)\leq 0\} and H(x,y)=yxH(x,y)=y-x. The lower-level problem has closed form solution y(x)=xy^{*}(x)=x. Now assume the gradient oracle of FF and GG have Gaussian noise with variance σ=0.1\sigma=0.1, that is,

f(x,y;ζ)=F(x,y)+ζ,g(x,y;ζ)=G(x,y)+ξ,\nabla f(x,y;\zeta)=\nabla F(x,y)+\zeta,\nabla g(x,y;\zeta)=\nabla G(x,y)+\xi,

with ζN(0,σ2I),ξN(0,σ2I)\zeta\sim N(0,\sigma^{2}I),\xi\sim N(0,\sigma^{2}I). We pick 200 random points (x0,y0)[0,3]×[0,3](x^{0},y^{0})\in[0,3]\times[0,3] as initial points and allow the maximum sample on ζ\zeta as 25002500. The final iterated points of Algorithm 2 and 3 are collected in Figure 1. Figure 1(a) plots the points (x,y)(x,y) projected on the y=y(x)y=y^{*}(x) and the distribution of the output xx and Figure 1(b) shows a 3D plot of the output xx and yy. As shown in the figures, the converged points of both algorithms are close to the global optimal solution xx^{*} and form an approximate Gaussian distribution. The distribution of SALVF-VR is more concentrated than SALVF, which demonstrates the acceleration effect of variance reduction techniques.

Refer to caption
(a) xx and its distribution
Refer to caption
(b) 3D plot of convergent points
Figure 1: The converged points of Algorithm 2.

4.2 Hyperparameter tuning for SVM

Consider the following support vector machine (SVM) problem as the lower-level problem:

minw,b,ξ\displaystyle\min_{w,b,\xi} 12w2+121i=1exp(ci)i=1Nexp(ci)ξi\displaystyle~\frac{1}{2}\|w\|^{2}+\frac{1}{2}\frac{1}{\sum_{i=1}\exp(c_{i})}\sum_{i=1}^{N}\exp({c_{i}})\xi_{i}
s.t. li(ziw+b)1ξ,(zi,li)Dtr,\displaystyle~l_{i}(z_{i}^{\top}w+b)\geq 1-\xi,\quad\forall(z_{i},l_{i})\in{D}_{tr},

where c1=(c1,,cN)c_{1}=(c_{1},...,c_{N}) is the hyperparameters, 𝒟tr\mathcal{D}_{tr} is the training set, N=|𝒟tr|N=|\mathcal{D}_{tr}| and lil_{i} is the label of the ii-th sample. The upper-level problem is to minimize the validation error with respect to c1c_{1}:

minc𝔼(z,l)𝒟val[exp(1l(zw+b))].\min_{c}{\mathbb{E}}_{{(z,l)\sim\mathcal{D}_{val}}}[\exp\left(1-l(z^{\top}w^{*}+b^{*})\right)].

In Figure 2, we compare the performance of SALVF with baselines, LV-HBA (29), GAM (26) and BLOCC (12). Since the lower-level problem is deterministic, we allow all algorithms to access the exact optimal solution of the lower-level problem by calling the ECOS (6) solver and we set γ2=0\gamma_{2}=0. We extend BLOCC, LV-HBA and GAM to their stochastic versions by replacing (projected) gradient descent with (projected) stochastic gradient descent in the upper-level update. Figure 2 shows the test accuracy of SALVF versus time and iterations on the Diabetes and Fourclass datasets. SALVF achieves a better accuracy over the baselines. Although BLOCC also has an approximate peak accuracy, we can see that iterations of SALVF are more time-efficient. This is because SALVF requires a double loop while BLOCC requires a triple loop, which is more computationally expensive.

Refer to caption
(a) Diabetes: test acc. v.s. time
Refer to caption
(b) Diabetes: test acc. v.s. iter.
Refer to caption
(c) Fourclass: test acc. v.s. time
Refer to caption
(d) Fourclass: test acc. v.s. iter.
Figure 2: The performance of SALVF compared with baselines on SVM hyperparameter optimization. The abbreviations “test acc.” and “iter.” stand for test accuracy and iterations, respectively. The curves are averaged over 10 random seeds. The curves in Figure 2(a)2(c) are clipped at the maximum iteration 120 and 60, respectively.

4.3 Weight decay tuning

Given a neural network f(w,x)f(w,x) where ww are the weight and bias parameters in each layer, the goal is to optimize the weight decay parameter λ\lambda for a neural network model. To improve the generalization performance, the weight decay parameters CC are introduced, which impose the constraint wC\|w\|\leq C. This can be formulated as the following stochastic BLO:

minC>0\displaystyle\min_{C>0} 𝔼(x,y)𝒟val[(y,f(w(C),x))]\displaystyle\mathbb{E}_{(x,y)\sim\mathcal{D}_{val}}[\ell(y,f(w^{*}(C),x))]
s.t. w(C)argminwC𝔼(x,y)𝒟tr(yi,f(w,xi)).\displaystyle w^{*}(C)\in\arg\min_{\|w\|\leq C}\mathbb{E}_{(x,y)\sim\mathcal{D}_{tr}}\ell\left(y_{i},f\left(w,x_{i}\right)\right).

The upper level focuses on performance on a validation set, while the lower level involves constrained classifier training. We compare the performance of SALVF, SALVF-VR and BLOCC on the digit dataset (1) with a two-layer MLP as the base model. The results are shown in Figure 3. The “no weight decay” curve represents the model’s performance without weight decay. By incorporating weight decay, all bilevel methods exhibit improved performance and reduce overfitting. From Figure 3(a) we see that SALVF is the most time-efficient, thanks to the simplicity of each step in its double-loop iteration process.

Refer to caption
(a) Test acc. v.s. time
Refer to caption
(b) Test acc. v.s. iter.
Figure 3: The performance of SALVF compared with baselines on digit dataset. The curves are averaged over 10 random seeds. The curves in Figure 3(a) are clipped at the maximum iteration 50.

5 Conclusion

In this paper, we propose a Hessian-free method for solving stochastic LC-BLO problems. We present the first non-asymptotic rate analysis on value function-based algorithms for nonlinear LC-BLO in the stochastic setting. The sample complexity of our algorithm is (𝒪~(c1ϵ2),𝒪~(c13ϵ2))(\widetilde{\mathcal{O}}(c_{1}\epsilon^{-2}),\widetilde{\mathcal{O}}(c_{1}^{3}\epsilon^{-2})) on upper- and lower-level random variables, respectively. The sample complexity of upper-level variables is further improved to 𝒪~(c11.5ϵ1.5)\widetilde{\mathcal{O}}(c_{1}^{1.5}\epsilon^{-1.5}) using variance reduction techniques. Numerical experiments on synthetic and real-world data demonstrate the effectiveness of the proposed approach.

Appendix A Convergence analysis

A.1 Properties of Moreau envelope

Given function ϕ(z)\phi(z), the Moreau envelope is defined as eγ(z)=minλϕ(λ)+γ2λz2.e^{\gamma}(z)=\min_{\lambda}\phi(\lambda)+\frac{\gamma}{2}\|\lambda-z\|^{2}.To make a comprehensive explanation on the enveloped value function in (2.2), we introduce some well-known properties of the Moreau envelope. The detailed proof can be referred to literature such as (19, 9).

Proposition A.1.

1. The Moreau envelope eγ(z)e^{\gamma}(z) is a continuous lower approximation of ϕ(z)\phi(z), i.e., limγ0eγ(z)ϕ(z),z\lim_{{\gamma}\to 0}e^{\gamma}(z)\leq\phi(z),\forall z and limγ0eγ(z)=ϕ(z)\lim_{{\gamma}\to 0}e^{\gamma}(z)=\phi(z). If ϕ(z)\phi(z) is ll-Lipschitz continuous, ρ\rho-weakly convex in zz, then this difference is at most O(γ)O({\gamma}), i.e.,

|eγ(z)ϕ(z)|O(γ).|e^{\gamma}(z)-\phi(z)|\leq O({\gamma}).

2. If ϕ\phi is ρ\rho-strongly convex in zz, ρ0\rho\geq 0, then eγ(z)e^{\gamma}(z) is (1γ+1ρ)1(\frac{1}{\gamma}+\frac{1}{\rho})^{-1}-strongly convex.

3. The Moreau envelope eγ(z)e^{\gamma}(z) has the same minimizer as ϕ(z)\phi(z), i.e.,

eγ(z)=ϕ(z),zargminzϕ(z)=argminzeγ(z).e^{\gamma}(z)=\phi(z),\forall z\in\arg\min_{z}\phi(z)=\arg\min_{z}e^{\gamma}(z).

4. Its gradient at zz is given by

eγ(z)=ϕ(proxγϕ(z))=γ(zproxγϕ(z)),\nabla e^{\gamma}(z)=\nabla\phi(\mathrm{prox}_{\gamma\phi}(z))={{\gamma}}(z-\mathrm{prox}_{{\gamma}\phi}(z)),

where prox1γϕ(z)=argminλϕ(λ)+γ2λz2\mathrm{prox}_{\frac{1}{\gamma}\phi}(z)=\arg\min_{\lambda}\phi(\lambda)+\frac{{\gamma}}{2}\|\lambda-z\|^{2} is the proximal operator. Therefore eγ(z)e^{\gamma}(z) is Lipschitz smooth given that ϕ(z)\phi(z) is Lipschitz smooth.

In our reformulation, we introduce E(x,z)E(x,z) in (2.2). Then E(x,z)-E(x,z) is the Moreau envelope of the augmented dual function D(x,z)-D(x,z). Utilizing the properties of the Moreau envelope, we transform the original strongly-convex–concave saddle problem maxz+pminyYγ1(x,y,z)\max_{z\in\mathbb{R}_{+}^{p}}\min_{y\in Y}\mathcal{L}_{\gamma_{1}}(x,y,z) into a strongly-convex–strongly-concave saddle-point problem maxw+pminλY~γ(x,z,w,λ)\max_{w\in\mathbb{R}_{+}^{p}}\min_{\lambda\in Y}\tilde{\mathcal{L}}_{\gamma}(x,z,w,\lambda) without changing the optimal solution.

A.2 Properties of the augmented Lagrangian function

In this section we review some important properties of the augmented Lagrangian function defined in (2.1).

A.3 Augmented Lagrangian duality

In this ssubsection we introduce the augmented Lagrangian duality for general constrained optimization problems. The augmented Lagrangian duality is a powerful tool for solving constrained optimization problems, especially when the constraints are non-convex. The notation introduced in this subsection is only used within this subsection.

For a general constrained optimization problems

minyYG(y)s.t. H(y)0,\min_{y\in Y}\quad G(y)\quad\text{s.t. }\quad H(y)\leq 0, (A.1)

The augmented Lagrangian penalty term is defined as

𝒜γ(y,z)=G(y)+1γi=1p([γzi+Hi(y)]+2γ2zi2).\mathcal{A}_{\gamma}(y,z)=G(y)+\frac{1}{\gamma}\sum_{i=1}^{p}\left([\gamma z_{i}+H_{i}(y)]_{+}^{2}-\gamma^{2}z_{i}^{2}\right). (A.2)

and the augmented Lagrangian dual function is defined as

γ(y,z)=G(y)+𝒜γ(y,z),\mathcal{L}_{\gamma}(y,z)=G(y)+\mathcal{A}_{\gamma}(y,z), (A.3)

where ziz_{i} is the dual variable associated with the ii-th constraint Hi(y)0H_{i}(y)\leq 0 and γ\gamma is the penalty parameter. The augmented Lagrangian dual function contains the Lagrangian function as a special case when γ=+\gamma=+\infty.

Under the convexity assumption and Slater’s condition, we have the following proposition about the augmented Lagrangian duality.

Proposition A.2.

(strong duality) Suppose G,HG,H are convex and Slater’s condition holds, i.e., there exists yYy^{*}\in Y such that H(y)<0H(y^{*})<0. Then the following statements hold:

1. The dual variables zz exists and the strong duality of augmented Lagrangian holds, i.e.,

minymaxz+pγ(y,z)=maxz+pminyγ(y,z).\min_{y}\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{{\gamma}}(y,z)=\max_{z\in\mathbb{R}_{+}^{p}}\min_{y}\mathcal{L}_{{\gamma}}(y,z).

2. The strong duality of the regularized augmented Lagrangian holds, i.e.,

minymaxz+pγ(y,z)σ2zz2=maxz+pminyγ1(y,z)σ2zz2.\min_{y}\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{{\gamma}}(y,z)-\frac{\sigma}{2}\|z-z^{\prime}\|^{2}=\max_{z\in\mathbb{R}_{+}^{p}}\min_{y}\mathcal{L}_{{\gamma_{1}}}(y,z)-\frac{\sigma}{2}\|z-z^{\prime}\|^{2}.

holds for any given σ>0\sigma>0 and z+pz^{\prime}\in\mathbb{R}_{+}^{p}.

The proof of Proposition A.2 is provided in Chapter 17 of Nocedal and Wright (17). A direct consequence of Proposition A.2 is that minimax in (2.5a) is interchangeable.

A.4 Gradient oracles

The gradient of γ1(x,y,z)\mathcal{L}_{{\gamma_{1}}}(x,y,z) is given by

xγ1(x,y,z)\displaystyle\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z) =xg(x,y)+1γ1i=1p[γ1zi+Hi(x,y)]+xHi(x,y),\displaystyle=\nabla_{x}g(x,y)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{x}H_{i}(x,y), (A.4)
yγ1(x,y,z)\displaystyle\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,y,z) =yg(x,y)+1γ1i=1p[γ1zi+Hi(x,y)]+yHi(x,y),\displaystyle=\nabla_{y}g(x,y)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{y}H_{i}(x,y),
zγ1(x,y,z)\displaystyle\nabla_{z}\mathcal{L}_{{\gamma_{1}}}(x,y,z) =[γ1z+H(x,y)]+γ1z=max(γ1z,H(x,y)).\displaystyle=[{\gamma_{1}}z+H(x,y)]_{+}-{\gamma_{1}}z=\max(-{\gamma_{1}}z,H(x,y)).

With some simple computation, it can be shown that these gradients are bounded by linear functions of z\|z\|.

Lemma A.1.

Under Assumption 3.3, the gradients oracle 1γ1(x,y,z),2γ1(x,y,z)\nabla_{1}\mathcal{L}_{{\gamma_{1}}}(x,y,z),\nabla_{2}\mathcal{L}_{{\gamma_{1}}}(x,y,z) are bounded by

xγ1(x,y,z)\displaystyle\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z)\| MG,1+p(MH,0MH,1γ1+z),\displaystyle\leq M_{G,1}+p\left(\frac{M_{H,0}M_{H,1}}{\gamma_{1}}+\|z\|\right), (A.5a)
yγ1(x,y,z)\displaystyle\|\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,y,z)\| MG,1+p(MH,0MH,1γ1+z).\displaystyle\leq M_{G,1}+p\left(\frac{M_{H,0}M_{H,1}}{\gamma_{1}}+\|z\|\right). (A.5b)

Proof It follows from (A.4) that

xγ1(x,y,z)\displaystyle\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z)\| xg(x,y)+1γ1i=1p(γ1|zi|+|Hi(x,y)|)yHi(x,y)\displaystyle\leq\|\nabla_{x}g(x,y)\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}|z_{i}|+|H_{i}(x,y)|)\|\nabla_{y}H_{i}(x,y)\|
G(x,y;ξ)+1γ1i=1p(γ1|zi|+MH,0)MH,1\displaystyle\leq\|\nabla G(x,y;\xi)\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}|z_{i}|+M_{H,0})M_{H,1}
MG,1+p(MH,0MH,1γ1+z).\displaystyle\leq M_{G,1}+p\left(\frac{M_{H,0}M_{H,1}}{\gamma_{1}}+\|z\|\right).

The proof of (A.5b) is the same. ∎By substituting G(x,y)\nabla G(x,y) with g(x,y;ξ)\nabla g(x,y;\xi) in (A.4) , we derive the gradient oracle of γ1(x,y,z;ξ)\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi) as

xγ1(x,y,z;ξ)\displaystyle\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi) =xg(x,y;ξ)+1γ1i=1p[γ1zi+Hi(x,y)]+xHi(x,y),\displaystyle=\nabla_{x}g(x,y;\xi)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{x}H_{i}(x,y), (A.6)
yγ1(x,y,z;ξ)\displaystyle\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi) =yg(x,y;ξ)+1γ1i=1p[γ1zi+Hi(x,y)]+yHi(x,y),\displaystyle=\nabla_{y}g(x,y;\xi)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{y}H_{i}(x,y),
zγ1(x,y,z;ξ)\displaystyle\nabla_{z}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi) =[γ1z+H(x,y)]+γ1z=max(γ1z,H(x,y)).\displaystyle=[{\gamma_{1}}z+H(x,y)]_{+}-{\gamma_{1}}z=\max(-{\gamma_{1}}z,H(x,y)).

We introduce several properties of the augmented Lagrangian function’s gradient oracle that are essential for the analysis of our approach. The following lemma guarantees that the stochastic gradient remains within bounded norms, which is a crucial condition for stability.

Lemma A.2.

Under Assumption 3.3 and 3.4, the gradient oracles 1γ1(x,y,z;ξ),2γ1(x,y,z;ξ)\nabla_{1}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi),\nabla_{2}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi) are bounded by

𝔼ξ𝒟ξ[xγ1(x,y,z;ξ)2]\displaystyle\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)\|^{2}] M,1+M,2z2,\displaystyle\leq M_{\mathcal{L},1}+M_{\mathcal{L},2}\|z\|^{2}, (A.7a)
𝔼ξ𝒟ξ[yγ1(x,y,z;ξ)2]\displaystyle\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\|\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)\|^{2}] M,1+M,2z2,\displaystyle\leq M_{\mathcal{L},1}+M_{\mathcal{L},2}\|z\|^{2}, (A.7b)

respectively, where M,1=(2+p)(MG,12+σg2+1γ12MH,0),M,2=(2+p)MH,02MH,12.M_{\mathcal{L},1}=(2+p)(M_{G,1}^{2}+\sigma_{g}^{2}+\frac{1}{\gamma_{1}^{2}}M_{H,0}),M_{\mathcal{L},2}=(2+p)M_{H,0}^{2}M_{H,1}^{2}.

Proof It follows from (A.6) that

xγ1(x,y,z;ξ)\displaystyle\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)\| xg(x,y;ξ)+1γ1i=1p(γ1|zi|+|Hi(x,y)|)yHi(x,y)\displaystyle\leq\|\nabla_{x}g(x,y;\xi)\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}|z_{i}|+|H_{i}(x,y)|)\|\nabla_{y}H_{i}(x,y)\|
xg(x,y;ξ)+1γ1i=1p(γ1|zi|+MH,0)MH,1.\displaystyle\leq\|\nabla_{x}g(x,y;\xi)\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}|z_{i}|+M_{H,0})M_{H,1}.

By Cauchy-Schwarz inequality, it holds that

(xg(x,y;ξ)+1γ1i=1p(γ1|zi|+MH,0)MH,1)2\displaystyle\quad\left(\|\nabla_{x}g(x,y;\xi)\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}|z_{i}|+M_{H,0})M_{H,1}\right)^{2} (A.8)
(1+1+p)(xg(x,y;ξ)2+1γ12MH,02+MH,02MH,12i=1pzi2).\displaystyle\leq(1+1+p)\left(\|\nabla_{x}g(x,y;\xi)\|^{2}+\frac{1}{\gamma_{1}^{2}}M_{H,0}^{2}+M_{H,0}^{2}M_{H,1}^{2}\sum_{i=1}^{p}z_{i}^{2}\right).

Assumption 3.3 and 3.4 imply

𝔼ξ𝒟ξ[xg(x,y;ξ)2]=𝔼ξ𝒟ξ[xg(x,y)2]+𝕍ξ𝒟ξ[xg(x,y;ξ)]MG,12+σg2.\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\|\nabla_{x}g(x,y;\xi)\|^{2}]=\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\|\nabla_{x}g(x,y)\|^{2}]+\mathbb{V}_{\xi\sim\mathcal{D}_{\xi}}[\nabla_{x}g(x,y;\xi)]\leq M_{G,1}^{2}+\sigma_{g}^{2}.

Taking expectation on (A.8) gives (A.7a). The proof of (A.7b) is similar. ∎

The convexity of G(x,y)G(x,y) and H(x,y)H(x,y) implies the convex-concavity of γ1(x,y,z)\mathcal{L}_{{\gamma_{1}}}(x,y,z) as shown in the following lemma.

Lemma A.3.

(convex-concavity) The function γ1(x,y,z)\mathcal{L}_{{\gamma_{1}}}(x,y,z) is μG\mu_{G}-strongly convex in yy and concave in zz.

Proof From (A.4) we can compute the second order derivative with respect to yy as

y2γ1(x,y,z)=\displaystyle\nabla_{y}^{2}\mathcal{L}_{{\gamma_{1}}}(x,y,z)= y2G(x,y)+1γ1i=1p{[γ1zi+Hi(x,y)]+y2Hi(x,y)\displaystyle\nabla_{y}^{2}G(x,y)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}\{[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{y}^{2}H_{i}(x,y) (A.9)
+𝕀{γ1zi+Hi(x,y)>0}yHi(x,y)yHi(x,y)T}μGI.\displaystyle+\mathbb{I}_{\{{\gamma_{1}}z_{i}+H_{i}(x,y)>0\}}\nabla_{y}H_{i}(x,y)\nabla_{y}H_{i}(x,y)^{T}\}\succeq\mu_{G}I.

This implies that γ1(x,y,z)\mathcal{L}_{{\gamma_{1}}}(x,y,z) is μG\mu_{G}-strongly convex in yy. Additionally (A.4) shows that zγ1(x,y,z)\nabla_{z}\mathcal{L}_{\gamma_{1}}(x,y,z) is monotonically decreasing with respect to zz. Therefore γ1(x,y,z)\mathcal{L}_{\gamma_{1}}(x,y,z) is concave in zz. ∎

Lemma A.4.

The function γ(x,y,w,λ)\ell_{\gamma}(x,y,w,\lambda) is μG\mu_{G}-strongly convex in ww and γ2\gamma_{2}-strongly concave in λ\lambda.

Proof Combining Lemma A.3 and (2.5b), the conclusion follows. ∎Besides, its gradients can be computed as

wγ(x,z,w,λ)\displaystyle\nabla_{w}\ell_{\gamma}(x,z,w,\lambda) =yg(x,w)+1γ1i=1p[γ1λi+Hi(x,w)]+yHi(x,w),\displaystyle=\nabla_{y}g(x,w)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}\lambda_{i}+H_{i}(x,w)]_{+}\nabla_{y}H_{i}(x,w), (A.10)
λγ(x,z,w,λ)\displaystyle\nabla_{\lambda}\ell_{\gamma}(x,z,w,\lambda) =[γ1λ+H(x,y)]+γ1λγ2(λz)=max(γ1λ,H(x,y))γ2(λz).\displaystyle=[{\gamma_{1}}\lambda+H(x,y)]_{+}-{\gamma_{1}}\lambda-\gamma_{2}(\lambda-z)=\max(-{\gamma_{1}}\lambda,H(x,y))-\gamma_{2}(\lambda-z).

A.4.1 Comparison between Lagrangian function and augmented Lagrangian function

In this section, we provide a comprehensive comparison between the Lagrangian function and the augmented Lagrangian function. The key points are summarized as follows: first, the Lagrangian function is a special case of the augmented Lagrangian function when γ=+\gamma=+\infty; second, the augmented Lagrangian function-based reformulation offers a tighter upper bound for the variance estimation when computing the upper-level gradient. The Lagrangian function (x,y,z)\mathcal{L}(x,y,z) of (1.1) is defined as

(x,y,z)=G(x,y)+i=1pziHi(x,y),(x,y,z)X×Y×+p.\mathcal{L}(x,y,z)=G(x,y)+\sum_{i=1}^{p}z_{i}H_{i}(x,y),\quad(x,y,z)\in X\times Y\times\mathbb{R}_{+}^{p}. (A.11)

The inequality relation between objective function, Lagrangian function and augmented Lagrangian function is given in the following proposition.

Proposition A.3.

If H(x,y)0,z0H(x,y)\leq 0,z\geq 0, then it holds that

(x,y,z)γ1(x,y,z)G(x,y).\mathcal{L}(x,y,z)\leq\mathcal{L}_{{\gamma_{1}}}(x,y,z)\leq G(x,y).

Proof For any i{1,,p}i\in\{1,...,p\}, there are two cases:

1. If γ1zi+Hi(x,y)0{\gamma_{1}}z_{i}+H_{i}(x,y)\leq 0, then 12γ1([γ1zi+Hi(x,y)]+2γ12zi2)=12γ1zi2\frac{1}{2{\gamma_{1}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})=-\frac{1}{2}{\gamma_{1}}z_{i}^{2}. It holds that ziHi(x,y)12γ1zi20z_{i}H_{i}(x,y)\leq-\frac{1}{2}{\gamma_{1}}z_{i}^{2}\leq 0.

2. If γ1zi+Hi(x,y)>0{\gamma_{1}}z_{i}+H_{i}(x,y)>0, then 12γ1([γ1zi+Hi(x,y)]+2γ12zi2)=ziHi(x,y)+12γ1Hi2(x,y)\frac{1}{2{\gamma_{1}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})=z_{i}H_{i}(x,y)+\frac{1}{2{\gamma_{1}}}H_{i}^{2}(x,y). Hence ziHi(x,y)ziHi(x,y)+12γ1Hi2(x,y)0z_{i}H_{i}(x,y)\leq z_{i}H_{i}(x,y)+\frac{1}{2{\gamma_{1}}}H_{i}^{2}(x,y)\leq 0.

Combining the above two cases, we have (x,y,z)γ1(x,y,z)G(x,y)\mathcal{L}(x,y,z)\leq\mathcal{L}_{{\gamma_{1}}}(x,y,z)\leq G(x,y). This completes the proof. ∎

From equations (2.1) and (A.11), it is evident that the Lagrangian function is the limit of the augmented Lagrangian function as γ1=+\gamma_{1}=+\infty. During the optimization process, it is desirable for the variance of the gradient oracle to be small in order to ensure the algorithm’s stability. The second-order moment serves as an upper bound for the variance. The following lemma demonstrates that the gradient of the augmented Lagrangian term for each constraint has a smaller second-order moment compared to that of the Lagrangian function. This property is a key reason for using the augmented Lagrangian function in our algorithm. From equations (2.1) and (A.11), it is evident that the Lagrangian function is the limit of the augmented Lagrangian function as γ1=+\gamma_{1}=+\infty. During the optimization process, it is desirable for the variance of the gradient oracle to be small in order to ensure the algorithm’s stability. The second-order moment serves as an upper bound for the variance. The following lemma demonstrates that the gradient of the augmented Lagrangian term for each constraint has a smaller second-order moment compared to that of the Lagrangian function. This property is a key reason for using the augmented Lagrangian function in our algorithm.

Lemma A.5.

Assume xx is fixed and the random variable (w^(ξ),λ^(ξ))𝒴(x)×+p(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\in\mathcal{Y}(x)\times\mathbb{R}_{+}^{p}, ξΩξs\mathbf{\xi}\in\Omega_{\xi}^{s}. Then it holds that

𝔼ξ[1γ1[γ1λ^i(ξ)+Hi(x,w^(ξ))]+Hi(x,w^(ξ))2]𝔼ξ[λ^i(ξ)Hi(x,w^(ξ))2].\displaystyle\mathbb{E}_{\mathbf{\xi}}\left[\left\|\frac{1}{{\gamma_{1}}}[{\gamma_{1}}\hat{\lambda}_{i}(\mathbf{\xi})+H_{i}(x,\hat{w}(\mathbf{\xi}))]_{+}\nabla H_{i}(x,\hat{w}(\mathbf{\xi}))\right\|^{2}\right]\leq\mathbb{E}_{\mathbf{\xi}}\left[\left\|\hat{\lambda}_{i}(\mathbf{\xi})\nabla H_{i}(x,\hat{w}(\mathbf{\xi}))\right\|^{2}\right]. (A.12)

Proof The condition (w^(ξ),λ^(ξ))𝒴(x)×+p(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\in\mathcal{Y}(x)\times\mathbb{R}_{+}^{p} indicates that H(x,w^(ξ))0H(x,\hat{w}(\mathbf{\xi}))\leq 0 and λ^(ξ)0\hat{\lambda}(\mathbf{\xi})\geq 0. Then we have

01γ1[γ1λ^i(ξ)+Hi(x,w^(ξ))]+λ^i(ξ),i=1,,p.\displaystyle 0\leq\frac{1}{{\gamma_{1}}}[{\gamma_{1}}\hat{\lambda}_{i}(\mathbf{\xi})+H_{i}(x,\hat{w}(\mathbf{\xi}))]_{+}\leq\hat{\lambda}_{i}(\mathbf{\xi}),\quad i=1,.,p.

Taking square and then taking expectation give the desired result. ∎

Lemma A.6.

Assume xx is fixed and the random variable (w^(ξ),λ^(ξ))𝒴(x)×+p(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\in\mathcal{Y}(x)\times\mathbb{R}_{+}^{p}, ξΩξs\mathbf{\xi}\in\Omega_{\xi}^{s}. Then it holds that

𝔼ξ[1γ1[γ1λ^i(ξ)+Hi(x,w^(ξ))]+Hi(x,w^(ξ))2]𝔼ξ[λ^i(ξ)Hi(x,w^(ξ))2].\displaystyle\mathbb{E}_{\mathbf{\xi}}\left[\left\|\frac{1}{{\gamma_{1}}}[{\gamma_{1}}\hat{\lambda}_{i}(\mathbf{\xi})+H_{i}(x,\hat{w}(\mathbf{\xi}))]_{+}\nabla H_{i}(x,\hat{w}(\mathbf{\xi}))\right\|^{2}\right]\leq\mathbb{E}_{\mathbf{\xi}}\left[\left\|\hat{\lambda}_{i}(\mathbf{\xi})\nabla H_{i}(x,\hat{w}(\mathbf{\xi}))\right\|^{2}\right]. (A.13)

Proof The condition (w^(ξ),λ^(ξ))𝒴(x)×+p(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\in\mathcal{Y}(x)\times\mathbb{R}_{+}^{p} indicates that H(x,w^(ξ))0H(x,\hat{w}(\mathbf{\xi}))\leq 0 and λ^(ξ)0\hat{\lambda}(\mathbf{\xi})\geq 0. Then we have

01γ1[γ1λ^i(ξ)+Hi(x,w^(ξ))]+λ^i(ξ),i=1,,p.\displaystyle 0\leq\frac{1}{{\gamma_{1}}}[{\gamma_{1}}\hat{\lambda}_{i}(\mathbf{\xi})+H_{i}(x,\hat{w}(\mathbf{\xi}))]_{+}\leq\hat{\lambda}_{i}(\mathbf{\xi}),\quad i=1,.,p.

Taking square and then taking expectation give the desired result. ∎

A.5 Analysis on the reformulation

In this section, we show the equivalence between the reformulations and the original BLO (1.2). First we consider the deterministic case as a special case of the stochastic case. Then we extend the equivalence to the stochastic case and show the major improvements of the stochastic reformulation.

A.6 Deterministic case

To further analyze the optimization problem, we establish a critical property of the lower-level objective function G(x,y)G(x,y). Define v(x)=miny𝒴(x)G(x,y)v(x)=\min_{y\in\mathcal{Y}(x)}G(x,y). The following proposition shows that G(x,y)G(x,y) satisfies a quadratic growth condition, ensuring the uniqueness of solutions in the lower-level problem.

Proposition A.4.

(Quadratic growth) 1. Suppose Assumption 3.2 holds. For any xXx\in X, G(x,y)G(x,y) has μG\mu_{G}-quadratic growth with respect to y𝒴(x)y\in\mathcal{Y}(x), namely,

G(x,y)v(x)\displaystyle G(x,y)-v(x) μG2yy(x)2,y𝒴(x).\displaystyle\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2},\quad y\in\mathcal{Y}(x).

2. Suppose Assumption 3.2 and 3.3 holds. For any xXx\in X and yYy\in Y satisfying dist(y,𝒴(x))ϵ\mathrm{dist}(y,\mathcal{Y}(x))\leq\epsilon, we have

G(x,y)v(x)μG4yy(x)2MG,1ϵμG2ϵ2.G(x,y)-v(x)\geq\frac{\mu_{G}}{4}\|y-y^{*}(x)\|^{2}-M_{G,1}\epsilon-\frac{\mu_{G}}{2}\epsilon^{2}.

3. Suppose Assumption 3.2 and 3.3 holds. For any xXx\in X and yYy\in Y, it holds that it holds that

G(x,y)v(x)μG2yy(x)2p12BMH,1yy(x),G(x,y)-v(x)\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-p^{\frac{1}{2}}BM_{H,1}\|y-y^{*}(x)\|,

where BB is defined in Lemma A.7.3. Further assume yy satisfys 12i=1p[Hi(x,y)]+2ϵ2\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon^{2}, then

G(x,y)v(x)μG2yy(x)22Bϵ.G(x,y)-v(x)\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-\sqrt{2}B\epsilon.

Proof For any xXx\in X, denote

(y(x),λ(x))=argminyYmaxλ+pG(x,y)+λH(x,y).(y^{*}(x),\lambda^{*}(x))=\arg\min_{y\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}G(x,y)+\lambda^{\top}H(x,y).

By Assumption 3.2, the strong convexity in yy implies its quadratic growth, i.e.,

G(x,y)+(λ(x))H(x,y)G(x,y(x))+λ(x)H(x,y(x))+μG2yy(x)2.G(x,y)+(\lambda^{*}(x))^{\top}H(x,y)\geq G(x,y^{*}(x))+\lambda^{*}(x)^{\top}H(x,y^{*}(x))+\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}.

Furthermore, the complementary slackness condition implies λiHi(x,y(x))=0\lambda^{*}_{i}H_{i}(x,y^{*}(x))=0 for i=1,,pi=1,...,p. Therefore, we have

G(x,y)v(x)\displaystyle\quad G(x,y)-v(x)
G(x,y)+(λ(x))H(x,y)G(x,y(x))(since λ(x)0 and H(x,y)0)\displaystyle\geq G(x,y)+(\lambda^{*}(x))^{\top}H(x,y)-G(x,y^{*}(x))\quad\text{(since $\lambda^{*}(x)\geq 0$ and $H(x,y)\leq 0$)}
=G(x,y)+(λ(x))H(x,y)G(x,y(x))λ(x)H(x,y(x))\displaystyle=G(x,y)+(\lambda^{*}(x))^{\top}H(x,y)-G(x,y^{*}(x))-\lambda^{*}(x)^{\top}H(x,y^{*}(x))
μG2yy(x)2,y𝒴(x).\displaystyle\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2},\quad y\in\mathcal{Y}(x).

2. For any yYy\in Y satisfying dist(y,𝒴(x))ϵ\mathrm{dist}(y,\mathcal{Y}(x))\leq\epsilon, denote y=Proj𝒴(x)(y)y^{\prime}=\mathrm{Proj}_{\mathcal{Y}(x)}(y) as the projection of yy onto 𝒴(x)\mathcal{Y}(x). Then yyϵ\|y-y^{\prime}\|\leq\epsilon. From the results of 1, we have

G(x,y)v(x)μG2yy(x)2.G(x,y^{\prime})-v(x)\geq\frac{\mu_{G}}{2}\|y^{\prime}-y^{*}(x)\|^{2}.

Assumption 3.3 implies

G(x,y)G(x,y)MG,1yy.G(x,y)-G(x,y^{\prime})\geq-M_{G,1}\|y-y^{\prime}\|.

Combining the above two inequalities and the fact 12yy(x)2yy(x)2+yy2\frac{1}{2}\|y-y^{*}(x)\|^{2}\leq\|y^{\prime}-y^{*}(x)\|^{2}+\|y-y^{\prime}\|^{2}, we have

G(x,y)v(x)\displaystyle G(x,y)-v(x) =G(x,y)G(x,y)+G(x,y)v(x)\displaystyle=G(x,y)-G(x,y^{\prime})+G(x,y^{\prime})-v(x)
MG,1yy+μG2yy(x)2\displaystyle\geq-M_{G,1}\|y-y^{\prime}\|+\frac{\mu_{G}}{2}\|y^{\prime}-y^{*}(x)\|^{2}
MG,1ϵ+μG2(12yy(x)2yy2)\displaystyle\geq-M_{G,1}\epsilon+\frac{\mu_{G}}{2}\left(\frac{1}{2}\|y-y^{*}(x)\|^{2}-\|y-y^{\prime}\|^{2}\right)
μG4yy(x)2MG,1ϵμG2ϵ2.\displaystyle\geq\frac{\mu_{G}}{4}\|y-y^{*}(x)\|^{2}-M_{G,1}\epsilon-\frac{\mu_{G}}{2}\epsilon^{2}.

3. By Assumption 3.3 and Lemma A.7, for any xXx\in X and yYy\in Y, it holds that

λ(x)H(x,y)λ(x)[H(x,y)]+p12BMH,1yy(x).\lambda^{*}(x)H(x,y)\leq\|\lambda^{*}(x)\|[\|H(x,y)]_{+}\|\leq p^{\frac{1}{2}}BM_{H,1}\|y-y^{*}(x)\|.

Then following the proof in Lemma A.4.1, we have

G(x,y)v(x)\displaystyle G(x,y)-v(x) μG2yy(x)2λ(x)H(x,y)\displaystyle\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-\lambda^{*}(x)H(x,y)
μG2yy(x)2p12BMH,1yy(x).\displaystyle\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-p^{\frac{1}{2}}BM_{H,1}\|y-y^{*}(x)\|.

Futher assume yYy\in Y satisfys 12i=1p[Hi(x,y)]+2ϵ2\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon^{2}, then

G(x,y)v(x)\displaystyle G(x,y)-v(x) μG2yy(x)2λ(x)H(x,y)\displaystyle\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-\lambda^{*}(x)H(x,y)
μG2yy(x)2λ(x)[H(x,y)]+\displaystyle\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-\|\lambda^{*}(x)\|[\|H(x,y)]_{+}\|
μG2yy(x)22Bϵ.\displaystyle\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-\sqrt{2}B\epsilon.

This completes the proof. ∎

The following lemma shows the relationship between v(x)v(x) and the enveloped value E(x,z)E(x,z) defined in (2.2).

Lemma A.7.

Suppose Assumptions 3.23.5 and 3.6 hold.

1. It holds that E(x,z)v(x),z+pE(x,z)\leq v(x),\forall z\in\mathbb{R}_{+}^{p}. The equality holds if and only if there exist y𝒴(x)y\in\mathcal{Y}(x) such that (y,z)argminwYmaxλ+pγ1(x,w,λ)(y,z)\in\arg\min_{w\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,w,\lambda).

2. Assume y𝒴(x)y\in\mathcal{Y}(x), then yargmin𝒴(x)G(x,y)y\in\arg\min_{\mathcal{Y}(x)}G(x,y) is equivalent to existing z+pz\in\mathbb{R}_{+}^{p} such that G(x,y)E(x,z)0G(x,y)-E(x,z)\leq 0.

3. For any xXx\in X, there exists zz such that (y,z)argminwYmaxλ+pγ1(x,y,z)(y,z)\in\arg\min_{w\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,z) satisfying |zi|p0.5B|z_{i}|\leq p^{-0.5}B, where B=p2σ02MH,1(MG,1+pMH,1)B=p^{2}\sigma_{0}^{-2}M_{H,1}(M_{G,1}+pM_{H,1}). This statement also holds for γ1=+\gamma_{1}=+\infty .

Proof 1. For any fixed (x,y)(x,y), we have

12γ1([γ1zi+Hi(x,y)]+2γ12zi2)={ziHi(x,y)+12γ1Hi(x,y)2,if zi1γ1Hi(x,y),12γ1zi2,if zi<1γ1Hi(x,y).\frac{1}{2{{\gamma_{1}}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})=\begin{cases}z_{i}H_{i}(x,y)+\frac{1}{2{{\gamma_{1}}}}H_{i}(x,y)^{2},&\text{if }z_{i}\geq-\frac{1}{\gamma_{1}}H_{i}(x,y),\\ -\frac{1}{2}\gamma_{1}z_{i}^{2},&\text{if }z_{i}<-\frac{1}{\gamma_{1}}H_{i}(x,y).\end{cases}

We consider maximizing with respect to zi+z_{i}\in\mathbb{R}_{+}. If Hi(x,y)>0H_{i}(x,y)>0, then maxzi+12γ1([γ1zi+Hi(x,y)]+2γ12zi2)=maxzi+ziHi(x,y)+12γ1Hi(x,y)2=+\max_{z_{i}\in\mathbb{R}_{+}}\frac{1}{2{{\gamma_{1}}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})=\max_{z_{i}\in\mathbb{R}_{+}}z_{i}H_{i}(x,y)+\frac{1}{2{{\gamma_{1}}}}H_{i}(x,y)^{2}=+\infty. If Hi(x,y)0H_{i}(x,y)\leq 0, then

maxzi+12γ1([γ1zi+Hi(x,y)]+2γ12zi2)\displaystyle\quad\max_{z_{i}\in\mathbb{R}_{+}}\frac{1}{2{{\gamma_{1}}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})
=max{max0zi1γ1Hi(x,y)12γ1γ12zi2,maxzi1γ1Hi(x,y)ziHi(x,y)+12γ1Hi(x,y)2}\displaystyle=\max\left\{\max_{0\leq z_{i}\leq-\frac{1}{\gamma_{1}}H_{i}(x,y)}-\frac{1}{2{{\gamma_{1}}}}\gamma_{1}^{2}z_{i}^{2},\max_{z_{i}\geq-\frac{1}{\gamma_{1}}H_{i}(x,y)}z_{i}H_{i}(x,y)+\frac{1}{2{{\gamma_{1}}}}H_{i}(x,y)^{2}\right\}
=max{0,12γ1Hi(x,y)2}=0.\displaystyle=\max\left\{0,-\frac{1}{2{{\gamma_{1}}}}H_{i}(x,y)^{2}\right\}=0.

Combining the above two cases, we have

maxzi+12γ1([γ1zi+Hi(x,y)]+2γ12zi2)\displaystyle\max_{z_{i}\in\mathbb{R}_{+}}\frac{1}{2{{\gamma_{1}}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2}) ={+,if Hi(x,y)>0,0,if Hi(x,y)0.\displaystyle=

This implies that

maxz+pγ1(x,y,z)={+,if Hi(x,y)>0,G(x,y),if Hi(x,y)0,i=1,,p.\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,z)=\begin{cases}+\infty,&\text{if }\exists H_{i}(x,y)>0,\\ G(x,y),&\text{if }H_{i}(x,y)\leq 0,i=1,...,p.\end{cases} (A.14)

From the definition of E(x,z)E(x,z), for z+p\forall z\in\mathbb{R}_{+}^{p}, we have

E(x,z)\displaystyle E(x,z) =maxλ+pminyY{γ1(x,y,λ)γ22λz2}\displaystyle=\max_{\lambda\in\mathbb{R}_{+}^{p}}\min_{y\in Y}\left\{\mathcal{L}_{\gamma_{1}}(x,y,\lambda)-\frac{\gamma_{2}}{2}\|\lambda-z\|^{2}\right\} (A.15)
=minyYmaxλ+p{γ1(x,y,λ)γ22λz2}(by Proposition A.2)\displaystyle=\min_{y\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\left\{\mathcal{L}_{\gamma_{1}}(x,y,\lambda)-\frac{\gamma_{2}}{2}\|\lambda-z\|^{2}\right\}\quad\text{(by Proposition~\ref{prop: strong duality})}
minyYmaxλ+pγ1(x,y,λ)\displaystyle\leq\min_{y\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,\lambda)
=minyY{G(x,y)|H(x,y)0}(by (A.14))\displaystyle=\min_{y\in Y}\left\{G(x,y)|H(x,y)\leq 0\right\}\quad\text{(by~\eqref{eq: max of augmented Lagrangian with respect to z})}
=v(x).\displaystyle=v(x).

The equality in the inequality holds if and only if (y,z)argminyYmaxz+pγ1(x,y,z)(y,z)\in\arg\min_{y\in Y}\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,z).

2. Since E(x,z)v(x)G(x,y),z+p,y𝒴(x)E(x,z)\leq v(x)\leq G(x,y),\forall z\in\mathbb{R}_{+}^{p},\forall y\in\mathcal{Y}(x), hence the condition G(x,y)E(x,z)0G(x,y)-E(x,z)\leq 0 holds if and only if G(x,y)=v(x)=E(x,z)G(x,y)=v(x)=E(x,z), which implies yargminy𝒴(x)G(x,y)y\in\arg\min_{y\in\mathcal{Y}(x)}G(x,y).
Conversely, suppose yargmin𝒴(x)G(x,y)y\in\arg\min_{\mathcal{Y}(x)}G(x,y), Take zargmaxλ+pγ1(x,y,λ)z\in\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,\lambda), then

E(x,z)\displaystyle E(x,z) =minwYmaxλ+p{γ1(x,w,λ)γ22λz2}\displaystyle=\min_{w\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\left\{\mathcal{L}_{\gamma_{1}}(x,w,\lambda)-\frac{\gamma_{2}}{2}\|\lambda-z\|^{2}\right\} (A.16)
minwYγ1(x,w,z)(taking λ=z)\displaystyle\geq\min_{w\in Y}\mathcal{L}_{\gamma_{1}}(x,w,z)\quad\text{(taking $\lambda=z$)}
=minw𝒴(x)G(x,w)=G(x,y).\displaystyle=\min_{w\in\mathcal{Y}(x)}G(x,w)=G(x,y).

3. The optimal condition of (y,z)argminwYmaxλ+pγ1(x,y,z)(y,z)\in\arg\min_{w\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,z) implies that H(x,y)0H(x,y)\leq 0 and yγ1(x,y,z)=0\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,y,z)=0, that is,

yg(x,y)+1γ1i=1p[γ1zi+Hi(x,y)]+yHi(x,y)=0.\nabla_{y}g(x,y)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{y}H_{i}(x,y)=0. (A.17)

From the proof of 1, we know that if Hi(x,y)<0H_{i}(x,y)<0, then zi=0z_{i}=0. If Hi(x,y)=0H_{i}(x,y)=0, then yHi(x,y)0\nabla_{y}H_{i}(x,y)\neq 0. Otherwise suppose Hi(x,y)=0H_{i}(x,y)=0 and yHi(x,y)=0\nabla_{y}H_{i}(x,y)=0 hold simultaneously. The convexity of Hi(x,y)H_{i}(x,y) implies Hi(x,y)0H_{i}(x,y)\geq 0 for all yYy\in Y, which contradicts Assumption 3.5. Substituting these cases in (A.17) yields

yg(x,y)+i:Hi(x,y)=0ziyHi(x,y)+i:Hi(x,y)<0[Hi(x,y)]+yHi(x,y)=0.\nabla_{y}g(x,y)+\sum_{i:H_{i}(x,y)=0}z_{i}\nabla_{y}H_{i}(x,y)+\sum_{i:H_{i}(x,y)<0}[H_{i}(x,y)]_{+}\nabla_{y}H_{i}(x,y)=0. (A.18)

Note that the above conditions are equivalent to the KKT conditions. Since the lower-level problem is convex and Slater’s conditions holds, this is a sufficient condition for the optimal point. Assume matrix 𝒞\mathcal{C} is the concatenation of yHi(x,y)=0,i{i,Hi(x,y)}\nabla_{y}H_{i}(x,y)=0,i\in\{i,H_{i}(x,y)\}, then zz with the minimal norm satisfying (A.18) has a bound as

|zi|(𝒞𝒞)1𝒞(yg(x,y)+i:Hi(x,y)<0[Hi(x,y)]+yHi(x,y))p1.5σ02MH,1(MG,1+pMH,0MH,1)|z_{i}|\leq\left\|(\mathcal{C}^{\top}\mathcal{C})^{-1}\mathcal{C}^{\top}\left(\nabla_{y}g(x,y)+\sum_{i:H_{i}(x,y)<0}[H_{i}(x,y)]_{+}\nabla_{y}H_{i}(x,y)\right)\right\|\leq p^{1.5}\sigma_{0}^{2}M_{H,1}(M_{G,1}+pM_{H,0}M_{H,1})

The inequality uses Assumption 3.6. As a special case of γ1=+\gamma_{1}=+\infty,

|zi|(𝒞𝒞)1𝒞yG(x,y)p1.5σ02MH,1MG,1.|z_{i}|\leq\left\|(\mathcal{C}^{\top}\mathcal{C})^{-1}\mathcal{C}^{\top}\nabla_{y}G(x,y)\right\|\leq p^{1.5}\sigma_{0}^{-2}M_{H,1}M_{G,1}.

This completes the proof. ∎

Remark A.1.

Under the strong duality condition, v(x)=maxz+pD(x,z)v(x)=\max_{z\in\mathbb{R}_{+}^{p}}D(x,z). Proposition A.1.1 and A.1.3 imply that E(x,z)maxz+pE(x,z)=maxz+pD(x,z)=v(x)E(x,z)\leq\max_{z\in\mathbb{R}_{+}^{p}}E(x,z)=\max_{z\in\mathbb{R}_{+}^{p}}D(x,z)=v(x). This shows that the enveloped value function E(x,z)E(x,z) is a lower approximation of v(x)v(x) and does not change the optimal solution.

The following theorem shows the equivalence between (1.2) and the penalty reformulation in the deterministic setting. This theorem generalizes Theorem 1 of (20); indeed, choosing ϵ2=0\epsilon_{2}=0 reduces our statement to exactly that result.

Theorem A.1.

Suppose that Assumptions 3.1  3.2 and 3.6 holds and γ1,γ2>0\gamma_{1},\gamma_{2}>0 are fixed parameters.

1. Assume (x,y)(x^{*},y^{*}) is a global solution to (1.2) and c1c1=LμGϵ1c_{1}\geq c_{1}^{*}=\frac{L}{\mu_{G}}\epsilon^{-1}. Then (x,y)(x^{*},y^{*}) is a ϵ\epsilon-global-minima of the following penalized form

min(x,y)X×Y\displaystyle\min_{(x,y)\in X\times Y} F(x,y)+c1(G(x,y)v(x))s.t.12i=1p[Hi(x,y)]+2ϵ22.\displaystyle\quad F(x,y)+c_{1}(G(x,y)-v(x))\quad\text{s.t.}\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2}. (A.19)

Furthermore, there exist z+pz^{*}\in\mathbb{R}_{+}^{p} such that (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-global-minima of the following penalized form

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} F(x,y)+c1𝒢(x,y,z)s.t.12i=1p[Hi(x,y)]+2ϵ22,\displaystyle\quad F(x,y)+c_{1}\mathcal{G}(x,y,z)\quad\text{s.t.}\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2}, (A.20)

with ϵ2ϵ22c1B\epsilon_{2}\leq\frac{\epsilon}{2\sqrt{2}c_{1}B}.

2. By taking c1=c1+2c_{1}=c_{1}^{*}+2, any ϵ\epsilon-global-minima of (A.19) and (A.20) is an ϵ\epsilon-global-minima of the following two approximations of BLO:

min(x,y)X×Y\displaystyle\min_{(x,y)\in X\times Y} F(x,y)s.t.G(x,y)v(x)ϵ1,12i=1p[Hi(x,y)]+2ϵ22.\displaystyle\quad F(x,y)\quad\text{s.t.}\quad G(x,y)-v(x)\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2}. (A.21)

and

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} F(x,y)s.t.G(x,y)E(x,z)ϵ1,12i=1p[Hi(x,y)]+2ϵ22,\displaystyle\quad F(x,y)\quad\text{s.t.}\quad G(x,y)-E(x,z)\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2}, (A.22)

with some ϵ1,ϵ2ϵ\epsilon_{1},\epsilon_{2}\leq\epsilon.

Proof 1. Lemma A.7 shows that (A.20) is equivalent to (A.19) by minimizing zz first. Hence it suffices to show the equivalence between (1.2) and (A.19). Let p(x,y)=G(x,y)v(x)\mathrm{p}(x,y)=G(x,y)-v(x). Proposition A.4.3 implies that p(x,y)μG2yy(x)22Bϵ2\mathrm{p}(x,y)\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-\sqrt{2}B\epsilon_{2} for any yy satisfying 12i=1p[Hi(x,y)]+2ϵ22\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2}. By the Lipschitz property of F(x,)F(x,\cdot), we have

F(x,y)+c1p(x,y)F(x,y(x))\displaystyle F(x,y)+c_{1}\mathrm{p}(x,y)-F(x,y^{*}(x)) LFyy(x)+c1μG2yy(x)2c12Bϵ2\displaystyle\geq-L_{F}\|y-y^{*}(x)\|+\frac{c_{1}\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-c_{1}\sqrt{2}B\epsilon_{2} (A.23)
LF2c1μGc12Bϵ2.\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-c_{1}\sqrt{2}B\epsilon_{2}.

By taking c1c1=LμGϵ1c_{1}\geq c_{1}^{*}=\frac{L}{\mu_{G}}\epsilon^{-1} and ϵ2ϵ22c1B\epsilon_{2}\leq\frac{\epsilon}{2\sqrt{2}c_{1}B}, the right side of the above inequality ϵ\geq-\epsilon. For the optimal solution (x,y)(x^{*},y^{*}) of (1.2) and (x,y)(X,Y)\forall(x,y)\in(X,Y), it holds that

F(x,y)+c1p(x,y)=F(x,y)F(x,y(x))F(x,y)+c1p(x,y)+ϵ.F(x^{*},y^{*})+c_{1}\mathrm{p}(x^{*},y^{*})=F(x^{*},y^{*})\leq F(x,y^{*}(x))\leq F(x,y)+c_{1}\mathrm{p}(x,y)+\epsilon. (A.24)

The first inequality follows from the optimality of xx^{*} and y=y(x)y^{*}=y^{*}(x^{*}). The second inequality follows from (A.23). The inequality (A.24) implies that (x,y)(x^{*},y^{*}) is a ϵ\epsilon-optimal solution of (A.19). Further take zargmaxλ+pγ1(x,y,λ)z^{*}\in\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x^{*},y^{*},\lambda), then (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-optimal solution of (A.20). By Lemma A.7, we can find such zz^{*} satisfying zZz\in Z. This completes the proof.

2. Lemma A.7 shows that (A.22) is equivalent to (A.21) by minimizing zz first. For any ϵ\epsilon-optimal solution (x^,y^)(\hat{x},\hat{y}) of (A.19), it holds that

F(x^,y^)+c1p(x^,y^)F(x,y)+c1p(x,y)+ϵF(x^,y^)+c1p(x^,y^)+2ϵ.F(\hat{x},\hat{y})+c_{1}\mathrm{p}(\hat{x},\hat{y})\leq F(x^{*},y^{*})+c_{1}\mathrm{p}(x^{*},y^{*})+\epsilon\leq F(\hat{x},\hat{y})+c_{1}^{*}\mathrm{p}(\hat{x},\hat{y})+2\epsilon.

The second inequality follows from (A.23) and p(x,y)=0\mathrm{p}(x^{*},y^{*})=0. Then we have

p(x^,y^)2ϵc1c1=ϵ.\mathrm{p}(\hat{x},\hat{y})\leq\frac{2\epsilon}{c_{1}-c_{1}^{*}}=\epsilon. (A.25)

Take ϵ1=p(x^,y^),ϵ2=12i=1p[Hi(x^,y^)]+2\epsilon_{1}=p(\hat{x},\hat{y}),\epsilon_{2}=\sqrt{\frac{1}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}}. Then (x^,y^)(\hat{x},\hat{y}) is feasible for (A.21). For any feasible solution (x,y)(x,y) of (A.21), the ϵ\epsilon-optimality of (x^,y^)(\hat{x},\hat{y}) implies

F(x^,y^)F(x,y)\displaystyle F(\hat{x},\hat{y})-F(x,y) c1(p(x,y)p(x^,y^))+ϵ\displaystyle\leq c_{1}\left(\mathrm{p}(x,y)-\mathrm{p}(\hat{x},\hat{y})\right)+\epsilon (A.26)
=c1(p(x,y)ϵ1)+ϵϵ.\displaystyle=c_{1}\left(\mathrm{p}(x,y)-\epsilon_{1}\right)+\epsilon\leq\epsilon.

The last inequality follows from the feasibility of (x,y)(x,y). Therefore (x^,y^)(\hat{x},\hat{y}) is a ϵ\epsilon-optimal solution of (A.21). This completes the proof. ∎

Theorem A.2.

Suppose that Assumptions 3.1  3.2 and 3.6 holds and γ1,γ2>0\gamma_{1},\gamma_{2}>0 are fixed parameters.

1. Assume (x,y)(x^{*},y^{*}) is a global solution to (1.2) and c1c1=L2μGϵ1,c2c2=(c1)2B2ϵ1c_{1}\geq c_{1}^{*}=\frac{L}{2\mu_{G}}\epsilon^{-1},c_{2}\geq c_{2}^{*}=(c_{1}^{*})^{2}B^{2}\epsilon^{-1}. Then (x,y)(x^{*},y^{*}) is a ϵ\epsilon-global-minima of the following penalized form

min(x,y)X×Y\displaystyle\min_{(x,y)\in X\times Y} Ψ(x,y)=F(x,y)+c1(G(x,y)v(x))+c22i=1p[Hi(x,y)]+2.\displaystyle\quad\Psi(x,y)=F(x,y)+c_{1}(G(x,y)-v(x))+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}. (A.27)

Furthermore, there exist z+pz^{*}\in\mathbb{R}_{+}^{p} such that (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-global-minima of the following penalized form

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} Ψ(x,y,z)=F(x,y)+c1𝒢(x,y,z)+c22i=1p[Hi(x,y)]+2.\displaystyle\quad\Psi(x,y,z)=F(x,y)+c_{1}\mathcal{G}(x,y,z)+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}. (A.28)

2. By taking c1=c1+2,c2=c2+2c_{1}=c_{1}^{*}+2,c_{2}=c_{2}^{*}+2, any ϵ\epsilon-global-minima of (A.27) and (A.28) is an ϵ\epsilon-global-minima of  (A.21) and (A.22) with some ϵ1,ϵ2ϵ\epsilon_{1},\epsilon_{2}\leq\epsilon, respectively.

Proof 1. Lemma A.7 shows that (A.28) is equivalent to (A.27) by minimizing zz first. Hence it suffices to show the equivalence between (1.2) and (A.27). Let p(x,y)=G(x,y)v(x)\mathrm{p}(x,y)=G(x,y)-v(x). Proposition A.4.3 implies that p(x,y)μGyy(x)2λ(x)H(x,y)\mathrm{p}(x,y)\geq{\mu_{G}}\|y-y^{*}(x)\|^{2}-\lambda^{*}(x)H(x,y). By the Lipschitz property of F(x,)F(x,\cdot), we have

Ψ(x,y)F(x,y(x))\displaystyle\quad\Psi(x,y)-F(x,y^{*}(x)) (A.29)
=F(x,y)F(x,y(x))+c1p(x,y)+c22i=1p[Hi(x,y)]+2\displaystyle=F(x,y)-F(x,y^{*}(x))+c_{1}\mathrm{p}(x,y)+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}
LFyy(x)+c1μG2yy(x)2c1λ(x)H(x,y)+c22i=1p[Hi(x,y)]+2\displaystyle\geq-L_{F}\|y-y^{*}(x)\|+{c_{1}}\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-c_{1}\lambda^{*}(x)H(x,y)+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}
LF2c1μGc122c2λ(x)2\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-\frac{c_{1}^{2}}{2c_{2}}\|\lambda^{*}(x)\|^{2}
LF2c1μGc122c2B2,(x,y)(X,Y).\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-\frac{c_{1}^{2}}{2c_{2}}B^{2},\quad\forall(x,y)\in(X,Y).

By taking c1c1=L2μGϵ1c_{1}\geq c_{1}^{*}=\frac{L}{2\mu_{G}}\epsilon^{-1} and c2c2=(c1)2B2ϵ1c_{2}\geq c_{2}^{*}=(c_{1}^{*})^{2}B^{2}\epsilon^{-1}, the right side of the above inequality ϵ\geq-\epsilon. For the optimal solution (x,y)(x^{*},y^{*}) of (1.2), it holds that

Ψ(x,y)=F(x,y)F(x,y(x))Ψ(x,y)+ϵ,(x,y)(X,Y).\Psi(x^{*},y^{*})=F(x^{*},y^{*})\leq F(x,y^{*}(x))\leq\Psi(x,y)+\epsilon,\quad\forall(x,y)\in(X,Y). (A.30)

The first inequality follows from the optimality of xx^{*} and the definition of v(x)v(x). The second inequality follows from (A.29). This implies that (x,y)(x^{*},y^{*}) is a ϵ\epsilon-optimal solution of (A.27). Further take zargmaxλ+pγ1(x,y,λ)z^{*}\in\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x^{*},y^{*},\lambda), then (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-optimal solution of (A.28). By Lemma A.7, we can find such zz^{*} satisfying zZz\in Z. This completes the proof.

2. Lemma A.7 shows that (A.22) is equivalent to (A.21) by minimizing zz first. For any ϵ\epsilon-optimal solution (x^,y^)(\hat{x},\hat{y}) of (A.27), it holds that

F(x^,y)+c1p(x^,y^)+c22i=1p[Hi(x^,y^)]+2Ψ(x,y)+ϵF(x^,y^)+c1p(x^,y^)+c22i=1p[Hi(x^,y^)]+2+2ϵ.F(\hat{x},y)+c_{1}\mathrm{p}(\hat{x},\hat{y})+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}\leq\Psi(x^{*},y^{*})+\epsilon\leq F(\hat{x},\hat{y})+c_{1}^{*}\mathrm{p}(\hat{x},\hat{y})+\frac{c_{2}^{*}}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}+2\epsilon.

The second inequality follows from (A.29). From the selection of c1,c2c_{1},c_{2}, we have

p(x^,y^)+12i=1p[Hi(x^,y^)]+22ϵ2=ϵ.\mathrm{p}(\hat{x},\hat{y})+\frac{1}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}\leq\frac{2\epsilon}{2}=\epsilon. (A.31)

Take ϵ1=p(x^,y^)ϵ,ϵ2=12i=1p[Hi(x^,y^)]+2ϵ\epsilon_{1}=\mathrm{p}(\hat{x},\hat{y})\leq\epsilon,\epsilon_{2}=\frac{1}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}\leq\epsilon. This implies (x^,y^)(\hat{x},\hat{y}) is feasible for (A.21). For any feasible solution (x,y)(x,y) of (A.21), the ϵ\epsilon-optimality of (x^,y^)(\hat{x},\hat{y}) implies

F(x^,y^)F(x,y)\displaystyle F(\hat{x},\hat{y})-F(x,y) c1(p(x,y)p(x^,y^))+c2(12i=1p[Hi(x,y)]+212i=1p[Hi(x^,y^)]+2)+ϵ\displaystyle\leq c_{1}\left(\mathrm{p}(x,y)-\mathrm{p}(\hat{x},\hat{y})\right)+c_{2}\left(\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}-\frac{1}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}\right)+\epsilon (A.32)
=c1(p(x,y)ϵ1)+c2(12i=1p[Hi(x,y)]+2ϵ2)+ϵϵ.\displaystyle=c_{1}\left(\mathrm{p}(x,y)-\epsilon_{1}\right)+c_{2}\left(\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}-\epsilon_{2}\right)+\epsilon\leq\epsilon.

The last inequality follows from the feasibility of (x,y)(x,y). Therefore (x^,y^)(\hat{x},\hat{y}) is a ϵ\epsilon-optimal solution of (A.21). This completes the proof. ∎

A.7 Stochastic case

Given ss samples ξ=(ξ1,,ξs)𝒟ξs\mathbf{\xi}=(\xi_{1},...,\xi_{s})\sim\mathcal{D}_{\xi}^{s}. Denote 𝒫w,𝒫λ\mathcal{P}_{w},\mathcal{P}_{\lambda} as the space of random variables mapping ξ\mathbf{\xi} to YY and +p\mathbb{R}_{+}^{p}, respectively, that is,

𝒫w\displaystyle\mathcal{P}_{w} ={w:ΩξsYw is measurable},𝒫λ={λ:Ωξs+pλ is measurable}.\displaystyle=\{w:\Omega_{\xi}^{s}\to Y\mid w\text{ is measurable}\},\quad\mathcal{P}_{\lambda}=\{\lambda:\Omega_{\xi}^{s}\to\mathbb{R}_{+}^{p}\mid\lambda\text{ is measurable}\}.

Assume (w^,λ^)𝒫w×𝒫λ(\hat{w},\hat{\lambda})\in\mathcal{P}_{w}\times\mathcal{P}_{\lambda} are two random variables depending on ξ\mathbf{\xi}. In this section, 𝔼[]\mathbb{E}[\cdot] is the abbreviation of 𝔼ξ𝒟ξs[]\mathbb{E}_{\mathbf{\xi}\sim\mathcal{D}_{\xi}^{s}}[\cdot]. The following theorem shows the equivalence between (1.2) and the penalty reformulation in the stochastic setting.

Theorem A.3.

Suppose that Assumptions 3.13.23.4 and 3.5 holds and γ1,γ2>0\gamma_{1},\gamma_{2}>0 are fixed parameters.

1. Assume (x,y)(x^{*},y^{*}) is a global solution to (1.2). If 𝒫(δ)\mathcal{P}(\delta) defined in (2.6) is nonempty for any (x,z)X×Z(x,z)\in X\times Z, then for any (w^,λ^)𝒫(δ)(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta), there exists zz^{*} such that (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-global-minima of the following penalized form

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} 𝔼[F(x,y)+c1(G(x,y)γ(x,z,w^(ξ),λ^(ξ)))],\displaystyle\quad\mathbb{E}\left[F(x,y)+c_{1}\left(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right], (A.33)
s.t. G(x,y)𝔼[γ(x,z,w^(ξ),λ^(ξ))]ϵ1,12i=1p[Hi(x,y)]+2ϵ22,\displaystyle\quad G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},

with any c1LμGϵ1c_{1}\geq\frac{L}{\mu_{G}}\epsilon^{-1}, ϵ2ϵ42c1B\epsilon_{2}\leq\frac{\epsilon}{4\sqrt{2}c_{1}B} and δϵ8c1\delta\leq\frac{\epsilon}{8c_{1}}.

2. By taking c1=c1+2:=LμGϵ1+2c_{1}=c_{1}^{*}+2:=\frac{L}{\mu_{G}}\epsilon^{-1}+2, ϵ2ϵ42c1B\epsilon_{2}\leq\frac{\epsilon}{4\sqrt{2}c_{1}B} and δϵ8c1\delta\leq\frac{\epsilon}{8c_{1}}, for any (w^,λ^)𝒫(δ)(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta), the ϵ\epsilon-global-minima of (A.33) is a ϵ\epsilon-global-minima of the following approximation of BLO:

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} F(x,y)\displaystyle\quad F(x,y) (A.34)
s.t. G(x,y)𝔼[γ(x,z,w^(ξ),λ^(ξ))]ϵ1,12i=1p[Hi(x,y)]+2ϵ22,\displaystyle\quad G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},

with some ϵ11716ϵ\epsilon_{1}\leq\frac{17}{16}\epsilon.

Proof 1. By Lemma A.7, we have E(x,z)v(x),zE(x,z)\leq v(x),\forall z. Then for (x,y,z)X×Y×Z,(w^,λ^)𝒫(δ)\forall(x,y,z)\in X\times Y\times Z,(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta), it holds that

𝔼[F(x,y)+c1(G(x,y)γ(x,z,w^(ξ),λ^(ξ)))]F(x,y)\displaystyle\quad\mathbb{E}\left[F(x,y)+c_{1}\left(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]-F(x^{*},y^{*}) (A.35)
𝔼[F(x,y)+c1(G(x,y)γ(x,z,w^(ξ),λ^(ξ)))]F(x,y(x))\displaystyle\geq\mathbb{E}\left[F(x,y)+c_{1}\left(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]-F(x,y^{*}(x))
=F(x,y)F(x,y(x))+c1(G(x,y)E(x,z))c1(𝔼[γ(x,z,w^(ξ),λ^(ξ))]E(x,z))\displaystyle=F(x,y)-F(x,y^{*}(x))+c_{1}(G(x,y)-E(x,z))-c_{1}\left(\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]-E(x,z)\right)
F(x,y)F(x,y(x))+c1(G(x,y)v(x))c1δ(by E(x,z)v(x) and (2.6))\displaystyle\geq F(x,y)-F(x,y^{*}(x))+c_{1}(G(x,y)-v(x))-c_{1}\delta\quad\text{(by $E(x,z)\leq v(x)$ and~\eqref{eq: P delta})}
LF2c1μGc12Bϵ2c1δ(by (A.23)).\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-c_{1}\sqrt{2}B\epsilon_{2}-c_{1}\delta\quad\text{(by~\eqref{eq: proof of equivalence of single level 1 1})}.

By (2.6), it holds that

𝔼[G(x,y)γ(x,z,w^(ξ),λ^(ξ))]δ.\displaystyle\mathbb{E}[G(x^{*},y^{*})-\ell_{\gamma}(x^{*},z^{*},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\geq-\delta. (A.36)

Combining this with (A.35) gives

𝔼[F(x,y)+c1(G(x,y)γ(x,z,w^(ξ),λ^(ξ)))]𝔼[F(x,y)c1(G(x,y)(x,z,w^(ξ),λ^(ξ)))]\displaystyle\quad\mathbb{E}\left[F(x,y)+c_{1}\left(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]-\mathbb{E}\left[F(x^{*},y^{*})-c_{1}\left(G(x^{*},y^{*})-(x^{*},z^{*},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]
LF2c1μGc12Bϵ22c1δ.\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-c_{1}\sqrt{2}B\epsilon_{2}-2c_{1}\delta.

By taking any c1LμGϵ1c_{1}\geq\frac{L}{\mu_{G}}\epsilon^{-1}, ϵ2ϵ42c1B\epsilon_{2}\leq\frac{\epsilon}{4\sqrt{2}c_{1}B} and δϵ8c1\delta\leq\frac{\epsilon}{8c_{1}}, the right side of the above inequality ϵ\geq-\epsilon. For the optimal solution (x,y)(x^{*},y^{*}) of (1.2) and any (w^,λ^)𝒫(δ)(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta), by taking zargmaxz+pγ1(x,y,z)z^{*}\in\arg\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x^{*},y^{*},z), we have E(x,z)=v(x)=G(x,y)E(x^{*},z^{*})=v(x^{*})=G(x^{*},y^{*}). Lemma (A.7).3 implies that such a zZz^{*}\in Z exists. Hence (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-optimal solution of (A.33).

2. For any ϵ\epsilon-optimal solution (x¯,y¯,z¯)(\bar{x},\bar{y},\bar{z}) of (A.33), it holds that

𝔼[F(x¯,y¯)+c1(G(x¯,y¯)γ(x¯,z¯,w^(ξ),λ^(ξ)))]\displaystyle\quad\mathbb{E}\left[F(\bar{x},\bar{y})+c_{1}\left(G(\bar{x},\bar{y})-\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]
𝔼[F(x,y)+c1(G(x,y)γ(x,z,w^(ξ),λ^(ξ)))]+ϵ(by ϵ-optimality)\displaystyle\leq\mathbb{E}\left[F(x^{*},y^{*})+c_{1}\left(G(x^{*},y^{*})-\ell_{\gamma}(x^{*},z^{*},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]+\epsilon\quad\text{(by $\epsilon$-optimality)}
𝔼[F(x,y)]+c1δ+ϵ(by (A.36))\displaystyle\leq\mathbb{E}\left[F(x^{*},y^{*})\right]+c_{1}\delta+\epsilon\quad\text{(by \eqref{eq: proof of equivalence of partial penalized stochastic single level 4})}
𝔼[F(x¯,y¯)+c1(G(x¯,y¯)γ(x¯,z¯,w^(ξ),λ^(ξ)))]+c1δ+2ϵ,(by (A.35) with c1 taken).\displaystyle\leq\mathbb{E}\left[F(\bar{x},\bar{y})+c_{1}^{*}\left(G(\bar{x},\bar{y})-\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]+c_{1}\delta+2\epsilon,\quad\text{{(by~\eqref{eq: proof of equivalence of partial penalized stochastic single level 3} with $c_{1}^{*}$ taken).}}

Then we have

G(x¯,y¯)𝔼[γ(x¯,z¯,w^(ξ),λ^(ξ))]2ϵ+c1δc1c12ϵ+ϵ82=1716ϵ.G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\leq\frac{2\epsilon+c_{1}\delta}{c_{1}-c_{1}^{*}}\leq\frac{2\epsilon+\frac{\epsilon}{8}}{2}=\frac{17}{16}\epsilon. (A.37)

Take ϵ1=G(x¯,y¯)𝔼[γ(x¯,z¯,w^(ξ),λ^(ξ))]\epsilon_{1}=G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]. This implies (x¯,y¯,z¯)(\bar{x},\bar{y},\bar{z}) is feasible for (A.34). For any feasible solution (x,y,z)(x,y,z) of (A.34), the ϵ\epsilon-optimality of (x¯,y¯,z¯)(\bar{x},\bar{y},\bar{z}) implies

F(x¯,y¯)F(x,y)\displaystyle F(\bar{x},\bar{y})-F(x,y) c1(G(x,y)𝔼[γ(x,z,w^(ξ),λ^(ξ))])\displaystyle\leq c_{1}\left(G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\right) (A.38)
c1(G(x¯,y¯)𝔼[γ(x¯,z¯,w^(ξ),λ^(ξ))])+ϵ\displaystyle\quad-c_{1}\left(G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\right)+\epsilon
=c1(G(x,y)𝔼[γ(x,z,w^(ξ),λ^(ξ))]ϵ1)+ϵϵ.\displaystyle=c_{1}\left(G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]-\epsilon_{1}\right)+\epsilon\leq\epsilon.

The last inequality follows from the feasibility of (x,y)(x,y). Therefore (x¯,y¯,z¯)(\bar{x},\bar{y},\bar{z}) is a ϵ\epsilon-optimal solution of (A.34). This completes the proof. ∎

Theorem A.4.

Suppose that Assumptions 3.13.23.4 and 3.5 holds and γ1,γ2>0\gamma_{1},\gamma_{2}>0 are fixed parameters. 1. Assume (x,y)(x^{*},y^{*}) is a global solution to (1.2). If 𝒫(δ)\mathcal{P}(\delta) defined in (2.6) is nonempty for any (x,z)X×Z(x,z)\in X\times Z, then for any (w^,λ^)𝒫(δ)(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta), there exists zz^{*} such that (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-global-minima of the following penalized form

min(x,y,z)X×Y×Z\displaystyle\min_{(x,y,z)\in X\times Y\times Z} 𝔼[Ψ(x,y,z,w^,λ^;ξ)]\displaystyle\quad\mathbb{E}[\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})] (A.39)
s:=𝔼[F(x,y)+c1(G(x,y)γ(x,z,w^(ξ),λ^(ξ)))+c22i=1p[Hi(x,y)]+2],\displaystyle\quad s=\mathbb{E}\left[F(x,y)+c_{1}(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi})))+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\right],

with any c12L3μGϵ1c_{1}\geq\frac{2L}{3\mu_{G}}\epsilon^{-1}, c232(c1)2B2ϵ1c_{2}\geq\frac{3}{2}(c_{1})^{2}B^{2}\epsilon^{-1} and δϵ6c1\delta\leq\frac{\epsilon}{6c_{1}}.

2. By taking c1=c1+2:=2L3μGϵ1+2,c2=c2+2:=32(c1)2B2ϵ1+2c_{1}=c_{1}^{*}+2:=\frac{2L}{3\mu_{G}}\epsilon^{-1}+2,c_{2}=c_{2}^{*}+2:=\frac{3}{2}(c_{1}^{*})^{2}B^{2}\epsilon^{-1}+2 and δϵ6c1\delta\leq\frac{\epsilon}{6c_{1}}, for any (w^,λ^)𝒫(δ)(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta), the ϵ\epsilon-global-minima of (A.39) is a ϵ\epsilon-global-minima of the (A.34) with some ϵ1,ϵ21312ϵ\epsilon_{1},\epsilon_{2}\leq\frac{13}{12}\epsilon.

Proof 1. By Lemma A.7, we have E(x,z)v(x),zE(x,z)\leq v(x),\forall z. Then for (x,y,z)X×Y×Z,(w^,λ^)𝒫(δ)\forall(x,y,z)\in X\times Y\times Z,(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta), it holds that

𝔼[Ψ(x,y,z,w^,λ^;ξ)]F(x,y)\displaystyle\quad\mathbb{E}\left[\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})\right]-F(x^{*},y^{*}) (A.40)
𝔼[Ψ(x,y,z,w^,λ^;ξ)]F(x,y(x))\displaystyle\geq\mathbb{E}\left[\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})\right]-F(x,y^{*}(x))
=F(x,y)F(x,y(x))+c1(G(x,y)E(x,z))c1(𝔼[γ(x,z,w^(ξ),λ^(ξ))]E(x,z))\displaystyle=F(x,y)-F(x,y^{*}(x))+c_{1}(G(x,y)-E(x,z))-c_{1}\left(\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]-E(x,z)\right)
+c22i=1p[Hi(x,y)]+2\displaystyle\quad+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}
F(x,y)F(x,y(x))+c1(G(x,y)E(x,z))c1δ+c22i=1p[Hi(x,y)]+2(by (2.6))\displaystyle\geq F(x,y)-F(x,y^{*}(x))+c_{1}(G(x,y)-E(x,z))-c_{1}\delta+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\quad\text{(by~\eqref{eq: P delta})}
LF2c1μGc122c2B2c1δ(by (A.29)).\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-\frac{c_{1}^{2}}{2c_{2}}B^{2}-c_{1}\delta\quad\text{(by~\eqref{eq: proof of equivalence of single level 1})}.

By (2.6), it holds that

𝔼[G(x,y)γ(x,z,w^(ξ),λ^(ξ))]δ.\displaystyle\mathbb{E}[G(x^{*},y^{*})-\ell_{\gamma}(x^{*},z^{*},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\geq-\delta. (A.41)

Combining this with (A.40) gives

𝔼[Ψ(x,y,z,w^,λ^;ξ)]𝔼[Ψ(x,y,z,w^,λ^;ξ)]LF2c1μGc122c2B22c1δ.\displaystyle\quad\mathbb{E}\left[\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})\right]-\mathbb{E}\left[\Psi(x^{*},y^{*},z^{*},\hat{w},\hat{\lambda};\mathbf{\xi})\right]\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-\frac{c_{1}^{2}}{2c_{2}}B^{2}-2c_{1}\delta.

By taking c12L3μGϵ1c_{1}\geq\frac{2L}{3\mu_{G}}\epsilon^{-1}, c232(c1)2B2ϵ1c_{2}\geq\frac{3}{2}(c_{1})^{2}B^{2}\epsilon^{-1} and δϵ6c1\delta\leq\frac{\epsilon}{6c_{1}}, the right side of the above inequality ϵ\geq-\epsilon. For the optimal solution (x,y)(x^{*},y^{*}) of (1.2) and any (w^,λ^)𝒫(δ)(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta), by taking zargmaxz+pγ1(x,y,z)z^{*}\in\arg\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x^{*},y^{*},z), we have E(x,z)=v(x)=G(x,y)E(x^{*},z^{*})=v(x^{*})=G(x^{*},y^{*}). Lemma (A.7).3 implies that such a zZz^{*}\in Z exists. Hence (x,y,z)(x^{*},y^{*},z^{*}) is a ϵ\epsilon-optimal solution of (A.33).

2. For any ϵ\epsilon-optimal solution (x¯,y¯,z¯)(\bar{x},\bar{y},\bar{z}) of (A.33), it holds that

𝔼[Ψ(x¯,y¯,z¯,w^,λ^;ξ)+c1(G(x¯,y¯)γ(x¯,z¯,w^(ξ),λ^(ξ)))]\displaystyle\quad\mathbb{E}\left[\Psi(\bar{x},\bar{y},\bar{z},\hat{w},\hat{\lambda};\mathbf{\xi})+c_{1}\left(G(\bar{x},\bar{y})-\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]
𝔼[Ψ(x,y,z,w^,λ^;ξ)]+ϵ(by ϵ-optimality)\displaystyle\leq\mathbb{E}\left[\Psi(x^{*},y^{*},z^{*},\hat{w},\hat{\lambda};\mathbf{\xi})\right]+\epsilon\quad\text{(by $\epsilon$-optimality)}
𝔼[F(x,y)]+c1δ+ϵ(by (A.41))\displaystyle\leq\mathbb{E}\left[F(x^{*},y^{*})\right]+c_{1}\delta+\epsilon\quad\text{(by \eqref{eq: proof of equivalence of stochastic single level 4})}
𝔼[F(x¯,y¯)+c1(G(x¯,y¯)γ(x¯,z¯,w^(ξ),λ^(ξ)))+c22i=1p[H(x¯,y¯)]+2]+c1δ+2ϵ,(by (A.40) with c1 taken).\displaystyle\leq\mathbb{E}\left[F(\bar{x},\bar{y})+c_{1}^{*}\left(G(\bar{x},\bar{y})-\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)+\frac{c_{2}^{*}}{2}\sum_{i=1}^{p}[H(\bar{x},\bar{y})]_{+}^{2}\right]+c_{1}\delta+2\epsilon,\quad\text{{(by~\eqref{eq: proof of equivalence of stochastic single level 3} with $c_{1}^{*}$ taken).}}

Then we have

G(x¯,y¯)𝔼[γ(x¯,z¯,w^(ξ),λ^(ξ))]+12i=1p[H(x¯,y¯)]+22ϵ+c1δ22ϵ+ϵ62=1312ϵ.G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]+\frac{1}{2}\sum_{i=1}^{p}[H(\bar{x},\bar{y})]_{+}^{2}\leq\frac{2\epsilon+c_{1}\delta}{2}\leq\frac{2\epsilon+\frac{\epsilon}{6}}{2}=\frac{13}{12}\epsilon. (A.42)

Take ϵ1=G(x¯,y¯)𝔼[γ(x¯,z¯,w^(ξ),λ^(ξ))]\epsilon_{1}=G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))], ϵ2=12i=1p[H(x¯,y¯)]+2\epsilon_{2}=\frac{1}{2}\sum_{i=1}^{p}[H(\bar{x},\bar{y})]_{+}^{2}. This implies (x¯,y¯,z¯)(\bar{x},\bar{y},\bar{z}) is feasible for (A.34). For any feasible solution (x,y,z)(x,y,z) of (A.34), the ϵ\epsilon-optimality of (x¯,y¯,z¯)(\bar{x},\bar{y},\bar{z}) implies

F(x¯,y¯)F(x,y)\displaystyle\quad F(\bar{x},\bar{y})-F(x,y) (A.43)
c1(G(x,y)𝔼[γ(x,z,w^(ξ),λ^(ξ))])c1(G(x¯,y¯)𝔼[γ(x¯,z¯,w^(ξ),λ^(ξ))])\displaystyle\leq c_{1}\left(G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\right)-c_{1}\left(G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\right)
+c22i=1p[Hi(x,y)]+2c22i=1p[Hi(x¯,y¯)]+2+ϵ\displaystyle\quad+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}-\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(\bar{x},\bar{y})]_{+}^{2}+\epsilon
=c1(G(x,y)𝔼[γ(x,z,w^(ξ),λ^(ξ))]ϵ1)+c2(12i=1p[Hi(x,y)]+212i=1p[Hi(x¯,y¯)]+2)+ϵ\displaystyle=c_{1}\left(G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]-\epsilon_{1}\right)+c_{2}\left(\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}-\frac{1}{2}\sum_{i=1}^{p}[H_{i}(\bar{x},\bar{y})]_{+}^{2}\right)+\epsilon
ϵ.\displaystyle\leq\epsilon.

The last inequality follows from the feasibility of (x,y,z)(x,y,z). Therefore (x¯,y¯,z¯)(\bar{x},\bar{y},\bar{z}) is a ϵ\epsilon-optimal solution of (A.34). This completes the proof. ∎

Appendix B Analysis on the inner loop

In this section, we analyze the convergence of Algorithm 1. Assume (x,y,z)=(xk1,yk1,zk1)(x,y,z)=(x^{k-1},y^{k-1},z^{k-1}) and γ1,γ2{\gamma_{1}},\gamma_{2} are fixed. The expectation in this section denotes the expectation conditioned on ~k1\tilde{\mathcal{F}}_{k-1}, that is, 𝔼[]𝔼[|~k1]\mathbb{E}[\cdot]\triangleq\mathbb{E}[\cdot|\tilde{\mathcal{F}}_{k-1}]. Since kk is fixed, we abbreviate wk,jw^{k,j} as wjw^{j} and λk,j\lambda^{k,j} as λj\lambda^{j} for j=0,1,j=0,1,... in this section.

We write (w,λ)=(w(x,z),λ(x,z))(w^{*},\lambda^{*})=(w^{*}(x,z),\lambda^{*}(x,z)) as defined in (2.5a) and γ(x,z,w,λ)\ell_{\gamma}(x,z,w,\lambda) as defined in (2.5b). By Proposition A.2, the optimal solution (w,λ)(w^{*},\lambda^{*}) is a saddle point of γ(x,z,w,λ)\ell_{\gamma}(x,z,w,\lambda), i.e.,

γ(x,z,w,λ)γ(x,z,w,λ)γ(x,z,w,λ),wY,λ+p.\ell_{\gamma}(x,z,w^{*},\lambda)\leq\ell_{\gamma}(x,z,w^{*},\lambda^{*})\leq\ell_{\gamma}(x,z,w,\lambda^{*}),\quad\forall w\in Y,\lambda\in\mathbb{R}_{+}^{p}. (B.1)

To analyze the convergence of the inner loop, we establish the decreasing property for running a single step of the inner algorithm. The following lemma gives the decreasing property of by taking gradient descent ascent steps.

Lemma B.1.

The following relationships hold by taking primal step (2.8a) and dual step (2.8b), respectively:

𝔼ξ1,,ξj[γ(x,z,wj,λj)]𝔼[γ(x,z,w,λj)]+12ηj𝔼ξ1,,ξj[wj+1w2]\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w^{j},\lambda^{j})]-\mathbb{E}[\ell_{\gamma}(x,z,w,\lambda^{j})]+\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j+1}-w\|^{2}] (B.2)
12(1ηjμG)𝔼ξ1,,ξj[wjw2]+ηj2𝔼ξ1,,ξj[wγ(x,z,wj,λj;ξj)2],wY,\displaystyle\leq\frac{1}{2}(\frac{1}{\eta_{j}}-\mu_{G})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w\|^{2}]+\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\nabla_{w}\ell_{\gamma}(x,z,w^{j},\lambda^{j};\xi_{j})\|^{2}],\quad\forall w\in Y,

and

γ(x,z,wj,λ)γ(x,z,wj,λj)+12ρjλj+1λ2\displaystyle\quad\ell_{\gamma}(x,z,w^{j},\lambda)-\ell_{\gamma}(x,z,w^{j},\lambda^{j})+\frac{1}{2\rho_{j}}\|\lambda^{j+1}-\lambda\|^{2} (B.3)
12(1ρjγ2)λjλ2+ρj2λγ(x,z,wj,λj)2,λ+p.\displaystyle\leq\frac{1}{2}(\frac{1}{\rho_{j}}-\gamma_{2})\|\lambda^{j}-\lambda\|^{2}+\frac{\rho_{j}}{2}\|\nabla_{\lambda}\ell_{\gamma}(x,z,w^{j},\lambda^{j})\|^{2},\quad\forall\lambda\in\mathbb{R}_{+}^{p}.

Here the notation 𝔼ξ1,,ξj[]\mathbb{E}_{\xi_{1},...,\xi_{j}}[\cdot] is the abbreviation of 𝔼ξ1𝒟ξ,,ξj𝒟ξ[]\mathbb{E}_{\xi_{1}\sim\mathcal{D}_{\xi},...,\xi_{j}\sim\mathcal{D}_{\xi}}[\cdot].

Proof The projection gradient step (2.8a) gives

wj+1w,wj+1(wjηjyγ1(x,wj,λj;ξj))0.\langle w^{j+1}-w,w^{j+1}-(w^{j}-\eta_{j}\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,w^{j},\lambda^{j};\xi_{j}))\rangle\leq 0. (B.4)

This implies

wj+1w,yγ1(x,wj,λj;ξj)1ηjwj+1wj,wj+1w.\langle w^{j+1}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle\leq-\frac{1}{\eta_{j}}\langle w^{j+1}-w^{j},w^{j+1}-w\rangle. (B.5)

By Young’s inequality, we have

wj+1wj,yγ1(x,wj,λj;ξj)\displaystyle\langle w^{j+1}-w^{j},\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle 12ηjwj+1wj2ηj2yγ1(x,wj,λj;ξj)2.\displaystyle\geq-\frac{1}{2\eta_{j}}\|w^{j+1}-w^{j}\|^{2}-\frac{\eta_{j}}{2}\|\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\|^{2}. (B.6)

By the strong convexity of γ1(x,w,λ)\mathcal{L}_{\gamma_{1}}(x,w,\lambda) with respect to ww, we have

wjw,yγ1(x,wj,λj)γ1(x,wj,λj)γ1(x,w,λj)+μG2wjw2.\langle w^{j}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j})\rangle\geq\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j})-\mathcal{L}_{\gamma_{1}}(x,w,\lambda^{j})+\frac{\mu_{G}}{2}\|w^{j}-w\|^{2}. (B.7)

Note that wjw^{j} is independent of ξj\xi_{j}, and hence

𝔼ξ1,,ξj[wjw,yγ1(x,wj,λj;ξj)]\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\langle w^{j}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle] (B.8)
=𝔼ξ1,,ξj1[𝔼ξj[wjw,yγ1(x,wj,λj;ξj)|ξ1,,ξj1]]\displaystyle=\mathbb{E}_{\xi_{1},...,\xi_{j-1}}[\mathbb{E}_{\xi_{j}}[\langle w^{j}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle|\xi_{1},.,\xi_{j-1}]]
=𝔼ξ1,,ξj[wjw,yγ1(x,wj,λj)].\displaystyle=\mathbb{E}_{\xi_{1},...,\xi_{j}}[\langle w^{j}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j})\rangle].

Summing up  (B.6),(B.7) and then taking expectation gives

𝔼ξ1,,ξj[wj+1w,yγ1(x,wj,λj;ξj)]\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\langle w^{j+1}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle]
12ηj𝔼ξ1,,ξj[wj+1wj2]ηj2𝔼ξ1,,ξj[yγ1(x,wj,λj;ξj)2]\displaystyle\geq-\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j+1}-w^{j}\|^{2}]-\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\|^{2}]
+𝔼ξ1,,ξj[γ1(x,wj,λj)]𝔼ξ1,,ξj[γ1(x,w,λj)]+μG2𝔼ξ1,,ξj[wjw2]\displaystyle+\mathbb{E}_{\xi_{1},...,\xi_{j}}[\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j})]-\mathbb{E}_{\xi_{1},...,\xi_{j}}[\mathcal{L}_{\gamma_{1}}(x,w,\lambda^{j})]+\frac{\mu_{G}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w\|^{2}]
=12ηj𝔼ξ1,,ξj[wj+1wj2]ηj2𝔼ξ1,,ξj[wγ(x,z,wj,λj;ξj)2]\displaystyle=-\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j+1}-w^{j}\|^{2}]-\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\nabla_{w}\ell_{\gamma}(x,z,w^{j},\lambda^{j};\xi_{j})\|^{2}]
+𝔼ξ1,,ξj[γ(x,z,wj,λj)]𝔼ξ1,,ξj[γ(x,z,w,λj)]+μG2𝔼ξ1,,ξj[wjw2].\displaystyle+\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w^{j},\lambda^{j})]-\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w,\lambda^{j})]+\frac{\mu_{G}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w\|^{2}].

Utilizing the identity wj+1wj2+wj+1w2wjw22wj+1w,wj+1wj=0\|w^{j+1}-w^{j}\|^{2}+\|w^{j+1}-w\|^{2}-\|w^{j}-w\|^{2}-2\langle w^{j+1}-w,w^{j+1}-w^{j}\rangle=0 and (B.5), we have

𝔼ξ1,,ξj[γ(x,z,wj,λj)]𝔼ξ1,,ξj[γ(x,z,w,λj)]\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w^{j},\lambda^{j})]-\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w,\lambda^{j})]
𝔼ξ1,,ξj[wj+1w,yγ1(x,wj,λj;ξj)]+12ηj𝔼ξ1,,ξj[wj+1wj2]\displaystyle\leq\mathbb{E}_{\xi_{1},...,\xi_{j}}[\langle w^{j+1}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle]+\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j+1}-w^{j}\|^{2}]
+ηj2𝔼ξ1,,ξj[wγ(x,z,wj,λj;ξj)2]μG2𝔼ξ1,,ξj[wjw2]\displaystyle\quad+\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\nabla_{w}\ell_{\gamma}(x,z,w^{j},\lambda^{j};\xi_{j})\|^{2}]-\frac{\mu_{G}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w\|^{2}]
1ηj𝔼ξ1,,ξj[wj+1wj,wj+1w]+12ηj𝔼ξ1,,ξj[wj+1wj2]\displaystyle\leq-\frac{1}{\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\langle w^{j+1}-w^{j},w^{j+1}-w\rangle]+\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j+1}-w^{j}\|^{2}]
μG2𝔼ξ1,,ξj[wjw2]+ηj2𝔼ξ1,,ξj[wγ(x,z,wj,λj;ξj)2]\displaystyle\quad-\frac{\mu_{G}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w\|^{2}]+\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\nabla_{w}\ell_{\gamma}(x,z,w^{j},\lambda^{j};\xi_{j})\|^{2}]
=12(1ηjμG)𝔼ξ1,,ξj[wjw2]12ηj𝔼ξ1,,ξj[wj+1w2]\displaystyle=\frac{1}{2}(\frac{1}{\eta_{j}}-\mu_{G})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w\|^{2}]-\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j+1}-w\|^{2}]
+ηj2𝔼ξ1,,ξj[wγ(x,z,wj,λj;ξj)2].\displaystyle\quad+\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\nabla_{w}\ell_{\gamma}(x,z,w^{j},\lambda^{j};\xi_{j})\|^{2}].

This gives (B.2). (x,w,λ)\mathcal{L}(x,w,\lambda) is concave in λ\lambda implies γ(x,z,w,λ)\ell_{\gamma}(x,z,w,\lambda) is γ2\gamma_{2}-strongly concave in λ\lambda. Similarly, we can obtain (B.3) by taking the dual step (2.8b). This completes the proof. ∎

Combining the primal and dual steps, the following lemma shows the decreasing property with respect to the duality gap.

Corollary B.1.

For any (w,λ)Y×+p(w,\lambda)\in Y\times\mathbb{R}_{+}^{p} it holds that

𝔼ξ1,,ξj[γ(x,z,wj,λ)γ(x,z,w,λj)]+12ηj𝔼ξ1,,ξj[wj+1w2]\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w^{j},\lambda)-\ell_{\gamma}(x,z,w,\lambda^{j})]+\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j+1}-w\|^{2}] (B.9)
+12ρj𝔼ξ1,,ξj[λj+1λ2]\displaystyle\quad+\frac{1}{2\rho_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j+1}-\lambda\|^{2}]
12(1ηjμG)𝔼ξ1,,ξj[wjw2]+12(1ρjγ2)𝔼ξ1,,ξj[λjλ2]\displaystyle\leq\frac{1}{2}(\frac{1}{\eta_{j}}-\mu_{G})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w\|^{2}]+\frac{1}{2}(\frac{1}{\rho_{j}}-\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}-\lambda\|^{2}]
+ηj2(M,1+M,2𝔼ξ1,,ξj[λj2])+ρj2(4γ2z2+4MH,02+2(γ1+γ2)𝔼ξ1,,ξj[λj2]).\displaystyle\quad+\frac{\eta_{j}}{2}(M_{\mathcal{L},1}+M_{\mathcal{L},2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}\|^{2}])+\frac{\rho_{j}}{2}\left(4\gamma_{2}\|z\|^{2}+4M_{H,0}^{2}+2(\gamma_{1}+\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}\|^{2}]\right).

Proof From (A.10), we have

λγ(x,z,wj,λj)2\displaystyle\|\nabla_{\lambda}\ell_{\gamma}(x,z,w^{j},\lambda^{j})\|^{2} max(γ2z(γ1+γ2)λj,γ2zγ2λj+H(x,wj))2\displaystyle\leq\|\max(\gamma_{2}z-(\gamma_{1}+\gamma_{2})\lambda^{j},\gamma_{2}z-\gamma_{2}\lambda^{j}+H(x,w^{j}))\|^{2} (B.10)
4γ2z2+4MH,02+2(γ1+γ2)λj2.\displaystyle\leq 4\gamma_{2}\|z\|^{2}+4M_{H,0}^{2}+2(\gamma_{1}+\gamma_{2})\|\lambda^{j}\|^{2}.

Then combining (A.7b), (B.2), (B.3) and (B.10) gives (B.9). ∎

By taking decreasing step sizes, we can prove that the dual variable is bounded during the iterations of the inner loop in expectation.

Lemma B.2.

Choose the dual step size as ρj=ρj+1,\rho_{j}=\frac{\rho}{j+1}, where the constants ρ>1γ2\rho>\frac{1}{\gamma_{2}}, then the following bounds hold:

λj2Mλ:=2ρ2γ22z2+2pρ2MH,02,j=1,,s.\|\lambda^{j}\|^{2}\leq M_{\lambda}:=2\rho^{2}\gamma_{2}^{2}\|z\|^{2}+2p\rho^{2}M_{H,0}^{2},\quad j=1,...,s. (B.11)

Proof From (2.8b) and (2.5b) we have

λj+1\displaystyle\lambda^{j+1} =Proj+p(λj+ρj+1(zγ1(x,wj,λj)γ2(λjz)))\displaystyle=\mathrm{Proj}_{\mathbb{R}_{+}^{p}}\left(\lambda^{j}+\frac{\rho}{j+1}(\nabla_{z}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j})-\gamma_{2}(\lambda^{j}-z))\right) (B.12)
=Proj+p(λj+ρj+1(max(γ1λj,H(x,wj))γ2λj+γ2z))\displaystyle=\mathrm{Proj}_{\mathbb{R}_{+}^{p}}\left(\lambda^{j}+\frac{\rho}{j+1}(\max(-\gamma_{1}\lambda^{j},H(x,w^{j}))-\gamma_{2}\lambda^{j}+\gamma_{2}z)\right)

Consider the ii-th component of λj+1\lambda^{j+1}, we have

λij+1\displaystyle\lambda_{i}^{j+1} =max(0,λij+ρj+1(γ1λij+γ2zi+max(0,Hi(x,wj))))\displaystyle=\max\left(0,\lambda_{i}^{j}+\frac{\rho}{j+1}(-\gamma_{1}\lambda_{i}^{j}+\gamma_{2}z_{i}+\max(0,H_{i}(x,w^{j})))\right)
|λj+ρj+1(γ2λj)|+ρj+1(γ2|zi|+|max(γ1λij,Hi(x,wj))|\displaystyle\leq\left|\lambda^{j}+\frac{\rho}{j+1}(-\gamma_{2}\lambda^{j})\right|+\frac{\rho}{j+1}(\gamma_{2}|z_{i}|+|\max(-\gamma_{1}\lambda^{j}_{i},H_{i}(x,w^{j}))|

Since λij0\lambda^{j}_{i}\geq 0, it holds that max(γ1λij,Hi(x,wj))max(0,Hi(x,wj))MH,0\max(-\gamma_{1}\lambda_{i}^{j},H_{i}(x,w^{j}))\leq\max(0,H_{i}(x,w^{j}))\leq M_{H,0}. Then we obtain a recursive relation as

λij+1(1ργ2j+1)λij+ρj+1γ2|zi|+ρj+1MH,0.\lambda^{j+1}_{i}\leq(1-\frac{\rho\gamma_{2}}{j+1})\lambda^{j}_{i}+\frac{\rho}{j+1}\gamma_{2}|z_{i}|+\frac{\rho}{j+1}M_{H,0}.

Multiplying both sides by j+1{j+1} and using ρ>1γ2\rho>\frac{1}{\gamma_{2}}, we have

(j+1)λij+1jλij+ργ2|zi|+ρMH,0.(j+1)\lambda^{j+1}_{i}\leq j\lambda^{j}_{i}+\rho\gamma_{2}|z_{i}|+\rho M_{H,0}.

Note that the initial point λi0=0\lambda^{0}_{i}=0. Hence λijργ2|zi|+ρMH,0),j,i\lambda^{j}_{i}\leq\rho\gamma_{2}|z_{i}|+\rho M_{H,0}),\forall j,i. This gives the bound of λj\lambda^{j} as (B.11). ∎

Now we are ready to establish the convergence of the inner loop. In the following Theorem, we show that the output pair (ws,λs)(w^{s},\lambda^{s}) is an O~(1sk)\widetilde{O}(\frac{1}{s_{k}})-optimal point of the lower-level problem measured by the square norm.

Theorem B.1.

By taking the step size as

ηj=ηj+1,ρj=ρj+1,\eta_{j}=\frac{\eta}{j+1},\quad\rho_{j}=\frac{\rho}{j+1}, (B.13)

where the constants η1μG,ρ1γ2\eta\geq\frac{1}{\mu_{G}},\rho\geq\frac{1}{\gamma_{2}}, there exist constants ϕ1,ϕ2>0\phi_{1},\phi_{2}>0 such that

𝔼ξ[wsw2]\displaystyle\mathbb{E}_{\mathbf{\xi}}[\|w^{s}-w^{*}\|^{2}] ϕ11+log(s)s,\displaystyle\leq{\phi}_{1}\frac{1+\log(s)}{s}, (B.14)
𝔼ξ[λsλ2]\displaystyle\mathbb{E}_{\mathbf{\xi}}[\|\lambda^{s}-\lambda^{*}\|^{2}] ϕ21+log(s)s,\displaystyle\leq{\phi}_{2}\frac{1+\log(s)}{s},

where ϕ1=η(ηM,1+ρ(4γ2z2+4MH,02)+(ηM,2+2ρ(γ1+γ2))Mλ)\phi_{1}=\eta\left(\eta M_{\mathcal{L},1}+\rho(4\gamma_{2}\|z\|^{2}+4M_{H,0}^{2})+(\eta M_{\mathcal{L},2}+2\rho(\gamma_{1}+\gamma_{2}))M_{\lambda}\right), ϕ2=ρηϕ1{\phi}_{2}=\frac{\rho}{\eta}\phi_{1}.

Proof Taking ηj=ηj+1\eta_{j}=\frac{\eta}{j+1}, ρj=ρj+1\rho_{j}=\frac{\rho}{j+1} in (B.9) gives

𝔼ξ1,,ξj[γ(x,z,wj,λ)γ(x,z,w,λj)]+j+12η𝔼ξ1,,ξj[wj+1w2]\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w^{j},\lambda)-\ell_{\gamma}(x,z,w,\lambda^{j})]+\frac{j+1}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j+1}-w^{*}\|^{2}] (B.15)
+j+12ρ𝔼ξ1,,ξj[λj+1λ2]\displaystyle\quad+\frac{j+1}{2\rho}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j+1}-\lambda\|^{2}]
12(j+1ημG)𝔼ξ1,,ξj[wjw2]+12(j+1ργ2)𝔼ξ1,,ξj[λjλ2]\displaystyle\leq\frac{1}{2}(\frac{j+1}{\eta}-\mu_{G})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w^{*}\|^{2}]+\frac{1}{2}(\frac{j+1}{\rho}-\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}-\lambda\|^{2}]
+η2(j+1)(M,1+M,2𝔼ξ1,,ξj[λj2])\displaystyle\quad+\frac{\eta}{2(j+1)}(M_{\mathcal{L},1}+M_{\mathcal{L},2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}\|^{2}])
+ρ2(j+1)(4γ2z2+4MH,02+2(γ1+γ2)𝔼ξ1,,ξj[λj2])\displaystyle\quad+\frac{\rho}{2(j+1)}\left(4\gamma_{2}\|z\|^{2}+4M_{H,0}^{2}+2(\gamma_{1}+\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}\|^{2}]\right)
j2η𝔼ξ1,,ξj[wjw2]+j2η𝔼ξ1,,ξj[λjλ2]\displaystyle\leq\frac{j}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w^{*}\|^{2}]+\frac{j}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}-\lambda\|^{2}]
+η2(j+1)(M,1+M,2𝔼ξ1,,ξj[λj2])\displaystyle\quad+\frac{\eta}{2(j+1)}(M_{\mathcal{L},1}+M_{\mathcal{L},2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}\|^{2}])
+ρ2(j+1)(4γ2z2+4MH,02+2(γ1+γ2)𝔼ξ1,,ξj[λj2]).\displaystyle\quad+\frac{\rho}{2(j+1)}\left(4\gamma_{2}\|z\|^{2}+4M_{H,0}^{2}+2(\gamma_{1}+\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}\|^{2}]\right).

Then taking w=w,λ=λw=w^{*},\lambda=\lambda^{*} and utilizing γ(x,z,wj,λ)γ(x,z,w,λ)γ(x,z,w,λj)\ell_{\gamma}(x,z,w^{j},\lambda^{*})\geq\ell_{\gamma}(x,z,w^{*},\lambda^{*})\geq\ell_{\gamma}(x,z,w^{*},\lambda^{j}) gives

j+12η𝔼ξ1,,ξj[wj+1w2]+j+12ρ𝔼ξ1,,ξj[λj+1λ2]\displaystyle\quad\frac{j+1}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j+1}-w^{*}\|^{2}]+\frac{j+1}{2\rho}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j+1}-\lambda^{*}\|^{2}] (B.16)
j2η𝔼ξ1,,ξj[wjw2]+j2ρ𝔼ξ1,,ξj[λjλ2]\displaystyle\leq\frac{j}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w^{*}\|^{2}]+\frac{j}{2\rho}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}-\lambda^{*}\|^{2}]
+η2(j+1)(M,1+M,2𝔼ξ1,,ξj[λj2])\displaystyle\quad+\frac{\eta}{2(j+1)}(M_{\mathcal{L},1}+M_{\mathcal{L},2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}\|^{2}])
+ρ2(j+1)(4γ2z2+4MH,02+2(γ1+γ2)𝔼ξ1,,ξj[λj2]).\displaystyle\quad+\frac{\rho}{2(j+1)}\left(4\gamma_{2}\|z\|^{2}+4M_{H,0}^{2}+2(\gamma_{1}+\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}\|^{2}]\right).

This recursive relationship gives

j2η𝔼ξ1,,ξj[wjw2]+j2ρ𝔼ξ1,,ξj[λjλ2]\displaystyle\quad\frac{j}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|w^{j}-w^{*}\|^{2}]+\frac{j}{2\rho}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{j}-\lambda^{*}\|^{2}] (B.17)
l=0j1η2(l+1)(M,1+M,2𝔼ξ1,,ξj[λl2])+l=0j1ρ2(l+1)(4γ2z2+4MH,02)\displaystyle\leq\sum_{l=0}^{j-1}\frac{\eta}{2(l+1)}(M_{\mathcal{L},1}+M_{\mathcal{L},2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{l}\|^{2}])+\sum_{l=0}^{j-1}\frac{\rho}{2(l+1)}\left(4\gamma_{2}\|z\|^{2}+4M_{H,0}^{2}\right)
+2(γ1+γ2)𝔼ξ1,,ξj[λl2]\displaystyle\quad+2(\gamma_{1}+\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{l}\|^{2}]
12(1+log(j))(ηM,1+ρ(4γ2z2+4MH,02))\displaystyle\leq\frac{1}{2}(1+\log(j))\left(\eta M_{\mathcal{L},1}+\rho(4\gamma_{2}\|z\|^{2}+4M_{H,0}^{2})\right)
+12(ηM,2+ρ(γ1+γ2))l=0j11l+1𝔼ξ1,,ξj[λl2].\displaystyle\quad+\frac{1}{2}(\eta M_{\mathcal{L},2}+\rho(\gamma_{1}+\gamma_{2}))\sum_{l=0}^{j-1}\frac{1}{l+1}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\|\lambda^{l}\|^{2}].

Substituting (B.11) into  (B.17) and taking j=sj=s we have

s2η𝔼ξ[wsw2]+s2ρ𝔼ξ[λjλ2]\displaystyle\quad\frac{s}{2\eta}\mathbb{E}_{\mathbf{\xi}}[\|w^{s}-w^{*}\|^{2}]+\frac{s}{2\rho}\mathbb{E}_{\mathbf{\xi}}[\|\lambda^{j}-\lambda^{*}\|^{2}] (B.18)
12(1+log(s))(ηM,1+ρ(4γ2z2+4MH,02)+(ηM,2+2ρ(γ1+γ2))Mλ).\displaystyle\leq\frac{1}{2}(1+\log(s))\left(\eta M_{\mathcal{L},1}+\rho(4\gamma_{2}\|z\|^{2}+4M_{H,0}^{2})+(\eta M_{\mathcal{L},2}+2\rho(\gamma_{1}+\gamma_{2}))M_{\lambda}\right).

This gives the boundness of 𝔼ξ[wsw2]\mathbb{E}_{\mathbf{\xi}}[\|w^{s}-w^{*}\|^{2}] and 𝔼ξ[λsλ2]\mathbb{E}_{\mathbf{\xi}}[\|\lambda^{s}-\lambda^{*}\|^{2}] in (B.14). ∎

By utilizing the measurement as objective function, a similar convergence rate can be obtained as follows.

Corollary B.2.

Under the same conditions as in Theorem B.1, it holds that

|𝔼ξ[γ(x,z,ws,λs)]E(x,z)|O~(1s).\left|\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{s},\lambda^{s})]-E(x,z)\right|\leq\widetilde{O}\left(\frac{1}{s}\right). (B.19)

Proof From (A.9), (A.10), we see wγ(x,z,w,λ)\nabla_{w}\ell{\gamma}(x,z,w,\lambda) is Lipschitz continuous in ww with module L,w=LG+p(Mλ+γ11MH,0)LH+MH,12L_{\ell,w}=L_{G}+p(\sqrt{M_{\lambda}}+\gamma_{1}^{-1}M_{H,0})L_{H}+M_{H,1}^{2}, λ(x,z,w,λ)\nabla_{\lambda}\mathcal{L}(x,z,w,\lambda) is Lipschitz continuous in λ\lambda with module L,λ=γ1+γ2L_{\ell,\lambda}=\gamma_{1}+\gamma_{2} given the bound λ2Mλ\|\lambda\|^{2}\leq M_{\lambda}. Then under the strongly-convex-strongly-concave condition shown in Lemma A.4 , the objective gap

𝔼ξ[γ(x,z,w,λ)]𝔼ξ[γ(x,z,w,λ)]=𝔼ξ[γ(x,z,w,λ)γ(x,z,w,λ)]+𝔼ξ[γ(x,z,w,λ)γ(x,z,w,λ)]\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w,\lambda^{*})]-\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{*},\lambda)]=\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w,\lambda^{*})-\ell_{\gamma}(x,z,w^{*},\lambda^{*})]+\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w,\lambda^{*})-\ell_{\gamma}(x,z,w^{*},\lambda)]

has a relationship to 𝔼ξ[ww2]+𝔼ξ[λλ2]\mathbb{E}_{\mathbf{\xi}}[\|w-w^{*}\|^{2}]+\mathbb{E}_{\mathbf{\xi}}[\|\lambda-\lambda^{*}\|^{2}] as follows:

μG2𝔼ξ[ww2]+γ22𝔼ξ[λλ2]\displaystyle\frac{\mu_{G}}{2}\mathbb{E}_{\mathbf{\xi}}[\|w-w^{*}\|^{2}]+\frac{\gamma_{2}}{2}\mathbb{E}_{\mathbf{\xi}}[\|\lambda-\lambda^{*}\|^{2}] |𝔼ξ[γ(x,z,w,λ)]𝔼ξ[γ(x,z,w,λ)]|\displaystyle\leq\left|\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w,\lambda^{*})]-\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{*},\lambda)]\right|
L,w2𝔼ξ[ww2]+L,λ2𝔼ξ[λjλ2].\displaystyle\leq\frac{L_{\ell,w}}{2}\mathbb{E}_{\mathbf{\xi}}[\|w-w^{*}\|^{2}]+\frac{L_{\ell,\lambda}}{2}\mathbb{E}_{\mathbf{\xi}}[\|\lambda^{j}-\lambda^{*}\|^{2}].

Therefore, the convergence rate of the objective function is also O~(1s)\widetilde{O}(\frac{1}{s}), that is

|𝔼ξ[γ(x,z,ws,λ)]𝔼ξ[γ(x,z,w,λs)]|O~(1s).\left|\mathbb{E}_{\mathbf{\xi}}\left[\ell_{\gamma}(x,z,w^{s},\lambda^{*})\right]-\mathbb{E}_{\mathbf{\xi}}\left[\ell_{\gamma}(x,z,w^{*},\lambda^{s})\right]\right|\leq\widetilde{O}\left(\frac{1}{s}\right).

Note that γ(x,z,ws,λ)γ(x,z,ws,λs)\ell_{\gamma}(x,z,w^{s},\lambda^{*})\geq\ell_{\gamma}(x,z,w^{s},\lambda^{s}) and γ(x,z,w,λs)γ(x,z,w,λ)=E(x,z)\ell_{\gamma}(x,z,w^{*},\lambda^{s})\leq\ell_{\gamma}(x,z,w^{*},\lambda^{*})=E(x,z), we have

𝔼ξ[γ(x,z,ws,λs)]E(x,z)𝔼ξ[γ(x,z,ws,λ)]𝔼ξ[γ(x,z,w,λs)]O~(1s).\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{s},\lambda^{s})]-E(x,z)\leq\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{s},\lambda^{*})]-\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{*},\lambda^{s})]\leq\widetilde{O}\left(\frac{1}{s}\right).

Similarly, it holds that

E(x,z)𝔼ξ[γ(x,z,ws,λs)]O~(1s).E(x,z)-\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{s},\lambda^{s})]\leq\widetilde{O}\left(\frac{1}{s}\right).

This completes the proof. ∎

Appendix C Analysis on the outer loop

In the outer loop, we apply the stochastic gradient descent (SGD) method to solve the saddle point problem (2.9). Unlike the standard analysis of SGD, this setting presents two main challenges: first, constructing a suitable stochastic gradient oracle for (2.10); second, addressing the bias introduced by the inexact solution of the lower-level problem.

In the following proposition, we show that the gradient of 𝒢\mathcal{G} is continuously differentiable and its gradient is computable given the optimal solution of the subproblem (2.5a).

Lemma C.1.

Assume (w,λ)=(w(x,z),λ(x,z))(w^{*},\lambda^{*})=(w^{*}(x,z),\lambda^{*}(x,z)) is defined in (2.5a). Then 𝒢(x,y,z)\mathcal{G}(x,y,z) defined in (2.3) is continuously differentiable and its gradient is given by

x𝒢(x,y,z)\displaystyle\nabla_{x}\mathcal{G}(x,y,z) =xg(x,y)xγ(x,z,w,λ),\displaystyle=\nabla_{x}g(x,y)-\nabla_{x}\ell_{\gamma}(x,z,w^{*},\lambda^{*}), (C.1a)
y𝒢(x,y,z)\displaystyle\nabla_{y}\mathcal{G}(x,y,z) =yg(x,y),\displaystyle=\nabla_{y}g(x,y), (C.1b)
z𝒢(x,y,z)\displaystyle\nabla_{z}\mathcal{G}(x,y,z) =γ2(λz).\displaystyle=\gamma_{2}(\lambda^{*}-z). (C.1c)

Furthermore substituting the fact zB\|z\|\leq B we obtain that 𝒢(x,y,z)\mathcal{G}(x,y,z) is Lipschitz continuous with module L𝒢=3MG,1+pMH,1(B+pγ2MH,0)+pMH,0MH,1+pMH,0=L𝒢L_{\mathcal{G}}=3M_{G,1}+\sqrt{p}M_{H,1}(B+\frac{\sqrt{p}}{\gamma_{2}}M_{H,0})+pM_{H,0}M_{H,1}+\sqrt{p}M_{H,0}=L_{\mathcal{G}}.

Proof By Theorem 4.24 in (3), E(x,z)E(x,z) is continuously differentiable with respect to zz and its gradient is given by

xE(x,z)\displaystyle\nabla_{x}E(x,z) =x(maxλ+p{D(x,λ)+γ22λz2})\displaystyle=\nabla_{x}\left(\max_{\lambda\in\mathbb{R}_{+}^{p}}\left\{D(x,\lambda)+\frac{\gamma_{2}}{2}\|\lambda-z\|^{2}\right\}\right)
=x(D(x,λ)+γ2(λ(x,z)z)).(due to the uniqueness of λ)\displaystyle=\nabla_{x}\left(D(x,\lambda^{*})+\gamma_{2}(\lambda^{*}(x,z)-z)\right).\quad\text{(due to the uniqueness of $\lambda^{*}$)}
=xγ(x,z,w,λ).\displaystyle=\nabla_{x}\ell_{\gamma}(x,z,w^{*},\lambda^{*}).

Hence 𝒢(x,y,z)\mathcal{G}(x,y,z) is continuously differentiable and (C.1a) holds. Since E(x,z)E(x,z) is independent of yy, (C.1b) is straightforward. Note that E(x,z)-E(x,z) is the Moreau envelope of D(x,z)-D(x,z) for any xx, Proposition A.1.4 gives zE(x,z)=γ2(zprox1γ2D(x,)(z))-\nabla_{z}E(x,z)=\gamma_{2}(z-\mathrm{prox}_{-\frac{1}{\gamma_{2}}D(x,\cdot)}(z)). On the other hand, the optimality condition of the subproblem (2.5a) gives

prox1γ2D(x,)(z)=argmaxλ+p{D(x,λ)+γ22λz2}=argmaxλ+p{minwYγ(x,z,w,λ)}=λ.\mathrm{prox}_{-\frac{1}{\gamma_{2}}D(x,\cdot)}(z)=\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\left\{D(x,\lambda)+\frac{\gamma_{2}}{2}\|\lambda-z\|^{2}\right\}=\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\left\{\min_{w\in Y}\ell_{\gamma}(x,z,w,\lambda)\right\}=\lambda^{*}.

This gives (C.1c).

Now we show the Lipschitz continuity of 𝒢(x,y,z)\mathcal{G}(x,y,z), that is 𝒢(x,y,z)L𝒢\|\nabla\mathcal{G}(x,y,z)\|\leq L_{\mathcal{G}}. Similar to the proof of Lemma A.2, it holds that xγ(x,z,w,λ)MG,1+1γ1i=1p(γ1|λi|+MH,0)MH,1\|\nabla_{x}\ell_{\gamma}(x,z,w^{*},\lambda^{*})\|\leq M_{G,1}+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}|\lambda^{*}_{i}|+M_{H,0})M_{H,1}. Again by Theorem 4.24 in (3), D(x,λ)=maxyY{(x,y,λ)}D(x,\lambda)=\max_{y\in Y}\{\mathcal{L}(x,y,\lambda)\} is continuously differentiable with respect to λ\lambda and its gradient is given by λD(x,λ)=z(x,y(x,λ),λ)\nabla_{\lambda}D(x,\lambda)=\nabla_{z}\mathcal{L}(x,y^{*}(x,\lambda),\lambda), where y(x,λ)y^{*}(x,\lambda) is the optimal solution of the subproblem minyY(x,y,λ)\min_{y\in Y}\mathcal{L}(x,y,\lambda). From (A.4) we know z(x,y,λ)pMH,0\|\nabla_{z}\mathcal{L}(x,y,\lambda)\|\leq\sqrt{p}M_{H,0}. Hence D(x,λ)D(x,\lambda) is pMH,0\sqrt{p}M_{H,0}-Lipschitz continuous with respect to λ\lambda. Proposition A.1.4 implies the Moreau envelope of E(x,z)E(x,z) is also Lipschitz continuous with the same Lipschitz constant. This gives

z𝒢(x,y,z)=γ2λzpMH,0.\|\nabla_{z}\mathcal{G}(x,y,z)\|=\gamma_{2}\|\lambda^{*}-z\|\leq\sqrt{p}M_{H,0}. (C.2)

Therefore,

𝒢(x,y,z)\displaystyle\|\nabla\mathcal{G}(x,y,z)\| xG(x,y)+xγ(x,z,w,λ)+yG(x,y)+z𝒢(x,y,z)\displaystyle\leq\|\nabla_{x}G(x,y)\|+\|\nabla_{x}\ell_{\gamma}(x,z,w^{*},\lambda^{*})\|+\|\nabla_{y}G(x,y)\|+\|\nabla_{z}\mathcal{G}(x,y,z)\|
MG,1+MG,1+1γ1i=1p(γ1|λi|+MH,0)MH,1+MG,1+pMH,0\displaystyle\leq M_{G,1}+M_{G,1}+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}|\lambda^{*}_{i}|+M_{H,0})M_{H,1}+M_{G,1}+\sqrt{p}M_{H,0}
3MG,1+pMH,1λ+pMH,0MH,1+pMH,0\displaystyle\leq 3M_{G,1}+\sqrt{p}M_{H,1}\|\lambda^{*}\|+pM_{H,0}M_{H,1}+\sqrt{p}M_{H,0}
3MG,1+pMH,1(z+pγ2MH,0)+pMH,0MH,1+pMH,0(by (C.2))\displaystyle\leq 3M_{G,1}+\sqrt{p}M_{H,1}(\|z\|+\frac{\sqrt{p}}{\gamma_{2}}M_{H,0})+pM_{H,0}M_{H,1}+\sqrt{p}M_{H,0}\quad\text{(by~\eqref{eq: proof of gradient of mathcal G 1})}
3MG,1+pMH,1(B+pγ2MH,0)+pMH,0MH,1+pMH,0=L𝒢.\displaystyle\leq 3M_{G,1}+\sqrt{p}M_{H,1}(B+\frac{\sqrt{p}}{\gamma_{2}}M_{H,0})+pM_{H,0}M_{H,1}+\sqrt{p}M_{H,0}=L_{\mathcal{G}}.

The last inequality follows from that zB\|z\|\leq B). This completes the proof. ∎

Corollary C.1.

Ψ(𝐮)\nabla\Psi(\mathbf{u}) is Lipschitz continuous with modulus

LΨ=LF+c1L𝒢+c22LHMH,0.L_{\Psi}=L_{F}+c_{1}L_{\mathcal{G}}+\frac{c_{2}}{2}L_{H}M_{H,0}. (C.3)

Proof Combining Assumption 3.1 and Lemma C.1 yields the desired result. ∎

Assume we have already obtained the approximate optimal solution (wk,λk)(w^{k},\lambda^{k}) of the subproblem (2.5a) at the kk-th iteration. Note that (wk,λk)(w^{k},\lambda^{k}) is random variables dependent on the inner loop sample ξk\mathbf{\xi}^{k}. In the subsequent analysis, we will take expectation conditioned on k\mathcal{F}^{k} and hence (wk,λk)(w^{k},\lambda^{k}) is treated as constants. Given sample ξ~k=(ξ~1k,,ξ~qkk)𝒟ξqk\mathbf{\tilde{\xi}}^{k}=(\tilde{\xi}^{k}_{1},...,\tilde{\xi}^{k}_{q_{k}})\sim\mathcal{D}_{\xi}^{q_{k}}, a natural first order oracle for 𝒢^\hat{\mathcal{G}} in (2.9) is considered as

x𝒢k(x,y,z;ξ~k)\displaystyle\nabla_{x}\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k}) =1qkl=1qkxg(x,y;ξlk)1qkl=1qkxγ(x,z,wk,λk;ξ~lk),\displaystyle=\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}g(x,y;\xi^{k}_{l})-\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}\ell{\gamma}(x,z,w^{k},\lambda^{k};\tilde{\xi}^{k}_{l}), (C.4)
y𝒢k(x,y,z;ξ~k)\displaystyle\nabla_{y}\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k}) =1qkl=1qkyg(x,y;ξ~lk),\displaystyle=\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{y}g(x,y;\tilde{\xi}^{k}_{l}),
z𝒢k(x,y,z;ξ~k)\displaystyle\nabla_{z}\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k}) =γ2(λkz).\displaystyle={\gamma_{2}}(\lambda^{k}-z).

Conditioned on ξk\mathbf{\xi}^{k}, the bias of 𝒢k(x,y,z;ξ~k)\nabla\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k}) arise from the term 1qkl=1qkxγ(x,z,wk,λk;ξ~lk)\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}\ell{\gamma}(x,z,w^{k},\lambda^{k};\tilde{\xi}^{k}_{l}). However when ξk\mathbf{\xi}^{k} is also treated as random variable, the estimation of bias is much more complicated. We establish two lemmas to control this bias. Lemma C.2 is conditioned on k\mathcal{F}^{k}, while Lemma C.3 is conditioned on ~k1\tilde{\mathcal{F}}_{k-1}. In the following analysis, 𝒢(𝐮)\mathcal{G}(\mathbf{u}) is the abbreviation of 𝒢(x,y,z)\mathcal{G}(x,y,z) for simplicity.

Lemma C.2.

𝒢k(𝐮;ξ~k)\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k}) has a controllable bias conditioned on k\mathcal{F}^{k} as

𝔼ξ~k[𝒢k(𝐮;ξ~k)|k]𝒢(𝐮)\displaystyle\quad\|\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|\mathcal{F}^{k}]-\nabla\mathcal{G}(\mathbf{u})\| (C.5)
γ22λkλ(xk1,zk1)2+(MH,1+p)λkλ(xk1,zk1)\displaystyle\leq\frac{\gamma_{2}}{2}\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\|^{2}+(M_{H,1}+\sqrt{p})\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\|
+\displaystyle+ (LG+1γ1(MH,22+γ1LHz+γ1LHγ2MH,0+MH,0LH))wkw(xk1,zk1),\displaystyle(L_{G}+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}L_{H}\|z\|+\frac{\gamma_{1}L_{H}}{\gamma_{2}}M_{H,0}+M_{H,0}L_{H}))\|w^{k}-w^{*}(x^{k-1},z^{k-1})\|,

and a controllable conditional variance as

𝕍ξ~k[𝒢k(𝐮;ξ~k)|k]4σg2qk.\displaystyle\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|\mathcal{F}_{k}]\leq\frac{4\sigma_{g}^{2}}{q_{k}}. (C.6)

Here 𝔼ξ~k[],𝕍ξ~k[]\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\cdot],\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}[\cdot] are the abbreviation of 𝔼ξ~k𝒟ξqk[],𝕍ξ~k𝒟ξqk[]\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}\sim\mathcal{D}_{\xi}^{q_{k}}}[\cdot],\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}\sim\mathcal{D}_{\xi}^{q_{k}}}[\cdot], respectively.

Proof When condition on k\mathcal{F}^{k}, (wk,λk),(w(xk1,zk1),λ(xk1,zk1))(w^{k},\lambda^{k}),(w^{*}(x^{k-1},z^{k-1}),\lambda^{*}(x^{k-1},z^{k-1})) are constants. Utilizing (C.1a) and taking expectation over ξ~k\mathbf{\tilde{\xi}}^{k} gives

𝔼ξ~k[x𝒢k(𝐮;ξ~k)|k]x𝒢(𝐮)\displaystyle\quad\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\nabla_{x}\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|{\mathcal{F}}_{k}]-\nabla_{x}\mathcal{G}(\mathbf{u}) (C.7)
=𝔼ξ~k[1qkl=1qkxg(x,y;ξ~l)1qkl=1qkxγ(x,z,wk,λk;ξ~l)|k]\displaystyle=\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}\left[\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}g(x,y;\tilde{\xi}_{l})-\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k};\tilde{\xi}_{l})|{\mathcal{F}}_{k}\right]
xg(x,y)+xγ(x,z,w,λ)\displaystyle\quad-\nabla_{x}g(x,y)+\nabla_{x}\ell_{\gamma}(x,z,w^{*},\lambda^{*})
=𝔼ξ~1k[xγ(x,z,wk,λk;ξ~1)|k]+xγ(x,z,w,λ).\displaystyle=-\mathbb{E}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k};\tilde{\xi}_{1})|{\mathcal{F}}_{k}\right]+\nabla_{x}\ell_{\gamma}(x,z,w^{*},\lambda^{*}).

From the expression of ~\tilde{\mathcal{L}} and (A.4), (A.6), we can further compute

𝔼ξ~k[x𝒢k(𝐮;ξ~k)|k]x𝒢(𝐮)\displaystyle\quad\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\nabla_{x}\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|{\mathcal{F}}_{k}]-\nabla_{x}\mathcal{G}(\mathbf{u}) (C.8)
=𝔼ξ~1k[xg(x,wk;ξ1)+1γ1i=1p[γ1λik+Hi(x,wk)]+xHi(x,wk)γ22λkz2|k]\displaystyle=-\mathbb{E}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,w^{k};\xi_{1})+\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})-\frac{\gamma_{2}}{2}\|\lambda^{k}-z\|^{2}|{\mathcal{F}}_{k}\right]
+xg(x,w)+1γ1i=1p[γ1λi+Hi(x,w)]+xHi(x,w)γ22λz2\displaystyle\quad+\nabla_{x}g(x,w^{*})+\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{*}+H_{i}(x,w^{*})]_{+}\nabla_{x}H_{i}(x,w^{*})-\frac{\gamma_{2}}{2}\|\lambda^{*}-z\|^{2}
=xg(x,wk)1γ1i=1p[γ1λik+Hi(x,wk)]+xHi(x,wk)+γ22λkz2\displaystyle=-\nabla_{x}g(x,w^{k})-\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})+\frac{\gamma_{2}}{2}\|\lambda^{k}-z\|^{2}
+xg(x,w)+1γ1i=1p[γ1λi+Hi(x,w)]+xHi(x,w)γ22λz2.\displaystyle\quad+\nabla_{x}g(x,w^{*})+\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{*}+H_{i}(x,w^{*})]_{+}\nabla_{x}H_{i}(x,w^{*})-\frac{\gamma_{2}}{2}\|\lambda^{*}-z\|^{2}.

By Assumption 3.1, it holds that

1γ1i=1p[γ1λik+Hi(x,wk)]+xHi(x,wk)1γ1i=1p[γ1λi+Hi(x,w)]+xHi(x,w)\displaystyle\quad\left\|\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})-\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{*}+H_{i}(x,w^{*})]_{+}\nabla_{x}H_{i}(x,w^{*})\right\| (C.9)
1γ1i=1p([γ1λik+Hi(x,wk)]+xHi(x,wk)[γ1λi+Hi(x,w)]+xHi(x,wk))\displaystyle\leq\frac{1}{\gamma_{1}}\left\|\sum_{i=1}^{p}\left([\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})-[\gamma_{1}\lambda_{i}^{*}+H_{i}(x,w^{*})]_{+}\nabla_{x}H_{i}(x,w^{k})\right)\right\|
+1γ1i=1p([γ1λi+Hi(x,w)]+xHi(x,wk)[γ1λi+Hi(x,w)]+xHi(x,w))\displaystyle+\frac{1}{\gamma_{1}}\left\|\sum_{i=1}^{p}\left([\gamma_{1}\lambda_{i}^{*}+H_{i}(x,w^{*})]_{+}\nabla_{x}H_{i}(x,w^{k})-[\gamma_{1}\lambda_{i}^{*}+H_{i}(x,w^{*})]_{+}\nabla_{x}H_{i}(x,w^{*})\right)\right\|
1γ1i=1p(|[γ1λik+Hi(x,wk)]+[γ1λi+Hi(x,w)]+|xHi(x,wk))\displaystyle\leq\frac{1}{\gamma_{1}}\sum_{i=1}^{p}\left(\left|[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}-[\gamma_{1}\lambda_{i}^{*}+H_{i}(x,w^{*})]_{+}\right|\cdot\left\|\nabla_{x}H_{i}(x,w^{k})\right\|\right)
+1γ1i=1p([γ1λi+Hi(x,w)]+xHi(x,wk)xHi(x,w))\displaystyle+\frac{1}{\gamma_{1}}\sum_{i=1}^{p}\left([\gamma_{1}\lambda_{i}^{*}+H_{i}(x,w^{*})]_{+}\cdot\left\|\nabla_{x}H_{i}(x,w^{k})-\nabla_{x}H_{i}(x,w^{*})\right\|\right)
1γ1MH,1(γ1λkλ+MH,1wkw)+1γ1(γ1λ+MH,0)LHwkw\displaystyle\leq\frac{1}{\gamma_{1}}M_{H,1}(\gamma_{1}\|\lambda^{k}-\lambda^{*}\|+M_{H,1}\|w^{k}-w^{*}\|)+\frac{1}{\gamma_{1}}(\gamma_{1}\|\lambda^{*}\|+M_{H,0})L_{H}\|w^{k}-w^{*}\|
=MH,1λkλ+1γ1(MH,22+γ1λLH+MH,0LH)wkw.\displaystyle=M_{H,1}\|\lambda^{k}-\lambda^{*}\|+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}\|\lambda^{*}\|L_{H}+M_{H,0}L_{H})\|w^{k}-w^{*}\|.

The last inequality uses the fact that |[a]+[b]+||ab||[a]_{+}-[b]_{+}|\leq|a-b| and the assumptions that Hi(x,wk)Hi(x,w)MH,1wkw\|H_{i}(x,w^{k})-H_{i}(x,w^{*})\|\leq M_{H,1}\|w^{k}-w^{*}\|, Hi(x,w)MH,0\|H_{i}(x,w^{*})\|\leq M_{H,0} and Hi(x,wk)Hi(x,w)LHwkw\|\nabla H_{i}(x,w^{k})-\nabla H_{i}(x,w^{*})\|\leq L_{H}\|w^{k}-w^{*}\|. By (C.2), it holds that λkz2λz2=λkλ+2(λkλ)T(λz)λkλ2+2pγ2MH,0λkλ\|\lambda^{k}-z\|^{2}-\|\lambda^{*}-z\|^{2}=\|\lambda^{k}-\lambda^{*}\|+2(\lambda^{k}-\lambda^{*})^{T}(\lambda^{*}-z)\leq\|\lambda^{k}-\lambda^{*}\|^{2}+\frac{2\sqrt{p}}{\gamma_{2}}M_{H,0}\|\lambda^{k}-\lambda^{*}\|. Then by combining (C.8) and (C.9), we bound the bias of x𝒢k(𝐮;ξ~k)\nabla_{x}\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k}) as

𝔼ξ~k[x𝒢k(𝐮;ξ~)|k]x𝒢(𝐮)\displaystyle\quad\|\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\nabla_{x}\mathcal{G}^{k}(\mathbf{u};\tilde{\xi})|\mathcal{F}^{k}]-\nabla_{x}\mathcal{G}(\mathbf{u})\|
LGwkw+MH,1λkλ+1γ1(MH,22+γ1λLH+MH,0LH)wkw\displaystyle\leq L_{G}\|w^{k}-w^{*}\|+M_{H,1}\|\lambda^{k}-\lambda^{*}\|+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}\|\lambda^{*}\|L_{H}+M_{H,0}L_{H})\|w^{k}-w^{*}\|
+γ22λkλ2+pMH,0λkλ\displaystyle\quad+\frac{\gamma_{2}}{2}\|\lambda^{k}-\lambda^{*}\|^{2}+\sqrt{p}M_{H,0}\|\lambda^{k}-\lambda^{*}\|
=γ22λkλ2+(MH,1+p)λkλ\displaystyle=\frac{\gamma_{2}}{2}\|\lambda^{k}-\lambda^{*}\|^{2}+(M_{H,1}+\sqrt{p})\|\lambda^{k}-\lambda^{*}\|
+(LG+1γ1(MH,22+γ1λLH+MH,0LH))wkw\displaystyle\quad+(L_{G}+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}\|\lambda^{*}\|L_{H}+M_{H,0}L_{H}))\|w^{k}-w^{*}\|
γ22λkλ2+(MH,1+p)λkλ\displaystyle\leq\frac{\gamma_{2}}{2}\|\lambda^{k}-\lambda^{*}\|^{2}+(M_{H,1}+\sqrt{p})\|\lambda^{k}-\lambda^{*}\|
+(LG+1γ1(MH,22+γ1LHz+γ1LHγ2MH,0+MH,0LH))wkw.\displaystyle\quad+(L_{G}+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}L_{H}\|z\|+\frac{\gamma_{1}L_{H}}{\gamma_{2}}M_{H,0}+M_{H,0}L_{H}))\|w^{k}-w^{*}\|.

The last inequality follows from (C.8) again. Besides 𝔼[y𝒢k(x,y,z;ξ~k)|~k]y𝒢(x,y,z)=0\mathbb{E}[\nabla_{y}\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k})|\tilde{\mathcal{F}}_{k}]-\nabla_{y}\mathcal{G}(x,y,z)=0 and 𝔼[z𝒢k(x,y,z;ξ~k)|~k]y𝒢(x,y,z)=γ2λkλ\mathbb{E}[\nabla_{z}\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k})|\tilde{\mathcal{F}}_{k}]-\nabla_{y}\mathcal{G}(x,y,z)=\gamma_{2}\|\lambda^{k}-\lambda^{*}\| are straightforward. Therefore (C.5) holds.

From the expression of 𝒢k(𝐮;ξ~k)\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k}), we can compute the conditional variance of 𝒢k(x,y,z;ξ~k)\nabla\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k}) as

𝕍ξ~k[x𝒢k(𝐮;ξ~k)|k]\displaystyle\quad\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}\left[\nabla_{x}\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|{\mathcal{F}}_{k}\right]
=𝕍ξ~k[1qkl=1qkxg(x,y;ξlk)1qkl=1qkxγ(x,z,wk,λk;ξ~lk)|k]\displaystyle=\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}\left[\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}g(x,y;\xi^{k}_{l})-\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}\ell{\gamma}(x,z,w^{k},\lambda^{k};\tilde{\xi}^{k}_{l})|{\mathcal{F}}_{k}\right]
=1qk𝕍ξ~1k[xg(x,y;ξ~1k)xg(x,wk;ξ~1k)1γ1i=1p[γ1λik+Hi(x,wk)]+xHi(x,wk)\displaystyle=\frac{1}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,y;\tilde{\xi}^{k}_{1})-\nabla_{x}g(x,w^{k};\tilde{\xi}^{k}_{1})-\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})\right.
+γ22λkz2|k]\displaystyle\quad\left.+\frac{\gamma_{2}}{2}\|\lambda^{k}-z\|^{2}|\mathcal{F}_{k}\right]
=1qk𝕍ξ~1k[xg(x,y;ξ~1k)xg(x,wk;ξ~1k)|k] (λk,wk are constants conditioned on k)\displaystyle=\frac{1}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,y;\tilde{\xi}^{k}_{1})-\nabla_{x}g(x,w^{k};\tilde{\xi}^{k}_{1})|\mathcal{F}_{k}\right]\quad\text{ ($\lambda^{k},w^{k}$ are constants conditioned on $\mathcal{F}^{k}$)}
2qk𝕍ξ~1k[xg(x,y;ξ~1k)|k]+2qk𝕍ξ~1k[xg(x,wk;ξ~1k)|k].\displaystyle\leq\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,y;\tilde{\xi}^{k}_{1})|\mathcal{F}_{k}\right]+\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,w^{k};\tilde{\xi}^{k}_{1})|\mathcal{F}_{k}\right].

and

𝕍ξ~k[y𝒢k(𝐮;ξ~k)|k]=𝕍ξ~k[1qkl=1qkyg(x,y;ξlk)]=1qk𝕍ξ~1k[yg(x,y;ξ1k)|k]\displaystyle\quad\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}\left[\nabla_{y}\mathcal{G}^{k}(\mathbf{u};\tilde{\xi}^{k})|{\mathcal{F}}_{k}\right]=\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}\left[\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{y}g(x,y;\xi^{k}_{l})\right]=\frac{1}{q_{k}}\mathbb{V}_{{\tilde{\xi}}^{k}_{1}}\left[\nabla_{y}g(x,y;\xi^{k}_{1})|\mathcal{F}_{k}\right]
𝕍ξ~k[z𝒢k(𝐮;ξ~k)|k]=0.\displaystyle\quad\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}\left[\nabla_{z}\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|{\mathcal{F}}_{k}\right]=0.

Hence

𝕍ξ~k[𝒢k(𝐮;ξ~k)|k]\displaystyle\quad\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}[\|\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|\mathcal{F}_{k}\|]
2qk𝕍ξ~1k[xg(x,y;ξ~1k)|k]+2qk𝕍ξ~1k[xg(x,wk;ξ~1k)|k]+1qk𝕍ξ~1k[yg(x,y;ξ1k)|k]\displaystyle\leq\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,y;\tilde{\xi}^{k}_{1})|\mathcal{F}_{k}\right]+\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,w^{k};\tilde{\xi}^{k}_{1})|\mathcal{F}_{k}\right]+\frac{1}{q_{k}}\mathbb{V}_{{\tilde{\xi}}^{k}_{1}}\left[\nabla_{y}g(x,y;\xi^{k}_{1})|\mathcal{F}_{k}\right]
2qk𝕍ξ~1k[g(x,y;ξ1k)]+2qk𝕍ξ~1k[g(x,wk;ξ1k)]4σg2qk.\displaystyle\leq\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla g(x,y;\xi^{k}_{1})\right]+\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla g(x,w^{k};\xi^{k}_{1})\right]\leq\frac{4\sigma_{g}^{2}}{q_{k}}.

This completes the proof. ∎

In the following lemma, wk,λkw^{k},\lambda^{k} are treated as random variables and we try to control the bias and variance of the first order oracle of 𝒢k\mathcal{G}^{k} conditioned on ~k1\tilde{\mathcal{F}}_{k-1}.

Lemma C.3.

The first order oracle of 𝒢k\mathcal{G}^{k} has a bounded conditional bias and variance as

𝔼ξ~k,ξk[𝒢k(𝐮;ξ~k)|~k1]𝒢(𝐮)\displaystyle\left\|\mathbb{E}_{\mathbf{\tilde{\xi}}^{k},\mathbf{\xi}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|\tilde{\mathcal{F}}_{k-1}]-\nabla\mathcal{G}(\mathbf{u})\right\| ϵ𝒢k,\displaystyle\leq\epsilon_{\mathcal{G}}^{k}, (C.10a)
𝕍ξ~k,ξk[𝒢k(𝐮;ξ~k)|~k1](σ𝒢k)2,\displaystyle\mathbb{V}_{\mathbf{\tilde{\xi}}^{k},\mathbf{\xi}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|\tilde{\mathcal{F}}_{k-1}]\leq(\sigma_{\mathcal{G}}^{k})^{2}, (C.10b)

where

ϵ𝒢k\displaystyle\epsilon_{\mathcal{G}}^{k} =γ22𝔼[λkλ(xk1,zk1)2|~k1]+(MH,1+p)𝔼[λkλ(xk1,zk1)|k1]\displaystyle=\frac{\gamma_{2}}{2}\mathbb{E}[\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\|^{2}|\tilde{\mathcal{F}}_{k-1}]+(M_{H,1}+\sqrt{p})\mathbb{E}[\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\||\mathcal{F}_{k-1}] (C.11)
+\displaystyle+ (LG+1γ1(MH,22+γ1LHz+γ1LHγ2MH,0+MH,0LH))𝔼[wkw(xk1,zk1)|~k1]\displaystyle(L_{G}+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}L_{H}\|z\|+\frac{\gamma_{1}L_{H}}{\gamma_{2}}M_{H,0}+M_{H,0}L_{H}))\mathbb{E}[\|w^{k}-w^{*}(x^{k-1},z^{k-1})\||\tilde{\mathcal{F}}_{k-1}]

and

(σ𝒢k)2\displaystyle(\sigma_{\mathcal{G}}^{k})^{2} =4σg2qk+2(LG2+(γ1Mλ+MH,0)2γ12LH2)𝔼ξk[wkw(xk1,zk1)2|~k1]\displaystyle=\frac{4\sigma_{g}^{2}}{q_{k}}+2(L_{G}^{2}+\frac{(\gamma_{1}M_{\lambda}+M_{H,0})^{2}}{\gamma_{1}^{2}}L_{H}^{2})\mathbb{E}_{\mathbf{\xi}^{k}}\left[\|w^{k}-w^{*}(x^{k-1},z^{k-1})\|^{2}|\tilde{\mathcal{F}}_{k-1}\right] (C.12)
+γ22𝔼ξk[λkλ(xk1,zk1)2|~k1].\displaystyle\quad+\gamma_{2}^{2}\mathbb{E}_{\mathbf{\xi}^{k}}\left[\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\|^{2}|\tilde{\mathcal{F}}_{k-1}\right].

Proof Taking expectation over ξk\mathbf{\xi}^{k} in (C.5) gives (C.10a). Utilizing the property of conditional variance and Lemma C.3, we have

𝕍ξ~k,ξk[𝒢k(𝐮;ξ~k)|~k1]\displaystyle\quad\mathbb{V}_{\mathbf{\tilde{\xi}}^{k},\mathbf{\xi}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|\tilde{\mathcal{F}}_{k-1}] (C.13)
=𝔼ξk[𝕍ξ~k[𝒢k(𝐮;ξ~k)|k]|~k1]+𝕍ξk[𝔼ξ~k[𝒢k(𝐮;ζ)|k]|~k1]\displaystyle=\mathbb{E}_{\mathbf{\xi}_{k}}[\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|{\mathcal{F}}_{k}]|\tilde{\mathcal{F}}_{k-1}]+\mathbb{V}_{\mathbf{\xi}^{k}}[\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\zeta})|{\mathcal{F}}_{k}]|\tilde{\mathcal{F}}_{k-1}]
4σg2qk+𝕍ξk[xg(x,y)xγ(x,z,wk,λk)|~k1]+𝕍ξk[yg(x,y)|~k1]\displaystyle\leq\frac{4\sigma_{g}^{2}}{q_{k}}+\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}g(x,y)-\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k})|\tilde{\mathcal{F}}_{k-1}]+\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{y}g(x,y)|\tilde{\mathcal{F}}_{k-1}]
+𝕍ξk[γ2(λkz)|~k1]\displaystyle\quad+\mathbb{V}_{\mathbf{\xi}^{k}}[\gamma_{2}(\lambda^{k}-z)|\tilde{\mathcal{F}}_{k-1}]
=4σg2qk+𝕍ξk[xγ(x,z,wk,λk)|~k1]+0+γ22𝕍ξk[λk|~k1].\displaystyle=\frac{4\sigma_{g}^{2}}{q_{k}}+\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k})|\tilde{\mathcal{F}}_{k-1}]+0+\gamma_{2}^{2}\mathbb{V}_{\mathbf{\xi}^{k}}[\lambda^{k}|\tilde{\mathcal{F}}_{k-1}].

The inequality is due to (C.4) and Lemma C.2. The second term in the right-hand side of (C.13) is bounded by

𝕍ξk[xγ(x,z,wk,λk)|~k1]\displaystyle\quad\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k})|\tilde{\mathcal{F}}_{k-1}] (C.14)
=𝕍ξk[xg(x,wk)+1γ1i=1p[γ1λik+Hi(x,wk)]+xHi(x,wk)|~k1]\displaystyle=\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}g(x,w^{k})+\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})|\tilde{\mathcal{F}}_{k-1}]
2𝕍ξk[xg(x,wk)|~k1]+2(γ1Mλ+MH,0)2γ12𝕍ξk[xH(x,wk)|~k1]\displaystyle\leq 2\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}g(x,w^{k})|\tilde{\mathcal{F}}_{k-1}]+\frac{2(\gamma_{1}M_{\lambda}+M_{H,0})^{2}}{\gamma_{1}^{2}}\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}H(x,w^{k})|\tilde{\mathcal{F}}_{k-1}]

Since w(xk1,zk1)w^{*}(x^{k-1},z^{k-1}) is a constant conditioned on ~k1\tilde{\mathcal{F}}_{k-1}, we have

𝕍ξk[xg(x,wk)|~k1]\displaystyle\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}g(x,w^{k})|\tilde{\mathcal{F}}_{k-1}] =𝕍ξk[xg(x,wk)xg(x,w(xk1,zk1))|~k1]\displaystyle=\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}g(x,w^{k})-\nabla_{x}g(x,w^{*}(x^{k-1},z^{k-1}))|\tilde{\mathcal{F}}_{k-1}]
𝔼ξk[xg(x,wk)xg(x,w(xk1,zk1))2|~k1]\displaystyle\leq\mathbb{E}_{\mathbf{\xi}^{k}}[\|\nabla_{x}g(x,w^{k})-\nabla_{x}g(x,w^{*}(x^{k-1},z^{k-1}))\|^{2}|\tilde{\mathcal{F}}_{k-1}]
LG2𝔼ξk[wkw(xk1)2|~k1].\displaystyle\leq L_{G}^{2}\mathbb{E}_{\mathbf{\xi}^{k}}\left[\|w^{k}-w^{*}(x^{k-1})\|^{2}|\tilde{\mathcal{F}}_{k-1}\right].

The first inequality uses the fact that variance is smaller than second order moment, and the second inequality is due to the Lipschitz continuity of GG. Similarly, we have

𝕍ξk[xH(x,wk)|~k1]LH2𝔼ξk[wkw(xk1)2|~k1].\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}H(x,w^{k})|\tilde{\mathcal{F}}_{k-1}]\leq L_{H}^{2}\mathbb{E}_{\mathbf{\xi}^{k}}\left[\|w^{k}-w^{*}(x^{k-1})\|^{2}|\tilde{\mathcal{F}}_{k-1}\right].

Substituting these two inequalities into (C.14) gives

𝕍ξk[xγ(x,z,wk,λk)|~k1]\displaystyle\quad\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k})|\tilde{\mathcal{F}}_{k-1}] (C.15)
2(LG2+(γ1Mλ+MH,0)2γ12LH2)𝔼[wkw(xk1,zk1)2|~k1].\displaystyle\leq 2\left(L_{G}^{2}+\frac{(\gamma_{1}M_{\lambda}+M_{H,0})^{2}}{\gamma_{1}^{2}}L_{H}^{2}\right)\mathbb{E}[\|w^{k}-w^{*}(x^{k-1},z^{k-1})\|^{2}|\tilde{\mathcal{F}}_{k-1}].

The last term in the right-hand side of (C.13) is bounded by

γ22𝕍ξk[λk|~k1]\displaystyle\gamma_{2}^{2}\mathbb{V}_{\mathbf{\xi}^{k}}[\lambda^{k}|\tilde{\mathcal{F}}_{k-1}] =γ22𝕍[λkλ(xk1,zk1)|~k1]γ22𝔼[λkλ(xk1,zk1)2|~k1].\displaystyle=\gamma_{2}^{2}\mathbb{V}[\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})|\tilde{\mathcal{F}}_{k-1}]\leq\gamma_{2}^{2}\mathbb{E}[\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\|^{2}|\tilde{\mathcal{F}}_{k-1}]. (C.16)

Combining (C.13), (LABEL:eq:_proof_of_bound_of_first_order_oracle_of_Gk_3) and (C.16) gives the desired result. ∎

Lemma C.4.

Under the conditions of Theorem B.1, ϵ𝒢k\epsilon_{\mathcal{G}}^{k} and σ𝒢k\sigma_{\mathcal{G}}^{k} are bounded as

ϵ𝒢k\displaystyle\|\epsilon_{\mathcal{G}}^{k}\| γ22ϕ21+log(sk)sk+(MH,1+p)(ϕ21+log(sk)sk)12\displaystyle\leq\frac{\gamma_{2}}{2}\phi_{2}\frac{1+\log(s_{k})}{s_{k}}+(M_{H,1}+\sqrt{p})\left(\phi_{2}\frac{1+\log(s_{k})}{s_{k}}\right)^{\frac{1}{2}} (C.17a)
+\displaystyle+ (LG+1γ1(MH,22+γ1LHB+γ1LHγ2MH,0+MH,0LH))(ϕ11+log(sk)sk)12,\displaystyle\left(L_{G}+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}L_{H}B+\frac{\gamma_{1}L_{H}}{\gamma_{2}}M_{H,0}+M_{H,0}L_{H})\right)\left(\phi_{1}\frac{1+\log(s_{k})}{s_{k}}\right)^{\frac{1}{2}}, (C.17b)
(σ𝒢k)2\displaystyle(\sigma_{\mathcal{G}}^{k})^{2} 2σg2qk+((LG2+(γ1M¯λ+MH,0)2γ12LH2)ϕ¯1+γ22ϕ¯2)1+log(sk)sk.\displaystyle\leq\frac{2\sigma_{g}^{2}}{q_{k}}+((L_{G}^{2}+\frac{(\gamma_{1}\bar{M}_{\lambda}+M_{H,0})^{2}}{\gamma_{1}^{2}}L_{H}^{2})\bar{\phi}_{1}+\gamma_{2}^{2}\bar{\phi}_{2})\frac{1+\log(s_{k})}{s_{k}}. (C.17c)

where ϕ¯1=η(ηM,1+ρ(4γ2B+4MH,02)+(ηM,2+2ρ(γ1+γ2))Mλ)\bar{\phi}_{1}=\eta\left(\eta M_{\mathcal{L},1}+\rho(4\gamma_{2}B+4M_{H,0}^{2})+(\eta M_{\mathcal{L},2}+2\rho(\gamma_{1}+\gamma_{2}))M_{\lambda}\right), ϕ¯2=ρηϕ¯1\bar{\phi}_{2}=\frac{\rho}{\eta}\bar{\phi}_{1}, M¯λ=2ρ2γ22B2+2pρ2MH,02\bar{M}_{\lambda}=2\rho^{2}\gamma_{2}^{2}B^{2}+2p\rho^{2}M_{H,0}^{2} are constants obtained by replacing z\|z\| with BB.

Proof It follows from Cauchy-Schwarz inequality that 𝔼[λkλ(xk1,zk1)|k](𝔼[λkλ(xk1,zk1)2|k])12\mathbb{E}[\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\||\mathcal{F}_{k}]\leq(\mathbb{E}[\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\|^{2}|\mathcal{F}_{k}])^{\frac{1}{2}} and 𝔼[wkw(xk1,zk1)|k](𝔼[wkw(xk1,zk1)2|k])12\mathbb{E}[\|w^{k}-w^{*}(x^{k-1},z^{k-1})\||\mathcal{F}_{k}]\leq(\mathbb{E}[\|w^{k}-w^{*}(x^{k-1},z^{k-1})\|^{2}|\mathcal{F}_{k}])^{\frac{1}{2}} Substituting the results of Theorem B.1 into (C.11) and (C.12) gives the desired results. Note that the notation wsw,λsλw^{s}-w^{*},\lambda^{s}-\lambda^{*} in (B.14) are replaced by wkw(xk1,zk1),λkλ(xk1,zk1)w^{k}-w^{*}(x^{k-1},z^{k-1}),\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1}) in the current context. ∎

Denote f(x,y;ζk)=1rkl=1rkf(x,y;ζlk)\nabla f(x,y;\mathbf{\zeta}^{k})=\frac{1}{r_{k}}\sum_{l=1}^{r_{k}}\nabla f(x,y;\zeta^{k}_{l}). From the definition of Ψk(𝐮;ζk,ξ~k)\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k}) in (2.10) and (2.12), we derive the relationship between the bias of Ψk(𝐮;ζk,ξ~k)\nabla\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k}) and 𝒢k(𝐮;ξ~k)\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k}) as follows.

Ψk(𝐮;ζk,ξ~k)Ψ(𝐮)=f(x,y;ζk)F(x,y)+c1(𝒢k(𝐮;ξ~k)𝒢(𝐮)).\nabla\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})-\nabla\Psi(\mathbf{u})=\nabla f(x,y;\mathbf{\zeta}^{k})-\nabla F(x,y)+c_{1}(\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})-\nabla\mathcal{G}(\mathbf{u})). (C.18)

Since f(x,y;ζk)\nabla f(x,y;\mathbf{\zeta}^{k}) is unbiased, the bias of Ψk(𝐮;ζk,ξ~k)\nabla\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k}) is fully determined by the bias of 𝒢k(𝐮;ξ~k)\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k}).

From the relationship between variance and moment, the bias bkb^{k} defined in (3.7) can be bounded as follow.

Lemma C.5.

The bias bkb^{k} has a controllable momentum as

𝔼[bk2|~k1]\displaystyle\mathbb{E}[\|b^{k}\|^{2}|\tilde{\mathcal{F}}_{k-1}] 2(σf2rk+c12(σ𝒢k)2qk+c12(ϵ𝒢k)2).\displaystyle\leq 2\left(\frac{\sigma_{f}^{2}}{r_{k}}+c_{1}^{2}\frac{(\sigma_{\mathcal{G}}^{k})^{2}}{q_{k}}+c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right). (C.19)

Here 𝔼[]\mathbb{E}[\cdot] is the abbreviation of 𝔼ζk,ξ~k,ξk[]\mathbb{E}_{\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k},\mathbf{\xi}^{k}}[\cdot].

Proof By Lemma C.3, it holds that

𝔼[bk2|~k1]\displaystyle\quad\mathbb{E}[\|b^{k}\|^{2}|\widetilde{\mathcal{F}}_{k-1}]
2𝕍[Ψk(𝐮k;ζk,ξ~k)|~k1]+2𝔼[𝔼[Ψk(𝐮k;ζk,ξ~k)|~k1]Ψk(𝐮k)2|~k1]\displaystyle\leq 2\mathbb{V}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\widetilde{\mathcal{F}}_{k-1}]+2\mathbb{E}[\|\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\widetilde{\mathcal{F}}_{k-1}]-\nabla\Psi^{k}(\mathbf{u}^{k})\|^{2}|\widetilde{\mathcal{F}}_{k-1}]
=2(𝕍[f(𝐮k;ζk)|~k1]+𝕍[c1𝒢(𝐮k;ξ~k)|~k1]\displaystyle=2\left(\mathbb{V}\left[\nabla f(\mathbf{u}^{k};\mathbf{\zeta}^{k})|\widetilde{\mathcal{F}}_{k-1}\right]+\mathbb{V}\left[c_{1}\nabla\mathcal{G}(\mathbf{u}^{k};\mathbf{\tilde{\xi}}^{k})|\widetilde{\mathcal{F}}_{k-1}\right]\right.
+𝔼[c12𝔼[𝒢(𝐮k;ζk)|~k1]𝒢(𝐮k)2|~k1])\displaystyle\quad\left.+\mathbb{E}\left[c_{1}^{2}\|\mathbb{E}[\nabla\mathcal{G}(\mathbf{u}^{k};\mathbf{\zeta}^{k})|\widetilde{\mathcal{F}}_{k-1}]-\nabla\mathcal{G}(\mathbf{u}^{k})\|^{2}|\widetilde{\mathcal{F}}_{k-1}\right]\right)
2(σf2rk+c12σ𝒢2qk+c12(ϵ𝒢k)2).\displaystyle\leq 2\left(\frac{\sigma_{f}^{2}}{r_{k}}+c_{1}^{2}\frac{\sigma_{\mathcal{G}}^{2}}{q_{k}}+c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right).

The last inequality follow from Lemma C.2 and C.3. This completes the proof. ∎

The convergence of Algorithm 2 is established in the following theorem.

Theorem C.1.

Assume the step sizes satisfy αk<12LΨ\alpha_{k}<\frac{1}{2L_{\Psi}}, then the sequence {𝐮k}\{\mathbf{u}^{k}\} satisfies

𝔼[1k=0K1αkk=0K11αk𝐮k+1𝐮k2]\displaystyle\quad\mathbb{E}\left[\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right] (C.20)
4k=0K1αk𝔼[Ψ(𝐮0)Ψ(𝐮K)]+2k=0K1αkk=0K1αk𝔼[bk2].\displaystyle\leq\frac{4}{\sum_{k=0}^{K-1}\alpha_{k}}\mathbb{E}[\Psi(\mathbf{u}^{0})-\Psi(\mathbf{u}^{K})]+\frac{2}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\alpha_{k}\mathbb{E}[\|b^{k}\|^{2}].

Here the expectation is taken over all ξ𝐤𝒟ξsk,ξ~k𝒟ξqk,ζk𝒟ζrk\mathbf{\xi^{k}}\sim\mathcal{D}_{\xi}^{s_{k}},\mathbf{\tilde{\xi}}^{k}\sim\mathcal{D}_{{\xi}}^{q_{k}},\mathbf{\zeta}^{k}\sim\mathcal{D}_{\zeta}^{r_{k}}, k=0,,K1k=0,...,K-1.

Proof The projection gradient step 𝐮k+1=Proj𝒰(𝐮kαkΨk(𝐮k;ζk))\mathbf{u}^{k+1}=\mathrm{Proj}_{\mathcal{U}}(\mathbf{u}^{k}-\alpha_{k}\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k})) gives

𝐮k+1𝐮k,𝐮k+1(𝐮kαkΨk(𝐮k;ζk),ξ~k)0.\langle\mathbf{u}^{k+1}-\mathbf{u}^{k},\mathbf{u}^{k+1}-(\mathbf{u}^{k}-\alpha_{k}\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k}),\mathbf{\tilde{\xi}}^{k})\rangle\leq 0.

This is equivalent to

Ψk(𝐮k;ζk),𝐮k+1𝐮k1αk𝐮k+1𝐮k2.\langle\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle\leq-\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}.

The Lipschitz property of Ψ(𝐮)\nabla\Psi(\mathbf{u}) gives

Ψ(𝐮k+1)Ψ(𝐮k)\displaystyle\quad\Psi(\mathbf{u}^{k+1})-\Psi(\mathbf{u}^{k}) (C.21)
Ψ(xk+1,yk+1,zk+1,νk,ρk)Ψ(xk,yk,zk,νk+1,ρk+1)\displaystyle\leq\Psi(x^{k+1},y^{k+1},z^{k+1},\nu^{k},\rho^{k})-\Psi(x^{k},y^{k},z^{k},\nu^{k+1},\rho^{k+1})
Ψ(𝐮k),𝐮k+1𝐮k+LΨ2𝐮k+1𝐮k2\displaystyle\leq\langle\nabla\Psi(\mathbf{u}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle+\frac{L_{\Psi}}{2}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}
=Ψk(𝐮k;ζk,ξ~k),𝐮k+1𝐮k+bk,𝐮k+1𝐮k+LΨ2𝐮k+1𝐮k2\displaystyle=\langle\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle+\langle b^{k},\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle+\frac{L_{\Psi}}{2}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}
1αk𝐮k+1𝐮k2+αk2bk2+12αk𝐮k+1𝐮k2+LΨ2𝐮k+1𝐮k2\displaystyle\leq-\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\frac{\alpha_{k}}{2}\|b^{k}\|^{2}+\frac{1}{2\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\frac{L_{\Psi}}{2}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}
=12(1αkLΨ)𝐮k+1𝐮k2+αk2bk2\displaystyle=-\frac{1}{2}(\frac{1}{\alpha_{k}}-L_{\Psi})\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\frac{\alpha_{k}}{2}\|b^{k}\|^{2}
14αk𝐮k+1𝐮k2+αk2bk2.\displaystyle\leq-\frac{1}{4\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\frac{\alpha_{k}}{2}\|b^{k}\|^{2}.

The last inequality uses the assumption αk12LΨ\alpha_{k}\leq\frac{1}{2L_{\Psi}}. Then summing up (C.21) over k=0,,K1k=0,...,K-1 gives

k=0K11αk𝐮k+1𝐮k2\displaystyle\sum_{k=0}^{K-1}\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\leq 4(Ψ(𝐮0)Ψ(𝐮K))+2k=0K1αkbk2.\displaystyle 4(\Psi(\mathbf{u}^{0})-\Psi(\mathbf{u}^{K}))+2\sum_{k=0}^{K-1}\alpha_{k}\|b^{k}\|^{2}. (C.22)

Taking expectation on both sides and multiplying 1k=0K1αk\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}} yields (C.20). This completes the proof. ∎

We consider measuring the convergence by the deviation from the optimal condition as

dist(0,Ψ(𝐮)+𝒩𝒰(𝐮)).\mathrm{dist}(0,\nabla\Psi(\mathbf{u})+\mathcal{N}_{\mathcal{U}}(\mathbf{u})). (C.23)

Let δk=𝐮k+1𝐮k+αkΨk(𝐮k;ζk,ξ~k)\delta^{k}=\mathbf{u}^{k+1}-\mathbf{u}^{k}+\alpha^{k}\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k}). The projection step gives δk𝒩𝒰(𝐮k+1)-\delta^{k}\in\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}). We can derive the following bound on this measure in Theorem C.2.

Theorem C.2.

Assume αk12LΨ\alpha_{k}\leq\frac{1}{2L_{\Psi}}. Then it holds that

1k=0K1αkk=0K1αk𝔼[dist(0,Ψ(𝐮k+1)+𝒩𝒰(𝐮k+1))2]\displaystyle\quad\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\alpha_{k}\mathbb{E}[\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{k+1})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))^{2}] (C.24)
18k=0K1αk(Ψ(𝐮0)Ψ(𝐮K))+11k=0K1αkk=0K1αk𝔼[bk2].\displaystyle\leq\frac{18}{\sum_{k=0}^{K-1}\alpha_{k}}(\Psi(\mathbf{u}^{0})-\Psi(\mathbf{u}^{K}))+\frac{11}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\alpha_{k}\mathbb{E}[\|b^{k}\|^{2}].

Proof Since δk𝒩𝒰(𝐮k+1)-\delta^{k}\in\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}), we have

dist(0,Ψk(𝐮k)+𝒩𝒰(𝐮k+1))1αk𝐮k+1𝐮k.\mathrm{dist}(0,\nabla\Psi^{k}(\mathbf{u}^{k})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))\leq\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|. (C.25)

The Lipschitz property of Ψ(𝐮)\nabla\Psi(\mathbf{u}) gives

dist(0,Ψ(𝐮k+1)+𝒩𝒰(𝐮k+1))\displaystyle\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{k+1})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1})) dist(0,Ψ(𝐮k)+𝒩𝒰(𝐮k+1))+LΨ𝐮k+1𝐮k\displaystyle\leq\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{k})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))+L_{\Psi}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\| (C.26)
dist(0,Ψk(𝐮k)+𝒩𝒰(𝐮k+1))+bk+LΨ𝐮k+1𝐮k\displaystyle\leq\mathrm{dist}(0,\nabla\Psi^{k}(\mathbf{u}^{k})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))+\|b^{k}\|+L_{\Psi}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|
bk+(1αk+LΨ)𝐮k+1𝐮k( by (C.25))\displaystyle\leq\|b^{k}\|+(\frac{1}{\alpha_{k}}+L_{\Psi})\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|\quad\text{( by~\eqref{eq: convergence of outer loop 1})}
bk+32αk𝐮k+1𝐮k.\displaystyle\leq\|b^{k}\|+\frac{3}{2\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|.

Taking square we obtain

dist(0,Ψ(𝐮k+1)+𝒩𝒰(𝐮k+1))2\displaystyle\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{k+1})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))^{2} (bk+32αk𝐮k+1𝐮k)2\displaystyle\leq(\|b^{k}\|+\frac{3}{2\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|)^{2} (C.27)
2bk2+92αk2𝐮k+1𝐮k2.\displaystyle\leq 2\|b^{k}\|^{2}+\frac{9}{2\alpha_{k}^{2}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}.

Substituting the above inequality into (C.20) gives (C.24). This completes the proof. ∎

Corollary C.2.

Take step size as (B.13) in the inner loop. Take constant step size αk=α<12LΨ\alpha_{k}={\alpha}<\frac{1}{2L_{\Psi}} in the outer loop and take constant sample sizes as rk=rr_{k}=r, qk=qq_{k}=q and sk=ss_{k}=s in Algorithm 2. Randomly choosing an index RR from {1,,K}\{1,...,K\} with probability Prob(R=k)=αk1k=1Kαk1\mathrm{Prob}(R=k)=\frac{\alpha_{k-1}}{\sum_{k=1}^{K}\alpha_{k-1}}, then we have

𝔼[1k=0K1αkk=0K11αk𝐮k+1𝐮k2]𝒪~(1αK+1r+c12q+c12s),\displaystyle\quad\mathbb{E}\left[\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right]\leq\widetilde{\mathcal{O}}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right),
𝔼[dist(0,Ψ(𝐮R)+𝒩𝒰(𝐮R))2]𝒪~(1αK+1r+c12q+c12s).\displaystyle\quad\mathbb{E}[\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{R})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{R}))^{2}]\leq\widetilde{\mathcal{O}}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right).

Proof From Lemma C.4C.5 we know that 𝔼[bk2]𝒪~(1rk+c12qk+c12sk)\mathbb{E}[\|b^{k}\|^{2}]\leq\widetilde{\mathcal{O}}(\frac{1}{r_{k}}+\frac{c_{1}^{2}}{q_{k}}+\frac{c_{1}^{2}}{s_{k}}). Substituting this into Theorem C.1 and C.2 gives

𝔼[1k=0Kαkk=0K11αk𝐮k+1𝐮k2]\displaystyle\mathbb{E}\left[\frac{1}{\sum_{k=0}^{K}\alpha_{k}}\sum_{k=0}^{K-1}\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right] 4αK𝔼[Ψ(𝐮0)Ψ(𝐮K)]+2Kk=0K1𝔼[bk2]\displaystyle\leq\frac{4}{\alpha K}\mathbb{E}[\Psi(\mathbf{u}^{0})-\Psi(\mathbf{u}^{K})]+\frac{2}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|b^{k}\|^{2}]
𝒪~(1αK+1r+c12q+c12s),\displaystyle\leq\widetilde{\mathcal{O}}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right),
𝔼[dist(0,Ψ(𝐮R)+𝒩𝒰(𝐮R))2]\displaystyle\mathbb{E}[\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{R})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{R}))^{2}] 18αK𝔼[Ψ(𝐮0)Ψ(𝐮K)]+11Kk=0K1𝔼[bk2]\displaystyle\leq\frac{18}{\alpha K}\mathbb{E}[\Psi(\mathbf{u}^{0})-\Psi(\mathbf{u}^{K})]+\frac{11}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|b^{k}\|^{2}]
𝒪~(1αK+1r+c12q+c12s).\displaystyle\leq\widetilde{\mathcal{O}}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right).

This completes the proof. ∎

Remark C.1.

Let c=max(c1,c2)c=\max(c_{1},c_{2}). The step size condition αk<12LΨ\alpha_{k}<\frac{1}{2L_{\Psi}} and (C.3) implies αk\alpha_{k} is a most 𝒪~(c1)\widetilde{\mathcal{O}}(c^{-1}). With α𝒪(c1)\alpha\sim\mathcal{O}(c^{-1}), r𝒪(ϵ1)r\sim\mathcal{O}(\epsilon^{-1}), q𝒪(c12ϵ1)q\sim\mathcal{O}(c_{1}^{2}\epsilon^{-1}), s𝒪(c12ϵ1)s\sim{\mathcal{O}}(c_{1}^{2}\epsilon^{-1}), K𝒪(cϵ1)K\sim{\mathcal{O}}(c\epsilon^{-1}), the right side of the above inequality is 𝒪~(ϵ)\widetilde{\mathcal{O}}(\epsilon). Then the sample complexity on ξ\xi is k=0K1sk+k=1kqk=sK+qK=𝒪~(cc12ϵ2)\sum_{k=0}^{K-1}{s_{k}}+\sum_{k=1}^{k}q_{k}=sK+qK=\widetilde{\mathcal{O}}(cc_{1}^{2}\epsilon^{-2}) and the sample complexity on ζ\zeta is k=0K1rk=rK=𝒪~(cϵ2)\sum_{k=0}^{K-1}{r_{k}}=rK=\widetilde{\mathcal{O}}(c\epsilon^{-2}). Theorem C.1 shows the algorithm converges to the ϵ\epsilon-stationary point of the problem (2.9) for any fixed c1>0c_{1}>0 with this sample complexity.

Remark C.2.

By Theorem A.4, if we take c1𝒪(ϵ1)c_{1}\sim\mathcal{O}(\epsilon^{-1}), c2𝒪(ϵ3),δ𝒪(ϵ2)c_{2}\sim\mathcal{O}(\epsilon^{-3}),\delta\sim\mathcal{O}(\epsilon^{-2}), then (2.9) is equivalent to the original BLO (1.2) in the sense of ϵ\epsilon-accuracy. Under this condition, the sample complexity on ξ\xi is 𝒪~(ϵ7)\widetilde{\mathcal{O}}(\epsilon^{-7}) and the sample complexity on ζ\zeta is 𝒪~(ϵ5)\widetilde{\mathcal{O}}(\epsilon^{-5}).

C.1 Analysis on variance reduction

In this section, we introduce a stronger assumption on the Lipschitz continuity of the gradient of f(x,y)f(x,y) and g(x,y)g(x,y) as Assupmtion 3.7. From (C.3) and (C.18), we know Ψk(𝐮;ζk,ξ~k)\nabla\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k}) is LΨL_{\Psi}^{\prime}-averaged Lipschitz continuous conditioned on k1\mathcal{F}_{k-1} with module

LΨ=Lf+c1ϵ𝒢k𝒪(Lf+c1sk).L_{\Psi}^{\prime}=L_{f}+c_{1}\epsilon_{\mathcal{G}}^{k}\leq\mathcal{O}(L_{f}+\frac{c_{1}}{s_{k}}). (C.28)

Define the error of the direction as

ek=dk𝔼[Ψk(𝐮k;ζk,ξ~k)|~k1].e^{k}=d^{k}-\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\tilde{\mathcal{F}}_{k-1}].

First we derive the decrease in variance reduction by a single iteration.

Lemma C.6.

The sequence {𝐮k}\{\mathbf{u}^{k}\} satisfies

Ψ(𝐮k+1)Ψ(𝐮k)14αk𝐮k+1𝐮k2+αkek2+αkc12(ϵ𝒢k)2,\quad\Psi(\mathbf{u}^{k+1})-\Psi(\mathbf{u}^{k})\leq-\frac{1}{4\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\alpha_{k}\|e^{k}\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}, (C.29)

where ϵ𝒢k\epsilon_{\mathcal{G}}^{k} is defined in Lemma C.3.

Proof Similar to (C.21), it holds that

Ψ(𝐮k+1)Ψ(𝐮k)\displaystyle\quad\Psi(\mathbf{u}^{k+1})-\Psi(\mathbf{u}^{k}) (C.30)
Ψ(𝐮k),𝐮k+1𝐮k+LΨ2𝐮k+1𝐮k2\displaystyle\leq\langle\nabla\Psi(\mathbf{u}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle+\frac{L_{\Psi}}{2}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}
=dk,𝐮k+1𝐮kek+𝔼[Ψk(𝐮k;ζk,ξ~k)|~k1]Ψ(𝐮k),𝐮k+1𝐮k\displaystyle=\langle d^{k},\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle-\langle e^{k}+\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\widetilde{\mathcal{F}}_{k-1}]-\nabla\Psi(\mathbf{u}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle
+LΨ2𝐮k+1𝐮k2\displaystyle\quad+\frac{L_{\Psi}}{2}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}
1αk𝐮k+1𝐮k2ek+𝔼[Ψk(𝐮k;ζk,ξ~k)|~k1]Ψ(𝐮k),𝐮k+1𝐮k\displaystyle\leq-\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}-\langle e^{k}+\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\widetilde{\mathcal{F}}_{k-1}]-\nabla\Psi(\mathbf{u}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle
+LΨ2𝐮k+1𝐮k2\displaystyle\quad+\frac{L_{\Psi}}{2}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}
1αk𝐮k+1𝐮k2+αk2ek+𝔼[Ψk(𝐮k;ζk,ξ~k)|~k1]Ψ(𝐮k)2\displaystyle\leq-\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\frac{\alpha_{k}}{2}\|e^{k}+\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\widetilde{\mathcal{F}}_{k-1}]-\nabla\Psi(\mathbf{u}^{k})\|^{2}
+(12αk+LΨ)𝐮k+1𝐮k2\displaystyle\quad+(\frac{1}{2\alpha_{k}}+L_{\Psi})\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}
12(1αkLΨ)𝐮k+1𝐮k2+αkek2+αkΨ(𝐮k)𝔼[Ψk(𝐮k;ζk,ξ~k)|~k1]2\displaystyle\leq-\frac{1}{2}(\frac{1}{\alpha_{k}}-L_{\Psi})\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\alpha_{k}\|e^{k}\|^{2}+\alpha_{k}\|\nabla\Psi(\mathbf{u}^{k})-\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\widetilde{\mathcal{F}}_{k-1}]\|^{2}
12(1αkLΨ)𝐮k+1𝐮k2+αkek2+αkc12(ϵ𝒢k)2.\displaystyle\leq-\frac{1}{2}(\frac{1}{\alpha_{k}}-L_{\Psi})\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\alpha_{k}\|e^{k}\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}.

The first inequality follows from the Lipschitz property, the second inequality is due to the projection gradient step and the third inequality uses Young’s inequality. This completes the proof. ∎

Next we show that the sequence of error {ek}\{e^{k}\} has recursive relationship as follows, hence the error is decreasing.

Lemma C.7.

The conditional expectation of the error ek+1e^{k+1} is bounded as

𝔼[ek+12|~k]\displaystyle\mathbb{E}[\|e^{k+1}\|^{2}|\widetilde{\mathcal{F}}_{k}] 2βk+12(σf2rk+1+c12(σ𝒢k)2)+8(1βk+1)2ek2\displaystyle\leq{2}\beta_{k+1}^{2}\left(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\right)+8(1-\beta_{k+1})^{2}\|e^{k}\|^{2} (C.31)
+8(LΨ)2+LΨ2rk+1𝔼[𝐮k+1𝐮k2|~k]+8c2rk+1((ϵ𝒢k)2+(ϵ𝒢k+1)2))\displaystyle\quad+\left.8\frac{(L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2}}{r_{k+1}}\mathbb{E}[\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}|\tilde{\mathcal{F}}_{k}]+\frac{8c^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big)\right)

Here the expectation is taken over ζk+1,ξ~k+1,ξk+1\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1},\mathbf{\xi}^{k+1}.

Proof Let Δk+1\Delta^{k+1} be the bias of the gradient at 𝐮k+1\mathbf{u}^{k+1} and Δ~k\tilde{\Delta}^{k} be the error of the gradient at 𝐮k+1\mathbf{u}^{k+1} comparing the expectation of the gradient of the last iteration, that is,

Δk+1\displaystyle\Delta^{k+1} =Ψk+1(𝐮k+1;ζk+1,ξ~k+1)𝔼[Ψk+1(𝐮k+1;ζk+1,ξ~k+1)|~k],\displaystyle=\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})|\tilde{\mathcal{F}}_{k}],
Δ~k\displaystyle\tilde{\Delta}^{k} =Ψk+1(𝐮k;ζk+1,ξ~k+1)𝔼[Ψk(𝐮k;ζk,ξ~k)|~k1].\displaystyle=\nabla\Psi^{k+1}(\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\tilde{\mathcal{F}}_{k-1}].

From the definition of eke^{k} and (2.15) we have

ek+1\displaystyle e^{k+1} =dk+1𝔼[Ψk+1(𝐮k+1;ζk+1,ξ~k+1)|~k]\displaystyle=d^{k+1}-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})|\tilde{\mathcal{F}}_{k}] (C.32)
=Ψk+1(𝐮k+1;ζk+1,ξ~k+1)𝔼[Ψk+1(𝐮k+1;ζk+1,ξ~k+1)|~k]\displaystyle=\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})|\tilde{\mathcal{F}}_{k}]
+(1βk+1)(dkΨk+1(𝐮k;ζk+1,ξ~k+1))\displaystyle\quad+(1-\beta_{k+1})(d^{k}-\nabla\Psi^{k+1}(\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1}))
=Ψk+1(𝐮k+1;ζk+1,ξ~k+1)𝔼[Ψk+1(𝐮k+1;ζk+1,ξ~k+1)|~k]\displaystyle=\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})|\tilde{\mathcal{F}}_{k}]
+(1βk+1)(ek+𝔼[Ψk(𝐮k;ζk,ξ~k)|k1]Ψk+1(𝐮k;ζk+1,ξ~k+1))\displaystyle\quad+(1-\beta_{k+1})(e^{k}+\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\mathcal{F}_{k-1}]-\nabla\Psi^{k+1}(\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1}))
=βk+1(Ψk+1(𝐮k+1;ζk+1,ξ~k+1𝔼[Ψk+1(𝐮k+1;ζk+1,ξ~k+1)|~k])\displaystyle=\beta_{k+1}\left(\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1}-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})|\tilde{\mathcal{F}}_{k}]\right)
+(1βk+1)(Ψk+1(𝐮k+1;ζk+1,ξ~k+1)𝔼[Ψk+1(𝐮k+1;ζk+1,ξ~k+1)|~k]\displaystyle\quad+(1-\beta_{k+1})\left(\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})|\tilde{\mathcal{F}}_{k}]\right.
+ek+𝔼[Ψk(𝐮k;ζk,ξ~k)|k1]Ψk+1(𝐮k;ζk+1,ξ~k+1))\displaystyle\quad\left.+e^{k}+\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\mathcal{F}_{k-1}]-\nabla\Psi^{k+1}(\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\right)
=βk+1Δk+1+(1βk+1)ek+(1βk+1)(Δk+1Δ~k).\displaystyle=\beta_{k+1}\Delta^{k+1}+(1-\beta_{k+1})e^{k}+(1-\beta_{k+1})(\Delta^{k+1}-\tilde{\Delta}^{k}).

It follows from (C.6) and Lemma C.3 that 𝔼[Δk+12|~k]=𝕍[k(𝐮k+1;ζk+1,ξ~k]=σf2rk+1+c12(σ𝒢k)2\mathbb{E}[\|\Delta^{k+1}\|^{2}|\widetilde{\mathcal{F}}_{k}]=\mathbb{V}[\nabla^{k}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k}]=\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}, where σ𝒢k\sigma_{\mathcal{G}}^{k} is defined in (C.12). Since eke^{k} is a constant term conditioned on ~k\widetilde{\mathcal{F}}_{k} and 𝔼[Δk+1|~k]=0\mathbb{E}[\|\Delta^{k+1}\||\widetilde{\mathcal{F}}_{k}]=0,  (C.32) implies that

𝔼[ek+12|~k]\displaystyle\quad\mathbb{E}[\|e^{k+1}\|^{2}|\widetilde{\mathcal{F}}_{k}] (C.33)
=𝔼[βk+1Δk+1+(1βk+1)(Δk+1Δ~k)2|~k]+(1βk+1)2ek2\displaystyle=\mathbb{E}[\|\beta_{k+1}\Delta^{k+1}+(1-\beta_{k+1})(\Delta^{k+1}-\tilde{\Delta}^{k})\|^{2}|\widetilde{\mathcal{F}}_{k}]+(1-\beta_{k+1})^{2}\|e^{k}\|^{2}
2βk+12𝔼[Δk+12|~k]+2(1βk+1)2𝔼[Δk+1Δ~k2|~k]+(1βk+1)2ek2\displaystyle\leq 2\beta_{k+1}^{2}\mathbb{E}[\|\Delta^{k+1}\|^{2}|\widetilde{\mathcal{F}}_{k}]+2(1-\beta_{k+1})^{2}\mathbb{E}[\|\Delta^{k+1}-\tilde{\Delta}^{k}\|^{2}|\widetilde{\mathcal{F}}_{k}]+(1-\beta_{k+1})^{2}\|e^{k}\|^{2}
2βk+12(σf2rk+1+c12(σ𝒢k)2)+2𝔼[Δk+1Δ~k2|~k]+(1βk+1)2ek2\displaystyle\leq{2}\beta_{k+1}^{2}\left(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\right)+2\mathbb{E}[\|\Delta^{k+1}-\tilde{\Delta}^{k}\|^{2}|\widetilde{\mathcal{F}}_{k}]+(1-\beta_{k+1})^{2}\|e^{k}\|^{2}
2βk+12(σf2rk+1+c12(σ𝒢k)2)+2(1βk+1)2𝔼[Δk+1Δ~k2|~k]+(1βk+1)2ek2.\displaystyle\leq{2}\beta_{k+1}^{2}\left(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\right)+2(1-\beta_{k+1})^{2}\mathbb{E}[\|\Delta^{k+1}-\tilde{\Delta}^{k}\|^{2}|\widetilde{\mathcal{F}}_{k}]+(1-\beta_{k+1})^{2}\|e^{k}\|^{2}.

Now we come to handle the term 𝔼[Δk+1Δ~k2|~k]\mathbb{E}[\|\Delta^{k+1}-\tilde{\Delta}^{k}\|^{2}|\tilde{\mathcal{F}}_{k}] . Let

Rk+1(𝐮k+1,𝐮k;ζk+1,ξ~k)\displaystyle\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k}) =Ψk+1(𝐮k+1;ζk+1,ξ~k)Ψk+1(𝐮k;ζk+1,ξ~k),\displaystyle=\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k})-\nabla\Psi^{k+1}(\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k}),
Rk+1(𝐮k+1,𝐮k)\displaystyle\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k}) =Ψ(𝐮k+1)Ψ(𝐮k).\displaystyle=\nabla\Psi(\mathbf{u}^{k+1})-\Psi(\mathbf{u}^{k}).

be the difference of the gradients between the points 𝐮k+1\mathbf{u}^{k+1} and 𝐮k\mathbf{u}^{k}, respectively. The averaged Lipschitz property of Ψk+1(𝐮;ζ,ξ~)\nabla\Psi^{k+1}(\mathbf{u};\mathbf{\zeta},\mathbf{\tilde{\xi}}) and the Lipschitz property of Ψ(𝐮)\nabla\Psi(\mathbf{u}) give 𝔼[Rk+1(𝐮k+1,𝐮k;ζk+1,ξ~k+1)2|~k](LΨ)2𝔼[𝐮k+1𝐮k2|~k]\mathbb{E}[\|\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\|^{2}|\tilde{\mathcal{F}}_{k}]\leq(L_{\Psi}^{\prime})^{2}\mathbb{E}[\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}|\tilde{\mathcal{F}}_{k}] and R(𝐮k+1,𝐮k)LΨ𝐮k+1𝐮k\|\mathrm{R}(\mathbf{u}^{k+1},\mathbf{u}^{k})\|\leq L_{\Psi}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|. Then it holds that

Δk+1Δ~k\displaystyle\Delta^{k+1}-\tilde{\Delta}^{k} =Rk+1(𝐮k+1,𝐮k;ζk+1,ξ~k+1)Rk(𝐮k+1,𝐮k)\displaystyle=\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathrm{R}^{k}(\mathbf{u}^{k+1},\mathbf{u}^{k})
+𝔼[Ψk+1(𝐮k+1;ζk+1,ξ~k+1)|~k]Ψk+1(𝐮k+1)\displaystyle\quad+\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})|\widetilde{\mathcal{F}}_{k}]-\nabla\Psi^{k+1}(\mathbf{u}^{k+1})
𝔼[Ψk(𝐮k;ζk,ξ~k)|~k1]+Ψk(𝐮k),\displaystyle\quad-\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\widetilde{\mathcal{F}}_{k-1}]+\Psi^{k}(\mathbf{u}^{k}),

and

𝔼[Δk+1Δ~k2|~k]\displaystyle\mathbb{E}[\|\Delta^{k+1}-\tilde{\Delta}^{k}\|^{2}|\widetilde{\mathcal{F}}_{k}] 2𝔼[Rk+1(𝐮k+1,𝐮k;ζk+1,ξ~k)Rk+1(𝐮k+1,𝐮k)2|~k]\displaystyle\leq 2\mathbb{E}[\|\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k})-\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k})\|^{2}|\widetilde{\mathcal{F}}_{k}] (C.34)
+4𝔼[Ψk+1(𝐮k+1;ζk+1,ξ~k)Ψ(𝐮k)|~k]\displaystyle\quad+4\|\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k})-\nabla\Psi(\mathbf{u}^{k})\||\widetilde{\mathcal{F}}_{k}]
+4𝔼[Ψk(𝐮k;ζk,ξ~k)|~k]Ψ(𝐮k)2\displaystyle\quad+4\|\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\widetilde{\mathcal{F}}_{k}]-\nabla\Psi(\mathbf{u}^{k})\|^{2}
4((LΨ)2+(LΨ2)2)rk+1𝔼[𝐮k+1𝐮k2|~k]+4c12rk+1((ϵ𝒢k)2+(ϵ𝒢k+1)2).\displaystyle\leq\frac{4\left((L_{\Psi}^{\prime})^{2}+(L_{\Psi}^{2})^{2}\right)}{r_{k+1}}\mathbb{E}[\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}|\widetilde{\mathcal{F}}_{k}]+\frac{4c_{1}^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big).

The last inequality follows from Lemma C.3 and  (C.18). Combining (LABEL:eq:_proof_of_bound_of_error_in_variance_reduction_2) and (C.34) gives  (C.31). This completes the proof. ∎Using the above two lemmas, we can derive the convergence of the variance reduction.

Theorem C.3.

The sequence {𝐮k+1}\{\mathbf{u}^{k+1}\} generated by Algorithm 3 satisfies

1k=0K1αkk=0K1𝔼[1αk𝐮k+1𝐮k2]1k=0K1αk𝔼[Ψ(𝐮0)+θ0e02Ψ(𝐮K)θKeK2]\displaystyle\quad\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\mathbb{E}\left[\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right]\leq\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\mathbb{E}[\Psi(\mathbf{u}^{0})+\theta_{0}\|e^{0}\|^{2}-\Psi(\mathbf{u}^{K})-\theta_{K}\|e^{K}\|^{2}] (C.35)
+1k=0K1αkk=0K{αkc2𝔼[(ϵ𝒢k)2]+2θαkβk+12(σf2rk+1+c12𝔼[(σ𝒢k)2])\displaystyle\quad\quad+\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K}\left\{\alpha_{k}c^{2}\mathbb{E}[(\epsilon_{\mathcal{G}}^{k})^{2}]+2\frac{\theta}{\alpha_{k}}\beta_{k+1}^{2}\big(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}\mathbb{E}[(\sigma_{\mathcal{G}}^{k})^{2}]\big)\right.
+8c12θαkrk+1(𝔼[(ϵ𝒢k)2]+𝔼[(ϵ𝒢k+1)2])}.\displaystyle\quad\quad\left.+\frac{8c_{1}^{2}\theta}{\alpha_{k}r_{k+1}}\big(\mathbb{E}[(\epsilon_{\mathcal{G}}^{k})^{2}]+\mathbb{E}[(\epsilon_{\mathcal{G}}^{k+1})^{2}]\big)\right\}.

with θ=164((LΨ)2+LΨ2)\theta=\frac{1}{64((L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2})}, αk18LΨ\alpha_{k}\leq\frac{1}{8L_{\Psi}}, βk1θαkαk8\beta_{k}\geq 1-\sqrt{\frac{\frac{\theta}{\alpha_{k}}-\alpha_{k}}{8}}.

Proof Consider a merit function as Ψ(𝐮k)+θkek2\Psi(\mathbf{u}^{k})+\theta_{k}\|e^{k}\|^{2}, where θk\theta_{k} satisfies that

αk+8(1βk+1)2θk+1θk0,\displaystyle\alpha_{k}+8(1-\beta_{k+1})^{2}\theta_{k+1}-\theta_{k}\leq 0, (C.36a)
12αk+LΨ2+8θk+1(LΨ)2+LΨ2rk+1\displaystyle-\frac{1}{2\alpha_{k}}+\frac{L_{\Psi}}{2}+8\theta_{k+1}\frac{(L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2}}{r_{k+1}} 14αk.\displaystyle\leq-\frac{1}{4\alpha_{k}}. (C.36b)

Considering the reduction of the merit function, we have

𝔼[Ψ(𝐮k+1)+θk+1ek+12Ψ(𝐮k)θkek2|~k]\displaystyle\quad\mathbb{E}[\Psi(\mathbf{u}^{k+1})+\theta_{k+1}\|e^{k+1}\|^{2}-\Psi(\mathbf{u}^{k})-\theta_{k}\|e^{k}\|^{2}|\widetilde{\mathcal{F}}_{k}] (C.37)
𝔼[12(1αkLΨ)𝐮k+1𝐮k2+αkek2+αkc12(ϵ𝒢k)2+θk+1ek+12\displaystyle\leq\mathbb{E}\left[-\frac{1}{2}(\frac{1}{\alpha_{k}}-L_{\Psi})\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\alpha_{k}\|e^{k}\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}+\theta_{k+1}\|e^{k+1}\|^{2}\right.
θkek2|~k](by (C.29))\displaystyle\quad\left.-\theta_{k}\|e^{k}\|^{2}|\widetilde{\mathcal{F}}_{k}\right]\text{(by~\eqref{eq: decrease in variance reduction})}
𝔼[12(1αkLΨ)𝐮k+1𝐮k2+αkek2+αkc12(ϵ𝒢k)2\displaystyle\leq\mathbb{E}\left[-\frac{1}{2}(\frac{1}{\alpha_{k}}-L_{\Psi})\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\alpha_{k}\|e^{k}\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right.
+θk+1(2βk+12(σf2rk+1+c12(σ𝒢k)2)+8(1βk+1)2ek2\displaystyle\quad+\theta_{k+1}\left(2\beta_{k+1}^{2}\big(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\big)+8(1-\beta_{k+1})^{2}\|e^{k}\|^{2}\right.
+8(LΨ)2+LΨ2rk+1𝐮k+1𝐮k2+8c2rk+1((ϵ𝒢k)2+(ϵ𝒢k+1)2))θkek2|~k](by (C.31))\displaystyle\quad+\left.\left.8\frac{(L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2}}{r_{k+1}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\frac{8c^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big)\right)-\theta_{k}\|e^{k}\|^{2}|\tilde{\mathcal{F}}_{k}\right]\quad\text{(by~\eqref{eq: bound of error in variance reduction})}
𝔼[(12αk+LΨ2+8θk+1(LΨ)2+LΨ2rk+1)𝐮k+1𝐮k2+αkc12(ϵ𝒢k)2\displaystyle\leq\mathbb{E}\left[\left(-\frac{1}{2\alpha_{k}}+\frac{L_{\Psi}}{2}+8\theta_{k+1}\frac{(L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2}}{r_{k+1}}\right)\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right.
+θk+1(2βk+12(σf2rk+1+c12(σ𝒢k)2)+8c2rk+1((ϵ𝒢k)2+(ϵ𝒢k+1)2))|~k](by (C.36a))\displaystyle\quad+\left.\theta_{k+1}\left(2\beta_{k+1}^{2}\big(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\big)+\frac{8c^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big)\right)|\widetilde{\mathcal{F}}_{k}\right]\quad\text{(by~\eqref{eq: condition on theta 1})}
𝔼[14αk𝐮k+1𝐮k2+αkc12(ϵ𝒢k)2\displaystyle\leq\mathbb{E}\left[-\frac{1}{4\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right.
+θk+1(2βk+12(σf2rk+1+c12(σ𝒢k)2)+8c2rk+1((ϵ𝒢k)2+(ϵ𝒢k+1)2))|~k].(by (C.36b))\displaystyle\quad\left.+\theta_{k+1}\left(2\beta_{k+1}^{2}\big(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\big)+\frac{8c^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big)\right)|\widetilde{\mathcal{F}}_{k}\right].\quad\text{(by~\eqref{eq: condition on theta 2})}

To ensure the conditions (C.36a) and (C.36b), we take

θk+1=θαkwithθ=164((LΨ)2+LΨ2).\theta_{k+1}=\frac{\theta}{\alpha_{k}}\quad\text{with}~\theta=\frac{1}{64((L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2})}. (C.38)

and let αk18LΨ\alpha_{k}\leq\frac{1}{8L_{\Psi}}, βk1θkαk8\beta_{k}\geq 1-\sqrt{\frac{\theta_{k}-\alpha_{k}}{8}}. Then (C.37) is simplified to

𝔼[Ψ(𝐮k+1)+θk+1ek+12Ψ(𝐮k)θkek2|~k]\displaystyle\quad\mathbb{E}[\Psi(\mathbf{u}^{k+1})+\theta_{k+1}\|e^{k+1}\|^{2}-\Psi(\mathbf{u}^{k})-\theta_{k}\|e^{k}\|^{2}|\widetilde{\mathcal{F}}_{k}] (C.39)
𝔼[14αk𝐮k+1𝐮k2+αkc12(ϵ𝒢k)2\displaystyle\leq\mathbb{E}\left[-\frac{1}{4\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right.
+θαk(2βk+12(σf2rk+1+c12(σ𝒢k)2)+8c2rk+1((ϵ𝒢k)2+(ϵ𝒢k+1)2))~k].\displaystyle\quad\left.+\frac{\theta}{\alpha_{k}}\big(2\beta_{k+1}^{2}\big(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\big)+\frac{8c^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big)\big)\widetilde{\mathcal{F}}_{k}\right].

Summing up the above inequality from k=0k=0 to K1K-1, taking expectation over all the random variables and multiplying 1k=0K1αk\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}} on both sides, we have (C.35). This completes the proof. ∎

Corollary C.3.

Take constant sample sizes as rk=r,qk=q,sk=sr_{k}=r,q_{k}=q,s_{k}=s, βk=βαk2\beta_{k}=\beta\alpha_{k}^{2} with β=𝒪(α2)\beta=\mathcal{O}(\alpha^{-2}), and αk=α(k+1)13\alpha_{k}=\alpha(k+1)^{-\frac{1}{3}}, where α=𝒪(LΨ1)=𝒪(c1)\alpha=\mathcal{O}(L_{\Psi}^{-1})=\mathcal{O}(c^{-1}) satisfying αk12LΨ\alpha_{k}\leq\frac{1}{2L_{\Psi}} as required in Theorem C.3. Then the sequence {𝐮k}\{\mathbf{u}^{k}\} generated by Algorithm 3 satisfies

1k=0K1αkk=0K1𝔼[1αk𝐮k+1𝐮k2]\displaystyle\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\mathbb{E}\left[\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right] 𝒪~(cK23+c12q+c12s+1K23r+K23r(1q+1s)).\displaystyle\leq\widetilde{\mathcal{O}}\left(\frac{c}{K^{\frac{2}{3}}}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}+\frac{1}{K^{\frac{2}{3}}r}+\frac{K^{\frac{2}{3}}}{r}(\frac{1}{q}+\frac{1}{s})\right). (C.40)

Proof From Corollary C.1 and Lemma C.4C.5 we know LΨ=𝒪(c)L_{\Psi}=\mathcal{O}(c), LΨ=𝒪(c+c1s)L_{\Psi}^{\prime}=\mathcal{O}(c+\frac{c_{1}}{s}), 𝔼[(ϵ𝒢k)2]𝒪~(1qk+1sk)\mathbb{E}[(\epsilon_{\mathcal{G}}^{k})^{2}]\leq\widetilde{\mathcal{O}}(\frac{1}{q_{k}}+\frac{1}{s_{k}}) , 𝔼[(σ𝒢k)2]𝒪~(1qk+1sk)\mathbb{E}[(\sigma_{\mathcal{G}}^{k})^{2}]\leq\widetilde{\mathcal{O}}(\frac{1}{q_{k}}+\frac{1}{s_{k}}) and 𝔼[bk2]𝒪~(1rk+c12qk+c12sk)\mathbb{E}[\|b^{k}\|^{2}]\leq\widetilde{\mathcal{O}}(\frac{1}{r_{k}}+\frac{c_{1}^{2}}{q_{k}}+\frac{c_{1}^{2}}{s_{k}}). Besides, it follows from (C.38) that θ=𝒪(LΨ2)=𝒪(c2)\theta=\mathcal{O}(L_{\Psi}^{-2})=\mathcal{O}(c^{-2}). By substituting these settings into (C.35), we have

1k=0K1αkk=0K1𝔼[1αk𝐮k+1𝐮k2]\displaystyle\quad\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\mathbb{E}\left[\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right] (C.41)
𝒪~(cK23(1+αc12K23𝒪~(ϵ𝒢2)+θβ2α3𝒪~(r1+c12(σ𝒢k)2)+θc12αrK43𝒪~(ϵ𝒢2)))\displaystyle\leq\widetilde{\mathcal{O}}\left(cK^{-\frac{2}{3}}\left(1+\alpha c_{1}^{2}K^{\frac{2}{3}}\widetilde{\mathcal{O}}(\epsilon_{\mathcal{G}}^{2})+\theta\beta^{2}\alpha^{3}\widetilde{\mathcal{O}}(r^{-1}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2})+\frac{\theta c_{1}^{2}}{\alpha r}K^{\frac{4}{3}}\widetilde{\mathcal{O}}(\epsilon_{\mathcal{G}}^{2})\right)\right)
=𝒪~(cK23(1+c12cK23𝒪~(ϵ𝒢2)+c1𝒪~(r1+c12(σ𝒢k)2)+c1r1K43𝒪~(ϵ𝒢2)))\displaystyle=\widetilde{\mathcal{O}}\left(cK^{-\frac{2}{3}}\left(1+\frac{c_{1}^{2}}{c}K^{\frac{2}{3}}\widetilde{\mathcal{O}}(\epsilon_{\mathcal{G}}^{2})+c^{-1}\widetilde{\mathcal{O}}(r^{-1}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2})+c^{-1}r^{-1}K^{\frac{4}{3}}\widetilde{\mathcal{O}}(\epsilon_{\mathcal{G}}^{2})\right)\right)
𝒪~(cK23(1+c12cK23(1q+1s)+c1(1r+c12q+c12s)+c1r1K43(1q+1s)))\displaystyle\leq\widetilde{\mathcal{O}}\left(cK^{-\frac{2}{3}}\left(1+\frac{c_{1}^{2}}{c}K^{\frac{2}{3}}(\frac{1}{q}+\frac{1}{s})+c^{-1}(\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s})+c^{-1}r^{-1}K^{\frac{4}{3}}(\frac{1}{q}+\frac{1}{s})\right)\right)
=𝒪~(cK23+c12q+c12s+1K23(1r+c12q+c12s)+K23r(1q+1s))\displaystyle=\widetilde{\mathcal{O}}\left(\frac{c}{K^{\frac{2}{3}}}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}+\frac{1}{K^{\frac{2}{3}}}(\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s})+\frac{K^{\frac{2}{3}}}{r}(\frac{1}{q}+\frac{1}{s})\right)
=𝒪~(cK23+c12q+c12s+1K23r+K23r(1q+1s)).\displaystyle=\widetilde{\mathcal{O}}\left(\frac{c}{K^{\frac{2}{3}}}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}+\frac{1}{K^{\frac{2}{3}}r}+\frac{K^{\frac{2}{3}}}{r}(\frac{1}{q}+\frac{1}{s})\right).

This completes the proof. ∎

Remark C.3.

Further take K𝒪(c1.5ϵ1.5)K\sim\mathcal{O}(c^{1.5}\epsilon^{-1.5}), r𝒪(1)r\sim\mathcal{O}(1), q𝒪(c12ϵ1)q\sim\mathcal{O}(c_{1}^{2}\epsilon^{-1}), s𝒪(c12ϵ1)s\sim\mathcal{O}(c_{1}^{2}\epsilon^{-1}), then the right side is 𝒪~(ϵ)\widetilde{\mathcal{O}}(\epsilon). The sample complexity on ξ\xi is k=0K1sk+k=0K1qk=(s+q)K=𝒪~(c1.5c12ϵ2.5)\sum_{k=0}^{K-1}{s_{k}}+\sum_{k=0}^{K-1}{q_{k}}=(s+q)K=\widetilde{\mathcal{O}}(c^{1.5}c_{1}^{2}\epsilon^{-2.5}) and the sample complexity on ζ\zeta is k=0K1rk=rK=𝒪~(c1.5ϵ1.5)\sum_{k=0}^{K-1}{r_{k}}=rK=\widetilde{\mathcal{O}}(c^{1.5}\epsilon^{-1.5}) .

References

  • Alpaydin and Alimoglu (1996) Ethem Alpaydin and Fevzi Alimoglu. Pen-based recognition of handwritten digits. (No Title), 1996.
  • Bennett et al. (2006) Kristin P Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli, and Jong-Shi Pang. Model selection via bilevel optimization. In The 2006 IEEE International Joint Conference on Neural Network Proceedings, pages 1922–1929. IEEE, 2006.
  • Bonnans and Shapiro (2013) J Frédéric Bonnans and Alexander Shapiro. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013.
  • Chen et al. (2021) Tianyi Chen, Yuejiao Sun, and Wotao Yin. Tighter analysis of alternating stochastic gradient method for stochastic nested problems. arXiv preprint arXiv:2106.13781, 2021.
  • Cutkosky and Orabona (2019) Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.
  • Domahidi et al. (2013) Alexander Domahidi, Eric Chu, and Stephen Boyd. Ecos: An socp solver for embedded systems. In 2013 European control conference (ECC), pages 3071–3076. IEEE, 2013.
  • Franceschi et al. (2018) Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pages 1568–1577. PMLR, 2018.
  • Gao et al. (2023) Lucy L Gao, Jane J Ye, Haian Yin, Shangzhi Zeng, and Jin Zhang. Moreau envelope based difference-of-weakly-convex reformulation and algorithm for bilevel programs. arXiv preprint arXiv:2306.16761, 2023.
  • Grimmer et al. (2023) Benjamin Grimmer, Haihao Lu, Pratik Worah, and Vahab Mirrokni. The landscape of the proximal point method for nonconvex–nonconcave minimax optimization. Mathematical Programming, 201(1):373–407, 2023.
  • Hong et al. (2023) Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147–180, 2023.
  • Ji et al. (2020) Kaiyi Ji, Junjie Yang, and Yingbin Liang. Provably faster algorithms for bilevel optimization and applications to meta-learning. Neural Information Processing Systems, 2020.
  • Jiang et al. (2024a) Liuyuan Jiang, Quan Xiao, Victor M Tenorio, Fernando Real-Rojas, Antonio Marques, and Tianyi Chen. A primal-dual-assisted penalty approach to bilevel optimization with coupled constraints. arXiv preprint arXiv:2406.10148, 2024a.
  • Jiang et al. (2024b) Xiaotian Jiang, Jiaxiang Li, Mingyi Hong, and Shuzhong Zhang. A barrier function approach for bilevel optimization with coupled lower-level constraints: Formulation, approximation and algorithms. arXiv preprint arXiv:2410.10670, 2024b.
  • Kang et al. (2023) Hyuna Kang, Seunghoon Jung, Jaewon Jeoung, Juwon Hong, and Taehoon Hong. A bi-level reinforcement learning model for optimal scheduling and planning of battery energy storage considering uncertainty in the energy-sharing community. Sustainable Cities and Society, 94:104538, 2023.
  • Khanduri et al. (2023) Prashant Khanduri, Ioannis Tsaknakis, Yihua Zhang, Jia Liu, Sijia Liu, Jiawei Zhang, and Mingyi Hong. Linearly constrained bilevel optimization: A smoothed implicit gradient approach. In International Conference on Machine Learning, pages 16291–16325. PMLR, 2023.
  • MacKay et al. (2019) Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. arXiv preprint arXiv:1903.03088, 2019.
  • Nocedal and Wright (1999) Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999.
  • Qin et al. (2023) Xiaorong Qin, Xinhang Song, and Shuqiang Jiang. Bi-level meta-learning for few-shot domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15900–15910, 2023.
  • Rockafellar and Wets (2009) R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
  • Shen and Chen (2023) Han Shen and Tianyi Chen. On penalty-based bilevel gradient descent method. In International Conference on Machine Learning, pages 30992–31015. PMLR, 2023.
  • Shen et al. (2024) Han Shen, Zhuoran Yang, and Tianyi Chen. Principled penalty-based methods for bilevel reinforcement learning and rlhf. arXiv preprint arXiv:2402.06886, 2024.
  • Sinha et al. (2020) Ankur Sinha, Tanmay Khandait, and Raja Mohanty. A gradient-based bilevel optimization approach for tuning hyperparameters in machine learning. arXiv preprint arXiv:2007.11022, 2020.
  • Stadie et al. (2020) Bradly Stadie, Lunjun Zhang, and Jimmy Ba. Learning intrinsic rewards as a bi-level optimization problem. In Conference on Uncertainty in Artificial Intelligence, pages 111–120. PMLR, 2020.
  • Tsaknakis et al. (2022) Ioannis Tsaknakis, Prashant Khanduri, and Mingyi Hong. An implicit gradient-type method for linearly constrained bilevel problems. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5438–5442. IEEE, 2022.
  • Tsaknakis et al. (2023) Ioannis Tsaknakis, Prashant Khanduri, and Mingyi Hong. An implicit gradient method for constrained bilevel problems using barrier approximation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  • Xu and Zhu (2023) Siyuan Xu and Minghui Zhu. Efficient gradient approximation method for constrained bilevel optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12509–12517, 2023.
  • Yang et al. (2024) Yan Yang, Bin Gao, and Ya-xiang Yuan. Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity. arXiv preprint arXiv:2405.19697, 2024.
  • Yao et al. (2024a) Wei Yao, Haian Yin, Shangzhi Zeng, and Jin Zhang. Overcoming lower-level constraints in bilevel optimization: A novel approach with regularized gap functions. arXiv preprint arXiv:2406.01992, 2024a.
  • Yao et al. (2024b) Wei Yao, Chengming Yu, Shangzhi Zeng, and Jin Zhang. Constrained bi-level optimization: Proximal Lagrangian value function approach and Hessian-free algorithm. arXiv preprint arXiv:2401.16164, 2024b.
  • Zhu et al. (2020) Hancheng Zhu, Leida Li, Jinjian Wu, Sicheng Zhao, Guiguang Ding, and Guangming Shi. Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization. IEEE Transactions on Cybernetics, 52(3):1798–1811, 2020.