An Augmented Lagrangian Value Function Method for Lower-level Constrained Stochastic Bilevel Optimization

Hantao Nie¹¹1School of Mathematical Science, Peking University, Beijing, China. nht@pku.edu.cn Jiaxiang Li²²2Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, USA. li003755@umn.edu Zaiwen Wen³³3School of Mathematical Science, Peking University, Beijing, China. wenzw@pku.edu.cn

Abstract

Recently, lower-level constrained bilevel optimization has attracted increasing attention. However, existing methods mostly focus on either deterministic cases or problems with linear constraints. The main challenge in stochastic cases with general constraints is the bias and variance of the hyper-gradient, arising from the inexact solution of the lower-level problem. In this paper, we propose a novel stochastic augmented Lagrangian value function method for solving stochastic bilevel optimization problems with nonlinear lower-level constraints. Our approach reformulates the original bilevel problem using an augmented Lagrangian-based value function and then applies a penalized stochastic gradient method that carefully manages the noise from stochastic oracles. We establish an equivalence between the stochastic single-level reformulation and the original constrained bilevel problem and provide a non-asymptotic rate of convergence for the proposed method. The rate is further enhanced by employing variance reduction techniques. Extensive experiments on synthetic problems and real-world applications demonstrate the effectiveness of our approach.

Keywords: Bilevel optimization, Lower-level constraint, Stochastic optimization, Augmented Lagrangian method

1 Introduction

We consider the stochastic lower-level constrained bilevel optimization (stochastic LC-BLO) problem. The lower-level problem is defined as

	$\displaystyle\min_{y\in Y}$	$\displaystyle\quad G(x,y)=\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[g(x,y;\xi)]$		(1.1)
	s.t.	$\displaystyle\quad H_{i}(x,y)\leq 0,\quad i=1,.,p,$		(1.1)

where $Y\subseteq\mathbb{R}^{n}$ is a convex compact set, $\xi$ is a random variable in the space $\Omega_{\xi}$ , and $\mathcal{D}_{\xi}$ is the distribution of $\xi$ . $G(x,y)$ and $H(x,y)=[H_{1}(x,y),...,H_{p}(x,y)]^{T}$ are the lower-level objective function and constraint function, respectively. We also denote the feasible set of the lower-level problem by $\mathcal{Y}(x):=\{y\in Y\mid H(x,y)\leq 0\}$ . The bilevel optimization (BLO) problem is

	$\displaystyle\min_{x\in X}$	$\displaystyle\quad F(x,y^{}(x))=\mathbb{E}_{\zeta\sim\mathcal{D}_{\zeta}}[f(x,y^{}(x);\zeta)]$		(1.2)
	s.t.	$\displaystyle\quad y^{*}(x)\in\arg\min_{y\in\mathcal{Y}(x)}G(x,y),$		(1.2)

where $X\subseteq\mathbb{R}^{m}$ is a convex compact set, $\zeta\in\Omega_{\zeta}$ is a random variable, and $\mathcal{D}_{\zeta}$ is the distribution of $\zeta$ . We assume the gradient oracles $\nabla g(x,y;\xi)$ and $\nabla f(x,y;\zeta)$ have unavoidable noise. This framework includes deterministic LC-BLO as a special case.

As depicted in (1.2), this hierarchical structure captures a learning‑to‑learn philosophy that underpins numerous modern machine‑learning pipelines. Hence, BLO plays a critical role in various machine learning tasks, including hyperparameter optimization (2, 7, 16, 22), model-agnostic meta-learning (7, 11, 18, 30), and reinforcement learning (23, 21, 27, 14). Recently, the lower-level constrained BLO (LC-BLO) has attracted increasing attention due to its wide applications such as transportation (12), kernelized SVM (29), meta-learning (26), and data hyper-cleaning (26, 28).

Several methods have been proposed to solve the deterministic LC-BLO problem. The methodologies for solving LC-BLO can be broadly categorized into implicit gradient-based (IG) approaches and lower-level value function-based (LLVF) techniques. Implicit gradient methods primarily focus on computing the hyper-gradient with some implicit gradient approximation of $\frac{d}{dx}y^{*}(x)$ . The key issue here is that $y^{*}(x)$ may not be differentiable when the lower-level problem has constraints. Some research has discussed the conditions under which the hyper-gradient exists. For the linearly constrained case, IG-AL (24) discussed the smoothness of $y^{*}(x)$ by introducing the Lagrangian multiplier of the lower-level problem. SIGD (15) develops a smoothing approximation implicit gradient method to handle the non-differentiability point. Recent works (25, 13) also consider a barrier reformulation of LC-BLO to write the lower-level constraints into the objective function. Even with these existing discussions, the main challenge of designing implicit gradient methods is that second-order derivative of the lower-level objective is required to compute the hyper-gradient, which is computationally expensive.

To overcome the high computational cost of implicit gradient computation, value function-based methods have been proposed as a Hessian-free alternative. Value function-based methods introduce the value function of the lower-level problem and then replace the optimality condition $y\in\mathcal{Y}(x)$ in (1.2) with an inequality condition on the value function. The original BLO (1.2) is then equivalently reformulated as a single-level problem. Different value-function formulations have been widely studied in the literature, including the value function (12), a Moreau envelope-based value function (8), the proximal Lagrangian function (29), and the regularized gap function (28).

To the best of our knowledge, no existing work analyzes the BLO with nonlinear lower-level constraints in the stochastic setting, which has broad applications in real-world machine learning tasks. Solving stochastic LC-BLO presents two fundamental challenges. The first is the nonsmoothness of the hyper-objective function, which arises from the coupling between the upper-level and lower-level problems. The second is the bias of the hyper-gradient due to the inexact solution of the lower-level problem. To address these challenges, we introduce an augmented Lagrangian function and its Moreau envelope to reformulate the bilevel problem as a single-level problem, while ensuring that the solution remains close to the optimal solution of the original problem. We then propose a novel stochastic value function-based method for stochastic LC-BLO, which carefully controls the bias of the gradient oracle to achieve convergence.

1.1 Main contribution

Our main contributions are as follows:

1. We introduce a novel reformulation of the stochastic LC-BLO by leveraging the stochastic augmented Lagrangian function and its Moreau envelope (see (2.2)). This reformulation transforms the bilevel problem into a single-level problem, effectively addressing noise arising from the inexact solution of the lower-level problem (see (2.3), (2.7)). Notably, we also ensure that the solution of the reformulated problem remains close to the optimal solution of the original bilevel problem (see Theorem 3.1, 3.2), providing a practical yet theoretically grounded approach to stochastic LC-BLO.

2. We propose a novel Hessian-free method based on the stochastic reformulation for solving the stochastic LC-BLO (see Algorithm 2). Our work provides the first convergence analysis of value function-based algorithms for nonlinear LC-BLO in the stochastic setting. The issue of biased gradients is mitigated by controlling the bias through the accuracy of lower-level solutions. We derive a non-asymptotic convergence rate, proving that our method achieves $(\widetilde{O}(c\epsilon^{-2}),\widetilde{O}(cc_{1}^{2}\epsilon^{-2}))$ sample complexity on $(\zeta,\xi)$ , where $c_{1},c_{2}$ denotes the penalty parameter in the reformulations and $c=\max(c_{1},c_{2})$ (see Theorem 3.4, Remark 3.1). The sample complexity on $\zeta$ is further improved to $\widetilde{O}(c^{1.5}\epsilon^{-1.5})$ by employing variance reduction techniques (see Theorem 3.5, Remark 3.4). In Table 1, we briefly summarize the existing approaches compared to the proposed methods.

Table 1: Comparison of methods for solving BLO with nonlinear convex lower-level constraints. “H.-free”, “Sto”, “Iteration”, “Sample” stand for Hessian-free, stochastic, iteration complexity, and sample complexity, respectively. Iteration complexity is the number of outer-loop iterations needed to achieve the target accuracy

\epsilon

, measured by the squared gradient norm. For stochastic algorithms, sample complexities denote the number of samples for the upper and lower stochastic variables

(\zeta,\xi)

needed to achieve the target accuracy

\epsilon

. The proposed methods are marked bold as SALVF (Algorithm 2) and SALVF-VR (Algorithm 3). We compare with four existing value function-based deterministic methods: BLOCC (12), LV-HBA (29), BiC-GAFFA (28), and GAM (26).

	H-free	Sto	Iteration	Sample (upper, lower)
BLOCC	yes	no	$\tilde{\mathcal{O}}(c\epsilon^{-1})$	–
LV-HBA	yes	no	$\mathcal{O}(\epsilon^{-2p}),p>0.5$	–
BiC-GAFFA	yes	no	$\mathcal{O}(\epsilon^{-2p}),p>0.5$	–
GAM	no	no	–	–
SALVF	yes	yes	$\tilde{\mathcal{O}}(c\epsilon^{-1})$	$\tilde{\mathcal{O}}(c\epsilon^{-2}),\tilde{\mathcal{O}}(cc_{1}^{2}\epsilon^{-2})$
SALVF-VR	yes	yes	$\tilde{\mathcal{O}}(c^{1.5}\epsilon^{-1.5})$	$\tilde{\mathcal{O}}(c^{1.5}\epsilon^{-1.5}),\tilde{\mathcal{O}}(c^{1.5}c_{1}^{2}\epsilon^{-2.5})$

1.2 Notation

For multivariate function $f(x_{1},...,x_{k})$ , its partial derivative with respect to the $i$ -th variable is denoted by $\nabla_{i}f(x_{1},...,x_{k})$ . Given random variable $x(\xi)\in\mathbb{R}^{n},\xi\sim\mathcal{D}_{\xi}$ , we represent its expectation and covariance matrix by $\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[x(\xi)]$ and $\mathrm{Var}_{\xi\sim\mathcal{D}_{\xi}}[x(\xi)]$ , respectively. The trace of its covariance matrix is denoted by $\mathbb{V}_{\xi\sim\mathcal{D}_{\xi}}[x(\xi)]=\mathrm{Tr}(\mathrm{Var}_{\xi\sim\mathcal{D}_{\xi}}[x(\xi)])=\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\|x-\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[x]\|^{2}].$ If the distribution $\mathcal{D}_{\xi}$ is clear, we abbreviate the above notations as $\mathbb{E}_{\xi}[x(\xi)]$ , $\mathrm{Var}_{\xi}[x(\xi)]$ and $\mathbb{V}_{\xi}[x(\xi)]$ , respectively. Denote the normal cone of a convex set $C$ at $y$ as $\mathcal{N}_{C}(y)=\{v\mid\langle v,w-y\rangle\leq 0,\forall w\in C\}$ . For a scalar $a$ , we define $[a]_{+}=\max\{0,a\}$ . For a vector $v$ , we write $[v]_{+}=([v_{1}]_{+},...,[v_{n}]_{+})^{T}$ and $[v]_{+}^{2}=([v_{1}]_{+}^{2},...,[v_{n}]_{+}^{2})^{T}$ . For $s$ independent identically distributed random variables $\mathbf{\xi}=(\xi_{1},...,\xi_{s})$ , we denote the joint distribution by $\mathcal{D}_{\mathbf{\xi}}^{s}$ . Further given some function $g(x,\xi)$ , we define the empirical average as $g(x,\mathbf{\xi})=\frac{1}{s}\sum_{i=1}^{s}g(x,\xi_{i})$ .

2 Stochastic augmented value function-based method

In this section, we propose a novel value function-based reformulation for (1.2) and a stochastic value function method.

2.1 Stochastic augmented Lagrangian reformulation

We propose a novel reformulation using a stochastic augmented Lagrangian value function that transforms (1.2) into a single level problem. First, the constraints (1.1) are addressed by the augmented Lagrangian function and its corresponding stochastic version. The bilevel optimization problem (1.2) is then transformed into a single-level problem via this augmented Lagrangian function-based formulation.

For the lower-level problem (1.1), an augmented Lagrangian penalty term is introduced by penalizing the constraints using

\mathcal{A}_{\gamma_{1}}(x,y,z)=\frac{1}{2{{\gamma_{1}}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2},

where $z_{1},...,z_{p}\geq 0$ and ${\gamma_{1}}>0$ is a penalty parameter. The augmented Lagrangian function and its stochastic oracle are defined by adding the penalty term to the objective function and its stochastic oracle, respectively, that is,

	$\displaystyle\mathcal{L}_{{\gamma_{1}}}(x,y,z)$	$\displaystyle=G(x,y)+\mathcal{A}_{\gamma_{1}}(x,y,z),$		(2.1)
	$\displaystyle\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)$	$\displaystyle=g(x,y;\xi)+\mathcal{A}_{\gamma_{1}}(x,y,z).$		(2.1)

The augmented dual function and its Moreau envelope are then defined as

	$\displaystyle{D}_{\gamma_{1}}(x,z)$	$\displaystyle=\min_{y\in Y}\mathcal{L}_{\gamma_{1}}(x,y,z),$		(2.2)
	$\displaystyle{E}_{\gamma_{1}}^{\gamma_{2}}(x,z)$	$\displaystyle=\max_{\lambda\in\mathbb{R}_{+}^{p}}\{{D}_{\gamma_{1}}(x,\lambda)-\frac{\gamma_{2}}{2}\\|\lambda-z\\|^{2}\},$		(2.2)

where $\gamma_{2}\geq 0$ is a regularization parameter. Then (1.2) is reformulated as an equivalent single-level problem

$\displaystyle\min_{(x,y,z)\in X\times Y\times\mathbb{R}_{+}^{p}}$	$\displaystyle F(x,y)$	(2.3)
s.t.	$\displaystyle\mathcal{G}(x,y,z)=G(x,y)-{E}_{\gamma_{1}}^{\gamma_{2}}(x,z)\leq 0,$
	$\displaystyle\quad H(x,y)\leq 0.$

In this paper, $\gamma_{1},\gamma_{2}$ are fixed parameters. We omit the subscript ${\gamma_{1}}$ and the superscript ${\gamma_{2}}$ in $D_{\gamma_{1}}$ and $E_{\gamma_{1}}^{\gamma_{2}}$ to simplify notation. The envelope-based value function reformulation (2.3) contains the value function-based reformulation

$\displaystyle\min_{(x,y)\in X\times Y}$	$\displaystyle\quad F(x,y)$	(2.4)
s.t.	$\displaystyle\quad G(x,y)-\min_{y\in\mathcal{Y}(x)}G(x,y)\leq 0,$
	$\displaystyle\quad H(x,y)\leq 0,$

as a special case when $\gamma_{2}=0$ . The relationship between them is discussed in Appendix A.5. By introducing the auxiliary variable $z$ , the advantage of the augmented function-based reformulation is that the subproblem (2.5a) becomes strongly-convex-strongly-concave, which ensures faster convergence of the inner loop.

To evaluate the $E(x,z)$ in (2.2), we need to estimate


$\displaystyle(w^{},\lambda^{})$	$\displaystyle=\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\min_{w\in Y}\left\{\ell_{\gamma}(x,z,w,\lambda)\right\},$	(2.5a)
$\displaystyle\text{where}\quad\ell_{\gamma}(x,z,w,\lambda)$	$\displaystyle:=\mathcal{L}_{\gamma_{1}}(x,w,\lambda)-\frac{\gamma_{2}}{2}\\|\lambda-z\\|^{2}.$	(2.5b)

However, in the stochastic setting, the exact solution is inaccessible due to unavoidable noise in the gradient oracles. To address this, we consider approximating the solution of (2.5a) using stochastic algorithms. Specifically, let $\mathbf{\xi}=(\xi_{1},...,\xi_{s})\in\Omega_{\xi}^{s}$ be the samples from $\mathcal{D}_{\xi}^{s}$ . Denote $\mathcal{P}_{w},\mathcal{P}_{\lambda}$ as the space of random variables mapping $\mathbf{\xi}$ to $Y$ and $\mathbb{R}_{+}^{p}$ , respectively, that is,

	$\displaystyle\mathcal{P}_{w}$	$\displaystyle=\{\hat{w}:\Omega_{\xi}^{s}\to Y\mid\hat{w}\text{ is measurable}\},$
	$\displaystyle\mathcal{P}_{\lambda}$	$\displaystyle=\{\hat{\lambda}:\Omega_{\xi}^{s}\to\mathbb{R}_{+}^{p}\mid\hat{\lambda}\text{ is measurable}\}.$

Assume $(\hat{w},\hat{\lambda})\in\mathcal{P}_{w}\times\mathcal{P}_{\lambda}$ is a stochastic algorithm solving the subproblem (2.5a) using samples $\mathbf{\xi}$ and $(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))$ is a pair of approximate solution of (2.5a). We are interested in the subset of “good enough” algorithms that can provide a sufficiently accurate solution of the subproblem (2.5a) as

\displaystyle\mathcal{P}(\delta)

\displaystyle=\left\{(\hat{w},\hat{\lambda})\in\mathcal{P}_{w}\times\mathcal{P}_{\lambda}\mid\left|\mathbb{E}_{\mathbf{\xi}}\left[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right]-E(x,z)\right|\leq\delta\right\}.

(2.6)

With this estimator $(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))$ , we approximate the envelope function with $\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))$ . This gives the following stochastic value function-based reformulation:

	$\displaystyle\min_{(x,y,z)\in X\times Y\times\mathbb{R}_{+}^{p}}$	$\displaystyle\quad F(x,y)$		(2.7)
	s.t.	$\displaystyle\hat{\mathcal{G}}(x,y,z;\mathbf{\xi})\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},$		(2.7)

where $\hat{\mathcal{G}}(x,y,z;\mathbf{\xi})=G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))$ , $\epsilon_{1},\epsilon_{2}$ are the target accuracy of the lower-level objective function and constraints violation, respectively. The equivalence between (1.2) and (2.7) is established in Theorem 3.2. Compared to (2.3), (2.7) incorporates inexactness of the lower-level solution into the formulation, making it more practical in the stochastic setting.

Algorithm 1

(w^{k},\lambda^{k})=\text{SALM}(x^{k-1},z^{k-1},s,\gamma_{1},\gamma_{2},\eta,\rho;\mathbf{\xi}^{k})

1:Input:

x^{k-1},z^{k-1}

, iteration count

s

, primal step size

\eta_{j}

, dual step size

\rho_{j}

for

0\leq j\leq s-1

2:Initialize

w^{k,0}

and

\lambda^{k,0}=0

3:for

j=0

s-1

4: Update

(w^{k,j+1},\lambda^{k,j+1})

by (2.8a) and (2.8b).

5:end for

6:Output

(w^{k},\lambda^{k})=(w^{k,s},\lambda^{k,s})

2.2 Value function-based penalized method

In this subsection, we develop a stochastic value function-based penalized method for the reformulation (2.7). At $k$ -th iteration, the stochastic gradient ascent descent method is applied to solve an approximation solution from the subproblem (2.5a) with fixed $(x^{k-1},z^{k-1})$ . More specifically, the primal and dual variables are updated by


$\displaystyle w^{k,j+1}$	$\displaystyle=\mathrm{Proj}_{Y}\left(w^{k,j}-\eta_{j}\nabla_{w}\ell_{\gamma}(x^{k-1},z^{k-1},w^{k,j},\lambda^{k,j};\xi^{k}_{j})\right),$	(2.8a)
$\displaystyle\lambda^{k,j+1}$	$\displaystyle=\mathrm{Proj}_{\mathbb{R}_{+}^{p}}\left(\lambda^{k,j}+\rho_{j}\nabla_{\lambda}\ell_{\gamma}(x^{k-1},z^{k-1},w^{j},\lambda^{k,j};\xi^{k}_{j})\right),$	(2.8b)

where $\eta_{p}^{j},\eta_{d}^{j}$ denote the primal and dual step sizes, respectively. The complete procedure is shown in Algorithm 1. After that we consider a augmented Lagrangian-based penalty reformulation:

	$\displaystyle\min_{(x,y,z)\in X\times Y\times Z}$	$\displaystyle\quad\mathbb{E}_{\mathbf{\xi}}[\Psi(x,y,z;\mathbf{\xi})],$		(2.9)
	$\displaystyle\text{where}\quad\Psi(x,y,z;\mathbf{\xi})$	$\displaystyle=F(x,y)+c_{1}\hat{\mathcal{G}}(x,y,z;\mathbf{\xi})+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2},$		(2.9)

where $Z=[0,p^{-0.5}B]^{p}$ is the domain of $z$ , $B$ is a constant and $c_{1},c_{2}>0$ are the penalty parameters⁴⁴4It can be shown that the optimal $z$ for (2.9) is contained in the domain $Z$ with appropriate regularity assumptions (see Assumption 3.6 and Lemma A.7). Different from the standard duality, the penalty parameter $\rho$ is also treated as a variable in the above saddle reformulation. The advantage of this saddle point refomulation is that the convexity of objective functin in (2.7) is not required. The equivalence between (2.7) and (2.9) is established in Theorem 3.1. We consider a stochastic gradient descent ascent method for solving (2.9). Denote all variables in (2.9) and its region as

\mathbf{u}=(x,y,z),\quad\mathcal{U}=X\times Y\times Z,

for simplicity. The gradient oracle of the objective function of (2.9) is given by

\displaystyle\nabla\Psi(\mathbf{u};\mathbf{\zeta},\mathbf{\xi},\mathbf{\tilde{\xi}})

\displaystyle=\nabla f(x,y;\mathbf{\zeta})+c_{1}\nabla\hat{\mathcal{G}}(\mathbf{u};\mathbf{\tilde{\xi}})+c_{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}\nabla H_{i}(x,y),

(2.10)

with the stochastic oracles in mini-batched as

	$\displaystyle\nabla f(x,y;\mathbf{\zeta})$	$\displaystyle=\frac{1}{r}\sum_{j=1}^{r}\nabla f(x,y;\zeta_{j}),$		(2.11)
	$\displaystyle\nabla\hat{\mathcal{G}}(\mathbf{u};\mathbf{\xi},\mathbf{\tilde{\xi}})$	$\displaystyle=\frac{1}{q}\sum_{j=1}^{q}\nabla g(x,y;\tilde{\xi}_{j})-\frac{1}{q}\sum_{j=1}^{q}\nabla\tilde{\mathcal{L}}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi});\tilde{\xi}_{j}).$		(2.11)

By substituting $r=r_{k},q=q_{k}$ in (2.10), (2.11) and conditioned on $\mathbf{\xi}^{k}$ we rewrite

	$\displaystyle\hat{\mathcal{G}}^{k}(\mathbf{u})=\hat{\mathcal{G}}(x,y,$	$\displaystyle z;\mathbf{\xi}^{k}),\quad\Psi^{k}(\mathbf{u})=\Psi(x,y,z;\mathbf{\xi}^{k}),$		(2.12)
	$\displaystyle\nabla\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})$	$\displaystyle=\nabla\Psi(x,y,z;\mathbf{\zeta}^{k},\mathbf{\xi}^{k},\mathbf{\tilde{\xi}}^{k}).$		(2.12)

The complete procedure is summarized as follows. At the $k$ -th iteration, an estimator $(w^{k},\lambda^{k})=(\hat{w}(x^{k-1},z^{k-1};\mathbf{\xi}^{k}),\hat{\lambda}(x^{k-1},z^{k-1};\mathbf{\xi}^{k}))$ is computed utilizing Algorithm 1 with samples $\mathbf{\xi}^{k}=(\xi_{1}^{k},...,\xi_{s_{k}}^{k})\sim\mathcal{D}_{\xi}^{s_{k}}$ . Then the stochastic gradient descent method is applied to solve (2.9) with samples $\mathbf{\zeta}^{k}=(\zeta_{1}^{k},...,\zeta_{r_{k}}^{k})\sim\mathcal{D}_{\zeta}^{r_{k}}$ , $\mathbf{\tilde{\xi}}^{k}=(\tilde{\xi}_{1}^{k},...,\tilde{\xi}_{q_{k}}^{k})\sim\mathcal{D}_{\xi}^{q_{k}}$ . After $K$ iterations, we output $(x^{R},y^{R})$ where the index $R$ is randomly chosen according to the probability mass function

\mathrm{Prob}(R=k)=\frac{\alpha_{k}}{\sum_{k=0}^{K-1}\alpha_{k}},\quad k=0,\ldots,K-1.

(2.13)

Besides, an extra SALM loop

(y^{\prime},z^{\prime})=\textbf{SALM}(x^{R},0,s^{K},{\gamma_{1}},0,\eta,\rho;\mathbf{\xi}^{K}),

(2.14)

can be applied to guarantee the feasibility of the final output. The complete procedure is shown in Algorithm 2.

Algorithm 2 SALVF

1:Input penalty parameters

c_{1},c_{2}

, iteration number

K

, sample sizes

s_{k},r_{k},q_{k}

and step sizes

\eta_{p}^{k},\eta_{d}^{k},\alpha_{k}

2:Initialize

x^{0},y^{0},z^{0}

3:for

k=0

K-1

4: Run Algorithm 1 with samples

\mathbf{\xi}^{k}

to compute

(\hat{w}^{k},\hat{\lambda}^{k})=\text{SALM}(x^{k-1},z^{k-1},s_{k},{\gamma_{1}},\gamma_{2},\eta_{k},\rho_{k};\mathbf{\xi}^{k}).

5: Sample

\mathbf{\zeta}^{k}\sim\mathcal{D}_{\zeta}^{r_{k}}

and

\mathbf{\tilde{\xi}}^{k}\sim\mathcal{D}_{\xi}^{q_{k}}

6: Compute direction

d^{k}=\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})

by (2.10).

8: Update

\mathbf{u}^{k+1}=\mathrm{Proj}_{\mathcal{U}}(\mathbf{u}^{k}-\alpha_{k}d^{k})

9:end for

10:Choose index

R

with probability mass function (2.13). Output

(x^{R},y^{R})

11:(Optional) Compute (2.14) and Output

(x^{R},y^{\prime})

2.3 Variance reduced SALVF method

When sampling on $\zeta$ is significantly more expensive than sampling on $\xi$ , a natural question arises: is it possible to reduce the sample size of $\zeta$ , thereby allowing an increase in the sample size of $\xi$ ? In this subsection, we apply a variance reduction technique with a similar update rule to STORM (5) to reduce the sample complexity on $\zeta$ . At each iteration, the direction $d^{k}$ is updated by

d^{k}=\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})+(1-\beta_{k})(d^{k-1}-\nabla\Psi^{k}(\mathbf{u}^{k-1};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})).

(2.15)

Unlike the STORM method, our approach deals with a different scenario, where the main challenge arises from the biased gradient oracle $\nabla\Psi^{k}$ due to the inexactness of estimator $(\hat{w}^{k},\hat{\lambda}^{k})$ . This challenge is addressed by carefully handling the extra bias term and designing proper coefficients $\beta_{k}$ . The complete procedure is summarized in Algorithm 3 and the convergence guarantee is provided in Theorem 3.5.

Algorithm 3 SALVF-VR

1:Input penalty parameters

c_{1},c_{2}

, iteration number

K

, sample sizes

s_{k},r_{k},q_{k}

, step sizes

\eta_{p}^{k},\eta_{d}^{k},\alpha_{k}

and coefficient

\beta_{k}

2:Initialize

x^{0},y^{0},z^{0}

3:for

k=0

K-1

4: Run Algorithm 1 with samples

\mathbf{\xi}^{k}

to compute

(\hat{w}^{k},\hat{\lambda}^{k})=\text{SALM}(x^{k-1},z^{k-1},s_{k},{\gamma_{1}},\gamma_{2},\eta_{k},\rho_{k};\mathbf{\xi}^{k}).

5: Sample

\mathbf{\zeta}^{k}

from

\mathcal{D}_{\zeta}^{r_{k}}

and

\mathbf{\tilde{\xi}}^{k}

from

\mathcal{D}_{\xi}^{q_{k}}

6: if

k=0

then

7: Compute

d^{0}=\nabla\Psi(\mathbf{u}^{0};\mathbf{\zeta^{0}},\mathbf{\tilde{\xi}}^{0}).

8: else

9: Update the direction by (2.15).

10: end if

11: Update

\mathbf{u}^{k+1}=\mathrm{Proj}_{\mathcal{U}}(\mathbf{u}^{k}-\alpha_{k}d^{k}).

12:end for

13:Choose index

R

with probability mass function (2.13). Output

(x^{R},y^{R})

14:(Optional) Compute (2.14) and Output

(x^{R},y^{\prime})

3 Theoretical analysis

3.1 Basic assumptions

In this section, we inspect the properties of the penalty reformulated problem (2.9), also provide non-asymptotic convergence guarantee for the proposed algorithms (Algorithm 2 and 3). First some basis assumptions in the literature of stochastic bilevel optimization (10, 4) are introduced, including assumptions on the smoothness, convexity, boundedness, and the stochastic oracle associated with the objective and constraints.

Assumption 3.1.

(Lipschitz continuity) Assume the $\nabla F(x,y)$ , $\nabla G(x,y)$ , $\nabla H(x,y)$ are $L_{F},L_{G},L_{H}$ -Lipschitz continuous, respectively.

Assumption 3.2.

(Convexity) Assume $G(x,y)$ is $\mu_{G}$ -strongly convex in $y$ for any $x\in X$ , $H(x,y)$ is convex in $y$ for any $x\in X$ .

Remark 3.1.

Assumption 3.2 implies that $y^{*}(x)$ defined in (1.2) is unique for any $x\in X$ .

Assumption 3.3.

(Boundedness) Assume $\nabla G(x,y)$ , $H(x,y)$ and $\nabla H(x,y)$ are bounded, that is,

		$\displaystyle\\|\nabla G(x,y)\\|\leq M_{G,1},$
	$\displaystyle\|H_{i}(x,y)\|\leq$	$\displaystyle M_{H,0},~\\|\nabla H_{i}(x,y)\\|\leq M_{H,1},1\leq i\leq p.$

Assumption 3.4.

(Stochastic derivative) The stochastic oracle $\nabla f(x,y;\zeta)$ , $\nabla g(x,y;\xi)$ are unbiased estimator of $\nabla F(x,y)$ , $\nabla G(x,y)$ , respectively, and their variances are bounded by $\sigma_{f}^{2},\sigma_{g}^{2}$ , respectively.

To ensure the strong duality and regularity of the optimal points, we assume Slater’s condition and linear independence constraint qualification (LICQ) hold for the lower-level constants, which are common in nonlinear optimization analysis.

Assumption 3.5.

(LL Slater’s condition) For any fixed $x\in X$ , Slater’s condition holds for (1.1), that is, there exist $\epsilon_{0}(x)>0$ and $y_{0}(x)$ such that

H_{i}(x,y_{0}(x))<-\epsilon_{0}(x),\quad i=1,...,p.

Assumption 3.6.

(LICQ) For any $x\in X$ and $y=y^{*}(x)$ , $\{\nabla_{y}H_{i}(x,y)|H_{i}(x,y)=0\}$ is linearly independent. Denote the matrix $\mathcal{C}(x,y)=[\nabla_{y}H_{i}(x,y)]_{i\in\{i|H_{i}(x,y)=0\}}$ . Since $X,Y$ are compact sets, we further assume the smallest singular value satisfies

\sigma_{\min}(\mathcal{C}(x,y)\mathcal{C}(x,y)^{\top})\geq\sigma_{0}^{2}>0,(x,y)\in X\times Y.

3.2 Equivalence of reformulations

In this subsection, the equivalence between the reformulations (2.9) and the original BLO formulation in (1.2) is established. We take $B=p^{2}\sigma_{0}^{2}M_{H,1}(M_{G,1}+pM_{H,1})$ as provided in (2.9) throughout the analysis. The deterministic case is first analyzed as a special case. The equivalence is then extended to the stochastic case, emphasizing the key improvements introduced by the stochastic reformulation.

3.2.1 Deterministic case

The following theorem establishes the equivalence between the penalized form of (2.3) and the original BLO.

Theorem 3.1.

Suppose that Assumptions 3.1 3.2 and 3.6 holds and $\gamma_{1},\gamma_{2}>0$ are fixed parameters.

1. Assume $(x^{*},y^{*})$ is a global solution to (1.2) and $c_{1}\geq\frac{L}{2\mu_{G}}\epsilon^{-1},c_{2}\geq(c_{1})^{2}B^{2}\epsilon^{-1}$ . There exists $z^{*}\in\mathbb{R}_{+}^{p}$ such that $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -global-minima of the following penalized form

\displaystyle\min_{(x,y,z)\in X\times Y\times Z}

\displaystyle\quad\Psi(x,y,z)=F(x,y)+c_{1}\mathcal{G}(x,y,z)+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}.

(3.1)

2. By taking $c_{1}=c_{1}^{*}+2:=\frac{L}{2\mu_{G}}\epsilon^{-1}+2,c_{2}=c_{2}^{*}+2:=(c_{1}^{*})^{2}B^{2}\epsilon^{-1}$ , any $\epsilon$ -global-minima of (3.1) is an $\epsilon$ -global-minima the following approximation of BLO

\displaystyle\min_{(x,y,z)\in X\times Y\times Z}

\displaystyle\quad F(x,y)\quad\text{s.t.}\quad G(x,y)-E(x,z)\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},

(3.2)

with some $\epsilon_{1}\leq\epsilon$ , respectively.

This theorem indicates that the penalized reformulation can approximate the original bilevel optimization problem within a controlled error bound.

3.2.2 Stochastic case

The major difference between the deterministic and stochastic reformulation is the inexact solution of the subproblem (2.5a). By controlling the inexactness, we design an approximated stochastic reformulation for the stochastic bilevel optimization problem. Denote the penalized form as

\displaystyle\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})=F(x,y)+c_{1}(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi})))+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}.

The following theorem shows the equivalence between this penalized form and (1.2) (see Theorem A.4 for proofs).

Theorem 3.2.

Suppose that Assumptions 3.1, 3.2 and 3.5 holds and $\gamma_{1},\gamma_{2}>0$ are fixed parameters.

1. Assume $(x^{*},y^{*})$ is a global solution to (1.2). If $\mathcal{P}(\delta)$ defined in (2.6) is nonempty for any $(x,z)\in X\times\mathbb{R}_{+}^{p}$ , then for any $(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta)$ , there exists $z^{*}$ such that $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -global-minima of the following penalized form there exists $z^{*}$ such that $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -global-minima of the following penalized form

\displaystyle\min_{(x,y,z)\in X\times Y\times Z}

\displaystyle\quad\mathbb{E}[\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})]

(3.3)

with any $c_{1}\geq\frac{2L}{3\mu_{G}}\epsilon^{-1}$ , $c_{2}\geq\frac{3}{2}(c_{1})^{2}B^{2}\epsilon^{-1}$ and $\delta\leq\frac{\epsilon}{6c_{1}}$ .

2. By taking $c_{1}=c_{1}^{*}+2:=\frac{2L}{3\mu_{G}}\epsilon^{-1}+2,c_{2}=c_{2}^{*}+2:=\frac{3}{2}(c_{1}^{*})^{2}B^{2}\epsilon^{-1}+2$ and $\delta\leq\frac{\epsilon}{6c_{1}}$ , for any $(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta)$ , the $\epsilon$ -global-minima of (3.3) is a $\epsilon$ -global-minima of the following approximation of BLO:

$\displaystyle\min_{(x,y,z)\in X\times Y\times Z}$	$\displaystyle\quad F(x,y)$	(3.4)
s.t.	$\displaystyle\quad G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\leq\epsilon_{1},$
	$\displaystyle\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},$

with some $\epsilon_{1},\epsilon_{2}\leq\frac{13}{12}\epsilon$ .

3.3 Convergence analysis

Denote $\mathcal{F}_{k}$ and $\tilde{\mathcal{F}}_{k}$ as the $\sigma$ -algebra generated by $\{\mathbf{\xi_{l}}\}_{l=0}^{k}\cup\{\mathbf{\zeta}^{l},\mathbf{\tilde{\xi}}^{l}\}_{l=0}^{k-1}$ and $\{\mathbf{\xi}^{l}\}_{l=0}^{k}\cup\{\mathbf{\zeta}^{l},\mathbf{\tilde{\xi}}^{l}\}_{l=0}^{k}$ respectively. Then

\emptyset=\tilde{\mathcal{F}}_{0}\subset\mathcal{F}_{1}\subset\tilde{\mathcal{F}}_{1}\subset\cdots\subset\mathcal{F}_{k}\subset\tilde{\mathcal{F}}_{k}\subset\cdots

is the filtration generated by the random variables in Algorithms 2 and 3.

Theorem 3.3.

Suppose Assumptions 3.1-3.5 holds. By taking the step sizes in Algorithm 1 as

\eta_{j}=\frac{\eta}{j+1},\quad\rho_{j}=\frac{\rho}{j+1},

(3.5)

there exist constants $\bar{\phi}_{1},\bar{\phi}_{2}>0$ such that the output pair $(w^{k,s},\lambda^{k,s})$ satisfying

	$\displaystyle\mathbb{E}_{\mathbf{\xi}}\\|w^{k,s}-w^{*}(x^{k-1},z^{k-1})\\|^{2}$	$\displaystyle\leq\bar{\phi}_{1}\frac{1+\log(s)}{s},$		(3.6)
	$\displaystyle\mathbb{E}_{\mathbf{\xi}}\\|\lambda^{k,s}-\lambda^{*}(x^{k-1},z^{k-1})\\|^{2}$	$\displaystyle\leq\bar{\phi}_{2}\frac{1+\log(s)}{s}.$		(3.6)

Define the bias of the gradient oracle of $\Psi$ as

b^{k}=\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})-\nabla\Psi(\mathbf{u}^{k}).

(3.7)

We establish the following lemma to control the bias of the gradient oracle in terms of conditional expectation.

Lemma 3.1.

The bias $b^{k}$ has a controllable bound as

\displaystyle\mathbb{E}[\|b^{k}\|^{2}|\tilde{\mathcal{F}}_{k-1}]

\displaystyle\leq 2\left(\frac{\sigma_{f}^{2}}{r_{k}}+c_{1}^{2}\frac{\sigma_{\mathcal{G}}^{2}}{q_{k}}+c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right).

(3.8)

Here $\mathbb{E}[\cdot]$ is the abbreviation of $\mathbb{E}_{\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k},\mathbf{\xi}^{k}}[\cdot]$ and $\epsilon^{k}_{\mathcal{G}},(\sigma_{\mathcal{G}}^{k})^{2}$ denote the upper bounds of the bias and variance of $\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})$ , respectively. $\epsilon^{k}_{\mathcal{G}},(\sigma_{\mathcal{G}}^{k})^{2}$ are constants conditioned on $\tilde{\mathcal{F}_{k-1}}$ and can be further bounded by polynomials of $\mathbb{E}_{\mathbf{\xi}^{k}}\|w^{k,s}-w^{*}(x^{k-1},z^{k-1})\|^{2}$ , $\mathbb{E}_{\mathbf{\xi}^{k}}\|\lambda^{k,s}-\lambda^{*}(x^{k-1},z^{k-1})\|^{2}$ . Therefore we can control the bias $b^{k}$ by enhancing the accuracy of Algorithm 1.

Denote $c=\max(c_{1},c_{2})$ . With the convergence results of Algorithm 1 and the boundedness of gradient oracle, we maintain the convergence of Algorithm 2.

Theorem 3.4.

Suppose Assumptions 3.1-3.6 Take constant step size $\alpha_{k}={\alpha}<\frac{1}{2L_{\Psi}}$ constant sample sizes as $r_{k}=r,q_{k}=q$ , and $s_{k}=s$ . Then the sequence $\{\mathbf{u}^{k}\}_{k=0}^{K}$ generated by Algorithm 2 satisfies

\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=0}^{K-1}\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right]

\displaystyle\leq\mathcal{O}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right).

The detailed proof is available in Theorem C.1 and Corollary C.2. We summarize the proof idea of Theorem 3.4 as follows. By combining Theorem 3.3 with Lemma 3.1, the bias and variance are bounded with $\widetilde{O}(s_{k})$ . Further analyzing the biased stochastic gradient descent methods provides the desired result.

Corollary 3.1.

If the rate is measured by $\mathrm{dist}(0,\nabla\Psi(\mathbf{u})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}))^{2}$ , we have the following equivalent conclusion (see Theorem C.2 and corollary C.2 for detailed proof):

\displaystyle\mathbb{E}[\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{R})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{R}))^{2}]

\displaystyle\leq\mathcal{O}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right).

Remark 3.2.

The step size condition $\alpha_{k}<\frac{1}{2L_{\Psi}}$ and (C.3) implies $\alpha_{k}$ is a most $\widetilde{\mathcal{O}}(c^{-1})$ . With $\alpha\sim\mathcal{O}(c^{-1})$ , $r\sim\mathcal{O}(\epsilon^{-1})$ , $q\sim\mathcal{O}(c_{1}^{2}\epsilon^{-1})$ , $s\sim{\mathcal{O}}(c_{1}^{2}\epsilon^{-1})$ , $K\sim{\mathcal{O}}(c\epsilon^{-1})$ , the right side of the above inequality is $\widetilde{\mathcal{O}}(\epsilon)$ . Then the sample complexity on $(\zeta,\xi)$ is $(\widetilde{\mathcal{O}}(c\epsilon^{-2}),\widetilde{\mathcal{O}}(cc_{1}^{2}\epsilon^{-2}))$ .

Remark 3.3.

Theorem 3.2 shows that (2.9) is equivalent to the original problem (1.2) in the sense of $\epsilon$ -accuracy by taking $c_{1}\sim\mathcal{O}(\epsilon^{-1})$ , $c_{2}\sim\mathcal{O}(\epsilon^{-3})$ and $\delta\sim\mathcal{O}(\epsilon^{-2})$ . Under this condition, the sample complexity on $(\zeta,\xi)$ is $(\widetilde{\mathcal{O}}(\epsilon^{-5}),\widetilde{\mathcal{O}}(\epsilon^{-7}))$ .

By introducing the following averaged Lipschitz assumption, we can further improve the convergence rate utilizing variance reduction techniques.

Assumption 3.7.

Assume $\nabla f(x,y;\zeta),\nabla g(x,y;\xi)$ are averaged Lipschitz continuous, that is,

		$\displaystyle\mathbb{E}_{\zeta}[\\|\nabla f(x_{1},y_{1};\zeta)-\nabla f(x_{2},y_{2};\zeta)\\|^{2}]\leq L_{f}^{2}\\|(x_{1},y_{1})-(x_{2},y_{2})\\|^{2},$		(3.9)
		$\displaystyle\mathbb{E}_{\xi}[\\|\nabla g(x_{1},y_{1};\xi)-\nabla g(x_{2},y_{2};\xi)\\|^{2}]\leq L_{g}^{2}\\|(x_{1},y_{1})-(x_{2},y_{2})\\|^{2}.$		(3.9)

Theorem 3.5.

Suppose Assumptions 3.2-3.6 and 3.7 hold. Take $\alpha_{k}=\alpha(k+1)^{-\frac{1}{3}}$ and $\beta_{k+1}=\beta\alpha_{k}^{2}$ in the outer loop and take constant sample sizes as $r_{k}=r,q_{k}=q$ , and $s_{k}=s$ . Then the sequence $\{\mathbf{u}^{k}\}_{k=0}^{K}$ generated by Algorithm 3 satisfies

\displaystyle\quad\frac{1}{\sum_{k=0}^{K}\alpha_{k}}\sum_{k=0}^{K-1}\mathbb{E}\left[\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right]\leq\widetilde{\mathcal{O}}\left(\frac{1}{\alpha K^{\frac{2}{3}}}+(c_{1}^{2}+\frac{K^{\frac{2}{3}}}{r})(\frac{1}{q}+\frac{1}{s})+\frac{1}{K^{\frac{2}{3}}r}\right).

(3.10)

The detailed proof is available in Theorem C.3 and Corollary C.3. The proof sketch is summarized as follows: first we show the error of the direction $e^{k}=d^{k}-\mathbb{E}[\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})]$ is bounded by the linear functions of the previous error $e^{k-1}$ . By proving the reduction of the merit function $\Psi(\mathbf{u}^{k})+\theta_{k}\|e^{k}\|$ with some coefficients $\theta_{k}$ , we establish the convergence of Algorithm 3.

Remark 3.4.

Further take $K\sim\mathcal{O}(c^{1.5}\epsilon^{-1.5})$ , $r\sim\mathcal{O}(1)$ , $q\sim\mathcal{O}(c_{1}^{2}\epsilon^{-1})$ , $s\sim\mathcal{O}(c_{1}^{2}\epsilon^{-1})$ , the right side is $\widetilde{\mathcal{O}}(\epsilon)$ . The sample complexity on $(\zeta,\xi)$ is $(\widetilde{\mathcal{O}}(c^{1.5}\epsilon^{-1.5}),\widetilde{\mathcal{O}}(c^{1.5}c_{1}^{2}\epsilon^{-2.5}))$ . From the Lemma 3.3, 3.1 we know the upper bound of the $\|b^{k}\|^{2}$ is $\widetilde{\mathcal{O}}(\frac{c_{1}^{2}}{s})$ , hence to achieve an $\epsilon$ -optimal solution by biased gradient-based approach, the condition $\widetilde{\mathcal{O}}(\frac{c_{1}^{2}}{s})\leq\epsilon$ cannot be be further improved. That is, the sample complexity on $\xi$ in each iteration is at least $\mathcal{O}(c_{1}^{2}\epsilon^{-1})$ . To this extent, current analysis requires a larger sample complexity on the lower-level to reduce the upper-level complexity, and it remains an open question whether variance-reduction could reduce the sample complexity fo upper- and lower-level at the same time.

4 Numerical experiments

This section presents numerical experiments to demonstrate the effectiveness of the proposed algorithms, compared with baselines including LV-HBA (29), GAM (26), and BLOCC (12).

4.1 Toy example

Consider the following example from Jiang et al. (12):

		$\displaystyle\min_{x\in[0,3]}F\left(x,y^{}\right)=\frac{e^{-y^{}(x)+2}}{2+\cos(6x)}+\frac{1}{2}\log\left((4x-2)^{2}+1\right)$
		$\displaystyle\text{ s.t. }y^{*}(x)\in\arg\min_{y\in\mathcal{Y}(x)}G(x,y)=(y-2x)^{2},$

where $\mathcal{Y}(x)=\{y\in[0,3]|H(x,y)\leq 0\}$ and $H(x,y)=y-x$ . The lower-level problem has closed form solution $y^{*}(x)=x$ . Now assume the gradient oracle of $F$ and $G$ have Gaussian noise with variance $\sigma=0.1$ , that is,

\nabla f(x,y;\zeta)=\nabla F(x,y)+\zeta,\nabla g(x,y;\zeta)=\nabla G(x,y)+\xi,

with $\zeta\sim N(0,\sigma^{2}I),\xi\sim N(0,\sigma^{2}I)$ . We pick 200 random points $(x^{0},y^{0})\in[0,3]\times[0,3]$ as initial points and allow the maximum sample on $\zeta$ as $2500$ . The final iterated points of Algorithm 2 and 3 are collected in Figure 1. Figure 1(a) plots the points $(x,y)$ projected on the $y=y^{*}(x)$ and the distribution of the output $x$ and Figure 1(b) shows a 3D plot of the output $x$ and $y$ . As shown in the figures, the converged points of both algorithms are close to the global optimal solution $x^{*}$ and form an approximate Gaussian distribution. The distribution of SALVF-VR is more concentrated than SALVF, which demonstrates the acceleration effect of variance reduction techniques.

Refer to caption — (a) $x$ and its distribution

4.2 Hyperparameter tuning for SVM

Consider the following support vector machine (SVM) problem as the lower-level problem:

	$\displaystyle\min_{w,b,\xi}$	$\displaystyle~\frac{1}{2}\\|w\\|^{2}+\frac{1}{2}\frac{1}{\sum_{i=1}\exp(c_{i})}\sum_{i=1}^{N}\exp({c_{i}})\xi_{i}$
	s.t.	$\displaystyle~l_{i}(z_{i}^{\top}w+b)\geq 1-\xi,\quad\forall(z_{i},l_{i})\in{D}_{tr},$

where $c_{1}=(c_{1},...,c_{N})$ is the hyperparameters, $\mathcal{D}_{tr}$ is the training set, $N=|\mathcal{D}_{tr}|$ and $l_{i}$ is the label of the $i$ -th sample. The upper-level problem is to minimize the validation error with respect to $c_{1}$ :

\min_{c}{\mathbb{E}}_{{(z,l)\sim\mathcal{D}_{val}}}[\exp\left(1-l(z^{\top}w^{*}+b^{*})\right)].

In Figure 2, we compare the performance of SALVF with baselines, LV-HBA (29), GAM (26) and BLOCC (12). Since the lower-level problem is deterministic, we allow all algorithms to access the exact optimal solution of the lower-level problem by calling the ECOS (6) solver and we set $\gamma_{2}=0$ . We extend BLOCC, LV-HBA and GAM to their stochastic versions by replacing (projected) gradient descent with (projected) stochastic gradient descent in the upper-level update. Figure 2 shows the test accuracy of SALVF versus time and iterations on the Diabetes and Fourclass datasets. SALVF achieves a better accuracy over the baselines. Although BLOCC also has an approximate peak accuracy, we can see that iterations of SALVF are more time-efficient. This is because SALVF requires a double loop while BLOCC requires a triple loop, which is more computationally expensive.

4.3 Weight decay tuning

Given a neural network $f(w,x)$ where $w$ are the weight and bias parameters in each layer, the goal is to optimize the weight decay parameter $\lambda$ for a neural network model. To improve the generalization performance, the weight decay parameters $C$ are introduced, which impose the constraint $\|w\|\leq C$ . This can be formulated as the following stochastic BLO:

	$\displaystyle\min_{C>0}$	$\displaystyle\mathbb{E}_{(x,y)\sim\mathcal{D}_{val}}[\ell(y,f(w^{*}(C),x))]$
	s.t.	$\displaystyle w^{*}(C)\in\arg\min_{\\|w\\|\leq C}\mathbb{E}_{(x,y)\sim\mathcal{D}_{tr}}\ell\left(y_{i},f\left(w,x_{i}\right)\right).$

The upper level focuses on performance on a validation set, while the lower level involves constrained classifier training. We compare the performance of SALVF, SALVF-VR and BLOCC on the digit dataset (1) with a two-layer MLP as the base model. The results are shown in Figure 3. The “no weight decay” curve represents the model’s performance without weight decay. By incorporating weight decay, all bilevel methods exhibit improved performance and reduce overfitting. From Figure 3(a) we see that SALVF is the most time-efficient, thanks to the simplicity of each step in its double-loop iteration process.

5 Conclusion

In this paper, we propose a Hessian-free method for solving stochastic LC-BLO problems. We present the first non-asymptotic rate analysis on value function-based algorithms for nonlinear LC-BLO in the stochastic setting. The sample complexity of our algorithm is $(\widetilde{\mathcal{O}}(c_{1}\epsilon^{-2}),\widetilde{\mathcal{O}}(c_{1}^{3}\epsilon^{-2}))$ on upper- and lower-level random variables, respectively. The sample complexity of upper-level variables is further improved to $\widetilde{\mathcal{O}}(c_{1}^{1.5}\epsilon^{-1.5})$ using variance reduction techniques. Numerical experiments on synthetic and real-world data demonstrate the effectiveness of the proposed approach.

Appendix A Convergence analysis

A.1 Properties of Moreau envelope

Given function $\phi(z)$ , the Moreau envelope is defined as $e^{\gamma}(z)=\min_{\lambda}\phi(\lambda)+\frac{\gamma}{2}\|\lambda-z\|^{2}.$ To make a comprehensive explanation on the enveloped value function in (2.2), we introduce some well-known properties of the Moreau envelope. The detailed proof can be referred to literature such as (19, 9).

Proposition A.1.

1. The Moreau envelope $e^{\gamma}(z)$ is a continuous lower approximation of $\phi(z)$ , i.e., $\lim_{{\gamma}\to 0}e^{\gamma}(z)\leq\phi(z),\forall z$ and $\lim_{{\gamma}\to 0}e^{\gamma}(z)=\phi(z)$ . If $\phi(z)$ is $l$ -Lipschitz continuous, $\rho$ -weakly convex in $z$ , then this difference is at most $O({\gamma})$ , i.e.,

|e^{\gamma}(z)-\phi(z)|\leq O({\gamma}).

2. If $\phi$ is $\rho$ -strongly convex in $z$ , $\rho\geq 0$ , then $e^{\gamma}(z)$ is $(\frac{1}{\gamma}+\frac{1}{\rho})^{-1}$ -strongly convex.

3. The Moreau envelope $e^{\gamma}(z)$ has the same minimizer as $\phi(z)$ , i.e.,

e^{\gamma}(z)=\phi(z),\forall z\in\arg\min_{z}\phi(z)=\arg\min_{z}e^{\gamma}(z).

4. Its gradient at $z$ is given by

\nabla e^{\gamma}(z)=\nabla\phi(\mathrm{prox}_{\gamma\phi}(z))={{\gamma}}(z-\mathrm{prox}_{{\gamma}\phi}(z)),

where $\mathrm{prox}_{\frac{1}{\gamma}\phi}(z)=\arg\min_{\lambda}\phi(\lambda)+\frac{{\gamma}}{2}\|\lambda-z\|^{2}$ is the proximal operator. Therefore $e^{\gamma}(z)$ is Lipschitz smooth given that $\phi(z)$ is Lipschitz smooth.

In our reformulation, we introduce $E(x,z)$ in (2.2). Then $-E(x,z)$ is the Moreau envelope of the augmented dual function $-D(x,z)$ . Utilizing the properties of the Moreau envelope, we transform the original strongly-convex–concave saddle problem $\max_{z\in\mathbb{R}_{+}^{p}}\min_{y\in Y}\mathcal{L}_{\gamma_{1}}(x,y,z)$ into a strongly-convex–strongly-concave saddle-point problem $\max_{w\in\mathbb{R}_{+}^{p}}\min_{\lambda\in Y}\tilde{\mathcal{L}}_{\gamma}(x,z,w,\lambda)$ without changing the optimal solution.

A.2 Properties of the augmented Lagrangian function

In this section we review some important properties of the augmented Lagrangian function defined in (2.1).

A.3 Augmented Lagrangian duality

In this ssubsection we introduce the augmented Lagrangian duality for general constrained optimization problems. The augmented Lagrangian duality is a powerful tool for solving constrained optimization problems, especially when the constraints are non-convex. The notation introduced in this subsection is only used within this subsection.

For a general constrained optimization problems

\min_{y\in Y}\quad G(y)\quad\text{s.t. }\quad H(y)\leq 0,

(A.1)

The augmented Lagrangian penalty term is defined as

\mathcal{A}_{\gamma}(y,z)=G(y)+\frac{1}{\gamma}\sum_{i=1}^{p}\left([\gamma z_{i}+H_{i}(y)]_{+}^{2}-\gamma^{2}z_{i}^{2}\right).

(A.2)

and the augmented Lagrangian dual function is defined as

\mathcal{L}_{\gamma}(y,z)=G(y)+\mathcal{A}_{\gamma}(y,z),

(A.3)

where $z_{i}$ is the dual variable associated with the $i$ -th constraint $H_{i}(y)\leq 0$ and $\gamma$ is the penalty parameter. The augmented Lagrangian dual function contains the Lagrangian function as a special case when $\gamma=+\infty$ .

Under the convexity assumption and Slater’s condition, we have the following proposition about the augmented Lagrangian duality.

Proposition A.2.

(strong duality) Suppose $G,H$ are convex and Slater’s condition holds, i.e., there exists $y^{*}\in Y$ such that $H(y^{*})<0$ . Then the following statements hold:

1. The dual variables $z$ exists and the strong duality of augmented Lagrangian holds, i.e.,

\min_{y}\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{{\gamma}}(y,z)=\max_{z\in\mathbb{R}_{+}^{p}}\min_{y}\mathcal{L}_{{\gamma}}(y,z).

2. The strong duality of the regularized augmented Lagrangian holds, i.e.,

\min_{y}\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{{\gamma}}(y,z)-\frac{\sigma}{2}\|z-z^{\prime}\|^{2}=\max_{z\in\mathbb{R}_{+}^{p}}\min_{y}\mathcal{L}_{{\gamma_{1}}}(y,z)-\frac{\sigma}{2}\|z-z^{\prime}\|^{2}.

holds for any given $\sigma>0$ and $z^{\prime}\in\mathbb{R}_{+}^{p}$ .

The proof of Proposition A.2 is provided in Chapter 17 of Nocedal and Wright (17). A direct consequence of Proposition A.2 is that minimax in (2.5a) is interchangeable.

A.4 Gradient oracles

The gradient of $\mathcal{L}_{{\gamma_{1}}}(x,y,z)$ is given by

$\displaystyle\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z)$	$\displaystyle=\nabla_{x}g(x,y)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{x}H_{i}(x,y),$	(A.4)
$\displaystyle\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,y,z)$	$\displaystyle=\nabla_{y}g(x,y)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{y}H_{i}(x,y),$
$\displaystyle\nabla_{z}\mathcal{L}_{{\gamma_{1}}}(x,y,z)$	$\displaystyle=[{\gamma_{1}}z+H(x,y)]_{+}-{\gamma_{1}}z=\max(-{\gamma_{1}}z,H(x,y)).$

With some simple computation, it can be shown that these gradients are bounded by linear functions of $\|z\|$ .

Lemma A.1.

Under Assumption 3.3, the gradients oracle $\nabla_{1}\mathcal{L}_{{\gamma_{1}}}(x,y,z),\nabla_{2}\mathcal{L}_{{\gamma_{1}}}(x,y,z)$ are bounded by


$\displaystyle\\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z)\\|$	$\displaystyle\leq M_{G,1}+p\left(\frac{M_{H,0}M_{H,1}}{\gamma_{1}}+\\|z\\|\right),$	(A.5a)
$\displaystyle\\|\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,y,z)\\|$	$\displaystyle\leq M_{G,1}+p\left(\frac{M_{H,0}M_{H,1}}{\gamma_{1}}+\\|z\\|\right).$	(A.5b)

Proof It follows from (A.4) that

	$\displaystyle\\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z)\\|$	$\displaystyle\leq\\|\nabla_{x}g(x,y)\\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}\|z_{i}\|+\|H_{i}(x,y)\|)\\|\nabla_{y}H_{i}(x,y)\\|$
		$\displaystyle\leq\\|\nabla G(x,y;\xi)\\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}\|z_{i}\|+M_{H,0})M_{H,1}$
		$\displaystyle\leq M_{G,1}+p\left(\frac{M_{H,0}M_{H,1}}{\gamma_{1}}+\\|z\\|\right).$

The proof of (A.5b) is the same. ∎By substituting $\nabla G(x,y)$ with $\nabla g(x,y;\xi)$ in (A.4) , we derive the gradient oracle of $\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)$ as

$\displaystyle\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)$	$\displaystyle=\nabla_{x}g(x,y;\xi)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{x}H_{i}(x,y),$	(A.6)
$\displaystyle\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)$	$\displaystyle=\nabla_{y}g(x,y;\xi)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{y}H_{i}(x,y),$
$\displaystyle\nabla_{z}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)$	$\displaystyle=[{\gamma_{1}}z+H(x,y)]_{+}-{\gamma_{1}}z=\max(-{\gamma_{1}}z,H(x,y)).$

We introduce several properties of the augmented Lagrangian function’s gradient oracle that are essential for the analysis of our approach. The following lemma guarantees that the stochastic gradient remains within bounded norms, which is a crucial condition for stability.

Lemma A.2.

Under Assumption 3.3 and 3.4, the gradient oracles $\nabla_{1}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi),\nabla_{2}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)$ are bounded by


$\displaystyle\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)\\|^{2}]$	$\displaystyle\leq M_{\mathcal{L},1}+M_{\mathcal{L},2}\\|z\\|^{2},$	(A.7a)
$\displaystyle\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\\|\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)\\|^{2}]$	$\displaystyle\leq M_{\mathcal{L},1}+M_{\mathcal{L},2}\\|z\\|^{2},$	(A.7b)

respectively, where $M_{\mathcal{L},1}=(2+p)(M_{G,1}^{2}+\sigma_{g}^{2}+\frac{1}{\gamma_{1}^{2}}M_{H,0}),M_{\mathcal{L},2}=(2+p)M_{H,0}^{2}M_{H,1}^{2}.$

Proof It follows from (A.6) that

	$\displaystyle\\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)\\|$	$\displaystyle\leq\\|\nabla_{x}g(x,y;\xi)\\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}\|z_{i}\|+\|H_{i}(x,y)\|)\\|\nabla_{y}H_{i}(x,y)\\|$
		$\displaystyle\leq\\|\nabla_{x}g(x,y;\xi)\\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}\|z_{i}\|+M_{H,0})M_{H,1}.$

By Cauchy-Schwarz inequality, it holds that

		$\displaystyle\quad\left(\\|\nabla_{x}g(x,y;\xi)\\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}\|z_{i}\|+M_{H,0})M_{H,1}\right)^{2}$		(A.8)
		$\displaystyle\leq(1+1+p)\left(\\|\nabla_{x}g(x,y;\xi)\\|^{2}+\frac{1}{\gamma_{1}^{2}}M_{H,0}^{2}+M_{H,0}^{2}M_{H,1}^{2}\sum_{i=1}^{p}z_{i}^{2}\right).$		(A.8)

Assumption 3.3 and 3.4 imply

\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\|\nabla_{x}g(x,y;\xi)\|^{2}]=\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\|\nabla_{x}g(x,y)\|^{2}]+\mathbb{V}_{\xi\sim\mathcal{D}_{\xi}}[\nabla_{x}g(x,y;\xi)]\leq M_{G,1}^{2}+\sigma_{g}^{2}.

Taking expectation on (A.8) gives (A.7a). The proof of (A.7b) is similar. ∎

The convexity of $G(x,y)$ and $H(x,y)$ implies the convex-concavity of $\mathcal{L}_{{\gamma_{1}}}(x,y,z)$ as shown in the following lemma.

Lemma A.3.

(convex-concavity) The function $\mathcal{L}_{{\gamma_{1}}}(x,y,z)$ is $\mu_{G}$ -strongly convex in $y$ and concave in $z$ .

Proof From (A.4) we can compute the second order derivative with respect to $y$ as

	$\displaystyle\nabla_{y}^{2}\mathcal{L}_{{\gamma_{1}}}(x,y,z)=$	$\displaystyle\nabla_{y}^{2}G(x,y)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}\{[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{y}^{2}H_{i}(x,y)$		(A.9)
		$\displaystyle+\mathbb{I}_{\{{\gamma_{1}}z_{i}+H_{i}(x,y)>0\}}\nabla_{y}H_{i}(x,y)\nabla_{y}H_{i}(x,y)^{T}\}\succeq\mu_{G}I.$		(A.9)

This implies that $\mathcal{L}_{{\gamma_{1}}}(x,y,z)$ is $\mu_{G}$ -strongly convex in $y$ . Additionally (A.4) shows that $\nabla_{z}\mathcal{L}_{\gamma_{1}}(x,y,z)$ is monotonically decreasing with respect to $z$ . Therefore $\mathcal{L}_{\gamma_{1}}(x,y,z)$ is concave in $z$ . ∎

Lemma A.4.

The function $\ell_{\gamma}(x,y,w,\lambda)$ is $\mu_{G}$ -strongly convex in $w$ and $\gamma_{2}$ -strongly concave in $\lambda$ .

Proof Combining Lemma A.3 and (2.5b), the conclusion follows. ∎Besides, its gradients can be computed as

	$\displaystyle\nabla_{w}\ell_{\gamma}(x,z,w,\lambda)$	$\displaystyle=\nabla_{y}g(x,w)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}\lambda_{i}+H_{i}(x,w)]_{+}\nabla_{y}H_{i}(x,w),$		(A.10)
	$\displaystyle\nabla_{\lambda}\ell_{\gamma}(x,z,w,\lambda)$	$\displaystyle=[{\gamma_{1}}\lambda+H(x,y)]_{+}-{\gamma_{1}}\lambda-\gamma_{2}(\lambda-z)=\max(-{\gamma_{1}}\lambda,H(x,y))-\gamma_{2}(\lambda-z).$		(A.10)

A.4.1 Comparison between Lagrangian function and augmented Lagrangian function

In this section, we provide a comprehensive comparison between the Lagrangian function and the augmented Lagrangian function. The key points are summarized as follows: first, the Lagrangian function is a special case of the augmented Lagrangian function when $\gamma=+\infty$ ; second, the augmented Lagrangian function-based reformulation offers a tighter upper bound for the variance estimation when computing the upper-level gradient. The Lagrangian function $\mathcal{L}(x,y,z)$ of (1.1) is defined as

\mathcal{L}(x,y,z)=G(x,y)+\sum_{i=1}^{p}z_{i}H_{i}(x,y),\quad(x,y,z)\in X\times Y\times\mathbb{R}_{+}^{p}.

(A.11)

The inequality relation between objective function, Lagrangian function and augmented Lagrangian function is given in the following proposition.

Proposition A.3.

If $H(x,y)\leq 0,z\geq 0$ , then it holds that

\mathcal{L}(x,y,z)\leq\mathcal{L}_{{\gamma_{1}}}(x,y,z)\leq G(x,y).

Proof For any $i\in\{1,...,p\}$ , there are two cases:

1. If ${\gamma_{1}}z_{i}+H_{i}(x,y)\leq 0$ , then $\frac{1}{2{\gamma_{1}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})=-\frac{1}{2}{\gamma_{1}}z_{i}^{2}$ . It holds that $z_{i}H_{i}(x,y)\leq-\frac{1}{2}{\gamma_{1}}z_{i}^{2}\leq 0$ .

2. If ${\gamma_{1}}z_{i}+H_{i}(x,y)>0$ , then $\frac{1}{2{\gamma_{1}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})=z_{i}H_{i}(x,y)+\frac{1}{2{\gamma_{1}}}H_{i}^{2}(x,y)$ . Hence $z_{i}H_{i}(x,y)\leq z_{i}H_{i}(x,y)+\frac{1}{2{\gamma_{1}}}H_{i}^{2}(x,y)\leq 0$ .

Combining the above two cases, we have $\mathcal{L}(x,y,z)\leq\mathcal{L}_{{\gamma_{1}}}(x,y,z)\leq G(x,y)$ . This completes the proof. ∎

From equations (2.1) and (A.11), it is evident that the Lagrangian function is the limit of the augmented Lagrangian function as $\gamma_{1}=+\infty$ . During the optimization process, it is desirable for the variance of the gradient oracle to be small in order to ensure the algorithm’s stability. The second-order moment serves as an upper bound for the variance. The following lemma demonstrates that the gradient of the augmented Lagrangian term for each constraint has a smaller second-order moment compared to that of the Lagrangian function. This property is a key reason for using the augmented Lagrangian function in our algorithm. From equations (2.1) and (A.11), it is evident that the Lagrangian function is the limit of the augmented Lagrangian function as $\gamma_{1}=+\infty$ . During the optimization process, it is desirable for the variance of the gradient oracle to be small in order to ensure the algorithm’s stability. The second-order moment serves as an upper bound for the variance. The following lemma demonstrates that the gradient of the augmented Lagrangian term for each constraint has a smaller second-order moment compared to that of the Lagrangian function. This property is a key reason for using the augmented Lagrangian function in our algorithm.

Lemma A.5.

Assume $x$ is fixed and the random variable $(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\in\mathcal{Y}(x)\times\mathbb{R}_{+}^{p}$ , $\mathbf{\xi}\in\Omega_{\xi}^{s}$ . Then it holds that

\displaystyle\mathbb{E}_{\mathbf{\xi}}\left[\left\|\frac{1}{{\gamma_{1}}}[{\gamma_{1}}\hat{\lambda}_{i}(\mathbf{\xi})+H_{i}(x,\hat{w}(\mathbf{\xi}))]_{+}\nabla H_{i}(x,\hat{w}(\mathbf{\xi}))\right\|^{2}\right]\leq\mathbb{E}_{\mathbf{\xi}}\left[\left\|\hat{\lambda}_{i}(\mathbf{\xi})\nabla H_{i}(x,\hat{w}(\mathbf{\xi}))\right\|^{2}\right].

(A.12)

Proof The condition $(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\in\mathcal{Y}(x)\times\mathbb{R}_{+}^{p}$ indicates that $H(x,\hat{w}(\mathbf{\xi}))\leq 0$ and $\hat{\lambda}(\mathbf{\xi})\geq 0$ . Then we have

\displaystyle 0\leq\frac{1}{{\gamma_{1}}}[{\gamma_{1}}\hat{\lambda}_{i}(\mathbf{\xi})+H_{i}(x,\hat{w}(\mathbf{\xi}))]_{+}\leq\hat{\lambda}_{i}(\mathbf{\xi}),\quad i=1,.,p.

Taking square and then taking expectation give the desired result. ∎

Lemma A.6.

Assume $x$ is fixed and the random variable $(\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\in\mathcal{Y}(x)\times\mathbb{R}_{+}^{p}$ , $\mathbf{\xi}\in\Omega_{\xi}^{s}$ . Then it holds that

\displaystyle\mathbb{E}_{\mathbf{\xi}}\left[\left\|\frac{1}{{\gamma_{1}}}[{\gamma_{1}}\hat{\lambda}_{i}(\mathbf{\xi})+H_{i}(x,\hat{w}(\mathbf{\xi}))]_{+}\nabla H_{i}(x,\hat{w}(\mathbf{\xi}))\right\|^{2}\right]\leq\mathbb{E}_{\mathbf{\xi}}\left[\left\|\hat{\lambda}_{i}(\mathbf{\xi})\nabla H_{i}(x,\hat{w}(\mathbf{\xi}))\right\|^{2}\right].

(A.13)

\displaystyle 0\leq\frac{1}{{\gamma_{1}}}[{\gamma_{1}}\hat{\lambda}_{i}(\mathbf{\xi})+H_{i}(x,\hat{w}(\mathbf{\xi}))]_{+}\leq\hat{\lambda}_{i}(\mathbf{\xi}),\quad i=1,.,p.

Taking square and then taking expectation give the desired result. ∎

A.5 Analysis on the reformulation

In this section, we show the equivalence between the reformulations and the original BLO (1.2). First we consider the deterministic case as a special case of the stochastic case. Then we extend the equivalence to the stochastic case and show the major improvements of the stochastic reformulation.

A.6 Deterministic case

To further analyze the optimization problem, we establish a critical property of the lower-level objective function $G(x,y)$ . Define $v(x)=\min_{y\in\mathcal{Y}(x)}G(x,y)$ . The following proposition shows that $G(x,y)$ satisfies a quadratic growth condition, ensuring the uniqueness of solutions in the lower-level problem.

Proposition A.4.

(Quadratic growth) 1. Suppose Assumption 3.2 holds. For any $x\in X$ , $G(x,y)$ has $\mu_{G}$ -quadratic growth with respect to $y\in\mathcal{Y}(x)$ , namely,

\displaystyle G(x,y)-v(x)

\displaystyle\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2},\quad y\in\mathcal{Y}(x).

2. Suppose Assumption 3.2 and 3.3 holds. For any $x\in X$ and $y\in Y$ satisfying $\mathrm{dist}(y,\mathcal{Y}(x))\leq\epsilon$ , we have

G(x,y)-v(x)\geq\frac{\mu_{G}}{4}\|y-y^{*}(x)\|^{2}-M_{G,1}\epsilon-\frac{\mu_{G}}{2}\epsilon^{2}.

3. Suppose Assumption 3.2 and 3.3 holds. For any $x\in X$ and $y\in Y$ , it holds that it holds that

G(x,y)-v(x)\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-p^{\frac{1}{2}}BM_{H,1}\|y-y^{*}(x)\|,

where $B$ is defined in Lemma A.7.3. Further assume $y$ satisfys $\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon^{2}$ , then

G(x,y)-v(x)\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-\sqrt{2}B\epsilon.

Proof For any $x\in X$ , denote

(y^{*}(x),\lambda^{*}(x))=\arg\min_{y\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}G(x,y)+\lambda^{\top}H(x,y).

By Assumption 3.2, the strong convexity in $y$ implies its quadratic growth, i.e.,

G(x,y)+(\lambda^{*}(x))^{\top}H(x,y)\geq G(x,y^{*}(x))+\lambda^{*}(x)^{\top}H(x,y^{*}(x))+\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}.

Furthermore, the complementary slackness condition implies $\lambda^{*}_{i}H_{i}(x,y^{*}(x))=0$ for $i=1,...,p$ . Therefore, we have

		$\displaystyle\quad G(x,y)-v(x)$
		$\displaystyle\geq G(x,y)+(\lambda^{}(x))^{\top}H(x,y)-G(x,y^{}(x))\quad\text{(since $\lambda^{*}(x)\geq 0$ and $H(x,y)\leq 0$)}$
		$\displaystyle=G(x,y)+(\lambda^{}(x))^{\top}H(x,y)-G(x,y^{}(x))-\lambda^{}(x)^{\top}H(x,y^{}(x))$
		$\displaystyle\geq\frac{\mu_{G}}{2}\\|y-y^{*}(x)\\|^{2},\quad y\in\mathcal{Y}(x).$

2. For any $y\in Y$ satisfying $\mathrm{dist}(y,\mathcal{Y}(x))\leq\epsilon$ , denote $y^{\prime}=\mathrm{Proj}_{\mathcal{Y}(x)}(y)$ as the projection of $y$ onto $\mathcal{Y}(x)$ . Then $\|y-y^{\prime}\|\leq\epsilon$ . From the results of 1, we have

G(x,y^{\prime})-v(x)\geq\frac{\mu_{G}}{2}\|y^{\prime}-y^{*}(x)\|^{2}.

Assumption 3.3 implies

G(x,y)-G(x,y^{\prime})\geq-M_{G,1}\|y-y^{\prime}\|.

Combining the above two inequalities and the fact $\frac{1}{2}\|y-y^{*}(x)\|^{2}\leq\|y^{\prime}-y^{*}(x)\|^{2}+\|y-y^{\prime}\|^{2}$ , we have

	$\displaystyle G(x,y)-v(x)$	$\displaystyle=G(x,y)-G(x,y^{\prime})+G(x,y^{\prime})-v(x)$
		$\displaystyle\geq-M_{G,1}\\|y-y^{\prime}\\|+\frac{\mu_{G}}{2}\\|y^{\prime}-y^{*}(x)\\|^{2}$
		$\displaystyle\geq-M_{G,1}\epsilon+\frac{\mu_{G}}{2}\left(\frac{1}{2}\\|y-y^{*}(x)\\|^{2}-\\|y-y^{\prime}\\|^{2}\right)$
		$\displaystyle\geq\frac{\mu_{G}}{4}\\|y-y^{*}(x)\\|^{2}-M_{G,1}\epsilon-\frac{\mu_{G}}{2}\epsilon^{2}.$

3. By Assumption 3.3 and Lemma A.7, for any $x\in X$ and $y\in Y$ , it holds that

\lambda^{*}(x)H(x,y)\leq\|\lambda^{*}(x)\|[\|H(x,y)]_{+}\|\leq p^{\frac{1}{2}}BM_{H,1}\|y-y^{*}(x)\|.

Then following the proof in Lemma A.4.1, we have

	$\displaystyle G(x,y)-v(x)$	$\displaystyle\geq\frac{\mu_{G}}{2}\\|y-y^{}(x)\\|^{2}-\lambda^{}(x)H(x,y)$
		$\displaystyle\geq\frac{\mu_{G}}{2}\\|y-y^{}(x)\\|^{2}-p^{\frac{1}{2}}BM_{H,1}\\|y-y^{}(x)\\|.$

Futher assume $y\in Y$ satisfys $\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon^{2}$ , then

	$\displaystyle G(x,y)-v(x)$	$\displaystyle\geq\frac{\mu_{G}}{2}\\|y-y^{}(x)\\|^{2}-\lambda^{}(x)H(x,y)$
		$\displaystyle\geq\frac{\mu_{G}}{2}\\|y-y^{}(x)\\|^{2}-\\|\lambda^{}(x)\\|[\\|H(x,y)]_{+}\\|$
		$\displaystyle\geq\frac{\mu_{G}}{2}\\|y-y^{*}(x)\\|^{2}-\sqrt{2}B\epsilon.$

This completes the proof. ∎

The following lemma shows the relationship between $v(x)$ and the enveloped value $E(x,z)$ defined in (2.2).

Lemma A.7.

Suppose Assumptions 3.2, 3.5 and 3.6 hold.

1. It holds that $E(x,z)\leq v(x),\forall z\in\mathbb{R}_{+}^{p}$ . The equality holds if and only if there exist $y\in\mathcal{Y}(x)$ such that $(y,z)\in\arg\min_{w\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,w,\lambda)$ .

2. Assume $y\in\mathcal{Y}(x)$ , then $y\in\arg\min_{\mathcal{Y}(x)}G(x,y)$ is equivalent to existing $z\in\mathbb{R}_{+}^{p}$ such that $G(x,y)-E(x,z)\leq 0$ .

3. For any $x\in X$ , there exists $z$ such that $(y,z)\in\arg\min_{w\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,z)$ satisfying $|z_{i}|\leq p^{-0.5}B$ , where $B=p^{2}\sigma_{0}^{-2}M_{H,1}(M_{G,1}+pM_{H,1})$ . This statement also holds for $\gamma_{1}=+\infty$ .

Proof 1. For any fixed $(x,y)$ , we have

\frac{1}{2{{\gamma_{1}}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})=\begin{cases}z_{i}H_{i}(x,y)+\frac{1}{2{{\gamma_{1}}}}H_{i}(x,y)^{2},&\text{if }z_{i}\geq-\frac{1}{\gamma_{1}}H_{i}(x,y),\\ -\frac{1}{2}\gamma_{1}z_{i}^{2},&\text{if }z_{i}<-\frac{1}{\gamma_{1}}H_{i}(x,y).\end{cases}

We consider maximizing with respect to $z_{i}\in\mathbb{R}_{+}$ . If $H_{i}(x,y)>0$ , then $\max_{z_{i}\in\mathbb{R}_{+}}\frac{1}{2{{\gamma_{1}}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})=\max_{z_{i}\in\mathbb{R}_{+}}z_{i}H_{i}(x,y)+\frac{1}{2{{\gamma_{1}}}}H_{i}(x,y)^{2}=+\infty$ . If $H_{i}(x,y)\leq 0$ , then

		$\displaystyle\quad\max_{z_{i}\in\mathbb{R}_{+}}\frac{1}{2{{\gamma_{1}}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})$
		$\displaystyle=\max\left\{\max_{0\leq z_{i}\leq-\frac{1}{\gamma_{1}}H_{i}(x,y)}-\frac{1}{2{{\gamma_{1}}}}\gamma_{1}^{2}z_{i}^{2},\max_{z_{i}\geq-\frac{1}{\gamma_{1}}H_{i}(x,y)}z_{i}H_{i}(x,y)+\frac{1}{2{{\gamma_{1}}}}H_{i}(x,y)^{2}\right\}$
		$\displaystyle=\max\left\{0,-\frac{1}{2{{\gamma_{1}}}}H_{i}(x,y)^{2}\right\}=0.$

Combining the above two cases, we have

\displaystyle\max_{z_{i}\in\mathbb{R}_{+}}\frac{1}{2{{\gamma_{1}}}}([{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}^{2}-{\gamma_{1}}^{2}z_{i}^{2})

\displaystyle=

This implies that

\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,z)=\begin{cases}+\infty,&\text{if }\exists H_{i}(x,y)>0,\\ G(x,y),&\text{if }H_{i}(x,y)\leq 0,i=1,...,p.\end{cases}

(A.14)

From the definition of $E(x,z)$ , for $\forall z\in\mathbb{R}_{+}^{p}$ , we have

$\displaystyle E(x,z)$	$\displaystyle=\max_{\lambda\in\mathbb{R}_{+}^{p}}\min_{y\in Y}\left\{\mathcal{L}_{\gamma_{1}}(x,y,\lambda)-\frac{\gamma_{2}}{2}\\|\lambda-z\\|^{2}\right\}$	(A.15)
	$\displaystyle=\min_{y\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\left\{\mathcal{L}_{\gamma_{1}}(x,y,\lambda)-\frac{\gamma_{2}}{2}\\|\lambda-z\\|^{2}\right\}\quad\text{(by Proposition~\ref{prop: strong duality})}$
	$\displaystyle\leq\min_{y\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,\lambda)$
	$\displaystyle=\min_{y\in Y}\left\{G(x,y)\|H(x,y)\leq 0\right\}\quad\text{(by~\eqref{eq: max of augmented Lagrangian with respect to z})}$
	$\displaystyle=v(x).$

The equality in the inequality holds if and only if $(y,z)\in\arg\min_{y\in Y}\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,z)$ .

2. Since $E(x,z)\leq v(x)\leq G(x,y),\forall z\in\mathbb{R}_{+}^{p},\forall y\in\mathcal{Y}(x)$ , hence the condition $G(x,y)-E(x,z)\leq 0$ holds if and only if $G(x,y)=v(x)=E(x,z)$ , which implies $y\in\arg\min_{y\in\mathcal{Y}(x)}G(x,y)$ .
Conversely, suppose $y\in\arg\min_{\mathcal{Y}(x)}G(x,y)$ , Take $z\in\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,\lambda)$ , then

$\displaystyle E(x,z)$	$\displaystyle=\min_{w\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\left\{\mathcal{L}_{\gamma_{1}}(x,w,\lambda)-\frac{\gamma_{2}}{2}\\|\lambda-z\\|^{2}\right\}$	(A.16)
	$\displaystyle\geq\min_{w\in Y}\mathcal{L}_{\gamma_{1}}(x,w,z)\quad\text{(taking $\lambda=z$)}$
	$\displaystyle=\min_{w\in\mathcal{Y}(x)}G(x,w)=G(x,y).$

3. The optimal condition of $(y,z)\in\arg\min_{w\in Y}\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x,y,z)$ implies that $H(x,y)\leq 0$ and $\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,y,z)=0$ , that is,

\nabla_{y}g(x,y)+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}[{\gamma_{1}}z_{i}+H_{i}(x,y)]_{+}\nabla_{y}H_{i}(x,y)=0.

(A.17)

From the proof of 1, we know that if $H_{i}(x,y)<0$ , then $z_{i}=0$ . If $H_{i}(x,y)=0$ , then $\nabla_{y}H_{i}(x,y)\neq 0$ . Otherwise suppose $H_{i}(x,y)=0$ and $\nabla_{y}H_{i}(x,y)=0$ hold simultaneously. The convexity of $H_{i}(x,y)$ implies $H_{i}(x,y)\geq 0$ for all $y\in Y$ , which contradicts Assumption 3.5. Substituting these cases in (A.17) yields

\nabla_{y}g(x,y)+\sum_{i:H_{i}(x,y)=0}z_{i}\nabla_{y}H_{i}(x,y)+\sum_{i:H_{i}(x,y)<0}[H_{i}(x,y)]_{+}\nabla_{y}H_{i}(x,y)=0.

(A.18)

Note that the above conditions are equivalent to the KKT conditions. Since the lower-level problem is convex and Slater’s conditions holds, this is a sufficient condition for the optimal point. Assume matrix $\mathcal{C}$ is the concatenation of $\nabla_{y}H_{i}(x,y)=0,i\in\{i,H_{i}(x,y)\}$ , then $z$ with the minimal norm satisfying (A.18) has a bound as

|z_{i}|\leq\left\|(\mathcal{C}^{\top}\mathcal{C})^{-1}\mathcal{C}^{\top}\left(\nabla_{y}g(x,y)+\sum_{i:H_{i}(x,y)<0}[H_{i}(x,y)]_{+}\nabla_{y}H_{i}(x,y)\right)\right\|\leq p^{1.5}\sigma_{0}^{2}M_{H,1}(M_{G,1}+pM_{H,0}M_{H,1})

The inequality uses Assumption 3.6. As a special case of $\gamma_{1}=+\infty$ ,

|z_{i}|\leq\left\|(\mathcal{C}^{\top}\mathcal{C})^{-1}\mathcal{C}^{\top}\nabla_{y}G(x,y)\right\|\leq p^{1.5}\sigma_{0}^{-2}M_{H,1}M_{G,1}.

This completes the proof. ∎

Remark A.1.

Under the strong duality condition, $v(x)=\max_{z\in\mathbb{R}_{+}^{p}}D(x,z)$ . Proposition A.1.1 and A.1.3 imply that $E(x,z)\leq\max_{z\in\mathbb{R}_{+}^{p}}E(x,z)=\max_{z\in\mathbb{R}_{+}^{p}}D(x,z)=v(x)$ . This shows that the enveloped value function $E(x,z)$ is a lower approximation of $v(x)$ and does not change the optimal solution.

The following theorem shows the equivalence between (1.2) and the penalty reformulation in the deterministic setting. This theorem generalizes Theorem 1 of (20); indeed, choosing $\epsilon_{2}=0$ reduces our statement to exactly that result.

Theorem A.1.

Suppose that Assumptions 3.1 3.2 and 3.6 holds and $\gamma_{1},\gamma_{2}>0$ are fixed parameters.

1. Assume $(x^{*},y^{*})$ is a global solution to (1.2) and $c_{1}\geq c_{1}^{*}=\frac{L}{\mu_{G}}\epsilon^{-1}$ . Then $(x^{*},y^{*})$ is a $\epsilon$ -global-minima of the following penalized form

\displaystyle\min_{(x,y)\in X\times Y}

\displaystyle\quad F(x,y)+c_{1}(G(x,y)-v(x))\quad\text{s.t.}\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2}.

(A.19)

Furthermore, there exist $z^{*}\in\mathbb{R}_{+}^{p}$ such that $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -global-minima of the following penalized form

\displaystyle\min_{(x,y,z)\in X\times Y\times Z}

\displaystyle\quad F(x,y)+c_{1}\mathcal{G}(x,y,z)\quad\text{s.t.}\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},

(A.20)

with $\epsilon_{2}\leq\frac{\epsilon}{2\sqrt{2}c_{1}B}$ .

2. By taking $c_{1}=c_{1}^{*}+2$ , any $\epsilon$ -global-minima of (A.19) and (A.20) is an $\epsilon$ -global-minima of the following two approximations of BLO:

\displaystyle\min_{(x,y)\in X\times Y}

\displaystyle\quad F(x,y)\quad\text{s.t.}\quad G(x,y)-v(x)\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2}.

(A.21)

and

\displaystyle\min_{(x,y,z)\in X\times Y\times Z}

\displaystyle\quad F(x,y)\quad\text{s.t.}\quad G(x,y)-E(x,z)\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},

(A.22)

with some $\epsilon_{1},\epsilon_{2}\leq\epsilon$ .

Proof 1. Lemma A.7 shows that (A.20) is equivalent to (A.19) by minimizing $z$ first. Hence it suffices to show the equivalence between (1.2) and (A.19). Let $\mathrm{p}(x,y)=G(x,y)-v(x)$ . Proposition A.4.3 implies that $\mathrm{p}(x,y)\geq\frac{\mu_{G}}{2}\|y-y^{*}(x)\|^{2}-\sqrt{2}B\epsilon_{2}$ for any $y$ satisfying $\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2}$ . By the Lipschitz property of $F(x,\cdot)$ , we have

	$\displaystyle F(x,y)+c_{1}\mathrm{p}(x,y)-F(x,y^{*}(x))$	$\displaystyle\geq-L_{F}\\|y-y^{}(x)\\|+\frac{c_{1}\mu_{G}}{2}\\|y-y^{}(x)\\|^{2}-c_{1}\sqrt{2}B\epsilon_{2}$		(A.23)
		$\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-c_{1}\sqrt{2}B\epsilon_{2}.$		(A.23)

By taking $c_{1}\geq c_{1}^{*}=\frac{L}{\mu_{G}}\epsilon^{-1}$ and $\epsilon_{2}\leq\frac{\epsilon}{2\sqrt{2}c_{1}B}$ , the right side of the above inequality $\geq-\epsilon$ . For the optimal solution $(x^{*},y^{*})$ of (1.2) and $\forall(x,y)\in(X,Y)$ , it holds that

F(x^{*},y^{*})+c_{1}\mathrm{p}(x^{*},y^{*})=F(x^{*},y^{*})\leq F(x,y^{*}(x))\leq F(x,y)+c_{1}\mathrm{p}(x,y)+\epsilon.

(A.24)

The first inequality follows from the optimality of $x^{*}$ and $y^{*}=y^{*}(x^{*})$ . The second inequality follows from (A.23). The inequality (A.24) implies that $(x^{*},y^{*})$ is a $\epsilon$ -optimal solution of (A.19). Further take $z^{*}\in\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x^{*},y^{*},\lambda)$ , then $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -optimal solution of (A.20). By Lemma A.7, we can find such $z^{*}$ satisfying $z\in Z$ . This completes the proof.

2. Lemma A.7 shows that (A.22) is equivalent to (A.21) by minimizing $z$ first. For any $\epsilon$ -optimal solution $(\hat{x},\hat{y})$ of (A.19), it holds that

F(\hat{x},\hat{y})+c_{1}\mathrm{p}(\hat{x},\hat{y})\leq F(x^{*},y^{*})+c_{1}\mathrm{p}(x^{*},y^{*})+\epsilon\leq F(\hat{x},\hat{y})+c_{1}^{*}\mathrm{p}(\hat{x},\hat{y})+2\epsilon.

The second inequality follows from (A.23) and $\mathrm{p}(x^{*},y^{*})=0$ . Then we have

\mathrm{p}(\hat{x},\hat{y})\leq\frac{2\epsilon}{c_{1}-c_{1}^{*}}=\epsilon.

(A.25)

Take $\epsilon_{1}=p(\hat{x},\hat{y}),\epsilon_{2}=\sqrt{\frac{1}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}}$ . Then $(\hat{x},\hat{y})$ is feasible for (A.21). For any feasible solution $(x,y)$ of (A.21), the $\epsilon$ -optimality of $(\hat{x},\hat{y})$ implies

	$\displaystyle F(\hat{x},\hat{y})-F(x,y)$	$\displaystyle\leq c_{1}\left(\mathrm{p}(x,y)-\mathrm{p}(\hat{x},\hat{y})\right)+\epsilon$		(A.26)
		$\displaystyle=c_{1}\left(\mathrm{p}(x,y)-\epsilon_{1}\right)+\epsilon\leq\epsilon.$		(A.26)

The last inequality follows from the feasibility of $(x,y)$ . Therefore $(\hat{x},\hat{y})$ is a $\epsilon$ -optimal solution of (A.21). This completes the proof. ∎

Theorem A.2.

Suppose that Assumptions 3.1 3.2 and 3.6 holds and $\gamma_{1},\gamma_{2}>0$ are fixed parameters.

1. Assume $(x^{*},y^{*})$ is a global solution to (1.2) and $c_{1}\geq c_{1}^{*}=\frac{L}{2\mu_{G}}\epsilon^{-1},c_{2}\geq c_{2}^{*}=(c_{1}^{*})^{2}B^{2}\epsilon^{-1}$ . Then $(x^{*},y^{*})$ is a $\epsilon$ -global-minima of the following penalized form

\displaystyle\min_{(x,y)\in X\times Y}

\displaystyle\quad\Psi(x,y)=F(x,y)+c_{1}(G(x,y)-v(x))+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}.

(A.27)

Furthermore, there exist $z^{*}\in\mathbb{R}_{+}^{p}$ such that $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -global-minima of the following penalized form

\displaystyle\min_{(x,y,z)\in X\times Y\times Z}

\displaystyle\quad\Psi(x,y,z)=F(x,y)+c_{1}\mathcal{G}(x,y,z)+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}.

(A.28)

2. By taking $c_{1}=c_{1}^{*}+2,c_{2}=c_{2}^{*}+2$ , any $\epsilon$ -global-minima of (A.27) and (A.28) is an $\epsilon$ -global-minima of (A.21) and (A.22) with some $\epsilon_{1},\epsilon_{2}\leq\epsilon$ , respectively.

Proof 1. Lemma A.7 shows that (A.28) is equivalent to (A.27) by minimizing $z$ first. Hence it suffices to show the equivalence between (1.2) and (A.27). Let $\mathrm{p}(x,y)=G(x,y)-v(x)$ . Proposition A.4.3 implies that $\mathrm{p}(x,y)\geq{\mu_{G}}\|y-y^{*}(x)\|^{2}-\lambda^{*}(x)H(x,y)$ . By the Lipschitz property of $F(x,\cdot)$ , we have

		$\displaystyle\quad\Psi(x,y)-F(x,y^{*}(x))$		(A.29)
		$\displaystyle=F(x,y)-F(x,y^{*}(x))+c_{1}\mathrm{p}(x,y)+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}$
		$\displaystyle\geq-L_{F}\\|y-y^{}(x)\\|+{c_{1}}\frac{\mu_{G}}{2}\\|y-y^{}(x)\\|^{2}-c_{1}\lambda^{*}(x)H(x,y)+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}$
		$\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-\frac{c_{1}^{2}}{2c_{2}}\\|\lambda^{*}(x)\\|^{2}$
		$\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-\frac{c_{1}^{2}}{2c_{2}}B^{2},\quad\forall(x,y)\in(X,Y).$

By taking $c_{1}\geq c_{1}^{*}=\frac{L}{2\mu_{G}}\epsilon^{-1}$ and $c_{2}\geq c_{2}^{*}=(c_{1}^{*})^{2}B^{2}\epsilon^{-1}$ , the right side of the above inequality $\geq-\epsilon$ . For the optimal solution $(x^{*},y^{*})$ of (1.2), it holds that

\Psi(x^{*},y^{*})=F(x^{*},y^{*})\leq F(x,y^{*}(x))\leq\Psi(x,y)+\epsilon,\quad\forall(x,y)\in(X,Y).

(A.30)

The first inequality follows from the optimality of $x^{*}$ and the definition of $v(x)$ . The second inequality follows from (A.29). This implies that $(x^{*},y^{*})$ is a $\epsilon$ -optimal solution of (A.27). Further take $z^{*}\in\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x^{*},y^{*},\lambda)$ , then $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -optimal solution of (A.28). By Lemma A.7, we can find such $z^{*}$ satisfying $z\in Z$ . This completes the proof.

2. Lemma A.7 shows that (A.22) is equivalent to (A.21) by minimizing $z$ first. For any $\epsilon$ -optimal solution $(\hat{x},\hat{y})$ of (A.27), it holds that

F(\hat{x},y)+c_{1}\mathrm{p}(\hat{x},\hat{y})+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}\leq\Psi(x^{*},y^{*})+\epsilon\leq F(\hat{x},\hat{y})+c_{1}^{*}\mathrm{p}(\hat{x},\hat{y})+\frac{c_{2}^{*}}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}+2\epsilon.

The second inequality follows from (A.29). From the selection of $c_{1},c_{2}$ , we have

\mathrm{p}(\hat{x},\hat{y})+\frac{1}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}\leq\frac{2\epsilon}{2}=\epsilon.

(A.31)

Take $\epsilon_{1}=\mathrm{p}(\hat{x},\hat{y})\leq\epsilon,\epsilon_{2}=\frac{1}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}\leq\epsilon$ . This implies $(\hat{x},\hat{y})$ is feasible for (A.21). For any feasible solution $(x,y)$ of (A.21), the $\epsilon$ -optimality of $(\hat{x},\hat{y})$ implies

	$\displaystyle F(\hat{x},\hat{y})-F(x,y)$	$\displaystyle\leq c_{1}\left(\mathrm{p}(x,y)-\mathrm{p}(\hat{x},\hat{y})\right)+c_{2}\left(\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}-\frac{1}{2}\sum_{i=1}^{p}[H_{i}(\hat{x},\hat{y})]_{+}^{2}\right)+\epsilon$		(A.32)
		$\displaystyle=c_{1}\left(\mathrm{p}(x,y)-\epsilon_{1}\right)+c_{2}\left(\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}-\epsilon_{2}\right)+\epsilon\leq\epsilon.$		(A.32)

The last inequality follows from the feasibility of $(x,y)$ . Therefore $(\hat{x},\hat{y})$ is a $\epsilon$ -optimal solution of (A.21). This completes the proof. ∎

A.7 Stochastic case

Given $s$ samples $\mathbf{\xi}=(\xi_{1},...,\xi_{s})\sim\mathcal{D}_{\xi}^{s}$ . Denote $\mathcal{P}_{w},\mathcal{P}_{\lambda}$ as the space of random variables mapping $\mathbf{\xi}$ to $Y$ and $\mathbb{R}_{+}^{p}$ , respectively, that is,

\displaystyle\mathcal{P}_{w}

\displaystyle=\{w:\Omega_{\xi}^{s}\to Y\mid w\text{ is measurable}\},\quad\mathcal{P}_{\lambda}=\{\lambda:\Omega_{\xi}^{s}\to\mathbb{R}_{+}^{p}\mid\lambda\text{ is measurable}\}.

Assume $(\hat{w},\hat{\lambda})\in\mathcal{P}_{w}\times\mathcal{P}_{\lambda}$ are two random variables depending on $\mathbf{\xi}$ . In this section, $\mathbb{E}[\cdot]$ is the abbreviation of $\mathbb{E}_{\mathbf{\xi}\sim\mathcal{D}_{\xi}^{s}}[\cdot]$ . The following theorem shows the equivalence between (1.2) and the penalty reformulation in the stochastic setting.

Theorem A.3.

Suppose that Assumptions 3.1, 3.2, 3.4 and 3.5 holds and $\gamma_{1},\gamma_{2}>0$ are fixed parameters.

1. Assume $(x^{*},y^{*})$ is a global solution to (1.2). If $\mathcal{P}(\delta)$ defined in (2.6) is nonempty for any $(x,z)\in X\times Z$ , then for any $(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta)$ , there exists $z^{*}$ such that $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -global-minima of the following penalized form

	$\displaystyle\min_{(x,y,z)\in X\times Y\times Z}$	$\displaystyle\quad\mathbb{E}\left[F(x,y)+c_{1}\left(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right],$		(A.33)
	s.t.	$\displaystyle\quad G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},$		(A.33)

with any $c_{1}\geq\frac{L}{\mu_{G}}\epsilon^{-1}$ , $\epsilon_{2}\leq\frac{\epsilon}{4\sqrt{2}c_{1}B}$ and $\delta\leq\frac{\epsilon}{8c_{1}}$ .

2. By taking $c_{1}=c_{1}^{*}+2:=\frac{L}{\mu_{G}}\epsilon^{-1}+2$ , $\epsilon_{2}\leq\frac{\epsilon}{4\sqrt{2}c_{1}B}$ and $\delta\leq\frac{\epsilon}{8c_{1}}$ , for any $(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta)$ , the $\epsilon$ -global-minima of (A.33) is a $\epsilon$ -global-minima of the following approximation of BLO:

	$\displaystyle\min_{(x,y,z)\in X\times Y\times Z}$	$\displaystyle\quad F(x,y)$		(A.34)
	s.t.	$\displaystyle\quad G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\leq\epsilon_{1},\quad\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\leq\epsilon_{2}^{2},$		(A.34)

with some $\epsilon_{1}\leq\frac{17}{16}\epsilon$ .

Proof 1. By Lemma A.7, we have $E(x,z)\leq v(x),\forall z$ . Then for $\forall(x,y,z)\in X\times Y\times Z,(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta)$ , it holds that

		$\displaystyle\quad\mathbb{E}\left[F(x,y)+c_{1}\left(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]-F(x^{},y^{})$		(A.35)
		$\displaystyle\geq\mathbb{E}\left[F(x,y)+c_{1}\left(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]-F(x,y^{*}(x))$
		$\displaystyle=F(x,y)-F(x,y^{*}(x))+c_{1}(G(x,y)-E(x,z))-c_{1}\left(\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]-E(x,z)\right)$
		$\displaystyle\geq F(x,y)-F(x,y^{*}(x))+c_{1}(G(x,y)-v(x))-c_{1}\delta\quad\text{(by $E(x,z)\leq v(x)$ and~\eqref{eq: P delta})}$
		$\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-c_{1}\sqrt{2}B\epsilon_{2}-c_{1}\delta\quad\text{(by~\eqref{eq: proof of equivalence of single level 1 1})}.$

By (2.6), it holds that

\displaystyle\mathbb{E}[G(x^{*},y^{*})-\ell_{\gamma}(x^{*},z^{*},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\geq-\delta.

(A.36)

Combining this with (A.35) gives

		$\displaystyle\quad\mathbb{E}\left[F(x,y)+c_{1}\left(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]-\mathbb{E}\left[F(x^{},y^{})-c_{1}\left(G(x^{},y^{})-(x^{},z^{},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]$
		$\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-c_{1}\sqrt{2}B\epsilon_{2}-2c_{1}\delta.$

By taking any $c_{1}\geq\frac{L}{\mu_{G}}\epsilon^{-1}$ , $\epsilon_{2}\leq\frac{\epsilon}{4\sqrt{2}c_{1}B}$ and $\delta\leq\frac{\epsilon}{8c_{1}}$ , the right side of the above inequality $\geq-\epsilon$ . For the optimal solution $(x^{*},y^{*})$ of (1.2) and any $(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta)$ , by taking $z^{*}\in\arg\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x^{*},y^{*},z)$ , we have $E(x^{*},z^{*})=v(x^{*})=G(x^{*},y^{*})$ . Lemma (A.7).3 implies that such a $z^{*}\in Z$ exists. Hence $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -optimal solution of (A.33).

2. For any $\epsilon$ -optimal solution $(\bar{x},\bar{y},\bar{z})$ of (A.33), it holds that

		$\displaystyle\quad\mathbb{E}\left[F(\bar{x},\bar{y})+c_{1}\left(G(\bar{x},\bar{y})-\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]$
		$\displaystyle\leq\mathbb{E}\left[F(x^{},y^{})+c_{1}\left(G(x^{},y^{})-\ell_{\gamma}(x^{},z^{},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]+\epsilon\quad\text{(by $\epsilon$-optimality)}$
		$\displaystyle\leq\mathbb{E}\left[F(x^{},y^{})\right]+c_{1}\delta+\epsilon\quad\text{(by \eqref{eq: proof of equivalence of partial penalized stochastic single level 4})}$
		$\displaystyle\leq\mathbb{E}\left[F(\bar{x},\bar{y})+c_{1}^{}\left(G(\bar{x},\bar{y})-\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]+c_{1}\delta+2\epsilon,\quad\text{{(by~\eqref{eq: proof of equivalence of partial penalized stochastic single level 3} with $c_{1}^{}$ taken).}}$

Then we have

G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\leq\frac{2\epsilon+c_{1}\delta}{c_{1}-c_{1}^{*}}\leq\frac{2\epsilon+\frac{\epsilon}{8}}{2}=\frac{17}{16}\epsilon.

(A.37)

Take $\epsilon_{1}=G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]$ . This implies $(\bar{x},\bar{y},\bar{z})$ is feasible for (A.34). For any feasible solution $(x,y,z)$ of (A.34), the $\epsilon$ -optimality of $(\bar{x},\bar{y},\bar{z})$ implies

$\displaystyle F(\bar{x},\bar{y})-F(x,y)$	$\displaystyle\leq c_{1}\left(G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\right)$	(A.38)
	$\displaystyle\quad-c_{1}\left(G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\right)+\epsilon$
	$\displaystyle=c_{1}\left(G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]-\epsilon_{1}\right)+\epsilon\leq\epsilon.$

The last inequality follows from the feasibility of $(x,y)$ . Therefore $(\bar{x},\bar{y},\bar{z})$ is a $\epsilon$ -optimal solution of (A.34). This completes the proof. ∎

Theorem A.4.

Suppose that Assumptions 3.1, 3.2, 3.4 and 3.5 holds and $\gamma_{1},\gamma_{2}>0$ are fixed parameters. 1. Assume $(x^{*},y^{*})$ is a global solution to (1.2). If $\mathcal{P}(\delta)$ defined in (2.6) is nonempty for any $(x,z)\in X\times Z$ , then for any $(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta)$ , there exists $z^{*}$ such that $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -global-minima of the following penalized form

	$\displaystyle\min_{(x,y,z)\in X\times Y\times Z}$	$\displaystyle\quad\mathbb{E}[\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})]$		(A.39)
		$\displaystyle\quad s=\mathbb{E}\left[F(x,y)+c_{1}(G(x,y)-\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi})))+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\right],$		(A.39)

with any $c_{1}\geq\frac{2L}{3\mu_{G}}\epsilon^{-1}$ , $c_{2}\geq\frac{3}{2}(c_{1})^{2}B^{2}\epsilon^{-1}$ and $\delta\leq\frac{\epsilon}{6c_{1}}$ .

2. By taking $c_{1}=c_{1}^{*}+2:=\frac{2L}{3\mu_{G}}\epsilon^{-1}+2,c_{2}=c_{2}^{*}+2:=\frac{3}{2}(c_{1}^{*})^{2}B^{2}\epsilon^{-1}+2$ and $\delta\leq\frac{\epsilon}{6c_{1}}$ , for any $(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta)$ , the $\epsilon$ -global-minima of (A.39) is a $\epsilon$ -global-minima of the (A.34) with some $\epsilon_{1},\epsilon_{2}\leq\frac{13}{12}\epsilon$ .

Proof 1. By Lemma A.7, we have $E(x,z)\leq v(x),\forall z$ . Then for $\forall(x,y,z)\in X\times Y\times Z,(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta)$ , it holds that

		$\displaystyle\quad\mathbb{E}\left[\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})\right]-F(x^{},y^{})$		(A.40)
		$\displaystyle\geq\mathbb{E}\left[\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})\right]-F(x,y^{*}(x))$
		$\displaystyle=F(x,y)-F(x,y^{*}(x))+c_{1}(G(x,y)-E(x,z))-c_{1}\left(\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]-E(x,z)\right)$
		$\displaystyle\quad+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}$
		$\displaystyle\geq F(x,y)-F(x,y^{*}(x))+c_{1}(G(x,y)-E(x,z))-c_{1}\delta+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}\quad\text{(by~\eqref{eq: P delta})}$
		$\displaystyle\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-\frac{c_{1}^{2}}{2c_{2}}B^{2}-c_{1}\delta\quad\text{(by~\eqref{eq: proof of equivalence of single level 1})}.$

By (2.6), it holds that

\displaystyle\mathbb{E}[G(x^{*},y^{*})-\ell_{\gamma}(x^{*},z^{*},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\geq-\delta.

(A.41)

Combining this with (A.40) gives

\displaystyle\quad\mathbb{E}\left[\Psi(x,y,z,\hat{w},\hat{\lambda};\mathbf{\xi})\right]-\mathbb{E}\left[\Psi(x^{*},y^{*},z^{*},\hat{w},\hat{\lambda};\mathbf{\xi})\right]\geq-\frac{L_{F}}{2c_{1}\mu_{G}}-\frac{c_{1}^{2}}{2c_{2}}B^{2}-2c_{1}\delta.

By taking $c_{1}\geq\frac{2L}{3\mu_{G}}\epsilon^{-1}$ , $c_{2}\geq\frac{3}{2}(c_{1})^{2}B^{2}\epsilon^{-1}$ and $\delta\leq\frac{\epsilon}{6c_{1}}$ , the right side of the above inequality $\geq-\epsilon$ . For the optimal solution $(x^{*},y^{*})$ of (1.2) and any $(\hat{w},\hat{\lambda})\in\mathcal{P}(\delta)$ , by taking $z^{*}\in\arg\max_{z\in\mathbb{R}_{+}^{p}}\mathcal{L}_{\gamma_{1}}(x^{*},y^{*},z)$ , we have $E(x^{*},z^{*})=v(x^{*})=G(x^{*},y^{*})$ . Lemma (A.7).3 implies that such a $z^{*}\in Z$ exists. Hence $(x^{*},y^{*},z^{*})$ is a $\epsilon$ -optimal solution of (A.33).

2. For any $\epsilon$ -optimal solution $(\bar{x},\bar{y},\bar{z})$ of (A.33), it holds that

		$\displaystyle\quad\mathbb{E}\left[\Psi(\bar{x},\bar{y},\bar{z},\hat{w},\hat{\lambda};\mathbf{\xi})+c_{1}\left(G(\bar{x},\bar{y})-\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)\right]$
		$\displaystyle\leq\mathbb{E}\left[\Psi(x^{},y^{},z^{*},\hat{w},\hat{\lambda};\mathbf{\xi})\right]+\epsilon\quad\text{(by $\epsilon$-optimality)}$
		$\displaystyle\leq\mathbb{E}\left[F(x^{},y^{})\right]+c_{1}\delta+\epsilon\quad\text{(by \eqref{eq: proof of equivalence of stochastic single level 4})}$
		$\displaystyle\leq\mathbb{E}\left[F(\bar{x},\bar{y})+c_{1}^{}\left(G(\bar{x},\bar{y})-\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))\right)+\frac{c_{2}^{}}{2}\sum_{i=1}^{p}[H(\bar{x},\bar{y})]_{+}^{2}\right]+c_{1}\delta+2\epsilon,\quad\text{{(by~\eqref{eq: proof of equivalence of stochastic single level 3} with $c_{1}^{*}$ taken).}}$

Then we have

G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]+\frac{1}{2}\sum_{i=1}^{p}[H(\bar{x},\bar{y})]_{+}^{2}\leq\frac{2\epsilon+c_{1}\delta}{2}\leq\frac{2\epsilon+\frac{\epsilon}{6}}{2}=\frac{13}{12}\epsilon.

(A.42)

Take $\epsilon_{1}=G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]$ , $\epsilon_{2}=\frac{1}{2}\sum_{i=1}^{p}[H(\bar{x},\bar{y})]_{+}^{2}$ . This implies $(\bar{x},\bar{y},\bar{z})$ is feasible for (A.34). For any feasible solution $(x,y,z)$ of (A.34), the $\epsilon$ -optimality of $(\bar{x},\bar{y},\bar{z})$ implies

		$\displaystyle\quad F(\bar{x},\bar{y})-F(x,y)$		(A.43)
		$\displaystyle\leq c_{1}\left(G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\right)-c_{1}\left(G(\bar{x},\bar{y})-\mathbb{E}[\ell_{\gamma}(\bar{x},\bar{z},\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]\right)$
		$\displaystyle\quad+\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}-\frac{c_{2}}{2}\sum_{i=1}^{p}[H_{i}(\bar{x},\bar{y})]_{+}^{2}+\epsilon$
		$\displaystyle=c_{1}\left(G(x,y)-\mathbb{E}[\ell_{\gamma}(x,z,\hat{w}(\mathbf{\xi}),\hat{\lambda}(\mathbf{\xi}))]-\epsilon_{1}\right)+c_{2}\left(\frac{1}{2}\sum_{i=1}^{p}[H_{i}(x,y)]_{+}^{2}-\frac{1}{2}\sum_{i=1}^{p}[H_{i}(\bar{x},\bar{y})]_{+}^{2}\right)+\epsilon$
		$\displaystyle\leq\epsilon.$

The last inequality follows from the feasibility of $(x,y,z)$ . Therefore $(\bar{x},\bar{y},\bar{z})$ is a $\epsilon$ -optimal solution of (A.34). This completes the proof. ∎

Appendix B Analysis on the inner loop

In this section, we analyze the convergence of Algorithm 1. Assume $(x,y,z)=(x^{k-1},y^{k-1},z^{k-1})$ and ${\gamma_{1}},\gamma_{2}$ are fixed. The expectation in this section denotes the expectation conditioned on $\tilde{\mathcal{F}}_{k-1}$ , that is, $\mathbb{E}[\cdot]\triangleq\mathbb{E}[\cdot|\tilde{\mathcal{F}}_{k-1}]$ . Since $k$ is fixed, we abbreviate $w^{k,j}$ as $w^{j}$ and $\lambda^{k,j}$ as $\lambda^{j}$ for $j=0,1,...$ in this section.

We write $(w^{*},\lambda^{*})=(w^{*}(x,z),\lambda^{*}(x,z))$ as defined in (2.5a) and $\ell_{\gamma}(x,z,w,\lambda)$ as defined in (2.5b). By Proposition A.2, the optimal solution $(w^{*},\lambda^{*})$ is a saddle point of $\ell_{\gamma}(x,z,w,\lambda)$ , i.e.,

\ell_{\gamma}(x,z,w^{*},\lambda)\leq\ell_{\gamma}(x,z,w^{*},\lambda^{*})\leq\ell_{\gamma}(x,z,w,\lambda^{*}),\quad\forall w\in Y,\lambda\in\mathbb{R}_{+}^{p}.

(B.1)

To analyze the convergence of the inner loop, we establish the decreasing property for running a single step of the inner algorithm. The following lemma gives the decreasing property of by taking gradient descent ascent steps.

Lemma B.1.

The following relationships hold by taking primal step (2.8a) and dual step (2.8b), respectively:

		$\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w^{j},\lambda^{j})]-\mathbb{E}[\ell_{\gamma}(x,z,w,\lambda^{j})]+\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j+1}-w\\|^{2}]$		(B.2)
		$\displaystyle\leq\frac{1}{2}(\frac{1}{\eta_{j}}-\mu_{G})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w\\|^{2}]+\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\nabla_{w}\ell_{\gamma}(x,z,w^{j},\lambda^{j};\xi_{j})\\|^{2}],\quad\forall w\in Y,$		(B.2)

and

		$\displaystyle\quad\ell_{\gamma}(x,z,w^{j},\lambda)-\ell_{\gamma}(x,z,w^{j},\lambda^{j})+\frac{1}{2\rho_{j}}\\|\lambda^{j+1}-\lambda\\|^{2}$		(B.3)
		$\displaystyle\leq\frac{1}{2}(\frac{1}{\rho_{j}}-\gamma_{2})\\|\lambda^{j}-\lambda\\|^{2}+\frac{\rho_{j}}{2}\\|\nabla_{\lambda}\ell_{\gamma}(x,z,w^{j},\lambda^{j})\\|^{2},\quad\forall\lambda\in\mathbb{R}_{+}^{p}.$		(B.3)

Here the notation $\mathbb{E}_{\xi_{1},...,\xi_{j}}[\cdot]$ is the abbreviation of $\mathbb{E}_{\xi_{1}\sim\mathcal{D}_{\xi},...,\xi_{j}\sim\mathcal{D}_{\xi}}[\cdot]$ .

Proof The projection gradient step (2.8a) gives

\langle w^{j+1}-w,w^{j+1}-(w^{j}-\eta_{j}\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,w^{j},\lambda^{j};\xi_{j}))\rangle\leq 0.

(B.4)

This implies

\langle w^{j+1}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle\leq-\frac{1}{\eta_{j}}\langle w^{j+1}-w^{j},w^{j+1}-w\rangle.

(B.5)

By Young’s inequality, we have

\displaystyle\langle w^{j+1}-w^{j},\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle

\displaystyle\geq-\frac{1}{2\eta_{j}}\|w^{j+1}-w^{j}\|^{2}-\frac{\eta_{j}}{2}\|\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\|^{2}.

(B.6)

By the strong convexity of $\mathcal{L}_{\gamma_{1}}(x,w,\lambda)$ with respect to $w$ , we have

\langle w^{j}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j})\rangle\geq\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j})-\mathcal{L}_{\gamma_{1}}(x,w,\lambda^{j})+\frac{\mu_{G}}{2}\|w^{j}-w\|^{2}.

(B.7)

Note that $w^{j}$ is independent of $\xi_{j}$ , and hence

		$\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\langle w^{j}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle]$		(B.8)
		$\displaystyle=\mathbb{E}_{\xi_{1},...,\xi_{j-1}}[\mathbb{E}_{\xi_{j}}[\langle w^{j}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle\|\xi_{1},.,\xi_{j-1}]]$
		$\displaystyle=\mathbb{E}_{\xi_{1},...,\xi_{j}}[\langle w^{j}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j})\rangle].$

Summing up (B.6),(B.7) and then taking expectation gives

		$\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\langle w^{j+1}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle]$
		$\displaystyle\geq-\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j+1}-w^{j}\\|^{2}]-\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\\|^{2}]$
		$\displaystyle+\mathbb{E}_{\xi_{1},...,\xi_{j}}[\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j})]-\mathbb{E}_{\xi_{1},...,\xi_{j}}[\mathcal{L}_{\gamma_{1}}(x,w,\lambda^{j})]+\frac{\mu_{G}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w\\|^{2}]$
		$\displaystyle=-\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j+1}-w^{j}\\|^{2}]-\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\nabla_{w}\ell_{\gamma}(x,z,w^{j},\lambda^{j};\xi_{j})\\|^{2}]$
		$\displaystyle+\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w^{j},\lambda^{j})]-\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w,\lambda^{j})]+\frac{\mu_{G}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w\\|^{2}].$

Utilizing the identity $\|w^{j+1}-w^{j}\|^{2}+\|w^{j+1}-w\|^{2}-\|w^{j}-w\|^{2}-2\langle w^{j+1}-w,w^{j+1}-w^{j}\rangle=0$ and (B.5), we have

		$\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w^{j},\lambda^{j})]-\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w,\lambda^{j})]$
		$\displaystyle\leq\mathbb{E}_{\xi_{1},...,\xi_{j}}[\langle w^{j+1}-w,\nabla_{y}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j};\xi_{j})\rangle]+\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j+1}-w^{j}\\|^{2}]$
		$\displaystyle\quad+\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\nabla_{w}\ell_{\gamma}(x,z,w^{j},\lambda^{j};\xi_{j})\\|^{2}]-\frac{\mu_{G}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w\\|^{2}]$
		$\displaystyle\leq-\frac{1}{\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\langle w^{j+1}-w^{j},w^{j+1}-w\rangle]+\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j+1}-w^{j}\\|^{2}]$
		$\displaystyle\quad-\frac{\mu_{G}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w\\|^{2}]+\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\nabla_{w}\ell_{\gamma}(x,z,w^{j},\lambda^{j};\xi_{j})\\|^{2}]$
		$\displaystyle=\frac{1}{2}(\frac{1}{\eta_{j}}-\mu_{G})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w\\|^{2}]-\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j+1}-w\\|^{2}]$
		$\displaystyle\quad+\frac{\eta_{j}}{2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\nabla_{w}\ell_{\gamma}(x,z,w^{j},\lambda^{j};\xi_{j})\\|^{2}].$

This gives (B.2). $\mathcal{L}(x,w,\lambda)$ is concave in $\lambda$ implies $\ell_{\gamma}(x,z,w,\lambda)$ is $\gamma_{2}$ -strongly concave in $\lambda$ . Similarly, we can obtain (B.3) by taking the dual step (2.8b). This completes the proof. ∎

Combining the primal and dual steps, the following lemma shows the decreasing property with respect to the duality gap.

Corollary B.1.

For any $(w,\lambda)\in Y\times\mathbb{R}_{+}^{p}$ it holds that

		$\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w^{j},\lambda)-\ell_{\gamma}(x,z,w,\lambda^{j})]+\frac{1}{2\eta_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j+1}-w\\|^{2}]$		(B.9)
		$\displaystyle\quad+\frac{1}{2\rho_{j}}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j+1}-\lambda\\|^{2}]$
		$\displaystyle\leq\frac{1}{2}(\frac{1}{\eta_{j}}-\mu_{G})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w\\|^{2}]+\frac{1}{2}(\frac{1}{\rho_{j}}-\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}-\lambda\\|^{2}]$
		$\displaystyle\quad+\frac{\eta_{j}}{2}(M_{\mathcal{L},1}+M_{\mathcal{L},2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}\\|^{2}])+\frac{\rho_{j}}{2}\left(4\gamma_{2}\\|z\\|^{2}+4M_{H,0}^{2}+2(\gamma_{1}+\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}\\|^{2}]\right).$

Proof From (A.10), we have

	$\displaystyle\\|\nabla_{\lambda}\ell_{\gamma}(x,z,w^{j},\lambda^{j})\\|^{2}$	$\displaystyle\leq\\|\max(\gamma_{2}z-(\gamma_{1}+\gamma_{2})\lambda^{j},\gamma_{2}z-\gamma_{2}\lambda^{j}+H(x,w^{j}))\\|^{2}$		(B.10)
		$\displaystyle\leq 4\gamma_{2}\\|z\\|^{2}+4M_{H,0}^{2}+2(\gamma_{1}+\gamma_{2})\\|\lambda^{j}\\|^{2}.$		(B.10)

Then combining (A.7b), (B.2), (B.3) and (B.10) gives (B.9). ∎

By taking decreasing step sizes, we can prove that the dual variable is bounded during the iterations of the inner loop in expectation.

Lemma B.2.

Choose the dual step size as $\rho_{j}=\frac{\rho}{j+1},$ where the constants $\rho>\frac{1}{\gamma_{2}}$ , then the following bounds hold:

\|\lambda^{j}\|^{2}\leq M_{\lambda}:=2\rho^{2}\gamma_{2}^{2}\|z\|^{2}+2p\rho^{2}M_{H,0}^{2},\quad j=1,...,s.

(B.11)

Proof From (2.8b) and (2.5b) we have

	$\displaystyle\lambda^{j+1}$	$\displaystyle=\mathrm{Proj}_{\mathbb{R}_{+}^{p}}\left(\lambda^{j}+\frac{\rho}{j+1}(\nabla_{z}\mathcal{L}_{\gamma_{1}}(x,w^{j},\lambda^{j})-\gamma_{2}(\lambda^{j}-z))\right)$		(B.12)
		$\displaystyle=\mathrm{Proj}_{\mathbb{R}_{+}^{p}}\left(\lambda^{j}+\frac{\rho}{j+1}(\max(-\gamma_{1}\lambda^{j},H(x,w^{j}))-\gamma_{2}\lambda^{j}+\gamma_{2}z)\right)$		(B.12)

Consider the $i$ -th component of $\lambda^{j+1}$ , we have

	$\displaystyle\lambda_{i}^{j+1}$	$\displaystyle=\max\left(0,\lambda_{i}^{j}+\frac{\rho}{j+1}(-\gamma_{1}\lambda_{i}^{j}+\gamma_{2}z_{i}+\max(0,H_{i}(x,w^{j})))\right)$
		$\displaystyle\leq\left\|\lambda^{j}+\frac{\rho}{j+1}(-\gamma_{2}\lambda^{j})\right\|+\frac{\rho}{j+1}(\gamma_{2}\|z_{i}\|+\|\max(-\gamma_{1}\lambda^{j}_{i},H_{i}(x,w^{j}))\|$

Since $\lambda^{j}_{i}\geq 0$ , it holds that $\max(-\gamma_{1}\lambda_{i}^{j},H_{i}(x,w^{j}))\leq\max(0,H_{i}(x,w^{j}))\leq M_{H,0}$ . Then we obtain a recursive relation as

\lambda^{j+1}_{i}\leq(1-\frac{\rho\gamma_{2}}{j+1})\lambda^{j}_{i}+\frac{\rho}{j+1}\gamma_{2}|z_{i}|+\frac{\rho}{j+1}M_{H,0}.

Multiplying both sides by ${j+1}$ and using $\rho>\frac{1}{\gamma_{2}}$ , we have

(j+1)\lambda^{j+1}_{i}\leq j\lambda^{j}_{i}+\rho\gamma_{2}|z_{i}|+\rho M_{H,0}.

Note that the initial point $\lambda^{0}_{i}=0$ . Hence $\lambda^{j}_{i}\leq\rho\gamma_{2}|z_{i}|+\rho M_{H,0}),\forall j,i$ . This gives the bound of $\lambda^{j}$ as (B.11). ∎

Now we are ready to establish the convergence of the inner loop. In the following Theorem, we show that the output pair $(w^{s},\lambda^{s})$ is an $\widetilde{O}(\frac{1}{s_{k}})$ -optimal point of the lower-level problem measured by the square norm.

Theorem B.1.

By taking the step size as

\eta_{j}=\frac{\eta}{j+1},\quad\rho_{j}=\frac{\rho}{j+1},

(B.13)

where the constants $\eta\geq\frac{1}{\mu_{G}},\rho\geq\frac{1}{\gamma_{2}}$ , there exist constants $\phi_{1},\phi_{2}>0$ such that

	$\displaystyle\mathbb{E}_{\mathbf{\xi}}[\\|w^{s}-w^{*}\\|^{2}]$	$\displaystyle\leq{\phi}_{1}\frac{1+\log(s)}{s},$		(B.14)
	$\displaystyle\mathbb{E}_{\mathbf{\xi}}[\\|\lambda^{s}-\lambda^{*}\\|^{2}]$	$\displaystyle\leq{\phi}_{2}\frac{1+\log(s)}{s},$		(B.14)

where $\phi_{1}=\eta\left(\eta M_{\mathcal{L},1}+\rho(4\gamma_{2}\|z\|^{2}+4M_{H,0}^{2})+(\eta M_{\mathcal{L},2}+2\rho(\gamma_{1}+\gamma_{2}))M_{\lambda}\right)$ , ${\phi}_{2}=\frac{\rho}{\eta}\phi_{1}$ .

Proof Taking $\eta_{j}=\frac{\eta}{j+1}$ , $\rho_{j}=\frac{\rho}{j+1}$ in (B.9) gives

		$\displaystyle\quad\mathbb{E}_{\xi_{1},...,\xi_{j}}[\ell_{\gamma}(x,z,w^{j},\lambda)-\ell_{\gamma}(x,z,w,\lambda^{j})]+\frac{j+1}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j+1}-w^{*}\\|^{2}]$		(B.15)
		$\displaystyle\quad+\frac{j+1}{2\rho}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j+1}-\lambda\\|^{2}]$
		$\displaystyle\leq\frac{1}{2}(\frac{j+1}{\eta}-\mu_{G})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w^{*}\\|^{2}]+\frac{1}{2}(\frac{j+1}{\rho}-\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}-\lambda\\|^{2}]$
		$\displaystyle\quad+\frac{\eta}{2(j+1)}(M_{\mathcal{L},1}+M_{\mathcal{L},2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}\\|^{2}])$
		$\displaystyle\quad+\frac{\rho}{2(j+1)}\left(4\gamma_{2}\\|z\\|^{2}+4M_{H,0}^{2}+2(\gamma_{1}+\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}\\|^{2}]\right)$
		$\displaystyle\leq\frac{j}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w^{*}\\|^{2}]+\frac{j}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}-\lambda\\|^{2}]$
		$\displaystyle\quad+\frac{\eta}{2(j+1)}(M_{\mathcal{L},1}+M_{\mathcal{L},2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}\\|^{2}])$
		$\displaystyle\quad+\frac{\rho}{2(j+1)}\left(4\gamma_{2}\\|z\\|^{2}+4M_{H,0}^{2}+2(\gamma_{1}+\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}\\|^{2}]\right).$

Then taking $w=w^{*},\lambda=\lambda^{*}$ and utilizing $\ell_{\gamma}(x,z,w^{j},\lambda^{*})\geq\ell_{\gamma}(x,z,w^{*},\lambda^{*})\geq\ell_{\gamma}(x,z,w^{*},\lambda^{j})$ gives

		$\displaystyle\quad\frac{j+1}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j+1}-w^{}\\|^{2}]+\frac{j+1}{2\rho}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j+1}-\lambda^{}\\|^{2}]$		(B.16)
		$\displaystyle\leq\frac{j}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w^{}\\|^{2}]+\frac{j}{2\rho}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}-\lambda^{}\\|^{2}]$
		$\displaystyle\quad+\frac{\eta}{2(j+1)}(M_{\mathcal{L},1}+M_{\mathcal{L},2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}\\|^{2}])$
		$\displaystyle\quad+\frac{\rho}{2(j+1)}\left(4\gamma_{2}\\|z\\|^{2}+4M_{H,0}^{2}+2(\gamma_{1}+\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}\\|^{2}]\right).$

This recursive relationship gives

		$\displaystyle\quad\frac{j}{2\eta}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|w^{j}-w^{}\\|^{2}]+\frac{j}{2\rho}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{j}-\lambda^{}\\|^{2}]$		(B.17)
		$\displaystyle\leq\sum_{l=0}^{j-1}\frac{\eta}{2(l+1)}(M_{\mathcal{L},1}+M_{\mathcal{L},2}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{l}\\|^{2}])+\sum_{l=0}^{j-1}\frac{\rho}{2(l+1)}\left(4\gamma_{2}\\|z\\|^{2}+4M_{H,0}^{2}\right)$
		$\displaystyle\quad+2(\gamma_{1}+\gamma_{2})\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{l}\\|^{2}]$
		$\displaystyle\leq\frac{1}{2}(1+\log(j))\left(\eta M_{\mathcal{L},1}+\rho(4\gamma_{2}\\|z\\|^{2}+4M_{H,0}^{2})\right)$
		$\displaystyle\quad+\frac{1}{2}(\eta M_{\mathcal{L},2}+\rho(\gamma_{1}+\gamma_{2}))\sum_{l=0}^{j-1}\frac{1}{l+1}\mathbb{E}_{\xi_{1},...,\xi_{j}}[\\|\lambda^{l}\\|^{2}].$

Substituting (B.11) into (B.17) and taking $j=s$ we have

		$\displaystyle\quad\frac{s}{2\eta}\mathbb{E}_{\mathbf{\xi}}[\\|w^{s}-w^{}\\|^{2}]+\frac{s}{2\rho}\mathbb{E}_{\mathbf{\xi}}[\\|\lambda^{j}-\lambda^{}\\|^{2}]$		(B.18)
		$\displaystyle\leq\frac{1}{2}(1+\log(s))\left(\eta M_{\mathcal{L},1}+\rho(4\gamma_{2}\\|z\\|^{2}+4M_{H,0}^{2})+(\eta M_{\mathcal{L},2}+2\rho(\gamma_{1}+\gamma_{2}))M_{\lambda}\right).$		(B.18)

This gives the boundness of $\mathbb{E}_{\mathbf{\xi}}[\|w^{s}-w^{*}\|^{2}]$ and $\mathbb{E}_{\mathbf{\xi}}[\|\lambda^{s}-\lambda^{*}\|^{2}]$ in (B.14). ∎

By utilizing the measurement as objective function, a similar convergence rate can be obtained as follows.

Corollary B.2.

Under the same conditions as in Theorem B.1, it holds that

\left|\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{s},\lambda^{s})]-E(x,z)\right|\leq\widetilde{O}\left(\frac{1}{s}\right).

(B.19)

Proof From (A.9), (A.10), we see $\nabla_{w}\ell{\gamma}(x,z,w,\lambda)$ is Lipschitz continuous in $w$ with module $L_{\ell,w}=L_{G}+p(\sqrt{M_{\lambda}}+\gamma_{1}^{-1}M_{H,0})L_{H}+M_{H,1}^{2}$ , $\nabla_{\lambda}\mathcal{L}(x,z,w,\lambda)$ is Lipschitz continuous in $\lambda$ with module $L_{\ell,\lambda}=\gamma_{1}+\gamma_{2}$ given the bound $\|\lambda\|^{2}\leq M_{\lambda}$ . Then under the strongly-convex-strongly-concave condition shown in Lemma A.4 , the objective gap

\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w,\lambda^{*})]-\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{*},\lambda)]=\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w,\lambda^{*})-\ell_{\gamma}(x,z,w^{*},\lambda^{*})]+\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w,\lambda^{*})-\ell_{\gamma}(x,z,w^{*},\lambda)]

has a relationship to $\mathbb{E}_{\mathbf{\xi}}[\|w-w^{*}\|^{2}]+\mathbb{E}_{\mathbf{\xi}}[\|\lambda-\lambda^{*}\|^{2}]$ as follows:

	$\displaystyle\frac{\mu_{G}}{2}\mathbb{E}_{\mathbf{\xi}}[\\|w-w^{}\\|^{2}]+\frac{\gamma_{2}}{2}\mathbb{E}_{\mathbf{\xi}}[\\|\lambda-\lambda^{}\\|^{2}]$	$\displaystyle\leq\left\|\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w,\lambda^{})]-\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{},\lambda)]\right\|$
		$\displaystyle\leq\frac{L_{\ell,w}}{2}\mathbb{E}_{\mathbf{\xi}}[\\|w-w^{}\\|^{2}]+\frac{L_{\ell,\lambda}}{2}\mathbb{E}_{\mathbf{\xi}}[\\|\lambda^{j}-\lambda^{}\\|^{2}].$

Therefore, the convergence rate of the objective function is also $\widetilde{O}(\frac{1}{s})$ , that is

\left|\mathbb{E}_{\mathbf{\xi}}\left[\ell_{\gamma}(x,z,w^{s},\lambda^{*})\right]-\mathbb{E}_{\mathbf{\xi}}\left[\ell_{\gamma}(x,z,w^{*},\lambda^{s})\right]\right|\leq\widetilde{O}\left(\frac{1}{s}\right).

Note that $\ell_{\gamma}(x,z,w^{s},\lambda^{*})\geq\ell_{\gamma}(x,z,w^{s},\lambda^{s})$ and $\ell_{\gamma}(x,z,w^{*},\lambda^{s})\leq\ell_{\gamma}(x,z,w^{*},\lambda^{*})=E(x,z)$ , we have

\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{s},\lambda^{s})]-E(x,z)\leq\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{s},\lambda^{*})]-\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{*},\lambda^{s})]\leq\widetilde{O}\left(\frac{1}{s}\right).

Similarly, it holds that

E(x,z)-\mathbb{E}_{\mathbf{\xi}}[\ell_{\gamma}(x,z,w^{s},\lambda^{s})]\leq\widetilde{O}\left(\frac{1}{s}\right).

This completes the proof. ∎

Appendix C Analysis on the outer loop

In the outer loop, we apply the stochastic gradient descent (SGD) method to solve the saddle point problem (2.9). Unlike the standard analysis of SGD, this setting presents two main challenges: first, constructing a suitable stochastic gradient oracle for (2.10); second, addressing the bias introduced by the inexact solution of the lower-level problem.

In the following proposition, we show that the gradient of $\mathcal{G}$ is continuously differentiable and its gradient is computable given the optimal solution of the subproblem (2.5a).

Lemma C.1.

Assume $(w^{*},\lambda^{*})=(w^{*}(x,z),\lambda^{*}(x,z))$ is defined in (2.5a). Then $\mathcal{G}(x,y,z)$ defined in (2.3) is continuously differentiable and its gradient is given by


$\displaystyle\nabla_{x}\mathcal{G}(x,y,z)$	$\displaystyle=\nabla_{x}g(x,y)-\nabla_{x}\ell_{\gamma}(x,z,w^{},\lambda^{}),$	(C.1a)
$\displaystyle\nabla_{y}\mathcal{G}(x,y,z)$	$\displaystyle=\nabla_{y}g(x,y),$	(C.1b)
$\displaystyle\nabla_{z}\mathcal{G}(x,y,z)$	$\displaystyle=\gamma_{2}(\lambda^{*}-z).$	(C.1c)

Furthermore substituting the fact $\|z\|\leq B$ we obtain that $\mathcal{G}(x,y,z)$ is Lipschitz continuous with module $L_{\mathcal{G}}=3M_{G,1}+\sqrt{p}M_{H,1}(B+\frac{\sqrt{p}}{\gamma_{2}}M_{H,0})+pM_{H,0}M_{H,1}+\sqrt{p}M_{H,0}=L_{\mathcal{G}}$ .

Proof By Theorem 4.24 in (3), $E(x,z)$ is continuously differentiable with respect to $z$ and its gradient is given by

	$\displaystyle\nabla_{x}E(x,z)$	$\displaystyle=\nabla_{x}\left(\max_{\lambda\in\mathbb{R}_{+}^{p}}\left\{D(x,\lambda)+\frac{\gamma_{2}}{2}\\|\lambda-z\\|^{2}\right\}\right)$
		$\displaystyle=\nabla_{x}\left(D(x,\lambda^{})+\gamma_{2}(\lambda^{}(x,z)-z)\right).\quad\text{(due to the uniqueness of $\lambda^{*}$)}$
		$\displaystyle=\nabla_{x}\ell_{\gamma}(x,z,w^{},\lambda^{}).$

Hence $\mathcal{G}(x,y,z)$ is continuously differentiable and (C.1a) holds. Since $E(x,z)$ is independent of $y$ , (C.1b) is straightforward. Note that $-E(x,z)$ is the Moreau envelope of $-D(x,z)$ for any $x$ , Proposition A.1.4 gives $-\nabla_{z}E(x,z)=\gamma_{2}(z-\mathrm{prox}_{-\frac{1}{\gamma_{2}}D(x,\cdot)}(z))$ . On the other hand, the optimality condition of the subproblem (2.5a) gives

\mathrm{prox}_{-\frac{1}{\gamma_{2}}D(x,\cdot)}(z)=\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\left\{D(x,\lambda)+\frac{\gamma_{2}}{2}\|\lambda-z\|^{2}\right\}=\arg\max_{\lambda\in\mathbb{R}_{+}^{p}}\left\{\min_{w\in Y}\ell_{\gamma}(x,z,w,\lambda)\right\}=\lambda^{*}.

This gives (C.1c).

Now we show the Lipschitz continuity of $\mathcal{G}(x,y,z)$ , that is $\|\nabla\mathcal{G}(x,y,z)\|\leq L_{\mathcal{G}}$ . Similar to the proof of Lemma A.2, it holds that $\|\nabla_{x}\ell_{\gamma}(x,z,w^{*},\lambda^{*})\|\leq M_{G,1}+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}|\lambda^{*}_{i}|+M_{H,0})M_{H,1}$ . Again by Theorem 4.24 in (3), $D(x,\lambda)=\max_{y\in Y}\{\mathcal{L}(x,y,\lambda)\}$ is continuously differentiable with respect to $\lambda$ and its gradient is given by $\nabla_{\lambda}D(x,\lambda)=\nabla_{z}\mathcal{L}(x,y^{*}(x,\lambda),\lambda)$ , where $y^{*}(x,\lambda)$ is the optimal solution of the subproblem $\min_{y\in Y}\mathcal{L}(x,y,\lambda)$ . From (A.4) we know $\|\nabla_{z}\mathcal{L}(x,y,\lambda)\|\leq\sqrt{p}M_{H,0}$ . Hence $D(x,\lambda)$ is $\sqrt{p}M_{H,0}-$ Lipschitz continuous with respect to $\lambda$ . Proposition A.1.4 implies the Moreau envelope of $E(x,z)$ is also Lipschitz continuous with the same Lipschitz constant. This gives

\|\nabla_{z}\mathcal{G}(x,y,z)\|=\gamma_{2}\|\lambda^{*}-z\|\leq\sqrt{p}M_{H,0}.

(C.2)

Therefore,

	$\displaystyle\\|\nabla\mathcal{G}(x,y,z)\\|$	$\displaystyle\leq\\|\nabla_{x}G(x,y)\\|+\\|\nabla_{x}\ell_{\gamma}(x,z,w^{},\lambda^{})\\|+\\|\nabla_{y}G(x,y)\\|+\\|\nabla_{z}\mathcal{G}(x,y,z)\\|$
		$\displaystyle\leq M_{G,1}+M_{G,1}+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}\|\lambda^{*}_{i}\|+M_{H,0})M_{H,1}+M_{G,1}+\sqrt{p}M_{H,0}$
		$\displaystyle\leq 3M_{G,1}+\sqrt{p}M_{H,1}\\|\lambda^{*}\\|+pM_{H,0}M_{H,1}+\sqrt{p}M_{H,0}$
		$\displaystyle\leq 3M_{G,1}+\sqrt{p}M_{H,1}(\\|z\\|+\frac{\sqrt{p}}{\gamma_{2}}M_{H,0})+pM_{H,0}M_{H,1}+\sqrt{p}M_{H,0}\quad\text{(by~\eqref{eq: proof of gradient of mathcal G 1})}$
		$\displaystyle\leq 3M_{G,1}+\sqrt{p}M_{H,1}(B+\frac{\sqrt{p}}{\gamma_{2}}M_{H,0})+pM_{H,0}M_{H,1}+\sqrt{p}M_{H,0}=L_{\mathcal{G}}.$

The last inequality follows from that $\|z\|\leq B$ ). This completes the proof. ∎

Corollary C.1.

$\nabla\Psi(\mathbf{u})$ is Lipschitz continuous with modulus

L_{\Psi}=L_{F}+c_{1}L_{\mathcal{G}}+\frac{c_{2}}{2}L_{H}M_{H,0}.

(C.3)

Proof Combining Assumption 3.1 and Lemma C.1 yields the desired result. ∎

Assume we have already obtained the approximate optimal solution $(w^{k},\lambda^{k})$ of the subproblem (2.5a) at the $k$ -th iteration. Note that $(w^{k},\lambda^{k})$ is random variables dependent on the inner loop sample $\mathbf{\xi}^{k}$ . In the subsequent analysis, we will take expectation conditioned on $\mathcal{F}^{k}$ and hence $(w^{k},\lambda^{k})$ is treated as constants. Given sample $\mathbf{\tilde{\xi}}^{k}=(\tilde{\xi}^{k}_{1},...,\tilde{\xi}^{k}_{q_{k}})\sim\mathcal{D}_{\xi}^{q_{k}}$ , a natural first order oracle for $\hat{\mathcal{G}}$ in (2.9) is considered as

$\displaystyle\nabla_{x}\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k})$	$\displaystyle=\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}g(x,y;\xi^{k}_{l})-\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}\ell{\gamma}(x,z,w^{k},\lambda^{k};\tilde{\xi}^{k}_{l}),$	(C.4)
$\displaystyle\nabla_{y}\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k})$	$\displaystyle=\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{y}g(x,y;\tilde{\xi}^{k}_{l}),$
$\displaystyle\nabla_{z}\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k})$	$\displaystyle={\gamma_{2}}(\lambda^{k}-z).$

Conditioned on $\mathbf{\xi}^{k}$ , the bias of $\nabla\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k})$ arise from the term $\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}\ell{\gamma}(x,z,w^{k},\lambda^{k};\tilde{\xi}^{k}_{l})$ . However when $\mathbf{\xi}^{k}$ is also treated as random variable, the estimation of bias is much more complicated. We establish two lemmas to control this bias. Lemma C.2 is conditioned on $\mathcal{F}^{k}$ , while Lemma C.3 is conditioned on $\tilde{\mathcal{F}}_{k-1}$ . In the following analysis, $\mathcal{G}(\mathbf{u})$ is the abbreviation of $\mathcal{G}(x,y,z)$ for simplicity.

Lemma C.2.

$\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})$ has a controllable bias conditioned on $\mathcal{F}^{k}$ as

	$\displaystyle\quad\\|\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})\|\mathcal{F}^{k}]-\nabla\mathcal{G}(\mathbf{u})\\|$	(C.5)
	$\displaystyle\leq\frac{\gamma_{2}}{2}\\|\lambda^{k}-\lambda^{}(x^{k-1},z^{k-1})\\|^{2}+(M_{H,1}+\sqrt{p})\\|\lambda^{k}-\lambda^{}(x^{k-1},z^{k-1})\\|$
$\displaystyle+$	$\displaystyle(L_{G}+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}L_{H}\\|z\\|+\frac{\gamma_{1}L_{H}}{\gamma_{2}}M_{H,0}+M_{H,0}L_{H}))\\|w^{k}-w^{*}(x^{k-1},z^{k-1})\\|,$

and a controllable conditional variance as

\displaystyle\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})|\mathcal{F}_{k}]\leq\frac{4\sigma_{g}^{2}}{q_{k}}.

(C.6)

Here $\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\cdot],\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}[\cdot]$ are the abbreviation of $\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}\sim\mathcal{D}_{\xi}^{q_{k}}}[\cdot],\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}\sim\mathcal{D}_{\xi}^{q_{k}}}[\cdot]$ , respectively.

Proof When condition on $\mathcal{F}^{k}$ , $(w^{k},\lambda^{k}),(w^{*}(x^{k-1},z^{k-1}),\lambda^{*}(x^{k-1},z^{k-1}))$ are constants. Utilizing (C.1a) and taking expectation over $\mathbf{\tilde{\xi}}^{k}$ gives

		$\displaystyle\quad\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\nabla_{x}\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})\|{\mathcal{F}}_{k}]-\nabla_{x}\mathcal{G}(\mathbf{u})$		(C.7)
		$\displaystyle=\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}\left[\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}g(x,y;\tilde{\xi}_{l})-\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k};\tilde{\xi}_{l})\|{\mathcal{F}}_{k}\right]$
		$\displaystyle\quad-\nabla_{x}g(x,y)+\nabla_{x}\ell_{\gamma}(x,z,w^{},\lambda^{})$
		$\displaystyle=-\mathbb{E}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k};\tilde{\xi}_{1})\|{\mathcal{F}}_{k}\right]+\nabla_{x}\ell_{\gamma}(x,z,w^{},\lambda^{}).$

From the expression of $\tilde{\mathcal{L}}$ and (A.4), (A.6), we can further compute

		$\displaystyle\quad\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\nabla_{x}\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})\|{\mathcal{F}}_{k}]-\nabla_{x}\mathcal{G}(\mathbf{u})$		(C.8)
		$\displaystyle=-\mathbb{E}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,w^{k};\xi_{1})+\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})-\frac{\gamma_{2}}{2}\\|\lambda^{k}-z\\|^{2}\|{\mathcal{F}}_{k}\right]$
		$\displaystyle\quad+\nabla_{x}g(x,w^{})+\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{}+H_{i}(x,w^{})]_{+}\nabla_{x}H_{i}(x,w^{})-\frac{\gamma_{2}}{2}\\|\lambda^{*}-z\\|^{2}$
		$\displaystyle=-\nabla_{x}g(x,w^{k})-\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})+\frac{\gamma_{2}}{2}\\|\lambda^{k}-z\\|^{2}$
		$\displaystyle\quad+\nabla_{x}g(x,w^{})+\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{}+H_{i}(x,w^{})]_{+}\nabla_{x}H_{i}(x,w^{})-\frac{\gamma_{2}}{2}\\|\lambda^{*}-z\\|^{2}.$

By Assumption 3.1, it holds that

		$\displaystyle\quad\left\\|\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})-\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{}+H_{i}(x,w^{})]_{+}\nabla_{x}H_{i}(x,w^{*})\right\\|$		(C.9)
		$\displaystyle\leq\frac{1}{\gamma_{1}}\left\\|\sum_{i=1}^{p}\left([\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})-[\gamma_{1}\lambda_{i}^{}+H_{i}(x,w^{})]_{+}\nabla_{x}H_{i}(x,w^{k})\right)\right\\|$
		$\displaystyle+\frac{1}{\gamma_{1}}\left\\|\sum_{i=1}^{p}\left([\gamma_{1}\lambda_{i}^{}+H_{i}(x,w^{})]_{+}\nabla_{x}H_{i}(x,w^{k})-[\gamma_{1}\lambda_{i}^{}+H_{i}(x,w^{})]_{+}\nabla_{x}H_{i}(x,w^{*})\right)\right\\|$
		$\displaystyle\leq\frac{1}{\gamma_{1}}\sum_{i=1}^{p}\left(\left\|[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}-[\gamma_{1}\lambda_{i}^{}+H_{i}(x,w^{})]_{+}\right\|\cdot\left\\|\nabla_{x}H_{i}(x,w^{k})\right\\|\right)$
		$\displaystyle+\frac{1}{\gamma_{1}}\sum_{i=1}^{p}\left([\gamma_{1}\lambda_{i}^{}+H_{i}(x,w^{})]_{+}\cdot\left\\|\nabla_{x}H_{i}(x,w^{k})-\nabla_{x}H_{i}(x,w^{*})\right\\|\right)$
		$\displaystyle\leq\frac{1}{\gamma_{1}}M_{H,1}(\gamma_{1}\\|\lambda^{k}-\lambda^{}\\|+M_{H,1}\\|w^{k}-w^{}\\|)+\frac{1}{\gamma_{1}}(\gamma_{1}\\|\lambda^{}\\|+M_{H,0})L_{H}\\|w^{k}-w^{}\\|$
		$\displaystyle=M_{H,1}\\|\lambda^{k}-\lambda^{}\\|+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}\\|\lambda^{}\\|L_{H}+M_{H,0}L_{H})\\|w^{k}-w^{*}\\|.$

The last inequality uses the fact that $|[a]_{+}-[b]_{+}|\leq|a-b|$ and the assumptions that $\|H_{i}(x,w^{k})-H_{i}(x,w^{*})\|\leq M_{H,1}\|w^{k}-w^{*}\|$ , $\|H_{i}(x,w^{*})\|\leq M_{H,0}$ and $\|\nabla H_{i}(x,w^{k})-\nabla H_{i}(x,w^{*})\|\leq L_{H}\|w^{k}-w^{*}\|$ . By (C.2), it holds that $\|\lambda^{k}-z\|^{2}-\|\lambda^{*}-z\|^{2}=\|\lambda^{k}-\lambda^{*}\|+2(\lambda^{k}-\lambda^{*})^{T}(\lambda^{*}-z)\leq\|\lambda^{k}-\lambda^{*}\|^{2}+\frac{2\sqrt{p}}{\gamma_{2}}M_{H,0}\|\lambda^{k}-\lambda^{*}\|$ . Then by combining (C.8) and (C.9), we bound the bias of $\nabla_{x}\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})$ as

		$\displaystyle\quad\\|\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\nabla_{x}\mathcal{G}^{k}(\mathbf{u};\tilde{\xi})\|\mathcal{F}^{k}]-\nabla_{x}\mathcal{G}(\mathbf{u})\\|$
		$\displaystyle\leq L_{G}\\|w^{k}-w^{}\\|+M_{H,1}\\|\lambda^{k}-\lambda^{}\\|+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}\\|\lambda^{}\\|L_{H}+M_{H,0}L_{H})\\|w^{k}-w^{}\\|$
		$\displaystyle\quad+\frac{\gamma_{2}}{2}\\|\lambda^{k}-\lambda^{}\\|^{2}+\sqrt{p}M_{H,0}\\|\lambda^{k}-\lambda^{}\\|$
		$\displaystyle=\frac{\gamma_{2}}{2}\\|\lambda^{k}-\lambda^{}\\|^{2}+(M_{H,1}+\sqrt{p})\\|\lambda^{k}-\lambda^{}\\|$
		$\displaystyle\quad+(L_{G}+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}\\|\lambda^{}\\|L_{H}+M_{H,0}L_{H}))\\|w^{k}-w^{}\\|$
		$\displaystyle\leq\frac{\gamma_{2}}{2}\\|\lambda^{k}-\lambda^{}\\|^{2}+(M_{H,1}+\sqrt{p})\\|\lambda^{k}-\lambda^{}\\|$
		$\displaystyle\quad+(L_{G}+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}L_{H}\\|z\\|+\frac{\gamma_{1}L_{H}}{\gamma_{2}}M_{H,0}+M_{H,0}L_{H}))\\|w^{k}-w^{*}\\|.$

The last inequality follows from (C.8) again. Besides $\mathbb{E}[\nabla_{y}\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k})|\tilde{\mathcal{F}}_{k}]-\nabla_{y}\mathcal{G}(x,y,z)=0$ and $\mathbb{E}[\nabla_{z}\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k})|\tilde{\mathcal{F}}_{k}]-\nabla_{y}\mathcal{G}(x,y,z)=\gamma_{2}\|\lambda^{k}-\lambda^{*}\|$ are straightforward. Therefore (C.5) holds.

From the expression of $\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})$ , we can compute the conditional variance of $\nabla\mathcal{G}^{k}(x,y,z;\mathbf{\tilde{\xi}}^{k})$ as

		$\displaystyle\quad\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}\left[\nabla_{x}\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})\|{\mathcal{F}}_{k}\right]$
		$\displaystyle=\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}\left[\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}g(x,y;\xi^{k}_{l})-\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{x}\ell{\gamma}(x,z,w^{k},\lambda^{k};\tilde{\xi}^{k}_{l})\|{\mathcal{F}}_{k}\right]$
		$\displaystyle=\frac{1}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,y;\tilde{\xi}^{k}_{1})-\nabla_{x}g(x,w^{k};\tilde{\xi}^{k}_{1})-\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})\right.$
		$\displaystyle\quad\left.+\frac{\gamma_{2}}{2}\\|\lambda^{k}-z\\|^{2}\|\mathcal{F}_{k}\right]$
		$\displaystyle=\frac{1}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,y;\tilde{\xi}^{k}_{1})-\nabla_{x}g(x,w^{k};\tilde{\xi}^{k}_{1})\|\mathcal{F}_{k}\right]\quad\text{ ($\lambda^{k},w^{k}$ are constants conditioned on $\mathcal{F}^{k}$)}$
		$\displaystyle\leq\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,y;\tilde{\xi}^{k}_{1})\|\mathcal{F}_{k}\right]+\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,w^{k};\tilde{\xi}^{k}_{1})\|\mathcal{F}_{k}\right].$

and

		$\displaystyle\quad\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}\left[\nabla_{y}\mathcal{G}^{k}(\mathbf{u};\tilde{\xi}^{k})\|{\mathcal{F}}_{k}\right]=\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}\left[\frac{1}{q_{k}}\sum_{l=1}^{q_{k}}\nabla_{y}g(x,y;\xi^{k}_{l})\right]=\frac{1}{q_{k}}\mathbb{V}_{{\tilde{\xi}}^{k}_{1}}\left[\nabla_{y}g(x,y;\xi^{k}_{1})\|\mathcal{F}_{k}\right]$
		$\displaystyle\quad\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}\left[\nabla_{z}\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})\|{\mathcal{F}}_{k}\right]=0.$

Hence

		$\displaystyle\quad\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}[\\|\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})\|\mathcal{F}_{k}\\|]$
		$\displaystyle\leq\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,y;\tilde{\xi}^{k}_{1})\|\mathcal{F}_{k}\right]+\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla_{x}g(x,w^{k};\tilde{\xi}^{k}_{1})\|\mathcal{F}_{k}\right]+\frac{1}{q_{k}}\mathbb{V}_{{\tilde{\xi}}^{k}_{1}}\left[\nabla_{y}g(x,y;\xi^{k}_{1})\|\mathcal{F}_{k}\right]$
		$\displaystyle\leq\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla g(x,y;\xi^{k}_{1})\right]+\frac{2}{q_{k}}\mathbb{V}_{\tilde{\xi}^{k}_{1}}\left[\nabla g(x,w^{k};\xi^{k}_{1})\right]\leq\frac{4\sigma_{g}^{2}}{q_{k}}.$

This completes the proof. ∎

In the following lemma, $w^{k},\lambda^{k}$ are treated as random variables and we try to control the bias and variance of the first order oracle of $\mathcal{G}^{k}$ conditioned on $\tilde{\mathcal{F}}_{k-1}$ .

Lemma C.3.

The first order oracle of $\mathcal{G}^{k}$ has a bounded conditional bias and variance as


$\displaystyle\left\\|\mathbb{E}_{\mathbf{\tilde{\xi}}^{k},\mathbf{\xi}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})\|\tilde{\mathcal{F}}_{k-1}]-\nabla\mathcal{G}(\mathbf{u})\right\\|$	$\displaystyle\leq\epsilon_{\mathcal{G}}^{k},$	(C.10a)
$\displaystyle\mathbb{V}_{\mathbf{\tilde{\xi}}^{k},\mathbf{\xi}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})\|\tilde{\mathcal{F}}_{k-1}]\leq(\sigma_{\mathcal{G}}^{k})^{2},$		(C.10b)

where

	$\displaystyle\epsilon_{\mathcal{G}}^{k}$	$\displaystyle=\frac{\gamma_{2}}{2}\mathbb{E}[\\|\lambda^{k}-\lambda^{}(x^{k-1},z^{k-1})\\|^{2}\|\tilde{\mathcal{F}}_{k-1}]+(M_{H,1}+\sqrt{p})\mathbb{E}[\\|\lambda^{k}-\lambda^{}(x^{k-1},z^{k-1})\\|\|\mathcal{F}_{k-1}]$		(C.11)
	$\displaystyle+$	$\displaystyle(L_{G}+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}L_{H}\\|z\\|+\frac{\gamma_{1}L_{H}}{\gamma_{2}}M_{H,0}+M_{H,0}L_{H}))\mathbb{E}[\\|w^{k}-w^{*}(x^{k-1},z^{k-1})\\|\|\tilde{\mathcal{F}}_{k-1}]$		(C.11)

and

	$\displaystyle(\sigma_{\mathcal{G}}^{k})^{2}$	$\displaystyle=\frac{4\sigma_{g}^{2}}{q_{k}}+2(L_{G}^{2}+\frac{(\gamma_{1}M_{\lambda}+M_{H,0})^{2}}{\gamma_{1}^{2}}L_{H}^{2})\mathbb{E}_{\mathbf{\xi}^{k}}\left[\\|w^{k}-w^{*}(x^{k-1},z^{k-1})\\|^{2}\|\tilde{\mathcal{F}}_{k-1}\right]$		(C.12)
		$\displaystyle\quad+\gamma_{2}^{2}\mathbb{E}_{\mathbf{\xi}^{k}}\left[\\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\\|^{2}\|\tilde{\mathcal{F}}_{k-1}\right].$		(C.12)

Proof Taking expectation over $\mathbf{\xi}^{k}$ in (C.5) gives (C.10a). Utilizing the property of conditional variance and Lemma C.3, we have

		$\displaystyle\quad\mathbb{V}_{\mathbf{\tilde{\xi}}^{k},\mathbf{\xi}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})\|\tilde{\mathcal{F}}_{k-1}]$		(C.13)
		$\displaystyle=\mathbb{E}_{\mathbf{\xi}_{k}}[\mathbb{V}_{\mathbf{\tilde{\xi}}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})\|{\mathcal{F}}_{k}]\|\tilde{\mathcal{F}}_{k-1}]+\mathbb{V}_{\mathbf{\xi}^{k}}[\mathbb{E}_{\mathbf{\tilde{\xi}}^{k}}[\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\zeta})\|{\mathcal{F}}_{k}]\|\tilde{\mathcal{F}}_{k-1}]$
		$\displaystyle\leq\frac{4\sigma_{g}^{2}}{q_{k}}+\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}g(x,y)-\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k})\|\tilde{\mathcal{F}}_{k-1}]+\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{y}g(x,y)\|\tilde{\mathcal{F}}_{k-1}]$
		$\displaystyle\quad+\mathbb{V}_{\mathbf{\xi}^{k}}[\gamma_{2}(\lambda^{k}-z)\|\tilde{\mathcal{F}}_{k-1}]$
		$\displaystyle=\frac{4\sigma_{g}^{2}}{q_{k}}+\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k})\|\tilde{\mathcal{F}}_{k-1}]+0+\gamma_{2}^{2}\mathbb{V}_{\mathbf{\xi}^{k}}[\lambda^{k}\|\tilde{\mathcal{F}}_{k-1}].$

The inequality is due to (C.4) and Lemma C.2. The second term in the right-hand side of (C.13) is bounded by

		$\displaystyle\quad\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k})\|\tilde{\mathcal{F}}_{k-1}]$		(C.14)
		$\displaystyle=\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}g(x,w^{k})+\frac{1}{\gamma_{1}}\sum_{i=1}^{p}[\gamma_{1}\lambda_{i}^{k}+H_{i}(x,w^{k})]_{+}\nabla_{x}H_{i}(x,w^{k})\|\tilde{\mathcal{F}}_{k-1}]$
		$\displaystyle\leq 2\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}g(x,w^{k})\|\tilde{\mathcal{F}}_{k-1}]+\frac{2(\gamma_{1}M_{\lambda}+M_{H,0})^{2}}{\gamma_{1}^{2}}\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}H(x,w^{k})\|\tilde{\mathcal{F}}_{k-1}]$

Since $w^{*}(x^{k-1},z^{k-1})$ is a constant conditioned on $\tilde{\mathcal{F}}_{k-1}$ , we have

	$\displaystyle\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}g(x,w^{k})\|\tilde{\mathcal{F}}_{k-1}]$	$\displaystyle=\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}g(x,w^{k})-\nabla_{x}g(x,w^{*}(x^{k-1},z^{k-1}))\|\tilde{\mathcal{F}}_{k-1}]$
		$\displaystyle\leq\mathbb{E}_{\mathbf{\xi}^{k}}[\\|\nabla_{x}g(x,w^{k})-\nabla_{x}g(x,w^{*}(x^{k-1},z^{k-1}))\\|^{2}\|\tilde{\mathcal{F}}_{k-1}]$
		$\displaystyle\leq L_{G}^{2}\mathbb{E}_{\mathbf{\xi}^{k}}\left[\\|w^{k}-w^{*}(x^{k-1})\\|^{2}\|\tilde{\mathcal{F}}_{k-1}\right].$

The first inequality uses the fact that variance is smaller than second order moment, and the second inequality is due to the Lipschitz continuity of $G$ . Similarly, we have

\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}H(x,w^{k})|\tilde{\mathcal{F}}_{k-1}]\leq L_{H}^{2}\mathbb{E}_{\mathbf{\xi}^{k}}\left[\|w^{k}-w^{*}(x^{k-1})\|^{2}|\tilde{\mathcal{F}}_{k-1}\right].

Substituting these two inequalities into (C.14) gives

		$\displaystyle\quad\mathbb{V}_{\mathbf{\xi}^{k}}[\nabla_{x}\ell_{\gamma}(x,z,w^{k},\lambda^{k})\|\tilde{\mathcal{F}}_{k-1}]$		(C.15)
		$\displaystyle\leq 2\left(L_{G}^{2}+\frac{(\gamma_{1}M_{\lambda}+M_{H,0})^{2}}{\gamma_{1}^{2}}L_{H}^{2}\right)\mathbb{E}[\\|w^{k}-w^{*}(x^{k-1},z^{k-1})\\|^{2}\|\tilde{\mathcal{F}}_{k-1}].$		(C.15)

The last term in the right-hand side of (C.13) is bounded by

\displaystyle\gamma_{2}^{2}\mathbb{V}_{\mathbf{\xi}^{k}}[\lambda^{k}|\tilde{\mathcal{F}}_{k-1}]

\displaystyle=\gamma_{2}^{2}\mathbb{V}[\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})|\tilde{\mathcal{F}}_{k-1}]\leq\gamma_{2}^{2}\mathbb{E}[\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\|^{2}|\tilde{\mathcal{F}}_{k-1}].

(C.16)

Combining (C.13), (LABEL:eq:_proof_of_bound_of_first_order_oracle_of_Gk_3) and (C.16) gives the desired result. ∎

Lemma C.4.

Under the conditions of Theorem B.1, $\epsilon_{\mathcal{G}}^{k}$ and $\sigma_{\mathcal{G}}^{k}$ are bounded as


$\displaystyle\\|\epsilon_{\mathcal{G}}^{k}\\|$	$\displaystyle\leq\frac{\gamma_{2}}{2}\phi_{2}\frac{1+\log(s_{k})}{s_{k}}+(M_{H,1}+\sqrt{p})\left(\phi_{2}\frac{1+\log(s_{k})}{s_{k}}\right)^{\frac{1}{2}}$	(C.17a)
$\displaystyle+$	$\displaystyle\left(L_{G}+\frac{1}{\gamma_{1}}(M_{H,2}^{2}+\gamma_{1}L_{H}B+\frac{\gamma_{1}L_{H}}{\gamma_{2}}M_{H,0}+M_{H,0}L_{H})\right)\left(\phi_{1}\frac{1+\log(s_{k})}{s_{k}}\right)^{\frac{1}{2}},$	(C.17b)
$\displaystyle(\sigma_{\mathcal{G}}^{k})^{2}$	$\displaystyle\leq\frac{2\sigma_{g}^{2}}{q_{k}}+((L_{G}^{2}+\frac{(\gamma_{1}\bar{M}_{\lambda}+M_{H,0})^{2}}{\gamma_{1}^{2}}L_{H}^{2})\bar{\phi}_{1}+\gamma_{2}^{2}\bar{\phi}_{2})\frac{1+\log(s_{k})}{s_{k}}.$	(C.17c)

where $\bar{\phi}_{1}=\eta\left(\eta M_{\mathcal{L},1}+\rho(4\gamma_{2}B+4M_{H,0}^{2})+(\eta M_{\mathcal{L},2}+2\rho(\gamma_{1}+\gamma_{2}))M_{\lambda}\right)$ , $\bar{\phi}_{2}=\frac{\rho}{\eta}\bar{\phi}_{1}$ , $\bar{M}_{\lambda}=2\rho^{2}\gamma_{2}^{2}B^{2}+2p\rho^{2}M_{H,0}^{2}$ are constants obtained by replacing $\|z\|$ with $B$ .

Proof It follows from Cauchy-Schwarz inequality that $\mathbb{E}[\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\||\mathcal{F}_{k}]\leq(\mathbb{E}[\|\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})\|^{2}|\mathcal{F}_{k}])^{\frac{1}{2}}$ and $\mathbb{E}[\|w^{k}-w^{*}(x^{k-1},z^{k-1})\||\mathcal{F}_{k}]\leq(\mathbb{E}[\|w^{k}-w^{*}(x^{k-1},z^{k-1})\|^{2}|\mathcal{F}_{k}])^{\frac{1}{2}}$ Substituting the results of Theorem B.1 into (C.11) and (C.12) gives the desired results. Note that the notation $w^{s}-w^{*},\lambda^{s}-\lambda^{*}$ in (B.14) are replaced by $w^{k}-w^{*}(x^{k-1},z^{k-1}),\lambda^{k}-\lambda^{*}(x^{k-1},z^{k-1})$ in the current context. ∎

Denote $\nabla f(x,y;\mathbf{\zeta}^{k})=\frac{1}{r_{k}}\sum_{l=1}^{r_{k}}\nabla f(x,y;\zeta^{k}_{l})$ . From the definition of $\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})$ in (2.10) and (2.12), we derive the relationship between the bias of $\nabla\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})$ and $\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})$ as follows.

\nabla\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})-\nabla\Psi(\mathbf{u})=\nabla f(x,y;\mathbf{\zeta}^{k})-\nabla F(x,y)+c_{1}(\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})-\nabla\mathcal{G}(\mathbf{u})).

(C.18)

Since $\nabla f(x,y;\mathbf{\zeta}^{k})$ is unbiased, the bias of $\nabla\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})$ is fully determined by the bias of $\nabla\mathcal{G}^{k}(\mathbf{u};\mathbf{\tilde{\xi}}^{k})$ .

From the relationship between variance and moment, the bias $b^{k}$ defined in (3.7) can be bounded as follow.

Lemma C.5.

The bias $b^{k}$ has a controllable momentum as

\displaystyle\mathbb{E}[\|b^{k}\|^{2}|\tilde{\mathcal{F}}_{k-1}]

\displaystyle\leq 2\left(\frac{\sigma_{f}^{2}}{r_{k}}+c_{1}^{2}\frac{(\sigma_{\mathcal{G}}^{k})^{2}}{q_{k}}+c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right).

(C.19)

Here $\mathbb{E}[\cdot]$ is the abbreviation of $\mathbb{E}_{\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k},\mathbf{\xi}^{k}}[\cdot]$ .

Proof By Lemma C.3, it holds that

		$\displaystyle\quad\mathbb{E}[\\|b^{k}\\|^{2}\|\widetilde{\mathcal{F}}_{k-1}]$
		$\displaystyle\leq 2\mathbb{V}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\widetilde{\mathcal{F}}_{k-1}]+2\mathbb{E}[\\|\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\widetilde{\mathcal{F}}_{k-1}]-\nabla\Psi^{k}(\mathbf{u}^{k})\\|^{2}\|\widetilde{\mathcal{F}}_{k-1}]$
		$\displaystyle=2\left(\mathbb{V}\left[\nabla f(\mathbf{u}^{k};\mathbf{\zeta}^{k})\|\widetilde{\mathcal{F}}_{k-1}\right]+\mathbb{V}\left[c_{1}\nabla\mathcal{G}(\mathbf{u}^{k};\mathbf{\tilde{\xi}}^{k})\|\widetilde{\mathcal{F}}_{k-1}\right]\right.$
		$\displaystyle\quad\left.+\mathbb{E}\left[c_{1}^{2}\\|\mathbb{E}[\nabla\mathcal{G}(\mathbf{u}^{k};\mathbf{\zeta}^{k})\|\widetilde{\mathcal{F}}_{k-1}]-\nabla\mathcal{G}(\mathbf{u}^{k})\\|^{2}\|\widetilde{\mathcal{F}}_{k-1}\right]\right)$
		$\displaystyle\leq 2\left(\frac{\sigma_{f}^{2}}{r_{k}}+c_{1}^{2}\frac{\sigma_{\mathcal{G}}^{2}}{q_{k}}+c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right).$

The last inequality follow from Lemma C.2 and C.3. This completes the proof. ∎

The convergence of Algorithm 2 is established in the following theorem.

Theorem C.1.

Assume the step sizes satisfy $\alpha_{k}<\frac{1}{2L_{\Psi}}$ , then the sequence $\{\mathbf{u}^{k}\}$ satisfies

		$\displaystyle\quad\mathbb{E}\left[\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\frac{1}{\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}\right]$		(C.20)
		$\displaystyle\leq\frac{4}{\sum_{k=0}^{K-1}\alpha_{k}}\mathbb{E}[\Psi(\mathbf{u}^{0})-\Psi(\mathbf{u}^{K})]+\frac{2}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\alpha_{k}\mathbb{E}[\\|b^{k}\\|^{2}].$		(C.20)

Here the expectation is taken over all $\mathbf{\xi^{k}}\sim\mathcal{D}_{\xi}^{s_{k}},\mathbf{\tilde{\xi}}^{k}\sim\mathcal{D}_{{\xi}}^{q_{k}},\mathbf{\zeta}^{k}\sim\mathcal{D}_{\zeta}^{r_{k}}$ , $k=0,...,K-1$ .

Proof The projection gradient step $\mathbf{u}^{k+1}=\mathrm{Proj}_{\mathcal{U}}(\mathbf{u}^{k}-\alpha_{k}\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k}))$ gives

\langle\mathbf{u}^{k+1}-\mathbf{u}^{k},\mathbf{u}^{k+1}-(\mathbf{u}^{k}-\alpha_{k}\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k}),\mathbf{\tilde{\xi}}^{k})\rangle\leq 0.

This is equivalent to

\langle\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle\leq-\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}.

The Lipschitz property of $\nabla\Psi(\mathbf{u})$ gives

		$\displaystyle\quad\Psi(\mathbf{u}^{k+1})-\Psi(\mathbf{u}^{k})$		(C.21)
		$\displaystyle\leq\Psi(x^{k+1},y^{k+1},z^{k+1},\nu^{k},\rho^{k})-\Psi(x^{k},y^{k},z^{k},\nu^{k+1},\rho^{k+1})$
		$\displaystyle\leq\langle\nabla\Psi(\mathbf{u}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle+\frac{L_{\Psi}}{2}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}$
		$\displaystyle=\langle\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle+\langle b^{k},\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle+\frac{L_{\Psi}}{2}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}$
		$\displaystyle\leq-\frac{1}{\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\frac{\alpha_{k}}{2}\\|b^{k}\\|^{2}+\frac{1}{2\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\frac{L_{\Psi}}{2}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}$
		$\displaystyle=-\frac{1}{2}(\frac{1}{\alpha_{k}}-L_{\Psi})\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\frac{\alpha_{k}}{2}\\|b^{k}\\|^{2}$
		$\displaystyle\leq-\frac{1}{4\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\frac{\alpha_{k}}{2}\\|b^{k}\\|^{2}.$

The last inequality uses the assumption $\alpha_{k}\leq\frac{1}{2L_{\Psi}}$ . Then summing up (C.21) over $k=0,...,K-1$ gives

\displaystyle\sum_{k=0}^{K-1}\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\leq

\displaystyle 4(\Psi(\mathbf{u}^{0})-\Psi(\mathbf{u}^{K}))+2\sum_{k=0}^{K-1}\alpha_{k}\|b^{k}\|^{2}.

(C.22)

Taking expectation on both sides and multiplying $\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}$ yields (C.20). This completes the proof. ∎

We consider measuring the convergence by the deviation from the optimal condition as

\mathrm{dist}(0,\nabla\Psi(\mathbf{u})+\mathcal{N}_{\mathcal{U}}(\mathbf{u})).

(C.23)

Let $\delta^{k}=\mathbf{u}^{k+1}-\mathbf{u}^{k}+\alpha^{k}\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})$ . The projection step gives $-\delta^{k}\in\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1})$ . We can derive the following bound on this measure in Theorem C.2.

Theorem C.2.

Assume $\alpha_{k}\leq\frac{1}{2L_{\Psi}}$ . Then it holds that

		$\displaystyle\quad\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\alpha_{k}\mathbb{E}[\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{k+1})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))^{2}]$		(C.24)
		$\displaystyle\leq\frac{18}{\sum_{k=0}^{K-1}\alpha_{k}}(\Psi(\mathbf{u}^{0})-\Psi(\mathbf{u}^{K}))+\frac{11}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\alpha_{k}\mathbb{E}[\\|b^{k}\\|^{2}].$		(C.24)

Proof Since $-\delta^{k}\in\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1})$ , we have

\mathrm{dist}(0,\nabla\Psi^{k}(\mathbf{u}^{k})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))\leq\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|.

(C.25)

The Lipschitz property of $\nabla\Psi(\mathbf{u})$ gives

$\displaystyle\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{k+1})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))$	$\displaystyle\leq\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{k})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))+L_{\Psi}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|$	(C.26)
	$\displaystyle\leq\mathrm{dist}(0,\nabla\Psi^{k}(\mathbf{u}^{k})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))+\\|b^{k}\\|+L_{\Psi}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|$
	$\displaystyle\leq\\|b^{k}\\|+(\frac{1}{\alpha_{k}}+L_{\Psi})\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|\quad\text{( by~\eqref{eq: convergence of outer loop 1})}$
	$\displaystyle\leq\\|b^{k}\\|+\frac{3}{2\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|.$

Taking square we obtain

	$\displaystyle\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{k+1})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{k+1}))^{2}$	$\displaystyle\leq(\\|b^{k}\\|+\frac{3}{2\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|)^{2}$		(C.27)
		$\displaystyle\leq 2\\|b^{k}\\|^{2}+\frac{9}{2\alpha_{k}^{2}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}.$		(C.27)

Substituting the above inequality into (C.20) gives (C.24). This completes the proof. ∎

Corollary C.2.

Take step size as (B.13) in the inner loop. Take constant step size $\alpha_{k}={\alpha}<\frac{1}{2L_{\Psi}}$ in the outer loop and take constant sample sizes as $r_{k}=r$ , $q_{k}=q$ and $s_{k}=s$ in Algorithm 2. Randomly choosing an index $R$ from $\{1,...,K\}$ with probability $\mathrm{Prob}(R=k)=\frac{\alpha_{k-1}}{\sum_{k=1}^{K}\alpha_{k-1}}$ , then we have

		$\displaystyle\quad\mathbb{E}\left[\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\frac{1}{\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}\right]\leq\widetilde{\mathcal{O}}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right),$
		$\displaystyle\quad\mathbb{E}[\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{R})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{R}))^{2}]\leq\widetilde{\mathcal{O}}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right).$

Proof From Lemma C.4, C.5 we know that $\mathbb{E}[\|b^{k}\|^{2}]\leq\widetilde{\mathcal{O}}(\frac{1}{r_{k}}+\frac{c_{1}^{2}}{q_{k}}+\frac{c_{1}^{2}}{s_{k}})$ . Substituting this into Theorem C.1 and C.2 gives

	$\displaystyle\mathbb{E}\left[\frac{1}{\sum_{k=0}^{K}\alpha_{k}}\sum_{k=0}^{K-1}\frac{1}{\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}\right]$	$\displaystyle\leq\frac{4}{\alpha K}\mathbb{E}[\Psi(\mathbf{u}^{0})-\Psi(\mathbf{u}^{K})]+\frac{2}{K}\sum_{k=0}^{K-1}\mathbb{E}[\\|b^{k}\\|^{2}]$
		$\displaystyle\leq\widetilde{\mathcal{O}}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right),$
	$\displaystyle\mathbb{E}[\mathrm{dist}(0,\nabla\Psi(\mathbf{u}^{R})+\mathcal{N}_{\mathcal{U}}(\mathbf{u}^{R}))^{2}]$	$\displaystyle\leq\frac{18}{\alpha K}\mathbb{E}[\Psi(\mathbf{u}^{0})-\Psi(\mathbf{u}^{K})]+\frac{11}{K}\sum_{k=0}^{K-1}\mathbb{E}[\\|b^{k}\\|^{2}]$
		$\displaystyle\leq\widetilde{\mathcal{O}}\left(\frac{1}{\alpha K}+\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}\right).$

This completes the proof. ∎

Remark C.1.

Let $c=\max(c_{1},c_{2})$ . The step size condition $\alpha_{k}<\frac{1}{2L_{\Psi}}$ and (C.3) implies $\alpha_{k}$ is a most $\widetilde{\mathcal{O}}(c^{-1})$ . With $\alpha\sim\mathcal{O}(c^{-1})$ , $r\sim\mathcal{O}(\epsilon^{-1})$ , $q\sim\mathcal{O}(c_{1}^{2}\epsilon^{-1})$ , $s\sim{\mathcal{O}}(c_{1}^{2}\epsilon^{-1})$ , $K\sim{\mathcal{O}}(c\epsilon^{-1})$ , the right side of the above inequality is $\widetilde{\mathcal{O}}(\epsilon)$ . Then the sample complexity on $\xi$ is $\sum_{k=0}^{K-1}{s_{k}}+\sum_{k=1}^{k}q_{k}=sK+qK=\widetilde{\mathcal{O}}(cc_{1}^{2}\epsilon^{-2})$ and the sample complexity on $\zeta$ is $\sum_{k=0}^{K-1}{r_{k}}=rK=\widetilde{\mathcal{O}}(c\epsilon^{-2})$ . Theorem C.1 shows the algorithm converges to the $\epsilon$ -stationary point of the problem (2.9) for any fixed $c_{1}>0$ with this sample complexity.

Remark C.2.

By Theorem A.4, if we take $c_{1}\sim\mathcal{O}(\epsilon^{-1})$ , $c_{2}\sim\mathcal{O}(\epsilon^{-3}),\delta\sim\mathcal{O}(\epsilon^{-2})$ , then (2.9) is equivalent to the original BLO (1.2) in the sense of $\epsilon$ -accuracy. Under this condition, the sample complexity on $\xi$ is $\widetilde{\mathcal{O}}(\epsilon^{-7})$ and the sample complexity on $\zeta$ is $\widetilde{\mathcal{O}}(\epsilon^{-5})$ .

C.1 Analysis on variance reduction

In this section, we introduce a stronger assumption on the Lipschitz continuity of the gradient of $f(x,y)$ and $g(x,y)$ as Assupmtion 3.7. From (C.3) and (C.18), we know $\nabla\Psi^{k}(\mathbf{u};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})$ is $L_{\Psi}^{\prime}$ -averaged Lipschitz continuous conditioned on $\mathcal{F}_{k-1}$ with module

L_{\Psi}^{\prime}=L_{f}+c_{1}\epsilon_{\mathcal{G}}^{k}\leq\mathcal{O}(L_{f}+\frac{c_{1}}{s_{k}}).

(C.28)

Define the error of the direction as

e^{k}=d^{k}-\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})|\tilde{\mathcal{F}}_{k-1}].

First we derive the decrease in variance reduction by a single iteration.

Lemma C.6.

The sequence $\{\mathbf{u}^{k}\}$ satisfies

\quad\Psi(\mathbf{u}^{k+1})-\Psi(\mathbf{u}^{k})\leq-\frac{1}{4\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}+\alpha_{k}\|e^{k}\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2},

(C.29)

where $\epsilon_{\mathcal{G}}^{k}$ is defined in Lemma C.3.

Proof Similar to (C.21), it holds that

		$\displaystyle\quad\Psi(\mathbf{u}^{k+1})-\Psi(\mathbf{u}^{k})$		(C.30)
		$\displaystyle\leq\langle\nabla\Psi(\mathbf{u}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle+\frac{L_{\Psi}}{2}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}$
		$\displaystyle=\langle d^{k},\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle-\langle e^{k}+\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\widetilde{\mathcal{F}}_{k-1}]-\nabla\Psi(\mathbf{u}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle$
		$\displaystyle\quad+\frac{L_{\Psi}}{2}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}$
		$\displaystyle\leq-\frac{1}{\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}-\langle e^{k}+\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\widetilde{\mathcal{F}}_{k-1}]-\nabla\Psi(\mathbf{u}^{k}),\mathbf{u}^{k+1}-\mathbf{u}^{k}\rangle$
		$\displaystyle\quad+\frac{L_{\Psi}}{2}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}$
		$\displaystyle\leq-\frac{1}{\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\frac{\alpha_{k}}{2}\\|e^{k}+\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\widetilde{\mathcal{F}}_{k-1}]-\nabla\Psi(\mathbf{u}^{k})\\|^{2}$
		$\displaystyle\quad+(\frac{1}{2\alpha_{k}}+L_{\Psi})\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}$
		$\displaystyle\leq-\frac{1}{2}(\frac{1}{\alpha_{k}}-L_{\Psi})\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\alpha_{k}\\|e^{k}\\|^{2}+\alpha_{k}\\|\nabla\Psi(\mathbf{u}^{k})-\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\widetilde{\mathcal{F}}_{k-1}]\\|^{2}$
		$\displaystyle\leq-\frac{1}{2}(\frac{1}{\alpha_{k}}-L_{\Psi})\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\alpha_{k}\\|e^{k}\\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}.$

The first inequality follows from the Lipschitz property, the second inequality is due to the projection gradient step and the third inequality uses Young’s inequality. This completes the proof. ∎

Next we show that the sequence of error $\{e^{k}\}$ has recursive relationship as follows, hence the error is decreasing.

Lemma C.7.

The conditional expectation of the error $e^{k+1}$ is bounded as

	$\displaystyle\mathbb{E}[\\|e^{k+1}\\|^{2}\|\widetilde{\mathcal{F}}_{k}]$	$\displaystyle\leq{2}\beta_{k+1}^{2}\left(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\right)+8(1-\beta_{k+1})^{2}\\|e^{k}\\|^{2}$		(C.31)
		$\displaystyle\quad+\left.8\frac{(L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2}}{r_{k+1}}\mathbb{E}[\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}\|\tilde{\mathcal{F}}_{k}]+\frac{8c^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big)\right)$		(C.31)

Here the expectation is taken over $\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1},\mathbf{\xi}^{k+1}$ .

Proof Let $\Delta^{k+1}$ be the bias of the gradient at $\mathbf{u}^{k+1}$ and $\tilde{\Delta}^{k}$ be the error of the gradient at $\mathbf{u}^{k+1}$ comparing the expectation of the gradient of the last iteration, that is,

	$\displaystyle\Delta^{k+1}$	$\displaystyle=\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\|\tilde{\mathcal{F}}_{k}],$
	$\displaystyle\tilde{\Delta}^{k}$	$\displaystyle=\nabla\Psi^{k+1}(\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\tilde{\mathcal{F}}_{k-1}].$

From the definition of $e^{k}$ and (2.15) we have

$\displaystyle e^{k+1}$	$\displaystyle=d^{k+1}-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\|\tilde{\mathcal{F}}_{k}]$	(C.32)
	$\displaystyle=\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\|\tilde{\mathcal{F}}_{k}]$
	$\displaystyle\quad+(1-\beta_{k+1})(d^{k}-\nabla\Psi^{k+1}(\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1}))$
	$\displaystyle=\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\|\tilde{\mathcal{F}}_{k}]$
	$\displaystyle\quad+(1-\beta_{k+1})(e^{k}+\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\mathcal{F}_{k-1}]-\nabla\Psi^{k+1}(\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1}))$
	$\displaystyle=\beta_{k+1}\left(\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1}-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\|\tilde{\mathcal{F}}_{k}]\right)$
	$\displaystyle\quad+(1-\beta_{k+1})\left(\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\|\tilde{\mathcal{F}}_{k}]\right.$
	$\displaystyle\quad\left.+e^{k}+\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\mathcal{F}_{k-1}]-\nabla\Psi^{k+1}(\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\right)$
	$\displaystyle=\beta_{k+1}\Delta^{k+1}+(1-\beta_{k+1})e^{k}+(1-\beta_{k+1})(\Delta^{k+1}-\tilde{\Delta}^{k}).$

It follows from (C.6) and Lemma C.3 that $\mathbb{E}[\|\Delta^{k+1}\|^{2}|\widetilde{\mathcal{F}}_{k}]=\mathbb{V}[\nabla^{k}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k}]=\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}$ , where $\sigma_{\mathcal{G}}^{k}$ is defined in (C.12). Since $e^{k}$ is a constant term conditioned on $\widetilde{\mathcal{F}}_{k}$ and $\mathbb{E}[\|\Delta^{k+1}\||\widetilde{\mathcal{F}}_{k}]=0$ , (C.32) implies that

		$\displaystyle\quad\mathbb{E}[\\|e^{k+1}\\|^{2}\|\widetilde{\mathcal{F}}_{k}]$		(C.33)
		$\displaystyle=\mathbb{E}[\\|\beta_{k+1}\Delta^{k+1}+(1-\beta_{k+1})(\Delta^{k+1}-\tilde{\Delta}^{k})\\|^{2}\|\widetilde{\mathcal{F}}_{k}]+(1-\beta_{k+1})^{2}\\|e^{k}\\|^{2}$
		$\displaystyle\leq 2\beta_{k+1}^{2}\mathbb{E}[\\|\Delta^{k+1}\\|^{2}\|\widetilde{\mathcal{F}}_{k}]+2(1-\beta_{k+1})^{2}\mathbb{E}[\\|\Delta^{k+1}-\tilde{\Delta}^{k}\\|^{2}\|\widetilde{\mathcal{F}}_{k}]+(1-\beta_{k+1})^{2}\\|e^{k}\\|^{2}$
		$\displaystyle\leq{2}\beta_{k+1}^{2}\left(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\right)+2\mathbb{E}[\\|\Delta^{k+1}-\tilde{\Delta}^{k}\\|^{2}\|\widetilde{\mathcal{F}}_{k}]+(1-\beta_{k+1})^{2}\\|e^{k}\\|^{2}$
		$\displaystyle\leq{2}\beta_{k+1}^{2}\left(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\right)+2(1-\beta_{k+1})^{2}\mathbb{E}[\\|\Delta^{k+1}-\tilde{\Delta}^{k}\\|^{2}\|\widetilde{\mathcal{F}}_{k}]+(1-\beta_{k+1})^{2}\\|e^{k}\\|^{2}.$

Now we come to handle the term $\mathbb{E}[\|\Delta^{k+1}-\tilde{\Delta}^{k}\|^{2}|\tilde{\mathcal{F}}_{k}]$ . Let

	$\displaystyle\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k})$	$\displaystyle=\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k})-\nabla\Psi^{k+1}(\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k}),$
	$\displaystyle\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k})$	$\displaystyle=\nabla\Psi(\mathbf{u}^{k+1})-\Psi(\mathbf{u}^{k}).$

be the difference of the gradients between the points $\mathbf{u}^{k+1}$ and $\mathbf{u}^{k}$ , respectively. The averaged Lipschitz property of $\nabla\Psi^{k+1}(\mathbf{u};\mathbf{\zeta},\mathbf{\tilde{\xi}})$ and the Lipschitz property of $\nabla\Psi(\mathbf{u})$ give $\mathbb{E}[\|\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\|^{2}|\tilde{\mathcal{F}}_{k}]\leq(L_{\Psi}^{\prime})^{2}\mathbb{E}[\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}|\tilde{\mathcal{F}}_{k}]$ and $\|\mathrm{R}(\mathbf{u}^{k+1},\mathbf{u}^{k})\|\leq L_{\Psi}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|$ . Then it holds that

	$\displaystyle\Delta^{k+1}-\tilde{\Delta}^{k}$	$\displaystyle=\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})-\mathrm{R}^{k}(\mathbf{u}^{k+1},\mathbf{u}^{k})$
		$\displaystyle\quad+\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k+1})\|\widetilde{\mathcal{F}}_{k}]-\nabla\Psi^{k+1}(\mathbf{u}^{k+1})$
		$\displaystyle\quad-\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\widetilde{\mathcal{F}}_{k-1}]+\Psi^{k}(\mathbf{u}^{k}),$

and

$\displaystyle\mathbb{E}[\\|\Delta^{k+1}-\tilde{\Delta}^{k}\\|^{2}\|\widetilde{\mathcal{F}}_{k}]$	$\displaystyle\leq 2\mathbb{E}[\\|\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k})-\mathrm{R}^{k+1}(\mathbf{u}^{k+1},\mathbf{u}^{k})\\|^{2}\|\widetilde{\mathcal{F}}_{k}]$	(C.34)
	$\displaystyle\quad+4\\|\mathbb{E}[\nabla\Psi^{k+1}(\mathbf{u}^{k+1};\mathbf{\zeta}^{k+1},\mathbf{\tilde{\xi}}^{k})-\nabla\Psi(\mathbf{u}^{k})\\|\|\widetilde{\mathcal{F}}_{k}]$
	$\displaystyle\quad+4\\|\mathbb{E}[\nabla\Psi^{k}(\mathbf{u}^{k};\mathbf{\zeta}^{k},\mathbf{\tilde{\xi}}^{k})\|\widetilde{\mathcal{F}}_{k}]-\nabla\Psi(\mathbf{u}^{k})\\|^{2}$
	$\displaystyle\leq\frac{4\left((L_{\Psi}^{\prime})^{2}+(L_{\Psi}^{2})^{2}\right)}{r_{k+1}}\mathbb{E}[\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}\|\widetilde{\mathcal{F}}_{k}]+\frac{4c_{1}^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big).$

The last inequality follows from Lemma C.3 and (C.18). Combining (LABEL:eq:_proof_of_bound_of_error_in_variance_reduction_2) and (C.34) gives (C.31). This completes the proof. ∎Using the above two lemmas, we can derive the convergence of the variance reduction.

Theorem C.3.

The sequence $\{\mathbf{u}^{k+1}\}$ generated by Algorithm 3 satisfies

		$\displaystyle\quad\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\mathbb{E}\left[\frac{1}{\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}\right]\leq\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\mathbb{E}[\Psi(\mathbf{u}^{0})+\theta_{0}\\|e^{0}\\|^{2}-\Psi(\mathbf{u}^{K})-\theta_{K}\\|e^{K}\\|^{2}]$		(C.35)
		$\displaystyle\quad\quad+\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K}\left\{\alpha_{k}c^{2}\mathbb{E}[(\epsilon_{\mathcal{G}}^{k})^{2}]+2\frac{\theta}{\alpha_{k}}\beta_{k+1}^{2}\big(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}\mathbb{E}[(\sigma_{\mathcal{G}}^{k})^{2}]\big)\right.$
		$\displaystyle\quad\quad\left.+\frac{8c_{1}^{2}\theta}{\alpha_{k}r_{k+1}}\big(\mathbb{E}[(\epsilon_{\mathcal{G}}^{k})^{2}]+\mathbb{E}[(\epsilon_{\mathcal{G}}^{k+1})^{2}]\big)\right\}.$

with $\theta=\frac{1}{64((L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2})}$ , $\alpha_{k}\leq\frac{1}{8L_{\Psi}}$ , $\beta_{k}\geq 1-\sqrt{\frac{\frac{\theta}{\alpha_{k}}-\alpha_{k}}{8}}$ .

Proof Consider a merit function as $\Psi(\mathbf{u}^{k})+\theta_{k}\|e^{k}\|^{2}$ , where $\theta_{k}$ satisfies that


$\displaystyle\alpha_{k}+8(1-\beta_{k+1})^{2}\theta_{k+1}-\theta_{k}\leq 0,$		(C.36a)
$\displaystyle-\frac{1}{2\alpha_{k}}+\frac{L_{\Psi}}{2}+8\theta_{k+1}\frac{(L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2}}{r_{k+1}}$	$\displaystyle\leq-\frac{1}{4\alpha_{k}}.$	(C.36b)

Considering the reduction of the merit function, we have

		$\displaystyle\quad\mathbb{E}[\Psi(\mathbf{u}^{k+1})+\theta_{k+1}\\|e^{k+1}\\|^{2}-\Psi(\mathbf{u}^{k})-\theta_{k}\\|e^{k}\\|^{2}\|\widetilde{\mathcal{F}}_{k}]$		(C.37)
		$\displaystyle\leq\mathbb{E}\left[-\frac{1}{2}(\frac{1}{\alpha_{k}}-L_{\Psi})\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\alpha_{k}\\|e^{k}\\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}+\theta_{k+1}\\|e^{k+1}\\|^{2}\right.$
		$\displaystyle\quad\left.-\theta_{k}\\|e^{k}\\|^{2}\|\widetilde{\mathcal{F}}_{k}\right]\text{(by~\eqref{eq: decrease in variance reduction})}$
		$\displaystyle\leq\mathbb{E}\left[-\frac{1}{2}(\frac{1}{\alpha_{k}}-L_{\Psi})\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\alpha_{k}\\|e^{k}\\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right.$
		$\displaystyle\quad+\theta_{k+1}\left(2\beta_{k+1}^{2}\big(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\big)+8(1-\beta_{k+1})^{2}\\|e^{k}\\|^{2}\right.$
		$\displaystyle\quad+\left.\left.8\frac{(L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2}}{r_{k+1}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\frac{8c^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big)\right)-\theta_{k}\\|e^{k}\\|^{2}\|\tilde{\mathcal{F}}_{k}\right]\quad\text{(by~\eqref{eq: bound of error in variance reduction})}$
		$\displaystyle\leq\mathbb{E}\left[\left(-\frac{1}{2\alpha_{k}}+\frac{L_{\Psi}}{2}+8\theta_{k+1}\frac{(L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2}}{r_{k+1}}\right)\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right.$
		$\displaystyle\quad+\left.\theta_{k+1}\left(2\beta_{k+1}^{2}\big(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\big)+\frac{8c^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big)\right)\|\widetilde{\mathcal{F}}_{k}\right]\quad\text{(by~\eqref{eq: condition on theta 1})}$
		$\displaystyle\leq\mathbb{E}\left[-\frac{1}{4\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right.$
		$\displaystyle\quad\left.+\theta_{k+1}\left(2\beta_{k+1}^{2}\big(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\big)+\frac{8c^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big)\right)\|\widetilde{\mathcal{F}}_{k}\right].\quad\text{(by~\eqref{eq: condition on theta 2})}$

To ensure the conditions (C.36a) and (C.36b), we take

\theta_{k+1}=\frac{\theta}{\alpha_{k}}\quad\text{with}~\theta=\frac{1}{64((L_{\Psi}^{\prime})^{2}+L_{\Psi}^{2})}.

(C.38)

and let $\alpha_{k}\leq\frac{1}{8L_{\Psi}}$ , $\beta_{k}\geq 1-\sqrt{\frac{\theta_{k}-\alpha_{k}}{8}}$ . Then (C.37) is simplified to

		$\displaystyle\quad\mathbb{E}[\Psi(\mathbf{u}^{k+1})+\theta_{k+1}\\|e^{k+1}\\|^{2}-\Psi(\mathbf{u}^{k})-\theta_{k}\\|e^{k}\\|^{2}\|\widetilde{\mathcal{F}}_{k}]$		(C.39)
		$\displaystyle\leq\mathbb{E}\left[-\frac{1}{4\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}+\alpha_{k}c_{1}^{2}(\epsilon_{\mathcal{G}}^{k})^{2}\right.$
		$\displaystyle\quad\left.+\frac{\theta}{\alpha_{k}}\big(2\beta_{k+1}^{2}\big(\frac{\sigma_{f}^{2}}{r_{k+1}}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2}\big)+\frac{8c^{2}}{r_{k+1}}\big((\epsilon_{\mathcal{G}}^{k})^{2}+(\epsilon_{\mathcal{G}}^{k+1})^{2}\big)\big)\widetilde{\mathcal{F}}_{k}\right].$

Summing up the above inequality from $k=0$ to $K-1$ , taking expectation over all the random variables and multiplying $\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}$ on both sides, we have (C.35). This completes the proof. ∎

Corollary C.3.

Take constant sample sizes as $r_{k}=r,q_{k}=q,s_{k}=s$ , $\beta_{k}=\beta\alpha_{k}^{2}$ with $\beta=\mathcal{O}(\alpha^{-2})$ , and $\alpha_{k}=\alpha(k+1)^{-\frac{1}{3}}$ , where $\alpha=\mathcal{O}(L_{\Psi}^{-1})=\mathcal{O}(c^{-1})$ satisfying $\alpha_{k}\leq\frac{1}{2L_{\Psi}}$ as required in Theorem C.3. Then the sequence $\{\mathbf{u}^{k}\}$ generated by Algorithm 3 satisfies

\displaystyle\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\mathbb{E}\left[\frac{1}{\alpha_{k}}\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\|^{2}\right]

\displaystyle\leq\widetilde{\mathcal{O}}\left(\frac{c}{K^{\frac{2}{3}}}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}+\frac{1}{K^{\frac{2}{3}}r}+\frac{K^{\frac{2}{3}}}{r}(\frac{1}{q}+\frac{1}{s})\right).

(C.40)

Proof From Corollary C.1 and Lemma C.4, C.5 we know $L_{\Psi}=\mathcal{O}(c)$ , $L_{\Psi}^{\prime}=\mathcal{O}(c+\frac{c_{1}}{s})$ , $\mathbb{E}[(\epsilon_{\mathcal{G}}^{k})^{2}]\leq\widetilde{\mathcal{O}}(\frac{1}{q_{k}}+\frac{1}{s_{k}})$ , $\mathbb{E}[(\sigma_{\mathcal{G}}^{k})^{2}]\leq\widetilde{\mathcal{O}}(\frac{1}{q_{k}}+\frac{1}{s_{k}})$ and $\mathbb{E}[\|b^{k}\|^{2}]\leq\widetilde{\mathcal{O}}(\frac{1}{r_{k}}+\frac{c_{1}^{2}}{q_{k}}+\frac{c_{1}^{2}}{s_{k}})$ . Besides, it follows from (C.38) that $\theta=\mathcal{O}(L_{\Psi}^{-2})=\mathcal{O}(c^{-2})$ . By substituting these settings into (C.35), we have

		$\displaystyle\quad\frac{1}{\sum_{k=0}^{K-1}\alpha_{k}}\sum_{k=0}^{K-1}\mathbb{E}\left[\frac{1}{\alpha_{k}}\\|\mathbf{u}^{k+1}-\mathbf{u}^{k}\\|^{2}\right]$		(C.41)
		$\displaystyle\leq\widetilde{\mathcal{O}}\left(cK^{-\frac{2}{3}}\left(1+\alpha c_{1}^{2}K^{\frac{2}{3}}\widetilde{\mathcal{O}}(\epsilon_{\mathcal{G}}^{2})+\theta\beta^{2}\alpha^{3}\widetilde{\mathcal{O}}(r^{-1}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2})+\frac{\theta c_{1}^{2}}{\alpha r}K^{\frac{4}{3}}\widetilde{\mathcal{O}}(\epsilon_{\mathcal{G}}^{2})\right)\right)$
		$\displaystyle=\widetilde{\mathcal{O}}\left(cK^{-\frac{2}{3}}\left(1+\frac{c_{1}^{2}}{c}K^{\frac{2}{3}}\widetilde{\mathcal{O}}(\epsilon_{\mathcal{G}}^{2})+c^{-1}\widetilde{\mathcal{O}}(r^{-1}+c_{1}^{2}(\sigma_{\mathcal{G}}^{k})^{2})+c^{-1}r^{-1}K^{\frac{4}{3}}\widetilde{\mathcal{O}}(\epsilon_{\mathcal{G}}^{2})\right)\right)$
		$\displaystyle\leq\widetilde{\mathcal{O}}\left(cK^{-\frac{2}{3}}\left(1+\frac{c_{1}^{2}}{c}K^{\frac{2}{3}}(\frac{1}{q}+\frac{1}{s})+c^{-1}(\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s})+c^{-1}r^{-1}K^{\frac{4}{3}}(\frac{1}{q}+\frac{1}{s})\right)\right)$
		$\displaystyle=\widetilde{\mathcal{O}}\left(\frac{c}{K^{\frac{2}{3}}}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}+\frac{1}{K^{\frac{2}{3}}}(\frac{1}{r}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s})+\frac{K^{\frac{2}{3}}}{r}(\frac{1}{q}+\frac{1}{s})\right)$
		$\displaystyle=\widetilde{\mathcal{O}}\left(\frac{c}{K^{\frac{2}{3}}}+\frac{c_{1}^{2}}{q}+\frac{c_{1}^{2}}{s}+\frac{1}{K^{\frac{2}{3}}r}+\frac{K^{\frac{2}{3}}}{r}(\frac{1}{q}+\frac{1}{s})\right).$

This completes the proof. ∎

Remark C.3.

References

Alpaydin and Alimoglu (1996) Ethem Alpaydin and Fevzi Alimoglu. Pen-based recognition of handwritten digits. (No Title), 1996.
Bennett et al. (2006) Kristin P Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli, and Jong-Shi Pang. Model selection via bilevel optimization. In The 2006 IEEE International Joint Conference on Neural Network Proceedings, pages 1922–1929. IEEE, 2006.
Bonnans and Shapiro (2013) J Frédéric Bonnans and Alexander Shapiro. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013.
Chen et al. (2021) Tianyi Chen, Yuejiao Sun, and Wotao Yin. Tighter analysis of alternating stochastic gradient method for stochastic nested problems. arXiv preprint arXiv:2106.13781, 2021.
Cutkosky and Orabona (2019) Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.
Domahidi et al. (2013) Alexander Domahidi, Eric Chu, and Stephen Boyd. Ecos: An socp solver for embedded systems. In 2013 European control conference (ECC), pages 3071–3076. IEEE, 2013.
Franceschi et al. (2018) Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pages 1568–1577. PMLR, 2018.
Gao et al. (2023) Lucy L Gao, Jane J Ye, Haian Yin, Shangzhi Zeng, and Jin Zhang. Moreau envelope based difference-of-weakly-convex reformulation and algorithm for bilevel programs. arXiv preprint arXiv:2306.16761, 2023.
Grimmer et al. (2023) Benjamin Grimmer, Haihao Lu, Pratik Worah, and Vahab Mirrokni. The landscape of the proximal point method for nonconvex–nonconcave minimax optimization. Mathematical Programming, 201(1):373–407, 2023.
Hong et al. (2023) Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147–180, 2023.
Ji et al. (2020) Kaiyi Ji, Junjie Yang, and Yingbin Liang. Provably faster algorithms for bilevel optimization and applications to meta-learning. Neural Information Processing Systems, 2020.
Jiang et al. (2024a) Liuyuan Jiang, Quan Xiao, Victor M Tenorio, Fernando Real-Rojas, Antonio Marques, and Tianyi Chen. A primal-dual-assisted penalty approach to bilevel optimization with coupled constraints. arXiv preprint arXiv:2406.10148, 2024a.
Jiang et al. (2024b) Xiaotian Jiang, Jiaxiang Li, Mingyi Hong, and Shuzhong Zhang. A barrier function approach for bilevel optimization with coupled lower-level constraints: Formulation, approximation and algorithms. arXiv preprint arXiv:2410.10670, 2024b.
Kang et al. (2023) Hyuna Kang, Seunghoon Jung, Jaewon Jeoung, Juwon Hong, and Taehoon Hong. A bi-level reinforcement learning model for optimal scheduling and planning of battery energy storage considering uncertainty in the energy-sharing community. Sustainable Cities and Society, 94:104538, 2023.
Khanduri et al. (2023) Prashant Khanduri, Ioannis Tsaknakis, Yihua Zhang, Jia Liu, Sijia Liu, Jiawei Zhang, and Mingyi Hong. Linearly constrained bilevel optimization: A smoothed implicit gradient approach. In International Conference on Machine Learning, pages 16291–16325. PMLR, 2023.
MacKay et al. (2019) Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. arXiv preprint arXiv:1903.03088, 2019.
Nocedal and Wright (1999) Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999.
Qin et al. (2023) Xiaorong Qin, Xinhang Song, and Shuqiang Jiang. Bi-level meta-learning for few-shot domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15900–15910, 2023.
Rockafellar and Wets (2009) R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
Shen and Chen (2023) Han Shen and Tianyi Chen. On penalty-based bilevel gradient descent method. In International Conference on Machine Learning, pages 30992–31015. PMLR, 2023.
Shen et al. (2024) Han Shen, Zhuoran Yang, and Tianyi Chen. Principled penalty-based methods for bilevel reinforcement learning and rlhf. arXiv preprint arXiv:2402.06886, 2024.
Sinha et al. (2020) Ankur Sinha, Tanmay Khandait, and Raja Mohanty. A gradient-based bilevel optimization approach for tuning hyperparameters in machine learning. arXiv preprint arXiv:2007.11022, 2020.
Stadie et al. (2020) Bradly Stadie, Lunjun Zhang, and Jimmy Ba. Learning intrinsic rewards as a bi-level optimization problem. In Conference on Uncertainty in Artificial Intelligence, pages 111–120. PMLR, 2020.
Tsaknakis et al. (2022) Ioannis Tsaknakis, Prashant Khanduri, and Mingyi Hong. An implicit gradient-type method for linearly constrained bilevel problems. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5438–5442. IEEE, 2022.
Tsaknakis et al. (2023) Ioannis Tsaknakis, Prashant Khanduri, and Mingyi Hong. An implicit gradient method for constrained bilevel problems using barrier approximation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
Xu and Zhu (2023) Siyuan Xu and Minghui Zhu. Efficient gradient approximation method for constrained bilevel optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12509–12517, 2023.
Yang et al. (2024) Yan Yang, Bin Gao, and Ya-xiang Yuan. Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity. arXiv preprint arXiv:2405.19697, 2024.
Yao et al. (2024a) Wei Yao, Haian Yin, Shangzhi Zeng, and Jin Zhang. Overcoming lower-level constraints in bilevel optimization: A novel approach with regularized gap functions. arXiv preprint arXiv:2406.01992, 2024a.
Yao et al. (2024b) Wei Yao, Chengming Yu, Shangzhi Zeng, and Jin Zhang. Constrained bi-level optimization: Proximal Lagrangian value function approach and Hessian-free algorithm. arXiv preprint arXiv:2401.16164, 2024b.
Zhu et al. (2020) Hancheng Zhu, Leida Li, Jinjian Wu, Sicheng Zhao, Guiguang Ding, and Guangming Shi. Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization. IEEE Transactions on Cybernetics, 52(3):1798–1811, 2020.


$\displaystyle\\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z)\\|$	$\displaystyle\leq M_{G,1}+p\left(\frac{M_{H,0}M_{H,1}}{\gamma_{1}}+\\|z\\|\right),$	(A.5a)
$\displaystyle\\|\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,y,z)\\|$	$\displaystyle\leq M_{G,1}+p\left(\frac{M_{H,0}M_{H,1}}{\gamma_{1}}+\\|z\\|\right).$	(A.5b)

	$\displaystyle\\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z)\\|$	$\displaystyle\leq\\|\nabla_{x}g(x,y)\\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}\|z_{i}\|+\|H_{i}(x,y)\|)\\|\nabla_{y}H_{i}(x,y)\\|$
		$\displaystyle\leq\\|\nabla G(x,y;\xi)\\|+\frac{1}{{\gamma_{1}}}\sum_{i=1}^{p}({\gamma_{1}}\|z_{i}\|+M_{H,0})M_{H,1}$
		$\displaystyle\leq M_{G,1}+p\left(\frac{M_{H,0}M_{H,1}}{\gamma_{1}}+\\|z\\|\right).$


$\displaystyle\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\\|\nabla_{x}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)\\|^{2}]$	$\displaystyle\leq M_{\mathcal{L},1}+M_{\mathcal{L},2}\\|z\\|^{2},$	(A.7a)
$\displaystyle\mathbb{E}_{\xi\sim\mathcal{D}_{\xi}}[\\|\nabla_{y}\mathcal{L}_{{\gamma_{1}}}(x,y,z;\xi)\\|^{2}]$	$\displaystyle\leq M_{\mathcal{L},1}+M_{\mathcal{L},2}\\|z\\|^{2},$	(A.7b)

	$\displaystyle G(x,y)-v(x)$	$\displaystyle=G(x,y)-G(x,y^{\prime})+G(x,y^{\prime})-v(x)$
		$\displaystyle\geq-M_{G,1}\\|y-y^{\prime}\\|+\frac{\mu_{G}}{2}\\|y^{\prime}-y^{*}(x)\\|^{2}$
		$\displaystyle\geq-M_{G,1}\epsilon+\frac{\mu_{G}}{2}\left(\frac{1}{2}\\|y-y^{*}(x)\\|^{2}-\\|y-y^{\prime}\\|^{2}\right)$
		$\displaystyle\geq\frac{\mu_{G}}{4}\\|y-y^{*}(x)\\|^{2}-M_{G,1}\epsilon-\frac{\mu_{G}}{2}\epsilon^{2}.$

	$\displaystyle G(x,y)-v(x)$	$\displaystyle\geq\frac{\mu_{G}}{2}\\|y-y^{}(x)\\|^{2}-\lambda^{}(x)H(x,y)$
		$\displaystyle\geq\frac{\mu_{G}}{2}\\|y-y^{}(x)\\|^{2}-\\|\lambda^{}(x)\\|[\\|H(x,y)]_{+}\\|$
		$\displaystyle\geq\frac{\mu_{G}}{2}\\|y-y^{*}(x)\\|^{2}-\sqrt{2}B\epsilon.$