An Augmented Lagrangian Value Function Method for Lower-level Constrained Stochastic Bilevel Optimization
Abstract
Recently, lower-level constrained bilevel optimization has attracted increasing attention. However, existing methods mostly focus on either deterministic cases or problems with linear constraints. The main challenge in stochastic cases with general constraints is the bias and variance of the hyper-gradient, arising from the inexact solution of the lower-level problem. In this paper, we propose a novel stochastic augmented Lagrangian value function method for solving stochastic bilevel optimization problems with nonlinear lower-level constraints. Our approach reformulates the original bilevel problem using an augmented Lagrangian-based value function and then applies a penalized stochastic gradient method that carefully manages the noise from stochastic oracles. We establish an equivalence between the stochastic single-level reformulation and the original constrained bilevel problem and provide a non-asymptotic rate of convergence for the proposed method. The rate is further enhanced by employing variance reduction techniques. Extensive experiments on synthetic problems and real-world applications demonstrate the effectiveness of our approach.
Keywords: Bilevel optimization, Lower-level constraint, Stochastic optimization, Augmented Lagrangian method
1 Introduction
We consider the stochastic lower-level constrained bilevel optimization (stochastic LC-BLO) problem. The lower-level problem is defined as
(1.1) | ||||
s.t. |
where is a convex compact set, is a random variable in the space , and is the distribution of . and are the lower-level objective function and constraint function, respectively. We also denote the feasible set of the lower-level problem by . The bilevel optimization (BLO) problem is
(1.2) | ||||
s.t. |
where is a convex compact set, is a random variable, and is the distribution of . We assume the gradient oracles and have unavoidable noise. This framework includes deterministic LC-BLO as a special case.
As depicted in (1.2), this hierarchical structure captures a learning‑to‑learn philosophy that underpins numerous modern machine‑learning pipelines. Hence, BLO plays a critical role in various machine learning tasks, including hyperparameter optimization (2, 7, 16, 22), model-agnostic meta-learning (7, 11, 18, 30), and reinforcement learning (23, 21, 27, 14). Recently, the lower-level constrained BLO (LC-BLO) has attracted increasing attention due to its wide applications such as transportation (12), kernelized SVM (29), meta-learning (26), and data hyper-cleaning (26, 28).
Several methods have been proposed to solve the deterministic LC-BLO problem. The methodologies for solving LC-BLO can be broadly categorized into implicit gradient-based (IG) approaches and lower-level value function-based (LLVF) techniques. Implicit gradient methods primarily focus on computing the hyper-gradient with some implicit gradient approximation of . The key issue here is that may not be differentiable when the lower-level problem has constraints. Some research has discussed the conditions under which the hyper-gradient exists. For the linearly constrained case, IG-AL (24) discussed the smoothness of by introducing the Lagrangian multiplier of the lower-level problem. SIGD (15) develops a smoothing approximation implicit gradient method to handle the non-differentiability point. Recent works (25, 13) also consider a barrier reformulation of LC-BLO to write the lower-level constraints into the objective function. Even with these existing discussions, the main challenge of designing implicit gradient methods is that second-order derivative of the lower-level objective is required to compute the hyper-gradient, which is computationally expensive.
To overcome the high computational cost of implicit gradient computation, value function-based methods have been proposed as a Hessian-free alternative. Value function-based methods introduce the value function of the lower-level problem and then replace the optimality condition in (1.2) with an inequality condition on the value function. The original BLO (1.2) is then equivalently reformulated as a single-level problem. Different value-function formulations have been widely studied in the literature, including the value function (12), a Moreau envelope-based value function (8), the proximal Lagrangian function (29), and the regularized gap function (28).
To the best of our knowledge, no existing work analyzes the BLO with nonlinear lower-level constraints in the stochastic setting, which has broad applications in real-world machine learning tasks. Solving stochastic LC-BLO presents two fundamental challenges. The first is the nonsmoothness of the hyper-objective function, which arises from the coupling between the upper-level and lower-level problems. The second is the bias of the hyper-gradient due to the inexact solution of the lower-level problem. To address these challenges, we introduce an augmented Lagrangian function and its Moreau envelope to reformulate the bilevel problem as a single-level problem, while ensuring that the solution remains close to the optimal solution of the original problem. We then propose a novel stochastic value function-based method for stochastic LC-BLO, which carefully controls the bias of the gradient oracle to achieve convergence.
1.1 Main contribution
Our main contributions are as follows:
1. We introduce a novel reformulation of the stochastic LC-BLO by leveraging the stochastic augmented Lagrangian function and its Moreau envelope (see (2.2)). This reformulation transforms the bilevel problem into a single-level problem, effectively addressing noise arising from the inexact solution of the lower-level problem (see (2.3), (2.7)). Notably, we also ensure that the solution of the reformulated problem remains close to the optimal solution of the original bilevel problem (see Theorem 3.1, 3.2), providing a practical yet theoretically grounded approach to stochastic LC-BLO.
2. We propose a novel Hessian-free method based on the stochastic reformulation for solving the stochastic LC-BLO (see Algorithm 2). Our work provides the first convergence analysis of value function-based algorithms for nonlinear LC-BLO in the stochastic setting. The issue of biased gradients is mitigated by controlling the bias through the accuracy of lower-level solutions. We derive a non-asymptotic convergence rate, proving that our method achieves sample complexity on , where denotes the penalty parameter in the reformulations and (see Theorem 3.4, Remark 3.1). The sample complexity on is further improved to by employing variance reduction techniques (see Theorem 3.5, Remark 3.4). In Table 1, we briefly summarize the existing approaches compared to the proposed methods.
H-free | Sto | Iteration | Sample (upper, lower) | |
BLOCC | yes | no | – | |
LV-HBA | yes | no | – | |
BiC-GAFFA | yes | no | – | |
GAM | no | no | – | – |
SALVF | yes | yes | ||
SALVF-VR | yes | yes |
1.2 Notation
For multivariate function , its partial derivative with respect to the -th variable is denoted by . Given random variable , we represent its expectation and covariance matrix by and , respectively. The trace of its covariance matrix is denoted by If the distribution is clear, we abbreviate the above notations as , and , respectively. Denote the normal cone of a convex set at as . For a scalar , we define . For a vector , we write and . For independent identically distributed random variables , we denote the joint distribution by . Further given some function , we define the empirical average as .
2 Stochastic augmented value function-based method
In this section, we propose a novel value function-based reformulation for (1.2) and a stochastic value function method.
2.1 Stochastic augmented Lagrangian reformulation
We propose a novel reformulation using a stochastic augmented Lagrangian value function that transforms (1.2) into a single level problem. First, the constraints (1.1) are addressed by the augmented Lagrangian function and its corresponding stochastic version. The bilevel optimization problem (1.2) is then transformed into a single-level problem via this augmented Lagrangian function-based formulation.
For the lower-level problem (1.1), an augmented Lagrangian penalty term is introduced by penalizing the constraints using
where and is a penalty parameter. The augmented Lagrangian function and its stochastic oracle are defined by adding the penalty term to the objective function and its stochastic oracle, respectively, that is,
(2.1) | ||||
The augmented dual function and its Moreau envelope are then defined as
(2.2) | ||||
where is a regularization parameter. Then (1.2) is reformulated as an equivalent single-level problem
(2.3) | ||||
s.t. | ||||
In this paper, are fixed parameters. We omit the subscript and the superscript in and to simplify notation. The envelope-based value function reformulation (2.3) contains the value function-based reformulation
(2.4) | ||||
s.t. | ||||
as a special case when . The relationship between them is discussed in Appendix A.5. By introducing the auxiliary variable , the advantage of the augmented function-based reformulation is that the subproblem (2.5a) becomes strongly-convex-strongly-concave, which ensures faster convergence of the inner loop.
To evaluate the in (2.2), we need to estimate
(2.5a) | ||||
(2.5b) |
However, in the stochastic setting, the exact solution is inaccessible due to unavoidable noise in the gradient oracles. To address this, we consider approximating the solution of (2.5a) using stochastic algorithms. Specifically, let be the samples from . Denote as the space of random variables mapping to and , respectively, that is,
Assume is a stochastic algorithm solving the subproblem (2.5a) using samples and is a pair of approximate solution of (2.5a). We are interested in the subset of “good enough” algorithms that can provide a sufficiently accurate solution of the subproblem (2.5a) as
(2.6) |
With this estimator , we approximate the envelope function with . This gives the following stochastic value function-based reformulation:
(2.7) | ||||
s.t. |
where , are the target accuracy of the lower-level objective function and constraints violation, respectively. The equivalence between (1.2) and (2.7) is established in Theorem 3.2. Compared to (2.3), (2.7) incorporates inexactness of the lower-level solution into the formulation, making it more practical in the stochastic setting.
2.2 Value function-based penalized method
In this subsection, we develop a stochastic value function-based penalized method for the reformulation (2.7). At -th iteration, the stochastic gradient ascent descent method is applied to solve an approximation solution from the subproblem (2.5a) with fixed . More specifically, the primal and dual variables are updated by
(2.8a) | ||||
(2.8b) |
where denote the primal and dual step sizes, respectively. The complete procedure is shown in Algorithm 1. After that we consider a augmented Lagrangian-based penalty reformulation:
(2.9) | ||||
where is the domain of , is a constant and are the penalty parameters444It can be shown that the optimal for (2.9) is contained in the domain with appropriate regularity assumptions (see Assumption 3.6 and Lemma A.7). Different from the standard duality, the penalty parameter is also treated as a variable in the above saddle reformulation. The advantage of this saddle point refomulation is that the convexity of objective functin in (2.7) is not required. The equivalence between (2.7) and (2.9) is established in Theorem 3.1. We consider a stochastic gradient descent ascent method for solving (2.9). Denote all variables in (2.9) and its region as
for simplicity. The gradient oracle of the objective function of (2.9) is given by
(2.10) |
with the stochastic oracles in mini-batched as
(2.11) | ||||
By substituting in (2.10), (2.11) and conditioned on we rewrite
(2.12) | ||||
The complete procedure is summarized as follows. At the -th iteration, an estimator is computed utilizing Algorithm 1 with samples . Then the stochastic gradient descent method is applied to solve (2.9) with samples , . After iterations, we output where the index is randomly chosen according to the probability mass function
(2.13) |
Besides, an extra SALM loop
(2.14) |
can be applied to guarantee the feasibility of the final output. The complete procedure is shown in Algorithm 2.
2.3 Variance reduced SALVF method
When sampling on is significantly more expensive than sampling on , a natural question arises: is it possible to reduce the sample size of , thereby allowing an increase in the sample size of ? In this subsection, we apply a variance reduction technique with a similar update rule to STORM (5) to reduce the sample complexity on . At each iteration, the direction is updated by
(2.15) |
Unlike the STORM method, our approach deals with a different scenario, where the main challenge arises from the biased gradient oracle due to the inexactness of estimator . This challenge is addressed by carefully handling the extra bias term and designing proper coefficients . The complete procedure is summarized in Algorithm 3 and the convergence guarantee is provided in Theorem 3.5.
3 Theoretical analysis
3.1 Basic assumptions
In this section, we inspect the properties of the penalty reformulated problem (2.9), also provide non-asymptotic convergence guarantee for the proposed algorithms (Algorithm 2 and 3). First some basis assumptions in the literature of stochastic bilevel optimization (10, 4) are introduced, including assumptions on the smoothness, convexity, boundedness, and the stochastic oracle associated with the objective and constraints.
Assumption 3.1.
(Lipschitz continuity) Assume the , , are -Lipschitz continuous, respectively.
Assumption 3.2.
(Convexity) Assume is -strongly convex in for any , is convex in for any .
Assumption 3.3.
(Boundedness) Assume , and are bounded, that is,
Assumption 3.4.
(Stochastic derivative) The stochastic oracle , are unbiased estimator of , , respectively, and their variances are bounded by , respectively.
To ensure the strong duality and regularity of the optimal points, we assume Slater’s condition and linear independence constraint qualification (LICQ) hold for the lower-level constants, which are common in nonlinear optimization analysis.
Assumption 3.5.
(LL Slater’s condition) For any fixed , Slater’s condition holds for (1.1), that is, there exist and such that
Assumption 3.6.
(LICQ) For any and , is linearly independent. Denote the matrix . Since are compact sets, we further assume the smallest singular value satisfies
3.2 Equivalence of reformulations
In this subsection, the equivalence between the reformulations (2.9) and the original BLO formulation in (1.2) is established. We take as provided in (2.9) throughout the analysis. The deterministic case is first analyzed as a special case. The equivalence is then extended to the stochastic case, emphasizing the key improvements introduced by the stochastic reformulation.
3.2.1 Deterministic case
The following theorem establishes the equivalence between the penalized form of (2.3) and the original BLO.
Theorem 3.1.
This theorem indicates that the penalized reformulation can approximate the original bilevel optimization problem within a controlled error bound.
3.2.2 Stochastic case
The major difference between the deterministic and stochastic reformulation is the inexact solution of the subproblem (2.5a). By controlling the inexactness, we design an approximated stochastic reformulation for the stochastic bilevel optimization problem. Denote the penalized form as
The following theorem shows the equivalence between this penalized form and (1.2) (see Theorem A.4 for proofs).
Theorem 3.2.
1. Assume is a global solution to (1.2). If defined in (2.6) is nonempty for any , then for any , there exists such that is a -global-minima of the following penalized form there exists such that is a -global-minima of the following penalized form
(3.3) |
with any , and .
2. By taking and , for any , the -global-minima of (3.3) is a -global-minima of the following approximation of BLO:
(3.4) | ||||
s.t. | ||||
with some .
3.3 Convergence analysis
Denote and as the -algebra generated by and respectively. Then
is the filtration generated by the random variables in Algorithms 2 and 3.
Theorem 3.3.
Define the bias of the gradient oracle of as
(3.7) |
We establish the following lemma to control the bias of the gradient oracle in terms of conditional expectation.
Lemma 3.1.
The bias has a controllable bound as
(3.8) |
Here is the abbreviation of and denote the upper bounds of the bias and variance of , respectively. are constants conditioned on and can be further bounded by polynomials of , . Therefore we can control the bias by enhancing the accuracy of Algorithm 1.
Denote . With the convergence results of Algorithm 1 and the boundedness of gradient oracle, we maintain the convergence of Algorithm 2.
Theorem 3.4.
The detailed proof is available in Theorem C.1 and Corollary C.2. We summarize the proof idea of Theorem 3.4 as follows. By combining Theorem 3.3 with Lemma 3.1, the bias and variance are bounded with . Further analyzing the biased stochastic gradient descent methods provides the desired result.
Corollary 3.1.
Remark 3.2.
The step size condition and (C.3) implies is a most . With , , , , , the right side of the above inequality is . Then the sample complexity on is .
Remark 3.3.
By introducing the following averaged Lipschitz assumption, we can further improve the convergence rate utilizing variance reduction techniques.
Assumption 3.7.
Assume are averaged Lipschitz continuous, that is,
(3.9) | ||||
Theorem 3.5.
The detailed proof is available in Theorem C.3 and Corollary C.3. The proof sketch is summarized as follows: first we show the error of the direction is bounded by the linear functions of the previous error . By proving the reduction of the merit function with some coefficients , we establish the convergence of Algorithm 3.
Remark 3.4.
Further take , , , , the right side is . The sample complexity on is . From the Lemma 3.3, 3.1 we know the upper bound of the is , hence to achieve an -optimal solution by biased gradient-based approach, the condition cannot be be further improved. That is, the sample complexity on in each iteration is at least . To this extent, current analysis requires a larger sample complexity on the lower-level to reduce the upper-level complexity, and it remains an open question whether variance-reduction could reduce the sample complexity fo upper- and lower-level at the same time.
4 Numerical experiments
This section presents numerical experiments to demonstrate the effectiveness of the proposed algorithms, compared with baselines including LV-HBA (29), GAM (26), and BLOCC (12).
4.1 Toy example
Consider the following example from Jiang et al. (12):
where and . The lower-level problem has closed form solution . Now assume the gradient oracle of and have Gaussian noise with variance , that is,
with . We pick 200 random points as initial points and allow the maximum sample on as . The final iterated points of Algorithm 2 and 3 are collected in Figure 1. Figure 1(a) plots the points projected on the and the distribution of the output and Figure 1(b) shows a 3D plot of the output and . As shown in the figures, the converged points of both algorithms are close to the global optimal solution and form an approximate Gaussian distribution. The distribution of SALVF-VR is more concentrated than SALVF, which demonstrates the acceleration effect of variance reduction techniques.


4.2 Hyperparameter tuning for SVM
Consider the following support vector machine (SVM) problem as the lower-level problem:
s.t. |
where is the hyperparameters, is the training set, and is the label of the -th sample. The upper-level problem is to minimize the validation error with respect to :
In Figure 2, we compare the performance of SALVF with baselines, LV-HBA (29), GAM (26) and BLOCC (12). Since the lower-level problem is deterministic, we allow all algorithms to access the exact optimal solution of the lower-level problem by calling the ECOS (6) solver and we set . We extend BLOCC, LV-HBA and GAM to their stochastic versions by replacing (projected) gradient descent with (projected) stochastic gradient descent in the upper-level update. Figure 2 shows the test accuracy of SALVF versus time and iterations on the Diabetes and Fourclass datasets. SALVF achieves a better accuracy over the baselines. Although BLOCC also has an approximate peak accuracy, we can see that iterations of SALVF are more time-efficient. This is because SALVF requires a double loop while BLOCC requires a triple loop, which is more computationally expensive.




4.3 Weight decay tuning
Given a neural network where are the weight and bias parameters in each layer, the goal is to optimize the weight decay parameter for a neural network model. To improve the generalization performance, the weight decay parameters are introduced, which impose the constraint . This can be formulated as the following stochastic BLO:
s.t. |
The upper level focuses on performance on a validation set, while the lower level involves constrained classifier training. We compare the performance of SALVF, SALVF-VR and BLOCC on the digit dataset (1) with a two-layer MLP as the base model. The results are shown in Figure 3. The “no weight decay” curve represents the model’s performance without weight decay. By incorporating weight decay, all bilevel methods exhibit improved performance and reduce overfitting. From Figure 3(a) we see that SALVF is the most time-efficient, thanks to the simplicity of each step in its double-loop iteration process.


5 Conclusion
In this paper, we propose a Hessian-free method for solving stochastic LC-BLO problems. We present the first non-asymptotic rate analysis on value function-based algorithms for nonlinear LC-BLO in the stochastic setting. The sample complexity of our algorithm is on upper- and lower-level random variables, respectively. The sample complexity of upper-level variables is further improved to using variance reduction techniques. Numerical experiments on synthetic and real-world data demonstrate the effectiveness of the proposed approach.
Appendix A Convergence analysis
A.1 Properties of Moreau envelope
Given function , the Moreau envelope is defined as To make a comprehensive explanation on the enveloped value function in (2.2), we introduce some well-known properties of the Moreau envelope. The detailed proof can be referred to literature such as (19, 9).
Proposition A.1.
1. The Moreau envelope is a continuous lower approximation of , i.e., and . If is -Lipschitz continuous, -weakly convex in , then this difference is at most , i.e.,
2. If is -strongly convex in , , then is -strongly convex.
3. The Moreau envelope has the same minimizer as , i.e.,
4. Its gradient at is given by
where is the proximal operator. Therefore is Lipschitz smooth given that is Lipschitz smooth.
In our reformulation, we introduce in (2.2). Then is the Moreau envelope of the augmented dual function . Utilizing the properties of the Moreau envelope, we transform the original strongly-convex–concave saddle problem into a strongly-convex–strongly-concave saddle-point problem without changing the optimal solution.
A.2 Properties of the augmented Lagrangian function
In this section we review some important properties of the augmented Lagrangian function defined in (2.1).
A.3 Augmented Lagrangian duality
In this ssubsection we introduce the augmented Lagrangian duality for general constrained optimization problems. The augmented Lagrangian duality is a powerful tool for solving constrained optimization problems, especially when the constraints are non-convex. The notation introduced in this subsection is only used within this subsection.
For a general constrained optimization problems
(A.1) |
The augmented Lagrangian penalty term is defined as
(A.2) |
and the augmented Lagrangian dual function is defined as
(A.3) |
where is the dual variable associated with the -th constraint and is the penalty parameter. The augmented Lagrangian dual function contains the Lagrangian function as a special case when .
Under the convexity assumption and Slater’s condition, we have the following proposition about the augmented Lagrangian duality.
Proposition A.2.
(strong duality) Suppose are convex and Slater’s condition holds, i.e., there exists such that . Then the following statements hold:
1. The dual variables exists and the strong duality of augmented Lagrangian holds, i.e.,
2. The strong duality of the regularized augmented Lagrangian holds, i.e.,
holds for any given and .
A.4 Gradient oracles
The gradient of is given by
(A.4) | ||||
With some simple computation, it can be shown that these gradients are bounded by linear functions of .
Lemma A.1.
Under Assumption 3.3, the gradients oracle are bounded by
(A.5a) | ||||
(A.5b) |
Proof It follows from (A.4) that
The proof of (A.5b) is the same. ∎By substituting with in (A.4) , we derive the gradient oracle of as
(A.6) | ||||
We introduce several properties of the augmented Lagrangian function’s gradient oracle that are essential for the analysis of our approach. The following lemma guarantees that the stochastic gradient remains within bounded norms, which is a crucial condition for stability.
Lemma A.2.
Proof It follows from (A.6) that
By Cauchy-Schwarz inequality, it holds that
(A.8) | ||||
Taking expectation on (A.8) gives (A.7a). The proof of (A.7b) is similar. ∎
The convexity of and implies the convex-concavity of as shown in the following lemma.
Lemma A.3.
(convex-concavity) The function is -strongly convex in and concave in .
Proof From (A.4) we can compute the second order derivative with respect to as
(A.9) | ||||
This implies that is -strongly convex in . Additionally (A.4) shows that is monotonically decreasing with respect to . Therefore is concave in . ∎
Lemma A.4.
The function is -strongly convex in and -strongly concave in .
Proof Combining Lemma A.3 and (2.5b), the conclusion follows. ∎Besides, its gradients can be computed as
(A.10) | ||||
A.4.1 Comparison between Lagrangian function and augmented Lagrangian function
In this section, we provide a comprehensive comparison between the Lagrangian function and the augmented Lagrangian function. The key points are summarized as follows: first, the Lagrangian function is a special case of the augmented Lagrangian function when ; second, the augmented Lagrangian function-based reformulation offers a tighter upper bound for the variance estimation when computing the upper-level gradient. The Lagrangian function of (1.1) is defined as
(A.11) |
The inequality relation between objective function, Lagrangian function and augmented Lagrangian function is given in the following proposition.
Proposition A.3.
If , then it holds that
Proof For any , there are two cases:
1. If , then . It holds that .
2. If , then . Hence .
Combining the above two cases, we have . This completes the proof. ∎
From equations (2.1) and (A.11), it is evident that the Lagrangian function is the limit of the augmented Lagrangian function as . During the optimization process, it is desirable for the variance of the gradient oracle to be small in order to ensure the algorithm’s stability. The second-order moment serves as an upper bound for the variance. The following lemma demonstrates that the gradient of the augmented Lagrangian term for each constraint has a smaller second-order moment compared to that of the Lagrangian function. This property is a key reason for using the augmented Lagrangian function in our algorithm. From equations (2.1) and (A.11), it is evident that the Lagrangian function is the limit of the augmented Lagrangian function as . During the optimization process, it is desirable for the variance of the gradient oracle to be small in order to ensure the algorithm’s stability. The second-order moment serves as an upper bound for the variance. The following lemma demonstrates that the gradient of the augmented Lagrangian term for each constraint has a smaller second-order moment compared to that of the Lagrangian function. This property is a key reason for using the augmented Lagrangian function in our algorithm.
Lemma A.5.
Assume is fixed and the random variable , . Then it holds that
(A.12) |
Proof The condition indicates that and . Then we have
Taking square and then taking expectation give the desired result. ∎
Lemma A.6.
Assume is fixed and the random variable , . Then it holds that
(A.13) |
Proof The condition indicates that and . Then we have
Taking square and then taking expectation give the desired result. ∎
A.5 Analysis on the reformulation
In this section, we show the equivalence between the reformulations and the original BLO (1.2). First we consider the deterministic case as a special case of the stochastic case. Then we extend the equivalence to the stochastic case and show the major improvements of the stochastic reformulation.
A.6 Deterministic case
To further analyze the optimization problem, we establish a critical property of the lower-level objective function . Define . The following proposition shows that satisfies a quadratic growth condition, ensuring the uniqueness of solutions in the lower-level problem.
Proposition A.4.
Proof For any , denote
By Assumption 3.2, the strong convexity in implies its quadratic growth, i.e.,
Furthermore, the complementary slackness condition implies for . Therefore, we have
2. For any satisfying , denote as the projection of onto . Then . From the results of 1, we have
Assumption 3.3 implies
Combining the above two inequalities and the fact , we have
3. By Assumption 3.3 and Lemma A.7, for any and , it holds that
Then following the proof in Lemma A.4.1, we have
Futher assume satisfys , then
This completes the proof. ∎
The following lemma shows the relationship between and the enveloped value defined in (2.2).
Lemma A.7.
1. It holds that . The equality holds if and only if there exist such that .
2. Assume , then is equivalent to existing such that .
3. For any , there exists such that satisfying , where . This statement also holds for .
Proof 1. For any fixed , we have
We consider maximizing with respect to . If , then . If , then
Combining the above two cases, we have
This implies that
(A.14) |
From the definition of , for , we have
(A.15) | ||||
The equality in the inequality holds if and only if .
2. Since ,
hence the condition holds if and only if
, which implies .
Conversely, suppose , Take , then
(A.16) | ||||
3. The optimal condition of implies that and , that is,
(A.17) |
From the proof of 1, we know that if , then . If , then . Otherwise suppose and hold simultaneously. The convexity of implies for all , which contradicts Assumption 3.5. Substituting these cases in (A.17) yields
(A.18) |
Note that the above conditions are equivalent to the KKT conditions. Since the lower-level problem is convex and Slater’s conditions holds, this is a sufficient condition for the optimal point. Assume matrix is the concatenation of , then with the minimal norm satisfying (A.18) has a bound as
The inequality uses Assumption 3.6. As a special case of ,
This completes the proof. ∎
Remark A.1.
The following theorem shows the equivalence between (1.2) and the penalty reformulation in the deterministic setting. This theorem generalizes Theorem 1 of (20); indeed, choosing reduces our statement to exactly that result.
Theorem A.1.
1. Assume is a global solution to (1.2) and . Then is a -global-minima of the following penalized form
(A.19) |
Furthermore, there exist such that is a -global-minima of the following penalized form
(A.20) |
with .
Proof 1. Lemma A.7 shows that (A.20) is equivalent to (A.19) by minimizing first. Hence it suffices to show the equivalence between (1.2) and (A.19). Let . Proposition A.4.3 implies that for any satisfying . By the Lipschitz property of , we have
(A.23) | ||||
By taking and , the right side of the above inequality . For the optimal solution of (1.2) and , it holds that
(A.24) |
The first inequality follows from the optimality of and . The second inequality follows from (A.23). The inequality (A.24) implies that is a -optimal solution of (A.19). Further take , then is a -optimal solution of (A.20). By Lemma A.7, we can find such satisfying . This completes the proof.
2. Lemma A.7 shows that (A.22) is equivalent to (A.21) by minimizing first. For any -optimal solution of (A.19), it holds that
The second inequality follows from (A.23) and . Then we have
(A.25) |
Take . Then is feasible for (A.21). For any feasible solution of (A.21), the -optimality of implies
(A.26) | ||||
The last inequality follows from the feasibility of . Therefore is a -optimal solution of (A.21). This completes the proof. ∎
Theorem A.2.
1. Assume is a global solution to (1.2) and . Then is a -global-minima of the following penalized form
(A.27) |
Furthermore, there exist such that is a -global-minima of the following penalized form
(A.28) |
2. By taking , any -global-minima of (A.27) and (A.28) is an -global-minima of (A.21) and (A.22) with some , respectively.
Proof 1. Lemma A.7 shows that (A.28) is equivalent to (A.27) by minimizing first. Hence it suffices to show the equivalence between (1.2) and (A.27). Let . Proposition A.4.3 implies that . By the Lipschitz property of , we have
(A.29) | ||||
By taking and , the right side of the above inequality . For the optimal solution of (1.2), it holds that
(A.30) |
The first inequality follows from the optimality of and the definition of . The second inequality follows from (A.29). This implies that is a -optimal solution of (A.27). Further take , then is a -optimal solution of (A.28). By Lemma A.7, we can find such satisfying . This completes the proof.
2. Lemma A.7 shows that (A.22) is equivalent to (A.21) by minimizing first. For any -optimal solution of (A.27), it holds that
The second inequality follows from (A.29). From the selection of , we have
(A.31) |
Take . This implies is feasible for (A.21). For any feasible solution of (A.21), the -optimality of implies
(A.32) | ||||
The last inequality follows from the feasibility of . Therefore is a -optimal solution of (A.21). This completes the proof. ∎
A.7 Stochastic case
Given samples . Denote as the space of random variables mapping to and , respectively, that is,
Assume are two random variables depending on . In this section, is the abbreviation of . The following theorem shows the equivalence between (1.2) and the penalty reformulation in the stochastic setting.
Theorem A.3.
1. Assume is a global solution to (1.2). If defined in (2.6) is nonempty for any , then for any , there exists such that is a -global-minima of the following penalized form
(A.33) | ||||
s.t. |
with any , and .
2. By taking , and , for any , the -global-minima of (A.33) is a -global-minima of the following approximation of BLO:
(A.34) | ||||
s.t. |
with some .
Proof 1. By Lemma A.7, we have . Then for , it holds that
(A.35) | ||||
By (2.6), it holds that
(A.36) |
Combining this with (A.35) gives
By taking any , and , the right side of the above inequality . For the optimal solution of (1.2) and any , by taking , we have . Lemma (A.7).3 implies that such a exists. Hence is a -optimal solution of (A.33).
2. For any -optimal solution of (A.33), it holds that
Then we have
(A.37) |
Take . This implies is feasible for (A.34). For any feasible solution of (A.34), the -optimality of implies
(A.38) | ||||
The last inequality follows from the feasibility of . Therefore is a -optimal solution of (A.34). This completes the proof. ∎
Theorem A.4.
Proof 1. By Lemma A.7, we have . Then for , it holds that
(A.40) | ||||
By (2.6), it holds that
(A.41) |
Combining this with (A.40) gives
By taking , and , the right side of the above inequality . For the optimal solution of (1.2) and any , by taking , we have . Lemma (A.7).3 implies that such a exists. Hence is a -optimal solution of (A.33).
2. For any -optimal solution of (A.33), it holds that
Then we have
(A.42) |
Take , . This implies is feasible for (A.34). For any feasible solution of (A.34), the -optimality of implies
(A.43) | ||||
The last inequality follows from the feasibility of . Therefore is a -optimal solution of (A.34). This completes the proof. ∎
Appendix B Analysis on the inner loop
In this section, we analyze the convergence of Algorithm 1. Assume and are fixed. The expectation in this section denotes the expectation conditioned on , that is, . Since is fixed, we abbreviate as and as for in this section.
We write as defined in (2.5a) and as defined in (2.5b). By Proposition A.2, the optimal solution is a saddle point of , i.e.,
(B.1) |
To analyze the convergence of the inner loop, we establish the decreasing property for running a single step of the inner algorithm. The following lemma gives the decreasing property of by taking gradient descent ascent steps.
Lemma B.1.
Proof The projection gradient step (2.8a) gives
(B.4) |
This implies
(B.5) |
By Young’s inequality, we have
(B.6) |
By the strong convexity of with respect to , we have
(B.7) |
Note that is independent of , and hence
(B.8) | ||||
Summing up (B.6),(B.7) and then taking expectation gives
Utilizing the identity and (B.5), we have
This gives (B.2). is concave in implies is -strongly concave in . Similarly, we can obtain (B.3) by taking the dual step (2.8b). This completes the proof. ∎
Combining the primal and dual steps, the following lemma shows the decreasing property with respect to the duality gap.
Corollary B.1.
For any it holds that
(B.9) | ||||
By taking decreasing step sizes, we can prove that the dual variable is bounded during the iterations of the inner loop in expectation.
Lemma B.2.
Choose the dual step size as where the constants , then the following bounds hold:
(B.11) |
Proof From (2.8b) and (2.5b) we have
(B.12) | ||||
Consider the -th component of , we have
Since , it holds that . Then we obtain a recursive relation as
Multiplying both sides by and using , we have
Note that the initial point . Hence . This gives the bound of as (B.11). ∎
Now we are ready to establish the convergence of the inner loop. In the following Theorem, we show that the output pair is an -optimal point of the lower-level problem measured by the square norm.
Theorem B.1.
By taking the step size as
(B.13) |
where the constants , there exist constants such that
(B.14) | ||||
where , .
Proof Taking , in (B.9) gives
(B.15) | ||||
Then taking and utilizing gives
(B.16) | ||||
This recursive relationship gives
(B.17) | ||||
Substituting (B.11) into (B.17) and taking we have
(B.18) | ||||
This gives the boundness of and in (B.14). ∎
By utilizing the measurement as objective function, a similar convergence rate can be obtained as follows.
Corollary B.2.
Under the same conditions as in Theorem B.1, it holds that
(B.19) |
Proof From (A.9), (A.10), we see is Lipschitz continuous in with module , is Lipschitz continuous in with module given the bound . Then under the strongly-convex-strongly-concave condition shown in Lemma A.4 , the objective gap
has a relationship to as follows:
Therefore, the convergence rate of the objective function is also , that is
Note that and , we have
Similarly, it holds that
This completes the proof. ∎
Appendix C Analysis on the outer loop
In the outer loop, we apply the stochastic gradient descent (SGD) method to solve the saddle point problem (2.9). Unlike the standard analysis of SGD, this setting presents two main challenges: first, constructing a suitable stochastic gradient oracle for (2.10); second, addressing the bias introduced by the inexact solution of the lower-level problem.
In the following proposition, we show that the gradient of is continuously differentiable and its gradient is computable given the optimal solution of the subproblem (2.5a).
Lemma C.1.
Proof By Theorem 4.24 in (3), is continuously differentiable with respect to and its gradient is given by
Hence is continuously differentiable and (C.1a) holds. Since is independent of , (C.1b) is straightforward. Note that is the Moreau envelope of for any , Proposition A.1.4 gives . On the other hand, the optimality condition of the subproblem (2.5a) gives
This gives (C.1c).
Now we show the Lipschitz continuity of , that is . Similar to the proof of Lemma A.2, it holds that . Again by Theorem 4.24 in (3), is continuously differentiable with respect to and its gradient is given by , where is the optimal solution of the subproblem . From (A.4) we know . Hence is Lipschitz continuous with respect to . Proposition A.1.4 implies the Moreau envelope of is also Lipschitz continuous with the same Lipschitz constant. This gives
(C.2) |
Therefore,
The last inequality follows from that ). This completes the proof. ∎
Corollary C.1.
is Lipschitz continuous with modulus
(C.3) |
Assume we have already obtained the approximate optimal solution of the subproblem (2.5a) at the -th iteration. Note that is random variables dependent on the inner loop sample . In the subsequent analysis, we will take expectation conditioned on and hence is treated as constants. Given sample , a natural first order oracle for in (2.9) is considered as
(C.4) | ||||
Conditioned on , the bias of arise from the term . However when is also treated as random variable, the estimation of bias is much more complicated. We establish two lemmas to control this bias. Lemma C.2 is conditioned on , while Lemma C.3 is conditioned on . In the following analysis, is the abbreviation of for simplicity.
Lemma C.2.
has a controllable bias conditioned on as
(C.5) | ||||
and a controllable conditional variance as
(C.6) |
Here are the abbreviation of , respectively.
Proof When condition on , are constants. Utilizing (C.1a) and taking expectation over gives
(C.7) | ||||
From the expression of and (A.4), (A.6), we can further compute
(C.8) | ||||
By Assumption 3.1, it holds that
(C.9) | ||||
The last inequality uses the fact that and the assumptions that , and . By (C.2), it holds that . Then by combining (C.8) and (C.9), we bound the bias of as
The last inequality follows from (C.8) again. Besides and are straightforward. Therefore (C.5) holds.
From the expression of , we can compute the conditional variance of as
and
Hence
This completes the proof. ∎
In the following lemma, are treated as random variables and we try to control the bias and variance of the first order oracle of conditioned on .
Lemma C.3.
The first order oracle of has a bounded conditional bias and variance as
(C.10a) | ||||
(C.10b) |
where
(C.11) | ||||
and
(C.12) | ||||
Proof Taking expectation over in (C.5) gives (C.10a). Utilizing the property of conditional variance and Lemma C.3, we have
(C.13) | ||||
The inequality is due to (C.4) and Lemma C.2. The second term in the right-hand side of (C.13) is bounded by
(C.14) | ||||
Since is a constant conditioned on , we have
The first inequality uses the fact that variance is smaller than second order moment, and the second inequality is due to the Lipschitz continuity of . Similarly, we have
Substituting these two inequalities into (C.14) gives
(C.15) | ||||
The last term in the right-hand side of (C.13) is bounded by
(C.16) |
Combining (C.13), (LABEL:eq:_proof_of_bound_of_first_order_oracle_of_Gk_3) and (C.16) gives the desired result. ∎
Lemma C.4.
Under the conditions of Theorem B.1, and are bounded as
(C.17a) | ||||
(C.17b) | ||||
(C.17c) |
where , , are constants obtained by replacing with .
Proof It follows from Cauchy-Schwarz inequality that and Substituting the results of Theorem B.1 into (C.11) and (C.12) gives the desired results. Note that the notation in (B.14) are replaced by in the current context. ∎
Denote . From the definition of in (2.10) and (2.12), we derive the relationship between the bias of and as follows.
(C.18) |
Since is unbiased, the bias of is fully determined by the bias of .
From the relationship between variance and moment, the bias defined in (3.7) can be bounded as follow.
Lemma C.5.
The bias has a controllable momentum as
(C.19) |
Here is the abbreviation of .
Proof By Lemma C.3, it holds that
The last inequality follow from Lemma C.2 and C.3. This completes the proof. ∎
The convergence of Algorithm 2 is established in the following theorem.
Theorem C.1.
Assume the step sizes satisfy , then the sequence satisfies
(C.20) | ||||
Here the expectation is taken over all , .
Proof The projection gradient step gives
This is equivalent to
The Lipschitz property of gives
(C.21) | ||||
The last inequality uses the assumption . Then summing up (C.21) over gives
(C.22) |
Taking expectation on both sides and multiplying yields (C.20). This completes the proof. ∎
We consider measuring the convergence by the deviation from the optimal condition as
(C.23) |
Let . The projection step gives . We can derive the following bound on this measure in Theorem C.2.
Theorem C.2.
Assume . Then it holds that
(C.24) | ||||
Proof Since , we have
(C.25) |
The Lipschitz property of gives
(C.26) | ||||
Taking square we obtain
(C.27) | ||||
Substituting the above inequality into (C.20) gives (C.24). This completes the proof. ∎
Corollary C.2.
Proof From Lemma C.4, C.5 we know that . Substituting this into Theorem C.1 and C.2 gives
This completes the proof. ∎
Remark C.1.
Let . The step size condition and (C.3) implies is a most . With , , , , , the right side of the above inequality is . Then the sample complexity on is and the sample complexity on is . Theorem C.1 shows the algorithm converges to the -stationary point of the problem (2.9) for any fixed with this sample complexity.
Remark C.2.
C.1 Analysis on variance reduction
In this section, we introduce a stronger assumption on the Lipschitz continuity of the gradient of and as Assupmtion 3.7. From (C.3) and (C.18), we know is -averaged Lipschitz continuous conditioned on with module
(C.28) |
Define the error of the direction as
First we derive the decrease in variance reduction by a single iteration.
Lemma C.6.
Proof Similar to (C.21), it holds that
(C.30) | ||||
The first inequality follows from the Lipschitz property, the second inequality is due to the projection gradient step and the third inequality uses Young’s inequality. This completes the proof. ∎
Next we show that the sequence of error has recursive relationship as follows, hence the error is decreasing.
Lemma C.7.
The conditional expectation of the error is bounded as
(C.31) | ||||
Here the expectation is taken over .
Proof Let be the bias of the gradient at and be the error of the gradient at comparing the expectation of the gradient of the last iteration, that is,
From the definition of and (2.15) we have
(C.32) | ||||
It follows from (C.6) and Lemma C.3 that , where is defined in (C.12). Since is a constant term conditioned on and , (C.32) implies that
(C.33) | ||||
Now we come to handle the term . Let
be the difference of the gradients between the points and , respectively. The averaged Lipschitz property of and the Lipschitz property of give and . Then it holds that
and
(C.34) | ||||
The last inequality follows from Lemma C.3 and (C.18). Combining (LABEL:eq:_proof_of_bound_of_error_in_variance_reduction_2) and (C.34) gives (C.31). This completes the proof. ∎Using the above two lemmas, we can derive the convergence of the variance reduction.
Theorem C.3.
Proof Consider a merit function as , where satisfies that
(C.36a) | ||||
(C.36b) |
Considering the reduction of the merit function, we have
(C.37) | ||||
To ensure the conditions (C.36a) and (C.36b), we take
(C.38) |
and let , . Then (C.37) is simplified to
(C.39) | ||||
Summing up the above inequality from to , taking expectation over all the random variables and multiplying on both sides, we have (C.35). This completes the proof. ∎
Corollary C.3.
Proof From Corollary C.1 and Lemma C.4, C.5 we know , , , and . Besides, it follows from (C.38) that . By substituting these settings into (C.35), we have
(C.41) | ||||
This completes the proof. ∎
Remark C.3.
Further take , , , , then the right side is . The sample complexity on is and the sample complexity on is .
References
- Alpaydin and Alimoglu (1996) Ethem Alpaydin and Fevzi Alimoglu. Pen-based recognition of handwritten digits. (No Title), 1996.
- Bennett et al. (2006) Kristin P Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli, and Jong-Shi Pang. Model selection via bilevel optimization. In The 2006 IEEE International Joint Conference on Neural Network Proceedings, pages 1922–1929. IEEE, 2006.
- Bonnans and Shapiro (2013) J Frédéric Bonnans and Alexander Shapiro. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013.
- Chen et al. (2021) Tianyi Chen, Yuejiao Sun, and Wotao Yin. Tighter analysis of alternating stochastic gradient method for stochastic nested problems. arXiv preprint arXiv:2106.13781, 2021.
- Cutkosky and Orabona (2019) Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.
- Domahidi et al. (2013) Alexander Domahidi, Eric Chu, and Stephen Boyd. Ecos: An socp solver for embedded systems. In 2013 European control conference (ECC), pages 3071–3076. IEEE, 2013.
- Franceschi et al. (2018) Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pages 1568–1577. PMLR, 2018.
- Gao et al. (2023) Lucy L Gao, Jane J Ye, Haian Yin, Shangzhi Zeng, and Jin Zhang. Moreau envelope based difference-of-weakly-convex reformulation and algorithm for bilevel programs. arXiv preprint arXiv:2306.16761, 2023.
- Grimmer et al. (2023) Benjamin Grimmer, Haihao Lu, Pratik Worah, and Vahab Mirrokni. The landscape of the proximal point method for nonconvex–nonconcave minimax optimization. Mathematical Programming, 201(1):373–407, 2023.
- Hong et al. (2023) Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147–180, 2023.
- Ji et al. (2020) Kaiyi Ji, Junjie Yang, and Yingbin Liang. Provably faster algorithms for bilevel optimization and applications to meta-learning. Neural Information Processing Systems, 2020.
- Jiang et al. (2024a) Liuyuan Jiang, Quan Xiao, Victor M Tenorio, Fernando Real-Rojas, Antonio Marques, and Tianyi Chen. A primal-dual-assisted penalty approach to bilevel optimization with coupled constraints. arXiv preprint arXiv:2406.10148, 2024a.
- Jiang et al. (2024b) Xiaotian Jiang, Jiaxiang Li, Mingyi Hong, and Shuzhong Zhang. A barrier function approach for bilevel optimization with coupled lower-level constraints: Formulation, approximation and algorithms. arXiv preprint arXiv:2410.10670, 2024b.
- Kang et al. (2023) Hyuna Kang, Seunghoon Jung, Jaewon Jeoung, Juwon Hong, and Taehoon Hong. A bi-level reinforcement learning model for optimal scheduling and planning of battery energy storage considering uncertainty in the energy-sharing community. Sustainable Cities and Society, 94:104538, 2023.
- Khanduri et al. (2023) Prashant Khanduri, Ioannis Tsaknakis, Yihua Zhang, Jia Liu, Sijia Liu, Jiawei Zhang, and Mingyi Hong. Linearly constrained bilevel optimization: A smoothed implicit gradient approach. In International Conference on Machine Learning, pages 16291–16325. PMLR, 2023.
- MacKay et al. (2019) Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. arXiv preprint arXiv:1903.03088, 2019.
- Nocedal and Wright (1999) Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999.
- Qin et al. (2023) Xiaorong Qin, Xinhang Song, and Shuqiang Jiang. Bi-level meta-learning for few-shot domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15900–15910, 2023.
- Rockafellar and Wets (2009) R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
- Shen and Chen (2023) Han Shen and Tianyi Chen. On penalty-based bilevel gradient descent method. In International Conference on Machine Learning, pages 30992–31015. PMLR, 2023.
- Shen et al. (2024) Han Shen, Zhuoran Yang, and Tianyi Chen. Principled penalty-based methods for bilevel reinforcement learning and rlhf. arXiv preprint arXiv:2402.06886, 2024.
- Sinha et al. (2020) Ankur Sinha, Tanmay Khandait, and Raja Mohanty. A gradient-based bilevel optimization approach for tuning hyperparameters in machine learning. arXiv preprint arXiv:2007.11022, 2020.
- Stadie et al. (2020) Bradly Stadie, Lunjun Zhang, and Jimmy Ba. Learning intrinsic rewards as a bi-level optimization problem. In Conference on Uncertainty in Artificial Intelligence, pages 111–120. PMLR, 2020.
- Tsaknakis et al. (2022) Ioannis Tsaknakis, Prashant Khanduri, and Mingyi Hong. An implicit gradient-type method for linearly constrained bilevel problems. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5438–5442. IEEE, 2022.
- Tsaknakis et al. (2023) Ioannis Tsaknakis, Prashant Khanduri, and Mingyi Hong. An implicit gradient method for constrained bilevel problems using barrier approximation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Xu and Zhu (2023) Siyuan Xu and Minghui Zhu. Efficient gradient approximation method for constrained bilevel optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12509–12517, 2023.
- Yang et al. (2024) Yan Yang, Bin Gao, and Ya-xiang Yuan. Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity. arXiv preprint arXiv:2405.19697, 2024.
- Yao et al. (2024a) Wei Yao, Haian Yin, Shangzhi Zeng, and Jin Zhang. Overcoming lower-level constraints in bilevel optimization: A novel approach with regularized gap functions. arXiv preprint arXiv:2406.01992, 2024a.
- Yao et al. (2024b) Wei Yao, Chengming Yu, Shangzhi Zeng, and Jin Zhang. Constrained bi-level optimization: Proximal Lagrangian value function approach and Hessian-free algorithm. arXiv preprint arXiv:2401.16164, 2024b.
- Zhu et al. (2020) Hancheng Zhu, Leida Li, Jinjian Wu, Sicheng Zhao, Guiguang Ding, and Guangming Shi. Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization. IEEE Transactions on Cybernetics, 52(3):1798–1811, 2020.