1]Department of Computer Science, Università di Pisa, 3 Largo B. Pontecorvo, 56127 Pisa, Italy 2]Laboratoire d’Informatique de Paris Nord, Université Sorbonne Paris Nord, 99 avenue Jean-Baptiste Clément, 93430 Villetaneuse, France 3]Department of Information Systems, Data Analytics and Operations, ESSEC Business School, 3 avenue Bernard Hirsch, 95000 Cergy, France

Bundle Network: a Machine Learning-Based Bundle Method

Francesca Demelas demelas@lipn.univ-paris13.fr    Antonio Frangioni antonio.frangioni@unipi.it    Mathieu Lacroix lacroix@lipn.univ-paris13.fr    Joseph Le Roux leroux@lipn.univ-paris13.fr    Emiliano Traversi emiliano.traversi@essec.edu    Roberto Wolfler Calvo wolfler@lipn.univ-paris13.fr [ [ [
Abstract

This paper presents Bundle Network, a learning-based algorithm inspired by the Bundle Method for convex non-smooth minimization problems. Unlike classical approaches that rely on heuristic tuning of a regularization parameter, our method automatically learns to adjust it from data. Furthermore, we replace the iterative resolution of the optimization problem that provides the search direction —traditionally computed as a convex combination of gradients at visited points— with a recurrent neural model equipped with an attention mechanism. By leveraging the unrolled graph of computation, our Bundle Network can be trained end-to-end via automatic differentiation. Experiments on Lagrangian dual relaxations of the Multi-Commodity Network Design and Generalized Assignment problems demonstrate that our approach consistently outperforms traditional methods relying on grid search for parameter tuning, while generalizing effectively across datasets.


Keywords: Attention Network, Bundle Method, Long Short Term Memory, Unrolling.

1 Introduction

This work presents a machine learning-based approach to tackle convex non-smooth minimization problems, inspired by bundle-type methods [42, 41], i.e., iterative algorithms that generalize the subgradient method in that the search direction is a convex combination of the gradients at the visited points obtained by solving an optimization problem, called Master problem. At each iteration, a crucial scalar parameter related to the regularization in the Master problem must be properly chosen to obtain optimal performance. Although there are many variants of the Bundle method [22], in this work we only consider the best known proximal version where the regularization parameter, denoted by η\eta, is the weight of the quadratic regularization term in the objective of the Master problem, weighting the distance of the new point from the better point found so far. This parameter significantly influences the next iterate produced at each iteration in two different ways:

  1. 1.

    by changing the convex multipliers that are used to form the search direction out of the currently available bundle of (approximate) subgradient information;

  2. 2.

    by acting as the stepsize along said direction.

Traditionally, selecting this parameter relies on heuristic approaches, referred to as η\eta-strategies, often requiring an extensive grid search to achieve high performance. By contrast, relatively simple diminishing stepsize rules are known to work well in subgradient-based approaches, both in theory and in practice. Thus, one may suspect that the “double role” of the η\eta parameter is what makes its online selection particularly challenging. Indeed, roughly speaking, the smaller η\eta the more “local” the direction is, taking into account only information computed close to the current iterate. However, towards termination, the algorithm should have accrued enough information that a “global” direction pointing towards the minimum of the model is likely preferable, which makes standard diminishing rules impractical.

In order to improve both the performance and the robustness of existing approaches, we propose to delegate the selection of the convex multipliers and the choice of the stepsize to a novel objective-based learning approach. Since our approach takes decisions dynamically during algorithm execution, Reinforcement Learning [55] techniques could be, and in fact have been, successfully used [47, 57]. However, Reinforcement Learning is mainly useful when differentiating the execution is not possible, and it has many drawbacks such as being often sample-inefficient and suffering high-variance [30]. We rather propose to differentiate our algorithm via an unrolling approach [48], yielding a more direct optimization process with lower-variance gradient estimators by exploiting the internal solver dynamics [53, 38]. A key challenge in this approach is to keep the structure of the Bundle method while differentiating through its execution. To address this problem, we use a heuristic approach whereby we replace these non-smooth components with smoother approximations based on a neural network model. Specifically, we propose a neural network that approximates the search direction using information from the current bundle and step size, replacing the η\eta-strategy heuristic and the (dual) Master problem (4). This neural network integrates an approach inspired by the Bundle method, as outlined in Algorithm 1, incorporating soft-updates for the stabilization point, thereby enabling end-to-end differentiation using Automatic Differentiation [4] techniques.

1.1 Literature Review

Many recent studies have leveraged machine learning to tackle Combinatorial Optimization problems [7]. In this work, we focus on amortization techniques [1], a research domain sometimes referred to as Learning-to-Optimize [9, 44, 10], and the associated methods are called semi-amortized optimization methods. In amortization techniques, the goal is to reduce the computational cost of iterative optimization methods by learning a direct mapping from inputs, defining problem instances, to approximate solutions. This often involves training an inference network to quickly produce a high-quality approximate solution that would otherwise require costly iterative optimization. Many amortized frameworks using an iterative structure for predictions often require some unrolling technique [48] for back-propagation, involving reformulating the algorithm execution to enable the gradient computation in the backward pass.

Previous works designed to find the best parameters for an optimization algorithm have focused on finding the best step size for stochastic gradient descent [54] in the context of Machine Learning, and are thus seen as instances of meta-learning While early works considered simple evolutionary rules [6, 5], more recent ones are based on neural networks [50]. Some works [60, 32] jointly train with gradient descent both the networks for the learning problem and the one for the meta-learning task. However, these approaches still struggle to scale to modern architectures with tens of thousands of parameters. To address this limitation, Andrychowicz et al. [2] introduce an LSTM-based optimizer that operates coordinate-wise on the network parameters. This design enables different behavior on each coordinate by using separate activation functions for each parameter, while maintaining the optimizer’s invariance to parameter order, as the same update rule is independently applied to each coordinate.

All of the above approaches fundamentally adopt the structure of gradient descent as the base for their learning framework. There is a growing interest in applying the Bundle method for optimization problems arising from the machine learning community [39]. Variants of Proximal Bundle methods [37] are used to optimize complex, non-smooth loss functions often encountered in machine learning [49]. A specialized asynchronous bundle is presented for solving distributed learning problems [8]. A further connection with the Bundle method can be found in [33], where a meta-learning approach is proposed to coordinate learning in master-slave distributed systems. This method aggregates the gradients for gradient descent, improving the scalability of distributed learning.

Recent research on Bundle methods for non-convex problems [18, 17], opens the possibility to use the Bundle variants for machine learning problems. Bundle methods have been applied to improve the training of SVMs [15], especially for large-scale problems [34, 58]. Extensions of Bundle methods have also been proposed for multiclass SVMs, where Bundle methods can handle a larger number of constraints efficiently [20].

The benefits of using a proximal operator when optimizing a convex non-smooth function have already been demonstrated [45]. Also [61] shows that advantages can be found in using proximality operators for meta-learning tasks, even if they did not consider the bundle method, but they use unrolling of the proximal gradient descent.

In [43], a method is presented that approximates the resolution of a large-scale linear programming problem based on the associated Lagrangian function, providing iterative primal and dual updates using a neural network trained with unrolling techniques.

1.2 Bundle Method

The Bundle method (BM) [42, 41] is an iterative method that solves problems of type

min{ϕ(𝝅):𝝅Π}\min\{\,\phi({\bm{\pi}})\,:\,{\bm{\pi}}\in\Pi\,\} (1)

where ϕ:m\phi:\mathbb{R}^{m}\rightarrow\mathbb{R} is a finite-valued convex non-smooth function and Π\Pi is a “simple” convex set. The Lagrangian Dual is a relevant application of (1), discussed with more details in Section 3.1. In the rest of the paper we restrict ourselves to the case in which Π=m\Pi={\mathbb{R}}^{m} or Π=+m={𝝅m:𝝅0}\Pi={\mathbb{R}}^{m}_{+}=\{{\bm{\pi}}\in{\mathbb{R}}^{m}\,:\,{\bm{\pi}}\geq 0\}, which correspond to the Lagrangian Dual [21]—that we use in our computational testing—when one is relaxing, respectively, equality or inequality constraints. For the sake of presentation we only consider the case Π=m\Pi={\mathbb{R}}^{m} here, but the approach can be simply extended, as discussed at the end of Section 3.1, to the case Π=+m\Pi={\mathbb{R}}^{m}_{+}.

Bundle methods iteratively optimize an approximation ϕβ\phi_{\beta} of the function ϕ\phi, constructed using the bundle β\beta of first-order information collected so far, plus some regularization. A common choice for ϕβ\phi_{\beta} is the Cutting Plane Model, i.e., the natural polyhedral upper approximation built using the very definition of subgradients. For reasons to become apparent shortly, it is expedient to fix some reference point 𝝅¯\bar{{\bm{\pi}}}; then, at a given iteration tt one has the set of previous iterates {𝝅i}i=1t\{{\bm{\pi}}_{i}\}_{i=1}^{t} and can define the bundle βt\beta_{t} as the set of pairs {(gi,αi)}i=1t\{(g_{i},\alpha_{i})\}_{i=1}^{t}, where giϕ(𝝅i)g_{i}\in\partial\phi({\bm{\pi}}_{i}) is a subgradient obtained in the given iterate and αi\alpha_{i} is the corresponding linearization error αi=ϕ(𝝅¯)ϕ(𝝅i)𝒈i(𝝅¯𝝅i)0\alpha_{i}=\phi(\bar{{\bm{\pi}}})-\phi({\bm{\pi}}_{i})-{\bm{g}}_{i}^{\top}(\bar{{\bm{\pi}}}-{\bm{\pi}}_{i})\geq 0 (non-negativity being implicit into convexity of ϕ\phi). Remarkably, the αi\alpha_{i} depend on 𝝅¯\bar{{\bm{\pi}}}, but we don’t stress this notationally since 𝝅¯\bar{{\bm{\pi}}} will always be clear from the context. In fact, βt\beta_{t} may contain a strict subset of the thusly computed pairs (in case of bundle selection) and even pairs obtained differently (in case of bundle aggregation), but this is immaterial to our discussion, and we leave these details out for simplicity of presentation. With this information, the Cutting Plane model [51] is

ϕ𝝅¯,βt(𝒅)=max{𝒈i𝒅+αi:(gi,αi)βt}.\phi_{\bar{{\bm{\pi}}},\beta_{t}}({\bm{d}})=\max\{\,{\bm{g}}_{i}^{\top}{\bm{d}}+\alpha_{i}\,:\,(g_{i},\alpha_{i})\in\beta_{t}\,\}\;. (2)

It is easy to see that ϕ(𝝅¯+𝒅)ϕ(𝝅¯)ϕ𝝅¯,βt(𝒅)\phi(\bar{{\bm{\pi}}}+{\bm{d}})-\phi(\bar{{\bm{\pi}}})\geq\phi_{\bar{{\bm{\pi}}},\beta_{t}}({\bm{d}}), i.e., the Cutting Plane method is a polyhedral lower approximation of ϕ\phi when translated w.r.t. (𝝅¯,ϕ(𝝅¯))(\bar{{\bm{\pi}}},\phi(\bar{{\bm{\pi}}})). Thus, however chosen 𝝅¯\bar{{\bm{\pi}}}, the “unstabilized” Master problem

min{ϕ𝝅¯,βt(𝒅):𝝅¯+𝒅Π}\min\{\,\phi_{\bar{{\bm{\pi}}},\beta_{t}}({\bm{d}})\,:\,\bar{{\bm{\pi}}}+{\bm{d}}\in\Pi\,\}

would provide a lower bound on the optimal value of Problem (1), but it is easy to see that such a problem can easily be unbounded below. Even if it is not, its optimal solution may easily be a very poor approximation to the true one, since ϕ𝝅¯,βt\phi_{\bar{{\bm{\pi}}},\beta_{t}} may be a poor model of ϕ\phi “far” from the points in which the function has been computed. Because of this, one has to introduce some stabilization term that avoids wandering away from the region of space where ϕ\phi is an appropriate model. The most common choice, corresponding to the proximal Bundle method, is the Euclidean norm of the distance from the reference point 𝝅¯\bar{{\bm{\pi}}}, which is therefore called the (current) stabilization centre; this leads to the stabilized (primal) Master problem

(MPt)min{ϕ𝝅¯,β(𝒅)+𝒅22/(2η):𝝅¯+𝒅Π}(MP_{t})~~\min\{\,\phi_{\bar{{\bm{\pi}}},\beta}({\bm{d}})+\|{\bm{d}}\|_{2}^{2}\,/\,(2\eta)\,:\,\bar{{\bm{\pi}}}+{\bm{d}}\in\Pi\,\} (3)

whose optimal solution 𝒅{\bm{d}}^{*} is used to choose the next iterate as 𝝅=𝝅¯+𝒅{\bm{\pi}}=\bar{{\bm{\pi}}}+{\bm{d}}^{*}. Solving (3) is equivalent to solving its dual, which for the case Π=m\Pi={\mathbb{R}}^{m} reads

max\displaystyle\max ηt2(𝒈i,αi)βt𝒈iθi22+(𝒈i,αi)βtαiθi\displaystyle\textstyle\frac{\eta_{t}}{2}\big\|\sum_{({\bm{g}}_{i},\alpha_{i})\in\beta_{t}}{\bm{g}}_{i}\theta_{i}\big\|_{2}^{2}+\sum_{({\bm{g}}_{i},\alpha_{i})\in\beta_{t}}\alpha_{i}\theta_{i} (4a)
s.t. (𝒈i,αi)βtθi=1,𝜽0\displaystyle\textstyle\sum_{({\bm{g}}_{i},\alpha_{i})\in\beta_{t}}\theta_{i}=1\;\;,\;\;{\bm{\theta}}\geq 0 (4b)

while the case Π=+m\Pi={\mathbb{R}}^{m}_{+} adds a new set of variables and simple constraints. The optimal solution 𝒅t{\bm{d}}^{*}_{t} of (3) and the optimal solution 𝜽(t){\bm{\theta}}^{(t)} of (4) are linked by the relationship 𝒅(t)=ηt𝒘t{\bm{d}}^{(t)}=\eta_{t}{\bm{w}}_{t}, where 𝒘t=(𝒈i,αi)βt𝒈iθi(t){\bm{w}}_{t}=\sum_{({\bm{g}}_{i},\alpha_{i})\in\beta_{t}}{\bm{g}}_{i}\theta_{i}^{(t)} is the aggregated subgradient computed out of the current bundle. This is relevant in that 𝒘t{\bm{w}}_{t} is a σt\sigma_{t}-subgradient of ϕ\phi in 𝝅¯\bar{{\bm{\pi}}}, where σt=(𝒈i,αi)βtαiθi(t)\sigma_{t}=\sum_{({\bm{g}}_{i},\alpha_{i})\in\beta_{t}}\alpha_{i}\theta_{i}^{(t)} is the aggregated linearization error. For the case Π=+m\Pi={\mathbb{R}}^{m}_{+}, 𝒅(t){\bm{d}}^{(t)} is basically 𝒘t{\bm{w}}_{t} projected over the set of the active non-negativity constraints 𝝅0{\bm{\pi}}\geq 0; we refer to [40, Chapter 4] for further details. Often, solving the Dual Master problem (4) is more convenient than solving the primal due to the existence of ad-hoc approaches [23]. The complete algorithm is described in Algorithm 1.

Algorithm 1 Bundle method pseudo-code
1:Choose 𝝅0,η0,m{\bm{\pi}}_{0},\eta_{0},m
2:𝝅¯0𝝅0\bar{{\bm{\pi}}}_{0}\leftarrow{\bm{\pi}}_{0} \triangleright The initial stabilization point coincides with the initial point
3:(𝒈0,𝜶0)(ϕ(𝝅0),0)({\bm{g}}_{0},{\bm{\alpha}}_{0})\leftarrow(\partial\phi({\bm{\pi}}_{0}),0); β0{(𝒈0,𝜶0)}\beta_{0}\leftarrow\{({\bm{g}}_{0},{\bm{\alpha}}_{0})\}; t0t\leftarrow 0
4:while ! stopping criteria do
5:  (𝒅(t),𝜽(t))({\bm{d}}^{(t)},{\bm{\theta}}^{(t)})\leftarrow Solve MP(ηt,βt)MP(\eta_{t},\beta_{t}) \triangleright Solve the Master problem (3)–(4)
6:  𝝅t+1𝝅¯t+ηt𝒅(t){\bm{\pi}}_{t+1}\leftarrow\bar{{\bm{\pi}}}_{t}+\eta_{t}{\bm{d}}^{(t)} \triangleright Compute the new trial point
7:  (𝒈t+1,αt+1)(ϕ(𝝅t+1),ϕ(𝝅¯t)ϕ(𝝅t+1)𝒈t+1(𝝅¯t𝝅t+1))({\bm{g}}_{t+1},\alpha_{t+1})\leftarrow(\partial\phi({\bm{\pi}}_{t+1}),\phi(\bar{{\bm{\pi}}}_{t})-\phi({\bm{\pi}}_{t+1})-{\bm{g}}_{t+1}^{\top}(\bar{{\bm{\pi}}}_{t}-{\bm{\pi}}_{t+1})) \triangleright Compute the new sub-gradient and linearization error
8:  βt+1βt(𝒈t+1,𝜶t+1)\beta_{t+1}\leftarrow\beta_{t}\cup({\bm{g}}_{t+1},{\bm{\alpha}}_{t+1}) \triangleright Update the Bundle
9:  if ϕ(𝝅t+1)ϕ(𝝅¯t)>m(ϕβt(𝝅t+1)ϕ(𝝅¯t))\phi({\bm{\pi}}_{t+1})-\phi(\bar{{\bm{\pi}}}_{t})>m(\,\phi_{\beta_{t}}({\bm{\pi}}_{t+1})-\phi(\bar{{\bm{\pi}}}_{t})\,) then
10:   𝝅¯t+1𝝅t+1\bar{{\bm{\pi}}}_{t+1}\leftarrow{\bm{\pi}}_{t+1}; Update αi(𝒈i,αi)βt\alpha_{i}\;\forall({\bm{g}}_{i},\alpha_{i})\in\beta_{t}; Update ηt\eta_{t} \triangleright Serious Step
11:  else
12:   𝝅¯t+1𝝅¯t\bar{{\bm{\pi}}}_{t+1}\leftarrow\bar{{\bm{\pi}}}_{t}; Update ηt\eta_{t} \triangleright Null Step
13:  end if
14:  βt+1\beta_{t+1}\leftarrow Remove outdated components βt+1\beta_{t+1}; tt+1t\leftarrow t+1
15:end while

The next trial point 𝝅t+1{\bm{\pi}}_{t+1} is computed by making a step of ηt\eta_{t} along the direction 𝒅t{\bm{d}}_{t}, which is either the (opposite of) the aggregated sugradient 𝒘t{\bm{w}}_{t} or easily obtained out of it. Computing ϕ(𝝅t+1)\phi({\bm{\pi}}_{t+1}) and its subgradient 𝒈t+1ϕ(𝝅t+1){\bm{g}}_{t+1}\in\partial\phi({\bm{\pi}}_{t+1}) enriches the bundle. The stabilization center is changed to 𝝅t+1{\bm{\pi}}_{t+1}, called a Serious Step, if a “substantial increase” is obtained, measured with respect to the ascent ϕβt(𝝅t+1)\phi_{\beta_{t}}({\bm{\pi}}_{t+1}) predicted by the model, with m<1m<1 a fixed parameter (usually “small”, say 0.0010.001). Otherwise, i.e., if no (significant) progress is made, the stabilization center is not changed, which is referred to as a Null Step (NS); yet, new information has been added to the bundle which will lead to a different direction 𝒅t+1{\bm{d}}_{t+1} in the next iteration. ηt\eta_{t} can (although it does not necessarily need to) also be adjusted, typically in different ways depending on the type of step taken. The stopping criterion in line 4 can be η𝒘t22+σtϵmax{0,ϕ(𝝅¯t)}\eta^{*}\|{\bm{w}}_{t}\|_{2}^{2}+\sigma_{t}\leq\epsilon\max\{0,\phi(\bar{{\bm{\pi}}}_{t})\}, where η\eta^{*} is “large w.r.t. ηt\eta_{t}”; this corresponds to both 𝒘t0{\bm{w}}_{t}\approx 0 and σt0\sigma_{t}\approx 0, proving that 𝝅¯t\bar{{\bm{\pi}}}_{t} is an approximate minimum. At the end of each iteration, efficient implementations of the Bundle method may delete some old pairs from β\beta to lessen the Master problem cost. A common strategy is removing components that are not used in the (dual) optimal solution over the last KK iterations (a quite conservative strategy takes K20K\sim 20 [25]).

2 Machine-Learning Based Bundle Method

The approach presented in this work is heavily inspired by Algorithm 1, but with two major differences:

  1. 1.

    we replace the solution of the Master problem, which is a convex quadratic program that may become rather costly to solve as the size of the bundle grows, by the prediction of neural network and

  2. 2.

    we decouple the role of η\eta in the formation of 𝒅{\bm{d}}, via the weights 𝜽{\bm{\theta}}, from its role as a stepsize along it, by gain predicting its value with a neural network.

To this effect, we need parametrize neural networks so that our two predictors from input instances and partial bundles to respectively convex multipliers 𝜽{\bm{\theta}} and stepsize η\eta are accurate. This parametrization is data-driven: we find the parameters that minimize a loss function, representing the value computed by the bundle network over a set of examples.

2.1 Learning Framework

Let 𝒟\mathcal{D} be a dataset of instances ι\iota and we denote by 𝝅0𝝅0,ι{\bm{\pi}}_{0}\equiv{\bm{\pi}}_{0,\iota} the given initial point for the associated instance, possibly equal to zero. To mathematically define the learning problem, the number of iterations TT is fixed during the training process. Once the model is trained, the number of iterations is no longer a consideration at inference time [44]. In Section 2.3 we describe in details the model 𝝅T(𝑾)𝝅T(𝑾;𝝅0,ι){\bm{\pi}}_{T}({\bm{W}})\equiv{\bm{\pi}}_{T}({\bm{W}};{\bm{\pi}}_{0},\iota), where 𝑾{\bm{W}} are the parameters. We may consider finding the parameters 𝑾{\bm{W}} that, on average over instances drawn from 𝒟\mathcal{D}, minimize the loss as defined by just the objective value (𝑾)=ϕ(𝝅T(𝑾))\mathcal{L}({\bm{W}})=\phi({\bm{\pi}}_{T}({\bm{W}})) after TT iterations:

min𝑾𝔼ι𝒟[(𝑾)].\textstyle\min_{{\bm{W}}}\mathbb{E}_{\iota\sim\mathcal{D}}[\mathcal{L}({\bm{W}})]\;.

However, Learning-to-Optimize approaches generally consider the contribution of each step of the whole trajectory, since it encourages optimizing the contributions of each step. Following this approach, we rather aim at maximizing the function:

𝜸(𝑾)=t=1Tγtϕ(𝝅t(𝑾))\textstyle\mathcal{L}_{\bm{\gamma}}({\bm{W}})=\sum_{t=1}^{T}\gamma_{t}\phi({\bm{\pi}}_{t}({\bm{W}})) (5)

where 𝝅t(𝑾){\bm{\pi}}_{t}({\bm{W}}) denotes the trial point at iteration tt, which is computed by a neural network. We use γt=γTt\gamma_{t}=\gamma^{T-t}, where 0<γ10<\gamma\leq 1 is a hyperparameter. Hence, the learning problem can be formulated as

min𝑾𝔼ι𝒟[𝜸(𝑾)].\textstyle\min_{{\bm{W}}}\mathbb{E}_{\iota\sim\mathcal{D}}[\mathcal{L}_{\bm{\gamma}}({\bm{W}})]. (6)

In the next sections, we will specify how to differentiate the loss function using the chain rule and the structure of the model 𝝅t(𝑾){\bm{\pi}}_{t}({\bm{W}}) used as machine learning surrogate of the Master problem in the Bundle method.

2.2 Loss Function

We want to compute a subgradient 𝑾𝜸\partial_{{\bm{W}}}\mathcal{L}_{{\bm{\gamma}}} of the loss with respect to the network parameters. 𝜸(𝑾)\mathcal{L}_{\bm{\gamma}}({\bm{W}}) is a linear combination of the function values ϕ\phi evaluated at different points along the trajectory: consequently, its subgradient with respect to 𝑾{\bm{W}} can be written as the weighted sum of subgradients of the function ϕ\phi computed in different points. Thus we only need to compute subgradients 𝑾ϕ(𝝅t(𝑾))\partial_{{\bm{W}}}\phi({\bm{\pi}}_{t}({\bm{W}})) for each tTt\leq T.

We apply the chain rule to compute 𝑾ϕ(𝝅t(𝑾))\partial_{{\bm{W}}}\phi({\bm{\pi}}_{t}({\bm{W}})) as 𝝅tϕ(𝝅t(𝑾))𝑾𝝅t(𝑾)\partial_{{\bm{\pi}}_{t}}\phi({\bm{\pi}}_{t}({\bm{W}}))\cdot\partial_{{\bm{W}}}{\bm{\pi}}_{t}({\bm{W}}). Note that 𝝅ϕ(𝝅t(𝑾))\partial_{{\bm{\pi}}}\phi({\bm{\pi}}_{t}({\bm{W}})) is the subgradient of ϕ\phi at 𝝅t(𝑾){\bm{\pi}}_{t}({\bm{W}}), which exists since ϕ\phi is convex and in fact corresponds to the 𝒈t{\bm{g}}_{t} used in the standard version of the Bundle method. As we construct 𝝅t(𝑾){\bm{\pi}}_{t}({\bm{W}}) to be differentiable, 𝑾𝝅t(𝑾)\partial_{{\bm{W}}}{\bm{\pi}}_{t}({\bm{W}}) corresponds to the gradient 𝑾𝝅t(𝑾)\nabla_{{\bm{W}}}{\bm{\pi}}_{t}({\bm{W}}) that can be obtained using automatic differentiation techniques. The algorithm structure for this computation is discussed in detail in the next section.

Putting all together, we obtain the following gradient approximation

𝑾(𝑾)=t=1TγTt𝒈t𝑾𝝅t(𝑾)\textstyle\partial_{{\bm{W}}}\mathcal{L}({\bm{W}})=\sum^{T}_{t=1}\gamma^{T-t}{\bm{g}}_{t}\nabla_{{\bm{W}}}{\bm{\pi}}_{t}({\bm{W}})

2.3 Unrolled Model

In this section, we provide further details on the structure of machine learning model used to compute the trial point 𝝅t(𝑾){\bm{\pi}}_{t}({\bm{W}}) at iteration tt. We keep the structure of the Bundle method mostly unchanged, modifying only some operations in order to make them differentiable: the choice of the step size, the solution of Problem 4, and the update of the stabilization center. The result is presented in Algorithm 3: since it only employs differentiable operations, we can then use standard automatic differentiation tools.

At each iteration, we need to provide a feature vector to the neural network. We do not back-propagate through the feature extraction, which is made using hand-crafted features described in detail in Appendix D. These features are fed as input to a neural network, which outputs a vector 𝜽(t)Δt={𝜽[0,1]t+1|i=1t+1θi=1}{\bm{\theta}}^{(t)}\in\Delta^{t}=\big\{{\bm{\theta}}\in[0,1]^{t+1}\;|\;\sum_{i=1}^{t+1}\theta_{i}=1\big\} in the tt-dimensional simplex and a step size ηt\eta_{t}. More details on the network architecture can be found in Section 3.

In the standard Bundle method, the stabilization center is updated if the condition ϕ(𝝅t+1)>ϕ(𝝅¯t)\phi({\bm{\pi}}_{t+1})>\phi(\bar{{\bm{\pi}}}_{t}) is satisfied. However, this is a piecewise-constant operation, and therefore with zero derivative almost everywhere. Denoting by 𝒓=argmin((ϕ(𝝅t+1),ϕ(𝝅¯t))){\bm{r}}=\text{arg}\min((\phi({\bm{\pi}}_{t+1}),\phi(\bar{{\bm{\pi}}}_{t}))) the one-hot vector encoding equal to (1,0)(1,0) if ϕ(𝝅t+1)ϕ(𝝅¯t)\phi({\bm{\pi}}_{t+1})\geq\phi(\bar{{\bm{\pi}}}_{t}) and (0,1)(0,1) otherwise, the update of the stabilization point can be rewritten as

𝒓(𝝅t+1,𝝅¯t)=r1𝝅t+1+r2𝝅¯t.{\bm{r}}^{\top}({\bm{\pi}}_{t+1},\bar{{\bm{\pi}}}_{t})=r_{1}{\bm{\pi}}_{t+1}+r_{2}\bar{{\bm{\pi}}}_{t}.

This corresponds to choosing a vertex of the two-dimensional simplex. A common strategy in machine learning to obtain smoother approximations of the argmax\text{arg}\max operator is to replace it with a soft version, obtained with the softmin [29, pp. 180-184]. This leads to considering

𝒓=𝐬𝐨𝐟𝐭𝐦𝐢𝐧(ϕ(𝝅t+1),ϕ(𝝅¯t)){\bm{r}}=\mathbf{softmin}(\phi({\bm{\pi}}_{t+1}),\phi(\bar{{\bm{\pi}}}_{t}))

and the updates can be performed as previously, i.e., choosing a possibly different convex combination of the two vectors 𝝅t+1{\bm{\pi}}_{t+1} and 𝝅t{\bm{\pi}}_{t}, which will not necessarily be a vertex of the two-dimensional simplex, but a point inside it.

It is important to notice that we do not differentiate through the operation computing the gradient in line 11, as obtaining exact derivatives requires the function ϕ\phi to be twice differentiable. It is possible to approximate this contribution, but this is not explicitly considered in this work; this strategy has been commonly pursued in other works, for example [2]. In other words, the bundle information fed to the neural network is considered as a constant, even though it is produced by a gradient computation. This is similar to the way recurrent language models do not backpropagate through the previously generated words, but rather through the previous latent state [3].

Algorithm 2 Bundle Network pseudo-code
1:Choose 𝝅0,T{\bm{\pi}}_{0},T
2:𝝅¯0𝝅0\bar{{\bm{\pi}}}_{0}\leftarrow{\bm{\pi}}_{0} \triangleright Initialize stabilization point
3:(𝒈0,𝜶0,v0)(ϕ(𝝅0),0,ϕ(𝝅0))({\bm{g}}_{0},{\bm{\alpha}}_{0},v_{0})\leftarrow(\partial\phi({\bm{\pi}}_{0}),0,\phi({\bm{\pi}}_{0}))
4:β0{(𝒈0,𝜶0)}\beta_{0}\leftarrow\{({\bm{g}}_{0},{\bm{\alpha}}_{0})\}
5:for t=1,,Tt=1,\cdots,T do
6:  φt=\varphi_{t}= features_extraction (βt)(\beta_{t}) \triangleright We do not back-propagate through this operation
7:  ηt,𝜹(t)nn(φt)\eta_{t},\bm{\delta}^{(t)}\leftarrow nn(\varphi_{t}) \triangleright Output of a neural network
8:  𝜽(t)ψ(𝜹(t)){\bm{\theta}}^{(t)}\leftarrow\psi(\bm{\delta}^{(t)}) \triangleright Approximate Solution of the DMP 4
9:  𝒘(t)i=0|βt|𝒈iθi(t){\bm{w}}^{(t)}\leftarrow\sum_{i=0}^{|\beta_{t}|}{\bm{g}}_{i}\theta_{i}^{(t)} \triangleright New trial direction
10:  𝝅t+1σ(𝝅¯t+ηt𝒘(t)){\bm{\pi}}_{t+1}\leftarrow\sigma(\bar{{\bm{\pi}}}_{t}+\eta_{t}{\bm{w}}^{(t)}) \triangleright Compute new trial point
11:  (𝒈t+1,𝜶t+1,vt+1)(ϕ(𝝅t+1),ϕ(𝝅¯t)ϕ(𝝅t+1)𝒈t+1(𝝅¯t𝝅t+1),ϕ(𝝅t+1))({\bm{g}}_{t+1},{\bm{\alpha}}_{t+1},v_{t+1})\leftarrow(\partial\phi({\bm{\pi}}_{t+1}),\phi(\bar{{\bm{\pi}}}_{t})-\phi({\bm{\pi}}_{t+1})-{\bm{g}}_{t+1}^{\top}(\bar{{\bm{\pi}}}_{t}-{\bm{\pi}}_{t+1}),\phi({\bm{\pi}}_{t+1})) \triangleright Gradient and linearization error
12:  βt+1βt(𝒈t+1,𝜶t+1,vt+1)\beta_{t+1}\leftarrow\beta_{t}\cup({\bm{g}}_{t+1},{\bm{\alpha}}_{t+1},v_{t+1}) \triangleright Update Bundle
13:  𝝅¯t+1𝐬𝐨𝐟𝐭𝐦𝐢𝐧(ϕ(𝝅t+1),ϕ(𝝅¯t))(𝝅t+1,𝝅¯t)\bar{{\bm{\pi}}}_{t+1}\leftarrow\mathbf{softmin}(\phi({\bm{\pi}}_{t+1}),\phi(\bar{{\bm{\pi}}}_{t}))\odot({\bm{\pi}}_{t+1},\bar{{\bm{\pi}}}_{t}) \triangleright Smooth updates of the stabilization point
14:  Update αii=0,,|βt|\alpha_{i}\;\forall i=0,\cdots,|\beta_{t}| \triangleright Update the linearization errors
15:end for
Algorithm 3 Machine learning-based version of the Bundle method 1 that substitutes the resolution of the Master problem and the non-continuous operations with a smoother version based on neural networks.

2.3.1 Step and Direction Architecture

This section provides further details on the architecture of the machine learning-model, schematically presented in Figure 1, which is used to predict ηt\eta_{t} and 𝜽(t){\bm{\theta}}^{(t)}.

Given the features φtd\varphi_{t}\in{\mathbb{R}}^{d} at the current bundle βt\beta_{t}, a Recurrent Neural Network [13, 11, 12], specifically an Long Short-Term Memory (LSTM [31]) layer, maps this vector to a hidden space by predicting six vectors of size hh. We denote these vectors as 𝝁qt,𝝈qt,𝝁kt,𝝈kt,𝝁ηt,𝝈ηth\bm{\mu}_{{q_{t}}},\bm{\sigma}_{{q_{t}}},\bm{\mu}_{{k_{t}}},\bm{\sigma}_{{k_{t}}},\bm{\mu}_{{\eta_{t}}},\bm{\sigma}_{{\eta_{t}}}\in{\mathbb{R}}^{h}, and they represent the mean and the variance used to extract, by sampling through a Gaussian distribution, the hidden representation of the keys and the queries used to represent the current iterations as in attention mechanisms [59], as well as the hidden representation for ηt\eta_{t}. More precisely, to obtain the hidden representations 𝒉qt,𝒉kt,𝒉ηth{\bm{h}}_{q_{t}},{\bm{h}}_{k_{t}},{\bm{h}}_{\eta_{t}}\in{\mathbb{R}}^{h}, we consider a Gaussian distribution, so we can approximate the expectation by sampling and use the reparameterization trick [36, 52] to perform standard backpropagation. The size of the hidden representations and other hyperparameters of the network will be discussed in Appendix D.

The sampled representations are then fed into three independent Multi-Layer Perceptrons that predict a scalar ηt\eta_{t} and two vectors 𝒒t{\bm{q}}_{t} and 𝒌t{\bm{k}}_{t}. The vectors 𝒒t{\bm{q}}_{t} and 𝒌t{\bm{k}}_{t} are, respectively, the query and the key of the current iteration. The vector 𝒌t{\bm{k}}_{t} is stored in memory for use in subsequent iterations. Using 𝒒t{\bm{q}}_{t} and all the vectors {𝒌i}i=0|βt|\{{\bm{k}}_{i}\}_{i=0}^{|\beta_{t}|}, we can compute a scalar for each component in the bundle using an Attention Layer defined as follows:

𝜹(t)=(𝒌i𝒒t)i=0|βt|t.\bm{\delta}^{(t)}=({\bm{k}}_{i}^{\top}{\bm{q}}_{t})_{i=0}^{|\beta_{t}|}\in{\mathbb{R}}^{t}.

To obtain the approximate solution of the Dual Master problem 4, we pass this vector to a normalization function ψ\psi taking as input a vector and returning one of the same dimension, but with all components between zero and one, and summing to one

𝜽(t)=ψ(𝜹(t))Δt.{\bm{\theta}}^{(t)}=\psi(\bm{\delta}^{(t)})\in\Delta^{t}.

This yields a vector 𝜽(t){\bm{\theta}}^{(t)} whose components lie in [0,1][0,1] and sum to one, as the solution of the Problem (4) considered in the classic bundle. Some examples of ψ\psi are the softmax or sparsemax [46] functions.

φt\varphi_{t}Input𝒌1{\bm{k}}_{1}𝒌2{\bm{k}}_{2}𝒌t1{\bm{k}}_{t-1}EncoderRNNSampler𝒉t1(RNN){\bm{h}}_{t-1}^{(RNN)}𝒉t(RNN){\bm{h}}_{t}^{(RNN)}𝒉𝒌th{\bm{h}}_{{\bm{k}}_{t}}\in{\mathbb{R}}^{h}𝒉𝒒th{\bm{h}}_{{\bm{q}}_{t}}\in{\mathbb{R}}^{h}𝒉ηth{\bm{h}}_{\eta_{t}}\in{\mathbb{R}}^{h}MLPkMLP_{k}MLPqMLP_{q}MLPηtMLP_{\eta_{t}}Decoder𝒌t{\bm{k}}_{t}𝒒t{\bm{q}}_{t}\odotψ\psiθ1(t)\theta_{1}^{(t)}θ2(t)\theta_{2}^{(t)}θt(t)\theta_{t}^{(t)}ηt\eta_{t}𝒌1{\bm{k}}_{1}𝒌2{\bm{k}}_{2}𝒌t{\bm{k}}_{t}
Figure 1: Schematic representation of the architecture used to predict the step-size ηt\eta_{t} and the weights 𝜽(t){\bm{\theta}}^{(t)} for the convex combination of the gradients in the bundle at the current iteration.

3 Evaluation

In this section, we present the main numerical results of our work. We test it only on the example case of the Lagrangian relaxation, of which we then provide further details.

3.1 Lagrangian Relaxation

Let ι\iota be a Mixed Integer Linear Program [14, Chap. 8] of the form:

(P)min\displaystyle(P)\qquad\min 𝒘𝒙\displaystyle\;{\bm{w}}^{\top}{\bm{x}} (7a)
𝑨𝒙=𝒃\displaystyle{\bm{A}}{\bm{x}}={\bm{b}} (7b)
𝑪𝒙𝒅,𝒙+n1×n2\displaystyle{\bm{C}}{\bm{x}}\geq{\bm{d}}\;\;,\;\;{\bm{x}}\in\mathbb{R}_{+}^{n_{1}}\times\mathbb{N}^{n_{2}} (7c)

The Lagrangian subproblem is obtained by dualizing a family of constraints. Here we relax the constraints defined by the inequalities (7b), and penalize their violation with Lagrangian multipliers (LMs) 𝝅m{\bm{\pi}}\in{\mathbb{R}}^{m}:

(LR(𝝅))min\displaystyle\big(LR({\bm{\pi}})\big)\qquad\min\; 𝒘𝒙+𝝅(𝒃𝑨𝒙)\displaystyle{\bm{w}}^{\top}{\bm{x}}+\bm{\pi}^{\top}({\bm{b}}-{\bm{A}}{\bm{x}})
𝑪𝒙𝒅,𝒙+n1×n2\displaystyle{\bm{C}}{\bm{x}}\geq{\bm{d}}{}\;\;,\;\;{\bm{x}}{}\in{}\mathbb{R}_{+}^{n_{1}}\times\mathbb{N}^{n_{2}}

Weak Lagrangian duality ensures that LR(𝝅)LR({\bm{\pi}}) provides a lower bound for PP. The goal of the Lagrangian Dual problem is to determine the optimal bound.

(LD)maxLR(𝝅).(LD)\qquad\max LR(\bm{\pi}). (8)

Lagrangian relaxation is particularly valuable when removing certain constraints results in a Lagrangian subproblem LR(𝝅)LR({\bm{\pi}}) that can be solved efficiently; it is therefore crucial to carefully select the set of constraints to dualize. Indeed, this choice involves a trade-off between the speed and the quality of the resulting bound, as will be discussed later, due to the No Free Lunch Theorem [28] that proves that, in order to achieve better bounds than the Continuous relaxation, the Lagrangian subproblem should be “harder” than a Linear problem. Actually, the Lagrangian Dual problem 8 is a concave maximization problem, but it is trivial to reformulate it as a convex minimization problem [56]. Relaxing inequality constraints 𝑨𝒙𝒃{\bm{A}}{\bm{x}}\geq{\bm{b}} instead of equality ones leads to nonnegative multipliers 𝝅+m{\bm{\pi}}\in{\mathbb{R}}_{+}^{m}. To ensure that we predict only nonnegative multipliers we substitute line 10 of Algorithm 3 with

𝝅t+1=σ(𝝅¯t+ηt𝒘(t)){\bm{\pi}}_{t+1}=\sigma(\bar{{\bm{\pi}}}_{t}+\eta_{t}{\bm{w}}^{(t)})

where σ\sigma is a component-wise non-negative activation function such as ReLU.

3.2 Experiments

In this section, we compare our approaches to classical iterative methods for solving the Lagrangian Dual, specifically those based on subgradient information. In particular, we evaluate the performance of the proposed approach against both simple subgradient-based methods and the classical Bundle method, using different heuristic η\eta-strategies.

We consider ADAM [35] and Gradient Ascent using a simple strategy for regularizing the step size: if we pass more than two iterations without improving the objective value, we divide the step size by two. These two baselines are well-established in the context of convex minimization and serve as relevant points of comparison. They are of particular interest because they are typically faster than the Bundle method, as Descent relies only on the subgradient at the latter visited point, meanwhile, ADAM relies on the subgradient of the last inserted point and the previous search direction.

We further compare our approach with classic variants of the Bundle method considering the four long-term η\eta-strategies presented in Appendix A, inspired by the ones used in the state-of-the-art bundle-based BundleSolver component of SMS++ [26]. With the non-constant ones, we also consider the middle-term η\eta-strategy and the short-term η\eta-strategy proposing increasing as ηt+1=1.1ηt\eta_{t+1}=1.1\cdot\eta_{t} and decreasing as ηt+1=ηt0.9\eta_{t+1}=\eta_{t}\cdot 0.9. For all these six classic approaches, we perform a grid search to choose the best initial parameter η0\eta_{0} considering different values. We evaluate the initialization of the η0\eta_{0} parameter using the values 10k10^{k} for k{4,3,2,1,0,1}k\in\{4,3,2,1,0,-1\}. For each fixed maximum number of iterations (1010, 2525, 5050, and 100100), we select the value of η0\eta_{0} that achieves the lowest average GAP over the test set. The results of this grid search are summarized in Table 14.

The objective values of ϕ\phi are too large for a learning method, so we rescale the objective ϕ\phi by dividing the function value using the norm 𝒈0𝒈0\sqrt{{\bm{g}}^{\top}_{0}{\bm{g}}_{0}} of the gradient 𝒈0{\bm{g}}_{0} in the starting point 𝝅0{\bm{\pi}}_{0}, always taken equal to the zero vector. Even if the objective value changes, the optimum of the original function ϕ\phi and of its scaled version are the same.

We consider the Lagrangian Dual of the Multi-Commodity Network Design (MC) presented in Appendix B and the Lagrangian Dual of the Generalized Assignment (GA) presented in Appendix C. For MC, we consider a Lagrangian relaxation in which we dualize equality constraints, while for GA we consider a Lagrangian relaxation in which we dualize inequality constraints, applying the modifications previously discussed. For the MC and the GA, we use the datasets considered in  [19].

For each instance ι\iota on which we evaluate our model, the Lagrangian relaxation is solved using SMS++. The obtained Lagrangian multipliers, denoted by 𝝅ι{\bm{\pi}}^{*}_{\iota}, are used for computing the GAP. We enhance the fact that we do not need these multipliers to perform the training, but we use them to provide a metric for our approach. We use the percentage gap as a metric to evaluate the quality of the bounds computed by the different systems, averaged over a dataset of instances \mathcal{I}. We measure the quality of each approach by means of the percentage GAP

100×1||ιLR(𝝅ι)BιLR(𝝅ι),100\times\frac{1}{|\mathcal{I}|}\sum_{\iota\in\mathcal{I}}\frac{LR(\bm{\pi}^{*}_{\iota})-B_{\iota}}{LR(\bm{\pi}^{*}_{\iota})}\;,

where BιB_{\iota} is the returned upper bound for instance ι\iota.

3.3 Bound Accuracy

Table 1 reports the performance of different systems on our datasets. For the heuristic strategies that did not use machine learning, we consider a grid search for the initialization of the step-size/regularization parameter. We consider values in between 10410^{4} and 10110^{-1}. For each fixed number of iterations (10, 25, 5010,\;25,\;50 and 100100), we choose the best η0\eta_{0} concerning the values in the grid. From Table 1 we can see that Bundle Network outperforms other techniques in terms of GAP for a fixed number of iterations in different datasets.

Table 1: New Comparison of our approach and different baselines. The baselines’ starting regularization parameter/step size is chosen using a grid search. Meanwhile, our approach does not need hyperparameter tuning. Ours is trained only on 1010 iterations.
Methods 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time

MC-SML-40

Bundle h. 20.18 0.044 6.68 0.117 2.84 0.269 1.17 0.767
Bundle b. 17.68 0.054 6.09 0.145 1.84 0.327 0.41 0.898
Bundle s. 17.58 0.044 6.10 0.116 1.82 0.271 0.36 0.778
Bundle c. 11.96 0.048 4.16 0.128 1.30 0.293 0.31 0.816
Descent 44.66 0.008 14.11 0.021 4.70 0.044 2.24 0.093
Adam 48.87 0.011 12.84 0.027 3.04 0.055 1.89 0.109
Bundle n. 9.20 0.105 1.75 0.185 0.53 0.321 0.17 0.592

MC-SML-Var

Bundle h. 18.27 0.094 9.55 0.254 4.08 0.564 1.96 1.433
Bundle b. 15.83 0.124 7.39 0.334 2.88 0.736 0.74 1.749
Bundle s. 15.82 0.094 6.92 0.256 2.58 0.576 0.67 1.421
Bundle c. 12.79 0.108 4.33 0.29 1.66 0.635 0.58 1.503
Descent 62.43 0.025 28.64 0.066 11.14 0.142 4.00 0.303
Adam 51.48 0.034 15.42 0.086 4.74 0.17 2.75 0.341
Bundle n. 12.60 0.138 3.28 0.269 1.24 0.496 0.50 0.952

MCND-Big-40

Bundle h. 20.66 0.07 8.54 0.185 5.32 0.410 2.88 0.986
Bundle b. 20.67 0.084 7.37 0.222 3.17 0.485 1.24 1.209
Bundle s. 20.67 0.07 7.37 0.184 3.17 0.407 1.19 1.042
Bundle c. 19.34 0.076 5.64 0.203 2.08 0.447 0.78 1.110
Descent 28.53 0.023 14.12 0.058 9.91 0.117 8.21 0.241
Adam 24.89 0.025 9.38 0.064 7.30 0.129 6.90 0.256
Bundle n. 12.70 0.130 3.41 0.243 1.26 0.436 0.47 0.832

MC-Big-Var

Bundle h. 26.23 0.095 9.99 0.248 4.41 0.550 2.23 1.386
Bundle b. 21.82 0.123 10.01 0.324 3.46 0.695 1.08 1.652
Bundle s. 21.83 0.094 9.53 0.253 3.43 0.550 1.02 1.353
Bundle c. 17.98 0.112 6.34 0.289 2.66 0.627 0.93 1.477
Descent 54.11 0.029 24.69 0.074 10.58 0.154 4.85 0.323
Adam 44.21 0.034 13.75 0.086 5.42 0.170 3.84 0.345
Bundle n. 20.10 0.139 5.11 0.271 2.01 0.495 0.70 0.953

GA-10-100

Bundle h. 0.2333 0.070 0.028 0.241 0.0016 0.657 0.0006 1.596
Bundle b. 0.2333 0.073 0.028 0.249 0.0016 0.684 0.0006 1.655
Bundle s. 0.3055 0.065 0.0156 0.211 0.0026 0.553 0.0024 1.283
Bundle c. 0.1893 0.071 0.0157 0.235 0.0014 0.601 0.0011 1.383
Descent 0.8799 0.013 0.2091 0.032 0.0644 0.063 0.0364 0.127
Adam 0.7234 0.012 0.1843 0.029 0.0183 0.059 0.0048 0.119
Bundle n. 0.1484 0.104 0.0228 0.177 0.0047 0.304 0.0009 0.551

GA-20-400

Bundle hard 0.157 0.445 0.0343 1.548 0.0088 4.528 0.0021 15.099
Bundle balancing 0.157 0.463 0.0343 1.568 0.0088 4.539 0.0021 15.158
Bundle soft 0.3542 0.431 0.0284 1.417 0.0048 4.256 0.0028 14.145
Bundle constant 0.1129 0.433 0.0193 1.459 0.0052 4.264 0.0020 14.459
Descent 0.5815 0.205 0.2971 0.5 0.0753 0.98 0.0255 1.937
Adam 0.6598 0.177 0.0831 0.444 0.016 0.887 0.0090 1.775
Bundle n. 0.1018 0.228 0.0190 0.498 0.0052 0.943 0.0014 1.837

Anyway, by Figures 2-5, we can see that Bundle Network is slower than other approaches for the first iterations. From these figures, we can see that initialization time can be improved. These times can also depend on passages of CPU vectors to the GPU and vice versa.

From Table 1 we can also see that the constant η\eta-strategy provides almost every time the best performance, compared to the other deterministic η\eta-strategies. This is a well-known characteristic of the Bundle method, as 100100 iterations is a relatively small number of iterations for an Aggregated Bundle method, making η\eta tuning less performative than a constant strategy obtained after a grid-search. This is because the Bundle method operates in distinct phases: initially, its goal is to identify promising directions for improvement, and later it focuses on collecting high-quality subgradients near the optimum to certify optimality. This raises an interesting question about using multiple models to capture this Bundle behavior, which is beyond the scope of this work.

Nevertheless, our model provides high-quality predictions that generalize well to a larger number of iterations than those used during training (10). In many cases, bundle network provides, in a given number of iterations, lower gaps than all the baselines, and even in cases where the provided predictions are not the best possibility, they lead to similar gaps to the optimal algorithm.

Refer to caption
Refer to caption
Figure 2: Comparison plots of the GAP (in logarithmic scale) during the execution of our trained neural network and different baselines in terms of iterations (in the left) and in terms of time (in the right) for MC-Sml-40 dataset.
Refer to caption
Refer to caption
Figure 3: Comparison plots of the GAP (in logarithmic scale) during the execution of our trained neural network and different baselines in terms of iterations (in the left) and in terms of time (in the right) for MC-Sml-Var dataset.
Refer to caption
Refer to caption
Figure 4: Comparison plots of the GAP (in logarithmic scale) during the execution of our trained neural network and different baselines in terms of iterations (in the left) and in terms of time (in the right) for MC-Big-40 dataset.
Refer to caption
Refer to caption
Figure 5: Comparison plots of the GAP (in logarithmic scale) during the execution of our trained neural network and different baselines in terms of iterations (in the left) and in terms of time (in the right) for MC-Big-Var dataset.
Refer to caption
Refer to caption
Figure 6: Comparison plots of the GAP (in logarithmic scale) during the execution of our trained neural network and different baselines in terms of iterations (in the left) and in terms of time (in the right) for GA-10-100 dataset.
Refer to caption
Refer to caption
Figure 7: Comparison plots of the GAP (in logarithmic scale) during the execution of our trained neural network and different baselines in terms of iterations (in the left) and in terms of time (in the right) for GA-20-400 dataset.

3.4 Sampling or Not Sampling

Table 2: Comparison of softmax and sparsemax, with or without using the sample strategy, or using the sample strategy only for t on MC-Sml40
Activation Sample 10 iter. 25 iter. 50 iter. 100 iter.
t q k GAP time GAP time GAP time GAP time
softmax 9.20 0.105 1.75 0.185 0.53 0.321 0.17 0.592
softmax x 9.78 0.104 1.85 0.187 0.56 0.321 0.17 0.592
softmax x x 10.41 0.105 1.99 0.185 0.60 0.318 0.18 0.588
sparsemax 58.3 0.125 57.93 0.214 57.47 0.361 57.29 0.661
sparsemax x 13.16 0.124 2.61 0.212 0.80 0.359 0.24 0.655
sparsemax x x 10.51 0.126 2.07 0.214 0.60 0.363 0.18 0.662
Table 3: Comparison of softmax and sparsemax, with or without using the sample strategy, or using the sample strategy only for t on MC-Big-40
Activation Sample 10 iter. 25 iter. 50 iter. 100 iter.
t q k GAP time GAP time GAP time GAP time
softmax 12.70 0.130 3.41 0.243 1.26 0.436 0.47 0.832
softmax x 12.85 0.131 3.34 0.244 1.22 0.436 0.45 0.831
softmax x x 12.88 0.131 3.36 0.244 1.26 0.436 0.48 0.834
sparsemax 38.65 0.150 24.00 0.270 10.34 0.475 3.39 0.898
sparsemax x 39.40 0.150 32.37 0.270 17.76 0.474 1.58 0.896
sparsemax x x 13.27 0.150 3.50 0.272 1.40 0.479 0.51 0.905
Table 4: Comparison of softmax and sparsemax, with or without using the sample strategy or using the sample strategy only for t on MC-Sml-Var
Activation Sample 10 iter. 25 iter. 50 iter. 100 iter.
t q k GAP time GAP time GAP time GAP time
softmax 12.60 0.138 3.28 0.269 1.24 0.496 0.50 0.952
softmax x 13.03 0.137 3.27 0.268 1.26 0.492 0.51 0.948
softmax x x 14.68 0.139 3.94 0.269 1.55 0.494 0.62 0.957
sparsemax 30.23 0.157 15.96 0.293 12.45 0.527 11.05 1.006
sparsemax x 56.83 0.160 56.03 0.300 56.00 0.538 56.00 1.033
sparsemax x x 14.02 0.158 3.80 0.298 1.56 0.531 0.64 1.007
Table 5: Comparison of softmax and sparsemax, with or without using the sample strategy or using the sample strategy only for t on MC-Big-Var
Activation Sample 10 iter. 25 iter. 50 iter. 100 iter.
t q k GAP time GAP time GAP time GAP time
softmax 20.10 0.139 5.11 0.271 2.01 0.495 0.70 0.953
softmax x 16.79 0.141 4.78 0.273 1.68 0.499 0.63 0.956
softmax x x 16.66 0.139 4.63 0.270 1.80 0.495 0.67 0.955
sparsemax 42.84 0.160 32.39 0.301 29.68 0.541 23.25 1.031
sparsemax x 30.20 0.162 12.33 0.305 6.25 0.542 3.78 1.033
sparsemax x x 18.94 0.162 4.24 0.305 1.47 0.548 0.61 1.045
Table 6: Comparison of softmax and sparsemax, with or without using the sample strategy, or using the sample strategy only for t on GA-10-100
Activation Sample 10 iter. 25 iter. 50 iter. 100 iter.
t q k GAP time GAP time GAP time GAP time
softmax 0.1484 0.104 0.0228 0.177 0.0047 0.304 0.0009 0.551
softmax x 0.1544 0.102 0.0229 0.176 0.0049 0.303 0.0009 0.548
softmax x x 0.1954 0.099 0.0370 0.176 0.0081 0.302 0.0013 0.551
sparsemax 0.2335 0.118 0.0600 0.201 0.0336 0.339 0.0288 0.615
sparsemax x 1.0482 0.120 0.9783 0.202 0.9782 0.342 0.9782 0.611
sparsemax x x 0.9172 0.122 0.7001 0.201 0.6827 0.340 0.6818 0.609
Table 7: Comparison of softmax and sparsemax, with or without using the sample strategy, or using the sample strategy only for t on GA-20-400
Activation Sample 10 iter. 25 iter. 50 iter. 100 iter.
t q k GAP time GAP time GAP time GAP time
softmax 0.1018 0.228 0.0190 0.498 0.0052 0.943 0.0014 1.837
softmax x 0.1065 0.218 0.0216 0.480 0.0060 0.904 0.0014 1.756
softmax x x 0.1273 0.204 0.0295 0.445 0.0087 0.836 0.0018 1.621
sparsemax 0.1463 0.235 0.0626 0.500 0.0601 0.935 0.0599 1.804
sparsemax x 0.7860 0.219 0.7696 0.463 0.767 0.860 0.7669 1.656
sparsemax x x 0.4002 0.218 0.1272 0.461 0.0681 0.860 0.0627 1.651

In this section, we compare the model trained with and without the sampling strategy, for the hidden representation discussed earlier, as well as considering two different possibilities for ψ\psi that is, the softmax [29, pp. 180-184] or the sparsemax [46]. In all the cases the models are trained unrolling 10 iterations of the resolutive approach and for 2525 training epochs using Adam as optimizer, with learning rate 10510^{-5} and a Clip Norm (to 5). . In the case in which we do not use the sampling mechanism, the hidden representations for the queries, the key, and the step size are not predicted by sampling them through a Gaussian distribution. Indeed, we learn the mean and the variance. Instead, it is predicted directly by the encoder. In the test phase, the hidden representation is not sampled for either approach. Instead, we take the mean directly when the model predicts both the mean and the variance.

From Tables 2-7, we can see that sampling is a strategy leading to better performance when using the sparsemax, but it is inefficient for the softmax. This can be because the sparsemax function has a zero gradient in a certain region of the space that can be escaped using sampling strategies.

We choose the softmax with no sampling strategy as it shows better performances that are stable across both problems. Meanwhile, the sparsemax seems to be an interesting choice for harder datasets, as shown in Table 5.

3.5 Testing on another dataset

From Tables 8-13, we can see the performance of a model trained on one dataset and tested on another. Training the model on another dataset rarely provides better gaps than training it on the same dataset. It happens in Table 8 and for 10 iterations in Table 12. Considering the same problem, we can obtain similar behaviors in several cases, in particular while using more iterations. In some cases, the model trained on the GA dataset achieves interesting performance on the small datasets of MC, as shown in Tables 8 and 9. Anyway, no update learned for the MC dataset seems to work sufficiently well for the GA datasets. Tables 8-13 also show that, in several cases, it is possible to attain lower gaps for iteration from the performances of a model trained on another dataset, of the same problem, than the baseline, providing a lower gap obtained with grid search.

Table 8: Gap of the model tested on MC-Sml-40 trained on different datasets. The associated training dataset is displayed as the first element of each row of the table. The last row shows the performances of the best baseline obtained with grid search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 9.20 0.105 1.75 0.185 0.53 0.321 0.17 0.592
MC-Sml-Var 8.02 0.104 1.69 0.184 0.51 0.319 0.17 0.593
MC-Big-40 18.02 0.116 4.14 0.206 1.49 0.358 0.47 0.665
MC-Big-Var 10.77 0.106 2.26 0.187 0.70 0.327 0.20 0.603
GA-10-100 61.98 0.104 20.79 0.183 4.07 0.319 0.95 0.589
GA-20-400 86.33 0.107 68.16 0.190 14.20 0.327 3.52 0.607
Best Baseline 11.96 0.048 4.16 0.128 1.30 0.293 0.31 0.816
Table 9: Gap of the model tested on MC-Sml-Var trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 19.87 0.142 5.72 0.276 2.32 0.501 1.00 0.965
MC-Sml-Var 12.60 0.138 3.28 0.269 1.24 0.496 0.50 0.952
MC-Big-40 33.83 0.149 14.03 0.291 7.54 0.536 3.90 1.033
MC-Big-Var 16.21 0.141 3.83 0.276 1.39 0.505 0.52 0.976
GA-10-100 25.76 0.142 7.60 0.275 2.08 0.502 0.67 0.969
GA-20-400 66.15 0.140 62.06 0.275 38.19 0.507 12.86 0.974
Best Baseline 12.79 0.108 4.33 0.29 1.66 0.635 0.58 1.503
Table 10: Gap of the model tested on MC-Big-40 trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 35.09 0.133 5.78 0.248 1.63 0.445 0.62 0.85
MC-Sml-Var 56.68 0.132 25.62 0.246 8.88 0.441 3.06 0.842
MC-Big-40 12.70 0.130 3.41 0.243 1.26 0.436 0.47 0.832
MC-Big-Var 37.57 0.134 10.74 0.251 4.89 0.449 1.49 0.856
GA-10-100 95.70 0.123 95.70 0.230 78.63 0.408 27.53 0.779
GA-20-400 100.0 0.134 100.0 0.252 99.53 0.447 74.54 0.854
Best Baseline 19.34 0.076 5.64 0.203 2.08 0.447 0.78 1.110
Table 11: Dataset tested on MC-Big-Var trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 21.50 0.142 5.48 0.280 2.07 0.511 0.85 0.984
MC-Sml-Var 21.03 0.221 7.64 0.419 2.67 0.755 0.95 1.430
MC-Big-40 28.39 0.150 11.15 0.294 5.87 0.540 2.94 1.037
MC-Big-Var 20.10 0.139 5.11 0.271 2.01 0.495 0.70 0.953
GA-10-100 41.66 0.144 28.58 0.280 20.72 0.511 8.30 0.979
GA-20-400 72.89 0.144 69.99 0.283 53.10 0.519 27.75 0.996
Best Baseline 17.98 0.112 6.34 0.289 2.66 0.627 0.93 1.477
Table 12: Gap of the model tested on GA-10-100 trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 2.5903 0.101 1.8558 0.179 0.9291 0.304 0.4860 0.555
MC-Sml-Var 2.5165 0.102 1.4663 0.178 1.0522 0.307 1.0512 0.562
MC-Big-40 2.7488 0.103 2.1948 0.177 1.3515 0.302 0.7463 0.550
MC-Big-Var 2.4958 0.102 1.3267 0.180 0.9705 0.309 0.9683 0.564
GA-10-100 0.1484 0.104 0.0228 0.177 0.0048 0.304 0.0009 0.551
GA-20-400 0.1332 0.105 0.0233 0.182 0.0056 0.309 0.0016 0.561
Best Baseline 0.1893 0.071 0.0156 0.211 0.0014 0.601 0.0006 1.596
Table 13: Gap of the model tested on GA-20-400 trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 2.3695 0.221 1.9535 0.474 1.1906 0.894 0.5052 1.732
MC-Sml-Var 2.3357 0.231 1.7168 0.507 0.8963 0.959 0.8654 1.868
MC-Big-40 2.4715 0.218 2.1550 0.481 1.5618 0.909 0.8056 1.759
MC-Big-Var 2.3197 0.218 1.6304 0.482 0.8092 0.910 0.7856 1.765
GA-10-100 0.1811 0.234 0.0296 0.511 0.0077 0.962 0.0015 1.866
GA-20-400 0.1018 0.228 0.0190 0.498 0.0052 0.943 0.0014 1.837
Best Baseline 0.1129 0.433 0.0193 1.459 0.0048 4.256 0.0020 14.459

4 Conclusion

In this work, we present a machine learning model based on unrolling and amortization strongly inspired by the bundle method. The model bypasses the need to solve the Dual Master problem at each iteration, leading to a smooth optimization process that can be differentiated using Automatic Differentiation techniques. We test it for the resolution of the Lagrangian dual obtained from the two different problems: the Multi-Commodity Network Design and the Generalized Assignment. It achieves a lower gap compared to other classical approaches for a fixed number of iterations. Moreover, it exhibits strong generalization capabilities, maintaining robust performance even beyond the training horizon and sometimes across different datasets.

This work opens several promising directions for future research. First, the framework can be directly applied, as it is, to other problems arising from Lagrangian relaxation. A central challenge in standard bundle methods lies in automatically determining which components to retain or discard in order to control bundle size and improve computational efficiency. While straightforward modifications of the framework are possible, this also raises questions concerning feature extraction and the design of an automated feature selection phase.

Given the growing use of bundle methods in machine learning, it is also interesting to assess the performance of the proposed approach in meta-learning tasks. This requires applying the method to non-convex and non-smooth optimization problems and this extension presents non-trivial challenges. In this setting, removing components from the bundle is not only essential for reducing computational cost, but also for avoiding convergence to poor local minima.

References

  • [1] Brandon Amos. Tutorial on amortized optimization for learning to optimize over continuous domains. CoRR, 2022.
  • [2] Marcin Andrychowicz, Misha Denil, Sergio Gómez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [4] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153):1–43, 2018.
  • [5] Samy Bengio, Y. Bengio, and Jocelyn Cloutier. On the search for new learning rules for anns. Neural Processing Letters, 2:26–30, 07 1995.
  • [6] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume ii, pages 969 vol.2–, 1991.
  • [7] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. CoRR, abs/1811.06128, 2018.
  • [8] Daniel Cederberg, Xuyang Wu, Stephen P. Boyd, and Mikael Johansson. An asynchronous bundle method for distributed learning problems. In The Thirteenth International Conference on Learning Representations, 2025.
  • [9] Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. Journal of Machine Learning Research, 23(189):1–59, 2022.
  • [10] Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. Journal of Machine Learning Research, 23(189):1–59, 2022.
  • [11] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • [12] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
  • [13] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014.
  • [14] Michele Conforti, Gérard Cornuéjols, and Giacomo Zambelli. Integer Programming. Springer, New York, 2014.
  • [15] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995.
  • [16] Teodor Gabriel Crainic, Antonio Frangioni, and Bernard Gendron. Bundle-based relaxation methods for multicommodity capacitated fixed charge network design. Discrete Applied Mathematics, 112(1):73–99, 2001.
  • [17] Welington de Oliveira and Mikhail Solodov. A doubly stabilized bundle method for nonsmooth convex optimization. Mathematical programming, 156(1):125–159, 2016.
  • [18] Welington Luis de Oliveira and Claudia A. Sagastizábal. Level bundle methods for oracles with on-demand accuracy. Optimization Methods and Software, 29:1180 – 1209, 2014.
  • [19] Francesco Demelas, Joseph Le Roux, Mathieu Lacroix, and Axel Parmentier. Predicting lagrangian multipliers for mixed integer linear programs. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 10368–10384. PMLR, 21–27 Jul 2024.
  • [20] Thomas Finley and Thorsten Joachims. Training structural svms when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, page 304–311, New York, NY, USA, 2008. Association for Computing Machinery.
  • [21] Marshall L. Fisher. The lagrangian relaxation method for solving integer programming problems. Management Science, 27:1–18, 1981.
  • [22] A. Frangioni. Standard Bundle Methods: Untrusted Models and Duality, chapter 2, pages 61–116. Springer, 2020.
  • [23] Antonio Frangioni. Solving semidefinite quadratic problems within nonsmooth optimization algorithms. Computers & Operations Research, 23(11):1099–1118, 1996.
  • [24] Antonio Frangioni. Dual-ascent methods and multicommodity flow problems. Università degli studi di Pisa. Dipartimento di informatica, 1997.
  • [25] Antonio Frangioni. Generalized bundle methods. SIAM Journal on Optimization, 13(1):117–156, 2002.
  • [26] Antonio Frangioni, Niccolò Iardella, and Rafael Durbano Lobato. SMS++, 2023.
  • [27] Bernard Gendron, Teodor Gabriel Crainic, and Antonio Frangioni. Multicommodity Capacitated Network Design, pages 1–19. Springer US, Boston, MA, 1999.
  • [28] A. M. Geoffrion. Lagrangean relaxation for integer programming. In M. L. Balinski, editor, Approaches to Integer Programming, pages 82–114. Springer Berlin Heidelberg, Berlin, Heidelberg, 1974.
  • [29] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
  • [30] Hossein Hassani, Roozbeh Razavi-Far, Mehrdad Saif, and Liang Lin. Towards sample-efficiency and generalization of transfer and inverse reinforcement learning: A comprehensive literature review, 2024.
  • [31] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9:1735–80, 12 1997.
  • [32] Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell. Learning to learn using gradient descent. In Georg Dorffner, Horst Bischof, and Kurt Hornik, editors, Artificial Neural Networks — ICANN 2001, pages 87–94, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
  • [33] Jinlong Ji, Xuhui Chen, Qianlong Wang, Lixing Yu, and Pan Li. Learning to learn gradient aggregation by gradient descent. In IJCAI, pages 2614–2620, 2019.
  • [34] Thorsten Joachims, Thomas Finley, and Chun-Nam Yu. Cutting-plane training of structural svms. Machine Learning, 77:27–59, 10 2009.
  • [35] Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [36] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  • [37] Krzysztof C. Kiwiel. A proximal bundle method with approximate subgradient linearizations. SIAM Journal on Optimization, 16(4):1007–1023, 2006.
  • [38] James Kotary, My H Dinh, and Ferdinando Fioretto. Backpropagation of unrolled solvers with folded optimization. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-2023, page 1963–1970. International Joint Conferences on Artificial Intelligence Organization, August 2023.
  • [39] Quoc Le, Alex Smola, and Svn Vishwanathan. Bundle methods for machine learning. Advances in neural information processing systems, 20, 2007.
  • [40] Claude Lemaréchal. Lagrangian Relaxation, pages 112–156. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001.
  • [41] Claude Lemaréchal, Arkadii Nemirovskii, and Yurii Nesterov. New variants of bundle methods. Mathematical programming, 69:111–147, 1995.
  • [42] Claude Lemarechal, Jean-Jacques Strodiot, and André Bihain. On a bundle algorithm for nonsmooth optimization. In Nonlinear programming 4, pages 245–282. Elsevier, 1981.
  • [43] Bingheng Li, Linxin Yang, Yupeng Chen, Senmiao Wang, Qian Chen, Haitao Mao, Yao Ma, Akang Wang, Tian Ding, Jiliang Tang, and Ruoyu Sun. Pdhg-unrolled learning-to-optimize method for large-scale linear programming, 2024.
  • [44] Ke Li and Jitendra Malik. Learning to optimize. ArXiv, abs/1606.01885, 2016.
  • [45] Jialin Liu, Xiaohan Chen, Zhangyang Wang, Wotao Yin, and Hanqin Cai. Towards constituting mathematical structures for learning to optimize. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 21426–21449. PMLR, 23–29 Jul 2023.
  • [46] André F. T. Martins and Ramón Fernandez Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. CoRR, abs/1602.02068, 2016.
  • [47] Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021.
  • [48] Vishal Monga, Yuelong Li, and Yonina C. Eldar. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing, 2020.
  • [49] Alasdair Paren, Leonard Berrada, Rudra P. K. Poudel, and M. Pawan Kumar. A stochastic bundle method for interpolating networks. J. Mach. Learn. Res., 23(1), January 2022.
  • [50] T.P. Runarsson and M.T. Jonsson. Evolution and design of distributed learning rules. In 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00, pages 59–63, 2000.
  • [51] Alexander Schrijver et al. On cutting planes. Combinatorics, 79:291–296, 1980.
  • [52] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  • [53] Damien Scieur, Quentin Bertrand, Gauthier Gidel, and Fabian Pedregosa. The curse of unrolling: Rate of differentiating through optimization, 2023.
  • [54] Naresh K. Sinha and Michael P. Griscik. A stochastic approximation method. IEEE Trans. Syst. Man Cybern., 1(4):338–344, 1971.
  • [55] R.S. Sutton and A.G. Barto. Reinforcement Learning, second edition: An Introduction. Adaptive Computation and Machine Learning series. MIT Press, 2018.
  • [56] Rockafellar R. T. Convex Analysis. Princeton University Press, 2015.
  • [57] Yunhao Tang, Shipra Agrawal, and Yuri Faenza. Reinforcement learning for integer programming: Learning to cut. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9367–9376. PMLR, 13–18 Jul 2020.
  • [58] Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11(10):311–365, 2010.
  • [59] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [60] Arthur Steven Younger, Sepp Hochreiter, and Peter R. Conwell. Meta-learning with backpropagation. IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), 3:2001–2006 vol.3, 2001.
  • [61] Yilang Zhang and Georgios B. Giannakis. Meta-learning priors using unrolled proximal networks. In The Twelfth International Conference on Learning Representations, 2024.

Appendix A Proximality Term η\eta-Strategy

The proper selection of the proximal term η\eta of the bundle method is both non-trivial and essential. This choice depends on the current stabilization point and the information accumulated in the bundle up to the current iteration. In particular, when all gradients at the optimum are available, solving a Cutting Plane model is preferable, as it yields the optimal solution to the problem. Hence, in this case, a higher value of η\eta may be preferred. While at the beginning of the resolution we possibly dispose of very local information, acting carefully is preferred, and so a smaller value of η\eta may be preferred.

In [25], theoretical conditions are established for the η\eta-strategy to ensure convergence. Constant strategies can be effective if appropriately fine-tuned. Many state-of-the-art strategies are self-adjusting rules for choosing ηt\eta_{t}.

The main state-of-the-art η\eta-strategies are composed of three levels, as the ones developed in [24] and also presented in [16]. These levels includes: Long-term, Middle-term, and Short-term η\eta-strategies.

The Long-term η\eta-strategies aim to capture the stage of the resolution process. In the Bundle method, this is important because it represents a balance between the subgradient method and the Cutting Plane method. In particular, we want to prioritize subgradient method iterations when the solution is still far from optimal, and shift towards Cutting Plane in the final iterations. This approach helps collect the subgradient information necessary to establish the optimality of the solution.

We define the maximum improvement estimation of the Dual Master problem as v^*_t=η_t ——∑_i∈β_tθ_i^(t)g_i——^2 + ∑_i∈β_tα_iθ_i^(t) where 𝜽(i){\bm{\theta}}^{(i)} is the solution of the Dual Master problem at iteration tt.

We also define the expected maximum improvement obtained using a variant of the Dual Master problem in which η\eta^{*} acts as a regularization parameter: ϵ^*_t = ∑_i∈β_tα_iθ_i^(t) + η^* ——∑_i∈β_tθ_i^(t)g_i——^2 where η\eta^{*} is a hyperparameter. The hyperparameter η\eta^{*} is generally chosen to be greater than all the possible ηt\eta_{t}, thereby emulating the expected behavior by following more the decision of the model. It is different from explicitly considering this parameter in the Dual Master problem, as this would alter its solution and consequently the new trial direction, as previously explained. Additionally, we define a hyperparameter m~[0,1)\tilde{m}\in[0,1), which is employed in the η\eta-strategies and is usually fixed to 0.010.01.

Some examples of Long-term η\eta-strategies are:

  • no Long-term η\eta-strategy;

  • the soft Long-term η\eta-strategy. After a null step, decreases in ηt\eta_{t} are inhibited whenever vt<m~ϵtv^{*}_{t}<\tilde{m}\epsilon^{*}_{t}. In other words, decrements are prevented if ηt\eta_{t} is already sufficiently small, such that the maximum improvement is estimated to be less than m~\tilde{m} times the improvement suggested by a Cutting-Planes model.

  • the hard Long-term η\eta-strategy. Increases of ηt\eta_{t} are allowed, even after null steps, whenever vt<m~ϵtv^{*}_{t}<\tilde{m}\epsilon^{*}_{t}.

  • the balancing Long-term η\eta-strategy. This strategy aims to keep the two terms η12iβtθi(t)𝒈i22\eta^{*}||\frac{1}{2}\sum_{i\in\beta_{t}}\theta_{i}^{(t)}{\bm{g}}_{i}||^{2}_{2} roughly the same size.

    Increases of ηt\eta_{t}, after a serious step, are inhibited if

    η2iβtθi(t)𝒈i22m~i=1|βt|αiθi(t).\frac{\eta^{*}}{2}||\sum_{i\in\beta_{t}}\theta_{i}^{(t)}{\bm{g}}_{i}||^{2}_{2}\leq\tilde{m}\sum_{i=1}^{|\beta_{t}|}\alpha_{i}\theta_{i}^{(t)}.

    In this case, increasing ηt\eta_{t} causes a reduction of η2iβtθi(t)𝒈i22\frac{\eta^{*}}{2}||\sum_{i\in\beta_{t}}\theta_{i}^{(t)}{\bm{g}}_{i}||^{2}_{2} which is already small.

    On the other hand, decreases in ηt\eta_{t} are inhibited if

    m~η2iβtθi(t)𝒈i22i=1|βt|αiθi(t).\tilde{m}\frac{\eta^{*}}{2}||\sum_{i\in\beta_{t}}\theta_{i}^{(t)}{\bm{g}}_{i}||^{2}_{2}\geq\sum_{i=1}^{|\beta_{t}|}\alpha_{i}\theta_{i}^{(t)}.

    The rational been that decreasing ηt\eta_{t} causes an increase in iβtθi(t)𝒈i22||\sum_{i\in\beta_{t}}\theta_{i}^{(t)}{\bm{g}}_{i}||^{2}_{2} which is already big.

The Middle-term η\eta-strategies are multi-step strategies that consider the outcomes from the last few iterations. These strategies include the following conditions:

  • A minimum number of consecutive successful steps (SS) with the same ηt\eta_{t} must be performed before ηt\eta_{t} is allowed to increase.

  • A minimum number of consecutive null steps (NS) with the same ηt\eta_{t} must be performed before ηt\eta_{t} is allowed to decrease.

At the bottom, heuristic Short-term η\eta-strategies only rely on information from the current iteration to propose an update for ηt\eta_{t}. Specifically, these heuristics are only used to assist in selecting the current value, but both the top and the middle layers would agree on the update. However, sometimes it can be beneficial to allow the Short-term heuristics to do small adjustments in ηt\eta_{t}, regardless of the decisions made by the other levels.

We consider in our implementation these Short-term η\eta-strategies together :

  • Increase ηt+1\eta_{t+1} as ηtηincr\eta_{t}\eta_{incr}, with ηincr>1\eta_{incr}>1.

  • Decrease ηt+1\eta_{t+1} as ηt1ηdecr\eta_{t-1}\eta_{decr}, with 0<ηdecr<10<\eta_{decr}<1.

  • When increased, ηt+1\eta_{t+1} should not exceed ηM\eta_{M}, with ηM>0\eta_{M}>0.

  • When decreased, ηt+1\eta_{t+1} should not go below ηm\eta_{m}, with 0<ηm<ηM0<\eta_{m}<\eta_{M}.

Appendix B Multi Commodity Capacitated Network Design Problem

A MC instance is given by a directed simple graph D=(N,A)D=(N,A), a set of commodities KK, an arc-capacity vector cc, and two cost vectors rr and ff. Each commodity kKk\in K corresponds to a triplet (ok,dk,qk)(o^{k},d^{k},q^{k}) where okNo^{k}\in N and dkNd^{k}\in N are the nodes corresponding to the origin and the destination of commodity kk, and qkq^{k}\in\mathbb{N}^{*} is its volume. For each arc, (i,j)A(i,j)\in A, cij>0c_{ij}>0 corresponds to the maximum amount of flow that can be routed through (i,j)(i,j) and fij>0f_{ij}>0 corresponds to the fixed cost of using arc (i,j)(i,j) to route commodities. For each arc (i,j)A(i,j)\in A and each commodity kKk\in K, rijk>0r_{ij}^{k}>0 corresponds to the cost of routing one unit of commodity kk through arc (i,j)(i,j).

A MC solution consists of an arc subset AAA^{\prime}\subseteq A and, for each commodity kKk\in K, in a flow of value qkq^{k} from its origin oko^{k} to its destination dkd^{k} with the following requirements: all commodities are only routed through arcs of AA^{\prime}, and the total amount of flow routed through each arc (i,j)A(i,j)\in A^{\prime} does not exceed its capacity cijc_{ij}. The solution cost is the sum of the fixed costs over the arcs of AA^{\prime} plus the routing cost, the latter being the sum over all arcs (i,j)A(i,j)\in A and all commodities kKk\in K of the unitary routing cost rijkr_{ij}^{k} multiplied by the amount of flow of kk routed through (i,j)(i,j).

B.1 MILP formulation

A standard model for the MC problem [27] introduces two sets of variables: the continuous flow variables xijkx_{ij}^{k} representing the amount of commodity kk that is routed through arc (i,j)(i,j) and the binary design variables yijy_{ij} representing whether or not arc (i,j)(i,j) is used to route commodities. Denoting respectively by Ni+={jN(i,j)A}N^{+}_{i}=\{j\in N\mid(i,j)\in A\} and Ni={jN(j,i)A}N_{i}^{-}=\{j\in N\mid(j,i)\in A\} the sets of forward and backward neighbors of a vertex iNi\in N, the MC problem can be modeled as follows:

min𝒙,𝒚(i,j)A(fijyij+kKrijkxijk)\displaystyle\min_{\bm{x},\bm{y}}\sum_{(i,j)\in A}\left(f_{ij}y_{ij}+\sum_{k\in K}r^{k}_{ij}x_{ij}^{k}\right) (9a)
jNi+xijkjNixjik=bik\displaystyle\sum_{j\in N^{+}_{i}}x_{ij}^{k}-\sum_{j\in N^{-}_{i}}x_{ji}^{k}=b^{k}_{i} iN,kK\displaystyle\!\!\forall i\in N,\forall k\in K (9b)
kKxijkcijyij,\displaystyle\sum_{k\in K}x_{ij}^{k}\leq c_{ij}y_{ij}, (i,j)A\displaystyle\!\!\forall(i,j)\in A (9c)
xijk=0\displaystyle x_{ij}^{k}=0 kK,(i,j)A s.t. i=dk or j=ok\displaystyle\!\!\!\!\!\begin{array}[]{l}\forall k\in K,\forall(i,j)\in A\\ \text{ s.t. }i=d^{k}\text{ or }j=o^{k}\end{array} (9f)
0xijkqk\displaystyle 0\leq x^{k}_{ij}\leq q^{k} (i,j)A,kK\displaystyle\!\!\forall(i,j)\in A,\forall k\in K (9g)
yij{0,1},\displaystyle y_{ij}\in\{0,1\}, (i,j)A\displaystyle\!\!\forall(i,j)\in A (9h)

where

bik={qk if i=ok,qk if i=dk,0 otherwise.b^{k}_{i}=\left\{\begin{array}[]{cc}q^{k}&\mbox{ if }i=o^{k},\\ -q^{k}&\mbox{ if }i=d^{k},\\ 0&\mbox{ otherwise.}\end{array}\right.

The objective function (9a) minimizes the sum of the routing and fixed costs. Equations (9b) are the flow conservation constraints that properly define the flow of each commodity through the graph. Constraints (9c) are the capacity constraints ensuring that the total amount of flow routed through each arc does not exceed its capacity or is zero if the arc is not used to route commodities. Equations (9f) ensure that a commodity is not routed on an arc entering its origin or leaving its destination. Finally inequalities (9g) are the bounds for the xx variables and inequalities (9h) are the integer constraints for the design variables.

B.2 Lagrangian Knapsack Relaxation

A standard way to obtain good bounds for the MC problem is to solve the Lagrangian relaxation obtained by dualizing the flow conservation constraints (9b) in formulation (9a)-(9h). Let πik\pi_{i}^{k} be the Lagrangian multiplier associated with node iNi\in N and commodity kKk\in K. Dualizing the flow conservation constraints gives the following relaxed Lagrangian problem LR(𝝅)LR(\bm{\pi})111Since the dualized constraints are equations, 𝝅{\bm{\pi}} have no sign constraints.: min_(x,y) satisfies (9c)-(9h) ∑_(i,j) ∈A (f_ijy_ij + ∑_k ∈K r^k_ijx_ij^k)+∑_k ∈K∑_i ∈Nπ^k_i(b^k_i -∑_j∈N^+_ix_ij^k + ∑_j∈N^-_ix_ji^k) Rearranging the terms in the objective function and observing that the relaxed Lagrangian problem is decomposed by arcs, we obtain a subproblem for each arc (i,j)A\displaystyle(i,j)\in A of the form:

(LRij(𝝅))\displaystyle\displaystyle(LR_{ij}(\bm{\pi}))\quad min𝒙,𝒚fijyij+kKijwijkxijk\displaystyle\displaystyle\min_{\bm{x},\bm{y}}f_{ij}y_{ij}+\sum_{k\in K_{ij}}w_{ij}^{k}x_{ij}^{k} (10a)
kKijxijkcijyij\displaystyle\displaystyle\sum_{k\in K_{ij}}x_{ij}^{k}\leq c_{ij}y_{ij} (10b)
0xijkqk\displaystyle\displaystyle 0\leq x^{k}_{ij}\leq q^{k} kKij\displaystyle\displaystyle\forall k\in K_{ij} (10c)
yij{0,1}\displaystyle\displaystyle y_{ij}\in\{0,1\} (10d)

where wijk=rijkπik+πjk\displaystyle w_{ij}^{k}=r_{ij}^{k}-\pi_{i}^{k}+\pi_{j}^{k} and Kij={kKjok and idk}\displaystyle K_{ij}=\{k\in K\mid j\neq o^{k}\mbox{ and }i\neq d^{k}\} is the set of commodities that may be routed through arc (i,j)\displaystyle(i,j).

For each (i,j)A\displaystyle(i,j)\in A, LRij(𝝅)\displaystyle LR_{ij}(\bm{\pi}) is a MILP with only one binary variable. If yij=0\displaystyle y_{ij}=0, then, by (10b) and (10c), xijk=0\displaystyle x_{ij}^{k}=0 for all kKij\displaystyle k\in K_{ij}. If yij=1\displaystyle y_{ij}=1, the problem reduces to a continuous knapsack problem. An optimal solution is obtained by ordering the commodities of Kij\displaystyle K_{ij} with respect to decreasing values wijk\displaystyle w_{ij}^{k} and setting for each variable xijk\displaystyle x_{ij}^{k} the value max{min{qk,cijkK(k)qk},0}\displaystyle\max\{\min\{q^{k},c_{ij}-\sum_{k\in K(k)}q^{k}\},0\} where K(k)\displaystyle K(k) denotes the set of commodities that preceded k\displaystyle k in the order. This step can be done in O(|Kij|)\displaystyle O(|K_{ij}|) if one computes xijk\displaystyle x_{ij}^{k} following the computed order. Hence, the complexity of the continuous knapsack problem is O(|Kij|log(|Kij|))\displaystyle O(|K_{ij}|\log(|K_{ij}|)). The solution of LRij(𝝅)\displaystyle LR_{ij}(\bm{\pi}) is the minimum between the cost of the continuous knapsack problem and 𝟎\displaystyle\mathbf{0}.

Lagrangian duality implies that

LR(𝝅)=(i,j)ALRij(𝝅)+iNkKπikbikLR(\bm{\pi})=\sum_{(i,j)\in A}LR_{ij}(\bm{\pi})+\sum_{i\in N}\sum_{k\in K}\pi_{i}^{k}b^{k}_{i}

is a lower bound for the MC problem and the best one is obtained by solving the following Lagrangian dual problem:

(LD)max𝝅N×KLR(𝝅)(LD)\quad\max_{\bm{\pi}\in\mathbb{R}^{N\times K}}LR(\bm{\pi})

Appendix C Generalized Assignment Problem

A GA instance is defined by a set I\displaystyle I of items and a set J\displaystyle J of bins. Each bin j\displaystyle j is associated with a certain capacity cj\displaystyle c_{j}. For each item iI\displaystyle i\in I and each bin jJ\displaystyle j\in J, pij\displaystyle p_{ij} is the profit of assigning item i\displaystyle i to bin j\displaystyle j, and wij\displaystyle w_{ij} is the weight of item i\displaystyle i inside bin j\displaystyle j.

Considering a binary variable xij\displaystyle x_{ij} for each item and each bin that is equal to one if and only if item i\displaystyle i is assigned to bin j\displaystyle j, the GA problem can be formulated as:

max𝒙iIjJpijxij\displaystyle\displaystyle\max_{\bm{x}}\sum_{i\in I}\sum_{j\in J}p_{ij}x_{ij} (11a)
jJxij1\displaystyle\displaystyle\sum_{j\in J}x_{ij}\leq 1 iI\displaystyle\displaystyle\forall i\in I (11b)
iIwijxijcj\displaystyle\displaystyle\sum_{i\in I}w_{ij}x_{ij}\leq c_{j} jJ\displaystyle\displaystyle\forall j\in J (11c)
xij{0,1}\displaystyle\displaystyle x_{ij}\in\{0,1\} iI,jJ.\displaystyle\displaystyle\forall i\in I,\;\forall j\in J. (11d)

The objective function (11a) maximizes the total profit. Inequalities (11b) assert that each item is contained in no more than one bin. Inequalities (11c) ensure that the sum of the weights of the items assigned to a bin does not exceed its capacity. Finally, constraints (11d) assure the integrality of the variables.

C.1 Lagrangian Relaxation

A Lagrangian relaxation of the GA problem is obtained by dualizing (11b). For iI\displaystyle i\in I, let πi0\displaystyle\pi_{i}\geq 0 be the Lagrangian multiplier of inequality (11b) associated with item i\displaystyle i. For each bin j\displaystyle j the subproblem becomes:

(LRj(𝝅))\displaystyle\displaystyle(LR_{j}(\bm{\pi}))\quad max𝒙iIjJ(pijπi)xij\displaystyle\displaystyle\max_{\bm{x}}\sum_{i\in I}\sum_{j\in J}(p_{ij}-\pi_{i})x_{ij}
iIwijxijcj\displaystyle\displaystyle\sum_{i\in I}w_{ij}x_{ij}\leq c_{j}
xij{0,1}\displaystyle\displaystyle x_{ij}\in\{0,1\} iI\displaystyle\displaystyle\forall i\in I

It corresponds to an integer knapsack with |I|\displaystyle|I| binary variables. For 𝝅𝟎\displaystyle\bm{\pi}\geq\bm{0}, the Lagrangian bound LR(𝝅)\displaystyle LR(\bm{\pi}) is:

LR(𝝅)=jJLRj(𝝅)+iIπi.LR(\bm{\pi})=\sum_{j\in J}LR_{j}(\bm{\pi})+\sum_{i\in I}\pi_{i}.

The Lagrangian dual can then be written as:

min𝝅0|I|LR(𝝅)\min_{\bm{\pi}\in{\mathbb{R}}^{|I|}_{\geq 0}}LR(\bm{\pi})

Appendix D Hyperparameters and Implementation Details

In this section, we provide technical details about the implementation. We also specify the hyperparameters of the NN architecture (regarding layers, layer sizes, and activation functions) for the architecture, define the features (hand-crafted), specify the hyperparameters for the optimizer used for the training, and the GPU and CPU specifics of the machines used for the numerical experiments.

Loss functions

To weigh the contributions of the different iterations, we use γ=0.999\displaystyle\gamma=0.999. In other amortized optimization works, focusing on learning optimizers, as [2], this parameter is taken equal to 1\displaystyle 1.

Table 14: Best average initialization values for the η0\displaystyle\eta_{0} parameter over the test set for all methods that require a grid search. The values {104, 103, 102, 101, 100, 101}\displaystyle\{10^{4},\,10^{3},\,10^{2},\,10^{1},\,10^{0},\,10^{-1}\} are evaluated, and for each fixed maximum number of iterations (10\displaystyle 10, 25\displaystyle 25, 50\displaystyle 50, and 100\displaystyle 100), the value yielding the lowest average GAP is selected.
Dataset Methods 10 iter. 25 iter. 50 iter. 100 iter.
MC-Sml-40 Bundle h. 10 10 10 10
Bundle b. 100 10 10 10
Bundle s. 100 10 10 100
Bundle c. 100 100 100 100
Descent 10 10 10 10
Adam 1 1 1 1
MC-Sml-Var Bundle h. 10 100 100 10
Bundle b. 100 100 100 100
Bundle s. 100 100 100 100
Bundle c. 100 100 100 100
Descent 100 10 10 10
Adam 1 1 1 1
MC-Big-40 Bundle h. 10 10 10 1
Bundle b. 10 10 10 10
Bundle s. 10 10 10 10
Bundle c. 10 10 10 10
Descent 10 10 10 10
Adam 1 1 1 1
MC-Big-Var Bundle h. 100 10 10 10
Bundle b. 100 100 10 10
Bundle s. 100 100 10 10
Bundle c. 100 100 100 100
Descent 10 10 10 10
Adam 1 1 1 1
GA-10-100 Bundle h. 1000 1000 1000 1000
Bundle b. 1000 1000 1000 1000
Bundle s. 1000 1000 1000 1000
Bundle c. 1000 1000 1000 1000
Descent 100 100 100 100
Adam 100 10 10 10
GA-20-400 Bundle h. 1000 1000 1000 1000
Bundle b. 1000 1000 1000 1000
Bundle s. 1000 1000 1000 1000
Bundle c. 1000 1000 1000 1000
Descent 1000 100 100 100
Adam 100 10 10 10
Model Architecture - Bundle Network

For all datasets, we consider the same architecture. A first LSTM that goes from the feature space to predict 6\displaystyle 6 vectors of size 128\displaystyle 128. These are the means and variances used to sample three vectors that are the hidden representation of the η\displaystyle\eta parameter, the one for the key for the last found component, and the one for the query for the current iteration. These vectors are given as inputs to three different decoders, each one composed of a hidden layer of size 8128\displaystyle 8\cdot 128. The output of these decoders is respectively a scalar ηt\displaystyle\eta_{t}, used as step size, and two vectors of size 128\displaystyle 128 to represent the key and the query of the current iteration. The main framework is presented with a sample mechanism, but we found that it was detrimental when using softmax for function ψ\displaystyle\psi, and we refer to Section 3.4 for further details. Still, sampling improves performance when used in conjuction with sparsemax for ψ\displaystyle\psi.

Features

At iteration t\displaystyle t, we perform a human-designed feature extraction, chosen to provide similar information to that used by the heuristics η\displaystyle\eta-strategies. We add further features to represent the component added to the bundle in the associated iteration, and we others to characterize this entry and to compare it with previously added components. Features related to iteration t\displaystyle t, summarizing the optimization dynamics at the current iteration:

  • the last step-size ηt1\displaystyle\eta_{t-1},

  • The square norm of the search direction 𝒘(t1)22\displaystyle||{\bm{w}}^{(t-1)}||^{2}_{2},

  • The square norm of the search direction weighted with ηt\displaystyle\eta_{t}, that is ηt1𝒘(t1)22\displaystyle\eta_{t-1}||{\bm{w}}^{(t-1)}||_{2}^{2}. This corresponds to the quadratic part in the DMP objective function,

  • The linear part in the DMP: j=1t1αjθj(t1)\displaystyle\sum_{j=1}^{t-1}\alpha_{j}\theta_{j}^{(t-1)},

  • a boolean value to see if the quadratic part is bigger than the linear part 𝒘(t1)22>j=1t1αjθj(t1)\displaystyle||{\bm{w}}^{(t-1)}||^{2}_{2}>\sum_{j=1}^{t-1}\alpha_{j}\theta_{j}^{(t-1)},

  • a boolean to see if the quadratic part, rescaled by a big η=10000\displaystyle\eta^{*}=10000, is greater than the linear part: η𝒘(t1)22>j=1t1αjθj(t1)\displaystyle\eta^{*}||{\bm{w}}^{(t-1)}||^{2}_{2}>\sum_{j=1}^{t-1}\alpha_{j}\theta_{j}^{(t-1)},

  • The iteration counter t\displaystyle t.

Features describing the last trial and stabilization points. These features capture information about the most recent trial point 𝝅t\displaystyle{\bm{\pi}}_{t} and stabilization point 𝝅¯t\displaystyle\bar{{\bm{\pi}}}_{t}:

  • The objective value in the last trial point ϕ(𝝅t)\displaystyle\phi({\bm{\pi}}_{t}) and in the last stabilization point ϕ(𝝅¯t)\displaystyle\phi(\bar{{\bm{\pi}}}_{t}),

  • The linearization error in the last stabilization point α¯t\displaystyle\bar{\alpha}_{t} and in the last trial point αt\displaystyle\alpha_{t},

  • The square norm of the last trial point 𝝅t2\displaystyle||{\bm{\pi}}_{t}||_{2} of the last stabilization point 𝝅¯t2\displaystyle||\bar{{\bm{\pi}}}_{t}||_{2} and of the gradient in the stabilization point 𝒈¯t2\displaystyle||\bar{{\bm{g}}}_{t}||_{2},

  • the square norm, the mean, the variance, the minimum, and the maximum of the vector 𝒈t\displaystyle{\bm{g}}_{t},

  • the square norm, the mean, the variance, the minimum, and the maximum of the vector 𝝅t\displaystyle{\bm{\pi}}_{t}.

Features comparing the last inserted component with the ones already contained in the bundle:

  • The minimum and the maximum of the scalar product of the gradients in the bundle minj(𝒈t𝒈j)\displaystyle\min_{j}({\bm{g}}_{t}^{\top}{\bm{g}}_{j}) and maxj(𝒈t𝒈j)\displaystyle\max_{j}({\bm{g}}_{t}^{\top}{\bm{g}}_{j}),

  • The minimum and the maximum of the scalar product of the points in the bundle minj(𝝅t𝝅j)\displaystyle\min_{j}({\bm{\pi}}_{t}^{\top}{\bm{\pi}}_{j}) and maxj(𝝅t𝝅j)\displaystyle\max_{j}({\bm{\pi}}_{t}^{\top}{\bm{\pi}}_{j}),

  • The scalar product of the last inserted gradient and the last search direction 𝒈t𝒘(t1)\displaystyle{\bm{g}}_{t}^{\top}{\bm{w}}^{(t-1)}.

Optimiser Specifications

We use Adam as optimizer, with a learning rate 0.00001\displaystyle 0.00001, a Clip Norm (to 5), and exponential decay 0.9\displaystyle 0.9.

GPU specifics

For the training on the datasets, we use Quadro RTX 5000 accelerators with 8Gb of RAM. We test performance on the same machine. All the variants, except for Bundle Network, are CPU-only based. The experiments are done on the same machine with QEMU Virtual CPU version 2.5+.

Appendix E Dataset generalization sparsemax and sampling

In this appendix, we provide some further details on the model that uses sparsemax and the sampling mechanism, related to cross-dataset generalization properties, similarly to what was done in Section 3.5 for the softmax without sampling.

Table 15: Gap of the model tested on MC-Sml-40 trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 10.51 0.126 2.07 0.214 0.60 0.363 0.18 0.662
MC-Sml-Var 9.87 0.128 1.75 0.218 0.54 0.371 0.18 0.683
MC-Big-40 21.36 0.140 5.23 0.239 1.78 0.41 0.52 0.755
MC-Big-Var 9.82 0.128 1.99 0.221 0.60 0.374 0.19 0.683
GA-10-100 69.18 0.128 69.18 0.216 69.18 0.366 69.18 0.667
GA-20-400 96.08 0.128 96.08 0.218 96.08 0.368 96.08 0.673
Best Baseline 11.96 0.048 4.16 0.128 1.30 0.293 0.31 0.816
Table 16: Gap of the model tested on MC-Sml-Var and trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 22.9 0.161 6.62 0.301 2.6 0.54 1.02 1.027
MC-Sml-Var 14.02 0.158 3.80 0.298 1.56 0.531 0.64 1.007
MC-Big-40 39.34 0.171 15.12 0.327 6.7 0.59 2.97 1.119
MC-Big-Var 14.27 0.160 3.64 0.302 1.27 0.545 0.49 1.042
GA-10-100 59.75 0.160 59.75 0.299 59.75 0.540 59.75 1.031
GA-20-400 66.35 0.163 66.35 0.307 66.35 0.552 66.35 1.052
Best Baseline 12.79 0.108 4.33 0.29 1.66 0.635 0.58 1.503
Table 17: Gap of the model tested on MC-Big-40 and trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 41.77 0.154 7.62 0.276 1.89 0.484 0.67 0.915
MC-Sml-Var 91.76 0.152 63.78 0.279 37.85 0.487 8.22 0.919
MC-Big-40 13.27 0.150 3.50 0.272 1.40 0.479 0.51 0.905
MC-Big-Var 44.89 0.154 7.19 0.280 2.17 0.492 1.09 0.931
GA-10-100 99.8 0.140 99.8 0.252 99.8 0.445 99.8 0.84
GA-20-400 100.0 0.156 100.0 0.281 100.0 0.493 100.0 0.931
Best Baseline 19.34 0.076 5.64 0.203 2.08 0.447 0.78 1.110
Table 18: Gap of the model tested on MC-Big-Var and trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 24.9 0.162 6.58 0.307 2.35 0.555 0.89 1.057
MC-Sml-Var 29.88 0.252 14.63 0.46 7.67 0.822 1.85 1.546
MC-Big-40 32.67 0.174 12.04 0.33 5.31 0.598 2.30 1.141
MC-Big-Var 18.94 0.162 4.24 0.305 1.47 0.548 0.61 1.045
GA-10-100 67.42 0.165 67.42 0.31 67.42 0.561 67.42 1.069
GA-20-400 74.69 0.163 74.69 0.306 74.69 0.551 74.69 1.051
Best Baseline 17.98 0.112 6.34 0.289 2.66 0.627 0.93 1.477
Table 19: Gap of the model tested on GA-10-100 and trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 2.55 0.123 1.79 0.208 0.85 0.35 0.33 0.636
MC-Sml-Var 2.42 0.121 1.4 0.202 0.65 0.341 0.56 0.613
MC-Big-40 2.67 0.119 2.12 0.197 1.4 0.334 0.69 0.603
MC-Big-Var 2.5 0.120 1.61 0.203 0.9 0.339 0.59 0.609
GA-10-100 0.92 0.122 0.70 0.201 0.68 0.34 0.68 0.609
GA-20-400 0.37 0.122 0.11 0.207 0.06 0.352 0.05 0.636
Best Baseline 0.1893 0.071 0.0156 0.211 0.0014 0.601 0.0006 1.596
Table 20: Gap of the model tested on GA-20-400 and trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.
Training Dataset 10 iter. 25 iter. 50 iter. 100 iter.
GAP time GAP time GAP time GAP time
MC-Sml-40 2.35 0.253 1.9 0.536 1.14 1.004 0.39 1.943
MC-Sml-Var 2.27 0.238 1.67 0.504 0.76 0.937 0.5 1.805
MC-Big-40 2.42 0.237 2.08 0.504 1.58 0.949 0.88 1.836
MC-Big-Var 2.32 0.237 1.79 0.503 1.01 0.949 0.6 1.84
GA-10-100 1.17 0.252 1.16 0.537 1.16 1.005 1.16 1.943
GA-20-400 0.4 0.218 0.13 0.461 0.07 0.86 0.06 1.651
Best Baseline 0.1129 0.433 0.0193 1.459 0.0048 4.256 0.0020 14.459