1]Department of Computer Science, Università di Pisa, 3 Largo B. Pontecorvo, 56127 Pisa, Italy 2]Laboratoire d’Informatique de Paris Nord, Université Sorbonne Paris Nord, 99 avenue Jean-Baptiste Clément, 93430 Villetaneuse, France 3]Department of Information Systems, Data Analytics and Operations, ESSEC Business School, 3 avenue Bernard Hirsch, 95000 Cergy, France

Bundle Network: a Machine Learning-Based Bundle Method

Francesca Demelas demelas@lipn.univ-paris13.fr Antonio Frangioni antonio.frangioni@unipi.it Mathieu Lacroix lacroix@lipn.univ-paris13.fr Joseph Le Roux leroux@lipn.univ-paris13.fr Emiliano Traversi emiliano.traversi@essec.edu Roberto Wolfler Calvo wolfler@lipn.univ-paris13.fr [ [ [

Abstract

This paper presents Bundle Network, a learning-based algorithm inspired by the Bundle Method for convex non-smooth minimization problems. Unlike classical approaches that rely on heuristic tuning of a regularization parameter, our method automatically learns to adjust it from data. Furthermore, we replace the iterative resolution of the optimization problem that provides the search direction —traditionally computed as a convex combination of gradients at visited points— with a recurrent neural model equipped with an attention mechanism. By leveraging the unrolled graph of computation, our Bundle Network can be trained end-to-end via automatic differentiation. Experiments on Lagrangian dual relaxations of the Multi-Commodity Network Design and Generalized Assignment problems demonstrate that our approach consistently outperforms traditional methods relying on grid search for parameter tuning, while generalizing effectively across datasets.

Keywords: Attention Network, Bundle Method, Long Short Term Memory, Unrolling.

1 Introduction

This work presents a machine learning-based approach to tackle convex non-smooth minimization problems, inspired by bundle-type methods [42, 41], i.e., iterative algorithms that generalize the subgradient method in that the search direction is a convex combination of the gradients at the visited points obtained by solving an optimization problem, called Master problem. At each iteration, a crucial scalar parameter related to the regularization in the Master problem must be properly chosen to obtain optimal performance. Although there are many variants of the Bundle method [22], in this work we only consider the best known proximal version where the regularization parameter, denoted by $\eta$ , is the weight of the quadratic regularization term in the objective of the Master problem, weighting the distance of the new point from the better point found so far. This parameter significantly influences the next iterate produced at each iteration in two different ways:

1.

by changing the convex multipliers that are used to form the search direction out of the currently available bundle of (approximate) subgradient information;
2.

by acting as the stepsize along said direction.

Traditionally, selecting this parameter relies on heuristic approaches, referred to as $\eta$ -strategies, often requiring an extensive grid search to achieve high performance. By contrast, relatively simple diminishing stepsize rules are known to work well in subgradient-based approaches, both in theory and in practice. Thus, one may suspect that the “double role” of the $\eta$ parameter is what makes its online selection particularly challenging. Indeed, roughly speaking, the smaller $\eta$ the more “local” the direction is, taking into account only information computed close to the current iterate. However, towards termination, the algorithm should have accrued enough information that a “global” direction pointing towards the minimum of the model is likely preferable, which makes standard diminishing rules impractical.

In order to improve both the performance and the robustness of existing approaches, we propose to delegate the selection of the convex multipliers and the choice of the stepsize to a novel objective-based learning approach. Since our approach takes decisions dynamically during algorithm execution, Reinforcement Learning [55] techniques could be, and in fact have been, successfully used [47, 57]. However, Reinforcement Learning is mainly useful when differentiating the execution is not possible, and it has many drawbacks such as being often sample-inefficient and suffering high-variance [30]. We rather propose to differentiate our algorithm via an unrolling approach [48], yielding a more direct optimization process with lower-variance gradient estimators by exploiting the internal solver dynamics [53, 38]. A key challenge in this approach is to keep the structure of the Bundle method while differentiating through its execution. To address this problem, we use a heuristic approach whereby we replace these non-smooth components with smoother approximations based on a neural network model. Specifically, we propose a neural network that approximates the search direction using information from the current bundle and step size, replacing the $\eta$ -strategy heuristic and the (dual) Master problem (4). This neural network integrates an approach inspired by the Bundle method, as outlined in Algorithm 1, incorporating soft-updates for the stabilization point, thereby enabling end-to-end differentiation using Automatic Differentiation [4] techniques.

1.1 Literature Review

Many recent studies have leveraged machine learning to tackle Combinatorial Optimization problems [7]. In this work, we focus on amortization techniques [1], a research domain sometimes referred to as Learning-to-Optimize [9, 44, 10], and the associated methods are called semi-amortized optimization methods. In amortization techniques, the goal is to reduce the computational cost of iterative optimization methods by learning a direct mapping from inputs, defining problem instances, to approximate solutions. This often involves training an inference network to quickly produce a high-quality approximate solution that would otherwise require costly iterative optimization. Many amortized frameworks using an iterative structure for predictions often require some unrolling technique [48] for back-propagation, involving reformulating the algorithm execution to enable the gradient computation in the backward pass.

Previous works designed to find the best parameters for an optimization algorithm have focused on finding the best step size for stochastic gradient descent [54] in the context of Machine Learning, and are thus seen as instances of meta-learning While early works considered simple evolutionary rules [6, 5], more recent ones are based on neural networks [50]. Some works [60, 32] jointly train with gradient descent both the networks for the learning problem and the one for the meta-learning task. However, these approaches still struggle to scale to modern architectures with tens of thousands of parameters. To address this limitation, Andrychowicz et al. [2] introduce an LSTM-based optimizer that operates coordinate-wise on the network parameters. This design enables different behavior on each coordinate by using separate activation functions for each parameter, while maintaining the optimizer’s invariance to parameter order, as the same update rule is independently applied to each coordinate.

All of the above approaches fundamentally adopt the structure of gradient descent as the base for their learning framework. There is a growing interest in applying the Bundle method for optimization problems arising from the machine learning community [39]. Variants of Proximal Bundle methods [37] are used to optimize complex, non-smooth loss functions often encountered in machine learning [49]. A specialized asynchronous bundle is presented for solving distributed learning problems [8]. A further connection with the Bundle method can be found in [33], where a meta-learning approach is proposed to coordinate learning in master-slave distributed systems. This method aggregates the gradients for gradient descent, improving the scalability of distributed learning.

Recent research on Bundle methods for non-convex problems [18, 17], opens the possibility to use the Bundle variants for machine learning problems. Bundle methods have been applied to improve the training of SVMs [15], especially for large-scale problems [34, 58]. Extensions of Bundle methods have also been proposed for multiclass SVMs, where Bundle methods can handle a larger number of constraints efficiently [20].

The benefits of using a proximal operator when optimizing a convex non-smooth function have already been demonstrated [45]. Also [61] shows that advantages can be found in using proximality operators for meta-learning tasks, even if they did not consider the bundle method, but they use unrolling of the proximal gradient descent.

In [43], a method is presented that approximates the resolution of a large-scale linear programming problem based on the associated Lagrangian function, providing iterative primal and dual updates using a neural network trained with unrolling techniques.

1.2 Bundle Method

The Bundle method (BM) [42, 41] is an iterative method that solves problems of type

\min\{\,\phi({\bm{\pi}})\,:\,{\bm{\pi}}\in\Pi\,\}

(1)

where $\phi:\mathbb{R}^{m}\rightarrow\mathbb{R}$ is a finite-valued convex non-smooth function and $\Pi$ is a “simple” convex set. The Lagrangian Dual is a relevant application of (1), discussed with more details in Section 3.1. In the rest of the paper we restrict ourselves to the case in which $\Pi={\mathbb{R}}^{m}$ or $\Pi={\mathbb{R}}^{m}_{+}=\{{\bm{\pi}}\in{\mathbb{R}}^{m}\,:\,{\bm{\pi}}\geq 0\}$ , which correspond to the Lagrangian Dual [21]—that we use in our computational testing—when one is relaxing, respectively, equality or inequality constraints. For the sake of presentation we only consider the case $\Pi={\mathbb{R}}^{m}$ here, but the approach can be simply extended, as discussed at the end of Section 3.1, to the case $\Pi={\mathbb{R}}^{m}_{+}$ .

Bundle methods iteratively optimize an approximation $\phi_{\beta}$ of the function $\phi$ , constructed using the bundle $\beta$ of first-order information collected so far, plus some regularization. A common choice for $\phi_{\beta}$ is the Cutting Plane Model, i.e., the natural polyhedral upper approximation built using the very definition of subgradients. For reasons to become apparent shortly, it is expedient to fix some reference point $\bar{{\bm{\pi}}}$ ; then, at a given iteration $t$ one has the set of previous iterates $\{{\bm{\pi}}_{i}\}_{i=1}^{t}$ and can define the bundle $\beta_{t}$ as the set of pairs $\{(g_{i},\alpha_{i})\}_{i=1}^{t}$ , where $g_{i}\in\partial\phi({\bm{\pi}}_{i})$ is a subgradient obtained in the given iterate and $\alpha_{i}$ is the corresponding linearization error $\alpha_{i}=\phi(\bar{{\bm{\pi}}})-\phi({\bm{\pi}}_{i})-{\bm{g}}_{i}^{\top}(\bar{{\bm{\pi}}}-{\bm{\pi}}_{i})\geq 0$ (non-negativity being implicit into convexity of $\phi$ ). Remarkably, the $\alpha_{i}$ depend on $\bar{{\bm{\pi}}}$ , but we don’t stress this notationally since $\bar{{\bm{\pi}}}$ will always be clear from the context. In fact, $\beta_{t}$ may contain a strict subset of the thusly computed pairs (in case of bundle selection) and even pairs obtained differently (in case of bundle aggregation), but this is immaterial to our discussion, and we leave these details out for simplicity of presentation. With this information, the Cutting Plane model [51] is

\phi_{\bar{{\bm{\pi}}},\beta_{t}}({\bm{d}})=\max\{\,{\bm{g}}_{i}^{\top}{\bm{d}}+\alpha_{i}\,:\,(g_{i},\alpha_{i})\in\beta_{t}\,\}\;.

(2)

It is easy to see that $\phi(\bar{{\bm{\pi}}}+{\bm{d}})-\phi(\bar{{\bm{\pi}}})\geq\phi_{\bar{{\bm{\pi}}},\beta_{t}}({\bm{d}})$ , i.e., the Cutting Plane method is a polyhedral lower approximation of $\phi$ when translated w.r.t. $(\bar{{\bm{\pi}}},\phi(\bar{{\bm{\pi}}}))$ . Thus, however chosen $\bar{{\bm{\pi}}}$ , the “unstabilized” Master problem

\min\{\,\phi_{\bar{{\bm{\pi}}},\beta_{t}}({\bm{d}})\,:\,\bar{{\bm{\pi}}}+{\bm{d}}\in\Pi\,\}

would provide a lower bound on the optimal value of Problem (1), but it is easy to see that such a problem can easily be unbounded below. Even if it is not, its optimal solution may easily be a very poor approximation to the true one, since $\phi_{\bar{{\bm{\pi}}},\beta_{t}}$ may be a poor model of $\phi$ “far” from the points in which the function has been computed. Because of this, one has to introduce some stabilization term that avoids wandering away from the region of space where $\phi$ is an appropriate model. The most common choice, corresponding to the proximal Bundle method, is the Euclidean norm of the distance from the reference point $\bar{{\bm{\pi}}}$ , which is therefore called the (current) stabilization centre; this leads to the stabilized (primal) Master problem

(MP_{t})~~\min\{\,\phi_{\bar{{\bm{\pi}}},\beta}({\bm{d}})+\|{\bm{d}}\|_{2}^{2}\,/\,(2\eta)\,:\,\bar{{\bm{\pi}}}+{\bm{d}}\in\Pi\,\}

(3)

whose optimal solution ${\bm{d}}^{*}$ is used to choose the next iterate as ${\bm{\pi}}=\bar{{\bm{\pi}}}+{\bm{d}}^{*}$ . Solving (3) is equivalent to solving its dual, which for the case $\Pi={\mathbb{R}}^{m}$ reads


$\displaystyle\max$	$\displaystyle\textstyle\frac{\eta_{t}}{2}\big\\|\sum_{({\bm{g}}_{i},\alpha_{i})\in\beta_{t}}{\bm{g}}_{i}\theta_{i}\big\\|_{2}^{2}+\sum_{({\bm{g}}_{i},\alpha_{i})\in\beta_{t}}\alpha_{i}\theta_{i}$	(4a)
s.t.	$\displaystyle\textstyle\sum_{({\bm{g}}_{i},\alpha_{i})\in\beta_{t}}\theta_{i}=1\;\;,\;\;{\bm{\theta}}\geq 0$	(4b)

while the case $\Pi={\mathbb{R}}^{m}_{+}$ adds a new set of variables and simple constraints. The optimal solution ${\bm{d}}^{*}_{t}$ of (3) and the optimal solution ${\bm{\theta}}^{(t)}$ of (4) are linked by the relationship ${\bm{d}}^{(t)}=\eta_{t}{\bm{w}}_{t}$ , where ${\bm{w}}_{t}=\sum_{({\bm{g}}_{i},\alpha_{i})\in\beta_{t}}{\bm{g}}_{i}\theta_{i}^{(t)}$ is the aggregated subgradient computed out of the current bundle. This is relevant in that ${\bm{w}}_{t}$ is a $\sigma_{t}$ -subgradient of $\phi$ in $\bar{{\bm{\pi}}}$ , where $\sigma_{t}=\sum_{({\bm{g}}_{i},\alpha_{i})\in\beta_{t}}\alpha_{i}\theta_{i}^{(t)}$ is the aggregated linearization error. For the case $\Pi={\mathbb{R}}^{m}_{+}$ , ${\bm{d}}^{(t)}$ is basically ${\bm{w}}_{t}$ projected over the set of the active non-negativity constraints ${\bm{\pi}}\geq 0$ ; we refer to [40, Chapter 4] for further details. Often, solving the Dual Master problem (4) is more convenient than solving the primal due to the existence of ad-hoc approaches [23]. The complete algorithm is described in Algorithm 1.

Algorithm 1 Bundle method pseudo-code

1:Choose

{\bm{\pi}}_{0},\eta_{0},m

\bar{{\bm{\pi}}}_{0}\leftarrow{\bm{\pi}}_{0}

\triangleright

The initial stabilization point coincides with the initial point

({\bm{g}}_{0},{\bm{\alpha}}_{0})\leftarrow(\partial\phi({\bm{\pi}}_{0}),0)

;

\beta_{0}\leftarrow\{({\bm{g}}_{0},{\bm{\alpha}}_{0})\}

;

t\leftarrow 0

4:while ! stopping criteria do

({\bm{d}}^{(t)},{\bm{\theta}}^{(t)})\leftarrow

Solve

MP(\eta_{t},\beta_{t})

\triangleright

Solve the Master problem (3)–(4)

{\bm{\pi}}_{t+1}\leftarrow\bar{{\bm{\pi}}}_{t}+\eta_{t}{\bm{d}}^{(t)}

\triangleright

Compute the new trial point

({\bm{g}}_{t+1},\alpha_{t+1})\leftarrow(\partial\phi({\bm{\pi}}_{t+1}),\phi(\bar{{\bm{\pi}}}_{t})-\phi({\bm{\pi}}_{t+1})-{\bm{g}}_{t+1}^{\top}(\bar{{\bm{\pi}}}_{t}-{\bm{\pi}}_{t+1}))

\triangleright

Compute the new sub-gradient and linearization error

\beta_{t+1}\leftarrow\beta_{t}\cup({\bm{g}}_{t+1},{\bm{\alpha}}_{t+1})

\triangleright

Update the Bundle

9: if

\phi({\bm{\pi}}_{t+1})-\phi(\bar{{\bm{\pi}}}_{t})>m(\,\phi_{\beta_{t}}({\bm{\pi}}_{t+1})-\phi(\bar{{\bm{\pi}}}_{t})\,)

then

10:

\bar{{\bm{\pi}}}_{t+1}\leftarrow{\bm{\pi}}_{t+1}

; Update

\alpha_{i}\;\forall({\bm{g}}_{i},\alpha_{i})\in\beta_{t}

; Update

\eta_{t}

\triangleright

Serious Step

11: else

12:

\bar{{\bm{\pi}}}_{t+1}\leftarrow\bar{{\bm{\pi}}}_{t}

; Update

\eta_{t}

\triangleright

Null Step

13: end if

14:

\beta_{t+1}\leftarrow

Remove outdated components

\beta_{t+1}

;

t\leftarrow t+1

15:end while

The next trial point ${\bm{\pi}}_{t+1}$ is computed by making a step of $\eta_{t}$ along the direction ${\bm{d}}_{t}$ , which is either the (opposite of) the aggregated sugradient ${\bm{w}}_{t}$ or easily obtained out of it. Computing $\phi({\bm{\pi}}_{t+1})$ and its subgradient ${\bm{g}}_{t+1}\in\partial\phi({\bm{\pi}}_{t+1})$ enriches the bundle. The stabilization center is changed to ${\bm{\pi}}_{t+1}$ , called a Serious Step, if a “substantial increase” is obtained, measured with respect to the ascent $\phi_{\beta_{t}}({\bm{\pi}}_{t+1})$ predicted by the model, with $m<1$ a fixed parameter (usually “small”, say $0.001$ ). Otherwise, i.e., if no (significant) progress is made, the stabilization center is not changed, which is referred to as a Null Step (NS); yet, new information has been added to the bundle which will lead to a different direction ${\bm{d}}_{t+1}$ in the next iteration. $\eta_{t}$ can (although it does not necessarily need to) also be adjusted, typically in different ways depending on the type of step taken. The stopping criterion in line 4 can be $\eta^{*}\|{\bm{w}}_{t}\|_{2}^{2}+\sigma_{t}\leq\epsilon\max\{0,\phi(\bar{{\bm{\pi}}}_{t})\}$ , where $\eta^{*}$ is “large w.r.t. $\eta_{t}$ ”; this corresponds to both ${\bm{w}}_{t}\approx 0$ and $\sigma_{t}\approx 0$ , proving that $\bar{{\bm{\pi}}}_{t}$ is an approximate minimum. At the end of each iteration, efficient implementations of the Bundle method may delete some old pairs from $\beta$ to lessen the Master problem cost. A common strategy is removing components that are not used in the (dual) optimal solution over the last $K$ iterations (a quite conservative strategy takes $K\sim 20$ [25]).

2 Machine-Learning Based Bundle Method

The approach presented in this work is heavily inspired by Algorithm 1, but with two major differences:

1.

we replace the solution of the Master problem, which is a convex quadratic program that may become rather costly to solve as the size of the bundle grows, by the prediction of neural network and
2.

we decouple the role of $\eta$ in the formation of ${\bm{d}}$ , via the weights ${\bm{\theta}}$ , from its role as a stepsize along it, by gain predicting its value with a neural network.

To this effect, we need parametrize neural networks so that our two predictors from input instances and partial bundles to respectively convex multipliers ${\bm{\theta}}$ and stepsize $\eta$ are accurate. This parametrization is data-driven: we find the parameters that minimize a loss function, representing the value computed by the bundle network over a set of examples.

2.1 Learning Framework

Let $\mathcal{D}$ be a dataset of instances $\iota$ and we denote by ${\bm{\pi}}_{0}\equiv{\bm{\pi}}_{0,\iota}$ the given initial point for the associated instance, possibly equal to zero. To mathematically define the learning problem, the number of iterations $T$ is fixed during the training process. Once the model is trained, the number of iterations is no longer a consideration at inference time [44]. In Section 2.3 we describe in details the model ${\bm{\pi}}_{T}({\bm{W}})\equiv{\bm{\pi}}_{T}({\bm{W}};{\bm{\pi}}_{0},\iota)$ , where ${\bm{W}}$ are the parameters. We may consider finding the parameters ${\bm{W}}$ that, on average over instances drawn from $\mathcal{D}$ , minimize the loss as defined by just the objective value $\mathcal{L}({\bm{W}})=\phi({\bm{\pi}}_{T}({\bm{W}}))$ after $T$ iterations:

\textstyle\min_{{\bm{W}}}\mathbb{E}_{\iota\sim\mathcal{D}}[\mathcal{L}({\bm{W}})]\;.

However, Learning-to-Optimize approaches generally consider the contribution of each step of the whole trajectory, since it encourages optimizing the contributions of each step. Following this approach, we rather aim at maximizing the function:

\textstyle\mathcal{L}_{\bm{\gamma}}({\bm{W}})=\sum_{t=1}^{T}\gamma_{t}\phi({\bm{\pi}}_{t}({\bm{W}}))

(5)

where ${\bm{\pi}}_{t}({\bm{W}})$ denotes the trial point at iteration $t$ , which is computed by a neural network. We use $\gamma_{t}=\gamma^{T-t}$ , where $0<\gamma\leq 1$ is a hyperparameter. Hence, the learning problem can be formulated as

\textstyle\min_{{\bm{W}}}\mathbb{E}_{\iota\sim\mathcal{D}}[\mathcal{L}_{\bm{\gamma}}({\bm{W}})].

(6)

In the next sections, we will specify how to differentiate the loss function using the chain rule and the structure of the model ${\bm{\pi}}_{t}({\bm{W}})$ used as machine learning surrogate of the Master problem in the Bundle method.

2.2 Loss Function

We want to compute a subgradient $\partial_{{\bm{W}}}\mathcal{L}_{{\bm{\gamma}}}$ of the loss with respect to the network parameters. $\mathcal{L}_{\bm{\gamma}}({\bm{W}})$ is a linear combination of the function values $\phi$ evaluated at different points along the trajectory: consequently, its subgradient with respect to ${\bm{W}}$ can be written as the weighted sum of subgradients of the function $\phi$ computed in different points. Thus we only need to compute subgradients $\partial_{{\bm{W}}}\phi({\bm{\pi}}_{t}({\bm{W}}))$ for each $t\leq T$ .

We apply the chain rule to compute $\partial_{{\bm{W}}}\phi({\bm{\pi}}_{t}({\bm{W}}))$ as $\partial_{{\bm{\pi}}_{t}}\phi({\bm{\pi}}_{t}({\bm{W}}))\cdot\partial_{{\bm{W}}}{\bm{\pi}}_{t}({\bm{W}})$ . Note that $\partial_{{\bm{\pi}}}\phi({\bm{\pi}}_{t}({\bm{W}}))$ is the subgradient of $\phi$ at ${\bm{\pi}}_{t}({\bm{W}})$ , which exists since $\phi$ is convex and in fact corresponds to the ${\bm{g}}_{t}$ used in the standard version of the Bundle method. As we construct ${\bm{\pi}}_{t}({\bm{W}})$ to be differentiable, $\partial_{{\bm{W}}}{\bm{\pi}}_{t}({\bm{W}})$ corresponds to the gradient $\nabla_{{\bm{W}}}{\bm{\pi}}_{t}({\bm{W}})$ that can be obtained using automatic differentiation techniques. The algorithm structure for this computation is discussed in detail in the next section.

Putting all together, we obtain the following gradient approximation

\textstyle\partial_{{\bm{W}}}\mathcal{L}({\bm{W}})=\sum^{T}_{t=1}\gamma^{T-t}{\bm{g}}_{t}\nabla_{{\bm{W}}}{\bm{\pi}}_{t}({\bm{W}})

2.3 Unrolled Model

In this section, we provide further details on the structure of machine learning model used to compute the trial point ${\bm{\pi}}_{t}({\bm{W}})$ at iteration $t$ . We keep the structure of the Bundle method mostly unchanged, modifying only some operations in order to make them differentiable: the choice of the step size, the solution of Problem 4, and the update of the stabilization center. The result is presented in Algorithm 3: since it only employs differentiable operations, we can then use standard automatic differentiation tools.

At each iteration, we need to provide a feature vector to the neural network. We do not back-propagate through the feature extraction, which is made using hand-crafted features described in detail in Appendix D. These features are fed as input to a neural network, which outputs a vector ${\bm{\theta}}^{(t)}\in\Delta^{t}=\big\{{\bm{\theta}}\in[0,1]^{t+1}\;|\;\sum_{i=1}^{t+1}\theta_{i}=1\big\}$ in the $t-$ dimensional simplex and a step size $\eta_{t}$ . More details on the network architecture can be found in Section 3.

In the standard Bundle method, the stabilization center is updated if the condition $\phi({\bm{\pi}}_{t+1})>\phi(\bar{{\bm{\pi}}}_{t})$ is satisfied. However, this is a piecewise-constant operation, and therefore with zero derivative almost everywhere. Denoting by ${\bm{r}}=\text{arg}\min((\phi({\bm{\pi}}_{t+1}),\phi(\bar{{\bm{\pi}}}_{t})))$ the one-hot vector encoding equal to $(1,0)$ if $\phi({\bm{\pi}}_{t+1})\geq\phi(\bar{{\bm{\pi}}}_{t})$ and $(0,1)$ otherwise, the update of the stabilization point can be rewritten as

{\bm{r}}^{\top}({\bm{\pi}}_{t+1},\bar{{\bm{\pi}}}_{t})=r_{1}{\bm{\pi}}_{t+1}+r_{2}\bar{{\bm{\pi}}}_{t}.

This corresponds to choosing a vertex of the two-dimensional simplex. A common strategy in machine learning to obtain smoother approximations of the $\text{arg}\max$ operator is to replace it with a soft version, obtained with the softmin [29, pp. 180-184]. This leads to considering

{\bm{r}}=\mathbf{softmin}(\phi({\bm{\pi}}_{t+1}),\phi(\bar{{\bm{\pi}}}_{t}))

and the updates can be performed as previously, i.e., choosing a possibly different convex combination of the two vectors ${\bm{\pi}}_{t+1}$ and ${\bm{\pi}}_{t}$ , which will not necessarily be a vertex of the two-dimensional simplex, but a point inside it.

It is important to notice that we do not differentiate through the operation computing the gradient in line 11, as obtaining exact derivatives requires the function $\phi$ to be twice differentiable. It is possible to approximate this contribution, but this is not explicitly considered in this work; this strategy has been commonly pursued in other works, for example [2]. In other words, the bundle information fed to the neural network is considered as a constant, even though it is produced by a gradient computation. This is similar to the way recurrent language models do not backpropagate through the previously generated words, but rather through the previous latent state [3].

Algorithm 2 Bundle Network pseudo-code

1:Choose

{\bm{\pi}}_{0},T

\bar{{\bm{\pi}}}_{0}\leftarrow{\bm{\pi}}_{0}

\triangleright

Initialize stabilization point

({\bm{g}}_{0},{\bm{\alpha}}_{0},v_{0})\leftarrow(\partial\phi({\bm{\pi}}_{0}),0,\phi({\bm{\pi}}_{0}))

\beta_{0}\leftarrow\{({\bm{g}}_{0},{\bm{\alpha}}_{0})\}

5:for

t=1,\cdots,T

\varphi_{t}=

features_extraction

(\beta_{t})

\triangleright

We do not back-propagate through this operation

\eta_{t},\bm{\delta}^{(t)}\leftarrow nn(\varphi_{t})

\triangleright

Output of a neural network

{\bm{\theta}}^{(t)}\leftarrow\psi(\bm{\delta}^{(t)})

\triangleright

Approximate Solution of the DMP 4

{\bm{w}}^{(t)}\leftarrow\sum_{i=0}^{|\beta_{t}|}{\bm{g}}_{i}\theta_{i}^{(t)}

\triangleright

New trial direction

10:

{\bm{\pi}}_{t+1}\leftarrow\sigma(\bar{{\bm{\pi}}}_{t}+\eta_{t}{\bm{w}}^{(t)})

\triangleright

Compute new trial point

11:

({\bm{g}}_{t+1},{\bm{\alpha}}_{t+1},v_{t+1})\leftarrow(\partial\phi({\bm{\pi}}_{t+1}),\phi(\bar{{\bm{\pi}}}_{t})-\phi({\bm{\pi}}_{t+1})-{\bm{g}}_{t+1}^{\top}(\bar{{\bm{\pi}}}_{t}-{\bm{\pi}}_{t+1}),\phi({\bm{\pi}}_{t+1}))

\triangleright

Gradient and linearization error

12:

\beta_{t+1}\leftarrow\beta_{t}\cup({\bm{g}}_{t+1},{\bm{\alpha}}_{t+1},v_{t+1})

\triangleright

Update Bundle

13:

\bar{{\bm{\pi}}}_{t+1}\leftarrow\mathbf{softmin}(\phi({\bm{\pi}}_{t+1}),\phi(\bar{{\bm{\pi}}}_{t}))\odot({\bm{\pi}}_{t+1},\bar{{\bm{\pi}}}_{t})

\triangleright

Smooth updates of the stabilization point

14: Update

\alpha_{i}\;\forall i=0,\cdots,|\beta_{t}|

\triangleright

Update the linearization errors

15:end for

Algorithm 3 Machine learning-based version of the Bundle method 1 that substitutes the resolution of the Master problem and the non-continuous operations with a smoother version based on neural networks.

2.3.1 Step and Direction Architecture

This section provides further details on the architecture of the machine learning-model, schematically presented in Figure 1, which is used to predict $\eta_{t}$ and ${\bm{\theta}}^{(t)}$ .

Given the features $\varphi_{t}\in{\mathbb{R}}^{d}$ at the current bundle $\beta_{t}$ , a Recurrent Neural Network [13, 11, 12], specifically an Long Short-Term Memory (LSTM [31]) layer, maps this vector to a hidden space by predicting six vectors of size $h$ . We denote these vectors as $\bm{\mu}_{{q_{t}}},\bm{\sigma}_{{q_{t}}},\bm{\mu}_{{k_{t}}},\bm{\sigma}_{{k_{t}}},\bm{\mu}_{{\eta_{t}}},\bm{\sigma}_{{\eta_{t}}}\in{\mathbb{R}}^{h}$ , and they represent the mean and the variance used to extract, by sampling through a Gaussian distribution, the hidden representation of the keys and the queries used to represent the current iterations as in attention mechanisms [59], as well as the hidden representation for $\eta_{t}$ . More precisely, to obtain the hidden representations ${\bm{h}}_{q_{t}},{\bm{h}}_{k_{t}},{\bm{h}}_{\eta_{t}}\in{\mathbb{R}}^{h}$ , we consider a Gaussian distribution, so we can approximate the expectation by sampling and use the reparameterization trick [36, 52] to perform standard backpropagation. The size of the hidden representations and other hyperparameters of the network will be discussed in Appendix D.

The sampled representations are then fed into three independent Multi-Layer Perceptrons that predict a scalar $\eta_{t}$ and two vectors ${\bm{q}}_{t}$ and ${\bm{k}}_{t}$ . The vectors ${\bm{q}}_{t}$ and ${\bm{k}}_{t}$ are, respectively, the query and the key of the current iteration. The vector ${\bm{k}}_{t}$ is stored in memory for use in subsequent iterations. Using ${\bm{q}}_{t}$ and all the vectors $\{{\bm{k}}_{i}\}_{i=0}^{|\beta_{t}|}$ , we can compute a scalar for each component in the bundle using an Attention Layer defined as follows:

\bm{\delta}^{(t)}=({\bm{k}}_{i}^{\top}{\bm{q}}_{t})_{i=0}^{|\beta_{t}|}\in{\mathbb{R}}^{t}.

To obtain the approximate solution of the Dual Master problem 4, we pass this vector to a normalization function $\psi$ taking as input a vector and returning one of the same dimension, but with all components between zero and one, and summing to one

{\bm{\theta}}^{(t)}=\psi(\bm{\delta}^{(t)})\in\Delta^{t}.

This yields a vector ${\bm{\theta}}^{(t)}$ whose components lie in $[0,1]$ and sum to one, as the solution of the Problem (4) considered in the classic bundle. Some examples of $\psi$ are the softmax or sparsemax [46] functions.

Figure 1: Schematic representation of the architecture used to predict the step-size

\eta_{t}

and the weights

{\bm{\theta}}^{(t)}

for the convex combination of the gradients in the bundle at the current iteration.

3 Evaluation

In this section, we present the main numerical results of our work. We test it only on the example case of the Lagrangian relaxation, of which we then provide further details.

3.1 Lagrangian Relaxation

Let $\iota$ be a Mixed Integer Linear Program [14, Chap. 8] of the form:


$\displaystyle(P)\qquad\min$	$\displaystyle\;{\bm{w}}^{\top}{\bm{x}}$	(7a)
	$\displaystyle{\bm{A}}{\bm{x}}={\bm{b}}$	(7b)
	$\displaystyle{\bm{C}}{\bm{x}}\geq{\bm{d}}\;\;,\;\;{\bm{x}}\in\mathbb{R}_{+}^{n_{1}}\times\mathbb{N}^{n_{2}}$	(7c)

The Lagrangian subproblem is obtained by dualizing a family of constraints. Here we relax the constraints defined by the inequalities (7b), and penalize their violation with Lagrangian multipliers (LMs) ${\bm{\pi}}\in{\mathbb{R}}^{m}$ :

	$\displaystyle\big(LR({\bm{\pi}})\big)\qquad\min\;$	$\displaystyle{\bm{w}}^{\top}{\bm{x}}+\bm{\pi}^{\top}({\bm{b}}-{\bm{A}}{\bm{x}})$
		$\displaystyle{\bm{C}}{\bm{x}}\geq{\bm{d}}{}\;\;,\;\;{\bm{x}}{}\in{}\mathbb{R}_{+}^{n_{1}}\times\mathbb{N}^{n_{2}}$

Weak Lagrangian duality ensures that $LR({\bm{\pi}})$ provides a lower bound for $P$ . The goal of the Lagrangian Dual problem is to determine the optimal bound.

(LD)\qquad\max LR(\bm{\pi}).

(8)

Lagrangian relaxation is particularly valuable when removing certain constraints results in a Lagrangian subproblem $LR({\bm{\pi}})$ that can be solved efficiently; it is therefore crucial to carefully select the set of constraints to dualize. Indeed, this choice involves a trade-off between the speed and the quality of the resulting bound, as will be discussed later, due to the No Free Lunch Theorem [28] that proves that, in order to achieve better bounds than the Continuous relaxation, the Lagrangian subproblem should be “harder” than a Linear problem. Actually, the Lagrangian Dual problem 8 is a concave maximization problem, but it is trivial to reformulate it as a convex minimization problem [56]. Relaxing inequality constraints ${\bm{A}}{\bm{x}}\geq{\bm{b}}$ instead of equality ones leads to nonnegative multipliers ${\bm{\pi}}\in{\mathbb{R}}_{+}^{m}$ . To ensure that we predict only nonnegative multipliers we substitute line 10 of Algorithm 3 with

{\bm{\pi}}_{t+1}=\sigma(\bar{{\bm{\pi}}}_{t}+\eta_{t}{\bm{w}}^{(t)})

where $\sigma$ is a component-wise non-negative activation function such as ReLU.

3.2 Experiments

In this section, we compare our approaches to classical iterative methods for solving the Lagrangian Dual, specifically those based on subgradient information. In particular, we evaluate the performance of the proposed approach against both simple subgradient-based methods and the classical Bundle method, using different heuristic $\eta$ -strategies.

We consider ADAM [35] and Gradient Ascent using a simple strategy for regularizing the step size: if we pass more than two iterations without improving the objective value, we divide the step size by two. These two baselines are well-established in the context of convex minimization and serve as relevant points of comparison. They are of particular interest because they are typically faster than the Bundle method, as Descent relies only on the subgradient at the latter visited point, meanwhile, ADAM relies on the subgradient of the last inserted point and the previous search direction.

We further compare our approach with classic variants of the Bundle method considering the four long-term $\eta$ -strategies presented in Appendix A, inspired by the ones used in the state-of-the-art bundle-based BundleSolver component of SMS++ [26]. With the non-constant ones, we also consider the middle-term $\eta$ -strategy and the short-term $\eta-$ strategy proposing increasing as $\eta_{t+1}=1.1\cdot\eta_{t}$ and decreasing as $\eta_{t+1}=\eta_{t}\cdot 0.9$ . For all these six classic approaches, we perform a grid search to choose the best initial parameter $\eta_{0}$ considering different values. We evaluate the initialization of the $\eta_{0}$ parameter using the values $10^{k}$ for $k\in\{4,3,2,1,0,-1\}$ . For each fixed maximum number of iterations ( $10$ , $25$ , $50$ , and $100$ ), we select the value of $\eta_{0}$ that achieves the lowest average GAP over the test set. The results of this grid search are summarized in Table 14.

The objective values of $\phi$ are too large for a learning method, so we rescale the objective $\phi$ by dividing the function value using the norm $\sqrt{{\bm{g}}^{\top}_{0}{\bm{g}}_{0}}$ of the gradient ${\bm{g}}_{0}$ in the starting point ${\bm{\pi}}_{0}$ , always taken equal to the zero vector. Even if the objective value changes, the optimum of the original function $\phi$ and of its scaled version are the same.

We consider the Lagrangian Dual of the Multi-Commodity Network Design (MC) presented in Appendix B and the Lagrangian Dual of the Generalized Assignment (GA) presented in Appendix C. For MC, we consider a Lagrangian relaxation in which we dualize equality constraints, while for GA we consider a Lagrangian relaxation in which we dualize inequality constraints, applying the modifications previously discussed. For the MC and the GA, we use the datasets considered in [19].

For each instance $\iota$ on which we evaluate our model, the Lagrangian relaxation is solved using SMS++. The obtained Lagrangian multipliers, denoted by ${\bm{\pi}}^{*}_{\iota}$ , are used for computing the GAP. We enhance the fact that we do not need these multipliers to perform the training, but we use them to provide a metric for our approach. We use the percentage gap as a metric to evaluate the quality of the bounds computed by the different systems, averaged over a dataset of instances $\mathcal{I}$ . We measure the quality of each approach by means of the percentage GAP

100\times\frac{1}{|\mathcal{I}|}\sum_{\iota\in\mathcal{I}}\frac{LR(\bm{\pi}^{*}_{\iota})-B_{\iota}}{LR(\bm{\pi}^{*}_{\iota})}\;,

where $B_{\iota}$ is the returned upper bound for instance $\iota$ .

3.3 Bound Accuracy

Table 1 reports the performance of different systems on our datasets. For the heuristic strategies that did not use machine learning, we consider a grid search for the initialization of the step-size/regularization parameter. We consider values in between $10^{4}$ and $10^{-1}$ . For each fixed number of iterations ( $10,\;25,\;50$ and $100$ ), we choose the best $\eta_{0}$ concerning the values in the grid. From Table 1 we can see that Bundle Network outperforms other techniques in terms of GAP for a fixed number of iterations in different datasets.

Table 1: New Comparison of our approach and different baselines. The baselines’ starting regularization parameter/step size is chosen using a grid search. Meanwhile, our approach does not need hyperparameter tuning. Ours is trained only on

10

iterations.

	Methods	10 iter.		25 iter.		50 iter.		100 iter.
		GAP	time	GAP	time	GAP	time	GAP	time
MC-SML-40	Bundle h.	20.18	0.044	6.68	0.117	2.84	0.269	1.17	0.767
	Bundle b.	17.68	0.054	6.09	0.145	1.84	0.327	0.41	0.898
	Bundle s.	17.58	0.044	6.10	0.116	1.82	0.271	0.36	0.778
	Bundle c.	11.96	0.048	4.16	0.128	1.30	0.293	0.31	0.816
	Descent	44.66	0.008	14.11	0.021	4.70	0.044	2.24	0.093
	Adam	48.87	0.011	12.84	0.027	3.04	0.055	1.89	0.109
	Bundle n.	9.20	0.105	1.75	0.185	0.53	0.321	0.17	0.592
MC-SML-Var	Bundle h.	18.27	0.094	9.55	0.254	4.08	0.564	1.96	1.433
	Bundle b.	15.83	0.124	7.39	0.334	2.88	0.736	0.74	1.749
	Bundle s.	15.82	0.094	6.92	0.256	2.58	0.576	0.67	1.421
	Bundle c.	12.79	0.108	4.33	0.29	1.66	0.635	0.58	1.503
	Descent	62.43	0.025	28.64	0.066	11.14	0.142	4.00	0.303
	Adam	51.48	0.034	15.42	0.086	4.74	0.17	2.75	0.341
	Bundle n.	12.60	0.138	3.28	0.269	1.24	0.496	0.50	0.952
MCND-Big-40	Bundle h.	20.66	0.07	8.54	0.185	5.32	0.410	2.88	0.986
	Bundle b.	20.67	0.084	7.37	0.222	3.17	0.485	1.24	1.209
	Bundle s.	20.67	0.07	7.37	0.184	3.17	0.407	1.19	1.042
	Bundle c.	19.34	0.076	5.64	0.203	2.08	0.447	0.78	1.110
	Descent	28.53	0.023	14.12	0.058	9.91	0.117	8.21	0.241
	Adam	24.89	0.025	9.38	0.064	7.30	0.129	6.90	0.256
	Bundle n.	12.70	0.130	3.41	0.243	1.26	0.436	0.47	0.832
MC-Big-Var	Bundle h.	26.23	0.095	9.99	0.248	4.41	0.550	2.23	1.386
	Bundle b.	21.82	0.123	10.01	0.324	3.46	0.695	1.08	1.652
	Bundle s.	21.83	0.094	9.53	0.253	3.43	0.550	1.02	1.353
	Bundle c.	17.98	0.112	6.34	0.289	2.66	0.627	0.93	1.477
	Descent	54.11	0.029	24.69	0.074	10.58	0.154	4.85	0.323
	Adam	44.21	0.034	13.75	0.086	5.42	0.170	3.84	0.345
	Bundle n.	20.10	0.139	5.11	0.271	2.01	0.495	0.70	0.953
GA-10-100	Bundle h.	0.2333	0.070	0.028	0.241	0.0016	0.657	0.0006	1.596
	Bundle b.	0.2333	0.073	0.028	0.249	0.0016	0.684	0.0006	1.655
	Bundle s.	0.3055	0.065	0.0156	0.211	0.0026	0.553	0.0024	1.283
	Bundle c.	0.1893	0.071	0.0157	0.235	0.0014	0.601	0.0011	1.383
	Descent	0.8799	0.013	0.2091	0.032	0.0644	0.063	0.0364	0.127
	Adam	0.7234	0.012	0.1843	0.029	0.0183	0.059	0.0048	0.119
	Bundle n.	0.1484	0.104	0.0228	0.177	0.0047	0.304	0.0009	0.551
GA-20-400	Bundle hard	0.157	0.445	0.0343	1.548	0.0088	4.528	0.0021	15.099
	Bundle balancing	0.157	0.463	0.0343	1.568	0.0088	4.539	0.0021	15.158
	Bundle soft	0.3542	0.431	0.0284	1.417	0.0048	4.256	0.0028	14.145
	Bundle constant	0.1129	0.433	0.0193	1.459	0.0052	4.264	0.0020	14.459
	Descent	0.5815	0.205	0.2971	0.5	0.0753	0.98	0.0255	1.937
	Adam	0.6598	0.177	0.0831	0.444	0.016	0.887	0.0090	1.775
	Bundle n.	0.1018	0.228	0.0190	0.498	0.0052	0.943	0.0014	1.837

Anyway, by Figures 2-5, we can see that Bundle Network is slower than other approaches for the first iterations. From these figures, we can see that initialization time can be improved. These times can also depend on passages of CPU vectors to the GPU and vice versa.

From Table 1 we can also see that the constant $\eta-$ strategy provides almost every time the best performance, compared to the other deterministic $\eta-$ strategies. This is a well-known characteristic of the Bundle method, as $100$ iterations is a relatively small number of iterations for an Aggregated Bundle method, making $\eta$ tuning less performative than a constant strategy obtained after a grid-search. This is because the Bundle method operates in distinct phases: initially, its goal is to identify promising directions for improvement, and later it focuses on collecting high-quality subgradients near the optimum to certify optimality. This raises an interesting question about using multiple models to capture this Bundle behavior, which is beyond the scope of this work.

Nevertheless, our model provides high-quality predictions that generalize well to a larger number of iterations than those used during training (10). In many cases, bundle network provides, in a given number of iterations, lower gaps than all the baselines, and even in cases where the provided predictions are not the best possibility, they lead to similar gaps to the optimal algorithm.

Refer to caption — Figure 2: Comparison plots of the GAP (in logarithmic scale) during the execution of our trained neural network and different baselines in terms of iterations (in the left) and in terms of time (in the right) for MC-Sml-40 dataset.

3.4 Sampling or Not Sampling

Table 2: Comparison of softmax and sparsemax, with or without using the sample strategy, or using the sample strategy only for t on MC-Sml40

Activation	Sample		10 iter.		25 iter.			50 iter.			100 iter.
	t	q k	GAP	time		GAP	time		GAP	time		GAP	time
softmax			9.20	0.105		1.75	0.185		0.53	0.321		0.17	0.592
softmax	x		9.78	0.104		1.85	0.187		0.56	0.321		0.17	0.592
softmax	x	x	10.41	0.105		1.99	0.185		0.60	0.318		0.18	0.588
sparsemax			58.3	0.125		57.93	0.214		57.47	0.361		57.29	0.661
sparsemax	x		13.16	0.124		2.61	0.212		0.80	0.359		0.24	0.655
sparsemax	x	x	10.51	0.126		2.07	0.214		0.60	0.363		0.18	0.662

Table 3: Comparison of softmax and sparsemax, with or without using the sample strategy, or using the sample strategy only for t on MC-Big-40

Activation	Sample		10 iter.		25 iter.		50 iter.		100 iter.
	t	q k	GAP	time	GAP	time	GAP	time	GAP	time
softmax			12.70	0.130	3.41	0.243	1.26	0.436	0.47	0.832
softmax	x		12.85	0.131	3.34	0.244	1.22	0.436	0.45	0.831
softmax	x	x	12.88	0.131	3.36	0.244	1.26	0.436	0.48	0.834
sparsemax			38.65	0.150	24.00	0.270	10.34	0.475	3.39	0.898
sparsemax	x		39.40	0.150	32.37	0.270	17.76	0.474	1.58	0.896
sparsemax	x	x	13.27	0.150	3.50	0.272	1.40	0.479	0.51	0.905

Table 4: Comparison of softmax and sparsemax, with or without using the sample strategy or using the sample strategy only for t on MC-Sml-Var

Activation	Sample			10 iter.	25 iter.		50 iter.		100 iter.
	t	q k	GAP	time	GAP	time	GAP	time	GAP	time
softmax			12.60	0.138	3.28	0.269	1.24	0.496	0.50	0.952
softmax	x		13.03	0.137	3.27	0.268	1.26	0.492	0.51	0.948
softmax	x	x	14.68	0.139	3.94	0.269	1.55	0.494	0.62	0.957
sparsemax			30.23	0.157	15.96	0.293	12.45	0.527	11.05	1.006
sparsemax	x		56.83	0.160	56.03	0.300	56.00	0.538	56.00	1.033
sparsemax	x	x	14.02	0.158	3.80	0.298	1.56	0.531	0.64	1.007

Table 5: Comparison of softmax and sparsemax, with or without using the sample strategy or using the sample strategy only for t on MC-Big-Var

Activation	Sample		10 iter.		25 iter.		50 iter.		100 iter.
	t	q k	GAP	time	GAP	time	GAP	time	GAP	time
softmax			20.10	0.139	5.11	0.271	2.01	0.495	0.70	0.953
softmax	x		16.79	0.141	4.78	0.273	1.68	0.499	0.63	0.956
softmax	x	x	16.66	0.139	4.63	0.270	1.80	0.495	0.67	0.955
sparsemax			42.84	0.160	32.39	0.301	29.68	0.541	23.25	1.031
sparsemax	x		30.20	0.162	12.33	0.305	6.25	0.542	3.78	1.033
sparsemax	x	x	18.94	0.162	4.24	0.305	1.47	0.548	0.61	1.045

Table 6: Comparison of softmax and sparsemax, with or without using the sample strategy, or using the sample strategy only for t on GA-10-100

Activation	Sample		10 iter.		25 iter.		50 iter.		100 iter.
	t	q k	GAP	time	GAP	time	GAP	time	GAP	time
softmax			0.1484	0.104	0.0228	0.177	0.0047	0.304	0.0009	0.551
softmax	x		0.1544	0.102	0.0229	0.176	0.0049	0.303	0.0009	0.548
softmax	x	x	0.1954	0.099	0.0370	0.176	0.0081	0.302	0.0013	0.551
sparsemax			0.2335	0.118	0.0600	0.201	0.0336	0.339	0.0288	0.615
sparsemax	x		1.0482	0.120	0.9783	0.202	0.9782	0.342	0.9782	0.611
sparsemax	x	x	0.9172	0.122	0.7001	0.201	0.6827	0.340	0.6818	0.609

Table 7: Comparison of softmax and sparsemax, with or without using the sample strategy, or using the sample strategy only for t on GA-20-400

Activation	Sample		10 iter.		25 iter.		50 iter.		100 iter.
	t	q k	GAP	time	GAP	time	GAP	time	GAP	time
softmax			0.1018	0.228	0.0190	0.498	0.0052	0.943	0.0014	1.837
softmax	x		0.1065	0.218	0.0216	0.480	0.0060	0.904	0.0014	1.756
softmax	x	x	0.1273	0.204	0.0295	0.445	0.0087	0.836	0.0018	1.621
sparsemax			0.1463	0.235	0.0626	0.500	0.0601	0.935	0.0599	1.804
sparsemax	x		0.7860	0.219	0.7696	0.463	0.767	0.860	0.7669	1.656
sparsemax	x	x	0.4002	0.218	0.1272	0.461	0.0681	0.860	0.0627	1.651

In this section, we compare the model trained with and without the sampling strategy, for the hidden representation discussed earlier, as well as considering two different possibilities for $\psi$ that is, the softmax [29, pp. 180-184] or the sparsemax [46]. In all the cases the models are trained unrolling 10 iterations of the resolutive approach and for $25$ training epochs using Adam as optimizer, with learning rate $10^{-5}$ and a Clip Norm (to 5). . In the case in which we do not use the sampling mechanism, the hidden representations for the queries, the key, and the step size are not predicted by sampling them through a Gaussian distribution. Indeed, we learn the mean and the variance. Instead, it is predicted directly by the encoder. In the test phase, the hidden representation is not sampled for either approach. Instead, we take the mean directly when the model predicts both the mean and the variance.

From Tables 2-7, we can see that sampling is a strategy leading to better performance when using the sparsemax, but it is inefficient for the softmax. This can be because the sparsemax function has a zero gradient in a certain region of the space that can be escaped using sampling strategies.

We choose the softmax with no sampling strategy as it shows better performances that are stable across both problems. Meanwhile, the sparsemax seems to be an interesting choice for harder datasets, as shown in Table 5.

3.5 Testing on another dataset

From Tables 8-13, we can see the performance of a model trained on one dataset and tested on another. Training the model on another dataset rarely provides better gaps than training it on the same dataset. It happens in Table 8 and for 10 iterations in Table 12. Considering the same problem, we can obtain similar behaviors in several cases, in particular while using more iterations. In some cases, the model trained on the GA dataset achieves interesting performance on the small datasets of MC, as shown in Tables 8 and 9. Anyway, no update learned for the MC dataset seems to work sufficiently well for the GA datasets. Tables 8-13 also show that, in several cases, it is possible to attain lower gaps for iteration from the performances of a model trained on another dataset, of the same problem, than the baseline, providing a lower gap obtained with grid search.

Table 8: Gap of the model tested on MC-Sml-40 trained on different datasets. The associated training dataset is displayed as the first element of each row of the table. The last row shows the performances of the best baseline obtained with grid search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	9.20	0.105	1.75	0.185	0.53	0.321	0.17	0.592
MC-Sml-Var	8.02	0.104	1.69	0.184	0.51	0.319	0.17	0.593
MC-Big-40	18.02	0.116	4.14	0.206	1.49	0.358	0.47	0.665
MC-Big-Var	10.77	0.106	2.26	0.187	0.70	0.327	0.20	0.603
GA-10-100	61.98	0.104	20.79	0.183	4.07	0.319	0.95	0.589
GA-20-400	86.33	0.107	68.16	0.190	14.20	0.327	3.52	0.607
Best Baseline	11.96	0.048	4.16	0.128	1.30	0.293	0.31	0.816

Table 9: Gap of the model tested on MC-Sml-Var trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	19.87	0.142	5.72	0.276	2.32	0.501	1.00	0.965
MC-Sml-Var	12.60	0.138	3.28	0.269	1.24	0.496	0.50	0.952
MC-Big-40	33.83	0.149	14.03	0.291	7.54	0.536	3.90	1.033
MC-Big-Var	16.21	0.141	3.83	0.276	1.39	0.505	0.52	0.976
GA-10-100	25.76	0.142	7.60	0.275	2.08	0.502	0.67	0.969
GA-20-400	66.15	0.140	62.06	0.275	38.19	0.507	12.86	0.974
Best Baseline	12.79	0.108	4.33	0.29	1.66	0.635	0.58	1.503

Table 10: Gap of the model tested on MC-Big-40 trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	35.09	0.133	5.78	0.248	1.63	0.445	0.62	0.85
MC-Sml-Var	56.68	0.132	25.62	0.246	8.88	0.441	3.06	0.842
MC-Big-40	12.70	0.130	3.41	0.243	1.26	0.436	0.47	0.832
MC-Big-Var	37.57	0.134	10.74	0.251	4.89	0.449	1.49	0.856
GA-10-100	95.70	0.123	95.70	0.230	78.63	0.408	27.53	0.779
GA-20-400	100.0	0.134	100.0	0.252	99.53	0.447	74.54	0.854
Best Baseline	19.34	0.076	5.64	0.203	2.08	0.447	0.78	1.110

Table 11: Dataset tested on MC-Big-Var trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	21.50	0.142	5.48	0.280	2.07	0.511	0.85	0.984
MC-Sml-Var	21.03	0.221	7.64	0.419	2.67	0.755	0.95	1.430
MC-Big-40	28.39	0.150	11.15	0.294	5.87	0.540	2.94	1.037
MC-Big-Var	20.10	0.139	5.11	0.271	2.01	0.495	0.70	0.953
GA-10-100	41.66	0.144	28.58	0.280	20.72	0.511	8.30	0.979
GA-20-400	72.89	0.144	69.99	0.283	53.10	0.519	27.75	0.996
Best Baseline	17.98	0.112	6.34	0.289	2.66	0.627	0.93	1.477

Table 12: Gap of the model tested on GA-10-100 trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	2.5903	0.101	1.8558	0.179	0.9291	0.304	0.4860	0.555
MC-Sml-Var	2.5165	0.102	1.4663	0.178	1.0522	0.307	1.0512	0.562
MC-Big-40	2.7488	0.103	2.1948	0.177	1.3515	0.302	0.7463	0.550
MC-Big-Var	2.4958	0.102	1.3267	0.180	0.9705	0.309	0.9683	0.564
GA-10-100	0.1484	0.104	0.0228	0.177	0.0048	0.304	0.0009	0.551
GA-20-400	0.1332	0.105	0.0233	0.182	0.0056	0.309	0.0016	0.561
Best Baseline	0.1893	0.071	0.0156	0.211	0.0014	0.601	0.0006	1.596

Table 13: Gap of the model tested on GA-20-400 trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	2.3695	0.221	1.9535	0.474	1.1906	0.894	0.5052	1.732
MC-Sml-Var	2.3357	0.231	1.7168	0.507	0.8963	0.959	0.8654	1.868
MC-Big-40	2.4715	0.218	2.1550	0.481	1.5618	0.909	0.8056	1.759
MC-Big-Var	2.3197	0.218	1.6304	0.482	0.8092	0.910	0.7856	1.765
GA-10-100	0.1811	0.234	0.0296	0.511	0.0077	0.962	0.0015	1.866
GA-20-400	0.1018	0.228	0.0190	0.498	0.0052	0.943	0.0014	1.837
Best Baseline	0.1129	0.433	0.0193	1.459	0.0048	4.256	0.0020	14.459

4 Conclusion

In this work, we present a machine learning model based on unrolling and amortization strongly inspired by the bundle method. The model bypasses the need to solve the Dual Master problem at each iteration, leading to a smooth optimization process that can be differentiated using Automatic Differentiation techniques. We test it for the resolution of the Lagrangian dual obtained from the two different problems: the Multi-Commodity Network Design and the Generalized Assignment. It achieves a lower gap compared to other classical approaches for a fixed number of iterations. Moreover, it exhibits strong generalization capabilities, maintaining robust performance even beyond the training horizon and sometimes across different datasets.

This work opens several promising directions for future research. First, the framework can be directly applied, as it is, to other problems arising from Lagrangian relaxation. A central challenge in standard bundle methods lies in automatically determining which components to retain or discard in order to control bundle size and improve computational efficiency. While straightforward modifications of the framework are possible, this also raises questions concerning feature extraction and the design of an automated feature selection phase.

Given the growing use of bundle methods in machine learning, it is also interesting to assess the performance of the proposed approach in meta-learning tasks. This requires applying the method to non-convex and non-smooth optimization problems and this extension presents non-trivial challenges. In this setting, removing components from the bundle is not only essential for reducing computational cost, but also for avoiding convergence to poor local minima.

References

[1] Brandon Amos. Tutorial on amortized optimization for learning to optimize over continuous domains. CoRR, 2022.
[2] Marcin Andrychowicz, Misha Denil, Sergio Gómez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[4] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153):1–43, 2018.
[5] Samy Bengio, Y. Bengio, and Jocelyn Cloutier. On the search for new learning rules for anns. Neural Processing Letters, 2:26–30, 07 1995.
[6] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume ii, pages 969 vol.2–, 1991.
[7] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. CoRR, abs/1811.06128, 2018.
[8] Daniel Cederberg, Xuyang Wu, Stephen P. Boyd, and Mikael Johansson. An asynchronous bundle method for distributed learning problems. In The Thirteenth International Conference on Learning Representations, 2025.
[9] Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. Journal of Machine Learning Research, 23(189):1–59, 2022.
[10] Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. Journal of Machine Learning Research, 23(189):1–59, 2022.
[11] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[12] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
[13] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014.
[14] Michele Conforti, Gérard Cornuéjols, and Giacomo Zambelli. Integer Programming. Springer, New York, 2014.
[15] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995.
[16] Teodor Gabriel Crainic, Antonio Frangioni, and Bernard Gendron. Bundle-based relaxation methods for multicommodity capacitated fixed charge network design. Discrete Applied Mathematics, 112(1):73–99, 2001.
[17] Welington de Oliveira and Mikhail Solodov. A doubly stabilized bundle method for nonsmooth convex optimization. Mathematical programming, 156(1):125–159, 2016.
[18] Welington Luis de Oliveira and Claudia A. Sagastizábal. Level bundle methods for oracles with on-demand accuracy. Optimization Methods and Software, 29:1180 – 1209, 2014.
[19] Francesco Demelas, Joseph Le Roux, Mathieu Lacroix, and Axel Parmentier. Predicting lagrangian multipliers for mixed integer linear programs. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 10368–10384. PMLR, 21–27 Jul 2024.
[20] Thomas Finley and Thorsten Joachims. Training structural svms when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, page 304–311, New York, NY, USA, 2008. Association for Computing Machinery.
[21] Marshall L. Fisher. The lagrangian relaxation method for solving integer programming problems. Management Science, 27:1–18, 1981.
[22] A. Frangioni. Standard Bundle Methods: Untrusted Models and Duality, chapter 2, pages 61–116. Springer, 2020.
[23] Antonio Frangioni. Solving semidefinite quadratic problems within nonsmooth optimization algorithms. Computers & Operations Research, 23(11):1099–1118, 1996.
[24] Antonio Frangioni. Dual-ascent methods and multicommodity flow problems. Università degli studi di Pisa. Dipartimento di informatica, 1997.
[25] Antonio Frangioni. Generalized bundle methods. SIAM Journal on Optimization, 13(1):117–156, 2002.
[26] Antonio Frangioni, Niccolò Iardella, and Rafael Durbano Lobato. SMS++, 2023.
[27] Bernard Gendron, Teodor Gabriel Crainic, and Antonio Frangioni. Multicommodity Capacitated Network Design, pages 1–19. Springer US, Boston, MA, 1999.
[28] A. M. Geoffrion. Lagrangean relaxation for integer programming. In M. L. Balinski, editor, Approaches to Integer Programming, pages 82–114. Springer Berlin Heidelberg, Berlin, Heidelberg, 1974.
[29] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
[30] Hossein Hassani, Roozbeh Razavi-Far, Mehrdad Saif, and Liang Lin. Towards sample-efficiency and generalization of transfer and inverse reinforcement learning: A comprehensive literature review, 2024.
[31] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9:1735–80, 12 1997.
[32] Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell. Learning to learn using gradient descent. In Georg Dorffner, Horst Bischof, and Kurt Hornik, editors, Artificial Neural Networks — ICANN 2001, pages 87–94, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
[33] Jinlong Ji, Xuhui Chen, Qianlong Wang, Lixing Yu, and Pan Li. Learning to learn gradient aggregation by gradient descent. In IJCAI, pages 2614–2620, 2019.
[34] Thorsten Joachims, Thomas Finley, and Chun-Nam Yu. Cutting-plane training of structural svms. Machine Learning, 77:27–59, 10 2009.
[35] Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[36] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
[37] Krzysztof C. Kiwiel. A proximal bundle method with approximate subgradient linearizations. SIAM Journal on Optimization, 16(4):1007–1023, 2006.
[38] James Kotary, My H Dinh, and Ferdinando Fioretto. Backpropagation of unrolled solvers with folded optimization. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-2023, page 1963–1970. International Joint Conferences on Artificial Intelligence Organization, August 2023.
[39] Quoc Le, Alex Smola, and Svn Vishwanathan. Bundle methods for machine learning. Advances in neural information processing systems, 20, 2007.
[40] Claude Lemaréchal. Lagrangian Relaxation, pages 112–156. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001.
[41] Claude Lemaréchal, Arkadii Nemirovskii, and Yurii Nesterov. New variants of bundle methods. Mathematical programming, 69:111–147, 1995.
[42] Claude Lemarechal, Jean-Jacques Strodiot, and André Bihain. On a bundle algorithm for nonsmooth optimization. In Nonlinear programming 4, pages 245–282. Elsevier, 1981.
[43] Bingheng Li, Linxin Yang, Yupeng Chen, Senmiao Wang, Qian Chen, Haitao Mao, Yao Ma, Akang Wang, Tian Ding, Jiliang Tang, and Ruoyu Sun. Pdhg-unrolled learning-to-optimize method for large-scale linear programming, 2024.
[44] Ke Li and Jitendra Malik. Learning to optimize. ArXiv, abs/1606.01885, 2016.
[45] Jialin Liu, Xiaohan Chen, Zhangyang Wang, Wotao Yin, and Hanqin Cai. Towards constituting mathematical structures for learning to optimize. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 21426–21449. PMLR, 23–29 Jul 2023.
[46] André F. T. Martins and Ramón Fernandez Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. CoRR, abs/1602.02068, 2016.
[47] Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021.
[48] Vishal Monga, Yuelong Li, and Yonina C. Eldar. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing, 2020.
[49] Alasdair Paren, Leonard Berrada, Rudra P. K. Poudel, and M. Pawan Kumar. A stochastic bundle method for interpolating networks. J. Mach. Learn. Res., 23(1), January 2022.
[50] T.P. Runarsson and M.T. Jonsson. Evolution and design of distributed learning rules. In 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00, pages 59–63, 2000.
[51] Alexander Schrijver et al. On cutting planes. Combinatorics, 79:291–296, 1980.
[52] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
[53] Damien Scieur, Quentin Bertrand, Gauthier Gidel, and Fabian Pedregosa. The curse of unrolling: Rate of differentiating through optimization, 2023.
[54] Naresh K. Sinha and Michael P. Griscik. A stochastic approximation method. IEEE Trans. Syst. Man Cybern., 1(4):338–344, 1971.
[55] R.S. Sutton and A.G. Barto. Reinforcement Learning, second edition: An Introduction. Adaptive Computation and Machine Learning series. MIT Press, 2018.
[56] Rockafellar R. T. Convex Analysis. Princeton University Press, 2015.
[57] Yunhao Tang, Shipra Agrawal, and Yuri Faenza. Reinforcement learning for integer programming: Learning to cut. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9367–9376. PMLR, 13–18 Jul 2020.
[58] Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11(10):311–365, 2010.
[59] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[60] Arthur Steven Younger, Sepp Hochreiter, and Peter R. Conwell. Meta-learning with backpropagation. IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), 3:2001–2006 vol.3, 2001.
[61] Yilang Zhang and Georgios B. Giannakis. Meta-learning priors using unrolled proximal networks. In The Twelfth International Conference on Learning Representations, 2024.

Appendix A Proximality Term $\eta$ -Strategy

The proper selection of the proximal term $\eta$ of the bundle method is both non-trivial and essential. This choice depends on the current stabilization point and the information accumulated in the bundle up to the current iteration. In particular, when all gradients at the optimum are available, solving a Cutting Plane model is preferable, as it yields the optimal solution to the problem. Hence, in this case, a higher value of $\eta$ may be preferred. While at the beginning of the resolution we possibly dispose of very local information, acting carefully is preferred, and so a smaller value of $\eta$ may be preferred.

In [25], theoretical conditions are established for the $\eta$ -strategy to ensure convergence. Constant strategies can be effective if appropriately fine-tuned. Many state-of-the-art strategies are self-adjusting rules for choosing $\eta_{t}$ .

The main state-of-the-art $\eta$ -strategies are composed of three levels, as the ones developed in [24] and also presented in [16]. These levels includes: Long-term, Middle-term, and Short-term $\eta$ -strategies.

The Long-term $\eta$ -strategies aim to capture the stage of the resolution process. In the Bundle method, this is important because it represents a balance between the subgradient method and the Cutting Plane method. In particular, we want to prioritize subgradient method iterations when the solution is still far from optimal, and shift towards Cutting Plane in the final iterations. This approach helps collect the subgradient information necessary to establish the optimality of the solution.

We define the maximum improvement estimation of the Dual Master problem as v^*_t=η_t ——∑_i∈β_tθ_i^(t)g_i——^2 + ∑_i∈β_tα_iθ_i^(t) where ${\bm{\theta}}^{(i)}$ is the solution of the Dual Master problem at iteration $t$ .

We also define the expected maximum improvement obtained using a variant of the Dual Master problem in which $\eta^{*}$ acts as a regularization parameter: ϵ^*_t = ∑_i∈β_tα_iθ_i^(t) + η^* ——∑_i∈β_tθ_i^(t)g_i——^2 where $\eta^{*}$ is a hyperparameter. The hyperparameter $\eta^{*}$ is generally chosen to be greater than all the possible $\eta_{t}$ , thereby emulating the expected behavior by following more the decision of the model. It is different from explicitly considering this parameter in the Dual Master problem, as this would alter its solution and consequently the new trial direction, as previously explained. Additionally, we define a hyperparameter $\tilde{m}\in[0,1)$ , which is employed in the $\eta$ -strategies and is usually fixed to $0.01$ .

Some examples of Long-term $\eta$ -strategies are:

•

no Long-term $\eta$ -strategy;
•

the soft Long-term $\eta$ -strategy. After a null step, decreases in $\eta_{t}$ are inhibited whenever $v^{*}_{t}<\tilde{m}\epsilon^{*}_{t}$ . In other words, decrements are prevented if $\eta_{t}$ is already sufficiently small, such that the maximum improvement is estimated to be less than $\tilde{m}$ times the improvement suggested by a Cutting-Planes model.
•

the hard Long-term $\eta$ -strategy. Increases of $\eta_{t}$ are allowed, even after null steps, whenever $v^{*}_{t}<\tilde{m}\epsilon^{*}_{t}$ .

•

the balancing Long-term $\eta$ -strategy. This strategy aims to keep the two terms $\eta^{*}||\frac{1}{2}\sum_{i\in\beta_{t}}\theta_{i}^{(t)}{\bm{g}}_{i}||^{2}_{2}$ roughly the same size.

Increases of $\eta_{t}$ , after a serious step, are inhibited if

\frac{\eta^{*}}{2}||\sum_{i\in\beta_{t}}\theta_{i}^{(t)}{\bm{g}}_{i}||^{2}_{2}\leq\tilde{m}\sum_{i=1}^{|\beta_{t}|}\alpha_{i}\theta_{i}^{(t)}.

In this case, increasing $\eta_{t}$ causes a reduction of $\frac{\eta^{*}}{2}||\sum_{i\in\beta_{t}}\theta_{i}^{(t)}{\bm{g}}_{i}||^{2}_{2}$ which is already small.

On the other hand, decreases in $\eta_{t}$ are inhibited if

\tilde{m}\frac{\eta^{*}}{2}||\sum_{i\in\beta_{t}}\theta_{i}^{(t)}{\bm{g}}_{i}||^{2}_{2}\geq\sum_{i=1}^{|\beta_{t}|}\alpha_{i}\theta_{i}^{(t)}.

The rational been that decreasing $\eta_{t}$ causes an increase in $||\sum_{i\in\beta_{t}}\theta_{i}^{(t)}{\bm{g}}_{i}||^{2}_{2}$ which is already big.

The Middle-term $\eta$ -strategies are multi-step strategies that consider the outcomes from the last few iterations. These strategies include the following conditions:

•

A minimum number of consecutive successful steps (SS) with the same $\eta_{t}$ must be performed before $\eta_{t}$ is allowed to increase.
•

A minimum number of consecutive null steps (NS) with the same $\eta_{t}$ must be performed before $\eta_{t}$ is allowed to decrease.

At the bottom, heuristic Short-term $\eta$ -strategies only rely on information from the current iteration to propose an update for $\eta_{t}$ . Specifically, these heuristics are only used to assist in selecting the current value, but both the top and the middle layers would agree on the update. However, sometimes it can be beneficial to allow the Short-term heuristics to do small adjustments in $\eta_{t}$ , regardless of the decisions made by the other levels.

We consider in our implementation these Short-term $\eta$ -strategies together :

•

Increase $\eta_{t+1}$ as $\eta_{t}\eta_{incr}$ , with $\eta_{incr}>1$ .
•

Decrease $\eta_{t+1}$ as $\eta_{t-1}\eta_{decr}$ , with $0<\eta_{decr}<1$ .
•

When increased, $\eta_{t+1}$ should not exceed $\eta_{M}$ , with $\eta_{M}>0$ .
•

When decreased, $\eta_{t+1}$ should not go below $\eta_{m}$ , with $0<\eta_{m}<\eta_{M}$ .

Appendix B Multi Commodity Capacitated Network Design Problem

A MC instance is given by a directed simple graph $D=(N,A)$ , a set of commodities $K$ , an arc-capacity vector $c$ , and two cost vectors $r$ and $f$ . Each commodity $k\in K$ corresponds to a triplet $(o^{k},d^{k},q^{k})$ where $o^{k}\in N$ and $d^{k}\in N$ are the nodes corresponding to the origin and the destination of commodity $k$ , and $q^{k}\in\mathbb{N}^{*}$ is its volume. For each arc, $(i,j)\in A$ , $c_{ij}>0$ corresponds to the maximum amount of flow that can be routed through $(i,j)$ and $f_{ij}>0$ corresponds to the fixed cost of using arc $(i,j)$ to route commodities. For each arc $(i,j)\in A$ and each commodity $k\in K$ , $r_{ij}^{k}>0$ corresponds to the cost of routing one unit of commodity $k$ through arc $(i,j)$ .

A MC solution consists of an arc subset $A^{\prime}\subseteq A$ and, for each commodity $k\in K$ , in a flow of value $q^{k}$ from its origin $o^{k}$ to its destination $d^{k}$ with the following requirements: all commodities are only routed through arcs of $A^{\prime}$ , and the total amount of flow routed through each arc $(i,j)\in A^{\prime}$ does not exceed its capacity $c_{ij}$ . The solution cost is the sum of the fixed costs over the arcs of $A^{\prime}$ plus the routing cost, the latter being the sum over all arcs $(i,j)\in A$ and all commodities $k\in K$ of the unitary routing cost $r_{ij}^{k}$ multiplied by the amount of flow of $k$ routed through $(i,j)$ .

B.1 MILP formulation

A standard model for the MC problem [27] introduces two sets of variables: the continuous flow variables $x_{ij}^{k}$ representing the amount of commodity $k$ that is routed through arc $(i,j)$ and the binary design variables $y_{ij}$ representing whether or not arc $(i,j)$ is used to route commodities. Denoting respectively by $N^{+}_{i}=\{j\in N\mid(i,j)\in A\}$ and $N_{i}^{-}=\{j\in N\mid(j,i)\in A\}$ the sets of forward and backward neighbors of a vertex $i\in N$ , the MC problem can be modeled as follows:


$\displaystyle\min_{\bm{x},\bm{y}}\sum_{(i,j)\in A}\left(f_{ij}y_{ij}+\sum_{k\in K}r^{k}_{ij}x_{ij}^{k}\right)$		(9a)
$\displaystyle\sum_{j\in N^{+}_{i}}x_{ij}^{k}-\sum_{j\in N^{-}_{i}}x_{ji}^{k}=b^{k}_{i}$	$\displaystyle\!\!\forall i\in N,\forall k\in K$	(9b)
$\displaystyle\sum_{k\in K}x_{ij}^{k}\leq c_{ij}y_{ij},$	$\displaystyle\!\!\forall(i,j)\in A$	(9c)
$\displaystyle x_{ij}^{k}=0$	$\displaystyle\!\!\!\!\!\begin{array}[]{l}\forall k\in K,\forall(i,j)\in A\\ \text{ s.t. }i=d^{k}\text{ or }j=o^{k}\end{array}$	(9f)
$\displaystyle 0\leq x^{k}_{ij}\leq q^{k}$	$\displaystyle\!\!\forall(i,j)\in A,\forall k\in K$	(9g)
$\displaystyle y_{ij}\in\{0,1\},$	$\displaystyle\!\!\forall(i,j)\in A$	(9h)

where

b^{k}_{i}=\left\{\begin{array}[]{cc}q^{k}&\mbox{ if }i=o^{k},\\ -q^{k}&\mbox{ if }i=d^{k},\\ 0&\mbox{ otherwise.}\end{array}\right.

The objective function (9a) minimizes the sum of the routing and fixed costs. Equations (9b) are the flow conservation constraints that properly define the flow of each commodity through the graph. Constraints (9c) are the capacity constraints ensuring that the total amount of flow routed through each arc does not exceed its capacity or is zero if the arc is not used to route commodities. Equations (9f) ensure that a commodity is not routed on an arc entering its origin or leaving its destination. Finally inequalities (9g) are the bounds for the $x$ variables and inequalities (9h) are the integer constraints for the design variables.

B.2 Lagrangian Knapsack Relaxation

A standard way to obtain good bounds for the MC problem is to solve the Lagrangian relaxation obtained by dualizing the flow conservation constraints (9b) in formulation (9a)-(9h). Let $\pi_{i}^{k}$ be the Lagrangian multiplier associated with node $i\in N$ and commodity $k\in K$ . Dualizing the flow conservation constraints gives the following relaxed Lagrangian problem $LR(\bm{\pi})$ ¹¹1Since the dualized constraints are equations, ${\bm{\pi}}$ have no sign constraints.: min_(x,y) satisfies (9c)-(9h) ∑_(i,j) ∈A (f_ijy_ij + ∑_k ∈K r^k_ijx_ij^k)+∑_k ∈K∑_i ∈Nπ^k_i(b^k_i -∑_j∈N^+_ix_ij^k + ∑_j∈N^-_ix_ji^k) Rearranging the terms in the objective function and observing that the relaxed Lagrangian problem is decomposed by arcs, we obtain a subproblem for each arc $\displaystyle(i,j)\in A$ of the form:


$\displaystyle\displaystyle(LR_{ij}(\bm{\pi}))\quad$	$\displaystyle\displaystyle\min_{\bm{x},\bm{y}}f_{ij}y_{ij}+\sum_{k\in K_{ij}}w_{ij}^{k}x_{ij}^{k}$		(10a)
	$\displaystyle\displaystyle\sum_{k\in K_{ij}}x_{ij}^{k}\leq c_{ij}y_{ij}$		(10b)
	$\displaystyle\displaystyle 0\leq x^{k}_{ij}\leq q^{k}$	$\displaystyle\displaystyle\forall k\in K_{ij}$	(10c)
	$\displaystyle\displaystyle y_{ij}\in\{0,1\}$		(10d)

where $\displaystyle w_{ij}^{k}=r_{ij}^{k}-\pi_{i}^{k}+\pi_{j}^{k}$ and $\displaystyle K_{ij}=\{k\in K\mid j\neq o^{k}\mbox{ and }i\neq d^{k}\}$ is the set of commodities that may be routed through arc $\displaystyle(i,j)$ .

For each $\displaystyle(i,j)\in A$ , $\displaystyle LR_{ij}(\bm{\pi})$ is a MILP with only one binary variable. If $\displaystyle y_{ij}=0$ , then, by (10b) and (10c), $\displaystyle x_{ij}^{k}=0$ for all $\displaystyle k\in K_{ij}$ . If $\displaystyle y_{ij}=1$ , the problem reduces to a continuous knapsack problem. An optimal solution is obtained by ordering the commodities of $\displaystyle K_{ij}$ with respect to decreasing values $\displaystyle w_{ij}^{k}$ and setting for each variable $\displaystyle x_{ij}^{k}$ the value $\displaystyle\max\{\min\{q^{k},c_{ij}-\sum_{k\in K(k)}q^{k}\},0\}$ where $\displaystyle K(k)$ denotes the set of commodities that preceded $\displaystyle k$ in the order. This step can be done in $\displaystyle O(|K_{ij}|)$ if one computes $\displaystyle x_{ij}^{k}$ following the computed order. Hence, the complexity of the continuous knapsack problem is $\displaystyle O(|K_{ij}|\log(|K_{ij}|))$ . The solution of $\displaystyle LR_{ij}(\bm{\pi})$ is the minimum between the cost of the continuous knapsack problem and $\displaystyle\mathbf{0}$ .

Lagrangian duality implies that

LR(\bm{\pi})=\sum_{(i,j)\in A}LR_{ij}(\bm{\pi})+\sum_{i\in N}\sum_{k\in K}\pi_{i}^{k}b^{k}_{i}

is a lower bound for the MC problem and the best one is obtained by solving the following Lagrangian dual problem:

(LD)\quad\max_{\bm{\pi}\in\mathbb{R}^{N\times K}}LR(\bm{\pi})

Appendix C Generalized Assignment Problem

A GA instance is defined by a set $\displaystyle I$ of items and a set $\displaystyle J$ of bins. Each bin $\displaystyle j$ is associated with a certain capacity $\displaystyle c_{j}$ . For each item $\displaystyle i\in I$ and each bin $\displaystyle j\in J$ , $\displaystyle p_{ij}$ is the profit of assigning item $\displaystyle i$ to bin $\displaystyle j$ , and $\displaystyle w_{ij}$ is the weight of item $\displaystyle i$ inside bin $\displaystyle j$ .

Considering a binary variable $\displaystyle x_{ij}$ for each item and each bin that is equal to one if and only if item $\displaystyle i$ is assigned to bin $\displaystyle j$ , the GA problem can be formulated as:


$\displaystyle\displaystyle\max_{\bm{x}}\sum_{i\in I}\sum_{j\in J}p_{ij}x_{ij}$		(11a)
$\displaystyle\displaystyle\sum_{j\in J}x_{ij}\leq 1$	$\displaystyle\displaystyle\forall i\in I$	(11b)
$\displaystyle\displaystyle\sum_{i\in I}w_{ij}x_{ij}\leq c_{j}$	$\displaystyle\displaystyle\forall j\in J$	(11c)
$\displaystyle\displaystyle x_{ij}\in\{0,1\}$	$\displaystyle\displaystyle\forall i\in I,\;\forall j\in J.$	(11d)

The objective function (11a) maximizes the total profit. Inequalities (11b) assert that each item is contained in no more than one bin. Inequalities (11c) ensure that the sum of the weights of the items assigned to a bin does not exceed its capacity. Finally, constraints (11d) assure the integrality of the variables.

C.1 Lagrangian Relaxation

A Lagrangian relaxation of the GA problem is obtained by dualizing (11b). For $\displaystyle i\in I$ , let $\displaystyle\pi_{i}\geq 0$ be the Lagrangian multiplier of inequality (11b) associated with item $\displaystyle i$ . For each bin $\displaystyle j$ the subproblem becomes:

$\displaystyle\displaystyle(LR_{j}(\bm{\pi}))\quad$	$\displaystyle\displaystyle\max_{\bm{x}}\sum_{i\in I}\sum_{j\in J}(p_{ij}-\pi_{i})x_{ij}$
	$\displaystyle\displaystyle\sum_{i\in I}w_{ij}x_{ij}\leq c_{j}$
	$\displaystyle\displaystyle x_{ij}\in\{0,1\}$	$\displaystyle\displaystyle\forall i\in I$

It corresponds to an integer knapsack with $\displaystyle|I|$ binary variables. For $\displaystyle\bm{\pi}\geq\bm{0}$ , the Lagrangian bound $\displaystyle LR(\bm{\pi})$ is:

LR(\bm{\pi})=\sum_{j\in J}LR_{j}(\bm{\pi})+\sum_{i\in I}\pi_{i}.

The Lagrangian dual can then be written as:

\min_{\bm{\pi}\in{\mathbb{R}}^{|I|}_{\geq 0}}LR(\bm{\pi})

Appendix D Hyperparameters and Implementation Details

In this section, we provide technical details about the implementation. We also specify the hyperparameters of the NN architecture (regarding layers, layer sizes, and activation functions) for the architecture, define the features (hand-crafted), specify the hyperparameters for the optimizer used for the training, and the GPU and CPU specifics of the machines used for the numerical experiments.

Loss functions

To weigh the contributions of the different iterations, we use $\displaystyle\gamma=0.999$ . In other amortized optimization works, focusing on learning optimizers, as [2], this parameter is taken equal to $\displaystyle 1$ .

Table 14: Best average initialization values for the

\displaystyle\eta_{0}

parameter over the test set for all methods that require a grid search. The values

\displaystyle\{10^{4},\,10^{3},\,10^{2},\,10^{1},\,10^{0},\,10^{-1}\}

are evaluated, and for each fixed maximum number of iterations (

\displaystyle 10

\displaystyle 25

\displaystyle 50

, and

\displaystyle 100

), the value yielding the lowest average GAP is selected.

Dataset	Methods	10 iter.	25 iter.	50 iter.	100 iter.
MC-Sml-40	Bundle h.	10	10	10	10
	Bundle b.	100	10	10	10
	Bundle s.	100	10	10	100
	Bundle c.	100	100	100	100
	Descent	10	10	10	10
	Adam	1	1	1	1
MC-Sml-Var	Bundle h.	10	100	100	10
	Bundle b.	100	100	100	100
	Bundle s.	100	100	100	100
	Bundle c.	100	100	100	100
	Descent	100	10	10	10
	Adam	1	1	1	1
MC-Big-40	Bundle h.	10	10	10	1
	Bundle b.	10	10	10	10
	Bundle s.	10	10	10	10
	Bundle c.	10	10	10	10
	Descent	10	10	10	10
	Adam	1	1	1	1
MC-Big-Var	Bundle h.	100	10	10	10
	Bundle b.	100	100	10	10
	Bundle s.	100	100	10	10
	Bundle c.	100	100	100	100
	Descent	10	10	10	10
	Adam	1	1	1	1
GA-10-100	Bundle h.	1000	1000	1000	1000
	Bundle b.	1000	1000	1000	1000
	Bundle s.	1000	1000	1000	1000
	Bundle c.	1000	1000	1000	1000
	Descent	100	100	100	100
	Adam	100	10	10	10
GA-20-400	Bundle h.	1000	1000	1000	1000
	Bundle b.	1000	1000	1000	1000
	Bundle s.	1000	1000	1000	1000
	Bundle c.	1000	1000	1000	1000
	Descent	1000	100	100	100
	Adam	100	10	10	10

Model Architecture - Bundle Network

For all datasets, we consider the same architecture. A first LSTM that goes from the feature space to predict $\displaystyle 6$ vectors of size $\displaystyle 128$ . These are the means and variances used to sample three vectors that are the hidden representation of the $\displaystyle\eta$ parameter, the one for the key for the last found component, and the one for the query for the current iteration. These vectors are given as inputs to three different decoders, each one composed of a hidden layer of size $\displaystyle 8\cdot 128$ . The output of these decoders is respectively a scalar $\displaystyle\eta_{t}$ , used as step size, and two vectors of size $\displaystyle 128$ to represent the key and the query of the current iteration. The main framework is presented with a sample mechanism, but we found that it was detrimental when using softmax for function $\displaystyle\psi$ , and we refer to Section 3.4 for further details. Still, sampling improves performance when used in conjuction with sparsemax for $\displaystyle\psi$ .

Features

At iteration $\displaystyle t$ , we perform a human-designed feature extraction, chosen to provide similar information to that used by the heuristics $\displaystyle\eta-$ strategies. We add further features to represent the component added to the bundle in the associated iteration, and we others to characterize this entry and to compare it with previously added components. Features related to iteration $\displaystyle t$ , summarizing the optimization dynamics at the current iteration:

•

the last step-size $\displaystyle\eta_{t-1}$ ,
•

The square norm of the search direction $\displaystyle||{\bm{w}}^{(t-1)}||^{2}_{2}$ ,
•

The square norm of the search direction weighted with $\displaystyle\eta_{t}$ , that is $\displaystyle\eta_{t-1}||{\bm{w}}^{(t-1)}||_{2}^{2}$ . This corresponds to the quadratic part in the DMP objective function,
•

The linear part in the DMP: $\displaystyle\sum_{j=1}^{t-1}\alpha_{j}\theta_{j}^{(t-1)}$ ,
•

a boolean value to see if the quadratic part is bigger than the linear part $\displaystyle||{\bm{w}}^{(t-1)}||^{2}_{2}>\sum_{j=1}^{t-1}\alpha_{j}\theta_{j}^{(t-1)}$ ,
•

a boolean to see if the quadratic part, rescaled by a big $\displaystyle\eta^{*}=10000$ , is greater than the linear part: $\displaystyle\eta^{*}||{\bm{w}}^{(t-1)}||^{2}_{2}>\sum_{j=1}^{t-1}\alpha_{j}\theta_{j}^{(t-1)}$ ,
•

The iteration counter $\displaystyle t$ .

Features describing the last trial and stabilization points. These features capture information about the most recent trial point $\displaystyle{\bm{\pi}}_{t}$ and stabilization point $\displaystyle\bar{{\bm{\pi}}}_{t}$ :

•

The objective value in the last trial point $\displaystyle\phi({\bm{\pi}}_{t})$ and in the last stabilization point $\displaystyle\phi(\bar{{\bm{\pi}}}_{t})$ ,
•

The linearization error in the last stabilization point $\displaystyle\bar{\alpha}_{t}$ and in the last trial point $\displaystyle\alpha_{t}$ ,
•

The square norm of the last trial point $\displaystyle||{\bm{\pi}}_{t}||_{2}$ of the last stabilization point $\displaystyle||\bar{{\bm{\pi}}}_{t}||_{2}$ and of the gradient in the stabilization point $\displaystyle||\bar{{\bm{g}}}_{t}||_{2}$ ,
•

the square norm, the mean, the variance, the minimum, and the maximum of the vector $\displaystyle{\bm{g}}_{t}$ ,
•

the square norm, the mean, the variance, the minimum, and the maximum of the vector $\displaystyle{\bm{\pi}}_{t}$ .

Features comparing the last inserted component with the ones already contained in the bundle:

•

The minimum and the maximum of the scalar product of the gradients in the bundle $\displaystyle\min_{j}({\bm{g}}_{t}^{\top}{\bm{g}}_{j})$ and $\displaystyle\max_{j}({\bm{g}}_{t}^{\top}{\bm{g}}_{j})$ ,
•

The minimum and the maximum of the scalar product of the points in the bundle $\displaystyle\min_{j}({\bm{\pi}}_{t}^{\top}{\bm{\pi}}_{j})$ and $\displaystyle\max_{j}({\bm{\pi}}_{t}^{\top}{\bm{\pi}}_{j})$ ,
•

The scalar product of the last inserted gradient and the last search direction $\displaystyle{\bm{g}}_{t}^{\top}{\bm{w}}^{(t-1)}$ .

Optimiser Specifications

We use Adam as optimizer, with a learning rate $\displaystyle 0.00001$ , a Clip Norm (to 5), and exponential decay $\displaystyle 0.9$ .

GPU specifics

For the training on the datasets, we use Quadro RTX 5000 accelerators with 8Gb of RAM. We test performance on the same machine. All the variants, except for Bundle Network, are CPU-only based. The experiments are done on the same machine with QEMU Virtual CPU version 2.5+.

Appendix E Dataset generalization sparsemax and sampling

In this appendix, we provide some further details on the model that uses sparsemax and the sampling mechanism, related to cross-dataset generalization properties, similarly to what was done in Section 3.5 for the softmax without sampling.

Table 15: Gap of the model tested on MC-Sml-40 trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	10.51	0.126	2.07	0.214	0.60	0.363	0.18	0.662
MC-Sml-Var	9.87	0.128	1.75	0.218	0.54	0.371	0.18	0.683
MC-Big-40	21.36	0.140	5.23	0.239	1.78	0.41	0.52	0.755
MC-Big-Var	9.82	0.128	1.99	0.221	0.60	0.374	0.19	0.683
GA-10-100	69.18	0.128	69.18	0.216	69.18	0.366	69.18	0.667
GA-20-400	96.08	0.128	96.08	0.218	96.08	0.368	96.08	0.673
Best Baseline	11.96	0.048	4.16	0.128	1.30	0.293	0.31	0.816

Table 16: Gap of the model tested on MC-Sml-Var and trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	22.9	0.161	6.62	0.301	2.6	0.54	1.02	1.027
MC-Sml-Var	14.02	0.158	3.80	0.298	1.56	0.531	0.64	1.007
MC-Big-40	39.34	0.171	15.12	0.327	6.7	0.59	2.97	1.119
MC-Big-Var	14.27	0.160	3.64	0.302	1.27	0.545	0.49	1.042
GA-10-100	59.75	0.160	59.75	0.299	59.75	0.540	59.75	1.031
GA-20-400	66.35	0.163	66.35	0.307	66.35	0.552	66.35	1.052
Best Baseline	12.79	0.108	4.33	0.29	1.66	0.635	0.58	1.503

Table 17: Gap of the model tested on MC-Big-40 and trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	41.77	0.154	7.62	0.276	1.89	0.484	0.67	0.915
MC-Sml-Var	91.76	0.152	63.78	0.279	37.85	0.487	8.22	0.919
MC-Big-40	13.27	0.150	3.50	0.272	1.40	0.479	0.51	0.905
MC-Big-Var	44.89	0.154	7.19	0.280	2.17	0.492	1.09	0.931
GA-10-100	99.8	0.140	99.8	0.252	99.8	0.445	99.8	0.84
GA-20-400	100.0	0.156	100.0	0.281	100.0	0.493	100.0	0.931
Best Baseline	19.34	0.076	5.64	0.203	2.08	0.447	0.78	1.110

Table 18: Gap of the model tested on MC-Big-Var and trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	24.9	0.162	6.58	0.307	2.35	0.555	0.89	1.057
MC-Sml-Var	29.88	0.252	14.63	0.46	7.67	0.822	1.85	1.546
MC-Big-40	32.67	0.174	12.04	0.33	5.31	0.598	2.30	1.141
MC-Big-Var	18.94	0.162	4.24	0.305	1.47	0.548	0.61	1.045
GA-10-100	67.42	0.165	67.42	0.31	67.42	0.561	67.42	1.069
GA-20-400	74.69	0.163	74.69	0.306	74.69	0.551	74.69	1.051
Best Baseline	17.98	0.112	6.34	0.289	2.66	0.627	0.93	1.477

Table 19: Gap of the model tested on GA-10-100 and trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	2.55	0.123	1.79	0.208	0.85	0.35	0.33	0.636
MC-Sml-Var	2.42	0.121	1.4	0.202	0.65	0.341	0.56	0.613
MC-Big-40	2.67	0.119	2.12	0.197	1.4	0.334	0.69	0.603
MC-Big-Var	2.5	0.120	1.61	0.203	0.9	0.339	0.59	0.609
GA-10-100	0.92	0.122	0.70	0.201	0.68	0.34	0.68	0.609
GA-20-400	0.37	0.122	0.11	0.207	0.06	0.352	0.05	0.636
Best Baseline	0.1893	0.071	0.0156	0.211	0.0014	0.601	0.0006	1.596

Table 20: Gap of the model tested on GA-20-400 and trained on different datasets. The associated training dataset is displayed as first element of each row of the table. The last row shows the performances of the best baseline obtained with grid-search.

Training Dataset	10 iter.		25 iter.		50 iter.		100 iter.
Training Dataset	GAP	time	GAP	time	GAP	time	GAP	time
MC-Sml-40	2.35	0.253	1.9	0.536	1.14	1.004	0.39	1.943
MC-Sml-Var	2.27	0.238	1.67	0.504	0.76	0.937	0.5	1.805
MC-Big-40	2.42	0.237	2.08	0.504	1.58	0.949	0.88	1.836
MC-Big-Var	2.32	0.237	1.79	0.503	1.01	0.949	0.6	1.84
GA-10-100	1.17	0.252	1.16	0.537	1.16	1.005	1.16	1.943
GA-20-400	0.4	0.218	0.13	0.461	0.07	0.86	0.06	1.651
Best Baseline	0.1129	0.433	0.0193	1.459	0.0048	4.256	0.0020	14.459

Bundle Network: a Machine Learning-Based Bundle Method

Abstract

1 Introduction

1.1 Literature Review

1.2 Bundle Method

2 Machine-Learning Based Bundle Method

2.1 Learning Framework

2.2 Loss Function

2.3 Unrolled Model

2.3.1 Step and Direction Architecture

3 Evaluation

3.1 Lagrangian Relaxation

3.2 Experiments

3.3 Bound Accuracy

3.4 Sampling or Not Sampling

3.5 Testing on another dataset

4 Conclusion

References

Appendix A Proximality Term η\eta-Strategy

Appendix B Multi Commodity Capacitated Network Design Problem

B.1 MILP formulation

B.2 Lagrangian Knapsack Relaxation

Appendix C Generalized Assignment Problem

C.1 Lagrangian Relaxation

Appendix D Hyperparameters and Implementation Details

Loss functions

Model Architecture - Bundle Network

Features

Optimiser Specifications

GPU specifics

Appendix E Dataset generalization sparsemax and sampling

Appendix A Proximality Term $\eta$ -Strategy