1]Department of Computer Science, Università di Pisa, 3 Largo B. Pontecorvo, 56127 Pisa, Italy 2]Laboratoire d’Informatique de Paris Nord, Université Sorbonne Paris Nord, 99 avenue Jean-Baptiste Clément, 93430 Villetaneuse, France 3]Department of Information Systems, Data Analytics and Operations, ESSEC Business School, 3 avenue Bernard Hirsch, 95000 Cergy, France
Bundle Network: a Machine Learning-Based Bundle Method
Abstract
This paper presents Bundle Network, a learning-based algorithm inspired by the Bundle Method for convex non-smooth minimization problems.
Unlike classical approaches that rely on heuristic tuning of a regularization parameter, our method automatically learns to adjust it from data.
Furthermore, we replace the iterative resolution of the optimization problem that provides the search direction —traditionally computed as a convex combination of gradients at visited points— with a recurrent neural model equipped with an attention mechanism.
By leveraging the unrolled graph of computation, our Bundle Network can be trained end-to-end via automatic differentiation.
Experiments on Lagrangian dual relaxations of the Multi-Commodity Network Design and Generalized Assignment problems demonstrate that our approach consistently outperforms traditional methods relying on grid search for parameter tuning, while generalizing effectively across datasets.
Keywords: Attention Network, Bundle Method, Long Short Term Memory, Unrolling.
1 Introduction
This work presents a machine learning-based approach to tackle convex non-smooth minimization problems, inspired by bundle-type methods [42, 41], i.e., iterative algorithms that generalize the subgradient method in that the search direction is a convex combination of the gradients at the visited points obtained by solving an optimization problem, called Master problem. At each iteration, a crucial scalar parameter related to the regularization in the Master problem must be properly chosen to obtain optimal performance. Although there are many variants of the Bundle method [22], in this work we only consider the best known proximal version where the regularization parameter, denoted by , is the weight of the quadratic regularization term in the objective of the Master problem, weighting the distance of the new point from the better point found so far. This parameter significantly influences the next iterate produced at each iteration in two different ways:
-
1.
by changing the convex multipliers that are used to form the search direction out of the currently available bundle of (approximate) subgradient information;
-
2.
by acting as the stepsize along said direction.
Traditionally, selecting this parameter relies on heuristic approaches, referred to as -strategies, often requiring an extensive grid search to achieve high performance. By contrast, relatively simple diminishing stepsize rules are known to work well in subgradient-based approaches, both in theory and in practice. Thus, one may suspect that the “double role” of the parameter is what makes its online selection particularly challenging. Indeed, roughly speaking, the smaller the more “local” the direction is, taking into account only information computed close to the current iterate. However, towards termination, the algorithm should have accrued enough information that a “global” direction pointing towards the minimum of the model is likely preferable, which makes standard diminishing rules impractical.
In order to improve both the performance and the robustness of existing approaches, we propose to delegate the selection of the convex multipliers and the choice of the stepsize to a novel objective-based learning approach. Since our approach takes decisions dynamically during algorithm execution, Reinforcement Learning [55] techniques could be, and in fact have been, successfully used [47, 57]. However, Reinforcement Learning is mainly useful when differentiating the execution is not possible, and it has many drawbacks such as being often sample-inefficient and suffering high-variance [30]. We rather propose to differentiate our algorithm via an unrolling approach [48], yielding a more direct optimization process with lower-variance gradient estimators by exploiting the internal solver dynamics [53, 38]. A key challenge in this approach is to keep the structure of the Bundle method while differentiating through its execution. To address this problem, we use a heuristic approach whereby we replace these non-smooth components with smoother approximations based on a neural network model. Specifically, we propose a neural network that approximates the search direction using information from the current bundle and step size, replacing the -strategy heuristic and the (dual) Master problem (4). This neural network integrates an approach inspired by the Bundle method, as outlined in Algorithm 1, incorporating soft-updates for the stabilization point, thereby enabling end-to-end differentiation using Automatic Differentiation [4] techniques.
1.1 Literature Review
Many recent studies have leveraged machine learning to tackle Combinatorial Optimization problems [7]. In this work, we focus on amortization techniques [1], a research domain sometimes referred to as Learning-to-Optimize [9, 44, 10], and the associated methods are called semi-amortized optimization methods. In amortization techniques, the goal is to reduce the computational cost of iterative optimization methods by learning a direct mapping from inputs, defining problem instances, to approximate solutions. This often involves training an inference network to quickly produce a high-quality approximate solution that would otherwise require costly iterative optimization. Many amortized frameworks using an iterative structure for predictions often require some unrolling technique [48] for back-propagation, involving reformulating the algorithm execution to enable the gradient computation in the backward pass.
Previous works designed to find the best parameters for an optimization algorithm have focused on finding the best step size for stochastic gradient descent [54] in the context of Machine Learning, and are thus seen as instances of meta-learning While early works considered simple evolutionary rules [6, 5], more recent ones are based on neural networks [50]. Some works [60, 32] jointly train with gradient descent both the networks for the learning problem and the one for the meta-learning task. However, these approaches still struggle to scale to modern architectures with tens of thousands of parameters. To address this limitation, Andrychowicz et al. [2] introduce an LSTM-based optimizer that operates coordinate-wise on the network parameters. This design enables different behavior on each coordinate by using separate activation functions for each parameter, while maintaining the optimizer’s invariance to parameter order, as the same update rule is independently applied to each coordinate.
All of the above approaches fundamentally adopt the structure of gradient descent as the base for their learning framework. There is a growing interest in applying the Bundle method for optimization problems arising from the machine learning community [39]. Variants of Proximal Bundle methods [37] are used to optimize complex, non-smooth loss functions often encountered in machine learning [49]. A specialized asynchronous bundle is presented for solving distributed learning problems [8]. A further connection with the Bundle method can be found in [33], where a meta-learning approach is proposed to coordinate learning in master-slave distributed systems. This method aggregates the gradients for gradient descent, improving the scalability of distributed learning.
Recent research on Bundle methods for non-convex problems [18, 17], opens the possibility to use the Bundle variants for machine learning problems. Bundle methods have been applied to improve the training of SVMs [15], especially for large-scale problems [34, 58]. Extensions of Bundle methods have also been proposed for multiclass SVMs, where Bundle methods can handle a larger number of constraints efficiently [20].
The benefits of using a proximal operator when optimizing a convex non-smooth function have already been demonstrated [45]. Also [61] shows that advantages can be found in using proximality operators for meta-learning tasks, even if they did not consider the bundle method, but they use unrolling of the proximal gradient descent.
In [43], a method is presented that approximates the resolution of a large-scale linear programming problem based on the associated Lagrangian function, providing iterative primal and dual updates using a neural network trained with unrolling techniques.
1.2 Bundle Method
The Bundle method (BM) [42, 41] is an iterative method that solves problems of type
(1) |
where is a finite-valued convex non-smooth function and is a “simple” convex set. The Lagrangian Dual is a relevant application of (1), discussed with more details in Section 3.1. In the rest of the paper we restrict ourselves to the case in which or , which correspond to the Lagrangian Dual [21]—that we use in our computational testing—when one is relaxing, respectively, equality or inequality constraints. For the sake of presentation we only consider the case here, but the approach can be simply extended, as discussed at the end of Section 3.1, to the case .
Bundle methods iteratively optimize an approximation of the function , constructed using the bundle of first-order information collected so far, plus some regularization. A common choice for is the Cutting Plane Model, i.e., the natural polyhedral upper approximation built using the very definition of subgradients. For reasons to become apparent shortly, it is expedient to fix some reference point ; then, at a given iteration one has the set of previous iterates and can define the bundle as the set of pairs , where is a subgradient obtained in the given iterate and is the corresponding linearization error (non-negativity being implicit into convexity of ). Remarkably, the depend on , but we don’t stress this notationally since will always be clear from the context. In fact, may contain a strict subset of the thusly computed pairs (in case of bundle selection) and even pairs obtained differently (in case of bundle aggregation), but this is immaterial to our discussion, and we leave these details out for simplicity of presentation. With this information, the Cutting Plane model [51] is
(2) |
It is easy to see that , i.e., the Cutting Plane method is a polyhedral lower approximation of when translated w.r.t. . Thus, however chosen , the “unstabilized” Master problem
would provide a lower bound on the optimal value of Problem (1), but it is easy to see that such a problem can easily be unbounded below. Even if it is not, its optimal solution may easily be a very poor approximation to the true one, since may be a poor model of “far” from the points in which the function has been computed. Because of this, one has to introduce some stabilization term that avoids wandering away from the region of space where is an appropriate model. The most common choice, corresponding to the proximal Bundle method, is the Euclidean norm of the distance from the reference point , which is therefore called the (current) stabilization centre; this leads to the stabilized (primal) Master problem
(3) |
whose optimal solution is used to choose the next iterate as . Solving (3) is equivalent to solving its dual, which for the case reads
(4a) | ||||
s.t. | (4b) |
while the case adds a new set of variables and simple constraints. The optimal solution of (3) and the optimal solution of (4) are linked by the relationship , where is the aggregated subgradient computed out of the current bundle. This is relevant in that is a -subgradient of in , where is the aggregated linearization error. For the case , is basically projected over the set of the active non-negativity constraints ; we refer to [40, Chapter 4] for further details. Often, solving the Dual Master problem (4) is more convenient than solving the primal due to the existence of ad-hoc approaches [23]. The complete algorithm is described in Algorithm 1.
The next trial point is computed by making a step of along the direction , which is either the (opposite of) the aggregated sugradient or easily obtained out of it. Computing and its subgradient enriches the bundle. The stabilization center is changed to , called a Serious Step, if a “substantial increase” is obtained, measured with respect to the ascent predicted by the model, with a fixed parameter (usually “small”, say ). Otherwise, i.e., if no (significant) progress is made, the stabilization center is not changed, which is referred to as a Null Step (NS); yet, new information has been added to the bundle which will lead to a different direction in the next iteration. can (although it does not necessarily need to) also be adjusted, typically in different ways depending on the type of step taken. The stopping criterion in line 4 can be , where is “large w.r.t. ”; this corresponds to both and , proving that is an approximate minimum. At the end of each iteration, efficient implementations of the Bundle method may delete some old pairs from to lessen the Master problem cost. A common strategy is removing components that are not used in the (dual) optimal solution over the last iterations (a quite conservative strategy takes [25]).
2 Machine-Learning Based Bundle Method
The approach presented in this work is heavily inspired by Algorithm 1, but with two major differences:
-
1.
we replace the solution of the Master problem, which is a convex quadratic program that may become rather costly to solve as the size of the bundle grows, by the prediction of neural network and
-
2.
we decouple the role of in the formation of , via the weights , from its role as a stepsize along it, by gain predicting its value with a neural network.
To this effect, we need parametrize neural networks so that our two predictors from input instances and partial bundles to respectively convex multipliers and stepsize are accurate. This parametrization is data-driven: we find the parameters that minimize a loss function, representing the value computed by the bundle network over a set of examples.
2.1 Learning Framework
Let be a dataset of instances and we denote by the given initial point for the associated instance, possibly equal to zero. To mathematically define the learning problem, the number of iterations is fixed during the training process. Once the model is trained, the number of iterations is no longer a consideration at inference time [44]. In Section 2.3 we describe in details the model , where are the parameters. We may consider finding the parameters that, on average over instances drawn from , minimize the loss as defined by just the objective value after iterations:
However, Learning-to-Optimize approaches generally consider the contribution of each step of the whole trajectory, since it encourages optimizing the contributions of each step. Following this approach, we rather aim at maximizing the function:
(5) |
where denotes the trial point at iteration , which is computed by a neural network. We use , where is a hyperparameter. Hence, the learning problem can be formulated as
(6) |
In the next sections, we will specify how to differentiate the loss function using the chain rule and the structure of the model used as machine learning surrogate of the Master problem in the Bundle method.
2.2 Loss Function
We want to compute a subgradient of the loss with respect to the network parameters. is a linear combination of the function values evaluated at different points along the trajectory: consequently, its subgradient with respect to can be written as the weighted sum of subgradients of the function computed in different points. Thus we only need to compute subgradients for each .
We apply the chain rule to compute as . Note that is the subgradient of at , which exists since is convex and in fact corresponds to the used in the standard version of the Bundle method. As we construct to be differentiable, corresponds to the gradient that can be obtained using automatic differentiation techniques. The algorithm structure for this computation is discussed in detail in the next section.
Putting all together, we obtain the following gradient approximation
2.3 Unrolled Model
In this section, we provide further details on the structure of machine learning model used to compute the trial point at iteration . We keep the structure of the Bundle method mostly unchanged, modifying only some operations in order to make them differentiable: the choice of the step size, the solution of Problem 4, and the update of the stabilization center. The result is presented in Algorithm 3: since it only employs differentiable operations, we can then use standard automatic differentiation tools.
At each iteration, we need to provide a feature vector to the neural network. We do not back-propagate through the feature extraction, which is made using hand-crafted features described in detail in Appendix D. These features are fed as input to a neural network, which outputs a vector in the dimensional simplex and a step size . More details on the network architecture can be found in Section 3.
In the standard Bundle method, the stabilization center is updated if the condition is satisfied. However, this is a piecewise-constant operation, and therefore with zero derivative almost everywhere. Denoting by the one-hot vector encoding equal to if and otherwise, the update of the stabilization point can be rewritten as
This corresponds to choosing a vertex of the two-dimensional simplex. A common strategy in machine learning to obtain smoother approximations of the operator is to replace it with a soft version, obtained with the softmin [29, pp. 180-184]. This leads to considering
and the updates can be performed as previously, i.e., choosing a possibly different convex combination of the two vectors and , which will not necessarily be a vertex of the two-dimensional simplex, but a point inside it.
It is important to notice that we do not differentiate through the operation computing the gradient in line 11, as obtaining exact derivatives requires the function to be twice differentiable. It is possible to approximate this contribution, but this is not explicitly considered in this work; this strategy has been commonly pursued in other works, for example [2]. In other words, the bundle information fed to the neural network is considered as a constant, even though it is produced by a gradient computation. This is similar to the way recurrent language models do not backpropagate through the previously generated words, but rather through the previous latent state [3].
2.3.1 Step and Direction Architecture
This section provides further details on the architecture of the machine learning-model, schematically presented in Figure 1, which is used to predict and .
Given the features at the current bundle , a Recurrent Neural Network [13, 11, 12], specifically an Long Short-Term Memory (LSTM [31]) layer, maps this vector to a hidden space by predicting six vectors of size . We denote these vectors as , and they represent the mean and the variance used to extract, by sampling through a Gaussian distribution, the hidden representation of the keys and the queries used to represent the current iterations as in attention mechanisms [59], as well as the hidden representation for . More precisely, to obtain the hidden representations , we consider a Gaussian distribution, so we can approximate the expectation by sampling and use the reparameterization trick [36, 52] to perform standard backpropagation. The size of the hidden representations and other hyperparameters of the network will be discussed in Appendix D.
The sampled representations are then fed into three independent Multi-Layer Perceptrons that predict a scalar and two vectors and . The vectors and are, respectively, the query and the key of the current iteration. The vector is stored in memory for use in subsequent iterations. Using and all the vectors , we can compute a scalar for each component in the bundle using an Attention Layer defined as follows:
To obtain the approximate solution of the Dual Master problem 4, we pass this vector to a normalization function taking as input a vector and returning one of the same dimension, but with all components between zero and one, and summing to one
This yields a vector whose components lie in and sum to one, as the solution of the Problem (4) considered in the classic bundle. Some examples of are the softmax or sparsemax [46] functions.
3 Evaluation
In this section, we present the main numerical results of our work. We test it only on the example case of the Lagrangian relaxation, of which we then provide further details.
3.1 Lagrangian Relaxation
Let be a Mixed Integer Linear Program [14, Chap. 8] of the form:
(7a) | ||||
(7b) | ||||
(7c) |
The Lagrangian subproblem is obtained by dualizing a family of constraints. Here we relax the constraints defined by the inequalities (7b), and penalize their violation with Lagrangian multipliers (LMs) :
Weak Lagrangian duality ensures that provides a lower bound for . The goal of the Lagrangian Dual problem is to determine the optimal bound.
(8) |
Lagrangian relaxation is particularly valuable when removing certain constraints results in a Lagrangian subproblem that can be solved efficiently; it is therefore crucial to carefully select the set of constraints to dualize. Indeed, this choice involves a trade-off between the speed and the quality of the resulting bound, as will be discussed later, due to the No Free Lunch Theorem [28] that proves that, in order to achieve better bounds than the Continuous relaxation, the Lagrangian subproblem should be “harder” than a Linear problem. Actually, the Lagrangian Dual problem 8 is a concave maximization problem, but it is trivial to reformulate it as a convex minimization problem [56]. Relaxing inequality constraints instead of equality ones leads to nonnegative multipliers . To ensure that we predict only nonnegative multipliers we substitute line 10 of Algorithm 3 with
where is a component-wise non-negative activation function such as ReLU.
3.2 Experiments
In this section, we compare our approaches to classical iterative methods for solving the Lagrangian Dual, specifically those based on subgradient information. In particular, we evaluate the performance of the proposed approach against both simple subgradient-based methods and the classical Bundle method, using different heuristic -strategies.
We consider ADAM [35] and Gradient Ascent using a simple strategy for regularizing the step size: if we pass more than two iterations without improving the objective value, we divide the step size by two. These two baselines are well-established in the context of convex minimization and serve as relevant points of comparison. They are of particular interest because they are typically faster than the Bundle method, as Descent relies only on the subgradient at the latter visited point, meanwhile, ADAM relies on the subgradient of the last inserted point and the previous search direction.
We further compare our approach with classic variants of the Bundle method considering the four long-term -strategies presented in Appendix A, inspired by the ones used in the state-of-the-art bundle-based BundleSolver component of SMS++ [26]. With the non-constant ones, we also consider the middle-term -strategy and the short-term strategy proposing increasing as and decreasing as . For all these six classic approaches, we perform a grid search to choose the best initial parameter considering different values. We evaluate the initialization of the parameter using the values for . For each fixed maximum number of iterations (, , , and ), we select the value of that achieves the lowest average GAP over the test set. The results of this grid search are summarized in Table 14.
The objective values of are too large for a learning method, so we rescale the objective by dividing the function value using the norm of the gradient in the starting point , always taken equal to the zero vector. Even if the objective value changes, the optimum of the original function and of its scaled version are the same.
We consider the Lagrangian Dual of the Multi-Commodity Network Design (MC) presented in Appendix B and the Lagrangian Dual of the Generalized Assignment (GA) presented in Appendix C. For MC, we consider a Lagrangian relaxation in which we dualize equality constraints, while for GA we consider a Lagrangian relaxation in which we dualize inequality constraints, applying the modifications previously discussed. For the MC and the GA, we use the datasets considered in [19].
For each instance on which we evaluate our model, the Lagrangian relaxation is solved using SMS++. The obtained Lagrangian multipliers, denoted by , are used for computing the GAP. We enhance the fact that we do not need these multipliers to perform the training, but we use them to provide a metric for our approach. We use the percentage gap as a metric to evaluate the quality of the bounds computed by the different systems, averaged over a dataset of instances . We measure the quality of each approach by means of the percentage GAP
where is the returned upper bound for instance .
3.3 Bound Accuracy
Table 1 reports the performance of different systems on our datasets. For the heuristic strategies that did not use machine learning, we consider a grid search for the initialization of the step-size/regularization parameter. We consider values in between and . For each fixed number of iterations ( and ), we choose the best concerning the values in the grid. From Table 1 we can see that Bundle Network outperforms other techniques in terms of GAP for a fixed number of iterations in different datasets.
Methods | 10 iter. | 25 iter. | 50 iter. | 100 iter. | |||||
GAP | time | GAP | time | GAP | time | GAP | time | ||
MC-SML-40 |
Bundle h. | 20.18 | 0.044 | 6.68 | 0.117 | 2.84 | 0.269 | 1.17 | 0.767 |
Bundle b. | 17.68 | 0.054 | 6.09 | 0.145 | 1.84 | 0.327 | 0.41 | 0.898 | |
Bundle s. | 17.58 | 0.044 | 6.10 | 0.116 | 1.82 | 0.271 | 0.36 | 0.778 | |
Bundle c. | 11.96 | 0.048 | 4.16 | 0.128 | 1.30 | 0.293 | 0.31 | 0.816 | |
Descent | 44.66 | 0.008 | 14.11 | 0.021 | 4.70 | 0.044 | 2.24 | 0.093 | |
Adam | 48.87 | 0.011 | 12.84 | 0.027 | 3.04 | 0.055 | 1.89 | 0.109 | |
Bundle n. | 9.20 | 0.105 | 1.75 | 0.185 | 0.53 | 0.321 | 0.17 | 0.592 | |
MC-SML-Var |
Bundle h. | 18.27 | 0.094 | 9.55 | 0.254 | 4.08 | 0.564 | 1.96 | 1.433 |
Bundle b. | 15.83 | 0.124 | 7.39 | 0.334 | 2.88 | 0.736 | 0.74 | 1.749 | |
Bundle s. | 15.82 | 0.094 | 6.92 | 0.256 | 2.58 | 0.576 | 0.67 | 1.421 | |
Bundle c. | 12.79 | 0.108 | 4.33 | 0.29 | 1.66 | 0.635 | 0.58 | 1.503 | |
Descent | 62.43 | 0.025 | 28.64 | 0.066 | 11.14 | 0.142 | 4.00 | 0.303 | |
Adam | 51.48 | 0.034 | 15.42 | 0.086 | 4.74 | 0.17 | 2.75 | 0.341 | |
Bundle n. | 12.60 | 0.138 | 3.28 | 0.269 | 1.24 | 0.496 | 0.50 | 0.952 | |
MCND-Big-40 |
Bundle h. | 20.66 | 0.07 | 8.54 | 0.185 | 5.32 | 0.410 | 2.88 | 0.986 |
Bundle b. | 20.67 | 0.084 | 7.37 | 0.222 | 3.17 | 0.485 | 1.24 | 1.209 | |
Bundle s. | 20.67 | 0.07 | 7.37 | 0.184 | 3.17 | 0.407 | 1.19 | 1.042 | |
Bundle c. | 19.34 | 0.076 | 5.64 | 0.203 | 2.08 | 0.447 | 0.78 | 1.110 | |
Descent | 28.53 | 0.023 | 14.12 | 0.058 | 9.91 | 0.117 | 8.21 | 0.241 | |
Adam | 24.89 | 0.025 | 9.38 | 0.064 | 7.30 | 0.129 | 6.90 | 0.256 | |
Bundle n. | 12.70 | 0.130 | 3.41 | 0.243 | 1.26 | 0.436 | 0.47 | 0.832 | |
MC-Big-Var |
Bundle h. | 26.23 | 0.095 | 9.99 | 0.248 | 4.41 | 0.550 | 2.23 | 1.386 |
Bundle b. | 21.82 | 0.123 | 10.01 | 0.324 | 3.46 | 0.695 | 1.08 | 1.652 | |
Bundle s. | 21.83 | 0.094 | 9.53 | 0.253 | 3.43 | 0.550 | 1.02 | 1.353 | |
Bundle c. | 17.98 | 0.112 | 6.34 | 0.289 | 2.66 | 0.627 | 0.93 | 1.477 | |
Descent | 54.11 | 0.029 | 24.69 | 0.074 | 10.58 | 0.154 | 4.85 | 0.323 | |
Adam | 44.21 | 0.034 | 13.75 | 0.086 | 5.42 | 0.170 | 3.84 | 0.345 | |
Bundle n. | 20.10 | 0.139 | 5.11 | 0.271 | 2.01 | 0.495 | 0.70 | 0.953 | |
GA-10-100 |
Bundle h. | 0.2333 | 0.070 | 0.028 | 0.241 | 0.0016 | 0.657 | 0.0006 | 1.596 |
Bundle b. | 0.2333 | 0.073 | 0.028 | 0.249 | 0.0016 | 0.684 | 0.0006 | 1.655 | |
Bundle s. | 0.3055 | 0.065 | 0.0156 | 0.211 | 0.0026 | 0.553 | 0.0024 | 1.283 | |
Bundle c. | 0.1893 | 0.071 | 0.0157 | 0.235 | 0.0014 | 0.601 | 0.0011 | 1.383 | |
Descent | 0.8799 | 0.013 | 0.2091 | 0.032 | 0.0644 | 0.063 | 0.0364 | 0.127 | |
Adam | 0.7234 | 0.012 | 0.1843 | 0.029 | 0.0183 | 0.059 | 0.0048 | 0.119 | |
Bundle n. | 0.1484 | 0.104 | 0.0228 | 0.177 | 0.0047 | 0.304 | 0.0009 | 0.551 | |
GA-20-400 |
Bundle hard | 0.157 | 0.445 | 0.0343 | 1.548 | 0.0088 | 4.528 | 0.0021 | 15.099 |
Bundle balancing | 0.157 | 0.463 | 0.0343 | 1.568 | 0.0088 | 4.539 | 0.0021 | 15.158 | |
Bundle soft | 0.3542 | 0.431 | 0.0284 | 1.417 | 0.0048 | 4.256 | 0.0028 | 14.145 | |
Bundle constant | 0.1129 | 0.433 | 0.0193 | 1.459 | 0.0052 | 4.264 | 0.0020 | 14.459 | |
Descent | 0.5815 | 0.205 | 0.2971 | 0.5 | 0.0753 | 0.98 | 0.0255 | 1.937 | |
Adam | 0.6598 | 0.177 | 0.0831 | 0.444 | 0.016 | 0.887 | 0.0090 | 1.775 | |
Bundle n. | 0.1018 | 0.228 | 0.0190 | 0.498 | 0.0052 | 0.943 | 0.0014 | 1.837 |
Anyway, by Figures 2-5, we can see that Bundle Network is slower than other approaches for the first iterations. From these figures, we can see that initialization time can be improved. These times can also depend on passages of CPU vectors to the GPU and vice versa.
From Table 1 we can also see that the constant strategy provides almost every time the best performance, compared to the other deterministic strategies. This is a well-known characteristic of the Bundle method, as iterations is a relatively small number of iterations for an Aggregated Bundle method, making tuning less performative than a constant strategy obtained after a grid-search. This is because the Bundle method operates in distinct phases: initially, its goal is to identify promising directions for improvement, and later it focuses on collecting high-quality subgradients near the optimum to certify optimality. This raises an interesting question about using multiple models to capture this Bundle behavior, which is beyond the scope of this work.
Nevertheless, our model provides high-quality predictions that generalize well to a larger number of iterations than those used during training (10). In many cases, bundle network provides, in a given number of iterations, lower gaps than all the baselines, and even in cases where the provided predictions are not the best possibility, they lead to similar gaps to the optimal algorithm.












3.4 Sampling or Not Sampling
Activation | Sample | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
t | q k | GAP | time | GAP | time | GAP | time | GAP | time | ||||
softmax | 9.20 | 0.105 | 1.75 | 0.185 | 0.53 | 0.321 | 0.17 | 0.592 | |||||
softmax | x | 9.78 | 0.104 | 1.85 | 0.187 | 0.56 | 0.321 | 0.17 | 0.592 | ||||
softmax | x | x | 10.41 | 0.105 | 1.99 | 0.185 | 0.60 | 0.318 | 0.18 | 0.588 | |||
sparsemax | 58.3 | 0.125 | 57.93 | 0.214 | 57.47 | 0.361 | 57.29 | 0.661 | |||||
sparsemax | x | 13.16 | 0.124 | 2.61 | 0.212 | 0.80 | 0.359 | 0.24 | 0.655 | ||||
sparsemax | x | x | 10.51 | 0.126 | 2.07 | 0.214 | 0.60 | 0.363 | 0.18 | 0.662 |
Activation | Sample | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
t | q k | GAP | time | GAP | time | GAP | time | GAP | time | ||||
softmax | 12.70 | 0.130 | 3.41 | 0.243 | 1.26 | 0.436 | 0.47 | 0.832 | |||||
softmax | x | 12.85 | 0.131 | 3.34 | 0.244 | 1.22 | 0.436 | 0.45 | 0.831 | ||||
softmax | x | x | 12.88 | 0.131 | 3.36 | 0.244 | 1.26 | 0.436 | 0.48 | 0.834 | |||
sparsemax | 38.65 | 0.150 | 24.00 | 0.270 | 10.34 | 0.475 | 3.39 | 0.898 | |||||
sparsemax | x | 39.40 | 0.150 | 32.37 | 0.270 | 17.76 | 0.474 | 1.58 | 0.896 | ||||
sparsemax | x | x | 13.27 | 0.150 | 3.50 | 0.272 | 1.40 | 0.479 | 0.51 | 0.905 |
Activation | Sample | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
t | q k | GAP | time | GAP | time | GAP | time | GAP | time | ||||
softmax | 12.60 | 0.138 | 3.28 | 0.269 | 1.24 | 0.496 | 0.50 | 0.952 | |||||
softmax | x | 13.03 | 0.137 | 3.27 | 0.268 | 1.26 | 0.492 | 0.51 | 0.948 | ||||
softmax | x | x | 14.68 | 0.139 | 3.94 | 0.269 | 1.55 | 0.494 | 0.62 | 0.957 | |||
sparsemax | 30.23 | 0.157 | 15.96 | 0.293 | 12.45 | 0.527 | 11.05 | 1.006 | |||||
sparsemax | x | 56.83 | 0.160 | 56.03 | 0.300 | 56.00 | 0.538 | 56.00 | 1.033 | ||||
sparsemax | x | x | 14.02 | 0.158 | 3.80 | 0.298 | 1.56 | 0.531 | 0.64 | 1.007 |
Activation | Sample | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
t | q k | GAP | time | GAP | time | GAP | time | GAP | time | ||||
softmax | 20.10 | 0.139 | 5.11 | 0.271 | 2.01 | 0.495 | 0.70 | 0.953 | |||||
softmax | x | 16.79 | 0.141 | 4.78 | 0.273 | 1.68 | 0.499 | 0.63 | 0.956 | ||||
softmax | x | x | 16.66 | 0.139 | 4.63 | 0.270 | 1.80 | 0.495 | 0.67 | 0.955 | |||
sparsemax | 42.84 | 0.160 | 32.39 | 0.301 | 29.68 | 0.541 | 23.25 | 1.031 | |||||
sparsemax | x | 30.20 | 0.162 | 12.33 | 0.305 | 6.25 | 0.542 | 3.78 | 1.033 | ||||
sparsemax | x | x | 18.94 | 0.162 | 4.24 | 0.305 | 1.47 | 0.548 | 0.61 | 1.045 |
Activation | Sample | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
t | q k | GAP | time | GAP | time | GAP | time | GAP | time | ||||
softmax | 0.1484 | 0.104 | 0.0228 | 0.177 | 0.0047 | 0.304 | 0.0009 | 0.551 | |||||
softmax | x | 0.1544 | 0.102 | 0.0229 | 0.176 | 0.0049 | 0.303 | 0.0009 | 0.548 | ||||
softmax | x | x | 0.1954 | 0.099 | 0.0370 | 0.176 | 0.0081 | 0.302 | 0.0013 | 0.551 | |||
sparsemax | 0.2335 | 0.118 | 0.0600 | 0.201 | 0.0336 | 0.339 | 0.0288 | 0.615 | |||||
sparsemax | x | 1.0482 | 0.120 | 0.9783 | 0.202 | 0.9782 | 0.342 | 0.9782 | 0.611 | ||||
sparsemax | x | x | 0.9172 | 0.122 | 0.7001 | 0.201 | 0.6827 | 0.340 | 0.6818 | 0.609 |
Activation | Sample | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
t | q k | GAP | time | GAP | time | GAP | time | GAP | time | ||||
softmax | 0.1018 | 0.228 | 0.0190 | 0.498 | 0.0052 | 0.943 | 0.0014 | 1.837 | |||||
softmax | x | 0.1065 | 0.218 | 0.0216 | 0.480 | 0.0060 | 0.904 | 0.0014 | 1.756 | ||||
softmax | x | x | 0.1273 | 0.204 | 0.0295 | 0.445 | 0.0087 | 0.836 | 0.0018 | 1.621 | |||
sparsemax | 0.1463 | 0.235 | 0.0626 | 0.500 | 0.0601 | 0.935 | 0.0599 | 1.804 | |||||
sparsemax | x | 0.7860 | 0.219 | 0.7696 | 0.463 | 0.767 | 0.860 | 0.7669 | 1.656 | ||||
sparsemax | x | x | 0.4002 | 0.218 | 0.1272 | 0.461 | 0.0681 | 0.860 | 0.0627 | 1.651 |
In this section, we compare the model trained with and without the sampling strategy, for the hidden representation discussed earlier, as well as considering two different possibilities for that is, the softmax [29, pp. 180-184] or the sparsemax [46]. In all the cases the models are trained unrolling 10 iterations of the resolutive approach and for training epochs using Adam as optimizer, with learning rate and a Clip Norm (to 5). . In the case in which we do not use the sampling mechanism, the hidden representations for the queries, the key, and the step size are not predicted by sampling them through a Gaussian distribution. Indeed, we learn the mean and the variance. Instead, it is predicted directly by the encoder. In the test phase, the hidden representation is not sampled for either approach. Instead, we take the mean directly when the model predicts both the mean and the variance.
From Tables 2-7, we can see that sampling is a strategy leading to better performance when using the sparsemax, but it is inefficient for the softmax. This can be because the sparsemax function has a zero gradient in a certain region of the space that can be escaped using sampling strategies.
We choose the softmax with no sampling strategy as it shows better performances that are stable across both problems. Meanwhile, the sparsemax seems to be an interesting choice for harder datasets, as shown in Table 5.
3.5 Testing on another dataset
From Tables 8-13, we can see the performance of a model trained on one dataset and tested on another. Training the model on another dataset rarely provides better gaps than training it on the same dataset. It happens in Table 8 and for 10 iterations in Table 12. Considering the same problem, we can obtain similar behaviors in several cases, in particular while using more iterations. In some cases, the model trained on the GA dataset achieves interesting performance on the small datasets of MC, as shown in Tables 8 and 9. Anyway, no update learned for the MC dataset seems to work sufficiently well for the GA datasets. Tables 8-13 also show that, in several cases, it is possible to attain lower gaps for iteration from the performances of a model trained on another dataset, of the same problem, than the baseline, providing a lower gap obtained with grid search.
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 9.20 | 0.105 | 1.75 | 0.185 | 0.53 | 0.321 | 0.17 | 0.592 | ||||
MC-Sml-Var | 8.02 | 0.104 | 1.69 | 0.184 | 0.51 | 0.319 | 0.17 | 0.593 | ||||
MC-Big-40 | 18.02 | 0.116 | 4.14 | 0.206 | 1.49 | 0.358 | 0.47 | 0.665 | ||||
MC-Big-Var | 10.77 | 0.106 | 2.26 | 0.187 | 0.70 | 0.327 | 0.20 | 0.603 | ||||
GA-10-100 | 61.98 | 0.104 | 20.79 | 0.183 | 4.07 | 0.319 | 0.95 | 0.589 | ||||
GA-20-400 | 86.33 | 0.107 | 68.16 | 0.190 | 14.20 | 0.327 | 3.52 | 0.607 | ||||
Best Baseline | 11.96 | 0.048 | 4.16 | 0.128 | 1.30 | 0.293 | 0.31 | 0.816 |
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 19.87 | 0.142 | 5.72 | 0.276 | 2.32 | 0.501 | 1.00 | 0.965 | ||||
MC-Sml-Var | 12.60 | 0.138 | 3.28 | 0.269 | 1.24 | 0.496 | 0.50 | 0.952 | ||||
MC-Big-40 | 33.83 | 0.149 | 14.03 | 0.291 | 7.54 | 0.536 | 3.90 | 1.033 | ||||
MC-Big-Var | 16.21 | 0.141 | 3.83 | 0.276 | 1.39 | 0.505 | 0.52 | 0.976 | ||||
GA-10-100 | 25.76 | 0.142 | 7.60 | 0.275 | 2.08 | 0.502 | 0.67 | 0.969 | ||||
GA-20-400 | 66.15 | 0.140 | 62.06 | 0.275 | 38.19 | 0.507 | 12.86 | 0.974 | ||||
Best Baseline | 12.79 | 0.108 | 4.33 | 0.29 | 1.66 | 0.635 | 0.58 | 1.503 |
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 35.09 | 0.133 | 5.78 | 0.248 | 1.63 | 0.445 | 0.62 | 0.85 | ||||
MC-Sml-Var | 56.68 | 0.132 | 25.62 | 0.246 | 8.88 | 0.441 | 3.06 | 0.842 | ||||
MC-Big-40 | 12.70 | 0.130 | 3.41 | 0.243 | 1.26 | 0.436 | 0.47 | 0.832 | ||||
MC-Big-Var | 37.57 | 0.134 | 10.74 | 0.251 | 4.89 | 0.449 | 1.49 | 0.856 | ||||
GA-10-100 | 95.70 | 0.123 | 95.70 | 0.230 | 78.63 | 0.408 | 27.53 | 0.779 | ||||
GA-20-400 | 100.0 | 0.134 | 100.0 | 0.252 | 99.53 | 0.447 | 74.54 | 0.854 | ||||
Best Baseline | 19.34 | 0.076 | 5.64 | 0.203 | 2.08 | 0.447 | 0.78 | 1.110 |
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 21.50 | 0.142 | 5.48 | 0.280 | 2.07 | 0.511 | 0.85 | 0.984 | ||||
MC-Sml-Var | 21.03 | 0.221 | 7.64 | 0.419 | 2.67 | 0.755 | 0.95 | 1.430 | ||||
MC-Big-40 | 28.39 | 0.150 | 11.15 | 0.294 | 5.87 | 0.540 | 2.94 | 1.037 | ||||
MC-Big-Var | 20.10 | 0.139 | 5.11 | 0.271 | 2.01 | 0.495 | 0.70 | 0.953 | ||||
GA-10-100 | 41.66 | 0.144 | 28.58 | 0.280 | 20.72 | 0.511 | 8.30 | 0.979 | ||||
GA-20-400 | 72.89 | 0.144 | 69.99 | 0.283 | 53.10 | 0.519 | 27.75 | 0.996 | ||||
Best Baseline | 17.98 | 0.112 | 6.34 | 0.289 | 2.66 | 0.627 | 0.93 | 1.477 |
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 2.5903 | 0.101 | 1.8558 | 0.179 | 0.9291 | 0.304 | 0.4860 | 0.555 | ||||
MC-Sml-Var | 2.5165 | 0.102 | 1.4663 | 0.178 | 1.0522 | 0.307 | 1.0512 | 0.562 | ||||
MC-Big-40 | 2.7488 | 0.103 | 2.1948 | 0.177 | 1.3515 | 0.302 | 0.7463 | 0.550 | ||||
MC-Big-Var | 2.4958 | 0.102 | 1.3267 | 0.180 | 0.9705 | 0.309 | 0.9683 | 0.564 | ||||
GA-10-100 | 0.1484 | 0.104 | 0.0228 | 0.177 | 0.0048 | 0.304 | 0.0009 | 0.551 | ||||
GA-20-400 | 0.1332 | 0.105 | 0.0233 | 0.182 | 0.0056 | 0.309 | 0.0016 | 0.561 | ||||
Best Baseline | 0.1893 | 0.071 | 0.0156 | 0.211 | 0.0014 | 0.601 | 0.0006 | 1.596 |
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 2.3695 | 0.221 | 1.9535 | 0.474 | 1.1906 | 0.894 | 0.5052 | 1.732 | ||||
MC-Sml-Var | 2.3357 | 0.231 | 1.7168 | 0.507 | 0.8963 | 0.959 | 0.8654 | 1.868 | ||||
MC-Big-40 | 2.4715 | 0.218 | 2.1550 | 0.481 | 1.5618 | 0.909 | 0.8056 | 1.759 | ||||
MC-Big-Var | 2.3197 | 0.218 | 1.6304 | 0.482 | 0.8092 | 0.910 | 0.7856 | 1.765 | ||||
GA-10-100 | 0.1811 | 0.234 | 0.0296 | 0.511 | 0.0077 | 0.962 | 0.0015 | 1.866 | ||||
GA-20-400 | 0.1018 | 0.228 | 0.0190 | 0.498 | 0.0052 | 0.943 | 0.0014 | 1.837 | ||||
Best Baseline | 0.1129 | 0.433 | 0.0193 | 1.459 | 0.0048 | 4.256 | 0.0020 | 14.459 |
4 Conclusion
In this work, we present a machine learning model based on unrolling and amortization strongly inspired by the bundle method. The model bypasses the need to solve the Dual Master problem at each iteration, leading to a smooth optimization process that can be differentiated using Automatic Differentiation techniques. We test it for the resolution of the Lagrangian dual obtained from the two different problems: the Multi-Commodity Network Design and the Generalized Assignment. It achieves a lower gap compared to other classical approaches for a fixed number of iterations. Moreover, it exhibits strong generalization capabilities, maintaining robust performance even beyond the training horizon and sometimes across different datasets.
This work opens several promising directions for future research. First, the framework can be directly applied, as it is, to other problems arising from Lagrangian relaxation. A central challenge in standard bundle methods lies in automatically determining which components to retain or discard in order to control bundle size and improve computational efficiency. While straightforward modifications of the framework are possible, this also raises questions concerning feature extraction and the design of an automated feature selection phase.
Given the growing use of bundle methods in machine learning, it is also interesting to assess the performance of the proposed approach in meta-learning tasks. This requires applying the method to non-convex and non-smooth optimization problems and this extension presents non-trivial challenges. In this setting, removing components from the bundle is not only essential for reducing computational cost, but also for avoiding convergence to poor local minima.
References
- [1] Brandon Amos. Tutorial on amortized optimization for learning to optimize over continuous domains. CoRR, 2022.
- [2] Marcin Andrychowicz, Misha Denil, Sergio Gómez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- [4] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153):1–43, 2018.
- [5] Samy Bengio, Y. Bengio, and Jocelyn Cloutier. On the search for new learning rules for anns. Neural Processing Letters, 2:26–30, 07 1995.
- [6] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume ii, pages 969 vol.2–, 1991.
- [7] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. CoRR, abs/1811.06128, 2018.
- [8] Daniel Cederberg, Xuyang Wu, Stephen P. Boyd, and Mikael Johansson. An asynchronous bundle method for distributed learning problems. In The Thirteenth International Conference on Learning Representations, 2025.
- [9] Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. Journal of Machine Learning Research, 23(189):1–59, 2022.
- [10] Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. Journal of Machine Learning Research, 23(189):1–59, 2022.
- [11] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- [12] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
- [13] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014.
- [14] Michele Conforti, Gérard Cornuéjols, and Giacomo Zambelli. Integer Programming. Springer, New York, 2014.
- [15] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995.
- [16] Teodor Gabriel Crainic, Antonio Frangioni, and Bernard Gendron. Bundle-based relaxation methods for multicommodity capacitated fixed charge network design. Discrete Applied Mathematics, 112(1):73–99, 2001.
- [17] Welington de Oliveira and Mikhail Solodov. A doubly stabilized bundle method for nonsmooth convex optimization. Mathematical programming, 156(1):125–159, 2016.
- [18] Welington Luis de Oliveira and Claudia A. Sagastizábal. Level bundle methods for oracles with on-demand accuracy. Optimization Methods and Software, 29:1180 – 1209, 2014.
- [19] Francesco Demelas, Joseph Le Roux, Mathieu Lacroix, and Axel Parmentier. Predicting lagrangian multipliers for mixed integer linear programs. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 10368–10384. PMLR, 21–27 Jul 2024.
- [20] Thomas Finley and Thorsten Joachims. Training structural svms when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, page 304–311, New York, NY, USA, 2008. Association for Computing Machinery.
- [21] Marshall L. Fisher. The lagrangian relaxation method for solving integer programming problems. Management Science, 27:1–18, 1981.
- [22] A. Frangioni. Standard Bundle Methods: Untrusted Models and Duality, chapter 2, pages 61–116. Springer, 2020.
- [23] Antonio Frangioni. Solving semidefinite quadratic problems within nonsmooth optimization algorithms. Computers & Operations Research, 23(11):1099–1118, 1996.
- [24] Antonio Frangioni. Dual-ascent methods and multicommodity flow problems. Università degli studi di Pisa. Dipartimento di informatica, 1997.
- [25] Antonio Frangioni. Generalized bundle methods. SIAM Journal on Optimization, 13(1):117–156, 2002.
- [26] Antonio Frangioni, Niccolò Iardella, and Rafael Durbano Lobato. SMS++, 2023.
- [27] Bernard Gendron, Teodor Gabriel Crainic, and Antonio Frangioni. Multicommodity Capacitated Network Design, pages 1–19. Springer US, Boston, MA, 1999.
- [28] A. M. Geoffrion. Lagrangean relaxation for integer programming. In M. L. Balinski, editor, Approaches to Integer Programming, pages 82–114. Springer Berlin Heidelberg, Berlin, Heidelberg, 1974.
- [29] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
- [30] Hossein Hassani, Roozbeh Razavi-Far, Mehrdad Saif, and Liang Lin. Towards sample-efficiency and generalization of transfer and inverse reinforcement learning: A comprehensive literature review, 2024.
- [31] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9:1735–80, 12 1997.
- [32] Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell. Learning to learn using gradient descent. In Georg Dorffner, Horst Bischof, and Kurt Hornik, editors, Artificial Neural Networks — ICANN 2001, pages 87–94, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
- [33] Jinlong Ji, Xuhui Chen, Qianlong Wang, Lixing Yu, and Pan Li. Learning to learn gradient aggregation by gradient descent. In IJCAI, pages 2614–2620, 2019.
- [34] Thorsten Joachims, Thomas Finley, and Chun-Nam Yu. Cutting-plane training of structural svms. Machine Learning, 77:27–59, 10 2009.
- [35] Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [36] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
- [37] Krzysztof C. Kiwiel. A proximal bundle method with approximate subgradient linearizations. SIAM Journal on Optimization, 16(4):1007–1023, 2006.
- [38] James Kotary, My H Dinh, and Ferdinando Fioretto. Backpropagation of unrolled solvers with folded optimization. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-2023, page 1963–1970. International Joint Conferences on Artificial Intelligence Organization, August 2023.
- [39] Quoc Le, Alex Smola, and Svn Vishwanathan. Bundle methods for machine learning. Advances in neural information processing systems, 20, 2007.
- [40] Claude Lemaréchal. Lagrangian Relaxation, pages 112–156. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001.
- [41] Claude Lemaréchal, Arkadii Nemirovskii, and Yurii Nesterov. New variants of bundle methods. Mathematical programming, 69:111–147, 1995.
- [42] Claude Lemarechal, Jean-Jacques Strodiot, and André Bihain. On a bundle algorithm for nonsmooth optimization. In Nonlinear programming 4, pages 245–282. Elsevier, 1981.
- [43] Bingheng Li, Linxin Yang, Yupeng Chen, Senmiao Wang, Qian Chen, Haitao Mao, Yao Ma, Akang Wang, Tian Ding, Jiliang Tang, and Ruoyu Sun. Pdhg-unrolled learning-to-optimize method for large-scale linear programming, 2024.
- [44] Ke Li and Jitendra Malik. Learning to optimize. ArXiv, abs/1606.01885, 2016.
- [45] Jialin Liu, Xiaohan Chen, Zhangyang Wang, Wotao Yin, and Hanqin Cai. Towards constituting mathematical structures for learning to optimize. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 21426–21449. PMLR, 23–29 Jul 2023.
- [46] André F. T. Martins and Ramón Fernandez Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. CoRR, abs/1602.02068, 2016.
- [47] Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021.
- [48] Vishal Monga, Yuelong Li, and Yonina C. Eldar. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing, 2020.
- [49] Alasdair Paren, Leonard Berrada, Rudra P. K. Poudel, and M. Pawan Kumar. A stochastic bundle method for interpolating networks. J. Mach. Learn. Res., 23(1), January 2022.
- [50] T.P. Runarsson and M.T. Jonsson. Evolution and design of distributed learning rules. In 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00, pages 59–63, 2000.
- [51] Alexander Schrijver et al. On cutting planes. Combinatorics, 79:291–296, 1980.
- [52] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- [53] Damien Scieur, Quentin Bertrand, Gauthier Gidel, and Fabian Pedregosa. The curse of unrolling: Rate of differentiating through optimization, 2023.
- [54] Naresh K. Sinha and Michael P. Griscik. A stochastic approximation method. IEEE Trans. Syst. Man Cybern., 1(4):338–344, 1971.
- [55] R.S. Sutton and A.G. Barto. Reinforcement Learning, second edition: An Introduction. Adaptive Computation and Machine Learning series. MIT Press, 2018.
- [56] Rockafellar R. T. Convex Analysis. Princeton University Press, 2015.
- [57] Yunhao Tang, Shipra Agrawal, and Yuri Faenza. Reinforcement learning for integer programming: Learning to cut. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9367–9376. PMLR, 13–18 Jul 2020.
- [58] Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11(10):311–365, 2010.
- [59] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- [60] Arthur Steven Younger, Sepp Hochreiter, and Peter R. Conwell. Meta-learning with backpropagation. IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), 3:2001–2006 vol.3, 2001.
- [61] Yilang Zhang and Georgios B. Giannakis. Meta-learning priors using unrolled proximal networks. In The Twelfth International Conference on Learning Representations, 2024.
Appendix A Proximality Term -Strategy
The proper selection of the proximal term of the bundle method is both non-trivial and essential. This choice depends on the current stabilization point and the information accumulated in the bundle up to the current iteration. In particular, when all gradients at the optimum are available, solving a Cutting Plane model is preferable, as it yields the optimal solution to the problem. Hence, in this case, a higher value of may be preferred. While at the beginning of the resolution we possibly dispose of very local information, acting carefully is preferred, and so a smaller value of may be preferred.
In [25], theoretical conditions are established for the -strategy to ensure convergence. Constant strategies can be effective if appropriately fine-tuned. Many state-of-the-art strategies are self-adjusting rules for choosing .
The main state-of-the-art -strategies are composed of three levels, as the ones developed in [24] and also presented in [16]. These levels includes: Long-term, Middle-term, and Short-term -strategies.
The Long-term -strategies aim to capture the stage of the resolution process. In the Bundle method, this is important because it represents a balance between the subgradient method and the Cutting Plane method. In particular, we want to prioritize subgradient method iterations when the solution is still far from optimal, and shift towards Cutting Plane in the final iterations. This approach helps collect the subgradient information necessary to establish the optimality of the solution.
We define the maximum improvement estimation of the Dual Master problem as v^*_t=η_t ——∑_i∈β_tθ_i^(t)g_i——^2 + ∑_i∈β_tα_iθ_i^(t) where is the solution of the Dual Master problem at iteration .
We also define the expected maximum improvement obtained using a variant of the Dual Master problem in which acts as a regularization parameter: ϵ^*_t = ∑_i∈β_tα_iθ_i^(t) + η^* ——∑_i∈β_tθ_i^(t)g_i——^2 where is a hyperparameter. The hyperparameter is generally chosen to be greater than all the possible , thereby emulating the expected behavior by following more the decision of the model. It is different from explicitly considering this parameter in the Dual Master problem, as this would alter its solution and consequently the new trial direction, as previously explained. Additionally, we define a hyperparameter , which is employed in the -strategies and is usually fixed to .
Some examples of Long-term -strategies are:
-
•
no Long-term -strategy;
-
•
the soft Long-term -strategy. After a null step, decreases in are inhibited whenever . In other words, decrements are prevented if is already sufficiently small, such that the maximum improvement is estimated to be less than times the improvement suggested by a Cutting-Planes model.
-
•
the hard Long-term -strategy. Increases of are allowed, even after null steps, whenever .
-
•
the balancing Long-term -strategy. This strategy aims to keep the two terms roughly the same size.
Increases of , after a serious step, are inhibited if
In this case, increasing causes a reduction of which is already small.
On the other hand, decreases in are inhibited if
The rational been that decreasing causes an increase in which is already big.
The Middle-term -strategies are multi-step strategies that consider the outcomes from the last few iterations. These strategies include the following conditions:
-
•
A minimum number of consecutive successful steps (SS) with the same must be performed before is allowed to increase.
-
•
A minimum number of consecutive null steps (NS) with the same must be performed before is allowed to decrease.
At the bottom, heuristic Short-term -strategies only rely on information from the current iteration to propose an update for . Specifically, these heuristics are only used to assist in selecting the current value, but both the top and the middle layers would agree on the update. However, sometimes it can be beneficial to allow the Short-term heuristics to do small adjustments in , regardless of the decisions made by the other levels.
We consider in our implementation these Short-term -strategies together :
-
•
Increase as , with .
-
•
Decrease as , with .
-
•
When increased, should not exceed , with .
-
•
When decreased, should not go below , with .
Appendix B Multi Commodity Capacitated Network Design Problem
A MC instance is given by a directed simple graph , a set of commodities , an arc-capacity vector , and two cost vectors and . Each commodity corresponds to a triplet where and are the nodes corresponding to the origin and the destination of commodity , and is its volume. For each arc, , corresponds to the maximum amount of flow that can be routed through and corresponds to the fixed cost of using arc to route commodities. For each arc and each commodity , corresponds to the cost of routing one unit of commodity through arc .
A MC solution consists of an arc subset and, for each commodity , in a flow of value from its origin to its destination with the following requirements: all commodities are only routed through arcs of , and the total amount of flow routed through each arc does not exceed its capacity . The solution cost is the sum of the fixed costs over the arcs of plus the routing cost, the latter being the sum over all arcs and all commodities of the unitary routing cost multiplied by the amount of flow of routed through .
B.1 MILP formulation
A standard model for the MC problem [27] introduces two sets of variables: the continuous flow variables representing the amount of commodity that is routed through arc and the binary design variables representing whether or not arc is used to route commodities. Denoting respectively by and the sets of forward and backward neighbors of a vertex , the MC problem can be modeled as follows:
(9a) | |||||
(9b) | |||||
(9c) | |||||
(9f) | |||||
(9g) | |||||
(9h) |
where
The objective function (9a) minimizes the sum of the routing and fixed costs. Equations (9b) are the flow conservation constraints that properly define the flow of each commodity through the graph. Constraints (9c) are the capacity constraints ensuring that the total amount of flow routed through each arc does not exceed its capacity or is zero if the arc is not used to route commodities. Equations (9f) ensure that a commodity is not routed on an arc entering its origin or leaving its destination. Finally inequalities (9g) are the bounds for the variables and inequalities (9h) are the integer constraints for the design variables.
B.2 Lagrangian Knapsack Relaxation
A standard way to obtain good bounds for the MC problem is to solve the Lagrangian relaxation obtained by dualizing the flow conservation constraints (9b) in formulation (9a)-(9h). Let be the Lagrangian multiplier associated with node and commodity . Dualizing the flow conservation constraints gives the following relaxed Lagrangian problem 111Since the dualized constraints are equations, have no sign constraints.: min_(x,y) satisfies (9c)-(9h) ∑_(i,j) ∈A (f_ijy_ij + ∑_k ∈K r^k_ijx_ij^k)+∑_k ∈K∑_i ∈Nπ^k_i(b^k_i -∑_j∈N^+_ix_ij^k + ∑_j∈N^-_ix_ji^k) Rearranging the terms in the objective function and observing that the relaxed Lagrangian problem is decomposed by arcs, we obtain a subproblem for each arc of the form:
(10a) | |||||
(10b) | |||||
(10c) | |||||
(10d) |
where and is the set of commodities that may be routed through arc .
For each , is a MILP with only one binary variable. If , then, by (10b) and (10c), for all . If , the problem reduces to a continuous knapsack problem. An optimal solution is obtained by ordering the commodities of with respect to decreasing values and setting for each variable the value where denotes the set of commodities that preceded in the order. This step can be done in if one computes following the computed order. Hence, the complexity of the continuous knapsack problem is . The solution of is the minimum between the cost of the continuous knapsack problem and .
Lagrangian duality implies that
is a lower bound for the MC problem and the best one is obtained by solving the following Lagrangian dual problem:
Appendix C Generalized Assignment Problem
A GA instance is defined by a set of items and a set of bins. Each bin is associated with a certain capacity . For each item and each bin , is the profit of assigning item to bin , and is the weight of item inside bin .
Considering a binary variable for each item and each bin that is equal to one if and only if item is assigned to bin , the GA problem can be formulated as:
(11a) | ||||
(11b) | ||||
(11c) | ||||
(11d) |
The objective function (11a) maximizes the total profit. Inequalities (11b) assert that each item is contained in no more than one bin. Inequalities (11c) ensure that the sum of the weights of the items assigned to a bin does not exceed its capacity. Finally, constraints (11d) assure the integrality of the variables.
C.1 Lagrangian Relaxation
A Lagrangian relaxation of the GA problem is obtained by dualizing (11b). For , let be the Lagrangian multiplier of inequality (11b) associated with item . For each bin the subproblem becomes:
It corresponds to an integer knapsack with binary variables. For , the Lagrangian bound is:
The Lagrangian dual can then be written as:
Appendix D Hyperparameters and Implementation Details
In this section, we provide technical details about the implementation. We also specify the hyperparameters of the NN architecture (regarding layers, layer sizes, and activation functions) for the architecture, define the features (hand-crafted), specify the hyperparameters for the optimizer used for the training, and the GPU and CPU specifics of the machines used for the numerical experiments.
Loss functions
To weigh the contributions of the different iterations, we use . In other amortized optimization works, focusing on learning optimizers, as [2], this parameter is taken equal to .
Dataset | Methods | 10 iter. | 25 iter. | 50 iter. | 100 iter. |
MC-Sml-40 | Bundle h. | 10 | 10 | 10 | 10 |
Bundle b. | 100 | 10 | 10 | 10 | |
Bundle s. | 100 | 10 | 10 | 100 | |
Bundle c. | 100 | 100 | 100 | 100 | |
Descent | 10 | 10 | 10 | 10 | |
Adam | 1 | 1 | 1 | 1 | |
MC-Sml-Var | Bundle h. | 10 | 100 | 100 | 10 |
Bundle b. | 100 | 100 | 100 | 100 | |
Bundle s. | 100 | 100 | 100 | 100 | |
Bundle c. | 100 | 100 | 100 | 100 | |
Descent | 100 | 10 | 10 | 10 | |
Adam | 1 | 1 | 1 | 1 | |
MC-Big-40 | Bundle h. | 10 | 10 | 10 | 1 |
Bundle b. | 10 | 10 | 10 | 10 | |
Bundle s. | 10 | 10 | 10 | 10 | |
Bundle c. | 10 | 10 | 10 | 10 | |
Descent | 10 | 10 | 10 | 10 | |
Adam | 1 | 1 | 1 | 1 | |
MC-Big-Var | Bundle h. | 100 | 10 | 10 | 10 |
Bundle b. | 100 | 100 | 10 | 10 | |
Bundle s. | 100 | 100 | 10 | 10 | |
Bundle c. | 100 | 100 | 100 | 100 | |
Descent | 10 | 10 | 10 | 10 | |
Adam | 1 | 1 | 1 | 1 | |
GA-10-100 | Bundle h. | 1000 | 1000 | 1000 | 1000 |
Bundle b. | 1000 | 1000 | 1000 | 1000 | |
Bundle s. | 1000 | 1000 | 1000 | 1000 | |
Bundle c. | 1000 | 1000 | 1000 | 1000 | |
Descent | 100 | 100 | 100 | 100 | |
Adam | 100 | 10 | 10 | 10 | |
GA-20-400 | Bundle h. | 1000 | 1000 | 1000 | 1000 |
Bundle b. | 1000 | 1000 | 1000 | 1000 | |
Bundle s. | 1000 | 1000 | 1000 | 1000 | |
Bundle c. | 1000 | 1000 | 1000 | 1000 | |
Descent | 1000 | 100 | 100 | 100 | |
Adam | 100 | 10 | 10 | 10 |
Model Architecture - Bundle Network
For all datasets, we consider the same architecture. A first LSTM that goes from the feature space to predict vectors of size . These are the means and variances used to sample three vectors that are the hidden representation of the parameter, the one for the key for the last found component, and the one for the query for the current iteration. These vectors are given as inputs to three different decoders, each one composed of a hidden layer of size . The output of these decoders is respectively a scalar , used as step size, and two vectors of size to represent the key and the query of the current iteration. The main framework is presented with a sample mechanism, but we found that it was detrimental when using softmax for function , and we refer to Section 3.4 for further details. Still, sampling improves performance when used in conjuction with sparsemax for .
Features
At iteration , we perform a human-designed feature extraction, chosen to provide similar information to that used by the heuristics strategies. We add further features to represent the component added to the bundle in the associated iteration, and we others to characterize this entry and to compare it with previously added components. Features related to iteration , summarizing the optimization dynamics at the current iteration:
-
•
the last step-size ,
-
•
The square norm of the search direction ,
-
•
The square norm of the search direction weighted with , that is . This corresponds to the quadratic part in the DMP objective function,
-
•
The linear part in the DMP: ,
-
•
a boolean value to see if the quadratic part is bigger than the linear part ,
-
•
a boolean to see if the quadratic part, rescaled by a big , is greater than the linear part: ,
-
•
The iteration counter .
Features describing the last trial and stabilization points. These features capture information about the most recent trial point and stabilization point :
-
•
The objective value in the last trial point and in the last stabilization point ,
-
•
The linearization error in the last stabilization point and in the last trial point ,
-
•
The square norm of the last trial point of the last stabilization point and of the gradient in the stabilization point ,
-
•
the square norm, the mean, the variance, the minimum, and the maximum of the vector ,
-
•
the square norm, the mean, the variance, the minimum, and the maximum of the vector .
Features comparing the last inserted component with the ones already contained in the bundle:
-
•
The minimum and the maximum of the scalar product of the gradients in the bundle and ,
-
•
The minimum and the maximum of the scalar product of the points in the bundle and ,
-
•
The scalar product of the last inserted gradient and the last search direction .
Optimiser Specifications
We use Adam as optimizer, with a learning rate , a Clip Norm (to 5), and exponential decay .
GPU specifics
For the training on the datasets, we use Quadro RTX 5000 accelerators with 8Gb of RAM. We test performance on the same machine. All the variants, except for Bundle Network, are CPU-only based. The experiments are done on the same machine with QEMU Virtual CPU version 2.5+.
Appendix E Dataset generalization sparsemax and sampling
In this appendix, we provide some further details on the model that uses sparsemax and the sampling mechanism, related to cross-dataset generalization properties, similarly to what was done in Section 3.5 for the softmax without sampling.
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 10.51 | 0.126 | 2.07 | 0.214 | 0.60 | 0.363 | 0.18 | 0.662 | ||||
MC-Sml-Var | 9.87 | 0.128 | 1.75 | 0.218 | 0.54 | 0.371 | 0.18 | 0.683 | ||||
MC-Big-40 | 21.36 | 0.140 | 5.23 | 0.239 | 1.78 | 0.41 | 0.52 | 0.755 | ||||
MC-Big-Var | 9.82 | 0.128 | 1.99 | 0.221 | 0.60 | 0.374 | 0.19 | 0.683 | ||||
GA-10-100 | 69.18 | 0.128 | 69.18 | 0.216 | 69.18 | 0.366 | 69.18 | 0.667 | ||||
GA-20-400 | 96.08 | 0.128 | 96.08 | 0.218 | 96.08 | 0.368 | 96.08 | 0.673 | ||||
Best Baseline | 11.96 | 0.048 | 4.16 | 0.128 | 1.30 | 0.293 | 0.31 | 0.816 |
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 22.9 | 0.161 | 6.62 | 0.301 | 2.6 | 0.54 | 1.02 | 1.027 | ||||
MC-Sml-Var | 14.02 | 0.158 | 3.80 | 0.298 | 1.56 | 0.531 | 0.64 | 1.007 | ||||
MC-Big-40 | 39.34 | 0.171 | 15.12 | 0.327 | 6.7 | 0.59 | 2.97 | 1.119 | ||||
MC-Big-Var | 14.27 | 0.160 | 3.64 | 0.302 | 1.27 | 0.545 | 0.49 | 1.042 | ||||
GA-10-100 | 59.75 | 0.160 | 59.75 | 0.299 | 59.75 | 0.540 | 59.75 | 1.031 | ||||
GA-20-400 | 66.35 | 0.163 | 66.35 | 0.307 | 66.35 | 0.552 | 66.35 | 1.052 | ||||
Best Baseline | 12.79 | 0.108 | 4.33 | 0.29 | 1.66 | 0.635 | 0.58 | 1.503 |
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 41.77 | 0.154 | 7.62 | 0.276 | 1.89 | 0.484 | 0.67 | 0.915 | ||||
MC-Sml-Var | 91.76 | 0.152 | 63.78 | 0.279 | 37.85 | 0.487 | 8.22 | 0.919 | ||||
MC-Big-40 | 13.27 | 0.150 | 3.50 | 0.272 | 1.40 | 0.479 | 0.51 | 0.905 | ||||
MC-Big-Var | 44.89 | 0.154 | 7.19 | 0.280 | 2.17 | 0.492 | 1.09 | 0.931 | ||||
GA-10-100 | 99.8 | 0.140 | 99.8 | 0.252 | 99.8 | 0.445 | 99.8 | 0.84 | ||||
GA-20-400 | 100.0 | 0.156 | 100.0 | 0.281 | 100.0 | 0.493 | 100.0 | 0.931 | ||||
Best Baseline | 19.34 | 0.076 | 5.64 | 0.203 | 2.08 | 0.447 | 0.78 | 1.110 |
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 24.9 | 0.162 | 6.58 | 0.307 | 2.35 | 0.555 | 0.89 | 1.057 | ||||
MC-Sml-Var | 29.88 | 0.252 | 14.63 | 0.46 | 7.67 | 0.822 | 1.85 | 1.546 | ||||
MC-Big-40 | 32.67 | 0.174 | 12.04 | 0.33 | 5.31 | 0.598 | 2.30 | 1.141 | ||||
MC-Big-Var | 18.94 | 0.162 | 4.24 | 0.305 | 1.47 | 0.548 | 0.61 | 1.045 | ||||
GA-10-100 | 67.42 | 0.165 | 67.42 | 0.31 | 67.42 | 0.561 | 67.42 | 1.069 | ||||
GA-20-400 | 74.69 | 0.163 | 74.69 | 0.306 | 74.69 | 0.551 | 74.69 | 1.051 | ||||
Best Baseline | 17.98 | 0.112 | 6.34 | 0.289 | 2.66 | 0.627 | 0.93 | 1.477 |
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 2.55 | 0.123 | 1.79 | 0.208 | 0.85 | 0.35 | 0.33 | 0.636 | ||||
MC-Sml-Var | 2.42 | 0.121 | 1.4 | 0.202 | 0.65 | 0.341 | 0.56 | 0.613 | ||||
MC-Big-40 | 2.67 | 0.119 | 2.12 | 0.197 | 1.4 | 0.334 | 0.69 | 0.603 | ||||
MC-Big-Var | 2.5 | 0.120 | 1.61 | 0.203 | 0.9 | 0.339 | 0.59 | 0.609 | ||||
GA-10-100 | 0.92 | 0.122 | 0.70 | 0.201 | 0.68 | 0.34 | 0.68 | 0.609 | ||||
GA-20-400 | 0.37 | 0.122 | 0.11 | 0.207 | 0.06 | 0.352 | 0.05 | 0.636 | ||||
Best Baseline | 0.1893 | 0.071 | 0.0156 | 0.211 | 0.0014 | 0.601 | 0.0006 | 1.596 |
Training Dataset | 10 iter. | 25 iter. | 50 iter. | 100 iter. | ||||||||
GAP | time | GAP | time | GAP | time | GAP | time | |||||
MC-Sml-40 | 2.35 | 0.253 | 1.9 | 0.536 | 1.14 | 1.004 | 0.39 | 1.943 | ||||
MC-Sml-Var | 2.27 | 0.238 | 1.67 | 0.504 | 0.76 | 0.937 | 0.5 | 1.805 | ||||
MC-Big-40 | 2.42 | 0.237 | 2.08 | 0.504 | 1.58 | 0.949 | 0.88 | 1.836 | ||||
MC-Big-Var | 2.32 | 0.237 | 1.79 | 0.503 | 1.01 | 0.949 | 0.6 | 1.84 | ||||
GA-10-100 | 1.17 | 0.252 | 1.16 | 0.537 | 1.16 | 1.005 | 1.16 | 1.943 | ||||
GA-20-400 | 0.4 | 0.218 | 0.13 | 0.461 | 0.07 | 0.86 | 0.06 | 1.651 | ||||
Best Baseline | 0.1129 | 0.433 | 0.0193 | 1.459 | 0.0048 | 4.256 | 0.0020 | 14.459 |