Simplex Frank-Wolfe: Linear Convergence and Its Numerical Efficiency for Convex Optimization over Polytopes
Abstract
We investigate variants of the Frank-Wolfe (FW) algorithm for smoothing and strongly convex optimization over polyhedral sets, with the goal of designing algorithms that achieve linear convergence while minimizing per-iteration complexity as much as possible. Starting from the simple yet fundamental unit simplex, and based on geometrically intuitive motivations, we introduce a novel oracle called Simplex Linear Minimization Oracle (SLMO), which can be implemented with the same complexity as the standard FW oracle. We then present two FW variants based on SLMO: Simplex Frank-Wolfe and the refined Simplex Frank-Wolfe (rSFW). Both variants achieve a linear convergence rate for all three common step-size rules. Finally, we generalize the entire framework from the unit simplex to arbitrary polytopes. Furthermore, the refinement step in rSFW can accommodate any existing FW strategies such as the well-known away-step and pairwise-step, leading to outstanding numerical performance. We emphasize that the oracle used in our rSFW method requires only one more vector addition compared to the standard LMO, resulting in the lowest per-iteration computational overhead among all known Frank-Wolfe variants with linear convergence.
keywords: Frank-Wolfe algorithm, conditional gradient methods, linear convergence, convex optimization, first-order methods, linear programming
1 Introduction
Over the past decades, Frank-Wolfe (FW) algorithms [10] (a.k.a. conditional gradients [25]) have been extensively investigated due to its lower per-iteration complexity compared to projected or proximal gradient-based methods, in particular for large-scale machine learning applications and sparse optimization. This topic has been comprehensively covered in several recent publications including [3, 4, 27] and [24, Chapter 7],[2, Chapter 10], to just name a few. The key step in FW algorithms is Linear Minimization Oracle (LMO). We refer to [23] for (worst-case) complexity analysis for general LMOs. One of the most often cited examples is LMO over the unit Simplex . Projection onto is much expensive than LMO over . Research effort has been on developing LMOs that may lead to linear convergence while keeping the computation of each LMO as low as possible. Therefore, the total computational complexity of a typical FW-type algorithm can be calculated as follows.
Note that some existing algorithms may require more than one LMO each iteration. The purpose of this paper is to propose a new LMO, whose computational complexity is probably the cheapest among all existing algorithms. Furthermore, it also guarantees a linear convergence rate comparable to the known ones for the convex optimization over a polytope:
(1.1) |
where is a finite set of vectors that we call atoms. For the moment, we only assume is convex and differentiable for the convenience of discussion below. Here, is an open set containing . Later, we will enforce strong convexity as well as other properties for our analysis.
1.1 Related Work
There are a large number of publications that directly or remotely motivated this work. We are only able to list a few of them below with some critical analysis. Given an LMO, the original FW algorithm [10] states as:
where is a steplength often satisfying certain conditions [6, 18, 19]. One of the key advantages of the FW method over the well-known projected gradient method is its lower cost per iteration in many common scenarios, such as the simplex [6], flow polytope [7, 21], spectrahedron [18, 12], and nuclear norm ball [20]. This efficiency makes the FW method particularly advantageous for large-scale problems. Numerous studies [19, 23, 11] have demonstrated that the convergence rate of the FW method is and that this rate is generally not improvable, except for some special cases, e.g., when the optimal solution lies in the interior of the constraint set [16]. In fact, there exist examples for which the convergence rate of the FW method does not improve even when the objective function is strongly convex, see [19, 23].
Therefore, modifications on the original FW method must be made in order to achieve linear convergence rate. Significant advances have been made alone this line of research and there exist a large number of variants of FW methods that enjoy linear convergence rate, see [4, Chapter 3]. The well-known ones include FW-method with away-step (AFW) and the pairwise FW (PFW) [22, 9, 13]. Most of those modified methods can be cast in the following framework:
(1.2) |
where is a well-constructed convex subset of at the current iterate . This framework has a flexibility for more technical strategies to be added. For example, one may mix and through certain combinations with some linesearch strategies. Both AFW and PFW make use of such flexibility. One major concern is that the computation of LMO over may be significantly higher than that over . This is the case when is Simplex-like polytopes including .
In a significant development aiming to address this issue, Garber and Hazan [14] proposed the methodology of LLOO (Local Linear Optimization Oracle), where is the intersection of and a -ball:
A key result is that LMO over this is LLOO, which is referred to as -LMO. Hence, linear convergence follows when the radius is exponentially reduced at each iteration under the strongly convex setting. We note that the framework of LLOO can be cast as a special case of Shrinking Conditional Gradient Methods (sCndG) by Lan [23, Eq. 3.34] and [24, Alg. 7.2], where an arbitrary norm is used. The LLOO framework does not require the step of in (1.2). To understand its actual performance, Fig. 2(a) in Sect. 5 illustrates its computational time in comparison to the LMO over the unit simplex as well a projection algorithm.
It can be clearly observed that the time taken by -LMO is roughly same as the projection method, but is significantly slower (e.g., slower when gets big) than . There is a deep reason behind this performance and it can be best appreciated from the perspective of geometric intuition by considering the situation of . Fig. 1(a) illustrates the intersection of -ball with the unit simplex. Note that for any point , the intersection of -ball with the hyperplane containing forms a regular hexagon. As the center and radius vary, the shape corresponding to the intersection of this regular hexagon and unit simplex becomes more complex, as shown by the blue region in Fig. 1(a). This increased complexity of the constraint set makes solving -LMO more challenging. From a computational point of view, -LMO requires a sorting procedure [14] to handle the complexity and hence takes up too much time.
We also observe that LLOO/sCndG framework was largely omitted from the recent surveys [3, 4, 27] probably due to the following two reasons. One is on the concern of computational cost per iteration discussed above. The other is that there lacks flexibility of incorporating existing accelerating strategies such as Away-steps. In this paper, we propose a new framework of constructing the subset that is not based on any norms. In the meantime, the computational cost per iteration is reduced probably to minimum and there is flexibility to include various acceleration strategies. We explain our framework below.


1.2 Simplex LMO and Simplex FW: a New Proposal
Ideally, we would like our subset to be like the simplex so that linear optimization over it can be fast executed. We here introduce the simplex ball with centroid and radius (detailed definition later). It coincides with the atom norm of the unit simplex as introduced by [5]. Moreover, the unit simplex is a simplex ball. A very useful property is that the intersection of two simplex balls is again a simplex ball:
where and can be cheaply computed from , . This property is illustrated in Fig. 1(b). Consequently, given , a radius and , we define the Simplex Linear Minimization Oracle by
Since the constraint is again a simplex ball, SLMO has a closed-form formula (see Alg. 1) and its complexity is roughly same as . Furthermore, we prove that SLMO is LLOO in Lemma 3.2. The consequence is that linearly convergent algorithm can be developed by following the template in [14]. This part forms the first contribution of the paper.
Casting SLMO as an instance of LLOO does not benefit too much in terms of computational efficiency because, as a common practice in FW methods, direction-correction step in (1.2) is essential in improving numerical performance. To accommodate this need, we make two additions. The first one is on a flexible rule to update the radius of the simplex ball. Any lower-bound for the objective function is permitted and the lower-bound by SLMO is a choice. We opt for the use of the best lower-bound available at the current iterate. This leads to the Simplex Frank-Wolfe (SFW) method in Alg. 2, which is proved to be linearly convergent in Thm. 3.4.
This second addition makes use of an important observation that SLMO can be split into two parts. The first part is the construction of the simplex ball and the second part is linear optimization over the simplex ball. Linear optimization is much cheaper than construction of the simplex ball. It would be much economical if we perform linear optimization a few more times for each simplex ball:
This functions like the direction-correction used in the general framework (1.2). This allows us to take advantage of existing FW algorithms. For example, AFW and PFW can be used for this part. This leads to our refined SFW method (rSFW) (see Alg. 3 and Alg. 6). We emphasize that the oracle used in our rSFW method requires only one additional vector addition compared to the standard LMO, whose computational complexity is probably the cheapest among all existing FW-type algorithms. Our numerical experiments show that rSFW with Away step and Pairwise steps improves its performance significantly. The resulting algorithmic scheme is hence different from LLOO scheme and we provide complete convergence analysis. This part may be treated as our second contribution.
Our third contribution is on extending the simplex case to general polytope case. We will make use of some fundamental connections between them established in [14]. Since the simplex ball is not defined from any norm, some part of the extension is highly non-trivial. In particular, the iteration complexity of the extended SFW depends only on the problem dimension instead of the number of extreme points of , see Thm. 4.4. Computationally, this can all be achieved for three popular polytopes: Hypercube, Flow polytope and -ball.
Our final part is to address the implementation issues including adaptive backtracking techniques on choosing the problem prameters and , and incorporating Away-FW and Pairwise-FW steps to SFW methods. Numerical experiments demonstrate that our SFW methods are highly competitive.
1.3 Organization
In the preceding discussion, we only focus on the framework (1.2) that may lead to linearly convergent FW methods. We avoid specifying the actual conditions on because various conditions can ensure such linear rate. In Section 2, we describe such conditions as well as some background on polytopes. We will explain the key concept of LLOO proposed in [14]. Section 3 contains the detailed development of SLMO and the resulting Simplex FW methods (SFW and rSFW) for the case . The extension to the general polytope case is conducted in Section 4. Lengthy proofs are moved to the Supplement for the benefit of readability of the paper. Section 5 reports some illustrative examples to demonstrate the advantage of SFW methods over some existing algorithms. We conclude the paper in Section 6.
2 Notation and Background
2.1 Notation
We employ lower-case letters, bold lower-case letters, and capital letters to denote scalars, vectors, and matrices, respectively (e.g., and ). For two column vectors , is a new column vector that takes the component-wise maximum of and . The vector is similarly defined. For vectors, we denote the standard Euclidean norm by and the standard inner-product by . For a vector , a subset , and , we define
where “” means “define”.
We let to denote the Euclidean ball of radius centered at . For matrices, we let denote the spectral norm. For a vector , we use or to denote the -th component. For a matrix , we use to denote the -th row of . The vector represents a vector with all entries equal to 1, and is the standard th unit vector in which takes value at its th position and elsewhere. Given a set , we denotes its convex hull as . For any positive integer , we use the notation to represent the set . We use to denote the unit simplex.
2.2 Smoothness, Strong Convexity and Stepsizes
Throughout the paper, we will assume -smoothness and -strong convexity of .
Definition 1 (Smooth function).
We say that a function is -smooth over a convex set , if for every there holds
Definition 2 (Strongly convex function).
We say that a function is -strongly convex over a convex set , if for every there holds
The above definition combined with first order optimality conditions imply that for a -strongly convex function , if , then for any we have
(2.1) |
This property, while weaker than strong convexity, is essential for demonstrating linear convergence rather than relying solely on strong convexity. A natural generalization of this property is known as the quadratic growth property.
There are three popular step size strategies:
-
1.
Simple step size:
(2.2) -
2.
Line-search step size:
(2.3) -
3.
Short step size:
(2.4)
2.3 Quantities of Polytope
The quantities reviewed in this part are well defined and investigated in [14] and they are mainly used in the extension of SFW to polytopes.
Let be a polytope described by linear equations and inequalities, specifically , where . Without loss of generality, we assume that all rows of have been scaled to possess a unit norm. We denote the set of vertices of as and let represent the number of vertices.
Next, we introduce several geometric parameters related to that will naturally arise in our analysis. The Euclidean diameter of is defined as . We define
This means that for any inequality constraint defining the polytope and for a given vertex, that vertex either satisfies the constraint with equality or is at least units away from satisfying it with equality. Let denote the row rank of , and let represent the set of all matrices with linearly independent rows selected from the rows of . We then define . Finally, we introduce condition number of as
(2.6) |
It is important to note that the translation, rotation and scaling of the polytope are invariant to . For convenience we use without explicitly mentioning the polytope when is clear from context. It is worth noting that in many relevant scenarios—particularly in cases where efficient algorithms exist for linear optimization over the given polytope—estimating the parameters is often straightforward. This is particularly true in convex domains encountered in combinatorial optimization, such as flow polytopes, matching polytopes, and matroid polytopes, among others. Furthermore, our algorithm relies primarily on the parameter and .
2.4 LLOO
A major concept proposed by Garber and Hazan [14, Def. 2.5] is LLOO. Consider the problem (1.1). We say a procedure , where , , , is an LLOO with parameter for polytope if returns a feasible point such that
-
(i)
for all , and
-
(ii)
.
Suppose the optimal solution of (1.1) is contained in and LLOO return a feasible point . The convexity of implies the following.
That is, LLOO naturally provides a lower bound for the optimal objective. Such lower bounds will be used in our updating scheme of the radius . We also note that LLOO often return an optimal solution over a subset , which should be constructed in (1.2) bearing in mind of its solution efficiency.
Given an LLOO procedure available, a general FW framework can be developed for it to enjoy a linear convergence rate over general polytope provided being -smooth and -strongly convex, see [14, Thm. 4.1]. It is proved that -LMO is an LLOO over the simplex polytope . The framework is then extended to general polytope. As we discussed in Introduction, -LMO is much less efficient than the original LMO over the simplex polytope . This is the one of the motivations for us to develop the simplex LMO below.
3 Simplex FW Method
This section is solely devoted to the case of simplex polytope: Problem (1.1) with . We will then extend the obtained results to general polytopes in the next section. We start with the introduction of simplex ball.
3.1 Simplex Ball and Simplex-based Linear Minimization Oracle
In this subsection, we will formally define the concept of the simplex ball and present some of its useful properties. Following this, we will introduce the Simplex-based Linear Minimization Oracle (SLMO) and provide an efficient algorithm for solving it.
Definition 3 (Simplex ball).
Let For any and , we define as the simplex ball of radius centered at by
(3.1) |
The following properties of the simplex ball are crucial to our development. The proof is moved to the Supplement A.
Lemma 3.1.
Given and , we have
-
(1)
The unit simplex is a simplex ball, i.e., . Moreover, we have
(3.2) -
(2)
The intersection of two simplex balls, if nonempty, is again a simplex ball. In particular,
(3.3) Moreover, for and radius such that , it holds
where
(3.4) Consequently, we have .
-
(3)
The linear optimization over a simplex ball has the following closed-form solution:
-
(4)
The diameter of the simplex ball is , i.e.,
-
(5)
For any point , if , then . Moreover, for any point , we have .
We now give a formal definition of our LMO based on simplex ball.
Definition 4 (SLMO).
Given a linear objective , radius and a point , a solution is called simplex-based linear minimization oracle if it solves the following optimization problem
(3.5) |
We have proved in Lemma 3.1(5) that . Therefore, for any , we must have This is the first property of LLOO. Moreover, since both , we must have with . This leads to the following key result.
Lemma 3.2.
Given , and such that , then is an LLOO with .
The implication of this result is far-reaching because the framework developed in [14] can be followed to get a linearly convergent algorithm with SLMO. An even more important result is that SLMO problem (3.5) can be solved by the following simple algorithm.
The algorithm follows these basic steps. Firstly, it represents the constraint set as a single simplex ball: Secondly, it uses the existing theoretical results of linear programming over the simplex ball to find the optimal solution.
Proof.
Remark 1.
(Comparison with and -) If we treat the element-wise minimum between two vectors as a basic operation, then SLMO requires only one extra basic operation, one more vector summation, and one more vector addition compared to the the original LMO over the simplex . Therefore, its total exact complexity is flops, making it nearly as efficient as . However, - involves a sorting operation, whose overall complexity is usually ), It also involves a few more vector additions. As will be illustrated in Fig. 2(c), it is far less efficient than the original and SLMO. Therefore, we expect that a linearly convergent algorithm with SLMO should be efficient as well. We develop it below.
3.2 SFW: Simplex Frank-Wolfe Method
In this subsection, we present a new variant of Frank-Wolfe method called Simplex Frank-Wolfe (abbreviated as SFW), obtained by replacing the LMO with SLMO. The algorithm is formally described as follows.
Before stating its convergence rate result, we make the following remarks regarding Alg. 2.
Remark 2.
(Choice of update strategy) We could follow the linear shrinking rule of in [14, 24]: with properly chosen . The linear convergent rate would be guaranteed by invoking the LLOO property of SLMO (Lemma 3.2). We do not take this route for convergence analysis because of the following two reasons. One is that the key property ensuring LLOO will have to be modified when it comes to the general polytope as our simplex ball is not based on any norm. The corresponding relationship becomes , see Lemma 4.3, where is defined. At least at a technical level, the original LLOO will have to be generalized to suit this extension and the corresponding proofs have also to be reproduced. The proof we provided below is more direct. The other reason is that we use the best lower bound provided by the algorithm to define . This choice is important because we are going to incorporate other accelerating strategies to SFW resulting in rSFW. The stopping criterion used there will be also based on the best lower bounds obtained. Therefore, the convergence analysis for SFW will naturally be adapted to rSFW.
At the -th iteration, Alg. 2 first invokes SLMO to find the minimum point of the first-order approximate expansion of the objective function within the region . Subsequently, using a suitable step size , a convex combination of and is computed to update the iteration point to . Finally, the algorithm updates the radius , ensuring that the optimal solution progressively falls within a smaller neighborhood . For Alg. 2, we propose the following simple step size, as an alternative to the simple step size selection in the original Frank-Wolfe algorithm:
(3.6) |
We have the following linear convergence rate result. The induction technique in the proof below was taken from [14, Lemma 4.3].
Theorem 3.4.
Proof.
We first claim that and that . We prove this by induction. First, we have
where comes from (2.1). This implies that , and by Lemma 3.1(5), we have . Therefore, the claim holds for .
Now suppose that and for all . Let . For step size policy in (2.3) (exact line search stepsize) or (3.6), we both have
(3.8) | |||||
Similarly, for the step size policy (2.4) (short stepsize), we have
(3.9) | |||||
Combining (3.8) and (3.9), we have
holds for step size policy (2.3), (2.4), or (3.6), where comes from the definition of , and is due to . Subtracting from the both sides of the above inequality, we obtain
where is due to our inductive hypothesis and Lemma 3.1(5). By plugging in the value of , and using , we have that
(3.10) |
With the definition of , we have . The bound in (3.10) implies
(3.11) |
By the inductive hypothesis, we know that holds for all . Thus is a valid lower bound of , and consequently, is also a lower bound of . Now by (2.1), we have
This implies that by Lemma 3.1(5). Therefore, we have completed the proof of the claim.
Remark 3.
(Iteration complexity of SFW) If we skip Lines 4–5 and replace Line 7 in Alg. 2 with , the algorithm remains correct and preserves its convergence guarantee. This modification avoids computing the objective value , making the algorithm more practical and simple when the objective is expensive or difficult to evaluate. In this case, assuming that there is an oracle to obtain the gradient information at each iteration, we are able to give the exact number of flops operations to compute the next iterate. The SLMO part to get is flops, and computing the direction requires an additional flops. For the short step size strategy, evaluating the step size incurs flops, and updating the iterate takes another flops. Thus, SFW needs flops with a simple step size, or flops with the short step size—–yet still achieves linear convergence for the simplex. To our knowledge, this is the lowest per-iteration cost among FW variants with linear convergence.
3.3 Refining SFW
In this subsection, we aim to further reduce the computational overhead of the proposed oracle as much as possible, while retaining the linear convergence rate. The motivation stems from an important observation. can be split into two parts. The first part is to construct a new Simplex-ball . For easy reference, we call it SLMO-1, which corresponds to Lines 1 in Alg. 1. The second part, which finds an optimal solution over , is referred to as SLMO-2 and corresponds to Lines 4-5 in Alg. 1. It is easy to see that the computation of SLMO-2 requires only one more vector addition compared to the standard LMO over and hence its computation is already kept minimum. The extra computation of SLMO is from SLMO-1. Since the new Simplex ball is already constructed, we like to carry out a few more times of the SLMO-2 part. This is roughly to find an approximate solution to the problem
with starting point . Our control of this refinement step is for the overall computation to remain and the overall convergence rate to remain linear. The overall algorithm is given in Alg. 3 and it is called Refined SFW.
Alg. 3 keeps three sequences , , and , each associated with a simplex ball, namely , , and . Starting with , we define such that . At the th iteration, we first compute
We then define by its simplex ball, which satisfies . We further define the iterate by its simplex ball satisfying . They can all be efficiently computed via the formula (3.4).
A great advantage of Alg. 3 is its computation of . The oracle we call is SLMO-2, which ensures that the iteration complexity remains the same as the original FW algorithm. It is also important to highlight that the inner loop of our algorithm (Lines 5-13) follows the standard FW algorithm. Its primary goal is to find a solution that satisfies . Therefore, various speedup techniques for the classical FW algorithm can be directly applied to this inner loop without interfering with the outer loop of the algorithm. These include approaches such as the ‘away-step’ and ‘pairwise’ variants of FW proposed by [22], ‘fully-corrective’ variant of FW proposed by [19], as well as the warm start technique suggested by [11].
The following theorem summarizes the convergence result for this algorithm.
Theorem 3.5.
Proof.
We first claim that for any and we prove this by induction. For , we have . By the definition of , we have
Hence, . Now suppose that for some . Note that the inner loop of Alg. 3 corresponds to the standard Frank-Wolfe algorithm. By Theorem 2.1 and Lemma 3.1(4), which implies that the diameter of is , we have
hold for all . In the case where the inner loop terminates at , we obtain
Similarly, if the inner loop is interrupted due to lines 9-11 of the algorithm, we still have
Using the fact that , we have , which implies via Lemma 3.1(5) that . This implies
We now start to prove the conclusion in Theorem 3.5. Since and , we have and thus
We complete the proof. ∎
Remark 4.
(Warm-start Strategy) For rSFW and rSFWP in the upcoming Section 4, when using the simple step size, the initial steps of each inner loop may perform poorly, causing the iteration point far away from the optimal solution. We found that the following heuristic warm-start strategy is effective in practice. Let denote the actual number of inner loop iterations during the -th outer loop iteration. When initiating the -th outer loop, instead of starting the inner loop from , begin from either or . Here is a hyperparameter, with serving as a reasonable default value.
Remark 5.
(Extension to Quadratic Growth Condition) Although the two main Thms. 3.4 and 3.5 are established under the strong convexity of , we would like to point out that the assumption can be weakened to quadratic growth condition:
where and is the solution set of Problem (1.1). This includes the well-known case with strongly convex, and , for a detailed investigation of this class of functions with FW methods, see [1]. In this paper, we did not make effort for such extension as our main purpose is to introduce SLMO under the strong convexity setting for the sake of simplicity.
4 Generalization to Arbitrary Polytopes
This section extends the previous results for the unit simplex to arbitrary polytopes. This generalization allows for a broader application of our findings, facilitating their relevance to a wider range of optimization problems. Important properties between the standard simplex and general polytopes have been established in [14]. Our extension heavily relies on some of those results. This section is patterned after Section 3 with some details omitted to avoid repeating. We first define the simplex ball for general polytope and the corresponding SLMO. We then describe the Simplex Frank-Wolfe for general polytope, followed by its refined version.
4.1 Simplex Ball and SLMO for Arbitrary Polytopes
Consider Problem (1.1) with and . Therefore, any given can be represented as a convex combination of those atoms . However, the convex combination may not be unique. We define a set-valued mapping from to the following set:
Recall that for and , is the simplex ball defined in (3.2). The idea of defining a similar Simplex ball over the polytope can be summarized as follows:
(4.1) |
where consists of the columns , . It seems that the Simplex ball depends on a particular choice of . The following result dismisses this dependence.
Lemma 4.1.
Given and , let . Then for any , there exist such that
Proof.
By definition of , we know
(4.2) |
Lemma 4.1 ensures that the definition is independent of choice of . Hence, the definition is well defined. Given a linear objective , we extend it to such that for all . Consequently, the following equivalence holds:
Leveraging this equivalence, we define the generalized SLMO for as follows.
Definition 5 (: SLMO over ).
Given a linear objective , radius , a point , and its corresponding , a solution is referred to as a generalized simplex-based linear minimization oracle if
where is an optimal solution to the following optimization problem
(4.4) | ||||
We note that (4.4) can be efficiently solved by Alg. 1. Consequently, can also be efficiently solved provided an element in can be cheaply obtained. The detailed steps are outlined in Alg. 4.
One can observe that -2—corresponding to Lines 4-5 in Alg. 4 and serving as the oracle in our subsequent Alg. 6—requires only one more vector addition and one extra scalar-vector multiplication compared to the standard LMO.
We summarize the optimality of Alg. 4 in the following result, which is direct consequence of Lemma 3.3
Lemma 4.2.
Algorithm returns an optimal solution for the problem:
Remark 6.
(Carathéodory’s Representation Assumption)
By Carathéodory’s
Representation Theorem [28, Thm. 17.1], for any point , there exists such that where .
As demonstrated in the illustrative examples in Supplement C.1, this representation can be easily implemented for common types of .
In this representation, the running time of does not explicitly depend on the number of vertices , but rather on the natural dimension of , that is, .
For the analysis in the subsequent sections, we assume without loss of generality that the selected always satisfies .
The following lemma demonstrates some useful properties of and , which can be regarded as a generalization of Lemma 3.1(5) and is crucial for proving the convergence of our algorithm in the next subsection. The detailed proof can be found in Supplement B.
Lemma 4.3.
Given , and , for any point satisfying , it follows that and . Furthermore, we have .
We also like to note that, though similar to LLOO, is not exactly an LLOO because Lemma 4.3 only proves that , not which would be sufficient for the first property of LLOO. We note that . Therefore, the condition would be enough for to be LLOO.
4.2 : Simplex Frank-Wolfe for Arbitrary Polytopes
In this subsection, we extend the SFW to the polytope case. The generalized SFW algorithm is presented as follows.
4.3 : Refining
As with the motivation for rSFW for the Simplex case, once we constructed the Simplex ball for , we may compute an approximate solution:
with the initial point . The motivation is based on a similar observation that can be split into two independent parts with the first part of constructing the Simplex ball being the major computation. Hence, once such a ball is constructed we run a few more cheap SLMO steps over this ball. Once again, other methods such as Away-step FW and pairwise FW can be used for computing . The generalized rSFW algorithm is presented as follows.
Similar to Theorem 3.5, we can provide the following convergence analysis for Alg. 6, which is proven in Supplement B.
Theorem 4.5.
Remark 7.
(Adaptive Lower Bound Update) We estimate the lower bound of by for the Simplex Frank-Wolfe method and its refined version. In fact, when the objective function exhibits specific structural properties, we can derive an additional lower and update the best bound as . For instance, when the objective function has a minmax structure, we can construct a minmax lower bound for , see [11] for detailed analysis. Moreover, in certain application scenarios, there may be exact information about the optimal value , such as in linear regression or machine learning tasks, where it is known a priori that the optimal value of the loss function is . In such cases, it is straightforward to set .
Remark 8.
(Robustness to Parameter Estimation) Our algorithms rely on parameters . In practice, using overestimates and an underestimate such that only increases the bounds by a constant factor. Moreover, as shown in Supplement C.2, both and can be efficiently estimated for common . For and , one can use the backtracking strategy from [26] to estimate their local values and compute adaptive short step sizes; see Subsection 5.4 for details.
5 Numerical Experiments
In this section, we present numerical experiments to evaluate the efficiency, convergence, and adaptability of the proposed methods. All tests were performed using MATLAB R2022b on a Windows laptop equipped with a 14-core Intel(R) Core(TM) 2.30GHz CPU and 16GB of RAM.
We try to furnish four tasks. (T1) We first assess the computational efficiency of SLMO and SLMO-2 across four representative polytopes, consolidating their role of the workhorse in our SFW methods. (T2) We illustrate the linear convergence behavior of SFW and rSFW using two numerical experiments. (T3) We show that our methods can be enhanced with a backtracking strategy to eliminate the need for predefined values of the parameters and . (T4) We demonstrate how integrating the away-step variants of the Frank-Wolfe method (AFW and PFW) into the rSFW framework significantly enhances its performance, outperforming the original AFW and PFW methods. Those four tasks are addressed in four subsections.
5.1 Efficiency of SLMO and SLMO-2
In this subsection, we evaluate the performance of the proposed SLMO and SLMO-2 through comparative experiments on four common polytopes : (a) Unit simplex; (b) Hypercube; (c) -ball; and (d) Flow polytope, derived from the video co-localization problem in [21].
Algorithm | Formulation | Description |
---|---|---|
Projection | The projection onto the polytope , and is a randomly generated point. We implement projections onto the Simplex and -ball using the method from [8, Fig. 2], while the projection onto the hypercube is straightforward. Although a closed-form solution exists for projection onto the flow polytope [29, Thm. 20], its computational complexity of makes it significantly more expensive than other LMO variants. | |
LMO | The standard linear minimization oracle. | |
-LMO | argmin_y∈{Vλ∣λ∈B_1(λ_x,d)∩S_N}⟨y,c⟩ | The -norm constrained LMO, Alg. 3 and Alg. 4 in [14]. Here, consists of columns of , and the computation of is included in the timing. |
NEP | argmin_y∈V(P)⟨y,c⟩+ λ∥y-x∥^2 | Nearest extreme point oracle in [15]. Here, is a randomly generated positive number. |
SLMOP | argmin_y∈{Vλ∣λ∈S(λ_x,d)∩S_N}⟨y,c⟩ | Our proposed Simplex Linear Minimization Oracle (Alg. 1 and Alg. 4). Here, the computation of is included in the timing. |
SLMOP-2 | argmin_y∈{Vλ∣λ∈S(^λ_x,^d)}⟨y,c⟩ | The latter phase of SLMO, consisting of Lines 4-5 of Alg. 1 and Alg. 4. Here, is precomputed and not included in the timing. |




We consider the six different methods, including projection (a key subproblem in projection/proximal based methods) and five variants of LMO, as detailed in Table 1. These six methods shares the same randomly generated and . Figure 2 illustrates the relationship between running time and dimensionality for the six methods. Additionally, Table 2 reports the proportion of time spent on LMO calls within the SLMO-2 algorithm. We draw the following observations from these results:
-
•
-LMO: Across all four polytopes, the -LMO method incurs the highest computational overhead surpassing even that of projection-based methods.
-
•
SLMOP-2 vs. SLMOP: The overhead of SLMO-2 is significantly lower than that of SLMO. In most cases, as shown in Table 2, its overhead closely matches that of the LMO itself. This indicates that our proposed rSFW and rSFWP achieve iterative complexity comparable to that of the standard Frank-Wolfe algorithm.
-
•
NEP: While NEP demonstrates very low runtime overhead, it is important to note that the corresponding Frank-Wolfe variant, NEP-FW, converges only sublinearly as shown in Subsection 5.2.
-
•
-LMO and SLMOP on the Flow Polytope: Both methods exhibit rising overhead with increasing dimension, mainly due to the cost of computing the Carathéodory representation , which dominates the runtime.
Simplex | Hypercube | -ball | Flow polytope | |
() | () | () | () | |
5.2 Linear Convergence of SFW and rSFW
Algorithm | Description |
---|---|
FW (simple/Ada) | Frank-Wolfe with simple step size . The ‘Ada’ variant employs a backtracking step (5.1) prior to updating the iterate, in order to estimate the local parameters and , thereby enabling an adaptive short step size. We set and in Alg. 7, and apply the same configuration to the subsequent ‘Ada’ variants. |
SFW/SFWP (line-search) | Simplex Frank-Wolfe with exact line-search (Alg. 2 and Alg. 5). For the -constrained least squares problem, we set , and . Although Supplement C.2 estimates for -ball, this setting does not hinder the algorithm s linear convergence and demonstrates strong practical performance. For the video co-localization task, we set and ; see Supplement C.2 for details. For the Simplex-constrained least squared problem, we set . |
NEP-FW (simple) | Frank-Wolfe with Nearest Extreme Point Oracle, with theoretical step size [15, Alg. 1]. We omitted Line 5, as it showed no noticeable effect on performance. |
rSFW/rSFWP (simple) | Refined Simplex Frank-Wolfe with simple step size (Alg. 3 and Alg. 6). We utilize the warm-start strategy with as mentioned in Remark. 4. The parameters and are set the same as in SFW/SFWP. Additionally, we set for /Simplex-constrained least squares problems, and for video co-localization problem. |
PFW (line-search) | Pairwise Frank-Wolfe with exact line-search [22, Alg. 2]. |
AFW (line-search) | Away-steps Frank-Wolfe with exact line-search [22, Alg. 1]. |
rSFW-P (line-search) | The Refined Simplex Frank-Wolfe framework enhanced with Pairwise technique. Specifically, we incorporate (5.3) after Line 6 in Alg. 3 and replace Line 12 with (5.4). |
rSFW-A (line-search) | The Refined Simplex Frank-Wolfe framework enhanced with Away-steps technique. Specifically, we incorporate (5.2) after Line 6 in Alg. 3 and replace Line 12 with (5.4). |
We demonstrate the linear convergence of our proposed methods—SFW and rSFW through two numerical experiments. These methods are compared against the standard Frank-Wolfe (FW) algorithm, its variant NEP-FW111As the code for NEP-FW is not publicly available, we implemented it ourselves., and two well-known variants: Away-step FW222The implementations of AFW and PFW are available at https://github.com/Simon-Lacoste-Julien/linearFW. (AFW) and Pairwise FW (PFW), all summarized in Table 3.
To evaluate algorithmic performance, we adopt the Frank-Wolfe gap defined by where is the -th iterate and denotes the solution returned by the respective LMO variant at that iteration. This FW gap provides a valid upper bound on the primal gap, i.e., , and can thus be used as a practical stopping criterion333For NEP-FW, however, this inequality does not hold in general for . Therefore, to ensure a consistent and fair comparison, we use the standard FW gap to evaluate NEP-FW as well..




The first experiment involves an -regularized least squares regression, that is , where with , and the entries of are drawn from a standard Gaussian distribution. We set , where is constructed by first generating a random vector with sparsity parameter , followed by normalization to lie on the boundary of the -ball. Thus, the optimal value of this problem is 0. We use the same initial point for all methods.
The second experiment involves a convex quadratic problem over the flow polytope, derived from the video co-localization task introduced by [21]. The problem is formulated as , where is a positive definite matrix, , and represents the s-t flow polytope. We used the same dataset and initial point as in [22, 15]. The problem has a dimension of .
The results are presented in Figure 3. We make some comments below.
-
•
Linear convergence of SFWP and rSFWP: Both SFW and rSFW show linear convergence, confirming our theoretical guarantees.
-
•
Superior efficiency of rSFWP: In both experiments, our proposed rSFW method significantly outperforms all other algorithms –including the well-established AFW and PFW– in terms of running time.
-
•
Limitations of NEP-FW: Although NEP performs well in iteration complexity, its NEP-FW variant converges sublinearly and is slightly slower than standard FW.
-
•
Time inefficiency in video co-localization: In the video co-localization task, while SFW, AFW, and PFW converge quickly by iteration count, their runtime is slower due to overhead—–Carathéodory computation for SFW, and growing active sets for AFW and PFW.
5.3 SFW/rSFW with Backtracking
We further show that our methods can be enhanced with the backtracking technique proposed by [26], thereby eliminating the need to manually specify the parameters and . While [26, Alg. 2] does not specify how to estimate the strong convexity constant , we outline our approach to estimating both and as well as determining an adaptive step size; see D for the detailed algorithm. As an example, consider the -th iteration of SFWP. Before updating , we perform the following backtracking step:
(5.1) |
We focus on the -constrained logistic regression problem with the form:
where and . We use the dataset Madelon [17], which has , and fully-density (i.e. ). We set and . We compare our methods against AFW, PFW, and the standard FW, all of which are equipped with the backtracking technique. The results are presented in Figure 4. It can be observed that our two methods achieve the best performance in terms of running time. Although AFW and PFW perform well in terms of iteration count, their overall efficiency is hindered by the increasing cost of maintaining a growing active set and computing the away direction.


5.4 rSFW Framework Combined with the Away Step Technique
In this subsection, we demonstrate that the well-known linearly converging variants of the standard Frank-Wolfe method AFW and PFW [22] can be seamlessly integrated into the inner loop of the rSFW framework. This straightforward combination leads to a significant performance improvement over the original AFW and PFW methods.
We focus on the simplex-regularized problem , where , with standard Gaussian entries. We set , where is constructed by first generating a random nonnegative vector with sparsity parameter and then normalized it so that its components sum to 1. Thus 0 is the optimal value of this problem. We use the same initial point for all methods.
In comparison to the experiments in the previous subsection, we introduce four additional algorithms: AFW and PFW, along with their respective versions integrated into the rSFW framework, denoted as rSFW-A and rSFW-P. Details of these methods are provided in Table 3.
We briefly explain here how the direction-correction is computed within the general framework (1.2) for both the rSFW-A and rSFW-P algorithms. Based on Alg. 3, during the -th outer loop and the -th inner loop, let denote the active set corresponding to the point . Thus, can be represented as where . Let . For the rSFW-A method, the direction-correction is computed as:
(5.2) |
where
For the rSFW-P method, the direction-correction is computed as:
(5.3) |
Finally, we update the point using the iteration:
(5.4) |
which replaces the original iteration.


The results are given in Figure 5. rSFW-P demonstrates superior performance compared to all other algorithms, excelling in both the number of iterations and running time, achieving nearly twice the efficiency of PFW. Additionally, both framework-based acceleration algorithms exhibit substantial performance improvements over the standalone rSFW framework.
6 Conclusion
In this paper, we introduced a novel oracle: SLMO, which leverages the advantageous geometric properties of the unit simplex. This design enables SLMO to be implemented with the same computational complexity as the standard linear optimization oracle, preserving the efficiency of the Frank-Wolfe framework. Building on this oracle, we proposed two new variants of the classical Frank-Wolfe algorithm: the Simplex Frank-Wolfe (SFW) and refined Simplex Frank-Wolfe (rSFW) algorithms. Both methods achieve linear convergence for smooth and strongly convex optimization problems over polytopes. The linear convergence rates of these methods depend only on the condition number of the objective function, the polytope s quantity, and the problem s dimension, demonstrating their scalability and robustness in various settings.
The purpose of this paper is to develop the basic framework for the new SFW methods and demonstrate that they are highly competitive. We do so with the simplest setting of being strongly convex and smooth. We made no attempt to weaken such assumption except pointing out that the obtained results should also hold under the quadratic growth condition. An immediate question would be to extend the methods to convex or even nonconvex setting. Furthermore, LLOO proposed in [14] was an elegant framework and it was largely omitted from recent surveys on FW methods. To our best knowledge, SFW was the first LLOO instance that was extensively tested and compared with other popular FW methods. An intriguing question is whether there exist alternative LLOO approaches that simultaneously satisfy the following criteria: (i) adhering to the LLOO framework, (ii) enabling fast and accurate computation, and (iii) incurring significantly lower overhead compared to projection-based methods? We leave those topics to our future research.
acknowledgement
This work was supported by the National Natural Science Foundation of China (Grant No. 12171271) and by Hong Kong RGC General Research Fund PolyU/15303124.
Appendix
Appendix A Proof of Lemma 3.1
Proof.
(1) By the definition of the simplex ball , we have
Furthermore, we have
Consequently, the characterization (3.2) holds.
(2) On one hand, for every , there exists such that . Define for each ,
Since and , we have and , which shows that for each . Moreover, let be an index set. Then we have
As above, we have verified that . We now show that , where and are defined in (3.3). This follows from the fact that, for each ,
Thus, we have , leading to .
On the other hand, for every , there exists such that . Due to the fact that for , and
we have . We now turn to show that .
Let . It is not difficult to verify that and . Moreover, we have
implying . Thus . This finishes the proof for (3.3).
We now proceed to prove (3.4). Following the definition of , we have
Translating to and , we have
Therefore,
Using (3.3), we have
where
Here, follows from the fact . We then have
where
Thus, we have proven (3.4).
(3) Since the optimal solution lies on the extreme point of , we have
where .
(4) By (3.1), for any points , there exist such that for . Thus, we have
Appendix B Proofs for Section 4
B.1 A Useful Bound
The proof of Lemma 4.3 relies on the following lemma, whose proof used some key technical results established in [14]. In particular, for given and , there must exist and such that
Let . We then have To put another way, the point can always be represented by
(B.1) |
for some , . Since is compact. There must exist a representation of (B.1) with the smallest among all such representations. An important fact established in [14, lemma 5.3] is that the minimal value can be bounded. We refine this bound below for the largest in .
Lemma B.1.
Let with Let be represented as in (B.1) with having been minimized. Then it holds that
Proof.
The claim is trivial for the case . Now we suppose that (i.e., at least one ). The following index sets and are defined in [14]. We simply describe them and use some established results relating to them. Denote the index set . By [14, Lemma 5.3] we have since one . Let be such that the set forms a basis for the set . Denote by consisting of the set . By definition we have . Then we obtain
where the first inequality is established in the proof of [14, Lemma 5.5], follows from the fact that for any , and any we have . Combining [14, Lemma 5.3] and [14, Lemma 5.4], we obtain that for all such that there exists such that . Hence,
Thus we conclude that . ∎
B.2 Proof of Lemma 4.3
Proof.
We begin by proving the first part. Write for and express , where and . Here, the sum is minimized (as in Lemma B.1). We then have
where the first inequality used Lemma B.1, the second inequality used the assumption , and the last equation is by the definition of in (2.6). Express as , where . We can then rewrite as follows:
Since , we have for all . Moreover, the sum , which implies that . By the definition in (3.1), we have , thus . Moreover, we have
B.3 Proof of Theorem 4.4
Proof.
The proof follows the framework of the proof of Theorem 3.4 and Lemma 4.3. We first claim that and that . We prove this by induction. First, we have
where comes from (2.1). This implies that , and by Lemma 4.3, we have . Therefore, the claim holds for .
Now suppose that and for all . Let . In the same manner as the proof in Theorem 3.4, for step size policy (2.3), (2.4) or (4.5), we all have
where is due to our inductive hypothesis and Lemma 4.3. By plugging in the value of , and using , we have that
Combining the above inequality with the fact that , and by the definition of , we conclude that . By the inductive hypothesis, we know that holds for all . Thus is a valid lower bound of , and consequently, is also a lower bound of . Now by (2.1), we have
(B.2) |
This implies that by Lemma 4.3. Therefore, we have completed the proof of the claim.
We now start to prove the conclusion in Theorem 4.4. From the earlier proof, we know that , thus confirming the first part of the inequality. By the definition of and the established claim, we have
The proof is thus completed. ∎
B.4 Proof of Theorem 4.5
The proof of Theorem 4.5 relies on the following lemma.
Lemma B.2.
For any , define the two corresponding points:
We must have
Proof.
We note that is compact. Let denote the set of its vertices. Obviously, we have
For being a vertex of , according to the separation theorem [28] that there exists a vector such that
In other words, is the unique solution of the following problem:
It follows Lemma 4.2 that can be represented as
where and independent of . Similarly, has a representation:
Therefore, we have
where and we have used . ∎
Proof of Theorem 4.5.
We first claim that for any and prove this by induction. From the proof of Theorem 4.4, this is true for . Now suppose that for some . Note that the inner loop of Alg. 6 corresponds to the standard Frank-Wolfe algorithm. By Theorem 2.1 and Lemma B.2, we have
hold for all . In the case where the inner loop terminates at , we obtain
Similarly, if the inner loop is interrupted due to lines 9-11 of the algorithm, we still have . Using the fact that , we have , which implies via Lemma 4.3 that .
Appendix C Properties of Some Common Polytopes
C.1 Carath odory Representation Examples
Hypercube: When is a hypercube , any point can be naturally represented as
where is a permutation over such that and is a vector with components from to equal to and the rest equal to .
-ball: When is a -ball , any point can be naturally represented as
where and if and otherwise.
Flow polytope: Let be a directed acyclic graph (DAG) with a set of vertices and edges such that . Let be two vertices in , referred to as the source and target, respectively. The - flow polytope, here denoted by , is the set of all unit - flows in . For any point and , the entry represents the amount of flow through edge , where the flow vector satisfies the flow conservation constraints at each vertex, ensuring that the flow entering any vertex (except and ) equals the flow leaving it. The extreme points of are the extreme unit flows. To find the Carath odory representation of a given flow , we can proceed recursively as follows.
Starting with the flow , we repeatedly perform the following steps until :
-
1.
Remove all edges with zero flow from the graph.
-
2.
Identify the edge corresponding to the smallest non-zero flow in , i.e., .
-
3.
Find the extreme unit flow in the reduced graph that includes edge .
-
4.
Subtract from the current flow, i.e., .
Since each operation eliminates at least one non-zero entry in the current flow, the loop will terminate within at most steps. As a result, we obtain a Carath odory representation of . This algorithm can be implemented in time when the graph is represented using sparsely structured adjacency matrices.
C.2 Quantities of Some Common Polytopes
Hypercube: The diameter of is given by
Since can be represented as
it follows from the definition in Subsection 2.3 that and . Thus, the quantity of is
-ball: The diameter of is given by
Note that can be described by the linear inequalities system , where is a matrix whose entries are either and whose rows all have unit norm. Following the definition in Subsection 2.3, we have and
where denotes the Frobenius norm. Thus, the quantity of can be estimated as
Flow Polytope: For every two extreme unit flows , since , we have . Thus, the diameter of can be estimated as
When representing using a system of linear equations and inequalities, the inequality constraints are given by . Thus, by definition, we have , leading to
It is worth noting that in the numerical experiment in Subsection 5.2, we set while the dimension is . This choice stems from the specific characteristics of the dataset used in [22, 15]. Specifically, we observe that each extreme unit flow contains exactly entries of , with the remaining entries being . Consequently, we can estimate .
Appendix D Backtracking Details
The routine of estimating local parameters and step-size is shown as follows. For rSFW, if we use the simple step-size, we can employ only the estimates of and without applying the corresponding step size.
References
- [1] Beck, A., Teboulle, M.: A conditional gradient method with linear rate of convergence for solving convex linear systems. Mathematical Methods of Operations Research 59, 235–247 (2004)
- [2] Bernd, G., Jaggi, M.: Optimization for machine learning. Lecture Notes CS-439, ETH, Spring 2023 (2023)
- [3] Bomze, I.M., Rinaldi, F., Zeffiro, D.: Frank–wolfe and friends: a journey into projection-free first-order optimization methods. 4OR 19(3), 313–345 (2021)
- [4] Braun, G., Carderera, A., Combettes, C.W., Hassani, H., Karbasi, A., Mokhtari, A., Pokutta, S.: Conditional gradient methods. arXiv preprint arXiv:2211.14103 (2022)
- [5] Chandrasekaran, V., Recht, B., Parrilo, P.A., Willsky, A.S.: The convex geometry of linear inverse problems. Foundations of Computational mathematics 12(6), 805–849 (2012)
- [6] Clarkson, K.L.: Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG) 6(4), 1–30 (2010)
- [7] Combettes, C.W., Pokutta, S.: Complexity of linear minimization and projection on some sets. Operations Research Letters 49(4), 565–571 (2021)
- [8] Condat, L.: Fast projection onto the simplex and the l 1 ball. Mathematical Programming 158(1), 575–585 (2016)
- [9] Damla Ahipasaoglu, S., Sun, P., Todd, M.J.: Linear convergence of a modified frank–wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optimisation Methods and Software 23(1), 5–19 (2008)
- [10] Frank, M., Wolfe, P., et al.: An algorithm for quadratic programming. Naval research logistics quarterly 3(1-2), 95–110 (1956)
- [11] Freund, R.M., Grigas, P.: New analysis and results for the frank–wolfe method. Mathematical Programming 155(1), 199–230 (2016)
- [12] Garber, D.: Faster projection-free convex optimization over the spectrahedron. Advances in Neural Information Processing Systems 29 (2016)
- [13] Garber, D., Hazan, E.: Playing non-linear games with linear oracles. In: 2013 IEEE 54th annual symposium on foundations of computer science, pp. 420–428. IEEE (2013)
- [14] Garber, D., Hazan, E.: A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization 26(3), 1493–1528 (2016)
- [15] Garber, D., Wolf, N.: Frank-wolfe with a nearest extreme point oracle. In: Conference on Learning Theory, pp. 2103–2132. PMLR (2021)
- [16] Guélat, J., Marcotte, P.: Some comments on wolfe’s ‘away step’. Mathematical Programming 35(1), 110–119 (1986)
- [17] Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A.: Feature extraction: foundations and applications, vol. 207. Springer (2008)
- [18] Hazan, E.: Sparse approximate solutions to semidefinite programs. In: Latin American symposium on theoretical informatics, pp. 306–316. Springer (2008)
- [19] Jaggi, M.: Revisiting frank-wolfe: Projection-free sparse convex optimization. In: International conference on machine learning, pp. 427–435. PMLR (2013)
- [20] Jaggi, M., Sulovsk, M., et al.: A simple algorithm for nuclear norm regularized problems. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp. 471–478 (2010)
- [21] Joulin, A., Tang, K., Fei-Fei, L.: Efficient image and video co-localization with frank-wolfe algorithm. In: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (eds.) Computer Vision – ECCV 2014, pp. 253–268. Springer International Publishing, Cham (2014)
- [22] Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of frank-wolfe optimization variants. Advances in neural information processing systems 28 (2015)
- [23] Lan, G.: The complexity of large-scale convex programming under a linear optimization oracle. arXiv preprint arXiv:1309.5550 (2013)
- [24] Lan, G.: First-order and stochastic optimization methods for machine learning, vol. 1. Springer (2020)
- [25] Levitin, E.S., Polyak, B.T.: Constrained minimization methods. USSR Computational mathematics and mathematical physics 6(5), 1–50 (1966)
- [26] Pedregosa, F., Negiar, G., Askari, A., Jaggi, M.: Linearly convergent frank-wolfe with backtracking line-search. In: International conference on artificial intelligence and statistics, pp. 1–10. PMLR (2020)
- [27] Pokutta, S.: The frank-wolfe algorithm: a short introduction. Jahresbericht der Deutschen Mathematiker-Vereinigung 126(1), 3–35 (2024)
- [28] Rockafellar, R.T.: Convex analysis, vol. 11. Princeton university press (1997)
- [29] Végh, L.A.: Strongly polynomial algorithm for a class of minimum-cost flow problems with separable convex objectives. In: Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 27–40 (2012)